This item is only available as the following downloads:
A SELF-ORGANIZING SCHEDULING AND RESOURCE MANAGEMENT
SCHEME IN A NETWORK COMPUTING ENVIRONMENT
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
First at all, I would like to thank my advisor, Dr. Fred J. Taylor, for his mentoring,
patience, and friendship. The supports and resources he provided are unsurpassable and
my appreciation to him is beyond words.
I would also like to thank Dr. Jose C. Principe, Dr. Herman Lam, Dr. Randy C.Y.
Chow, and Dr. Jih-Kwon Peir for serving my committee.
Next, I would like to express my respect and thanks to the members in the HSDAL.
They all were great to work with and full of energy and good ideas. Special thanks goes to
Dr. Ahmad R. Ansari for the pioneer works he did for the Osculant project. I would also
like to thank Dr. Uwe and Anke Meyer-Blise for the supports and advice they gave me.
I would like to thank my parents, whose love, sacrifice, encouragement, and ultimate
support made my degree possible. Finally, I have to thank my lovely wife, Yi-miao. She
made my Ph.D. years both exciting and enjoyable.
TABLE OF CONTENTS
A CKN OW LED GEM ENTS .........................................................................................ii
A B ST R A C T ...................... .................................................................................. vi
1 IN TR O D U CTIO N ........................................................... .........................
1.1 M otivation......................................................................... .................... 1
1.2 Dissertation Outline and Summary.......................................................... 2
2 BACKGROUND ..................................................................................4......
2.1 Target Environment ............... .................................................... 4
2.2 Network Traffic Studies........................................................................ 6
2.3 Load Balancing Algorithms............................................................... 8
2.4 Task Scheduling Algorithms..............................................................9...
2.5 Resource Management Schemes...................................................... 11
3 OVERVIEW OF THE OSCULANT SCHEDULER................................... 15
3.1 Basis of The Osculant Scheduling Scheme ........................................ 15
3.2 The Osculant Simulator ...................................................................... 19
3.3 Initial Performance Studies on the Osculant Scheduling Scheme ......... 20
4 ANALYTICAL NETWORK PERFORMANCE MODELS ...................25
4.1 Introduction.............................................................. ........................ 25
4.2 Basic Concepts of Switch-Based Networks........................................ 25
4.3 Performance Model of Multiple Stage Interconnection Networks ........26
4.4 The M IN Sim ulator................................................... ........................ 31
4.4.1 Simulator Design Principles ................................................... 31
4.4.2 Limitations of the MIN Simulator ............................................32
4.4.3 Sim ulation Results........................................... ...................... 33
4.5 Results and Analysis .................. ..... ........................... 33
4.5.1 Input Load vs. Throughput and Delay ......................................33
4.5.2 Buffer Size vs. Normalized Throughput and Delay.................. 37
4.5.3 Network Size vs. Normalized Throughput and Delay.............. 38
4.5.4 Switch Element Types vs. Normalized Throughput and Delay...41
4.5.5 Number of Iterations Needed to Compute Analytic Results ....... 43
4.5.6 Discrepancy Error Between the Analytical Model
and Real N etw orks .................................................................. 45
4.6 Non-Uniform Communication Latency Interconnection Networks....... 46
4.6.1 Comparisons Between Uniform and Non-Uniform
Communication Latency Networks .......................................... 47
4.6.2 Key Issues in Non-Uniform Communication Latency
Interconnection Network............................... ........................ 48
4.7 Conclusions.................. ........................................................... 50
5 OSCULANT SCHEDULING MECHANISMS .......................................53
5.1 Jobpost Distribution Protocol.................................... ...................... 53
5.1.1 M ulti-layer Jobpost Protocol ......................... ....................... 54
5.1.2 Other Jobpost Distribution Techniques.................................... 56
5.1.3 Optimal Jobpost Distribution ................................................... 57
5.2 B idding Strategies .................................................. .......................... 60
5.2.1 Performance-based Bidding Method ....................................... 60
5.2.2 Energy-based Bidding Method .................................................63
5.2.3 Dynamic Jobpost Model ..................... .........................63
5.2.4 Resource Contractor Bidding Model........................................ 65
5.3 Comparisons Among The Bidding Strategies..................................... 67
5.3.1 System Throughput Rate ...................... ................................ 68
5.3.2 Average CPU Time Consumption............................................ 70
5.3.3 Average Job Resource Transmission Time .............................. 71
5.3.4 Average Job Energy Consumption ...........................................72
5.3.5 Average Jobpost Coverage and Jobpost/bidding Delay............... 73
5.3.6 Locality Efficiency......................................... ......................... 75
5.4 Resource Management Schemes...................................................... 75
5.5 C onclusions.................... ........................................................... 79
6 DEVELOPMENT OF OSCULANT SHELL ........................................... 82
6.1 Structure and Implementation of Osculant Shell................................ 82
6.2 Osculant Job Profile Generator........................ ..... ....................... 83
6.2.1 Design Principle and Algorithms.............................................. 84
6.2.2 Osculant Job Profile Retrospective........................................... 88
6.3 File Transfer U nit...................................................... ........................ 90
6.4 Steward Process ................... .................................................... 91
6.5 O their M odules ........................................................ ......................... 92
6.6 Future Developm ents............................................... ........................ 94
7.1 Research Contributions.................................... ............................ 96
7.2 Lim stations ........................................................................ 98
7.3 Future Directions ........................................................................ 99
A PPE N D IC E S ....................................................................................................... 10 1
A SIMULATION CONFIGURATION...................................................... 101
B EXAMPLE OF OSCULANT JOB PROFILE GENERATION.............. 105
C EXAMPLE OF OSCULANT SIMULATOR USER INTERFACE.......... 107
R EFE R E N C E S .................................................... ............................................... 110
BIOGRAPHICAL SKETCH....................................................................... 113
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
A SELF-ORGANIZING SCHEDULING AND RESOURCE MANAGEMENT
SCHEME IN A NETWORK COMPUTING ENVIRONMENT
Chairman: Dr. Fred J. Taylor
Major Department: Electrical and Computer Engineering
In this dissertation, a new scheduling scheme called Osculant is studied. The
Osculant scheduler is bottom-up, self-organized, and designed for distributed
heterogeneous computing systems. In the first part of the dissertation, performance
evaluations of uniform latency communication (UCL) and non-uniform latency
communication (NUCL) networks are performed. By studying the analytical model and
simulation results of UCL networks, we found that network performance can be
predictable if the job arrival rate is known in advance. However, UCL networks' high cost
growth rate prohibits it from being applied in general distributed computing systems.
Conversely, NUCL networks are more scalable and economical but are more difficult to
predict the communication performance. Studies show that locality properties are the keys
to improve performance of NUCL systems. In the Osculant, we develop new techniques
in exploiting localities embedded in the applications and systems.
Several new dynamic bidding strategies were introduced and investigated. Compared
to the top-down scheduling scheme, the performance-bidding and energy-based bidding
methods improve the system throughput rate and average job energy consumption rate.
Multiple-bid methods are developed to further improve the performance of single-bid
strategies (e.g., the performance- and energy-based bidding methods). Dynamic Jobpost
Bidding Model and Resource Contractor Bidding Model are the two examples of this
category. Experimental results show very promising performance growth over the single-
bid methods. It is also found that system ethos can be altered to suit the user demands and
environmental changes by choosing different bidding methods. Moreover, with multiple-
bid methods, system status information is progressively gathered through multiple job
announcement and bidding processes. It is shown that scheduling overheads can be
effectively reduced by this scheme.
The Osculant Job Profile Generator (JPG) generates job profiles on-line so that other
nodes can estimate resource requirements and job completion cost. The Multi-layer
Jobpost Protocol (MJP) is developed to announce jobs to the computing system. The MJP
is found to be robust and self-regulated. Studies of other modules in the Osculant Shell
reveal even more potential of the Osculant scheduling scheme.
Heterogeneous network computing is now ubiquitous and is found in virtually every
major computing environment. Heterogeneous computing is known to be cost-effective,
robust, and scaleable. With Moore's law [Schaller 1997] driving technological
improvements, the entire field of heterogeneous computing is undergoing a
metamorphosis. Nevertheless, within this dynamically changing landscape, there is a need
to provide management of these computational resources. Computer engineers
unknowingly have, for some time, been introducing market driven philosophies into their
design strategies. These include organization theory (e.g., shared resources), recycling
(e.g., cache), commodity investment (e.g., speculative computing), to name but a few.
Current studies have indicated that market driven concepts can, in fact, be integrated
formally into a resource management and scheduling paradigms for heterogeneous
computing systems. What important to realize is that network computing suffers from the
same set of restrictions that govern supply-side economics in terms of access to resources
and services and hierarchical control. These system-level attributes and resources, which
are important to the conducted study, relate to
* Bandwidth and latency: It is self-evident that network bandwidth restrictions can
seriously impair timely execution tasks. Another performance limiting observation is
that network latency (i.e., the time required to receive requested information) is not
directly correlated to communication bandwidth when the message lengths are small.
* Localized information and service providers: In order to avoid unnecessary network
traffic congestion, information storage should be distributed, or duplicated, in a
prudent manner. In this way, a balance is achieved between system performance and
These observations point out that task and resource scheduling will play an
important role in improving the performance of virtually any network-based computing
environment. A new bottom-up resource scheduling paradigm, call Osculant, was
proposed as an innovative facilitating technology.
1.2 Dissertation Outline and Summary
In this dissertation, we investigate a new scheme that will schedule tasks and
allocate resources in a bottom-up fashion within a distributed computing system. We
demonstrate, through analytical modeling of network traffic and a series of simulations,
how different scheduling policies accommodate changing system status, conditions, and
job input rate. We also built a prototype scheduler with job profile generator, job
announcing, and bidding processes to see how such a paradigm works in a real computing
system. The remainder of this dissertation is organized as follows.
Chapter 2 serves as the background and presents material that will act as a
foundation for the dissertation. We first explain the target environment of this research.
Then, related studies in qualifying network performance, load balancing, task scheduling,
and resource management are reviewed.
Chapter 3 presents the overview of the Osculant scheduling scheme. First, we
explain the basic ideas and logical operation of the Osculant scheduler. Next, the
Osculant simulator and job state transition are introduced. The third part reports the initial
performance studies based on the Osculant scheduling scheme. Here, we compare the
Osculant to several top-down scheduling schemes.
In Chapter 4, a network performance study based on the analytical model of multiple
stage interconnection networks is presented. We first study the analytical model and learn
how the network performance index can be extracted. Then, an original simulator is
presented that is used to verify the correctness of the analytical model. By comparing the
two, we learn how job input rate, network configuration, and network scaling affect the
overall system performance. In the next part, we extend our scope to non-uniform
communication latency (NUCL) networks that are used in most distributed systems. In
this study, we exploit the locality issues in NUCL networks.
Chapter 5 presents the mechanisms of the Osculant scheduling scheme. First,
jobpost distribution protocols are discussed. Next, various bidding strategies and
simulation results are analyzed. Later, the resource management scheme in the Osculant
Chapter 6 demonstrates the development of the Osculant shell. The Osculant Job
Profile Generator is presented and results are discussed. Remainder of this chapter
discusses the other parts of the prototype scheduler.
In Chapter 7, we conclude the dissertation with a summary of the research results
presented. Possible future research directions are also discussed.
2.1 Target Environment
In this dissertation, the Osculant scheduling scheme will be studied. The target of the
Osculant scheme is a general-purpose distributed heterogeneous computing environment.
To achieve optimal system performance for such a computing environment, the system
can have an assumed central node that has knowledge of the status of all nodes.
Obviously, there are too many obstacles to realize such a system even without
considering the aspects in scaling and reliability. In Osculant, we exploit the bottom-up
approach that can self-organize the system driven by user demands and system
conditions. Our goal is to improve the overall system performance and exploit the
benefits of scalability and reliability nature from the bottom-up approach.
A distributed computing system can be described by the following components:
working nodes, network configuration, jobs and applications, and management policies.
A working node is defined as a device that is capable of job processing or resource
management. They can be mainframe computers, workstations, file servers, personal
computers, and even portable computers. Because working nodes not only vary in their
performance and architecture but also change their states in timely basis, the developing
scheme must have the ability to collect or probe the system states and nature of jobs. In
the dissertation, a working node is characterized by its computation power, memory size,
storage capacity, and type of architecture.
Network configuration is rather flexible for the target computing environment. The
goal of our studies is to develop a scheduling scheme that can be applied to systems
without regarding their underlying connection structure. That is, the communication
scheme under our studies will be topology-independent. In the dissertation, each node in
the target system has a fixed degree of connections and point-to-point connections are
used to link them. Data transmissions are accomplished by packet switching where data
are divided into many small packets before transmission. It is assumed that each node has
finite amount of buffers and will reject new packets when its buffer is full. Paths between
the source and destination nodes are always the shortest in terms of the number of
intermediate nodes passed. However, there may be more than one shortest path between
any two nodes in the system.
Allowed job types are reasonably flexible. In contrast to real-time systems, which
allow only well-known jobs, new jobs are allowed to enter the system because completion
time constraints are not the top priority goal in the target environment. Jobs can be
generated by any participating nodes and need to access remote file servers and databases
for appropriate services and resources. It is also assumed that jobs are allowed to be
executed at remote nodes for performance considerations. Finally, jobs in the target
environment are assumed to be run in batch mode, i.e. information required to execute the
job must be fixed prior to the scheduling phase. Typical jobs for the target environment
include engineering applications, medical image processing, and scientific research
programs. In the following sections, issues and problems that we encountered will be
discussed. Also, related studies and researches will be reviewed.
2.2 Network Traffic Studies
In many cases, the network performance determines the overall capabilities of a
distributed computing system. Overall system performance can be discussed in two
categories: The first category, network performance, is determined by two important
metrics: user response time and network utilization. The response time of a networked
system can be related to many factors. For example, network configuration and number of
active users will affect the response time. Studies [Blommers 1996] show that response
time is independent of the number of users as long as the resource utilization remains
low; and it is linear to the number of users when the utilization is near 100%. Network
utilization and throughput can also be affected by many factors. For instance, data
segment size and network roundtrip time can be directly related to network throughput
performance. Because data segment size, which is governed by the size of packets and
maximum transmit unit (MTU) setting, is fixed under various standards, streaming
capable transport protocols like TCP can improve the throughput rate by continuously
sending data into the network before receiving acknowledgements.
The second category is the cost to build and maintain the network. It is important to
correctly evaluate the requirements of current users and predict possible future growth.
The degree of network provisioning and profitability of network investment must be on a
proper ratio or the network performance or maintenance cost will be deteriorated. When a
system is saturated, one has to decide whether to upgrade the servers or to install more
servers. Generally speaking, component upgrades are cheaper but two servers are more
reliable than one. So, the solution depends on the system and user requirements. For
instance, we can bring the resource servers closer to their users by duplicating or moving
the resources. On the other hand, queuing theory tells us that one fast server is better than
two half speed servers because both cases give about the same performance while the fast
server case has better response time performance in a lightly loaded condition.
A node in a distributed computing system is either a client or a server. When the
network performance is concerned, the location of nodes is the most important factor. In
is clear that clients located on the same local-area network (LAN) as their servers will get
the best performance. So, one can bring the data closer to the client by either move the
data onto a local server or move the remote server to the client-side LAN. This solution
may not be appealing to users located at other locations because it may increase the
network delays they see. The classic way to solve this problem is to conduct a network-
wide traffic matrix study to determine location, number, and traffic load for all client
systems. Observations also show that client-server applications tend to be the least
sensitive to network delays because they are already incur processing delays on both
sides, and this masks the network delay to some extent.
Network performance can be modeled and evaluated by mathematical modeling and
simulations. Numerous research studies have been conducted in this filed. For example,
Patel [Patel 1981] examined the operation and performance of crossbars and delta
networks. Yoon et al. [Yoon 1990] followed Patel's study and extended it to switch-based
networks with multiple buffers. However, as the system and network configurations
become more complex, it is extremely difficult to model them. Analytical models
generally make assumptions on the job arrival rate and system service rate, and their
results are derived for a steady-state condition that is rarely observed in general network
systems. By keeping these arguments in mind, we review the studies of Patel and Yoon et
al. closely in Chapter 4 since they can be used to study the network performance of our
target environment under certain conditions.
The importance of client/server allocations has been explained early in this section.
In this dissertation, two manifestations of this idea will be explored. First, the distributed
scheduling scheme will try to find the best processing node based on job contents and
system status; and second, we will explore the technique to dynamically relocate servers
and resources in a distributed system to achieve better performance.
2.3 Load Balancing Algorithms
Load balancing and resource allocation are key topics in designing efficient
multiple-node systems. Load balancing algorithms try to improve system performance by
redistributing the workload submitted by users. Early studies focus on static placement
techniques that seek optimal or near optimal solutions to processor or resource allocation.
Solution methods of this category contain graph theoretic approaches, mathematical
programming, and queuing theory [Goscinski 1991]. Recent research in this field evolved
to adaptive load balancing with process migration [Goscinski 1991]. A simple
manifestation of this method is to have a central allocation processor that receives
periodically load information from all processors and then makes process placement
decisions. The central processor, however, can become a single point of failure and a
bottleneck of the system.
Distributed process allocation algorithms become a complex alternative. A good
example of distributed load balancing which is conceptually similar to Osculant is the
microeconomic algorithm by Ferguson et al. [Ferguson 1988]. In this model, all
processors and tasks are independent economic agents who attempt to selfishly optimize
their satisfaction. All agents have to obey the rules set for the system to achieve their
goals. That is, according to the economy model, job agents have to pay for CPU and
network services based on their budget, and processor agents sell CPU time and
communication bandwidth with the price depend on the demands. The processor agents
have to advertise their price on bulletin boards in neighboring nodes so that job agents
can shop around for best suited services.
The reported load balancing algorithms are generally designed either for closely
coupled system or for homogeneous systems. With the advent of loosely coupled
heterogeneous computing systems, more considerations will be required to improve the
overall system performance.
2.4 Task Scheduling Algorithms
According to Goscinski [Goscinski 1991], a distributed scheduling policy can be
divided into two components, namely
* a local scheduler that decides how to distribute the computation time of a processor
to its resident processes, and
* a load distributing strategy that distributes the system load among the computing
The designs of local schedulers are generally simpler, for example, round-robin,
priority queue, and time-slice policies are frequently used in existing operation systems.
In designing local scheduler, fairness and priority are usually the most important issues.
On the other hand, load distribution strategies concentrate on improving system
performance by sharing and balancing the system load. It is easy to observe that, in
distributed systems, load sharing, load balancing, resource allocation, and task scheduling
are closely related. In the Osculant, a bottom-up approach is employed. That is, system
status has to be known by either a collection or a probing process before a task can be
assigned or a resource can be allocated. Many studies can be related to the Osculant
scheduling scheme. In the state-collection category, Shin and Chang [Shin 1988, 1995]
proposed a resource allocation policy that uses buddy sets to reduce the state-collection
overhead in a multicomputer configuration similar to our target environment. For the
state-probing category, task announcement and bidding processes were first introduced by
Smith [Smith 1980] in his contract net protocol that facilitates distributed control of
cooperative task execution. Focused addressing techniques were also introduced by Smith
in an effort to reduce network traffic. A good example that follows Smith's approach is
done by Ramamritham and Stankovic [Ramamritham 1989, 1994]. In their studies, a
distributed task scheduler with real-time constraints and resource requirement was
studied. An effort to maximize the guarantee ratio in meeting completion time constraints
is accomplished by focused addressing and bidding techniques. Nodes have to calculate
the surplus, i.e. the bids, of processing the announced task. In turn, the parent node uses
previously collected surplus information to limit the scope of the task announcing process
or even award tasks according to the surplus information to accelerate the scheduling
time. By combining focused addressing and bidding algorithms, Ramamritham and
Stankovic introduced the flexible algorithm that allows a two-level task
announcing/bidding structure to shorten the scheduling delay and improve the system
performance. Blake and Schwan [Blake 1991] and Ni et al. [Ni 1985] also performed
bottom-up scheduling studies using various bidding strategies. These task scheduling
schemes, however, are restricted to hard real-time environments where most of the task
and resource requirements are well known prior to the scheduling time. Overall system
performance such as utilization is generally neglected.
Comparing to the reported bottom-up task scheduling studies, the Osculant
scheduling scheme exploits the state-probing approach with self-regulating task
announcement processes and aggressive bidding strategies. Improving system
performance in a general-purpose distributed computing environment, where execution
time constraints are not critical and new tasks are allowed to enter the system, is the main
mission. In addition, the Osculant scheduling scheme is flexible, fault-tolerable, and
scalable because of the nature of the bottom-up approach.
2.5 Resource Management Schemes
According to Goscinski [Goscinski 1991], resources are reusable and relatively
stable hardware or software components of a computer system that are useful to system
users or to their processes. Because of their usefulness, they are requested, used, and
released by processes during their activities. Resources also can be grouped as low-level
resources and high-level resources. Low-level resources can be used directly by the
distributed system and users. However, high-level resources, which are built upon several
low-level resources, are more commonly utilized in the system. Hardware resources are
generally static, permanent, and limited in quantity. Management of physical resources is
simpler than logical resources because the latter may be varied temporally and
Software resources can be pre-defined by the system or composed by the users. They
can also be active (so that they can change states), or static. Furthermore, advances in
network computing and in software engineering bring both new opportunities and
problems to the management of logical resources. For instance, software components will
be required to be managed properly in a heterogeneous system with differences in
suitability for the underlying platforms and in service quality. In other words, multiple
software versions and locations may coexist that can be used and/or retrieved to complete
the task with differing costs and functionality. Additionally, as purchasing and
maintenance of software becomes more dominant in the overall operation cost of
computing systems, licensing and revision control will become an important topic in the
future resource management.
The software industry has begun to adapt the "thin-client," "component software,"
and "just-in-time application" concepts. "Fat-clients" are difficult to manage, and they add
to network congestion. On the other hand, in the "thin-clients" environment, applications
are stored and managed centrally on servers. Appropriate executables and data files
(logical resources) are sent and stored in local caches when needed.
Castanet [Thomas 1997] and ALTiS [Goulde 1997] are two pioneer systems built on
the thin-client concepts to distribute Java applications. Corel Corp. tried and
demonstrated some of the WordPerfect suite (Corel Office JV) in their Java
implementation so that applications can be launched and executed from web browsers on
different platforms. ComponentWare [ComponentWare 1997], by I-Kinetics Inc., is a
design based on Common Object Request Broker Architecture (COBRA) that integrates
customer applications from prefabricated software components. These trends suggest that
software resources are changing. Existing scheduling and resource management methods
may be inefficient to handle them.
Traditionally, resource allocations are done centrally by schedulers with queue or
priority schemes. As the types and number of computing resource grow, it will be more
difficult to efficiently manage the resources in centralized approaches. Therefore, many
distributed resource management schemes were developed. For example, in agent based
resource management schemes [Goscinski 1991], the policy of accessing resources is
enforced by the resource owners. An agent process has to be created at both the client and
server. The local agent searches and borrows resources from the remote agent, and the
remote agent verifies the borrowing request and provides the service. Another example
can be seen from the study performed by Gagliano et al. [Gagliano 1995]. In their
approach, a free-market principle was used to allocate computing resources. Tasks are
given an initial fund to acquire needed resources and they have to bid on resources
offered by the system. An auction process is convened by all the waiting tasks when a
task arrived or is completed. Their studies show that decentralized approaches are more
flexible and have better reaction time.
In this dissertation, we approach this problem by a combination of agent and free-
market approaches. That is, we allow intermediate nodes to store and provide the logical
resources to other nodes. We established a scheme that merges the scheduling and
resource management processes so that the overhead is reduced. We also show that, with
some extensions, our scheme can accommodate the thin-client structure.
OVERVIEW OF THE OSCULANT SCHEDULER
3.1 Basis of The Osculant Scheduling Scheme
Osculant differs from existing task and resource scheduling paradigms in that it is
bottom-up and self-organizing. Experimental studies have led to the conclusion that
Osculant is: (1) architecturally robust, (2) capable of internalizing the management of
system assets, and (3) able to dynamically alter the system ethos to range from a real-time
operation, to maximize bandwidth, to minimize latency, to minimize energy dissipation.
The Osculant paradigm can be motivated as follows
* An Osculant system consists of a collection of possibly dissimilar autonomous
information systems (e.g., capabilities, instruction set, local storage, I/O) which may
or may not be connected by a network with an arbitrary topology and time- and
* Executed programs send messages to a higher level entity, called the steward, which
interprets these requests in terms of executable objects and posts them on a job board
along with salient information about data location, resource requirements, job
priority, and so forth.
* When a job is posted, all processors bid on that job in a manner that maximizes their
profit (measured in terms of tangibles such as net number of cycles per unit time).
Job bids are a function of processor resources, locality of data, I/O costs, job priority,
Task Orgamzer & Generate Job Proile
Optimal Scheduling Job Profile Generation
Participant Participant Participant/Steward auction
Bid Bid Bid/Po Jobechniqus
SParticipan Participant Partcipan/Steward/Worker
/Techniques Bid Bid Bid/Job Post/Execute Job
Paricipant Panicipant Participant
Bid Bid Bid
Figure 3.1 Overview of the Osculant scheduling scheme.
existing local job queue, and so forth. Processors have no knowledge of other bids
and operate autonomously. The steward receives bids and awards tasks to the
processor with the best bid.
* Any processor can play one of the three following roles:
1. User: A user node issues jobs.
2. Steward: A steward node manages jobs authorized by other users or assigned by
J ob Generation ..
Energy Bidding Job Board
Dynamic Jobpost Bidding Job ID
InUser Contractor Bidding Job Content
Job Processing Log
Osclant Limited Flooding -- Node Board
Simulaor Epdimedic Node Configuration
Job Auction Strategies ,'
Output Job Assign w/ Multiple Bids
Job Assign w/ Simu. Annealing
Job Status Collection
Node Status Logging
Figure 3.2 Structure of the Osculant Simulator.
3. Participant: A participant node bids jobs and executes the job once assigned by
* Role assignments of nodes are based on individual jobs and circumstances. Hence, a
node can play one or more roles at the same time for different jobs.
Figure 3.1 shows an example of the basic operations of the Osculant scheduler and
the related study areas. It should be noted that Figure 3.1 only shows the logical
hierarchical structure of the Osculant system. In reality, the connection scheme can be
graph, ring, tree, or a combination of them.
The basic market-driven philosophy underlying an Osculant system is a competitive
bidding scheduling scheme. The key to successful bidding is to estimate accurately and
efficiently the cost-complexity of a posted job. This is the role of the job profile generator
(JPG). Job profiles are first generated from the information provided by the tasks that are
distributed among participating nodes. Upon receiving the job profiles, the distributed
heterogeneous nodes calculate the completion cost of a task based on their interests,
ethos, biases, and capabilities. A job will be assigned to the node by the steward (which is
re-locatable) with best bid.
3.2 The Osculant Simulator
The Osculant Simulator is designed to study the Osculant scheduling scheme by
simulating various types of computing environment and external conditions. The current
version of Osculant Simulator is capable of simulating job generation, dynamic jobpost,
bidding, and performance evaluation.
The structure of the Osculant Simulator is illustrated in Figure 3.2. There are five
major modules in the simulator: job generation, bidding strategies, jobpost strategies, job
auction strategies, and job completion module. Each contains a selection of algorithms
and subroutines that operate each module. These modules access the Job Board and the
Node Board for job contents, node configuration, and current system status.
In the Osculant Simulator, a working node is characterized by its computation
power, memory size, storage capacity, and type of architecture. Jobs are managed locally
by the local scheduler in a node. In most cases, jobs are processed in the order of job
arrival time, and a first-in-first-out (FIFO) queue is maintained for the waiting jobs. In a
later part of the dissertation, a non-preemptive local scheduler was implemented to
improve the node utilization. In order to reduce the complexity, it is further assumed a
node cannot process jobs with requirements that exceed the node's physical limitations,
e.g. memory size, and jobs from different architectures, e.g. no software emulators.
As mentioned in Chapter 2, data transmissions are accomplished using packet
switching techniques. It is assumed that each node has a finite amount of buffers and
shortest paths will be chosen to route the packets. In order to simplify the analysis,
virtual-circuit packet switching [Stallings 1986] is used to route the packets between the
source and destination nodes. This means that a route between two nodes is set up prior to
data transmission. This route is not a dedicated path and may change every time. Packets
follow the same route and are buffered at each node. Blocking may happen in the
intermediate nodes and packets are queued for output over a line.
The Osculant Simulator is implemented with the time-advance concept that scans all
modules for events at each time step. Initial events, which contain job generation time,
locations, resource requirements, and resource locations, node configurations, and
network bandwidth, are generated using a MATLAB program with user-defined
statistical models. Internal states of jobs and nodes are created and are inserted by the
simulator during run-time. State diagram of jobs is shown in Figure 3.3.
The current version of Osculant Simulator is implemented by using C language and
is executed on UNIX environments (tested on SUN OS, HP-UX, and LINUX). Appendix
C illustrates the Osculant Simulator user interface. The simulation configuration file
allows users to change a wide range of system parameters such as: job contents, node
configurations, network configurations, bidding strategies, jobpost protocols, resource
management schemes, and other system parameters.
3.3 Initial Performance Studies on the Osculant Scheduling Scheme
The Osculant is a bottom-up scheme that is capable of assimilating the most up-to-
date system information and, therefore, achieves near-optimal scheduling performance.
Performance improvement is attained with the possible overhead expense associated with
a relatively long time period spent in jobpost/bidding process. In this section, the
characteristics of jobpost/bidding delay on the system performance are studied.
Speed Gain: Unbalanced Loc LoLoad, AlM Node Idential
.Numar ao Local Jobs at Node #3
Speed Gain: Unbalanced Local Load, Various Node Speed
0 1 2 3 4 5 6 7 8 9
Number of Local Jobs at Node #3
Figure 3.3 Comparisons between top-down schedulers and the bidding scheme. In
these results, the bidding scheme outperforms the top-down schemes when the system
load becomes unbalanced. The bidding scheme has more prominent performance
advantages when the system contains processors with diverse service rate.
In order to show the effects of jobpost/bidding delays, three top-down schemes are
compared to the Osculant bidding scheme:
* Round robin: Jobs are assigned to working nodes in a sequential order among
* Random: Jobs are distributed randomly among the working nodes.
* Well informed top-down scheduler: The scheduler has complete knowledge about the
capabilities of all working nodes but does not have information about the load
generated locally at the nodes. This scheme also serves as an ideal comparison
counterpart to the bidding scheme because it accurately estimates working node
information without any jobpost/bidding delay if there is a lack of local activity at
The results are retrieved from a sequence of simulations. In the simulator, a collection
of jobs is generated and fed into the scheduler with various local loads (jobs that are
generated by local users which are unpredictable to the top-down scheduler) and system
parameters (that are known to the schedulers). Jobs are characterized by job size, and by
similarities between jobs and job generation rate.
The first simulation results shown in Figure 3.3 assume a homogeneous system with
four identical processors. Local jobs are added gradually to one of the working processors
while they keep the local load of other nodes constant. The results show that the
performance gain of random scheduler remains random; the round robin scheduler and
the well informed top-down scheduler have approximately the same performance; the
Osculant scheduler gradually performs better as the level of unbalanced load is increased.
These results show that the Osculant scheme can adapt effectively to the current
processor status under this condition. The effect of post/bidding overhead also can be
observed from this figure: When the homogeneous system is well balanced and lightly
loaded, both the round robin scheduler and the well informed top-down scheduler out-
perform the bidding scheme.
The same simulation is also performed in a heterogeneous system of four processors
with different service rates (1:2:3:4 in this example). Simulation is achieved by
manipulating the local job load as in the homogeneous case. As shown by the results, the
advantage of the Osculant scheme becomes more prominent. In this case, both the
EIfect of Job Granulaity on Speed Gain (No local Load)
5 0 15 20 25 30 35 40 45 50 55
Figure 3.4 The effect of job granularity on the performance of bidding scheme. The
performance of the top-down scheduler is an ideal case counterpart if the
jobpost/bidding delay is not present.
random and round-robin schedulers have worse performance while the Osculant
scheduler generally outperforms the well-informed top-down scheduler by around 20%.
This is mainly because the bidding scheme can retrieve more accurate local information
than the top-down schemes. When the well-known top-down scheduler distributes a job
to a heavily loaded processor, the performance penalty to the system is more severe in a
heterogeneous system than in a homogeneous system.
The simulation reported in Figure 3.4 investigates how the jobpost/bidding
overheads affect the overall system performance. In this simulation, the well-informed
top-down scheduler has accurate knowledge of the working node status and lacks a local
job load in the working nodes. Under this condition, the quality of job distribution by the
host processor is comparable to the bidding scheme but lacks bidding overhead.
Therefore, the well-informed top-down scheduler serves as an ideal case for the Osculant
scheduler. The other parameter under control is the job granularity. Here, a job is
partitioned into different sizes and the overall completion time is measured. From the
results, as job granularity increases, performance of bidding scheme gradually approaches
the ideal top-down scheduler. Conversely, large job granularity will result in low
performance gain because of low degree of parallelism.
These studies suggest that Osculant can perform as well as the best top-down
scheduler, as well as offering its unique properties and attributes. In the next chapter,
network performance of distributed computing systems will be studied to identify the key
components of the proposed scheduling scheme.
ANALYTICAL NETWORK PERFORMANCE MODELS
The communication schemes in multiprocessor systems play an important role in
determining the overall system performance. The fundamental of the network
performance model is based on the studies from uniform communication latency (UCL)
interconnection networks [Johnson 1992]. The UCL network model can be applied to
study various types of communication schemes among computing nodes. In this chapter,
an analytical model is studied and verified by a simulator so that a better cost-to-
performance ratio can be found with customized requirements.
Based on the studies on UCL networks, we extend our range to non-uniform
communication latency (NUCL) interconnection networks [Johnson 1992], which are
used in most distributed computing systems. From these studies and observations, an
understanding of communications in a heterogeneous computing environment can be
obtained and applied in the Osculant study.
4.2 Basic Concepts of Switch-Based Networks
The basic element of switch-based networks is a crossbar switch as shown in Figure
4.1 with various sizes from 2x2 to a larger system. Assume that m is the probability that a
source node issues a request during a cycle into an M-by-N crossbar switch. Patel [Patel
Figure 4.1 MxN crossbar switch.
1981] shows that the bandwidth, which is the number of requests that arrive at the
destination node during this cycle, is
BW=N*(1 -(1 -/N )) (1)
and the normalized throughput rate of the M-by-N crossbar switch is
P = BW/(m*M) (2)
It is well known that the cost of a crossbar switch is O(n2) to its size. Therefore, it is
impractical to use single crossbar switch to connect all nodes in a distributed system. So,
a multistage interconnection network (MIN) is used to interconnect a large number of
nodes by using many small crossbar switches. The performance of MINs varies according
to the number of switching elements (SEs) as well as the topology of connecting crossbar
4.3 Performance Model of Multiple Stage Interconnection Networks
In this section, the performance model of a typical uniform communication latency
(UCL) interconnection network, the delta network, is reviewed and investigated. An N-
by-N delta network consists of -J*n a-by-a crossbar switches, where N = a". A packet
movement through the network can be controlled locally at each switching element (SE)
Figure 4.2 The 8x8 delta network.
by a single base-a digit of the destination address of the packet. Figure 4.2 shows an 8x8
delta network by using 12 2x2 crossbar switches.
Following the study by Yoon et al. [Yoon 1990], the analytical model is built based
on the following assumptions:
(1) Packets are generated at each node with equal probability, and the arrivals of
packets are memoryless.
(2) Packets are directed uniformly over all network outputs.
(3) The routing logic at each SE is fair, i.e. conflicts are randomly resolved.
These assumptions imply that the distribution of packets is uniform and statistically
independent for all SEs. Consider the single buffer case and the state diagram in Figure
4.3 where q(k,t) is the probability that a packet is ready to come to a buffer of an SE at
stage k during th stage cycle; r(k,t) is the probability that a packet in a buffer of an SE at
Figure 4.3 Transition diagram of single buffer MINs.
qr+qr qr+qr q+qr+ r+q
Figure 4.4 The state transition diagram of multi-buffer MINs.
stage k is able to move forward during tt stage cycle giving that there is a packet in the
buffer. Then we have
q(k,t) = I- ( 1 PI(k-1,t)/a )". (3)
where PI(k,t) is the probability that a buffer of an SE at stage k is full at the beginning of
the tth stage cycle, and Po(k,t) is the probability that a buffer is empty. Then,
r(k,t) = r'(kt) (Po(k+1,t) + Pi(k+ ,t)*r(k+ ,t)) (4)
q(k+l,t) = P,(k,t) r'(k,t) (5)
r'(k,t) = q(k+l,t)/P(k,t) (6)
where r'(k,t) is the probability that a packet of an SE at stage k can move to the output of
the SE during the tth stage cycle giving that there is a packet in the buffer. Since q(1,t) is
the probability that there is a packet coming to a buffer at first stage during a cycle, it is
the arrival rate of a single input port of the network. Finally, according to the state
diagram shown in Figure 4.3, we have
Po(k,t) = Po(k,t)(1-q(k,t)) + P(k,t)*(1-q(k,t))*r(k,t) (7)
Pl(k,t) = P ., ri A + P(k,t)*[q(k,t)*r(k,t) + (1-r(k,t)] (8)
Similarly, Yoon et al. expanded the model to multiple buffer delta networks with
following additional definitions:
m : Buffer size.
Pj(k,t) : Probability that there arej packets in a buffer of an SE at stage k
at the beginning of the t,h stage cycle.
Po (k,t) : Probability that there is no packet in the buffer.
(1-Po(k,t)) : Probability that the buffer is not empty.
P,(k,t) : Probability that the buffer is full.
(1-P,,(k,t)) : Probability that the buffer is not full.
Using the modified state diagram in Figure 4.4, we get the same equation pairs as in a
single buffered delta network.
q(k,t) = 1-[1-(I-Po(k-,t)/a)" 2 < k
r(k,t) = [q(k+1,t)*(1-Po(k,t)]*[1-P,,,(k+,t)+ P,.(k+I,t)*r(k+l,t)]
Il k n- (10)
r(n,t) = q(n+I,t)/(1-Po(n,t)) (11)
Pj(k,t+l) = q(k,t)*[Pj.l(k,t)*(1-r(k,t)+Pj(k,t)*r(k,t)]
Po(k,t+l)=(1-q(k,t))*[Po(k,t)+ Pi 1 'rr ii/ 1 k n (13)
Pl(k,t+1) = q(k,t)*[Po(k,t)+Pj(k,t)*r(k,t)]
P,(k,t+l) = q(k,t)*[P,.i(k,t)*(1-r(k,t))+P,(k,t)*r(k,t)]
I k < n (15)
Similar to the single buffer case, q(l,t) is the only known parameter in the beginning
of the analysis. By using the following iteration procedures, the value of any desired
variables can be computed:
(1) initialize all values to zeros except Po(all stages,O) and q(O, ) are equal to 1,
(2) compute Pj(all stages,t+1) according to Pj(stage,t),
(3) compute q(stage I..n,t) according to Pj(stage-I,t),
(4) compute r(n,t) accord to q(n,t) and Po(n-l,t),
(5) compute r(stage O..n-l,t) according to q(stage+l,t) and Pj(stage+I,t),
(6) repeat (2), (3), (4) and (5).
Because there is no closed-form solution for the above equations, the steady-state
solutions required to find out the performance index are computed after a number of
iterations until the outputs of these equations reach their steady-state, i.e. the differences
between iterations are less than certain values. Details of this procedure will be discussed
in section 4.5.5.
Once in the steady state, the probability that a packet arrives at an output port is
defined as the normalized throughput S. That is
S = (1-Po(n)) r(n) (16)
Let R(k) be the probability that a packet in a buffer of an SE in stage k is able to
move forward. Then the normalized delay d can be given as
d = R (17)
where R(k)=r(k)* *-
4.4 The MIN Simulator
In order to verify the analytical model developed by Yoon et al., a simulator is built
to analyze the errors and find out the impact and effect of the assumptions on the
analytical model. Also, with the simulator, a platform is provided so that more realistic
input patterns can be tested.
4.4.1 Simulator Design Principles
From the beginning of a clock cycle, the input to each input port is computed based
on the average arrival rate m. Two random number generators are used to construct the
packet arrival process. The first one determines whether there is a packet transmission
request or not. If there is a packet generated, the second random number generator is used
to determine the destining port address. Thus the arrival process is memoryless and the
traffic is uniformly distributed all over the network.
For each cycle, all the switching elements in the system are examined. An SE at
stage k will check the packet at the head of its buffer and route the packet according to its
proper destination address. The routing in a delta network is controlled by the target
address, i.e. the k1, digit of the destination address represents the local destination port in
the k,h SE. An SE will first check the buffer in the destination SE at the (k+l),h stage. If a
packet can be moved forward, the buffer management procedure is activated in both the
two SEs. It is assumed that the destination port has a service rate of infinity. Thus a
packet at the buffer of the (n-2),h stage SE is always able to move forward. The data
which is required to compute normalized throughput is collect at the destination port of
the final (i.e. the (n-1),h stage) SE.
This simulation program is developed in C language. Simulations with network size
less than 1024x1024 nodes can be simulated in an MS-DOS environment while the
simulation of larger networks can be performed in the UNIX environment (LINUX and
4.4.2 Limitations of the MIN Simulator
Some part of the MIN simulator can be improved further so that more realistic
conditions can be simulated. They are as follows:
(1) Blocking packets: The current simulator follows the assumption that blocked
packets will not issue further request in the next cycle. This is not true for most
real-world cases. Details of this issue will be discussed in Section 4.5.
(2) Bias in contention resolution: When two or more packets are routed to the same
destination port of an SE, the current simulator will choose the packet from the SE
with smallest identification. Thus, SEs with smaller identification numbers have
more advantages over SEs with larger identifications. Contention resolution can be
further refined by using random logic or priority-based algorithms.
(3) Variable packet size: The packet size in the current simulator is fixed. In practical
cases, packets may have different sizes.
(4) Limited buffer size of the first-stage SEs: In order to simulate more realistic
networks, the buffer size of switching elements located in the first stage should be
much larger than that of the later stages. It is because the first-stage SEs are used to
simulate real nodes that can normally hold more data and reissue requests for
blocked packets for a long time period.
4.4.3 Simulation Results
The results from the simulations were shown to agree with the results from the
analytical model with requisite input patterns. When the input patterns differ from the
assumptions, the simulator shows results that can be explained with reasonable causes.
The detailed results are shown in the Section 4.5.
In order to obtain the steady-state results, sufficient number of packets and clock
cycles are needed. All the simulation results shown in Section 4.5 were obtained with a
simulated running time of 500 clock cycles.
4.5 Results and Analysis
4.5.1 Input Load vs. Throughput and Delay
Two observations can be made from the results shown in Figure 4.5, 4.6, 4.7, and
4.8. First, increasing buffer size of SEs will benefit the overall throughput of the system.
The throughput of the system is seen to increase linearly as the arrival rate increases (see
Figure 4.5) until the system is saturated. The incremental throughput in the high arrival
rate region is very small as the buffer size increases. This indicates that the improvement
made by adding buffers will be very expensive when the average arrival rate is high.
Second, adding buffers will result in longer delay for packets to go through the
network (shown in Figure 4.6). The delay increases as the arrival rate grows regardless
1 1 F F 1F1 F 1
0 0 42 0.2 03 4 00 0. 7 04 Q 0 2
G O I 1----T-----S 1--1 T) O --T-[-- >
Input Load (m)
Figure 4.6 Normalized delay of MINs with various buffer sizes and input rate.
whether the network is saturated or not. We can conclude from the results that the
normalized delay increases exponentially with respect to the input load.
whether thenet is s or n a c e fr
normalize delay iree n.t --
normalized delay increases exponentially with respect to the input load.
U.o B.I 52 5 A j o. A 05. 7 0. 8 .S a.
Input Load (m)
Figure 4.7 Error rate of normalized throughput with various buffer size and input rate.
0.0 ., I 2 0 .3 0.4 a.S o .7 O5 0 I,
Input Load (m)
Figure 4.8 Error rate of normalized delay with various buffer size and input rate.
Comparing the analytical model of Yoon et al. and the simulation results, we can see
that the analytical model has high errors for predicting normalized throughput (shown in
Figure 4.7) when buffer size is small and input rate is high. Another observation from the
results in Figure 4.8 is that the analytical model has larger errors in predicting delays
when the average arrival rate is between 0.4 and 0.8. For the first case, with high input
rate and small buffer size, there are many packets that are dropped by the switching
elements because of network contentions (i.e. packets, that either cannot enter the
network because of blocking in the first stage or are dropped by the intermediate nodes
because their buffers are full, are not considered in the analytical model). For the second
case, because the way the analytical model handles the network contention resolution and
the distribution of destination addresses of packets are different from that of the real
networks (as well as that of the simulator), the effect of buffering is more apparent in the
middle input rate range than that of the two ends. In other words, when a packet is
blocked in an intermediate node, in the analytical model, the packet will try again in the
next cycle with the probability
m=l-(1 -i evusge ) a,
where m is the input rate of an a-by-b switching elements in an MIN network, and with a
new destination address determined by uniform distribution. On the other hand, in a real
network, a blocked packet will try again in the next cycle with probably 1 and with the
original destination address. When the input rate is small, there are few packets in the
network and, consequently, the network contention is low. On the other hand, when the
input rate is high, the network contention in the first few stages of the network is much
0 2 4 6 8 to 11 14 16 is 2
Buffer Size (v)
Figure 4.9 Normalized throughput of MINs with various buffer sizes under different
input rate. Network size is 28x2 .
severe than that of the later stages. This results in that the network contention in the later
stages of the network is not as intensive as that in the middle input rate range. Therefore,
the error is lower in the two ends of the input load range than that of the middle range.
4.5.2 Buffer Size vs. Normalized Throughput and Delay
Figure 4.9 and 4.10 show that the normalized throughput increases with the buffer
size, although it will become steady when the buffer size meets the demand from the
average arrival rate and the constraints from network configurations. Figure 4.10 shows
the effect of various buffer sizes on system throughput rate when the MIN network scales
in size. The results indicate that the cost-to-performance ratio climbs rapidly as more
buffers are added in SEs. Compared to throughput rate cases, normalized delay is more
predictable under different buffer and network sizes. Figure 4.11 shows a constant linear
property between the buffer size and normalized delays for different network sizes.
The assessments are as follows: First, if the average arrival rate is known in advance,
the optimal buffer size (in the sense of cost-to-performance ratio) can be found by either
the analytical model or the simulation results. Second, the number of stages in an MIN
network has downward effects on the overall performance. This shows the fact that a
crossbar network (an MIN network with only one stage) has the best performance over all
other MIN networks.
4.5.3 Network Size vs. Normalized Throughput and Delay
As mentioned in Section 4.5.2, the throughput of an MIN is not linearly related to its
network size. Figure 4.12 shows the impacts of changing network sizes on the normalized
throughput rate. When the buffer size is small, the normalized throughput will decay
quickly when the network scales its size. Figure 4.13 presents interesting results: The
normalized delay decreases as the MIN network increases in size. For example, with
mean arrival rate 1 and buffer size 16, the normalized delay of an MIN network of size
2x2 is 16. When the network size is scaled to 256x256, the normalized delay decreases to
8! The reasoning is that: for multiple stage networks, the queuing length of buffers at later
stages is not as long as the queuing length in the first few stages because of contentions
and blocking. Considering the above two cases, MINs with sufficient buffers show to be
very cost-to-performance efficient for moderate network sizes.
Figure 4.10 Normalized
4 6 8 I0 12 14 16 I8 20
Buffer Size (v)
throughput of MINs with network
size 2 and various
size 2"x2" and various
I I I 1 T 1 I 11 11 1[ I -
0 2 4 6 2 10 12 14 16 II 20
Buffer Size (v)
Normalized delay of MINs with network size 2"x2" and various buffer
S 2 3 4 5 6 7 1
Network Size (n)
Figure 4.12 Normalized throughput of different MIN sizes for various buffer sizes.
C I C 3 4 s C 7 C C I|
Network Size (n)
Figure 4.13 Normalized delay of different MIN sizes with buffer size 1, 2, 4, 8, and 16.
-6 22 SE
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Input Load (m)
Figure 4.14 Normalized throughput versus various arrival rate for different types of
4.5.4 Switch Element Types vs. Normalized Throughput and Delay
In this section, different types of switching elements are considered in constructing
MINs. Compared to 2x2 SEs, the 3x2 SE networks have a nearly fixed normalized
throughput (Figure 4.14) because the network becomes saturated even when the input
load is very small (compared to 2x2 SE networks). For example, in a 243x32 (a=3, b=2,
n=5) delta network, the network will become saturated when the average arrival rate is
larger than 0.13. The saturated point will become smaller when the network size
increases. This type of network may only be suitable for networks that have a very small
arrival rate. Contrast to the case of networks with 3x2 SEs, 2x3 SE networks have a
14 R 32SE
~j12 _I /- 3.2 SE
00 0.1 0.2 0.3 0. 0.5 0.6 0.7 0.8 0.9 1.0
Input Load (m)
Figure 4.15 Normalized delay versus various arrival rate with different types of
linearly increasing normalized throughput (i.e. no saturation point) since the network
never saturates even when the input load equal to 1.
The normalized delay of 3x2 SE networks is larger than 2x2 SE networks (shown in
Figure 4.15) since packets have higher probability in losing in competitions with other
packets. This results in longer queuing length in SEs. Conversely, the 2x3 SE network has
a fixed normalized delay because packets are distributed sparsely in the network. In this
case, packets rarely compete with other packets in SEs.
The errors between the analytic model and the simulation results are shown to be
large in this section. As mentioned in the study by Yoon et al. [Yoon 1990], the equations
are based on the uniform traffic assumptions. The traffic conditions discussed in this
section are different from stage to stage. In addition, referring to Section 4.5.1 and note
that the analytical model is prone to have a high error rate when the average arrival rate is
00 NO0..09.4 .
Input Load (m)
Figure 4.16 Number of iterations required to compute the analytical model results, i.e.
the iterations needed to reach steady state, with various arrival rates. The network size
high. The configurations that use 3x2 SEs have very low saturated average arrival rate.
This can be used to explain the larger error rate in this section.
4.5.5 Number of Iterations Needed to Compute Analytic Results
It is interesting to find out how many iterations are needed to compute the analytical
results. As mentioned in Section 4.3, the number of iterations is the time needed for the
network to reach its steady state when the input load is fixed. Figure 4.16 shows the
iteration number versus various average arrival rates. The steady state case is defined as
the differences between consecutive iterative computation of normalized throughput and
delay is less than a threshold value S (e.g. S = 10" in this dissertation). The results show
that the iteration number is a floor function with respect to the average arrival rate. When
the input load is low, it needs a fixed number of iterations to compute the answer just like
4 2 4 6 a I a
"I Ier = SOB
0 2 4 8 1g i 2 14 i1 | 4 ,
Buffer Size (v)
Figure 4.17 Number of iterations needed to compute analytical model results, i.e. to
reach steady state, with various buffer sizes.
packets need a minimum number of steps to pass the network. On the other hand, when
the input load is high, the network is nearly full, the iteration numbers will not increase
after the saturation points that we found in Figure 4.6.
Figure 4.17 shows a plot of iteration numbers versus buffer size. The iteration
number grows exponentially with respect to buffer sizes. This trend becomes more
apparent when the input load is high. From the results gathered in previous sections, it is
found that the behavior of MINs is predictable. An MIN system can be built to meet
required specifications as long as the average arrival rate is known in advance. Our
studies also show that the analytical model is capable of computing the performance
index quite accurately compared to the simulation results. The advantage of using the
analytical model is its short computation time. In the experiments, a simulation run of a
28x2s MIN took minutes (depending on the buffer size and the arrival rate) to be
0.1 0.5 1.0
Input Load (m)
Figure 4.18 Trend of differences between the realistic MIN and the analytical model.
completed by an Intel 80486 personal computer while the results from the analytical
model can be retrieved within seconds. Thus, the analytical model is very valuable in
systems with dynamic configuration capabilities. That is, it can be used to evaluate
network performance of our target computing system in a real-time fashion.
4.5.6 Discrepancy Errors Between the Analytical Model and Real Networks
Due to the memoryless assumption, packets, which lose the competition for entering
the first SE stage, are discarded without any further actions. Realistically, when a request
is blocked in the current clock cycle, it would re-issue a request in the next cycle with
probability 1 (i.e. instead of the average request rate), and the destination address should
be the same as that determined earlier. As a result, the accuracy of this model will
perform as shown in Figure 4.18. When the average arrival rate m is small and conflicts
occurred, the analytical model will use average input rate m instead of 1. That is, the
analytical model tends to reduce the number of packets in the network. So, the predicted
throughput rate from this model will be lower than that of the real case. On the other
hand, when m is high, contentions that happen in the last cycle should continue for the
next few cycles. In other words, if more than two packets want to go through the same
port, the contention resolution logic can only admit one packet to pass each cycle and the
request rate will remain as 1 in the next cycle. However, in the analytical model, it is
assumed to generate a new request with random destination so that there are too many
packets compared to that of the real-world case. In this case, the derived throughput rate
from the model will to higher than that in reality.
4.6 Non-Uniform Communication Latency Interconnection Networks
A uniform communication latency interconnection network is a special case of non-
uniform communication latency (NUCL) interconnection networks. Literally, in a NUCL
network, the number of nodes or switching elements required in a communication
between two nodes can vary. Nodes in a NUCL network generally have a fixed and,
generally, smaller degree of connection which allows the system to be scaled easily.
Analysis of UCL networks in previous sections provides many basic and helpful
performance indicators for the study of NUCL networks. Relation and interactions among
the key components (e.g. network sizes, buffer sizes, network loads...etc.) in UCL
networks can be applied to observe and analyze the characteristics of NUCL networks.
4.6.1 Comparisons between Uniform and Non-Uniform Communication Latency
Uniform Communication Latency (UCL) interconnection networks are the most
popular interconnection architecture for parallel processing systems. Observations from
the above analysis clearly illustrate the advantages of UCL interconnection networks:
(1) Simplicity: UCL interconnection networks afford simplicity in software. Compilers,
schedulers and operation systems built upon architectures with UCL networks can
be simplified because of the symmetric topology and same bi-directional
(2) Steady throughput: Studies in Section 4.5.1 and 4.5.2 indicate that the steady-state
throughput rate can be raised by adding more buffers in the switching elements.
Also, studies in Section 4.5.5 found that the time needed to reach steady state is
also predictable under the same conditions. These features ease users in predicting
communications performance between nodes.
(3) Predictable delay: Observing from previous studies, it is found that the steady-state
network delay increases linearly with buffer size (see Figure 4.11), and the delay
reduces when the network size is increased (see Figure 13) with sufficient buffer
size. These features provide network designers a way to estimate the network
performances and a shorter delay can be expected when network size grows.
Unfortunately, the mechanism used in implementing UCL network is not scaleable.
Full crossbars can provide nearly uniform communication latencies, but the O(n2)
hardware cost makes it virtually unscalable. MINs, which circumvent the bandwidth
problems of bus-based systems and high hardware requirements of full crossbars, provide
bandwidth that scales with machine size. However, this increased bandwidth comes at a
price: all communication latencies increase with the number of nodes in the system. Since
scalability is very important for a distributed system, the Non-Uniform Communication
Latency (NUCL) interconnection networks may be more suitable for building distributed
The most important characteristic of NUCL interconnection networks is that the
node degree (i.e. the number of links of a node) is fixed. This feature makes NUCL
networks easier to scale in size. Latencies of NUCL networks may not be as good as that
of UCL networks for the same network size. Nonetheless, when the size of networks
grows, some nodes in NUCL networks have the advantage of remaining close to one
another regardless of network size. Therefore, as machine size increases, applications
running on UCL networks face increasing latencies for all communications. On the other
hand, with NUCL architectures, if applications or the system controlling body can be
designed so that communication patterns favor nearby nodes, the system should be able to
improve its performance depending on how the average communication distance is
4.6.2 Key Issues in Non-Uniform Communication Latency Interconnection Network
From the discussion in the previous section, it is clear that the key for NUCL
networks to out-perform UCL networks is to exploit various locality properties embedded
in the system and applications. Communication locality can be divided into two domains:
the first one is application locality that is presented in the organization of an application.
The second one is architecture locality that represents the capability of an operation
system or architecture to exploit application locality.
Similar to studies in cache memory designs, application locality can be further
divided into two categories: temporal locality which represents the effect of decreasing
the communication frequency between application threads, and spatial locality which
represents the effects of affinity in the communication pattern among application threads.
Johnson [Johnson 1992] suggests the following three approaches to reduce
Avoid long latency operations
This approach exploits temporal locality in the applications. Numerous researches
have been done in this area by exploiting compilation techniques in enhancing data reuses
in applications. Both UCL and NUCL network architecture allow exploitation of
Reduce communication latency
This approach exploits spatial locality in the applications and focuses on minimizing
communication distance. Applications can be designed to have good spatial locality to the
extent that their inter-thread and resource-acquiring communication graphs have relative
low bisection width and high diameter. While NUCL architectures can exploit spatial
locality by placing communication requests on nearby nodes so that average
communication distance can be minimized, UCL architecture is only able to exploit
spatial locality when there are more resource servers or application threads than nodes
(i.e. overlapping communications with executing other threads). It is evident that the
governing body of the system needs to be designed cooperatively with this idea to
improve the performance.
Tolerate long latency operations
This approach focuses on software paradigms and processor architectures that allow
useful works to overlay long latency operations.
Studies [Agarwal 1991][Johnson 1992] have shown that, for NUCL networks,
exploiting localities can reduce bandwidth requirements of applications and network
contentions more effectively than improving the backbone network components (i.e.
faster switching elements and communication channels). The study by Johnson [Johnson
1992] shows that the spatial localities embedded in applications have direct impacts on
the network performance in NUCL systems. However, the study also shows that the
effectiveness of gaining performance by exploiting spatial localities depends upon the
level of which the communication distances can be reduced by the system. Thus, in the
Osculant scheme, we mainly exploit the second approach by trying to shorten the
communication distance of resource requisitions and by increasing the number of threads
and servers. Additionally, we exploit the third approach in scheduling techniques that
overlap the resource transmission task execution time. With these approaches, the NUCL
network systems will have a good potential for improving their performance over the
UCL network systems especially when the network size is large.
The studies of UCL and NUCL networks in this chapter suggest that, in order to
improve the performance of a distributed computing system, which is mostly connected
by an NUCL network, one has to cleverly arrange the network dispositions to achieve
better performance. The analytical model and simulations of multiple stage
interconnection networks indicate that
(1) Throughput increases as the buffer size increases and is limited by the configuration
of the MINs.
(2) Delay increases linearly with the buffer size but decreases when the network size
(3) The performance of UCL network is predictable when the average arrival rate and
the network configuration are known in advance.
(4) The analytical model, which is computationally efficient, provides adequate
accurate performance estimations when the MIN is not saturated and the buffer size
Thus, a multiple stage interconnection network with optimal cost-to-performance ratio
can be built. Specifically, for the studied Osculant scheduling scheme, observations
suggest that network traffic in distributed computing systems can be predicted controlled
by arranging the buffer sizes and communication lengths whenever the job arrival rate is
known or predictable.
The studies on non-uniform communication latency (NUCL) networks suggest that
they offer certain features that lack in the UCL networks such as scalability and shorter
communication latencies. However, in order to obtain these benefits, resources and
applications in such distributed systems must be carefully arranged and designed to
exploit the locality characteristics. For instance, applications have to be implemented so
that communication distance between threads is shortened, and communication latencies
can be overlapped by other useful works and tasks. More importantly, the scheduling and
resource management scheme of the distributed system has to assign tasks and allocate
resources so that network traffic is confined and minimized. The Osculant scheduling
scheme studied in this paper will focus on this field.
OSCULANT SCHEDULING MECHANISMS
In this chapter, the details of the Osculant scheduling scheme will be discussed. In
the first part, jobpost distribution protocols will be presented. Various bidding strategies
will be discussed in Section 5.2. In Section 5.3, experimental results are analyzed. Section
5.4 presents the resource management scheme used in Osculant. In Section 5.5,
conclusions are presented.
5.1 Jobpost Distribution Protocol
Jobpost distribution protocols are designed to distribute small packets to
participating nodes that contain job profiles, resource locations, and execution
specifications. Major concerns in designing the protocol are topology independence and
jobpost efficiency because:
(1) The Osculant scheduler is designed for a distributed, heterogeneous computing
(2) Jobpost efficiency directly affects the capability of the scheduler to probe, to
search, and to gather information in the system.
Two parameters control and indicate the performance of jobposting: jobpost
constraint and jobpost coverage. The former limits the distance how far jobposts can go
and controls the jobpost/bidding delays, where the later determines the range which
jobposts reached in a system. It also represents the optimization that a job achieves in the
5.1.1 Multi-layer Jobpost Protocol
Flooding broadcasting forms the basis of our jobpost distribution protocol. This
technique guarantees that all nodes that meet the defined distribution rules will receive
jobposts even when there are failures in the system.
Flooding Broadcast with Hub Number Constraints
A node is either susceptible (nodes never hear the jobpost) or infectious (nodes know
the jobpost). When a susceptible node receives a jobpost, it becomes infectious and
decreases the hub number constraint by one. Then it relays the jobpost to its neighbors.
When an infectious node receives a jobpost that has been seen before or the hub number
constraint it received is zero, it does not react.
Multi-layer Jobpost Protocol (MJP)
Multi-layer Jobpost Protocol (MJP) contains three major components:
(1) a flooding broadcast is used to distribute jobposts in a single jobpost/bidding layer;
(2) a jobpost constraint, in the unit of communication hubs, limits the range of jobpost
distributions in a jobpost/bidding layer; and
(3) in multi-layer jobpost/bidding, the winner of current layer repeats the MJP until a
node wins in consecutive jobpost/bidding layers.
MJP satisfies the goals of topology-independent jobposting, balancing and regulating
jobpost/bidding delay, controlling the level of optimization, and self-organizing in the
Osculant scheme. An example is shown in Figure 5.1.
Figure 5.1 This figure shows an example of multi-layer jobpost/bidding. It indicates
that 25 processors are connected by a mesh structure. With the jobpost constraint in
the unit of hub number equals 1, as shown in the left figure, there are 5 levels of
jobpost/bidding processes with a jobpost coverage of 60% and with 16 messages. If
the constraint is increased to 2, it requires only 3 jobpost/bidding levels to have a
jobpost coverage of 84% (21/25), and it needs to pass 34 messages.
Being an anomalous distribution protocol, MJP is convergent because, infectious
nodes will neither re-post nor re-bid previous jobs. However, later studies remove this
restraint (see Section 5.2). Convergence of MJP is then enforced by bidding strategies.
Balancing between jobpost/bidding overheads and level of optimization is an
important issue in designing a distributed scheduler. A flat jobpost structure, that has
loose job constraints, generally has higher jobpost coverage and, therefore, produces more
optimal results. But this structure also is more vulnerable to high jobpost/bidding
overheads because of failures in nodes or in communication channels. Moreover, there
will be a greater message overhead burden (sending jobposts to infectious nodes).
Conversely, distributions with restricted jobpost constraints normally have a quicker
jobpost/bidding process, but have a more narrow scope in the system status.
Consequently, such systems are inclined to become trapped at local optimums. Studies
indicate that small jobpost constraints (2 or 3 hubs) have sufficient jobpost coverage plus
low overheads. Additional results can also be found in Section 5.3.
5.1.2 Other Jobpost Distribution Techniques
In some cases, such as in military applications, the number of messages traveled in
the system must be minimized so that the probability of being detected or intercepted can
be reduced. In a computing environment where customers will be charged for using
communication channels, it is also desirable to reduce the number of messages in order to
reduce cost. Therefore, by relaxing the requirement that all processors should receive
jobposts in a jobpost/bidding layer, the number of messages transmitted in a system can
be reduced. The candidates in this category contain epidemic algorithms and antientrophy
algorithms [Chow 1996].
In most applications, it is not strictly required that all active processors receive
jobposts and bid for jobs. Tasks can be completed as long as bidding participants can
provide required system services. Therefore, the number of bidding participants only
represents the level of optimization that can be obtained in the system. For example, in a
distributed system with many resource suppliers, it is not necessary that all active agents
join the jobpost/bidding process. Jobs still can be completed because resources are
duplicated at many locations. The differences will be only cost and, possibly, service
quality. Another distributed computing case is that: Accuracy of executing jobs can be
guaranteed by setting appropriate read/write quorums to system functions. For example,
by requiring all updated system functions or routines to be duplicated in more than half of
the servers, it needs only half of the servers in the system to participate in the
jobpost/bidding process to guarantee the correctness of processing.
5.1.3 Optimal Jobpost Distribution
The optimization criteria ofjobpost distributions depend on the system configuration
and job requirements. In this section, the criteria under investigation are message
efficiency, jobpost coverage, and jobpost/bidding levels. The quality of a jobpost
distribution method is determined by the ratio that it successfully reaches the best node,
or the ratio of overall execution time from actual case to ideal case. In this section, the
ideal case means a jobpost distribution with 100% jobpost coverage and a uniform
jobpost/bidding delay from any two processors in the system.
The example shown in Figure 5.1 displays the effect of jobpost constraints on
jobpost distribution quality. In this section, the jobpost distribution is studied on a mesh
structure. Jobposts are distributed by using the MJP with jobpost constraints in hub
number. The bidding algorithm applied here is the performance-based bidding method
(Details presented in Section 5.2.1). Jobs are generated with the same processing and I/O
time, but with resources uniformly distributed in the system. Experimental results are
Jobpost coverage and message efficiency
As shown in Figure 5.2a, the growth rate in the total number of messages is greater
than that of the jobpost coverage with increasing hub number constraints. This means
that, when the hub number increases, the number of redundant messages transmitted in
the system increases in a faster rate. The MJP suppresses the number of nodes between
SHub Numer MehSize C0 0.,. r
(c) Optimal Hub Number vs. Bddmg Delays (d) Hub Numbeon Required for 100% Coverage & Ideal Nodes
Ideal Rate N
o:0 ../ -'- ';o. t
0 5 10 is 20 .*
% of Bidding Delayg Delys () Hb N be Med h Sifr 7 8 e 10
Figure 5.2 Simulation results and optimal jobpost constraints in hub number.
the new and old stewards in distributing old jobposts. However, this protocol cannot
detect nodes that already knew the jobpost from previous broadcast processes. Therefore,
in order to improve message efficiency, a less restricted jobpost protocol, e.g. epidemic
algorithm, will be needed.
Number of iobpos/bidding layers
The number of jobpostbidding layers decreases steadily as the hub number
constraint increases (shown in Figure 5.2b). Job auction process in a steward is hold once
all bids are received or once a pre-defined time-out period is reached. In general, a time-
out period will be used only when there are failures in the system. So, in a system that is
in its normal state, the auction delay is determined by the longest bidding delay between
the steward and its child nodes. If the logical structure of an Osculant system is close to
the under-lying physical structure, i.e. small hub numbers is equivalent to short bidding
delays, more jobpost/bidding layers may have the advantage in achieving better message
efficiency while providing sufficient jobpost coverage.
For jobs with fixed resource servers, or for systems with large centralized resource
servers, it is optimal to have small hub number constraints because the cost surface is
mostly monotonically decreasing toward the optimal candidate node from any processor
in the system. Small hub numbers will have highest message efficiency and the greatest
Optimal jobpost constraints in hub number
The optimal hub number can be achieved in two ways. First, a system with long
bidding delays will prefer to have small hub number constraints. This relation is shown in
Figure 5.2c; the optimal hub number is generally a floor function of the bidding delays.
For example, if the bidding delay is between 1% to 4.5% of the average job processing
time, the optimal hub number constraint is 7 with the mesh size of 10xI0. If the bidding
delay exceeds 5.5%, the optimal value will drop to 1. The second way is observed from
the jobpost coverage and ideal ratios. Figure 5.2d indicates that the relation between the
100% jobpost coverage and the 100% ideal ratio is practically linear. Therefore, it is
sufficient to achieve the optimal system performance by using a jobpost constraint that is
about 75% of the full jobpost coverage constraints. For instance, as shown in Figure 5.2d,
in a 10x10 mesh, a hub number constraint of 5 is sufficient to reach the optimal system
performance while the constraint for full jobpost coverage is 12.
5.2 Bidding Strategies
Following Chapter 3, participating nodes submit bids that represent status, intention,
or profit of nodes that can achieve from posting jobs. With additional considerations of
the underlying computing environment, bidding processes are to be designed with the
* Locally calculated bids: Participating nodes should calculate bids based on the
information provided in the jobposts and local knowledge of the system status. The
intention is to reduce the network traffic and to establish a loose, distributed and
bottom-up scheduling style.
* Simple job auction process: The function of a steward node is to distribute jobposts,
collect bids, and designate winning nodes in an Osculant system. Stewards should be
kept as simple as possible. As a virtue of bottom-up and distributed system design,
vital information must be kept locally. Therefore, the failure of a steward node, or
any part in the system, will not represent a severe threat. In addition, the designed
computing system can be easily reconfigured by choosing other nodes to serve as a
steward at any moment.
5.2.1 Performance-based Bidding Method
The performance-based bidding method is the most basic bidding method. The
objective is to balance loads of processing units and the network traffic connecting
participating nodes. A good load-balanced system will generally have better performance
because of reduced network congestion. With top-down scheduling schemes, load
balancing can be achieved rather easily in a homogeneous and centralized computing
system. Otherwise, load balancing is hard to achieve because of the difficulties in
collecting local node status, and being aware of non-scheduled local events. In this
method, bids are calculated locally by participating nodes to reflect the cost of completing
jobs. Three key components of the performance-based bidding method are as follows:
Resource collection time estimation
An estimate of time elapse to collect resources distributed within a system is needed
to define a rational bid. Prior to job execution, the required resources must be located,
verified and possibly transmitted over the network. The resource transmission time is
governed by the size of resources, cache storage status, local node I/O load, remote node
I/O load, and network traffic. Of the five factors, the first three are known locally and can
be correctly calculated. The other two factors must be estimated. Network bandwidth (or
throughput rate) can be estimated by two ways. The first method sends a packet to probe
the network status. It is obvious that this method will add to the network traffic. The other
method is to estimate the traffic condition using indirect information. For example, the
I/O load at remote nodes can be estimated by tracking previously received packets.
Network traffic can be estimated in many ways. For instance, we can estimate the
transmission time by counting the number of hubs between resource servers and bidders.
The relationship between the bandwidth and distance can be defined using linear
equations. This method will work as long as the network load is light. In this dissertation,
we will use the network traffic model studied in Chapter 4 to estimate network
bandwidth. Recalling that, in the UCL analytical model, network throughput rate can be
found by giving the network configuration and average job I/O load information. The
details of applying this model are presented in Appendix A.
Task execution time estimation
Local schedulers calculate the task execution time. For single-bidding strategies, a
first-in-first-out (FIFO) scheduling scheme is employed. It is assumed that tasks are
executed after all resources were received and previously assigned tasks were finished.
Task Execution Time = Max(I/O time, Completion Time of Last Task)
+ (Task CPU Time)
Bid suppression methods play an important role in the Osculant scheduler design.
Bid suppression is used to justify errors made in previously bidden jobs. Previous bidding
errors are used to adjust bids applied to current jobs. Bid suppression can also be used to
resist clustering of task assignments. Clustering occurs when tasks enter the system in
blocks over a short time period because:
(1) nodes do not update their states (CPU, I/O and network load) until they receive job
(2) there are time lapses among jobposting, bidding, and receiving task assignments.
Job assignments, therefore, for these tasks will be sent to the same node. The observed
load diagrams for nodes without bid suppression generally have a "saw-tooth" shape.
Short-term scheduling performance will be degraded. Thus, the design of bid suppressor
contains two factors, namely the previous bidding errors and the number of jobs that are
bided but not yet be assigned.
Suppressed Bid = Current Bid
(I + BiddingError)MaxLt'en'sh ofJob' QOuee Numr fJo,, Bd'rdB ot Assignd)
5.2.2 Energy-based Bidding Method
This method assumes that when information is transmitted between two nodes, two
parties will be required to be active for the same time period. The overall energy
consumed by the jobs is modeled to be:
Bid = (Task processing time) + (Estimated resource transmission time)*2
This is the simplest bidding algorithm in our studies. This method emphasizes the
importance of network transmission time. Therefore, jobs with lowest transmission time
will be favored by a bidding node. In other words, nodes that hold more resources will
have advantages over other nodes.
5.2.3 Dynamic Jobpost Model
Jobposts are continuously updated or modified during the jobpost/bidding phase.
Ideally, the information stored in the jobposts becomes more localized as they approach
the final working node. In this model, bids calculated by the participating nodes will
become more and more specific because the creditability of estimated system status is
increased as the jobposts move closer to the future working-node. This model also
provides resource forwarding capability. Resource forwarding scheme grants rights to
intermediate nodes that keep valid copies of resources to distribute the resources to
others. In this design, original resource servers will no longer supply all the file services
to other nodes, but perform more "selective" tasks, like providing new resources to the
system, managerial tasks, cache validations, and resource distribution management. The
job 1 job 2
0 1 2 3 4 5 6 7 8 9 10
Node 5 data 1 job 1 usage 7
Nod 2 data 1 job 2 usage 6
Node 9 data dat age 6
Total usage = 19
Node Usage Without Cache Forward
job 1 job 2
0 1 2 3 4 5 6 7 8 9 10
Overhead due to
Node 5 data 1 i job 1 3ob 2 usage 7.5
Node 2 data 1 job 2 Iusge= 6
Node 9 data 1 usage = 3.5
Total usage = 17
Node Usage With Cache Forward
Figure 5.3 Example of resource forwarding in the Osculant.
work load and network traffic at resource nodes will, therefore, be reduced. Figure 5.3
illustrates the resource forwarding in the Osculant scheme.
The resource caching mechanism forms the basis of this bidding model. There are
two bids that participating nodes need to concurrently calculate: the task-processing bid
and the resource-supply bid. There are also two winners at each round of job auction
process. The winner of the resource supply bidding updates the jobpost according to the
local cache status and sends it to the task processing bidding winner. The next round of
jobpost/bidding or job execution then can proceed. Resource verifications are performed
before modifying the jobposts in order to ensure the correctness of jobposts.
The job auction process is the same as single-bid models. The winner of resource
supply bidding and task processing bidding modifies the job profile. The winner of task
processing bidding then distributes the new job profile to the next jobpost/bidding level.
5.2.4 Resource Contractor Bidding Model
Evolved from the dynamic jobpost model, there are also two bidding stages in this
model. The winning node in the resource-supply bidding stage becomes the resource
contractor that takes responsibility of collecting resources for jobs. Upon collecting all the
required resources, the contractor node forwards them to the winner of task-processing
bidding stage. The motivations of this model are that: first, the winning resource-supply
nodes usually have, comparatively, more local resources and also have better channels
(i.e. higher bandwidth) to access remote resources (e.g. gateway or bridge nodes). Second,
we want to simplify the task of estimating network traffic among the bidders and resource
holders by reducing the number of resource suppliers in the task-processing bidding
stage. These arrangements will result in better bidding accuracy and local scheduling
capability. This model also better suits the conventionally configured nodes in a local area
network (LAN). In a LAN configuration, there are generally only a few nodes that can
serve as local resource servers with better network and file service performance. In the
resource contractor bidding model, these local servers autonomously become resource
contractors and satisfy most local needs. Additionally, failures in local resource servers
will only degrade system performance (i.e. resources will host remotely or migrate to
less-capable nodes in the LAN), but will not halt services as in conventional systems.
Furthermore, resource distribution management can be conveniently performed in the
resource-supply bidding phase.
The resource contractor strategy is implemented by the two-stage bidding model plus
a resource information exchange session. They are described as follows:
Resource-supply Bidding Stage
Participating nodes calculate bids that represent the cost of collecting all required
resources. The single-bidding job auction model is applied in this stage. Winners at this
stage become the stewards of second stage bidding. The final winners provide the
following information in the jobposts that will be used in the second stage bidding:
(1) the identification of the resource-supplier node,
(2) the expected completion time to collect all resources, and
(3) the estimated I/O load after the completion of collecting resources.
Task-processing Bidding Stage
Participating nodes calculate the task processing bids by using information provided
by the first stage winner. Because contractor nodes provide estimated resource waiting
time, all required resources will be transmitted from only one node. Because the
contractor node will, normally, locate locally, more accurate and aggressive local
scheduling schemes can be applied. In the current implementation, task-processing
bidding is held immediately after the assignment of resource contractor node. Bidders
calculate their bids according to the current cache status (i.e. speculative estimation; no
resource verification), network traffic between the contractor node and the bidder, and
local CPU schedule.
Cache Information Exchange
In order to reduce communication overhead between the resource contractor and
task-processing node, a resource information exchange session is performed right after
the contractor node retrieves all resources from various servers. A packet with job
resource list and current resource version number will be sent to the task process node in
order to determine the types of resources to be transmitted.
Out-of-order Local Scheduling (O3LS)
O3LS becomes a feasible solution with the realization of two-stage bidding scheme.
Contractor bidding model provides better-bounded information about the availability of
future expected resources. Therefore, local nodes can make better estimates and utilize
the system's resources more efficiently. 03LS is applied to local CPU time planning with
non-preemptive scheduling techniques
5.3 Comparisons Among The Bidding Strategies
Experiments were conducted using the Osculant Simulator (see Section 3.2) with the
job state diagram shown in Figure 3.3. The results shown in this session were obtained by
using the configuration described in Appendix A. It should be noted that, in order to study
the scheduling and resource management performance of the Osculant scheme, the spatial
locality of applications are intentionally disrupted in the following simulations. That is,
resources required by tasks were uniformed distributed in the simulated computing
environment. The studied scheme will need to manipulate its scheduling and resource
relocation policies to improve the system performance.
Perf(o) Random(x) Round(v) Dyna(') Contractor(s) Energy(d) 10/23
0 0,1 02 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Average Job I/O Load
Figure 5.4 System throughput rate results of various bidding strategies.
5.3.1 System Throughput Rate
System throughput rate (shown in Figure 5.4) illustrates the capability of the
scheduling scheme to utilize the system's resources. The results demonstrate different
characteristics of these bidding models with respect to changing system configurations
and job combinations. In general, multiple-bidding models outpace single-bidding
methods, especially in a CPU-intensive computing environment. The performance-based
method has the best throughput rate if average I/O load of tasks is below 10%. Resource
contractor model performs well over job 1/O ratios ranging from 15% to 50%. The
advantages of contractor bidding scheme gradually gives way to the dynamic jobpost
method, or even the performance-based bidding method, as more I/O intensive jobs enter
the system. Weak performance of contractor model in the low and high I/O load regions
Perf(o) Random(x) Round(v) Dyna(') Contractor(s) Energy(d) 10/23
0 0.1 0.2 0.3 0.4 0.5 0 6 0.7 0.8 0.9 1
Average Job I/O Load
Figure 5.5 Average task CPU time consumption of various bidding strategies.
comes from excessive bidding stages and resource transmission sessions. In an
environment where timing and task execution sequencing are enforced, the complexity of
implementing contractor model will be of concern. When the average I/O load exceeds
50%, energy-based bidding method behaves surprisingly well. Tasks are found to be
assigned mostly to nodes near resource servers. Depending on job composition, the
overall performance of these bidding methods varies. Assuming distributions of job
generation locations and compositions (in CPU time and I/O load) are uniform, system
throughput rate improved by 54%, 95%, 102% and 74% over the random method for
performance-based models, dynamic jobpost models, resource contractor models, and
energy-based bidding models, respectively.
Perf(o) Random(x) Round(v) Dyna(*) Contractor(s) Energy(d) 10/23
6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Average Job I/O Load
Figure 5.6 Average task resource transmission time of various bidding strategies.
5.3.2 Average CPU Time Consumption
The results in Figure 5.5 show the adaptability of bidding schemes. The sample jobs
have mean CPU processing time ranging from 100 (no I/O) to 5 (95% I/O time) time
units (shown in dotted line in Figure 5.5). The nodes, on the other hand, have a mean
CPU processing power of unity. The bidding schemes are efficient in scheduling jobs to
high power computing nodes. Furthermore, bidding schemes in Osculant are immune to
performance degradation caused by system configuration alterations and by load
fluctuations in top-down scheduling methods.
5.3.3 Average Job Resource Transmission Time
Task communication time is a major factor in determining the overall performance
of a distributed computing system. The performance of various bidding models is
strongly influenced by resource transmission time. As illustrated in Figure 5.6, energy-
based bidding scheme has the lowest 1/O transmission time (as well as the I/O time
overhead), while its job assignment distribution performance is poor in low I/O rate
region. The resource contractor model and dynamic bidding model have a significantly
lower transmission time than the performance-based bidding method. Interestingly, most
of the bidding schemes achieve a constant resource transmission time during a broad I/O
ratio range. This is a result of low job completion rate and saturated network traffic. The
lower graph in Figure 5.6 shows the transmission time overhead of completed jobs. This
graph can be used to identify when the network saturates. For example, in the resource
contractor model, the network saturation point is around 0.55. After this point, the
network will be fully loaded and, hence, the job completion time will be delayed. For jobs
that can be completed by the simulation time deadline, however, their overheads are
smaller then those of jobs with /O load nears the saturate point. The reason is that there
are fewer contentions in the network because most of them are queued at the resource
servers. This provides an interesting analogue to the analytical model and simulator of
multiple stage interconnection networks discussed in Chapter 4. In the middle I/O load
range, the system status is difficult to estimate (i.e. the bids are less accurate). On the
other hand, when I/O load is very low or very high, predictions of transmission time are
usually close to the actual time.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Average Job I/O Load
Figure 5.7 Average job energy consumption of various bidding strategies.
5.3.4 Average Job Energy Consumption
Energy consumption rate (shown in Figure 5.7) is defined as the time period that
nodes need to be active because of assigned jobs. Overlaps in the communication time of
newly assigned jobs and the processing time of previous jobs contribute to energy
savings. Compared to system throughput rate, these results are very different among
various bidding schemes. The contractor bidding method shows a relatively high average
energy consumption rate because of its two resource transmission sessions. Overlaps in
the resource contractor session are relatively small because the contractor rarely receives
task-processing assignments. In contrast, the energy-based bidding model demonstrates
the lowest energy consumption rate for almost all ranges. In summary, average job energy
Perf(o) Random(x) Round(v) Dyna(') Contractor(s) Energy(d) 10/23
S 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Average Job I/O Load
d' 01 0.2 0 3 04 0.5 0.6 0.7 08 0'9 '
Average Job I/O Load
Figure 5.8 Level of jobpost/bidding and jobpost coverage of various bidding strategies.
consumption reduction from the bidding methods are significant. With the same job
distribution in the throughput rate section, completed jobs utilize 42%, 29%, 73% and
19% of power consumed by jobs in the random method for the performance-based
scheme, dynamic jobpost scheme, resource contractor scheme, and energy-based bidding
5.3.5 Average Jobpost Coverage and Jobpost/bidding Delay
As discussed in Section 5.1, jobpost constraints and the number of jobpost/bidding
levels control the degree of optimizations that jobs can find in the system. This statement
is argumentatively correct for single bidding schemes. The experimental results show that
0.1 0.2 0.3 0,4 0,5 0,6 0,7 0.8 0.9
Average Job I/0 Load
Figure 5.9 Locality efficiencies of various bidding strategies.
the resource contractor model has the highest jobpost coverage. The same model also has
the greatest levels of jobpost-bidding because of the two-stage bidding. The dynamic
jobpost model, which has moderate number in jobpost/bidding levels but a low jobpost
coverage, suggests the effectiveness of dynamic jobpost modifications in finding more
detail local information than other bidding methods. Unfortunately, dynamic jobpost
modifications also increase the possibility of being trapped in a local optimal location.
This correlation may explain the unsatisfied performance of this model in the mid-high
job I/O load range. The energy-based bidding model has a surprisingly low number in
jobpost/bidding levels, which results in low bidding delay, and moderate jobpost
coverage. The results show that this model performs best in medium-high job I/O load
range. In the I/O intensive environment, jobs are better assigned to resource servers (or to
their neighboring nodes) to reduce the communication overhead. This is exactly the goal
of the energy-based bidding model. The simulation results are shown in Figure 5.8.
5.3.6 Locality Efficiency
Studies in Chapter 4 suggest that locality is the key to improve the performance of a
system with non-uniform communication latency network. Experimental results in
locality efficiencies (shown in Figure 5.9) suggest the advantages of contractor and
energy-based bidding models. In Osculant, the transient resource allocations are
autonomous and are driven by demands. Thus, they are flexible and adaptive. Resources
that are vital to system operations, or in high demand, can be hosted in more nodes.
Otherwise, a minimum number of duplications are maintained to reserve storage space.
Moreover, controlling the number of duplications improves the variety of replicated
resources, which contributes to the reliability of the system.
5.4 Resource Management Schemes
The current implementation of the Osculant scheduler includes an integrated
resource management scheme to improve the system performance. It is evident that
proper resource allocations are essential to improve system performance. Network
performance studies in Chapter 4 also conclude that the scheduling scheme should be
able to exploit locality properties embedded in the network infrastructure and the jobs to
further improve the system performance. Besides, while more copies of same resources
improve the performance of some tasks, it is also desired to have more types of resources
duplicated and distributed so that the designed system can be more robust and balanced.
Resource ,' Job
'--- _____ ---------.------ Exe
Figure 5.10 Resource validation process in the Osculant scheme.
In Osculant, the resource distribution scheme is implemented by various file caching and
forwarding techniques. The resource management scheme is driven in a bottom-up
manner and grants the responsibility of resource allocation, coherence maintenance, and
distribution pattern enforcement.
Figure 5.10 shows the resource validation process in the Osculant scheduling
scheme. Nodes will issue resource verification requests to resource servers upon
receiving jobposts. In order to reduce the bidding delays, bids are calculated before
knowing the verification results (i.e. speculative bidding). It is possible that a node
receives job assignments based on expired cached copies. This type of error is treated as
bidding errors and generally does not repeat itself in the next bidding because the
assigned job will miss the predicted completion time. Bidding errors in this case do not
affect the correctness of job executions because resource verification results are required
Average CPU Time -0.0882 -0.1886 -0.1863 -0.0234 -0.0144 -0.1705
Average I/O Time -0.2000 -0.4273 -0.3843 -0.4655 -0.3202 -0.6296
Energy Consumption -0.2033 -03971 -0.3933 -0.4681 -0.3496 -0.4309
Cache Hit Rate 2.8018 1.3488 1.5173 0.6177 0.8525 0 4181
Throughput Rate 0.3228 0,6848 0.6818 1.2414 0.8620 0 6377
I/O Ridding Error -0.0734 -0.1207 -0.1413 -0.0968 -0.0454 -0.1180
Processing Time Error -0.1502 0.8488 0.8320 -0.1863 -0.1200 -0.5517
Overall Utilization 0.0557 0.0365 0.0365 -0.2077 -0.1616 -0.3921
Server Utilization -0.0610 -0.1007 -0.0927 -0.1738 -0.1247 -0,2660
Table 5.1 Performance comparisons between the Plain Model and Request Frequency
prior to the job execution phase. Normally, resource validation process is completed
earlier than the combined time length of bidding and job auction processes. By integrating
the resource verification and bidding processes, there will be no extra overhead
introduced in this resource management process.
Resource distribution control can be accomplished by the above resource
management process, too. The number of duplicated resources can either be limited by
the storage capacity of remote nodes or actively controlled by the resource servers. In the
former case, the number and topology of duplications will be driven by the demands and
system configurations. This model will improve the system performance but will also
weaken the system reliability because some resources will not be replicated elsewhere. In
the second case, we introduce active control mechanisms on the resource validation
process so that more different types of resources can be replicated in the system. Two
resource distribution models implemented in the Osculant are as follows:
This model is based on first-in-first-out (FIFO) principle. When a new node holds a
copy of the resource, the oldest node in the list is flushed. From the experiments, it is
shown this method is simple and introduces little network traffic for resource validation
message exchanges. However, the trashing effect affects the overall performance because
useful resources might be replaced or flushed by any newly requested resources.
Request Frequency Model
In this model, updates of resource entries are based on the frequency of validation
requests. Resource servers maintain several counters that record the number of validation
request from other nodes. The counter also decreases periodically according to the local
clock. Simulation results indicate that this method is more efficient in reducing network
traffic among nodes. Experimental results are shown in Table 5.1. However, this model
cannot control the geology distribution of resources.
In the Osculant scheme, a node that updates its locally stored resources is required to
notify the original resource servers about the changes. Consistency of duplicated
resources will then be enforced since all nodes have to verify the correctness of their
cached resources prior to the task execution phase. In case that there are more than one
resource servers, the notified resource server has to propagate the updates to other
resource servers using various commit protocols [Chow 1996]. However, update
propagation in the Osculant should be kept in the level among the resources servers.
Although it is possible to update all the replicated resources in the system, the cost of
extra network traffic may be too high. Furthermore, most nodes in the Osculant are only
caching the resource to reduce the resource transmission cost instead of intending to
become replica servers. This means that the cached resources in nodes can be replaced
In this chapter, we explore the scheduling techniques based the findings from
previous chapters. Many interesting results, as well as many potential problems, were
uncovered from these studies. For the jobpost distribution protocols, the multi-layer
jobpost protocol (MJP) was established and studied to distribute jobpost to the system in
a best-effort approach. Though it is clear that jobpost coverage rate is proportional to the
optimal scheduling performance in a distributed system, the jobpost and bidding delays
will nevertheless grow to an unbearable level when the system is large. Moreover,
announcing jobs in a flat structure (for example, a system with only one steward) will
make the system more vulnerable to system faults. Experiments show that MJP is capable
of controlling the depth and width of each jobpost/bidding process. The results also show
that small jobpost constraints are sufficient to distribute jobposts to the system with
Since bidding strategies are at the heart of the Osculant scheduling scheme, we
developed and tested several different approaches in this chapter. Listing in the order of
increasing complexity and aggressiveness: they were, first, the performance-based
method. Here, bidding nodes calculate their bids based on the job completion time. It is
found that this method performs well in a lightly loaded system but performs badly in a
heavy load system. It is clear that, without knowing the global system status, this method
is not capable of delivering good scheduling performance in certain conditions. Next, in
the energy-based method, we change the bidding components to the active time of nodes.
Compared to the performance-based method, this method performed very well in heavy
load systems because nodes want to conserve their active time and emphasize their
network traffic load in their bid.
By realizing that single-bid methods did not perform well, we consider exploiting
the locality principle discussed in Chapter 4. The first step is to build a mechanism so that
local load information can be learned without broadcasting or polling processes. The
Dynamic Jobpost Model applied this principle by granting the intermediate-level
stewards the right to modify jobposts according to local information. With this
arrangement, the accuracy of jobposts and bids can be improved. Experimental results
show that the system throughput was increased by 27% over the performance-based
In the Resource Contractor Bidding Model, we take further steps in exploiting the
locality principle: First, we used the dynamic jobpost model to collect local information in
the system. Then, by observing that resources are generally located in some "super" nodes
in a distributed system, we split the bidding process into two phases. In the Resource-
supple bidding phase, the scheduling scheme will designate a node to act as the resource
contractor that is responsible for collecting and supplying all required resources. Next,
the resource contractor initiates the task-processing bidding phase that finds the real
working nodes. With this model, we can further improve the system throughput
performance by 32% over the performance-based method.
Following the discussions in Chapter 2, another technique to improve the network
performance is to move servers and clients so that they can close to each other. Task
scheduling is the typical way of relocating clients. In this chapter, we also exploit the
technique of relocating resources (and even servers) in the system to reduce network
traffic. Resources in Osculant scheduling scheme can be relocated to other nodes based
on demands. This means resources can be duplicated and reused in nearby nodes.
However, with the resource relocating, the coherence of resources must be maintained to
guarantee the correctness of job processing. Hence, more network traffic will be
introduced. Furthermore, it will be very difficult to define an analytical network traffic
model with the resource relocating technique. In terms, it means that network
communication delay will be more difficult to predict.
DEVELOPMENT OF OSCULANT SHELL
Osculant scheduling studies were conducted by simulations, and implemented using
a custom UNIX Osculant Shell. The Shell connects, monitors, bids, and distributes
MATLAB jobs and executable objects among a collection of Hewlett Packard and SUN
workstations. The studies also utilized a custom Osculant Job Profile Generator
implemented in MATLAB script and C language. These two software systems define the
Osculant experimental environment.
6.1 Structure and Implementation of Osculant Shell
Figure 6.1 shows the structure of Osculant Shell. The Osculant Shell consists of a
collection of modules interconnected by various communication channels. Modules
independently process information that runs simultaneously in the background. The shell
is implemented in C language and Berkeley Sockets. The design of the Osculant
environment is based on the consideration of portability and compatibility among other
UNIX systems. The functions of modules are described in the following sections.
6.2 Osculant Job Profile Generator
The basic market-driven philosophy underlying Osculant is one of competitive
bidding. The key to successful bidding is to estimate accurately and efficiently the cost-
complexity of a posted job. This is the role of the Job Profile Generator (JPG). Job
profiles must be generated first from the information provided by the tasks that are
distributed among participating nodes. Upon receiving the job profiles, distributed
heterogeneous nodes calculate the completion cost of a task based on their interests,
ethos, biases, and capabilities. The job will be assigned to the node by the steward with
the best bid. The quality of JPG directly affects the bid accuracy and scheduling
6.2.1 Design Principle and Algorithms
The JPG is responsible for creating job profiles that act as a unidirectional bridge
between tasks and announcing their presence to the distributed heterogeneous computing
system. Job profiles of announced tasks will be made available to the system prior to the
job execution phase. In the case of Osculant, profiles are created prior to the bidding
phase. The JPG uses source code or batch files of jobs to extract job profiles that contains
(1) information that can be used to estimate work loads of jobs,
(2) architecture specifications that are required to complete the jobs, and
(3) constraints needed to complete the assigned jobs.
Though profiling techniques, which are used to extract timing characteristics of computer
programs, have been extensively studied and developed, relatively few studies have been
done with the goal and approach described above. Park [Park 1989, 1993] and Shaw
[Shaw 1989] developed a technique to predict deterministic timing behavior of
programming in high-level languages with analytical models in evaluating the lower and
upper bounds in program execution time. In the JPG, we concentrate our effort in
extracting function statistics, which includes the frequency and granularity of function
calls in a job, and resource requirements (instead of timing characteristics in previous
studies) from the jobs because they can be better utilized in the target computing
environment. That is, since the configuration and status of nodes can vary dramatically in
a heterogeneous computing system, the information stored in job profiles should allow
nodes to estimate the job completion cost according to their setup and status.
Figure 6.2 Structure of the Job Profile Generator. The filter process separates operands
and operators. The SAM Generator process performs branch coding, variable
extraction, and function name extraction. The Variable Information Table (VIT),
which is a cross-reference table between the variables and functions, is generated by
the VIT Generator. In the Job Profile Generator, the Variable Back Tracing engine
estimates the value of and the size of variables.
Several support mechanisms were studied and built in the Osculant scheduler so that
participating nodes can utilize them to construct job profiles and calculate the cost of job
completion during bidding. The JPG support mechanisms are described as follows:
Function Result Type Tables (FRTT)
FRTT stores information regarding system functions. In some cases, FRTT entries
are functions that have been previously profiled. Functions listed in the FRTT will be
1. Filter: Remove comments, blank spaces and
2. SAM Generator: Branch coding, variable/
function name extraction and causlity tracing.
x: pl.sam) VIT Generator
1. Variable Information Table Generator
2. Unknown Variable Finder.
3. Variable Back Trace Deadlock Eliminator.
P rofi e G-e o----r j
j Profile OGenerator
processed faster than from their source code once they are properly characterized. The
FRTT can be modified during run-time to improve the estimation accuracy and execution
efficiency. FRTT contains two libraries: (1) the Function Result Size Table (FRST),
which is used to estimate variable sizes (for matrix) and values (for scalar variables), and
(2) Osculant Function Profile (OFP), which is used to estimate function execution time.
An example is provided in Appendix B.
It is found that the computation loads for many classes of functions are not linear
with respect to the size of input variables. The estimation of the computation load at the
bidding nodes will, in these cases, need to be synthesized with some care. A possible
solution is to implement a multi-mode, or high resolution Osculant Function Profiler
(OFP) based on statistics of input variables for all function calls. It is impractical to send
detailed variable size information in the job profiles since this would consume too much
network bandwidth. For the designed JPG, the simplest measure of granularity was used
where the granularity is represented by a number (i.e. granularity index) of an individual
call. The results obtained to date are promising with small approximation errors.
Other Supporting Mechanisms
Several supporting mechanisms have been implemented to improve estimation
accuracy and execution efficiency. For example: the Location Information Adjustment
(LIA) procedure uses normalized weighted sum according to the proximity to the target
variable to determine the results of variables. The Variable Dimension Adjustment
procedure is used to utilize the information extracted by the pre-filtering process and
modifies (reduces, in most cases) regarding variable size. The Inline Scalar Evaluation
procedure evaluates values of scalar variables. And a JPG caching technique is developed
to accelerate the job profile generation process. Detailed information of these procedures
can be found in the study by Wu and Taylor [Wu 1997].
The heart of the profile generator is the Variable Back Tracing (VBT) engine. The
VBT recursively traces the variables referenced by the target variable. The stop
conditions of recursions are input parameters and explicitly defined variables in the
program. The Function Result Type Tables, Location Information Adjustment, and Inline
Scalar Evaluation procedures work cooperatively to find the final estimate of a target
variable. The next step is to categorize the functions such that they can be traced properly.
The first stage of classification determines the types of functions based the information
stored in FRTT. Functions that are not in the FRTT will be considered as customer
functions. The second stage of classification names types of operators and traces input
parameters. Memory occupancies of tasks are determined at this stage. The structure of
JPG is shown in Figure 6.2.
Figure 6.3 shows an example of the job profile for a test sample. Most of the job
profiles are sparse and in integer format. Therefore, they can be compressed and
efficiently transmitted. The resulting job profile not only provides essential information
of tasks but also is easily computable. The current version of the Osculant Profile
Generator is implemented in C language and MATLAB 4.0.
Experiments show that the error rate varies with respect to the actual computation
load. The normalized error rate, which is found by normalizing errors with respect to real
computation load of jobs, gives a better indication for the overall quality of the Osculant
JPG. The normalized error rate for estimated computation load, which is the computation
15 5 0 1 2 2
/ Function StatisticsImage
6 2 ... .
0 5 10 15 20 25 30
'Ii' v \ \ ii v v I
0 20 40 60 80 100 120 140 160 180 200
Job Profile for detmax.m
Figure 6.3 Example of job profile from the Osculant Job Profile Generator.
load estimated by the Osculant JPG, is 10.78%, and bidding estimates, which is the
computation load estimated by bidding nodes, is 45.46%. When it was recorded, these
estimations are made without actually executing the programs. The error rates are
considered acceptable for the purpose of preliminary or first-time bidding.
6.2.2 Osculant Job Profile Retrospective
In an Osculant system, generating job profiles is the first step in the execution of a
job. Typically, inaccurate job profiles normally will neither seriously degrade the overall
system performance, nor produce incorrect results. Inaccuracy can be corrected or
compensated for possibly in latter stages of job processing. For example, the bidding
(Node domain error) Path ofjob profiles
(Function category error) -------- Path of error feedback
Fail to generate?
(Source error) r Work Node #1
.', ------- Work Node #2
Steward jobpost Wko
j, Pi .....
Run out of resource? Work Node #n
(FRST misaligned) --- -' Job profile from
Resource not available? Bidding 'error-prone node?
(source error) (Bid adjustment)
Miss deadline? -------------------------- Execution
(OFP misaligned) Execution error?
Figure 6.4 This figure shows the Osculant Job Profile Generator in the Osculant
scheme. The accuracy and quality of job profiles can be fed back to the origin of the
job profile from different stages of job executions.
process itself can correct the errors in job profiles. Furthermore, Osculant scheme
contains close-loop feedback that can correct errors as well. Figure 6.4 shows the life
cycle of jobs in the Osculant scheme based on the viewpoint of job profiles. Function
category faults in generating profiles, for example, results in job rejections before job
executions. Function profile inaccuracy results in degraded performance. These concerns
can be easily corrected by adjusting the FRTT. A very sophisticated profile generator
design, however, may not be worth the tradeoff for a decrease in generality and
6.3 File Transfer Unit
The File Transfer Unit (FTU) transfers data among nodes. In Osculant, data and job
files can be distributed within several nodes and transmissions can occur at any time.
Performing multiple file transmission sessions will improve the performance of and the
efficiency of communications. The FTU generates an individual process for each file
File transmissions can be either active or passive, depending on what initiates the
file transmission. In the active file transmission mode, the assigned node sets up the
communication channels to the resource (e.g. data and job files) holders and "retrieves"
the required data. This model is simple and fast, but is considered impractical and
insecure. On the other hand, in the passive mode, a node receives a job assignment, and
sends the request to resource holders, and then the resource holder initiates the
transmission. Data is then "sent" to the assigned node. The passive model requires one
extra message passing stage but is more secure and practical than the first method
because the resource holders are mostly the file servers of a system.
To operate the Osculant Scheduler correctly and effectively, messages and data must
be handled in a proper order. An FTU module has two message receiving and two
message transmission queues. One pair of receiving/transmission queues, which use first-
in-first-out policy, is dedicated to the transmission of data and job files. The other
message queues are priority queues. The priorities of messages are:
(1) Status Request (highest priority)
(2) Status Reply
(3) Job Completion
(4) Job Deletion
(5) Job Assignment Acknowledgement
(6) Job Assignment
(9) User Input (lowest priority)
6.4 Steward Process
In Osculant, the steward performs the following functions: job auction, status
checking, fault handling and node training. Because processors perform their bidding
autonomously, and the steward has the full authority to award job assignments to any
working node based on bids, some top-down capabilities can be implemented in the
Osculant scheduler to achieve certain goals like fast job assignments or improving the
overall scheduling performance.
Job auction processes are discussed in Chapter 5. The steward also performs status
checking for the system. Normally, the steward has responsibility to check the status of
its child nodes. Faults are handled by redoing jobpost/bidding process among active
neighboring processors. The Osculant scheduler also provides top-down control over the
system in training and monitoring of working nodes. The steward can log and evaluate
the bidding performance of its child nodes. Top-down training can be accomplished by
duplicating job assignments and sending them to both the node with best bid and the
nodes that need further training. Therefore, training nodes can have more chances to
calibrate their bidding components.
An additional feature of steward is its ability to perform reliable job processing. To
perform reliable computing for an unreliable distributed system, a steward node can
duplicate job assignments and send them to working nodes without sharing resources.
Failures caused by crackdowns in any single resources, therefore, can be greatly reduced.
Compared to other fault tolerance schemes, this method is the simplest to implement.
This method also provides higher guarantee level in processing critical tasks in a real-
6.5 Other Modules
The following modules are planned or partial implemented in the current version of
Configuration Unit (CU) dynamically changes the system configuration based on the
system status or user demands. It also provides a transparent view of the system to the
user. Main functions of CU contain:
* Connect/disconnect nodes: Disconnect failed nodes in order to reduce bidding delays
and to re-connect recovered nodes to the system.
* Add/delete nodes: Accept new nodes to the system or delete nodes from participating
* Calibrate system parameters.
Load Monitoring Unit
The Load Monitoring Unit (LMU) provides local node information for the purpose
of bid generation. LMU adjusts the local node performance parameters to adapt to the
local load requirements set by the local owners as if these nodes are privately owned. The
main functions of this LMU are:
* Modified "Ping" function, which probes the network status.
* Idle node hunter that estimates the computation resource of a subsystem.
* Algorithmtic estimates of communication channel bandwidth and throughput
Osculant Function Profile Modification Unit
As described in Section 6.2, the Osculant Function Profile (OFP) stores information
about system functions and provides the basis for calculating bids. The module modifies
the contents of OFP in order to improve the bidding accuracy after completing all
The computation engine of current version of Osculant Shell is MATLAB. The
cooperation of Osculant Shell and MATLAB is established by using MATLAB External
Interfaces [MATLAB 1993]. The External Interface Library is MATLAB's application
programming interface (API) and is called from the C language within the Osculant Shell.
The interface routines will create two pipes so that data and commands are transmitted
xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID EQFNDMPW4_5T5T15 INGEST_TIME 2013-01-23T13:50:36Z PACKAGE AA00012986_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC