Designing a Distributed Queue
Theodore Johnson
Dept. of Computer and Information Science
University of Florida
Abstract
A common paradigm for distributed computing is the producerconsumer model. One set of processes
produce objects that are consumed by another set of processes. These objects might be data, resources,
or tasks. We present a simple algorithm for implementing a distributed queue. This algorithm has
several parameters that need to be tuned, such as the number of probes to find an object, the amount
of buffering, and the connectivity between the producers and the consumers. We provide an analytical
model that predicts performance, and based on the analytical model we provide recommendations for
setting the parameters. Our analytical model is validated by a comparison to simulation results.
Keywords: Distributed Queue, Distributed Data Structure, Scheduling, Performance Analysis, Scientific
Computing.
1 Introduction
A common paradigm for distributed computing is the producerconsumer model [2, 1, 5]. The Linda parallel
programming language [4] has constructions to allow the easy implementation of shared queues. Other
examples include Marionette [13] and Workcrews [16]. In addition, producerconsumer relations are often
used in parallel scheduling algorithms [6, 7].
The primary motivation for this work is the UFMulti project [3, 14], a distributed processing system
for High Energy Physics. HEP research requires the processing of billions of experimental observations (or
events) to find the events that contain interesting information [12]. For example, the recent discovery of the
sixth quark at Fermilab required the processing of several billion events to find the twelve instances when
the sixth quark was definitely observed.
HEP processing is easily parallelizable because each event can be processed independently. The processing
typically consists of several stages, with each stage written by a different specialist. For example, PASS2
processing consists of an event reconstruction stage, in which particle tracks are reconstructed from sensor
information, followed by an test to determine if the event is of interest to later processing. So, a typical
HEP job consists of a set of processes that read raw events from tape, a set of processes that perform event
reconstruction, a set of processes that test the reconstructed events, and a set of processes that write the
reconstructed events that pass the test to another tape. Such a scenario is shown in Figure 1. Each event
Stage 1 btage 2 stage 3
p> NetQueue 1 @ NetQueue 2 ''"
Input ,Out
Events  Output
Figure 1: An UFMulti application divided into stages glued together by NetQueues
group is connected by a queue abstraction, which we call Netqueues in the UFMulti system. We are using
the research reported in this paper to implement a fully distributed Netqueue.
Some work has been done to implement distributed queues [4, 13, 16, 6, 7, 10]. Manber [11] proposed
concurrent pools, a shared memory equivalent of a distributed queue. Kotz and Ellis [9] made a performance
evaluation of the performance of concurrent pools. The idea of a concurrent pool or a distributed queue is
related to techniques for emulating shared memory with a message passing system [15, 8].
In this paper, we present a simple stochastic distributed queue algorithm. This algorithm has features
in common with previously proposed algorithms. The contribution of this work is to develop a validated
performance model of the stochastic distributed queue algorithm, and to investigate the best parameter
settings.
2 The Distributed Queue Algorithm
There are two sets of processes, the producers who generate objects, and the consumers who use and destroy
the objects. A producer repeatedly executes a program that creates a new object, then tries to insert the
object into a shared buffer. If there is no room in the shared buffer, the producer blocks until room becomes
available. A consumer repeatedly requests an object from the shared buffer, and blocks until an object is
delivered. The consumer then processes the object, and repeats its request after finishing the processing.
We assume in this paper that the producers and consumers are disjoint, and execute on disjoint processors.
This assumption allows us to neglect consideration of the obvious localization optimization (i.e., a consumer
always first checks a local producer) when performing the analysis, permitting a cleaner analysis of the
algorithm for accessing nonlocal data. We note that the assumption is likely to be valid in many scenarios
(including UFMulti).
An obvious algorithm for implementing the queue is to have a single process store all produced but
unconsumed items (i.e., the centralized buffer algorithm). This approach has the drawback of requiring a
single process to perform a great deal of work (in receiving and transferring all objects), and to maintain a
great deal of storage. So, the centralized buffer solution is not scalable. A more subtle problem is that object
descriptions can be quite large, up to several megabytes in UFMulti applications. The centralized buffer
solution requires that each object be transferred twice, wasting network bandwidth. A better approach
is to distribute the queue, and have each producer store the objects that it produced but which are still
unconsumed. With this approach the buffer storage and management is spread among all producers, and
objects are transferred only once.
The main complication in developing a distributed queue algorithm is connecting a consumer who requests
a new object with a producer who has an object to give. If a consumer has an accurate count of the number
of data items available at each producer, the consumer can obtain the object from the producer with an
exchange of only two messages (the request and the reply). However, distributing this information requires
the exchange of many messages.
One option for distributing the lengths of producer queues is to have each producer multicast to all
consumers a description of every enqueue and dequeue event. Such an approach has the advantage of being
fully distributed, but requires the exchange of many messages per object. Another approach is to have a
single process that acts as a queue manager (i.e., the queue manager algorithm). A producer informs the
queue manager of every item produced, and a consumer queries the queue manager for a producer with an
unallocated object. The queue manager algorithm is better than the centralized buffer algorithm because
most of the work in maintaining the distributed queue is distributed among the producers, and objects are
transferred only once. However, there is still a large burden placed on a single process. In addition, the
algorithm imposes overhead of three control messages (the message from the producer to the queue manager,
the request from and reply to the consumer) in addition to the two messages required to perform the transfer
(the request to the producer and the reply).
A different approach is to use a fully distributed and stochastically balanced algorithm. Producers
maintain their own queues, and have no direct communication with each other. A consumer finds an object
by repeatedly probing producers until it finds a producer with an unallocated object, or decides to block
at the producer until an object becomes available. We call this algorithm the stochastic distributed queue
algorithm.
The stochastic distributed queue algorithm has a simple description, but many parameters which can
affect performance. The main goal of this paper is to examine the effect of the parameters on the performance
of the algorithm. To clarify the discussion, we present the stochastic distributed queue algorithm in pseudo
code.
We start with a description of the parameters:
N Number of producers.
M Number of consumers.
bufFersj The number of buffers at producer j.
max_hops The number of probes a consumer makes before blocking.
p_access[1 .. N]i p_access[ij] is the probability that consumer i will choose
producer j on a probe.
We use the following conventions to specify the message passing synchronization. When a process sends a
message, it executes the following line of code to send a message of type action to destination with parameters
parameters:
send(destination,action; parameters)
When a process waits for an event, it can wait for one of a number of event types to occur. These events
can be the reception of a message of a particular type (specified by the action), or an internal event. If more
than one event is possible, the code to handle each event is specified along with the parameters passed for
the event:
wait for event Al, A2, An
Al (parameters) :
code to handle Al
An (parameters)
code to handle An
The protocol at the consumer is to pick a random producer, send a probe to that producer, then wait for
the object to be returned (possible from a different producer). The function random uses the distribution
specified by its parameter.
consumer(self)
while True
probe = random(p_accesself )
hops=1
send(probe,REQUEST; self,hops)
wait for a REPLY message from a producer
REPLY (object)
consume(object)
We specify the queue management portion of the producer, and we assume that the code that actually
produces objects executes in a separate thread. When the production thread creates a new object, it notifies
the producer thread and blocks until specifically unblocked by the producer thread. The protocol maintains
two data structures: buffer to store produced but unconsumed objects, and blockedproducer to store the
identities of blocked consumers. The protocol at a producer is:
producer(self)
initialize the buffer and the blockedprocess queue.
unblock the production thread.
while TRUE
wait for a REQUEST message, or for an OBJECT to be produced
REQUEST (consumer; hops):
if the buffer is not empty,
obtain an object from the buffer
send(consumer,REPLY; object)
if the buffer was full,
unblock the production thread
else
if hops < max_hops
probe = random(p_accessconsumer)
send(probe,REQUEST; consumer,hops+1)
else
put consumer in the blockedprocess queue
OBJECT (object) :
block the production thread
if the blockedprocess queue is not empty,
get consumer from blockedprocess queue
send(consumer,REPLY; object)
unblock the production thread
else
put object in buffer
if the buffer is not full
unblock the production thread
The parameters of the stochastic distributed queue represent a wide variety of algorithms, with different
resource demands and performance characteristics. The number of producers and consumers, N and M, is
defined by the computation to be performed, and is in general not tunable (i.e, is not a parameter available
to the algorithm tuner). The most important parameter is max_hops, the maximum number of probes of
producers that a consumer will make before blocking. Raising max_hops improves throughput at the cost of
additional message passing overhead. More subtle is the effect of bufFers Increasing the number of buffers
at a producer improves efficiency, but increases the memory overhead of executing the protocol. Finally,
p_accessj defines the producers that consumer j will probe, and at what rate. The probability of probing a
producer should be proportional to its production rate. In addition, setting p_accessj[i] to zero means that
consumer j never probes producer i. Therefore, i and j do not need to maintain the overhead to support
potential communication. Reliable communication channels (such as BSD sockets) are often scarce resources,
and the ability to limit the communications patterns is essential for a scalable implementation. Also, limiting
the number of consumers that can probe a producer limits the size of the blockedprocess queue.
It is not immediately obvious how to set the parameters of the stochastic distributed queue algorithm.
To permit a logical design, we provide a simple analytical performance model, and execute a performance
study.
3 The Simulator
We wrote a simulation to validate the analytical models that we develop. Since the simulator is uniform
throughout the study, but is not the focus of the study, we discuss the simulator here.
The simulator accepts as parameters the number of producers, N, the number of consumers M, the sizes
of the producer queues buffers, and max_hops. In addition, we specify the expected time to send a message,
r, the expected time to process a message d, the expected time to produce an object at producer i, 1/Ai,
and the expected time to consume an object at consumer j, 1/1j. Each of these variables are sampled using
an exponential distribution.
For each experiment, we execute the simulation until 1,000,000 objects are consumed. The 95% confidence
intervals are within 2%.
4 The Analytical Model
We use the following parameters in the analysis:
N : Number of producers.
M : Number of consumers.
Ai : The production rate at producer i. More precisely, the time to produce an item has mean 1/Ai.
1j : The consumption rate at consumer j. More precisely, the time to consume an item has mean
hmax : the maximum number of hops a request makes before blocking.
fi The maximum capacity of the object queue at producer i (fi = buffersi.
r : The average message transit time.
p_accessj [1 ... N] : the probability that a request from consumer j is sent to a given producer.
In addition, we will use the following variables:
havg : Average number of hops that a request makes.
B : Average amount of time that a consumer's request is blocked.
6
* pc : Actual arrival rate of requests to a producer.
Pnt : Probability that the producer has no unallocated objects.
Pb : Probability that a consumer's request will block if it finds the producer to be empty.
p(s) : Probability that a producer is in state s.
Be : Expected time a request is blocked, given that it blocks.
We will develop a series of models, of increasing complexity. The first two models will assume that
p_accessi = paccessj for every pair of consumers i and j. The third model will relax this assumption. The
first model will assume that every producer produces at the same rate. The subsequent two models will
relax this assumption.
4.1 Homogenous Producers
To build the first analytical model, we will assume that all producers have the same production rates. That
is, Ai = A for every producer i = 1 ... N. In addition, we assume that that p_accessi = p_accessj. Because
the consumers make requests in the same proportion to all of the producers, there is no need to distinguish
between consumers. Instead, we assume that all consumers have the same consumption rate p.
The actual rate at which a consumer issues requests is somewhat less than p, because of the overhead
of obtaining objects. In particular, a request must make hang hops, each of which requires r seconds to be
processed. In addition, returning the object requires another hop, costing r seconds. If a request makes
hmax hops, it will block at the producer for an average of B seconds. The total rate at which consumers
issue requests is multiplied by the average number of probes, havg. If p, is the actual rate at which requests
arrive at a producer, then we can calculate:
M hag
Pc = (1)
N 1/p+hav*r+B
To make the analysis feasible, we assume that the time to produce an item is exponentially distributed,
and that consumer requests arrive in a Poisson process. Since consumer requests are randomly generated
from a large population, assuming a Poisson process is not serious. The actual distribution of the production
rate might have an impact on the actual performance. We investigate the sensitivity of our model to the
interproduction time distribution in the next section, and find that the model remains accurate.
The state of a producer can be modeled as a Markov chain. Since all producers have the same production
rate and receive the same rate of requests, we need only model one of the producers, and it will represent
all producers. The state of a producer is represented by an integer s, where M < s < f. We define the
p(s) to be the probability that the producer is in state s. If s < 0, then there are s requests blocked at the
producer. If s > 0, then the producer has s produced but as yet unallocated objects in its queue. If s = 0,
there are neither blocked requests or unconsumed items at the producer.
In every state s < f the arrival rate of newly produced items is A. If s > 0, then every arriving request
decrements the number of unconsumed objects. Therefore the balance equations for states 0 through f are:
Ap(s) = Jp(s + 1) 0 < s < f (2)
If s < 0, then a request will block only if it has made its maximum number of hops. In addition the fact
that s < 0 tells us that there are s blocked consumers, so the arrival rate of requests is (M s)pc/M. We
define Pb to be the probability that a request blocks if it finds the producer queue empty. Then, the balance
equations for states M through 0 are:
p(s) = M PbcP(+ 1) M < < 0 (3)
The state dependent request arrival rates makes the model computationally expensive to solve. We can
observe in the usual case, there are only a few blocked consumers at any producer. If M is large, then
the state dependent arrival rate buys little in terms of accuracy. We will therefore make the following
approximation:
p(s) = PbcP(S + 1) M < s < 0 (4)
This approximation makes the model solution very fast. However, the approximation is not stable if
there are many more consumers than producers and Pb is large. We will examine performance in these cases
by using simulation. By combining the system of equations 2 and 4 together with the requirement that the
state occupancy probabilities sum to 1, we find the solution:
p(s)= 8p(O) s > 0
" p() s< 0
(Pbrc)M+1 A PbPc Pc(A/Pc)f+1 + c O 5)
X PbcA PbAcA PcA /cA
Note that if A = /p, or A = Pbtc, then we need to make an exception to handle the degenerate cases.
Having found the equations of state in terms of pb, we must next solve for pb. A request will block if, after
probing hmax 1 producers which it found to be empty, it arrives at a producer which is empty. Let Pint be
the probability that a request finds a producer empty. Then:
Pmt = E M P(s)
= p(0) ( + (P )M+ A P ) (6)
Each consumer request generates havg requests at the producers. Every time a request arrives at an
empty producer, another request is generated, up to hax requests. Therefore,
havg i=1 Pmt
= (lp_)/(l _P) (7)
Of the request stream, only those messages that are on probe hma will block. Therefore,
h a 1/h (8)
pb = Pnt1 avg
We next calculate the average time that a request spends blocked. Suppose that a request arrives and is
blocked. Then, if the producer is in state s when the request arrives, the request will need to wait for s + 1
items to be produced before it can be unblocked (because there are s blocked consumers in line ahead of
it). Let Be be the time that a request spends blocked, given that it blocks. Then:
Be = o C M (s + l)P(s)/pnt
p(O) ((/Pb) M M (pp,A) A)2
Pmt A (PbPcA)2 (pbc)2
A request blocks only if all hmax producers that it probes are empty, which occurs with probability p'."
Therefore:
B = p>" B, (9)
We have now defined enough equations to solve the system. An explicit solution is infeasible, but iteration
on pc works well. After the system of equations is solved, we can calculate the following performance
measures:
The average number of probes per request havg
The average time a consumer spends blocked W = (havg + 1)r + p>t "B.
The utilization of the producers Up = 1 p(f) = 1 (A/pc,)p(O).
The utilization of the consumers U, = (1/p)/(1/p + W).
The throughout T = MUc/ = NUp/A.
4.1.1 Performance Comparison
We ran a set of experiments to validate the accuracy of the analytical model by comparing its results to the
simulation results. In addition, we investigate several aspects of the performance of the stochastic distributed
queue algorithm.
The performance of the stochastic distributed queue algorithm depends on the match between the total
production rate and the total consumption demand. We define the load on the producers to be 1 = Mfp/(NA).
A load of 1 = 50% means that the producers should be idle half of the time, and a load of 1 = 200% means
that the consumers should be idle half of the time.
In the experiments, we set p = A = .01. Sending a message requires 1 tick. There are 100 producers,
each of which have five local buffers (i.e., fi = 5). We varied the number of consumers to vary the load.
In Figure 2, we plot the average number of probes that the algorithm makes as we increase the maximum
number of probes before blocking, for loads of 50%, 100%, 150%, and 200%. The points in the chart
are simulation points, while the lines are computed from the analytical model. The chart shows that the
analytical and simulation models are in close agreement. If the load is 100% or less, then only a few probes
are used on average (less than 2). If the load is greater than 100%, then the hag grows in proportion to
hmax
Average Probes vs. Max Probes
average number of probes
8
50% load
6 .......................................... .. ........
+sim
100% load
Ssim
.X 150% load
sim
2 200% load
x sinm
1 2 3 4 5 6 7 8 9 10
maximum number of probes
Figure 2: Average number of probes vs. maximum number of probes.
In Figure 3, we plot the average time that a consumer waits between issuing its request and receiving an
object to consume (i.e., W). The points on the chart are simulation points, while the lines are drawn from
the analytical model. Again, both models have good agreement. The line for the 200% load does not extend
to hma = 1 because the model would not converge for this point. If the load is 100% or less, then the waiting
time quickly approaches 2 ticks. If the load is greater than 100%, the waiting time quickly approaches a
limiting value, of approximately M/(NA) 1/p. In all cases, the waiting time is close to its asymptotic
value when hma is 3 or greater. If the load is close to 100%, an additional performance improvement (of
approximately 2.8%) can be gained by setting hma = 5. Doing so can result in a significant increase in
message passing overhead if the load is high. However, in the centralized queue manager algorithm, four
control messages are sent for every object consumed. Figure 2 shows that even in a high load case, setting
hmax to 5 means that hang < 4.
Waiting time vs Max Probes
consumer waiting time
160
1 4 0 . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .
50% load
1 2 0 . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .
100 ..... .. 100% load
a sim
80 ...................................
150% load
6 0 ....................................................* s im
S *   
x sim
20 ...' :.... ................................ .......
0.
1 2 3 4 5 6 7 8 9 10
maximum number of probes
Figure 3: Waiting time vs. maximum number of probes.
The efficiency of the stochastic distributed queue algorithm can be seen by examining pt. We plot pt
against the maximum number of probes in Figure 4. This chart contains data from the analytical model
only, the points in the graph serve only to help identify the curves. As was suggested by the plot of W
against hmax, most of the benefit of increasing h,,m is achieved when hma, = 3.
In the previous three charts, we used 5 buffers at each producer. The number of buffers at each producer
has a significant effect on performance, because increasing the number of buffers reduces the chance that
a producer will block. In Figure 5, we plot the producer utilization against the number of buffers at each
producer. For these experiments, hma = 5. Most of the performance benefit of increasing the number of
buffers is gained with using 5 buffers at each producer.
Producer blocking vs Max Probes
fraction of time blocked
0.6
0. . .. ..... ....... .... ........ ....... ........... .. 50% load
.100% load
0.3 .
+150% load
0.2  200% load
0.1
1 2 3 4 5 6 7 8 9 10
maximum number of hops
Figure 4: p(f) vs. maximum number of probes.
Finally, we ran a simulation experiment to test how critical is the assumption of an exponential distri
bution for the production times. We ran a simulation in which the time required to produce an object is
uniformly distributed in [50, 100]. We plot the analytical and simulation waiting times in Figure 6. The an
alytical model produces qualitatively accurate results, but is somewhat pessimistic. The simulation waiting
times are lower because the variance in the production time are significantly lower in the simulation.
4.2 Nonhomogenous Producers
The homogenous producers model reveals much about the performance of the stochastic distributed queue
algorithms. However, the assumption that Ai = A for every producer i is often not realistic. For example, the
underlying computers might be heterogeneous. In this section, we will allow the producers to have different
production rates, but we will still require that paccessi = paccessj for every consumer i and j.
We assume that the producers are partitioned into K types, and every producer in type k has the same
production rate Ak. In addition, we assume that p_access[k] = p_access[k'] if producers k and k' are of the
same type.
Of the N producers, Nk are of type k. The proportion of the request stream received by all type k
producers is
frack= paccess[i] (10)
i of type k
Producer Blocking vs Producer Buffers
fraction of time blocked
0.6
0.5
0. 5 ..... .......... ................ ................ ...... 50 load
0.4 50% load
a100% load
0 .3 . . .. . . . . .. . . . .. . . . . .. . . . .. . . . .
+150% load
0.2 .*200% load
0.1
0 i
1 2 3 4 5 6 7 8 9 10
number of producer buffers
Figure 5: p(f) vs. the number of buffers at each producer.
The arrival rate of requests to a type k producer is:
k = M frackc,/Nk
where p is calculated using the formula 1. The pertype quantities pk(0) and pmt,k are calculated using
the appropriate analogues of formulae 5 and 6. The probability of blocking pb and the average number of
probes hang depend on whether a request finds a random producer empty. The formulae for pb and hang are
the same as formulae 8 and 7 as long as we calculate the average probability of finding a producer empty,
pnt, which is:
K
Pmnt = frack Pmt,k (11)
k=l
The average time spent blocked at a producer of type k, given that a request blocks at a consumer of
type k, Bc,k is computed using the appropriate analogue of the formula in the previous section. The average
time that a request spends blocked, given that it blocks, is a weighted average over Bc,k:
K
Be = frack Pt,k B, (12)
k=l
By using these modifications, we again have enough equations to solve the system by using iteration.
4.2.1 Performance Comparison
We divided the producers into two groups: fast and slow. In the first set of experiments, the fast producers
constitute 50% of the producers, and have twice the production rate of the slow producers (thus 67% of
Waiting time vs Max Probes
consumer waiting time
140
1 2 0 . . .. . . . .. . . . . .. . . . .. . . . . .. . . . .
100 ..... x.~... . .. ...... : 50% load
+ sim
80 .......................................
 100% load
60 s........................................... im
200% load
40 .................................................. sim
2 0 . .. . . . . . . . . .. . . . .. . . . . .. . . . .
20
0 . . . . . .. .. . . .. . . . .. . .
1 2 3 4 5 6 7 8 9 10
maximum number of probes
Figure 6: Waiting time vs. hma. Object production times are uniformly randomly generated in the
simulator.
the objects are produced by 50% of the producers). In the second set of experiments, the fast producers
constitute 10% of the producers, and are 9 times as fast as the slow producers (thus 50% of the objects are
produced by 10% of the producers).
In Figure 7, we plot the waiting time W against hmax for both sets of experiments. The points on
the chart are simulation results, so this plot also serves to validate the analytical model. In this chart,
p_access[i] = p_access[j] for every pair of producers i and j (i.e., every producer receives the same request
rate). In the moderately unbalanced (67/50) experiment, most of the performance benefit of increasing the
maximum number of probes is achieved with hmax = 3. For the highly unbalanced (50/10) experiment,
even a maximum of 10 probes does not achieve good performance. Since the fast producers make up only
10% of the producer population, there is a 35% chance that a consumer will not find a fast producer after
10 probes. In both cases, the simulation results are close to the analytical results, although the analytical
model is somewhat pessimistic in the highly unbalanced case. The analytical model did not converge for
hmax = 1, so we present only simulation results for that case.
In Figure 8, we plot the waiting time of a consumer against the fraction of requests directed to the fast
consumers. In this chart, hmax = 3. As is expected, the waiting times are lowest when the fraction of
consumer requests that are directed to the fast producers is proportional to the fraction of objects they
produce.
Waiting time vs. maximum probes
consumer waiting time
160
1 4 0 . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . .
1 2 0 . .. . . . .. . . . . .. . . . .. . . . . .. . . . .. . . .
1 0 0 ........................................................  6 7 /5 0
+ sim
8 0 . . . . . . . . . . . . . . . . . . . . . . . . . . .
50/10
6 0 ........... .. ....................................... 13 s im
0  
1 2 3 4 5 6 7 8 9 10
maximum number of probes
Figure 7: Waiting time vs. hma. Fast and slow producers are equally likely to be probed.
4.3 NonHomogenous Requests
In this section, we remove the restriction that p_access is uniform among all consumers. Unfortunately, this
removes the simplifying assumption that all consumers are identical, and can be treated as a group. So, we
will need to define many of the variables on a perconsumer basis.
We define Bi to be the average time that consumer i's request spends blocked, and havg,i to be the
average number of hops made by consumer i. Then the effective request rate issued by consumer i is
havg,i (
Peff'i / (13)
ei = 1/pi + (havg.i + 1)r + Bi
The arrival rate of requests at producer i is
M
Pcj = p_accessi(j)peff,i (14)
i=1
Given the production rate at producer j, AN, the arrival rate of requests p.,j and the yet to be calculated
probability of blocking I i, we can calculate the state occupancy probabilities pj (s) and the probability that
the producer is empty pmt,j. Let Mj be the number of consumers that can send a request to producer j.
That is, Mj = {ilp_accessi[j] : 0}1. If Mj is small, then blocking a consumer has a large effect on the
request arrival rate, so pj(s) should be calculated based on equation 3 instead of equation 4.
Given pnt,j we can compute the probability that consumer i's probe finds a producer empty:
N
Pmt(i) = Y paccessi(j) Pmt,j (15)
j=1
Waiting time vs. request distribution
waiting time
70
6 0 1 . . . . . . . . . .. . . . .. . . . . .. . . . .. . . .
40 i
4 0 . . . . . . . . . . . . . . . . . . . . . . . .. .
\
3 0 .. . . . . . . . .. . . . .. . . . . .. . . . .. .
0
20 
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
fraction of requests sent to fast producers
Figure 8: Waiting time vs. frachi. hmax
+ 67/50
50/10
3.
Given pt(i), we can calculate the average number of hops made by consumer i's request, ha g,i and the
probability that a request from consumer i will block, pb(i), by substituting pmt(i) for pmt in formulae 7
and 8. Given pb(i), we can calculate the probability that a request blocks at a producer, i by
E 1i paccessi(j)ff ,(i) (16)
i=lP access(j)peff,i
Given the state occupancy probabilities of producer j, we can calculate the length of time that a request
blocks, given that it blocks, by using the formula from the previous section. We can calculate the average
time that a request from consumer i spends blocked by :
N
B6,i = p_accessi(j)* Pmt,j Bj
j=1
Finally we can calculate Bi by
Bi = pt (i)hmac1 1 ,
4.3.1 Performance Comparison
We executed an experiment to test the effect of limiting the number of producers a consumer can probe.
We used 100 producers and consumers with identical production and consumption rates of p = A = .01,
and varied both the maximum number of probes h,,m and the number of producers that each consumer is
allowed to probe Mj. When consumer j chooses a producer to probe, it selects one of the Mj producers,
each with equal probability. In Figure 9, we plot waiting time of a consumer against Mj for different settings
of hma,. We find that restricting Mj to any value larger than hma has little effect on performance.
Waiting time vs consumer fanout
consumer waiting time
10

6 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
hmax=3
hmax=5
2 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 2 3 4 5 6 7 8 9 10
consumer fanout
Figure 9: Waiting time vs. Mj.
5 Conclusions
In this paper, we present a simple stochastic algorithm to implement a distributed queue. The algorithm has
several parameters, including the maximum number of probes a consumer makes before blocking, hmax, the
producers that consumer j will probe, p_accessj, and the number of buffers at each consumer f. We develop
a validated performance model of the stochastic distributed queue, and execute a performance study. We
find that:
Setting hmax to 3 works well for most loads and for moderately unbalanced producers.
Setting hmax to 5 can increase performance by about 3% over setting hmax to 3, and still require fewer
messages than the centralized manager algorithm.
Using 5 buffers at each producer will give good performance. Additional buffers can give a small
additional throughput improvement.
The setting of p_access should be matched to the producer rates. A system with very unbalanced
producers will not give low waiting times even with a large value of hmax unless p_access and the
production rates are mathed.
The number of connections between producers and consumers can be limited to Mj > hmax with only
a small degradation in performance.
References
[1] G.R. Andrews. Concurrent Programming Principles and Practice. Benjamin/Cummings, 1991.
[2] G.R. Andrews. Paradigms for process interaction in distributed programs. AC if Computing Surveys,
23(1):4990, 1991.
[3] P. Avery, C. Chegireddy, J. Brothers, T. Johnson, J. Kasaraneni, and K. Harathi. The ufmulti project.
In I,, I Conf. on Computing in High Energy Physics, pages 156164, 1994.
[4] N. Carriero, D. Gelernter, and J. Leichter. Distributed data structures in linda. In Proc. AC if Symp.
on Principles and Practice of Programming Languages, pages 236242, 1986.
[5] R. Finkel and U. Manber. DIB a distributed implementation of backtracking. AC if Transactions on
Programming Languages and Systems, 9(2):235'"., 1 i.
[6] L. George. A scheduling strategy for shared memory multiprocessors. In Ii I Conf. on Parallel Pro
gramming, pages 1:6771, 1990.
[7] S.F. Hummel and E. Schonberg. Lowoverhead scheduling of nested parallelism. IBM Journal of
Research and Development, 35(5):743765, 1991.
[8] A.R. Karlin and E. Upfal. Parallel hashing: An efficient implementation of shared memory. J. AC i(
35(4):876892, 1988.
[9] D. Kotz and C.S. Ellis. Evaluation of concurrent pools. In Proc. I,, I. Conf. on Distributed Computing
Systems, pages 378385, 1989.
[10] P.N. Lee, Y. Chen, and J.M. Holdman. DRISP: A versitile scheme for distributed faulttolerant queues.
In Proc. I,, I Conf. on Distributed Computing Systems, pages .1111 '.';, 1991.
[11] U. Manber. On maintaining dynamic information in a concurrent environment. SIAM Journal on
Computing, 15(4):11301142, 1986.
[12] F.J. Rinaldo and M.R. Fausey. Event reconstruction in high energy physics. Computer, pages 6887,
1993.
[13] M. Sullivan and D. Anderson. Marionette: A system for parallel distributed programming using a
master/slave model. In Proc. 9th 1,, I Conf. on Distributed Computing Systems, pages 181187, 1989.
[14] UFMulti Distributed Toolkit. http://www.phys.ufl.edu/~ufm.
[15] E. Upfal and A. Wigderson. How to share memory in a distributed system. J. AC il, 34(1):116127,
1 .
[16] M.T. Vandevoorde and E.S. Roberts. Workcrews: An abstraction for controlling parallelism. Interna
tional Journal of Parallel Programming, 17(4):347366, 1988.
