An efficient dynamic load balancing algorithm for
adaptive mesh refinement
Ravi Varadarajan
Dept. of Mathematics,
University of Puerto Rico
Rio Piedras, PR 00931
Email: r_varadarajai fuprl.upr.clu.edu
Fax : (809) 7540757
Injae Hwang
Comp. and Info. Sciences Department,
University of Florida
Gainesville, FL 32611
Email: ih cis.ufl.edu
May 26, 1994
Abstract
In numerical algorithms based on adaptive mesh refinement, the computational workload
changes during the execution of the algorithms. In mapping such algorithms on to distributed
memory architectures, dynamic load balancing is necessary to balance the workload among
the processors in order to obtain high performance. In this paper, we propose a dynamic
processor allocation algorithm for a mesh architecture that reassigns the workload in an
attempt to minimize both the computational and communication costs. Our algorithm is
based on a heuristic for a 2D packing problem that gives provably close to optimal solutions
for special cases of the problem. We also demonstrate through experiments how our algorithm
provides good quality solutions in general.
1 Introduction
Parallel computers with thousands of processors are becoming a commercial reality due to
advanced VLSI technologies. These machines can provide tremendous computational power
needed to solve many computationally intensive problems that arise in different fields of
science and engineering such as aerodynamics, nuclear physics and weather forecasting. In
solving such problems, it is necessary to distribute the computational load among the pro
cessors in order to achieve good performance as measured by speedup, processor utilization
etc.. The term "load balancing" refers to this activity of distributing or redistributing the
computational load in order to achieve high performance.
In multiprocessor computer systems with distributed memory, load balancing should also
take into consideration the interprocessor communication cost that arises when for example,
a computation in a processor needs data located in another processor's memory. For simple
applications, it is sufficient for the programmer to perform static load balancing by properly
distributing the data before the parallel program begins its computation. But there are many
engineering applications wherein simple static load balancing techniques are not sufficient
to ensure high performance. A case in point is the solution of partial differential equations
using adaptive meshes.
In numerical methods that employ finite differences to solve such problems, a rectangular
mesh is imposed on the computational domain and the solution is evaluated at the mesh
points; the mesh size (or spacing between mesh points) is related to the required error
tolerance in the solution. It is computationally advantageous to have mesh points spaced
more closely in regions with steep gradients and have coarser mesh points in the regions where
the solution is smooth. In many problems (for example, flame propagation in combustion),
the regions of steep gradient are difficult to predict before the computation begins. For this
reason, we start with a uniform mesh of coarse size and finer and finer meshes are dynamically
imposed on regions with large estimated errors as the solution proceeds. Parallelization of
adaptive mesh algorithms thus introduces frequent changes in workload distribution among
the processors and hence requires d.w, ,ii;'. load balancing to achieve high performance. In
this paper, we address the issue of dynamic load balancing that arises particularly in mapping
adaptive mesh refinement algorithms onto parallel architectures.
There are many load balancing algorithms (such as binary decomposition [7], scatter de
composition rs]) proposed in the literature for numerical algorithms based on finite differ
ence or finite element methods. These algorithms are more suitable for static load balancing
as they ignore the interprocessor communication cost that usually results from relocating
the intermediate solution values among the processors during rebalancing. There are other
general dynamic load balancing algorithms (such as diffusion scheme [20]) suitable for slowly
varying workload distributions, as the time to rebalance the workload is quite high in these
methods.
In this paper, we propose dynamic load balancing algorithms especially for adaptive mesh
refinement, that ,i,. .11'll consider the computational and communication costs in rebalanc
ing the workload. The grids generated dynamically during the computation are assumed to
be rectangular. We use twodimensional packing problem to allocate a set of processors for
each grid in mesh multiprocessor system. The set of processors is chosen in such a way that
computational workload among the processors is balanced and intragrid communication
cost is minimized at the same time. We propose an efficient heuristic packing algorithm and
show how it can be used for processor allocation.
A word about the organization. In the next section, we briefly discuss some of the
previously proposed load balancing algorithms. In Section 3, we formulate the dynamic
load balancing problem that arises in mapping adaptive mesh computations onto parallel
computers. In Section 4, we propose a simple heuristic and two approaches for implementing
it on the parallel machines. In Sections 5 and 6, we describe the two approaches and analyze
their complexities for hypercube architectures. In Section 7, we give the results of our
experiments performed to evaluate our heuristic methods with respect to some wellknown
scheduling heuristics. Finally, we summarize our paper with a brief discussion of scope of
future work.
2 Previous work
In this section, we discuss some of the load balancing techniques proposed in the literature for
mapping numerical computations onto parallel architectures. These techniques partition the
computational domain into pieces (subdomains) which are then allocated to the processors.
For static load balancing, communication between two processors is usually assumed to be
proportional to the number of points on the boundaries between the subdomains allocated
to the processors. In these techniques, balancing the load among the processors is given
more importance than minimizing the interprocessor communication cost.
In [7], Berger and Bokhari propose binary decomposition for distributing the workload
associated with nonuniform mesh computations in a multiprocessor system with a mesh
topology. It is a recursive partitioning of the domain into vertical and horizontal strips
alternatively so that the left and right or top and bottom segments that result from the
cut have approximately equal load; load here is measured by the number of mesh points.
Number of processors is assumed to be a power of two in this method. This technique,
while useful for static load balancing, has limited applicability as a dynamic load balancing
method for adaptive meshes since it ignores the cost of migrating intermediate solution values
that occurs during reallocation of mesh points to the processors. For minor imbalances in
workload, this technique can perform efficient rebalancing by adjusting the cut lines starting
with the smallestsized strips. It is not specified how binary decomposition can be efficiently
implemented in parallel on a multiprocessor system.
Scatter decomposition is another static load balancing technique which accounts for minor
fluctuations in workload distribution that tend to concentrate in a few contiguous regions of
the computational domain. In this case, the rectangular domain is divided into (say) m rows
and n columns of equal width. Let rij be the region in the ith row and jth column, where
0 < i < m 1 and 0 < j < n 1. Then for a multiprocessor system with p x q mesh (assuming
p divides m and q divides n), we allocate the computations of rij to processor indexed (s, t)
where s = i mod m and t jmod n, where 0 < s < p1 and 0 < t < q1. Larger the ratio
of m, better balanced the workload will be but higher the communication cost among the
processors; this tradeoff very much depends on the application. Determining this tradeoff
is a major difficulty with this method. This technique was used as a load balancing method
by Williams [32] and was analyzed for a probabilistic workload by Nicol and Saltz '"].
In [29], Simon proposes the application of graphtheoretic techniques for mapping un
structured grid calculations that arise in computational fluid dynamics onto multiprocessor
architectures. These techniques recursively apply a procedure that partitions a planar graph
into two halves with roughly equal number of nodes such that the number of edges in the
cutset (measure of communication cost) is attempted to be minimized. In one technique
called recursivee graph bisection", nodes are partitioned based on their distances from an
extremal vertex which is the root of a breadthfirst spanning tree with maximum height. In
another technique called recursivee spectral bisection", partitioning is done along the eigen
vector associated with the second largest eigen value of the Laplacian matrix A D, where
A is the adjacency matrix and D is the diagonal matrix of vertex degrees. Recursive spectral
bisection is seen to provide a desirable graph partition.
There are many distributed dynamic load balancing strategies of which "diffusion method"
is one example. It is a local balancing operation, wherein each processor independently moves
(receives) workload to (from) its neighbors based on only the information about its workload
and that of its neighbors. This method was used for molecular dynamics simulation applica
tions [11]. A major disadvantage of this method is that it takes a long time to achieve global
balancing but this delay is acceptable when the workload distribution changes slowly. Cy
benko [20] showed the conditions for convergence of this method and the rate of convergence
for arbitrary processor network topologies.
Hanxledon and Scott [25] propose a decentralized dynamic strategy in which each proces
sor does local balancing based on the total workload information that it receives from every
other processor. This strategy was tested on Intel iPSC/2 for WaTor, a particle migration
application and was found to give good results.
The technique we propose in this paper is somewhat similar to binary decomposition but
it ,11/.', .'llI takes into account interprocessor communication cost that occurs in rebalancing
the workload. We also propose efficient parallel implementations of this technique. It is a
technique tailormade for adaptive mesh refinement applications.
3 Proposed load balancing approaches
In a typical adpative mesh refinement algorithm, the given partial differential equation is
discretized by imposing initially a uniform mesh on the computational domain; we call this
mesh a coarse grid at level 0. Letting i to be equal to 0 initially, the following steps are then
repeatedly applied.
1. Obtain solution values at the mesh points of the grids at level i, using finite difference
schemes that involve 4 or 8 mesh neighbors.
2. If i 0, project solution values of level i grids onto grids at level i 1, for updating
solution values at level i mesh points.
3. Estimate errors at mesh points of level i grids and if the estimated errors at all the
mesh points of level i are within the acceptable limit, terminate the algorithm.
4. Otherwise, in regions of high error, define grids at level i + 1 with finer mesh spacing
(assuming a constant ratio between level i and i + 1 mesh spacings). One common
approach is to flag higherror points, cluster them into groups using a suitable clustering
algorithm and define a finer grid for each group.
5. Define boundary values for grids at level i + 1 by interpolating intermediate solution
values at level i.
6. Let i = i + 1 and go to step 1.
In the formulation of our load balancing problem, we use the following assumptions:
1. A grid at level i (i > 0) is completely contained inside a grid at level i 1. This is
a most commonly used assumption in adaptive mesh refinement for simplification of
numerical algorithms. In this case, we call the two grids "child" and "p.!. ,l" grids
respectively.
2. Grid computations are synchronized such that grid computations at level i start at the
same time. Dynamic load balancing is performed whenever finer grids are created and
before their computations are started.
3. Computational cost for each grid depends only on the number of mesh points it con
tains. This cost includes time for steps 1,3, 4 and 5 above. For simplicity, we will
assume that the cost is a linear function of the number of points, though our formula
tion can be used for any known function of number of points.
4. There are two types of interprocessor communication called "intragrid communica
tion" and "intergrid communication" respectively. Intragrid communication is needed
whenever computation of a value at a mesh point (during step 1 above) in a processor
needs a value at a neighboring mesh point located in another processor; both the mesh
points belong to the same grid at some level. Intergrid communication occurs during
steps 2 and 5 whenever a child grid is located in a different processor than that of
its parent. Communication cost is directly proportional to the distance between the
processors in the network and we ignore traffic on the communication links.
We are currently investigating the following three approaches for mapping the adaptive mesh
computations. Each approach has its own merits and demerits.
In the first approach we call the "grid indivisibility" approach, we allocate all the mesh
points of a grid to the same processor; thus a grid is an indivisible computational entity
that can be treated as a process. Creation of a finer grid is analogous to the creation of a
child process which can either be allocated to the same processor as the parent process or
migrated to a different processor by the load balancing process. A major advantage of this
approach is that in addition to being simple to implement, intragrid communication (which
occurs more frequently than intergrid communication) is avoided. But the number of grids
at a level must be at least the number of processors to keep all the processors busy and this
will not happen at least for the first few levels.
In the second approach we call the "domain decomposition" approach, computational
domain is divided into regions (possibly using scatter decomposition) and the mesh compu
tations of the grids at all levels associated with a region are assigned to the same processor.
Whenever finer grids are created, we perform new decomposition for balancing the workload
associated with finer grids. This new decomposition may require migration of coarse (previ
ous level) grid mesh points to different processors. But this approach eliminates intergrid
communication at the expense of this migration cost. It introduces intragrid communication
however since the mesh points of a grid may be allocated to more than one processor. The
load balancing problem here is similar to the problem considered in [31] for load balancing
with resource migration in distributed systems if the coarse grids are treated as resources.
The third approach we call the "grid partitioning" approach is a more general approach
than the previous two methods but the load balancing problem formulation is more compli
cated and its solution difficult to implement. In this approach, we allocate a set of processors
for each grid and within the set of processors, the grid is uniformly partitioned into pieces
and each piece is assign to a processor.
In this paper, we propose a solution to the load balancing problem formulated using the
third approach. This approach requires that the number of processor be larger than the
number of grid at a level of refinement.
3.1 Problem Formulation
In this section, we consider the grid assignment for a twodimensional adaptive mesh algo
rithm. When we formulate our load balancing problem, we assume that all twodimensional
grids are rectangular. We also assume that target machines are twodimensional mesh mul
tiprocessors. Later in this chapter, we will consider hypercube multiprocessors.
Let m be the number of grids at level i. If we allocate Xj Yj submesh for each grid
j close to the processors where grid j was generated, the submesh of one grid can overlap
with submeshes of other grids especially when grids are closely located to each other. Then,
the amount of workload for the overlapping processors will be twice as much as those of
other processors. To avoid this workload imbalance, some of the grids may have to be
assigned to processors which are distant from the processors where they were generated.
These assignments will incur intergrid communication at some distance which can not be
expressed in terms of Xj and Y. To define the problem more formally, we introduce the
following notations :
G = (V, E) processor graph where V is the set of processor nodes and
E the set of communication links
S set of fine grids generated at level i
f : S 2 f(a) is the neighborhood of processors where grid a was generated.
f'(a) processors (subset of f(a)) having boundary values for grid a
before load balancing
Ucomp computational cost per mesh point; depends on the finite difference
method used
Ucomm cost of transmitting a mesh point value between neighboring
processors
d(pi,p2) communication distance between processors pi and P2 in G
d'(VI, V2) maXp E1V,p2EV2 d(p, p2)
F frequency of intragrid communication during level i computation
g : S  2 grid assignment function to be determined by load balancing
g'(a) processors (subset of g(a)) having boundary values for grid a
after load balancing
S set of subgrids created by assignment g
3. number of mesh points in the subgrid b
Bb number of mesh points on the boundary of subgrid b
h : V  S h(p) is the subgrid assigned to processor p
After the partitioning and assignment of grids at level i, the computational workload Ep for
processor p is given by UcompMh(p). Note that we require that no more than one subgrid be
assigned to a processor. For a fine grid a, intergrid communication cost Ci(a) is estimated
to be Ucomm (d'(f'(a), g'(a))B'(a) + d'(f(a), g(a))M'(a)) where B'(a) = Bi~ and
M'(a) = mln h,(a) the first term is the cost of migrating the boundary values for fine
min(m(f(a),(a))'6 6
grid a that occurs before the computation of mesh point values in a and the second term is
the cost of sending the fine grid values back to the parent grid for updating. Whenever a grid
a is allocated to more than one processor, for each of its subgrids b, intragrid communication
is needed whose cost C2(b) is estimated to be UcommFBb where Bb is the number of mesh
points (interior to grid a) on the boundary of b. Our problem is then formally stated as
follows :
Find g, S, and hence h so as to minimize maxpEv (Ep + C2(h(p))) + maxEs Ci(a)
with the restriction that the grids are assigned to disjoint sets of processors, i.e.
for any two grids a, and a2, g(ai) g(a2) = 0.
As was stated before, we restrict our attention to the case where each fine grid aj (1 <
j < m) is rectangular in shape with dimensions wj x hj and the processor graph G is a mesh
network with P rows and Q columns (assuming P > Q without loss of generality). Since
there is at most one subgrid assigned to a processor, m < PQ. The problem is then posed
as follows :
Determine for each grid aj, dimensions Xj x Y of the processor submesh to
be allocated to the grid and lj, the index of the processor in the southwest
corner of the submesh so as to minimize maxi
maxnas Ci(aj). In the rest of this chapter, denotes ] when we discuss the formulation
of the problem.
For the simplicity of the problem, we ignore intergrid communication cost and seek
solutions which minimize computation cost and intragrid communication cost. We will
discuss the extension of the approach to include intergrid communication cost at the end
of this chapter. If we remove the term representing intergrid communication cost from the
problem formulation, our objective function decreases as Xj and Y increase. Consequently,
it is advantageous to assign a submesh which is as large as possible for each grid aj. It is also
necessary that h be equal to wk for all pairs of j and k to balance the workload among
the processors. From the above discussion, we have the following two desirable properties of
processor allocation.
1. >m 1 Xjj= N,
2. Vij, k 3h Wkhk
S X, Y, XkY
When the total number of grid points at level i is larger than the number of processors,
the number of processors XjYj for each grid j is limited by the above two properties. Let
Nj = XjY. Then the intragrid communication cost for grid aj is as follows.
wj Xj hj
C2(aj) 2UcommF + + )
Then,
dC2(aj) (wj h2
S2 Ucmm 2 +
dX Since Then it follows that
3 becomes 0 when Xj = 3w. Since Y, = Then it follows that
dX3 V X3 3 V w
w,_= T. To minimize the cost of intragrid communication separately for each grid (for
a fixed number of processors assigned to the grid), we add the third property of processor
allocation as follows.
3. Vj w h3
Now, our problem is approximated to finding m submeshes and their locations, one for each
grid, which satisfy the above three properties. This problem is formally stated as follows
For each grid aj of dimension wj x hj, find X x Yj processor submesh to
be allocated to the grid and 1j, the index of the processor in the southwest
corner of the submesh so as to maximize Zm 1 XjY with the restriction that
Vj, k W3h Wkhk and Vj w_ = h. We will call this problem submesh allocation problem.
X, Y, XkYk
4 Processor Allocation Using TwoDimensional Pack
ing
In this section, we propose an efficient approach to find an approximate solution to the
problem formulated in the previous section. This approach views the problem as a two
dimensional packing problem wherein rectangles need to be packed in a bin of infinite width
and height. We present an approximate heuristic for this packing problem and show how the
approximate solution can be used to allocate processor submeshes to the grids. Experimental
results on the performance of the packing algorithm are included at the end of this section.
4.1 A Packing Algorithm
Two dimensional packing problem has been studied by many researchers [5, 4, 6, 18]. It arises
in a variety of situations such as scheduling of tasks and cuttingstock problems. Cutting
stock problems may involve cutting objects out of a sheet or roll of material so as to minimize
waste. The scheduling of tasks with a shared resource involves two dimensions, the resource
and time, and the problem is to schedule the tasks so as to minimize the total amount of
time used. Memory is a typical example of shared resources. In general, the problem is
stated as follows :
Given a rectangular bin with fixed width and infinite height, pack a finite set
of rectangles of specified dimensions into the bin in such a way that the rectangles
do not overlap and the total bin height used in the packing is minimized. The
rectangles should be packed with their sides parallel to the sides of the bin. In one version
of the problem (see [5]) the rectangles should be packed in a fixed orientation.
In our processor allocation problem, grids correspond to rectangles to be packed. But,
the problem is slightly different from the above twodimensional bin packing problem. First,
the width as well as the height of the bin is unlimited. Assume that we are given a x/N /N
mesh multiprocessors for m grids. The goal of our packing is minimizing the maximum of
width and height of bin used for packing m grids starting from the leftbottom corner of the
bin. Second, the orientation of grids is not important, so grids can be rotated when '1 ,c are
packed. Figure 1 shows an example of such packing.
After packing m grids as in Figure 1, we use the two ratios,  and h ,9h to allocate a
submesh for each grid. The detailed allocation method will be described later in this section.
First, we will show that decision version of our 2D packing problem is NPcomplete in the
following theorem.
Theorem 1 : The following problem is NPcomplete:
Given m rectangles of dimension ;, x hi (;i > hi), is there a packing so that the maxi
I
height
I<
width
Figure 1: An Example of Packing
mum of height and width of packing is smaller than or equal to L?
Proof: We will reduce the bin packing problem which is known to be NPcomplete [24] to
our 2D packing problem. The bin packing problem is stated as follows: Given k bins of
capacity B, and a finite set U = {u1,u2,..., u,} of items with size of each item
s(ui) e Z+ (s(ui) < B), is there a partition of U into disjoint sets U1,U2,... ,Uk
such that the sum of sizes of the items in each Ui is no more than B? We
can construct an instance of the 2D packing problem for an instance of the bin packing
problem in the following way : Create m + 1 rectangles with the following dimensions;
wo = k(B + 1), ho = k(B + 1) B, wl = B + 1, hi = s(ui), w2 = B + 1, h2 = s(u2), ,
;I = B + 1, h, = s(u,) and L = k(B + 1). Then any feasible packing of these rectangles
must be in the form as in Figure 2. The largest rectangle in the figure is the first rectangle
in the list. Since one dimension of the rectangle is already k(B + 1), other rectangles cannot
be packed above it. The larger of the two dimensions is B + 1 for all but the first rectangle,
hence they can only be packed so that their longer sides are parallel to the Yaxis. For
this instance of the 2D packing problem to have a solution, rectangles should be packed so
that the width of the packing does not exceed k(B + 1). Note that each level of packing of
k(B+1)
cd e
B+l a b
'  
k(B+1)B B
Figure 2: Feasible Packing of m+1 Rectangles
rectangles indexed from 1 to n (for example, rectangles "a", "b" form one level of packing
in Figure 2) corresponds to a disjoint set Ui in the bin packing problem. Now, it is obvious
that there is a solution for an instance of the bin packing problem if and only if there is a
solution for the corresponding instance of the 2D packing problem. m
Since the decision version of the problem is NPcomplete, the original optimization prob
lem is NPhard. For this reason, we seek an approximation algorithm to solve the problem.
Since our packing problem is somewhat different from the commonly known 2D bin
packing problem, the heuristic algorithms developed in [5, 4, 6, 18] are not readily useful
for our purpose. One approach to solving our problem is to try different bin widths and
for each width, use one of the above packing heuristics to see if the height of the packing
does not exceed the desired height for that width. We can find the minimum width that
allows the desired packing, by using some form of interpolation search on the interval between
minimum and maximum widths possible. A better approach, that avoids repacking, is to use
a packing heuristic that attempts to satisfy two goals, namely, packing as tightly as possible
and making the width/height ratio for the packing closely match the ratio of the processor
mesh. To this end, we propose a heuristic which we will call "tightp... 1: (TP, for short)
heuristic.
The basic idea of TPheuristic is as follows. First the grids are sorted in some selected
order. Then we start packing grids one by one at the southwest corner of the bin. Let's
consider the space of the bin as the first quadrant of XY plane. Then the southwest corner
of the bin becomes the origin of the coordinate system, that is (0,0). Each packed grid has
4 corners NW,NE,SE and SW with respect to its orientation in the packing. A NW or SE
corner of a packed grid is called a free corner (FC) if no other item occupies that corner
such that the item is above and to the right of that corner. In our algorithm, only free
corners are considered for packing the new grid and when a new grid is placed in a free
corner, it is placed so that it is above and to the right of the corner. After packing the first
grid at the origin, the next grid is packed at one of the two corners created by packing the
first grid. Figure 3 (a) shows the two corners, "a" and "b". In general, after packing ;i x hi
/C
a
/ /
b d
(a) (b)
Figure 3: The Corners for Packing the Next Grid
grid at (xj, yj), we add two more free corners located at (xj, yj + hi) and (xj + yj) to the
list of free corners. We also keep the maximum size of the grid which can be packed in that
free corner along with its location. We call it the , ." of free corner. Figure 3 (b) shows
two new corners, "c" and "d", after packing the second grid at corner "b". After the second
grid is packed, the width of the grid which can be packed at corner "a" is limited to the
width of the first grid. Hence, we need to update that information whenever we pack a grid.
Let Wi and Hi be the width and height of the space used in packing after packing the first
i grids. We choose the corner for the i + 1st grid, so that the maximum of Wi+ and Hi+1
is minimized. If we are given a processor mesh P X Q with P > Q and P = R, we choose
the corner for the i + 1st grid, so that the maximum of Wi+1 and RHi+i is minimized. The
detailed description of our packing algorithm is given bellow. Assume that we are given the
mesh ratio R and m grids, wl x hi, w2 x h2,... x hm. After the following algorithm
terminates, global variables, XLoci and YLoci, contain the location of the corner where grid
i was packed. Orienti is set to 1 if grid i was rotated and it is set to 0 otherwise.
Algorithm Packing;
1. width = 0
2. height = 0
3. Initialize the three lists CornerList, XScanList and YScanList to be empty.
4. Insert [(0, 0), (oo, oo)] into CornerList.
// [(x, y), (p, q)] represents the corner located at (x, y) and the maximum grid
size that can be packed is p x q. //
5. for each h, x hi grid do
begin
6. FindBestCorner((, hi), k)
// In procedure FindBestCorner, each corner in CornerList is
examined and a corner [(zk, ), )(/' qk)] is chosen which minimizes the
maximum of width and height R after h, x hi grid is packed //
7. UpdateCornerList((; hi), k)
// In procedure FindBestCorner, the size of each corner in
CornerList is updated after h, x hi grid is packed at
corner [(Xk, / ), (/i' )] //
8. FindNewCornerSize((,I hi))
// After h, x hi grid is packed at corner [(zk, ,; ), (i' qk), two new
corners, [(zk, ;i + hi), (/' qk')] and [(zk + ", i, ), (/i ', qk,)] are created.
In procedure FindNewCornerSize, the size of each corner is
determined. //
end
end Packing;
Procedure FindBestCorner((; hi), k);
1. mindim = oo
2. for each element [(xj, yj), (pj, qj)] in CornerList do
begin
3. mxl = oo, myl = oo, mx2 = oo, my2 = oo
// Try to pack a grid (;i x hi) at a free corner (pj,qj) in both
orientations. //
4. if (pj )> 0 and (q hi) > 0 then
5. mxl = max(width, xj + ;i ), myl = max(height, yj + hi)
6. else if (pj hi) > 0 and (qj ;, ) > 0 then
7. mx2 = max(width, xj + hi), my2 = max(height, yj + ;i )
8. else goto the next iteration
9. ml = max(mxl, R myl)
10. m2 = max(mx2, R my2)
11. m = min(ml,m2)
12. if (m < mindim) then
begin
13. mindim = m, k =j, XLoci = xj, YLoci = yj
14. if (ml < m2) then
15. width = mxl, height = myl, Orient = 0
else
16. width = mx2, height = my2, Orienti =
end
end
end FindBestCorner;
Procedure UpdateCornerList((;, hi),k);
1. Remove [(xk, ;i ), (i, qk)] from CornerList
2. if Orienti = 1 then
3. wi= h, h =
else
4. ,' = h
// If the grid (;i x hi) packed at kth free corner overlaps with the packing
area of a free corner [(xj, yj), (pj, qj)], update the size of that free corner. //
5. for each element [(xj, yj), (pj, qj)] in CornerList do
begin
6. if (;/ < yj < ; + h') and (xj < k < j + pj) then
7. pj = Xk j
8. if (xk
9. qj = ;/ yj
10. if pj = 0 or qj = 0 then
11. remove [(xj, yj), (pj, qj)] from CornerList
end
end UpdateCornerList;
Procedure FindNewCornerSize((;, h));
// Find the sizes of two new free corners created after packing a grid (; ,hi) //
1. p = oo, q = oo
2. if Orienti = 1 then
3. .' = h, h' =
else
4. wI = h = hi
// Find the closest grid to the new corner packed on line y = YLoci. //
5. for each element [(x,, yj), (x, y2)] in XScanList do
// Each element [(x, yf), (x, y2)] in XScanList contains the locations
of the SW, NW corners of a packed item. (Note that x1 = x2) //
begin
6. if (y) < YLoci < y)) and (XLoci + < < x1) then
7. if (x1 < p) then
8. p = xj
end
// Find the closest grid to the new corner packed on line x = XLoci + .'. //
9. for each element [(x, y'), (x, y)] in YScanList do
// Each element [(x), y), (x, y2)] in YScanList contains the locations
of the SW, SE corners of a packed item. (Note that y = y2) //
begin
10. if (xl < XLoci + < < x2) and (YLoci < yf) then
11. if (y) < q) then
12. q = y1
end
13. p = p (XLoci + ,,')
14. q = q YLoci
15. Insert [(XLoci + w', YLoci), (p, q)] into CornerList
16. p = oo, q = oo
// Find the closest grid to the new corner packed on line y = YLoci + h'. //
17. for each element [(x,, yj), (x, y)] in XScanList do
begin
18. if (y' < YLoci + h <_ y)2 and (XLoc, <_ x) then
19. if (xj < p) then
20. p = x1
end
// Find the closest grid to the new corner packed on line x = XLoci. //
21. for each element [(x, y ), (x2, y)] in YScanList do
begin
22. if (x _< XLoci < x2) and (YLoci + h' < yj) then
23. if (y' < q) then
24. q = y1
end
25. p = p XLoci
26. q = q (YLoc, + h')
27. Insert [(XLoci, YLoci + h ), (p, q)] into CornerList
28. Insert [(XLoci, YLoci), (XLoci, YLoci + h')] into XScanList
29. Insert [(XLoci, YLoci), (XLoci + ',YLoci)] into YScanList
end FindNewCornerSize;
Every time a grid is packed, a corner is used and two new corners are introduced. Since
there is only one corner initially, the number of elements in CornerList is at most m + 1.
An element is added to XScanList as well as to YScanList, each time a grid is packed.
Hence, the number of elements in XScanList and YScanList can be at most m. As a result,
Procedures FindBestCorner, UpdateCornerList and FindNewCornerSize take O(m)
time. Since there are m grids to be packed, the total time complexity of our packing algorithm
is O(m2).
Remark : A small improvement can be made if we remove corners which are too small
to be useful from CornerList. Let MINHOLESIZE = min(minl
procedure UpdateCornerList, if either pj < MINHOLESIZE or qj < MINHOLESIZE, then
remove [(xj, yj), (pj, qj)] from CornerList. In procedure FindNewCornerSize, if either p <
MINHOLESIZE or q < MINHOLESIZE, then do not insert [(XLoc, + w', YLoc,), (p, q)] or
[(XLoci, YLoci + h ), (p, q)] into CornerList.
The following theorem shows the correctness of our packing algorithm. In the following
theorem, we assume for convenience that we do not make modifications as in above remark.
Theorem 2 : After ith grid is packed, The following two claims hold.
(1)For any element [(xj, yj), (pj, qj)] in the CornerList, the maximum size of the grid that
can be packed at corner (xj, yj) is pj x qj.
(2)For any point (x,y) in the space of the bin, either it is occupied by a grid or it belongs to
at least one corner in the CornerList.
Proof: To prove the first part of the theorem, we need to show that procedure FindNew
CornerSize correctly computes the sizes of the two new corners. We will use induction on
j, the number of grids already packed. Base case is trivial. When j is 1, both XScanList
and YScanList are empty. Hence, the loops of lines 5 12 and 17 24 will not be executed.
Both p and q remain oo, which is the correct width and height of the maximum grid that
can be packed at the new corners. Now, assume that the sizes of all the corners had been
correctly computed until grid j 1 was packed. We pack grid j at (XLocj, YLocj), and two
new corners, namely, [(XLocj + w', YLocj), (p', q')] and [(XLocj, YLocj + h'), (p",q")] are
created; here w', h' are dimensions of grid j along X and Y directions respectively. Let us
see how p' and q' are determined in procedure FindNewCornerSize. XscanList contains
the coordinates of the SW, NW corners of each grid which was already packed. We scan
each element in XScanList and find [(xI, yI), (X', y/)] such that (y < YLocj < y/) and
(XLocj + w' <_ x1), and x1 is minimum among all such elements; if no such grid exists,
then p' is correctly set to oo. We find [(xIY, y), (x ,,y ,)] in YScanList in the similar way.
Figure 4 (a) shows such grids k, k'. In procedure FindNewCornerSize p' and q' are set to
4x (XLocj + w') and y, YLocj respectively. It is obvious that p' and q' cannot be greater
than the above values from the figure. For both p' and q' to be correct, the shaded area in
Figure 4 (a) should not be occupied by any grids which were already packed. Suppose that
there is a grid 1 packed at (XLoci, YLoci), and YLoci < YLock, and XLoci < XLock,, that
is, grid 1 is in the shaded area. Figure 4 (b) shows such a grid 1. Since every grid including
grid 1 was packed at a valid corner, there must be a grid n either to the left of or below
grid 1. Assume that grid n is located below grid 1. Now, we apply the same argument to
grid n, and so on. Eventually, there must be a grid t packed at (XLoct, YLoct) such that
XLoct < XLock and (YLocI < YLocj < YLoct + h'). Figure 4 (c) shows such a grid t. This
contradicts the fact that 4x is the minimum among those grids. Therefore, there cannot exist
such a grid 1 and procedure FindNewCornerSize correctly determines p' and q'. Similarly,
we can prove that p" and q" are correctly determined.
kk k tk
wn
(a) (b) (c)
Figure 4: Figures for The Proof of Theorem 5.3.2
Proving the second part of the theorem is straightforward. We showed that sizes of cor
ners are correctly computed. Since grids do not overlap in our packing, any point (x, y) is
occupied at most one grid. We can use an induction similar to the one used for the first
part of the theorem to prove that (x, y) belongs to the free space of at least one corner if it
is not occupied. Base case (i.e. that is, when no grids have been packed) is trivial. Let's
assume that the claim is true after grid j 1 had been packed. Now, we pack grid j at cor
ner [(XLocj, YLocj), (p, q)]. This corner is removed from CornerList and two new corners,
[(XLocj + w', YLoc), (p', q')] and [(XLocj, YLocj + h'), (p", q")] are added, where w h'
are dimensions of grid j along X and Y directions respectively. If p' and q' are correctly
computed, p' and q' must be at least p w' and q respectively. From the same reasoning,
p" and q" must be at least p and q h' respectively. Therefore, for any point (x, y) which
belonged to the free space of the corner [(XLocj, YLocj), (p, q)] before, either it is occupied
by grid j or it belongs to one of the two new corners. *
To show a bound for the accuracy of the solutions provided by our packing algorithm,
we impose the following restrictions on packing the grids. When ;i x hi (;, > h) grid is
packed at a corner (x, y), it should be placed so that the side with dimension (long side) is
parallel to the Xaxis if x < Ry and it should be placed with long side parallel to the Yaxis
if x > Ry. Suppose there is a free corner that cannot accommodate the item in the allowed
orientation but can accommodate the item in the other orientation. Then we disregard this
corner though it is possible to get a better packing by placing the item in that corner. If
x = Ry, then both orientations of the grid are allowed for that corner. We will call this
packing algorithm modified TPheuristic. Now we state a result on the solution accuracy
bound when R = 1 (i.e. for square meshes)
Theorem 3 : Consider the twodimensional packing problem with R = 1 and let w* =
max, 1 Let Wopt, Hopt be the width and height of the optimal packing and let WTP, HTP
be the corresponding values for the packing given by the modified TPheuristic when items
are packed in decreasing order of their maximum side lengths. Assume without loss of
generality that Wopt > Hopt and WTp > HTp. Then WTp < v 2Wopt + 3w*.
Proof : Let us denote the regions below and above the line OP (that has unit slope) by
R1 and R2. First we observe that there is always a corner in R1 as well as in R2 that can
accommodate a subsequent item in the allowed orientation (i.e. long side parallel to Yaxis
in R1 and parallel to Xaxis in R2). This follows from the fact that we can always pack a
subsequent item q abutting to Yaxis (or Xaxis). Since the item q' below (or to the left
of) q was packed prior to q, the maximum side of q' is longer than the maximum side of q.
Hence, there is always enough space to pack item q above (or to the right of) item q'.
Now, we show that WTp HTP < w* as follows. Let W}p, HTp be the width and height
of the packing after the ith item is packed. Suppose that WTp HTp > w*. Then there must
be an item i of dimensions (;. hi) such that that W'p1 HIz < w* and Wjp Hp > w*.
It also must be true that the ith item was packed at a corner in R1, hence W91 < Wjp and
H I < Hip. Let (x, y) be the location of a corner in R2 which can accommodate the ith
:~ l n i1ITi
item. Then x < y < H Since x + < H + w* < Wp and y + h; < + w* < W1p,
the maximum of the width and height of the packing would be smaller if the item i were
packed at the corner at (x, y). Hence the item i should not have been packed at a corner in
R1. Since we always pack items so that the maximum of the width and height of the packing
is minimized, we can conclude that WTp HTP < w*.
We only have to prove our result when WTp > 3w* in which case HTp > 2w*. Consider
the two isosceles right triangles A1 and A2 (see Figure 5) in regions R1 and R2 with areas
(WTp2*) (HTp2w*) respectively. Note that all items in the region A1 (A2) have been
packed with their long sides parallel to Y (X) axis. Thus the items are packed in these
regions as in bottomup leftjustified (BL for short) strategy of [5]. We can make the similar
argument on the occupancy of A1 and A2 as in [5]. Note that any vertical (or horizontal)
cut through A1 (or A2) can be partitioned into alternating segments corresponding to cuts
through unoccupied and occupied areas. Using the fact that there are items to the right
of A1 and above A2 and considering the order in which the items are packed, we can show
that the sum of the occupied segments is at least the sum of the unoccupied segments. By
integrating the lines over WTp 2w* (or HTp 2w*), we can verify that A1 (or A2) is at
optW2)2 )2) > (H 2w*)2
least half full. This means that W2 > 1/4 ((HTp 2w*)2 (WT 2w*)2) > (HTP2*
and the result follows from the fact that WTp HTP < w*. m
The bound can be improved when the items to be packed are square shaped.
Theorem 4 : Consider the same 2D packing problem as in Theorem 5.3.3 except that the
items are square shaped. If the items are packed in decreasing order of their sizes in the
modified TPheuristic, then WTp < 2V/Wopt + 2w*.
Proof : The proof of this theorem is similar to the proof of the previous theorem. It can be
easily shown that WTp HTp < w*. Since all the items are square shaped, they are packed
as in BL strategy [5] in the two isosceles right triangles Si and S2 (see Figure 6) with areas
(WP *)2 (HTP w*)2 respectively. Using the fact that there are items to the right of Si and
above S2 and considering the order in which the items are packed, we can show that Si and
S2 are at least halfoccupied. This means that W p, > 1/4 ((HTP w*)2 + (WT w)2)
S(HTpw*)~ and the result follows from the fact that WTp HTp < w*. .
The order in which we consider the grids for packing will affect the performance of
our packing algorithm. Since small grids have a better chance to fit into holes without
increasing the width or height of the packing, it is intuitively advantageous to pack them
later. Assuming that ;i > hi (1 < i < m), we consider the following four orderings of grids
to be reasonable for good performance.
A2
HT 2w*
TP
A1
w*
LC "  
O I I
O
w* WTP 2 w* w*
Figure 5: Regions A1 and A2 in the Packing (Theorem 5.3.3)
w* /
S2
HTPW*
S,
1 
WTP W
Figure 6: Regions S and 2 in the Packing (Theorem 5.3.4)
Figure 6: Regions S1 and S^ in the Packing (Theorem 5.3.4)
1. Decreasing order of ;,
2. Decreasing order of hi
3. Decreasing order of ;i hi
4. Decreasing order of
Since sorting the grids can be done in O(m log m) time, the time complexity of our packing
algorithm will be same regardless of which grid ordering is used.
4.2 A Parallel Packing Algorithm
The packing algorithm described in the previous section is sequential. One processor has
to collect all the information about the grids from the other processors and execute the
packing algorithm in order to find the processor allocation. The result of allocation should
be communicated to all the processors which in turn communicate the boundary values for
the fine grids to the corresponding processors. In this section, we present a parallel algorithm
in which m processors cooperatively execute the packing algorithm for m grids in order to
speed up the algorithm. The same packing method that was used in the sequential algorithm
will be used in the parallel algorithm described here.
In the adaptive mesh algorithm, each rectangular grid at level i may be generated across
more than one processor. One of those processors (with lowest index) produces a packet
< (wj, hj), Sj > for grid j, where (wj, hj) is the dimension of grid j and Sj is its own
processor index. These packets contain the necessary information that forms the input
data to our packing and processor allocation algorithms, processors for them. The detailed
description of the algorithm is given below. The topology of multiprocessors on which the
algorithm runs, is assumed to be either mesh or hypercube.
Algorithm Parallel Packing
(a) Let PS be a set of m processors forming a submesh or a subcube, that is PS = {f' 0 <
i < m 1}. The processors which produced packets send them to the processors in
PS (one packet per processor) using procedure TokenPacking. The processors in
PS will do the remaining steps of our parallel packing algorithm.
(b) Sort packets according to the packing order using a parallel sorting algorithm. After
sorting, assume that processor Pb, is holding packet < (; hi), Si >.
(c) Store the initial free corner [(0, 0), (oo, oo)] at processor Pb0. Each processor Pb set both
width and height to 0. Now, pack the grids one by one by performing m iterations
where in the ith iteration (0 < i < m 1) call procedure GridPacking(i).
(d) After step (c), each processor Pb has (XLoci, YLoci), that is, the free corner where
grid (;, hi) was packed and the width and height of the packing. Each processor Pb
include the above information in its packet < (; hi), Si > and send it to Si using
procedure SendPacketToSource.
end Parallel Packing;
Procedure TokenPacking;
// Here a subset of processors PJ0, Pj,..., Pj with jo < j < ... < ji has one packet
each and it is desired to store the packet of Pjk in Pk for t < k < (t + 1) mod m for some
0 < t < N 1. Using greedy routing, this operation can be done in O(log N) time on a
hypercube and O( N) time on a mesh [26]. //
Procedure GridPacking(i);
1. Call procedure Broadcast(bi, < (; hi), Si >).
2. Call parallel procedure FindBestCorner((;, hi), k) to find the corner (xk, ) where
the grid (; hi) is to be packed, and its orientation.
3. Processor Pb set XLoci and YLoci to Xk and ;/ respectively, and set Orienti to
Orientk. Also Pb determines the width and height of the packing after (;i hi) is
packed.
Call procedure Broadcast(bi, ((xk,;l ), (width, height), Orienti)).
4. Call parallel procedure UpdateCorners(( hi), k) to update the sizes of corners
after the grid (; hi) is packed at (xk, / ).
5. Call Parallel procedure FindNewCornerSize((;i hi), k) to determine the sizes of
two new corners.
Procedure FindBestCorner((; hi), k);
1. Each Processor Pb, does the following
begin
Let [(xj, yj), (pj,qj)] the corner Pb is holding
if (pj ., )> O0 and (qj hi) > 0 then
mxl = max(width, xj + .i ), myl = max(height, yj + hi)
else if (pj hi) > 0 and (qj >1 ) > 0 then
mx2 = max(width, xj + hi), my2 = max(height, y + )
else set m to oo and goto the next step
ml = max(mxl, R myl)
m2 = max(mx2, R my2)
mj = min(ml, m2)
if (ml < m2) then Orientj = 0
else Orientj = 1
end
If Pb, has two corners then choose the one which gives smaller mj.
If Pb, does not have any corners then set mj to oc.
2. Call parallel procedure FindMin({mi}jo1, mk, b&)
end FindBestCorner;
Procedure UpdateCorners((;, hi),k);
1. Processor Pbk remove [(xk, ), (i ,qk)].
2. Each Processor Pb, does the following
begin
Let [(xj, yj), (pj, qj)] the corner Pb, is holding
if Orienti = 1 then
= hi, h =
else
S' = hi = h
if (; < yj < ; + h') and (j < k < x j X pj) then
pj = Xk Xj
if (xk < + xk+ ') and (yj < <; < yj + qj) then
qj = yj
if pj = 0 or qj = 0 then
remove [(xj, yj), (pj, qj)]
end
If Pb3 has two corners then update the second one in the same way.
end UpdateCorners;
Procedure FindNewCornerSize((; hi), k);
1. Each Processor Pb,(j < i) does the following
begin
if Orienti = 1 then
",' = hi, h' =,
else
wI = hl = hi
if (YLocj < ;, < YLocj + hj) and (xk + < < XLocj)
then pj = XLocj
else pj = oo
if (XLocj < Xk + wi < XLocj + wj) and (;, < YLocj)
then qj = YLocj
else qj = oo
2. Call parallel procedure FindMin({pi}):, p, bi)
3. Call parallel procedure FindMin({qi)j} q, bi)
4. Find p' and q' in the same way for the other corner.
5. Store the two new corners, [(XLoci + ;, ', YLoci), (p, q)] and
[(XLoci, YLoci + h'), (p', q')], at Processor Pb.
end FindNewCornerSize;
Procedure Broadcast(j, V);
// Processor Pj broadcasts value V to all the processors in PS; takes O(log N) time in a
hypercube and O( N) time in a mesh [26]. //
Procedure FindMin({ai}701, b,j);
// Given the al values with one value per processor, find the minimum (b) of these values and
store it in the processor Pj. This operation can be done in O(log N) time on a hypercube
and O( /N) time on a mesh. //
Procedure SendPacketToSource
1. Each processor Pb, include XLoci, YLoci, Orienti, width and height in the packet
< (, h,), S >.
2. For a hypercube, sort the packets according to Si using a parallel sorting algorithm.
3. Send each packet < (I hi), Si, (XLoci, YLoci), Orienti, (width, height) > to processor
Si. For this, use the monotone routing (see chapter 3) for a hypercube and onetoone
routing [26] for a mesh.
end SendPacketToSource
Finding a corner to pack the next grid, updating the sizes of corners and determining the
sizes of two new corners are done in the same way as in the sequential packing algorithm.
After executing the above algorithm, processor Si can find a submesh for grid (;i hi) using
the information included in the packet.
Time complexity analysis for a hypercube :
Step (a) takes O(log N) time and step (b) takes O(log2 m) time. Both procedure FindBest
Corner and FindNewCornerSize take O(log m) time. procedure UpdateCorners takes
a constant time. Hence step (c) takes O(m log m) time. Step (d) takes O(log N + log2 m)
time. The total time complexity for a hypercube is O(log N + m log m).
Time complexity analysis for a hypercube :
Step (a) takes O({N) time and step (b) takes O(/m log m) time. Both procedure Find
BestCorner and FindNewCornerSize take O( /m) time. procedure UpdateCorners
takes a constant time. Hence step (c) takes O(m im) time. Step (d) takes O(/N) time.
The total time complexity for a mesh is O( N + mz /).
4.3 Allocation of Processors
In the previous sections, we described algorithms for packing grids in such a way that the
ratio of width to height is as close to the mesh ratio as possible and the packing area is
minimized at the same time. Now, we are ready to allocate a set of processors (submesh) for
each grid using the results of packing. Let W and H be the width and height of the packing,
and let (xi, yi) be the location of the southwest corner of grid i with dimensions x hi.
Assume that we are given a P x Q(P > Q) mesh of multiprocessors and let W > H. Let
((i,j), (x X y)) be a x x y submesh with its southwest corner processor indexed (i,j). We
propose the following two ways to allocate submeshes to grids.
1. Assign submesh
(( Lxi], LyiQi), ( L(x + x) Lxi'J) x (L(yi + hi)J Lyi]J)) to grid i.
2. If (Q < H), then assign submesh
(( Lxi, LyiJ), (L(xI + )J LxiQ) x (L(yi + hi)J Lyi)) to grid i.
else assign submesh
((LxiWL, LyJ]), (L(xi + )P L[xi' ) x (L(y + hi) L[yipJ)) to grid i.
The first method tries to use as many processors as possible to reduce the computational
workload, but may violate the property that = for all the grids. In the second method,
we allocate processors in such a way that ~ Q. In other words, the second method
uses uniform scaling in X and Y directions while the first method uses different scaling
factors in X and Y directions in order to reduce the packing to the given mesh size. In the
second method, we may not utilize all the processors in the mesh. We will call the first and
second method nonuniform scaling and uniform scaling respectively. Both of the methods
are identical to each other when Q H
P W
To make sure that both Xi and Y are not zero, that is, at least one processor is allocated
to each grid, we impose the following lower bound on the size of each grid. Let D =
Ci max(; hi). D is a very loose upper bound on max(W, H) where W and H are width
and height of any packing. Then, mini (min(; ,hi)) > miQ should be satisfied. To
achieve better bound on D, we can quickly pack the grids in the following way. Each time a
grid is packed, we examine only two corners, one on Xaxis and the other on Yaxis. Then
use max(width, height) as D. Since we always consider only two corners, this packing takes
only O(m) time. After finding the lower bound on the size of the grids, change ;, or hi to
this value if they are smaller than that.
4.4 Processor Allocation for Hypercubes
If we are given a hypercube, we can still use the above method to allocate processors for
grids assuming that we are given a mesh. Then, we can easily embed the mesh in the
given hypercube using reflected Gray code [10]. The reflected Gray code is generated in the
following way. We start with the 1bit Gray code sequence {0, 1}, and then insert a zero and
a one in front of the two elements obtaining the two sequences {00, 01} and {10, 11}. We
then reverse the second sequence to obtain {11, 10}, and then concatenate the two sequences
to obtain the 2bit reflected Gray code.
{00,01,11,10}
Generally, given a (d 1)bit code
{b b2,. ., b ,
where p = 2d1 and bl, b2,..., bp are binary strings, the corresponding dbit code is
{0bi,0b2,...,0bp, lbp, lbp_,,..., lbi}.
The preceding recursive construction of the reflected Gray code sequence can be general
ized. Let d, and db be positive integers, and let d = d, db. Suppose that {ai, a2,..., pa} and
{bl, b2,..., bpb} are the d,bit and dbbit reflected Gray code sequences, where p, = 2da and
pb = 2db. Consider the p, x pb matrix of dbit strings {aibji = 1,2,... ,pa,j = 1,2,...,pb}.
albI alb2 ... albPb
a2bl a2b2 '. a2bPb
aP ab apab2 ... aP bPb
The preceding construction also shows that the nodes of a dcube can be arranged along a
twodimensional mesh with p, and pb nodes in the first and second dimensions, respectively.
The (i, j)th element of the mesh, where i = 1, 2,..., p and j = 1, 2,...,pb, is the dcube
node with identity number aibj.
In Nnode hypercube, we can embed log +/N + 1 different meshes with power of two
nodes in each dimension, such as (210g' x 210og ), (21og N+l x 21og 1), . (21log x 20).
We can pick one of the above meshes and pack the grids using its mesh ratio. Or, we can
try for all the above meshes and pick the one which gives he closest to the corresponding
mesh ratio. Since we use only m processors in the parallel packing algorithm, we can pack
the grids using [L" different mesh ratios at the same time. If we try all log /N + 1 meshes
using the parallel packing algorithm, we have to repeat the packing algorithm [ log vN+1)
times. This takes O(1ogsmog/ ) time since one execution of the packing algorithm takes
O(m log m) time.
4.5 Experimental Results
We performed experiments to measure the performance of our approximation algorithm
based on the TPheuristic. As in the experiments for the first approach, we start with some
number of (coarse) grids and assume that at each level of adaptive mesh refinement, the
number of fine grids (child grids) created within each coarse grid region is exponentially
distributed with a mean value of 1. The number of mesh points in each fine grid is uniformly
distributed in the range [MINSIZE,MAXSIZE]. The aspect ratios of fine grids, namely w
(, > hi), are also assumed to be uniformly distributed in the range [1,MAXRATIO]. In
our experiments, we went through 200 levels of refinement. For assignment of fine grids at
each level, we first solve the twodimensional packing problem and then allocate a submesh
for each grid using the methods described in the previous section.
We compared the performance of our algorithm with that of an algorithm based on the
modified firstfit level heuristic (LP) discussed in [18] for packing rectangles into a bin of
fixed width. In this heuristic, rectangles (grids) are considered in decreasing order of height
and packed level by level as in the firstfit heuristic. But successive levels are packed left
right and rightleft in the bin of fixed width so that tallest item in a level will be above the
shortest in the previous level. This will allow the items to drop down after packing which
may in turn reduce the total height of the packing. Since in our problem, the width is not
fixed, we have to apply an iterative process where in each iteration, we use the LPheuristic
to determine the height of the packing for some fixed width. The initial width is set equal
to \RS where S = ZC(., hi) and R is the mesh ratio. Then, the width is increased by a
small amount (1 of the width) each time until we get a ratio for W that is just above the
Hht sjstaovh
processor mesh ratio Q.
The objective function values were computed using the following formula.
I
=max U wo 2U FtF( + )
where Ucomp, Ucomm and F are assumed to be 1 for simplicity. First, we compare the solution
values obtained by the nonuniform and uniform scaling processor allocation methods in
Figure 7. Here, we started with 40 grids. MINSIZE and MAXSIZE are set to k 0.7
and k 1.3 where the constant k is set based on the average number of mesh points per
processor which is assumed to be 300. 32 x 32 processor mesh was used for assignment.
The experiment was repeated for different aspect ratios of grids. Figure 7 shows that the
nonuniform scaling method performs slightly better than the uniform scaling method. The
reason for the narrow difference in the performances of the two methods is attributed to the
fact that our algorithm gives a packing whose W ratio is close toP ratio of the mesh. In
our remaining experiments, we use only the nonuniform scaling method for allocation of
submeshes. Figure 8 shows the solution values obtained after applying four different packing
orders of grids in TPheuristic. From the results, we see that packing grids in decreasing
order of grid size (; hi) performs the best, while packing them in decreasing order of their
aspect ratio () performs the worst. We use only decreasing order of grid size as a packing
order in our remaining experiments.
Figure 9 Figure 24 show comparison of TP and LP heuristics with different values of
various parameters. In the figures, we show the cumulative sum (across 200 levels) for (a)
computational cost, (b) communication cost, (c) sum of (a) and (b) and (d) average processor
utilization which indicates how tightly the grids are packed in the packing algorithm. The
processor utilization is the ratio of the number of processors used in allocation to the total
number of processors. Figure 9 and Figure 12 show how the performance of each heuristic
varies with variance in grid size (i.e. number of grid points). In this experiment, MINSIZE
and MAXSIZE are set to k (1VAR) and k (1+VAR) respectively, and VAR varies from
0.1 to 0.9. We used the same values for other parameters as in the previous two experiments.
Figure 13 and Figure 16 show how the performance of each heuristic varies with the aspect
ratios of the grids. Figure 17 and Figure 20 show how the performance of each heuristic
varies with processor mesh ratio. In this experiment, we use 32*MESHRATIO x 32 mesh for
assignment, where MESHRATIO varies from 1 to 3. Figure 21 and Figure 24 show how the
performance of each heuristic varies with initial number of grids. Below we give a summary
of our observations.
1. As can be seen in Figure 9 and Figure 12, the TPheuristic performs better than the LP
heuristic when the variation in grid sizes is reasonably large. The processor utilization
NonUniform Scr
Uniform Scr
145000 F
150000 k
140000 h
135000 I
130000
125000
3
MAXRATIO
Figure 7: The Comparison of Two Different Allocation Methods
ding
aling  .
aling .... ..
/.
;.:'::
..
.. ';"
;:
..:
..:
.:
.x
155000 F
Decreasing Max.
Decreasing Min.
Decreasing Grid size
Decreasing Ratio
.
150000 F
145000 h
140000 I
135000
130000
125000 J
3
MAXRATIO
Figure 8: The Comparison of Four Different Packing Orders
/
/
/
,
I
126000 TPHeuristic .....
LPHeuristic ..
124000
122000
0 x
a 120000
5 118000 /
0         
( 116000
114000
112000
110000
0.1 0.3 0.5 0.7 0.9
VAR
Figure 9: Computation Costs for Different Grid Size Variances
21000
TPHeuristic 
LPHeuristic...
20500 ................
20000
E 19500
19000
18500
0.1 0.3 0.5 0.7 0.9
VAR
Figure 10: Communication Costs for Different Grid Size Variances
TPHeuristic 
146000 LPHeuristic .. x
144000
142000
S140000
3 138000 
/
/  
136000
134000
132000 
130000
0.1 0.3 0.5 0.7 0.9
VAR
Figure 11: Total Costs for Different Grid Size Variances
84
84~~ 
83 TPHeuristic 
LPHeuristic
82
N 80 
79
8 7...........
77
76
75
0.1 0.3 0.5 0.7 0.9
VAR
Figure 12: Processor Utilizations for Different Grid Size Variances
140000
135000
130000
125000
120000
115000
110000
105000
3
MAXRATIO
Figure 13: Computation Costs for Different Aspect Ratios of Grids
22000
21500
21000
20500
20000
19500
19000
18500
2 3 4
MAXRATIO
Figure 14: Communication Costs for Different Aspect Ratios of Grids
TPHeuristic 
LPHeuristic
... ..... ...x '
TPHeuristic 
SLPHeuristic x
2 3
2 3 4
MAXRATIO
Total Costs for Different Aspect Ratios of Grids
2 3 4
MAXRATIO
Figure 16: Processor Utilizations for Different Aspect Ratios of Grids
160000
155000
150000
145000
140000
135000
130000
125000
 TPHeuristic  ..
LPHeuristic x .
/
./
//
.
//
/"
..'
///
i
,x
//"
/
/
..
...
/
/
..o.o'
.."" .....x "
Figure 15:
TPHeuristic 
LPHeuristic x
"x 
",..x
'' .;     . 
125000
120000
115000
110000
105000
100000
95000
TPHeuristic
LPHeuristic x
", ***... ...
"X ..... .
TPHeuristic
LPHeuristic x
..
.. ... ..
""^..'O ....
Figure 17: Computation Costs for Different Aspect Ratios of Processor Meshes
20500
20000
19500
19000
18500
18000
17500
2
MESHRATIO
Figure 18: Communication Costs for Different Aspect Ratios of Processor Meshes
1.5 2 2.5
MESHRATIO
140000
135000
130000
125000
120000
115000
1.5 2 2.5
MESHRATIO
Figure 19: Total Costs for Different Aspect Ratios of Processor Meshes
78
2
MESHRATIO
Figure 20: Processor Utilizations for Different Aspect Ratios of Processor Meshes
TPHeuristic 
LPHeuristic x
... 0...
TPHeuristic
LPHeuristic x
x
.. ..... ...  .  
145000
140000
135000
130000
125000
120000
115000
50
No. of Grids
Figure 21: Computation Costs for Different Numbers of Grids
22500
22000
21500
21000
20500
20000
19500
19000
TPHeuristic 
LPHeuristic x
."
X',"
) 45 50 55
No. of Grids
Figure 22: Communication Costs for Different Numbers of Grids
TPHeuristic 
LPHeuristic xx
170000
TPHeuristic  x ..
165000 LPHeuristic x
160000
155000 
150000 
.'
/"
145000
140000
135000 
130000
40 45 50 55
No. of Grids
Figure 23: Total Costs for Different Numbers of Grids
80
79.5
79
78.5
78
77.5
77
76.5
76
) 45 50 55
No. of Grids
Figure 24: Processor Utilizations for Different Numbers of Grids
TPHeuristic
LPHeuristic
'
also increases with the variation in grid sizes for TPheuristic. Since the processor
utilization indicates how tightly the grids are packed in the packing algorithm, this
result demonstrates that our proposed heuristic gives good packing in practice.
2. Figure 13 and Figure 16 show that LP outperforms TP only when the grids are square
or close to being square shaped. But as the aspect ratio increases, TP produces better
solutions than LP as indicated by smaller computational and communication costs
as well as higher processor utilization. In both TP and LP heuristics, the total cost
increases with the maximum aspect ratio of the grids. This implies that it is difficult
to pack the grids tightly when the variance in aspect ratios is large.
3. The following reason can be attributed to the better performance of TP over LP for
large variation in grid sizes or larger aspect ratios : TP allows smaller grids that are
packed later on, to be packed into holes created during the packing of larger grids and
this has the effect of keeping the width and height of packing as small as possible.
4. Figure 17 and Figure 20 show that LP outperforms TP when processor mesh ratio is
larger than 2. Higher processor utilization for LP heuristic indicates that LP can pack
rectangles tightly when the width of the bin is large. In LPheuristic, the number of
levels in packing is small when the width of the bin is large.
5. Figure 21 and Figure 24 indicate that TP performs always better than LP regardless
of the number of grids to be packed. Processor utilization of TP seems to increase
slightly as the number of grids increases.
6. In all the figures, both computation and communication costs show the same behavior.
The reason for this is that both computation and communication costs are inversely
proportional to the number of processors allocated and processors are uniformly allo
cated to the grids according to the number of grid points.
7. In most of the cases, TP outperforms LP in both total cost and processor utilization.
One of the reasons for this is that TPheuristic takes advantage of the fact that two
possible orientations are allowed in the packing. TP has the additional advantage over
LP in that it avoids repacking and hence has less time complexity.
5 Extension for Intergrid Communication Cost
The solution method based on twodimensional packing ignores intergrid communication
cost. In this section, we discuss possible approaches for minimizing intergrid communication
cost as well as computation and intragrid communication costs. First we consider one
dimensional problem and propose an iterative approach. Then we discuss the possibility of
applying it and other approaches to solving the twodimensional case.
5.1 OneDimensional Problem
Processor allocation for the onedimensional problem is relatively easier than twodimensional
problem. We can find an optimal solution when there is only one grid at each level. Let X be
the number of contiguous processors (in onedimensional mesh) allocated for the newly gen
erated fine grid. Then, the length of the communication path for intergrid communication
is X + a in the worst case, where a depends on the actual locations of these X processors.
When there is only one fine grid at each level, we can allocate X processors including the
processors where the fine grid was generated. Then a becomes a constant. For the fine grid
at level 2, in Figure ?? (a) X = 2 and a = 2, and in Figure ?? (b) X = 4 and a = 2.
If we uniformly divide the fine grid among X processors, the number of grid points each
processor is responsible for is , where M is the total number of grid points in the fine grid.
Then, the problem is finding X which minimizes
Ti = Ucomp comm (X + a) + CI
X X
The first term is the computation cost for each processor to which M grid points are assigned.
The second term is the worst case intergrid communication cost when the data have to travel
through a path of length X + a. The third term CI is the cost of intragrid communication.
It is a constant because regardless of X, it is necessary to exchange only two grid points (left
end and right end) values with neighboring processors. In the above equation Ti decreases
indefinitely up to Ci as X increases. But, as we increase X, M reaches packet size Sp and
communication cost is only proportional to X + a. Then, the above equation becomes
M M
T = Ucom +UcommSp(X + a) + CI, when < Sp
X X
If we differentiate Ti,
dTj U v U
dX X2 commS
d' becomes 0 when X is equal to Ucomp Therefore, Ti is minimized when X is equal to
dX UcommSp
max( UcompM )
Sp V UcommSp
This analysis is valid when there is only one fine grid at level i. The problem becomes
more difficult when there are more than one fine grid at level i. If each grid is assigned to
a set of X processors that are in the neighborhood of the set of processors where it was
generated, a grid can overlap with other nearby grids in some processors. The workload
in these processors will be twice as much as that in the other processors. To avoid this
workload imbalance, some grids have to be assigned to processors far from those where they
were generated. This will make the maximum length of data path greater than X. Let us
assume that there are m fine grids (grid 1, grid 2, . grid m from left to right in the one
dimensional domain) at level i and the number of grid points in grid j is Mj. We also assume
that grid j is assigned to P1 Pi+,P ,... and Xj = rj Ij + 1. If we allow overlapping
of grids in some processors, workload imbalance among processors is inevitable. Hence, we
avoid assigning more than one grid to any processor. Now, when there are m grids at level
i, the computation time Ti becomes as follows:
Ti = max Ucomp c + Ucommf (Xj)(Xj + aj) + C,
I
where f(Xj) = max(t Sp). ac is the extra path length caused by assigning the grid far
from the processors where it was generated to avoid overlapping. Here, each aj depends on
the assignments of the other grids at the same level, unlike the case when there is only one
grid in which case it is a constant. Therefore, we cannot compute individual values of Xj in
the same way as before. In general, it is difficult to find each Xj without knowing uj. Here,
we propose a simple heuristic using an iterative procedure.
1. For each j, Xj = max( A ).
Sp V U Sp
This is the optimal number of processors allocated to grid j if we ignore the other
grids at the same level. If Zj Xj is greater than N, each Xj is reduced proportionally
according to Mj.
2. Let sj be the index of the processor where grid j was generated. If grid j was generated
in a set of processors, then let sj be the index of the leftmost processor. Allocate
Pt,3 PI +1,..., PT, to grid j in the following way.
ptr = 0
forj = 1 to m
Ij = max(ptr, sj), rj = Ij + Xj 1, ptr = rj + 1
if (ptr > N) then
ptr = N 1
forj = mto 1
rj = min(ptr, rj), lI = r Xj + 1, ptr = Ij 1
3. Using the above processor allocations, compute Ti and aj = Ij sj. Also find jmax
such that Ti = Ucomp ma + Ucomm ( xma + 2)(XJma.. + c,,) + C. Since each Xj was
obtained assuming aj to be 0, we may be able to reduce Ti by reducing jma.. Assume
that ac., > 0 and let k = max(,,=o) & (l
and reallocate processors to each grid j (k < j < jmx).
4. Recompute Ti using the above reallocation of processors. If it was improved then go
to step 1.
5.2 TwoDimensional Problem
Considering intergrid communication cost for twodimensional problem is very complicated.
When the grids clustered on the computational domain, allocation of processors for one
grid cannot be determined independently since grids are closely located to each other. The
iterative heuristic used for onedimensional problem previously can be considered for solving
the twodimensional problem. But, after we find Xi x Y submesh for each grid i assuming
there is only one grid at a level, we still need to find the locations of these submeshes. For
the onedimensional problem, we assigned m linear subarrays in the same order as the grids.
But, we need to solve a packing problem to arrange these submeshes without overlapping.
This packing problem is different from the one we considered in the previous section since
we have to assign submeshes that are somewhat close to the original locations of the grids.
This problem need to be solved again and again after we reduce the sizes of submeshes by
some quantities. For this reason, we cannot apply the same heuristic to the twodimensional
problem without major changes.
One possible solution for assigning submeshes is starting with a submesh for a grid located
at the center, then assigning other submeshes around it using a simple packing algorithm.
But, after reducing the sizes of the submeshes, we have to rearrange the submeshes so that
maximum intergrid communication distance is reduced. This may not be a trivial problem
without overlapping submeshes.
Another possible way of allocating submeshes considering intergrid communication cost
is imposing a constraint on it as we did in the first approach. When the grids are clustered,
we can compute the size of a submesh which all m grids will be assigned to. Let D(PSi) be
the diameter of a submesh PSi, then maxes UcommD(PSi)(B, + M,) < Cma where all the
other notations are defined in chapter 3. The size of the submesh should be small enough
so that maximum intergrid communication cost cannot exceed the limit no matter which
processors are allocated for a grid within the submesh. Then, we can use the twodimensional
packing approach described in the previous section to allocate processors for each grid within
the submesh. Since we do not use all the processors, computation time will increase. When
there are more than one cluster of grids, we have to find a submesh for each cluster. This
problem is the same as finding a submesh for each grid, but the number of submeshes we
need to find is much smaller because we need one submesh per a cluster of grids. If the
number of clusters is reasonably small, it will be easier than solving the original problem.
There is another alternative solution when the grids are scattered instead of clustered.
Consider mesh processors as domain and fine grids generated by them as workload. Then
we partition the processors as we decompose the domain using binary decomposition (see
Figure 25). The processors are partitioned in such a way that the computational requirement
Figure 25: Partitioning of Processors
of grids are balanced among the different groups of processors. We need to repeat partitioning
the processors until the maximum intergrid communication cost cannot exceed the limit
when processors are allocated within the group. Then we again use the packing approach
for each group of processors. This approach will work when the fine grids are reasonably
scattered.
We considered a few possible approaches to include intergrid communication cost in our
objective function. Each approach seems to be suitable for a particular type of coarse grid
distribution among the processors. But deciding which approach to apply after generating
a set of fine grids is a challenging problem. Thus, the inclusion of intergrid communication
cost makes the load balancing problem complicated and requires further study.
tL 
_0__0__CYO
(i~il:~
6 Conclusions
We have formulated the dynamic processor allocation algorithm that arises in mapping adap
tive mesh computations onto multiprocessors. For mesh (and hypercube) architectures, the
problem can be transformed into a 2D packing problem. We proposed an efficient approxima
tion heuristic and derived accuracy bounds for square meshes. We also demonstrated through
experiments how our approximation algorithm can provide solutions with smaller compu
tational and communication costs than one of the wellknown packing heuristics adapted
to this problem. We derived the bounds for our heuristic when the grids are packed in de
creasing order of their maximum side lengths. We plan to analyze its performance for other
orders such as decreasing order of areas, aspect ratios etc.. One of our challenging future
tasks is to address the load balancing problem when intergrid communication cost is also
considered in the objective function.
References
[1] S. Acharya and F Moukalled, "An Adaptive Grid Solution Procedure for Convection
Diffusion Problems," Journal of Computational P;,.*:', Vol.91, 1990, pp.3254.
[2] A. V. Aho, J. E. Hopcroft and J. D. Ullman, The Design and A,,1..,<. of Computer
Algorithms, AddisonWesley Publishing Company, Reading, MA, 1974.
[3] I. Altas and J. Stephenson, "A TwoDimensional Adaptive Mesh Generation Method,"
Journal of Computational Pb.;', Vol.94, 1991, pp.201224.
[4] B. S. Baker, D. J. Brown and H. P. Katseff, "A 5/4 Algorithm for TwoDimensional
Packing," Journal of Algorithms, Vol.2, 1981, pp.348:; "
[5] B. S. Baker, E. G. Coffman and R. L. Rivest, "Orthogonal Packings in Two Dimensions,"
SIAM Journal on Computing, Vol.9, No.4, August, 1'iII pp.846S,,
[6] B. S. Baker and J. S. Schwarz, "Shelf Algorithms for TwoDimensional Packing Prob
lems," SIAM Journal on Computing, Vol.12, No.3, August, l'i ;, pp.508525.
[7] M. Berger and S. Bokhari, "A Partitioning Strategy for Nonuniform Problems on Mul
tiprocessors," IEEE Transactions on Computers, Vol.C36, 1987, pp.570580.
[8] M. Berger and J. Oliger, "Adaptive Mesh Refinement for Hyperbolic Partial Differential
Equations," Journal of Computational PbI.., Vol.53, 1984, pp.484512.
[9] M. Berger and I. Rigoutsos, "An Algorithm for Point Clustering and Grid Generation,"
IEEE Transactions on Si ,,. .. Man and C9i1 1 ,, i.'. Vol.21, no.5, 1991, pp.11781l"1.
[10] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation, Prentice
Hall, Englewood Cliffs, NJ, 1989.
[11] J. Boillat, F. Bruce and P. Kropf, "A Dynamic LoadBalancing Algorithm for Molecular
Dynamics Simulation on Multiprocessor Systems," Journal of Computational Pl; .
Vol.96, 1991, pp.114.
[12] S. H. Bokhari, "A Shortest Tree Algorithm for Optimal Assignments Across Space
and Time in a Distributed System," IEEE Trans. Software Engineering, Vol.7, No.6,
November 1981, pp.583589.
[13] S. W. Bollinger and S. F. Midkiff, "Processor and Link Assignment in Multicomput
ers Using Simulated Annealing," Proceedings of the 1988 International Conference on
Parallel Processing, Vol.1, 1l'i", pp.17.
[14] R. G. Casey, "Allocation of Copies of a File in an Information Network," Proc. AFIPS
National Computer Conference, Vol.43, 1972, pp.371374.
[15] W. W. Chu, "Optimal File Allocation in a Computer Network," in Computer Commu
nication Sil l ,,, N. Abramson and F. F. Kuo eds., PrenticeHall, Englewood Cliffs,
NJ, 1973, pp.8294.
[16] E. G. Coffman, Computer and JobslpT Scheduling Tl, 'ry, John Wiley & Sons, Inc.
New York, 1976.
[17] E. G. Coffman, M. R. Garey and D. S. Johnson, "An Application of Binpacking to
Multiprocessor Scheduling," SIAM Journal on Computing, Vol.7, No.l, February, 1978,
pp.117.
[18] E. G. Coffman, M. R. Garey, D. S. Johnson and R. E. Tarjan, "Performance Bounds for
LevelOriented TwoDimensional Packing Algorithms," SIAM Journal on Computing,
Vol.9, No.4, August, 1'iII pp.ilis _'i,
[19] W. Y. Crutchfield, "Load Balancing Irregular Algorithms," Lawrence Livermore Na
tional Laboratory Report, 1991.
[20] G. Cybenko, "Dynamic Load Balancing for Distributed Memory Multiprocessors," Jour
nal of Parallel and Distributed Computing, Vol.7, 1989, pp.279301.
[21] K. D. Devine, J. E. Flaherty, S. R. Wheat and A. B. Maccabe, "A Massively Paral
lel Adaptive Finite Element Method with Dynamic Load Balancing," Proceedings of
Supercomputing '93 Conference, Portland, Oregon, 1993, pp.211.
[22] A. K. Dewdney, "Computer Recreations," Scientific American, Dec. 1984, pp.1418.
[23] L. W. Dowdy and D. V. Foster, "Comparative Models of the File Assignment Problem,"
ACM Computing Surveys, Vol.14, No.2, June 1982, pp._' 313.
[24] M. R. Garey and D. S. Johnson, Computers and Intrac/.lJ',1../i W. H. Freeman and
Company, San Francisco, CA, l'l ;
[25] R. Hanxleden and L. R. Scott, "Load Balancing on Message Passing Architectures,"
Journal of Parallel and Distributed Computing, Vol.13, 1991, pp.312324.
[26] F. T. Leighton, Introduction to Parallel Algorithms and Architectures, Morgan Kauf
mann Publishers, Inc. San Mateo, CA, 1991.
[27] D. Nicol and N. Saltz, "Dynamic Remapping of Parallel Computations with Varying
Resource Demands," IEEE Transactions on Computers, Vol.37, No.9, 1l'I" pp.1073
1087.
L2] D. Nicol and N. Saltz, "An Analysis of Scatter Decomposition," IEEE Trnasactions on
Computers, Vol.39, No.ll, 1991, pp.13371345.
[29] H. D. Simon, "Partitioning of Unstructured Problems for Parallel Processing," Com
puting Sil/ ,,1.w in Engineering, Vol.2, No.2/3, 1991, pp.135148.
[30] P. Sadayappan and F. Ercal, \N .,! estNeighbor Mapping of Finite Element Graphs onto
Processor Meshes," IEEE Transactions on Computers' Vol.C36, No.12, 1987, pp.1408
1424.
[31] Ravi Varadarajan and Eva Ma, "An Approximation Load Balancing Model with Re
source Migration in Distributed Systems," Proceedings of the 1988 International Con
ference on Parallel Processing, Vol.1, 1'l", pp.1317.
[32] W. Williams, "Load Balancing and Hypercubes: A Preliminary Look," Hypercube Mul
tiprocessors 1987, SIAM, Philadelphia, PA, 1897, pp.108113.
[33] J. Woo and S. Sahni, "Load Balancing on a Hypercube," Proceedings of the Fifth In
ternational Parallel Processing S,,,' ..."',,, Anaheim, CA, 1991, pp.525530.
