Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: An Efficient dynamic load balancing algorithm for adaptive mesh refinement
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00095211/00001
 Material Information
Title: An Efficient dynamic load balancing algorithm for adaptive mesh refinement
Series Title: Department of Computer and Information Science and Engineering Technical Reports
Physical Description: Book
Language: English
Creator: Hwang, Injae
Varadarajan, Ravi
Affiliation: University of Florida
University of Puerto Rico -- Rio Pedras
Publisher: Department of Computer and Information Sciences, University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: May 26, 1994
Copyright Date: 1994
 Record Information
Bibliographic ID: UF00095211
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

1994125 ( PDF )


Full Text













An efficient dynamic load balancing algorithm for
adaptive mesh refinement


Ravi Varadarajan
Dept. of Mathematics,
University of Puerto Rico
Rio Piedras, PR 00931
Email: r_varadarajai fuprl.upr.clu.edu
Fax : (809) 754-0757

Injae Hwang
Comp. and Info. Sciences Department,
University of Florida
Gainesville, FL 32611
Email: ih cis.ufl.edu


May 26, 1994























Abstract

In numerical algorithms based on adaptive mesh refinement, the computational workload
changes during the execution of the algorithms. In mapping such algorithms on to distributed
memory architectures, dynamic load balancing is necessary to balance the workload among
the processors in order to obtain high performance. In this paper, we propose a dynamic
processor allocation algorithm for a mesh architecture that reassigns the workload in an
attempt to minimize both the computational and communication costs. Our algorithm is
based on a heuristic for a 2D packing problem that gives provably close to optimal solutions
for special cases of the problem. We also demonstrate through experiments how our algorithm
provides good quality solutions in general.









1 Introduction

Parallel computers with thousands of processors are becoming a commercial reality due to
advanced VLSI technologies. These machines can provide tremendous computational power
needed to solve many computationally intensive problems that arise in different fields of
science and engineering such as aerodynamics, nuclear physics and weather forecasting. In
solving such problems, it is necessary to distribute the computational load among the pro-
cessors in order to achieve good performance as measured by speed-up, processor utilization
etc.. The term "load balancing" refers to this activity of distributing or redistributing the
computational load in order to achieve high performance.
In multiprocessor computer systems with distributed memory, load balancing should also
take into consideration the interprocessor communication cost that arises when for example,
a computation in a processor needs data located in another processor's memory. For simple
applications, it is sufficient for the programmer to perform static load balancing by properly
distributing the data before the parallel program begins its computation. But there are many
engineering applications wherein simple static load balancing techniques are not sufficient
to ensure high performance. A case in point is the solution of partial differential equations
using adaptive meshes.
In numerical methods that employ finite differences to solve such problems, a rectangular
mesh is imposed on the computational domain and the solution is evaluated at the mesh
points; the mesh size (or spacing between mesh points) is related to the required error
tolerance in the solution. It is computationally advantageous to have mesh points spaced
more closely in regions with steep gradients and have coarser mesh points in the regions where
the solution is smooth. In many problems (for example, flame propagation in combustion),
the regions of steep gradient are difficult to predict before the computation begins. For this
reason, we start with a uniform mesh of coarse size and finer and finer meshes are dynamically
imposed on regions with large estimated errors as the solution proceeds. Parallelization of
adaptive mesh algorithms thus introduces frequent changes in workload distribution among
the processors and hence requires d.w, ,ii;'.- load balancing to achieve high performance. In
this paper, we address the issue of dynamic load balancing that arises particularly in mapping
adaptive mesh refinement algorithms onto parallel architectures.
There are many load balancing algorithms (such as binary decomposition [7], scatter de-
composition -rs]) proposed in the literature for numerical algorithms based on finite differ-
ence or finite element methods. These algorithms are more suitable for static load balancing
as they ignore the interprocessor communication cost that usually results from relocating
the intermediate solution values among the processors during rebalancing. There are other









general dynamic load balancing algorithms (such as diffusion scheme [20]) suitable for slowly
varying workload distributions, as the time to rebalance the workload is quite high in these
methods.
In this paper, we propose dynamic load balancing algorithms especially for adaptive mesh
refinement, that ,i,. .11'll consider the computational and communication costs in rebalanc-
ing the workload. The grids generated dynamically during the computation are assumed to
be rectangular. We use two-dimensional packing problem to allocate a set of processors for
each grid in mesh multiprocessor system. The set of processors is chosen in such a way that
computational workload among the processors is balanced and intra-grid communication
cost is minimized at the same time. We propose an efficient heuristic packing algorithm and
show how it can be used for processor allocation.
A word about the organization. In the next section, we briefly discuss some of the
previously proposed load balancing algorithms. In Section 3, we formulate the dynamic
load balancing problem that arises in mapping adaptive mesh computations onto parallel
computers. In Section 4, we propose a simple heuristic and two approaches for implementing
it on the parallel machines. In Sections 5 and 6, we describe the two approaches and analyze
their complexities for hypercube architectures. In Section 7, we give the results of our
experiments performed to evaluate our heuristic methods with respect to some well-known
scheduling heuristics. Finally, we summarize our paper with a brief discussion of scope of
future work.


2 Previous work

In this section, we discuss some of the load balancing techniques proposed in the literature for
mapping numerical computations onto parallel architectures. These techniques partition the
computational domain into pieces (subdomains) which are then allocated to the processors.
For static load balancing, communication between two processors is usually assumed to be
proportional to the number of points on the boundaries between the subdomains allocated
to the processors. In these techniques, balancing the load among the processors is given
more importance than minimizing the inter-processor communication cost.
In [7], Berger and Bokhari propose binary decomposition for distributing the workload
associated with non-uniform mesh computations in a multiprocessor system with a mesh
topology. It is a recursive partitioning of the domain into vertical and horizontal strips
alternatively so that the left and right or top and bottom segments that result from the
cut have approximately equal load; load here is measured by the number of mesh points.
Number of processors is assumed to be a power of two in this method. This technique,









while useful for static load balancing, has limited applicability as a dynamic load balancing
method for adaptive meshes since it ignores the cost of migrating intermediate solution values
that occurs during reallocation of mesh points to the processors. For minor imbalances in
workload, this technique can perform efficient rebalancing by adjusting the cut lines starting
with the smallest-sized strips. It is not specified how binary decomposition can be efficiently
implemented in parallel on a multiprocessor system.
Scatter decomposition is another static load balancing technique which accounts for minor
fluctuations in workload distribution that tend to concentrate in a few contiguous regions of
the computational domain. In this case, the rectangular domain is divided into (say) m rows
and n columns of equal width. Let rij be the region in the i-th row and j-th column, where
0 < i < m- 1 and 0 < j < n- 1. Then for a multiprocessor system with p x q mesh (assuming
p divides m and q divides n), we allocate the computations of rij to processor indexed (s, t)
where s = i mod m and t jmod n, where 0 < s < p-1 and 0 < t < q-1. Larger the ratio
of m-, better balanced the workload will be but higher the communication cost among the
processors; this trade-off very much depends on the application. Determining this trade-off
is a major difficulty with this method. This technique was used as a load balancing method
by Williams [32] and was analyzed for a probabilistic workload by Nicol and Saltz '"].
In [29], Simon proposes the application of graph-theoretic techniques for mapping un-
structured grid calculations that arise in computational fluid dynamics onto multiprocessor
architectures. These techniques recursively apply a procedure that partitions a planar graph
into two halves with roughly equal number of nodes such that the number of edges in the
cutset (measure of communication cost) is attempted to be minimized. In one technique
called recursivee graph bisection", nodes are partitioned based on their distances from an
extremal vertex which is the root of a breadth-first spanning tree with maximum height. In
another technique called recursivee spectral bisection", partitioning is done along the eigen
vector associated with the second largest eigen value of the Laplacian matrix A D, where
A is the adjacency matrix and D is the diagonal matrix of vertex degrees. Recursive spectral
bisection is seen to provide a desirable graph partition.
There are many distributed dynamic load balancing strategies of which "diffusion method"
is one example. It is a local balancing operation, wherein each processor independently moves
(receives) workload to (from) its neighbors based on only the information about its workload
and that of its neighbors. This method was used for molecular dynamics simulation applica-
tions [11]. A major disadvantage of this method is that it takes a long time to achieve global
balancing but this delay is acceptable when the workload distribution changes slowly. Cy-
benko [20] showed the conditions for convergence of this method and the rate of convergence
for arbitrary processor network topologies.









Hanxledon and Scott [25] propose a decentralized dynamic strategy in which each proces-
sor does local balancing based on the total workload information that it receives from every
other processor. This strategy was tested on Intel iPSC/2 for WaTor, a particle migration
application and was found to give good results.
The technique we propose in this paper is somewhat similar to binary decomposition but
it ,11/.', .'llI takes into account inter-processor communication cost that occurs in rebalancing
the workload. We also propose efficient parallel implementations of this technique. It is a
technique tailor-made for adaptive mesh refinement applications.


3 Proposed load balancing approaches

In a typical adpative mesh refinement algorithm, the given partial differential equation is
discretized by imposing initially a uniform mesh on the computational domain; we call this
mesh a coarse grid at level 0. Letting i to be equal to 0 initially, the following steps are then
repeatedly applied.

1. Obtain solution values at the mesh points of the grids at level i, using finite difference
schemes that involve 4 or 8 mesh neighbors.

2. If i 0, project solution values of level i grids onto grids at level i 1, for updating
solution values at level i mesh points.

3. Estimate errors at mesh points of level i grids and if the estimated errors at all the
mesh points of level i are within the acceptable limit, terminate the algorithm.

4. Otherwise, in regions of high error, define grids at level i + 1 with finer mesh spacing
(assuming a constant ratio between level i and i + 1 mesh spacings). One common
approach is to flag high-error points, cluster them into groups using a suitable clustering
algorithm and define a finer grid for each group.

5. Define boundary values for grids at level i + 1 by interpolating intermediate solution
values at level i.

6. Let i = i + 1 and go to step 1.

In the formulation of our load balancing problem, we use the following assumptions:

1. A grid at level i (i > 0) is completely contained inside a grid at level i 1. This is
a most commonly used assumption in adaptive mesh refinement for simplification of
numerical algorithms. In this case, we call the two grids "child" and "p.!. ,l" grids
respectively.









2. Grid computations are synchronized such that grid computations at level i start at the
same time. Dynamic load balancing is performed whenever finer grids are created and
before their computations are started.

3. Computational cost for each grid depends only on the number of mesh points it con-
tains. This cost includes time for steps 1,3, 4 and 5 above. For simplicity, we will
assume that the cost is a linear function of the number of points, though our formula-
tion can be used for any known function of number of points.

4. There are two types of inter-processor communication called "intra-grid communica-
tion" and "inter-grid communication" respectively. Intra-grid communication is needed
whenever computation of a value at a mesh point (during step 1 above) in a processor
needs a value at a neighboring mesh point located in another processor; both the mesh
points belong to the same grid at some level. Inter-grid communication occurs during
steps 2 and 5 whenever a child grid is located in a different processor than that of
its parent. Communication cost is directly proportional to the distance between the
processors in the network and we ignore traffic on the communication links.

We are currently investigating the following three approaches for mapping the adaptive mesh
computations. Each approach has its own merits and demerits.
In the first approach we call the "grid indivisibility" approach, we allocate all the mesh
points of a grid to the same processor; thus a grid is an indivisible computational entity
that can be treated as a process. Creation of a finer grid is analogous to the creation of a
child process which can either be allocated to the same processor as the parent process or
migrated to a different processor by the load balancing process. A major advantage of this
approach is that in addition to being simple to implement, intra-grid communication (which
occurs more frequently than inter-grid communication) is avoided. But the number of grids
at a level must be at least the number of processors to keep all the processors busy and this
will not happen at least for the first few levels.
In the second approach we call the "domain decomposition" approach, computational
domain is divided into regions (possibly using scatter decomposition) and the mesh compu-
tations of the grids at all levels associated with a region are assigned to the same processor.
Whenever finer grids are created, we perform new decomposition for balancing the workload
associated with finer grids. This new decomposition may require migration of coarse (previ-
ous level) grid mesh points to different processors. But this approach eliminates inter-grid
communication at the expense of this migration cost. It introduces intra-grid communication
however since the mesh points of a grid may be allocated to more than one processor. The









load balancing problem here is similar to the problem considered in [31] for load balancing
with resource migration in distributed systems if the coarse grids are treated as resources.
The third approach we call the "grid partitioning" approach is a more general approach
than the previous two methods but the load balancing problem formulation is more compli-
cated and its solution difficult to implement. In this approach, we allocate a set of processors
for each grid and within the set of processors, the grid is uniformly partitioned into pieces
and each piece is assign to a processor.
In this paper, we propose a solution to the load balancing problem formulated using the
third approach. This approach requires that the number of processor be larger than the
number of grid at a level of refinement.


3.1 Problem Formulation

In this section, we consider the grid assignment for a two-dimensional adaptive mesh algo-
rithm. When we formulate our load balancing problem, we assume that all two-dimensional
grids are rectangular. We also assume that target machines are two-dimensional mesh mul-
tiprocessors. Later in this chapter, we will consider hypercube multiprocessors.
Let m be the number of grids at level i. If we allocate Xj Yj submesh for each grid
j close to the processors where grid j was generated, the submesh of one grid can overlap
with submeshes of other grids especially when grids are closely located to each other. Then,
the amount of workload for the overlapping processors will be twice as much as those of
other processors. To avoid this workload imbalance, some of the grids may have to be
assigned to processors which are distant from the processors where they were generated.
These assignments will incur inter-grid communication at some distance which can not be
expressed in terms of Xj and Y. To define the problem more formally, we introduce the
following notations :

G = (V, E) -processor graph where V is the set of processor nodes and
E the set of communication links
S set of fine grids generated at level i
f : S 2- f(a) is the neighborhood of processors where grid a was generated.
f'(a) processors (subset of f(a)) having boundary values for grid a
before load balancing
Ucomp computational cost per mesh point; depends on the finite difference
method used
Ucomm cost of transmitting a mesh point value between neighboring
processors









d(pi,p2) -communication distance between processors pi and P2 in G
d'(VI, V2) maXp E1V,p2EV2 d(p, p2)
F -frequency of intra-grid communication during level i computation
g : S -- 2- grid assignment function to be determined by load balancing
g'(a) -processors (subset of g(a)) having boundary values for grid a
after load balancing
S set of subgrids created by assignment g
3. number of mesh points in the subgrid b
Bb -number of mesh points on the boundary of subgrid b
h : V -- S h(p) is the subgrid assigned to processor p

After the partitioning and assignment of grids at level i, the computational workload Ep for
processor p is given by UcompMh(p). Note that we require that no more than one subgrid be
assigned to a processor. For a fine grid a, inter-grid communication cost Ci(a) is estimated
to be Ucomm (d'(f'(a), g'(a))B'(a) + d'(f(a), g(a))M'(a)) where B'(a) = Bi~ and
M'(a) = mln h,(a) the first term is the cost of migrating the boundary values for fine
min(m(f(a)|,|(a)|)'6 6
grid a that occurs before the computation of mesh point values in a and the second term is
the cost of sending the fine grid values back to the parent grid for updating. Whenever a grid
a is allocated to more than one processor, for each of its subgrids b, intra-grid communication
is needed whose cost C2(b) is estimated to be UcommFBb where Bb is the number of mesh
points (interior to grid a) on the boundary of b. Our problem is then formally stated as
follows :
Find g, S, and hence h so as to minimize maxpEv (Ep + C2(h(p))) + maxEs Ci(a)
with the restriction that the grids are assigned to disjoint sets of processors, i.e.
for any two grids a, and a2, g(ai) g(a2) = 0.
As was stated before, we restrict our attention to the case where each fine grid aj (1 <
j < m) is rectangular in shape with dimensions wj x hj and the processor graph G is a mesh
network with P rows and Q columns (assuming P > Q without loss of generality). Since
there is at most one subgrid assigned to a processor, m < PQ. The problem is then posed
as follows :
Determine for each grid aj, dimensions Xj x Y of the processor submesh to
be allocated to the grid and lj, the index of the processor in the south-west
corner of the submesh so as to minimize maxi maxnas Ci(aj). In the rest of this chapter, denotes ] when we discuss the formulation
of the problem.
For the simplicity of the problem, we ignore inter-grid communication cost and seek
solutions which minimize computation cost and intra-grid communication cost. We will









discuss the extension of the approach to include inter-grid communication cost at the end
of this chapter. If we remove the term representing inter-grid communication cost from the
problem formulation, our objective function decreases as Xj and Y increase. Consequently,
it is advantageous to assign a submesh which is as large as possible for each grid aj. It is also
necessary that -h be equal to wk for all pairs of j and k to balance the workload among
the processors. From the above discussion, we have the following two desirable properties of
processor allocation.

1. >m 1 Xjj= N,

2. Vij, k 3h Wkhk
S X, Y, XkY

When the total number of grid points at level i is larger than the number of processors,
the number of processors XjYj for each grid j is limited by the above two properties. Let
Nj = XjY. Then the intra-grid communication cost for grid aj is as follows.

wj Xj hj
C2(aj) 2UcommF + + )

Then,

dC2(aj) (-wj h2
S2 Ucmm 2 +
dX Since Then it follows that

3 becomes 0 when Xj = 3w. Since Y, = Then it follows that
dX3 V X3 3 V w
w,_= T. To minimize the cost of intra-grid communication separately for each grid (for
a fixed number of processors assigned to the grid), we add the third property of processor
allocation as follows.

3. Vj w h3

Now, our problem is approximated to finding m submeshes and their locations, one for each
grid, which satisfy the above three properties. This problem is formally stated as follows
For each grid aj of dimension wj x hj, find X x Yj processor submesh to
be allocated to the grid and 1j, the index of the processor in the south-west
corner of the submesh so as to maximize Zm 1 XjY with the restriction that
Vj, k W3h Wkhk and Vj w_ = h. We will call this problem submesh allocation problem.
X, Y, XkYk









4 Processor Allocation Using Two-Dimensional Pack-
ing

In this section, we propose an efficient approach to find an approximate solution to the
problem formulated in the previous section. This approach views the problem as a two-
dimensional packing problem wherein rectangles need to be packed in a bin of infinite width
and height. We present an approximate heuristic for this packing problem and show how the
approximate solution can be used to allocate processor submeshes to the grids. Experimental
results on the performance of the packing algorithm are included at the end of this section.

4.1 A Packing Algorithm

Two dimensional packing problem has been studied by many researchers [5, 4, 6, 18]. It arises
in a variety of situations such as scheduling of tasks and cutting-stock problems. Cutting-
stock problems may involve cutting objects out of a sheet or roll of material so as to minimize
waste. The scheduling of tasks with a shared resource involves two dimensions, the resource
and time, and the problem is to schedule the tasks so as to minimize the total amount of
time used. Memory is a typical example of shared resources. In general, the problem is
stated as follows :
Given a rectangular bin with fixed width and infinite height, pack a finite set
of rectangles of specified dimensions into the bin in such a way that the rectangles
do not overlap and the total bin height used in the packing is minimized. The
rectangles should be packed with their sides parallel to the sides of the bin. In one version
of the problem (see [5]) the rectangles should be packed in a fixed orientation.
In our processor allocation problem, grids correspond to rectangles to be packed. But,
the problem is slightly different from the above two-dimensional bin packing problem. First,
the width as well as the height of the bin is unlimited. Assume that we are given a x/N /N
mesh multiprocessors for m grids. The goal of our packing is minimizing the maximum of
width and height of bin used for packing m grids starting from the left-bottom corner of the
bin. Second, the orientation of grids is not important, so grids can be rotated when '1 ,c are
packed. Figure 1 shows an example of such packing.
After packing m grids as in Figure 1, we use the two ratios, - and h ,9h to allocate a
submesh for each grid. The detailed allocation method will be described later in this section.
First, we will show that decision version of our 2D packing problem is NP-complete in the
following theorem.
Theorem 1 : The following problem is NP-complete:
Given m rectangles of dimension ;, x hi (;i > hi), is there a packing so that the maxi-














I





height






I<--------------------------------
width

Figure 1: An Example of Packing

mum of height and width of packing is smaller than or equal to L?
Proof: We will reduce the bin packing problem which is known to be NP-complete [24] to
our 2D packing problem. The bin packing problem is stated as follows: Given k bins of
capacity B, and a finite set U = {u1,u2,..., u,} of items with size of each item
s(ui) e Z+ (s(ui) < B), is there a partition of U into disjoint sets U1,U2,... ,Uk
such that the sum of sizes of the items in each Ui is no more than B? We
can construct an instance of the 2D packing problem for an instance of the bin packing
problem in the following way : Create m + 1 rectangles with the following dimensions;
wo = k(B + 1), ho = k(B + 1) B, wl = B + 1, hi = s(ui), w2 = B + 1, h2 = s(u2), ,
;I = B + 1, h, = s(u,) and L = k(B + 1). Then any feasible packing of these rectangles
must be in the form as in Figure 2. The largest rectangle in the figure is the first rectangle
in the list. Since one dimension of the rectangle is already k(B + 1), other rectangles cannot
be packed above it. The larger of the two dimensions is B + 1 for all but the first rectangle,
hence they can only be packed so that their longer sides are parallel to the Y-axis. For
this instance of the 2D packing problem to have a solution, rectangles should be packed so
that the width of the packing does not exceed k(B + 1). Note that each level of packing of





















k(B+1)

cd e



B+l a b

'-------------------------------- -- ---------
k(B+1)-B B

Figure 2: Feasible Packing of m+1 Rectangles


rectangles indexed from 1 to n (for example, rectangles "a", "b" form one level of packing
in Figure 2) corresponds to a disjoint set Ui in the bin packing problem. Now, it is obvious
that there is a solution for an instance of the bin packing problem if and only if there is a
solution for the corresponding instance of the 2D packing problem. m


Since the decision version of the problem is NP-complete, the original optimization prob-
lem is NP-hard. For this reason, we seek an approximation algorithm to solve the problem.
Since our packing problem is somewhat different from the commonly known 2D bin
packing problem, the heuristic algorithms developed in [5, 4, 6, 18] are not readily useful
for our purpose. One approach to solving our problem is to try different bin widths and
for each width, use one of the above packing heuristics to see if the height of the packing
does not exceed the desired height for that width. We can find the minimum width that
allows the desired packing, by using some form of interpolation search on the interval between
minimum and maximum widths possible. A better approach, that avoids repacking, is to use
a packing heuristic that attempts to satisfy two goals, namely, packing as tightly as possible
and making the width/height ratio for the packing closely match the ratio of the processor
mesh. To this end, we propose a heuristic which we will call "tight-p... 1: (TP, for short)
heuristic.
The basic idea of TP-heuristic is as follows. First the grids are sorted in some selected









order. Then we start packing grids one by one at the south-west corner of the bin. Let's
consider the space of the bin as the first quadrant of X-Y plane. Then the south-west corner
of the bin becomes the origin of the coordinate system, that is (0,0). Each packed grid has
4 corners NW,NE,SE and SW with respect to its orientation in the packing. A NW or SE
corner of a packed grid is called a free corner (FC) if no other item occupies that corner
such that the item is above and to the right of that corner. In our algorithm, only free
corners are considered for packing the new grid and when a new grid is placed in a free
corner, it is placed so that it is above and to the right of the corner. After packing the first
grid at the origin, the next grid is packed at one of the two corners created by packing the
first grid. Figure 3 (a) shows the two corners, "a" and "b". In general, after packing ;i x hi

/C

a

/ /


b d



(a) (b)

Figure 3: The Corners for Packing the Next Grid

grid at (xj, yj), we add two more free corners located at (xj, yj + hi) and (xj + yj) to the
list of free corners. We also keep the maximum size of the grid which can be packed in that
free corner along with its location. We call it the -, .-" of free corner. Figure 3 (b) shows
two new corners, "c" and "d", after packing the second grid at corner "b". After the second
grid is packed, the width of the grid which can be packed at corner "a" is limited to the
width of the first grid. Hence, we need to update that information whenever we pack a grid.
Let Wi and Hi be the width and height of the space used in packing after packing the first
i grids. We choose the corner for the i + 1-st grid, so that the maximum of Wi+ and Hi+1
is minimized. If we are given a processor mesh P X Q with P > Q and P = R, we choose
the corner for the i + 1-st grid, so that the maximum of Wi+1 and RHi+i is minimized. The
detailed description of our packing algorithm is given bellow. Assume that we are given the
mesh ratio R and m grids, wl x hi, w2 x h2,... x hm. After the following algorithm
terminates, global variables, XLoci and YLoci, contain the location of the corner where grid
i was packed. Orienti is set to 1 if grid i was rotated and it is set to 0 otherwise.








Algorithm Packing;
1. width = 0
2. height = 0
3. Initialize the three lists CornerList, XScanList and YScanList to be empty.
4. Insert [(0, 0), (oo, oo)] into CornerList.
// [(x, y), (p, q)] represents the corner located at (x, y) and the maximum grid
size that can be packed is p x q. //
5. for each h, x hi grid do
begin
6. FindBestCorner((, hi), k)
// In procedure FindBestCorner, each corner in CornerList is
examined and a corner [(zk, ), )(/' qk)] is chosen which minimizes the
maximum of width and height R after h, x hi grid is packed //
7. UpdateCornerList((; hi), k)
// In procedure FindBestCorner, the size of each corner in
CornerList is updated after h, x hi grid is packed at
corner [(Xk, / ), (/i' )] //
8. FindNewCornerSize((,I hi))
// After h, x hi grid is packed at corner [(zk, ,; ), (i' qk), two new
corners, [(zk, ;i + hi), (/' qk')] and [(zk + ", i, ), (/i ', qk,)] are created.
In procedure FindNewCornerSize, the size of each corner is
determined. //
end
end Packing;

Procedure FindBestCorner((; hi), k);
1. mindim = oo
2. for each element [(xj, yj), (pj, qj)] in CornerList do
begin
3. mxl = oo, myl = oo, mx2 = oo, my2 = oo
// Try to pack a grid (;i x hi) at a free corner (pj,qj) in both
orientations. //
4. if (pj )> 0 and (q hi) > 0 then
5. mxl = max(width, xj + ;i ), myl = max(height, yj + hi)
6. else if (pj hi) > 0 and (qj ;, ) > 0 then
7. mx2 = max(width, xj + hi), my2 = max(height, yj + ;i )
8. else goto the next iteration









9. ml = max(mxl, R myl)
10. m2 = max(mx2, R my2)
11. m = min(ml,m2)
12. if (m < mindim) then
begin
13. mindim = m, k =j, XLoci = xj, YLoci = yj
14. if (ml < m2) then
15. width = mxl, height = myl, Orient = 0
else
16. width = mx2, height = my2, Orienti =
end
end
end FindBestCorner;

Procedure UpdateCornerList((;, hi),k);
1. Remove [(xk, ;i ), (i, qk)] from CornerList
2. if Orienti = 1 then
3. wi= h, h =
else
4. ,' = h
// If the grid (;i x hi) packed at k-th free corner overlaps with the packing
area of a free corner [(xj, yj), (pj, qj)], update the size of that free corner. //
5. for each element [(xj, yj), (pj, qj)] in CornerList do
begin
6. if (;/ < yj < ; + h') and (xj < k < j + pj) then
7. pj = Xk j
8. if (xk 9. qj = ;/ yj
10. if pj = 0 or qj = 0 then
11. remove [(xj, yj), (pj, qj)] from CornerList
end
end UpdateCornerList;

Procedure FindNewCornerSize((;, h));
// Find the sizes of two new free corners created after packing a grid (; ,hi) //
1. p = oo, q = oo
2. if Orienti = 1 then








3. .' = h, h' =
else
4. wI = h = hi
// Find the closest grid to the new corner packed on line y = YLoci. //
5. for each element [(x,, yj), (x, y2)] in XScanList do
// Each element [(x, yf), (x, y2)] in XScanList contains the locations
of the SW, NW corners of a packed item. (Note that x1 = x2) //
begin
6. if (y) < YLoci < y)) and (XLoci + < < x1) then
7. if (x1 < p) then
8. p = xj
end
// Find the closest grid to the new corner packed on line x = XLoci + .'. //
9. for each element [(x, y'), (x, y)] in YScanList do
// Each element [(x), y), (x, y2)] in YScanList contains the locations
of the SW, SE corners of a packed item. (Note that y = y2) //
begin
10. if (xl < XLoci + < < x2) and (YLoci < yf) then
11. if (y) < q) then
12. q = y1
end
13. p = p- (XLoci + ,,')
14. q = q- YLoci
15. Insert [(XLoci + w', YLoci), (p, q)] into CornerList
16. p = oo, q = oo
// Find the closest grid to the new corner packed on line y = YLoci + h'. //
17. for each element [(x,, yj), (x, y)] in XScanList do
begin
18. if (y' < YLoci + h <_ y)2 and (XLoc, <_ x) then
19. if (xj < p) then
20. p = x1
end
// Find the closest grid to the new corner packed on line x = XLoci. //
21. for each element [(x, y ), (x2, y)] in YScanList do
begin
22. if (x _< XLoci < x2) and (YLoci + h' < yj) then









23. if (y' < q) then
24. q = y1
end
25. p = p XLoci
26. q = q- (YLoc, + h')
27. Insert [(XLoci, YLoci + h ), (p, q)] into CornerList
28. Insert [(XLoci, YLoci), (XLoci, YLoci + h')] into XScanList
29. Insert [(XLoci, YLoci), (XLoci + ',YLoci)] into YScanList
end FindNewCornerSize;

Every time a grid is packed, a corner is used and two new corners are introduced. Since
there is only one corner initially, the number of elements in CornerList is at most m + 1.
An element is added to XScanList as well as to YScanList, each time a grid is packed.
Hence, the number of elements in XScanList and YScanList can be at most m. As a result,
Procedures FindBestCorner, UpdateCornerList and FindNewCornerSize take O(m)
time. Since there are m grids to be packed, the total time complexity of our packing algorithm
is O(m2).
Remark : A small improvement can be made if we remove corners which are too small
to be useful from CornerList. Let MINHOLESIZE = min(minl procedure UpdateCornerList, if either pj < MINHOLESIZE or qj < MINHOLESIZE, then
remove [(xj, yj), (pj, qj)] from CornerList. In procedure FindNewCornerSize, if either p <
MINHOLESIZE or q < MINHOLESIZE, then do not insert [(XLoc, + w', YLoc,), (p, q)] or
[(XLoci, YLoci + h ), (p, q)] into CornerList.
The following theorem shows the correctness of our packing algorithm. In the following
theorem, we assume for convenience that we do not make modifications as in above remark.
Theorem 2 : After i-th grid is packed, The following two claims hold.
(1)For any element [(xj, yj), (pj, qj)] in the CornerList, the maximum size of the grid that
can be packed at corner (xj, yj) is pj x qj.
(2)For any point (x,y) in the space of the bin, either it is occupied by a grid or it belongs to
at least one corner in the CornerList.
Proof: To prove the first part of the theorem, we need to show that procedure FindNew-
CornerSize correctly computes the sizes of the two new corners. We will use induction on
j, the number of grids already packed. Base case is trivial. When j is 1, both XScanList
and YScanList are empty. Hence, the loops of lines 5 12 and 17 24 will not be executed.
Both p and q remain oo, which is the correct width and height of the maximum grid that
can be packed at the new corners. Now, assume that the sizes of all the corners had been
correctly computed until grid j 1 was packed. We pack grid j at (XLocj, YLocj), and two








new corners, namely, [(XLocj + w', YLocj), (p', q')] and [(XLocj, YLocj + h'), (p",q")] are
created; here w', h' are dimensions of grid j along X and Y directions respectively. Let us
see how p' and q' are determined in procedure FindNewCornerSize. XscanList contains
the coordinates of the SW, NW corners of each grid which was already packed. We scan
each element in XScanList and find [(xI, yI), (X', y/)] such that (y < YLocj < y/) and
(XLocj + w' <_ x1), and x1 is minimum among all such elements; if no such grid exists,
then p' is correctly set to oo. We find [(xIY, y), (x ,,y ,)] in YScanList in the similar way.
Figure 4 (a) shows such grids k, k'. In procedure FindNewCornerSize p' and q' are set to
4x- (XLocj + w') and y, YLocj respectively. It is obvious that p' and q' cannot be greater
than the above values from the figure. For both p' and q' to be correct, the shaded area in
Figure 4 (a) should not be occupied by any grids which were already packed. Suppose that
there is a grid 1 packed at (XLoci, YLoci), and YLoci < YLock, and XLoci < XLock,, that
is, grid 1 is in the shaded area. Figure 4 (b) shows such a grid 1. Since every grid including
grid 1 was packed at a valid corner, there must be a grid n either to the left of or below
grid 1. Assume that grid n is located below grid 1. Now, we apply the same argument to
grid n, and so on. Eventually, there must be a grid t packed at (XLoct, YLoct) such that
XLoct < XLock and (YLocI < YLocj < YLoct + h'). Figure 4 (c) shows such a grid t. This
contradicts the fact that 4x is the minimum among those grids. Therefore, there cannot exist
such a grid 1 and procedure FindNewCornerSize correctly determines p' and q'. Similarly,
we can prove that p" and q" are correctly determined.





kk k tk
wn






(a) (b) (c)

Figure 4: Figures for The Proof of Theorem 5.3.2

Proving the second part of the theorem is straightforward. We showed that sizes of cor-
ners are correctly computed. Since grids do not overlap in our packing, any point (x, y) is
occupied at most one grid. We can use an induction similar to the one used for the first
part of the theorem to prove that (x, y) belongs to the free space of at least one corner if it
is not occupied. Base case (i.e. that is, when no grids have been packed) is trivial. Let's









assume that the claim is true after grid j 1 had been packed. Now, we pack grid j at cor-
ner [(XLocj, YLocj), (p, q)]. This corner is removed from CornerList and two new corners,
[(XLocj + w', YLoc), (p', q')] and [(XLocj, YLocj + h'), (p", q")] are added, where w h'
are dimensions of grid j along X and Y directions respectively. If p' and q' are correctly
computed, p' and q' must be at least p w' and q respectively. From the same reasoning,
p" and q" must be at least p and q h' respectively. Therefore, for any point (x, y) which
belonged to the free space of the corner [(XLocj, YLocj), (p, q)] before, either it is occupied
by grid j or it belongs to one of the two new corners. *


To show a bound for the accuracy of the solutions provided by our packing algorithm,
we impose the following restrictions on packing the grids. When ;i x hi (;, > h) grid is
packed at a corner (x, y), it should be placed so that the side with dimension (long side) is
parallel to the X-axis if x < Ry and it should be placed with long side parallel to the Y-axis
if x > Ry. Suppose there is a free corner that cannot accommodate the item in the allowed
orientation but can accommodate the item in the other orientation. Then we disregard this
corner though it is possible to get a better packing by placing the item in that corner. If
x = Ry, then both orientations of the grid are allowed for that corner. We will call this
packing algorithm modified TP-heuristic. Now we state a result on the solution accuracy
bound when R = 1 (i.e. for square meshes)
Theorem 3 : Consider the two-dimensional packing problem with R = 1 and let w* =
max, 1 Let Wopt, Hopt be the width and height of the optimal packing and let WTP, HTP
be the corresponding values for the packing given by the modified TP-heuristic when items
are packed in decreasing order of their maximum side lengths. Assume without loss of
generality that Wopt > Hopt and WTp > HTp. Then WTp < v 2Wopt + 3w*.
Proof : Let us denote the regions below and above the line OP (that has unit slope) by
R1 and R2. First we observe that there is always a corner in R1 as well as in R2 that can
accommodate a subsequent item in the allowed orientation (i.e. long side parallel to Y-axis
in R1 and parallel to X-axis in R2). This follows from the fact that we can always pack a
subsequent item q abutting to Y-axis (or X-axis). Since the item q' below (or to the left
of) q was packed prior to q, the maximum side of q' is longer than the maximum side of q.
Hence, there is always enough space to pack item q above (or to the right of) item q'.
Now, we show that WTp HTP < w* as follows. Let W}p, HTp be the width and height
of the packing after the i-th item is packed. Suppose that WTp- HTp > w*. Then there must
be an item i of dimensions (;. hi) such that that W'p1 HIz- < w* and Wjp Hp > w*.
It also must be true that the i-th item was packed at a corner in R1, hence W9-1 < Wjp and
H I- < Hip. Let (x, y) be the location of a corner in R2 which can accommodate the i-th








:~ l-- n i--1ITi
item. Then x < y < H Since x + < H + w* < Wp and y + h; < + w* < W1p,
the maximum of the width and height of the packing would be smaller if the item i were
packed at the corner at (x, y). Hence the item i should not have been packed at a corner in
R1. Since we always pack items so that the maximum of the width and height of the packing
is minimized, we can conclude that WTp HTP < w*.
We only have to prove our result when WTp > 3w* in which case HTp > 2w*. Consider
the two isosceles right triangles A1 and A2 (see Figure 5) in regions R1 and R2 with areas
(WTp2*) (HTp--2w*) respectively. Note that all items in the region A1 (A2) have been
packed with their long sides parallel to Y (X) axis. Thus the items are packed in these
regions as in bottom-up left-justified (BL for short) strategy of [5]. We can make the similar
argument on the occupancy of A1 and A2 as in [5]. Note that any vertical (or horizontal)
cut through A1 (or A2) can be partitioned into alternating segments corresponding to cuts
through unoccupied and occupied areas. Using the fact that there are items to the right
of A1 and above A2 and considering the order in which the items are packed, we can show
that the sum of the occupied segments is at least the sum of the unoccupied segments. By
integrating the lines over WTp 2w* (or HTp 2w*), we can verify that A1 (or A2) is at
optW2)2 )2) > (H 2w*)2
least half full. This means that W2 > 1/4 ((HTp 2w*)2 (WT 2w*)2) > (HTP-2*
and the result follows from the fact that WTp HTP < w*. m


The bound can be improved when the items to be packed are square shaped.
Theorem 4 : Consider the same 2D packing problem as in Theorem 5.3.3 except that the
items are square shaped. If the items are packed in decreasing order of their sizes in the
modified TP-heuristic, then WTp < 2V/Wopt + 2w*.
Proof : The proof of this theorem is similar to the proof of the previous theorem. It can be
easily shown that WTp HTp < w*. Since all the items are square shaped, they are packed
as in BL strategy [5] in the two isosceles right triangles Si and S2 (see Figure 6) with areas
(WP -*)2 (HTP w*)2 respectively. Using the fact that there are items to the right of Si and
above S2 and considering the order in which the items are packed, we can show that Si and
S2 are at least half-occupied. This means that W p, > 1/4 ((HTP w*)2 + (WT w)2)
S(HTp-w*)~ and the result follows from the fact that WTp HTp < w*. .


The order in which we consider the grids for packing will affect the performance of
our packing algorithm. Since small grids have a better chance to fit into holes without
increasing the width or height of the packing, it is intuitively advantageous to pack them
later. Assuming that ;i > hi (1 < i < m), we consider the following four orderings of grids
to be reasonable for good performance.

































A2

HT -2w*
TP



A1


w*

LC- "- --------------------------- -
O I I
O
w* WTP 2 w* w*

Figure 5: Regions A1 and A2 in the Packing (Theorem 5.3.3)

























w* /





S2

HTP-W*



S,




1---------------------------------- -

WTP W
Figure 6: Regions S and 2 in the Packing (Theorem 5.3.4)
Figure 6: Regions S1 and S^ in the Packing (Theorem 5.3.4)









1. Decreasing order of ;,


2. Decreasing order of hi

3. Decreasing order of ;i hi

4. Decreasing order of

Since sorting the grids can be done in O(m log m) time, the time complexity of our packing
algorithm will be same regardless of which grid ordering is used.


4.2 A Parallel Packing Algorithm

The packing algorithm described in the previous section is sequential. One processor has
to collect all the information about the grids from the other processors and execute the
packing algorithm in order to find the processor allocation. The result of allocation should
be communicated to all the processors which in turn communicate the boundary values for
the fine grids to the corresponding processors. In this section, we present a parallel algorithm
in which m processors cooperatively execute the packing algorithm for m grids in order to
speed up the algorithm. The same packing method that was used in the sequential algorithm
will be used in the parallel algorithm described here.
In the adaptive mesh algorithm, each rectangular grid at level i may be generated across
more than one processor. One of those processors (with lowest index) produces a packet
< (wj, hj), Sj > for grid j, where (wj, hj) is the dimension of grid j and Sj is its own
processor index. These packets contain the necessary information that forms the input
data to our packing and processor allocation algorithms, processors for them. The detailed
description of the algorithm is given below. The topology of multiprocessors on which the
algorithm runs, is assumed to be either mesh or hypercube.

Algorithm Parallel Packing

(a) Let PS be a set of m processors forming a submesh or a subcube, that is PS = {f' |0 <
i < m 1}. The processors which produced packets send them to the processors in
PS (one packet per processor) using procedure TokenPacking. The processors in
PS will do the remaining steps of our parallel packing algorithm.

(b) Sort packets according to the packing order using a parallel sorting algorithm. After
sorting, assume that processor Pb, is holding packet < (; hi), Si >.









(c) Store the initial free corner [(0, 0), (oo, oo)] at processor Pb0. Each processor Pb set both
width and height to 0. Now, pack the grids one by one by performing m iterations
where in the i-th iteration (0 < i < m 1) call procedure GridPacking(i).

(d) After step (c), each processor Pb has (XLoci, YLoci), that is, the free corner where
grid (;, hi) was packed and the width and height of the packing. Each processor Pb
include the above information in its packet < (; hi), Si > and send it to Si using
procedure SendPacketToSource.

end Parallel Packing;

Procedure TokenPacking;
// Here a subset of processors PJ0, Pj,..., Pj with jo < j < ... < ji has one packet
each and it is desired to store the packet of Pjk in Pk for t < k < (t + 1) mod m for some
0 < t < N 1. Using greedy routing, this operation can be done in O(log N) time on a
hypercube and O( N) time on a mesh [26]. //

Procedure GridPacking(i);

1. Call procedure Broadcast(bi, < (; hi), Si >).

2. Call parallel procedure FindBestCorner((;, hi), k) to find the corner (xk, ) where
the grid (; hi) is to be packed, and its orientation.

3. Processor Pb set XLoci and YLoci to Xk and ;/ respectively, and set Orienti to
Orientk. Also Pb determines the width and height of the packing after (;i hi) is
packed.
Call procedure Broadcast(bi, ((xk,;l ), (width, height), Orienti)).

4. Call parallel procedure UpdateCorners(( hi), k) to update the sizes of corners
after the grid (; hi) is packed at (xk, / ).

5. Call Parallel procedure FindNewCornerSize((;i hi), k) to determine the sizes of
two new corners.

Procedure FindBestCorner((; hi), k);
1. Each Processor Pb, does the following
begin
Let [(xj, yj), (pj,qj)] the corner Pb is holding
if (pj ., )> O0 and (qj hi) > 0 then
mxl = max(width, xj + .i ), myl = max(height, yj + hi)









else if (pj hi) > 0 and (qj >1 ) > 0 then
mx2 = max(width, xj + hi), my2 = max(height, y + )
else set m to oo and goto the next step
ml = max(mxl, R myl)
m2 = max(mx2, R my2)
mj = min(ml, m2)
if (ml < m2) then Orientj = 0
else Orientj = 1
end
If Pb, has two corners then choose the one which gives smaller mj.
If Pb, does not have any corners then set mj to oc.
2. Call parallel procedure FindMin({mi}jo1, mk, b&)
end FindBestCorner;

Procedure UpdateCorners((;, hi),k);
1. Processor Pbk remove [(xk, ), (i ,qk)].
2. Each Processor Pb, does the following
begin
Let [(xj, yj), (pj, qj)] the corner Pb, is holding
if Orienti = 1 then
= hi, h =
else
S' = hi = h
if (; < yj < ; + h') and (j < k < x j X pj) then
pj = Xk Xj
if (xk < + xk+ ') and (yj < <; < yj + qj) then
qj = yj
if pj = 0 or qj = 0 then
remove [(xj, yj), (pj, qj)]
end
If Pb3 has two corners then update the second one in the same way.
end UpdateCorners;

Procedure FindNewCornerSize((; hi), k);
1. Each Processor Pb,(j < i) does the following
begin
if Orienti = 1 then









",' = hi, h' =,
else
wI = hl = hi
if (YLocj < ;, < YLocj + hj) and (xk + < < XLocj)
then pj = XLocj
else pj = oo
if (XLocj < Xk + wi < XLocj + wj) and (;, < YLocj)
then qj = YLocj
else qj = oo
2. Call parallel procedure FindMin({pi}):-, p, bi)
3. Call parallel procedure FindMin({qi)j} q, bi)
4. Find p' and q' in the same way for the other corner.
5. Store the two new corners, [(XLoci + ;, ', YLoci), (p, q)] and
[(XLoci, YLoci + h'), (p', q')], at Processor Pb.
end FindNewCornerSize;

Procedure Broadcast(j, V);
// Processor Pj broadcasts value V to all the processors in PS; takes O(log N) time in a
hypercube and O( N) time in a mesh [26]. //

Procedure FindMin({ai}701, b,j);
// Given the al values with one value per processor, find the minimum (b) of these values and
store it in the processor Pj. This operation can be done in O(log N) time on a hypercube
and O( /N) time on a mesh. //

Procedure SendPacketToSource

1. Each processor Pb, include XLoci, YLoci, Orienti, width and height in the packet
< (, h,), S >.

2. For a hypercube, sort the packets according to Si using a parallel sorting algorithm.

3. Send each packet < (I hi), Si, (XLoci, YLoci), Orienti, (width, height) > to processor
Si. For this, use the monotone routing (see chapter 3) for a hypercube and one-to-one
routing [26] for a mesh.

end SendPacketToSource
Finding a corner to pack the next grid, updating the sizes of corners and determining the
sizes of two new corners are done in the same way as in the sequential packing algorithm.
After executing the above algorithm, processor Si can find a submesh for grid (;i hi) using








the information included in the packet.
Time complexity analysis for a hypercube :
Step (a) takes O(log N) time and step (b) takes O(log2 m) time. Both procedure FindBest-
Corner and FindNewCornerSize take O(log m) time. procedure UpdateCorners takes
a constant time. Hence step (c) takes O(m log m) time. Step (d) takes O(log N + log2 m)
time. The total time complexity for a hypercube is O(log N + m log m).
Time complexity analysis for a hypercube :
Step (a) takes O({N) time and step (b) takes O(/m log m) time. Both procedure Find-
BestCorner and FindNewCornerSize take O( /m) time. procedure UpdateCorners
takes a constant time. Hence step (c) takes O(m im) time. Step (d) takes O(-/N) time.
The total time complexity for a mesh is O( N + mz /-).

4.3 Allocation of Processors

In the previous sections, we described algorithms for packing grids in such a way that the
ratio of width to height is as close to the mesh ratio as possible and the packing area is
minimized at the same time. Now, we are ready to allocate a set of processors (submesh) for
each grid using the results of packing. Let W and H be the width and height of the packing,
and let (xi, yi) be the location of the south-west corner of grid i with dimensions x hi.
Assume that we are given a P x Q(P > Q) mesh of multiprocessors and let W > H. Let
((i,j), (x X y)) be a x x y submesh with its south-west corner processor indexed (i,j). We
propose the following two ways to allocate submeshes to grids.

1. Assign submesh
(( Lxi], LyiQi), ( L(x + x) Lxi'J) x (L(yi + hi)J Lyi]J)) to grid i.
2. If (Q < H), then assign submesh
(( Lxi, LyiJ), (L(xI + )J LxiQ) x (L(yi + hi)J Lyi)) to grid i.
else assign submesh
((LxiWL, LyJ]), (L(xi + )P L[xi' ) x (L(y + hi) L[yipJ)) to grid i.
The first method tries to use as many processors as possible to reduce the computational
workload, but may violate the property that -= for all the grids. In the second method,
we allocate processors in such a way that ~ Q. In other words, the second method
uses uniform scaling in X and Y directions while the first method uses different scaling
factors in X and Y directions in order to reduce the packing to the given mesh size. In the
second method, we may not utilize all the processors in the mesh. We will call the first and
second method non-uniform scaling and uniform scaling respectively. Both of the methods
are identical to each other when Q H
P W









To make sure that both Xi and Y are not zero, that is, at least one processor is allocated
to each grid, we impose the following lower bound on the size of each grid. Let D =
Ci max(; hi). D is a very loose upper bound on max(W, H) where W and H are width
and height of any packing. Then, mini (min(; ,hi)) > miQ should be satisfied. To
achieve better bound on D, we can quickly pack the grids in the following way. Each time a
grid is packed, we examine only two corners, one on X-axis and the other on Y-axis. Then
use max(width, height) as D. Since we always consider only two corners, this packing takes
only O(m) time. After finding the lower bound on the size of the grids, change ;, or hi to
this value if they are smaller than that.

4.4 Processor Allocation for Hypercubes

If we are given a hypercube, we can still use the above method to allocate processors for
grids assuming that we are given a mesh. Then, we can easily embed the mesh in the
given hypercube using reflected Gray code [10]. The reflected Gray code is generated in the
following way. We start with the 1-bit Gray code sequence {0, 1}, and then insert a zero and
a one in front of the two elements obtaining the two sequences {00, 01} and {10, 11}. We
then reverse the second sequence to obtain {11, 10}, and then concatenate the two sequences
to obtain the 2-bit reflected Gray code.

{00,01,11,10}

Generally, given a (d 1)-bit code

{b b2,. ., b ,

where p = 2d-1 and bl, b2,..., bp are binary strings, the corresponding d-bit code is

{0bi,0b2,...,0bp, lbp, lbp_,,..., lbi}.

The preceding recursive construction of the reflected Gray code sequence can be general-
ized. Let d, and db be positive integers, and let d = d, db. Suppose that {ai, a2,..., pa} and
{bl, b2,..., bpb} are the d,-bit and db-bit reflected Gray code sequences, where p, = 2da and
pb = 2db. Consider the p, x pb matrix of d-bit strings {aibj|i = 1,2,... ,pa,j = 1,2,...,pb}.


albI alb2 ... albPb
a2bl a2b2 '. a2bPb


aP ab apab2 ... aP bPb









The preceding construction also shows that the nodes of a d-cube can be arranged along a
two-dimensional mesh with p, and pb nodes in the first and second dimensions, respectively.
The (i, j)-th element of the mesh, where i = 1, 2,..., p and j = 1, 2,...,pb, is the d-cube
node with identity number aibj.
In N-node hypercube, we can embed log +/N + 1 different meshes with power of two
nodes in each dimension, such as (210g' x 210og ), (21og N+l x 21og -1), . (21log x 20).
We can pick one of the above meshes and pack the grids using its mesh ratio. Or, we can
try for all the above meshes and pick the one which gives he closest to the corresponding
mesh ratio. Since we use only m processors in the parallel packing algorithm, we can pack
the grids using [L" different mesh ratios at the same time. If we try all log /N + 1 meshes
using the parallel packing algorithm, we have to repeat the packing algorithm [ log vN+1)
times. This takes O(1-ogsmog/ ) time since one execution of the packing algorithm takes
O(m log m) time.

4.5 Experimental Results

We performed experiments to measure the performance of our approximation algorithm
based on the TP-heuristic. As in the experiments for the first approach, we start with some
number of (coarse) grids and assume that at each level of adaptive mesh refinement, the
number of fine grids (child grids) created within each coarse grid region is exponentially
distributed with a mean value of 1. The number of mesh points in each fine grid is uniformly
distributed in the range [MINSIZE,MAXSIZE]. The aspect ratios of fine grids, namely w
(, > hi), are also assumed to be uniformly distributed in the range [1,MAXRATIO]. In
our experiments, we went through 200 levels of refinement. For assignment of fine grids at
each level, we first solve the two-dimensional packing problem and then allocate a submesh
for each grid using the methods described in the previous section.
We compared the performance of our algorithm with that of an algorithm based on the
modified first-fit level heuristic (LP) discussed in [18] for packing rectangles into a bin of
fixed width. In this heuristic, rectangles (grids) are considered in decreasing order of height
and packed level by level as in the first-fit heuristic. But successive levels are packed left-
right and right-left in the bin of fixed width so that tallest item in a level will be above the
shortest in the previous level. This will allow the items to drop down after packing which
may in turn reduce the total height of the packing. Since in our problem, the width is not
fixed, we have to apply an iterative process where in each iteration, we use the LP-heuristic
to determine the height of the packing for some fixed width. The initial width is set equal
to \RS where S = ZC(., hi) and R is the mesh ratio. Then, the width is increased by a
small amount (1 of the width) each time until we get a ratio for W that is just above the
Hht sjstaov-h









processor mesh ratio Q.
The objective function values were computed using the following formula.
I =max U wo 2U FtF( + )

where Ucomp, Ucomm and F are assumed to be 1 for simplicity. First, we compare the solution
values obtained by the non-uniform and uniform scaling processor allocation methods in
Figure 7. Here, we started with 40 grids. MINSIZE and MAXSIZE are set to k 0.7
and k 1.3 where the constant k is set based on the average number of mesh points per
processor which is assumed to be 300. 32 x 32 processor mesh was used for assignment.
The experiment was repeated for different aspect ratios of grids. Figure 7 shows that the
non-uniform scaling method performs slightly better than the uniform scaling method. The
reason for the narrow difference in the performances of the two methods is attributed to the
fact that our algorithm gives a packing whose W ratio is close toP ratio of the mesh. In
our remaining experiments, we use only the non-uniform scaling method for allocation of
submeshes. Figure 8 shows the solution values obtained after applying four different packing
orders of grids in TP-heuristic. From the results, we see that packing grids in decreasing
order of grid size (; hi) performs the best, while packing them in decreasing order of their
aspect ratio () performs the worst. We use only decreasing order of grid size as a packing
order in our remaining experiments.
Figure 9 Figure 24 show comparison of TP and LP heuristics with different values of
various parameters. In the figures, we show the cumulative sum (across 200 levels) for (a)
computational cost, (b) communication cost, (c) sum of (a) and (b) and (d) average processor
utilization which indicates how tightly the grids are packed in the packing algorithm. The
processor utilization is the ratio of the number of processors used in allocation to the total
number of processors. Figure 9 and Figure 12 show how the performance of each heuristic
varies with variance in grid size (i.e. number of grid points). In this experiment, MINSIZE
and MAXSIZE are set to k (1-VAR) and k (1+VAR) respectively, and VAR varies from
0.1 to 0.9. We used the same values for other parameters as in the previous two experiments.
Figure 13 and Figure 16 show how the performance of each heuristic varies with the aspect
ratios of the grids. Figure 17 and Figure 20 show how the performance of each heuristic
varies with processor mesh ratio. In this experiment, we use 32*MESHRATIO x 32 mesh for
assignment, where MESHRATIO varies from 1 to 3. Figure 21 and Figure 24 show how the
performance of each heuristic varies with initial number of grids. Below we give a summary
of our observations.

1. As can be seen in Figure 9 and Figure 12, the TP-heuristic performs better than the LP-
heuristic when the variation in grid sizes is reasonably large. The processor utilization













Non-Uniform Scr
Uniform Scr


145000 F


150000 k


140000 h


135000 I


130000


125000


3
MAXRATIO


Figure 7: The Comparison of Two Different Allocation Methods


ding
aling -- .----
aling .... -..







/.













;.:'::-
..-



.. ';"


;:
..:
..:
.:


.x














155000 F


Decreasing Max.
Decreasing Min.
Decreasing Grid size
Decreasing Ratio


--.-----


150000 F


145000 h


140000 I


135000






130000


125000 J-


3
MAXRATIO


Figure 8: The Comparison of Four Different Packing Orders


/
/
/


,-




I--














126000 TP-Heuristic .....
LP-Heuristic ..
124000

122000
0 x
a 120000

5 118000 /
0 -- - - - -- - - - -
( 116000

114000

112000

110000
0.1 0.3 0.5 0.7 0.9
VAR


Figure 9: Computation Costs for Different Grid Size Variances




21000

TP-Heuristic --
LP-Heuristic...
20500 ................



20000



E 19500



19000



18500
0.1 0.3 0.5 0.7 0.9
VAR


Figure 10: Communication Costs for Different Grid Size Variances












TP-Heuristic --
146000 LP-Heuristic .. x

144000

142000

S140000

3 138000 -
/
/ --- -----
136000

134000

132000 -

130000

0.1 0.3 0.5 0.7 0.9
VAR

Figure 11: Total Costs for Different Grid Size Variances




84
84~~-------- -------------------------
83 TP-Heuristic -
LP-Heuristic
82




N 80 -

79

8 7...........

77

76

75
0.1 0.3 0.5 0.7 0.9
VAR


Figure 12: Processor Utilizations for Different Grid Size Variances













140000


135000


130000


125000


120000


115000


110000


105000


3
MAXRATIO


Figure 13: Computation Costs for Different Aspect Ratios of Grids


22000

21500

21000

20500

20000

19500

19000

18500


2 3 4
MAXRATIO


Figure 14: Communication Costs for Different Aspect Ratios of Grids


TP-Heuristic --
LP-Heuristic















... ..... ...-x '


TP-Heuristic --
SLP-Heuristic x














































2 3


2 3 4
MAXRATIO


Total Costs for Different Aspect Ratios of Grids


2 3 4
MAXRATIO


Figure 16: Processor Utilizations for Different Aspect Ratios of Grids


160000



155000



150000



145000



140000



135000



130000



125000


- TP-Heuristic -- ..
LP-Heuristic --x----- .
/
./
//


.-
//

/"
..'

///


-i

,x
//"




-/



-/
.-.
...
/


/



..o.o-'


..---"" .....x "


Figure 15:


TP-Heuristic --
LP-Heuristic x










"x- -

"-,..x
''- .; ---- - - - . -











125000



120000



115000



110000



105000



100000



95000


TP-Heuristic
LP-Heuristic x

















", ***... -...
"X -..... .


TP-Heuristic
LP-Heuristic x








..



--.. ... ..


""^..'O- ---....


Figure 17: Computation Costs for Different Aspect Ratios of Processor Meshes


20500


20000


19500



19000



18500



18000



17500


2
MESHRATIO


Figure 18: Communication Costs for Different Aspect Ratios of Processor Meshes


1.5 2 2.5
MESHRATIO













140000



135000



130000



125000



120000



115000


1.5 2 2.5
MESHRATIO


Figure 19: Total Costs for Different Aspect Ratios of Processor Meshes


78


2
MESHRATIO


Figure 20: Processor Utilizations for Different Aspect Ratios of Processor Meshes


TP-Heuristic -
LP-Heuristic x

















... -0-...


TP-Heuristic
LP-Heuristic x
x






.. ..... ... -- ----. - -














145000


140000


135000


130000


125000


120000


115000


50
No. of Grids


Figure 21: Computation Costs for Different Numbers of Grids


22500


22000


21500


21000


20500


20000


19500


19000


TP-Heuristic --
-LP-Heuristic x


.-"
X',"


) 45 50 55
No. of Grids


Figure 22: Communication Costs for Different Numbers of Grids


TP-Heuristic --
LP-Heuristic xx










170000

TP-Heuristic --- x ..
165000 LP-Heuristic x


160000


155000 -


150000 -
.-'
/"

145000


140000


135000 --


130000
40 45 50 55
No. of Grids

Figure 23: Total Costs for Different Numbers of Grids


80


79.5


79


78.5


78


77.5


77


76.5


76


) 45 50 55
No. of Grids


Figure 24: Processor Utilizations for Different Numbers of Grids


-TP-Heuristic
LP-Heuristic
---'









also increases with the variation in grid sizes for TP-heuristic. Since the processor
utilization indicates how tightly the grids are packed in the packing algorithm, this
result demonstrates that our proposed heuristic gives good packing in practice.

2. Figure 13 and Figure 16 show that LP outperforms TP only when the grids are square
or close to being square shaped. But as the aspect ratio increases, TP produces better
solutions than LP as indicated by smaller computational and communication costs
as well as higher processor utilization. In both TP and LP heuristics, the total cost
increases with the maximum aspect ratio of the grids. This implies that it is difficult
to pack the grids tightly when the variance in aspect ratios is large.

3. The following reason can be attributed to the better performance of TP over LP for
large variation in grid sizes or larger aspect ratios : TP allows smaller grids that are
packed later on, to be packed into holes created during the packing of larger grids and
this has the effect of keeping the width and height of packing as small as possible.

4. Figure 17 and Figure 20 show that LP outperforms TP when processor mesh ratio is
larger than 2. Higher processor utilization for LP heuristic indicates that LP can pack
rectangles tightly when the width of the bin is large. In LP-heuristic, the number of
levels in packing is small when the width of the bin is large.

5. Figure 21 and Figure 24 indicate that TP performs always better than LP regardless
of the number of grids to be packed. Processor utilization of TP seems to increase
slightly as the number of grids increases.

6. In all the figures, both computation and communication costs show the same behavior.
The reason for this is that both computation and communication costs are inversely
proportional to the number of processors allocated and processors are uniformly allo-
cated to the grids according to the number of grid points.

7. In most of the cases, TP outperforms LP in both total cost and processor utilization.
One of the reasons for this is that TP-heuristic takes advantage of the fact that two
possible orientations are allowed in the packing. TP has the additional advantage over
LP in that it avoids repacking and hence has less time complexity.


5 Extension for Inter-grid Communication Cost

The solution method based on two-dimensional packing ignores inter-grid communication
cost. In this section, we discuss possible approaches for minimizing inter-grid communication









cost as well as computation and intra-grid communication costs. First we consider one-
dimensional problem and propose an iterative approach. Then we discuss the possibility of
applying it and other approaches to solving the two-dimensional case.


5.1 One-Dimensional Problem

Processor allocation for the one-dimensional problem is relatively easier than two-dimensional
problem. We can find an optimal solution when there is only one grid at each level. Let X be
the number of contiguous processors (in one-dimensional mesh) allocated for the newly gen-
erated fine grid. Then, the length of the communication path for inter-grid communication
is X + a in the worst case, where a depends on the actual locations of these X processors.
When there is only one fine grid at each level, we can allocate X processors including the
processors where the fine grid was generated. Then a becomes a constant. For the fine grid
at level 2, in Figure ?? (a) X = 2 and a = -2, and in Figure ?? (b) X = 4 and a = -2.
If we uniformly divide the fine grid among X processors, the number of grid points each
processor is responsible for is -, where M is the total number of grid points in the fine grid.
Then, the problem is finding X which minimizes

Ti = Ucomp comm (X + a) + CI
X X
The first term is the computation cost for each processor to which M grid points are assigned.
The second term is the worst case inter-grid communication cost when the data have to travel
through a path of length X + a. The third term CI is the cost of intra-grid communication.
It is a constant because regardless of X, it is necessary to exchange only two grid points (left
end and right end) values with neighboring processors. In the above equation Ti decreases
indefinitely up to Ci as X increases. But, as we increase X, M reaches packet size Sp and
communication cost is only proportional to X + a. Then, the above equation becomes

M M
T = Ucom +UcommSp(X + a) + CI, when < Sp
X X
If we differentiate Ti,

dTj -U v U
dX X2 commS
d' becomes 0 when X is equal to Ucomp Therefore, Ti is minimized when X is equal to
dX UcommSp
max( UcompM )
Sp V UcommSp
This analysis is valid when there is only one fine grid at level i. The problem becomes
more difficult when there are more than one fine grid at level i. If each grid is assigned to
a set of X processors that are in the neighborhood of the set of processors where it was









generated, a grid can overlap with other nearby grids in some processors. The workload
in these processors will be twice as much as that in the other processors. To avoid this
workload imbalance, some grids have to be assigned to processors far from those where they
were generated. This will make the maximum length of data path greater than X. Let us
assume that there are m fine grids (grid 1, grid 2, -. grid m from left to right in the one-
dimensional domain) at level i and the number of grid points in grid j is Mj. We also assume
that grid j is assigned to P1 Pi+,P ,... and Xj = rj Ij + 1. If we allow overlapping
of grids in some processors, workload imbalance among processors is inevitable. Hence, we
avoid assigning more than one grid to any processor. Now, when there are m grids at level
i, the computation time Ti becomes as follows:

Ti = max Ucomp c + Ucommf (Xj)(Xj + aj) + C,
I
where f(Xj) = max(t Sp). ac is the extra path length caused by assigning the grid far
from the processors where it was generated to avoid overlapping. Here, each aj depends on
the assignments of the other grids at the same level, unlike the case when there is only one
grid in which case it is a constant. Therefore, we cannot compute individual values of Xj in
the same way as before. In general, it is difficult to find each Xj without knowing uj. Here,
we propose a simple heuristic using an iterative procedure.

1. For each j, Xj = max( A ).
-S-p V U -Sp
This is the optimal number of processors allocated to grid j if we ignore the other
grids at the same level. If Zj Xj is greater than N, each Xj is reduced proportionally
according to Mj.

2. Let sj be the index of the processor where grid j was generated. If grid j was generated
in a set of processors, then let sj be the index of the leftmost processor. Allocate
Pt,3 PI +1,..., PT, to grid j in the following way.

ptr = 0
forj = 1 to m
Ij = max(ptr, sj), rj = Ij + Xj 1, ptr = rj + 1
if (ptr > N) then
ptr = N 1
forj = mto 1
rj = min(ptr, rj), lI = r Xj + 1, ptr = Ij 1

3. Using the above processor allocations, compute Ti and aj = Ij sj. Also find jmax
such that Ti = Ucomp ma + Ucomm ( xma + 2)(XJma.. + c,,) + C. Since each Xj was









obtained assuming aj to be 0, we may be able to reduce Ti by reducing jma.. Assume
that ac., > 0 and let k = max(,,=o) & (l and reallocate processors to each grid j (k < j < jmx).

4. Recompute Ti using the above reallocation of processors. If it was improved then go
to step 1.


5.2 Two-Dimensional Problem

Considering inter-grid communication cost for two-dimensional problem is very complicated.
When the grids clustered on the computational domain, allocation of processors for one
grid cannot be determined independently since grids are closely located to each other. The
iterative heuristic used for one-dimensional problem previously can be considered for solving
the two-dimensional problem. But, after we find Xi x Y submesh for each grid i assuming
there is only one grid at a level, we still need to find the locations of these submeshes. For
the one-dimensional problem, we assigned m linear subarrays in the same order as the grids.
But, we need to solve a packing problem to arrange these submeshes without overlapping.
This packing problem is different from the one we considered in the previous section since
we have to assign submeshes that are somewhat close to the original locations of the grids.
This problem need to be solved again and again after we reduce the sizes of submeshes by
some quantities. For this reason, we cannot apply the same heuristic to the two-dimensional
problem without major changes.
One possible solution for assigning submeshes is starting with a submesh for a grid located
at the center, then assigning other submeshes around it using a simple packing algorithm.
But, after reducing the sizes of the submeshes, we have to rearrange the submeshes so that
maximum inter-grid communication distance is reduced. This may not be a trivial problem
without overlapping submeshes.
Another possible way of allocating submeshes considering inter-grid communication cost
is imposing a constraint on it as we did in the first approach. When the grids are clustered,
we can compute the size of a submesh which all m grids will be assigned to. Let D(PSi) be
the diameter of a submesh PSi, then maxes UcommD(PSi)(B, + M,) < Cma where all the
other notations are defined in chapter 3. The size of the submesh should be small enough
so that maximum inter-grid communication cost cannot exceed the limit no matter which
processors are allocated for a grid within the submesh. Then, we can use the two-dimensional
packing approach described in the previous section to allocate processors for each grid within
the submesh. Since we do not use all the processors, computation time will increase. When
there are more than one cluster of grids, we have to find a submesh for each cluster. This







problem is the same as finding a submesh for each grid, but the number of submeshes we
need to find is much smaller because we need one submesh per a cluster of grids. If the
number of clusters is reasonably small, it will be easier than solving the original problem.
There is another alternative solution when the grids are scattered instead of clustered.
Consider mesh processors as domain and fine grids generated by them as workload. Then
we partition the processors as we decompose the domain using binary decomposition (see
Figure 25). The processors are partitioned in such a way that the computational requirement


Figure 25: Partitioning of Processors

of grids are balanced among the different groups of processors. We need to repeat partitioning
the processors until the maximum inter-grid communication cost cannot exceed the limit
when processors are allocated within the group. Then we again use the packing approach
for each group of processors. This approach will work when the fine grids are reasonably
scattered.
We considered a few possible approaches to include inter-grid communication cost in our
objective function. Each approach seems to be suitable for a particular type of coarse grid
distribution among the processors. But deciding which approach to apply after generating
a set of fine grids is a challenging problem. Thus, the inclusion of inter-grid communication
cost makes the load balancing problem complicated and requires further study.


tL -



_0__0__CYO


(i~il:~









6 Conclusions


We have formulated the dynamic processor allocation algorithm that arises in mapping adap-
tive mesh computations onto multiprocessors. For mesh (and hypercube) architectures, the
problem can be transformed into a 2D packing problem. We proposed an efficient approxima-
tion heuristic and derived accuracy bounds for square meshes. We also demonstrated through
experiments how our approximation algorithm can provide solutions with smaller compu-
tational and communication costs than one of the well-known packing heuristics adapted
to this problem. We derived the bounds for our heuristic when the grids are packed in de-
creasing order of their maximum side lengths. We plan to analyze its performance for other
orders such as decreasing order of areas, aspect ratios etc.. One of our challenging future
tasks is to address the load balancing problem when inter-grid communication cost is also
considered in the objective function.


References

[1] S. Acharya and F Moukalled, "An Adaptive Grid Solution Procedure for Convection-
Diffusion Problems," Journal of Computational P;,.*:', Vol.91, 1990, pp.32-54.

[2] A. V. Aho, J. E. Hopcroft and J. D. Ullman, The Design and A,,1..,<-. of Computer
Algorithms, Addison-Wesley Publishing Company, Reading, MA, 1974.

[3] I. Altas and J. Stephenson, "A Two-Dimensional Adaptive Mesh Generation Method,"
Journal of Computational Pb.;', Vol.94, 1991, pp.201-224.

[4] B. S. Baker, D. J. Brown and H. P. Katseff, "A 5/4 Algorithm for Two-Dimensional
Packing," Journal of Algorithms, Vol.2, 1981, pp.348-:; "

[5] B. S. Baker, E. G. Coffman and R. L. Rivest, "Orthogonal Packings in Two Dimensions,"
SIAM Journal on Computing, Vol.9, No.4, August, 1'i-II pp.846-S,,

[6] B. S. Baker and J. S. Schwarz, "Shelf Algorithms for Two-Dimensional Packing Prob-
lems," SIAM Journal on Computing, Vol.12, No.3, August, l'i ;, pp.508-525.

[7] M. Berger and S. Bokhari, "A Partitioning Strategy for Nonuniform Problems on Mul-
tiprocessors," IEEE Transactions on Computers, Vol.C-36, 1987, pp.570-580.

[8] M. Berger and J. Oliger, "Adaptive Mesh Refinement for Hyperbolic Partial Differential
Equations," Journal of Computational PbI.-., Vol.53, 1984, pp.484-512.









[9] M. Berger and I. Rigoutsos, "An Algorithm for Point Clustering and Grid Generation,"
IEEE Transactions on Si ,,. .. Man and C9i1 1 ,, i.'. Vol.21, no.5, 1991, pp.1178-1l"1.

[10] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation, Prentice-
Hall, Englewood Cliffs, NJ, 1989.

[11] J. Boillat, F. Bruce and P. Kropf, "A Dynamic Load-Balancing Algorithm for Molecular
Dynamics Simulation on Multi-processor Systems," Journal of Computational Pl; .-
Vol.96, 1991, pp.1-14.

[12] S. H. Bokhari, "A Shortest Tree Algorithm for Optimal Assignments Across Space
and Time in a Distributed System," IEEE Trans. Software Engineering, Vol.7, No.6,
November 1981, pp.583-589.

[13] S. W. Bollinger and S. F. Midkiff, "Processor and Link Assignment in Multicomput-
ers Using Simulated Annealing," Proceedings of the 1988 International Conference on
Parallel Processing, Vol.1, 1l'i", pp.1-7.

[14] R. G. Casey, "Allocation of Copies of a File in an Information Network," Proc. AFIPS
National Computer Conference, Vol.43, 1972, pp.371-374.

[15] W. W. Chu, "Optimal File Allocation in a Computer Network," in Computer Commu-
nication Sil l ,,, N. Abramson and F. F. Kuo eds., Prentice-Hall, Englewood Cliffs,
NJ, 1973, pp.82-94.

[16] E. G. Coffman, Computer and Job-slpT Scheduling Tl, 'ry, John Wiley & Sons, Inc.
New York, 1976.

[17] E. G. Coffman, M. R. Garey and D. S. Johnson, "An Application of Bin-packing to
Multiprocessor Scheduling," SIAM Journal on Computing, Vol.7, No.l, February, 1978,
pp.1-17.

[18] E. G. Coffman, M. R. Garey, D. S. Johnson and R. E. Tarjan, "Performance Bounds for
Level-Oriented Two-Dimensional Packing Algorithms," SIAM Journal on Computing,
Vol.9, No.4, August, 1'i-II pp.ilis- _'i,

[19] W. Y. Crutchfield, "Load Balancing Irregular Algorithms," Lawrence Livermore Na-
tional Laboratory Report, 1991.

[20] G. Cybenko, "Dynamic Load Balancing for Distributed Memory Multiprocessors," Jour-
nal of Parallel and Distributed Computing, Vol.7, 1989, pp.279-301.









[21] K. D. Devine, J. E. Flaherty, S. R. Wheat and A. B. Maccabe, "A Massively Paral-
lel Adaptive Finite Element Method with Dynamic Load Balancing," Proceedings of
Supercomputing '93 Conference, Portland, Oregon, 1993, pp.2-11.

[22] A. K. Dewdney, "Computer Recreations," Scientific American, Dec. 1984, pp.14-18.

[23] L. W. Dowdy and D. V. Foster, "Comparative Models of the File Assignment Problem,"
ACM Computing Surveys, Vol.14, No.2, June 1982, pp._' -313.

[24] M. R. Garey and D. S. Johnson, Computers and Intrac/.lJ',1../i W. H. Freeman and
Company, San Francisco, CA, l'l- ;

[25] R. Hanxleden and L. R. Scott, "Load Balancing on Message Passing Architectures,"
Journal of Parallel and Distributed Computing, Vol.13, 1991, pp.312-324.

[26] F. T. Leighton, Introduction to Parallel Algorithms and Architectures, Morgan Kauf-
mann Publishers, Inc. San Mateo, CA, 1991.

[27] D. Nicol and N. Saltz, "Dynamic Remapping of Parallel Computations with Varying
Resource Demands," IEEE Transactions on Computers, Vol.37, No.9, 1l'I" pp.1073-
1087.

L2] D. Nicol and N. Saltz, "An Analysis of Scatter Decomposition," IEEE Trnasactions on
Computers, Vol.39, No.ll, 1991, pp.1337-1345.

[29] H. D. Simon, "Partitioning of Unstructured Problems for Parallel Processing," Com-
puting Sil/ ,,1.w in Engineering, Vol.2, No.2/3, 1991, pp.135-148.

[30] P. Sadayappan and F. Ercal, \N .,! est-Neighbor Mapping of Finite Element Graphs onto
Processor Meshes," IEEE Transactions on Computers' Vol.C-36, No.12, 1987, pp.1408-
1424.

[31] Ravi Varadarajan and Eva Ma, "An Approximation Load Balancing Model with Re-
source Migration in Distributed Systems," Proceedings of the 1988 International Con-
ference on Parallel Processing, Vol.1, 1'-l", pp.13-17.

[32] W. Williams, "Load Balancing and Hypercubes: A Preliminary Look," Hypercube Mul-
tiprocessors 1987, SIAM, Philadelphia, PA, 1897, pp.108-113.

[33] J. Woo and S. Sahni, "Load Balancing on a Hypercube," Proceedings of the Fifth In-
ternational Parallel Processing S,,,' ..."',,-, Anaheim, CA, 1991, pp.525-530.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs