Analysis of potential parallel implementations of
the unsymmetricpattern multifrontal method for
sparse LU factorization
Steven M. Hadfield and Timothy A. Davis
301 CSE, Computer and Information Sciences Dept.
University of Florida, Gainesville, FL 326112024 USA
(904) 3921481, email: smh@cis.ufl.edu and davis@cis.ufl.edu
Technical Report TR920171
July 13, 1992
1Available via anonymous ftp to cis.ufl.edu as /cis/techreports/tr92/tr92017.ps.Z. Supported
in part by NSF ASC911 1'2,.;
Abstract
The unsymmetricpattern multifrontal method for the LU factorization of sparse matrices
creates a series of smaller, dense frontal matrices that are partially factored and then com
bined into a fully factored result. These smaller frontal matrices can overlap in rows and/or
columns creating dependencies between themselves which are represented as an elimination
directed acyclic graph(DAG). This elimination DAG can serve as a task graph to control the
parallel factorization of frontal matrices. The potential for parallelism provided in the elim
ination DAGs produced by the unsymmetricpattern multifrontal method is investigated via
analysis and tracedriven simulations. Different levels of parallelism are explored using both
unbounded and bounded processor models. Various scheduling and task allocation schemes
are also investigated. Performance results are measured in terms of speedups and processor
utilizations.
Contents
1 Overview
2 Background
2.1 LU Factorization .. .......
2.2 Frontal Matrices (Elements) .
2.3 Elimination Directed Acyclic Graph
2.4 Major Algorithmic Steps ......
2.5 Potentials for Parallelism ......
3 Unbounded Parallelism Models
3.1 Elimination DAG Analysis .....
3.2 Model Definitions .. ........
3.3 Matrices Analyzed .. .......
3.4 R results . . . . . . . .
3.5 Conclusions and Observations .
4 Bounded Parallelism Models
4.1 Initial M odels .. .........
4.1.1 TraceDriven Simulation .
4.1.2 Processor Allocations . .
......................
4.1.3 Scheduling Methods ..... . .......... ....... . 19
4.1.4 Results. .................. ............... 20
4.1.5 Conclusions and Observations .................. ... . 27
4.2 Refined M models .................. ............... 27
4.2.1 M odel 3 Revisions .................. ......... 28
4.2.2 Models 2 and 4: Vertical Partitioning ..... . . . . . 36
4.2.3 Model 4: Assembly Pipelining .... ........... ... 43
4.3 Scalability Analysis .................. ............. 46
5 Conclusions 48
Chapter 1
Overview
One of the cornerstones of numerical linear algebra is the solving of systems of linear equa
tions [7]. LU factorization is the most common and appropriate method for solving such
systems when the only assumptions that can be made of the coefficient matrix are that
it is square and nonsingular. When the additional assumption of sparsity can be made,
algorithmic modifications are possible that can significantly enhance execution times.
Dr. Tim Davis, working with Dr. Iain Duff, has developed a very promising algorithm
for solving such systems of linear equations [4]. Their method focuses on large, sparse,
nonsingular systems that have an unsymmetric nonzero pattern. A most significant aspect
of the method is the potential it has for parallel implementation.
Current implementations of the method operate on sequential and vectorized processors.
Parallel implementations are upcoming. The focus of this project is to provide an assessment
of parallel implementation alternatives through both analysis and simulation.
The research presented in this document is organized into three parts. The first part, Chapter
2, provides some fundamental background on LU factorization and the unsymmetricpattern
multifrontal method. This is followed by the results of the unbounded parallelism models
run against traces produced by the sequential version of the method. These results provide
an initial assessment of the potential parallelism of the method assuming a sufficiently large
number of available processors. The third and final part refines the unbounded models into
bounded models and uses tracedriven simulation techniques to predict the actual paral
lelism achievable with various processor set sizes. Major concerns with these models are the
allocation of processors to subproblems and the scheduling of tasks.
The results of the analysis just outlined illustrate a strong potential for parallelism with
practical and realizable processor set sizes.
Chapter 2
Background
This section briefly overviews some of the basic concepts of the unsymmetricpattern multi
frontal method [4].
2.1 LU Factorization
LU factorization is a numerical algorithm based on Gaussian elimination for solving square,
nonsingular systems of linear equations of the form: Ax = b, where A is an n by n real matrix
and x, b are n dimensional real column vectors [7]. Specifically, the coefficient matrix, A, is
factored into the product of two triangular matrices: A = LU. L is unit lower triangular
and consists of the row multipliers used to reduce A to echelon form. U is the echelon
form of A which is upper triangular. With the A = LU factorization complete, the solution
to the system of linear equations, Ax = b, can be found by solving the two triangular
systems: Ly = b and U = y. This is a much simpler task of O(n2) compared to O(n3)
for the factorization. Typically, LU factorization is accompanied by row permutations to
insure numerical stability and, particularly when dealing with sparse matrices, by column
permutations to recover from zero pivots. Such permutations are incorporated into the
unsymmetricpattern multifrontal method.
2.2 Frontal Matrices (Elements)
The unsymmetricpattern multifrontal method uses LU factorization to solve very large sys
tems of linear equations (A of size 1000 x 1000 and larger). In order to perform this task
efficiently, the method takes advantage of the sparse, yet unsymmetric pattern typical of
many such large coefficient matrices. The method does this by identifying elemental subma
trices along the diagonal of the coefficient matrix. The nonzero elements of the first row and
column of the right lower submatrix define the rows and columns of the elemental matrix.
These elemental matrices are also referred to as frontal matrices. They are typically much
smaller and very dense. Thus, they can be factored very efficiently by existing vectorized
and parallel subroutines.
2.3 Elimination Directed Acyclic Graph
A major complexity of this method arises from the fact that these elemental matrices can
(and do) overlap in rows, columns, or both. As a result, most elemental matrices can be
only partially factored. The unfactored portion of the submatrix has been modified by
the partial factorization and will be passed and assembled into other elemental matrices
for further factorizations. This "assembly" process creates data dependencies between the
elemental submatrices. These dependencies are represented with a directed acyclic graph
(DAG) where the elemental submatrices are represented by vertices and the dependencies
with edges. This DAG is referred to as the elimination DAG.
The unsymmetricpattern exploited by the method can result in parts of the unfactored sub
matrix of a frontal matrix to be assembled into multiple later frontal matrices. In symmetric
multifrontal methods [3] the assembly will be into only one later frontal matrix. This results
in an elimination DAG for the unsymmetricpattern method as opposed to an elimination
tree for the symmetricpattern methods.
2.4 Major Algorithmic Steps
The execution of the unsymmetricpattern multifrontal method is thus controlled in large
part by this elimination DAG. Two of the major steps in the method include the assembly of
data passed between elemental matrices and the actual numerical factoring. These are the
algorithmic steps that will be modeled in this effort. These two steps constitute the majority
of the processing of the factor,., ii version of the unsymmetricpattern multifrontal method
which is the version to be analyzed in this effort. The entire method is implemented in
the 1,, :,,li.factor version. Within this larger version these steps typically account for over
fifty percent of the method's execution time [4]. The limited scope of this effort allows
the assumption of an existing and static elimination DAG. This is a simplification of what
actually happens since possible modifications of the elimination DAG and memory allocation
for frontal matrices are ignored.
2.5 Potentials for Parallelism
There are at least three levels at which parallelism can be exploited in the unsymmetric
pattern multifrontal method.
The highest level is present in the independent blocks that can naturally appear in the
coefficient matrix once it has been permuted into an upper block triangular form. All
entries to the left and below these block are zeroes and the operations on such a block
are independent of the operations on the other blocks. These blocks manifest themselves as
distinct connected components of the elimination DAG.
Within these blocks, parallelism is also possible when factoring independent elemental ma
trices. These typically manifest themselves as multiple vertices at the same topological level
within a connected component of the elimination DAG (however, additional parallelism is
possible at this level depending on the structure of the dependencies represented by the edges
of the elimination DAG). Assembly and numerical factoring of such elemental matrices can
be accomplished in parallel.
The lowest level of parallelism occurs within the actual numerical factoring of elemental ma
trices using factoring subroutines tailored for particular parallel architectures. The assembly
process of data transferred along edges in the elimination DAG can also be done in parallel at
this level. Exploitation of this level of parallelism is critical for achieving the most significant
speedups.
Chapter 3
Unbounded Parallelism Models
The performance and complexity of a sparse matrix algorithm are closely tied to the sparsity,
structure, and numerical characteristics of a particular matrix. As a result, performance of
such algorithms is frequently measured by comparison to other algorithms on identical matri
ces. This same approach will be taken with this model. Both sequential and parallel versions
of the unsymmetricpattern multifrontal method will be modeled for specific matrices and
then the results compared.
An underlying multiple instruction, multiple data (MIMD) shared memory (SM) model
of computation is assumed, although several of the models are also realizable on single
instruction, multiple data (SIMD) models. The shared memory is initially assumed to be
concurrent read, concurrent write with concurrently written results summed (CRCWSUM).
Later models will require only a concurrent read, exclusive write (CREW) shared memory.
Instruction counts are abstracted to only the floating point operations (flops) required for
the assembly and numerical factoring processes.
The analysis of the unbounded parallelism models is based on the representations of elimina
tion DAGs built from traces produced by running the sequential version of the unsymmetric
pattern, multifrontal method on real matrices. These elimination DAGs are analyzed and
then used as input for the five unbounded parallelism models. The resulting speedup,
processor set, and utilization estimates are then analyzed.
3.1 Elimination DAG Analysis
A sound characterization of the elimination DAG is the cornerstore of the unbounded models
and will be fundamental to fully understanding the results of the bounded models. In order to
obtain this characterization, the elimination DAGs are topologically sorted and partitioned
into a data structure that identifies all the nodes in a given topological level. Node weights
are assigned in accordance with the model definitions defined below. Using these node
weights, the heaviest weighted path through the elimination DAG is determined for each
model. The weight of this path provides the lower bound on the parallel execution time.
Other characteristics of the elimination DAGs are also determined. In particular, the number
of nodes and edges, the depth, and the number of leaf nodes are determined. Since the
elimination DAGs tend to be significantly wider at their leaf level (Level 0), the next widest
level is also determined. The total weight of all nodes is divided by the number of nodes
to obtain an average node weight. The depth is divided by number of nodes to give a
depth ratio. Low values will correspond to DAGs that are the most significant course grain
parallelism. Various methods of exploiting the parallelisms are investigated with the different
models defined next.
3.2 Model Definitions
A model of the sequential version of the algorithm is provided as reference and will be used to
develop the speed up calculations. In this and all the subsequent models there are two levels.
The first (higher) level consists of the outer sum that counts processing steps for an entire
frontal matrix. This summation will either be across all nodes of the elimination DAG, which
represents a sequential processing of the frontals, or along the heaviest path, which represents
the critical path for the processing. The heaviest path summation is used to characterize
parallelism exploited between the nodes in the elimination DAG. With the assumption of an
unbounded number of processors, processing of nodes with dependencies not on the heaviest
path (critical path) can be accomplished during the critical path processing and not affect
the completion time.
The second level of the models describes the processing and parallelism of factoring a par
ticular frontal matrix (i.e. node in the elimination DAG). Each of these descriptions are
presented in four terms. The first term describes the processing needed to complete the
assembly of contribution blocks from sons in the elimination DAG. The second term char
acterizes the processing needed for the selection of a pivot for numerical stability. The third
term describes the calculation of multipliers, which results in a column of the lower triangu
lar matrix L. Finally, the fourth term defines the updating of the active submatrix which is
essentially the carrying out of the necessary row operations needed to reduce the matrix.
These models are very similar to an earlier effort by Duff and Johnsson [6] that focused on
the elimination trees produced by symmetricpattern, multifrontal methods.
Each of the models use the following terms:
Aj number of matrix entries assembled into the frontal represented by nodej.
* Sj number of sons assembling entries into nodej.
Pj number of pivots by which the frontal matrix at nodej is reduced.
cj number of columns in the nodej frontal.
rj number of rows in the nodej frontal.
With these definitions, the model for the sequential version of the algorithm is defined as
follows:
EJiaode [A + E J(r, + 1) + E7f(r i) + 2E1f(r i)(cj )]
Model 0: Concurrency Between Elements Only Model 0 depicts a version of the
algorithm that only exploits parallelism across nodes in the elimination DAG. Each frontal
matrix is factored sequentially. This model requires an underlying MIMD architecture. The
formula describing this model is:
=1 iesth[Aj + EI,(rj i + 1) + Etl(rj i)+ 2E,' (rj i)(cj )]
Model 1: Block Concurrency Within Elements Only Model 1 assumes that nodes
of the elimination DAG are processed sequentially in an order that preserves the dependencies
expressed by the elimination DAG. Within each node, the frontal matrix is factored with a
block level of parallelism. Assembly, scaling, and updates to the active submatrix are done
for all rows in parallel moving sequentially through the columns. For the assembly of a row
that includes contributions from more than one son in the elimination DAG, the assumption
of a CRCWSUM shared memory insures correctness. In this model selection of numerical
pivots is assumed to be sequential.
The method described by this model and the next lend themselves easily to a SIMD archi
tecture.
Ej=1 [c3 + E='1(r i 1) + Pj + 2E~,(cj i)]
Model 2: Full Concurrency Within Elements Only Concurrency within a frontal
matrix is extended by Model 2. In this model, more processors are used and a maximal
amount of parallelism exploited. Assembly is done for each element in parallel. However,
in order to maintain consistency with the earlier work of Johnsson and Duff [6], a CREW
memory is used by this model and the assembly of contributions to an entry from multiple
sons is done via a parallel prefix computation using the associative operator of addition.
Numerical pivot selection is handled in this model with another parallel prefix computation
this time using the maximum function as the associative operator. Scaling and updating of
the active matrix are done with parallelism at the entry level which is easily realizable with
the CREW memory model. This model is also oriented to a SIMD architecture.
Eallnodes +
j= L[ log2 S + i=i rlog2(r i + 1)1 + P1 + 21P]
Model 3: Block Concurrency Within and Between Elements The blockoriented,
node level parallelism of Model 1 is augmented by parallelism across the nodes in Model 3.
This is accomplished by changing the outer summation to be over the heaviest path instead
of across all nodes. With this model, a MIMD architecture is required.
iheaviestpath[ + EJ1(r
j=1 [J i=(J
Model 4: Full Concurrency Within and Between Elements The suite of models
is completed by extending the full node level concurrency of Model 2 to include parallelism
across nodes in the elimination DAG. This is done in the same fashion as with Model 3 and
a MIMD architecture is likewise assumed.
zheaviestpath [ lo2 Sjl + EP I lg2j
j= 1 [1092 S1 + ,=flog102 (ri
i+ I)] + P + 2Pj]
3.3 Matrices Analyzed
A set of five matrices was selected for this study. They represent a cross section of the type
of elimination DAGs that are generated. The matrices are from the HarwellBocing set and
each matrix comes from a real problem [5]. The table below briefly describes these matrices:
MATRIX
mahindasb
gemat 11
gre_1107
lns_3937
sherman5
ORDER DESCRIPTION
1258
4929
1107
3937
3312
economic modelling
optimal power flow problem
computer system simulation
compressible fluid flow
oil reservoir, implicit black oil model
I + ) + Pj + 2f l(cj )]
3.4 Results
Elimination DAG Analysis The first objective of this part of the effort was to charac
terize the elimination DAGs of the test matrices. Table 3.1 below provides these character
izations that were produced with the analysis program written for the effort. The DEPTH
entries are the number levels in the elimination DAG as determined by a topological sort.
The width of the DAG is presented in two entries. LEVEL 0 refers to the leaf nodes of the
elimination DAG and WIDTH gives the widest level other than the leaf level (Level 0). It
is important to remember that the elimination DAGs are typically NOT a single connected
component and that the many small, distinct connected components show up on these lowest
level of the DAG's topological order.
MATRIX
# NODES
# EDGES
mahindasb
gematll
1015
1219
DEPTH 68
LEVEL 0
WIDTH 55
AVERAGE 1,887
WEIGHT
DEPTH
RATIO
0.29
530
189
1,195
0.02
gre_1107 lns_3937 sherman5
511
1366
40
233
106
9,998
0.08
1636
3724
296
565
248
67,s01
0.18
664
1496
46
359
99
42,093
0.07
Table 3.1: Elimination DAG Characterizations
The statistics presented above reveal that the elimination DAGs tend to be short and wide
(which is the desired shape for parallelism). However, they do not provide the full picture.
The true nature of the elimination DAGs was better revealed with an abstraction of the DAG
provided by the analysis program. This abstraction illustrates the topologically ordered
elimination DAG with an example provided for the gematll matrix in Figure 3.1. For each
level the number of elements in that level is provided as well as the minimum, average, and
maximum node weights. (The node weights are expressed as common logs to save space
and since their magnitude is all that is of interest). These abstractions revealed that the
elimination DAGs were very bottom heavy in terms of the number of nodes but that the
average node size tended to increase dramatically in the upper levels. The elimination DAG
for the gematll matrix had the nicest shape for parallelism (as i.'. 1. .1 by its depth ratio).
The other elimination DAGs were taller and relatively more narrow at their base.
L(Level): Count (Min,Avg,Max)
25)
24)
23)
22)
21)
20)
19)
18)
17):
16)
15)
14)
13)
12)
11)
10)
9):
8):
7):
6):
5):
4):
3):
2):
1):
0):
1
1
1
1
1
1
1
2
2
2
2
2
3
4
6
4
4
7
10
16
24
35
58
108
189
530
(5.3,
(4.8,
(3.7,
(3.7,
(3.7,
(4.7,
(3.6,
(3.6,
(3.4,
(3.3,
(3.0,
(3.0,
(3.0,
(2.6,
(3.0,
(2.1,
(3.0,
(2.6,
(2.6,
(2.5,
(2.2,
(2.2,
(1.6,
(1.6,
(1.6,
(0.7,
5.3,
4.8,
3.7,
3.7,
3.7,
4.7,
3.6,
4.1,
3.8,
4.0,
4.0,
3.4,
3.4,
3.7,
3.5,
3.9,
3.3,
3.5,
3.4,
3.6,
3.5,
3.2,
3.0,
2.9,
2.8,
2.5,
5.3)
4.8)
3.7)
3.7)
3.7)
4.7)
3.6)
4.3)
4.0)
4.2)
4.2)
3.7)
3.6)
4.2)
3.9)
4.5)
3.7)
3.9)
3.9)
4.5)
4.2)
4.2)
3.8)
3.6)
3.5)
3.3)
**
**
**
***
*****
Figure 3.1: Elimination DAG for gematll
Speed Ups The most interesting result of the unbounded models is the speed ups ob
tained via parallelism (as compared to the sequential version of the method). These results
are presented in Table 3.2.
MATRIX mahindasb gematll gre_1107 lns_3937 sherman5
MODEL 0 1.17 2.88 1.26 1.30 1.74
MODEL 1 10.50 9.31 41.21 103.77 78.36
MODEL 2 114.25 39.35 575.21 3490.76 1877.05
MODEL 3 15.96 68.79 70.15 167.94 161..,l
MODEL 4 237.74 743.80 1,847.68 11, 1. 40 i.,I.S.42
Table 3.2: Speed Up Results
The speed ups for Model 0 are disappointing but not unexpected after what was seen in
the previous section on the shape of the elimination DAGs. However, significant speed ups
were seen as parallelism was introduced within the nodes (Models 1 and 2) and within and
between nodes (Models 3 and 4). It is interesting to note that a significant synergism is
taking place when parallelism is utilized BOTH within and between nodes. One's intuition
might , .. 1 that both levels of parallelism would produce a resulting speed up that is close
to the product of the speed ups achieved by using levels of parallelism separately. However,
the speed ups obtained are actually much better than this expectation. This is consistent
with the results of the earlier study by Duff and Johnsson [6].
Another interesting, but expected, observation is that the speed ups for models 0, 3, and
4 correlate strongly with the depth ratio presented earlier. Likewise, models 1, 2, 3, and 4
correlate strongly with average node weight. The underlying principle is that concurrency
within the node is most exploitable by larger nodes and that concurrency across nodes is
most exploitable with short and wide elimination DAG structures.
Processors Sets and Utilization Also analyzed in the unbounded models were an
upper bounded on the number of processors that could be employed on a given matrix and
the associated processor utilization. In particular, Model 0 assumed that any frontal matrix
would have only one processor allocated to it. For the block level concurrency of Models
1 and 3, a frontal matrix would have a processor set equal to the number of rows in the
frontal. The full concurrency of Models 2 and 4 would use processor sets equal to the row
size times column size of the frontal. The maximum processor usage would then be the
largest processor set for any frontal in Models 1 and 2 and the largest sum of processors for
any one level in the elimination DAG for Models 0, 3, and 4. This strategy results from the
use of concurrency across nodes in Models 0, 3, and 4.
Utilizations were then calculated by dividing
parallel time times the number of processors
provided in Table 3.3. The top value in each
lower value is the utilization.
mahindasb
Nli
1. ;I,',
52
20.1' '
304
5.2 ,' ,
4161
5.71 ,
total sequential time by the product of total
used. The results by matrix and model are
entry is the size of the processor set and the
gematll gre_1107 lns_3937
530
0.54 ,
68
13.71'
4624
0.8,.' .
3788
1.,2' ,
30127
2.47' .
233
0.54 ~
126
32.71,
19782
2.91 ,
1161
6.04 ~
19782
9.34 ~
565
0.2 ;' ,
258
40.22'f
107874
3.24 ~
2961
5.1 ,
17' 'i. L
6 .:; '"
sherman5
359
0.4.'
189
41. 1 .'
47060
3827
4 .;i .'
1 1 1
; 'I '
Table 3.3: Processor Set and Utilization Results
In all cases the size of the processor sets seem quite high and the utilization quite low, but
these estimates are very crude upper bounds and it is likely that significantly better results
are realizable.
3.5 Conclusions and Observations
A significant amount of parallelism is achievable with the unsymmetricpattern multifrontal
method. However, very little of the parallelism comes directly from the elimination DAG as
is evidenced from the Model 0 results. Yet a synergism occurs when the parallelism from
the elimination DAG is combined with the concurrency available within the factoring of
MATRIX
MODEL 0
MODEL 1
MODEL 2
MODEL 3
MODEL 4
particular frontal matrices. These resulting speed ups, as seen in Models 3 and 4, are very
promising.
Looking beyond the speed ups, the large processor sets and low utilizations are of both
concern and interest. In particular, the very "bottom heavy" elimination DAGs i.2. 1 that
better distributions of the processing may be possible. Such processing distributions could
be achieved by appropriate task scheduling on bounded sets of processors. The next chapter
explores these possibities using tracedriven simulation techniques.
Chapter 4
Bounded Parallelism Models
The results of the unbounded parallelism models illustrate that significant speedups are
possible with parallel implementations of the unsymmetricpattern, multifrontal method.
However, the efficient use of processor sets and the speed ups achievable on smaller pro
cessor sets are still open questions. This chapter addresses these issues using the results of
tracedriven simulations run on the five models defined in the previous chapter. With each
model sixteen different processor sets and three different scheduling schemes were used. The
processor sets were powers of 2 from 21 to 216. The scheduling schemes corresponded to
orderings of the work queue and are described in a subsequent section.
The results of the bounded parallelism models will indicate that the speed ups seen in the
unbounded models are achievable will reasonable size processor sets and with significant
improvements in efficiency. However, the initial versions of these bounded models will also
demonstrate the critical importance of processor allocation strategies and task definitions.
As a result, several revisions will be made to the initial models.
4.1 Initial Models
The initial bounded parallelism models follow directly from the unbounded parallelism mod
els. The critical difference is that a limited processor set is assumed so tasks that are ready
for execution will have to wait for available processors. This can be appropriately mod
elled via a trace driven simulation run against the representation of the elimination DAGs
obtained from the traces produced by the sequential version of the unsymmetricpattern,
multifrontal method. Critical issues with the simulation will be how to allocate processors
when the number available is less than that required and how to order the work that is ready
to be done.
4.1.1 TraceDriven Simulation
The tracedriven simulations used to implement the bounded parallelism models were ac
complished with a custom Pascal program that was built as an extension to the analysis
program used for the unbounded parallelism models. The simulation uses the topologically
sorted representation of the elimination DAG as input. All the nodes (frontal matricies)
on the leaf level (Level 0) are initially put into a work queue. Work is scheduled from this
queue based on processor availability. The amount of work and required processors for each
node is determined based on the specific model being simulated. These models are initially
those defined for the unbounded parallelism models. When the model calls for a sequential
ordering of the nodes, only one node may be executing at any particular time, otherwise
multiple nodes may execute concurrently based on processor availability and the allocation
strategy in use. Upon completion of a node's execution, the edges to dependent nodes are
removed. When all such edges to a particular node have been removed, that node is put
into the work queue as it is available for execution. The simulation continues until all nodes
have completed execution.
Speed up and processor utilization are then calculated per the following formulas:
Tim sequential
SpeedUp = 6
STimeparallel
alnodesTme ProcessorsUsed)
Utilization = j Procssors
(Time finished ProcessorSetSize)
The average number of searches through elements in the work queue is also determined for
each simulation run.
Time history reports can be produced that track the number of processors in use and the
size of the work queue across time.
Output from the simulation program can be in either report format or a Matlab mfile
format. The latter format is used to port the results to Matlab for plotting.
4.1.2 Processor Allocations
Allocation of processors to the assembly and factoring of a frontal matrix is a critical issue
for the realization of the algorithm on a bounded set of processors. As will be shown in the
results of the initial models, the allocation scheme can greatly effect the performance of the
algorithm. Since the five unbounded models use three different approaches to factoring a
particular frontal, there are three distinct allocation schemes.
The allocation scheme required by Model 0 is trivial since each frontal matrix in this model
is allocated only a single processor.
Models 1 and 3 use a block concurrency within a frontal matrix. As this block concurrency
is roworiented, a processor set equal to the row size is required. The difficulty arises when
there are processor sets available but they are smaller than that required (in fact the total
available number of processors is frequently less than the row size). In this eventuality the
total amount of work is evenly distributed across the available processors. This is a fairly
crude allocation scheme but is not too unrealistic given the nature of the processing within
a frontal.
To formally define this allocation scheme, we define the work to be done within a frontal j as
before and shown below (recall the following definitions: Pj pivots of nodej, cj columns
of nodej, and rj rows of nodej):
work = cj + Ei'j(r, i + 1) + Pj + 2E, (c i)
With this definition of work we define the allocation scheme as:
If processors_available >= row_size then
Schedule work on (row_size) processors
else
Schedule (work row_size) / processors_available
on the processors_available
endif
Models 2 and 4 use a full concurrency within each frontal matrix and requires a number of
processors that is equal to the number of entries in the frontal (row_size times col_size). A
more sophisticated allocation scheme is thus required. Prior to formally defining this scheme,
we recall the definition of the time complexity for a frontal using this model:
work = [l2 S + ~g2 S log, 2( i + 1)] + Pj + 2Pj
Using this definition and a similar assumption on the ability to distribute work on smaller
processor sets, the allocation scheme is formally defined as follows:
if processors_available >= (col_size row_size) then
Schedule work on (col_size row_size) processors
else
if processors_available >= row_size then
Schedule (work + 3 pivots col_size ) on
(row_size) processors
else
if processors_total >= row_size then
Wait for more available processors
else
Schedule (work col_size row_size) /
processors_available
on processors_available
endif
endif
endif
Notice that the deepest nested "else" block provides a fairly crude overapproximation of
the amount of work to do. However, this case is only used for the small total processor set
sizes which are not the target processor sets for the parallelism of Models 2 and 4. Thus,
this overestimate is reasonable.
4.1.3 Scheduling Methods
There are three scheduling methods employed by the work queue. These methods corre
spond to how the work ready to be executed is ordered within the work queue. The three
corresponding orderings are firstin, firstout (FIFO); largest calculation first; and heaviest
path first.
The FIFO method is selfexplanatory and included because of its implementation simplicity
and efficiency.
The largest calculation first order uses the node weight estimate of the work to be done and
schedules the largest such nodes first. This method is designed to approximate the next
method that is heaviest path first. The largest calculation first requires a priority queue
implementation based on a set of priorities that is quite variable. As such there would be
some significant penalties to address in implementing this method.
The last method, heaviest path first, uses the heaviest path determined by the analysis
portion of the software. Any node on this path would be the next scheduled. Other nodes
are put in a FIFO ordering. Notice that since the heaviest path is essentially a path through
the dependency chart, only one node at most from the heaviest path will ever be in the work
queue at any one time.
4.1.4 Results
The results of the simulations using the initial models were analyzed using both the report
format listings and graphs produced from the output files using Matlab. The results of this
analysis is summarized by model. These initial models will be referred to as the baseline
models from which further revisions will be made in the next section.
Whenever the results for all five matrices are presented, the following line types will be used
to represent the different matrices:
 mahindasb
*** gematll
+++ gre_1107
000 Ins_3937
  sherman5
All the results presented in this section are based on the heaviest path first scheduling unless
otherwise mentioned.
Model 0: Concurrency Between Elements Only Model 0 takes advantage of con
currency only between frontal matrices. The speed up and utilization results from Model 0
are shown in Figure 4.1. These indicate that the speed ups seen in the unbounded models are
achievable with significantly fewer processors and corresponding higher utilizations. Table
4.1 compares the number of processors used in the bounded versus the unbounded models
to achieve maximum speed up.
There were no significant differences produced by the scheduling methods for this model.
Model 1: Block Concurrency Within Elements Only Model 1 only takes advan
tage of concurrency within frontal matrices and does so in a blockoriented fashion. The
speed up and utilization results are shown in Figure 4.2. The method employed by this
model seems well suited for processor sets up to about 256 processors. After that point there
is no further advantage to additional processors for any of the test matrices.
The various scheduling methods showed no significant differences for this model.
Model 2: Full Concurrency Within Elements Only The block concurrency of
Model 1 is extended to full concurrency in Model 2 but is still restricted to concurrency only
within the frontal. The speed up and utilization results for Model 2 are shown in Figure
4.3. A very definite step behavior is evident in the speed up curves. This behavior directly
correlates to the allocation scheme in use that bases allocation on the row size and column
size of the frontal matrices. While the speed ups are significant, the step behavior is severely
limits scaleability. Later revisions to this model will resolve this problem.
The different scheduling methods showed no significant differences for this model.
Model 3: Block Concurrency Within and Between Elements Model 3 combines
the block concurrency within a frontal of Model 1 with the concurrency between frontals
used in Model 0. The speed up and utilization results for this model are shown in Figure
4.4.
There is a definite irregular behavior evident in the speed up and utilization curves. Analysis
of these irregularities determined that the allocation scheme employed was causing large
frontal matrices to be scheduled on a relatively small number of processors. This could
occur if two frontals were in the work queue and the first one required 91i' of the processors.
Once the first frontal got its processors, the second would be scheduled on the remaining.
This would significantly stretch out the time it would need to complete. The first frontal
could complete and release its processors but all subsequent frontals could not be scheduled
since they are dependent on the completion of the second frontal. The result is that 911' of
the processors go unused for a significant period of time. This scenario is illustrated with a
time history of utilization for the sherman5 matrix using 128 processors provided in Figure
4.5. The problems with processor allocation will be addressed in two subsequent revisions
to this model in the next section.
There were some significant differences in the scheduling methods used but these were side
effects of the allocation problems. Scheduling methods will be addressed again after the
allocation problems are resolved with the subsequent revisions.
Model 4: Full Concurrency Within and Between Elements The fifth and final
model uses concurrency between nodes and full concurrency within nodes. The speed up
and utilization results for this model are presented in Figure 4.6.
A steplike speed up and corresponding utilization irregularities are evident and very similar
to the results for Model 2. These results are also traced to the allocation method in use
and will addressed in a subsequent revision to the allocation scheme presented in the next
section.
Comparison of these results against the unbounded Model 4 reveals that the maximum speed
up is not obtainable with the processor sets tested and the current definition of Model 4.
6 8 10
Log2 ofNumber of Processors
12 14 16
(a) Speed Up Results
90
80
2 4 6 8 10 12 14 16
Log2 ofNumber of Processors
(b) Utilization Results
Figure 4.1: Model 0 Baseline Results
MATRIX
mahindasb
gemat 11
gre_1107
lns_3937
sherman5
BOUNDED MODEL UNBOUNDED MODEL
Table 4.1: Processors Used Comparison
14
12
2 4
I 6 ,
Log2 ofNumber of Processors
(a) Speed Up Results
2 4 6 8 10 12 14 16
Log2 ofNumber of Processors
(b) Utilization Results
Figure 4.2: Model 1 Baseline Results
0
0
2 4 6 8 10 12 14 16
Log2 ofNumber of Processors
(a) Speed Up Results
2 4 6 8 10 12 14 16
Log2 ofNumber of Processors
(b) Utilization Results
Figure 4.3: Model 2 Baseline Results
2 4 6 8 10 12 14 16
Log2 ofNumber of Processors
(a) Speed Up Results
2 4 6 8 10 12 14 16
Log2 ofNumber of Processors
(b) Utilization Results
Figure 4.4: Model 3 Baseline Results
0 1 2 3 4 5 6 7 8
Time x105
Figure 4.5: Utilization Time History For Sherman5 With P=128
1
I I '6
Log2 ofNumber of Processors
(a) Speed Up Results
I I '6
Log2 ofNumber of Processors
(b) Utilization Results
Figure 4.6: Model 4 Baseline Results
4.1.5 Conclusions and Observations
Several conclusion and observations are apparent from these initial, baseline bounded par
allelism models.
There is an excellent potential for parallelism under both SIMD and MIMD implemen
tations.
The speed ups indicated in the unbounded models are achievable with practical size
processor sets.
The benefits of the different models are realized within distinct ranges of the number
of processors.
Model 3 exhibits irregular behavior due to inefficient allocation of large critical path
elements to small processor sets. Similar, but less severe results were seen with Model
4.
Scheduling method has no significant effect on performance for Models 0 and 1. Models
2 and 4 also exhibit consistent behavior for the different scheduling methods, however,
a more detailed analysis will be done after these models have been refined. The effect
of scheduling on performance for model 3 is obscured by the irregular behavior of that
model. Analysis of scheduling for this model will also be postponed until the model has
been refined.
The performance under Models 2 and 4 (those using full concurrency within the ele
ments) displays a definite steplike behavior that correlates strongly with the allocation
strategy in use.
The speed ups obtained by Model 4 were well below the maximum speed ups predicted
by the unbounded version of the model for the larger matrices.
4.2 Refined Models
While the baseline models provided some significant insights into the potential of the unsymmetric
pattern, multifrontal method, they also illustrated several undesired effects that can result
if care is not taken in the allocation of processors and the definitions of tasks. As a result
the allocation method used by Model 3 is revised in two distinct manners to produce a more
regular behavior. The allocation and task definitions for Models 2 and 4 are revised to ad
dress the steplike performance curves obtained from these models. Finally, the concept of
pipelining is applied to the assembly process to further improve the performance of Model
4.
4.2.1 Model 3 Revisions
The first version of Model 3 (block concurrency within and between elements) experienced
irregular behavior that was traced to inefficient allocation of processors. In particular, when
most of the processors were already in use, the allocation algorithm would assign the next
frontal to the smaller set of remaining processors. Since this next element could be a large
frontal, it would take a long time on the small set of processors. Furthermore, all subsequent
elements could (in fact, are very likely to) be dependent on this large frontal. Thus, when
other processors are freed up, no other frontals can be factored since they are dependent
on this large frontal. This behavior was verified with the time history utilization graphs
presented earlier.
Model 3 Revision 1
The idea behind the first approach to this problem is to only assign processors to a task
if such an assignment will result in the task being completed at least as soon as if it were
postponed until more processors are available. This method requires an accurate prediction
of the number of processors to next become available. Such predictions are possible for this
algorithm given a centralized scheduler and the predictability of the workload associated
with each frontal. The second revision to this model will not require such predictions and is
thus well suited for more general application.
New Allocation Scheme The new allocation scheme is formally defined in the following
algorithm. Notice that the case of the full compliment of processors available is implicitly
handled.
Set Pointer to beginning of the Work Queue
while (processors are available) and
(more Work Queue Entries to check) do
begin
Calculate when next entry on Work Queue would
finish if allocated processors from those
currently available > Scheduled_Now
Calculate when next entry on Work Queue would
finish if allocation is delayed until more
processors are freed up > Scheduled_Later
if (Scheduled_Now <= Scheduled_Later) then
Schedule the entry now using
min(required,available) processors
Advance Work Queue pointer
end; {while}
Speed Up and Utilization Results The results of this new allocation scheme are
presented in Figure 4.7. The new allocation has resolved the irregularities and provides a
very nicely scalable method up through processor sets of 512.
. 6
Log2 ofNumber of Processors
(a) Speed Up Results
60
0\\ \ ,
2 4 6 8 10 12 14 16
Log2 ofNumber of Processors
(b) Utilization Results
Figure 4.7: Model 3 Revision 1 Results
Time History Utilization Results Additional verification that the problem with the
Model 3 baseline have been resolved is seen the utilization time history of the sherman5 ma
trix using 128 processors. A comparison of the baseline and revision 1 utilization time history
results is provided in Figure 4.8. The large gaps of low utilization have been eliminated.
0~ I
Required Work Queue Searches A concern with the new allocation scheme is that the
scheduling/allocation process may now take much longer since multiple work queue entries
(up to the entire work queue) may have to be checked on each scheduling opportunity. For
this reason the average number of work queue entries check was analyzed for each of the
three scheduling methods. Figure 4.9 illustrates this comparison for the lns_3937 matrix.
The solid line depicts the FIFO method. The starred (*) line show the largest calculation
first method. The dashed line shows heaviest path first. While the largest calculation first
method had the highest number of searches for all five matrices, its performance relative
to the other methods varied with processor set size and by the matrix. Thus, while the
other two methods appeared to be better in general, their dominance was not consistent.
An additional scheduling alternative that was not tested would be smallest calculation first.
Since this method would put the entries most likely to be scheduleable on limited processors
up in the front of the work queue, the number of work queue searches would likely drop.
Scheduling Differences With the ill effects of the initial processor allocation scheme
resolved, a serious look at the scheduling method influence can be taken. As a result of this
look, I found the effect of scheduling on performance was minimal. The greatest difference
was seen for the gematll matrix. These results are seen in Figure 4.10. The solid line rep
resents the FIFO method, the starred (*) line is largest calculation first, and the dashed line
is the heaviest path first method. The heaviest path first method did offer some minimally
better performance for processor sets in the range 256 to 2048.
3 4 5
Time
(a) Original
2 3 4 5
Time
(b) Revision 1
Figure 4.8: Utilization Time Histories (Before and After Revision 1)
80
60
CA
40 40
o 30
10 
0
2 4 6 8 10 12 14 16
Log2 of Number of Processors
Figure 4.9: Average Number of Work Queue Searches: lns_3937
2 4 6 8 10 12 14 16
Log2 of Number of Processors
Figure 4.10: Model 3 Rev 1 Schedulings: gematll
Model 3 Revision 2
The second revision to the Model 3 allocation scheme was developed as an alternative to the
overhead and limitations imposed by the need to make predictions that was inherent in the
first revision. In particular, this second scheme always allocates processors to the next task
in the work queue if nothing is currently executing. If, however, only a subset of the total
number of processors is available, the next task in the work queue will be scheduled only if
its entire number of required processors is available. While this scheme should not produce
results that are quite as good as the first revision, it also does not require the additional
overhead. Furthermore, the method could use a work queue that is organized in some type
of tree structure ordered by processor requirement. With such a work queue, searching for
next tasks could be reduced from linear to logarithmic time complexity.
Revision 2 Allocation Scheme A more formal definition of this allocation scheme is
presented in the following algorithm:
Set Work Queue Point to start of Work Queue
if processors_available = processors_total then
Schedule next Work Queue entry
Advance Work Queue Pointer
endif
while (more Work Queue entries to check) and
(processors_available > 0) do
if processors_available >= processors_required then
Schedule next Work Queue entry
endif
Advance Work Queue Pointer
endwhile
Speed Up and Utilization Results The speed up and utilization results produced
by this second revision are provided in Figure 4.11. While not quite as good as the results
of the first revision, they are very promising. Speed ups increase nicely with processor size
and the utilization curves are very consistent with only one minor irregularity. The greatest
difference in speed ups was seen for the sherman5 matrix. This difference is shown in Figure
4.12 with the dashed line representing the second revision and the solid line the first revision.
180,
120k
S 6
Log2 ofNumber of Processors
(a) Speed Up Results
2 4 6 8 10 12 14 16
Log2 ofNumber of Processors
(b) Utilization Results
Figure 4.11: Model 3 Revision 2 Results
Time History Utilization Results A comparison of the utilization time histories
of Model 3 Revisions 1 and 2 is provided for the sherman5 matrix in Figure 4.13. The
time histories use a processor set of 128 (as before) and illustrate that the second revision
maintains a very nice processor utilization.
Scheduling Differences Scheduling methods differences for this second revision were
very comparable to that of the first revision.
 1(
C)
C)
a
m
C *
2 4 6 8 10 12 14 1
Log2 of Number of Processors
Figure 4.12: Model 3 Speed Ups (Rev 1 vs Rev 2): sherman5
3 4 5
Time
(a) Revision 1
2 3 4 5
Time
(b) Revision 2
Figure 4.13: Utilization Time Histories (Rev 1 vs Rev 2): Sherman5 (P=128)
4.2.2 Models 2 and 4: Vertical Partitioning
In order to smooth out the steplike performance curves of Models 2 and 4, an allocation
scheme is needed to make more efficient use of the parallelism available. The approach taken
is to combine the allocation concepts of the first revision to Model 3 with a finer grain of task
definition. I call the approach vertical partitioning since it will partition the work done for a
frontal into sets of tasks done sequentially. Each task in turn will be composed of subtasks
that can be accomplished in parallel. This approach should provide the flexibility needed to
make more efficient use of the parallelism available.
Specifically, the factoring of a frontal matrix will be altered to process assemblies one son
at a time and to divide up the numerical pivoting, scaling, and active matrix updates.
The new assembly processing changes the model definition. The new definition requires
more sequential time, since sons are not assembled concurrently but does not require the
additional memory of the earlier approach. The new definition of processing within a frontal
for Models 2 and 4 is given by the following formula:
Sj + ES flog2(r i + 1)1 + PJ + 2Pj
With vertical partitioning each son is assembled separately using a processor set equal to
the number of entries contributed by the son. Each such assembly becomes a separate task
for scheduling. Once all the sons are assembled, the factorization commences for one pivot
at a time. First the numerical pivoting is scheduled as a separate task requiring a number
of processors equal to the row size of the active submatrix of the frontal. Numerical pivot
ing requires logarithmic time to complete the parallel prefix operation using the maximum
associative operator. Upon determination of the numerical pivoting, scaling and updating
are scheduled sequentially each as a set of parallel tasks. The scaling requires a processor
set proportional to the row size of the active submatrix. The update will use a processor
set equal to the entire size of the active submatrix. This sequence of numerical pivoting,
scaling, and updating is repeated for each pivot with which the frontal is reduced.
If ever there are insufficient processors available for a particular task group, a strategy similar
to that used by Revision 1 of Model 3 is used. That is, the completion times of scheduling
the work now on the available processors versus that of waiting for more processors to be
freed up are compared. If immediate scheduling will not delay completion, the tasks are
scheduled, otherwise the scheduling is postponed.
The basic algorithm for vertical partitioning is shown below without the logic for insufficient
processor availability.
for each son to be assembled do
Assemble the contributions from this son in
parallel using one processor per entry
endfor
for each pivot (i=l to p) for this frontal do
Determine the maximum valued entry in first
column of the active submatrix using
row_size i + 1 processors
Do the scaling (multiplier calculation) using
row_size i processors
Update the active submatrix using
(row_size i) (col_size i)
processors
endfor
Model 2 Vertical Partitioning The improvements due to vertical partitioning are quite
dramatic. Figure 4.14 below illustrates the speed up improvements of the original Model 2
versus the vertical partitioning version. Recall that Model 2 uses concurrency within frontals
only. Not only have curves smoothed out but they offer much higher speed ups across the
entire range of the processor set sizes (notice the vertical scales).
Likewise the corresponding utilizations are significantly improved as is illustrated by the
comparison provided in Figure 4.15.
Model 4 Vertical Partitioning The same vertical partitioning changes were applied
to Model 4 that exploits parallelism both within and across nodes in the elimination DAG.
Even more dramatic improvements were seen with this revision to Model 4. The speed up
comparison is shown in Figure 4.16. These curves illustrate that the maximum speed ups
are now achievable with the processor sets tested for all but one of the matrices (lns_3937
has a speed up of about 11,000 using 65,536 processors, where its unbounded Model 4 speed
up was 11, '_ 40 using 179, ;1i processors).
The corresponding utilization improvements of using vertical partitioning with Model 4 are
shown in Figure 4.17. Notice that the use of concurrency across nodes has significantly
improved utilization also.
Average Work Queue Searches The scheduling method (ordering of the work queue)
with Model 4 vertical partitioning has no real effect on speed ups or utilizations but did
affect the average number of work queue searches for the smaller processors sets. Figure 4.18
shows the average work queue searches for gre_ll07. The largest calculation first method
(starred line) required significantly more searches than FIFO (solid line) or heaviest path
first (dashed line).
Utilization Time History A utilization time history comparisons reveals much of the
dynamics of Model 4. Figure 4.19 compares utilization time histories for gre_ll07 using
16,384 processors on both the original and vertical partitioning versions of Model 4. Subfig
ures (a) and (b) offer the comparison with a common horizonal scale. Subfigure (c) provides
a more detailed look at processor usage with vertical partitioning. This figure reveals much
about the nature of the elimination DAG with the presence of the larger frontals quite evident
in the large quadratically decreasing regions.
1800
1400
1200
1000
800
600
400
2 4 6 8 10 12 14 16
Log2 ofNumber of Processors
(a) Original
0I .= F . i. .. .i. .t .1.^
2 4 6 8 10 12 14 16
Log2 ofNumber of Processors
(b) Vertical Partitioning
Figure 4.14: Model 2 Vertical Partitioning Speed Up Comparisons
2 4 6 8 10 12 14 16
Log2 ofNumber of Processors
(a) Original
30
0
20
10
2 4 6 8 10 12 14 16
Log2 ofNumber of Processors
(b) Vertical Partitioning
Figure 4.15: Model 2 Vertical Partitioning Utilization Comparisons
4500
3000
2500
2000
1500
1000
Log2 ofNumber of Processors
(a) Original
/6
, I ,16
2 4 6 8 10 12 14 16
Log2 ofNumber of Processors
(b) Vertical Partitioning
Figure 4.16: Model 4 Vertical Partitioning Speed Up Comparisons
, 6 ,
Log2 ofNumber of Processors
(a) Original
80\
\0 \\\\
60
40
30
10
2 4 6 8 10 12 14 16
Log2 ofNumber of Processors
(b) Vertical Partitioning
Figure 4.17: Model 4 Vertical Partitioning Utilization Comparisons
E
I 1
e
CA i
bb
2 4 6 8 10 12 14 16
Log2 of Number of Processors
Figure 4.18: Model 4 Average Work Queue Searches: gre_1107
1 15 2 25 3
Time
(a) Original
4 4.
0 05 1 15 2 25 3 35 4 4
Time
(b) Vertical Partitioning
100
80
60
40
20
0
o
0 500 1000 1500 2000 2500 3000
Time
(c) Vertical Partitioning (Spread Out)
Figure 4.19: Vertical Partitioning Utilization Time History: gre_1107
I I I I I
Ih~j
4.2.3 Model 4: Assembly Pipelining 
An additional modification was made to the vertical partitioning version of Model 4. This
modification is based on a concept proposed by Professor Ravi Varadarajan to pipeline the
assembly process. In particular, the assembly process is initiated by the sons (the contribu
tors) instead of by the fathers (the receivers). As the factorization of a frontal completes, it
immediately schedules the assembly of its contributions to other frontals. Once a subsequent
frontal has received all the contributions from its sons, it begins its own factorization process.
This approach to assembly pipelining is fairly conservative since the assembly pipeline does
not begin until factorization is complete. More aggressive assembly pipeline strategies are
possible with closer interactions with the factorization process.
Tli, re is an oversight in this ,, rll.il pipeline model that limits the value of its comparison
to vertical partitioning. S..' .,,lli. there are no synchronization provisions made to insure
that the ,i, r,,l1il of multiple entries into single frontal element do not occur concurrently.
Thus a CRCWSUM model must be .',,.//,., '/il assumed for the *.,,, ll,,il pipeline where a
CREW model was used for the vertical partitioning version of the Model 4. The effect of
this oversight is not suspected to be too significant but further revised runs are necessary to
address the issue.
The results of the assembly pipelining runs are that the same nice speed up and utilization
curves appeared as were seen with vertical partitioning. Figure 4.20 illustrates the speed up
results.
Vertical Partitioning vs Assembly Pipelining Furthermore, the assembly pipeling
version of Model 4 also demonstrated a significant improvement in speed up over the vertical
partitioning version. Figure 4.21 shows a comparison of the speed up curve for the gre_1107
matrix under the two versions. Such comparisons must however be tempered by the earlier
comments on the ;,,,,1, i l;,',i memory models.
Figure 4.22 compares the average work queue searches for vertical partitioning and assembly
pipelining both using a heaviest path first scheduling. The number of searches is larger for
assembly pipelining since multiple assemblies are concurrently scheduled as each frontal's
factorization completes.
6000 
4000
2000
2 4 6 8 10 12 14 16
Log2 of Number of Processors
Figure 4.20: Model 4 Assembly Pipeline Speed Up Results
2000
20001
1800 
1600
1400
1200
S1000
800
600
400
200
2 4 6 8 10 12 14 16
Log2 of Number of Processors
Figure 4.21: Speed Up Comparisons (VP vs AP): gre_1107
44
S120
S100
S80
60
40
0
2 4 6 8 10 12 14 16
Log2 of Number of Processors
Figure 4.22: Average Work Queue Searches (VP vs AP): gre_1107
4.3 Scalability Analysis
The results presented so far have indicated that the unsymmetricpattern, multifrontal
method has some nice scalability features, especially when the fuller ranges of concurrency
are employed (i.e. Model 4). Ideally, one would like to quanitify this scalability with the
use of iso efficiency curves. However, there is a major obstacle to accomplishing this goal.
In particular, it is very difficult to quantify problem size, since the complexity of a sparse
matrix problem is a function of its order, sparsity, degree of symmetry, as well as other
factors. More specifically to the unsymmetricpattern, multifrontal method; the structure
and node size of the elimination DAG combine to dictate the complexity of the factorization
process.
One approach to the scalability issue is to isolate a single factor that dominates problem
size for a given limited model. In particular, I chose to look at Model 2 whose results
were very strongly correlated to average node size. Thus, I defined problem size by average
node size and developed a three dimensional representation of the efficiency function. (That
is, efficiency equals speed up divided by processors used). Figure 4.23 (a) show this three
dimension result. The upper left corner corresponds to the 2processor set and larger proces
sors sets (4, 8, 16, 32, ..., 65536) are represented coming down the left diagonal. The right
diagonal represents average node size where the leftmost entry is the lns_3937 matrix with
an average node size of 67,S., then sherman5 with 42,093, gre_1107 with 9,998, mahindasb
with 1,887, and, finally in the upper right, gematll with 1,195. Thus the scales on this figure
are logarithmic on one axis and irregular and decreasing on the other. However, inspite of
its difficulties, the figure does illustrate efficiencies due to larger node sizes for Model 2. The
second subfigure is a contour map of the first subfigure and is presented as an approxima
tion of the isoefficiency curves. (The vertical axis represents the processor sets of the left
diagonal and the horizonal axis indicates the various matrices).
The concept for this type of ,.. iJ.,.i7,'i iir,,1. ;. was also proposed to me by Professor Ravi
Varadarajan. While the difficult nature of this problem and limited resources available pro
duced less than what could be hoped for, the results do illustrate some positive eff ,'. i
results.
Avg Node Weight (large to small)
(a) Processor Sets vs Matrix
5 2 25 3 35 4
Avg Node Weight (large to small)
(b) IsoEfficency Curves
Figure 4.23: Model 2 Efficiency Curves
Chapter 5
Conclusions
The bottom line is that a factoronly version of the unsymmetricpattern, multifrontal
method for the LU factorization of sparse matrices shows significant potential for paral
lel implementation. Furthermore, parallelism can be exploited in a variety of degrees and
levels. The most significant results occur when parallelism is used both within and across
nodes in the elimination DAG. The resulting synergism is very promising, as the refined
Model 4 results indicate.
However, exploiting parallelism in the fullest since also incurs costs both in scheduling and
mapping the parallelism. As possible, several of these cost have been addressed with the
analyses such as the average number of work queue searches and the various allocation and
scheduling alternatives.
Another critical observation is that the different degrees of concurrency produce their best
results for distinct processor set ranges. Concurrency only across nodes in the elimination
DAG (Model 0) saw its best performance for small processor sets (typically no more than
eight processors). Block level concurrency within frontals (Models 1 and 3) was best realized
on processor sets of 32 to 512 processors. The full concurrency within a frontal models
(Models 2 and 4) realized the best results and were dominant over most of the range of
processor sets tested. Figure 5.1 compares Model 3 Revision 1 against Model 4 with vertical
partitioning on the sherman5 matrix for a processor set range of 32 to 256. The figure
illustrates that the higher degree of concurrency of Model 4 does out perform Model 3 in
the range of processors where Model 3 has its best performance. Also encouraging is the
observation that almost of the speed ups predicted by the unbounded models are realizable
with processor set sizes that are currently implemented.
Other conclusions from this effort are that the structure of the elimination DAGs tend to
have dominant nearlinear components that directly limit further parallelism. While these
structures work well for the sequential version of the method, different construction techniques
may be possible to enhance parallelism.
Furthermore, the allocation and task definition schemes and methods have very pronounced
effects on performance. Workable solutions to these issues have been developed and verified
with the models but care needs to be taken in implementation.
Finally, throughout this effort I have approached the actual LU factorization in a very
straight forward fashion. This was done for easy of description and ease of modelling. The
are many other algorithmic alternatives to accomplishing the factorization that may lend
themselves to easier and more efficient parallel implementation. Such alternatives would be
a significant implementation issue.
100
50 
5 5.5 6 6.5 7 7.5 8
Log2 of Number of Processors
Figure 5.1: Model 3 (Rev 1) vs Model 4 (VP) for P=32..256
BIBLIOGRAPHY
1. Aho A. V., J. E. Hopcroft, and J. D. Ullman, Data Structures and Algorithms, Addison
Wesley, Reading, MA 1974.
2. Aki, Selim G., The Design and Analysis of Parallel Algorithms, PrenticeHall, Engle
wood Cliffs, NJ 1989.
3. Amestoy, Patrick R., Factorization of Large Unsymmetric Sparse Matrices Based on a
Multifrontal Approach in a Multiprocessor Environment, Doctorate Thesis, CERFACS
Report Ref: TH/PA/91/2, Toulouse, France, 1991.
4. Davis, T. A. and I. S. Duff, UnsymmetricPattern Multifrontal Methods for Paral
lel Sparse LU Factorization, Technical Report TR9123, Computer and Information
Sciences Department, Univ. of Florida, Gainesville, FL 1991.
5. Duff, I. S., Grimes, R. G., and Lewis, J. G. Sparse Matrix Test Problems, ACM Trans.
Math. Softw., 1989, 15, ppl14.
6. Duff, I. S. and S. L. Johnsson, Node Orderings and Concurrency in Structurally
Symmetric Sparse Problems, appeared in Parallel Supercomputing: Methods, Algo
rithms, and Applications, John Wiley & Sons Ltd, 1989.
7. Golub, G. H. and C. F. Van Loan, Matrix Computations, 2d Ed., John Hopkins Uni
versity Press, Baltimore, MD 1989.
8. Manber, Udi, Introduction to Algorithms: A Creative Approach, AddisonWesley,
Reading, MA 1989.
