Potential and Achievable Parallelism in the
UnsymmetricPattern Multifrontal LU Factorization Method
for Sparse Matrices *
Tech. Report TR94006; Proc. Fifth SIAM Conf. on Applied Linear Algebra, June 1994
Steven M. Hadfield and Timothy A. Davist
Abstract
The unsymmetricpattern multifrontal method of Davis and Duff [1] generalizes
earlier multifrontal approaches to LU factorization by removing the assumption of
a symmetricpattern of nonzeros in the sparse matrix. As a result, the underlying
computational structure becomes a directed acyclic graph (DAG) instead of a tree.
This research explores the potential parallelism available in the unsymmetricpattern
multifrontal method using both unbounded and bounded parallelism models based on
this DAG. The bounded parallelism model is extended to reflect the performance
characteristics of the nCUBE 2 distributed memory multiprocessor to investigate
the achievable parallelism. Finally, a factorizationonly version of the method is
implemented on the nCUBE 2 and its achieved parallelism and scalability evaluated.
1 UnsymmetricPattern Multifrontal Method
Multifrontal methods for the LU factorization of sparse matrices [1, 4, 5, 8] decompose the
sparse matrix into a set of overlapping dense submatrices called frontal matrices. Each
frontal matrix contains one or more pivots and is partially factorized according to these
pivots. The updated entries in the unfactorized portion of a frontal matrix (called the
contribution block) are passed and assembled (added) into subsequent frontal matrices.
The assembly directed acyclic graph (DAG) represents frontal matrices as nodes and the
passing of contributions blocks as directed edges. With previous multifrontal methods, the
assumption of a symmetricpattern matrix caused the assembly DAG to be a tree (or a
forest). The tree structure results as the contribution of a frontal matrix can be completely
absorbed within a single subsequent frontal matrix. When the assumption of symmetry in
the pattern is removed, contribution blocks need to be fragmented with the various pieces
passed to different frontal matrices.
The assembly DAG also serves as a task graph with the nodes representing the
factorization of individual frontal matrices and the edges as intertask data dependencies.
Large grain parallelism is available among independent nodes in the assembly DAG.
Furthermore, the factorization of the dense frontal matrices provides a second level of
parallelism as there is a great deal of independence and regularity in the factorization for
each pivot. Earlier efforts have shown that exploitation of both levels of parallelism is
necessary for the best performance [2, 3].
*This project is supported the National Science Foundation (ASC9111263, DMS9223088)
tComputer and Information Sciences Department, University of Florida, Gainesville, Florida, USA.
phone: (904) 3921481, email: smh@cis.ufl.edu and davis@cis.ufl.edu.
2 HADFIELD AND DAVIS
In all of the models and the implementation only the actual numerical factorization is
considered. Use of parallelism in the analysis of the matrix and construction of the assembly
DAG is not addressed.
2 Unbounded Parallelism Models
The unbounded parallelism models (UPMs) explore the amount of theoretical parallelism
available using analytical techniques and assume an unbounded number of available
processors. The underlying model of computation is a multiple instruction, multiple data
('.11. 11)) parallel random access machine pramM). The models are based on the assembly
DAG and weight the nodes (representing frontal matrices) according to their predicted
parallel factorization time. Three models are presented. Model 1 assumes no parallelism
within frontal matrices with only a single processor assigned to each frontal matrix. Model 2
uses a medium grain parallelism within each frontal matrix with a processor assigned to each
row of the frontal matrix. Model 3 allows a fine grain parallelism with a processor assigned
to each entry of a frontal matrix. The assembly of contribution blocks is represented by
edge weights that estimate the time required for assembly. Potential parallelism in the
three models is determined by taking the ratio of all node and edge weights over the node
and edge weights for only the nodes on the heaviest weighted path through the assembly
DAG.
Four test matrices were analyzed by the unbounded parallelism models. Their
characteristics are shown in Table 1. The asymmetry statistic is the ratio of unmatched
offdiagonal entries (aij f 0 and aji = 0) over the total number of offdiagonal entries.
Therefore, a symmetric matrix has an asymmetry of 0 and a perfectly unsymmetric matrix
has an asymmetry of 1.
TABLE 1
Test M airix i I.... i. ristics
MATRIX GRE_1107 GEMAT11 SHERMAN5 RDIST1
Order 1107 4929 3312 4134
Nonzeros 5664 33108 20793 "I III
Asymmetry 1.000 0.999 0.261 0.941
The speedup results of the unbounded parallelism models are shown in Table 2. These
results reveal significant available parallelism in Models 2 and 3, where parallelism is
exploited both between and within frontal matrices. Model 1 shows that there is only
minimal parallelism available just between frontal matrices. The question of how many
processors are needed to achieve this parallelism is addressed by the bounded parallelism
models.
TABLE 2
Unbounded Parallelism Models' Speedup Results
MODEL GRE_1107 GEMAT11 SHERMAN5 RDIST1
Model 1 1.29 4.31 1.19 2.12
Model 2 102.52 97.10 187.58 193.58
Model 3 L'i., 1.30 12 .;,. ,.,. i 5946.84
POTENTIAL AND ACHIEVABLE PARALLELISM
3 Bounded Parallelism Models
The bounded parallelism models use simulation to determine how much speedup can be
achieved on a fixed number of available processors. Only Models 2 and 3 have corresponding
bounded parallelism models. Node and edge weights are assigned as they were in the earlier
models. Processors sets are defined as powers of 2 from 20 to 216 (1 to 1Fi, ".;i,). Frontal
matrix tasks (corresponding to the nodes) are placed in a ready queue when all of their
predecessors (if any) have completed execution. Tasks are scheduled from this ready queue
based on a heaviest node first priority scheme. Allocation of processors to tasks is nominally
one per frontal matrix row in the first model and one per frontal matrix entry in the second
model. If a sufficient number of processors are not available, the smaller available set is
allocated, unless waiting for the next executing task to complete and free its processors will
result in a sooner completion time. The task's parallel execution time is .,1.1i. .1 per the
number of processors allocated.
The speedup results of the bounded parallelism model corresponding to UPM Model 2
and UPM Model 3 are shown in Figures la and Ib, respectively. These results indicate that
the theoretically available parallelism of the unbounded parallelism models can be achieved
on reasonably sized processor sets. With the medium grain parallelism of the first model,
almost all of the potential parallelism is achieved with only 512 processors (29). The fine
grain parallelism was achieved with i., "..;1. processors (216).
250  9000]
S7000 ERAN5
SSHERMAN5
6000
3000 G 07
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Log2 f Processor Set Sze Log2 f Processor Set Sze
(a) Medium Grain (b) Fine Grain
FIG. 1. Bounded Parallelism Model Results
4 Distributed Memory Performance Model
The distributed memory performance model extends the medium grain bounded parallelism
model to reflect the performance characteristics of the nCUBE 2 multiprocessor. Within
this model, individual frontal matrices are broken up by columns and distributed to the
processors of an assigned subcube in a scattered fashion. The factorization process has
the processor that owns the current pivot column compute the multipliers (column of L)
and broadcast them to the other processors. All of the processors in the frontal matrix's
subcube then update their active columns using these multipliers. This is commonly referred
to as a fanout algorithm [6]. Node weights correspond to a predicted parallel execution
time. This predicted time is based on an analytical model of the fanout algorithm with
4 HADFIELD AND DAVIS
specific parameters set according to results from an implementation and evaluation of this
algorithm on the nCUBE 2.
Assembly of incoming contributions is assumed to be evenly distributed across the
subcube processors. A distinct message is assumed for each edge. The edge weights are
based on the size of the contribution block passed on the edge and the pointtopoint
message passing characteristics of the nCUBE 2.
The scheduling of tasks is done based on a critical path priority scheme. The critical
path priority for a particular node (task) is the heaviest weighted path from that node to
an exit node (a node with no outgoing edges). Processor allocations are done as subcubes
with subcube size based on the size of the frontal matrix. Specific subcube assignments are
not tracked and fragmentation of the hypercube is neglected.
The results of this distributed memory performance model are shown in Figure 2. These
results indicate that about 20 percent of the theoretically available parallelism should be
achieved on the nCUBE 2.
30 RDISTI
25
15 GRE 1107
10
0 1 2 3 4 5 6 7 8 9 10
Log2 of Processor Set Size
FIG. 2. Distributed Memory Model Results
5 Implementation Results
Within the implementation, factorization of the frontal matrices is done using a pipelined,
fanout algorithm similar to that used in the distributed memory model [7]. With pipelining,
however, the processor owning the next pivot column will update only that column and
compute and send the next pivot's multipliers before doing the rest of the updates for
the current pivot. This allows the communications to be overlapped with computations.
Experiments revealed that about 35 percent of communications are effectively overlapped
by using pipelining. The Basic Linear Algebra Subroutines Level 1 (BLAS1) are used
for the bulk of the computations required for factorization. These routines are about four
times faster than the Clanguage code upon which the distributed memory model is based.
While the inclusion of pipelining improved parallelism, use of the BLAS1 routines reduced
the ratio of computations to communications and the combined effect was to reduce the
parallelism achieved by frontal matrix factorization. (Of course overall execution times
were significantly improved).
Task scheduling for the implementation was done using critical path priorities. Pro
cessor allocation was done using a proportional scheme that favored interfrontal matrix
parallelism (between nodes) over intrafrontal matrix parallelism (within nodes). Specific
subcubes were assigned to maximize overlapping of communicating frontal matrix tasks.
Subcube management was done with a variant of the binary buddy system. Frontal matrix
POTENTIAL AND ACHIEVABLE PARALLELISM
columns were assigned to processors in either a scattered or blocked format depending upon
which produced the best overall execution time.
The speedups achieved by this implementation are shown in Figure 3a. Typically, they
are less than that predicted by the distributed memory performance model but this is
mainly attributed to the effects of using the faster BLAS1 routines. In order to assess the
scalability of the implementation, the maximum memory usage for any single processing
node was tracked and is reported for two of the test matrices in Figure 3b. The dashed
lines show the corresponding perfect scalability curves.
Perfect eedup
20 SHRA1 \ RDIST1
SHERMA ""
g Perfect Scalablhty Curves ()
15 RDIST1
10 GEMAT11
04
GRE 1107 I EMA
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Log2 of Processor Set Size Log2 of Processor Set Size
(a) Speedups Achieved (b) Scalability
FIG. 3. Implementation Results
6 Conclusions
The unsymmetricpattern multifrontal method for the LU factorization of sparse matrices
has significant potential parallelism that can be achieved on realistic processor sets.
References
[1] T. A. Davis and I. S. Duff, An unsymmetricpattern multifrontal method for parallel sparse
LU factorization, Tech. Rep. TR93018, Computer and Information Sciences Department,
University of Florida, Gainesville, FL, 1993.
[2] I. S. Duff, Parallel implementation of multifrontal schemes, Parallel Computing, 3 (1 1'.),
pp. 193204.
[3] I. S. Duff and L. S. Johnsson, Node orderings and concurrency in structurallysymmetric sparse
problems, in Parallel Supercomputing: Methods, Algorithms, and Applications, John Wiley
and Sons Ltd., 1989.
[4] I. S. Duff and J. K. Reid, The multifrontal solution of indefinite sparse symmetric linear
systems, AC\I Transactions on Mathematical Software, 9 (1983), pp. 302325.
[5] The multifrontal solution of unsymmetric set of linear equations, SIAM Journal of
Scientific and Statistical Computing, 5 (1984), pp. 633641.
[6] K. A. Gallivan, R. J. Plemmons, and A. H. Sameh, Parallel algorithms for dense linear algebra
computations, in Parallel Algorithms for Matrix Computations, SIAM, Philadelphia, PA, 1990.
[7] A. Geist and M. Heath, Matrix factorization on a hypercube, in Hypercube Multiprocessors
1986, Society for Industrial and Applied Mathematics, 1986.
[8] J. W. H. Liu, The multifronlal method for sparse matrix solution: theory and practice, SIAM
Review, 34 (1992), pp. 82109.
