A PARALLEL UNSYMMETRICPATTERN MULTIFRONTAL
METHOD*
STEVEN M. HADFIELDt AND TIMOTHY A. DAVISt
Abstract. The sparse LU factorization algorithm by Davis and Duff [4] is the first multifrontal
method that relaxes the assumption of a symmetric pattern matrix. While the algorithm offers
significant performance advantages for unsymmetric pattern matrices, the underlying computational
structure changes from a tree (or forest) to a directed acyclic graph. This paper discusses some
key issues in the parallel implementation of the unsymmetricpattern multifrontal method when the
pivot sequence is known prior to factorization. The algorithm was implemented on the nCUBE 2
distributed memory multiprocessor and achieved performance is reported.
Key words. LU factorization, unsymmetric sparse matrices, multifrontal methods, parallel
algorithms
AMS subject classifications. 65F50, 65F05.
1. Introduction. The multifrontal approach to sparse matrix factorization de
composes a sparse matrix into a collection of smaller dense submatrices (frontal ma
trices) that can partially overlap one another. This overlapping causes data depen
dencies between the frontal matrices. A computational graph structure is built for
the factorization by representing the frontal matrices as nodes and data dependencies
between frontal matrices as edges. Parallelism is available within this computational
structure both between independent frontal matrices and within the partial factor
ization of individual frontal matrices. One or more steps of Gaussian (or Cholesky)
factorization of the sparse matrix is done within each frontal matrix. An important
advantage of the multifrontal method is the avoidance of indirect addressing within
the factorization of the dense frontal matrices.
If the sparse matrix has (or assumes) a symmetric pattern then the resulting
computational structure is a tree (or forest). The independence of subtrees allows
for efficient scheduling. Prior to the work of Davis and Duff [4], all multifrontal
algorithms have assumed a symmetric pattern and several parallel distributed memory
implementations have resulted [10, 14, 15].
When the sparse matrix has a significantly unsymmetric pattern, a more general
multifrontal method can be employed which takes advantage of the unsymmetric
pattern to reduce required computations and expose greater internode parallelism.
This unsymmetricpattern multifrontal approach was developed by Davis and Duff [4]
and has proven to be very competitive with significant potential parallelism [12]. The
unsymmetricpattern multifrontal approach does however result in a computational
structure that is a directed acyclic graph (DAG) instead of a tree (or forest). Figure 1
illustrates the LU factorization of a 7by7 matrix. The example matrix is factorized
with four frontal matrices. The notation aij refers to a fully summed entry in a
pivot row or column of the active submatrix; cij refers to an update from previous
pivots that is not yet added into aij. An update term cij is computed via a Schur
This project is supported the National Science Foundation (ASC9111263, DMS9223088).
t Department of Mathematical Sciences, US Air Force Academy, Colorado, USA. phone: (719)
4724470, email: hadfieldsm%dfms%usafa@dfmail.usafa.af.mil
I Computer and Information Sciences Department, University of Florida, Gainesville, Florida,
USA. phone: (904) 3921481, email: davis@cis.ufl.edu.. Technical Report TR94028, CIS Dept.,
Univ. of Florida, August 5, 1994
S. M. HADFIELD AND T. A. DAVIS
complement within a frontal matrix. In Figure 1, the updates c44 and c45 are passed
to the third frontal matrix. All other updates from the first frontal matrix are passed
to the second frontal matrix.
all 0 0 a14 a15 0 0
a21 a22 a23 0 a25 0 0
a31 a32 a33 0 0 0 a37
A = a41 0 0 a44 a45 a46 0
0 a52 a53 0 a55 a56 0
0 0 0 0 0 a66 a67
a71 a72 0 0 a75 0 a77
all a14 al5 a22 a23 a24 a25 0
a 21 C 25 a32 33 34 35 a 37 44 a45 0
a31 C35 52 55 a54 a55 a56 57 66 67
071 I C7 a72 a C75 a74 a76 a77
a41 I
FIG. 1. Multifrontal factorization of a 7by7 sparse matrix
Scheduling and tasktoprocessor mapping of this generalized DAG structure is
a more challenging problem than in the symmetricpattern multifrontal method, es
pecially for a distributed memory computer. This paper addresses these issues for
the parallel implementation of the unsymmetricpattern multifrontal method on the
nCUBE 2 distributed memory multiprocessor, and reports the achieved performance.
In this paper, allocation refers to the number of processors that factorize a partic
ular frontal matrix. Scheduling is an ordering imposed on the execution. Subcube
assignment is the designation of a particular set of processors for each frontal matrix.
Data assignment is the distribution of the submatrix's entries across the participating
processors.
2. Mechanisms. Parallelism is limited to the numerical factorization. Defini
tion of the computational DAG structure and the scheduling, allocation, and assign
ment are done sequentially in a preprocessing step. The computational DAG structure
is determined from UMFPACK, Davis and Duff's unsymmetric pattern multifrontal
method [3]. This structure is called the assembly DAG, which is then used in schedul
ing, allocation, and assignment.
Once the assembly DAG is defined, we determine the number of processors to
allocate to each frontal matrix. Experiments have shown that use of both inter and
intrafrontal matrix parallelism is necessary for peak performance [12, 7, 5]. As inter
frontal matrix parallelism (between independent frontal matrices) is most efficient,
it is preferred when available. Intrafrontal matrix parallelism (multiple processors
cooperating on a specific frontal matrix) is most useful for larger frontal matrices. The
performance characteristics of intrafrontal parallelism can be accurately modeled via
an analytical formula [11] which significantly aids both the allocation and scheduling
processes.
Allocation of processor sets to specific frontal matrix tasks is done via blocks of
tasks. Specifically, Qeligible is defined as the next block of tasks and contains the
A PARALLEL UNSYMMETRICPATTERN MULTIFRONTAL METHOD
next group of ready (available to execute as predecessors have all completed) tasks.
The size of this set is limited to the number of processors in the system. Each task
in Eligible has its sequential processing time predicted by analytical formulas. This
predicted time is divided by the sum of the predicted times for all tasks in Qeligible and
then multiplied by the number of available processors. This gives the frontal matrix
task's portion of the available processors which is rounded down to the next power of
two (that is, the next full subcube). Once this is done for all tasks in Qeligible, their
outgoing dependencies are considered satisfied and new tasks added to the queue of
ready tasks as necessary. The process continues until all frontal matrices have received
allocations. The initial set of ready tasks is then reconstructed for scheduling.
We schedule the frontal matrix tasks using a critical path priority scheme. For
each frontal matrix, its critical path priority is defined as the weight of the heavi
est weighted path from the frontal matrix to any exit node in the assembly DAG.
Weights are assigned to frontal matrix tasks based on their sequential execution time
estimates. As each frontal matrix tasks is scheduled, it is assigned a specific subcube
of processors. Subcube assignment is done so as to allow the largest possible number
of interfrontal matrix messages to be eliminated by placing that data on the same
processor from which it is sent. We refer to this as overlapping.
Management of the subcubes is done using a binary buddy management system
[13] where subcubes are split by a fixed ordering of the hypercube's dimensions. Alter
native binary buddy management systems are available that can split across several
alternative dimensions but they were not used here [2, 1]. However, the binary buddy
subcube manager is augmented to improve the chances of overlapping data assign
ments. Specifically, when no subcubes of the requested size are available, all available
subcubes of the next larger available size are fragmented. The subcube with the best
overlap potential is selected and the unselected subcubes are coalesced.
Scheduling of available tasks continues until there is no subcube for the next task
or all ready tasks have been scheduled. A simulated run time is maintained using the
task execution times predicted by the analytical models. These models predict task
execution time based on any processor set allocation. When the simulated run time
indicates a task has completed, its subcube is returned to the binary buddy manager
and any outgoing dependencies considered satisfied with new tasks becoming ready
as appropriate.
As each task is scheduled and assigned a subcube, its frontal matrix data is
assigned to the subcube processors. All data assignments are columnoriented and
a number of options are implemented. Overlapping tries to assign columns of data
passed between frontal matrices to the same processor as they were assigned in the
source frontal matrix. This eliminates message passing. When overlapping is not
possible, clustering per children is attempted. The columns of data to be sent to
the current frontal matrix from a predecessor (child) that are assigned to a single
processor in the child are also assigned to a single processor in the current frontal
matrix. While the source and destination processor are different, all the data may
be passed in a single message and the additional costly message setups foregone.
Clustering per parents looks forward from the current frontal matrix to immediate
successors. If a number of columns are to be sent to a particular successor, they can
be assigned to a single processor in the current frontal matrix to improve the chances
for subsequent overlapping and cluster per children.
We examined all possible combinations of overlapping, clustering per children,
and cluster per parents. Each assignment strategy attempts to minimize message
S. M. HADFIELD AND T. A. DAVIS
passing overhead. However, the strategies are subject to the constraint that no single
processor can be assigned more than its proportional share of the frontal matrix's
data. We also examined block and scattered (columnwrap) assignments.
Once scheduling, allocation, and assignment are completed, the resulting sched
ules and assignments are passed to the processors and the parallel numerical factor
ization starts. For each frontal matrix, the assigned processors allocate their portion
of the frontal matrix's storage, assemble (add) in the original values and contributions
from previous frontal matrices, and then partially factorize the frontal matrix. The
partial factorization process is done with a columnoriented, pipeline fanout routine
similar to that found in [9] but generalized to allow any type of column assignment
scheme. Entries in the Schur complement of each frontal matrix are forwarded to
subsequent frontal matrices. Message typing conventions allow these forwarded con
tributions to be selectively read by the receiving frontal matrices. The portions of the
frontal matrix that fall within the L and U factors are retained in separate storage
for use by subsequent triangular solves.
3. Results. The parallel numerical factorization was evaluated using the ma
trices described in Table 1. The table reports the matrix name, order, number of
nonzeros, speedup on 8 and 64 processors, and memory scalability on 8 and 64 pro
cessors All these matrices are highly unsymmetric in pattern. The GEMAT11 is
an electrical power problem [6]; the others are chemical engineering matrices [16].
Speedup was determined by dividing the method's single processor execution time by
its parallel execution time. The memory scalability is defined as Mi/pMp, where Mp
is the largest amount of memory required on any single processor when p processors
are used to solve the problem. These results compare favorably with similar results
using symmetric pattern multifrontal methods on matrices of like order [10, 14, 15].
The memory requirements scale extremely well with additional processors indicating
that increasingly larger problems can be solved by using additional processors.
TABLE 1
Speedup and memory scalability
The competitiveness of the new parallel algorithm was evaluated by comparing
its single processor execution time to that of the MA28B algorithm [8] running on a
single nCUBE 2 processor. Due to memory limitations, these results could only be
obtained for two of the matrices. These results are shown in Table 2 with the parallel
unsymmetric pattern multifrontal code called PRF (run times are in seconds). The
speedup of PRF in this table is the MA28B run time divided by the PRF (p = 64)
run time.
The communication reducing data assignment features of overlapping, cluster
ing per children, and clustering per parents were effective in reducing the amount
of required communications with reductions by as much as 22% for a 64processor
configuration. However, the resulting irregular distribution of columns had adverse
effects on the performance of the partial dense factorization routine. This routine
Speedup Scalability
Matrix n nonzeros 8 64 8 64
RDIST1 4134 94408 5.2 20.2 0.69 0.31
EXTR1 2837 11407 4.5 12.2 0.54 0.15
GEMAT11 4929 33108 5.1 16.8 0.50 0.11
RDIST2 3198 56834 4.4 17.3 0.56 0.19
RDIST3A 2398 61896 5.0 17.8 0.67 0.22
A PARALLEL UNSYMMETRICPATTERN MULTIFRONTAL METHOD
TABLE 2
Competitiveness Results
Run times
Matrix MA28B PRF (p = 1) PRF (p = 64) Speedup
EXTR1 1.61 2.61 0.24 6.8
GEMAT11 3.10 2.52 0.18 17.0
typically accounted for .'i to 95% of execution time while communication between
frontal matrices accounted for only 2% to 10% of execution time. Thus, the best
performance on the nCUBE 2 was obtained using a strictly scattered assignment for
small hypercubes (of dimension 2 or less) and a blocked assignment for the larger
hypercubes. The mechanisms we use to reduce communication would be more im
portant on a parallel computer with slower communications (relative to computation
speed).
4. Conclusion. We have found that the unsymmetric pattern multifrontal method
of Davis and Duff has significant parallel potential that can be effectively exploited
even within a distributed memory environment. The results obtained here are compa
rable to similar results for distributed memory implementations of symmetric pattern
multifrontal methods.
REFERENCES
[1] S. AlBassam and H. ElRewini, Processor allocation for hypercubes, Journal of Parallel and
Distributed Computing, 16 (1992), pp. 394401.
[2] M.S. Chen and K. G. Shin, Processor allocation in an ncube multiprocessor using gray codes,
IEEE Transactions on Computers, C36 (1987), pp. 13961407.
[3] T. A. Davis, Users' guide for the unsymmetricpattern multifrontal package (UMFPACK),
Tech. Rep. TR93020, Computer and Information Sciences Department, University of
Florida, Gainesville, FL, June 1993.
[4] T. A. Davis and I. S. Duff, An unsymmetricpattern multifrontal method for sparse lu factor
ization, SIAM J. Matrix Anal. Appl., (submitted March 1993, under revision.).
[5] I. S. Duff, Parallel implementation of multifrontal schemes, Parallel Computing, 3 (1986),
pp. 193204.
[6] I. S. Duff, R. G. Grimes, and J. G. Lewis, User's guide for the I I sparse matrix
collection (Release I), Tech. Rep. TR/PA/92/86, Computer Science and Systems Division,
Harwell Laboratory, Oxon, U.K., October 1992.
[7] I. S. Duff and L. S. Johnsson, Node orderings and concurrency in ric
sparse problems, in Parallel Supercomputing: Methods, Algorithms, and Applications,
G. F. Carey, ed., John Wiley and Sons Ltd., New York, NY, 1989, pp. 177189.
[8] I. S. Duff and J. K. Reid, Some design features of a sparse matrix code, ACM Trans. Math.
Softw., 5 (1979), pp. 1835.
[9] G. A. Geist and M. Heath, Matrix factorization on a hypercube, in Hypercube Multiprocessors
1986, M. Heath, ed., Society for Industrial and Applied Mathematics, Philadelphia, PA,
1986, pp. 161180.
[10] A. George, M. Heath, J. W.H. Liu, and E. G.Y. Ng, Solution of sparse positive definite
systems on a hypercube, J. Comput. Appl. Math., 27 (1989), pp. 129156.
[11] S. Hadfield, On the LU Factorization of Sequences of I Structured Sparse Matrices
within a Distributed Memory Environment, PhD thesis, University of Florida, Gainesville,
FL, April 1994.
[12] S. Hadfield and T. Davis, Potential and achievable parallelism in the unsymmetricpattern
multifrontal LU factorization method for sparse matrices, in Fifth SIAM Conference on
Applied Linear Algebra, 1994.
[13] K. C. Knowlton, A fast storage allocator, Communications of the ACM, 8 (1965), pp. 623625.
[14] R. Lucas, T. Blank, and J. Tiemann, A parallel solution method for large sparse systems of
equations, IEEE Transactions on ComputerAided Design, CAD6 (1987), pp. 981991.
6 S. M. HADFIELD AND T. A. DAVIS
[15] A. Pothen and C. Sun, A mapping algorithm for parallel sparse cholesky factorization, SIAM
J. Sci. Comput., 14 (1993), pp. 12531257.
[16] S. E. Zitney and M. A. Stadtherr, Supercomputing strategies for the design and analysis of
complex separation systems, Ind. Eng. Chem. Res., 32 (1993), pp. 604612.
