A Combined Unifrontal/Multifrontal Method for
Unsymmetric Sparse Matrices *
Tech. Report TI;'l Ilii. Proc. Fifth SIAM Conf. on Applied Linear Algebra, June 1994
Timothy A. Davist
Abstract
In the multifrontal method, each frontal matrix must hold all of its pivot rows
and columns at one time. Moving data between frontal matrices is costly, but
the method can handle arbitrary fillreducing orderings. In the unifrontal method,
pivot rows and columns are shifted out of the frontal matrix as the factorization
proceeds. Data movement is simpler, but higher fillin can result. We consider a
hybrid unifrontal/multifrontal algorithm. Fillreducing orderings can still be applied,
but data movement is reduced by allowing pivot rows and columns to be shifted into
and out of each frontal matrix (just as in the unifrontal method). Performance results
on a Cray YMP supercomputer are presented.
1 Unifrontal and Multifrontal methods
In the unifrontal method [5, 11], the entire matrix is factorized within a single frontal
matrix. Rows and columns of the active submatrix are shifted into the frontal matrix,
and pivot rows and columns are shifted out and stored in the LU factors. In the context
of the elimination graph model [7], subsequent pivot nodes are selected only within the
single clique. Pivot nodes are dropped from the clique as they are factorized. The method
works well for envelopelimited matrices. The overhead is low: no assembly operations are
needed between frontal matrices, and a degree update phase is not required. However, for
some matrices the single frontal matrix is large, and an unacceptable level of fillin occurs.
It is inherently sequential (except for the finegrain parallelism available within the single
frontal matrix). The unifrontal method efficiently factorizes a band matrix of bandwidth b
with a single (b + 1)by(b + 1) frontal matrix.
The multifrontal method [1, 2, 3, 6, 10] avoids the adverse fillin properties of the
unifrontal method by treating different subgraphs of the matrix with different frontal
matrices. It works well for graphs with an overall irregular structure, but with structured
subgraphs. Subsequent pivot nodes do not have to be selected from within the clique of
the first pivot. However, a frontal matrix must hold the entire set of pivot nodes and
their neighbors in order to perform the factorization steps for those pivot nodes. Pivots
are not shifted into or out of the frontal matrix. More than one pivot within a frontal
matrix is desirable, as this allows the use of a dense matrixmultiply in the innermost loop
(the Level3 BLAS [4]). Once the factorization of the active frontal matrix is complete, the
remaining contribution block is assembled (added) into a subsequent frontal matrix. Frontal
*This project is supported the National Science Foundation (ASC9111263, DMS9223088), and by Cray
Research, Inc. through the allocation of supercomputer resources.
tComputer and Information Sciences Department, University of Florida, Gainesville, Florida, USA. (904)
3921481, davis@cis.ufl.edu. CIS tech reports available via anonymous ftp to ftp.cis.ufl.edu:cis/techreports.
2 DAVIS
matrices are related by an assembly tree, which is typically identical to the elimination
tree [9] if each frontal matrix contains only one pivot each. In the unsymmetricpattern
multifrontal method, the tree is replaced with a dag, and a contribution block may be
assembled into more than one subsequent frontal matrix. The multifrontal method uses
n/k frontal matrices, each of size (b + k)by(b + k) to factorize an nbyn band matrix of
bandwidth b, where the innermost loop uses a rankk update (a dense matrixmultiply if
k > 1). There are b2(n/k 1) assembly operations between frontal matrices.
2 Combining the two methods
Suppose we attempt to factorize a matrix with the unifrontal method, but we give it a
smaller working array than it requires. The factorization proceeds from the first pivot, up
to the point at which the front size grows larger than the working array. At this point
the unifrontal method halts. However, the remaining terms in the frontal matrix form a
contribution block. Store the contribution block and deallocate the working array. Select
a new pivot based on a global fillreducing heuristic and start a new frontal matrix with
this starting node, in a newly allocated working array. The contribution block from the
first frontal matrix is eventually assembled into a subsequent frontal matrix. This is the
combined unifrontal/multifrontal method. In comparison with the unifrontal method, the
combined method requires a selection of new starting pivots. This implies either an a
priori symbolic ordering or a pivot search during numerical factorization. The pivot search
heuristic requires the degrees of rows and columns, and thus a degree update phase. The
undesirable ordering methods required for unifrontal methods are avoided. The combined
method factorizes a band matrix of bandwidth b with a single (b + k)by(b + k) frontal
matrix (and thus no assembly operations between frontal matrices), using rankk updates.
We use UMFPACK [2, 3] as a starting point for the combined method, although the
approach can be applied to any multifrontal algorithm. First, the algorithm selects a single
column from a few columns (4, say) of minimal upper bound degree (UMFPACK keeps track
of upper and lower bounds on the degree of each row and column, sometimes collapsing
these bounds to the true degree for a few rows and columns). The patterns and true degrees
of these four columns are found. A pivot row is selected from within the nonzero entries in
the selected column. Suppose the pivot row and column degree are r and c, respectively.
An mbyn working array is allocated, where m = Gc and n = Gr (G is typically 2 to 3,
and is fixed for the entire factorization). The frontal matrix is cbyr.
The following is repeated until the factorization can no longer continue within the
current working array (at which point a new frontal matrix is started).
The candidate pivot row and column are fully assembled into the active frontal matrix.
The approximate degree update phase computes the bounds on all rows and columns
affected by the active frontal matrix, and assembles any previous contribution blocks into
the active frontal matrix (based on the assembly dag). This degree update and assembly
phase is skipped if the candidate pivot does not change the size of the active frontal matrix
(unless this is the first pivot). The corresponding row and column of U and L are computed,
but most of the updates from this pivot are not yet made unless the number of pivots in
the active frontal matrix, k, reaches a certain fixed value (typically 16, to allow for use of
the Level3 BLAS).
The active contribution block is searched for a subsequent pivot. A single candidate
column is selected and any pending updates are applied to that column only. The column is
assembled into a separate work vector, and a pivot row is selected. Suppose the candidate
A COMBINED UNIFRONTAL/MULTIFRONTAL METHOD
(a) UMFPACK working array (b) Combined method working array
FIG. 1. Data structures for the active frontal matrix
pivot row and column have degree r and c (not necessarily the same degrees as the first
pivot). Four conditions apply (UMFPACK treats case 4 as it does case 2):
1. No pivot is found. The matrix is singular. This case is treated as in case (2) below,
except that the factorization will later fail.
2. If r > n or c > m, then factorization can no longer continue within the active frontal
matrix. Any pending updates are applied. The LU factors are stored in a permanent
space. The active contribution block is saved for later assembly.
3. If r < n k and c < m k, then the candidate pivot can fit into the active
frontal matrix without removing the k pivots already stored there. Set k k + 1.
Factorization continues within the active frontal matrix.
4. Otherwise, if n k < r < n or m k < c < m, then the candidate pivot can fit if the
previous k pivots are removed. Any pending updates are applied. The LU factors
are stored in a permanent space. The active contribution block is left in place. Set
k 1 Factorization continues within the active frontal matrix.
Figure 1 illustrates how the working array is used in UMFPACK and the combined
method. L and U are the LU factors, and D is the contribution block. The matrix U2
is U2 with rows in reverse order, and L2 is L2 with columns in reverse order. The arrows
denote how these matrices grow as new pivots are added. When pivots are removed from
the working array in Figure l(b), the contribution block does not need to be repositioned.
3 Performance
The test collection consists of the 86 matrices used in [3]. Each method has a number
of userselectable parameters; these are varied for each matrix, and the fastest time thus
obtained is used as the run time for that method and that matrix. The assumption is
that the user determines a good set of parameters based on a qualitative knowledge of
the problem domain. Optimal parameter settings are similar for matrices within a single
domain. A single processor of a CRAY YMP8E/81_'s6 computer was used (UNICOS
Version 7.0.5, CF77 Version 6.0). Sufficient memory was given so that memory compaction
or garbage collection was minimized. D2 switches to a dense matrix factorization code
when the active submatrix becomes 21' nonzero. UMFPACKv1.0 is available via Netlib
(send email to netlib@ornl.gov with the message send umfpack.shar from misc).
Table 1 divides the matrices into 14 domains, with domain name and number of matrices
listed first. The run times (symbolic and numerical factorization) of the other methods
(UMFPACK, MUPS, D2, and MA28) are pairwise compared with the combined method,
I I'
I T7
,,i/'h
rftk
4 DAVIS
TABLE 1
Pairwise comparison of combined method with respect to the other four, by domain
combined method slower asymmetry distribution avg.
domain # UMF MUPS D2 MA28 spd 0 < .25 < .6 > .6 asym
structural eng. 23 1 9 23 23 0.00
petroleum eng. 2 2 0.02
circuit simul. 4 1 3 3 2 2 0.02
oil reservoir 12 1 7 5 0.14
partial diff eqn 3 1 1 1 2 1 0.17
CFD 4 1 4 2 4 0.18
nuclear 1 1 1 0.18
astrophysics I 1 0.30
chem. kinetics 8 2 2 8 0.36
electrical power 5 3 3 2 3 3 2 0.40
chemical eng. 8 4 4 8 0.96
economics 1 1 1 0.98
linear prog. 12 1 6 8 12 0.99
computer sim. 2 2 2 1.00
TABLE 2
Total run times (seconds), by domain
domain combined UMF MUPS D2 MA28
structural engineering 37.92 54.12 23.46 471.59 927.43
petroleum engineering 0.94 1.09 1.08 4.72 6.48
circuit simulation 3.43 ; 7.14 2.13 4.52
oil reservoir simulation 5.02 6.07 6.15 23.94 ',"1
partial differential equations 23.24 56.74 34.36 73.56 1,. 73
computational fluid dynamics 4.94 6.00 2.33 11.36 39.91
nuclear power 0.13 0.15 0.10 0.15 0.33
astrophysics 0.22 0.33 0.44 0.69 1.58
chemical kinetics 1.13 1.20 1.45 1.04 3.18
electrical power 1.44 1.46 1.40 2.37 1.86
chemical engineering 2.77 4.03 6.46 46.74 155.55
economics 0.21 0.21 0.84 0.09 0.16
linear programming 0.70 0.74 2.69 0.68 0.61
computer simulation 0.51 0.52 0.66 0.34 1.41
and the number of matrices for which each method is faster than the combined method is
listed. Also listed are the number of matrices in each domain that are symmetricpositive
definite, and the distribution of asymmetry (0.0 is a symmetric pattern, 1.0 is completely
asymmetric). The last column lists the average asymmetry for the matrices in each domain.
In Table 2, the run times (in seconds) for each matrix within each domain are added for
each method. The assumption is that each method is used to factorize all the matrices
within a single domain, and the five methods are then compared on their total run time.
The total run times in Table 2 are shown in bold if they are the fastest or within 1:'. of
the fastest total run times for that domain.
A COMBINED UNIFRONTAL/MULTIFRONTAL METHOD
The combined method is faster than all of the other four methods for 43 of the 86 test
matrices. Based on this test collection (no claim is made that it is truly representative),
the combined method is better than or comparable to the other methods in all but six of
the 14 domains. The combined method is nearly always faster than UMFPACK (for 81
out of 86 matrices). For the other 5 matrices, UMFPACK is at most .;. faster than the
combined method, a negligible difference.
MUPS is faster than the combined method for 21 of the 86 matrices. These include
the larger symmetricpositivedefinite matrices (in particular, the structural engineering
matrices) and all four CFD matrices. MUPS is currently being updated for inclusion into
the Harwell Subroutine Library. The version tested here is a prerelease version, and thus
may have lower performance than the final version. In addition, MUPS is a parallel code.
It is being compared with strictly sequential codes in this paper, running on only a single
processor. The theoreticallyachievable parallelism in MUPS and UMFPACK is similar
[8], and good results have been obtained in a preliminary parallel, distributedmemory
factoronly phase of UMFPACK. A parallel analysis+factorize version of UMFPACK or
the combined method is not yet developed. D2 is faster than the combined method for
23 matrices, mostly of which are small or extremely sparse. The D2 algorithm is faster
for the four small chemical engineering matrices (from the Harwell/Boeing collection), but
much slower for the larger matrices in the same domain (from [11]). The MA28 algorithm
has similar behavior as the D2 algorithm: it is faster than the combined method for 18
matrices, in similar domains as the D2 algorithm.
4 Conclusions
We have demonstrated how the advantages of the unifrontal and multifrontal methods can
be combined. The resulting algorithm performs well for a wide range of matrices.
References
[1] P. R. Amestoy and I. S. Duff, Vectorization of a multiprocessor multifrontal code, Int. J.
Supercomputer Appl., 3 (1989), pp. 4159.
[2] T. A. Davis, Users' guide to the unsymmetricpattern multifrontal package (UMFPACK), Tech.
Rep. TR93020, CIS Dept., Univ. of Florida, Gainesville, FL, 1993.
[3] T. A. Davis and I. S. Duff, An unsymmetricpattern multifrontal method for sparse LU
factorization, Tech. Rep. TR93018, CIS Dept., Univ. of Florida, Gainesville, FL, 1993.
[4] J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff, A set of level3 basic linear algebra
subprograms, AC\I Trans. Math. Softw., 16 (1990), pp. 117.
[5] I. S. Duff, Design features of a frontal code for solving sparse unsymmetric linear systems
outofcore, SIAM J. Sci. Statis. Comput., 5 (1984), pp. 270280.
[6] I. S. Duff and J. K. Reid, The multifrontal solution of unsymmetric sets of linear equations,
SIAM J. Sci. Statist. Comput., 5 (1984), pp. 633641.
[7] A. George and J. W. H. Liu, Computer Solution of Large Sparse Positive Definite Systems,
Englewood Cliffs, New Jersey: PrenticeHall, 1981.
[8] S. Hadfield and T. A. Davis, Analysis of potential parallel implementations of the unsymmetric
pattern multifrontal method for sparse LU factorization, Tech. Rep. TR92017, CIS Dept.,
Univ. of Florida, Gainesville, FL, June 1992.
[9] J. W. H. Liu, The role of elimination trees in sparse factorization, SIAM J. Matrix Anal. Appl.,
11 (1990), pp. 134172.
[10] The multifrontal method for sparse matrix solution: theory and practice, SIAM Review,
34 (1992), pp. 82109.
[11] S. E. Zitney and M. A. Stadtherr, Supercomputing strategies for the design and analysis of
complex separation systems, Ind. Eng. Chem. Res., 32 (1993), pp. 604612.
