Performance of an unsymmetricpattern multifrontal
method
for sparse LU factorization
Timothy A. Davis
301 CSE, Computer and Information Sciences Dept.
University of Florida, Gainesville, FL 326112024 USA
(904) 3921481, email: dav; .. ,l edu
Supported in part by NSF ASC9111263.
To appear: Proceedings of the
7th IMACS Int. Conf. on Computer Methods for Partial Differential Equations,
June 2224, 1992, Rutgers Univ., New Brunswick, New Jersey, USA
Technical Report TR92014*
May 26, 1992
1 Introduction
Solving large, unsymmetric, sparse linear systems on a highperformance parallel supercomputer
is critical to many diverse areas of scientific computing such as climate modeling, finiteelement
methods, structural analysis, computational fluid dynamics, economic modeling, oil reservoir simu
lation, circuit simulation, chemical process modeling, and electrical power network modeling [4, 16].
However, conventional sparse matrix factorization algorithms rely heavily on indirect addressing.
This gives them an irregular memory access pattern that limits their performance. For symmetric
problems, methods such as the multifrontal method and the supernodal approach have replaced
irregular operations with dense matrix kernels (see [13] for a survey). The kernel of the multi
frontal method is one or more steps of LU factorization within each square, dense frontal matrix
defined by the nonzero pattern of a pivot row and column. These steps of LU factorization com
pute a submatrix of update terms that are held within the frontal matrix until they are assembled
(added) into the frontal matrix of its father in the elimination tree. The elimination tree controls
parallelism across multiple frontal matrices, while dense matrix operations [3] provide parallelism
and vectorization within each frontal matrix. However, this method is based on an assumption
of a symmetric nonzero pattern, and so has a poor performance on matrices whose patterns are
very unsymmetric. None of the previous parallel methods for unsymmetricpatterned matrices use
dense matrix kernels [2, 10, 11], with the exception of a multifrontal QR factorization algorithm
[15] (which will be compared later on with the algorithms we develop).
*available via anonymous ftp to cis.ufl.edu as /cis/techreports/tr92/tr92014.ps.Z
This paper presents a new unsymmetricpattern multifrontal method that takes full advantage
of dense matrix kernels, maintains a purely unsymmetric pattern, and is suitable for both shared
memory and distributed memory parallel supercomputers. The frontal matrices are rectangular
instead of square, a directed acyclic task graph replaces the elimination tree, and frontal matrices
are no longer assembled by a single father. The method builds the elimination task graph either
during factorization or in a preprocessing phase. As in the classical multifrontal method, we take
advantage of repetitive structure in the matrix by amalgamating nodes in the elimination task
graph. Thus the algorithm can use dense matrix kernels in its innermost loops, potentially giving
it high performance on parallel supercomputers.
Parallel algorithms for sparse positive definite matrices [13] or for matrices with nearly sym
metric pattern [7] are typically divided into a symbolic factorization phase that computes the pivot
ordering and patterns of the LU factors, and a numerical factorization phase that computes L and
U. Disruptions due to numerical pivoting are either nonexistent (in the positive definite case) or
can be limited to a father and son in the elimination tree (in the symmetricpattern case). The
disruptions are more severe in the unsymmetricpattern multifrontal method if the pivot ordering
is not known to be numerically acceptable (unsymmetric symbolic factorization when the pivot
ordering is known is described in [8, 9, 12]). If a single phase is used, the numerical values are avail
able to the pivot search heuristic. This limits the disruption, but requires a dynamic parallelism
where the interdependence between parallel tasks is unknown until the pivot search is complete.
2 Formulation of the unsymmetricpattern multifrontal method
The LU factorization of an nbyn matrix A into the product of a lower triangular matrix L (with
unitdiagonal) times an upper triangular matrix U consists of n major steps. A is transformed into
the product L[k]A[k]U[k] after step k (1 < k < n, A[] = A, LZ = L, UI = U). Step k selects a
single pivot entry aij from the active submatrix Ak1 (the lowerright n k + 1byn k + 1
submatrix of A[k1]), interchanges the kth pivot row i and kth pivot column j with the leading
row and column of Ak1, respectively, and then computes L[k], A[k], and U[k]. An element or general
frontal matrix Ek is a dense, rectangular submatrix that corresponds to the pivot row and column
selected at step k, as well as the update terms from this pivot. If gk multiple consecutive pivots
have identical pivot row and column structure as the first pivot 1] in the frontal matrix Ek, then
their pivot rows and columns can also be represented in Ek. The numerical factorization within
this frontal matrix factorizes the pivot block (the leading gkbygk submatrix of Ek), computes gk
columns of L and gk rows of U, and updates the contribution block (entries not in the first gk rows
or columns of Ek) with a rankgk Schur complement. These steps can use dense matrix kernels
(Level3 BLAS if gk > 1) [3]. These update terms are added to A[k1] to obtain the reduced matrix
A[k+gk1]. The elements Ek+1 to Ek+g,1 are not explicitly represented, having been u1iiltaii '/1t d1
into the single element Ek. Element Ek is referred to as a supernode whose pivot block rank is gk if
gk > 1, or as a simple node if gk = 1. Numerical considerations require pivoting within the pivot
block and might limit the number of pivots found to be less than the maximum, gk. The active
submatrix Ak is not explicitly represented. Instead, it is held as a sum of the original matrix A,
and the elements created during factorization.
The elimination tree forms the basis of many sparse matrix algorithms, describing either the
data flow graph or the control flow graph, or both [14]. It is insufficient for our purposes. Instead,
the L Ugraph, GLU = (V, ) is a directed acyclic graph sufficient for describing both the data flow
and control flow for LU factorization with general unamalgamated frontal matrices, where V is the
set of all explicitly represented frontal matrices (IV = n), and C = L U CU has directed edges (s, t)
of two types: Ledges in the set L = {(s, t) I s < t, lt 0} = {(s, t) It E r} and Uedges in the set
u = {(s, t) I s < t, ust 0} = {(s, t) It E U"}, where is the pattern of L, Cs is the pattern of row
s of L, and ' is the pattern the contribution block of E,. Node s is an Lson of its Lfather node t
if (s, t) E gL and (s, t) CU. Node s is a Uson of its Ufather node t if (s, t) gL and (s, t) E C.
Finally, node s is an L Uson of its L Ufather node t if (s, t) E gL and (s, t) E CU. Similar directed
acyclic graphs been used for symbolic factorization of unsymmetric sparse matrices [8, 9, 12]. If the
pattern of the LU factors is symmetric, the contribution block of s is always completely assembled
into its single LUfather node f in the elimination tree. In general, the elimination tree cannot
be used in the unsymmetricpattern multifrontal method because of the incomplete assembly that
takes place.
Each edge (s, t) in the LUgraph represents the assembly of only a single row or column from
Es to Et. A reduced data flow graph Gdata = (V, da ta), and a reduced control flow graph, Gcontrol =
(V, control) can also be used to describe the data flow and control flow. Node amalgamation will
also remove some of these edges. Edge reductions decrease memory requirements, and simplify
amalgamation and degree update. These reductions are similar to those used by Eisenstat, Liu,
and Gilbert, [8, 12, 9], except that additional edges are removed during degree update. These two
graphs are not necessarily minimal. Transitive reduction could be applied to obtain a minimal
control flow graph, for example, but this would introduce a significant computational overhead.
However, these edge removals greatly reduce the graphs from the LUgraph. For example, one
typical matrix had 529,892 edges in GLU, but only 32,842 and 'i ', edges in Gdata and Gcontrol,
respectively (the lower bound was IV\ = 3, 380). Both graphs are identical to the elimination tree
if the pattern of the LU factors is symmetric. In the sequential prototype of the analysisfactor
algorithm, the Gdata graph is represented as contiguous lists of edges, one for each row (called an
Lson list) and column (called a Uson list) in the active submatrix. The lists are used for the pivot
search, assembly, and amalgamation procedures. Edges are pruned during each of these phases.
3 Algorithms
The analysisonly algorithm (symbolic factorization) finds the pivot ordering, computes the pattern
of the LU factors, performs amalgamation, and determines the reduced graphs. The factoronly
algorithm factorizes A into LU using the patterns and graphs computed in either the analysisonly
or analysisfactor algorithms. This paper focuses on the analysisfactor algorithm, which performs
the symbolic and numerical factorizations in a single phase. A single phase is more typical of
conventional factorization algorithms for unsymmetric sparse matrices. The advantage is that the
numerical values are available during the pivot search. No pivot preordering is assumed. Pivots
are chosen as the algorithm progresses via some sparsitypreserving and numerical criteria. The
disadvantage to this approach is the lack of precomputed data flow and control flow graphs to guide
the construction of new elements, the assembly, and the exploitation of parallelism. Instead, the
algorithm must compute the graphs, , and U during factorization. With a parallel pivot search,
these symbolic computations must also be done in parallel.
Two sequential prototypes of the analysisfactor algorithm have been implemented. They differ
in the way they amalgamate nodes in the elimination graph. The .i'.,',,,11,,. ',,/i approach (AFup)
can be outlined as follows. Initially, k = 1.
1. The pivot search finds the pivot a j in Ak1 and interchanges rows i and k, and columns
j and k. The pivot row and column define Uk and k, respectively. The Lsons, Usons, and
LUsons of node k are now located in the Lson list k and Uson list k.
2. Perform degree update, and try to extend frontal matrix to include near superrows and
supercolumns (rows and columns whose patterns can be considered identical with the first
pivot row and column). The search is limited to pivot entries lying in the current element. It
"looks upward" because it considers the amalgamation of a single son k with one or more of
its (potential) fathers.
3. Create Ek, performing garbage collection if necessary.
4. Assemble the sons of node k into Ek. Deallocate any elements whose contributions have been
completely assembled. Let n,,o and nco be the number of rows and columns in the superrow
i and supercolumn j, and let gk = min(n, ow, ncol).
4. Perform up to gk steps of numerical factorization within the front. Set gk to the number of
pivots actually found.
5. Update the Lson and Uson lists. Increment k by gk and repeat until the matrix is factorized.
The search is based on Markowitz' strategy [4], which selects the pivot with minimum upper
bound on fillin. The rows and columns of the active matrix are not held explicitly, rather, they
are held as a set of contribution blocks and entries from the original matrix A. To avoid expensive
scans only upper and lower bounds of the degree of each row and column are computed. When,
and if, the true degree is calculated, the two bounds are set equal to the true degree. The pivot
must also satisfy the numerical threshold partialpivoting criterion [4]. The approximate degree
update finds the upper and lower bounds of the degrees of each row i E L' and column j E '(
by scanning their Lson and Uson lists, respectively. The new lower bound on the row degree, for
example, cannot be smaller than the upper bound on fillin in row i or the maximum number of
unassembled columns of each contribution block affecting row i (these can be found in the Lson list
i). The new upper bound on the row degree can be computed in a similar way. Only a short scan of
each Lson list and Uson list in the affected rows and columns suffices. The lists are kept short via
edge removal. However, at the cost of an additional scan of either the affected Lson or Uson lists,
more accurate degree bounds can be computed. The external row degree of a contribution block,
Em (m < k), is the number of unassembled columns of Em that do not overlap with the columns
in Ek. The internal degree can be found by counting how many times Em appears in the Uson lists
of the columns affected by Ek. The external degree is the number of unassembled columns in Em
minus the internal degree. The lower bound degree of row i, then, is the number of columns in the
contribution block of Ek plus the maximum external degree of each contribution block affecting
row i. Our experiments into the approximate degree update show a moderate, acceptable increase
in fillin in exchange for a drastic reduction in time.
The upwardlooking algorithm can sometimes lead to excessive fillin due to its limited pivot
search. The .o1,o..ui,;I11, '/,,,/ approach (AFdown) can decrease fillin by replacing the local search
with a global search that looks at the entire active submatrix. The method constructs a set S
of unfactorized frontal matrices, each of which is a candidate for amalgamation. These frontal
matrices have been symbolically factorized, but not numerically factorized. The method "looks
downward" because it considers the amalgamation of a single node f with one or more of its
sons. The pivot search finds a single pivot entry, corresponding to a new proposed node f in the
partiallyconstructed graph. This node is amalgamated with its unfactorized sons and placed in
S, unless doing so would cause excessive fillin, in which case the sons that caused amalgamation
to fail are factorized and removed from S. If a proposed node f has one or more unfactorized
sons, the numerical test is only an estimate, since the numerical values of the unfactorized sons
are not available. The numerical criterion is also checked during the numerical factorization, and
unacceptable pivots are rejected. Our algorithm will search for acceptable pivot entries, construct
proposed nodes, amalgamate them with previous nodes in S, and include them in S, all in a highly
parallel manner.
4 Performance
Two prototype sequential versions of the analysisfactor algorithm have been developed. Both
methods exhibit excessive delayed pivoting for some matrices. This will be corrected by modifying
the upwardlooking algorithm to update the proposed pivot columns before they are selected. No
delayed pivoting will then occur. However, even with this limitation, the current versions out
perform the MA28 algorithm [6], sequential versions of the classical multifrontal method (:.1p.)
[1], and the D2 algorithm [2] for some matrices. Eightysix matrices from the Harwell/Boeing (HB)
collection [5] and our collaborators have been assembled and factorized with all of these methods.
Twentyseven of these are symmetric positivedefinite. Only matrices of order 500 or larger were
considered. The X/* and Z/* matrices are from chemical engineering problems (from S. Zitney
and others [16]) (X/m2 is from a PDE). The Hm/* matrices are circuit simulation matrices from
S. Hamm (Motorola). Table 1 lists the results for these matrices obtained on a Cray YMP/81'
sorted by asymmetry (and by order if tied). Each line lists the matrix number, name, whether it
is symmetric positivedefinite, the order, number of nonzeros, and the asymmetry. The asymmetry
is the number of unmatched offdiagonal pairs over the total number of offdiagonal entries (0 is
a symmetric pattern, 1 is completely asymmetric). The run time is the total of the analysis and
factor time, in seconds, and is listed for the AFdown, AFup, Mups, D2, and MA28 algorithms.
The fastest run time is shown in bold. Figures 1 through 3 show the normalized run time of each
method, for all 86 matrices. The best time for AFdown and AFup was selected as the time for the
AF algorithm in these figures. The normalized run time is the run time divided by the fastest run
time found for that particular matrix. Thus the fastest method would have a normalized run time
of 1.0.
The new method is faster than Mups, D2 and MA28 for 10 out of 86 matrices. Among these are
nearly all of the large chemical engineering matrices. Also, the method demonstrates a consistent
performance for the entire range of matrices. It usually takes no more than twice the time as
Mups for symmetric matrices, and is even occasionally faster. This is because it takes advantage of
dense matrix kernels, as Mups does. It is faster than Mups for most unsymmetric matrices because
it does not make the symmetricpattern assumption. The new method also avoids the worstcase
behavior of D2 and MA28 for symmetric matrices and for very large matrices with substantial dense
substructure (such as the large chemical engineering problems). Addressing the delayed pivoting
problem will improve the performance and reliability of the method. These results show that the
sequential prototypes of the unsymmetricpattern multifrontal method are competitive algorithms
and merit further study.
Table 1. Matrix statistics and run time.
Matrix Statistics Run Time (in seconds)
name spd? n nz asym AFdown AFup Mups D2 MA28
1 HB/i.i.''_ us yes
2 HB/nos6 yes
3 HB/,. '._l us yes
4 HB/nos7 yes
5 HB/bcsstkl9 yes
6 HB/orsirr_2 no
7 HB/gr_30_30 yes
8 HB/nos2 yes
9 HB/nos3 yes
10 HB/pd. _',11 no
11 HB/shermani no
12 HB/orsirr_l no
13 HB/bcsstk08 yes
14 HB/bcsstk09 yes
15 HB/bcsstklO yes
16 HB/bcsstml0 yes
17 HB/sherman4 no
18 HB/1138_bus yes
19 HB/bcsstk27 yes
20 HB/bcsstm27 yes
21 HB/bcsstkll yes
22 HB/bcsstkl2 yes
23 HB/bcsstml2 yes
24 HB/bcsstkl4 yes
25 HB/bcsstk26 yes
26 HB/orsregl no
27 Hm/add20 no
28 HB/bcsstk23 yes
29 HB/bcsstk24 yes
30 HB/saylr4 no
31 HB/bcsstk21 yes
32 HB/bcsstkl5 yes
33 HB/bcsstk28 yes
34 HB/bcsstkl6 yes
35 Hm/add32 no
36 HB/sherman3 no
37 HB/bcsstkl8 yes
38 HB/hwattl no
39 HB/hwatt_2 no
40 Hm/memplus no
41 HB/jpwh_991 no
42 HB/1.._..' .. no
43 HB/1..'..id no
.. 2474 0.000
S. .. 0.000
I. 3249 0.000
72' 4617 0.000
817 i. "... 0.000
886 5970 0.000
900 7744 0.000
,;7 4137 0.000
960 15844 0.000
961 4681 0.000
1000 3750 0.000
1030 1. ." 0.000
1074 12960 0.000
1083 18437 0.000
1086 22070 0.000
1086 22092 0.000
1104 3786 0.000
1138 Il, I 0.000
1224 56126 0.000
1224 56126 0.000
1473 34241 0.000
1473 34241 0.000
1473 1 ,.,' 0.000
1806 63454 0.000
1922 30336 0.000
,, 14133 0.000
'..*, 13151 0.000
3134 45178 0.000
:i;52 159910 0.000
:51;4 22316 0.000
3600 26600 0.000
3948 117816 0.000
4410 219024 0.000
4884 290378 0.000
4960 19848 0.000
5005 20033 0.000
11948 149090 0.000
1. ', 11360 0.013
1,'., 11550 0.020
177",s 99147 0.021
991 6027 0.064
3937 "' ,l17 0.150
3937 "',l 17 0.150
0.139 0.103 0.066 0.044 0.066
0.172 0.153 0.090 0.145 0.171
0.158 0.111 0.076 0.082 0.094
0.281 0.386 0.169 0.341 I *I';
0.216 0.175 0.100 0.364 0.210
0.327 0.424 0.199 0.379 0.741
1 I i 0.139 0.502 0.552
0.140 0.086 0.063 0.134 0.080
0.411 0.397 0.169 1.201 1.187
0.361 0.246 0.136 0.233 0.319
0.270 0.204 0.127 0.174 0.303
0.403 0.537 0.242 0.492 1.019
0.845 0.636 0.869 0.775 1.604
0.542 0.736 0.387 1.826 3.335
0.374 0.329 0.173 1.254 0.833
0.349 0.302 0.174 1.363 ii 1 .,
0.244 0.187 0.141 0.165 0.214
0.208 0.140 0.101 0.062 11 I;
0.514 0.555 0.338 4.010 3.135
0.528 1 0.339 5.793 3.136
11 .',, i ,.,'' 0.276 3. 11.. 2.039
11 ,.' i' ,, 0.277 3.459 2.039
0.489 0.470 0.219 1.541 0.961
1.944 1.114 0.541 9.968 6.516
0.696 0.736 0.307 1.934 2.291
0.911 2.099 0.584 1.889 4.041
0.602 0.339 0.337 0.179 0.264
8.195 7 .' 2.199 32.519 205.430
2.208 2 *', 1.183 36.899 66.472
1.618 3.270 0.943 13.138 9.835
1.740 2.114 0.858 13.682 ", ..;,
13.907 10.874 3.262 96.093 108.980
2.670 4.014 1.499 48.904 2'. .'.
6.886 13.163 2.979 153.7',. 76.674
0.917 0.499 0.500 0.266 0.378
2.782 2.361 0.904 5.508 6.157
30.7"1 ,, '; 4.400 1. '.., 393.710
1 , 1.308 0.445 2 ,' 3.210
1.370 1.302 0.461 2.623 3.306
6.787 3.510 4.659 1.305 2.563
0.500 0.438 0.224 0.425 1.372
4.788 3.358 0.935 6.075 19.277
5.224 3.311 0.923 7.527 17.483
Table 1, continued. Matrix statistics and run time.
Matrix Statistics Run Time (in seconds)
name spd? n nz asym AFdown AFup Mups D2 MA28
44 HB/nnc666 no 666 4032 0.179 11, ",. 0.182 0.092 0.152 0.377
45 HB/lns_511 no 511 2796 0.201 0.173 0.156 0.084 0.081 0.189
46 HB/lns_511c no 511 2796 0.201 0.177 0.163 0.084 0.072 0.206
47 HB/pores_3 no 532 3474 I 1"s 0.163 0.117 0.060 0.124 0.116
48 HB/sherman5 no 3312 20793 0.261 1.314 1.085 0.735 2.199 5.434
49 HB/mc_fe no 765 24382 0.301 0.560 0.387 0.403 0.723 1 I.,
50 HB/fs_541_4 no 541 4273 0.317 0.180 0.170 0.211 0.123 0.228
51 HB/fs_541_1 no 541 4282 0.317 0.188 0.158 0.211 0.130 0.196
52 HB/fs_541_2 no 541 4282 0.317 0.181 0.172 0.211 0.148 0.228
53 HB/fs_541_3 no 541 4282 0.317 0.188 0.173 0.211 0.158 0.227
54 HB/sherman2 no 1080 23094 0.329 0.918 11 '' 1; 0.411 1.397 8.262
55 HB/fs_760_1 no 760 ',..', 0.354 0.336 0.370 0.163 0.218 1.036
56 HB/fs_760_3 no 760 ',..', 0.354 0.336 0.370 0.163 0.218 1.036
57 HB/pores_2 no 1224 9613 0.388 I' ., 0.536 0.365 0.375 1.107
58 HB/fs_680_3 no 680 2471 0.439 0.081 0.065 0.076 0.049 0.059
59 HB/fs_680_2 no 680 2424 0.448 0.079 0.065 0.076 0.053 0.063
60 HB/steam2 no 600 5660 0.451 0.279 0.266 0.116 0.243 I1 .,,.
61 X/m2 no 8641 102449 0.508 31.595 14.975 33.401 93.621 362.766
62 Z/rdist3a no 2398 61896 0.860 1.970 1.404 1.512 9.864 66.584
63 Z/rdistl no 4134 94408 0.941 3.089 1.811 2.365 33.013 L'i.;.308
64 Z/radfrl no 1048 13299 0.946 0.373 0.240 0.231 0.787 2.668
65 Z/rdist2 no 3198 56834 iI '. I1 1.388 0.861 1.165 7.504 117.395
66 HB/west0989 no 989 3518 0.982 0.174 0.130 0.126 0.064 0.075
67 HB/mahindasb no 1258 7682 0.983 0.381 0.218 0.803 0.096 0.156
68 HB/bp_1600 no 822 4841 0.989 0.123 0.104 I' ",1 0.088 0.082
69 HB/bp_1400 no 822 ";'iI, 0.990 0.136 0.119 0.264 0.089 0.087
70 HB/bp_1000 no 822 4661 0.991 0.117 0.095 0.272 0.079 0.073
71 HB/bp_1200 no 822 1i'. 0.991 0.119 0.100 11 2 .; 0.088 0.080
72 HB/bp_0 no 822 3276 0.991 0.020 0.020 0.148 0.031 0.025
73 HB/bp_200 no 822 3802 0.994 0.048 0.042 0.194 0.047 0.039
74 HB/bp_600 no 822 4172 0.994 11 ; 0.074 0.234 0.069 0.059
75 HB i.,, no 655 2808 0.994 0.151 0.114 0.122 0.058 0.082
76 HB/bp_400 no 822 4028 I *, 0.074 0.063 0.213 0.058 0.050
77 HB/bp_800 no 822 4534 I *, 0.106 0.091 0.250 0.074 0.065
78 HB/west2021 no 2021 7310 I 'I',; 0.379 i';, 0.264 0.181 0.162
79 HB/gematl2 no 4929 33044 0.999 1.016 0.771 0.471 1.245 0.831
80 HB/gematll no 4929 33108 0.999 0.919 0.716 0.449 0.961 0.756
81 HB/westl505 no 1505 5414 0.999 0.277 0.204 0.194 0.117 0.120
82 HB/gre_512 no 512 1976 1.000 0.209 0.182 0.121 0.064 0.263
83 HB/shl_0 no 663 11. 1.000 0.013 0.013 0.113 0.019 0.016
84 HB/shl_200 no 663 171'. 1.000 0.014 0.014 0.120 0.019 0.016
85 HB/shl_400 no 663 1712 1.000 0.014 0.014 0.121 0.020 0.016
86 HB/gre_1107 no 1107 5664 1.000 11 1.' 0.573 0.501 0.294 1.146
0 10 20 30 40 50 60 70 80
matrix number
Figure 1: AF (solid) and Mups (dotted)
20
18
16
14
S 12
10
N
8
6
4
2
0
0 10 20 30 40 50 60 70 80
matrix number
Figure 2: AF (solid) and MA28 (dotted)
20
18 
16 :
14
12
o
10 
6
4
2
0
0 10 20 30 40 50 60 70 80
matrix number
Figure 3: AF (solid) and D2 (dotted)
5 Summary
The objective of this research is to produce a suite of highperformance algorithms for the LU fac
torization of large sparse unsymmetric matrices on parallel shared memory and distributed mem
ory supercomputers. These matrices arise in many problems in science and engineering, including
structural analysis, computational fluid dynamics, economic modeling, chemical plant modeling, oil
reservoir simulation, circuit and device simulation, and electric power network modeling. To accom
plish this objective, we will develop highly concurrent algorithms to take advantage of both MIMD
and SIMD parallelism. Sequential prototypes have already demonstrated competitive performance
when compared with previous methods. Following the initial research phase, all algorithms based
on the unsymmetricpattern multifrontal method will be put into a usable form and provided to
the academic and industrial research communities.
References
[1] P. R. Amestoy and I. S. Duff. Vectorization of a multiprocessor multifrontal code. Int. J.
Supercomputer Appl., 3(3):4159, 1',1'
[2] T. A. Davis and P. C. Yew. A nondeterministic parallel algorithm for general unsymmetric
sparse LU factorization. SIAM J. Matrix Anal. Appl., 11(3):383402, 1990.
[3] J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A set of level3 basic linear algebra
subprograms. AC I[ Trans. Math. Softw., 16(1):117, 1990.
[4] I. S. Duff, A. M. Erisman, and J. K. Reid. Direct Methods for Sparse Matrices. London:
Oxford Univ. Press, 1'i'i.
[5] I. S. Duff, R. G. Grimes, and J. G. Lewis. Sparse matrix test problems. AC I Trans. Math.
Softw., 15:114, 1','"
[6] I. S. Duff and J. K. Reid. Some design features of a sparse matrix code. AC if Trans. Math.
Softw., 5(1):1835, 1979.
[7] I. S. Duff and J. K. Reid. The multifrontal solution of unsymmetric sets of linear equations.
SIAM J. Sci. Statist. Comput., 5(3):633641, 1' L.
[8] S. C. Eisenstant and J. W. H. Liu. Exploiting structural symmetry in unsymmetric sparse
symbolic factorization. Technical Report CS9012, Dept. of Computer Sci., York Univ., North
York, Ontario, Nov. 1990.
[9] S. C. Eisenstant and J. W. H. Liu. Exploiting structural symmetry in a sparse partial pivoting
code. Technical Report CS9201, Dept. of Computer Sci., York Univ., North York, Ontario,
Feb. 1992.
[10] A. George and E. Ng. Parallel sparse Gaussian elimination with partial pivoting. Technical
report, Oak Ridge National Laboratory, 1'"'
[11] J. R. Gilbert. An efficient parallel sparse partial pivoting algorithm. Technical Report
'/450521, Center for Computer Science, Chr. Michelsen Institute, Bergen, Norway, 1''"
[12] J. R. Gilbert and J. W. H. Liu. Elimination structures for unsymmetric sparse LU factors.
Technical Report CS9011, Dept. of Computer Sci., York Univ., North York, Ontario, Feb.
1990.
[13] M. T. Heath, E. Ng, and B. W. Peyton. Parallel algorithms for sparse linear systems. SIAM
Review, 33(3) 1'11460, 1991.
[14] J. W. H. Liu. The role of elimination trees in sparse factorization. SIAM J. Matrix Anal.
Apple 11(1):134172, 1990.
[15] P. Matstoms. Sparse QR factorization in MATLAB. Technical Report LiTHMATR199205,
Dept. of Mathematics, Linkoping Univ., Linkoping, Sweden, March 1992.
[16] S. E. Zitney and M. A. Stadtherr. A frontal algorithm for equationbased chemical process
flowsheeting on vector and parallel computers. In Proc. AIChE Annual M, C',o. Washington,
DC, 1l"
