Unsymmetricpattern multifrontal
methods for parallel sparse LU
factorization
T. A. Davis*
Computer and Information Sciences Department
University of Florida,
Gainesville, Florida, USA
I. S. Dufft
European Center for Research and Advanced Training
in Scientific Computation (CERFACS)
Toulouse, France
September 1991
Technical Report TR9123
Computer and Information Sciences Department
University of Florida
Gainesville, FL, 32611 USA
*phone: (904) 3921481, email: .I  .. . ..i .1,, Supported by a postdoctoral grant
from CERFACS, September 1989 to December 1990. Continuing research supported by the
National Science Foundation (ASC91! 1!'... 8/91). This report is available in postscript
form via anonymous ftp (in binary mode) to cis.ufl.edu as the compressed file /cis/tech
reports/tr91/tr91023.ps.Z.
tAlso Rutherford Appleton Laboratory, Chilton, Didcot, Oxon. OX11 OQX England.
Abstract
Sparse matrix factorization algorithms are typically characterized
by irregular memory access patterns that limit their performance on
parallelvector supercomputers. For symmetric problems, methods
such as the multifrontal method replace irregular operations with
dense matrix kernels. However, no efficient method based primari
ly on dense matrix kernels exists for matrices whose pattern is very
unsymmetric. A new unsymmetricpattern multifrontal method based
on dense matrix kernels is presented. Frontal matrices are rectangular
instead of square, and the elimination tree is replaced with a directed
acyclic graph. As in the classical multifrontal method, advantage is
taken of repetitive structure in the matrix by amalgamating nodes in
the directed acyclic graph, potentially giving it high performance on
parallelvector supercomputers. Performance of a sequential version
is compared with the classical multifrontal method and a standard
unsymmetric solver on an Alliant FX/80 computer.
Contents
1 Introduction 5
1.1 Previous work ............ . . . . . .... 5
1.2 Outline of report .................. . . ... 6
2 LU factorization with general frontal matrices 7
2.1 Example matrix .................. ...... .. 10
3 Elimination tree and elimination directed acyclic graphs 11
4 Reduced data flow and control flow graphs 16
4.1 Amalgamation. .................. .. ..... .. 20
4.2 Representation of the data flow and control flow graphs . . 22
5 Analysisonly algorithm 25
6 Factoronly algorithm 26
7 Analysisfactor algorithm 29
7.1 Data structures .................. ..... .. .. 30
7.2 Algorithm .................. .......... .. 31
7.2.1 Pivot search ...... . . . . .... .. 32
7.2.2 Assembly and dynamic amalgamation . . . ... 33
7.2.3 Numerical factorization within the frontal matrix . 34
7.2.4 Partial degree update . . . . . . . 35
7.3 Summary of the analysisfactor algorithm . . . . .... 35
8 Performance results 36
9 Summary 39
10 Acknowledgments 40
List of Figures
1 Data flow graph fragment for Equation 1 . . . . ... 15
2 Data flow graph fragment for Equation 2 . . . . .... 15
3 Data flow and control flow graph for Equation 3 . . ... 16
4 Case 1 edge removal in data flow and control flow graphs . 18
5 Case 3 edge removal in control flow graph only . . . ... 19
6 Reduced data flow graph with redundant edge (1,4) . . 20
7 Fillin due to amalgamation between LUson and LUfather . 21
8 Fillin due to amalgamation between Lson and Lfather . . 22
List of Tables
1 Performance comparisons and profile of analysisfactor algorithm 38
2 Reduced data flow and control flow graphs . . . . ... 39
1 Introduction
Conventional sparse matrix factorization algorithms rely heavily on indirect
addressing. This gives them an irregular memory access pattern that limits
their performance on typical parallelvector supercomputers. In contrast, the
multifrontal method is designed with regular memory access behavior in the
innermost loops [14]. Its kernel is one or more steps of LU factorization within
each square, dense frontal matrix defined by the nonzero pattern of a pivot
row and column. These steps of LU factorization compute a submatrix of
update terms that are held within the frontal matrix until they are assembled
(added) into the frontal matrix of its father in the elimination tree. The
elimination tree controls parallelism across multiple frontal matrices, while
dense matrix operations [6] provide parallelism and vectorization within each
frontal matrix. These concepts are discussed by Duff and Reid in [14].
However, this method is based on an assumption of a symmetric nonzero
pattern, and so has a poor performance on matrices whose patterns are very
unsymmetric. If this assumption is not made, the frontal matrices are rectan
gular instead of square, a directed acyclic task graph replaces the elimination
tree, and frontal matrices are no longer assembled by a single father.
A new i,,i .;ri,, Iricpattern multifrontal approach respecting these con
straints is presented. It builds the elimination task graph either during factor
ization or in a preprocessing phase. As in the classical multifrontal method,
advantage is taken of repetitive structure in the matrix by amalgamating
nodes in the elimination task graph. Thus the algorithm uses dense ma
trix kernels in its innermost loops, potentially giving it high performance on
parallelvector supercomputers.
1.1 Previous work
The multifrontal method of Duff and Reid (MA37) [3, 9, 14] will be referred
to as the classical multifrontal method. It will be described in Section 2
to contrast it with the unsymmetricpattern multifrontal method proposed
here. Other parallel algorithms for unsymmetric sparse matrices include
the D2 algorithm [5], partial pivoting methods [20, 21, 22, 24], the PSolve
algorithm [4], and others [2, 27]. The D2 algorithm is based on a non
deterministic parallel pivot search that constructs a set of independent pivots
(m, say) followed by a parallel rankm update of the active submatrix. Near
the end of the factorization, the active submatrix is considered as a dense
matrix (and factorized with dense matrix kernels). Of all these algorithms for
unsymmetric sparse matrices, the classical multifrontal method takes most
advantage of dense matrix kernels, but is unsuitable when the pattern of the
matrix is very unsymmetric.
Most recently, Gilbert and Liu [23] and Eisenstat and Liu [16] have pre
sented symbolic factorization algorithms for unsymmetric matrices, assuming
that the pivot ordering is known a priori. The algorithms are based on the
elimination directed .i. 1.' graph (dag) and its reductions, which are similar
to the reduced data flow and control flow graphs presented in this report.
Moreover, we also indicate how our graphs can be applied to the case where
the pivot ordering is not known a priori.
1.2 Outline of report
The following sections discuss issues in developing an unsymmetricpattern
multifrontal method and in designing algorithms and software to implement
it. Section 2 describes LU factorization in terms of general frontal matrices.
Section 3 shows that the elimination tree is insufficient for the unsymmetric
pattern multifrontal method. Section 4 presents a reduced data flow graph
and a reduced control flow graph, and discusses their amalgamation and
representation. These two directed acyclic graphs are to the unsymmetric
pattern multifrontal method as the elimination tree is to the classical multi
frontal method. Section 5 discusses an analysisonly algorithm, which com
putes a symbolic factorization when the pivot sequence is not known in ad
vance. Section 6 presents a factoronly algorithm based on the reduced data
flow and control flow graphs, and highlights several open problems such as
the effects of numerical pivoting. It requires a previous analysis phase, either
from the analysisonly algorithm or the combined analysisfactor algorithm
described in Section 7. The performance of the analysisfactor algorithm
is illustrated in Section 8. Finally, Section 9 summarizes the unsymmetric
pattern multifrontal method, including open problems and future research.
2 LU factorization with general frontal ma
trices
The LU factorization of an nbyn matrix A into the product of a lower
triangular matrix L (with unitdiagonal) times an upper triangular matrix
U consists of n major steps. In the outerproduct formulation of Gaussian
elimination, A is transformed into the product L[k]A[k]U[k] after step k (1 <
k < n, A[o] = A, L" = L, U" = U). The active submatrix, Ak, is the lower
right (n k)by(n k) submatrix of A[k], and is the portion of the matrix
A[k] that has still to be reduced after step k has finished. We can write
L[k]A[k][k] L, 0 Ik 0 Ui U2
L2 k 0 Ak 0 Ik
where matrices L1 and U, are kbyk lower and upper triangular matrices,
respectively, Ik is the kbyk identity matrix, and Ink is the (nk)by(nk)
identity matrix.
When we include pivoting to preserve sparsity and maintain numerical
accuracy, the matrix PAQ (instead of A) is factorized into LU. The permu
tation matrices P and Q define the row and column permutations performed
during factorization or during a preprocessing phase. However, to simpli
fy the notation used to describe the method, permutations will be ignored
without loss of generality.
An entry in row i and column j of a sparse matrix A is a single value aij
that is symbolically represented in the sparse data structure for A. An entry
is typically numerically nonzero, but explicit zero entries might be included
if the pattern represents a class of matrices to which the given A belongs.
Numerical cancellation is ignored during factorization, so entries in A[k] might
become numerically zero.
Step k selects a single pivot entry a 1] from the submatrix Ak1, and
interchanges row i and column j with the leading row and column of Ak1,
respectively (equivalently, the kth row and column of A[k1]). Row i and
column j are the kth pivot row and column, respectively. Step k then
computes L[k], A[k], and U[k] from L[k1], A[k1], and U[k1].
For a sparse matrix A, define the row and column structure as the index
pattern of the entries in rows or columns of A, viz.
Struct(Ai,) = {j  aij # 0}
Struct(A j) = {i  aj / O}.
The row degree r\k] (i > k) is the number of entries in row i of A[k], and the
column degree c k] (j > k) is the number of entries in column j of A[k], so
that
rk]= Struct(Aik)
c] IStruct(A k])l
S=*3
The LU factorization can be described in terms of general frontal matrices.
An element or general frontal matrix Ek is a dense, rectangular (c41]by
r 1) submatrix that corresponds to the pivot (aj ) selected at step k.
The columns in Ek are defined by the set Uk, which is the set of column
indices of entries in the pivot row i selected at step k. Similarly, the rows in
Ek are defined by the set k, which is the set of row indices of entries in the
pivot column j selected at step k. That is,
Uk = Struct(A )
Ck = Struct(A ).
The sets k and Uk are referred to as the row and column pattern, respec
tively, of the frontal matrix Ek.
One or more consecutive pivots ,*'1, .. '+9k2] (for some .i > 1) may
have the same pivot row and column structure (excluding earlier pivots) as
the first pivot .' 1] in the frontal matrix. In this case, their pivot rows and
columns are also in the element Ek. The qth row and column in Ek is the
(k + q 1)th pivot row and column (1 < q < ., ). Entries not in the first .,
rows or columns of Ek form the contribution block, Dk, of update terms that
are added to A[k1] to obtain the reduced matrix A[k+gk1]. The elements
Ek+1 to Ek+gk1 are not explicitly represented, having been amalgamated
into the single element Ek. The .; major steps of LU factorization within a
single element Ek can be performed with dense matrix kernels. Element Ek
is referred to as a supernode with a pivot block of rank ., if ., > 1, or as a
simple node if = 1. Two or more rows with identical structure are referred
to as a superrow, and columns with identical column structure are referred
to as a supercolumn.
The row and column pattern of Ek (k and Uk) divide into two subsets:
4k = 1k U 2k
(where Clk is the set of .; pivot rows in Ek) and
Uk =Uk U U2k,
(where Ulk is the set of .; pivot columns in Ek). The sets Clk, C2k, ilk, and
U2k divide the element Ek into four submatrices Fk, Bk, Ck and Dk,
Uik Ui2k
Ek = Ik Fk Bk ,
C2k Ck Dk
where the matrix is shown before factorization. The matrices Fk, Bk and
Ck are fully assembled before the factorization of this frontal matrix begins.
That is, all contributions from previous elements are added into Fk, Bk and
C0. However, the submatrix Dk may only hold a partial summation of the
original matrix entries and update terms from previous elements.
The numerical factorization within this frontal matrix computes the LU
factorization of Fk (Fk = LikUlk), computes the block column L2k of L and
the block row U2k of U, and updates the contribution block with the Schur
complement
Dk < D' = Dk L2kU2k,
overwriting Dk with D'. Any entries in Fk can be selected as pivots, as
long as they are numerically acceptable. Numerical considerations require
pivoting within the pivot block Fk and might limit the number of pivots to
less than the maximum (in which case .; is taken to be the actual number
of pivots found in Ek). The matrices Lik and L2k define columns k through
k + .; 1 of L, and U1k and U2k define rows k through k + .; 1 of U.
The lowerright (n k)by(n k) submatrix of A is referred to as the
active part of A, and is denoted by Ak+l..n,ak+..n. The active submatrix Ak is
represented as a sum of elements created during factorization, plus the active
part of A,
k
Ak = Ak+l..n,k+l..n + E uk]
t=l
(where elements amalgamated into other elements are not included in the
summation). The active part of an element Et just after step k (denoted as
E ]) is a submatrix of Et formed by rows and columns that are nonpivotal
after step k (where 1 < t < k). The row and column pattern of Ek] are
denoted as ] and UI ], respectively.
2.1 Example matrix
Consider the matrix A
with a partially factorized matrix L[3]A[3]U[3]
Pl p x x
x P2 x x x
after three steps of LU factorization. An entry is denoted as x, a zero is a
small dot, and /I denotes the kth pivot entry. The pivots are assumed to lie
on the diagonal in order. The two elements created are the dense rectangular
matrices, El and E2.
1
2
El=
3
4
7
2
E2= 3
5
7
2 3
P2 x
X P3
4 5
4 5 7
The pivot rows and columns have been delineated from the contribution
blocks. The active submatrix after step 3 is
A3 = A4..7,4..7 + E 3] + E 3]
or,
4 5 6 7
4567
4 x x 4 5 4 5 7
A3= 5 x x 4 x x 5 x x x
6 x x 7 x x 7 x x x
7 x x
Note that E3 is not generated, since it lies entirely within E2. If the entry
a44 (labeled as p4) is selected at step 4 as the fourth pivot, then the resulting
element E4 would be
4 5
4 p4
4 =
5 xx
7 x
Also note that element El makes a contribution to the pivot row of E4 which
cannot be assembled into or represented by the intermediate element E2,
even though El affects portions of E2. This would not occur if the pattern
of A was symmetric.
3 Elimination tree and elimination directed
acyclic graphs
The elimination tree forms the basis of many sparse matrix algorithms, de
scribing either the data flow graph or the control flow graph, or both [25].
A more general pair of graphs is needed for the unsymmetricpattern multi
frontal method.
A data flow graph is a directed acyclic graph Gdata = (V, Sdat) that con
sists of a set of nodes V C {1,. , n}, and a set of directed edges Sdt,.
Without amalgamation, V = {1, n}. With amalgamation, V is the set of
all explicitly represented frontal matrices (V = (1, 2,4,  ) for the example
in Section 2). The corresponding control flow graph is Gcontrol = (V, Scontrol).
For multifrontal methods, a node s G V represents the assembly and factor
ization of element Es. The directed edge (s, t) c Sdata, s < t, if node t receives
data directly from node s. Similarly, (s, t) G control if the execution of node
t must pause at some point and wait for node s to reach a certain point in
its execution (for multifrontal methods, t usually waits until s completes).
The data flow graph is sufficient to describe the control flow dependence,
but some of the edges might be unnecessary. Thus, Scontrol C Edata.
A complete directed acyclic graph Gcomplete = (V, complete) is sufficient for
describing both the data flow and control flow for the unsymmetricpattern
multifrontal method. The complete data flow graph, Gcomplete_data, and the
complete control flow graph, Gcompletecontrol are simply equal to Gcomplete*
Complete has directed edges (s, t) of two types: S mplete or complete, where
sco mplete {= )
c mplete {(8,I) s < t, U t 0} {(s,I) t G U2,}
complete complete Ucomplete
cL m cL cL
completecontrol completedata complete
SU cU IcU
completecontrol completedata complete"
Similar directed acyclic graphs have been used for symbolic LU factorization
of unsymmetric sparse matrices [16, 23].
Most parallel sparse matrix factorization algorithms are based on the
elimination tree, T,
i = (V, Stree)
tree= {(s,t) I t = father (s)}
father (s) = min{t I s < t, Its i 0}.
If a node a appears in the unique path from node d to the root, then node
a is an ancestor of its descendant node d. The elimination tree is typical
ly defined only for symmetricpatterned LU factors, with the exception of
partialpivoting methods [21, 22].
The classical multifrontal method [13] is one method based on the elimin
ation tree. It has a similar formulation as the general frontal matrix formula
tion described in the previous section, except that the analysis is performed
on the pattern of A+AT. Frontal matrices are square. The method is usually
divided into two phases: symbolic analysis and numerical factorization. The
symbolic phase finds a suitable pivot ordering using a sparsitypreserving
heuristic such as minimum degree [19], determines the elimination tree, and
finds the patterns of L and U. Each node s in the elimination tree repre
sents the work associated with element Es. The elimination tree describes
the largegrain parallelism between the nodes. For both the classical and
unsymmetricpattern multifrontal methods, the work at a node s will be re
ferred to as task s, although additional mediumgrain parallelism can occur
within each task, using the Level3 BLAS [6] for supernodes or Level2 BLAS
[7] for simple nodes.
The column pattern Ut of the frontal matrix Et is given by the union of
the column pattern Uil1] of each son s of node t and the structure of row
t of the upper triangular part of A. A node can be amalgamated with one
of its sons if the row pattern of the father is identical to the son (excluding
the index of the son's pivot column). That is, if Ut = U2s for a single son s
of node t. Additional amalgamation may be allowed if the patterns are not
quite identical, in which case extra fillin occurs.
The numerical factorization phase uses the elimination tree to compute
the LU factorization. The tree acts as both a data flow graph by guiding
the assembly process and the construction of new elements, and as a control
flow graph by describing the precedence between nodes. Task t assembles the
frontal matrices E'11 of each son node s into Et. Because of the symmetric
pattern assumption the pattern Ut is a superset of the pattern Ul'1] of the
contribution block of each son s. The entire contribution block D. of a
son can always be assembled into its father, since all the rows and columns
that are affected by D5 are present in Et. The method takes advantage of the
dense matrix kernels [6, 7] to factorize Et: the Level2 BLAS (outerproduct)
if g1 = 1 or the Level3 BLAS if gt > 1.
The classical multifrontal method is not the only method based on the
elimination tree. In the sparse columnCholesky factorization of George et al.
[17], the work at node k in the elimination tree is the computation of column
k of L. The work at node k modifies column k with columns corresponding to
a subset of the descendants of node k in the elimination tree. The data flow
graph is not a tree, but it is at least spanned by the control flow graph, which
is simply the elimination tree. This is in contrast to the classical multifrontal
method, in which data is assembled only from the sons of a node.
However, the elimination tree is not appropriate in a multifrontal method
if the frontal matrices have unsymmetric patterns. This is due to the incom
plete assembly that takes place. Consider the interrelationships between
three elements, E,, Es, and Et (r < s < t) described the following 3by3
submatrix of L\U formed from the pivot rows and columns r, s, and t:
Pr ,rs ~rt
Isr Ps Ust ]
Itr Its Pt
Assume that there are other nonzeros in these row and columns of U and L,
respectively. If, for example, Us = ts = 0, the interrelationships become
Pr Urt
Isr Ps Ust (1)
.Itr pt
The example given earlier in Section 2.1 fits this case, where r = 1, s = 2,
and t = 4 (except that element E2 is a supernode with g2 = 2).
Task r computes update terms for both rows s and t. The row pattern
Cs of Es, however, does not contain the row index t, so the update terms in
row t of E, cannot be assembled into E,. Figure 1 shows the corresponding
fragment of the data flow graph (only nodes r, s, and t). The data flow
edges are labeled with rows and columns of the updates computed by the
source and sent to the sink. The data flow edge (r, t) is present because row
t of E, must be assembled directly into Et. It cannot be assembled into E.
for later assembly into Et. The data flow graph shown in Figure 1 is not a
tree, although it would be possible to use an elimination tree as a control
flow graph in this case. The edges of a sufficient control flow graph fragment
would be (r, s) and (s, t). This 3node elimination tree spans the data flow
graph in Figure 1, as is the case in sparse columnCholesky.
In general, however, it is not possible to use an elimination tree to describe
the control flow graph. Consider the case in Equation 1 if u,t = us" = 0. The
3by3 submatrix of L\U becomes
Pr
I Ps ] (2)
Itr PtA
In this case, tasks s and t both assemble a contribution from E, (assuming
that there are other nonzeros in row r), but they may proceed in parallel
column t
O column t
row s
Figure 1: Data flow graph fragment for Equation 1
Figure t: Data flow graph fragment for Equation 1
Figure 2: Data flow graph fragment for Equation 2
after task r completes. The corresponding fragment of the data flow graph
is shown in Figure 2. The control flow graph is not a tree since node r has
more than one father; the parallelism between s and t cannot be described
by a tree rooted at node n.
Thus the elimination tree can be used neither as a data flow graph nor
as a control flow graph for a general frontal matrix method. This point
is further illustrated by the following example. The matrix (3) shows the
leading 5by5 submatrix of the L\U factors of a larger matrix,
pI
X
X
X
x
X
X
X
P4
x P5
Figure 3 shows the corresponding complete graph. The data dependence in
the figure are labeled with the entries in L and U that cause them,
Complete= {(1,2), (1,3), (1,5), (4, 5)}
complete = {(1,4), (2,4), (3,4)}.
Figure 3: Data flow and control flow graph for Equation 3
With the seven data dependence shown, the control flow graph must include
at least the edges drawn with solid lines (edges (1,2), (1,3), (2,4), (3,4), and
(4,5)). It is not a tree but a directed acyclic graph. If a control flow graph
was chosen that satisfied the data dependence (and yet was still a tree), it
would limit the parallelism. The parallelism between nodes 2 and 3 cannot be
described by a tree. If matrix (3) and the given pivot ordering were presented
to the classical multifrontal method, the entire 5by5 submatrix would be
dense and no parallelism would exist between nodes 2 and 3.
4 Reduced data flow and control flow graphs
A frontal matrix will persist until its contribution block is completely assem
bled into other frontal matrices. Using the complete graph as the data flow
graph, the contribution blocks are held for the longest time possible before
being completely assembled and discarded, because each edge represents the
assembly of a small amount of data (a single row or column). A simpler data
flow graph will improve the memory requirements of the method. Similarly,
although the complete graph can be used as the control flow graph, removing
redundant control flow edges will simplify amalgamation. Common simplifi
cations to both the data flow graph and the control flow graph are discussed
first, followed by simplifications applicable to the control flow graph only.
Relationships between adjacent nodes in the control flow graph and data
flow graph are defined as follows. Node s is an Lson of its Lfather node
t if (s,t) G SL and (s,t) I'. Node s is a Uson of its Ufather node t if
(s, t) gL and (s, t) EG I. Finally, node s is an L Uson of its L Ufather node
t if (s, t) G SL and (s, t) EG I. The term father refers to any type of father,
and the term son refers to any type of son.
Data flow edges are redundant if the data flow they represent can be
accounted for by other edges in the graph. Redundant edge removal in the
data flow graph is divided into two cases: removal of edges in SL (case 1),
and removal of edges in Su (case 2). Case 2 is simply the transpose of case
1, so we only discuss case 1. For case 1 edge removal, consider the lower
triangular part of the submatrix formed from rows and columns r, s, and t
of L\U (r < s < t),
For any given entry Isr in L element E, makes a contribution to row s,
which later is assembled into the pivot row s of the element Es (assuming
r is not a singleton). Fillin from pivot row r causes the column pattern of
E~1] (just before step s) to be a subset of the column pattern of Es,
l 1] C Ug if ls, # 0. (4)
If there are two other entries ltr and Its then both elements Er and Es compute
a contribution to the pivot row t. Because of Equation 4, the contribution
that Er makes to row t can be assembled into the frontal matrix of Es without
changing the pattern of Eg. If this assembly is made, the data flow edge
(r, t) E ompletedata is removed. This is shown in the graph in Figure 4. The
new graph, after redundant edge removal via cases 1 and 2, will be referred
to as the reduced data flow graph, and denoted as Grdata = (V, Srdata).
The reduced data flow graph Grdata can be partially characterized as
follows. Srdata is a subset of all edges
{(r, s) I (r, s) G Scompletedata, < firstpair (r)}
where firstpair (r) is the first offdiagonal pair in row r of U and column r
of L,
firstpair (r) = min{t I IrUrt 0}.
Some edges can satisfy cases 1 or 2 and precede the first offdiagonal pair
(edge (r,t) in Figure 4, for example). Any given node in the reduced data
is redundant
Figure 4: Case 1 edge removal in data flow and control flow graphs
flow graph has at most one LUfather (corresponding to the first offdiagonal
pair), in addition to Lfathers and Ufathers corresponding to edges preceding
the first offdiagonal pair.
The control flow graph can be simplified using the same cases 1 and 2.
Many other edges, however, are redundant for describing the task precedence.
Additional edge removal in the control flow graph falls into two cases: fillin
in L (case 3), and fillin in U (case 4). Case 4 is the transpose of case 3, and
so we only discuss case 3. Consider the following submatrix,
Ps
,Itr Its Pt
Two entries Urs and It, force Its to be an entry in L. If Its was zero, it
becomes a fillin entry. If it was already an entry, it still is an entry since
we ignore numerical cancellation. The entry l4s creates a dependence from
node s to node t in the control flow graph fragment shown in Figure 5, where
r < s < t. Edge (r, t) is redundant to describe the control flow dependence
between tasks r, s, and t and can be removed. The edge (r, t) would also be
removed if transitive reduction [1] is applied to this fragment. The new graph,
after edge removal via all four cases, will be referred to as the reduced control
flow graph, and denoted as Grcontrol = (V, SEcontrol). In general, the reduced
control flow graph is not sufficient to describe the data flow dependence.
The reduced control flow graph Grcontrol can be described as follows. Let
jmin be the column of the first offdiagonal entry in row r of U. Let imin
be the row of the first offdiagonal entry in column r of L. If jmin < imin
Its is fillin
s I tr is redundant
rs
Figure 5: Case 3 edge removal in control flow graph only
then node r has no Lfathers, as shown in the following example submatrix,
Pr irs U rv
s (5)
It s Pt A Utv
where imin = t and jmin = s.
All edges in Scomplete control starting at r are removed via case 3 edge
removal because of fillin from the single entry Ur,jmin. Some, but not all, of
the edges in S completecontrol are alSO removed via case 4 edge removal. The
edge (r, v) i E~ojplete control is redundant because of the fillin entry u,, and
can be removed via case 4 edge removal for nodes r, t, and v. Thus, any
edge {(r, v) (r, v) Ef Snpletecontrol, v > imin} is redundant. The node r
will have Ufathers described by the set {s Urs 7 0, s < imin}. Node r
will have no LUfathers, since all edges in Iplcoo starting at r are
removed. Similarly, if jmin > imin then node r has Lfathers in the set
{s1ll8, 0, s < jmin}. If jmin = imin, all but the first offdiagonal pair
are removed via cases 1 and 2, and node r only has a single LUfather, jmin.
Any node has either a single LUfather, or one or more Ufathers, or one or
more Lfathers, but never a combination of these three types.
These two graphs are not necessarily minimal. For example, the redun
dant edge (1,4) in the graph shown in Figure 6 will be present in the reduced
graphs (where all edges are in SL). Transitive reduction [1] could be applied
to Gcompletecontrol to obtain a minimal Grcontrol. Transitive reduction cannot
be applied to Gcomplete_data. For example, edge (r,t) is required in Grdata in
Figure 5, but is removed from Gcomplete_data by transitive reduction. Tran
sitive reduction could be applied separately to the graphs (V, Smpletedata)
Figure 6: Reduced data flow graph with redundant edge (1,4)
and (V, Simplete_data), and the results combined to give Gdata However,
performing transitive reduction would introduce a significant computational
overhead (specifically, O(n'lg2 7), which is much more than the time to fac
torize a sparse matrix). Although these graphs are not minimal, the four
cases of edge removal yield a reduced graph with many fewer edges than
the complete graph, without the cost of transitive reduction. Experimental
results showing the level of reduction achieved are given in Table 2 and the
accompanying discussion in Section 8. Eisenstat and Liu [16] describe similar
reductions, which they refer to as partial transitive reductions. These reduc
tions improve the assembly process, decrease the amount of memory needed
to represent the graphs, simplify amalgamation, and improve the memory
requirements for unassembled contributions. Both the reduced control flow
graph and the reduced data flow graph are identical to the elimination tree
if the pattern of L\U is symmetric.
4.1 Amalgamation
Both the factoronly algorithm (Section 6) and the analysisfactor algorithm
(Section 7) require a representation of the reduced data flow and reduced con
trol flow graphs. They also require amalgamation to be performed on these
graphs. These operations can either be done by the analysisonly algorithm
(Section 5) or by the analysisfactor algorithm.
A father node f and son node s can be amalgamated into a single super
Xs x Us 0
X Xf f
I
f
Figure 7: Fillin due to amalgamation between LUson and LUfather
node s'. The amalgamated element Es, has row pattern C, U Cf and column
pattern Us U Uf. Extra fillin due to amalgamation will occur if the pattern
of E,, is not identical to the pattern of E,.
If f is an LUfather of s, then extra fillin can occur in row and column s,
but not row or column f, nor in the contribution block, because
uI C Uf (6)
f C l] (7)
as shown in Figure 7. Fillin does occur in columns U}1]\U fl in row s, and
in rows C[f\f1 in column s.
If f is only an Lfather of s, then the two pivot columns f and s are
unrelated (Equation 6 holds, but not Equation 7), as shown in Figure 8.
Fillin can occur in columns f and s, and in row s, but not in row f. Fillin
can also occur in the contribution block of E,,, specifically, the submatrix
defined by columns U f\Ul and rows lfl\(cfl n fl) (outlined with a
dashed box in Figure 8). Amalgamation between s and a Ufather f is the
transpose of the Lfather case.
Extra fillin due to amalgamation has different effects on the symbolic
factorization computed before amalgamation is performed. Fillin in the
pivot rows and columns f and s has only a local effect on the graph. Edges
between f and s are removed, edges that start at f or s now start at s', and
edges that terminate at f or s now terminate at s'. If there is no fillin in
Xs 0 ls o
XX UMn
x x,
0 I
[fl ____ __ \ Contribution
s I block fillin
f
Figure 8: Fillin due to amalgamation between Lson and Lfather
the contribution block, the patterns of the remaining factors L and U (in
rows and columns f + 1 through n) are unaffected. However, fillin in the
contribution block can cause a ripple effect in the remaining factors and their
graphs, requiring a reapplication of the symbolic factorization. Fillin in the
contribution blocks can be avoided by limiting amalgamation so that it never
occurs. An alternative is to include amalgamation as an integral part of the
symbolic factorization so that the extra fillin due to amalgamation can be
computed during symbolic factorization.
4.2 Representation of the data flow and control flow
graphs
The pattern of L is stored in a columnoriented form as the sets {I I t V1},
while the pattern of U is stored in a roworiented form as the sets {A It G V1}.
For the factoronly algorithm, each set is sorted in pivot order. Together,
the patterns of L and U represent the complete graph. The fathers of a
node s in the reduced data flow and reduced control flow graphs are readily
found from the patterns of L and U using the rules given above. Redundant
edges in the complete data flow graph can be flagged by the symbolic phase,
although only entries up to and including the first offdiagonal pair need to
be scanned. However, the numerical phase also needs the sons of a node
in the data flow graph, requiring access to the pattern of L by rows and
to the pattern of U by columns. Scanning the columnoriented form of the
pattern of L and the roworiented form of the pattern of U in this order
is inefficient. One alternative is to list in a separate, static, data structure
the sons of each node in the reduced data flow graph, and their type (Lson,
Uson, or LUson). The additional storage could be a significant overhead, as
much as O(entries in L + entries in U). This alternative is not suitable for
the analysisfactor algorithm, since the pivot order and the patterns of L and
U are not known in advance.
An alternative, dynamic, representation requires less storage than the
static representation. Associated with each pivot entry is a pair of link lists,
one for Lsons, and the other for Usons. The lists are empty at the start of
the numerical phase. When element E, is created, the Lson and Uson lists of
its pivot entries contain the Lsons and Usons of node r in Grdata When E,
is factorized, a tuple is placed in the Lson lists for each row s in the pattern
C22 of the contribution block. The tuple placed in the Lson list s contains
(r, qs), where s is the q,th entry in 2,. The contribution that E, makes to
the sth pivot row is located in the q,th row of Dr. Only tuples up to and
including the first offdiagonal pair need to be placed. Operations for the
Uson lists are similar.
Consider the following example LU factorization of a 5by5 matrix,
pi x x x
P2 X
x x P3 X
X X X 4x
After the first element, El, is factorized, tuples are placed in the Lson and
Uson lists as follows:
Pivot Lson list Uson list
2 (1, 1)
3 (1,1)
4 (1,2) (1,2)
5
List 5 is empty since the first offdiagonal pair for the first pivot is u14 and
141. When element E2 is factorized, the tuple (1, 1) is found in Uson list 2.
Thus, E2 has one Uson El whose contributions to pivot column 2 are found
in column 1 of D1. Tuples for E2 are placed in the lists,
Pivot Lson list Uson list
3 (1,1)(2,1)
4 (1,2)(2,2) (1,2)
5 (2,3) (2,1)
Case 1 edge removal can be done dynamically using the Lson lists. A edge
is removed by marking its associated tuple. With the following submatrix of
PL,
itr Its p
task r places the tuple (r, q,) in Lson list s, and the tuple (r, qt) in Lson list
t. Task s assembles contributions from the unmarked tuples in its Lson list,
and scans the Lson list of each node t L C,. Task s finds a reference to its
Lson node r in list t, marks the tuple, and assembles the corresponding row t
of E~f1] into Es. If a reference to r is found in both the Lson and Uson list of
node s, then s is an LUfather of r. Task s assembles the entire LUson E~"1]
into E,, and removes all tuples placed by task r in the Lson and Uson lists.
In either case, the edge (r, t) has been deleted from the graph (case 1 edge
removal). The element E, is deallocated if it no longer contains unassembled
contributions. This is an example of how edge removal can decrease the
amount of working storage required for the frontal matrices. Similarly, case
2 edge removal can be done using the Uson lists. Once assembly is complete,
the lists for node s are discarded.
In the previous example LU factorization of a 5by5 matrix, case 1 edge
removal occurs when element E3 is factorized, with r = 1, s = 3, and t = 4.
Task 3 finds a tuple from one of its Lsons (namely, tuple (1, 2) from node 1)
in Lson list 4. This tuple is marked and the second row of D1 (a contribution
to row 4 of the matrix being factorized) is assembled into E3. Task 3 can
also mark the tuple (2, 2) in Lson list 4, and assemble the second row from
D2 into E3 (case 1 edge removal, with r = 2, s = 3, and t = 4).
The unmarked tuples in all the lists at step k form a dynamic represen
tation of the unassembled edges in the reduced data flow graph,
{(S,t) 1(S,t) G Srdata,S < k < t4.
This set of edges is smaller than the set of all edges in Erdata With this second
alternative, the symbolic phase described in the next section does not need
to precompute the reduced data flow graph, and less memory is required to
represent it.
5 Analysisonly algorithm
In the symbolic analysis phase, we wish to use a sparsity preserving heuristic
to generate an ordering and symbolic factorization information so that a sub
sequent numerical factorization can use this information to effect an efficient
decomposition. Thus, in this symbolic ,, i r.Ji',', .i algorithm, the values of
the nonzeros are not taken into account and only the sparsity structure of
the matrix is considered. We did start to design such an analysis phase but
were not convinced of its utility because of the problems with perturbing the
data structures that we mention elsewhere in this report. However, we can
apply recent work of Duff and Reid [15] based on algorithms developed by
Duff, Gould, Reid, Scott, and Turner [10] to obtain a suitable analysis. We
discuss the use of their algorithms in this section. Methods based on the
elimination dag are presented in [16, 23].
Duff et al. [10] design algorithms for factorizing symmetric indefinite ma
trices which use block pivots of order 1 or 2, chosen from the diagonal to
preserve symmetry, and are suitable even when there are zero entries on the
diagonal. In particular, they have tested their codes on augmented matrices
of the form
ItA
AT 0
which arise from the solution of least squares problems. They also consider
the more general case where the upper left identity matrix is replaced by a
general symmetric matrix, H, say corresponding to the augmented matrix
that occurs when solving constrained nonlinear programming problems. The
strategy used to select pivots in their symmetric analysis is that of minimum
degree, generalized to handle 2 x 2 pivots of the form
[ ] (full pivots),
X X
x x ] (tile pivots), and
x 0
S X (oxo pivots).
x 0 1
Now, if we consider the augmented system
0 BT x [ (
B 0 0 b
where B is a nonsingular unsymmetric matrix of order n, the first n compo
nents of the solution are just the solution of the set of unsymmetric linear
equations
Bx = b.
Furthermore, if the algorithm of Duff et al. [10] is used to choose pivots
from the coefficient matrix of Equation 8, then n oxo pivots will be chosen.
In addition, their generalization of the minimum degree criterion will ensure
that the offdiagonal entries of the oxo pivot will be the same as the entry in B
(and BT) with the lowest Markowitz count. Thus we can use the symmetric
minimum degree ordering of Duff et al. [10] to obtain a symbolic analysis of
an unsymmetric matrix. The code of Duff and Reid [15] will also produce
the equivalent of our data dependency directed acyclic graph which could,
after suitable modification, be used as input to the factoronly algorithm of
this paper. Because numerical values were not taken at all into account in
the analysis phase, we do not, however, recommend this route because of the
large amount of modification to the resulting directed acyclic graph by the
subsequent numerical factorization phase.
6 Factoronly algorithm
This section describes the factor ,,lii algorithm, an unsymmetricpattern
multifrontal method that factorizes A into LU using the patterns and graphs
computed in the analysisonly or analysisfactor algorithm. Numerical piv
oting is performed in this phase, so it must be able to handle changes in the
predicted patterns of L and U.
Task r does the following (for tasks r e V):
1. Wait until all sons of node r in the control flow graph have finished.
2. Create E, and assemble the contributions represented by the edges in
the data flow graph that terminate at node r into E,. These are the
Lsons, Usons, and LUsons of node r in the dynamic representation
of the reduced data flow graph. Deallocate elements of sons that no
longer have unassembled contributions. The row and column pattern
of E, is C, and U,, which were precomputed in the symbolic phase
(although they might change due to numerical pivoting considerations
as described below).
3. Factorize the pivot block F, into L1,\UT, and compute the block row
U2. of U and block column of L2., overwriting them in E,. Then store
them in a separate, staticallyallocated data structure for L and U.
This step tries to find g, pivots within F,, but might only be able to
find g' pivots (where 0 < g' < g). Replace g. by the number of pivots
actually found (g,. <
4. Compute the update terms with a rankg, update to the contribution
block D,, using the Level2 BLAS if g. = 1, or Level3 if g, > 1. Par
allelism can occur within these kernels, as well as between independent
tasks.
5. If node r is the last son to complete for a father t (in the control flow
graph), then enable task t.
Numerical pivoting considerations might not allow the expected number
of pivots to be chosen from a pivot block F,. The work associated with
the failed pivots must be performed later. This can be regarded as a forced
amalgamation of the the failed pivots with one or more fathers of r in the
data flow and control flow graphs, with any fillin constraints removed.
In the classical multifrontal method, the failed pivots are amalgamated
with the single father node s of r. Task s then attempts to perform both
its own gs steps of LU factorization and that of any failed pivots of its sons.
Numerical pivoting causes only local changes in the elimination tree and in
the patterns of L and U, although these changes can ripple up if the piv
ots also fail in the father node. The changes are limited to the failed pivot
rows and columns. This case also occurs in the unsymmetricpattern multi
frontal method if node r has a single LUfather and no Lfathers or Ufathers.
Otherwise, larger disruptions can occur.
If a node r has either Lfathers or Ufathers in the data flow graph, and is
found to have numerically unacceptable pivots, the effects are not limited to
a single pair of nodes. In the following example, node s is the Lfather of a
simple node r with a single failed pivot,
Pr ; Urt
Isr Ps Ust
Itr Pt
One option is to amalgamate nodes r and s, as was done for numerical
pivoting failures in the classical multifrontal method. However, this causes
fillin in the contribution block of Es, since the pivot rows are not likely to be
identical. This fillin causes farreaching effects in the data flow and control
flow graphs between node r and the root node. Catastrophic fillin and loss
of parallelism in the control flow graph can result.
The second option for recovering from numerical pivoting failures is to
amalgamate node r with its single LUfather t, assuming it exists. This has
the effect of reordering the matrix so the pivot p, follows pt,
Ps Ust ,sr
Pt Utr
Irt Pr
Limited fillin occurs in the intermediate nodes between node r and node
t that are Lfathers and Ufathers of node r. In this example, the column
pattern Us of element E, is augmented by including the failed pivot column
r (because of entry us,). The fillin in column r from E, (assuming ps is not
a column singleton) is
C.
If node s was instead a Ufather of node r, then the row pattern C" of element
E. would be augmented by the single failed pivot row r. The fillin in row r
from Es would be
u,  U.s.
Nodes r and s are not amalgamated, so catastrophic fillin does not occur
in the contribution block of Es, as in the first option. Instead, fillin in any
contribution block is limited to row or column r. Fillin also occurs in both
row and column r when it is amalgamated with t.
In general, node r is shifted past each intervening Lfather and Ufather
node, augmenting each Lfather by column r and each Ufather by row r.
Each Lfather s causes fillin with pattern sC in column r, and each Ufather
s causes fillin with pattern Us in row r. Finally, node r is amalgamated
with its single LUfather, node t. Amalgamation with the LUfather t does
not cause fillin in the contribution block of Et because fC already contains
the pattern C, of any Lfather node s of r, even without amalgamation. This
can be seen in the example above (prior to reordering). If s is an Lfather
of r, and t is an LUfather of r, then ut is nonzero (because of fillin) and
C C Ct. Similarly, Ut is a superset of the pattern Us of any Ufather node s
of r.
The data flow and control flow graphs are only locally modified, but this
option breaks down if node r does not have an LUfather t. In this case,
node r is delayed until the end of the factorization. Fillin is caused in
row and column r by every ancestor of node r. The failed pivot row and
column might easily become dense, leading to a high level of fillin if many
numerically unacceptable pivots are encountered.
To summarize, the factoronly algorithm presented in this section can
take advantage of the Level3 BLAS to factorize a sparse matrix with an
unsymmetric pattern. This method, however, is "fragile" with respect to
numerical pivoting perturbations in the numerical phase. The graphs and the
patterns of L and U can change drastically from those found by the symbolic
phase. The changes are less drastic in the elimination tree and the patterns
of L and U for the classical multifrontal method. Limiting the perturbations
caused by numerical pivoting is the most important open problem facing the
development of a practical factoronly algorithm, and we have i.'. 1. .1 a
possible first step in this direction. The next section presents an alternative
that bypasses this problem by combining the symbolic and numerical phases
into a single phase.
7 Analysisfactor algorithm
Combining the symbolic and numerical factorization into a single phase is
more typical of conventional factorization algorithms for unsymmetric sparse
matrices. The advantage is that the numerical values are available during
the pivot search. No pivot preordering is assumed. Pivots are chosen as
the algorithm progresses via some sparsitypreserving and numerical criteria.
Unsymmetric permutations are allowed, and probably required, since no as
sumption is made about a zerofree diagonal or the positivedefiniteness of
the original matrix. The disadvantage to this approach is the lack of pre
computed data flow and control flow graphs to guide the construction of new
elements, the assembly process, and the exploitation of parallelism. Instead,
the algorithm must compute the graphs and the patterns of L and U dur
ing factorization. With a parallel pivot search, these symbolic computations
must also be done in parallel.
7.1 Data structures
The key to this algorithm is an extension to the dynamic representation
of the reduced data flow graph presented in Section 4.2. At step k in the
factorization, only pivots 1 through k are defined. The pair of linked lists
associated with each pivot t cannot be defined for pivots t > k. Instead, a
linked list is associated with each nonpivotal row i (an Lson list) and with
each nonpivotal column j (a Uson list).
When task r creates and factorizes E,, it places a tuple (r, q) in the Lson
list of each row i 2 2,, where i is the qth entry in 2,. It also places a tuple
(r, q) in the Uson list of each column j e U2,. When task s starts (r < s), its
Lsons are defined by the tuples in the Lson lists of the gs pivot rows of Es
(including both marked and unmarked tuples). Dynamic edge removal via
cases 1 and 2 can still be performed. Task s scans the Lson list of each row
i G C,. If it finds a reference to an Lson r of node s, then the tuple (r, q) is
marked, and the contribution to row i in E, (found in the qth row of D,)
is assembled into E,. This is an example of case 1 edge removal. If a tuple
referring to node r is found in both the Lson and Uson lists of s then s is an
LUfather of r. Task s assembles the entire Ef"1] into EM.
Consider the following 5by5 matrix after two steps of LU factorization,
where the third pivot entry is shown as p3, but is not yet permuted to the
third row and column.
P1 X X X
P2 X
X X X
X X X pX X
X X X P3 X
The Lson and Uson lists before the factorization of E3 are:
Row/Column Lson list of row Uson list of column
3 (2,1) (1,2)
4 (1,1)(2,2)
5 (1,2)(2,3) (1,3)(2,2)
Task 3 finds the pivot, ps, in row 5 and column 4, thus its Lson list is
((1, 2), (2, 3)), while its Uson list is empty. During assembly, it finds these
two tuples in the Lson list of row 4 corresponding to its Lson nodes 1 and 2.
These tuples are marked, and the contributions of El and E2 to row 4 are
assembled into E3.
All data structures are allocated out of a single pair of onedimensional
real and integer arrays. The original matrix A is stored in both row and col
umn form at the beginning of the two arrays, followed by various workspaces
and data structures of size n. The end of the two arrays holds a stack con
taining the pattern and numerical values of the LU factors. These only grow
in size during factorization. The space between the original matrix and fixed
arrays and the LU factors holds the frontal matrices and Lson and Uson
lists. The frontal matrices and the Lson and Uson lists are allocated when
elements are created and deallocated when their corresponding contribution
block has been completely assembled. These form the main dynamic data
structures in the algorithm.
7.2 Algorithm
The analysisfactor algorithm first initializes the data structures for the al
locatable memory, the original matrix A, and the pivot search procedure. It
then factorizes the matrix A into the product PAQ = LU, where P and Q
are the row and column permutations due to the pivot ordering. The lower
and upper triangular systems are then solved to arrive at the solution x to
the original problem Ax = b. The factorization forms the major part of the
algorithm, and can be outlined as follows. Initially, k = 1.
1. The pivot search finds the pivot a k] in Ak_1 and interchanges rows i
and k, and columns j and k. The pivot row and column define Uk and
Ck, respectively. The Lsons, Usons, and LUsons of node k are located
in the Lson list i and Uson list j.
2. If the newly created element is about to exhaust the remaining un
allocated memory, perform garbage collection. Allocate the cb 
frontal matrix Ek and store the pivot row and column in it.
3. Completely assemble LUsons of node k into Ek. Assemble columns
from Usons and rows from Lsons. Look for supercolumns and super
rows. Permute the n,,,o superrows and nco supercolumns to the
upperleft of the frontal matrix, and change Ck and Uk to reflect this.
Let .; = min(n,,,o, nol). Deallocate any elements whose contributions
have been completely assembled.
4. Perform up to *, steps of numerical factorization within the front and
save the computed rows and columns of U and L. Set .; to the number
of pivots actually found.
5. Perform degree update and update the Lson and Uson lists. Increment
k by . and repeat until the matrix is factorized.
7.2.1 Pivot search
The pivot search is based on Markowitz' strategy [26], which selects the pivot
a with minimum upper bound on fillin (or cost),
(k1] l) 1).
Scanning a row i of A[ 1] involves scanning row i of the active part of A and
unmarked tuples in the Lson list i. Adding these terms gives the structure
and numerical values of row i in Ak 1. Columns are scanned similarly. Many
candidate pivots will need to be searched, so this is an expensive operation
for only calculating the degree. To avoid this, only upper and lower bounds
of the degree of each row and column are computed. When, and if, the
true degree is calculated, the two bounds are set equal to the true degree.
Only the first few columns with minimum upper bound degree are searched
[12, 28], and the true degrees of these columns are computed. The pivot
search is assisted by a set of n linked lists. The dth linked list holds those
columns with upper bound degree d.
The pivot akk at step k must satisfy the threshold partialpivoting cri
terion [9]:
.' 1> u max .r'l], O< u 1. (9)
k
The candidate pivot is the numerically acceptable entry ak1] with lowest
approximate Markowitz cost using the true column degree and the upper
bound row degree.
7.2.2 Assembly and dynamic amalgamation
The frontal matrix E k1] of any LUson node s is assembled into Ek. To mini
mize the amount of indirect addressing necessary for the numerical additions,
offsets are computed for the pivot row and column k. A single scatter oper
ation assembles a row of Elk1] into Ek. Without this offset, both a scatter
and a gather would be needed. This approach is taken in the classical mul
tifrontal method [3]. The row and column offset vectors R and C for Ek are
initialized as
R(QCk(q))= q, q = 1 ... IC
C(Uk(q)) = q, q = 1... ,
then adding the dth row of Es into Ek is done as follows:
q = R(,s(d))
for e = gs+1 to IUs
Ek(q, C(s (e))) = Ek(q, C(s(e))) + Es(d, e)
endfor
One or more columns j Uk are assembled from each Uson, following
the dynamic representation of the reduced data flow graph discussed in Sec
tion 7.1. The assembly takes advantage of the pivot column offsets computed
for the LUson assembly. If all tuples in the Uson list of column j are marked,
column j of the active part of A is scanned, and a count m is made of entries
lying outside the row pattern, Ck. If m = 0, then the structure of column
j is identical to the structure of the pivot column k, and column j of the
active part of A is assembled into Ek. Columns k and j form a supercolumn.
Otherwise, column j is not a supercolumn of column k, but its true degree
can be computed. After factorization,
c ij 1] = k + m (10)
The assembly of Lsons and the search for superrows is similar.
7.2.3 Numerical factorization within the frontal matrix
Pivots are constrained to the upperleft nrowbyncol submatrix, since only
these rows and columns are fully assembled. They are selected with threshold
partialpivoting (Equation 9). The first step of the numerical factorization
of Ek finds up to .i pivots and updates the left ckbynaco submatrix of Ek
(thereby updating DJk). The rest of the frontal matrix is not modified by
this step, except by the pivot permutations. The current version uses only
the Level2 BLAS for this step (Level3 BLAS should be used if .; is large
enough). ., is set to the actual number of pivots found. At this point, Ek
has been partially factorized into
Llk\Ulk U21k B2k
L2k Dik D2k
The matrices Llk\Ulk are the .b b , factorization of Fk. The first n.ol *
columns of U2k and D' are denoted as Ul21k and D'k, respectively. The second
step of numerical factorization updates B2k and D2k, using the Level3 BLAS
if .; > 1. The lowertriangular system
L1kU22k = B2k
is solved for U22k, and B2k is overwritten by U22k. Finally, D2k is overwritten
with
D'k = D2k L2kU22k
The ., rows and columns of Lk and Uk are copied from Ek into the data
structure for L and U. This data remains after Ek is deallocated.
7.2.4 Partial degree update
Unless the true degree can be computed using Equation 10 (or its counterpart
for the row degree), the upper and lower bounds of the degrees of each row
i G C2k and column j G U2k are found by scanning their Lson and Uson lists
respectively. Let t = k +  1. The new lower bound on the row degree ri
is the largest of the following:
1. the previous lower bound minus , since the degree cannot drop by
more than the number of reductions performed on the row,
2. one, since the matrix is assumed to be nonsingular,
3. IU2k , which is the number of columns in the contribution block Dk,
and also the upper bound on fillin in row i,
4. the number of entries in row i of the active part of A, and
5. max(lUj l), for each element s appearing in the Lson list of row i.
If the true degree is ever computed, it is likely that the first term will be
the largest. The new upper bound on the row degree can be computed in a
similar way. It is the smallest of the following:
1. the previous upper bound plus U2k I, since the row degree cannot in
crease by more than the upper bound on the fillin in row i,
2. n  t, which is the size of the active submatrix, and
3. U2kI + Es It] , for each element s appearing in the Lson list of row i.
As with the lower bound computation, it is likely that the first term will be
the smallest if the true degree is ever computed. These computations are
much less than those required to find the true degrees.
7.3 Summary of the analysisfactor algorithm
The analysisfactor algorithm is based on a dynamic representation of the
reduced data flow graph that guides the pivot search, the construction of
new elements, the assembly process, the detection of superrows and super
columns, and the degree update. Dynamic amalgamation is made possible
by the detection of superrows and supercolumns. Thus, the algorithm can
take advantage of the Level3 BLAS. It does not suffer from disruptions in
the graphs or in the patterns of L and U caused by numerical pivoting, as
does the factoronly algorithm. Parallelism is not yet addressed; this and
other issues are dealt with in Section 9, which presents open problems and
future work. The following section compares the performance of the analysis
factor algorithm with that of the D2 algorithm and the classical multifrontal
method.
8 Performance results
The analysisfactor algorithm has been implemented and tested on a parallel
minisupercomputer, the Alliant FX/80. The FX/80 has eight processors,
each with a peak performance of 23.6 megaflops. Since a fully parallel ver
sion is not yet developed, only one processor is used for the performance
comparisons, even though both D2 and the classical multifrontal method are
parallel algorithms. Amestoy and Duff's version of the classical multifrontal
method is used for the comparisons [3]. Each algorithm was used to factorize
a set of 39 test matrices from the Harwell/Boeing sparse matrix collection
[11], and the results for five typical matrices are shown in Table 1.
The table describes the five matrices by name, discipline from which they
are derived, size, number of nonzero entries, and ... i,,i, I ry. The asymmetry
of a matrix A is the number of unmatched offdiagonal entries over the total
number of offdiagonal entries. An unmatched entry is an entry aij for which
aji is zero. The next section of the table compares the performance of MA28
[12], the D2 algorithm (with and without the switch to dense code) [5],
the classical multifrontal method (AmestoyDuff) [3], and the analysisfactor
algorithm based on the unsymmetricpattern multifrontal method. The D2
algorithm switches to dense code when a pivot set of less than four pivots is
found and when the density of Ak is greater than 2"' . The AmestoyDuff
algorithm uses a form of amalgamation that allows a controlled amount of
extra fillin. The number of nonzeros in L and U, the number of floating
point operations required to factorize the matrix (in millions), and the run
time in seconds are compared. The run time includes both the symbolic
analysis phase and the numerical factorization phase for all three algorithms.
These two phases are combined in the case of D2 and the new analysisfactor
algorithm, and are separate in the AmestoyDuff algorithm. The last section
of the table shows a timing profile of the new algorithm. The total run time
is divided into the run time for initializing the original matrix and pivot
search data structures, the pivot search (step 1 of the algorithm outlined in
Section 7.2), garbage collection and allocation of frontal matrices (step 2),
assembly and dynamic amalgamation (step 3), numerical factorization within
the frontal matrices (step 4), and the partial degree update (step 5).
Several conclusions can be made from these results. First, the analysis
factor algorithm finds an LU factorization with a reasonable number of
nonzeros compared with MA28. The MA28 algorithm uses the true Marko
witz criterion, while the analysisfactor algorithm uses a modified cost based
on the upper bound of the row and column degrees. The modified method
simplifies the degree update while only slightly degrading the performance
in terms of fillin. In fact, the method has less fillin than the classical mul
tifrontal method for all but the nearlysymmetric matrix lns3937 (another
reason for this difference is the relaxed amalgamation in the classical multi
frontal algorithm). It always has less fillin than the D2 algorithm (with the
switch). For this reason, it also requires fewest floating point operations to
factorize the matrix, except for the lns3937 matrix.
Although the new analysisfactor method is never the fastest for these ma
trices, it is often faster than the slowest of the other two (D2 with switch and
AmestoyDuff). The implementation requires refinement, as discussed in the
next section. The floatingpoint operations in the analysisfactor algorithm
are divided between the assembly (step 3) and the numerical factorization
(step 4). Most of the floatingpoint work is done in step 4, which uses dense
matrix kernels. Step 4 is faster than step 3, which relies on gather/scatter
operations. These steps together take up about half the total run time. The
degree update and pivot search take up the bulk of the rest of the time.
Table 2 shows how effective amalgamation and edge removal are in simpli
fying the data flow and control flow graphs for the same five matrices. With
no amalgamation, the number of edges in the complete graph (Gcomplete) is
simply the number of offdiagonal entries in L and U. Amalgamation and
case 1 and 2 edge removal is applied to obtain the reduced data flow graph,
Grdata More edges can be removed from Gdata using cases 3 and 4 to ob
tain the reduced control flow graph, Grcontro.l As is apparent from the table,
Grdata and Grcontrol have many fewer edges than Gcomplete. Transitive reduc
tion could reduce the number of edges even further, but at an unacceptable
Table 1: Performance comparisons and profile of analysisfactor algorithm
Matrix
name l,.;'i.; sherman5 mahindas gemat11 grell07
discipline fluid flow geology economics power computer
order i.;7 3312 1258 4929 1107
nonzeros ".1117 20793 7682 33108 5664
asymmetry 0.15 0.26 0.98 1.00 1.00
Nonzeros in L\U
MA28 417603 1,.;:s 10783 51729 ' I'ii
D2 (no switch) .;.; 216644 12109 ., . 18
D2 (switch) 705240 .;11 II1. 13200 61,'8 68305
AmestoyDuff "',_'.;1 177818 41632 62387 181143
analysisfactor ...*" 152011 11230 58457 53324
operations (106)
D2 (switch) 174.7 39.7 0.1 1.0 3.6
AmestoyDuff 21.6 16.0 2.2 0.5 24.0
analysisfactor *' . 13.9 0.1 0.6 3.2
time (seconds)
D2 (switch) 101.7 I 'm1 1.2 11.3 4.5
AmestoyDuff 12.0 8.9 7.6 5.7 7.2
analysisfactor 129.9 20.3 4.5 10.7 10.2
analysisfactor
timing profile ( .):
initializations 0.1 0.5 0.9 1.3 0.3
1. pivot search 9.8 19.6 23.5 30.7 20.5
2. allocation 12.9 5.5 2.0 3.5 4.8
3. assembly 44.9 36.9 39.2 27.2 39.5
4. numerical 16.8 17.7 14.2 20.5 12.7
5. degree update 15.5 19.8 20.2 16.8 22.2
Table 2: Reduced data flow and control flow graphs
Matrix
name l,.;'i.; sherman5 mahindas gemat11 gre1107
order t'.; 3312 1258 4929 1107
b, irreducible blocks 351 1675 670 352 1
edges in Gcomplete ".'"'2 148699 9972 :,.;',K2 52217
edges in Grdata 32842 11i'l I 1827 6444 5707
edges in Grcontrol I''," 3355 1279 2777 .;
lower bound, IV  b :.i 1390 551 2476 975
computational cost. The lower bound on the number of edges that transitive
reduction could obtain for a data flow or control flow graph with IV nodes
is IV  b (where b is the number of irreducible blocks in A [8], although the
edge counts were computed without permuting the matrix A to block upper
triangular form).
9 Summary
The factoronly algorithm is "fragile" with respect to disruptions in the
graphs and patterns of L and U caused by numerical pivoting. This problem
is addressed by the analysisfactor algorithm. The disruptions are avoided by
combining the numerical and symbolic phases so that pivots can be selected
on both sparsitypreserving and numerical criteria. However, the factoronly
algorithm still forms an important part of a complete unsymmetricpattern
multifrontal method. If multiple problems are to be solved that have simi
lar pattern and numerical characteristics (in solving nonlinear systems, for
example), the pivot ordering of the first matrix is often suitable for succes
sive matrices. The analysisfactor algorithm would factorize the first matrix
and provide the pivot ordering to the factoronly algorithm, which factor
izes subsequent matrices. Few numerical problems would be expected in the
factoronly algorithm. However, a better handling of the disruptions caused
by numerical pivoting is the most important open problem facing the devel
opment of a practical factoronly algorithm.
Parallelism is not yet addressed in the sequential version of the analysis
factor algorithm. Some parallel work can take place within the dense matrix
kernels, and while this will be important in the final version, it will not
provide enough parallelism in general. A truly parallel version must take
advantage of parallelism across multiple frontal matrices. Parallelism can
be incorporated in one of several ways. The parallel pivot search of the D2
algorithm can be adapted to this algorithm. The pivot search first creates
an independent set of pivots (m, say). Each factorization task takes a single
pivot and extends it into a block of pivots via dynamic amalgamation. Since
multiple tasks can affect a single row or column in the active submatrix,
these tasks either cooperate to update the degrees, or a separate parallel
degree update phase is employed. When all factorization tasks finish, a new
set of independent pivots is found. A second approach would pipeline the
pivot search with the numerical factorization. The pivot search task (with
one processor or a set of processors) searches for pivots that are independent
with the pivots of currentlyexecuting factorization tasks. A task k is created
for each pivot. Task k creates the frontal matrix Ek, performs the assembly,
factorizes it, and either cooperates with other tasks to perform the degree
update or requests the pivot search task to perform the degree update. When
it completes, it signals the pivot search task that the rows and columns it
affected (C2k and U2k, respectively) are now candidates for the pivot search.
In both approaches, multiple factorizations of the associated frontal matrices
are done in parallel.
10 Acknowledgments
We would like to thank Patrick Amestoy, Michel Dayd6, and Mario Arioli at
CERFACS for many helpful discussions.
References
[1] A. V. AHO, M. R. GAREY, AND J. D. ULLMAN, The transitive reduction
of a directed graph, SIAM J. Comput., 1 (1972), pp. 131137.
[2] G. ALAGHBAND AND H. F. JORDAN, Sparse Gaussian elimination with con
trolled fillin on a shared memory multiprocessor, IEEE Trans. Comput., 38
(1', '), pp. 15391557.
[3] P. R. AMESTOY AND I. S. DUFF, Vectorization of a multiprocessor multi
frontal code, Internat. J. Supercomputer Appl., 3 (1',',), pp. 4159.
[4] T. A. DAVIS AND E. S. DAVIDSON, Pairwise reduction for the direct, par
allel solution of sparse unsymmetric sets of linear equations, IEEE Trans.
Comput., 37 (1',"), pp. 164811.". i.
[5] T. A. DAVIS AND P.C. YEW, A nondeterministic parallel *i, 'll", r for
general unsymmetric sparse LU factorization, SIAM J. Matrix Anal. Appl.,
11 (1990), pp. 383402.
[6] J. J. DONGARRA, J. Du CROZ, I. S. DUFF, AND S. HAMMARLING, A set of
level 3 basic linear .1.,. 1,,, subprograms, AC':.1 Trans. Math. Softw., 16 (1990),
pp. 117.
[7] J. J. DONGARRA, J. Du CROZ, S. HAMMARLING, AND R. J. HANSON,
An extended set of Fortran Basic Linear Algebra Subprograms, AC':.1 Trans.
Math. Softw., 14 (l''�), pp. 117.
[8] I. S. DUFF, On permutations to block ti ".,,h,,1.,, form, J. Inst. Math. Applic.,
19 (1977), pp. 339342.
[9] I. S. DUFF, A. M. ERISMAN, AND J. K. REID, Direct Methods for Sparse
Matrices, Oxford University Press, Oxford, 1'i'.
[10] I. S. DUFF, N. I. M. GOULD, J. K. REID, J. A. SCOTT, AND K. TURNER,
Factorization of sparse symmetric indefinite matrices, IMA J. Numer. Anal.,
11 (1991), pp. 18111 L.
[11] I. S. DUFF, R. G. GRIMES, AND J. K. REID, Sparse matrix test problems,
AC('.I Trans. Math. Software, 15 (1'l'"), pp. 114.
[12] I. S. DUFF AND J. K. REID, Some J1. '.,, features of a sparse matrix code,
AC(:.1 Trans. Math. Softw., 5 (1979), pp. 1835.
[13] , The ,Iltifritiil solution of indefinite sparse symmetric linear equations,
AC':.1 Trans. Math. Softw., 9 (1''.;), p. 302325.
[14] , The iIlt;ifritiil solution of unsymmetric sets of linear equations, SIAM
J. Sci. Statist. Comput., 5 (1',, 1), pp. 633641.
[15] , MA47, a Fortran code for direct solution of indefinite sparse symmetric
linear systems, To appear, 1991.
[16] S. C. EISENSTAT AND J. W. H. LIU, EI1,,'IT, structural symmetry in
unsymmetric sparse symbolic factorization, Report CS9012, Dept. of Com
puter Science, York University, North York, Ontario, Canada, Nov. 1990.
[17] A. GEORGE, M. T. HEATH, J. W. H. LIU, AND E. NG, Solution of sparse
positive definite systems on a sharedmemory multiprocessor, Internat. J. Par
allel Programming, 15 (1'�1.), pp. 309325.
[18] A. GEORGE AND J. W. H. LIU, Computer Solution of Large Sparse Positive
Definite Systems, PrenticeHall, Englewood Cliffs, NJ, 1981.
[19] , The evolution of the minimum degree ordering .li,1ii SIAM Review,
31 (1','"), pp. 119.
[20] A. GEORGE AND E. NG, An implementation of Gaussian elimination with
partial pivoting for sparse systems, SIAM J. Sci. Statist. Comput. 6 (1', ".),
pp . ;''i I I',
[21] , Parallel sparse Gaussian elimination with partial ,' ../',,1 Oak Ridge
National Laboratory, Oak Ridge, TN, 1''"
[22] J. R. GILBERT, An efficient parallel sparse partial pivoting l.., 'll ,i Report
s/450521, Center for Computer Science, Chr. Michelsen Institute, Bergen,
Norway, 1' "
[23] J. R. GILBERT AND J. W. H. LIU, Elimination structures for unsymmet
ric sparse LU factors, Report CS9011, Dept. of Computer Science, York
University, North York, Ontario, Canada, Feb. 1991.
[24] J. R. GILBERT AND T. PEIERLS, Sparse partial pivoting in time proportional
to arithmetic operations, SIAM J. Sci. Statist. Comput., 9 (1'" ) pp. 862874.
[25] J. W. H. LIu, The role of elimination trees in sparse factorization, SIAM J.
Matrix Anal. Appl., 11 (1990), pp. 134172.
[26] H. M. MARKOWITZ, The elimination form of the inverse and its application
to linear programming, Management Science, 3 (1957), pp. 255269.
[27] 0. WING AND J. W. HUANG, A computational model of parallel solution of
linear equations, IEEE Trans. Comput., 29 (1980), pp. 6321..;'
�] Z. ZLATEV, J. WASNIEWSKI, AND K. SCHAUMBURG, Y12M: Solution of large
and sparse systems of linear .1i,/, 1., ', equations, Lecture Notes in Computer
Science, 121, SpringerVerlag, Berlin, 1981.
