Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: An unsymmetric-pattern multifrontal method for sparse LU factorization
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00095180/00001
 Material Information
Title: An unsymmetric-pattern multifrontal method for sparse LU factorization
Series Title: Department of Computer and Information Science and Engineering Technical Report ; 93-018
Physical Description: Book
Language: English
Creator: Davis, Timothy A.
Duff, Iain S.
Publisher: Department of Computer and Information Sciences, University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: March, 1993
Copyright Date: 1993
 Record Information
Bibliographic ID: UF00095180
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

199399 ( PDF )


Full Text







An unsymmetric-pattern multifrontal method for

sparse LU factorization

Timothy A. Davis*
Computer and Information Sciences Department
University of Florida, Gainesville, Florida, USA

Iain S. Dufft


European
in


Center for Research and Advanced Training
Scientific Computation (CERFACS)
Toulouse, France


March 1993




Technical Report TR-93-018
Computer and Information Sciences Department
University of Florida
Gainesville, FL, 32611 USA


Submitted to the SIAM Journal on Matrix Analysis and Applications.

Key words. LU factorization, unsymmetric sparse matrices, parallel algorithms, multifrontal
methods
AMS (MOS) subject classifications. 65F50, 65W05, 65F05, 65-04, 68-04.
Abbreviated title. Unsymmetric-pattern multifrontal method








*phone: (904) 392-1481, em ail: .1I - .. .-.11. -.I,
also Rutherford Appleton Laboratory, Chilton, Didcot, Oxon. OX11 OQX England.










Abstract


Sparse matrix factorization algorithms are typically characterized by irregular mem-
ory access patterns that limit their performance on parallel-vector supercomputers. For
symmetric problems, methods such as the multifrontal method replace irregular opera-
tions with dense matrix kernels. However, no efficient LU factorization algorithm based
primarily on dense matrix kernels exists for matrices whose pattern is very unsymmet-
ric. A new unsymmetric-pattern multifrontal method based on dense matrix kernels
is presented. Frontal matrices are rectangular instead of square, and the assembly
tree is replaced with a directed acyclic graph. As in the classical multifrontal method,
advantage is taken of repetitive structure in the matrix by amalgamating nodes in the
directed acyclic graph, giving it high performance on parallel-vector supercomputers.
The performance of three sequential versions is compared with the classical multifrontal
method and other unsymmetric solvers on a Cray YMP-8/1"'









Contents

1 Introduction 5
1.1 Previous work ................... . . . . . .... 5
1.2 Outline ................... ................. 6

2 LU factorization with general frontal matrices 6
2.1 Example matrix ................... . . . . . .... 9

3 Elimination and assembly trees 10

4 The assembly graph and assembly dag 13
4.1 The basic graph .................. ............... .. 14
4.2 Edge reductions .................. ............... .. 19
4.3 Additional amalgamation ............. . . . . .... 20

5 Analysis-only algorithm 21

6 Factor-only algorithm 23

7 Analysis-factor algorithm 25
7.1 Data structures ..... . . . . ..... ................... 26
7.2 Pivot search, degree update, and edge reductions . . . . . . 26
7.3 The AFup algorithm .................. ............ .. 30
7.4 The AFdown algorithm .................. ........... .. 31
7.5 The AFstack algorithm ............. . . . . . ...... 31
7.6 Summary of the analysis-factor algorithm ................... ... .. 35

8 Performance results 36

9 Final remarks 38

10 Acknowledgments 46









List of Figures

1 A possible subgraph that characterizes Equation 7 . . . . . ..... 12
2 Assembly graph [2] = (A[2], g[2], F[2]) for matrix A in Equation 6 ..... . .15
3 Assembly graph )[4] = (A[4], g[4], [4]) for matrix A in Equation 6 ..... . .17
4 Assembly graph D'[5 = (A[5], g], ,E[5]) for matrix A in Equation 6 ..... . 18
5 Fill-in due to amalgamation between LU-child and LU-parent . . ... 22
6 Fill-in due to amalgamation between L-child and L-parent . . . ... 22
7 Relative performance: AFstack (solid) and Mups (dotted) . . . ... 42
8 Relative performance: AFstack (solid) and D2 (dotted) . . . . .... 43
9 Relative performance: AFstack (solid) and MA28 (dotted) . . . ... 44


List of Tables

1 Matrix statistics and run time .................. .... .. .. 39
2 Detailed results for nine representative matrices . . . . . .... 45
3 True versus approximate degree update (AFdown on Alliant FX/80) . . 46









1 Introduction


Conventional sparse matrix factorization algorithms rely heavily on indirect addressing. This
gives them an irregular memory access pattern that limits their performance on typical
parallel-vector supercomputers. In contrast, the multifrontal method of Duff and Reid [16]
is designed with regular memory access in the innermost loops. Its kernel is one or more
steps of LU factorization within each square, dense frontal matrix defined by the nonzero
pattern of a pivot row and column. These steps of LU factorization compute a submatrix
of update terms that are held within the frontal matrix until they are assembled (added)
into the frontal matrix of its parent in the assembly tree. The assembly tree (a variant of
the elimination tree [26]) controls parallelism across multiple frontal matrices, while dense
matrix operations [6] provide parallelism and vectorization within each frontal matrix.
However, this method is based on an assumption of a symmetric nonzero pattern, and so
has a poor performance on matrices whose patterns are very unsymmetric. If this assumption
is not made, the frontal matrices are rectangular instead of square, a directed acyclic graph
(called the ."i ,,,1,1, dag) replaces the assembly tree, and frontal matrices are no longer
assembled by a single parent.
A new i, 'i,,,.,r, Iric-pattern multifrontal approach respecting these constraints is pre-
sented [4]. It builds the assembly dag either during factorization or in a preprocessing phase.
As in the symmetric multifrontal case, advantage is taken of repetitive structure in the ma-
trix by amalgamating nodes in the assembly dag. Thus the algorithm uses dense matrix
kernels in its innermost loops, giving it high performance on parallel-vector supercomputers.

1.1 Previous work

The D2 algorithm [5] is based on a non-deterministic parallel pivot search that constructs
a set of independent pivots (m, say) followed by a parallel rank-m update of the active
submatrix. Near the end of the factorization, the active submatrix is considered as a dense
matrix (and factorized with dense matrix kernels).
The multifrontal method of Duff and Reid (MA37) [2, 9, 10, 16] will be referred to as the
classical multifrontal method. The method takes more advantage of dense matrix kernels
than D2, but is unsuitable when the pattern of the matrix is very unsymmetric. Many
methods for symmetric matrices use dense kernels; a survey may be found in [25].
Most recently, Gilbert and Liu [23] and Eisenstat and Liu [18] have presented symbolic
factorization algorithms for unsymmetric matrices, assuming that the pivot ordering is known
a priori. The algorithms are based on the elimination directed .I i.. 1'.- graph (dag) and its
reductions, which are similar to the assembly dag presented in this paper. Moreover, we
also indicate how our graphs can be applied to the case where the pivot ordering is not
known a priori. Also, Matstoms 1_] has recently developed a multifrontal QR factorization
algorithm.









1.2 Outline


The following sections present the unsymmetric-pattern multifrontal method and three se-
quential algorithms based on the method. Section 2 describes LU factorization in terms of
general frontal matrices. Section 3 demonstrates that both the elimination tree and assembly
tree are unsuitable for the unsymmetric-pattern multifrontal method. Section 4 presents the
assembly graph D and assembly dag G, and discusses amalgamation and edge reduction on
these graphs. The assembly dag g is to the unsymmetric-pattern multifrontal method as the
assembly tree is to the classical multifrontal method. Section 5 discusses an analysis-only
algorithm, which computes a symbolic factorization when the pivot sequence is not known in
advance. Section 6 presents a factor-only algorithm based on D and highlights several open
problems such as the effects of numerical pivoting. It requires a previous analysis phase,
either from the analysis-only algorithm or the combined analysis-factor algorithm described
in Section 7. The performance of three analysis-factor algorithms is illustrated in Section 8.
Finally, Section 9 summarizes the unsymmetric-pattern multifrontal method, including open
problems and future research.


2 LU factorization with general frontal matrices

The LU factorization of an n-by-n matrix A into the product of a lower triangular matrix
L (with unit-diagonal) times an upper triangular matrix U consists of n major steps. In
the outer-product formulation of Gaussian elimination, A is transformed into the product
L[k]A[ ]U[k] after step k 1 and just before step k (1 < k < n, A1] = A, L[+1] = L, U+1] =
U). Throughout this paper, the notation X[k] will refer to the state of X just before step
k, where X is a matrix, set, graph, scalar, etc. The state of X after LU factorization is
complete is denoted by X[+1].
The active submatrix, A[ n,.., is the portion of the matrix A[] that has still to be
reduced after step k 1 has finished. We can write


] L I 1 Ak_ (1)


where matrices L f] and U[.k 1_ are (k- 1)-by-(k-1) lower and upper triangular
matrices, respectively, Ik-1 is the (k 1)-by-(k 1) identity matrix, and In-k+1 is the
(n k + 1)-by-(n k + 1) identity matrix.
When we include pivoting to preserve sparsity and maintain numerical accuracy, the
matrix PAQ (instead of A) is factorized into LU. The permutation matrices P and Q define
the row and column permutations performed during factorization or during a preprocessing
phase. However, to simplify the notation used to describe the method, we will assume that
permutations rename rows and columns of L[k], A[k], and U[k]. Thus, the partial factorization
shown in Equation 1 will be considered as the factorization of P[k]AQ [], where p[h] and Q[]
are the permutations applied through step k 1.









An entry in row i and column j of a sparse matrix A is a single value aij that is sym-
bolically represented in the sparse data structure for A. An entry is typically numerically
nonzero, but explicit zero entries might be included if the pattern represents a class of ma-
trices to which the given A belongs. Numerical cancellation is ignored during factorization,
so entries in A[k] might become numerically zero. Entries are interchangeably referred to
as nonzeros in this paper, with the understanding that such a "nonzero" might actually be
numerically zero.
Step k selects a single pivot entry aj] from the active submatrix A[k.]. k, and interchanges
row i and column j with the k-th row and column of A n, k.] Step k then computes L[k+l]
A[k+l], and U[k+1] from L[k], A[k], and U[k].
For a sparse matrix A, define the row and column structure (or pattern) as the index
pattern of the entries in rows or columns of A, viz.

Struct (A1.) = {j 1 aj f 0}

Struct (Aj) = {i a aj 0}.
The row degree rk] (i > k) is the number of entries in row i of A[k], and the column degree
c[] (j > k) is the number of entries in column j of A[k], so that

rk]= Struct (Ak)|
= (At,)*

c k] = Struct (Ak)I.
The LU factorization can be described in terms of general frontal matrices. An element
or general frontal matrix Ek is a dense, rectangular (ck -by-r[]) submatrix that corresponds
to the pivot (a1 ) selected at step k. The columns in Ek are defined by the set Uk, which is
the set of column indices of entries in the pivot row i selected at step k. Similarly, the rows
in Ek are defined by the set Ck, which is the set of row indices of entries in the pivot column
j selected at step k. That is,
Uk = Struct (A.)

k = Struct (A ).
The sets k and Uk are referred to as the row and column pattern, respectively, of the frontal
matrix Ek.
Let us assume that another ., 1 pivots (.; > 1) have the same pivot row and column
pattern (excluding earlier pivots) as the first pivot in the frontal matrix. In this case, their
pivot rows and columns are also in the element Ek. Entries not in the first ., rows or columns
of Ek form the contribution block, Dk, of update terms that are added to A[k] to obtain the
reduced matrix A[k+gk]. The elements Ek+1 to Ek+g--1 are not explicitly represented, having
been amalgamated into the single element Ek. The .; major steps of LU factorization within
a single element Ek can be performed with dense matrix kernels. Element Ek is referred to
as a supernode with a pivot block of rank .* if ., > 1, or as a simple node if = 1. Two or









more rows with identical pattern are referred to as a super-row, and columns with identical
column pattern are referred to as a super-column.
The row and column pattern of Ek (Ck and Uk) divide into two disjoint subsets:

Ck = C' u C'k

(where C' is the set of .; pivot rows in Ek) and

uke = / U u/
Uk U'U'k

(where U'/ is the set of pivot columns in Ek). The sets L', f', U, and U/ divide the element
Ek into four submatrices: Fk (the -b---., pivot block), Bk, Ck, and Dk (the contribution
block),
Uik Uk
Ek = Ck Fk Bk (2)
C'k Ck Dk
where the matrix is shown before factorization. The matrices Fk, Bk and Ck are fully
assembled before the factorization of this frontal matrix begins. That is, after assembly no
other nonzeros remain in the .; pivot rows and columns except those in Ek. However, the
submatrix Dk may only hold a partial summation of the original matrix entries and update
terms from previous elements.
The numerical factorization within this frontal matrix computes the LU factorization of
Fk (Fk = L'kU), computes the block column L7 of L and the block row Uk of U, and updates
the contribution block with the Schur complement

Dk D = Dk L"11.

Any entries in Fk can be selected as pivots, as long as they are numerically acceptable.
Numerical considerations require pivoting within the pivot block Fk and might limit the
number of pivots to less than the maximum (in which case .* is taken to be the actual
number of pivots found in Ek). The matrices L' and L" define columns k through k + 1
of L, and U' and U/f define rows k through k + .1 1 of U. The factorized frontal matrix is

U'k U
Ek = C'Lk\ Uk" (3)
L11 D'k

Entries in the lower-right (n k + 1)-by-(n k + 1) submatrix of P[k]AQ[k] that are not
yet assembled in a frontal matrix form the active part of A, denoted as M[k], where

0 if i < k Vj < k,
(M[k]); = 0 if (P[k]AQ[k])i has been assembled into some Et (1 < t < k),
(P[k]AQ[k]),j otherwise.
(4)










The active part of A is simply the set of original entries that have not yet been assembled
into some frontal matrix. The assembly of original entries of A is done when a row or column
becomes pivotal, or earlier as a by-product of the degree update. The active submatrix is
represented as a sum of elements created during factorization, plus the active part of A,
AV]Uk M[k]+ ] (5)
A.k...=M + E (5)
tev[k]

where V[k] is the set of all frontal matrices created through step k 1 of the factorization.
The active part of an element Et just before step k (denoted as E k]) is a submatrix of Et
formed by rows and columns that are non-pivotal just before step k (where 1 < t < k), and
also have not been assembled into subsequent frontal matrices. The row and column pattern
[k] Ek] and [k] [k+gk] k, [k+gk]
of E] are denoted as C and U2 respectively. Note that 7 = k, U' = + ", and
D- E [k+gk]
UD = I k


2.1 Example matrix

Consider the matrix
pi x x
x P2 x x x
X X p3 X
A= x p4 x (6)

X X
X X X X

with the partially factorized matrix L[4]A[4]U[4]

pi x x
x P2 x x x x
X X P3 X X X
X P4 X

X X
X X X X X

just before the fourth step of LU factorization. An entry is denoted as x, a zero is a small
dot, and /I denotes the k-th pivot entry. The pivots are assumed to lie on the diagonal in
order. The two elements already created are the dense rectangular matrices, El and E2-

145
1 4 5
1 pi x x
2 x x x
3 x x x
4 x x x
7 x x x









23457
2 3 4 5 7
2 p2 x x x x
E2= 3 x P3 x x x
5 x x x x x
7 x x x x x
The pivot rows and columns have been delineated from the contribution blocks. The active
submatrix just before step 4 is
[4] M [4] [4] 4]
A4..7,4..7 + E + E

or
4 5 6 7
4567
4 x x 45 4 5 7
A[,4. = 5 x x + 4 x x 5 x x
4..7,4..7

7 x x

Note that E3 is not generated, since it lies entirely within E2. If the entry a44 (labeled as
p4) is selected at step 4 as the fourth pivot, then the resulting element E4 would be


4 p4 X
4 =
5 xx
7 x

Also note that element El makes a contribution to the pivot row of E4 which cannot be as-
sembled into or represented by the intermediate element E2, even though El affects portions
of E2. This would not occur if the pattern of A was symmetric.
Row 7 of E can be assembled into E2, since U[2 C U2, even though the contribution
that El makes to row 7 is not needed to factorize E2. If this assembly is performed, row 7
is removed from C[4] since the notation E[4] refers only to portions of El that are not yet
assembled into subsequent frontal matrices just before step 4. This assembly is performed
by the edge reductions described in Section 4.2.


3 Elimination and assembly trees

The elimination tree, T, [26] and its variants (such as the .I ,,,.1,.il tree [15]) are used either
explicitly or implicitly in most parallel sparse matrix algorithms [25]:

T = ( EST)

Tv = l... n
T = {(i,j) IJ = parent (i)}









parent (i) = min{j I i < j, lji # 0}.
Starting with the elimination tree, T, an assembly tree is constructed by amalgamating
a connected (node-induced) subgraph into a single supernode. The minimum label of the
nodes in the subgraph becomes the label of the new supernode. Edges contained in a
subgraph are removed, and any edges incident on only one node in the subgraph become
incident on the resulting supernode. The process repeats until the desired assembly tree is
obtained. Elimination and assembly trees are typically defined only for symmetric-patterned
LU factors, with the exception of partial-pivoting methods [21, 22]. A more general graph
is needed for the unsymmetric-pattern multifrontal method.
The classical multifrontal method [2, 9, 10, 15, 16] is based on the assembly tree. It has
a similar formulation as the general frontal matrix formulation described in the previous
section, except that the analysis is performed on the pattern of A + AT. Frontal matrices
are square. The method is usually divided into two phases: symbolic analysis and numerical
factorization. The symbolic phase finds a suitable pivot ordering using a sparsity preserving
heuristic such as minimum degree [20], determines the assembly tree, and finds the patterns
of L and U. Each node k in the assembly tree represents the work associated with element
Ek. The assembly tree describes the large-grain parallelism between the nodes.
The column pattern Uk of the frontal matrix Ek is given by the union of the column
pattern U:k] of each child j of node k and the pattern of row k of the upper triangular part
-3
of A. A node can be amalgamated with one of its children if the column pattern of the
parent is identical to the child (excluding the index of the child's pivot column). That is,
if Uk = k] for a single child j of node k. Additional amalgamation may be allowed if the
patterns are not quite identical, in which case extra fill-in occurs.
The numerical factorization phase uses the assembly tree to compute the LU factoriza-
tion. The tree guides the assembly process and construction of new elements and describes
the precedence between nodes. At node k, the frontal matrices Ek] of each child node j
are assembled into Ek. Because of the symmetric-pattern assumption, the pattern Uk is a
superset of the pattern U k] of the contribution block of each child j. The entire contribution
block Dj of a child can always be assembled into its parent, since all the rows and columns
that are affected by Dj are present in Ek. The method takes advantage of the dense matrix
kernels [6, 7] to factorize Ek: the Level-2 BLAS for simple nodes (.; = 1) or the Level-3
BLAS for supernodes (., > 1).
The classical multifrontal method is not the only method based on the elimination tree
or its variants. In the sparse column-Cholesky factorization of George et al. [19], the work
at node k in the elimination tree is the computation of column k of L. The work at node k
modifies column k with columns corresponding to a subset of the descendants of node k in
the elimination tree. This is in contrast to the classical multifrontal method, in which data
is assembled only from the children of a node.
However, the assembly tree is not appropriate in a multifrontal method if the frontal
matrices have unsymmetric patterns. This is due to the incomplete assembly that takes
place. Consider the inter-relationships between three elements, E, Ej, and Ek (i < j < k
and gi = gj = .- = 1) that results from the following 3-by-3 submatrix of L\U formed from









the pivot rows and columns i, j, and k:


p; uij uk
Iji pj Ujk
lki Ikj 1'

(the notation L\U when applied to matrices refers to two matrices packed in the same array).
Assume that there are other nonzeros in these rows and columns of U and L, respectively.
If, for example, uik = uj, = uij = Ik = 0, the inter-relationships become


H Pj ] (7)
lki '

In this case, a contribution from Ei is assembled into both Ej and Ek (assuming that there
are other nonzeros in row i), but the work at nodes j and k may proceed in parallel after
the work at node i completes. The corresponding fragment of a possible subgraph that
characterizes these relationships is shown in Figure 1. An assembly tree is not suitable since
node i has more than one parent; the parallelism between j and k cannot be described by
a tree rooted at node n, and the contributions of a child (node i) are not assembled by a
single parent. The contribution that Ei makes to row k cannot be assembled into Ej, but
the contribution that Ei makes to row j must be assembled into Ej.


j k

rowj rowk


Figure 1: A possible subgraph that characterizes Equation 7

If the assembly tree is unsuitable, what kind of graph can guide the unsymmetric-pattern
multifrontal method? The elimination dag [18, 23] is one possibility. Define 9(L) as the di-
rected graph associated with L. That is, (i,j) is an edge of 9(L) if and only if Ij; is nonzero.
Similarly, (i,j) is an edge of g(UT) if and only if ui is nonzero. Several reductions to these
graphs are described in [18, 23] (such as transitive reduction, which leads to the elimination
dag). If the graphs are constructed for a symmetric-patterned matrix and a simple reduction
is applied (namely, .,il;,,'; Iric reduction), then the graphs become equivalent to the elimin-
ation tree. The next section extends the results in [18, 23] by allowing for arbitrary sparsity
preserving and numerically acceptable pivoting during the factorization. Further extensions
are discussed in Section 7.









4 The assembly graph and assembly dag


The 1 ,1 ,i,1;llii graph
D = (A,, F)
is constructed during the factorization as the pivot sequence is determined. Edge reductions
are applied as it is constructed. The notation D[k], A[k], [k], and F[k] refer to the state of
each of these graphs or edge-sets just before step k of the factorization.
The composite graph consists of three types of nodes in two types of graphs. Conceptu-
ally, the bipartite graph
[kc] k] k []
represents the unassembled form of the active submatrix, A [,k..n, where,

Ak] = A[ k. A[]

A = {row k,...,row n},

A] = {column k,..., column n},
Ak]= {(i,j[) i A Aj A[k] A 0a] O}.

The edges Ak] in the bipartite graph A[k] are not explicitly stored. Rather, they are repre-
sented as the active part of A plus any unassembled contribution blocks (as in Equation 5).
The factorized frontal matrices are described by the ,.. ,,,lblil dag (directed acyclic graph)

g[k] = (V[k]", [k])

V[k] = {t Et is a frontal matrix created before step k}
V[k] C {,..., k- I }
S[k] __ k] U S[k]
W = pL u Cff

I] = {s,t) s t < kA s [k] At V[k] A
one or more entire rows of E] assembled into Et}.


E]-= { s, t) I s < t < k A s e [ A t E[ A
one or more entire columns of Elt assembled into Et}

Edges in S[k] are referred to as inactive L-, U-, or LU-edges. Inactive edges describe the
{L/U/LU}-{parent/child} relationship of factorized frontal matrices. Node s is an LU-
child of its LU-parent node t if Et] is assembled into Et in its entirety ((s,t}) E k] n gk]).
Otherwise, node s is an L-child of its L-parent node t if (s, t) E Sk], or a U-child of its









U-parent node t if (s, t) E E]. In the assembly dag, the term parent refers to any type of
parent, and the term child refers to any type of child.
An active edge in 7 connects a frontal matrix in g with a row or column in A for which
it holds an unassembled contribution. That is,

y[k] = -k] U -k]

{7k {t, row Zi) Z' CV}

7k] = (t, column j) i J Ut}.
These edges are referred to as active L-edges and active U-edges, respectively.

4.1 The basic graph
Before factorization starts, the dag G[1] is empty and A[1 is the bipartite graph of the original
matrix, A.
The pivot search selects a pivot and permutes it to the first row and column (renaming
them as row and column one, in our notation). Consecutive pivots with identical pattern
are included in the first frontal matrix El which after being factorized is given as

U1; U11'
E, = [ L[\U' U' ,
E f[ L2v D '

where gi = |ll = |U' is the number of pivots in the pivot block of El.
The first node of G is constructed. This node refers to the newly factorized frontal matrix
El. The column pattern U1 of El is given by the set of edges in A[1 incident to row nodes 1
through gi. Similarly, the row pattern of E1 (1) is given by the set of edges in A[1 incident
to column nodes 1 through gi. Pivot row and column nodes 1 through gi (having been
renamed) and edges in A[1 incident to these nodes are then removed from A. The pattern
L' defines new active L-edges in yJi1+] that are added from node 1 of g to row nodes in A.
Similarly, Ul' defines new active U-edges that are added from node 1 in g to column nodes
in A. Note that edges from pivots 2 through gi are not added, since nodes 2 through gi
will not appear in G. This is the first case of edge reduction, and is a result of no-fill node
amalgamation (additional amalgamation is described in Section 4.3).
New fill-in edges are (implicitly) added to A for each entry in the contribution block that
is not already present in the active submatrix. The resulting bipartite graph is A[gl+l], with
row and column nodes numbered from gl + 1 to n. The resulting G[l1+1] contains a single
node. The factorization of the first frontal matrix satisfies steps 1 through gl of the LU
factorization. The next step of LU factorization will be step gi + 1.
An example of a graph D[2] is shown in Figure 2 for the matrix A in Equation 6 in
Section 2.1. The first frontal matrix Ei consists of only one pivot and is the first node in
the graph [2]. To avoid too many lines, edges in the implicit bipartite graph A[2 are shown












dag,(;[2


active U-edges


S' column nodes

4 00 X 00
-2 X X x o x

3 XX X OX

4) 0 0 X X 0 0

o X X 0 X X 0
active L-edges 0 0 0 0 X X

row nodes X X O
bipartite graph, A[21
Figure 2: Assembly graph D[2] = (A[2, [2], [2] ) for matrix A in Equation 6









in array form. An original edge between a row i and a column j is shown as an x, a fill-in
edge is shown as a 0, and a zero is shown as a small o (for which no edge in A exists). The
two graphs, g[2] and A[2], are separated by a dashed line. Edges in F[k] cross this dashed
line, while edges in Ik] will be placed to the upper-left of this line ([2] is empty).
At a subsequent step k, a new pivot row i and column j are selected, and frontal matrix
Ek is created. Additional pivots with identical pattern are also included, so that Ek contains
pivots k through k + .; 1. After permutations are made, active edges in F[k] incident
to row and column nodes k through k + 1 in A[k] represent unassembled numerical
contributions to the pivot rows and columns of Ek. These contributions must be assembled
before factorizing Ek. Active edges are assembled, removed from F[k], and combined into
inactive edges in S[k+9]^. Thus,

(s, row i) A i G C s,k) k (8)

and
(s, column j) G e ] A j Ue -- (s, k) 4 "k. (9)
If both Equations 8 and 9 hold, then the entire Ek] is assembled into Ek (a ,...,,,,', I ric edge
reduction, see Section 4.2).
Pivot row and column nodes k through k + .; 1 and edges incident to these nodes are
removed from A. New active edges are added from node k in g to row and column nodes
in A defined by C and U'', respectively. New fill-in edges are added (implicitly) to A in
the same manner as fill-in from the first frontal matrix. The factorization of the frontal
matrix Ek satisfies steps k through k + .; 1 of the LU factorization. The next step of LU
factorization will be k + .;
The graphs A[k] and g[k] correspond to specific submatrices of the partial LU factors
in Equation 1. [] is an assembly dag for the L [k-1] and ._ -i[,..] matrices. The
active L-edges are a subset of the nonzero pattern of Lk]V .k-_, and the active U-edges are
a subset of the nonzero pattern of U l .k-
Figure 3 continues the example shown in Figure 2, where the second frontal matrix E2
has been factorized. The single inactive L-edge (1, 2) in Q[4] is shown in bold (node 2 is an
L-parent of its L-child node 1). This L-edge is due to the nonzero entries 12,1 and 13,1, and
(at this point in the discussion) represents the assembly of two rows from the contribution
block of El into the pivot rows of E2. The resulting bipartite graph A[4] is also shown.
Figure 4 shows the graph D5 = (A[5], T[5], )5]) after the factorization of E4. The inactive
U-edge (2, 4) in [5] represents the assembly into E4 of the contribution that E2 makes to
row 4. It is present because of the nonzeros u2,4 and U3,4. The inactive LU-edge (1,4)
represents the assembly into E4 of the contribution of El to both row and column 4. It is
present because of the nonzeros 14,1 and u1,4. In the next section these inactive edges will be
used to represent the assembly of additional contributions, and some of the active edges in
Figures 3 and 4 will be removed.












dag,([4]


active U-edges


column nodes



5 5 X X
x x 0



o 0 0 X X

, active L-edges 7 X 0 X
row nodes
bipartite graph, A,[4]
Figure 3: Assembly graph )[4] = (A[4, g[44] [4 ]) for matrix A in Equation 6
















dag, [5]


column nodes


row nodes
bipartite graph, A[5]

Figure 4: Assembly graph D[5] = (A[5], [5], T[5]) for matrix A in Equation 6









4.2 Edge reductions

Potentially, every off-diagonal nonzero in the LU factors can define a single, unique edge in
the assembly dag G. Some of the these potential edges are never created, due to the no-fill
amalgamation described in the previous section. Additional edge reductions are described
in this section. Some of these are in the style of Eisenstat, Liu, and Gilbert [18, 23], except
that we apply them to our partially constructed graph when the pivot order is not fully
known. A class of edge reductions that goes beyond transitive reduction is applied during
the approximate degree update phase described in Section 7.2.
Up to this point in the discussion, an active L-edge (s,row i) G FL (or an active U-
edge (s, column j) eG u) represents an unassembled contribution from Es to row i (or
column j) of the active submatrix. This contribution remains in Es until row i becomes
pivotal in Et, or until E, is assembled into its LU-parent Ek (where s < k < t). With
no additional edge reductions, contributions of a frontal matrix remain unassembled until
the latest possible step of the factorization. A frontal matrix persists until its contribution
block is completely assembled into subsequent frontal matrices. Edge reductions improve
the memory requirements of the method by assembling these contributions at the earliest
possible step. They also simplify the pivot search and approximate degree update described
in Section 7.
Two types of simple edge reductions, L-child and U-child reductions, can be applied at
step k. For an L-child edge reduction, consider the lower triangular part of the submatrix
formed from rows and columns s to s + gs 1, k to k+ .; 1, and i of L[k+gk]\A[k+gk], where
s and k are nodes in g[k+Qk] and i > k + .,

SL's
Lks L'k (10)
Lis Lik ai

This reflects the status of the factorization just before step k + .; It is not yet known
when row i will become pivotal. In Equation 10, matrix Lk..k+gk-1,s+gs-1 is denoted as Lks,
Li,s..s~,_- is denoted as Lis, and Li,k..k+gk-1 is denoted as Lik.
If Lks 7' 0, then s is an L-child of k. Fill-in from Es causes

c k (11)

If Es and Ek both contribute to a common row i, the contribution that element Es makes to
row i (with pattern ULk]) is assembled into Ek at step k before row i becomes pivotal. The
active edge (s, row i) is removed, reducing the size of the active edge-set FL. The entry for
row i is removed from [k+"k]. Row i can be found by scanning all the active edges incident
on all rows in C. If an L-child of k is found in this scan, the active edge can be removed.
U-child edge reduction is the transpose of the L-child case. Both reductions are applied
if a node s is an LU-child of a new node k, forming a ,';,,';, I ric edge reduction. In this case,
the entire contribution block of Es is assembled into Ek, and all active edges from node s
are removed.









Removal of active edges also results in fewer inactive edges in g. If all active L-edges
from node s to pivot rows of Ek are removed before step k, then no inactive L-edge (s, k)
appears in Q.
As an example, refer again to matrix A in Equation 6, and to Figure 3. When E2 is
created, its column pattern
2= {2,3,4,5,7}
is a superset of the pattern
,u = {4, 5}

because of fill-in. Row i = 7 is in both row patterns of '2] and C'. Thus, the contribution
that El makes to row 7 is assembled into E2 (before row 7 becomes pivotal). The active
L-edge from node 1 to row 7 is assembled and deleted.
Similarly, in Figure 4, node 4 eG is a U-parent of node 2. Node 2 contributes to column
5; this contribution is assembled and the active U-edge (2, column 5) is removed. Node 4 is
also an LU-parent of node 1. All contributions from El are assembled into E4, and all active
edges originating at node 1 are removed.
The edge reductions presented here do not result in a graph, G, with the minimal number
of edges. However, we can perform these edge reductions as a byproduct of the approximate
degree update phase of the analysis-factor algorithm, because the active edges terminating at
rows and columns affected by a newly factorized Ek are scanned to compute bounds on the
number of nonzeros in those rows and columns of the active submatrix. Transitive reduction
[1] could remove more edges, but would be too costly.


4.3 Additional amalgamation

No-fill amalgamation has already been described. Additional amalgamation can improve
the unsymmetric-pattern multifrontal method by increasing the ratio of Level-3 BLAS to
Level-2 BLAS operations and by simplifying the symbolic computations. If a parent node k
and child node s in g are amalgamated into a single supernode r, the amalgamated element
E, has row pattern C, U k and column pattern Us U Uk.
If k is an LU-parent of s, then extra fill-in can occur only in the pivot rows and columns
of E,, because
Uk C Uk
and


Rows and columns that are pivotal through step k + 1 are removed from these patterns,
maintaining the subset relationship:

U[k+gk] C [k+] (12)

[k+gk] [k+gk]
C C "+ k (13)
as shown in Figure 5 (where k' = k + ., ).









If k is only an L-parent of s, then the two row patterns k and C, are unrelated (Equa-
tion 12 holds, but not Equation 13), as shown in Figure 6 (where k' = k + ., ). Fill-in
can occur in pivot columns U'/ and U', and in pivot rows ', but not in pivot rows C'
Fill-in can also occur in the contribution block of E,, specifically, the submatrix defined by
columns Uk+P]\Ut[k+gk] and rows k+9kl Ck+ (outlined with a dashed box in Figure 6).
Amalgamation between s and a U-parent node k is the transpose of the L-parent case.
When two nodes k and s are amalgamated into a new node r, the edge (s, k) E S is
removed, any edges that terminated at either k or s now terminate at r, and any edges that
start at either k or s now start at r. Duplicate edges are combined.


5 Analysis-only algorithm

In the symbolic analysis phase, we wish to use a sparsity preserving heuristic to generate an
ordering and symbolic factorization information so that a subsequent numerical factorization
can use this information to effect an efficient decomposition. Thus, in this symbolic ii,,a,,.';-
orlil algorithm, the values of the nonzeros are not taken into account and only the sparsity
structure of the matrix is considered. We did start to design such an analysis phase but
were not convinced of its utility because of the problems with perturbing the data structures
that we mention in Section 6. However, we can apply recent work of Duff and Reid [17]
based on algorithms developed by Duff, Gould, Reid, Scott, and Turner [11] to obtain a
suitable analysis. We discuss the use of their algorithms in this section. Methods based on
the elimination dag are presented in [18, 23].
Duff et al. [11] design algorithms for factorizing symmetric indefinite matrices which use
block pivots of order 1 or 2, chosen from the diagonal to preserve symmetry, and are suitable
even when there are zero entries on the diagonal. In particular, they have tested their codes
on augmented matrices of the form
SI A


which arise from the solution of least-squares problems. They also consider the more general
case where the upper left identity matrix is replaced by a general symmetric matrix, H,
say corresponding to the augmented matrix that occurs when solving constrained nonlinear
programming problems. The strategy used to select pivots in their symmetric analysis is
that of minimum degree, generalized to handle 2 x 2 pivots of the form


S(full pivots),

x x
[ X (tile pivots), and
x 0

0 X (oxo pivots).
X 0













.. [kk']


Xs X

X Xk



z[k']

sI -[k']
k
0 1


__ u^ 0

[k']
k


Figure 5: Fill-in due to amalgamation between LU-child and LU-parent


Xs 0 [k']
---S


X X,
o


r I
r [k']
k
0


l[k']
k
I bo----cfiln

------- \Contribution
block fill-in


Figure 6: Fill-in due to amalgamation between L-child and L-parent









Now, if we consider the augmented system

0 B T
0BT %[ 1 [] (14)
B O 0 b

where B is a nonsingular unsymmetric matrix of order n, the first n components of the
solution are just the solution of the set of unsymmetric linear equations

Bx = b.

Furthermore, if the algorithm of Duff et al. [11] is used to choose pivots from the coefficient
matrix of Equation 14, then n oxo pivots will be chosen. In addition, their generalization of
the minimum degree criterion will ensure that the off-diagonal entries of the oxo pivot will
be the same as the entry in B (and BT) with the lowest Markowitz count. Thus we can use
the symmetric minimum degree ordering of Duff et al. [11] to obtain a symbolic analysis of
an unsymmetric matrix. The code of Duff and Reid [17] will also produce the equivalent
of our directed acyclic graph (9) which could, after suitable modification, be used as input
to the factor-only algorithm of this paper. Because numerical values were not taken at all
into account in the analysis phase, we do not, however, recommend this route because of
the large amount of modification to the resulting directed acyclic graph by the subsequent
numerical factorization phase.


6 Factor-only algorithm

This section describes the factor-. ,, li algorithm, an unsymmetric-pattern multifrontal method
that factorizes A into LU using the patterns of the LU factors and the final assembly dag,
g[n+1], computed in the analysis-only or analysis-factor algorithm. Numerical pivoting is
performed in this phase, so it must be able to handle changes in the predicted patterns of L
and U.
The work at node i consists of the following (for i G V):
1. Wait until all children of node i in 9 have finished.

2. Create Ei and assemble into it the contributions represented by the edges in 9 that
terminate at node i. These are the L-children, U-children, and LU-children of node
i. Deallocate elements of children that no longer have unassembled contributions.
The row and column pattern of Ei is i and U;, which were precomputed in the
symbolic phase (although they might change due to numerical pivoting considerations
as described below).

3. Factorize the pivot block Fi into L'\U', and compute the block row U/" of U and block
column L", overwriting them in E;. Then store them in a separate, statically-allocated
data structure for L and U. The symbolic factorization predicted gi pivots, but this
number might be less due to pivot failures within Fi, or more due to pivot failures in
children of node i. Set gi to the number of pivots actually found.









4. Compute the update terms with a rank-.; update to the contribution block Di, using
the Level-2 BLAS if gi = 1, or Level-3 if gi > 1. Parallelism can occur within these
kernels, as well as between independent nodes [24].

5. If node i is the last child to complete for a parent j then enable node j.

Numerical pivoting considerations might not allow the expected number of pivots to be
chosen from a pivot block Fi. The work associated with the failed pivots must be performed
later. This can be regarded as a forced amalgamation of the failed pivots with one or more
parents of i in g with any fill-in constraints removed.
In the classical multifrontal method, the failed pivots are amalgamated with the single
parent node j of i. The gj steps of LU factorization in node j and that of any failed pivots
of its children are attempted. Numerical pivoting causes only local changes in the assembly
tree and in the patterns of L and U, although these changes can ripple up if the pivots
also fail in the parent node. The changes are limited to the failed pivot rows and columns.
This case also occurs in the unsymmetric-pattern multifrontal method if node i has a single
LU-parent and no L-parents or U-parents. Otherwise, larger disruptions can occur.
If a node i has either L-parents or U-parents in g and is found to have numerically
unacceptable pivots, the effects are not limited to a single pair of nodes. In the following
example, node j is the L-parent of a simple node i with a single failed pivot,

pi A uik
Iji Pj Ujk
ki 1'

One option is to amalgamate nodes i and j, as was done for numerical pivoting failures in
the classical multifrontal method. However, this causes fill-in in the contribution block of Ej,
since the pivot rows are not likely to be identical (see Figure 6 and the related discussion).
This fill-in causes far-reaching effects in g in any ancestors of i. Catastrophic fill-in and loss
of parallelism can result.
The second option for recovering from numerical pivoting failures is to amalgamate node
i with its single LU-parent k, assuming it exists. This has the effect of reordering the matrix
so the pivot pi follows pi
Pj Ujk Uji
r' Uki
lik Pi
Limited fill-in occurs in the intermediate nodes between node i and node k that are L-parents
and U-parents of node i. In this example, the column pattern Uj of element Ej is augmented
by including the failed pivot column i (because of entry uji). The fill-in in column i from Ej
(assuming pj is not a column singleton) is

4,i -- 4i u Lj.









If node j was instead a U-parent of node i, then the row pattern Cj of element Ej would be
augmented by the single failed pivot row i. The fill-in in row i from Ej would be

Ui; U; U".

Nodes i and j are not amalgamated, so catastrophic fill-in does not occur in the contribution
block of Ej, as in the first option. Instead, fill-in in any contribution block is limited to row
or column i. Fill-in also occurs in both row and column i when it is amalgamated with k.
In general, node i is shifted past each intervening L-parent and U-parent node, augment-
ing each L-parent by column i and each U-parent by row i. Let m be the final position
of pivot i in the modified pivot order. Each L-parent j causes fill-in with pattern j' l in
column i, and each U-parent j causes fill-in with pattern U~j" in row i. Finally, node i is
amalgamated with its single LU-parent, node k, augmenting both C' and U' by i (where
k < m < k + .; ). Amalgamation with the LU-parent k does not cause fill-in in the contri-
bution block of Ek because already contains the pattern j'] of any L-parent node j of
i, even without amalgamation. This can be seen in the example above (prior to reordering).
If j is an L-parent of i, and k is an LU-parent of i, then ujk is nonzero (because of fill-in)
and ["C c C Ck. Similarly, Um] C HU C U[, for any U-parent node j of i.
The graph g is only locally modified, but this option breaks down if node i does not have
an LU-parent k. In this case, node i is delayed until its last L-parent or U-parent. Fill-in is
caused in row and column i by every ancestor of node i until that point. The failed pivot row
and column might easily become dense, leading to a high level of fill-in if many numerically
unacceptable pivots are encountered.
To summarize, the factor-only algorithm presented in this section can take advantage of
the Level-3 BLAS to factorize a sparse matrix with an unsymmetric pattern. This method,
however, is "fragile" with respect to numerical pivoting perturbations in the numerical phase.
The graphs and the patterns of L and U can change drastically from those found by the
symbolic phase. The changes are less drastic in the assembly tree and the patterns of L and
U for the classical multifrontal method. Limiting the perturbations caused by numerical
pivoting is the most important open problem facing the development of a practical factor-
only algorithm, and we have -, --'-. .1 a possible first step in this direction. The next section
presents an alternative that bypasses this problem by combining the symbolic and numerical
phases into a single phase.


7 Analysis-factor algorithm

Combining the symbolic and numerical factorization into a single phase is more typical of
conventional factorization algorithms for unsymmetric sparse matrices. The advantage is
that the numerical values are available during the pivot search. No pivot preordering is
assumed. Pivots are chosen as the algorithm progresses via some sparsity preserving and
numerical criteria. Unsymmetric permutations are allowed, and probably required, since no
assumption is made about a zero-free diagonal or the positive-definiteness of the original









matrix. The disadvantage to this approach is the lack of a precomputed assembly dag
to guide the construction of new elements, the assembly process, and the exploitation of
parallelism. The algorithm must compute the graph ED and the patterns of L and U during
factorization. Some form of the active submatrix must be maintained to allow for arbitrary
pivoting and the degree of each row and column of the active submatrix must be maintained.
Three approaches for performing additional amalgamation are described below (the AFup,
AFdown, and AFstack algorithms). Before using any of these algorithms, we perform an (op-
tional) preprocessing stage that scales the matrix and permutes it to upper-block-triangular
form (the diagonal blocks are factorized independently) [10]. The performance of these
algorithms is presented in Section 8.

7.1 Data structures

All data structures are allocated out of a pair of one-dimensional real and integer arrays. The
original matrix A is stored in both row and column form, followed by various workspaces and
data structures of size n. The two arrays hold a stack containing the pattern and numerical
values of the LU factors.
Portions of D = (A, F) are held implicitly. The bipartite graph AMk] is an implicit
representation of the active submatrix. The active submatrix A[k] is represented as the
unassembled entries of the active part of A and all unassembled portions of contribution
blocks (see Equation 5). The dag, 9 = (V, 8), is explicitly stored with the LU factors. The
edge set S is not needed in the analysis-factor algorithm, but it is required as input to the
factor-only algorithm.
The active L-edges in Lk] are held as a set of L-child lists. Similarly, the active U-edges
in TIt are held as a set of U-child lists. The L-child list for a row i in AMk] holds a set of
tuples,
L-child list i = {(t, f) (t, row i) E Lk] A f = offset of row i in Et},
one for each active L-edge that terminates at row i (similarly, U-child list j holds tuples for
active U-edges that terminate at column j of A[k]). A tuple (t, f) in an L-child list i contains
a reference to a node t, and an offset f to the contribution to row i in Et.
The space between the original matrix and fixed arrays and the LU factors holds the
frontal matrices and L-child and U-child lists. The frontal matrices and the L-child and U-
child lists are allocated when elements are created and deallocated when their corresponding
contribution blocks have been completely assembled. These form the main dynamic data
structures in the algorithm.

7.2 Pivot search, degree update, and edge reductions
The three algorithms use a similar global pivot search strategy and degree update phase
(although AFstack combines the numerical assembly with this phase), and perform similar
edge reductions.









The pivot search is based on Markowitz' strategy [27], which selects the pivot a ] with
minimum upper bound on fill-in (or cost),

(rk]- 1)( ]- 1). (15)
The rows and columns of the active matrix are not held explicitly, rather, they are held as a
set of contribution blocks and entries from the original matrix A. Scanning a row i of A [,k.
involves scanning row i of the active part of A and tuples in the L-child list i. Adding these
terms gives the pattern and numerical values of row i in [k] Columns are scanned
similarly. Many candidate pivots will need to be searched, so this is an expensive operation
for only calculating the degree. To avoid this, only upper and lower bounds of the degree of
each row and column are computed. Upper and lower bounds of a degree r are denoted as
upper (r) and lower (r), respectively. The initial upper and lower bounds are simply the true
degrees of the rows and columns of A. If the true degree is calculated during factorization,
the two bounds are set equal to the true degree. Only the first few columns with minimum
upper bound degree are searched (not unlike the truncated pivot search options in [14] and
[31]), and the true degrees of these columns are computed. The pivot search is assisted by
a set of n linked lists. The d-th linked list holds those columns with upper bound degree d,
that is {j upper(c5 ) = d}.
The pivot a ] at step k must also satisfy the threshold partial-pivoting criterion [10]:
'.1 > u max a 1, 0 < u < 1. (16)
k
The candidate pivot is the numerically acceptable entry a ] with lowest approximate Mark-
owitz cost using the true column degree and the upper bound row degree,

(upper(r) 1)(C]- 1). (17)
The approximate degree update finds the upper and lower bounds of the degrees of each
row i G C and column j U i by scanning their L-child and U-child lists, respectively. The
new lower bound on the row degree, for example, cannot be smaller than the upper bound on
fill-in in row i or the maximum number of unassembled columns of each contribution block
affecting row i (these can be found in the L-child list i). The new upper bound on the row
degree can be computed in a similar way. Only a short scan of each L-child list and U-child
list in the affected rows ({i i }) and columns ({j I e U/}) suffices. The lists are kept
short via edge removal. The time taken for this scan is linear in the sum of the sizes of the
L-child/U-child lists of the affected rows and columns.
However, at the cost of an additional scan of either the affected L-child or U-child lists,
more accurate degree bounds can be computed, based on the external row and column degrees
of previous frontal matrices with respect to the current frontal matrix, Ek. The time taken
is asymptotically unchanged.
The column pattern U[k] of an unassembled contribution block Ek] divides into two
disjoint subsets with respect to Uk: internal entries

2uk = t< UU u









and external entries
Umk Un\Uk
and thus, by definition,
UMk] = mk U mk
An external entry j Umjk is an unassembled column of Em that is not present in the column
pattern of the current frontal matrix Ek. Let the internal and external row degree of Em
with respect to Ek be denoted as
rink = \Umk I
and
rink = \Umk I
respectively. Thus,
rink MW-] r mk* (18)
The external and internal column degrees are defined similarly, based on Cm and k. Note
that the external row and column degrees are not affected by the assembly of contributions
from Em into the current frontal matrix Ek. Assembly can decrease the internal degrees.
Each U-child list for columns j G Umk contains a tuple (m, f) (where f is the offset
of where column j appears in Em). The internal degree, rimk, can be computed by simply
counting how many times node m appears in the U-child lists for columns j Uk. This
count can be done for all frontal matrices affecting Ek in a single scan of all U-child lists for
columns j E Uk. The external row degree of Em can then be computed with Equation 18.
The new lower bound degree of row i (lower (r k+gk])), then, is the largest of the following:

1. rk+k] > lower(r k]) since the degree cannot drop by more than the number of
pivots applied to row i,

2. r k+k] > IStruct (M[k+kg])|, which is the number of unassembled entries in row i of the
active part of A (see Equation 4), and

3. the number of columns in the contribution block of Ek plus the maximum external row
degree of the frontal matrices in the L-child list for row i. That is,

rk+gl > UI + max rk
(m, row i)E Z7+9I]

This lower bound is tight if the column patterns of all the external entries of frontal
matrices affecting row i of Ek,

1mk V{m I (m, row) + ,

are all subsets of the largest such column pattern.

The new upper bound degree of row i (upper(rt+- ])) can be computed in a similar way.
It is the smallest of the following:









r. rk+ < uppe r(rk]) + U since the row degree cannot increase by more than the
upper bound on the fill-in in row i,

2. r[k+9k] < n k - -i + 1, which is the size of the active submatrix after Ek is factorized,
and

3. the number of columns in the contribution block of Ek plus the sum of the external
row degrees of the frontal matrices in the L-child list for row i. That is,

rk+gk] < u + z1 [kY Vmk.
(m, row i)G7+9k

This upper bound is tight if the column patterns of all the external entries affecting
row i are disjoint.

An approximate degree update (and edge reduction) based on external row and column
degrees requires one extra scan of either the U-child lists of the affected columns j e Uk or
the L-child lists of the affected rows i G k, as follows:

1. Scan the U-child lists for all columns j G Uk, and compute the external row degree for
each frontal matrix Em appearing in the lists. Perform edge reduction if Em is a true
U-child or LU-child.

2. Scan the L-child lists for all rows i G Ck. Compute both the external column degree
for each frontal matrix Em appearing in the lists, and the approximate degree bounds
for each row i (using the external row degrees computed in the first scan). Perform
edge reduction if Em is a true or effective L-child or LU-child (see below).

3. Re-scan the U-child lists for all j G Uk, and compute the approximate degree bounds
for each column i (using the external column degrees computed in the second scan).
Perform edge reduction if Em is an effective U-child or LU-child.

If Vmk = 0, then node m is an effective L-child of node k. Similarly, node m is an effective
U-child of node k if Cmk = 0. When the affected L-child or U-child lists are scanned, any
edge from a true or effective child of node k is removed, and the corresponding numerical
contribution is assembled. Node m is an effective LU-child if Vmk = Cmk = 0, in which case
the entire Ek] is assembled into Ek.
Edge reductions from effective children remove any edge that would have been removed
by transitive reduction. Furthermore, these edge reductions take advantage of coincidental
overlap between previous frontal matrices and the current one, and thus remove even more
edges than is possible by transitive reduction. Consider Equation 10 in Section 4.2. If
Lks = 0, then node s is not a true L-child of the node k, the frontal matrix currently being
factorized. Equation 11 may still hold, by coincidence, even though E, does not cause fill-in
in the .; pivot rows of Ek. In this case, node k is an effective L-father of node s. If both E,
and Ek contribute to a common row i, then the contribution from E, is assembled into Ek,









and the active L-edge (s, row i) is removed. If transitive reduction were applied, this edge
might not be removed.
Edge reductions from true children are still useful (as described in Section 4.2), since the
edges from these nodes can be removed in the first scan of the affected L-child and U-child
lists, rather than in the second scan of either set of lists.
Our experiments into the approximate degree update show a moderate, acceptable in-
crease in fill-in in exchange for a significant reduction in time when compared with computing
the true degrees (see performance results in Section 8).


7.3 The AFup algorithm

The first approach (the AFup algorithm) selects a single pivot using the global Markowitz-
like strategy described above, allowing a pivot to be selected from anywhere in the active
submatrix. The new node in g is augmented by additional pivot entries that lie within the
current element. The new node in g "grows upward" by considering nodes in A connected
via active edges to the new node.
The AFup algorithm can be outlined as follows. Initially, k = 1.

1. The pivot search finds the pivot aK] in A k.. and interchanges rows i and k, and
columns j and k. The pivot row and column define Uk and Ck, respectively. The
L-children, U-children, and LU-children of node k are now located in the L-child list k
and U-child list k.

2. Perform degree update, and try to extend frontal matrix to include near super-rows
and super-columns (rows and columns whose patterns can be considered identical with
the first pivot row and column). The search is limited to pivot entries lying in the
current element. It "looks upward" because it considers the amalgamation of a single
child k with one or more of its (potential) parents. Perform edge reduction, and find
the effective children of node k.

3. Allocate space for the numerical values of Ek, performing garbage collection if neces-
sary.

4. Assemble the children of node k into Ek. Deallocate any elements whose contributions
have been completely assembled. Let n,r,, and nco be the number of rows and columns
in the super-row i and super-column j, and let .; = min(nrow, ncoW).

5. Perform up to ., steps of numerical factorization within the front. Set ., to the number
of pivots actually found.

6. Update the L-child and U-child lists. Increment k by .; and repeat until the matrix is
factorized.









7.4 The AFdown algorithm

The upward-looking algorithm can sometimes lead to excessive fill-in due to its limited pivot
search. The downward-looking approach (AFdown) can decrease fill-in by replacing the local
search with a global search that looks at the entire active submatrix. The method constructs
a set S of unfactorized frontal matrices, each of which is a candidate for amalgamation.
These frontal matrices have been symbolically factorized, but not numerically factorized.
The method "looks downward" because it considers the amalgamation of a single node f
(corresponding to the latest pivot entry) with one or more of its unfactorized children (in
S). The pivot search finds a single pivot entry, corresponding to a new proposed node f in
the partially constructed graph. This node is amalgamated with its unfactorized children
and placed in S, unless doing so would cause excessive fill-in, in which case the children that
caused amalgamation to fail are factorized and removed from S.
It is interesting to note that the pivot blocks of the nodes in S form a block diagonal
matrix. Thus, the frontal matrices in S can be factorized in parallel.
If a proposed node f has one or more unfactorized children, the numerical test is only
an estimate, since the numerical values of the unfactorized children are not available. The
numerical criterion is also checked during the numerical factorization, and unacceptable
pivots are rejected.

7.5 The AFstack algorithm

Both the AFup and AFdown algorithms make only a estimate of the numerical acceptability
of additional pivots (2 through .; ) in a frontal matrix Ek. These pivots may then be delayed,
since the numerical factorization of the frontal matrix performs the accurate numerical test
after all -. pivots in Ek have been selected. Delayed pivots cause additional fill-in and waste
the work performed in combining those pivots into Ek. The third approach (the AFstack
algorithm) is more suitable for matrices that experience excessive delayed pivoting in the
AFup and AFdown algorithms. It is essentially the same as the AFup algorithm, except
that the frontal matrix Ek for the new node k is allocated on a stack, so that it can "grow"
while pivots are being included into node k. These additional pivot rows and columns are
factorized as they are included, so that an accurate numerical check can be combined with
the symbolic (fill-reducing) pivoting test. No pivots are delayed.
The AFstack algorithm typically out-performs the other two (see Section 8), and is thus
presented in more detail in the following pseudo-code. Comments are enclosed in curly-
brackets. Let m ICkL and n IUk for the current m-by-n frontal matrix Ek. The number
of factorized pivots contained in Ek is .; (enumerated 1 through .; ). Garbage collection is
performed as necessary. For simplicity, assume all variables are global.










procedure AFstack
k- 1
nb block size for tradeoff between Level-2 and Level-3 BLAS
while (k < size of sparse matrix A) do
{ create and factorize frontal matrix Ek }
call find first _pivot Jnfrontal-matrix
call extendfrontalmatrix
call factor_frontalmatrix
k <-- k + .
endwhile { factorizing }
end procedure { AFstack }



procedure findfirst_pivot infrontal-matrix
global pivot search { find al the k-th pivot in the factorization of A }
k -- Struct (A ) { row pattern of Ek }
Uk -- Struct (A k) { column pattern of Ek }
m <- \Ck { Ek is m-by-n }
n <-- Uk
Came -- false { Csame is true if the latest pivot does not change k }
Same -- false { Usame is true if the latest pivot does not change Uk }
allocate frontal matrix Ek on a stack at the top of allocatable memory
Ek <-- 0
{ An initial frontal matrix of size m-by-n is allowed to grow as large as }
{ (Gm)-by-(Gn), where G is a user-settable parameter controlling the tradeoff }
{ between fill-in and amalgamation (typically 2 to 3). }
ko 0 { Dk has been updated with the first ko pivots in Ek }
<-- 1 { number of pivots in Ek }
end procedure { findfirst_pivotinfrontalmatrix }










procedure extendfrontalmatrix
{ Local pivot search to extend the frontal matrix, which currently has just one pivot }
while (extending the current frontal matrix, Ek) do
do approximate degree update, and assemble from previous frontal matrices
delete edges in T that correspond to assembled contributions
{ some portions of update/assembly are skipped, depending on Csame and Usame }
if (not Isame) then
attempt to find a candidate pivot column, j 6 Uk,
for the (., + 1)-st pivot of Ek
{ based on column degree only. No scan of the column is done. }
endif
Came -- true { assembly and degree update completed for current size of Ek }
Same -- true
while (Csame and Usame) do
scale current pivot column using Level-1 BLAS { column ., of Ek }
{ At this point, .i pivots in Ek have been successfully factorized }
if (., mod nb = 0) then
update Dk with pivots ko + 1 to ., in Ek
{ Level-3 BLAS matrix multiply, rank-rb update }
ko <--
endif
g <-- + 1 { g is the leftmost column of Ek to be updated in factor_frontalmatrix }
if (no acceptable candidate pivot column, j, for (,i + l)-st pivot of Ek) then
return { amalgamation of node k has proceeded as far as possible }
endif
swap candidate column jc with column ., +1 of Ek
{ where column jc of Ek holds the contribution of Ek to column j of A }
update pivot column ,i + 1 with previous pivots ko + 1 to ., {Level-2 BLAS }
g -- i. + 2 { g is the leftmost column of Ek to be updated in factor_frontalmatrix }
attempt to find a pivot row, i L", such that ak] is an acceptable pivot
{ pivot row search is both symbolic (Equation 17) and numerical (Equation 16) }
if (no acceptable candidate pivot row, i, for (,i + 1)-st pivot of Ek) then
return { amalgamation of node k has proceeded as far as possible }
endif
swap pivot row ic with row ., +1 of Ek
{ where row iZ of Ek holds the contribution of Ek to row i of A }
{ entry a + is the (,i + 1)-st pivot in Ek, }
{ and the (k + ., )-th pivot in the factorization of A }
update pivot row ,i + 1 with previous pivots ko + 1 to ., { Level-2 BLAS }
{ the pivot row update excludes the pivot itself }
same -- (k = k U Struct (A kk))









if (Lsame) then
{ new pivot column adds no new rows to Ck, do cheap degree update }
decrement upper and lower bounds of degrees of each column j G U
find a good candidate pivot column j G U' for next iteration, .i + 2
{ based on column degree only. No scan of the column is done. }
endif
Same -- (Uk = Uk U Struct (A[+ ))
if (Usame) then
{ new pivot row adds no new columns to Uk, do cheap degree update }
decrement upper and lower bounds of degrees of each row in i E C
endif
Ck k U Struct (A [ )
Uk k U Struct (A A )
./i *-. + 1
endwhile { finding pivots without extending frontal matrix }
zero the newly extended portions of Ek, on the stack
endwhile { finding pivots in local pivot search }
end procedure { extendfrontalmatrix }


procedure factor_frontalmatrix
{ The current m-by-n frontal matrix Ek has been extended to include .; pivots.
L\JJ' Il"
Ek ~ Uk
Lk Dk
where L'\U is -b- U is -by-(n ), L is (m )-b- and Dk
is (m .; )-by-(n ). The contribution block Dk has been updated for pivots
1 through ko. It is not updated for pivots ko + 1 to ., except perhaps for
the first column of Dk (depending on g). }
if (,, / ko) then
update Dk with pivots ko + 1 to ., { starting at column g of Ek }
{ Level-3 BLAS matrix multiply, rank-b update, where 1 < b < nb }
endif
save the pivot rows and columns in the LU arrowhead { compress on the stack }
copy Dk to a new location allocated from the tail end of memory
{ Dk will be deallocated when fully assembled into subsequent frontal matrices }
make an explicit list of the children of node k, and place in LU data structure
{ These are inactive edges in S, which may be used in a factor-only algorithm. }
add active L-edges {(k, row i}) i C} to FL
add active U-edges {(k, column j) j Uk'} to Fu
end procedure { factor_frontalmatrix }









The AFstack algorithm has available to it the numerical values of the current frontal
matrix, Ek, when it performs degree update and edge reduction. The degree update,
edge reductions, and numerical assembly can all be done in a single phase (procedure ex-
tendfrontal_matrix). The AFup and AFdown algorithms split this work into two separate
symbolic and numerical phases (for each frontal matrix), resulting in a duplication of work.
Note that in AFstack, the assembly phase is skipped and degree update phase is simplified
if the new pivot does not extend the pattern of the frontal matrix.
The local pivot search attempts to find a candidate pivot column j based purely on
symbolic (sparsity-preserving) considerations. Once it is located, the AFstack algorithm
updates the candidate column with previous pivots (a Level-2 BLAS matrix-vector multiply),
and then searches the column j for an acceptable pivot, finding the row i with lowest upper
bound degree (upper(rQ)) such that ayj is numerically acceptable. No delayed pivots occur
because the algorithm checks the numerical acceptability of candidate pivots as it selects
them. Only one candidate column is updated and searched, since this could be an expensive
operation. Examining all entries in Ek, for instance, would require an immediate Level-2
BLAS update of the entire contribution block. This would defeat the use of the Level-3
BLAS in the extendfrontal_matrix procedure (although the Level-3 BLAS would still be
used in the factor _frontal_matrix procedure).
If a pivot is found, the candidate row is updated with a matrix-vector multiply. As soon
as sufficient pivots have accumulated (nb), the entire contribution block is updated with a
matrix-multiply operation (a rank-nb update). Thus, the Level-2 BLAS updates of individual
candidate rows and columns require no more than a rank-nb matrix-vector multiply. The
block-size parameter nb can be adjusted depending on the relative speed of the Level-3 BLAS
versus the Level-2 BLAS on a given computer. On the Cray YMP, with nb = 16, most of
the floating-point operations are performed using Level-3 BLAS subroutines (between 511' .
and 'II.' depending on the sparsity of the matrix).


7.6 Summary of the analysis-factor algorithm

Each version of the analysis-factor algorithm is based on the assembly graph, ED, that guides
the pivot search, the construction of new elements, the assembly process, the detection
of super-rows and super-columns, and the degree update. Edge reductions and dynamic
amalgamation keep this graph pruned. Dynamic amalgamation allows the algorithms to
take advantage of the Level-3 BLAS. The analysis-factor algorithms do not suffer from the
disruptions in the graphs or in the patterns of L and U caused by numerical pivoting in the
factor-only algorithm. Parallelism is not yet addressed; this and other issues are considered
in Section 9, which presents open problems and future work. The following section compares
the performance of the three sequential versions of the analysis-factor algorithm with that
of the classical multifrontal method (M I-.), the D2 algorithm, and the MA28 algorithm.









8 Performance results


We have developed three prototype sequential versions of the unsymmetric-pattern multi-
frontal method, and have tested them extensively on a Cray-YMP. In this section, we com-
pare their performance with the MA28 algorithm [14], sequential versions of the classical
multifrontal method (N11p.) [2], and the D2 algorithm [5].
Table 1 summarizes our results [3] for eighty-six matrices from the Harwell/Boeing collec-
tion [12, 13] and other sources. Twenty-seven of these are symmetric positive-definite. Only
matrices of order 500 or larger were considered. The Z matrices are chemical engineering
problems from S. Zitney and others [29] (Z/m2 is from a PDE). The Hm matrices are circuit
simulation matrices from S. Hamm (Motorola). Table 1 lists the results for these matrices
obtained on a Cray YMP-8/128, sorted by asymmetry (and by order if tied). Each line lists
the matrix number, name, order, number of nonzeros, and asymmetry. The asymmetry, s,
is the number of unmatched off-diagonal pairs over the total number of off-diagonal entries
(0 is a symmetric pattern, 1 is completely asymmetric). The run time includes both the
analysis and factorize time, in seconds, and is listed for the AFstack, AFdown, AFup, Mups,
D2, and MA28 algorithms. The fastest run time is shown in bold.
Of the three versions of the unsymmetric-pattern multifrontal method, the AFstack al-
gorithm typically out-performs the other two. It is faster than both AFdown and AFup for
63 out of il, matrices. When AFstack is slower than the other two algorithms, its run time is
no more than 1.3 times the run time of the faster of AFdown and AFup. The sum of the run
times of AFstack, AFdown, and AFup for all sil matrices is 96.8, 154.0 and 161.3 seconds,
respectively, whereas the sum of the best run time for each matrix (for these three algorithms)
is 94.0 seconds. The AFstack algorithm is clearly superior to AFdown and AFup.
Figures 7 through 9 show the normalized run time of AFstack, Mups, D2, and MA28,
for all isl matrices. The normalized run time is the run time divided by the fastest run time
found for that particular matrix. Thus the fastest method would have a normalized run time
of 1.0.
The new method (AFstack) is faster than Mups, D2 and MA28 for only 13 out of il,
matrices. However, these matrices include nearly all of the large unsymmetric matrices which
are not extremely sparse. The present release of the Harwell/Boeing collection is very weak
in this class [12, 13]. Also, the method demonstrates a consistent performance for the entire
range of matrices, as can be observed in Figures 7 through 9. It usually takes no more than
twice the time as Mups for symmetric-patterned matrices, and is even occasionally faster (for
bcsstk08 and the two Hm/add matrices). This is because it takes advantage of dense matrix
kernels, as Mups does. It is faster than Mups for most matrices with asymmetry greater
than 0.5 because it does not make the symmetric pattern assumption (the gemat matrices
are notable exceptions; they become nearly symmetric-patterned when permuted to obtain a
zero-free diagonal, as is done in Mups). The new method also avoids the worst-case behavior
of D2 and MA28 for symmetric matrices and for very large matrices with substantial dense
substructure (such as the large chemical engineering problems).
Table 2 presents detailed results for each method on nine representative matrices taken









from Table 1. The first section duplicates the matrix statistics from Table 1 (matrix names
are abbreviated), and also states whether or not the matrix is symmetric positive-definite.
The table then lists (1) the run time relative to the fastest method for each matrix, (2) the
number of nonzeros in the LU factors, (3) the number of floating point operations performed
during factorization, (4) the number of frontal matrices, (5) the number of pivots that were
delayed due to an unacceptable numerical value (there are no delayed pivots in AFstack),
(6) the number of edges in the assembly tree for Mups (T) and in the assembly dag (9) for
AFdown, AFup, and AFstack, and (7) the number of times an active edge in 7 is scanned
during the pivot search, assembly, and degree update phases in AFdown, AFup, and AFstack.
The sherman5 and mahindasb matrices contain 1674 and 669 singletons, respectively,
which are 1-by-1 diagonal blocks arising from a permutation to upper-block-triangular form
[10]. These singletons are included in the count of frontal matrices for the AF algorithms.
The dags are unconnected for these matrices.
The number of delayed pivots in AFdown and AFup, although acceptable for many
matrices, can be quite high. A delayed pivot is selected on symbolic grounds and then
rejected on numerical grounds. The symbolic factorization then "retreats" by one step,
possibly selecting another pivot that will later be rejected. Thus, the number of delayed
pivots can actually exceed the size of the matrix, n, as is seen in the Z/m2 matrix for the
AFdown algorithm. The problem of delayed pivoting seems to be more acute for the larger
matrices in our test set.
The dag, 9, found by the unsymmetric-patterned multifrontal method does not have
many more edges than the assembly tree for Mups (the number of edges in the tree is simply
the number of frontal matrices minus one). The number of times an active edge in F is
traversed is more critical to the performance of the new method than the number of edges
in G. Although this number is not guaranteed to be less than the number floating-point
operations in the LU factorization, in practice it is much less than that (by three orders
of magnitude for the larger matrices). It is usually less than the number of nonzeros in
the LU factors (the exceptions are for very sparse matrices). This result demonstrates the
effectiveness of our edge reductions.
The Z/rdistl matrix is a distillation column with 19 components and 100 stages [30].
Reactions occur in stages 35 to 70. Both Mups and AFstack can take advantage of its
unsymmetric block structure, which occurs within each stage. For example, of the 44.5
million floating-point operations in AFstack, only sili_.''' are done in frontal matrices with
1 or 2 pivots ( 1-2' fronts). The remaining 44.4 million operations are performed in only 116
frontal matrices, (a typical one of which is 69-by-65 with 31 pivots). The largest frontal
matrix constructed by AFstack is 216-by-281, with 171 pivots. For this matrix, Mups is
directed to first find a maximum transversal [8], otherwise excessive fill-in is obtained. MA28
and D2 both find a poor pivot ordering for rdistl, when compared with Mups and the AF
algorithms. MA28 and D2 use only a global pivot search, and do not consider the overlap
of the fill-in of a previous pivot with the current pivot. Using only a global pivot search
destroys the block structure of the matrix and leads to excessive fill-in.
Table 3 compares the performance of a true degree update and an approximate degree









update (described in Section 7.2) in the AFdown algorithm, for five matrices selected from
Table 2. These results are on an Alliant FX/80, and are based on more appropriate values
of user-settable parameters (for controlling amalgamation, pivoting, etc.) than those used
on the Cray YMP. Thus, the amount of fill-in for AFdown reported in Table 2 differs from
that reported in Table 3. Both versions allow some controlled fill-in due to amalgamation.
The approximate degree update typically results in slightly higher fill-in, in exchange for a
drastic reduction in run time, when compared with a true degree update. In two of these
matrices (sherman5 and gre_ll07), the use of approximate degrees instead of true degrees
actually results in less fill-in and floating-point work.
Overall, these results show that the sequential prototypes of the unsymmetric-pattern
multifrontal method are competitive algorithms when compared with the both classical mul-
tifrontal approach (M11.p-) and algorithms based on more conventional sparse matrix data
structures (D2 and MA28).


9 Final remarks

The factor-only algorithm is "fragile" with respect to disruptions in the graphs and patterns
of L and U caused by numerical pivoting. This problem is addressed by the analysis-factor
algorithm. The disruptions are avoided by combining the numerical and symbolic phases
so that pivots can be selected on both sparsity preserving and numerical criteria. However,
the factor-only algorithm still forms an important part of a complete unsymmetric-pattern
multifrontal method. If multiple problems are to be solved that have similar pattern and
numerical characteristics (in solving nonlinear systems, for example), the pivot ordering of
the first matrix is often suitable for successive matrices. The analysis-factor algorithm would
factorize the first matrix and provide the pivot ordering to the factor-only algorithm, which
factorizes subsequent matrices. Few numerical problems would be expected in the factor-only
algorithm. However, a better handling of the disruptions caused by numerical pivoting is the
most important open problem facing the development of a practical factor-only algorithm.
Parallelism is not yet addressed in the sequential versions of the analysis-factor algorithm.
Some parallel work can take place within the dense matrix kernels, and while this is impor-
tant, it will not provide enough parallelism in general. A truly parallel version must take
advantage of parallelism across multiple frontal matrices. Parallelism can be incorporated
in one of several ways. The parallel pivot search of the D2 algorithm can be adapted to
this algorithm. The pivot search first creates an independent set of pivots (m, say). Each
factorization task takes a single pivot and extends it into a block of pivots via dynamic amal-
gamation. Since multiple tasks can affect a single row or column in the active submatrix,
these tasks either cooperate to update the degrees, or a separate parallel degree update phase
is employed. When all factorization tasks finish, a new set of independent pivots is found.
A second approach would pipeline the pivot search with the numerical factorization. The
pivot search task (with one processor or a set of processors) searches for pivots that are in-
dependent with the pivots of currently-executing factorization tasks. A task k is created for
each pivot (multiple processors can also be used within each task). Task k creates the frontal













Table 1: Matrix statistics and run time
Matrix Statistics Run Time (in seconds)
name n nz s AFstack AFdown AFup Mups D2 MA28
1 662_bus 662 2474 0 0.087 0.139 0.103 0.072 0.044 0.066
2 nos6 675 3255 0 0.114 0.172 0.153 0.102 0.145 0.171
3 1.-"._ius ,-., 3249 0 0.100 0.158 0.111 11 II-I 0.082 0.094
4 nos7 729 4617 0 0.218 0.281 II .-i. 0.186 0.341 0.997
5 bcsstkl9 817 i ",.; 0 0.155 0.216 0.175 0.113 II .;,I 0.210
6 orsirr_2 >"', 5970 0 0.244 0.327 0.424 0.224 0.379 0.741
7 gr_30_30 900 7744 0 0.231 0.287 0.275 0.154 0.502 0.552
8 nos2 957 4137 0 0.078 0.140 0.086 0.071 0.134 0.080
9 nos3 960 15844 0 0.311 0.411 0.397 0.194 1.201 1.187
10 pde_9511 961 11.- [ 0 0.183 0.361 0.246 0.152 0.233 0.319
11 shermani 1000 3750 0 0.162 0.270 0.204 0.148 0.174 0.303
12 orsirr_l 1030 -.8 0 0 0. 111-; 0.537 0.266 0.492 1.019
13 .. --1074 1-"11, 0 0.474 0.845 0.636 0.894 0.775 1.604
14 bcsstk09 1083 1 l.;7 0 0.501 0.542 0.736 0.417 1..', 3.335
15 bcsstkl0 1086 22070 0 Ii 0.374 0.329 0.207 1.254 0.833
16 bcsstml0 1086 ii'1,-' 0 0.259 0.349 0.302 0.208 1.363 0.956
17 sherman4 1104 .;"' 0 0.143 0.244 0.187 0.172 0.165 0.214
18 1138_bus 1138 Ii'. I 0 0.129 Ii 21is 0.140 0.113 0.062 0.087
19 .. -7 1224 56126 0 0.518 0.514 0.555 0.422 4.010 3.135
20 bcsstm27 1224 56126 0 0.537 0.528 0.572 0.422 5.793 3.136
21 bcsstkll 1473 34241 0 0.563 0.655 0.659 0.328 3.467 2.039
22 bcsstkl2 1473 34241 0 0.563 0.657 0.659 0.328 3.459 2.039
23 bcsstml2 1473 19659 0 0.330 0.489 0.470 0.251 1.541 0.961
24 bcsstkl4 1806 63454 0 1.441 1.944 1.114 0.637 i',s 6.516
25 .. 1:., 1922 30336 0 0.597 0.696 0.736 0.357 1.934 2.291
26 orsreg_l 2205 14133 0 0.754 0.911 2.099 0.705 1.>"' 4.041
27 Hm/add20 2395 13151 0 0.293 0.602 0.339 0.390 0.179 Ii 1,.
28 .. -i'.; 3134 45178 0 5.319 8.195 7.562 2.279 32.519 205. 1.;1
29 bcsstk24 :.1;,' 159910 0 2.510 2 '21i 2.955 1.420 .;, -'9 66.472
30 saylr4 1' I 22316 0 1.322 1.618 3.270 1.085 13.138 9...
31 bcsstk21 3600 26600 0 1.324 1.740 2.114 1.131 13.682 5.375
32 bcsstkl5 ;' IS 117816 0 12.114 13.907 10.874 3.447 96.093 108.980
33 .. --i- 4410 219024 0 2.770 2.670 4.014 1.829 48 "'- I 21, 1,'
34 bcsstkl6 1-" 1 290378 0 7.-"is 6.> ,i 13.163 3.404 153.726 76.674
35 Hm/add32 4960 1', IS 0 0.500 0.917 0.499 0.651 0.266 0.378
36 sherman3 5005 20033 0 1.237 2.782 2.361 1.372 5.508 6.157
37 bcsstkl8 11948 1 I '-i- 0 1 1.; 30.794 ,,i ,-'7 4.870 168.935 393.710





















Table 1 continued. Matrix statistics and run time.


Matrix Statistics Run Time (in seconds)
name n nz s AFstack AFdown AFup Mups D2 MA28
38 hwatt_l 1i. 11360 0.013 0.595 1.578 1.308 0.487 2.579 3.210
39 hwatt_2 1 .I, 11550 0.020 0.581 1.370 1.302 0.534 2.623 3.306
40 Hm/mem+ 17758 99147 0.021 2 'is 6.787 3.510 5.612 1.305 2.563
41 jpwh_991 991 ir-,- 0.064 0.294 0.500 O.1.,s 0.242 0.i'., 1.372
42 11-_.',.- :'. '". ',i7 0.150 3.073 4.7 3.358 1.062 6.075 19.277
43 1.- _.:',.:;d ;'.; '-, "II7 0.150 3. I-Ci 5.224 3.311 1.053 7.527 17.483
44 nnc666 666 4032 0.179 0.170 I -, ., 0.182 0.103 0.152 0.377
45 lns_511 511 2, 0.201 0.124 0.173 0.156 0.091 0.081 0.189
46 Ins_511c 511 2, 0.201 0.132 0.177 0.163 0.090 0.072 0.206
47 pores_3 532 3474 0.258 0.106 0.163 0.117 0.067 0.124 0.116
48 sherman5 3312 20793 0.261 0.957 1.314 1.085 0.947 2.199 5.434
49 mc_fe 765 2 1.;'2 0.301 0.338 0.560 0.387 0.440 0.723 1.1 i,
50 fs_541_4 541 4273 0.317 0.134 0.180 0.170 0.220 0.123 0.228
51 fs_541_1 541 i--' 0.317 0.132 0.'1s 0.158 0.220 0.130 0.196
52 fs_541_2 541 1"-' 0.317 0.128 0.181 0.172 0.220 0.148 0.228
53 fs_541_3 541 i"-' 0.317 0.134 0.'1s 0.173 0.220 0.158 0.227
54 sherman2 1080 '.r' 1 0.329 0.805 0.918 0.697 0.449 1.397 8.262
55 fs_760_1 760 5739 ii .. I 0.329 0.336 0.370 0.176 0.218 1.036
56 fs_760_3 760 5739 ii .. I 0.327 0.336 0.370 0.176 0.218 1.036
57 pores_2 1224 9613 Ii.-s 0.354 0.525 0.536 0. 11r- 0.375 1.107
58 f- _-I I_. 680 2471 0.i.'' 0.058 0.081 0.065 0.085 0.049 0.059
59 f- _'." I_680 2424 0.448 0.058 0.079 0.065 0.085 0.053 0.063
60 steam2 600 5660 0.451 0.175 0.279 I -'1.. 0.139 0.243 0.565




















Table 1 continued. Matrix statistics and run time.


Matrix Statistics Run Time (in seconds)
name n nz s AFstack AFdown AFup Mups D2 MA28


Z/m2
Z/rdist3a
Z/rdistl
Z/radfrl
Z/rdist2
V -III'I.P I
mahindas
bp_1600
bp_1400
bp_1000
bp_1200
bp_0
bp_200
bp_600
V ii .".
bp_400
bp_800
v -1 I2112 I1
gemat l2
gemat 11
west1505
gre_512
shl_0
shl_200
shl_400
gre_1107


8641

4134
1048
3198

1258
822
822
822
822
822
822
822
655
822
822
2021
4929
4929
1505
512
663
663
663
1107


102449
61896
*1' IIIn



3518
7682
4841
4790
4661
4726
3276

4172

4028
4534
7310
33044
33108
5414
1976
1687
1726
1712
5664


0.508
0.860
0.941
0.946
0.954


II '1"-J
II * -

0.990
0.991
0.991
0.991
0.994
0.994
0.994
0.995
0.995
0.997
0.999
0.999
0.999
1.000
1.000
1.000
1.000
1.000


13.596
0.648
1.625
0.221
0. ,i.
0.152
0.223
0.105
0.116
0.097
0.099
0.024
0.046
0.076
0.108
0.069
0.091
0.324
0.606
0.597
0.234
0.140
0.016
0.017
0.017
0.417


31.595
1.970
3.089
0.373
1 .:
0.174
0.381
0.123
0.136
0.117
0.119
0.020
0.048
0.087
0.151
0.074
0.106
0.379
1.016
0.919
0.277
0.209
0.013
0.014
0.014
0.628


14.975
1.404
1.811
0.240
0.861
0.130
0.218
0.104
0.119
0.095
0.100
0.020
0.042
0.074
0.114
0.063
0.091
0.275
0.771
0.716
0.204
0.182
0.013
0.014
0.014
0.573


33.691
1.606
2.506
0.252
1.252
0.142
(I ".
II 'I**
0.275

0.278
0.157
0.204
0.244
0.131
0.222
0.261
0.320
0.549
0.520
0.220
0. 1 I'
0.118
0.126
0. !'2
0.518


93.621
9.864
33.013
0.787
7.504
0.064
0.096
(I 1I1-
0.089
0.079
II II11"
0.031
0.047
0.069
0.058
0.058
0.074
0.181
1.245
0.961
0.117
0.064
0.019
0.019
0.020
0.294


.;,,-1 766
66.584
267.308
2.668
117.395
0.075
0.156
0.082
0.087
0.073
0.080
0.025
0.039
0.059
0.082
0.050
0.065
0.162
0.831
0.756
0.120
II 'l.0
0.016
0.016
0.016
1.146














































10 20 30 40 50 60 70 80
matrix number


Figure 7: Relative performance: AFstack (solid) and Mups (dotted)
















































10 20 30 40 50 60 70 80
matrix number


Figure 8: Relative performance: AFstack (solid) and D2 (dotted)
















































10 20 30 40 50 60 70 80
matrix number


Figure 9: Relative performance: AFstack (solid) and MA28 (dotted)











Table 2: Detailed results for nine representative matrices
Matrix
matrix number 29 37 42 48 61 63 67 80 86
name b24 bl8 Ins sher5 m2 rdistl mah gell gre
sym. pos. def.? yes yes no no no no no no no
order -' 11948 :; 3312 8641 4134 1258 4929 1107
nonzeros 159910 11 I'-I "_', I7 20793 102449 ', III, 7682 33108 5664
asymmetry 0 0 0.150 0.261 0.508 0.941 Ii 's.; 0.999 1.000
Rel. run time
MA28 46.8 80.8 18.1 5.7 26.7 164.5 1.6 1.5 3.9
D2 26.0 34.7 5.7 2.3 6.9 20.3 1 1.8 1
Mups 1 1 1 1 2.5 1.5 8.6 1 1.8
AFdown 1.6 6.3 4.5 1.4 2.3 1.9 4.0 1.8 2.1
AFup 2.1 12.4 3.1 1.1 1.1 1.1 2.3 1.3 1.9
AFstack 1.8 3.5 2.9 1.0 1 1 2.3 1.1 1.4
Nz in LU (103)
MA28 990.0 2910.7 427.1 158.0 .:;1 I" 2280.2 10.1 50.1 47.0
D2 1699.4 7.;"' 7 847.3 316.3 4966.7 2431.0 11.2 72.3 71.0
Mups 592.4 1 i, .; 332.7 187.0 2-i, i 279.4 50.3 79.4 187.6
AFdown ,"-. 1 5123.3 1060.8 269.1 .'" 624.0 16.1 92.9 132.8
AFup 1253.8 "il- .6 937.7 260.6 : 1.1.9 564.5 19.3 103.6 133.0
AFstack 1053.0 4733.0 543.1 2" .- :**'' 1.6 532.5 13.2 99.4 94.4
Flops (106)
Mups 72.2 347.5 27.2 16.8 938.7 10.2 2.80 0.83 25.0
AFdown 141.4 4725.4 .'- 'l 37.5 5301.9 49.8 0.39 1.18 21.7
AFup 325.6 9412.1 248.3 34.4 .'-ii -" 40.7 0.38 1.65 15.1
AFstack 181.5 2335.4 100.9 43.6 1636.1 24.5 0.16 1.08 5.7
Frontal mat.
Mups 2 3209 1521 2226 1860 579 508 *'.' 342
AFdown 219 2130 1207 2057 1782 302 1027 641 1.',
AFup 224 1i" L 1248 2243 2 1 449 822 1041 243
AFstack 189 ,*,; 2213 2151 '2-1 544 1016 1144 334
Delayed pivots
Mups 0 0 1211' 0 0 14 4 0 270
AFdown 51 3678 1339 177 11078 ,i". 0 6 8
AFup 755 1225 453 190 1027 309 3 42 50
Edges in T, G
Mups 2 1 ;'; 1520 2225 l,.'i 578 507 *'i- 341
AFdown 244 ;.;'s 3341 841 1-" L 541 616 692 1093
AFup 356 4479 3120 1-'i .', 's 311 799 760
AFstack 2". L i'". 6300 ,i. .'.s 705 818 892 929
F scans (103)
AFdown 168.2 ",1 i, S 965.4 232.5 2146.8 280.7 186.5 86.5 159.7
AFup 183.6 1681.9 559.2 194.3 1440.8 132.2 39.7 50.8 111.0
AFstack 131.7 2219.4 ,i. 4 140.8 2'.. 1.0 212.1 78.1 39.1 102.2










Table 3: True versus approximate degree update (AFdown on Alliant FX/80)
Matrix
matrix number 42 48 67 80 86
name 1,- _.',. T sherman5 mahindas gematll gre_1107
order :'T.: 3312 1258 4929 1107
nonzeros '".1117 20793 7682 33108 5664
asymmetry 0.150 0.261 i '-. 0.999 1.000
Run time (seconds)
True 1-I 211 66.9 8.5 15.5 29.9
Approximate 172.5 30.3 5.7 14.6 12.1
Nz in LU (103)
True 597.6 210.3 22.6 72.9 78.9
Approximate 611.7 201.8 23.9 76.6 72.4
Flops (106)
True 102.1 25.8 0.31 0.71 7.4
Approximate 117.9 22.9 0.37 0.85 4.9

matrix Ek, performs the assembly, factorizes it, and either cooperates with other tasks to
perform the degree update or requests the pivot search task to perform the degree update.
When it completes, it signals the pivot search task that the rows and columns it affected (LC'
and UkJ", respectively) are now candidates for the pivot search. In both approaches, multiple
factorizations of the associated frontal matrices are done in parallel.


10 Acknowledgments

We would like to thank Patrick Amestoy, Michel Dayd6, and Mario Arioli at CERFACS,
and Steve Zitney at Cray Research, Inc. for many helpful discussions. Steve Zitney at Cray
and Steve Hamm at Motorola provided us with several large unsymmetric matrices, a class
of matrices that is weak in the present version of the Harwell/Boeing collection [12, 13].
Portions of this work were supported by a post-doctoral grant from CERFACS, September
1989 to December 1990. Support for this project also provided by the National Science
Foundation (ASC-9111263), and by Cray Research, Inc. and Florida State University through
the allocation of supercomputer resources.









References


[1] A. V. Aho, M. R. Garey, and J. D. Ullman. The transitive reduction of a directed graph.
SIAM J. Comput., 1:131137, 1972.

[2] P. R. Amestoy and I. S. Duff. Vectorization of a multiprocessor multifrontal code. Int.
J. Supercomputer Appl., 3(3):41-59, 1989.

[3] T. A. Davis. Performance of an unsymmetric-pattern multifrontal method for sparse
LU factorization. Technical Report TR-92-014, Comp. and Info. Sci. Dept., Univ. of
Florida (anonymous ftp to cis.ufl.edu:cis/tech-reports/tr92/tr92-014.ps.Z), Gainesville,
FL, May 1992.

[4] T. A. Davis and I. S. Duff. Unsymmetric-pattern multifrontal methods for parallel sparse
LU factorization. Technical Report TR-91-023, CIS Dept., Univ. of Florida (anonymous
ftp to cis.ufl.edu:cis/tech-reports/tr91/tr91-023.ps.Z), Gainesville, FL, 1991.

[5] T. A. Davis and P. C. Yew. A nondeterministic parallel algorithm for general unsym-
metric sparse LU factorization. SIAM J. Matrix Anal. Appl., 11(3):383-402, 1990.

[6] J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A set of level-3 basic linear
algebra subprograms. ACM Trans. Math. Softw., 16(1):1-17, 1990.

[7] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended set of
Fortran basic linear algebra subprograms. ACM Trans. Math. Softw., 14:1-32, 1'l"

[8] I. S. Duff. On algorithms for obtaining a maximum transversal. ACM Transactions on
Mathematical Software, 7:315-330, 1981.

[9] I. S. Duff. Parallel implementation of multifrontal schemes. Parallel Computing, 3:193
204, 1"'lI.

[10] I. S. Duff, A. M. Erisman, and J. K. Reid. Direct Methods for Sparse Matrices. London:
Oxford Univ. Press, 1'11i.

[11] I. S. Duff, N. I. M. Gould, J. K. Reid, J. A. Scott, and K. Turner. The factorization of
sparse symmetric indefinite matrices. IMA Journal of Numerical A,,r..*l,' 11:181-204,
1991.

[12] I. S. Duff, R. G. Grimes, and J. G. Lewis. Sparse matrix test problems. ACM Trans.
Math. Softw., 15:1-14, 1989.

[13] I. S. Duff, R. G. Grimes, and J. G. Lewis. Users' guide for the Harwell-Boeing sparse
matrix collection. Technical Report TR/PA/92/si., CERFACS, Toulouse, France, Oct.
1992.









[14] I. S. Duff and J. K. Reid. Some design features of a sparse matrix code. ACM Trans.
Math. Softw., 5(1):18-35, 1979.

[15] I. S. Duff and J. K. Reid. The multifrontal solution of indefinite sparse symmetric linear
equations. ACM Trans. Math. Softw., 9(3):302-325, 1'' ;

[16] I. S. Duff and J. K. Reid. The multifrontal solution of unsymmetric sets of linear
equations. SIAM J. Sci. Statist. Comput., 5(3):633-641, 1984.

[17] I. S. Duff and J. K. Reid. MA47, a Fortran code for direct solution of indefinite sparse
symmetric linear systems. to appear, 1993.

[18] S. C. Eisenstat and J. W. H. Liu. Exploiting structural symmetry in unsymmetric sparse
symbolic factorization. SIAM J. Matrix Anal. Appl., 13(1):202-211, 1992.

[19] A. George, M. T. Heath, J. W. H. Liu, and E. Ng. Solution of sparse positive def-
inite systems on a shared-memory multiprocessor. International Journal of Parallel
Programming, 15(4):309-325, 1"'1'

[20] A. George and J. W. H. Liu. The evolution of the minimum degree ordering algorithm.
SIAM Review, 31(1):1-19, 1989.

[21] A. George and E. Ng. Parallel sparse gaussian elimination with partial pivoting. Annals
of Operation Research, 22:219-240, 1990.

[22] J. R. Gilbert. An efficient parallel sparse partial pivoting algorithm. Technical Report
88/45052-1, Center for Computer Science, Chr. Michelsen Institute, Bergen, Norway,


[23] J. R. Gilbert and J. W. H. Liu. Elimination structures for unsymmetric sparse LU
factors. Technical Report CS-90-11, Dept. of Computer Sci., York Univ., North York,
Ontario, Feb. 1990.

[24] S. Hadfield and T. A. Davis. Analysis of potential parallel implementations of the
unsymmetric-pattern multifrontal method for sparse LU factorization. Technical Re-
port TR-92-017, CIS Dept., Univ. of Florida (anonymous ftp to cis.ufl.edu:cis/tech-
reports/tr92/tr92-017.ps.Z), Gainesville, FL, June 1992.

[25] M. T. Heath, E. Ng, and B. W. Peyton. Parallel algorithms for sparse linear systems.
SIAM Review, 33(3):420-460, 1991.

[26] J. W. H. Liu. The role of elimination trees in sparse factorization. SIAM J. Matrix
Anal. Appl., 11(1):134-172, 1990.

[27] H. M. Markowitz. The elimination form on the inverse and its application to linear
programming. Management Science, 3:255-269, Apr 1957.


48









['] P. Matstoms. Sparse QR factorization in MATLAB. Technical Report LiTH-MAT-R-
1992-05, Dept. of Mathematics, Linkoping Univ., Linkoping, Sweden, March 1992.

[29] S. E. Zitney and M. A. Stadtherr. A frontal algorithm for equation-based chemical
process flowsheeting on vector and parallel computers. In Proc. AIChE Annual Meeting,
Washington, DC, 1'l"

[30] S.E. Zitney. A frontal code for aspen plus on advanced architecture computers. In
American Inst. of Chemical Eng. Annual Meeting, S,',',...'.'. on Parallel Computing,
Nov. 1990.

[31] Z. Zlatev, J. Wasniewski, and K. Schaumburg. Y12M: Solution of Large and Sparse
Sii .' ,,, of Linear Algebraic Equations, Lecture Notes in Computer Science 121. Berlin:
Springer-Verlag, 1981.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs