Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: An Unsymmetric-pattern multifrontal method for sparse LU factorization
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00095307/00001
 Material Information
Title: An Unsymmetric-pattern multifrontal method for sparse LU factorization
Series Title: Department of Computer and Information Science and Engineering Technical Report ; 94-038
Physical Description: Book
Language: English
Creator: Davis, Timothy A.
Duff, Iain S.
Affiliation: University of Florida
Rutherford Appleton Laboratory
Publisher: Department of Computer and Information Sciences, University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: November, 1994
Copyright Date: 1994
 Record Information
Bibliographic ID: UF00095307
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

1994159 ( PDF )


Full Text











AN UNSYMMETRIC-PATTERN MULTIFRONTAL METHOD FOR
SPARSE LU FACTORIZATION

TIMOTHY A. DAVIS* AND IAIN S. DUFFt

Computer and Information Sciences Dept., University of Florida,
Technical Report TR-94-038, November, 1994.
Abstract. Sparse matrix factorization algorithms for general problems are typically
characterized by irregular memory access patterns that limit their performance on parallel-vector
supercomputers. For symmetric problems, methods such as the multifrontal method avoid indirect
addressing in the innermost loops by using dense matrix kernels. However, no efficient LU
factorization algorithm based primarily on dense matrix kernels exists for matrices whose pattern is
very unsymmetric. We address this deficiency and present a new unsymmetric-pattern multifrontal
method based on dense matrix kernels. As in the classical multifrontal method, advantage is taken
of repetitive structure in the matrix by factorizing more than one pivot in each frontal matrix thus
enabling the use of Level 2 and Level 3 BLAS. The performance is compared with the classical
multifrontal method and other unsymmetric solvers on a CRAY YMP.

Key words. LU factorization, unsymmetric sparse matrices, multifrontal methods

AMS subject classifications. 65F50 (Sparse matrices), 65F05 (Direct methods for linear
systems and matrix inversion), 65Y15 (Packaged methods), 65-04 (Explicit machine computation
and programs).

1. Introduction. Conventional sparse matrix factorization algorithms for
general problems rely heavily on indirect addressing. This gives them an irregular
memory access pattern that limits their performance on typical parallel-vector
supercomputers and on cache-based RISC architectures. In contrast, the multifrontal
method of Duff and Reid [9, 10, 14, 15] is designed with regular memory access in
the innermost loops and has been modified by Amestoy and Duff to use standard
kernels [1]. This multifrontal method assumes structural symmetry and bases the
factorization on an assembly tree generated from the original matrix and an ordering
such as minimum degree. The computational kernel, executed at each node of the tree,
is one or more steps of LU factorization within a square, dense frontal matrix defined
by the nonzero pattern of a pivot row and column. These steps of LU factorization
compute a contribution block (a Schur complement) that is later assembled (added)
into the frontal matrix of its parent in the assembly tree. Henceforth we will call this
approach the classical multifrontal method.
Although structural asymmetry can be accommodated in the classical multifrontal
method by holding the pattern of A+AT and storing explicit zeros, this can have poor
performance on matrices whose patterns are very unsymmetric. If we assume from the
outset that the matrix may be structurally asymmetric, the situation becomes more
complicated. For example, the frontal matrices are rectangular instead of square, and
some contribution blocks must be assembled into more than one subsequent frontal
matrix. As a consequence, it is no longer possible to represent the factorization by

Computer and Information Sciences Department, University of Florida, Gainesville, Florida,
USA. phone: (904) 392-1481, email: davis@cis.ufl.edu. Support for this project was provided by
the National Science Foundation (ASC-9111263 and DMS-9223088), and by Cray Research, Inc. and
Florida State University through the allocation of supercomputer resources. Portions of this work
were supported by a post-doctoral grant from CERFACS.
t Rutherford Appleton Laboratory, Chilton, Didcot, Oxon. OX11 OQX England, and European
Center for Research and Advanced Training in Scientific Computation (CERFACS), Toulouse,
France.










T. A. DAVIS AND I. S. DUFF


an assembly tree and the more general structure of an assembly dag (directed acyclic
graph) [5] similar to that of Gilbert and Liu [22] and Eisenstat and Liu [17, 18] is
required. In the current work we do not explicitly use this structure.
We have developed a new unsymmetric-pattern multifrontal approach [4, 5]. As
in the symmetric multifrontal case, advantage is taken of repetitive structure in the
matrix by factorizing more than one pivot in each frontal matrix. Thus the algorithm
can use higher level dense matrix kernels in its innermost loops (Level 3 BLAS [6]).
We refer to the unsymmetric-pattern multifrontal method described in this paper
as UMFPACK version 1.0 [4]. It is is available in Netlib [7]. A parallel factorize-
only version of UMFPACK, based on the assembly dag, is discussed in Hadfield's
dissertation [24, 26, 27, 25]. The multifrontal method for symmetric positive definite
matrices is reviewed in [29].
Section 2 presents an overview of the basic approach, and a brief outline of the
algorithm. We introduce our data structures in the context of a small sparse matrix in
Section 3 where we describe the factorization of the first frontal matrix. In Section 4
we develop the algorithm further by discussing how subsequent frontal matrices are
factorized. We have split the discussion of the algorithm into these two sections so
that we can define important terms in the earlier section while considering a less
complicated situation. Section 5 presents a full outline of the algorithm, using the
notation introduced in previous sections. In Section 6, we compare the performance of
our algorithm with two algorithms based on the classical multifrontal method: MUPS
[1, 2] and SSGETRF [3], and two algorithms based on conventional (compressed sparse
vector) data structures: Gilbert and Peierls' partial-pivoting code (GPLU [23]) and
MA48 [16] (a successor to MA28 [13]). GPLU does not use dense matrix kernels.
MA48 uses dense matrix kernels only after switching to a dense factorization code
towards the end of factorization when the active submatrix is fairly dense.

2. The basic approach. Our goal with the UMFPACK algorithm is to achieve
high performance in a general unsymmetric sparse factorization code by using the
Level 3 BLAS. We accomplish this by developing a multifrontal technique that uses
rectangular frontal matrices and chooses several pivots within each frontal matrix.
High performance is also achieved through an approximate degree update algorithm
that is much faster asymptoticallyy and in practice) than computing the true degrees.
A general sparse code must select pivots based on both numerical and symbolic
(fill-reducing) criteria. We therefore combine the analysis phase (pivot selection
and symbolic factorization) with the numerical factorization. We construct our
rectangular frontal matrices dynamically, since we do not know their structure prior
to factorization. Although based on an assembly dag that can be constructed during
this analyze-factorize phase, we do not use it here although Hadfield and Davis
[24, 26, 27, 25] develop it further and use it in a factorize-only algorithm.
At a particular stage, the frontal matrix is initialized through choosing a pivot
from all the active matrix (called a global pivot search) using a Zlatev-style pivot
search [32], except that we keep track of upper bounds on the degrees of rows and
columns in the active submatrix, rather than the true degrees (the degree of a row or
column is simply the number of entries in the row or column). We call this first pivot
the seed pivot. Storage for the frontal matrix is allocated to contain the entries in the
pivot row and column plus some room for further expansion determined by an input
parameter. We define the current frontal matrix by F and the submatrix comprising
the rows and columns not already pivotal by C, calling C the contribution block.
Subsequent pivots within this frontal matrix are found within the contribution










UNSYMMETRIC-PATTERN MULTIFRONTAL METHOD


FIG. 2.1. A rectangular frontal matrix within a larger working array.


block, C, as shown in Figure 2.1. The frontal matrix grows as more pivots are chosen,
as denoted by the arrows in the figure. We assemble contribution blocks from earlier
frontal matrices into this frontal matrix as needed. The selection of pivots within this
frontal matrix stops when our next choice for pivot would cause the frontal matrix to
become larger than the allocated working array. We then complete the factorization of
the frontal matrix using Level 3 BLAS, store the LU factors, and place the contribution
block, C, onto a heap. The contribution block is deallocated when it is assembled into
a subsequent frontal matrix. We then continue the factorization by choosing another
seed pivot and generating and factorizing a new frontal matrix.
It is too expensive to compute the actual degrees of the rows and columns of the
active submatrix. To do so would require at least as much work as the numerical
factorization itself. This would defeat the performance gained from using the dense
matrix kernels. Instead, we compute upper bounds for these degrees at a much lower
complexity than the true degrees, since they are obtained from the frontal matrix
data structures instead of conventional sparse vectors. We avoid forming the union of
sparse rows or columns which would have been needed were we to compute the filled
patterns of rows and columns in the active submatrix.
The performance we achieve in the UMFPACK algorithm thus depends equally
on two crucial factors: this approximate degree update algorithm and the numerical
factorization within dense, rectangular frontal matrices. An outline of the UMFPACK
algorithm is shown in Algorithm 1. If A is permuted to block upper triangular form
[12], the algorithm is applied to each block on the diagonal. We will use the matrix

all 0 0 a14 a15 0 0
a21 a22 a23 0 a25 0 0
a31 a32 a33 0 0 0 a37
(2.1) A a41 0 0 a44 a45 a46 0
0 a52 a53 0 a55 a56 0
0 0 0 0 0 a66 a67
a71 a72 0 0 a75 0 a77

to illustrate our algorithm in Sections 3 and 4.
Algorithm 1 consists of initializations followed by three steps, as follows:

Algorithm 1 (Outline of the unsymmetric-pattern multifrontal algorithm)
0: initializations
while factorizingg A) do
1: global pivot search for seed pivot


LU-


L" --- --


H-----
;'/'i










T. A. DAVIS AND I. S. DUFF


form frontal matrix F
while (pivots found within frontal matrix) do
2: assemble prior contribution blocks and original rows into F
compute the degrees of rows and columns in C (the contribution block of F)
numerically update part of C (Level 2 and Level 3 BLAS)
local pivot search within C
endwhile
3: complete the factorization of F using Level 3 BLAS
endwhile

The initialization phase of the algorithm (step 0) converts the original matrix
into two compressed sparse vector forms (row-oriented and column-oriented [10]) with
numerical values, A, and symbolic pattern, A. Rows and columns are used and deleted
from A and A during factorization when they are assembled into frontal matrices.
At any given step, k say, we use Ak and Ak to refer to entries in the original matrix
that are not yet deleted. An entry is defined by a value in the matrix that is actually
stored. Thus all nonzeros are entries but some entries may have the value zero. We
use .. both to denote the absolute value of a scalar and to signify the number of
entries in a set, sequence, or matrix. The meaning should always be quite clear from
the context.
The true degrees, d,(i) and dc(j), are the number of entries in row i and column j
of the active submatrix, A', respectively, but we do not store these. Because the cost of
updating these would be prohibitive, we instead use upper bounds d,(i) (d,(i) < d,(i))
and dc(j) (dc(j) < dc(j)). However, when a true degree is computed, as in the
initialization phase or during the search for a seed pivot, its corresponding upper
bound is set equal to the true degree.

3. The first frontal matrix. We will label the frontal matrix generated at
stage e by the index e. We now describe the factorization of the first frontal matrix
(e = 1). This discussion is, however, also applicable for subsequent frontal matrices
(e > 1) which are discussed in full in Section 4 where differences from the case e = 1
are detailed.

3.1. Step 1: Perform global pivot search and form frontal matrix. The
algorithm performs pivoting both to maintain numerical stability and to reduce fill-in.
The first pivot in each frontal matrix is chosen using a global Zlatev-style search [32].
A few candidate columns with the lowest upper bound degrees are searched. The
number searched is controlled by an input parameter (which we denote by nsrch and
whose default value is four). Among those nsrch columns, we select as pivot the entry
a', with the smallest approximate Markowitz cost [30], (d,(r) 1)(d(c) 1), such
that a', also satisfies a threshold partial pivoting condition [10]

(3.1) a', > u *max lac, 0< u < 1.

Note that we have the true column degree since the column entries were generated
explicitly to enable the threshold test in Equation (3.1). When the pivot is chosen
its row and column structure define the frontal matrix. If Struct(...) denotes the row
indices of entries in a column, or column indices of entries in a row, we define C and U
by = Struct(A' ) and U =Struct(A'.,), the row and column indices, respectively,
of the current 1||-by-|U| frontal matrix F. We partition the sets C and U into pivotal










UNSYMMETRIC-PATTERN MULTIFRONTAL METHOD


row and column indices (' and U') and non-pivotal row and column indices (C2" and
U").
We then assemble the pivot row (A,.) and column (A ,) from the original matrix
into F and delete them from Ak (which also deletes them from Ak, since Ak is defined
as Struct(Ak)).
We then try to find further pivot rows and columns with identical pattern in the
same frontal matrix. This process is called amalgamation. Relaxed amalgamation
does the same with pivots of similar but non-identical pattern. To permit relaxed
amalgamation, F is placed in the upper left corner of a larger, newly allocated, s-
by-t work array. Relaxed amalgamation is controlled by choosing values for s and t
through the input parameter, g, where s = _[gflC, t = _gLUl, and g > 1. The default
value of this parameter in UMFPACK is g = 2.
We now use Example (2.1) to illustrate our discussion. Permutations would
needlessly obscure the example, so we assume the pivots in the example matrix are on
the diagonal, in order. (Note that this assumption would not be true if we performed
a global pivot search as in Step 1 since in our example the pivots do not have the
lowest possible Markowitz cost.) The first pivot is all. We have C = U =
{1,2,3,4,7}= {1} U {2,3,4,7} and U = U' UU" = {1,4,5}= {1} U {4,5}. Let g be
1.25, then the 5-by-3 frontal matrix would be stored in a 6-by-3 array.

3.2. Step 2: Choose further pivots, perform assemblies, and partial
factorization. We continue our pivot search within the contribution block, C, of the
current frontal matrix, F, and repeat this for as long as there is sufficient space in the
working array.
We use the term assembly for the addition of contribution terms or original
entries via the extend-add ("+ ") operator [29]. This operator aligns the row and
column index sets of its two matrix or vector operands and then adds together values
referenced by the same indices. An implicit assembly is one that is mathematically
represented by the data structures, but computationally postponed. An explicit
assembly is one that is actually computed. An entry in the active submatrix, A',
is explicitly assembled if all its contribution terms have been added to it, but this is
usually not done and such entries are normally only held implicitly. Pivotal rows and
columns are always explicitly assembled.
We scan A4 for each column j in U". The scan of A' is stopped as soon as a
row i C is found. If the scan completes without such a row being found, then all
row indices in A, are also in , and we delete Aj and assemble it into F. If this
assembly is done, the true degree of column j is dc(j) = dc(j) = \C"|. If the scan
stops early, we compute the upper bound degree of column j as

S n k (the size of A')
dc(j) = min + (IA, aj) (the worst case fill-in)


where k is the current step of Gaussian elimination, and aj is the number of entries
scanned in Aj before stopping. For each row i in C", we scan Ai and compute d,(i)
in an analogous manner.
In the example, A,4 is assembled into C and entry a44 is deleted. The uncomputed
true degrees and the degree bounds are shown in Table 3.1. The values of aj used in
constructing the upper bounds were obtained on the assumption that Ak is stored in
ascending order for each row and column. We have










T. A. DAVIS AND I. S. DUFF


TABLE 3.1
True degrees and degree bounds in example matrix.

i d,(i) d,(i) j dc(j) dc(j)
2 4 5 2 4 4
3 5 5 3 3 3
4 3 3 4 4 4
5 4 4 5 5 6
6 2 2 6 3 3
74 5 73 3


1 4 5

U' U" 1 a1 a' 4 a',5
F = a'e A' = a2 0 0
"l A' C 31
S *c 4 a41 a44 0
7 a '71 0 0

We divide the pivot column A', by the pivot al, to obtain the kth column of L,
the n-by-n lower triangular factor. The pivot row is the kth row of U, the n-by-n
upper triangular factor. Step k of Gaussian elimination is complete, except for the
updates from the kth pivot. The counter k is now incremented for the next step
of Gaussian elimination. The frontal matrix F is partitioned into four submatrices,
according to the partition of C and U. We have

f 1 4 5

U' U" r11 U14 U15
F = ' L'U' U" = 2 121 0 0
3 /3 0 0
4 141 a44 0
7 171 0 0

The updates to C from the |U'| pivots in F are not applied one at a time. Instead,
they are delayed until there are updates pending from b pivots to allow the efficient
use of Level 3 BLAS [6]. On a CRAY YMP, a good value for the parameter b is 16.
Let L and U denote the portions of L" and U", respectively, whose updates have
yet to be fully applied to C. If |U'l mod b = 0 then the pending updates are applied
(C = C LU). If b were 16, no updates would be applied in our example since
U'l = 1.
We now search for the next pivot within the current frontal matrix. We search
the columns in U" to find a candidate pivot column, c, that has minimum d,(c) among
the columns of U". We then apply any pending updates to this candidate column
(Ce* = Cc-LU*,), and compute the candidate column A'e, its pattern Struct(A',),
and its true degree dj(c). We select the candidate pivot row r in C" with the lowest
d,(r) such that a', also satisfies the threshold pivoting criterion (Equation (3.1)). We
compute the pattern Struct(A,) of the candidate pivot row and its true degree d,(r).
If dc(c) > s NU'l or d,(r) > t U'I the current work array is too small to
accommodate the candidate pivot and we stop the pivot search. Also, if the candidate
column has entries outside the current frontal matrix, the threshold pivoting criterion
might prevent us from finding an acceptable candidate pivot in C". In this case also
we stop the factorization of the current frontal matrix F. If the candidate pivot










UNSYMMETRIC-PATTERN MULTIFRONTAL METHOD


a', is acceptable, then we let = C U Struct(A' ) and U = U U Struct(A'.,). We
repartition and U into pivotal row and column indices (' and U') and non-pivotal
row and column indices (C2" and U") and apply any pending updates to the pivot row
(C,, = C,, L,,U).
In the example, the candidate column (column 4) can fit in the 6-by-3 work array
(that is, d,(4) = 4 < s |U'| =6 1 5). Suppose a44 does not meet the threshold
criterion, and row 7 is selected as the candidate row. The candidate row is, however,
rejected when its true degree is computed (the work array is too small to accommodate
row 7, since d,(7) = 4 > t |U' = 3 -1 =2).
3.3. Step 3: Complete the factorization of F. After the last pivot has been
selected within the current frontal matrix F, we apply any pending updates to the
contribution block (C = C LU). The pivot rows and columns in F are then placed
in storage allocated for the LU factors.
The contribution block C and its pattern C" and U" form what we call an element.
An element is the contribution block of a factorized frontal matrix that has yet to be
assembled into subsequent frontal matrices (both its numerical values and symbolic
pattern). In particular, let Ce denote the contribution block of element e, and let the
pattern of Ce be C, and Ue (note that C, = " and Ue = U").
Initially, all row and column indices in C, and Ue are unmarked. When a row
(or column) of Ce is assembled into a subsequent frontal matrix, the corresponding
index is marked in C, (or U,). Element e (which consists of the terms Ce, C,, and
Ue) will refer to unmarked portions only. Element e is deleted when all of its entries
are assembled into subsequent frontal matrices. For our example, we have

4 5
2 c24 C25
Ce = 3 C34 C35
4 C44 C45
7 C74 C75

We associate with each row and column in the active submatrix an element list,
which is a list of the elements that hold pending updates to the row or column,
respectively. We denote the list of elements containing row i as Ri, and the list of
elements containing column j as Cj. The element lists contain a local index which
identifies which row or column in the element matrix is equivalent to the row or
column of the active matrix. This facilitates the numerical assembly of individual
rows and columns. For each row i in C,, we place an element/local-index pair, (e, m),
in the element list 7?, where row i is the mth entry of C,. Similarly, for each column
j in Ue, we place (e, m) in the element list Cj, where column j is the mth entry of U.e
Let F denote a summation using the + operator. The (n- k+1 )-by-(n- k+ 1)
active submatrix A' is represented by an implicit assembly of Ak and the elements
in the set C,


(3.2) A' = Ce Ak


where C C {1 ... k- 1} is the set of elements that remain after step k 1 of Gaussian
elimination. All + operations in Equation (3.2) are not explicitly performed and
are postponed, unless stated otherwise. As defined earlier, the notation Ak refers to









T. A. DAVIS AND I. S. DUFF


original entries in non-pivotal rows and columns of the original matrix, that have not
yet been assembled into any frontal matrices.
The element lists allow Equation (3.2) to be evaluated one row or column at a
time, as needed. Column j of A' is

(3.3) A' = [Ce]m + A
wie,tm)ECh





Similarly, row i of A' is

(3.5) A = [Ce]m, + At
\(e,m)E-7,
with pattern

(3.6) Struct(A;.)= J Ue UAt.

There is an interesting correspondence between our data structures and George
and Liu's quotient graph representation of the factorization of a symmetric positive
definite matrix [19]. Suppose we factorize a symmetric positive definite matrix using
our algorithm and restrict the pivots to the diagonal. Then A A, = Ai, i = Ci, Ce
Ue, and A.1;- (xi) = RiUA4, where xi is an uneliminated node in the quotient graph
Gk. The uneliminated node xi corresponds to a row i and column i in A'. That is, the
sets Ri and Ai, are the eliminated supernodes and uneliminated nodes, respectively,
that are adjacent to the uneliminated node zi. In our terminology, the eliminated
supernode ag corresponds to element e. The set Ce contains the uneliminated nodes
that are adjacent to the eliminated supernode Tg. That is, A.l (xe) = te.
After the first frontal matrix on Example (2.1), {1}, and
2 3 5 6 7
4 5 2 a22 a23 a25 0 0
2 c24 C25 3 a32 a33 0 0 a37
A' = C+ Ak 3 C34 C35 + 4 0 0 a45 a46 0
4 C44 C45 5 a52 a53 a55 a56 0
7 C74 C75 6 0 0 0 a66 a67
7 a72 0 a75 0 a77
Note that column four was deleted from Ak (refer to Section 3.2). It also no
longer appears in Ak. The element lists are given in Table 3.2. Applying
Equations (3.5) and (3.6) to obtain row 2, for example, we obtain

A'2 C1 A = 4 5 2 3 5
A [C],2 [ c24 C25 2 A a22 a23 a25

C 4 5 2 3
2 [ C24 (C25 + a25) 022 a23










UNSYMMETRIC-PATTERN MULTIFRONTAL METHOD


TABLE 3.2
Element lists for example matrix, after first frontal matrix.

i 2, J C3
2 (1,1) 2
3 (1,2) 3
4 (1,3) 4 (1,1)
5 5 (1,2)
6 6
7 (1,4) 7


Struct(A2,) = 1 U .A'. = {4, 5} U {2, 3, 5} = {4, 5, 2, 3}.

4. Subsequent frontal matrices. We now describe how later steps differ when
the element lists are not empty, by continuing the example with the second frontal
matrix.

4.1. Step 1: Perform global pivot search and form frontal matrix. We
compute the nsrch candidate pivot columns using Equations (3.3) and (3.4). In the
example, the next pivot is a'2, with C = U = {2, 3, 5, 7} = {2} U {3, 5, 7} and
U =U' U U" = {2, 3, 4, 5} = {2} U {3, 4, 5}. The 4-by-4 frontal matrix is stored in a
5-by-5 array (g = 1.25).

4.2. Step 2: Choose further pivots, perform assemblies, and partial
factorization. In the example, a second pivot (a33) is found in the second frontal
matrix and so we will repeat this step twice.
As we discussed earlier, computing the true degree, dc(j) = Struct(A' j), with
Equation (3.4) would be very time consuming. A loose upper bound on dc(j) can be
derived if we assume no overlap between C and each C, viz.

n- k,
d,(j) < min |"1 + d(j),
L"I + (A'- a) + (Ze-c Li)

To compute this bound for all rows and columns in C would take time

O( ai + Y aj)
iE/" jEWll

to scan Ak, and time

O(Y Ri IJ+ E iCi1)
iE" jEU/"

to scan 7 and C. For a single column j, the total time is O(aN + |C |), or O(IJA4' +|CI ),
since aj < 1A4 Similarly, the time to compute this loose degree bound for a row i
is O(a + IRi ), or O(A.4f + Ri ).
However, a much tighter bound can be obtained in the same asymptotic time.
The set C, can be split into two disjoint subsets: the external subset C, \ C and the
internal subset C, n C, where C, = (C, \ ) U (C, n ), and "\" is the standard set
difference operator. Define Ce \ C as the external column degree of element e with
respect to F. Similarly, define IUe \U as the external row degree of element e with










T. A. DAVIS AND I. S. DUFF


respect to F. We use the bound
n-k,
(4.1) d,(j) < dc(j) = min " + dc(j),
If" + (l1A a )+ (Zeec C \ l)

which is tighter than before since C, \ I = I, 1Cn 1 < C 1e. The equation for
d,(i) is analogous.
An efficient way of computing the external row and column degrees is given in
Algorithm 2. The cost of doing this can be amortized over all subsequent degree
updates on the current front. We use the term "amortized time" to define how much
of this total work is ascribed to the computation of a single degree bound, dc(j) or
d,(i). Note that in computing these amortized time estimates we actually include
the cost of computing the external row degrees within the estimate for the column
degree bounds although it is actually the external column degrees that are used in
computing this bound. We can amortize the time in this way because we compute
the external row and column degrees, and the row and column degree bounds, for all
rows and columns in the current frontal matrix.
Relating our approximate degree algorithm to George and Liu's quotient graph,
our algorithm takes an amortized time of O(IA | + ICj ) = 0(|1.\1,- (xj)l) to compute
dc(j). This correspondence holds only if A is symmetric and pivots are selected from
the diagonal. This is much less than the Q(' ..1i (xj)|) time take to compute the
true degree. The true degree dc(j) = |Struct(A j)| = '.\.l (xj)| is the degree of
node xj in the implicitly represented elimination graph, Gk [19]. If indistinguishable
uneliminated nodes are present in the quotient graph (as used in [28], for example),
both of these time complexity bounds are reduced, but computing the true degree
still takes much more time than computing our approximate degree.
We now describe how we compute our degree bound, dc(j), in an amortized time
of O(IAk |$ + ICj |). We compute the external column degrees by scanning each e in 7R
for each .i -" row i in C, as shown in Algorithm 2. A row or column is new if it did
not appear in C or U prior to the current pivot. Since e E Ri implies i E C,, row i
must be internal (that is, i E Ce n ).

Algorithm 2 (Computation of external column degrees)
assume w(1...n) = -1
for each new row i E do
for each element e in the element list 7R of row i do
if (w(e) < 0) then w(e) = ICI
w(e) = w(e) 1
end for
end for

If Algorithm 2 scans element e, the term w(e) is initialized to ICI and then
decremented once for each internal row i E C n C. In this case, at the end of
Algorithm 2 three equivalent conditions hold:
1. e appears in the lists 7R for the rows i in C,
2. the internal subset C, n C is not empty,
3. w(e)= CI Cn C = C, \C1.
If Algorithm 2 did not scan element e in any Ri, then the three following equivalent
conditions hold:










UNSYMMETRIC-PATTERN MULTIFRONTAL METHOD


1. e does not appear in 7R for any row i in C,
2. the internal subset C, n is empty,
3. w(e) < 0.
Combining these two cases, we obtain


(4.2) L \ = w( ) if w() >0 for all e E .
E' 1 otherwise


To compute the external row degrees of all elements, we scan the element list Cj
for each new column j in U in an analogous manner (with a separate work array).
The total time to compute both the external column degrees (Algorithm 2) and the
external row degrees is O(E,,,, 1Ri + Eju" Cj1).
We compute dc(j) and assemble elements by scanning the element list Cj for each
column j E U", evaluating dc(j) using Equations (4.1) and (4.2). If the external row
and column degrees of element e are both zero, then we delete (e, m) from Cj and
assemble C, into F. Element e no longer exists. This is identical to the assembly
from a child (element e) into a parent (the current frontal matrix F) in the assembly
tree of the classical multifrontal method. It is also referred to as element absorption
[14]. It is too costly at this point to delete all references to the deleted element. If a
reference to a deleted element is found later on, it is then discarded. If the external
column degree of element e is zero but its external row degree is not zero, then (e, m)
is deleted from Cj, column j is assembled from C, into F, and column j is deleted
from element e. Finally, we scan the original entries (A4~) in column j as discussed in
Section 3.2. If all remaining entries can be assembled into the current frontal matrix,
then we perform the assembly and delete column j of A Thus, the amortized time
to compute dc(j) is O(jA.4j + Cj1). This time complexity does not include the time
to perform the numerical assembly.
The scan of rows i E C" is analogous. The amortized time to compute d,(i) is
O(IAf,| + IRij).
For pivot a22 in the example, we only have one previous element, element 1. The
element lists are shown in Table 3.2. The external column degree of element 1 is one,
since |11 = 4, and e = 1 appears in the element lists of three rows in C. The external
row degree of element 1 is zero, since U1 I = 2, and e = 1 appears in the element
lists of two columns in U. We have 1 = (1 \ ) U (1 n ) ={4} U {2, 3, 7} and
U1 = (U1 \ U) U (Ul = ) 0 U {4, 5}. Rows 2, 3, and 7 (but not 4) are assembled
from C1 into F and deleted. Row 2 and columns 2 and 3 of Ak are also assembled
into F. No columns are assembled from C1 into F during the column scan, since the
external column degree of element 1 is not zero.
We have,


4 567
( 4 5 5 6 7
J 2 3 0 0 a37
C3 A k 4 a45 a46 0

S 4 0 a66 a67
7 a75 0 a77










T. A. DAVIS AND I. S. DUFF


and

2 2 3 4 5
f u~ 1 J [ar a223 24 "*
F f= C a,', A., 3 a/32 a33 C34 C35
" [ A', C 5 a' 053 0 0
7 a72 0 C74 C75

where we have marked already assembled parts of element Ci by -. It would be
possible to recover this space during the computation but we have chosen not to do so
in the interest of avoiding the expense of updating the associated element lists. Note
then that these lists refer to positions within the original element.
The assembly and deletion of a row in an element does not affect the external
column degree of the element, which is why only new rows are scanned in Algorithm 2.
Similarly, the assembly and deletion of a column in an element does not affect the
external row degree of the element.
The local pivot search within F evaluates the candidate column c and row r using
Equations (3.3), (3.4), and (3.6). In the example, the second pivot a33 is found in the
local pivot search. The set C remains unchanged, but the set U is augmented with
the new column 7. Rows 3 and 7 are assembled from Ak into F in the subsequent
execution of step 2 for this pivot. No further assembly from Ci is made.
Step 2 is substantially reduced if there are no new rows or columns in F. No
assemblies from Ak or C, can be done since all possible assemblies would have been
done for a previous pivot. It is only necessary to decrement d,(j) for all j E 2" and
d,(i) for all i E U".

4.3. Step 3: Complete the factorization of F. The work array, w, must be
reset for the next frontal matrix. Rather than rescanning the elements and resetting
the counters, we use the following modification to Algorithm 2. The counter w0
and the counters w(1...n) are equal to zero and -1, respectively, at the start of
factorization.

Algorithm 2, modified (Computation of external column degrees)
assume w(1...n) < w0
for each new row i E do
for each element e in the element list Ri of row i do
if(w(e) < wo) then w(e) = 1C, + w0
w(e) = w(e) 1
end for
end for

Then the external column degrees are


(4.3) \ C, \ | ,W for all e E C.
| I ,I otherwise J


To enable reuse of w for the degree computation for the next pivot step, we increment
w0 by n. If w0 would overflow, we reinitialize w0 and w to zero and -1, respectively.










UNSYMMETRIC-PATTERN MULTIFRONTAL METHOD


In the example, the final factorized frontal matrix is

( 2 3 4 5 7
S2 u22 23 U24 U25 0
F= 1' L'U' U" = 3 132 U33 U34 U35 U37
S[ L" C 5 152 /53 C54 C55 C57
7 172 /73 C74 C75 C77

Note that u27 = 0, due to the relaxed amalgamation of two pivot rows with non-
identical patterns. Relaxed amalgamation can result in higher performance since
more of the Level 3 BLAS can be used. In the small example, the active submatrix
is represented by the implicit assembly

A' = Cl+ C2+ A
4 5
2 C 4 5 7
= 3 5 C54 C55 C57
4 C44 C45 7 C74 C75 C77
7 -- )
5 6 7
567
+ 4 a45 a46 0
5 a55 a56 0
6 0 a66 a67
4 5 6 7 ]
4 a44 a45 a46 0
5 / a'55 a'56
6 0 0 a 67
7 a'74 a'75 0 a77

The element lists are shown in Table 4.1.
TABLE 4.1
Element lists for example matrix, after second frontal matriz.

i 7Z, j Cl
4 (1,3) 4 (1,1) (2,1)
5 (2,1) 5 (1,2) (2,2)
6 6 -
7 (2,2) 7 (2,3)


5. Algorithm. Algorithm 3 is a full outline of the UMFPACK (version 1.0)
algorithm.
6. Performance results. In this section, we compare the performance of
UMFPACK version 1.0 with MUPS [1], MA48 [16], GPLU [23], and SSGETRF [3]
on a single processor of a CRAY YMP (although MUPS and SSGETRF are parallel
codes). Each method has a set of input parameters that control its behavior. We used
the recommended defaults for most of these, with a few exceptions that we indicate
below. All methods can factorize general unsymmetric matrices.
MA48 [16] supersedes the MA28 code [13]. It first performs an ordering phase
that also computes most of the factors but discards them. It then performs the










T. A. DAVIS AND I. S. DUFF


Algorithm 3 (Unsymmetric-pattern multifrontal algorithm)
0: initializations
k=
C = empty
while (k < n) do
1: e k
global search for kth pivot: a .
C= Struct(A',)
= Struct(A',)
form rectangular frontal matrix F in an s-by-t work array (s = grl|, t = glul)
while (more pivots can be found) do
2: assemble kth pivot row and column into F
scan element lists and compute external degrees
assemble from Ak and contribution blocks into F
compute degree bounds
compute entries of L (F*c = Fc/a'c)
k= k+
if (|U'| mod b = 0) C = C LU
if (lU" = 0) goto step 3
find candidate column c E U"
C,e = C,, LU,,
if (dc(c) 7 I|") assemble column c and compute dc(c)
if (dc(c) > s |U'l) goto step 3
find candidate row r CE"
if (r not found) goto step 3
if (d,(r) | U"|) assemble row r and compute d,(r)
if (d,(r) > t lU'I) goto step 3
S= U Struct(A,,)
U =U Struct(A/,)
Cr, = Cr, Lr,U
endwhile
3: save L', L", , U', U", and U
C C LU
C, =C
Ce = C
= UC"
L/e = U"
delete F
= I U{e}
add e to element lists
endwhile




numerical factorization to compute the entire LU factors. When the matrix becomes
dense enough near the end of factorization (default of 50% dense), MA48 switches to
a dense factorization code. MA48 can preorder a matrix to block upper triangular
form (always preceded by finding a maximum transversal [8]), and then factorize each
block on the diagonal [12]. Off-diagonal blocks do not suffer fill-in. MA48 can restrict










UNSYMMETRIC-PATTERN MULTIFRONTAL METHOD


TABLE 6.1
Input parameters for each method.


method
option UMFPACK MA48 MUPS SSGETRF GPLU
scaling of A yes/no yes/no yes/no yes/no yes/no
block upper triangular form yes/no yes/no no no yes/no
maximum transversal yes/no yes
preserve symmetry yes/no yes/no yes yes no
total number of tests per matrix 8 8 4 2 4


its pivot search to the diagonal, thus preserving symmetry if it exists.
MUPS performs a minimum degree ordering and symbolic factorization on the
nonzero pattern of A + A and constructs an assembly tree for the numerical
factorization phase [1]. During numerical factorization, candidate pivot entries must
pass a threshold partial pivoting test similar to Equation (3.1), except that the test
is by rows instead of by columns. Since all the other methods we are comparing
perform this test by columns, we factorize AT with MUPS and then use the factors
of AT to solve the original system (Ax = b). MUPS optionally preorders a matrix
to maximize the modulus of the smallest entry on the diagonal (using a maximum
transversal algorithm [8]). MUPS always attempts to preserve symmetry. It does not
permute the matrix to block upper triangular form.
SSGETRF is a classical multifrontal method in the Cray Research, Inc., library
(version 1.1) installed on the CRAY YMP. It uses Liu's multiple minimum degree
(\I 11)) algorithm [28] on the pattern of A + AT. It includes a threshold partial
pivoting test. It is not specified in the documentation, but from our results we
conclude that SSGETRF always uses a maximum transversal algorithm. We base
this conclusion on the observation that MUPS and SSGETRF obtain similar fill-in
for highly unsymmetric matrices (matrices for which MUPS performs very poorly
if a maximum transversal algorithm is not used). Like MUPS, it always preserves
symmetry, and does not permute the matrix to block upper triangular form.
The GPLU code of Gilbert and Peierls [23] does not include a pre-ordering phase.
It factorizes A using threshold partial pivoting with row interchanges only. We first
explicitly form ATA, find a fill-reducing ordering via Liu's multiple minimum degree
algorithm [20, 28], and use that permutation as the column order for A (as suggested
in [21]). The time we report includes this analysis phase. We also tested GPLU on
the block upper triangular form of A (as found by MA48), by applying GPLU and
our preordering to each block on the diagonal. GPLU does not have an option for
preserving symmetry.
UMFPACK has similar input parameters as MA48. It does not explicitly include a
switch to dense factorization code (each frontal matrix is dense, however). UMFPACK
has a symmetry-preserving option similar to that of MA48, except that the input
parameter only sets a preference for diagonal pivots and the preference is not strict.
We tested all methods with both scaled and unsealed matrices. The scale factors
were computed by the Harwell Subroutine Library routine MC19A. Each row of the
matrix scaled by MC19A was then subsequently divided by the maximum absolute
value in the row (or column). We selected the threshold partial pivoting factor (u) to
be 0.1 for all five methods. Table 6.1 summarizes the different options used for each
method and indicates the number of runs performed in the experiments. The number
in each case is determined by the number of options available in the particular code.
The methods were tested on a single processor of a CRAY YMP C-98-512Mw-











16 T. A. DAVIS AND I. S. DUFF

TABLE 6.2
Test matrices.

name n |A| sym. discipline comments
bcsstk08 1074 12960 1.000 structural eng. TV studio, irregular
bcsstk28 4410 219024 1.000 solid element model
bcsstkl6 4884 290378 1.000 Corps. of Eng. dam
platl919 1919 32399 1.000 oceanography Atlantic and Indian oceans
orsirr-_ 1030 6858 1.000 petroleum eng. 21x21x5 irregular grid
sherman4 1104 3786 1.000 16x23x3 grid, fully implicit
pores2 1224 9613 0.612
orsreg-_ 2205 14133 1.000 21x21x5 full grid
saylr4 3564 22316 1.000 33x6x18 grid, shale
sherman3 5005 20033 1.000 16x23x3 grid
Ins_3937 3937 25407 0.850 fluid flow linearized Navier-Stokes
shyy41 4720 20042 0.723 viscous fully-coupled Navier-Stokes
exllmat 16614 1096948 1.000 3D cylinder and plate heat exch.
shyyl61 76480 329762 0.726 viscous fully-coupled Navier-Stokes
mcfe 765 24382 0.699 astrophysics radiative transfer
fs_541_2 541 4285 0.683 chem. kinetics atmospheric pollution
eris1176 1176 18552 1.000 electric power
gemat11 4929 33185 0.001 linear programming basis
bcspwrl0 5300 21842 1.000 Eastern US
psmigr-l 3140 543162 0.479 demography US county-to-county migration
mahindas 1258 7682 0.017 economics Victoria, Australia
orani678 2529 90158 0.071 Australia
finan512 74752 596992 1.000 portfolio optimization
radfrl 1048 13299 0.054 chemical eng. non-reactive separation
lhrOl 1477 18592 0.007 light hydrocarbon recovery
west2021 2021 7353 0.003 15-stage column section
rdist3a 2398 61896 0.140 reactive distillation
extrl 2837 11407 0.004 dynamic simulation
rdist2 3198 56934 0.046 reactive distillation
lhr04 4101 82682 0.015 light hydrocarbon recovery
rdistl 4134 94408 0.059 reactive distillation
hydrl 5308 23752 0.004 dynamic simulation
lhr71 70304 1528092 0.002 light hydrocarbon recovery
gre-1107 1107 5664 0.000 discrete simul. computer system



8, with 512 Megawords of memory (8-byte words). Version 6.0.3.22 of the Fortran
compiler (CFT77) was used. Each method was given ,1 I- of memory to factorize 34
test matrices, listed in Table 6.2. The table lists the name, order, number of entries
(I|A), symmetry, the discipline from which the matrix came, and additional comments.
The symmetry is the number of matched off-diagonal entries over the total number of
off-diagonal entries. An entry, aij (j f i), is matched if aji is also an entry. The table is
subdivided by discipline, and disciplines are in order of decreasing average symmetry
of the matrices in that discipline. Matrices within a discipline are ordered by size
(n). All matrices are available via anonymous ftp. They include matrices from the
Harwell/Boeing collection [11] (at orion.cerfacs.fr or numerical.cc.rl.ac.uk)
and Saad's SPARSKIT2 collection (at ftp.cs.umn.edu). All other test matrices
in the table are available from ftp.cis.ufl.edu:pub/umfpack/matrices. All
petroleum engineering problems listed are from oil reservoir simulations.

The best time from runs listed in Table 6.1 for each method is shown in Table 6.3.
The best set of options tends to be dependent on the discipline, rather than the
particular matrix. The optimal parameters (preservation of symmetry, scaling, or
permutation to block triangular form) can usually be determined a priori. The time is











UNSYMMETRIC-PATTERN MULTIFRONTAL METHOD


Run time in seconds on


TABLE 6.3
a single processor of a CRAY YMP C-98-512Mw-8.


matrix


discipline


method
UMFPACK MA48 MUPS


SSGETRF


GPLU


bcsstk08 structural eng. 0.342 0.416 0.708 0.588 21.044
bcsstk28 1.564 6.771 1.204 2.896 312.059
bcsstkl6 3.825 19.363 2.057 4.682 510.914
platl919 oceanography 0.595 1.816 0.415 failed 21.716
orsirr-1 petroleum eng. 0.196 0.317 0.209 0.193 5.297
sherman4 0.099 0.125 0.134 0.133 0.903
pores2 0.252 0.284 0.316 0.297 4.652
orsreg-1 0.472 1.169 0.545 0.415 25.394
saylr4 0.849 2.578 0.814 0.744 53.371
sherman3 0.741 1.732 1.043 1.084 36.087
Ins_3937 fluid flow 1.869 5.437 1.899 1.746 38.213
shyy41 0.472 1.145 0.681 0.509 7.407
exllmat 91.571 413.496 17.781 21.072 failed
shyyl61 14.541 140.211 failed 8.843 failed
mcfe astrophysics 0.343 0.426 0.324 0.399 7.065
fs_541_2 chem. kinetics 0.095 0.118 0.174 0.129 0.839
erisl1176 electric power 0.121 0.134 0.140 0.264 6.356
gematll 0.457 0.519 0.569 0.795 45.036
bcspwrl0 0.559 0.519 0.637 0.553 40.123
psmigrl demography 31.364 29.077 322.335 232.047 failed
mahindas economics 0.164 0.085 0.663 0.892 1.793
orani678 1.630 0.775 52.091 169.187 68.729
finan512 30.787 184.727 46.691 146.275 failed
radfrl chemical eng. 0.206 0.254 0.214 0.316 3.480
lhr01 0.430 0.426 1.026 1.170 6.381
west2021 0.250 0.134 0.274 0.257 3.043
rdist3a 0.445 1.354 1.359 1.050 38.655
extrl 0.396 0.268 0.412 0.437 8.859
rdist2 0.925 1.765 1.027 1.017 45.775
lhr04 2.595 4.118 9.812 10.541 86.020
rdistl 1.350 3.451 2.095 2.178 98.884
hydrl 0.879 0.575 1.221 1.130 16.352
lhr71 52.480 164.034 failed 174.761 failed


gre_1107


discrete simul.


0.307 0.329 0.378


0.380 7.161


in seconds, and includes both the analysis and factorization times. It does not include
the time to compute the scale factors, since we used the same scaling algorithm for
all methods. The fastest time for each matrix is shown in bold.
MUPS failed on the lhr71 and shyyl61 matrices because of insufficient memory.
These are very ill-conditioned problems that cause MUPS to be unable, on numerical
grounds, to choose pivots selected by the analysis. This leads to an increase in fill-
in and subsequent failure. SSGETRF failed for platl919 because it was unable to
determine a numerically acceptable pivot order (SSGETRF erroneously declared the
platl919 matrix as "singular"). GPLU exceeded the time limit (which was one hour)
for five matrices.
UMFPACK is faster than the other four methods for 16 out of 34 matrices. It is
usually no more than about twice as slow as the fastest method, with the exception
of the exllmat matrix (a large matrix with symmetric nonzero pattern).
However, UMFPACK normally requires more storage than the four other
methods, as shown in Table 6.4. This table lists the memory used for the runs
whose times are listed in Table 6.3. The smallest memory usage is shown in bold.
UMFPACK uses from 1.0 to 3.6 times the memory required by the method needing the











T. A. DAVIS AND I. S. DUFF


Memory usage in megawords


matrix


discipline


TABLE 6.4
on a single processor of a CRAY

method
UMFPACK MA48 MUPS


YMP C-98-512Mw-8.


SSGETRF GPLU


bcsstk08 structural eng. 0.229 0.295 0.187 0.170 0.524
bcsstk28 2.162 3.043 1.674 1.236 2.023
bcsstkl6 5.014 6.407 3.160 2.262 3.625
platl919 oceanography 0.669 0.890 0.363 failed 0.371
orsirr-_ petroleum eng. 0.180 0.230 0.152 0.108 0.203
sherman4 0.078 0.085 0.068 0.060 0.063
pores2 0.163 0.179 0.236 0.164 0.115
orsreg-_ 0.504 0.678 0.410 0.281 0.659
saylr4 0.896 1.229 0.691 0.495 1.146
sherman3 0.764 0.938 0.567 0.430 0.867
Ins_3937 fluid flow 1.044 1.908 1.121 1.193 0.738
shyy41 0.374 0.618 0.486 0.349 0.294
exllmat 56.743 53.552 22.941 15.981 failed
shyyl61 12.162 25.692 failed 7.570 failed
mcfe astrophysics 0.267 0.430 0.310 0.202 0.219
fs_541_2 chem. kinetics 0.060 0.067 0.053 0.043 0.040
eris1176 electric power 0.135 0.161 0.102 0.085 0.136
gematll 0.369 0.345 0.369 0.321 0.315
bcspwrl0 0.354 0.321 0.284 0.272 0.227
psmigrl demography 20.527 20.229 25.638 16.943 failed
mahindas economics 0.085 0.074 0.208 0.143 0.073
orani678 0.592 0.573 4.069 8.014 0.725
finan512 22.938 54.545 31.524 21.951 failed
radfrl chemical eng. 0.136 0.142 0.108 0.096 0.107
lhr01 0.240 0.238 0.435 0.287 0.141
west2021 0.116 0.089 0.110 0.109 0.093
rdist3a 0.725 0.697 0.466 0.407 0.502
extrl 0.197 0.160 0.193 0.172 0.138
rdist2 0.689 0.683 0.455 0.392 0.458
lhr04 1.166 1.259 2.160 1.655 0.614
rdistl 0.957 1.346 0.707 0.634 0.772
hydrl 0.419 0.334 0.450 0.393 0.272
lhr71 23.145 34.403 failed 35.268 failed


gre_1107


0.228 0.205 0.352


0.280 0.252


least memory (usually GPLU for unsymmetric matrices, or SSGETRF for symmetric
matrices). The median between these two values is 1.51. This is significant but
not high enough to limit the usefulness of UMFPACK on the larger matrices (the
shyyl61, psmigr_l, finan512, and lhr71 matrices, for example). Part of the reason
why UMFPACK uses so much memory is that is stores the original matrix (both
pattern and numerical values) in both row and column form. The double storage of
the matrix slightly facilitates the scanning and assembly of entries from A and A (see
Section 3.2) but we plan to remove the double storage of reals from the next release
of UMFPACK.
The Level 3 BLAS matrix-matrix multiply (SGEMM) routine attains a
performance of 901.5 Mflops for a 1024-by-1024 matrix multiply (stored in a lI i-' -. -
1025 array) on one processor of a CRAY YMP C-98-512Mw-8. It reaches this peak
performance quite quickly (810.5 Mflops for 64-by-64 matrices, and 887.7 Mflops for
128-by-128 matrices). UMFPACK achieved a peak of .,., iI Mflops, for the exllmat
matrix. MUPS and SSGETRF obtained 441.5 and 327.0 Mflops, respectively, for
the same matrix but both performed much less work than UMFPACK. The highest
performance obtained by UMFPACK on a matrix for which it was the fastest was the










UNSYMMETRIC-PATTERN MULTIFRONTAL METHOD


finan512 matrix (359.0 Mflops). The peak performance of MA48 was 375.1 Mflops, for
the psmigr_l matrix. The peak performance of GPLU was 1.9 Mflops (for the bcsstk08
matrix), primarily because its inner-most loops do not readily vectorize (even with
the appropriate compiler directives). This is not a fundamental limitation of GPLU,
however. Ng [31] reports that GPLU attains a much higher relative performance on
an IBM RS/6000. For example, GPLU is faster than UMFPACK on the RS/6000 for
the 1,- _..' ..; matrix, although the fastest BLAS were not used in his comparisons.
Scaling the matrix had little effect on the factorization time and solution quality
(as measured by the relative residual) for half the matrices in our test set. Scaling
improved the results for the bcsstk08 and psmigr_l matrices, and for the hydrl, radfrl,
and rdist matrices for all codes except MUPS and SSGETRF.
Permuting the matrix to block upper triangular form (BTF) usually reduces
execution time and fill-in when the BTF is nontrivial. GPLU is always improved
however trivial the BTF is.
Symmetry is usually worth preserving if the pattern is symmetric, or moderately
so. One class of matrices for which this is not so are those with zeros on the diagonal.
We note that none of these methods use 2-by-2 pivots [10] and so are unable to preserve
symmetry if there are zeros on the diagonal that are not filled by earlier pivot steps.
If MA48 is unable to find stable diagonal pivots when the diagonal pivoting option is
requested, it immediately switches to full code. This early switch may cause a large
increase in storage required. It prevented us performing diagonal pivoting with MA48
on the 1,- _..' ..; matrix. For the same matrix, UMFPACK selects off-diagonal pivot
entries when the symmetry-preserving option is enabled.
These performance statistics include both analysis and numerical factorization
times. All of the five codes listed have factorize-only options that are usually much
faster than the combined analysis+factorization phase(s), and indeed the design
criterion for some codes (for example MA48) was to minimize factorize time even
if this caused an increase in the initial analyse time. The factorize-only option in
UMFPACK version 1.0 is not as fast as it could be since most of our library-code
development effort has gone towards the combined analysis-factorize phase. The
parallel factorize-only code [24, 26, 27] is not included in UMFPACK version 1.0.
Overall, these results show that the unsymmetric-pattern multifrontal method is
a competitive algorithm when compared with both the classical multifrontal approach
(MUPS and SSGETRF) and algorithms based on more conventional sparse matrix
data structures (\I.\ IS and GPLU).

Acknowledgments. We thank Patrick Amestoy, Mario Arioli, Michel Dayd6,
Theodore Johnson, and Steve Zitney for many helpful discussions, Joseph Liu for
providing us a copy of the MMD ordering code, and John Gilbert for providing
us a copy of the GPLU factorization code. Many researchers provided us with
large unsymmetric matrices, a class of matrices that is weak in Release 1 of the
Harwell/Boeing collection. These matrices are available via anonymous ftp to
ftp.cis.ufl. edu in the pub/umfpack/matrices directory.


REFERENCES

[1] P. R. AMESTOY AND I. S. DUFF, Vectorization of a multiprocessor multifrontal code, Int. J.
Supercomputer Appl., 3 (1989), pp. 41-59.
[2] P. R. AMESTOY AND I. S. DUFF, MUPS: a parallel package for solving sparse unsymmetric
sets of linear equations, Tech. Report (to appear), CERFACS, Toulouse, France, 1994.












T. A. DAVIS AND I. S. DUFF


[3] CRAY RESEARCH, INC., SSGETRF factors a real sparse general matrix, (online manual),
Cray Research, Inc., Eagan, Minn., May 1993.
[4] T. A. DAVIS, Users' guide to the unsymmetric-pattern multifrontal package (UMFPACK),
Tech. Report TR-93-020, CIS Dept., Univ. of Florida, Gainesville, FL, 1993. For a copy of
UMFPACK, send e-mail to netlib@ornl. gov with the one-line message send umfpack. shar
from linalg.
[5] T. A. DAVIS AND I. S. DUFF, Unsymmetric-pattern multifrontal methods for parallel sparse
LU factorization, Tech. Report TR-91-023, CIS Dept., Univ. of Florida, Gainesville, FL,
1991.
[6] J. J. DONGARRA, J. J. Du CROZ, I. S. DUFF, AND S. HAMMARLING, A set of Level 3 Basic
Linear Algebra Subprograms., ACM Transactions on Mathematical Software, 16 (1990),
pp. 1-17.
[7] J. J. DONGARRA AND E. GROSSE, Distribution of mathematical software via electronic mail,
Comm. ACM, 30 (1987), pp. 403-407.
[8] I. S. DUFF, On algorithms for obtaining a maximum transversal, ACM Transactions on
Mathematical Software, 7 (1981), pp. 315-330.
[9] Parallel implementation of multifrontal schemes, Parallel Computing, 3 (1986), pp. 193
204.
[10] I. S. DUFF, A. M. ERISMAN, AND J. K. REID, Direct Methods for Sparse Matrices, London:
Oxford Univ. Press, 1986.
[11] I. S. DUFF, R. G. GRIMES, AND J. G. LEWIS, Users' guide for the I F sparse
matrix collection (Release 1), Tech. Report RAL-92-086, Rutherford Appleton Laboratory,
Didcot, Oxon, England, Dec. 1992.
[12] I. S. DUFF AND J. K. REID, An implementation of Tarjan's algorithm for the block
triangularization of a matrix, ACM Trans. Math. Softw., 4 (1978), pp. 137-147.
[13] Some design features of a sparse matrix code, ACM Trans. Math. Softw., 5 (1979),
pp. 18-35.
[14] The multifrontal solution of indefinite sparse symmetric linear systems, ACM
Transactions on Mathematical Software, 9 (1983), pp. 302-325.
[15] The multifrontal solution of unsymmetric sets of linear equations, SIAM J. Sci. Statist.
Comput., 5 (1984), pp. 633-641.
[16] MA48, a Fortran code for direct solution of sparse unsymmetric linear systems of
equations, Tech. Report RAL-93-072, Rutherford Appleton Laboratory, Didcot, Oxon,
England, Oct. 1993.
[17] S. C. EISENSTAT AND J. W. H. LIU, Exploiting structural symmetry in unsymmetric sparse
symbolic factorization, SIAM J. Matrix Anal. Appl., 13 (1992), pp. 202-211.
[18] Exploiting structural symmetry in a sparse partial pivoting code, SIAM J. Sci. Statist.
Comput., 14 (1993), pp. 253-257.
[19] A. GEORGE AND J. W. H. LIU, Computer Solution of Large Sparse Positive Definite Systems,
Englewood Cliffs, New Jersey: Prentice-Hall, 1981.
[20] The evolution of the minimum degree ordering algorithm, SIAM Review, 31 (1989),
pp. 1-19.
[21] A. GEORGE AND E. NG, An implementation of Gaussian elimination with partial pivoting for
sparse systems, SIAM J. Sci. Statist. Comput., 6 (1985), pp. 390-409.
[22] J. R. GILBERT AND J. W. H. LIU, Elimination structures for unsymmetric sparse LU factors,
SIAM Journal on Matrix Analysis and Applications, 14 (1993), pp. 334-354.
[23] J. R. GILBERT AND T. PEIERLS, Sparse partial pivoting in time proportional to arithmetic
operations, SIAM J. Sci. Statist. Comput., 9 (1988), pp. 862-874.
[24] S. M. HADFIELD, On the LU Factorization of Sequences of I Structured Sparse
Matrices within a Distributed Memory Environment, PhD thesis, Computer and
Information Sciences Department, University of Florida, Gainesville, Florida, 1994. (Also
Univ. of Fl. tech report TR-94-019).
[25] S. M. HADFIELD AND T. A. DAVIS, Lost pivot recovery for an unsymmetric-pattern multifrontal
method, SIAM J. Matrix Analysis and Applications, (1994). (submitted. Also Univ. of Fl.
tech report TR-94-029).
[26] A parallel unsymmetric-pattern multifrontal method, SIAM J. Sci. Computing, (1994).
(submitted. Also Univ. of Fl. tech report TR-94-028).
[27] Potential and achievable parallelism in the unsymmetric-pattern multifrontal LU
factorization method for sparse matrices, in Proceedings of the Fifth SIAM Conf. on
Applied Linear Algebra, Snowbird, Utah, 1994, SIAM. (Also Univ. of Fl. tech report
TR-94-006).
[28] J. W. H. LIU, Modification of the minimum-degree algorithm by multiple elimination, ACM











UNSYMMETRIC-PATTERN MULTIFRONTAL METHOD 21


Trans. Math. Softw., 11 (1985), pp. 141-153.
[29] The multifrontal method for sparse matrix solution: Theory and practice, SIAM Review,
34 (1992), pp. 82-109.
[30] H. M. MARKOWITZ, The elimination form of the inverse and its application to linear
programming, Management Science, 3 (1957), pp. 255-269.
[31] E. NG, A comparison of some direct methods for solving sparse nonsymmetric linear systems,
in Proceedings of the Fifth SIAM Conf. on Applied Linear Algebra, Snowbird, Utah, 1994,
SIAM.
[32] Z. ZLATEV, On some pivotal strategies in Gaussian elimination by sparse technique, SIAM
Journal on Numerical Analysis, 17 (1980), pp. 18-30.
Note: all University of Florida technical reports in this list of references are
available in postscript form via anonymous ftp to ftp. cis .ufl. edu in the directory
cis/tech-reports.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs