COMPUTING THE SPARSE INVERSE SUBSET:
AN INVERSE MULTIFRONTAL APPROACH
YOGIN E. CAMPBELL* AND TIMOTHY A. DAVISt
Technical Report TI.'' .i', Computer and Information Sciences Department,
University of Florida, Gainesville, FL, 32611 USA. October, 1995.
Key words. Sparse matrices, symmetric multifrontal method, supernode,
Takahashi equations, inverse multifrontal method, inverse frontal matrix, inverse
contribution matrix, inverse assembly tree, Zsparse.
AMS (MOS) subject classifications. 05C50, 65F50, '..I II.
Abbreviated title: Multifrontal Sparse Inverse Subset
Abstract.
We present the symmetric inverse multifrontal method for computing the sparse inverse subset
of symmetric matrices. The symmetric inverse multifrontal approach uses an equation presented by
Takahashi, Fagan, and Chin to compute the numerical values of the entries of the inverse, and an
inverted form of the symmetric multifrontal method of Duff and Reid to guide the computation. We
take advantage of related structures that allow the use of dense matrix kernels (levels 2 & 3 BLAS)
in the computation of this subset. We discuss the theoretical basis for this new algorithm and give
numerical results for a serial implementation and demonstrate its performance on a CRAYC98.
1. Introduction. We address the problem of computing the sparse inverse
subset (Zsparse) of a symmetric matrix, A, using the LDU factorization of A, and an
equation relating the LDU factors to Z = A1 presented by Takahashi, Fagan, and
Chin in [19]. The sparse inverse subset is defined as the set of entries in the upper
triangular part of Z in locations given by the nonzero entries in the factorized matrix:
that is, Zsparse = {zij (U)ij f 0} C Z.
The entries in Zsparse are useful in many practical applications such as
approximating the condition number of symmetric positivedefinite matrices, and
estimating variances of the fitted parameters in the leastsquare datafitting problem.
Prior results on computing Zsparse and other related inverse subsets, based on the
two equations of Takahashi, Fagan, and Chin, are found in the articles by Erisman
and Tinney [13], and Betancourt and Alvarado [2]. Erisman and Tinney in [13] proved
that both Zsparse and the subset of entries on the diagonal of Z can be evaluated
without computing any inverse entry from outside of Zsparse. In [2] Betancourt and
Alvarado give a parallel algorithm to compute Zsparse and the full inverse. None of
these two articles considered the use of dense matrix operations such as matrixvector
or matrixmatrix multiplications in the computations. Our numerical results show
that there is a significant improvement in the performance of the Zsparse algorithm
when the level 2 (matrixvector) and level 3 (matrixmatrix) BLAS [8] operations are
used in the implementation.
In this paper we develop a new method, the symmetric inverse multifronlal method
[3], to compute Zsparse. We use one of the equations from [19] (henceforth referred to
as the Takahashi equation(s)) to compute the numerical values of the inverse elements,
and an inverted form of the symmetric multifrontal method of Duff and Reid [12] to
email: yec@cis.ufl.edu.
t Computer and Information Science and Engineering Department, University of Florida,
Gainesville, Florida, USA. (904) 3921481, email: davis@cis.ufl.edu. Technical reports and matrices
are available via the World Wide Web at http://www.cis.ufl.edu/davis, or by anonymous ftp at
ftp.cis.ufl.edu:cis/techreports.
guide the computation and take advantage of related structures that allow the use of
dense matrix kernels in the innermost loops. We show that the results in [13] and [2]
can be easily derived using this formulation.
In the multifrontal method the frontal matrix and assembly tree are the two key
constructs. We introduce two similar constructs, the inverse frontal matrix and the
inverse assembly tree, and discuss their relationship to the frontal matrix and assembly
tree, respectively. We show how the computation of Zsparse can be formulated in
terms of inverting inverse frontal matrices using the inverse assembly tree to specify
the data dependencies.
An outline of the paper follows. In Section 2, we discuss relevant aspects of one of
the Takahashi equations and give a small example showing its use. We briefly review
the symmetric multifrontal factorization method in Section 3. For a more detailed
treatment of multifrontal factorization we refer the reader to [12, 17]. The inverse
multifrontal algorithm is developed in Sections 4 and 5. In Section 4, the fundamentals
of the inverse multifrontal approach are presented, culminating in an algorithm to
compute Zsparse based on using matrixvector operations. This algorithm is extended
in Section 5 to include matrixmatrix operations based on inverting supernodes.
Performance results from an implementation of a blockstructured form of the Zsparse
algorithm on the CRAYC'", is discussed in Section 6. Conclusions and avenues for
future work are given in Section 7.
2. The Takahashi equation. Takahashi, Fagan, and Chin in [19] presented two
equations for computing the inverse of a general matrix A, using its LDU factorization.
The equations are
(2.1) Z = D1L1 + (I U)Z
and
(2.2) Z = UD1 + Z(I L),
where Z = A1, and A = LDU (L, U, and D are unit lower triangular, unit upper
triangular, and diagonal matrices, respectively). When A is (numerically) symmetric
U = LT and Z is symmetric. In this paper we consider symmetric matrices and,
therefore, only use Equation (2.1). (For notational convenience we continue to use
U instead of LT.) We refer to Equations (2.1) and (2.2) as the Takahashi equations.
The matrices involved in Equation (2.1) are illustrated in Fig. 2.1. Shaded areas in
this figure represent nonzero elements.
The following are some useful observations concerning the Takahashi equation
and the matrices involved in it. The product D1L1 is a lower triangular matrix
with (D1L1)ii = (D1)ii. This is used to avoid computing L1 when evaluating
elements on the diagonal and upper triangular part of Z. The matrix (I U) is
strictly upper triangular since U is unit upper triangular. Computationally the most
useful feature of Equation (2.1) is that we can use it to compute Z (more precisely the
upper triangular part of Z) without having to first find L1. Using the two previous
observations, and restricting the inverse elements to the diagonal and upper triangular
part of Z, Equation (2.1) can be restated as follows:
(2.3) zij = di1 Uik Zkj for i < j
Z D1 L U Z
+ + K o
FIG. 2.1. Illustration of the Takahashi equation, Equation (2.1)
where the notation yij = (Y)ij is used. The elements of Z can be computed in reverse
Crout order [9]. That is, evaluate in order the elements in rows n, n 1, ... 1. (In
each row we only need to evaluate entries in the upper triangular part of Z.) When
computing the entries of a given row the order in which the diagonal entry is computed
(relative to the other entries in the row) is important because it depends on a subset
of the entries in the row. The other entries in the row are independent of each other
and can therefore be computed in any order. Erisman and Tinney show in [13] that
the sparse subset can be computed in terms of U, and other inverse elements from
within the sparse subset only. This important result allows one to safely ignore all
other inverse elements when computing the sparse subset.
2.1. An Example. We use a small example to illustrate the use of
Equation (2.3) in computing inverse elements. Consider the symmetric matrix A and
its corresponding filled matrix (L+U) shown in Equations (2.4) and (2.5), respectively.
The x's represent the original entries of A, while the 's are entries due to fillin.
(2.4) A =
X X X
x x
X X
(2.5) L + U =
x x xO
x O x
Using Equation (2.3), and a reverse Crout computational order, The set of
Equations (2.6) gives the sequence used to compute the elements of the sparse subset.
71
244 = d44
Z34 = 34244
Z33 = d331 U34Z43
(2.6) z23 = U23Z33
z22 = d2 U23Z32
214 = Ul3Z34 U14Z44
213 = U13Z33 U14Z43
Zll = dll U13Z31 U14Z41
(Note by symmetry zij = zji.) Observe that this is only a partial ordering. For
example, the entries z11, 213, and 214 can be computed in any order with respect to
z22 and z23. We can also use Equation (2.3) to evaluate elements outside the sparse
subset, for example: z24 = U23Z34 U24Z44 and z12 = U13z32 Ul,4Z42.
3. The symmetric multifrontal approach. The direct method of solving the
linear system of equations Ax = b (where A symmetric) using the LU or Cholesky
factorization of A, is usually a two step algorithm. In the first step, the coefficient
matrix A is factorized into LU (or LLT if Cholesky factorization is used), where L
and U are lower and upper triangular matrices, respectively. The second step involves
solving the two triangular systems Ly = b for y (forward substitution), and Ux = y
for x (backward substitution). The symmetric multifrontal method of Duff and Reid
[11, 12] (see also Liu [17]) is an efficient method to compute the LU factorization of
A, especially on machines with memory hierarchy. This factorization method is based
on the use of dense matrix kernels in the innermost loops. Dense submatrices called
frontal matrices are formed as the multifrontal algorithm progresses. One or more
steps of Gaussian elimination is done on each of these frontal matrices.
In general, the symmetric multifrontal algorithm consists of a symbolic analysis
phase and a numerical factorization phase. In the analysis phase a fillreducing
pivot ordering algorithm (such as the approximate minimum degree algorithm [1] or
the minimum degree algorithm [14]) is used to establish the pivot order and data
structures. In addition, the precedence relationships among the frontal matrices
that are used in the numerical phase are established and given by the assembly or
elimination tree [9, 16]. In this phase only the pattern of A is used. The numerical
work to actually compute the LU factors is done in the numerical factorization phase.
The assembly tree is used to guide the computation in this phase.
We use the example matrix A shown in Equation (3.1) to highlight the basic
multifrontal constructs. The p's and x's represent original entries of A, while the O's
are entries due to fillin.
P1 x x x
x P2 X X
p3 X
x p4 x x
(3.1) A= p5 x x
x P6 X X
x x P7 x
X x x x X ps
x x x x P9
P1 x x x
x P2 X X
p3 x x
X p4 X
(3.2) L+U= p5 x x
x P6 X X
x P6 x x
x xD x P7 (@ x
X X X X X g Ps
Pivot Offdiagonal pivot row Pivot Row
O
Contributiion
SMNlatri\
0
bo
Pivot Column
FIG. 3.1. General structure of a frontal matrix
We assume that the matrix has already been permuted so that the pivots lie on the
diagonal: pi ... pg. Its fill structure is also determined and shown in Equation (3.2).
In the simplest case (no supernodes) nine frontal matrices, F1 ... Fg, corresponding
to the nine pivots, pi ... pg, are used in the LU factorization of A. The row/column
index pattern of frontal matrix Fi is defined by the row/column index pattern of pivot
row/column i of the filled matrix. Let Ui be the column pattern of Fi. Then,
Ui = {j uij O,j > i}.
We find it convenient to partition the column index set Ui into a pivotal subset U ,
and nonpivotal subset Ui where
(3.3) = {i}
Ui = {jl ij f 0,j > i.
The row pattern Ci is defined and partitioned similarly. For symmetric matrices
Ci = Ui. For our example matrix U' = {2, 8, 9} and U/ {7, 8, 9}.
The partitioning of the index set U induces a natural partitioning of the frontal
matrix F into four parts: the pivot element, the offdiagonal pivot row, the offdiagonal
pivot column, and the contribution matrix
C {(F)ij I i,j Ei }.
Figure 3.1 illustrates this general partitioning scheme.
The assembly tree can be constructed using the parentchild relationship given
by [16]
parent(i) = min{ j I ,ji f 0 j > i} = min{ j Ij ,}.
Figure 3.2 shows the assembly tree for the filled matrix in Equation (3.2). The
node numbers correspond to the labels of frontal matrices and the arrows specify
the dependency relationships among frontal matrices. For example, frontal matrix
Fs must be factored after F2 and F7 because F2 and F7 contain update terms to the
pivot row and pivot column of Fs.
FIG. 3.2. Assembly tree for the matrix in Equation (3.2)
4. The symmetric inverse multifrontal method. We introduce the basic
concepts and constructs of the symmetric inverse multifrontal method by considering
the computation of the sparse inverse subset, Zsparse. We show that the computation
of Zsparse can be formulated in terms of constructs similar to those used in the
multifrontal method. These include the inverse frontal matrix and inverse assembly
tree.
4.1. The inverse frontal matrix. Let bT be the set of frontal matrices used
in the multifrontal LDU factorization of a (numerically) symmetric matrix A. (In
this case U = LT, but we continue to U instead of LT for notational convenience.)
For every frontal matrix Fi E 7, we define a corresponding inverse frontal matrix
Fi, with row and column patterns identical to that of Fi. That is, Fi has row index
pattern Ci and column index pattern Ui. We use the same notation, Ci and Ui, to
denote the index patterns of both Fi and Fi. Although Fi and Fi have the same
structure we define an element of Fi to be an element of the inverse matrix Z, i.e.,
(Fi)kj = zkj E Z. The set of elements defined by Fi is thus a subset of Z. Let T
represent the set of corresponding inverse frontal matrices. We partition an inverse
frontal matrix Fi into an inverse pivot row, an inverse pivot column, and an inverse
contribution matrix in much the same way that its corresponding frontal matrix Fi
is partitioned. The general partitionedstructure of the frontal and inverse frontal
matrices is illustrated in Figure 4.1.
Let Zi be the set of elements in the inverse pivot row of (Fi). Then,
Zi = {zij j Ui, j > i}.
It is easy to show by the equivalence in structure of Fi and Fi that the sparse inverse
subset (Zsparse) is given by
(4.1) Zsparse = Zi = {ziuij \u, 0,j > i}.
ier
(Note that since Zsparse is symmetric we only consider entries in its upper triangular
part.)
Equation (4.1) states that the set of entries in Zsparse is the union of the set of
entries in the inverse pivot rows of all inverse frontal matrices in fT. Consequently, to
Contribuli on
Nlatrin\
Pivot
Column
SPivot
Row
CO(ili rllli on
InversePivot
Column
FIG. 4.1. Equivalence of structures F and F
FIG. 4.2. Corresponding inverse assembly tree
compute Zsparse we in effect need to compute the entries in the inverse pivot rows
of every inverse frontal matrix. We refer to the process of computing the entries in
the inverse pivot row of an inverse frontal matrix as .ii I ," or "inversion" of the
inverse frontal matrix.
4.2. The inverse assembly tree. As mentioned in our review of the symmetric
multifrontal method (Section 3), an assembly tree is used to guide the factorization
of process. For any given matrix A, we define the corresponding inverse assembly
tree to have identical structure as the assembly tree used in the factorization of A,
except that the direction of the parentchild dependency arrows are reversed and node
labels refer to inverse frontal matrices. As we shall show, the inverse assembly tree
is used to guide the inversion process in a way analogous to how the assembly tree is
used to guide the factorization process. Figures 3.2 and 4.2 show the assembly and
inverse assembly trees, respectively, for the filled matrix in Equation (3.2). Note that
the flow of data, represented by the dependency arrows, are from parent to children
in the inverse assembly tree (Figure 4.2) as opposed to the usual children to parent
flow in the assembly tree (Figure 3.2). As discussed later, this "dependency reversal"
captures the topdown traversal of the inverse assembly tree used in the computation
of Zsparse compared to the bottomup assembly tree traversal used in the factorization
algorithm. There is also an inverse assembly process from parent to child, instead of
the child to parent assembly process in the multifrontal factorization method. The
Inverse
Pivot
Row
F
Assembly tree
Inverse assembly tree
1 2 8 9
1289
1
2 =
8 = =
9 .
2 8 9
2
8 nn F
n2
#0
F
F 1
9
1 2 8 9
1289
2 + + +
8 + + + F1
9 + + +
2 8 9
+ + +
+ + + F
+ + + 2
0
6
Multifrontal Constructs
Inverse Multifrontal Constructs
FIG. 4.3. The frontal and inverse frontal constructs
inverse assembly tree can also be used in a parallel algorithm to compute Zsparse.
We report results on such an algorithm in [5].
In Figure 4.3 we give a sidebyside illustration of the multifrontal and inverse
multifrontal constructs for the matrix in Equation (3.2). Note that, whereas, assembly
in the multifrontal method takes place from the contribution matrix of a child node
(the # positions in Fi) to the parent node (the # positions in F2), inverse assembly
is from the parent node (the + positions in F2) to the inverse contribution matrix of
the child node (the + positions in Fi).
4.3. The theoretical development. Consider using Equation (2.3) to
compute entries of Zsparse. Let
(4.2)
Tij = {Uik lik # 0, k > i}
Since Z is usually full, even though A may be sparse, the set Tij specifies the set of
nonzero UikZkj terms in the summation. These are the values of the U factor that
actually "contribute" to the value of zij. The following is the key result that relates
Tij to the index pattern of frontal matrix Fi. Claim:
(4.3) Tij = {Uik  k U
Proof: By definition (see Equation (3.3)) {k uik f 0, k > i} = Ui Substituting
this result into Equation (4.2) gives the result. o In effect, Equation (4.3) implies that
in computing an inverse element, the k indices in the summation in Equation (2.3)
are constrained to the nonpivotal index set of frontal matrix Fi. We refer to Tj as
the direct udependency set of zij. The direct zdependency set of zij, Zij, is defined
similarly: Zij = {Zkj [uik f 0, k > i}. As for the direct udependency set this can be
rewritten in terms of the index set of the inverse frontal matrix Fi,
(4.4) Zij = {zkj I k E Ui .
Equation (4.4) expresses the set of inverse values directly used to compute zij in terms
of the nonpivotal index set of inverse frontal matrix matrix Fi. (Here we made use
of the equivalence in index patterns of Fi and Fi.) Equations (4.3) and (4.4) state
two of the central results of the inverse multifrontal method.
Using the Tij and Zij sets as row and column vectors, respectively, we can rewrite
Equation (2.3) in the innerproduct form
(4.5) zij = d1 + Tij Zij i < j.
We now discuss five important results that follow immediately from the
application of Equations (4.34.5) and the definition of the inverse frontal matrix.
First, note that the direct udependency set Tij is independent of index j since
the set defined by Equation (4.3) is independent of j. (This is true even for zij
Zsparse [4].)
Second, the direct zdependency set of an inverse entry zij E Zsparse, i f j, is
identical to column j of the inverse contribution block of Fi. This can be easily
proved. Proof: Recall that by definition, the set of entries in column j of the inverse
contribution block of Fi is given by
(Ci), = {zkj k Ui },
where j E Ui The fact that (Ci),j = Zij follows trivially from the definition of Zij
given by Equation (4.4). o (Note that for i =j, Zij is not in the inverse contribution
block.) Figure 4.4 illustrates the relationship among the frontal, inverse frontal and
direct dependency vectors.
The third result concerns the relationship between the direct zdependency set
of the diagonal entry and the other entries in the inverse pivot row. We claim that
the direct zdependency set of zii is equal to the set of offdiagonal inverse pivot row
entries of Fi. Proof: Recall that the set of offdiagonal inverse pivot row entries of Fi
is given by Zi = {zikIk E U }. The direct zdependency set of zi is obtained from
Equation (4.4) and is given by Zii = {zki Ik E Ui }. Substituting zki = zik (by
symmetry of Z) we get the result Zii = {zik E Ui } = Zi o This implies that
zii depends directly on all the other entries in the inverse pivot row of Fi. From an
..... 71Z11 j
,.'" I,
/dot .. T
Sprod./ ) ..
Z. Tij
1J 1,
Z.
Fi Fi
FIG. 4.4. Relating F,, F,, T,3, and Z,3
implementation perspective this of course means that zi should be the last element
computed when Fi is inverted.
The fourth result makes the connection between the direct zdependency set of
the entire offdiagonal inverse pivot row and the inverse contribution matrix. Let Zi
be the direct zdependency set of all entries in the offdiagonal pivot row of Fi, i.e.
Zi Then the following is true: Zi = Ci. That is, the direct zdependency set of the
set of offdiagonal entries of an inverse pivot row is equal to the set of entries in the
inverse contribution block. Proof: From Equation (4.4),
Z,= U Z UJ {Zj k e Ui} {= {zk k,j e Ui}
jeu7 jeu7
The last set expression is by definition equal to the Ci. o From here on we use Zi for
the inverse contribution matrix of Fi instead of the Ci notation used earlier.
The fifth and final result concerns the direct zdependency set of all entries in
Zsparse. Erisman and Tinney in [13] showed that the only inverse entries that an
entry in Zsparse depended on also belong to Zsparse. In our terminology, this is
equivalent to saying that the direct zdependency set of Zsparse is equal to Zsparse.
We prove this result next. Let direct(Zsparse) be the direct zdependency set of
Zsparse; we claim that direct(Zsparse) = Zsparse. Proof: By Equation (4.1),
direct(Zsparse) = direct( Zi) = U direct(Zi)
iEF iET
where Zi is the elements in the inverse pivot row of Fi and direct(Zi) is the direct z
dependency set for the entries in Zi. Using the third and fourth results just discussed,
(4.6) direct(Zi) = {zjk IE u k E Ui}
{Zji I jE } U {zjk ljE jE
Note that the set {zjflj E UN/ } = zijj I E U } = Zi ; clearly, these entries all belong
to Zsparse. The set {zjk j E Ui, j E Ui } = Zi, the inverse contribution block of Fi.
Every element in the inverse contribution block of Fi is also an element of Zsparse for
the following simple reason. Since the structure of Fi is equivalent to the structure
of its corresponding frontal matrix Fi, for every zij in the inverse contribution block
of Fi there is a nonzero update to the entry, uij, in the factor matrix U (or 1ij in
the factor matrix L if i > j): this obviously means that uij (or lij) is nonzero. By
definition zij E Zsparse if uij is nonzero; therefore every entry of Zi must also be an
entry of Zsparse. This is true for every inverse frontal matrix, therefore:
direct(Zsparse) = direct( Z) = U direct(Zi) = Zsparse.D
ieF iTF
It is also easy to show that the direct zdependency set of sparse subset consisting
of entries from the diagonal of Z is equal to Zsparse. Proof: Recall that
direct(zii) = zii U Zi = Zi. (We have included zii in its own zdependency set to
make the proof that follows simpler.) Therefore,
U direct(zii) U Zi = Zsparse.0
iEr iEF
4.4. The matrixvector form of the Takahashi equation. The fourth
result discussed in the previous subsection shows that the computation of the off
diagonal entries of inverse pivot row i directly involves only the entries in the inverse
contribution matrix of Fi. We also know by the second result that these dependencies
are i ii. I li'ed," in the sense that zij depends only on column j of the inverse
contribution matrix of Fi. This allows us to restate Equation (4.5) in matrixvector
form for the special case where we are only concerned with computing the entries in
the inverse pivot row of Fi E f:
(4.7) Zi = Z
Zi d 7i 1 + Ti Zii = di + Ti Z
where we have dropped the unnecessary j subscript on the direct tdependency vector.
4.5. Computational cost. Observe that ITil = Zyi = jUi , where xl is the
number of elements in the set/vector/matrix x. By Equation (4.5) it is easy to see that
the direct cost in terms of multiplyadd pairs of floating point operations to compute
zij is Til  Zij I= Ui I. This direct cost only represent the cost to compute zij once
all the inverse entries on which it depends have previously been computed. Clearly,
the overall floating point operation cost in computing zij must also include the floating
point work done in computing the entries on which zij depends, the indirect cost.
Since inverse pivot row i has Ul I + 1 entries, the direct cost to compute it is
(Ui" )(jUi  + 1). To compute Zsparse we need to compute the entries of all inverse
pivot rows (see Equation (4.1)). Using the fact that Zsparse can be computed without
computing any element from outside this set, we get the following result for the total
number of multiplyadd pairs of operations required to compute Zsparse,
(A" i )(Ai" + 1),
i=1
which is the same as the number of operations required for the factorization (recall that
U = LT). If for example, LDU has a tridiagonal pattern IU, = 1 (i =1, ..., n 1),
U, I = 0, and the cost of computing Zsparse for this system is 2(n 1) multiplyadd
pairs. In general, if LDU is a band matrix of bandwidth 2m + 1, then
U = m, 1 < i < n m,
and
Ui = n i, n m+1 < i < n.
The number of multiplyadd pair operations to compute Zsparse is then,
((m)(m+ 1) + (n i)(n i + 1) = m(m + 1)(3n 2m 1)/3
i=1 i=nm+l
When m is much less than n (the typical case) this computation is of order e(m2n).
4.6. The inverse multifrontal Zsparse algorithm. There are still two
important issues to be addressed:
(a) how to efficiently locate (assemble) the elements of the direct zdependency
sets (except, of course, Zi = Zi ), and
(b) the order in which the inverse elements should be computed, taking into
account the zdependencies inherent in the Takahashi equation.
Note that the locations of the direct udependency sets are completely specified by
Equation (4.3).
4.6.1. Assembling the direct zdependency set. Let node p be the parent
of node i in the inverse assembly tree. The following assertions are true:
(a) j U/i implies j Up,
(b) for every zij in the offdiagonal pivot row of Fi, Zij is a subset of column j
of Fp, and
(c) the inverse contribution block of Fi is a subset of the entries in the parent
node Fp.
Assertion (a) follows immediately from the general properties of the index sets,
one of which is that Ui C Up ([16] and [12]). o
The proof of assertion (b) is as follows. By Equation (4.4) the direct zdependency
of zij is
zij = {zkjk ui }.
Using the fact that Ui C Up we can write Up = U U Q (for some, possibly empty,
index set Q). Then,
{zkj k E Up} = {zkj k E U} U {zkj k E Q}
From which we get, Zy = {zkj I k EUi } C {zkj I E Up}. Since from (a) we have
j E Up we get the result that Zij is a subset of column j of (Fp). O
Assertion (c) is an immediate consequence of assertions (a) and (b):
Zi = UjEu/ Zij is a subset of the entries in Fp since all i,j C Up by assertion (a)
and in these cases Zij is a subset of column j of Fp by (b). o
Assertion (c) in essence assures us that the inverse contribution block of Fi
can always be assembled from the parent node of i in the inverse assembly tree.
while (computing Zsparse) do
inverse ordering:
select the next inverse frontal matrix, Fi, based on
topdown traversal of the inverse assembly tree
inverse assembly:
assemble inverse contribution matrix Zi, from its parent node
inversion:
Compute inverse pivot row i using Equation (4.7)
end while
FIG. 4.5. Algorithm to compute Zsparse
This is analogous to the situation in the symmetric multifrontal method where the
contribution matrix of a child node can always be assembled into its parent node.
Being able to assemble the inverse contribution matrix from the parent node allows
for an implementation that can take advantage of dense kernels. With the appropriate
data structure, the inverse contribution block can be first assembled from its parent
and the matrixvector product found in Equation (4.7) done using level 2 BLAS.
Furthermore, applying the inverse assembly process recursively we get the important
result that the inverse assembly tree (with dependency arrows from parent to children)
captures the dependency relationships among the inverse frontal matrices. The
computation of Zsparse then becomes a simple matter of following the dependency
arrows of the inverse assembly tree, assembling the inverse contribution matrix and
doing the actual inversion using Equation (4.7) at every node. Note that in contrast
to the multifrontal factorization algorithm, the assembly in the inverse multifrontal
algorithm is a copy operation requiring no floatingpoint operations. This algorithm
is given in Figure 4.5.
5. The supernodal inverse multifrontal method. The discussion of the
inverse multifrontal method in the previous section focused on simple inverse frontal
matrices containing one inverse pivot row and column. However, multifrontal codes
take advantage of related structures in the index pattern to form supernodes consisting
of several pivot rows and columns per frontal matrix. The use of these supernodes
permits the more efficient use of the memory hierarchy (cache and vector registers, for
example) found in high performance computer architectures. We show in this section
how to extend the simple algorithm for computing Zsparse to take advantage of the
supernodes formed during the LDU factorization.
5.1. The supernodal inverse constructs. Our definition of a supernode
comes from a theorem given by Liu, Ng, and Peyton in [18]:
THEOREM 5.1 ([18]). The column set S = {i,i+1, ...,i+m} is a supernode of
the matrix L if and only if S is a maximal set of contiguous columns such that s+j1
is a child of s+ j in the elimination (assembly tree in our case) for j=1,2, ..., m, and
11(L,ji) = j(L.,i+m) + m
where, qr(v) gives the number of nonzeros in the vector v.
We label a supernode with the minimum column index in S. In the symmetric
multifrontal method the column set S, defined by Theorem (5.1), specifies the
pivot columns, and an equivalent set specifies the pivot rows. The simple inverse
multifrontal approach is extended in a natural way to include supernodes. The
r^ 3
I C
SP^
g^
a
Ic liii it'' '[jun
I L~iii~
F Fi
FIG. 5.1. Block structure of supernodal F and F
supernodes and supernodal assembly tree of the LU factorization have their
corresponding inverse supernodes and inverse supernodal assembly tree defined
analogously as for the simple nodes discussed in Section 4. We continue to use
the same notation for the supernodal frontal and inverse frontal matrices as was
introduced in Section 4, with the understanding that we mean the supernodal
structures.
The supernodes are partitioned into the same four parts as the simple nodes,
except that the single pivot is now a block matrix, and both the offdiagonal pivot row
and pivot column are now block matrices. Figure 5.1 illustrates the general structures
and partitioning of the supernodal frontal/inverse frontal matrices. It should be noted
that now
Ui = {i,i+ ,...,i+m} = S.
5.2. The supernodal algorithm for Zsparse. The matrixvector form of the
Takahashi equation, Equation (4.7), can be easily extended to take advantage of
the supernodal structures. The following Equations (5.1) give the extended form to
compute entries in the inverse row of Fi.
i = Ti Zi
(Z )r, = (Zi )r* + (Ti)[r,r+1:m] (Zi )[r+1:i+mr,j:jc] i < r < i +
Z = D + T (Z")T
(Zi)[r,r+l:i+m] = (Zi)[r,r+l:i+m] + (Ti)[r,r+l:i+m](Zi )[r+:i+m,r+1:i+m] i r < i+ m
(Zi)r,r = (Zi)r, + ((Ti)[r,r+l:i+m])(Zi)[r,r+l:i+r] i < r < i + m
(5.1)
We used the colon notation found in [15] where X[p:q r.,] represents the matrix with
row indices ranging from p to q and column indices from r to s. If p = q then X[p ,r.s]
is a row vector with column indices r through s. Similarly Xp: ,,] is a column vector
with row indices p through q. For = {i, i1, . ., i+ m} and U = ji, j2,... jc
the matrices Zi, Zi Zi, D 1, and Ti are defined as follows (see Figure 5.1 also):
Zi = (Fi)[i:i+m,i:i+,m],
Zi ((Fi)[i:i+m,i:o],
Inverse
Inverse OffDiagonal
Pivot Block Pivot Row Block
Z. Z.
1 1
g l 'n e l c
i 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
FIG. 5.2. The blockpartitioned F
Z = (F)u[i:jji:j],
D = (Di )[i:i+m,i:i+m],
and
Ti = (Fi)[i:i+mr,i:ji] = T + ii:j'c]
respectively.
There are three potential problems with using Equation (5.1) in an
implementation. First, a modest amount of unnecessary floating point operations
may be done, due to the third subequation in (5.1). Here, partial computation of
entries in the lower triangular part of Z is done. The idea in using this subequation
is to be able to take advantage of the level 3 BLAS operation. We can rewrite the
subequation in terms of the less efficient level 2 BLAS operations. We need to balance
the desire for high megaflop performance (using the level 3 BLAS) versus the increase
in CPU time as the inverse pivot blocks get bigger and more unnecessary operations
are done.
The second potential problem is that for the larger sized problems where the
supernodes can become relatively large, the block sizes can become larger than the
"optimum" level 3 BLAS operating size. This optimum block size will, of course,
depend on hardware characteristics such as the cache size and vector register size.
Clearly, it would be advantageous to restrict the block sizes so that the computation
fits in cache and/or makes effective use of vector registers.
The third concern has to do with memory usage. In order to take advantage of
the level 3 BLAS we need to use more memory than is really necessary. This can be
serious for the larger sized problems if the block size is set to large.
while (computing Zsparse) do
1: inverse ordering:
get next supernode, i, from a topdown traversal
of the inverse supernodal assembly tree
2: inverse assembly:
assemble the 1 i,. i.." upper triangular part of Zi,
the sparse inverse contribution matrix of Fi, from its parent node
3: inversion:
compute the inverse pivot row using Equation (5.2)
end while
FIG. 5.3. The blockpartitioned algorithm to compute Zsparse
To control the negative impact of these three problems we use the twodimensional
partitioning scheme shown in Figure 5.2 for each of the inverse frontal matrices.
The upper 1 I... i." diagonal form of the matrix still involves some unnecessary
computation and unnecessary storage, but these can both be controlled by regulating
the size of the blocking parameter. Assuming that the blocking size, b, is the same in
both dimensions, we get the blockpartitioned form of Equations (5.1):
(5.2)
Zij = Dij'+ =i+ Tikx Zkj+ 1 T+Tik (Zjk)T i
(Zij),. = (Zij)r, + (Tii)[r,r+1:b] (Zij)[r+1:b,t1b] 1 < r < b
(Zii)rr= (Zii)rr + (Tii)[r,r+l:b] X (Zii) r,r+l:b] 1 < r
where, the boldtype indices are local block indices, m = mi + m2 with m, and m2
defined by Equation (5.3), 1 < i < mi, 1
(5.3) m = [Ull /b]
M2, = [lU"/b]
In the example shown in Figure 5.2, mi = 5, and m2 = 3. The blockstructured form
of the Zsparse algorithm based on Equation (5.2) is shown in Figure 5.3.
6. Numerical experiments. We implemented the blockpartitioned Zsparse
algorithm discussed in the last section and a . I ," Zsparse algorithm (discussed
below), on the CRAY Research, Inc. CRAYC'I" platform. The CRAYC"'I is an 8
processor, 512Mw sharedmemory machine. Our runs were restricted to a single
processor. Each processor has vector capability and peak theoretical performance
of one billion floating point operations per second (1 Gflop). Version cf77 6.0.4.1 of
the CRAY fortran compiler with automatic vectorization was used. All arithmetic
operations were done in single precision (64 bit). The CRAY Research Inc. optimized
BLAS were used to do the matrixvector and matrixmatrix computations.
6.1. Data structures. The available memory (not including the memory to
hold the input factors, tree information and integer work arrays) is partitioned as
shown in Figure 6.1. This is essentially a twostack layout. The zvalue memory area
is used as the permanent storage area for the computed inverse entries. This forms
the leftstack and grows towards the right as indicated by the rightarrow. The right
stack, labeled the parent area in Figure 6.1, is used to store the upper triangular parts
of parent nodes with children yet to be inverted. Data are moved from this area to the
inverse contribution block in the active area during inverse assembly. The rightstack
FIG. 6.1. Partitioning of memory
grows and shrinks as inverted parent nodes are allocated or deallocated, respectively.
A point to note here is that a stack structure is the "natural" data structure to use for
the parent memory area when the inverse assembly tree is traversed in a depthfirst
manner. Natural here meaning that when Fi is to be inverted its parent lies at the top
of the stack and, therefore, no overhead is incurred in locating the parent node. If a
breadthfirst traversal is used however, then a queue would be a better data structure.
We used a depthfirst traversal scheme.
The memory area in Figure 6.1 labeled the active area is used as the actual center
of computation. When Fi is to be inverted, memory of size
b2[m (m + 1)/2 + m2(m2 + 1)/2 + m1m2]
(see Section 5 for definitions of mi and m2) is allocated to accommodate its staircase
upper diagonal form. The location of the active area is fixed with respect to the top of
the leftstack. A preprocessing phase is used to determine the maximum size to which
the stacks and active area can grow. After an inverse frontal matrix, such as Fi, is
inverted, the input factors corresponding to the pivot row of Fi can be overwritten
with the inverse entries in the inverse row of Fi. Significant savings in the amount of
stack memory used can result if this overwriting is allowed. We therefore provide the
user with a parameter to control this feature.
6.2. The scalar Zsparse algorithm. To serve as a basis for comparison we
implemented a . ,I ,," version of the Zsparse algorithm, shown in Figure 6.2. The
implementation of this algorithm allowed the use of level 1 BLAS operations (saxpy
and dot product, with vectorized gather/scatter operations), but no level 2 or 3 BLAS
operations.
The algorithm proceeds by computing elements in rows n through 1, step 1.
Within each row, say i, the only offdiagonal elements in Zsparse are those with
indices from the index pattern of row i of the factor matrix U, i.e. Ui. This is done
in step 2 for the offdiagonal entries and in step 6a for the diagonal entry (zii). Note
that zi is evaluated after all the offdiagonal entries in row i have been evaluated.
Step 3a involves a saxpy operation, while lines 3b and 6a involve dot products.
6.3. Numerical results. Statistics for the test matrices used in our numerical
experiments are shown in Table 6.1. Most of these matrices are part of the Harwell
Boeing sparse matrix collection [10]. The sizes of the matrices range from small (the
100by100 Nos4 matrix) to fairly large (7 II..'by74752 for the FINAN512 matrix).
The matrices were first factored using a early modified form, with strict diagonal
pivoting, of UMFPACK Version 2.0 [7, 6]. The supernodal tree information was
also generated by UMFPACK. In general, any supernodal or multifrontal Cholesky
factorization algorithm can be used.
Table 6.2 gives the execution times for the scalar and blockpartitioned Zsparse
implementations. For the blockpartitioned algorithm we give CPU times for block
sizes 8, 32, 64, 128, and '_"I. In the column labeled 1* I1: mflops of block *" the
18
1: for i = ntol
2: for k E Ui
3: for j E (U and j > k)
3a: zij = zij Uik Zkj
3b: ik = Zik Uij Zkj
end for
4: ik Zik Uik Zkk
end for
5: zi = d1
6: for k EU
6a: ii = Zii Uik Zik
end for
end for
FIG. 6.2. The scalar Zsparse algorithm
TABLE 6.1
Information on test matrices
best megaflop performance is given (the block size for which this occurs is shown in
parentheses). The worst CPU performance for all the matrices occurs for a block size
of 8; for block sizes smaller than 8 we got even larger CPU times for all the matrices.
This no doubt is due to the poor performance of the BLAS on the smallersized blocks
and the increasing overhead involved in creating the larger number of blocks. But
even for a block size of 8 the inverse multifrontal blockpartitioned algorithm has a
clear performance advantage over the scalar algorithm for all but the smallest sized
matrix (Nos4) and the very sparse BCSPWR10 matrix. As the block size increases the
CPU times for the block algorithm typically decreases until some optimum block size
is reached, after which the CPU times begin to rise. One may have expected that a
block size of length equal to the length of the vector registers (128) would give the best
CPU performance. However we see from Table 6.2 that a block size of 64 gives the
best CPU performance (times in bold). We attribute this to the increasing number of
unnecessary operations that are performed on the diagonal blocks. As the block size
increases the ratio of unnecessary to necessary operations also increases, resulting in a
relative increase in the CPU times. This explanation is further supported by the fact
that the highest megaflop performance occurs for a block size of 128 (for the larger
matrices at least) implying that the BLAS are more efficient but more operations are
being done.
IL
matrix n discipline (x103
Nos4 100 structural eng. 0.8
PLAT1919 1919 oceanography 83.4
BCSPWR10 5300 electric power 40.0
BCSSTK28 4410 structural eng. 468.8
BCSSTK25 15439 structural eng. 2425.6
FINAN512 74752 economics 9564.8
TABLE 6.2
Results for the scalar and multifrontal Zsparse runs
matrix CPU time (sec) peak
matrix
mflops of
scalar block block *
block *
8 32 64 128 256
Nos4 0.003 0.007 0.007 0.007 0.007 0.007 1 (8)
PLAT1919 1.022 0.268 0.161 0.155 0.155 0.155 33 (64)
BCSPWR10 0.203 0.405 0.387 0.387 0.387 0.387 2 (32)
BCSSTK28 13.01 1.70 0.64 0.59 0.62 0.66 127 (128)
BCSSTK25 170.9 19.52 5.64 5.01 5.14 5.29 187 (128)
FINAN512 2469.1 325.5 52.5 42.25 44.05 47.29 320 (128)
7. Conclusions. We introduced the symmetric inverse multifrontal approach to
computing the sparse inverse subset. This method uses the concept of inverting an
inverse frontal matrix using dense matrix kernels. Similarities between the inversion of
an inverse frontal matrix and the familiar concept of factorizing a frontal matrix used
in a multifrontal formulation of Gaussian elimination were highlighted. It should
be pointed out that it is not essential that the LDU factorization be done by the
multifrontal method for the inverse multifrontal method to be used. All we really need
is the pattern of nonzeros in the factorized matrix; even the supernodal structures
can be reconstructed from the filledmatrix pattern information [18]. The essential
theoretical results, stated in Equations (4.3) and (4.4), and the ensuing observations,
give a precise description of the locations of the direct dependency sets when inverting
an inverse frontal matrix. This allows the use of level 2 and level 3 BLAS in the
implementation.
The high performance that we achieve in computing Zsparse based on the block
partitioned inverse multifrontal algorithm is evident from the CPU times and mflops
rates found in Table 6.2. The use of the higher level BLAS primitives is certainly the
dominant contributing factor in the speedup over the scalar algorithm (which only
takes advantage of gather/scatter vector operations).
The presentation in this paper focused on the special case of computing Zsparse
where the matrix A and, therefore, its inverse Z, are numerically symmetric. The
theoretical development and implementation are very similar for computing Zsparse
when A and Z are unsymmetric in value but have symmetric nonzero patterns. We
discuss a parallel implementation of our algorithm in [5], and an extension to arbitrary
subsets of Z in [4].
An open problem remains in adapting the inverse multifrontal method to the
case where the coefficient matrix has been reduced to blocktriangular form before
the factorization is done.
8. Acknowledgements. Support for this project was provided by the National
Science Foundation (DMS9223088 and DMS' '.1 1''74), and by CRAY Research, Inc.,
through the allocation of supercomputing resources.
REFERENCES
[1] P. R. Amestoy, T. A. Davis, and I. S. Duff. An approximateminimum degree ordering algorithm.
SIAM J. Matrix Analysis and Application, (to appear). Also CISE Technical Report TR
94039.
[2] R. Betancourt and F. L. Alvarado. Parallel inversion of sparse matrices. IEEE Transactions
on Power Systems, PWRS1(1):7481, 1986.
[3] Y. E. Campbell. Multifrontal algorithms for sparse inverse subsets and incomplete LU
factorization. PhD thesis, Computer and Information Science and Engineering Department,
Univ. of Florida, Gainesville, FL, November 1995. Also CISE Technical Report TR95025.
[4] Y. E. Campbell and T. A. Davis. On computing an arbitrary subset of entries of the inverse of a
matrix. Technical Report TR95022, Computer and Information Science and Engineering
Department, Univ. of Florida, 1995.
[5] Y. E. Campbell and T. A. Davis. A parallel implementation of the blockpartioned inverse
multifrontal Zsparse algorithm. Technical Report TR95023, Computer and Information
Science and Engineering Department, Univ. of Florida, 1995.
[6] T. A. Davis and I. S. Duff. A combined unifrontal/multifrontal method for unsymmetric sparse
matrices. Technical Report TR95020, Computer and Information Science and Engineering
Department, Univ. of Florida, 1995.
[7] T. A. Davis and I. S. Duff. An unsymmetricpattern multifrontal method for sparse LU
factorization. SIAM J. Matrix Analysis and Application, (to appear). Also CISE Technical
Report TR94038.
[8] J. Dongarra, J. Du Croz, S. Hammarling, and I. Duff. A set of Level 3 Basic Linear Algebra
Subprograms. ACM Transactions on Mathematical Software, 16(1):117, 1990.
[9] I. S. Duff, A. M. Erisman, and J. K. Reid. Direct Methods for Sparse Matrices. Oxford Science
Publications, New York, NY, 1991.
[10] I. S. Duff, R. G. Grimes, and J. G. Lewis. Users' guide for the HarwellBoeing sparse matrix
collection (release 1). Technical Report RAL92086, Rutherford Appleton Laboratory,
Didcot, Oxon, England, Dec. 1992.
[11] I. S. Duff and J. K. Reid. The multifrontal solution of unsymmetric sets of linear equations.
SIAM J. of Sci. and Statis. Computing, 5(3):633641, 1984.
[12] I.S. Duff and J.K. Reid. The multifrontal solution of indefinite sparse symmetric linear
equations. ACM Trans. Math. Software, 9:302325,1983.
[13] A. M. Erisman and W. F.Tinney. On computing certain elements of the inverse of a sparse
matrix. Communications of the ACM, 18:177179, March 1975.
[14] A. George and J. W. H. Liu. Computer Solution of Large Sparse PositiveDefinite Systems.
PrenticeHall, Englewood Cliffs, NJ, 1981.
[15] G. H. Golub and C. F. van Loan. Matrix Computations. The Johns Hopkins University Press,
Baltimore, MD and London, UK, second edition, 1990.
[16] J. W. H. Liu. The role of elimination trees in sparse factorization. Siam J. Matrix Appl.,
11(1):134172, 1990.
[17] J. W. H. Liu. The multifrontal method for sparse matrix solution: Theory and practice. Siam
Review, 34(1):82109, March 1992.
[18] J. W. H. Liu, E. G. Ng, and B. W. Peyton. On finding supernodes for sparse matrix
computations. Siam J. Matrix Anal. Appl., 14(1):242252, January 1993.
[19] K. Takahashi, J. Fagan, and M. Chin. Formation of a sparse bus impedance matrix and its
application to short circuit study. 8th PICA Conference Proc., Minneapolis, Minn, pages
177179, June, 46 1973.
Note: all University of Florida technical reports in this list of references are
available in postscript form via anonymous ftp to ftp. cis .ufl. edu in the directory
cis/techreports, or via the World Wide Web at
http://www.cis.ufl.edu/~davis.
