Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: Computing the sparse inverse subset : an inverse multifrontal approach
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00095338/00001
 Material Information
Title: Computing the sparse inverse subset : an inverse multifrontal approach
Series Title: Department of Computer and Information Science and Engineering Technical Report ; 95-021
Physical Description: Book
Language: English
Creator: Campbell, Yogin E.
Davis, Timothy A.
Publisher: Department of Computer and Information Sciences, University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: October, 1995
Copyright Date: 1995
 Record Information
Bibliographic ID: UF00095338
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

1995183 ( PDF )


Full Text










COMPUTING THE SPARSE INVERSE SUBSET:
AN INVERSE MULTIFRONTAL APPROACH

YOGIN E. CAMPBELL* AND TIMOTHY A. DAVISt

Technical Report TI.-'' .-i-', Computer and Information Sciences Department,
University of Florida, Gainesville, FL, 32611 USA. October, 1995.

Key words. Sparse matrices, symmetric multifrontal method, supernode,
Takahashi equations, inverse multifrontal method, inverse frontal matrix, inverse
contribution matrix, inverse assembly tree, Zsparse.
AMS (MOS) subject classifications. 05C50, 65F50, '..I II.
Abbreviated title: Multifrontal Sparse Inverse Subset
Abstract.
We present the symmetric inverse multifrontal method for computing the sparse inverse subset
of symmetric matrices. The symmetric inverse multifrontal approach uses an equation presented by
Takahashi, Fagan, and Chin to compute the numerical values of the entries of the inverse, and an
inverted form of the symmetric multifrontal method of Duff and Reid to guide the computation. We
take advantage of related structures that allow the use of dense matrix kernels (levels 2 & 3 BLAS)
in the computation of this subset. We discuss the theoretical basis for this new algorithm and give
numerical results for a serial implementation and demonstrate its performance on a CRAY-C98.

1. Introduction. We address the problem of computing the sparse inverse
subset (Zsparse) of a symmetric matrix, A, using the LDU factorization of A, and an
equation relating the LDU factors to Z = A-1 presented by Takahashi, Fagan, and
Chin in [19]. The sparse inverse subset is defined as the set of entries in the upper
triangular part of Z in locations given by the nonzero entries in the factorized matrix:
that is, Zsparse = {zij (U)ij f 0} C Z.
The entries in Zsparse are useful in many practical applications such as
approximating the condition number of symmetric positive-definite matrices, and
estimating variances of the fitted parameters in the least-square data-fitting problem.
Prior results on computing Zsparse and other related inverse subsets, based on the
two equations of Takahashi, Fagan, and Chin, are found in the articles by Erisman
and Tinney [13], and Betancourt and Alvarado [2]. Erisman and Tinney in [13] proved
that both Zsparse and the subset of entries on the diagonal of Z can be evaluated
without computing any inverse entry from outside of Zsparse. In [2] Betancourt and
Alvarado give a parallel algorithm to compute Zsparse and the full inverse. None of
these two articles considered the use of dense matrix operations such as matrix-vector
or matrix-matrix multiplications in the computations. Our numerical results show
that there is a significant improvement in the performance of the Zsparse algorithm
when the level 2 (matrix-vector) and level 3 (matrix-matrix) BLAS [8] operations are
used in the implementation.
In this paper we develop a new method, the symmetric inverse multifronlal method
[3], to compute Zsparse. We use one of the equations from [19] (henceforth referred to
as the Takahashi equation(s)) to compute the numerical values of the inverse elements,
and an inverted form of the symmetric multifrontal method of Duff and Reid [12] to

email: yec@cis.ufl.edu.
t Computer and Information Science and Engineering Department, University of Florida,
Gainesville, Florida, USA. (904) 392-1481, email: davis@cis.ufl.edu. Technical reports and matrices
are available via the World Wide Web at http://www.cis.ufl.edu/-davis, or by anonymous ftp at
ftp.cis.ufl.edu:cis/tech-reports.












guide the computation and take advantage of related structures that allow the use of
dense matrix kernels in the innermost loops. We show that the results in [13] and [2]
can be easily derived using this formulation.
In the multifrontal method the frontal matrix and assembly tree are the two key
constructs. We introduce two similar constructs, the inverse frontal matrix and the
inverse assembly tree, and discuss their relationship to the frontal matrix and assembly
tree, respectively. We show how the computation of Zsparse can be formulated in
terms of inverting inverse frontal matrices using the inverse assembly tree to specify
the data dependencies.
An outline of the paper follows. In Section 2, we discuss relevant aspects of one of
the Takahashi equations and give a small example showing its use. We briefly review
the symmetric multifrontal factorization method in Section 3. For a more detailed
treatment of multifrontal factorization we refer the reader to [12, 17]. The inverse
multifrontal algorithm is developed in Sections 4 and 5. In Section 4, the fundamentals
of the inverse multifrontal approach are presented, culminating in an algorithm to
compute Zsparse based on using matrix-vector operations. This algorithm is extended
in Section 5 to include matrix-matrix operations based on inverting supernodes.
Performance results from an implementation of a block-structured form of the Zsparse
algorithm on the CRAY-C'", is discussed in Section 6. Conclusions and avenues for
future work are given in Section 7.

2. The Takahashi equation. Takahashi, Fagan, and Chin in [19] presented two
equations for computing the inverse of a general matrix A, using its LDU factorization.
The equations are

(2.1) Z = D-1L-1 + (I- U)Z

and

(2.2) Z = U-D-1 + Z(I- L),

where Z = A1, and A = LDU (L, U, and D are unit lower triangular, unit upper
triangular, and diagonal matrices, respectively). When A is (numerically) symmetric
U = LT and Z is symmetric. In this paper we consider symmetric matrices and,
therefore, only use Equation (2.1). (For notational convenience we continue to use
U instead of LT.) We refer to Equations (2.1) and (2.2) as the Takahashi equations.
The matrices involved in Equation (2.1) are illustrated in Fig. 2.1. Shaded areas in
this figure represent nonzero elements.
The following are some useful observations concerning the Takahashi equation
and the matrices involved in it. The product D-1L-1 is a lower triangular matrix
with (D-1L-1)ii = (D-1)ii. This is used to avoid computing L-1 when evaluating
elements on the diagonal and upper triangular part of Z. The matrix (I U) is
strictly upper triangular since U is unit upper triangular. Computationally the most
useful feature of Equation (2.1) is that we can use it to compute Z (more precisely the
upper triangular part of Z) without having to first find L-1. Using the two previous
observations, and restricting the inverse elements to the diagonal and upper triangular
part of Z, Equation (2.1) can be restated as follows:


(2.3) zij = di1 Uik Zkj for i < j













Z D1 L U Z




+ + K o





FIG. 2.1. Illustration of the Takahashi equation, Equation (2.1)


where the notation yij = (Y)ij is used. The elements of Z can be computed in reverse
Crout order [9]. That is, evaluate in order the elements in rows n, n 1, ... 1. (In
each row we only need to evaluate entries in the upper triangular part of Z.) When
computing the entries of a given row the order in which the diagonal entry is computed
(relative to the other entries in the row) is important because it depends on a subset
of the entries in the row. The other entries in the row are independent of each other
and can therefore be computed in any order. Erisman and Tinney show in [13] that
the sparse subset can be computed in terms of U, and other inverse elements from
within the sparse subset only. This important result allows one to safely ignore all
other inverse elements when computing the sparse subset.
2.1. An Example. We use a small example to illustrate the use of
Equation (2.3) in computing inverse elements. Consider the symmetric matrix A and
its corresponding filled matrix (L+U) shown in Equations (2.4) and (2.5), respectively.
The x's represent the original entries of A, while the 's are entries due to fill-in.


(2.4) A =
X X X
x x


X X

(2.5) L + U =
x x xO
x O x

Using Equation (2.3), and a reverse Crout computational order, The set of
Equations (2.6) gives the sequence used to compute the elements of the sparse subset.
7-1
244 = d44
Z34 = --34244
Z33 = d331- U34Z43
(2.6) z23 = -U23Z33
z22 = d2 U23Z32
214 = -Ul3Z34 U14Z44
213 = -U13Z33 U14Z43
Zll = dll U13Z31 U14Z41













(Note by symmetry zij = zji.) Observe that this is only a partial ordering. For
example, the entries z11, 213, and 214 can be computed in any order with respect to
z22 and z23. We can also use Equation (2.3) to evaluate elements outside the sparse
subset, for example: z24 = -U23Z34 U24Z44 and z12 = -U13z32 Ul,4Z42.

3. The symmetric multifrontal approach. The direct method of solving the
linear system of equations Ax = b (where A symmetric) using the LU or Cholesky
factorization of A, is usually a two step algorithm. In the first step, the coefficient
matrix A is factorized into LU (or LLT if Cholesky factorization is used), where L
and U are lower and upper triangular matrices, respectively. The second step involves
solving the two triangular systems Ly = b for y (forward substitution), and Ux = y
for x (backward substitution). The symmetric multifrontal method of Duff and Reid
[11, 12] (see also Liu [17]) is an efficient method to compute the LU factorization of
A, especially on machines with memory hierarchy. This factorization method is based
on the use of dense matrix kernels in the innermost loops. Dense submatrices called
frontal matrices are formed as the multifrontal algorithm progresses. One or more
steps of Gaussian elimination is done on each of these frontal matrices.
In general, the symmetric multifrontal algorithm consists of a symbolic analysis
phase and a numerical factorization phase. In the analysis phase a fill-reducing
pivot ordering algorithm (such as the approximate minimum degree algorithm [1] or
the minimum degree algorithm [14]) is used to establish the pivot order and data
structures. In addition, the precedence relationships among the frontal matrices
that are used in the numerical phase are established and given by the assembly or
elimination tree [9, 16]. In this phase only the pattern of A is used. The numerical
work to actually compute the LU factors is done in the numerical factorization phase.
The assembly tree is used to guide the computation in this phase.
We use the example matrix A shown in Equation (3.1) to highlight the basic
multifrontal constructs. The p's and x's represent original entries of A, while the O's
are entries due to fill-in.

P1 x x x
x P2 X X
p3 X
x p4 x x
(3.1) A= p5 x x
x P6 X X
x x P7 x
X x x x X ps
x x x x P9



P1 x x x
x P2 X X
p3 x x
X p4 X
(3.2) L+U= p5 x x
x P6 X X
x P6 x x
x xD x P7 (@ x
X X X X X g Ps













Pivot Off-diagonal pivot row Pivot Row




O
Contributiion

SMNlatri\
0
bo




Pivot Column

FIG. 3.1. General structure of a frontal matrix


We assume that the matrix has already been permuted so that the pivots lie on the
diagonal: pi ... pg. Its fill structure is also determined and shown in Equation (3.2).
In the simplest case (no supernodes) nine frontal matrices, F1 ... Fg, corresponding
to the nine pivots, pi ... pg, are used in the LU factorization of A. The row/column
index pattern of frontal matrix Fi is defined by the row/column index pattern of pivot
row/column i of the filled matrix. Let Ui be the column pattern of Fi. Then,

Ui = {j uij O,j > i}.

We find it convenient to partition the column index set Ui into a pivotal subset U ,
and non-pivotal subset Ui where

(3.3) = {i}
Ui = {jl ij f 0,j > i.
The row pattern Ci is defined and partitioned similarly. For symmetric matrices
Ci = Ui. For our example matrix U' = {2, 8, 9} and U/ {7, 8, 9}.
The partitioning of the index set U induces a natural partitioning of the frontal
matrix F into four parts: the pivot element, the off-diagonal pivot row, the off-diagonal
pivot column, and the contribution matrix

C {(F)ij I i,j Ei }.

Figure 3.1 illustrates this general partitioning scheme.
The assembly tree can be constructed using the parent-child relationship given
by [16]

parent(i) = min{ j I ,ji f 0 j > i} = min{ j Ij ,}.

Figure 3.2 shows the assembly tree for the filled matrix in Equation (3.2). The
node numbers correspond to the labels of frontal matrices and the arrows specify
the dependency relationships among frontal matrices. For example, frontal matrix
Fs must be factored after F2 and F7 because F2 and F7 contain update terms to the
pivot row and pivot column of Fs.



























FIG. 3.2. Assembly tree for the matrix in Equation (3.2)


4. The symmetric inverse multifrontal method. We introduce the basic
concepts and constructs of the symmetric inverse multifrontal method by considering
the computation of the sparse inverse subset, Zsparse. We show that the computation
of Zsparse can be formulated in terms of constructs similar to those used in the
multifrontal method. These include the inverse frontal matrix and inverse assembly
tree.

4.1. The inverse frontal matrix. Let bT be the set of frontal matrices used
in the multifrontal LDU factorization of a (numerically) symmetric matrix A. (In
this case U = LT, but we continue to U instead of LT for notational convenience.)
For every frontal matrix Fi E 7, we define a corresponding inverse frontal matrix
Fi, with row and column patterns identical to that of Fi. That is, Fi has row index
pattern Ci and column index pattern Ui. We use the same notation, Ci and Ui, to
denote the index patterns of both Fi and Fi. Although Fi and Fi have the same
structure we define an element of Fi to be an element of the inverse matrix Z, i.e.,
(Fi)kj = zkj E Z. The set of elements defined by Fi is thus a subset of Z. Let T
represent the set of corresponding inverse frontal matrices. We partition an inverse
frontal matrix Fi into an inverse pivot row, an inverse pivot column, and an inverse
contribution matrix in much the same way that its corresponding frontal matrix Fi
is partitioned. The general partitioned-structure of the frontal and inverse frontal
matrices is illustrated in Figure 4.1.
Let Zi be the set of elements in the inverse pivot row of (Fi). Then,

Zi = {zij j Ui, j > i}.

It is easy to show by the equivalence in structure of Fi and Fi that the sparse inverse
subset (Zsparse) is given by

(4.1) Zsparse = Zi = {ziuij \u, 0,j > i}.
ier

(Note that since Zsparse is symmetric we only consider entries in its upper triangular
part.)
Equation (4.1) states that the set of entries in Zsparse is the union of the set of
entries in the inverse pivot rows of all inverse frontal matrices in fT. Consequently, to
















Contribuli on
Nlatrin\




Pivot
Column


SPivot
Row


CO(ili rllli on




InversePivot
Column


FIG. 4.1. Equivalence of structures F and F


FIG. 4.2. Corresponding inverse assembly tree


compute Zsparse we in effect need to compute the entries in the inverse pivot rows
of every inverse frontal matrix. We refer to the process of computing the entries in
the inverse pivot row of an inverse frontal matrix as .ii- I ," or "inversion" of the
inverse frontal matrix.

4.2. The inverse assembly tree. As mentioned in our review of the symmetric
multifrontal method (Section 3), an assembly tree is used to guide the factorization
of process. For any given matrix A, we define the corresponding inverse assembly
tree to have identical structure as the assembly tree used in the factorization of A,
except that the direction of the parent-child dependency arrows are reversed and node
labels refer to inverse frontal matrices. As we shall show, the inverse assembly tree
is used to guide the inversion process in a way analogous to how the assembly tree is
used to guide the factorization process. Figures 3.2 and 4.2 show the assembly and
inverse assembly trees, respectively, for the filled matrix in Equation (3.2). Note that
the flow of data, represented by the dependency arrows, are from parent to children
in the inverse assembly tree (Figure 4.2) as opposed to the usual children to parent
flow in the assembly tree (Figure 3.2). As discussed later, this "dependency reversal"
captures the top-down traversal of the inverse assembly tree used in the computation
of Zsparse compared to the bottom-up assembly tree traversal used in the factorization
algorithm. There is also an inverse assembly process from parent to child, instead of
the child to parent assembly process in the multifrontal factorization method. The


Inverse
Pivot
Row







F












Assembly tree


Inverse assembly tree


1 2 8 9
1289
1
2 =
8 = =
9 .


2 8 9
2
8 nn F
n2

#0
F
F 1


9
1 2 8 9
1289

2 + + +
8 + + + F1
9 + + +


2 8 9
+ + +
+ + + F
+ + + 2

0


6


Multifrontal Constructs


Inverse Multifrontal Constructs


FIG. 4.3. The frontal and inverse frontal constructs


inverse assembly tree can also be used in a parallel algorithm to compute Zsparse.
We report results on such an algorithm in [5].
In Figure 4.3 we give a side-by-side illustration of the multifrontal and inverse
multifrontal constructs for the matrix in Equation (3.2). Note that, whereas, assembly
in the multifrontal method takes place from the contribution matrix of a child node
(the # positions in Fi) to the parent node (the # positions in F2), inverse assembly
is from the parent node (the + positions in F2) to the inverse contribution matrix of
the child node (the + positions in Fi).
4.3. The theoretical development. Consider using Equation (2.3) to
compute entries of Zsparse. Let


(4.2)


Tij = {-Uik lik # 0, k > i}


Since Z is usually full, even though A may be sparse, the set Tij specifies the set of
nonzero UikZkj terms in the summation. These are the values of the U factor that
actually "contribute" to the value of zij. The following is the key result that relates
Tij to the index pattern of frontal matrix Fi. Claim:














(4.3) Tij = {-Uik | k U

Proof: By definition (see Equation (3.3)) {k uik f 0, k > i} = Ui Substituting
this result into Equation (4.2) gives the result. o In effect, Equation (4.3) implies that
in computing an inverse element, the k indices in the summation in Equation (2.3)
are constrained to the non-pivotal index set of frontal matrix Fi. We refer to Tj as
the direct u-dependency set of zij. The direct z-dependency set of zij, Zij, is defined
similarly: Zij = {Zkj [uik f 0, k > i}. As for the direct u-dependency set this can be
rewritten in terms of the index set of the inverse frontal matrix Fi,

(4.4) Zij = {zkj I k E Ui .

Equation (4.4) expresses the set of inverse values directly used to compute zij in terms
of the non-pivotal index set of inverse frontal matrix matrix Fi. (Here we made use
of the equivalence in index patterns of Fi and Fi.) Equations (4.3) and (4.4) state
two of the central results of the inverse multifrontal method.
Using the Tij and Zij sets as row and column vectors, respectively, we can rewrite
Equation (2.3) in the inner-product form

(4.5) zij = d1 + Tij Zij i < j.

We now discuss five important results that follow immediately from the
application of Equations (4.3-4.5) and the definition of the inverse frontal matrix.
First, note that the direct u-dependency set Tij is independent of index j since
the set defined by Equation (4.3) is independent of j. (This is true even for zij
Zsparse [4].)
Second, the direct z-dependency set of an inverse entry zij E Zsparse, i f j, is
identical to column j of the inverse contribution block of Fi. This can be easily
proved. Proof: Recall that by definition, the set of entries in column j of the inverse
contribution block of Fi is given by

(Ci), = {zkj k Ui },

where j E Ui The fact that (Ci),j = Zij follows trivially from the definition of Zij
given by Equation (4.4). o (Note that for i =j, Zij is not in the inverse contribution
block.) Figure 4.4 illustrates the relationship among the frontal, inverse frontal and
direct dependency vectors.
The third result concerns the relationship between the direct z-dependency set
of the diagonal entry and the other entries in the inverse pivot row. We claim that
the direct z-dependency set of zii is equal to the set of off-diagonal inverse pivot row
entries of Fi. Proof: Recall that the set of off-diagonal inverse pivot row entries of Fi
is given by Zi = {zikIk E U }. The direct z-dependency set of zi is obtained from
Equation (4.4) and is given by Zii = {zki Ik E Ui }. Substituting zki = zik (by
symmetry of Z) we get the result Zii = {zik E Ui } = Zi o This implies that
zii depends directly on all the other entries in the inverse pivot row of Fi. From an












..... 71Z11 j
,.'"-- I,

/dot .. T
Sprod./ ) ..


Z. Tij



1J 1,
Z.-

Fi Fi


FIG. 4.4. Relating F,, F,, T,3, and Z,3


implementation perspective this of course means that zi should be the last element
computed when Fi is inverted.
The fourth result makes the connection between the direct z-dependency set of
the entire off-diagonal inverse pivot row and the inverse contribution matrix. Let Zi
be the direct z-dependency set of all entries in the off-diagonal pivot row of Fi, i.e.
Zi Then the following is true: Zi = Ci. That is, the direct z-dependency set of the
set of off-diagonal entries of an inverse pivot row is equal to the set of entries in the
inverse contribution block. Proof: From Equation (4.4),

Z,= U Z UJ {Zj k e Ui} {= {zk k,j e Ui}
jeu7 jeu7

The last set expression is by definition equal to the Ci. o From here on we use Zi for
the inverse contribution matrix of Fi instead of the Ci notation used earlier.
The fifth and final result concerns the direct z-dependency set of all entries in
Zsparse. Erisman and Tinney in [13] showed that the only inverse entries that an
entry in Zsparse depended on also belong to Zsparse. In our terminology, this is
equivalent to saying that the direct z-dependency set of Zsparse is equal to Zsparse.
We prove this result next. Let direct(Zsparse) be the direct z-dependency set of
Zsparse; we claim that direct(Zsparse) = Zsparse. Proof: By Equation (4.1),

direct(Zsparse) = direct( Zi) = U direct(Zi)
iEF iET
where Zi is the elements in the inverse pivot row of Fi and direct(Zi) is the direct z-
dependency set for the entries in Zi. Using the third and fourth results just discussed,

(4.6) direct(Zi) = {zjk IE u k E Ui}
{Zji I jE } U {zjk ljE jE

Note that the set {zjflj E UN/ } = zijj I E U } = Zi ; clearly, these entries all belong
to Zsparse. The set {zjk j E Ui, j E Ui } = Zi, the inverse contribution block of Fi.












Every element in the inverse contribution block of Fi is also an element of Zsparse for
the following simple reason. Since the structure of Fi is equivalent to the structure
of its corresponding frontal matrix Fi, for every zij in the inverse contribution block
of Fi there is a nonzero update to the entry, uij, in the factor matrix U (or 1ij in
the factor matrix L if i > j): this obviously means that uij (or lij) is nonzero. By
definition zij E Zsparse if uij is nonzero; therefore every entry of Zi must also be an
entry of Zsparse. This is true for every inverse frontal matrix, therefore:

direct(Zsparse) = direct( Z) = U direct(Zi) = Zsparse.D
ieF iTF

It is also easy to show that the direct z-dependency set of sparse subset consisting
of entries from the diagonal of Z is equal to Zsparse. Proof: Recall that
direct(zii) = zii U Zi = Zi. (We have included zii in its own z-dependency set to
make the proof that follows simpler.) Therefore,

U direct(zii) U Zi = Zsparse.0
iEr iEF

4.4. The matrix-vector form of the Takahashi equation. The fourth
result discussed in the previous subsection shows that the computation of the off-
diagonal entries of inverse pivot row i directly involves only the entries in the inverse
contribution matrix of Fi. We also know by the second result that these dependencies
are -i ii. I li'ed," in the sense that zij depends only on column j of the inverse
contribution matrix of Fi. This allows us to restate Equation (4.5) in matrix-vector
form for the special case where we are only concerned with computing the entries in
the inverse pivot row of Fi E f:

(4.7) Zi = Z
Zi d 7i 1 + Ti Zii = d-i + Ti Z

where we have dropped the unnecessary j subscript on the direct t-dependency vector.
4.5. Computational cost. Observe that ITil = |Zyi = jUi |, where xl is the
number of elements in the set/vector/matrix x. By Equation (4.5) it is easy to see that
the direct cost in terms of multiply-add pairs of floating point operations to compute
zij is |Til | Zij I= Ui I. This direct cost only represent the cost to compute zij once
all the inverse entries on which it depends have previously been computed. Clearly,
the overall floating point operation cost in computing zij must also include the floating
point work done in computing the entries on which zij depends, the indirect cost.
Since inverse pivot row i has |Ul I + 1 entries, the direct cost to compute it is
(|Ui" )(jUi | + 1). To compute Zsparse we need to compute the entries of all inverse
pivot rows (see Equation (4.1)). Using the fact that Zsparse can be computed without
computing any element from outside this set, we get the following result for the total
number of multiply-add pairs of operations required to compute Zsparse,


(A" i )(Ai" + 1),
i=1

which is the same as the number of operations required for the factorization (recall that
U = LT). If for example, LDU has a tridiagonal pattern IU, = 1 (i =1, ..., n 1),












|U, I = 0, and the cost of computing Zsparse for this system is 2(n 1) multiply-add
pairs. In general, if LDU is a band matrix of bandwidth 2m + 1, then

U = m, 1 < i < n m,

and

Ui = n i, n m+1 < i < n.

The number of multiply-add pair operations to compute Zsparse is then,


((m)(m+ 1) + (n i)(n i + 1) = m(m + 1)(3n 2m 1)/3
i=1 i=n-m+l

When m is much less than n (the typical case) this computation is of order e(m2n).
4.6. The inverse multifrontal Zsparse algorithm. There are still two
important issues to be addressed:
(a) how to efficiently locate (assemble) the elements of the direct z-dependency
sets (except, of course, Zi = Zi ), and
(b) the order in which the inverse elements should be computed, taking into
account the z-dependencies inherent in the Takahashi equation.
Note that the locations of the direct u-dependency sets are completely specified by
Equation (4.3).
4.6.1. Assembling the direct z-dependency set. Let node p be the parent
of node i in the inverse assembly tree. The following assertions are true:
(a) j U/i implies j Up,
(b) for every zij in the off-diagonal pivot row of Fi, Zij is a subset of column j
of Fp, and
(c) the inverse contribution block of Fi is a subset of the entries in the parent
node Fp.
Assertion (a) follows immediately from the general properties of the index sets,
one of which is that Ui C Up ([16] and [12]). o
The proof of assertion (b) is as follows. By Equation (4.4) the direct z-dependency
of zij is

zij = {zkjk ui }.

Using the fact that Ui C Up we can write Up = U U Q (for some, possibly empty,
index set Q). Then,

{zkj k E Up} = {zkj k E U} U {zkj k E Q}

From which we get, Zy = {zkj I k EUi } C {zkj I E Up}. Since from (a) we have
j E Up we get the result that Zij is a subset of column j of (Fp). O
Assertion (c) is an immediate consequence of assertions (a) and (b):
Zi = UjEu/ Zij is a subset of the entries in Fp since all i,j C Up by assertion (a)
and in these cases Zij is a subset of column j of Fp by (b). o
Assertion (c) in essence assures us that the inverse contribution block of Fi
can always be assembled from the parent node of i in the inverse assembly tree.












while (computing Zsparse) do
inverse ordering:
select the next inverse frontal matrix, Fi, based on
top-down traversal of the inverse assembly tree
inverse assembly:
assemble inverse contribution matrix Zi, from its parent node
inversion:
Compute inverse pivot row i using Equation (4.7)
end while

FIG. 4.5. Algorithm to compute Zsparse


This is analogous to the situation in the symmetric multifrontal method where the
contribution matrix of a child node can always be assembled into its parent node.
Being able to assemble the inverse contribution matrix from the parent node allows
for an implementation that can take advantage of dense kernels. With the appropriate
data structure, the inverse contribution block can be first assembled from its parent
and the matrix-vector product found in Equation (4.7) done using level 2 BLAS.
Furthermore, applying the inverse assembly process recursively we get the important
result that the inverse assembly tree (with dependency arrows from parent to children)
captures the dependency relationships among the inverse frontal matrices. The
computation of Zsparse then becomes a simple matter of following the dependency
arrows of the inverse assembly tree, assembling the inverse contribution matrix and
doing the actual inversion using Equation (4.7) at every node. Note that in contrast
to the multifrontal factorization algorithm, the assembly in the inverse multifrontal
algorithm is a copy operation requiring no floating-point operations. This algorithm
is given in Figure 4.5.

5. The supernodal inverse multifrontal method. The discussion of the
inverse multifrontal method in the previous section focused on simple inverse frontal
matrices containing one inverse pivot row and column. However, multifrontal codes
take advantage of related structures in the index pattern to form supernodes consisting
of several pivot rows and columns per frontal matrix. The use of these supernodes
permits the more efficient use of the memory hierarchy (cache and vector registers, for
example) found in high performance computer architectures. We show in this section
how to extend the simple algorithm for computing Zsparse to take advantage of the
supernodes formed during the LDU factorization.

5.1. The supernodal inverse constructs. Our definition of a supernode
comes from a theorem given by Liu, Ng, and Peyton in [18]:
THEOREM 5.1 ([18]). The column set S = {i,i+1, ...,i+m} is a supernode of
the matrix L if and only if S is a maximal set of contiguous columns such that s+j-1
is a child of s+ j in the elimination (assembly tree in our case) for j=1,2, ..., m, and

11(L,ji) = j(L.,i+m) + m

where, qr(v) gives the number of nonzeros in the vector v.
We label a supernode with the minimum column index in S. In the symmetric
multifrontal method the column set S, defined by Theorem (5.1), specifies the
pivot columns, and an equivalent set specifies the pivot rows. The simple inverse
multifrontal approach is extended in a natural way to include supernodes. The




















r^ 3
I C-
SP^
g^
a


Ic liii it'' '[jun
I L~iii~


F Fi

FIG. 5.1. Block structure of supernodal F and F


supernodes and supernodal assembly tree of the LU factorization have their
corresponding inverse supernodes and inverse supernodal assembly tree defined
analogously as for the simple nodes discussed in Section 4. We continue to use
the same notation for the supernodal frontal and inverse frontal matrices as was
introduced in Section 4, with the understanding that we mean the supernodal
structures.
The supernodes are partitioned into the same four parts as the simple nodes,
except that the single pivot is now a block matrix, and both the off-diagonal pivot row
and pivot column are now block matrices. Figure 5.1 illustrates the general structures
and partitioning of the supernodal frontal/inverse frontal matrices. It should be noted
that now

Ui = {i,i+ ,...,i+m} = S.
5.2. The supernodal algorithm for Zsparse. The matrix-vector form of the
Takahashi equation, Equation (4.7), can be easily extended to take advantage of
the supernodal structures. The following Equations (5.1) give the extended form to
compute entries in the inverse row of Fi.

i = Ti Zi
(Z )r, = (Zi )r* + (Ti)[r,r+1:m] (Zi )[r+1:i+mr,j:jc] i < r < i +
Z = D + T (Z")T
(Zi)[r,r+l:i+m] = (Zi)[r,r+l:i+m] + (Ti)[r,r+l:i+m](Zi )[r+:i+m,r+1:i+m] i r < i+ m
(Zi)r,r = (Zi)r, + ((Ti)[r,r+l:i+m])(Zi)[r,r+l:i+r] i < r < i + m
(5.1)
We used the colon notation found in [15] where X[p:q r.,] represents the matrix with
row indices ranging from p to q and column indices from r to s. If p = q then X[p ,r.s]
is a row vector with column indices r through s. Similarly Xp: ,,] is a column vector
with row indices p through q. For = {i, i1, . ., i+ m} and U = ji, j2,... jc
the matrices Zi, Zi Zi, D 1, and Ti are defined as follows (see Figure 5.1 also):

Zi = (Fi)[i:i+m,i:i+,m],


Zi ((Fi)[i:i+m,i:o],


Inverse
Inverse Off-Diagonal
Pivot Block Pivot Row Block
Z. Z.
1 1

g |l 'n e l c













i 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8


FIG. 5.2. The block-partitioned F


Z = (F)u[i:jji:j],


D = (Di )[i:i+m,i:i+m],

and

Ti = (Fi)[i:i+mr,i:ji] = T + ii:j'c]

respectively.
There are three potential problems with using Equation (5.1) in an
implementation. First, a modest amount of unnecessary floating point operations
may be done, due to the third subequation in (5.1). Here, partial computation of
entries in the lower triangular part of Z is done. The idea in using this subequation
is to be able to take advantage of the level 3 BLAS operation. We can rewrite the
subequation in terms of the less efficient level 2 BLAS operations. We need to balance
the desire for high megaflop performance (using the level 3 BLAS) versus the increase
in CPU time as the inverse pivot blocks get bigger and more unnecessary operations
are done.
The second potential problem is that for the larger sized problems where the
supernodes can become relatively large, the block sizes can become larger than the
"optimum" level 3 BLAS operating size. This optimum block size will, of course,
depend on hardware characteristics such as the cache size and vector register size.
Clearly, it would be advantageous to restrict the block sizes so that the computation
fits in cache and/or makes effective use of vector registers.
The third concern has to do with memory usage. In order to take advantage of
the level 3 BLAS we need to use more memory than is really necessary. This can be
serious for the larger sized problems if the block size is set to large.












while (computing Zsparse) do
1: inverse ordering:
get next supernode, i, from a top-down traversal
of the inverse supernodal assembly tree
2: inverse assembly:
assemble the -1 i,. i-.." upper triangular part of Zi,
the sparse inverse contribution matrix of Fi, from its parent node
3: inversion:
compute the inverse pivot row using Equation (5.2)
end while

FIG. 5.3. The block-partitioned algorithm to compute Zsparse


To control the negative impact of these three problems we use the two-dimensional
partitioning scheme shown in Figure 5.2 for each of the inverse frontal matrices.
The upper -1 I... i.-" diagonal form of the matrix still involves some unnecessary
computation and unnecessary storage, but these can both be controlled by regulating
the size of the blocking parameter. Assuming that the blocking size, b, is the same in
both dimensions, we get the block-partitioned form of Equations (5.1):


(5.2)


Zij = Dij'+ =i+ Tikx Zkj+ 1 T+Tik (Zjk)T i (Zij),. = (Zij)r, + (Tii)[r,r+1:b] (Zij)[r+1:b,t1b] 1 < r < b
(Zii)rr= (Zii)rr + (Tii)[r,r+l:b] X (Zii) r,r+l:b] 1 < r


where, the bold-type indices are local block indices, m = mi + m2 with m, and m2
defined by Equation (5.3), 1 < i < mi, 1
(5.3) m = [Ull /b]
M2, = [lU"/b]

In the example shown in Figure 5.2, mi = 5, and m2 = 3. The block-structured form
of the Zsparse algorithm based on Equation (5.2) is shown in Figure 5.3.
6. Numerical experiments. We implemented the block-partitioned Zsparse
algorithm discussed in the last section and a -. I ," Zsparse algorithm (discussed
below), on the CRAY Research, Inc. CRAY-C'I" platform. The CRAY-C"'I is an 8-
processor, 512Mw shared-memory machine. Our runs were restricted to a single
processor. Each processor has vector capability and peak theoretical performance
of one billion floating point operations per second (1 Gflop). Version cf77 6.0.4.1 of
the CRAY fortran compiler with automatic vectorization was used. All arithmetic
operations were done in single precision (64 bit). The CRAY Research Inc. optimized
BLAS were used to do the matrix-vector and matrix-matrix computations.
6.1. Data structures. The available memory (not including the memory to
hold the input factors, tree information and integer work arrays) is partitioned as
shown in Figure 6.1. This is essentially a two-stack layout. The z-value memory area
is used as the permanent storage area for the computed inverse entries. This forms
the left-stack and grows towards the right as indicated by the right-arrow. The right-
stack, labeled the parent area in Figure 6.1, is used to store the upper triangular parts
of parent nodes with children yet to be inverted. Data are moved from this area to the
inverse contribution block in the active area during inverse assembly. The right-stack


















FIG. 6.1. Partitioning of memory


grows and shrinks as inverted parent nodes are allocated or deallocated, respectively.
A point to note here is that a stack structure is the "natural" data structure to use for
the parent memory area when the inverse assembly tree is traversed in a depth-first
manner. Natural here meaning that when Fi is to be inverted its parent lies at the top
of the stack and, therefore, no overhead is incurred in locating the parent node. If a
breadth-first traversal is used however, then a queue would be a better data structure.
We used a depth-first traversal scheme.
The memory area in Figure 6.1 labeled the active area is used as the actual center
of computation. When Fi is to be inverted, memory of size

b2[m (m + 1)/2 + m2(m2 + 1)/2 + m1m2]

(see Section 5 for definitions of mi and m2) is allocated to accommodate its staircase
upper diagonal form. The location of the active area is fixed with respect to the top of
the left-stack. A preprocessing phase is used to determine the maximum size to which
the stacks and active area can grow. After an inverse frontal matrix, such as Fi, is
inverted, the input factors corresponding to the pivot row of Fi can be overwritten
with the inverse entries in the inverse row of Fi. Significant savings in the amount of
stack memory used can result if this overwriting is allowed. We therefore provide the
user with a parameter to control this feature.
6.2. The scalar Zsparse algorithm. To serve as a basis for comparison we
implemented a -. ,I ,," version of the Zsparse algorithm, shown in Figure 6.2. The
implementation of this algorithm allowed the use of level 1 BLAS operations (saxpy
and dot product, with vectorized gather/scatter operations), but no level 2 or 3 BLAS
operations.
The algorithm proceeds by computing elements in rows n through 1, step 1.
Within each row, say i, the only off-diagonal elements in Zsparse are those with
indices from the index pattern of row i of the factor matrix U, i.e. Ui. This is done
in step 2 for the off-diagonal entries and in step 6a for the diagonal entry (zii). Note
that zi is evaluated after all the off-diagonal entries in row i have been evaluated.
Step 3a involves a saxpy operation, while lines 3b and 6a involve dot products.
6.3. Numerical results. Statistics for the test matrices used in our numerical
experiments are shown in Table 6.1. Most of these matrices are part of the Harwell-
Boeing sparse matrix collection [10]. The sizes of the matrices range from small (the
100-by-100 Nos4 matrix) to fairly large (7 II..-'-by-74752 for the FINAN512 matrix).
The matrices were first factored using a early modified form, with strict diagonal
pivoting, of UMFPACK Version 2.0 [7, 6]. The supernodal tree information was
also generated by UMFPACK. In general, any supernodal or multifrontal Cholesky
factorization algorithm can be used.
Table 6.2 gives the execution times for the scalar and block-partitioned Zsparse
implementations. For the block-partitioned algorithm we give CPU times for block
sizes 8, 32, 64, 128, and '_"I. In the column labeled 1* I1: mflops of block *" the










18

1: for i = ntol
2: for k E Ui
3: for j E (U and j > k)
3a: zij = zij Uik Zkj
3b: ik = Zik Uij Zkj
end for
4: ik Zik Uik Zkk
end for
5: zi = d1
6: for k EU
6a: ii = Zii Uik Zik
end for
end for


FIG. 6.2. The scalar Zsparse algorithm


TABLE 6.1
Information on test matrices


best megaflop performance is given (the block size for which this occurs is shown in
parentheses). The worst CPU performance for all the matrices occurs for a block size
of 8; for block sizes smaller than 8 we got even larger CPU times for all the matrices.
This no doubt is due to the poor performance of the BLAS on the smaller-sized blocks
and the increasing overhead involved in creating the larger number of blocks. But
even for a block size of 8 the inverse multifrontal block-partitioned algorithm has a
clear performance advantage over the scalar algorithm for all but the smallest sized
matrix (Nos4) and the very sparse BCSPWR10 matrix. As the block size increases the
CPU times for the block algorithm typically decreases until some optimum block size
is reached, after which the CPU times begin to rise. One may have expected that a
block size of length equal to the length of the vector registers (128) would give the best
CPU performance. However we see from Table 6.2 that a block size of 64 gives the
best CPU performance (times in bold). We attribute this to the increasing number of
unnecessary operations that are performed on the diagonal blocks. As the block size
increases the ratio of unnecessary to necessary operations also increases, resulting in a
relative increase in the CPU times. This explanation is further supported by the fact
that the highest megaflop performance occurs for a block size of 128 (for the larger
matrices at least) implying that the BLAS are more efficient but more operations are
being done.


IL|
matrix n discipline (x103

Nos4 100 structural eng. 0.8
PLAT1919 1919 oceanography 83.4
BCSPWR10 5300 electric power 40.0
BCSSTK28 4410 structural eng. 468.8
BCSSTK25 15439 structural eng. 2425.6
FINAN512 74752 economics 9564.8












TABLE 6.2
Results for the scalar and multifrontal Zsparse runs


matrix CPU time (sec) peak
matrix
mflops of
scalar block block *
block *
8 32 64 128 256
Nos4 0.003 0.007 0.007 0.007 0.007 0.007 1 (8)
PLAT1919 1.022 0.268 0.161 0.155 0.155 0.155 33 (64)
BCSPWR10 0.203 0.405 0.387 0.387 0.387 0.387 2 (32)
BCSSTK28 13.01 1.70 0.64 0.59 0.62 0.66 127 (128)
BCSSTK25 170.9 19.52 5.64 5.01 5.14 5.29 187 (128)
FINAN512 2469.1 325.5 52.5 42.25 44.05 47.29 320 (128)



7. Conclusions. We introduced the symmetric inverse multifrontal approach to
computing the sparse inverse subset. This method uses the concept of inverting an
inverse frontal matrix using dense matrix kernels. Similarities between the inversion of
an inverse frontal matrix and the familiar concept of factorizing a frontal matrix used
in a multifrontal formulation of Gaussian elimination were highlighted. It should
be pointed out that it is not essential that the LDU factorization be done by the
multifrontal method for the inverse multifrontal method to be used. All we really need
is the pattern of nonzeros in the factorized matrix; even the supernodal structures
can be reconstructed from the filled-matrix pattern information [18]. The essential
theoretical results, stated in Equations (4.3) and (4.4), and the ensuing observations,
give a precise description of the locations of the direct dependency sets when inverting
an inverse frontal matrix. This allows the use of level 2 and level 3 BLAS in the
implementation.
The high performance that we achieve in computing Zsparse based on the block-
partitioned inverse multifrontal algorithm is evident from the CPU times and mflops
rates found in Table 6.2. The use of the higher level BLAS primitives is certainly the
dominant contributing factor in the speedup over the scalar algorithm (which only
takes advantage of gather/scatter vector operations).
The presentation in this paper focused on the special case of computing Zsparse
where the matrix A and, therefore, its inverse Z, are numerically symmetric. The
theoretical development and implementation are very similar for computing Zsparse
when A and Z are unsymmetric in value but have symmetric nonzero patterns. We
discuss a parallel implementation of our algorithm in [5], and an extension to arbitrary
subsets of Z in [4].
An open problem remains in adapting the inverse multifrontal method to the
case where the coefficient matrix has been reduced to block-triangular form before
the factorization is done.

8. Acknowledgements. Support for this project was provided by the National
Science Foundation (DMS-9223088 and DMS-'- '.1 1''74), and by CRAY Research, Inc.,
through the allocation of supercomputing resources.


REFERENCES















[1] P. R. Amestoy, T. A. Davis, and I. S. Duff. An approximateminimum degree ordering algorithm.
SIAM J. Matrix Analysis and Application, (to appear). Also CISE Technical Report TR-
94-039.
[2] R. Betancourt and F. L. Alvarado. Parallel inversion of sparse matrices. IEEE Transactions
on Power Systems, PWRS-1(1):74-81, 1986.
[3] Y. E. Campbell. Multifrontal algorithms for sparse inverse subsets and incomplete LU
factorization. PhD thesis, Computer and Information Science and Engineering Department,
Univ. of Florida, Gainesville, FL, November 1995. Also CISE Technical Report TR-95-025.
[4] Y. E. Campbell and T. A. Davis. On computing an arbitrary subset of entries of the inverse of a
matrix. Technical Report TR-95-022, Computer and Information Science and Engineering
Department, Univ. of Florida, 1995.
[5] Y. E. Campbell and T. A. Davis. A parallel implementation of the block-partioned inverse
multifrontal Zsparse algorithm. Technical Report TR-95-023, Computer and Information
Science and Engineering Department, Univ. of Florida, 1995.
[6] T. A. Davis and I. S. Duff. A combined unifrontal/multifrontal method for unsymmetric sparse
matrices. Technical Report TR-95-020, Computer and Information Science and Engineering
Department, Univ. of Florida, 1995.
[7] T. A. Davis and I. S. Duff. An unsymmetric-pattern multifrontal method for sparse LU
factorization. SIAM J. Matrix Analysis and Application, (to appear). Also CISE Technical
Report TR-94-038.
[8] J. Dongarra, J. Du Croz, S. Hammarling, and I. Duff. A set of Level 3 Basic Linear Algebra
Subprograms. ACM Transactions on Mathematical Software, 16(1):1-17, 1990.
[9] I. S. Duff, A. M. Erisman, and J. K. Reid. Direct Methods for Sparse Matrices. Oxford Science
Publications, New York, NY, 1991.
[10] I. S. Duff, R. G. Grimes, and J. G. Lewis. Users' guide for the Harwell-Boeing sparse matrix
collection (release 1). Technical Report RAL-92-086, Rutherford Appleton Laboratory,
Didcot, Oxon, England, Dec. 1992.
[11] I. S. Duff and J. K. Reid. The multifrontal solution of unsymmetric sets of linear equations.
SIAM J. of Sci. and Statis. Computing, 5(3):633-641, 1984.
[12] I.S. Duff and J.K. Reid. The multifrontal solution of indefinite sparse symmetric linear
equations. ACM Trans. Math. Software, 9:302-325,1983.
[13] A. M. Erisman and W. F.Tinney. On computing certain elements of the inverse of a sparse
matrix. Communications of the ACM, 18:177-179, March 1975.
[14] A. George and J. W. H. Liu. Computer Solution of Large Sparse Positive-Definite Systems.
Prentice-Hall, Englewood Cliffs, NJ, 1981.
[15] G. H. Golub and C. F. van Loan. Matrix Computations. The Johns Hopkins University Press,
Baltimore, MD and London, UK, second edition, 1990.
[16] J. W. H. Liu. The role of elimination trees in sparse factorization. Siam J. Matrix Appl.,
11(1):134-172, 1990.
[17] J. W. H. Liu. The multifrontal method for sparse matrix solution: Theory and practice. Siam
Review, 34(1):82-109, March 1992.
[18] J. W. H. Liu, E. G. Ng, and B. W. Peyton. On finding supernodes for sparse matrix
computations. Siam J. Matrix Anal. Appl., 14(1):242-252, January 1993.
[19] K. Takahashi, J. Fagan, and M. Chin. Formation of a sparse bus impedance matrix and its
application to short circuit study. 8th PICA Conference Proc., Minneapolis, Minn, pages
177-179, June, 4-6 1973.
Note: all University of Florida technical reports in this list of references are
available in postscript form via anonymous ftp to ftp. cis .ufl. edu in the directory
cis/tech-reports, or via the World Wide Web at
http://www.cis.ufl.edu/~davis.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs