Dynamic supernodes in sparse Cholesky
update/downdate and triangular solves *
TIMOTHY A. DAVIStand WILLIAM W. HAGERt
September 8, 2006
Technical report TR2006004, CISE Dept, Univ. of Florida, Gainesville, FL
Abstract
The supernodal method for sparse C'!i n!. !: factorization represents the factor L as
a set of supernodes, each consisting of a contiguous set of columns of L with identical
nonzero pattern. A conventional supernode is stored as a dense submatrix. While this
is suitable for sparse C'!i .1 lI: factorization where the nonzero pattern of L does not
change, it is not suitable for methods that modify a sparse C'!i .1 I:y factorization after
a lowrank change to A (an update/downdate, A A+WWT). Supernodes merge and
split apart during an update/downdate. Dynamic supernodes are introduced, which
allow a sparse C'!i, !. I:y update/downdate to obtain performance competitive with con
ventional supernodal methods. A dynamic supernodal solver is shown to exceed the
performance of the conventional (BLASbased) supernodal method for solving trian
gular systems. These methods are incorporated into CHOLMOD, a sparse C'l!,.' I:y
factorization and update/downdate package, which forms the basis of x=A\b in MAT
LAB when A is sparse and , rii, Hii positive definite.
1 Introduction
Given a sparse, symmetric positive definite matrix A with a Cholesky factorization A LLT
or A LDLT and a low rank modification A A + WWT (update) or A A WWT
(downdate), we consider the problem of computing the C'!i1,, l:y factorization of A while
exploiting the supernodal structure of L. Since an update operation changes the supern
odal structure, it is not easy to take advantage of the original supernodes. This leads us to
develop a new i,.i'i,. supernodal update il1..'':thm. In the dynamic algorithm, the supern
odes are detected and exploited as the update progresses. In a similar fashion, we obtain a
*This work was supported by the National Science Foundation under grants 0203270 and 0620286.
tDept. of Computer and Information Science and Engineering, Univ. of Florida, Gainesville, FL, USA.
email: davis@cise.ufl.edu. http://www.cise.ufl.edu/~davis.
tDept. of Mathematics, Univ. of Florida, Gainesville, FL, USA. email: hager@math.ufl.edu.
http://www.math.ufl.edu/~hager.
i ,,in. supernodal solve in which the supernodes are detected as the solve progresses. Up
date/downdate problems such as these arise in optimization algorithms, sensitivity analysis,
and many other areas [20]. The sparse rank1 update when L is not a supernodal Cholesky
factorization is discussed in [7], while the multiple rank case is in [8]. It is assumed that A
has already been permuted by a fillreducing oi1. l ir,:. this is a large and critical topic in
itself which is beyond the scope of this paper [6].
The supernodal C'!. .,! l:y factorization method [2, 11, 21, 26, 28, 29] exploits dense matrix
kernels during the factorization and solution of the resulting triangular systems. It is based
on supernodes, which are .,.i i,'ent columns of L with identical nonzero pattern stored as a
single dense submatrix of L. In this paper, the supernodal structure of L is exploited in the
initial factorization, the lowrank update/downdate, and in the solution of the triangular
systems required to solve Ax b after the C'! .1 l:y factorization A LLT is computed. As
the matrix is updated or downdated, supernodes can merge and split apart. Conventional
supernodes cannot be adapted to this problem. In this paper we show how these tI;,',,,,
supernodes can be effectively exploited to obtain high performance.
Section 2 provides a brief overview of supernodes in sparse C'!i. .! ly factorization and in
the solution of triangular systems. Sections 3 and 4 present our update/downdate method
and triangular solvers, both of which exploit dynamic supernodes. The performance of our
new methods is illustrated in Section 5.
2 The supernodal method
The primary purpose of the supernodal method is to obtain high performance on mod
ern computer architectures with memory hierarchy by exploiting dense submatrices in the
sparse factorization. Improved locality also enables its use on parallel computers, but only
sequential algorithms are considered here.
The use of dense matrix kernels (the BLAS [22, 14, 13]) is a common technique for im
proving the performance of sparse matrix factorization and the solution of the subsequent
triangular systems that are required to solve Ax = b for a general matrix A. Supernodal
and multifrontal methods both exploit the nearlyidentical sparsity structure often shared by
.,ili i.ent columns of L (and rows of U in the unsymmetric case), for either LU or Cholesky
factorization (A LU or A LLT). These methods are able to use the BLAS to ob
tain a level of performance that is a significant fraction of the computer's theoretical peak
performance.
2.1 Finding supernodes
Informally, a supernode is a set of .,Ii i:ent columns of L that have an identical nonzero
pattern. They are typically stored in a way that exploits this structure, such as a dense
matrix of dimension sbyz where s is the number of nonzeros in the leftmost column in
the supernode and z is the number of columns in the supernode. Dense matrix kernels are
used to compute each supernode, and to apply each supernode to the righthand side when
solving a triangular system.
More precisely, a supernode is defined by a chain of nodes in the elimination tree, and
the sparsity pattern of the corresponding columns of L. The elimination tree of an nb\,,
matrix A is a tree of n nodes [23, 24, 30]. The parent of node j in the tree is given by the
first offdiagonal nonzero entry lij in column j,
parent(j) = min{i I i > jandij / 0}. (1)
If this set is empty, then node j is a root of the elimination tree. The tree may actually
be a forest with more than one root, but it is still conventionally called an elimination
tree. Numerical cancellation is ignored, so the term "lj / 0" in (1) should be understood
to be true for any entry lij that must be computed during C'!i..1 l:y factorization; it may
occasionally be zero numerically. Otherwise, the definition of the elimination tree breaks
down, as do many theorems regarding sparse C'!i .1 l:y factorization.
Let j denote the nonzero pattern of column j of L. That is, Lj = {i I lij $ 0}. A
fundamental supernode is a maximal sequence of z columns f, f + 1,..., f + z 1 such that
for any successive pair of columns j 1 and j in the list, j 1 is the only child of j, and
Lj = j_1 \ {j}. That is, the set of columns in a supernode forms a chain in the elimination
tree, and have identical nonzero pattern (excluding entries in the upper triangular part).
Column f is the first, or 1. ilii column in the supernode. It may have any number of
children in the elimination tree. For a relaxed supernode, some of the constraints of this
definition are loosened; two columns may be placed in the same supernode if their nonzero
patterns are similar but not identical, and j 1 need not be the only child of j, for example.
Supernodes can be found without constructing the nonzero pattern of L, in time that
is essentially linear in the number of nonzeros of A [25]. First, the elimination tree of A is
computed in nearly O(A) time [23, 24].1 More precisely the time is O(IAa(A, n)) where a
is the inverse Ackerman function, a function that grows extremely slowly. Thus in practical
terms the time is O(IAI).
Next, the elimination tree is typically reordered via a depthfirst postordering, taking
O(n) time. In a depthfirst postordering, the d descendants of a node j in the elimination
tree are all numbered j d through j 1. The postordering ensures that a node with only
one child c is ak,v numbered c+1. This maximizes the sizes of fundamental supernodes in
the matrix L. Postordering also improves memory locality during numerical factorization.
The postordering is a permutation of A, but has no effect on the the number of nonzeros in
L. It thus has no effect on the fillreducing ordering. In MATLAB, [parent, q] =etree (A)
computes both the elimination tree (parent) and its postordering (q) using CHOLMOD.
Once the tree is found and postordered, the number of entries in each column of L is
found, using an algorithm that takes nearly O(A) time [16]. In MATLAB, this is computed
by the routine symbfact, also using CHOLMOD. If count=symbfact(A), then count(j)
= j l. The column counts and the elimination tree are then used to find the fundamental
supernodes. Consider the jth column of L. Its nonzero pattern is related to the nonzero
patterns of the children of node j in the elimination tree [15],
Lj= A U{j} U paj \ {c}L ) (2)
parent(c)
'The number of nonzeros in matrix or vector x, or the size of a set x, is denoted as x.
where Aj is the nonzero pattern of the jth column of the strictly lower triangular part of
A. A lower bound on the column count of j is thus I l > I, 1. The following condition
defines a fundamental supernode.
Condition 2.1 Columns j 1 and j are members of the same fundamental supernode if
and only if ljl = 1 j_1 1 and j 1 is the only child of j in the elimination tree.
Thus, supernodes can be found in nearly O(A) time without accessing or computing
the nonzero pattern of L. Only the elimination tree and the column counts are required.
This observation is essential to the dynamic supernodal routines described in Sections 3 and
4. The restriction on j 1 being the only child of j can be relaxed, resulting in larger
supernodes.
Condition 2.2 Columns j 1 and j can be members of the same supernode if IjL =
1lj_1 1 and j 1 is a child of j in the elimination tree.
2.2 Supernodal factorization
In the nonsupernodal leftlooking C'!. .. l:y factorization algorithm, the kth step of fac
torization computes the kth column of L, accessing columns 1 through k 1 of L and
column k of A. Each step consists of a sparsematrixvector multiply (in MATLAB nota
tion, A(k:n,k)L(k:n, 1:k1)*L(k, 1:kl) ') followed by a square root and scaling of the
kth column of L.
Supernodal Cholesky factorization is a blocked version of the leftlooking method, where
each block is a supernode. The method can be derived from the expression
L1 L T LT LT An AT A
L11 ] L 21L 31 ] [A 21 A 31
L21 L22 L2 L32 A21 A22 A2 (3)
L31 L32 L33 L33 A31 A32 A33
where the middle block row and column of each matrix are rows and columns corresponding
to the kth supernode. If the columns of L corresponding to the first k 1 supernodes are
known (L11, L21, and L31), then the kth supernode can be computed, using the following
algorithm.
First, a sparse matrix product is performed to initialize the kth supernode,
I A22 L21 L T (4)
r2 A32 L31 21*
The L21 and L31 matrices split into a set of supernodes. The sparse matrix multiplication is
performed one supernode at a time, using a dense matrix multiplication for each supernode.
The subtraction in (4) does not use dense matrix operations, since the nonzero patterns of
each supernode are different. Instead, a scatter operation is used. Fortunately, most of the
floatingpoint operations are performed in the dense matrix multiply.
Next, the dense Cholesky factorization S1 L22L is computed. This is a dense subma
trix factorization, since L22 is the diagonal block of a single supernode. The nonzero patterns
of these columns are all the same, and the subdiagonal of L22 is all nonzero in these columns
(the columns form a chain in the elimination tree). Thus, L22 is a dense matrix.
Finally, the triangular system L32LT S2 is solved for L32. The L32 matrix is sparse,
but each column has the same nonzero pattern, and thus a dense triangular solver is used
for this step, with multiple dense righthand sides.
Since S1 and S2 have the same nonzero pattern that is a subset of the kth supernode,
'! i:, can be stored in the same place as L22 and L32, respectively. The supernode
L22
L32
is stored in a sbyz dense matrix, where s > z is the number of nonzeros in the first column
of the supernode.
In MATLAB 7.2, x=A\b and chol use the above supernodal C'!..1! l:y factorization
method, as implemented in CHOLMOD. The performance of CHOLMOD and ii' 'i: other
sparse Cholesky factorization packages is discussed in [19].
2.3 Supernodal solve
Consider the triangular system Lx b,
L[ l ] i b
L21 L22 2 2 (5)
L31 L32 L33 X3 b3
where L is partitioned the same as in (3), and x is a dense vector. In the forward solve,
the kth step requires the solution of a dense lower triangular system L222 b2 where
b2 = b2 L21X is computed first. Next, L32X2 is subtracted from the righthand side,
requiring a dense matrixvector multiplication and a sparse gather/scatter operation (since
L32 is the subdiagonal part of the k supernode). Most of the work is thus performed with
dense matrix kernels (the level2 BLAS). Matrixmatrix operations are used if x is a dense
matrix rather than a vector.
This method is used in x=A\b in MATLAB 7.2 when A is sparse and symmetric positive
definite, as implemented in CHOLMOD. It is not used in x=L\b when L is lower triangular,
since L is not stored in supernodal form (even if L comes from a supernodal sparse Cholesky
factorization, L=chol (A)). Performance comparisons with other triangular solvers are given
in [19].
3 Updating a sparse Cholesky factorization
3.1 Nonsupernodal update
Consider the rank1 update/downdate, A = A wwT where w is a sparse column vector.
If the C!i. .1, l:y factorization LLT of A is known (or A = LDLT where D is diagonal), the
factorization of the modified matrix A can be found in time proportional to the number of
entries in L that change [7]. This includes the time required to modify the nonzero pattern
of L, if the pattern needs to change. For additional background on the update/downdate
problem, see (for example) [3, 4, 17, 31, 32, 27]. A simple rank1 update/downdate of a
sparse LLT factorization that does not change the nonzero pattern of L is discussed in detail
in [6]; that algorithm is a mere 35 lines of C.
The rank1 update/downdate is discussed in [7]. During an update, A = A + ww, no
entries are removed from L; entries are only added. During a downdate, A A wwT,
entries could be dropped (but not added) if A CCT and w is one of the columns of C.
The driving motivation is an activeset linear programming method [10, 9], where C is the
matrix comprised of the columns corresponding to the active variables. Dropping entries
requires the nonzero pattern of L to be held as a multiset, with multiplicities for each entry
in the set. This extra information is costly for a general application, and is thus it is often
more efficient to retain all the nonzeros during a downdate rather try to determine which
nonzero should be zero after the downdate.
Instead, assume that either an update or downdate can add entries to A and thus also
to its updated/downdated C'!. .1 l:y factor L. Neither the update nor the downdate are
assumed to remove any entries. For either an update or a downdate, the entries that change
in L correspond to a single path in the elimination tree of the modified matrix A. The path
starts from the node i corresponding to the smallest row index of nonzero entries in w, and
proceeds upwards, ending at the root of the tree. For a multiple rank update/downdate
(A A WWT where W is nbyk), the columns that change correspond to a set of paths
in the elimination tree of A, each starting at the node corresponding to the smallest row
index of nonzero entries in each column of W [8]. Paths that merge as h1! i:, proceed upwards
toward the root result in a rankk update of the columns along that path, where in this case
k is the number of paths from columns of W that have merged.
3.2 Supernodal update
If supernodes are exploited, better performance could be obtained, in much the same way as
supernodes can improve the performance of sparse Cholesky factorization and solves. How
ever, adding entries to a supernodal factorization is problematic. A single update/downdate
can cause some supernodes to merge and others to split apart. If two .,li i.:ent columns j 1
and j of L are not in the same supernode, an update/downdate could add entries to j 1 so
that now these two columns have the same nonzero pattern, causing them to merge into one
fundamental supernode. Similarly, if the two columns are identical, an update/downdate
could add entries to column j but not to j 1, causing the supernode to split apart.
Supernodes are conventionally stored in a dense sbyz matrix. Modifying this structure
would lead to an algorithm whose time complexity could be far from optimal. Suppose an
update/downdate modifies only the last column of the supernode, causing the supernode to
split. The time required to modify the numerical values in the supernode would be O(s z),
but O(sz) time would be required for the data movement that splits the supernode in two.
This is not a viable solution. Thus, supernodes were not considered in [7, 8].
It would be simpler to assume, as in [6], that the nonzero pattern of L does not change.
The matrix L could be kept in its supernodal form, and the update/downdate could then
exploit this structure to obtain higher performance than a nonsupernodal update/downdate.
The structure of L, and its supernodes, would be static. However, this requirement is too
6
(5)
4\
2 4w
1 r
Figure 1: Example elimination tree
limiting for ini iv applications. In particular, it would not be suitable in our motivating
application, the LP Dual Active Set Algorithm, LPDASA.
Thus, our goal is a rankk update/downdate method that simultaneously exploits supern
odes and allows the nonzero pattern of L to change. The solution is to not store supernodes
in their conventional form. Instead, the matrix L is stored in a conventional nonsupernodal
compressed sparse column form, where each column of L is stored as a list of numerical values
of the nonzero entries, and an integer list of the corresponding row indices. MATLAB uses
a similar data structure, except that to allow for columns to grow and shrink, gaps between
the columns of unused memory space are permitted in our data structure, and the columns
need not appear in monotonic order in memory. Supernodes are detected dynamically, as
the update/downdate progresses through the matrix.
The update/downdate is split into two phases: symbolic and numeric. The symbolic
update is identical to that in [8], except that multiplicities are not kept. The run time of the
symbolic phase is alvl ..; ,mptotically dominated by the numeric update; it is much less if
the pattern does not change or changes only very little. The numeric update proceeds along
a series of disjoint subpaths, each of which computes an update of rank anywhere from 1
to k. As it proceeds, it detects supernodes dynamically. If columns jl and j2 are .,Ii i:ent
columns on the same subpath, then jl is a child of j2. If in addition, the number of nonzeros
in column jl is one more than the number of nonzeros in column j2, then jl and j2 lie in the
same dynamic supernode. This test takes 0(1) time, and can be done without examining
the nonzero patterns of the two columns. This is a relaxation of the restriction that j 1
and j be .ili i:ent in the matrix. Consider the small elimination tree in Figure 1. Suppose
the update path starts at node 1, and that columns 3 and 4 are not updated. Columns 1, 2,
5, and 6 could all be part of the same dynamic supernode.
Condition 3.1 Columns c and j can be members of the same it ;,ri' supernode if Cj
I,c 1, c is a child of j in the elimination tree, and c and j are adjacent nodes in a disjoint
subpath of the subtree of columns ,,i.. .:I, l by the update/downdate.
Columns can be added to a dynamic supernode by looking ahead along the path and
stopping when Condition 3.1 is broken. For the triangular solve of LTx b and Lx b,
respectively, Condition 2.2 is used when b is dense, because all columns of L take part in the
solve.
To understand how the update/downdate algorithm operates on a supernode, we first
examine how the update/downdate algorithm operates on a dense matrix.
3.3 Dense Cholesky update
Consider the method in Algorithm 1 for updating/downdating a dense LDLT factorization
(a modified version of Method C1 in [17]). It computes the new factorization LDLT
LDLT WWT, where W is nbyk. The a term is equal to +1 for an update or 1 for
a downdate. A ,: other methods are possible (see [17], for example). They all include an
innermost loop in which W and a column of L are used to modify each other.
ALGORITHM 1 (Dense rankk update/downdate).
for r =1to k do
ar = 1
for j = 1 to n do
for r =1to k do
a = a, + aw,/dj
dj = dja
'r = awjr/dj
dj = djl/r
ar = a
for i j +1 to n do
for r =1to k do
Wir ir Wjrlij
lij lij r ir
An update of the LLT factorization is nearly identical. For example, the cs_updown
function in CSparse [6] is based on Bischof, Pan, and Tang's [3] combination of Carlson's
update [4] and Pan's downdate [27]. All of these methods modify the LLT factorization
instead, as does Stewart's method in LINPACK [12, 31, 32] (the method used by cholupdate
in MATLAB). To update LLT factorization, the innermost loop requires 5 floatingpoint
operations instead of 4 for the LDLT case. The memory traffic is identical, however, since
the extra term is a scalar which would be held in registers or cache.
3.4 Dynamic supernodal update
In the sparse update, the for j loop in ALGORITHM 1 is replaced by a loop that iterates
over a single disjoint subpath in the elimination tree. Each column L,j along that path is
modified by W, and likewise modifies the matrix W.
The algorithm exploits three cases: supernodes consisting of one, two, or four columns
of L. Suppose the algorithm is at column j. If column j and j + 1 are not in the same
dynamic supernode, the single column j is updated. If columns j through j + 3 all lie within
the same supernode, then a dynamic supernode of four columns is updated. Otherwise, if j
and j + 1 reside in the same dynamic supernode, then a dynamic supernode of two columns
is updated.
For simplicity, this discussion assumes j, j + 1, j + 2, and j + 3 are the successive
columns along an update/downdate subpath. In general, :, i need not be .li i:ent in L to
be considered part of same dynamic supernode (see Condition 3.1).
Consider the update of a single column of L. This method corresponds to the rankk
update given in [8], and a single iteration of the for j loop in ALGORITHM 1. The loop
across the rows i of the column iterate instead over the rows i corresponding to nonzero
entries in column j. This loop is blocked, so that every iteration updates four nonzeros in
column j at a time. The r loop is unchanged.
Let s = Ijl be the number of entries in the jth column of L. The update performs 4sk
floating point operations, and 2sk + 3s memory references. For the matrix W, sk entries
are read, modified, and written back, accounting for 2sk memory references. The numerical
values and nonzero pattern of L,j are read (2s references). The numerical values of L,j are
then written back (another s references). The ratio of floating point operations per memory
reference is given below.
Nonsupernodal update
rankk update:
flops / memory access:
k 1 2 3 4 5 6 7 8
0.80 1.14 1.33 1.45 1.53 1.60 1.65 1.68
Consider an update of columns j and j + 1. In this case, the nonzero pattern of column
j + 1 need not be accessed. First, the r loop computes the a and 7 terms for the jth column,
and modifies the diagonal dj. Next, the single subdiagonal entry Lj+i,j is updated. The r
loop can now proceed for column j + 1, computing the c and 7 terms for that column, and
updating the j + 1st diagonal entry.
The 2column update is given in the algorithm below. It assumes that there are an even
number of nonzero entries below the diagonal in column j + 1. If there are an odd number,
the outermost for loop is proceed by an iteration that handles the first offdiagonal entry i
in column j + 1 of L.
ALGORITHM 2 (Supernodal rankk update/downdate of columns j and j + 1 of L).
for each adjacent pair il, i2 in j \ {j,j + 1} do
copy entries of L into a 2by2 dense matrix:
X1 = li,j
X21 = li,j
X12 = li,j+1
X22 i2,j+1
rankk update of x:
for r =1to k do
gather two entries of W into a 2by1 vector t:
ti = Wi,r
t2 Wi,r
update two entries in column j:
ti = ti WjrX11
t2 = t2 WjrX21
X21 = X21 7riti
X21 X21 ',rlt2
update two entries in column j + 1:
tl = t Wj+1,rX12
t2 t2 Wj+l,rX22
X12 = X12 r12tl
22 = a22 'r2t2
scatter t back into W:
Wir = t
Wi,r t2
copy x back into L:
li,j = xl
li2,j = 21
1i,j+1 = X22
i2,J+1 = 22
Two critical factors impact the performance of any numerical method on a highperformance
computer: (1) the number of floatingpoint operations per memory reference, and (2) how
regular or irregular the memory references are.
Assume W is stored in rov.in ii"r order, in scattered form. That is, it is stored as
a dense rov.in i,,P nbyk matrix. This limits the algorithm to handling updates with a
modest number of columns k. Since W is stored in rov.in i,,i" order, the access of i,'. and
Wj+l,r as r varies is very efficient. These entries will be cached, since there are only 2r of
them. The arrays t and x can be held in registers. The access of the 2by2 block of L is
efficient, for two reasons: First, the terms are accessed only once regardless of k. Second,
since the columns of L are kept sorted (with row indices in strictly ascending order), entries
in rows il and i2 of Lj are .,.li i:ent in memory.
The two columns j and j + 1 of L are not stored in a single dense submatrix, as 161:,
would be in a true supernodal factor. However, in both the nonsupernodal data structure
for L and in the supernodal L, the two entries lij and lij+1 would not be .i.1i ient anyway.
In the former case, it is likely that '1, i, will be nearby. In the latter case, '!. :, will reside
a distance of exactly s = If7 entries away, in the same dense submatrix, where f is the
first column in the supernode. The impact on performance of using a nonsupernodal data
structure instead of a conventional supernodal data structure is thus slight. The nonzero
pattern of j needs to be read only once, not twice, and its access is also independent of k.
Performing a rankk update of two columns of L requires about 8sk floatingpoint oper
ations (s/2 outer iterations, with 16k operations each). Excluding cached variables (t, x, 7,
wjr and Wj+l,r), four entries of L are accessed for each s/2 outer iteration, and two entries
of W are accessed for each k x s/2 inner iteration. The integer pattern of L is read in just
once. The total number of memory reads is s(k + 3), and there are s(k + 2) writes.
Most memory traffic is regular, since W is stored in rov.in i, r order and since L is
stored by column. The only irregular access is the gather of the first ", ,, and it' 1 entries;
for subsequent r, the memory traffic is regular. Likewise, only the access to the first entry in
each column of L is irregular; the subsequent ones are all stride1 accesses. This is essentially
the same as accessing two columns of a dense matrix, which would be the case if L is stored
in supernodal form.
The number of floatingpoint operations per memory reference is thus 8sk/(2sk + 5s).
Our code handles the case for k = 1 to k = 8; larger rank updates are split into updates
where W has 8 columns or less. Since k is limited to a small constant, the r loop is completely
unrolled, and eight different versions of the function are created by the compiler. The flops
per memory access for each possible value of k is given in the table below.
Dynamic supernodal update with 2column supernodes
rankk update: k 1 2 3 4 5 6 7 8
flops / memory access: 1.14 1.78 2.18 2.46 2.67 2.82 2.95 3.05
By comparison, a level2 BLAS operation (nby,, dense matrix times dense vector) has
a flops per memory access ratio of about 2, or about the same as a rank3 2column update.
The memory traffic to update column j and j + 1 separately is almost double, since
sk entries of W must be accessed twice. These entries are accessed only once, above, in a
2column update.
This idea can extended to more than two columns of L since supernodes are usually much
larger. With four columns, an analogous algorithm that updates a 1by4 block of L in the
innermost loop performs 16sk floatingpoint operations, s(k + 5) memory reads, and s(k + 4)
memory writes. The flops per memory access ratio becomes 16sk/(2sk + 9s); these ratios
are shown below for each value of k:
Dynamic supernodal update with 4column supernodes
rankk update: k 1 2 3 4 5 6 7 8
flops / memory access: 1.45 2.46 3.20 3.76 4.21 4.57 4.87 5.12
Its peak ratio is 128/25, or 5.12. In a matrix with ii in': large supernodes, most of the
work will be done in updates of rankk to dynamic supernodes of size 4. Our method thus
uses one of three updates: singlecolumn with 4by1 updates in the innermost loop, 2column
with 2by2 updates of L, and 4column with 1by4 updates of L. We can thus expect a
rank8 update of a matrix with ii in': large supernodes to rival or exceed the performance
of a BLAS matrixvector multiply.
4 Dynamic supernodal solve
The dynamic supernodal update/downdate has a similar structure as the algorithm for solv
ing a sparse lower triangular system Lx = b. The latter is simpler since we only consider the
case where b is a dense vector or matrix. It accesses columns 1 through n, and can exploit
the same dynamic supernode strategy. Our method looks for dynamic supernodes of size
one to three columns, and can handle one to four righthand sides (b can be an nby4 dense
matrix). The solution x is stored in ro ini ri" form.
With four righthand sides, and a dynamic supernode of three columns of L, the triangular
solve for these three columns is given in Algorithm 3. Each inner iteration multiplies a dense
1by3 vector with a 3by4 matrix.
ALGORITHM 3 Supernodal solve (3 columns of L and 4 righthand sides)
solve Lj:j+2,j:j+2Y = Xj:j+2,* for y
Xj:j+2, y
for each i in j \ {j,j +l,j +2}
Xi,, = Xi,, Li,j:j+2Y
Just as in the dynamic supernodal update/downdate, the nonzero patterns of columns
j + 1 and j + 2 are not accessed. The access of the first entry xij is irregular, but access to
entries in subsequent columns is regular, since X we store in royi, i, ,i" form. The matrix
y is only of size 3by4 and can be stored in cache or registers. Thus, each inner iteration
performs 24 floating point operations, reads 3 entries of L, one entry of L and reads/writes
4 entries of X. The flops per memory access ratio is thus 24/12, or 2. A conventional
supernodal solve is similar.
With a single righthand side and 3 columns in a dynamic supernode, each inner iteration
performs 6 floating point operations, reads 3 entries of L, one entry of L and reads/writes
one entry of x. In this case the ratio is 1.
By comparison, a simple lower triangular solver (one righthand side, and no supern
odes) performs 2 floating point operations in its innermost loop, reads one entry of L, and
reads/writes one entry of x. The flops per memory ratio is 2/3.
For a dense L and a single righthand side, most of the work can be done in a matrix
vector multiplication, which has a flops per memory access ratio of about 2.
5 Results
To test our methods, we compared the nonsupernodal multiplerank update in [8] with our
new dynamic supernodal update in CHOLMOD. We also compared our dynamic supern
odal triangular solvers with a simple sparse triangular solver and a conventional supernodal
triangular solver.
These results were obtained on a Intel Pentium 4 (3.2GHz clock frequency, 4 GB RAM
(DDR 333 Mhz), 512 KB cache, an 800 MHz memory bus, the Goto BLAS v1.05 [18], and
running Linux). The theoretical peak performance of the computer is 6.4 GFlops. The gcc
compiler was used (version 3.3.5, with 03 optimization).
5.1 Dynamic supernodal update/downdate
Three matrices were used for the tests presented below.
Matrix name: ND/ND3K
source: 3D discretization of a PDE
n: 9000
A, lower triangular part: 1.6 x 106
ordering method: CHOLMOD nested dissection
ordering time: 2.0 seconds
CHOLMOD symbolic Ci'. ..! I:y factorization: 0.14 seconds
CHOLMOD numeric C'!i .1 I:y factorization: 6.75 seconds
time to convert to nonsupernodal LDLT: 0.25 seconds
IL: 12.6 x 106
Cl'I .. I:y factorization flop count: 22.15 x 109
CHOLMOD C'l. .1. !:y factorization Mflops: 3281
update/downdate rank: 128
# of entries modified in L: 12.1 x 106
CSparse update time: 3.29 seconds
CSparse update flop count: 2.2 x 109
CSparse update Mflops: 667
CHOLMOD update flop count: 1.81 x 109
The update LDLT + CCT was selected so that the nonzero pattern of L did not change,
to compare the results with CSparse. It was computed 16 different rv, in steps of 1 to
8 columns at a time, and both with and without dynamic supernodes. The results in the
table below show that using dynamic supernodes increases the performance of the sparse
Cholesky update/downdate by about 50'. when the rank of C is 5 or greater.
CSparse includes a sparse rank1 ('!. .1 l:y update/downdate method that updates the
LLT factorization. It performs about 25'. more floatingpoint work, but requires the same
memory traffic as an LDLT update. It only applies to problems where the nonzero pattern
of L does not change. Thus, for a rank1 update, its performance in terms of run time
and MFlops can exceed CHOLMOD. Note that the 3.29 seconds in CSparse excludes the
MATLAB interface, which must < .'1 L (a MATLAB function should not modify its inputs).
The second problem was selected to test a changing nonzero pattern. The matrix A
is 6330by22275; 6330 columns were chosen at random, to obtain AF. The matrix S
AFAT + 31 was factorized, and then a rank128 update was selected at random from the
nonsupernodal supernodal
rank time Mflops time Mflops speedup
1 3.95 458 3.32 554 1' ,
2 3.09 585 2.53 715 2
3 2.88 628 2.23 811 2' _'
4 2.64 685 2.02 895 31%
5 2.65 682 1.82 993 '
6 2.46 735 1.65 1096 '
7 2.34 769 1.54 1174 52'.
8 2.24 807 1.52 1189 17'
columns in A but not in AF. This procedure mimics the use of CHOLMOD in a linear
programming solver, LPDASA [10, 9]. The method cannot be compared with CSparse, since
the pattern of L is changing.
Matrix name: QAPLIB/LP_NUG15
source: linear programming problem
n: 6330
ISI, lower triangular part: 129 x 103
ordering method: CHOLMOD nested dissection
ordering time: 0.58 seconds
CHOLMOD symbolic C'!I. 1 l:y factorization: 0.02 seconds
CHOLMOD numeric C'! .1 l:y factorization: 5.89 seconds
time to convert to nonsupernodal LDLT: 0.14 seconds
initial IL: 7.57 x 106
C'! .. :y factorization flop count: 16.4 x 109
C'! .. :y factorization Mflops: 2684
update/downdate rank: 128
# of entries modified in L: 7.56 x 106
final IL: 7.61 x 106
CHOLMOD update flop count: 2.5 x 109
If the update is followed by another one with the same C, then the nonzero pattern of
L remains unchanged. Almost the same amount of floatingpoint work is required. For this
experiment, CSparse takes 4.65 seconds plus an additional 0.15 seconds in its MATLAB
interface. The results for CHOLMOD are listed below.
nonsupernodal supernodal
rank time Mflops time Mflops speedup
1 6.81 372 6.08 416 12'
2 5.14 492 4.40 575 17',
3 4.78 529 3.97 637 21 ',
4 4.47 566 3.69 686 21%
5 4.44 570 3.47 729
6 4.25 596 3.30 767 2' '
7 4.11 616 3.24 781 27',
8 4.05 625 3.24 781
nonsupernodal supernodal
rank time Mflops time Mflops speedup
1 5.32 476 4.57 555 1 '
2 3.69 687 2.94 863
3 3.33 762 2.54 999 31%
4 3.03 837 2.23 1137 :;'
5 3.03 837 2.05 1237 !"'
6 2.85 890 1.92 1321 !"'
7 2.75 922 1.87 1357 17'
8 2.70 939 1.88 1349 II'.
The final matrix we tested was a randomly generated dense symmetric positive definite
matrix of dimension n = 3000. The chol function in MATLAB takes 2.3 seconds to factorize
it, a rate of 3.9 GFlops using LAPACK [1]. If that same matrix is converted into a sparse
matrix, CHOLMOD requires 3.1 seconds to factorize it (2.9 GFlops), excluding the ordering
time but including both symbolic and numeric factorization. A rank1 dense update using
cholupdate in MATLAB (based on LINPACK [12, 31]), takes 0.2 seconds. The same update
using CSparse takes 0.12 seconds (0.03 seconds for the update, and 0.10 seconds to <.. "p the
result back to MATLAB). A rank8 LDLT update in CHOLMOD takes 0.2 seconds, half of
which is the actual update, and the other half is the time to <",'i; the matrix to and from
the MATLAB workspace.
The dense matrixvector multiply (DGEMV) in the Goto BLAS has a peak performance
of 2.7 GFlops for computing y = Ax+ y when A is nbi,, and assumed to already be stored
in cache if it is small enough to fit. DGEMV reaches this peak when n = 160, but drops
when n is about 256 or larger because of cache effects (637 MFlops for n = 500, and 597
MFlops when n = 1000, for example). This range of performance is comparable to the sparse
Cholesky update/downdate, because L itself is normally too large to fit into cache. The n
byk workspace W for the update/downdate will typically be too large to fit into cache.
The peak performance of the rank8 update is 1349, which is 2.3 times the performance of
the dense matrixvector multiply for large n. This is very close to the expected ratio of
2.56. The entries in the matrix L are read and written just once by our method. Additional
performance can only be obtained by an algorithm that keeps more of W in cache than our
method does.
5.2 Dynamic supernodal solve
In this experiment, we compare four different methods for solving LX = B, where B is an
nbyk matrix and L is lower triangular with a nonunit diagonal.
1. a simple method in CSparse (cs_lsolve) that solves Lx = b with a single righthand
side (k = 1),
2. a nonsupernodal method, but with loopunrolling and the ability to handle multiple
righthand sides,
3. the dynamic supernodal method, which is the same as (2) except that it detects and
exploits dynamic supernodes, and
4. a conventional supernodal method, using DTRSV and DGEMV for one righthand side
and DTRSM and DGEMM for multiple righthand sides.
The last two methods are in CHOLMOD. Method (2) is the same as (3) except that the
dynamic supernode detection in (3) was disabled. The ND/ND3K matrix was used, with
the same ordering as discussed in the previous section. For this matrix, CSparse obtains a
performance of 421 Mflops when solving Lx = b. The results for the other three methods
are shown below, for various values of k. The performance for k > 32 is essentially the same
for k = 32 for all three methods.
The dynamic supernodal method is nearly twice as fast
and conventional BLASbased supernodal method for k =
about 50' faster than the nonsupernodal method. This
the nonsupernodal method
For other values of k, it is
the same speedup obtained
by the dynamic update/downdate method discussed in the previous section. The dynamic
solver is slower than the BLASbased supernodal solver only for k > 7. For k > 32 the
BLASbased solver is about SI' faster than the dynamic supernodal solver.
These results are significant in applications such as the LP Dual Active Set Algorithm
that require 1n, I iv solutions to triangular systems. For many linear programming problems,
the update/downdate and triangular solve time dominate the time required by the initial
Cholesky factorization.
6 Summary
We have shown how dynamic supernodes can be exploited to obtain efficient methods for up
dating or downdating a sparse Cholesky factorization and for solving the resulting triangular
systems. The methods are faster than both the nonsupernodal methods and conventional
supernodal methods, except when solving a lower triangular system with 8 or more simulta
neous righthand sides. The CHOLMOD package that includes these methods is described
in a companion paper, [5].
References
[1] E. Anderson, Z. Bai, C. H. Bischof, S. Blackford, J. W. Demmel, J. J. Dongarra, J. Du
Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. C. Sorensen. LAPACK
non dynamic conventional
supernodal supernodal supernodal
k time Mflops time Mflops time Mflops
1 0.058 437 0.041 619 0.048 525
2 0.092 550 0.050 1010 0.098 514
3 0.117 655 0.072 1050 0.107 708
4 0.108 935 0.078 1300 0.113 891
5 0.167 757 0.118 1072 0.123 1023
6 0.193 783 0.128 1188 0.128 1188
7 0.220 803 0.147 1201 0.136 1297
8 0.208 971 0.151 1334 0.143 1414
9 0.263 ,.". 0.193 1175 0.151 1500
10 0.298 849 0.202 1250 0.157 1606
11 0.325 1.! 0.226 1229 0.167 1666
12 0.313 969 0.230 1317 0.173 1748
16 0.410 985 0.303 1335 0.206 1961
20 0.515 980 0.380 1329 0.240 2104
24 0.615 985 0.460 1317 0.270 2244
28 0.720 982 0.535 1321 0.303 2336
32 0.825 979 0.605 1335 0.340 2376
Users' Guide. SIAM, Philadelphia, 3rd edition, 1999.
[2] C. C. Ashcraft and R. G. Grimes. SPOOLES: an objectoriented sparse matrix library.
In Proc. 1999 SIAM Conf. Parallel Processing for Scientific Cor,,i',l.:, Mar. 1999.
[3] C. H. Bischof, C.T. Pan, and P. T. P. Tang. A Cholesky up and downdating algorithm
for systolic and SIMD architectures. SIAM J. Sci. Comrput., 14(3):670676, 1993.
[4] N. A. Carlson. Fast triangular factorization of the square root filter. AIIA Journal,
11:12591265, 1973.
[5] Y. Chen, T. A. Davis, W. W. Hager, and S. R ,i ,nipi;l:.i1 Algorithm 8xx:
CHOLMOD, supernodal sparse ('! .1, l:y factorization and update/downdate. ACI
Trans. Math. Software, 2006. (submitted).
[6] T. A. Davis. Direct Methods for Sparse Linear Si1. m SIAM, Philadelphia, PA, 2006.
[7] T. A. Davis and W. W. Hager. Modifying a sparse ('!..!. l:y factorization. SIAM J.
Matrix Anal. Appl., 20(3):606627, 1999.
[8] T. A. Davis and W. W. Hager. Multiplerank modifications of a sparse C'!. .1 ly fac
torization. SIAM J. Matrix Anal. Appl., 22:9971013, 2001.
[9] T. A. Davis and W. W. Hager. Dual multilevel optimization. Math. P,. ',r,,, page to
appear, 2006.
[10] T. A. Davis and W. W. Hager. A sparse proximal implementation of the LP Dual Active
Set Algorithm. Math. Pi. ',i pages http://dx.doi.org/10.1007/s1010700600170,
2006.
[11] F. Dobrian, G. K. Kumfert, and A. Pothen. The design of sparse direct solvers using
object oriented techniques. In Adv. in Software Tools in Sci. Cor,,lni.:.' pages 89131.
SpringerV(i1 . 2000.
[12] J. J. Dongarra, J. R. Bunch, C. Moler, and G. W. Stewart. LINPACK Users Guide.
SIAM, Philadelphia, 1978.
[13] J. J. Dongarra, J. Du Croz, I. S. Duff, and S. Hammarling. A set of level3 basic linear
algebra subprograms. ACM_ Trans. Math. Software, 16(1):117, 1990.
[14] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson. An extended set of
Fortran basic linear algebra subprograms. ACI[ Trans. Math. Software, 14:1832, 1988.
[15] A. George and J. W. H. Liu. Computer Solution of L,,,., Sparse Positive Definite
S;.ii. m PrenticeHall, Englewood Cliffs, New Jersey, 1981.
[16] J. R. Gilbert, E. G. Ng, and B. W. Peyton. An efficient algorithm to compute row
and column counts for sparse C'. ..! l:y factorization. SIAM J. Matrix Anal. Appl.,
15(4):10751091, 1994.
[17] P. E. Gill, G. H. Golub, W. Murray, and M. A. Saunders. Methods for modifying matrix
factorizations. Math. Comp., 28(126):505535, 1974.
[18] K. Goto and R. van de Geijn. On reducing TLB misses in matrix multiplication. TR
200255, Univ. Texas at Austin, Dept. of Computer Sciences, 2002.
[19] N. I. M. Gould, Y. Hu, and J. A. Scott. A numerical evaluation of sparse direct solvers
for the solution of large sparse, symmetric linear systems of equations. AC'If Trans.
Math. Software, 200x. (to appear).
[20] W. W. Hager. Updating the inverse of a matrix. SIAM Review, 31(2):221239, 1989.
[21] P. H6non, P. Ramet, and J. Roman. PaStiX: A highperformance parallel direct solver
for sparse symmetric definite systems. Parallel Cornj',,',.: 28(2):301321, 2002.
[22] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra
subprograms for Fortran usage. ACIf Trans. Math. Software, 5:308323, 1979.
[23] J. W. H. Liu. A compact row storage scheme for Cholesky factors using elimination
trees. ACI[f Trans. Math. Software, 12(2):127148, 1986.
[24] J. W. H. Liu. The role of elimination trees in sparse factorization. SIAM J. Matrix
Anal. Appl., 11(1):134172, 1990.
[25] J. W. H. Liu, E. G. Ng, and B. W. Peyton. On finding supernodes for sparse matrix
computations. SIAM J. Matrix Anal. Appl., 14(1):242252, 1993.
[26] E. G. Ng and B. W. Peyton. Block sparse Cholesky algorithms on advanced uniprocessor
computers. SIAM J. Sci. Comput., 14(5):10341056, 1993.
[27] C.T. Pan. A modification to the LINPACK downdating algorithm. BIT, 30:707722,
1990.
[28] E. Rothberg and A. Gupta. Efficient sparse matrix factorization on highperformance
workstations Exploiting the memory hierarchy. AC'If Trans. Math. Software,
17(3):313334, 1991.
[29] V. Rotkin and S. Toledo. The design and implementation of a new outofcore sparse
('C!.. l:y factorization method. AC'If Trans. Math. Software, 30(1):1946, 2004.
[30] R. Schreiber. A new implementation of sparse Gaussian elimination. AC' I Trans. Math.
Software, 8(3):256276, 1982.
[31] G. W. Stewart. The effects of rounding error on an algorithm for downdating a Cholesky
factorization. J. Inst. Math. Appl., 23:203213, 1979.
[32] G. W. Stewart. Matrix algorithms, Volume 1: Basic decompositions. SIAM, Philadel
phia, 1998.
