Algorithm 8xx: a concise sparse Cholesky factorization
package
Timothy A. Davis*
January 5, 2004
Abstract
The LDL software package is a set of short, concise routines for factorizing symmet
ric positivedefinite sparse matrices, with some applicability to symmetric indefinite
matrices. Its primary purpose is to illustrate much of the basic theory of sparse ma
trix algorithms in as concise a code as possible, including an elegant new method of
sparse symmetric factorization that computes the factorization rowbyrow but stores
it columnbycolumn. The entire symbolic and numeric factorization consists of a to
tal of only 53 lines of code. The package is written in C, and includes a MATLAB
interface.
Categories and Subject Descriptors: G.1.3 [Numerical Analysis]: Numerical Linear Al
gebra linear s,1. iii (direct methods), sparse and very l ,r ,g s m, I ,, G.4 \11 i ii iii cs of
Computing]: Mathematical Software ly.., :thmr analysis, eff.' '. ". ;
General terms: Algorithms, Experimentation, Performance.
Keywords: sparse matrices, linear equations, C'!! .1 ly factorization
1 Overview
LDL is a set of short, concise routines that compute the LDLT factorization of a sparse
symmetric matrix A. Their primary purpose is to illustrate much of the basic theory of sparse
matrix algorithms in as compact a code as possible, including an elegant new method of
sparse symmetric factorization (related to [8, 10]). The lower triangular factor L is computed
rowbyrow, in contrast to the conventional columnbycolumn method. This roworiented
algorithm allows for simpler access to the data structure for L during the factorization.
Although it does not achieve the same level of performance as methods based on dense
matrix kernels (such as [11, 12]), its performance is competitive with columnbycolumn
*Dept. of Computer and Information Science and Engineering, Univ. of Florida, Gainesville, FL, USA.
email: davis@cise.ufl.edu. http://www.cise.ufl.edu/~davis. This work was supported by the National Science
Foundation, under grant DMS0203270. Portions of the work were done while on sabbatical at Stanford
University and Lawrence Berkeley National Laboratory (with funding from Stanford University and the
SciDAC program).
methods that do not use dense kernels [3, 4, 5]. The symbolic factorization is typically an
order of magnitude faster than the corresponding method used in MATLAB Version 6.5.
The numeric factorization achieves a peak performance that is about 50'. higher than the
peak performance of chol in MATLAB Version 6.5, which is based on the columnbycolumn
algorithm, and does not use dense matrix kernels [5].
Section 2 gives a brief description of the algorithm used in the symbolic and numeric
factorization. A more detailed tutoriallevel discussion may be found in [13]. You may find
it helpful to have the code in front of you while you read this paper. Some of the details of
the concise implementation of this method are given in Section 3. Sections 4 and 5 give an
overview of how to use the package in MATLAB and in a standalone C program.
2 Algorithm
The underlying numerical algorithm is described below. The kth step solves a lower triangu
lar system of dimension k 1 to compute the kth row of L and the dkk entry of the diagonal
matrix D.
Algorithm 1 (LDLT factorization of a nbyn symmetric matrix A)
for k 1 to n
(step 1) Solve L1:k1,1i:kI Al:kl,k for y
(step 2) Lk,1:k1 (D1, :k1Y
(step 3) Ikk 1
(step 4) dkk = akk Lk,l:kIY
end for
When A and L are sparse, step 1 of Algorithm 1 requires a triangular solve of the form
Lx = b, where all three terms in the equation are sparse. This is the most costly step of the
Algorithm. Steps 2 through 4 are fairly straightforward.
Let X and B refer to the set of indices of nonzero entries in x and b, respectively, in the
lower triangular system Lx = b. To compute x efficiently the nonzero pattern X must be
found first. In the general case when L is arbitrary [7], the nonzero pattern X is the set
of nodes reachable via paths in the graph GL from all nodes in the set B, and where the
graph GL has n nodes and a directed edge (j, i) if and only if lij is nonzero. To compute
the numerical solution to Lx = b by accessing the columns of L one at a time, X can be
traversed in any topological order of the subgraph of GL consisting of nodes in X. That
is, xj must be computed before xi if there is a path from j to i in GL. The natural order
(1, 2,..., n) is one such ord., I, but that requires a costly sort of X. With a graph traversal
and topological sort, the solution of Lx = b can be computed using Algorithm 2 below. The
computation of X and x both take time proportional to the floatingpoint operation count.
Algorithm 2 (Solve Lx = b, where L is lower triangular with unit diagonal)
X ReachGL (B)
x b
for i E X in any topological order
Xi+l:n = i+l:n Li+l:n,ixi
end for
The general result also governs the pattern of y in Algorithm 1. However, in this case L
arises from a sparse Cholesky factorization, and is governed by the elimination tree [9]. A
general graph traversal is not required. In the elimination tree, the parent of node i is the
smallest j > i such that Iji is nonzero. Node i has no parent if column i of L is completely
zero below the diagonal; i is a root of the elimination tree in this case. The nonzero pattern of
x is the union of all paths from any node i (where bi is nonzero) to the root of the elimination
tree. It is referred to here as a tree, but in general it can be a forest.
Rather than a general topological sort of the subgraph of GL consisting nodes reachable
from nodes in B, a simpler tree traversal can be used. First, select any nonzero entry bi and
follow the path from i to the root of tree, marking the nodes along the way. Place this path
on a stack, in order, with i at the top of the stack and the root at the bottom. Repeat for
every other nonzero entry in bi, in arbitrary order, but stop just before reaching a marked
node (the result can be empty if i is already in the stack). The stack now contains X, a
topological ordering of the nonzero pattern of x, which can be used in Algorithm 2 to solve
Lx = b. The time to compute X using an elimination tree traversal is much faster than the
general graph traversal, taking time proportional to the size of X rather than the number of
floatingpoint operations required to compute x.
In the kth step of the factorization, the set X becomes the nonzero pattern of row k of
L. This step requires the elimination tree of L1:k1,1:k1, and must construct the elimination
tree of L1:k,l:k for step k + 1. Recall that the parent of i in the tree is the smallest j such
that i < j and lji / 0. Thus, if any node i already has a parent j, then j will remain the
parent of i in the elimination trees of all other larger leading submatrices of L, and in the
elimination tree of L itself. If Iki / 0 and i does not have a parent in the elimination tree of
L1:kl,l:k1, then the parent of i is k in the elimination tree of L1:k,l:k. Node k becomes the
parent of any node i E X that does not yet have a parent.
Since Algorithm 2 traverses L in column order, L is stored in a conventional sparse column
representation. Each column j is stored as a list of nonzero values and their corresponding
row indices. When row k is computed, the new entries can be placed at the end of each list.
As a byproduct of computing L one row at a time, the columns of L are computed in a sorted
manner. This is a convenient form of the output. MATLAB requires the columns of its sparse
matrices to be sorted, for example. Sorted columns improve the speed of Algorithm 2, since
the memory access pattern is more regular. The conventional columnbycolumn algorithm
[3, 4] does not produce columns of L with sorted row indices.
If the size of each column of L could be incrementally increased as they are computed,
no symbolic preanalysis would be required. The elimination tree, nonzero pattern of L,
and numerical values of L could all be computed in a single step. However, to allow for a
simple static data structure, the above algorithm can be repeated, but without the numerical
computation. All that is required to compute the nonzero pattern of the kth row of L is the
partially constructed elimination tree and the nonzero pattern of the kth column of A. This
is computed in time proportional to the size of this set, using the elimination tree traversal.
Once constructed, the number of nonzeros in each column of L is incremented, for each entry
in X, and then X is discarded. The set X need not be constructed in topological order, so
no stack is required. The run time of the symbolic analysis algorithm is thus proportional
to the number of nonzeros in L, and the memory requirements are just the matrix A and a
few sizen integer arrays. The result of the algorithm is the elimination tree, a count of the
number of nonzeros in each column of L, and the cumulative sum of the column counts.
3 Implementation
Because of its simplicity, the implementation of this algorithm leads to a very short, concise
code. The symbolic analysis routine ldl_symbolic consists of only 20 lines of executable C
code. This includes 5 lines of code to allow for a sparsitypreserving ordering P so that either
A or PAPT can be analyzed, 3 lines of code to compute the cumulative sum of the column
counts, and one line of code to speed up a for loop. An additional line of code allows for a
more general form of the input sparse matrix A. A shorter 9line synopsis of ldl_symbolic
is shown below. The ldl_symbolic routine ignores entries in the lower triangular part of A;
the following synopsis requires the upper triangular part of A only.
The nbyn sparse matrix A is provided in compressed column form as an int array Ap
of length n+1, an int array Ai of length nz, and a double array Ax also of length nz, where
nz is the number of entries in the matrix. The numerical values of entries in column j are
stored in Ax [Ap[j] ... Ap[j+1]11 and the corresponding row indices are in Ai [Ap[j]
.. Ap[j+1] 1]. With Ap[0] = 0, the number of entries in the matrix is nz = Ap[n].
The outputs of the following code are two sizen arrays: Parent holds the elimination
tree, and Lnz holds the counts of the number of entries in each column of L. The sizen array
Flag is used as workspace. None of the output or workspace arrays need to be initialized.
for (k = 0 ; k < n ; k++)
{
Parent [k] = 1 ; /* parent of k is not yet known */
Flag [k] = k ; /* mark node k as visited */
Lnz [k] = 0 ; /* count of nonzeros in column k of L */
for (p = Ap [k] ; p < Ap [k+1] ; p++)
{
/* follow path from i to root of etree, stop at flagged node */
for (i = Ai [p] ; Flag [i] != k ; i = Parent [i])
{
/* find parent of i if not yet determined */
if (Parent [i] == 1) Parent [i] = k ;
Lnz [i]++ ; /* L (k,i) is nonzero */
Flag [i] = k ; /* mark i as visited */
}
}
}
The above code is roughly an order of magnitude faster than the MATLAB equivalent,
below, on a wide range of sparse symmetric matrices, although faster methods are available
[6].
Lnz = symbfact (A) ;
Parent = etree (A) ;
The numeric factorization in Idlnumeric includes this same kernel, except that each
path is placed on a stack that represents X, the nonzero pattern of the kth row of L. Next,
a sparse forward solve is performed using this pattern X, and all of the nonzero entries in
the resulting kth row of L are appended to their respective columns in the data structure of
L. Only a real (double) version is provided. A complex version could easily be generated.
In addition to appearing as a Collected Algorithm of the AC'\ I LDL Version 1.0 is available
at http://www.cise.ufl.edu/research/sparse.
The following table illustrates a few performance results on a Pentium 4 laptop with
1GB of memory. Each matrix is permuted with AMD [1, 2]. MATLAB's chol computes
L columnbycolumn using the conventional method, but as a result the columns are not
sorted. It then sorts the columns by transposing the matrix and returning LT instead. It
thus uses twice the memory of the LDL package, which by design constructs L in place with
sorted columns. Memory limitations severely reduced chol's performance on the largest
matrix in the table.
Matrix nonzeros flops symbolic time (sec) total time (sec) Mflops
in L
(106) (109) LDL MATLAB LDL MATLAB LDL MATLAB
Boeing/bcsstk34 0.04 0.004 0.0004 0.005 0.012 0.036 314 108
Boeing/ct20stif 10.68 7.1 0.22 0.86 35.7 41.2 200 173
Nasa/nasasrb 11.90 4.8 0.14 0.91 26.1 31.0 183 154
Boeing/pwtk 59.84 46.8 0.70 4.13 246.2 1728.4 190 27
4 Using LDL in MATLAB
The simplest way to use LDL is within MATLAB. Once the LDL mexFunction is compiled
and installed, the MATLAB statement [L, D, Parent, fl] = Idl (A) returns the sparse
factorization A = (L+I)*D* (L+I) ', where L is lower triangular, D is a diagonal matrix, and
I is the nbyn identity matrix (LDL does not return the unit diagonal of L). The elimination
tree is returned in Parent. If no zero on the diagonal of D is encountered, fl is the floating
point operation count. Otherwise, D(fl,fl) is the first zero entry encountered. Let
d=fl. The function returns the factorization of A (1:d, 1: d), where rows d+1 to n of L and
D are all zero. If a sparsity preserving permutation P is passed, [L, D, Parent, fl] = Idl
(A,P) operates on A(P,P) without forming it explicitly.
The statement x = Idl (A, [ ], b) is roughly equivalent to x = A\b, when A is sparse,
real, and symmetric. The LDLT factorization of A is performed. If P is provided, x = Idl
(A, P, b) still performs x = A\b, except that A(P,P) is factorized instead.
5 Using LDL in a C program
The Ccallable LDL library consists of nine usercallable routines and one include file.
ldl_symbolic: given the nonzero pattern of a sparse symmetric matrix A and an
optional permutation P, analyzes either A or PAPT, and returns the elimination tree,
the number of nonzeros in each column of L, and the Lp array for the sparse matrix
data structure for L. Duplicate entries are allowed in the columns of A, and the row
indices in each column need not be sorted. Providing a sparsitypreserving ordering
is important for obtaining good performance. A minimum degree ordering (such as
AMD [1, 2]) or a graphpartitioning based ordering are appropriate.
ldl_numeric: given Lp and the elimination tree computed by ldl_symbolic, and an
optional permutation P, returns the numerical factorization of A or PAPT. Duplicate
entries are allowed in the columns of A (any duplicate entries are summed), and the
row indices in each column need not be sorted. The data structure for L is the same
as A, except that no duplicates appear, and each column has sorted row indices.
ldl_lsolve: given the factor L computed by ldl_numeric, solves the linear system
Lx = b, where x and b are full nby1 vectors.
ldl_dsolve: given the factor D computed by ldl_numeric, solves the linear system
Dx b.
ldl_ltsolve: given the factor L computed by ldl_numeric, solves the linear system
LTx b.
ldl_perm: given a vector b and a permutation P, returns x = Pb.
ldl_permt: given a vector b and a permutation P, returns x = PTb.
ldl_valid_perm: Except for checking if the diagonal of D is zero, none of the above
routines check their inputs for errors. This routine checks the validity of a permutation
P.
ldl_validmatrix: checks if a matrix A is valid as input to ldl_symbolic and
ldl_numeric.
Note that the primary input to the ldl_symbolic and ldlnumeric is the sparse matrix
A. It is provided in columnoriented form, and only the upper triangular part is accessed.
This is slightly different than the primary output: the matrix L, which is lower triangular in
columnoriented form. If you wish to factorize a symmetric matrix A for which only the lower
triangular part is supplied, you would need to transpose A before passing it ldl_symbolic
and ldl_numeric.
The follow program illustrates the basic usage of the LDL routines. It analyzes and
factorizes the sparse symmetric positivedefinite matrix
1.7 0 0 0 0 0 0 0 .13 0
0 1. 0 0 .02 0 0 0 0 .01
0 0 1.5 0 0 0 0 0 0 0
0 0 0 1.1 0 0 0 0 0 0
0 .02 0 0 2.6 0 .16 .09 .52 .53
A=
0 0 0 0 0 1.2 0 0 0 0
0 0 0 0 .16 0 1.3 0 0 .56
0 0 0 0 .09 0 0 1.6 .11 0
.13 0 0 0 .52 0 0 .11 1.4 0
0 .01 0 0 .53 0 .56 0 0 3.1
and then solves a system Ax = b whose true solution is xi = i/10. Note that Li and Lx are
statically allocated. Normally they would be allocated after their size, Lp [n], is determined
by Idl_symbolic. This simple example does not check the return value of Idl_numeric,
which is n if the factorization is successful, or less than n otherwise.
#include
#include "ldl.h"
#define N 10 /* A is 10by10 */
#define ANZ 19 /* # of nonzeros on diagonal and upper triangular part of A */
#define LNZ 13 /* # of nonzeros below the diagonal of L */
int main (int argc, int **argv)
{
int Ap [N+1] = {0, 1, 2, 3, 4, 6, 7, 9, 11, 15, ANZ},
Ai [ANZ] = {0, 1, 2, 3, 1,4, 5, 4,6, 4,7, 0,4,7,8, 1,4,6,9 } ;
double Ax [ANZ] = {1.7, 1., 1.5, 1.1, .02,2.6, 1.2, .16,1.3, .09,1.6,
.13, .52, .11,1.4, .01, .53, .56,3.1},
b [N] = {.287, .22, .45, .44, 2.486, .72, 1.55, 1.424, 1.621, 3.759};
double Lx [LNZ], D [N], Y [N] ;
int Li [LNZ], Lp [N+1], Parent [N], Lnz [N], Flag [N], Pattern [N], i
ldl_symbolic (N, Ap, Ai, Lp, Parent, Lnz, Flag, (int *) NULL, (int *) NULL);
(void) ldl_numeric (N, Ap, Ai, Ax, Lp, Parent, Lnz, Li, Lx, D, Y, Pattern,
Flag, (int *) NULL, (int *) NULL) ;
ldl_lsolve (N, b, Lp, Li, Lx) ;
ldl_dsolve (N, b, D) ;
ldl_ltsolve (N, b, Lp, Li, Lx)
for (i = 0 ; i < N ; i++) printf ("x [%d] = %g\n", i, b [i]) ;
}
More example programs are included with the LDL package.
References
[1] P. R. Amestoy, T. A. Davis, and I. S. Duff. An approximate minimum degree ordering
algorithm. SIAM J. Matrix Anal. Applic., 17(4):886905, 1996.
[2] P. R. Amestoy, T. A. Davis, and I. S. Duff. Algorithm 8xx: AMD, an approximate min
imum degree ordering algorithm. ACI[ Trans. Math. Softw., 2003 (under submission).
Also TR03010 at www.cise.ufl.edu.
[3] A. George and J. W. H. Liu. The design of a user interface for a sparse matrix package.
ACM_ Trans. Math. Softw., 5(2):139 162, Jun. 1979.
[4] A. George and J. W. H. Liu. Computer Solution of L,,., Sparse Positive Definite
S;,1iii, Englewood Cliffs, New Jersey: PrenticeHall, 1981.
[5] J. R. Gilbert, C. Moler, and R. Schreiber. Sparse matrices in MATLAB: design and
implementation. SIAM J. Matrix Anal. Applic., 13(1):333356, 1992.
[6] J. R. Gilbert, E. G. Ng, and B. W. Peyton. An efficient algorithm to compute row
and column counts for sparse ('!i.1. l:y factorization. SIAM J. Matrix Anal. Applic.,
15(4):10751091, 1994.
[7] J. R. Gilbert and T. Peierls. Sparse partial pivoting in time proportional to arithmetic
operations. SIAM J. Sci. Statist. Comput., 9:862874, 1988.
[8] J. W. H. Liu. A compact row storage scheme for C'!i!.1 l:y factors using elimination
trees. ACM Trans. Math. Softw., 12(2):127148, Jun. 1986.
[9] J. W. H. Liu. The role of elimination trees in sparse factorization. SIAM J. Matrix
Anal. Applic., 11(1):134172, 1990.
[10] J. W. H. Liu. A generalized envelope method for sparse factorization by rows. ACI
Trans. Math. Softw., 17(1):112129, 1991.
[11] E. G. Ng and B. W. Peyton. A supernodal Cholesky factorization algorithm for shared
memory multiprocessors. SIAM J. Sci. Comput., 14:761769, 1993.
[12] E. Rothberg and A. Gupta. Efficient sparse matrix factorization on highperformance
workstations Exploiting the memory hierarchy. ACM Trans. Math. Softw., 17(3):313
334, 1991.
[13] G. W. Stewart. Building an oldfashioned sparse solver. Technical report, Univ. Mary
land (www.umd.cs.edu/~stewart), 2003.
