Citation
KLU--A High Performance Sparse Linear Solver for Circuit Simulation Problems

Material Information

Title:
KLU--A High Performance Sparse Linear Solver for Circuit Simulation Problems
Copyright Date:
2008

Subjects

Subjects / Keywords:
Algorithms ( jstor )
Factorization ( jstor )
Gaussian elimination ( jstor )
Input output ( jstor )
Linear systems ( jstor )
Mathematics ( jstor )
Matrices ( jstor )
Permutations ( jstor )
Pruning ( jstor )
Simulations ( jstor )

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Embargo Date:
7/30/2007

Downloads

This item has the following downloads:


Full Text











KLU-A HIGH PERFORMANCE SPARSE LINEAR SOLVER
FOR CIRCUIT SIMULATION PROBLEMS















By

EKANATHAN PALAMADAI NATARAJAN


A THESIS PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE

UNIVERSITY OF FLORIDA


2005

































Copyright 2005

by

Ekanathan Palamadai Natarajan
















I dedicate this work to my mother Savitri who has been a source of inspiration

and support to me.















ACKNOWLEDGMENTS

I would like to thank Dr. Timothy Davis, my advisor for introducing me

to the area of sparse matrix algorithms and linear solvers. I started only with

my background in numerical analysis and algorithms, a year and half back. The

insights and knowledge I have gained since then in the area and in implementing

a sparse linear solver like KLU would not have been possible but for Dr. Davis'

guidance and help. I thank him for giving me an opportunity to work on KLU. I

would like to thank Dr. Jose Fortes and Dr. Arunava Banerjee for their support

and help and for serving on my committee.

I would like to thank CISE administrative staff for helping me at different

times during my master's research work.















TABLE OF CONTENTS
page

ACKNOWLEDGMENTS ................... ...... iv

LIST OF TABLES ...................... ......... vii

LIST OF FIGURES ................... ......... viii

ABSTRACT ....................... ........... ix

CHAPTER

1 INTRODUCTION .................... ....... 1

2 THEORY: SPARSE LU ........................... 5

2.1 D ense LU . . . . . . . 5
2.2 Sparse LU . . . . . . . 7
2.3 Left Looking Gaussian Elimination ........ ........ 8
2.4 Gilbert-Peierls' Algorithm ......... ........ ..... 10
2.4.1 Symbolic Analysis .......... ........ ..... 11
2.4.2 Numerical Factorization ......... ........ .. 13
2.5 Maximum Transversal ................... ... 14
2.6 Block Triangular Form .......... ............. 16
2.7 Symmetric Pruning ................... .... 18
2.8 Ordering .................. .......... 19
2.9 Pivoting .................. .......... 22
2.10 Scaling ........ ...... ................. 23
2.11 Growth Factor ......... .......... ...... 25
2.12 Condition Number ............... .... .. 27
2.13 Depth First Search ............... ..... .. 30
2.14 Memory Fragmentation ............... .. .. 31
2.15 Complex Number Support ................ . .. 33
2.16 Parallelism in KLU ............... ..... .. 33

3 CIRCUIT SIMULATION: APPLICATION OF KLU . .... 35

3.1 Characteristics of Circuit Matrices ..... . . ..... 37
3.2 Linear Systems in Circuit Simulation . . ..... 38
3.3 Performance Benchmarks ................ .... .. 39
3.4 Analyses and Findings ............. .. .. .. 41
3.5 Alternate Ordering Experiments ... . . 42










3.6 Experiments with UF Sparse Matrix Collection .. ........
3.6.1 Different Ordering Schemes in KLU .. ...........
3.6.2 Timing Different Phases in KLU .. ............
3.6.3 Ordering Quality among KLU, UMFPACK and Gilbert-
P eierls . . . . . . .
3.6.4 Performance Comparison between KLU and UMFPACK .


4 USER GUIDE FOR KLU

4 1 The Primar KLU Structures


4.1.1
4.1.2
4.1.3
4.2 KLU I
4.2.1
4.2.2
4.2.3
4.2.4
4.2.5
4.2.6
4.2.7
4.2.8
4.2.9
4.2.10
4.2.11


L lllCAl V .JU JUl U uAVU .1
klu_common .. ......
klusymbolic ........
klunumeric .. ......
(outines . . .
kluanalyze .........
klu_analyze_given .....
klu_*factor .. .......
klu_*solve .. .......
klu_*tsolve .. .......
klu_ 1 . . ..
klu_defaults .. ......
klu_*rec_pivot_growth .
klu_*estimatecond_number
klu_freesymbolic .....
klu_free_numeric ......


REFEREN CES . . . . . . . . .


BIOGRAPHICAL SKETCH ......















LIST OF TABLES
Table page

3-1 Comparison between KLU and SuperLU on overall time and fill-in .39

3-2 Comparison between KLU and SuperLU on factor time and solve time 40

3-3 Ordering results using BTF+AMD in KLU on circuit matrices .. 41

3-4 Comparison of ordering results produced by BTF+AMD, AMD, MMD 43

3-5 Fill-in with four different schemes in KLU . . 46

3-6 Time in seconds, spent in different phases in KLU . .... 47

3-7 Fill-in among KLU, UMFPACK and Gilbert-Peierls . ... 49

3-8 Performance comparison between KLU and UMFPACK . ... 50















LIST OF FIGURES
Figure page

2-1 Nonzero pattern of x when solving Lx = b . . ...... 13

2-2 A matrix permuted to BTF form ....... .. 16

2-3 A symmetric pruning scenario ................... .. .. 18

2-4 A symmetric matrix and its graph representation . .... 21

2-5 The matrix and its graph representation after one step of Gaussian
elim nation .................. ............ .. 21

2-6 A doubly bordered block diagonal matrix and its corresponding ver-
tex separator tree .................. ......... .. 34















Abstract of Thesis Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Master of Science

KLU-A HIGH PERFORMANCE SPARSE LINEAR SOLVER
FOR CIRCUIT SIMULATION PROBLEMS

By

Ekanathan Palamadai Natarajan

August 2005

Chair: Dr. Timothy A. Davis
I. Pi r Department: Computer and Information Science and Engineering

The thesis work focuses on KLU, a sparse high performance linear solver for

circuit simulation matrices. During the circuit simulation process, one of the key

steps is to solve sparse systems of linear equations of very high order. KLU targets

solving these systems with efficient ordering mechanisms and high performance

factorization and solve algorithms. KLU uses a hybrid ordering strategy that

comprises an unsymmetric permutation to ensure zero free diagonal, a symmetric

permutation to block upper triangular form and a fill reducing ordering such as

approximate minimum degree.

The factorization is based on Gilbert-Peierls' left-looking algorithm with

partial pivoting. KLU also includes features like symmetric pruning to cut down

symbolic analysis costs. It offers to solve up to four right hand sides in a single

solve step. In addition, it offers transpose and conjugate transpose solve capabili-

ties and important diagnostic features to estimate the reciprocal pivot growth of

the factorization and condition number of the input matrix.

The algorithm is implemented in the C language with MATLAB interfaces as

well. The MATLAB interfaces enable a user to invoke KLU routines from within









the MATLAB environment. The implementation was tested on circuit matrices

and the results determined. KLU achieves superior fill-in quality with its hybrid

ordering strategy and achieves a good performance speed-up when compared with

existing sparse linear solvers for circuit problems. The thesis highlights the work

being done on exploiting parallelism in KLU as well.















CHAPTER 1
INTRODUCTION

Sparse is beautiful. Solving systems of linear equations of the form Ax = b

is a fundamental and important area of high performance computing. The matrix

A is called the coefficient matrix and b the right hand side vector. The vector x is

the solution to the equation. There are a number of methods available for solving

such systems. Some of the popular ones are Gaussian elimination, QR factorization

using Householder transformations or Givens rotations and Cholesky factorization.

Gaussian elimination with partial pivoting is the most widely used algorithm for

solving linear systems because of its stability and better time complexity. Cholesky

can be used only when A is symmetric positive definite.

Some systems that are solved comprise a dense coefficient matrix A. By dense,

we mean most of the elements in A are nonzero. There are high performance

subroutines such as the BLAS [1, 2, 3, 4, 5] that can maximize flop count for

such dense matrices. The interesting systems are those where the coefficient

matrix A happens to be sparse. By sparse, we mean the matrix has few nonzero

entries(hereafter referred to simply as 'nonzeros'). The adjective 'few' is not well-

defined as we will see in chapter two. When matrices tend to be sparse, we need

to find out effective ways to store the matrix in memory since we want to avoid

storing zeros of the matrix. When we store only the nonzeros in the matrix, it has

consequences in the factorization algorithm as well. One typical example would be

we do not know before hand how nonzeros would appear in the L and U factors

when we factorize the matrix. While we avoid storing the zeros, we also want to

achieve good time complexity when solving sparse systems. If the time spent to









solve sparse systems remains same as for dense systems, we have not done any

better.

KLU stands for Clark Kent LU, since it is based on Gilbert-Peierls' algorithm,

a non-supernodal algorithm, which is the predecessor to SuperLU, a supernodal

algorithm. KLU is a sparse high performance linear solver that employs hybrid

ordering mechanisms and elegant factorization and solve algorithms. It achieves

high cqii,.lilI fill-in rate and beats many existing solvers in run time, when used for

matrices arising in circuit simulation.

There are several flavours of Gaussian elimination. A left-looking Gaussian

elimination algorithm factorizes the matrix left to right computing columns of L

and U. A right-looking version factorizes the matrix from top-left to bottom-right

computing column of L and row of U. Both have their advantages and disad-

vantages. KLU uses a left looking algorithm called Gilbert-Peierls' algorithm.

Gilbert-Peierls' comprises a graph theoretical symbolic analysis phase that iden-

tifies the nonzero pattern of each column of L and U factors and a left-looking

numerical factorization phase with partial pivoting that calculates the numerical

values of the factors. KLU uses Symmetric Pruning to cut down symbolic analysis

cost. We shall look in detail on these features in chapter two.

A critical issue in linear solvers is Ordering. Ordering means permuting the

rows and columns of a matrix, so that the fill-in in the L and U factors is reduced

to a minimum. A fill-in is defined as a nonzero appearing in either of the matrices

L or U, while the element in the corresponding position in A is a zero. Lij or Uij

is a fill-in if Aij is a zero. Fill-in has obvious consequences in memory in that

the factorization algorithm could create dense L and U factors that can exhaust

available memory. A good ordering algorithm yields a low fill-in in the factors.

Finding the ordering that gives minimal fill-in is an NP complete problem. So









ordering algorithms use heuristics. KLU accomodates multiple ordering schemes

like AMD, COLAMD and any user generated permutation.

There are other orderings for different purposes. For example, one could order

a matrix to ensure that it has no zeros on the diagonal. Otherwise, the Gaussian

elimination would fail. Another ordering scheme could reduce the factorization

work. KLU einr'l,-'- two such orderings namely an unsymmetric ordering that

ensures a zero free diagonal and a symmetric ordering that permutes the matrix

into a block upper triangular form (BTF) that restricts factorization to only the

diagonal blocks.

One of the key steps in the circuit simulation process is solving sparse linear

systems. These systems originate from solving large systems of non linear equations

using Newton's method and integrating large stiff systems of ordinary differential

equations. These systems are of very high dimensions and a considerable fraction of

simulation time is spent on solving these systems. Often the solve phase tends to

be a bottleneck in the simulation process. Hence high performance sparse solvers

that optimize memory usage and solution time are critical components of circuit

simulation software. Some of the popular solvers in use in circuit simulation tools

are Sparsel.3 and SuperLU. Sparsel.3 is used in SPICE circuit simulation package

and SuperLU uses a supernodal factorization algorithm. Experimental results of

KLU indicate that it is 1000 times faster than Sparsel.3 and 1.5 to 3 times faster

than SuperLU.

Circuit matrices show some unique properties. They have a nearly zero free

diagonal. They have a roughly symmetric pattern but have unsymmetric values.

They are highly sparse and often have a few dense rows and columns. These dense

rows/columns arise from voltage sources and current sources in the circuit. Circuit

matrices show good amenability to BTF ordering. Though the nonzero pattern of

original matrix is unsymmetric, the nonzero pattern of blocks produced by BTF









ordering tend to be symmetric. Since circuit matrices are extremely sparse, sparse

matrix algorithms such as SuperLU [6] and UMFPACK [7, 8] that nipil] dense

BLAS kernels are often inappropriate. Another unique characteristic of circuit

matrices is that employing a good ordering strategy keeps the L and U factors

sparse. However as we will see in experimental results, typical ordering strategies

can lead to high fill-in.

In circuit simulation problems, typically the circuit matrix template is gen-

erated once and the numerical values of the matrix alone change. In other words,

the nonzero pattern of the matrix does not change. This implies that we need

to order and factor the matrix once to generate the ordering permutations and

the nonzero patterns of L and U factors. For all subsequent matrices, we can use

the same information and need only to recompute the numerical values of the

L and U factors. This process of skipping analysis and factor phases is called

refactorization.Refactorization leads to a significant reduction in run time.

Because of the unique characteristics of circuit matrices and their amenability

to BTF ordering, KLU is a method well-suited to circuit simulation problems. KLU

has been implemented in the C language. It offers a set of API for the analysis

phase, factor phase, solve phase and refactor phase. It also offers the ability to

solve upto four right hand sides in a single solve step. In addition, it offers trans-

pose solve, conjugate transpose solve features and diagnostic tools like pivot growth

estimator and condition number estimator. It also offers a MATLAB interface for

the API so that KLU can be used from within the MATLAB environment.















CHAPTER 2
THEORY: SPARSE LU

2.1 Dense LU

Consider the problem of solving the linear system of n equations in n un-

knowns:


a11x1 + a122 + ... + ainxn = b

a21x + a22x2 + ... 2+ a = b2

(2-1)

anlXl + an2x2 +... ax, = bn


or, in matrix notation,


all a12 a n X1 bl

a21 a22 ... a*2n x2 b2
(2-2)


anl au2 .. a nn Xn b


Ax = b

where A = (aij), x = (1, X2, ..., n)T and b = (bl,..., b,)T. A well-known approach

to solving this equation is Gaussian elimination. Gaussian elimination consists of a

series of eliminations of unknowns xi from the original system. Let us briefly review

the elimination process. In the first step, the first equation of 2-1 is multiplied

by -21 2 a .. and added with the second through nth equation of 2-1
respectively. This would eliminate from second through the equations. After
respectively. This would eliminate x1 from second through the nth equations. After










the first step, the 2-2 would become


all a12 an
(1) (1)
0 a22 ... 2



0 a(,) a(n


(2-3)


where al22 a22 -a21 a12, a32(1) 32 11a a12 and so on. In the second

step, x2 will be eliminated by a similar process of computing multipliers and adding

the multiplied second equation with the third through nth equations. After n-1

eliminations, the matrix A is transformed to an upper triangular matrix U. The

upper triangular system is then solved by back-substitution.

An equivalent interpretation of this elimination process is that we have

factorized A into a lower triangular matrix L and an upper triangular matrix U

where


al 1 0
"21 1 0
all
a31 32



(1) Un
ttnl Un2
all () a (2)


a12 a13

a2) a23(1)

0 a33(2)


(2-4)











(2-5)


~L l 1 1 ((nr-1)
0 0 0 a.t)nn

The column k of the lower triangular matrix L consists of the multipliers

obtained during step k of the elimination process, with their sign negated.


and









Mathematically, Ax = b can be rewritten as


(LU)x b

L(Ux) = b (2-6)


Substituting Ux = y in 2-6, we have


Ly = b (2-7)

Ux y (2-8)


By solving these two lower triangular systems, we find the solution to the actual

system.

The reason for triangularizing the system is to avoid finding the inverse of

the original coefficient matrix A. Inverse finding is atleast thrice as expensive as

Gaussian elimination in the dense case and often leads to more inaccuracies.

2.2 Sparse LU

A sparse matrix is defined as one that has few nonzeros in it. The quan-

tification of the adjective 'few' is not specified. The decision as to what kind of

algorithm to use (sparse or dense) depends on the fill-in properties of the matrices.

However, sparse matrices typically have O(n) nonzero entries. Dense matrices

are typically represented by a two dimensional .- i. .'.The zeros of a sparse matrix

should not be stored if we want to save memory. This fact makes a two dimensional

array unsuitable for representing sparse matrices. Sparse matrices are represented

with a different kind of data structure. They can be represented in two different

data structures viz.column compressed form or row compressed form.

A column compressed form consists of three vectors Ap, Ai and Ax. Ap

consists of column pointers. It is of length n+ 1. The start of column k of the input

matrix is given by Ap [k].









Ai consists of row indices of the elements. This is a zero based data structure

with row indices in the interval [0,n). Ax consists of the actual numerical values of

the elements.

Thus the elements of a column k of the matrix are held in

Ax [Ap [k]...Ap [k+1]). The corresponding row indices are held in

Ai [Ap [k]...Ap[k+l]).

Equivalently, a row compressed format stores a row pointer vector Ap, a

column indices vector Ai and a value vector Ax. For example, the matrix

500

420

318

when represented in column compressed format will be

Ap: 0 3 5 6

Ai: 0 1 2 1 2 2

Ax: 5 4 3 2 1 8

and when represented in row compressed format will be

Ap: 0 1 3 6

Ai: 0 0 1 0 1 2

Ax: 5 4 2 3 1 8

Let nnz represent the number of nonzeros in a matrix of dimension n n.

Then in a dense matrix representation, we will need n2 memory to represent the

matrix. In sparse matrix representation, we reduce it to O(n + nnz) and typically

nnz < n2.

2.3 Left Looking Gaussian Elimination

Let us derive a left looking version of Gaussian elimination. Let an input

matrix A of order n n be represented as a product of two triangular matrices L

and U.













All a12 A13 L11 0 0 U11 U12 U13

a21 a22 a23 121 1 0 0 22 U23

A31 a32 A33 L31 132 L33 0 0 U33

where Aij is a block, aij is a vector and aij is a scalar. The dimensions of

different elements in the matrices are as follows:

All, L11, U1 are k k blocks

a12, u12 are k 1 vectors

A13, U13 are k n (k + 1) blocks

a21, 21 are 1 k row vectors

a22, u22 are scalars

a23, U23 are 1 n (k + 1) row vectors

A31, L31 are n (k + 1) k blocks

a32, 132 are n (k + 1) 1 vectors

A33, L33, U33 are n (k + 1) n ( + 1) blocks.

From (2-9), we can arrive at the following set of equations.


L U11

L U12

LI* U13

121 U11

121 U12 + U22

121 U13 + U23

L31 UN

La1 U12 + 132 U22

La1 U13 + 132 U23 + L33 U33


All

a12

A13

a21

a22

a23

A31

a32

A33


(2-9)









From (2-11), (2-14) and (2-17), we can compute the 2nd column of L

and U, assuming we have already computed L11, 121 and L31. We first solve the

lower triangular system (2-11) for u12. Then, we solve for u22 using (2-14) by

computing the sparse dot product



U22 a= 22 121 12 (2 19)

Finally we solve (2-17) for 132 as


132 (a32 L31 U12) (2-20)
U22

This step of computing the 2nd column of L and U can be considered equiva-

lent to solving a lower triangular system as follows:


L11 0 0 U12 a12

121 1 0 U22 a22 (221)

L31 0 1 132 U22 a32

This mechanism of computing column k of L and U by solving a lower

triangular system L x = b is the key step in a left-looking factorization algorithm.

As we will see later, Gilbert-Peierls' algorithm revolves around solving this lower

triangular system. The algorithm is called a left-looking algorithm since column

k of L and U are computed by using the already computed columns 1...k-1 of L.

In other words, to compute column k of L and U, one looks only at the already

computed columns 1...k-1 in L, that are to the left of the currently computed

column k.

2.4 Gilbert-Peierls' Algorithm

Gilbert-Peierls' [9] proposed an algorithm for Gaussian elimination with partial

pivoting in time proportional to the flop count of the elimination to factor an

arbitrary non singular sparse matrix A as PA = LU. If flops(LU) is the number









of nonzero multiplications performed when multiplying two matrices L and U, then

Gaussian elimination uses exactly flops(LU) multiplications and divisions to factor

a matrix A into L and U. Given an input matrix and assuming no partial pivoting,

it is possible to predict the nonzero pattern of its factors. However with partial

pivoting, it is not possible to predict the exact nonzero pattern of the factors

before hand. Finding an upper bound is possible, but the bound can be very loose

[10]. Note that computing the nonzero pattern of L and U is a necessary part of

Gaussian elimination involving sparse matrices since we do not use two dimensional

arrays for representing them but sparse data structures. Gilbert-Peierls' algorithm

aims at computing the nonzero pattern of the factors and the numerical values in a

total time proportional to O(flops(LU)).

It consists of two stages for determining every column of L and U. The first

stage is a symbolic analysis stage that computes the nonzero pattern of the column

k of the factors. The second stage is the numerical factorization stage that involves

solving the lower triangular system Lx = b, that we discussed in the section above.

2.4.1 Symbolic Analysis

A sparse Gaussian elimination algorithm with partial pivoting cannot know

the exact nonzero structure of the factors ahead of all numerical computation,

simply because partial pivoting at column k can introduce new nonzeros in columns

k+1 ... n. Solving Lx = b must be done in time proportional to the number of flops

performed. Consider a simple column-oriented algorithm in MATLAB notation for

solving Lx = b as follows:

x =b

for j = l:n

if x(j) ~= 0

x(j+l:n) = x(j+l:n) L(j+l:n,j) x(j)









end

The above algorithm takes time 0(n + number of flops performed). The 0(n)

term looks harmless, but Lx = b is solved n times in the factorization of A = LU,

leading to an unacceptable O(n2) term in the work to factorize A into L times U.

To remove the O(n) term, we must replace the algorithm with

x =b

for each j for which x(j) ~= 0

x(j+l:n) = x(j+l:n) L(j+l:n, j) x(j)

end

This would reduce the 0(n) term to O(q(b)), where q(b) is the number of nonzeros

in b. Note that b is a column of the input matrix A. Thus to solve Lx = b, we need

to know the nonzero pattern of x before we compute x itself. Symbolic analysis

helps us determine the nonzero pattern of x.

Let us say we are computing column k of L and U. Let G = G(Lk) be

the directed graph of L with k 1 vertices representing the already computed

k 1 columns. G(Lk) has an edge j -- i iff lij / 0. Let f = {ilbi 0} and

X = {ixli / 0} Now the elements of X is given by


X = ReachG(L)J) (2-22)


The nonzero pattern of X is computed by the determining the vertices that

are reachable from the vertices of the set 3. The reachability problem can be solved

using a classical depth first search in G(Lk) from the vertices of the set P. If bj / 0,

then xj / 0. In addition if L-j / 0, then xi / 0 even if bi = 0. This is because

a Lij xj contributes to a nonzero in the equation when we solve for xi. During

the depth first search, Gilbert-Peierls' algorithm computes the topological order of

X. This topological ordering is useful for eliminating unknowns in the Numerical

factorization step.











C- xj






^------------- X


L


Figure 2-1: Nonzero pattern of x when solving Lx = b

The row indices vector Li of columns 1 ... k 1 of L represents the .,li.'.,ency

list of the graph G(Lk). The depth first search takes time proportional to the

number of vertices examined plus the number of edges traversed.

2.4.2 Numerical Factorization

Numerical factorization consists of solving the system (2-21) for each col-

umn k of L and U. Normally we would solve for the unknowns in (2-21) in the

increasing order of the row index. The row indices/nonzero pattern computed by

depth first search are not necessarily in increasing order. Sorting the indices would

increase the time complexity above our O(flops(LU)) goal. However, the require-

ment of eliminating unknowns in increasing order can be relaxed to a topological

order of the row indices. An unknown xi can be computed, once all the unknowns

xj of which it is dependent on are computed. This is obvious when we write the

equations comprising a lower triangular solve. Theoretically, the unknowns can be

solved in any topological order. The depth first search algorithm gives one such

topological order which is sufficient for our case. In our example, the depth first

search would have finished exploring vertex i before it finishes exploring vertices j.









Hence a topological order given by depth first search would have j appearing before

i. This is exactly what we need.

Gilbert-Peierls' algorithm starts with an identity L matrix. The entire left

looking algorithm can be summarized in MATLAB notation as follows:

L= I

for k = 1:n

x = L \ A(:,k)

%(partial pivoting on x can be done here)

U(1:k,k) = x(l:k)

L(k:n,k) = x(k:n) / U(k,k)

end

where x = L\b denotes the solution of a sparse lower triangular system. In

this case, b is the kth column of A. The total time complexity of Gilbert-Peierls'

algorithm is O(rq(A) + flops(LU)). r(A) is the number of nonzeros in the matrix A

and flops(LU) is the flop count of the product of the matrices L and U. Typically

flops(LU) dominates the complexity and hence the claim of factorizing in time

proportional to the flop count.

2.5 Maximum Transversal

Duff [11, 12] proposed an algorithm for determining the maximum transversal

of a directed graph.The purpose of the algorithm is to find a row permutation that

minimizes the zeros on the diagonal of the matrix. For non singular matrices, the

algorithm ensures a zero free diagonal. KLU e i'l-1-,t Duff's [11, 12] algorithm

to find an unsymmetric permutation of the input matrix to determine a zero-

free diagonal. A matrix cannot be permuted to have a zero free diagonal if and

only if it is structurally singular. A matrix is structurally singular if there is no

permutation of its nonzero pattern that makes it numerically nonsingular.









A transversal is defined as a set of nonzeros, no two of which lie in the same

row or column, on the diagonal of the permuted matrix. A transversal of maximum

length is the maximum transversal.

Duff's maximum transversal algorithm consists of representing the matrix as

a graph with each vertex corresponding to a row in the matrix. An edge ik -- ik+

exists in the graph if A(ik,jk+i) is a nonzero and A(ik+, jk+i) is an element in the

transversal set. A path between vertices io and ik would consist of a sequence of

nonzeros (io, jl), (i1,j2), (i -1,j) where the current transversal would include

(il,'jl), (i2,j2), (i ,jk'). If there is a nonzero in position (i, jk+i) and no nonzero

in row io or column jk+1 is currently on the travseral, it increases the transerval

by one by adding the nonzeros (ir,jr+i), r =0, 1, k k to the transversal and

removing the nonzeros (ir, j,), r = 1, 2, k k from the transversal. This adding and

removing of nonzeros to and from the transversal is called reassignment chain or

augmenting path.

A vertex or row is said to be assigned if a nonzero in the row is chosen for

the transversal. The process of constructing augmenting paths is done by doing

a depth first search from an unassigned row io of the matrix and continue till a

vertex ik is reached where the path terminates because A(ik, jk+l) is a nonzero and

column jK+1 is unassigned. Then the search backtracks to io adding and removing

transversal elements thus constructing an augmenting path.

Duff's maximum transversal transversal algorithm has a worst case time

complexity of O(nT) where T is the number of nonzeros in the matrix and n is the

order of the matrix. However in practice, the time complexity is close to O(n + r).

The maximum transversal problem can be cast as a maximal matching

problem on bipartite graphs. This is only to make a comparison. The maximal

matching problem is stated as follows.










All A12 A13 A14 Xi bl



A22 A23 A24 X2 b2


A33 A34 X3 b3

0
A44 ;4 b4



Figure 2-2: A matrix permuted to BTF form

Given an undirected graph G = (V, E), a matching is a subset of the edges
M C E such that for all vertices v E V, at most one edge of M is incident on

v. A vertex v E V is matched if some edge in M is incident on v, otherwise, v is
unmatched. A maximal matching is a matching of maximum card Iilili r, that is a
matching M such that for any matching M', we have IMI > |M'|.
A maximal matching can be built incrementally, by picking an arbitraty edge e
in the graph, deleting any edge that is sharing a vertex with e and repeating until

the graph is out of edges.
2.6 Block Triangular Form

A block (upper) triangular matrix is similar to a upper triangular matrix
except that the diagonals in the former are square blocks instead of scalars. Figure
2-2 shows a matrix permuted to the BTF form.

Converting the input matrix to block triangular form is important in that,
1. The part of the matrix below the block diagonal requires no factorization
effort.









2. The diagonal blocks are independent of each other.Only the blocks need to

be factorized. For example, in figure 2-2, the -i1.--I, ,-I. A444 = b4 can be

solved independently for x4 and x4 can be eliminated from the overall system.

The system A33X3 = b A34x4 is then solved for x3 and so on.

3. The off diagonal nonzeros do not contribute to any fill-in.

Finding a symmetric permutation of a matrix to its BTF form is equivalent

to finding the strongly connected components of a graph. A strongly connected

component of a directed graph G = (V, E) is a maximal set of vertices C C V such

that for every pair of vertices u and v in C, we have both u -- v and v -- u. The

vertices u and v are reachable from each other.

The Algorithm ei'l-,v d in KLU for symmetric permutation of a matrix to a

BTF form, is based on Duff and Reid's [13, 14] algorithm. Duff and Reid provide

an implementation for Tarjan's [15] algorithm to determine the strongly connected

components of a directed graph. The algorithm has a time complexity of O(n + r)

where n is the order of the matrix and T is the number of off diagonal nonzeros in

the matrix.

The algorithm essentially consists of doing a depth first search from unvisited

nodes in the graph. It uses a stack to keep track of nodes being visited and uses a

path of the nodes. When all edges in the path are explored, it generates Strongly

connected components from the top of stack.

Duff's algorithm is different from the method proposed by Cormen, Leiserson,

Rivest and Stein [16]. They -I,.1-. -1 doing a depth first search on G computing GT

and then running a depth first search on GT on vertices in the decreasing order of

their finish times from the first depth first search (the topological order from the

first depth first search).










J i k
\

















Pruned, = Nonzero, ( = Fill in

Figure 2-3: A symmetric pruning scenario

2.7 Symmetric Pruning

Eisenstat and Liu [17] proposed a method called Symmetric Pruning to exploit

structural symmetry for cutting down the symbolic analysis time. The cost of

depth first search can be cut down by pruning unnecessary edges in the graph of

L, G(L). The idea is to replace G(L) by a reduced graph of L. Any graph H can be

used in place of G(L), provided that i j exists in H iff i j exists in G(L). If A

is symmetric, then the symmetric reduction is just the elimination tree.

The symmetric reduction is a subgraph of G(L). It has fewer edges than G(L)

and is easier to compute by taking advantage of symmetry in the structure of

the factors L and U. Even though the symmetric reduction removes edges, it still

preserves the paths between vertices of the original graph.

Figure 2-3 shows a symmetric pruning example.

If lij $ 0, uji $ 0, then we can prune edges j s, where s > i. The reason

behind this is that for any ajk $ 0, ask Will fill in from column j of L for s > k.
behind this is that for any ajk / 0, aisk will fill in from column j of L for s > k .









The just computed column i of L is used to prune earlier columns. Any future

depth first search from vertex j will not visit vertex s, since s would have been

visited via i already.

Note that every column is pruned only once. KLU inpl-,i- symmetric pruning

to speed up the depth first search in the symbolic analysis stage.

2.8 Ordering

It is a widely used practice to precede the factorization step of a sparse

linear system by an ordering phase. The purpose of the ordering is to generate a

permutation P, that reduces the fill-in in the factorization phase of PAPT. A fill-in

is defined as a nonzero in a position (i, j) of the factor that was zero in the original

matrix. In other words, we have a fill-in if Lij / 0, where Aj = 0.

The permuted matrix created by the ordering PAPT creates much less fill-in

in factorization phase than the unpermuted matrix A. The ordering mechanism

typically takes into account only the structure of the input matrix, without

considering the numerical values stored in the matrix. Partial pivoting during

factorization changes the row permutation P and hence could potentially increase

fill-in as opposed to what was estimated by the ordering scheme. We shall see more

about pivoting in the following sections.

If the input matrix A is unsymmetric, then the permutation of the matrix

A + AT can be used. Various minimum degree algorithms can be used for ordering.

Some of the popular ordering schemes include approximate minimum degree(AMD)

[18, 19], column approximate minimum degree(COLAMD) [20, 21] among others.

COLAMD orders the matrix AAT without forming it explicitly.

After permuting an input matrix A into BTF form using the maximum

transversal and BTF orderings, KLU attempts to factorize each of the diagonal

blocks. It applies the fill reducing ordering algorithm on the block before factoriz-

ing it. KLU supports both approximate minimum degree and column approximate









minimum degree. Besides, any given ordering algorithm can be plu-.i:'- ..1 into

KLU without much effort. Work is being done on integrating a Nested Dissection

ordering strategy into KLU as well.

Of the various ordering schemes, AMD gives best results on circuit matrices.

AMD finds a permutation P to reduce fill-in for the Cholesky factorization of

PAPT (of P(A + AT)PT, if A is unsymmetric). AMD assumes no numerical

pivoting. AMD attempts to reduce an optimistic estimate of fill-in.

COLAMD is an unsymmetric ordering scheme, that computes a column

permutation Q to reduce fill-in for Cholesky factorization of (AQ)TAQ. COLAMD

attempts to reduce a "pessimitic" estimate (upper bound) of fill-in.

Nested Dissection is another ordering scheme that creates permutation

such that the input matrix is transformed into block diagonal form with vertex

separators. This is a popular ordering scheme. However, it is unsuitable for

circuit matrices when applied to the matrix as such. It can be used on the blocks

generated by BTF pre-ordering.

The idea behind a minimum degree algorithm is as follows: A structurally

symmetric matrix A can be represented by an equivalent undirected graph G(V, E)

with vertices corresponding to row/column indices. An edge i -> j exists in G if

Aj / 0.

Consider the figure 2-4. If the matrix is factorized with vertex 1 as the pivot,

then after the first Gaussian elimination step, the matrix would be transformed as

in figure 2-5.

This first step of elimination can be considered equivalent to removing node

1 and all its edges from the graph and adding edges to connect all nodes .,li.I :ent

to 1. In other words, the elimination has created a clique of the nodes .,li.I :ent

to the eliminated node. Note that there are as many fill-ins in the reduced matrix

as there are edges added in the clique formation. In the above example, we have
















* *


* *
*
* *
*


Figure 2-4: A symmetric matrix and its graph representation


2 3



4 5

Figure 2-5: The matrix and its graph representation after one step of Gaussian
elimination









chosen the wrong node as pivot, since node 1 has the maximum degree. Instead if

we had chosen a node with minimum degree say 3 or 5 as pivot, then there would

have been zero fill-in after the elimination since both 3 and 5 have degree 1.

This is the key idea in a minimum degree algorithm. It generates a permuta-

tion such that a node with minimum degree is eliminated in each step of Gaussian

elimination, thus ensuring a minimal fill-in. The algorithm does not examine the

numerical values in the node selection process. It could happen that during partial

pivoting, a node other than the one .-..i. -1. .1 by the minimum degree algorithm

must be chosen as pivot because of its numerical magnitude. That's exactly the

reason why the fill-in estimate produced by the ordering algorithm could be less

than that experienced in the factorization phase.

2.9 Pivoting

Gaussian elimination fails when the diagonal element in the input matrix

happens to be zero. Consider a simple 2 2 system,


0 a12 x1 b1
A 2 1 (2-23)
a21 a22 x2 b2

When solving the above system, Gaussian elimination computes the multiplier

-a21/all [and multiplies row 1 with this multiplier and adds it to row 2] thus

eliminating the coefficient element a21 from the matrix. This step obviously would

fail, since all is zero. Now let's see a classical case when the diagonal element is

nonzero but close to zero.


0.0001 1
A (2-24)
1 1

The multiplier is -1/0.0001 -104. The factors L and U are











1 0 0.0001 1
L = U 1 (2-25)
104 1 0 -104

The element u22 has the actual value 1 104. However assuming a four digit

arithmetic, it would be rounded off to -104. Note that the product of L and U is

0.0001 1
L U (2-26)
1 0

which is different from the original matrix. The reason for this problem is that the

multiplier computed is so large that when added with the small element a22 with

value 1, it obscured the tiny value present in a22-

We can solve these problems with pivoting. In the above two examples,

we could interchange rows 1 and 2, to solve the problem. This mechanism of

interchanging rows(and columns) and picking a large element as the diagonal, to

avoid numerical failures or inaccuracies is called pivoting. To pick a numerically

large element as pivot, we could look at the elements in the current column or we

could look at the entire submatrix (across both rows and columns). The former is

called partial pivoting and the latter is called complete pivoting.

For dense matrices, partial pivoting adds a time complexity of O(n2) com-

parisons to Gaussian elimination and complete pivoting adds O(n3) comparisons.

Complete pivoting is expensive and hence is generally avoided, except for special

cases. KLU employs partial pivoting with diagonal preference. As long as the

diagonal element is atleast a constant threshold times the largest element in the

column, we choose the diagonal as the pivot. This constant threshold is called pivot

tolerance.

2.10 Scaling

The case where small elements in the matrix get obscured during the elimina-

tion process and accuracy of the results gets skewed because of numerical addition









is not completely overcome by the pivoting process. Let us see an example of this

case.

Consider the 2 2 system


105 xi 105
A= (2-27)
1 1 x2 2

When we apply Gaussian elimination with partial pivoting to the above

system, the entry all is largest in the first column and hence would continue to be

the pivot.After the first step of elimination assuming a four digit arithmetic, we

would have


10 105 X1 105
A (2-28)
0 -104 X2 -104

The solution from the above elimination is x = 1, 2 = 0. However the correct

solution is close to x1 = 1, x2 1.

If we divide each row of the matrix by the largest element in that row(and the

corresponding element in the right hand side as well), prior to Gaussian elimination

we would have


A 10-4 1 X1 129)
A= (2-29)
1 1 x2 2

Now if we apply partial pivoting we would have,


1 1 xi 2
A = ] [] [ (2-30)
10-4 1 a2 s

And after an elimination step, the result would be











1 1 xi 2
A 1 [ (2-31)
0 1 10-4 X2 1 10-4

which yields the correct solution xi 1, 2 = 1. The process of balancing out

the numerical enormity or obscurity on each row or column is termed as scaling.

In the above example, we have scaled with respect to the maximum value in a row

which is row scaling. Another variant would be to scale with respect to the sum of

absolute values of all elements across a row.

In column -.. lii,. we would scale with respect to the maximum value in a

column or the sum of absolute values of all elements in a cloumn.

Row scaling can be considered equivalent to finding an invertible diagonal

matrix D1 such that all the rows in the matrix D-1A have equally large numerical

values.

Once we have such a D1, the solution of the original system Ax = b is

equivalent to solving the system Ax = b where A = D-1A and b = D-b.

Equilibration is another popular term used for scaling.

In KLU, the diagonal elements of the diagonal matrix D1 are either the largest

elements in the rows of the original matrix or the sum of the absolute values of the

elements in the rows. Besides scaling can be turned off as well, if the simulation

environment does not need scaling. Scaling though it offers better numerical results

when solving systems, is not mandatory. Its usage depends on the data values that

constitute the system and if the values are already balanced, scaling might not be

necessary.

2.11 Growth Factor

Pivot growth factor is a key diagnostic estimate in determining the stability

of Gaussian elimination. Stability of Numerical Algorithms is an important factor

in determining the accuracy of the solution. Study of stability is done by a process









called Roundoff Error analysis. Roundoff error analysis comprises two sub types

called Forward error analysis and Backward error analysis. If the computed

solution x is close to the exact solution x, then the algorithm is said to be Forward

stable. If the algorithm computes an exact solution to a nearby problem, then the

algorithm is said to be Backward stable.Backward stability is the most widely used

technique in studying stability of systems. Often the data generated for solving

systems have impurity in them or they are distorted by a small amount. Under

such circumstances we are interested that the algorithm produce an exact solution

to this nearby problem and hence the relevance of backward stability assumes

significance.

Pivot growth factor is formally defined as
max max (

ij a i ,


where a) is an entry in the reduced matrix A(k) after the kt elimination step.

From (2-32), we find that if the entries of the reduced matrix grow arbitrarily,

we would have a high growth factor. This arbitrary growth would again lead to

inaccuracies in the results. Consider the following 2 2 system.



A 10-4 1 Xi 133)
A= (2-33)
1 1 x2 2

After one step of Gaussian elimination assuming four digit arithmetic, we

would have the reduced system


10-4 1 X1
A = (2-34)
0 1 104 2 2 104

Solving the system yields xl = 0, x2 = 1 which is different from the actual

solution x1 1, x2 = 1.The pivot growth factor of the above system is









S ma(l,104) 104
S 1
Thus a large pivot growth clearly indicates the inaccuracy in the result. Partial

pivoting generally avoids large growth factors. In the above example, if we had

applying partial pivoting, we would have got the correct results. But this is not

assured and there are cases where partial pivoting might not result in an acceptable

growth factor. This necessitates the estimation of the growth factor as a diagnostic

tool to detect cases where Gaussian elimination could be unstable.

Pivot growth factor is calculated usually in terms of its reciprocal, to avoid

numerical overflow problems when the value is very large. (2-32) is a harder to

compute equation since it involves calculating the maximum of reduced matrix

after every step of elimination. The other definitions of reciprocal growth factor

that are easy to compute are as follows:
max
Sm a3 .) (2-35)

P ( iN ua j)

max

max (236)
P ( ij uij )
Equation (2-35) is the definition implemented in KLU and it is a column

scaling invariant. It helps unmask a large pivot growth that could be totally

masked because of column scaling.

2.12 Condition Number

Growth factor is a key estimate in determining the stability of the algorithm.

Condition number is a key estimate in determining the amenability or conditioning

of a given problem. It is not guaranteed that a highly stable algorithm can yield

accurate results for all problems it can solve. The conditioning of the problem has

a dominant effect on the accuracy of the solution.

Note that while stability deals with the algorithm, conditioning deals with

the problem itself. In practical applications like circuit simulation, the data of









a problem come from experimental observations. Typically such data have a

factor of error or impurities or noise associated with them. Roundoff errors and

discretization errors also contribute to impurities in the data. Conditioning of a

problem deals with determining how the solution of the problem changes in the

presence of impurities.

The preceding discussion shows that one often deals with solving problems

not with the original data but that with perturbed data. The analysis of effect

of perturbation of the problem on the solution is called Perturbation analysis. It

helps in determining whether a given problem produces a little or huge variation

in solution when perturbed. Let us see what we mean by well or ill conditioned

problems.

A problem is said to be ill conditioned if a small relative error in data leads

to a large relative error in solution irrespective of the algorithm employed. A

problem is said to be well conditioned if a small relative error in data does not

lead to a large relative error in solution.

Accuracy of the computed solution is of primary importance in numerical

analysis. Stability of the algorithm and the Conditioning of the given problem are

the two factors that directly determine accuracy.A highly stable algorithm well

armored with .lii,. partial pivoting and other concepts cannot be guaranteed to

yield an accurate solution to an ill-conditioned problem.

A backward stable algorithm applied to a well-conditioned problem should

yield a solution close to the exact solution. This follows from the definitions of

backward stability and well-c. .i'ili iil -. where backward stability assures exact

solution to a nearby problem and well-conditioned problem assures that the

computed solution to perturbed data is relatively close to the exact solution of the

actual problem.









Mathematically, let X be some problem. Let X(d) be the solution to the

problem for some input d. Let 6d denote a small perturbation in the input d. Now

if the relative error in the solution

IX(d + 6d) X(d) I
Ix(d)l

exceeds the relative error in the input

1Jdl
d\

then the problem is ill conditioned and well conditioned otherwise.

Condition number is a measure of the conditioning of the problem. It shows

whether a problem is well or ill conditioned.For the linear system problems of the

form Ax = b, the condition number is defined as


Cond(A) IIl I IA-l 1 (2-37)

Equation (2-37) is arrived at by theory that deals with perturbations either

in the input matrix A or the right hand side b or both the matrix and right hand

side. Equation (2-37) can be defined with respect to any norm viz. 1, 2 or oo. The

system Ax = b is said to be ill-conditioned if the condition number from (2-37) is

quite large. Otherwise it is said to be well-conditioned.

A naive way to compute the condition number would be to compute the

inverse of the matrix, compute the norm of the matrix and its inverse and compute

the product. However, computing the inverse is atleast thrice as expensive as

solving the linear system Ax = b and hence should be avoided.

Hager [22] developed a method for estimating the 1-norm of IIA-11 and the

corresponding 1-norm condition number. Hager proposed an optimization approach

for estimating IIA-111. The 1-norm of a matrix is formally defined as











||A,||,
I|A|1 max (2-38)

Hager's algorithm can be briefly described as follows: For A E R*", a convex

function is defined as

F(x) = |11, 1 (2-39)

over the convex set

S {x R : I,I < 1}

Then I|A|I1 is the global maximum of (2-39).

The algorithm involves computing Ax and ATx or computing matrix-vector

products. When we want to compute IA-1 I1, it involves computing A-'x and

(A-1)Tx which is equivalent to solving Ax = b and ATx = b. We can use KLU to

efficiently solve these systems.

Higham [23] presents refinements to Hager's algorithm and restricts the

number of iterations to five. Higham further presents a simple device and using

the higher of the estimates from this device and Hager's algorithm to ensure the

estimate is large enough. This device involves solving the linear system Ax = b

where


i-1
bi =(-1)+1(1 + )i 1, 2, ...n

The final estimate is chosen as the maximum from Hager's algorithm and
2 ||-||1
3n
KLU's condition number estimator is based on Higham's refinement of Hager's

algorithm.

2.13 Depth First Search

As we discussed earlier, the nonzero pattern of the kh column of L is deter-

mined by the Reachability of the row-indices of kh column of A in the graph of L.









The reachability is determined by a depth-first search traversal of the graph of L.

The topological order for elimination of variables when solving the lower triangular

system Lx = b is also determined by the depth-first search traversal. A classical

depth first search algorithm is a recursive one. One of the major problems in a

recursive implementation of depth-first search is Stack overflow. Each process is

allocated a stack space upon execution. When there is a high number of recursive

calls, the stack space is exhausted and the process terminates abruptly. This is a

definite possibility in the context of our depth-first search algorithm when we have

a dense column of a matrix of a very high dimension.

The solution to stack overflow caused by recursion is to replace recursion

by iteration. With an iterative or non-recursive function, the entire depth first

search happens in a single function stack. The iterative solution uses an array

of row indices called pstack. When descending to an .,li.I1:ent node during the

search, the row index of the next .,li.1:ent node is stored in the pstack at the

position(row/column index) corresponding to the current node. When the search

returns to the current node, we know that we next need to descend into the node

stored in the pstack at the position corresponding to the current node. Using this

extra O(n) memory, the iterative version completes the depth first search in a

single function stack.

This is an important improvement from the recursive version since it avoids

the stack overflow problem that would have been a bottleneck when solving high

dimension systems.

2.14 Memory Fragmentation

The data structures for L and U are the ones used to represent sparse matri-

ces. These comprise 3 vectors.

1. Vector of column pointers

2. Vector of row indices









3. Vector of numerical values

There are overall, six vectors needed for the two matrices L and U. Of these, the

two vectors of column pointers are of pre-known size namely the size of a block.

The remaining four vectors of row indices and numerical values depend on the fill-

in estimated by AMD. However AMD gives an optimistic estimate of fill-in.Hence

we need to dynamically grow memory for these vectors during the factorization

phase if we determine that the fill-in is higher than estimated. The partial pivoting

strategy can alter the row ordering determined by AMD and hence is another

source of higher fill-in than the estimate from AMD.

Dynamically growing these four vectors suffers from the problem of external

memory fragmentation. In external fragmentation, free memory is scattered in

the memory space. A call for more memory fails because of non-availability of

contiguous free memory space. If the scattered free memory areas were contiguous,

the memory request would have succeeded. In the context of our problem, the

memory request to grow the four vectors could either fail if we run into external

fragmentation or succeed when there is enough free space available.

When we reallocate or grow memory, there are two types of success cases. In

the first case called cheap reallocation, there is enough free memory space abutting

the four vectors. Here the memory occupied by a vector is just extended or its end

boundary is increased. The start boundary remains the same. In the second case

called costly reallocation, there is not enough free memory space abutting a vector.

Hence a fresh memory is allocated in another region for the new size of vector and

the contents are copied from old location. Finally the old location is freed.

With four vectors to grow, there is a failure case because of external fragmen-

tation and a costly success case because of costly reallocation. To reduce the failure

case and avoid the costly success case, we have coalesced the four vectors into a

single vector. This new data structure is byte aligned on double boundary. For









every column of L and U, the vector contains the row indices and numerical values

of L followed by the row indices and numerical values of U. Multiple integer row

indices are stored in a single double location The actual number of integers that

can be stored in a double location varies with platform and is determined dynam-

ically. The common technique of using integer pointer to point to location aligned

on double boundary, is iIpl'-v-d to retrieve or save the row indices.

In addition to this coalesced data structure containing the row indices and

numerical values, two more length vectors of size n are needed to contain the length

of each column of L and U. These length vectors are preallocated once and need

not be grown dynamically.

Some Memory management schemes never do cheap reallocation. In such

schemes, the new data structure serves to reduce external fragmentation only.

2.15 Complex Number Support

KLU supports complex matrices and complex right hand sides. KLU also

supports solving the transpose system ATx = b for real matrices and solving the

conjugate transpose system AHx = b for complex matrices. Initially it relied on the

C99 language support for complex numbers. However the C99 specification is not

supported across operating systems. For example, earlier versions of Sun Solaris do

not support C99. To avoid these compatibility issues, KLU no longer relies on C99

and has its own complex arithmetic implementation.

2.16 Parallelism in KLU

When solving a system Ax = b using KLU, we use BTF pre-ordering to

convert A into a block upper triangular form. We apply AMD on each block and

factorize each block one after the other serially. Alternatively, nested dissection can

be applied to each block. Nested dissection ordering converts a block to a doubly

bordered block diagonal form. A doubly bordered block diagonal form is similar to

a block upper triangular form but has nonzeros on the sub diagonal region. These










0 *5

2 **
2

S000 3 4 3
0**
4 O0
000*** ** 5 2 1
0000 00 00
A Doubly Bordered Block diagonal Matrix Separator Tree

Figure 2-6: A doubly bordered block diagonal matrix and its corresponding vertex
separator tree

nonzeros form a horizontal strip resembling a border. Similarly the nonzeros in the

region above the diagonal form a corresponding vertical strip.

The doubly bordered block diagonal form can be thought of as a separator
tree. Factorization of the block then involves a post-order traversal of the separator
tree. The nodes in the separator tree can be factorized in parallel. The factor-
ization of a node would additionally involve computing the schur complement of
its parent and of its ancestors in the tree. Once all the children of a node have
updated its schur complement, the node is ready to be factorized and it inturn
computes the schur complement of its parent and its ancestors. The factorization
and computation of schur complement is done in a post-order traversal fashion and
the process stops at the root.
Parallelism can help in reducing the factorization time. It gains importance in

the context of multi processor systems. Work is being done to enable parallelism in
KLU.















CHAPTER 3
CIRCUIT SIMULATION: APPLICATION OF KLU

The KLU algorithm comprises the following steps:

1. Unsymmetric Permutation to block upper triangular form. This consists of

two steps.

(a) 'lil-iii, tii permutation to ensure a zero free diagonal using maximum

transversal.

(b) symmetric permutation to block upper triangular form by finding the

strongly connected components of the graph.

2. Symmetric permutation of each block(say A) using AMD on A + AT or an

'ln-vimn lli-ti permutation of each block using COLAMD on AAT. These

permutations are fill-in reducing orderings on each block.

3. Factorization of each scaled block using Gilbert-Peierls' left looking algorithm

with partial pivoting.

4. Solve the system using block-back substitution and account for the off-

diagonal entries. The solution is re-permuted to bring it back to original

order.

Let us first derive the final system that we need to solve taking into account,

the different permutations, scaling and pivoting. The original system to solve is


Ax: =b (3-1)


Let R be the diagonal matrix with the scale factors for each row. Applying scaling,

we have

RAx = Rb (3-2)









Let P' and Q' be the row and column permutation matrices that combine the

permutations for maximum transversal and the block upper triangular form

together. Applying these permutations together, we have


P'RAQ'Q'T = P'Rb. [Q'QT = I,the ,,,-ntt,, matrix] (3-3)


Let P and Q be row and column permutation matrices that club the P' and Q'

mentioned above with the symmetric permutation produced by AMD and the

partial pivoting row permutation produced by factorization. Now,


PRAQQTx = PRb

or

(PRAQ)QTX = PRb (3-4)

The matrix (PRAQ) consists of two parts viz. the diagonal blocks that are factor-

ized and the off-diagonal elements that are not factorized.

(PRAQ) = LU + F where LU represents the factors of all the blocks

collectively and F represents the entire off diagonal region. Equation (3-4) now

becomes

(LU F)QTx = PRb (3-5)

x =Q(LU + F)-'(PRb) (3-6)

Equation (3-6) consists of two steps. A block back-substitution i.e. computing

(LU + F)-l(PRb) followed by applying the column permutation Q.

The block-back substitution in (LU + F)-1(PRb) looks cryptic and can be

better explained as follows: Consider a simple 3 3 block system











L11U11 F12 F13 X1 B1

0 L22U22 F23 X2 2 (37)

0 0 L33U33 X3 B1

The equations corresponding to the above system are:


LUnl1 X1 + F12 X2 F13 X3 = B (3-8)

L22 22* X2 F23* 3 B2 (39)

L33U33* X3 B3 (3-10)


In block back substitution, we first solve (3-10) for X3, and then eliminate Xa

from (3-9) and (3-8) using the off-diagonal entries.

Next, we solve (3-9) for X2 and eliminate X2 from (3-8).

Finally we solve (3-8) for X1

3.1 Characteristics of Circuit Matrices

Circuit matrices exhibit certain unique characteristics that makes KLU more

relevant to them. They are very sparse. Because of their high sparsity, BLAS

kernels are not applicable. Circuit matrices often have a few dense rows/columns

that originate from voltage or current sources. These dense rows/columns are

effectively removed by BTF permutation.

Circuit matrices are ..i-'.mmetric, but the nonzero pattern is roughly symmet-

ric. They are easily permutable to block upper triangular form. Besides, they have

zero-free or nearly zero-free diagonal. Another peculiar feature of circuit matrices is

that the nonzero pattern of each block after permutation to block upper triangular

form, is more symmetric than the original matrix. Typical ordering strategies

applied to the original matrix cause high fill-in whereas when applied to the blocks,

leads to less fill-in.









The efficiency of the permutation to block upper triangular form shows in the

fact the entire sub-diagonal region in the matrix has zero work and the off-diagonal

elements do not cause any fill-in since they are not factorized.

3.2 Linear Systems in Circuit Simulation

The linear systems in circuit simulation process originate from solving large

systems of non linear equations using Newton's method and integrating large stiff

systems of ordinary differential equations. These linear systems consist of the

coefficients matrix A, the unknowns vector x and the right hand side b.

During the course of simulation, the matrix A retains the same nonzero

pattern(structurally unchanged) and only undergoes changes in numerical val-

ues. Thus the initial analysis phase (permutation to ensure zero-free diagonal,

block triangular form and minimum degree ordering on blocks) and factorization

phase(that involves symbolic analysis, partial pivoting and symmetric pruning) can

be restricted to the initial system alone.

Subsequent systems A'x = b where only the numerical values of A' are different

from A, can be solved using a mechanism called refactorization. Refactorization

simply means to use the same row and column permutations (comprising entire

analysis phase and partial pivoting) computed for the initial system, for solving the

subsequent systems that have changes only in numerical values. Refactorization

substantially reduces run time since the analysis time and factorization time spent

on symbolic analysis, partial pivoting, pruning are avoided. The nonzero pattern

of the factors L and U are the same as for the initial system. Only Numerical

factorization using the pre-computed nonzero pattern and partial pivoting order, is

required.

The solve step follows the factorization/refactorization step. KLU accomodates

solving multiple right hand sides in a single solve step. Upto four right hand sides

can be solved in a single step.









3.3 Performance Benchmarks


During my internship at a circuit simulation company, I did performance

benchmarking of KLU vs SuperLU in the simulation environment. The perfor-

mance benchmarks were run on a representative set of examples. The results

of these benchmarks are tabulated as follows. (the size of the matrix created in

simulation is shown in parenthesis).

Table 3-1: Comparison between KLU and SuperLU on overall time and fill-in

Overall time Nonzeros(L+U)
Netlist KLU SuperLU Speedup KLU SuperLU
Problems (301) 1.67 1.24 0.74 1808 1968
Problem (1879) 734.96 1.- ". 0.94 13. i1 13770
Problem (2076) 56.46 53.38 0.95 16403 16551
Problem (7598) 89.63 81.85 0.91 52056 54997
Problem (745) 18.13 16.84 0.93 4156 4231
Problem (1041) 1336.50 1317.30 0.99 13198 1.'' r".
Problem (33) 0.40 0.32 0.80 157 176
Problem (10) 4.46 1.570 0.35 40 41
Problem (180) 222.26 202.29 0.91 1845 1922
Probleml0(6833) 6222.20 6410.40 1.03 56322 ..1
Problemll (1960) 181.78 179.50 0.99 13527 13963
Probleml2 (200004) 6.25 8.47 1.35 500011 600011
Probleml3 (20004) 0.47 0.57 1.22 50011 60011
Probleml4 (40004) 0.97 1.31 1.35 100011 120011
Probleml5 (100000) 1.76 2.08 1.18 299998 499932
Probleml6 (7602) 217.80 255.88 1.17 156311 184362
Probleml7(10922) 671.10 770.58 1.15 267237 299937
Probleml8 (14842) 1017.00 1238.10 1.22 326811 425661
Probleml9 (19362) 1099.00 1284.40 1.17 550409 581277
Problem20 (24482) 3029.00 3116.90 1.03 684139 788047
Problem21 (30202) 2904.00 3507.40 1.21 933131 1049463


The circuits Probleml6-Problem21 are TFT LCD arrays similar to memory

circuits. These examples were run atleast twice with each algorithm empl-v -d viz.

KLU or SuperLU to get consistent results. The results are tabulated in tables 3-1,

3-2 and 3-3. The "overall time" in table 3-1 comprises of analysis, factorization


and solve time.



















Table 3-2: Comparison between KLU and SuperLU on factor time and solve time


Factor time Factor Solve time Solve
per iteration speedup per iteration speedup
Netlist KLU SuperLU KLU SuperLU
Problems (301) 0.000067 0.000084 1.26 0.000020 0.000019 0.92
Problem (1879) 0.000409 0.000377 0.92 0.000162 0.000100 0.61
Problem (2076) 0.000352 0.000317 0.90 0.000122 0.000083 0.68
Problem (7598) 0.001336 0.001318 0.99 0.000677 0.000326 0.48
Problem (745) 0.000083 0.000063 0.76 0.000035 0.000022 0.62
Problem (1041) 0.000321 0.000406 1.26 0.000079 0.000075 0.95
Problem (33) 0.000004 0.000004 0.96 0.000003 0.000002 0.73
Problem (10) 0.000001 0.000001 0.89 0.000001 0.000001 0.80
Problem (180) 0.000036 0.000042 1.16 0.000014 0.000011 0.76
Probleml0(6833) 0.001556 0.001530 0.98 0.000674 0.000365 0.54
Problemll (1960) 0.000663 0.000753 1.14 0.000136 0.000122 0.90
Probleml2 (200004) 0.103900 0.345500 3.33 0.11 I',I. 0 0.041220 1.35
Probleml3 (20004) 0.005672 0.020110 3.55 0.001633 0.002735 1.67
Probleml4 (40004) 0.014430 0.056080 3.89 0.004806 0.iiii.11 .1 1.43
Probleml5 (100000) 0.168700 0.283700 1.68 0.018600 0.033610 1.81
Probleml6 (7602) 0.009996 0.017620 1.76 0.001654 0.001439 0.87
Probleml7(10922) 0.018380 0.030010 1.63 0.002542 0.001783 0.70
Probleml8 (14842) 0.024020 0.046130 1.92 0.003187 0.002492 0.78
Probleml9 (19362) 0.054730 0.080280 1.47 0.005321 0.003620 0.68
Problem20 (24482) 0.121400 0.122600 1.01 0.006009 0.004705 0.78
Problem21 (30202) 0.124000 0.188700 1.52 0.009268 0.006778 0.73









3.4 Analyses and Findings


The following are my inferences based on the results: Most of the matrices in

these experiments are small matrices of the order of a few thousands.

Fill in is much better with KLU. The 'BTF' ordering combined with the

'AMD' ordering on each of the blocks does a good job in reducing the fill in

count to a good extent. The improvement in fill in averages around 6% for many

examples.

Table 3-3: Ordering results using BTF+AMD in KLU on circuit matrices

Nonzeros Nonzeros No of Max Block Nonzeros
Netlist Size in A in L+U Blocks size off diagonal
Problems 301 1484 1808 7 295 89
Problem 1879 12926 13 .i 1 19 1861 4307
Problem 2076 15821 16403 13 2064 6689
Problem 7598 48922 52056 13 7586 19018
Problem 745 3966 4156 128 426 1719
Problem 1041 9654 13198 67 975 2608
Problem 33 153 157 7 27 50
Problem 10 39 40 5 6 16
Problem 180 1503 1845 19 162 661
ProblemlO 6833 43250 56322 507 6282 12594
Problemll 1960 11187 13527 58 1715 2959
Probleml2 200004 500011 500011 200003 2 300005
Probleml3 20004 50011 50011 20003 2 30005
Probleml4 40004 100011 100011 40003 2 60005
Probleml5 100000 299998 299998 1 100000 0
Probleml6 7602 32653 156311 103 7500 251
Probleml7 10922 46983 267237 123 10800 301
Probleml8 14842 63913 326811 143 14700 351
Probleml9 19362 83443 550409 163 19200 401
Problem20 24482 105573 684139 183 24300 451
Problem21 30202 130303 933131 203 30000 501


There is no 'fill in' in the Probleml2, Probleml3, Probleml4 and Probleml5

netlists with KLU. This is quite significant in that there is no memory overhead

in these examples. In the case of circuits Probleml6-Problem21, the gain in fill









in with KLU ranges from 6% in the Probleml9 example to 24% in Probleml8

example.

The gain in fill in translates into faster factorization because few nonzeros

imply less work. The factorization time thus is expected to be low. It turns out

to be true in most of the cases factorizationn speedup of 1.6x in Probleml6

Problem21 examples and 3x for Probleml2-Probleml4 examples). For some cases

like Problem2 and ProblemlO, the factorization time remains same for both KLU

and SuperLU.

Solve phase turns out to be slow in KLU. Probably, the off diagonal nonzero

handling part tends to account for the extra time spent in the solve phase.

One way of reducing the solve overhead in KLU would be solving multiple

RHS at the same time. In a single solve iteration, 4 equations are solved.

On the whole, the overall time speedup is 1.2 for Probleml6-Problem21

examples and Probleml2-Probleml4 examples. For others, the overall time is

almost the same between the two algorithms.

BTF is not able to find out many blocks for most of the matrices and there

happens to be a single large block and the remaining are singletons. But the AMD

ordering does a good job in getting the fill in count reduced. The off-diagonal

nonzero count is not high.

3.5 Alternate Ordering Experiments

Different ordering strategies were employed to analyze the fill in behaviour.

The statistics using different ordering schemes are listed in table 3-4. 'COLAMD'

is not listed in the table. It typically gives poor ordering and causes more fill

in than AMD, MMD and AMD+BTF. AMD alone gives relatively higher fill in

compared to AMD+BTF in most of the matrices. However AMD alone gives mixed

results in comparison with MMD. It matches or outperforms MMD in fill in on

Probleml2-Probleml4 and Probleml6-Problem21 matrices. However it gives




















Table 3-4: Comparison of ordering results produced by BTF+AMD, AMD, MMD

Nonzeros Fill-in Fill-in Fill-in
Netlist in A BTF+AMD AMD MMD
Problems (301) 1484 1808 1928 1968
Problem (1879) 12926 13-.11 13857 13770
Problem (2076) 15821 16403 18041 16551
Problem (7598) 48922 52056 57975 54997
Problem (745) 3966 4156 5562 4231
Problem (1041) 9654 13198 14020 1.''.
Problem (33) 153 157 178 176
Problem (10) 39 40 41 41
Problem (180) 1503 1845 1968 1922
Probleml0(6833) 43250 56322 133739 ."..-.1
Problemll (1960) 11187 13527 14800 13963
Probleml2 (200004) 500011 500011 600011 600011
Probleml3 (20004) 50011 50011 60011 60011
Probleml4 (40004) 100011 100011 120011 120011
Probleml5 (100000) 299998 299998 299998 499932
Probleml6 (7602) 32653 156311 165264 184362
Probleml7(10922) 46983 267237 255228 299937
Probleml8 (14842) 63913 326811 387668 425661
Probleml9 (19362) 83443 550409 451397 581277
Problem20 (24482) 105573 684139 718891 788047
Problem21 (30202) 130303 933131 839226 1049463









poor fill in for rest of the circuits when compared with MMD. AMD alone beats

AMD+BTF in fill-in for some of the examples viz. Probleml7, Probleml9 and

Problem21. Overall, to sum it up, BTF+AMD is the best ordering strategy to use.

3.6 Experiments with UF Sparse Matrix Collection

There are a number of circuit matrices in the UF sparse matrix collection.

Different experiments were done with these matrices as well on different parameters

like

1. ordering
2. timing of different phases of KLU

3. ordering
4. performance comparison between KLU and UMFPACK.

UMFPACK is a unsymmetric multifrontal sparse solver and uses an unsymmetric

COLAMD ordering or an AMD ordering, automatically selecting which method

to use based on the matrix characteristics. Gilbert-Peierls' algorithm is available

in MATLAB as an LU factorization scheme. These experiments were done on a

Mandrake 10.0 linux OS, Intel Pentium M Processor with clock frequency of 1400

MHz and RAM 768 kB.

3.6.1 Different Ordering Schemes in KLU

There are six different ordering schemes possible in KLU. The three fill

reducing schemes are AMD, COLAMD and User Specified Ordering. These

three fill reducing orderings can be combined with a BTF preordering or no

preordering.Hence six different schemes. However in this experiment user specified

ordering is not considered. That leaves us with four different schemes.The table

3-5 lists the fill (number of nonzeros in L+U plus the number of off-diagonal

elements) for each of these ordering schemes.

From table 3-5, we find that BTF+AMD gives consistently better fill-in

across different circuit matrices. However there are observable aberrations. For









example, with the circuit Bomhof/circuit-2, AMD and COLAMD give better fill-in

than BTF+AMD. These results determine again that BTF+AMD is the best

ordering scheme to use for circuit matrices.

3.6.2 Timing Different Phases in KLU

These experiments show the time spent in different phases of the algorithm.

BTF pre-ordering followed by a AMD fill-reducing ordering is eiipl-'w] d. As

mentioned earlier, there are four different phases.

1. Analysis phase: This comprises the pre-ordering and fill reducing ordering.

2. Factor phase: This comprises the factorization part of the algorithm. It

includes the symbolic analysis phase, partial pivoting, symmetric pruning

steps.

3. Refactor phase: This comprises the part where we do a factorization using the

already pre-computed partial pivoting permutation and the nonzero pattern

of the L and U matrices. There is no symbolic analysis, partial pivoting and

symmetric pruning in refactor phase.

4. Solve phase: This comprises the solve phase of the algorithm. Solving a single

right hand side was experimented.

When given a set of matrices with the same nonzero pattern, the analysis

and factor phases are done only once. The refactor phase is then repeated for the

remaining matrices. Solve phase is repeated as many times as there are systems

to solve. Table 3-6 consists of the timing results. Analysis phase consumes most

of the time spent in the algorithm. Refactor time is typically 3 or 4 times smaller

than factor time and 8 times smaller than analysis time plus factor time put

together. Solve phase consumes the least fraction of time spent.

3.6.3 Ordering Quality among KLU, UMFPACK and Gilbert-Peierls

The table 3-7 compares the ordering quality among KLU using BTF+AMD,

UMFPACK using COLAMD or AMD and Gilbert-Peierls' using AMD. We can












Table 3-5: Fill-in with four different schemes in KLU


Nonzeros BTF + BTF +
Matrix in A AMD COLAMD AMD COLAMD
Sandia/adder_dcop_01 (1813) 11156 13525 13895 18848 21799
Sandia/adder_trans_01 (1814) 14579 20769 36711 24365 119519
Sandia/fpga_dcop_01 (1220) 5892 7891 8118 S .' 12016
Sandia/fpga_trans_01 (1220) 7382 10152 12776 10669 21051
Sandia/init_adder1 (1813) 11156 13525 13895 18848 21799
Sandia/mult_dcop_01 (25187) 193276 226673 228301 2176328 1460322
Sandia/oscil_dcop_01 (430) 1544 2934 3086 3078 3295
Sandia/oscil_trans_01 (430) 1614 2842 3247 2897 3259
Bomhof/circuitl (2624) 35823 44879 775363 44815 775720
Bomhof/circuit_2 (4510) 21199 40434 89315 36197 36196
Bomhof/circuit_3 (12127) 48137 86718 98911 245336 744245
Grund/b2_ss (1089) 3895 9994 9212 26971 9334
Grund/b_dyn (1089) 4144 11806 10597 33057 10544
Grund/bayer02 (1;'i ;') 63307 889914 245259 1365142 307979
Grund/d_dyn (87) 230 481 461 619 494
Grund/d_ss (53) 144 302 292 382 298
Grund/megl (2904) 58142 232042 184471 1526780 378904
Grund/meg4 (5860) 25258 42 ;', 310126 43250 328144
Grund/poli (4008) 8188 12200 12208 12238 12453
Grund/poli_large (15575) 33033 48718 48817 49806 51970
Hamm/add20 (2395) 13151 19554 3 1. ;i 19554 3 1. ;i
Hamm/add32 (4960) 19848 28754 36030 28754 36030
Hamm/bcircuit (68902) 375558 1033240 1692668 1033240 1692668
Hamm/hcircuit (105676) 513072 731634 2.2 ;-. 2 736080 4425310
Hamm/memplus (17758) 99147 137030 3282586 137030 3282586
Hamm/scircuit (170998) '."-'i ;i. 2481122 64102-. 2481832 6427526
R. i.l /rajat03 (7602) 32653 163913 235111 172666 236938
R. i.I /rajat04 (1041) 8725 12863 80518 18618 158000
R.,i.l /rajat05 (301) 1250 1926 2053 2101 3131
R.ii.I /rajatll (135) 665 890 978 944 1129
R. i.l /rajatl2 (1879) 12818 15308 273667 15571 128317
R. i.l /rajatl3 (7598) 48762 5 '"i 90368 64791 5234287
R. i.I /rajatl4 (180) 1475 1994 2249 2105 2345













Table 3-6: Time in seconds, spent in different phases in KLU
Analysis Factor Refactor Solve
Matrix time time time time
Sandia/adder_dcop_01 (1813) 0.0028 0.0028 0.0007 0.0003
Sandia/adder_trans_01 (1814) 0.0045 0.0038 0.0013 0.0003
Sandia/fpga_dcop_01 (1220) 0.0018 0.0015 0.0004 0.0002
Sandia/fpgatrans_01 (1220) 0.0022 0.0017 0.0005 0.0002
Sandia/init_adderl (1813) 0.0028 0.0028 0.0007 0.0003
Sandia/mult_dcop_01 (25187) 0.2922 0.0522 0.0196 0.0069
Sandia/oscil_dcop_01 (430) 0.0008 0.0006 0.0002 0.0001
Sandia/oscil_trans_01 (430) 0.0008 0.0006 0.0002 0.0001
Bomhof/circuit_l (2624) 0.0098 0.0085 0.0053 0.0006
Bomhof/circuit_2 (4510) 0.0082 0.0064 0.0034 0.0006
Bomhof/circuit_3 (12127) 0.0231 0.0174 0.0056 0.0020
Grund/b2_ss (1089) 0.0031 0.0018 0.0005 0.0001
Grund/b_dyn (1089) 0.0033 0.0021 0.0007 0.0002
Grund/bayer02 (1:;' ;.) 0.0584 0.2541 0.2070 0.0090
Grund/d_dyn (87) 0.0002 0.0001 0.0000 0.0000
Grund/d_ss (53) 0.0001 0.0001 0.0000 0.0000
Grund/megl (2904) 0.0178 0.0853 0.0590 0.0033
Grund/meg4 (5860) 0.0157 0.0094 0.0028 0.0009
Grund/poli (4008) 0.0017 0.0010 0.0004 0.0003
Grund/polilarge (15575) 0.0064 0.0045 0.0018 0.0014
Hamm/add20 (2395) 0.0056 0.0044 0.0014 0.0003
Hamm/add32 (4960) 0.0084 0.0074 0.0019 0.0006
Hamm/bcircuit (68902) 0.3120 0.2318 0.1011 0.0257
Hamm/hcircuit (105676) 0.2553 0.1920 0.0658 0.0235
Hamm/memplus (17758) 0.0576 0.0358 0.0157 0.0036
Hamm/scircuit (170998) 0.8491 0.6364 0.3311 0.0622
R.ii. ,/rajat03 (7602) 0.0152 0.0440 0.0306 0.0034
R.ii. ,/rajat04 (1041) 0.0029 0.0023 0.0008 0.0002
R.,i. I/rajat05 (301) 0.0005 0.0005 0.0001 0.0000
R.,i. I/rajatll (135) 0.0002 0.0002 0.0001 0.0000
R.,i. ,/rajatl2 (1879) 0.0038 0.0027 0.0008 0.0002
R.,i. ,/rajatl3 (7598) 0.0122 0.0105 0.0033 0.0011
R.,i. ,/rajatl4 (180) 0.0004 0.0003 0.0001 0.0000









infer from the results that KLU produces better ordering than UMFPACK and

Gilbert-Peierls' algorithm. For KLU, the following MATLAB code determines the

fill.

opts = [0.1 1.2 1.2 10 1 0 0 0 ] ;

[x info] = klus(A,b,opts, []) ;

fill = info (31) + info (32) + info(8) ;

For UMFPACK, the snippet is

[L U P Q] = lu(A) ;

fill = nnz(L) + nnz(U) ;

For Gilbert-Peierls' the code is

Q = amd(A) ;

[L U P] = lu(A(Q,Q), 0.1) ;

fill = nnz(L) + nnz(U) ;

3.6.4 Performance Comparison between KLU and UMFPACK

This experiment compares the total time spent in the analysis, factor and solve

phases by the algorithms. The results are listed in table 3-8. KLU outperforms

UMFPACK in time. For KLU, the following snippet in MATLAB is used:

tic

[x info] = klus(A,b,opts)

tl = toc ;

For UMFPACK, the following code in MATLAB is used to find the total time:

tic

x= A \b;

t2 = toc ;













Table 3-7: Fill-in among KLU, UMFPACK and Gilbert-Peierls


Matrix nnz KLU UMFPACK Gilbert-Peierls
Sandia/adder_dcop_01 (1813) 11156 13525 14658 18825
Sandia/adder_trans_01 (1814) 14579 20769 20769 24365
Sandia/fpga_dcop_01 (1220) 5892 7891 8106 I '
Sandia/fpgatrans_01 (1220) 7382 10152 10152 10669
Sandia/iit_adder1 (1813) 11156 13525 14658 18825
Sandia/mult_dcop_01 (25187) 193276 226673 556746 1300902
Sandia/oscil_dcop_01 (430) 1544 2934 2852 3198
Sandia/oscil_trans_01 (430) 1614 2842 3069 2897
Bomhof/circuit-_ (2624) 35823 44879 44879 44815
Bomhof/circuit_2 (4510) 21199 40434 35107 38618
Bomhof/circuit_3 (12127) 48137 86718 84117 245323
Grund/b2_ss (1089) 3895 9994 8309 22444
Grund/b_dyn (1089) 4144 11806 9642 41092
Grund/bayer02 (l:;' ;,) 63307 889914 259329 973093
Grund/d_dyn (87) 230 481 442 523
Grund/dss (53) 144 302 268 395
Grund/megl (2904) 58142 232042 151740 1212904
Grund/meg4 (5860) 25258 42 ;' 42 ;'- 43250
Grund/poli (4008) 8188 12200 12200 12239
Grund/polilarge (15575) 33033 48718 48745 49803
Hamm/add20 (2395) 13151 19554 19554 19554
Hamm/add32 (4960) 19848 28754 28754 28754
Hamm/bcircuit (68902) 375558 1033240 1033240 1033240
Hamm/hcircuit (105676) 513072 731634 730906 736080
Hamm/memplus (17758) 99147 137030 137030 137030
Hamm/scircuit (170998) 1 .' ;i, 2481122 2481122 2481832
R.i i. /rajat03 (7602) 32653 163913 163913 172666
R.i i. ,/rajat04 (1041) 8725 12863 12860 18613
R.,i. ,/rajat05 (301) 1250 1926 1944 2101
R.ii. ,/rajatll (135) 665 890 890 944
R. i., /rajatl2 (1879) 12818 15308 15308 15571
R. i., /rajatl3 (7598) 48762 5'.r, 5 '". 64791
R. i.1 /rajatl4 (180) 1475 1994 1994 2105













Table 3-8: Performance comparison between KLU and UMFPACK


Matrix KLU UMFPACK
Sandia/adder_dcop_01 (1813) 0.0116 0.0344
Sandia/adder_trans_01 (1814) 0.0112 0.0401
Sandia/fpga_dcop_01 (1220) 0.0050 0.0257
Sandia/fpgatrans_01 (1220) 0.0054 0.0203
Sandia/iit_adder1 (1813) 0.0109 0.0323
Sandia/mult_dcop_01 (25187) 1.2383 1.1615
Sandia/oscil_dcop_01 (430) 0.0022 0.0070
Sandia/oscil_trans_01 (430) 0.0019 0.0074
Bomhof/circuitl (2624) 0.0232 0.1223
Bomhof/circuit_2 (4510) 0.0201 0.0522
Bomhof/circuit_3 (12127) 0.0579 0.1713
Grund/b2_ss (1089) 0.0066 0.0168
Grund/b_dyn (1089) 0.0072 0.0175
Grund/bayer02 (1;;' ;,) 0.6089 0.3565
Grund/d_dyn (87) 0.0005 0.0014
Grund/dss (53) 0.0003 0.0010
Grund/megl (2904) 0.1326 0.1018
Grund/meg4 (5860) 0.0571 0.1111
Grund/poli (4008) 0.0050 0.0121
Grund/polilarge (15575) 0.0208 0.0497
Hamm/add20 (2395) 0.0123 0.0506
Hamm/add32 (4960) 0.0201 0.0738
Hamm/bcircuit (68902) 0.7213 1.8823
Hamm/hcircuit (105676) 0.7313 2.7764
Hamm/memplus (17758) 0.1232 0.8140
Hamm/scircuit (170998) 1.9812 7.3448
R.ii.-./rajat03 (7602) 0.0793 0.1883
R.ii. ,/rajat04 (1041) 0.0068 0., 1-
R.,i. ,/rajat05 (301) 0.0014 0.0046
R.,i.i /rajatll (135) 0.0007 0.0023
R.ii.,/rajatl2 (1879) 0.0087 0.0355
R.,i.,I/rajatl3 (7598) 0.0330 0.1229
R.ii. ,/rajatl4 (180) 0.0010 0.0032















CHAPTER 4
USER GUIDE FOR KLU

4.1 The Primary KLU Structures

4.1.1 klu_common

This is a control structure that contains both input control parameters for

KLU as well as output statistics computed by the algorithm. It is a mandatory

parameter for all the KLU routines. Its contents are listed as follows:

tol

Type: double

Input parameter for pivot tolerance for diagonal preference.Default value is

0.001

growth

Type: double

Input parameter for reallocation growth size of LU factors. Default value is

1.2

initmem_amd

Type: double

Input parameter for initial memory size with AMD.

Initial memory size = initmemamd nnz(L) + n. Default value is 1.2

initmem

Type: double

Input parameter for initial memory size with COLAMD.

Initial memory size = initmem nnz(A) + n. Default value is 10

btf

Type: int









Input parameter to use BTF pre-ord' ii,. or not. Default value is 1 (to use

BTF)

* ordering

Type: int

Input parameter to specify which fill reducing ordering to use.

0- AMD,1 COLAMD, 2= user P and Q. Default is 0.

* scale

Type: int

Input parameter to specify which scaling strategy to use. 0= none, 1= sum,

2= max. Default is 0

* singular_proc

Type: int

Input parameter to specify whether to stop on singularity or continue.

0= stop, 1= continue. Default is 0.

* status

Type: int

Output parameter that indicates the result of the KLU function call. Values

are KLU_OK(0) if OK and < 0 if error. Error values are KLU_SINGULAR

(-1), KLU_OUT_OF_MEMORY (-2), KLU_INVALID (-3)

* nrealloc

Type: int

Output parameter.Contains number of reallocations of L and U

* singular_col

Type: int

Output parameter.Contains the column no of singular column if any

* noffdiag

Type: int










Output parameter.Contains the number of off-diagonal pivots chosen

4.1.2 klu_symbolic

This structure encapsulates the information related to the analysis phase. The

members of the structure are listed as follows:

symmetry

Type: double

Contains the symmetry of largest block

estflops

Type: double

Contains the estimated factorization flop count

Inz

Type: double

Contains the estimated nonzeros in L including diagonals

unz

Type: double

Contains the estimated nonzeros in U including diagonals

Lnz

Type: double *

Array of size n, but only Lnz [O..nblocks-1] is used. Contains the estimated

number of nonzeros in each block

Sn

Type: int

Contains the size of input matrix A where A is n-by-n

nz

Type: int

Contains the number of entries in input matrix

*P









Type: int *

Array of size n. Contains the row permutation from ordering

*Q

Type: int *

Array of size n. Contains the column permutation from ordering

*R

Type: int *

Array of size n+1, but only R [0..nblocks] is used. Contains the start and end

column/row index for each block. Block k goes from R[k] to R[k+l] 1

nzoff

Type: int

Contains the number of nonzeros in off-diagonal blocks

nblocks Type: int

Contains the number of blocks

maxblock

Type: int

Contains the size of largest block

ordering

Type: int

Contains the ordering used (0 = AMD, 1 = COLAMD, 2 = GIVEN)

do_btf

Type: int

Indicates whether or not BTF preordering was requested

The members symmetry, est flops, Inz, unz, Lnz are computed only when AMD is

used. The remaining members are computed for all orderings.










4.1.3 klu_numeric

This structure encapsulates information related to the factor phase. It contains

the LU factors of each block, pivot row permutation, and the entries in the off-

diagonal blocks among others. Its contents are listed as follows:

umin

Type: double

Contains the minimum absolute diagonal entry in U

umax

Type: double

Contains the maximum absolute diagonal entry in U

nblocks

Type: int

Contains the number of blocks

Inz

Type: int

Contains actual number of nonzeros in L excluding diagonals

unz

Type: int

Contains actual number of nonzeros in U excluding diagonals

Pnum

Type: int *

Array of size n.Contains the final pivot permutation

Pinv

Type: int *

Array of size n.Contains the inverse of final pivot permutation

Lbip

Type: int **









Array of size nblocks. Each element is an .11,i.,i of size block size + 1. Each

element contains the column pointers for L factor of the corresponding block

* Ubip

Type: int **

Array of size nblocks. Each element is an .11.i i. of size block size + 1. Each

element contains the column pointers for U factor of the corresponding block

* Lblen

Type: int **

Array of size nblocks. Each element is an .11,i.,i of size block size. Each

element contains the column lengths for L factor of the corresponding block

* Ublen

Type: int **

Array of size nblocks. Each element is an .11,i.,i of size block size. Each

element contains the column lengths for U factor of the corresponding block

* LUbx

Type: void **

Array of size nblocks. Each element is an .11,i.,i containing the row indices

and numerical values of LU factors of the corresponding block. The diagonals

of LU factors are not stored here

* Udiag

Type: void **

Array of size nblocks.Each element is an .11,i.,i of size block size. Each

element contains the diagonal values of U factor of the corresponding block

* Singleton

Type: void *

Array of size nblocks.Contains the singleton values

* Rs










Type: double *

Array of size n. Contains the row scaling factors

* scale

Type: int

Indicates the scaling strategy used.

(0 = none, 1 = sum, 2 = max)

* Work

Type: void *

Permanent workspace used for factorization and solve. It is of size

MAX (4n numerical values, n numerical values + 6*maxblock into's)

* worksize

Type: size_t

Contains the size (in bytes) of Work allocated above

* Xwork

Type: void *

This is an alias into Numeric-gWork

* Iwork

Type: int *

An integer alias into Xwork + n

* Offp

Type: int *

Array of size n+1. Contains the column pointers for off-diagonal elements.

* Offi

Type: int *

Array of size number of off-diagonal entries. Contains the row indices of

off-diagonal elements

* Offx










Type: void *

Array of size number of off-diagonal entries. Contains the numerical values of

off-diagonal elements

4.2 KLU Routines

The user callable KLU routines in the C language are explained in this section.

The following guidelines are applicable to all routines except when explicitly stated

otherwise in the description of a routine.

1. : All the arguments are required except when explicitly stated as optional. If

optional, a user can pass NULL for the corresponding argument.

2. : The control input/output argument "C. III." .'" of type "klucommon *" is

a required argument for all routines.

3. : All arguments other than the above mentioned control argument "Com-

ii. ,ii", are input arguments and are not modified.

4.2.1 klu_analyze
klu_symbolic *klu_analyze
(
int n,
int Ap [ ],
int Ai [ ],
klucommon *Common
)

This routine orders the matrix using BTF if specified and the fill reducing

ordering specified.

Returns a pointer to klusymbolic structure that contains the ordering

information.

All arguments are required

n: Size of the input matrix A where A is n*n.

Ap: Array of column pointers for the input matrix. Size n+1.










Ai: Array of row indices. Size number of nonzeros in A.

Common: The control input/output structure.

4.2.2 klu_analyze_given

klu_symbolic *klu_analyze_given

(
int n,
int Ap [ ],
int Ai [ ],

int Puser [ ],
int Quser [ ],
klucommon *Common

) ;

This routine orders the matrix using BTF if specified and the given Puser and

QUser as fill reducing ordering. If Puser and Quser are NULL, then the natural

ordering is used.

Returns a pointer to klusymbolic structure that contains the ordering

information.

Arguments

n: Size of the input matrix A where A is n*n. Required.

Ap: Array of column pointers for the input matrix. Size n+1. Required.

Ai: Array of row indices. Size number of nonzeros in A. Required.

Puser: Optional row permutation to use.

Quser: Optional column permutation to use.

Common: The control input/output structure.

4.2.3 klu_*factor

klunumeric *Numeric klufactor
(

int Ap [ ],
int Ai [ ,










double Ax [ ],

klu_symbolic *Symbolic,

klucommon *Common

) ;

This routine factors a real matrix. There is a complex version of this routine

kluz_factor that factors a complex matrix and has same function declaration as

klu_factor. Both use the results of a call to kluanalyze.

Returns a pointer to klunumeric if successful. NULL otherwise.

All the arguments are required.

Ap: Array of column pointers for the input matrix. Size n+1.

Ai: Array of row indices. Size number of nonzeros in A.

Ax: Array of numerical values. Size number of nonzeros in A. In the

complex case, the i., should consist of real and imaginary parts of

each numerical value as .-li. ,:ent pairs.

Symbolic: The structure that contains the results from a call to

kluanalyze.

Common: The control input/output structure. The status field in

Common is set to indicate if the routine was successful or not.

4.2.4 klu_*solve

void klusolve


klu_symbolic *Symbolic,

klu_numeric *Numeric,

int Idim,

int nrhs,

double B [ ],

klucommon *Common










This routine solves a real system. There is a complex version of this routine

kluz_solve, that solves a complex system and has the same function declaration as

klu_solve. Both use the results of a call to kluanalyze and klu_*factor.

Return type is void. The rhs vector B is overwritten with the solution.

All Arguments are required.

Symbolic: The structure that contains the results from a call to

kluanalyze.

Numeric: The structure that contains the results from a call to

klu_*factor.

Idim: The leading dimension of the right hand side B.

nrhs: The number of right hand sides being solved.

B: The right hand side. It is a vector of length Idim nrhs. It can

be real or complex depending on whether a real or complex system

is being solved. If complex, the real and imaginary parts of the rhs

numerical value must be stored as .,li.I :ent pairs. It is overwritten with

the solution.

Common: The control input/output structure.

4.2.5 klu_*tsolve

void klutsolve


klusymbolic *Symbolic,
klu_numeric *Numeric,
int Idim,
int nrhs,
int conjsolve,
double B [ ],
klu common *Common










This routine is similar to klusolve except that it solves a transpose system

A'x = b. This routine solves a real system. Again there is a complex version of

this routine kluz_tsolve for solving complex systems and has the same function

declaration as klutsolve. It also offers to do a conjugate transpose solve for the

complex system AHx = b.

Return type is void. The rhs vector B is overwritten with the solution.

All arguments are required. The descriptions for all arguments except

conj_solve are same as those for klu_*solve. The argument conli--. .1-,I is

relevant only for complex case. It takes two values 1 = CONJUGATE

TRANSPOSE SOLVE, 0 = TRANSPOSE SOLVE.

4.2.6 klu r..,1 I ..r

void klurefactor


int Ap [ ],
int Ai [ ],
double Ax [ ],
klu_symbolic *Symbolic,
klu_numeric *Numeric,
klucommon *Common

) ;

This routine does a refactor of the matrix using the previously computed

ordering information in Symbolic and the nonzero pattern of the LU factors in

Numeric objects. It assumes same partial pivoting order computed in Numeric. It

changes only the numerical values of the LU factors stored in Numeric object. It

has a complex version klu_z_refactor with the same function prototype to handle

complex cases.

Return type is void. The numerical values of LU factors in Numeric parame-

ter are updated.

All arguments are required.










Ap: Array of column pointers for the input matrix. Size n+1.

Ai: Array of row indices. Size number of nonzeros in A.

Ax: Array of numerical values. Size number of nonzeros in A. In the

complex case, the .- i.,- should consist of real and imaginary parts of

each numerical value as .-li. ,:ent pairs.

Symbolic: The structure that contains the results from a call to

kluanalyze.

Numeric: Input/output argument. The structure contains the results

from a call to klu_*factor. The numerical values of LU factors are

overwritten with the ones for the current matrix being factorized.

Common: The control input/output structure. The status field in

Common is set to indicate if the routine was successful or not.

4.2.7 klu_defaults
void kludefaults


klucommon *Common
) ;

This routine sets the default values for the control input parameters of

the klu_common object. The default values are listed in the description of the

klu_common structure. A call to this routine is required unless the user sets the

control input parameters explicitly.

Return type is void.

The argument Common is required. The control input parameters in Com-

mon are set to default values.

4.2.8 klu_*rec_pivotgrowth
double klu_rec_pivot_growth
<


int Ap [ ,










int Ai [ ],
double Ax [ ],

klu_symbolic *Symbolic,
klu_numeric *Numeric,
klucommon *Common

) ;

This routine computes the reciprocal pivot growth of the factorization algo-

rithm. The complex version of this routine kluzrecpivotgrowth handles complex

matrices and has the same function declaration.

The pivot growth estimate is returned.

All arguments are required.

Ap: Array of column pointers for the input matrix. Size n+1.

Ai: Array of row indices. Size number of nonzeros in A.

Ax: Array of numerical values. Size number of nonzeros in A. In the

complex case, the .- i.,- should consist of real and imaginary parts of

each numerical value as .-li. ,:ent pairs.

Symbolic: The structure that contains the results from a call to

kluanalyze.

Numeric: The structure that contains the results from a call to

klu_*factor.

Common: The control input/output structure. The status field in

Common is set to indicate if the routine was successful or not.

4.2.9 klu_*estimatecondnumber

double klu_estimate_cond_number

(
int Ap [ ],
double Ax [ ],

klu_symbolic *Symbolic,
klu_numeric *Numeric,










klucommon *Common


This routine computes the condition number estimate of the input matrix. As

before, the complex version of this routine kluzestimatecondnumber has the

same function declaration and handles complex matrices.

The condition number estimate is returned.

All arguments are required.

Ap: Array of column pointers for the input matrix. Size n+1.

Ax: Array of numerical values. Size number of nonzeros in A. In the

complex case, the .- i.,i- should consist of real and imaginary parts of

each numerical value as .,i.1:ent pairs.

Symbolic: The structure that contains the results from a call to

kluanalyze.

Numeric: The structure that contains the results from a call to

klu_*factor.

Common: The control input/output structure. The status field in

Common is set to indicate if the routine was successful or not.

4.2.10 klu_free_symbolic
void klu_free_symbolic


klu_symbolic **Symbolic,
klucommon *Common
) ;

This routine deallocates or frees the contents of the klusymbolic object. The

Symbolic parameter must be a valid object computed by a call to kluanalyze or

klu_analyze_given.

Return type is void.

All arguments are required.










Symbolic: Input/Output argument. Must be a valid object computed

by a call to kluanalyze or klu_analyze_given. If NULL, the routine just

returns.

Common: The control input/output structure.

4.2.11 klu_free_numeric

void klufreenumeric
(
klu_numeric **Numeric,
klucommon *Common

) ;

This routine frees the klunumeric object computed by a call to klufactor

or kluz_factor routines. It resets the pointer to klu-numeric to NULL. There is a

complex version of this routine called klu_zfreenumeric with the same function

declaration to handle the complex case.

Return type is void.

All arguments are required.

Numeric. Input/Output argument. The contents of the klunumeric

object are freed. The pointer to klunumeric object is set to NULL.

Common: The control input/output structure.















REFERENCES


[1] C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh, Basic linear
algebra subprograms for FORTRAN usage, ACM Trans. Math. Soft., 5:
308-323, 1979.

[2] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson, An extended
set of FORTRAN basic linear algebra subprograms, ACM Trans. Math. Soft.,
14: 1-17, 1988.

[3] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson, Algorithm
656: An extended set of FORTRAN basic linear algebra subprograms, ACM
Trans. Math. Soft., 14: 18-32, 1988.

[4] J. J. Dongarra, J. Du Croz, I. S. Duff, and S. TH.iiiiii.,.lii-. A set of level 3
basic linear algebra subprograms, ACM Trans. Math. Soft., 16: 1-17, 1990.

[5] J. J. Dongarra, J. Du Croz, I. S. Duff, and S. H.,,iiii.i.lii,.. Algorithm 679: A
set of level 3 basic linear algebra subprograms, ACM Trans. Math. Soft., 16:
18-28, 1990.

[6] James W. Demmel, Stanley C. Eisenstat, John R. Gilbert, Xi....,,- S. Li and
Joseph W. H. Liu, A supernodal approach to sparse partial pivoting, SIAM J.
Matrix Al,,i. !-: and Applications, 20(3): 720-755, 1999.

[7] Timothy A. Davis, I.S.Duff, An unsymmetric-pattern multifrontal method
for sparse LU factorization, SIAM J. Matrix Aili,,,1.: and Applic., 18(1):
140-158, 1997.

[8] Timothy A. Davis, Algorithm 832: UMFPACK, an rnii-inin,-tric-pattern
multifrontal method with a column pre-ordering strategy, ACM Trans. Math.
Software, 30(2): 196-199, 2004.

[9] John R.Gilbert and Tim Peierls, Sparse partial pivoting in time proportional
to arithmetic operations, SIAM J. Sci. Stat. Comput., 9(5): ,-,2 873, 1988.

[10] A. George and E. Ng, An implementation of Gaussian elimination with
partial pivoting for sparse systems. SIAM J. Sci. Statist. Comput., 6(2):
390-409, 1985.

[11] Iain S. Duff, On algorithms for obtaining a maximum transversal, ACMI
Transactions on Mathematical Software, 7(3): 315-330, 1981.









[12] Iain S. Duff, Algorithm 575 permutations for a zero-free diagonal, ACM
Transactions on Mathematical Software, 7(3): 387-390, 1981.

[13] Iain S. Duff and John K. Reid, Algorithm 529: permutations to block
triangular form, ACM Trans. on Mathematical Software, 4(2): 189-192, 1978.

[14] Iain S. Duff and John K. Reid, An implementation of Tarjan's algorithm
for the block triangular form of a matrix, ACM Trans. on Mathematical
Software, 4(2): 137-147, 1978.

[15] R.E. Tarjan, Depth first search and linear graph algorithms, SIAM J.
Coi(,iitillm., 1: 146-160, 1972.

[16] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein,
Introduction to Algorithms, Second Edition 2001, MIT Press, Cambridge.

[17] S.C. Eisenstat and J.W.H. Liu, Exploiting structural symmetry in a sparse
partial pivoting code, SIAM J.Sci. Comput., 14(1): 253-257, 1993.

[18] P. R. Amestoy, T. A. Davis, and I. S. Duff,An approximate minimum degree
ordering algorithm, SIAM J. Matrix Anal. Applic., 17(4): 886-905, 1996.

[19] P.R. Amestoy, T.A. Davis, and I.S. Duff, Algorithm 837: AMD, an ap-
proximate minimum degree ordering algorithm, ACM Transactions on
Mathematical Software, 30(3): 381-388, 2004.

[20] Timothy A. Davis, John R. Gilbert, Stefan I. Larimore, and Esmond G.
Ng. A column approximate minimum degree ordering algorithm, ACM1I
Transactions on Mathematical Software, 30(3): 353-376, 2004.

[21] Timothy A. Davis John R. Gilbert Stefan I. Larimore Esmond G. Ng,
Algorithm 836: COLAMD, a column approximate minimum degree ordering
algorithm, ACM Transactions on Mathematical Software, 30(3): 377-380,
2004.

[22] W.W. Hager, Condition estimates, SIAM J. Sci. Stat. Comput., 5,2: 311-316,
1984.

[23] Nicholas J. Higham, Fortran codes for estimating the one-norm of a real or
complex matrix, with applications to condition estimation, ACM Trans. on
Mathematical Software., 14(4): 381-396, 1988.















BIOGRAPHICAL SKETCH

Ekanathan was born in Tirunelveli, India, on October 2, 1977. He received

his Bachelor of Technology degree in chemical engineering from Anna University,

Chennai, India, in M.v 1998. He worked with Infosys Technologies at Brussels, Bel-

gium, as programmer/analyst from 1998 till 2001 and with SAP AG at Walldorf,

Germany, as software developer from 2001 till 2003. Currently he is pursuing his

Master of Science degree in computer science at University of Florida. His interests

and hobbies include travel, reading novels and magazines, music and movies.




Full Text

PAGE 1

KLU{A HIGH PERF ORMANCE SP ARSE LINEAR SOL VER F OR CIR CUIT SIMULA TION PR OBLEMS By EKANA THAN P ALAMAD AI NA T ARAJAN A THESIS PRESENTED TO THE GRADUA TE SCHOOL OF THE UNIVERSITY OF FLORID A IN P AR TIAL FULFILLMENT OF THE REQUIREMENTS F OR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORID A 2005

PAGE 2

Cop yrigh t 2005 b y Ek anathan P alamadai Natara jan

PAGE 3

I dedicate this w ork to m y mother Sa vitri who has b een a source of inspiration and supp ort to me.

PAGE 4

A CKNO WLEDGMENTS I w ould lik e to thank Dr. Timoth y Da vis, m y advisor for in tro ducing me to the area of sparse matrix algorithms and linear solv ers. I started only with m y bac kground in n umerical analysis and algorithms, a y ear and half bac k. The insigh ts and kno wledge I ha v e gained since then in the area and in implemen ting a sparse linear solv er lik e KLU w ould not ha v e b een p ossible but for Dr. Da vis' guidance and help. I thank him for giving me an opp ortunit y to w ork on KLU. I w ould lik e to thank Dr. Jose F ortes and Dr. Aruna v a Banerjee for their supp ort and help and for serving on m y committee. I w ould lik e to thank CISE administrativ e sta for helping me at dieren t times during m y master's researc h w ork. iv

PAGE 5

T ABLE OF CONTENTS page A CKNO WLEDGMENTS . . . . . . . . . . . . . . iv LIST OF T ABLES . . . . . . . . . . . . . . . . vii LIST OF FIGURES . . . . . . . . . . . . . . . . viii ABSTRA CT . . . . . . . . . . . . . . . . . . ix CHAPTER1 INTR ODUCTION . . . . . . . . . . . . . . . 1 2 THEOR Y: SP ARSE LU . . . . . . . . . . . . . 5 2.1 Dense LU . . . . . . . . . . . . . . . 5 2.2 Sparse LU . . . . . . . . . . . . . . . 7 2.3 Left Lo oking Gaussian Elimination . . . . . . . . 8 2.4 Gilb ert-P eierls' Algorithm . . . . . . . . . . . 10 2.4.1 Sym b olic Analysis . . . . . . . . . . . 11 2.4.2 Numerical F actorization . . . . . . . . . . 13 2.5 Maxim um T ransv ersal . . . . . . . . . . . . 14 2.6 Blo c k T riangular F orm . . . . . . . . . . . . 16 2.7 Symmetric Pruning . . . . . . . . . . . . . 18 2.8 Ordering . . . . . . . . . . . . . . . . 19 2.9 Piv oting . . . . . . . . . . . . . . . . 22 2.10 Scaling . . . . . . . . . . . . . . . . 23 2.11 Gro wth F actor . . . . . . . . . . . . . . 25 2.12 Condition Num b er . . . . . . . . . . . . . 27 2.13 Depth First Searc h . . . . . . . . . . . . . 30 2.14 Memory F ragmen tation . . . . . . . . . . . . 31 2.15 Complex Num b er Supp ort . . . . . . . . . . . 33 2.16 P arallelism in KLU . . . . . . . . . . . . . 33 3 CIR CUIT SIMULA TION: APPLICA TION OF KLU . . . . . 35 3.1 Characteristics of Circuit Matrices . . . . . . . . . 37 3.2 Linear Systems in Circuit Sim ulation . . . . . . . . 38 3.3 P erformance Benc hmarks . . . . . . . . . . . 39 3.4 Analyses and Findings . . . . . . . . . . . . 41 3.5 Alternate Ordering Exp erimen ts . . . . . . . . . 42 v

PAGE 6

3.6 Exp erimen ts with UF Sparse Matrix Collection . . . . . 44 3.6.1 Dieren t Ordering Sc hemes in KLU . . . . . . 44 3.6.2 Timing Dieren t Phases in KLU . . . . . . . 45 3.6.3 Ordering Qualit y among KLU, UMFP A CK and Gilb ertP eierls . . . . . . . . . . . . . . 45 3.6.4 P erformance Comparison b et w een KLU and UMFP A CK . 48 4 USER GUIDE F OR KLU . . . . . . . . . . . . . 51 4.1 The Primary KLU Structures . . . . . . . . . . 51 4.1.1 klu common . . . . . . . . . . . . . 51 4.1.2 klu sym b olic . . . . . . . . . . . . . 53 4.1.3 klu n umeric . . . . . . . . . . . . . 55 4.2 KLU Routines . . . . . . . . . . . . . . 58 4.2.1 klu analyze . . . . . . . . . . . . . 58 4.2.2 klu analyze giv en . . . . . . . . . . . 59 4.2.3 klu *factor . . . . . . . . . . . . . 59 4.2.4 klu *solv e . . . . . . . . . . . . . 60 4.2.5 klu *tsolv e . . . . . . . . . . . . . 61 4.2.6 klu *refactor . . . . . . . . . . . . . 62 4.2.7 klu defaults . . . . . . . . . . . . . 63 4.2.8 klu *rec piv ot gro wth . . . . . . . . . . 63 4.2.9 klu *estimate cond n um b er . . . . . . . . . 64 4.2.10 klu free sym b olic . . . . . . . . . . . 65 4.2.11 klu free n umeric . . . . . . . . . . . . 66 REFERENCES . . . . . . . . . . . . . . . . . 67 BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . 69 vi

PAGE 7

LIST OF T ABLES T able page 3{1 Comparison b et w een KLU and Sup erLU on o v erall time and ll-in . 39 3{2 Comparison b et w een KLU and Sup erLU on factor time and solv e time 40 3{3 Ordering results using BTF+AMD in KLU on circuit matrices . . 41 3{4 Comparison of ordering results pro duced b y BTF+AMD, AMD, MMD 43 3{5 Fill-in with four dieren t sc hemes in KLU . . . . . . . . 46 3{6 Time in seconds, sp en t in dieren t phases in KLU . . . . . 47 3{7 Fill-in among KLU, UMFP A CK and Gilb ert-P eierls . . . . . 49 3{8 P erformance comparison b et w een KLU and UMFP A CK . . . . 50 vii

PAGE 8

LIST OF FIGURES Figure page 2{1 Nonzero pattern of x when solving Lx = b . . . . . . . 13 2{2 A matrix p erm uted to BTF form . . . . . . . . . . 16 2{3 A symmetric pruning scenario . . . . . . . . . . . 18 2{4 A symmetric matrix and its graph represen tation . . . . . 21 2{5 The matrix and its graph represen tation after one step of Gaussian elimination . . . . . . . . . . . . . . . 21 2{6 A doubly b ordered blo c k diagonal matrix and its corresp onding v ertex separator tree . . . . . . . . . . . . . . 34 viii

PAGE 9

Abstract of Thesis Presen ted to the Graduate Sc ho ol of the Univ ersit y of Florida in P artial F ulllmen t of the Requiremen ts for the Degree of Master of Science KLU{A HIGH PERF ORMANCE SP ARSE LINEAR SOL VER F OR CIR CUIT SIMULA TION PR OBLEMS By Ek anathan P alamadai Natara jan August 2005 Chair: Dr. Timoth y A. Da vis Ma jor Departmen t: Computer and Information Science and Engineering The thesis w ork fo cuses on KLU, a sparse high p erformance linear solv er for circuit sim ulation matrices. During the circuit sim ulation pro cess, one of the k ey steps is to solv e sparse systems of linear equations of v ery high order. KLU targets solving these systems with ecien t ordering mec hanisms and high p erformance factorization and solv e algorithms. KLU uses a h ybrid ordering strategy that comprises an unsymmetric p erm utation to ensure zero free diagonal, a symmetric p erm utation to blo c k upp er triangular form and a ll reducing ordering suc h as appro ximate minim um degree. The factorization is based on Gilb ert-P eierls' left-lo oking algorithm with partial piv oting. KLU also includes features lik e symmetric pruning to cut do wn sym b olic analysis costs. It oers to solv e up to four righ t hand sides in a single solv e step. In addition, it oers transp ose and conjugate transp ose solv e capabilities and imp ortan t diagnostic features to estimate the recipro cal piv ot gro wth of the factorization and condition n um b er of the input matrix. The algorithm is implemen ted in the C language with MA TLAB in terfaces as w ell. The MA TLAB in terfaces enable a user to in v ok e KLU routines from within ix

PAGE 10

the MA TLAB en vironmen t. The implemen tation w as tested on circuit matrices and the results determined. KLU ac hiev es sup erior ll-in qualit y with its h ybrid ordering strategy and ac hiev es a go o d p erformance sp eed-up when compared with existing sparse linear solv ers for circuit problems. The thesis highligh ts the w ork b eing done on exploiting parallelism in KLU as w ell. x

PAGE 11

CHAPTER 1 INTR ODUCTION Sparse is b eautiful. Solving systems of linear equations of the form Ax = b is a fundamen tal and imp ortan t area of high p erformance computing. The matrix A is called the co ecien t matrix and b the righ t hand side v ector. The v ector x is the solution to the equation. There are a n um b er of metho ds a v ailable for solving suc h systems. Some of the p opular ones are Gaussian elimination, QR factorization using Householder transformations or Giv ens rotations and Cholesky factorization. Gaussian elimination with partial piv oting is the most widely used algorithm for solving linear systems b ecause of its stabilit y and b etter time complexit y Cholesky can b e used only when A is symmetric p ositiv e denite. Some systems that are solv ed comprise a dense co ecien t matrix A By dense, w e mean most of the elemen ts in A are nonzero. There are high p erformance subroutines suc h as the BLAS [ 1 2 3 4 5 ] that can maximize rop coun t for suc h dense matrices. The in teresting systems are those where the co ecien t matrix A happ ens to b e sparse. By sparse, w e mean the matrix has few nonzero en tries(hereafter referred to simply as 'nonzeros'). The adjectiv e 'few' is not w elldened as w e will see in c hapter t w o. When matrices tend to b e sparse, w e need to nd out eectiv e w a ys to store the matrix in memory since w e w an t to a v oid storing zeros of the matrix. When w e store only the nonzeros in the matrix, it has consequences in the factorization algorithm as w ell. One t ypical example w ould b e w e do not kno w b efore hand ho w nonzeros w ould app ear in the L and U factors when w e factorize the matrix. While w e a v oid storing the zeros, w e also w an t to ac hiev e go o d time complexit y when solving sparse systems. If the time sp en t to 1

PAGE 12

2 solv e sparse systems remains same as for dense systems, w e ha v e not done an y b etter. KLU stands for Clark Ken t LU, since it is based on Gilb ert-P eierls' algorithm, a non-sup erno dal algorithm, whic h is the predecessor to Sup erLU, a sup erno dal algorithm. KLU is a sparse high p erformance linear solv er that emplo ys h ybrid ordering mec hanisms and elegan t factorization and solv e algorithms. It ac hiev es high qualit y ll-in rate and b eats man y existing solv ers in run time, when used for matrices arising in circuit sim ulation. There are sev eral ra v ours of Gaussian elimination. A left-lo oking Gaussian elimination algorithm factorizes the matrix left to righ t computing columns of L and U. A righ t-lo oking v ersion factorizes the matrix from top-left to b ottom-righ t computing column of L and ro w of U. Both ha v e their adv an tages and disadv an tages. KLU uses a left lo oking algorithm called Gilb ert-P eierls' algorithm. Gilb ert-P eierls' comprises a graph theoretical sym b olic analysis phase that identies the nonzero pattern of eac h column of L and U factors and a left-lo oking n umerical factorization phase with partial piv oting that calculates the n umerical v alues of the factors. KLU uses Symmetric Pruning to cut do wn sym b olic analysis cost. W e shall lo ok in detail on these features in c hapter t w o. A critical issue in linear solv ers is Ordering. Ordering means p erm uting the ro ws and columns of a matrix, so that the ll-in in the L and U factors is reduced to a minim um. A ll-in is dened as a nonzero app earing in either of the matrices L or U, while the elemen t in the corresp onding p osition in A is a zero. L ij or U ij is a ll-in if A ij is a zero. Fill-in has ob vious consequences in memory in that the factorization algorithm could create dense L and U factors that can exhaust a v ailable memory A go o d ordering algorithm yields a lo w ll-in in the factors. Finding the ordering that giv es minimal ll-in is an NP complete problem. So

PAGE 13

3 ordering algorithms use heuristics. KLU accomo dates m ultiple ordering sc hemes lik e AMD, COLAMD and an y user generated p erm utation. There are other orderings for dieren t purp oses. F or example, one could order a matrix to ensure that it has no zeros on the diagonal. Otherwise, the Gaussian elimination w ould fail. Another ordering sc heme could reduce the factorization w ork. KLU emplo ys t w o suc h orderings namely an unsymmetric ordering that ensures a zero free diagonal and a symmetric ordering that p erm utes the matrix in to a blo c k upp er triangular form (BTF) that restricts factorization to only the diagonal blo c ks. One of the k ey steps in the circuit sim ulation pro cess is solving sparse linear systems. These systems originate from solving large systems of non linear equations using Newton's metho d and in tegrating large sti systems of ordinary dieren tial equations. These systems are of v ery high dimensions and a considerable fraction of sim ulation time is sp en t on solving these systems. Often the solv e phase tends to b e a b ottlenec k in the sim ulation pro cess. Hence high p erformance sparse solv ers that optimize memory usage and solution time are critical comp onen ts of circuit sim ulation soft w are. Some of the p opular solv ers in use in circuit sim ulation to ols are Sparse1.3 and Sup erLU. Sparse1.3 is used in SPICE circuit sim ulation pac k age and Sup erLU uses a sup erno dal factorization algorithm. Exp erimen tal results of KLU indicate that it is 1000 times faster than Sparse1.3 and 1.5 to 3 times faster than Sup erLU. Circuit matrices sho w some unique prop erties. They ha v e a nearly zero free diagonal. They ha v e a roughly symmetric pattern but ha v e unsymmetric v alues. They are highly sparse and often ha v e a few dense ro ws and columns. These dense ro ws/columns arise from v oltage sources and curren t sources in the circuit. Circuit matrices sho w go o d amenabilit y to BTF ordering. Though the nonzero pattern of original matrix is unsymmetric, the nonzero pattern of blo c ks pro duced b y BTF

PAGE 14

4 ordering tend to b e symmetric. Since circuit matrices are extremely sparse, sparse matrix algorithms suc h as Sup erLU [ 6 ] and UMFP A CK [ 7 8 ] that emplo y dense BLAS k ernels are often inappropriate. Another unique c haracteristic of circuit matrices is that emplo ying a go o d ordering strategy k eeps the L and U factors sparse. Ho w ev er as w e will see in exp erimen tal results, t ypical ordering strategies can lead to high ll-in. In circuit sim ulation problems, t ypically the circuit matrix template is generated once and the n umerical v alues of the matrix alone c hange. In other w ords, the nonzero pattern of the matrix do es not c hange. This implies that w e need to order and factor the matrix once to generate the ordering p erm utations and the nonzero patterns of L and U factors. F or all subsequen t matrices, w e can use the same information and need only to recompute the n umerical v alues of the L and U factors. This pro cess of skipping analysis and factor phases is called refactorization.Refactorization leads to a signican t reduction in run time. Because of the unique c haracteristics of circuit matrices and their amenabilit y to BTF ordering, KLU is a metho d w ell-suited to circuit sim ulation problems. KLU has b een implemen ted in the C language. It oers a set of API for the analysis phase, factor phase, solv e phase and refactor phase. It also oers the abilit y to solv e upto four righ t hand sides in a single solv e step. In addition, it oers transp ose solv e, conjugate transp ose solv e features and diagnostic to ols lik e piv ot gro wth estimator and condition n um b er estimator. It also oers a MA TLAB in terface for the API so that KLU can b e used from within the MA TLAB en vironmen t.

PAGE 15

CHAPTER 2 THEOR Y: SP ARSE LU 2.1 Dense LU Consider the problem of solving the linear system of n equations in n unkno wns: a 11 x 1 + a 12 x 2 + ::: + a 1 n x n = b 1 a 21 x 1 + a 22 x 2 + ::: + a 2 n x n = b 2 ... (2{1) a n 1 x 1 + a n 2 x 2 + ::: + a nn x n = b n or, in matrix notation, 266666664 a 11 a 12 a 1 n a 21 a 22 a 2 n ... a n 1 a n 2 a nn 377777775 266666664 x 1 x 2 ... x n 377777775 = 266666664 b 1 b 2 ... b n 377777775 (2{2) Ax = b where A = ( a ij ), x = ( x 1 ; x 2 ; :::; x n ) T and b = ( b 1 ; :::; b n ) T : A w ell-kno wn approac h to solving this equation is Gaussian elimination. Gaussian elimination consists of a series of eliminations of unkno wns x i from the original system. Let us briery review the elimination pro cess. In the rst step, the rst equation of 2{1 is m ultiplied b y a 21 a 11 ; a 31 a 11 ; ::: a n 1 a 11 and added with the second through n th equation of 2{1 resp ectiv ely This w ould eliminate x 1 from second through the n th equations. After 5

PAGE 16

6 the rst step, the 2{2 w ould b ecome 266666664 a 11 a 12 a 1 n 0 a (1)22 a (1)2 n ... 0 a (1)n 2 a (1)nn 377777775 266666664 x 1 x 2 ... x n 377777775 = 266666664 b 1 b (1)2 ... b (1)n 377777775 (2{3) where a (1)22 = a 22 a 21 a 11 a 12 a 3 2 (1) = a 32 a 31 a 11 a 12 and so on. In the second step, x 2 will b e eliminated b y a similar pro cess of computing m ultipliers and adding the m ultiplied second equation with the third through n th equations. After n-1 eliminations, the matrix A is transformed to an upp er triangular matrix U The upp er triangular system is then solv ed b y bac k-substitution. An equiv alen t in terpretation of this elimination pro cess is that w e ha v e factorized A in to a lo w er triangular matrix L and an upp er triangular matrix U where L = 2666666666664 1 0 0 0 a 21 a 11 1 0 0 a 31 a 11 a (1)32 a (1)22 1 0 ... a n 1 a 11 a (1)n 2 a (1)22 a (2)n 3 a (2)33 1 3777777777775 (2{4) and U = 266666666664 a 11 a 12 a 13 a 1 n 0 a (1)22 a 2 3 (1) a (1)2 n 0 0 a 3 3 (2) a (2)3 n ... 0 0 0 a ( n 1) nn 377777777775 (2{5) The column k of the lo w er triangular matrix L consists of the m ultipliers obtained during step k of the elimination pro cess, with their sign negated.

PAGE 17

7 Mathematically Ax = b can b e rewritten as ( LU ) x = b L ( U x ) = b (2{6) Substituting U x = y in 2{6 w e ha v e Ly = b (2{7) U x = y (2{8) By solving these t w o lo w er triangular systems, w e nd the solution to the actual system. The reason for triangularizing the system is to a v oid nding the in v erse of the original co ecien t matrix A. In v erse nding is atleast thrice as exp ensiv e as Gaussian elimination in the dense case and often leads to more inaccuracies. 2.2 Sparse LU A sparse matrix is dened as one that has few nonzeros in it. The quantication of the adjectiv e 'few' is not sp ecied. The decision as to what kind of algorithm to use (sparse or dense) dep ends on the ll-in prop erties of the matrices. Ho w ev er, sparse matrices t ypically ha v e O ( n ) nonzero en tries. Dense matrices are t ypically represen ted b y a t w o dimensional arra y .The zeros of a sparse matrix should not b e stored if w e w an t to sa v e memory This fact mak es a t w o dimensional arra y unsuitable for represen ting sparse matrices. Sparse matrices are represen ted with a dieren t kind of data structure. They can b e represen ted in t w o dieren t data structures viz.column compressed form or ro w compressed form. A column compressed form consists of three v ectors Ap, Ai and Ax. Ap consists of column p oin ters. It is of length n + 1. The start of column k of the input matrix is giv en b y Ap [k].

PAGE 18

8 Ai consists of ro w indices of the elemen ts. This is a zero based data structure with ro w indices in the in terv al [0,n). Ax consists of the actual n umerical v alues of the elemen ts. Th us the elemen ts of a column k of the matrix are held in Ax [Ap [k]...Ap [k+1]). The corresp onding ro w indices are held in Ai [Ap [k]...Ap[k+1]). Equiv alen tly a ro w compressed format stores a ro w p oin ter v ector Ap, a column indices v ector Ai and a v alue v ector Ax. F or example, the matrix 266664 5 0 0 4 2 0 3 1 8 377775 when represen ted in column compressed format will b e Ap: 0 3 5 6 Ai: 0 1 2 1 2 2 Ax: 5 4 3 2 1 8 and when represen ted in ro w compressed format will b e Ap: 0 1 3 6 Ai: 0 0 1 0 1 2 Ax: 5 4 2 3 1 8 Let nnz represen t the n um b er of nonzeros in a matrix of dimension n n Then in a dense matrix represen tation, w e will need n 2 memory to represen t the matrix. In sparse matrix represen tation, w e reduce it to O ( n + nnz ) and t ypically nnz n 2 2.3 Left Lo oking Gaussian Elimination Let us deriv e a left lo oking v ersion of Gaussian elimination. Let an input matrix A of order n n b e represen ted as a pro duct of t w o triangular matrices L and U.

PAGE 19

9 Let 266664 A 11 a 12 A 13 a 21 a 22 a 23 A 31 a 32 A 33 377775 = 266664 L 11 0 0 l 21 1 0 L 31 l 32 L 33 377775 266664 U 11 u 12 U 13 0 u 22 u 23 0 0 U 33 377775 (2{9) where A ij is a blo c k, a ij is a v ector and a ij is a scalar. The dimensions of dieren t elemen ts in the matrices are as follo ws: A 11 ; L 11 ; U 11 are k k blo c ks a 12 ; u 12 are k 1 v ectors A 13 ; U 13 are k n ( k + 1) blo c ks a 21 ; l 21 are 1 k ro w v ectors a 22 ; u 22 are scalars a 23 ; u 23 are 1 n ( k + 1) ro w v ectors A 31 ; L 31 are n ( k + 1) k blo c ks a 32 ; l 32 are n ( k + 1) 1 v ectors A 33 ; L 33 ; U 33 are n ( k + 1) n ( k + 1) blo c ks. F rom ( 2{9 ), w e can arriv e at the follo wing set of equations. L 11 U 11 = A 11 (2{10) L 11 u 12 = a 12 (2{11) L 11 U 13 = A 13 (2{12) l 21 U 11 = a 21 (2{13) l 21 u 12 + u 22 = a 22 (2{14) l 21 U 13 + u 23 = a 23 (2{15) L 31 U 11 = A 31 (2{16) L 31 u 12 + l 32 u 22 = a 32 (2{17) L 31 U 13 + l 32 u 23 + L 33 U 33 = A 33 (2{18)

PAGE 20

10 F rom ( 2{11 ), ( 2{14 ) and ( 2{17 ), w e can compute the 2 nd column of L and U, assuming w e ha v e already computed L 11 l 21 and L 31 W e rst solv e the lo w er triangular system ( 2{11 ) for u 12 Then, w e solv e for u 22 using ( 2{14 ) b y computing the sparse dot pro duct u 22 = a 22 l 21 u 12 (2{19) Finally w e solv e ( 2{17 ) for l 32 as l 32 = 1 u 22 ( a 32 L 31 u 12 ) (2{20) This step of computing the 2 nd column of L and U can b e considered equiv alen t to solving a lo w er triangular system as follo ws: 266664 L 11 0 0 l 21 1 0 L 31 0 1 377775 266664 u 12 u 22 l 32 u 22 377775 = 266664 a 12 a 22 a 32 377775 (2{21) This mec hanism of computing column k of L and U b y solving a lo w er triangular system L x = b is the k ey step in a left-lo oking factorization algorithm. As w e will see later, Gilb ert-P eierls' algorithm rev olv es around solving this lo w er triangular system. The algorithm is called a left-lo oking algorithm since column k of L and U are computed b y using the already computed columns 1...k-1 of L. In other w ords, to compute column k of L and U, one lo oks only at the already computed columns 1...k-1 in L, that are to the left of the curren tly computed column k. 2.4 Gilb ert-P eierls' Algorithm Gilb ert-P eierls' [ 9 ] prop osed an algorithm for Gaussian elimination with partial piv oting in time prop ortional to the rop coun t of the elimination to factor an arbitrary non singular sparse matrix A as P A = LU If f l ops ( LU ) is the n um b er

PAGE 21

11 of nonzero m ultiplications p erformed when m ultiplying t w o matrices L and U then Gaussian elimination uses exactly f l ops ( LU ) m ultiplications and divisions to factor a matrix A in to L and U Giv en an input matrix and assuming no partial piv oting, it is p ossible to predict the nonzero pattern of its factors. Ho w ev er with partial piv oting, it is not p ossible to predict the exact nonzero pattern of the factors b efore hand. Finding an upp er b ound is p ossible, but the b ound can b e v ery lo ose [ 10 ]. Note that computing the nonzero pattern of L and U is a necessary part of Gaussian elimination in v olving sparse matrices since w e do not use t w o dimensional arra ys for represen ting them but sparse data structures. Gilb ert-P eierls' algorithm aims at computing the nonzero pattern of the factors and the n umerical v alues in a total time prop ortional to O ( f l ops ( LU )). It consists of t w o stages for determining ev ery column of L and U The rst stage is a sym b olic analysis stage that computes the nonzero pattern of the column k of the factors. The second stage is the n umerical factorization stage that in v olv es solving the lo w er triangular system Lx = b that w e discussed in the section ab o v e. 2.4.1 Sym b olic Analysis A sparse Gaussian elimination algorithm with partial piv oting cannot kno w the exact nonzero structure of the factors ahead of all n umerical computation, simply b ecause partial piv oting at column k can in tro duce new nonzeros in columns k + 1 n Solving Lx = b m ust b e done in time prop ortional to the n um b er of rops p erformed. Consider a simple column-orien ted algorithm in MA TLAB notation for solving Lx = b as follo ws: x = b for j = 1:n if x(j) ~= 0 x(j+1:n) = x(j+1:n) L(j+1:n,j) x(j) end

PAGE 22

12 endThe ab o v e algorithm tak es time O ( n + number of f l ops per f or med ). The O ( n ) term lo oks harmless, but Lx = b is solv ed n times in the factorization of A = LU leading to an unacceptable O ( n 2 ) term in the w ork to factorize A in to L times U T o remo v e the O ( n ) term, w e m ust replace the algorithm with x = b for each j for which x(j) ~= 0 x(j+1:n) = x(j+1:n) L(j+1:n, j) x(j) endThis w ould reduce the O ( n ) term to O ( ( b )), where ( b ) is the n um b er of nonzeros in b Note that b is a column of the input matrix A Th us to solv e Lx = b w e need to kno w the nonzero pattern of x b efore w e compute x itself. Sym b olic analysis helps us determine the nonzero pattern of x Let us sa y w e are computing column k of L and U Let G = G ( L k ) b e the directed graph of L with k 1 v ertices represen ting the already computed k 1 columns. G ( L k ) has an edge j i i l ij 6 = 0. Let = f i j b i 6 = 0 g and X = f i j x i 6 = 0 g No w the elemen ts of X is giv en b y X = R each G ( L ) ( ) (2{22) The nonzero pattern of X is computed b y the determining the v ertices that are reac hable from the v ertices of the set The reac habilit y problem can b e solv ed using a classical depth rst searc h in G ( L k ) from the v ertices of the set If b j 6 = 0, then x j 6 = 0. In addition if L ij 6 = 0, then x i 6 = 0 ev en if b i = 0. This is b ecause a L ij x j con tributes to a nonzero in the equation when w e solv e for x i During the depth rst searc h, Gilb ert-P eierls' algorithm computes the top ological order of X This top ological ordering is useful for eliminating unkno wns in the Numerical factorization step.

PAGE 23

13 L x j x i l ij Figure 2{1: Nonzero pattern of x when solving Lx = b The ro w indices v ector L i of columns 1 k 1 of L represen ts the adjacency list of the graph G ( L k ). The depth rst searc h tak es time prop ortional to the n um b er of v ertices examined plus the n um b er of edges tra v ersed. 2.4.2 Numerical F actorization Numerical factorization consists of solving the system ( 2{21 ) for eac h column k of L and U Normally w e w ould solv e for the unkno wns in ( 2{21 ) in the increasing order of the ro w index. The ro w indices/nonzero pattern computed b y depth rst searc h are not necessarily in increasing order. Sorting the indices w ould increase the time complexit y ab o v e our O ( f l ops ( LU )) goal. Ho w ev er, the requiremen t of eliminating unkno wns in increasing order can b e relaxed to a top ological order of the ro w indices. An unkno wn x i can b e computed, once all the unkno wns x j of whic h it is dep enden t on are computed. This is ob vious when w e write the equations comprising a lo w er triangular solv e. Theoretically the unkno wns can b e solv ed in an y top ological order. The depth rst searc h algorithm giv es one suc h top ological order whic h is sucien t for our case. In our example, the depth rst searc h w ould ha v e nished exploring v ertex i b efore it nishes exploring v ertices j

PAGE 24

14 Hence a top ological order giv en b y depth rst searc h w ould ha v e j app earing b efore i This is exactly what w e need. Gilb ert-P eierls' algorithm starts with an iden tit y L matrix. The en tire left lo oking algorithm can b e summarized in MA TLAB notation as follo ws: L = I for k = 1:n x = L \ A(:,k) %(partial pivoting on x can be done here) U(1:k,k) = x(1:k) L(k:n,k) = x(k:n) / U(k,k) end where x = L n b denotes the solution of a sparse lo w er triangular system. In this case, b is the k th column of A The total time complexit y of Gilb ert-P eierls' algorithm is O ( ( A ) + f l ops ( LU )). ( A ) is the n um b er of nonzeros in the matrix A and f l ops ( LU ) is the rop coun t of the pro duct of the matrices L and U. T ypically f l ops ( LU ) dominates the complexit y and hence the claim of factorizing in time prop ortional to the rop coun t. 2.5 Maxim um T ransv ersal Du [ 11 12 ] prop osed an algorithm for determining the maxim um transv ersal of a directed graph.The purp ose of the algorithm is to nd a ro w p erm utation that minimizes the zeros on the diagonal of the matrix. F or non singular matrices, the algorithm ensures a zero free diagonal. KLU emplo ys Du 's [ 11 12 ] algorithm to nd an unsymmetric p erm utation of the input matrix to determine a zerofree diagonal. A matrix cannot b e p erm uted to ha v e a zero free diagonal if and only if it is structurally singular. A matrix is structurally singular if there is no p erm utation of its nonzero pattern that mak es it n umerically nonsingular.

PAGE 25

15 A transv ersal is dened as a set of nonzeros, no t w o of whic h lie in the same ro w or column, on the diagonal of the p erm uted matrix. A transv ersal of maxim um length is the maxim um transv ersal. Du 's maxim um transv ersal algorithm consists of represen ting the matrix as a graph with eac h v ertex corresp onding to a ro w in the matrix. An edge i k i k +1 exists in the graph if A ( i k ; j k +1 ) is a nonzero and A ( i k +1 ; j k +1 ) is an elemen t in the transv ersal set. A path b et w een v ertices i 0 and i k w ould consist of a sequence of nonzeros ( i 0 ; j 1 ) ; ( i 1 ; j 2 ) ; ( i k 1 ; j k ) where the curren t transv ersal w ould include ( i 1 ; j 1 ) ; ( i 2 ; j 2 ) ; ( i k ; j k ). If there is a nonzero in p osition ( i k ; j k +1 ) and no nonzero in ro w i 0 or column j k +1 is curren tly on the tra vseral, it increases the transerv al b y one b y adding the nonzeros ( i r ; j r +1 ) ; r = 0 ; 1 ; ; k to the transv ersal and remo ving the nonzeros ( i r ; j r ) ; r = 1 ; 2 ; ; k from the transv ersal. This adding and remo ving of nonzeros to and from the transv ersal is called reassignmen t c hain or augmen ting path. A v ertex or ro w is said to b e assigned if a nonzero in the ro w is c hosen for the transv ersal. The pro cess of constructing augmen ting paths is done b y doing a depth rst searc h from an unassigned ro w i 0 of the matrix and con tin ue till a v ertex i k is reac hed where the path terminates b ecause A ( i k ; j k +1 ) is a nonzero and column j K +1 is unassigned. Then the searc h bac ktrac ks to i 0 adding and remo ving transv ersal elemen ts th us constructing an augmen ting path. Du 's maxim um transv ersal transv ersal algorithm has a w orst case time complexit y of O ( n ) where is the n um b er of nonzeros in the matrix and n is the order of the matrix. Ho w ev er in practice, the time complexit y is close to O ( n + ). The maxim um transv ersal problem can b e cast as a maximal matc hing problem on bipartite graphs. This is only to mak e a comparison. The maximal matc hing problem is stated as follo ws.

PAGE 26

16 0 = A 11 A 13 A 14 A 34 A 22 A 23 A 24 A 33 A 44 x 1 x 2 x 3 x 4 b 1 b 2 b 3 b 4 A 12 Figure 2{2: A matrix p erm uted to BTF form Giv en an undirected graph G = ( V ; E ), a matc hing is a subset of the edges M E suc h that for all v ertices v 2 V at most one edge of M is inciden t on v A v ertex v 2 V is matc hed if some edge in M is inciden t on v otherwise, v is unmatc hed. A maximal matc hing is a matc hing of maxim um cardinalit y that is a matc hing M suc h that for an y matc hing M 0 w e ha v e j M j j M 0 j A maximal matc hing can b e built incremen tally b y pic king an arbitrat y edge e in the graph, deleting an y edge that is sharing a v ertex with e and rep eating un til the graph is out of edges. 2.6 Blo c k T riangular F orm A blo c k (upp er) triangular matrix is similar to a upp er triangular matrix except that the diagonals in the former are square blo c ks instead of scalars. Figure 2{2 sho ws a matrix p erm uted to the BTF form. Con v erting the input matrix to blo c k triangular form is imp ortan t in that, 1. The part of the matrix b elo w the blo c k diagonal requires no factorization eort.

PAGE 27

17 2. The diagonal blo c ks are indep enden t of eac h other.Only the blo c ks need to b e factorized. F or example, in gure 2{2 the subsystem A 44 x 4 = b 4 can b e solv ed indep enden tly for x 4 and x 4 can b e eliminated from the o v erall system. The system A 33 x 3 = b 3 A 34 x 4 is then solv ed for x 3 and so on. 3. The o diagonal nonzeros do not con tribute to an y ll-in. Finding a symmetric p erm utation of a matrix to its BTF form is equiv alen t to nding the strongly connected comp onen ts of a graph. A strongly connected comp onen t of a directed graph G = ( V ; E ) is a maximal set of v ertices C V suc h that for ev ery pair of v ertices u and v in C w e ha v e b oth u v and v u The v ertices u and v are reac hable from eac h other. The Algorithm emplo y ed in KLU for symmetric p erm utation of a matrix to a BTF form, is based on Du and Reid's [ 13 14 ] algorithm. Du and Reid pro vide an implemen tation for T arjan's [ 15 ] algorithm to determine the strongly connected comp onen ts of a directed graph. The algorithm has a time complexit y of O ( n + ) where n is the order of the matrix and is the n um b er of o diagonal nonzeros in the matrix. The algorithm essen tially consists of doing a depth rst searc h from un visited no des in the graph. It uses a stac k to k eep trac k of no des b eing visited and uses a path of the no des. When all edges in the path are explored, it generates Strongly connected comp onen ts from the top of stac k. Du 's algorithm is dieren t from the metho d prop osed b y Cormen, Leiserson, Riv est and Stein [ 16 ]. They suggest doing a depth rst searc h on G computing G T and then running a depth rst searc h on G T on v ertices in the decreasing order of their nish times from the rst depth rst searc h (the top ological order from the rst depth rst searc h).

PAGE 28

18 j i k jis = Pruned, = Fill in = Nonzero, Figure 2{3: A symmetric pruning scenario 2.7 Symmetric Pruning Eisenstat and Liu [ 17 ] prop osed a metho d called Symmetric Pruning to exploit structural symmetry for cutting do wn the sym b olic analysis time. The cost of depth rst searc h can b e cut do wn b y pruning unnecessary edges in the graph of L; G ( L ). The idea is to replace G ( L ) b y a reduced graph of L An y graph H can b e used in place of G ( L ), pro vided that i j exists in H i i j exists in G ( L ). If A is symmetric, then the symmetric reduction is just the elimination tree. The symmetric reduction is a subgraph of G ( L ). It has few er edges than G ( L ) and is easier to compute b y taking adv an tage of symmetry in the structure of the factors L and U. Ev en though the symmetric reduction remo v es edges, it still preserv es the paths b et w een v ertices of the original graph. Figure 2{3 sho ws a symmetric pruning example. If l ij 6 = 0 ; u j i 6 = 0, then w e can prune edges j s where s > i The reason b ehind this is that for an y a j k 6 = 0, a sk will ll in from column j of L for s > k

PAGE 29

19 The just computed column i of L is used to prune earlier columns. An y future depth rst searc h from v ertex j will not visit v ertex s since s w ould ha v e b een visited via i already Note that ev ery column is pruned only once. KLU emplo ys symmetric pruning to sp eed up the depth rst searc h in the sym b olic analysis stage. 2.8 Ordering It is a widely used practice to precede the factorization step of a sparse linear system b y an ordering phase. The purp ose of the ordering is to generate a p erm utation P that reduces the ll-in in the factorization phase of P AP T A ll-in is dened as a nonzero in a p osition ( i; j ) of the factor that w as zero in the original matrix. In other w ords, w e ha v e a ll-in if L ij 6 = 0, where A ij = 0. The p erm uted matrix created b y the ordering P AP T creates m uc h less ll-in in factorization phase than the unp erm uted matrix A The ordering mec hanism t ypically tak es in to accoun t only the structure of the input matrix, without considering the n umerical v alues stored in the matrix. P artial piv oting during factorization c hanges the ro w p erm utation P and hence could p oten tially increase ll-in as opp osed to what w as estimated b y the ordering sc heme. W e shall see more ab out piv oting in the follo wing sections. If the input matrix A is unsymmetric, then the p erm utation of the matrix A + A T can b e used. V arious minim um degree algorithms can b e used for ordering. Some of the p opular ordering sc hemes include appro ximate minim um degree(AMD) [ 18 19 ], column appro ximate minim um degree(COLAMD) [ 20 21 ] among others. COLAMD orders the matrix AA T without forming it explicitly After p erm uting an input matrix A in to BTF form using the maxim um transv ersal and BTF orderings, KLU attempts to factorize eac h of the diagonal blo c ks. It applies the ll reducing ordering algorithm on the blo c k b efore factorizing it. KLU supp orts b oth appro ximate minim um degree and column appro ximate

PAGE 30

20 minim um degree. Besides, an y giv en ordering algorithm can b e plugged in to KLU without m uc h eort. W ork is b eing done on in tegrating a Nested Dissection ordering strategy in to KLU as w ell. Of the v arious ordering sc hemes, AMD giv es b est results on circuit matrices. AMD nds a p erm utation P to reduce ll-in for the Cholesky factorization of P AP T (of P ( A + A T ) P T if A is unsymmetric). AMD assumes no n umerical piv oting. AMD attempts to reduce an optimistic estimate of ll-in. COLAMD is an unsymmetric ordering sc heme, that computes a column p erm utation Q to reduce ll-in for Cholesky factorization of ( AQ ) T AQ COLAMD attempts to reduce a "p essimitic" estimate (upp er b ound) of ll-in. Nested Dissection is another ordering sc heme that creates p erm utation suc h that the input matrix is transformed in to blo c k diagonal form with v ertex separators. This is a p opular ordering sc heme. Ho w ev er, it is unsuitable for circuit matrices when applied to the matrix as suc h. It can b e used on the blo c ks generated b y BTF pre-ordering. The idea b ehind a minim um degree algorithm is as follo ws: A structurally symmetric matrix A can b e represen ted b y an equiv alen t undirected graph G ( V ; E ) with v ertices corresp onding to ro w/column indices. An edge i j exists in G if A ij 6 = 0. Consider the gure 2{4 If the matrix is factorized with v ertex 1 as the piv ot, then after the rst Gaussian elimination step, the matrix w ould b e transformed as in gure 2{5 This rst step of elimination can b e considered equiv alen t to remo ving no de 1 and all its edges from the graph and adding edges to connect all no des adjacen t to 1. In other w ords, the elimination has created a clique of the no des adjacen t to the eliminated no de. Note that there are as man y ll-ins in the reduced matrix as there are edges added in the clique formation. In the ab o v e example, w e ha v e

PAGE 31

21 ***** * * ** * 1 2 3 5 4 Figure 2{4: A symmetric matrix and its graph represen tation * * * ** * 2 3 5 4 * * * *** Figure 2{5: The matrix and its graph represen tation after one step of Gaussian elimination

PAGE 32

22 c hosen the wrong no de as piv ot, since no de 1 has the maxim um degree. Instead if w e had c hosen a no de with minim um degree sa y 3 or 5 as piv ot, then there w ould ha v e b een zero ll-in after the elimination since b oth 3 and 5 ha v e degree 1. This is the k ey idea in a minim um degree algorithm. It generates a p erm utation suc h that a no de with minim um degree is eliminated in eac h step of Gaussian elimination, th us ensuring a minimal ll-in. The algorithm do es not examine the n umerical v alues in the no de selection pro cess. It could happ en that during partial piv oting, a no de other than the one suggested b y the minim um degree algorithm m ust b e c hosen as piv ot b ecause of its n umerical magnitude. That's exactly the reason wh y the ll-in estimate pro duced b y the ordering algorithm could b e less than that exp erienced in the factorization phase. 2.9 Piv oting Gaussian elimination fails when the diagonal elemen t in the input matrix happ ens to b e zero. Consider a simple 2 2 system, A = 264 0 a 12 a 21 a 22 375 264 x 1 x 2 375 = 264 b 1 b 2 375 (2{23) When solving the ab o v e system, Gaussian elimination computes the m ultiplier a 21 =a 11 [and m ultiplies ro w 1 with this m ultiplier and adds it to ro w 2] th us eliminating the co ecien t elemen t a 21 from the matrix. This step ob viously w ould fail, since a 11 is zero. No w let's see a classical case when the diagonal elemen t is nonzero but close to zero. A = 264 0 : 0001 1 1 1 375 (2{24) The m ultiplier is 1 = 0 : 0001 = 10 4 The factors L and U are

PAGE 33

23 L = 264 1 0 10 4 1 375 U = 264 0 : 0001 1 0 10 4 375 (2{25) The elemen t u 22 has the actual v alue 1 10 4 Ho w ev er assuming a four digit arithmetic, it w ould b e rounded o to 10 4 Note that the pro duct of L and U is L U = 264 0 : 0001 1 1 0 375 (2{26) whic h is dieren t from the original matrix. The reason for this problem is that the m ultiplier computed is so large that when added with the small elemen t a 22 with v alue 1, it obscured the tin y v alue presen t in a 22 W e can solv e these problems with piv oting. In the ab o v e t w o examples, w e could in terc hange ro ws 1 and 2, to solv e the problem. This mec hanism of in terc hanging ro ws(and columns) and pic king a large elemen t as the diagonal, to a v oid n umerical failures or inaccuracies is called piv oting. T o pic k a n umerically large elemen t as piv ot, w e could lo ok at the elemen ts in the curren t column or w e could lo ok at the en tire submatrix (across b oth ro ws and columns). The former is called partial piv oting and the latter is called complete piv oting. F or dense matrices, partial piv oting adds a time complexit y of O ( n 2 ) comparisons to Gaussian elimination and complete piv oting adds O ( n 3 ) comparisons. Complete piv oting is exp ensiv e and hence is generally a v oided, except for sp ecial cases. KLU emplo ys partial piv oting with diagonal preference. As long as the diagonal elemen t is atleast a constan t threshold times the largest elemen t in the column, w e c ho ose the diagonal as the piv ot. This constan t threshold is called piv ot tolerance. 2.10 Scaling The case where small elemen ts in the matrix get obscured during the elimination pro cess and accuracy of the results gets sk ew ed b ecause of n umerical addition

PAGE 34

24 is not completely o v ercome b y the piv oting pro cess. Let us see an example of this case. Consider the 2 2 system A = 264 10 10 5 1 1 375 264 x 1 x 2 375 = 264 10 5 2 375 (2{27) When w e apply Gaussian elimination with partial piv oting to the ab o v e system, the en try a 11 is largest in the rst column and hence w ould con tin ue to b e the piv ot.After the rst step of elimination assuming a four digit arithmetic, w e w ould ha v e A = 264 10 10 5 0 10 4 375 264 x 1 x 2 375 = 264 10 5 10 4 375 (2{28) The solution from the ab o v e elimination is x 1 = 1 ; x 2 = 0. Ho w ev er the correct solution is close to x 1 = 1 ; x 2 = 1. If w e divide eac h ro w of the matrix b y the largest elemen t in that ro w(and the corresp onding elemen t in the righ t hand side as w ell), prior to Gaussian elimination w e w ould ha v e A = 264 10 4 1 1 1 375 264 x 1 x 2 375 = 264 12 375 (2{29) No w if w e apply partial piv oting w e w ould ha v e, A = 264 1 1 10 4 1 375 264 x 1 x 2 375 = 264 21 375 (2{30) And after an elimination step, the result w ould b e

PAGE 35

25 A = 264 1 1 0 1 10 4 375 264 x 1 x 2 375 = 264 2 1 10 4 375 (2{31) whic h yields the correct solution x 1 = 1 ; x 2 = 1. The pro cess of balancing out the n umerical enormit y or obscurit y on eac h ro w or column is termed as scaling. In the ab o v e example, w e ha v e scaled with resp ect to the maxim um v alue in a ro w whic h is ro w scaling. Another v arian t w ould b e to scale with resp ect to the sum of absolute v alues of all elemen ts across a ro w. In column scaling, w e w ould scale with resp ect to the maxim um v alue in a column or the sum of absolute v alues of all elemen ts in a cloumn. Ro w scaling can b e considered equiv alen t to nding an in v ertible diagonal matrix D 1 suc h that all the ro ws in the matrix D 1 A ha v e equally large n umerical v alues. Once w e ha v e suc h a D 1 the solution of the original system Ax = b is equiv alen t to solving the system ~ A x = ~ b where ~ A = D 1 A and ~ b = D 1 b Equilibration is another p opular term used for scaling. In KLU, the diagonal elemen ts of the diagonal matrix D 1 are either the largest elemen ts in the ro ws of the original matrix or the sum of the absolute v alues of the elemen ts in the ro ws. Besides scaling can b e turned o as w ell, if the sim ulation en vironmen t do es not need scaling. Scaling though it oers b etter n umerical results when solving systems, is not mandatory Its usage dep ends on the data v alues that constitute the system and if the v alues are already balanced, scaling migh t not b e necessary 2.11 Gro wth F actor Piv ot gro wth factor is a k ey diagnostic estimate in determining the stabilit y of Gaussian elimination. Stabilit y of Numerical Algorithms is an imp ortan t factor in determining the accuracy of the solution. Study of stabilit y is done b y a pro cess

PAGE 36

26 called Roundo Error analysis. Roundo error analysis comprises t w o sub t yp es called F orw ard error analysis and Bac kw ard error analysis. If the computed solution ~ x is close to the exact solution x, then the algorithm is said to b e F orw ard stable. If the algorithm computes an exact solution to a nearb y problem, then the algorithm is said to b e Bac kw ard stable.Bac kw ard stabilit y is the most widely used tec hnique in studying stabilit y of systems. Often the data generated for solving systems ha v e impurit y in them or they are distorted b y a small amoun t. Under suc h circumstances w e are in terested that the algorithm pro duce an exact solution to this nearb y problem and hence the relev ance of bac kw ard stabilit y assumes signicance. Piv ot gro wth factor is formally dened as = max k max ij j a ( k ) ij j max ij j a ij j (2{32) where a ( k ) ij is an en try in the reduced matrix A ( k ) after the k th elimination step. F rom ( 2{32 ), w e nd that if the en tries of the reduced matrix gro w arbitrarily w e w ould ha v e a high gro wth factor. This arbitrary gro wth w ould again lead to inaccuracies in the results. Consider the follo wing 2 2 system. A = 264 10 4 1 1 1 375 264 x 1 x 2 375 = 264 12 375 (2{33) After one step of Gaussian elimination assuming four digit arithmetic, w e w ould ha v e the reduced system A = 264 10 4 1 0 1 10 4 375 264 x 1 x 2 375 = 264 1 2 10 4 375 (2{34) Solving the system yields x 1 = 0 ; x 2 = 1 whic h is dieren t from the actual solution x 1 = 1 ; x 2 = 1.The piv ot gro wth factor of the ab o v e system is

PAGE 37

27 = max (1 ; 10 4 ) 1 = 10 4 Th us a large piv ot gro wth clearly indicates the inaccuracy in the result. P artial piv oting generally a v oids large gro wth factors. In the ab o v e example, if w e had applying partial piv oting, w e w ould ha v e got the correct results. But this is not assured and there are cases where partial piv oting migh t not result in an acceptable gro wth factor. This necessitates the estimation of the gro wth factor as a diagnostic to ol to detect cases where Gaussian elimination could b e unstable. Piv ot gro wth factor is calculated usually in terms of its recipro cal, to a v oid n umerical o v erro w problems when the v alue is v ery large. ( 2{32 ) is a harder to compute equation since it in v olv es calculating the maxim um of reduced matrix after ev ery step of elimination. The other denitions of recipro cal gro wth factor that are easy to compute are as follo ws: 1 = min j ( max i j a ij j ) ( max i j u ij j ) (2{35) 1 = ( max ij j a ij j ) ( max ij j u ij j ) (2{36) Equation ( 2{35 ) is the denition implemen ted in KLU and it is a column scaling in v arian t. It helps unmask a large piv ot gro wth that could b e totally mask ed b ecause of column scaling. 2.12 Condition Num b er Gro wth factor is a k ey estimate in determining the stabilit y of the algorithm. Condition n um b er is a k ey estimate in determining the amenabilit y or conditioning of a giv en problem. It is not guaran teed that a highly stable algorithm can yield accurate results for all problems it can solv e. The conditioning of the problem has a dominan t eect on the accuracy of the solution. Note that while stabilit y deals with the algorithm, conditioning deals with the problem itself. In practical applications lik e circuit sim ulation, the data of

PAGE 38

28 a problem come from exp erimen tal observ ations. T ypically suc h data ha v e a factor of error or impurities or noise asso ciated with them. Roundo errors and discretization errors also con tribute to impurities in the data. Conditioning of a problem deals with determining ho w the solution of the problem c hanges in the presence of impurities. The preceding discussion sho ws that one often deals with solving problems not with the original data but that with p erturb ed data. The analysis of eect of p erturbation of the problem on the solution is called P erturbation analysis. It helps in determining whether a giv en problem pro duces a little or h uge v ariation in solution when p erturb ed. Let us see what w e mean b y w ell or ill conditioned problems. A problem is said to b e ill conditioned if a small relativ e error in data leads to a large relativ e error in solution irresp ectiv e of the algorithm emplo y ed. A problem is said to b e w ell conditioned if a small relativ e error in data do es not lead to a large relativ e error in solution. Accuracy of the computed solution is of primary imp ortance in n umerical analysis. Stabilit y of the algorithm and the Conditioning of the giv en problem are the t w o factors that directly determine accuracy .A highly stable algorithm w ell armored with scaling, partial piv oting and other concepts cannot b e guaran teed to yield an accurate solution to an ill-conditioned problem. A bac kw ard stable algorithm applied to a w ell-conditioned problem should yield a solution close to the exact solution. This follo ws from the denitions of bac kw ard stabilit y and w ell-conditioning, where bac kw ard stabilit y assures exact solution to a nearb y problem and w ell-conditioned problem assures that the computed solution to p erturb ed data is relativ ely close to the exact solution of the actual problem.

PAGE 39

29 Mathematically let X b e some problem. Let X(d) b e the solution to the problem for some input d. Let d denote a small p erturbation in the input d. No w if the relativ e error in the solution j X ( d + d ) X ( d ) j j X ( d ) j exceeds the relativ e error in the input j d j j d j then the problem is ill conditioned and w ell conditioned otherwise. Condition n um b er is a measure of the conditioning of the problem. It sho ws whether a problem is w ell or ill conditioned.F or the linear system problems of the form Ax = b the condition n um b er is dened as C ond ( A ) = k A kk A 1 k (2{37) Equation ( 2{37 ) is arriv ed at b y theory that deals with p erturbations either in the input matrix A or the righ t hand side b or b oth the matrix and righ t hand side. Equation ( 2{37 ) can b e dened with resp ect to an y norm viz. 1, 2 or 1 The system Ax = b is said to b e ill-conditioned if the condition n um b er from ( 2{37 ) is quite large. Otherwise it is said to b e w ell-conditioned. A naiv e w a y to compute the condition n um b er w ould b e to compute the in v erse of the matrix, compute the norm of the matrix and its in v erse and compute the pro duct. Ho w ev er, computing the in v erse is atleast thrice as exp ensiv e as solving the linear system Ax = b and hence should b e a v oided. Hager [ 22 ] dev elop ed a metho d for estimating the 1-norm of k A 1 k and the corresp onding 1-norm condition n um b er. Hager prop osed an optimization approac h for estimating k A 1 k 1 The 1-norm of a matrix is formally dened as

PAGE 40

30 k A k 1 = max k Ax k 1 k x k 1 (2{38) Hager's algorithm can b e briery describ ed as follo ws: F or A 2 R n n a con v ex function is dened as F ( x ) = k Ax k 1 (2{39) o v er the con v ex set S = f x 2 R n : k x k 1 1 g Then k A k 1 is the global maxim um of ( 2{39 ). The algorithm in v olv es computing Ax and A T x or computing matrix-v ector pro ducts. When w e w an t to compute k A 1 k 1 it in v olv es computing A 1 x and ( A 1 ) T x whic h is equiv alen t to solving Ax = b and A T x = b W e can use KLU to ecien tly solv e these systems. Higham [ 23 ] presen ts renemen ts to Hager's algorithm and restricts the n um b er of iterations to v e. Higham further presen ts a simple device and using the higher of the estimates from this device and Hager's algorithm to ensure the esimate is large enough. This device in v olv es solving the linear system Ax = b where b i = ( 1) i +1 (1 + i 1 n 1 ) ; i = 1 ; 2 ; :::n The nal estimate is c hosen as the maxim um from Hager's algorithm and 2 k x k 1 3 n KLU's condition n um b er estimator is based on Higham's renemen t of Hager's algorithm. 2.13 Depth First Searc h As w e discussed earlier, the nonzero pattern of the k th column of L is determined b y the Reac habilit y of the ro w-indices of k th column of A in the graph of L.

PAGE 41

31 The reac habilit y is determined b y a depth-rst searc h tra v ersal of the graph of L. The top ological order for elimination of v ariables when solving the lo w er triangular system Lx = b is also determined b y the depth-rst searc h tra v ersal. A classical depth rst searc h algorithm is a recursiv e one. One of the ma jor problems in a recursiv e implemen tation of depth-rst searc h is Stac k o v erro w. Eac h pro cess is allo cated a stac k space up on execution. When there is a high n um b er of recursiv e calls, the stac k space is exhausted and the pro cess terminates abruptly This is a denite p ossibilit y in the con text of our depth-rst searc h algorithm when w e ha v e a dense column of a matrix of a v ery high dimension. The solution to stac k o v erro w caused b y recursion is to replace recursion b y iteration. With an iterativ e or non-recursiv e function, the en tire depth rst searc h happ ens in a single function stac k. The iterativ e solution uses an arra y of ro w indices called pstac k. When descending to an adjacen t no de during the searc h, the ro w index of the next adjacen t no de is stored in the pstac k at the p osition(ro w/column index) corresp onding to the curren t no de. When the searc h returns to the curren t no de, w e kno w that w e next need to descend in to the no de stored in the pstac k at the p osition corresp onding to the curren t no de. Using this extra O(n) memory the iterativ e v ersion completes the depth rst searc h in a single function stac k. This is an imp ortan t impro v emen t from the recursiv e v ersion since it a v oids the stac k o v erro w problem that w ould ha v e b een a b ottlenec k when solving high dimension systems. 2.14 Memory F ragmen tation The data structures for L and U are the ones used to represen t sparse matrices. These comprise 3 v ectors. 1. V ector of column p oin ters 2. V ector of ro w indices

PAGE 42

32 3. V ector of n umerical v alues There are o v erall, six v ectors needed for the t w o matrices L and U. Of these, the t w o v ectors of column p oin ters are of pre-kno wn size namely the size of a blo c k. The remaining four v ectors of ro w indices and n umerical v alues dep end on the llin estimated b y AMD. Ho w ev er AMD giv es an optimistic estimate of ll-in.Hence w e need to dynamically gro w memory for these v ectors during the factorization phase if w e determine that the ll-in is higher than estimated. The partial piv oting strategy can alter the ro w ordering determined b y AMD and hence is another source of higher ll-in than the estimate from AMD. Dynamically gro wing these four v ectors suers from the problem of external memory fragmen tation. In external fragmen tation, free memory is scattered in the memory space. A call for more memory fails b ecause of non-a v ailabilit y of con tiguous free memory space. If the scattered free memory areas w ere con tiguous, the memory request w ould ha v e succeeded. In the con text of our problem, the memory request to gro w the four v ectors could either fail if w e run in to external fragmen tation or succeed when there is enough free space a v ailable. When w e reallo cate or gro w memory there are t w o t yp es of success cases. In the rst case called c heap reallo cation, there is enough free memory space abutting the four v ectors. Here the memory o ccupied b y a v ector is just extended or its end b oundary is increased. The start b oundary remains the same. In the second case called costly reallo cation, there is not enough free memory space abutting a v ector. Hence a fresh memory is allo cated in another region for the new size of v ector and the con ten ts are copied from old lo cation. Finally the old lo cation is freed. With four v ectors to gro w, there is a failure case b ecause of external fragmentation and a costly success case b ecause of costly reallo cation. T o reduce the failure case and a v oid the costly success case, w e ha v e coalesced the four v ectors in to a single v ector. This new data structure is b yte aligned on double b oundary F or

PAGE 43

33 ev ery column of L and U, the v ector con tains the ro w indices and n umerical v alues of L follo w ed b y the ro w indices and n umerical v alues of U. Multiple in teger ro w indices are stored in a single double lo cation The actual n um b er of in tegers that can b e stored in a double lo cation v aries with platform and is determined dynamically The common tec hnique of using in teger p oin ter to p oin t to lo cation aligned on double b oundary is emplo y ed to retriev e or sa v e the ro w indices. In addition to this coalesced data structure con taining the ro w indices and n umerical v alues, t w o more length v ectors of size n are needed to con tain the length of eac h column of L and U These length v ectors are preallo cated once and need not b e gro wn dynamically Some Memory managemen t sc hemes nev er do c heap reallo cation. In suc h sc hemes, the new data structure serv es to reduce external fragmen tation only 2.15 Complex Num b er Supp ort KLU supp orts complex matrices and complex righ t hand sides. KLU also supp orts solving the transp ose system A T x = b for real matrices and solving the conjugate transp ose system A H x = b for complex matrices. Initially it relied on the C99 language supp ort for complex n um b ers. Ho w ev er the C99 sp ecication is not supp orted across op erating systems. F or example, earlier v ersions of Sun Solaris do not supp ort C99. T o a v oid these compatibilit y issues, KLU no longer relies on C99 and has its o wn complex arithmetic implemen tation. 2.16 P arallelism in KLU When solving a system Ax = b using KLU, w e use BTF pre-ordering to con v ert A in to a blo c k upp er triangular form. W e apply AMD on eac h blo c k and factorize eac h blo c k one after the other serially Alternativ ely nested dissection can b e applied to eac h blo c k. Nested dissection ordering con v erts a blo c k to a doubly b ordered blo c k diagonal form. A doubly b ordered blo c k diagonal form is similar to a blo c k upp er triangular form but has nonzeros on the sub diagonal region. These

PAGE 44

34 1 2 4 5 3 A Doubly Bordered Block diagonal Matrix Separator Tree 2 1 4 5 3 Figure 2{6: A doubly b ordered blo c k diagonal matrix and its corresp onding v ertex separator tree nonzeros form a horizon tal strip resem bling a b order. Similarly the nonzeros in the region ab o v e the diagonal form a corresp onding v ertical strip. The doubly b ordered blo c k diagonal form can b e though t of as a separator tree. F actorization of the blo c k then in v olv es a p ost-order tra v ersal of the separator tree. The no des in the separator tree can b e factorized in parallel. The factorization of a no de w ould additionally in v olv e computing the sc h ur complemen t of its paren t and of its ancestors in the tree. Once all the c hildren of a no de ha v e up dated its sc h ur complemen t, the no de is ready to b e factorized and it in turn computes the sc h ur complemen t of its paren t and its ancestors. The factorization and computation of sc h ur complemen t is done in a p ost-order tra v ersal fashion and the pro cess stops at the ro ot. P arallelism can help in reducing the factorization time. It gains imp ortance in the con text of m ulti pro cessor systems. W ork is b eing done to enable parallelism in KLU.

PAGE 45

CHAPTER 3 CIR CUIT SIMULA TION: APPLICA TION OF KLU The KLU algorithm comprises the follo wing steps: 1. Unsymmetric P erm utation to blo c k upp er triangular form. This consists of t w o steps. (a) unsymmetric p erm utation to ensure a zero free diagonal using maxim um transv ersal. (b) symmetric p erm utation to blo c k upp er triangular form b y nding the strongly connected comp onen ts of the graph. 2. Symmetric p erm utation of eac h blo c k(sa y A) using AMD on A + A T or an unsymmetric p erm utation of eac h blo c k using COLAMD on AA T These p erm utations are ll-in reducing orderings on eac h blo c k. 3. F actorization of eac h scaled blo c k using Gilb ert-P eierls' left lo oking algorithm with partial piv oting. 4. Solv e the system using blo c k-bac k substitution and accoun t for the odiagonal en tries. The solution is re-p erm uted to bring it bac k to original order. Let us rst deriv e the nal system that w e need to solv e taking in to accoun t, the dieren t p erm utations, scaling and piv oting. The original system to solv e is Ax = b (3{1) Let R b e the diagonal matrix with the scale factors for eac h ro w. Applying scaling, w e ha v e R Ax = R b (3{2) 35

PAGE 46

36 Let P' and Q' b e the ro w and column p erm utation matrices that com bine the p erm utations for maxim um transv ersal and the blo c k upp er triangular form together. Applying these p erm utations together, w e ha v e P 0 R AQ 0 Q 0 T x = P 0 R b: [ Q 0 Q 0 T = I ; the identity matr ix ] (3{3) Let P and Q b e ro w and column p erm utation matrices that club the P' and Q' men tioned ab o v e with the symmetric p erm utation pro duced b y AMD and the partial piv oting ro w p erm utation pro duced b y factorization. No w, P R AQQ T x = P R b or ( P R AQ ) Q T x = P R b (3{4) The matrix (PRA Q) consists of t w o parts viz. the diagonal blo c ks that are factorized and the o-diagonal elemen ts that are not factorized. ( P R AQ ) = LU + F where LU represen ts the factors of all the blo c ks collectiv ely and F represen ts the en tire o diagonal region. Equation ( 3{4 ) no w b ecomes ( LU + F ) Q T x = P R b (3{5) x = Q ( LU + F ) 1 ( P R b ) (3{6) Equation ( 3{6 ) consists of t w o steps. A blo c k bac k-substitution i.e. computing ( LU + F ) 1 ( P R b ) follo w ed b y applying the column p erm utation Q The blo c k-bac k substitution in ( LU + F ) 1 ( P R b ) lo oks cryptic and can b e b etter explained as follo ws: Consider a simple 3 3 blo c k system

PAGE 47

37 266664 L 11 U 11 F 12 F 13 0 L 22 U 22 F 23 0 0 L 33 U 33 377775 266664 X 1 X 2 X 3 377775 = 266664 B 1 B 2 B 3 377775 (3{7) The equations corresp onding to the ab o v e system are: L 11 U 11 X 1 + F 12 X 2 + F 13 X 3 = B 1 (3{8) L 22 U 22 X 2 + F 23 X 3 = B 2 (3{9) L 33 U 33 X 3 = B 3 (3{10) In blo c k bac k substitution, w e rst solv e ( 3{10 ) for X 3 and then eliminate X 3 from ( 3{9 ) and ( 3{8 ) using the o-diagonal en tries. Next, w e solv e ( 3{9 ) for X 2 and eliminate X 2 from ( 3{8 ). Finally w e solv e ( 3{8 ) for X 1 3.1 Characteristics of Circuit Matrices Circuit matrices exhibit certain unique c haracteristics that mak es KLU more relev an t to them. They are v ery sparse. Because of their high sparsit y BLAS k ernels are not applicable. Circuit matrices often ha v e a few dense ro ws/columns that originate from v oltage or curren t sources. These dense ro ws/columns are eectiv ely remo v ed b y BTF p erm utation. Circuit matrices are asymmetric, but the nonzero pattern is roughly symmetric. They are easily p erm utable to blo c k upp er triangular form. Besides, they ha v e zero-free or nearly zero-free diagonal. Another p eculiar feature of circuit matrices is that the nonzero pattern of eac h blo c k after p erm utation to blo c k upp er triangular form, is more symmetric than the original matrix. T ypical ordering strategies applied to the original matrix cause high ll-in whereas when applied to the blo c ks, leads to less ll-in.

PAGE 48

38 The eciency of the p erm utation to blo c k upp er triangular form sho ws in the fact the en tire sub-diagonal region in the matrix has zero w ork and the o-diagonal elemen ts do not cause an y ll-in since they are not factorized. 3.2 Linear Systems in Circuit Sim ulation The linear systems in circuit sim ulation pro cess originate from solving large systems of non linear equations using Newton's metho d and in tegrating large sti systems of ordinary dieren tial equations. These linear systems consist of the co ecien ts matrix A the unkno wns v ector x and the righ t hand side b During the course of sim ulation, the matrix A retains the same nonzero pattern(structurally unc hanged) and only undergo es c hanges in n umerical v alues. Th us the initial analysis phase (p erm utation to ensure zero-free diagonal, blo c k triangular form and minim um degree ordering on blo c ks) and factorization phase(that in v olv es sym b olic analysis, partial piv oting and symmetric pruning) can b e restricted to the initial system alone. Subsequen t systems A 0 x = b where only the n umerical v alues of A 0 are dieren t from A can b e solv ed using a mec hanism called refactorization. Refactorization simply means to use the same ro w and column p erm utations (comprising en tire analysis phase and partial piv oting) computed for the initial system, for solving the subsequen t systems that ha v e c hanges only in n umerical v alues. Refactorization substan tially reduces run time since the analysis time and factorization time sp en t on sym b olic analysis, partial piv oting, pruning are a v oided. The nonzero pattern of the factors L and U are the same as for the initial system. Only Numerical factorization using the pre-computed nonzero pattern and partial piv oting order, is required. The solv e step follo ws the factorization/refactorization step. KLU accomo dates solving m ultiple righ t hand sides in a single solv e step. Upto four righ t hand sides can b e solv ed in a single step.

PAGE 49

39 3.3 P erformance Benc hmarks During m y in ternship at a circuit sim ulation compan y I did p erformance b enc hmarking of KLU vs Sup erLU in the sim ulation en vironmen t. The p erformance b enc hmarks w ere run on a represen tativ e set of examples. The results of these b enc hmarks are tabulated as follo ws. (the size of the matrix created in sim ulation is sho wn in paren thesis). T able 3{1: Comparison b et w een KLU and Sup erLU on o v erall time and ll-in Ov erall time Nonzeros(L+U) Netlist KLU Sup erLU Sp eedup KLU Sup erLU Problem1 (301) 1.67 1.24 0.74 1808 1968 Problem2 (1879) 734.96 688.52 0.94 13594 13770 Problem3 (2076) 56.46 53.38 0.95 16403 16551 Problem4 (7598) 89.63 81.85 0.91 52056 54997 Problem5 (745) 18.13 16.84 0.93 4156 4231 Problem6 (1041) 1336.50 1317.30 0.99 13198 13995 Problem7 (33) 0.40 0.32 0.80 157 176 Problem8 (10) 4.46 1.570 0.35 40 41 Problem9 (180) 222.26 202.29 0.91 1845 1922 Problem10(6833) 6222.20 6410.40 1.03 56322 58651 Problem11 (1960) 181.78 179.50 0.99 13527 13963 Problem12 (200004) 6.25 8.47 1.35 500011 600011 Problem13 (20004) 0.47 0.57 1.22 50011 60011 Problem14 (40004) 0.97 1.31 1.35 100011 120011 Problem15 (100000) 1.76 2.08 1.18 299998 499932 Problem16 (7602) 217.80 255.88 1.17 156311 184362 Problem17(10922) 671.10 770.58 1.15 267237 299937 Problem18 (14842) 1017.00 1238.10 1.22 326811 425661 Problem19 (19362) 1099.00 1284.40 1.17 550409 581277 Problem20 (24482) 3029.00 3116.90 1.03 684139 788047 Problem21 (30202) 2904.00 3507.40 1.21 933131 1049463 The circuits Problem16{Problem21 are TFT LCD arra ys similar to memory circuits. These examples w ere run atleast t wice with eac h algorithm emplo y ed viz. KLU or Sup erLU to get consisten t results. The results are tabulated in tables 3{1 3{2 and 3{3 The "o v erall time" in table 3{1 comprises of analysis, factorization and solv e time.

PAGE 50

40 T able 3{2: Comparison b et w een KLU and Sup erLU on factor time and solv e time F actor time F actor Solv e time Solv e p er iteration sp eedup p er iteration sp eedup Netlist KLU Sup erLU KLU Sup erLU Problem1 (301) 0.000067 0.000084 1.26 0.000020 0.000019 0.92 Problem2 (1879) 0.000409 0.000377 0.92 0.000162 0.000100 0.61 Problem3 (2076) 0.000352 0.000317 0.90 0.000122 0.000083 0.68 Problem4 (7598) 0.001336 0.001318 0.99 0.000677 0.000326 0.48 Problem5 (745) 0.000083 0.000063 0.76 0.000035 0.000022 0.62 Problem6 (1041) 0.000321 0.000406 1.26 0.000079 0.000075 0.95 Problem7 (33) 0.000004 0.000004 0.96 0.000003 0.000002 0.73 Problem8 (10) 0.000001 0.000001 0.89 0.000001 0.000001 0.80 Problem9 (180) 0.000036 0.000042 1.16 0.000014 0.000011 0.76 Problem10(6833) 0.001556 0.001530 0.98 0.000674 0.000365 0.54 Problem11 (1960) 0.000663 0.000753 1.14 0.000136 0.000122 0.90 Problem12 (200004) 0.103900 0.345500 3.33 0.030640 0.041220 1.35 Problem13 (20004) 0.005672 0.020110 3.55 0.001633 0.002735 1.67 Problem14 (40004) 0.014430 0.056080 3.89 0.004806 0.006864 1.43 Problem15 (100000) 0.168700 0.283700 1.68 0.018600 0.033610 1.81 Problem16 (7602) 0.009996 0.017620 1.76 0.001654 0.001439 0.87 Problem17(10922) 0.018380 0.030010 1.63 0.002542 0.001783 0.70 Problem18 (14842) 0.024020 0.046130 1.92 0.003187 0.002492 0.78 Problem19 (19362) 0.054730 0.080280 1.47 0.005321 0.003620 0.68 Problem20 (24482) 0.121400 0.122600 1.01 0.006009 0.004705 0.78 Problem21 (30202) 0.124000 0.188700 1.52 0.009268 0.006778 0.73

PAGE 51

41 3.4 Analyses and Findings The follo wing are m y inferences based on the results: Most of the matrices in these exp erimen ts are small matrices of the order of a few thousands. Fill in is m uc h b etter with KLU. The 'BTF' ordering com bined with the 'AMD' ordering on eac h of the blo c ks do es a go o d job in reducing the ll in coun t to a go o d exten t. The impro v emen t in ll in a v erages around 6% for man y examples. T able 3{3: Ordering results using BTF+AMD in KLU on circuit matrices Nonzeros Nonzeros No of Max Blo c k Nonzeros Netlist Size in A in L+U Blo c ks size o diagonal Problem1 301 1484 1808 7 295 89 Problem2 1879 12926 13594 19 1861 4307 Problem3 2076 15821 16403 13 2064 6689 Problem4 7598 48922 52056 13 7586 19018 Problem5 745 3966 4156 128 426 1719 Problem6 1041 9654 13198 67 975 2608 Problem7 33 153 157 7 27 50 Problem8 10 39 40 5 6 16 Problem9 180 1503 1845 19 162 661 Problem10 6833 43250 56322 507 6282 12594 Problem11 1960 11187 13527 58 1715 2959 Problem12 200004 500011 500011 200003 2 300005 Problem13 20004 50011 50011 20003 2 30005 Problem14 40004 100011 100011 40003 2 60005 Problem15 100000 299998 299998 1 100000 0 Problem16 7602 32653 156311 103 7500 251 Problem17 10922 46983 267237 123 10800 301 Problem18 14842 63913 326811 143 14700 351 Problem19 19362 83443 550409 163 19200 401 Problem20 24482 105573 684139 183 24300 451 Problem21 30202 130303 933131 203 30000 501 There is no 'll in' in the Problem12, Problem13, Problem14 and Problem15 netlists with KLU. This is quite signican t in that there is no memory o v erhead in these examples. In the case of circuits Problem16{Problem21, the gain in ll

PAGE 52

42 in with KLU ranges from 6% in the Problem19 example to 24% in Problem18 example. The gain in ll in translates in to faster factorization b ecause few nonzeros imply less w ork. The factorization time th us is exp ected to b e lo w. It turns out to b e true in most of the cases (factorization sp eedup of 1.6x in Problem16{ Problem21 examples and 3x for Problem12{Problem14 examples). F or some cases lik e Problem2 and Problem10, the factorization time remains same for b oth KLU and Sup erLU. Solv e phase turns out to b e slo w in KLU. Probably the o diagonal nonzero handling part tends to accoun t for the extra time sp en t in the solv e phase. One w a y of reducing the solv e o v erhead in KLU w ould b e solving m ultiple RHS at the same time. In a single solv e iteration, 4 equations are solv ed. On the whole, the o v erall time sp eedup is 1.2 for Problem16{Problem21 examples and Problem12{Problem14 examples. F or others, the o v erall time is almost the same b et w een the t w o algorithms. BTF is not able to nd out man y blo c ks for most of the matrices and there happ ens to b e a single large blo c k and the remaining are singletons. But the AMD ordering do es a go o d job in getting the ll in coun t reduced. The o-diagonal nonzero coun t is not high. 3.5 Alternate Ordering Exp erimen ts Dieren t ordering strategies w ere emplo y ed to analyze the ll in b eha viour. The statistics using dieren t ordering sc hemes are listed in table 3{4 'COLAMD' is not listed in the table. It t ypically giv es p o or ordering and causes more ll in than AMD, MMD and AMD+BTF. AMD alone giv es relativ ely higher ll in compared to AMD+BTF in most of the matrices. Ho w ev er AMD alone giv es mixed results in comparison with MMD. It matc hes or outp erforms MMD in ll in on Problem12{Problem14 and Problem16{Problem21 matrices. Ho w ev er it giv es

PAGE 53

43 T able 3{4: Comparison of ordering results pro duced b y BTF+AMD, AMD, MMD Nonzeros Fill-in Fill-in Fill-in Netlist in A BTF+AMD AMD MMD Problem1 (301) 1484 1808 1928 1968 Problem2 (1879) 12926 13594 13857 13770 Problem3 (2076) 15821 16403 18041 16551 Problem4 (7598) 48922 52056 57975 54997 Problem5 (745) 3966 4156 5562 4231 Problem6 (1041) 9654 13198 14020 13995 Problem7 (33) 153 157 178 176 Problem8 (10) 39 40 41 41 Problem9 (180) 1503 1845 1968 1922 Problem10(6833) 43250 56322 133739 58651 Problem11 (1960) 11187 13527 14800 13963 Problem12 (200004) 500011 500011 600011 600011 Problem13 (20004) 50011 50011 60011 60011 Problem14 (40004) 100011 100011 120011 120011 Problem15 (100000) 299998 299998 299998 499932 Problem16 (7602) 32653 156311 165264 184362 Problem17(10922) 46983 267237 255228 299937 Problem18 (14842) 63913 326811 387668 425661 Problem19 (19362) 83443 550409 451397 581277 Problem20 (24482) 105573 684139 718891 788047 Problem21 (30202) 130303 933131 839226 1049463

PAGE 54

44 p o or ll in for rest of the circuits when compared with MMD. AMD alone b eats AMD+BTF in ll-in for some of the examples viz. Problem17, Problem19 and Problem21. Ov erall, to sum it up, BTF+AMD is the b est ordering strategy to use. 3.6 Exp erimen ts with UF Sparse Matrix Collection There are a n um b er of circuit matrices in the UF sparse matrix collection. Dieren t exp erimen ts w ere done with these matrices as w ell on dieren t parameters lik e 1. ordering qualit y with dieren t ordering sc hemes in KLU 2. timing of dieren t phases of KLU 3. ordering qualit y among KLU, UMFP A CK and Gilb ert-P eierls' Algorithm 4. p erformance comparison b et w een KLU and UMFP A CK. UMFP A CK is a unsymmetric m ultifron tal sparse solv er and uses an unsymmetric COLAMD ordering or an AMD ordering, automatically selecting whic h metho d to use based on the matrix c haracteristics. Gilb ert-P eierls' algorithm is a v ailable in MA TLAB as an LU factorization sc heme. These exp erimen ts w ere done on a Mandrak e 10.0 lin ux OS, In tel P en tium M Pro cessor with clo c k frequency of 1400 MHz and RAM 768 kB. 3.6.1 Dieren t Ordering Sc hemes in KLU There are six dieren t ordering sc hemes p ossible in KLU. The three ll reducing sc hemes are AMD, COLAMD and User Sp ecied Ordering. These three ll reducing orderings can b e com bined with a BTF preordering or no preordering.Hence six dieren t sc hemes. Ho w ev er in this exp erimen t user sp ecied ordering is not considered. That lea v es us with four dieren t sc hemes.The table 3{5 lists the ll (n um b er of nonzeros in L+U plus the n um b er of o-diagonal elemen ts) for eac h of these ordering sc hemes. F rom table 3{5 w e nd that BTF+AMD giv es consisten tly b etter ll-in across dieren t circuit matrices. Ho w ev er there are observ able ab errations. F or

PAGE 55

45 example, with the circuit Bomhof/circuit 2, AMD and COLAMD giv e b etter ll-in than BTF+AMD. These results determine again that BTF+AMD is the b est ordering sc heme to use for circuit matrices. 3.6.2 Timing Dieren t Phases in KLU These exp erimen ts sho w the time sp en t in dieren t phases of the algorithm. BTF pre-ordering follo w ed b y a AMD ll-reducing ordering is emplo y ed. As men tioned earlier, there are four dieren t phases. 1. Analysis phase: This comprises the pre-ordering and ll reducing ordering. 2. F actor phase: This comprises the factorization part of the algorithm. It includes the sym b olic analysis phase, partial piv oting, symmetric pruning steps. 3. Refactor phase: This comprises the part where w e do a factorization using the already pre-computed partial piv oting p erm utation and the nonzero pattern of the L and U matrices. There is no sym b olic analysis, partial piv oting and symmetric pruning in refactor phase. 4. Solv e phase: This comprises the solv e phase of the algorithm. Solving a single righ t hand side w as exp erimen ted. When giv en a set of matrices with the same nonzero pattern, the analysis and factor phases are done only once. The refactor phase is then rep eated for the remaining matrices. Solv e phase is rep eated as man y times as there are systems to solv e. T able 3{6 consists of the timing results. Analysis phase consumes most of the time sp en t in the algorithm. Refactor time is t ypically 3 or 4 times smaller than factor time and 8 times smaller than analysis time plus factor time put together. Solv e phase consumes the least fraction of time sp en t. 3.6.3 Ordering Qualit y among KLU, UMFP A CK and Gilb ert-P eierls The table 3{7 compares the ordering qualit y among KLU using BTF+AMD, UMFP A CK using COLAMD or AMD and Gilb ert-P eierls' using AMD. W e can

PAGE 56

46 T able 3{5: Fill-in with four dieren t sc hemes in KLU Nonzeros BTF + BTF + Matrix in A AMD COLAMD AMD COLAMD Sandia/adder dcop 01 (1813) 11156 13525 13895 18848 21799 Sandia/adder trans 01 (1814) 14579 20769 36711 24365 119519 Sandia/fpga dcop 01 (1220) 5892 7891 8118 8869 12016 Sandia/fpga trans 01 (1220) 7382 10152 12776 10669 21051 Sandia/init adder1 (1813) 11156 13525 13895 18848 21799 Sandia/m ult dcop 01 (25187) 193276 226673 228301 2176328 1460322 Sandia/oscil dcop 01 (430) 1544 2934 3086 3078 3295 Sandia/oscil trans 01 (430) 1614 2842 3247 2897 3259 Bomhof/circuit 1 (2624) 35823 44879 775363 44815 775720 Bomhof/circuit 2 (4510) 21199 40434 89315 36197 36196 Bomhof/circuit 3 (12127) 48137 86718 98911 245336 744245 Grund/b2 ss (1089) 3895 9994 9212 26971 9334 Grund/b dyn (1089) 4144 11806 10597 33057 10544 Grund/ba y er02 (13935) 63307 889914 245259 1365142 307979 Grund/d dyn (87) 230 481 461 619 494 Grund/d ss (53) 144 302 292 382 298 Grund/meg1 (2904) 58142 232042 184471 1526780 378904 Grund/meg4 (5860) 25258 42398 310126 43250 328144 Grund/p oli (4008) 8188 12200 12208 12238 12453 Grund/p oli large (15575) 33033 48718 48817 49806 51970 Hamm/add20 (2395) 13151 19554 34636 19554 34636 Hamm/add32 (4960) 19848 28754 36030 28754 36030 Hamm/b circuit (68902) 375558 1033240 1692668 1033240 1692668 Hamm/hcircuit (105676) 513072 731634 2623852 736080 4425310 Hamm/memplus (17758) 99147 137030 3282586 137030 3282586 Hamm/scircuit (170998) 958936 2481122 6410286 2481832 6427526 Ra jat/ra jat03 (7602) 32653 163913 235111 172666 236938 Ra jat/ra jat04 (1041) 8725 12863 80518 18618 158000 Ra jat/ra jat05 (301) 1250 1926 2053 2101 3131 Ra jat/ra jat11 (135) 665 890 978 944 1129 Ra jat/ra jat12 (1879) 12818 15308 273667 15571 128317 Ra jat/ra jat13 (7598) 48762 58856 90368 64791 5234287 Ra jat/ra jat14 (180) 1475 1994 2249 2105 2345

PAGE 57

47 T able 3{6: Time in seconds, sp en t in dieren t phases in KLU Analysis F actor Refactor Solv e Matrix time time time time Sandia/adder dcop 01 (1813) 0.0028 0.0028 0.0007 0.0003 Sandia/adder trans 01 (1814) 0.0045 0.0038 0.0013 0.0003 Sandia/fpga dcop 01 (1220) 0.0018 0.0015 0.0004 0.0002 Sandia/fpga trans 01 (1220) 0.0022 0.0017 0.0005 0.0002 Sandia/init adder1 (1813) 0.0028 0.0028 0.0007 0.0003 Sandia/m ult dcop 01 (25187) 0.2922 0.0522 0.0196 0.0069 Sandia/oscil dcop 01 (430) 0.0008 0.0006 0.0002 0.0001 Sandia/oscil trans 01 (430) 0.0008 0.0006 0.0002 0.0001 Bomhof/circuit 1 (2624) 0.0098 0.0085 0.0053 0.0006 Bomhof/circuit 2 (4510) 0.0082 0.0064 0.0034 0.0006 Bomhof/circuit 3 (12127) 0.0231 0.0174 0.0056 0.0020 Grund/b2 ss (1089) 0.0031 0.0018 0.0005 0.0001 Grund/b dyn (1089) 0.0033 0.0021 0.0007 0.0002 Grund/ba y er02 (13935) 0.0584 0.2541 0.2070 0.0090 Grund/d dyn (87) 0.0002 0.0001 0.0000 0.0000 Grund/d ss (53) 0.0001 0.0001 0.0000 0.0000 Grund/meg1 (2904) 0.0178 0.0853 0.0590 0.0033 Grund/meg4 (5860) 0.0157 0.0094 0.0028 0.0009 Grund/p oli (4008) 0.0017 0.0010 0.0004 0.0003 Grund/p oli large (15575) 0.0064 0.0045 0.0018 0.0014 Hamm/add20 (2395) 0.0056 0.0044 0.0014 0.0003 Hamm/add32 (4960) 0.0084 0.0074 0.0019 0.0006 Hamm/b circuit (68902) 0.3120 0.2318 0.1011 0.0257 Hamm/hcircuit (105676) 0.2553 0.1920 0.0658 0.0235 Hamm/memplus (17758) 0.0576 0.0358 0.0157 0.0036 Hamm/scircuit (170998) 0.8491 0.6364 0.3311 0.0622 Ra jat/ra jat03 (7602) 0.0152 0.0440 0.0306 0.0034 Ra jat/ra jat04 (1041) 0.0029 0.0023 0.0008 0.0002 Ra jat/ra jat05 (301) 0.0005 0.0005 0.0001 0.0000 Ra jat/ra jat11 (135) 0.0002 0.0002 0.0001 0.0000 Ra jat/ra jat12 (1879) 0.0038 0.0027 0.0008 0.0002 Ra jat/ra jat13 (7598) 0.0122 0.0105 0.0033 0.0011 Ra jat/ra jat14 (180) 0.0004 0.0003 0.0001 0.0000

PAGE 58

48 infer from the results that KLU pro duces b etter ordering than UMFP A CK and Gilb ert-P eierls' algorithm. F or KLU, the follo wing MA TLAB co de determines the ll. opts = [0.1 1.2 1.2 10 1 0 0 0 ] ; [x info] = klus(A,b,opts, []) ; fill = info (31) + info (32) + info(8) ; F or UMFP A CK, the snipp et is [L U P Q] = lu(A) ; fill = nnz(L) + nnz(U) ; F or Gilb ert-P eierls' the co de is Q = amd(A) ; [L U P] = lu(A(Q,Q), 0.1) ; fill = nnz(L) + nnz(U) ; 3.6.4 P erformance Comparison b et w een KLU and UMFP A CK This exp erimen t compares the total time sp en t in the analysis, factor and solv e phases b y the algorithms. The results are listed in table 3{8 KLU outp erforms UMFP A CK in time. F or KLU, the follo wing snipp et in MA TLAB is used: tic[x info] = klus(A,b,opts) ; t1 = toc ; F or UMFP A CK, the follo wing co de in MA TLAB is used to nd the total time: ticx = A \ b ; t2 = toc ;

PAGE 59

49 T able 3{7: Fill-in among KLU, UMFP A CK and Gilb ert-P eierls Matrix nnz KLU UMFP A CK Gilb ert-P eierls Sandia/adder dcop 01 (1813) 11156 13525 14658 18825 Sandia/adder trans 01 (1814) 14579 20769 20769 24365 Sandia/fpga dcop 01 (1220) 5892 7891 8106 8869 Sandia/fpga trans 01 (1220) 7382 10152 10152 10669 Sandia/init adder1 (1813) 11156 13525 14658 18825 Sandia/m ult dcop 01 (25187) 193276 226673 556746 1300902 Sandia/oscil dcop 01 (430) 1544 2934 2852 3198 Sandia/oscil trans 01 (430) 1614 2842 3069 2897 Bomhof/circuit 1 (2624) 35823 44879 44879 44815 Bomhof/circuit 2 (4510) 21199 40434 35107 38618 Bomhof/circuit 3 (12127) 48137 86718 84117 245323 Grund/b2 ss (1089) 3895 9994 8309 22444 Grund/b dyn (1089) 4144 11806 9642 41092 Grund/ba y er02 (13935) 63307 889914 259329 973093 Grund/d dyn (87) 230 481 442 523 Grund/d ss (53) 144 302 268 395 Grund/meg1 (2904) 58142 232042 151740 1212904 Grund/meg4 (5860) 25258 42398 42398 43250 Grund/p oli (4008) 8188 12200 12200 12239 Grund/p oli large (15575) 33033 48718 48745 49803 Hamm/add20 (2395) 13151 19554 19554 19554 Hamm/add32 (4960) 19848 28754 28754 28754 Hamm/b circuit (68902) 375558 1033240 1033240 1033240 Hamm/hcircuit (105676) 513072 731634 730906 736080 Hamm/memplus (17758) 99147 137030 137030 137030 Hamm/scircuit (170998) 958936 2481122 2481122 2481832 Ra jat/ra jat03 (7602) 32653 163913 163913 172666 Ra jat/ra jat04 (1041) 8725 12863 12860 18613 Ra jat/ra jat05 (301) 1250 1926 1944 2101 Ra jat/ra jat11 (135) 665 890 890 944 Ra jat/ra jat12 (1879) 12818 15308 15308 15571 Ra jat/ra jat13 (7598) 48762 58856 58856 64791 Ra jat/ra jat14 (180) 1475 1994 1994 2105

PAGE 60

50 T able 3{8: P erformance comparison b et w een KLU and UMFP A CK Matrix KLU UMFP A CK Sandia/adder dcop 01 (1813) 0.0116 0.0344 Sandia/adder trans 01 (1814) 0.0112 0.0401 Sandia/fpga dcop 01 (1220) 0.0050 0.0257 Sandia/fpga trans 01 (1220) 0.0054 0.0203 Sandia/init adder1 (1813) 0.0109 0.0323 Sandia/m ult dcop 01 (25187) 1.2383 1.1615 Sandia/oscil dcop 01 (430) 0.0022 0.0070 Sandia/oscil trans 01 (430) 0.0019 0.0074 Bomhof/circuit 1 (2624) 0.0232 0.1223 Bomhof/circuit 2 (4510) 0.0201 0.0522 Bomhof/circuit 3 (12127) 0.0579 0.1713 Grund/b2 ss (1089) 0.0066 0.0168 Grund/b dyn (1089) 0.0072 0.0175 Grund/ba y er02 (13935) 0.6089 0.3565 Grund/d dyn (87) 0.0005 0.0014 Grund/d ss (53) 0.0003 0.0010 Grund/meg1 (2904) 0.1326 0.1018 Grund/meg4 (5860) 0.0571 0.1111 Grund/p oli (4008) 0.0050 0.0121 Grund/p oli large (15575) 0.0208 0.0497 Hamm/add20 (2395) 0.0123 0.0506 Hamm/add32 (4960) 0.0201 0.0738 Hamm/b circuit (68902) 0.7213 1.8823 Hamm/hcircuit (105676) 0.7313 2.7764 Hamm/memplus (17758) 0.1232 0.8140 Hamm/scircuit (170998) 1.9812 7.3448 Ra jat/ra jat03 (7602) 0.0793 0.1883 Ra jat/ra jat04 (1041) 0.0068 0.0284 Ra jat/ra jat05 (301) 0.0014 0.0046 Ra jat/ra jat11 (135) 0.0007 0.0023 Ra jat/ra jat12 (1879) 0.0087 0.0355 Ra jat/ra jat13 (7598) 0.0330 0.1229 Ra jat/ra jat14 (180) 0.0010 0.0032

PAGE 61

CHAPTER 4 USER GUIDE F OR KLU 4.1 The Primary KLU Structures 4.1.1 klu common This is a con trol structure that con tains b oth input con trol parameters for KLU as w ell as output statistics computed b y the algorithm. It is a mandatory parameter for all the KLU routines. Its con ten ts are listed as follo ws: tol T yp e: double Input parameter for piv ot tolerance for diagonal preference.Default v alue is 0.001 gro wth T yp e: double Input parameter for reallo cation gro wth size of LU factors. Default v alue is 1.2 initmem amd T yp e: double Input parameter for initial memory size with AMD. Initial memory size = initmem amd nnz(L) + n. Default v alue is 1.2 initmem T yp e: double Input parameter for initial memory size with COLAMD. Initial memory size = initmem nnz(A) + n. Default v alue is 10 btf T yp e: in t 51

PAGE 62

52 Input parameter to use BTF pre-ordering, or not. Default v alue is 1 (to use BTF) ordering T yp e: in t Input parameter to sp ecify whic h ll reducing ordering to use. 0= AMD,1= COLAMD, 2= user P and Q. Default is 0. scale T yp e: in t Input parameter to sp ecify whic h scaling strategy to use. 0= none, 1= sum, 2= max. Default is 0 singular pro c T yp e: in t Input parameter to sp ecify whether to stop on singularit y or con tin ue. 0= stop, 1= con tin ue. Default is 0. status T yp e: in t Output parameter that indicates the result of the KLU function call. V alues are KLU OK(0) if OK and < 0 if error. Error v alues are KLU SINGULAR (-1), KLU OUT OF MEMOR Y (-2), KLU INV ALID (-3) nreallo c T yp e: in t Output parameter.Con tains n um b er of reallo cations of L and U singular col T yp e: in t Output parameter.Con tains the column no of singular column if an y nodiag T yp e: in t

PAGE 63

53 Output parameter.Con tains the n um b er of o-diagonal piv ots c hosen 4.1.2 klu sym b olic This structure encapsulates the information related to the analysis phase. The mem b ers of the structure are listed as follo ws: symmetry T yp e: double Con tains the symmetry of largest blo c k est rops T yp e: double Con tains the estimated factorization rop coun t lnz T yp e: double Con tains the estimated nonzeros in L including diagonals unz T yp e: double Con tains the estimated nonzeros in U including diagonals Lnz T yp e: double Arra y of size n, but only Lnz [0..n blo c ks-1] is used. Con tains the estimated n um b er of nonzeros in eac h blo c k n T yp e: in t Con tains the size of input matrix A where A is n-b y-n nz T yp e: in t Con tains the n um b er of en tries in input matrix P

PAGE 64

54 T yp e: in t Arra y of size n. Con tains the ro w p erm utation from ordering Q T yp e: in t Arra y of size n. Con tains the column p erm utation from ordering R T yp e: in t Arra y of size n+1, but only R [0..n blo c ks] is used. Con tains the start and end column/ro w index for eac h blo c k. Blo c k k go es from R[k] to R[k+1] 1 nzo T yp e: in t Con tains the n um b er of nonzeros in o-diagonal blo c ks n blo c ks T yp e: in t Con tains the n um b er of blo c ks maxblo c k T yp e: in t Con tains the size of largest blo c k ordering T yp e: in t Con tains the ordering used (0 = AMD, 1 = COLAMD, 2 = GIVEN) do btf T yp e: in t Indicates whether or not BTF preordering w as requested The mem b ers symmetry est rops, lnz, unz, Lnz are computed only when AMD is used. The remaining mem b ers are computed for all orderings.

PAGE 65

55 4.1.3 klu n umeric This structure encapsulates information related to the factor phase. It con tains the LU factors of eac h blo c k, piv ot ro w p erm utation, and the en tries in the odiagonal blo c ks among others. Its con ten ts are listed as follo ws: umin T yp e: double Con tains the minim um absolute diagonal en try in U umax T yp e: double Con tains the maxim um absolute diagonal en try in U n blo c ks T yp e: in t Con tains the n um b er of blo c ks lnz T yp e: in t Con tains actual n um b er of nonzeros in L excluding diagonals unz T yp e: in t Con tains actual n um b er of nonzeros in U excluding diagonals Pn um T yp e: in t Arra y of size n.Con tains the nal piv ot p erm utation Pin v T yp e: in t Arra y of size n.Con tains the in v erse of nal piv ot p erm utation Lbip T yp e: in t **

PAGE 66

56 Arra y of size n blo c ks. Eac h elemen t is an arra y of size blo c k size + 1. Eac h elemen t con tains the column p oin ters for L factor of the corresp onding blo c k Ubip T yp e: in t ** Arra y of size n blo c ks. Eac h elemen t is an arra y of size blo c k size + 1. Eac h elemen t con tains the column p oin ters for U factor of the corresp onding blo c k Lblen T yp e: in t ** Arra y of size n blo c ks. Eac h elemen t is an arra y of size blo c k size. Eac h elemen t con tains the column lengths for L factor of the corresp onding blo c k Ublen T yp e: in t ** Arra y of size n blo c ks. Eac h elemen t is an arra y of size blo c k size. Eac h elemen t con tains the column lengths for U factor of the corresp onding blo c k LUb x T yp e: v oid ** Arra y of size n blo c ks. Eac h elemen t is an arra y con taining the ro w indices and n umerical v alues of LU factors of the corresp onding blo c k. The diagonals of LU factors are not stored here Udiag T yp e: v oid ** Arra y of size n blo c ks.Eac h elemen t is an arra y of size blo c k size. Eac h elemen t con tains the diagonal v alues of U factor of the corresp onding blo c k Singleton T yp e: v oid Arra y of size n blo c ks.Con tains the singleton v alues Rs

PAGE 67

57 T yp e: double Arra y of size n. Con tains the ro w scaling factors scale T yp e: in t Indicates the scaling strategy used. (0 = none, 1 = sum, 2 = max) W ork T yp e: v oid P ermanen t w orkspace used for factorization and solv e. It is of size MAX (4n n umerical v alues, n n umerical v alues + 6*maxblo c k in t's) w orksize T yp e: size t Con tains the size (in b ytes) of W ork allo cated ab o v e Xw ork T yp e: v oid This is an alias in to Numeric->W ork Iw ork T yp e: in t An in teger alias in to Xw ork + n Op T yp e: in t Arra y of size n+1. Con tains the column p oin ters for o-diagonal elemen ts. O T yp e: in t Arra y of size n um b er of o-diagonal en tries. Con tains the ro w indices of o-diagonal elemen ts Ox

PAGE 68

58 T yp e: v oid Arra y of size n um b er of o-diagonal en tries. Con tains the n umerical v alues of o-diagonal elemen ts 4.2 KLU Routines The user callable KLU routines in the C language are explained in this section. The follo wing guidelines are applicable to all routines except when explicitly stated otherwise in the description of a routine. 1. : All the argumen ts are required except when explicitly stated as optional. If optional, a user can pass NULL for the corresp onding argumen t. 2. : The con trol input/output argumen t "Common" of t yp e "klu common *" is a required argumen t for all routines. 3. : All argumen ts other than the ab o v e men tioned con trol argumen t "Common", are input argumen ts and are not mo died. 4.2.1 klu analyze klu_symbolic *klu_analyze ( int n, int Ap [ ], int Ai [ ], klu_common *Common ) ; This routine orders the matrix using BTF if sp ecied and the ll reducing ordering sp ecied. Returns a p oin ter to klu sym b olic structure that con tains the ordering information. All argumen ts are required { n: Size of the input matrix A where A is n*n. { Ap: Arra y of column p oin ters for the input matrix. Size n+1.

PAGE 69

59 { Ai: Arra y of ro w indices. Size n um b er of nonzeros in A. { Common: The con trol input/output structure. 4.2.2 klu analyze giv en klu_symbolic *klu_analyze_given ( int n, int Ap [ ], int Ai [ ], int Puser [ ], int Quser [ ], klu_common *Common ) ; This routine orders the matrix using BTF if sp ecied and the giv en Puser and QUser as ll reducing ordering. If Puser and Quser are NULL, then the natural ordering is used. Returns a p oin ter to klu sym b olic structure that con tains the ordering information. Argumen ts { n: Size of the input matrix A where A is n*n. Required. { Ap: Arra y of column p oin ters for the input matrix. Size n+1. Required. { Ai: Arra y of ro w indices. Size n um b er of nonzeros in A. Required. { Puser: Optional ro w p erm utation to use. { Quser: Optional column p erm utation to use. { Common: The con trol input/output structure. 4.2.3 klu *factor klu_numeric *Numeric klu_factor ( int Ap [ ], int Ai [ ],

PAGE 70

60 double Ax [ ], klu_symbolic *Symbolic, klu_common *Common ) ; This routine factors a real matrix. There is a complex v ersion of this routine klu z factor that factors a complex matrix and has same function declaration as klu factor. Both use the results of a call to klu analyze. Returns a p oin ter to klu n umeric if successful. NULL otherwise. All the argumen ts are required. { Ap: Arra y of column p oin ters for the input matrix. Size n+1. { Ai: Arra y of ro w indices. Size n um b er of nonzeros in A. { Ax: Arra y of n umerical v alues. Size n um b er of nonzeros in A. In the complex case, the arra y should consist of real and imaginary parts of eac h n umerical v alue as adjacen t pairs. { Sym b olic: The structure that con tains the results from a call to klu analyze. { Common: The con trol input/output structure. The status eld in Common is set to indicate if the routine w as successful or not. 4.2.4 klu *solv e void klu_solve ( klu_symbolic *Symbolic, klu_numeric *Numeric, int ldim, int nrhs, double B [ ], klu_common *Common ) ;

PAGE 71

61 This routine solv es a real system. There is a complex v ersion of this routine klu z solv e, that solv es a complex system and has the same function declaration as klu solv e. Both use the results of a call to klu analyze and klu *factor. Return t yp e is v oid. The rhs v ector B is o v erwritten with the solution. All Argumen ts are required. { Sym b olic: The structure that con tains the results from a call to klu analyze. { Numeric: The structure that con tains the results from a call to klu *factor. { ldim: The leading dimension of the righ t hand side B. { nrhs: The n um b er of righ t hand sides b eing solv ed. { B: The righ t hand side. It is a v ector of length ldim nrhs. It can b e real or complex dep ending on whether a real or complex system is b eing solv ed. If complex, the real and imaginary parts of the rhs n umerical v alue m ust b e stored as adjacen t pairs. It is o v erwritten with the solution. { Common: The con trol input/output structure. 4.2.5 klu *tsolv e void klu_tsolve ( klu_symbolic *Symbolic, klu_numeric *Numeric, int ldim, int nrhs, int conj_solve, double B [ ], klu_common *Common ) ;

PAGE 72

62 This routine is similar to klu solv e except that it solv es a transp ose system A 0 x = b This routine solv es a real system. Again there is a complex v ersion of this routine klu z tsolv e for solving complex systems and has the same function declaration as klu tsolv e. It also oers to do a conjugate transp ose solv e for the complex system A H x = b Return t yp e is v oid. The rhs v ector B is o v erwritten with the solution. All argumen ts are required. The descriptions for all argumen ts except conj solv e are same as those for klu *solv e. The argumen t conj solv e is relev an t only for complex case. It tak es t w o v alues 1 = CONJUGA TE TRANSPOSE SOL VE, 0 = TRANSPOSE SOL VE. 4.2.6 klu *refactor void klu_refactor ( int Ap [ ], int Ai [ ], double Ax [ ], klu_symbolic *Symbolic, klu_numeric *Numeric, klu_common *Common ) ; This routine do es a refactor of the matrix using the previously computed ordering information in Sym b olic and the nonzero pattern of the LU factors in Numeric ob jects. It assumes same partial piv oting order computed in Numeric. It c hanges only the n umerical v alues of the LU factors stored in Numeric ob ject. It has a complex v ersion klu z refactor with the same function protot yp e to handle complex cases. Return t yp e is v oid. The n umerical v alues of LU factors in Numeric parameter are up dated. All argumen ts are required.

PAGE 73

63 { Ap: Arra y of column p oin ters for the input matrix. Size n+1. { Ai: Arra y of ro w indices. Size n um b er of nonzeros in A. { Ax: Arra y of n umerical v alues. Size n um b er of nonzeros in A. In the complex case, the arra y should consist of real and imaginary parts of eac h n umerical v alue as adjacen t pairs. { Sym b olic: The structure that con tains the results from a call to klu analyze. { Numeric: Input/output argumen t. The structure con tains the results from a call to klu *factor. The n umerical v alues of LU factors are o v erwritten with the ones for the curren t matrix b eing factorized. { Common: The con trol input/output structure. The status eld in Common is set to indicate if the routine w as successful or not. 4.2.7 klu defaults void klu_defaults ( klu_common *Common ) ; This routine sets the default v alues for the con trol input parameters of the klu common ob ject. The default v alues are listed in the description of the klu common structure. A call to this routine is required unless the user sets the con trol input parameters explicitly Return t yp e is v oid. The argumen t Common is required. The con trol input parameters in Common are set to default v alues. 4.2.8 klu *rec piv ot gro wth double klu_rec_pivot_growth ( int Ap [ ],

PAGE 74

64 int Ai [ ], double Ax [ ], klu_symbolic *Symbolic, klu_numeric *Numeric, klu_common *Common ) ; This routine computes the recipro cal piv ot gro wth of the factorization algorithm. The complex v ersion of this routine klu z rec piv ot gro wth handles complex matrices and has the same function declaration. The piv ot gro wth estimate is returned. All argumen ts are required. { Ap: Arra y of column p oin ters for the input matrix. Size n+1. { Ai: Arra y of ro w indices. Size n um b er of nonzeros in A. { Ax: Arra y of n umerical v alues. Size n um b er of nonzeros in A. In the complex case, the arra y should consist of real and imaginary parts of eac h n umerical v alue as adjacen t pairs. { Sym b olic: The structure that con tains the results from a call to klu analyze. { Numeric: The structure that con tains the results from a call to klu *factor. { Common: The con trol input/output structure. The status eld in Common is set to indicate if the routine w as successful or not. 4.2.9 klu *estimate cond n um b er double klu_estimate_cond_number ( int Ap [ ], double Ax [ ], klu_symbolic *Symbolic, klu_numeric *Numeric,

PAGE 75

65 klu_common *Common ); This routine computes the condition n um b er estimate of the input matrix. As b efore, the complex v ersion of this routine klu z estimate cond n um b er has the same function declaration and handles complex matrices. The condition n um b er estimate is returned. All argumen ts are required. { Ap: Arra y of column p oin ters for the input matrix. Size n+1. { Ax: Arra y of n umerical v alues. Size n um b er of nonzeros in A. In the complex case, the arra y should consist of real and imaginary parts of eac h n umerical v alue as adjacen t pairs. { Sym b olic: The structure that con tains the results from a call to klu analyze. { Numeric: The structure that con tains the results from a call to klu *factor. { Common: The con trol input/output structure. The status eld in Common is set to indicate if the routine w as successful or not. 4.2.10 klu free sym b olic void klu_free_symbolic ( klu_symbolic **Symbolic, klu_common *Common ) ; This routine deallo cates or frees the con ten ts of the klu sym b olic ob ject. The Sym b olic parameter m ust b e a v alid ob ject computed b y a call to klu analyze or klu analyze giv en. Return t yp e is v oid. All argumen ts are required.

PAGE 76

66 { Sym b olic: Input/Output argumen t. Must b e a v alid ob ject computed b y a call to klu analyze or klu analyze giv en. If NULL, the routine just returns. { Common: The con trol input/output structure. 4.2.11 klu free n umeric void klu_free_numeric ( klu_numeric **Numeric, klu_common *Common ) ; This routine frees the klu n umeric ob ject computed b y a call to klu factor or klu z factor routines. It resets the p oin ter to klu n umeric to NULL. There is a complex v ersion of this routine called klu z free n umeric with the same function declaration to handle the complex case. Return t yp e is v oid. All argumen ts are required. { Numeric. Input/Output argumen t. The con ten ts of the klu n umeric ob ject are freed. The p oin ter to klu n umeric ob ject is set to NULL. { Common: The con trol input/output structure.

PAGE 77

REFERENCES [1] C. L. La wson, R. J. Hanson, D. Kincaid, and F. T. Krogh, Basic linear algebra subprograms for F OR TRAN usage, A CM T r ans. Math. Soft. 5: 308{323, 1979. [2] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson, An extended set of F OR TRAN basic linear algebra subprograms, A CM T r ans. Math. Soft. 14: 1{17, 1988. [3] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson, Algorithm 656: An extended set of F OR TRAN basic linear algebra subprograms, A CM T r ans. Math. Soft. 14: 18{32, 1988. [4] J. J. Dongarra, J. Du Croz, I. S. Du, and S. Hammarling, A set of lev el 3 basic linear algebra subprograms, A CM T r ans. Math. Soft. 16: 1{17, 1990. [5] J. J. Dongarra, J. Du Croz, I. S. Du, and S. Hammarling, Algorithm 679: A set of lev el 3 basic linear algebra subprograms, A CM T r ans. Math. Soft. 16: 18{28, 1990. [6] James W. Demmel, Stanley C. Eisenstat, John R. Gilb ert, Xiao y e S. Li and Joseph W. H. Liu, A sup erno dal approac h to sparse partial piv oting, SIAM J. Matrix A nalysis and Applic ations 20(3): 720{755, 1999. [7] Timoth y A. Da vis, I.S.Du, An unsymmetric-pattern m ultifron tal metho d for sparse LU factorization, SIAM J. Matrix A nalysis and Applic. 18(1): 140{158, 1997. [8] Timoth y A. Da vis, Algorithm 832: UMFP A CK, an unsymmetric-pattern m ultifron tal metho d with a column pre-ordering strategy A CM T r ans. Math. Softwar e 30(2): 196{199, 2004. [9] John R.Gilb ert and Tim P eierls, Sparse partial piv oting in time prop ortional to arithmetic op erations, SIAM J. Sci. Stat. Comput. 9(5): 862{873, 1988. [10] A. George and E. Ng, An implemen tation of Gaussian elimination with partial piv oting for sparse systems. SIAM J. Sci. Statist. Comput. 6(2): 390{409, 1985. [11] Iain S. Du, On algorithms for obtaining a maxim um transv ersal, A CM T r ansactions on Mathematic al Softwar e 7(3): 315{330, 1981. 67

PAGE 78

68 [12] Iain S. Du, Algorithm 575 p erm utations for a zero-free diagonal, A CM T r ansactions on Mathematic al Softwar e 7(3): 387{390, 1981. [13] Iain S. Du and John K. Reid, Algorithm 529: p erm utations to blo c k triangular form, A CM T r ans. on Mathematic al Softwar e 4(2): 189{192, 1978. [14] Iain S. Du and John K. Reid, An implemen tation of T arjan's algorithm for the blo c k triangular form of a matrix, A CM T r ans. on Mathematic al Softwar e 4(2): 137{147, 1978. [15] R.E. T arjan, Depth rst searc h and linear graph algorithms, SIAM J. Computing. 1: 146{160, 1972. [16] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Riv est, Cliord Stein, Intr o duction to A lgorithms Second Edition 2001, MIT Press, Cam bridge. [17] S.C. Eisenstat and J.W.H. Liu, Exploiting structural symmetry in a sparse partial piv oting co de, SIAM J.Sci.Comput. 14(1): 253{257, 1993. [18] P R. Amesto y T. A. Da vis, and I. S. Du,An appro ximate minim um degree ordering algorithm, SIAM J. Matrix A nal. Applic. 17(4): 886{905, 1996. [19] P .R. Amesto y T.A. Da vis, and I.S. Du, Algorithm 837: AMD, an appro ximate minim um degree ordering algorithm, A CM T r ansactions on Mathematic al Softwar e 30(3): 381{388, 2004. [20] Timoth y A. Da vis, John R. Gilb ert, Stefan I. Larimore, and Esmond G. Ng. A column appro ximate minim um degree ordering algorithm, A CM T r ansactions on Mathematic al Softwar e 30(3): 353{376, 2004. [21] Timoth y A. Da vis John R. Gilb ert Stefan I. Larimore Esmond G. Ng, Algorithm 836: COLAMD, a column appro ximate minim um degree ordering algorithm, A CM T r ansactions on Mathematic al Softwar e 30(3): 377{380, 2004. [22] W.W. Hager, Condition estimates, SIAM J. Sci. Stat. Comput. 5,2: 311{316, 1984. [23] Nic holas J. Higham, F ortran co des for estimating the one-norm of a real or complex matrix, with applications to condition estimation, A CM T r ans. on Mathematic al Softwar e. 14(4): 381{396, 1988.

PAGE 79

BIOGRAPHICAL SKETCH Ek anathan w as b orn in Tirunelv eli, India, on Octob er 2, 1977. He receiv ed his Bac helor of T ec hnology degree in c hemical engineering from Anna Univ ersit y Chennai, India, in Ma y 1998. He w ork ed with Infosys T ec hnologies at Brussels, Belgium, as programmer/analyst from 1998 till 2001 and with SAP A G at W alldorf, German y as soft w are dev elop er from 2001 till 2003. Curren tly he is pursuing his Master of Science degree in computer science at Univ ersit y of Florida. His in terests and hobbies include tra v el, reading no v els and magazines, m usic and mo vies. 69