Citation |

- Permanent Link:
- https://ufdc.ufl.edu/UFE0011721/00001
## Material Information- Title:
- KLU--A High Performance Sparse Linear Solver for Circuit Simulation Problems
- Creator:
- NATARAJAN, EKANATHAN PALAMADAI (
*Author, Primary*) - Copyright Date:
- 2008
## Subjects- Subjects / Keywords:
- Algorithms ( jstor )
Factorization ( jstor ) Gaussian elimination ( jstor ) Input output ( jstor ) Linear systems ( jstor ) Mathematics ( jstor ) Matrices ( jstor ) Permutations ( jstor ) Pruning ( jstor ) Simulations ( jstor )
## Record Information- Source Institution:
- University of Florida
- Holding Location:
- University of Florida
- Rights Management:
- Copyright Ekanathan Palamadai Natarajan. Permission granted to University of Florida to digitize and display this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
- Embargo Date:
- 7/30/2007
- Resource Identifier:
- 74493210 ( OCLC )
## UFDC Membership |

Downloads |

## This item has the following downloads:
palamadai_e ( .pdf )
palamadai_e_Page_02.txt palamadai_e_Page_23.txt palamadai_e_Page_58.txt palamadai_e_Page_62.txt palamadai_e_Page_09.txt palamadai_e_Page_36.txt palamadai_e_Page_19.txt palamadai_e_pdf.txt palamadai_e_Page_75.txt palamadai_e_Page_07.txt palamadai_e_Page_37.txt palamadai_e_Page_42.txt palamadai_e_Page_01.txt palamadai_e_Page_31.txt palamadai_e_Page_67.txt palamadai_e_Page_54.txt palamadai_e_Page_21.txt palamadai_e_Page_14.txt palamadai_e_Page_10.txt palamadai_e_Page_35.txt palamadai_e_Page_57.txt palamadai_e_Page_34.txt palamadai_e_Page_64.txt palamadai_e_Page_27.txt palamadai_e_Page_06.txt palamadai_e_Page_55.txt palamadai_e_Page_61.txt palamadai_e_Page_77.txt palamadai_e_Page_17.txt palamadai_e_Page_68.txt palamadai_e_Page_49.txt palamadai_e_Page_24.txt palamadai_e_Page_45.txt palamadai_e_Page_63.txt palamadai_e_Page_20.txt palamadai_e_Page_53.txt palamadai_e_Page_59.txt palamadai_e_Page_38.txt palamadai_e_Page_15.txt palamadai_e_Page_08.txt palamadai_e_Page_72.txt palamadai_e_Page_13.txt palamadai_e_Page_05.txt palamadai_e_Page_46.txt palamadai_e_Page_66.txt palamadai_e_Page_30.txt palamadai_e_Page_44.txt palamadai_e_Page_11.txt palamadai_e_Page_39.txt palamadai_e_Page_69.txt palamadai_e_Page_41.txt palamadai_e_Page_73.txt palamadai_e_Page_70.txt palamadai_e_Page_76.txt palamadai_e_Page_78.txt palamadai_e_Page_25.txt palamadai_e_Page_33.txt palamadai_e_Page_32.txt palamadai_e_Page_43.txt palamadai_e_Page_12.txt palamadai_e_Page_29.txt palamadai_e_Page_51.txt palamadai_e_Page_03.txt palamadai_e_Page_18.txt palamadai_e_Page_74.txt palamadai_e_Page_48.txt palamadai_e_Page_79.txt palamadai_e_Page_04.txt palamadai_e_Page_22.txt palamadai_e_Page_50.txt palamadai_e_Page_16.txt palamadai_e_Page_60.txt palamadai_e_Page_65.txt palamadai_e_Page_26.txt palamadai_e_Page_52.txt palamadai_e_Page_56.txt palamadai_e_Page_47.txt palamadai_e_Page_28.txt palamadai_e_Page_40.txt palamadai_e_Page_71.txt |

Full Text |

KLU-A HIGH PERFORMANCE SPARSE LINEAR SOLVER FOR CIRCUIT SIMULATION PROBLEMS By EKANATHAN PALAMADAI NATARAJAN A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2005 Copyright 2005 by Ekanathan Palamadai Natarajan I dedicate this work to my mother Savitri who has been a source of inspiration and support to me. ACKNOWLEDGMENTS I would like to thank Dr. Timothy Davis, my advisor for introducing me to the area of sparse matrix algorithms and linear solvers. I started only with my background in numerical analysis and algorithms, a year and half back. The insights and knowledge I have gained since then in the area and in implementing a sparse linear solver like KLU would not have been possible but for Dr. Davis' guidance and help. I thank him for giving me an opportunity to work on KLU. I would like to thank Dr. Jose Fortes and Dr. Arunava Banerjee for their support and help and for serving on my committee. I would like to thank CISE administrative staff for helping me at different times during my master's research work. TABLE OF CONTENTS page ACKNOWLEDGMENTS ................... ...... iv LIST OF TABLES ...................... ......... vii LIST OF FIGURES ................... ......... viii ABSTRACT ....................... ........... ix CHAPTER 1 INTRODUCTION .................... ....... 1 2 THEORY: SPARSE LU ........................... 5 2.1 D ense LU . . . . . . . 5 2.2 Sparse LU . . . . . . . 7 2.3 Left Looking Gaussian Elimination ........ ........ 8 2.4 Gilbert-Peierls' Algorithm ......... ........ ..... 10 2.4.1 Symbolic Analysis .......... ........ ..... 11 2.4.2 Numerical Factorization ......... ........ .. 13 2.5 Maximum Transversal ................... ... 14 2.6 Block Triangular Form .......... ............. 16 2.7 Symmetric Pruning ................... .... 18 2.8 Ordering .................. .......... 19 2.9 Pivoting .................. .......... 22 2.10 Scaling ........ ...... ................. 23 2.11 Growth Factor ......... .......... ...... 25 2.12 Condition Number ............... .... .. 27 2.13 Depth First Search ............... ..... .. 30 2.14 Memory Fragmentation ............... .. .. 31 2.15 Complex Number Support ................ . .. 33 2.16 Parallelism in KLU ............... ..... .. 33 3 CIRCUIT SIMULATION: APPLICATION OF KLU . .... 35 3.1 Characteristics of Circuit Matrices ..... . . ..... 37 3.2 Linear Systems in Circuit Simulation . . ..... 38 3.3 Performance Benchmarks ................ .... .. 39 3.4 Analyses and Findings ............. .. .. .. 41 3.5 Alternate Ordering Experiments ... . . 42 3.6 Experiments with UF Sparse Matrix Collection .. ........ 3.6.1 Different Ordering Schemes in KLU .. ........... 3.6.2 Timing Different Phases in KLU .. ............ 3.6.3 Ordering Quality among KLU, UMFPACK and Gilbert- P eierls . . . . . . . 3.6.4 Performance Comparison between KLU and UMFPACK . 4 USER GUIDE FOR KLU 4 1 The Primar KLU Structures 4.1.1 4.1.2 4.1.3 4.2 KLU I 4.2.1 4.2.2 4.2.3 4.2.4 4.2.5 4.2.6 4.2.7 4.2.8 4.2.9 4.2.10 4.2.11 L lllCAl V .JU JUl U uAVU .1 klu_common .. ...... klusymbolic ........ klunumeric .. ...... (outines . . . kluanalyze ......... klu_analyze_given ..... klu_*factor .. ....... klu_*solve .. ....... klu_*tsolve .. ....... klu_ 1 . . .. klu_defaults .. ...... klu_*rec_pivot_growth . klu_*estimatecond_number klu_freesymbolic ..... klu_free_numeric ...... REFEREN CES . . . . . . . . . BIOGRAPHICAL SKETCH ...... LIST OF TABLES Table page 3-1 Comparison between KLU and SuperLU on overall time and fill-in .39 3-2 Comparison between KLU and SuperLU on factor time and solve time 40 3-3 Ordering results using BTF+AMD in KLU on circuit matrices .. 41 3-4 Comparison of ordering results produced by BTF+AMD, AMD, MMD 43 3-5 Fill-in with four different schemes in KLU . . 46 3-6 Time in seconds, spent in different phases in KLU . .... 47 3-7 Fill-in among KLU, UMFPACK and Gilbert-Peierls . ... 49 3-8 Performance comparison between KLU and UMFPACK . ... 50 LIST OF FIGURES Figure page 2-1 Nonzero pattern of x when solving Lx = b . . ...... 13 2-2 A matrix permuted to BTF form ....... .. 16 2-3 A symmetric pruning scenario ................... .. .. 18 2-4 A symmetric matrix and its graph representation . .... 21 2-5 The matrix and its graph representation after one step of Gaussian elim nation .................. ............ .. 21 2-6 A doubly bordered block diagonal matrix and its corresponding ver- tex separator tree .................. ......... .. 34 Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science KLU-A HIGH PERFORMANCE SPARSE LINEAR SOLVER FOR CIRCUIT SIMULATION PROBLEMS By Ekanathan Palamadai Natarajan August 2005 Chair: Dr. Timothy A. Davis I. Pi r Department: Computer and Information Science and Engineering The thesis work focuses on KLU, a sparse high performance linear solver for circuit simulation matrices. During the circuit simulation process, one of the key steps is to solve sparse systems of linear equations of very high order. KLU targets solving these systems with efficient ordering mechanisms and high performance factorization and solve algorithms. KLU uses a hybrid ordering strategy that comprises an unsymmetric permutation to ensure zero free diagonal, a symmetric permutation to block upper triangular form and a fill reducing ordering such as approximate minimum degree. The factorization is based on Gilbert-Peierls' left-looking algorithm with partial pivoting. KLU also includes features like symmetric pruning to cut down symbolic analysis costs. It offers to solve up to four right hand sides in a single solve step. In addition, it offers transpose and conjugate transpose solve capabili- ties and important diagnostic features to estimate the reciprocal pivot growth of the factorization and condition number of the input matrix. The algorithm is implemented in the C language with MATLAB interfaces as well. The MATLAB interfaces enable a user to invoke KLU routines from within the MATLAB environment. The implementation was tested on circuit matrices and the results determined. KLU achieves superior fill-in quality with its hybrid ordering strategy and achieves a good performance speed-up when compared with existing sparse linear solvers for circuit problems. The thesis highlights the work being done on exploiting parallelism in KLU as well. CHAPTER 1 INTRODUCTION Sparse is beautiful. Solving systems of linear equations of the form Ax = b is a fundamental and important area of high performance computing. The matrix A is called the coefficient matrix and b the right hand side vector. The vector x is the solution to the equation. There are a number of methods available for solving such systems. Some of the popular ones are Gaussian elimination, QR factorization using Householder transformations or Givens rotations and Cholesky factorization. Gaussian elimination with partial pivoting is the most widely used algorithm for solving linear systems because of its stability and better time complexity. Cholesky can be used only when A is symmetric positive definite. Some systems that are solved comprise a dense coefficient matrix A. By dense, we mean most of the elements in A are nonzero. There are high performance subroutines such as the BLAS [1, 2, 3, 4, 5] that can maximize flop count for such dense matrices. The interesting systems are those where the coefficient matrix A happens to be sparse. By sparse, we mean the matrix has few nonzero entries(hereafter referred to simply as 'nonzeros'). The adjective 'few' is not well- defined as we will see in chapter two. When matrices tend to be sparse, we need to find out effective ways to store the matrix in memory since we want to avoid storing zeros of the matrix. When we store only the nonzeros in the matrix, it has consequences in the factorization algorithm as well. One typical example would be we do not know before hand how nonzeros would appear in the L and U factors when we factorize the matrix. While we avoid storing the zeros, we also want to achieve good time complexity when solving sparse systems. If the time spent to solve sparse systems remains same as for dense systems, we have not done any better. KLU stands for Clark Kent LU, since it is based on Gilbert-Peierls' algorithm, a non-supernodal algorithm, which is the predecessor to SuperLU, a supernodal algorithm. KLU is a sparse high performance linear solver that employs hybrid ordering mechanisms and elegant factorization and solve algorithms. It achieves high cqii,.lilI fill-in rate and beats many existing solvers in run time, when used for matrices arising in circuit simulation. There are several flavours of Gaussian elimination. A left-looking Gaussian elimination algorithm factorizes the matrix left to right computing columns of L and U. A right-looking version factorizes the matrix from top-left to bottom-right computing column of L and row of U. Both have their advantages and disad- vantages. KLU uses a left looking algorithm called Gilbert-Peierls' algorithm. Gilbert-Peierls' comprises a graph theoretical symbolic analysis phase that iden- tifies the nonzero pattern of each column of L and U factors and a left-looking numerical factorization phase with partial pivoting that calculates the numerical values of the factors. KLU uses Symmetric Pruning to cut down symbolic analysis cost. We shall look in detail on these features in chapter two. A critical issue in linear solvers is Ordering. Ordering means permuting the rows and columns of a matrix, so that the fill-in in the L and U factors is reduced to a minimum. A fill-in is defined as a nonzero appearing in either of the matrices L or U, while the element in the corresponding position in A is a zero. Lij or Uij is a fill-in if Aij is a zero. Fill-in has obvious consequences in memory in that the factorization algorithm could create dense L and U factors that can exhaust available memory. A good ordering algorithm yields a low fill-in in the factors. Finding the ordering that gives minimal fill-in is an NP complete problem. So ordering algorithms use heuristics. KLU accomodates multiple ordering schemes like AMD, COLAMD and any user generated permutation. There are other orderings for different purposes. For example, one could order a matrix to ensure that it has no zeros on the diagonal. Otherwise, the Gaussian elimination would fail. Another ordering scheme could reduce the factorization work. KLU einr'l,-'- two such orderings namely an unsymmetric ordering that ensures a zero free diagonal and a symmetric ordering that permutes the matrix into a block upper triangular form (BTF) that restricts factorization to only the diagonal blocks. One of the key steps in the circuit simulation process is solving sparse linear systems. These systems originate from solving large systems of non linear equations using Newton's method and integrating large stiff systems of ordinary differential equations. These systems are of very high dimensions and a considerable fraction of simulation time is spent on solving these systems. Often the solve phase tends to be a bottleneck in the simulation process. Hence high performance sparse solvers that optimize memory usage and solution time are critical components of circuit simulation software. Some of the popular solvers in use in circuit simulation tools are Sparsel.3 and SuperLU. Sparsel.3 is used in SPICE circuit simulation package and SuperLU uses a supernodal factorization algorithm. Experimental results of KLU indicate that it is 1000 times faster than Sparsel.3 and 1.5 to 3 times faster than SuperLU. Circuit matrices show some unique properties. They have a nearly zero free diagonal. They have a roughly symmetric pattern but have unsymmetric values. They are highly sparse and often have a few dense rows and columns. These dense rows/columns arise from voltage sources and current sources in the circuit. Circuit matrices show good amenability to BTF ordering. Though the nonzero pattern of original matrix is unsymmetric, the nonzero pattern of blocks produced by BTF ordering tend to be symmetric. Since circuit matrices are extremely sparse, sparse matrix algorithms such as SuperLU [6] and UMFPACK [7, 8] that nipil] dense BLAS kernels are often inappropriate. Another unique characteristic of circuit matrices is that employing a good ordering strategy keeps the L and U factors sparse. However as we will see in experimental results, typical ordering strategies can lead to high fill-in. In circuit simulation problems, typically the circuit matrix template is gen- erated once and the numerical values of the matrix alone change. In other words, the nonzero pattern of the matrix does not change. This implies that we need to order and factor the matrix once to generate the ordering permutations and the nonzero patterns of L and U factors. For all subsequent matrices, we can use the same information and need only to recompute the numerical values of the L and U factors. This process of skipping analysis and factor phases is called refactorization.Refactorization leads to a significant reduction in run time. Because of the unique characteristics of circuit matrices and their amenability to BTF ordering, KLU is a method well-suited to circuit simulation problems. KLU has been implemented in the C language. It offers a set of API for the analysis phase, factor phase, solve phase and refactor phase. It also offers the ability to solve upto four right hand sides in a single solve step. In addition, it offers trans- pose solve, conjugate transpose solve features and diagnostic tools like pivot growth estimator and condition number estimator. It also offers a MATLAB interface for the API so that KLU can be used from within the MATLAB environment. CHAPTER 2 THEORY: SPARSE LU 2.1 Dense LU Consider the problem of solving the linear system of n equations in n un- knowns: a11x1 + a122 + ... + ainxn = b a21x + a22x2 + ... 2+ a = b2 (2-1) anlXl + an2x2 +... ax, = bn or, in matrix notation, all a12 a n X1 bl a21 a22 ... a*2n x2 b2 (2-2) anl au2 .. a nn Xn b Ax = b where A = (aij), x = (1, X2, ..., n)T and b = (bl,..., b,)T. A well-known approach to solving this equation is Gaussian elimination. Gaussian elimination consists of a series of eliminations of unknowns xi from the original system. Let us briefly review the elimination process. In the first step, the first equation of 2-1 is multiplied by -21 2 a .. and added with the second through nth equation of 2-1 respectively. This would eliminate from second through the equations. After respectively. This would eliminate x1 from second through the nth equations. After the first step, the 2-2 would become all a12 an (1) (1) 0 a22 ... 2 0 a(,) a(n (2-3) where al22 a22 -a21 a12, a32(1) 32 11a a12 and so on. In the second step, x2 will be eliminated by a similar process of computing multipliers and adding the multiplied second equation with the third through nth equations. After n-1 eliminations, the matrix A is transformed to an upper triangular matrix U. The upper triangular system is then solved by back-substitution. An equivalent interpretation of this elimination process is that we have factorized A into a lower triangular matrix L and an upper triangular matrix U where al 1 0 "21 1 0 all a31 32 (1) Un ttnl Un2 all () a (2) a12 a13 a2) a23(1) 0 a33(2) (2-4) (2-5) ~L l 1 1 ((nr-1) 0 0 0 a.t)nn The column k of the lower triangular matrix L consists of the multipliers obtained during step k of the elimination process, with their sign negated. and Mathematically, Ax = b can be rewritten as (LU)x b L(Ux) = b (2-6) Substituting Ux = y in 2-6, we have Ly = b (2-7) Ux y (2-8) By solving these two lower triangular systems, we find the solution to the actual system. The reason for triangularizing the system is to avoid finding the inverse of the original coefficient matrix A. Inverse finding is atleast thrice as expensive as Gaussian elimination in the dense case and often leads to more inaccuracies. 2.2 Sparse LU A sparse matrix is defined as one that has few nonzeros in it. The quan- tification of the adjective 'few' is not specified. The decision as to what kind of algorithm to use (sparse or dense) depends on the fill-in properties of the matrices. However, sparse matrices typically have O(n) nonzero entries. Dense matrices are typically represented by a two dimensional .- i. .'.The zeros of a sparse matrix should not be stored if we want to save memory. This fact makes a two dimensional array unsuitable for representing sparse matrices. Sparse matrices are represented with a different kind of data structure. They can be represented in two different data structures viz.column compressed form or row compressed form. A column compressed form consists of three vectors Ap, Ai and Ax. Ap consists of column pointers. It is of length n+ 1. The start of column k of the input matrix is given by Ap [k]. Ai consists of row indices of the elements. This is a zero based data structure with row indices in the interval [0,n). Ax consists of the actual numerical values of the elements. Thus the elements of a column k of the matrix are held in Ax [Ap [k]...Ap [k+1]). The corresponding row indices are held in Ai [Ap [k]...Ap[k+l]). Equivalently, a row compressed format stores a row pointer vector Ap, a column indices vector Ai and a value vector Ax. For example, the matrix 500 420 318 when represented in column compressed format will be Ap: 0 3 5 6 Ai: 0 1 2 1 2 2 Ax: 5 4 3 2 1 8 and when represented in row compressed format will be Ap: 0 1 3 6 Ai: 0 0 1 0 1 2 Ax: 5 4 2 3 1 8 Let nnz represent the number of nonzeros in a matrix of dimension n n. Then in a dense matrix representation, we will need n2 memory to represent the matrix. In sparse matrix representation, we reduce it to O(n + nnz) and typically nnz < n2. 2.3 Left Looking Gaussian Elimination Let us derive a left looking version of Gaussian elimination. Let an input matrix A of order n n be represented as a product of two triangular matrices L and U. All a12 A13 L11 0 0 U11 U12 U13 a21 a22 a23 121 1 0 0 22 U23 A31 a32 A33 L31 132 L33 0 0 U33 where Aij is a block, aij is a vector and aij is a scalar. The dimensions of different elements in the matrices are as follows: All, L11, U1 are k k blocks a12, u12 are k 1 vectors A13, U13 are k n (k + 1) blocks a21, 21 are 1 k row vectors a22, u22 are scalars a23, U23 are 1 n (k + 1) row vectors A31, L31 are n (k + 1) k blocks a32, 132 are n (k + 1) 1 vectors A33, L33, U33 are n (k + 1) n ( + 1) blocks. From (2-9), we can arrive at the following set of equations. L U11 L U12 LI* U13 121 U11 121 U12 + U22 121 U13 + U23 L31 UN La1 U12 + 132 U22 La1 U13 + 132 U23 + L33 U33 All a12 A13 a21 a22 a23 A31 a32 A33 (2-9) From (2-11), (2-14) and (2-17), we can compute the 2nd column of L and U, assuming we have already computed L11, 121 and L31. We first solve the lower triangular system (2-11) for u12. Then, we solve for u22 using (2-14) by computing the sparse dot product U22 a= 22 121 12 (2 19) Finally we solve (2-17) for 132 as 132 (a32 L31 U12) (2-20) U22 This step of computing the 2nd column of L and U can be considered equiva- lent to solving a lower triangular system as follows: L11 0 0 U12 a12 121 1 0 U22 a22 (221) L31 0 1 132 U22 a32 This mechanism of computing column k of L and U by solving a lower triangular system L x = b is the key step in a left-looking factorization algorithm. As we will see later, Gilbert-Peierls' algorithm revolves around solving this lower triangular system. The algorithm is called a left-looking algorithm since column k of L and U are computed by using the already computed columns 1...k-1 of L. In other words, to compute column k of L and U, one looks only at the already computed columns 1...k-1 in L, that are to the left of the currently computed column k. 2.4 Gilbert-Peierls' Algorithm Gilbert-Peierls' [9] proposed an algorithm for Gaussian elimination with partial pivoting in time proportional to the flop count of the elimination to factor an arbitrary non singular sparse matrix A as PA = LU. If flops(LU) is the number of nonzero multiplications performed when multiplying two matrices L and U, then Gaussian elimination uses exactly flops(LU) multiplications and divisions to factor a matrix A into L and U. Given an input matrix and assuming no partial pivoting, it is possible to predict the nonzero pattern of its factors. However with partial pivoting, it is not possible to predict the exact nonzero pattern of the factors before hand. Finding an upper bound is possible, but the bound can be very loose [10]. Note that computing the nonzero pattern of L and U is a necessary part of Gaussian elimination involving sparse matrices since we do not use two dimensional arrays for representing them but sparse data structures. Gilbert-Peierls' algorithm aims at computing the nonzero pattern of the factors and the numerical values in a total time proportional to O(flops(LU)). It consists of two stages for determining every column of L and U. The first stage is a symbolic analysis stage that computes the nonzero pattern of the column k of the factors. The second stage is the numerical factorization stage that involves solving the lower triangular system Lx = b, that we discussed in the section above. 2.4.1 Symbolic Analysis A sparse Gaussian elimination algorithm with partial pivoting cannot know the exact nonzero structure of the factors ahead of all numerical computation, simply because partial pivoting at column k can introduce new nonzeros in columns k+1 ... n. Solving Lx = b must be done in time proportional to the number of flops performed. Consider a simple column-oriented algorithm in MATLAB notation for solving Lx = b as follows: x =b for j = l:n if x(j) ~= 0 x(j+l:n) = x(j+l:n) L(j+l:n,j) x(j) end The above algorithm takes time 0(n + number of flops performed). The 0(n) term looks harmless, but Lx = b is solved n times in the factorization of A = LU, leading to an unacceptable O(n2) term in the work to factorize A into L times U. To remove the O(n) term, we must replace the algorithm with x =b for each j for which x(j) ~= 0 x(j+l:n) = x(j+l:n) L(j+l:n, j) x(j) end This would reduce the 0(n) term to O(q(b)), where q(b) is the number of nonzeros in b. Note that b is a column of the input matrix A. Thus to solve Lx = b, we need to know the nonzero pattern of x before we compute x itself. Symbolic analysis helps us determine the nonzero pattern of x. Let us say we are computing column k of L and U. Let G = G(Lk) be the directed graph of L with k 1 vertices representing the already computed k 1 columns. G(Lk) has an edge j -- i iff lij / 0. Let f = {ilbi 0} and X = {ixli / 0} Now the elements of X is given by X = ReachG(L)J) (2-22) The nonzero pattern of X is computed by the determining the vertices that are reachable from the vertices of the set 3. The reachability problem can be solved using a classical depth first search in G(Lk) from the vertices of the set P. If bj / 0, then xj / 0. In addition if L-j / 0, then xi / 0 even if bi = 0. This is because a Lij xj contributes to a nonzero in the equation when we solve for xi. During the depth first search, Gilbert-Peierls' algorithm computes the topological order of X. This topological ordering is useful for eliminating unknowns in the Numerical factorization step. C- xj ^------------- X L Figure 2-1: Nonzero pattern of x when solving Lx = b The row indices vector Li of columns 1 ... k 1 of L represents the .,li.'.,ency list of the graph G(Lk). The depth first search takes time proportional to the number of vertices examined plus the number of edges traversed. 2.4.2 Numerical Factorization Numerical factorization consists of solving the system (2-21) for each col- umn k of L and U. Normally we would solve for the unknowns in (2-21) in the increasing order of the row index. The row indices/nonzero pattern computed by depth first search are not necessarily in increasing order. Sorting the indices would increase the time complexity above our O(flops(LU)) goal. However, the require- ment of eliminating unknowns in increasing order can be relaxed to a topological order of the row indices. An unknown xi can be computed, once all the unknowns xj of which it is dependent on are computed. This is obvious when we write the equations comprising a lower triangular solve. Theoretically, the unknowns can be solved in any topological order. The depth first search algorithm gives one such topological order which is sufficient for our case. In our example, the depth first search would have finished exploring vertex i before it finishes exploring vertices j. Hence a topological order given by depth first search would have j appearing before i. This is exactly what we need. Gilbert-Peierls' algorithm starts with an identity L matrix. The entire left looking algorithm can be summarized in MATLAB notation as follows: L= I for k = 1:n x = L \ A(:,k) %(partial pivoting on x can be done here) U(1:k,k) = x(l:k) L(k:n,k) = x(k:n) / U(k,k) end where x = L\b denotes the solution of a sparse lower triangular system. In this case, b is the kth column of A. The total time complexity of Gilbert-Peierls' algorithm is O(rq(A) + flops(LU)). r(A) is the number of nonzeros in the matrix A and flops(LU) is the flop count of the product of the matrices L and U. Typically flops(LU) dominates the complexity and hence the claim of factorizing in time proportional to the flop count. 2.5 Maximum Transversal Duff [11, 12] proposed an algorithm for determining the maximum transversal of a directed graph.The purpose of the algorithm is to find a row permutation that minimizes the zeros on the diagonal of the matrix. For non singular matrices, the algorithm ensures a zero free diagonal. KLU e i'l-1-,t Duff's [11, 12] algorithm to find an unsymmetric permutation of the input matrix to determine a zero- free diagonal. A matrix cannot be permuted to have a zero free diagonal if and only if it is structurally singular. A matrix is structurally singular if there is no permutation of its nonzero pattern that makes it numerically nonsingular. A transversal is defined as a set of nonzeros, no two of which lie in the same row or column, on the diagonal of the permuted matrix. A transversal of maximum length is the maximum transversal. Duff's maximum transversal algorithm consists of representing the matrix as a graph with each vertex corresponding to a row in the matrix. An edge ik -- ik+ exists in the graph if A(ik,jk+i) is a nonzero and A(ik+, jk+i) is an element in the transversal set. A path between vertices io and ik would consist of a sequence of nonzeros (io, jl), (i1,j2), (i -1,j) where the current transversal would include (il,'jl), (i2,j2), (i ,jk'). If there is a nonzero in position (i, jk+i) and no nonzero in row io or column jk+1 is currently on the travseral, it increases the transerval by one by adding the nonzeros (ir,jr+i), r =0, 1, k k to the transversal and removing the nonzeros (ir, j,), r = 1, 2, k k from the transversal. This adding and removing of nonzeros to and from the transversal is called reassignment chain or augmenting path. A vertex or row is said to be assigned if a nonzero in the row is chosen for the transversal. The process of constructing augmenting paths is done by doing a depth first search from an unassigned row io of the matrix and continue till a vertex ik is reached where the path terminates because A(ik, jk+l) is a nonzero and column jK+1 is unassigned. Then the search backtracks to io adding and removing transversal elements thus constructing an augmenting path. Duff's maximum transversal transversal algorithm has a worst case time complexity of O(nT) where T is the number of nonzeros in the matrix and n is the order of the matrix. However in practice, the time complexity is close to O(n + r). The maximum transversal problem can be cast as a maximal matching problem on bipartite graphs. This is only to make a comparison. The maximal matching problem is stated as follows. All A12 A13 A14 Xi bl A22 A23 A24 X2 b2 A33 A34 X3 b3 0 A44 ;4 b4 Figure 2-2: A matrix permuted to BTF form Given an undirected graph G = (V, E), a matching is a subset of the edges M C E such that for all vertices v E V, at most one edge of M is incident on v. A vertex v E V is matched if some edge in M is incident on v, otherwise, v is unmatched. A maximal matching is a matching of maximum card Iilili r, that is a matching M such that for any matching M', we have IMI > |M'|. A maximal matching can be built incrementally, by picking an arbitraty edge e in the graph, deleting any edge that is sharing a vertex with e and repeating until the graph is out of edges. 2.6 Block Triangular Form A block (upper) triangular matrix is similar to a upper triangular matrix except that the diagonals in the former are square blocks instead of scalars. Figure 2-2 shows a matrix permuted to the BTF form. Converting the input matrix to block triangular form is important in that, 1. The part of the matrix below the block diagonal requires no factorization effort. 2. The diagonal blocks are independent of each other.Only the blocks need to be factorized. For example, in figure 2-2, the -i1.--I, ,-I. A444 = b4 can be solved independently for x4 and x4 can be eliminated from the overall system. The system A33X3 = b A34x4 is then solved for x3 and so on. 3. The off diagonal nonzeros do not contribute to any fill-in. Finding a symmetric permutation of a matrix to its BTF form is equivalent to finding the strongly connected components of a graph. A strongly connected component of a directed graph G = (V, E) is a maximal set of vertices C C V such that for every pair of vertices u and v in C, we have both u -- v and v -- u. The vertices u and v are reachable from each other. The Algorithm ei'l-,v d in KLU for symmetric permutation of a matrix to a BTF form, is based on Duff and Reid's [13, 14] algorithm. Duff and Reid provide an implementation for Tarjan's [15] algorithm to determine the strongly connected components of a directed graph. The algorithm has a time complexity of O(n + r) where n is the order of the matrix and T is the number of off diagonal nonzeros in the matrix. The algorithm essentially consists of doing a depth first search from unvisited nodes in the graph. It uses a stack to keep track of nodes being visited and uses a path of the nodes. When all edges in the path are explored, it generates Strongly connected components from the top of stack. Duff's algorithm is different from the method proposed by Cormen, Leiserson, Rivest and Stein [16]. They -I,.1-. -1 doing a depth first search on G computing GT and then running a depth first search on GT on vertices in the decreasing order of their finish times from the first depth first search (the topological order from the first depth first search). J i k \ Pruned, = Nonzero, ( = Fill in Figure 2-3: A symmetric pruning scenario 2.7 Symmetric Pruning Eisenstat and Liu [17] proposed a method called Symmetric Pruning to exploit structural symmetry for cutting down the symbolic analysis time. The cost of depth first search can be cut down by pruning unnecessary edges in the graph of L, G(L). The idea is to replace G(L) by a reduced graph of L. Any graph H can be used in place of G(L), provided that i j exists in H iff i j exists in G(L). If A is symmetric, then the symmetric reduction is just the elimination tree. The symmetric reduction is a subgraph of G(L). It has fewer edges than G(L) and is easier to compute by taking advantage of symmetry in the structure of the factors L and U. Even though the symmetric reduction removes edges, it still preserves the paths between vertices of the original graph. Figure 2-3 shows a symmetric pruning example. If lij $ 0, uji $ 0, then we can prune edges j s, where s > i. The reason behind this is that for any ajk $ 0, ask Will fill in from column j of L for s > k. behind this is that for any ajk / 0, aisk will fill in from column j of L for s > k . The just computed column i of L is used to prune earlier columns. Any future depth first search from vertex j will not visit vertex s, since s would have been visited via i already. Note that every column is pruned only once. KLU inpl-,i- symmetric pruning to speed up the depth first search in the symbolic analysis stage. 2.8 Ordering It is a widely used practice to precede the factorization step of a sparse linear system by an ordering phase. The purpose of the ordering is to generate a permutation P, that reduces the fill-in in the factorization phase of PAPT. A fill-in is defined as a nonzero in a position (i, j) of the factor that was zero in the original matrix. In other words, we have a fill-in if Lij / 0, where Aj = 0. The permuted matrix created by the ordering PAPT creates much less fill-in in factorization phase than the unpermuted matrix A. The ordering mechanism typically takes into account only the structure of the input matrix, without considering the numerical values stored in the matrix. Partial pivoting during factorization changes the row permutation P and hence could potentially increase fill-in as opposed to what was estimated by the ordering scheme. We shall see more about pivoting in the following sections. If the input matrix A is unsymmetric, then the permutation of the matrix A + AT can be used. Various minimum degree algorithms can be used for ordering. Some of the popular ordering schemes include approximate minimum degree(AMD) [18, 19], column approximate minimum degree(COLAMD) [20, 21] among others. COLAMD orders the matrix AAT without forming it explicitly. After permuting an input matrix A into BTF form using the maximum transversal and BTF orderings, KLU attempts to factorize each of the diagonal blocks. It applies the fill reducing ordering algorithm on the block before factoriz- ing it. KLU supports both approximate minimum degree and column approximate minimum degree. Besides, any given ordering algorithm can be plu-.i:'- ..1 into KLU without much effort. Work is being done on integrating a Nested Dissection ordering strategy into KLU as well. Of the various ordering schemes, AMD gives best results on circuit matrices. AMD finds a permutation P to reduce fill-in for the Cholesky factorization of PAPT (of P(A + AT)PT, if A is unsymmetric). AMD assumes no numerical pivoting. AMD attempts to reduce an optimistic estimate of fill-in. COLAMD is an unsymmetric ordering scheme, that computes a column permutation Q to reduce fill-in for Cholesky factorization of (AQ)TAQ. COLAMD attempts to reduce a "pessimitic" estimate (upper bound) of fill-in. Nested Dissection is another ordering scheme that creates permutation such that the input matrix is transformed into block diagonal form with vertex separators. This is a popular ordering scheme. However, it is unsuitable for circuit matrices when applied to the matrix as such. It can be used on the blocks generated by BTF pre-ordering. The idea behind a minimum degree algorithm is as follows: A structurally symmetric matrix A can be represented by an equivalent undirected graph G(V, E) with vertices corresponding to row/column indices. An edge i -> j exists in G if Aj / 0. Consider the figure 2-4. If the matrix is factorized with vertex 1 as the pivot, then after the first Gaussian elimination step, the matrix would be transformed as in figure 2-5. This first step of elimination can be considered equivalent to removing node 1 and all its edges from the graph and adding edges to connect all nodes .,li.I :ent to 1. In other words, the elimination has created a clique of the nodes .,li.I :ent to the eliminated node. Note that there are as many fill-ins in the reduced matrix as there are edges added in the clique formation. In the above example, we have * * * * * * * * Figure 2-4: A symmetric matrix and its graph representation 2 3 4 5 Figure 2-5: The matrix and its graph representation after one step of Gaussian elimination chosen the wrong node as pivot, since node 1 has the maximum degree. Instead if we had chosen a node with minimum degree say 3 or 5 as pivot, then there would have been zero fill-in after the elimination since both 3 and 5 have degree 1. This is the key idea in a minimum degree algorithm. It generates a permuta- tion such that a node with minimum degree is eliminated in each step of Gaussian elimination, thus ensuring a minimal fill-in. The algorithm does not examine the numerical values in the node selection process. It could happen that during partial pivoting, a node other than the one .-..i. -1. .1 by the minimum degree algorithm must be chosen as pivot because of its numerical magnitude. That's exactly the reason why the fill-in estimate produced by the ordering algorithm could be less than that experienced in the factorization phase. 2.9 Pivoting Gaussian elimination fails when the diagonal element in the input matrix happens to be zero. Consider a simple 2 2 system, 0 a12 x1 b1 A 2 1 (2-23) a21 a22 x2 b2 When solving the above system, Gaussian elimination computes the multiplier -a21/all [and multiplies row 1 with this multiplier and adds it to row 2] thus eliminating the coefficient element a21 from the matrix. This step obviously would fail, since all is zero. Now let's see a classical case when the diagonal element is nonzero but close to zero. 0.0001 1 A (2-24) 1 1 The multiplier is -1/0.0001 -104. The factors L and U are 1 0 0.0001 1 L = U 1 (2-25) 104 1 0 -104 The element u22 has the actual value 1 104. However assuming a four digit arithmetic, it would be rounded off to -104. Note that the product of L and U is 0.0001 1 L U (2-26) 1 0 which is different from the original matrix. The reason for this problem is that the multiplier computed is so large that when added with the small element a22 with value 1, it obscured the tiny value present in a22- We can solve these problems with pivoting. In the above two examples, we could interchange rows 1 and 2, to solve the problem. This mechanism of interchanging rows(and columns) and picking a large element as the diagonal, to avoid numerical failures or inaccuracies is called pivoting. To pick a numerically large element as pivot, we could look at the elements in the current column or we could look at the entire submatrix (across both rows and columns). The former is called partial pivoting and the latter is called complete pivoting. For dense matrices, partial pivoting adds a time complexity of O(n2) com- parisons to Gaussian elimination and complete pivoting adds O(n3) comparisons. Complete pivoting is expensive and hence is generally avoided, except for special cases. KLU employs partial pivoting with diagonal preference. As long as the diagonal element is atleast a constant threshold times the largest element in the column, we choose the diagonal as the pivot. This constant threshold is called pivot tolerance. 2.10 Scaling The case where small elements in the matrix get obscured during the elimina- tion process and accuracy of the results gets skewed because of numerical addition is not completely overcome by the pivoting process. Let us see an example of this case. Consider the 2 2 system 105 xi 105 A= (2-27) 1 1 x2 2 When we apply Gaussian elimination with partial pivoting to the above system, the entry all is largest in the first column and hence would continue to be the pivot.After the first step of elimination assuming a four digit arithmetic, we would have 10 105 X1 105 A (2-28) 0 -104 X2 -104 The solution from the above elimination is x = 1, 2 = 0. However the correct solution is close to x1 = 1, x2 1. If we divide each row of the matrix by the largest element in that row(and the corresponding element in the right hand side as well), prior to Gaussian elimination we would have A 10-4 1 X1 129) A= (2-29) 1 1 x2 2 Now if we apply partial pivoting we would have, 1 1 xi 2 A = ] [] [ (2-30) 10-4 1 a2 s And after an elimination step, the result would be 1 1 xi 2 A 1 [ (2-31) 0 1 10-4 X2 1 10-4 which yields the correct solution xi 1, 2 = 1. The process of balancing out the numerical enormity or obscurity on each row or column is termed as scaling. In the above example, we have scaled with respect to the maximum value in a row which is row scaling. Another variant would be to scale with respect to the sum of absolute values of all elements across a row. In column -.. lii,. we would scale with respect to the maximum value in a column or the sum of absolute values of all elements in a cloumn. Row scaling can be considered equivalent to finding an invertible diagonal matrix D1 such that all the rows in the matrix D-1A have equally large numerical values. Once we have such a D1, the solution of the original system Ax = b is equivalent to solving the system Ax = b where A = D-1A and b = D-b. Equilibration is another popular term used for scaling. In KLU, the diagonal elements of the diagonal matrix D1 are either the largest elements in the rows of the original matrix or the sum of the absolute values of the elements in the rows. Besides scaling can be turned off as well, if the simulation environment does not need scaling. Scaling though it offers better numerical results when solving systems, is not mandatory. Its usage depends on the data values that constitute the system and if the values are already balanced, scaling might not be necessary. 2.11 Growth Factor Pivot growth factor is a key diagnostic estimate in determining the stability of Gaussian elimination. Stability of Numerical Algorithms is an important factor in determining the accuracy of the solution. Study of stability is done by a process called Roundoff Error analysis. Roundoff error analysis comprises two sub types called Forward error analysis and Backward error analysis. If the computed solution x is close to the exact solution x, then the algorithm is said to be Forward stable. If the algorithm computes an exact solution to a nearby problem, then the algorithm is said to be Backward stable.Backward stability is the most widely used technique in studying stability of systems. Often the data generated for solving systems have impurity in them or they are distorted by a small amount. Under such circumstances we are interested that the algorithm produce an exact solution to this nearby problem and hence the relevance of backward stability assumes significance. Pivot growth factor is formally defined as max max ( ij a i , where a) is an entry in the reduced matrix A(k) after the kt elimination step. From (2-32), we find that if the entries of the reduced matrix grow arbitrarily, we would have a high growth factor. This arbitrary growth would again lead to inaccuracies in the results. Consider the following 2 2 system. A 10-4 1 Xi 133) A= (2-33) 1 1 x2 2 After one step of Gaussian elimination assuming four digit arithmetic, we would have the reduced system 10-4 1 X1 A = (2-34) 0 1 104 2 2 104 Solving the system yields xl = 0, x2 = 1 which is different from the actual solution x1 1, x2 = 1.The pivot growth factor of the above system is S ma(l,104) 104 S 1 Thus a large pivot growth clearly indicates the inaccuracy in the result. Partial pivoting generally avoids large growth factors. In the above example, if we had applying partial pivoting, we would have got the correct results. But this is not assured and there are cases where partial pivoting might not result in an acceptable growth factor. This necessitates the estimation of the growth factor as a diagnostic tool to detect cases where Gaussian elimination could be unstable. Pivot growth factor is calculated usually in terms of its reciprocal, to avoid numerical overflow problems when the value is very large. (2-32) is a harder to compute equation since it involves calculating the maximum of reduced matrix after every step of elimination. The other definitions of reciprocal growth factor that are easy to compute are as follows: max Sm a3 .) (2-35) P ( iN ua j) max max (236) P ( ij uij ) Equation (2-35) is the definition implemented in KLU and it is a column scaling invariant. It helps unmask a large pivot growth that could be totally masked because of column scaling. 2.12 Condition Number Growth factor is a key estimate in determining the stability of the algorithm. Condition number is a key estimate in determining the amenability or conditioning of a given problem. It is not guaranteed that a highly stable algorithm can yield accurate results for all problems it can solve. The conditioning of the problem has a dominant effect on the accuracy of the solution. Note that while stability deals with the algorithm, conditioning deals with the problem itself. In practical applications like circuit simulation, the data of a problem come from experimental observations. Typically such data have a factor of error or impurities or noise associated with them. Roundoff errors and discretization errors also contribute to impurities in the data. Conditioning of a problem deals with determining how the solution of the problem changes in the presence of impurities. The preceding discussion shows that one often deals with solving problems not with the original data but that with perturbed data. The analysis of effect of perturbation of the problem on the solution is called Perturbation analysis. It helps in determining whether a given problem produces a little or huge variation in solution when perturbed. Let us see what we mean by well or ill conditioned problems. A problem is said to be ill conditioned if a small relative error in data leads to a large relative error in solution irrespective of the algorithm employed. A problem is said to be well conditioned if a small relative error in data does not lead to a large relative error in solution. Accuracy of the computed solution is of primary importance in numerical analysis. Stability of the algorithm and the Conditioning of the given problem are the two factors that directly determine accuracy.A highly stable algorithm well armored with .lii,. partial pivoting and other concepts cannot be guaranteed to yield an accurate solution to an ill-conditioned problem. A backward stable algorithm applied to a well-conditioned problem should yield a solution close to the exact solution. This follows from the definitions of backward stability and well-c. .i'ili iil -. where backward stability assures exact solution to a nearby problem and well-conditioned problem assures that the computed solution to perturbed data is relatively close to the exact solution of the actual problem. Mathematically, let X be some problem. Let X(d) be the solution to the problem for some input d. Let 6d denote a small perturbation in the input d. Now if the relative error in the solution IX(d + 6d) X(d) I Ix(d)l exceeds the relative error in the input 1Jdl d\ then the problem is ill conditioned and well conditioned otherwise. Condition number is a measure of the conditioning of the problem. It shows whether a problem is well or ill conditioned.For the linear system problems of the form Ax = b, the condition number is defined as Cond(A) IIl I IA-l 1 (2-37) Equation (2-37) is arrived at by theory that deals with perturbations either in the input matrix A or the right hand side b or both the matrix and right hand side. Equation (2-37) can be defined with respect to any norm viz. 1, 2 or oo. The system Ax = b is said to be ill-conditioned if the condition number from (2-37) is quite large. Otherwise it is said to be well-conditioned. A naive way to compute the condition number would be to compute the inverse of the matrix, compute the norm of the matrix and its inverse and compute the product. However, computing the inverse is atleast thrice as expensive as solving the linear system Ax = b and hence should be avoided. Hager [22] developed a method for estimating the 1-norm of IIA-11 and the corresponding 1-norm condition number. Hager proposed an optimization approach for estimating IIA-111. The 1-norm of a matrix is formally defined as ||A,||, I|A|1 max (2-38) Hager's algorithm can be briefly described as follows: For A E R*", a convex function is defined as F(x) = |11, 1 (2-39) over the convex set S {x R : I,I < 1} Then I|A|I1 is the global maximum of (2-39). The algorithm involves computing Ax and ATx or computing matrix-vector products. When we want to compute IA-1 I1, it involves computing A-'x and (A-1)Tx which is equivalent to solving Ax = b and ATx = b. We can use KLU to efficiently solve these systems. Higham [23] presents refinements to Hager's algorithm and restricts the number of iterations to five. Higham further presents a simple device and using the higher of the estimates from this device and Hager's algorithm to ensure the estimate is large enough. This device involves solving the linear system Ax = b where i-1 bi =(-1)+1(1 + )i 1, 2, ...n The final estimate is chosen as the maximum from Hager's algorithm and 2 ||-||1 3n KLU's condition number estimator is based on Higham's refinement of Hager's algorithm. 2.13 Depth First Search As we discussed earlier, the nonzero pattern of the kh column of L is deter- mined by the Reachability of the row-indices of kh column of A in the graph of L. The reachability is determined by a depth-first search traversal of the graph of L. The topological order for elimination of variables when solving the lower triangular system Lx = b is also determined by the depth-first search traversal. A classical depth first search algorithm is a recursive one. One of the major problems in a recursive implementation of depth-first search is Stack overflow. Each process is allocated a stack space upon execution. When there is a high number of recursive calls, the stack space is exhausted and the process terminates abruptly. This is a definite possibility in the context of our depth-first search algorithm when we have a dense column of a matrix of a very high dimension. The solution to stack overflow caused by recursion is to replace recursion by iteration. With an iterative or non-recursive function, the entire depth first search happens in a single function stack. The iterative solution uses an array of row indices called pstack. When descending to an .,li.I1:ent node during the search, the row index of the next .,li.1:ent node is stored in the pstack at the position(row/column index) corresponding to the current node. When the search returns to the current node, we know that we next need to descend into the node stored in the pstack at the position corresponding to the current node. Using this extra O(n) memory, the iterative version completes the depth first search in a single function stack. This is an important improvement from the recursive version since it avoids the stack overflow problem that would have been a bottleneck when solving high dimension systems. 2.14 Memory Fragmentation The data structures for L and U are the ones used to represent sparse matri- ces. These comprise 3 vectors. 1. Vector of column pointers 2. Vector of row indices 3. Vector of numerical values There are overall, six vectors needed for the two matrices L and U. Of these, the two vectors of column pointers are of pre-known size namely the size of a block. The remaining four vectors of row indices and numerical values depend on the fill- in estimated by AMD. However AMD gives an optimistic estimate of fill-in.Hence we need to dynamically grow memory for these vectors during the factorization phase if we determine that the fill-in is higher than estimated. The partial pivoting strategy can alter the row ordering determined by AMD and hence is another source of higher fill-in than the estimate from AMD. Dynamically growing these four vectors suffers from the problem of external memory fragmentation. In external fragmentation, free memory is scattered in the memory space. A call for more memory fails because of non-availability of contiguous free memory space. If the scattered free memory areas were contiguous, the memory request would have succeeded. In the context of our problem, the memory request to grow the four vectors could either fail if we run into external fragmentation or succeed when there is enough free space available. When we reallocate or grow memory, there are two types of success cases. In the first case called cheap reallocation, there is enough free memory space abutting the four vectors. Here the memory occupied by a vector is just extended or its end boundary is increased. The start boundary remains the same. In the second case called costly reallocation, there is not enough free memory space abutting a vector. Hence a fresh memory is allocated in another region for the new size of vector and the contents are copied from old location. Finally the old location is freed. With four vectors to grow, there is a failure case because of external fragmen- tation and a costly success case because of costly reallocation. To reduce the failure case and avoid the costly success case, we have coalesced the four vectors into a single vector. This new data structure is byte aligned on double boundary. For every column of L and U, the vector contains the row indices and numerical values of L followed by the row indices and numerical values of U. Multiple integer row indices are stored in a single double location The actual number of integers that can be stored in a double location varies with platform and is determined dynam- ically. The common technique of using integer pointer to point to location aligned on double boundary, is iIpl'-v-d to retrieve or save the row indices. In addition to this coalesced data structure containing the row indices and numerical values, two more length vectors of size n are needed to contain the length of each column of L and U. These length vectors are preallocated once and need not be grown dynamically. Some Memory management schemes never do cheap reallocation. In such schemes, the new data structure serves to reduce external fragmentation only. 2.15 Complex Number Support KLU supports complex matrices and complex right hand sides. KLU also supports solving the transpose system ATx = b for real matrices and solving the conjugate transpose system AHx = b for complex matrices. Initially it relied on the C99 language support for complex numbers. However the C99 specification is not supported across operating systems. For example, earlier versions of Sun Solaris do not support C99. To avoid these compatibility issues, KLU no longer relies on C99 and has its own complex arithmetic implementation. 2.16 Parallelism in KLU When solving a system Ax = b using KLU, we use BTF pre-ordering to convert A into a block upper triangular form. We apply AMD on each block and factorize each block one after the other serially. Alternatively, nested dissection can be applied to each block. Nested dissection ordering converts a block to a doubly bordered block diagonal form. A doubly bordered block diagonal form is similar to a block upper triangular form but has nonzeros on the sub diagonal region. These 0 *5 2 ** 2 S000 3 4 3 0** 4 O0 000*** ** 5 2 1 0000 00 00 A Doubly Bordered Block diagonal Matrix Separator Tree Figure 2-6: A doubly bordered block diagonal matrix and its corresponding vertex separator tree nonzeros form a horizontal strip resembling a border. Similarly the nonzeros in the region above the diagonal form a corresponding vertical strip. The doubly bordered block diagonal form can be thought of as a separator tree. Factorization of the block then involves a post-order traversal of the separator tree. The nodes in the separator tree can be factorized in parallel. The factor- ization of a node would additionally involve computing the schur complement of its parent and of its ancestors in the tree. Once all the children of a node have updated its schur complement, the node is ready to be factorized and it inturn computes the schur complement of its parent and its ancestors. The factorization and computation of schur complement is done in a post-order traversal fashion and the process stops at the root. Parallelism can help in reducing the factorization time. It gains importance in the context of multi processor systems. Work is being done to enable parallelism in KLU. CHAPTER 3 CIRCUIT SIMULATION: APPLICATION OF KLU The KLU algorithm comprises the following steps: 1. Unsymmetric Permutation to block upper triangular form. This consists of two steps. (a) 'lil-iii, tii permutation to ensure a zero free diagonal using maximum transversal. (b) symmetric permutation to block upper triangular form by finding the strongly connected components of the graph. 2. Symmetric permutation of each block(say A) using AMD on A + AT or an 'ln-vimn lli-ti permutation of each block using COLAMD on AAT. These permutations are fill-in reducing orderings on each block. 3. Factorization of each scaled block using Gilbert-Peierls' left looking algorithm with partial pivoting. 4. Solve the system using block-back substitution and account for the off- diagonal entries. The solution is re-permuted to bring it back to original order. Let us first derive the final system that we need to solve taking into account, the different permutations, scaling and pivoting. The original system to solve is Ax: =b (3-1) Let R be the diagonal matrix with the scale factors for each row. Applying scaling, we have RAx = Rb (3-2) Let P' and Q' be the row and column permutation matrices that combine the permutations for maximum transversal and the block upper triangular form together. Applying these permutations together, we have P'RAQ'Q'T = P'Rb. [Q'QT = I,the ,,,-ntt,, matrix] (3-3) Let P and Q be row and column permutation matrices that club the P' and Q' mentioned above with the symmetric permutation produced by AMD and the partial pivoting row permutation produced by factorization. Now, PRAQQTx = PRb or (PRAQ)QTX = PRb (3-4) The matrix (PRAQ) consists of two parts viz. the diagonal blocks that are factor- ized and the off-diagonal elements that are not factorized. (PRAQ) = LU + F where LU represents the factors of all the blocks collectively and F represents the entire off diagonal region. Equation (3-4) now becomes (LU F)QTx = PRb (3-5) x =Q(LU + F)-'(PRb) (3-6) Equation (3-6) consists of two steps. A block back-substitution i.e. computing (LU + F)-l(PRb) followed by applying the column permutation Q. The block-back substitution in (LU + F)-1(PRb) looks cryptic and can be better explained as follows: Consider a simple 3 3 block system L11U11 F12 F13 X1 B1 0 L22U22 F23 X2 2 (37) 0 0 L33U33 X3 B1 The equations corresponding to the above system are: LUnl1 X1 + F12 X2 F13 X3 = B (3-8) L22 22* X2 F23* 3 B2 (39) L33U33* X3 B3 (3-10) In block back substitution, we first solve (3-10) for X3, and then eliminate Xa from (3-9) and (3-8) using the off-diagonal entries. Next, we solve (3-9) for X2 and eliminate X2 from (3-8). Finally we solve (3-8) for X1 3.1 Characteristics of Circuit Matrices Circuit matrices exhibit certain unique characteristics that makes KLU more relevant to them. They are very sparse. Because of their high sparsity, BLAS kernels are not applicable. Circuit matrices often have a few dense rows/columns that originate from voltage or current sources. These dense rows/columns are effectively removed by BTF permutation. Circuit matrices are ..i-'.mmetric, but the nonzero pattern is roughly symmet- ric. They are easily permutable to block upper triangular form. Besides, they have zero-free or nearly zero-free diagonal. Another peculiar feature of circuit matrices is that the nonzero pattern of each block after permutation to block upper triangular form, is more symmetric than the original matrix. Typical ordering strategies applied to the original matrix cause high fill-in whereas when applied to the blocks, leads to less fill-in. The efficiency of the permutation to block upper triangular form shows in the fact the entire sub-diagonal region in the matrix has zero work and the off-diagonal elements do not cause any fill-in since they are not factorized. 3.2 Linear Systems in Circuit Simulation The linear systems in circuit simulation process originate from solving large systems of non linear equations using Newton's method and integrating large stiff systems of ordinary differential equations. These linear systems consist of the coefficients matrix A, the unknowns vector x and the right hand side b. During the course of simulation, the matrix A retains the same nonzero pattern(structurally unchanged) and only undergoes changes in numerical val- ues. Thus the initial analysis phase (permutation to ensure zero-free diagonal, block triangular form and minimum degree ordering on blocks) and factorization phase(that involves symbolic analysis, partial pivoting and symmetric pruning) can be restricted to the initial system alone. Subsequent systems A'x = b where only the numerical values of A' are different from A, can be solved using a mechanism called refactorization. Refactorization simply means to use the same row and column permutations (comprising entire analysis phase and partial pivoting) computed for the initial system, for solving the subsequent systems that have changes only in numerical values. Refactorization substantially reduces run time since the analysis time and factorization time spent on symbolic analysis, partial pivoting, pruning are avoided. The nonzero pattern of the factors L and U are the same as for the initial system. Only Numerical factorization using the pre-computed nonzero pattern and partial pivoting order, is required. The solve step follows the factorization/refactorization step. KLU accomodates solving multiple right hand sides in a single solve step. Upto four right hand sides can be solved in a single step. 3.3 Performance Benchmarks During my internship at a circuit simulation company, I did performance benchmarking of KLU vs SuperLU in the simulation environment. The perfor- mance benchmarks were run on a representative set of examples. The results of these benchmarks are tabulated as follows. (the size of the matrix created in simulation is shown in parenthesis). Table 3-1: Comparison between KLU and SuperLU on overall time and fill-in Overall time Nonzeros(L+U) Netlist KLU SuperLU Speedup KLU SuperLU Problems (301) 1.67 1.24 0.74 1808 1968 Problem (1879) 734.96 1.- ". 0.94 13. i1 13770 Problem (2076) 56.46 53.38 0.95 16403 16551 Problem (7598) 89.63 81.85 0.91 52056 54997 Problem (745) 18.13 16.84 0.93 4156 4231 Problem (1041) 1336.50 1317.30 0.99 13198 1.'' r". Problem (33) 0.40 0.32 0.80 157 176 Problem (10) 4.46 1.570 0.35 40 41 Problem (180) 222.26 202.29 0.91 1845 1922 Probleml0(6833) 6222.20 6410.40 1.03 56322 ..1 Problemll (1960) 181.78 179.50 0.99 13527 13963 Probleml2 (200004) 6.25 8.47 1.35 500011 600011 Probleml3 (20004) 0.47 0.57 1.22 50011 60011 Probleml4 (40004) 0.97 1.31 1.35 100011 120011 Probleml5 (100000) 1.76 2.08 1.18 299998 499932 Probleml6 (7602) 217.80 255.88 1.17 156311 184362 Probleml7(10922) 671.10 770.58 1.15 267237 299937 Probleml8 (14842) 1017.00 1238.10 1.22 326811 425661 Probleml9 (19362) 1099.00 1284.40 1.17 550409 581277 Problem20 (24482) 3029.00 3116.90 1.03 684139 788047 Problem21 (30202) 2904.00 3507.40 1.21 933131 1049463 The circuits Probleml6-Problem21 are TFT LCD arrays similar to memory circuits. These examples were run atleast twice with each algorithm empl-v -d viz. KLU or SuperLU to get consistent results. The results are tabulated in tables 3-1, 3-2 and 3-3. The "overall time" in table 3-1 comprises of analysis, factorization and solve time. Table 3-2: Comparison between KLU and SuperLU on factor time and solve time Factor time Factor Solve time Solve per iteration speedup per iteration speedup Netlist KLU SuperLU KLU SuperLU Problems (301) 0.000067 0.000084 1.26 0.000020 0.000019 0.92 Problem (1879) 0.000409 0.000377 0.92 0.000162 0.000100 0.61 Problem (2076) 0.000352 0.000317 0.90 0.000122 0.000083 0.68 Problem (7598) 0.001336 0.001318 0.99 0.000677 0.000326 0.48 Problem (745) 0.000083 0.000063 0.76 0.000035 0.000022 0.62 Problem (1041) 0.000321 0.000406 1.26 0.000079 0.000075 0.95 Problem (33) 0.000004 0.000004 0.96 0.000003 0.000002 0.73 Problem (10) 0.000001 0.000001 0.89 0.000001 0.000001 0.80 Problem (180) 0.000036 0.000042 1.16 0.000014 0.000011 0.76 Probleml0(6833) 0.001556 0.001530 0.98 0.000674 0.000365 0.54 Problemll (1960) 0.000663 0.000753 1.14 0.000136 0.000122 0.90 Probleml2 (200004) 0.103900 0.345500 3.33 0.11 I',I. 0 0.041220 1.35 Probleml3 (20004) 0.005672 0.020110 3.55 0.001633 0.002735 1.67 Probleml4 (40004) 0.014430 0.056080 3.89 0.004806 0.iiii.11 .1 1.43 Probleml5 (100000) 0.168700 0.283700 1.68 0.018600 0.033610 1.81 Probleml6 (7602) 0.009996 0.017620 1.76 0.001654 0.001439 0.87 Probleml7(10922) 0.018380 0.030010 1.63 0.002542 0.001783 0.70 Probleml8 (14842) 0.024020 0.046130 1.92 0.003187 0.002492 0.78 Probleml9 (19362) 0.054730 0.080280 1.47 0.005321 0.003620 0.68 Problem20 (24482) 0.121400 0.122600 1.01 0.006009 0.004705 0.78 Problem21 (30202) 0.124000 0.188700 1.52 0.009268 0.006778 0.73 3.4 Analyses and Findings The following are my inferences based on the results: Most of the matrices in these experiments are small matrices of the order of a few thousands. Fill in is much better with KLU. The 'BTF' ordering combined with the 'AMD' ordering on each of the blocks does a good job in reducing the fill in count to a good extent. The improvement in fill in averages around 6% for many examples. Table 3-3: Ordering results using BTF+AMD in KLU on circuit matrices Nonzeros Nonzeros No of Max Block Nonzeros Netlist Size in A in L+U Blocks size off diagonal Problems 301 1484 1808 7 295 89 Problem 1879 12926 13 .i 1 19 1861 4307 Problem 2076 15821 16403 13 2064 6689 Problem 7598 48922 52056 13 7586 19018 Problem 745 3966 4156 128 426 1719 Problem 1041 9654 13198 67 975 2608 Problem 33 153 157 7 27 50 Problem 10 39 40 5 6 16 Problem 180 1503 1845 19 162 661 ProblemlO 6833 43250 56322 507 6282 12594 Problemll 1960 11187 13527 58 1715 2959 Probleml2 200004 500011 500011 200003 2 300005 Probleml3 20004 50011 50011 20003 2 30005 Probleml4 40004 100011 100011 40003 2 60005 Probleml5 100000 299998 299998 1 100000 0 Probleml6 7602 32653 156311 103 7500 251 Probleml7 10922 46983 267237 123 10800 301 Probleml8 14842 63913 326811 143 14700 351 Probleml9 19362 83443 550409 163 19200 401 Problem20 24482 105573 684139 183 24300 451 Problem21 30202 130303 933131 203 30000 501 There is no 'fill in' in the Probleml2, Probleml3, Probleml4 and Probleml5 netlists with KLU. This is quite significant in that there is no memory overhead in these examples. In the case of circuits Probleml6-Problem21, the gain in fill in with KLU ranges from 6% in the Probleml9 example to 24% in Probleml8 example. The gain in fill in translates into faster factorization because few nonzeros imply less work. The factorization time thus is expected to be low. It turns out to be true in most of the cases factorizationn speedup of 1.6x in Probleml6 Problem21 examples and 3x for Probleml2-Probleml4 examples). For some cases like Problem2 and ProblemlO, the factorization time remains same for both KLU and SuperLU. Solve phase turns out to be slow in KLU. Probably, the off diagonal nonzero handling part tends to account for the extra time spent in the solve phase. One way of reducing the solve overhead in KLU would be solving multiple RHS at the same time. In a single solve iteration, 4 equations are solved. On the whole, the overall time speedup is 1.2 for Probleml6-Problem21 examples and Probleml2-Probleml4 examples. For others, the overall time is almost the same between the two algorithms. BTF is not able to find out many blocks for most of the matrices and there happens to be a single large block and the remaining are singletons. But the AMD ordering does a good job in getting the fill in count reduced. The off-diagonal nonzero count is not high. 3.5 Alternate Ordering Experiments Different ordering strategies were employed to analyze the fill in behaviour. The statistics using different ordering schemes are listed in table 3-4. 'COLAMD' is not listed in the table. It typically gives poor ordering and causes more fill in than AMD, MMD and AMD+BTF. AMD alone gives relatively higher fill in compared to AMD+BTF in most of the matrices. However AMD alone gives mixed results in comparison with MMD. It matches or outperforms MMD in fill in on Probleml2-Probleml4 and Probleml6-Problem21 matrices. However it gives Table 3-4: Comparison of ordering results produced by BTF+AMD, AMD, MMD Nonzeros Fill-in Fill-in Fill-in Netlist in A BTF+AMD AMD MMD Problems (301) 1484 1808 1928 1968 Problem (1879) 12926 13-.11 13857 13770 Problem (2076) 15821 16403 18041 16551 Problem (7598) 48922 52056 57975 54997 Problem (745) 3966 4156 5562 4231 Problem (1041) 9654 13198 14020 1.''. Problem (33) 153 157 178 176 Problem (10) 39 40 41 41 Problem (180) 1503 1845 1968 1922 Probleml0(6833) 43250 56322 133739 ."..-.1 Problemll (1960) 11187 13527 14800 13963 Probleml2 (200004) 500011 500011 600011 600011 Probleml3 (20004) 50011 50011 60011 60011 Probleml4 (40004) 100011 100011 120011 120011 Probleml5 (100000) 299998 299998 299998 499932 Probleml6 (7602) 32653 156311 165264 184362 Probleml7(10922) 46983 267237 255228 299937 Probleml8 (14842) 63913 326811 387668 425661 Probleml9 (19362) 83443 550409 451397 581277 Problem20 (24482) 105573 684139 718891 788047 Problem21 (30202) 130303 933131 839226 1049463 poor fill in for rest of the circuits when compared with MMD. AMD alone beats AMD+BTF in fill-in for some of the examples viz. Probleml7, Probleml9 and Problem21. Overall, to sum it up, BTF+AMD is the best ordering strategy to use. 3.6 Experiments with UF Sparse Matrix Collection There are a number of circuit matrices in the UF sparse matrix collection. Different experiments were done with these matrices as well on different parameters like 1. ordering 2. timing of different phases of KLU 3. ordering 4. performance comparison between KLU and UMFPACK. UMFPACK is a unsymmetric multifrontal sparse solver and uses an unsymmetric COLAMD ordering or an AMD ordering, automatically selecting which method to use based on the matrix characteristics. Gilbert-Peierls' algorithm is available in MATLAB as an LU factorization scheme. These experiments were done on a Mandrake 10.0 linux OS, Intel Pentium M Processor with clock frequency of 1400 MHz and RAM 768 kB. 3.6.1 Different Ordering Schemes in KLU There are six different ordering schemes possible in KLU. The three fill reducing schemes are AMD, COLAMD and User Specified Ordering. These three fill reducing orderings can be combined with a BTF preordering or no preordering.Hence six different schemes. However in this experiment user specified ordering is not considered. That leaves us with four different schemes.The table 3-5 lists the fill (number of nonzeros in L+U plus the number of off-diagonal elements) for each of these ordering schemes. From table 3-5, we find that BTF+AMD gives consistently better fill-in across different circuit matrices. However there are observable aberrations. For example, with the circuit Bomhof/circuit-2, AMD and COLAMD give better fill-in than BTF+AMD. These results determine again that BTF+AMD is the best ordering scheme to use for circuit matrices. 3.6.2 Timing Different Phases in KLU These experiments show the time spent in different phases of the algorithm. BTF pre-ordering followed by a AMD fill-reducing ordering is eiipl-'w] d. As mentioned earlier, there are four different phases. 1. Analysis phase: This comprises the pre-ordering and fill reducing ordering. 2. Factor phase: This comprises the factorization part of the algorithm. It includes the symbolic analysis phase, partial pivoting, symmetric pruning steps. 3. Refactor phase: This comprises the part where we do a factorization using the already pre-computed partial pivoting permutation and the nonzero pattern of the L and U matrices. There is no symbolic analysis, partial pivoting and symmetric pruning in refactor phase. 4. Solve phase: This comprises the solve phase of the algorithm. Solving a single right hand side was experimented. When given a set of matrices with the same nonzero pattern, the analysis and factor phases are done only once. The refactor phase is then repeated for the remaining matrices. Solve phase is repeated as many times as there are systems to solve. Table 3-6 consists of the timing results. Analysis phase consumes most of the time spent in the algorithm. Refactor time is typically 3 or 4 times smaller than factor time and 8 times smaller than analysis time plus factor time put together. Solve phase consumes the least fraction of time spent. 3.6.3 Ordering Quality among KLU, UMFPACK and Gilbert-Peierls The table 3-7 compares the ordering quality among KLU using BTF+AMD, UMFPACK using COLAMD or AMD and Gilbert-Peierls' using AMD. We can Table 3-5: Fill-in with four different schemes in KLU Nonzeros BTF + BTF + Matrix in A AMD COLAMD AMD COLAMD Sandia/adder_dcop_01 (1813) 11156 13525 13895 18848 21799 Sandia/adder_trans_01 (1814) 14579 20769 36711 24365 119519 Sandia/fpga_dcop_01 (1220) 5892 7891 8118 S .' 12016 Sandia/fpga_trans_01 (1220) 7382 10152 12776 10669 21051 Sandia/init_adder1 (1813) 11156 13525 13895 18848 21799 Sandia/mult_dcop_01 (25187) 193276 226673 228301 2176328 1460322 Sandia/oscil_dcop_01 (430) 1544 2934 3086 3078 3295 Sandia/oscil_trans_01 (430) 1614 2842 3247 2897 3259 Bomhof/circuitl (2624) 35823 44879 775363 44815 775720 Bomhof/circuit_2 (4510) 21199 40434 89315 36197 36196 Bomhof/circuit_3 (12127) 48137 86718 98911 245336 744245 Grund/b2_ss (1089) 3895 9994 9212 26971 9334 Grund/b_dyn (1089) 4144 11806 10597 33057 10544 Grund/bayer02 (1;'i ;') 63307 889914 245259 1365142 307979 Grund/d_dyn (87) 230 481 461 619 494 Grund/d_ss (53) 144 302 292 382 298 Grund/megl (2904) 58142 232042 184471 1526780 378904 Grund/meg4 (5860) 25258 42 ;', 310126 43250 328144 Grund/poli (4008) 8188 12200 12208 12238 12453 Grund/poli_large (15575) 33033 48718 48817 49806 51970 Hamm/add20 (2395) 13151 19554 3 1. ;i 19554 3 1. ;i Hamm/add32 (4960) 19848 28754 36030 28754 36030 Hamm/bcircuit (68902) 375558 1033240 1692668 1033240 1692668 Hamm/hcircuit (105676) 513072 731634 2.2 ;-. 2 736080 4425310 Hamm/memplus (17758) 99147 137030 3282586 137030 3282586 Hamm/scircuit (170998) '."-'i ;i. 2481122 64102-. 2481832 6427526 R. i.l /rajat03 (7602) 32653 163913 235111 172666 236938 R. i.I /rajat04 (1041) 8725 12863 80518 18618 158000 R.,i.l /rajat05 (301) 1250 1926 2053 2101 3131 R.ii.I /rajatll (135) 665 890 978 944 1129 R. i.l /rajatl2 (1879) 12818 15308 273667 15571 128317 R. i.l /rajatl3 (7598) 48762 5 '"i 90368 64791 5234287 R. i.I /rajatl4 (180) 1475 1994 2249 2105 2345 Table 3-6: Time in seconds, spent in different phases in KLU Analysis Factor Refactor Solve Matrix time time time time Sandia/adder_dcop_01 (1813) 0.0028 0.0028 0.0007 0.0003 Sandia/adder_trans_01 (1814) 0.0045 0.0038 0.0013 0.0003 Sandia/fpga_dcop_01 (1220) 0.0018 0.0015 0.0004 0.0002 Sandia/fpgatrans_01 (1220) 0.0022 0.0017 0.0005 0.0002 Sandia/init_adderl (1813) 0.0028 0.0028 0.0007 0.0003 Sandia/mult_dcop_01 (25187) 0.2922 0.0522 0.0196 0.0069 Sandia/oscil_dcop_01 (430) 0.0008 0.0006 0.0002 0.0001 Sandia/oscil_trans_01 (430) 0.0008 0.0006 0.0002 0.0001 Bomhof/circuit_l (2624) 0.0098 0.0085 0.0053 0.0006 Bomhof/circuit_2 (4510) 0.0082 0.0064 0.0034 0.0006 Bomhof/circuit_3 (12127) 0.0231 0.0174 0.0056 0.0020 Grund/b2_ss (1089) 0.0031 0.0018 0.0005 0.0001 Grund/b_dyn (1089) 0.0033 0.0021 0.0007 0.0002 Grund/bayer02 (1:;' ;.) 0.0584 0.2541 0.2070 0.0090 Grund/d_dyn (87) 0.0002 0.0001 0.0000 0.0000 Grund/d_ss (53) 0.0001 0.0001 0.0000 0.0000 Grund/megl (2904) 0.0178 0.0853 0.0590 0.0033 Grund/meg4 (5860) 0.0157 0.0094 0.0028 0.0009 Grund/poli (4008) 0.0017 0.0010 0.0004 0.0003 Grund/polilarge (15575) 0.0064 0.0045 0.0018 0.0014 Hamm/add20 (2395) 0.0056 0.0044 0.0014 0.0003 Hamm/add32 (4960) 0.0084 0.0074 0.0019 0.0006 Hamm/bcircuit (68902) 0.3120 0.2318 0.1011 0.0257 Hamm/hcircuit (105676) 0.2553 0.1920 0.0658 0.0235 Hamm/memplus (17758) 0.0576 0.0358 0.0157 0.0036 Hamm/scircuit (170998) 0.8491 0.6364 0.3311 0.0622 R.ii. ,/rajat03 (7602) 0.0152 0.0440 0.0306 0.0034 R.ii. ,/rajat04 (1041) 0.0029 0.0023 0.0008 0.0002 R.,i. I/rajat05 (301) 0.0005 0.0005 0.0001 0.0000 R.,i. I/rajatll (135) 0.0002 0.0002 0.0001 0.0000 R.,i. ,/rajatl2 (1879) 0.0038 0.0027 0.0008 0.0002 R.,i. ,/rajatl3 (7598) 0.0122 0.0105 0.0033 0.0011 R.,i. ,/rajatl4 (180) 0.0004 0.0003 0.0001 0.0000 infer from the results that KLU produces better ordering than UMFPACK and Gilbert-Peierls' algorithm. For KLU, the following MATLAB code determines the fill. opts = [0.1 1.2 1.2 10 1 0 0 0 ] ; [x info] = klus(A,b,opts, []) ; fill = info (31) + info (32) + info(8) ; For UMFPACK, the snippet is [L U P Q] = lu(A) ; fill = nnz(L) + nnz(U) ; For Gilbert-Peierls' the code is Q = amd(A) ; [L U P] = lu(A(Q,Q), 0.1) ; fill = nnz(L) + nnz(U) ; 3.6.4 Performance Comparison between KLU and UMFPACK This experiment compares the total time spent in the analysis, factor and solve phases by the algorithms. The results are listed in table 3-8. KLU outperforms UMFPACK in time. For KLU, the following snippet in MATLAB is used: tic [x info] = klus(A,b,opts) tl = toc ; For UMFPACK, the following code in MATLAB is used to find the total time: tic x= A \b; t2 = toc ; Table 3-7: Fill-in among KLU, UMFPACK and Gilbert-Peierls Matrix nnz KLU UMFPACK Gilbert-Peierls Sandia/adder_dcop_01 (1813) 11156 13525 14658 18825 Sandia/adder_trans_01 (1814) 14579 20769 20769 24365 Sandia/fpga_dcop_01 (1220) 5892 7891 8106 I ' Sandia/fpgatrans_01 (1220) 7382 10152 10152 10669 Sandia/iit_adder1 (1813) 11156 13525 14658 18825 Sandia/mult_dcop_01 (25187) 193276 226673 556746 1300902 Sandia/oscil_dcop_01 (430) 1544 2934 2852 3198 Sandia/oscil_trans_01 (430) 1614 2842 3069 2897 Bomhof/circuit-_ (2624) 35823 44879 44879 44815 Bomhof/circuit_2 (4510) 21199 40434 35107 38618 Bomhof/circuit_3 (12127) 48137 86718 84117 245323 Grund/b2_ss (1089) 3895 9994 8309 22444 Grund/b_dyn (1089) 4144 11806 9642 41092 Grund/bayer02 (l:;' ;,) 63307 889914 259329 973093 Grund/d_dyn (87) 230 481 442 523 Grund/dss (53) 144 302 268 395 Grund/megl (2904) 58142 232042 151740 1212904 Grund/meg4 (5860) 25258 42 ;' 42 ;'- 43250 Grund/poli (4008) 8188 12200 12200 12239 Grund/polilarge (15575) 33033 48718 48745 49803 Hamm/add20 (2395) 13151 19554 19554 19554 Hamm/add32 (4960) 19848 28754 28754 28754 Hamm/bcircuit (68902) 375558 1033240 1033240 1033240 Hamm/hcircuit (105676) 513072 731634 730906 736080 Hamm/memplus (17758) 99147 137030 137030 137030 Hamm/scircuit (170998) 1 .' ;i, 2481122 2481122 2481832 R.i i. /rajat03 (7602) 32653 163913 163913 172666 R.i i. ,/rajat04 (1041) 8725 12863 12860 18613 R.,i. ,/rajat05 (301) 1250 1926 1944 2101 R.ii. ,/rajatll (135) 665 890 890 944 R. i., /rajatl2 (1879) 12818 15308 15308 15571 R. i., /rajatl3 (7598) 48762 5'.r, 5 '". 64791 R. i.1 /rajatl4 (180) 1475 1994 1994 2105 Table 3-8: Performance comparison between KLU and UMFPACK Matrix KLU UMFPACK Sandia/adder_dcop_01 (1813) 0.0116 0.0344 Sandia/adder_trans_01 (1814) 0.0112 0.0401 Sandia/fpga_dcop_01 (1220) 0.0050 0.0257 Sandia/fpgatrans_01 (1220) 0.0054 0.0203 Sandia/iit_adder1 (1813) 0.0109 0.0323 Sandia/mult_dcop_01 (25187) 1.2383 1.1615 Sandia/oscil_dcop_01 (430) 0.0022 0.0070 Sandia/oscil_trans_01 (430) 0.0019 0.0074 Bomhof/circuitl (2624) 0.0232 0.1223 Bomhof/circuit_2 (4510) 0.0201 0.0522 Bomhof/circuit_3 (12127) 0.0579 0.1713 Grund/b2_ss (1089) 0.0066 0.0168 Grund/b_dyn (1089) 0.0072 0.0175 Grund/bayer02 (1;;' ;,) 0.6089 0.3565 Grund/d_dyn (87) 0.0005 0.0014 Grund/dss (53) 0.0003 0.0010 Grund/megl (2904) 0.1326 0.1018 Grund/meg4 (5860) 0.0571 0.1111 Grund/poli (4008) 0.0050 0.0121 Grund/polilarge (15575) 0.0208 0.0497 Hamm/add20 (2395) 0.0123 0.0506 Hamm/add32 (4960) 0.0201 0.0738 Hamm/bcircuit (68902) 0.7213 1.8823 Hamm/hcircuit (105676) 0.7313 2.7764 Hamm/memplus (17758) 0.1232 0.8140 Hamm/scircuit (170998) 1.9812 7.3448 R.ii.-./rajat03 (7602) 0.0793 0.1883 R.ii. ,/rajat04 (1041) 0.0068 0., 1- R.,i. ,/rajat05 (301) 0.0014 0.0046 R.,i.i /rajatll (135) 0.0007 0.0023 R.ii.,/rajatl2 (1879) 0.0087 0.0355 R.,i.,I/rajatl3 (7598) 0.0330 0.1229 R.ii. ,/rajatl4 (180) 0.0010 0.0032 CHAPTER 4 USER GUIDE FOR KLU 4.1 The Primary KLU Structures 4.1.1 klu_common This is a control structure that contains both input control parameters for KLU as well as output statistics computed by the algorithm. It is a mandatory parameter for all the KLU routines. Its contents are listed as follows: tol Type: double Input parameter for pivot tolerance for diagonal preference.Default value is 0.001 growth Type: double Input parameter for reallocation growth size of LU factors. Default value is 1.2 initmem_amd Type: double Input parameter for initial memory size with AMD. Initial memory size = initmemamd nnz(L) + n. Default value is 1.2 initmem Type: double Input parameter for initial memory size with COLAMD. Initial memory size = initmem nnz(A) + n. Default value is 10 btf Type: int Input parameter to use BTF pre-ord' ii,. or not. Default value is 1 (to use BTF) * ordering Type: int Input parameter to specify which fill reducing ordering to use. 0- AMD,1 COLAMD, 2= user P and Q. Default is 0. * scale Type: int Input parameter to specify which scaling strategy to use. 0= none, 1= sum, 2= max. Default is 0 * singular_proc Type: int Input parameter to specify whether to stop on singularity or continue. 0= stop, 1= continue. Default is 0. * status Type: int Output parameter that indicates the result of the KLU function call. Values are KLU_OK(0) if OK and < 0 if error. Error values are KLU_SINGULAR (-1), KLU_OUT_OF_MEMORY (-2), KLU_INVALID (-3) * nrealloc Type: int Output parameter.Contains number of reallocations of L and U * singular_col Type: int Output parameter.Contains the column no of singular column if any * noffdiag Type: int Output parameter.Contains the number of off-diagonal pivots chosen 4.1.2 klu_symbolic This structure encapsulates the information related to the analysis phase. The members of the structure are listed as follows: symmetry Type: double Contains the symmetry of largest block estflops Type: double Contains the estimated factorization flop count Inz Type: double Contains the estimated nonzeros in L including diagonals unz Type: double Contains the estimated nonzeros in U including diagonals Lnz Type: double * Array of size n, but only Lnz [O..nblocks-1] is used. Contains the estimated number of nonzeros in each block Sn Type: int Contains the size of input matrix A where A is n-by-n nz Type: int Contains the number of entries in input matrix *P Type: int * Array of size n. Contains the row permutation from ordering *Q Type: int * Array of size n. Contains the column permutation from ordering *R Type: int * Array of size n+1, but only R [0..nblocks] is used. Contains the start and end column/row index for each block. Block k goes from R[k] to R[k+l] 1 nzoff Type: int Contains the number of nonzeros in off-diagonal blocks nblocks Type: int Contains the number of blocks maxblock Type: int Contains the size of largest block ordering Type: int Contains the ordering used (0 = AMD, 1 = COLAMD, 2 = GIVEN) do_btf Type: int Indicates whether or not BTF preordering was requested The members symmetry, est flops, Inz, unz, Lnz are computed only when AMD is used. The remaining members are computed for all orderings. 4.1.3 klu_numeric This structure encapsulates information related to the factor phase. It contains the LU factors of each block, pivot row permutation, and the entries in the off- diagonal blocks among others. Its contents are listed as follows: umin Type: double Contains the minimum absolute diagonal entry in U umax Type: double Contains the maximum absolute diagonal entry in U nblocks Type: int Contains the number of blocks Inz Type: int Contains actual number of nonzeros in L excluding diagonals unz Type: int Contains actual number of nonzeros in U excluding diagonals Pnum Type: int * Array of size n.Contains the final pivot permutation Pinv Type: int * Array of size n.Contains the inverse of final pivot permutation Lbip Type: int ** Array of size nblocks. Each element is an .11,i.,i of size block size + 1. Each element contains the column pointers for L factor of the corresponding block * Ubip Type: int ** Array of size nblocks. Each element is an .11.i i. of size block size + 1. Each element contains the column pointers for U factor of the corresponding block * Lblen Type: int ** Array of size nblocks. Each element is an .11,i.,i of size block size. Each element contains the column lengths for L factor of the corresponding block * Ublen Type: int ** Array of size nblocks. Each element is an .11,i.,i of size block size. Each element contains the column lengths for U factor of the corresponding block * LUbx Type: void ** Array of size nblocks. Each element is an .11,i.,i containing the row indices and numerical values of LU factors of the corresponding block. The diagonals of LU factors are not stored here * Udiag Type: void ** Array of size nblocks.Each element is an .11,i.,i of size block size. Each element contains the diagonal values of U factor of the corresponding block * Singleton Type: void * Array of size nblocks.Contains the singleton values * Rs Type: double * Array of size n. Contains the row scaling factors * scale Type: int Indicates the scaling strategy used. (0 = none, 1 = sum, 2 = max) * Work Type: void * Permanent workspace used for factorization and solve. It is of size MAX (4n numerical values, n numerical values + 6*maxblock into's) * worksize Type: size_t Contains the size (in bytes) of Work allocated above * Xwork Type: void * This is an alias into Numeric-gWork * Iwork Type: int * An integer alias into Xwork + n * Offp Type: int * Array of size n+1. Contains the column pointers for off-diagonal elements. * Offi Type: int * Array of size number of off-diagonal entries. Contains the row indices of off-diagonal elements * Offx Type: void * Array of size number of off-diagonal entries. Contains the numerical values of off-diagonal elements 4.2 KLU Routines The user callable KLU routines in the C language are explained in this section. The following guidelines are applicable to all routines except when explicitly stated otherwise in the description of a routine. 1. : All the arguments are required except when explicitly stated as optional. If optional, a user can pass NULL for the corresponding argument. 2. : The control input/output argument "C. III." .'" of type "klucommon *" is a required argument for all routines. 3. : All arguments other than the above mentioned control argument "Com- ii. ,ii", are input arguments and are not modified. 4.2.1 klu_analyze klu_symbolic *klu_analyze ( int n, int Ap [ ], int Ai [ ], klucommon *Common ) This routine orders the matrix using BTF if specified and the fill reducing ordering specified. Returns a pointer to klusymbolic structure that contains the ordering information. All arguments are required n: Size of the input matrix A where A is n*n. Ap: Array of column pointers for the input matrix. Size n+1. Ai: Array of row indices. Size number of nonzeros in A. Common: The control input/output structure. 4.2.2 klu_analyze_given klu_symbolic *klu_analyze_given ( int n, int Ap [ ], int Ai [ ], int Puser [ ], int Quser [ ], klucommon *Common ) ; This routine orders the matrix using BTF if specified and the given Puser and QUser as fill reducing ordering. If Puser and Quser are NULL, then the natural ordering is used. Returns a pointer to klusymbolic structure that contains the ordering information. Arguments n: Size of the input matrix A where A is n*n. Required. Ap: Array of column pointers for the input matrix. Size n+1. Required. Ai: Array of row indices. Size number of nonzeros in A. Required. Puser: Optional row permutation to use. Quser: Optional column permutation to use. Common: The control input/output structure. 4.2.3 klu_*factor klunumeric *Numeric klufactor ( int Ap [ ], int Ai [ , double Ax [ ], klu_symbolic *Symbolic, klucommon *Common ) ; This routine factors a real matrix. There is a complex version of this routine kluz_factor that factors a complex matrix and has same function declaration as klu_factor. Both use the results of a call to kluanalyze. Returns a pointer to klunumeric if successful. NULL otherwise. All the arguments are required. Ap: Array of column pointers for the input matrix. Size n+1. Ai: Array of row indices. Size number of nonzeros in A. Ax: Array of numerical values. Size number of nonzeros in A. In the complex case, the i., should consist of real and imaginary parts of each numerical value as .-li. ,:ent pairs. Symbolic: The structure that contains the results from a call to kluanalyze. Common: The control input/output structure. The status field in Common is set to indicate if the routine was successful or not. 4.2.4 klu_*solve void klusolve klu_symbolic *Symbolic, klu_numeric *Numeric, int Idim, int nrhs, double B [ ], klucommon *Common This routine solves a real system. There is a complex version of this routine kluz_solve, that solves a complex system and has the same function declaration as klu_solve. Both use the results of a call to kluanalyze and klu_*factor. Return type is void. The rhs vector B is overwritten with the solution. All Arguments are required. Symbolic: The structure that contains the results from a call to kluanalyze. Numeric: The structure that contains the results from a call to klu_*factor. Idim: The leading dimension of the right hand side B. nrhs: The number of right hand sides being solved. B: The right hand side. It is a vector of length Idim nrhs. It can be real or complex depending on whether a real or complex system is being solved. If complex, the real and imaginary parts of the rhs numerical value must be stored as .,li.I :ent pairs. It is overwritten with the solution. Common: The control input/output structure. 4.2.5 klu_*tsolve void klutsolve klusymbolic *Symbolic, klu_numeric *Numeric, int Idim, int nrhs, int conjsolve, double B [ ], klu common *Common This routine is similar to klusolve except that it solves a transpose system A'x = b. This routine solves a real system. Again there is a complex version of this routine kluz_tsolve for solving complex systems and has the same function declaration as klutsolve. It also offers to do a conjugate transpose solve for the complex system AHx = b. Return type is void. The rhs vector B is overwritten with the solution. All arguments are required. The descriptions for all arguments except conj_solve are same as those for klu_*solve. The argument conli--. .1-,I is relevant only for complex case. It takes two values 1 = CONJUGATE TRANSPOSE SOLVE, 0 = TRANSPOSE SOLVE. 4.2.6 klu r..,1 I ..r void klurefactor int Ap [ ], int Ai [ ], double Ax [ ], klu_symbolic *Symbolic, klu_numeric *Numeric, klucommon *Common ) ; This routine does a refactor of the matrix using the previously computed ordering information in Symbolic and the nonzero pattern of the LU factors in Numeric objects. It assumes same partial pivoting order computed in Numeric. It changes only the numerical values of the LU factors stored in Numeric object. It has a complex version klu_z_refactor with the same function prototype to handle complex cases. Return type is void. The numerical values of LU factors in Numeric parame- ter are updated. All arguments are required. Ap: Array of column pointers for the input matrix. Size n+1. Ai: Array of row indices. Size number of nonzeros in A. Ax: Array of numerical values. Size number of nonzeros in A. In the complex case, the .- i.,- should consist of real and imaginary parts of each numerical value as .-li. ,:ent pairs. Symbolic: The structure that contains the results from a call to kluanalyze. Numeric: Input/output argument. The structure contains the results from a call to klu_*factor. The numerical values of LU factors are overwritten with the ones for the current matrix being factorized. Common: The control input/output structure. The status field in Common is set to indicate if the routine was successful or not. 4.2.7 klu_defaults void kludefaults klucommon *Common ) ; This routine sets the default values for the control input parameters of the klu_common object. The default values are listed in the description of the klu_common structure. A call to this routine is required unless the user sets the control input parameters explicitly. Return type is void. The argument Common is required. The control input parameters in Com- mon are set to default values. 4.2.8 klu_*rec_pivotgrowth double klu_rec_pivot_growth < int Ap [ , int Ai [ ], double Ax [ ], klu_symbolic *Symbolic, klu_numeric *Numeric, klucommon *Common ) ; This routine computes the reciprocal pivot growth of the factorization algo- rithm. The complex version of this routine kluzrecpivotgrowth handles complex matrices and has the same function declaration. The pivot growth estimate is returned. All arguments are required. Ap: Array of column pointers for the input matrix. Size n+1. Ai: Array of row indices. Size number of nonzeros in A. Ax: Array of numerical values. Size number of nonzeros in A. In the complex case, the .- i.,- should consist of real and imaginary parts of each numerical value as .-li. ,:ent pairs. Symbolic: The structure that contains the results from a call to kluanalyze. Numeric: The structure that contains the results from a call to klu_*factor. Common: The control input/output structure. The status field in Common is set to indicate if the routine was successful or not. 4.2.9 klu_*estimatecondnumber double klu_estimate_cond_number ( int Ap [ ], double Ax [ ], klu_symbolic *Symbolic, klu_numeric *Numeric, klucommon *Common This routine computes the condition number estimate of the input matrix. As before, the complex version of this routine kluzestimatecondnumber has the same function declaration and handles complex matrices. The condition number estimate is returned. All arguments are required. Ap: Array of column pointers for the input matrix. Size n+1. Ax: Array of numerical values. Size number of nonzeros in A. In the complex case, the .- i.,i- should consist of real and imaginary parts of each numerical value as .,i.1:ent pairs. Symbolic: The structure that contains the results from a call to kluanalyze. Numeric: The structure that contains the results from a call to klu_*factor. Common: The control input/output structure. The status field in Common is set to indicate if the routine was successful or not. 4.2.10 klu_free_symbolic void klu_free_symbolic klu_symbolic **Symbolic, klucommon *Common ) ; This routine deallocates or frees the contents of the klusymbolic object. The Symbolic parameter must be a valid object computed by a call to kluanalyze or klu_analyze_given. Return type is void. All arguments are required. Symbolic: Input/Output argument. Must be a valid object computed by a call to kluanalyze or klu_analyze_given. If NULL, the routine just returns. Common: The control input/output structure. 4.2.11 klu_free_numeric void klufreenumeric ( klu_numeric **Numeric, klucommon *Common ) ; This routine frees the klunumeric object computed by a call to klufactor or kluz_factor routines. It resets the pointer to klu-numeric to NULL. There is a complex version of this routine called klu_zfreenumeric with the same function declaration to handle the complex case. Return type is void. All arguments are required. Numeric. Input/Output argument. The contents of the klunumeric object are freed. The pointer to klunumeric object is set to NULL. Common: The control input/output structure. REFERENCES [1] C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh, Basic linear algebra subprograms for FORTRAN usage, ACM Trans. Math. Soft., 5: 308-323, 1979. [2] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson, An extended set of FORTRAN basic linear algebra subprograms, ACM Trans. Math. Soft., 14: 1-17, 1988. [3] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson, Algorithm 656: An extended set of FORTRAN basic linear algebra subprograms, ACM Trans. Math. Soft., 14: 18-32, 1988. [4] J. J. Dongarra, J. Du Croz, I. S. Duff, and S. TH.iiiiii.,.lii-. A set of level 3 basic linear algebra subprograms, ACM Trans. Math. Soft., 16: 1-17, 1990. [5] J. J. Dongarra, J. Du Croz, I. S. Duff, and S. H.,,iiii.i.lii,.. Algorithm 679: A set of level 3 basic linear algebra subprograms, ACM Trans. Math. Soft., 16: 18-28, 1990. [6] James W. Demmel, Stanley C. Eisenstat, John R. Gilbert, Xi....,,- S. Li and Joseph W. H. Liu, A supernodal approach to sparse partial pivoting, SIAM J. Matrix Al,,i. !-: and Applications, 20(3): 720-755, 1999. [7] Timothy A. Davis, I.S.Duff, An unsymmetric-pattern multifrontal method for sparse LU factorization, SIAM J. Matrix Aili,,,1.: and Applic., 18(1): 140-158, 1997. [8] Timothy A. Davis, Algorithm 832: UMFPACK, an rnii-inin,-tric-pattern multifrontal method with a column pre-ordering strategy, ACM Trans. Math. Software, 30(2): 196-199, 2004. [9] John R.Gilbert and Tim Peierls, Sparse partial pivoting in time proportional to arithmetic operations, SIAM J. Sci. Stat. Comput., 9(5): ,-,2 873, 1988. [10] A. George and E. Ng, An implementation of Gaussian elimination with partial pivoting for sparse systems. SIAM J. Sci. Statist. Comput., 6(2): 390-409, 1985. [11] Iain S. Duff, On algorithms for obtaining a maximum transversal, ACMI Transactions on Mathematical Software, 7(3): 315-330, 1981. [12] Iain S. Duff, Algorithm 575 permutations for a zero-free diagonal, ACM Transactions on Mathematical Software, 7(3): 387-390, 1981. [13] Iain S. Duff and John K. Reid, Algorithm 529: permutations to block triangular form, ACM Trans. on Mathematical Software, 4(2): 189-192, 1978. [14] Iain S. Duff and John K. Reid, An implementation of Tarjan's algorithm for the block triangular form of a matrix, ACM Trans. on Mathematical Software, 4(2): 137-147, 1978. [15] R.E. Tarjan, Depth first search and linear graph algorithms, SIAM J. Coi(,iitillm., 1: 146-160, 1972. [16] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein, Introduction to Algorithms, Second Edition 2001, MIT Press, Cambridge. [17] S.C. Eisenstat and J.W.H. Liu, Exploiting structural symmetry in a sparse partial pivoting code, SIAM J.Sci. Comput., 14(1): 253-257, 1993. [18] P. R. Amestoy, T. A. Davis, and I. S. Duff,An approximate minimum degree ordering algorithm, SIAM J. Matrix Anal. Applic., 17(4): 886-905, 1996. [19] P.R. Amestoy, T.A. Davis, and I.S. Duff, Algorithm 837: AMD, an ap- proximate minimum degree ordering algorithm, ACM Transactions on Mathematical Software, 30(3): 381-388, 2004. [20] Timothy A. Davis, John R. Gilbert, Stefan I. Larimore, and Esmond G. Ng. A column approximate minimum degree ordering algorithm, ACM1I Transactions on Mathematical Software, 30(3): 353-376, 2004. [21] Timothy A. Davis John R. Gilbert Stefan I. Larimore Esmond G. Ng, Algorithm 836: COLAMD, a column approximate minimum degree ordering algorithm, ACM Transactions on Mathematical Software, 30(3): 377-380, 2004. [22] W.W. Hager, Condition estimates, SIAM J. Sci. Stat. Comput., 5,2: 311-316, 1984. [23] Nicholas J. Higham, Fortran codes for estimating the one-norm of a real or complex matrix, with applications to condition estimation, ACM Trans. on Mathematical Software., 14(4): 381-396, 1988. BIOGRAPHICAL SKETCH Ekanathan was born in Tirunelveli, India, on October 2, 1977. He received his Bachelor of Technology degree in chemical engineering from Anna University, Chennai, India, in M.v 1998. He worked with Infosys Technologies at Brussels, Bel- gium, as programmer/analyst from 1998 till 2001 and with SAP AG at Walldorf, Germany, as software developer from 2001 till 2003. Currently he is pursuing his Master of Science degree in computer science at University of Florida. His interests and hobbies include travel, reading novels and magazines, music and movies. |

Full Text |

PAGE 1 KLU{A HIGH PERF ORMANCE SP ARSE LINEAR SOL VER F OR CIR CUIT SIMULA TION PR OBLEMS By EKANA THAN P ALAMAD AI NA T ARAJAN A THESIS PRESENTED TO THE GRADUA TE SCHOOL OF THE UNIVERSITY OF FLORID A IN P AR TIAL FULFILLMENT OF THE REQUIREMENTS F OR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORID A 2005 PAGE 2 Cop yrigh t 2005 b y Ek anathan P alamadai Natara jan PAGE 3 I dedicate this w ork to m y mother Sa vitri who has b een a source of inspiration and supp ort to me. PAGE 4 A CKNO WLEDGMENTS I w ould lik e to thank Dr. Timoth y Da vis, m y advisor for in tro ducing me to the area of sparse matrix algorithms and linear solv ers. I started only with m y bac kground in n umerical analysis and algorithms, a y ear and half bac k. The insigh ts and kno wledge I ha v e gained since then in the area and in implemen ting a sparse linear solv er lik e KLU w ould not ha v e b een p ossible but for Dr. Da vis' guidance and help. I thank him for giving me an opp ortunit y to w ork on KLU. I w ould lik e to thank Dr. Jose F ortes and Dr. Aruna v a Banerjee for their supp ort and help and for serving on m y committee. I w ould lik e to thank CISE administrativ e sta for helping me at dieren t times during m y master's researc h w ork. iv PAGE 5 T ABLE OF CONTENTS page A CKNO WLEDGMENTS . . . . . . . . . . . . . . iv LIST OF T ABLES . . . . . . . . . . . . . . . . vii LIST OF FIGURES . . . . . . . . . . . . . . . . viii ABSTRA CT . . . . . . . . . . . . . . . . . . ix CHAPTER1 INTR ODUCTION . . . . . . . . . . . . . . . 1 2 THEOR Y: SP ARSE LU . . . . . . . . . . . . . 5 2.1 Dense LU . . . . . . . . . . . . . . . 5 2.2 Sparse LU . . . . . . . . . . . . . . . 7 2.3 Left Lo oking Gaussian Elimination . . . . . . . . 8 2.4 Gilb ert-P eierls' Algorithm . . . . . . . . . . . 10 2.4.1 Sym b olic Analysis . . . . . . . . . . . 11 2.4.2 Numerical F actorization . . . . . . . . . . 13 2.5 Maxim um T ransv ersal . . . . . . . . . . . . 14 2.6 Blo c k T riangular F orm . . . . . . . . . . . . 16 2.7 Symmetric Pruning . . . . . . . . . . . . . 18 2.8 Ordering . . . . . . . . . . . . . . . . 19 2.9 Piv oting . . . . . . . . . . . . . . . . 22 2.10 Scaling . . . . . . . . . . . . . . . . 23 2.11 Gro wth F actor . . . . . . . . . . . . . . 25 2.12 Condition Num b er . . . . . . . . . . . . . 27 2.13 Depth First Searc h . . . . . . . . . . . . . 30 2.14 Memory F ragmen tation . . . . . . . . . . . . 31 2.15 Complex Num b er Supp ort . . . . . . . . . . . 33 2.16 P arallelism in KLU . . . . . . . . . . . . . 33 3 CIR CUIT SIMULA TION: APPLICA TION OF KLU . . . . . 35 3.1 Characteristics of Circuit Matrices . . . . . . . . . 37 3.2 Linear Systems in Circuit Sim ulation . . . . . . . . 38 3.3 P erformance Benc hmarks . . . . . . . . . . . 39 3.4 Analyses and Findings . . . . . . . . . . . . 41 3.5 Alternate Ordering Exp erimen ts . . . . . . . . . 42 v PAGE 6 3.6 Exp erimen ts with UF Sparse Matrix Collection . . . . . 44 3.6.1 Dieren t Ordering Sc hemes in KLU . . . . . . 44 3.6.2 Timing Dieren t Phases in KLU . . . . . . . 45 3.6.3 Ordering Qualit y among KLU, UMFP A CK and Gilb ertP eierls . . . . . . . . . . . . . . 45 3.6.4 P erformance Comparison b et w een KLU and UMFP A CK . 48 4 USER GUIDE F OR KLU . . . . . . . . . . . . . 51 4.1 The Primary KLU Structures . . . . . . . . . . 51 4.1.1 klu common . . . . . . . . . . . . . 51 4.1.2 klu sym b olic . . . . . . . . . . . . . 53 4.1.3 klu n umeric . . . . . . . . . . . . . 55 4.2 KLU Routines . . . . . . . . . . . . . . 58 4.2.1 klu analyze . . . . . . . . . . . . . 58 4.2.2 klu analyze giv en . . . . . . . . . . . 59 4.2.3 klu *factor . . . . . . . . . . . . . 59 4.2.4 klu *solv e . . . . . . . . . . . . . 60 4.2.5 klu *tsolv e . . . . . . . . . . . . . 61 4.2.6 klu *refactor . . . . . . . . . . . . . 62 4.2.7 klu defaults . . . . . . . . . . . . . 63 4.2.8 klu *rec piv ot gro wth . . . . . . . . . . 63 4.2.9 klu *estimate cond n um b er . . . . . . . . . 64 4.2.10 klu free sym b olic . . . . . . . . . . . 65 4.2.11 klu free n umeric . . . . . . . . . . . . 66 REFERENCES . . . . . . . . . . . . . . . . . 67 BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . 69 vi PAGE 7 LIST OF T ABLES T able page 3{1 Comparison b et w een KLU and Sup erLU on o v erall time and ll-in . 39 3{2 Comparison b et w een KLU and Sup erLU on factor time and solv e time 40 3{3 Ordering results using BTF+AMD in KLU on circuit matrices . . 41 3{4 Comparison of ordering results pro duced b y BTF+AMD, AMD, MMD 43 3{5 Fill-in with four dieren t sc hemes in KLU . . . . . . . . 46 3{6 Time in seconds, sp en t in dieren t phases in KLU . . . . . 47 3{7 Fill-in among KLU, UMFP A CK and Gilb ert-P eierls . . . . . 49 3{8 P erformance comparison b et w een KLU and UMFP A CK . . . . 50 vii PAGE 8 LIST OF FIGURES Figure page 2{1 Nonzero pattern of x when solving Lx = b . . . . . . . 13 2{2 A matrix p erm uted to BTF form . . . . . . . . . . 16 2{3 A symmetric pruning scenario . . . . . . . . . . . 18 2{4 A symmetric matrix and its graph represen tation . . . . . 21 2{5 The matrix and its graph represen tation after one step of Gaussian elimination . . . . . . . . . . . . . . . 21 2{6 A doubly b ordered blo c k diagonal matrix and its corresp onding v ertex separator tree . . . . . . . . . . . . . . 34 viii PAGE 9 Abstract of Thesis Presen ted to the Graduate Sc ho ol of the Univ ersit y of Florida in P artial F ulllmen t of the Requiremen ts for the Degree of Master of Science KLU{A HIGH PERF ORMANCE SP ARSE LINEAR SOL VER F OR CIR CUIT SIMULA TION PR OBLEMS By Ek anathan P alamadai Natara jan August 2005 Chair: Dr. Timoth y A. Da vis Ma jor Departmen t: Computer and Information Science and Engineering The thesis w ork fo cuses on KLU, a sparse high p erformance linear solv er for circuit sim ulation matrices. During the circuit sim ulation pro cess, one of the k ey steps is to solv e sparse systems of linear equations of v ery high order. KLU targets solving these systems with ecien t ordering mec hanisms and high p erformance factorization and solv e algorithms. KLU uses a h ybrid ordering strategy that comprises an unsymmetric p erm utation to ensure zero free diagonal, a symmetric p erm utation to blo c k upp er triangular form and a ll reducing ordering suc h as appro ximate minim um degree. The factorization is based on Gilb ert-P eierls' left-lo oking algorithm with partial piv oting. KLU also includes features lik e symmetric pruning to cut do wn sym b olic analysis costs. It oers to solv e up to four righ t hand sides in a single solv e step. In addition, it oers transp ose and conjugate transp ose solv e capabilities and imp ortan t diagnostic features to estimate the recipro cal piv ot gro wth of the factorization and condition n um b er of the input matrix. The algorithm is implemen ted in the C language with MA TLAB in terfaces as w ell. The MA TLAB in terfaces enable a user to in v ok e KLU routines from within ix PAGE 10 the MA TLAB en vironmen t. The implemen tation w as tested on circuit matrices and the results determined. KLU ac hiev es sup erior ll-in qualit y with its h ybrid ordering strategy and ac hiev es a go o d p erformance sp eed-up when compared with existing sparse linear solv ers for circuit problems. The thesis highligh ts the w ork b eing done on exploiting parallelism in KLU as w ell. x PAGE 11 CHAPTER 1 INTR ODUCTION Sparse is b eautiful. Solving systems of linear equations of the form Ax = b is a fundamen tal and imp ortan t area of high p erformance computing. The matrix A is called the co ecien t matrix and b the righ t hand side v ector. The v ector x is the solution to the equation. There are a n um b er of metho ds a v ailable for solving suc h systems. Some of the p opular ones are Gaussian elimination, QR factorization using Householder transformations or Giv ens rotations and Cholesky factorization. Gaussian elimination with partial piv oting is the most widely used algorithm for solving linear systems b ecause of its stabilit y and b etter time complexit y Cholesky can b e used only when A is symmetric p ositiv e denite. Some systems that are solv ed comprise a dense co ecien t matrix A By dense, w e mean most of the elemen ts in A are nonzero. There are high p erformance subroutines suc h as the BLAS [ 1 2 3 4 5 ] that can maximize rop coun t for suc h dense matrices. The in teresting systems are those where the co ecien t matrix A happ ens to b e sparse. By sparse, w e mean the matrix has few nonzero en tries(hereafter referred to simply as 'nonzeros'). The adjectiv e 'few' is not w elldened as w e will see in c hapter t w o. When matrices tend to b e sparse, w e need to nd out eectiv e w a ys to store the matrix in memory since w e w an t to a v oid storing zeros of the matrix. When w e store only the nonzeros in the matrix, it has consequences in the factorization algorithm as w ell. One t ypical example w ould b e w e do not kno w b efore hand ho w nonzeros w ould app ear in the L and U factors when w e factorize the matrix. While w e a v oid storing the zeros, w e also w an t to ac hiev e go o d time complexit y when solving sparse systems. If the time sp en t to 1 PAGE 12 2 solv e sparse systems remains same as for dense systems, w e ha v e not done an y b etter. KLU stands for Clark Ken t LU, since it is based on Gilb ert-P eierls' algorithm, a non-sup erno dal algorithm, whic h is the predecessor to Sup erLU, a sup erno dal algorithm. KLU is a sparse high p erformance linear solv er that emplo ys h ybrid ordering mec hanisms and elegan t factorization and solv e algorithms. It ac hiev es high qualit y ll-in rate and b eats man y existing solv ers in run time, when used for matrices arising in circuit sim ulation. There are sev eral ra v ours of Gaussian elimination. A left-lo oking Gaussian elimination algorithm factorizes the matrix left to righ t computing columns of L and U. A righ t-lo oking v ersion factorizes the matrix from top-left to b ottom-righ t computing column of L and ro w of U. Both ha v e their adv an tages and disadv an tages. KLU uses a left lo oking algorithm called Gilb ert-P eierls' algorithm. Gilb ert-P eierls' comprises a graph theoretical sym b olic analysis phase that identies the nonzero pattern of eac h column of L and U factors and a left-lo oking n umerical factorization phase with partial piv oting that calculates the n umerical v alues of the factors. KLU uses Symmetric Pruning to cut do wn sym b olic analysis cost. W e shall lo ok in detail on these features in c hapter t w o. A critical issue in linear solv ers is Ordering. Ordering means p erm uting the ro ws and columns of a matrix, so that the ll-in in the L and U factors is reduced to a minim um. A ll-in is dened as a nonzero app earing in either of the matrices L or U, while the elemen t in the corresp onding p osition in A is a zero. L ij or U ij is a ll-in if A ij is a zero. Fill-in has ob vious consequences in memory in that the factorization algorithm could create dense L and U factors that can exhaust a v ailable memory A go o d ordering algorithm yields a lo w ll-in in the factors. Finding the ordering that giv es minimal ll-in is an NP complete problem. So PAGE 13 3 ordering algorithms use heuristics. KLU accomo dates m ultiple ordering sc hemes lik e AMD, COLAMD and an y user generated p erm utation. There are other orderings for dieren t purp oses. F or example, one could order a matrix to ensure that it has no zeros on the diagonal. Otherwise, the Gaussian elimination w ould fail. Another ordering sc heme could reduce the factorization w ork. KLU emplo ys t w o suc h orderings namely an unsymmetric ordering that ensures a zero free diagonal and a symmetric ordering that p erm utes the matrix in to a blo c k upp er triangular form (BTF) that restricts factorization to only the diagonal blo c ks. One of the k ey steps in the circuit sim ulation pro cess is solving sparse linear systems. These systems originate from solving large systems of non linear equations using Newton's metho d and in tegrating large sti systems of ordinary dieren tial equations. These systems are of v ery high dimensions and a considerable fraction of sim ulation time is sp en t on solving these systems. Often the solv e phase tends to b e a b ottlenec k in the sim ulation pro cess. Hence high p erformance sparse solv ers that optimize memory usage and solution time are critical comp onen ts of circuit sim ulation soft w are. Some of the p opular solv ers in use in circuit sim ulation to ols are Sparse1.3 and Sup erLU. Sparse1.3 is used in SPICE circuit sim ulation pac k age and Sup erLU uses a sup erno dal factorization algorithm. Exp erimen tal results of KLU indicate that it is 1000 times faster than Sparse1.3 and 1.5 to 3 times faster than Sup erLU. Circuit matrices sho w some unique prop erties. They ha v e a nearly zero free diagonal. They ha v e a roughly symmetric pattern but ha v e unsymmetric v alues. They are highly sparse and often ha v e a few dense ro ws and columns. These dense ro ws/columns arise from v oltage sources and curren t sources in the circuit. Circuit matrices sho w go o d amenabilit y to BTF ordering. Though the nonzero pattern of original matrix is unsymmetric, the nonzero pattern of blo c ks pro duced b y BTF PAGE 14 4 ordering tend to b e symmetric. Since circuit matrices are extremely sparse, sparse matrix algorithms suc h as Sup erLU [ 6 ] and UMFP A CK [ 7 8 ] that emplo y dense BLAS k ernels are often inappropriate. Another unique c haracteristic of circuit matrices is that emplo ying a go o d ordering strategy k eeps the L and U factors sparse. Ho w ev er as w e will see in exp erimen tal results, t ypical ordering strategies can lead to high ll-in. In circuit sim ulation problems, t ypically the circuit matrix template is generated once and the n umerical v alues of the matrix alone c hange. In other w ords, the nonzero pattern of the matrix do es not c hange. This implies that w e need to order and factor the matrix once to generate the ordering p erm utations and the nonzero patterns of L and U factors. F or all subsequen t matrices, w e can use the same information and need only to recompute the n umerical v alues of the L and U factors. This pro cess of skipping analysis and factor phases is called refactorization.Refactorization leads to a signican t reduction in run time. Because of the unique c haracteristics of circuit matrices and their amenabilit y to BTF ordering, KLU is a metho d w ell-suited to circuit sim ulation problems. KLU has b een implemen ted in the C language. It oers a set of API for the analysis phase, factor phase, solv e phase and refactor phase. It also oers the abilit y to solv e upto four righ t hand sides in a single solv e step. In addition, it oers transp ose solv e, conjugate transp ose solv e features and diagnostic to ols lik e piv ot gro wth estimator and condition n um b er estimator. It also oers a MA TLAB in terface for the API so that KLU can b e used from within the MA TLAB en vironmen t. PAGE 15 CHAPTER 2 THEOR Y: SP ARSE LU 2.1 Dense LU Consider the problem of solving the linear system of n equations in n unkno wns: a 11 x 1 + a 12 x 2 + ::: + a 1 n x n = b 1 a 21 x 1 + a 22 x 2 + ::: + a 2 n x n = b 2 ... (2{1) a n 1 x 1 + a n 2 x 2 + ::: + a nn x n = b n or, in matrix notation, 266666664 a 11 a 12 a 1 n a 21 a 22 a 2 n ... a n 1 a n 2 a nn 377777775 266666664 x 1 x 2 ... x n 377777775 = 266666664 b 1 b 2 ... b n 377777775 (2{2) Ax = b where A = ( a ij ), x = ( x 1 ; x 2 ; :::; x n ) T and b = ( b 1 ; :::; b n ) T : A w ell-kno wn approac h to solving this equation is Gaussian elimination. Gaussian elimination consists of a series of eliminations of unkno wns x i from the original system. Let us briery review the elimination pro cess. In the rst step, the rst equation of 2{1 is m ultiplied b y a 21 a 11 ; a 31 a 11 ; ::: a n 1 a 11 and added with the second through n th equation of 2{1 resp ectiv ely This w ould eliminate x 1 from second through the n th equations. After 5 PAGE 16 6 the rst step, the 2{2 w ould b ecome 266666664 a 11 a 12 a 1 n 0 a (1)22 a (1)2 n ... 0 a (1)n 2 a (1)nn 377777775 266666664 x 1 x 2 ... x n 377777775 = 266666664 b 1 b (1)2 ... b (1)n 377777775 (2{3) where a (1)22 = a 22 a 21 a 11 a 12 a 3 2 (1) = a 32 a 31 a 11 a 12 and so on. In the second step, x 2 will b e eliminated b y a similar pro cess of computing m ultipliers and adding the m ultiplied second equation with the third through n th equations. After n-1 eliminations, the matrix A is transformed to an upp er triangular matrix U The upp er triangular system is then solv ed b y bac k-substitution. An equiv alen t in terpretation of this elimination pro cess is that w e ha v e factorized A in to a lo w er triangular matrix L and an upp er triangular matrix U where L = 2666666666664 1 0 0 0 a 21 a 11 1 0 0 a 31 a 11 a (1)32 a (1)22 1 0 ... a n 1 a 11 a (1)n 2 a (1)22 a (2)n 3 a (2)33 1 3777777777775 (2{4) and U = 266666666664 a 11 a 12 a 13 a 1 n 0 a (1)22 a 2 3 (1) a (1)2 n 0 0 a 3 3 (2) a (2)3 n ... 0 0 0 a ( n 1) nn 377777777775 (2{5) The column k of the lo w er triangular matrix L consists of the m ultipliers obtained during step k of the elimination pro cess, with their sign negated. PAGE 17 7 Mathematically Ax = b can b e rewritten as ( LU ) x = b L ( U x ) = b (2{6) Substituting U x = y in 2{6 w e ha v e Ly = b (2{7) U x = y (2{8) By solving these t w o lo w er triangular systems, w e nd the solution to the actual system. The reason for triangularizing the system is to a v oid nding the in v erse of the original co ecien t matrix A. In v erse nding is atleast thrice as exp ensiv e as Gaussian elimination in the dense case and often leads to more inaccuracies. 2.2 Sparse LU A sparse matrix is dened as one that has few nonzeros in it. The quantication of the adjectiv e 'few' is not sp ecied. The decision as to what kind of algorithm to use (sparse or dense) dep ends on the ll-in prop erties of the matrices. Ho w ev er, sparse matrices t ypically ha v e O ( n ) nonzero en tries. Dense matrices are t ypically represen ted b y a t w o dimensional arra y .The zeros of a sparse matrix should not b e stored if w e w an t to sa v e memory This fact mak es a t w o dimensional arra y unsuitable for represen ting sparse matrices. Sparse matrices are represen ted with a dieren t kind of data structure. They can b e represen ted in t w o dieren t data structures viz.column compressed form or ro w compressed form. A column compressed form consists of three v ectors Ap, Ai and Ax. Ap consists of column p oin ters. It is of length n + 1. The start of column k of the input matrix is giv en b y Ap [k]. PAGE 18 8 Ai consists of ro w indices of the elemen ts. This is a zero based data structure with ro w indices in the in terv al [0,n). Ax consists of the actual n umerical v alues of the elemen ts. Th us the elemen ts of a column k of the matrix are held in Ax [Ap [k]...Ap [k+1]). The corresp onding ro w indices are held in Ai [Ap [k]...Ap[k+1]). Equiv alen tly a ro w compressed format stores a ro w p oin ter v ector Ap, a column indices v ector Ai and a v alue v ector Ax. F or example, the matrix 266664 5 0 0 4 2 0 3 1 8 377775 when represen ted in column compressed format will b e Ap: 0 3 5 6 Ai: 0 1 2 1 2 2 Ax: 5 4 3 2 1 8 and when represen ted in ro w compressed format will b e Ap: 0 1 3 6 Ai: 0 0 1 0 1 2 Ax: 5 4 2 3 1 8 Let nnz represen t the n um b er of nonzeros in a matrix of dimension n n Then in a dense matrix represen tation, w e will need n 2 memory to represen t the matrix. In sparse matrix represen tation, w e reduce it to O ( n + nnz ) and t ypically nnz n 2 2.3 Left Lo oking Gaussian Elimination Let us deriv e a left lo oking v ersion of Gaussian elimination. Let an input matrix A of order n n b e represen ted as a pro duct of t w o triangular matrices L and U. PAGE 19 9 Let 266664 A 11 a 12 A 13 a 21 a 22 a 23 A 31 a 32 A 33 377775 = 266664 L 11 0 0 l 21 1 0 L 31 l 32 L 33 377775 266664 U 11 u 12 U 13 0 u 22 u 23 0 0 U 33 377775 (2{9) where A ij is a blo c k, a ij is a v ector and a ij is a scalar. The dimensions of dieren t elemen ts in the matrices are as follo ws: A 11 ; L 11 ; U 11 are k k blo c ks a 12 ; u 12 are k 1 v ectors A 13 ; U 13 are k n ( k + 1) blo c ks a 21 ; l 21 are 1 k ro w v ectors a 22 ; u 22 are scalars a 23 ; u 23 are 1 n ( k + 1) ro w v ectors A 31 ; L 31 are n ( k + 1) k blo c ks a 32 ; l 32 are n ( k + 1) 1 v ectors A 33 ; L 33 ; U 33 are n ( k + 1) n ( k + 1) blo c ks. F rom ( 2{9 ), w e can arriv e at the follo wing set of equations. L 11 U 11 = A 11 (2{10) L 11 u 12 = a 12 (2{11) L 11 U 13 = A 13 (2{12) l 21 U 11 = a 21 (2{13) l 21 u 12 + u 22 = a 22 (2{14) l 21 U 13 + u 23 = a 23 (2{15) L 31 U 11 = A 31 (2{16) L 31 u 12 + l 32 u 22 = a 32 (2{17) L 31 U 13 + l 32 u 23 + L 33 U 33 = A 33 (2{18) PAGE 20 10 F rom ( 2{11 ), ( 2{14 ) and ( 2{17 ), w e can compute the 2 nd column of L and U, assuming w e ha v e already computed L 11 l 21 and L 31 W e rst solv e the lo w er triangular system ( 2{11 ) for u 12 Then, w e solv e for u 22 using ( 2{14 ) b y computing the sparse dot pro duct u 22 = a 22 l 21 u 12 (2{19) Finally w e solv e ( 2{17 ) for l 32 as l 32 = 1 u 22 ( a 32 L 31 u 12 ) (2{20) This step of computing the 2 nd column of L and U can b e considered equiv alen t to solving a lo w er triangular system as follo ws: 266664 L 11 0 0 l 21 1 0 L 31 0 1 377775 266664 u 12 u 22 l 32 u 22 377775 = 266664 a 12 a 22 a 32 377775 (2{21) This mec hanism of computing column k of L and U b y solving a lo w er triangular system L x = b is the k ey step in a left-lo oking factorization algorithm. As w e will see later, Gilb ert-P eierls' algorithm rev olv es around solving this lo w er triangular system. The algorithm is called a left-lo oking algorithm since column k of L and U are computed b y using the already computed columns 1...k-1 of L. In other w ords, to compute column k of L and U, one lo oks only at the already computed columns 1...k-1 in L, that are to the left of the curren tly computed column k. 2.4 Gilb ert-P eierls' Algorithm Gilb ert-P eierls' [ 9 ] prop osed an algorithm for Gaussian elimination with partial piv oting in time prop ortional to the rop coun t of the elimination to factor an arbitrary non singular sparse matrix A as P A = LU If f l ops ( LU ) is the n um b er PAGE 21 11 of nonzero m ultiplications p erformed when m ultiplying t w o matrices L and U then Gaussian elimination uses exactly f l ops ( LU ) m ultiplications and divisions to factor a matrix A in to L and U Giv en an input matrix and assuming no partial piv oting, it is p ossible to predict the nonzero pattern of its factors. Ho w ev er with partial piv oting, it is not p ossible to predict the exact nonzero pattern of the factors b efore hand. Finding an upp er b ound is p ossible, but the b ound can b e v ery lo ose [ 10 ]. Note that computing the nonzero pattern of L and U is a necessary part of Gaussian elimination in v olving sparse matrices since w e do not use t w o dimensional arra ys for represen ting them but sparse data structures. Gilb ert-P eierls' algorithm aims at computing the nonzero pattern of the factors and the n umerical v alues in a total time prop ortional to O ( f l ops ( LU )). It consists of t w o stages for determining ev ery column of L and U The rst stage is a sym b olic analysis stage that computes the nonzero pattern of the column k of the factors. The second stage is the n umerical factorization stage that in v olv es solving the lo w er triangular system Lx = b that w e discussed in the section ab o v e. 2.4.1 Sym b olic Analysis A sparse Gaussian elimination algorithm with partial piv oting cannot kno w the exact nonzero structure of the factors ahead of all n umerical computation, simply b ecause partial piv oting at column k can in tro duce new nonzeros in columns k + 1 n Solving Lx = b m ust b e done in time prop ortional to the n um b er of rops p erformed. Consider a simple column-orien ted algorithm in MA TLAB notation for solving Lx = b as follo ws: x = b for j = 1:n if x(j) ~= 0 x(j+1:n) = x(j+1:n) L(j+1:n,j) x(j) end PAGE 22 12 endThe ab o v e algorithm tak es time O ( n + number of f l ops per f or med ). The O ( n ) term lo oks harmless, but Lx = b is solv ed n times in the factorization of A = LU leading to an unacceptable O ( n 2 ) term in the w ork to factorize A in to L times U T o remo v e the O ( n ) term, w e m ust replace the algorithm with x = b for each j for which x(j) ~= 0 x(j+1:n) = x(j+1:n) L(j+1:n, j) x(j) endThis w ould reduce the O ( n ) term to O ( ( b )), where ( b ) is the n um b er of nonzeros in b Note that b is a column of the input matrix A Th us to solv e Lx = b w e need to kno w the nonzero pattern of x b efore w e compute x itself. Sym b olic analysis helps us determine the nonzero pattern of x Let us sa y w e are computing column k of L and U Let G = G ( L k ) b e the directed graph of L with k 1 v ertices represen ting the already computed k 1 columns. G ( L k ) has an edge j i i l ij 6 = 0. Let = f i j b i 6 = 0 g and X = f i j x i 6 = 0 g No w the elemen ts of X is giv en b y X = R each G ( L ) ( ) (2{22) The nonzero pattern of X is computed b y the determining the v ertices that are reac hable from the v ertices of the set The reac habilit y problem can b e solv ed using a classical depth rst searc h in G ( L k ) from the v ertices of the set If b j 6 = 0, then x j 6 = 0. In addition if L ij 6 = 0, then x i 6 = 0 ev en if b i = 0. This is b ecause a L ij x j con tributes to a nonzero in the equation when w e solv e for x i During the depth rst searc h, Gilb ert-P eierls' algorithm computes the top ological order of X This top ological ordering is useful for eliminating unkno wns in the Numerical factorization step. PAGE 23 13 L x j x i l ij Figure 2{1: Nonzero pattern of x when solving Lx = b The ro w indices v ector L i of columns 1 k 1 of L represen ts the adjacency list of the graph G ( L k ). The depth rst searc h tak es time prop ortional to the n um b er of v ertices examined plus the n um b er of edges tra v ersed. 2.4.2 Numerical F actorization Numerical factorization consists of solving the system ( 2{21 ) for eac h column k of L and U Normally w e w ould solv e for the unkno wns in ( 2{21 ) in the increasing order of the ro w index. The ro w indices/nonzero pattern computed b y depth rst searc h are not necessarily in increasing order. Sorting the indices w ould increase the time complexit y ab o v e our O ( f l ops ( LU )) goal. Ho w ev er, the requiremen t of eliminating unkno wns in increasing order can b e relaxed to a top ological order of the ro w indices. An unkno wn x i can b e computed, once all the unkno wns x j of whic h it is dep enden t on are computed. This is ob vious when w e write the equations comprising a lo w er triangular solv e. Theoretically the unkno wns can b e solv ed in an y top ological order. The depth rst searc h algorithm giv es one suc h top ological order whic h is sucien t for our case. In our example, the depth rst searc h w ould ha v e nished exploring v ertex i b efore it nishes exploring v ertices j PAGE 24 14 Hence a top ological order giv en b y depth rst searc h w ould ha v e j app earing b efore i This is exactly what w e need. Gilb ert-P eierls' algorithm starts with an iden tit y L matrix. The en tire left lo oking algorithm can b e summarized in MA TLAB notation as follo ws: L = I for k = 1:n x = L \ A(:,k) %(partial pivoting on x can be done here) U(1:k,k) = x(1:k) L(k:n,k) = x(k:n) / U(k,k) end where x = L n b denotes the solution of a sparse lo w er triangular system. In this case, b is the k th column of A The total time complexit y of Gilb ert-P eierls' algorithm is O ( ( A ) + f l ops ( LU )). ( A ) is the n um b er of nonzeros in the matrix A and f l ops ( LU ) is the rop coun t of the pro duct of the matrices L and U. T ypically f l ops ( LU ) dominates the complexit y and hence the claim of factorizing in time prop ortional to the rop coun t. 2.5 Maxim um T ransv ersal Du [ 11 12 ] prop osed an algorithm for determining the maxim um transv ersal of a directed graph.The purp ose of the algorithm is to nd a ro w p erm utation that minimizes the zeros on the diagonal of the matrix. F or non singular matrices, the algorithm ensures a zero free diagonal. KLU emplo ys Du 's [ 11 12 ] algorithm to nd an unsymmetric p erm utation of the input matrix to determine a zerofree diagonal. A matrix cannot b e p erm uted to ha v e a zero free diagonal if and only if it is structurally singular. A matrix is structurally singular if there is no p erm utation of its nonzero pattern that mak es it n umerically nonsingular. PAGE 25 15 A transv ersal is dened as a set of nonzeros, no t w o of whic h lie in the same ro w or column, on the diagonal of the p erm uted matrix. A transv ersal of maxim um length is the maxim um transv ersal. Du 's maxim um transv ersal algorithm consists of represen ting the matrix as a graph with eac h v ertex corresp onding to a ro w in the matrix. An edge i k i k +1 exists in the graph if A ( i k ; j k +1 ) is a nonzero and A ( i k +1 ; j k +1 ) is an elemen t in the transv ersal set. A path b et w een v ertices i 0 and i k w ould consist of a sequence of nonzeros ( i 0 ; j 1 ) ; ( i 1 ; j 2 ) ; ( i k 1 ; j k ) where the curren t transv ersal w ould include ( i 1 ; j 1 ) ; ( i 2 ; j 2 ) ; ( i k ; j k ). If there is a nonzero in p osition ( i k ; j k +1 ) and no nonzero in ro w i 0 or column j k +1 is curren tly on the tra vseral, it increases the transerv al b y one b y adding the nonzeros ( i r ; j r +1 ) ; r = 0 ; 1 ; ; k to the transv ersal and remo ving the nonzeros ( i r ; j r ) ; r = 1 ; 2 ; ; k from the transv ersal. This adding and remo ving of nonzeros to and from the transv ersal is called reassignmen t c hain or augmen ting path. A v ertex or ro w is said to b e assigned if a nonzero in the ro w is c hosen for the transv ersal. The pro cess of constructing augmen ting paths is done b y doing a depth rst searc h from an unassigned ro w i 0 of the matrix and con tin ue till a v ertex i k is reac hed where the path terminates b ecause A ( i k ; j k +1 ) is a nonzero and column j K +1 is unassigned. Then the searc h bac ktrac ks to i 0 adding and remo ving transv ersal elemen ts th us constructing an augmen ting path. Du 's maxim um transv ersal transv ersal algorithm has a w orst case time complexit y of O ( n ) where is the n um b er of nonzeros in the matrix and n is the order of the matrix. Ho w ev er in practice, the time complexit y is close to O ( n + ). The maxim um transv ersal problem can b e cast as a maximal matc hing problem on bipartite graphs. This is only to mak e a comparison. The maximal matc hing problem is stated as follo ws. PAGE 26 16 0 = A 11 A 13 A 14 A 34 A 22 A 23 A 24 A 33 A 44 x 1 x 2 x 3 x 4 b 1 b 2 b 3 b 4 A 12 Figure 2{2: A matrix p erm uted to BTF form Giv en an undirected graph G = ( V ; E ), a matc hing is a subset of the edges M E suc h that for all v ertices v 2 V at most one edge of M is inciden t on v A v ertex v 2 V is matc hed if some edge in M is inciden t on v otherwise, v is unmatc hed. A maximal matc hing is a matc hing of maxim um cardinalit y that is a matc hing M suc h that for an y matc hing M 0 w e ha v e j M j j M 0 j A maximal matc hing can b e built incremen tally b y pic king an arbitrat y edge e in the graph, deleting an y edge that is sharing a v ertex with e and rep eating un til the graph is out of edges. 2.6 Blo c k T riangular F orm A blo c k (upp er) triangular matrix is similar to a upp er triangular matrix except that the diagonals in the former are square blo c ks instead of scalars. Figure 2{2 sho ws a matrix p erm uted to the BTF form. Con v erting the input matrix to blo c k triangular form is imp ortan t in that, 1. The part of the matrix b elo w the blo c k diagonal requires no factorization eort. PAGE 27 17 2. The diagonal blo c ks are indep enden t of eac h other.Only the blo c ks need to b e factorized. F or example, in gure 2{2 the subsystem A 44 x 4 = b 4 can b e solv ed indep enden tly for x 4 and x 4 can b e eliminated from the o v erall system. The system A 33 x 3 = b 3 A 34 x 4 is then solv ed for x 3 and so on. 3. The o diagonal nonzeros do not con tribute to an y ll-in. Finding a symmetric p erm utation of a matrix to its BTF form is equiv alen t to nding the strongly connected comp onen ts of a graph. A strongly connected comp onen t of a directed graph G = ( V ; E ) is a maximal set of v ertices C V suc h that for ev ery pair of v ertices u and v in C w e ha v e b oth u v and v u The v ertices u and v are reac hable from eac h other. The Algorithm emplo y ed in KLU for symmetric p erm utation of a matrix to a BTF form, is based on Du and Reid's [ 13 14 ] algorithm. Du and Reid pro vide an implemen tation for T arjan's [ 15 ] algorithm to determine the strongly connected comp onen ts of a directed graph. The algorithm has a time complexit y of O ( n + ) where n is the order of the matrix and is the n um b er of o diagonal nonzeros in the matrix. The algorithm essen tially consists of doing a depth rst searc h from un visited no des in the graph. It uses a stac k to k eep trac k of no des b eing visited and uses a path of the no des. When all edges in the path are explored, it generates Strongly connected comp onen ts from the top of stac k. Du 's algorithm is dieren t from the metho d prop osed b y Cormen, Leiserson, Riv est and Stein [ 16 ]. They suggest doing a depth rst searc h on G computing G T and then running a depth rst searc h on G T on v ertices in the decreasing order of their nish times from the rst depth rst searc h (the top ological order from the rst depth rst searc h). PAGE 28 18 j i k jis = Pruned, = Fill in = Nonzero, Figure 2{3: A symmetric pruning scenario 2.7 Symmetric Pruning Eisenstat and Liu [ 17 ] prop osed a metho d called Symmetric Pruning to exploit structural symmetry for cutting do wn the sym b olic analysis time. The cost of depth rst searc h can b e cut do wn b y pruning unnecessary edges in the graph of L; G ( L ). The idea is to replace G ( L ) b y a reduced graph of L An y graph H can b e used in place of G ( L ), pro vided that i j exists in H i i j exists in G ( L ). If A is symmetric, then the symmetric reduction is just the elimination tree. The symmetric reduction is a subgraph of G ( L ). It has few er edges than G ( L ) and is easier to compute b y taking adv an tage of symmetry in the structure of the factors L and U. Ev en though the symmetric reduction remo v es edges, it still preserv es the paths b et w een v ertices of the original graph. Figure 2{3 sho ws a symmetric pruning example. If l ij 6 = 0 ; u j i 6 = 0, then w e can prune edges j s where s > i The reason b ehind this is that for an y a j k 6 = 0, a sk will ll in from column j of L for s > k PAGE 29 19 The just computed column i of L is used to prune earlier columns. An y future depth rst searc h from v ertex j will not visit v ertex s since s w ould ha v e b een visited via i already Note that ev ery column is pruned only once. KLU emplo ys symmetric pruning to sp eed up the depth rst searc h in the sym b olic analysis stage. 2.8 Ordering It is a widely used practice to precede the factorization step of a sparse linear system b y an ordering phase. The purp ose of the ordering is to generate a p erm utation P that reduces the ll-in in the factorization phase of P AP T A ll-in is dened as a nonzero in a p osition ( i; j ) of the factor that w as zero in the original matrix. In other w ords, w e ha v e a ll-in if L ij 6 = 0, where A ij = 0. The p erm uted matrix created b y the ordering P AP T creates m uc h less ll-in in factorization phase than the unp erm uted matrix A The ordering mec hanism t ypically tak es in to accoun t only the structure of the input matrix, without considering the n umerical v alues stored in the matrix. P artial piv oting during factorization c hanges the ro w p erm utation P and hence could p oten tially increase ll-in as opp osed to what w as estimated b y the ordering sc heme. W e shall see more ab out piv oting in the follo wing sections. If the input matrix A is unsymmetric, then the p erm utation of the matrix A + A T can b e used. V arious minim um degree algorithms can b e used for ordering. Some of the p opular ordering sc hemes include appro ximate minim um degree(AMD) [ 18 19 ], column appro ximate minim um degree(COLAMD) [ 20 21 ] among others. COLAMD orders the matrix AA T without forming it explicitly After p erm uting an input matrix A in to BTF form using the maxim um transv ersal and BTF orderings, KLU attempts to factorize eac h of the diagonal blo c ks. It applies the ll reducing ordering algorithm on the blo c k b efore factorizing it. KLU supp orts b oth appro ximate minim um degree and column appro ximate PAGE 30 20 minim um degree. Besides, an y giv en ordering algorithm can b e plugged in to KLU without m uc h eort. W ork is b eing done on in tegrating a Nested Dissection ordering strategy in to KLU as w ell. Of the v arious ordering sc hemes, AMD giv es b est results on circuit matrices. AMD nds a p erm utation P to reduce ll-in for the Cholesky factorization of P AP T (of P ( A + A T ) P T if A is unsymmetric). AMD assumes no n umerical piv oting. AMD attempts to reduce an optimistic estimate of ll-in. COLAMD is an unsymmetric ordering sc heme, that computes a column p erm utation Q to reduce ll-in for Cholesky factorization of ( AQ ) T AQ COLAMD attempts to reduce a "p essimitic" estimate (upp er b ound) of ll-in. Nested Dissection is another ordering sc heme that creates p erm utation suc h that the input matrix is transformed in to blo c k diagonal form with v ertex separators. This is a p opular ordering sc heme. Ho w ev er, it is unsuitable for circuit matrices when applied to the matrix as suc h. It can b e used on the blo c ks generated b y BTF pre-ordering. The idea b ehind a minim um degree algorithm is as follo ws: A structurally symmetric matrix A can b e represen ted b y an equiv alen t undirected graph G ( V ; E ) with v ertices corresp onding to ro w/column indices. An edge i j exists in G if A ij 6 = 0. Consider the gure 2{4 If the matrix is factorized with v ertex 1 as the piv ot, then after the rst Gaussian elimination step, the matrix w ould b e transformed as in gure 2{5 This rst step of elimination can b e considered equiv alen t to remo ving no de 1 and all its edges from the graph and adding edges to connect all no des adjacen t to 1. In other w ords, the elimination has created a clique of the no des adjacen t to the eliminated no de. Note that there are as man y ll-ins in the reduced matrix as there are edges added in the clique formation. In the ab o v e example, w e ha v e PAGE 31 21 ***** * * ** * 1 2 3 5 4 Figure 2{4: A symmetric matrix and its graph represen tation * * * ** * 2 3 5 4 * * * *** Figure 2{5: The matrix and its graph represen tation after one step of Gaussian elimination PAGE 32 22 c hosen the wrong no de as piv ot, since no de 1 has the maxim um degree. Instead if w e had c hosen a no de with minim um degree sa y 3 or 5 as piv ot, then there w ould ha v e b een zero ll-in after the elimination since b oth 3 and 5 ha v e degree 1. This is the k ey idea in a minim um degree algorithm. It generates a p erm utation suc h that a no de with minim um degree is eliminated in eac h step of Gaussian elimination, th us ensuring a minimal ll-in. The algorithm do es not examine the n umerical v alues in the no de selection pro cess. It could happ en that during partial piv oting, a no de other than the one suggested b y the minim um degree algorithm m ust b e c hosen as piv ot b ecause of its n umerical magnitude. That's exactly the reason wh y the ll-in estimate pro duced b y the ordering algorithm could b e less than that exp erienced in the factorization phase. 2.9 Piv oting Gaussian elimination fails when the diagonal elemen t in the input matrix happ ens to b e zero. Consider a simple 2 2 system, A = 264 0 a 12 a 21 a 22 375 264 x 1 x 2 375 = 264 b 1 b 2 375 (2{23) When solving the ab o v e system, Gaussian elimination computes the m ultiplier a 21 =a 11 [and m ultiplies ro w 1 with this m ultiplier and adds it to ro w 2] th us eliminating the co ecien t elemen t a 21 from the matrix. This step ob viously w ould fail, since a 11 is zero. No w let's see a classical case when the diagonal elemen t is nonzero but close to zero. A = 264 0 : 0001 1 1 1 375 (2{24) The m ultiplier is 1 = 0 : 0001 = 10 4 The factors L and U are PAGE 33 23 L = 264 1 0 10 4 1 375 U = 264 0 : 0001 1 0 10 4 375 (2{25) The elemen t u 22 has the actual v alue 1 10 4 Ho w ev er assuming a four digit arithmetic, it w ould b e rounded o to 10 4 Note that the pro duct of L and U is L U = 264 0 : 0001 1 1 0 375 (2{26) whic h is dieren t from the original matrix. The reason for this problem is that the m ultiplier computed is so large that when added with the small elemen t a 22 with v alue 1, it obscured the tin y v alue presen t in a 22 W e can solv e these problems with piv oting. In the ab o v e t w o examples, w e could in terc hange ro ws 1 and 2, to solv e the problem. This mec hanism of in terc hanging ro ws(and columns) and pic king a large elemen t as the diagonal, to a v oid n umerical failures or inaccuracies is called piv oting. T o pic k a n umerically large elemen t as piv ot, w e could lo ok at the elemen ts in the curren t column or w e could lo ok at the en tire submatrix (across b oth ro ws and columns). The former is called partial piv oting and the latter is called complete piv oting. F or dense matrices, partial piv oting adds a time complexit y of O ( n 2 ) comparisons to Gaussian elimination and complete piv oting adds O ( n 3 ) comparisons. Complete piv oting is exp ensiv e and hence is generally a v oided, except for sp ecial cases. KLU emplo ys partial piv oting with diagonal preference. As long as the diagonal elemen t is atleast a constan t threshold times the largest elemen t in the column, w e c ho ose the diagonal as the piv ot. This constan t threshold is called piv ot tolerance. 2.10 Scaling The case where small elemen ts in the matrix get obscured during the elimination pro cess and accuracy of the results gets sk ew ed b ecause of n umerical addition PAGE 34 24 is not completely o v ercome b y the piv oting pro cess. Let us see an example of this case. Consider the 2 2 system A = 264 10 10 5 1 1 375 264 x 1 x 2 375 = 264 10 5 2 375 (2{27) When w e apply Gaussian elimination with partial piv oting to the ab o v e system, the en try a 11 is largest in the rst column and hence w ould con tin ue to b e the piv ot.After the rst step of elimination assuming a four digit arithmetic, w e w ould ha v e A = 264 10 10 5 0 10 4 375 264 x 1 x 2 375 = 264 10 5 10 4 375 (2{28) The solution from the ab o v e elimination is x 1 = 1 ; x 2 = 0. Ho w ev er the correct solution is close to x 1 = 1 ; x 2 = 1. If w e divide eac h ro w of the matrix b y the largest elemen t in that ro w(and the corresp onding elemen t in the righ t hand side as w ell), prior to Gaussian elimination w e w ould ha v e A = 264 10 4 1 1 1 375 264 x 1 x 2 375 = 264 12 375 (2{29) No w if w e apply partial piv oting w e w ould ha v e, A = 264 1 1 10 4 1 375 264 x 1 x 2 375 = 264 21 375 (2{30) And after an elimination step, the result w ould b e PAGE 35 25 A = 264 1 1 0 1 10 4 375 264 x 1 x 2 375 = 264 2 1 10 4 375 (2{31) whic h yields the correct solution x 1 = 1 ; x 2 = 1. The pro cess of balancing out the n umerical enormit y or obscurit y on eac h ro w or column is termed as scaling. In the ab o v e example, w e ha v e scaled with resp ect to the maxim um v alue in a ro w whic h is ro w scaling. Another v arian t w ould b e to scale with resp ect to the sum of absolute v alues of all elemen ts across a ro w. In column scaling, w e w ould scale with resp ect to the maxim um v alue in a column or the sum of absolute v alues of all elemen ts in a cloumn. Ro w scaling can b e considered equiv alen t to nding an in v ertible diagonal matrix D 1 suc h that all the ro ws in the matrix D 1 A ha v e equally large n umerical v alues. Once w e ha v e suc h a D 1 the solution of the original system Ax = b is equiv alen t to solving the system ~ A x = ~ b where ~ A = D 1 A and ~ b = D 1 b Equilibration is another p opular term used for scaling. In KLU, the diagonal elemen ts of the diagonal matrix D 1 are either the largest elemen ts in the ro ws of the original matrix or the sum of the absolute v alues of the elemen ts in the ro ws. Besides scaling can b e turned o as w ell, if the sim ulation en vironmen t do es not need scaling. Scaling though it oers b etter n umerical results when solving systems, is not mandatory Its usage dep ends on the data v alues that constitute the system and if the v alues are already balanced, scaling migh t not b e necessary 2.11 Gro wth F actor Piv ot gro wth factor is a k ey diagnostic estimate in determining the stabilit y of Gaussian elimination. Stabilit y of Numerical Algorithms is an imp ortan t factor in determining the accuracy of the solution. Study of stabilit y is done b y a pro cess PAGE 36 26 called Roundo Error analysis. Roundo error analysis comprises t w o sub t yp es called F orw ard error analysis and Bac kw ard error analysis. If the computed solution ~ x is close to the exact solution x, then the algorithm is said to b e F orw ard stable. If the algorithm computes an exact solution to a nearb y problem, then the algorithm is said to b e Bac kw ard stable.Bac kw ard stabilit y is the most widely used tec hnique in studying stabilit y of systems. Often the data generated for solving systems ha v e impurit y in them or they are distorted b y a small amoun t. Under suc h circumstances w e are in terested that the algorithm pro duce an exact solution to this nearb y problem and hence the relev ance of bac kw ard stabilit y assumes signicance. Piv ot gro wth factor is formally dened as = max k max ij j a ( k ) ij j max ij j a ij j (2{32) where a ( k ) ij is an en try in the reduced matrix A ( k ) after the k th elimination step. F rom ( 2{32 ), w e nd that if the en tries of the reduced matrix gro w arbitrarily w e w ould ha v e a high gro wth factor. This arbitrary gro wth w ould again lead to inaccuracies in the results. Consider the follo wing 2 2 system. A = 264 10 4 1 1 1 375 264 x 1 x 2 375 = 264 12 375 (2{33) After one step of Gaussian elimination assuming four digit arithmetic, w e w ould ha v e the reduced system A = 264 10 4 1 0 1 10 4 375 264 x 1 x 2 375 = 264 1 2 10 4 375 (2{34) Solving the system yields x 1 = 0 ; x 2 = 1 whic h is dieren t from the actual solution x 1 = 1 ; x 2 = 1.The piv ot gro wth factor of the ab o v e system is PAGE 37 27 = max (1 ; 10 4 ) 1 = 10 4 Th us a large piv ot gro wth clearly indicates the inaccuracy in the result. P artial piv oting generally a v oids large gro wth factors. In the ab o v e example, if w e had applying partial piv oting, w e w ould ha v e got the correct results. But this is not assured and there are cases where partial piv oting migh t not result in an acceptable gro wth factor. This necessitates the estimation of the gro wth factor as a diagnostic to ol to detect cases where Gaussian elimination could b e unstable. Piv ot gro wth factor is calculated usually in terms of its recipro cal, to a v oid n umerical o v erro w problems when the v alue is v ery large. ( 2{32 ) is a harder to compute equation since it in v olv es calculating the maxim um of reduced matrix after ev ery step of elimination. The other denitions of recipro cal gro wth factor that are easy to compute are as follo ws: 1 = min j ( max i j a ij j ) ( max i j u ij j ) (2{35) 1 = ( max ij j a ij j ) ( max ij j u ij j ) (2{36) Equation ( 2{35 ) is the denition implemen ted in KLU and it is a column scaling in v arian t. It helps unmask a large piv ot gro wth that could b e totally mask ed b ecause of column scaling. 2.12 Condition Num b er Gro wth factor is a k ey estimate in determining the stabilit y of the algorithm. Condition n um b er is a k ey estimate in determining the amenabilit y or conditioning of a giv en problem. It is not guaran teed that a highly stable algorithm can yield accurate results for all problems it can solv e. The conditioning of the problem has a dominan t eect on the accuracy of the solution. Note that while stabilit y deals with the algorithm, conditioning deals with the problem itself. In practical applications lik e circuit sim ulation, the data of PAGE 38 28 a problem come from exp erimen tal observ ations. T ypically suc h data ha v e a factor of error or impurities or noise asso ciated with them. Roundo errors and discretization errors also con tribute to impurities in the data. Conditioning of a problem deals with determining ho w the solution of the problem c hanges in the presence of impurities. The preceding discussion sho ws that one often deals with solving problems not with the original data but that with p erturb ed data. The analysis of eect of p erturbation of the problem on the solution is called P erturbation analysis. It helps in determining whether a giv en problem pro duces a little or h uge v ariation in solution when p erturb ed. Let us see what w e mean b y w ell or ill conditioned problems. A problem is said to b e ill conditioned if a small relativ e error in data leads to a large relativ e error in solution irresp ectiv e of the algorithm emplo y ed. A problem is said to b e w ell conditioned if a small relativ e error in data do es not lead to a large relativ e error in solution. Accuracy of the computed solution is of primary imp ortance in n umerical analysis. Stabilit y of the algorithm and the Conditioning of the giv en problem are the t w o factors that directly determine accuracy .A highly stable algorithm w ell armored with scaling, partial piv oting and other concepts cannot b e guaran teed to yield an accurate solution to an ill-conditioned problem. A bac kw ard stable algorithm applied to a w ell-conditioned problem should yield a solution close to the exact solution. This follo ws from the denitions of bac kw ard stabilit y and w ell-conditioning, where bac kw ard stabilit y assures exact solution to a nearb y problem and w ell-conditioned problem assures that the computed solution to p erturb ed data is relativ ely close to the exact solution of the actual problem. PAGE 39 29 Mathematically let X b e some problem. Let X(d) b e the solution to the problem for some input d. Let d denote a small p erturbation in the input d. No w if the relativ e error in the solution j X ( d + d ) X ( d ) j j X ( d ) j exceeds the relativ e error in the input j d j j d j then the problem is ill conditioned and w ell conditioned otherwise. Condition n um b er is a measure of the conditioning of the problem. It sho ws whether a problem is w ell or ill conditioned.F or the linear system problems of the form Ax = b the condition n um b er is dened as C ond ( A ) = k A kk A 1 k (2{37) Equation ( 2{37 ) is arriv ed at b y theory that deals with p erturbations either in the input matrix A or the righ t hand side b or b oth the matrix and righ t hand side. Equation ( 2{37 ) can b e dened with resp ect to an y norm viz. 1, 2 or 1 The system Ax = b is said to b e ill-conditioned if the condition n um b er from ( 2{37 ) is quite large. Otherwise it is said to b e w ell-conditioned. A naiv e w a y to compute the condition n um b er w ould b e to compute the in v erse of the matrix, compute the norm of the matrix and its in v erse and compute the pro duct. Ho w ev er, computing the in v erse is atleast thrice as exp ensiv e as solving the linear system Ax = b and hence should b e a v oided. Hager [ 22 ] dev elop ed a metho d for estimating the 1-norm of k A 1 k and the corresp onding 1-norm condition n um b er. Hager prop osed an optimization approac h for estimating k A 1 k 1 The 1-norm of a matrix is formally dened as PAGE 40 30 k A k 1 = max k Ax k 1 k x k 1 (2{38) Hager's algorithm can b e briery describ ed as follo ws: F or A 2 R n n a con v ex function is dened as F ( x ) = k Ax k 1 (2{39) o v er the con v ex set S = f x 2 R n : k x k 1 1 g Then k A k 1 is the global maxim um of ( 2{39 ). The algorithm in v olv es computing Ax and A T x or computing matrix-v ector pro ducts. When w e w an t to compute k A 1 k 1 it in v olv es computing A 1 x and ( A 1 ) T x whic h is equiv alen t to solving Ax = b and A T x = b W e can use KLU to ecien tly solv e these systems. Higham [ 23 ] presen ts renemen ts to Hager's algorithm and restricts the n um b er of iterations to v e. Higham further presen ts a simple device and using the higher of the estimates from this device and Hager's algorithm to ensure the esimate is large enough. This device in v olv es solving the linear system Ax = b where b i = ( 1) i +1 (1 + i 1 n 1 ) ; i = 1 ; 2 ; :::n The nal estimate is c hosen as the maxim um from Hager's algorithm and 2 k x k 1 3 n KLU's condition n um b er estimator is based on Higham's renemen t of Hager's algorithm. 2.13 Depth First Searc h As w e discussed earlier, the nonzero pattern of the k th column of L is determined b y the Reac habilit y of the ro w-indices of k th column of A in the graph of L. PAGE 41 31 The reac habilit y is determined b y a depth-rst searc h tra v ersal of the graph of L. The top ological order for elimination of v ariables when solving the lo w er triangular system Lx = b is also determined b y the depth-rst searc h tra v ersal. A classical depth rst searc h algorithm is a recursiv e one. One of the ma jor problems in a recursiv e implemen tation of depth-rst searc h is Stac k o v erro w. Eac h pro cess is allo cated a stac k space up on execution. When there is a high n um b er of recursiv e calls, the stac k space is exhausted and the pro cess terminates abruptly This is a denite p ossibilit y in the con text of our depth-rst searc h algorithm when w e ha v e a dense column of a matrix of a v ery high dimension. The solution to stac k o v erro w caused b y recursion is to replace recursion b y iteration. With an iterativ e or non-recursiv e function, the en tire depth rst searc h happ ens in a single function stac k. The iterativ e solution uses an arra y of ro w indices called pstac k. When descending to an adjacen t no de during the searc h, the ro w index of the next adjacen t no de is stored in the pstac k at the p osition(ro w/column index) corresp onding to the curren t no de. When the searc h returns to the curren t no de, w e kno w that w e next need to descend in to the no de stored in the pstac k at the p osition corresp onding to the curren t no de. Using this extra O(n) memory the iterativ e v ersion completes the depth rst searc h in a single function stac k. This is an imp ortan t impro v emen t from the recursiv e v ersion since it a v oids the stac k o v erro w problem that w ould ha v e b een a b ottlenec k when solving high dimension systems. 2.14 Memory F ragmen tation The data structures for L and U are the ones used to represen t sparse matrices. These comprise 3 v ectors. 1. V ector of column p oin ters 2. V ector of ro w indices PAGE 42 32 3. V ector of n umerical v alues There are o v erall, six v ectors needed for the t w o matrices L and U. Of these, the t w o v ectors of column p oin ters are of pre-kno wn size namely the size of a blo c k. The remaining four v ectors of ro w indices and n umerical v alues dep end on the llin estimated b y AMD. Ho w ev er AMD giv es an optimistic estimate of ll-in.Hence w e need to dynamically gro w memory for these v ectors during the factorization phase if w e determine that the ll-in is higher than estimated. The partial piv oting strategy can alter the ro w ordering determined b y AMD and hence is another source of higher ll-in than the estimate from AMD. Dynamically gro wing these four v ectors suers from the problem of external memory fragmen tation. In external fragmen tation, free memory is scattered in the memory space. A call for more memory fails b ecause of non-a v ailabilit y of con tiguous free memory space. If the scattered free memory areas w ere con tiguous, the memory request w ould ha v e succeeded. In the con text of our problem, the memory request to gro w the four v ectors could either fail if w e run in to external fragmen tation or succeed when there is enough free space a v ailable. When w e reallo cate or gro w memory there are t w o t yp es of success cases. In the rst case called c heap reallo cation, there is enough free memory space abutting the four v ectors. Here the memory o ccupied b y a v ector is just extended or its end b oundary is increased. The start b oundary remains the same. In the second case called costly reallo cation, there is not enough free memory space abutting a v ector. Hence a fresh memory is allo cated in another region for the new size of v ector and the con ten ts are copied from old lo cation. Finally the old lo cation is freed. With four v ectors to gro w, there is a failure case b ecause of external fragmentation and a costly success case b ecause of costly reallo cation. T o reduce the failure case and a v oid the costly success case, w e ha v e coalesced the four v ectors in to a single v ector. This new data structure is b yte aligned on double b oundary F or PAGE 43 33 ev ery column of L and U, the v ector con tains the ro w indices and n umerical v alues of L follo w ed b y the ro w indices and n umerical v alues of U. Multiple in teger ro w indices are stored in a single double lo cation The actual n um b er of in tegers that can b e stored in a double lo cation v aries with platform and is determined dynamically The common tec hnique of using in teger p oin ter to p oin t to lo cation aligned on double b oundary is emplo y ed to retriev e or sa v e the ro w indices. In addition to this coalesced data structure con taining the ro w indices and n umerical v alues, t w o more length v ectors of size n are needed to con tain the length of eac h column of L and U These length v ectors are preallo cated once and need not b e gro wn dynamically Some Memory managemen t sc hemes nev er do c heap reallo cation. In suc h sc hemes, the new data structure serv es to reduce external fragmen tation only 2.15 Complex Num b er Supp ort KLU supp orts complex matrices and complex righ t hand sides. KLU also supp orts solving the transp ose system A T x = b for real matrices and solving the conjugate transp ose system A H x = b for complex matrices. Initially it relied on the C99 language supp ort for complex n um b ers. Ho w ev er the C99 sp ecication is not supp orted across op erating systems. F or example, earlier v ersions of Sun Solaris do not supp ort C99. T o a v oid these compatibilit y issues, KLU no longer relies on C99 and has its o wn complex arithmetic implemen tation. 2.16 P arallelism in KLU When solving a system Ax = b using KLU, w e use BTF pre-ordering to con v ert A in to a blo c k upp er triangular form. W e apply AMD on eac h blo c k and factorize eac h blo c k one after the other serially Alternativ ely nested dissection can b e applied to eac h blo c k. Nested dissection ordering con v erts a blo c k to a doubly b ordered blo c k diagonal form. A doubly b ordered blo c k diagonal form is similar to a blo c k upp er triangular form but has nonzeros on the sub diagonal region. These PAGE 44 34 1 2 4 5 3 A Doubly Bordered Block diagonal Matrix Separator Tree 2 1 4 5 3 Figure 2{6: A doubly b ordered blo c k diagonal matrix and its corresp onding v ertex separator tree nonzeros form a horizon tal strip resem bling a b order. Similarly the nonzeros in the region ab o v e the diagonal form a corresp onding v ertical strip. The doubly b ordered blo c k diagonal form can b e though t of as a separator tree. F actorization of the blo c k then in v olv es a p ost-order tra v ersal of the separator tree. The no des in the separator tree can b e factorized in parallel. The factorization of a no de w ould additionally in v olv e computing the sc h ur complemen t of its paren t and of its ancestors in the tree. Once all the c hildren of a no de ha v e up dated its sc h ur complemen t, the no de is ready to b e factorized and it in turn computes the sc h ur complemen t of its paren t and its ancestors. The factorization and computation of sc h ur complemen t is done in a p ost-order tra v ersal fashion and the pro cess stops at the ro ot. P arallelism can help in reducing the factorization time. It gains imp ortance in the con text of m ulti pro cessor systems. W ork is b eing done to enable parallelism in KLU. PAGE 45 CHAPTER 3 CIR CUIT SIMULA TION: APPLICA TION OF KLU The KLU algorithm comprises the follo wing steps: 1. Unsymmetric P erm utation to blo c k upp er triangular form. This consists of t w o steps. (a) unsymmetric p erm utation to ensure a zero free diagonal using maxim um transv ersal. (b) symmetric p erm utation to blo c k upp er triangular form b y nding the strongly connected comp onen ts of the graph. 2. Symmetric p erm utation of eac h blo c k(sa y A) using AMD on A + A T or an unsymmetric p erm utation of eac h blo c k using COLAMD on AA T These p erm utations are ll-in reducing orderings on eac h blo c k. 3. F actorization of eac h scaled blo c k using Gilb ert-P eierls' left lo oking algorithm with partial piv oting. 4. Solv e the system using blo c k-bac k substitution and accoun t for the odiagonal en tries. The solution is re-p erm uted to bring it bac k to original order. Let us rst deriv e the nal system that w e need to solv e taking in to accoun t, the dieren t p erm utations, scaling and piv oting. The original system to solv e is Ax = b (3{1) Let R b e the diagonal matrix with the scale factors for eac h ro w. Applying scaling, w e ha v e R Ax = R b (3{2) 35 PAGE 46 36 Let P' and Q' b e the ro w and column p erm utation matrices that com bine the p erm utations for maxim um transv ersal and the blo c k upp er triangular form together. Applying these p erm utations together, w e ha v e P 0 R AQ 0 Q 0 T x = P 0 R b: [ Q 0 Q 0 T = I ; the identity matr ix ] (3{3) Let P and Q b e ro w and column p erm utation matrices that club the P' and Q' men tioned ab o v e with the symmetric p erm utation pro duced b y AMD and the partial piv oting ro w p erm utation pro duced b y factorization. No w, P R AQQ T x = P R b or ( P R AQ ) Q T x = P R b (3{4) The matrix (PRA Q) consists of t w o parts viz. the diagonal blo c ks that are factorized and the o-diagonal elemen ts that are not factorized. ( P R AQ ) = LU + F where LU represen ts the factors of all the blo c ks collectiv ely and F represen ts the en tire o diagonal region. Equation ( 3{4 ) no w b ecomes ( LU + F ) Q T x = P R b (3{5) x = Q ( LU + F ) 1 ( P R b ) (3{6) Equation ( 3{6 ) consists of t w o steps. A blo c k bac k-substitution i.e. computing ( LU + F ) 1 ( P R b ) follo w ed b y applying the column p erm utation Q The blo c k-bac k substitution in ( LU + F ) 1 ( P R b ) lo oks cryptic and can b e b etter explained as follo ws: Consider a simple 3 3 blo c k system PAGE 47 37 266664 L 11 U 11 F 12 F 13 0 L 22 U 22 F 23 0 0 L 33 U 33 377775 266664 X 1 X 2 X 3 377775 = 266664 B 1 B 2 B 3 377775 (3{7) The equations corresp onding to the ab o v e system are: L 11 U 11 X 1 + F 12 X 2 + F 13 X 3 = B 1 (3{8) L 22 U 22 X 2 + F 23 X 3 = B 2 (3{9) L 33 U 33 X 3 = B 3 (3{10) In blo c k bac k substitution, w e rst solv e ( 3{10 ) for X 3 and then eliminate X 3 from ( 3{9 ) and ( 3{8 ) using the o-diagonal en tries. Next, w e solv e ( 3{9 ) for X 2 and eliminate X 2 from ( 3{8 ). Finally w e solv e ( 3{8 ) for X 1 3.1 Characteristics of Circuit Matrices Circuit matrices exhibit certain unique c haracteristics that mak es KLU more relev an t to them. They are v ery sparse. Because of their high sparsit y BLAS k ernels are not applicable. Circuit matrices often ha v e a few dense ro ws/columns that originate from v oltage or curren t sources. These dense ro ws/columns are eectiv ely remo v ed b y BTF p erm utation. Circuit matrices are asymmetric, but the nonzero pattern is roughly symmetric. They are easily p erm utable to blo c k upp er triangular form. Besides, they ha v e zero-free or nearly zero-free diagonal. Another p eculiar feature of circuit matrices is that the nonzero pattern of eac h blo c k after p erm utation to blo c k upp er triangular form, is more symmetric than the original matrix. T ypical ordering strategies applied to the original matrix cause high ll-in whereas when applied to the blo c ks, leads to less ll-in. PAGE 48 38 The eciency of the p erm utation to blo c k upp er triangular form sho ws in the fact the en tire sub-diagonal region in the matrix has zero w ork and the o-diagonal elemen ts do not cause an y ll-in since they are not factorized. 3.2 Linear Systems in Circuit Sim ulation The linear systems in circuit sim ulation pro cess originate from solving large systems of non linear equations using Newton's metho d and in tegrating large sti systems of ordinary dieren tial equations. These linear systems consist of the co ecien ts matrix A the unkno wns v ector x and the righ t hand side b During the course of sim ulation, the matrix A retains the same nonzero pattern(structurally unc hanged) and only undergo es c hanges in n umerical v alues. Th us the initial analysis phase (p erm utation to ensure zero-free diagonal, blo c k triangular form and minim um degree ordering on blo c ks) and factorization phase(that in v olv es sym b olic analysis, partial piv oting and symmetric pruning) can b e restricted to the initial system alone. Subsequen t systems A 0 x = b where only the n umerical v alues of A 0 are dieren t from A can b e solv ed using a mec hanism called refactorization. Refactorization simply means to use the same ro w and column p erm utations (comprising en tire analysis phase and partial piv oting) computed for the initial system, for solving the subsequen t systems that ha v e c hanges only in n umerical v alues. Refactorization substan tially reduces run time since the analysis time and factorization time sp en t on sym b olic analysis, partial piv oting, pruning are a v oided. The nonzero pattern of the factors L and U are the same as for the initial system. Only Numerical factorization using the pre-computed nonzero pattern and partial piv oting order, is required. The solv e step follo ws the factorization/refactorization step. KLU accomo dates solving m ultiple righ t hand sides in a single solv e step. Upto four righ t hand sides can b e solv ed in a single step. PAGE 49 39 3.3 P erformance Benc hmarks During m y in ternship at a circuit sim ulation compan y I did p erformance b enc hmarking of KLU vs Sup erLU in the sim ulation en vironmen t. The p erformance b enc hmarks w ere run on a represen tativ e set of examples. The results of these b enc hmarks are tabulated as follo ws. (the size of the matrix created in sim ulation is sho wn in paren thesis). T able 3{1: Comparison b et w een KLU and Sup erLU on o v erall time and ll-in Ov erall time Nonzeros(L+U) Netlist KLU Sup erLU Sp eedup KLU Sup erLU Problem1 (301) 1.67 1.24 0.74 1808 1968 Problem2 (1879) 734.96 688.52 0.94 13594 13770 Problem3 (2076) 56.46 53.38 0.95 16403 16551 Problem4 (7598) 89.63 81.85 0.91 52056 54997 Problem5 (745) 18.13 16.84 0.93 4156 4231 Problem6 (1041) 1336.50 1317.30 0.99 13198 13995 Problem7 (33) 0.40 0.32 0.80 157 176 Problem8 (10) 4.46 1.570 0.35 40 41 Problem9 (180) 222.26 202.29 0.91 1845 1922 Problem10(6833) 6222.20 6410.40 1.03 56322 58651 Problem11 (1960) 181.78 179.50 0.99 13527 13963 Problem12 (200004) 6.25 8.47 1.35 500011 600011 Problem13 (20004) 0.47 0.57 1.22 50011 60011 Problem14 (40004) 0.97 1.31 1.35 100011 120011 Problem15 (100000) 1.76 2.08 1.18 299998 499932 Problem16 (7602) 217.80 255.88 1.17 156311 184362 Problem17(10922) 671.10 770.58 1.15 267237 299937 Problem18 (14842) 1017.00 1238.10 1.22 326811 425661 Problem19 (19362) 1099.00 1284.40 1.17 550409 581277 Problem20 (24482) 3029.00 3116.90 1.03 684139 788047 Problem21 (30202) 2904.00 3507.40 1.21 933131 1049463 The circuits Problem16{Problem21 are TFT LCD arra ys similar to memory circuits. These examples w ere run atleast t wice with eac h algorithm emplo y ed viz. KLU or Sup erLU to get consisten t results. The results are tabulated in tables 3{1 3{2 and 3{3 The "o v erall time" in table 3{1 comprises of analysis, factorization and solv e time. PAGE 50 40 T able 3{2: Comparison b et w een KLU and Sup erLU on factor time and solv e time F actor time F actor Solv e time Solv e p er iteration sp eedup p er iteration sp eedup Netlist KLU Sup erLU KLU Sup erLU Problem1 (301) 0.000067 0.000084 1.26 0.000020 0.000019 0.92 Problem2 (1879) 0.000409 0.000377 0.92 0.000162 0.000100 0.61 Problem3 (2076) 0.000352 0.000317 0.90 0.000122 0.000083 0.68 Problem4 (7598) 0.001336 0.001318 0.99 0.000677 0.000326 0.48 Problem5 (745) 0.000083 0.000063 0.76 0.000035 0.000022 0.62 Problem6 (1041) 0.000321 0.000406 1.26 0.000079 0.000075 0.95 Problem7 (33) 0.000004 0.000004 0.96 0.000003 0.000002 0.73 Problem8 (10) 0.000001 0.000001 0.89 0.000001 0.000001 0.80 Problem9 (180) 0.000036 0.000042 1.16 0.000014 0.000011 0.76 Problem10(6833) 0.001556 0.001530 0.98 0.000674 0.000365 0.54 Problem11 (1960) 0.000663 0.000753 1.14 0.000136 0.000122 0.90 Problem12 (200004) 0.103900 0.345500 3.33 0.030640 0.041220 1.35 Problem13 (20004) 0.005672 0.020110 3.55 0.001633 0.002735 1.67 Problem14 (40004) 0.014430 0.056080 3.89 0.004806 0.006864 1.43 Problem15 (100000) 0.168700 0.283700 1.68 0.018600 0.033610 1.81 Problem16 (7602) 0.009996 0.017620 1.76 0.001654 0.001439 0.87 Problem17(10922) 0.018380 0.030010 1.63 0.002542 0.001783 0.70 Problem18 (14842) 0.024020 0.046130 1.92 0.003187 0.002492 0.78 Problem19 (19362) 0.054730 0.080280 1.47 0.005321 0.003620 0.68 Problem20 (24482) 0.121400 0.122600 1.01 0.006009 0.004705 0.78 Problem21 (30202) 0.124000 0.188700 1.52 0.009268 0.006778 0.73 PAGE 51 41 3.4 Analyses and Findings The follo wing are m y inferences based on the results: Most of the matrices in these exp erimen ts are small matrices of the order of a few thousands. Fill in is m uc h b etter with KLU. The 'BTF' ordering com bined with the 'AMD' ordering on eac h of the blo c ks do es a go o d job in reducing the ll in coun t to a go o d exten t. The impro v emen t in ll in a v erages around 6% for man y examples. T able 3{3: Ordering results using BTF+AMD in KLU on circuit matrices Nonzeros Nonzeros No of Max Blo c k Nonzeros Netlist Size in A in L+U Blo c ks size o diagonal Problem1 301 1484 1808 7 295 89 Problem2 1879 12926 13594 19 1861 4307 Problem3 2076 15821 16403 13 2064 6689 Problem4 7598 48922 52056 13 7586 19018 Problem5 745 3966 4156 128 426 1719 Problem6 1041 9654 13198 67 975 2608 Problem7 33 153 157 7 27 50 Problem8 10 39 40 5 6 16 Problem9 180 1503 1845 19 162 661 Problem10 6833 43250 56322 507 6282 12594 Problem11 1960 11187 13527 58 1715 2959 Problem12 200004 500011 500011 200003 2 300005 Problem13 20004 50011 50011 20003 2 30005 Problem14 40004 100011 100011 40003 2 60005 Problem15 100000 299998 299998 1 100000 0 Problem16 7602 32653 156311 103 7500 251 Problem17 10922 46983 267237 123 10800 301 Problem18 14842 63913 326811 143 14700 351 Problem19 19362 83443 550409 163 19200 401 Problem20 24482 105573 684139 183 24300 451 Problem21 30202 130303 933131 203 30000 501 There is no 'll in' in the Problem12, Problem13, Problem14 and Problem15 netlists with KLU. This is quite signican t in that there is no memory o v erhead in these examples. In the case of circuits Problem16{Problem21, the gain in ll PAGE 52 42 in with KLU ranges from 6% in the Problem19 example to 24% in Problem18 example. The gain in ll in translates in to faster factorization b ecause few nonzeros imply less w ork. The factorization time th us is exp ected to b e lo w. It turns out to b e true in most of the cases (factorization sp eedup of 1.6x in Problem16{ Problem21 examples and 3x for Problem12{Problem14 examples). F or some cases lik e Problem2 and Problem10, the factorization time remains same for b oth KLU and Sup erLU. Solv e phase turns out to b e slo w in KLU. Probably the o diagonal nonzero handling part tends to accoun t for the extra time sp en t in the solv e phase. One w a y of reducing the solv e o v erhead in KLU w ould b e solving m ultiple RHS at the same time. In a single solv e iteration, 4 equations are solv ed. On the whole, the o v erall time sp eedup is 1.2 for Problem16{Problem21 examples and Problem12{Problem14 examples. F or others, the o v erall time is almost the same b et w een the t w o algorithms. BTF is not able to nd out man y blo c ks for most of the matrices and there happ ens to b e a single large blo c k and the remaining are singletons. But the AMD ordering do es a go o d job in getting the ll in coun t reduced. The o-diagonal nonzero coun t is not high. 3.5 Alternate Ordering Exp erimen ts Dieren t ordering strategies w ere emplo y ed to analyze the ll in b eha viour. The statistics using dieren t ordering sc hemes are listed in table 3{4 'COLAMD' is not listed in the table. It t ypically giv es p o or ordering and causes more ll in than AMD, MMD and AMD+BTF. AMD alone giv es relativ ely higher ll in compared to AMD+BTF in most of the matrices. Ho w ev er AMD alone giv es mixed results in comparison with MMD. It matc hes or outp erforms MMD in ll in on Problem12{Problem14 and Problem16{Problem21 matrices. Ho w ev er it giv es PAGE 53 43 T able 3{4: Comparison of ordering results pro duced b y BTF+AMD, AMD, MMD Nonzeros Fill-in Fill-in Fill-in Netlist in A BTF+AMD AMD MMD Problem1 (301) 1484 1808 1928 1968 Problem2 (1879) 12926 13594 13857 13770 Problem3 (2076) 15821 16403 18041 16551 Problem4 (7598) 48922 52056 57975 54997 Problem5 (745) 3966 4156 5562 4231 Problem6 (1041) 9654 13198 14020 13995 Problem7 (33) 153 157 178 176 Problem8 (10) 39 40 41 41 Problem9 (180) 1503 1845 1968 1922 Problem10(6833) 43250 56322 133739 58651 Problem11 (1960) 11187 13527 14800 13963 Problem12 (200004) 500011 500011 600011 600011 Problem13 (20004) 50011 50011 60011 60011 Problem14 (40004) 100011 100011 120011 120011 Problem15 (100000) 299998 299998 299998 499932 Problem16 (7602) 32653 156311 165264 184362 Problem17(10922) 46983 267237 255228 299937 Problem18 (14842) 63913 326811 387668 425661 Problem19 (19362) 83443 550409 451397 581277 Problem20 (24482) 105573 684139 718891 788047 Problem21 (30202) 130303 933131 839226 1049463 PAGE 54 44 p o or ll in for rest of the circuits when compared with MMD. AMD alone b eats AMD+BTF in ll-in for some of the examples viz. Problem17, Problem19 and Problem21. Ov erall, to sum it up, BTF+AMD is the b est ordering strategy to use. 3.6 Exp erimen ts with UF Sparse Matrix Collection There are a n um b er of circuit matrices in the UF sparse matrix collection. Dieren t exp erimen ts w ere done with these matrices as w ell on dieren t parameters lik e 1. ordering qualit y with dieren t ordering sc hemes in KLU 2. timing of dieren t phases of KLU 3. ordering qualit y among KLU, UMFP A CK and Gilb ert-P eierls' Algorithm 4. p erformance comparison b et w een KLU and UMFP A CK. UMFP A CK is a unsymmetric m ultifron tal sparse solv er and uses an unsymmetric COLAMD ordering or an AMD ordering, automatically selecting whic h metho d to use based on the matrix c haracteristics. Gilb ert-P eierls' algorithm is a v ailable in MA TLAB as an LU factorization sc heme. These exp erimen ts w ere done on a Mandrak e 10.0 lin ux OS, In tel P en tium M Pro cessor with clo c k frequency of 1400 MHz and RAM 768 kB. 3.6.1 Dieren t Ordering Sc hemes in KLU There are six dieren t ordering sc hemes p ossible in KLU. The three ll reducing sc hemes are AMD, COLAMD and User Sp ecied Ordering. These three ll reducing orderings can b e com bined with a BTF preordering or no preordering.Hence six dieren t sc hemes. Ho w ev er in this exp erimen t user sp ecied ordering is not considered. That lea v es us with four dieren t sc hemes.The table 3{5 lists the ll (n um b er of nonzeros in L+U plus the n um b er of o-diagonal elemen ts) for eac h of these ordering sc hemes. F rom table 3{5 w e nd that BTF+AMD giv es consisten tly b etter ll-in across dieren t circuit matrices. Ho w ev er there are observ able ab errations. F or PAGE 55 45 example, with the circuit Bomhof/circuit 2, AMD and COLAMD giv e b etter ll-in than BTF+AMD. These results determine again that BTF+AMD is the b est ordering sc heme to use for circuit matrices. 3.6.2 Timing Dieren t Phases in KLU These exp erimen ts sho w the time sp en t in dieren t phases of the algorithm. BTF pre-ordering follo w ed b y a AMD ll-reducing ordering is emplo y ed. As men tioned earlier, there are four dieren t phases. 1. Analysis phase: This comprises the pre-ordering and ll reducing ordering. 2. F actor phase: This comprises the factorization part of the algorithm. It includes the sym b olic analysis phase, partial piv oting, symmetric pruning steps. 3. Refactor phase: This comprises the part where w e do a factorization using the already pre-computed partial piv oting p erm utation and the nonzero pattern of the L and U matrices. There is no sym b olic analysis, partial piv oting and symmetric pruning in refactor phase. 4. Solv e phase: This comprises the solv e phase of the algorithm. Solving a single righ t hand side w as exp erimen ted. When giv en a set of matrices with the same nonzero pattern, the analysis and factor phases are done only once. The refactor phase is then rep eated for the remaining matrices. Solv e phase is rep eated as man y times as there are systems to solv e. T able 3{6 consists of the timing results. Analysis phase consumes most of the time sp en t in the algorithm. Refactor time is t ypically 3 or 4 times smaller than factor time and 8 times smaller than analysis time plus factor time put together. Solv e phase consumes the least fraction of time sp en t. 3.6.3 Ordering Qualit y among KLU, UMFP A CK and Gilb ert-P eierls The table 3{7 compares the ordering qualit y among KLU using BTF+AMD, UMFP A CK using COLAMD or AMD and Gilb ert-P eierls' using AMD. W e can PAGE 56 46 T able 3{5: Fill-in with four dieren t sc hemes in KLU Nonzeros BTF + BTF + Matrix in A AMD COLAMD AMD COLAMD Sandia/adder dcop 01 (1813) 11156 13525 13895 18848 21799 Sandia/adder trans 01 (1814) 14579 20769 36711 24365 119519 Sandia/fpga dcop 01 (1220) 5892 7891 8118 8869 12016 Sandia/fpga trans 01 (1220) 7382 10152 12776 10669 21051 Sandia/init adder1 (1813) 11156 13525 13895 18848 21799 Sandia/m ult dcop 01 (25187) 193276 226673 228301 2176328 1460322 Sandia/oscil dcop 01 (430) 1544 2934 3086 3078 3295 Sandia/oscil trans 01 (430) 1614 2842 3247 2897 3259 Bomhof/circuit 1 (2624) 35823 44879 775363 44815 775720 Bomhof/circuit 2 (4510) 21199 40434 89315 36197 36196 Bomhof/circuit 3 (12127) 48137 86718 98911 245336 744245 Grund/b2 ss (1089) 3895 9994 9212 26971 9334 Grund/b dyn (1089) 4144 11806 10597 33057 10544 Grund/ba y er02 (13935) 63307 889914 245259 1365142 307979 Grund/d dyn (87) 230 481 461 619 494 Grund/d ss (53) 144 302 292 382 298 Grund/meg1 (2904) 58142 232042 184471 1526780 378904 Grund/meg4 (5860) 25258 42398 310126 43250 328144 Grund/p oli (4008) 8188 12200 12208 12238 12453 Grund/p oli large (15575) 33033 48718 48817 49806 51970 Hamm/add20 (2395) 13151 19554 34636 19554 34636 Hamm/add32 (4960) 19848 28754 36030 28754 36030 Hamm/b circuit (68902) 375558 1033240 1692668 1033240 1692668 Hamm/hcircuit (105676) 513072 731634 2623852 736080 4425310 Hamm/memplus (17758) 99147 137030 3282586 137030 3282586 Hamm/scircuit (170998) 958936 2481122 6410286 2481832 6427526 Ra jat/ra jat03 (7602) 32653 163913 235111 172666 236938 Ra jat/ra jat04 (1041) 8725 12863 80518 18618 158000 Ra jat/ra jat05 (301) 1250 1926 2053 2101 3131 Ra jat/ra jat11 (135) 665 890 978 944 1129 Ra jat/ra jat12 (1879) 12818 15308 273667 15571 128317 Ra jat/ra jat13 (7598) 48762 58856 90368 64791 5234287 Ra jat/ra jat14 (180) 1475 1994 2249 2105 2345 PAGE 57 47 T able 3{6: Time in seconds, sp en t in dieren t phases in KLU Analysis F actor Refactor Solv e Matrix time time time time Sandia/adder dcop 01 (1813) 0.0028 0.0028 0.0007 0.0003 Sandia/adder trans 01 (1814) 0.0045 0.0038 0.0013 0.0003 Sandia/fpga dcop 01 (1220) 0.0018 0.0015 0.0004 0.0002 Sandia/fpga trans 01 (1220) 0.0022 0.0017 0.0005 0.0002 Sandia/init adder1 (1813) 0.0028 0.0028 0.0007 0.0003 Sandia/m ult dcop 01 (25187) 0.2922 0.0522 0.0196 0.0069 Sandia/oscil dcop 01 (430) 0.0008 0.0006 0.0002 0.0001 Sandia/oscil trans 01 (430) 0.0008 0.0006 0.0002 0.0001 Bomhof/circuit 1 (2624) 0.0098 0.0085 0.0053 0.0006 Bomhof/circuit 2 (4510) 0.0082 0.0064 0.0034 0.0006 Bomhof/circuit 3 (12127) 0.0231 0.0174 0.0056 0.0020 Grund/b2 ss (1089) 0.0031 0.0018 0.0005 0.0001 Grund/b dyn (1089) 0.0033 0.0021 0.0007 0.0002 Grund/ba y er02 (13935) 0.0584 0.2541 0.2070 0.0090 Grund/d dyn (87) 0.0002 0.0001 0.0000 0.0000 Grund/d ss (53) 0.0001 0.0001 0.0000 0.0000 Grund/meg1 (2904) 0.0178 0.0853 0.0590 0.0033 Grund/meg4 (5860) 0.0157 0.0094 0.0028 0.0009 Grund/p oli (4008) 0.0017 0.0010 0.0004 0.0003 Grund/p oli large (15575) 0.0064 0.0045 0.0018 0.0014 Hamm/add20 (2395) 0.0056 0.0044 0.0014 0.0003 Hamm/add32 (4960) 0.0084 0.0074 0.0019 0.0006 Hamm/b circuit (68902) 0.3120 0.2318 0.1011 0.0257 Hamm/hcircuit (105676) 0.2553 0.1920 0.0658 0.0235 Hamm/memplus (17758) 0.0576 0.0358 0.0157 0.0036 Hamm/scircuit (170998) 0.8491 0.6364 0.3311 0.0622 Ra jat/ra jat03 (7602) 0.0152 0.0440 0.0306 0.0034 Ra jat/ra jat04 (1041) 0.0029 0.0023 0.0008 0.0002 Ra jat/ra jat05 (301) 0.0005 0.0005 0.0001 0.0000 Ra jat/ra jat11 (135) 0.0002 0.0002 0.0001 0.0000 Ra jat/ra jat12 (1879) 0.0038 0.0027 0.0008 0.0002 Ra jat/ra jat13 (7598) 0.0122 0.0105 0.0033 0.0011 Ra jat/ra jat14 (180) 0.0004 0.0003 0.0001 0.0000 PAGE 58 48 infer from the results that KLU pro duces b etter ordering than UMFP A CK and Gilb ert-P eierls' algorithm. F or KLU, the follo wing MA TLAB co de determines the ll. opts = [0.1 1.2 1.2 10 1 0 0 0 ] ; [x info] = klus(A,b,opts, []) ; fill = info (31) + info (32) + info(8) ; F or UMFP A CK, the snipp et is [L U P Q] = lu(A) ; fill = nnz(L) + nnz(U) ; F or Gilb ert-P eierls' the co de is Q = amd(A) ; [L U P] = lu(A(Q,Q), 0.1) ; fill = nnz(L) + nnz(U) ; 3.6.4 P erformance Comparison b et w een KLU and UMFP A CK This exp erimen t compares the total time sp en t in the analysis, factor and solv e phases b y the algorithms. The results are listed in table 3{8 KLU outp erforms UMFP A CK in time. F or KLU, the follo wing snipp et in MA TLAB is used: tic[x info] = klus(A,b,opts) ; t1 = toc ; F or UMFP A CK, the follo wing co de in MA TLAB is used to nd the total time: ticx = A \ b ; t2 = toc ; PAGE 59 49 T able 3{7: Fill-in among KLU, UMFP A CK and Gilb ert-P eierls Matrix nnz KLU UMFP A CK Gilb ert-P eierls Sandia/adder dcop 01 (1813) 11156 13525 14658 18825 Sandia/adder trans 01 (1814) 14579 20769 20769 24365 Sandia/fpga dcop 01 (1220) 5892 7891 8106 8869 Sandia/fpga trans 01 (1220) 7382 10152 10152 10669 Sandia/init adder1 (1813) 11156 13525 14658 18825 Sandia/m ult dcop 01 (25187) 193276 226673 556746 1300902 Sandia/oscil dcop 01 (430) 1544 2934 2852 3198 Sandia/oscil trans 01 (430) 1614 2842 3069 2897 Bomhof/circuit 1 (2624) 35823 44879 44879 44815 Bomhof/circuit 2 (4510) 21199 40434 35107 38618 Bomhof/circuit 3 (12127) 48137 86718 84117 245323 Grund/b2 ss (1089) 3895 9994 8309 22444 Grund/b dyn (1089) 4144 11806 9642 41092 Grund/ba y er02 (13935) 63307 889914 259329 973093 Grund/d dyn (87) 230 481 442 523 Grund/d ss (53) 144 302 268 395 Grund/meg1 (2904) 58142 232042 151740 1212904 Grund/meg4 (5860) 25258 42398 42398 43250 Grund/p oli (4008) 8188 12200 12200 12239 Grund/p oli large (15575) 33033 48718 48745 49803 Hamm/add20 (2395) 13151 19554 19554 19554 Hamm/add32 (4960) 19848 28754 28754 28754 Hamm/b circuit (68902) 375558 1033240 1033240 1033240 Hamm/hcircuit (105676) 513072 731634 730906 736080 Hamm/memplus (17758) 99147 137030 137030 137030 Hamm/scircuit (170998) 958936 2481122 2481122 2481832 Ra jat/ra jat03 (7602) 32653 163913 163913 172666 Ra jat/ra jat04 (1041) 8725 12863 12860 18613 Ra jat/ra jat05 (301) 1250 1926 1944 2101 Ra jat/ra jat11 (135) 665 890 890 944 Ra jat/ra jat12 (1879) 12818 15308 15308 15571 Ra jat/ra jat13 (7598) 48762 58856 58856 64791 Ra jat/ra jat14 (180) 1475 1994 1994 2105 PAGE 60 50 T able 3{8: P erformance comparison b et w een KLU and UMFP A CK Matrix KLU UMFP A CK Sandia/adder dcop 01 (1813) 0.0116 0.0344 Sandia/adder trans 01 (1814) 0.0112 0.0401 Sandia/fpga dcop 01 (1220) 0.0050 0.0257 Sandia/fpga trans 01 (1220) 0.0054 0.0203 Sandia/init adder1 (1813) 0.0109 0.0323 Sandia/m ult dcop 01 (25187) 1.2383 1.1615 Sandia/oscil dcop 01 (430) 0.0022 0.0070 Sandia/oscil trans 01 (430) 0.0019 0.0074 Bomhof/circuit 1 (2624) 0.0232 0.1223 Bomhof/circuit 2 (4510) 0.0201 0.0522 Bomhof/circuit 3 (12127) 0.0579 0.1713 Grund/b2 ss (1089) 0.0066 0.0168 Grund/b dyn (1089) 0.0072 0.0175 Grund/ba y er02 (13935) 0.6089 0.3565 Grund/d dyn (87) 0.0005 0.0014 Grund/d ss (53) 0.0003 0.0010 Grund/meg1 (2904) 0.1326 0.1018 Grund/meg4 (5860) 0.0571 0.1111 Grund/p oli (4008) 0.0050 0.0121 Grund/p oli large (15575) 0.0208 0.0497 Hamm/add20 (2395) 0.0123 0.0506 Hamm/add32 (4960) 0.0201 0.0738 Hamm/b circuit (68902) 0.7213 1.8823 Hamm/hcircuit (105676) 0.7313 2.7764 Hamm/memplus (17758) 0.1232 0.8140 Hamm/scircuit (170998) 1.9812 7.3448 Ra jat/ra jat03 (7602) 0.0793 0.1883 Ra jat/ra jat04 (1041) 0.0068 0.0284 Ra jat/ra jat05 (301) 0.0014 0.0046 Ra jat/ra jat11 (135) 0.0007 0.0023 Ra jat/ra jat12 (1879) 0.0087 0.0355 Ra jat/ra jat13 (7598) 0.0330 0.1229 Ra jat/ra jat14 (180) 0.0010 0.0032 PAGE 61 CHAPTER 4 USER GUIDE F OR KLU 4.1 The Primary KLU Structures 4.1.1 klu common This is a con trol structure that con tains b oth input con trol parameters for KLU as w ell as output statistics computed b y the algorithm. It is a mandatory parameter for all the KLU routines. Its con ten ts are listed as follo ws: tol T yp e: double Input parameter for piv ot tolerance for diagonal preference.Default v alue is 0.001 gro wth T yp e: double Input parameter for reallo cation gro wth size of LU factors. Default v alue is 1.2 initmem amd T yp e: double Input parameter for initial memory size with AMD. Initial memory size = initmem amd nnz(L) + n. Default v alue is 1.2 initmem T yp e: double Input parameter for initial memory size with COLAMD. Initial memory size = initmem nnz(A) + n. Default v alue is 10 btf T yp e: in t 51 PAGE 62 52 Input parameter to use BTF pre-ordering, or not. Default v alue is 1 (to use BTF) ordering T yp e: in t Input parameter to sp ecify whic h ll reducing ordering to use. 0= AMD,1= COLAMD, 2= user P and Q. Default is 0. scale T yp e: in t Input parameter to sp ecify whic h scaling strategy to use. 0= none, 1= sum, 2= max. Default is 0 singular pro c T yp e: in t Input parameter to sp ecify whether to stop on singularit y or con tin ue. 0= stop, 1= con tin ue. Default is 0. status T yp e: in t Output parameter that indicates the result of the KLU function call. V alues are KLU OK(0) if OK and < 0 if error. Error v alues are KLU SINGULAR (-1), KLU OUT OF MEMOR Y (-2), KLU INV ALID (-3) nreallo c T yp e: in t Output parameter.Con tains n um b er of reallo cations of L and U singular col T yp e: in t Output parameter.Con tains the column no of singular column if an y nodiag T yp e: in t PAGE 63 53 Output parameter.Con tains the n um b er of o-diagonal piv ots c hosen 4.1.2 klu sym b olic This structure encapsulates the information related to the analysis phase. The mem b ers of the structure are listed as follo ws: symmetry T yp e: double Con tains the symmetry of largest blo c k est rops T yp e: double Con tains the estimated factorization rop coun t lnz T yp e: double Con tains the estimated nonzeros in L including diagonals unz T yp e: double Con tains the estimated nonzeros in U including diagonals Lnz T yp e: double Arra y of size n, but only Lnz [0..n blo c ks-1] is used. Con tains the estimated n um b er of nonzeros in eac h blo c k n T yp e: in t Con tains the size of input matrix A where A is n-b y-n nz T yp e: in t Con tains the n um b er of en tries in input matrix P PAGE 64 54 T yp e: in t Arra y of size n. Con tains the ro w p erm utation from ordering Q T yp e: in t Arra y of size n. Con tains the column p erm utation from ordering R T yp e: in t Arra y of size n+1, but only R [0..n blo c ks] is used. Con tains the start and end column/ro w index for eac h blo c k. Blo c k k go es from R[k] to R[k+1] 1 nzo T yp e: in t Con tains the n um b er of nonzeros in o-diagonal blo c ks n blo c ks T yp e: in t Con tains the n um b er of blo c ks maxblo c k T yp e: in t Con tains the size of largest blo c k ordering T yp e: in t Con tains the ordering used (0 = AMD, 1 = COLAMD, 2 = GIVEN) do btf T yp e: in t Indicates whether or not BTF preordering w as requested The mem b ers symmetry est rops, lnz, unz, Lnz are computed only when AMD is used. The remaining mem b ers are computed for all orderings. PAGE 65 55 4.1.3 klu n umeric This structure encapsulates information related to the factor phase. It con tains the LU factors of eac h blo c k, piv ot ro w p erm utation, and the en tries in the odiagonal blo c ks among others. Its con ten ts are listed as follo ws: umin T yp e: double Con tains the minim um absolute diagonal en try in U umax T yp e: double Con tains the maxim um absolute diagonal en try in U n blo c ks T yp e: in t Con tains the n um b er of blo c ks lnz T yp e: in t Con tains actual n um b er of nonzeros in L excluding diagonals unz T yp e: in t Con tains actual n um b er of nonzeros in U excluding diagonals Pn um T yp e: in t Arra y of size n.Con tains the nal piv ot p erm utation Pin v T yp e: in t Arra y of size n.Con tains the in v erse of nal piv ot p erm utation Lbip T yp e: in t ** PAGE 66 56 Arra y of size n blo c ks. Eac h elemen t is an arra y of size blo c k size + 1. Eac h elemen t con tains the column p oin ters for L factor of the corresp onding blo c k Ubip T yp e: in t ** Arra y of size n blo c ks. Eac h elemen t is an arra y of size blo c k size + 1. Eac h elemen t con tains the column p oin ters for U factor of the corresp onding blo c k Lblen T yp e: in t ** Arra y of size n blo c ks. Eac h elemen t is an arra y of size blo c k size. Eac h elemen t con tains the column lengths for L factor of the corresp onding blo c k Ublen T yp e: in t ** Arra y of size n blo c ks. Eac h elemen t is an arra y of size blo c k size. Eac h elemen t con tains the column lengths for U factor of the corresp onding blo c k LUb x T yp e: v oid ** Arra y of size n blo c ks. Eac h elemen t is an arra y con taining the ro w indices and n umerical v alues of LU factors of the corresp onding blo c k. The diagonals of LU factors are not stored here Udiag T yp e: v oid ** Arra y of size n blo c ks.Eac h elemen t is an arra y of size blo c k size. Eac h elemen t con tains the diagonal v alues of U factor of the corresp onding blo c k Singleton T yp e: v oid Arra y of size n blo c ks.Con tains the singleton v alues Rs PAGE 67 57 T yp e: double Arra y of size n. Con tains the ro w scaling factors scale T yp e: in t Indicates the scaling strategy used. (0 = none, 1 = sum, 2 = max) W ork T yp e: v oid P ermanen t w orkspace used for factorization and solv e. It is of size MAX (4n n umerical v alues, n n umerical v alues + 6*maxblo c k in t's) w orksize T yp e: size t Con tains the size (in b ytes) of W ork allo cated ab o v e Xw ork T yp e: v oid This is an alias in to Numeric->W ork Iw ork T yp e: in t An in teger alias in to Xw ork + n Op T yp e: in t Arra y of size n+1. Con tains the column p oin ters for o-diagonal elemen ts. O T yp e: in t Arra y of size n um b er of o-diagonal en tries. Con tains the ro w indices of o-diagonal elemen ts Ox PAGE 68 58 T yp e: v oid Arra y of size n um b er of o-diagonal en tries. Con tains the n umerical v alues of o-diagonal elemen ts 4.2 KLU Routines The user callable KLU routines in the C language are explained in this section. The follo wing guidelines are applicable to all routines except when explicitly stated otherwise in the description of a routine. 1. : All the argumen ts are required except when explicitly stated as optional. If optional, a user can pass NULL for the corresp onding argumen t. 2. : The con trol input/output argumen t "Common" of t yp e "klu common *" is a required argumen t for all routines. 3. : All argumen ts other than the ab o v e men tioned con trol argumen t "Common", are input argumen ts and are not mo died. 4.2.1 klu analyze klu_symbolic *klu_analyze ( int n, int Ap [ ], int Ai [ ], klu_common *Common ) ; This routine orders the matrix using BTF if sp ecied and the ll reducing ordering sp ecied. Returns a p oin ter to klu sym b olic structure that con tains the ordering information. All argumen ts are required { n: Size of the input matrix A where A is n*n. { Ap: Arra y of column p oin ters for the input matrix. Size n+1. PAGE 69 59 { Ai: Arra y of ro w indices. Size n um b er of nonzeros in A. { Common: The con trol input/output structure. 4.2.2 klu analyze giv en klu_symbolic *klu_analyze_given ( int n, int Ap [ ], int Ai [ ], int Puser [ ], int Quser [ ], klu_common *Common ) ; This routine orders the matrix using BTF if sp ecied and the giv en Puser and QUser as ll reducing ordering. If Puser and Quser are NULL, then the natural ordering is used. Returns a p oin ter to klu sym b olic structure that con tains the ordering information. Argumen ts { n: Size of the input matrix A where A is n*n. Required. { Ap: Arra y of column p oin ters for the input matrix. Size n+1. Required. { Ai: Arra y of ro w indices. Size n um b er of nonzeros in A. Required. { Puser: Optional ro w p erm utation to use. { Quser: Optional column p erm utation to use. { Common: The con trol input/output structure. 4.2.3 klu *factor klu_numeric *Numeric klu_factor ( int Ap [ ], int Ai [ ], PAGE 70 60 double Ax [ ], klu_symbolic *Symbolic, klu_common *Common ) ; This routine factors a real matrix. There is a complex v ersion of this routine klu z factor that factors a complex matrix and has same function declaration as klu factor. Both use the results of a call to klu analyze. Returns a p oin ter to klu n umeric if successful. NULL otherwise. All the argumen ts are required. { Ap: Arra y of column p oin ters for the input matrix. Size n+1. { Ai: Arra y of ro w indices. Size n um b er of nonzeros in A. { Ax: Arra y of n umerical v alues. Size n um b er of nonzeros in A. In the complex case, the arra y should consist of real and imaginary parts of eac h n umerical v alue as adjacen t pairs. { Sym b olic: The structure that con tains the results from a call to klu analyze. { Common: The con trol input/output structure. The status eld in Common is set to indicate if the routine w as successful or not. 4.2.4 klu *solv e void klu_solve ( klu_symbolic *Symbolic, klu_numeric *Numeric, int ldim, int nrhs, double B [ ], klu_common *Common ) ; PAGE 71 61 This routine solv es a real system. There is a complex v ersion of this routine klu z solv e, that solv es a complex system and has the same function declaration as klu solv e. Both use the results of a call to klu analyze and klu *factor. Return t yp e is v oid. The rhs v ector B is o v erwritten with the solution. All Argumen ts are required. { Sym b olic: The structure that con tains the results from a call to klu analyze. { Numeric: The structure that con tains the results from a call to klu *factor. { ldim: The leading dimension of the righ t hand side B. { nrhs: The n um b er of righ t hand sides b eing solv ed. { B: The righ t hand side. It is a v ector of length ldim nrhs. It can b e real or complex dep ending on whether a real or complex system is b eing solv ed. If complex, the real and imaginary parts of the rhs n umerical v alue m ust b e stored as adjacen t pairs. It is o v erwritten with the solution. { Common: The con trol input/output structure. 4.2.5 klu *tsolv e void klu_tsolve ( klu_symbolic *Symbolic, klu_numeric *Numeric, int ldim, int nrhs, int conj_solve, double B [ ], klu_common *Common ) ; PAGE 72 62 This routine is similar to klu solv e except that it solv es a transp ose system A 0 x = b This routine solv es a real system. Again there is a complex v ersion of this routine klu z tsolv e for solving complex systems and has the same function declaration as klu tsolv e. It also oers to do a conjugate transp ose solv e for the complex system A H x = b Return t yp e is v oid. The rhs v ector B is o v erwritten with the solution. All argumen ts are required. The descriptions for all argumen ts except conj solv e are same as those for klu *solv e. The argumen t conj solv e is relev an t only for complex case. It tak es t w o v alues 1 = CONJUGA TE TRANSPOSE SOL VE, 0 = TRANSPOSE SOL VE. 4.2.6 klu *refactor void klu_refactor ( int Ap [ ], int Ai [ ], double Ax [ ], klu_symbolic *Symbolic, klu_numeric *Numeric, klu_common *Common ) ; This routine do es a refactor of the matrix using the previously computed ordering information in Sym b olic and the nonzero pattern of the LU factors in Numeric ob jects. It assumes same partial piv oting order computed in Numeric. It c hanges only the n umerical v alues of the LU factors stored in Numeric ob ject. It has a complex v ersion klu z refactor with the same function protot yp e to handle complex cases. Return t yp e is v oid. The n umerical v alues of LU factors in Numeric parameter are up dated. All argumen ts are required. PAGE 73 63 { Ap: Arra y of column p oin ters for the input matrix. Size n+1. { Ai: Arra y of ro w indices. Size n um b er of nonzeros in A. { Ax: Arra y of n umerical v alues. Size n um b er of nonzeros in A. In the complex case, the arra y should consist of real and imaginary parts of eac h n umerical v alue as adjacen t pairs. { Sym b olic: The structure that con tains the results from a call to klu analyze. { Numeric: Input/output argumen t. The structure con tains the results from a call to klu *factor. The n umerical v alues of LU factors are o v erwritten with the ones for the curren t matrix b eing factorized. { Common: The con trol input/output structure. The status eld in Common is set to indicate if the routine w as successful or not. 4.2.7 klu defaults void klu_defaults ( klu_common *Common ) ; This routine sets the default v alues for the con trol input parameters of the klu common ob ject. The default v alues are listed in the description of the klu common structure. A call to this routine is required unless the user sets the con trol input parameters explicitly Return t yp e is v oid. The argumen t Common is required. The con trol input parameters in Common are set to default v alues. 4.2.8 klu *rec piv ot gro wth double klu_rec_pivot_growth ( int Ap [ ], PAGE 74 64 int Ai [ ], double Ax [ ], klu_symbolic *Symbolic, klu_numeric *Numeric, klu_common *Common ) ; This routine computes the recipro cal piv ot gro wth of the factorization algorithm. The complex v ersion of this routine klu z rec piv ot gro wth handles complex matrices and has the same function declaration. The piv ot gro wth estimate is returned. All argumen ts are required. { Ap: Arra y of column p oin ters for the input matrix. Size n+1. { Ai: Arra y of ro w indices. Size n um b er of nonzeros in A. { Ax: Arra y of n umerical v alues. Size n um b er of nonzeros in A. In the complex case, the arra y should consist of real and imaginary parts of eac h n umerical v alue as adjacen t pairs. { Sym b olic: The structure that con tains the results from a call to klu analyze. { Numeric: The structure that con tains the results from a call to klu *factor. { Common: The con trol input/output structure. The status eld in Common is set to indicate if the routine w as successful or not. 4.2.9 klu *estimate cond n um b er double klu_estimate_cond_number ( int Ap [ ], double Ax [ ], klu_symbolic *Symbolic, klu_numeric *Numeric, PAGE 75 65 klu_common *Common ); This routine computes the condition n um b er estimate of the input matrix. As b efore, the complex v ersion of this routine klu z estimate cond n um b er has the same function declaration and handles complex matrices. The condition n um b er estimate is returned. All argumen ts are required. { Ap: Arra y of column p oin ters for the input matrix. Size n+1. { Ax: Arra y of n umerical v alues. Size n um b er of nonzeros in A. In the complex case, the arra y should consist of real and imaginary parts of eac h n umerical v alue as adjacen t pairs. { Sym b olic: The structure that con tains the results from a call to klu analyze. { Numeric: The structure that con tains the results from a call to klu *factor. { Common: The con trol input/output structure. The status eld in Common is set to indicate if the routine w as successful or not. 4.2.10 klu free sym b olic void klu_free_symbolic ( klu_symbolic **Symbolic, klu_common *Common ) ; This routine deallo cates or frees the con ten ts of the klu sym b olic ob ject. The Sym b olic parameter m ust b e a v alid ob ject computed b y a call to klu analyze or klu analyze giv en. Return t yp e is v oid. All argumen ts are required. PAGE 76 66 { Sym b olic: Input/Output argumen t. Must b e a v alid ob ject computed b y a call to klu analyze or klu analyze giv en. If NULL, the routine just returns. { Common: The con trol input/output structure. 4.2.11 klu free n umeric void klu_free_numeric ( klu_numeric **Numeric, klu_common *Common ) ; This routine frees the klu n umeric ob ject computed b y a call to klu factor or klu z factor routines. It resets the p oin ter to klu n umeric to NULL. There is a complex v ersion of this routine called klu z free n umeric with the same function declaration to handle the complex case. Return t yp e is v oid. All argumen ts are required. { Numeric. Input/Output argumen t. The con ten ts of the klu n umeric ob ject are freed. The p oin ter to klu n umeric ob ject is set to NULL. { Common: The con trol input/output structure. PAGE 77 REFERENCES [1] C. L. La wson, R. J. Hanson, D. Kincaid, and F. T. Krogh, Basic linear algebra subprograms for F OR TRAN usage, A CM T r ans. Math. Soft. 5: 308{323, 1979. [2] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson, An extended set of F OR TRAN basic linear algebra subprograms, A CM T r ans. Math. Soft. 14: 1{17, 1988. [3] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson, Algorithm 656: An extended set of F OR TRAN basic linear algebra subprograms, A CM T r ans. Math. Soft. 14: 18{32, 1988. [4] J. J. Dongarra, J. Du Croz, I. S. Du, and S. Hammarling, A set of lev el 3 basic linear algebra subprograms, A CM T r ans. Math. Soft. 16: 1{17, 1990. [5] J. J. Dongarra, J. Du Croz, I. S. Du, and S. Hammarling, Algorithm 679: A set of lev el 3 basic linear algebra subprograms, A CM T r ans. Math. Soft. 16: 18{28, 1990. [6] James W. Demmel, Stanley C. Eisenstat, John R. Gilb ert, Xiao y e S. Li and Joseph W. H. Liu, A sup erno dal approac h to sparse partial piv oting, SIAM J. Matrix A nalysis and Applic ations 20(3): 720{755, 1999. [7] Timoth y A. Da vis, I.S.Du, An unsymmetric-pattern m ultifron tal metho d for sparse LU factorization, SIAM J. Matrix A nalysis and Applic. 18(1): 140{158, 1997. [8] Timoth y A. Da vis, Algorithm 832: UMFP A CK, an unsymmetric-pattern m ultifron tal metho d with a column pre-ordering strategy A CM T r ans. Math. Softwar e 30(2): 196{199, 2004. [9] John R.Gilb ert and Tim P eierls, Sparse partial piv oting in time prop ortional to arithmetic op erations, SIAM J. Sci. Stat. Comput. 9(5): 862{873, 1988. [10] A. George and E. Ng, An implemen tation of Gaussian elimination with partial piv oting for sparse systems. SIAM J. Sci. Statist. Comput. 6(2): 390{409, 1985. [11] Iain S. Du, On algorithms for obtaining a maxim um transv ersal, A CM T r ansactions on Mathematic al Softwar e 7(3): 315{330, 1981. 67 PAGE 78 68 [12] Iain S. Du, Algorithm 575 p erm utations for a zero-free diagonal, A CM T r ansactions on Mathematic al Softwar e 7(3): 387{390, 1981. [13] Iain S. Du and John K. Reid, Algorithm 529: p erm utations to blo c k triangular form, A CM T r ans. on Mathematic al Softwar e 4(2): 189{192, 1978. [14] Iain S. Du and John K. Reid, An implemen tation of T arjan's algorithm for the blo c k triangular form of a matrix, A CM T r ans. on Mathematic al Softwar e 4(2): 137{147, 1978. [15] R.E. T arjan, Depth rst searc h and linear graph algorithms, SIAM J. Computing. 1: 146{160, 1972. [16] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Riv est, Cliord Stein, Intr o duction to A lgorithms Second Edition 2001, MIT Press, Cam bridge. [17] S.C. Eisenstat and J.W.H. Liu, Exploiting structural symmetry in a sparse partial piv oting co de, SIAM J.Sci.Comput. 14(1): 253{257, 1993. [18] P R. Amesto y T. A. Da vis, and I. S. Du,An appro ximate minim um degree ordering algorithm, SIAM J. Matrix A nal. Applic. 17(4): 886{905, 1996. [19] P .R. Amesto y T.A. Da vis, and I.S. Du, Algorithm 837: AMD, an appro ximate minim um degree ordering algorithm, A CM T r ansactions on Mathematic al Softwar e 30(3): 381{388, 2004. [20] Timoth y A. Da vis, John R. Gilb ert, Stefan I. Larimore, and Esmond G. Ng. A column appro ximate minim um degree ordering algorithm, A CM T r ansactions on Mathematic al Softwar e 30(3): 353{376, 2004. [21] Timoth y A. Da vis John R. Gilb ert Stefan I. Larimore Esmond G. Ng, Algorithm 836: COLAMD, a column appro ximate minim um degree ordering algorithm, A CM T r ansactions on Mathematic al Softwar e 30(3): 377{380, 2004. [22] W.W. Hager, Condition estimates, SIAM J. Sci. Stat. Comput. 5,2: 311{316, 1984. [23] Nic holas J. Higham, F ortran co des for estimating the one-norm of a real or complex matrix, with applications to condition estimation, A CM T r ans. on Mathematic al Softwar e. 14(4): 381{396, 1988. PAGE 79 BIOGRAPHICAL SKETCH Ek anathan w as b orn in Tirunelv eli, India, on Octob er 2, 1977. He receiv ed his Bac helor of T ec hnology degree in c hemical engineering from Anna Univ ersit y Chennai, India, in Ma y 1998. He w ork ed with Infosys T ec hnologies at Brussels, Belgium, as programmer/analyst from 1998 till 2001 and with SAP A G at W alldorf, German y as soft w are dev elop er from 2001 till 2003. Curren tly he is pursuing his Master of Science degree in computer science at Univ ersit y of Florida. His in terests and hobbies include tra v el, reading no v els and magazines, m usic and mo vies. 69 |