An Approach for Parallelizing any General Unsymmetric
Sparse Matrix Algorithm *
Tariq Rashidt Timothy A. Davist
Technical Report TR94036, Computer and Information Sciences Department, Univer
sity of Florida, 1994. Presented at the 7th SIAM Conference on Parallel Processing for
Scientific Computing, February 1995, San Francisco, CA.
Abstract
In many large scale scientific and engineering computations, the solution to a sparse
linear system is required. We present a partial unsymmetric nested dissection method
that can be used to parallelize any general unsymmetric sparse matrix algorithm whose
pivot search can be restricted to a subset of rows and columns in the active submatrix.
The submatrix is determined by the partial unsymmetric dissection. We present results
of this approach, applied to the unsymmetricpattern multifrontal method.
1 Introduction
The importance of efficient large sparse matrix manipulation is well known for scientific and
engineering applications. It is the need for accuracy and speed in scientific and engineering
computation that drives the need for higher performance. Thus parallel computers have
become in great demand for such computation. The solution one chooses depends critically
on the degree of parallelism sought and the model of parallel computation used. For a
given problem to be solved in parallel, the message passing programming model involves
distributing the data and the computation among the processors. While this can be easily
done for well structured problems, the irregular structure of large problems, for example
unsymmetric pattern sparse matrices, makes it difficult to solve, in general. Therefore, a
good ordering of rows and columns of a sparse matrix can significantly reduce the storage
and execution time required by an algorithm for factoring the matrix on parallel computers.
In this paper, we consider the solution of the nbyn linear system
(1) Ax = b
where A is large, sparse, and unsymmetric. We consider the direct solution of (1) by
means of LU factorization. In such a procedure, we apply a graph partition based ordering
approach (partial unsymmetric nested dissection) to decompose the computations for the
solution so that it can be efficiently done in parallel.
*This project is supported by the National Science Foundation(ASC9111263 and DMS9223088)
phone: (904) 3921482, email: tariq@cis.ufl.edu
phone: (904) 3921481, email: davis@cis.ufl.edu
2 DAVIS AND RASHID
2 Design Considerations
Graph partitioning is a method which divides the vertices of the graph into several sets
of roughly equal size such that the number of edges connecting the vertices in different
sets is kept small. Partitioning is one of the key ideas for generating efficient orderings for
sparse matrix calculations. Since finding an optimal partition of a graph is known to be
NPcomplete [7], the nature of the problem has inspired a variety of heuristics.
Consider the undirected bipartite graph of the matrix A, with 2n nodes (n row nodes
and n column nodes). Let TR and C be the sets of n rows and n columns and H = (V, E)
be the bipartite graph whose node set V is divided into sets TR and C (V = R U C). The
sets 7' and C are disjoint, and every edge in H has one end point in R' and the other in C.
An example of the matrix and its associated bipartite graph is shown in Fig. 1 where,
(2) (ri, cj) E E, ri E 7, cj E C and ri f cj
Suppose a single vertex separator S is found for H such that there are two disconnected
subgraphs Ha and Hb in H\S. If the pseudoperipheral node is a row (column), then the
level set structures will contain columns (rows) and rows (columns) alternatively. The
vertex separator contains either a set of rs row nodes or a set of cs column nodes. This
occurs because there is no edge between any two row nodes or column nodes in the above
defined bipartite graph H, that is, (ri, rj) E or (ci, cj) E for any i and j. However, in
our partial nested dissection algorithm, we perform node improvement (as done in [10]) so
that a single vertex separator with r, row nodes and c, column nodes is found. In contrast
with [10], rs, c,, in general. The resulting unconnected subgraphs Ha and Hb have ra and
rb row nodes, and c, and cb column nodes, respectively. The result is a matrix of the form
Anl 0 A13
0 A22 A23
A31 A32 A33
Where An is r,byc,, A22 is rbbycb, and A33 is rsbyc, (rectangular, in general).
Each of these submatries, An and A22, are factorized in parallel. Up to min(ra, ca)
pivots can be found in An. Unfactorized rows and columns in An are then combined with
A33. Thresholdbased partial pivoting can be used, where the separator A31 is updated
by the factorization of An. The method can be applied recursively to generate more
parallelism. However, the sizes of the submatries, An and A22, should not be too small.
This is why we call our approach "partial" nested dissection.
C
R2R
0@ C2
FIG. 1. Matrix A and its associated bipartite graph.
3 Constructing a Balanced Separator
In the automatic nested dissection algorithm [9], a level structure rooted at some
pseudoperipheral node is generated. A level structure rooted at a pseudoperipheral node is
PARALLELIZING UNSYMMETRIC SPARSE MATRIX
likely to have many levels. A separator is then selected from the iii.1.11." level such that
the connected graph can be divided into two unconnected subgraphs. In our algorithm
we perform similar operations to obtain a rooted level structure at some pseudoperipheral
node. Let v be the pseudoperipheral node and the sequence of level structure is defined as
(3) Lo, L1, L2,  , Lh
where Lo = v and u E Li such that (u, w) E E and u E Li imply w E L+i. The separator
is selected from the m = [(l)]th level. In our case, L, C 7R or Lm C C but not both
for i = 1, S ISmI. Since the objective is to find a balanced separator with rs nodes and
cs columns, we either select levels m and m + 1 and then prune the resulting separator
('.1. I I... 1) or we select two nodes for Lo ('.1. I I... 2).
3.1 Method 1
In this approach we calculate the degrees of all the nodes in level m and m + 1. Nodes
with degree 0 (which can only occur in level m) are discarded. A node with the highest
degree is selected for the separator and the degrees of its neighbors are reduced by one.
The selection continues until the degrees of all the nodes become zero. Nodes with zero
degree are removed from the separator. Ties between two nodes are broken according to
their positions in the list. We present the algorithm as follows:
Methodl)
{
Calculate the degrees of Vv E H' U H'+1 and insert in the list B
for i = I,..., IBI
if (deg(b(i)) 0) then
S = SU{b(i)}
endif
Vu e H U H+
deg(u) deg(u) 1 if (v, u) E'
endfor
}
This algorithm does not always produce a balanced separator. This happens, for
example, when the initial degrees of all the nodes are equal. The balance can be improved
by breaking ties depending on how many rows and columns are currently in the separator.
3.2 Method 2
In this approach we maintain symmetry similar to [8]. If the pseudoperipheral node is row
ri (column ci), we also consider column ci (row ri) as a pseudoperipheral node and generate
level structures starting from both the nodes (Lo contains both nodes). The nodes for the
separator are selected from level m.
4 Experimental Results
The partial unsymmetric nested dissection ordering was incorporated in the UMFPACK [1]
package and evaluated using the matrices in Table 1. The matrix RDIST1 is a chemical
engineering matrix [13] and other matrices are from [5] ranging from electrical power flow
(GEMAT11), fluid flow (LNS3937), and oil reservoir simulation (ORSREG1) problems.
In Fig. 2 we show the patterns of the original, permuted (after dissection), and factorized
4 DAVIS AND RASHID
RDIST1 matrix. In Table 1 "Nz in LU(1)" and "Nz in LU(2)" represent the number
of nonzeros in the LU factors after numeric factorization with Method 1 and Method 2
ordering respectively. We estimate the speedup obtained by partial unsymmetric nested
dissection by dividing the total work by the work in the longest (weighted) path in the
separator tree. This assumes one processor per node.
Three matrices perform better under Method 2 in terms of number of fillins. However,
the estimate of speedup does not show reasonable parallelism under this method. We get a
factor of speedup 7 in case of RDIST1 matrix when run using Method 1 with higher fillins.
FIG. 2. Structure of the original matrix, Method 1 ordering, and matrix after factorization.
5 Conclusions
We have found that an unsymmetric partial nested dissection approach has reasonable
parallel potential. The independent blocks can be factorized effectively within a distributed
memory environment. Further parallelism can be obtained by applying a parallel
factorization algorithm to each node in the separator tree. The method presented here
provides a "firstcut" coarsegrain parallelism that should improve the efficiency of the
PARALLELIZING UNSYMMETRIC SPARSE MATRIX
TABLE 1
Test Unsymmetric Matrices
Matrix GEMAT11 LNS.;'i.;! ORSREG1 RDIST1
Order 4929 .. 2205 4134
Nonzeros in A 33108 '_". 17 14133 '1il 111
Nz in LU(1) 39650 224977 148000 11i.i~,'
Nz in LU(2) 41961 156573 118518 108216
Speedup(l) 2.3 3.5 2.4 6.8
parallelism within the nodes of the separator tree.
References
[1] T. Davis, Users' guide to the unsymmetric pattern multifronlal package (UMFPACK), Tech.
Report TR93020, CIS Dept., Univ. of Florida, Gainesville, Florida, 1993.
[2] T. Davis, A Combined Unifronlal/ il:t.li....it., Method for Unsymmetric Sparse Matrices.,
Proceedings of the Fifth SIAM Conference on Applied Linear Algebra, Snowbird, Utah, June
1518, 1994.
[3] I. Duff, Design features of a frontal code for solving sparse unsymmetric linear systems out
ofcore, SIAM J. Sci. Statis. Comput., 5(1984), pp. 270280.
[4] I. Duff, A. Erisman, and J. Reid, Direct Methods for Sparse Matrices, Claredon Press, Oxford,
1986.
[5] I. Duff, R. G. Grimes, and J. G. Lewis, User's guide for the HarwellBoeing sparse matrix
collections (Release I), Tech. Rep. TRPA9286, Computer Science and Systems Division,
Harwell Laboratory, Oxon, U.K., October 1992.
[6] S. Even, Graph Algorithms, Computer Science Press, Potomac, MD, 1979.
[7] M. Garey and D. Johnson, Computers and Intractability, W. H. Freeman and Company, New
York, 1979.
[8] A. George, Nested dissection of a regular finite element mesh, SIAM J. Numer. Anal., 10 (1973),
pp. 345363.
[9] A. George and J. Liu, Computer Solution of Large Sparse Positive Definite Systems, Prentice
Hall, Englewood Cliffs, NJ, 1981.
[10] J. W. H. Liu, A graph partitioning algorithm by node separators, AC\I TOMS, 15 (1989),
pp. 198219.
[11] The role of elimination trees in sparse factorization, SIAM J. Matrix Anal. Appl., 11
(1990), pp.134172.
[12] E. Ng, A comparison of some direct methods for solving sparse nonsymmetric linear systems,
in Proceedings 5th SIAM conference on Applied Linear Algebra, Lewis, eds., Snowbird, Utah,
June 1518, 1994, SIAM Press, pp.140.
[13] S. E. Zitney and M. A. Stadtherr, Supercomputing Strategies for the design and analysis of
complex separation systems, Ind. Eng. Chem. Res., 32(1993), pp. 604612.
