On the LU factorization of sequences of identically structured sparse matrices within a distributed memory environment

Material Information

On the LU factorization of sequences of identically structured sparse matrices within a distributed memory environment
Hadfield, Steven Michael
Place of Publication:
Gainesville, Fla.
Department of Computer and Information Science and Engineering, University of Florida
Copyright Date:


Subjects / Keywords:
Algorithms ( jstor )
Broadcasting industry ( jstor )
Directed acyclic graphs ( jstor )
Distributed memory ( jstor )
Factorization ( jstor )
Mathematical vectors ( jstor )
Matrices ( jstor )
Modeling ( jstor )
Permutations ( jstor )
Scheduling ( jstor )
Computer and Information Sciences thesis Ph.D
Dissertations, Academic -- UF -- Computer and Information Sciences
Parallel processing (Electronic computers) ( lcsh )
Sparse matrices -- Data processing ( lcsh )
bibliography ( marcgt )
theses ( marcgt )


This research effort studies the parallel LU factorization of sequences of identically structured sparse matrices using the unsymmetric pattern, multifrontal method of Davis and Duff. Computational burden is reduced by using the computational structure (called an assembly directed acyclic graph: DAG) that results from analysis of the first matrix for the numerical factorization of the subsequent matrices. Execution time is reduced via exploitation of parallelism in the assembly DAG. The theoretical parallelism in the assembly DAG is investigated using both analysis and simulation. Achievable parallelism is evaluated using a simulation model based on the nCUBE distributed memory multiprocessor. A fixed pivot order parallel LU factorization routine is implemented on the nCUBE and evaluated. Key subissues within this implementation are task scheduling, subcube allocation, subcube assignment, and data assignments to reduce communications and overall execution time. The resulting implementation is shown to be competitive with conventional techniques and demonstrates significant parallel performance with excellent scalability. Of greatest significance is the theoretical development and implementation of a lost pivot recovery capability for the unsymmetric patter, multifrontal method. The capability incorporates a lost pivot avoidance strategy with both inter- and intra- frontal matrix pivot recovery mechanisms. The inter-frontal matrix pivot recovery mechanisms migrate lost pivots to subsequent frontal matrices and accommodate corresponding side effects using relationships represented in the assembly DAG. Intra-frontal matrix pivot recoveries are done via a pipelined partial dense factorization kernel that handles unsymmetric permutations across processors. Evaluation of the lost pivot recovery capability using three matrix sequences from chemical engineering problems shows that significant execution time savings are afforded when executing on either a single processor or multiple processors.
General Note:
General Note:
Thesis (Ph. D.)--University of Florida, 1994.
Includes bibliographical references (leaves 360-371).

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright [name of dissertation author]. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Resource Identifier:
31732007 ( OCLC )
001972742 ( ALEPH )


This item has the following downloads:

Full Text







First, I must express my sincere appreciation to my advisor, Dr. Timothy A. Davis
for guiding, challenging, and encouraging me throughout this effort. His intelligence,
patience, and concern have helped me establish high academic and personal goals.
Also, I would like to thank my supervisory committee members, and especially Dr.
Theodore Johnson, for their support and many helpful suggestions. In addition, I
would like to express my appreciation to Professor Iain Duff of Harwell Laboratory,
U.K. for reviewing and encouraging key aspects of this effort and Dr. Steve Zitney
of Cray Research for providing the much needed matrix sequences for the evaluation
of lost pivot recovery.
In addition, I recognize the National Science Foundation for their funding of
the development of unsymmetric-pattern, multifrontal method (ASC-9111263, DMS-
Si_' ;11 s). Furthermore, I appreciate the financial and professional support of the Air
Force Institute of Technology.
Personally, I thank my parents for their continual support and encouragement
(and especially Mom for her many hours of prayer). I would like to thank my children:
Melissa, Andrew, and Christopher for the sacrifices they have had to endure during
this program. Yet most of all, I thank my wonderful wife, Marissa, for her constant
love and support. She makes it all worthwhile and helps me keep it all in perspective.


ACKNOW LEDGMENTS ............................. ii

LIST OF TABLES ................................. vi

LIST OF FIGURES ........................ ........ viii

ABSTRACT .. .. .. ... .. .. .. .. ... .. .. .. .. ... .. .. xii


1 INTRODUCTION ............................. 1
1.1 Topic Statem ent ............................ 2
1.2 Overview. ............................... 2


2.1 LU Factorization .. .... ........... ......... 5
2.2 Algorithm Stability and Error Analysis ............... 6
2.3 Sparse Matrix Concepts ........... ....... .... 11
2.4 Multifrontal Methods ................... . . .. 19
2.5 Factorization Sequences of Matrices ....... . . . . 24
2.6 Parallel Matrix Computations ................ . . 28
2.7 Multiprocessor Scheduling .................. .... 32

3 IMPLEMENTATION PLATFORM ................ . . 37
3.1 Hardware Architecture .................. ... . 37
3.2 Software Environment ...... . . . . . ... . 38
3.3 Performance Characterization of the nCUBE 2 . . . .... 41

4.1 Unbounded Parallelism Models .................. 45
4.1.1 Assembly DAG Analysis .................. 46
4.1.2 M odel Definitions .................. .. . 46
4.1.3 Matrices Analyzed .................. ... . 48
4.1.4 Assembly DAG Analysis Results . . . . . . 48
4.1.5 Conclusions and Observations . . . . . .... 51

4.2 Bounded Parallelism Models .................. ... 52
4.2.1 Initial M odels .................. .... . 52
4.2.2 Trace-Driven Simulation .................. 53
4.2.3 Processor Allocations .................. .. 53
4.2.4 Scheduling Methods .................. ... 55
4.2.5 Simulation Results .................. ... . 55
4.2.6 Conclusions and Observations . . . . . . 58
4.3 Distributed Memory Multifrontal Performance Potential .... .59
4.3.1 Dense Matrix Factorization Routine . . . . .... 59
4.3.2 Distributed Memory Multifrontal Simulation Model . . 65
4.4 Conclusions .................. ........... 73


5.1 Basic Algorithm s ........... . . . . . ... 78
5.1.1 P1 Basic Fan-Out Algorithm . . . . . ..... 79
5.1.2 P2 -Pipelined Fan-Out Algorithm . . . . . 79
5.2 Use of BLAS 1 Routines .................. .. .. 80
5.3 Inclusion of Row and Column Pivoting . . . . . ..... 81
5.3.1 P3 -Basic PDF with Pivot Recovery . . . . .... 82
5.3.2 P4 -Pipelined Partial Factoring with Pivot Recovery .. 84
5.4 Single Pivot Version. .................. .... .
5.5 Performance Achieved .................. ... . 87
5.5.1 Floating Point Operations ................. 87
5.5.2 Use of BLAS 1 Routines .................. 88
5.5.3 Basic versus Pipelined Performance . . . . . 91
5.5.4 Inclusion of Numerical Considerations . . . . .... 92
5.5.5 Single Pivot Performance ... . . . . . 92
5.5.6 Double Precision Results ... . . . . . ..... 93
5.6 Analytical Performance Predictions .... . . . . ..... 93
5.6.1 Component Times .................. .... 94
5.6.2 A: .- -.,e Tim es .................. .. . 96

6.1 Host Preprocessing ...... . . . . . . 99
6.1.1 Assembly DAG Acquisition . . . . . ..... 101
6.1.2 Frontal Matrix Scheduling .... . . . . . .... 104
6.1.3 Data Assignments ................ .. .... 111
6.1.4 Launching the Parallel Factorization . . . . .... 113
6.2 Parallel Factorizations .................. . . .. 114
6.2.1 Schedules and Frontal Matrix Descriptions . . . ... 115
6.2.2 Refactorization Process ..... . . . . . .120
6.2.3 Distributed Triangular Solves . . . . . ..... 129
6.3 Perform ance Issues ................... . . . 135
6.3.1 Performance Measures ................ . . 137

6.3.2 Test Case Selection .. .......
6.3.3 Scheduling/Allocation Issues .
6.3.4 Assignment Issues .. ........
6.3.5 Parallelism within the Method . .
6.3.6 Competitiveness of the Method .
6.3.7 Execution Time Profiling ......
6.3.8 Memory Utilization and Scalability
6.4 Summary and Conclusions .. .......

7 LOST PIVOT RECOVERY .. . .............
7.1 Theoretical Development .. . ............
7.1.1 Definitions and Notation .. .........
7.1.2 Foundational Theorems .. ..........
7.1.3 Fill-In due to Recovery .. ..........
7.1.4 Impacts on Assembly DAG .. .........
7.1.5 Effects of Multiple Recoveries .. .......
7.1.6 Repeated Failures .. . ............
7.1.7 Ordering Recovery Resolutions ...
7.1.8 Communication Paths .. ..........
7.1.9 Consequences of Block Triangular Form ..
7.1.10 Summary of the Lost Pivot Recovery Theory .
7.2 Implementation Specifics .. . ............
7.2.1 Lost Pivot Recovery Synchronization .....
7.2.2 Host Preprocessing .. . ...........
7.2.3 Enhanced Task Processing .. ........
7.2.4 Recovery Structures .. ............
7.2.5 Recovery Handling .. . ...........
7.2.6 Recovery Creation and Forwarding .. ....
7.2.7 Sequential and Shared Memory Considerations
7.3 Performance Evaluation .. . ............
7.3.1 Performance Issues .. . ...........
7.3.2 Performance Measures .. ..........
7.3.3 Test Cases . . . . . . . . . .
7.3.4 Performance Results .. . ..........

8 CONCLUSION . . . .....
8.1 Summary . . . .....
8.2 Future Efforts . . . ...

. . . .. . . . 203
. . . .. . . . 203
. . . .. . . . 206

REFERENCES ............... .................

BIOGRAPHICAL SKETCH ............................


4-1 Test Matrix Descriptions .. ..........

4-2 Assembly DAG Characterizations .. ......

4-3 Unbounded Parallelism Speed-Up Results ..

4-4 Processor Set and Utilization Results .. ....

4-5 Required Processors Comparison ...

4-6 Empirical Factoring Speed-Ups ....

4-7 Rectangular Matrix Speed-Ups ....

4-8 Empirical Versus Analytical Times .. .....

4-9 Empirical Versus Analytical Speed-Ups .....

4-10 Sample P_Desired Calculations .. ........

4-11 Test Matrix Characteristics .....

4-12 Sequential Assembly Version Results Part 1 .

4-13 Sequential Assembly Version Results Part 2 .

4-14 Parallel Assembly Version Results Part 1 .

4-15 Parallel Assembly Version Results Part 2 .

5-1 PDF Mflops Achieved (Single Precision) . .

5-2 PDF Mflops Achieved (D

5-3 P1 Predicted Versus Obse

5-4 P2 Predicted Versus Obse

5-5 P3 Predicted Versus Obse

5-6 P4 Predicted Versus Obse

5-7 P5 Predicted Versus Obse

6-1 Test Matrix Characteristi

6-2 Test Configurations . .

6-3 Percentage of Communica

6-4 Competitiveness of Parall

6-5 Sequential and Parallel E:

7-1 Edge Count Comparison

7-2 RDIST1 Parallel Executic

7-3 RDIST2 Parallel Executic

double Precision) . . . . . ..... 90

rved Performance. . . . . . 96

rved Performance. . . . . . 97

rved Performance. . . . . . 97

rved Performance. . . . . . 98

rved Performance. . . . . . 98

cs . . . . . . . ... .. . . 139

. . . ... . . . . 4 0

ition Reduction .. . . . . . 141

el Refactorization Code . . . .... 145

execution Profiles . . . . . . 146

. ... . . . . . 75

n Time Results (32 processors) ...... 197

n Time Results (32 processors) ...... 198

7-4 RDIST3A Parallel Execution Time Results (32 processors)



2-1 LU Factorization Algorithm ........... . . .... 6

2-2 Sample Matrix Structure .................. .... .. 12

2-3 Sample Digraph for Matrix A ............ . .. . .. 12

2-4 Sample Symmetric Matrix Structure ..... . . . . ..... 13

2-5 Sample Undirected Graph for Symmetric Matrix B . . . ... 13

2-6 Matrix Structure Before and After Elimination . . . . .... 13

2-7 Graph Reduction Due to Gauss Elimination . . . . . 14

2-8 Graph After Symmetric Permutation ..... . . . . ..... 15

2-9 Clique Storage Matrix Example ............... .. ... .. 17

2-10 Block Triangular Form Matrix ............ . .. . .. 18

2-11 Sample Sparse Matrix .................. ..... .. 21

2-12 Frontal Matrix for First Pivot ............ . .. . .. 21

2-13 Frontal Matrix for Second Pivot .................. ... .. 22

2-14 Possible Relationships between Frontal Matrices . . . . .... 22

2-15 Lost Pivot Example. .................. ........ .. 23

2-16 Basic Column Fan-Out LU Factorization . . . . . 29

2-17 Task Graph with Communication Costs . . . . . ..... 34

2-18 Gantt Chart of Schedule with Communication Costs . . . ... 34

3-1 3D Broadcast Example .................. ..... .. 39

4-1 Assembly DAG Abstraction for GEMAT11 . . . . ..... 50

4-2 Processor Allocation For Fine Grain Parallelism . . . . .... 54

4-3 Model 0 Speed-Up Results .................. ... .. 56

4-4 Model 0 Utilization Results ................ .... .. 57

4-5 Model 1 Speed-Up Results .................. ... .. 58

4-6 Model 1 Utilization Results ................ .... .. 59

4-7 Model 2 Original Speed-Ups ................ .... .. 60

4-8 Model 2 Vertical Partitioning Speed-Ups . . . . . . 61

4-9 Model 2 Original Utilizations .............. . . .. 62

4-10 Model 2 Vertical Partitioning Utilizations . . . . . . 63

4-11 Model 3 Speed-Up Results .............. . . .. 64

4-12 Model 3 Utilization Results ...... .......... . . .. 65

4-13 Model 4 Original Speed-Ups ................ .... .. 66

4-14 Model 4 Vertical Partitioning Speed-Ups . . . . . ..... 67

4-15 Model 4 Original Utilizations ................... . .. .. 68

4-16 Model 4 Vertical Partitioning Utilizations . . . . . . 69

4-17 Distributed Memory Fan-Out Factorization . . . . . . 70

4-18 Time Line for Partial Dense Factor Routine . . . . . 71

4-19 Scheduling Decision Diagram ................... . .. .. 74

5-1 P1 Basic Fan-Out Algorithm ................ . .. 79

5-2 P2 -Pipelined Fan-Out Algorithm .................. .. 80

5-3 P3 -Basic PDF with Pivot Recovery ..... . . . . . 83

5-4 P3 Subroutine: ATTEMPT_LOCAL_PIVOT . . . . . ... 84

5-5 P3 Subroutine: LOOK_FOR_ALTPIVOTS . . . . . . ..

5-6 P3 Subroutine: RESOLVE_NEW_PIVOT_OWNER . . . ... .

5-7 P4 Pivot Recovery Timeline ................ .... .. 87

5-8 P4 -Pipelined PDF with Pivot Recovery . . . . . ..... 88


5-10 P4 Subroutine: NONPIVOT_OWNER_RECOVERY . . . ... 90

5-11 P4 Subroutine: PIVOT_OWNER_RECOVERY . . . . ... 91

5-12 nCUBE 2 Peak and BLAS Performance ... . . . . . 91

6-1 Edge Refinement Example .................. .... .. 103

6-2 Overlap Determination Algorithm ...... . . . . .. 107

6-3 Parallel Refactorization Algorithm .... . . . . .. 121

6-4 Assemble Contributions Algorithm ..... . . . . . 125

6-5 Contribution Forwarding... ............ . . . ..126

6-6 Block Upper Triangular Matrix Form Example . . . . .... 131

6-7 Dependency Chart For A Lower Triangular Solve . . . .... 134

6-8 Dependency Chart For A Sparse Lower Triangular Solve ...... 135

6-9 Task Graph For A Sparse Lower Triangular Solve . . . .... 136

6-10 Parallel Refactorization Speed-Ups .... . . . . . 144

6-11 Memory Requirements and Scalability ................ ..147

7-1 Sparse Matrix With Lost Pivot .................. .. ..153

7-2 Sample Assembly DAG .................. ....... 153

7-3 Frontal Matrices Augmented by Lost Pivot Recovery . . . ... 154

7-4 Multi-Level List Structure Prior to Combining . . . . . ... 181

7-5 Multi-Level List Structure After Combining . . . . ..... 182

7-6 Lost Pivot Recovery Overhead .................. . ..194

7-7 Sequential Execution Time for RDIST1 Sequence . . . . ... 195

7-8 Sequential Execution Time for RDIST2 Sequence . . . . ... 196

7-9 Sequential Execution Time for RDIST3A Sequence . . . ... 197

7-10 RDIST3A Speed-Ups With and Without Lost Pivot Recovery (LPR) 200

7-11 Scalability Without Lost Pivot Recovery RDIST3A . . . ... 201

7-12 Scalability With Lost Pivot Recovery RDIST3A . . . . ... 202

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy



Steven Michael Hadfield

April 1994

Chairman: Timothy A. Davis
Major Department: Computer and Information Sciences

This research effort studies the parallel LU factorization of sequences of identi-
cally structured sparse matrices using the unsymmetric-pattern, multifrontal method
of Davis and Duff. Computational burden is reduced by using the computational
structure (called an assembly directed acyclic graph: DAG) that results from anal-
ysis of the first matrix for the numerical factorization of the subsequent matrices.
Execution time is reduced via exploitation of parallelism in the assembly DAG. The
theoretical parallelism in the assembly DAG is investigated using both analysis and
simulation. Achievable parallelism is evaluated using a simulation model based on
the nCUBE 2 distributed memory multiprocessor.
A fixed pivot order parallel LU factorization routine is implemented on the nCUBE
2 and evaluated. Key subissues within this implementation are task scheduling,
subcube allocation, subcube assignment, and data assignments to reduce commu-
nications and overall execution time. The resulting implementation is shown to be
competitive with conventional techniques and demonstrates significant parallel per-
formance with excellent scalability.
Of greatest significance is the theoretical development and implementation of a
lost pivot recovery capability for the unsymmetric-pattern, multifrontal method. The

capability incorporates a lost pivot avoidance strategy with both inter- and intra-
frontal matrix pivot recovery mechanisms. The inter-frontal matrix pivot recovery
mechanisms migrate lost pivots to subsequent frontal matrices and accommodate
corresponding side effects using relationships represented in the assembly DAG. Intra-
frontal matrix pivot recoveries are done via a pipelined partial dense factorization
kernel that handles unsymmetric permutations across processors. Evaluation of the
lost pivot recovery capability using three matrix sequences from chemical engineering
problems shows that significant execution time savings are afforded when executing
on either a single processor or multiple processors.


With the advent and refinement of practical parallel processing platforms, there
has been a flurry of research in concurrent algorithms that can efficiently utilize
these resources. A primary emphasis of this research has been numerical algorithms,
in general, and, in particular, algorithms for the direct solution (as opposed to it-
erative approximation) of systems of linear equations. Extensive research has been
done on solving such systems when the coefficient matrix is dense. Such matrices
are frequently encountered and have very regular structure that can be exploited in
algorithm development.
When the coefficient matrix is sparse considerably more parallelism may be avail-
able, although it is more difficult to exploit. For sparse, symmetric, positive definite
matrices, elimination tree structures can be derived to expose parallelism in the com-
putations. Much of the parallel sparse matrix research focus to date as been on
such matrices due to their frequent occurrence, the exposure of parallelism in their
elimination tree, and the guarantee of numerically stable pivots on the diagonal.
Parallel algorithms for solving more general linear systems are more difficult as
they are characterized by sparse matrices where the nonzero structure can be highly
unsymmetric and the diagonal entries may not be numerically acceptable as piv-
ots. Hence, parallelism is irregular and numerical pivoting considerations become
The most common direct technique for solving such systems is to factorize the
coefficient matrix into the product of a lower triangular matrix, L, and an upper tri-
angular matrix, U. Triangular solves (forward and backward substitutions) can then
be used to solve the system. One approach to LU factorization that holds signifi-
cant potential for parallel implementation is the unsymmetric-pattern multifrontal
method of Davis and Duff [34, 35].
Like most sparse matrix factorization algorithms, the unsymmetric-pattern mul-
tifrontal method has two principal operations, analyze and factorize. The analyze
operation selects matrix entries to act as pivots for the numerical factorization with
objectives of reducing computations and maintaining numerical stability. The fac-
torize operation then uses the selected pivots to perform the actual numerical fac-
torization. Furthermore, there are many sparse matrix problems that require the
factorization of sequences of identically structured sparse matrices. Such sequences
are common to solution techniques for the systems of differential algebraic equations
used in many application areas including circuit simulations, chemical engineering,
magnetic resonance spectroscopy, and air pollution modeling [105, 144, 143, 146, 37,
82, 49, 18, 97, 108, 131, 43, 50]. When factorizing these sequences of identically
structured sparse matrices, the computational burden can be eased by only doing the

analyze operation once for the entire sequence and using the selected pivots for the
subsequent factorization of each matrix in the sequence.
1.1 Topic Statement
The focus of this research effort is to study and implement parallel LU factoriza-
tion algorithms based on an unsymmetric-pattern multifrontal approach and targeted
for a distributed memory environment. Specifically, the parallel algorithms developed
will perform the numerical factorization of sequences of identically structured sparse
matrices using a directed acyclic graph (DAG) structure produced by a sequential im-
plementation of Davis and Duff's Unsymmetric Multifrontal Package (UMFPACK)
[34, 35, 32]. This DAG structure, known as the ". ,,,,i.'1, DAG, defines the necessary
computations in terms of the partial factorizations of small, dense submatrices, which
are known as frontal matrices and represented by nodes in the assembly DAG. The
edges of the assembly DAG define the necessary data communication between frontal
matrices, as well as the corresponding precedence relation. Thus, the assembly DAG
can also serve as a task graph for the parallel computations.
The first objective is to determine how much parallelism is available and ex-
ploitable within the computational structure defined by the assembly DAG. Of inter-
est within this objective is how much theoretical parallelism in inherent in the method
itself and how much of this theoretical parallelism can be achieved on contemporary
parallel architectures. The theoretical and practical parallelism will be investigated
by analysis and simulation. Then the practical parallelism will be scrutinized further
by implementing a parallel, distributed memory multifrontal refactorization algo-
rithm and empirically measuring its achieved parallelism.
When factorizing a sequence of identically structured sparse matrices where ad-
ditional properties such as symmetry, positive-definiteness, or diagonal dominance
are not assumed, the changing numerical values of the various entries can cause the
anticipated pivots to no longer be acceptable numerically. The capability to recover
from these lost pivots and complete the factorization without having to recompute a
new assembly DAG could significantly reduce the computation time required for fac-
torization of the sequence of matrices. Therefore, the second objective is to develop
a lost pivot recovery capability within the context of an unsymmetric-pattern multi-
frontal method. The actual implementation of the lost pivot recovery capability will
be based on the parallel, distributed memory multifrontal algorithm from the first
objective. Thus, both the sequential and parallel performance of lost pivot recovery
may be assessed.
1.2 Overview
Chapters 2 through 7 of this document report the results of this research effort.
Chapter 8 reviews the critical results and discusses possible future efforts.
In Chapter 2, pertinent background and the results of related efforts are presented.
The LU factorization process and its stability and error bounds are described. A dis-
cussion follows of key sparse matrix concepts such as fill-in, the analyze and factorize
phases, the relationship of sparse matrices to graph theory, scatter and gather opera-
tions, and block triangular form. Next the key concepts of the multifrontal approach

to LU factorization are illustrated. A mathematical framework is then shown for
how sequences of identically structured sparse matrices arise in various solution tech-
niques with references to specific applications provided. Some methods for parallel
matrix computations are summarized for both dense matrices (such as the frontal
matrices of a multifrontal method) as well as sparse matrices. Included in this dis-
cussion will be key results regarding parallelism from earlier efforts that were based
on a symmetric-pattern, multifrontal method. Finally, some scheduling techniques
for distributed memory parallel processing are summarized.
Chapter 3 describes the nCUBE 2 distributed memory parallel processor on which
the algorithms are implemented. First, the key hardware features are presented fol-
lowed by an overview of the software environment. Then the results of an empirical
characterization of the nCUBE 2's performance are presented. This characterization
will aid in assessing achievable parallelism and in making later design and implemen-
tation decisions.
The key results of this research effort begin in Chapter 4, which describes how
the theoretical and achievable parallelism of the unsymmetric-pattern multifrontal
method are assessed using both analysis and simulation. First, analytical models
are used to determine how much parallelism is available within the computational
structure provided by the assembly DAG with the various models corresponding to
different sources of parallelism. These analytical models all assume an unbounded
number of processors and an underlying parallel random access memory (PRAM)
model and thus provide an assessment of theoretical parallelism. Next an evaluation
of how much parallelism is achievable on bounded processor sets is done using simu-
lation techniques. Finally, performance characteristics of the nCUBE 2 are used to
extend the simulation models to represent a realistic distributed memory implemen-
With the understanding of the parallel characteristics of the unsymmetric-pattern
multifrontal method gained from the analytical and simulation models, Chapter 5 be-
gins the actual implementation. As the required computations are dominated by the
partial factorization of frontal matrices, this chapter is devoted to the development
of high performance parallel partial dense factorization kernels. Five distinct rou-
tines are developed and tested. The first two confine themselves to the given pivot
ordering and do not check the numerical acceptability of the pivots. These routines
are applicable to matrices that are known to be diagonally dominant, and thus the
diagonal entries are assured to be acceptable as pivots. The second of these routines
uses pipelining to overlap required communication with computation. The next two
partial dense factorization kernels address numerical pivoting with alternative pivot
selection done within the frontal matrix's pivot block. This can be difficult as the
columns of the pivot block may be distributed across several processors. The second
of these routines is pipelined to improve performance. The fifth kernel is designed
to handle frontal matrices that have only a single potential pivot. Since single pivot
frontal matrices occur frequently and do not require all the additional logic of the
more general earlier routines, there is a performance advantage to be gained by a
single pivot partial dense factorization routine. The performance of these kernels is

assessed, and analytical models of their execution times based on various processor
sets are defined. These models will be useful in scheduling frontal matrix tasks.
With the partial dense factorization kernels available, Chapter 6 discusses the im-
plementation of a full parallel refactorization capability based on the unsymmetric-
pattern multifrontal method. Of the most critical issues to be addressed in this im-
plementation are the scheduling and allocation of processors to frontal matrix tasks.
Two distinct scheduling methods are developed. One focuses on minimizing required
communication by overlapping the assignment of frontal matrix tasks to processor
sets. The other method tries to reduce overall execution time by more sophisticated
management of the hypercube topology with communication reduction a secondary
concern. Allocation of processor sets to frontal matrix tasks is done using a propor-
tional allocation scheme based on the predicted execution times of tasks and the tasks
available to execute. A number of mechanisms are also developed for the distribution
of frontal matrix columns to processors in hopes of further reducing required commu-
nication costs. All of the scheduling, allocation, and assignment activities are done as
part of a sequential software interface that initiates and controls the parallel refactor-
ization. The details of the parallel refactorization code are also discussed and include
specifics on how frontal matrices are represented, how original matrix entries and
contributions between frontal matrices are handled, and how frontal matrix tasks are
synchronized. A distributed forward and back solve capability is also implemented
and takes advantage of the distributed storage of the LU factors that results from the
parallel refactorization. With the implementation complete, a detailed performance
evaluation is accomplished. Both the achieved parallelism and competitiveness of the
implementation are assessed. The effectiveness of the various scheduling, allocation,
and assignment mechanisms is also addressed.
Perhaps the most significant of the results from this research effort is development
of a robust lost pivot recovery capability for the unsymmetric-pattern multifrontal
approach to LU factorization. Chapter 7 describes this capability. First an extensive
theoretical development of the lost pivot recovery capability is done. This is followed
by a description of the implementation that was done by extending the distributed
memory parallel refactorization software of Chapter 6. Finally, a performance eval-
uation addresses the sequential effectiveness of the lost pivot recovery capability, its
parallel execution time and speed-up characteristics, and its memory requirements
and scalability.


Prerequisite to pursuing the study and development of a parallel unsymmetric-
pattern multifrontal method is the understanding of a number of key concepts related
to linear algebra and sparse matrices. Specifically, the process of LU factorization
and concepts of algorithm stability and error analysis are necessary. Furthermore,
concepts and techniques for sparse matrices must be defined and described as they
will be frequently referenced.
In addition to this general background, the specifics of the unsymmetric-pattern
multifrontal method of Davis and Duff [34, 35] are necessary as this entire effort
is based on that method. Mathematical techniques that give rise to sequences of
identially structured sparse matrices together with specific applications justify the
need to be able to factorize such matrix sequences.
Finally, a summary of previous results in both parallel matrix computations and
multiprocessor scheduling help to define realistic goals for this effort and establish its
2.1 LU Factorization
The most common method for solving general systems of linear equations in the
form Ax = b (where A is an n x n matrix of full rank and x and b are vectors of length
n) is LU factorization which is based on Gauss Elimination [48]. In this method, the
matrix A is factorized into A = LU where L is unit lower triangular and U is upper
triangular. The problem Ax = b then becomes

LUx =b

and can be solved with two subsequent triangular solves. The first intermediate solve
is the forward substitution
Ly = b.
The result, y, is then used in the second triangular solve that is a back substitution

Ux = y.

In the actual factorization process, A can be overwritten with the factors LU
with the unit diagonal of L stored implicitly. The actual process proceeds down
the diagonal of the matrix using the diagonal entries as pivots. Row and sometimes
column interchanges may be required to replace zero entries on the diagonal, and
similar permutations are common in order to preserve the numerical stability of
the algorithm. An illustrative version of the basic LU factorization algorithm is

presented in Figure 2.1 that assumes good pivots on the diagonal. The version of
Gauss Elimination presented in Figure 2.1 is based on the most common ordering
of the nested loops (i.e., the kij ordering). However, alternative orderings are also
possible that correspond to all six of the possible permutations of the indices k, i,
and j [78].

for k := 1 to n do
for j := k to n do
U(k,j) := A(k,j)
for i := k+1 to n do
L(i,k) := A(i,k)/A(k,k)
for j := k+1 to n do
A(i,j) := A(i,j) L(i,k) U(k,j)

Figure 2-1. LU Factorization Algorithm

When row and column permutations are introduced into the algorithm, the fac-
torization becomes
where P provides the row permutations and Q provides the column permutations
[48]. The equation Ax = b then becomes


with the following sequence of a matrix multiply, two triangular solves, and another
matrix multiply required to complete the solution based on the factorization: c = Pb,
Ly = c, Uw = y, x = Qw. As P and Q are permutation matrices, they can be stored
in O(n) storage and the required multiplications can also be done in O(n) time.
2.2 Algorithm Stability and Error Analysis
There are two types of error analysis typically used. The first, forward error anal-
.I;. provides bounds directly on the relative error between the computed and actual
solutions. Such a technique is typically overly pessimistic and does not distinguish
between error sources. Thus, this type of analysis is not as useful as the second type,
which is backwards error analysis.
The concept of backwards error i,,il.;.'; considers the computed solution as the
exact solution to a "n., .,b!" problem. Analysis of the round off computing error
associated with the algorithm in use provides bounds on the distance between the
original problem and the nearby version for which the true solution has been com-
puted. The bounds on this distance provide a measure of the algorithm's stability.

Perturbation theory is then used to relate the distance between the two problems to
the distance between their respective true solutions. While the computing round off
error is dependent upon the algorithm in use and is, to a certain extent, controllable
by algorithmic alterations, the perturbations are strictly a function of the particular
problem instance and cannot be controlled. Hence, backwards error analysis provides
a means to distinguish between controllable and uncontrollable error sources.
Furthermore, backwards error analysis can be used in an a posteriori fashion. Here
the actual deviations between the solved and original problems can be measured (not
just bounded). These more accurate results are then combined with the perturbation
theory to provide a refined error bound on the solution. Thus, backwards error
analysis provides both the distinction of error sources and a refined error bound.
This section addresses both aspects of backwards error analysis. Computing round
off errors and algorithm stability for LU factorization are first discussed. Means of
insuring stability are emphasized. Then, perturbation theory is addressed. The
effects of problem conditioning are discussed with the condition number defined as a
means to quantify these effects. The integration of stability and conditioning is used
to develop an overall error bound for LU factorization.
Computing Round Off Errors: The two primary sources of computing errors are
the representation of real numbers by finite digit approximations and the correspond-
ing loss of significant digits during finite digit arithmetic. When real numbers are
stored and used in a computer, they are typically approximated by a floating point
value represented in a finite bit storage cell and a bounded amount of relative error
is introduced. Consider the real value represented by a. The corresponding floating
point value would be
fl(a) = a(1 + c), c < p
where c is the relative error introduced by the finite bit representation, which is
bounded by the machine precision, p, of the particular processor [78].
In a similar fashion, finite arithmetic computations introduce additional errors
as results of the computations are subject to the same finite storage restrictions.
Particularly vulnerable to this effect is addition (subtraction) as order of magnitude
differences (similarities) in the values of the two operands cause increasing losses in
Combining the effects of inaccuracies due to floating point representations and
computations, we find a cumulative effect as illustrated by

fl(a + b) = (fl(a) + fl(b))(1 + E1) = ((a(1 + 62)) + (b(1 + 63)))(1 + 61)

where E1, 62, and (3 are each bounded by the machine precision p [78].
Algorithm Stability: Algorithm stability (or instability) is a measure of the ef-
fects of computing round off errors on the computation sequence dictated by the
algorithm. The magnitude of this effect is measured by the difference between the
factorized representation of the problem produced by the Gaussian Elimination algo-
rithm (LU) and the true problem (A). This difference is accounted for in the matrix

term H where
LU = A + H.
Wilkinson [139] showed that the entrywise deviations accounted for by H in the
absence of pivoting are bounded by

Ihij < 5.01 npp, where p = max' (k)

for each hij E H where a}) refers to the value of aij after the kth step of factorization.
Extensive growth in the ak )'s can thus have catastrophic effects.
Pivoting for Stability: Partial Pivoting is the typical mechanism used to control
growth of the pivots. At each step, k, a new pivot is chosen from the kth column of
the active submatrix such that

,, (k) > (k), i > k.

With this approach the growth represented by p is limited to

p = 2-1 max',, |.

While this is not a particularly good bound, in practice the results are consistently
much better [139].
A stronger alternative to partial pivoting is complete pivoting. Complete pivoting
chooses the kth pivot as the largest entry in the entire active submatrix A(k). In
doing so the growth is limited to

p = f(n) max I

where f(n) is a nearly linear function [138].
A problem with both partial and complete pivoting is that the rigid pivot selection
rules can result in extensive fill-in (fill-in is the changing of zero entries into nonzero
entries) when dealing with sparse matrices. Tl, i. /,-.1,/ pivoting provides an alternative
that allows greater flexibility in pivot selection to aid in the preservation of sparsity
[48]. Pivot selection in threshold pivoting requires the selected pivot to meet
'.(k) > ,(k) i > k, 0 < u < 1.

Here u is a parameter that is typically set between 0.001 and 0.1. When threshold
pivoting is employed, the growth bounding function becomes

p < (1 + u-l) -1 max'., i.

When LU factorization is applied to a matrix A and no zero pivots are encountered
[78], the bound on H in LU = A + H is

IHI < 3(n- 1)p[|AI + 1|llU] + O([t2).

Here |M| refers to the largest entry in the matrix M.
Similar analysis for a triangular solve reveals that the computed solution, y, to
the triangular solve problem Ly = b is the exact solution of

(L + F)y = b

F| < nu|L + 0(n2).
Combining the LU factorization results with the results of the two required tri-
angular solves reveals that the computed solution, x, to the problem Ax = b is the
exact solution to some problem

(A + E)x = b

for some E satisfying

IE < nnp(3|AI + 511U1) + O(n2).

When partial pivoting is included, the entries of L can be bounded by 1 (and
|IILI < n) and the bound on E in the infinity norm becomes

I1 1 < np(311AI| + :, -.I|| ) + (n2).

This bound for | |II provides the measure of the distance of the problem for
which x is the exact solution from the original problem. This measure will be com-
bined with the perturbation theory to provide a bound on the relative error of the
computed solution.
Ill Conditioning: At the heart of perturbation theory is the concept of how the
distance between two nearby problems relates to the distance between their solutions.
When the distance between solutions is great relative to the distance between the
problems, the problems are said to be ill conditioned. Distance here is measured in an
appropriate matrix or vector norm. The conditioning of a problem can be quantified
by a value known as the condition number. Analysis can then utilize this condition
number to bound the relative error in the solution space based on the relative error
in the problem space (distance between original problem and the problem exactly
solved by the computed solution).
The condition number can be defined in terms of matrix norms as

K(A)= IIAllplA- ll1

where p designates the specific norm in use [48]. An alternative, but equivalent
formulation based on singular values is

K2(A) = al/a7

where a1 is the largest singular value and a~ the smallest [78]. This later formulation
illustrates that ill conditioning can be viewed as a measure of singularity as a~ will
be zero for a singular n x n matrix, and a relatively small value of a, (as compared
to ai) will produce a large condition number.
Analysis of how conditioning affects the relative error in the computed solution
produces the following relationship between the relative distance between the original
problem Ax = b and the one exactly solved by the computed solution (A + E)x = b.
Specifically, the relative distance between the true (x) and computed (,) solutions is

S-., II oo(A)I\ II
I, 1 r IAII

where = IIEIIIAA-1II < 1 [78].
Prevention of ill conditioning may be addressed by analyzing if the process has
the same sensitivities as the model. If not, identify the problem sources in the model
and alter the model.
Overall Error Bound: The overall error bound on the relative error of the solu-
tion is obtained by combining the round off error analysis. Recall that the computed
solution, x, solves a nearby problem (A + E)x = b where

I1 1 < n[(311AI| + : -.I||2 ) + (n2).

Also, the condition of the problem affects the relative error in the solution by

I1'1 'II < (K.(A)) \It11

where r = IIEIIIIA-1|| < 1. Combining these two results, we obtain the overall error
bound of
I- I < (A) 3 + 5 + 0(p2).
I II I l-r HA||
Measuring Stability: A large relative error in the solution of the equation Ax = b
can be the result of either an unstable algorithm or a poorly conditioned problem or
both. While the condition of the problem is typically not controllable, the stability
of the algorithm is, and a large relative error would raise the question of whether or
not it can be reduced by a more stable algorithm. Thus, there is a need to measure
the stability of an algorithm. The relative residual, as defined below, provides such
a measure [84].
relativeresidual = -
In this equation, x represents the computed solution. To understand how the relative
residual measures stability, a simple derivation is useful. Recall that

A = LU + H and LUx = b

where L and U are the computed LU factors. H then represents the differences
between the original and actually solved versions of the problem. Hence we can
multiply the first equation by x and get the following derivation:

Ax = ( L + H)x
= LUx+Hx,
= b+Hx.

The vector b is then subtracted from each side, norms are taken, and the results
divided by |Ilbl to get
||Ar bl|| IIH I
Ibll I|bll
Thus the relative residual provides a relative measure of the size of H, which is
controlled by the algorithm's stability. If the relative error is large but the relative
residual small, then the trouble is with the conditioning of the problem and not the
stability of the algorithm.
2.3 Sparse Matrix Concepts
The matrices used in many applications tend to have a very significant number of
entries that are zero. This is particularly true and significant for very large matrices
as the computations on such large matrices, as dictated by direct solution methods,
would be too numerous to be done in a reasonable amount of time. The sparsity
induced by the many zero entries allows special techniques to be employed that sig-
nificantly reduce the required number of computations. Thus, the number of nonzero
entries in a matrix becomes a very critical measure of the amount of computations
required for a particular sparse matrix. Furthermore, as the number of nonzeros is
so critical, special attention is paid to minimize/reduce the number of matrix en-
tries that were originally zero and become nonzero due to the computations. This
situation is commonly called fill-in.
The discussion of this section first relates sparse matrices to graph theory, which is
an important tool used in structuring sparse matrix computations. Special techniques
for storing sparse matrices are then addressed. These techniques focus on the efficient
use of memory while preserving the ability to access the matrix entries efficiently. A
typical preprocessing step of reduction to block triangular form is then described
followed by a brief discussion of how the Markowitz Criterion can be used during the
analyze phase to determine the pivot ordering.
Relationship to Graph Theory: Critical to the exploitation of sparsity in sparse
matrices is the understanding and representation of the inherent matrix structure.
Graph theory provides an excellent vehicle for this purpose [48, 70]. By representing
the structure of a matrix with a graph construct, exploitable patterns can be more
readily recognized. Furthermore, as operations take place on the matrix that alter its
structure, corresponding changes can be made to the graph representation to track
the structural dynamics of the process.

In this section, the graph representations of both unsymmetric and symmetric
matrices are discussed, as well as the corresponding graph dynamics that occur due
to Gauss Elimination and matrix permutations.
If each row and each column in an n x n matrix A is labeled using 1, .., n and
each such label is associated with a node in a graph, the structure of the matrix can
be represented with edges from node i to node j for each nonzero aij entry in the
matrix A. These edges are directed and thus uniquely identify each nonzero entry in
the structure of any matrix. The corresponding graph is called a digraph or directed
graph. As an example, the matrix structure shown in Figure 2-2 is represented by
the digraph structure shown in Figure 2-3.

A= X

Figure 2-2. Sample Matrix Structure

1 2


Figure 2-3. Sample Digraph for Matrix A

If the matrix has a symmetric pattern (that is, if for each nonzero aij entry, the aji
entry is also nonzero), then each pair of nodes connected by a directed edge will have
a second edge connecting them that is oriented in the opposite direction of the first
edge. These two directed edges can be replaced by a single undirected edge, and thus
matrices with a symmetric pattern can be represented with an undirected graph. A
sample symmetric pattern matrix structure and the corresponding undirected graph
are shown in Figures 2-4 and 2-5.
The undirected graph for matrix B (as shown in Figure 2-5) happens to form a tree
structure with node 1 as the root. While this does not occur with all unsymmetric
matrices, it is a very important special case that will be discussed in more detail
When Gauss Elimination is performed on a matrix to zero a column below the
diagonal, the structure of the matrix changes and corresponding alterations occur
within the graph representation [118]. In particular, assume the column below the


Figure 2-4. Sample Symmetric Matrix Structure


0 Co
Figure 2-5. Sample Undirected Graph for Symmetric Matrix B

diagonal that corresponds to the node y in the graph is eliminated. Then, the node
y may be removed from the graph, and for any pair of nodes x and z such that both
(x, y) and (y, z) are existing edges, the edge (x, z) is added to the graph (assuming
it does not already exist). The resulting graph structure corresponds to the active
submatrix after this step of Gauss Elimination. This relationship is shown in the
matrices of Figure 2-6 and the digraphs of Figure 2-7.

xX X


Figure 2-6. Matrix Structure Before and After Elimination

Notice that in the graphs of Figure 2-7 edge (3, 4) was added due to the existence
of (3, 1) and (1,4). Likewise (3, 2) was added due to (3, 1) and (1,2). However, (4, 2)
was not added since it was already in the graph. Interestingly, the changes that take

Figure 2-7. Graph Reduction Due to Gauss Elimination

place in the graph structure due to corresponding Gauss Elimination in the matrix
are very similar to the graph theory process of computing transitive closures.
Permutations are another very important operation performed on matrices. If
the permutation is symmetric (that is, both rows and columns are interchanged),
then the corresponding graph alterations amount to simply relabeling the nodes. For
example consider the symmetric permutation B,,, = PBPT where B is as shown in
Figure 3.3 and P is defined as

P= 0 1 0
1 0 0

The resulting B,,w will be


and the corresponding graph will be as shown in Figure 2-8, which is simply a rela-
beling of the graph in Figure 2-5.
Storage Techniques: Typical storage mechanisms for dense matrices require O(n2)
memory. The number of nonzeroes in a sparse matrix (referred to as r) will frequently
be several orders of magnitude less than n2. Thus, more efficient storage mechanisms
are required for simple efficiency as well as to make the algorithm being implemented
tenable. Furthermore, due to phenomena such as fill-in during Gauss Elimination, it
will frequently be necessary to alter the storage structure during processing. Thus, a
static structure will not be suitable unless it preallocates sufficient room for growth.
The alternative is a dynamic structure that can be easily altered during processing.
While both static and dynamic storage techniques for sparse matrices will be ad-
dressed, the storage of sparse vectors is discussed first as a necessary and illustrative

Figure 2-8. Graph After Symmetric Permutation

prerequisite. Much of this summary will follow that provided in Duff, Reid, and
Erisman's text on sparse matrices [48].
A full storage scheme for a sparse vector requires O(n) memory but has the
advantage that individual vector components can be directly indexed. A more space
efficient storage technique would be to hold each vector as a set of (value, index)
pairs with one pair for each nonzero component in the vector. These pairs may
be ordered by the index values or left unordered. The set of (value, index) pairs
is typically called the packed storage form. A transformation from the full to the
packed storage forms is called a gather operation. The reciprocating transformation
from packed to full form is called a scatter operation. Some contemporary computers
with extensive vector facilities, such as the Cray X-MP and the Fujitsu FACOM VP,
provide hardware support for these operations.
Typical mathematical operations on vectors include dot products and the saxpy
operation. The saxpy operation is defined as

= ax + y

where x, y, and z are n-vectors and a is a scalar. The name saxpy comes from the
software mnemonic for "scalar alpha x plus y" [78]. As the packed forms of sparse
vectors are typically unordered (done for efficiency), operations on these vectors will
frequently follow the outline provided below:

Scatter the nonzero entries of one of the vectors into a full length vector known
to be initially zeroed.

For each operation:

1. Revise the full length vector by applying the operation to it (can be done
by a traversal of the other vector operand in packed form for vector-vector
2. If an operation changes a zero to a nonzero, then update the corresponding
packed structure index list.

Gather the nonzeros out of the full length vector and back into packed form by
using the (potentially updated) index list of the packed form and in the process
zero out the full length vector entries used (facilitates later reuse of this vector).

When a vector facility is available, there is a greater tradeoff in performance
between the full and packed formats. Especially if the vector facility does not ac-
commodate indirect indexing, the storage savings from a packed format will result in
a significant performance disadvantage by precluding the effective use of the vector
Moving on to sparse matrix storage techniques, there are four primary mechanisms
that are coordinate schemes, collection of sparse vector schemes, linked lists, and
clique schemes. Each mechanism has unique advantages and disadvantages, and
a particular algorithm may use a number of different mechanisms throughout the
different phases of its operation.
The coordinate schemes maintain each nonzero in the matrix as a distinct triple
(aij, i, j). This technique is typically used as the general input and output format
for a general purpose algorithm. While this format is difficult to work with internally
since it cannot be efficiently accessed by either row or column, it provides an efficient
and simple interface format. Contributing to its efficiency as an input format is
the fact that it can be sorted into either the collection of sparse vectors or linked
list formats in time that is O(n) + O(r) where r is the number of nonzeros in the
matrix. The sorting technique is based on the bucket sort using either the row or
column indices. Furthermore, a column ordered row oriented result from the sort is
possible by first sorting by columns and using that result to sort by rows.
The collection of sparse vectors method uses a size n array of (pointer, count)
pairs that index into the rows or columns that are held as sparse vectors. These
sparse vectors are held in unordered, yet contiguous format. Such a static format
is frequently allocated with extra space to accommodate growth. When the extra
space has been exhausted, but prior entries have been freed, a compaction operation
may be done to reclaim free entries. The advantages of this method are that row or
column accesses are efficient depending on the orientation of the structure and that
no memory is needed for links between vector entries. Disadvantages are primarily
with the growth restrictions imposed by the static allocation.
The linked list representation offers a dynamic structure similar to the prior col-
lection of sparse vectors. However, as each vector (row or column) is held as a linked
list, it can easily be expanded or contracted. Furthermore, it is also now possible to
easily maintain row vectors in column order (or column vectors in row order). While
access to these structures is usually simple and short, there are some disadvantages.
For one, it may be necessary to maintain doubly linked lists to accommodate inser-
tions and deletions (unless access methods are limited to means that preserve pointers
to prior list entries). Also, the linked list structures will exhibit less spatial locality
than the collection of sparse vectors method. This could adversely affect performance
if a hierarchical memory is in use.
A common drawback of both the linked list and collection of sparse vectors tech-
niques is that they are either row or column oriented but not both. For operations

such as Gauss Elimination, both row and column accesses may be necessary, and
thus one of these two access types will be inefficient. This problem can be remedied
by developing a second structure that is oriented opposite to the first (i.e., if the pri-
mary structure is row oriented, the secondary structure would be column oriented).
Furthermore, this secondary structure may only need to hold the structural compo-
nents of the matrix (i.e., the row/column indices) so full duplication would not be
necessary. Another possibility using linked lists would be to intermingle row- and
column-oriented linked lists between the nonzero elements.
The final storage method is the clique scheme, which is derived from finite element
models. In this scheme, the matrix, A, can be represented as the sum of a number
of matrices, A[], that each have only a few rows/columns with nonzero entries:

A = EA[k].

These A[k] matrices correspond to fully connected subgraphs within the graph for
the matrix, which is from where the term clique is derived. By way of example, the
matrix shown in Figure 2-9 can be represented as the sum of the three cliques that
are obtained by using the following row/column patterns: {1, 2, 5}, {2, 3}, and {3, 4}.

A= X X X

Figure 2-9. Clique Storage Matrix Example

While this method was originally developed for finite element problems that are
directly derived as the sum of component matrices, this method is more widely ap-
plicable as any symmetric pattern matrix can be disassembled into a sum of cliques.
In order to improve the structure or maintain numerical stability, it is often nec-
essary to perform permutations to the original matrix. Such permutations are al-
gebraically represented as full matrices that are developed by interchanging rows or
columns of the identity matrix. However, storage of the full matrix for a permutation
is not required. A single n length vector can be used to represent any permutation
and variations exist that allow the permutation to be performed in place for only an
O(n) additional memory requirement for the permutation.
Reduction to Block Triangular Form: Frequently the matrix A can be permuted
into a block triangular form as shown in Figure 2-10 (a lower triangular version is
also possible). This form of the matrix is significant as it can reduce the amount of
computations required for factorizing and divides the overall problem into a number
of independent subproblems. In particular, only the B;; blocks need to be factorized.
A back substitution process can then be used with a set of simple matrix-vector
multiplications to evaluate the contributions of the off diagonal blocks [48]. The

Ax = b
thus becomes
PAQy = Pb
x = Qy.
The back substitution then proceeds using

Buy = (Pb), E By, i = N, N 1,.., 1

(where N is the number of diagonal blocks) followed by

x = Qy.

Notice with this technique fill-in is limited to the Bii submatrices and row or col-
umn interchanges within these submatrices do not affect the rest of the structure.
Furthermore, reduction to block triangular form can be done in O(n) + O(r) time
which is often more than offset by the savings in storage and computations of block
triangular form.

/ B0l B12 B13 B14
PAQ B22 B23 B24
B33 B34
k B44 /

Figure 2-10. Block Triangular Form Matrix

As a final set of notes on block triangular form, it has been proved that the
block triangular form for a nonsingular matrix is essentially unique (to the extent
that other equivalent forms are merely a reordering of the diagonal blocks). Also,
empirical studies have illustrated a high correlation between asymmetric structure
and block triangular reducibility.
Once the block triangular form has been found, it can be exploited for the subse-
quent factorization that begins with an analysis of the sparsity structure in the case
of a sparse matrix.
Analyze Phase: Not present in dense matrix factorization algorithms, the an-
alyze phase of a sparse matrix algorithm determines how the sparsity structure of
the matrix can be exploited to reduce the computations necessary for factorization.
This phase manifests itself as the selection of a pivot ordering and a symbolic fac-
torization to determine the pattern of the LU factors. The primary consideration in
this phase is to minimize the amount of fill-in that occurs as factorization progresses.
This in itself is a very difficult problem. In fact, a minimum fill-in pivot selection
is NP-complete in the restricted case where the diagonal consists of all nonzero en-
tries and all pivots are chosen from the diagonal [141]. However, there are also other

considerations. They include the selection of pivots to preserve numerical stability,
as well as, parallel and vector processing considerations. Furthermore, the selection
of a minimum fill-in pivot sequence may be so costly as to overshadow any potential
savings in the factorization. As a result, a number of heuristic techniques have been
developed for the analyze phase. When the matrix has an unsymmetric pattern, the
techniques most frequently are based on the Markowitz Criterion.
Markowitz Criterion: When selecting the kth pivot for the factorizing of the A(k)
active submatrix, the Markowitz Criterion chooses the a ,) entry that is numerically
acceptable and minimizes the product of the corresponding row and column degrees
[106]. The row degree for an entry is simply the number of nonzeroes currently in the
entry's row of the active submatrix. Likewise, the column degree is the number of
nonzeroes in the entry's column of the active submatrix. These degrees shrink with
the active submatrix as the effects of earlier pivot selections are taken into account,
but they may also grow due to fill-in. The idea is that this strategy will modify the
least number of coefficients in the remaining submatrix. Drawbacks of this approach
are that it requires maintaining updated row and column degrees for each step of the
factorization and that both row and column access to the matrix pattern is required.
When implementing the Markowitz Criterion, a typical approach is to order the
current row and column degrees in an increasing order with all equal values in a single
doubly linked list which is indexed by a pointer array. The search then commences
in increasing order until an entry is found that meets the pivot criteria. Additional
searches are done until it can be assured that no better entry can be found. It is
also possible to break ties based on the entry with the best numerical characteristics.
Importantly, the row and column degrees should only be calculated once at the
beginning and then implemented to facilitate expedient incremental updates.
2.4 Multifrontal Methods
The multifrontal approach to sparse matrix factorization is based on a decompo-
sition of the matrix into a sum of smaller dense submatrices. Factorization of these
smaller matrices (called elements or frontal matrices) can be done using dense ma-
trix operations that do not require the indirect addressing required of conventional
sparse matrix operations. The multifrontal approach was first developed by Duff
and Reid for symmetric, indefinite matrices [54] and then extended to unsymmetric
matrices [55]. However, the multifrontal approach has been used most extensively
for the Cholesky factorization of symmetric, positive-definite matrices [70, 65, 67].
Most recently, Davis and Duff have generalized the method to take advantage of
unsymmetric-pattern matrices [34, 35].
One of the major advantages of a multifrontal method is that the regularity found
in the dense matrix operations can be used to take advantage of advanced architec-
tural features such as vector processors, hierarchical memories, multiprocessors with
shared memories or distributed memories connected by regular pattern communi-
cation networks [5, 7]. Furthermore, when frontal matrices do not overlap in their
pivot rows and columns, the factorization steps associated with those pivots can be
done concurrently [46]. This provides an additional degree of parallelism that can be

exploited by multiprocessors. The combination of these two sources of parallelism,
within dense frontal matrix operations and between frontal matrices, has shown to
provide a significant amount of potential speed-up in several studies [83, 51, 121, 122].
Multifrontal methods for sparse matrices were originally developed for matrices
resulting from finite element problems where the coefficient matrix is the sum of a
number of dense symmetric matrices that correspond to the elements of the original
model (hence the use of element to describe the frontal matrices). As rows and
columns of the .---i. -.-.i, e matrix are assembled from all the contributing elements, the
factorization associated with that row/column pivot can be performed even though
other rows and columns have not yet been fully assembled. Later, these techniques
were extended to other matrices by artificially decomposing the matrix into a sum
of frontal matrices. With finite element and other symmetric-pattern matrices, the
frontal matrices are square. Furthermore, an assembly graph can be built with the
frontal matrices as nodes using edges to describe overlaps between frontal matrices.
The assembly graphs associated with symmetric matrices always form tree structures.
Such a structure simplifies both the task mapping and scheduling problems when
employing a multiprocessors on the method.
Recent efforts by Davis and Duff have focused on extension of the method to
unsymmetric-pattern matrices [34, 35]. With such matrices, the frontal matrices are
no longer guaranteed to be square and are typically rectangular. Furthermore, the as-
sembly graph is no longer a tree. Instead, it becomes a directed acyclic graph (DAG).
The rest of this section describes the multifrontal method applied to unsymmetric-
pattern matrices.
Frontal Formation: A central issue of the multifrontal method is the decomposi-
tion of the matrix into frontal matrices. As the amount of computation is proportional
to the size of the frontal matrix and as distinct memory is typically allocated for each
frontal matrix, the primary objective is to minimize the size of the frontal matrices.
Moreover, as the amount of fill-in is typically proportional to the size of the frontal
matix, minimizing frontal matrix size will also tend to minimize fill-in. This is es-
pecially important in the earlier stages of the algorithm, as early fill-in will likely
propagate into larger memory and computational requirements for the later frontal
The approach taken by Davis & Duff to minimize frontal size is to base frontal
matrix determination on selection of a pivot with minimal Markowitz cost. In order
to preserve numerical stability, threshold pivoting is often added to the pivot selection
Criteria stronger than the Markowitz cost, such as minimum actual fill-in, are typ-
ically too costly computationally and provide little additional benefit. As Markowitz
cost requires that updated row and column degrees be maintained, approximate de-
grees are frequently used.
Frontal Factorization: Once a frontal matrix has been determined, the factor-
ization associated with its pivot will be done. This involves the updating of the
corresponding row in the U factor via a copy operation, the computation (via a di-
vide by the pivot value) and copying of the pivot column into the corresponding

column of L, and the updating of the remaining frontal matrix entries. This updat-
ing process is referred to as the Schur Complement where each entry is computed as
a7, = a- (aikakk.'
which is equivalent to:
aij = aij likUkj.
The -likUkj contribution to the aij entry need not be applied immediately and, in
fact, distinct updates to the same entry can be applied in any order. Hence only the
pivot row and column are needed for the frontal matrix. The update terms can be
calculated and saved until their row/column becomes pivotal or they can be applied
to the entry earlier if convenient. This can significantly reduce the required amount
of data sharing.
Assembly: The process of applying updates to an entry is called ". ,,,'ll;l Con-
sider the following example of Figures 2-11, 2-12, and 2-13 where each Pi represents
a pivot, letters are nonzeros, and are zeros.

/ PI ab b \
c P2 d e
A= f g
h ij

\ k I I

Figure 2-11. Sample Sparse Matrix

The first frontal matrix of A in Figure 2-11 is El and consists of rows 1, 2, 3, and
5 together with columns 1, 3, and 4 as shown in Figure 2-12.

/P, a b\
c d
El =

Figure 2-12. Frontal Matrix for First Pivot

With El, fill-in occurs in entries (2,4), (3,3), (3,4), (5,3), and (5,4). Furthermore,
entry a2,3 is updated to d and must be assembled into E2 (shown in Figure 2-13) as
it is in the pivot row of E2. Furthermore, the fill-in occurring in entries (2,4) and
(3,4) of El will cause E2 to be expanded by this column.
While the contributions to entries a2,3 and a2,4 are assembled into E2, it would
also be convenient to assemble El's contributions to entries a3,3 and a3,4. However,
none of the other entries in the Schur Complement of El can be assembled into E2
and thus El must be held until the rest of its contributions can be assembled into
later frontal matrices. The inability for E2 to accommodate all of El's contributions
is referred to as partial i ,il,1'i and only occurs in the unsymmetric-pattern version

P2 d e)
E2= g
h i

Figure 2-13. Frontal Matrix for Second Pivot

of the method. In the symmetric-pattern version all contributions can be passed
together to a single subsequent frontal matrix.
Assembly DAG Edges: Whenever a data dependency exists between frontal ma-
trices, an edge in the assembly DAG is used to represent the dependency. In the
previous example of El and E2, there would be a directed edge (El, E2). Further-
more, as the row pattern of El contains the pivot row of E2 and the column pattern
of El does not contain the pivot row of E2, the relationship indicated by the edge
(EI, E2) is referred to as an L relationship where El is an L child of E2 and E2 is an
L parent of El. The four matrix patterns of Figure 2-14 illustrate the four possible
relationships between the frontal matrices that are induced by pivots P1 and P2.

No relationship P2 P L relationship X P2 )2

P, X . P, X .
U relationship P' 2 LU relationship X P2

Figure 2-14. Possible Relationships between Frontal Matrices

The edges in the assembly DAG show not only data dependencies but also con-
trol dependencies. Transitive reductions in the DAG are possible to more efficiently
describe the precedence relationships and to reduce the data passing.
Pivot Amalgamation: Up to this point in the discussion each frontal matrix has
been built using a single pivot and updated via a single factorization step. When
additional pivots are identified that have identical (or nearly identical) row and col-
umn patterns, the pivots can be amalgamated together into a single frontal matrix.
This allows several pivots' contributions to be calculated and combined. In particu-
lar, consider the generalized example below of an r x c frontal matrix where F is the
k x k pivot block that includes the k pivots on its diagonal. The matrix F will be fac-
tored into the product L1U, in the frontal matrix's factored form. C' is a k x (c- k)
submatrix that will be updated to become part of the U factor (specifically U2 in the
frontal's factored form). Likewise, L2 will be formed from B which is a (r k) x k
submatrix. Finally, the remaining (r k) x (c k) submatrix D will be updated by

the Schur complement of the factorization into D'.

( F CT Li\U, U2
B D L2 D'

The update to D is formed as:

D' = D (LU2)

which is a rank k update to D. In the single pivot case this is just a rank 1 or outer
product update commonly found in BLAS 2 (Basic Linear Algebra Subroutines Level
2) routines. However, in this more general case, the update is a rank k update that
can be implemented via matrix multiplication in BLAS 3 routines. These higher level
BLAS routines have a richer computational structure that can be used to exploit the
advanced architectural features of a particular system. Hence, amalgamation can
significantly improve the speed of frontal matrix numerical factorizations.
As eluded to earlier, the row and column patterns of two pivots do not need to be
identical for amalgamation to be applied. If they are nearly identical, amalgamation
can still take place at the cost of a slightly larger frontal matrix with additional fill-in.
Separate Symbolic and Numerical Factorizations: When multiple matrices of iden
tical nonzero structure are to be factored, it is possible to separate the symbolic and
numerical factorization phases. When this is done the results of a single symbolic
factorization of the matrices' pattern can be performed followed by a numerical fac-
torization for each distinct matrix. The symbolic factorization will determine the
pivot ordering, develop the assembly DAG, and determine the nonzero structures of
the LU factors. (Symbolic factorization is essentially the same operation as was pre-
viously referred to as the analyze phase). The subsequent numerical factorizations
will then make use of this information for the actual numerical factorizations.
Loss of Anticipated Pivots: An important and difficult problem exists however
with the separation of symbolic and numerical factorizations. Specifically, the sym-
bolic factorization will not be able to make use of numerical values for the selection
of pivots. Hence, specified pivots in the ordering may become numerically unaccept-
able. When such an occurrence is detected during a numerical factorization, special
recourses are necessary to recover. An example of this situation can be seen in Figure

/2 1 0 4 \ /2 8 0 4\
1 4 10 1 410
0 3 2 0 0 3 2 0
1 204 1 204

Figure 2-15. Lost Pivot Example

In matrix A1, the a2,2 entry is a perfectly acceptable second pivot. However, as
the values change in matrix A;, the update to the a2,2 entry, accomplished via:

2,2 = a2,2 (2,1/1,1) a1,2

causes a2,2 to become zero and no longer acceptable as a pivot. Furthermore, a pivot's
numerical value need only be altered to drop it below the threshold pivoting limits
in place for it to become no longer acceptable. Also interesting to note is that the
actual pivot entry did not need to change for it to fail to be acceptable. Other entries
may affect the anticipated pivot's value during previous steps of elimination, as was
the case in the example provided.
Davis and Duff I.-- -1 two possible approaches to dealing with the loss of antic-
ipated pivots [34, 35]. They are:

Force the amalgamation of the lost pivot with subsequent frontal matrices dur-
ing the numerical factorization. This creates a larger pivot block from which
permutations can be used to select alternative pivots. The drawback is that the
row and column patterns of the amalgamated frontals could be quite different
and result in catastrophic fill-in and loss of inter-frontal parallelism.

Amalgamate the frontal matrix with the lost pivot with its first LU parent
(assuming one exists). This will limit additional fill-in, however, such an LU
parent is not guaranteed to exist in which case forced amalgamation will be

While the possibility of lost pivots complicates the factorization of sequences of
identically structured sparse matrices, the unsymmetric-pattern method with sep-
arated analyze and factorize phases does provide significant potential benefits for
dealing with such sequences. Yet, the question of where such sequences arise is
still open. The next section provides general mathematical settings in which such
sequences occur and provides a listing of sample application areas.
2.5 Factorization Sequences of Matrices
The factorization of a sequence of sparse matrices that maintain an identical
nonzero structure occurs frequently in a variety of applications. The principal cate-
gories of such applications are those that use a Newton method-based approach to
solving a system of nonlinear algebraic equations and those that require the solution
to a system of ordinary differential equations by an implicit method. Both of these
categories will be discussed in this section.
Systems of Nonlinear Algebraic Equations: Newton's method is a very common
technique in numerical analysis that is used to find the zeros of a function by an
approximation of the function that is based on a first order Taylor's series expan-
sion [13]. For one dimensional real valued functions, the first order Taylor's series
expansion is
f(x) = f(xo) + f'(xo)( Xo).

When solving for a such that f(a) = 0 we replace x with an approximation xl of a
and assume f(xi) = 0. The result is

0 = f(xo) + f'(xo)(xi Xo)

which can be solved for xl and generalized into the following iteration formula by
replacing xl with Xk+1 and xo with xk:

Xk+l = Xk- f(xk)/f'(xk).

The method can be generalized from strictly real (or complex) functions to any
mapping between Banach spaces by using the Frechet derivative [93]. A Banach space
is a normed vector space that is complete (that is, all Cauchy sequences will converge
to an element in the space). The Frechet derivative is a bounded linear operator L
that satisfies:
L(o) li f(xo + y) f(xo) L
L(xo) = lim
y-O lyll
where f is a function mapping one Banach space to another and the I| I|'s are norms
appropriate to the given Banach spaces.
Systems of nonlinear algebraic equations can be considered to be a mapping from
an n-dimensional Euclidean space (R"), which is a Banach space, back to itself defined
/ fi(xi, X21 -I Xn)

f(x) = 2,

\ f X,(x 2, I nI /
for fi mapping R" to R1. The Frechet derivative of such a mapping is simply the
Jacobian of the function (denoted Vf) which is an n x n matrix of partial derivatives
where the ith row, jth column of the Jacobian (evaluated at a given xo) is


The Newton iteration formula thus becomes

Xk+1 = Xk Vf(xk)-1f(xk)

within this context. The problem with this formula is the requirement to compute
the inverse of the Jacobian. In order to avoid this costly (and sparsity destroying)
process we define
Axck = Xk+1 Xk
and then manipulate the formula into

Vf(k)Axk = -f(xk)

and determine Xk+1 = Xk + Axk. As the Jacobian Vf(xk) is a linear operator and
-f(x) is a vector, the computation now consists of simply solving a linear system of
equations. The focus of this work is on instances where the Jacobian is sparse which
is typical of most large systems. The nonzero entries of the Jacobian can change
values with the various xk's but the nonzero structure will remain constant.
One application area of such nonlinear algebraic equations is chemical engineering.
Specifically, Zitney discusses the solution of such systems for distillation flowsheet-
ing problems where the Jacobians are of an unsymmetric-pattern and fail to have
favorable numerical properties such as diagonal dominance [144, 143].
Systems of Ordinary Differential Equations: Another major type of problems where
the factorization of sequences of identically structured matrices is required is the solv-
ing of systems of ordinary differential equations using an implicit method. The pre-
dominant type of numerical methods for such tasks when initial values are provided
are linear single- and multi-step techniques due mainly to their ease of implemen-
tation and analysis [107]. These techniques divide the independent axis into steps
or mesh points across which the differential equations are integrated via a discrete
summation method. A generic form of the problem would be:

x = f(x)

with x a vector function of time (t), x the first time derivative of x, and the initial
condition is x(0) = xo. A typical linear single step technique would use

X,+1 + alrXn = h[3 f(.X+1) + 1 f(xn)]

to solve for .X+1 where xi denotes x(t;), ti the time at time step i, h the time step
size, and al, 3o, and Si constants. If 3o = 0 then the method is explicit, that is zr+1
can be solved for directly. If 3o f 0 then the method is implicit and an iterative
method is required to solve for xr+l.
Linear multi-step techniques are similar but use more of the previously predicted
values xi and function evaluations at those xi's for the next prediction. In particular,
they take the general form:

,Xn+ + E c=l" n-i+l = hPof(xn+l) + hE 1 f(Xn-i+1)

where 3o = 0 again means an explicit method and 3o z 0 means an implicit method.
Specific methods vary in their value for p and the weights given by the ;a's and
3;'s. Such methods include Adams-Bashford (explicit), Adams-Moulton (implicit),
N;-iI..nt, Newton-Cates, and the Backward Differentiation Formula [107, 124, 90].
For reasons of stability, implicit methods are preferred for "stiff" (that is, ill-
conditioned) problems. In particular, explicit linear multi-step methods cannot meet
A-stable criteria and can only meet A(a)-stable criteria if p > 3 [107]. These stability
conditions relate the effectiveness of a method to the conditioning of the problem as
evidenced by the eigenvalues of the corresponding matrix. Hence, unless only "non-
stiff" (well-conditioned) problems are to be solved by a linear multi-step method, the

method should be implicit. (Other classes of alternative nonlinear methods will be
discussed briefly later).
When using implicit methods, the linear multi-step method given generically
above can be reorganized to accumulate only terms dependent upon xz+l onto the
left side as follows:

(I hfof)xn+l = -E [hf3f(x,_+i1) a;ax i+1].

As the right hand side is constant within a time step, we can define it equal to b.
Hence, our task within a time step becomes finding the solution x in

F(x) = b where x replaces X+1 and F(x) = (I hdof)(x).

For f defining a linear system of ODEs, this simply requires the solution of a linear
system of equations. However, when the system is stiff, the coefficient matrix corre-
sponding to F is ill-conditioned and direct solution methods can be too inaccurate.
Thus Newton's method can be used to solve for the zero of the modified function:

G(x) = F(x)- b = 0.

This has proven useful in practice due to the quadratic convergence of Newton's
method [107, 90]. The same approach is used when f is a nonlinear function (repre-
senting a system of nonlinear ODEs).
When Newton's method is applied within each time step as above, each Newton
iteration requires solving a linearized system (as discussed in the previous subsection
on systems of nonlinear algebraic equations). Specifically, the equation:

VG(x(=))Ax() = -G(x(i))

must be solved for Ax(i) where x(i) is the ith approximation to xz+l, VG is the
Jacobian of G, and x(i+l) = x(j) + Ax(1). The structure of VG remains constant both
within and between time steps thus producing a sequence of matrices with identical
structure that must be factorized and solved.
Other methods also exist and are particularly useful for solving stiff systems.
These include nonlinear methods such as Certaine's method, Jain's method and the
class of Runge-Kutta techniques [107]. Certain's method is also an example of
another class called predictor/corrector methods where an explicit technique is used
to predict a first approximation to xz+l and then an implicit method is used to correct
(refine) the approximation. Furthermore, there are both implicit and explicit versions
of the numerous Runge-Kutta techniques. Whenever implicit techniques are used,
the factorization of a sequence of matrices is a likely subtask. However, with some of
the techniques these matrices take on an already factored triangular or even diagonal
structure. Yet, the bottom line is that the factorization of a sequence of general
sparse identically structured matrices is a common subproblem to solving systems of
ODEs via an implicit method. Furthermore, these systems are frequently encountered
in applications that include nuclear magnetic resonance spectroscopy, computational

chemistry, and computational biology [147]. More specifically, Miranker provides
examples of systems of nonlinear ODEs with the circuit simulation of tunnel diodes
commonly used in high speed circuits, thermal decomposition of ozone, and the
behavior of a catalytic fluidized bed [107]. Electronic circuit simulations are another
common source of large systems of nonlinear, first order ODEs as evidenced in the
modified nodal analysis approach used by Saleh and Kundert [131, 97] and the sparse
tableau formulation of Hachtel [82].
2.6 Parallel Matrix Computations
A major emphasis of this research effort is to perform the numerical factorization
of sequences of sparse matrices in parallel. Furthermore, the unsymmetric-pattern
multifrontal approach provides dense submatrices which may also be factorized in
parallel. This section summarizes some previous efforts that explored distributed
memory implementations of both dense and sparse matrix factorizations.
Distributed Memory Factorization: In distributed memory environments, algo-
rithms are heavily dependent upon the storage allocation scheme for the matrix. The
two most common schemes are row oriented [22] and column oriented [64, 40, 27, 76].
Blocking schemes of rows or columns are frequently used and Dongarra and Os-
trouchov [40] discuss such methods together with the need for adaptive blocking
mechanisms. Choi, Dongarra, Pozo, and Walker explore the various data allocation
schemes and associated algorithms and develop a generalized block scattered approach
with data blocked and blocks scattered (wrapped) to processors [21]. By varying the
parameters of this method both pure blocked and pure scattered (together with inbe-
tween formats) can be achieved. A block-cyclic strategy has recently been proposed
by Lichtenstein and Johnsson for dense linear algebra on distributed memory multi-
processors [100]. The block-cyclic strategy is applied to both rows and columns and
is effectively applied to LU factorization and the subsequent triangular solves.
Furthermore, pivoting, if required, can be done in a row or column fashion. When
partial pivoting is used, Geist and Heath [62] recommend inclusion of pipelining to
offset the cost of pivoting. For efficiency of pivot determination, row pivoting is
preferred with column storage and column pivoting with row storage. A simple illus-
trative example of a basic (non-pipelined) algorithm (taken from Gallivan, Plemmons,
and Sameh [59]) is shown in Figure 2-16. It uses column oriented storage with a row
pivoting scheme.
Sparse Matrix Computations: While considerable attention has been paid to im-
plementation of dense matrix operations on parallel architectures, less has been given
to sparse matrix operations and there exists significant open research to be accom-
plished in this area. Major reasons for delaying such attention are that sparse matrix
algorithms are more complex, use more sophisticated data structures, require irreg-
ular memory referencing, and are thus much more difficult to efficiently implement.
Ironically, operations for sparse matrices will typically have a greater degree of
potential parallelism than their dense counterparts. This additional parallelism comes
in the form of large grain parallelism that may be found in the structure of nonzeros
in sparse matrices. The available parallelism in sparse matrix factorization has been

Each processor executes the following:

do k = 0, n- 1
if ( column k is local ) then
determine pivot row
interchange row k with pivot row
doi= k+l,n -1
compute multipliers Aik = aik/akk
broadcast the column just completed with pivot index
receive the column just computed with pivot index
interchange row k with pivot row
for (all columns j > k that are local ) do
do i= k+l,n -1
aij = aij Aikcakj

Figure 2-16. Basic Column Fan-Out LU Factorization

studied using a variety of models. Wing and Huang developed a very fine grain
model with each divide/update operation on a single entry considered a task [140].
With this model (which restricts the factorization to a given pivot ordering), the
maximal parallelism available can be realized. A large grain model was developed by
Jess and Kees [91] that defines a task as the factorizing and complete corresponding
update of a single pivot. Liu [102] establishes a third, medium grain model with
the various column operations associated with a single pivot defined as tasks with
one task per column. Ostrouchov, Heath, and Romine [117] investigate whether
the rather disappointing results seen so far with Cholesky factorization are due to
an inherent lack of parallelism in the problems or to limitations on contemporary
hardware. They conclude that parallelism is available, but the relatively high cost
of communication (relative to computation) has limited performance. Specifically,
on currently available distributed memory multiprocessors, only 20 to 25 percent of
potential speed-ups are realized when using a medium grain of parallelization.
A most important issue in exploiting available parallelism in a parallel sparse
matrix factorization is that of task allocation.
Task Allocation: Task allocation requires the careful balance of two opposing
objectives. The first objective is to always have as much useful activity as possible
going on in parallel. This typically requires use of finer grains parallelism to achieve

more potential concurrency. The second objective is to have as little interaction as
possible between concurrently executing processors. This frequently contradicts fine
grain parallelism that tends to require frequent interaction. For very regular com-
putations, interactions can be orchestrated to minimize idle time due to processors
waiting for an interaction. However, with sparse matrix routines, this regularity is
typically lost due to variations in the pattern of nonzeros. Thus, task allocation
becomes more challenging.
In a distributed memory environment task allocation is especially challenging.
Due to the higher cost of communication, dynamically balancing loads can be ex-
pensive. Hence, a static allocation is frequently used and tightly coupled to the data
allocation scheme. For Cholesky factorization, early column oriented algorithms al-
located data and column tasks to processors in a wrapped fashion proceeding in a
bottom up manner from the elimination tree [63, 71]. This did well to address the
load balancing issue but caused excessive communication. Later schemes allocated
subcubes to subtrees in the elimination tree and effectively reduced communication
[63, 71]. This worked well for the well balanced trees produced from nested dissection
orderings but not for the potentially unbalanced trees of minimum degree orderings.
A further refined allocation scheme was developed by Geist and Ng to deal with
unbalanced trees in a way that addresses both communication elimination and load
balancing [63].
Additional advantages are typically found when using contemporary architectures
by organizing tasks to use fewer, but larger, messages. Hulbert and Zmijewski propose
such a method that combine update contributions into a single message [88]. While
some granularity is lost, there are fewer message setup delays which tend to dominate
much of the communication cost in contemporary architectures.
Numerical Factorization: As much of the work in sparse matrix factorization
comes in the numerical factorization phase and as this work is typically the most
structured (due both to the nature of the work and the advantages obtained from
the earlier phases), this phase has the most exploitable parallelism and has received
the most research attention. However, sparse numerical factorization is still much
more difficult than its dense counterpart. Most implementations exploit only large
and medium grains of parallelism.
Within a distributed memory environment, it is important to exploit locality of the
data. Using data local to the processor as much as possible avoids the communication
and synchronization overheads imposed by message passing between processors.
Chu and George [22] discuss a row oriented dense LU factorization routine for dis-
tributed memory environments that could be extended in concept to sparse matrices.
This technique uses a fan-in operation to determine the next pivot row followed by
a fan-out operation on the selected pivot that facilitates the updating of the active
submatrix. Another key feature is the dynamic load balancing used to accommodate
the use of partial pivoting.
Using a column oriented allocation scheme, pivot selection is handled by a single
processor (the one owning the current pivot column). The factored pivot column can
then be broadcasted (or in the sparse matrix context perhaps multicasting is more

appropriate) to the processors holding the other columns of the active submatrix to
allow updating. This is the fan-out approach that has been used in initial Cholesky
factorization routines. The excessive communication overheads of these methods
limited their performance.
Another class of methods (used for sparse Cholesky factorization) are the fan-in
methods [9, 117, 88]. These methods accumulate contributions to the updating of
the active submatrix and send fewer but larger messages [9].
An increasingly popular approach to sparse matrix factorization are the symmetric-
pattern, multifrontal methods [105, 76, 51, 121, 122, 77]. Large grain parallelism is
available in these methods via independent subtrees in the assembly trees. How-
ever, as subtrees combine and parallelism at that level decreases, a switch is made
to exploit parallelism at a finer grain within the factorization of a particular frontal
matrix. This is especially appropriate as the later frontal matrices are typically
much larger. Within a distributed memory environment, a column oriented storage
scheme can be used with frontal matrices allocated to subcubes and frontal matrix
factorization done in a fan-out implementation of dense factorization. Assembly in-
volves sending of contributions to appropriate follow-on processors where they are
added to the appropriate entries. While these distributed multifrontal approaches
typically require more communication than a fan-in approach, their performance has
been quite promising for the Cholesky factorization of symmetric, positive definite
matrices [76, 119, 120]. Specifically, speed-ups of 5.9 to 21.4 were achieved on a
32 processor Intel iPSC/2 using Cholesky factorization on matrices varying in order
from 1,824 to 16,129 [119, 120]. Lucas has implemented a distributed memory mul-
tifrontal LU factorization routine that assumes numerically acceptable pivots and
a symmetric-pattern. Speed-ups of up to 10.2 were obtained on 16 processors of
an iPSC/2 using electronic device simulation matrices of order 7225. Recent sim-
ulation studies of similar methods for the LU factorization of (assumed-)symmetric
matrices [121, 122] have shown a reasonable parallel potential in distributed memory
environments. Here speed-ups of up to 30-40 were predicted for 128 node iPSC/2
and iPSC/i"Ii hypercube topology multiprocessors. Comparison of these results with
other simulated architectures reveal that lower communication to computation ratios
produced better results (as one would expect). The exploitation of parallelism both
in the elimination structure and within the distinct pivot steps has been found to be
critical to the success of all of these methods [76, 51, 121]
Recently, fine grain parallelism has been investigated by Gilbert and Schreiber for
a variety of sparse Cholesky methods on a SIMD distributed memory environment
[77]. The multifrontal approach has demonstrated the greatest potential for exploiting
this level of parallelism.
A critical subissue in any parallel sparse matrix factorization routine is how to
efficiently schedule the component tasks. This issue will be discussed in the next
2.7 Multiprocessor Scheduling
The precedence constrained multiprocessor scheduling problem has been shown to
be NP-Complete in both shared and distributed memory environments [137]. Thus,

heuristic techniques have been the emphasis of much of the research in this area.
Within shared memory environments, a list scheduling scheme is typically used. Un-
der such schemes, tasks are assigned priorities and when made ready (predecessors in
the partial order have all completed) are put into a single priority queue for scheduling
by next available processor [133]. The most common and effective priority schemes
are based on a critical path analysis of the precedence DAG with the priority of a task
defined as the heaviest weighted path from that task's node to an exit node (node
with no successors) in the DAG [23]. Variations within these schemes primarily deal
with tie breaking with a Most Immediate Successors First criteria both popular and
effective [94].
Within the context of a distributed memory environment, communication delays
(represented by edge weights in the task precedence graph) can become quite signif-
icant. Three methods for dealing with such Task with Communications Scheduling
(TCS) problems are presented in this section.
ETF -Earliest Task First: The Earliest Task First (ETF) heuristic [89] is based
on an extension of the Rayward-Smith computing model [125]. This extended model
uses message counts for edge weights in the task graph together with a transmis-
sion cost function that defines the communication costs between any two processors.
Different network topologies can be represented by varying this transmission cost
function. Task computation times are represented as node weights.
Rayward-Smith proposed a list scheduling (LS) approach based on a greedy strat-
egy for his original unit execution time/unit communication time (UET/UCT) model
[125]. The idea is that no processor should be left idle if there is an available task to
execute. Under the UET/UCT assumption, a task T can be scheduled on processor
Pi at time t if T has no immediate predecessors, or each immediate predecessor has
been scheduled to start on Pi at time t 1 or on Pj (i # j) at time t 2. The length
of the schedule produced by this greedy schedule w, satisfies

w, < (3 (2/n)) ..- (1 (1/n))

where ,' .. is the length of the optimal schedule and n is the number of processors.
Hwang, Chow, Anger, and Lee first propose the Extended List Scheduling (ELS)
algorithm that applies the LS approach assuming no communication delays and then
adds the necessary communication delays to the initially produced schedule. They
show that the ELS method produces schedules that are bounded by

wELS < (2 (1/n),'.,. + T,ax E =' edgeweighti

where Tmax is the maximum transmission cost between two processors. While this
seems like a loose bound, the authors show that specific problem instances will asymp-
totically achieve it. Thus, we see their motivation for the ETF method.
Due to its complexity, a number of formalisms are necessary to describe the ETF
algorithm. For tasks T and T', let v(T, T') be the number of messages sent on the edge
(T, T') in the task graph. Let r(Pi, P) be the transmission cost of communication
between processors Pi and Pj. Then, define s(T), f(T), and p(T) as the start time,

finish time, and processor assigned for task T by the scheduling algorithm. With
these definitions, the time the last message for task T arrives at processor P is
r(T, P) which will be zero if T has no predecessors otherwise

r(T, P) = max {f(T') + v(T', T) r(p(T'), P)}.

The earliest starting time for an available task T is then

e,(T) = max{CM, min {r(T,P)}}

where CM is the "current moment" of scheduling time. The ETF uses a greedy
strategy and finds the minimum earliest starting time (em~") across all available tasks
and available processors.
Due to arbitrary communication delays, if e""n > CM it is possible that another
processor may be freed between CM and ei"" that would reduce the next earliest
starting time. To handle this possibility, ETF uses a second time variable called NM
for "next moment". NM is maintained as the next time at which a processor will
become free. If NM > ei"" then no earlier starting time is possible and the schedule
is set. Otherwise the scheduling decision is postponed.
If two tasks come up with the same scheduled start time on the same processor,
the task chosen can be arbitrary or according to some priority list scheme. One
such priority list scheme that is particularly attractive is that given by critical path
analysis. However, the authors base their subsequent analysis of ETF on an arbitrary
resolution mechanism.
Careful analysis of the ETF algorithm reveals its time complexity to be O(nm2)
where n is the number of processors and m is the number of tasks. Furthermore,
they develop an upper bound on the schedule lengths produced by ETF of

wETF < (2 (1/n))w.i,. + COm,

where Cmx, is the maximum communication requirement of any path through the
task graph (assuming the Tma, transmission cost). This is a significant improvement
over the bound produced by ELS. Furthermore, a tighter bound is even possible that
reduces the Cma, component.
The Earliest Task First (ETF) seems to be a very natural and effective extension
to the shared memory environment list scheduling methods.
SCST -Scheduling with Communication Successors' Priority Tracking: Anoth-
er algorithm that addresses the scheduling of task graphs with nonzero communica-
tion costs is due to Yin [142]. This method defines the combinatorial level of a
particular task node as the heaviest node and edge weighted path from that node to
any accessible leaf node (exit node). The node weights represent task execution times
and the edge weights reflect communication costs. The run priority of a particular
task is then defined as the combinatorial level of that task plus the heaviest edge from
any immediate predecessor (if one exists) to that node. Figure 2-17 gives a sample

task graph with communication costs. The combinatorial levels for T1, T2, T3, T4,
and T5 are 15, 14, 7, 8, and 3 respectively. The run priorities given in the same order
of tasks are 15, 14, 11, 12, and 6.

T1/3 T2/3

1 4

T4/2 T3/3



Figure 2-17. Task Graph with Communication Costs

The first objective is to execute the heaviest paths first. The second objective is
to assign tasks to processors in a manner that minimizes the amount of necessary
communication. The SSCT algorithm addresses these objectives by scheduling tasks
from the ready queue with the highest run priority first (ties are broken using the
dictionary ordering of the combinatorial level values of successor nodes). However,
assignment of tasks to processors is accomplished to minimize the amount of com-
munication necessary. That is, if task 4 is next to be scheduled, it should run on the
same processor as task 1 so that the communication cost of 4 from task 1 to task
4 can be precluded (under the assumption that running on the same processor has
zero communication overhead). Using this strategy, the SSCT algorithm produces
the schedule shown in Figure 2-18.

Time 0 1 2 3 4 5 6 7 8 9 10 11

P1 T1 T4 I T5


T 2 T 3

Figure 2-18. Gantt Chart of Schedule with Communication Costs

Worst case complexity analysis of SSCT is O(n2) which reduces to O(n) when
the precedence graph is a tree. Simulations were used on 20 randomly generated
task graphs with random edge weights to compare SSCT with an algorithm based
on topological levels that disregards communication. In 711' of the cases SSCT was
better and in 2-' of the cases was SSCT worse.

MH -Mapping Heuristic: The Mapping Heuristic (MH) introduced by El-Rewi-
ni and Lewis takes both communication and contention into account when addressing
the TCS problem [57]. The MH algorithm accepts a task graph description of the

T2 T3

parallel program and a description of the target machine topology and produces a
Gantt chart that shows the allocation of tasks to processor elements and their exe-
cution time frames. El-Rewini and Lewis illustrate the method by first presenting it
without considering contention and then separately presenting the extensions needed
to account for contention in the communication links.
The MH algorithm proceeds in three steps:

1. A priority is assigned to each task in a fashion resembling the critical path
method mentioned earlier. In particular, the heaviest weighted path from each
task node to a leaf (exit node) in the task graph is determined using both
execution times for node weights and communication time estimates for edge
weights. (At this point communication time estimates assume a fully connected
network with no contention). In the case of a priority tie, the task with the
largest number of immediate successors is given the higher priority. If this does
not break the tie, a random selection is used. An event list is initialized with
tasks that have no predecessors and are thus initially "ready" for execution.

2. As long as the event list is not empty, entries are taken from the event list.
These events will be either "task is ready" or "task is done".

For "task is ready" events, a processing element is selected such that the
task cannot complete any sooner on any other processing element. This
selection takes into account the machine topology and number of commu-
nication links that messages must be sent across prior to the initiation
of the task's computation. The task is then scheduled on the selected
processor and will signal a "task is done" event upon completion.
For "task is done" events, all immediate successors are notified and any
successor that becomes "ready" as a result of the notification adds a "task
is ready" event to the event list.

3. The second step is repeated until all of the tasks have been scheduled.

Notice that this algorithm does not treat communication cost uniformly. That is,
the priority is assigned assuming a fully connected network but the actual task alloca-
tion is accomplished based on a specific (and typically less connected) topology. Thus
multiple hops may be required that were not accounted for in the priority assignment.
Furthermore, scheduling communicating tasks on the same processor assumes that
communication will take zero time. This assumption is also not accounted for in the
priority assignment.
The extension to the MH algorithm that considers contention is accomplished by
maintaining a table of ongoing communication such that the best currently available
communication path can be determined. Specifically, the tables record for each pair
of processors the number of hops on the currently preferred path between these
processors, the preferred first communication link, and the current delay associated
with this path. Table maintenance is triggered by two additional event types for the
event list. The "sending message" event causes the contention tables to be updated

to reflect the additional links made busy by the message. The "message receive. .1
event causes the tables to updated to reflect the links that are now free. Notice
that this mechanism correlates to a virtual circuit type of communication. If the
underlying communication is packet switching, the view presented by this method is
too pessimistic. However, the authors opt against a lower level tracking mechanism
for reasons of efficiency.
El-Rewini and Lewis ran a number of simulations of their MH algorithm and vari-
ants of it on both actual and randomly generated task graphs. Among their more
interesting results is that priority assignments that neglect communication produced
better schedules for "computation-iil In- i--" task structures and priority assign-
ments that consider communication produced better schedules for "communication-
iii. ii I task structures. The delineation between "computation-inl. i I and
"communication-inl. i, I was a communication to computation ratio of one.
These three methods provide insight into how the task with communication
scheduling problem might be addressed for scheduling an unsymmetric-pattern mul-
tifrontal method elimination DAG on a distributed memory multiprocessor.
Chapter Conclusion: This chapter established the necessary background for this
research effort and summarized pertinent related efforts. To this end, the use of
LU factorization to solve a system of linear equations was described and the issues
of algorithm stability and error analysis addressed. Key sparse matrix concepts
were then defined. Details of Davis and Duff's unsymmetric-pattern multifrontal
method were provided as this entire research effort is based on this method. The
focus on factorizing sequences of sparse matrices was then justified by presenting
the mathematical formulation for two key types of problems that give rise to such
sequences. Specific applications were also identified within each type of problem.
Related efforts in parallel, distributed memory algorithms for both dense and sparse
matrix factorizations were summarized with specific results providing performance
goals for this effort. Finally, previous results in the key sub-topic of multiprocessor
scheduling for distributed memory environments were discussed.
With this background established, the next chapter will highlight key hardware
and software features of this effort's implementation platform.


A major thrust of this research effort will be to evaluate the performance of de-
rived algorithms through experimental measurements. This will be accomplished on
the nCUBE 2 distributed memory multiprocessor. The next two sections summa-
rize key aspects of the nCUBE 2's hardware architecture and software environment.
Information for these sections was obtained from the nCUBE 2 Systems Technical
Overview [112], the nCUBE 2 Programmer's Guide [110], the nCUBE 2 Processor
Manual [109], and the on-line ndoc documentation manager [113]. The third and
final section describes an effort to experimentally characterize the performance of the
nCUBE 2 multiprocessor.
3.1 Hardware Architecture
The nCUBE 2 supercomputer is a distributed memory, multiple instruction multi-
ple data (MIMD) multiprocessor based on a binary hypercube interconnection topol-
ogy. Configurations range from 8 to 8192 processors (3 to 13 dimension hypercubes).
Each processor is a custom, 64 bit, VLSI (very large scale integration) processor with
an integrated 64 bit floating point unit. Also integrated into the processor's design
is an error correcting memory management unit with a 39 bit data path and full
message routing circuitry. There are 14 bidirectional DMA (direct memory access)
channels with 13 used for node interconnections and one for input/output. Proces-
sor node configurations contain from 4 to 64 Mbytes of memory. Processor options
include both 20 MHz and 25 MHz versions of the custom processor. Performance
of the 25 MHz processor is rated at 15 VAX MIPS, 4.1 Mflops (single precision),
and 3.0 Mflops (double precision). Instruction execution times will vary however due
to hits/misses in the 128 byte instruction buffer, data alignments in memory, and
memory contention due to IO, refresh cycles, etc.
In addition to their use as node processors, additional custom 64 bit processor
chips are also used in a separate, but connected, input/output array. These IO pro-
cessors provide connections to disk arrays, tapes, graphics facilities, signal processors,
and networks. The nChannel board connects up to 128 channels from the processor
array to 16 serial channels. Each serial channel is controlled by a single IO proces-
sor. Eight of the serial channels are devoted to SCSI peripheral controllers that can
handle up to seven SCSI peripherals. Other serial channels provide ethernet, ana-
log/digital and digital/analog conversion, digital input and output, video, and tape
interfaces. There is also an nCUBE HIPPI (High Performance Parallel Interface)
board for interconnection to other supercomputers, UltraNet hubs, and mass storage
devices. An nVision Real-Time Graphics Subsystem provides an integrated graphics
and IO subsystem.

The processing capabilities are balanced by the very flexible, high performance
hypercube interconnection network. Wormhole routing is employed in the network so
that messages are not stored in intermediate nodes as they pass through the network.
In the absence of path conflicts, messages pass directly from source to destination
without delay so communication latencies are largely independent of path length in a
practical sense. The primary manner in which longer paths will affect performance is
the greater likelihood of path conflicts when the paths are longer. The Programmer's
Guide --'-. -1-, a simple performance model for messages less than 100 bytes [110].
The model anticipates a constant overhead for a message send or receive of 100 psecs
with a link delay time of 0 psecs. (This simple model is refined experimentally later in
the chapter). This model indicates a relatively low communication to computation
ratio compared to other machines in the class [121] and is characteristic of a well
balanced design.
In more detail, the nCUBE 2 uses an e-code default routing that progresses from
least to most significant bit in the destination address. At each node on the path,
the message is forwarded on the channel whose ID corresponds to the bit position of
the least significant bit that differs between the destination node ID and the current
node's ID. This scheme precludes the possibility of deadlocks if it is used exclusively.
While the default routing provides safe and reliable communication, it does not
efficiently handle multiple destinations for a single message. Therefore, a broad-
cast/multicast facility is also available that will send a message to all nodes in the
cube (or in a selected subcube) with a single transmission initiation and a single
message fetch from memory. A mask is used to define the destination cube/subcube.
If a bit in the mask is zero, the corresponding bit in the destination field must match
for all receivers. If a mask bit is one, then the corresponding destination bit need
not match and a fan-out duplication of the message occurs for each such bit setting.
During such fan-out activities all necessary forwarding channels must be open or else
the message will block. Hence, frequent point to point traffic on these channels can
livelock the broadcast/multicast operation.
In particular, the broadcast/multicast routing will send the message out on all
channels with ID's higher than the channel ID upon which the message was received
such that the channel ID for forwarding corresponds to a "don't care" bit setting in
the mask. For example, consider a broadcast from node 000 as shown in Figure 3.1.
More flexible routing is also possible using the hardware supported message for-
warding mechanisms. These use multiple message addressing headers with forwarding
bits to provide complete routing freedom.
Also, multiple messages from the same source and to the same destination may
be .,.-. -- .i.,ted into a single transmission to reduce setup overhead.
The nCUBE 2 hardware architecture is complimented by a powerful and robust
software environment.
3.2 Software Environment
As the nCUBE 2 is a powerful second generation multiprocessor, the accompa-
nying nCUBE Parallel Software Environment is an extensive, robust, and refined
software environment. The environment supports a variety of programming models.

ch 0

Figure 3-1. 3D Broadcast Example

The two primary models are (1) where node programs are launched directly from the
host shell, and (2) where a host program launches the node programs. Within node
programs, both the single program, multiple data (SPMD) and multiple program,
multiple data (MPMD) models are provided. SPMD uses a single program that is
loaded onto all the processors, but each instance of the program can have distinct
data with data dependent branching. By contrast, the MPMD model allows different
programs to be loaded onto distinct subsets of processors. The different programs
can interact via message passing with data dependent branching still provided.
The environment includes the nCX microkernel executive that runs on the node
and IO processors, the Host Interface Driver, a variety of standard programming lan-
guages, a powerful symbolic and assembly level debugger, extensive software libraries,
and a sophisticated profiler.
The nCX microkernel is the node/IO processor resident executive. nCX supports
process management, local memory management, message passing, multi-tasking,
UNIX System V Release 4 system calls, and POSIX signals.
The nCUBE Host Interface Driver provides for the connection of the node proces-
sors to the host/front-end computer, which is usually a high performance workstation.
The nCUBE Parallel Software Environment supports FORTRAN 77, C, and C++
together with the Linda suite of hardware independent extensions for parallel pro-
cessing and message processing.
A powerful symbolic and assembly level debugger, ndb, provides a full range of
functions including:

* examination of variables, the stack, and processor registers

multiple format data displays

address map and source code displays

the use of breakpoints

single stepping through the code

command and response logging, and

other standard debugging operations

Extensive libraries provide for efficient usage of the features of the nCUBE 2 su-
percomputer. An extended Run-Time Library provides basic subcube management
to include routines to identify a program's position in the subcube, its process ID, the
host's process ID, and the dimension of the subcube. Also in this library are inter-
processor communication routines such as an asynchronous message send writete,
a blocking message receive (nread), and routines to check if any or a selected range
of messages types have been received (ntest and range respectively). These later
routines provide for the development of a nonblocking message receive capability.
Processor allocation routines are also included in this library and can be used to
select processors in a subcube and launch programs on them.
The nCUBE Parallelization Library includes routines for 2 and 3 dimension fast
Fourier transforms and matrix transposition; the min, max, and sum determination
of either scalar or vector data items of various types; processor sequencing; grid-
hypercube mappings; general purpose data allocation; and general purpose subcube
The nCUBE Math Libraries contain 375 routines callable from C or FORTRAN
and are based on the Seismic Subroutine Standard (SSS) and Basic Linear Algebra
Standard (BLAS). These are uniprocessor routines written in assembly language and
highly tuned to the nCUBE 2's node processor. The IMSL libraries of FORTRAN
subroutines are also available.
The nCUBE 2's profiler (PM) is composed of three tools: tool, ctool, and etool.
The xtool program profiles execution by functions on the various node processors by
using a time interval sampling technique. The ctool emphasizes analysis of commu-
nication and for each node provides:

time spent in calculation, communication, and IO;

number of calls to each communication routine

total number of all communication routine calls

number of errors returned by communication routines

distribution of sent and received message lengths

* number of objects broadcasted.

The etool facility provides for the profiling of user-defined events to include the time
of each event, the type of event, and the value of a specified variable at the event's
occurrence. Profiling results are displayed graphically for easy analysis. Profiling can
be done for the entire program by setting an environment variable or can be limited
to a specific part of the program by insertion of profiling routines into the code.
All together these facilities provide a powerful and robust environment for software
development and performance-oriented execution.
3.3 Performance Characterization of the nCUBE 2
In order to achieve maximal performance on a specific multiprocessor, both the
strengths and weakness of the target hardware must be identified, understood, and
taken into account during algorithm development. This section summarizes a variety
of experiments designed to quantify key performance characteristics of the nCUBE
2 architecture. These results will be used later in Chapter 4 to define a simulation
model that will evaluate the achievable parallel performance of the unsymmetric-
pattern multifrontal method. The results will also be used to justify subsequent
algorithm design decisions in Chapters 5 and 6.
Specific areas of investigation include the rate of floating point operations, point-
to-point message latencies, and message broadcast/multicast latencies.
Floating Point Operations: The design of the nCUBE 2 processor places a sig-
nificant emphasis on fast floating point operations and even includes a combined
multiply-add floating point instruction. According to the nCUBE 2 Processor Man-
ual [109], a typical floating point operation (i.e., add, subtract, multiply, compare)
takes 300-350 nanoseconds with more complex functions (like divide, square root,
and remainder) taking 300-3450 nanoseconds. Additional timing variability occurs
due to the variety of addressing modes (26), hits/misses in the 128 byte instruction
cache, memory contention (with DMA activities, DRAM refresh cycles, and instruc-
tion fetches), and data misalignments.
In order to determine a typical floating point operations per second (flops) rate,
two experiments were constructed. The first used a single loop that performed floating
point divides (as would be typical of the calculation of multipliers in LU factoriza-
tion). The second experiment used two nested loops that performed a multiply-add
operation as would used in a typical pivot step of LU factorization. This second
experiment was done with both possible loop orderings in an attempt to determine
the affect of unit and nonunit memory strides. The experiments were coded in C and
measurements taken with the etool user-defined event profiler.
The first experiment, using a single loop of divisions, produced results that fit
well to a linear relationship. Specifically, for single precision divisions the equation is

time = 4.6 + 2.745n

where n is the number of divisions and time is in microseconds. Likewise, for double
precisions divides the relationship is

time = 4.93 + 3.444n.

As arrays are stored by rows in the C language, unit stride is achieved with the
row index in the outer loop. When this was done in the second experiment that used
nested loops and a subtract-multiply operation, the performance was approximately
,''. worse than was seen with the column index in the outer loop. This was a bit
surprising given the auto-incrementing possible in the nCUBE 2 processor. The
specific results are given below with all times in microseconds and i as the row index
and j as the column index.

Unit Stride (Single precision): time = 17.24 + 1.56i + 4.88ij
Unit Stride (Double precision): time = 16.60 + 2.71i + 4.72ij

Non-Unit Stride (Single precision): time = 17.41 + 2.05j + 4.635ij
Non-Unit Stride (Double precision): time = 1 " + 2.78j + 4.629ij
More efficient routines to perform these operations are available in the BLAS
(Basic Linear Algebra Subroutines) library resident on the system, but these will be
explored later. As these are highly tuned assembly language routines, their perfor-
mance is expected to be significantly better than these initial experiments.
Point to Point Message Latency: Passing of contributions between frontal ma-
trices will likely be done using point to point message passing. Hence, an under-
standing of the latencies associated with such activities will be important. As the
nCUBE 2 uses wormhole routing (where messages passing through the network are
not stored at intermediate nodes), very little dependency on path length is expected.
However, message length is expected to be significant. Furthermore, we need to de-
termine two distinct latencies. The first is the message transmission latency. This is
the time it takes for a message of given length to be transmitted and received by a
particular set of processors of given distance apart. The second time is the message
initiation latency, which is how much time a sending processor takes to initiate the
transmission before it can continue.
For the first experiment, dealing with the message transmission latency, a test
program was written in C and profiled using etool. Within the program, processor
node 0 sends a message of a specified byte length to the most distant node in the
executing cube/subcube. It then does a blocking receive waiting for an equal length
acknowledgment message from the other node. The distant node performs a blocking
receive and then sends the acknowledgment message. The test was repeated 1000
times for each configuration. For each configuration, message lengths were set at 10,
100, 500, 1000, 5000, and 10000 bytes. Cube/subcube dimensions varied from 1 to 5.
A linear regression run on the resulting timings revealed a very strong linear
relationship with both message length and cube/subcube dimension (with message
length providing a much larger contribution however). The resulting linear equation
(given in microseconds with message length 1 in bytes and d being the dimension of
the subcube) is
time = 170.32 + 4.54d + 0.5731.

The second experiment was designed to determine the message initiation latency.
Again using a C program and etool, a set of experiments using the same configurations
of message length and cube/subcube size as were used in the first experiment was
conducted. This time, however, node 0 just sent one message after another to the
distant node without waiting for an acknowledgment. A total of 1000 messages were
sent in each configuration.
For this second experiment, cube/subcube dimension had no influence on the
initiation latency. There is a strong linear correlation with message length which is
expressed by the following equation (again given in microseconds with message length
(1) in bytes)
time = 79.94 + 0.44981.
The regression for this second experiment however did not use the results for the 10
byte message configurations, as initiation latencies for messages of 10 to 100 had very
similar initiation times. That is, the performance curve flattened in this region.
For both of the point to point message experiments, the standard write and
nread functions were used. These functions copy the message from a user buffer to a
separate communication buffer (and vice-versa for nread) where the actual message
transmission occurs from the communication buffer. Alternative functions are avail-
able that use only the communication buffers, but these were not tested. The message
broadcast tests to be mentioned next also use the separate user and communication
Message Broadcast/Multicast Latencies: From the description of the message
broadcast/multicast capabilities of the nCUBE 2 provided in the nCUBE 2 Processor
Manual [109], definite performance benefits are anticipated when using the broad-
cast/multicast mechanisms as opposed to repeated point to point message passing.
Such data distribution will be required in the dense matrix kernels that will factorize
distinct frontal matrices in parallel.
To this end, another set of experiments was developed to quantify both the full
message broadcast latency and the broadcast initiation latency. Again, C programs
were written and run with the etool profiler. Configurations used message lengths of
100, 500, 1000, 5000, and 10000 bytes and cubes of dimensions 1 to 5.
In the full message broadcast latency test, node 0 initiates message broadcasts
to all nodes in the active cube and then waits for an equal size point-to-point ac-
knowledgment message from the most distant node in the cube. (The description of
the broadcast function in the nCUBE 2 Processor Manual, which was described in
the preceding chapter, assures that this most distant node is the last to receive the
broadcast). All nodes receive the broadcast and the most distant node generates the
acknowledgments. With the point-to-point message latency well defined by the ear-
lier experiments, their contributions to the overall latency are subtracted to obtain
the latency due only to the broadcast.
The results show strong correlations with both message length and cube dimension
together with a very significant product term of message length times cube dimension.
The resulting formula (given in microseconds with message length (1) in bytes and d

the subcube's dimension) is

time = 195.22 + 0.1221 + 50.02d + 0.451d.

Message broadcast initiation latencies were also investigated using a C program
that had node 0 initiate 1000 broadcasts back to back and then wait for a single ac-
knowledgment from the most distant node. (Again the well defined acknowledgment
message time was subtracted off the overall latency). The same configurations as in
the first experiment were used.
The initiation latency results also correlated with message length, cube dimension,
and the product of the two. The resulting equation (given in microseconds) is

time = 1i,. II + 0.06241 + 54.09d + 0.451d.

Comparison of the point to point and broadcast/multicast results indicate that,
for cubes of dimension greater than one, the broadcast/multicast mechanism is to be
preferred over successive point-to-point transmissions.
Chapter Summary: In this chapter, the implementation platform of this re-
search effort has been discussed. Specifically, this platform is the nCUBE 2 dis-
tributed memory multiprocessor. First, the features of the hardware architecture
were described. Then, an overview of the key software features was provided. In the
last section, a series of experiments were designed and conducted to characterize the
performance of the nCUBE 2. The resulting characterization will be used in the next
chapter to define a realistic simulation model to evaluate the achievable parallelism
in the unsymmetric-pattern multifrontal method for the LU factorization of sparse
matrices. This characterization will also be used in Chapters 5 and 6 when a parallel
version of the unsymmetric-pattern multifrontal method is designed and implemented
on the nCUBE 2.


In this chapter, the parallelism available in the unsymmetric-pattern multifrontal
method [34, 35] is investigated. The assembly DAG is the principle data structure
of this method and describes the data dependencies that exist between the various
dense submatrices, called frontal matrices. This assembly DAG is also significant as
it exposes parallelism between the frontal matrices. Moreover, parallelism can be
exploited within frontal matrices also, as the factorization of these dense submatrices
consists of many regular and independent computations. The models developed in
this chapter will study both sources of parallelism as used in different combinations
and at different levels of granularity.
The first objective is to determine the amount of theoretical parallelism available.
This is first done by a series of analytical models that assume an unbounded number
of available processors. Then, the analytical models are expanded into simulation
models where a bounded number of processors are assumed. These simulation models
are used to determine the minimum number of processors needed to actually achieve
the theoretical speed-ups.
With the theoretical parallelism determined by the models of the first objective,
the second objective is to determine how much of the theoretical parallelism can
actually be achieved on existing multiprocessors. This objective is addressed by
revising the simulation models to correspond to the performance characteristics of
the nCUBE 2 multiprocessor (as determined in Chapter 3).
4.1 Unbounded Parallelism Models
The first objective in the investigation of parallelism within the unsymmetric-
pattern multifrontal method is to assess the availability of theoretical parallelism
inherent in the assembly DAGs produced by the method. This is done using analytical
models that assume an unbounded number of available processors.
An underlying multiple instruction, multiple data (MIMD) shared memory (SM)
model of computation is assumed, although several of the models are also realizable
on single instruction, multiple data (SIMD) model. The shared memory is based on
a parallel random access machine (PRAM) model, which is initially assumed to allow
concurrent reads and concurrent writes with concurrently written results summed
(CRCW-SUM). Later models will require only a concurrent read, exclusive write
(CREW) shared memory. Instruction counts are abstracted to only the floating point
operations (flops) required for the assembly and numerical factorization processes.
The analysis of the unbounded parallelism models is based on the representa-
tions of assembly DAGs built from traces produced by a sequential version of the
unsymmetric-pattern multifrontal method on matrices from real applications [34, 35].

These assembly DAGs are analyzed and then used as input for the five unbounded
parallelism models. The resulting speed-up, processor set, and utilization estimates
are then analyzed.
4.1.1 Assembly DAG Analysis
A sound characterization of the assembly DAG is the cornerstone of the un-
bounded models and will be fundamental to fully understanding the results of the
bounded models. In order to obtain this characterization, the assembly DAGs are
topologically sorted and partitioned into a data structure that identifies all the nodes
in a given topological level. Node weights are assigned in accordance with the
model definitions defined later. Using these node weights, the heaviest weighted
path through the assembly DAG is determined for each model. The weight of this
path provides the lower bound on the parallel execution time.
Other characteristics of the assembly DAGs are also determined. In particular, the
number of nodes and edges, the depth, and the number of leaf nodes are determined.
Since all the assembly DAGs tend to be significantly wider at their leaf level (Level
0), the next widest level is also determined. The total weight of all nodes is divided
by the number of nodes to obtain an average node weight. The depth is divided by
number of nodes to give a depth ratio. Low values will correspond to DAGs that
have the most significant course grain parallelism. Various methods of exploiting the
available parallelism are investigated with the different models defined next.
4.1.2 Model Definitions
A model of the sequential version of the algorithm is provided as reference and will
be used to establish the speed-up calculations. In this and all the subsequent models
there are two levels. The first (higher) level consists of the outer sum that counts
processing steps for the frontal matrices (nodes) that make up the overall sparse
matrix. This summation will either be across all nodes of the assembly DAG, which
represents a sequential processing of the frontal matrices, or along the heaviest path,
which represents the critical path for the processing. The heaviest path summation is
used to characterize parallelism exploited between the nodes in the assembly DAG.
With the assumption of an unbounded number of processors, processing of nodes
with dependencies not on the heaviest path (critical path) can be done during the
critical path processing and not affect the overall completion time.
The second level of the models describes the processing and parallelism of factor-
izing a particular frontal matrix (i.e., a node in the assembly DAG). Each of these
descriptions are presented in four terms. The first term describes the processing
needed to complete the assembly of contribution blocks from predecessors in the as-
sembly DAG. The second term characterizes the processing needed for the selection
of a pivot for numerical stability. The third term describes the calculation of multi-
pliers, which results in a column of the lower triangular matrix L. Finally, the fourth
term defines the updating of the active submatrix, which is essentially the carrying
out of the necessary row operations needed to reduce the matrix.
These models are very similar to an earlier effort by Duff and Johnsson [51] that
focused on the assembly trees produced by symmetric-pattern, multifrontal methods.

Each of the models use the following terms:

Aj number of matrix entries assembled into the frontal matrix represented by

Sj number of children assembling entries into nodej.

Pj number of pivots by which the frontal matrix at nodej is reduced.

cj number of columns in the nodej frontal matrix.

rj number of rows in the nodej frontal matrix.

With these definitions, the model for the sequential version of the algorithm is defined
as follows:

EjS1o[Ai + El (ri i + 1) + E7j(r i) + 2E7'(r i)(c )]

Model 0 (Concurrency Between Frontal Matrices Only): Model 0 depicts a ver-
sion of the algorithm that only exploits parallelism across nodes in the assembly
DAG. Each frontal matrix is factorized sequentially. This model requires an under-
lying MIMD architecture. The formula describing this model is:
heaviest-path A + + i' 2 i)
j=i [Aj r i + + i -i ) + E2 zj(r +- i)(cj i-)]

In the case of a dense matrix, this model degenerates to a single, one processor tasks
that requires O(n3) time.
Model 1 (Block Concurrency Within Frontal Matrices Only): Model 1 assumes
that nodes of the assembly DAG are processed sequentially in an order that preserves
the dependencies expressed by the assembly DAG. Within each node, the frontal ma-
trix is factorized with a block level of parallelism. Assembly, multiplier computation,
and updates to the active submatrix are done for all rows in parallel moving sequen-
tially through the columns. For the assembly of a row that includes contributions
from more than one child in the assembly DAG, the assumption of a CRCW-SUM
shared memory insures correctness. In this model selection of numerical pivots is
assumed to be sequential.
The method described by this model and the next lend themselves easily to either
a SIMD or a MIMD architecture.

E allodes[ + ESr j (r i + 1) + P + 2E _(c i i

In the case of a dense matrix, this model requires n processors and takes O(n2) time.
Model 2 (Full Concurrency Within Frontal Matrices Only): Concurrency with-
in a frontal matrix is extended by Model 2. In this model, very fine grain parallelism is
exploited. Assembly is done for each element in parallel. However, in order to main-
tain consistency with the earlier work of Johnsson and Duff [51], a CREW memory

is used by this model and the assembly of contributions to an entry from multiple
children is done via a parallel prefix computation using the associative operator of
Numerical pivot selection is handled in this model with another parallel pre-
fix computation this time using the maximum function as the associative operator.
Multiplier computation and updating of the active matrix are done with parallelism
at the matrix entry level, which is easily realizable with the CREW memory model.
This model is also oriented to either a SIMD or MIMD architecture.

j= Oe[log2 Si1 + Ei= r[log2(r i + 1)1 + P1 + 2P5]

In the case of a dense matrix, this model requires n2 processors and takes O(n log n)
Model 3 (Block Concurrency Within and Between Frontal Matrices): The
block-oriented, node level parallelism of Model 1 is augmented by parallelism across
nodes of the assembly DAG in Model 3. This is done by changing the outer summation
to be over the heaviest path instead of across all nodes. With this model, a MIMD
architecture is required.
heaviest- path P P
Ej=i [cj + C,=( '(r i + 1) + Pj + 2 _'(cj i)]

With a dense matrix, this model is equivalent to Model 1.
Model 4 (Full Concurrency Within and Between Frontal Matrices): The suite of
models is completed by extending the full node level concurrency of Model 2 to in-
clude parallelism across nodes in the assembly DAG. This is done in the same fashion
as with Model 3 and a MIMD architecture is likewise assumed.
heaviestpath [ cn
E iestth[ log2 Sj] + E7j [log2(r i + 1+)] + P, + 2Pj]

With a dense matrix, this model is equivalent to Model 2.
4.1.3 Matrices Analyzed
A set of five matrices was selected for this study. They represent a cross section of
the type of assembly DAGs that are generated. The matrices are from the Harwell-
Boeing set and each matrix comes from a real application [49]. Table 4-1 briefly
describes these matrices:
4.1.4 Assembly DAG Analysis Results
The first objective of this part of the effort was to characterize the assembly
DAGs of the test matrices. Table 4-2 below provides these characterizations that
were produced with the analysis program written for the effort. The DEPTH entries
are the number levels in the assembly DAG as determined by a topological sort. The
width of the DAG is presented in two entries. LEVEL 0 refers to the leaf nodes
of the assembly DAG and WIDTH gives the widest level other than the leaf level
(Level 0). It is important to remember that the assembly DAGs are typically NOT a

Table 4-1. Test Matrix Descriptions

MAHINDASB 1258 economic modeling
GEMAT11 4929 optimal power flow problem
GRE_1107 1107 computer system simulation
LNS_3937 3937 compressible fluid flow
SHERMAN5 3312 oil reservoir, implicit black oil model

single connected component and that the many small, distinct connected components
frequently show up on these lowest levels of the DAG's topological order. The DEPTH
RATIO is the DEPTH divided by the number of nodes in the assembly DAG. This
statistic will be smaller for the short and wide DAGs that have the greatest inherent

Table 4-2. Assembly DAG Characterizations

# NODES 238 1015 511 1636 664
# EDGES 507 1,219 1,366 3,724 1,496
DEPTH 68 25 40 296 46
LEVEL 0 86 530 233 565 359
WIDTH 55 189 106 248 99
AVG WEIGHT 1,887 1,195 9,998 67, ,'. 42,093
DEPTH RATIO 0.29 0.02 0.08 0.18 0.07

The statistics presented in Table 4-2 reveal that the assembly DAGs tend to be
short and wide (which is the desired shape for parallelism). However, they do not
provide the full picture. The true nature of the assembly DAGs was better revealed
with an abstraction of the DAG provided by the analysis program. This abstraction
illustrates the topologically ordered assembly DAG with an example provided for the
GEMAT11 matrix in Figure 4-1. For each level the number of elements in that level
is provided as well as the minimum, average, and maximum node weights. (The node
weights are expressed as common logs to save space and since their magnitude is all
that is of interest). These abstractions revealed that the assembly DAGs were very
bottom heavy in terms of the number of nodes but that the average node size tended
to increase dramatically in the upper levels. The assembly DAG for the GEMAT11
matrix had the nicest shape for parallelism (as -.-' 1. ., 1 by its depth ratio). The
other assembly DAGs were taller and relatively more narrow at their base.
Speed-Ups: The most interesting results of the unbounded models are the speed-
ups available via parallelism. These results are presented in Table 4-3.

L(Level): Count (Min,Avg,Max)

L( 25): 1 (5.3, 5.3, 5.3) *
L( 24): 1 (4.8, 4.8, 4.8) *
L( 23): 1 (3.7, 3.7, 3.7) *
L( 22): 1 (3.7, 3.7, 3.7) *
L( 21): 1 (3.7, 3.7, 3.7) *
L( 20): 1 (4.7, 4.7, 4.7)
L( 19): 1 (3.6, 3.6, 3.6)
L( 18): 2 (3.6, 4.1, 4.3)
L( 17): 2 (3.4, 3.8, 4.0)
L( 16): 2 (3.3, 4.0, 4.2)
L( 15): 2 (3.0, 4.0, 4.2) *
L( 14): 2 (3.0, 3.4, 3.7) *
L( 13): 3 (3.0, 3.4, 3.6) *
L( 12): 4 (2.6, 3.7, 4.2) *
L( 11): 6 (3.0, 3.5, 3.9) *
L( 10): 4 (2.1, 3.9, 4.5) *
L( 9): 4 (3.0, 3.3, 3.7)
L( 8): 7 (2.6, 3.5, 3.9)
L( 7): 10 (2.6, 3.4, 3.9)
L( 6): 16 (2.5, 3.6, 4.5) **
L( 5): 24 (2.2, 3.5, 4.2) **
L( 4): 35 (2.2, 3.2, 4.2) ***
L( 3): 58 (1.6, 3.0, 3.8) *****
L( 2): 108 (1.6, 2.9, 3.6) *******
L( 1): 189 (1.6, 2.8, 3.5) *************
L( 0): 530 (0.7, 2.5, 3.3) *************************************'

Figure 4-1. Assembly DAG Abstraction for GEMAT11

The speed-ups for Model 0 are disappointing but not unexpected after what was
seen in the previous section on the shape of the assembly DAGs. However, significant
speed-ups were seen as parallelism was introduced within the nodes (Models 1 and
2) and within and between nodes (Models 3 and 4). It is interesting to note that
a significant synergism is taking place when parallelism is utilized both within and
between nodes. One's intuition might -,.-i- -1 that both levels of parallelism would
produce a resulting speed-up that is close to the product of the speed-ups achieved by
using levels of parallelism separately. However, the speed-ups obtained are actually
much better than this expectation. This is consistent with the results of the earlier
study by Duff and Johnsson [51].

Table 4-3. Unbounded Parallelism Speed-Up Results

MODEL 0 1.17 2.88 1.26 1.30 1.74
MODEL 1 10.50 9.31 41.21 103.77 78.36
MODEL 2 114.25 39.35 575.21 3,490.76 1,877.05
MODEL 3 15.96 68.79 70.15 167.94 16..,
MODEL 4 237.74 743.80 1,847.68 11, 1-'. 40 1.,8..42

Another interesting, but expected, observation is that the speed-ups for models
0, 3, and 4 (that use parallelism between frontal matrices) correlate strongly with the
depth ratio presented earlier. Likewise, models 1, 2, 3, and 4 correlate strongly with
average node weight. The underlying principle is that concurrency within the node
is most exploitable by larger frontal matrices and that concurrency across nodes is
most exploitable with short and wide assembly DAG structures.
Processors Sets and Utilization: Also determined by the unbounded models was
an upper bound on the number of processors that could be employed on a given
matrix and the associated processor utilization. In particular, Model 0 assumed that
any frontal matrix would have only one processor allocated to it. For the block level
concurrency of Models 1 and 3, a frontal matrix would have a processor set equal
to the number of rows in the frontal matrix. The full concurrency of Models 2 and
4 would use processor sets equal to the row size times column size of the frontal
matrix. The maximum processor usage would then be the largest processor set for
any particular frontal matrix in Models 1 and 2 and the largest sum of processors
for any one level in the assembly DAG for Models 0, 3, and 4. This strategy results
from the use of concurrency across nodes in Models 0, 3, and 4.
Utilizations were then calculated by dividing total sequential time by the product
of total parallel time times the number of processors used. The results by matrix
and model are provided in Table 4-4. The top value in each entry is the size of the
processor set and the lower value is the utilization.
In all cases the size of the processor sets seem quite high and the utilization quite
low, but these estimates are very crude upper bounds and it is likely that significantly
better results are realizable.
4.1.5 Conclusions and Observations
A significant amount of theoretical parallelism is achievable with the unsymmetric-
pattern multifrontal method. However, very little of the parallelism comes directly
from the assembly DAG as is evidenced from the Model 0 results. Yet a synergism
occurs when the parallelism from the assembly DAG is combined with the concur-
rency available within the factorizing of particular frontal matrices. The resulting
speed-ups, as seen in Models 3 and 4, are very promising.

Table 4-4. Processor Set and Utilization Results

MODEL 0 si 530 233 565 359
1. ;. 0.54 0.54 0.2;'. 0.4-' .
MODEL 1 52 68 126 258 189
20.1'!'. 13.7"'. 32.71 40.22' 41. '.
MODEL 2 :it 4624 19782 107874 47060
:1 1 '. 0.8-.'. 2.91 -. 3.24 ,,'.
MODEL 3 304 3788 1161 2961 3827
5.2.V 1.i--" 6.04 5.7', 4.1.'
MODEL 4 4161 30127 19782 17',i.L 'IIII
5.71 2.47' 9.34 6.:;>'. '

Looking beyond the speed-ups, the large processor sets and low utilizations are
of both concern and interest. In particular, the very bottom heavy assembly DAGs
.i-.-.i -1 that better distributions of the processing are possible. Such processing
distributions could be achieved by appropriate task scheduling on bounded sets of
processors. The next section explores these possibilities using trace-driven simulation
4.2 Bounded Parallelism Models
The results of the unbounded parallelism models illustrated that significant speed-
ups are possible with parallel implementations of the unsymmetric-pattern multi-
frontal method. However, the efficient use of processor sets and the speed-ups achiev-
able on smaller processor sets are still open questions. This section addresses these
issues using the results of trace-driven simulations run on the five models defined
in the previous section. With each model sixteen different processor sets and three
different scheduling schemes were used. The processor sets were powers of 2 from 21
to 216 (65,536). The scheduling schemes correspond to orderings of the work queue
and are described in a subsequent paragraph.
The results of the bounded parallelism models will indicate that the speed-ups
seen in the unbounded models are achievable will reasonable size processor sets and
with significant improvements in efficiency.
This section is a summary of the bounded parallelism models. For a full descrip-
tion of the models and their revisions, see the Hadfield and Davis' technical report
4.2.1 Initial Models
The initial bounded parallelism models follow directly from the unbounded par-
allelism models. The critical difference is that a limited processor set is assumed so
tasks that are ready for execution may have to wait for available processors. This

can be appropriately modeled via a trace-driven simulation run against the represen-
tation of the assembly DAGs obtained from the traces produced by the sequential
version of the unsymmetric-pattern multifrontal method [34, 35]. Critical issues with
the simulation will be how to allocate processors when the number available is less
than that required and how to order the work that is ready to be done.
4.2.2 Trace-Driven Simulation
The trace-driven simulations used to implement the bounded parallelism models
were accomplished with a Pascal program that was built as an extension to the
analysis program used for the unbounded parallelism models. The simulation uses
the topologically sorted representation of the assembly DAG as input. All the nodes
(frontal matrices) on the leaf level (Level 0) are initially put into a work queue.
Work is scheduled from this queue based on processor availability. The amount
of work and required processors for each node is determined based on the specific
model being simulated. These models are initially those defined for the unbounded
parallelism models. When the model calls for a sequential ordering of the nodes,
only one node may be executing at any particular time, otherwise multiple nodes
may execute concurrently based on processor availability and the allocation strategy
in use. Upon completion of a node's execution, the edges to dependent nodes are
removed. When all such edges to a particular node have been removed, that node
is put into the work queue as it has become available for execution. The simulation
continues until all nodes have completed execution.
Speed-up and processor utilization are then calculated per the following formulas:

Speedup = .

Utilization EJ1l (Timej ProcessorsUsedj)
Utilization = ---
(Time finished ProcessorSetSize)
4.2.3 Processor Allocations
Allocation of processors to the assembly and factorizing of a frontal matrix is
a critical issue for the realization of the algorithm on a bounded set of processors.
As was seen in the results of the initial models, the allocation scheme can greatly
effect the performance of the algorithm. Since the five unbounded models use three
different approaches to factorization of a particular frontal, there are three distinct
allocation schemes.
The allocation scheme required by Model 0 is trivial since each frontal matrix in
this model is allocated only a single processor.
Models 1 and 3 use a block concurrency within a frontal matrix. As this block
concurrency is row-oriented, a processor set equal to the row size is required. The
difficulty arises when there are processor sets available but they are smaller than that
required (in fact the total available number of processors can be less than the row
size). In this eventuality the total amount of work is evenly distributed across the
available processors. This is a fairly crude allocation scheme but is not too unrealistic
given the nature of the processing within a frontal matrix.

This strategy was initially followed blindly and lead to some terrible scheduling
as large frontal matrices were allocated relatively small processors sets. This greatly
extended their execution times and delayed the subsequent frontal matrices that
were dependent upon them. Revisions to this allocation strategy were to (a) check if
waiting for more available processors would improve a frontal's execution time and
scheduling appropriately and (b) wait for more processors if the required amount
is not available. Both strategies corrected the problem and produced surprisingly
similar results. Strategy (b) will be reported in the results of this discussion.
Models 2 and 4 use a full concurrency within each frontal matrix and requires
a number of processors that is equal to the number of entries in the frontal matrix
(row_size times col_size: rj cj). A more sophisticated allocation scheme is thus
required. Prior to formally defining this scheme, we recall the definition of the time
complexity for a frontal matrix using this model:

work = [log2 Sj] + E i_ [log2(r i + 1)] + Pj + 2Pj

Using this definition and a similar assumption on the ability to distribute work on
smaller processor sets, the allocation scheme is formally defined by the algorithm in
Figure 4-2.

if processors_available > (col_size row_size) then
Schedule work on (col_size row_size) processors
if processors_available > row_size then
Schedule (work + 3 pivots col_size ) on
(row_size) processors
if processors_total > row_size then
Wait for more available processors
Schedule (work col_size row_size) /
on processors_available

Figure 4-2. Processor Allocation For Fine Grain Parallelism

Notice that the deepest nested else block provides a fairly crude over-approxi-
mation of the amount of work to do. However, this case is only used for the small
total processor set sizes, which are not the target processor sets for the parallelism
of Models 2 and 4. Thus, this over-estimate is reasonable.

Furthermore, this allocation strategy tended to produce a very step-like perfor-
mance function as processor set sizes grew to meet the various cut-off points. A
smoother performance curve was achieved by using a technique called vertical parti-
tioning, which subdivides a frontal matrix task into a series of subtasks each requiring
substantially fewer processors. This produced much more efficient scheduling and
significant performance improvements. Results for both the initial allocation scheme
and vertical partitioning will be reported.
4.2.4 Scheduling Methods
There are three scheduling methods employed on the work queue. These methods
correspond to how the work ready to be executed is ordered within the work queue.
The three corresponding orderings are first-in, first-out (FIFO); largest calculation
first; and heaviest path first.
The FIFO method is self-explanatory and included because of its simplicity and
The largest calculation first order uses the node weight estimate of the work to
be done and schedules the largest such nodes first. This method is designed as an
approximation of the heaviest path first method. The largest calculation first requires
a priority queue implementation based on a set of priorities that is quite variable.
As such there would be some significant penalties to address in implementing this
The last method, heaviest path first, uses the heaviest path determined by the
analysis portion of the software. Any node on this path would be the next scheduled.
Other nodes are put in a FIFO ordering. Notice that since the heaviest path is
essentially a path through the dependency chart, only one node at most from the
heaviest path will ever be in the work queue at any one time.
Very little difference was found using these scheduling methods and they will not
be discussed further in this document (for a detailed description of these results, see
4.2.5 Simulation Results
The results of the simulations are summarized for each model. Whenever the
results for all five matrices are presented, the following line types will be used to
represent the different matrices:

-*-*-*- GEMAT11
-+-+-+- GRE_1107
-0-0-0- LNS_3937

All the results presented in this section are based on the heaviest path first schedul-

Model 0 (Concurrency Between Frontal Matrices Only): Model 0 takes advan-
tage of concurrency only between frontal matrices. The speed-up and utilization re-
sults from Model 0 are shown in Figures 4-3 and 4-4, respectively. These indicate that
the speed-ups seen in the unbounded models are achievable with significantly fewer
processors and corresponding higher utilizations. Table 4-5 compares the number of
processors used in the bounded versus the unbounded models to achieve maximum






2 4 6 8 10 12 14 16
Log2 of Number of Processors

Figure 4-3. Model 0 Speed-Up Results

Table 4-5. Required Processors Comparison

GEMAT11 64 530
GRE_1107 8 233
LNS_3937 8 565
SHERMAN5 8 359

Model 1 (Block Concurrency Within Frontal Matrices Only): Model 1 only ta-
kes advantage of concurrency within frontal matrices and does so in a block-oriented
fashion. The speed-up and utilization results are shown in Figures 4-5 and 4-6,



80 -



5 40

30 -

20 -


2 4 6 8 10 12 14 16
Log2 of Number of Processors

Figure 4-4. Model 0 Utilization Results

respectively. The method employed by this model seems well suited for processor
sets up to about 256 processors. After that point there is no further advantage to
additional processors for any of the test matrices.

Model 2 (Full Concurrency Within Frontal Matrices Only): The block concur-
rency of Model 1 is extended to full concurrency in Model 2 but is still restricted
to concurrency only within the frontal matrix. The speed-up and utilization results
for Model 2 using both the initial allocation scheme and vertical partitioning are
shown in Figures 4-7, 4-8, 4-9, and 4-10, respectively. A very definite step behavior
is evident in the speed-up curves for the initial allocation scheme. This behavior
directly correlates to the allocation scheme in use that bases allocation on the row
size and column size of the frontal matrices. While the speed-ups are significant, the
step behavior severely limits scalability. Vertical partitioning resolves this problem.

Model 3 (Block Concurrency Within and Between Frontal Matrices): Model
3 combines the block concurrency within a frontal of Model 1 with the concurrency
between frontal matrices used in Model 0. The speed-up and utilization results for this
model are shown in Figures 4-11 and 4-11, respectively. These results were produced
using the allocation strategy that waits if sufficient processors are not available.

Model 4 (Full Concurrency Within and Between Frontal Matrices): The fifth
and final model uses concurrency between nodes and full concurrency within nodes.
The speed-up and utilization results for this model are included for both the initial
allocation scheme and vertical partitioning. These are presented in Figures 4-13,
4-14, 4-15, and 4-16.






2 4 6 8 10 12 14 16
Log2 of Number of Processors

Figure 4-5. Model 1 Speed-Up Results

A step-like speed-up and corresponding utilization irregularities are evident in the
original allocation scheme and very similar to the results for Model 2. Again, vertical
partitioning corrects this problem.

4.2.6 Conclusions and Observations

Several conclusion and observations are apparent from the results of the bounded
parallelism models.

1. There is an excellent potential for parallelism under both SIMD and MIMD

2. The speed-ups indicated in the unbounded models are achievable with practical
size processor sets.

3. The benefits of the different models are realized within distinct ranges of the
number of processors.

4. Scheduling method has little significant effect on performance for all five of the
models. However, the critical path method was not tested in these initial runs
and may have a significant effect.

5. Processor allocation schemes can have very significant effects on performance.


10 -
1 0 ^

2 4 6 8 10 12 14 16
Log2 of Number of Processors

Figure 4-6. Model 1 Utilization Results

With this establishment of the theoretical parallelism in the unsymmetric-pattern
multifrontal method, the next step is to use the characterization of an existing mul-
tiprocessor (the nCUBE 2) from Chapter 3 to revise the simulation model and study
the parallelism that is actually achievable.
4.3 Distributed Memory Multifrontal Performance Potential
With the characterization of the nCUBE 2's performance achieved in the Chapter
3, the bounded parallelism PRAM simulation models are refined to the estimate
the achievable parallelism of the unsymmetric-pattern multifrontal method. The
performance characteristics of the nCUBE 2 are used in the revised simulation models
so that predicted performance can be later compared to that actually achieved as the
method is later implemented on the nCUBE 2.
However, one important prerequisite remains before the simulation model is up-
dated. This prerequisite is to specifically characterize the performance of the dense
matrix factorization routine(s) to be used within the unsymmetric-pattern multi-
frontal method. Recall that a primary motivation of multifrontal methods is to
replace sparse matrix operations with more efficient dense submatrix operations.
Hence, such a characterization of full and partial dense matrix factorization is criti-
cal to development of a realistic simulation.
4.3.1 Dense Matrix Factorization Routine
The characterization of the dense matrix factorization routine is developed by
actually implementing such a routine and measuring its performance. A model of its





C- '
800 -




2 4 6 8 10 12 14 16
Log2 of Number of Processors

Figure 4-7. Model 2 Original Speed-Ups

performance is also developed and the actual measured results are compared to the
results predicted by the model.

Choice of Algorithm

The algorithm chosen for dense matrix factorization is a column-oriented, fan-out
method. This model was chosen because it is easy to analyze and implement and also
because this was the approach taken in successful distributed memory implementa-
tions of multifrontal Cholesky factorization [105, 76].
Specifically, columns of the matrix are stored in a scattered fashion across the
processors. For added simplicity, diagonal dominance is assumed, so no numerical
pivoting will take place. An outer indexed loop provides one iteration for each pivot
with which the matrix will be partially factorized. If the pivot column is owned by the
processor, the multipliers (values for the corresponding column of L) are calculated
by dividing by the pivot entry. Then, this column of multipliers is broadcasted to
the other processors. Following the broadcast, the originating processor will use the
multipliers to update any remaining columns that it owns. If a processor is not the
owner of the current pivot column, then it will simply wait for the broadcasting of the
multipliers and then update its columns with the received multipliers. This algorithm
is outlined in Figure 4-17.

Performance Model

An analytical model of the performance of this algorithm can be easily derived
by looking at a time line of processor activity. For this time line, the assumption is




(U /----------
m 1500


0 -
2 4 6 8 10 12 14 16
Log2 of Number of Processors

Figure 4-8. Model 2 Vertical Partitioning Speed-Ups

made that a broadcast ties up all involved processors for a fixed period of time and
that all processors are available to receive the broadcast when it is initiated. (As
was seen in the previous chapter, near neighbors will actually finish the broadcast
reception earlier than other, but the difference is not deemed to be significant). The
produced time line is shown in Figure 4-18. In this figure, each processor's activities
are represented within a distinct column (four processors are shown in this figure).
Solid lines indicate ongoing computation and dashed lines indicate communication
and their associated latencies.
Using this model and assuming that the column update times are approximately
equal for each involved processor, we see that calculation of the multipliers is not
overlapped with any other useful activity and can be modeled as a sequential activity
in time. (This observation is the motivation for more efficient pipelined versions of the
method). Hence, as only updating of the columns is done in parallel, the execution
time model is:

where tm is the time to compute the multipliers, tb is the time required for the
broadcasts, t, is the time for all updates, and P is the number of processors being
used. Each of the component times can be defined in terms of the matrix parameters
using formulas similar to those derived from the experiments on the nCUBE 2. Thus,
using r, c, and k for the number of rows, columns, and pivots for the matrix, the






2 8 10 12 14 16
\ \ ',\

2 4 6 8 10 12 14 16
Log2 of Number of Processors

Figure 4-9. Model 2 Original Utilizations

following formulas can be derived:

t, =4* Ei(c -i)+5* k

t = CEz [195.22 + 50.02 log2(P) + 4 0.122 (r i) + 4 0.45 log2(P) *(r )]

= 5 Si(c i)(r i)+17 k+3 k c.

Similarly, if numerical pivoting were to be introduced, it would be handled in a way
very similar to that in which the multipliers are done and could be defined as:

p= 4* Ef(c-iz+1)+5*k

with execution time becoming:

texec = tm + tp + tb +

Furthermore, a sequential version of the algorithm can be simply modeled with:

eexec = tm + tu


if numerical pivoting is incorporated.
if numerical pivoting is incorporated.

o 60

30 -

2 4 6 8 10 12 14 16
Log2 of Number of Processors

Figure 4-10. Model 2 Vertical Partitioning Utilizations

Dense Matrix Factoring Results
Timings of the dense matrix factorization algorithm were obtained using etool for
a variety of matrix sizes, shapes, and number of pivots. The results were analyzed
and several trends identified. Then, the same parameters were fed into the analytical
models for validation purposes.
Representative empirical results from the factorization of several matrix configu-
rations are provided below (Table 4-6) for hypercubes of dimensions 0 to 5. These
results are provided in terms of speed-ups by dividing the multiprocessor results
(dimensions 1 to 5) by the uniprocessor (dimension 0) results.
Several conclusions are readily apparent from this data:

1. Speed-up (and efficiency) decrease with more pivots. This occurs as there are
fewer computations with later pivots with which to amortize the communication

2. There is a point at which additional processors add to overall latency (reduce
speed-ups). This happens as broadcast times grow significantly with larger
dimension hypercubes and can be seen for the 32 X 32 X 32 configuration with
dimension 4 and 5 hypercubes in Table 4-6.

3. High efficiencies are achieved only with very limited numbers of processors and
relatively large problems.

Other interesting observations are possible by looking at the results for rectangu-
lar matrices (which are important to the unsymmetric-pattern multifrontal method).


160 -

-l ---- - - - - -




2 4 6 8 10 12 14 16
Log2 of Number of Processors

Figure 4-11. Model 3 Speed-Up Results

For example, by observing the results for the relatively tall and narrow (512 X 32)
and short and wide (32 X 512) matrices of Table 4-7, one can see a striking differ-
ence in performance. This is easily explained as the column-oriented method in use
broadcasts columns of data that are much larger in the 512 X 32 case and since the
essentially sequential multiplier determination time (t,) is proportional to the row
size of the matrix. This -i--r.-i-. the possibility that row-oriented versions of the
algorithm would be desirable for the tall and narrow frontal matrices and motivates
consideration of alternative dense factorization methods within an overall distributed
memory multifrontal implementation.
Finally, the actual timing obtained through experimentation is compared with
the results of the corresponding analytical models. This will be done in two ways.
First the actual observed and predicted times are compared and second the speed-ups
based strictly on observed times are compared to those based strictly on predicted
The times observed from the experiments and those produced by the analytical
models are provided in Table 4-8 for several matrix configurations. (Results from
the experiments are annotated with "(e)" and those from the analytical models with
"(m)" and all times are in milliseconds).
The corresponding speed-ups for the same configurations are provided in Table
From these results it is readily apparent that the analytical model under-predicts
execution time especially in the smaller configurations and over-predicts speed-up.
Due to this combination, there is some overhead in the actual computations that is

o 60






2 4 6 8 10 12 14 16
Log2 of Number of Processors

Figure 4-12. Model 3 Utilization Results

not being properly accounted for in the analytical model. However, the speed-ups are
not too far off and the analytical model should provide fairly accurate results when
integrated into the simulation model.
A common observation throughout this section is that the dense matrix factoriza-
tion algorithm is not very efficient for hypercubes of dimension greater than 3 on the
matrix configurations shown. Alternative algorithms will be explored as the efficiency
and performance of a distributed memory, unsymmetric-pattern multifrontal method
should be very closely tied to the performance of the dense matrix operations.

4.3.2 Distributed Memory Multifrontal Simulation Model
With a reasonable characterization of the dense matrix factorization process
achieved in the last section, the bounded parallelism PRAM simulation model de-
scribed earlier can be revised into a distributed memory implementation model with
performance characteristics similar to those of the nCUBE 2. In order to revise the
simulation model, the following specific changes will be required:

Replace the node weight (frontal task's execution time) with the execution time
obtained from the analytical model of the previous discussion.

Account for passing of contribution blocks between frontal matrices with edge
weights in the graph. Specific edge weights will be defined using the perfor-
mance equations derived from experimentation on the nCUBE 2's point to point
message passing capabilities.







1500 -



2 4 6 8 10 12 14 16
Log2 of Number of Processors

Figure 4-13. Model 4 Original Speed-Ups

Modify the processor allocation schemes to allocate processors as complete sub-

Develop initial algorithms for the scheduling and allocating processors to frontal

From the earlier results of the bounded parallelism PRAM simulations and the
characteristization of the nCUBE 2 multiprocessor, an appropriate strategy for par-
allel implementation would be blocked (column-oriented) parallelism within frontal
matrices and concurrency between independent frontal matrices. Such an approach
is taken. Furthermore, it is necessary to make several other assumptions for revision
of the model. These added assumptions are:

Assembly will consist of both the communication of contributions between
frontal matrices (done as point to point message passing) and the numerical
assembly (addition) of contributions to the corresponding frontal matrix entry.
Both components will initially be done sequentially within each frontal matrix.

Processor allocations will be done in full subcubes (i.e., in processor sets that
are a power of 2) but specifically which processors are assigned to a particular
task will not be tracked. Hence, there is an implicit assumption that available
processors form a subcube, that is, no fragmentation of the hypercube occurs
that would hinder allocations.






2 4 6 8 10 12 14 16
Log2 of Number of Processors

Figure 4-14. Model 4 Vertical Partitioning Speed-Ups

As the latency of point to point message passing is dependent upon the path
distance of the transmission and as specific processor allocations will not be
tracked, a set distance of 10 links will be assumed for all communicating of
contributions. Such a setting will insure an upper bound on this latency for all
hypercubes of 1024 processors or less.

Checking of numerical pivots will be accounted for by a column-based, partial
pivoting scheme within the node weight function. However, the assumption is
made that no anticipated pivots will be lost during the factorization process.

Node and Edge Weight Models
Time will be tracked in the distributed memory simulation model using the fol-
lowing node and edge weights for task and communication times respectively.
Node weights will be assigned initially based upon a sequential node execution
model and then revised to a parallel time upon allocation of a specific set of processors
to the node (task). The sequential node model is:

lexec la + Im + Ip + tu

where t., tp, and tu are the multiplier, pivoting, and updating times respectively
as defined earlier in the dense matrix factorization model and ta is the time for
assemblies defined as:

t = 8 A + [204 + 0.573(8 A)]





2 4 6 8 10 12 14 16
Log2 of Number of Processors

Figure 4-15. Model 4 Original Utilizations

where A is the total number of contributions and A, is the number of contributions
passed on a particular edge e. This formula is based on the point to point message
passing latency experiments described earlier and an assumption of 8 bytes per en-
try to be assembled (4 bytes for index, 4 bytes for value). This second term (the
summation) includes the edge weight definition.
The parallel node execution time model then becomes:

67exec la + t + p +

where P is the number of processors assigned to the node. This parallel node execu-
tion time model will also be referred to in the next section as T_Para(ID, P); that
is, the parallel time associated with task ID when using P processors.
Scheduling and Processor Allocation

For scheduling, a priority list scheme is used with priority assigned to tasks using
node and edge weights with a critical path method. The node weights used are based
on sequential execution time.
Processor allocation is quite challenging as allocating more processors to a task
will cause it to execute faster (up to some limit) but will introduce more inefficien-
cies, as seen in the earlier dense matrix factorization experiments. Furthermore, such
inefficient use of processors may preclude their more efficient use on other distinct
nodes (tasks). Thus, for processor scheduling we define two quantities to be associ-
ated with each node: P_desired and P_most. P_desired is the preferred number of


100 .. _---- -----" . .-.



3 40

30 -


2 4 6 8 10 12 14 16
Log2 of Number of Processors

Figure 4-16. Model 4 Vertical Partitioning Utilizations

processors to assign to the node. P_most is the maximum number of processors to
ever assign to a node. P_most is necessary due to the execution time increase seen
when too many processors are assigned to a dense factorization task and due to the
column-oriented nature of the dense factorization algorithm (that is, the number of
columns provides an upper bound on the number of processors).
Determination of P_desired is essentially a question of efficiency and should be
based both on the task characteristics and the number of available processors. That
is, if there is a very large number of processors in the configuration, we can afford
to be more inefficient in our allocation since the extra processors would likely go idle
otherwise. Hence, a target efficiency is defined and P_desired is set via our analytical
models to achieve an efficiency near the target. Furthermore, this target efficiency
is set to different values based on the number of total number of processors in the
system. Some initial experiments revealed that reasonable target efficiencies would
be 0.8 for hypercube of dimension 3 or less (8 or fewer processors), 0.5 for hypercubes
of dimensions 4 to 7 (16 to 128 processors), and 0.2 for hypercubes of dimension 8 or
larger (256 or more processors).
Given these values for target efficiency, the value for P_desired is found by solving
the following for P:

Tsequential Target ff'
TPara(= ar, Pet_Ef ',
P T_Para(ID, P)

Each processor executes the following:

do k = 0, num_of_pivots 1
if ( column k is local ) then
do i = k+1,n- 1
compute multipliers Aik = aik/akk
broadcast the column just computed
receive the column just computed
for ( all columns j > k that are local ) do
do i = k+ 1,n 1
aij = aij Aikakj

Figure 4-17. Distributed Memory Fan-Out Factorization

This turns out to be a nonlinear equation with Plog2(P) and P terms. Thus to
simplify its solution Plog2(P) is replaced by P2 and the conservative result is then
rounded up to the nearest encompassing subcube. Table 4-10 below provides sample
results from this calculation for the various target efficiencies:
The determination of P_most on the other hand simply involves the calculation
of the value of P at which parallel execution time begins to increase. As parallel
execution time is a function of row size, column size, number of pivots and number
of processors, we can simply take the partial derivative of this equation with respect
to the number of processors and solve for the value of P that zeros the derivative.
The function for parallel time is:

Tparallel(r, c, k, P) = tm(r, k) + tp(r, k) + tb(r, k, P) +

where t,, tp, tb, and t, are component functions for multipliers, pivoting, broadcasts,
and updates (as defined earlier) and r, c, and k are row size, column size, and number
of pivots. Thus, solving

SpTparalll = 0
one gets
t,,(r, c, k) In 2
50.02 k + (1.8 (r ))
Once P has been determined, it is rounded down to the nearest power of 2. Then,
T_Para(ID, P) and T_Para(ID, 2 P) are evaluated and P_most assigned either P

proc 1 proc 2 proc 3 proc 4

(compute multipliers for first pivot)

'' ---. (broadcast multipliers)

(update for first pivot)

(compute multipliers for second pivot)

i --
S^ (broadcast multipliers)

S(update for second pivot)

Figure 4-18. Time Line for Partial Dense Factor Routine

or 2 P based on which produces the lesser execution time. As a column-oriented
dense factorization algorithm is used, P_most is also restricted to be no larger than
the smallest power of 2 that is greater than or equal to the number of columns.
As an example of what this calculation produces, all of the configurations in the
P_desired example of Table 4-10 are limited to column size except for the 16 X 16
X 2 case, which is limited to 8 processors.
With P_desired and P_most now defined for each task node, the processor allo-
cation and scheduling can be carried out with the decision sequence defined in Figure
4-19. This decision sequence is repeated until all of the system's processors have been
allocated or until a task does not meet the allocation criteria. (In this latter, case it
may be possible to allocate other tasks with lesser priority and achieve a better over-
all performance, but this extension produced significantly more run time overhead
with only moderate improvements).
Test Matrices
With the updates just described made to the simulation program, traces of assem-
bly DAGs from the latest version of the unsymmetric-pattern multifrontal method
were used as input [35]. This latest version of the method, called afstack, produced
assembly DAGS with significantly better parallelism than seen in the earlier DAGs
used in the unbounded and bounded parallelism PRAM models. This better paral-
lelism was reflected in DAGs with lower depth ratios topologicall levels in the DAG
divided by the number of nodes in the DAG), and fewer assemblies as a percentage
of overall node weight.

Table 4-6. Empirical Factoring Speed-Ups

Rows X Cols X Pivots 1-Cube 2-Cube 3-Cube 4-Cube 5-Cube
(P=2) (P=4) (P=8) (P=16) (P=32)
32 X 32 X 1 1.55 2.39 3.18 3.63 3.74
32 X 32 X 8 1.65 2.40 2.96 3.32 3.32
32 X 32 X 32 1.44 1.81 1.93 1.i,, 1.73
64 X 64 X 1 1.89 3.36 5.39 7.48 8.95
64 X 64 X 8 1.88 3.28 5.17 7.28 8.82
64 X 64 X 32 1.84 3.11 4.63 5.93 6.66
64 X 64 X 64 1.78 2 5 3.92 4.63 4.94
128 X 128 X 1 1.96 3.69 6.64 10.84 15.37
128 X 128 X 8 1.95 3.69 6.61 11.14 16.41
128 X 128 X 64 1.94 3.60 6.25 9.74 13.25
128 X 128 X 128 1.19 3.49 5.81 8.55 10.88

Table 4-7. Rectangular Matrix Speed-Ups

Rows X Cols X Pivots 1-Cube 2-Cube 3-Cube 4-Cube 5-Cube
(P=2) (P=4) (P=8) (P=16) (P=32)
512 X 32 X 8 1.82 3.03 4.55 6.35 7.69
32 X 512 X 8 1.98 3.87 7.33 13.21 21.53

Specific characteristics of the matrices tested are provided in Table 4-11. The
results for DAGs produced by the older version of the method are included in paren-
theses (where they were available).
Simulation Results
The simulation results are provided in Tables 4-12 and 4-13 below in terms of
speed-ups achieved for hypercubes of dimensions 1 through 10 (processor sets of 2 to
1024). All maximum speed-ups were achieved with processor sets in this range.
Noticeable in these results is that matrices where a larger percentage of their node
weight is attributed to assembly had a worse performance. This is consistent with
the model assumption of purely sequential assemblies within the node tasks. As it is
quite possible to perform at least some fraction of the assemblies concurrently within
a node task, a revision to the model is justified.
The revision was completed by redefining assembly time so that numerical assem-
blies were done fully in parallel by the number of processors assigned to the frontal
matrix task. Thus, t, became:

,+ E [204 + 0.573(8 A,)].

Table 4-8. Empirical Versus Analytical Times

Table 4-9. Empirical Versus Analytical Speed-Ups

Rows X Cols X Pivots 1-Cube 2-Cube 3-Cube 4-Cube 5-Cube
(P=2) (P=4) (P=8) (P=16) (P=32)
32 X 32 X 8 (e) 1.65 2.40 2.96 3.32 3.32
32 X 32 X 8 (m) 1.69 2.69 3.63 4.17 4.28
64 X 64 X 1 (e) 1.89 3.36 5.39 7.48 8.95
64 X 64 X 1 (m) 1.90 3.49 5.84 8.50 10.54
128 X 128 X 64 (e) 1.94 3.60 6.25 9.74 13.25
128 X 128 X 64 (m) 1.95 3.70 6.63 10.69 14.89

With parallel execution time thus redefined by the new t,, corresponding revisions
were needed and made to the formulas for P_desired and P_most. The results of this
revision to the model are shown in the Tables 4-14 and 4-15.
4.4 Conclusions
While the results may not be outwardly spectacular, they are actually very promis-
ing when compared to a similar simulation study based on a distributed memory ver-
sion of the classical multifrontal method [122]. In this study by Pozo, speed-ups for
the BCSSTK24 matrix were 29.27 based on an iPSC/2 hypercube and 23.35 based

Table 4-10. Sample P_Desired Calculations

Rows X Cols X Pivots Eff=0.8 Eff=0.5 Eff=0.2
16 X 16 X 2 P=l P=4 P=8
32 X 32 X 8 P=2 P=8 P=16
64 X 64 X 8 P=4 P=16 P=32
128 X 128 X 64 P=8 P=16 P=32

Rows X Cols X Pivots 0-Cube 1-Cube 2-Cube 3-Cube 4-Cube 5-Cube
(P=l) (P=2) (P=4) (P=8) (P=16) (P=32)
32 X 32 X 8 (e) 33.5 20.3 14.0 11.3 10.1 10.1
32 X 32 X 8 (m) 32.3 19.1 12.0 8.9 7.7 7.6
64 X 64 X 1 (e) 21.4 11.3 6.4 4.0 2.9 2.4
64 X 64 X 1 (m) 20.3 10.7 5.8 3.5 2.4 1.9
128 X 128 X 64 (e) 3165.8 1632.9 879.4 506.9 325.3 238.9
128 X 128 X 64 (m) 3078.1 1581.1 832.0 464.5 2" .9 206.7

Figure 4-19. Scheduling Decision Diagram

on an iPSC/ti"l hypercube. These figures compare to 32.67 and 43.99 (sequential
assembly and parallel assembly respectively) for the nCUBE 2 based simulations
studies documented here. Insufficient specifics were provided in the referenced article
to fully assess the relative differences in assumptions between my simulation model
and theirs, but from the details that were provided, my assumptions appear to be
more conservative. Thus, I have some confidence in the significance of these results.
It is important to keep in mind that these speed-up figures are relative to a
sequential version of the same algorithm. Yet, as both the classical and unsymmetric-
pattern multifrontal methods are competitive when run sequentially, the speed-ups
as reported here are still significant.
Chapter Summary: The purpose of this chapter has been to investigate the the-
oretical and achievable parallelism of the unsymmetric-pattern multifrontal method
for sparse LU factorization. This investigation was based on the assembly DAG pro-
duced by the method during and analyze-factorize operation. The initial analytical
models showed that significant parallelism was available if the parallelism both be-
tween and within nodes of the assembly DAG was exploited. When medium grain
parallelism was used within frontal matrices, speed-ups as high as 167 are possible on
the relatively modest sized test problems. Furthermore, when fine grain parallelism
was used within frontal matrices, the theoretical speed-ups increased to as much as
I. "1.' and 11, 1-_ for two of the test matrices.

Table 4-11. Test Matrix Characteristics

MAHINDASB 1258 7682 0.031 8.64
(0.29) (20. '.)
GEMAT11 4929 331S-, 0.014 6.1'~,
(0.02) (10.4 .)
GRE_1107 1107 5664 0.051 4.'',
(0.08) (17.1' .)
LNS_3937 3937 25407 0.057 10.41
(0.18) (32.!' .)
SHERMAN5 3312 20793 0.005 2.21
(0.07) (21 '.)
RDIST1 4134 94408 0.075 1'.'
BCSSTK24 3562 159910 I 11' 2 1'

Table 4-12. Sequential Assembly Version Results

Part 1

MATRIX P=2 P=4 P=8 P=16 P=32
MAHINDASB 1.30 1.75 2.89 3.32 3.39
GEMAT11 1.65 3.18 5.68 7.95 11.49
GRE_1107 1.77 3.23 5.07 7.68 10.21
LNS_3937 1.62 2.73 4.36 6.04 7.29
SHERMAN5 1.87 3.20 5.96 9.49 12.64
RDIST1 1.82 3.23 5.83 8.65 16.25
BCSSTK24 1.89 3.47 6.25 9.S5 15.22

Whereas the initial analytical models were based on an assumption of an un-
bounded number of available processors, the simulation models that followed assumed
a bounded number of processors. The results of these models showed that the theo-
retical parallelism seen in the analytical models was achievable on realistically sized
processor sets. Specifically, the speed-ups of up to 167 seen when using medium grain
parallelism with nodes and parallelism across nodes were achievable with a maximum
of 512 processors. Moreover, when the parallelism within nodes was changed to fine
grain, the corresponding speed-ups of up to "i.S and 11, !_' were achievable on no
more than 65,536 processors.
The final objective of this chapter was then to revise the simulation models to
represent an existing architecture and estimate how much of the theoretical par-
allelism was actually achievable. The earlier performance characterization of the
nCUBE 2 multiprocessor was used to set the simulation parameters and the model

Table 4-13. Sequential Assembly Version Results

MATRIX P=64 P=128 P=256 P=512 P=1024
MAHINDASB 3.39 3.39 3.43 3.43 3.43
GEMAT11 12.48 12.60 14.01 14.02 14.02
GRE_1107 11.78 12.09 12.81 12.81 12.81
LNS_3937 7.93 8.11 8.87 8.88 8.88
SHERMAN5 18.62 20.22 22 ^ 22.98 22.98
RDIST1 22.29 22.50 26.08 26.08 26.16
BCSSTK24 26.57 32.67 38.87 39.29 39.29

Table 4-14. Parallel Assembly Version Results

Part 1

MATRIX P=2 P=4 P=8 P=16 P=32
MAHINDASB 1.31 1.81 2.14 4.14 4.29
GEMAT11 1.68 3.21 6.05 8.49 12.37
GRE_1107 1.82 3.29 5.54 8.71 12.84
LNS_3937 1.75 3.10 5.13 7.83 10.16
SHERMAN5 1.91 3.34 6.07 11.00 17.95
RDIST1 1.87 3.40 5.96 9.38 16.16
BCSSTK24 1.93 3.67 6.72 11.28 18.01

revised to correspond to a distributed memory environment. The results of the dis-
tributed memory model showed that 20 to 25 percent of the theoretical parallelism
was typically achievable. Furthermore, performing the assemblies in parallel pro-
duced significant benefits.
With the prospective of achieving significant parallel performance established by
the models of this chapter, the next step is to actually implement a distributed mem-
ory, unsymmetric-pattern multifrontal method. Chapter 5 begins the implementation
by developing and analyzing a number of highly tuned partial dense factorization ker-
nels, which will account for most of the computational workload.

Part 2

Table 4-15. Parallel Assembly Version Results Part 2

MATRIX P=64 P=128 P=256 P=512 P=1024
MAHINDASB 4.29 4.29 4.41 4.41 4.41
GEMAT11 15.17 16.47 17.89 17.96 17.96
GRE_1107 15.32 15.91 17.02 17.02 17.02
LNS_3937 15.35 16.83 18.37 18.45 18.45
SHERMAN5 24.95 29.55 32.39 32.64 32.65
RDIST1 23.24 27.20 32.02 32.02 32.09
BCSSTK24 31.88 43.99 53.49 58.63 58.66


The primary computational function performed within an unsymmetric-pattern
multifrontal method is the partial factorization of dense submatrices (frontal ma-
trices). The term partial factorization refers to the application of only a limited
number of pivot steps for each frontal matrix. The remainder of the matrix entries
are updated by the pivot steps but not factored themselves and will be passed on
and assembled into other frontal matrices where they may be further updated and
eventually factored.
This chapter develops a number of routines to perform this partial dense factoriza-
tion (PDF) function. A basic column-scattered, fan-out algorithm is first developed.
This algorithm is improved upon by use of optimized vector functions from the Basic
Linear Algebra Subroutines (BLAS 1) and by the introduction of pipelining. Numeri-
cal considerations are then incorporated into both the basic and pipelined algorithms
with a major emphasis placed on finding alternative pivots in other columns when an-
ticipated pivots are numerically unacceptable. A parallel prefix operation among the
participating processors provides an efficient, consistent, and deterministic recovery
mechanism. A special streamlined algorithm is developed for the single pivot case.
The performance of all the developed algorithms is measured and analyzed. Specific
performance effects of the various enhancements are reported. Finally, performance
prediction formulas are derived from analytical models of the routines and compared
against observed performance. These predictions will be useful in the next chapter
when task scheduling allocation are addressed.
The strategy of a column-scattered allocation of frontal matrices is adopted for
all of the partial dense factorization kernels to be discussed in this chapter. There are
a number of reasons for this. First, the column-scattered allocation provides good
load balancing. Also, it provides easy incorporation of partial pivoting strategies and
allows the location of entries to be easily determined.
5.1 Basic Algorithms
The first two partial dense factorization kernels assume that the diagonal elements
of the pivot block are numerically acceptable and the kernels use them as pivots
without any numerical checking. The assumption of numerically acceptable pivots on
the diagonal is consistent with matrices with properties such as diagonal dominance.
The first such kernel is called P1 and uses a simple fan-out approach. The second is
called P2 and improves performance by the use of pipelining.

5.1.1 P1 Basic Fan-Out Algorithm
The first algorithm implemented is the basic fan-out column-scattered method
introduced earlier in Section 2.6. This method has each participating processor loop
through all of the frontal matrix's pivots. If the processor owns the current pivot,
it will compute and broadcast the multipliers (column of L corresponding to that
pivot). If the processor does not own the current pivot, it will wait to receive the
multipliers from the owner processor. Following completion of the broadcast, all
processors will update the locally held columns of the active submatrix. An outline
of this algorithm, referred to as P1, is provided in Figure 5-1. A major drawback
of this algorithm is that all of the processors that do not own the current pivot
are essentially idle during the computation of multipliers and subsequent broadcast
initiation. While in a larger context this idle time could be used for other processing
(such as completing assemblies), a more desirable situation would be to develop a
more efficient factorizing algorithm.


(Basic fan-out based on a column scattered allocation)

for ( pivots 1 thru number of pivots ) do
if ( next pivot column is local to this processor )
compute multipliers
broadcast multipliers to subcube for frontal
receive broadcast
update remaining active local columns using
the current multipliers

Figure 5-1. P Basic Fan-Out Algorithm

5.1.2 P2 -Pipelined Fan-Out Algorithm
The second algorithm uses the concept of pipelining to minimize idle time and
overlap computation with communication. This algorithm was originally -i.--. -1. .1
by Geist and Heath [62]. In this pipelined version, once the processor that owns the
next pivot receives the multipliers for the current pivot, it updates only the next
pivot column and then computes and broadcasts the multipliers for that pivot. Only
after the broadcast of these subsequent multipliers has been initiated, will the rest of
the updates for the current pivot be done. In this way, the multipliers for the next
pivot will be already computed and distributed by the time the other processors are
ready for them. The effect is to overlap the updating due to the current pivot with

the computation of multipliers for the next pivot and thus preclude much of the idle
time experienced with the basic fan-out algorithm. An outline of the basic pipelined
algorithm (P2) is provided in Figure 5-2.


(Pipelined fan-out based on a column scattered allocation)

if ( first pivot column is local to this processor )
compute multipliers
broadcast multipliers
save a ptr to these next multipliers
for ( all potential pivot columns ) do
if ( current pivot column is not local to this processor )
receive broadcast from pivot owner
adjust ptr to use saved next multipliers
if ( next pivot column is local to this processor )
update next pivot column with received multipliers
compute multipliers
broadcast multipliers
save a ptr to these next multipliers
update remaining active columns of the frontal matrix

Figure 5-2. P2 -Pipelined Fan-Out Algorithm

5.2 Use of BLAS 1 Routines
As most of the computations required by a partial dense factorization algorithm
entail repeated use of simple vector operations, performance improvements in these
operations will significantly enhance the performance of the overall factorization rou-
tine. The Basic Linear Algebra Subroutines 1 (BLAS 1) provide these optimized
vector operations [110, 111]. Furthermore, these routines are available for most high
performance computers and have been specifically tailored for their host architec-
tures. Thus, the BLAS provide both high performance and portability.
The original motivation of the BLAS 1 routines was to exploit the high perfor-
mance available from vector processing units. While the nCUBE 2 processors are
not equipped with vector processing units, it is possible to significantly improve per-
formance of BLAS 1 routines by making efficient use of the nCUBE 2's instruction
cache, many addressing modes, and CISC-style instructions. The assembly coded

BLAS 1 routines available on the nCUBE 2 do this and achieve a performance that
is 3.3 to 7.8 times faster than comparable C code.
A number of BLAS 1 routines were integrated into the first two algorithms and
will be used in the subsequent algorithms. Specifically, the Saxpy (Daxpy for double
precision) routine does a vector-scalar product added to another vector (y y + ax)
and is used to perform the outer product update (by columns) of the active submatrix
for each pivot. The Sscal (Dscal for double precision) routine scales a vector by
multiplying it by a scalar term (x <-- ax). The Sscal routine is used to calculate
the multipliers (column of L) for a particular pivot. These routines improved of the
algorithm's execution time by 2 to 3.67 times with the larger speed-ups occurring
with the larger matrices.
In the next algorithms where threshold partial pivoting is incorporated, three
other BLAS 1 routines will be useful. The Isamax (Idamax for double precision)
routine finds and returns the index of the largest magnitude entry in a specified
vector. The Sswap (Dswap for double precision) routine swaps the location of two
vectors. Finally, the S-.. "', (D.. -",l for double precision) makes a copy of a vector.
All of these BLAS 1 routines allow non-unit stride (adjacent vector entries need not
be adjacent in memory, but do need to be a constant distance apart). This enables
easy access to matrix rows in a column-major storage scheme (as well as columns
in a row-major scheme). Furthermore, with most of the nCUBE 2 BLAS 1 routines
there is no performance difference between unit and non-unit stride operations.
5.3 Inclusion of Row and Column Pivoting
While the basic concept of a partial factorization algorithm is clear and easily
understood, numerical considerations become more challenging. In particular, pivot
choices are limited to the leading principal submatrix with order equal to the number
of pivot steps to be applied. This submatrix is referred to as the pivot block. As the
pivot block will usually not include all of the pivot columns' entries, there can be
large entries in pivot columns that are not part of the pivot block. As a result no
pivot block entries for that column may pass the threshold pivoting criteria and that
pivot becomes lost. However, if another acceptable pivot block column is used then
its resulting update may cause the earlier column to become acceptable.
As a result, numerical checks must determine the maximum magnitude value in
a pivot column and attempt to find a pivot (from the pivot block) that can pass
the corresponding pivot threshold. Upon a failure, other pivot columns should be
checked accordingly. Only when no pivot column has an acceptable pivot should the
search be terminated. Furthermore, once a successful pivot has been applied, pivot
columns that previously failed may be rechecked as their values have been updated
by the pivot application.
Since pivot columns are distributed across processors, checking of all pivot columns
will require communication. As the pivot search operation has a relatively low com-
putational cost (compared to communication costs), the strategy of checking all the
local pivot columns prior to initiating communication is adopted. This strategy was
integrated into both the basic and pipelined versions of the fan-out, column-scattered
partial factorization algorithms.

5.3.1 P3 -Basic PDF with Pivot Recovery
Since the basic fan-out algorithm essentially leaves all but the current pivot owner
processor idle during pivot selection and the computation of the corresponding col-
umn of L, a natural modification would be to have all processors search through their
active pivot columns while the current pivot owner is doing its pivot selection. Thus,
if the current pivot owner fails to find an acceptable pivot column, the other proces-
sors will at least have started searching for an alternative pivot. Upon notification
of a pivot failure, the other processors will either continue their search or reply with
their results if their search is already completed. Furthermore, if the processors that
do not own the current pivot complete their search before any notification from the
current owner, they may even proceed to the computing of the corresponding column
of L (in a separate work area) if their search was successful.
Care must be taken however, to not introduce excessive additional costs in the
situations where the current pivot owners typically find a good pivot. To protect
against this, each non-current pivot owner will search a single column at a time and
then check if the current owner has broadcasted its results. (The test for the received
broadcast can be done via the test function on the nCUBE 2 system [111]). If the
notification is of a successful pivot, the corresponding multipliers (column of L) is
included in the notification and the alternative searches are terminated immediately.
The other challenge is to determine an appropriate new pivot owner in an efficient
and deterministic fashion when the current owner fails to find a good pivot. The
parallel prefix operation, imin, is an excellent tool for this purpose [111]. The imin
function operates on a specified cube/subcube with each node providing a single
integer value. The minimum value is determined using a parallel prefix computation
and the result provided to either one or all of the participating nodes. For purposes
of determining a new pivot owner, each processor node in the subcube for the frontal
matrix will provide either the number of nodes in the current subcube (if this node
fails to have a good pivot) or the forward difference from the ID of the current
pivot owner processor node to the reporting node's ID. If there is an acceptable
pivot available, the imin operation will determine the forward delta (difference in
IDs) from the current pivot owner to the nearest processor with a good pivot (in an
increasing node ID order). Alternatively, if there is no acceptable pivot in any active
pivot column, this delta will be equal to the number of nodes in the subcube and
all nodes will know to discontinue factorizing of this frontal matrix. If a new owner
is found, it will compute the corresponding multipliers (if not already precomputed
during earlier idle time) and then broadcast them to the rest of the cube/subcube.
The other processors will know who the new owner is from the imin operation and
simply wait for that node to provide its good multipliers.
The whole process is done with a single small broadcast notifying the subcube of
the owner's failure to find a good pivot, an imin parallel prefix operation to determine
the new owner, and the broadcast of the resulting good multipliers. Much of the
search for alternative pivots is done during what would otherwise be idle time.
To avoid nondeterminism, implementation of this algorithm always begins the
alternative pivot search (when not the current pivot owner) using a static ordering

of its active pivot columns and a fixed starting position. If this is not done, the
column on which the search is terminated, due to receipt of a good pivot from the
current owner, will be timing dependent and an initiation of the next search based on
this previous termination point can produce different results (different selected pivot
column orderings).
An outline of the P3 algorithm and the component subroutines are provided in
Figures 5-3, 5-4, 5-5, and 5-6.


(Basic fan-out with lost pivot recovery across columns and)
(overlapped column pivot searching.)

for ( pivots 1 thru number of pivots ) do
if ( this processor's turn for current pivot )
else /* not this processor's turn for current pivot */
if ( pivot owner does NOT have a good pivot column )
update the row and column permutations for the
resulting pivot column
determine which processor will own next pivot
(look forward from current pivot owner to next)
(processor that has active pivot columns )
update active local frontal matrix columns using the
current multipliers

Figure 5-3. P3 -Basic PDF with Pivot Recovery

In order to accurately track both row and column permutations to the frontal ma-
trix being partially factorized, each participating processor maintains row and column
permutation vectors. Broadcasted messages that include multipliers also include a
header which specifies status and the row/column permutations for the correspond-
ing pivot. In this way, all processors can maintain consistent records of the applied
permutations. In order to avoid unnecessary memory to memory transfers for mes-
sage preparation, multipliers (which occur consecutively in the column-major storage
format) are prefixed with the header information and provided to the broadcast func-
tion in place. Allocation of the frontal matrix data area includes the necessary extra
prefix memory. Data objects displaced by the header are saved before the header
is created and restored after the broadcast. To avoid data alignment problems, the


while ( this processor has active pivot columns to check )
and ( have not found a good pivot column ) do
check next local pivot column
if ( this processor has a good pivot column)
compute multipliers and row/col permutations
broadcast results to the others
broadcast "no_pivot" status
determine nearest good pivot column holder via
imin parallel prefix computation
if ( no others have a good pivot col )
return ( number of good pivots )
receive multipliers from new pivot owner

Figure 5-4. P3 Subroutine: ATTEMPT_LOCAL_PIVOT

integer header information is converted into the floating point format for its storage
as a prefix to the floating point message. This limits the values allowed for these in-
tegers to 24 bits but this is sufficient for the problem sizes currently targeted. These
broadcast message mechanisms are also used by the P4 algorithm to be discussed
5.3.2 P4 -Pipelined Partial Factoring with Pivot Recovery
The problem with the P3 algorithm just described is that when pivots are eas-
ily found, its performance is only near that of the basic fan-out algorithm. A more
desirable situation would be to integrate the pivot recovery mechanisms of the third
algorithm (P3) with the pipelining concept of the basic pipelined algorithm (P2).
An effective integration of these two concepts would hopefully preserve the high per-
formance of pipelining when pivots are easily found without jeopardarizing efficient
overlapped searching for alternative pivots. The P4 algorithm is such an integration.
With the P4 algorithm, the owner of the next pivot searches its own pivot columns
prior to completing the updates for the current pivot. The results of the search will be
broadcast to the other nodes and buffered until they complete their current updates.
If a pivot is found, the multipliers are already computed and available in the broadcast
message. If the search for the next pivot is unsuccessful, the initial next pivot owner
completes the rest of its current pivot updates before participating in the resolution of
the next pivot's recovery. This allows the other nodes time to complete their current


while ( this processor has active pivot columns to check )
and ( have not found a good pivot column )
and ( have not received a good pivot column ) do
-check my next active pivot column
if ( pivot owner has sent broadcast (ntest))
receive broadcast and determine status
if ( pivot own has not yet sent broadcast )
and ( this processor has a good pivot column )
compute multipliers using my pivot column and
hold them in a temporary buffer, also the
corresponding row/col permutations
if ( pivot owner has not yet sent broadcast )
wait for and receive broadcast

Figure 5-5. P3 Subroutine: LOOK_FOR_ALT_PIVOTS

pivot updates and perform their local searches for a next pivot. As the amount of
computations done by each processor is roughly equivalent, all nodes should be ready
for the imin determination of the next pivot owner at nearly the same time. (This
new pivot owner determination is done in the same way as it was in P3). A graphical
representation of this process is seen in the time line of Figure 5-7. Here processor
Procl is the initial next pivot owner and does its search of active pivot columns
(updating each pivot column with the previous pivots) prior to updating the rest
of the active columns with the updates for the current pivot. Meanwhile the other
processors do the current update and then find the status of the next pivot and do
their local pivot searches while processor Procl completes its current pivot updates.
While much of the recovery effort is overlapped as it was in the P3 algorithm, the
recovery will result in a drain of the pipeline. The pipeline will be "reprimed" with
the multipliers produced by the new owner of the next pivot (Proc2 in Figure 5-7).
During both the initial priming of the pipeline and any subsequent !. p! in"ingI", it is
possible to take advantage of the idle time on the nonpivot owner nodes by having
them search for alternative k + 2 pivots. This would improve the worst case behavior
of P4 to be closer to the worst case behavior of P3. However, the worst case behavior
is anticipated to be infrequent and the additional code complexity deemed to not be
worth the potential advantage.
A more specific statement of the P4 algorithm can be found in Figures 5-8, 5-9,
5-10, and 5-11. As alternative pivot searching in this algorithm is only done when


if ( this processor has a good pivot column)
send delta from pivot owner's node id to this
node id in imin prefix computation
send number of processors for this frontal matrix
in the imin prefix computation
-receive delta to new pivot owner from imin computation
if ( delta == number of frontal processors )
/* that is, no processor has a good pivot column */
return ( number of good pivots )
if ( this processor is the new pivot owner )
if ( multipliers not yet computed )
compute the multipliers and row/col permutes
broadcast the results
broadcast multipliers from temporary buffer
together with row/col permutations
copy multipliers to frontal matrix
receive broadcast from the new pivot owner

Figure 5-6. P3 Subroutine: RESOLVE_NEW_PIVOT_OWNER

required and in a deterministic manner, there is no timing dependent behavior. The
algorithm is deterministic.
5.4 Single Pivot Version
Frequently in the current sequential implementation of the unsymmetric-pattern
multifrontal method [34, 35], frontal matrices are encountered that require only a
single pivot step. In such frontal matrices the pivot block is a single entry and no
row or column permutations can be made. Furthermore, the looping overhead for all
the pivots can be eliminated as can other related overhead. As a result, a streamlined
partial dense factorization routine, called P5, was developed for exclusive use on single
pivot frontal matrices. The logic is very similar to that of the P1 algorithm with the

ProcO Procl Proc2 Proc3
I updates and
updates for I search for updates for
Kth pivot K+lst pivot Kth pivot

Procl broadcts *
"no pivolv 'found ,

updates for
Kth pivot
I search for alternative I search for alternative'
K+lst pivot K+lst pivot
. . . !' .- t . . . t
Determination of new pivot owner via the imin operation
(Proc2 is nearest in forward distance so it is new owner)

compute multipliers
for the K+lst pivot
Proc 2 broadcasts the * "*
multipliers ** *
< ..**"*** " *** T "****

Figure 5-7. P4 Pivot Recovery Timeline

loop for all pivots removed and a simple check for numerical acceptability of the
single pivot added.

5.5 Performance Achieved

In order to assess the performance achieved by each of the five algorithms intro-
duced, user-defined events were inserted into the factorization algorithms and the
etool facility used to determine the times at which the events occurred. From the
results, detailed timings of the partial dense factorization algorithms were achieved.
A driver program created square frontal matrices of orders 32, 64, 128, and 256 each
run with 1, 8, 16, and 32 pivot steps. Subcubes of dimensions 0 through 5 were used
and speed-ups determined based on comparison with the dimension 0 run times.
Pivot rows and columns were generated using random numbers and were densely
populated. In most runs, the pivot block was forced to be diagonally dominant to
preclude the need for permutations. (However, some runs were made to force row
As the number of floating point operations for factorization of each of these com-
binations is well defined, achieved Mflops could also be determined. Comparison is
also made of the basic and pipelined versions together with the impact of using the
BLAS 1 routines. The consequences to performance of taking numerical consider-
ations into account are also analyzed. Finally, brief discussions of the single pivot
algorithm and overall double precision results are provided.

5.5.1 Floating Point Operations

The achieved Mflops (for the fastest versions of each kernel) are shown for single
precision (32 bits) are shown in Table 5-1 and for double precision (64 bits) in Table
5-2. These results are based on a frontal matrix of 256 by 256 entries with 32 pivots