Citation |

- Permanent Link:
- http://ufdc.ufl.edu/UF00101365/00001
## Material Information- Title:
- On the LU factorization of sequences of identically structured sparse matrices within a distributed memory environment
- Creator:
- Hadfield, Steven Michael
- Place of Publication:
- Gainesville, Fla.
- Publisher:
- Department of Computer and Information Science and Engineering, University of Florida
- Copyright Date:
- 1991
- Language:
- English
## Subjects- Subjects / Keywords:
- Algorithms ( jstor )
Broadcasting industry ( jstor ) Directed acyclic graphs ( jstor ) Distributed memory ( jstor ) Factorization ( jstor ) Mathematical vectors ( jstor ) Matrices ( jstor ) Modeling ( jstor ) Permutations ( jstor ) Scheduling ( jstor ) Computer and Information Sciences thesis Ph.D Dissertations, Academic -- UF -- Computer and Information Sciences Parallel processing (Electronic computers) ( lcsh ) Sparse matrices -- Data processing ( lcsh ) - Genre:
- bibliography ( marcgt )
theses ( marcgt )
## Notes- Abstract:
- This research effort studies the parallel LU factorization of sequences of identically structured sparse matrices using the unsymmetric pattern, multifrontal method of Davis and Duff. Computational burden is reduced by using the computational structure (called an assembly directed acyclic graph: DAG) that results from analysis of the first matrix for the numerical factorization of the subsequent matrices. Execution time is reduced via exploitation of parallelism in the assembly DAG. The theoretical parallelism in the assembly DAG is investigated using both analysis and simulation. Achievable parallelism is evaluated using a simulation model based on the nCUBE distributed memory multiprocessor. A fixed pivot order parallel LU factorization routine is implemented on the nCUBE and evaluated. Key subissues within this implementation are task scheduling, subcube allocation, subcube assignment, and data assignments to reduce communications and overall execution time. The resulting implementation is shown to be competitive with conventional techniques and demonstrates significant parallel performance with excellent scalability. Of greatest significance is the theoretical development and implementation of a lost pivot recovery capability for the unsymmetric patter, multifrontal method. The capability incorporates a lost pivot avoidance strategy with both inter- and intra- frontal matrix pivot recovery mechanisms. The inter-frontal matrix pivot recovery mechanisms migrate lost pivots to subsequent frontal matrices and accommodate corresponding side effects using relationships represented in the assembly DAG. Intra-frontal matrix pivot recoveries are done via a pipelined partial dense factorization kernel that handles unsymmetric permutations across processors. Evaluation of the lost pivot recovery capability using three matrix sequences from chemical engineering problems shows that significant execution time savings are afforded when executing on either a single processor or multiple processors.
- General Note:
- Typescript.
- General Note:
- Vita.
- Thesis:
- Thesis (Ph. D.)--University of Florida, 1994.
- Bibliography:
- Includes bibliographical references (leaves 360-371).
## Record Information- Source Institution:
- University of Florida
- Holding Location:
- University of Florida
- Rights Management:
- Copyright [name of dissertation author]. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
- Resource Identifier:
- 31732007 ( OCLC )
001972742 ( ALEPH )
## UFDC Membership |

Downloads |

## This item has the following downloads: |

Full Text |

ON THE LU FACTORIZATION OF SEQUENCES OF IDENTICALLY STRUCTURED SPARSE MATRICES WITHIN A DISTRIBUTED MEMORY ENVIRONMENT BY STEVEN MICHAEL HADFIELD A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1994 ACKNOWLEDGMENTS First, I must express my sincere appreciation to my advisor, Dr. Timothy A. Davis for guiding, challenging, and encouraging me throughout this effort. His intelligence, patience, and concern have helped me establish high academic and personal goals. Also, I would like to thank my supervisory committee members, and especially Dr. Theodore Johnson, for their support and many helpful suggestions. In addition, I would like to express my appreciation to Professor Iain Duff of Harwell Laboratory, U.K. for reviewing and encouraging key aspects of this effort and Dr. Steve Zitney of Cray Research for providing the much needed matrix sequences for the evaluation of lost pivot recovery. In addition, I recognize the National Science Foundation for their funding of the development of unsymmetric-pattern, multifrontal method (ASC-9111263, DMS- Si_' ;11 s). Furthermore, I appreciate the financial and professional support of the Air Force Institute of Technology. Personally, I thank my parents for their continual support and encouragement (and especially Mom for her many hours of prayer). I would like to thank my children: Melissa, Andrew, and Christopher for the sacrifices they have had to endure during this program. Yet most of all, I thank my wonderful wife, Marissa, for her constant love and support. She makes it all worthwhile and helps me keep it all in perspective. TABLE OF CONTENTS ACKNOW LEDGMENTS ............................. ii LIST OF TABLES ................................. vi LIST OF FIGURES ........................ ........ viii ABSTRACT .. .. .. ... .. .. .. .. ... .. .. .. .. ... .. .. xii CHAPTERS 1 INTRODUCTION ............................. 1 1.1 Topic Statem ent ............................ 2 1.2 Overview. ............................... 2 2 BACKGROUND AND RELATED EFFORTS .............. 5 2.1 LU Factorization .. .... ........... ......... 5 2.2 Algorithm Stability and Error Analysis ............... 6 2.3 Sparse Matrix Concepts ........... ....... .... 11 2.4 Multifrontal Methods ................... . . .. 19 2.5 Factorization Sequences of Matrices ....... . . . . 24 2.6 Parallel Matrix Computations ................ . . 28 2.7 Multiprocessor Scheduling .................. .... 32 3 IMPLEMENTATION PLATFORM ................ . . 37 3.1 Hardware Architecture .................. ... . 37 3.2 Software Environment ...... . . . . . ... . 38 3.3 Performance Characterization of the nCUBE 2 . . . .... 41 4 EVALUATION OF POTENTIAL PARALLELISM . . . . .. 45 4.1 Unbounded Parallelism Models .................. 45 4.1.1 Assembly DAG Analysis .................. 46 4.1.2 M odel Definitions .................. .. . 46 4.1.3 Matrices Analyzed .................. ... . 48 4.1.4 Assembly DAG Analysis Results . . . . . . 48 4.1.5 Conclusions and Observations . . . . . .... 51 4.2 Bounded Parallelism Models .................. ... 52 4.2.1 Initial M odels .................. .... . 52 4.2.2 Trace-Driven Simulation .................. 53 4.2.3 Processor Allocations .................. .. 53 4.2.4 Scheduling Methods .................. ... 55 4.2.5 Simulation Results .................. ... . 55 4.2.6 Conclusions and Observations . . . . . . 58 4.3 Distributed Memory Multifrontal Performance Potential .... .59 4.3.1 Dense Matrix Factorization Routine . . . . .... 59 4.3.2 Distributed Memory Multifrontal Simulation Model . . 65 4.4 Conclusions .................. ........... 73 5 PARTIAL DENSE FACTORIZATION KERNELS . . . . . . 78 5.1 Basic Algorithm s ........... . . . . . ... 78 5.1.1 P1 Basic Fan-Out Algorithm . . . . . ..... 79 5.1.2 P2 -Pipelined Fan-Out Algorithm . . . . . 79 5.2 Use of BLAS 1 Routines .................. .. .. 80 5.3 Inclusion of Row and Column Pivoting . . . . . ..... 81 5.3.1 P3 -Basic PDF with Pivot Recovery . . . . .... 82 5.3.2 P4 -Pipelined Partial Factoring with Pivot Recovery .. 84 5.4 Single Pivot Version. .................. .... . 5.5 Performance Achieved .................. ... . 87 5.5.1 Floating Point Operations ................. 87 5.5.2 Use of BLAS 1 Routines .................. 88 5.5.3 Basic versus Pipelined Performance . . . . . 91 5.5.4 Inclusion of Numerical Considerations . . . . .... 92 5.5.5 Single Pivot Performance ... . . . . . 92 5.5.6 Double Precision Results ... . . . . . ..... 93 5.6 Analytical Performance Predictions .... . . . . ..... 93 5.6.1 Component Times .................. .... 94 5.6.2 A: .- -.,e Tim es .................. .. . 96 6 FIXED PIVOT ORDERING IMPLEMENTATION . . . . .. 99 6.1 Host Preprocessing ...... . . . . . . 99 6.1.1 Assembly DAG Acquisition . . . . . ..... 101 6.1.2 Frontal Matrix Scheduling .... . . . . . .... 104 6.1.3 Data Assignments ................ .. .... 111 6.1.4 Launching the Parallel Factorization . . . . .... 113 6.2 Parallel Factorizations .................. . . .. 114 6.2.1 Schedules and Frontal Matrix Descriptions . . . ... 115 6.2.2 Refactorization Process ..... . . . . . .120 6.2.3 Distributed Triangular Solves . . . . . ..... 129 6.3 Perform ance Issues ................... . . . 135 6.3.1 Performance Measures ................ . . 137 6.3.2 Test Case Selection .. ....... 6.3.3 Scheduling/Allocation Issues . 6.3.4 Assignment Issues .. ........ 6.3.5 Parallelism within the Method . . 6.3.6 Competitiveness of the Method . 6.3.7 Execution Time Profiling ...... 6.3.8 Memory Utilization and Scalability 6.4 Summary and Conclusions .. ....... 7 LOST PIVOT RECOVERY .. . ............. 7.1 Theoretical Development .. . ............ 7.1.1 Definitions and Notation .. ......... 7.1.2 Foundational Theorems .. .......... 7.1.3 Fill-In due to Recovery .. .......... 7.1.4 Impacts on Assembly DAG .. ......... 7.1.5 Effects of Multiple Recoveries .. ....... 7.1.6 Repeated Failures .. . ............ 7.1.7 Ordering Recovery Resolutions ... 7.1.8 Communication Paths .. .......... 7.1.9 Consequences of Block Triangular Form .. 7.1.10 Summary of the Lost Pivot Recovery Theory . 7.2 Implementation Specifics .. . ............ 7.2.1 Lost Pivot Recovery Synchronization ..... 7.2.2 Host Preprocessing .. . ........... 7.2.3 Enhanced Task Processing .. ........ 7.2.4 Recovery Structures .. ............ 7.2.5 Recovery Handling .. . ........... 7.2.6 Recovery Creation and Forwarding .. .... 7.2.7 Sequential and Shared Memory Considerations 7.3 Performance Evaluation .. . ............ 7.3.1 Performance Issues .. . ........... 7.3.2 Performance Measures .. .......... 7.3.3 Test Cases . . . . . . . . . . 7.3.4 Performance Results .. . .......... 8 CONCLUSION . . . ..... 8.1 Summary . . . ..... 8.2 Future Efforts . . . ... . . . .. . . . 203 . . . .. . . . 203 . . . .. . . . 206 REFERENCES ............... ................. BIOGRAPHICAL SKETCH ............................ LIST OF TABLES 4-1 Test Matrix Descriptions .. .......... 4-2 Assembly DAG Characterizations .. ...... 4-3 Unbounded Parallelism Speed-Up Results .. 4-4 Processor Set and Utilization Results .. .... 4-5 Required Processors Comparison ... 4-6 Empirical Factoring Speed-Ups .... 4-7 Rectangular Matrix Speed-Ups .... 4-8 Empirical Versus Analytical Times .. ..... 4-9 Empirical Versus Analytical Speed-Ups ..... 4-10 Sample P_Desired Calculations .. ........ 4-11 Test Matrix Characteristics ..... 4-12 Sequential Assembly Version Results Part 1 . 4-13 Sequential Assembly Version Results Part 2 . 4-14 Parallel Assembly Version Results Part 1 . 4-15 Parallel Assembly Version Results Part 2 . 5-1 PDF Mflops Achieved (Single Precision) . . 5-2 PDF Mflops Achieved (D 5-3 P1 Predicted Versus Obse 5-4 P2 Predicted Versus Obse 5-5 P3 Predicted Versus Obse 5-6 P4 Predicted Versus Obse 5-7 P5 Predicted Versus Obse 6-1 Test Matrix Characteristi 6-2 Test Configurations . . 6-3 Percentage of Communica 6-4 Competitiveness of Parall 6-5 Sequential and Parallel E: 7-1 Edge Count Comparison 7-2 RDIST1 Parallel Executic 7-3 RDIST2 Parallel Executic double Precision) . . . . . ..... 90 rved Performance. . . . . . 96 rved Performance. . . . . . 97 rved Performance. . . . . . 97 rved Performance. . . . . . 98 rved Performance. . . . . . 98 cs . . . . . . . ... .. . . 139 . . . ... . . . . 4 0 ition Reduction .. . . . . . 141 el Refactorization Code . . . .... 145 execution Profiles . . . . . . 146 . ... . . . . . 75 n Time Results (32 processors) ...... 197 n Time Results (32 processors) ...... 198 7-4 RDIST3A Parallel Execution Time Results (32 processors) I LIST OF FIGURES 2-1 LU Factorization Algorithm ........... . . .... 6 2-2 Sample Matrix Structure .................. .... .. 12 2-3 Sample Digraph for Matrix A ............ . .. . .. 12 2-4 Sample Symmetric Matrix Structure ..... . . . . ..... 13 2-5 Sample Undirected Graph for Symmetric Matrix B . . . ... 13 2-6 Matrix Structure Before and After Elimination . . . . .... 13 2-7 Graph Reduction Due to Gauss Elimination . . . . . 14 2-8 Graph After Symmetric Permutation ..... . . . . ..... 15 2-9 Clique Storage Matrix Example ............... .. ... .. 17 2-10 Block Triangular Form Matrix ............ . .. . .. 18 2-11 Sample Sparse Matrix .................. ..... .. 21 2-12 Frontal Matrix for First Pivot ............ . .. . .. 21 2-13 Frontal Matrix for Second Pivot .................. ... .. 22 2-14 Possible Relationships between Frontal Matrices . . . . .... 22 2-15 Lost Pivot Example. .................. ........ .. 23 2-16 Basic Column Fan-Out LU Factorization . . . . . 29 2-17 Task Graph with Communication Costs . . . . . ..... 34 2-18 Gantt Chart of Schedule with Communication Costs . . . ... 34 3-1 3D Broadcast Example .................. ..... .. 39 4-1 Assembly DAG Abstraction for GEMAT11 . . . . ..... 50 4-2 Processor Allocation For Fine Grain Parallelism . . . . .... 54 4-3 Model 0 Speed-Up Results .................. ... .. 56 4-4 Model 0 Utilization Results ................ .... .. 57 4-5 Model 1 Speed-Up Results .................. ... .. 58 4-6 Model 1 Utilization Results ................ .... .. 59 4-7 Model 2 Original Speed-Ups ................ .... .. 60 4-8 Model 2 Vertical Partitioning Speed-Ups . . . . . . 61 4-9 Model 2 Original Utilizations .............. . . .. 62 4-10 Model 2 Vertical Partitioning Utilizations . . . . . . 63 4-11 Model 3 Speed-Up Results .............. . . .. 64 4-12 Model 3 Utilization Results ...... .......... . . .. 65 4-13 Model 4 Original Speed-Ups ................ .... .. 66 4-14 Model 4 Vertical Partitioning Speed-Ups . . . . . ..... 67 4-15 Model 4 Original Utilizations ................... . .. .. 68 4-16 Model 4 Vertical Partitioning Utilizations . . . . . . 69 4-17 Distributed Memory Fan-Out Factorization . . . . . . 70 4-18 Time Line for Partial Dense Factor Routine . . . . . 71 4-19 Scheduling Decision Diagram ................... . .. .. 74 5-1 P1 Basic Fan-Out Algorithm ................ . .. 79 5-2 P2 -Pipelined Fan-Out Algorithm .................. .. 80 5-3 P3 -Basic PDF with Pivot Recovery ..... . . . . . 83 5-4 P3 Subroutine: ATTEMPT_LOCAL_PIVOT . . . . . ... 84 5-5 P3 Subroutine: LOOK_FOR_ALTPIVOTS . . . . . . .. 5-6 P3 Subroutine: RESOLVE_NEW_PIVOT_OWNER . . . ... . 5-7 P4 Pivot Recovery Timeline ................ .... .. 87 5-8 P4 -Pipelined PDF with Pivot Recovery . . . . . ..... 88 5-9 P4 Subroutine: FIND_PIVOTS_AND_COMPUTE_MULTIPLIERS 89 5-10 P4 Subroutine: NONPIVOT_OWNER_RECOVERY . . . ... 90 5-11 P4 Subroutine: PIVOT_OWNER_RECOVERY . . . . ... 91 5-12 nCUBE 2 Peak and BLAS Performance ... . . . . . 91 6-1 Edge Refinement Example .................. .... .. 103 6-2 Overlap Determination Algorithm ...... . . . . .. 107 6-3 Parallel Refactorization Algorithm .... . . . . .. 121 6-4 Assemble Contributions Algorithm ..... . . . . . 125 6-5 Contribution Forwarding... ............ . . . ..126 6-6 Block Upper Triangular Matrix Form Example . . . . .... 131 6-7 Dependency Chart For A Lower Triangular Solve . . . .... 134 6-8 Dependency Chart For A Sparse Lower Triangular Solve ...... 135 6-9 Task Graph For A Sparse Lower Triangular Solve . . . .... 136 6-10 Parallel Refactorization Speed-Ups .... . . . . . 144 6-11 Memory Requirements and Scalability ................ ..147 7-1 Sparse Matrix With Lost Pivot .................. .. ..153 7-2 Sample Assembly DAG .................. ....... 153 7-3 Frontal Matrices Augmented by Lost Pivot Recovery . . . ... 154 7-4 Multi-Level List Structure Prior to Combining . . . . . ... 181 7-5 Multi-Level List Structure After Combining . . . . ..... 182 7-6 Lost Pivot Recovery Overhead .................. . ..194 7-7 Sequential Execution Time for RDIST1 Sequence . . . . ... 195 7-8 Sequential Execution Time for RDIST2 Sequence . . . . ... 196 7-9 Sequential Execution Time for RDIST3A Sequence . . . ... 197 7-10 RDIST3A Speed-Ups With and Without Lost Pivot Recovery (LPR) 200 7-11 Scalability Without Lost Pivot Recovery RDIST3A . . . ... 201 7-12 Scalability With Lost Pivot Recovery RDIST3A . . . . ... 202 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy ON THE LU FACTORIZATION OF SEQUENCES OF IDENTICALLY STRUCTURED SPARSE MATRICES WITHIN A DISTRIBUTED MEMORY ENVIRONMENT By Steven Michael Hadfield April 1994 Chairman: Timothy A. Davis Major Department: Computer and Information Sciences This research effort studies the parallel LU factorization of sequences of identi- cally structured sparse matrices using the unsymmetric-pattern, multifrontal method of Davis and Duff. Computational burden is reduced by using the computational structure (called an assembly directed acyclic graph: DAG) that results from anal- ysis of the first matrix for the numerical factorization of the subsequent matrices. Execution time is reduced via exploitation of parallelism in the assembly DAG. The theoretical parallelism in the assembly DAG is investigated using both analysis and simulation. Achievable parallelism is evaluated using a simulation model based on the nCUBE 2 distributed memory multiprocessor. A fixed pivot order parallel LU factorization routine is implemented on the nCUBE 2 and evaluated. Key subissues within this implementation are task scheduling, subcube allocation, subcube assignment, and data assignments to reduce commu- nications and overall execution time. The resulting implementation is shown to be competitive with conventional techniques and demonstrates significant parallel per- formance with excellent scalability. Of greatest significance is the theoretical development and implementation of a lost pivot recovery capability for the unsymmetric-pattern, multifrontal method. The capability incorporates a lost pivot avoidance strategy with both inter- and intra- frontal matrix pivot recovery mechanisms. The inter-frontal matrix pivot recovery mechanisms migrate lost pivots to subsequent frontal matrices and accommodate corresponding side effects using relationships represented in the assembly DAG. Intra- frontal matrix pivot recoveries are done via a pipelined partial dense factorization kernel that handles unsymmetric permutations across processors. Evaluation of the lost pivot recovery capability using three matrix sequences from chemical engineering problems shows that significant execution time savings are afforded when executing on either a single processor or multiple processors. CHAPTER 1 INTRODUCTION With the advent and refinement of practical parallel processing platforms, there has been a flurry of research in concurrent algorithms that can efficiently utilize these resources. A primary emphasis of this research has been numerical algorithms, in general, and, in particular, algorithms for the direct solution (as opposed to it- erative approximation) of systems of linear equations. Extensive research has been done on solving such systems when the coefficient matrix is dense. Such matrices are frequently encountered and have very regular structure that can be exploited in algorithm development. When the coefficient matrix is sparse considerably more parallelism may be avail- able, although it is more difficult to exploit. For sparse, symmetric, positive definite matrices, elimination tree structures can be derived to expose parallelism in the com- putations. Much of the parallel sparse matrix research focus to date as been on such matrices due to their frequent occurrence, the exposure of parallelism in their elimination tree, and the guarantee of numerically stable pivots on the diagonal. Parallel algorithms for solving more general linear systems are more difficult as they are characterized by sparse matrices where the nonzero structure can be highly unsymmetric and the diagonal entries may not be numerically acceptable as piv- ots. Hence, parallelism is irregular and numerical pivoting considerations become important. The most common direct technique for solving such systems is to factorize the coefficient matrix into the product of a lower triangular matrix, L, and an upper tri- angular matrix, U. Triangular solves (forward and backward substitutions) can then be used to solve the system. One approach to LU factorization that holds signifi- cant potential for parallel implementation is the unsymmetric-pattern multifrontal method of Davis and Duff [34, 35]. Like most sparse matrix factorization algorithms, the unsymmetric-pattern mul- tifrontal method has two principal operations, analyze and factorize. The analyze operation selects matrix entries to act as pivots for the numerical factorization with objectives of reducing computations and maintaining numerical stability. The fac- torize operation then uses the selected pivots to perform the actual numerical fac- torization. Furthermore, there are many sparse matrix problems that require the factorization of sequences of identically structured sparse matrices. Such sequences are common to solution techniques for the systems of differential algebraic equations used in many application areas including circuit simulations, chemical engineering, magnetic resonance spectroscopy, and air pollution modeling [105, 144, 143, 146, 37, 82, 49, 18, 97, 108, 131, 43, 50]. When factorizing these sequences of identically structured sparse matrices, the computational burden can be eased by only doing the analyze operation once for the entire sequence and using the selected pivots for the subsequent factorization of each matrix in the sequence. 1.1 Topic Statement The focus of this research effort is to study and implement parallel LU factoriza- tion algorithms based on an unsymmetric-pattern multifrontal approach and targeted for a distributed memory environment. Specifically, the parallel algorithms developed will perform the numerical factorization of sequences of identically structured sparse matrices using a directed acyclic graph (DAG) structure produced by a sequential im- plementation of Davis and Duff's Unsymmetric Multifrontal Package (UMFPACK) [34, 35, 32]. This DAG structure, known as the ". ,,,,i.'1, DAG, defines the necessary computations in terms of the partial factorizations of small, dense submatrices, which are known as frontal matrices and represented by nodes in the assembly DAG. The edges of the assembly DAG define the necessary data communication between frontal matrices, as well as the corresponding precedence relation. Thus, the assembly DAG can also serve as a task graph for the parallel computations. The first objective is to determine how much parallelism is available and ex- ploitable within the computational structure defined by the assembly DAG. Of inter- est within this objective is how much theoretical parallelism in inherent in the method itself and how much of this theoretical parallelism can be achieved on contemporary parallel architectures. The theoretical and practical parallelism will be investigated by analysis and simulation. Then the practical parallelism will be scrutinized further by implementing a parallel, distributed memory multifrontal refactorization algo- rithm and empirically measuring its achieved parallelism. When factorizing a sequence of identically structured sparse matrices where ad- ditional properties such as symmetry, positive-definiteness, or diagonal dominance are not assumed, the changing numerical values of the various entries can cause the anticipated pivots to no longer be acceptable numerically. The capability to recover from these lost pivots and complete the factorization without having to recompute a new assembly DAG could significantly reduce the computation time required for fac- torization of the sequence of matrices. Therefore, the second objective is to develop a lost pivot recovery capability within the context of an unsymmetric-pattern multi- frontal method. The actual implementation of the lost pivot recovery capability will be based on the parallel, distributed memory multifrontal algorithm from the first objective. Thus, both the sequential and parallel performance of lost pivot recovery may be assessed. 1.2 Overview Chapters 2 through 7 of this document report the results of this research effort. Chapter 8 reviews the critical results and discusses possible future efforts. In Chapter 2, pertinent background and the results of related efforts are presented. The LU factorization process and its stability and error bounds are described. A dis- cussion follows of key sparse matrix concepts such as fill-in, the analyze and factorize phases, the relationship of sparse matrices to graph theory, scatter and gather opera- tions, and block triangular form. Next the key concepts of the multifrontal approach to LU factorization are illustrated. A mathematical framework is then shown for how sequences of identically structured sparse matrices arise in various solution tech- niques with references to specific applications provided. Some methods for parallel matrix computations are summarized for both dense matrices (such as the frontal matrices of a multifrontal method) as well as sparse matrices. Included in this dis- cussion will be key results regarding parallelism from earlier efforts that were based on a symmetric-pattern, multifrontal method. Finally, some scheduling techniques for distributed memory parallel processing are summarized. Chapter 3 describes the nCUBE 2 distributed memory parallel processor on which the algorithms are implemented. First, the key hardware features are presented fol- lowed by an overview of the software environment. Then the results of an empirical characterization of the nCUBE 2's performance are presented. This characterization will aid in assessing achievable parallelism and in making later design and implemen- tation decisions. The key results of this research effort begin in Chapter 4, which describes how the theoretical and achievable parallelism of the unsymmetric-pattern multifrontal method are assessed using both analysis and simulation. First, analytical models are used to determine how much parallelism is available within the computational structure provided by the assembly DAG with the various models corresponding to different sources of parallelism. These analytical models all assume an unbounded number of processors and an underlying parallel random access memory (PRAM) model and thus provide an assessment of theoretical parallelism. Next an evaluation of how much parallelism is achievable on bounded processor sets is done using simu- lation techniques. Finally, performance characteristics of the nCUBE 2 are used to extend the simulation models to represent a realistic distributed memory implemen- tation. With the understanding of the parallel characteristics of the unsymmetric-pattern multifrontal method gained from the analytical and simulation models, Chapter 5 be- gins the actual implementation. As the required computations are dominated by the partial factorization of frontal matrices, this chapter is devoted to the development of high performance parallel partial dense factorization kernels. Five distinct rou- tines are developed and tested. The first two confine themselves to the given pivot ordering and do not check the numerical acceptability of the pivots. These routines are applicable to matrices that are known to be diagonally dominant, and thus the diagonal entries are assured to be acceptable as pivots. The second of these routines uses pipelining to overlap required communication with computation. The next two partial dense factorization kernels address numerical pivoting with alternative pivot selection done within the frontal matrix's pivot block. This can be difficult as the columns of the pivot block may be distributed across several processors. The second of these routines is pipelined to improve performance. The fifth kernel is designed to handle frontal matrices that have only a single potential pivot. Since single pivot frontal matrices occur frequently and do not require all the additional logic of the more general earlier routines, there is a performance advantage to be gained by a single pivot partial dense factorization routine. The performance of these kernels is assessed, and analytical models of their execution times based on various processor sets are defined. These models will be useful in scheduling frontal matrix tasks. With the partial dense factorization kernels available, Chapter 6 discusses the im- plementation of a full parallel refactorization capability based on the unsymmetric- pattern multifrontal method. Of the most critical issues to be addressed in this im- plementation are the scheduling and allocation of processors to frontal matrix tasks. Two distinct scheduling methods are developed. One focuses on minimizing required communication by overlapping the assignment of frontal matrix tasks to processor sets. The other method tries to reduce overall execution time by more sophisticated management of the hypercube topology with communication reduction a secondary concern. Allocation of processor sets to frontal matrix tasks is done using a propor- tional allocation scheme based on the predicted execution times of tasks and the tasks available to execute. A number of mechanisms are also developed for the distribution of frontal matrix columns to processors in hopes of further reducing required commu- nication costs. All of the scheduling, allocation, and assignment activities are done as part of a sequential software interface that initiates and controls the parallel refactor- ization. The details of the parallel refactorization code are also discussed and include specifics on how frontal matrices are represented, how original matrix entries and contributions between frontal matrices are handled, and how frontal matrix tasks are synchronized. A distributed forward and back solve capability is also implemented and takes advantage of the distributed storage of the LU factors that results from the parallel refactorization. With the implementation complete, a detailed performance evaluation is accomplished. Both the achieved parallelism and competitiveness of the implementation are assessed. The effectiveness of the various scheduling, allocation, and assignment mechanisms is also addressed. Perhaps the most significant of the results from this research effort is development of a robust lost pivot recovery capability for the unsymmetric-pattern multifrontal approach to LU factorization. Chapter 7 describes this capability. First an extensive theoretical development of the lost pivot recovery capability is done. This is followed by a description of the implementation that was done by extending the distributed memory parallel refactorization software of Chapter 6. Finally, a performance eval- uation addresses the sequential effectiveness of the lost pivot recovery capability, its parallel execution time and speed-up characteristics, and its memory requirements and scalability. CHAPTER 2 BACKGROUND AND RELATED EFFORTS Prerequisite to pursuing the study and development of a parallel unsymmetric- pattern multifrontal method is the understanding of a number of key concepts related to linear algebra and sparse matrices. Specifically, the process of LU factorization and concepts of algorithm stability and error analysis are necessary. Furthermore, concepts and techniques for sparse matrices must be defined and described as they will be frequently referenced. In addition to this general background, the specifics of the unsymmetric-pattern multifrontal method of Davis and Duff [34, 35] are necessary as this entire effort is based on that method. Mathematical techniques that give rise to sequences of identially structured sparse matrices together with specific applications justify the need to be able to factorize such matrix sequences. Finally, a summary of previous results in both parallel matrix computations and multiprocessor scheduling help to define realistic goals for this effort and establish its significance. 2.1 LU Factorization The most common method for solving general systems of linear equations in the form Ax = b (where A is an n x n matrix of full rank and x and b are vectors of length n) is LU factorization which is based on Gauss Elimination [48]. In this method, the matrix A is factorized into A = LU where L is unit lower triangular and U is upper triangular. The problem Ax = b then becomes LUx =b and can be solved with two subsequent triangular solves. The first intermediate solve is the forward substitution Ly = b. The result, y, is then used in the second triangular solve that is a back substitution Ux = y. In the actual factorization process, A can be overwritten with the factors LU with the unit diagonal of L stored implicitly. The actual process proceeds down the diagonal of the matrix using the diagonal entries as pivots. Row and sometimes column interchanges may be required to replace zero entries on the diagonal, and similar permutations are common in order to preserve the numerical stability of the algorithm. An illustrative version of the basic LU factorization algorithm is presented in Figure 2.1 that assumes good pivots on the diagonal. The version of Gauss Elimination presented in Figure 2.1 is based on the most common ordering of the nested loops (i.e., the kij ordering). However, alternative orderings are also possible that correspond to all six of the possible permutations of the indices k, i, and j [78]. for k := 1 to n do for j := k to n do U(k,j) := A(k,j) endfor for i := k+1 to n do L(i,k) := A(i,k)/A(k,k) for j := k+1 to n do A(i,j) := A(i,j) L(i,k) U(k,j) endfor endfor endfor Figure 2-1. LU Factorization Algorithm When row and column permutations are introduced into the algorithm, the fac- torization becomes PAQ = LU where P provides the row permutations and Q provides the column permutations [48]. The equation Ax = b then becomes PAQQT = Pb with the following sequence of a matrix multiply, two triangular solves, and another matrix multiply required to complete the solution based on the factorization: c = Pb, Ly = c, Uw = y, x = Qw. As P and Q are permutation matrices, they can be stored in O(n) storage and the required multiplications can also be done in O(n) time. 2.2 Algorithm Stability and Error Analysis There are two types of error analysis typically used. The first, forward error anal- .I;. provides bounds directly on the relative error between the computed and actual solutions. Such a technique is typically overly pessimistic and does not distinguish between error sources. Thus, this type of analysis is not as useful as the second type, which is backwards error analysis. The concept of backwards error i,,il.;.'; considers the computed solution as the exact solution to a "n., .,b!" problem. Analysis of the round off computing error associated with the algorithm in use provides bounds on the distance between the original problem and the nearby version for which the true solution has been com- puted. The bounds on this distance provide a measure of the algorithm's stability. Perturbation theory is then used to relate the distance between the two problems to the distance between their respective true solutions. While the computing round off error is dependent upon the algorithm in use and is, to a certain extent, controllable by algorithmic alterations, the perturbations are strictly a function of the particular problem instance and cannot be controlled. Hence, backwards error analysis provides a means to distinguish between controllable and uncontrollable error sources. Furthermore, backwards error analysis can be used in an a posteriori fashion. Here the actual deviations between the solved and original problems can be measured (not just bounded). These more accurate results are then combined with the perturbation theory to provide a refined error bound on the solution. Thus, backwards error analysis provides both the distinction of error sources and a refined error bound. This section addresses both aspects of backwards error analysis. Computing round off errors and algorithm stability for LU factorization are first discussed. Means of insuring stability are emphasized. Then, perturbation theory is addressed. The effects of problem conditioning are discussed with the condition number defined as a means to quantify these effects. The integration of stability and conditioning is used to develop an overall error bound for LU factorization. Computing Round Off Errors: The two primary sources of computing errors are the representation of real numbers by finite digit approximations and the correspond- ing loss of significant digits during finite digit arithmetic. When real numbers are stored and used in a computer, they are typically approximated by a floating point value represented in a finite bit storage cell and a bounded amount of relative error is introduced. Consider the real value represented by a. The corresponding floating point value would be fl(a) = a(1 + c), c < p where c is the relative error introduced by the finite bit representation, which is bounded by the machine precision, p, of the particular processor [78]. In a similar fashion, finite arithmetic computations introduce additional errors as results of the computations are subject to the same finite storage restrictions. Particularly vulnerable to this effect is addition (subtraction) as order of magnitude differences (similarities) in the values of the two operands cause increasing losses in accuracy. Combining the effects of inaccuracies due to floating point representations and computations, we find a cumulative effect as illustrated by fl(a + b) = (fl(a) + fl(b))(1 + E1) = ((a(1 + 62)) + (b(1 + 63)))(1 + 61) where E1, 62, and (3 are each bounded by the machine precision p [78]. Algorithm Stability: Algorithm stability (or instability) is a measure of the ef- fects of computing round off errors on the computation sequence dictated by the algorithm. The magnitude of this effect is measured by the difference between the factorized representation of the problem produced by the Gaussian Elimination algo- rithm (LU) and the true problem (A). This difference is accounted for in the matrix term H where LU = A + H. Wilkinson [139] showed that the entrywise deviations accounted for by H in the absence of pivoting are bounded by Ihij < 5.01 npp, where p = max' (k) k for each hij E H where a}) refers to the value of aij after the kth step of factorization. Extensive growth in the ak )'s can thus have catastrophic effects. Pivoting for Stability: Partial Pivoting is the typical mechanism used to control growth of the pivots. At each step, k, a new pivot is chosen from the kth column of the active submatrix such that ,, (k) > (k), i > k. With this approach the growth represented by p is limited to p = 2-1 max',, |. While this is not a particularly good bound, in practice the results are consistently much better [139]. A stronger alternative to partial pivoting is complete pivoting. Complete pivoting chooses the kth pivot as the largest entry in the entire active submatrix A(k). In doing so the growth is limited to p = f(n) max I where f(n) is a nearly linear function [138]. A problem with both partial and complete pivoting is that the rigid pivot selection rules can result in extensive fill-in (fill-in is the changing of zero entries into nonzero entries) when dealing with sparse matrices. Tl, i. /,-.1,/ pivoting provides an alternative that allows greater flexibility in pivot selection to aid in the preservation of sparsity [48]. Pivot selection in threshold pivoting requires the selected pivot to meet '.(k) > ,(k) i > k, 0 < u < 1. Here u is a parameter that is typically set between 0.001 and 0.1. When threshold pivoting is employed, the growth bounding function becomes p < (1 + u-l) -1 max'., i. When LU factorization is applied to a matrix A and no zero pivots are encountered [78], the bound on H in LU = A + H is IHI < 3(n- 1)p[|AI + 1|llU] + O([t2). Here |M| refers to the largest entry in the matrix M. Similar analysis for a triangular solve reveals that the computed solution, y, to the triangular solve problem Ly = b is the exact solution of (L + F)y = b with F| < nu|L + 0(n2). Combining the LU factorization results with the results of the two required tri- angular solves reveals that the computed solution, x, to the problem Ax = b is the exact solution to some problem (A + E)x = b for some E satisfying IE < nnp(3|AI + 511U1) + O(n2). When partial pivoting is included, the entries of L can be bounded by 1 (and |IILI < n) and the bound on E in the infinity norm becomes I1 1 < np(311AI| + :, -.I|| ) + (n2). This bound for | |II provides the measure of the distance of the problem for which x is the exact solution from the original problem. This measure will be com- bined with the perturbation theory to provide a bound on the relative error of the computed solution. Ill Conditioning: At the heart of perturbation theory is the concept of how the distance between two nearby problems relates to the distance between their solutions. When the distance between solutions is great relative to the distance between the problems, the problems are said to be ill conditioned. Distance here is measured in an appropriate matrix or vector norm. The conditioning of a problem can be quantified by a value known as the condition number. Analysis can then utilize this condition number to bound the relative error in the solution space based on the relative error in the problem space (distance between original problem and the problem exactly solved by the computed solution). The condition number can be defined in terms of matrix norms as K(A)= IIAllplA- ll1 where p designates the specific norm in use [48]. An alternative, but equivalent formulation based on singular values is K2(A) = al/a7 where a1 is the largest singular value and a~ the smallest [78]. This later formulation illustrates that ill conditioning can be viewed as a measure of singularity as a~ will be zero for a singular n x n matrix, and a relatively small value of a, (as compared to ai) will produce a large condition number. Analysis of how conditioning affects the relative error in the computed solution produces the following relationship between the relative distance between the original problem Ax = b and the one exactly solved by the computed solution (A + E)x = b. Specifically, the relative distance between the true (x) and computed (,) solutions is S-., II oo(A)I\ II I, 1 r IAII where = IIEIIIAA-1II < 1 [78]. Prevention of ill conditioning may be addressed by analyzing if the process has the same sensitivities as the model. If not, identify the problem sources in the model and alter the model. Overall Error Bound: The overall error bound on the relative error of the solu- tion is obtained by combining the round off error analysis. Recall that the computed solution, x, solves a nearby problem (A + E)x = b where I1 1 < n[(311AI| + : -.I||2 ) + (n2). Also, the condition of the problem affects the relative error in the solution by I1'1 'II < (K.(A)) \It11 I, II r IIAI where r = IIEIIIIA-1|| < 1. Combining these two results, we obtain the overall error bound of I- I < (A) 3 + 5 + 0(p2). I II I l-r HA|| Measuring Stability: A large relative error in the solution of the equation Ax = b can be the result of either an unstable algorithm or a poorly conditioned problem or both. While the condition of the problem is typically not controllable, the stability of the algorithm is, and a large relative error would raise the question of whether or not it can be reduced by a more stable algorithm. Thus, there is a need to measure the stability of an algorithm. The relative residual, as defined below, provides such a measure [84]. relativeresidual = - Ilbl In this equation, x represents the computed solution. To understand how the relative residual measures stability, a simple derivation is useful. Recall that A = LU + H and LUx = b where L and U are the computed LU factors. H then represents the differences between the original and actually solved versions of the problem. Hence we can multiply the first equation by x and get the following derivation: Ax = ( L + H)x = LUx+Hx, = b+Hx. The vector b is then subtracted from each side, norms are taken, and the results divided by |Ilbl to get ||Ar bl|| IIH I Ibll I|bll Thus the relative residual provides a relative measure of the size of H, which is controlled by the algorithm's stability. If the relative error is large but the relative residual small, then the trouble is with the conditioning of the problem and not the stability of the algorithm. 2.3 Sparse Matrix Concepts The matrices used in many applications tend to have a very significant number of entries that are zero. This is particularly true and significant for very large matrices as the computations on such large matrices, as dictated by direct solution methods, would be too numerous to be done in a reasonable amount of time. The sparsity induced by the many zero entries allows special techniques to be employed that sig- nificantly reduce the required number of computations. Thus, the number of nonzero entries in a matrix becomes a very critical measure of the amount of computations required for a particular sparse matrix. Furthermore, as the number of nonzeros is so critical, special attention is paid to minimize/reduce the number of matrix en- tries that were originally zero and become nonzero due to the computations. This situation is commonly called fill-in. The discussion of this section first relates sparse matrices to graph theory, which is an important tool used in structuring sparse matrix computations. Special techniques for storing sparse matrices are then addressed. These techniques focus on the efficient use of memory while preserving the ability to access the matrix entries efficiently. A typical preprocessing step of reduction to block triangular form is then described followed by a brief discussion of how the Markowitz Criterion can be used during the analyze phase to determine the pivot ordering. Relationship to Graph Theory: Critical to the exploitation of sparsity in sparse matrices is the understanding and representation of the inherent matrix structure. Graph theory provides an excellent vehicle for this purpose [48, 70]. By representing the structure of a matrix with a graph construct, exploitable patterns can be more readily recognized. Furthermore, as operations take place on the matrix that alter its structure, corresponding changes can be made to the graph representation to track the structural dynamics of the process. In this section, the graph representations of both unsymmetric and symmetric matrices are discussed, as well as the corresponding graph dynamics that occur due to Gauss Elimination and matrix permutations. If each row and each column in an n x n matrix A is labeled using 1, .., n and each such label is associated with a node in a graph, the structure of the matrix can be represented with edges from node i to node j for each nonzero aij entry in the matrix A. These edges are directed and thus uniquely identify each nonzero entry in the structure of any matrix. The corresponding graph is called a digraph or directed graph. As an example, the matrix structure shown in Figure 2-2 is represented by the digraph structure shown in Figure 2-3. A= X X X Figure 2-2. Sample Matrix Structure 1 2 3 Figure 2-3. Sample Digraph for Matrix A If the matrix has a symmetric pattern (that is, if for each nonzero aij entry, the aji entry is also nonzero), then each pair of nodes connected by a directed edge will have a second edge connecting them that is oriented in the opposite direction of the first edge. These two directed edges can be replaced by a single undirected edge, and thus matrices with a symmetric pattern can be represented with an undirected graph. A sample symmetric pattern matrix structure and the corresponding undirected graph are shown in Figures 2-4 and 2-5. The undirected graph for matrix B (as shown in Figure 2-5) happens to form a tree structure with node 1 as the root. While this does not occur with all unsymmetric matrices, it is a very important special case that will be discussed in more detail later. When Gauss Elimination is performed on a matrix to zero a column below the diagonal, the structure of the matrix changes and corresponding alterations occur within the graph representation [118]. In particular, assume the column below the B= XX X X Figure 2-4. Sample Symmetric Matrix Structure 0 0 Co Figure 2-5. Sample Undirected Graph for Symmetric Matrix B diagonal that corresponds to the node y in the graph is eliminated. Then, the node y may be removed from the graph, and for any pair of nodes x and z such that both (x, y) and (y, z) are existing edges, the edge (x, z) is added to the graph (assuming it does not already exist). The resulting graph structure corresponds to the active submatrix after this step of Gauss Elimination. This relationship is shown in the matrices of Figure 2-6 and the digraphs of Figure 2-7. X X X X X A= xX X X X X X X X X X Figure 2-6. Matrix Structure Before and After Elimination Notice that in the graphs of Figure 2-7 edge (3, 4) was added due to the existence of (3, 1) and (1,4). Likewise (3, 2) was added due to (3, 1) and (1,2). However, (4, 2) was not added since it was already in the graph. Interestingly, the changes that take Figure 2-7. Graph Reduction Due to Gauss Elimination place in the graph structure due to corresponding Gauss Elimination in the matrix are very similar to the graph theory process of computing transitive closures. Permutations are another very important operation performed on matrices. If the permutation is symmetric (that is, both rows and columns are interchanged), then the corresponding graph alterations amount to simply relabeling the nodes. For example consider the symmetric permutation B,,, = PBPT where B is as shown in Figure 3.3 and P is defined as 001 P= 0 1 0 1 0 0 The resulting B,,w will be X X and the corresponding graph will be as shown in Figure 2-8, which is simply a rela- beling of the graph in Figure 2-5. Storage Techniques: Typical storage mechanisms for dense matrices require O(n2) memory. The number of nonzeroes in a sparse matrix (referred to as r) will frequently be several orders of magnitude less than n2. Thus, more efficient storage mechanisms are required for simple efficiency as well as to make the algorithm being implemented tenable. Furthermore, due to phenomena such as fill-in during Gauss Elimination, it will frequently be necessary to alter the storage structure during processing. Thus, a static structure will not be suitable unless it preallocates sufficient room for growth. The alternative is a dynamic structure that can be easily altered during processing. While both static and dynamic storage techniques for sparse matrices will be ad- dressed, the storage of sparse vectors is discussed first as a necessary and illustrative Figure 2-8. Graph After Symmetric Permutation prerequisite. Much of this summary will follow that provided in Duff, Reid, and Erisman's text on sparse matrices [48]. A full storage scheme for a sparse vector requires O(n) memory but has the advantage that individual vector components can be directly indexed. A more space efficient storage technique would be to hold each vector as a set of (value, index) pairs with one pair for each nonzero component in the vector. These pairs may be ordered by the index values or left unordered. The set of (value, index) pairs is typically called the packed storage form. A transformation from the full to the packed storage forms is called a gather operation. The reciprocating transformation from packed to full form is called a scatter operation. Some contemporary computers with extensive vector facilities, such as the Cray X-MP and the Fujitsu FACOM VP, provide hardware support for these operations. Typical mathematical operations on vectors include dot products and the saxpy operation. The saxpy operation is defined as = ax + y where x, y, and z are n-vectors and a is a scalar. The name saxpy comes from the software mnemonic for "scalar alpha x plus y" [78]. As the packed forms of sparse vectors are typically unordered (done for efficiency), operations on these vectors will frequently follow the outline provided below: Scatter the nonzero entries of one of the vectors into a full length vector known to be initially zeroed. For each operation: 1. Revise the full length vector by applying the operation to it (can be done by a traversal of the other vector operand in packed form for vector-vector operations). 2. If an operation changes a zero to a nonzero, then update the corresponding packed structure index list. Gather the nonzeros out of the full length vector and back into packed form by using the (potentially updated) index list of the packed form and in the process zero out the full length vector entries used (facilitates later reuse of this vector). When a vector facility is available, there is a greater tradeoff in performance between the full and packed formats. Especially if the vector facility does not ac- commodate indirect indexing, the storage savings from a packed format will result in a significant performance disadvantage by precluding the effective use of the vector hardware. Moving on to sparse matrix storage techniques, there are four primary mechanisms that are coordinate schemes, collection of sparse vector schemes, linked lists, and clique schemes. Each mechanism has unique advantages and disadvantages, and a particular algorithm may use a number of different mechanisms throughout the different phases of its operation. The coordinate schemes maintain each nonzero in the matrix as a distinct triple (aij, i, j). This technique is typically used as the general input and output format for a general purpose algorithm. While this format is difficult to work with internally since it cannot be efficiently accessed by either row or column, it provides an efficient and simple interface format. Contributing to its efficiency as an input format is the fact that it can be sorted into either the collection of sparse vectors or linked list formats in time that is O(n) + O(r) where r is the number of nonzeros in the matrix. The sorting technique is based on the bucket sort using either the row or column indices. Furthermore, a column ordered row oriented result from the sort is possible by first sorting by columns and using that result to sort by rows. The collection of sparse vectors method uses a size n array of (pointer, count) pairs that index into the rows or columns that are held as sparse vectors. These sparse vectors are held in unordered, yet contiguous format. Such a static format is frequently allocated with extra space to accommodate growth. When the extra space has been exhausted, but prior entries have been freed, a compaction operation may be done to reclaim free entries. The advantages of this method are that row or column accesses are efficient depending on the orientation of the structure and that no memory is needed for links between vector entries. Disadvantages are primarily with the growth restrictions imposed by the static allocation. The linked list representation offers a dynamic structure similar to the prior col- lection of sparse vectors. However, as each vector (row or column) is held as a linked list, it can easily be expanded or contracted. Furthermore, it is also now possible to easily maintain row vectors in column order (or column vectors in row order). While access to these structures is usually simple and short, there are some disadvantages. For one, it may be necessary to maintain doubly linked lists to accommodate inser- tions and deletions (unless access methods are limited to means that preserve pointers to prior list entries). Also, the linked list structures will exhibit less spatial locality than the collection of sparse vectors method. This could adversely affect performance if a hierarchical memory is in use. A common drawback of both the linked list and collection of sparse vectors tech- niques is that they are either row or column oriented but not both. For operations such as Gauss Elimination, both row and column accesses may be necessary, and thus one of these two access types will be inefficient. This problem can be remedied by developing a second structure that is oriented opposite to the first (i.e., if the pri- mary structure is row oriented, the secondary structure would be column oriented). Furthermore, this secondary structure may only need to hold the structural compo- nents of the matrix (i.e., the row/column indices) so full duplication would not be necessary. Another possibility using linked lists would be to intermingle row- and column-oriented linked lists between the nonzero elements. The final storage method is the clique scheme, which is derived from finite element models. In this scheme, the matrix, A, can be represented as the sum of a number of matrices, A[], that each have only a few rows/columns with nonzero entries: A = EA[k]. These A[k] matrices correspond to fully connected subgraphs within the graph for the matrix, which is from where the term clique is derived. By way of example, the matrix shown in Figure 2-9 can be represented as the sum of the three cliques that are obtained by using the following row/column patterns: {1, 2, 5}, {2, 3}, and {3, 4}. X X X X X X X A= X X X X X X X X Figure 2-9. Clique Storage Matrix Example While this method was originally developed for finite element problems that are directly derived as the sum of component matrices, this method is more widely ap- plicable as any symmetric pattern matrix can be disassembled into a sum of cliques. In order to improve the structure or maintain numerical stability, it is often nec- essary to perform permutations to the original matrix. Such permutations are al- gebraically represented as full matrices that are developed by interchanging rows or columns of the identity matrix. However, storage of the full matrix for a permutation is not required. A single n length vector can be used to represent any permutation and variations exist that allow the permutation to be performed in place for only an O(n) additional memory requirement for the permutation. Reduction to Block Triangular Form: Frequently the matrix A can be permuted into a block triangular form as shown in Figure 2-10 (a lower triangular version is also possible). This form of the matrix is significant as it can reduce the amount of computations required for factorizing and divides the overall problem into a number of independent subproblems. In particular, only the B;; blocks need to be factorized. A back substitution process can then be used with a set of simple matrix-vector multiplications to evaluate the contributions of the off diagonal blocks [48]. The equation, Ax = b thus becomes PAQy = Pb where x = Qy. The back substitution then proceeds using Buy = (Pb), E By, i = N, N 1,.., 1 (where N is the number of diagonal blocks) followed by x = Qy. Notice with this technique fill-in is limited to the Bii submatrices and row or col- umn interchanges within these submatrices do not affect the rest of the structure. Furthermore, reduction to block triangular form can be done in O(n) + O(r) time which is often more than offset by the savings in storage and computations of block triangular form. / B0l B12 B13 B14 PAQ B22 B23 B24 B33 B34 k B44 / Figure 2-10. Block Triangular Form Matrix As a final set of notes on block triangular form, it has been proved that the block triangular form for a nonsingular matrix is essentially unique (to the extent that other equivalent forms are merely a reordering of the diagonal blocks). Also, empirical studies have illustrated a high correlation between asymmetric structure and block triangular reducibility. Once the block triangular form has been found, it can be exploited for the subse- quent factorization that begins with an analysis of the sparsity structure in the case of a sparse matrix. Analyze Phase: Not present in dense matrix factorization algorithms, the an- alyze phase of a sparse matrix algorithm determines how the sparsity structure of the matrix can be exploited to reduce the computations necessary for factorization. This phase manifests itself as the selection of a pivot ordering and a symbolic fac- torization to determine the pattern of the LU factors. The primary consideration in this phase is to minimize the amount of fill-in that occurs as factorization progresses. This in itself is a very difficult problem. In fact, a minimum fill-in pivot selection is NP-complete in the restricted case where the diagonal consists of all nonzero en- tries and all pivots are chosen from the diagonal [141]. However, there are also other considerations. They include the selection of pivots to preserve numerical stability, as well as, parallel and vector processing considerations. Furthermore, the selection of a minimum fill-in pivot sequence may be so costly as to overshadow any potential savings in the factorization. As a result, a number of heuristic techniques have been developed for the analyze phase. When the matrix has an unsymmetric pattern, the techniques most frequently are based on the Markowitz Criterion. Markowitz Criterion: When selecting the kth pivot for the factorizing of the A(k) active submatrix, the Markowitz Criterion chooses the a ,) entry that is numerically acceptable and minimizes the product of the corresponding row and column degrees [106]. The row degree for an entry is simply the number of nonzeroes currently in the entry's row of the active submatrix. Likewise, the column degree is the number of nonzeroes in the entry's column of the active submatrix. These degrees shrink with the active submatrix as the effects of earlier pivot selections are taken into account, but they may also grow due to fill-in. The idea is that this strategy will modify the least number of coefficients in the remaining submatrix. Drawbacks of this approach are that it requires maintaining updated row and column degrees for each step of the factorization and that both row and column access to the matrix pattern is required. When implementing the Markowitz Criterion, a typical approach is to order the current row and column degrees in an increasing order with all equal values in a single doubly linked list which is indexed by a pointer array. The search then commences in increasing order until an entry is found that meets the pivot criteria. Additional searches are done until it can be assured that no better entry can be found. It is also possible to break ties based on the entry with the best numerical characteristics. Importantly, the row and column degrees should only be calculated once at the beginning and then implemented to facilitate expedient incremental updates. 2.4 Multifrontal Methods The multifrontal approach to sparse matrix factorization is based on a decompo- sition of the matrix into a sum of smaller dense submatrices. Factorization of these smaller matrices (called elements or frontal matrices) can be done using dense ma- trix operations that do not require the indirect addressing required of conventional sparse matrix operations. The multifrontal approach was first developed by Duff and Reid for symmetric, indefinite matrices [54] and then extended to unsymmetric matrices [55]. However, the multifrontal approach has been used most extensively for the Cholesky factorization of symmetric, positive-definite matrices [70, 65, 67]. Most recently, Davis and Duff have generalized the method to take advantage of unsymmetric-pattern matrices [34, 35]. One of the major advantages of a multifrontal method is that the regularity found in the dense matrix operations can be used to take advantage of advanced architec- tural features such as vector processors, hierarchical memories, multiprocessors with shared memories or distributed memories connected by regular pattern communi- cation networks [5, 7]. Furthermore, when frontal matrices do not overlap in their pivot rows and columns, the factorization steps associated with those pivots can be done concurrently [46]. This provides an additional degree of parallelism that can be exploited by multiprocessors. The combination of these two sources of parallelism, within dense frontal matrix operations and between frontal matrices, has shown to provide a significant amount of potential speed-up in several studies [83, 51, 121, 122]. Multifrontal methods for sparse matrices were originally developed for matrices resulting from finite element problems where the coefficient matrix is the sum of a number of dense symmetric matrices that correspond to the elements of the original model (hence the use of element to describe the frontal matrices). As rows and columns of the .---i. -.-.i, e matrix are assembled from all the contributing elements, the factorization associated with that row/column pivot can be performed even though other rows and columns have not yet been fully assembled. Later, these techniques were extended to other matrices by artificially decomposing the matrix into a sum of frontal matrices. With finite element and other symmetric-pattern matrices, the frontal matrices are square. Furthermore, an assembly graph can be built with the frontal matrices as nodes using edges to describe overlaps between frontal matrices. The assembly graphs associated with symmetric matrices always form tree structures. Such a structure simplifies both the task mapping and scheduling problems when employing a multiprocessors on the method. Recent efforts by Davis and Duff have focused on extension of the method to unsymmetric-pattern matrices [34, 35]. With such matrices, the frontal matrices are no longer guaranteed to be square and are typically rectangular. Furthermore, the as- sembly graph is no longer a tree. Instead, it becomes a directed acyclic graph (DAG). The rest of this section describes the multifrontal method applied to unsymmetric- pattern matrices. Frontal Formation: A central issue of the multifrontal method is the decomposi- tion of the matrix into frontal matrices. As the amount of computation is proportional to the size of the frontal matrix and as distinct memory is typically allocated for each frontal matrix, the primary objective is to minimize the size of the frontal matrices. Moreover, as the amount of fill-in is typically proportional to the size of the frontal matix, minimizing frontal matrix size will also tend to minimize fill-in. This is es- pecially important in the earlier stages of the algorithm, as early fill-in will likely propagate into larger memory and computational requirements for the later frontal matrices. The approach taken by Davis & Duff to minimize frontal size is to base frontal matrix determination on selection of a pivot with minimal Markowitz cost. In order to preserve numerical stability, threshold pivoting is often added to the pivot selection criteria. Criteria stronger than the Markowitz cost, such as minimum actual fill-in, are typ- ically too costly computationally and provide little additional benefit. As Markowitz cost requires that updated row and column degrees be maintained, approximate de- grees are frequently used. Frontal Factorization: Once a frontal matrix has been determined, the factor- ization associated with its pivot will be done. This involves the updating of the corresponding row in the U factor via a copy operation, the computation (via a di- vide by the pivot value) and copying of the pivot column into the corresponding column of L, and the updating of the remaining frontal matrix entries. This updat- ing process is referred to as the Schur Complement where each entry is computed as follows: a7, = a- (aikakk.' which is equivalent to: aij = aij likUkj. The -likUkj contribution to the aij entry need not be applied immediately and, in fact, distinct updates to the same entry can be applied in any order. Hence only the pivot row and column are needed for the frontal matrix. The update terms can be calculated and saved until their row/column becomes pivotal or they can be applied to the entry earlier if convenient. This can significantly reduce the required amount of data sharing. Assembly: The process of applying updates to an entry is called ". ,,,'ll;l Con- sider the following example of Figures 2-11, 2-12, and 2-13 where each Pi represents a pivot, letters are nonzeros, and are zeros. / PI ab b \ c P2 d e A= f g h ij \ k I I Figure 2-11. Sample Sparse Matrix The first frontal matrix of A in Figure 2-11 is El and consists of rows 1, 2, 3, and 5 together with columns 1, 3, and 4 as shown in Figure 2-12. /P, a b\ c d El = Figure 2-12. Frontal Matrix for First Pivot With El, fill-in occurs in entries (2,4), (3,3), (3,4), (5,3), and (5,4). Furthermore, entry a2,3 is updated to d and must be assembled into E2 (shown in Figure 2-13) as it is in the pivot row of E2. Furthermore, the fill-in occurring in entries (2,4) and (3,4) of El will cause E2 to be expanded by this column. While the contributions to entries a2,3 and a2,4 are assembled into E2, it would also be convenient to assemble El's contributions to entries a3,3 and a3,4. However, none of the other entries in the Schur Complement of El can be assembled into E2 and thus El must be held until the rest of its contributions can be assembled into later frontal matrices. The inability for E2 to accommodate all of El's contributions is referred to as partial i ,il,1'i and only occurs in the unsymmetric-pattern version P2 d e) E2= g h i Figure 2-13. Frontal Matrix for Second Pivot of the method. In the symmetric-pattern version all contributions can be passed together to a single subsequent frontal matrix. Assembly DAG Edges: Whenever a data dependency exists between frontal ma- trices, an edge in the assembly DAG is used to represent the dependency. In the previous example of El and E2, there would be a directed edge (El, E2). Further- more, as the row pattern of El contains the pivot row of E2 and the column pattern of El does not contain the pivot row of E2, the relationship indicated by the edge (EI, E2) is referred to as an L relationship where El is an L child of E2 and E2 is an L parent of El. The four matrix patterns of Figure 2-14 illustrate the four possible relationships between the frontal matrices that are induced by pivots P1 and P2. No relationship P2 P L relationship X P2 )2 P, X . P, X . U relationship P' 2 LU relationship X P2 Figure 2-14. Possible Relationships between Frontal Matrices The edges in the assembly DAG show not only data dependencies but also con- trol dependencies. Transitive reductions in the DAG are possible to more efficiently describe the precedence relationships and to reduce the data passing. Pivot Amalgamation: Up to this point in the discussion each frontal matrix has been built using a single pivot and updated via a single factorization step. When additional pivots are identified that have identical (or nearly identical) row and col- umn patterns, the pivots can be amalgamated together into a single frontal matrix. This allows several pivots' contributions to be calculated and combined. In particu- lar, consider the generalized example below of an r x c frontal matrix where F is the k x k pivot block that includes the k pivots on its diagonal. The matrix F will be fac- tored into the product L1U, in the frontal matrix's factored form. C' is a k x (c- k) submatrix that will be updated to become part of the U factor (specifically U2 in the frontal's factored form). Likewise, L2 will be formed from B which is a (r k) x k submatrix. Finally, the remaining (r k) x (c k) submatrix D will be updated by the Schur complement of the factorization into D'. ( F CT Li\U, U2 B D L2 D' The update to D is formed as: D' = D (LU2) which is a rank k update to D. In the single pivot case this is just a rank 1 or outer product update commonly found in BLAS 2 (Basic Linear Algebra Subroutines Level 2) routines. However, in this more general case, the update is a rank k update that can be implemented via matrix multiplication in BLAS 3 routines. These higher level BLAS routines have a richer computational structure that can be used to exploit the advanced architectural features of a particular system. Hence, amalgamation can significantly improve the speed of frontal matrix numerical factorizations. As eluded to earlier, the row and column patterns of two pivots do not need to be identical for amalgamation to be applied. If they are nearly identical, amalgamation can still take place at the cost of a slightly larger frontal matrix with additional fill-in. Separate Symbolic and Numerical Factorizations: When multiple matrices of iden tical nonzero structure are to be factored, it is possible to separate the symbolic and numerical factorization phases. When this is done the results of a single symbolic factorization of the matrices' pattern can be performed followed by a numerical fac- torization for each distinct matrix. The symbolic factorization will determine the pivot ordering, develop the assembly DAG, and determine the nonzero structures of the LU factors. (Symbolic factorization is essentially the same operation as was pre- viously referred to as the analyze phase). The subsequent numerical factorizations will then make use of this information for the actual numerical factorizations. Loss of Anticipated Pivots: An important and difficult problem exists however with the separation of symbolic and numerical factorizations. Specifically, the sym- bolic factorization will not be able to make use of numerical values for the selection of pivots. Hence, specified pivots in the ordering may become numerically unaccept- able. When such an occurrence is detected during a numerical factorization, special recourses are necessary to recover. An example of this situation can be seen in Figure 2-15. /2 1 0 4 \ /2 8 0 4\ 1 4 10 1 410 0 3 2 0 0 3 2 0 1 204 1 204 Figure 2-15. Lost Pivot Example In matrix A1, the a2,2 entry is a perfectly acceptable second pivot. However, as the values change in matrix A;, the update to the a2,2 entry, accomplished via: 2,2 = a2,2 (2,1/1,1) a1,2 causes a2,2 to become zero and no longer acceptable as a pivot. Furthermore, a pivot's numerical value need only be altered to drop it below the threshold pivoting limits in place for it to become no longer acceptable. Also interesting to note is that the actual pivot entry did not need to change for it to fail to be acceptable. Other entries may affect the anticipated pivot's value during previous steps of elimination, as was the case in the example provided. Davis and Duff I.-- -1 two possible approaches to dealing with the loss of antic- ipated pivots [34, 35]. They are: Force the amalgamation of the lost pivot with subsequent frontal matrices dur- ing the numerical factorization. This creates a larger pivot block from which permutations can be used to select alternative pivots. The drawback is that the row and column patterns of the amalgamated frontals could be quite different and result in catastrophic fill-in and loss of inter-frontal parallelism. Amalgamate the frontal matrix with the lost pivot with its first LU parent (assuming one exists). This will limit additional fill-in, however, such an LU parent is not guaranteed to exist in which case forced amalgamation will be necessary. While the possibility of lost pivots complicates the factorization of sequences of identically structured sparse matrices, the unsymmetric-pattern method with sep- arated analyze and factorize phases does provide significant potential benefits for dealing with such sequences. Yet, the question of where such sequences arise is still open. The next section provides general mathematical settings in which such sequences occur and provides a listing of sample application areas. 2.5 Factorization Sequences of Matrices The factorization of a sequence of sparse matrices that maintain an identical nonzero structure occurs frequently in a variety of applications. The principal cate- gories of such applications are those that use a Newton method-based approach to solving a system of nonlinear algebraic equations and those that require the solution to a system of ordinary differential equations by an implicit method. Both of these categories will be discussed in this section. Systems of Nonlinear Algebraic Equations: Newton's method is a very common technique in numerical analysis that is used to find the zeros of a function by an approximation of the function that is based on a first order Taylor's series expan- sion [13]. For one dimensional real valued functions, the first order Taylor's series expansion is f(x) = f(xo) + f'(xo)( Xo). When solving for a such that f(a) = 0 we replace x with an approximation xl of a and assume f(xi) = 0. The result is 0 = f(xo) + f'(xo)(xi Xo) which can be solved for xl and generalized into the following iteration formula by replacing xl with Xk+1 and xo with xk: Xk+l = Xk- f(xk)/f'(xk). The method can be generalized from strictly real (or complex) functions to any mapping between Banach spaces by using the Frechet derivative [93]. A Banach space is a normed vector space that is complete (that is, all Cauchy sequences will converge to an element in the space). The Frechet derivative is a bounded linear operator L that satisfies: L(o) li f(xo + y) f(xo) L L(xo) = lim y-O lyll where f is a function mapping one Banach space to another and the I| I|'s are norms appropriate to the given Banach spaces. Systems of nonlinear algebraic equations can be considered to be a mapping from an n-dimensional Euclidean space (R"), which is a Banach space, back to itself defined by: / fi(xi, X21 -I Xn) f(x) = 2, \ f X,(x 2, I nI / for fi mapping R" to R1. The Frechet derivative of such a mapping is simply the Jacobian of the function (denoted Vf) which is an n x n matrix of partial derivatives where the ith row, jth column of the Jacobian (evaluated at a given xo) is axj The Newton iteration formula thus becomes Xk+1 = Xk Vf(xk)-1f(xk) within this context. The problem with this formula is the requirement to compute the inverse of the Jacobian. In order to avoid this costly (and sparsity destroying) process we define Axck = Xk+1 Xk and then manipulate the formula into Vf(k)Axk = -f(xk) and determine Xk+1 = Xk + Axk. As the Jacobian Vf(xk) is a linear operator and -f(x) is a vector, the computation now consists of simply solving a linear system of equations. The focus of this work is on instances where the Jacobian is sparse which is typical of most large systems. The nonzero entries of the Jacobian can change values with the various xk's but the nonzero structure will remain constant. One application area of such nonlinear algebraic equations is chemical engineering. Specifically, Zitney discusses the solution of such systems for distillation flowsheet- ing problems where the Jacobians are of an unsymmetric-pattern and fail to have favorable numerical properties such as diagonal dominance [144, 143]. Systems of Ordinary Differential Equations: Another major type of problems where the factorization of sequences of identically structured matrices is required is the solv- ing of systems of ordinary differential equations using an implicit method. The pre- dominant type of numerical methods for such tasks when initial values are provided are linear single- and multi-step techniques due mainly to their ease of implemen- tation and analysis [107]. These techniques divide the independent axis into steps or mesh points across which the differential equations are integrated via a discrete summation method. A generic form of the problem would be: x = f(x) with x a vector function of time (t), x the first time derivative of x, and the initial condition is x(0) = xo. A typical linear single step technique would use X,+1 + alrXn = h[3 f(.X+1) + 1 f(xn)] to solve for .X+1 where xi denotes x(t;), ti the time at time step i, h the time step size, and al, 3o, and Si constants. If 3o = 0 then the method is explicit, that is zr+1 can be solved for directly. If 3o f 0 then the method is implicit and an iterative method is required to solve for xr+l. Linear multi-step techniques are similar but use more of the previously predicted values xi and function evaluations at those xi's for the next prediction. In particular, they take the general form: ,Xn+ + E c=l" n-i+l = hPof(xn+l) + hE 1 f(Xn-i+1) where 3o = 0 again means an explicit method and 3o z 0 means an implicit method. Specific methods vary in their value for p and the weights given by the ;a's and 3;'s. Such methods include Adams-Bashford (explicit), Adams-Moulton (implicit), N;-iI..nt, Newton-Cates, and the Backward Differentiation Formula [107, 124, 90]. For reasons of stability, implicit methods are preferred for "stiff" (that is, ill- conditioned) problems. In particular, explicit linear multi-step methods cannot meet A-stable criteria and can only meet A(a)-stable criteria if p > 3 [107]. These stability conditions relate the effectiveness of a method to the conditioning of the problem as evidenced by the eigenvalues of the corresponding matrix. Hence, unless only "non- stiff" (well-conditioned) problems are to be solved by a linear multi-step method, the method should be implicit. (Other classes of alternative nonlinear methods will be discussed briefly later). When using implicit methods, the linear multi-step method given generically above can be reorganized to accumulate only terms dependent upon xz+l onto the left side as follows: (I hfof)xn+l = -E [hf3f(x,_+i1) a;ax i+1]. As the right hand side is constant within a time step, we can define it equal to b. Hence, our task within a time step becomes finding the solution x in F(x) = b where x replaces X+1 and F(x) = (I hdof)(x). For f defining a linear system of ODEs, this simply requires the solution of a linear system of equations. However, when the system is stiff, the coefficient matrix corre- sponding to F is ill-conditioned and direct solution methods can be too inaccurate. Thus Newton's method can be used to solve for the zero of the modified function: G(x) = F(x)- b = 0. This has proven useful in practice due to the quadratic convergence of Newton's method [107, 90]. The same approach is used when f is a nonlinear function (repre- senting a system of nonlinear ODEs). When Newton's method is applied within each time step as above, each Newton iteration requires solving a linearized system (as discussed in the previous subsection on systems of nonlinear algebraic equations). Specifically, the equation: VG(x(=))Ax() = -G(x(i)) must be solved for Ax(i) where x(i) is the ith approximation to xz+l, VG is the Jacobian of G, and x(i+l) = x(j) + Ax(1). The structure of VG remains constant both within and between time steps thus producing a sequence of matrices with identical structure that must be factorized and solved. Other methods also exist and are particularly useful for solving stiff systems. These include nonlinear methods such as Certaine's method, Jain's method and the class of Runge-Kutta techniques [107]. Certain's method is also an example of another class called predictor/corrector methods where an explicit technique is used to predict a first approximation to xz+l and then an implicit method is used to correct (refine) the approximation. Furthermore, there are both implicit and explicit versions of the numerous Runge-Kutta techniques. Whenever implicit techniques are used, the factorization of a sequence of matrices is a likely subtask. However, with some of the techniques these matrices take on an already factored triangular or even diagonal structure. Yet, the bottom line is that the factorization of a sequence of general sparse identically structured matrices is a common subproblem to solving systems of ODEs via an implicit method. Furthermore, these systems are frequently encountered in applications that include nuclear magnetic resonance spectroscopy, computational chemistry, and computational biology [147]. More specifically, Miranker provides examples of systems of nonlinear ODEs with the circuit simulation of tunnel diodes commonly used in high speed circuits, thermal decomposition of ozone, and the behavior of a catalytic fluidized bed [107]. Electronic circuit simulations are another common source of large systems of nonlinear, first order ODEs as evidenced in the modified nodal analysis approach used by Saleh and Kundert [131, 97] and the sparse tableau formulation of Hachtel [82]. 2.6 Parallel Matrix Computations A major emphasis of this research effort is to perform the numerical factorization of sequences of sparse matrices in parallel. Furthermore, the unsymmetric-pattern multifrontal approach provides dense submatrices which may also be factorized in parallel. This section summarizes some previous efforts that explored distributed memory implementations of both dense and sparse matrix factorizations. Distributed Memory Factorization: In distributed memory environments, algo- rithms are heavily dependent upon the storage allocation scheme for the matrix. The two most common schemes are row oriented [22] and column oriented [64, 40, 27, 76]. Blocking schemes of rows or columns are frequently used and Dongarra and Os- trouchov [40] discuss such methods together with the need for adaptive blocking mechanisms. Choi, Dongarra, Pozo, and Walker explore the various data allocation schemes and associated algorithms and develop a generalized block scattered approach with data blocked and blocks scattered (wrapped) to processors [21]. By varying the parameters of this method both pure blocked and pure scattered (together with inbe- tween formats) can be achieved. A block-cyclic strategy has recently been proposed by Lichtenstein and Johnsson for dense linear algebra on distributed memory multi- processors [100]. The block-cyclic strategy is applied to both rows and columns and is effectively applied to LU factorization and the subsequent triangular solves. Furthermore, pivoting, if required, can be done in a row or column fashion. When partial pivoting is used, Geist and Heath [62] recommend inclusion of pipelining to offset the cost of pivoting. For efficiency of pivot determination, row pivoting is preferred with column storage and column pivoting with row storage. A simple illus- trative example of a basic (non-pipelined) algorithm (taken from Gallivan, Plemmons, and Sameh [59]) is shown in Figure 2-16. It uses column oriented storage with a row pivoting scheme. Sparse Matrix Computations: While considerable attention has been paid to im- plementation of dense matrix operations on parallel architectures, less has been given to sparse matrix operations and there exists significant open research to be accom- plished in this area. Major reasons for delaying such attention are that sparse matrix algorithms are more complex, use more sophisticated data structures, require irreg- ular memory referencing, and are thus much more difficult to efficiently implement. Ironically, operations for sparse matrices will typically have a greater degree of potential parallelism than their dense counterparts. This additional parallelism comes in the form of large grain parallelism that may be found in the structure of nonzeros in sparse matrices. The available parallelism in sparse matrix factorization has been Each processor executes the following: do k = 0, n- 1 if ( column k is local ) then determine pivot row interchange row k with pivot row doi= k+l,n -1 compute multipliers Aik = aik/akk enddo broadcast the column just completed with pivot index else receive the column just computed with pivot index interchange row k with pivot row endif for (all columns j > k that are local ) do do i= k+l,n -1 aij = aij Aikcakj enddo endfor enddo. Figure 2-16. Basic Column Fan-Out LU Factorization studied using a variety of models. Wing and Huang developed a very fine grain model with each divide/update operation on a single entry considered a task [140]. With this model (which restricts the factorization to a given pivot ordering), the maximal parallelism available can be realized. A large grain model was developed by Jess and Kees [91] that defines a task as the factorizing and complete corresponding update of a single pivot. Liu [102] establishes a third, medium grain model with the various column operations associated with a single pivot defined as tasks with one task per column. Ostrouchov, Heath, and Romine [117] investigate whether the rather disappointing results seen so far with Cholesky factorization are due to an inherent lack of parallelism in the problems or to limitations on contemporary hardware. They conclude that parallelism is available, but the relatively high cost of communication (relative to computation) has limited performance. Specifically, on currently available distributed memory multiprocessors, only 20 to 25 percent of potential speed-ups are realized when using a medium grain of parallelization. A most important issue in exploiting available parallelism in a parallel sparse matrix factorization is that of task allocation. Task Allocation: Task allocation requires the careful balance of two opposing objectives. The first objective is to always have as much useful activity as possible going on in parallel. This typically requires use of finer grains parallelism to achieve more potential concurrency. The second objective is to have as little interaction as possible between concurrently executing processors. This frequently contradicts fine grain parallelism that tends to require frequent interaction. For very regular com- putations, interactions can be orchestrated to minimize idle time due to processors waiting for an interaction. However, with sparse matrix routines, this regularity is typically lost due to variations in the pattern of nonzeros. Thus, task allocation becomes more challenging. In a distributed memory environment task allocation is especially challenging. Due to the higher cost of communication, dynamically balancing loads can be ex- pensive. Hence, a static allocation is frequently used and tightly coupled to the data allocation scheme. For Cholesky factorization, early column oriented algorithms al- located data and column tasks to processors in a wrapped fashion proceeding in a bottom up manner from the elimination tree [63, 71]. This did well to address the load balancing issue but caused excessive communication. Later schemes allocated subcubes to subtrees in the elimination tree and effectively reduced communication [63, 71]. This worked well for the well balanced trees produced from nested dissection orderings but not for the potentially unbalanced trees of minimum degree orderings. A further refined allocation scheme was developed by Geist and Ng to deal with unbalanced trees in a way that addresses both communication elimination and load balancing [63]. Additional advantages are typically found when using contemporary architectures by organizing tasks to use fewer, but larger, messages. Hulbert and Zmijewski propose such a method that combine update contributions into a single message [88]. While some granularity is lost, there are fewer message setup delays which tend to dominate much of the communication cost in contemporary architectures. Numerical Factorization: As much of the work in sparse matrix factorization comes in the numerical factorization phase and as this work is typically the most structured (due both to the nature of the work and the advantages obtained from the earlier phases), this phase has the most exploitable parallelism and has received the most research attention. However, sparse numerical factorization is still much more difficult than its dense counterpart. Most implementations exploit only large and medium grains of parallelism. Within a distributed memory environment, it is important to exploit locality of the data. Using data local to the processor as much as possible avoids the communication and synchronization overheads imposed by message passing between processors. Chu and George [22] discuss a row oriented dense LU factorization routine for dis- tributed memory environments that could be extended in concept to sparse matrices. This technique uses a fan-in operation to determine the next pivot row followed by a fan-out operation on the selected pivot that facilitates the updating of the active submatrix. Another key feature is the dynamic load balancing used to accommodate the use of partial pivoting. Using a column oriented allocation scheme, pivot selection is handled by a single processor (the one owning the current pivot column). The factored pivot column can then be broadcasted (or in the sparse matrix context perhaps multicasting is more appropriate) to the processors holding the other columns of the active submatrix to allow updating. This is the fan-out approach that has been used in initial Cholesky factorization routines. The excessive communication overheads of these methods limited their performance. Another class of methods (used for sparse Cholesky factorization) are the fan-in methods [9, 117, 88]. These methods accumulate contributions to the updating of the active submatrix and send fewer but larger messages [9]. An increasingly popular approach to sparse matrix factorization are the symmetric- pattern, multifrontal methods [105, 76, 51, 121, 122, 77]. Large grain parallelism is available in these methods via independent subtrees in the assembly trees. How- ever, as subtrees combine and parallelism at that level decreases, a switch is made to exploit parallelism at a finer grain within the factorization of a particular frontal matrix. This is especially appropriate as the later frontal matrices are typically much larger. Within a distributed memory environment, a column oriented storage scheme can be used with frontal matrices allocated to subcubes and frontal matrix factorization done in a fan-out implementation of dense factorization. Assembly in- volves sending of contributions to appropriate follow-on processors where they are added to the appropriate entries. While these distributed multifrontal approaches typically require more communication than a fan-in approach, their performance has been quite promising for the Cholesky factorization of symmetric, positive definite matrices [76, 119, 120]. Specifically, speed-ups of 5.9 to 21.4 were achieved on a 32 processor Intel iPSC/2 using Cholesky factorization on matrices varying in order from 1,824 to 16,129 [119, 120]. Lucas has implemented a distributed memory mul- tifrontal LU factorization routine that assumes numerically acceptable pivots and a symmetric-pattern. Speed-ups of up to 10.2 were obtained on 16 processors of an iPSC/2 using electronic device simulation matrices of order 7225. Recent sim- ulation studies of similar methods for the LU factorization of (assumed-)symmetric matrices [121, 122] have shown a reasonable parallel potential in distributed memory environments. Here speed-ups of up to 30-40 were predicted for 128 node iPSC/2 and iPSC/i"Ii hypercube topology multiprocessors. Comparison of these results with other simulated architectures reveal that lower communication to computation ratios produced better results (as one would expect). The exploitation of parallelism both in the elimination structure and within the distinct pivot steps has been found to be critical to the success of all of these methods [76, 51, 121] Recently, fine grain parallelism has been investigated by Gilbert and Schreiber for a variety of sparse Cholesky methods on a SIMD distributed memory environment [77]. The multifrontal approach has demonstrated the greatest potential for exploiting this level of parallelism. A critical subissue in any parallel sparse matrix factorization routine is how to efficiently schedule the component tasks. This issue will be discussed in the next section. 2.7 Multiprocessor Scheduling The precedence constrained multiprocessor scheduling problem has been shown to be NP-Complete in both shared and distributed memory environments [137]. Thus, heuristic techniques have been the emphasis of much of the research in this area. Within shared memory environments, a list scheduling scheme is typically used. Un- der such schemes, tasks are assigned priorities and when made ready (predecessors in the partial order have all completed) are put into a single priority queue for scheduling by next available processor [133]. The most common and effective priority schemes are based on a critical path analysis of the precedence DAG with the priority of a task defined as the heaviest weighted path from that task's node to an exit node (node with no successors) in the DAG [23]. Variations within these schemes primarily deal with tie breaking with a Most Immediate Successors First criteria both popular and effective [94]. Within the context of a distributed memory environment, communication delays (represented by edge weights in the task precedence graph) can become quite signif- icant. Three methods for dealing with such Task with Communications Scheduling (TCS) problems are presented in this section. ETF -Earliest Task First: The Earliest Task First (ETF) heuristic [89] is based on an extension of the Rayward-Smith computing model [125]. This extended model uses message counts for edge weights in the task graph together with a transmis- sion cost function that defines the communication costs between any two processors. Different network topologies can be represented by varying this transmission cost function. Task computation times are represented as node weights. Rayward-Smith proposed a list scheduling (LS) approach based on a greedy strat- egy for his original unit execution time/unit communication time (UET/UCT) model [125]. The idea is that no processor should be left idle if there is an available task to execute. Under the UET/UCT assumption, a task T can be scheduled on processor Pi at time t if T has no immediate predecessors, or each immediate predecessor has been scheduled to start on Pi at time t 1 or on Pj (i # j) at time t 2. The length of the schedule produced by this greedy schedule w, satisfies w, < (3 (2/n)) ..- (1 (1/n)) where ,' .. is the length of the optimal schedule and n is the number of processors. Hwang, Chow, Anger, and Lee first propose the Extended List Scheduling (ELS) algorithm that applies the LS approach assuming no communication delays and then adds the necessary communication delays to the initially produced schedule. They show that the ELS method produces schedules that are bounded by wELS < (2 (1/n),'.,. + T,ax E =' edgeweighti where Tmax is the maximum transmission cost between two processors. While this seems like a loose bound, the authors show that specific problem instances will asymp- totically achieve it. Thus, we see their motivation for the ETF method. Due to its complexity, a number of formalisms are necessary to describe the ETF algorithm. For tasks T and T', let v(T, T') be the number of messages sent on the edge (T, T') in the task graph. Let r(Pi, P) be the transmission cost of communication between processors Pi and Pj. Then, define s(T), f(T), and p(T) as the start time, finish time, and processor assigned for task T by the scheduling algorithm. With these definitions, the time the last message for task T arrives at processor P is r(T, P) which will be zero if T has no predecessors otherwise r(T, P) = max {f(T') + v(T', T) r(p(T'), P)}. T'EPred(T) The earliest starting time for an available task T is then e,(T) = max{CM, min {r(T,P)}} PEAvailable where CM is the "current moment" of scheduling time. The ETF uses a greedy strategy and finds the minimum earliest starting time (em~") across all available tasks and available processors. Due to arbitrary communication delays, if e""n > CM it is possible that another processor may be freed between CM and ei"" that would reduce the next earliest starting time. To handle this possibility, ETF uses a second time variable called NM for "next moment". NM is maintained as the next time at which a processor will become free. If NM > ei"" then no earlier starting time is possible and the schedule is set. Otherwise the scheduling decision is postponed. If two tasks come up with the same scheduled start time on the same processor, the task chosen can be arbitrary or according to some priority list scheme. One such priority list scheme that is particularly attractive is that given by critical path analysis. However, the authors base their subsequent analysis of ETF on an arbitrary resolution mechanism. Careful analysis of the ETF algorithm reveals its time complexity to be O(nm2) where n is the number of processors and m is the number of tasks. Furthermore, they develop an upper bound on the schedule lengths produced by ETF of wETF < (2 (1/n))w.i,. + COm, where Cmx, is the maximum communication requirement of any path through the task graph (assuming the Tma, transmission cost). This is a significant improvement over the bound produced by ELS. Furthermore, a tighter bound is even possible that reduces the Cma, component. The Earliest Task First (ETF) seems to be a very natural and effective extension to the shared memory environment list scheduling methods. SCST -Scheduling with Communication Successors' Priority Tracking: Anoth- er algorithm that addresses the scheduling of task graphs with nonzero communica- tion costs is due to Yin [142]. This method defines the combinatorial level of a particular task node as the heaviest node and edge weighted path from that node to any accessible leaf node (exit node). The node weights represent task execution times and the edge weights reflect communication costs. The run priority of a particular task is then defined as the combinatorial level of that task plus the heaviest edge from any immediate predecessor (if one exists) to that node. Figure 2-17 gives a sample task graph with communication costs. The combinatorial levels for T1, T2, T3, T4, and T5 are 15, 14, 7, 8, and 3 respectively. The run priorities given in the same order of tasks are 15, 14, 11, 12, and 6. T1/3 T2/3 1 4 4 T4/2 T3/3 1 3 T5/3 Figure 2-17. Task Graph with Communication Costs The first objective is to execute the heaviest paths first. The second objective is to assign tasks to processors in a manner that minimizes the amount of necessary communication. The SSCT algorithm addresses these objectives by scheduling tasks from the ready queue with the highest run priority first (ties are broken using the dictionary ordering of the combinatorial level values of successor nodes). However, assignment of tasks to processors is accomplished to minimize the amount of com- munication necessary. That is, if task 4 is next to be scheduled, it should run on the same processor as task 1 so that the communication cost of 4 from task 1 to task 4 can be precluded (under the assumption that running on the same processor has zero communication overhead). Using this strategy, the SSCT algorithm produces the schedule shown in Figure 2-18. Time 0 1 2 3 4 5 6 7 8 9 10 11 P1 T1 T4 I T5 P2 T 2 T 3 Figure 2-18. Gantt Chart of Schedule with Communication Costs Worst case complexity analysis of SSCT is O(n2) which reduces to O(n) when the precedence graph is a tree. Simulations were used on 20 randomly generated task graphs with random edge weights to compare SSCT with an algorithm based on topological levels that disregards communication. In 711' of the cases SSCT was better and in 2-' of the cases was SSCT worse. MH -Mapping Heuristic: The Mapping Heuristic (MH) introduced by El-Rewi- ni and Lewis takes both communication and contention into account when addressing the TCS problem [57]. The MH algorithm accepts a task graph description of the T2 T3 parallel program and a description of the target machine topology and produces a Gantt chart that shows the allocation of tasks to processor elements and their exe- cution time frames. El-Rewini and Lewis illustrate the method by first presenting it without considering contention and then separately presenting the extensions needed to account for contention in the communication links. The MH algorithm proceeds in three steps: 1. A priority is assigned to each task in a fashion resembling the critical path method mentioned earlier. In particular, the heaviest weighted path from each task node to a leaf (exit node) in the task graph is determined using both execution times for node weights and communication time estimates for edge weights. (At this point communication time estimates assume a fully connected network with no contention). In the case of a priority tie, the task with the largest number of immediate successors is given the higher priority. If this does not break the tie, a random selection is used. An event list is initialized with tasks that have no predecessors and are thus initially "ready" for execution. 2. As long as the event list is not empty, entries are taken from the event list. These events will be either "task is ready" or "task is done". For "task is ready" events, a processing element is selected such that the task cannot complete any sooner on any other processing element. This selection takes into account the machine topology and number of commu- nication links that messages must be sent across prior to the initiation of the task's computation. The task is then scheduled on the selected processor and will signal a "task is done" event upon completion. For "task is done" events, all immediate successors are notified and any successor that becomes "ready" as a result of the notification adds a "task is ready" event to the event list. 3. The second step is repeated until all of the tasks have been scheduled. Notice that this algorithm does not treat communication cost uniformly. That is, the priority is assigned assuming a fully connected network but the actual task alloca- tion is accomplished based on a specific (and typically less connected) topology. Thus multiple hops may be required that were not accounted for in the priority assignment. Furthermore, scheduling communicating tasks on the same processor assumes that communication will take zero time. This assumption is also not accounted for in the priority assignment. The extension to the MH algorithm that considers contention is accomplished by maintaining a table of ongoing communication such that the best currently available communication path can be determined. Specifically, the tables record for each pair of processors the number of hops on the currently preferred path between these processors, the preferred first communication link, and the current delay associated with this path. Table maintenance is triggered by two additional event types for the event list. The "sending message" event causes the contention tables to be updated to reflect the additional links made busy by the message. The "message receive. .1 event causes the tables to updated to reflect the links that are now free. Notice that this mechanism correlates to a virtual circuit type of communication. If the underlying communication is packet switching, the view presented by this method is too pessimistic. However, the authors opt against a lower level tracking mechanism for reasons of efficiency. El-Rewini and Lewis ran a number of simulations of their MH algorithm and vari- ants of it on both actual and randomly generated task graphs. Among their more interesting results is that priority assignments that neglect communication produced better schedules for "computation-iil In- i--" task structures and priority assign- ments that consider communication produced better schedules for "communication- iii. ii I task structures. The delineation between "computation-inl. i I and "communication-inl. i, I was a communication to computation ratio of one. These three methods provide insight into how the task with communication scheduling problem might be addressed for scheduling an unsymmetric-pattern mul- tifrontal method elimination DAG on a distributed memory multiprocessor. Chapter Conclusion: This chapter established the necessary background for this research effort and summarized pertinent related efforts. To this end, the use of LU factorization to solve a system of linear equations was described and the issues of algorithm stability and error analysis addressed. Key sparse matrix concepts were then defined. Details of Davis and Duff's unsymmetric-pattern multifrontal method were provided as this entire research effort is based on this method. The focus on factorizing sequences of sparse matrices was then justified by presenting the mathematical formulation for two key types of problems that give rise to such sequences. Specific applications were also identified within each type of problem. Related efforts in parallel, distributed memory algorithms for both dense and sparse matrix factorizations were summarized with specific results providing performance goals for this effort. Finally, previous results in the key sub-topic of multiprocessor scheduling for distributed memory environments were discussed. With this background established, the next chapter will highlight key hardware and software features of this effort's implementation platform. CHAPTER 3 IMPLEMENTATION PLATFORM A major thrust of this research effort will be to evaluate the performance of de- rived algorithms through experimental measurements. This will be accomplished on the nCUBE 2 distributed memory multiprocessor. The next two sections summa- rize key aspects of the nCUBE 2's hardware architecture and software environment. Information for these sections was obtained from the nCUBE 2 Systems Technical Overview [112], the nCUBE 2 Programmer's Guide [110], the nCUBE 2 Processor Manual [109], and the on-line ndoc documentation manager [113]. The third and final section describes an effort to experimentally characterize the performance of the nCUBE 2 multiprocessor. 3.1 Hardware Architecture The nCUBE 2 supercomputer is a distributed memory, multiple instruction multi- ple data (MIMD) multiprocessor based on a binary hypercube interconnection topol- ogy. Configurations range from 8 to 8192 processors (3 to 13 dimension hypercubes). Each processor is a custom, 64 bit, VLSI (very large scale integration) processor with an integrated 64 bit floating point unit. Also integrated into the processor's design is an error correcting memory management unit with a 39 bit data path and full message routing circuitry. There are 14 bidirectional DMA (direct memory access) channels with 13 used for node interconnections and one for input/output. Proces- sor node configurations contain from 4 to 64 Mbytes of memory. Processor options include both 20 MHz and 25 MHz versions of the custom processor. Performance of the 25 MHz processor is rated at 15 VAX MIPS, 4.1 Mflops (single precision), and 3.0 Mflops (double precision). Instruction execution times will vary however due to hits/misses in the 128 byte instruction buffer, data alignments in memory, and memory contention due to IO, refresh cycles, etc. In addition to their use as node processors, additional custom 64 bit processor chips are also used in a separate, but connected, input/output array. These IO pro- cessors provide connections to disk arrays, tapes, graphics facilities, signal processors, and networks. The nChannel board connects up to 128 channels from the processor array to 16 serial channels. Each serial channel is controlled by a single IO proces- sor. Eight of the serial channels are devoted to SCSI peripheral controllers that can handle up to seven SCSI peripherals. Other serial channels provide ethernet, ana- log/digital and digital/analog conversion, digital input and output, video, and tape interfaces. There is also an nCUBE HIPPI (High Performance Parallel Interface) board for interconnection to other supercomputers, UltraNet hubs, and mass storage devices. An nVision Real-Time Graphics Subsystem provides an integrated graphics and IO subsystem. The processing capabilities are balanced by the very flexible, high performance hypercube interconnection network. Wormhole routing is employed in the network so that messages are not stored in intermediate nodes as they pass through the network. In the absence of path conflicts, messages pass directly from source to destination without delay so communication latencies are largely independent of path length in a practical sense. The primary manner in which longer paths will affect performance is the greater likelihood of path conflicts when the paths are longer. The Programmer's Guide --'-. -1-, a simple performance model for messages less than 100 bytes [110]. The model anticipates a constant overhead for a message send or receive of 100 psecs with a link delay time of 0 psecs. (This simple model is refined experimentally later in the chapter). This model indicates a relatively low communication to computation ratio compared to other machines in the class [121] and is characteristic of a well balanced design. In more detail, the nCUBE 2 uses an e-code default routing that progresses from least to most significant bit in the destination address. At each node on the path, the message is forwarded on the channel whose ID corresponds to the bit position of the least significant bit that differs between the destination node ID and the current node's ID. This scheme precludes the possibility of deadlocks if it is used exclusively. While the default routing provides safe and reliable communication, it does not efficiently handle multiple destinations for a single message. Therefore, a broad- cast/multicast facility is also available that will send a message to all nodes in the cube (or in a selected subcube) with a single transmission initiation and a single message fetch from memory. A mask is used to define the destination cube/subcube. If a bit in the mask is zero, the corresponding bit in the destination field must match for all receivers. If a mask bit is one, then the corresponding destination bit need not match and a fan-out duplication of the message occurs for each such bit setting. During such fan-out activities all necessary forwarding channels must be open or else the message will block. Hence, frequent point to point traffic on these channels can livelock the broadcast/multicast operation. In particular, the broadcast/multicast routing will send the message out on all channels with ID's higher than the channel ID upon which the message was received such that the channel ID for forwarding corresponds to a "don't care" bit setting in the mask. For example, consider a broadcast from node 000 as shown in Figure 3.1. More flexible routing is also possible using the hardware supported message for- warding mechanisms. These use multiple message addressing headers with forwarding bits to provide complete routing freedom. Also, multiple messages from the same source and to the same destination may be .,.-. -- .i.,ted into a single transmission to reduce setup overhead. The nCUBE 2 hardware architecture is complimented by a powerful and robust software environment. 3.2 Software Environment As the nCUBE 2 is a powerful second generation multiprocessor, the accompa- nying nCUBE Parallel Software Environment is an extensive, robust, and refined software environment. The environment supports a variety of programming models. ch 0 Figure 3-1. 3D Broadcast Example The two primary models are (1) where node programs are launched directly from the host shell, and (2) where a host program launches the node programs. Within node programs, both the single program, multiple data (SPMD) and multiple program, multiple data (MPMD) models are provided. SPMD uses a single program that is loaded onto all the processors, but each instance of the program can have distinct data with data dependent branching. By contrast, the MPMD model allows different programs to be loaded onto distinct subsets of processors. The different programs can interact via message passing with data dependent branching still provided. The environment includes the nCX microkernel executive that runs on the node and IO processors, the Host Interface Driver, a variety of standard programming lan- guages, a powerful symbolic and assembly level debugger, extensive software libraries, and a sophisticated profiler. The nCX microkernel is the node/IO processor resident executive. nCX supports process management, local memory management, message passing, multi-tasking, UNIX System V Release 4 system calls, and POSIX signals. The nCUBE Host Interface Driver provides for the connection of the node proces- sors to the host/front-end computer, which is usually a high performance workstation. The nCUBE Parallel Software Environment supports FORTRAN 77, C, and C++ together with the Linda suite of hardware independent extensions for parallel pro- cessing and message processing. A powerful symbolic and assembly level debugger, ndb, provides a full range of functions including: * examination of variables, the stack, and processor registers multiple format data displays address map and source code displays the use of breakpoints single stepping through the code command and response logging, and other standard debugging operations Extensive libraries provide for efficient usage of the features of the nCUBE 2 su- percomputer. An extended Run-Time Library provides basic subcube management to include routines to identify a program's position in the subcube, its process ID, the host's process ID, and the dimension of the subcube. Also in this library are inter- processor communication routines such as an asynchronous message send writete, a blocking message receive (nread), and routines to check if any or a selected range of messages types have been received (ntest and range respectively). These later routines provide for the development of a nonblocking message receive capability. Processor allocation routines are also included in this library and can be used to select processors in a subcube and launch programs on them. The nCUBE Parallelization Library includes routines for 2 and 3 dimension fast Fourier transforms and matrix transposition; the min, max, and sum determination of either scalar or vector data items of various types; processor sequencing; grid- hypercube mappings; general purpose data allocation; and general purpose subcube broadcasting. The nCUBE Math Libraries contain 375 routines callable from C or FORTRAN and are based on the Seismic Subroutine Standard (SSS) and Basic Linear Algebra Standard (BLAS). These are uniprocessor routines written in assembly language and highly tuned to the nCUBE 2's node processor. The IMSL libraries of FORTRAN subroutines are also available. The nCUBE 2's profiler (PM) is composed of three tools: tool, ctool, and etool. The xtool program profiles execution by functions on the various node processors by using a time interval sampling technique. The ctool emphasizes analysis of commu- nication and for each node provides: time spent in calculation, communication, and IO; number of calls to each communication routine total number of all communication routine calls number of errors returned by communication routines distribution of sent and received message lengths * number of objects broadcasted. The etool facility provides for the profiling of user-defined events to include the time of each event, the type of event, and the value of a specified variable at the event's occurrence. Profiling results are displayed graphically for easy analysis. Profiling can be done for the entire program by setting an environment variable or can be limited to a specific part of the program by insertion of profiling routines into the code. All together these facilities provide a powerful and robust environment for software development and performance-oriented execution. 3.3 Performance Characterization of the nCUBE 2 In order to achieve maximal performance on a specific multiprocessor, both the strengths and weakness of the target hardware must be identified, understood, and taken into account during algorithm development. This section summarizes a variety of experiments designed to quantify key performance characteristics of the nCUBE 2 architecture. These results will be used later in Chapter 4 to define a simulation model that will evaluate the achievable parallel performance of the unsymmetric- pattern multifrontal method. The results will also be used to justify subsequent algorithm design decisions in Chapters 5 and 6. Specific areas of investigation include the rate of floating point operations, point- to-point message latencies, and message broadcast/multicast latencies. Floating Point Operations: The design of the nCUBE 2 processor places a sig- nificant emphasis on fast floating point operations and even includes a combined multiply-add floating point instruction. According to the nCUBE 2 Processor Man- ual [109], a typical floating point operation (i.e., add, subtract, multiply, compare) takes 300-350 nanoseconds with more complex functions (like divide, square root, and remainder) taking 300-3450 nanoseconds. Additional timing variability occurs due to the variety of addressing modes (26), hits/misses in the 128 byte instruction cache, memory contention (with DMA activities, DRAM refresh cycles, and instruc- tion fetches), and data misalignments. In order to determine a typical floating point operations per second (flops) rate, two experiments were constructed. The first used a single loop that performed floating point divides (as would be typical of the calculation of multipliers in LU factoriza- tion). The second experiment used two nested loops that performed a multiply-add operation as would used in a typical pivot step of LU factorization. This second experiment was done with both possible loop orderings in an attempt to determine the affect of unit and nonunit memory strides. The experiments were coded in C and measurements taken with the etool user-defined event profiler. The first experiment, using a single loop of divisions, produced results that fit well to a linear relationship. Specifically, for single precision divisions the equation is time = 4.6 + 2.745n where n is the number of divisions and time is in microseconds. Likewise, for double precisions divides the relationship is time = 4.93 + 3.444n. As arrays are stored by rows in the C language, unit stride is achieved with the row index in the outer loop. When this was done in the second experiment that used nested loops and a subtract-multiply operation, the performance was approximately ,''. worse than was seen with the column index in the outer loop. This was a bit surprising given the auto-incrementing possible in the nCUBE 2 processor. The specific results are given below with all times in microseconds and i as the row index and j as the column index. Unit Stride (Single precision): time = 17.24 + 1.56i + 4.88ij Unit Stride (Double precision): time = 16.60 + 2.71i + 4.72ij Non-Unit Stride (Single precision): time = 17.41 + 2.05j + 4.635ij Non-Unit Stride (Double precision): time = 1 " + 2.78j + 4.629ij More efficient routines to perform these operations are available in the BLAS (Basic Linear Algebra Subroutines) library resident on the system, but these will be explored later. As these are highly tuned assembly language routines, their perfor- mance is expected to be significantly better than these initial experiments. Point to Point Message Latency: Passing of contributions between frontal ma- trices will likely be done using point to point message passing. Hence, an under- standing of the latencies associated with such activities will be important. As the nCUBE 2 uses wormhole routing (where messages passing through the network are not stored at intermediate nodes), very little dependency on path length is expected. However, message length is expected to be significant. Furthermore, we need to de- termine two distinct latencies. The first is the message transmission latency. This is the time it takes for a message of given length to be transmitted and received by a particular set of processors of given distance apart. The second time is the message initiation latency, which is how much time a sending processor takes to initiate the transmission before it can continue. For the first experiment, dealing with the message transmission latency, a test program was written in C and profiled using etool. Within the program, processor node 0 sends a message of a specified byte length to the most distant node in the executing cube/subcube. It then does a blocking receive waiting for an equal length acknowledgment message from the other node. The distant node performs a blocking receive and then sends the acknowledgment message. The test was repeated 1000 times for each configuration. For each configuration, message lengths were set at 10, 100, 500, 1000, 5000, and 10000 bytes. Cube/subcube dimensions varied from 1 to 5. A linear regression run on the resulting timings revealed a very strong linear relationship with both message length and cube/subcube dimension (with message length providing a much larger contribution however). The resulting linear equation (given in microseconds with message length 1 in bytes and d being the dimension of the subcube) is time = 170.32 + 4.54d + 0.5731. The second experiment was designed to determine the message initiation latency. Again using a C program and etool, a set of experiments using the same configurations of message length and cube/subcube size as were used in the first experiment was conducted. This time, however, node 0 just sent one message after another to the distant node without waiting for an acknowledgment. A total of 1000 messages were sent in each configuration. For this second experiment, cube/subcube dimension had no influence on the initiation latency. There is a strong linear correlation with message length which is expressed by the following equation (again given in microseconds with message length (1) in bytes) time = 79.94 + 0.44981. The regression for this second experiment however did not use the results for the 10 byte message configurations, as initiation latencies for messages of 10 to 100 had very similar initiation times. That is, the performance curve flattened in this region. For both of the point to point message experiments, the standard write and nread functions were used. These functions copy the message from a user buffer to a separate communication buffer (and vice-versa for nread) where the actual message transmission occurs from the communication buffer. Alternative functions are avail- able that use only the communication buffers, but these were not tested. The message broadcast tests to be mentioned next also use the separate user and communication buffers. Message Broadcast/Multicast Latencies: From the description of the message broadcast/multicast capabilities of the nCUBE 2 provided in the nCUBE 2 Processor Manual [109], definite performance benefits are anticipated when using the broad- cast/multicast mechanisms as opposed to repeated point to point message passing. Such data distribution will be required in the dense matrix kernels that will factorize distinct frontal matrices in parallel. To this end, another set of experiments was developed to quantify both the full message broadcast latency and the broadcast initiation latency. Again, C programs were written and run with the etool profiler. Configurations used message lengths of 100, 500, 1000, 5000, and 10000 bytes and cubes of dimensions 1 to 5. In the full message broadcast latency test, node 0 initiates message broadcasts to all nodes in the active cube and then waits for an equal size point-to-point ac- knowledgment message from the most distant node in the cube. (The description of the broadcast function in the nCUBE 2 Processor Manual, which was described in the preceding chapter, assures that this most distant node is the last to receive the broadcast). All nodes receive the broadcast and the most distant node generates the acknowledgments. With the point-to-point message latency well defined by the ear- lier experiments, their contributions to the overall latency are subtracted to obtain the latency due only to the broadcast. The results show strong correlations with both message length and cube dimension together with a very significant product term of message length times cube dimension. The resulting formula (given in microseconds with message length (1) in bytes and d the subcube's dimension) is time = 195.22 + 0.1221 + 50.02d + 0.451d. Message broadcast initiation latencies were also investigated using a C program that had node 0 initiate 1000 broadcasts back to back and then wait for a single ac- knowledgment from the most distant node. (Again the well defined acknowledgment message time was subtracted off the overall latency). The same configurations as in the first experiment were used. The initiation latency results also correlated with message length, cube dimension, and the product of the two. The resulting equation (given in microseconds) is time = 1i,. II + 0.06241 + 54.09d + 0.451d. Comparison of the point to point and broadcast/multicast results indicate that, for cubes of dimension greater than one, the broadcast/multicast mechanism is to be preferred over successive point-to-point transmissions. Chapter Summary: In this chapter, the implementation platform of this re- search effort has been discussed. Specifically, this platform is the nCUBE 2 dis- tributed memory multiprocessor. First, the features of the hardware architecture were described. Then, an overview of the key software features was provided. In the last section, a series of experiments were designed and conducted to characterize the performance of the nCUBE 2. The resulting characterization will be used in the next chapter to define a realistic simulation model to evaluate the achievable parallelism in the unsymmetric-pattern multifrontal method for the LU factorization of sparse matrices. This characterization will also be used in Chapters 5 and 6 when a parallel version of the unsymmetric-pattern multifrontal method is designed and implemented on the nCUBE 2. CHAPTER 4 EVALUATION OF POTENTIAL PARALLELISM In this chapter, the parallelism available in the unsymmetric-pattern multifrontal method [34, 35] is investigated. The assembly DAG is the principle data structure of this method and describes the data dependencies that exist between the various dense submatrices, called frontal matrices. This assembly DAG is also significant as it exposes parallelism between the frontal matrices. Moreover, parallelism can be exploited within frontal matrices also, as the factorization of these dense submatrices consists of many regular and independent computations. The models developed in this chapter will study both sources of parallelism as used in different combinations and at different levels of granularity. The first objective is to determine the amount of theoretical parallelism available. This is first done by a series of analytical models that assume an unbounded number of available processors. Then, the analytical models are expanded into simulation models where a bounded number of processors are assumed. These simulation models are used to determine the minimum number of processors needed to actually achieve the theoretical speed-ups. With the theoretical parallelism determined by the models of the first objective, the second objective is to determine how much of the theoretical parallelism can actually be achieved on existing multiprocessors. This objective is addressed by revising the simulation models to correspond to the performance characteristics of the nCUBE 2 multiprocessor (as determined in Chapter 3). 4.1 Unbounded Parallelism Models The first objective in the investigation of parallelism within the unsymmetric- pattern multifrontal method is to assess the availability of theoretical parallelism inherent in the assembly DAGs produced by the method. This is done using analytical models that assume an unbounded number of available processors. An underlying multiple instruction, multiple data (MIMD) shared memory (SM) model of computation is assumed, although several of the models are also realizable on single instruction, multiple data (SIMD) model. The shared memory is based on a parallel random access machine (PRAM) model, which is initially assumed to allow concurrent reads and concurrent writes with concurrently written results summed (CRCW-SUM). Later models will require only a concurrent read, exclusive write (CREW) shared memory. Instruction counts are abstracted to only the floating point operations (flops) required for the assembly and numerical factorization processes. The analysis of the unbounded parallelism models is based on the representa- tions of assembly DAGs built from traces produced by a sequential version of the unsymmetric-pattern multifrontal method on matrices from real applications [34, 35]. These assembly DAGs are analyzed and then used as input for the five unbounded parallelism models. The resulting speed-up, processor set, and utilization estimates are then analyzed. 4.1.1 Assembly DAG Analysis A sound characterization of the assembly DAG is the cornerstone of the un- bounded models and will be fundamental to fully understanding the results of the bounded models. In order to obtain this characterization, the assembly DAGs are topologically sorted and partitioned into a data structure that identifies all the nodes in a given topological level. Node weights are assigned in accordance with the model definitions defined later. Using these node weights, the heaviest weighted path through the assembly DAG is determined for each model. The weight of this path provides the lower bound on the parallel execution time. Other characteristics of the assembly DAGs are also determined. In particular, the number of nodes and edges, the depth, and the number of leaf nodes are determined. Since all the assembly DAGs tend to be significantly wider at their leaf level (Level 0), the next widest level is also determined. The total weight of all nodes is divided by the number of nodes to obtain an average node weight. The depth is divided by number of nodes to give a depth ratio. Low values will correspond to DAGs that have the most significant course grain parallelism. Various methods of exploiting the available parallelism are investigated with the different models defined next. 4.1.2 Model Definitions A model of the sequential version of the algorithm is provided as reference and will be used to establish the speed-up calculations. In this and all the subsequent models there are two levels. The first (higher) level consists of the outer sum that counts processing steps for the frontal matrices (nodes) that make up the overall sparse matrix. This summation will either be across all nodes of the assembly DAG, which represents a sequential processing of the frontal matrices, or along the heaviest path, which represents the critical path for the processing. The heaviest path summation is used to characterize parallelism exploited between the nodes in the assembly DAG. With the assumption of an unbounded number of processors, processing of nodes with dependencies not on the heaviest path (critical path) can be done during the critical path processing and not affect the overall completion time. The second level of the models describes the processing and parallelism of factor- izing a particular frontal matrix (i.e., a node in the assembly DAG). Each of these descriptions are presented in four terms. The first term describes the processing needed to complete the assembly of contribution blocks from predecessors in the as- sembly DAG. The second term characterizes the processing needed for the selection of a pivot for numerical stability. The third term describes the calculation of multi- pliers, which results in a column of the lower triangular matrix L. Finally, the fourth term defines the updating of the active submatrix, which is essentially the carrying out of the necessary row operations needed to reduce the matrix. These models are very similar to an earlier effort by Duff and Johnsson [51] that focused on the assembly trees produced by symmetric-pattern, multifrontal methods. Each of the models use the following terms: Aj number of matrix entries assembled into the frontal matrix represented by nodej. Sj number of children assembling entries into nodej. Pj number of pivots by which the frontal matrix at nodej is reduced. cj number of columns in the nodej frontal matrix. rj number of rows in the nodej frontal matrix. With these definitions, the model for the sequential version of the algorithm is defined as follows: EjS1o[Ai + El (ri i + 1) + E7j(r i) + 2E7'(r i)(c )] Model 0 (Concurrency Between Frontal Matrices Only): Model 0 depicts a ver- sion of the algorithm that only exploits parallelism across nodes in the assembly DAG. Each frontal matrix is factorized sequentially. This model requires an under- lying MIMD architecture. The formula describing this model is: heaviest-path A + + i' 2 i) j=i [Aj r i + + i -i ) + E2 zj(r +- i)(cj i-)] In the case of a dense matrix, this model degenerates to a single, one processor tasks that requires O(n3) time. Model 1 (Block Concurrency Within Frontal Matrices Only): Model 1 assumes that nodes of the assembly DAG are processed sequentially in an order that preserves the dependencies expressed by the assembly DAG. Within each node, the frontal ma- trix is factorized with a block level of parallelism. Assembly, multiplier computation, and updates to the active submatrix are done for all rows in parallel moving sequen- tially through the columns. For the assembly of a row that includes contributions from more than one child in the assembly DAG, the assumption of a CRCW-SUM shared memory insures correctness. In this model selection of numerical pivots is assumed to be sequential. The method described by this model and the next lend themselves easily to either a SIMD or a MIMD architecture. E allodes[ + ESr j (r i + 1) + P + 2E _(c i i In the case of a dense matrix, this model requires n processors and takes O(n2) time. Model 2 (Full Concurrency Within Frontal Matrices Only): Concurrency with- in a frontal matrix is extended by Model 2. In this model, very fine grain parallelism is exploited. Assembly is done for each element in parallel. However, in order to main- tain consistency with the earlier work of Johnsson and Duff [51], a CREW memory is used by this model and the assembly of contributions to an entry from multiple children is done via a parallel prefix computation using the associative operator of addition. Numerical pivot selection is handled in this model with another parallel pre- fix computation this time using the maximum function as the associative operator. Multiplier computation and updating of the active matrix are done with parallelism at the matrix entry level, which is easily realizable with the CREW memory model. This model is also oriented to either a SIMD or MIMD architecture. j= Oe[log2 Si1 + Ei= r[log2(r i + 1)1 + P1 + 2P5] In the case of a dense matrix, this model requires n2 processors and takes O(n log n) time. Model 3 (Block Concurrency Within and Between Frontal Matrices): The block-oriented, node level parallelism of Model 1 is augmented by parallelism across nodes of the assembly DAG in Model 3. This is done by changing the outer summation to be over the heaviest path instead of across all nodes. With this model, a MIMD architecture is required. heaviest- path P P Ej=i [cj + C,=( '(r i + 1) + Pj + 2 _'(cj i)] With a dense matrix, this model is equivalent to Model 1. Model 4 (Full Concurrency Within and Between Frontal Matrices): The suite of models is completed by extending the full node level concurrency of Model 2 to in- clude parallelism across nodes in the assembly DAG. This is done in the same fashion as with Model 3 and a MIMD architecture is likewise assumed. heaviestpath [ cn E iestth[ log2 Sj] + E7j [log2(r i + 1+)] + P, + 2Pj] With a dense matrix, this model is equivalent to Model 2. 4.1.3 Matrices Analyzed A set of five matrices was selected for this study. They represent a cross section of the type of assembly DAGs that are generated. The matrices are from the Harwell- Boeing set and each matrix comes from a real application [49]. Table 4-1 briefly describes these matrices: 4.1.4 Assembly DAG Analysis Results The first objective of this part of the effort was to characterize the assembly DAGs of the test matrices. Table 4-2 below provides these characterizations that were produced with the analysis program written for the effort. The DEPTH entries are the number levels in the assembly DAG as determined by a topological sort. The width of the DAG is presented in two entries. LEVEL 0 refers to the leaf nodes of the assembly DAG and WIDTH gives the widest level other than the leaf level (Level 0). It is important to remember that the assembly DAGs are typically NOT a Table 4-1. Test Matrix Descriptions MATRIX ORDER DESCRIPTION MAHINDASB 1258 economic modeling GEMAT11 4929 optimal power flow problem GRE_1107 1107 computer system simulation LNS_3937 3937 compressible fluid flow SHERMAN5 3312 oil reservoir, implicit black oil model single connected component and that the many small, distinct connected components frequently show up on these lowest levels of the DAG's topological order. The DEPTH RATIO is the DEPTH divided by the number of nodes in the assembly DAG. This statistic will be smaller for the short and wide DAGs that have the greatest inherent parallelism. Table 4-2. Assembly DAG Characterizations MATRIX MAHINDASB GEMAT11 GRE_1107 L _....7 SHERMAN5 # NODES 238 1015 511 1636 664 # EDGES 507 1,219 1,366 3,724 1,496 DEPTH 68 25 40 296 46 LEVEL 0 86 530 233 565 359 WIDTH 55 189 106 248 99 AVG WEIGHT 1,887 1,195 9,998 67, ,'. 42,093 DEPTH RATIO 0.29 0.02 0.08 0.18 0.07 The statistics presented in Table 4-2 reveal that the assembly DAGs tend to be short and wide (which is the desired shape for parallelism). However, they do not provide the full picture. The true nature of the assembly DAGs was better revealed with an abstraction of the DAG provided by the analysis program. This abstraction illustrates the topologically ordered assembly DAG with an example provided for the GEMAT11 matrix in Figure 4-1. For each level the number of elements in that level is provided as well as the minimum, average, and maximum node weights. (The node weights are expressed as common logs to save space and since their magnitude is all that is of interest). These abstractions revealed that the assembly DAGs were very bottom heavy in terms of the number of nodes but that the average node size tended to increase dramatically in the upper levels. The assembly DAG for the GEMAT11 matrix had the nicest shape for parallelism (as -.-' 1. ., 1 by its depth ratio). The other assembly DAGs were taller and relatively more narrow at their base. Speed-Ups: The most interesting results of the unbounded models are the speed- ups available via parallelism. These results are presented in Table 4-3. L(Level): Count (Min,Avg,Max) L( 25): 1 (5.3, 5.3, 5.3) * L( 24): 1 (4.8, 4.8, 4.8) * L( 23): 1 (3.7, 3.7, 3.7) * L( 22): 1 (3.7, 3.7, 3.7) * L( 21): 1 (3.7, 3.7, 3.7) * L( 20): 1 (4.7, 4.7, 4.7) L( 19): 1 (3.6, 3.6, 3.6) L( 18): 2 (3.6, 4.1, 4.3) L( 17): 2 (3.4, 3.8, 4.0) L( 16): 2 (3.3, 4.0, 4.2) L( 15): 2 (3.0, 4.0, 4.2) * L( 14): 2 (3.0, 3.4, 3.7) * L( 13): 3 (3.0, 3.4, 3.6) * L( 12): 4 (2.6, 3.7, 4.2) * L( 11): 6 (3.0, 3.5, 3.9) * L( 10): 4 (2.1, 3.9, 4.5) * L( 9): 4 (3.0, 3.3, 3.7) L( 8): 7 (2.6, 3.5, 3.9) L( 7): 10 (2.6, 3.4, 3.9) L( 6): 16 (2.5, 3.6, 4.5) ** L( 5): 24 (2.2, 3.5, 4.2) ** L( 4): 35 (2.2, 3.2, 4.2) *** L( 3): 58 (1.6, 3.0, 3.8) ***** L( 2): 108 (1.6, 2.9, 3.6) ******* L( 1): 189 (1.6, 2.8, 3.5) ************* L( 0): 530 (0.7, 2.5, 3.3) *************************************' Figure 4-1. Assembly DAG Abstraction for GEMAT11 The speed-ups for Model 0 are disappointing but not unexpected after what was seen in the previous section on the shape of the assembly DAGs. However, significant speed-ups were seen as parallelism was introduced within the nodes (Models 1 and 2) and within and between nodes (Models 3 and 4). It is interesting to note that a significant synergism is taking place when parallelism is utilized both within and between nodes. One's intuition might -,.-i- -1 that both levels of parallelism would produce a resulting speed-up that is close to the product of the speed-ups achieved by using levels of parallelism separately. However, the speed-ups obtained are actually much better than this expectation. This is consistent with the results of the earlier study by Duff and Johnsson [51]. Table 4-3. Unbounded Parallelism Speed-Up Results MATRIX MAHINDASB GEMAT11 GRE_1107 LNS_3937 SHERMAN5 MODEL 0 1.17 2.88 1.26 1.30 1.74 MODEL 1 10.50 9.31 41.21 103.77 78.36 MODEL 2 114.25 39.35 575.21 3,490.76 1,877.05 MODEL 3 15.96 68.79 70.15 167.94 16.., MODEL 4 237.74 743.80 1,847.68 11, 1-'. 40 1.,8..42 Another interesting, but expected, observation is that the speed-ups for models 0, 3, and 4 (that use parallelism between frontal matrices) correlate strongly with the depth ratio presented earlier. Likewise, models 1, 2, 3, and 4 correlate strongly with average node weight. The underlying principle is that concurrency within the node is most exploitable by larger frontal matrices and that concurrency across nodes is most exploitable with short and wide assembly DAG structures. Processors Sets and Utilization: Also determined by the unbounded models was an upper bound on the number of processors that could be employed on a given matrix and the associated processor utilization. In particular, Model 0 assumed that any frontal matrix would have only one processor allocated to it. For the block level concurrency of Models 1 and 3, a frontal matrix would have a processor set equal to the number of rows in the frontal matrix. The full concurrency of Models 2 and 4 would use processor sets equal to the row size times column size of the frontal matrix. The maximum processor usage would then be the largest processor set for any particular frontal matrix in Models 1 and 2 and the largest sum of processors for any one level in the assembly DAG for Models 0, 3, and 4. This strategy results from the use of concurrency across nodes in Models 0, 3, and 4. Utilizations were then calculated by dividing total sequential time by the product of total parallel time times the number of processors used. The results by matrix and model are provided in Table 4-4. The top value in each entry is the size of the processor set and the lower value is the utilization. In all cases the size of the processor sets seem quite high and the utilization quite low, but these estimates are very crude upper bounds and it is likely that significantly better results are realizable. 4.1.5 Conclusions and Observations A significant amount of theoretical parallelism is achievable with the unsymmetric- pattern multifrontal method. However, very little of the parallelism comes directly from the assembly DAG as is evidenced from the Model 0 results. Yet a synergism occurs when the parallelism from the assembly DAG is combined with the concur- rency available within the factorizing of particular frontal matrices. The resulting speed-ups, as seen in Models 3 and 4, are very promising. Table 4-4. Processor Set and Utilization Results MATRIX MAHINDASB GEMAT11 GRE_1107 LNS_3937 SHERMAN5 MODEL 0 si 530 233 565 359 1. ;. 0.54 0.54 0.2;'. 0.4-' . MODEL 1 52 68 126 258 189 20.1'!'. 13.7"'. 32.71 40.22' 41. '. MODEL 2 :it 4624 19782 107874 47060 :1 1 '. 0.8-.'. 2.91 -. 3.24 ,,'. MODEL 3 304 3788 1161 2961 3827 5.2.V 1.i--" 6.04 5.7', 4.1.' MODEL 4 4161 30127 19782 17',i.L 'IIII 5.71 2.47' 9.34 6.:;>'. ' Looking beyond the speed-ups, the large processor sets and low utilizations are of both concern and interest. In particular, the very bottom heavy assembly DAGs .i-.-.i -1 that better distributions of the processing are possible. Such processing distributions could be achieved by appropriate task scheduling on bounded sets of processors. The next section explores these possibilities using trace-driven simulation techniques. 4.2 Bounded Parallelism Models The results of the unbounded parallelism models illustrated that significant speed- ups are possible with parallel implementations of the unsymmetric-pattern multi- frontal method. However, the efficient use of processor sets and the speed-ups achiev- able on smaller processor sets are still open questions. This section addresses these issues using the results of trace-driven simulations run on the five models defined in the previous section. With each model sixteen different processor sets and three different scheduling schemes were used. The processor sets were powers of 2 from 21 to 216 (65,536). The scheduling schemes correspond to orderings of the work queue and are described in a subsequent paragraph. The results of the bounded parallelism models will indicate that the speed-ups seen in the unbounded models are achievable will reasonable size processor sets and with significant improvements in efficiency. This section is a summary of the bounded parallelism models. For a full descrip- tion of the models and their revisions, see the Hadfield and Davis' technical report [83]. 4.2.1 Initial Models The initial bounded parallelism models follow directly from the unbounded par- allelism models. The critical difference is that a limited processor set is assumed so tasks that are ready for execution may have to wait for available processors. This can be appropriately modeled via a trace-driven simulation run against the represen- tation of the assembly DAGs obtained from the traces produced by the sequential version of the unsymmetric-pattern multifrontal method [34, 35]. Critical issues with the simulation will be how to allocate processors when the number available is less than that required and how to order the work that is ready to be done. 4.2.2 Trace-Driven Simulation The trace-driven simulations used to implement the bounded parallelism models were accomplished with a Pascal program that was built as an extension to the analysis program used for the unbounded parallelism models. The simulation uses the topologically sorted representation of the assembly DAG as input. All the nodes (frontal matrices) on the leaf level (Level 0) are initially put into a work queue. Work is scheduled from this queue based on processor availability. The amount of work and required processors for each node is determined based on the specific model being simulated. These models are initially those defined for the unbounded parallelism models. When the model calls for a sequential ordering of the nodes, only one node may be executing at any particular time, otherwise multiple nodes may execute concurrently based on processor availability and the allocation strategy in use. Upon completion of a node's execution, the edges to dependent nodes are removed. When all such edges to a particular node have been removed, that node is put into the work queue as it has become available for execution. The simulation continues until all nodes have completed execution. Speed-up and processor utilization are then calculated per the following formulas: STimesequential Speedup = . Simeparallel Utilization EJ1l (Timej ProcessorsUsedj) Utilization = --- (Time finished ProcessorSetSize) 4.2.3 Processor Allocations Allocation of processors to the assembly and factorizing of a frontal matrix is a critical issue for the realization of the algorithm on a bounded set of processors. As was seen in the results of the initial models, the allocation scheme can greatly effect the performance of the algorithm. Since the five unbounded models use three different approaches to factorization of a particular frontal, there are three distinct allocation schemes. The allocation scheme required by Model 0 is trivial since each frontal matrix in this model is allocated only a single processor. Models 1 and 3 use a block concurrency within a frontal matrix. As this block concurrency is row-oriented, a processor set equal to the row size is required. The difficulty arises when there are processor sets available but they are smaller than that required (in fact the total available number of processors can be less than the row size). In this eventuality the total amount of work is evenly distributed across the available processors. This is a fairly crude allocation scheme but is not too unrealistic given the nature of the processing within a frontal matrix. This strategy was initially followed blindly and lead to some terrible scheduling as large frontal matrices were allocated relatively small processors sets. This greatly extended their execution times and delayed the subsequent frontal matrices that were dependent upon them. Revisions to this allocation strategy were to (a) check if waiting for more available processors would improve a frontal's execution time and scheduling appropriately and (b) wait for more processors if the required amount is not available. Both strategies corrected the problem and produced surprisingly similar results. Strategy (b) will be reported in the results of this discussion. Models 2 and 4 use a full concurrency within each frontal matrix and requires a number of processors that is equal to the number of entries in the frontal matrix (row_size times col_size: rj cj). A more sophisticated allocation scheme is thus required. Prior to formally defining this scheme, we recall the definition of the time complexity for a frontal matrix using this model: work = [log2 Sj] + E i_ [log2(r i + 1)] + Pj + 2Pj Using this definition and a similar assumption on the ability to distribute work on smaller processor sets, the allocation scheme is formally defined by the algorithm in Figure 4-2. if processors_available > (col_size row_size) then Schedule work on (col_size row_size) processors else if processors_available > row_size then Schedule (work + 3 pivots col_size ) on (row_size) processors else if processors_total > row_size then Wait for more available processors else Schedule (work col_size row_size) / processors_available on processors_available endif endif endif Figure 4-2. Processor Allocation For Fine Grain Parallelism Notice that the deepest nested else block provides a fairly crude over-approxi- mation of the amount of work to do. However, this case is only used for the small total processor set sizes, which are not the target processor sets for the parallelism of Models 2 and 4. Thus, this over-estimate is reasonable. Furthermore, this allocation strategy tended to produce a very step-like perfor- mance function as processor set sizes grew to meet the various cut-off points. A smoother performance curve was achieved by using a technique called vertical parti- tioning, which subdivides a frontal matrix task into a series of subtasks each requiring substantially fewer processors. This produced much more efficient scheduling and significant performance improvements. Results for both the initial allocation scheme and vertical partitioning will be reported. 4.2.4 Scheduling Methods There are three scheduling methods employed on the work queue. These methods correspond to how the work ready to be executed is ordered within the work queue. The three corresponding orderings are first-in, first-out (FIFO); largest calculation first; and heaviest path first. The FIFO method is self-explanatory and included because of its simplicity and efficiency. The largest calculation first order uses the node weight estimate of the work to be done and schedules the largest such nodes first. This method is designed as an approximation of the heaviest path first method. The largest calculation first requires a priority queue implementation based on a set of priorities that is quite variable. As such there would be some significant penalties to address in implementing this method. The last method, heaviest path first, uses the heaviest path determined by the analysis portion of the software. Any node on this path would be the next scheduled. Other nodes are put in a FIFO ordering. Notice that since the heaviest path is essentially a path through the dependency chart, only one node at most from the heaviest path will ever be in the work queue at any one time. Very little difference was found using these scheduling methods and they will not be discussed further in this document (for a detailed description of these results, see [83]). 4.2.5 Simulation Results The results of the simulations are summarized for each model. Whenever the results for all five matrices are presented, the following line types will be used to represent the different matrices: ----- MAHINDASB -*-*-*- GEMAT11 -+-+-+- GRE_1107 -0-0-0- LNS_3937 - - SHERMAN5 All the results presented in this section are based on the heaviest path first schedul- ing. Model 0 (Concurrency Between Frontal Matrices Only): Model 0 takes advan- tage of concurrency only between frontal matrices. The speed-up and utilization re- sults from Model 0 are shown in Figures 4-3 and 4-4, respectively. These indicate that the speed-ups seen in the unbounded models are achievable with significantly fewer processors and corresponding higher utilizations. Table 4-5 compares the number of processors used in the bounded versus the unbounded models to achieve maximum speed-up. 2.8 2.6 2.4 2.2 2- m- 2 4 6 8 10 12 14 16 Log2 of Number of Processors Figure 4-3. Model 0 Speed-Up Results Table 4-5. Required Processors Comparison MATRIX BOUNDED UNBOUNDED MODEL MODEL MAHINDASB 32 Si GEMAT11 64 530 GRE_1107 8 233 LNS_3937 8 565 SHERMAN5 8 359 Model 1 (Block Concurrency Within Frontal Matrices Only): Model 1 only ta- kes advantage of concurrency within frontal matrices and does so in a block-oriented fashion. The speed-up and utilization results are shown in Figures 4-5 and 4-6, 100 90- 80 - ',, 50 5 40 30 - 20 - 1-' 2 4 6 8 10 12 14 16 Log2 of Number of Processors Figure 4-4. Model 0 Utilization Results respectively. The method employed by this model seems well suited for processor sets up to about 256 processors. After that point there is no further advantage to additional processors for any of the test matrices. Model 2 (Full Concurrency Within Frontal Matrices Only): The block concur- rency of Model 1 is extended to full concurrency in Model 2 but is still restricted to concurrency only within the frontal matrix. The speed-up and utilization results for Model 2 using both the initial allocation scheme and vertical partitioning are shown in Figures 4-7, 4-8, 4-9, and 4-10, respectively. A very definite step behavior is evident in the speed-up curves for the initial allocation scheme. This behavior directly correlates to the allocation scheme in use that bases allocation on the row size and column size of the frontal matrices. While the speed-ups are significant, the step behavior severely limits scalability. Vertical partitioning resolves this problem. Model 3 (Block Concurrency Within and Between Frontal Matrices): Model 3 combines the block concurrency within a frontal of Model 1 with the concurrency between frontal matrices used in Model 0. The speed-up and utilization results for this model are shown in Figures 4-11 and 4-11, respectively. These results were produced using the allocation strategy that waits if sufficient processors are not available. Model 4 (Full Concurrency Within and Between Frontal Matrices): The fifth and final model uses concurrency between nodes and full concurrency within nodes. The speed-up and utilization results for this model are included for both the initial allocation scheme and vertical partitioning. These are presented in Figures 4-13, 4-14, 4-15, and 4-16. 120 100 80 60 40- 40 2 4 6 8 10 12 14 16 Log2 of Number of Processors Figure 4-5. Model 1 Speed-Up Results A step-like speed-up and corresponding utilization irregularities are evident in the original allocation scheme and very similar to the results for Model 2. Again, vertical partitioning corrects this problem. 4.2.6 Conclusions and Observations Several conclusion and observations are apparent from the results of the bounded parallelism models. 1. There is an excellent potential for parallelism under both SIMD and MIMD implementations. 2. The speed-ups indicated in the unbounded models are achievable with practical size processor sets. 3. The benefits of the different models are realized within distinct ranges of the number of processors. 4. Scheduling method has little significant effect on performance for all five of the models. However, the critical path method was not tested in these initial runs and may have a significant effect. 5. Processor allocation schemes can have very significant effects on performance. uuZ 20- 10 - 0 1 0 ^ 2 4 6 8 10 12 14 16 Log2 of Number of Processors Figure 4-6. Model 1 Utilization Results With this establishment of the theoretical parallelism in the unsymmetric-pattern multifrontal method, the next step is to use the characterization of an existing mul- tiprocessor (the nCUBE 2) from Chapter 3 to revise the simulation model and study the parallelism that is actually achievable. 4.3 Distributed Memory Multifrontal Performance Potential With the characterization of the nCUBE 2's performance achieved in the Chapter 3, the bounded parallelism PRAM simulation models are refined to the estimate the achievable parallelism of the unsymmetric-pattern multifrontal method. The performance characteristics of the nCUBE 2 are used in the revised simulation models so that predicted performance can be later compared to that actually achieved as the method is later implemented on the nCUBE 2. However, one important prerequisite remains before the simulation model is up- dated. This prerequisite is to specifically characterize the performance of the dense matrix factorization routine(s) to be used within the unsymmetric-pattern multi- frontal method. Recall that a primary motivation of multifrontal methods is to replace sparse matrix operations with more efficient dense submatrix operations. Hence, such a characterization of full and partial dense matrix factorization is criti- cal to development of a realistic simulation. 4.3.1 Dense Matrix Factorization Routine The characterization of the dense matrix factorization routine is developed by actually implementing such a routine and measuring its performance. A model of its 2000 1800 1600 1400 1200 C- 1000- C- ' 800 - 600- 400- 200 0' 2 4 6 8 10 12 14 16 Log2 of Number of Processors Figure 4-7. Model 2 Original Speed-Ups performance is also developed and the actual measured results are compared to the results predicted by the model. Choice of Algorithm The algorithm chosen for dense matrix factorization is a column-oriented, fan-out method. This model was chosen because it is easy to analyze and implement and also because this was the approach taken in successful distributed memory implementa- tions of multifrontal Cholesky factorization [105, 76]. Specifically, columns of the matrix are stored in a scattered fashion across the processors. For added simplicity, diagonal dominance is assumed, so no numerical pivoting will take place. An outer indexed loop provides one iteration for each pivot with which the matrix will be partially factorized. If the pivot column is owned by the processor, the multipliers (values for the corresponding column of L) are calculated by dividing by the pivot entry. Then, this column of multipliers is broadcasted to the other processors. Following the broadcast, the originating processor will use the multipliers to update any remaining columns that it owns. If a processor is not the owner of the current pivot column, then it will simply wait for the broadcasting of the multipliers and then update its columns with the received multipliers. This algorithm is outlined in Figure 4-17. Performance Model An analytical model of the performance of this algorithm can be easily derived by looking at a time line of processor activity. For this time line, the assumption is 3500 3000 2500 (U /---------- m 1500 1000 0 - 2 4 6 8 10 12 14 16 Log2 of Number of Processors Figure 4-8. Model 2 Vertical Partitioning Speed-Ups made that a broadcast ties up all involved processors for a fixed period of time and that all processors are available to receive the broadcast when it is initiated. (As was seen in the previous chapter, near neighbors will actually finish the broadcast reception earlier than other, but the difference is not deemed to be significant). The produced time line is shown in Figure 4-18. In this figure, each processor's activities are represented within a distinct column (four processors are shown in this figure). Solid lines indicate ongoing computation and dashed lines indicate communication and their associated latencies. Using this model and assuming that the column update times are approximately equal for each involved processor, we see that calculation of the multipliers is not overlapped with any other useful activity and can be modeled as a sequential activity in time. (This observation is the motivation for more efficient pipelined versions of the method). Hence, as only updating of the columns is done in parallel, the execution time model is: P where tm is the time to compute the multipliers, tb is the time required for the broadcasts, t, is the time for all updates, and P is the number of processors being used. Each of the component times can be defined in terms of the matrix parameters using formulas similar to those derived from the experiments on the nCUBE 2. Thus, using r, c, and k for the number of rows, columns, and pivots for the matrix, the 62 100 90 70 50 2 8 10 12 14 16 \ \ ',\ 2 4 6 8 10 12 14 16 Log2 of Number of Processors Figure 4-9. Model 2 Original Utilizations following formulas can be derived: t, =4* Ei(c -i)+5* k t = CEz [195.22 + 50.02 log2(P) + 4 0.122 (r i) + 4 0.45 log2(P) *(r )] = 5 Si(c i)(r i)+17 k+3 k c. Similarly, if numerical pivoting were to be introduced, it would be handled in a way very similar to that in which the multipliers are done and could be defined as: p= 4* Ef(c-iz+1)+5*k with execution time becoming: tu texec = tm + tp + tb + P Furthermore, a sequential version of the algorithm can be simply modeled with: eexec = tm + tu or if numerical pivoting is incorporated. if numerical pivoting is incorporated. o 60 30 - 0 2 4 6 8 10 12 14 16 Log2 of Number of Processors Figure 4-10. Model 2 Vertical Partitioning Utilizations Dense Matrix Factoring Results Timings of the dense matrix factorization algorithm were obtained using etool for a variety of matrix sizes, shapes, and number of pivots. The results were analyzed and several trends identified. Then, the same parameters were fed into the analytical models for validation purposes. Representative empirical results from the factorization of several matrix configu- rations are provided below (Table 4-6) for hypercubes of dimensions 0 to 5. These results are provided in terms of speed-ups by dividing the multiprocessor results (dimensions 1 to 5) by the uniprocessor (dimension 0) results. Several conclusions are readily apparent from this data: 1. Speed-up (and efficiency) decrease with more pivots. This occurs as there are fewer computations with later pivots with which to amortize the communication latencies. 2. There is a point at which additional processors add to overall latency (reduce speed-ups). This happens as broadcast times grow significantly with larger dimension hypercubes and can be seen for the 32 X 32 X 32 configuration with dimension 4 and 5 hypercubes in Table 4-6. 3. High efficiencies are achieved only with very limited numbers of processors and relatively large problems. Other interesting observations are possible by looking at the results for rectangu- lar matrices (which are important to the unsymmetric-pattern multifrontal method). 180 160 - -l ---- - - - - - 14'' 120 100 80 2 4 6 8 10 12 14 16 Log2 of Number of Processors Figure 4-11. Model 3 Speed-Up Results For example, by observing the results for the relatively tall and narrow (512 X 32) and short and wide (32 X 512) matrices of Table 4-7, one can see a striking differ- ence in performance. This is easily explained as the column-oriented method in use broadcasts columns of data that are much larger in the 512 X 32 case and since the essentially sequential multiplier determination time (t,) is proportional to the row size of the matrix. This -i--r.-i-. the possibility that row-oriented versions of the algorithm would be desirable for the tall and narrow frontal matrices and motivates consideration of alternative dense factorization methods within an overall distributed memory multifrontal implementation. Finally, the actual timing obtained through experimentation is compared with the results of the corresponding analytical models. This will be done in two ways. First the actual observed and predicted times are compared and second the speed-ups based strictly on observed times are compared to those based strictly on predicted times. The times observed from the experiments and those produced by the analytical models are provided in Table 4-8 for several matrix configurations. (Results from the experiments are annotated with "(e)" and those from the analytical models with "(m)" and all times are in milliseconds). The corresponding speed-ups for the same configurations are provided in Table 4-9. From these results it is readily apparent that the analytical model under-predicts execution time especially in the smaller configurations and over-predicts speed-up. Due to this combination, there is some overhead in the actual computations that is o 60 S50 40 30 20- 10 2 4 6 8 10 12 14 16 Log2 of Number of Processors Figure 4-12. Model 3 Utilization Results not being properly accounted for in the analytical model. However, the speed-ups are not too far off and the analytical model should provide fairly accurate results when integrated into the simulation model. A common observation throughout this section is that the dense matrix factoriza- tion algorithm is not very efficient for hypercubes of dimension greater than 3 on the matrix configurations shown. Alternative algorithms will be explored as the efficiency and performance of a distributed memory, unsymmetric-pattern multifrontal method should be very closely tied to the performance of the dense matrix operations. 4.3.2 Distributed Memory Multifrontal Simulation Model With a reasonable characterization of the dense matrix factorization process achieved in the last section, the bounded parallelism PRAM simulation model de- scribed earlier can be revised into a distributed memory implementation model with performance characteristics similar to those of the nCUBE 2. In order to revise the simulation model, the following specific changes will be required: Replace the node weight (frontal task's execution time) with the execution time obtained from the analytical model of the previous discussion. Account for passing of contribution blocks between frontal matrices with edge weights in the graph. Specific edge weights will be defined using the perfor- mance equations derived from experimentation on the nCUBE 2's point to point message passing capabilities. 4500 -41-11-111 3500 3000 S2500 S2000 1500 - 1000 500 2 4 6 8 10 12 14 16 Log2 of Number of Processors Figure 4-13. Model 4 Original Speed-Ups Modify the processor allocation schemes to allocate processors as complete sub- cubes. Develop initial algorithms for the scheduling and allocating processors to frontal tasks. From the earlier results of the bounded parallelism PRAM simulations and the characteristization of the nCUBE 2 multiprocessor, an appropriate strategy for par- allel implementation would be blocked (column-oriented) parallelism within frontal matrices and concurrency between independent frontal matrices. Such an approach is taken. Furthermore, it is necessary to make several other assumptions for revision of the model. These added assumptions are: Assembly will consist of both the communication of contributions between frontal matrices (done as point to point message passing) and the numerical assembly (addition) of contributions to the corresponding frontal matrix entry. Both components will initially be done sequentially within each frontal matrix. Processor allocations will be done in full subcubes (i.e., in processor sets that are a power of 2) but specifically which processors are assigned to a particular task will not be tracked. Hence, there is an implicit assumption that available processors form a subcube, that is, no fragmentation of the hypercube occurs that would hinder allocations. 12000 10000 8000 6000 4000 2 4 6 8 10 12 14 16 Log2 of Number of Processors Figure 4-14. Model 4 Vertical Partitioning Speed-Ups As the latency of point to point message passing is dependent upon the path distance of the transmission and as specific processor allocations will not be tracked, a set distance of 10 links will be assumed for all communicating of contributions. Such a setting will insure an upper bound on this latency for all hypercubes of 1024 processors or less. Checking of numerical pivots will be accounted for by a column-based, partial pivoting scheme within the node weight function. However, the assumption is made that no anticipated pivots will be lost during the factorization process. Node and Edge Weight Models Time will be tracked in the distributed memory simulation model using the fol- lowing node and edge weights for task and communication times respectively. Node weights will be assigned initially based upon a sequential node execution model and then revised to a parallel time upon allocation of a specific set of processors to the node (task). The sequential node model is: lexec la + Im + Ip + tu where t., tp, and tu are the multiplier, pivoting, and updating times respectively as defined earlier in the dense matrix factorization model and ta is the time for assemblies defined as: t = 8 A + [204 + 0.573(8 A)] S40 30 20 10 0 2 4 6 8 10 12 14 16 Log2 of Number of Processors Figure 4-15. Model 4 Original Utilizations where A is the total number of contributions and A, is the number of contributions passed on a particular edge e. This formula is based on the point to point message passing latency experiments described earlier and an assumption of 8 bytes per en- try to be assembled (4 bytes for index, 4 bytes for value). This second term (the summation) includes the edge weight definition. The parallel node execution time model then becomes: 67exec la + t + p + where P is the number of processors assigned to the node. This parallel node execu- tion time model will also be referred to in the next section as T_Para(ID, P); that is, the parallel time associated with task ID when using P processors. Scheduling and Processor Allocation For scheduling, a priority list scheme is used with priority assigned to tasks using node and edge weights with a critical path method. The node weights used are based on sequential execution time. Processor allocation is quite challenging as allocating more processors to a task will cause it to execute faster (up to some limit) but will introduce more inefficien- cies, as seen in the earlier dense matrix factorization experiments. Furthermore, such inefficient use of processors may preclude their more efficient use on other distinct nodes (tasks). Thus, for processor scheduling we define two quantities to be associ- ated with each node: P_desired and P_most. P_desired is the preferred number of 69 100 100 .. _---- -----" . .-. 60- 50 3 40 30 - 10 0 2 4 6 8 10 12 14 16 Log2 of Number of Processors Figure 4-16. Model 4 Vertical Partitioning Utilizations processors to assign to the node. P_most is the maximum number of processors to ever assign to a node. P_most is necessary due to the execution time increase seen when too many processors are assigned to a dense factorization task and due to the column-oriented nature of the dense factorization algorithm (that is, the number of columns provides an upper bound on the number of processors). Determination of P_desired is essentially a question of efficiency and should be based both on the task characteristics and the number of available processors. That is, if there is a very large number of processors in the configuration, we can afford to be more inefficient in our allocation since the extra processors would likely go idle otherwise. Hence, a target efficiency is defined and P_desired is set via our analytical models to achieve an efficiency near the target. Furthermore, this target efficiency is set to different values based on the number of total number of processors in the system. Some initial experiments revealed that reasonable target efficiencies would be 0.8 for hypercube of dimension 3 or less (8 or fewer processors), 0.5 for hypercubes of dimensions 4 to 7 (16 to 128 processors), and 0.2 for hypercubes of dimension 8 or larger (256 or more processors). Given these values for target efficiency, the value for P_desired is found by solving the following for P: Tsequential Target ff' TPara(= ar, Pet_Ef ', P T_Para(ID, P) Each processor executes the following: do k = 0, num_of_pivots 1 if ( column k is local ) then do i = k+1,n- 1 compute multipliers Aik = aik/akk broadcast the column just computed else receive the column just computed endif for ( all columns j > k that are local ) do do i = k+ 1,n 1 aij = aij Aikakj enddo endfor enddo. Figure 4-17. Distributed Memory Fan-Out Factorization This turns out to be a nonlinear equation with Plog2(P) and P terms. Thus to simplify its solution Plog2(P) is replaced by P2 and the conservative result is then rounded up to the nearest encompassing subcube. Table 4-10 below provides sample results from this calculation for the various target efficiencies: The determination of P_most on the other hand simply involves the calculation of the value of P at which parallel execution time begins to increase. As parallel execution time is a function of row size, column size, number of pivots and number of processors, we can simply take the partial derivative of this equation with respect to the number of processors and solve for the value of P that zeros the derivative. The function for parallel time is: Tparallel(r, c, k, P) = tm(r, k) + tp(r, k) + tb(r, k, P) + P where t,, tp, tb, and t, are component functions for multipliers, pivoting, broadcasts, and updates (as defined earlier) and r, c, and k are row size, column size, and number of pivots. Thus, solving SpTparalll = 0 aO one gets t,,(r, c, k) In 2 P= 50.02 k + (1.8 (r )) Once P has been determined, it is rounded down to the nearest power of 2. Then, T_Para(ID, P) and T_Para(ID, 2 P) are evaluated and P_most assigned either P proc 1 proc 2 proc 3 proc 4 (compute multipliers for first pivot) '' ---. (broadcast multipliers) (update for first pivot) (compute multipliers for second pivot) i -- S^ (broadcast multipliers) S(update for second pivot) Figure 4-18. Time Line for Partial Dense Factor Routine or 2 P based on which produces the lesser execution time. As a column-oriented dense factorization algorithm is used, P_most is also restricted to be no larger than the smallest power of 2 that is greater than or equal to the number of columns. As an example of what this calculation produces, all of the configurations in the P_desired example of Table 4-10 are limited to column size except for the 16 X 16 X 2 case, which is limited to 8 processors. With P_desired and P_most now defined for each task node, the processor allo- cation and scheduling can be carried out with the decision sequence defined in Figure 4-19. This decision sequence is repeated until all of the system's processors have been allocated or until a task does not meet the allocation criteria. (In this latter, case it may be possible to allocate other tasks with lesser priority and achieve a better over- all performance, but this extension produced significantly more run time overhead with only moderate improvements). Test Matrices With the updates just described made to the simulation program, traces of assem- bly DAGs from the latest version of the unsymmetric-pattern multifrontal method were used as input [35]. This latest version of the method, called afstack, produced assembly DAGS with significantly better parallelism than seen in the earlier DAGs used in the unbounded and bounded parallelism PRAM models. This better paral- lelism was reflected in DAGs with lower depth ratios topologicall levels in the DAG divided by the number of nodes in the DAG), and fewer assemblies as a percentage of overall node weight. Table 4-6. Empirical Factoring Speed-Ups Rows X Cols X Pivots 1-Cube 2-Cube 3-Cube 4-Cube 5-Cube (P=2) (P=4) (P=8) (P=16) (P=32) 32 X 32 X 1 1.55 2.39 3.18 3.63 3.74 32 X 32 X 8 1.65 2.40 2.96 3.32 3.32 32 X 32 X 32 1.44 1.81 1.93 1.i,, 1.73 64 X 64 X 1 1.89 3.36 5.39 7.48 8.95 64 X 64 X 8 1.88 3.28 5.17 7.28 8.82 64 X 64 X 32 1.84 3.11 4.63 5.93 6.66 64 X 64 X 64 1.78 2 5 3.92 4.63 4.94 128 X 128 X 1 1.96 3.69 6.64 10.84 15.37 128 X 128 X 8 1.95 3.69 6.61 11.14 16.41 128 X 128 X 64 1.94 3.60 6.25 9.74 13.25 128 X 128 X 128 1.19 3.49 5.81 8.55 10.88 Table 4-7. Rectangular Matrix Speed-Ups Rows X Cols X Pivots 1-Cube 2-Cube 3-Cube 4-Cube 5-Cube (P=2) (P=4) (P=8) (P=16) (P=32) 512 X 32 X 8 1.82 3.03 4.55 6.35 7.69 32 X 512 X 8 1.98 3.87 7.33 13.21 21.53 Specific characteristics of the matrices tested are provided in Table 4-11. The results for DAGs produced by the older version of the method are included in paren- theses (where they were available). Simulation Results The simulation results are provided in Tables 4-12 and 4-13 below in terms of speed-ups achieved for hypercubes of dimensions 1 through 10 (processor sets of 2 to 1024). All maximum speed-ups were achieved with processor sets in this range. Noticeable in these results is that matrices where a larger percentage of their node weight is attributed to assembly had a worse performance. This is consistent with the model assumption of purely sequential assemblies within the node tasks. As it is quite possible to perform at least some fraction of the assemblies concurrently within a node task, a revision to the model is justified. The revision was completed by redefining assembly time so that numerical assem- blies were done fully in parallel by the number of processors assigned to the frontal matrix task. Thus, t, became: 8*A ,+ E [204 + 0.573(8 A,)]. P Table 4-8. Empirical Versus Analytical Times Table 4-9. Empirical Versus Analytical Speed-Ups Rows X Cols X Pivots 1-Cube 2-Cube 3-Cube 4-Cube 5-Cube (P=2) (P=4) (P=8) (P=16) (P=32) 32 X 32 X 8 (e) 1.65 2.40 2.96 3.32 3.32 32 X 32 X 8 (m) 1.69 2.69 3.63 4.17 4.28 64 X 64 X 1 (e) 1.89 3.36 5.39 7.48 8.95 64 X 64 X 1 (m) 1.90 3.49 5.84 8.50 10.54 128 X 128 X 64 (e) 1.94 3.60 6.25 9.74 13.25 128 X 128 X 64 (m) 1.95 3.70 6.63 10.69 14.89 With parallel execution time thus redefined by the new t,, corresponding revisions were needed and made to the formulas for P_desired and P_most. The results of this revision to the model are shown in the Tables 4-14 and 4-15. 4.4 Conclusions While the results may not be outwardly spectacular, they are actually very promis- ing when compared to a similar simulation study based on a distributed memory ver- sion of the classical multifrontal method [122]. In this study by Pozo, speed-ups for the BCSSTK24 matrix were 29.27 based on an iPSC/2 hypercube and 23.35 based Table 4-10. Sample P_Desired Calculations Rows X Cols X Pivots Eff=0.8 Eff=0.5 Eff=0.2 16 X 16 X 2 P=l P=4 P=8 32 X 32 X 8 P=2 P=8 P=16 64 X 64 X 8 P=4 P=16 P=32 128 X 128 X 64 P=8 P=16 P=32 Rows X Cols X Pivots 0-Cube 1-Cube 2-Cube 3-Cube 4-Cube 5-Cube (P=l) (P=2) (P=4) (P=8) (P=16) (P=32) 32 X 32 X 8 (e) 33.5 20.3 14.0 11.3 10.1 10.1 32 X 32 X 8 (m) 32.3 19.1 12.0 8.9 7.7 7.6 64 X 64 X 1 (e) 21.4 11.3 6.4 4.0 2.9 2.4 64 X 64 X 1 (m) 20.3 10.7 5.8 3.5 2.4 1.9 128 X 128 X 64 (e) 3165.8 1632.9 879.4 506.9 325.3 238.9 128 X 128 X 64 (m) 3078.1 1581.1 832.0 464.5 2" .9 206.7 Figure 4-19. Scheduling Decision Diagram on an iPSC/ti"l hypercube. These figures compare to 32.67 and 43.99 (sequential assembly and parallel assembly respectively) for the nCUBE 2 based simulations studies documented here. Insufficient specifics were provided in the referenced article to fully assess the relative differences in assumptions between my simulation model and theirs, but from the details that were provided, my assumptions appear to be more conservative. Thus, I have some confidence in the significance of these results. It is important to keep in mind that these speed-up figures are relative to a sequential version of the same algorithm. Yet, as both the classical and unsymmetric- pattern multifrontal methods are competitive when run sequentially, the speed-ups as reported here are still significant. Chapter Summary: The purpose of this chapter has been to investigate the the- oretical and achievable parallelism of the unsymmetric-pattern multifrontal method for sparse LU factorization. This investigation was based on the assembly DAG pro- duced by the method during and analyze-factorize operation. The initial analytical models showed that significant parallelism was available if the parallelism both be- tween and within nodes of the assembly DAG was exploited. When medium grain parallelism was used within frontal matrices, speed-ups as high as 167 are possible on the relatively modest sized test problems. Furthermore, when fine grain parallelism was used within frontal matrices, the theoretical speed-ups increased to as much as I. "1.' and 11, 1-_ for two of the test matrices. Table 4-11. Test Matrix Characteristics MATRIX ORDER NONZEROS DEPTH OF RATIO ASSEMBLIES MAHINDASB 1258 7682 0.031 8.64 (0.29) (20. '.) GEMAT11 4929 331S-, 0.014 6.1'~, (0.02) (10.4 .) GRE_1107 1107 5664 0.051 4.'', (0.08) (17.1' .) LNS_3937 3937 25407 0.057 10.41 (0.18) (32.!' .) SHERMAN5 3312 20793 0.005 2.21 (0.07) (21 '.) RDIST1 4134 94408 0.075 1'.' BCSSTK24 3562 159910 I 11' 2 1' Table 4-12. Sequential Assembly Version Results Part 1 MATRIX P=2 P=4 P=8 P=16 P=32 MAHINDASB 1.30 1.75 2.89 3.32 3.39 GEMAT11 1.65 3.18 5.68 7.95 11.49 GRE_1107 1.77 3.23 5.07 7.68 10.21 LNS_3937 1.62 2.73 4.36 6.04 7.29 SHERMAN5 1.87 3.20 5.96 9.49 12.64 RDIST1 1.82 3.23 5.83 8.65 16.25 BCSSTK24 1.89 3.47 6.25 9.S5 15.22 Whereas the initial analytical models were based on an assumption of an un- bounded number of available processors, the simulation models that followed assumed a bounded number of processors. The results of these models showed that the theo- retical parallelism seen in the analytical models was achievable on realistically sized processor sets. Specifically, the speed-ups of up to 167 seen when using medium grain parallelism with nodes and parallelism across nodes were achievable with a maximum of 512 processors. Moreover, when the parallelism within nodes was changed to fine grain, the corresponding speed-ups of up to "i.S and 11, !_' were achievable on no more than 65,536 processors. The final objective of this chapter was then to revise the simulation models to represent an existing architecture and estimate how much of the theoretical par- allelism was actually achievable. The earlier performance characterization of the nCUBE 2 multiprocessor was used to set the simulation parameters and the model Table 4-13. Sequential Assembly Version Results MATRIX P=64 P=128 P=256 P=512 P=1024 MAHINDASB 3.39 3.39 3.43 3.43 3.43 GEMAT11 12.48 12.60 14.01 14.02 14.02 GRE_1107 11.78 12.09 12.81 12.81 12.81 LNS_3937 7.93 8.11 8.87 8.88 8.88 SHERMAN5 18.62 20.22 22 ^ 22.98 22.98 RDIST1 22.29 22.50 26.08 26.08 26.16 BCSSTK24 26.57 32.67 38.87 39.29 39.29 Table 4-14. Parallel Assembly Version Results Part 1 MATRIX P=2 P=4 P=8 P=16 P=32 MAHINDASB 1.31 1.81 2.14 4.14 4.29 GEMAT11 1.68 3.21 6.05 8.49 12.37 GRE_1107 1.82 3.29 5.54 8.71 12.84 LNS_3937 1.75 3.10 5.13 7.83 10.16 SHERMAN5 1.91 3.34 6.07 11.00 17.95 RDIST1 1.87 3.40 5.96 9.38 16.16 BCSSTK24 1.93 3.67 6.72 11.28 18.01 revised to correspond to a distributed memory environment. The results of the dis- tributed memory model showed that 20 to 25 percent of the theoretical parallelism was typically achievable. Furthermore, performing the assemblies in parallel pro- duced significant benefits. With the prospective of achieving significant parallel performance established by the models of this chapter, the next step is to actually implement a distributed mem- ory, unsymmetric-pattern multifrontal method. Chapter 5 begins the implementation by developing and analyzing a number of highly tuned partial dense factorization ker- nels, which will account for most of the computational workload. Part 2 Table 4-15. Parallel Assembly Version Results Part 2 MATRIX P=64 P=128 P=256 P=512 P=1024 MAHINDASB 4.29 4.29 4.41 4.41 4.41 GEMAT11 15.17 16.47 17.89 17.96 17.96 GRE_1107 15.32 15.91 17.02 17.02 17.02 LNS_3937 15.35 16.83 18.37 18.45 18.45 SHERMAN5 24.95 29.55 32.39 32.64 32.65 RDIST1 23.24 27.20 32.02 32.02 32.09 BCSSTK24 31.88 43.99 53.49 58.63 58.66 CHAPTER 5 PARTIAL DENSE FACTORIZATION KERNELS The primary computational function performed within an unsymmetric-pattern multifrontal method is the partial factorization of dense submatrices (frontal ma- trices). The term partial factorization refers to the application of only a limited number of pivot steps for each frontal matrix. The remainder of the matrix entries are updated by the pivot steps but not factored themselves and will be passed on and assembled into other frontal matrices where they may be further updated and eventually factored. This chapter develops a number of routines to perform this partial dense factoriza- tion (PDF) function. A basic column-scattered, fan-out algorithm is first developed. This algorithm is improved upon by use of optimized vector functions from the Basic Linear Algebra Subroutines (BLAS 1) and by the introduction of pipelining. Numeri- cal considerations are then incorporated into both the basic and pipelined algorithms with a major emphasis placed on finding alternative pivots in other columns when an- ticipated pivots are numerically unacceptable. A parallel prefix operation among the participating processors provides an efficient, consistent, and deterministic recovery mechanism. A special streamlined algorithm is developed for the single pivot case. The performance of all the developed algorithms is measured and analyzed. Specific performance effects of the various enhancements are reported. Finally, performance prediction formulas are derived from analytical models of the routines and compared against observed performance. These predictions will be useful in the next chapter when task scheduling allocation are addressed. The strategy of a column-scattered allocation of frontal matrices is adopted for all of the partial dense factorization kernels to be discussed in this chapter. There are a number of reasons for this. First, the column-scattered allocation provides good load balancing. Also, it provides easy incorporation of partial pivoting strategies and allows the location of entries to be easily determined. 5.1 Basic Algorithms The first two partial dense factorization kernels assume that the diagonal elements of the pivot block are numerically acceptable and the kernels use them as pivots without any numerical checking. The assumption of numerically acceptable pivots on the diagonal is consistent with matrices with properties such as diagonal dominance. The first such kernel is called P1 and uses a simple fan-out approach. The second is called P2 and improves performance by the use of pipelining. 5.1.1 P1 Basic Fan-Out Algorithm The first algorithm implemented is the basic fan-out column-scattered method introduced earlier in Section 2.6. This method has each participating processor loop through all of the frontal matrix's pivots. If the processor owns the current pivot, it will compute and broadcast the multipliers (column of L corresponding to that pivot). If the processor does not own the current pivot, it will wait to receive the multipliers from the owner processor. Following completion of the broadcast, all processors will update the locally held columns of the active submatrix. An outline of this algorithm, referred to as P1, is provided in Figure 5-1. A major drawback of this algorithm is that all of the processors that do not own the current pivot are essentially idle during the computation of multipliers and subsequent broadcast initiation. While in a larger context this idle time could be used for other processing (such as completing assemblies), a more desirable situation would be to develop a more efficient factorizing algorithm. P1 ALGORITHM: (Basic fan-out based on a column scattered allocation) for ( pivots 1 thru number of pivots ) do if ( next pivot column is local to this processor ) compute multipliers broadcast multipliers to subcube for frontal else receive broadcast endif update remaining active local columns using the current multipliers endfor Figure 5-1. P Basic Fan-Out Algorithm 5.1.2 P2 -Pipelined Fan-Out Algorithm The second algorithm uses the concept of pipelining to minimize idle time and overlap computation with communication. This algorithm was originally -i.--. -1. .1 by Geist and Heath [62]. In this pipelined version, once the processor that owns the next pivot receives the multipliers for the current pivot, it updates only the next pivot column and then computes and broadcasts the multipliers for that pivot. Only after the broadcast of these subsequent multipliers has been initiated, will the rest of the updates for the current pivot be done. In this way, the multipliers for the next pivot will be already computed and distributed by the time the other processors are ready for them. The effect is to overlap the updating due to the current pivot with the computation of multipliers for the next pivot and thus preclude much of the idle time experienced with the basic fan-out algorithm. An outline of the basic pipelined algorithm (P2) is provided in Figure 5-2. P2 ALGORITHM: (Pipelined fan-out based on a column scattered allocation) if ( first pivot column is local to this processor ) compute multipliers broadcast multipliers save a ptr to these next multipliers endif for ( all potential pivot columns ) do if ( current pivot column is not local to this processor ) receive broadcast from pivot owner else adjust ptr to use saved next multipliers endif if ( next pivot column is local to this processor ) update next pivot column with received multipliers compute multipliers broadcast multipliers save a ptr to these next multipliers endif update remaining active columns of the frontal matrix endfor Figure 5-2. P2 -Pipelined Fan-Out Algorithm 5.2 Use of BLAS 1 Routines As most of the computations required by a partial dense factorization algorithm entail repeated use of simple vector operations, performance improvements in these operations will significantly enhance the performance of the overall factorization rou- tine. The Basic Linear Algebra Subroutines 1 (BLAS 1) provide these optimized vector operations [110, 111]. Furthermore, these routines are available for most high performance computers and have been specifically tailored for their host architec- tures. Thus, the BLAS provide both high performance and portability. The original motivation of the BLAS 1 routines was to exploit the high perfor- mance available from vector processing units. While the nCUBE 2 processors are not equipped with vector processing units, it is possible to significantly improve per- formance of BLAS 1 routines by making efficient use of the nCUBE 2's instruction cache, many addressing modes, and CISC-style instructions. The assembly coded BLAS 1 routines available on the nCUBE 2 do this and achieve a performance that is 3.3 to 7.8 times faster than comparable C code. A number of BLAS 1 routines were integrated into the first two algorithms and will be used in the subsequent algorithms. Specifically, the Saxpy (Daxpy for double precision) routine does a vector-scalar product added to another vector (y y + ax) and is used to perform the outer product update (by columns) of the active submatrix for each pivot. The Sscal (Dscal for double precision) routine scales a vector by multiplying it by a scalar term (x <-- ax). The Sscal routine is used to calculate the multipliers (column of L) for a particular pivot. These routines improved of the algorithm's execution time by 2 to 3.67 times with the larger speed-ups occurring with the larger matrices. In the next algorithms where threshold partial pivoting is incorporated, three other BLAS 1 routines will be useful. The Isamax (Idamax for double precision) routine finds and returns the index of the largest magnitude entry in a specified vector. The Sswap (Dswap for double precision) routine swaps the location of two vectors. Finally, the S-.. "', (D.. -",l for double precision) makes a copy of a vector. All of these BLAS 1 routines allow non-unit stride (adjacent vector entries need not be adjacent in memory, but do need to be a constant distance apart). This enables easy access to matrix rows in a column-major storage scheme (as well as columns in a row-major scheme). Furthermore, with most of the nCUBE 2 BLAS 1 routines there is no performance difference between unit and non-unit stride operations. 5.3 Inclusion of Row and Column Pivoting While the basic concept of a partial factorization algorithm is clear and easily understood, numerical considerations become more challenging. In particular, pivot choices are limited to the leading principal submatrix with order equal to the number of pivot steps to be applied. This submatrix is referred to as the pivot block. As the pivot block will usually not include all of the pivot columns' entries, there can be large entries in pivot columns that are not part of the pivot block. As a result no pivot block entries for that column may pass the threshold pivoting criteria and that pivot becomes lost. However, if another acceptable pivot block column is used then its resulting update may cause the earlier column to become acceptable. As a result, numerical checks must determine the maximum magnitude value in a pivot column and attempt to find a pivot (from the pivot block) that can pass the corresponding pivot threshold. Upon a failure, other pivot columns should be checked accordingly. Only when no pivot column has an acceptable pivot should the search be terminated. Furthermore, once a successful pivot has been applied, pivot columns that previously failed may be rechecked as their values have been updated by the pivot application. Since pivot columns are distributed across processors, checking of all pivot columns will require communication. As the pivot search operation has a relatively low com- putational cost (compared to communication costs), the strategy of checking all the local pivot columns prior to initiating communication is adopted. This strategy was integrated into both the basic and pipelined versions of the fan-out, column-scattered partial factorization algorithms. 5.3.1 P3 -Basic PDF with Pivot Recovery Since the basic fan-out algorithm essentially leaves all but the current pivot owner processor idle during pivot selection and the computation of the corresponding col- umn of L, a natural modification would be to have all processors search through their active pivot columns while the current pivot owner is doing its pivot selection. Thus, if the current pivot owner fails to find an acceptable pivot column, the other proces- sors will at least have started searching for an alternative pivot. Upon notification of a pivot failure, the other processors will either continue their search or reply with their results if their search is already completed. Furthermore, if the processors that do not own the current pivot complete their search before any notification from the current owner, they may even proceed to the computing of the corresponding column of L (in a separate work area) if their search was successful. Care must be taken however, to not introduce excessive additional costs in the situations where the current pivot owners typically find a good pivot. To protect against this, each non-current pivot owner will search a single column at a time and then check if the current owner has broadcasted its results. (The test for the received broadcast can be done via the test function on the nCUBE 2 system [111]). If the notification is of a successful pivot, the corresponding multipliers (column of L) is included in the notification and the alternative searches are terminated immediately. The other challenge is to determine an appropriate new pivot owner in an efficient and deterministic fashion when the current owner fails to find a good pivot. The parallel prefix operation, imin, is an excellent tool for this purpose [111]. The imin function operates on a specified cube/subcube with each node providing a single integer value. The minimum value is determined using a parallel prefix computation and the result provided to either one or all of the participating nodes. For purposes of determining a new pivot owner, each processor node in the subcube for the frontal matrix will provide either the number of nodes in the current subcube (if this node fails to have a good pivot) or the forward difference from the ID of the current pivot owner processor node to the reporting node's ID. If there is an acceptable pivot available, the imin operation will determine the forward delta (difference in IDs) from the current pivot owner to the nearest processor with a good pivot (in an increasing node ID order). Alternatively, if there is no acceptable pivot in any active pivot column, this delta will be equal to the number of nodes in the subcube and all nodes will know to discontinue factorizing of this frontal matrix. If a new owner is found, it will compute the corresponding multipliers (if not already precomputed during earlier idle time) and then broadcast them to the rest of the cube/subcube. The other processors will know who the new owner is from the imin operation and simply wait for that node to provide its good multipliers. The whole process is done with a single small broadcast notifying the subcube of the owner's failure to find a good pivot, an imin parallel prefix operation to determine the new owner, and the broadcast of the resulting good multipliers. Much of the search for alternative pivots is done during what would otherwise be idle time. To avoid nondeterminism, implementation of this algorithm always begins the alternative pivot search (when not the current pivot owner) using a static ordering of its active pivot columns and a fixed starting position. If this is not done, the column on which the search is terminated, due to receipt of a good pivot from the current owner, will be timing dependent and an initiation of the next search based on this previous termination point can produce different results (different selected pivot column orderings). An outline of the P3 algorithm and the component subroutines are provided in Figures 5-3, 5-4, 5-5, and 5-6. P3 ALGORITHM: (Basic fan-out with lost pivot recovery across columns and) (overlapped column pivot searching.) for ( pivots 1 thru number of pivots ) do if ( this processor's turn for current pivot ) call ATTEMPT_LOCAL_PIVOT else /* not this processor's turn for current pivot */ call LOOK_FOR_ALT_PIVOTS if ( pivot owner does NOT have a good pivot column ) call RESOLVE_NEW_PIVOT_OWNER endif endif update the row and column permutations for the resulting pivot column determine which processor will own next pivot (look forward from current pivot owner to next) (processor that has active pivot columns ) update active local frontal matrix columns using the current multipliers endfor Figure 5-3. P3 -Basic PDF with Pivot Recovery In order to accurately track both row and column permutations to the frontal ma- trix being partially factorized, each participating processor maintains row and column permutation vectors. Broadcasted messages that include multipliers also include a header which specifies status and the row/column permutations for the correspond- ing pivot. In this way, all processors can maintain consistent records of the applied permutations. In order to avoid unnecessary memory to memory transfers for mes- sage preparation, multipliers (which occur consecutively in the column-major storage format) are prefixed with the header information and provided to the broadcast func- tion in place. Allocation of the frontal matrix data area includes the necessary extra prefix memory. Data objects displaced by the header are saved before the header is created and restored after the broadcast. To avoid data alignment problems, the Subroutine ATTEMPT_LOCAL_PIVOT: while ( this processor has active pivot columns to check ) and ( have not found a good pivot column ) do check next local pivot column endwhile if ( this processor has a good pivot column) compute multipliers and row/col permutations broadcast results to the others else broadcast "no_pivot" status determine nearest good pivot column holder via imin parallel prefix computation if ( no others have a good pivot col ) return ( number of good pivots ) else receive multipliers from new pivot owner endif endif Figure 5-4. P3 Subroutine: ATTEMPT_LOCAL_PIVOT integer header information is converted into the floating point format for its storage as a prefix to the floating point message. This limits the values allowed for these in- tegers to 24 bits but this is sufficient for the problem sizes currently targeted. These broadcast message mechanisms are also used by the P4 algorithm to be discussed next. 5.3.2 P4 -Pipelined Partial Factoring with Pivot Recovery The problem with the P3 algorithm just described is that when pivots are eas- ily found, its performance is only near that of the basic fan-out algorithm. A more desirable situation would be to integrate the pivot recovery mechanisms of the third algorithm (P3) with the pipelining concept of the basic pipelined algorithm (P2). An effective integration of these two concepts would hopefully preserve the high per- formance of pipelining when pivots are easily found without jeopardarizing efficient overlapped searching for alternative pivots. The P4 algorithm is such an integration. With the P4 algorithm, the owner of the next pivot searches its own pivot columns prior to completing the updates for the current pivot. The results of the search will be broadcast to the other nodes and buffered until they complete their current updates. If a pivot is found, the multipliers are already computed and available in the broadcast message. If the search for the next pivot is unsuccessful, the initial next pivot owner completes the rest of its current pivot updates before participating in the resolution of the next pivot's recovery. This allows the other nodes time to complete their current Subroutine LOOK_FOR_ALT_PIVOTS: while ( this processor has active pivot columns to check ) and ( have not found a good pivot column ) and ( have not received a good pivot column ) do -check my next active pivot column if ( pivot owner has sent broadcast (ntest)) receive broadcast and determine status endif endwhile if ( pivot own has not yet sent broadcast ) and ( this processor has a good pivot column ) compute multipliers using my pivot column and hold them in a temporary buffer, also the corresponding row/col permutations endif if ( pivot owner has not yet sent broadcast ) wait for and receive broadcast endif Figure 5-5. P3 Subroutine: LOOK_FOR_ALT_PIVOTS pivot updates and perform their local searches for a next pivot. As the amount of computations done by each processor is roughly equivalent, all nodes should be ready for the imin determination of the next pivot owner at nearly the same time. (This new pivot owner determination is done in the same way as it was in P3). A graphical representation of this process is seen in the time line of Figure 5-7. Here processor Procl is the initial next pivot owner and does its search of active pivot columns (updating each pivot column with the previous pivots) prior to updating the rest of the active columns with the updates for the current pivot. Meanwhile the other processors do the current update and then find the status of the next pivot and do their local pivot searches while processor Procl completes its current pivot updates. While much of the recovery effort is overlapped as it was in the P3 algorithm, the recovery will result in a drain of the pipeline. The pipeline will be "reprimed" with the multipliers produced by the new owner of the next pivot (Proc2 in Figure 5-7). During both the initial priming of the pipeline and any subsequent !. p! in"ingI", it is possible to take advantage of the idle time on the nonpivot owner nodes by having them search for alternative k + 2 pivots. This would improve the worst case behavior of P4 to be closer to the worst case behavior of P3. However, the worst case behavior is anticipated to be infrequent and the additional code complexity deemed to not be worth the potential advantage. A more specific statement of the P4 algorithm can be found in Figures 5-8, 5-9, 5-10, and 5-11. As alternative pivot searching in this algorithm is only done when Subroutine RESOLVE_NEW_PIVOT_OWNER: if ( this processor has a good pivot column) send delta from pivot owner's node id to this node id in imin prefix computation else send number of processors for this frontal matrix in the imin prefix computation endif -receive delta to new pivot owner from imin computation if ( delta == number of frontal processors ) /* that is, no processor has a good pivot column */ return ( number of good pivots ) else if ( this processor is the new pivot owner ) if ( multipliers not yet computed ) compute the multipliers and row/col permutes broadcast the results else broadcast multipliers from temporary buffer together with row/col permutations copy multipliers to frontal matrix endif else receive broadcast from the new pivot owner endif endif endif Figure 5-6. P3 Subroutine: RESOLVE_NEW_PIVOT_OWNER required and in a deterministic manner, there is no timing dependent behavior. The algorithm is deterministic. 5.4 Single Pivot Version Frequently in the current sequential implementation of the unsymmetric-pattern multifrontal method [34, 35], frontal matrices are encountered that require only a single pivot step. In such frontal matrices the pivot block is a single entry and no row or column permutations can be made. Furthermore, the looping overhead for all the pivots can be eliminated as can other related overhead. As a result, a streamlined partial dense factorization routine, called P5, was developed for exclusive use on single pivot frontal matrices. The logic is very similar to that of the P1 algorithm with the ProcO Procl Proc2 Proc3 I updates and updates for I search for updates for Kth pivot K+lst pivot Kth pivot Procl broadcts * "no pivolv 'found , updates for Kth pivot T T I search for alternative I search for alternative' K+lst pivot K+lst pivot . . . !' .- t . . . t Determination of new pivot owner via the imin operation (Proc2 is nearest in forward distance so it is new owner) compute multipliers for the K+lst pivot Proc 2 broadcasts the * "* multipliers ** * < ..**"*** " *** T "**** Figure 5-7. P4 Pivot Recovery Timeline loop for all pivots removed and a simple check for numerical acceptability of the single pivot added. 5.5 Performance Achieved In order to assess the performance achieved by each of the five algorithms intro- duced, user-defined events were inserted into the factorization algorithms and the etool facility used to determine the times at which the events occurred. From the results, detailed timings of the partial dense factorization algorithms were achieved. A driver program created square frontal matrices of orders 32, 64, 128, and 256 each run with 1, 8, 16, and 32 pivot steps. Subcubes of dimensions 0 through 5 were used and speed-ups determined based on comparison with the dimension 0 run times. Pivot rows and columns were generated using random numbers and were densely populated. In most runs, the pivot block was forced to be diagonally dominant to preclude the need for permutations. (However, some runs were made to force row permutations). As the number of floating point operations for factorization of each of these com- binations is well defined, achieved Mflops could also be determined. Comparison is also made of the basic and pipelined versions together with the impact of using the BLAS 1 routines. The consequences to performance of taking numerical consider- ations into account are also analyzed. Finally, brief discussions of the single pivot algorithm and overall double precision results are provided. 5.5.1 Floating Point Operations The achieved Mflops (for the fastest versions of each kernel) are shown for single precision (32 bits) are shown in Table 5-1 and for double precision (64 bits) in Table 5-2. These results are based on a frontal matrix of 256 by 256 entries with 32 pivots |