• TABLE OF CONTENTS
HIDE
 Title Page
 Abstract
 Table of Contents
 Main






Group Title: Department of Computer and Information Science and Engineering Technical Reports
Title: Analysis of potential parallel implementations of the unsymmetric-pattern multifrontal method for sparse LU factorization
CITATION PDF VIEWER THUMBNAILS PAGE IMAGE ZOOMABLE
Full Citation
STANDARD VIEW MARC VIEW
Permanent Link: http://ufdc.ufl.edu/UF00095117/00001
 Material Information
Title: Analysis of potential parallel implementations of the unsymmetric-pattern multifrontal method for sparse LU factorization
Alternate Title: Department of Computer and Information Science and Engineering Technical Report ; 92-017
Physical Description: Book
Language: English
Creator: Hadfield, Steven M.
Davis, Timothy A.
Publisher: Department of Computer and Information Sciences, University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: July 13, 1992
Copyright Date: 1992
 Record Information
Bibliographic ID: UF00095117
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.

Downloads

This item has the following downloads:

199246 ( PDF )


Table of Contents
    Title Page
        Title Page
    Abstract
        Abstract
    Table of Contents
        Page 1
        Page 2
    Main
        Page 3
        Page 4
        Page 5
        Page 6
        Page 7
        Page 8
        Page 9
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
        Page 23
        Page 24
        Page 25
        Page 26
        Page 27
        Page 28
        Page 29
        Page 30
        Page 31
        Page 32
        Page 33
        Page 34
        Page 35
        Page 36
        Page 37
        Page 38
        Page 39
        Page 40
        Page 41
        Page 42
        Page 43
        Page 44
        Page 45
        Page 46
        Page 47
        Page 48
        Page 49
        Page 50
        Page 51
Full Text












Analysis of potential parallel implementations of
the unsymmetric-pattern multifrontal method for
sparse LU factorization



Steven M. Hadfield and Timothy A. Davis
301 CSE, Computer and Information Sciences Dept.
University of Florida, Gainesville, FL 32611-2024 USA
(904) 392-1481, email: smh@cis.ufl.edu and davis@cis.ufl.edu


Technical Report TR-92-0171

July 13, 1992










1Available via anonymous ftp to cis.ufl.edu as /cis/tech-reports/tr92/tr92-017.ps.Z. Supported
in part by NSF ASC-911 1'2,.;






















Abstract


The unsymmetric-pattern multifrontal method for the LU factorization of sparse matrices
creates a series of smaller, dense frontal matrices that are partially factored and then com-
bined into a fully factored result. These smaller frontal matrices can overlap in rows and/or
columns creating dependencies between themselves which are represented as an elimination
directed acyclic graph(DAG). This elimination DAG can serve as a task graph to control the
parallel factorization of frontal matrices. The potential for parallelism provided in the elim-
ination DAGs produced by the unsymmetric-pattern multifrontal method is investigated via
analysis and trace-driven simulations. Different levels of parallelism are explored using both
unbounded and bounded processor models. Various scheduling and task allocation schemes
are also investigated. Performance results are measured in terms of speed-ups and processor
utilizations.















Contents





1 Overview


2 Background

2.1 LU Factorization .. .......

2.2 Frontal Matrices (Elements) .

2.3 Elimination Directed Acyclic Graph

2.4 Major Algorithmic Steps ......

2.5 Potentials for Parallelism ......


3 Unbounded Parallelism Models

3.1 Elimination DAG Analysis .....

3.2 Model Definitions .. ........

3.3 Matrices Analyzed .. .......

3.4 R results . . . . . . . .

3.5 Conclusions and Observations .


4 Bounded Parallelism Models

4.1 Initial M odels .. .........

4.1.1 Trace-Driven Simulation .

4.1.2 Processor Allocations . .


......................









4.1.3 Scheduling Methods ..... . .......... ....... . 19

4.1.4 Results. .................. ............... 20

4.1.5 Conclusions and Observations .................. ... . 27

4.2 Refined M models .................. ............... 27

4.2.1 M odel 3 Revisions .................. ......... 28

4.2.2 Models 2 and 4: Vertical Partitioning ..... . . . . . 36

4.2.3 Model 4: Assembly Pipelining .... ........... ... 43

4.3 Scalability Analysis .................. ............. 46


5 Conclusions 48















Chapter 1



Overview




One of the cornerstones of numerical linear algebra is the solving of systems of linear equa-
tions [7]. LU factorization is the most common and appropriate method for solving such
systems when the only assumptions that can be made of the coefficient matrix are that
it is square and nonsingular. When the additional assumption of sparsity can be made,
algorithmic modifications are possible that can significantly enhance execution times.

Dr. Tim Davis, working with Dr. Iain Duff, has developed a very promising algorithm
for solving such systems of linear equations [4]. Their method focuses on large, sparse,
nonsingular systems that have an unsymmetric nonzero pattern. A most significant aspect
of the method is the potential it has for parallel implementation.

Current implementations of the method operate on sequential and vectorized processors.
Parallel implementations are upcoming. The focus of this project is to provide an assessment
of parallel implementation alternatives through both analysis and simulation.

The research presented in this document is organized into three parts. The first part, Chapter
2, provides some fundamental background on LU factorization and the unsymmetric-pattern
multifrontal method. This is followed by the results of the unbounded parallelism models
run against traces produced by the sequential version of the method. These results provide
an initial assessment of the potential parallelism of the method assuming a sufficiently large
number of available processors. The third and final part refines the unbounded models into
bounded models and uses trace-driven simulation techniques to predict the actual paral-
lelism achievable with various processor set sizes. Major concerns with these models are the
allocation of processors to subproblems and the scheduling of tasks.

The results of the analysis just outlined illustrate a strong potential for parallelism with
practical and realizable processor set sizes.














Chapter 2



Background



This section briefly overviews some of the basic concepts of the unsymmetric-pattern multi-
frontal method [4].


2.1 LU Factorization

LU factorization is a numerical algorithm based on Gaussian elimination for solving square,
nonsingular systems of linear equations of the form: Ax = b, where A is an n by n real matrix
and x, b are n dimensional real column vectors [7]. Specifically, the coefficient matrix, A, is
factored into the product of two triangular matrices: A = LU. L is unit lower triangular
and consists of the row multipliers used to reduce A to echelon form. U is the echelon
form of A which is upper triangular. With the A = LU factorization complete, the solution
to the system of linear equations, Ax = b, can be found by solving the two triangular
systems: Ly = b and U = y. This is a much simpler task of O(n2) compared to O(n3)
for the factorization. Typically, LU factorization is accompanied by row permutations to
insure numerical stability and, particularly when dealing with sparse matrices, by column
permutations to recover from zero pivots. Such permutations are incorporated into the
unsymmetric-pattern multifrontal method.


2.2 Frontal Matrices (Elements)

The unsymmetric-pattern multifrontal method uses LU factorization to solve very large sys-
tems of linear equations (A of size 1000 x 1000 and larger). In order to perform this task
efficiently, the method takes advantage of the sparse, yet unsymmetric pattern typical of
many such large coefficient matrices. The method does this by identifying elemental subma-









trices along the diagonal of the coefficient matrix. The non-zero elements of the first row and
column of the right lower submatrix define the rows and columns of the elemental matrix.
These elemental matrices are also referred to as frontal matrices. They are typically much
smaller and very dense. Thus, they can be factored very efficiently by existing vectorized
and parallel subroutines.



2.3 Elimination Directed Acyclic Graph


A major complexity of this method arises from the fact that these elemental matrices can
(and do) overlap in rows, columns, or both. As a result, most elemental matrices can be
only partially factored. The unfactored portion of the submatrix has been modified by
the partial factorization and will be passed and assembled into other elemental matrices
for further factorizations. This "assembly" process creates data dependencies between the
elemental submatrices. These dependencies are represented with a directed acyclic graph
(DAG) where the elemental submatrices are represented by vertices and the dependencies
with edges. This DAG is referred to as the elimination DAG.

The unsymmetric-pattern exploited by the method can result in parts of the unfactored sub-
matrix of a frontal matrix to be assembled into multiple later frontal matrices. In symmetric
multifrontal methods [3] the assembly will be into only one later frontal matrix. This results
in an elimination DAG for the unsymmetric-pattern method as opposed to an elimination
tree for the symmetric-pattern methods.



2.4 Major Algorithmic Steps


The execution of the unsymmetric-pattern multifrontal method is thus controlled in large
part by this elimination DAG. Two of the major steps in the method include the assembly of
data passed between elemental matrices and the actual numerical factoring. These are the
algorithmic steps that will be modeled in this effort. These two steps constitute the majority
of the processing of the factor-,., ii version of the unsymmetric-pattern multifrontal method
which is the version to be analyzed in this effort. The entire method is implemented in
the 1,, :,,li-.-factor version. Within this larger version these steps typically account for over
fifty percent of the method's execution time [4]. The limited scope of this effort allows
the assumption of an existing and static elimination DAG. This is a simplification of what
actually happens since possible modifications of the elimination DAG and memory allocation
for frontal matrices are ignored.









2.5 Potentials for Parallelism


There are at least three levels at which parallelism can be exploited in the unsymmetric-
pattern multifrontal method.

The highest level is present in the independent blocks that can naturally appear in the
coefficient matrix once it has been permuted into an upper block triangular form. All
entries to the left and below these block are zeroes and the operations on such a block
are independent of the operations on the other blocks. These blocks manifest themselves as
distinct connected components of the elimination DAG.

Within these blocks, parallelism is also possible when factoring independent elemental ma-
trices. These typically manifest themselves as multiple vertices at the same topological level
within a connected component of the elimination DAG (however, additional parallelism is
possible at this level depending on the structure of the dependencies represented by the edges
of the elimination DAG). Assembly and numerical factoring of such elemental matrices can
be accomplished in parallel.

The lowest level of parallelism occurs within the actual numerical factoring of elemental ma-
trices using factoring subroutines tailored for particular parallel architectures. The assembly
process of data transferred along edges in the elimination DAG can also be done in parallel at
this level. Exploitation of this level of parallelism is critical for achieving the most significant
speedups.














Chapter 3



Unbounded Parallelism Models



The performance and complexity of a sparse matrix algorithm are closely tied to the sparsity,
structure, and numerical characteristics of a particular matrix. As a result, performance of
such algorithms is frequently measured by comparison to other algorithms on identical matri-
ces. This same approach will be taken with this model. Both sequential and parallel versions
of the unsymmetric-pattern multifrontal method will be modeled for specific matrices and
then the results compared.

An underlying multiple instruction, multiple data (MIMD) shared memory (SM) model
of computation is assumed, although several of the models are also realizable on single
instruction, multiple data (SIMD) models. The shared memory is initially assumed to be
concurrent read, concurrent write with concurrently written results summed (CRCW-SUM).
Later models will require only a concurrent read, exclusive write (CREW) shared memory.
Instruction counts are abstracted to only the floating point operations (flops) required for
the assembly and numerical factoring processes.

The analysis of the unbounded parallelism models is based on the representations of elimina-
tion DAGs built from traces produced by running the sequential version of the unsymmetric-
pattern, multifrontal method on real matrices. These elimination DAGs are analyzed and
then used as input for the five unbounded parallelism models. The resulting speed-up,
processor set, and utilization estimates are then analyzed.


3.1 Elimination DAG Analysis

A sound characterization of the elimination DAG is the cornerstore of the unbounded models
and will be fundamental to fully understanding the results of the bounded models. In order to
obtain this characterization, the elimination DAGs are topologically sorted and partitioned
into a data structure that identifies all the nodes in a given topological level. Node weights









are assigned in accordance with the model definitions defined below. Using these node
weights, the heaviest weighted path through the elimination DAG is determined for each
model. The weight of this path provides the lower bound on the parallel execution time.

Other characteristics of the elimination DAGs are also determined. In particular, the number
of nodes and edges, the depth, and the number of leaf nodes are determined. Since the
elimination DAGs tend to be significantly wider at their leaf level (Level 0), the next widest
level is also determined. The total weight of all nodes is divided by the number of nodes
to obtain an average node weight. The depth is divided by number of nodes to give a
depth ratio. Low values will correspond to DAGs that are the most significant course grain
parallelism. Various methods of exploiting the parallelisms are investigated with the different
models defined next.



3.2 Model Definitions


A model of the sequential version of the algorithm is provided as reference and will be used to
develop the speed up calculations. In this and all the subsequent models there are two levels.
The first (higher) level consists of the outer sum that counts processing steps for an entire
frontal matrix. This summation will either be across all nodes of the elimination DAG, which
represents a sequential processing of the frontals, or along the heaviest path, which represents
the critical path for the processing. The heaviest path summation is used to characterize
parallelism exploited between the nodes in the elimination DAG. With the assumption of an
unbounded number of processors, processing of nodes with dependencies not on the heaviest
path (critical path) can be accomplished during the critical path processing and not affect
the completion time.

The second level of the models describes the processing and parallelism of factoring a par-
ticular frontal matrix (i.e. node in the elimination DAG). Each of these descriptions are
presented in four terms. The first term describes the processing needed to complete the
assembly of contribution blocks from sons in the elimination DAG. The second term char-
acterizes the processing needed for the selection of a pivot for numerical stability. The third
term describes the calculation of multipliers, which results in a column of the lower triangu-
lar matrix L. Finally, the fourth term defines the updating of the active submatrix which is
essentially the carrying out of the necessary row operations needed to reduce the matrix.

These models are very similar to an earlier effort by Duff and Johnsson [6] that focused on
the elimination trees produced by symmetric-pattern, multifrontal methods.

Each of the models use the following terms:


Aj number of matrix entries assembled into the frontal represented by nodej.









* Sj number of sons assembling entries into nodej.


Pj number of pivots by which the frontal matrix at nodej is reduced.

cj number of columns in the nodej frontal.

rj number of rows in the nodej frontal.


With these definitions, the model for the sequential version of the algorithm is defined as
follows:


EJiaode [A + E J(r, + 1) + E7f(r i) + 2E1f(r i)(cj )]


Model 0: Concurrency Between Elements Only Model 0 depicts a version of the
algorithm that only exploits parallelism across nodes in the elimination DAG. Each frontal
matrix is factored sequentially. This model requires an underlying MIMD architecture. The
formula describing this model is:


=1 iesth[Aj + EI,(rj i + 1) + Etl(rj i)+ 2E,' (rj i)(cj -)]


Model 1: Block Concurrency Within Elements Only Model 1 assumes that nodes
of the elimination DAG are processed sequentially in an order that preserves the dependencies
expressed by the elimination DAG. Within each node, the frontal matrix is factored with a
block level of parallelism. Assembly, scaling, and updates to the active submatrix are done
for all rows in parallel moving sequentially through the columns. For the assembly of a row
that includes contributions from more than one son in the elimination DAG, the assumption
of a CRCW-SUM shared memory insures correctness. In this model selection of numerical
pivots is assumed to be sequential.

The method described by this model and the next lend themselves easily to a SIMD archi-
tecture.


Ej=1 [c3 + E='1(r i 1) + Pj + 2E~,(cj i)]


Model 2: Full Concurrency Within Elements Only Concurrency within a frontal
matrix is extended by Model 2. In this model, more processors are used and a maximal
amount of parallelism exploited. Assembly is done for each element in parallel. However,
in order to maintain consistency with the earlier work of Johnsson and Duff [6], a CREW









memory is used by this model and the assembly of contributions to an entry from multiple
sons is done via a parallel prefix computation using the associative operator of addition.

Numerical pivot selection is handled in this model with another parallel prefix computation
this time using the maximum function as the associative operator. Scaling and updating of
the active matrix are done with parallelism at the entry level which is easily realizable with
the CREW memory model. This model is also oriented to a SIMD architecture.


Eallnodes +
j= L[ log2 S + i=i rlog2(r i + 1)1 + P1 + 21P]


Model 3: Block Concurrency Within and Between Elements The block-oriented,
node level parallelism of Model 1 is augmented by parallelism across the nodes in Model 3.
This is accomplished by changing the outer summation to be over the heaviest path instead
of across all nodes. With this model, a MIMD architecture is required.


iheaviestpath[ + E-J1(r
j=1 [J i=(J


Model 4: Full Concurrency Within and Between Elements The suite of models
is completed by extending the full node level concurrency of Model 2 to include parallelism
across nodes in the elimination DAG. This is done in the same fashion as with Model 3 and
a MIMD architecture is likewise assumed.


zheaviestpath [ lo2 Sjl + EP I lg2j
j= 1 [1092 S1 + ,=flog102 (ri


i+ I)] + P + 2Pj]


3.3 Matrices Analyzed


A set of five matrices was selected for this study. They represent a cross section of the type
of elimination DAGs that are generated. The matrices are from the Harwell-Bocing set and
each matrix comes from a real problem [5]. The table below briefly describes these matrices:


MATRIX

mahindasb
gemat 11
gre_1107
lns_3937
sherman5


ORDER DESCRIPTION


1258
4929
1107
3937
3312


economic modelling
optimal power flow problem
computer system simulation
compressible fluid flow
oil reservoir, implicit black oil model


I + ) + Pj + 2f l(cj )]













3.4 Results


Elimination DAG Analysis The first objective of this part of the effort was to charac-
terize the elimination DAGs of the test matrices. Table 3.1 below provides these character-
izations that were produced with the analysis program written for the effort. The DEPTH
entries are the number levels in the elimination DAG as determined by a topological sort.
The width of the DAG is presented in two entries. LEVEL 0 refers to the leaf nodes of the
elimination DAG and WIDTH gives the widest level other than the leaf level (Level 0). It
is important to remember that the elimination DAGs are typically NOT a single connected
component and that the many small, distinct connected components show up on these lowest
level of the DAG's topological order.


MATRIX

# NODES

# EDGES


mahindasb


gematll


1015

1219


DEPTH 68


LEVEL 0


WIDTH 55

AVERAGE 1,887
WEIGHT


DEPTH
RATIO


0.29


530

189

1,195


0.02


gre_1107 lns_3937 sherman5


511

1366

40

233

106

9,998


0.08


1636

3724

296

565

248

67,s01


0.18


664

1496

46

359

99

42,093


0.07


Table 3.1: Elimination DAG Characterizations


The statistics presented above reveal that the elimination DAGs tend to be short and wide
(which is the desired shape for parallelism). However, they do not provide the full picture.
The true nature of the elimination DAGs was better revealed with an abstraction of the DAG










provided by the analysis program. This abstraction illustrates the topologically ordered
elimination DAG with an example provided for the gematll matrix in Figure 3.1. For each
level the number of elements in that level is provided as well as the minimum, average, and
maximum node weights. (The node weights are expressed as common logs to save space
and since their magnitude is all that is of interest). These abstractions revealed that the
elimination DAGs were very bottom heavy in terms of the number of nodes but that the
average node size tended to increase dramatically in the upper levels. The elimination DAG
for the gematll matrix had the nicest shape for parallelism (as i.-'. -1. .1 by its depth ratio).
The other elimination DAGs were taller and relatively more narrow at their base.



L(Level): Count (Min,Avg,Max)


25)
24)
23)
22)
21)
20)
19)
18)
17):
16)
15)
14)
13)
12)
11)
10)
9):
8):
7):
6):
5):
4):
3):
2):
1):
0):


1
1
1
1
1
1
1
2
2
2
2
2
3
4
6
4
4
7
10
16
24
35
58
108
189
530


(5.3,
(4.8,
(3.7,
(3.7,
(3.7,
(4.7,
(3.6,
(3.6,
(3.4,
(3.3,
(3.0,
(3.0,
(3.0,
(2.6,
(3.0,
(2.1,
(3.0,
(2.6,
(2.6,
(2.5,
(2.2,
(2.2,
(1.6,
(1.6,
(1.6,
(0.7,


5.3,
4.8,
3.7,
3.7,
3.7,
4.7,
3.6,
4.1,
3.8,
4.0,
4.0,
3.4,
3.4,
3.7,
3.5,
3.9,
3.3,
3.5,
3.4,
3.6,
3.5,
3.2,
3.0,
2.9,
2.8,
2.5,


5.3)
4.8)
3.7)
3.7)
3.7)
4.7)
3.6)
4.3)
4.0)
4.2)
4.2)
3.7)
3.6)
4.2)
3.9)
4.5)
3.7)
3.9)
3.9)
4.5)
4.2)
4.2)
3.8)
3.6)
3.5)
3.3)


**
**
**
***
*****


Figure 3.1: Elimination DAG for gematll









Speed Ups The most interesting result of the unbounded models is the speed ups ob-
tained via parallelism (as compared to the sequential version of the method). These results
are presented in Table 3.2.


MATRIX mahindasb gematll gre_1107 lns_3937 sherman5

MODEL 0 1.17 2.88 1.26 1.30 1.74

MODEL 1 10.50 9.31 41.21 103.77 78.36

MODEL 2 114.25 39.35 575.21 3490.76 1877.05

MODEL 3 15.96 68.79 70.15 167.94 161..,l

MODEL 4 237.74 743.80 1,847.68 11, 1-. 40 i.,I.S.42



Table 3.2: Speed Up Results

The speed ups for Model 0 are disappointing but not unexpected after what was seen in
the previous section on the shape of the elimination DAGs. However, significant speed ups
were seen as parallelism was introduced within the nodes (Models 1 and 2) and within and
between nodes (Models 3 and 4). It is interesting to note that a significant synergism is
taking place when parallelism is utilized BOTH within and between nodes. One's intuition
might -, -.-. -1 that both levels of parallelism would produce a resulting speed up that is close
to the product of the speed ups achieved by using levels of parallelism separately. However,
the speed ups obtained are actually much better than this expectation. This is consistent
with the results of the earlier study by Duff and Johnsson [6].

Another interesting, but expected, observation is that the speed ups for models 0, 3, and
4 correlate strongly with the depth ratio presented earlier. Likewise, models 1, 2, 3, and 4
correlate strongly with average node weight. The underlying principle is that concurrency
within the node is most exploitable by larger nodes and that concurrency across nodes is
most exploitable with short and wide elimination DAG structures.


Processors Sets and Utilization Also analyzed in the unbounded models were an
upper bounded on the number of processors that could be employed on a given matrix and
the associated processor utilization. In particular, Model 0 assumed that any frontal matrix
would have only one processor allocated to it. For the block level concurrency of Models
1 and 3, a frontal matrix would have a processor set equal to the number of rows in the
frontal. The full concurrency of Models 2 and 4 would use processor sets equal to the row










size times column size of the frontal. The maximum processor usage would then be the
largest processor set for any frontal in Models 1 and 2 and the largest sum of processors for
any one level in the elimination DAG for Models 0, 3, and 4. This strategy results from the
use of concurrency across nodes in Models 0, 3, and 4.


Utilizations were then calculated by dividing
parallel time times the number of processors
provided in Table 3.3. The top value in each
lower value is the utilization.


mahindasb


Nli
1. ;I,'-,

52
20.1' '





304
5.2 ,' ,

4161
5.71 ,


total sequential time by the product of total
used. The results by matrix and model are
entry is the size of the processor set and the


gematll gre_1107 lns_3937


530
0.54 ,

68
13.71'

4624
0.8,.' .

3788
1.,2' ,

30127
2.47' .


233
0.54 ~

126
32.71,

19782
2.91 ,

1161
6.04 ~

19782
9.34 ~


565
0.2 ;' ,

258
40.22'f

107874
3.24 ~

2961
5.1 ,

17' 'i. L
6 .:; '"


sherman5


359
0.4.-'

189
41. 1 .'

47060



3827
4 .;i .'

1- 1 1
;- 'I '


Table 3.3: Processor Set and Utilization Results

In all cases the size of the processor sets seem quite high and the utilization quite low, but
these estimates are very crude upper bounds and it is likely that significantly better results
are realizable.



3.5 Conclusions and Observations


A significant amount of parallelism is achievable with the unsymmetric-pattern multifrontal
method. However, very little of the parallelism comes directly from the elimination DAG as
is evidenced from the Model 0 results. Yet a synergism occurs when the parallelism from
the elimination DAG is combined with the concurrency available within the factoring of


MATRIX

MODEL 0



MODEL 1



MODEL 2



MODEL 3



MODEL 4









particular frontal matrices. These resulting speed ups, as seen in Models 3 and 4, are very
promising.

Looking beyond the speed ups, the large processor sets and low utilizations are of both
concern and interest. In particular, the very "bottom heavy" elimination DAGs -i.-2-. -1 that
better distributions of the processing may be possible. Such processing distributions could
be achieved by appropriate task scheduling on bounded sets of processors. The next chapter
explores these possibities using trace-driven simulation techniques.














Chapter 4



Bounded Parallelism Models



The results of the unbounded parallelism models illustrate that significant speed-ups are
possible with parallel implementations of the unsymmetric-pattern, multifrontal method.
However, the efficient use of processor sets and the speed ups achievable on smaller pro-
cessor sets are still open questions. This chapter addresses these issues using the results of
trace-driven simulations run on the five models defined in the previous chapter. With each
model sixteen different processor sets and three different scheduling schemes were used. The
processor sets were powers of 2 from 21 to 216. The scheduling schemes corresponded to
orderings of the work queue and are described in a subsequent section.

The results of the bounded parallelism models will indicate that the speed ups seen in the
unbounded models are achievable will reasonable size processor sets and with significant
improvements in efficiency. However, the initial versions of these bounded models will also
demonstrate the critical importance of processor allocation strategies and task definitions.
As a result, several revisions will be made to the initial models.


4.1 Initial Models

The initial bounded parallelism models follow directly from the unbounded parallelism mod-
els. The critical difference is that a limited processor set is assumed so tasks that are ready
for execution will have to wait for available processors. This can be appropriately mod-
elled via a trace- driven simulation run against the representation of the elimination DAGs
obtained from the traces produced by the sequential version of the unsymmetric-pattern,
multifrontal method. Critical issues with the simulation will be how to allocate processors
when the number available is less than that required and how to order the work that is ready
to be done.









4.1.1 Trace-Driven Simulation


The trace-driven simulations used to implement the bounded parallelism models were ac-
complished with a custom Pascal program that was built as an extension to the analysis
program used for the unbounded parallelism models. The simulation uses the topologically
sorted representation of the elimination DAG as input. All the nodes (frontal matricies)
on the leaf level (Level 0) are initially put into a work queue. Work is scheduled from this
queue based on processor availability. The amount of work and required processors for each
node is determined based on the specific model being simulated. These models are initially
those defined for the unbounded parallelism models. When the model calls for a sequential
ordering of the nodes, only one node may be executing at any particular time, otherwise
multiple nodes may execute concurrently based on processor availability and the allocation
strategy in use. Upon completion of a node's execution, the edges to dependent nodes are
removed. When all such edges to a particular node have been removed, that node is put
into the work queue as it is available for execution. The simulation continues until all nodes
have completed execution.

Speed up and processor utilization are then calculated per the following formulas:

Tim sequential
SpeedUp = 6---
STimeparallel


alnodesTme ProcessorsUsed)
Utilization = j Procssors
(Time finished ProcessorSetSize)

The average number of searches through elements in the work queue is also determined for
each simulation run.

Time history reports can be produced that track the number of processors in use and the
size of the work queue across time.

Output from the simulation program can be in either report format or a Matlab m-file
format. The latter format is used to port the results to Matlab for plotting.


4.1.2 Processor Allocations

Allocation of processors to the assembly and factoring of a frontal matrix is a critical issue
for the realization of the algorithm on a bounded set of processors. As will be shown in the
results of the initial models, the allocation scheme can greatly effect the performance of the
algorithm. Since the five unbounded models use three different approaches to factoring a
particular frontal, there are three distinct allocation schemes.









The allocation scheme required by Model 0 is trivial since each frontal matrix in this model
is allocated only a single processor.

Models 1 and 3 use a block concurrency within a frontal matrix. As this block concurrency
is row-oriented, a processor set equal to the row size is required. The difficulty arises when
there are processor sets available but they are smaller than that required (in fact the total
available number of processors is frequently less than the row size). In this eventuality the
total amount of work is evenly distributed across the available processors. This is a fairly
crude allocation scheme but is not too unrealistic given the nature of the processing within
a frontal.

To formally define this allocation scheme, we define the work to be done within a frontal j as
before and shown below (recall the following definitions: Pj pivots of nodej, cj columns
of nodej, and rj rows of nodej):

work = cj + Ei'j(r, i + 1) + Pj + 2E, (c i)

With this definition of work we define the allocation scheme as:



If processors_available >= row_size then

Schedule work on (row_size) processors

else

Schedule (work row_size) / processors_available
on the processors_available

endif



Models 2 and 4 use a full concurrency within each frontal matrix and requires a number of
processors that is equal to the number of entries in the frontal (row_size times col_size). A
more sophisticated allocation scheme is thus required. Prior to formally defining this scheme,
we recall the definition of the time complexity for a frontal using this model:

work = [l2 S + ~g2 S log, 2( i + 1)] + Pj + 2Pj

Using this definition and a similar assumption on the ability to distribute work on smaller
processor sets, the allocation scheme is formally defined as follows:










if processors_available >= (col_size row_size) then

Schedule work on (col_size row_size) processors


else


if processors_available >= row_size then

Schedule (work + 3 pivots col_size ) on
(row_size) processors


else


if processors_total >= row_size then

Wait for more available processors


else


Schedule (work col_size row_size) /
processors_available
on processors_available

endif
endif
endif



Notice that the deepest nested "else" block provides a fairly crude over-approximation of
the amount of work to do. However, this case is only used for the small total processor set
sizes which are not the target processor sets for the parallelism of Models 2 and 4. Thus,
this over-estimate is reasonable.



4.1.3 Scheduling Methods


There are three scheduling methods employed by the work queue. These methods corre-
spond to how the work ready to be executed is ordered within the work queue. The three
corresponding orderings are first-in, first-out (FIFO); largest calculation first; and heaviest
path first.

The FIFO method is self-explanatory and included because of its implementation simplicity









and efficiency.


The largest calculation first order uses the node weight estimate of the work to be done and
schedules the largest such nodes first. This method is designed to approximate the next
method that is heaviest path first. The largest calculation first requires a priority queue
implementation based on a set of priorities that is quite variable. As such there would be
some significant penalties to address in implementing this method.

The last method, heaviest path first, uses the heaviest path determined by the analysis
portion of the software. Any node on this path would be the next scheduled. Other nodes
are put in a FIFO ordering. Notice that since the heaviest path is essentially a path through
the dependency chart, only one node at most from the heaviest path will ever be in the work
queue at any one time.


4.1.4 Results

The results of the simulations using the initial models were analyzed using both the report
format listings and graphs produced from the output files using Matlab. The results of this
analysis is summarized by model. These initial models will be referred to as the baseline
models from which further revisions will be made in the next section.

Whenever the results for all five matrices are presented, the following line types will be used
to represent the different matrices:

------- mahindasb
-*-*-*- gematll
-+-+-+- gre_1107
-0-0-0- Ins_3937
- - sherman5


All the results presented in this section are based on the heaviest path first scheduling unless
otherwise mentioned.


Model 0: Concurrency Between Elements Only Model 0 takes advantage of con-
currency only between frontal matrices. The speed up and utilization results from Model 0
are shown in Figure 4.1. These indicate that the speed ups seen in the unbounded models are
achievable with significantly fewer processors and corresponding higher utilizations. Table
4.1 compares the number of processors used in the bounded versus the unbounded models
to achieve maximum speed up.

There were no significant differences produced by the scheduling methods for this model.









Model 1: Block Concurrency Within Elements Only Model 1 only takes advan-
tage of concurrency within frontal matrices and does so in a block-oriented fashion. The
speed up and utilization results are shown in Figure 4.2. The method employed by this
model seems well suited for processor sets up to about 256 processors. After that point there
is no further advantage to additional processors for any of the test matrices.

The various scheduling methods showed no significant differences for this model.


Model 2: Full Concurrency Within Elements Only The block concurrency of
Model 1 is extended to full concurrency in Model 2 but is still restricted to concurrency only
within the frontal. The speed up and utilization results for Model 2 are shown in Figure
4.3. A very definite step behavior is evident in the speed up curves. This behavior directly
correlates to the allocation scheme in use that bases allocation on the row size and column
size of the frontal matrices. While the speed ups are significant, the step behavior is severely
limits scaleability. Later revisions to this model will resolve this problem.

The different scheduling methods showed no significant differences for this model.


Model 3: Block Concurrency Within and Between Elements Model 3 combines
the block concurrency within a frontal of Model 1 with the concurrency between frontals
used in Model 0. The speed up and utilization results for this model are shown in Figure
4.4.

There is a definite irregular behavior evident in the speed up and utilization curves. Analysis
of these irregularities determined that the allocation scheme employed was causing large
frontal matrices to be scheduled on a relatively small number of processors. This could
occur if two frontals were in the work queue and the first one required 91i' of the processors.
Once the first frontal got its processors, the second would be scheduled on the remaining.
This would significantly stretch out the time it would need to complete. The first frontal
could complete and release its processors but all subsequent frontals could not be scheduled
since they are dependent on the completion of the second frontal. The result is that 911' of
the processors go unused for a significant period of time. This scenario is illustrated with a
time history of utilization for the sherman5 matrix using 128 processors provided in Figure
4.5. The problems with processor allocation will be addressed in two subsequent revisions
to this model in the next section.

There were some significant differences in the scheduling methods used but these were side
effects of the allocation problems. Scheduling methods will be addressed again after the
allocation problems are resolved with the subsequent revisions.









Model 4: Full Concurrency Within and Between Elements The fifth and final
model uses concurrency between nodes and full concurrency within nodes. The speed up
and utilization results for this model are presented in Figure 4.6.

A step-like speed up and corresponding utilization irregularities are evident and very similar
to the results for Model 2. These results are also traced to the allocation method in use
and will addressed in a subsequent revision to the allocation scheme presented in the next
section.

Comparison of these results against the unbounded Model 4 reveals that the maximum speed
up is not obtainable with the processor sets tested and the current definition of Model 4.




































6 8 10
Log2 ofNumber of Processors


12 14 16


(a) Speed Up Results


90
80










2 4 6 8 10 12 14 16
Log2 ofNumber of Processors


(b) Utilization Results


Figure 4.1: Model 0 Baseline Results


MATRIX


mahindasb
gemat 11

gre_1107
lns_3937
sherman5


BOUNDED MODEL UNBOUNDED MODEL


Table 4.1: Processors Used Comparison


14
12

2 4







































I 6 ,


Log2 ofNumber of Processors



(a) Speed Up Results


2 4 6 8 10 12 14 16
Log2 ofNumber of Processors



(b) Utilization Results


Figure 4.2: Model 1 Baseline Results


0




0







2 4 6 8 10 12 14 16
Log2 ofNumber of Processors



(a) Speed Up Results


2 4 6 8 10 12 14 16
Log2 ofNumber of Processors



(b) Utilization Results


Figure 4.3: Model 2 Baseline Results



























2 4 6 8 10 12 14 16
Log2 ofNumber of Processors


(a) Speed Up Results


2 4 6 8 10 12 14 16
Log2 ofNumber of Processors


(b) Utilization Results


Figure 4.4: Model 3 Baseline Results


0 1 2 3 4 5 6 7 8

Time x105


Figure 4.5: Utilization Time History For Sherman5 With P=128


1































































I I '6


Log2 ofNumber of Processors



(a) Speed Up Results


I I '6


Log2 ofNumber of Processors



(b) Utilization Results


Figure 4.6: Model 4 Baseline Results












4.1.5 Conclusions and Observations


Several conclusion and observations are apparent from these initial, baseline bounded par-
allelism models.


There is an excellent potential for parallelism under both SIMD and MIMD implemen-
tations.

The speed ups indicated in the unbounded models are achievable with practical size
processor sets.

The benefits of the different models are realized within distinct ranges of the number
of processors.

Model 3 exhibits irregular behavior due to inefficient allocation of large critical path
elements to small processor sets. Similar, but less severe results were seen with Model
4.

Scheduling method has no significant effect on performance for Models 0 and 1. Models
2 and 4 also exhibit consistent behavior for the different scheduling methods, however,
a more detailed analysis will be done after these models have been refined. The effect
of scheduling on performance for model 3 is obscured by the irregular behavior of that
model. Analysis of scheduling for this model will also be postponed until the model has
been refined.

The performance under Models 2 and 4 (those using full concurrency within the ele-
ments) displays a definite step-like behavior that correlates strongly with the allocation
strategy in use.

The speed ups obtained by Model 4 were well below the maximum speed ups predicted
by the unbounded version of the model for the larger matrices.



4.2 Refined Models


While the baseline models provided some significant insights into the potential of the unsymmetric-
pattern, multifrontal method, they also illustrated several undesired effects that can result
if care is not taken in the allocation of processors and the definitions of tasks. As a result
the allocation method used by Model 3 is revised in two distinct manners to produce a more
regular behavior. The allocation and task definitions for Models 2 and 4 are revised to ad-
dress the step-like performance curves obtained from these models. Finally, the concept of









pipelining is applied to the assembly process to further improve the performance of Model
4.


4.2.1 Model 3 Revisions

The first version of Model 3 (block concurrency within and between elements) experienced
irregular behavior that was traced to inefficient allocation of processors. In particular, when
most of the processors were already in use, the allocation algorithm would assign the next
frontal to the smaller set of remaining processors. Since this next element could be a large
frontal, it would take a long time on the small set of processors. Furthermore, all subsequent
elements could (in fact, are very likely to) be dependent on this large frontal. Thus, when
other processors are freed up, no other frontals can be factored since they are dependent
on this large frontal. This behavior was verified with the time history utilization graphs
presented earlier.


Model 3 Revision 1

The idea behind the first approach to this problem is to only assign processors to a task
if such an assignment will result in the task being completed at least as soon as if it were
postponed until more processors are available. This method requires an accurate prediction
of the number of processors to next become available. Such predictions are possible for this
algorithm given a centralized scheduler and the predictability of the workload associated
with each frontal. The second revision to this model will not require such predictions and is
thus well suited for more general application.


New Allocation Scheme The new allocation scheme is formally defined in the following
algorithm. Notice that the case of the full compliment of processors available is implicitly
handled.



Set Pointer to beginning of the Work Queue

while (processors are available) and
(more Work Queue Entries to check) do

begin

Calculate when next entry on Work Queue would
finish if allocated processors from those










currently available -> Scheduled_Now


Calculate when next entry on Work Queue would
finish if allocation is delayed until more
processors are freed up -> Scheduled_Later

if (Scheduled_Now <= Scheduled_Later) then

Schedule the entry now using
min(required,available) processors

Advance Work Queue pointer

end; {while}




Speed Up and Utilization Results The results of this new allocation scheme are
presented in Figure 4.7. The new allocation has resolved the irregularities and provides a
very nicely scalable method up through processor sets of 512.


. 6


Log2 ofNumber of Processors


(a) Speed Up Results


60
0\\ \ ,




2 4 6 8 10 12 14 16
Log2 ofNumber of Processors


(b) Utilization Results


Figure 4.7: Model 3 Revision 1 Results


Time History Utilization Results Additional verification that the problem with the
Model 3 baseline have been resolved is seen the utilization time history of the sherman5 ma-
trix using 128 processors. A comparison of the baseline and revision 1 utilization time history
results is provided in Figure 4.8. The large gaps of low utilization have been eliminated.


0--~ I









Required Work Queue Searches A concern with the new allocation scheme is that the
scheduling/allocation process may now take much longer since multiple work queue entries
(up to the entire work queue) may have to be checked on each scheduling opportunity. For
this reason the average number of work queue entries check was analyzed for each of the
three scheduling methods. Figure 4.9 illustrates this comparison for the lns_3937 matrix.
The solid line depicts the FIFO method. The starred (*) line show the largest calculation
first method. The dashed line shows heaviest path first. While the largest calculation first
method had the highest number of searches for all five matrices, its performance relative
to the other methods varied with processor set size and by the matrix. Thus, while the
other two methods appeared to be better in general, their dominance was not consistent.
An additional scheduling alternative that was not tested would be smallest calculation first.
Since this method would put the entries most likely to be scheduleable on limited processors
up in the front of the work queue, the number of work queue searches would likely drop.


Scheduling Differences With the ill effects of the initial processor allocation scheme
resolved, a serious look at the scheduling method influence can be taken. As a result of this
look, I found the effect of scheduling on performance was minimal. The greatest difference
was seen for the gematll matrix. These results are seen in Figure 4.10. The solid line rep-
resents the FIFO method, the starred (*) line is largest calculation first, and the dashed line
is the heaviest path first method. The heaviest path first method did offer some minimally
better performance for processor sets in the range 256 to 2048.































3 4 5
Time


(a) Original


2 3 4 5
Time


(b) Revision 1


Figure 4.8: Utilization Time Histories (Before and After Revision 1)






80





60


CA


40 40


o 30





10 -

0

2 4 6 8 10 12 14 16

Log2 of Number of Processors


Figure 4.9: Average Number of Work Queue Searches: lns_3937

















































2 4 6 8 10 12 14 16

Log2 of Number of Processors


Figure 4.10: Model 3 Rev 1 Schedulings: gematll












Model 3 Revision 2


The second revision to the Model 3 allocation scheme was developed as an alternative to the
overhead and limitations imposed by the need to make predictions that was inherent in the
first revision. In particular, this second scheme always allocates processors to the next task
in the work queue if nothing is currently executing. If, however, only a subset of the total
number of processors is available, the next task in the work queue will be scheduled only if
its entire number of required processors is available. While this scheme should not produce
results that are quite as good as the first revision, it also does not require the additional
overhead. Furthermore, the method could use a work queue that is organized in some type
of tree structure ordered by processor requirement. With such a work queue, searching for
next tasks could be reduced from linear to logarithmic time complexity.


Revision 2 Allocation Scheme A more formal definition of this allocation scheme is
presented in the following algorithm:


Set Work Queue Point to start of Work Queue

if processors_available = processors_total then

Schedule next Work Queue entry

Advance Work Queue Pointer

endif

while (more Work Queue entries to check) and
(processors_available > 0) do

if processors_available >= processors_required then

Schedule next Work Queue entry

endif

Advance Work Queue Pointer

endwhile











Speed Up and Utilization Results The speed up and utilization results produced
by this second revision are provided in Figure 4.11. While not quite as good as the results
of the first revision, they are very promising. Speed ups increase nicely with processor size
and the utilization curves are very consistent with only one minor irregularity. The greatest
difference in speed ups was seen for the sherman5 matrix. This difference is shown in Figure
4.12 with the dashed line representing the second revision and the solid line the first revision.




180,


120k


S 6


Log2 ofNumber of Processors


(a) Speed Up Results


2 4 6 8 10 12 14 16
Log2 ofNumber of Processors


(b) Utilization Results


Figure 4.11: Model 3 Revision 2 Results



Time History Utilization Results A comparison of the utilization time histories
of Model 3 Revisions 1 and 2 is provided for the sherman5 matrix in Figure 4.13. The
time histories use a processor set of 128 (as before) and illustrate that the second revision
maintains a very nice processor utilization.



Scheduling Differences Scheduling methods differences for this second revision were
very comparable to that of the first revision.































- 1(

C)
C)
a
m
C- *


2 4 6 8 10 12 14 1

Log2 of Number of Processors



Figure 4.12: Model 3 Speed Ups (Rev 1 vs Rev 2): sherman5


3 4 5
Time



(a) Revision 1


2 3 4 5
Time



(b) Revision 2


Figure 4.13: Utilization Time Histories (Rev 1 vs Rev 2): Sherman5 (P=128)












4.2.2 Models 2 and 4: Vertical Partitioning


In order to smooth out the step-like performance curves of Models 2 and 4, an allocation
scheme is needed to make more efficient use of the parallelism available. The approach taken
is to combine the allocation concepts of the first revision to Model 3 with a finer grain of task
definition. I call the approach vertical partitioning since it will partition the work done for a
frontal into sets of tasks done sequentially. Each task in turn will be composed of subtasks
that can be accomplished in parallel. This approach should provide the flexibility needed to
make more efficient use of the parallelism available.

Specifically, the factoring of a frontal matrix will be altered to process assemblies one son
at a time and to divide up the numerical pivoting, scaling, and active matrix updates.
The new assembly processing changes the model definition. The new definition requires
more sequential time, since sons are not assembled concurrently but does not require the
additional memory of the earlier approach. The new definition of processing within a frontal
for Models 2 and 4 is given by the following formula:


Sj + ES flog2(r- i + 1)1 + PJ + 2Pj

With vertical partitioning each son is assembled separately using a processor set equal to
the number of entries contributed by the son. Each such assembly becomes a separate task
for scheduling. Once all the sons are assembled, the factorization commences for one pivot
at a time. First the numerical pivoting is scheduled as a separate task requiring a number
of processors equal to the row size of the active submatrix of the frontal. Numerical pivot-
ing requires logarithmic time to complete the parallel prefix operation using the maximum
associative operator. Upon determination of the numerical pivoting, scaling and updating
are scheduled sequentially each as a set of parallel tasks. The scaling requires a processor
set proportional to the row size of the active submatrix. The update will use a processor
set equal to the entire size of the active submatrix. This sequence of numerical pivoting,
scaling, and updating is repeated for each pivot with which the frontal is reduced.

If ever there are insufficient processors available for a particular task group, a strategy similar
to that used by Revision 1 of Model 3 is used. That is, the completion times of scheduling
the work now on the available processors versus that of waiting for more processors to be
freed up are compared. If immediate scheduling will not delay completion, the tasks are
scheduled, otherwise the scheduling is postponed.

The basic algorithm for vertical partitioning is shown below without the logic for insufficient
processor availability.










for each son to be assembled do


Assemble the contributions from this son in
parallel using one processor per entry

endfor

for each pivot (i=l to p) for this frontal do

Determine the maximum valued entry in first
column of the active submatrix using
row_size i + 1 processors

Do the scaling (multiplier calculation) using
row_size i processors

Update the active submatrix using
(row_size i) (col_size -i)
processors

endfor




Model 2 Vertical Partitioning The improvements due to vertical partitioning are quite
dramatic. Figure 4.14 below illustrates the speed up improvements of the original Model 2
versus the vertical partitioning version. Recall that Model 2 uses concurrency within frontals
only. Not only have curves smoothed out but they offer much higher speed ups across the
entire range of the processor set sizes (notice the vertical scales).

Likewise the corresponding utilizations are significantly improved as is illustrated by the
comparison provided in Figure 4.15.


Model 4 Vertical Partitioning The same vertical partitioning changes were applied
to Model 4 that exploits parallelism both within and across nodes in the elimination DAG.
Even more dramatic improvements were seen with this revision to Model 4. The speed up
comparison is shown in Figure 4.16. These curves illustrate that the maximum speed ups
are now achievable with the processor sets tested for all but one of the matrices (lns_3937
has a speed up of about 11,000 using 65,536 processors, where its unbounded Model 4 speed
up was 11, '_ 40 using 179, ;1i processors).

The corresponding utilization improvements of using vertical partitioning with Model 4 are









shown in Figure 4.17. Notice that the use of concurrency across nodes has significantly
improved utilization also.


Average Work Queue Searches The scheduling method (ordering of the work queue)
with Model 4 vertical partitioning has no real effect on speed ups or utilizations but did
affect the average number of work queue searches for the smaller processors sets. Figure 4.18
shows the average work queue searches for gre_ll07. The largest calculation first method
(starred line) required significantly more searches than FIFO (solid line) or heaviest path
first (dashed line).


Utilization Time History A utilization time history comparisons reveals much of the
dynamics of Model 4. Figure 4.19 compares utilization time histories for gre_ll07 using
16,384 processors on both the original and vertical partitioning versions of Model 4. Subfig-
ures (a) and (b) offer the comparison with a common horizonal scale. Subfigure (c) provides
a more detailed look at processor usage with vertical partitioning. This figure reveals much
about the nature of the elimination DAG with the presence of the larger frontals quite evident
in the large quadratically decreasing regions.


























1800


1400

1200

1000

800

600

400



2 4 6 8 10 12 14 16
Log2 ofNumber of Processors



(a) Original


0I -.=-- F . i. .. .i. .t .1.^
2 4 6 8 10 12 14 16
Log2 ofNumber of Processors



(b) Vertical Partitioning


Figure 4.14: Model 2 Vertical Partitioning Speed Up Comparisons


2 4 6 8 10 12 14 16
Log2 ofNumber of Processors


(a) Original


30

0
20

10

2 4 6 8 10 12 14 16
Log2 ofNumber of Processors


(b) Vertical Partitioning


Figure 4.15: Model 2 Vertical Partitioning Utilization Comparisons






















4500




3000

2500

2000

1500

1000




Log2 ofNumber of Processors



(a) Original


/6

, I ,16


2 4 6 8 10 12 14 16
Log2 ofNumber of Processors



(b) Vertical Partitioning


Figure 4.16: Model 4 Vertical Partitioning Speed Up Comparisons


, 6 ,


Log2 ofNumber of Processors



(a) Original


80\
\0 \\\\
60


40
30


10

2 4 6 8 10 12 14 16
Log2 ofNumber of Processors



(b) Vertical Partitioning


Figure 4.17: Model 4 Vertical Partitioning Utilization Comparisons












































E
I 1

e
CA i



bb


2 4 6 8 10 12 14 16

Log2 of Number of Processors



Figure 4.18: Model 4 Average Work Queue Searches: gre_1107



































1 15 2 25 3
Time


(a) Original


4 4.


0 05 1 15 2 25 3 35 4 4
Time


(b) Vertical Partitioning


100




80




60




40




20




0
o-


0 500 1000 1500 2000 2500 3000

Time



(c) Vertical Partitioning (Spread Out)





Figure 4.19: Vertical Partitioning Utilization Time History: gre_1107


I I I I I


Ih~j












4.2.3 Model 4: Assembly Pipelining -


An additional modification was made to the vertical partitioning version of Model 4. This
modification is based on a concept proposed by Professor Ravi Varadarajan to pipeline the
assembly process. In particular, the assembly process is initiated by the sons (the contribu-
tors) instead of by the fathers (the receivers). As the factorization of a frontal completes, it
immediately schedules the assembly of its contributions to other frontals. Once a subsequent
frontal has received all the contributions from its sons, it begins its own factorization process.

This approach to assembly pipelining is fairly conservative since the assembly pipeline does
not begin until factorization is complete. More aggressive assembly pipeline strategies are
possible with closer interactions with the factorization process.

Tli, re is an oversight in this ,, rll.il pipeline model that limits the value of its comparison
to vertical partitioning. S..' .,,lli. there are no synchronization provisions made to insure
that the ,i, r,,l1il of multiple entries into single frontal element do not occur concurrently.
Thus a CRCW-SUM model must be .',,.//,., '/il assumed for the *.,,, ll,,il pipeline where a
CREW model was used for the vertical partitioning version of the Model 4. The effect of
this oversight is not suspected to be too significant but further revised runs are necessary to
address the issue.

The results of the assembly pipelining runs are that the same nice speed up and utilization
curves appeared as were seen with vertical partitioning. Figure 4.20 illustrates the speed up
results.


Vertical Partitioning vs Assembly Pipelining Furthermore, the assembly pipeling
version of Model 4 also demonstrated a significant improvement in speed up over the vertical
partitioning version. Figure 4.21 shows a comparison of the speed up curve for the gre_1107
matrix under the two versions. Such comparisons must however be tempered by the earlier
comments on the ;,,,,1, i l;,',i memory models.

Figure 4.22 compares the average work queue searches for vertical partitioning and assembly
pipelining both using a heaviest path first scheduling. The number of searches is larger for
assembly pipelining since multiple assemblies are concurrently scheduled as each frontal's
factorization completes.

























6000 -



4000



2000



2 4 6 8 10 12 14 16

Log2 of Number of Processors


Figure 4.20: Model 4 Assembly Pipeline Speed Up Results



2000
20001----------------------------------------
1800 -

1600

1400-

1200

S1000-

800

600

400

200


2 4 6 8 10 12 14 16

Log2 of Number of Processors


Figure 4.21: Speed Up Comparisons (VP vs AP): gre_1107


44




































S120

S100-

S80

60

40-




0
2 4 6 8 10 12 14 16

Log2 of Number of Processors


Figure 4.22: Average Work Queue Searches (VP vs AP): gre_1107












4.3 Scalability Analysis


The results presented so far have indicated that the unsymmetric-pattern, multifrontal
method has some nice scalability features, especially when the fuller ranges of concurrency
are employed (i.e. Model 4). Ideally, one would like to quanitify this scalability with the
use of iso- efficiency curves. However, there is a major obstacle to accomplishing this goal.
In particular, it is very difficult to quantify problem size, since the complexity of a sparse
matrix problem is a function of its order, sparsity, degree of symmetry, as well as other
factors. More specifically to the unsymmetric-pattern, multifrontal method; the structure
and node size of the elimination DAG combine to dictate the complexity of the factorization
process.

One approach to the scalability issue is to isolate a single factor that dominates problem
size for a given limited model. In particular, I chose to look at Model 2 whose results
were very strongly correlated to average node size. Thus, I defined problem size by average
node size and developed a three dimensional representation of the efficiency function. (That
is, efficiency equals speed up divided by processors used). Figure 4.23 (a) show this three
dimension result. The upper left corner corresponds to the 2-processor set and larger proces-
sors sets (4, 8, 16, 32, ..., 65536) are represented coming down the left diagonal. The right
diagonal represents average node size where the leftmost entry is the lns_3937 matrix with
an average node size of 67,S.-, then sherman5 with 42,093, gre_1107 with 9,998, mahindasb
with 1,887, and, finally in the upper right, gematll with 1,195. Thus the scales on this figure
are logarithmic on one axis and irregular and decreasing on the other. However, inspite of
its difficulties, the figure does illustrate efficiencies due to larger node sizes for Model 2. The
second subfigure is a contour map of the first subfigure and is presented as an approxima-
tion of the iso-efficiency curves. (The vertical axis represents the processor sets of the left
diagonal and the horizonal axis indicates the various matrices).

The concept for this type of ,.. iJ.,.i7,'i iir,,1. ;. was also proposed to me by Professor Ravi
Varadarajan. While the difficult nature of this problem and limited resources available pro-
duced less than what could be hoped for, the results do illustrate some positive eff ,'. i
results.























































Avg Node Weight (large to small)


(a) Processor Sets vs Matrix


5 2 25 3 35 4
Avg Node Weight (large to small)


(b) Iso-Efficency Curves


Figure 4.23: Model 2 Efficiency Curves















Chapter 5



Conclusions




The bottom line is that a factor-only version of the unsymmetric-pattern, multifrontal
method for the LU factorization of sparse matrices shows significant potential for paral-
lel implementation. Furthermore, parallelism can be exploited in a variety of degrees and
levels. The most significant results occur when parallelism is used both within and across
nodes in the elimination DAG. The resulting synergism is very promising, as the refined
Model 4 results indicate.

However, exploiting parallelism in the fullest since also incurs costs both in scheduling and
mapping the parallelism. As possible, several of these cost have been addressed with the
analyses such as the average number of work queue searches and the various allocation and
scheduling alternatives.

Another critical observation is that the different degrees of concurrency produce their best
results for distinct processor set ranges. Concurrency only across nodes in the elimination
DAG (Model 0) saw its best performance for small processor sets (typically no more than
eight processors). Block level concurrency within frontals (Models 1 and 3) was best realized
on processor sets of 32 to 512 processors. The full concurrency within a frontal models
(Models 2 and 4) realized the best results and were dominant over most of the range of
processor sets tested. Figure 5.1 compares Model 3 Revision 1 against Model 4 with vertical
partitioning on the sherman5 matrix for a processor set range of 32 to 256. The figure
illustrates that the higher degree of concurrency of Model 4 does out perform Model 3 in
the range of processors where Model 3 has its best performance. Also encouraging is the
observation that almost of the speed ups predicted by the unbounded models are realizable
with processor set sizes that are currently implemented.

Other conclusions from this effort are that the structure of the elimination DAGs tend to
have dominant near-linear components that directly limit further parallelism. While these
structures work well for the sequential version of the method, different construction techniques
may be possible to enhance parallelism.









Furthermore, the allocation and task definition schemes and methods have very pronounced
effects on performance. Workable solutions to these issues have been developed and verified
with the models but care needs to be taken in implementation.

Finally, throughout this effort I have approached the actual LU factorization in a very
straight forward fashion. This was done for easy of description and ease of modelling. The
are many other algorithmic alternatives to accomplishing the factorization that may lend
themselves to easier and more efficient parallel implementation. Such alternatives would be
a significant implementation issue.










































100



50 -




5 5.5 6 6.5 7 7.5 8

Log2 of Number of Processors


Figure 5.1: Model 3 (Rev 1) vs Model 4 (VP) for P=32..256










BIBLIOGRAPHY


1. Aho A. V., J. E. Hopcroft, and J. D. Ullman, Data Structures and Algorithms, Addison-
Wesley, Reading, MA 1974.

2. Aki, Selim G., The Design and Analysis of Parallel Algorithms, Prentice-Hall, Engle-
wood Cliffs, NJ 1989.

3. Amestoy, Patrick R., Factorization of Large Unsymmetric Sparse Matrices Based on a
Multifrontal Approach in a Multiprocessor Environment, Doctorate Thesis, CERFACS
Report Ref: TH/PA/91/2, Toulouse, France, 1991.

4. Davis, T. A. and I. S. Duff, Unsymmetric-Pattern Multifrontal Methods for Paral-
lel Sparse LU Factorization, Technical Report TR-91-23, Computer and Information
Sciences Department, Univ. of Florida, Gainesville, FL 1991.

5. Duff, I. S., Grimes, R. G., and Lewis, J. G. Sparse Matrix Test Problems, ACM Trans.
Math. Softw., 1989, 15, ppl-14.

6. Duff, I. S. and S. L. Johnsson, Node Orderings and Concurrency in Structurally-
Symmetric Sparse Problems, appeared in Parallel Supercomputing: Methods, Algo-
rithms, and Applications, John Wiley & Sons Ltd, 1989.

7. Golub, G. H. and C. F. Van Loan, Matrix Computations, 2d Ed., John Hopkins Uni-
versity Press, Baltimore, MD 1989.

8. Manber, Udi, Introduction to Algorithms: A Creative Approach, Addison-Wesley,
Reading, MA 1989.




University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs