Pressure-based methods on single-instruction stream/multiple-data stream computers

MISSING IMAGE

Material Information

Title:
Pressure-based methods on single-instruction stream/multiple-data stream computers
Physical Description:
vii, 189 leaves : ill. ; 29 cm.
Language:
English
Creator:
Blosch, Edwin L
Publication Date:

Subjects

Genre:
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1994.
Bibliography:
Includes bibliographical references (leaves 182-188).
Statement of Responsibility:
Edwin L. Blosch.
General Note:
Typescript.
General Note:
Vita.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 002044809
notis - AKN2725
oclc - 33373637
System ID:
AA00003617:00001

Full Text







PRESSURE-BASED METHODS ON SINGLE-INSTRUCTION
STREAM/MULTIPLE-DATA STREAM COMPUTERS














By

EDWIN L. BLOSCH


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA


1994











ACKNOWLEDGEMENTS


I would like to express my thanks to my advisor Dr. Wei Shyy for reflecting

carefully on my results and for directing my research toward interesting issues. I

would also like to thank him for the exceptional personal support and flexibility he

offered me during my last year of study, which was done off-campus. I would also like

to acknowledge the contributions of the other members of my Ph.D. committee, Dr.

Chen-Chi Hsu, Dr. Bruce Carroll, Dr. David Mikolaitis, and Dr. Sartaj Sahni. Dr.

Hsu and Dr. Carroll supervised my B.S. and M.S. degree research studies, respectively,

and Dr. Mikolaitis, in the role of graduate coordinator, enabled me to obtain financial

support from the Department of Energy.

Also I would like to thank Madhukar Rao, Rick Smith and H.S. Udaykumar, for

paying fees on my behalf and for registering me for classes while I was in California.

Jeff Wright, S. Thakur, Shin-Jye Liang, Guobao Guo and Pedro Lopez-Fernandez

have also made direct and indirect contributions for which I am grateful.

Special thanks go to Dr. Jamie Sethian, Dr. Alexandre Chorin and Dr. Paul Con-

cus of Lawrence Berkeley Laboratory for allowing me to visit LBL and use their

resources, for giving personal words of support and constructive advice, and for the

privilege of interacting with them and their graduate students in the applied mathe-

matics branch.

Last but not least I would like to thank my wife, Laura, for her patience, her

example, and her frank thoughts on "cups with sliding lids," "flow through straws,"

and numerical simulations in general.











My research was supported in part by the Computational Science Graduate Fel-

lowship Program of the Office of Scientific Computing in the Department of Energy.

The CM-5s used in this study were partially funded by National Science Foundation

Infrastructure Grant CDA-8722788 (in the computer science department of the Uni-

versity of California-Berkeley), and a grant of HPC time from the DoD HPC Shared

Resource Center, Army High-Performance Computing Research Center, Minneapolis,

Minnesota.

















TABLE OF CONTENTS




ACKNOWLEDGEMENTS ............................ ii

ABSTRACT ..................... ............... vi

CHAPTERS

1 INTRODUCTION ..................... .......... 1

1.1 Motivations ............................... 1
1.2 Governing Equations ........................... 3
1.3 Numerical Methods for Viscous Incompressible Flow .......... 5
1.4 Parallel Com putting ............................ 7
1.4.1 Data-Parallelism and SIMD Computers ................ 8
1.4.2 Algorithms and Performance ..................... 11
1.5 Pressure-Based Multigrid Methods . .... 13
1.6 Description of the Research . ..... ..... 17

2 PRESSURE-CORRECTION METHODS . .... 21

2.1 Finite-Volume Discretization on Staggered Grids . .... 21
2.2 The SIMPLE Method ................. ....... 23
2.3 Discrete Formulation of the Pressure-Correction Equation ...... 27
2.4 Well-Posedness of the Pressure-Correction Equation ... 30
2.4.1 Analysis ... .. . . 30
2.4.2 Verification by Numerical Experiments . .... 33
2.5 Numerical Treatment of Outflow Boundaries . .... 38
2.6 Concluding Remarks ................... ........ 40

3 EFFICIENCY AND SCALABILITY ON SIMD COMPUTERS ...... 53

3.1 Background ............... .. .............. 53
3.1.1 Speedup and Efficiency . .. .. 53
3.1.2 Comparison Between CM-2, CM-5, and MP-1 ... 55
3.1.3 Hierarchical and Cut-and-Stack Data Mappings ... 57
3.2 Implementional Considerations . .. ... 59
3.3 Numerical Experiments . ... . 61
3.3.1 Efficiency of Point and Line Solvers for the Inner Iterations 62
3.3.2 Effect of Uniform Boundary Condition Implementation 69
3.3.3 Overall Performance ....................... 70
3.3.4 Isoefficiency Plot ......................... 72











3.4 Concluding Remarks ...........................

4 A NONLINEAR PRESSURE-CORRECTION MULTIGRID METHOD ..

4.1 Background . . . .
4.1.1 Terminology and Scheme for Linear Equations .........
4.1.2 Full-Approximation Storage Scheme for Nonlinear Equations
4.1.3 Extension to the Navier-Stokes Equations . .
4.2 Comparison of Pressure-Based Smoothers . .
4.3 Stability of Multigrid Iterations . . .
4.3.1 Defect-Correction Method . . ..
4.3.2 Cost of Different Convection Schemes ..... ..
4.4 Restriction and Prolongation Procedures . .
4.5 Concluding Remarks ...........................

5 IMPLEMENTATION AND PERFORMANCE ON THE CM-5 .......

5.1 Storage Problem .......... ..... .... ....... ...
5.2 Multigrid Convergence Rate and Stability . .
5.2.1 Truncation Error Convergence Criterion for Coarse Grids .
5.2.2 Numerical Characteristics of the FMG Procedure .
5.2.3 Influence of Initial Guess on Convergence Rate . .
5.2.4 Rem arks . . . .
5.3 Performance on the CM-5 ........................
5.4 Concluding Remarks ..........................

REFERENCES ...................................

BIOGRAPHICAL SKETCH ............................











Abstract of Dissertation
Presented to the Graduate School of the University of Florida
in Partial Fulfillment of the Requirements for the
Degree of Doctor of Philosophy



PRESSURE-BASED METHODS ON SINGLE-INSTRUCTION
STREAM/MULTIPLE-DATA STREAM COMPUTERS

By

Edwin L. Blosch


Chairman: Dr. Wei Shyy
Major Department: Aerospace Engineering, Mechanics and Engineering Science


Computationally and numerically scalable algorithms are needed to exploit emerg-

ing parallel-computing capabilities. In this work pressure-based algorithms which

solve the two-dimensional incompressible Navier-Stokes equations are developed for

single-instruction stream/multiple-data stream (SIMD) computers.

The implications of the continuity constraint for the proper numerical treatment

of open boundary problems are investigated. Mass must be conserved globally so that

the system of linear algebraic pressure-correction equations is numerically consistent.

The convergence rate is poor unless global mass conservation is enforced explicitly.

Using an additive-correction technique to restore global mass conservation, flows

which have recirculating zones across the open boundary can be simulated.

The performance of the single-grid algorithm is assessed on three massively-

parallel computers, MasPar's MP-1 and Thinking Machines' CM-2 and CM-5. Paral-

lel efficiencies approaching 0.8 are possible with speeds exceeding that of traditional

vector supercomputers. The following issues relevant to the variation of parallel ef-

ficiency with problem size are studied: the suitability of the algorithm for SIMD

computation; the implementation of boundary conditions to avoid idle processors;

vi











the choice of point versus line-iterative relaxation schemes; the relative costs of the

coefficient computations and solving operations, and the variation of these costs with

problem size; the effect of the data-array-to-processor mapping; and the relative

speeds of computation and communication of the computer.

A nonlinear pressure-correction multigrid algorithm which has better convergence

rate characteristics than the single-grid method is formulated and implemented on

the CM-5. On the CM-5, the components of the multigrid algorithm are tested over a

range of problem sizes. The smoothing step is the dominant cost. Pressure-correction

methods and the locally-coupled explicit method are equally efficient on the CM-5.

V cycling is found to be much cheaper than W cycling, and a truncation-error based

"full-multigrid" procedure is found to be a computationally efficient and convenient

method for obtaining the initial fine-grid guess. The findings presented enable further

development of efficient, scalable pressure-based parallel computing algorithms.
















CHAPTER 1
INTRODUCTION

1.1 Motivations

Computational fluid dynamics (CFD) is a growing field which brings together

high-performance computing, physical science, and engineering technology. The dis-

tinctions between CFD and other fields such as computational physics and computa-

tional chemistry are largely semantic now, because increasingly more interdisplinary

applications are coming within range of the computational capabilities. CFD algo-

rithms and techniques are mature enough that the focus of research is expected to

shift in the next decade toward the development of robust flow codes, and toward the

application of these codes to numerical simulations which do not idealize either the

physics or the geometry and which take full account of the coupling between fluid

dynamics and other areas of physics [65]. These applications will require formidable

resources, particularly in the areas of computing speed, memory, storage, and in-

put/output bandwidth [78].

At the present time, the computational demands of the applications are still

at least two orders-of-magnitude beyond the computing technology. For example,

NASA's grand challenges for the 1990s are to achieve the capability to simulate vis-

cous, compressible flows with two-equation turbulence modelling over entire aircraft

configurations, and to couple the fluid dynamics simulation with the propulsion and

aircraft control systems modelling. To meet this challenge it is estimated that 1 ter-

aflops computing speed and 50 gigawords of memory will be required [24]. Current











massively-parallel supercomputers, for example, the CM-5 manufactured by Thinking

Machines, have peak speeds of 0(10 gigaflops) and memories of 0(1 gigaword).

Optimism is sometimes circulated that teraflop computers may be expected by

1995 [68]. In view of the two orders-of-magnitude disparity between the speed of

present-generation parallel computers and teraflops, such optimism should be dimmed

somewhat. Expectations are not being met in part because the applications, which

are the driving force behind the progress in hardware, have been slow to develop. The

numerical algorithms which have seen two decades of development on traditional vec-

tor supercomputers are not always easy targets for efficient parallel implementation.

Better understanding of the basic concepts and more experience with the present

generation of parallel computers is a prerequisite for improved algorithms and imple-

mentations.

The motivation of the present work has been the opportunity to investigate issues

related to the use of parallel computers in CFD, with the hope that the knowledge

gained can assist the transition to the new computing technology. The context of the

research is the numerical solution of the 2-d incompressible Navier-Stokes equations,

by a popular and proven numerical method known as the pressure-correction tech-

nique. A specific objective emerged as the research progressed, namely to develop

and analyze the performance of pressure-correction methods on the single-instruction

stream/multiple-data stream (SIMD) type of parallel computer. Single-grid compu-

tations were studied first, then a multigrid method was developed and tested.

SIMD computers were chosen because they are easier to program than multiple-

instruction stream/multiple-data stream (MIMD) computers explicitt message-passing

is not required), because synchronization of the processors is not an issue, and be-

cause the factors affecting the parallel run time and computational efficiency are

easier to identify and quantify. Also, these are arguably the most powerful machines











available right now-Los Alamos National Laboratory has a 1024-node CM-5 with 32

Gbytes of processor memory and is capable of 32 Gflops peak speed. Thus, the code,

the numerical techniques, and the understanding which are the contribution of this

research can be immediately useful for applications on massively parallel computers.

1.2 Governing Equations

The governing equations for 2-d, constant property, time-dependent viscous in-

compressible flow are the Navier-Stokes equations. They express the principles of

conservation of mass and momentum. In primitive variables and cartesian coordi-

nates, they may be written

u+ = 0 (1.1)

apu apu2 apuv op a2u a2u
+ + = + 2 + y (1.2)

dpv apuv dpv2 dp d2v a2v
-+ +$- = -- (1.3)
-t + + y dy +d2 + y2

where u and v are cartesian velocity components, p is the density, p is the fluid's

molecular viscosity, and p is the pressure. Eq. 1.1 is the mass continuity equation, also

known as the divergence-free constraint since its coordinate-free form is div ii = 0.

The Navier-Stokes equations 1.1-1.3 are a coupled set of nonlinear partial differ-

ential equations of mixed elliptic/parabolic type. Mathematically, they differ from

the compressible Navier-Stokes equations in two important respects that lead to dif-

ficulties for devising numerical solution techniques.

First, the role of the continuity equation is different in incompressible flow. In-

stead of a time-dependent equation for the density, in incompressible fluids the conti-

nuity equation is a constraint on the admissible velocity solutions. Numerical meth-

ods must be able to integrate the momentum equations forward in time while simul-

taneously maintaining satisfaction of the continuity constraint. On the other hand,











numerical methods for compressible flows can take advantage of the fact that in the

unsteady form each equation has a time-dependent term. The equations are cast

in vector form-any suitable method for time-integration can be employed on the

system of equations as a whole.

The second problem, assuming that a primitive-variable formulation is desired, is

that there is no equation for pressure. For compressible flows, the pressure can be de-

termined from the equation of state of the fluid. For incompressible flow, an auxiliary

"pressure-Poisson" equation can be derived by taking the divergence of the vector

form of the momentum equations; the continuity equation is invoked to eliminate

the unsteady term in the result. The formulation of the pressure-Poisson equation

requires manipulating the discrete forms of the momentum and continuity equations.

A particular discretization of the Laplacian operator is therefore implied in pressure-

Poisson equation, depending on the discrete gradient and divergence operators. This

operator may not be implementable at boundaries, and solvability constraints can

be violated [30]. Also, the differentiation of the governing equations introduces the

need for additional unphysical boundary conditions on the pressure. Physically, the

pressure in incompressible flow is only defined relative to an (arbitrary) constant.

Thus, the correct boundary conditions are Neumann. However, if the problem has

an open boundary, the governing equations should be supplemented with a boundary

condition on the normal traction [29, 32],

1 &un
F, -p + (1.4)
Re dn

where F is the force, Re is the Reynolds number, and the subscript n indicates the

normal direction. However, F, may be difficult to prescribe.











In practice, a zero-gradient or linear extrapolation for the normal velocity com-

ponent is a more popular outflow boundary condition. Many outflow boundary con-

ditions have been analyzed theoretically for incompressible flow (see [30, 31, 38, 56]).

There are even more boundary condition procedures in use. The method used and its

impact on the "solvability" of the resulting numerical systems of equations depends

on the discretization and the numerical method. This issue is treated in Chapter 2.

1.3 Numerical Methods for Viscous Incompressible Flow

Numerical algorithms for solving the incompressible Navier-Stokes system of equa-

tions were first developed by Harlow and Welch [39] and Chorin [15, 16]. Descendants

of these approaches are popular today. Harlow and Welch introduced the important

contribution of the staggered-grid location of the dependent variables. On a stag-

gered grid, the discrete Laplacian appearing in the derivation of the pressure-Poisson

equation has the standard five-point stencil. On colocated grids it still has a five-

point form but, if the central point is located at (i,j), the other points which are

involved are located at (i+2,j), (i-2j), (i,j+2), and (ij-2). Without nearest-neighbor

linkages, two uncoupled ("checkerboard") pressure fields can develop independently.

This pressure-decoupling can cause stability problems, since nonphysical discontinu-

ities in the pressure may develop [50]. In the present work, the velocity components

are staggered one-half of a control volume to the west and south of the pressure which

is defined at the center of the control volume as shown in Figure 1.1. Figure 1.1 also

shows the locations of all boundary velocity components involved in the discretization

and numerical solution, and representative boundary control volumes for u, v, and p.

In Chorin's artificial compressibility approach [15] a time-derivative of pressure is

added to the continuity equation. In this manner the continuity equation becomes

an equation for the pressure, and all the equations can be integrated forward in time,











either as a system or one at a time. The artificial compressibility method is closely

related to the penalty formulation used in finite-element methods [41]. The equations

are solved simultaneously in finite-element formulations. Penalty methods and the

artificial compressibility approach suffer from ill-conditioning when the equations

have strong nonlinearities or source terms. Because the pressure term is artificial,

they are not time-accurate either.

Projection methods [16, 62] are two-step procedures which first obtain a velocity

field by integrating the momentum equations, and then project this vector field into

a divergence-free space by subtracting the gradient of the pressure. The pressure-

Poisson equation is solved to obtain the pressure. The solution must be obtained

to a high degree of accuracy in unsteady calculations in order to obtain the correct

long-term behavior [76]-every step may therefore be fairly expensive. Furthermore,

the time-step size is limited by stability considerations, depending on the implicitness

of the treatment used for the convection terms.

"Pressure-based" methods for the incompressible Navier-Stokes equations include

SIMPLE [61] and its variants, SIMPLEC [19], SIMPLER [60], and PISO [43]. These

methods are similar to projection methods in the sense that a non-mass-conserving

velocity field is computed first, and then corrected to satisfy continuity. However, they

are not implicit in two steps because the nonlinear convection terms are linearized

explicitly. Instead of a pressure-Poisson equation, an approximate equation for the

pressure or pressure-correction is derived by manipulating the discrete forms of the

momentum and continuity equations. A few iterations of a suitable relaxation method

are used to obtain a partial solution to the system of correction equations, and

then new guesses for pressure and velocity are obtained by adding the corrections

to the old values. This process is iterated until all three equations are satisfied.

The iterations require underrelaxation because of the sequential coupling between











variables. Compared to projection methods, pressure-based methods are less implicit

when used for time-dependent problems. However, they can be used to seek the

steady-state directly if desired.

Compared to a fully coupled strategy, the sequential pressure-based approach

typically has slower convergence and less robustness with respect to Reynolds num-

ber. However, the sequential approach has the important advantage that additional

complexities, for example, chemical reaction, can be easily accommodated by simply

adding species-balance equations to the stack. The overall run time increases since

each governing equation is solved independently, and the total storage requirements

scale linearly with the number of equations solved. On the other hand, the computer

time and storage requirements escalate faster in a fully coupled solution strategy. The

typical way around this problem is to solve simultaneously the continuity and momen-

tum equations, then solve any additional equations in a sequential fashion. Without

knowing beforehand that the pressure-velocity coupling is the strongest among all the

various flow variables, however, the extra computational effort spent in simultaneous

solution of these equations is unwarranted.

There are other approaches for solving the incompressible Navier-Stokes equa-

tions, notably methods based on vorticity-streamfunction (w-p) or velocity-vorticity

(u-w) formulations, but pressure-based methods are easier, especially with regard to

boundary conditions and possible extension to 3-d domains. Furthermore, they have

demonstrated considerable robustness in computing incompressible flows. A broad

range of applications of pressure-based methods is demonstrated in [73].

1.4 Parallel Computing

General background of parallel computers and their application to the numeri-

cal solution of partial differential equations is given in Hockney and Jesshope [40]











and Ortega and Voigt [58]. Fischer and Patera [23] gave a recent review of parallel

computing from the perspective of the fluid dynamics community. Their "indirect

cost," the parallel run time, is of primary interest here. The "direct cost" of parallel

computers and their components is another matter entirely. For the iteration-based

numerical methods developed here, the parallel run time is the cost per iteration

multiplied by the number of iterations. The latter is affected by the characteristics of

the particular parallel computer used and the algorithms and implementations em-

ployed. Parallel computers come in all shapes and sizes, and it is becoming virtually

impossible to give a thorough taxonomy. The background given here is limited to a

description of the type of computer used in this work.

1.4.1 Data-Parallelism and SIMD Computers

Single-instruction stream/multiple-data stream (SIMD) computers include the

connection machines manufactured by the Thinking Machines Corporation, the CM

and CM-2, and the MP-1, MP-2, and MP-3 computers produced by the MasPar Cor-

poration. These are massively-parallel machines consisting of a front-end computer

and many processor/memory pairs, figuratively, the "back-end." The back-end pro-

cessors are connected to each other by a "data network." The topology of the data

network is a major feature of distributed-memory parallel computers.

The schematic in Figure 1.2 gives the general idea of the SIMD layout. The

program executes on the serial front-end computer. The front-end triggers the syn-

chronous execution of the "back-end" processors by sending "code blocks" simul-

taneously to all processors. Actually, the code blocks are sent to an intermediate

"control processorss)" The control processor broadcasts the instructions contained











in the code block, one at a time, to the computing processors. These "front-end-

to-processor" communications take time. This time is an overhead cost not present

when the program runs on a serial computer.

The operands of the instructions, the data, are distributed among the processors'

memories. Each processor operates on its own locally-stored data. The "data" in

grid-based numerical methods are the arrays, 2-d in this case, of dependent variables,

geometric quantities, and equation coefficients. Because there are usually plenty

of grid points and the same governing equations apply at each point, most CFD

algorithms contain many operations to be performed at every grid point. Thus this

"data-parallel" approach is very natural to most CFD algorithms.

Many operations may be done independently on each grid point, but there is cou-

pling between grid points in physically-derived problems. The data network enters

the picture when an instruction involves another processor's data. Such "interpro-

cessor" communication is another overhead cost of solving the problem on a parallel

computer. For a given algorithm, the amount of interprocessor communication de-

pends on the "data mapping," which refers to the partitioning of the arrays and the

assignment of these "subgrids" to processors. For a given machine, the speed of the

interprocessor communication depends on the pattern of communication (random or

regular) and the distance between the processors (far away or nearest-neighbor).

The run time of a parallel program depends first on the amount of front-end and

parallel computation in the algorithm, and the speeds of the front-end and back-

end for doing these computations. In the programs developed here, the front-end

computations are mainly the program control statements (IF blocks, DO loops, etc.).

The front-end work is not sped up by parallel processing. The parallel computations

are the useful work, and by design one hopes to have enough parallel computation











to amortize both the front-end computation and the interprocessor and front-end-to-

processor communication, which are the other factors that contribute to the parallel

run time.

From this brief description it should be clear that SIMD computers have four char-

acteristic speeds: the computation speed of the processors, the communication speed

between processors, and the speed of the front-end-to-processor communication, i.e.

the speed that code blocks are transferred, and the speed of the front-end. These

machine characteristics are not under the control of the programmer. However, the

amount of computation and communication a program contains is determined by the

programmer because it depends on the algorithm selected and the algorithm's imple-

mentation (the choice of the data mapping, for example). Thus, the key to obtaining

good performance from SIMD computers is to pick a suitable algorithm, "matched"

in a sense to the architecture, and to develop an implementation which minimizes

and localizes the interprocessor communication. Then, if there is enough parallel

computation to amortize the serial content of the program and the communication

overheads, the speedup obtained will be nearly the number of processors. The actual

performance, because it depends on the computer, the algorithm, and the imple-

mentation, must be determined by numerical experiment on a program-by-program

basis.

SIMD computers are restricted to exploiting data-parallelism, as opposed to the

parallelism of the tasks in an algorithm. The task-parallel approach is more com-

monly used, for example, on the Cray C90 supercomputer. Multiple-instruction

stream/multiple-data stream (MIMD) computers, on the other hand, are composed of

more-or-less autonomous processor/memory pairs. Examples include the Intel series

of machines (iPSC/2, iPSC/860, and Paragon), workstation clusters, and the connec-

tion machine CM-5. However, in CFD, the data-parallel approach is the prevalent











one even on MIMD computers. The front-end/back-end programming paradigm is

implemented by selecting one processor to initiate programs on the other processors,

accumulate global results, and enforce synchronization when necessary, a strategy

called single-program-multiple-data (SPMD) [23]. The CM-5 has a special "control

network" to provide automatic synchronization of the processor's execution, so a

SIMD programming model can be supported as well as MIMD. SIMD is the manner

in which the CM-5 has been used in the present work. The advantage to using the

CM-5 in the SIMD mode is that the programmer does not have to explicitly specify

message-passing. This simplification saves effort and increases the effective speed of

communication because certain time-consuming protocols for the data transfer can

be eliminated.

1.4.2 Algorithms and Performance

The previous subsection discussed data-parallelism and SIMD computers, i.e.

what parallel computing means in the present context and how it is carried out

by SIMD-type computers. To develop programs for SIMD computers requires one

to recognize that unlike serial computers, parallel computers are not black boxes. In

addition to the selection of an algorithm with ample data-parallelism, consideration

must be given to the implementation of the algorithm in specific ways in order to

achieve the desired benefits speedupss over serial computations).

The success of the choice of algorithm and the implementation on a particular

computer is judged by the "speedup" (S) and "efficiency" (E) of the program. The

communications mentioned above, front-end-to-processor and interprocessor, are es-

sentially overhead costs associated with the SIMD computational model. They would

not be present if the algorithm were implemented on a serial computer, or if such

communications were infinitely fast. If the overhead cost was zero, a parallel program











executing on n, processors would run np times faster than on a single processor, a

speedup of np. This idealized case would also have a parallel efficiency of 1. The

parallel efficiency E measures the actual speedup in comparison with the ideal.

One is also interested in how speedup, efficiency, and the parallel run time (Tp)

scale with problem size, and with the number of processors used. The objective in

using parallel computers is more than just obtaining a good speedup on a particular

problem size and a particular number of processors. For parallel CFD, the goals are

to either (1) reduce the time (the indirect cost [23]) to solve problems of a given

complexity, to satisfy the need for rapid turnaround times in design work, or (2)

increase the complexity of problems which can be solved in a fixed amount of time.

For the iteration-based numerical methods studied here, there are two considerations:

the cost per iteration, and the number of iterations, respectively, computational and

numerical factors. The total run time is the product of the two.

Gustafson [35] has presented fixed-size and scaled-size experiments whose results

describe how the cost per iteration scales on a particular machine. In the fixed-

size experiment, the efficiency is measured for a fixed problem size as processors are

added. The hope is that the run time is halved when the number of processors is

doubled. However, the run time obviously cannot be reduced indefinitely by adding

more processors because at some point the parallelism runs out-the limit to the

attainable speedup is the number of grid points. In the scaled-size experiment, the

problem size is increased along with the number of processors, to maintain a constant

local problem size for each of the parallel processors. Care must be taken to make

timings on a per iteration basis if the number of iterations to reach the end of the

computation increases with the problem size. The hope in such an experiment is that

the program will maintain a certain high level of parallel efficiency E. The ability











to maintain E in the scaled-size experiment indicates that the additional processors

increased the speedup in a one-for-one trade.

1.5 Pressure-Based Multigrid Methods

Multigrid methods are a potential route to both computationally and numerically

scalable programs. Their cost per iteration on parallel computers and convergence

rate is the subject of Chapters 4-5. For sufficiently smooth elliptic problems, the

convergence rate of multigrid methods is independent of the problem size-their op-

eration count is O(N). In practice, good convergence rates are maintained as the

problem size increases for Navier-Stokes problems, also, provided suitable multigrid

components-the smoother, restriction and prolongation procedures-and multigrid

techniques are employed. The standard V-cycle full-multigrid (FMG) algorithm has

an almost optimal operation count, O(log2N) for Poisson equations, on parallel com-

puters. Provided the multigrid algorithm is implemented efficiently and that the cost

per iteration scales well with the problem size and the number of processors, the

multigrid approach seems to be a promising way to exploit the increased computa-

tional capabilities that parallel computers offer.

The pressure-based methods mentioned previously involve the solution of three

systems of linear algebraic equations, one each for the two velocity components

and one for the pressure, by standard iterative methods such as successive line-

underrelaxation (SLUR). Hence they inherit the convergence rate properties of these

solvers, i.e. as the problem size grows the convergence rate deteriorates. With the

single-grid techniques, therefore, it will be difficult to obtain reasonable turnaround

times when the problem size is increased into the target range for parallel com-

puters. Multigrid techniques for accelerating the convergence of pressure-correction











methods should be pursued, and in fact they have been within the last five or so

years [70, 74, 80].

However, there are still many unsettled issues. The complexities affecting the

convergence rate of single-grid calculations carry over to the multigrid framework

and are compounded there by the coupling between the evolving solutions on multiple

grid levels, and by the particular "grid-scheduling" used.

Linear multigrid methods have been applied to accelerate the convergence rate for

the solution of the system of pressure or pressure-correction equations [4, 22, 42, 64,

94]. However, the overall convergence rate does not significantly improve because the

velocity-pressure coupling is not addressed [4, 22]. Therefore the multigrid strategy

should be applied on the "outer loop," with the role of the iterative relaxation method

played by the numerical methods described above, e.g. the projection method or the

pressure-correction method. Thus, the generic term "smoother" is prescribed because

it reflects the purpose of the solution of the coupled system of equations going on

inside the multigrid cycle-to smooth the residual so that an accurate coarse-grid

approximation of the fine-grid problem is possible. It is not true that a good solver,

one with a fast convergence rate on single-grid computations, is necessarily a good

smoother of the residual. It is therefore of interest to assess pressure-correction meth-

ods as potential multigrid smoothers. See Shyy and Sun [74] for more information

on the staggered-grid implementation of multigrid methods, and some encouraging

results.

Staggered grids require special techniques [21, 74] for the transfer of solutions and

residuals between grid levels, since the positions of the variables on different levels

do not correspond. However, they alleviate the "checkerboard" pressure stability

problem [50], and since techniques have already been established [74], there is no











reason not to go this route, especially when cartesian grids are used as in the present

work.

Vanka [89] has proposed a new numerical method as a smoother for multigrid

computations, one which has inferior convergence properties as a single-grid method

but apparently yields an effective multigrid method. A staggered-grid finite-volume

discretization is employed. In Vanka's smoother, the velocity components and pres-

sure of each control volume are updated simultaneously, so it is a coupled approach,

but the coupling between control volumes is not taken into account, so the calcu-

lation of new velocities and pressures is explicit. This method is sometimes called

the "locally-coupled explicit" or "block-explicit" pressure-based method. The control

volumes are visited in lexicographic order in the original method which is therefore

aptly called BGS (block Gauss-Seidel). Line-variants have been developed to couple

the flow variables in neighboring control volumes along lines (see [80, 87]).

Linden et al.[50] gave a brief survey of multigrid methods for the steady-state in-

compressible Navier-Stokes equations. They argue without analysis that BGS should

be preferred over the pressure-correction type methods since the strong local cou-

pling is likely to have better success smoothing the residual locally. On the other

hand, Sivaloganathan and Shaw [71, 70] have found good smoothing properties for

the pressure-correction approach, although the analysis was simplified considerably.

Sockol [80] has compared the point and line-variants of BGS with the pressure-

correction methods on serial computers, using model problems with different physical

characteristics. SIMPLE and BGS emerge as favorites in terms of robustness with

BGS preferred due to a lower cost per iteration. This preference may or may not

carry over to SIMD parallel computers (see Chapter 4 for comparison). Interesting

applications of multigrid methods to incompressible Navier-Stokes flow problems can

be found in [12, 28, 48, 54].











In terms of parallel implementations there are far fewer results although this

field is rapidly growing. Simon [77] gives a recent cross-section of parallel CFD

results. Parallel multigrid methods, not only in CFD but as a general technique

for partial differential equations, have received much attention due to their desirable

O(N) operation count on Poisson equations. However, it is apparently difficult to find

or design parallel computers with ideal communication networks for multigrid [13].

Consequently implementations have been pursued on a variety of machines to see

what performance can be obtained with the present generation of parallel machines,

and to identify and understand the basic issues. Dendy et al.[18] have recently

described a multigrid method on the CM-2. However, to accommodate the data-

parallel programming model they had to dimension their array data on every grid level

to the dimension extents of the finest grid array data. This approach is very wasteful

of storage. Consequently the size of problems which can be solved is greatly reduced.

Recently an improved release of the compiler has enabled the storage problem to be

circumvented with some programming diligence (see Chapter 5). The implementation

developed in this work is one of the first to take advantage of the new compiler feature.

In addition to parallel implementations of serial multigrid algorithms, several

novel multigrid methods have been proposed for SIMD computers [25, 26, 33]. Some

of the algorithms are instrinsically parallel [25, 26] or have increased parallelism

because they use multiple coarse grids, for example [33]. These efforts and others

have been recently reviewed [14, 53, 92]. Most of the new ideas have not been

developed yet for solving the incompressible Navier-Stokes equations.

One of the most prominent concerns addressed in the literature regarding parallel

implementations of serial multigrid methods is the coarse grids. When the number

of grid points is smaller than the number of processors the parallelism is reduced

to the number of grid points. This loss of parallelism may significantly affect the











parallel efficiency. One of the routes around the problem is to use multiple coarse

grids [59, 33, 79]. Another is to alter the grid-scheduling to avoid coarse grids. This

approach can lead to computationally scalable implementations [34, 49] but may

sacrifice the convergence rate. "Agglomeration" is an efficiency-increasing technique

used in MIMD multigrid programs which refers to the technique of duplicating the

coarse grid problem in each processor so that computation proceeds independently

(and redundantly). Such an approach can also be scalable [51]. However, most atten-

tion so far has focused on parallel implementations of serial multigrid algorithms, in

particular on assessing the importance of the coarse-grid smoothing problem for dif-

ferent machines and on developing techniques to minimize the impact on the parallel

efficiency.

1.6 Description of the Research

The dissertation is organized as follows. Chapter 2 discusses the role of the mass

conservation in the numerical consistency of the single-grid SIMPLE method for open

boundary problems, and explains the relevance of this issue to the convergence rate.

In Chapter 3 the single-grid pressure-correction method is implemented on the MP-1,

CM-2, and CM-5 computers and its performance is analyzed. High parallel efficien-

cies are obtained at speeds and problem sizes well beyond the current performance of

such algorithms on traditional vector supercomputers. Chapter 4 develops a multigrid

numerical method for the purpose of accelerating the single-grid pressure-correction

method and maintaining the accelerated convergence property independent of the

problem size. The multigrid smoother, the intergrid transfer operators, and the sta-

bilization strategy for Navier-Stokes computations are discussed. Chapter 5 describes

the actual implementation of the multigrid algorithm on the CM-5, its convergence

rate, and its parallel run time and scalability. The convergence rate depends on the








18


flow problem and the coarse-grid discretization, among other factors. These factors

are considered in the context of the "full-multigrid" (FMG) starting procedure by

which the initial guess on the fine grid is obtained. The cost of the FMG proce-

dure is a concern for parallel computation [88], and this issue is also addressed. The

results indicate that the FMG procedure may influence the asymptotic convergence

rate and the stability of the multigrid iterations. Concluding remarks in each chapter

summarize the progress made and suggest avenues for further study.















































Figure 1.1. Staggered-grid layout of dependent variables, for a small but complete
domain. Boundary values involved in the computation are shown. Representative u,
v, and pressure boundary control volumes are shaded.




















short blocks
of parallel code


Sequencer (CM-2)
Array control unit (MP-1)
Multiple SPARC nodes (CM-5:


individual instruction



P.E. P. E. P. E. P. E.


more P.E.s
0 0 0


array data partitioned among processor memories


Interprocessor communication network
hypercube (CM-2) + "NEWS"
3-stage crossbar (MP-1) + "X-Net"
fat tree (CM-5)


Figure 1.2. Layout of the MP-1, CM-2, and CM-5 SIMD computers.


Front End (CM-2 and MP-1)
Partition Manager (CM-5)
-> serial code, control code, scalar data


a 0 a


* 0 0














CHAPTER 2
PRESSURE-CORRECTION METHODS

2.1 Finite-Volume Discretization on Staggered Grids

The formulation of the numerical method used in this work begins with the inte-
gration of the governing equations Eq 1.1-1.3 over each of the control volumes in the
computational domain. Figure 1.1 shows a model computational domain with u, v,
and p (cell-centered) control volumes shaded. The continuity equation is integrated
over the p control volumes.
Consider the discretization of the u-momentum equation for the control volume
shown in Figure 2.1 whose dimensions are Ax and Ay. The v control volumes are
done exactly the same except rotated 900. Integration of Eq. 1.2 over the shaded
region is interpreted as follows for each of the terms:
I Opu Opup
P z dxdy AAy, (2.1)
at at
2 dx dy = (pu pu ) Ay, (2.2)

audx dy = (puv, pu,sv) A, (2.3)

S- dx dy = -(p p,)Ay, (2.4)
I (" ( au
It2U dxdy = a Ax (2.6)
jXdx = d2e-i jA (2.5)

/ P a dady=Pau Itau A, X (2.6)
if 49y 2 9ay ys)
The lowercase subscripts e, w. n, s indicate evaluation on the control volume faces.
By convention and the mean-value theorem, these are at the midpoint of the faces.
The subscript P in Eq. 2.1 indicates evaluation at the center of the control volume.
21











Because of the staggered grid, the required pressure values in Eq. 2.4 are already

located on the u control volume faces. The pressure-gradient term is effectively a

second-order central-difference approximation. With colocated grids, however, the

control-volume face pressures are obtained by averaging the nearby pressures. This

averaging results in the pressure at the cell center dropping out of the expression

for the pressure gradient. The central-difference in Eq. 2.4 is effectively taken over

a distance 2Ax on colocated grids. Thus staggered cartesian grids provide a more

accurate approximation of the pressure-gradient term since the difference stencil is

smaller.

The next step is to approximate the terms which involve values at the control

volume faces. In Eq. 2.2, one of the ue and one of the u, are replaced by an average

of neighboring values,

2 2) A ( UE+ Up UP+ UW )
(ue pu y = p 2 e P U2 A (2.7)

and in Eq. 2.3, v, and v, are obtained by averaging nearby values,
( Vne + nw Vse + Vs
(pUnVn pusv) Ax= p V -u pV us), Ax (2.8)

The remaining face velocities in the convection terms, u,, u,, ue, and uw, are ex-

pressed as a certain combination of the nearby u values-which u values are involved

and what weighting they receive is prescribed by the convection scheme. Some pop-

ular recirculating flow convection schemes are described in [73, 75].

The control-volume face derivatives in the diffusion terms are evaluated by central

differences,

P Ou 9u E-UP P-UW A (2.9)


P -I a AX = M UN- P p Ax (2.10)
dy ay Ay Ay











The unsteady term in Eq. 2.1 is approximated by a backward Euler scheme. All the

terms are evaluated at the "new" time level, i.e. implicitly.
Thus, the discretized momentum equations for each control volume can be put

into the following general form,

apup = aEUE + awuw + aNUN + asus + b, (2.11)

where b = (pw -pe)Ay+pun/At, the superscript n indicating the previous time-step.

The coefficients aN, as, etc. are comprised of the terms which modify UN, us, etc. in
the discretized convection and diffusion terms.

The continuity equation is integrated over a pressure control volume,

I/ [ pu+ dx dy = p(ue u)Ay + p(vn v)A = 0. (2.12)

Again the staggered grid is an advantage because the normal velocity components on

each control volume face are already in position-there is no need for interpolation.

2.2 The SIMPLE Method

One SIMPLE iteration takes initial velocity and pressure fields (u*, v*,p*) and
computes new guesses (u, v,p). The intermediate values are denoted with a tilde,

(it, j,). In the algorithm below, au(u*, v*), for example, means that the aN coeffi-
cient in the u-momentum equation depends on u* and v*. The parameters v,, v,, and
vc are the numbers of "inner" iterations to be taken for the u, v, and continuity equa-

tions, respectively. This notation will be clarified by the following discussion. The
inner iteration count is indicated by the superscript enclosed in parentheses. Finally,
wU and w, are the relaxation factors for the momentum and continuity equations.

SIMPLE (u*, v*, p*; Vu, vV, Vp, WU, wc)
Compute u coefficients au(u*, v*) (k = P,E,W,N,S) and source term b"(u*,p*)











for each discrete u-momentum equation:

a~Up = aNiUN + auis + auiE + a'viw + bU + (1 wu -2-u
UJUV s Ev W JUV P

Do v, iterations to obtain an approximate solution for ii

starting with u* as the initial guess

u(n) = Gu(n-1) + fU

f = un=u)

Compute v coefficients ak((i, v*) (k = E,W,N,S) and source term bv(v*,p*)

for each discrete v-momentum equation:

v2-p = a'VNl + a'ls + as + IE + a'w w + b + (1 wuv,) v

Do v iterations to obtain an approximate solution for v

starting with v* as the initial guess

v(n) = Gv(n-l1) + fV

V = v(n=v)

Compute p' coefficients a' (k = P,E,W,N,S) and source term bc(ii, 3)

for each discrete p' equation:

app p = aNp N + asp s + aEP'E + awP'w + bc
Do vc iterations to obtain an approximate solution for p'

starting with zero as the initial guess

p'(") = Gp'("-1) + f

Correct f, i, and p* at every interior grid point

up = ip + ,
(ap')p
Vp = pp + (a')p

pp = p* + WcP'p











The algorithm is not as complicated as it looks. The important point to note is

that the major tasks to be done are the computing of coefficients and the solving of

the systems of equations. The symbol G indicates the iteration matrix of whatever

type relaxation is used on these inner iterations (SLUR in this case), and f is the

corresponding source term.

In the SIMPLE pressure-correction method [61], the averages in Eq. 2.7 and 2.8

are lagged in order to linearize the resulting algebraic equations. The governing

equations are solved sequentially. First, the u momentum equation coefficients are

computed and an updated u field is computed by solving the system of linear alge-

braic equations. The pressures in Eq. 2.4 are lagged. The v momentum equation is

solved next to update v. The continuity equation, recast in terms of pressure correc-

tions, is then set up and solved. These pressure corrections are coupled to velocity

corrections. Together they are designed to correct the velocity field so that it satisfies

the continuity constraint, while simultaneously correcting the pressure field so that

momentum conservation is maintained.

The relationship between the velocity and pressure corrections is derived from

the momentum equation, as described in the next section. The resulting system

of equations is fully coupled, as one might expect knowing the elliptic nature of

pressure in incompressible fluids, and is therefore expensive to solve. However, if the

resulting system of pressure-correction equations were solved exactly, the divergence-

free constraint and the momentum equations (with old values of u and v present in

the nonlinear convection terms) would be satisfied. This approach would constitute

an implicit method of time integration for the linearized equations. The time-step

size would have to be limited to avoid stability problems caused by the linearization.

To reduce the computational cost, the SIMPLE prescription is to use an approx-

imate relationship between the velocity and pressure corrections (hence the label











"semi-implicit"). Variations on the original SIMPLE approximation have shown bet-

ter convergence rates for simple flow problems, but in discretizations on curvilinear

grids and other problems with significant contributions from source terms, the per-

formance is no better than the original SIMPLE method (see the results in [4]).

The goal of satisfying the divergence-free constraint can still be attained, if the

system of pressure-correction equations is converged to strict tolerances, because the

discrete continuity equations are still being solved. But satisfaction of the momentum

equations cannot be maintained with the approximate relationship. Consequently it

is no longer desirable to solve the p'-system of equations to strict tolerances. It-

erations are necessary to find the right velocities and pressures which satisfy all

three equations. Furthermore, since the equation coefficients are changing from one

iteration to the next, it is pointless to solve the momentum equations to strict tol-

erances. In practice, only a few iterations of a standard scheme such as successive

line-underrelaxation (SLUR) are performed.

The single "outer" iteration outlined above is repeated many times, with under-

relaxation to prevent the iterations from diverging. In this sense a two-level iterative

procedure is being employed. In the outer iterations, the momentum and pressure-

correction equations are iteratively updated based on the linearized coefficients and

sources, and inner iterations are applied to partially solve the systems of linear alge-

braic equations.

The fact that only a few inner iterations are taken on each system of equations sug-

gests that the asymptotic convergence rate of the iterative solver, which is the usual

means of comparison between solvers, does not necessarily dictate the convergence

rate of the outer iterative process. Braaten and Shyy [4] have found that the con-

vergence rate of the outer iterations actually decreases when the pressure-correction

equation is solved to a much stricter tolerance than the momentum equations. They











concluded that the balance between the equations is important. Because u, v, and

p' are segregated, the overall convergence rate is strongly dependent on the partic-

ular flow problem, the grid distribution and quality, and the choice of relaxation

parameters.

In contrast to projection methods, which are two-step but treat the convection

terms explicitly (or more recently by solving a Riemann problem [2]) and are therefore

restricted from taking too large a time-step, the pressure-correction approach is fully

implicit with no time-step limitation, but many iterations may be necessary. The

projection methods are formalized as time-integration techniques for semi-discrete

equations. SIMPLE is an iterative method for solving the discretized Navier-Stokes

system of coupled nonlinear algebraic equations. But the details given above should

make it clear that these techniques bear strong similarities-specifically, a single

SIMPLE iteration would be a projection method if the system of pressure-correction

equations was solved to strict tolerances at each iteration. It would be interesting to

do some numerical comparisons between projection methods and pressure-correction

methods to further clarify the similarity.

2.3 Discrete Formulation of the Pressure-Correction Equation

The discrete pressure-correction equation is obtained from the discrete momentum

and continuity equations as follows. The velocity field which has been newly obtained

by solving the momentum equations was denoted by (ii, v) earlier. The pressure field

after the momentum equations are solved still has the initial value p*. So fi, u, and

p* satisfy the u-momentum equation

apUip = aEUE + awuw + aNUN + asfis + (p* p*)Ay, (2.13)

and the corresponding v-momentum equation. The corrected (continuity-satisfying)

velocity field (u, v) satisfies the u-momentum equation with the corrected pressure











field p,

apup = aEUE + awuw + aNUN + asus + (pw pe)Ay, (2.14)

and likewise for the v-momentum equation. Additive corrections are assumed, i.e.


u = ii + u' (2.15)


v = v' (2.16)

p= p* + p'. (2.17)

Subtracting Eq. 2.13 from Eq. 2.14 gives the desired relationship between pressure

and the u corrections,


apup = akk + (p, p')Ay, (2.18)
k=E,W,N,S

with a similar expression for the v corrections.

If Eq. 2.18 is used as is, then the nearby velocity corrections in the summation need

to be replaced by similar expressions involving pressure-corrections. This requirement

brings in more velocity corrections and more pressure corrections, and so on, leading

to an equation which involves the pressure corrections at every grid point. The

resulting system of equations would be expensive to solve. Thus, the summation

term is dropped in order to obtain a compact expression for the velocity correction in

terms of pressure corrections. At convergence, the pressure corrections (and therefore

the velocity corrections) go to zero, so the precise form of the approximate pressure-

velocity correction relationship does not figure in the final converged solution.

The discrete form of the pressure-correction equation follows by first substituting

the simplified version of Eq. 2.18 into Eq. 2.15,


Up = Up + Up = Up + (p, p')Ay,


(2.19)











and then substituting this into the continuity equation Eq. 2.12, (with an analogous

formula for vp). The result is

pAy2 pAy2 PAx2 I PAX2
a (pu),-P) -(P'' (P-PP)+ (PP -pN) -- (-P') = b, (2.20)
ap(ue) ap(uw) ap(vn) ap(v,)

where the source term b is

b = p,Ay piiAy + pv;:A pvAx (2.21)


Recall that Eq. 2.20 and Eq. 2.21 are written for the pressure control volumes, so that

there is some interpretation required. The term ap(ue) in Eq. 2.20 is the appropriate

ap for the discretized u-momentum equation, Eq. 2.13. In other words, up in Eq. 2.13

is actually u,, u,, u,, or us in Eq. 2.20 and 2.21, relative to the pressure control

volumes on the staggered grid. Eq. 2.20 can be rearranged into the same general

form as Eq. 2.11. From Eq. 2.21, it is apparent that the right-hand side term is the

net mass flux entering the control volume, which should be zero in incompressible

flow.

In the formulation of the pressure-correction equation for boundary control vol-

umes, one makes use of the fact that the normal velocity components on the bound-

aries are known from either Dirichlet or Neumann boundary conditions, so no velocity

correction is required there. Consequently, the formulation of Eq. 2.20 for boundary

control volumes does not require any prescription of boundary p' values [60] when

velocity boundary conditions are prescribed. Without the summation from Eq. 2.18,

it is apparent that a zero velocity correction for the outflow boundary u-velocity

component is obtained when pw = pe-in effect, a Neumann boundary condition on

pressure is implied. This boundary condition is appropriate for an incompressible

fluid because it is physically consistent with the governing equations in which only

the pressure gradient appears. There is a unique pressure gradient but the level is











adjustable by any constant amount. If it happens that there is a pressure specified

on the boundary, for example by Eq. 1.4, then the correction there will be zero, pro-

viding a boundary condition for Eq. 2.20. Thus, it seems that there are no concerns

over the specification of boundary conditions for the p' equations.

2.4 Well-Posedness of the Pressure-Correction Equation

2.4.1 Analysis

To better understand the characteristics of the pressure-correction step in the

SIMPLE procedure, consider a model 3 x 3 computational domain, so that 9 algebraic

equations for the pressure corrections are obtained. Number the control volumes as

shown in Figure 2.3. Then the system of p' equations can be written

al -a 0 -ak 0 0 0 0 0 p' p(u\ u1 + v1 v,)
2 2
-aw a --a 0 -a 0 0 0 0 p' p(u + v v )
0 -a, a 0 0 -aN 0 0 0 p' P(u U + -v)
-a 0 0 a -4 0-aa4 0 0 0 4 p(u -u+ v -v)
55 + 5 _V5)
S-4 0 -a, a -aE 0 -aN 0 p' = p(uW-n+v -v) (2.22)
0 0 -a 0 -a6 ap 0 0 -aN P'6 P(u6 + v v,)
0 0 0 -as 0 0 ap -aE 0 p'7 p(u7 u7 + v v7)
S 0 0 -a -a8 as -4a P' P(u8 u+ v8 -
S0 0 0 -a 0 -a9 a9 .Pg p( u + v9 v9)


where the superscript designates the cell location and the subscript designates the

coefficient linking the point in question, P, and the neighboring node. The right-hand

side velocities are understood to be tilde quantities as in Eq. 2.21.

In finite-volume discretizations, fluxes are estimated at the control volume faces

which are common to adjacent control volumes, so if the governing equations are

cast in conservation law form, as they are here, the discrete efflux of any quantity

out of one control volume is guaranteed to be identical to the influx into its neighbor.

There is no possibility of internal sources or sinks. In fact this is what makes finite-

volume discretizations preferable to finite-difference discretizations. The following











relationships, using control volume 5 in Figure 2.3 as an example, follow from Eq. 2.20

and the internal consistency of finite-volume discretizations:


a = a + as + a N + aG (2.23)


a = a E = aw, aN =as as = aN (2.24)
5 = = 4 =1 v, = 2 (2.25)
w e e zw n S Vn

Eq. 2.23 states that the coefficient matrix is pentadiagonal and diagonally dominant

for the interior control volumes. Furthermore, when the natural boundary condition

(zero velocity correction) is applied, the appropriate term in Eq. 2.20 for the boundary

under consideration does not appear, and therefore the pressure-correction equations

for the boundary control volumes also satisfy Eq. 2.23. If a pressure boundary condi-

tion is applied so that the corresponding pressure correction is zero, then one would

set p' = 0 in Eq. 2.20, for example, which would give aw + aN + as < ap. Thus,

either way, the entire coefficient matrix in Eq. 2.22 is diagonally dominant. However,

with the natural prescription for boundary treatment, no diagonal term exceeds the

sum of its off-diagonal terms.

Thus, the system of equations Eq. 2.22 is linearly dependent with the natural

(velocity) boundary conditions, which can be verified by adding the 9 equations

above. Because of Eq. 2.23 and Eq. 2.24 all terms on the left-hand side of Eq. 2.22

identically cancel one another. At all interior control volume interfaces, the right-

hand side terms identically cancel due to Eq. 2.25, and the remaining source terms

are simply the boundary mass fluxes. This cancellation is equivalent to a discrete

statement of the divergence theorem


SV-idQ = j i -d(0fl) (2.26)











where 0f is the domain under consideration and n is the unit vector in the direction

normal to its boundary 0t.

Due to the linear dependence of the left-hand side of Eq. 2.22, the boundary mass

fluxes must also sum to zero in order for the system of equations to be consistent.

No solution exists if the linearly dependent system of equations is inconsistent. The

situation can be likened to a steady-state heat conduction problem with source terms

and adiabatic boundaries. Clearly, a steady-state solution only exists if the sum of

the source terms is zero. If there is a net heat source, then the temperature inside

the domain will simply rise without bound if an iterative solution strategy (quasi

time-marching) is used. Likewise, the net mass source in flow problems with open

boundaries must sum to zero for the pressure-correction equation to have a solution.

In other words, global mass conservation is required in discrete form in order for a

solution to exist. The interesting point to note is that during the course of SIMPLE

iterations, when the pressure-correction equation is executed, the velocity field does

not usually conserve mass globally in flow problems with open boundaries, unless

explicit measure is taken to enforce global mass conservation. The purpose of solving

the pressure-correction equations is to drive the local mass sources to zero by suitable

velocity corrections. But the pressure-correction equations which are supposed to

accomplish this purpose do not have a solution unless the net mass source is already

zero. For domains with closed boundaries, global mass conservation is obviously not

an issue.

Furthermore, this problem does not only show up when the initial guess is bad.

In the backward-facing step flow discussed below, the initial guess is zero everywhere

except for inflow, which obviously is the worst case as far as a net mass source is

concerned (all inflow and no outflow). But even if one starts with a mass-conserving

initial guess, during the course of iterations the outflow velocity boundary condition











which is necessary to solve the momentum equations will reset the outflow so that

the global mass-conservation constraint is violated.

2.4.2 Verification by Numerical Experiments

Support for the preceding discussion is provided by numerical simulation of two

model problems, a lid-driven cavity flow and a backward-facing step flow. The con-

figurations are shown along with other relevant data in Figure 2.2.

Figure 2.4 shows the outer-loop convergence paths for the lid-driven cavity flow

and the backward-facing step flow, both at Re = 100. The quantities plotted in

Figure 2.4 are the logo of the global residuals for each governing equation obtained

by summing up the local residuals, each of which is obtained by subtracting the

left-hand side of the discretized equations from the right-hand side. For the cavity

flow there are no mass fluxes across the boundary so, as mentioned earlier, the global

mass conservation condition is always satisfied when the algorithm reaches the point

of solving the system of p'-equations. The residuals have dropped to 107 after 150

iterations, which is very rapid convergence, indicating that good pressure and velocity

corrections are being obtained.

In the backward-facing step flow, however, the flowfield is very slow to develop

because no global mass conservation measure is enforced. During the course of iter-

ations, the mass flux into the domain from the left is not matched by an equal flux

through the outflow boundary, and consequently the system of pressure-correction

equations which is supposed to produce a continuity-satisfying velocity field does not

have a solution. Correspondingly one observes that the outer-loop convergence rate

is about 10 times worse than for cavity flow.

Also, note that the momentum convergence path of the backward-facing step flow

in Figure 2.4 tends to follow the continuity equation, indicating that the pressure and











velocity fields are strongly coupled. The present flow problem bears some similarity to

a fully-developed channel flow, in which the streamwise pressure-gradient and cross-

stream viscous diffusion are balanced, so the observation that pressure and velocity

are strongly coupled is intuitively correct. Thus, the convergence path is controlled

by the development of the pressure field. The slow convergence rate problem is due

to the inconsistency of the system of pressure-correction equations.

The inner-loop convergence path (the SLUR iterations) for the p'-system of equa-

tions must be examined to determine the manner in which the inner-loop inconsis-

tency leads to poor outer-loop convergence rates. Table 2.1 shows leading eigenvalues

for successive line-underrelaxation iteration matrices of the p'-system of equations at

an intermediate iteration for which the outer-loop residuals had dropped to approx-

imately 10-2.

Largest 3 eigenvalues Cavity Flow Back-Step Flow
A1 1.0 1.0
A2 0.956 0.996
A3 0.951 0.984

Table 2.1. Largest eigenvalues of iteration matrices during an intermediate itera-
tion, applying the successive line-underrelaxation iteration scheme to the p'-system of
equations.


In both model problems the spectral radius is 1.0 because the p'-system of equa-

tions is linearly dependent. The next largest eigenvalue is smaller in the cavity flow

computation than in the step flow computation, which means a faster asymptotic con-

vergence rate. However, the difference between 0.996 and 0.956 is not large enough

to produce the significant difference observed in the outer convergence path.

Figure 2.5 shows the inner-loop residuals of the SLUR procedure during an inter-

mediate iteration. The two momentum equations are well-conditioned and converge

to a solution within 4 iterations. In Figure 2.5 for the cavity flow case, the p'-equation











converges to zero, although this happens at a slower rate than the two momentum

equations because of the diffusive nature of the equation. In Figure 2.5 for the back-

step flow, the inner-loop residual is fixed on a nonzero residual, which is in fact the

initial level of inconsistency in the system of equations, i.e. the global mass deficit.

Given that the system of p'- equations which is being solved does not satisfy the

global continuity constraint, however, the significance or utility of the p'-field that

has been obtained is unknown.

In practice, the overall procedure may still be able to lead to a converged solu-

tion, as in the present case. It appears that the outflow extrapolating procedure,

a zero-gradient treatment utilized here, can help induce the overall computation to

converge to the right solution [72]. Obviously, such a lack of satisfaction of global

mass conservation is not desirable in view of the slow convergence rate.

Further study suggests that the iterative solution to the inconsistent system of

p'-equations converges on a unique pressure gradient, i.e. the difference between p'

values at any two points tends to a constant value, even though the p'-field does not

in general satisfy any of the equations in the system. This relationship is shown in

Figure 2.6, in which the convergence of the difference in p' between the lower-left and

upper-right locations in the domain of the cavity and backward-facing step flows is

plotted. Also shown is the value of p' at the lower-left corner of the domain. For the

cavity flow, there is a solution to the system of p'-equations, and it is obtained by

the SLUR technique in about 10 iterations. Thus all the pressure corrections and the

differences between them tend towards constant values. In the backward-facing step

flow, however, the individual pressure corrections increase linearly with the number

of iterations, symptomatic of the inconsistency in the system of equations. The

differences between p' values approach a constant, however. The rate at which this











unique pressure-gradient field is obtained depends on the eigenvalues of the iteration

matrix.

To resolve the inconsistency problem in the p'-system of equations and thereby

improve the outer-loop convergence rate in the backward-facing step flow, global mass

conservation has been explicitly enforced during the sequential solution procedure.

The procedure used is to compute the global mass deficit and then add a constant

value to the outflow boundary u-velocities to restore global mass conservation. Al-

ternatively, corrections can be applied at every streamwise location by considering

control volumes whose boundaries are the inflow plane, the top and bottom walls

of the channel, and the i=constant line at the specified streamwise location. The

artificially-imposed convection has the effect of speeding up the development of the

pressure field, whose normal development is diffusion-dominated. It is interesting to

note that this physically-motivated approach is in essence an acceleration of conver-

gence of the line-iterative method via the technique called additive correction [45, 69].

The strategy is to adjust the residual on the current line to zero by adding a con-

stant to all the unknowns in the line. This procedure is done for every line, for every

iteration, and generally produces improvement in the SLUR solution of a system of

equations. Kelkar and Patankar [45] have gone one step further by applying additive

corrections like an injection step of a multigrid scheme, a so-called block correction

technique. This technique is exploited to its fullest by Hutchinson and Raithby [42].

Given a fine-grid solution and a coarse grid, discretized equations for the correction

quantities on the coarse grid are obtained by summing the equations for each of the

fine-grid cells within a given coarse grid cell. A solution is then obtained (by direct

methods in [45]) which satisfies conservation of mass and momentum. The corrections

are then distributed uniformly to the fine grid cells which make up the coarse grid











cell, and the iterative solution on the fine grid is resumed. However, experiences have

shown that the net effect of such a treatment for complex flow problems is limited.

Figure 2.7 illustrates the improved convergence rate of the continuity equation for

the inner and outer loops, in the backward-facing step flow, when conservation of mass

is explicitly enforced. The inner-loop data is from the 10th outer-loop iteration. In

Figure 2.7, the cavity flow convergence path is also shown to facilitate the comparison.

For the back-step, the overall convergence rate is improved by an order of magnitude,

becoming slightly faster than the cavity flow case. This result reflects the improved

inner-loop performance, also shown in Figure 2.7. The improved performance for the

pressure-correction equation comes at the expense of a slightly slower convergence

rate for the momentum equations, because of the nonlinear convection term.

In short, it has been shown that a consistency condition, which is physically the re-

quirement of global mass conservation, is critical for meaningful pressure-corrections

to be guaranteed. Given natural (velocity) boundary conditions, which lead to a

linearly dependent system of pressure-correction equations, satisfaction of the global

continuity constraint is the only way that a solution can exist, and therefore the only

way that the inner-loop residuals can be driven to zero. For the model backward-

facing step flow in a channel with length L = 4 and a 21 x 9 mesh, the mass-

conservation constraint is enforced globally or at every streamwise location by an

additive-correction technique. This technique produces a 10-fold increase in the con-

vergence rate. Physically, modifying the u velocities has the same effect as adding

a convection term to the Poisson equation for the p'-field, which otherwise develops

very slowly. A coarse grid size was used to demonstrate the need of enforcing global

mass conservation. On a finer grid, this issue becomes more critical. In the next

section, the solution accuracy aspects related to mass conservation will be addressed,

and the computations will be conducted with more adequate grid resolution.











2.5 Numerical Treatment of Outflow Boundaries

Continuing with the theme of well-posedness, the next numerical issue to be dis-

cussed is the choice of outflow boundary location. If fluid flows into the domain at

a boundary where extrapolation is applied, then, traditionally, the problem is not

considered to be well-posed, because the information which is being transported into

the domain does not participate in the solution to the problem [60]. Numerically,

however, accurate solutions can be obtained using first-order extrapolation for the ve-

locity components on a boundary where inflow is occurring [72]. Here open boundary

treatment for both steady and time-dependent flow problems is investigated further.

Figure 2.9 and 2.8 present streamfunction contours for a time-dependent flow

problem, impulsively started backward-facing step flow, using central-differencing

for the convection terms and first-order backward-differencing in time. A parabolic

inflow velocity profile is specified, while outflow boundary velocities are obtained by

first-order extrapolation. The Reynolds number based on the average inflow velocity

Uavg and the channel height H, is 800. The expansion ratio H/h is 2 as in the model
problem described in Figure 2.3. Time-accurate simulations were performed for two

channel configurations, one with length L = 8 (81 x 41 mesh) and the other with

length L = 16 (161 x 41 mesh). This flow problem has been the subject of some

recent investigations focusing on open boundary conditions [30, 31].

For each time step, the SIMPLE algorithm is used to iteratively converge on a

solution to the unsteady form of the governing equations, explicitly enforcing global

conservation of mass during the course of iterations. In the present study, convergence

was declared for a given time step when the global residuals had been reduced below

10-4. The time-step size was twice the viscous time scale in the y-direction, i.e.











At = 2Ay2/v. Thus a fluid particle entering the domain at the average velocity u =

1 travels 2 units downstream during a time-step.

Figure 2.8 shows the formation of alternate bottom/top wall recirculation regions

during startup which gradually become thinner and elongated as they drift down-

stream. For the L = 16 simulation (Figure 2.8), the transient flowfield has as many

as four separation bubbles at T = 32, the latter two of which are eventually washed

out of the domain. In the L = 8 simulation (Figure 2.9) the streamfunction plots are

at times corresponding to those shown in Figure 2.8. Note that between T = 11 and

T = 32, a secondary bottom wall recirculation zone forms and drifts downstream,

exiting without reflection through the downstream boundary. The time evolution of

the flowfield for the L = 8 and L = 16 simulations is virtually identical.

As can be observed, the facts that a shorter channel length was used in Figure 2.9

and that a recirculating cell may go through the open boundary do not affect the

solutions. Figure 2.10 compares the computed time histories of the bottom wall

reattachment and top wall separation points between the two computations. The

L = 8 and L = 16 curves are perfectly overlapped. The steady-state solutions for

both the L = 8 and L = 16 channel configurations are also shown in Figure 2.9

and 2.8, respectively. Although the outflow boundary cuts the top wall separation

bubble approximately in half, there is no apparent difference between the computed

streamfunction contours for 0 < x < 8. Furthermore, the convergence rate is not

affected by the choice of outflow boundary location.

Figure 2.11 compares the steady-state u and v velocity profiles at x = 7 be-

tween the two computations. The accuracy of the computed results is assessed by

comparison with an FEM numerical solution reported by Gartling [27]. Figure 2.11

establishes quantitatively that the two simulations differ negligibly over 0 < x < 8

(the v profile differs on the order of 10-3) The velocity scale for the problem is 1.











Neither v profile agrees perfectly with the solution obtained by Gartling, which may

be attributed to the need for conducting further grid refinement studies in the present

work and/or Gartling's work.

Evidently the location of the open boundary is not critical to obtaining a con-

verged solution. This observation indicates that the downstream information is com-

pletely accounted for by the continuity equation. The correct pressure field can de-

velop because the system of p'-equations requires only the boundary mass flux specifi-

cation. If the global continuity constraint is satisfied, the pressure-correction equation

is consistent regardless of whether there is inflow or outflow at the boundary where

extrapolation is applied. The numerical well-posedness of the open boundary com-

putation results in virtually identical flowfield development for the time-dependent

L = 8 and L = 16 simulations as well as steady-state solutions which agree with each

other and follow closely Gartling's benchmark data [27].

2.6 Concluding Remarks

In order for the SIMPLE pressure-correction method to be a well-posed numer-

ical procedure for open boundary problems, explicit steps must be taken to ensure

the numerical consistency of the pressure-correction system of equations during the

course of iterations. For the discrete problem with the natural boundary treatment

for pressure, i.e. normal velocity specified at all boundaries, global mass conserva-

tion is the solvability constraint which must be satisfied in order that the system of

p'-equations is consistent. Without a globally mass-conserving procedure enforced

during each iterative step, the utility of the pressure-corrections obtained at each it-

eration cannot be guaranteed. Overall convergence may still occur, albeit very slowly.

In this regard, the poor outer-loop convergence behavior simply reflects the (poor)

convergence rate of the inner-loop iterations of the SLUR technique. In general, the











inner-loop residual is fixed on the value of the initial level of inconsistency of the

system of p'-equations which physically is the global mass deficit. The convergence

rate can be improved dramatically by explicitly enforcing mass conservation using

an additive-correction technique. The results of numerical simulations of backward-

facing step flow illustrate and support these conclusions.

The mass-conservation constraint also has implications for the issue of proper

numerical treatment of open boundaries where inflow is occurring. Specifically, the

conventional viewpoint that inflow cannot occur at open boundaries without Dirich-

let prescription of the inflow variables can be rebutted, based on the grounds that

the numerical problem is well-posed if the normal velocity components satisfy the

continuity constraint.


















































Figure 2.1. Staggered grid u control volume and the nearby variables which are
involved in the discretization of the u-momentum equation.











U=1
^t------


U(y) \
1i\~ L 1


W> -------------- ^:^ __
7 7/

h


Figure 2.2. Description of two model problems. Both are at Re = 100. The cavity
is a square with a top wall sliding to the left, while the backward-facing step is a
4 x 1 rectangular domain with an expansion ratio H/h = 2, and a parabolic inflow
(average inflow velocity = 1). The cavity flow grid is 9 x 9 and the step flow grid is
21 x 9. The meshes and the velocity vectors are shown.


y///#/////////////////////////







































Figure 2.3. Model 3 x 3 computational domain with numbered control volumes, for
discussion of Eq. 2.22. The staggered velocity components which refer to control
volume 5 are also indicated.


I I I

OP 7 OP P9


5 Un 5
P 4 0 Pr5 OP'6

s5

S -, -
Pi P'2 P 3
























Re = 100 Back-Step Flow


0

S-2

0 -4
0
O -6
S-6
ed)
2_


100 200 300


# of Iterations


0 500 1000


# of Iterations


Figure 2.4. Outer-loop convergence paths for the Re = 100 lid-driven cavity and
backward-facing step flows, using central-differencing for the convection terms. Leg-
end: p' equation: u momentum equation: -.-.-.- r momentum equation.


1500


Re = 100 Cavity Flow
























Re = 100 Cavity Flow Re = 100 Back-Step Flow


S0 -3 0oo
3 \ 3 \

0 .1



-6 -6
0 10 20 30 0 10 20 30

# of Iterations # of Iterations





Figure 2.5. Inner-loop convergence paths for the Re = 100 lid-driven cavity and
backward-facing step flows. The vertical axis is the logo of the ratio of the current
residual to the initial residual. Legend: p' equation: --- u momentum equation:
.-.-.- momentum equation.
























Inner Loop for Cavity Flow
0.03 I


0.01


# of Iterations


Inner Loop for Back-Step Flow


50
# of Iterations


Figure 2.6. Variation of p' with inner-loop iterations. The dashed line is the value
of p' at the lower-left control volume, while the solid line is the difference between
P'lowerleft and Pupperrnght


0.02 1:



























Outer Loop Convergence Path


- 81 200
0 100 200


# of Iterations


0 50


# of Iterations


Figure 2.7. Outer-loop and inner-loop convergence paths of the p' equation for the
backward-facing step model problem, with and without enforcing the continuity con-
straint. (1) conservation of mass not enforced: (2) continuity enforced globally; (3)
cavity flow.


Inner-Loop Convergence Path


ca
3
ya -1
(U
(^
Q- ,

















Ii


i


H


c J
II .
= 3
SrJ
= !

n
ce
0i




oC
*- C
- .
I^



cc-.
i "
-
a-
*~

-=rey
JC k
i-r i
:-"t ;
T- ^"
.ihC ^
i-. -: '














T=11






T= 15






T=20






T=32






T=oo








Figure 2.9. Time-dependent flowfield for impulsively started backward-facing step
flow, Re = 800. The domain has length L = 8. Streamfunction contours are plotted
at several instants during the evolution to the steady-state, which is the last figure.















Time-Evolution of Reattachment/Separation Locations


0 10 20 30 40 50


Time



Figure 2.10. Time-dependent location of bottom wall reattachment point and top wall
separation point for Re = 800 impulsively started backward-facing step flow. The
curves for both L = 8 and L = 16 computations are shown; they overlap identically.












U Velocity Profile at X = 7 For Re = 800 Back-Step Flow


U(Y)


V Velocity Profile at X = 7 For Re = 800 Back-Step Flow


-0.02 -0.015 -0.01 -0.005 0 0.005 0.01
V(Y)


Figure 2.11. Comparison of u and v-component of velocity profiles at x = 7.0 for
the L = 16 and L = 8 backward-facing step simulations at Re = 800, with central-
differencing. (o) indicates the grid-independent FEM solution obtained by Gartling.
The v profile is scaled up by 103.
















CHAPTER 3
EFFICIENCY AND SCALABILITY ON SIMD COMPUTERS

The previous chapter considered an issue which was important because of its im-

plications for the convergence rate in open boundary problems. The present chapter

shifts gears to focus on the cost and efficiency of pressure-correction methods on

SIMD computers.

As discussed in Chapter 1, the eventual goal is to understand the indirect cost [23],

i.e. the parallel run time, of such methods on SIMD computers, and how this cost

scales with the problem size and the number of processors. The run time is just the

number of iterations multiplied by the cost per iteration. This chapter considers the

cost per iteration.

3.1 Background

The discussion of SIMD computers in Chapter 1 indicated similarities in the

general layout of such machines and in the factors which affect program performance.

More detail is given in this section to better support the discussion of results.

3.1.1 Speedup and Efficiency

Speedup S is defined as

S = T (3.1)

where Tp is the measured run time using np processors. In the present work T1 is

the run time of the parallel algorithm on one processor, including both serial and

parallel computational work, but excluding the front-end-to-processor and interpro-

cessor communication. On a MIMD machine it is sometimes possible to actually time

53











the program on one processor, but each SIMD processor is not usually a capable

serial computer by itself, so Ti must be estimated. The timing tools on the CM-2

and CM-5 are very sophisticated, and can separately measure the time elapsed by

the processors doing computation, doing various kinds of communication, and doing

nothing (waiting for an instruction from the front-end, which might be finishing up

some serial work before it can send another code block). Thus, it is possible to make

a reasonable estimate for T1.

Parallel efficiency is the ratio of the actual speedup to the ideal (np), which reflects

the overhead costs of doing the computation in parallel:

E- S.actu T1/Tp (3.2)
Sideal np

If Tcomp is the time in seconds spent by each of the np processors doing useful work

(computation), Tinter-proc is the time spent by the processors doing interprocessor

communication, and Tfe-to-proc is the time elapsed through front-end-to-processor

communication, then each of the processors is busy a total of Tcomp + Tinter-proc

seconds and the total run time on multiple processors is Tcomp + Tinter-proc Tfe-to-proc

seconds. Assuming that the parallelism is high, i.e. a high percentage of the virtual

processors are not idle, a single processor would need npTcomp time to do the same

work. Thus, T1 = npTcomp, and from Eq. 3.2 E can be expressed as

1 1
E = (3.3)
1 + (Tinter-proc + Tfe-to-proc) /Tcomp 1 + (Tcomm) Tcomp

Since time is work divided by speed, E depends on both machine-related factors and

the implementational factors through Eq. 3.3. High parallel efficiency is not neces-

sarily a product of fast processors or fast communications considered alone, instead it

is the relative speeds that are important, and the relative amount of communication

and computation in the program. Consider the machine-related factors first.











3.1.2 Comparison Between CM-2. CM-5, and MP-1

A 32-node CM-5 with vector units, a 16k processor CM-2, and a 1k processor

MP-1 were used in the present study. The CM-5 has 4 GBytes total memory, while

the CM-2 has 512 Mbytes, and the MP-1 has 64 MBytes. The peak speeds of these

computers are 4, 3.5, and 0.034 Gflops, respectively, in double precision. Per proces-

sor, the peak speeds are 32, 7, and 0.033 Mflops, with memory bandwidths of 128,

25, and 0.67 Mbytes/s [67, 83]. Clearly these are computers with very different capa-

bilities, even taking into account the fact that peak speeds, which are based only on

the processor speed under ideal conditions, are not an accurate basis for comparison.

In the CM-2 and CM-5 the front-end computers are Sun-4 workstations, while

in the MP-1 the front-end is a Decstation 5000. From Eq. 3.3, it is clear that the

relative speeds of the front-end computer and the processors are important. Their

ratio determines the importance of the front-end-to-processor type of communication.

On the CM-2 and MP-1, there is just one of these intermediate processors, called

either a. sequencer or an array control unit, respectively, while on the 32-node CM-5

the 32 SPARC microprocessors have the role of sequencers.

Each SPARC node broadcasts to four vector units (VUs) which actually do the

work. Thus a 32-node CM-5 has 128 independent processors. In the CM-2 the "pro-

cessors" are more often called processing elements (PEs), because each one consists of

a floating-point unit coupled with 32 bit-serial processors. Each bit-serial processor

is the memory manager for a single bit of a 32-bit word. Thus, the 16k-processor

CM-2 actually has only 512 independent processing elements. This strange CM-2

processor design came about basically as a workaround which was introduced to im-

prove the memory bandwidth for floating-point calculations [66]. Compared to the

CM-5 VUs, the CM-2 processors are about one-fourth as fast, with larger overhead











costs associated with memory access and computation. The MP-1 has 1024 4-bit

processors-compared to either the CM-5 or CM-2 processors, the MP-1 processors

are very slow. The generic term "processing element" (PE), which is used occassion-

ally in the discussion below, refers to either one of the VUs, one of the 512 CM-2

processors, or one of the MP-1 processors, whichever is appropriate.

For the present study, the processors are either physically or logically imagined

to be arranged as a 2-d mesh, which is a layout that is well-supported by the data

networks of each of the computers. The data network of the 32-node CM-5 is a

fat tree of height 3, which is similar to a binary tree except the bandwidth stays

constant upwards from height 2 at 160 MBytes/s (details in [83]). One can expect

approximately 480 MBytes/s for regular grid communication patterns (i.e. between

nearest-neighbor SPARC nodes) and 128 MBytes/s for random (global) communica-

tions. The randomly-directed messages have to go farther up the tree, so they are

slower. The CM-2 network (a hypercube) is completely different from the fat-tree net-

work and its performance for regular grid communication between nearest-neighbor

processors is roughly 350 MBytes/s [67]. The grid network on the CM-2 is called

NEWS (North-East-West-South). It is a subset of the hypercube connections se-

lected at run time. The MP-1 has two networks: regular communications use X-Net

(1.25 GBytes/s, peak) which connects each processor to its eight nearest neighbors,

and random communications use a 3-stage crossbar (80 MBytes/s, peak).

To summarize the relative speeds of these three SIMD computers it is sufficient

for the present study to observe that the MP-1 has very fast nearest-neighbor com-

munication compared to its computational speed, while the exact opposite is true for

the CM-2. The ratio of nearest-neighbor communication speed to computation speed

is smaller still for the CM-5 than the CM-2. Again, from Eq. 3.3, one expects that

these differences will be an important factor influencing the parallel efficiency.











3.1.3 Hierarchical and Cut-and-Stack Data Mappings

When there are more array elements (grid points) than processors, each processor

handles multiple grid points. Which grid points are assigned to which processors is

determined by the "data-mapping," also called the data layout. The processors repeat

any instructions the appropriate number of times to handle all the array elements

which have been assigned to it. A useful idealization for SIMD machines, however,

is to pretend there are always as many processors as grid points. Then one speaks of

the "virtual processor" ratio (VP) which is the number of array elements assigned to

each physical processor. The way the data arrays are partitioned and mapped to the

processors is a main concern for developing a parallel implementation. The layout of

the data determines the amount of communication in a given program.

When the virtual processor ratio is 1, there are an equal number of processors

and array elements and the mapping is just one-to-one. When VP > 1 the mapping

of data to processors is either "hierarchical," in CM-Fortran, or "cut-and-stack" in

MP-Fortran. These mappings are also termed "block" and "cyclic" [85], respectively,

in the emerging High-Performance Fortran standard. The relative merits of these

different approaches have not been completely explored yet.

In cut-and-stack mapping, nearest-neighbor array elements are mapped to nearest-

neighbor physical processors. When the number of array elements exceeds the num-

ber of processors, additional memory layers are created. VP is just the number of

memory layers. In the general case, nearest-neighbor virtual processors (i.e. array

elements) will not be mapped to the same physical processor. Thus, the cost of a

nearest-neighbor communication of distance one will be proportional to VP, since the

nearest-neighbors of each virtual processor will be on a different physical processor.

In the hierarchical mapping, contiguous pieces of an array ("virtual subgrids") are











mapped to each processor. The "subgrid size" for the hierarchical mapping is syn-

onymous with VP. The distinction between hierarchical and cut-and-stack mapping

is clarified by Figure 3.1.

In hierarchical mapping, for VP > 1, each virtual processor has nearest-neighbors

in the same virtual subgrid, that is, on the same physical processor. Thus, for hier-

archical mapping on the CM-2, interprocessor communication breaks down into two

types (with different speeds)-on-processor and off-processor. Off-processor commu-

nication on the CM-2 has the NEWS speed given above, while on-processor communi-

cation is somewhat faster, because it is essentially just a memory operation. A more

detailed presentation and modelling of nearest-neighbor communication costs for the

hierarchical mapping on the CM-2 is given in [3]. The key idea is that with hierar-

chical mapping on the CM-2 the relative amount of on-processor and off-processor

communication is the area to perimeter ratio of the virtual subgrid.

For the CM-5, there are three types of interprocessor communication: (1) between

virtual processors on the same processor (that is, the same VU), (2) between virtual

processors on different VUs but on the same SPARC node, and (3) between virtual

processors on different SPARC nodes. Between different SPARC nodes (number 2),

the speed is 480 MBytes/s as mentioned above. On the same VU the speed is 16

GBytes/s. (The latter number is just the aggregate memory bandwidth of the 32-

node CM-5.) Thus, although off-processor NEWS communication is slow compared

to computation on the CM-2 and CM-5, good efficiencies can still be achieved as a

consequence of the data mapping which allows the majority of communication to be

of the on-processor type.











3.2 Implementional Considerations

The cost per SIMPLE iteration depends on the choice of relaxation method

(solver) for the systems of equations, the number of inner iterations (,,, v,, and vc),
the computation of coefficients for each system of equations, the correction step, and

the convergence checking and serial work done in program control. The pressure-

correction equation, since it is not underrelaxed, typically needs to be given more

iterations than the momentum equations, and consequently most of the effort is ex-

pended during this step of the SIMPLE method. This is another reason why the

convergence rate of the p'-equations discussed in Chapter 2 is important. Typically

v. and v, are the same and are < 3, and v, < 5v,.

In developing a parallel implementation of the SIMPLE algorithm, the first con-

sideration is the method of solving the u, v, and p' systems of equations. For serial

computations, successive line-underrelaxation using the tridiagonal matrix algorithm

(TDMA, whose operation count is O(N)) is a good choice because the cost per it-

eration is optimal and there is long-distance coupling between flow variables (along

lines), which is effective in promoting convergence in the outer iterations. The TDMA

is intrinsically serial. For parallel computations, a parallel tridiagonal solver must be

used (parallel cyclic reduction in the present work). In this case the cost per it-

eration depends not only on the computational workload (O(Nlog2N)) but also on

the amount of communication generated by the implementation on a particular ma-

chine. For these reasons, timing comparisons are made for several implementations

of both point- and line-Jacobi solvers used during the inner iterations of the SIMPLE

algorithm.











Generally, point-Jacobi iteration is not sufficiently effective for complex flow prob-

lems. However, as part of a multigrid strategy, good convergence rates can be ob-

tained (see Chapters 4 and 5). Furthermore, because it only involves the fastest type

of interprocessor communication, that which occurs between nearest-neighbor pro-

cessors, point-Jacobi iteration provides an upper bound for parallel efficiency, against

which other solvers can be compared.

The second consideration is the treatment of boundary computations. In the

present implementation, the coefficients and source terms for the boundary control

volumes are computed using the interior control volume formula and mask arrays.

Oran et al. [57] have called this trick the uniform boundary condition approach.

All coefficients can be computed simultaneously. The problem with computing the

boundary coefficients separately is that some of the processors are idle, which de-

creases E. For the CM-5, which is "synchronized MIMD" instead of strictly SIMD,

there exists limited capability to handle both boundary and interior coefficients si-

multaneously without formulating a single all-inclusive expression. However, this

capability cannot be utilized if either the boundary or interior formulas involve in-

terprocessor communication, which is the case here. As an example of the uniform

approach, consider the source terms for the north boundary u control volumes, which

are computed by the formula

b = aNUN + (pw Pe)Ay (3.4)


Recall that aN represents the discretized convective and diffusive flux terms, and UN

is the boundary value, and in the pressure gradient term, Ay is the vertical dimension

of the u control volume and pw/P are the west/east u-control-volume face pressures

on the staggered grid. Similar modifications show up in the south, east, and west

boundary u control volume source terms. To compute the boundary and interior











source terms simultaneously, the following implementation is used:


b = aboundaryUboundary + (Pw pe)Ay (3.5)


where

Uboundary = UNIN + USIS + UEIE + UWIW (3.6)

and

boundary = aNIN + asls + aEIE + awIw (3.7)

IN, Is, IE, and Iw are the mask arrays, which have the value 1 for the respective
boundary control volumes and 0 everywhere else. They are initialized once, at the

beginning of the program. Then, every iteration, there are four extra nearest-neighbor

communications. A comparison of the uniform approach with an implementation that

treats each boundary separately is discussed in the results.

3.3 Numerical Experiments

The SIMPLE algorithm for two-dimensional laminar flow has been timed on a

range of problem sizes from 8 x 8 to 1024 x 1024 which, on the CM-5, covers up

to VP = 8192. The convection terms are central-differenced. A fixed number (100)

of outer iterations are timed using as a model flow problem the lid-driven cavity

flow at Re = 1000. The timings were made with the "Prism" timing utility on

the CM-2 and CM-5, and the "dpuTimer" routines on the MP-1 [52, 86]. These

utilities can be inaccurate if the front-end machine is heavily loaded, which was the

case with the CM-2. Thus, on the CM-2 all cases were timed three times and the

fastest times were used, as recommended by Thinking Machines [82]. Prism times

every code block and accumulates totals in several categories, including computation

time for the nodes (Tcomp), "NEWS" communication (Tnews), and irregular-pattern

"SEND" communication. Also it is possible to infer Tfe-to-proc from the difference











between the processor busy time and the elapsed time. In the results Tcomm is the

sum of the "NEWS" and "SEND" interprocessor times. The front-end-to-processor

communication is separate. Additionally, the component tasks of the algorithm have

been timed, namely the coefficient computations (Tcoe/,), the solver (Tsoi,,), and the

velocity-correction and convergence-checking parts.

3.3.1 Efficiency of Point and Line Solvers for the Inner Iterations

Figure 3.2, based on timings made on the CM-5, illustrates the difference in

parallel efficiency for SIMPLE using point-Jacobi and line-Jacobi iterative solvers. E

is computed from Eq. 3.3 by timing Tcomm and Teomp introduced above. Problem size

is given in terms of the virtual processor ratio VP previously defined.

There are two implementations each with different data layouts, for point-Jacobi

iteration. One ignores the distinction between virtual processors which are on the

same physical processor and those which are on different physical processors. Each

array element is treated as if it is a processor. Thus, interprocessor communication

is generated whenever data is to be moved, even if the two virtual processors do-

ing the communication happen to be on the same physical processor. To be more

precise, a call to the run-time communication library is generated for every array el-

ement. Then, those array elements (virtual processors) which actually reside on the

same physical processor are identified and the communication is done as a memory

operation-but the unnecessary overhead of calling the library is incurred. Obviously

there is an inefficiency associated with pretending that there are as many processors

as array elements, but the tradeoff is that this is the most straightforward, and indeed

the intended, way to do the programming. In Figure 3.2, this approach is labelled

"NEWS," with the symbol "o." The other implementation is labelled "on-VU," with











the symbol "+," to indicate that interprocessor communication between virtual pro-

cessors on the same physical processor is being eliminated-the programming is in a

sense being done "on-VU."

To indicate to the compiler the different layouts of the data which are needed,

the programmer inserts compiler directives. For the "NEWS" version, the arrays are

laid out as shown in this example for a 1k x 1k grid and an 8 x 16 processor layout

on the CM-5:

REAL*8 A(1024,1024)
$CMF LAYOUT A(:BLOCK=128 :PROCS=8, :BLOCK=64 :PROCS=16)


Thus, the subgrid shape is 128 x 64, with a subgrid size (VP) of 8192 (this hap-

pens to be the biggest problem size for my program on a 32-node CM-5 with 4GBytes

of memory). When shifting all the data to their east nearest-neighbor, for example,

by far the large majority of transfers are on-VU and could be done without real inter-

processor communication. But there are only 2 dimensions in A, so that data-parallel

program statements cannot specifically access certain array elements, i.e. the ones on

the perimeter of the subgrid. Thus it is not possible with the "NEWS" layout to

treat interior virtual processors differently from those on the perimeter, and conse-

quently data shifts between the interior virtual processors generate interprocessor

communication even though it is unnecessary.

In the "on-VU" version, a different data layout is used which makes explicit to the

compiler the boundary between physical processors. The arrays are laid out without

virtual processors:

$CMF LAYOUT A(:SERIAL,:SERIAL,:BLOCK=1 :PROCS=8,:BLOCK=1 :PROCS=16)


The declaration must be changed accordingly, to A(128,64,8,16). Normally it is

inconvenient to work with the arrays in this manner. Thus the approach taken here











is to use an "array alias" of A [84]. In other words, this is an EQUIVALENCE func-

tion for the data-parallel arrays (similar to the Fortran77 EQUIVALENCE concept),

which equates A(1024,1024) with A(128,64,8,16), with the different LAYOUTs given

above. It is the alias instead of the original A which is used in the on-VU point-

Jacobi implementation. In the solver, the "on-VU" layout is used; everywhere else,

the more convenient "NEWS" layout is used. The actual mechanism by which the

equivalencing of distributed arrays can be accomplished is not too difficult to under-

stand. The front-end computer stores "array descriptors," which contain the array

layout, the starting address in processor memory, and other information. The actual

layout in each processors' memory is linear and doesn't change, but multiple array

descriptors can be generated for the same data. This descriptor multiplicity is what

array aliasing accomplishes. With the "on-VU" programming style, the compiler

does not generate communication when the shift of data is along a SERIAL axis.

Thus, interprocessor communication is generated only when the virtual processors

involved are on different physical processors, i.e. only when it is truly necessary. The

difference in the amount of communication is substantial for large subgrid sizes.

For both the "NEWS" and the "on-VU" curves in Figure 3.2, E is initially very

low, but as VP increases, E rises until it reaches a peak value of about 0.8 for the

"NEWS" version and 0.85 for the "on-VU" version. The trend is due to the amor-

tization of the front-end-to-processor and off-VU (between VUs which are physically

under control of different SPARC nodes) communication. The former contributes a

constant overhead cost per Jacobi iteration to Tcomm, while the latter has a VP1/2

dependency [3]. However, it does not appear from Figure 3.2 that these two terms'

effects can be distinguished from one another.

For VP > 2k, the CM-5 is computing roughly 3/4 of the time for the implementa-

tion which uses the "NEWS" version of point-Jacobi, with the remainder split evenly











between front-end-to-processor communication and on-VU interprocessor communi-

cation. It appears that the "on-VU" version has more front-end-to-processor com-

munication per iteration, so there is, in effect, a price of more front-end-to-processor

communication to pay in exchange for less interprocessor communication. Conse-

quently it takes VP > 4k to reach peak efficiency instead of 2k with the "NEWS"

version. For VP > 4k, however, E is about 5% 10% higher than for the "NEWS"

version because the on-VU communication has been replaced by straight memory

operations.

The observed difference would be even greater if a larger part of the total parallel

run time was spent in the solver. For the large VP cases in Figure 3.2, approximately

equal time was spent computing coefficients and solving the systems of equations.

"Typical" numbers of inner iterations were used, 3 each for the u and v momentum

equations, and 9 for the p' equation. From Figure 3.2, then, it appears that the ad-

vantage of the "on-VU" version over the "NEWS" version of point-Jacobi relaxation

within the SIMPLE algorithm is around 0.1 in E, for large problem sizes.

Red/black analogues to the "NEWS" and "on-VU" versions of point-Jacobi iter-

ation have also been tested. Red/black point iteration done in the "on-VU" manner

does not generate any additional front-end-to-processor communication, and there-

fore takes almost an identical amount of time as point-Jacobi. Thus red/black point

iterations are recommended when the "on-VU" layout is used due to their improved

convergence rate. However, with the "NEWS" layout, red/black point iteration gen-

erates two code blocks instead of one, and reduces by 2 the amount of computation

per code block. This results in a substantial (- 35% for the VP = 8k case) in-

crease in run time. Thus, if using "NEWS" layouts, red/black point iteration is not

cost-effective.











There are also two implementations of line-Jacobi iteration. In both, one inner

iteration consists of forming a tridiagonal system of equations for the unknowns in

each vertical line by moving the east/west terms to the right-hand side, solving the

multiple systems of equations simultaneously, and repeating the procedure for the

horizontal lines.

In the first version, parallel cyclic reduction is used to solve the multiple tridiag-

onal systems of equations (see [44] for a clear presentation). This involves combining

equations to decouple the system into even and odd equations. The result is two

tridiagonal systems of equations each half the size of the original. The reduction step

is repeated log2 N times, where N is the number of unknowns in each line. Thus, the

computational operation count is O(Nlog2N). Interprocessor communication occurs

for every unknown for every step, thus the communication operation count is also

O(Nlog2N). However, the distance for communication increases every step of the re-

duction by a factor of 2. For the first step, nearest-neighbor communication occurs,

while for the second step, the distance is 2, then 4, etc. Thus, the net communi-

cation speed is slower than the nearest-neighbor type of communication. Figure 3.2

confirms this argument-E peaks at about 0.5 compared to 0.8 for point-Jacobi it-

eration. In other words, for VP > 4k, interprocessor communication takes as much

time as computation with the line-Jacobi solver using cyclic reduction.

In the second version, the multiple systems of tridiagonal equations are solved

using the standard TDMA algorithm along the lines. To implement this version,

one must remap the arrays from (:NEWS,:NEWS) to (:NEWS,:SERIAL), for the

vertical lines, and to (:SERIAL,:NEWS) for the horizontal lines. This change from

rectangular subgrids to 1-d slices is the most time-consuming step, involving a global

communication of data ("SEND" instead of "NEWS"). Applied along the serial di-

mension, the TDMA does not generate any interprocessor communication. Some











front-end-to-processor communication is generated by the incrementing of the DO-

loop index, but unrolling the DO-loop helps to amortize this overhead cost to some

extent. Thus, in Figure 3.2 E is approximately constant at 0.14, except for very small

VP. The global communication is much slower than computation and consequently

there is not enough computation to amortize the communication. Furthermore, the

constant E implies from Eq. 3.3 that Tcomm and Teomp both scale in the same way

with problem size. It is evident that Tcomp ~ VP because the TDMA is O(N). Thus

constant E implies Tcomm VP. This means doubling VP doubles Tcomm, indicating

the communication speed has reached its peak, which further indicates that the full

bandwidth of the fat-tree is being utilized.

The disappointing performance of the standard line-iterative approach using the

TDMA points out the important fact that, for the CM-5, global communication

within inner iterations is intolerable. There is not enough computation to amortize

slow communication in the solver for any problem size. With parallel cyclic reduction,

where the regularity of the data movement allows faster communication, the efficiency

is much higher, although still significantly lower than for point-iterations. Additional

improvement can be sought by using the "on-VU" data layout to implement the

line-iterative solver within each processor's subgrid. This implementation essentially

trades interprocessor communication for the front-end-to-PE type of communication,

and in practice a front-end bottleneck develops. For the remainder of the discussion,

all line-Jacobi results refer to the parallel cyclic reduction implementation.

On the MP-1, the front-end-to-processor communication is not a major concern,

as can be inferred from Figure 3.3. The efficiency of the SIMPLE algorithm using

the point-Jacobi solver is plotted for each machine for the range of problem sizes

corresponding to the cases solved on the MP-1. The CM-2 and CM-5 can solve

much larger problems, so for comparison purposes only part of their data is shown.











Also, because the computers have different numbers of processors, the number of grid

points is used instead of VP to define the problem size.

As in Figure 3.2, each curve exhibits an initial rise corresponding to the amortiza-

tion of the front-end-to-processor communication and, for the CM-2 and CM-5, the

off-processor "NEWS" communication. On the MP-1, peak E is reached for small

problems (VP > 32). Due to the MP-l's relatively slow processors, the computa-

tion time quickly amortizes the front-end-to-processor communication time as VP

increases. Furthermore, because the relative speed of X-Net communication is fast,

the peak E is high, 0.85. On the CM-2, the peak E is 0.4, and this efficiency is

reached for approximately VP > 128. On the CM-5, the peak E is 0.8, but this

efficiency is not reached until VP > 2k. If computation is fast, then the rate of in-

crease of E with VP depends on the relative cost of on-processor, off-processor, and

front-end-to-processor communication. If the on-processor communication is fast,

larger VP is required to reach peak E. Thus, on the CM-5, the relatively fast on-VU

communication is simultaneously responsible for the good (0.8) peak E, and the fact

that very large problem sizes, (VP > 2k, 64 times larger than on the MP-1), are

needed to reach this peak E.

The aspect ratio of the virtual subgrid constitutes a secondary effect of the data

layout on the efficiency for hierarchical mapping. The major influence on E depends

on VP, i.e. the subgrid size, but the subgrid shape matters, too. This dependence

comes into play due to the different speeds of the on-processor and off-processor types

of communication. Higher aspect ratio subgrids have higher area to perimeter ratios,

and thus relatively more of off-processor communication than square subgrids.

Figure 3.4 gives some idea of the relative importance of the subgrid aspect ratio

effect. Along each curve the number of grid points is fixed, but the grid dimensions

vary, which, for a given processor layout, causes the subgrid shape (aspect ratio), to











vary. For example, on the CM-5 with an 8 x 16 processor layout, the following grids

were used corresponding to the VP = 1024 CM-5 curve: 256 x 512, 512 x 256, 680 x

192, and 1024 x 128. These cases give subgrid aspect ratios of 1, 4, 7, and 16. Tnews

is the time spent in "NEWS" type of interprocessor communication and Tcom, is the

time spent doing computation during 100 SIMPLE iterations. The solver for these

results is point-Jacobi relaxation.

For the VP = 1024 CM-5 case, increasing the aspect ratio from 1 to 16 causes

Tnews/Tcomp to increase from 0.3 to 0.5. This increase in Tnews/Tcomp increases the
run time for 100 iterations from 15s to 20s, and decreases the efficiency from 0.61 to

0.54. For the VP = 8192 CM-5 case, increasing the aspect ratio from 1 to 16 causes

Tnews/Tcomp to increase from 0.19 to 0.27. This increase in Tnews/Tcomp increases the
run time for 100 iterations from 118s to 126s, and decreases the efficiency from 0.74

to 0.72. Thus, the aspect ratio effect diminishes as VP increases due to the increasing

area of the subgrid. In other words the variation in the perimeter length matters less,

percentage-wise, as the area increases. The CM-2 results are similar. However, on

the CM-2 the on-PE type of communication is slower than on the CM-5, relative to

the computational speed. Thus, Te,,,/Tcomp ratios are higher on the CM-2.

3.3.2 Effect of Uniform Boundary Condition Implementation

In addition to the choice of solver, the treatment of boundary coefficient computa-

tions was discussed earlier as an important consideration affecting parallel efficiency.

Figure 3.5 compares the implementation described in the introductory section of this

chapter, to an implementation which treats the boundary control volumes separate

from the interior control volumes. The latter approach involves some 1-d operations

which leave some processors idle.











The results indicated in Figure 3.5 were obtained on the CM-2, using point-

Jacobi relaxation as the solver. With the uniform approach, the ratio of the time

spent computing coefficients, Tcoeff, to the time spent solving the equations, Tsolve,

remains constant at 0.6 for VP > 256. Both Tcoeff and To ,,ve VP in this case, so

doubling VP doubles both Tcoeff and To,,ve, leaving their ratio unchanged. The value

0.6 reflects the relative cost of coefficient computations compared to point-Jacobi

iteration. There are three equations for which coefficients are computed and 15 total

inner iterations, 3 each for the u and v equations, and 9 for the p' equation. Thus if

more inner iterations are taken, the ratio of Tcoeff to Tsoive will decrease, and vice-

versa. With the 1-d implementation, Tcoeff/Tsove increases until VP > 1024. Both

TcoeJf and Tso,,ve scale with VP asymptotically, but Figure 3.5 shows that Tcoeff has an
apparently very significant square-root component due to the boundary operations. If

N is the number of grid points and n, is the number of processors, then VP = N/np.

For boundary operations, N1/2 control volumes are computed in parallel with only

nt/2 processors-hence the VP1/2 contribution to Tcoeff. From Figure 3.5, it appears

that very large problems are required to reach the point where the interior coefficient

computations amortize the boundary coefficient computations. Even for large VP

when Tcoeff/Tsoive is approaching a constant, this constant is larger, approximately

0.8 compared to 0.6 for the uniform approach, due to the additional front-end-to-

processor communication which is intrinsic to the 1-d formulation.

3.3.3 Overall Performance

Table 3.1 summarizes the relative performance of SIMPLE on the CM-2, CM-5,

and MP-1 computers, using point and line-iterative solvers and the uniform boundary











condition treatment. In the first three cases the "NEWS" implementation of point-

Jacobi relaxation is the solver, while the last two cases are for the line-Jacobi solver

using cyclic reduction.

Machine Solver Problem VP Tp Time/Iter./Pt. Speed ) Peak
Size ___(MFlops) Speed
512 PE Point- 512 x 1024 188 s 2.6 x 10-6 s 147 4
CM-2 Jacobi 1024
128 VU Point- 736 x 8192 137 s 1.3 x 10-6 s 417 10
CM-5 Jacobi 1472
1024 PE Point- 512 x 256 316 s 1.2 x 10-5 s 44* 59
MP-1 Jacobi 512
512 PE Line- 512 x 1024 409 s 7.8 x 10-6 s 133 3
CM-2 Jacobi 1024
128 VU Line- 736 x 8192 453 s 4.2 x 10-6 s 247 6
CM-5 Jacobi 1472

Table 3.1. Performance results for the SIMPLE algorithm for 100 iterations of the
model problem. The solvers are the point-Jacobi ("NEWS") and line-Jacobi (cyclic
reduction) implementations. 3, 3, and 9 inner iterations are used for the u, v, and p'
equations, respectively. *The speeds are for double-precision calculations, except on
the MP-1.


In Table 3.1, the speeds reported are obtained by comparing the timings with

the identical code timed on a Cray C90, using the Cray hardware performance mon-

itor to determine Mflops. In terms of Mflops, the CM-2 version of the SIMPLE

algorithm's performance appears to be consistent with other CFD algorithms on the

CM-2. Jesperson and Levit [44] report 117 Mflops for a scalar implicit version of an

approximate factorization Navier-Stokes algorithm using parallel cyclic reduction to

solve the tridiagonal systems of equations. This result was obtained for a 512 x 512

simulation of 2-d flow over a cylinder using a 16k CM-2 as in the present study (a

different execution model was used (see [3, 47] for details). The measured time per

time-step per grid point was 1.6 x 10-5 seconds. By comparison, the performance of

the SIMPLE algorithm for the 512 x 1024 problem size using the line-Jacobi solver is











133 Mflops and 7.8 x 10-6 seconds per iteration per grid pt. Egolf [20] reports that the

TEACH Navier-Stokes combustor code based on a sequential pressure-based method

with a solver that is comparable to point-Jacobi relaxation, obtains a performance

which is 3.67 times better than a vectorized Cray X-MP version of the code, for a

model problem with 3.2 x 104 nodes. The present program runs 1.6 times faster than

a single Cray C90 processor for a 128 x 256 problem (32k grid points). One Cray

C-90 processor is about 2-4 times faster than a Cray X-MP. Thus, the present code

runs comparably fast.

3.3.4 Isoefficiencv Plot

Figures 3.2-3.4 addressed the effects of the inner-iterative solver, the boundary

treatment, the data layout, and the variation of parallel efficiency with problem size

for a fixed number of processors. Varying the number of processors is also of interest

and, as discussed in Chapter 1, an even more practical numerical experiment is to

vary np in proportion with the problem size, i.e. the scaled-size model.

Figure 3.6, which is based on the point-Jacobi MP-1 timings, incorporates the

above information into one plot, which has been called an isoefficiency plot by Kumar

and Singh [46]. The lines are paths along which the parallel efficiency E remains

constant as the problem size and the number of processors np vary. Using the point-

Jacobi solver and the uniform boundary coefficient implementation, each SIMPLE

iteration has no substantial contribution from operations which are less than fully

parallel or from operations whose time depends on the number of processors. The

efficiency is only a function of the virtual processor ratio, thus the lines are straight.

Much of the parameter space is covered by efficiencies between 0.6 and 0.8.

The reason that the present implementation is linearly scalable is that the oper-

ations are all scalable-each SIMPLE iteration has predominantly nearest-neighbor











communication and computation and full parallelism. Thus, Tp depends on VP.

Local communication speed does not depend on np.

Ti depends on the problem size N. Thus, as N and n, are increased in proportion,

starting from some initial ratio, the efficiency from Eq. 3.3 stays constant. If the initial

problem size is large and the corresponding parallel run time is acceptable, then one

can quickly get to very large problem sizes while still maintaining Tp constant by

increasing n, a relatively small amount (along the E = 0.85 curve). If the desired

run time is smaller, then initially (i.e. starting from small np) the efficiency will be

lower. Then the scaled-size experiment requires relatively more processors to get

to a large problem size along the constant efficiency (constant Tp for point-Jacobi

ierations) curve. Thus, the most desirable situation occurs when the efficiency is

high for an initially small problem size.

For this case the fixed-time and scaled-size methods are equivalent, because the

problem size T1 depends on N per iteration. However this is not the case when

the SIMPLE inner iterations are done with the line-Jacobi solver using parallel cyclic

reduction. Cyclic reduction requires (13 log2 N+1)N operations to solve a tridiagonal

system of N equations [44]. Thus, T ~- (13log2 N + 1)N and on np = N processors,

Tp ~ 13 log2 N+ 1 because every processor is active during every step of the reduction

and there are 13 log2 N 1 steps. Since VP = 1, every processor's time is proportional

to the number of steps, assuming each step costs about the same.

In the scaled-size approach, one doubles np and N together, which therefore gives

Ti ~ (26 log, 2N+2)N and Tp 13 log, 2N+1. The efficiency is 1, but Tp is increased

and T1 is more than doubled. In the fixed-time approach, then, one concludes that

N must be increased by a factor which is less than two, and n, must be doubled, in

order to maintain constant Tp. If a plot like Figure 3.6 is constructed, it should be

done with Ti instead of N as the measure of problem size. In that case, the lines











of constant efficiency would be described as Ti ~ n0, with a > 1. The ideal case is

a = 1. In addition to the operation count, there is another factor which reduces the

scalability of cyclic reduction, namely the time per step is not actually the same as

was assumed above-later steps require communication over longer distances which

is slower. In practice, however, no more than a few steps are necessary because the

coupling between widely-separated equations becomes very weak. As the system is

reduced the diagonal becomes much larger than the off-diagonal terms which can

then be neglected and the reduction process abbreviated.

In short, the basic prerequisite for scaled-size constant efficiency is that the

amount of work per SIMPLE iteration varies with VP and that the overheads and

inefficiencies, specifically the time spent in communication and the fraction of idle

processors, do not grow relative to the useful computational work as np and N are

increased proportionally. The SIMPLE implementation developed here using the

point-iterative solvers, Jacobi and red/black, have this linear computational scalabil-

ity property.

On the other hand, the convergence rate of point-iterative methods increases at a

rate greater than the problem size, so although Tp can be maintained constant while

the problem size and np are scaled up, the convergence rate deteriorates. Hence the

total run time (cost per iteration multiplied by the number of iterations) increases.

This lack of numerical scalability of standard iterative methods like point-Jacobi

relaxation is the motivation for the development of multigrid strategies.

3.4 Concluding Remarks

The SIMPLE algorithm, especially using point-iterative methods, is efficient on

SIMD machines and can maintain a relatively high efficiency as the problem size and

the number of processors is scaled up. However, boundary coefficient computations











need to be folded in with interior coefficient computations to achieve good efficiencies

at smaller problem sizes. For the CM-5, the inefficiency caused by idle processors

in a 1-d boundary treatment was significant over the entire range of problem sizes

tested. The line-Jacobi solver based on parallel cyclic reduction leads to a lower

peak E (0.5 on the CM-5) than the point-Jacobi solver (0.8), because there is more

communication and on average this communication is less localized. On the other

hand, the asymptotic convergence rates of the two methods are also different and need

to be considered on a problem-by-problem basis. The speeds which are obtained with

the line-iterative method are consistent and comparable with other CFD algorithms

on SIMD computers.

The key factor in obtaining high parallel efficiency for the SIMPLE algorithm

on the computers used, is fast nearest-neighbor communication relative to the speed

of computation. On the CM-2 and CM-5, hierarchical mapping allows on-processor

communication to dominate the slower off-processor form(s) of communication for

large VP. The efficiency is low for small problems because of the relatively large

contribution to the run time from the front-end-to-processor type of communication,

but this type of communication is constant and becomes less important as the problem

size increases.

Once the peak E is reached, the efficiency is determined by the balance of compu-

tation and on-processor communication speeds-for the CM-5, using a point-Jacobi

solver, E approaches approximately 0.8, while on the CM-2 the peak efficiency is 0.4,

which reflects the fact that the CM-5 vector units have a better balance, at least for

the operations in this algorithm, than the CM-2 processors.

The rate at which E approaches the peak value depends on the relative contribu-

tions of on- and off-processor communication and front-end-to-processor communica-

tion to the total run time. On the CM-5, VP > 2k is required to reach peak E. This











problem size is about one-fourth the maximum size which can be accommodated,

and yet still larger than many computations on traditional vector supercomputers.

Clearly a gap is developing between the size of problems which can be solved effi-

ciently in parallel and the size of problems which are small enough to be solved on

serial computers.

For parallel computations of all but the largest problems, then, the data layout

issue is very important- in going from a square subgrid to one with aspect ratio of

16, for a VP = 1k case on the CM-5, the run time increased by 25%. On the MP-1,

hierarchical mapping is not needed, because the processors are slow compared to the

X-Net communication speed. The peak E is 0.85 with the point-Jacobi solver, and

this performance is obtained for VP > 32, which is about one-eighth the size of

the largest case possible for this machine. Thus, with regards to achieving efficient

performance in the teraflops range, the comparison given here suggests a preference

for numerous slow processors instead of fewer fast ones, but such a computer may be

difficult and expensive to build.



















4 x 1 Layout of Processors PE 0 PE 1 PE 2 PE 3


Array A(8)


1


21


31


61


Cut-and-Stack Mapping (MP-Fortran)
Memory Layers
i/ 5/ 6/ 7/ 8/
E 1/2 / 3PE 4/
PE0 PE1 PE2 PE3


Hierarchical Mapping (CM-Fortran)
2 x1 virtual subgrids

1 2 3 4 5 6 7 8

PE 0 PE 1 PE 2 PE 3


Figure 3.1. Mapping an 8 element array A onto 4 processors. For the cut-and-
stack mapping, nearest-neighbors array elements are mapped to nearest-neighbor
physical processors. For the hierarchical mapping, nearest-neighbor array elements
are mapped to nearest-neighbor virtual processors, which may be on the same physical
processor.


7 1


81












Efficiency vs. VP


wU


1

0.8

0.6

0.4

0.2,
1


0'
0


10000


5000
VP


Figure 3.2. Parallel efficiency, E, as a function of problem size and solver, for the
CM-5 cases. The number of grid points is the virtual processor ratio, VP, multiplied
by the number of processors, 128. E is computed from Eq. 3.3. It reflects the relative
amount of communication, compared to computation, in the algorithm.


+ + 0o
0 0



m + Point-Jacobi (on-VU)
o Point-Jacobi (NEWS)
Line-Jacobi (Cyclic Red.)
x Line-Jacobi (TDMA)
txx x x x XX


)











E vs. Problem Size


LL


1

0.8

0.6

0.4

0.2


0
C


1 2
# of Grid Points x 10 5


Figure 3.3. Comparison between the CM-2, CM-5 and MP-1. The variation of
parallel efficiency with problem size is shown for the model problem, using point-
Jacobi relaxation as the solver. E is calculated from Eq. 3.3, and T1 = nrpTcom for
the CM-2 and CM-5, where Tcomp is measured. For the MP-1 cases, T1 is the front-end
time, scaled down to the estimated speed of the MP-1 processors (0.05 Mflops).


0 0 0 0


D +
)
+ X X
o MP-1
S+ CM-5
CM-2


)






80




Aspect Ratio Effect
2


E 1.5
O + VP=256, CM-2
0 o VP=1024, CM-2
1 VP=1024, CM-5
x VP=8192, CM-5

0.5 -


0
|-0--------------



0 5 10 15 20
Subgrid AR


Figure 3.4. Effect of subgrid aspect ratio on interprocessor communication time,
T,ew, for the hierarchical data-mapping (CM-2 and CM-5). Tnew, is normalized by
Tcom, in order to show how the aspect ratio effect varies with problem size, without
the complication of the fact that Tcomp varies also.











Effect of Implementation
0.8


0.75

0
c)0.7
So 1-d operations
S0.65 + 2-d operations
0
0
0.6 ----


0.55
0 500 1000 1500
VP


Figure 3.5. Normalized coefficient computation time as a function of problem size,
for two implementations (on the CM-2). In the 1-d case the boundary coefficients
are handled by 1-d array operations. In the 2-d case the uniform implementation
computes both boundary and interior coefficients simultaneously. Tcoeff is the time
spent computing coefficients in a SIMPLE iteration; Toie, is the time spent in point-
Jacobi iterations. There are 15 point-Jacobi iterations (v, = v, = 3 and v, = 9).











Isoefficiency Curves


in

-o
x
CD

0
IL

0C
Vf


2.5

2

1.5

1

0.5


2000 4000 6000 8000
# Processors (MP-1)


Figure 3.6. Isoefficiency curves based on the MP-1 cases and SIMPLE method with
the point-Jacobi solver. Efficiency E is computed from Eq. 3.3. Along lines of
constant E the cost per SIMPLE iteration is constant with the point-Jacobi solver
and the uniform boundary condition implementation.
















CHAPTER 4
A NONLINEAR PRESSURE-CORRECTION MULTIGRID METHOD

The single-grid timing results focused on the cost per iteration in order to elucidate

the computational issues which influence the parallel run time and the scalability. But

the parallel run time is the cost per iteration multiplied by the number of iterations.

For scaling to large problem sizes and numbers of processors, the numerical method

must scale well with respect to convergence rate, also.

The convergence rate of the single-grid pressure-correction method deteriorates

with increasing problem size. This trait is inherited from the smoothing property of

the stationary linear iterative method, point or line-Jacobi relaxation, used to solve

the systems of u, v, and p' equations during the course of SIMPLE iterations. Point-

Jacobi relaxation requires O(N2) iterations, where N is the number of grid points,

to decrease the solution error by a specified amount [1]. In other words, the number

of iterations increases faster than the problem size.

At best the cost per iteration stays constant as the number of processors np

increases proportional to the problem size. Thus, the total run time increases in

the scaled-size experiment using single-grid pressure-correction methods, due to the

increased number of iterations required. This lack of numerical scalability is a serious

disadvantage for parallel implementations, since the target problem size for parallel

computation is very large.

Multigrid methods can maintain good convergence rates as the problem size in-

creases. For Poisson equations, problem-size independent convergence rates can be

obtained [36, 55]. The recent book by Briggs [10] introduces the major concepts in

83











the context of Poisson equations. See also [11, 37, 90] for surveys and analyses of

multigrid convergence properties for more general linear equations. For a description

of practical techniques and special considerations for fluid dynamics, see the impor-

tant early papers by Brandt [5, 6]. However, there are many unresolved issues for

application to the incompressible Navier-Stokes equations, especially with regards to

their implementation and performance on parallel computers. The purpose of this

chapter is to describe the relevant convergence rate and stability issues for multigrid

methods in the context of application to the incompressible Navier-Stokes equations,

with numerical experiments used to illustrate the points made, in particular, regard-

ing the role of the restriction and prolongation procedures.

4.1 Background

The basic concept is the use of coarse grids to accelerate the asymptotic con-

vergence rate of an inner iterative scheme. The inner iterative method is called the

"smoother" for reasons to be made clear shortly. In the context of the present applica-

tion to the incompressible Navier-Stokes equations, the single-grid pressure-correction

method is the inner iterative scheme. Because the pressure-correction algorithm also

uses inner iterations-to solve the systems of u, v, and p' equations-the multigrid

method developed here actually has three nested levels of iterations.

A multigrid V cycle begins with a certain number of smoothing iterations on the

fine grid, where the solution is desired. Figure 4.1 shows a schematic of a V(3,2) cycle.

In this case three pressure-correction iterations are done first. Then residuals and

variables are restricted (averaged) to obtain coarse-grid values for these quantities.

The solution to the coarse-grid discretized equation provides a correction to the fine-

grid solution. Once the solution on the coarse grid is obtained, the correction is

interpolated (prolongated) to the fine grid and added back into the solution there.











Some post-smoothing iterations, two in this case, are needed to eliminate errors

introduced by the interpolation. Since it is usually too costly to attempt a direct

solution on the coarse grid, this smoothing-correction cycle is applied recursively,

leading to the V cycle shown.

The next section describes how such a procedure can accelerate the convergence

rate of an iterative method, in the context of linear equations. The multigrid scheme

for nonlinear scalar equations and the Navier-Stokes system of equations is then

described. Brandt [5] was the first to formalize the manner in which coarse grids

could be used as a convergence-acceleration technique for a given smoother. The

idea of using coarse grids to generate initial guesses for fine-grid solutions was around

much earlier.

The cost of the multigrid algorithm, per cycle, is dominated by the smoothing cost,

as will be shown in Chapter 5. Thus, with regard to the parallel run time per multigrid

iteration, the smoother is the primary concern. Also, with regard to the convergence

rate, the smoother is important. The single-grid convergence rate characteristics

of pressure-correction methods, the dependence on Reynolds number, flow problem,

and the convection scheme, carry over to the multigrid context. However, in the

multigrid method the smoother's role is, as the name implies, to smooth the fine-grid

residual, which is a different objective than to solve the equations quickly. A smooth

fine-grid residual equation can be approximated accurately on a coarser grid. The

next section describes an alternate pressure-based smoother, and compares its cost

against the pressure-correction method on the CM-5.

Stability of multigrid iterations is also an important unresolved issue. There are

two ways in which multigrid iterations can be caused to diverge. First, the single-grid

smoothing iterations can diverge, for example if central-differencing is used there are

possibly stability problems if the Reynolds number is high. Second, poor coarse-grid











corrections can cause divergence if the smoothing is insufficient. In a sense this latter

issue, the scheme and intergrid transfer operators which prescribe the coordination

between coarse and fine grids in the multigrid procedure, is the key issue. In the next

section two "stabilization strategies" are described. Then, the impact of different

restriction and prolongation procedures on the convergence rate is studied in the

context of two model problems, lid-driven cavity flow and flow past a symmetric

backward-facing step. These two particular flow problems have different physical

characteristics, and therefore the numerical experiments should give insight into the

problem-dependence of the results.

4.1.1 Terminology and Scheme for Linear Equations

The discrete problem to be solved can be written Ahuh = Sh, corresponding to

some differential equation L[u] = S. The set of values uh is defined by


{u}, = u(ih,jh), (i,j) E ([0 : N], [0 : N]) -- (4.1)

Similarly, u2h is defined on the coarser grid f2h with grid spacing 2h. The variable u

can be a scalar or a vector, and the operator A can be linear or nonlinear.

For linear equations, the "correction scheme" (CS) is frequently used. A two-

level multigrid cycle using CS accelerates the convergence of an iterative method

(with iteration matrix P) by the following procedure:

Do v fine-grid iterations vh PVvh
Compute residual on Qh rh = Ahvh Sh

Restrict rh to Q2h r2h = 2hrh

Solve exactly for e2h e2h = (A2h)-1r2h

Correct vh on Qh (h)new (vh)old + hhe2h











Ih' and I,' symbolize the restriction and prolongation procedures. The quantity vh

is the current approximation to the discrete solution uh. The algebraic error is the

difference between them, eh = uh vh. The discretization error is the difference

between the exact solutions of the continuous and discrete problems, ediscr = u uh.

The truncation error is obtained by substituting the exact solution into the discrete

equation,

7h = Ahu Sh = Ahu Ahuh. (4.2)

The notation above follows Briggs [10].

The two-level multigrid cycle begins on the fine grid with v iterations of the

smoother. Standard iterative methods all have the "smoothing property," which is

that the various eigenvector-decomposed components of the solution error are damped

at a rate proportional to their corresponding eigenvalues, i.e. the high frequency

errors are damped faster than the low frequency (smooth) errors. Thus, the conver-

gence rate of the smoothing iterations is initially rapid, but deteriorates as smooth

error components, those with large eigenvalues, dominate the remaining error. The

purpose of transferring the problem to a coarser grid is to make these smooth error

components appear more oscillatory with respect to the grid spacing, so that the

initial rapid convergence rate is obtained for the elimination of these smooth errors

by coarse-grid iterations. Since the coarse grid Q2h has only 1/4 as many grid points

as Qh (in 2-d), the smoothing iterations on the coarse grid are cheaper as well as

more effective in reducing the smooth error components than on the fine grid.

In the correction scheme, the coarse-grid problem is an equation for the algebraic

error,


A2he2h ^ r2h,


(4.3)











approximating the fine-grid residual equation for the algebraic error. To obtain the

coarse-grid source term, r2h, the restriction procedure Ih is applied to the fine-grid

residual rh,
r2h I2hh. (4.4)

Eq. 4.4 is an averaging type of operation. Two common restriction procedures are

straight injection of fine-grid values to their corresponding coarse-grid grid points,

and averaging rh over a few fine-grid grid points which are near the corresponding

coarse-grid grid point. The initial error on the coarse grid is taken as zero.

After the solution for e2h is obtained, this coarse-grid quantity is interpolated to

the fine grid and used to correct the fine-grid solution,

^ + 2h. (4.5)

For Ihh, common choices are bilinear or biquadratic interpolation.

In practice the solution for e2h is obtained by recursion on the two-level cycle-

(A2h)-1 is not explicitly computed. On the coarsest grid, direct solution may be

feasible if the equation is simple enough. Otherwise a few smoothing iterations can

be applied.

Recursion on the two-level algorithm leads to a "V cycle," as shown in Figure 4.1.

A simple V(3,2) cycle is shown. Three smoothing iterations are taken before re-

stricting to the next coarser grid, and two iterations are taken after the solution has

been corrected. The purpose of the latter smoothing iterations is to smooth out

any high-frequency noise introduced by the prolongation. Other cycles can be envi-

sioned. In particular the W cycle is popular [6]. The cycling strategy is called the

"grid-schedule," since it is the order in which the various grid levels are visited.

The most important consideration for the correction scheme has been saved for

last, namely the definition of the coarse-grid discrete equation A2h. One possibility is











to discretize the original differential equation directly on the coarse grid. However this

choice is not always the best one. The convergence-rate benefit from the multigrid

strategy is derived from the particular coarse-grid approximation to the fine-grid

discrete problem, not the continuous problem. Because the coarse-grid solutions

and residuals are obtained by particular averaging procedures, there is an implied

averaging procedure for the fine-grid discrete operator Ah which should be honored

to ensure a useful homogenization of the fine-grid residual equation. This issue is

critical when the coefficients and/or dependent variables of the governing equations

are not smooth [17].

For the Poisson equation, the Galerkin approximation A2h 1= 2hIA^I2h is the

right choice. The discretized equation coefficients on the coarse grid are obtained

by applying suitable averaging and interpolation operations to the fine-grid coeffi-

cients, instead of by discretizing the governing equation on a grid with a coarser

mesh spacing. Briggs has shown, by exploiting the algebraic relationship between

bilinear interpolation and full-weighting restriction operators, that initially smooth

errors begin in the range of interpolation and finish, after the smoothing-correction

cycle is applied, in the null space of the restriction operator [10]. Thus, if the fine-grid

smoothing eliminates all the high-frequency error components in the solution, one V

cycle using the correction-scheme is a direct solver for the Poisson equation. The con-

vergence rate of multigrid methods using the Galerkin approximation is more difficult

to analyze if the governing equations are more complicated than Poisson equations,

but significant theoretical advantages for application to general linear problems have

been indicated [90].











4.1.2 Full-Approximation Storage Scheme for Nonlinear Equations

The brief description given above does not bring out the complexities inherent in

the application to nonlinear problems. There is only experience, derived mostly from

numerical experiments, to guide the choice of the restriction/prolongation procedures

and the smoother. Furthermore, the linkage between the grid levels requires special

considerations because of the nonlinearity.

The correction scheme using the Galerkin approximation can be applied to the

nonlinear Navier-Stokes system of equations [94]. However, in order to use CS for

nonlinear equations, linearization is required. The best coarse-grid correction only

improves the fine-grid solution to the linearized equation. Also, for complex equa-

tions, considerable expense is incurred in computing A2h by the Galerkin approxi-

mation. The commonly adopted alternative is the intuitive one, to let A2h be the

differential operator L discretized on the grid with spacing 2h instead of h. In ex-

change for a straightforward problem definition on the coarse grid though, special

restriction and prolongation procedures may be necessary to ensure the usefulness of

the resulting corrections. Numerical experiments on a problem-by-problem basis are

necessary to determine good choices for the restriction and prolongation procedures

for Navier-Stokes multigrid methods.

The full-approximation storage (FAS) scheme [5] is preferred over the correction

scheme for nonlinear problems. The coarse-grid corrections generated by FAS improve

the solution to the full nonlinear problem instead of just the linearized one. The

discretized equation on the fine grid is, again,

Ahuh = Sh. (4.6)











The approximate solution vh after a few fine-grid iterations defines the residual on

the fine grid,
Ahvh = Sh + rh. (4.7)

A correction, the algebraic error e^l = uh vh, is sought which satisfies

Ah(vh + el) = Sh. (4.8)


The residual equation is formed by subtracting Eq. 4.7 from Eq. 4.8, and cancelling

Sh,
Ah(vh + eh) Ah(vh) = -rh, (4.9)

where the subscript "alg" is dropped for convenience. For linear equations the Ahvh

terms cancel leaving Eq. 4.3. Eq. 4.9 does not simplify for nonlinear equations.
Assuming that the smoother has done its job, rh is smooth and Eq. 4.9 is the same

as the coarse-grid residual equation

A2h(2h + e2h) A2h(2h) = -r2h, (4.10)

at coarse-grid grid points.
The error e2h is to be found, interpolated back to ih according to eh = Ih^2h,

and added to vh so that Eq. 4.8 is satisfied. The known quantities are v2h, which is a
"suitable" restriction of vh, and r2h, likewise a restriction of rh. Different restrictions

can be used for residuals and solutions. Thus, Eq. 4.10 can be written

A2h(Ihvh + 62h) = A2h(Ir hv) I7^rh. (4.11)


Since Eq. 4.11 is not an equation for e2h, one solves instead for the sum Ihhvh + e2h
Expanding rh and regrouping terms, Eq. 4.11 can be written


A2h(u2h) = A2h(I hvh) I2h h


(4.12)











= [A2h(Ihvh) Ih2h(Ahvh) + IhhSh S2h] + S2h (4.13)

[Snumerical 2+ S2h, (4.14)


Eq. 4.14 is similar to Eq. 4.6 except for the extra numerically-derived source term.

Once I^hvh + e2h is obtained the coarse-grid approximation to the fine-grid error, e2h,

is computed by first subtracting the initial coarse-grid solution I^hvh,

e2h = 2h I2h, (4.15)


then interpolating back to the fine grid and combining with the current solution,

vh + vh + I2h(e2h). (4.16)


4.1.3 Extension to the Navier-Stokes Equations

The incompressible Navier-Stokes equations are a system of coupled, nonlinear

equations. Consequently the FAS scheme given above for single nonlinear equations

needs to be modified.

The variables u^, u^, and uh represent the cartesian velocity components and the

pressure, respectively. Corresponding subscripts are used to identify each equations'

source term, residual and discrete operator in the formulation below. The three

equations for momentum and mass conservation are treated as if part of the following

matrix equation,
A 0 Gh uh SS^
0 Ah G [ U^ S2^ (4.17)
Gh G h 0 U S^h
The continuity equation source term is zero on the finest grid, Qh, but for coarser grid

levels it may not be zero. Thus, for the sake of generality it is included in Eq. 4.17.

Thus, for the ux-momentum equation Eq. 4.8 is modified to account for the

pressure-gradient, G^u^, which is also an unknown. The approximate solutions are











v^, vh, and v3 corresponding to u^, u4, and u.. For the ul-momentum equation, the
approximate solution satisfies

Av + G^v^ = S^ + ,^. (4.18)


The fine-grid residual equation corresponding to Eq. 4.9 is modified to

Al(v1 + eh) Ah(v ) + G (v3 + eh) G(h) = -rh, (4.19)


which is approximated on the coarse grid by the corresponding coarse-grid residual

equation,

A2h(vh + e2h) A2h(vh) + Gh(v + e^) G2v) = (4.20)
1 1V -11 x M + e3) Mv .r

The known terms are v2h = Ihh, v2h h= hh, and r2h = I2h ^
Expanding r and regrouping terms, Eq. 4.19 can be written
Expanding rh and regrouping terms, Eq. 4.19 can be written


Ah(uh) + Gh (u2h)


A2hr h2h h 2h 2h h
= Al (f, ) + Gx (l v^3)

-2h (A^hvh + GGVh) + I2hSh

2hl i2h h) 2h r2h h
= [A (h + Gx (1h v 3)
I2h Ahv + G)+ Ih2hSh Sh] + s2h
-h ( IU + Gv1 3 1 1 1h

1Ll,nulmer icalrc I *


Since Eq. 4.22 includes numerically derived source terms in addition to the physical

ones, the coarse-grid variables are not in general the same as would be obtained from

a discretization of the original continuous governing equations on the coarse grid.

The u2-momentum equation is treated similarly, and the coarse-grid continuity

equation is


G2hu2h + G2hU2h = G ^2h(I h h\ G2h(lr2hi h\ 12h r
X1 y 2 x h UI y 2h 'h 3


(4.22)


(4.21)