UFDC Home  Search all Groups  UF Institutional Repository  UF Institutional Repository  UF Theses & Dissertations  Vendor Digitized Files   Help 
Material Information
Subjects
Notes
Record Information

Full Text 
PRESSUREBASED METHODS ON SINGLEINSTRUCTION STREAM/MULTIPLEDATA STREAM COMPUTERS By EDWIN L. BLOSCH A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1994 ACKNOWLEDGEMENTS I would like to express my thanks to my advisor Dr. Wei Shyy for reflecting carefully on my results and for directing my research toward interesting issues. I would also like to thank him for the exceptional personal support and flexibility he offered me during my last year of study, which was done offcampus. I would also like to acknowledge the contributions of the other members of my Ph.D. committee, Dr. ChenChi Hsu, Dr. Bruce Carroll, Dr. David Mikolaitis, and Dr. Sartaj Sahni. Dr. Hsu and Dr. Carroll supervised my B.S. and M.S. degree research studies, respectively, and Dr. Mikolaitis, in the role of graduate coordinator, enabled me to obtain financial support from the Department of Energy. Also I would like to thank Madhukar Rao, Rick Smith and H.S. Udaykumar, for paying fees on my behalf and for registering me for classes while I was in California. Jeff Wright, S. Thakur, ShinJye Liang, Guobao Guo and Pedro LopezFernandez have also made direct and indirect contributions for which I am grateful. Special thanks go to Dr. Jamie Sethian, Dr. Alexandre Chorin and Dr. Paul Con cus of Lawrence Berkeley Laboratory for allowing me to visit LBL and use their resources, for giving personal words of support and constructive advice, and for the privilege of interacting with them and their graduate students in the applied mathe matics branch. Last but not least I would like to thank my wife, Laura, for her patience, her example, and her frank thoughts on "cups with sliding lids," "flow through straws," and numerical simulations in general. My research was supported in part by the Computational Science Graduate Fel lowship Program of the Office of Scientific Computing in the Department of Energy. The CM5s used in this study were partially funded by National Science Foundation Infrastructure Grant CDA8722788 (in the computer science department of the Uni versity of CaliforniaBerkeley), and a grant of HPC time from the DoD HPC Shared Resource Center, Army HighPerformance Computing Research Center, Minneapolis, Minnesota. TABLE OF CONTENTS ACKNOWLEDGEMENTS ............................ ii ABSTRACT ..................... ............... vi CHAPTERS 1 INTRODUCTION ..................... .......... 1 1.1 Motivations ............................... 1 1.2 Governing Equations ........................... 3 1.3 Numerical Methods for Viscous Incompressible Flow .......... 5 1.4 Parallel Com putting ............................ 7 1.4.1 DataParallelism and SIMD Computers ................ 8 1.4.2 Algorithms and Performance ..................... 11 1.5 PressureBased Multigrid Methods . .... 13 1.6 Description of the Research . ..... ..... 17 2 PRESSURECORRECTION METHODS . .... 21 2.1 FiniteVolume Discretization on Staggered Grids . .... 21 2.2 The SIMPLE Method ................. ....... 23 2.3 Discrete Formulation of the PressureCorrection Equation ...... 27 2.4 WellPosedness of the PressureCorrection Equation ... 30 2.4.1 Analysis ... .. . . 30 2.4.2 Verification by Numerical Experiments . .... 33 2.5 Numerical Treatment of Outflow Boundaries . .... 38 2.6 Concluding Remarks ................... ........ 40 3 EFFICIENCY AND SCALABILITY ON SIMD COMPUTERS ...... 53 3.1 Background ............... .. .............. 53 3.1.1 Speedup and Efficiency . .. .. 53 3.1.2 Comparison Between CM2, CM5, and MP1 ... 55 3.1.3 Hierarchical and CutandStack Data Mappings ... 57 3.2 Implementional Considerations . .. ... 59 3.3 Numerical Experiments . ... . 61 3.3.1 Efficiency of Point and Line Solvers for the Inner Iterations 62 3.3.2 Effect of Uniform Boundary Condition Implementation 69 3.3.3 Overall Performance ....................... 70 3.3.4 Isoefficiency Plot ......................... 72 3.4 Concluding Remarks ........................... 4 A NONLINEAR PRESSURECORRECTION MULTIGRID METHOD .. 4.1 Background . . . . 4.1.1 Terminology and Scheme for Linear Equations ......... 4.1.2 FullApproximation Storage Scheme for Nonlinear Equations 4.1.3 Extension to the NavierStokes Equations . . 4.2 Comparison of PressureBased Smoothers . . 4.3 Stability of Multigrid Iterations . . . 4.3.1 DefectCorrection Method . . .. 4.3.2 Cost of Different Convection Schemes ..... .. 4.4 Restriction and Prolongation Procedures . . 4.5 Concluding Remarks ........................... 5 IMPLEMENTATION AND PERFORMANCE ON THE CM5 ....... 5.1 Storage Problem .......... ..... .... ....... ... 5.2 Multigrid Convergence Rate and Stability . . 5.2.1 Truncation Error Convergence Criterion for Coarse Grids . 5.2.2 Numerical Characteristics of the FMG Procedure . 5.2.3 Influence of Initial Guess on Convergence Rate . . 5.2.4 Rem arks . . . . 5.3 Performance on the CM5 ........................ 5.4 Concluding Remarks .......................... REFERENCES ................................... BIOGRAPHICAL SKETCH ............................ Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy PRESSUREBASED METHODS ON SINGLEINSTRUCTION STREAM/MULTIPLEDATA STREAM COMPUTERS By Edwin L. Blosch Chairman: Dr. Wei Shyy Major Department: Aerospace Engineering, Mechanics and Engineering Science Computationally and numerically scalable algorithms are needed to exploit emerg ing parallelcomputing capabilities. In this work pressurebased algorithms which solve the twodimensional incompressible NavierStokes equations are developed for singleinstruction stream/multipledata stream (SIMD) computers. The implications of the continuity constraint for the proper numerical treatment of open boundary problems are investigated. Mass must be conserved globally so that the system of linear algebraic pressurecorrection equations is numerically consistent. The convergence rate is poor unless global mass conservation is enforced explicitly. Using an additivecorrection technique to restore global mass conservation, flows which have recirculating zones across the open boundary can be simulated. The performance of the singlegrid algorithm is assessed on three massively parallel computers, MasPar's MP1 and Thinking Machines' CM2 and CM5. Paral lel efficiencies approaching 0.8 are possible with speeds exceeding that of traditional vector supercomputers. The following issues relevant to the variation of parallel ef ficiency with problem size are studied: the suitability of the algorithm for SIMD computation; the implementation of boundary conditions to avoid idle processors; vi the choice of point versus lineiterative relaxation schemes; the relative costs of the coefficient computations and solving operations, and the variation of these costs with problem size; the effect of the dataarraytoprocessor mapping; and the relative speeds of computation and communication of the computer. A nonlinear pressurecorrection multigrid algorithm which has better convergence rate characteristics than the singlegrid method is formulated and implemented on the CM5. On the CM5, the components of the multigrid algorithm are tested over a range of problem sizes. The smoothing step is the dominant cost. Pressurecorrection methods and the locallycoupled explicit method are equally efficient on the CM5. V cycling is found to be much cheaper than W cycling, and a truncationerror based "fullmultigrid" procedure is found to be a computationally efficient and convenient method for obtaining the initial finegrid guess. The findings presented enable further development of efficient, scalable pressurebased parallel computing algorithms. CHAPTER 1 INTRODUCTION 1.1 Motivations Computational fluid dynamics (CFD) is a growing field which brings together highperformance computing, physical science, and engineering technology. The dis tinctions between CFD and other fields such as computational physics and computa tional chemistry are largely semantic now, because increasingly more interdisplinary applications are coming within range of the computational capabilities. CFD algo rithms and techniques are mature enough that the focus of research is expected to shift in the next decade toward the development of robust flow codes, and toward the application of these codes to numerical simulations which do not idealize either the physics or the geometry and which take full account of the coupling between fluid dynamics and other areas of physics [65]. These applications will require formidable resources, particularly in the areas of computing speed, memory, storage, and in put/output bandwidth [78]. At the present time, the computational demands of the applications are still at least two ordersofmagnitude beyond the computing technology. For example, NASA's grand challenges for the 1990s are to achieve the capability to simulate vis cous, compressible flows with twoequation turbulence modelling over entire aircraft configurations, and to couple the fluid dynamics simulation with the propulsion and aircraft control systems modelling. To meet this challenge it is estimated that 1 ter aflops computing speed and 50 gigawords of memory will be required [24]. Current massivelyparallel supercomputers, for example, the CM5 manufactured by Thinking Machines, have peak speeds of 0(10 gigaflops) and memories of 0(1 gigaword). Optimism is sometimes circulated that teraflop computers may be expected by 1995 [68]. In view of the two ordersofmagnitude disparity between the speed of presentgeneration parallel computers and teraflops, such optimism should be dimmed somewhat. Expectations are not being met in part because the applications, which are the driving force behind the progress in hardware, have been slow to develop. The numerical algorithms which have seen two decades of development on traditional vec tor supercomputers are not always easy targets for efficient parallel implementation. Better understanding of the basic concepts and more experience with the present generation of parallel computers is a prerequisite for improved algorithms and imple mentations. The motivation of the present work has been the opportunity to investigate issues related to the use of parallel computers in CFD, with the hope that the knowledge gained can assist the transition to the new computing technology. The context of the research is the numerical solution of the 2d incompressible NavierStokes equations, by a popular and proven numerical method known as the pressurecorrection tech nique. A specific objective emerged as the research progressed, namely to develop and analyze the performance of pressurecorrection methods on the singleinstruction stream/multipledata stream (SIMD) type of parallel computer. Singlegrid compu tations were studied first, then a multigrid method was developed and tested. SIMD computers were chosen because they are easier to program than multiple instruction stream/multipledata stream (MIMD) computers explicitt messagepassing is not required), because synchronization of the processors is not an issue, and be cause the factors affecting the parallel run time and computational efficiency are easier to identify and quantify. Also, these are arguably the most powerful machines available right nowLos Alamos National Laboratory has a 1024node CM5 with 32 Gbytes of processor memory and is capable of 32 Gflops peak speed. Thus, the code, the numerical techniques, and the understanding which are the contribution of this research can be immediately useful for applications on massively parallel computers. 1.2 Governing Equations The governing equations for 2d, constant property, timedependent viscous in compressible flow are the NavierStokes equations. They express the principles of conservation of mass and momentum. In primitive variables and cartesian coordi nates, they may be written u+ = 0 (1.1) apu apu2 apuv op a2u a2u + + = + 2 + y (1.2) dpv apuv dpv2 dp d2v a2v + +$ =  (1.3) t + + y dy +d2 + y2 where u and v are cartesian velocity components, p is the density, p is the fluid's molecular viscosity, and p is the pressure. Eq. 1.1 is the mass continuity equation, also known as the divergencefree constraint since its coordinatefree form is div ii = 0. The NavierStokes equations 1.11.3 are a coupled set of nonlinear partial differ ential equations of mixed elliptic/parabolic type. Mathematically, they differ from the compressible NavierStokes equations in two important respects that lead to dif ficulties for devising numerical solution techniques. First, the role of the continuity equation is different in incompressible flow. In stead of a timedependent equation for the density, in incompressible fluids the conti nuity equation is a constraint on the admissible velocity solutions. Numerical meth ods must be able to integrate the momentum equations forward in time while simul taneously maintaining satisfaction of the continuity constraint. On the other hand, numerical methods for compressible flows can take advantage of the fact that in the unsteady form each equation has a timedependent term. The equations are cast in vector formany suitable method for timeintegration can be employed on the system of equations as a whole. The second problem, assuming that a primitivevariable formulation is desired, is that there is no equation for pressure. For compressible flows, the pressure can be de termined from the equation of state of the fluid. For incompressible flow, an auxiliary "pressurePoisson" equation can be derived by taking the divergence of the vector form of the momentum equations; the continuity equation is invoked to eliminate the unsteady term in the result. The formulation of the pressurePoisson equation requires manipulating the discrete forms of the momentum and continuity equations. A particular discretization of the Laplacian operator is therefore implied in pressure Poisson equation, depending on the discrete gradient and divergence operators. This operator may not be implementable at boundaries, and solvability constraints can be violated [30]. Also, the differentiation of the governing equations introduces the need for additional unphysical boundary conditions on the pressure. Physically, the pressure in incompressible flow is only defined relative to an (arbitrary) constant. Thus, the correct boundary conditions are Neumann. However, if the problem has an open boundary, the governing equations should be supplemented with a boundary condition on the normal traction [29, 32], 1 &un F, p + (1.4) Re dn where F is the force, Re is the Reynolds number, and the subscript n indicates the normal direction. However, F, may be difficult to prescribe. In practice, a zerogradient or linear extrapolation for the normal velocity com ponent is a more popular outflow boundary condition. Many outflow boundary con ditions have been analyzed theoretically for incompressible flow (see [30, 31, 38, 56]). There are even more boundary condition procedures in use. The method used and its impact on the "solvability" of the resulting numerical systems of equations depends on the discretization and the numerical method. This issue is treated in Chapter 2. 1.3 Numerical Methods for Viscous Incompressible Flow Numerical algorithms for solving the incompressible NavierStokes system of equa tions were first developed by Harlow and Welch [39] and Chorin [15, 16]. Descendants of these approaches are popular today. Harlow and Welch introduced the important contribution of the staggeredgrid location of the dependent variables. On a stag gered grid, the discrete Laplacian appearing in the derivation of the pressurePoisson equation has the standard fivepoint stencil. On colocated grids it still has a five point form but, if the central point is located at (i,j), the other points which are involved are located at (i+2,j), (i2j), (i,j+2), and (ij2). Without nearestneighbor linkages, two uncoupled ("checkerboard") pressure fields can develop independently. This pressuredecoupling can cause stability problems, since nonphysical discontinu ities in the pressure may develop [50]. In the present work, the velocity components are staggered onehalf of a control volume to the west and south of the pressure which is defined at the center of the control volume as shown in Figure 1.1. Figure 1.1 also shows the locations of all boundary velocity components involved in the discretization and numerical solution, and representative boundary control volumes for u, v, and p. In Chorin's artificial compressibility approach [15] a timederivative of pressure is added to the continuity equation. In this manner the continuity equation becomes an equation for the pressure, and all the equations can be integrated forward in time, either as a system or one at a time. The artificial compressibility method is closely related to the penalty formulation used in finiteelement methods [41]. The equations are solved simultaneously in finiteelement formulations. Penalty methods and the artificial compressibility approach suffer from illconditioning when the equations have strong nonlinearities or source terms. Because the pressure term is artificial, they are not timeaccurate either. Projection methods [16, 62] are twostep procedures which first obtain a velocity field by integrating the momentum equations, and then project this vector field into a divergencefree space by subtracting the gradient of the pressure. The pressure Poisson equation is solved to obtain the pressure. The solution must be obtained to a high degree of accuracy in unsteady calculations in order to obtain the correct longterm behavior [76]every step may therefore be fairly expensive. Furthermore, the timestep size is limited by stability considerations, depending on the implicitness of the treatment used for the convection terms. "Pressurebased" methods for the incompressible NavierStokes equations include SIMPLE [61] and its variants, SIMPLEC [19], SIMPLER [60], and PISO [43]. These methods are similar to projection methods in the sense that a nonmassconserving velocity field is computed first, and then corrected to satisfy continuity. However, they are not implicit in two steps because the nonlinear convection terms are linearized explicitly. Instead of a pressurePoisson equation, an approximate equation for the pressure or pressurecorrection is derived by manipulating the discrete forms of the momentum and continuity equations. A few iterations of a suitable relaxation method are used to obtain a partial solution to the system of correction equations, and then new guesses for pressure and velocity are obtained by adding the corrections to the old values. This process is iterated until all three equations are satisfied. The iterations require underrelaxation because of the sequential coupling between variables. Compared to projection methods, pressurebased methods are less implicit when used for timedependent problems. However, they can be used to seek the steadystate directly if desired. Compared to a fully coupled strategy, the sequential pressurebased approach typically has slower convergence and less robustness with respect to Reynolds num ber. However, the sequential approach has the important advantage that additional complexities, for example, chemical reaction, can be easily accommodated by simply adding speciesbalance equations to the stack. The overall run time increases since each governing equation is solved independently, and the total storage requirements scale linearly with the number of equations solved. On the other hand, the computer time and storage requirements escalate faster in a fully coupled solution strategy. The typical way around this problem is to solve simultaneously the continuity and momen tum equations, then solve any additional equations in a sequential fashion. Without knowing beforehand that the pressurevelocity coupling is the strongest among all the various flow variables, however, the extra computational effort spent in simultaneous solution of these equations is unwarranted. There are other approaches for solving the incompressible NavierStokes equa tions, notably methods based on vorticitystreamfunction (wp) or velocityvorticity (uw) formulations, but pressurebased methods are easier, especially with regard to boundary conditions and possible extension to 3d domains. Furthermore, they have demonstrated considerable robustness in computing incompressible flows. A broad range of applications of pressurebased methods is demonstrated in [73]. 1.4 Parallel Computing General background of parallel computers and their application to the numeri cal solution of partial differential equations is given in Hockney and Jesshope [40] and Ortega and Voigt [58]. Fischer and Patera [23] gave a recent review of parallel computing from the perspective of the fluid dynamics community. Their "indirect cost," the parallel run time, is of primary interest here. The "direct cost" of parallel computers and their components is another matter entirely. For the iterationbased numerical methods developed here, the parallel run time is the cost per iteration multiplied by the number of iterations. The latter is affected by the characteristics of the particular parallel computer used and the algorithms and implementations em ployed. Parallel computers come in all shapes and sizes, and it is becoming virtually impossible to give a thorough taxonomy. The background given here is limited to a description of the type of computer used in this work. 1.4.1 DataParallelism and SIMD Computers Singleinstruction stream/multipledata stream (SIMD) computers include the connection machines manufactured by the Thinking Machines Corporation, the CM and CM2, and the MP1, MP2, and MP3 computers produced by the MasPar Cor poration. These are massivelyparallel machines consisting of a frontend computer and many processor/memory pairs, figuratively, the "backend." The backend pro cessors are connected to each other by a "data network." The topology of the data network is a major feature of distributedmemory parallel computers. The schematic in Figure 1.2 gives the general idea of the SIMD layout. The program executes on the serial frontend computer. The frontend triggers the syn chronous execution of the "backend" processors by sending "code blocks" simul taneously to all processors. Actually, the code blocks are sent to an intermediate "control processorss)" The control processor broadcasts the instructions contained in the code block, one at a time, to the computing processors. These "frontend toprocessor" communications take time. This time is an overhead cost not present when the program runs on a serial computer. The operands of the instructions, the data, are distributed among the processors' memories. Each processor operates on its own locallystored data. The "data" in gridbased numerical methods are the arrays, 2d in this case, of dependent variables, geometric quantities, and equation coefficients. Because there are usually plenty of grid points and the same governing equations apply at each point, most CFD algorithms contain many operations to be performed at every grid point. Thus this "dataparallel" approach is very natural to most CFD algorithms. Many operations may be done independently on each grid point, but there is cou pling between grid points in physicallyderived problems. The data network enters the picture when an instruction involves another processor's data. Such "interpro cessor" communication is another overhead cost of solving the problem on a parallel computer. For a given algorithm, the amount of interprocessor communication de pends on the "data mapping," which refers to the partitioning of the arrays and the assignment of these "subgrids" to processors. For a given machine, the speed of the interprocessor communication depends on the pattern of communication (random or regular) and the distance between the processors (far away or nearestneighbor). The run time of a parallel program depends first on the amount of frontend and parallel computation in the algorithm, and the speeds of the frontend and back end for doing these computations. In the programs developed here, the frontend computations are mainly the program control statements (IF blocks, DO loops, etc.). The frontend work is not sped up by parallel processing. The parallel computations are the useful work, and by design one hopes to have enough parallel computation to amortize both the frontend computation and the interprocessor and frontendto processor communication, which are the other factors that contribute to the parallel run time. From this brief description it should be clear that SIMD computers have four char acteristic speeds: the computation speed of the processors, the communication speed between processors, and the speed of the frontendtoprocessor communication, i.e. the speed that code blocks are transferred, and the speed of the frontend. These machine characteristics are not under the control of the programmer. However, the amount of computation and communication a program contains is determined by the programmer because it depends on the algorithm selected and the algorithm's imple mentation (the choice of the data mapping, for example). Thus, the key to obtaining good performance from SIMD computers is to pick a suitable algorithm, "matched" in a sense to the architecture, and to develop an implementation which minimizes and localizes the interprocessor communication. Then, if there is enough parallel computation to amortize the serial content of the program and the communication overheads, the speedup obtained will be nearly the number of processors. The actual performance, because it depends on the computer, the algorithm, and the imple mentation, must be determined by numerical experiment on a programbyprogram basis. SIMD computers are restricted to exploiting dataparallelism, as opposed to the parallelism of the tasks in an algorithm. The taskparallel approach is more com monly used, for example, on the Cray C90 supercomputer. Multipleinstruction stream/multipledata stream (MIMD) computers, on the other hand, are composed of moreorless autonomous processor/memory pairs. Examples include the Intel series of machines (iPSC/2, iPSC/860, and Paragon), workstation clusters, and the connec tion machine CM5. However, in CFD, the dataparallel approach is the prevalent one even on MIMD computers. The frontend/backend programming paradigm is implemented by selecting one processor to initiate programs on the other processors, accumulate global results, and enforce synchronization when necessary, a strategy called singleprogrammultipledata (SPMD) [23]. The CM5 has a special "control network" to provide automatic synchronization of the processor's execution, so a SIMD programming model can be supported as well as MIMD. SIMD is the manner in which the CM5 has been used in the present work. The advantage to using the CM5 in the SIMD mode is that the programmer does not have to explicitly specify messagepassing. This simplification saves effort and increases the effective speed of communication because certain timeconsuming protocols for the data transfer can be eliminated. 1.4.2 Algorithms and Performance The previous subsection discussed dataparallelism and SIMD computers, i.e. what parallel computing means in the present context and how it is carried out by SIMDtype computers. To develop programs for SIMD computers requires one to recognize that unlike serial computers, parallel computers are not black boxes. In addition to the selection of an algorithm with ample dataparallelism, consideration must be given to the implementation of the algorithm in specific ways in order to achieve the desired benefits speedupss over serial computations). The success of the choice of algorithm and the implementation on a particular computer is judged by the "speedup" (S) and "efficiency" (E) of the program. The communications mentioned above, frontendtoprocessor and interprocessor, are es sentially overhead costs associated with the SIMD computational model. They would not be present if the algorithm were implemented on a serial computer, or if such communications were infinitely fast. If the overhead cost was zero, a parallel program executing on n, processors would run np times faster than on a single processor, a speedup of np. This idealized case would also have a parallel efficiency of 1. The parallel efficiency E measures the actual speedup in comparison with the ideal. One is also interested in how speedup, efficiency, and the parallel run time (Tp) scale with problem size, and with the number of processors used. The objective in using parallel computers is more than just obtaining a good speedup on a particular problem size and a particular number of processors. For parallel CFD, the goals are to either (1) reduce the time (the indirect cost [23]) to solve problems of a given complexity, to satisfy the need for rapid turnaround times in design work, or (2) increase the complexity of problems which can be solved in a fixed amount of time. For the iterationbased numerical methods studied here, there are two considerations: the cost per iteration, and the number of iterations, respectively, computational and numerical factors. The total run time is the product of the two. Gustafson [35] has presented fixedsize and scaledsize experiments whose results describe how the cost per iteration scales on a particular machine. In the fixed size experiment, the efficiency is measured for a fixed problem size as processors are added. The hope is that the run time is halved when the number of processors is doubled. However, the run time obviously cannot be reduced indefinitely by adding more processors because at some point the parallelism runs outthe limit to the attainable speedup is the number of grid points. In the scaledsize experiment, the problem size is increased along with the number of processors, to maintain a constant local problem size for each of the parallel processors. Care must be taken to make timings on a per iteration basis if the number of iterations to reach the end of the computation increases with the problem size. The hope in such an experiment is that the program will maintain a certain high level of parallel efficiency E. The ability to maintain E in the scaledsize experiment indicates that the additional processors increased the speedup in a oneforone trade. 1.5 PressureBased Multigrid Methods Multigrid methods are a potential route to both computationally and numerically scalable programs. Their cost per iteration on parallel computers and convergence rate is the subject of Chapters 45. For sufficiently smooth elliptic problems, the convergence rate of multigrid methods is independent of the problem sizetheir op eration count is O(N). In practice, good convergence rates are maintained as the problem size increases for NavierStokes problems, also, provided suitable multigrid componentsthe smoother, restriction and prolongation proceduresand multigrid techniques are employed. The standard Vcycle fullmultigrid (FMG) algorithm has an almost optimal operation count, O(log2N) for Poisson equations, on parallel com puters. Provided the multigrid algorithm is implemented efficiently and that the cost per iteration scales well with the problem size and the number of processors, the multigrid approach seems to be a promising way to exploit the increased computa tional capabilities that parallel computers offer. The pressurebased methods mentioned previously involve the solution of three systems of linear algebraic equations, one each for the two velocity components and one for the pressure, by standard iterative methods such as successive line underrelaxation (SLUR). Hence they inherit the convergence rate properties of these solvers, i.e. as the problem size grows the convergence rate deteriorates. With the singlegrid techniques, therefore, it will be difficult to obtain reasonable turnaround times when the problem size is increased into the target range for parallel com puters. Multigrid techniques for accelerating the convergence of pressurecorrection methods should be pursued, and in fact they have been within the last five or so years [70, 74, 80]. However, there are still many unsettled issues. The complexities affecting the convergence rate of singlegrid calculations carry over to the multigrid framework and are compounded there by the coupling between the evolving solutions on multiple grid levels, and by the particular "gridscheduling" used. Linear multigrid methods have been applied to accelerate the convergence rate for the solution of the system of pressure or pressurecorrection equations [4, 22, 42, 64, 94]. However, the overall convergence rate does not significantly improve because the velocitypressure coupling is not addressed [4, 22]. Therefore the multigrid strategy should be applied on the "outer loop," with the role of the iterative relaxation method played by the numerical methods described above, e.g. the projection method or the pressurecorrection method. Thus, the generic term "smoother" is prescribed because it reflects the purpose of the solution of the coupled system of equations going on inside the multigrid cycleto smooth the residual so that an accurate coarsegrid approximation of the finegrid problem is possible. It is not true that a good solver, one with a fast convergence rate on singlegrid computations, is necessarily a good smoother of the residual. It is therefore of interest to assess pressurecorrection meth ods as potential multigrid smoothers. See Shyy and Sun [74] for more information on the staggeredgrid implementation of multigrid methods, and some encouraging results. Staggered grids require special techniques [21, 74] for the transfer of solutions and residuals between grid levels, since the positions of the variables on different levels do not correspond. However, they alleviate the "checkerboard" pressure stability problem [50], and since techniques have already been established [74], there is no reason not to go this route, especially when cartesian grids are used as in the present work. Vanka [89] has proposed a new numerical method as a smoother for multigrid computations, one which has inferior convergence properties as a singlegrid method but apparently yields an effective multigrid method. A staggeredgrid finitevolume discretization is employed. In Vanka's smoother, the velocity components and pres sure of each control volume are updated simultaneously, so it is a coupled approach, but the coupling between control volumes is not taken into account, so the calcu lation of new velocities and pressures is explicit. This method is sometimes called the "locallycoupled explicit" or "blockexplicit" pressurebased method. The control volumes are visited in lexicographic order in the original method which is therefore aptly called BGS (block GaussSeidel). Linevariants have been developed to couple the flow variables in neighboring control volumes along lines (see [80, 87]). Linden et al.[50] gave a brief survey of multigrid methods for the steadystate in compressible NavierStokes equations. They argue without analysis that BGS should be preferred over the pressurecorrection type methods since the strong local cou pling is likely to have better success smoothing the residual locally. On the other hand, Sivaloganathan and Shaw [71, 70] have found good smoothing properties for the pressurecorrection approach, although the analysis was simplified considerably. Sockol [80] has compared the point and linevariants of BGS with the pressure correction methods on serial computers, using model problems with different physical characteristics. SIMPLE and BGS emerge as favorites in terms of robustness with BGS preferred due to a lower cost per iteration. This preference may or may not carry over to SIMD parallel computers (see Chapter 4 for comparison). Interesting applications of multigrid methods to incompressible NavierStokes flow problems can be found in [12, 28, 48, 54]. In terms of parallel implementations there are far fewer results although this field is rapidly growing. Simon [77] gives a recent crosssection of parallel CFD results. Parallel multigrid methods, not only in CFD but as a general technique for partial differential equations, have received much attention due to their desirable O(N) operation count on Poisson equations. However, it is apparently difficult to find or design parallel computers with ideal communication networks for multigrid [13]. Consequently implementations have been pursued on a variety of machines to see what performance can be obtained with the present generation of parallel machines, and to identify and understand the basic issues. Dendy et al.[18] have recently described a multigrid method on the CM2. However, to accommodate the data parallel programming model they had to dimension their array data on every grid level to the dimension extents of the finest grid array data. This approach is very wasteful of storage. Consequently the size of problems which can be solved is greatly reduced. Recently an improved release of the compiler has enabled the storage problem to be circumvented with some programming diligence (see Chapter 5). The implementation developed in this work is one of the first to take advantage of the new compiler feature. In addition to parallel implementations of serial multigrid algorithms, several novel multigrid methods have been proposed for SIMD computers [25, 26, 33]. Some of the algorithms are instrinsically parallel [25, 26] or have increased parallelism because they use multiple coarse grids, for example [33]. These efforts and others have been recently reviewed [14, 53, 92]. Most of the new ideas have not been developed yet for solving the incompressible NavierStokes equations. One of the most prominent concerns addressed in the literature regarding parallel implementations of serial multigrid methods is the coarse grids. When the number of grid points is smaller than the number of processors the parallelism is reduced to the number of grid points. This loss of parallelism may significantly affect the parallel efficiency. One of the routes around the problem is to use multiple coarse grids [59, 33, 79]. Another is to alter the gridscheduling to avoid coarse grids. This approach can lead to computationally scalable implementations [34, 49] but may sacrifice the convergence rate. "Agglomeration" is an efficiencyincreasing technique used in MIMD multigrid programs which refers to the technique of duplicating the coarse grid problem in each processor so that computation proceeds independently (and redundantly). Such an approach can also be scalable [51]. However, most atten tion so far has focused on parallel implementations of serial multigrid algorithms, in particular on assessing the importance of the coarsegrid smoothing problem for dif ferent machines and on developing techniques to minimize the impact on the parallel efficiency. 1.6 Description of the Research The dissertation is organized as follows. Chapter 2 discusses the role of the mass conservation in the numerical consistency of the singlegrid SIMPLE method for open boundary problems, and explains the relevance of this issue to the convergence rate. In Chapter 3 the singlegrid pressurecorrection method is implemented on the MP1, CM2, and CM5 computers and its performance is analyzed. High parallel efficien cies are obtained at speeds and problem sizes well beyond the current performance of such algorithms on traditional vector supercomputers. Chapter 4 develops a multigrid numerical method for the purpose of accelerating the singlegrid pressurecorrection method and maintaining the accelerated convergence property independent of the problem size. The multigrid smoother, the intergrid transfer operators, and the sta bilization strategy for NavierStokes computations are discussed. Chapter 5 describes the actual implementation of the multigrid algorithm on the CM5, its convergence rate, and its parallel run time and scalability. The convergence rate depends on the 18 flow problem and the coarsegrid discretization, among other factors. These factors are considered in the context of the "fullmultigrid" (FMG) starting procedure by which the initial guess on the fine grid is obtained. The cost of the FMG proce dure is a concern for parallel computation [88], and this issue is also addressed. The results indicate that the FMG procedure may influence the asymptotic convergence rate and the stability of the multigrid iterations. Concluding remarks in each chapter summarize the progress made and suggest avenues for further study. Figure 1.1. Staggeredgrid layout of dependent variables, for a small but complete domain. Boundary values involved in the computation are shown. Representative u, v, and pressure boundary control volumes are shaded. short blocks of parallel code Sequencer (CM2) Array control unit (MP1) Multiple SPARC nodes (CM5: individual instruction P.E. P. E. P. E. P. E. more P.E.s 0 0 0 array data partitioned among processor memories Interprocessor communication network hypercube (CM2) + "NEWS" 3stage crossbar (MP1) + "XNet" fat tree (CM5) Figure 1.2. Layout of the MP1, CM2, and CM5 SIMD computers. Front End (CM2 and MP1) Partition Manager (CM5) > serial code, control code, scalar data a 0 a * 0 0 CHAPTER 2 PRESSURECORRECTION METHODS 2.1 FiniteVolume Discretization on Staggered Grids The formulation of the numerical method used in this work begins with the inte gration of the governing equations Eq 1.11.3 over each of the control volumes in the computational domain. Figure 1.1 shows a model computational domain with u, v, and p (cellcentered) control volumes shaded. The continuity equation is integrated over the p control volumes. Consider the discretization of the umomentum equation for the control volume shown in Figure 2.1 whose dimensions are Ax and Ay. The v control volumes are done exactly the same except rotated 900. Integration of Eq. 1.2 over the shaded region is interpreted as follows for each of the terms: I Opu Opup P z dxdy AAy, (2.1) at at 2 dx dy = (pu pu ) Ay, (2.2) audx dy = (puv, pu,sv) A, (2.3) S dx dy = (p p,)Ay, (2.4) I (" ( au It2U dxdy = a Ax (2.6) jXdx = d2ei jA (2.5) / P a dady=Pau Itau A, X (2.6) if 49y 2 9ay ys) The lowercase subscripts e, w. n, s indicate evaluation on the control volume faces. By convention and the meanvalue theorem, these are at the midpoint of the faces. The subscript P in Eq. 2.1 indicates evaluation at the center of the control volume. 21 Because of the staggered grid, the required pressure values in Eq. 2.4 are already located on the u control volume faces. The pressuregradient term is effectively a secondorder centraldifference approximation. With colocated grids, however, the controlvolume face pressures are obtained by averaging the nearby pressures. This averaging results in the pressure at the cell center dropping out of the expression for the pressure gradient. The centraldifference in Eq. 2.4 is effectively taken over a distance 2Ax on colocated grids. Thus staggered cartesian grids provide a more accurate approximation of the pressuregradient term since the difference stencil is smaller. The next step is to approximate the terms which involve values at the control volume faces. In Eq. 2.2, one of the ue and one of the u, are replaced by an average of neighboring values, 2 2) A ( UE+ Up UP+ UW ) (ue pu y = p 2 e P U2 A (2.7) and in Eq. 2.3, v, and v, are obtained by averaging nearby values, ( Vne + nw Vse + Vs (pUnVn pusv) Ax= p V u pV us), Ax (2.8) The remaining face velocities in the convection terms, u,, u,, ue, and uw, are ex pressed as a certain combination of the nearby u valueswhich u values are involved and what weighting they receive is prescribed by the convection scheme. Some pop ular recirculating flow convection schemes are described in [73, 75]. The controlvolume face derivatives in the diffusion terms are evaluated by central differences, P Ou 9u EUP PUW A (2.9) P I a AX = M UN P p Ax (2.10) dy ay Ay Ay The unsteady term in Eq. 2.1 is approximated by a backward Euler scheme. All the terms are evaluated at the "new" time level, i.e. implicitly. Thus, the discretized momentum equations for each control volume can be put into the following general form, apup = aEUE + awuw + aNUN + asus + b, (2.11) where b = (pw pe)Ay+pun/At, the superscript n indicating the previous timestep. The coefficients aN, as, etc. are comprised of the terms which modify UN, us, etc. in the discretized convection and diffusion terms. The continuity equation is integrated over a pressure control volume, I/ [ pu+ dx dy = p(ue u)Ay + p(vn v)A = 0. (2.12) Again the staggered grid is an advantage because the normal velocity components on each control volume face are already in positionthere is no need for interpolation. 2.2 The SIMPLE Method One SIMPLE iteration takes initial velocity and pressure fields (u*, v*,p*) and computes new guesses (u, v,p). The intermediate values are denoted with a tilde, (it, j,). In the algorithm below, au(u*, v*), for example, means that the aN coeffi cient in the umomentum equation depends on u* and v*. The parameters v,, v,, and vc are the numbers of "inner" iterations to be taken for the u, v, and continuity equa tions, respectively. This notation will be clarified by the following discussion. The inner iteration count is indicated by the superscript enclosed in parentheses. Finally, wU and w, are the relaxation factors for the momentum and continuity equations. SIMPLE (u*, v*, p*; Vu, vV, Vp, WU, wc) Compute u coefficients au(u*, v*) (k = P,E,W,N,S) and source term b"(u*,p*) for each discrete umomentum equation: a~Up = aNiUN + auis + auiE + a'viw + bU + (1 wu 2u UJUV s Ev W JUV P Do v, iterations to obtain an approximate solution for ii starting with u* as the initial guess u(n) = Gu(n1) + fU f = un=u) Compute v coefficients ak((i, v*) (k = E,W,N,S) and source term bv(v*,p*) for each discrete vmomentum equation: v2p = a'VNl + a'ls + as + IE + a'w w + b + (1 wuv,) v Do v iterations to obtain an approximate solution for v starting with v* as the initial guess v(n) = Gv(nl1) + fV V = v(n=v) Compute p' coefficients a' (k = P,E,W,N,S) and source term bc(ii, 3) for each discrete p' equation: app p = aNp N + asp s + aEP'E + awP'w + bc Do vc iterations to obtain an approximate solution for p' starting with zero as the initial guess p'(") = Gp'("1) + f Correct f, i, and p* at every interior grid point up = ip + , (ap')p Vp = pp + (a')p pp = p* + WcP'p The algorithm is not as complicated as it looks. The important point to note is that the major tasks to be done are the computing of coefficients and the solving of the systems of equations. The symbol G indicates the iteration matrix of whatever type relaxation is used on these inner iterations (SLUR in this case), and f is the corresponding source term. In the SIMPLE pressurecorrection method [61], the averages in Eq. 2.7 and 2.8 are lagged in order to linearize the resulting algebraic equations. The governing equations are solved sequentially. First, the u momentum equation coefficients are computed and an updated u field is computed by solving the system of linear alge braic equations. The pressures in Eq. 2.4 are lagged. The v momentum equation is solved next to update v. The continuity equation, recast in terms of pressure correc tions, is then set up and solved. These pressure corrections are coupled to velocity corrections. Together they are designed to correct the velocity field so that it satisfies the continuity constraint, while simultaneously correcting the pressure field so that momentum conservation is maintained. The relationship between the velocity and pressure corrections is derived from the momentum equation, as described in the next section. The resulting system of equations is fully coupled, as one might expect knowing the elliptic nature of pressure in incompressible fluids, and is therefore expensive to solve. However, if the resulting system of pressurecorrection equations were solved exactly, the divergence free constraint and the momentum equations (with old values of u and v present in the nonlinear convection terms) would be satisfied. This approach would constitute an implicit method of time integration for the linearized equations. The timestep size would have to be limited to avoid stability problems caused by the linearization. To reduce the computational cost, the SIMPLE prescription is to use an approx imate relationship between the velocity and pressure corrections (hence the label "semiimplicit"). Variations on the original SIMPLE approximation have shown bet ter convergence rates for simple flow problems, but in discretizations on curvilinear grids and other problems with significant contributions from source terms, the per formance is no better than the original SIMPLE method (see the results in [4]). The goal of satisfying the divergencefree constraint can still be attained, if the system of pressurecorrection equations is converged to strict tolerances, because the discrete continuity equations are still being solved. But satisfaction of the momentum equations cannot be maintained with the approximate relationship. Consequently it is no longer desirable to solve the p'system of equations to strict tolerances. It erations are necessary to find the right velocities and pressures which satisfy all three equations. Furthermore, since the equation coefficients are changing from one iteration to the next, it is pointless to solve the momentum equations to strict tol erances. In practice, only a few iterations of a standard scheme such as successive lineunderrelaxation (SLUR) are performed. The single "outer" iteration outlined above is repeated many times, with under relaxation to prevent the iterations from diverging. In this sense a twolevel iterative procedure is being employed. In the outer iterations, the momentum and pressure correction equations are iteratively updated based on the linearized coefficients and sources, and inner iterations are applied to partially solve the systems of linear alge braic equations. The fact that only a few inner iterations are taken on each system of equations sug gests that the asymptotic convergence rate of the iterative solver, which is the usual means of comparison between solvers, does not necessarily dictate the convergence rate of the outer iterative process. Braaten and Shyy [4] have found that the con vergence rate of the outer iterations actually decreases when the pressurecorrection equation is solved to a much stricter tolerance than the momentum equations. They concluded that the balance between the equations is important. Because u, v, and p' are segregated, the overall convergence rate is strongly dependent on the partic ular flow problem, the grid distribution and quality, and the choice of relaxation parameters. In contrast to projection methods, which are twostep but treat the convection terms explicitly (or more recently by solving a Riemann problem [2]) and are therefore restricted from taking too large a timestep, the pressurecorrection approach is fully implicit with no timestep limitation, but many iterations may be necessary. The projection methods are formalized as timeintegration techniques for semidiscrete equations. SIMPLE is an iterative method for solving the discretized NavierStokes system of coupled nonlinear algebraic equations. But the details given above should make it clear that these techniques bear strong similaritiesspecifically, a single SIMPLE iteration would be a projection method if the system of pressurecorrection equations was solved to strict tolerances at each iteration. It would be interesting to do some numerical comparisons between projection methods and pressurecorrection methods to further clarify the similarity. 2.3 Discrete Formulation of the PressureCorrection Equation The discrete pressurecorrection equation is obtained from the discrete momentum and continuity equations as follows. The velocity field which has been newly obtained by solving the momentum equations was denoted by (ii, v) earlier. The pressure field after the momentum equations are solved still has the initial value p*. So fi, u, and p* satisfy the umomentum equation apUip = aEUE + awuw + aNUN + asfis + (p* p*)Ay, (2.13) and the corresponding vmomentum equation. The corrected (continuitysatisfying) velocity field (u, v) satisfies the umomentum equation with the corrected pressure field p, apup = aEUE + awuw + aNUN + asus + (pw pe)Ay, (2.14) and likewise for the vmomentum equation. Additive corrections are assumed, i.e. u = ii + u' (2.15) v = v' (2.16) p= p* + p'. (2.17) Subtracting Eq. 2.13 from Eq. 2.14 gives the desired relationship between pressure and the u corrections, apup = akk + (p, p')Ay, (2.18) k=E,W,N,S with a similar expression for the v corrections. If Eq. 2.18 is used as is, then the nearby velocity corrections in the summation need to be replaced by similar expressions involving pressurecorrections. This requirement brings in more velocity corrections and more pressure corrections, and so on, leading to an equation which involves the pressure corrections at every grid point. The resulting system of equations would be expensive to solve. Thus, the summation term is dropped in order to obtain a compact expression for the velocity correction in terms of pressure corrections. At convergence, the pressure corrections (and therefore the velocity corrections) go to zero, so the precise form of the approximate pressure velocity correction relationship does not figure in the final converged solution. The discrete form of the pressurecorrection equation follows by first substituting the simplified version of Eq. 2.18 into Eq. 2.15, Up = Up + Up = Up + (p, p')Ay, (2.19) and then substituting this into the continuity equation Eq. 2.12, (with an analogous formula for vp). The result is pAy2 pAy2 PAx2 I PAX2 a (pu),P) (P'' (PPP)+ (PP pN)  (P') = b, (2.20) ap(ue) ap(uw) ap(vn) ap(v,) where the source term b is b = p,Ay piiAy + pv;:A pvAx (2.21) Recall that Eq. 2.20 and Eq. 2.21 are written for the pressure control volumes, so that there is some interpretation required. The term ap(ue) in Eq. 2.20 is the appropriate ap for the discretized umomentum equation, Eq. 2.13. In other words, up in Eq. 2.13 is actually u,, u,, u,, or us in Eq. 2.20 and 2.21, relative to the pressure control volumes on the staggered grid. Eq. 2.20 can be rearranged into the same general form as Eq. 2.11. From Eq. 2.21, it is apparent that the righthand side term is the net mass flux entering the control volume, which should be zero in incompressible flow. In the formulation of the pressurecorrection equation for boundary control vol umes, one makes use of the fact that the normal velocity components on the bound aries are known from either Dirichlet or Neumann boundary conditions, so no velocity correction is required there. Consequently, the formulation of Eq. 2.20 for boundary control volumes does not require any prescription of boundary p' values [60] when velocity boundary conditions are prescribed. Without the summation from Eq. 2.18, it is apparent that a zero velocity correction for the outflow boundary uvelocity component is obtained when pw = pein effect, a Neumann boundary condition on pressure is implied. This boundary condition is appropriate for an incompressible fluid because it is physically consistent with the governing equations in which only the pressure gradient appears. There is a unique pressure gradient but the level is adjustable by any constant amount. If it happens that there is a pressure specified on the boundary, for example by Eq. 1.4, then the correction there will be zero, pro viding a boundary condition for Eq. 2.20. Thus, it seems that there are no concerns over the specification of boundary conditions for the p' equations. 2.4 WellPosedness of the PressureCorrection Equation 2.4.1 Analysis To better understand the characteristics of the pressurecorrection step in the SIMPLE procedure, consider a model 3 x 3 computational domain, so that 9 algebraic equations for the pressure corrections are obtained. Number the control volumes as shown in Figure 2.3. Then the system of p' equations can be written al a 0 ak 0 0 0 0 0 p' p(u\ u1 + v1 v,) 2 2 aw a a 0 a 0 0 0 0 p' p(u + v v ) 0 a, a 0 0 aN 0 0 0 p' P(u U + v) a 0 0 a 4 0aa4 0 0 0 4 p(u u+ v v) 55 + 5 _V5) S4 0 a, a aE 0 aN 0 p' = p(uWn+v v) (2.22) 0 0 a 0 a6 ap 0 0 aN P'6 P(u6 + v v,) 0 0 0 as 0 0 ap aE 0 p'7 p(u7 u7 + v v7) S 0 0 a a8 as 4a P' P(u8 u+ v8  S0 0 0 a 0 a9 a9 .Pg p( u + v9 v9) where the superscript designates the cell location and the subscript designates the coefficient linking the point in question, P, and the neighboring node. The righthand side velocities are understood to be tilde quantities as in Eq. 2.21. In finitevolume discretizations, fluxes are estimated at the control volume faces which are common to adjacent control volumes, so if the governing equations are cast in conservation law form, as they are here, the discrete efflux of any quantity out of one control volume is guaranteed to be identical to the influx into its neighbor. There is no possibility of internal sources or sinks. In fact this is what makes finite volume discretizations preferable to finitedifference discretizations. The following relationships, using control volume 5 in Figure 2.3 as an example, follow from Eq. 2.20 and the internal consistency of finitevolume discretizations: a = a + as + a N + aG (2.23) a = a E = aw, aN =as as = aN (2.24) 5 = = 4 =1 v, = 2 (2.25) w e e zw n S Vn Eq. 2.23 states that the coefficient matrix is pentadiagonal and diagonally dominant for the interior control volumes. Furthermore, when the natural boundary condition (zero velocity correction) is applied, the appropriate term in Eq. 2.20 for the boundary under consideration does not appear, and therefore the pressurecorrection equations for the boundary control volumes also satisfy Eq. 2.23. If a pressure boundary condi tion is applied so that the corresponding pressure correction is zero, then one would set p' = 0 in Eq. 2.20, for example, which would give aw + aN + as < ap. Thus, either way, the entire coefficient matrix in Eq. 2.22 is diagonally dominant. However, with the natural prescription for boundary treatment, no diagonal term exceeds the sum of its offdiagonal terms. Thus, the system of equations Eq. 2.22 is linearly dependent with the natural (velocity) boundary conditions, which can be verified by adding the 9 equations above. Because of Eq. 2.23 and Eq. 2.24 all terms on the lefthand side of Eq. 2.22 identically cancel one another. At all interior control volume interfaces, the right hand side terms identically cancel due to Eq. 2.25, and the remaining source terms are simply the boundary mass fluxes. This cancellation is equivalent to a discrete statement of the divergence theorem SVidQ = j i d(0fl) (2.26) where 0f is the domain under consideration and n is the unit vector in the direction normal to its boundary 0t. Due to the linear dependence of the lefthand side of Eq. 2.22, the boundary mass fluxes must also sum to zero in order for the system of equations to be consistent. No solution exists if the linearly dependent system of equations is inconsistent. The situation can be likened to a steadystate heat conduction problem with source terms and adiabatic boundaries. Clearly, a steadystate solution only exists if the sum of the source terms is zero. If there is a net heat source, then the temperature inside the domain will simply rise without bound if an iterative solution strategy (quasi timemarching) is used. Likewise, the net mass source in flow problems with open boundaries must sum to zero for the pressurecorrection equation to have a solution. In other words, global mass conservation is required in discrete form in order for a solution to exist. The interesting point to note is that during the course of SIMPLE iterations, when the pressurecorrection equation is executed, the velocity field does not usually conserve mass globally in flow problems with open boundaries, unless explicit measure is taken to enforce global mass conservation. The purpose of solving the pressurecorrection equations is to drive the local mass sources to zero by suitable velocity corrections. But the pressurecorrection equations which are supposed to accomplish this purpose do not have a solution unless the net mass source is already zero. For domains with closed boundaries, global mass conservation is obviously not an issue. Furthermore, this problem does not only show up when the initial guess is bad. In the backwardfacing step flow discussed below, the initial guess is zero everywhere except for inflow, which obviously is the worst case as far as a net mass source is concerned (all inflow and no outflow). But even if one starts with a massconserving initial guess, during the course of iterations the outflow velocity boundary condition which is necessary to solve the momentum equations will reset the outflow so that the global massconservation constraint is violated. 2.4.2 Verification by Numerical Experiments Support for the preceding discussion is provided by numerical simulation of two model problems, a liddriven cavity flow and a backwardfacing step flow. The con figurations are shown along with other relevant data in Figure 2.2. Figure 2.4 shows the outerloop convergence paths for the liddriven cavity flow and the backwardfacing step flow, both at Re = 100. The quantities plotted in Figure 2.4 are the logo of the global residuals for each governing equation obtained by summing up the local residuals, each of which is obtained by subtracting the lefthand side of the discretized equations from the righthand side. For the cavity flow there are no mass fluxes across the boundary so, as mentioned earlier, the global mass conservation condition is always satisfied when the algorithm reaches the point of solving the system of p'equations. The residuals have dropped to 107 after 150 iterations, which is very rapid convergence, indicating that good pressure and velocity corrections are being obtained. In the backwardfacing step flow, however, the flowfield is very slow to develop because no global mass conservation measure is enforced. During the course of iter ations, the mass flux into the domain from the left is not matched by an equal flux through the outflow boundary, and consequently the system of pressurecorrection equations which is supposed to produce a continuitysatisfying velocity field does not have a solution. Correspondingly one observes that the outerloop convergence rate is about 10 times worse than for cavity flow. Also, note that the momentum convergence path of the backwardfacing step flow in Figure 2.4 tends to follow the continuity equation, indicating that the pressure and velocity fields are strongly coupled. The present flow problem bears some similarity to a fullydeveloped channel flow, in which the streamwise pressuregradient and cross stream viscous diffusion are balanced, so the observation that pressure and velocity are strongly coupled is intuitively correct. Thus, the convergence path is controlled by the development of the pressure field. The slow convergence rate problem is due to the inconsistency of the system of pressurecorrection equations. The innerloop convergence path (the SLUR iterations) for the p'system of equa tions must be examined to determine the manner in which the innerloop inconsis tency leads to poor outerloop convergence rates. Table 2.1 shows leading eigenvalues for successive lineunderrelaxation iteration matrices of the p'system of equations at an intermediate iteration for which the outerloop residuals had dropped to approx imately 102. Largest 3 eigenvalues Cavity Flow BackStep Flow A1 1.0 1.0 A2 0.956 0.996 A3 0.951 0.984 Table 2.1. Largest eigenvalues of iteration matrices during an intermediate itera tion, applying the successive lineunderrelaxation iteration scheme to the p'system of equations. In both model problems the spectral radius is 1.0 because the p'system of equa tions is linearly dependent. The next largest eigenvalue is smaller in the cavity flow computation than in the step flow computation, which means a faster asymptotic con vergence rate. However, the difference between 0.996 and 0.956 is not large enough to produce the significant difference observed in the outer convergence path. Figure 2.5 shows the innerloop residuals of the SLUR procedure during an inter mediate iteration. The two momentum equations are wellconditioned and converge to a solution within 4 iterations. In Figure 2.5 for the cavity flow case, the p'equation converges to zero, although this happens at a slower rate than the two momentum equations because of the diffusive nature of the equation. In Figure 2.5 for the back step flow, the innerloop residual is fixed on a nonzero residual, which is in fact the initial level of inconsistency in the system of equations, i.e. the global mass deficit. Given that the system of p' equations which is being solved does not satisfy the global continuity constraint, however, the significance or utility of the p'field that has been obtained is unknown. In practice, the overall procedure may still be able to lead to a converged solu tion, as in the present case. It appears that the outflow extrapolating procedure, a zerogradient treatment utilized here, can help induce the overall computation to converge to the right solution [72]. Obviously, such a lack of satisfaction of global mass conservation is not desirable in view of the slow convergence rate. Further study suggests that the iterative solution to the inconsistent system of p'equations converges on a unique pressure gradient, i.e. the difference between p' values at any two points tends to a constant value, even though the p'field does not in general satisfy any of the equations in the system. This relationship is shown in Figure 2.6, in which the convergence of the difference in p' between the lowerleft and upperright locations in the domain of the cavity and backwardfacing step flows is plotted. Also shown is the value of p' at the lowerleft corner of the domain. For the cavity flow, there is a solution to the system of p'equations, and it is obtained by the SLUR technique in about 10 iterations. Thus all the pressure corrections and the differences between them tend towards constant values. In the backwardfacing step flow, however, the individual pressure corrections increase linearly with the number of iterations, symptomatic of the inconsistency in the system of equations. The differences between p' values approach a constant, however. The rate at which this unique pressuregradient field is obtained depends on the eigenvalues of the iteration matrix. To resolve the inconsistency problem in the p'system of equations and thereby improve the outerloop convergence rate in the backwardfacing step flow, global mass conservation has been explicitly enforced during the sequential solution procedure. The procedure used is to compute the global mass deficit and then add a constant value to the outflow boundary uvelocities to restore global mass conservation. Al ternatively, corrections can be applied at every streamwise location by considering control volumes whose boundaries are the inflow plane, the top and bottom walls of the channel, and the i=constant line at the specified streamwise location. The artificiallyimposed convection has the effect of speeding up the development of the pressure field, whose normal development is diffusiondominated. It is interesting to note that this physicallymotivated approach is in essence an acceleration of conver gence of the lineiterative method via the technique called additive correction [45, 69]. The strategy is to adjust the residual on the current line to zero by adding a con stant to all the unknowns in the line. This procedure is done for every line, for every iteration, and generally produces improvement in the SLUR solution of a system of equations. Kelkar and Patankar [45] have gone one step further by applying additive corrections like an injection step of a multigrid scheme, a socalled block correction technique. This technique is exploited to its fullest by Hutchinson and Raithby [42]. Given a finegrid solution and a coarse grid, discretized equations for the correction quantities on the coarse grid are obtained by summing the equations for each of the finegrid cells within a given coarse grid cell. A solution is then obtained (by direct methods in [45]) which satisfies conservation of mass and momentum. The corrections are then distributed uniformly to the fine grid cells which make up the coarse grid cell, and the iterative solution on the fine grid is resumed. However, experiences have shown that the net effect of such a treatment for complex flow problems is limited. Figure 2.7 illustrates the improved convergence rate of the continuity equation for the inner and outer loops, in the backwardfacing step flow, when conservation of mass is explicitly enforced. The innerloop data is from the 10th outerloop iteration. In Figure 2.7, the cavity flow convergence path is also shown to facilitate the comparison. For the backstep, the overall convergence rate is improved by an order of magnitude, becoming slightly faster than the cavity flow case. This result reflects the improved innerloop performance, also shown in Figure 2.7. The improved performance for the pressurecorrection equation comes at the expense of a slightly slower convergence rate for the momentum equations, because of the nonlinear convection term. In short, it has been shown that a consistency condition, which is physically the re quirement of global mass conservation, is critical for meaningful pressurecorrections to be guaranteed. Given natural (velocity) boundary conditions, which lead to a linearly dependent system of pressurecorrection equations, satisfaction of the global continuity constraint is the only way that a solution can exist, and therefore the only way that the innerloop residuals can be driven to zero. For the model backward facing step flow in a channel with length L = 4 and a 21 x 9 mesh, the mass conservation constraint is enforced globally or at every streamwise location by an additivecorrection technique. This technique produces a 10fold increase in the con vergence rate. Physically, modifying the u velocities has the same effect as adding a convection term to the Poisson equation for the p'field, which otherwise develops very slowly. A coarse grid size was used to demonstrate the need of enforcing global mass conservation. On a finer grid, this issue becomes more critical. In the next section, the solution accuracy aspects related to mass conservation will be addressed, and the computations will be conducted with more adequate grid resolution. 2.5 Numerical Treatment of Outflow Boundaries Continuing with the theme of wellposedness, the next numerical issue to be dis cussed is the choice of outflow boundary location. If fluid flows into the domain at a boundary where extrapolation is applied, then, traditionally, the problem is not considered to be wellposed, because the information which is being transported into the domain does not participate in the solution to the problem [60]. Numerically, however, accurate solutions can be obtained using firstorder extrapolation for the ve locity components on a boundary where inflow is occurring [72]. Here open boundary treatment for both steady and timedependent flow problems is investigated further. Figure 2.9 and 2.8 present streamfunction contours for a timedependent flow problem, impulsively started backwardfacing step flow, using centraldifferencing for the convection terms and firstorder backwarddifferencing in time. A parabolic inflow velocity profile is specified, while outflow boundary velocities are obtained by firstorder extrapolation. The Reynolds number based on the average inflow velocity Uavg and the channel height H, is 800. The expansion ratio H/h is 2 as in the model problem described in Figure 2.3. Timeaccurate simulations were performed for two channel configurations, one with length L = 8 (81 x 41 mesh) and the other with length L = 16 (161 x 41 mesh). This flow problem has been the subject of some recent investigations focusing on open boundary conditions [30, 31]. For each time step, the SIMPLE algorithm is used to iteratively converge on a solution to the unsteady form of the governing equations, explicitly enforcing global conservation of mass during the course of iterations. In the present study, convergence was declared for a given time step when the global residuals had been reduced below 104. The timestep size was twice the viscous time scale in the ydirection, i.e. At = 2Ay2/v. Thus a fluid particle entering the domain at the average velocity u = 1 travels 2 units downstream during a timestep. Figure 2.8 shows the formation of alternate bottom/top wall recirculation regions during startup which gradually become thinner and elongated as they drift down stream. For the L = 16 simulation (Figure 2.8), the transient flowfield has as many as four separation bubbles at T = 32, the latter two of which are eventually washed out of the domain. In the L = 8 simulation (Figure 2.9) the streamfunction plots are at times corresponding to those shown in Figure 2.8. Note that between T = 11 and T = 32, a secondary bottom wall recirculation zone forms and drifts downstream, exiting without reflection through the downstream boundary. The time evolution of the flowfield for the L = 8 and L = 16 simulations is virtually identical. As can be observed, the facts that a shorter channel length was used in Figure 2.9 and that a recirculating cell may go through the open boundary do not affect the solutions. Figure 2.10 compares the computed time histories of the bottom wall reattachment and top wall separation points between the two computations. The L = 8 and L = 16 curves are perfectly overlapped. The steadystate solutions for both the L = 8 and L = 16 channel configurations are also shown in Figure 2.9 and 2.8, respectively. Although the outflow boundary cuts the top wall separation bubble approximately in half, there is no apparent difference between the computed streamfunction contours for 0 < x < 8. Furthermore, the convergence rate is not affected by the choice of outflow boundary location. Figure 2.11 compares the steadystate u and v velocity profiles at x = 7 be tween the two computations. The accuracy of the computed results is assessed by comparison with an FEM numerical solution reported by Gartling [27]. Figure 2.11 establishes quantitatively that the two simulations differ negligibly over 0 < x < 8 (the v profile differs on the order of 103) The velocity scale for the problem is 1. Neither v profile agrees perfectly with the solution obtained by Gartling, which may be attributed to the need for conducting further grid refinement studies in the present work and/or Gartling's work. Evidently the location of the open boundary is not critical to obtaining a con verged solution. This observation indicates that the downstream information is com pletely accounted for by the continuity equation. The correct pressure field can de velop because the system of p'equations requires only the boundary mass flux specifi cation. If the global continuity constraint is satisfied, the pressurecorrection equation is consistent regardless of whether there is inflow or outflow at the boundary where extrapolation is applied. The numerical wellposedness of the open boundary com putation results in virtually identical flowfield development for the timedependent L = 8 and L = 16 simulations as well as steadystate solutions which agree with each other and follow closely Gartling's benchmark data [27]. 2.6 Concluding Remarks In order for the SIMPLE pressurecorrection method to be a wellposed numer ical procedure for open boundary problems, explicit steps must be taken to ensure the numerical consistency of the pressurecorrection system of equations during the course of iterations. For the discrete problem with the natural boundary treatment for pressure, i.e. normal velocity specified at all boundaries, global mass conserva tion is the solvability constraint which must be satisfied in order that the system of p'equations is consistent. Without a globally massconserving procedure enforced during each iterative step, the utility of the pressurecorrections obtained at each it eration cannot be guaranteed. Overall convergence may still occur, albeit very slowly. In this regard, the poor outerloop convergence behavior simply reflects the (poor) convergence rate of the innerloop iterations of the SLUR technique. In general, the innerloop residual is fixed on the value of the initial level of inconsistency of the system of p'equations which physically is the global mass deficit. The convergence rate can be improved dramatically by explicitly enforcing mass conservation using an additivecorrection technique. The results of numerical simulations of backward facing step flow illustrate and support these conclusions. The massconservation constraint also has implications for the issue of proper numerical treatment of open boundaries where inflow is occurring. Specifically, the conventional viewpoint that inflow cannot occur at open boundaries without Dirich let prescription of the inflow variables can be rebutted, based on the grounds that the numerical problem is wellposed if the normal velocity components satisfy the continuity constraint. Figure 2.1. Staggered grid u control volume and the nearby variables which are involved in the discretization of the umomentum equation. U=1 ^t U(y) \ 1i\~ L 1 W>  ^:^ __ 7 7/ h Figure 2.2. Description of two model problems. Both are at Re = 100. The cavity is a square with a top wall sliding to the left, while the backwardfacing step is a 4 x 1 rectangular domain with an expansion ratio H/h = 2, and a parabolic inflow (average inflow velocity = 1). The cavity flow grid is 9 x 9 and the step flow grid is 21 x 9. The meshes and the velocity vectors are shown. y///#///////////////////////// Figure 2.3. Model 3 x 3 computational domain with numbered control volumes, for discussion of Eq. 2.22. The staggered velocity components which refer to control volume 5 are also indicated. I I I OP 7 OP P9 5 Un 5 P 4 0 Pr5 OP'6 s5 S ,  Pi P'2 P 3 Re = 100 BackStep Flow 0 S2 0 4 0 O 6 S6 ed) 2_ 100 200 300 # of Iterations 0 500 1000 # of Iterations Figure 2.4. Outerloop convergence paths for the Re = 100 liddriven cavity and backwardfacing step flows, using centraldifferencing for the convection terms. Leg end: p' equation: u momentum equation: ... r momentum equation. 1500 Re = 100 Cavity Flow Re = 100 Cavity Flow Re = 100 BackStep Flow S0 3 0oo 3 \ 3 \ 0 .1 6 6 0 10 20 30 0 10 20 30 # of Iterations # of Iterations Figure 2.5. Innerloop convergence paths for the Re = 100 liddriven cavity and backwardfacing step flows. The vertical axis is the logo of the ratio of the current residual to the initial residual. Legend: p' equation:  u momentum equation: ... momentum equation. Inner Loop for Cavity Flow 0.03 I 0.01 # of Iterations Inner Loop for BackStep Flow 50 # of Iterations Figure 2.6. Variation of p' with innerloop iterations. The dashed line is the value of p' at the lowerleft control volume, while the solid line is the difference between P'lowerleft and Pupperrnght 0.02 1: Outer Loop Convergence Path  81 200 0 100 200 # of Iterations 0 50 # of Iterations Figure 2.7. Outerloop and innerloop convergence paths of the p' equation for the backwardfacing step model problem, with and without enforcing the continuity con straint. (1) conservation of mass not enforced: (2) continuity enforced globally; (3) cavity flow. InnerLoop Convergence Path ca 3 ya 1 (U (^ Q , Ii i H c J II . = 3 SrJ = ! n ce 0i oC * C  . I^ cc. i "  a *~ =rey JC k ir i :"t ; T ^" .ihC ^ i. : ' T=11 T= 15 T=20 T=32 T=oo Figure 2.9. Timedependent flowfield for impulsively started backwardfacing step flow, Re = 800. The domain has length L = 8. Streamfunction contours are plotted at several instants during the evolution to the steadystate, which is the last figure. TimeEvolution of Reattachment/Separation Locations 0 10 20 30 40 50 Time Figure 2.10. Timedependent location of bottom wall reattachment point and top wall separation point for Re = 800 impulsively started backwardfacing step flow. The curves for both L = 8 and L = 16 computations are shown; they overlap identically. U Velocity Profile at X = 7 For Re = 800 BackStep Flow U(Y) V Velocity Profile at X = 7 For Re = 800 BackStep Flow 0.02 0.015 0.01 0.005 0 0.005 0.01 V(Y) Figure 2.11. Comparison of u and vcomponent of velocity profiles at x = 7.0 for the L = 16 and L = 8 backwardfacing step simulations at Re = 800, with central differencing. (o) indicates the gridindependent FEM solution obtained by Gartling. The v profile is scaled up by 103. CHAPTER 3 EFFICIENCY AND SCALABILITY ON SIMD COMPUTERS The previous chapter considered an issue which was important because of its im plications for the convergence rate in open boundary problems. The present chapter shifts gears to focus on the cost and efficiency of pressurecorrection methods on SIMD computers. As discussed in Chapter 1, the eventual goal is to understand the indirect cost [23], i.e. the parallel run time, of such methods on SIMD computers, and how this cost scales with the problem size and the number of processors. The run time is just the number of iterations multiplied by the cost per iteration. This chapter considers the cost per iteration. 3.1 Background The discussion of SIMD computers in Chapter 1 indicated similarities in the general layout of such machines and in the factors which affect program performance. More detail is given in this section to better support the discussion of results. 3.1.1 Speedup and Efficiency Speedup S is defined as S = T (3.1) where Tp is the measured run time using np processors. In the present work T1 is the run time of the parallel algorithm on one processor, including both serial and parallel computational work, but excluding the frontendtoprocessor and interpro cessor communication. On a MIMD machine it is sometimes possible to actually time 53 the program on one processor, but each SIMD processor is not usually a capable serial computer by itself, so Ti must be estimated. The timing tools on the CM2 and CM5 are very sophisticated, and can separately measure the time elapsed by the processors doing computation, doing various kinds of communication, and doing nothing (waiting for an instruction from the frontend, which might be finishing up some serial work before it can send another code block). Thus, it is possible to make a reasonable estimate for T1. Parallel efficiency is the ratio of the actual speedup to the ideal (np), which reflects the overhead costs of doing the computation in parallel: E S.actu T1/Tp (3.2) Sideal np If Tcomp is the time in seconds spent by each of the np processors doing useful work (computation), Tinterproc is the time spent by the processors doing interprocessor communication, and Tfetoproc is the time elapsed through frontendtoprocessor communication, then each of the processors is busy a total of Tcomp + Tinterproc seconds and the total run time on multiple processors is Tcomp + Tinterproc Tfetoproc seconds. Assuming that the parallelism is high, i.e. a high percentage of the virtual processors are not idle, a single processor would need npTcomp time to do the same work. Thus, T1 = npTcomp, and from Eq. 3.2 E can be expressed as 1 1 E = (3.3) 1 + (Tinterproc + Tfetoproc) /Tcomp 1 + (Tcomm) Tcomp Since time is work divided by speed, E depends on both machinerelated factors and the implementational factors through Eq. 3.3. High parallel efficiency is not neces sarily a product of fast processors or fast communications considered alone, instead it is the relative speeds that are important, and the relative amount of communication and computation in the program. Consider the machinerelated factors first. 3.1.2 Comparison Between CM2. CM5, and MP1 A 32node CM5 with vector units, a 16k processor CM2, and a 1k processor MP1 were used in the present study. The CM5 has 4 GBytes total memory, while the CM2 has 512 Mbytes, and the MP1 has 64 MBytes. The peak speeds of these computers are 4, 3.5, and 0.034 Gflops, respectively, in double precision. Per proces sor, the peak speeds are 32, 7, and 0.033 Mflops, with memory bandwidths of 128, 25, and 0.67 Mbytes/s [67, 83]. Clearly these are computers with very different capa bilities, even taking into account the fact that peak speeds, which are based only on the processor speed under ideal conditions, are not an accurate basis for comparison. In the CM2 and CM5 the frontend computers are Sun4 workstations, while in the MP1 the frontend is a Decstation 5000. From Eq. 3.3, it is clear that the relative speeds of the frontend computer and the processors are important. Their ratio determines the importance of the frontendtoprocessor type of communication. On the CM2 and MP1, there is just one of these intermediate processors, called either a. sequencer or an array control unit, respectively, while on the 32node CM5 the 32 SPARC microprocessors have the role of sequencers. Each SPARC node broadcasts to four vector units (VUs) which actually do the work. Thus a 32node CM5 has 128 independent processors. In the CM2 the "pro cessors" are more often called processing elements (PEs), because each one consists of a floatingpoint unit coupled with 32 bitserial processors. Each bitserial processor is the memory manager for a single bit of a 32bit word. Thus, the 16kprocessor CM2 actually has only 512 independent processing elements. This strange CM2 processor design came about basically as a workaround which was introduced to im prove the memory bandwidth for floatingpoint calculations [66]. Compared to the CM5 VUs, the CM2 processors are about onefourth as fast, with larger overhead costs associated with memory access and computation. The MP1 has 1024 4bit processorscompared to either the CM5 or CM2 processors, the MP1 processors are very slow. The generic term "processing element" (PE), which is used occassion ally in the discussion below, refers to either one of the VUs, one of the 512 CM2 processors, or one of the MP1 processors, whichever is appropriate. For the present study, the processors are either physically or logically imagined to be arranged as a 2d mesh, which is a layout that is wellsupported by the data networks of each of the computers. The data network of the 32node CM5 is a fat tree of height 3, which is similar to a binary tree except the bandwidth stays constant upwards from height 2 at 160 MBytes/s (details in [83]). One can expect approximately 480 MBytes/s for regular grid communication patterns (i.e. between nearestneighbor SPARC nodes) and 128 MBytes/s for random (global) communica tions. The randomlydirected messages have to go farther up the tree, so they are slower. The CM2 network (a hypercube) is completely different from the fattree net work and its performance for regular grid communication between nearestneighbor processors is roughly 350 MBytes/s [67]. The grid network on the CM2 is called NEWS (NorthEastWestSouth). It is a subset of the hypercube connections se lected at run time. The MP1 has two networks: regular communications use XNet (1.25 GBytes/s, peak) which connects each processor to its eight nearest neighbors, and random communications use a 3stage crossbar (80 MBytes/s, peak). To summarize the relative speeds of these three SIMD computers it is sufficient for the present study to observe that the MP1 has very fast nearestneighbor com munication compared to its computational speed, while the exact opposite is true for the CM2. The ratio of nearestneighbor communication speed to computation speed is smaller still for the CM5 than the CM2. Again, from Eq. 3.3, one expects that these differences will be an important factor influencing the parallel efficiency. 3.1.3 Hierarchical and CutandStack Data Mappings When there are more array elements (grid points) than processors, each processor handles multiple grid points. Which grid points are assigned to which processors is determined by the "datamapping," also called the data layout. The processors repeat any instructions the appropriate number of times to handle all the array elements which have been assigned to it. A useful idealization for SIMD machines, however, is to pretend there are always as many processors as grid points. Then one speaks of the "virtual processor" ratio (VP) which is the number of array elements assigned to each physical processor. The way the data arrays are partitioned and mapped to the processors is a main concern for developing a parallel implementation. The layout of the data determines the amount of communication in a given program. When the virtual processor ratio is 1, there are an equal number of processors and array elements and the mapping is just onetoone. When VP > 1 the mapping of data to processors is either "hierarchical," in CMFortran, or "cutandstack" in MPFortran. These mappings are also termed "block" and "cyclic" [85], respectively, in the emerging HighPerformance Fortran standard. The relative merits of these different approaches have not been completely explored yet. In cutandstack mapping, nearestneighbor array elements are mapped to nearest neighbor physical processors. When the number of array elements exceeds the num ber of processors, additional memory layers are created. VP is just the number of memory layers. In the general case, nearestneighbor virtual processors (i.e. array elements) will not be mapped to the same physical processor. Thus, the cost of a nearestneighbor communication of distance one will be proportional to VP, since the nearestneighbors of each virtual processor will be on a different physical processor. In the hierarchical mapping, contiguous pieces of an array ("virtual subgrids") are mapped to each processor. The "subgrid size" for the hierarchical mapping is syn onymous with VP. The distinction between hierarchical and cutandstack mapping is clarified by Figure 3.1. In hierarchical mapping, for VP > 1, each virtual processor has nearestneighbors in the same virtual subgrid, that is, on the same physical processor. Thus, for hier archical mapping on the CM2, interprocessor communication breaks down into two types (with different speeds)onprocessor and offprocessor. Offprocessor commu nication on the CM2 has the NEWS speed given above, while onprocessor communi cation is somewhat faster, because it is essentially just a memory operation. A more detailed presentation and modelling of nearestneighbor communication costs for the hierarchical mapping on the CM2 is given in [3]. The key idea is that with hierar chical mapping on the CM2 the relative amount of onprocessor and offprocessor communication is the area to perimeter ratio of the virtual subgrid. For the CM5, there are three types of interprocessor communication: (1) between virtual processors on the same processor (that is, the same VU), (2) between virtual processors on different VUs but on the same SPARC node, and (3) between virtual processors on different SPARC nodes. Between different SPARC nodes (number 2), the speed is 480 MBytes/s as mentioned above. On the same VU the speed is 16 GBytes/s. (The latter number is just the aggregate memory bandwidth of the 32 node CM5.) Thus, although offprocessor NEWS communication is slow compared to computation on the CM2 and CM5, good efficiencies can still be achieved as a consequence of the data mapping which allows the majority of communication to be of the onprocessor type. 3.2 Implementional Considerations The cost per SIMPLE iteration depends on the choice of relaxation method (solver) for the systems of equations, the number of inner iterations (,,, v,, and vc), the computation of coefficients for each system of equations, the correction step, and the convergence checking and serial work done in program control. The pressure correction equation, since it is not underrelaxed, typically needs to be given more iterations than the momentum equations, and consequently most of the effort is ex pended during this step of the SIMPLE method. This is another reason why the convergence rate of the p'equations discussed in Chapter 2 is important. Typically v. and v, are the same and are < 3, and v, < 5v,. In developing a parallel implementation of the SIMPLE algorithm, the first con sideration is the method of solving the u, v, and p' systems of equations. For serial computations, successive lineunderrelaxation using the tridiagonal matrix algorithm (TDMA, whose operation count is O(N)) is a good choice because the cost per it eration is optimal and there is longdistance coupling between flow variables (along lines), which is effective in promoting convergence in the outer iterations. The TDMA is intrinsically serial. For parallel computations, a parallel tridiagonal solver must be used (parallel cyclic reduction in the present work). In this case the cost per it eration depends not only on the computational workload (O(Nlog2N)) but also on the amount of communication generated by the implementation on a particular ma chine. For these reasons, timing comparisons are made for several implementations of both point and lineJacobi solvers used during the inner iterations of the SIMPLE algorithm. Generally, pointJacobi iteration is not sufficiently effective for complex flow prob lems. However, as part of a multigrid strategy, good convergence rates can be ob tained (see Chapters 4 and 5). Furthermore, because it only involves the fastest type of interprocessor communication, that which occurs between nearestneighbor pro cessors, pointJacobi iteration provides an upper bound for parallel efficiency, against which other solvers can be compared. The second consideration is the treatment of boundary computations. In the present implementation, the coefficients and source terms for the boundary control volumes are computed using the interior control volume formula and mask arrays. Oran et al. [57] have called this trick the uniform boundary condition approach. All coefficients can be computed simultaneously. The problem with computing the boundary coefficients separately is that some of the processors are idle, which de creases E. For the CM5, which is "synchronized MIMD" instead of strictly SIMD, there exists limited capability to handle both boundary and interior coefficients si multaneously without formulating a single allinclusive expression. However, this capability cannot be utilized if either the boundary or interior formulas involve in terprocessor communication, which is the case here. As an example of the uniform approach, consider the source terms for the north boundary u control volumes, which are computed by the formula b = aNUN + (pw Pe)Ay (3.4) Recall that aN represents the discretized convective and diffusive flux terms, and UN is the boundary value, and in the pressure gradient term, Ay is the vertical dimension of the u control volume and pw/P are the west/east ucontrolvolume face pressures on the staggered grid. Similar modifications show up in the south, east, and west boundary u control volume source terms. To compute the boundary and interior source terms simultaneously, the following implementation is used: b = aboundaryUboundary + (Pw pe)Ay (3.5) where Uboundary = UNIN + USIS + UEIE + UWIW (3.6) and boundary = aNIN + asls + aEIE + awIw (3.7) IN, Is, IE, and Iw are the mask arrays, which have the value 1 for the respective boundary control volumes and 0 everywhere else. They are initialized once, at the beginning of the program. Then, every iteration, there are four extra nearestneighbor communications. A comparison of the uniform approach with an implementation that treats each boundary separately is discussed in the results. 3.3 Numerical Experiments The SIMPLE algorithm for twodimensional laminar flow has been timed on a range of problem sizes from 8 x 8 to 1024 x 1024 which, on the CM5, covers up to VP = 8192. The convection terms are centraldifferenced. A fixed number (100) of outer iterations are timed using as a model flow problem the liddriven cavity flow at Re = 1000. The timings were made with the "Prism" timing utility on the CM2 and CM5, and the "dpuTimer" routines on the MP1 [52, 86]. These utilities can be inaccurate if the frontend machine is heavily loaded, which was the case with the CM2. Thus, on the CM2 all cases were timed three times and the fastest times were used, as recommended by Thinking Machines [82]. Prism times every code block and accumulates totals in several categories, including computation time for the nodes (Tcomp), "NEWS" communication (Tnews), and irregularpattern "SEND" communication. Also it is possible to infer Tfetoproc from the difference between the processor busy time and the elapsed time. In the results Tcomm is the sum of the "NEWS" and "SEND" interprocessor times. The frontendtoprocessor communication is separate. Additionally, the component tasks of the algorithm have been timed, namely the coefficient computations (Tcoe/,), the solver (Tsoi,,), and the velocitycorrection and convergencechecking parts. 3.3.1 Efficiency of Point and Line Solvers for the Inner Iterations Figure 3.2, based on timings made on the CM5, illustrates the difference in parallel efficiency for SIMPLE using pointJacobi and lineJacobi iterative solvers. E is computed from Eq. 3.3 by timing Tcomm and Teomp introduced above. Problem size is given in terms of the virtual processor ratio VP previously defined. There are two implementations each with different data layouts, for pointJacobi iteration. One ignores the distinction between virtual processors which are on the same physical processor and those which are on different physical processors. Each array element is treated as if it is a processor. Thus, interprocessor communication is generated whenever data is to be moved, even if the two virtual processors do ing the communication happen to be on the same physical processor. To be more precise, a call to the runtime communication library is generated for every array el ement. Then, those array elements (virtual processors) which actually reside on the same physical processor are identified and the communication is done as a memory operationbut the unnecessary overhead of calling the library is incurred. Obviously there is an inefficiency associated with pretending that there are as many processors as array elements, but the tradeoff is that this is the most straightforward, and indeed the intended, way to do the programming. In Figure 3.2, this approach is labelled "NEWS," with the symbol "o." The other implementation is labelled "onVU," with the symbol "+," to indicate that interprocessor communication between virtual pro cessors on the same physical processor is being eliminatedthe programming is in a sense being done "onVU." To indicate to the compiler the different layouts of the data which are needed, the programmer inserts compiler directives. For the "NEWS" version, the arrays are laid out as shown in this example for a 1k x 1k grid and an 8 x 16 processor layout on the CM5: REAL*8 A(1024,1024) $CMF LAYOUT A(:BLOCK=128 :PROCS=8, :BLOCK=64 :PROCS=16) Thus, the subgrid shape is 128 x 64, with a subgrid size (VP) of 8192 (this hap pens to be the biggest problem size for my program on a 32node CM5 with 4GBytes of memory). When shifting all the data to their east nearestneighbor, for example, by far the large majority of transfers are onVU and could be done without real inter processor communication. But there are only 2 dimensions in A, so that dataparallel program statements cannot specifically access certain array elements, i.e. the ones on the perimeter of the subgrid. Thus it is not possible with the "NEWS" layout to treat interior virtual processors differently from those on the perimeter, and conse quently data shifts between the interior virtual processors generate interprocessor communication even though it is unnecessary. In the "onVU" version, a different data layout is used which makes explicit to the compiler the boundary between physical processors. The arrays are laid out without virtual processors: $CMF LAYOUT A(:SERIAL,:SERIAL,:BLOCK=1 :PROCS=8,:BLOCK=1 :PROCS=16) The declaration must be changed accordingly, to A(128,64,8,16). Normally it is inconvenient to work with the arrays in this manner. Thus the approach taken here is to use an "array alias" of A [84]. In other words, this is an EQUIVALENCE func tion for the dataparallel arrays (similar to the Fortran77 EQUIVALENCE concept), which equates A(1024,1024) with A(128,64,8,16), with the different LAYOUTs given above. It is the alias instead of the original A which is used in the onVU point Jacobi implementation. In the solver, the "onVU" layout is used; everywhere else, the more convenient "NEWS" layout is used. The actual mechanism by which the equivalencing of distributed arrays can be accomplished is not too difficult to under stand. The frontend computer stores "array descriptors," which contain the array layout, the starting address in processor memory, and other information. The actual layout in each processors' memory is linear and doesn't change, but multiple array descriptors can be generated for the same data. This descriptor multiplicity is what array aliasing accomplishes. With the "onVU" programming style, the compiler does not generate communication when the shift of data is along a SERIAL axis. Thus, interprocessor communication is generated only when the virtual processors involved are on different physical processors, i.e. only when it is truly necessary. The difference in the amount of communication is substantial for large subgrid sizes. For both the "NEWS" and the "onVU" curves in Figure 3.2, E is initially very low, but as VP increases, E rises until it reaches a peak value of about 0.8 for the "NEWS" version and 0.85 for the "onVU" version. The trend is due to the amor tization of the frontendtoprocessor and offVU (between VUs which are physically under control of different SPARC nodes) communication. The former contributes a constant overhead cost per Jacobi iteration to Tcomm, while the latter has a VP1/2 dependency [3]. However, it does not appear from Figure 3.2 that these two terms' effects can be distinguished from one another. For VP > 2k, the CM5 is computing roughly 3/4 of the time for the implementa tion which uses the "NEWS" version of pointJacobi, with the remainder split evenly between frontendtoprocessor communication and onVU interprocessor communi cation. It appears that the "onVU" version has more frontendtoprocessor com munication per iteration, so there is, in effect, a price of more frontendtoprocessor communication to pay in exchange for less interprocessor communication. Conse quently it takes VP > 4k to reach peak efficiency instead of 2k with the "NEWS" version. For VP > 4k, however, E is about 5% 10% higher than for the "NEWS" version because the onVU communication has been replaced by straight memory operations. The observed difference would be even greater if a larger part of the total parallel run time was spent in the solver. For the large VP cases in Figure 3.2, approximately equal time was spent computing coefficients and solving the systems of equations. "Typical" numbers of inner iterations were used, 3 each for the u and v momentum equations, and 9 for the p' equation. From Figure 3.2, then, it appears that the ad vantage of the "onVU" version over the "NEWS" version of pointJacobi relaxation within the SIMPLE algorithm is around 0.1 in E, for large problem sizes. Red/black analogues to the "NEWS" and "onVU" versions of pointJacobi iter ation have also been tested. Red/black point iteration done in the "onVU" manner does not generate any additional frontendtoprocessor communication, and there fore takes almost an identical amount of time as pointJacobi. Thus red/black point iterations are recommended when the "onVU" layout is used due to their improved convergence rate. However, with the "NEWS" layout, red/black point iteration gen erates two code blocks instead of one, and reduces by 2 the amount of computation per code block. This results in a substantial ( 35% for the VP = 8k case) in crease in run time. Thus, if using "NEWS" layouts, red/black point iteration is not costeffective. There are also two implementations of lineJacobi iteration. In both, one inner iteration consists of forming a tridiagonal system of equations for the unknowns in each vertical line by moving the east/west terms to the righthand side, solving the multiple systems of equations simultaneously, and repeating the procedure for the horizontal lines. In the first version, parallel cyclic reduction is used to solve the multiple tridiag onal systems of equations (see [44] for a clear presentation). This involves combining equations to decouple the system into even and odd equations. The result is two tridiagonal systems of equations each half the size of the original. The reduction step is repeated log2 N times, where N is the number of unknowns in each line. Thus, the computational operation count is O(Nlog2N). Interprocessor communication occurs for every unknown for every step, thus the communication operation count is also O(Nlog2N). However, the distance for communication increases every step of the re duction by a factor of 2. For the first step, nearestneighbor communication occurs, while for the second step, the distance is 2, then 4, etc. Thus, the net communi cation speed is slower than the nearestneighbor type of communication. Figure 3.2 confirms this argumentE peaks at about 0.5 compared to 0.8 for pointJacobi it eration. In other words, for VP > 4k, interprocessor communication takes as much time as computation with the lineJacobi solver using cyclic reduction. In the second version, the multiple systems of tridiagonal equations are solved using the standard TDMA algorithm along the lines. To implement this version, one must remap the arrays from (:NEWS,:NEWS) to (:NEWS,:SERIAL), for the vertical lines, and to (:SERIAL,:NEWS) for the horizontal lines. This change from rectangular subgrids to 1d slices is the most timeconsuming step, involving a global communication of data ("SEND" instead of "NEWS"). Applied along the serial di mension, the TDMA does not generate any interprocessor communication. Some frontendtoprocessor communication is generated by the incrementing of the DO loop index, but unrolling the DOloop helps to amortize this overhead cost to some extent. Thus, in Figure 3.2 E is approximately constant at 0.14, except for very small VP. The global communication is much slower than computation and consequently there is not enough computation to amortize the communication. Furthermore, the constant E implies from Eq. 3.3 that Tcomm and Teomp both scale in the same way with problem size. It is evident that Tcomp ~ VP because the TDMA is O(N). Thus constant E implies Tcomm VP. This means doubling VP doubles Tcomm, indicating the communication speed has reached its peak, which further indicates that the full bandwidth of the fattree is being utilized. The disappointing performance of the standard lineiterative approach using the TDMA points out the important fact that, for the CM5, global communication within inner iterations is intolerable. There is not enough computation to amortize slow communication in the solver for any problem size. With parallel cyclic reduction, where the regularity of the data movement allows faster communication, the efficiency is much higher, although still significantly lower than for pointiterations. Additional improvement can be sought by using the "onVU" data layout to implement the lineiterative solver within each processor's subgrid. This implementation essentially trades interprocessor communication for the frontendtoPE type of communication, and in practice a frontend bottleneck develops. For the remainder of the discussion, all lineJacobi results refer to the parallel cyclic reduction implementation. On the MP1, the frontendtoprocessor communication is not a major concern, as can be inferred from Figure 3.3. The efficiency of the SIMPLE algorithm using the pointJacobi solver is plotted for each machine for the range of problem sizes corresponding to the cases solved on the MP1. The CM2 and CM5 can solve much larger problems, so for comparison purposes only part of their data is shown. Also, because the computers have different numbers of processors, the number of grid points is used instead of VP to define the problem size. As in Figure 3.2, each curve exhibits an initial rise corresponding to the amortiza tion of the frontendtoprocessor communication and, for the CM2 and CM5, the offprocessor "NEWS" communication. On the MP1, peak E is reached for small problems (VP > 32). Due to the MPl's relatively slow processors, the computa tion time quickly amortizes the frontendtoprocessor communication time as VP increases. Furthermore, because the relative speed of XNet communication is fast, the peak E is high, 0.85. On the CM2, the peak E is 0.4, and this efficiency is reached for approximately VP > 128. On the CM5, the peak E is 0.8, but this efficiency is not reached until VP > 2k. If computation is fast, then the rate of in crease of E with VP depends on the relative cost of onprocessor, offprocessor, and frontendtoprocessor communication. If the onprocessor communication is fast, larger VP is required to reach peak E. Thus, on the CM5, the relatively fast onVU communication is simultaneously responsible for the good (0.8) peak E, and the fact that very large problem sizes, (VP > 2k, 64 times larger than on the MP1), are needed to reach this peak E. The aspect ratio of the virtual subgrid constitutes a secondary effect of the data layout on the efficiency for hierarchical mapping. The major influence on E depends on VP, i.e. the subgrid size, but the subgrid shape matters, too. This dependence comes into play due to the different speeds of the onprocessor and offprocessor types of communication. Higher aspect ratio subgrids have higher area to perimeter ratios, and thus relatively more of offprocessor communication than square subgrids. Figure 3.4 gives some idea of the relative importance of the subgrid aspect ratio effect. Along each curve the number of grid points is fixed, but the grid dimensions vary, which, for a given processor layout, causes the subgrid shape (aspect ratio), to vary. For example, on the CM5 with an 8 x 16 processor layout, the following grids were used corresponding to the VP = 1024 CM5 curve: 256 x 512, 512 x 256, 680 x 192, and 1024 x 128. These cases give subgrid aspect ratios of 1, 4, 7, and 16. Tnews is the time spent in "NEWS" type of interprocessor communication and Tcom, is the time spent doing computation during 100 SIMPLE iterations. The solver for these results is pointJacobi relaxation. For the VP = 1024 CM5 case, increasing the aspect ratio from 1 to 16 causes Tnews/Tcomp to increase from 0.3 to 0.5. This increase in Tnews/Tcomp increases the run time for 100 iterations from 15s to 20s, and decreases the efficiency from 0.61 to 0.54. For the VP = 8192 CM5 case, increasing the aspect ratio from 1 to 16 causes Tnews/Tcomp to increase from 0.19 to 0.27. This increase in Tnews/Tcomp increases the run time for 100 iterations from 118s to 126s, and decreases the efficiency from 0.74 to 0.72. Thus, the aspect ratio effect diminishes as VP increases due to the increasing area of the subgrid. In other words the variation in the perimeter length matters less, percentagewise, as the area increases. The CM2 results are similar. However, on the CM2 the onPE type of communication is slower than on the CM5, relative to the computational speed. Thus, Te,,,/Tcomp ratios are higher on the CM2. 3.3.2 Effect of Uniform Boundary Condition Implementation In addition to the choice of solver, the treatment of boundary coefficient computa tions was discussed earlier as an important consideration affecting parallel efficiency. Figure 3.5 compares the implementation described in the introductory section of this chapter, to an implementation which treats the boundary control volumes separate from the interior control volumes. The latter approach involves some 1d operations which leave some processors idle. The results indicated in Figure 3.5 were obtained on the CM2, using point Jacobi relaxation as the solver. With the uniform approach, the ratio of the time spent computing coefficients, Tcoeff, to the time spent solving the equations, Tsolve, remains constant at 0.6 for VP > 256. Both Tcoeff and To ,,ve VP in this case, so doubling VP doubles both Tcoeff and To,,ve, leaving their ratio unchanged. The value 0.6 reflects the relative cost of coefficient computations compared to pointJacobi iteration. There are three equations for which coefficients are computed and 15 total inner iterations, 3 each for the u and v equations, and 9 for the p' equation. Thus if more inner iterations are taken, the ratio of Tcoeff to Tsoive will decrease, and vice versa. With the 1d implementation, Tcoeff/Tsove increases until VP > 1024. Both TcoeJf and Tso,,ve scale with VP asymptotically, but Figure 3.5 shows that Tcoeff has an apparently very significant squareroot component due to the boundary operations. If N is the number of grid points and n, is the number of processors, then VP = N/np. For boundary operations, N1/2 control volumes are computed in parallel with only nt/2 processorshence the VP1/2 contribution to Tcoeff. From Figure 3.5, it appears that very large problems are required to reach the point where the interior coefficient computations amortize the boundary coefficient computations. Even for large VP when Tcoeff/Tsoive is approaching a constant, this constant is larger, approximately 0.8 compared to 0.6 for the uniform approach, due to the additional frontendto processor communication which is intrinsic to the 1d formulation. 3.3.3 Overall Performance Table 3.1 summarizes the relative performance of SIMPLE on the CM2, CM5, and MP1 computers, using point and lineiterative solvers and the uniform boundary condition treatment. In the first three cases the "NEWS" implementation of point Jacobi relaxation is the solver, while the last two cases are for the lineJacobi solver using cyclic reduction. Machine Solver Problem VP Tp Time/Iter./Pt. Speed ) Peak Size ___(MFlops) Speed 512 PE Point 512 x 1024 188 s 2.6 x 106 s 147 4 CM2 Jacobi 1024 128 VU Point 736 x 8192 137 s 1.3 x 106 s 417 10 CM5 Jacobi 1472 1024 PE Point 512 x 256 316 s 1.2 x 105 s 44* 59 MP1 Jacobi 512 512 PE Line 512 x 1024 409 s 7.8 x 106 s 133 3 CM2 Jacobi 1024 128 VU Line 736 x 8192 453 s 4.2 x 106 s 247 6 CM5 Jacobi 1472 Table 3.1. Performance results for the SIMPLE algorithm for 100 iterations of the model problem. The solvers are the pointJacobi ("NEWS") and lineJacobi (cyclic reduction) implementations. 3, 3, and 9 inner iterations are used for the u, v, and p' equations, respectively. *The speeds are for doubleprecision calculations, except on the MP1. In Table 3.1, the speeds reported are obtained by comparing the timings with the identical code timed on a Cray C90, using the Cray hardware performance mon itor to determine Mflops. In terms of Mflops, the CM2 version of the SIMPLE algorithm's performance appears to be consistent with other CFD algorithms on the CM2. Jesperson and Levit [44] report 117 Mflops for a scalar implicit version of an approximate factorization NavierStokes algorithm using parallel cyclic reduction to solve the tridiagonal systems of equations. This result was obtained for a 512 x 512 simulation of 2d flow over a cylinder using a 16k CM2 as in the present study (a different execution model was used (see [3, 47] for details). The measured time per timestep per grid point was 1.6 x 105 seconds. By comparison, the performance of the SIMPLE algorithm for the 512 x 1024 problem size using the lineJacobi solver is 133 Mflops and 7.8 x 106 seconds per iteration per grid pt. Egolf [20] reports that the TEACH NavierStokes combustor code based on a sequential pressurebased method with a solver that is comparable to pointJacobi relaxation, obtains a performance which is 3.67 times better than a vectorized Cray XMP version of the code, for a model problem with 3.2 x 104 nodes. The present program runs 1.6 times faster than a single Cray C90 processor for a 128 x 256 problem (32k grid points). One Cray C90 processor is about 24 times faster than a Cray XMP. Thus, the present code runs comparably fast. 3.3.4 Isoefficiencv Plot Figures 3.23.4 addressed the effects of the inneriterative solver, the boundary treatment, the data layout, and the variation of parallel efficiency with problem size for a fixed number of processors. Varying the number of processors is also of interest and, as discussed in Chapter 1, an even more practical numerical experiment is to vary np in proportion with the problem size, i.e. the scaledsize model. Figure 3.6, which is based on the pointJacobi MP1 timings, incorporates the above information into one plot, which has been called an isoefficiency plot by Kumar and Singh [46]. The lines are paths along which the parallel efficiency E remains constant as the problem size and the number of processors np vary. Using the point Jacobi solver and the uniform boundary coefficient implementation, each SIMPLE iteration has no substantial contribution from operations which are less than fully parallel or from operations whose time depends on the number of processors. The efficiency is only a function of the virtual processor ratio, thus the lines are straight. Much of the parameter space is covered by efficiencies between 0.6 and 0.8. The reason that the present implementation is linearly scalable is that the oper ations are all scalableeach SIMPLE iteration has predominantly nearestneighbor communication and computation and full parallelism. Thus, Tp depends on VP. Local communication speed does not depend on np. Ti depends on the problem size N. Thus, as N and n, are increased in proportion, starting from some initial ratio, the efficiency from Eq. 3.3 stays constant. If the initial problem size is large and the corresponding parallel run time is acceptable, then one can quickly get to very large problem sizes while still maintaining Tp constant by increasing n, a relatively small amount (along the E = 0.85 curve). If the desired run time is smaller, then initially (i.e. starting from small np) the efficiency will be lower. Then the scaledsize experiment requires relatively more processors to get to a large problem size along the constant efficiency (constant Tp for pointJacobi ierations) curve. Thus, the most desirable situation occurs when the efficiency is high for an initially small problem size. For this case the fixedtime and scaledsize methods are equivalent, because the problem size T1 depends on N per iteration. However this is not the case when the SIMPLE inner iterations are done with the lineJacobi solver using parallel cyclic reduction. Cyclic reduction requires (13 log2 N+1)N operations to solve a tridiagonal system of N equations [44]. Thus, T ~ (13log2 N + 1)N and on np = N processors, Tp ~ 13 log2 N+ 1 because every processor is active during every step of the reduction and there are 13 log2 N 1 steps. Since VP = 1, every processor's time is proportional to the number of steps, assuming each step costs about the same. In the scaledsize approach, one doubles np and N together, which therefore gives Ti ~ (26 log, 2N+2)N and Tp 13 log, 2N+1. The efficiency is 1, but Tp is increased and T1 is more than doubled. In the fixedtime approach, then, one concludes that N must be increased by a factor which is less than two, and n, must be doubled, in order to maintain constant Tp. If a plot like Figure 3.6 is constructed, it should be done with Ti instead of N as the measure of problem size. In that case, the lines of constant efficiency would be described as Ti ~ n0, with a > 1. The ideal case is a = 1. In addition to the operation count, there is another factor which reduces the scalability of cyclic reduction, namely the time per step is not actually the same as was assumed abovelater steps require communication over longer distances which is slower. In practice, however, no more than a few steps are necessary because the coupling between widelyseparated equations becomes very weak. As the system is reduced the diagonal becomes much larger than the offdiagonal terms which can then be neglected and the reduction process abbreviated. In short, the basic prerequisite for scaledsize constant efficiency is that the amount of work per SIMPLE iteration varies with VP and that the overheads and inefficiencies, specifically the time spent in communication and the fraction of idle processors, do not grow relative to the useful computational work as np and N are increased proportionally. The SIMPLE implementation developed here using the pointiterative solvers, Jacobi and red/black, have this linear computational scalabil ity property. On the other hand, the convergence rate of pointiterative methods increases at a rate greater than the problem size, so although Tp can be maintained constant while the problem size and np are scaled up, the convergence rate deteriorates. Hence the total run time (cost per iteration multiplied by the number of iterations) increases. This lack of numerical scalability of standard iterative methods like pointJacobi relaxation is the motivation for the development of multigrid strategies. 3.4 Concluding Remarks The SIMPLE algorithm, especially using pointiterative methods, is efficient on SIMD machines and can maintain a relatively high efficiency as the problem size and the number of processors is scaled up. However, boundary coefficient computations need to be folded in with interior coefficient computations to achieve good efficiencies at smaller problem sizes. For the CM5, the inefficiency caused by idle processors in a 1d boundary treatment was significant over the entire range of problem sizes tested. The lineJacobi solver based on parallel cyclic reduction leads to a lower peak E (0.5 on the CM5) than the pointJacobi solver (0.8), because there is more communication and on average this communication is less localized. On the other hand, the asymptotic convergence rates of the two methods are also different and need to be considered on a problembyproblem basis. The speeds which are obtained with the lineiterative method are consistent and comparable with other CFD algorithms on SIMD computers. The key factor in obtaining high parallel efficiency for the SIMPLE algorithm on the computers used, is fast nearestneighbor communication relative to the speed of computation. On the CM2 and CM5, hierarchical mapping allows onprocessor communication to dominate the slower offprocessor form(s) of communication for large VP. The efficiency is low for small problems because of the relatively large contribution to the run time from the frontendtoprocessor type of communication, but this type of communication is constant and becomes less important as the problem size increases. Once the peak E is reached, the efficiency is determined by the balance of compu tation and onprocessor communication speedsfor the CM5, using a pointJacobi solver, E approaches approximately 0.8, while on the CM2 the peak efficiency is 0.4, which reflects the fact that the CM5 vector units have a better balance, at least for the operations in this algorithm, than the CM2 processors. The rate at which E approaches the peak value depends on the relative contribu tions of on and offprocessor communication and frontendtoprocessor communica tion to the total run time. On the CM5, VP > 2k is required to reach peak E. This problem size is about onefourth the maximum size which can be accommodated, and yet still larger than many computations on traditional vector supercomputers. Clearly a gap is developing between the size of problems which can be solved effi ciently in parallel and the size of problems which are small enough to be solved on serial computers. For parallel computations of all but the largest problems, then, the data layout issue is very important in going from a square subgrid to one with aspect ratio of 16, for a VP = 1k case on the CM5, the run time increased by 25%. On the MP1, hierarchical mapping is not needed, because the processors are slow compared to the XNet communication speed. The peak E is 0.85 with the pointJacobi solver, and this performance is obtained for VP > 32, which is about oneeighth the size of the largest case possible for this machine. Thus, with regards to achieving efficient performance in the teraflops range, the comparison given here suggests a preference for numerous slow processors instead of fewer fast ones, but such a computer may be difficult and expensive to build. 4 x 1 Layout of Processors PE 0 PE 1 PE 2 PE 3 Array A(8) 1 21 31 61 CutandStack Mapping (MPFortran) Memory Layers i/ 5/ 6/ 7/ 8/ E 1/2 / 3PE 4/ PE0 PE1 PE2 PE3 Hierarchical Mapping (CMFortran) 2 x1 virtual subgrids 1 2 3 4 5 6 7 8 PE 0 PE 1 PE 2 PE 3 Figure 3.1. Mapping an 8 element array A onto 4 processors. For the cutand stack mapping, nearestneighbors array elements are mapped to nearestneighbor physical processors. For the hierarchical mapping, nearestneighbor array elements are mapped to nearestneighbor virtual processors, which may be on the same physical processor. 7 1 81 Efficiency vs. VP wU 1 0.8 0.6 0.4 0.2, 1 0' 0 10000 5000 VP Figure 3.2. Parallel efficiency, E, as a function of problem size and solver, for the CM5 cases. The number of grid points is the virtual processor ratio, VP, multiplied by the number of processors, 128. E is computed from Eq. 3.3. It reflects the relative amount of communication, compared to computation, in the algorithm. + + 0o 0 0 m + PointJacobi (onVU) o PointJacobi (NEWS) LineJacobi (Cyclic Red.) x LineJacobi (TDMA) txx x x x XX ) E vs. Problem Size LL 1 0.8 0.6 0.4 0.2 0 C 1 2 # of Grid Points x 10 5 Figure 3.3. Comparison between the CM2, CM5 and MP1. The variation of parallel efficiency with problem size is shown for the model problem, using point Jacobi relaxation as the solver. E is calculated from Eq. 3.3, and T1 = nrpTcom for the CM2 and CM5, where Tcomp is measured. For the MP1 cases, T1 is the frontend time, scaled down to the estimated speed of the MP1 processors (0.05 Mflops). 0 0 0 0 D + ) + X X o MP1 S+ CM5 CM2 ) 80 Aspect Ratio Effect 2 E 1.5 O + VP=256, CM2 0 o VP=1024, CM2 1 VP=1024, CM5 x VP=8192, CM5 0.5  0 0 0 5 10 15 20 Subgrid AR Figure 3.4. Effect of subgrid aspect ratio on interprocessor communication time, T,ew, for the hierarchical datamapping (CM2 and CM5). Tnew, is normalized by Tcom, in order to show how the aspect ratio effect varies with problem size, without the complication of the fact that Tcomp varies also. Effect of Implementation 0.8 0.75 0 c)0.7 So 1d operations S0.65 + 2d operations 0 0 0.6  0.55 0 500 1000 1500 VP Figure 3.5. Normalized coefficient computation time as a function of problem size, for two implementations (on the CM2). In the 1d case the boundary coefficients are handled by 1d array operations. In the 2d case the uniform implementation computes both boundary and interior coefficients simultaneously. Tcoeff is the time spent computing coefficients in a SIMPLE iteration; Toie, is the time spent in point Jacobi iterations. There are 15 pointJacobi iterations (v, = v, = 3 and v, = 9). Isoefficiency Curves in o x CD 0 IL 0C Vf 2.5 2 1.5 1 0.5 2000 4000 6000 8000 # Processors (MP1) Figure 3.6. Isoefficiency curves based on the MP1 cases and SIMPLE method with the pointJacobi solver. Efficiency E is computed from Eq. 3.3. Along lines of constant E the cost per SIMPLE iteration is constant with the pointJacobi solver and the uniform boundary condition implementation. CHAPTER 4 A NONLINEAR PRESSURECORRECTION MULTIGRID METHOD The singlegrid timing results focused on the cost per iteration in order to elucidate the computational issues which influence the parallel run time and the scalability. But the parallel run time is the cost per iteration multiplied by the number of iterations. For scaling to large problem sizes and numbers of processors, the numerical method must scale well with respect to convergence rate, also. The convergence rate of the singlegrid pressurecorrection method deteriorates with increasing problem size. This trait is inherited from the smoothing property of the stationary linear iterative method, point or lineJacobi relaxation, used to solve the systems of u, v, and p' equations during the course of SIMPLE iterations. Point Jacobi relaxation requires O(N2) iterations, where N is the number of grid points, to decrease the solution error by a specified amount [1]. In other words, the number of iterations increases faster than the problem size. At best the cost per iteration stays constant as the number of processors np increases proportional to the problem size. Thus, the total run time increases in the scaledsize experiment using singlegrid pressurecorrection methods, due to the increased number of iterations required. This lack of numerical scalability is a serious disadvantage for parallel implementations, since the target problem size for parallel computation is very large. Multigrid methods can maintain good convergence rates as the problem size in creases. For Poisson equations, problemsize independent convergence rates can be obtained [36, 55]. The recent book by Briggs [10] introduces the major concepts in 83 the context of Poisson equations. See also [11, 37, 90] for surveys and analyses of multigrid convergence properties for more general linear equations. For a description of practical techniques and special considerations for fluid dynamics, see the impor tant early papers by Brandt [5, 6]. However, there are many unresolved issues for application to the incompressible NavierStokes equations, especially with regards to their implementation and performance on parallel computers. The purpose of this chapter is to describe the relevant convergence rate and stability issues for multigrid methods in the context of application to the incompressible NavierStokes equations, with numerical experiments used to illustrate the points made, in particular, regard ing the role of the restriction and prolongation procedures. 4.1 Background The basic concept is the use of coarse grids to accelerate the asymptotic con vergence rate of an inner iterative scheme. The inner iterative method is called the "smoother" for reasons to be made clear shortly. In the context of the present applica tion to the incompressible NavierStokes equations, the singlegrid pressurecorrection method is the inner iterative scheme. Because the pressurecorrection algorithm also uses inner iterationsto solve the systems of u, v, and p' equationsthe multigrid method developed here actually has three nested levels of iterations. A multigrid V cycle begins with a certain number of smoothing iterations on the fine grid, where the solution is desired. Figure 4.1 shows a schematic of a V(3,2) cycle. In this case three pressurecorrection iterations are done first. Then residuals and variables are restricted (averaged) to obtain coarsegrid values for these quantities. The solution to the coarsegrid discretized equation provides a correction to the fine grid solution. Once the solution on the coarse grid is obtained, the correction is interpolated (prolongated) to the fine grid and added back into the solution there. Some postsmoothing iterations, two in this case, are needed to eliminate errors introduced by the interpolation. Since it is usually too costly to attempt a direct solution on the coarse grid, this smoothingcorrection cycle is applied recursively, leading to the V cycle shown. The next section describes how such a procedure can accelerate the convergence rate of an iterative method, in the context of linear equations. The multigrid scheme for nonlinear scalar equations and the NavierStokes system of equations is then described. Brandt [5] was the first to formalize the manner in which coarse grids could be used as a convergenceacceleration technique for a given smoother. The idea of using coarse grids to generate initial guesses for finegrid solutions was around much earlier. The cost of the multigrid algorithm, per cycle, is dominated by the smoothing cost, as will be shown in Chapter 5. Thus, with regard to the parallel run time per multigrid iteration, the smoother is the primary concern. Also, with regard to the convergence rate, the smoother is important. The singlegrid convergence rate characteristics of pressurecorrection methods, the dependence on Reynolds number, flow problem, and the convection scheme, carry over to the multigrid context. However, in the multigrid method the smoother's role is, as the name implies, to smooth the finegrid residual, which is a different objective than to solve the equations quickly. A smooth finegrid residual equation can be approximated accurately on a coarser grid. The next section describes an alternate pressurebased smoother, and compares its cost against the pressurecorrection method on the CM5. Stability of multigrid iterations is also an important unresolved issue. There are two ways in which multigrid iterations can be caused to diverge. First, the singlegrid smoothing iterations can diverge, for example if centraldifferencing is used there are possibly stability problems if the Reynolds number is high. Second, poor coarsegrid corrections can cause divergence if the smoothing is insufficient. In a sense this latter issue, the scheme and intergrid transfer operators which prescribe the coordination between coarse and fine grids in the multigrid procedure, is the key issue. In the next section two "stabilization strategies" are described. Then, the impact of different restriction and prolongation procedures on the convergence rate is studied in the context of two model problems, liddriven cavity flow and flow past a symmetric backwardfacing step. These two particular flow problems have different physical characteristics, and therefore the numerical experiments should give insight into the problemdependence of the results. 4.1.1 Terminology and Scheme for Linear Equations The discrete problem to be solved can be written Ahuh = Sh, corresponding to some differential equation L[u] = S. The set of values uh is defined by {u}, = u(ih,jh), (i,j) E ([0 : N], [0 : N])  (4.1) Similarly, u2h is defined on the coarser grid f2h with grid spacing 2h. The variable u can be a scalar or a vector, and the operator A can be linear or nonlinear. For linear equations, the "correction scheme" (CS) is frequently used. A two level multigrid cycle using CS accelerates the convergence of an iterative method (with iteration matrix P) by the following procedure: Do v finegrid iterations vh PVvh Compute residual on Qh rh = Ahvh Sh Restrict rh to Q2h r2h = 2hrh Solve exactly for e2h e2h = (A2h)1r2h Correct vh on Qh (h)new (vh)old + hhe2h Ih' and I,' symbolize the restriction and prolongation procedures. The quantity vh is the current approximation to the discrete solution uh. The algebraic error is the difference between them, eh = uh vh. The discretization error is the difference between the exact solutions of the continuous and discrete problems, ediscr = u uh. The truncation error is obtained by substituting the exact solution into the discrete equation, 7h = Ahu Sh = Ahu Ahuh. (4.2) The notation above follows Briggs [10]. The twolevel multigrid cycle begins on the fine grid with v iterations of the smoother. Standard iterative methods all have the "smoothing property," which is that the various eigenvectordecomposed components of the solution error are damped at a rate proportional to their corresponding eigenvalues, i.e. the high frequency errors are damped faster than the low frequency (smooth) errors. Thus, the conver gence rate of the smoothing iterations is initially rapid, but deteriorates as smooth error components, those with large eigenvalues, dominate the remaining error. The purpose of transferring the problem to a coarser grid is to make these smooth error components appear more oscillatory with respect to the grid spacing, so that the initial rapid convergence rate is obtained for the elimination of these smooth errors by coarsegrid iterations. Since the coarse grid Q2h has only 1/4 as many grid points as Qh (in 2d), the smoothing iterations on the coarse grid are cheaper as well as more effective in reducing the smooth error components than on the fine grid. In the correction scheme, the coarsegrid problem is an equation for the algebraic error, A2he2h ^ r2h, (4.3) approximating the finegrid residual equation for the algebraic error. To obtain the coarsegrid source term, r2h, the restriction procedure Ih is applied to the finegrid residual rh, r2h I2hh. (4.4) Eq. 4.4 is an averaging type of operation. Two common restriction procedures are straight injection of finegrid values to their corresponding coarsegrid grid points, and averaging rh over a few finegrid grid points which are near the corresponding coarsegrid grid point. The initial error on the coarse grid is taken as zero. After the solution for e2h is obtained, this coarsegrid quantity is interpolated to the fine grid and used to correct the finegrid solution, ^ + 2h. (4.5) For Ihh, common choices are bilinear or biquadratic interpolation. In practice the solution for e2h is obtained by recursion on the twolevel cycle (A2h)1 is not explicitly computed. On the coarsest grid, direct solution may be feasible if the equation is simple enough. Otherwise a few smoothing iterations can be applied. Recursion on the twolevel algorithm leads to a "V cycle," as shown in Figure 4.1. A simple V(3,2) cycle is shown. Three smoothing iterations are taken before re stricting to the next coarser grid, and two iterations are taken after the solution has been corrected. The purpose of the latter smoothing iterations is to smooth out any highfrequency noise introduced by the prolongation. Other cycles can be envi sioned. In particular the W cycle is popular [6]. The cycling strategy is called the "gridschedule," since it is the order in which the various grid levels are visited. The most important consideration for the correction scheme has been saved for last, namely the definition of the coarsegrid discrete equation A2h. One possibility is to discretize the original differential equation directly on the coarse grid. However this choice is not always the best one. The convergencerate benefit from the multigrid strategy is derived from the particular coarsegrid approximation to the finegrid discrete problem, not the continuous problem. Because the coarsegrid solutions and residuals are obtained by particular averaging procedures, there is an implied averaging procedure for the finegrid discrete operator Ah which should be honored to ensure a useful homogenization of the finegrid residual equation. This issue is critical when the coefficients and/or dependent variables of the governing equations are not smooth [17]. For the Poisson equation, the Galerkin approximation A2h 1= 2hIA^I2h is the right choice. The discretized equation coefficients on the coarse grid are obtained by applying suitable averaging and interpolation operations to the finegrid coeffi cients, instead of by discretizing the governing equation on a grid with a coarser mesh spacing. Briggs has shown, by exploiting the algebraic relationship between bilinear interpolation and fullweighting restriction operators, that initially smooth errors begin in the range of interpolation and finish, after the smoothingcorrection cycle is applied, in the null space of the restriction operator [10]. Thus, if the finegrid smoothing eliminates all the highfrequency error components in the solution, one V cycle using the correctionscheme is a direct solver for the Poisson equation. The con vergence rate of multigrid methods using the Galerkin approximation is more difficult to analyze if the governing equations are more complicated than Poisson equations, but significant theoretical advantages for application to general linear problems have been indicated [90]. 4.1.2 FullApproximation Storage Scheme for Nonlinear Equations The brief description given above does not bring out the complexities inherent in the application to nonlinear problems. There is only experience, derived mostly from numerical experiments, to guide the choice of the restriction/prolongation procedures and the smoother. Furthermore, the linkage between the grid levels requires special considerations because of the nonlinearity. The correction scheme using the Galerkin approximation can be applied to the nonlinear NavierStokes system of equations [94]. However, in order to use CS for nonlinear equations, linearization is required. The best coarsegrid correction only improves the finegrid solution to the linearized equation. Also, for complex equa tions, considerable expense is incurred in computing A2h by the Galerkin approxi mation. The commonly adopted alternative is the intuitive one, to let A2h be the differential operator L discretized on the grid with spacing 2h instead of h. In ex change for a straightforward problem definition on the coarse grid though, special restriction and prolongation procedures may be necessary to ensure the usefulness of the resulting corrections. Numerical experiments on a problembyproblem basis are necessary to determine good choices for the restriction and prolongation procedures for NavierStokes multigrid methods. The fullapproximation storage (FAS) scheme [5] is preferred over the correction scheme for nonlinear problems. The coarsegrid corrections generated by FAS improve the solution to the full nonlinear problem instead of just the linearized one. The discretized equation on the fine grid is, again, Ahuh = Sh. (4.6) The approximate solution vh after a few finegrid iterations defines the residual on the fine grid, Ahvh = Sh + rh. (4.7) A correction, the algebraic error e^l = uh vh, is sought which satisfies Ah(vh + el) = Sh. (4.8) The residual equation is formed by subtracting Eq. 4.7 from Eq. 4.8, and cancelling Sh, Ah(vh + eh) Ah(vh) = rh, (4.9) where the subscript "alg" is dropped for convenience. For linear equations the Ahvh terms cancel leaving Eq. 4.3. Eq. 4.9 does not simplify for nonlinear equations. Assuming that the smoother has done its job, rh is smooth and Eq. 4.9 is the same as the coarsegrid residual equation A2h(2h + e2h) A2h(2h) = r2h, (4.10) at coarsegrid grid points. The error e2h is to be found, interpolated back to ih according to eh = Ih^2h, and added to vh so that Eq. 4.8 is satisfied. The known quantities are v2h, which is a "suitable" restriction of vh, and r2h, likewise a restriction of rh. Different restrictions can be used for residuals and solutions. Thus, Eq. 4.10 can be written A2h(Ihvh + 62h) = A2h(Ir hv) I7^rh. (4.11) Since Eq. 4.11 is not an equation for e2h, one solves instead for the sum Ihhvh + e2h Expanding rh and regrouping terms, Eq. 4.11 can be written A2h(u2h) = A2h(I hvh) I2h h (4.12) = [A2h(Ihvh) Ih2h(Ahvh) + IhhSh S2h] + S2h (4.13) [Snumerical 2+ S2h, (4.14) Eq. 4.14 is similar to Eq. 4.6 except for the extra numericallyderived source term. Once I^hvh + e2h is obtained the coarsegrid approximation to the finegrid error, e2h, is computed by first subtracting the initial coarsegrid solution I^hvh, e2h = 2h I2h, (4.15) then interpolating back to the fine grid and combining with the current solution, vh + vh + I2h(e2h). (4.16) 4.1.3 Extension to the NavierStokes Equations The incompressible NavierStokes equations are a system of coupled, nonlinear equations. Consequently the FAS scheme given above for single nonlinear equations needs to be modified. The variables u^, u^, and uh represent the cartesian velocity components and the pressure, respectively. Corresponding subscripts are used to identify each equations' source term, residual and discrete operator in the formulation below. The three equations for momentum and mass conservation are treated as if part of the following matrix equation, A 0 Gh uh SS^ 0 Ah G [ U^ S2^ (4.17) Gh G h 0 U S^h The continuity equation source term is zero on the finest grid, Qh, but for coarser grid levels it may not be zero. Thus, for the sake of generality it is included in Eq. 4.17. Thus, for the uxmomentum equation Eq. 4.8 is modified to account for the pressuregradient, G^u^, which is also an unknown. The approximate solutions are v^, vh, and v3 corresponding to u^, u4, and u.. For the ulmomentum equation, the approximate solution satisfies Av + G^v^ = S^ + ,^. (4.18) The finegrid residual equation corresponding to Eq. 4.9 is modified to Al(v1 + eh) Ah(v ) + G (v3 + eh) G(h) = rh, (4.19) which is approximated on the coarse grid by the corresponding coarsegrid residual equation, A2h(vh + e2h) A2h(vh) + Gh(v + e^) G2v) = (4.20) 1 1V 11 x M + e3) Mv .r The known terms are v2h = Ihh, v2h h= hh, and r2h = I2h ^ Expanding r and regrouping terms, Eq. 4.19 can be written Expanding rh and regrouping terms, Eq. 4.19 can be written Ah(uh) + Gh (u2h) A2hr h2h h 2h 2h h = Al (f, ) + Gx (l v^3) 2h (A^hvh + GGVh) + I2hSh 2hl i2h h) 2h r2h h = [A (h + Gx (1h v 3) I2h Ahv + G)+ Ih2hSh Sh] + s2h h ( IU + Gv1 3 1 1 1h 1Ll,nulmer icalrc I * Since Eq. 4.22 includes numerically derived source terms in addition to the physical ones, the coarsegrid variables are not in general the same as would be obtained from a discretization of the original continuous governing equations on the coarse grid. The u2momentum equation is treated similarly, and the coarsegrid continuity equation is G2hu2h + G2hU2h = G ^2h(I h h\ G2h(lr2hi h\ 12h r X1 y 2 x h UI y 2h 'h 3 (4.22) (4.21) 