Parallel computing of overset grids for aerodynamic problems with moving objects


Material Information

Parallel computing of overset grids for aerodynamic problems with moving objects
Physical Description:
xv, 146 leaves : ill. ; 29 cm.
Prewitt, Nathan C., 1964-
Publication Date:


Subjects / Keywords:
Aerodynamics   ( lcsh )
Fluid dynamics   ( lcsh )
Aerospace Engineering, Mechanics, and Engineering Science thesis, Ph. D   ( lcsh )
Dissertations, Academic -- Aerospace Engineering, Mechanics, and Engineering Science -- UF   ( lcsh )
bibliography   ( marcgt )
non-fiction   ( marcgt )


Thesis (Ph. D.)--University of Florida, 1999.
Includes bibliographical references (leaves 139-144).
Statement of Responsibility:
by Nathan C. Prewitt.
General Note:
General Note:

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 021573509
oclc - 43707804
System ID:

This item is only available as the following downloads:

Full Text







Happy is the man that findeth wisdom, and the man that getteth under-

Proverbs 3:13


Where no counsel is, the people fall: but in the multitude of counselors
there is safety.

Proverbs 11:14

That said, I would like to thank my counselors. Dr. Wei Shyy has been an ex-
cellent advisor, always enthusiastic and willing to help in any way possible. Likewise,
Dr. Davy Belk has always been encouraging and supportive of my work. I am glad to
be able to count them both as friends, as well as mentors. I would also like to thank

the rest of my committee, Dr. Bruce Carroll, Dr. Chen Chi Hsu, and Dr. B. C. Vemuri.

They have all been my allies.
I also want to thank Dr. Milton, who was the head of the GERC when I started
this endeavor, and Dr. Sforza, who is now head of the GERC, and whom I have gotten
to know better through AIAA. And since I am mentioning the GERC, thanks go to

Cathy and Judi. I must also thank Mr. Whitehead, who has been my manager and an
excellent supporter (even if he is an Alabama fan), since I came to work for QuesTech
and now CACI. And, let me not forget Dr. Darryl Thornton (it gets worse, he's an
Ole Miss grad), who has been my department director for many years now.
I have been very fortunate to receive funding from AFOSR. Without this funding

and the time it allowed me away from my other task duties, I am sure that I would not
have been able to complete my degree. I would like to thank everyone that helped to
obtain this funding including Mr. Jim Brock and Maj. Mark Lutton of the Air Force
Seek Eagle Office, and Mr. Dave Uhrig.
To be nice and legal, support was provided in part by a grant of HPC time from
the DoD HPC Distributed Center, U.S. Army Space and Strategic Defense Command,

SGI Origin 2000. Additional support was provided by the Air Force Office of Scientific
Research, Air Force Materiel Command, USAF, under grant number F499620-98-1-
0092. The U.S. Government is authorized to reproduce and distribute reprints for
Governmental purposes notwithstanding any copyright notation thereon. The views
and conclusions contained herein are those of the author and should not be interpreted

as necessarily representing the official policies or endorsements, either expressed or

implied, of the Air Force Office of Scientific Research or the U.S. Government.

I hope that my presentation clearly shows which part of this work is my own.
The original development of the Beggar code, including the unique algorithms for
grid assembly, was done by Dr. Davy Belk and Maj. Ray Maple. The original work

done to parallelize the Beggar flow solver was done by Dr. Davy Belk and Mr. Doug

Strasburg. I owe a large debt to these gentlemen. There are now two others that
continue the development of Beggar. Dr. Ralph Noack has been a tremendous asset

to our group and has been most helpful in my work. It is from Ralph that I learned
of the latency issues associated with the Origin 2000 and Ralph is the author of the
load balancing algorithm used to distribute the grids in order to load balance the
flow solver. Dr. Magdi Rizk came to Eglin at the same time that I did and we have

worked together since. Magdi has been the sole developer of the Beggar flow solver
for several years and has made significant contributions in that area. Magdi has also
been my teacher at the GERC and a great friend. I would like to thank my other
friends and colleagues and hope they will accept this thank you en masse.
Most importantly, I would like to thank my family. My in-laws have been very
helpful (during births and hurricanes, pouring the patio and eating at Hideaway's)

and have always been willing to take the wife and kids when I really needed to study

(sometimes for weeks). My parents have always been supportive in innumerable ways,
from driving for hours and then sitting in the car for hours at honor band tryouts,
to attending MSU football games just to see the halftime show, to buying a house
for us when Tracye and I got married in the middle of college, to doing my laundry

and fixing the brakes on the car over the weekend so I could get back to school. My
brother Stephen has been a lot of help to my parents the last few years. Taking
care of them is something that I don't think I could do. Jay and Josh are too much
like me at times; but they are welcomed diversions and are stress relievers (our dog,
Dudy Noble Polk, also fits into this category). And Tracye is a wonderful person. My
roommate in college once said, "You are very lucky. You've found someone that you
love, who is also a friend." She puts up with a lot. I am no picnic. We spent a lot of
time apart during this, and I always get depressed whenever they are gone for a long
time. Some of her bulldog friends, that she talks to over the internet, asked "How do
you live with someone that is so positive all of the time?" "Who Tracye?", I replied.
"No..., it's great!"



ACKNOWLEDGEMENTS ........................... iii

LIST OF TABLES ................................. viii

LIST OF FIGURES ................................ ix

ABSTRACT ............ ....................... xiii


1 INTRODUCTION .................... ........... 1

Overview .................................... 1
Related Work .................. .............. 7
Grid Assembly .... ................ .. ... ..... 7
Store Separation ... .................. .......... 9
Parallel Computing ................. ..... ..... 14
Dissertation Outline ...... .......... ... ........ 16

2 GRID ASSEMBLY .............................. 18

Polygonal Mapping Tree ........................... 19
Interpolation Stencils ........ ...... ..... ........ 22
Hole Cutting .................... .............. 25
Donors and Receptors ............ ......... .. .. 26
Boundary Condition Identification ................. ..... 27

3 FLOW SOLUTION .................. ........ ... 28

Governing Equations ................... .......... 28
Vector Form ..................... .. .......... 33
Non-Dimensionalization ................... ..... .. 35
Coordinate Transformation ................... ....... 37
Flux Vector Splitting .. ................. ........ 40
Flux Difference Splitting ........................... .. 45
Newton Relaxation ............................. .. 47
Fixed-Point Iteration ............................. 48

. . . 50

4 6DOF INTEGRATION ......................... .... .. 52

Equations of Motion .............................. 52
Coordinate Transformations . . ... 54
Quaternions ........ ................ .. ........ 57
Numerical Integration ......... ............. ..... 60

5 PARALLEL PROGRAMMING ....... .................. 62

Hardware Overview .............................. 62
Software Overview ............................... 64
Perform ance .. .................... .. ...... 68
Load Balancing ................................ 73
Proposed Approach ....................... . 79


Phase I: Hybrid Parallel-Sequential. .. ... .. 81
Phase II: Function Overlapping ......... ..... ......... 83
Phase III: Coarse Grain Decomposition ....... . ... 88
Phase IV: Fine Grain Decomposition ..... .. .. 92
Summary .. .. .... ... . ... .. .. 97

7 TEST PROBLEM ........................... .... .. .. 98

8 RESULTS ................................... 106

Phase I: Hybrid Parallel-Sequential. .... .. . .. 106
Phase II: Function Overlapping ... . ... 108
Phase III: Coarse Grain Decomposition ... . 113
Phase IV: Fine Grain Decomposition . ..... 121
Sum m ary .................................... 132


BIBLIOGRAPHY ..... ................... ..... 139

BIOGRAPHICAL SKETCH ................. ......... 145

Parallel Considerations ...


Table page

1.1 Grid assembly codes ........................... 8

1.2 Store separation modeling methods . . 10

1.3 Grid generation methods ......................... 12

1.4 Significant accomplishments in parallel computing in relation to overset
grid m ethods .. ........... .. .. .. .. .. ... .. 17

6.1 Summary of the implementations of parallel grid assembly .. 97

7.1 Store physical properties .... . . 98

7.2 Ejector properties ...................... ... ..... .... 99

7.3 Original grid dimensions . ... ... 100

7.4 Dimensions of split grids ........................... 101

7.5 Load Imbalance Factors ................... .. 102

7.6 Summary of the final position of the stores calculated from the two
different grid sets ..... .............. .... .... 103

8.1 Summary of results from the phase I runs including the final position
of the bottom store ............................ 107

8.2 Summary of results from the phase II runs including the final position
of the outboard store ........................... 113

8.3 Summary of results from the phase III runs including the final position
of the inboard store .................... ....... 121

8.4 Summary of results from the runs that used fine grain hole cutting
including the final position of the bottom store . ... 132

8.5 Summary of best execution times (in minutes) from runs of the different
implementations (number of FE processes shown in parentheses) 133



1.1 History of three store ripple release . .

1.2 Solution process ...........................

1.3 Example of overlapping grids with holes cut . .

1.4 Grids for single generic store trajectory calculation .

1.5 Mach 0.95 single store trajectory calculated (left) CG position
(right) angular position versus wind tunnel CTS data .

1.6 Mach 1.20 single store trajectory calculated (left) CG position
(right) angular position versus wind tunnel CTS data .

2.1 Example quad tree mesh .....................

2.2 Example PM tree structure ....................

4.1 Transformation from global to local coordinates . .


. .

5.1 Unbalanced work load

Limitations in load balance caused by a poor decomposition 72

Imbalance caused by synchronization ........ . 73

Phase I implementation ......................... 82

Comparison of estimated speedup of phase I to Amdahl's law .... 83

Basic flow solution algorithm ..... . ...... .. 85

Phase II implementation .................. ... 86

Insufficient time to hide grid assembly . ..... 87

Comparison of estimated speedup of phases I and II ... 88











. 71

6.7 Duplication of PM tree on each FE process . ..... 90

6.8 Distribution of PM tree across the FE processes . ... 91

6.9 Phase III implementation . ..... ...... 92

6.10 Comparison of the estimated speedup of phases I, II, and III 93

6.11 Phase IV implementation. . .... ...... 95

6.12 Comparison of estimated speedup of phases I, II, III and IV 96

7.1 Bottom store (left) CG and (right) angular positions ... 104

7.2 Outboard store (left) CG and (right) angular positions ... 105

7.3 Inboard store (left) CG and (right) angular positions ... 105

8.1 Actual speedup of phase I ...... .................. 107

8.2 Bottom store (left) force coefficient and (right) moment coefficient vari-
ation between dt iterations history . .... 108

8.3 Outboard store (left) force coefficient and (right) moment coefficient
variation between dt iterations history . ..... 109

8.4 Inboard store (left) force coefficient and (right) moment coefficient vari-
ation between dt iterations history ..... .. ........ .. 109

8.5 Actual speedup of phase II . . 111

8.6 Effect of using average execution time . ... .. .. 112

8.7 Actual speedup of phase III . ... 114

8.8 History of grid assembly load imbalance based on execution times of
hole cutting, stencil search, and health check . ... 115

8.9 Grid assembly process .......................... .. 117

8.10 History of grid assembly load imbalance based on execution time of
the stencil search ..... ...................... .. 117

8.11 Grid assembly execution timings for four FE processes ... 118

8.12 Grid assembly execution timings of (left) hole cutting and (right) sten-
cil searching with load balance based on measured execution time of
stencil searching. Each curve represents a separate process. 119

8.13 History of grid assembly load imbalance based on number of IGBPs 120

8.14 Grid assembly execution timings of (left) hole cutting and (right) sten-
cil searching with load balance based on number of IGBPs. Each curve
represents a separate process. . . .. 121

8.15 Speedup due to fine grain hole cutting and load balancing of hole cut-
ting separate from the stencil search . ..... 123

8.16 Grid assembly execution timings of (left) hole cutting and (right) sten-
cil searching with fine grain hole cutting and the stencil search load
balanced based on execution time. Each curve represents a separate
process . . . .. 124

8.17 Grid assembly execution timings of (left) hole cutting and (right) sten-
cil searching with fine grain hole cutting and the stencil search dis-
tributed across 5 FE processes. Each curve represents a separate process. 125

8.18 Grid assembly execution timings of (left) hole cutting and (right) sten-
cil searching with fine grain hole cutting and the stencil search dis-
tributed across 6 FE processes. Each curve represents a separate process. 126

8.19 Grid assembly execution timings of (left) hole cutting and (right) sten-
cil searching with fine grain hole cutting and the stencil search dis-
tributed across 7 FE processes. Each curve represents a separate process. 126

8.20 Grid assembly execution timings of (left) hole cutting and (right) sten-
cil searching with fine grain hole cutting and the stencil search dis-
tributed across 8 FE processes. Each curve represents a separate process. 127

8.21 Use of additional processors continues to reduce time for hole cutting 128

8.22 Execution times for load balanced fine grain hole cutting distributed
across 4 FE processes ............ .............. .129

8.23 Execution times for load balanced fine grain hole cutting distributed
across 5 FE processes ................... ....... 130

8.24 Execution times for load balanced fine grain hole cutting distributed
across 6 FE processes ........................... 130

8.25 Execution times for load balanced fine grain hole cutting distributed
across 7 FE processes ........................... 131

8.26 Execution times for load balanced fine grain hole cutting distributed
across 8 FE processes ................... ....... 131

8.27 Summary of the increasing speedup achieved through the different im-
plementations ................. .............. 134

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy



Nathan C. Prewitt

December 1999

Chairman: Dr. Wei Shyy
Major Department: Aerospace Engineering, Mechanics and Engineering Science

When a store is dropped from a military aircraft at high subsonic, transonic,

or supersonic speeds, the aerodynamic forces and moments acting on the store can
be sufficient to send the store back into contact with the aircraft. This can cause
damage to the aircraft and endanger the life of the crew. Therefore, store separation
analysis is used to certify the safety of any proposed drop. This analysis is often
based on wind tunnel aerodynamic data or analogy with flight test data from similar

configurations. Time accurate computational fluid dynamics (CFD) offers the option
of calculating store separation trajectories from first principles.
In the Chimera grid scheme, a set of independent, overlapping, structured grids
are used to decompose the domain of interest. This allows the use of efficient struc-
tured grid flow solvers and associated boundary conditions, and allows for grid motion
without stretching or regridding. However, these advantages are gained in exchange
for the requirement to establish communication links between the overlapping grids
via a process referred to as "grid assembly."

The calculation of a moving body problem, such as a store separation trajec-

tory calculation, using the Chimera grid scheme, requires that the grid assembly be
performed each time that a grid is moved. Considering the facts that time accurate

CFD calculations are computationally expensive and that the grids may be moved
hundreds of times throughout a complete trajectory calculation, a single store trajec-
tory calculation requires significant computational resources.
Parallel computing is used regularly to reduce the time required to get a CFD

solution to steady state problems. However, relatively little work has been done to use
parallel computing for time accurate, moving body problems. Thus, new techniques
are presented for the parallel implementation of the assembly of overset, Chimera

This work is based on the grid assembly function defined in the Beggar code,
currently under development at Eglin Air Force Base, FL. This code is targeted at

the store separation problem and automates the grid assembly problem to a large
extent, using a polygonal mapping (PM) tree data structure to identify point/volume
A logical succession of incremental steps are presented in the parallel implemen-

tation of the grid assembly function. The parallel performance of each implementation
is analyzed and equations are presented for estimating the parallel speedup. Each

successive implementation attacks the weaknesses of the previous implementation in
an effort to improve the parallel performance.
The first implementation achieves the solution of moving body problems on mul-
tiple processors with minimum code changes. The second implementation improves
the parallel performance by hiding the execution time of the grid assembly function
behind the execution time of the flow solver. The third implementation uses coarse
grain data decomposition to reduce the execution time of the grid assembly func-
tion. The final implementation demonstrates the fine grain decomposition of the grid
assembly through the fine grain decomposition of the hole cutting process. Shared

memory techniques are used in the final implementation and appropriate dynamic
load balancing algorithms are presented.


The knowledge of forces and moments induced by the addition of stores to an
aircraft is vital for safe carriage. Once a store is released, knowledge of the interference
aerodynamics and the effects on the trajectory of the store is vital for the safety of the
pilot and aircraft. Such aerodynamic data have traditionally been provided by wind
tunnel testing or flight testing; however, these techniques can be very expensive and
have limitations when simulating time accurate, moving body problems such as the

ripple release depicted in figure 1.1. Computational fluid dynamics (CFD) provides
a way to supplement wind tunnel and flight test data.

Figure 1.1: History of three store ripple release


The primary problem to be considered is store separation from fighter aircraft
configurations. The goal is to compute store separation trajectories in a timely fashion


using CFD and parallel computing. Due to the geometric complexity of aircraft/store

configurations and the requirement to handle moving body problems, the Chimera
grid scheme [1] is being used. This approach uses a set of overlapping structured
grids to decompose the domain of interest. The Chimera grid scheme offers several

advantages: 1) the use of structured grids allows the use of efficient block structured

grid flow solvers and the associated boundary conditions; 2) the generation of over-
lapping grids which best fit a particular component geometry eases the burden of

structured grid generation; and 3) the use of interpolation for communication be-
tween overlapping grids allows grids to be moved relative to each other. However,
the communication between overlapping grids must be reestablished whenever a grid

is moved. This process of establishing communication between overlapping grids will

be referred to as grid assembly.
Whenever the grid around a physical object overlaps another grid, there is the

probability that some grid points will lie inside the physical object and thus will be
outside of the flow field. Even if no actual grid points lie inside the physical object,
if a grid line crosses the physical object, there will be neighboring grid points that
lie on opposite sides of the physical object. Any numerical stencil that uses two

such neighboring grid points will introduce errors into the solution. This situation
is avoided by cutting holes into any grids overlapping the physical surfaces of the
During hole cutting, regions of the overlapping grids are marked as invalid. This

creates additional boundaries within the grid system. The flow solver requires that
some boundary condition be supplied at these boundaries. Likewise, some boundary
condition is also needed at the outer boundaries of embedded grids. Collectively, the
grid points on the fringe of the holes and the grid points on the outer boundaries of
the embedded grids are referred to as inter-grid boundary points (IGBPs) [2]. The
boundary conditions required at the IGBPs are supplied by interpolating the flow
solution from any overlapping grids.

The Beggar code [3], developed at Eglin Air Force Base, is capable of sov-

ing three-dimensional inviscid and viscous flow problems involving multiple moving
objects, and is suitable for simulating store separation. This code allows blocked,
patched, and overlapping structured grids in a framework that includes grid assem-

bly, flow solution, force and moment calculation, and the integration of the rigid body,
six degrees of freedom (6DOF) equations of motion. All block-to-block connections,
patched connections, freestream boundary conditions, singularity boundary condi-

tions, and overlapped boundaries are detected automatically. All holes are defined
using the solid boundaries as cutting surfaces and all required interpolation stencils
are calculated automatically. The integration of all necessary functions simplifies the

simulation of moving body problems [4]; while the automation and efficient imple-

mentation of the grid assembly process [5] significantly reduces the amount of user
input and is of great benefit in a production work environment.
The basic solution process consists of an iterative loop through the four func-

tions shown in figure 1.2. The blocked and overset grid system is first assembled.
Once this is done, the flow solution is calculated in a time-accurate manner. Aerody-
namic forces and moments are then integrated over the grid surfaces representing the

physical surfaces of the moving bodies. The rigid body equations of motion are then
integrated with respect to time to determine the new position of the grids considering
all aerodynamic forces and moments, forces due to gravity, and all externally applied

forces and moments (such as ejectors).
Multiple iterations of this loop are required to perform a complete store sep-

aration trajectory calculation. The accuracy of the trajectory predicted from the
integration of the equations of motion is affected by the time step chosen; however,
stability contraints on the flow solver are normally more restrictive. In typical store
separation calculations, the time step has been limited to 0.1-1.0 milli-second; thus,
hundreds or even thousands of iterations are often required.
As the complexity of flow simulations continues to increase it becomes more crit-

Figure 1.2: Solution process

ical to utilize parallel computing to reduce solution turnaround times. The parallel
implementation of the Beggar flow solver was first presented by Belk and Strasburg
[6]. This flow solver uses a finite volume discretization and flux difference splitting
based on Roe's approximate Riemann solver [7]. The solution method is a Newton-
Relaxation scheme [8]; that is, the discretized, linearized, governing equations are
written in the form of Newton's method and each step of the Newton's method is
solved using symmetric Gauss-Seidel (SGS) iteration. The SGS iterations, or in-
ner iterations, are performed on a grid by grid basis; while the Newton iterations,
or dt iterations, are used to achieve time accuracy and are performed on all grids in
sequence. In this reference, the separate grids are used as the basis for data decompo-
sition. The grids, which represent separate flow solution tasks, are distributed across
multiple processors and the flow solver is executed concurrently. The only communi-
cation between processes is the exchange of flow field information at block-to-block,
patched, and overlapped boundaries between dt iterations. The grid assembly is

performed only once; thus, only static grid problems are addressed.
It is also desirable to utilize parallel computing to reduce the turnaround time

of moving body problems such as the ripple release configuration. In order to do so,

the grid assembly function must be executed each time grids are moved. An efficient,
scalable parallel implementation of any process requires that both the computation
and the required data be evenly distributed across the available processors while
minimizing the communication between processors. The movement of the grids and
the time variation in the holes being cut, as illustrated in figure 1.3, indicate the

dynamic and unstructured nature of the grid assembly work load and data structures.
This makes an efficient implementation a challenging task.

Thus, the primary focus of this work is the parallel implementation of the grid

assembly function so that store separation trajectories can be calculated using time-
accurate CFD and parallel computers. A logical succession of incremental steps is
used to facilitate the parallel implementation of the grid assembly function. The initial
implementation (phase I) uses a single process to perform the entire grid assembly in
a serial fashion with respect to the parallel execution of the flow solver. This requires
that proper communication be established between the flow solution function and
the grid assembly function; however, it does not require any consideration of load
balancing or partitioning of the grid assembly function. The grid assembly function

is not accelerated, but the flow solution is.

In the second implementation (phase II), parallel efficiency is gained by over-

lapping the grid assembly function and the flow solution function. This overlapping
of work is possible because of the use of the Newton-Relaxation method within the
flow solver. Each step of the approximate Newton's method produces an approxi-
mation to the flow solution at the next time step. Approximate aerodynamic forces

and moments are calculated from the flow solution after the first Newton step and

are used to drive the grid assembly function, while additional Newton steps are being
calculated to achieve time accuracy.
As long as there is sufficient time to hide the work of the grid assembly function,

the speedup is affected only by the performance of the flow solver. However, as the
processor count increases, the time of the flow solution available to hide the grid
assembly decreases and the rate of change of speedup with respect to processor count

decreases. Therefore, it is important to distribute the work of the grid assembly
function to make the entire process scalable to higher processor counts.
The third implementation (phase III) uses data decomposition of the grid assem-
bly function to reduce its execution time and thus allows the grid assembly time to be

more easily hidden by the flow solution time. The basis for the data decomposition
is the superblock, which is a group of grids that contain block-to-block connections
and are overlapped with other superblocks. In this implementation, the work and

data structures associated with a superblock are distributed over multiple processors.
Dynamic load balancing is used to improve the performance by moving superblocks
between processes.
The relatively small number of superblocks used in most problems places a limit
on the number of processors that can be effectively utilized. Thus, in order to improve
scalability, the fourth implementation (phase IV) uses a fine grain decomposition of
the work associated with grid assembly. The work of the grid assembly function can
be associated with the facets that cut holes into overlapping grids and the cell centers

that require interpolation. Therefore, the hole cutting facets and the IGBPs form the

basis for the fine grain distribution of the work associated with grid assembly.

This dissertation gives complete details of the implementation options for in-

cluding the grid assembly function into the parallel execution of moving body CFD

computations. Each implementation builds upon the previous implementation, at-

tacking the limitations in order to improve performance. Details of the performance

analysis are included. Equations for estimating the performance are also presented.

With practical experience and some further development, these implementations and
performance estimators could offer optimum execution guidelines for particular prob-


Related Work

Grid Assembly

Table 1.1 lists some of the codes that are currently available for assembling over-
set grid systems. Some of the advantages and disadvantages of each code are listed.

Since the author is not completely familiar with the operation of all of these codes,

some of the disadvantages (or advantages) may only be perceived. In general, finding

the root cause of a failure in the grid assembly process is a difficult task. Therefore,
it is a disadvantage of overset grids in general and is not listed as a disadvantage

for any of the codes although some of the codes provide better aids for debugging

than do others. Likewise, the use of orphan points (points that fail to be properly
interpolated and are given averaged values from neighbors) can help to ensure that

grid assembly does not fail. However, orphan points are not listed as an advantage
for any code since they can adversely affect the flow solution.
PEGSUS [9] is the first and one of the more widely used codes for handling

the grid assembly problem. It relies on a set of overlapping grids (block-to-block
connections are not allowed). PEGSUS is completely separate from any flow solver
but will produce interpolation information for either finite difference or finite volume

Table 1.1: Grid assembly codes
Code Advantage Disadvantage
PEGSUS First code; large user base Slow; requires alot of user
DCF3D Fast; large user base; well Requires significant user
supported input
CMPGRD Modern programming Not widely distributed
techniques; well defined
BEGGAR Automated grid assembly; Slower than DCF3D; mono-
allows block-to-block con- lithic code; limited user
nections; small user input base; has difficulties with
geared toward production overset viscous grids
work environment; complete
flow solution environment

flow solvers. The amount of user input required is often rather large: each hole cutting
surface has to be identified, all overlapping boundaries must be identified, and a set
of links must be specified to tell the code which grids to cut holes into and where to
check for interpolation coefficients.
DCF3D (domain connectivity function) [2] is another code used to accomplish
the grid assembly task. DCF3D is not coupled directly with any flow solver but
it has been used extensively with the OVERFLOW flow solver [10]. DCF3D uses
several alternative approaches in order to improve the efficiency of the grid assembly
process. DCF3D uses analytic shapes for hole cutting which allows grid points to
be compared directly to the hole cutting surface. It also uses Cartesian grids, called

inverse maps, to improve the efficiency of selecting starting points for the search for
interpolation stencils. These techniques-improve the efficiency of the grid assembly
process; however, an additional burden is placed on the user to define the analytical
shapes and the extent and density of the inverse maps.
More recently, improvements to DCF3D have been proposed in order to reduce
the burden placed on the user. These improvements include the use of hole-map tech-
nology and the iterative adjustment of the connectivity information [11]. Hole-map

technology uses Cartesian grids to map the hole cutting surfaces in an approximate,

stair stepped fashion. This would allow the automatic creation of the hole cutting

surfaces and an efficient means of identifying hole points. The iterative process of ad-

justing the connectivity information by expanding and contracting the holes in order

to minimize the overlap between grids also offers benefits.

Yet another code that addresses the grid assembly problem is CMPGRD [12].

This code is an early version of the grid assembly process that has been included in

OVERTURE [13]. This tool does not appear to be highly optimized; moreover, its

strengths seem to be in its well defined algorithms for the grid assembly process. The

algorithms can produce minimum overlap between grids and other quality measures

are considered in the donor selection process.

In comparison to the above mentioned codes, Beggar is unique in that its devel-

opment has been geared towards the store separation problem and a production work

environment. As such, Beggar attempts to automate the entire solution process while

reducing the burden of input that is placed on the user. Beggar also uses unique data

structures and algorithms in order to maintain the efficiency of the grid assembly


Store Separation

Table 1.2 lists some of the techniques that have been used to calculate store sep-

aration trajectories. Some of the advantages and disadvantages from each technique

are listed. The techniques range from simple curve fits of data from similar configu-

rations, to wind tunnel experimental methods, to the calculation of the complete flow

field from first principles.

Engineering level methods (see Dillenius et al. [14] for example) derive aerody-

namic data from data bases of experimental data, simple aerodynamic correlations,

and panel methods with corrections for nonlinear effects such as vortical flow. Such

methods are computationally inexpensive but have very limited applicability. These

Table 1.2: Store separation modeling methods
Method Advantage Disadvantage
Engineering Computationally inexpen- Limited applicability
Level Methods sive; provide quick data for
preliminary design
Captive Trajec- Wind tunnel accuracy of Limited range of motion;
tory Support flow phenomenon quasi-steady; high cost; tun-
nel interference
Influence Func- Fast solution allows sta- Mutual interference effects
tion Method tistical investigation of can be lost
Computational Completely time accurate; Grid generation can be labor
Fluid Dynamics flexible; unlimited in config- intensive; requires signifi-
uration; provides data for vi- cant computing resources;
sualization of the complete weaknesses in modeling
flow field some flow phenomena such
as turbulence

methods are most useful in preliminary design, but have been applied to the calcula-
tion of store separation trajectories.
Store separation events have been simulated in wind tunnels using the captive

trajectory support (CTS) system [15]. This technique places a sting mounted store
in the flow field of an aircraft wing and pylon. The store is repositioned according to

the integration of measured aerodynamic loads and modeled ejector loads. Since the
store can not be moved in real-time, an angle-of-attack correction is made based on
the velocity of the moving store. This technique is quasi-steady and often limited in
the range of motion due to the sting mechanism.

Store separation trajectories have also been calculated using wind tunnel data
and an influence function method (IFM) [16]. This method uses wind tunnel data
to define flow angularity near an aircraft wing and pylon. These data are used to

apply a delta to the freestream forces and moments of the store assuming that the
store does not affect the aircraft flow field. Another correction is made for mutual
interference using wind tunnel data of the store in carriage position. Jordan [17] gave
a detailed comparison of loads calculated from IFM and CFD versus loads measured

in the wind tunnel. IFM was shown to be inaccurate due to mutual interference
that is not directly related to flow angle. The distance at which mutual interference
becomes insignificant must also be well known.

Such semi-emperical techniques can also be used with CFD data replacing part
or all of the wind tunnel data. In a recent special session at the AIAA Aerospace
Sciences Meeting, most of the papers [18, 19, 20] presented used this technique. One
paper [21] used quasi-steady CFD. Of the two time-accurate CFD simulations slated

to be presented, one was withdrawn and the other was prevented from being presented
due to the failure to get clearance for public release.

When considering time-accurate CFD calculations for moving body problems,
the decomposition of the domain (grid type) has a significant impact on the solution
process. Table 1.3 lists several of the different grid methods in use. Some of the
advantages and disadvantages of each grid method are listed.

Cartesian grids (see [22] for example) have been used for moving body problems,
but the treatment of boundary conditions can be complicated. Boundary conforming
block structured grids have been used to calculate store separation trajectories [23];
however, the motion of a store within a block structured grid requires grid stretching
and deformation. This places a limit on the motion before regridding is required due to
errors introduced by grid skewness. Unstructured grids have also been applied to the
store separation problem (see [24] for example). The flexibility of unstructured grid

generation eases the grid generation burden but complicates the flow solver. Octree
grids have also been used to ease the grid generation burden and allow adaptive grid
refinement. SPLITFLOW [25] represents a compromise between these unstructured
grid techniques. A prismatic grid is used near solid surfaces to simplify boundary
conditions and an octree grid is used away from the solid surfaces and offers adaption
and some organizational structure. Chimera grid methods are also a compromise and
have been applied extensively to the store separation problem [4, 26, 27, 28, 29, 30].
They can be viewed as locally structured, but globally unstructured.

Table 1.3: Grid generation methods
Grid Type Advantage Disadvantage
Cartesian Small memory require- Difficult treatment of
ments; fast flow solver boundary conditions; poor
viscous solution capabilities
Structured General treatment of Restricted to simple
flow solver and boundary geometries
Block Structured Extension to complex Grid generation is time con-
geometries suming; grid motion or
adaption is difficult
Quad Tree Easily adapted Difficult treatment of
boundary conditions;
connectivity information
Unstructured Automatic grid generation; Larger memory require-
easily adapted ments; slower flow solvers;
connectivity information
required; weak viscous
solution capabilities
Chimera Structured grid flow solvers connectivity (only at
and boundary conditions; IGBPs) must be con-
eases grid generation bur- structed separate from the
den; allows grid movement grid generation process

Time accurate CFD has been validated for use in calculating store separation

trajectories. Lijewski [28] presented the first complete system for calculating store
separation trajectories. Lijewski [28] also presented the first use of a particular set

of wind tunnel CTS data for store separation code validation. The configuration is
a single, sting mounted, ogive-cylinder-ogive store under a generic pylon and wing.
Grids for the generic store are shown in figure 1.4.
Data, first presented by Prewitt et al. [4], for the subsonic and supersonic tra-
jectories of the single generic store are shown in figures 1.5 and 1.6. The CTS data
are shown by the symbols and the time accurate CFD calculations are shown by the
curves. These comparisons show excellent agreement between the wind tunnel data

and time accurate CFD calculations for this test problem.

Figure 1.4: Grids for single generic store trajectory calculation

4.0 14.0
3.0 o
.0 10.0 yaw 0o0
o 'I O

12.0 72 .0 00 0

4.0 Uoo o
0.0 2.0 roll

-1.0 0.0 .
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30
time (sec) time (sec)

Figure 1.5: Mach 0.95 single store trajectory calculated (left) CG position and (right)
angular position versus wind tunnel CTS data

More complex configurations have also been used for validation cases. Cline

et al. [31] presented store separation trajectories from an F-16 aircraft configuration

including a fuel tank, pylon, and an aerodynamic fairing at the junction of the pylon

and the wing. Coleman et al. [32] presented separation trajectories for the MK-

84 from the F-15E aircraft. This configuration included a centerline fuel tank, a

LANTIRN targeting pod, an inboard conformal fuel tank (CFT) weapons pylon with

attached MK-84, forward and middle stub pylons on the outside of the CFT, LAU-

128 rail launchers with AIM-9 missiles on both sides of the wing weapons pylon, and

the MK-84 to be released from the wing weapons pylon. Both references compared

E 2.0 z 6.0 pitch o0
1 1.0 4.0 0 ot oo

0 .0 7i r 2 .0 o
-- roll
-1.0 0.0 0 <
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30
time (sec) time (sec)

Figure 1.6: Mach 1.20 single store trajectory calculated (left) CG position and (right)
angular position versus wind tunnel CTS data

trajectory calculations to wind tunnel data and were instrumental in the approval

of the use of CFD by engineers in evaluating the store separation characteristics of


Parallel Computing

Although much work has been done on the parallel execution of CFD flow solvers,

including Chimera method flow solvers, little work has been done on the efficient par-

allel implementation of Chimera methods for moving body problems. In particular,

there are very few references on the parallel treatment of the grid assembly problem.

Table 1.4 gives a list of references of some of the more important developments in

parallel computing as related to Chimera grid methods and grid assembly.

Smith [33] presents the parallel implementation of an overset grid flow solver for

a network based heterogeneous computing environment. This flow solver was derived

from OVERFLOW and uses coarse grain parallelism with the component grids being

distributed among the available processors. A master/slave model is used. The master

process performs all i/o functions, maintains all of the interpolated flow solution data,

and communicates with each of the slave processes. The slave processes calculate the

flow solution and perform the interpolation of flow solution data. A load balancing


technique is used and the interpolation of flow solution data is overlapped with the
calculation of the flow solution to reduce load imbalances. The grid communication
information was supplied as an input and only static problems were addressed.

Wissink and Meakin [34] presented the application of a Chimera grid flow solver
based on OVERFLOW and DCF3D. This code uses overlapping structured grids

near the solid boundaries in order to resolve viscous effects and uses spatially refined
Cartesian blocks in the rest of the domain. Parallel performance was presented but
only static problems were addressed. The same code was again presented by Meakin
and Wissink [35]. Two dynamic problems were presented in this reference; however,
the focus was on the ability to adapt the Cartesian blocks due to flow solution and

body motion. Some parallel performance data are presented based on an iteration of

the flow solver. No performance data were presented for an entire simulation which
would include the performance of the grid assembly.

The first presentation of the parallel implementation of grid assembly for dy-

namic, overset grids was by Barszcz et al. [36]. DCF3D was parallelized and used in
connection with a parallel version of OVERFLOW on a distributed memory parallel

machine. A coarse grain parallelism was implemented with the data decomposition

based on component grids. A static load balance was used based on balancing the
load of the flow solver. Since the flow solver represented a large portion of the total
work, load balancing the flow solver is important to achieving a good overall load bal-

ance; however, significant imbalances were seen in the grid assembly processes. Donor
cell identification was found to be the most time consuming part of grid assembly and
algorithm changes were implemented to reduce this part of the work load.
Weeratunga and Chawla [37] again used DCF3D and OVERFLOW on a dis-
tributed memory parallel machine. Again, the component grids are used for data
decomposition and load balancing is based on the work load of the flow solver. No
consideration is given to the distribution of donor elements or IGBPs. The primary
focus in this reference is on demonstrating the scalability of the processes used. In


this study, the donor search method scaled well; however, the hole cutting and the
identification of points requiring interpolation did not scale well.
Wissink and Meakin [38] present the first attempt to load balance the grid

assembly process. However, the data decomposition is still based on the component

grids and affects the load balance of both the flow solver and the grid assembly
function. A static load balance is initially performed to equally distribute the numbers
of grid points which helps to load balance the flow solver. A dynamic load balancing
routine is then used during a calculation to redistribute the grids to improve the
load balance of grid assembly. This, in turn, creates an imbalance in the flow solver.
This algorithm offers a method of improving performance if an imbalance in the grid

assembly work load is a major deterrent. However, in the problems presented, the
flow solver represented the major part of the work load and any redistribution of grids
in order to improve the grid assembly load balance actually decreased the overall code

Dissertation Outline

Chapter 2 presents details of the algorithms and data structures of the grid
assembly process. For completeness, chapter 3 presents the flow solution algorithm
and chapter 4 presents the integration of the 6DOF rigid body equations of motion.
Chapter 5 presents an overview of programming parallel computers and outlines the
approaches used in the current work. Chapter 6 gives some details of the proposed
implementations including equations for estimating speedup. Chapter 7 presents the
ripple release test problem used for all timings of the implementations. The results of
the timings are presented in chapter 8. The final conclusions and some possibilities
for future work are presented in chapter 9.

Table 1.4: Significant accomplishments in parallel computing in relation to overset
grid methods
Reference Accomplishment Limitation
Smith and Pallis, Parallel implementation Restricted to static
1993 [33] of OVERFLOW for het- problems
erogeneous computing
Barszcz, Weer- First parallel implementa- Data decomposition and
atunga, and tion of grid assembly static load balance tied to
Meakin, 1993 [36] flow solver
Weeratunga and Detailed study of scalabil- Data decomposition and
Chawla, 1995 [37] ity of parallel implementa- static load balance tied to
tion of DCF3D flow solver
Belk and Stras- First parallel implementa- Restricted to static
burg, 1996 [6] tion of Beggar problems
Wissink and First attempt to load bal- Decomposition of grid as-
Meakin, 1997 [38] ance grid assembly sembly tied to flow solver
means any improvement in
the load balance of grid as-
sembly adversely affects the
load balance of the flow
Wissink and Small, near body, curvilin- Only static problems were
Meakin, 1998 [34] ear grids used in combina- presented
tion with adaptive Cartesian
Prewitt, Belk, First parallel implementa- Limited scalability
and Shyy, 1998 tion of Beggar for dynamic
[39] problems; overlapping of
grid assembly and flow solu-
tion time
Meakin and Included dynamic problems No performance of dynamic
Wissink, 1999 with combined overset grids grid assembly was presented
[35] and adaptive Cartesian grids
Prewitt, Belk, Coarse grain decomposition Major functions within the
and Shyy, 1999 and dynamic load balancing grid assembly are not indi-
[40] of grid assembly based on vidually well balanced
superblocks independent of
flow solver


Although Beggar is useful for general external compressible fluid flow problems,

its primary focus during development has been on the simulation of store carriage

and separation events. A typical grid system includes grids for an aircraft, pylons,

launchers, and stores. The grids are often placed inside a large rectangular grid
which serves as a background grid that reaches to freestream. Due to disparity in

grid spacing between overlapping grids, it is often necessary to introduce other grids

that serve as an interface to aid communication. The stores are often bodies of

revolution with attached wings, canards, and/or fins. Blocked grid systems are used

for these relatively simple geometries; however, in order to allow grid movement, such

blocked grids are treated as overlapping grids with respect to other grids.

The superblock construct is introduced to aid in grid assembly. The superblock
is a collection of non-overlapping grids which are treated as a single entity. Block-to-

block connections are allowed only within a superblock; thus, a superblock is often

used to implement a blocked system of grids for part of the solution domain. Over-

lapping connections are allowed only between different superblocks.
A dynamic group is a collection of one or more superblocks that is treated as

a single entity by the 6DOF. The dynamic group is used primarily to group grids
which are part of the same moving body. There is always at least one dynamic group:
the static dynamic group. This holds the static grids like the background grid, the

aircraft grid, pylon grids, or store grids that do not move relative to the aircraft.

Other dynamic groups are created for each store that will move relative to the static
dynamic group. Since a dynamic group may contain one or more superblocks, each


moving body can be constructed from a system of blocked grids in a single superblock,

a system of overlapping grids in multiple superblocks, or a combination of the two.

Polygonal Mapping Tree

In order to establish overlapped grid communications the following questions
must be answered: does this point lie inside a grid, and if so, what is an appropriate

interpolation stencil? These questions represent point-volume geometric relation-
ships. In order to determine such relationships, Beggar uses a polygonal mapping
(PM) tree, which is a combination of the octree and binary space partioning (BSP)
tree data structures [41, 42].
An octree is a data structure in which a region of space is recursively subdivided

into octants. Each parent octant is divided into eight children which can be further
subdivided. This forms a hierarchy of ancestor and descendant octants. Each octant

in the tree is termed a node with the beginning node (lowest level) being the root
node and the most descendent nodes (highest levels) being the leaf nodes. Such a
data structure allows a domain to be divided into 8" subdomains using just n levels.

Associated with each node are the Cartesian coordinates of the center of the octant.
Which child octant a point lies in can be identified by comparing the coordinates of

the point against the coordinates of the center of the parent octant. With such a
data structure, a point can be identified as lying within a particular octant out of 8"
octants by using at most n comparisons (if the tree is perfectly balanced).
The BSP tree is a binary tree data structure in which each node of the tree
is represented by a plane definition. Each node has two children representing the

in and out sides of the plane. For a faceted representation of a surface, each facet
defines a plane that is inserted into the BSP tree. While being inserted the facet may
be clipped against existing planes in the BSP tree placing pieces of the same plane
definition into different branches of the tree. Using a given point, the BSP tree is
traversed by comparing the point against a plane definition at each level to determine

which branch to descend into. Once a leaf node is reached, the point is identified as

being inside or outside of the faceted surface.
In theory, a BSP tree of the cell faces on the boundaries of a grid block could be

used to determine whether a point is IN or OUT of that particular grid. However, due
to the clipping process, the BSP tree can be prone to roundoff error. Likewise, the
structure of the tree is dependent on the order in which facets are inserted and it is
not guaranteed to be well balanced. If the tree were to become one-sided, a point may
have to be compared against all or most of the facets on a surface to determine its
relationship to that surface. Therefore, Beggar uses a combination of the octree and
BSP tree data structures. The octree, which stays well balanced, is used to quickly
narrow down the region of space in which a point lies. If a point lies in a leaf node

that contains an overlapping boundary grid surface, it must be compared to a BSP
tree that is stored in that leaf node to determine its relationship to that boundary
surface and therefore its relationship to the grid itself.
The PM tree data structure is built by refining the octree in a local manner until
no octant contains more than one grid boundary point from the same superblock. This
produces a regular division of space that adapts to grid boundaries and grid point

density. The boundary cell faces of the grids are then used to define facets which
are inserted into BSP trees stored at the leaf nodes of the octree. Since each grid
boundary point is normally shared by four cell faces and each octant contains only
one grid boundary point, the BSP trees stored at the octree leaf nodes should be very

Once the basic data structure is complete, all of the octants of the leaf nodes are
classified relative to the grid boundaries. Each octant is classified as inside or outside
of each superblock or as a boundary octant. Then points can be classified efficiently
relative to the superblocks. To do so, the octant in which the point lies is found. If
the octant has been classified as IN or OUT, the point can be immediately classified
as IN or OUT. However, if the point lies in a boundary octant, the point must be

compared against the BSP tree that is stored in that octant.
Figure 2.1 represents a quadtree (2d equivalent of an octree) for a 4 block O-grid
around an airfoil. Only the grid points on the boundaries are used to define the level

of refinement, so only the boundaries of the grid are shown. The grid boundaries are
inserted into the leaf octants as BSP trees to form the PM tree. A portion of the
PM tree that might result is shown in figure 2.2. The leaf octants are represented by
squares, while the other nodes are represented by circles. The four branches at each

node represent the four quadrants of an octant. The line segments shown in some
of the leaf octants represent portions of the grid boundaries that would be placed in

BSP trees. If a point being classified against the PM tree falls into one of these leaf

octants, it must be compared against the facets to determine its relationship to the
grid. The empty leaf octants that are drawn with solid lines are completely inside the
grid, while the leaf octants that are drawn with dashed lines are completely outside
the grid. Points that fall into either of these types of octants can immediately be
classified relative to the grid.

Figure 2.1: Example quad tree mesh

The PM tree is expensive to construct and would be very inefficient to use if

Figure 2.2: Example PM tree structure

it had to be reconstructed each time a grid moved. Therefore, for each dynamic

group, a set of transformations are maintained between the current position and
the original position in which the PM tree was built. Whenever the PM tree is

used to find an interpolation stencil in one grid for a grid point in another grid,
the transformations are used to transform the grid point to the original coordinate

system. The transformed grid point can then be used with the PM tree constructed

in the original coordinate system of the grids. Thus the PM tree must be constructed

only once.

Interpolation Stencils

The primary function of the PM tree is to help find interpolation stencils for grid

points which require interpolated flow field information. When an interpolation stencil

is required, the PM tree is used to classify the corresponding grid point relative to

each superblock. This quickly identifies which superblocks and grids the grid point lies
in and therefore which superblocks might offer a source of interpolation information.

This answers the in/out question directly. However, once a point is identified as
being inside a given superblock, the exact curvilinear coordinates corresponding to
the Cartesian coordinates of the grid point must be found.

For a curvilinear grid defined by the coordinates of the intersections of three
families of boundary conforming grid lines denoted by (, qr, (), the coordinates at

any point within a cell can be calculated from tri-linear interpolation

R(, r, ,) = (1 u)(1 v)(1 w) r(I, J, K) +

(1 u)(1 v)wr(I,J,K + 1) +

(1 u)v(1 w) r(I, J + 1,K) +

(1 u)vw r(I,J + 1, K + 1) +

u(1 v)(1 w) r(I + 1, J, K) +

u(1- v)wr(I + 1,J,K + 1) +

uv( -w)r(I+ 1,J+ 1, K)+

uvw r(I + 1,J + 1, K + 1) (2.1)

where r(I, J, K), r(I + 1, J, K),... are the known coordinates of the eight corners of
a cell. The index (I, J, K) denotes the three dimensional equivalent of the lower left
corner of a two dimensional cell; while, (u, v, w) vary between 0 and 1 throughout the
cell so that

= I+u, I = 1,2,... ,NI-1, 0
7= J+v, J= 1,2,... ,NJ-1, 0
= K + w, K=1,2,... ,NK-1, 0 < w< 1 (2.2)

and R((, 77, () is a piecewise continuous function over the entire grid.
For every known point r that lies within a grid, there exists some (, 77, () such
that r = -R(, ,). However, in order to find (, qr, () that corresponds to a known
r, the nonlinear function F = R(r, 7, () r must be minimized. Newton's method
can be used to minimize this function iteratively using
r8Ftm -1
~m+1 -_ [a"n Fm (2.3)

where ( is the curvilinear coordinate vector (, ?, r ), m is the Newton iteration

counter, and the jacobian matrix is calculated from the equations

= C1 + vC3 + w [C5 + vC7]
= C + uC3 + w [C ] + UC7]
-- = C4 + uCs + v [C6 + uCr] (2.4)


C1 =r(I+ 1, J,K)- r(I,J,K)

C2 =r(I,J + 1,K) -r(I,J,K)

C3 =r(I + 1, J + 1, K) r(I, J + 1, K) Ci

C4 =r(I,J,K + 1)- r(I,J,K)

C5 =r(I + 1,J, K + 1) -r(I,J,K+ 1)-C1

C6 =r(I,J+ 1, K + 1)- r(I,J, K + 1) C2

C7 =r(I+ 1,J + 1, K + 1) -r(I,J+ 1, K+ 1) -

r(I + 1, J, K + 1)+ r(I,J,K + 1) -C3 (2.5)

Newton's method needs a good starting point; therefore, stored in the leaf nodes
of the octree and the BSP trees are curvilinear coordinates at which to start the
search. Although the PM tree classifies a point relative to a superblock, a starting
point identifies a particular cell within a particular grid of the superblock. If the octree
is sufficiently refined, the starting point should be close enough to ensure that stencil
jumping will converge. As the curvilinear coordinates ( are updated with equation
2.3, if AL exceeds the range of 0 -* 1 then the search proceeds to a neighboring
cell and the jacobian matrix, as well as the corners of the containing cell, must be
updated. This algorithm is commonly called stencil jumping.

Hole Cutting

Beggar uses an outline and fill algorithm for cutting holes. In this algorithm, the
facets of the hole cutting surface are used to create an outline of the hole. The cells
of a grid through which a hole cutting facet passes are located by using the PM tree
to locate the cells containing the vertices of the facet. These cells are compared to

the facet and are marked as being either on the hole side or the world side of the hole

cutting surface. If the cells containing the facet vertices are not neighbors, the facet is
subdivided recursively and new points on the hole cutting facet are introduced. These
new points are processed just like the original facet vertices to ensure a continuous

outline of the hole cutting surface. Once the complete hole cutting surface is outlined,
the hole is flood filled by sweeping through the grid and marking as hole points any
points that lie between hole points or between a grid boundary and a hole point. The

marking of holes is capped off by the world side points created from the outline. This
process is able to mark holes without comparing every grid point against each hole
cutting surface and it places no special restrictions on how the hole cutting surfaces

are defined as long as they are completely closed. It also allows holes to be cut using
infinitely thin surfaces.
During the search for interpolation stencils, it is possible that a stencil may
be found that is in someway undesirable. If no other interpolation stencil can be

found for this point, then the point is marked out and an attempt is made to find an
interpolation stencil for a neighboring point. This process essentially grows the hole
in an attempt to find a valid grid assembly.
There are several weaknesses in this hole cutting algorithm. During the flood

fill, if the hole outline is not completely surrounded by world side points, a leaky hole
can result and the complete grid can be marked out. Conversely, the use of recursive
subdivision of facets to ensure that a complete hole is outlined can dramatically
increase execution time when hole cutting surfaces cut across a singularity or a region
of viscous spacing. In such cases, it is possible to coarsely outline the hole and to

use the natural process of marking out points which fail interpolation rather than
doing the flood fill. This option is often referred to as the "nofill" option based on
the command line argument that is used to invoke this option and the fact that the

holes are outlined but are not filled.

Donors and Receptors

One of the more important concepts is how to handle block-to-block and over-
lapped communications. Beggar introduces the concept of donors and receptors to
deal with the communication introduced by these two boundary conditions. Since
the flow solver uses a finite volume discretization, flow field information is associ-

ated with the grid cells or cell centers. A receptor will grab flow field information
from one cell and store it in another cell. The receptor only needs to know which
grid and cell from which to get the information. A donor will interpolate flow field

information from a cell and then put the interpolated data into another storage lo-

cation. The donor needs to know the grid from which to interpolate data, as well
as an interpolation stencil for use in interpolating data from eight surrounding cell

centers. Thus, block-to-block connections can be implemented using only receptors.
Overlapped connections are implemented with donors.
If all of the grids' data are stored in core, a donor can be used to interpolate
the flow data from one grid and to store the interpolated values into another grid.

However, if all of the grids' data are not available, a small, donor value array (DVA)
is needed to store the intermediate values. A donor, associated with the source grid,
is used to perform the interpolation and to store the result into the DVA. Then a
receptor, associated with the destination grid, is used to fetch the values from the
DVA and store it into the final location.

Boundary Condition Identification

The automatic identification of most of the boundary conditions centers around

several interdependent linked lists. The first of these is a list of the points on the
boundaries of the grids in each superblock. A tolerance is used to decide if two points
are coincident, so that the list contains only one entry for each unique boundary point.
Another tolerance is used to decide if a point lies on a user specified reflection plane.
Another list is constructed using the unique cell faces on each grid's boundaries.
While building this list, degenerate cell faces and cell faces that lie on a reflection
plane are identified. The order of the forming points for a cell face is not important for

identification, therefore the forming points are sorted using pointers into the points
list. The cell faces can then be associated with the first point in its sorted list of
forming points. For a finite volume code, each cell face on a block-to-block boundary

connects exactly two cells from either the same grid or from two different grids. Thus,
for a given boundary point, if its list of associated cell faces contains two faces that
are built from the same forming points, a block-to-block connection is defined.


Although the flow solver is not the focus of this work, this section is included

for completeness. The flow solution algorithm supplies some opportunities for paral-
lelization that affect the total performance of the code. The governing equations are

presented, the unique solution algorithms are presented, and the general numerical
solution techniques are presented.

Governing Equations

The equations governing the motion of a fluid are the statements of the con-
servation of mass, momentum, and energy. As an example, Newton's second law of

motion describes the conservation of momentum. However, Newton's second law, as
presented in most dynamics textbooks, is written to describe the motion of a particle,
a rigid body, or a system of particles, that is, a well defined quantity of mass whose
motion is described in a Lagrangian reference frame. For the motion of a fluid, it is

often more useful to consider flow through a region of space or a control volume (an
Eulerian reference frame). Considering a control volume V(t) that is bounded by a
surface S(t), Reynolds' Transport Theorem (see Potter and Foss [43, pages 72-87]

for an example of the derivation) is used to convert the time rate of change of an
extensive property of a system into integrals over the control volume, i.e.

dt O(x, t)dV = dV + fu S dS (3.1)
V(t) V(t) S(t)

These two terms represent a variation in the conserved property within a volume
V(t) due to some internal sources (the volume integral) and a variation due to flux

across the boundary surface S(t) (the surface integral). The variable 0 represents any
conserved quantity (such as p, pu, pE for mass, linear momentum, and total energy,
all per unit volume), u is a local velocity vector, and i is the unit vector normal to
dS. The surface integral is converted to a volume integral using the vector form of
Gauss's Theorem

f u. dS = V- (0u)dV (3.2)
s v
which assumes that V u exists everywhere in V. Thus, the time rate of change of
the conserved property can be written as

Sf (, t)dV= + V (.u)dV (3.3)
V(t) V(t)
The time rate of change of the conserved quantity is dependent upon source
terms that can act on the volume or on the surface of the volume. If we can represent
the source terms by a volume integral of a scalar quantity V and a surface integral of
a vector quantity iP, the general conservation law can be written as

+ +V (u)dV = + V V.dV (3.4)
V(t) V(t)
Since an arbitrary volume is assumed, the integrand must apply for an infinites-
imal volume. The integral can be removed to yield the differential form

+ V (,u) = 0 + V 1P (3.5)

For the conservation of mass, mass is conserved and there are no source terms.
Replacing 0 by the density p in equation 3.5, the differential, continuity equation is

S+ V (pu) = 0 (3.6)

or, written in Cartesian coordinates

op o(pu) o(p) O(pw) (3.7)
+ o + + (3.7)
at Ozx ay z

where u, v, and w are the three Cartesian components of the velocity.
For the conservation of momentum, the source terms are the forces acting on the
control volume. Ignoring the gravitational and inertial forces, the sum of the forces
acting on the system can be written as

F = pidS + IrdS (3.8)
s(t) s(t)
where r is the viscous stress vector and p is the pressure. Note that the pressure
has been separated from the viscous stress, whereas it is often included as a spherical
stress term. The differential form is

F = -p + + + -T

o= + + +
= 9y O9x y 9z
ap 8, 9az a7,
F = p ry,, + 7 (3.9)
az 9y x y 9z
where 7,,, Ty, etc. are elements of the viscous stress tensor (see Potter and Foss

[43, pages 171-174] for the derivation). This tensor is symmetric so that ry, = 7ry,
Tzx = x,, and rz = Tyz. Using these equations as the source terms and substituting
pu into equation 3.5 as the conserved quantity, the three Cartesian components of
the conservation of momentum are

9(pu) + (pu + p) o(puv) + (puw) =r, 8a7, a7
t+ + + a +
9t ax 9y 9z ax 9y az
a(pv) a(puv) O(pv2 + p) o(pvw) ary oy, oy
T+ + + = -+ +
9t ax ay dz ax 9y 9z
O(pw) O(puw) O(pvw) a(pw2 + p) O9rz, ry, Orzz (
+ + + +-- (3.10)
at 9+ y +z az2 9 y z
where the pressure terms have been moved to the left hand side. Formally, these
equations are the Navier-Stokes equations. However, in general, the term Navier-
Stokes equations is used to refer to the complete set of conservation laws when the
viscous terms are included as shown here.
For the conservation of energy, the source terms are the rate of heat transfer to
the system minus the rate of work done by the system. Substituting pE into equation

3.4, the conservation of energy is written as

S(pE)+ (pE)dV = Q (3.11)
or, in differential form

p) + V (pEu) = Q W (3.12)

Ignoring any internal heat sources, the heat transfer rate can be written as

Q=- 4 AdS (3.13)
where 4 is the heat flux vector. This integral can be converted to a volume integral
and then written in the differential form

Q = -Qy, 9 z (3.14)
9x 9y az

The work rate is due to the forces acting on the surface of the control volume.
Ignoring any work by gravitational or inertia forces, the work rate is written in the

W = pu .dS I dS (3.15)
S(t) S(t)
The differential form is
a pu apv apw au av aw
S- + x 7 7 zz
ax ay az ax ay Oz
[au 9av 9 w 99 8w 9
r a+- au +a a- a+ (3.16)
ay 9 z 9X] 9z 9y

Plugging equations 3.14 and 3.16 into equation 3.12 yields the final differential form
of the conservation of energy equation

8(pE) au(pE + p) 8v(pE + p) aw(pE + p) 9q, aqy q,
+ + + +
at x ay Ow Ox 9y 9z
a (TXU + *yV + TXZw) (7.yU + yyv + 7-ry) a (r.2u + 7yV + r7.w) (3
x + y + (3.17)
8@ By 8z

Counting the primitive variables p, u, v, w, E, p, and T and the 6 unique
elements of the viscous stress tensor, there are 13 unknowns and only 5 equations
(the conservation laws). In order to make a solution tractable, 8 more equations are


Fortunately, a relationship between the components of the stress tensor and
the velocity gradients is possible. The velocity gradients are directly related to the
rate-of-strain tensor and the vorticity tensor. Constitutive equations then define

the relationship between the stress components and the rate-of-strain components.
For a Newtonian fluid, a linear relationship between the stress and the strain rate is
assumed. Since the strain rate tensor is symmetric, there are only 6 unique strain rate
components. Assuming a linear relationship between the 6 unique stress components
and the 6 unique strain rate components, there are 36 material constants that must
be defined. The assumption of an isotropic material reduces this to 2 constants. For
fluids, these two constants are the dynamic viscosity t and the second coefficient of
viscosity A. From Stoke's hypothesis, the relationship
A = -2y (3.18)
can be used for the compressible flow of air. For a Newtonian, isotropic fluid, the final
relationships between the components on the stess tensor and the velocity gradients

XX (I+2ou -
3 O"\ x uy 6z
2 ( N v & 9w\

2 au 9v i9w
T=i I -- + 2-
=3 8z x 9y O z
2 Ou Ov \
7T = + -

(au i9w
Txz = P -OZ + T z
9z Qx /
,a = + (3.19)
az 19y

With the original 5 conservation equations and the 6 relationships between the
viscous stesses and the velocity gradients, only 2 more equations are needed. If a
perfect gas is assumed, the thermodynamic state can be specified by only two ther-

modynamic variables. If p and p are choose as the two independent thermodynamic

variables, the perfect gas law

p = pRT (3.20)

(where R is the gas constant) can be used to calculate the temperature T. The
relationship for the internal energy e per unit mass is

e= (3.21)
P(7 1)

where is the ratio of specific heats (y = 1.4 for air) and the total energy per unit
mass is related to the internal energy by

E = e+ U2 (3.22)

where U is the magnitude of the velocity vector u.

Vector Form

The three conservation laws, written in differential form, can be combined in the

vector differential equation

q 8 9f, 9g, Sh, 9f, 9g, Qh,
Oq+ +. g- -I Og+ + h,, (3.23)
at o x ay oz ox ay 8z


p pu 0

pu pu2 + p Tx

q = pv fi = puv f = T

pw puw T2x

pE u(pE + p) urT + vr7y + wrz X1

pv 0

pvu l xv

9i= pv2 + p +g = g y

pvw Tyz

v(pE + p) ursy + vryy + WTrz v

pw 0

pwu 7xz
hi = pwv h = Ty (3.24)

PW2 + p Tzz

w(pE + p) Urx + vr + Wrzz gz

The first component of the vector equation represents the conservation of mass. The

next three components represent the conservation of momentum. The fifth component

represents the conservation of energy.

The terms fi, gi, and hi represent the inviscid flux vectors and f,, g,, and h,

represent the viscous flux vectors. Setting f, = g, = h, = 0 recovers the Euler

equations, which govern inviscid fluid flow. The elements of the vector q are the

conserved variables, as opposed to the primitive variables p, u, v, w, and p.

The use of subscripts on the terms q-, qy, and q, represents the components

of the heat transfer vector as opposed to partial derivative notation. Considering

only heat conduction, Fourier's law can be used to relate the heat flux vector to the

temperature gradient

S= -kVT (3.25)

where k is the heat conductivity and T is the temperature. The Prandtl number,

defined as

Pr =c (3.26)

is used to compute the heat conductivity k from the viscosity j (for air at standard

conditions, Pr = 0.72). Using the relationship

= (3.27)

for a perfect gas, the components of the heat flux vector can be written as

yR p OT
y 1 Pr Ox
7 1 Pr By
yR p OT
qi = (3.28)
7 1 Pr oz


The governing equations are non-dimensionalized by freestream conditions so


p u v w p E
P=-, = -, v= -, = -, p= = -
Poo aoo aoo aoo pooa aoo
x y z ta,
x = = =- z = = t (3.29)
L L L' oo L

where the~ denotes non-dimensional quantities, the subscript oo denotes freestream

conditions, L is some dimensional reference length and a is the speed of sound, which

is defined by the relations

a= = vRT (3.30)

The Mach number is the non-dimensional velocity. The freestream Mach number


Mo = U (3.31)
where U.o is the magnitude of the freestream velocity. Therefore, the non-dimensional

velocity components become
u v w
iU = -MO, v = M,-- o, zv = M,--M (3.32)
U. U.0 U00
The terms u/U0o, etc. represent scaled direction cosines; therefore, the non-dimensional

velocities are scaled values of the Mach number.

The Reynolds number is the non-dimensional parameter

Re = p-U-L (3.33)
which arises from the non-dimensionalization of the conservation of momentum equa-

tion. This parameter represents the ratio of inertia forces to viscous forces.

The non-dimensional governing equations can be written in the same form as

equations 3.23 and 3.24 by replacing the dimensional quantities by the correspond-

ing non-dimensional quantities. However, in the process of non-dimensionalizing the

equations, the non-dimensional term Mo/Re arises from the viscous flux vectors.

Therefore, the definition of the viscous stresses and the heat flux components must
be modified as
2 Moo dii -u 9 9N
3 Re di 9x 9 Q
2 -Moo0 O( DOi,
"k3 Re 8
2. Moo aft 9 9 8a
Tr, = IA- +2 -
3 Re 51 ay 6 a

7o =a +
Re 5 /

= Re( 3z
7Ty = +A (3.34)
Re 857 8F


1 M OaT
--1 Pr Re Oi
1 A Mo OT
7 1 Pr Re 9P
1 p Moo T
z = 0 (3.35)
-y 1 Pr Re 98

The non-dimensional equation of state becomes

= P- (3.36)

and the non-dimensional energy is related to the non-dimensional density and pressure
by the equation

E = P + (0i2 + 0 + t2) (3.37)
( 1) 2

The non-dimensional viscosity coefficient is related to the non-dimensional tempera-
ture by the power law

j = 2/3 (3.38)

Coordinate Transformation

The use of a general, boundary conforming, structured grid introduces an ordered
set of grid points represented by Cartesian coordinates given at the integer curvilinear
coordinates (, it, C. In order to solve the governing equations in this curvilinear
coordinate system, the partial derivatives with respect to the Cartesian coordinates
must be converted into partial derivatives with respect to the curvilinear coordinates.
This requires a transformation of the form

( = (x,y,z,t), 7r = r(x,y,z,t), ( = ((x, y, z,t), r = r(t)



Applying the chain rule, the partial derivatives with respect to the Cartesian coordi-
nates can be written as

+9 79 +9 (9
Sa a7 a

S= G, + 77z + C,
9 9 0 9 9
5 -Z rT (
-=7t + t 7+,7t- + Ct (3.40)

x = X(,,CT), y = y( 7, ,r), z = z(, (,r), t = t(r) (3.41)
at r7 6( Br] 8c(

where easily evaluated. Applying the chain rule again, the partial derivative of with respect to etc. Thus,
the metric term 5x represents the variation in ( with a unit variation in x while y, z,

and t are held constant. These terms are not easily evaluated. However, the partial
derivatives of the Cartesian coordinates with respect to the curvilinear coordinates

that arise from the inverse transformation represented by

az= z((,r?,(,7), y = y((,7,C,7), Z= z(5,,C,7), t= t(T) (3.41)

are easily evaluated. Applying the chain rule again, the partial derivatives with
respect to the curvilinear coordinates can be written as

= z + y-E + z -
9 O9x 9y 9z
y = f + Y1 + Z,1-
a77 z 9y 9z

= x( + y(T + zC-
a( 8x y z
a a a 0 a
= -t + x + y,- + z5- (3.42)
r7 9t 9x 9y 9z

Comparing equations 3.40 and 3.42 the Jacobian matrix of the transformation 3.39
is seen to be the inverse of the Jacobian of the inverse transformation 3.41 (see Belk

[44, appendix A] for a complete derivation). This yields the relationships

S= (y,zC zy)/J

( = (zn Xzc)/J

r = (xy< Y1zX)/J

t = -7t=(zye Ttz(y/ -J 7t2,

77 = (xzc zxc)/J

7z = (yXZ( zxyC)/J

77 = (ywz, zcy,)/J
7/t = -rtXrr/x Trtyrr/y -- rtz7rlZ

C. = {yt Z,7- zcy,7)/J

C = (zE, Xz,)/J

Cz = (xCy, yCXr)/J
Ct = -TtXrCx T7yrCy rtZrz (3.43)

where J is the determinant of the Jacobian matrix of the inverse transformation

J = XE(yz7 zyc) y(xzC z,xC) + zC(x,y( yxz() (3.44)

The governing equations are then written in the form

aQ aFi F, aGi G, 8H, H,
S+ + + = 0 (3.45)
87r 8i 8r] 9C


Q = Jq

Fi = J(qt + f,6 + gjiy + hiG)

Gi = J(q27t + f x+9, + gr0Y + hiz)

GH = J(q(t + fi(r + gi( + hiCz)

F, = J(f,,g + gy + h,)

G, = J(f,? + g,7y + h,,q7)

H. = J(f,,C+ + g,~ + h,4() (3.46)

Flux Vector Splitting

The model hyperbolic equation is the one-dimensional linear convection equation

8u au
-+ a- = 0 (3.47)

If a > 0, this equation describes the propogation of a wave in the +x direction at the

velocity a. The use of a backward time difference and a forward space difference or a

central space difference to produce the explicit discretized finite difference equations

rn+1 n n n
u u' + a +-U= 0 (3.48)
At Ax


n+l n n n
7_+1 + a i+1 u-1 = 0 (3.49)
At 2Ax

yields unconditionally unstable solution schemes. Instead, with a > 0, a backward

space difference of the form

n,+1 -n U n
t +a A =0 (3.50)

is required to produce a stable scheme. Since the wave is propogating in the +x

direction, the backward space differencing represents "upwind" differencing. If the


wave speed a were negative, a forward space difference, again representing upwind
differencing, would be required to produce a stable scheme.

The goal of "flux vector splitting" [45] is to split the flux vector into components
which are associated with the positive and negative direction propogation of informa-
tion so that upwind differencing can be used. This produces a stable scheme without
the addition of any artificial dissipation that is often done with central difference


Consider the one-dimensional Euler equations in Cartesian coordinates

aq aF
+ x 0 (3.51)

Since the flux vector is a homogeneous function of degree one of Q, the governing

equations can be written in quasi-linear form

aQ aQ
+ Ax = 0 (3.52)

(this looks alot like the model equation). The matrix A = 8F/aQ, is the flux Jacobian
matrix. This matrix can be decomposed in the form

A = RAR-1 (3.53)

where the columns of R are the right eigenvectors of A, the rows of R-' are the left
eigenvectors of A, and the matrix A is a diagonal matrix with the eigenvalues of A
along the diagonal. The eigenvalues are of the form

A, = A2 = A3 = U

A4 = u +a

A5 = u -a (3.54)

where a is the speed of sound. For locally subsonic flow, some of the eigenvalues will
be positive and some will be negative. Thus the matrix A can be written as

A = A+ + A- (3.55)

where A+ contains only the positive eigenvalues and A- contains only the negative
eigenvalues. Substituting this into equation 3.53, the Jacobian matrix is split into

A = RA+R-1 + RA-R-1

= A+ + A- (3.56)

and the split flux vectors are defined from the homogeneity property as

F+ = A+Q

F- = A-Q (3.57)

so that

F = F+ + F- (3.58)

Upwind differencing is then used appropriately with the split flux vectors in the
discretized governing equations. The complete form of the split flux vectors can be
found in Hirsch [46].
An implicit discretization of the governing equations can be written as

AQn+l + AT (EFn"+1 + ,,G"n+ + ScHn+l) = 0 (3.59)

where the superscript n denotes the time step,

=" i, = Q, Q1,k (3.60)


=Fi+l/2,j,k Fi-1/2,j,k

G Gi,j+1/2,k Gi,j-1/2,k

6H = Hijk+2 Hij,k-l/2 (3.61)

For a finite volume discretization, the indices i,j, k represent a cell in which the
dependent variables Q are assumed to be constant, and indices i + 1/2,j, k and

i 1/2, j, k, for example, are opposing cell faces at which the flux is evaluated. The
changes in the curvilinear coordinates AL, Aq, and AC are unity.
A first order time linearization of the flux terms [47, 48], leads to the equation

[ /(OH8F "G
+ F" + ("+) 6' + G- +( ) Aq(+1

+6< H"+ ( ) AQ +' = 0 (3.62)

Introducing the split flux vectors, produces the form
AQ-+1 OF+)n ,- )] n,[(%]) n+,]
A+ F 8 F- AG Q+"]
A7 + St K aQ AQn+1 I +St, aQ W+AQn l]
+S A q+ + [(aH+ n AQn+l] l + [6 H- )nAqn+]
+r aq aq aq

= -t (F + F-)n + S, (G+ + G-)" + S6 (H+ + H-)" (3.63)

Ar + 8C [(A+)" (AQ+)"+' + (A-)" (AQ-)"+'

+S, [(B+)" (Aq+)"+' + (B-)" (Aq-)"+']
+5,, [(c+)" (A +)"+ + (c-)" (AQ-)"+] = -R" (3.64)


R" = 6~ [F+ + F-]" + 6, [G+ + G-]" + Sc [H+ + H-] (3.65)

It should be noted that the Jacobian matrices A+, A-, etc. are not the same as
the split flux Jacobian matrices A+, A-, etc. that were presented in equation 3.56.
Instead, the notation A+ is used to represent aF+/aQ, for example. This is required
to preserve the conservation form of the equations. The final form of these Jacobian
matrices, and the derivation thereof, can be found in Belk [44, appendix B].
In evaluating the split flux vectors at the cell faces according to the difference
operators defined in equation 3.61, dependent variables from cells upwind of the cell
face are used. For a first order spatial discretization, only the neighboring cell is used.

For second order accuracy, the dependent variables from the two neighboring cells are
extrapolated to the cell face. As an example, the (+) flux is evaluated using cells to
the left of the cell face

Fi+/2,i,k -= F (Q+1/2,j,k) (3.66)


f+1/2,k = Qi,j,k (3.67)

for a first order accurate scheme, and
3 1
+i/2,j,k = ,j,k Qi-,j,k(3.68)

for a second order accurate scheme. Likewise, the (-) flux is evaluated using cells to
the right of the cell face

F = F- (Q /2,,k) (3.69)
Fi+1/2,j,k = F- i+1/2,j,k)


Q+1/2,j,k = Qi+lj,k (3.70)

for a first order accurate scheme, and
3 1
Qi+1/2,j,k = Qi+l,j,k Qi+2,,k (3.71)

for a second order accurate scheme. The extrapolation of the conserved variables to
the cell face and their use to calculate the flux is often referred to as MUSCL extrap-
olation [491. Alternatively, the primative variables can be extrapolated and used to
calculate the flux or the flux can be evaluated at the cell centers and extrapolated to
the cell center.
In the higher order schemes, flux limiters, applied to the flux, conserved variables,
or the primitive variables, are used to selectively reduce the scheme to first order to
avoid oscillations in the solution near discontinuities. The flux limiters available
include the minmod, van Leer, and van Albada limiters.

Flux Difference Splitting

Hirsch [46] describes upwind methods as methods in which physical properties of
the flow equations are introduced into the discretized equations. Flux vector splitting
introduces the direction of propogation of information through consideration of the
sign of the eigenvalues in the discretization. Another method that handles discontinu-
ities well is due to Godunov [50]. In Godunov's method, the conserved variables are
considered constant throughout each cell and a one-dimensional exact solution of the
Euler equations is computed at each cell boundary. The two constant states on either
side of a cell boundary define a Riemann (or shock tube) problem that can be solved
exactly. An integral average of the exact solutions to the Riemann problems at each
cell is taken to determine the solution at the next time step. Other methods have
replaced the computationally expensive exact solution of the Riemann problem with
an approximate Riemann solution. These methods, including the popular method
due to Roe [7], are often referred to as "flux difference splitting" methods.
Considering the quasi-linear form of the one-dimensional Euler equations shown
in equation 3.52, the elements of the Jacobian matrix A are not constant. Roe pro-
posed replacing this non-linear equation with the linear equation

S+ = 0 (3.72)
Y ax
where A is a constant matrix. This equation is solved at the interface between two cells
to determine the flux at the interface. The matrix A is chosen so that the solution
of this linear equation gives the correct flux difference for the non-linear Riemann
problem. The properties required of A are

i It constitutes a linear mapping from Q to F

ii limnL, QRQ A(QL, QR) = A(Q) = 8F

iii F(QR) F(QL) = A(QL, QR) (QR QL)

iv The eigenvectors of A are linearly independent

The superscript ()L and () represent quantities on the left and right sides of the
The matrix A for the approximate Riemann problem is constructed from the
flux Jacobian matrices where the primitive variables are replaced by the Roe averaged

P= p/p
S_ + R + U

VV+ vr
= v wL + vpKwR
w+ w/Pn

= v LZHL + f R (3.73)

Vp7 + V pXV
where H is the total enthalpy per unit mass, which is related to the total energy per
unit mass by the relationship

H = E + (3.74)
The solution of the approximate Riemann problem yields the following equation
for the flux at a cell interface

Fi+1/2,j,k= ( +1/2,j,k + F1/j,)- i+1/2,j,k I (Qi+1/2.j,k Qij,k) (3.75)


IA = R AI R-1 (3.76)

where the (-) notation is used to denote that the Roe averaged variables are used
in the evaluation. The assumption is made that the waves from the solution of
the one-dimensional Riemann problem moves normal to the interface. For the three-
dimensional problem, the one-dimensional solution is repeated for the three directions.
For first order spatial accuracy, the primitive variables used in the Roe averaged

variables come from the cells neighboring the interface. For second order accuracy,
the values are extrapolated as shown in equation 3.71.

Newton Relaxation

Newton's method for a non-linear system of vector functions

.(X) = 0 (3.77)

can be written as

Y'(x) (x-m+ X ) = -YF(m) (3.78)

This defines an iterative procedure for which m is the iteration counter and F'(x) is
the Jacobian matrix defined by

YO() = y (3.79)

Following the presentation of Whitfield [51], the discretized governing equation
3.59 leads to the function

y(Q"+l) = + r+ F"+' + 6,Gn+1 + JcH"+1

=-+ R(Q"+') (3.80)

for which a solution is sought by Newton's method. Here, the vector Q contains the
dependent variables for every cell throughout the entire grid system. The Jacobian
matrix is defined by

j ( n+' 1 R(Q"1) (3.81)
S) + 9RQn+l

which yields the iterative scheme

1 f n+l"m m+ qn+lm n+ 1
+ ) n+I,m AQn+l,m+l Qn+ Qn R(Qn+lm") (3.82)
ATI aQ Ar^,n I


Qn+l,m+l = Qn+l,m + AQn+l,m+l

Qn+l,0 = Qn

n denotes the time level, m is the Newton iteration counter, Art is the local time

step, and Armin is the minimum time step. Flux vector splitting is used on the

left-hand-side and flux difference splitting with Roe averaged variables is used on

the right-hand-side. Steger-Warming jacobians, Roe analytical jacobians, or Roe

numerical jacobians can be used.

Each iteration of the Newton's method is solved using symmetric Gauss-Seidel

(SGS) iteration. The SGS iterations, or inner iterations, are performed on a grid by

grid basis; while the Newton iterations, or dt iterations, are used to achieve time accu-

racy and are performed on all grids in sequence. This procedure eliminates synchro-

nization errors at blocked and overset boundaries by iteratively bringing all dependent

variables up to the r'n+ time level. The fixed time step, Armi,, is used to maintain

time accuracy and a local time step, Art, is used for stability and convergence of the

Newton iterations. Steady state calculations do not use Newton iterations. The first

term on the right-hand-side of equation 3.82 becomes zero and local time stepping is

used during the inner iterations.

Explicit boundary conditions (BC) can be used or implicit BC's can be achieved

by updating the BC's during the SGS relaxation solution of Equation 3.82 [52]. An

under-relaxation factor is applied to the implicit BC update to improve stability.

Fixed-Point Iteration

A linear system of equations of the form


can be solved using the general fixed-point iteration scheme

zm+1 = z" + C (b Ax") m = 1,2,3,... (3.84)

(see, for example, Conte and de Boor [53, pages 223-233]). This iteration function is
in quasi-Newton form. The function for which a zero is being sought is f(X) = Ax -b.

However, the matrix C is an approximate inverse of A rather than the inverse of the

derivative of f. This approximate inverse is defined by the requirement that

II/ CA1 < 1 (3.85)

for some matrix norm.

The coefficient matrix A can be written as

A = L + D + U (3.86)

where L is the lower triangluar elements of A, D is the diagonal elements of A, and

U is the upper triangluar elements of A. If A is diagonally dominant, D-1 is an

approximate inverse of A and the iteration function

zm+1 = Xm + D-l (b Axm) (3.87)

will converge. This is Jacobi iteration. It can be rewritten as

Dxzm+ = b (L + U) xm (3.88)

or as

Xm+1 b,- a.,rj aijx i = 1,... ,n (3.89)
Sj=1 j=i+1 a

to explicitly show how each element of z is updated. The distinguishing characteristic

of Jacobi iteration is that the iteration function only uses values of x from the previous


Gauss-Seidel iteration comes from the choice of C = (L + D)-1. This gives the

iteration function

zm+l = =m + (L + D)-1 (b Axm)


This can be rewritten as

(L + D) xm+' = b Uzm (3.91)


Dx-'+ = b Lzam+ Uz" (3.92)

To explicitly show how each element is computed, this is written as
/ i-i n 1
m+= b1 a +1 s = 1 ... n (3-93)
b =i E a"
j=1 j=i+l

As each element of x is updated, the previous elements of z, which multiply the lower
triangular elements of A, have already been updated. Thus, for the summation that

represents -Lz, the elements of x are evaluated at iteration m + 1. In other words,
when updating an element of z, the most up to date values of X are used.

If Gauss-Seidel iteration is guaranteed to converge, it will converge faster than
Jacobi iteration. It also has the side benefit that only one array is needed to store z
during the iterations.

Parallel Considerations

Following the analysis presented by Weeratunga and Chawla [37], the solution
algorithm could be written as a global system of linear equations

A,i Ai,2 .. A1,N AQ1 -Fi(Q1, Q2, QN)

A2,1 A2,2 *.. A2,N AQ2 -F2(Q,, 2,-'--,QN)
S= < (3.94)

AN,1 AN,2 AN,N AQN -FN(Q1, Q2, QN)

The diagonal, block matrix elements, Aii, represent the coupling within grid i due to
the implicit time discretization. These elements are banded, sparse matrices defined
by the spatial discretization. The off-diagonal, block matrix elements, A1,j(i j),

represent the coupling between grids i and j due to block-to-block and/or overlap-

ping boundary conditions. The coupling between grids is dependent on the relative
positions of the grids. Thus, some of the off diagonal elements will be zero.
Together, these elements form a large, sparse matrix. This large system of linear
equations could be solved directly; however, this would not be efficient and does not

lend itself well to parallel computing. Instead, the off-diagonal terms are moved to

the right-hand-side. Thus, block-to-block and overlapped boundary conditions are

treated explicitly. This gives a decoupled set of equations of the form

Ai,i 0 ... 0 AQ1 -R1
0 A2,2 0 AQ2 -R2
S= = <(3.95)

0 0 ... AN,N AQN -RN

where -Ri = -Fi(Q,Q2, ...) Elj Ai,jAQj. Each decoupled equation can be
solved using Gauss-Seidel iteration.


In order to solve store separation problems, we must be able to simulate the
general motion of bodies under the influence of aerodynamic, gravitational, and ex-
ternally applied loads. We will ignore structural bending; therefore, we can limit

ourselves to rigid body motion. This chapter presents the basis for the six degrees of

freedom (6DOF) motion simulation routines in Beggar that were written by Belk [4].
This is similar to the method presented by Meakin [54]. The equations of motion,

the coordinate systems used, and the techniques used to integrate the equations of
motion are presented.

Equations of Motion

The unconstrained motion of a rigid body is modeled by Newton's second law

of motion

F=ma (4.1)

where F is the total force acting on the body, m is the mass of the body, and a is the

resulting acceleration of the body. This can be written as the conservation of linear

and angular momentum

F = L (4.2)

M =H (4.3)

where L = mV is the linear momentum, H = Iw is the angular momentum, and M
is the total applied moments about the body center of gravity (CG). The dot notation


represents the derivative with respect to time, V is the translational velocity vector,
w is the rotational velocity vector, and I is the rotational inertia tensor

1 = -I.y Iyy -ly (4.4)
L-Iz I (4.4)

constructed from the moments (I., ,, Izz,) and products (Iy, I., I,,) of inertia of
the body. The six degrees of freedom of the motion are represented by the transla-
tional position of the CG and the rotation of the body about its CG.
Equations 4.2 and 4.3 can only be applied in an inertial reference frame (IRF);
therefore, the derivatives of the linear and angular momentum must be taken with
respect to an IRF. However, the body moments of inertia and products of inertia will
vary with time (due to body motion) if they are defined relative to a fixed, global
coordinate system. Thus, it is easier to use a non-inertial, local coordinate system
that is fixed relative to the body, so that the moments and products of inertia will
remain constant.
In order to apply equations 4.2 and 4.3 in a moving, local coordinate system,
we need to relate the derivatives with respect to this non-inertial reference frame to
derivatives with respect to an IRF. This relationship is defined by the equation

i/XYZ = a/(y + w x a (4.5)

for any vector a defined in a coordinate system xyz that is rotating by w relative
to an IRF XYZ. Applying this relationship to L and assuming that the mass m is
constant, equation 4.2 becomes

F- =V + W xV (4.6)
m xyz


V =- w xV (4.7)
/yz m

Applying 4.5 to H, equation 4.3 becomes

M = I/y, + w x H (4.8)


l/yz = -1M I-lw x 1w (4.9)

Equations 4.7 and 4.9 are the equations of motion written with respect to the

local coordinate system (see Etkin [55] for a more complete derivation of the equations
of motion). These equations can be integrated twice with respect to time to get a
time history of the translational and rotational position of the rigid body. However,
since the equations of motion are written with respect to the local coordinate system,

the change in position coming from the integration of the equations of motion is of
little use for tracking the body motion, since the local coordinate system is always
changing. Instead, the changes in body position must be transformed to the global

coordinate system so that the position and orientation of the body relative to the
global coordinate system can be maintained.

Coordinate Transformations

The local coordinate system is represented by the lower case letters zyz, while the
global coordinate system is represented by the upper case letters XYZ, as shown in

figure 4.1. The origin of the local coordinate system is placed at the CG of the body,
the +z axis extends forward along the longitudinal body axis, the +y axis extends
laterally along what would be an aircraft's right wing (from the pilot's perspective),

and the +z axis extends downward in the direction defined by the right-hand rule.
The rotation of the local coordinate system relative to the global coordinate
system can be represented by the three Euler angles of yaw (0), pitch (0), and roll

(40). As shown in figure 4.1, the local coordinate axes, which are initially aligned with
the global coordinate axes, are first rotated by 4i about the Z axis to produce the z'y'Z


Figure 4.1: Transformation from global to local coordinates

axes. These axes are then rotated by 0 about the y' axis to produce the zy'z" axes.

These axes are then rotated by 4 about the x axis to produce the local coordinate

axes xyz (see Blakelock [56] for another description of the coordinate systems). These

transformations are written in matrix form as

cos V; sin 4 0 x' X

sin e cos 0 y' Y 4.10)

0 0 1 Z Z

cos 0 0 sin0 x x'

0 1 0 y' = (4.11)
-sin 0 0 cos0 z" Z

1 0 0 x x

0 cos -sin y y' (4.12)
0 sin e cos z z"

With the notation [R,()] representing a rotational transformation matrix constructed

for rotation about the x axis by an angle 0, the complete transformation from local

coordinates to global coordinates can be written as

[R()] [R,()] [R(4)] y = Y (4.13)
z Z

cos os (cos 4 sin 0 sin 0- (cos 0 sin 0 cos 0+
cos j cos 0
sin 0 cos 0) sin 7 sin 0) z X

sin (sin o sin 0 sin 4+ (sin C sin 0 cos y = (4.14)
sin m} cos 0
cos 0 cos )) cos 0 sin )) z Z
sin 0 cos 0 sin k cos 0 cos 0

Since the rotational transformations are orthonormal, the inverse transform is equal
to the transpose. Thus, the complete transformation from global coordinates to local
coordinates can be written as

X x

[R() [R,(O)] [R(4)) Y = y} (4.15)

which is equivalent to the transpose of the matrix shown in equation 4.14.
If the Euler angles 0, 0, 4 are used to track the angular position of the body
relative to the global coordinate system, a relationship is required to convert the
rotational velocity vector w in local coordinates (calculated from the integration of
equation 4.9) to the rate of change of the Euler angles. However, the Euler angles are
not applied about the global coordinate system axes; therefore, the transformation
from local to global coordinates can not be used for this purpose. Referring back
to figure 4.1, ) is applied about the Z axis, 0 is applied about the y' axis, and 4 is
applied about the z axis. Therefore, the rotational velocity vector can be decomposed

w = pe, + q~y + re,



w= 4&z + 0I + (4.17)

Decomposing the unit vectors b,y and &z into the zyz coordinate system yields

Ci = cos ~&y sin e&z (4.18)


ez = sin O9, + cos 0 sin e, + cos 0 cos OCz (4.19)

as can be seen from the transformation matrices in equations 4.12 and 4.14. Com-

bining equations 4.16-4.19 yields the relationships

p = sin 0

q = 4 cos 9 sin 4 + 0 cos

r = cos 0 cos 4 O sin (4.20)

which can be inverted to give

= p + q tan 0 sin 4 + r tan 0 cos

0 = qcos r sin q

( = (qsin + r cos 0)/ cos 0 (4.21)

As 0 -+ 7r/2, cos -+ 0 and tan -- oo; therefore, 4 --+ 0 and o -*+ o. This singularity
is called "gimble lock" [57].


Quaternions were developed by Hamilton [58] as an extension to complex num-
bers and have been used in 6DOF simulations [59] because their use avoids the gimble
lock problem. They have properties similar to both complex numbers and vectors.


Like a complex number, which has a real part and an imaginary part, a quaternion
has a scalar part and a vector part and is often written as

Q = eo + eli + e2j + e3k (4.22)

where i, j, and k are unit vectors in the three Cartesian coordinate directions.
The multiplication of two quaternions requires the additional rules of quaternion

i2 =j =k =-1

ij = -ji = k

jk = -kj = i

ki = -ik = j (4.23)

which are similar to the rules of complex math and vector cross products. The
multiplication of two quaternions is simplified if we rewrite equation 4.22 as

Q = Qo + Q (4.24)

which emphasizes the scalar and vector components. Following the distributive prop-
erty, the multiplication of two quaternions is

PQ= (Po + P)(Qo + Q)

= PoQo + PoQ + QoP + P 0 Q (4.25)

The 0 operator can be shown to be equivalent to

P Q= P x Q-P Q (4.26)

Similar to complex arithmetic, the conjugate of a quaternion is defined as

Q* = Q0 Q (4.27)

The product of a quaternion and its conjugate is thus

QQ* = Q*Q = e2 + e2+ e+ + e2 = IQ12


or the square of the magnitude of the quaternion. A unit quaternion is a quaternion
of unit magnitude.
For the unit quaternion of the form

Q = cos(a/2) + Asin(a/2) (4.29)

the transformation

QVQ* = V (4.30)

rotates the vector V about the axis defined by the unit vector A by an angle a to
produce the vector V'. Since this is a unit quaternion, Q* is the inverse of Q. Thus
the inverse transformation

Q*V'Q = V (4.31)

rotates the vector V' about the axis defined by A by an angle of -a to recover V.
If the unit vector A is defined to be equivalent to e. of our local coordinate
system, a is equivalent to the roll angle and the rotational position of the body can
be represented by the quaternion

Q = cos(q/2) + &, sin(q/2)

= cos(4/2) + [cos 0 cos Oi + sin k cos Oj sin Ok] sin(0/2) (4.32)

where i, j, k represent the three cartesian coordinate directions &x, &y, ez of the IRF.
Then equation 4.30 represents the transformation from local coordinates to global
coordinates and equation 4.31 represents the transformation from global coordinates
to local coordinates. Equation 4.32 gives the relationship between the Euler angles
and the components of the quaternion. Alternatively, the transformation in equation
4.31 can be compared to a general transformation matrix to find the relationship
between the components of the quaternion and the elements of the transformation


The only other relationship needed in order to use quaternions to track rigid
body motion is the knowledge of how to update the quaternion. Without going
through a derivation, the following derivatives of the scalar and vector components
of a quaternion were presented in Katz [60]

o = Q (4.33)

1 1
Q = wQo + -w x Q (4.34)
2 2

These equations are integrated with respect to time along with the equations of
The quaternion must remain a unit vector to ensure that the transformation
represents pure rotation with no scaling or shearing. Therefore, the quaternion needs
to be normalized during the integration.

Numerical Integration

A fourth order Runge-Kutta scheme is used to integrate the equations of mo-
tion. Runge-Kutta schemes are an attractive option for solving initial value problems
governed by first order differential equations of the form

= f(x,y), y(xo)=yo (4.35)

because they can achieve higher order accuracy without the evaluation of higher order
derivatives. Conte and de Boor [53, pages 362-365] defines the fourth order Runge-
Kutta scheme as

Yn+1 = Yn + 1 (ki + 2k2 + 2k3 + k4) (4.36)
yn+ y,


ki = hf(x., y,)

k2= hf(x, + ,y, + ki)
k3 = hf(x, + h,y, + k2)
2 2

k4 = hf(x, + h,y, + k3)

The integration time step is represented by h, x represents the time, y represents the

position, velocity, and quaternion, f(x, y) represents the derivative of y (right hand

side of the equations of motion), and the subscripts n and n + 1 are used to denote

quantities at the current and next iteration (or time step), respectively.

The aerodynamics solution comes into the equations of motion through the in-

tegrated forces and moments. Since the aerodynamics are a function of position, the

use of four different positions in the evaluation of f(x, y) in equation 4.36 requires

the calculation of the flow solution four times for each integration of the 6DOF. How-

ever, this would be very expensive. Therefore, the aerodynamics are assumed to be

constant over the complete time step and are evaluated only once.

Since the translational equation of motion is written relative to the local coordi-

nate system, the integrated aerodynamic forces (and moments) will be independent

of position. However, the gravitational force, which is constant in global coordinates,

is not constant in local coordinates. Thus, care should be taken when decompos-

ing the gravitational force into local coordinates with each step of the Runge-Kutta



Computing power has increased many orders of magnitude over the last decade.
This trend is expected to continue in the near future. However, the shift appears

to be away from sequential processing and towards parallel processing. This chapter

presents an overview of parallel computing hardware, the options and some consider-

ations for programming parallel computers, some methods for judging and improving

parallel performance, and the proposed approach taken in this work.

Hardware Overview

The performance gains that are being achieved with single processors is dimin-

ishing as they approach physical limitations such as the speed of light. With this in

mind, VLSI design principles have been used to conclude that it is possible to in-

crease potential computing power more cost effectively by utilizing multiple, slower,
less expensive components rather than a single, faster, more costly component [61].

Therefore, the trend in computing hardware is towards multiple processors. Machines

that utilize high performance vector processors are shipping with modest numbers of

vector processors, and massively parallel processors (MPP) are utilizing existing, low

cost RISC processors (for example, the Cray T3D which uses DEC Alpha processors
or the IBM SP2 which uses the IBM RS6000 processors) in ever increasing numbers
to achieve greater potential processing power.

Another trend, that is affecting the way computing is being done, is the increase
in network transfer rates. This allows physically separated resources to be utilized
for solving a single problem. Since many MPP's utilize the same processors found in

high end workstations, a group of workstations connected by a high speed network can

be viewed as a distributed parallel computer with the main differences being in the

speed of the inter-processor connections and the possible differences in processors and

operating systems. This type of computing is often referred to as "cycle harvesting."

This is due to the fact that networked computers, that are used for routine computing

during business hours, often sit idle at night. These unused computing cycles can be

harvested for scientific computing.

A relatively new form of parallel computing takes the use of commercially avail-

able off-the-shelf components to the extreme. Personal computers, based on Intel or

compatible microprocessors, running a freely available UNIX clone operating system

such as LINUX, are linked together using low cost ethernet networking. Such parallel

computers are often referred to as Beowulf clusters [62]. Such a distributed computing

environment can represent a sizeable computational resource with very low associated


Parallel computer architectures are often classified according to the number of

instructions that can be executed in parallel, as well as, the amount of data that can be

operated on in parallel. The most common of these classifications range from multiple

instruction, multiple data or MIMD computers to single instruction, multiple data or

SIMD computers. SIMD systems offer reduced program complexity, but can greatly

reduce the available algorithms than can be implemented on such an architecture.

Parallel computers are often further classified according to their memory layout as

distributed memory, in which case each processor has its own local memory, or as

shared memory, for which each processor has direct access to a single, global memory

address space. Most of the machines being produced today are of the MIMD type.

The Cray T3D and IBM SP2 are distributed memory MIMD machines, while the SGI

Onyx is a shared memory MIMD machine.

The SGI Origin 2000 represents a unique memory architecture referred to as

CC-NUMA (cache-coherent, nonuniform memory access). It is made of multiple

node cards that contain two processors and local, shared memory. However, any

processor on any node card can access any of the memory in the machine. This
hybrid organization of memory is called distributed, shared memory (DSM). There is a

latency associated with accessing memory located off of a node card; therefore, access
times to memory are nonuniform. However, hardware is used to maintain coherency
of data held in cache between different processors. This architecture has been shown
to perform well for many irregular applications and scales well to moderate numbers

of processors [63].

Software Overview

Logically, parallel computers can be viewed as a set of sequential processors,
each with its own memory, inter-connected by some communication links [61]. Each

processor executes a sequential set of instructions and communicates with other pro-

cessors and accesses remote memory through the communication links. Distributed
memory and shared memory systems, as well as, distributed computing environments
fit this model. The processors of a shared memory system simply have a more effi-
cient way of accessing remote memory than do the processors of a distributed memory
system or a distributed computing environment. This model of a parallel computer

and the use of messages for all communication between processors forms the basis of
the message passing paradigm of parallel programming.

Due to the model used for the parallel computer, it is conceivable that the user
could write, compile, and execute a different program on each processor, with each
program communicating with the others via messages. It is more often the case
that the same source is compiled and executed on each processor, with control flow
statements in the code used to determine the path executed or the data manipulated
at run time. This programming model is referred to as single process multiple data
or SPMD. The SPMD model of programming aids in code maintenance and provides
a simplified path for converting an existing sequential code for parallel execution.

Many libraries exist for implementing message passing. Two of the more pre-

dominant libraries are PVM [64] and MPI [65]. PVM, which stands for parallel virtual

machine, is a defacto standard message passing interface due to its popularity and

widespread use. It is the product of Oak Ridge National Lab and several university

contributions. PVM consists of two parts: a library consisting of the functions that

implement the application programming interface (API) and a daemon which runs

in the background and actually handles the communication between processes. MPI,

which stands for Message Passing Interface, is a proposed standard message passing

interface. It was developed out of a series of meetings of a committee of experts from

the parallel computing community. MPI draws features from other message passing

libraries and provides a common API that the vendors can optimize for their ma-

chines. PVM evolved out of a research project on distributed computing and places a

higher priority on portability than on performance. MPI is expected to provide bet-

ter performance on large MPP's but does not provide for heterogeneous distributed

computing and lacks many task management functions [66].

Other models are available for parallel programming. One of the more popular

is the shared memory programming model. Pthreads [67] is a POSIX standard imple-

mentation for shared memory programming using threads. A thread is a light weight

process that shares memory with other threads, but has its own program counter,

registers, and stack so that each thread can execute a different part of a code. The

sharing of memory between threads is automatic and communication between threads
is accomplished through cooperative use of shared variables. Mutual exclusion or mu-

tex variables are used to ensure that only one thread changes the value of a variable

at a time. Signals are sent between threads using condition variables. OpenMP [68]

is an alternative library that attempts to avoid the low level programming constructs
required by Pthreads. OpenMP is used to identify loops that can be executed in par-
allel similar to vectorization of loops on vector processors. OpenMP automatically
handles all communication.


These techniques all require shared memory and thus can not be used on dis-

tributed memory, parallel computers. However, Pthreads or OpenMP can be mixed

with PVM or MPI to take advantage of both programming models when clusters of

shared memory multi-processor (SMP) machines are linked together. Likewise, other

techniques for using shared memory can be mixed with the message passing model.

POSIX also defines a standard for specifying the use of shared memory explicitly [69]

as opposed to the automatic use of shared memory as with Pthreads.

When approaching a parallel programming task, the key issues to be addressed
are concurrency, scalability, locality, and modularity [61]. Concurrency relates to the
need for algorithms which subdivide larger problems into a set of smaller tasks that

can be executed concurrently. An intimate knowledge of the data structures and data

dependencies in an algorithm is required to identify such concurrencies. Scalability

relates to the behavior of an algorithm in terms of parallel efficiency or speedup as a

function of processor count. Since the number of processors being utilized in MPP's

appears to be continually increasing, the efficiency of a good parallel program design
should scale with increased processor counts to remain effective throughout its life

cycle. Locality relates to the desire to enhance local memory utilization since access

to local memory is less expensive than access to remote memory. Raw communication
speeds are typically orders of magnitude slower than floating-point operations; thus,

communication performance strongly influences the parallel run time. Modularity is

important in all software development. It allows objects to be manipulated without

regard for their internal structure. It reduces code complexity and promotes code

reuse, extensibility, and portability.
The algorithm design process can be broken down into four phases: partition-
ing, communication, agglomeration, and mapping [61]. Machine independent issues,
such as concurrency, are considered early in the design process, while machine specific

issues are delayed until late in the design. Partitioning and communication address
the issues of concurrency and scalability, while agglomeration and mapping address


locality and performance. Partitioning falls into two major categories: functional de-

composition and data decomposition. Functional decomposition focuses on the com-

putation, while data decomposition focuses on the data. A good partition will divide

both the data and the computation. The communication phase of a design deals with

identifying the inter-process communication requirements. This is complicated when

the communication patterns are global, unstructured, dynamic, and/or asynchronous.

Agglomeration seeks to reduce communication costs by increasing computation and

communication granularity. Tasks can be combined and data and/or computation

can be duplicated across processors in order to reduce communication. The mapping

phase is a machine specific problem of specifying where each task will execute. A

mapping solution is highly dependent on the communication structure and the work

load distribution. A load balancing algorithm is often needed. If the communication

structure is dynamic, tradeoffs must be made between a load imbalance and repeated

application of a possibly expensive load balancing algorithm.

A good algorithm design must optimize a problem-specific function of execution

time, memory requirements, implementation costs, and maintenance costs, etc. [61].

Furthermore, when solving coupled systems of partial differential equations, issues

unique to the problem must be considered. For example, on a distributed memory

machine, a minimum number of processors may be required in order to hold a specific
problem; however, the use of additional processors must be balanced against its effect

on the solution convergence [70]. Likewise, since communication cost is proportional

to surface area and computational cost is proportional to volume, the desire for a

high ratio of volume to surface area places a lower limit on the subdivision of the

computational domain. Communication through messages has an associated cost of

the latency time for message startup and a cost per word of data transferred in the
message; therefore, it is generally desirable to use a small number of larger messages

rather than a large number of small messages. However, the use of small messages
may allow an algorithm change that would allow communications to be overlapped

by computation. An efficient parallel implementation will require the consideration
of all such factors.


Performance of a parallel algorithm is normally measured via speedup. This
is the ratio of the execution time on a single processor and the execution time on

multiple processors. Thus, the speedup s can be computed by

s = (5.1)

where T1 denotes the execution time on a single processor and Tn denotes the exe-

cution time on n processors. Ideally, T1 should represent the execution time of the

best sequential algorithm available to do the job. When parallelizing a sequential

algorithm, the best sequential algorithm may not parallelize well and, vice versa, the
best parallel algorithm may not perform well sequentially. Likewise, when paralleliz-

ing a given sequential algorithm, some overhead will be introduced. If the parallel

algorithm is executed on a single processor to measure T1, this value may be arti-
ficially high due to the use of a poor sequential algorithm or due to the existence
of parallelization overhead. However, the definition of the best sequential algorithm

may be unattainable. Thus, there exists some ambiguity in how T1 should be mea-

sured in order to judge the performance of a parallel algorithm. At the least, when
converting an existing sequential algorithm for execution in parallel, Ti should be

measured using the original sequential algorithm. Likewise, if any algorithm changes

are made during parallelization that would also decrease the sequential execution
time, Ti should be remeasured so as to isolate improvements due to the algorithm
change from improvements due to the use of multiple processors.

One source of overhead, that exists in all parallel programs, is time spent in
communication between multiple processors. Following the analysis presented by
Roose and Van Driessche [71], the total execution time of a parallel algorithm executed

on n processors can be approximated as

Tn = Tcalc + Tcomm (5.2)

where Tcac denotes the actual computation time and Tcomm denotes the time spent

in communication due to parallelization. If the work is perfectly balanced and there

is no time spent in communication during a sequential run, then the execution time

of the sequential run will be

T = n Tcaic (5.3)

Hence, the speedup would be

n Tcaic
Tcalc + Tcomm

n om (5.4)

Thus, the ratio of the communication time and the computation time can have a large

effect on the speedup.

In general, for CFD flow solvers, the communication time is proportional to the

area of (number of grid points on) the boundaries of the domain, and the computation

time is proportional to the volume of (total number of grid points in) the domain.

Thus, as the problem size increases, the ratio of communication to computation de-

creases. The characteristics of a particular computer, the form of the communication,

the algorithm used, and the partitioning of the domain can also affect this ratio.

In general, a parallel computer with n processors can execute n instructions at

the same time. Thus, if the instructions in a sequential algorithm could be evenly

divided among the n processors, so that each processor executed 1/nth of the total

instructions, the execution time would be decreased by a factor of n. Therefore, linear

speedup is the ideal case, and speedup is limited to s < n. However, there are other

factors that place additional limits on the speedup that can be achieved.

If we consider the entire work load of a complete simulation to be broken down

into part that can be executed in parallel and part that must be executed serially,

the speedup, that can be achieved, is limited by Amdahl's law [72]:

s < (5.5)
(0 +I
where f, is the serial fraction of the work, fp is the parallel fraction of the work, and
n is the number of processors on which the parallel portion of the code is running.
The factors f, and fp are fractions so that

S< f, < 1

0 < f

f, + fp = 1 (5.6)

Since the parallel work will be distributed across multiple processors, the execution
time of the parallel work will be decreased, but the execution time of the serial work
will not.
Amdahl's law shows the significant penalty that the serial fraction of the work
load can place on the parallel performance. For example, consider a case where 5%
of an executable code must be performed serially (f, = .05 and fp = .95). If only
4 processors are used, the speedup will be limited to 3.48, nearly 90% of the ideal
speedup. However, if 32 processors are used, then the speedup will be limited to 12.55,
less than 40% of the ideal speedup. Although the processor count was increased by a
factor of 8, the speedup increased by less than a factor of 4. In fact, as the number
of processors n -+ oo, the term fp/n -+ 0. Thus, the speedup is limited to 1/f,, or
20 in this case, no matter how many processors are used.
This could be used to argue that parallel processing does not hold the answer to
the need for increased computing power. However, the potential from multiple proces-
sors and the increased memory often available with MPP machines allows larger and
larger problems to be addressed. With CFD solutions, as the problem size increases,


the computation to communication ratio usually increases and the serial fraction of

the work load decreases.

Even the limit specified by Amdahl's law is not always reached. The major

contributor to this behavior is an imbalance in the distribution of the work to be

executed in parallel. Consider figure 5.1. The left side of the figure shows the serial

execution of a function that operates on four grids, while the right side shows the

parallel execution of the function on four processors with one grid per processor. The

serial execution time, and thus the total work, is represented by the time T5-T1. On

four processors, the average work per processor is represented by the time (T5-T1)/4.

However, the total execution time in parallel is dictated by the maximum execution

time of- any process. This time, T2-T1, is larger that the average execution time by

a factor related to the imbalance in the work load or execution times.

Tie Serial Parallel

gl gl g2 g3 g
TAV .3 ........ .... .

T5 __

Figure 5.1: Unbalanced work load

Since the term fp/n in equation 5.5 represents the average parallel work per

processor, this work must be increased by a factor proportional to the load imbalance.

Generalizing equation 5.5 to include the effect of a load imbalance, the speedup


S ( -) -1 (5.7)

where f, is the load imbalance factor. The load imbalance is often judged by the ratio

of the maximum execution time of any process and the minimum execution time of


any process. However, as used here, the load imbalance factor is used to increase the

average execution time per process to the maximum execution time of any processor.

Thus, the imbalance is equal to the ratio of the maximum execution time of any

process to the average execution time per process.

The load balance is further complicated by the basis for the decomposition of the

work. If each division in the decomposition does not represent a nearly equal piece of

the work, the load balance can vary significantly with the process count. Obviously,

if there are not enough pieces of work, some of the processes would sit idle. Likewise,

if there is one piece of work that is significantly larger than the other pieces, it can

dominate the execution time. Consider figure 5.2. The left side of the figure shows

the serial execution of a function that operates on four grids. When the function is

duplicated and the grids are distributed across two processes (shown in the middle of

the figure), the work is well balanced and the execution time is cut in half. However,

when four processes are used (shown on the right side of the figure), no improvement

in the execution time is seen. The work associated with grid gl is one half of the

total work; thus, the execution time is dominated by the execution time of gl.

Time Serial 2 PE's 4 PE's

g2 g2 g3 g4
gl gl g3 gl ---
T2 ......... .... ...... ............

T3 g2
T5 g4

Figure 5.2: Limitations in load balance caused by a poor decomposition

Another common cause for the degredation in the speedup achieved is synchro-

nization between processes. Synchronization is enforced by the placement of barriers

in the execution path. No process may pass the barrier until all of the processes have

reached the barrier. This can ensure that every process has completed a particular

portion of the work before any process starts on the next portion of work. This may

be required if one function is dependent on the results from a previous function. How

this can cause an increase in execution time is illustrated in figure 5.3. This diagram

shows two functions (A and B) operating on separate grids on separate processors.

Without synchronization, the total work per process may be well balanced; but if

synchronization is required between the functions, wait time can be introduced if

each function is not well balanced.

Timn Independent Synchronized
processors processors

A(g2) A(
A(gl) A(gl)
B(gl) B(gl)
4 B(g2)
T5 L........ .-

Figure 5.3: Imbalance caused by synchronization

This illustrates the fact that each piece of work between any two synchronization

points must be well balanced in order to achieve a good overall load balance for the

entire code. To take this into account, equation 5.7 should be written as

s ~ (5.8)

where the terms within the summation represent each section of code between syn-

chronization points that is executed in parallel.

Load Balancing

The most important part to achieving good parallel performance is load bal-

ancing. The problem of load balancing is similar to the computer science problems

referred to as the "knapsack problem" [73] and the "partition problem" [74]. These

problems are NP-complete, which is the set of all problems for which no algorithm ex-

ists that is guaranteed to produce the exact solution through nondeterministic means

in polynomial time. The input to the knapsack problem is defined by a set of items

with specified sizes and values and a knapsack of a specified capacity. The problem

is to maximize the value of the subset of items that will fit into the knapsack at one

time. The input to the partition problem is a set of blocks of varying heights. The

problem is to stack the blocks into two towers of equal heights.
The input to the load balancing problem consists of a set of pieces of work, a

measure of the cost of each piece of work, and the number of processors across which

the pieces of work are to be distributed. The problem is to associate each piece

of work with a processor, while minimizing the ratio of the maximum total work

associated with any processor and the average work per processor. The amount of

work associated with each piece of work is similar to the value of the items to be
placed in the knapsack or the height of the blocks. The processors are similar to the

knapsack or the towers. The average work per processor corresponds to the capacity

of the knapsack or the average height of the towers. However, each piece of work

must be associated with a processor, each processor must have at least one piece of

work, and there is no limit on the amount of work that can be associated with each


The algorithm used to balance the work of the flow solver is a max-min algorithm.

This algorithm, shown below, takes the piece of work with the maximum cost from

the pieces of work that have not yet been assigned to a processor and assigns it to the

processor that has the minimum total work assigned to it. This algorithm distributes
the work across the available processors with only a single pass through the list of the

pieces of work, thus the execution time is bounded. With sufficient partitioning of
the work, this algorithm produces a good load balance, although it may not produce
the best possible distribution of the work.
The array Work[] is an estimate of the cost associated with each piece of work

(each grid, in this case). Since the work of the flow solver is closely associated with the
number of grid points, the cost associated with a grid can be defined as the number

of points in the grid. However, there are other factors that can affect the execution

time that are not directly related to the number of grid points. Thus, a user defined
weight can be associated with each grid and it is the weighted number of grid points
that is fed into the load balancing routine as the cost of each piece of work. The
output from the load balancing routine is an array GridToPe[] that specifies which
processor will execute the flow solver for each grid.

1 for i 1 to npes
2 do PeWork[i] +- 0
4 for i +- 1 to ngrids
5 do PeNum +- FINDMIN-VALINDEX(PeWork[])
6 GridNum <- FINDMAXVALANDEX(Work[])
7 PeWork[PeNum] +- PeWork[PeNum] + Work[GridNum]
8 GridToPe[GridNum] +- PeNum
9 Work[GridNum] +- 0
11 return GridToPe[]

This load balancing algorithm is applied to the flow solver, for which there is an
initial estimate of the value of each piece of work. In grid assembly, there is no apriori

knowledge of the amount of work associated with a grid or superblock. In fact, the
amount of work associated with a grid or superblock depends on the location of the

grids and thus changes as the grids move. In such a case, a dynamic load balancing

algorithm is needed that can redistribute the work based on some perception of the
current work distribution.
The algorithm implemented for dynamic load balancing, shown below, consists

of two iterative steps. The algorithm requires as input, some numerical measure of
the cost associated with each piece of work in the partition and the current mapping
of those pieces of work to the available processes. The first step in the algorithm is
to move work from the process with the maximum work load to the process with the
minimum work load in order to improve the load balance. The second step is to swap
single pieces of work between two processes in order to improve the load balance.


The output from this algorithm is a new mapping of the pieces of work to the set of
processes. This mapping must be compared to the previous mapping to determine
which data has to be transferred from one process to another.

1 TotalWork +- CALC.SvM(Workf)
2 AvgWork +- TotalWork/npes
3 PeWork[, PeWorkListo BUILDIPEWORK-LISTS(WorkToPe[)
5 MOVEWORK(Work[], WorkToPe[])
6 SWAP.WORK(WorkU, WorkToPe[)

MOVEWORK(Work[], WorkToPe[])
1 repeat
2 pemin +- FIND_2 MIVALANDEX(PeWork])
3 pemax F- FIND MAXVALANDEX(PeWork[])
5 WorkLimit (PeWork[pemax] PeWork[pemin]) 0.99
6 WorkToPut <- CHOOSE -.A X _LIM ITED-VAL(
7 PeWorkList[pemax], WorkLimit)
9 if WorkToPut
10 then WorkToPe[WorkToPut] pemin
11 PeWork[], PeWorkList BUILD-PEWORK-LISTS(
12 WorkToPe[])
13 until WorkToPut = NIL

In line 3 of OPTIMIZEMAPPING, BUILD-PEWORKLISTS calculates the total
work per process (PeWorkf]) and also builds an array of the lists of pieces of work
that are mapped to each process (PeWorkList[]) from the mapping of work to pro-
cesses (WorkToPe[]). In MOVEWORK, if any piece of work can be moved from one
process to another and decrease the maximum amount of work on any process, then
it will improve the load balance. Therefore, in line 5, WorkLimit is set based on a
percentage of the difference in the work assigned to the processes with the least and
most work. CHOOSEMAX LIMITED-VAL chooses the piece of work from the list of
work associated with pemax (PeWorkList[pemax]) that has the largest cost and also
is less than WorkLimit. This piece of work is assigned to pemin in line 10 and the

work lists are updated in line 11.

SWAP.WORK(Workf, WorkToPe[)
1 repeat
2 for i 1 to npes
3 do PeWorklmb[i] +- PeWork[ij AvgWork
4 pemax +- FINDI-MAX_VALjNDEX(PeWorklmb6[)
6 for each i in PeWorkList[pemax]
7 do WorkToPut +- i
8 for pemin +- 1 to npes
9 do if pemin : pemax
10 then WorkLimit +- (Work[WorkToPut]
11 -PeWorklmb[pemaz]
12 +PeWorkImb[j]) 1.001
15 PeWorkList [pemin])
17 if WorkToGet
18 then WorkToPe[WorkToGet] +- pemax
19 WorkToPe[WorkToPut] 4- pemin
20 PeWorka, PeWorkList[] +-
22 WorkToPe[])
23 break
25 until WorkToGet = NIL

In SWAPWORK, one piece of work on one process is swapped for a piece of
work on another process. Therefore, there is an upper and lower bound on the cost
of the WorkToGet. If WorkToGet is larger that WorkToPut, then the total work
on pemax will increase. However, if WorkToGet is less than WorkToPut by more
than the difference between the imbalance on the two processes, then the imbal-
ance on pemin will increase beyond the original imbalance of pemax. The routine
CHOOSE-ANY-LIMITEDVAL chooses a piece of work from the list of work associ-
ated with pemin (PeWorkList[pemin]) that costs less than the work represented
by WorkToPut and is greater than WorkLimit. It is not important that the opti-
mum piece of work be chosen. Any piece of work that improves the load imbalance


will do. The work represented by WorkToGet is assigned to pemax and the work
represented by WorkToPut is assigned to pemin. The work lists are updated by
BUILD-PE-WORK-LISTS. The break at line 23 causes control to jump out of the
loop that started at line 6 so that the load imbalances and pemax can be recalculated.
Each step of this algorithm attempts to decrease the load imbalance caused by
the process with the maximum work load. However, the search for pieces of work to
swap is exhaustive. This algorithm is used to redistribute the superblocks based on

the work of grid assembly; therefore, this algorithm is not too expensive, since there
are not many superblocks. This algorithm could be extended to swap multiple pieces
of work on one process for one or more pieces of work on another process. However,

this would require more efficient ways of sorting the pieces of work, so that the pieces
to be swapped could be selected more efficiently.
Another situation, that often arises, is a large set of pieces of work that can be

treated as an array. This array can be evenly divided among the processes with each
getting an equal number of elements of the array. However, if each element of the
array does not represent the same amount of work, there will be a load imbalance.

It could be expensive to treat each element as a separate piece of work, measure its

cost, and use the previous algorithms to distribute this large number of pieces of
work. Instead, the total cost of the work can be associated with the processor and
used as a weight. Load balancing can then be achieved by dividing the array so that
the weighted number of elements is equally divided among the processes.
This algorithm requires as input, the number of elements of the array mapped
to each process (N[]) and the execution time of each process (Tr). A weight for
each process (W[]) is calculated as the execution time per array element. The excess
number of elements assigned to the process with the maximum load is calculated as
the delta between the process execution time and the average process execution time
divided by the process weight. Since this number of elements will be assigned to
another process, the weight of the receiving process must be updated. The execution

times of the two processes are updated and the loop is repeated if there are other
processes with excess array elements.

1 for i +- 1 to npes
2 do W[i] T[i]/N[i]
4 imax +- FINDMAX-VALjNDEX(T[])
7 repeat
8 Nexcess +- (T[imaz] Tavg)/W[imax]
9 if Nexcess = 0
10 then break
11 TotWeight +- N[imin] W[imin] + Nexcess W[imax]
12 W[iminj +- TotWeight/(N[imin] + Nexcess)
13 N[imin] +- N[imin] + Nexcess
14 N[imax] N[imax] Nexcess
15 T[imin] <- N[imin] W[imin]
16 T[imax] +- N[imax] W[imax]
18 imax +- FIND MAX-VALjNDEX(T[])
19 until T[imax]/Tavg < 1.005

Proposed Approach

Since this work builds on the initial parallel implementation of the Beggar flow
solver [6], the same methods used there will be continued here. The message passing
paradigm is used within an SPMD programming model. PVM is used for the mes-
sage passing environment. The code is geared toward MIMD machines with either
distributed or shared memory. The ultimate goal is to allow heterogeneous computing
although homogeneous computing environments are the primary focus. A functional
decomposition of the entire simulation process is used with the flow solution, force
and moment calculation, 6DOF integration, and grid assembly being the primary
functions. Coarse grain domain decomposition of the flow solver based on grids is
used. The degree to which domain decomposition of the grid assembly function can
be used is determined. Load balancing is treated as the primary contributor to good


parallel performance. The most efficient implementation requires consideration of
both the flow solver and the grid assembly process during data partitioning and load


For parallel algorithm design, Foster [61] recommends a process denoted by the

acronym PCAM, referring to partitioning, communication, agglomeration, and map-

ping, as mentioned previously. In this approach, Foster recommends that the finest
grained partitioning of the work be defined along with all of the required communica-

tions. Then, the partitions are agglomerated to reduce the communications required

and thus increase the computation to communication ratio. The final step is to map
the work to processors based on the particular computer architecture. In this work, a

nearly opposite approach is taken. The work is partitioned using coarse grain decom-
position first. This allows a parallel implementation to be achieved with minimal code

changes and with less expertise in the existing sequential code, as well as, parallel

programming itself. This also allows the users to receive and to start using the code

earlier. As the code performance is analyzed, the granularity of the decomposition is

refined as required. Mapping of work to processes is done dynamically to achieve a

good load balance; however, no machine specific issues of mapping work to specific

processors are addressed.


Phase I: Hybrid Parallel-Sequential

The simplest approach to achieving a parallel version of Beggar for moving body

problems is to use a separate front-end (FE) process that performs the grid assembly
function for the complete domain in a serial fashion with respect to the parallel
execution of the flow solver across multiple back-end (BE) processes. This requires

that proper communication be established between the flow solution function and
the grid assembly function; however, this does not require any consideration of load

balancing or partitioning of the grid assembly function.
This implementation is referred to as phase I and is depicted in figure 6.1. This
diagram and the others like it that follow are referred to as timing diagrams. The
major functions are represented and the diagram serves as a template of one iteration
of the solution process. The vertical direction represents time and this template can be

stamped out in a vertical manner to construct a complete time history of the solution
process. The boxes on the left represent the functions running in the FE process,
while the boxes on the right represent the functions running in the BE processes.
The arrows represent communication between specific functions on the FE and BE.
Communication between functions on the same process is not shown explicitly. The
vertical lines through a function indicates that it is spread across multiple processors.
Although these diagrams are not drawn to scale, the work of a particular function
is represented by the area of the box drawn for that function. Thus, as a function
is spread across multiple processors, the width increases and the height decreases
representing the decrease in the time for executing the function.


Referring to figure 6.1, the solution process is started at time T1. Once the

grid assembly function is completed at time T2, the interpolation stencils, iblank

arrays, etc. are communicated from the FE to the BE so that the flow solution

function can proceed. Once an iteration of the flow solver is completed, the forces

and moments are integrated and are passed from the BE to the FE. The 6DOF

function is then executed to reposition the grids and to calculate motion rates. Since

the 6DOF function executes quickly, it is duplicated on the FE and the BE rather

than communicating the resulting information.

Ignoring the cost of the force and moment calculation and the 6DOF integration,

the flow solver represents the parallel work and the grid assembly represents the serial

work. Based on the fractions of the total execution time represented by the flow solver

and the grid assembly, equation 5.7 can be used to estimate the performance that can

be achieved with this implementation. However, instead of using the notation f,, f/

and f,, we will use the uppercase letters F and G to represent the flow solver and

grid assembly functions and the subscripts p, s and i to represent the fractions of the

work executed in parallel or serial and the load imbalance factors, respectively. Thus,

for the phase I implementation, the speedup can be approximated as

s ( r1 (6.1)
( ) )* Fi + G,

where nbes is the number of BE processes. Since the work of the flow solver is closely

Front End Back End
(serial) (parallel)
Grid Assembly wait time
Frow Spluti n
wait ties
--- -- -i--i-- -
T3 Forces f Momeats

Figure 6.1: Phase I implementation

18 ideal-
Amdahl's Law
& 12
D 10 os
8 Fp=.95, Fi=1.05
6 s=.05
0 5 10 15 20 25 30 35 40
processor count

Figure 6.2: Comparison of estimated speedup of phase I to Amdahl's law

related to the number of grid points, the equation

S= max(no. points on each processor) (6.2)
Fi =- (6.2)
avg no. points per processor

can be used to obtain an approximation for the load imbalance factor for the flow

solver. There are other factors, such as boundary conditions and the distribution of

communication between processors, that affect the load balance. They will be ignored

at this point.

Figure 6.2 shows a comparison of the estimated speedup from the phase I im-

plementation to Amdahl's law. The estimated speedup of phase I is plotted using

equation 6.1 with Fp = 0.95, G, = 0.05, and F, = 1.05 representing a nominal load

imbalance of 5% in the distribution of the work of the flow solver. This plot shows the

significant drop off in speedup with increased processor counts due to the serial frac-

tion of the work. A small decrease in the performance of the phase I implementation

(as compared to Amdahl's Law), due to the load imbalance, can also be seen.

Phase II: Function Overlapping

Some parallel efficiency can be gained by overlapping the grid assembly function

with the flow solution function. This overlapping could be achieved by updating the


6DOF and the interpolation stencils at the same time that the flow solver is updated,
by using the forces and moments calculated from a previous iteration of the flow
solution as an approximation to the current forces and moments. Thus, the updating

of the grid assembly would be based on a flow solution that is lagged behind the

current flow solution. This is similar to the technique used for sequential processing

in Lijewski and Suhs [28, 29], where store separation events were simulated with the

grid assembly recomputed once after every 20 iterations of the flow solver and 6DOF

integration. The grids were moved and the grid motion time metrics were fed into
the flow solver every iteration, although the interpolation stencils were not. Time

accurate forces and moments were used, although the flow solution could be affected

since the interpolation stencil locations were not time accurate. The variation in
stencil locations due to this time lag (.004 seconds in their case) is justified by the
assumption that the grids will not move by an appreciable amount during the delay.

Good results were achieved for the problems addressed.

Some parallel efficiency may be gained without lagging the grid assembly behind

the flow solution. This is possible due to the Newton-Relaxation scheme used in the

flow solution function. The discretized, linearized, governing equations are written

in the form of Newton's method. Each step of the Newton's method is solved using

symmetric Gauss-Seidel (SGS) iteration. The SGS iterations, or inner iterations,
are performed on a grid by grid basis, while the Newton iterations, or dt iterations,

are used to achieve time accuracy and are performed on all grids in sequence. This

procedure eliminates synchronization errors at blocked and overset boundaries by
iteratively bringing all dependent variables up to the next time level.

Figure 6.3 is a diagram of the flow solution process. The inner loop represents
the inner iterations or iterations of the SGS solution of the linear equations from one
step of the Newton's method. The outer loop represents the dt iterations or steps of
the Newton's method.

For time accurate flow calculations with Beggar, it is normal to run more than

Figure 6.3: Basic flow solution algorithm

one dt iteration to eliminate synchronization errors between grids and to achieve time

accuracy. Each dt iteration produces an updated approximation to the flow solution

at the next time step. Forces and moments can be calculated after each dt iteration

using this approximate solution. If the forces and moments calculated after the first

dt iteration are a good approximation to the final forces and moments, these values

can be forwarded to the grid assembly process. This allows the computation of the

grid assembly to proceed during the computation of additional dt iterations. If the
computation time for the flow solver is sufficiently large, the computational cost of

the grid assembly process can be hidden.

This implementation is referred to as phase II and is depicted in figure 6.4.

Rather than calculating forces and moments only after a complete time step of the
flow solver, they are calculated after each dt iteration. The node labeled 1 in figure

6.3 represents the point where the forces and moments are calculated.
Referring to figure 6.4, the solution process is started at time T1. After the first
dt iteration, initial approximations to the forces and moments are calculated and are

passed from the BE to the FE at time T2. The 6DOF and grid assembly functions
proceed while the remaining dt iterations of the flow solver are completed. Once the

Full Text
xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID E6DBJQ7PU_QVQ5IO INGEST_TIME 2013-11-16T00:19:44Z PACKAGE AA00014221_00001