UFDC Home  Search all Groups  UF Institutional Repository  UF Institutional Repository  UF Theses & Dissertations  Vendor Digitized Files   Help 
Material Information
Subjects
Notes
Record Information

Full Text 
PARALLEL COMPUTING OF OVERSET GRIDS FOR AERODYNAMIC PROBLEMS WITH MOVING OBJECTS By NATHAN C. PREWITT A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1999 Happy is the man that findeth wisdom, and the man that getteth under standing. Proverbs 3:13 ACKNOWLEDGEMENTS Where no counsel is, the people fall: but in the multitude of counselors there is safety. Proverbs 11:14 That said, I would like to thank my counselors. Dr. Wei Shyy has been an ex cellent advisor, always enthusiastic and willing to help in any way possible. Likewise, Dr. Davy Belk has always been encouraging and supportive of my work. I am glad to be able to count them both as friends, as well as mentors. I would also like to thank the rest of my committee, Dr. Bruce Carroll, Dr. Chen Chi Hsu, and Dr. B. C. Vemuri. They have all been my allies. I also want to thank Dr. Milton, who was the head of the GERC when I started this endeavor, and Dr. Sforza, who is now head of the GERC, and whom I have gotten to know better through AIAA. And since I am mentioning the GERC, thanks go to Cathy and Judi. I must also thank Mr. Whitehead, who has been my manager and an excellent supporter (even if he is an Alabama fan), since I came to work for QuesTech and now CACI. And, let me not forget Dr. Darryl Thornton (it gets worse, he's an Ole Miss grad), who has been my department director for many years now. I have been very fortunate to receive funding from AFOSR. Without this funding and the time it allowed me away from my other task duties, I am sure that I would not have been able to complete my degree. I would like to thank everyone that helped to obtain this funding including Mr. Jim Brock and Maj. Mark Lutton of the Air Force Seek Eagle Office, and Mr. Dave Uhrig. To be nice and legal, support was provided in part by a grant of HPC time from the DoD HPC Distributed Center, U.S. Army Space and Strategic Defense Command, SGI Origin 2000. Additional support was provided by the Air Force Office of Scientific Research, Air Force Materiel Command, USAF, under grant number F499620981 0092. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the author and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Office of Scientific Research or the U.S. Government. I hope that my presentation clearly shows which part of this work is my own. The original development of the Beggar code, including the unique algorithms for grid assembly, was done by Dr. Davy Belk and Maj. Ray Maple. The original work done to parallelize the Beggar flow solver was done by Dr. Davy Belk and Mr. Doug Strasburg. I owe a large debt to these gentlemen. There are now two others that continue the development of Beggar. Dr. Ralph Noack has been a tremendous asset to our group and has been most helpful in my work. It is from Ralph that I learned of the latency issues associated with the Origin 2000 and Ralph is the author of the load balancing algorithm used to distribute the grids in order to load balance the flow solver. Dr. Magdi Rizk came to Eglin at the same time that I did and we have worked together since. Magdi has been the sole developer of the Beggar flow solver for several years and has made significant contributions in that area. Magdi has also been my teacher at the GERC and a great friend. I would like to thank my other friends and colleagues and hope they will accept this thank you en masse. Most importantly, I would like to thank my family. My inlaws have been very helpful (during births and hurricanes, pouring the patio and eating at Hideaway's) and have always been willing to take the wife and kids when I really needed to study (sometimes for weeks). My parents have always been supportive in innumerable ways, from driving for hours and then sitting in the car for hours at honor band tryouts, to attending MSU football games just to see the halftime show, to buying a house for us when Tracye and I got married in the middle of college, to doing my laundry and fixing the brakes on the car over the weekend so I could get back to school. My brother Stephen has been a lot of help to my parents the last few years. Taking care of them is something that I don't think I could do. Jay and Josh are too much like me at times; but they are welcomed diversions and are stress relievers (our dog, Dudy Noble Polk, also fits into this category). And Tracye is a wonderful person. My roommate in college once said, "You are very lucky. You've found someone that you love, who is also a friend." She puts up with a lot. I am no picnic. We spent a lot of time apart during this, and I always get depressed whenever they are gone for a long time. Some of her bulldog friends, that she talks to over the internet, asked "How do you live with someone that is so positive all of the time?" "Who Tracye?", I replied. "No..., it's great!" TABLE OF CONTENTS page ACKNOWLEDGEMENTS ........................... iii LIST OF TABLES ................................. viii LIST OF FIGURES ................................ ix ABSTRACT ............ ....................... xiii CHAPTERS 1 INTRODUCTION .................... ........... 1 Overview .................................... 1 Related Work .................. .............. 7 Grid Assembly .... ................ .. ... ..... 7 Store Separation ... .................. .......... 9 Parallel Computing ................. ..... ..... 14 Dissertation Outline ...... .......... ... ........ 16 2 GRID ASSEMBLY .............................. 18 Polygonal Mapping Tree ........................... 19 Interpolation Stencils ........ ...... ..... ........ 22 Hole Cutting .................... .............. 25 Donors and Receptors ............ ......... .. .. 26 Boundary Condition Identification ................. ..... 27 3 FLOW SOLUTION .................. ........ ... 28 Governing Equations ................... .......... 28 Vector Form ..................... .. .......... 33 NonDimensionalization ................... ..... .. 35 Coordinate Transformation ................... ....... 37 Flux Vector Splitting .. ................. ........ 40 Flux Difference Splitting ........................... .. 45 Newton Relaxation ............................. .. 47 FixedPoint Iteration ............................. 48 . . . 50 4 6DOF INTEGRATION ......................... .... .. 52 Equations of Motion .............................. 52 Coordinate Transformations . . ... 54 Quaternions ........ ................ .. ........ 57 Numerical Integration ......... ............. ..... 60 5 PARALLEL PROGRAMMING ....... .................. 62 Hardware Overview .............................. 62 Software Overview ............................... 64 Perform ance .. .................... .. ...... 68 Load Balancing ................................ 73 Proposed Approach ....................... . 79 6 PARALLEL IMPLEMENTATIONS .... . 81 Phase I: Hybrid ParallelSequential. .. ... .. 81 Phase II: Function Overlapping ......... ..... ......... 83 Phase III: Coarse Grain Decomposition ....... . ... 88 Phase IV: Fine Grain Decomposition ..... .. .. 92 Summary .. .. .... ... . ... .. .. 97 7 TEST PROBLEM ........................... .... .. .. 98 8 RESULTS ................................... 106 Phase I: Hybrid ParallelSequential. .... .. . .. 106 Phase II: Function Overlapping ... . ... 108 Phase III: Coarse Grain Decomposition ... . 113 Phase IV: Fine Grain Decomposition . ..... 121 Sum m ary .................................... 132 9 CONCLUSIONS AND FUTURE WORK . ... 135 BIBLIOGRAPHY ..... ................... ..... 139 BIOGRAPHICAL SKETCH ................. ......... 145 Parallel Considerations ... LIST OF TABLES Table page 1.1 Grid assembly codes ........................... 8 1.2 Store separation modeling methods . . 10 1.3 Grid generation methods ......................... 12 1.4 Significant accomplishments in parallel computing in relation to overset grid m ethods .. ........... .. .. .. .. .. ... .. 17 6.1 Summary of the implementations of parallel grid assembly .. 97 7.1 Store physical properties .... . . 98 7.2 Ejector properties ...................... ... ..... .... 99 7.3 Original grid dimensions . ... ... 100 7.4 Dimensions of split grids ........................... 101 7.5 Load Imbalance Factors ................... .. 102 7.6 Summary of the final position of the stores calculated from the two different grid sets ..... .............. .... .... 103 8.1 Summary of results from the phase I runs including the final position of the bottom store ............................ 107 8.2 Summary of results from the phase II runs including the final position of the outboard store ........................... 113 8.3 Summary of results from the phase III runs including the final position of the inboard store .................... ....... 121 8.4 Summary of results from the runs that used fine grain hole cutting including the final position of the bottom store . ... 132 8.5 Summary of best execution times (in minutes) from runs of the different implementations (number of FE processes shown in parentheses) 133 LIST OF FIGURES Figure 1.1 History of three store ripple release . . 1.2 Solution process ........................... 1.3 Example of overlapping grids with holes cut . . 1.4 Grids for single generic store trajectory calculation . 1.5 Mach 0.95 single store trajectory calculated (left) CG position (right) angular position versus wind tunnel CTS data . 1.6 Mach 1.20 single store trajectory calculated (left) CG position (right) angular position versus wind tunnel CTS data . 2.1 Example quad tree mesh ..................... 2.2 Example PM tree structure .................... 4.1 Transformation from global to local coordinates . . and and . . 5.1 Unbalanced work load Limitations in load balance caused by a poor decomposition 72 Imbalance caused by synchronization ........ . 73 Phase I implementation ......................... 82 Comparison of estimated speedup of phase I to Amdahl's law .... 83 Basic flow solution algorithm ..... . ...... .. 85 Phase II implementation .................. ... 86 Insufficient time to hide grid assembly . ..... 87 Comparison of estimated speedup of phases I and II ... 88 page 1 4 5 13 13 14 21 22 55 . 71 6.7 Duplication of PM tree on each FE process . ..... 90 6.8 Distribution of PM tree across the FE processes . ... 91 6.9 Phase III implementation . ..... ...... 92 6.10 Comparison of the estimated speedup of phases I, II, and III 93 6.11 Phase IV implementation. . .... ...... 95 6.12 Comparison of estimated speedup of phases I, II, III and IV 96 7.1 Bottom store (left) CG and (right) angular positions ... 104 7.2 Outboard store (left) CG and (right) angular positions ... 105 7.3 Inboard store (left) CG and (right) angular positions ... 105 8.1 Actual speedup of phase I ...... .................. 107 8.2 Bottom store (left) force coefficient and (right) moment coefficient vari ation between dt iterations history . .... 108 8.3 Outboard store (left) force coefficient and (right) moment coefficient variation between dt iterations history . ..... 109 8.4 Inboard store (left) force coefficient and (right) moment coefficient vari ation between dt iterations history ..... .. ........ .. 109 8.5 Actual speedup of phase II . . 111 8.6 Effect of using average execution time . ... .. .. 112 8.7 Actual speedup of phase III . ... 114 8.8 History of grid assembly load imbalance based on execution times of hole cutting, stencil search, and health check . ... 115 8.9 Grid assembly process .......................... .. 117 8.10 History of grid assembly load imbalance based on execution time of the stencil search ..... ...................... .. 117 8.11 Grid assembly execution timings for four FE processes ... 118 8.12 Grid assembly execution timings of (left) hole cutting and (right) sten cil searching with load balance based on measured execution time of stencil searching. Each curve represents a separate process. 119 8.13 History of grid assembly load imbalance based on number of IGBPs 120 8.14 Grid assembly execution timings of (left) hole cutting and (right) sten cil searching with load balance based on number of IGBPs. Each curve represents a separate process. . . .. 121 8.15 Speedup due to fine grain hole cutting and load balancing of hole cut ting separate from the stencil search . ..... 123 8.16 Grid assembly execution timings of (left) hole cutting and (right) sten cil searching with fine grain hole cutting and the stencil search load balanced based on execution time. Each curve represents a separate process . . . .. 124 8.17 Grid assembly execution timings of (left) hole cutting and (right) sten cil searching with fine grain hole cutting and the stencil search dis tributed across 5 FE processes. Each curve represents a separate process. 125 8.18 Grid assembly execution timings of (left) hole cutting and (right) sten cil searching with fine grain hole cutting and the stencil search dis tributed across 6 FE processes. Each curve represents a separate process. 126 8.19 Grid assembly execution timings of (left) hole cutting and (right) sten cil searching with fine grain hole cutting and the stencil search dis tributed across 7 FE processes. Each curve represents a separate process. 126 8.20 Grid assembly execution timings of (left) hole cutting and (right) sten cil searching with fine grain hole cutting and the stencil search dis tributed across 8 FE processes. Each curve represents a separate process. 127 8.21 Use of additional processors continues to reduce time for hole cutting 128 8.22 Execution times for load balanced fine grain hole cutting distributed across 4 FE processes ............ .............. .129 8.23 Execution times for load balanced fine grain hole cutting distributed across 5 FE processes ................... ....... 130 8.24 Execution times for load balanced fine grain hole cutting distributed across 6 FE processes ........................... 130 8.25 Execution times for load balanced fine grain hole cutting distributed across 7 FE processes ........................... 131 8.26 Execution times for load balanced fine grain hole cutting distributed across 8 FE processes ................... ....... 131 8.27 Summary of the increasing speedup achieved through the different im plementations ................. .............. 134 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy PARALLEL COMPUTING OF OVERSET GRIDS FOR AERODYNAMIC PROBLEMS WITH MOVING OBJECTS By Nathan C. Prewitt December 1999 Chairman: Dr. Wei Shyy Major Department: Aerospace Engineering, Mechanics and Engineering Science When a store is dropped from a military aircraft at high subsonic, transonic, or supersonic speeds, the aerodynamic forces and moments acting on the store can be sufficient to send the store back into contact with the aircraft. This can cause damage to the aircraft and endanger the life of the crew. Therefore, store separation analysis is used to certify the safety of any proposed drop. This analysis is often based on wind tunnel aerodynamic data or analogy with flight test data from similar configurations. Time accurate computational fluid dynamics (CFD) offers the option of calculating store separation trajectories from first principles. In the Chimera grid scheme, a set of independent, overlapping, structured grids are used to decompose the domain of interest. This allows the use of efficient struc tured grid flow solvers and associated boundary conditions, and allows for grid motion without stretching or regridding. However, these advantages are gained in exchange for the requirement to establish communication links between the overlapping grids via a process referred to as "grid assembly." The calculation of a moving body problem, such as a store separation trajec tory calculation, using the Chimera grid scheme, requires that the grid assembly be performed each time that a grid is moved. Considering the facts that time accurate CFD calculations are computationally expensive and that the grids may be moved hundreds of times throughout a complete trajectory calculation, a single store trajec tory calculation requires significant computational resources. Parallel computing is used regularly to reduce the time required to get a CFD solution to steady state problems. However, relatively little work has been done to use parallel computing for time accurate, moving body problems. Thus, new techniques are presented for the parallel implementation of the assembly of overset, Chimera grids. This work is based on the grid assembly function defined in the Beggar code, currently under development at Eglin Air Force Base, FL. This code is targeted at the store separation problem and automates the grid assembly problem to a large extent, using a polygonal mapping (PM) tree data structure to identify point/volume relationships. A logical succession of incremental steps are presented in the parallel implemen tation of the grid assembly function. The parallel performance of each implementation is analyzed and equations are presented for estimating the parallel speedup. Each successive implementation attacks the weaknesses of the previous implementation in an effort to improve the parallel performance. The first implementation achieves the solution of moving body problems on mul tiple processors with minimum code changes. The second implementation improves the parallel performance by hiding the execution time of the grid assembly function behind the execution time of the flow solver. The third implementation uses coarse grain data decomposition to reduce the execution time of the grid assembly func tion. The final implementation demonstrates the fine grain decomposition of the grid assembly through the fine grain decomposition of the hole cutting process. Shared memory techniques are used in the final implementation and appropriate dynamic load balancing algorithms are presented. CHAPTER 1 INTRODUCTION The knowledge of forces and moments induced by the addition of stores to an aircraft is vital for safe carriage. Once a store is released, knowledge of the interference aerodynamics and the effects on the trajectory of the store is vital for the safety of the pilot and aircraft. Such aerodynamic data have traditionally been provided by wind tunnel testing or flight testing; however, these techniques can be very expensive and have limitations when simulating time accurate, moving body problems such as the ripple release depicted in figure 1.1. Computational fluid dynamics (CFD) provides a way to supplement wind tunnel and flight test data. Figure 1.1: History of three store ripple release Overview The primary problem to be considered is store separation from fighter aircraft configurations. The goal is to compute store separation trajectories in a timely fashion 1~1_11~ using CFD and parallel computing. Due to the geometric complexity of aircraft/store configurations and the requirement to handle moving body problems, the Chimera grid scheme [1] is being used. This approach uses a set of overlapping structured grids to decompose the domain of interest. The Chimera grid scheme offers several advantages: 1) the use of structured grids allows the use of efficient block structured grid flow solvers and the associated boundary conditions; 2) the generation of over lapping grids which best fit a particular component geometry eases the burden of structured grid generation; and 3) the use of interpolation for communication be tween overlapping grids allows grids to be moved relative to each other. However, the communication between overlapping grids must be reestablished whenever a grid is moved. This process of establishing communication between overlapping grids will be referred to as grid assembly. Whenever the grid around a physical object overlaps another grid, there is the probability that some grid points will lie inside the physical object and thus will be outside of the flow field. Even if no actual grid points lie inside the physical object, if a grid line crosses the physical object, there will be neighboring grid points that lie on opposite sides of the physical object. Any numerical stencil that uses two such neighboring grid points will introduce errors into the solution. This situation is avoided by cutting holes into any grids overlapping the physical surfaces of the geometry. During hole cutting, regions of the overlapping grids are marked as invalid. This creates additional boundaries within the grid system. The flow solver requires that some boundary condition be supplied at these boundaries. Likewise, some boundary condition is also needed at the outer boundaries of embedded grids. Collectively, the grid points on the fringe of the holes and the grid points on the outer boundaries of the embedded grids are referred to as intergrid boundary points (IGBPs) [2]. The boundary conditions required at the IGBPs are supplied by interpolating the flow solution from any overlapping grids. The Beggar code [3], developed at Eglin Air Force Base, is capable of sov ing threedimensional inviscid and viscous flow problems involving multiple moving objects, and is suitable for simulating store separation. This code allows blocked, patched, and overlapping structured grids in a framework that includes grid assem bly, flow solution, force and moment calculation, and the integration of the rigid body, six degrees of freedom (6DOF) equations of motion. All blocktoblock connections, patched connections, freestream boundary conditions, singularity boundary condi tions, and overlapped boundaries are detected automatically. All holes are defined using the solid boundaries as cutting surfaces and all required interpolation stencils are calculated automatically. The integration of all necessary functions simplifies the simulation of moving body problems [4]; while the automation and efficient imple mentation of the grid assembly process [5] significantly reduces the amount of user input and is of great benefit in a production work environment. The basic solution process consists of an iterative loop through the four func tions shown in figure 1.2. The blocked and overset grid system is first assembled. Once this is done, the flow solution is calculated in a timeaccurate manner. Aerody namic forces and moments are then integrated over the grid surfaces representing the physical surfaces of the moving bodies. The rigid body equations of motion are then integrated with respect to time to determine the new position of the grids considering all aerodynamic forces and moments, forces due to gravity, and all externally applied forces and moments (such as ejectors). Multiple iterations of this loop are required to perform a complete store sep aration trajectory calculation. The accuracy of the trajectory predicted from the integration of the equations of motion is affected by the time step chosen; however, stability contraints on the flow solver are normally more restrictive. In typical store separation calculations, the time step has been limited to 0.11.0 millisecond; thus, hundreds or even thousands of iterations are often required. As the complexity of flow simulations continues to increase it becomes more crit Figure 1.2: Solution process ical to utilize parallel computing to reduce solution turnaround times. The parallel implementation of the Beggar flow solver was first presented by Belk and Strasburg [6]. This flow solver uses a finite volume discretization and flux difference splitting based on Roe's approximate Riemann solver [7]. The solution method is a Newton Relaxation scheme [8]; that is, the discretized, linearized, governing equations are written in the form of Newton's method and each step of the Newton's method is solved using symmetric GaussSeidel (SGS) iteration. The SGS iterations, or in ner iterations, are performed on a grid by grid basis; while the Newton iterations, or dt iterations, are used to achieve time accuracy and are performed on all grids in sequence. In this reference, the separate grids are used as the basis for data decompo sition. The grids, which represent separate flow solution tasks, are distributed across multiple processors and the flow solver is executed concurrently. The only communi cation between processes is the exchange of flow field information at blocktoblock, patched, and overlapped boundaries between dt iterations. The grid assembly is performed only once; thus, only static grid problems are addressed. It is also desirable to utilize parallel computing to reduce the turnaround time of moving body problems such as the ripple release configuration. In order to do so, the grid assembly function must be executed each time grids are moved. An efficient, scalable parallel implementation of any process requires that both the computation and the required data be evenly distributed across the available processors while minimizing the communication between processors. The movement of the grids and the time variation in the holes being cut, as illustrated in figure 1.3, indicate the dynamic and unstructured nature of the grid assembly work load and data structures. This makes an efficient implementation a challenging task. Thus, the primary focus of this work is the parallel implementation of the grid assembly function so that store separation trajectories can be calculated using time accurate CFD and parallel computers. A logical succession of incremental steps is used to facilitate the parallel implementation of the grid assembly function. The initial implementation (phase I) uses a single process to perform the entire grid assembly in a serial fashion with respect to the parallel execution of the flow solver. This requires that proper communication be established between the flow solution function and the grid assembly function; however, it does not require any consideration of load balancing or partitioning of the grid assembly function. The grid assembly function is not accelerated, but the flow solution is. In the second implementation (phase II), parallel efficiency is gained by over lapping the grid assembly function and the flow solution function. This overlapping of work is possible because of the use of the NewtonRelaxation method within the flow solver. Each step of the approximate Newton's method produces an approxi mation to the flow solution at the next time step. Approximate aerodynamic forces and moments are calculated from the flow solution after the first Newton step and are used to drive the grid assembly function, while additional Newton steps are being calculated to achieve time accuracy. As long as there is sufficient time to hide the work of the grid assembly function, the speedup is affected only by the performance of the flow solver. However, as the processor count increases, the time of the flow solution available to hide the grid assembly decreases and the rate of change of speedup with respect to processor count decreases. Therefore, it is important to distribute the work of the grid assembly function to make the entire process scalable to higher processor counts. The third implementation (phase III) uses data decomposition of the grid assem bly function to reduce its execution time and thus allows the grid assembly time to be more easily hidden by the flow solution time. The basis for the data decomposition is the superblock, which is a group of grids that contain blocktoblock connections and are overlapped with other superblocks. In this implementation, the work and data structures associated with a superblock are distributed over multiple processors. Dynamic load balancing is used to improve the performance by moving superblocks between processes. The relatively small number of superblocks used in most problems places a limit on the number of processors that can be effectively utilized. Thus, in order to improve scalability, the fourth implementation (phase IV) uses a fine grain decomposition of the work associated with grid assembly. The work of the grid assembly function can be associated with the facets that cut holes into overlapping grids and the cell centers that require interpolation. Therefore, the hole cutting facets and the IGBPs form the basis for the fine grain distribution of the work associated with grid assembly. This dissertation gives complete details of the implementation options for in cluding the grid assembly function into the parallel execution of moving body CFD computations. Each implementation builds upon the previous implementation, at tacking the limitations in order to improve performance. Details of the performance analysis are included. Equations for estimating the performance are also presented. With practical experience and some further development, these implementations and performance estimators could offer optimum execution guidelines for particular prob lems. Related Work Grid Assembly Table 1.1 lists some of the codes that are currently available for assembling over set grid systems. Some of the advantages and disadvantages of each code are listed. Since the author is not completely familiar with the operation of all of these codes, some of the disadvantages (or advantages) may only be perceived. In general, finding the root cause of a failure in the grid assembly process is a difficult task. Therefore, it is a disadvantage of overset grids in general and is not listed as a disadvantage for any of the codes although some of the codes provide better aids for debugging than do others. Likewise, the use of orphan points (points that fail to be properly interpolated and are given averaged values from neighbors) can help to ensure that grid assembly does not fail. However, orphan points are not listed as an advantage for any code since they can adversely affect the flow solution. PEGSUS [9] is the first and one of the more widely used codes for handling the grid assembly problem. It relies on a set of overlapping grids (blocktoblock connections are not allowed). PEGSUS is completely separate from any flow solver but will produce interpolation information for either finite difference or finite volume Table 1.1: Grid assembly codes Code Advantage Disadvantage PEGSUS First code; large user base Slow; requires alot of user input DCF3D Fast; large user base; well Requires significant user supported input CMPGRD Modern programming Not widely distributed techniques; well defined algorithms BEGGAR Automated grid assembly; Slower than DCF3D; mono allows blocktoblock con lithic code; limited user nections; small user input base; has difficulties with geared toward production overset viscous grids work environment; complete flow solution environment flow solvers. The amount of user input required is often rather large: each hole cutting surface has to be identified, all overlapping boundaries must be identified, and a set of links must be specified to tell the code which grids to cut holes into and where to check for interpolation coefficients. DCF3D (domain connectivity function) [2] is another code used to accomplish the grid assembly task. DCF3D is not coupled directly with any flow solver but it has been used extensively with the OVERFLOW flow solver [10]. DCF3D uses several alternative approaches in order to improve the efficiency of the grid assembly process. DCF3D uses analytic shapes for hole cutting which allows grid points to be compared directly to the hole cutting surface. It also uses Cartesian grids, called inverse maps, to improve the efficiency of selecting starting points for the search for interpolation stencils. These techniquesimprove the efficiency of the grid assembly process; however, an additional burden is placed on the user to define the analytical shapes and the extent and density of the inverse maps. More recently, improvements to DCF3D have been proposed in order to reduce the burden placed on the user. These improvements include the use of holemap tech nology and the iterative adjustment of the connectivity information [11]. Holemap technology uses Cartesian grids to map the hole cutting surfaces in an approximate, stair stepped fashion. This would allow the automatic creation of the hole cutting surfaces and an efficient means of identifying hole points. The iterative process of ad justing the connectivity information by expanding and contracting the holes in order to minimize the overlap between grids also offers benefits. Yet another code that addresses the grid assembly problem is CMPGRD [12]. This code is an early version of the grid assembly process that has been included in OVERTURE [13]. This tool does not appear to be highly optimized; moreover, its strengths seem to be in its well defined algorithms for the grid assembly process. The algorithms can produce minimum overlap between grids and other quality measures are considered in the donor selection process. In comparison to the above mentioned codes, Beggar is unique in that its devel opment has been geared towards the store separation problem and a production work environment. As such, Beggar attempts to automate the entire solution process while reducing the burden of input that is placed on the user. Beggar also uses unique data structures and algorithms in order to maintain the efficiency of the grid assembly process. Store Separation Table 1.2 lists some of the techniques that have been used to calculate store sep aration trajectories. Some of the advantages and disadvantages from each technique are listed. The techniques range from simple curve fits of data from similar configu rations, to wind tunnel experimental methods, to the calculation of the complete flow field from first principles. Engineering level methods (see Dillenius et al. [14] for example) derive aerody namic data from data bases of experimental data, simple aerodynamic correlations, and panel methods with corrections for nonlinear effects such as vortical flow. Such methods are computationally inexpensive but have very limited applicability. These Table 1.2: Store separation modeling methods Method Advantage Disadvantage Engineering Computationally inexpen Limited applicability Level Methods sive; provide quick data for preliminary design Captive Trajec Wind tunnel accuracy of Limited range of motion; tory Support flow phenomenon quasisteady; high cost; tun nel interference Influence Func Fast solution allows sta Mutual interference effects tion Method tistical investigation of can be lost trajectories Computational Completely time accurate; Grid generation can be labor Fluid Dynamics flexible; unlimited in config intensive; requires signifi uration; provides data for vi cant computing resources; sualization of the complete weaknesses in modeling flow field some flow phenomena such as turbulence methods are most useful in preliminary design, but have been applied to the calcula tion of store separation trajectories. Store separation events have been simulated in wind tunnels using the captive trajectory support (CTS) system [15]. This technique places a sting mounted store in the flow field of an aircraft wing and pylon. The store is repositioned according to the integration of measured aerodynamic loads and modeled ejector loads. Since the store can not be moved in realtime, an angleofattack correction is made based on the velocity of the moving store. This technique is quasisteady and often limited in the range of motion due to the sting mechanism. Store separation trajectories have also been calculated using wind tunnel data and an influence function method (IFM) [16]. This method uses wind tunnel data to define flow angularity near an aircraft wing and pylon. These data are used to apply a delta to the freestream forces and moments of the store assuming that the store does not affect the aircraft flow field. Another correction is made for mutual interference using wind tunnel data of the store in carriage position. Jordan [17] gave a detailed comparison of loads calculated from IFM and CFD versus loads measured in the wind tunnel. IFM was shown to be inaccurate due to mutual interference that is not directly related to flow angle. The distance at which mutual interference becomes insignificant must also be well known. Such semiemperical techniques can also be used with CFD data replacing part or all of the wind tunnel data. In a recent special session at the AIAA Aerospace Sciences Meeting, most of the papers [18, 19, 20] presented used this technique. One paper [21] used quasisteady CFD. Of the two timeaccurate CFD simulations slated to be presented, one was withdrawn and the other was prevented from being presented due to the failure to get clearance for public release. When considering timeaccurate CFD calculations for moving body problems, the decomposition of the domain (grid type) has a significant impact on the solution process. Table 1.3 lists several of the different grid methods in use. Some of the advantages and disadvantages of each grid method are listed. Cartesian grids (see [22] for example) have been used for moving body problems, but the treatment of boundary conditions can be complicated. Boundary conforming block structured grids have been used to calculate store separation trajectories [23]; however, the motion of a store within a block structured grid requires grid stretching and deformation. This places a limit on the motion before regridding is required due to errors introduced by grid skewness. Unstructured grids have also been applied to the store separation problem (see [24] for example). The flexibility of unstructured grid generation eases the grid generation burden but complicates the flow solver. Octree grids have also been used to ease the grid generation burden and allow adaptive grid refinement. SPLITFLOW [25] represents a compromise between these unstructured grid techniques. A prismatic grid is used near solid surfaces to simplify boundary conditions and an octree grid is used away from the solid surfaces and offers adaption and some organizational structure. Chimera grid methods are also a compromise and have been applied extensively to the store separation problem [4, 26, 27, 28, 29, 30]. They can be viewed as locally structured, but globally unstructured. Table 1.3: Grid generation methods Grid Type Advantage Disadvantage Cartesian Small memory require Difficult treatment of ments; fast flow solver boundary conditions; poor viscous solution capabilities Structured General treatment of Restricted to simple flow solver and boundary geometries conditions Block Structured Extension to complex Grid generation is time con geometries suming; grid motion or adaption is difficult Quad Tree Easily adapted Difficult treatment of boundary conditions; connectivity information required Unstructured Automatic grid generation; Larger memory require easily adapted ments; slower flow solvers; connectivity information required; weak viscous solution capabilities Chimera Structured grid flow solvers connectivity (only at and boundary conditions; IGBPs) must be con eases grid generation bur structed separate from the den; allows grid movement grid generation process Time accurate CFD has been validated for use in calculating store separation trajectories. Lijewski [28] presented the first complete system for calculating store separation trajectories. Lijewski [28] also presented the first use of a particular set of wind tunnel CTS data for store separation code validation. The configuration is a single, sting mounted, ogivecylinderogive store under a generic pylon and wing. Grids for the generic store are shown in figure 1.4. Data, first presented by Prewitt et al. [4], for the subsonic and supersonic tra jectories of the single generic store are shown in figures 1.5 and 1.6. The CTS data are shown by the symbols and the time accurate CFD calculations are shown by the curves. These comparisons show excellent agreement between the wind tunnel data and time accurate CFD calculations for this test problem. Figure 1.4: Grids for single generic store trajectory calculation 4.0 14.0 12.0 3.0 o .0 10.0 yaw 0o0 o 'I O 12.0 72 .0 00 0 4.0 Uoo o 0.0 2.0 roll 1.0 0.0 . 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30 time (sec) time (sec) Figure 1.5: Mach 0.95 single store trajectory calculated (left) CG position and (right) angular position versus wind tunnel CTS data More complex configurations have also been used for validation cases. Cline et al. [31] presented store separation trajectories from an F16 aircraft configuration including a fuel tank, pylon, and an aerodynamic fairing at the junction of the pylon and the wing. Coleman et al. [32] presented separation trajectories for the MK 84 from the F15E aircraft. This configuration included a centerline fuel tank, a LANTIRN targeting pod, an inboard conformal fuel tank (CFT) weapons pylon with attached MK84, forward and middle stub pylons on the outside of the CFT, LAU 128 rail launchers with AIM9 missiles on both sides of the wing weapons pylon, and the MK84 to be released from the wing weapons pylon. Both references compared E 2.0 z 6.0 pitch o0 1 0 .0 7i r 2 .0 o  roll 1.0 0.0 0 < 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.00 0.05 0.10 0.15 0.20 0.25 0.30 time (sec) time (sec) Figure 1.6: Mach 1.20 single store trajectory calculated (left) CG position and (right) angular position versus wind tunnel CTS data trajectory calculations to wind tunnel data and were instrumental in the approval of the use of CFD by engineers in evaluating the store separation characteristics of weapons. Parallel Computing Although much work has been done on the parallel execution of CFD flow solvers, including Chimera method flow solvers, little work has been done on the efficient par allel implementation of Chimera methods for moving body problems. In particular, there are very few references on the parallel treatment of the grid assembly problem. Table 1.4 gives a list of references of some of the more important developments in parallel computing as related to Chimera grid methods and grid assembly. Smith [33] presents the parallel implementation of an overset grid flow solver for a network based heterogeneous computing environment. This flow solver was derived from OVERFLOW and uses coarse grain parallelism with the component grids being distributed among the available processors. A master/slave model is used. The master process performs all i/o functions, maintains all of the interpolated flow solution data, and communicates with each of the slave processes. The slave processes calculate the flow solution and perform the interpolation of flow solution data. A load balancing 15 technique is used and the interpolation of flow solution data is overlapped with the calculation of the flow solution to reduce load imbalances. The grid communication information was supplied as an input and only static problems were addressed. Wissink and Meakin [34] presented the application of a Chimera grid flow solver based on OVERFLOW and DCF3D. This code uses overlapping structured grids near the solid boundaries in order to resolve viscous effects and uses spatially refined Cartesian blocks in the rest of the domain. Parallel performance was presented but only static problems were addressed. The same code was again presented by Meakin and Wissink [35]. Two dynamic problems were presented in this reference; however, the focus was on the ability to adapt the Cartesian blocks due to flow solution and body motion. Some parallel performance data are presented based on an iteration of the flow solver. No performance data were presented for an entire simulation which would include the performance of the grid assembly. The first presentation of the parallel implementation of grid assembly for dy namic, overset grids was by Barszcz et al. [36]. DCF3D was parallelized and used in connection with a parallel version of OVERFLOW on a distributed memory parallel machine. A coarse grain parallelism was implemented with the data decomposition based on component grids. A static load balance was used based on balancing the load of the flow solver. Since the flow solver represented a large portion of the total work, load balancing the flow solver is important to achieving a good overall load bal ance; however, significant imbalances were seen in the grid assembly processes. Donor cell identification was found to be the most time consuming part of grid assembly and algorithm changes were implemented to reduce this part of the work load. Weeratunga and Chawla [37] again used DCF3D and OVERFLOW on a dis tributed memory parallel machine. Again, the component grids are used for data decomposition and load balancing is based on the work load of the flow solver. No consideration is given to the distribution of donor elements or IGBPs. The primary focus in this reference is on demonstrating the scalability of the processes used. In 16 this study, the donor search method scaled well; however, the hole cutting and the identification of points requiring interpolation did not scale well. Wissink and Meakin [38] present the first attempt to load balance the grid assembly process. However, the data decomposition is still based on the component grids and affects the load balance of both the flow solver and the grid assembly function. A static load balance is initially performed to equally distribute the numbers of grid points which helps to load balance the flow solver. A dynamic load balancing routine is then used during a calculation to redistribute the grids to improve the load balance of grid assembly. This, in turn, creates an imbalance in the flow solver. This algorithm offers a method of improving performance if an imbalance in the grid assembly work load is a major deterrent. However, in the problems presented, the flow solver represented the major part of the work load and any redistribution of grids in order to improve the grid assembly load balance actually decreased the overall code performance. Dissertation Outline Chapter 2 presents details of the algorithms and data structures of the grid assembly process. For completeness, chapter 3 presents the flow solution algorithm and chapter 4 presents the integration of the 6DOF rigid body equations of motion. Chapter 5 presents an overview of programming parallel computers and outlines the approaches used in the current work. Chapter 6 gives some details of the proposed implementations including equations for estimating speedup. Chapter 7 presents the ripple release test problem used for all timings of the implementations. The results of the timings are presented in chapter 8. The final conclusions and some possibilities for future work are presented in chapter 9. Table 1.4: Significant accomplishments in parallel computing in relation to overset grid methods Reference Accomplishment Limitation Smith and Pallis, Parallel implementation Restricted to static 1993 [33] of OVERFLOW for het problems erogeneous computing environments Barszcz, Weer First parallel implementa Data decomposition and atunga, and tion of grid assembly static load balance tied to Meakin, 1993 [36] flow solver Weeratunga and Detailed study of scalabil Data decomposition and Chawla, 1995 [37] ity of parallel implementa static load balance tied to tion of DCF3D flow solver Belk and Stras First parallel implementa Restricted to static burg, 1996 [6] tion of Beggar problems Wissink and First attempt to load bal Decomposition of grid as Meakin, 1997 [38] ance grid assembly sembly tied to flow solver means any improvement in the load balance of grid as sembly adversely affects the load balance of the flow solver Wissink and Small, near body, curvilin Only static problems were Meakin, 1998 [34] ear grids used in combina presented tion with adaptive Cartesian grids Prewitt, Belk, First parallel implementa Limited scalability and Shyy, 1998 tion of Beggar for dynamic [39] problems; overlapping of grid assembly and flow solu tion time Meakin and Included dynamic problems No performance of dynamic Wissink, 1999 with combined overset grids grid assembly was presented [35] and adaptive Cartesian grids Prewitt, Belk, Coarse grain decomposition Major functions within the and Shyy, 1999 and dynamic load balancing grid assembly are not indi [40] of grid assembly based on vidually well balanced superblocks independent of flow solver CHAPTER 2 GRID ASSEMBLY Although Beggar is useful for general external compressible fluid flow problems, its primary focus during development has been on the simulation of store carriage and separation events. A typical grid system includes grids for an aircraft, pylons, launchers, and stores. The grids are often placed inside a large rectangular grid which serves as a background grid that reaches to freestream. Due to disparity in grid spacing between overlapping grids, it is often necessary to introduce other grids that serve as an interface to aid communication. The stores are often bodies of revolution with attached wings, canards, and/or fins. Blocked grid systems are used for these relatively simple geometries; however, in order to allow grid movement, such blocked grids are treated as overlapping grids with respect to other grids. The superblock construct is introduced to aid in grid assembly. The superblock is a collection of nonoverlapping grids which are treated as a single entity. Blockto block connections are allowed only within a superblock; thus, a superblock is often used to implement a blocked system of grids for part of the solution domain. Over lapping connections are allowed only between different superblocks. A dynamic group is a collection of one or more superblocks that is treated as a single entity by the 6DOF. The dynamic group is used primarily to group grids which are part of the same moving body. There is always at least one dynamic group: the static dynamic group. This holds the static grids like the background grid, the aircraft grid, pylon grids, or store grids that do not move relative to the aircraft. Other dynamic groups are created for each store that will move relative to the static dynamic group. Since a dynamic group may contain one or more superblocks, each 19 moving body can be constructed from a system of blocked grids in a single superblock, a system of overlapping grids in multiple superblocks, or a combination of the two. Polygonal Mapping Tree In order to establish overlapped grid communications the following questions must be answered: does this point lie inside a grid, and if so, what is an appropriate interpolation stencil? These questions represent pointvolume geometric relation ships. In order to determine such relationships, Beggar uses a polygonal mapping (PM) tree, which is a combination of the octree and binary space partioning (BSP) tree data structures [41, 42]. An octree is a data structure in which a region of space is recursively subdivided into octants. Each parent octant is divided into eight children which can be further subdivided. This forms a hierarchy of ancestor and descendant octants. Each octant in the tree is termed a node with the beginning node (lowest level) being the root node and the most descendent nodes (highest levels) being the leaf nodes. Such a data structure allows a domain to be divided into 8" subdomains using just n levels. Associated with each node are the Cartesian coordinates of the center of the octant. Which child octant a point lies in can be identified by comparing the coordinates of the point against the coordinates of the center of the parent octant. With such a data structure, a point can be identified as lying within a particular octant out of 8" octants by using at most n comparisons (if the tree is perfectly balanced). The BSP tree is a binary tree data structure in which each node of the tree is represented by a plane definition. Each node has two children representing the in and out sides of the plane. For a faceted representation of a surface, each facet defines a plane that is inserted into the BSP tree. While being inserted the facet may be clipped against existing planes in the BSP tree placing pieces of the same plane definition into different branches of the tree. Using a given point, the BSP tree is traversed by comparing the point against a plane definition at each level to determine which branch to descend into. Once a leaf node is reached, the point is identified as being inside or outside of the faceted surface. In theory, a BSP tree of the cell faces on the boundaries of a grid block could be used to determine whether a point is IN or OUT of that particular grid. However, due to the clipping process, the BSP tree can be prone to roundoff error. Likewise, the structure of the tree is dependent on the order in which facets are inserted and it is not guaranteed to be well balanced. If the tree were to become onesided, a point may have to be compared against all or most of the facets on a surface to determine its relationship to that surface. Therefore, Beggar uses a combination of the octree and BSP tree data structures. The octree, which stays well balanced, is used to quickly narrow down the region of space in which a point lies. If a point lies in a leaf node that contains an overlapping boundary grid surface, it must be compared to a BSP tree that is stored in that leaf node to determine its relationship to that boundary surface and therefore its relationship to the grid itself. The PM tree data structure is built by refining the octree in a local manner until no octant contains more than one grid boundary point from the same superblock. This produces a regular division of space that adapts to grid boundaries and grid point density. The boundary cell faces of the grids are then used to define facets which are inserted into BSP trees stored at the leaf nodes of the octree. Since each grid boundary point is normally shared by four cell faces and each octant contains only one grid boundary point, the BSP trees stored at the octree leaf nodes should be very shallow. Once the basic data structure is complete, all of the octants of the leaf nodes are classified relative to the grid boundaries. Each octant is classified as inside or outside of each superblock or as a boundary octant. Then points can be classified efficiently relative to the superblocks. To do so, the octant in which the point lies is found. If the octant has been classified as IN or OUT, the point can be immediately classified as IN or OUT. However, if the point lies in a boundary octant, the point must be compared against the BSP tree that is stored in that octant. Figure 2.1 represents a quadtree (2d equivalent of an octree) for a 4 block Ogrid around an airfoil. Only the grid points on the boundaries are used to define the level of refinement, so only the boundaries of the grid are shown. The grid boundaries are inserted into the leaf octants as BSP trees to form the PM tree. A portion of the PM tree that might result is shown in figure 2.2. The leaf octants are represented by squares, while the other nodes are represented by circles. The four branches at each node represent the four quadrants of an octant. The line segments shown in some of the leaf octants represent portions of the grid boundaries that would be placed in BSP trees. If a point being classified against the PM tree falls into one of these leaf octants, it must be compared against the facets to determine its relationship to the grid. The empty leaf octants that are drawn with solid lines are completely inside the grid, while the leaf octants that are drawn with dashed lines are completely outside the grid. Points that fall into either of these types of octants can immediately be classified relative to the grid. Figure 2.1: Example quad tree mesh The PM tree is expensive to construct and would be very inefficient to use if Figure 2.2: Example PM tree structure it had to be reconstructed each time a grid moved. Therefore, for each dynamic group, a set of transformations are maintained between the current position and the original position in which the PM tree was built. Whenever the PM tree is used to find an interpolation stencil in one grid for a grid point in another grid, the transformations are used to transform the grid point to the original coordinate system. The transformed grid point can then be used with the PM tree constructed in the original coordinate system of the grids. Thus the PM tree must be constructed only once. Interpolation Stencils The primary function of the PM tree is to help find interpolation stencils for grid points which require interpolated flow field information. When an interpolation stencil is required, the PM tree is used to classify the corresponding grid point relative to each superblock. This quickly identifies which superblocks and grids the grid point lies in and therefore which superblocks might offer a source of interpolation information. This answers the in/out question directly. However, once a point is identified as being inside a given superblock, the exact curvilinear coordinates corresponding to the Cartesian coordinates of the grid point must be found. For a curvilinear grid defined by the coordinates of the intersections of three families of boundary conforming grid lines denoted by (, qr, (), the coordinates at any point within a cell can be calculated from trilinear interpolation R(, r, ,) = (1 u)(1 v)(1 w) r(I, J, K) + (1 u)(1 v)wr(I,J,K + 1) + (1 u)v(1 w) r(I, J + 1,K) + (1 u)vw r(I,J + 1, K + 1) + u(1 v)(1 w) r(I + 1, J, K) + u(1 v)wr(I + 1,J,K + 1) + uv( w)r(I+ 1,J+ 1, K)+ uvw r(I + 1,J + 1, K + 1) (2.1) where r(I, J, K), r(I + 1, J, K),... are the known coordinates of the eight corners of a cell. The index (I, J, K) denotes the three dimensional equivalent of the lower left corner of a two dimensional cell; while, (u, v, w) vary between 0 and 1 throughout the cell so that = I+u, I = 1,2,... ,NI1, 0 7= J+v, J= 1,2,... ,NJ1, 0 = K + w, K=1,2,... ,NK1, 0 < w< 1 (2.2) and R((, 77, () is a piecewise continuous function over the entire grid. For every known point r that lies within a grid, there exists some (, 77, () such that r = R(, ,). However, in order to find (, qr, () that corresponds to a known r, the nonlinear function F = R(r, 7, () r must be minimized. Newton's method can be used to minimize this function iteratively using r8Ftm 1 ~m+1 _ [a"n Fm (2.3) where ( is the curvilinear coordinate vector (, ?, r ), m is the Newton iteration counter, and the jacobian matrix is calculated from the equations OF = C1 + vC3 + w [C5 + vC7] OF = C + uC3 + w [C ] + UC7] Or] OF  = C4 + uCs + v [C6 + uCr] (2.4) where C1 =r(I+ 1, J,K) r(I,J,K) C2 =r(I,J + 1,K) r(I,J,K) C3 =r(I + 1, J + 1, K) r(I, J + 1, K) Ci C4 =r(I,J,K + 1) r(I,J,K) C5 =r(I + 1,J, K + 1) r(I,J,K+ 1)C1 C6 =r(I,J+ 1, K + 1) r(I,J, K + 1) C2 C7 =r(I+ 1,J + 1, K + 1) r(I,J+ 1, K+ 1)  r(I + 1, J, K + 1)+ r(I,J,K + 1) C3 (2.5) Newton's method needs a good starting point; therefore, stored in the leaf nodes of the octree and the BSP trees are curvilinear coordinates at which to start the search. Although the PM tree classifies a point relative to a superblock, a starting point identifies a particular cell within a particular grid of the superblock. If the octree is sufficiently refined, the starting point should be close enough to ensure that stencil jumping will converge. As the curvilinear coordinates ( are updated with equation 2.3, if AL exceeds the range of 0 * 1 then the search proceeds to a neighboring cell and the jacobian matrix, as well as the corners of the containing cell, must be updated. This algorithm is commonly called stencil jumping. Hole Cutting Beggar uses an outline and fill algorithm for cutting holes. In this algorithm, the facets of the hole cutting surface are used to create an outline of the hole. The cells of a grid through which a hole cutting facet passes are located by using the PM tree to locate the cells containing the vertices of the facet. These cells are compared to the facet and are marked as being either on the hole side or the world side of the hole cutting surface. If the cells containing the facet vertices are not neighbors, the facet is subdivided recursively and new points on the hole cutting facet are introduced. These new points are processed just like the original facet vertices to ensure a continuous outline of the hole cutting surface. Once the complete hole cutting surface is outlined, the hole is flood filled by sweeping through the grid and marking as hole points any points that lie between hole points or between a grid boundary and a hole point. The marking of holes is capped off by the world side points created from the outline. This process is able to mark holes without comparing every grid point against each hole cutting surface and it places no special restrictions on how the hole cutting surfaces are defined as long as they are completely closed. It also allows holes to be cut using infinitely thin surfaces. During the search for interpolation stencils, it is possible that a stencil may be found that is in someway undesirable. If no other interpolation stencil can be found for this point, then the point is marked out and an attempt is made to find an interpolation stencil for a neighboring point. This process essentially grows the hole in an attempt to find a valid grid assembly. There are several weaknesses in this hole cutting algorithm. During the flood fill, if the hole outline is not completely surrounded by world side points, a leaky hole can result and the complete grid can be marked out. Conversely, the use of recursive subdivision of facets to ensure that a complete hole is outlined can dramatically increase execution time when hole cutting surfaces cut across a singularity or a region of viscous spacing. In such cases, it is possible to coarsely outline the hole and to use the natural process of marking out points which fail interpolation rather than doing the flood fill. This option is often referred to as the "nofill" option based on the command line argument that is used to invoke this option and the fact that the holes are outlined but are not filled. Donors and Receptors One of the more important concepts is how to handle blocktoblock and over lapped communications. Beggar introduces the concept of donors and receptors to deal with the communication introduced by these two boundary conditions. Since the flow solver uses a finite volume discretization, flow field information is associ ated with the grid cells or cell centers. A receptor will grab flow field information from one cell and store it in another cell. The receptor only needs to know which grid and cell from which to get the information. A donor will interpolate flow field information from a cell and then put the interpolated data into another storage lo cation. The donor needs to know the grid from which to interpolate data, as well as an interpolation stencil for use in interpolating data from eight surrounding cell centers. Thus, blocktoblock connections can be implemented using only receptors. Overlapped connections are implemented with donors. If all of the grids' data are stored in core, a donor can be used to interpolate the flow data from one grid and to store the interpolated values into another grid. However, if all of the grids' data are not available, a small, donor value array (DVA) is needed to store the intermediate values. A donor, associated with the source grid, is used to perform the interpolation and to store the result into the DVA. Then a receptor, associated with the destination grid, is used to fetch the values from the DVA and store it into the final location. Boundary Condition Identification The automatic identification of most of the boundary conditions centers around several interdependent linked lists. The first of these is a list of the points on the boundaries of the grids in each superblock. A tolerance is used to decide if two points are coincident, so that the list contains only one entry for each unique boundary point. Another tolerance is used to decide if a point lies on a user specified reflection plane. Another list is constructed using the unique cell faces on each grid's boundaries. While building this list, degenerate cell faces and cell faces that lie on a reflection plane are identified. The order of the forming points for a cell face is not important for identification, therefore the forming points are sorted using pointers into the points list. The cell faces can then be associated with the first point in its sorted list of forming points. For a finite volume code, each cell face on a blocktoblock boundary connects exactly two cells from either the same grid or from two different grids. Thus, for a given boundary point, if its list of associated cell faces contains two faces that are built from the same forming points, a blocktoblock connection is defined. CHAPTER 3 FLOW SOLUTION Although the flow solver is not the focus of this work, this section is included for completeness. The flow solution algorithm supplies some opportunities for paral lelization that affect the total performance of the code. The governing equations are presented, the unique solution algorithms are presented, and the general numerical solution techniques are presented. Governing Equations The equations governing the motion of a fluid are the statements of the con servation of mass, momentum, and energy. As an example, Newton's second law of motion describes the conservation of momentum. However, Newton's second law, as presented in most dynamics textbooks, is written to describe the motion of a particle, a rigid body, or a system of particles, that is, a well defined quantity of mass whose motion is described in a Lagrangian reference frame. For the motion of a fluid, it is often more useful to consider flow through a region of space or a control volume (an Eulerian reference frame). Considering a control volume V(t) that is bounded by a surface S(t), Reynolds' Transport Theorem (see Potter and Foss [43, pages 7287] for an example of the derivation) is used to convert the time rate of change of an extensive property of a system into integrals over the control volume, i.e. dt O(x, t)dV = dV + fu S dS (3.1) V(t) V(t) S(t) These two terms represent a variation in the conserved property within a volume V(t) due to some internal sources (the volume integral) and a variation due to flux across the boundary surface S(t) (the surface integral). The variable 0 represents any conserved quantity (such as p, pu, pE for mass, linear momentum, and total energy, all per unit volume), u is a local velocity vector, and i is the unit vector normal to dS. The surface integral is converted to a volume integral using the vector form of Gauss's Theorem f u. dS = V (0u)dV (3.2) s v which assumes that V u exists everywhere in V. Thus, the time rate of change of the conserved property can be written as Sf (, t)dV= + V (.u)dV (3.3) V(t) V(t) The time rate of change of the conserved quantity is dependent upon source terms that can act on the volume or on the surface of the volume. If we can represent the source terms by a volume integral of a scalar quantity V and a surface integral of a vector quantity iP, the general conservation law can be written as + +V (u)dV = + V V.dV (3.4) V(t) V(t) Since an arbitrary volume is assumed, the integrand must apply for an infinites imal volume. The integral can be removed to yield the differential form + V (,u) = 0 + V 1P (3.5) at For the conservation of mass, mass is conserved and there are no source terms. Replacing 0 by the density p in equation 3.5, the differential, continuity equation is Op S+ V (pu) = 0 (3.6) or, written in Cartesian coordinates op o(pu) o(p) O(pw) (3.7) + o + + (3.7) at Ozx ay z where u, v, and w are the three Cartesian components of the velocity. For the conservation of momentum, the source terms are the forces acting on the control volume. Ignoring the gravitational and inertial forces, the sum of the forces acting on the system can be written as F = pidS + IrdS (3.8) s(t) s(t) where r is the viscous stress vector and p is the pressure. Note that the pressure has been separated from the viscous stress, whereas it is often included as a spherical stress term. The differential form is F = p + + + T o= + + + = 9y O9x y 9z ap 8, 9az a7, F = p ry,, + 7 (3.9) az 9y x y 9z where 7,,, Ty, etc. are elements of the viscous stress tensor (see Potter and Foss [43, pages 171174] for the derivation). This tensor is symmetric so that ry, = 7ry, Tzx = x,, and rz = Tyz. Using these equations as the source terms and substituting pu into equation 3.5 as the conserved quantity, the three Cartesian components of the conservation of momentum are 9(pu) + (pu + p) o(puv) + (puw) =r, 8a7, a7 t+ + + a + 9t ax 9y 9z ax 9y az a(pv) a(puv) O(pv2 + p) o(pvw) ary oy, oy T+ + + = + + 9t ax ay dz ax 9y 9z O(pw) O(puw) O(pvw) a(pw2 + p) O9rz, ry, Orzz ( + + + + (3.10) at 9+ y +z az2 9 y z where the pressure terms have been moved to the left hand side. Formally, these equations are the NavierStokes equations. However, in general, the term Navier Stokes equations is used to refer to the complete set of conservation laws when the viscous terms are included as shown here. For the conservation of energy, the source terms are the rate of heat transfer to the system minus the rate of work done by the system. Substituting pE into equation 3.4, the conservation of energy is written as S(pE)+ (pE)dV = Q (3.11) v(t) or, in differential form p) + V (pEu) = Q W (3.12) Ignoring any internal heat sources, the heat transfer rate can be written as Q= 4 AdS (3.13) s(t) where 4 is the heat flux vector. This integral can be converted to a volume integral and then written in the differential form Q = Qy, 9 z (3.14) 9x 9y az The work rate is due to the forces acting on the surface of the control volume. Ignoring any work by gravitational or inertia forces, the work rate is written in the form W = pu .dS I dS (3.15) S(t) S(t) The differential form is a pu apv apw au av aw S + x 7 7 zz ax ay az ax ay Oz [au 9av 9 w 99 8w 9 r a+ au +a a a+ (3.16) ay 9 z 9X] 9z 9y Plugging equations 3.14 and 3.16 into equation 3.12 yields the final differential form of the conservation of energy equation 8(pE) au(pE + p) 8v(pE + p) aw(pE + p) 9q, aqy q, + + + + at x ay Ow Ox 9y 9z a (TXU + *yV + TXZw) (7.yU + yyv + 7ry) a (r.2u + 7yV + r7.w) (3 x + y + (3.17) 8@ By 8z Counting the primitive variables p, u, v, w, E, p, and T and the 6 unique elements of the viscous stress tensor, there are 13 unknowns and only 5 equations (the conservation laws). In order to make a solution tractable, 8 more equations are needed. Fortunately, a relationship between the components of the stress tensor and the velocity gradients is possible. The velocity gradients are directly related to the rateofstrain tensor and the vorticity tensor. Constitutive equations then define the relationship between the stress components and the rateofstrain components. For a Newtonian fluid, a linear relationship between the stress and the strain rate is assumed. Since the strain rate tensor is symmetric, there are only 6 unique strain rate components. Assuming a linear relationship between the 6 unique stress components and the 6 unique strain rate components, there are 36 material constants that must be defined. The assumption of an isotropic material reduces this to 2 constants. For fluids, these two constants are the dynamic viscosity t and the second coefficient of viscosity A. From Stoke's hypothesis, the relationship 2 A = 2y (3.18) 3 can be used for the compressible flow of air. For a Newtonian, isotropic fluid, the final relationships between the components on the stess tensor and the velocity gradients are XX (I+2ou  3 O"\ x uy 6z 2 ( N v & 9w\ 2 au 9v i9w T=i I  + 2 =3 8z x 9y O z 2 Ou Ov \ 7T = +  (au i9w Txz = P OZ + T z 9z Qx / ,a = + (3.19) az 19y With the original 5 conservation equations and the 6 relationships between the viscous stesses and the velocity gradients, only 2 more equations are needed. If a perfect gas is assumed, the thermodynamic state can be specified by only two ther modynamic variables. If p and p are choose as the two independent thermodynamic variables, the perfect gas law p = pRT (3.20) (where R is the gas constant) can be used to calculate the temperature T. The relationship for the internal energy e per unit mass is e= (3.21) P(7 1) where is the ratio of specific heats (y = 1.4 for air) and the total energy per unit mass is related to the internal energy by E = e+ U2 (3.22) 2 where U is the magnitude of the velocity vector u. Vector Form The three conservation laws, written in differential form, can be combined in the vector differential equation q 8 9f, 9g, Sh, 9f, 9g, Qh, Oq+ +. g I Og+ + h,, (3.23) at o x ay oz ox ay 8z where p pu 0 pu pu2 + p Tx q = pv fi = puv f = T pw puw T2x pE u(pE + p) urT + vr7y + wrz X1 pv 0 pvu l xv 9i= pv2 + p +g = g y pvw Tyz v(pE + p) ursy + vryy + WTrz v pw 0 pwu 7xz hi = pwv h = Ty (3.24) PW2 + p Tzz w(pE + p) Urx + vr + Wrzz gz The first component of the vector equation represents the conservation of mass. The next three components represent the conservation of momentum. The fifth component represents the conservation of energy. The terms fi, gi, and hi represent the inviscid flux vectors and f,, g,, and h, represent the viscous flux vectors. Setting f, = g, = h, = 0 recovers the Euler equations, which govern inviscid fluid flow. The elements of the vector q are the conserved variables, as opposed to the primitive variables p, u, v, w, and p. The use of subscripts on the terms q, qy, and q, represents the components of the heat transfer vector as opposed to partial derivative notation. Considering only heat conduction, Fourier's law can be used to relate the heat flux vector to the temperature gradient S= kVT (3.25) where k is the heat conductivity and T is the temperature. The Prandtl number, defined as Pr =c (3.26) k is used to compute the heat conductivity k from the viscosity j (for air at standard conditions, Pr = 0.72). Using the relationship yR = (3.27) '1 for a perfect gas, the components of the heat flux vector can be written as yR p OT y 1 Pr Ox R dOT 7 1 Pr By yR p OT qi = (3.28) 7 1 Pr oz NonDimensionalization The governing equations are nondimensionalized by freestream conditions so that p u v w p E P=, = , v= , = , p= =  Poo aoo aoo aoo pooa aoo x y z ta, x = = = z = = t (3.29) L L L' oo L where the~ denotes nondimensional quantities, the subscript oo denotes freestream conditions, L is some dimensional reference length and a is the speed of sound, which is defined by the relations a= = vRT (3.30) V P The Mach number is the nondimensional velocity. The freestream Mach number is Mo = U (3.31) aoo where U.o is the magnitude of the freestream velocity. Therefore, the nondimensional velocity components become u v w iU = MO, v = M, o, zv = M,M (3.32) U. U.0 U00 The terms u/U0o, etc. represent scaled direction cosines; therefore, the nondimensional velocities are scaled values of the Mach number. The Reynolds number is the nondimensional parameter Re = pUL (3.33) 'U00 which arises from the nondimensionalization of the conservation of momentum equa tion. This parameter represents the ratio of inertia forces to viscous forces. The nondimensional governing equations can be written in the same form as equations 3.23 and 3.24 by replacing the dimensional quantities by the correspond ing nondimensional quantities. However, in the process of nondimensionalizing the equations, the nondimensional term Mo/Re arises from the viscous flux vectors. Therefore, the definition of the viscous stresses and the heat flux components must be modified as 2 Moo dii u 9 9N 3 Re di 9x 9 Q 2 Moo0 O( DOi, "k3 Re 8 2. Moo aft 9 9 8a Tr, = IA +2  3 Re 51 ay 6 a 7o =a + Re 5 / = Re( 3z 7Ty = +A (3.34) Re 857 8F and 1 M OaT 1 Pr Re Oi 1 A Mo OT 7 1 Pr Re 9P 1 p Moo T z = 0 (3.35) y 1 Pr Re 98 The nondimensional equation of state becomes = P (3.36) and the nondimensional energy is related to the nondimensional density and pressure by the equation E = P + (0i2 + 0 + t2) (3.37) ( 1) 2 The nondimensional viscosity coefficient is related to the nondimensional tempera ture by the power law j = 2/3 (3.38) Coordinate Transformation The use of a general, boundary conforming, structured grid introduces an ordered set of grid points represented by Cartesian coordinates given at the integer curvilinear coordinates (, it, C. In order to solve the governing equations in this curvilinear coordinate system, the partial derivatives with respect to the Cartesian coordinates must be converted into partial derivatives with respect to the curvilinear coordinates. This requires a transformation of the form ( = (x,y,z,t), 7r = r(x,y,z,t), ( = ((x, y, z,t), r = r(t) (3.39) 38 Applying the chain rule, the partial derivatives with respect to the Cartesian coordi nates can be written as +9 79 +9 (9 Sa a7 a S= G, + 77z + C, 9 9 0 9 9 5 Z rT ( =7t + t 7+,7t + Ct (3.40) x = X(,,CT), y = y( 7, ,r), z = z(, (,r), t = t(r) (3.41) at r7 6( Br] 8c( where easily evaluated. Applying the chain rule again, the partial derivative of with respect to etc. Thus, the metric term 5x represents the variation in ( with a unit variation in x while y, z, and t are held constant. These terms are not easily evaluated. However, the partial derivatives of the Cartesian coordinates with respect to the curvilinear coordinates that arise from the inverse transformation represented by az= z((,r?,(,7), y = y((,7,C,7), Z= z(5,,C,7), t= t(T) (3.41) are easily evaluated. Applying the chain rule again, the partial derivatives with respect to the curvilinear coordinates can be written as = z + yE + z  9 O9x 9y 9z y = f + Y1 + Z,1 a77 z 9y 9z = x( + y(T + zC a( 8x y z a a a 0 a = t + x + y, + z5 (3.42) r7 9t 9x 9y 9z Comparing equations 3.40 and 3.42 the Jacobian matrix of the transformation 3.39 is seen to be the inverse of the Jacobian of the inverse transformation 3.41 (see Belk [44, appendix A] for a complete derivation). This yields the relationships S= (y,zC zy)/J ( = (zn Xzc)/J r = (xy< Y1zX)/J t = 7t=(zye Ttz(y/ J 7t2, 77 = (xzc zxc)/J 7z = (yXZ( zxyC)/J 77 = (ywz, zcy,)/J 7/t = rtXrr/x Trtyrr/y  rtz7rlZ C. = {yt Z,7 zcy,7)/J C = (zE, Xz,)/J Cz = (xCy, yCXr)/J Ct = TtXrCx T7yrCy rtZrz (3.43) where J is the determinant of the Jacobian matrix of the inverse transformation J = XE(yz7 zyc) y(xzC z,xC) + zC(x,y( yxz() (3.44) The governing equations are then written in the form aQ aFi F, aGi G, 8H, H, S+ + + = 0 (3.45) 87r 8i 8r] 9C where Q = Jq Fi = J(qt + f,6 + gjiy + hiG) Gi = J(q27t + f x+9, + gr0Y + hiz) GH = J(q(t + fi(r + gi( + hiCz) F, = J(f,,g + gy + h,) G, = J(f,? + g,7y + h,,q7) H. = J(f,,C+ + g,~ + h,4() (3.46) Flux Vector Splitting The model hyperbolic equation is the onedimensional linear convection equation 8u au + a = 0 (3.47) If a > 0, this equation describes the propogation of a wave in the +x direction at the velocity a. The use of a backward time difference and a forward space difference or a central space difference to produce the explicit discretized finite difference equations rn+1 n n n u u' + a +U= 0 (3.48) At Ax and n+l n n n 7_+1 + a i+1 u1 = 0 (3.49) At 2Ax yields unconditionally unstable solution schemes. Instead, with a > 0, a backward space difference of the form n,+1 n U n t +a A =0 (3.50) is required to produce a stable scheme. Since the wave is propogating in the +x direction, the backward space differencing represents "upwind" differencing. If the 41 wave speed a were negative, a forward space difference, again representing upwind differencing, would be required to produce a stable scheme. The goal of "flux vector splitting" [45] is to split the flux vector into components which are associated with the positive and negative direction propogation of informa tion so that upwind differencing can be used. This produces a stable scheme without the addition of any artificial dissipation that is often done with central difference schemes. Consider the onedimensional Euler equations in Cartesian coordinates aq aF + x 0 (3.51) Since the flux vector is a homogeneous function of degree one of Q, the governing equations can be written in quasilinear form aQ aQ + Ax = 0 (3.52) (this looks alot like the model equation). The matrix A = 8F/aQ, is the flux Jacobian matrix. This matrix can be decomposed in the form A = RAR1 (3.53) where the columns of R are the right eigenvectors of A, the rows of R' are the left eigenvectors of A, and the matrix A is a diagonal matrix with the eigenvalues of A along the diagonal. The eigenvalues are of the form A, = A2 = A3 = U A4 = u +a A5 = u a (3.54) where a is the speed of sound. For locally subsonic flow, some of the eigenvalues will be positive and some will be negative. Thus the matrix A can be written as A = A+ + A (3.55) where A+ contains only the positive eigenvalues and A contains only the negative eigenvalues. Substituting this into equation 3.53, the Jacobian matrix is split into A = RA+R1 + RAR1 = A+ + A (3.56) and the split flux vectors are defined from the homogeneity property as F+ = A+Q F = AQ (3.57) so that F = F+ + F (3.58) Upwind differencing is then used appropriately with the split flux vectors in the discretized governing equations. The complete form of the split flux vectors can be found in Hirsch [46]. An implicit discretization of the governing equations can be written as AQn+l + AT (EFn"+1 + ,,G"n+ + ScHn+l) = 0 (3.59) where the superscript n denotes the time step, =" i, = Q, Q1,k (3.60) and =Fi+l/2,j,k Fi1/2,j,k G Gi,j+1/2,k Gi,j1/2,k SA17 6H = Hijk+2 Hij,kl/2 (3.61) For a finite volume discretization, the indices i,j, k represent a cell in which the dependent variables Q are assumed to be constant, and indices i + 1/2,j, k and i 1/2, j, k, for example, are opposing cell faces at which the flux is evaluated. The changes in the curvilinear coordinates AL, Aq, and AC are unity. A first order time linearization of the flux terms [47, 48], leads to the equation [ /(OH8F "G + F" + ("+) 6' + G +( ) Aq(+1 +6< H"+ ( ) AQ +' = 0 (3.62) Introducing the split flux vectors, produces the form AQ+1 OF+)n , )] n,[(%]) n+,] A+ F 8 F AG Q+"] A7 + St K aQ AQn+1 I +St, aQ W+AQn l] +S A q+ + [(aH+ n AQn+l] l + [6 H )nAqn+] +r aq aq aq = t (F + F)n + S, (G+ + G)" + S6 (H+ + H)" (3.63) or Ar + 8C [(A+)" (AQ+)"+' + (A)" (AQ)"+' +S, [(B+)" (Aq+)"+' + (B)" (Aq)"+'] +5,, [(c+)" (A +)"+ + (c)" (AQ)"+] = R" (3.64) where R" = 6~ [F+ + F]" + 6, [G+ + G]" + Sc [H+ + H] (3.65) It should be noted that the Jacobian matrices A+, A, etc. are not the same as the split flux Jacobian matrices A+, A, etc. that were presented in equation 3.56. Instead, the notation A+ is used to represent aF+/aQ, for example. This is required to preserve the conservation form of the equations. The final form of these Jacobian matrices, and the derivation thereof, can be found in Belk [44, appendix B]. In evaluating the split flux vectors at the cell faces according to the difference operators defined in equation 3.61, dependent variables from cells upwind of the cell face are used. For a first order spatial discretization, only the neighboring cell is used. For second order accuracy, the dependent variables from the two neighboring cells are extrapolated to the cell face. As an example, the (+) flux is evaluated using cells to the left of the cell face Fi+/2,i,k = F (Q+1/2,j,k) (3.66) where f+1/2,k = Qi,j,k (3.67) for a first order accurate scheme, and 3 1 +i/2,j,k = ,j,k Qi,j,k(3.68) for a second order accurate scheme. Likewise, the () flux is evaluated using cells to the right of the cell face F = F (Q /2,,k) (3.69) Fi+1/2,j,k = F i+1/2,j,k) where Q+1/2,j,k = Qi+lj,k (3.70) for a first order accurate scheme, and 3 1 Qi+1/2,j,k = Qi+l,j,k Qi+2,,k (3.71) for a second order accurate scheme. The extrapolation of the conserved variables to the cell face and their use to calculate the flux is often referred to as MUSCL extrap olation [491. Alternatively, the primative variables can be extrapolated and used to calculate the flux or the flux can be evaluated at the cell centers and extrapolated to the cell center. In the higher order schemes, flux limiters, applied to the flux, conserved variables, or the primitive variables, are used to selectively reduce the scheme to first order to avoid oscillations in the solution near discontinuities. The flux limiters available include the minmod, van Leer, and van Albada limiters. Flux Difference Splitting Hirsch [46] describes upwind methods as methods in which physical properties of the flow equations are introduced into the discretized equations. Flux vector splitting introduces the direction of propogation of information through consideration of the sign of the eigenvalues in the discretization. Another method that handles discontinu ities well is due to Godunov [50]. In Godunov's method, the conserved variables are considered constant throughout each cell and a onedimensional exact solution of the Euler equations is computed at each cell boundary. The two constant states on either side of a cell boundary define a Riemann (or shock tube) problem that can be solved exactly. An integral average of the exact solutions to the Riemann problems at each cell is taken to determine the solution at the next time step. Other methods have replaced the computationally expensive exact solution of the Riemann problem with an approximate Riemann solution. These methods, including the popular method due to Roe [7], are often referred to as "flux difference splitting" methods. Considering the quasilinear form of the onedimensional Euler equations shown in equation 3.52, the elements of the Jacobian matrix A are not constant. Roe pro posed replacing this nonlinear equation with the linear equation S+ = 0 (3.72) Y ax where A is a constant matrix. This equation is solved at the interface between two cells to determine the flux at the interface. The matrix A is chosen so that the solution of this linear equation gives the correct flux difference for the nonlinear Riemann problem. The properties required of A are i It constitutes a linear mapping from Q to F ii limnL, QRQ A(QL, QR) = A(Q) = 8F iii F(QR) F(QL) = A(QL, QR) (QR QL) iv The eigenvectors of A are linearly independent The superscript ()L and () represent quantities on the left and right sides of the interface. The matrix A for the approximate Riemann problem is constructed from the flux Jacobian matrices where the primitive variables are replaced by the Roe averaged variables P= p/p S_ + R + U VV+ vr = v wL + vpKwR w+ w/Pn = v LZHL + f R (3.73) Vp7 + V pXV where H is the total enthalpy per unit mass, which is related to the total energy per unit mass by the relationship H = E + (3.74) P The solution of the approximate Riemann problem yields the following equation for the flux at a cell interface Fi+1/2,j,k= ( +1/2,j,k + F1/j,) i+1/2,j,k I (Qi+1/2.j,k Qij,k) (3.75) where IA = R AI R1 (3.76) where the () notation is used to denote that the Roe averaged variables are used in the evaluation. The assumption is made that the waves from the solution of the onedimensional Riemann problem moves normal to the interface. For the three dimensional problem, the onedimensional solution is repeated for the three directions. For first order spatial accuracy, the primitive variables used in the Roe averaged variables come from the cells neighboring the interface. For second order accuracy, the values are extrapolated as shown in equation 3.71. Newton Relaxation Newton's method for a nonlinear system of vector functions .(X) = 0 (3.77) can be written as Y'(x) (xm+ X ) = YF(m) (3.78) This defines an iterative procedure for which m is the iteration counter and F'(x) is the Jacobian matrix defined by YO() = y (3.79) Following the presentation of Whitfield [51], the discretized governing equation 3.59 leads to the function y(Q"+l) = + r+ F"+' + 6,Gn+1 + JcH"+1 =+ R(Q"+') (3.80) for which a solution is sought by Newton's method. Here, the vector Q contains the dependent variables for every cell throughout the entire grid system. The Jacobian matrix is defined by j ( n+' 1 R(Q"1) (3.81) S) + 9RQn+l which yields the iterative scheme 1 f n+l"m m+ qn+lm n+ 1 + ) n+I,m AQn+l,m+l Qn+ Qn R(Qn+lm") (3.82) ATI aQ Ar^,n I where Qn+l,m+l = Qn+l,m + AQn+l,m+l Qn+l,0 = Qn n denotes the time level, m is the Newton iteration counter, Art is the local time step, and Armin is the minimum time step. Flux vector splitting is used on the lefthandside and flux difference splitting with Roe averaged variables is used on the righthandside. StegerWarming jacobians, Roe analytical jacobians, or Roe numerical jacobians can be used. Each iteration of the Newton's method is solved using symmetric GaussSeidel (SGS) iteration. The SGS iterations, or inner iterations, are performed on a grid by grid basis; while the Newton iterations, or dt iterations, are used to achieve time accu racy and are performed on all grids in sequence. This procedure eliminates synchro nization errors at blocked and overset boundaries by iteratively bringing all dependent variables up to the r'n+ time level. The fixed time step, Armi,, is used to maintain time accuracy and a local time step, Art, is used for stability and convergence of the Newton iterations. Steady state calculations do not use Newton iterations. The first term on the righthandside of equation 3.82 becomes zero and local time stepping is used during the inner iterations. Explicit boundary conditions (BC) can be used or implicit BC's can be achieved by updating the BC's during the SGS relaxation solution of Equation 3.82 [52]. An underrelaxation factor is applied to the implicit BC update to improve stability. FixedPoint Iteration A linear system of equations of the form (3.83) can be solved using the general fixedpoint iteration scheme zm+1 = z" + C (b Ax") m = 1,2,3,... (3.84) (see, for example, Conte and de Boor [53, pages 223233]). This iteration function is in quasiNewton form. The function for which a zero is being sought is f(X) = Ax b. However, the matrix C is an approximate inverse of A rather than the inverse of the derivative of f. This approximate inverse is defined by the requirement that II/ CA1 < 1 (3.85) for some matrix norm. The coefficient matrix A can be written as A = L + D + U (3.86) where L is the lower triangluar elements of A, D is the diagonal elements of A, and U is the upper triangluar elements of A. If A is diagonally dominant, D1 is an approximate inverse of A and the iteration function zm+1 = Xm + Dl (b Axm) (3.87) will converge. This is Jacobi iteration. It can be rewritten as Dxzm+ = b (L + U) xm (3.88) or as Xm+1 b, a.,rj aijx i = 1,... ,n (3.89) Sj=1 j=i+1 a to explicitly show how each element of z is updated. The distinguishing characteristic of Jacobi iteration is that the iteration function only uses values of x from the previous iteration. GaussSeidel iteration comes from the choice of C = (L + D)1. This gives the iteration function zm+l = =m + (L + D)1 (b Axm) (3.90) This can be rewritten as (L + D) xm+' = b Uzm (3.91) or Dx'+ = b Lzam+ Uz" (3.92) To explicitly show how each element is computed, this is written as / ii n 1 m+= b1 a +1 s = 1 ... n (393) b =i E a" j=1 j=i+l As each element of x is updated, the previous elements of z, which multiply the lower triangular elements of A, have already been updated. Thus, for the summation that represents Lz, the elements of x are evaluated at iteration m + 1. In other words, when updating an element of z, the most up to date values of X are used. If GaussSeidel iteration is guaranteed to converge, it will converge faster than Jacobi iteration. It also has the side benefit that only one array is needed to store z during the iterations. Parallel Considerations Following the analysis presented by Weeratunga and Chawla [37], the solution algorithm could be written as a global system of linear equations A,i Ai,2 .. A1,N AQ1 Fi(Q1, Q2, QN) A2,1 A2,2 *.. A2,N AQ2 F2(Q,, 2,',QN) S= < (3.94) AN,1 AN,2 AN,N AQN FN(Q1, Q2, QN) The diagonal, block matrix elements, Aii, represent the coupling within grid i due to the implicit time discretization. These elements are banded, sparse matrices defined by the spatial discretization. The offdiagonal, block matrix elements, A1,j(i j), represent the coupling between grids i and j due to blocktoblock and/or overlap ping boundary conditions. The coupling between grids is dependent on the relative positions of the grids. Thus, some of the off diagonal elements will be zero. Together, these elements form a large, sparse matrix. This large system of linear equations could be solved directly; however, this would not be efficient and does not lend itself well to parallel computing. Instead, the offdiagonal terms are moved to the righthandside. Thus, blocktoblock and overlapped boundary conditions are treated explicitly. This gives a decoupled set of equations of the form Ai,i 0 ... 0 AQ1 R1 0 A2,2 0 AQ2 R2 S= = <(3.95) 0 0 ... AN,N AQN RN where Ri = Fi(Q,Q2, ...) Elj Ai,jAQj. Each decoupled equation can be solved using GaussSeidel iteration. CHAPTER 4 6DOF INTEGRATION In order to solve store separation problems, we must be able to simulate the general motion of bodies under the influence of aerodynamic, gravitational, and ex ternally applied loads. We will ignore structural bending; therefore, we can limit ourselves to rigid body motion. This chapter presents the basis for the six degrees of freedom (6DOF) motion simulation routines in Beggar that were written by Belk [4]. This is similar to the method presented by Meakin [54]. The equations of motion, the coordinate systems used, and the techniques used to integrate the equations of motion are presented. Equations of Motion The unconstrained motion of a rigid body is modeled by Newton's second law of motion F=ma (4.1) where F is the total force acting on the body, m is the mass of the body, and a is the resulting acceleration of the body. This can be written as the conservation of linear and angular momentum F = L (4.2) M =H (4.3) where L = mV is the linear momentum, H = Iw is the angular momentum, and M is the total applied moments about the body center of gravity (CG). The dot notation 52 represents the derivative with respect to time, V is the translational velocity vector, w is the rotational velocity vector, and I is the rotational inertia tensor 1 = I.y Iyy ly (4.4) LIz I (4.4) constructed from the moments (I., ,, Izz,) and products (Iy, I., I,,) of inertia of the body. The six degrees of freedom of the motion are represented by the transla tional position of the CG and the rotation of the body about its CG. Equations 4.2 and 4.3 can only be applied in an inertial reference frame (IRF); therefore, the derivatives of the linear and angular momentum must be taken with respect to an IRF. However, the body moments of inertia and products of inertia will vary with time (due to body motion) if they are defined relative to a fixed, global coordinate system. Thus, it is easier to use a noninertial, local coordinate system that is fixed relative to the body, so that the moments and products of inertia will remain constant. In order to apply equations 4.2 and 4.3 in a moving, local coordinate system, we need to relate the derivatives with respect to this noninertial reference frame to derivatives with respect to an IRF. This relationship is defined by the equation i/XYZ = a/(y + w x a (4.5) for any vector a defined in a coordinate system xyz that is rotating by w relative to an IRF XYZ. Applying this relationship to L and assuming that the mass m is constant, equation 4.2 becomes F F =V + W xV (4.6) m xyz or V = w xV (4.7) /yz m Applying 4.5 to H, equation 4.3 becomes M = I/y, + w x H (4.8) or l/yz = 1M Ilw x 1w (4.9) Equations 4.7 and 4.9 are the equations of motion written with respect to the local coordinate system (see Etkin [55] for a more complete derivation of the equations of motion). These equations can be integrated twice with respect to time to get a time history of the translational and rotational position of the rigid body. However, since the equations of motion are written with respect to the local coordinate system, the change in position coming from the integration of the equations of motion is of little use for tracking the body motion, since the local coordinate system is always changing. Instead, the changes in body position must be transformed to the global coordinate system so that the position and orientation of the body relative to the global coordinate system can be maintained. Coordinate Transformations The local coordinate system is represented by the lower case letters zyz, while the global coordinate system is represented by the upper case letters XYZ, as shown in figure 4.1. The origin of the local coordinate system is placed at the CG of the body, the +z axis extends forward along the longitudinal body axis, the +y axis extends laterally along what would be an aircraft's right wing (from the pilot's perspective), and the +z axis extends downward in the direction defined by the righthand rule. The rotation of the local coordinate system relative to the global coordinate system can be represented by the three Euler angles of yaw (0), pitch (0), and roll (40). As shown in figure 4.1, the local coordinate axes, which are initially aligned with the global coordinate axes, are first rotated by 4i about the Z axis to produce the z'y'Z z Figure 4.1: Transformation from global to local coordinates axes. These axes are then rotated by 0 about the y' axis to produce the zy'z" axes. These axes are then rotated by 4 about the x axis to produce the local coordinate axes xyz (see Blakelock [56] for another description of the coordinate systems). These transformations are written in matrix form as cos V; sin 4 0 x' X sin e cos 0 y' Y 4.10) 0 0 1 Z Z cos 0 0 sin0 x x' 0 1 0 y' = (4.11) sin 0 0 cos0 z" Z 1 0 0 x x 0 cos sin y y' (4.12) 0 sin e cos z z" With the notation [R,()] representing a rotational transformation matrix constructed for rotation about the x axis by an angle 0, the complete transformation from local coordinates to global coordinates can be written as [R()] [R,()] [R(4)] y = Y (4.13) z Z or cos os (cos 4 sin 0 sin 0 (cos 0 sin 0 cos 0+ cos j cos 0 sin 0 cos 0) sin 7 sin 0) z X sin (sin o sin 0 sin 4+ (sin C sin 0 cos y = (4.14) sin m} cos 0 cos 0 cos )) cos 0 sin )) z Z sin 0 cos 0 sin k cos 0 cos 0 Since the rotational transformations are orthonormal, the inverse transform is equal to the transpose. Thus, the complete transformation from global coordinates to local coordinates can be written as X x [R() [R,(O)] [R(4)) Y = y} (4.15) which is equivalent to the transpose of the matrix shown in equation 4.14. If the Euler angles 0, 0, 4 are used to track the angular position of the body relative to the global coordinate system, a relationship is required to convert the rotational velocity vector w in local coordinates (calculated from the integration of equation 4.9) to the rate of change of the Euler angles. However, the Euler angles are not applied about the global coordinate system axes; therefore, the transformation from local to global coordinates can not be used for this purpose. Referring back to figure 4.1, ) is applied about the Z axis, 0 is applied about the y' axis, and 4 is applied about the z axis. Therefore, the rotational velocity vector can be decomposed as w = pe, + q~y + re, (4.16) 57 or w= 4&z + 0I + (4.17) Decomposing the unit vectors b,y and &z into the zyz coordinate system yields Ci = cos ~&y sin e&z (4.18) and ez = sin O9, + cos 0 sin e, + cos 0 cos OCz (4.19) as can be seen from the transformation matrices in equations 4.12 and 4.14. Com bining equations 4.164.19 yields the relationships p = sin 0 q = 4 cos 9 sin 4 + 0 cos r = cos 0 cos 4 O sin (4.20) which can be inverted to give = p + q tan 0 sin 4 + r tan 0 cos 0 = qcos r sin q ( = (qsin + r cos 0)/ cos 0 (4.21) As 0 + 7r/2, cos + 0 and tan  oo; therefore, 4 + 0 and o *+ o. This singularity is called "gimble lock" [57]. Quaternions Quaternions were developed by Hamilton [58] as an extension to complex num bers and have been used in 6DOF simulations [59] because their use avoids the gimble lock problem. They have properties similar to both complex numbers and vectors. 58 Like a complex number, which has a real part and an imaginary part, a quaternion has a scalar part and a vector part and is often written as Q = eo + eli + e2j + e3k (4.22) where i, j, and k are unit vectors in the three Cartesian coordinate directions. The multiplication of two quaternions requires the additional rules of quaternion algebra i2 =j =k =1 ij = ji = k jk = kj = i ki = ik = j (4.23) which are similar to the rules of complex math and vector cross products. The multiplication of two quaternions is simplified if we rewrite equation 4.22 as Q = Qo + Q (4.24) which emphasizes the scalar and vector components. Following the distributive prop erty, the multiplication of two quaternions is PQ= (Po + P)(Qo + Q) = PoQo + PoQ + QoP + P 0 Q (4.25) The 0 operator can be shown to be equivalent to P Q= P x QP Q (4.26) Similar to complex arithmetic, the conjugate of a quaternion is defined as Q* = Q0 Q (4.27) The product of a quaternion and its conjugate is thus QQ* = Q*Q = e2 + e2+ e+ + e2 = IQ12 (4.28) or the square of the magnitude of the quaternion. A unit quaternion is a quaternion of unit magnitude. For the unit quaternion of the form Q = cos(a/2) + Asin(a/2) (4.29) the transformation QVQ* = V (4.30) rotates the vector V about the axis defined by the unit vector A by an angle a to produce the vector V'. Since this is a unit quaternion, Q* is the inverse of Q. Thus the inverse transformation Q*V'Q = V (4.31) rotates the vector V' about the axis defined by A by an angle of a to recover V. If the unit vector A is defined to be equivalent to e. of our local coordinate system, a is equivalent to the roll angle and the rotational position of the body can be represented by the quaternion Q = cos(q/2) + &, sin(q/2) = cos(4/2) + [cos 0 cos Oi + sin k cos Oj sin Ok] sin(0/2) (4.32) where i, j, k represent the three cartesian coordinate directions &x, &y, ez of the IRF. Then equation 4.30 represents the transformation from local coordinates to global coordinates and equation 4.31 represents the transformation from global coordinates to local coordinates. Equation 4.32 gives the relationship between the Euler angles and the components of the quaternion. Alternatively, the transformation in equation 4.31 can be compared to a general transformation matrix to find the relationship between the components of the quaternion and the elements of the transformation matrix. 60 The only other relationship needed in order to use quaternions to track rigid body motion is the knowledge of how to update the quaternion. Without going through a derivation, the following derivatives of the scalar and vector components of a quaternion were presented in Katz [60] o = Q (4.33) 1 1 Q = wQo + w x Q (4.34) 2 2 These equations are integrated with respect to time along with the equations of motion. The quaternion must remain a unit vector to ensure that the transformation represents pure rotation with no scaling or shearing. Therefore, the quaternion needs to be normalized during the integration. Numerical Integration A fourth order RungeKutta scheme is used to integrate the equations of mo tion. RungeKutta schemes are an attractive option for solving initial value problems governed by first order differential equations of the form = f(x,y), y(xo)=yo (4.35) because they can achieve higher order accuracy without the evaluation of higher order derivatives. Conte and de Boor [53, pages 362365] defines the fourth order Runge Kutta scheme as Yn+1 = Yn + 1 (ki + 2k2 + 2k3 + k4) (4.36) yn+ y, where ki = hf(x., y,) k2= hf(x, + ,y, + ki) k3 = hf(x, + h,y, + k2) 2 2 k4 = hf(x, + h,y, + k3) The integration time step is represented by h, x represents the time, y represents the position, velocity, and quaternion, f(x, y) represents the derivative of y (right hand side of the equations of motion), and the subscripts n and n + 1 are used to denote quantities at the current and next iteration (or time step), respectively. The aerodynamics solution comes into the equations of motion through the in tegrated forces and moments. Since the aerodynamics are a function of position, the use of four different positions in the evaluation of f(x, y) in equation 4.36 requires the calculation of the flow solution four times for each integration of the 6DOF. How ever, this would be very expensive. Therefore, the aerodynamics are assumed to be constant over the complete time step and are evaluated only once. Since the translational equation of motion is written relative to the local coordi nate system, the integrated aerodynamic forces (and moments) will be independent of position. However, the gravitational force, which is constant in global coordinates, is not constant in local coordinates. Thus, care should be taken when decompos ing the gravitational force into local coordinates with each step of the RungeKutta integration. CHAPTER 5 PARALLEL PROGRAMMING Computing power has increased many orders of magnitude over the last decade. This trend is expected to continue in the near future. However, the shift appears to be away from sequential processing and towards parallel processing. This chapter presents an overview of parallel computing hardware, the options and some consider ations for programming parallel computers, some methods for judging and improving parallel performance, and the proposed approach taken in this work. Hardware Overview The performance gains that are being achieved with single processors is dimin ishing as they approach physical limitations such as the speed of light. With this in mind, VLSI design principles have been used to conclude that it is possible to in crease potential computing power more cost effectively by utilizing multiple, slower, less expensive components rather than a single, faster, more costly component [61]. Therefore, the trend in computing hardware is towards multiple processors. Machines that utilize high performance vector processors are shipping with modest numbers of vector processors, and massively parallel processors (MPP) are utilizing existing, low cost RISC processors (for example, the Cray T3D which uses DEC Alpha processors or the IBM SP2 which uses the IBM RS6000 processors) in ever increasing numbers to achieve greater potential processing power. Another trend, that is affecting the way computing is being done, is the increase in network transfer rates. This allows physically separated resources to be utilized for solving a single problem. Since many MPP's utilize the same processors found in high end workstations, a group of workstations connected by a high speed network can be viewed as a distributed parallel computer with the main differences being in the speed of the interprocessor connections and the possible differences in processors and operating systems. This type of computing is often referred to as "cycle harvesting." This is due to the fact that networked computers, that are used for routine computing during business hours, often sit idle at night. These unused computing cycles can be harvested for scientific computing. A relatively new form of parallel computing takes the use of commercially avail able offtheshelf components to the extreme. Personal computers, based on Intel or compatible microprocessors, running a freely available UNIX clone operating system such as LINUX, are linked together using low cost ethernet networking. Such parallel computers are often referred to as Beowulf clusters [62]. Such a distributed computing environment can represent a sizeable computational resource with very low associated cost. Parallel computer architectures are often classified according to the number of instructions that can be executed in parallel, as well as, the amount of data that can be operated on in parallel. The most common of these classifications range from multiple instruction, multiple data or MIMD computers to single instruction, multiple data or SIMD computers. SIMD systems offer reduced program complexity, but can greatly reduce the available algorithms than can be implemented on such an architecture. Parallel computers are often further classified according to their memory layout as distributed memory, in which case each processor has its own local memory, or as shared memory, for which each processor has direct access to a single, global memory address space. Most of the machines being produced today are of the MIMD type. The Cray T3D and IBM SP2 are distributed memory MIMD machines, while the SGI Onyx is a shared memory MIMD machine. The SGI Origin 2000 represents a unique memory architecture referred to as CCNUMA (cachecoherent, nonuniform memory access). It is made of multiple node cards that contain two processors and local, shared memory. However, any processor on any node card can access any of the memory in the machine. This hybrid organization of memory is called distributed, shared memory (DSM). There is a latency associated with accessing memory located off of a node card; therefore, access times to memory are nonuniform. However, hardware is used to maintain coherency of data held in cache between different processors. This architecture has been shown to perform well for many irregular applications and scales well to moderate numbers of processors [63]. Software Overview Logically, parallel computers can be viewed as a set of sequential processors, each with its own memory, interconnected by some communication links [61]. Each processor executes a sequential set of instructions and communicates with other pro cessors and accesses remote memory through the communication links. Distributed memory and shared memory systems, as well as, distributed computing environments fit this model. The processors of a shared memory system simply have a more effi cient way of accessing remote memory than do the processors of a distributed memory system or a distributed computing environment. This model of a parallel computer and the use of messages for all communication between processors forms the basis of the message passing paradigm of parallel programming. Due to the model used for the parallel computer, it is conceivable that the user could write, compile, and execute a different program on each processor, with each program communicating with the others via messages. It is more often the case that the same source is compiled and executed on each processor, with control flow statements in the code used to determine the path executed or the data manipulated at run time. This programming model is referred to as single process multiple data or SPMD. The SPMD model of programming aids in code maintenance and provides a simplified path for converting an existing sequential code for parallel execution. Many libraries exist for implementing message passing. Two of the more pre dominant libraries are PVM [64] and MPI [65]. PVM, which stands for parallel virtual machine, is a defacto standard message passing interface due to its popularity and widespread use. It is the product of Oak Ridge National Lab and several university contributions. PVM consists of two parts: a library consisting of the functions that implement the application programming interface (API) and a daemon which runs in the background and actually handles the communication between processes. MPI, which stands for Message Passing Interface, is a proposed standard message passing interface. It was developed out of a series of meetings of a committee of experts from the parallel computing community. MPI draws features from other message passing libraries and provides a common API that the vendors can optimize for their ma chines. PVM evolved out of a research project on distributed computing and places a higher priority on portability than on performance. MPI is expected to provide bet ter performance on large MPP's but does not provide for heterogeneous distributed computing and lacks many task management functions [66]. Other models are available for parallel programming. One of the more popular is the shared memory programming model. Pthreads [67] is a POSIX standard imple mentation for shared memory programming using threads. A thread is a light weight process that shares memory with other threads, but has its own program counter, registers, and stack so that each thread can execute a different part of a code. The sharing of memory between threads is automatic and communication between threads is accomplished through cooperative use of shared variables. Mutual exclusion or mu tex variables are used to ensure that only one thread changes the value of a variable at a time. Signals are sent between threads using condition variables. OpenMP [68] is an alternative library that attempts to avoid the low level programming constructs required by Pthreads. OpenMP is used to identify loops that can be executed in par allel similar to vectorization of loops on vector processors. OpenMP automatically handles all communication. 66 These techniques all require shared memory and thus can not be used on dis tributed memory, parallel computers. However, Pthreads or OpenMP can be mixed with PVM or MPI to take advantage of both programming models when clusters of shared memory multiprocessor (SMP) machines are linked together. Likewise, other techniques for using shared memory can be mixed with the message passing model. POSIX also defines a standard for specifying the use of shared memory explicitly [69] as opposed to the automatic use of shared memory as with Pthreads. When approaching a parallel programming task, the key issues to be addressed are concurrency, scalability, locality, and modularity [61]. Concurrency relates to the need for algorithms which subdivide larger problems into a set of smaller tasks that can be executed concurrently. An intimate knowledge of the data structures and data dependencies in an algorithm is required to identify such concurrencies. Scalability relates to the behavior of an algorithm in terms of parallel efficiency or speedup as a function of processor count. Since the number of processors being utilized in MPP's appears to be continually increasing, the efficiency of a good parallel program design should scale with increased processor counts to remain effective throughout its life cycle. Locality relates to the desire to enhance local memory utilization since access to local memory is less expensive than access to remote memory. Raw communication speeds are typically orders of magnitude slower than floatingpoint operations; thus, communication performance strongly influences the parallel run time. Modularity is important in all software development. It allows objects to be manipulated without regard for their internal structure. It reduces code complexity and promotes code reuse, extensibility, and portability. The algorithm design process can be broken down into four phases: partition ing, communication, agglomeration, and mapping [61]. Machine independent issues, such as concurrency, are considered early in the design process, while machine specific issues are delayed until late in the design. Partitioning and communication address the issues of concurrency and scalability, while agglomeration and mapping address 67 locality and performance. Partitioning falls into two major categories: functional de composition and data decomposition. Functional decomposition focuses on the com putation, while data decomposition focuses on the data. A good partition will divide both the data and the computation. The communication phase of a design deals with identifying the interprocess communication requirements. This is complicated when the communication patterns are global, unstructured, dynamic, and/or asynchronous. Agglomeration seeks to reduce communication costs by increasing computation and communication granularity. Tasks can be combined and data and/or computation can be duplicated across processors in order to reduce communication. The mapping phase is a machine specific problem of specifying where each task will execute. A mapping solution is highly dependent on the communication structure and the work load distribution. A load balancing algorithm is often needed. If the communication structure is dynamic, tradeoffs must be made between a load imbalance and repeated application of a possibly expensive load balancing algorithm. A good algorithm design must optimize a problemspecific function of execution time, memory requirements, implementation costs, and maintenance costs, etc. [61]. Furthermore, when solving coupled systems of partial differential equations, issues unique to the problem must be considered. For example, on a distributed memory machine, a minimum number of processors may be required in order to hold a specific problem; however, the use of additional processors must be balanced against its effect on the solution convergence [70]. Likewise, since communication cost is proportional to surface area and computational cost is proportional to volume, the desire for a high ratio of volume to surface area places a lower limit on the subdivision of the computational domain. Communication through messages has an associated cost of the latency time for message startup and a cost per word of data transferred in the message; therefore, it is generally desirable to use a small number of larger messages rather than a large number of small messages. However, the use of small messages may allow an algorithm change that would allow communications to be overlapped by computation. An efficient parallel implementation will require the consideration of all such factors. Performance Performance of a parallel algorithm is normally measured via speedup. This is the ratio of the execution time on a single processor and the execution time on multiple processors. Thus, the speedup s can be computed by TI s = (5.1) where T1 denotes the execution time on a single processor and Tn denotes the exe cution time on n processors. Ideally, T1 should represent the execution time of the best sequential algorithm available to do the job. When parallelizing a sequential algorithm, the best sequential algorithm may not parallelize well and, vice versa, the best parallel algorithm may not perform well sequentially. Likewise, when paralleliz ing a given sequential algorithm, some overhead will be introduced. If the parallel algorithm is executed on a single processor to measure T1, this value may be arti ficially high due to the use of a poor sequential algorithm or due to the existence of parallelization overhead. However, the definition of the best sequential algorithm may be unattainable. Thus, there exists some ambiguity in how T1 should be mea sured in order to judge the performance of a parallel algorithm. At the least, when converting an existing sequential algorithm for execution in parallel, Ti should be measured using the original sequential algorithm. Likewise, if any algorithm changes are made during parallelization that would also decrease the sequential execution time, Ti should be remeasured so as to isolate improvements due to the algorithm change from improvements due to the use of multiple processors. One source of overhead, that exists in all parallel programs, is time spent in communication between multiple processors. Following the analysis presented by Roose and Van Driessche [71], the total execution time of a parallel algorithm executed on n processors can be approximated as Tn = Tcalc + Tcomm (5.2) where Tcac denotes the actual computation time and Tcomm denotes the time spent in communication due to parallelization. If the work is perfectly balanced and there is no time spent in communication during a sequential run, then the execution time of the sequential run will be T = n Tcaic (5.3) Hence, the speedup would be n Tcaic Tcalc + Tcomm n om (5.4) Thus, the ratio of the communication time and the computation time can have a large effect on the speedup. In general, for CFD flow solvers, the communication time is proportional to the area of (number of grid points on) the boundaries of the domain, and the computation time is proportional to the volume of (total number of grid points in) the domain. Thus, as the problem size increases, the ratio of communication to computation de creases. The characteristics of a particular computer, the form of the communication, the algorithm used, and the partitioning of the domain can also affect this ratio. In general, a parallel computer with n processors can execute n instructions at the same time. Thus, if the instructions in a sequential algorithm could be evenly divided among the n processors, so that each processor executed 1/nth of the total instructions, the execution time would be decreased by a factor of n. Therefore, linear speedup is the ideal case, and speedup is limited to s < n. However, there are other factors that place additional limits on the speedup that can be achieved. If we consider the entire work load of a complete simulation to be broken down into part that can be executed in parallel and part that must be executed serially, the speedup, that can be achieved, is limited by Amdahl's law [72]: s < (5.5) (0 +I where f, is the serial fraction of the work, fp is the parallel fraction of the work, and n is the number of processors on which the parallel portion of the code is running. The factors f, and fp are fractions so that S< f, < 1 0 < f and f, + fp = 1 (5.6) Since the parallel work will be distributed across multiple processors, the execution time of the parallel work will be decreased, but the execution time of the serial work will not. Amdahl's law shows the significant penalty that the serial fraction of the work load can place on the parallel performance. For example, consider a case where 5% of an executable code must be performed serially (f, = .05 and fp = .95). If only 4 processors are used, the speedup will be limited to 3.48, nearly 90% of the ideal speedup. However, if 32 processors are used, then the speedup will be limited to 12.55, less than 40% of the ideal speedup. Although the processor count was increased by a factor of 8, the speedup increased by less than a factor of 4. In fact, as the number of processors n + oo, the term fp/n + 0. Thus, the speedup is limited to 1/f,, or 20 in this case, no matter how many processors are used. This could be used to argue that parallel processing does not hold the answer to the need for increased computing power. However, the potential from multiple proces sors and the increased memory often available with MPP machines allows larger and larger problems to be addressed. With CFD solutions, as the problem size increases, 71 the computation to communication ratio usually increases and the serial fraction of the work load decreases. Even the limit specified by Amdahl's law is not always reached. The major contributor to this behavior is an imbalance in the distribution of the work to be executed in parallel. Consider figure 5.1. The left side of the figure shows the serial execution of a function that operates on four grids, while the right side shows the parallel execution of the function on four processors with one grid per processor. The serial execution time, and thus the total work, is represented by the time T5T1. On four processors, the average work per processor is represented by the time (T5T1)/4. However, the total execution time in parallel is dictated by the maximum execution time of any process. This time, T2T1, is larger that the average execution time by a factor related to the imbalance in the work load or execution times. Tie Serial Parallel Tl gl gl g2 g3 g TAV .3 ........ .... . g2 T3 g3 T4 g4 T5 __ Figure 5.1: Unbalanced work load Since the term fp/n in equation 5.5 represents the average parallel work per processor, this work must be increased by a factor proportional to the load imbalance. Generalizing equation 5.5 to include the effect of a load imbalance, the speedup becomes S ( ) 1 (5.7) where f, is the load imbalance factor. The load imbalance is often judged by the ratio of the maximum execution time of any process and the minimum execution time of 72 any process. However, as used here, the load imbalance factor is used to increase the average execution time per process to the maximum execution time of any processor. Thus, the imbalance is equal to the ratio of the maximum execution time of any process to the average execution time per process. The load balance is further complicated by the basis for the decomposition of the work. If each division in the decomposition does not represent a nearly equal piece of the work, the load balance can vary significantly with the process count. Obviously, if there are not enough pieces of work, some of the processes would sit idle. Likewise, if there is one piece of work that is significantly larger than the other pieces, it can dominate the execution time. Consider figure 5.2. The left side of the figure shows the serial execution of a function that operates on four grids. When the function is duplicated and the grids are distributed across two processes (shown in the middle of the figure), the work is well balanced and the execution time is cut in half. However, when four processes are used (shown on the right side of the figure), no improvement in the execution time is seen. The work associated with grid gl is one half of the total work; thus, the execution time is dominated by the execution time of gl. Time Serial 2 PE's 4 PE's g2 g2 g3 g4 gl gl g3 gl  T2 ......... .... ...... ............ T3 g2 T4 T5 g4 Figure 5.2: Limitations in load balance caused by a poor decomposition Another common cause for the degredation in the speedup achieved is synchro nization between processes. Synchronization is enforced by the placement of barriers in the execution path. No process may pass the barrier until all of the processes have reached the barrier. This can ensure that every process has completed a particular portion of the work before any process starts on the next portion of work. This may be required if one function is dependent on the results from a previous function. How this can cause an increase in execution time is illustrated in figure 5.3. This diagram shows two functions (A and B) operating on separate grids on separate processors. Without synchronization, the total work per process may be well balanced; but if synchronization is required between the functions, wait time can be introduced if each function is not well balanced. Timn Independent Synchronized processors processors T1 A(g2) A( A(gl) A(gl) T2 wait T3 B(g2) B(gl) B(gl) 4 B(g2) T4wait T5 L........ . Figure 5.3: Imbalance caused by synchronization This illustrates the fact that each piece of work between any two synchronization points must be well balanced in order to achieve a good overall load balance for the entire code. To take this into account, equation 5.7 should be written as s ~ (5.8) where the terms within the summation represent each section of code between syn chronization points that is executed in parallel. Load Balancing The most important part to achieving good parallel performance is load bal ancing. The problem of load balancing is similar to the computer science problems referred to as the "knapsack problem" [73] and the "partition problem" [74]. These problems are NPcomplete, which is the set of all problems for which no algorithm ex ists that is guaranteed to produce the exact solution through nondeterministic means in polynomial time. The input to the knapsack problem is defined by a set of items with specified sizes and values and a knapsack of a specified capacity. The problem is to maximize the value of the subset of items that will fit into the knapsack at one time. The input to the partition problem is a set of blocks of varying heights. The problem is to stack the blocks into two towers of equal heights. The input to the load balancing problem consists of a set of pieces of work, a measure of the cost of each piece of work, and the number of processors across which the pieces of work are to be distributed. The problem is to associate each piece of work with a processor, while minimizing the ratio of the maximum total work associated with any processor and the average work per processor. The amount of work associated with each piece of work is similar to the value of the items to be placed in the knapsack or the height of the blocks. The processors are similar to the knapsack or the towers. The average work per processor corresponds to the capacity of the knapsack or the average height of the towers. However, each piece of work must be associated with a processor, each processor must have at least one piece of work, and there is no limit on the amount of work that can be associated with each processor. The algorithm used to balance the work of the flow solver is a maxmin algorithm. This algorithm, shown below, takes the piece of work with the maximum cost from the pieces of work that have not yet been assigned to a processor and assigns it to the processor that has the minimum total work assigned to it. This algorithm distributes the work across the available processors with only a single pass through the list of the pieces of work, thus the execution time is bounded. With sufficient partitioning of the work, this algorithm produces a good load balance, although it may not produce the best possible distribution of the work. The array Work[] is an estimate of the cost associated with each piece of work (each grid, in this case). Since the work of the flow solver is closely associated with the number of grid points, the cost associated with a grid can be defined as the number of points in the grid. However, there are other factors that can affect the execution time that are not directly related to the number of grid points. Thus, a user defined weight can be associated with each grid and it is the weighted number of grid points that is fed into the load balancing routine as the cost of each piece of work. The output from the load balancing routine is an array GridToPe[] that specifies which processor will execute the flow solver for each grid. MAPGRIDSToPES(Work[]) 1 for i 1 to npes 2 do PeWork[i] + 0 3 4 for i + 1 to ngrids 5 do PeNum + FINDMINVALINDEX(PeWork[]) 6 GridNum < FINDMAXVALANDEX(Work[]) 7 PeWork[PeNum] + PeWork[PeNum] + Work[GridNum] 8 GridToPe[GridNum] + PeNum 9 Work[GridNum] + 0 10 11 return GridToPe[] This load balancing algorithm is applied to the flow solver, for which there is an initial estimate of the value of each piece of work. In grid assembly, there is no apriori knowledge of the amount of work associated with a grid or superblock. In fact, the amount of work associated with a grid or superblock depends on the location of the grids and thus changes as the grids move. In such a case, a dynamic load balancing algorithm is needed that can redistribute the work based on some perception of the current work distribution. The algorithm implemented for dynamic load balancing, shown below, consists of two iterative steps. The algorithm requires as input, some numerical measure of the cost associated with each piece of work in the partition and the current mapping of those pieces of work to the available processes. The first step in the algorithm is to move work from the process with the maximum work load to the process with the minimum work load in order to improve the load balance. The second step is to swap single pieces of work between two processes in order to improve the load balance. 76 The output from this algorithm is a new mapping of the pieces of work to the set of processes. This mapping must be compared to the previous mapping to determine which data has to be transferred from one process to another. OPTIMIZEMAPPING(Work[], WorkToPe[]) 1 TotalWork + CALC.SvM(Workf) 2 AvgWork + TotalWork/npes 3 PeWork[, PeWorkListo BUILDIPEWORKLISTS(WorkToPe[) 4 5 MOVEWORK(Work[], WorkToPe[]) 6 SWAP.WORK(WorkU, WorkToPe[) MOVEWORK(Work[], WorkToPe[]) 1 repeat 2 pemin + FIND_2 MIVALANDEX(PeWork]) 3 pemax F FIND MAXVALANDEX(PeWork[]) 4 5 WorkLimit (PeWork[pemax] PeWork[pemin]) 0.99 6 WorkToPut < CHOOSE .A X _LIM ITEDVAL( 7 PeWorkList[pemax], WorkLimit) 8 9 if WorkToPut 10 then WorkToPe[WorkToPut] pemin 11 PeWork[], PeWorkList BUILDPEWORKLISTS( 12 WorkToPe[]) 13 until WorkToPut = NIL In line 3 of OPTIMIZEMAPPING, BUILDPEWORKLISTS calculates the total work per process (PeWorkf]) and also builds an array of the lists of pieces of work that are mapped to each process (PeWorkList[]) from the mapping of work to pro cesses (WorkToPe[]). In MOVEWORK, if any piece of work can be moved from one process to another and decrease the maximum amount of work on any process, then it will improve the load balance. Therefore, in line 5, WorkLimit is set based on a percentage of the difference in the work assigned to the processes with the least and most work. CHOOSEMAX LIMITEDVAL chooses the piece of work from the list of work associated with pemax (PeWorkList[pemax]) that has the largest cost and also is less than WorkLimit. This piece of work is assigned to pemin in line 10 and the work lists are updated in line 11. SWAP.WORK(Workf, WorkToPe[) 1 repeat 2 for i 1 to npes 3 do PeWorklmb[i] + PeWork[ij AvgWork 4 pemax + FINDIMAX_VALjNDEX(PeWorklmb6[) 5 6 for each i in PeWorkList[pemax] 7 do WorkToPut + i 8 for pemin + 1 to npes 9 do if pemin : pemax 10 then WorkLimit + (Work[WorkToPut] 11 PeWorklmb[pemaz] 12 +PeWorkImb[j]) 1.001 13 14 WorkToGet  CHOOSEANYLIMITED_VAL( 15 PeWorkList [pemin]) 16 17 if WorkToGet 18 then WorkToPe[WorkToGet] + pemax 19 WorkToPe[WorkToPut] 4 pemin 20 PeWorka, PeWorkList[] + 21 BUILDPEWORKLISTS( 22 WorkToPe[]) 23 break 24 25 until WorkToGet = NIL In SWAPWORK, one piece of work on one process is swapped for a piece of work on another process. Therefore, there is an upper and lower bound on the cost of the WorkToGet. If WorkToGet is larger that WorkToPut, then the total work on pemax will increase. However, if WorkToGet is less than WorkToPut by more than the difference between the imbalance on the two processes, then the imbal ance on pemin will increase beyond the original imbalance of pemax. The routine CHOOSEANYLIMITEDVAL chooses a piece of work from the list of work associ ated with pemin (PeWorkList[pemin]) that costs less than the work represented by WorkToPut and is greater than WorkLimit. It is not important that the opti mum piece of work be chosen. Any piece of work that improves the load imbalance 78 will do. The work represented by WorkToGet is assigned to pemax and the work represented by WorkToPut is assigned to pemin. The work lists are updated by BUILDPEWORKLISTS. The break at line 23 causes control to jump out of the loop that started at line 6 so that the load imbalances and pemax can be recalculated. Each step of this algorithm attempts to decrease the load imbalance caused by the process with the maximum work load. However, the search for pieces of work to swap is exhaustive. This algorithm is used to redistribute the superblocks based on the work of grid assembly; therefore, this algorithm is not too expensive, since there are not many superblocks. This algorithm could be extended to swap multiple pieces of work on one process for one or more pieces of work on another process. However, this would require more efficient ways of sorting the pieces of work, so that the pieces to be swapped could be selected more efficiently. Another situation, that often arises, is a large set of pieces of work that can be treated as an array. This array can be evenly divided among the processes with each getting an equal number of elements of the array. However, if each element of the array does not represent the same amount of work, there will be a load imbalance. It could be expensive to treat each element as a separate piece of work, measure its cost, and use the previous algorithms to distribute this large number of pieces of work. Instead, the total cost of the work can be associated with the processor and used as a weight. Load balancing can then be achieved by dividing the array so that the weighted number of elements is equally divided among the processes. This algorithm requires as input, the number of elements of the array mapped to each process (N[]) and the execution time of each process (Tr). A weight for each process (W[]) is calculated as the execution time per array element. The excess number of elements assigned to the process with the maximum load is calculated as the delta between the process execution time and the average process execution time divided by the process weight. Since this number of elements will be assigned to another process, the weight of the receiving process must be updated. The execution times of the two processes are updated and the loop is repeated if there are other processes with excess array elements. OPTIMIZE.ARRAY.MAPPING(N], TO) 1 for i + 1 to npes 2 do W[i] T[i]/N[i] 3 imin FINDMIN_VALjNDEX(T[]) 4 imax + FINDMAXVALjNDEX(T[]) 5 Tavg + CALCULATEAVGVAL(T[]) 6 7 repeat 8 Nexcess + (T[imaz] Tavg)/W[imax] 9 if Nexcess = 0 10 then break 11 TotWeight + N[imin] W[imin] + Nexcess W[imax] 12 W[iminj + TotWeight/(N[imin] + Nexcess) 13 N[imin] + N[imin] + Nexcess 14 N[imax] N[imax] Nexcess 15 T[imin] < N[imin] W[imin] 16 T[imax] + N[imax] W[imax] 17 imin < FINDMINVALINDEX(T[]) 18 imax + FIND MAXVALjNDEX(T[]) 19 until T[imax]/Tavg < 1.005 Proposed Approach Since this work builds on the initial parallel implementation of the Beggar flow solver [6], the same methods used there will be continued here. The message passing paradigm is used within an SPMD programming model. PVM is used for the mes sage passing environment. The code is geared toward MIMD machines with either distributed or shared memory. The ultimate goal is to allow heterogeneous computing although homogeneous computing environments are the primary focus. A functional decomposition of the entire simulation process is used with the flow solution, force and moment calculation, 6DOF integration, and grid assembly being the primary functions. Coarse grain domain decomposition of the flow solver based on grids is used. The degree to which domain decomposition of the grid assembly function can be used is determined. Load balancing is treated as the primary contributor to good 80 parallel performance. The most efficient implementation requires consideration of both the flow solver and the grid assembly process during data partitioning and load balancing. For parallel algorithm design, Foster [61] recommends a process denoted by the acronym PCAM, referring to partitioning, communication, agglomeration, and map ping, as mentioned previously. In this approach, Foster recommends that the finest grained partitioning of the work be defined along with all of the required communica tions. Then, the partitions are agglomerated to reduce the communications required and thus increase the computation to communication ratio. The final step is to map the work to processors based on the particular computer architecture. In this work, a nearly opposite approach is taken. The work is partitioned using coarse grain decom position first. This allows a parallel implementation to be achieved with minimal code changes and with less expertise in the existing sequential code, as well as, parallel programming itself. This also allows the users to receive and to start using the code earlier. As the code performance is analyzed, the granularity of the decomposition is refined as required. Mapping of work to processes is done dynamically to achieve a good load balance; however, no machine specific issues of mapping work to specific processors are addressed. CHAPTER 6 PARALLEL IMPLEMENTATIONS Phase I: Hybrid ParallelSequential The simplest approach to achieving a parallel version of Beggar for moving body problems is to use a separate frontend (FE) process that performs the grid assembly function for the complete domain in a serial fashion with respect to the parallel execution of the flow solver across multiple backend (BE) processes. This requires that proper communication be established between the flow solution function and the grid assembly function; however, this does not require any consideration of load balancing or partitioning of the grid assembly function. This implementation is referred to as phase I and is depicted in figure 6.1. This diagram and the others like it that follow are referred to as timing diagrams. The major functions are represented and the diagram serves as a template of one iteration of the solution process. The vertical direction represents time and this template can be stamped out in a vertical manner to construct a complete time history of the solution process. The boxes on the left represent the functions running in the FE process, while the boxes on the right represent the functions running in the BE processes. The arrows represent communication between specific functions on the FE and BE. Communication between functions on the same process is not shown explicitly. The vertical lines through a function indicates that it is spread across multiple processors. Although these diagrams are not drawn to scale, the work of a particular function is represented by the area of the box drawn for that function. Thus, as a function is spread across multiple processors, the width increases and the height decreases representing the decrease in the time for executing the function. 82 Referring to figure 6.1, the solution process is started at time T1. Once the grid assembly function is completed at time T2, the interpolation stencils, iblank arrays, etc. are communicated from the FE to the BE so that the flow solution function can proceed. Once an iteration of the flow solver is completed, the forces and moments are integrated and are passed from the BE to the FE. The 6DOF function is then executed to reposition the grids and to calculate motion rates. Since the 6DOF function executes quickly, it is duplicated on the FE and the BE rather than communicating the resulting information. Ignoring the cost of the force and moment calculation and the 6DOF integration, the flow solver represents the parallel work and the grid assembly represents the serial work. Based on the fractions of the total execution time represented by the flow solver and the grid assembly, equation 5.7 can be used to estimate the performance that can be achieved with this implementation. However, instead of using the notation f,, f/ and f,, we will use the uppercase letters F and G to represent the flow solver and grid assembly functions and the subscripts p, s and i to represent the fractions of the work executed in parallel or serial and the load imbalance factors, respectively. Thus, for the phase I implementation, the speedup can be approximated as s ( r1 (6.1) ( ) )* Fi + G, where nbes is the number of BE processes. Since the work of the flow solver is closely Front End Back End (serial) (parallel) Grid Assembly wait time T2 Frow Spluti n wait ties   ii  T3 Forces f Momeats 6DOF 6DOF Figure 6.1: Phase I implementation 20 18 ideal 16 Amdahl's Law 14 & 12 D 10 os 8 Fp=.95, Fi=1.05 6 s=.05 4 2 0 5 10 15 20 25 30 35 40 processor count Figure 6.2: Comparison of estimated speedup of phase I to Amdahl's law related to the number of grid points, the equation S= max(no. points on each processor) (6.2) Fi = (6.2) avg no. points per processor can be used to obtain an approximation for the load imbalance factor for the flow solver. There are other factors, such as boundary conditions and the distribution of communication between processors, that affect the load balance. They will be ignored at this point. Figure 6.2 shows a comparison of the estimated speedup from the phase I im plementation to Amdahl's law. The estimated speedup of phase I is plotted using equation 6.1 with Fp = 0.95, G, = 0.05, and F, = 1.05 representing a nominal load imbalance of 5% in the distribution of the work of the flow solver. This plot shows the significant drop off in speedup with increased processor counts due to the serial frac tion of the work. A small decrease in the performance of the phase I implementation (as compared to Amdahl's Law), due to the load imbalance, can also be seen. Phase II: Function Overlapping Some parallel efficiency can be gained by overlapping the grid assembly function with the flow solution function. This overlapping could be achieved by updating the 84 6DOF and the interpolation stencils at the same time that the flow solver is updated, by using the forces and moments calculated from a previous iteration of the flow solution as an approximation to the current forces and moments. Thus, the updating of the grid assembly would be based on a flow solution that is lagged behind the current flow solution. This is similar to the technique used for sequential processing in Lijewski and Suhs [28, 29], where store separation events were simulated with the grid assembly recomputed once after every 20 iterations of the flow solver and 6DOF integration. The grids were moved and the grid motion time metrics were fed into the flow solver every iteration, although the interpolation stencils were not. Time accurate forces and moments were used, although the flow solution could be affected since the interpolation stencil locations were not time accurate. The variation in stencil locations due to this time lag (.004 seconds in their case) is justified by the assumption that the grids will not move by an appreciable amount during the delay. Good results were achieved for the problems addressed. Some parallel efficiency may be gained without lagging the grid assembly behind the flow solution. This is possible due to the NewtonRelaxation scheme used in the flow solution function. The discretized, linearized, governing equations are written in the form of Newton's method. Each step of the Newton's method is solved using symmetric GaussSeidel (SGS) iteration. The SGS iterations, or inner iterations, are performed on a grid by grid basis, while the Newton iterations, or dt iterations, are used to achieve time accuracy and are performed on all grids in sequence. This procedure eliminates synchronization errors at blocked and overset boundaries by iteratively bringing all dependent variables up to the next time level. Figure 6.3 is a diagram of the flow solution process. The inner loop represents the inner iterations or iterations of the SGS solution of the linear equations from one step of the Newton's method. The outer loop represents the dt iterations or steps of the Newton's method. For time accurate flow calculations with Beggar, it is normal to run more than Figure 6.3: Basic flow solution algorithm one dt iteration to eliminate synchronization errors between grids and to achieve time accuracy. Each dt iteration produces an updated approximation to the flow solution at the next time step. Forces and moments can be calculated after each dt iteration using this approximate solution. If the forces and moments calculated after the first dt iteration are a good approximation to the final forces and moments, these values can be forwarded to the grid assembly process. This allows the computation of the grid assembly to proceed during the computation of additional dt iterations. If the computation time for the flow solver is sufficiently large, the computational cost of the grid assembly process can be hidden. This implementation is referred to as phase II and is depicted in figure 6.4. Rather than calculating forces and moments only after a complete time step of the flow solver, they are calculated after each dt iteration. The node labeled 1 in figure 6.3 represents the point where the forces and moments are calculated. Referring to figure 6.4, the solution process is started at time T1. After the first dt iteration, initial approximations to the forces and moments are calculated and are passed from the BE to the FE at time T2. The 6DOF and grid assembly functions proceed while the remaining dt iterations of the flow solver are completed. Once the 
Full Text 
xml version 1.0 encoding UTF8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd INGEST IEID E6DBJQ7PU_QVQ5IO INGEST_TIME 20131116T00:19:44Z PACKAGE AA00014221_00001 AGREEMENT_INFO ACCOUNT UF PROJECT UFDC FILES 