UFDC Home  myUFDC Home  Help 



Full Text  
APPLICATIONS OF PARALLEL GLOBAL OPTIMIZATION TO MECHANICS PROBLEMS By JACO FRANCOIS SCHUTTE A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2005 Copyright 2005 by JACO FRANCOIS SCHUTTE This work is dedicated to my parents and my wife Lisa. ACKNOWLEDGMENTS First and foremost, I would like to thank Dr. Raphael T. Haftka, chairman of my advisory committee, for the opportunity he provided me to complete my doctoral studies under his exceptional guidance. Without his unending patience, constant encouragement, guidance and expertise, this work would not have been possible. Dr. Haftka's mentoring has made a lasting impression on both my academic and personal life. I would also like to thank the members of my advisory committee, Dr. Benjamin Fregly, Dr. Alan D. George, Dr. Panos M. Pardalos, and Dr. Nam Ho Kim. I am grateful for their willingness to serve on my committee, for the help they provided, for their involvement with my oral examination, and for reviewing this dissertation. Special thanks go to Dr. Benjamin Fregly, who provided a major part of the financial support for my studies. Special thanks also go to Dr. Alan George whose parallel processing graduate course provided much of the inspiration for the research presented in this manuscript, and for reviewing some of my publications. Thanks go also to Dr. Nielen Stander who provided me with the wonderful opportunity to do an internship at the Livermore Software Technology Corporation. My colleagues in the Structural and Multidisciplinary Optimization Research Group at the University of Florida also deserve many thanks for their support and the many fruitful discussions. Special thanks go to Tushar Goel, Erdem Acar, and also Dr. Satchi Venkataraman and his wife Beth, who took me in on my arrival in the USA and provided me a foothold for which I will forever be grateful. The financial support provided by AFOSR grant F496200910070 to R.T.H. and the NASA Cooperative Agreement NCC3994, the "Institute for Future Space Transport" University Research, Engineering and Technology Institute is gratefully acknowledged. I would also like to express my deepest appreciation to my parents. Their limitless love, support and understanding are the mainstay of my achievements in life. Lastly, I would like to thank my wife, Lisa. Without her love, patience and sacrifice I would never have been able to finish this dissertation. TABLE OF CONTENTS A C K N O W L E D G M E N T S ................................................................................................. iv L IST O F TA B L E S .................... ...... ..... ....................... .. .. ..... ............ .. ix LIST OF FIGURES ............................... ... ...... ... ................. .x A B S T R A C T .......................................... ..................................................x iii CHAPTER 1 IN TR OD U CTION ............................................... .. ......................... .. State ent of Problem ....................................... ...... .......... .............. .. 1 Purpose of R research .................................. .. .. .. ........ .... ............. . Significance of R research .......................... ................ ... .... .... .... ............... .... . Parallelism by Exploitation of Optimization Algorithm Structure........................2 Parallelism through Multiple Independent Concurrent Optimizations ...............3 Parallelism through Concurrent Optimization of Decomposed Problems ............3 R o ad m ap ................................................................................................... . 4 2 B A C K G R O U N D ............................................................ ........ .......... .... .... Populationbased Global O ptim ization..................................... ......... ............... 5 Parallel Processing in Optimization .................................... ... ..................... ....... 6 D ecom position in Large Scale Optim ization..................................... .....................7 Literature Review: Problem Decomposition Strategies ..........................................8 Collaborative O ptim ization (C O ) .................................. ..................................... 8 Concurrent SubSpace Optimization (CSSO) ................................................11 A nalytical Target Cascading (A TC)................................................................ 15 Quasiseparable Decomposition and Optimization ...........................................17 3 GLOBAL OPTIMIZATION THROUGH THE PARTICLE SWARM ALGORITHM .................................. ... .. .......... .......... .... 18 O overview ............. ..... ..... ... ............................................. ............... 18 In tro du ctio n ......... ....................... ........................... ... .. ................19 Theory .................................................... ............... 21 Particle Swarm Algorithm ........................... ...............21 A analysis of Scale Sensitivity. ........................................ ......................... 24 M eth odology ................................................................................. ....................2 8 Optimization Algorithms......... .......................................... 28 A nalytical Test Problem s ............................................................................. 30 Biom echanical Test Problem ........................................ ........................... 32 R e su lts ........................................................................................................... 3 7 D iscu ssio n .................41.............................................. C o n c lu sio n s........................................................................................................... 4 6 4 PARALLELISM BY EXPLOITING POPULATIONBASED ALGORITHM S T R U C T U R E S ..................................................................................................... 4 7 O v e rv iew .........................................................................................4 7 Introduction ........................................................................................................ 48 Serial Particle Sw arm A lgorithm ...................................................................... ..... 50 Parallel Particle Sw arm A lgorithm ................................................................... 53 Concurrent Operation and Scalability ........................................ ......53 Asynchronous vs. Synchronous Implementation .................................... 54 C o h eren c e ...................................................................................... ............. 5 5 Network Communication ................................................56 Synchronization and Implementation....................... .............. 58 Sample Optimization Problems .................................................59 Analytical Test Problems .................................................. 59 Biomechanical System Identification problems ............. .... ............ 60 Speedup and Parallel Efficiency .................................................................. 63 N u m eric al R e su lts................................................................................................. 6 5 D isc u ssio n ............................................................................................................. 6 7 C o n c lu sio n s........................................................................................................... 7 3 5 IMPROVED GLOBAL CONVERGENCE USING MULTIPLE INDEPENDENT O P T IM IZ A T IO N S ............................................................................................... 74 Overview .............. ..................................................74 Introdu action ............. ..................................................................................... 74 M methodology ............. ....................................................................................77 A nalytical T est Set ............................................................77 M ultiplerun M ethodology ........................................ ...... ........ 78 Exploratory run and budgeting scheme ................................................81 Bayesian convergence probability estimation......................... ...84 N um erical R esults................................ .... .. .................................. 85 Multirun Approach for Predetermined Number of Optimizations .....................85 M ultirun Efficiency ........................... .... ..................... .. .............. 87 Bayesian Convergence Probability Estimation ...................................... 89 Monte Carlo Convergence Probability Estimation......................... ...92 C o n c lu sio n s........................................................................................................... 9 2 6 PARALLELISM BY DECOMPOSITION METHODOLOGIES ..........................94 O v e rv iew .................................................................................. 9 4 Introdu action ...........................................................................................94 Quasiseparable Decomposition Theory .......................... ......... ............... 96 Stepped H ollow Cantilever Beam Exam ple ........................................ ....................98 Stepped hollow beam optimization. ...................................... ............... 102 Quasiseparable Optimization Approach ............. ............. .............. ........ 104 R e su lts .................. ....................... ..........................................................1 0 6 Allatonce Approach ............................... ................................. 106 Hybrid allatonce Approach ..............................................107 Quasiseparable Approach ....... ............................................................. 108 Approximation of Constraint Margins ............................. ............. 109 D iscu ssion ......... .... ................................................ ............ ...... .... ... 112 C o n clu sio n s.................................................... ................ 1 15 7 CONCLUSIONS ...................... ........ ................... 116 Parallelism by Exploitation of Optimization Algorithm Structure ...........................116 Parallelism through Multiple Independent Optimizations......................................116 Parallelism through Concurrent Optimization of Decomposed Problems .............17 Future Directions .................. ................................................. ........ 117 S u m m a ry ........................................................................................1 1 8 APPENDIX A ANALYTICAL TEST PROBLEM SET ..............................119 G riew an k ......... ... .............................................................................. 1 19 H artm an 6 ................................ ... ................................. ........... 119 S h e k e l 1 0 ......................................................................... 12 0 B MONTE CARLO VERIFICATION OF GLOBAL CONVERGENCE PR O B A B IL ITY ...... ... .............. .. .... .................................... ........ 122 LIST OF REFEREN CES ........................................... ........................ ............... 126 BIOGRAPHICAL SKETCH ............... ................. ............... 138 LIST OF TABLES Table page 1 Standard PSO algorithm parameters used in the study ........................................24 2 Fraction of successful optimizer runs for the analytical test problems ..................37 3 Final cost function values and associated marker distance and joint parameter rootmeansquare (RMS) errors after 10,000 function evaluations performed by multiple unsealed and scaled PSO, GA, SQP, and BFGS runs.............................40 4 Parallel PSO results for the biomechanical system identification problem using synthetic marker trajectories without and with numerical noise.............................66 5 Parallel PSO results for the biomechanical system identification problem using synthetic marker trajectories without and with numerical noise..............................67 6 Particle swarm algorithm parameters ............................................... ............... 77 7 Problem convergence tolerances................................... .............................. ....... 78 8 Theoretical convergence probability results for Hartman problem .........................87 9 Minimum, maximum and median fitness evaluations when applying ratio of change stopping criteria on pool of 1,000 optimizations for Griewank, Hartman and Shekel problem s .......................... ........................ .. ............ .... 89 10 Beam material properties and end load configuration. ........................................101 11 Stepped hollow beam global optimum ................. ......... ............ ..... ............. 104 12 Allatonce approach median solution ....................................... ............... 107 13 Hybrid, allatonce median solution.... ................... ........................108 14 Quasiseparable optimization result .......................... ......... .... ............... 110 15 Surrogate lower level approximation optimization results ..................................112 16 H artm an problem constants......................................................... ............... 120 17 Shekel problem constants..................... ................. ........................... 121 LIST OF FIGURES Figure page 1 Collaborative optimization flow diagram ..................................... .................10 2 Collaborative optimization subspace constraint satisfaction procedure (taken from [6]) ...................................... .................................. ........... 10 3 Concurrent subspace optimization methodology flow diagram.............................13 4 Example hierarchical problem structure ..................... .............................16 5 Subproblem inform ation flow .......................................................... ............... 16 6 Joint locations and orientations in the parametric ankle kinematic model ..............33 7 Comparison of convergence history results for the analytical test problems........... 38 8 Final cost function values for ten unsealed (dark bars) and scaled (gray bars) parallel PSO, GA, SQP, and BFGS runs for the biomechanical test problem.........39 9 Convergence history for unsealed (dark lines) and scaled (gray lines) parallel PSO, GA, SQP, and BFGS runs for the biomechanical test problem...................40 10 Sensitivity of gradient calculations to selected finite difference step size for one d esig n v ariab le .............................................................................. ............... 4 3 11 Serial implementation of PSO algorithm ...................................... ............... 54 12 Parallel implementation of the PSO algorithm ........... .......... ............... 57 13 Surface plots of the (a) Griewank and (b) Corana analytical test problems showing the presence of multiple local minima ....................................................61 14 Average fitness convergence histories for the (a) Griewank and (b) Corana analytical test problems for swarm sizes of 16,32,64,and 128 particles and 10000 sw arm iteration s .................................................... ................ 64 15 Fitness convergence and parameter error plots for the biomechanical system identification problem using synthetic data with noise................ .............. ....68 16 (a) Speedup and (b) parallel efficiency for the analytical and biomechanical optim ization problem s .............................................................. .... 69 17 Multiple local minima for Griewank analytical problem surface plot in two dim en sion s ........... ............ ....................................... ................... 75 18 Cumulative convergence probability Pc as a function of the number of optimization runs with assumed equal P, values....................................................81 19 Fitness history and convergence probability Pc plots for Griewank, Hartman and S h ek el p rob lem s .................................................... ................ 82 20 Typical Shekel fitness history plots of 20 optimizations (sampled out of 1000).....83 21 Shekel convergence probability for an individual optimization as a function of fitness evaluations and population size ........................................ ............... 85 22 Theoretical cumulative convergence probability Pc as a function of the number of optimization runs with constant P, for the Hartman problem ...........................86 23 Theoretical convergence probability Pc with sets of multiple runs for the G riew ank prob lem .............................................................................. ...... 88 24 Theoretical convergence probability Pc using information from exploratory optimizations which are stopped using a rate of change stopping condition for the Griewank, Hartman and Shekel problems............................... ............... 90 25 Bayesian Pc estimation comparison to using extrapolated and randomly sampled optimizations out of pool of 1000 runs for Griewank problem.............................91 26 Bayesian Pc estimation comparison to using extrapolated and randomly sampled optimizations out of pool of 1000 runs for Hartman problem. .............................91 27 Bayesian Pc estimation comparison to using extrapolated and randomly sampled optimizations out of pool of 1000 runs for Shekel problem ................................92 28 Stepped hollow cantilever beam ........................................ ......................... 99 29 Dimensional parameters of each cross section...................................................... 99 30 Projected displacement in direction 0 ........................................ ............... 100 31 Tip deflection contour plot as a function of beam section 5 with height h, and width w with yield stress and aspect ratio constraints indicated by dashed and dash dotted lines respectively .......................................................... ............... 103 32 Quasiseperable optimization flow chart. ...................................... ............... 105 33 Results for 1000 allatonce optimizations..................... ..................... 106 34 Hybrid PSOfmincon strategy for 100 optimizations ........................................108 35 Repeated optimizations of section 1 subproblem using fnincon function............110 36 Summed budget value and constraint margins for individual sections ................111 37 Global and local optimum in section 1 suboptimization. Scale is 0.1:1 ..............111 38 Decomposed cross section solution. Scale is 0.1:1 .............. ...... ....................112 39 Target tip deflection value histories as a function of upperlevel fitness ev alu atio n s ...................................... ......... .................. ................ 1 13 40 Constraint margin value histories as a function of upperlevel function evaluations ............ ... ....................... ......... .............. ......... 114 41 Predicted and Monte Carlo sampled convergence probability Pc for 5 independent optimization runs for the Griewank problem.................................122 42 Predicted and Monte Carlo sampled convergence probability Pc for 12 independent optimization runs for the Griewank problem................................... 123 43 Monte Carlo sampled convergence probability Pc with sets of multiple runs for the Griewank problem ...................................... ......... .................... 123 44 Monte Carlo sampled convergence probability Pc using information from exploratory optimizations stopped using a rate of change stopping condition for the Griewank, Hartman and Shekel problems............................................124 45 Bayesian Pc comparison for Griewank, Hartman and Shekel problem................ 125 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy APPLICATIONS OF PARALLEL GLOBAL OPTIMIZATION TO MECHANICS PROBLEMS By Jaco Francois Schutte December 2005 Chair: Raphael T. Haftka Cochair: Benjamin J. Fregly Major Department: Mechanical and Aerospace Engineering Global optimization of complex engineering problems, with a high number of variables and local minima, requires sophisticated algorithms with global search capabilities and high computational efficiency. With the growing availability of parallel processing, it makes sense to address these requirements by increasing the parallelism in optimization strategies. This study proposes three methods of concurrent processing. The first method entails exploiting the structure of populationbased global algorithms such as the stochastic Particle Swarm Optimization (PSO) algorithm and the Genetic Algorithm (GA). As a demonstration of how such an algorithm may be adapted for concurrent processing we modify and apply the PSO to several mechanical optimization problems on a parallel processing machine. Desirable PSO algorithm features such as insensitivity to design variable scaling and modest sensitivity to algorithm parameters are demonstrated. A second approach to parallelism and improving algorithm efficiency is by utilizing multiple optimizations. With this method a budget of fitness evaluations is distributed among several independent suboptimizations in place of a single extended optimization. Under certain conditions this strategy obtains a higher combined probability of converging to the global optimum than a single optimization which utilizes the full budget of fitness evaluations. The third and final method of parallelism addressed in this study is the use of quasiseparable decomposition, which is applied to decompose loosely coupled problems. This yields several subproblems of lesser dimensionality which may be concurrently optimized with reduced effort. CHAPTER 1 INTRODUCTION Statement of Problem Modem large scale problems often require highfidelity analyses for every fitness evaluation. In addition to this, these optimization problems are of a global nature in the sense that many local minima exist. These two factors combine to form exceptionally demanding optimization problems which require many hours of computation on highend singleprocessor computers. In order to efficiently solve such challenging problems parallelism may be employed for improved optimizer throughput on computational clusters or multicore processors. Purpose of Research The research presented in this manuscript is targeted on the investigation of methods of implementing parallelism in global optimization. These methods are (i) parallel processing through the optimization algorithm, (ii) multiple independent concurrent optimizations, and (iii) parallel processing by decomposition. Related methods in the literature are reported and will be compared to the approaches formulated in this study. Significance of Research Parallel processing is becoming a rapidly growing resource in the engineering community. Large "processor farms" or Beowulf clusters are becoming increasingly common at research and commercial engineering facilities. In addition to this, processor manufacturers are encountering physical limitations such as heat dissipation and constraints on processor dimensions at current clock frequencies because of the upper limit on signal speeds. These place an upper limit on the clock frequencies that can be attained and have forced manufacturers to look at other alternatives to improve processing capability. Both Intel and AMD are currently developing methods of putting multiple processors on a single die and will be releasing multiprocessor cores in the consumer market in the near future. This multicore technology will enable even users of desktop computers to utilize concurrent processing and make it an increasingly cheap commodity in the future. The engineering community is facing more complex and computationally demanding problems as the fidelity of simulation software is improved every day. New methods that can take advantage of the increasing availability of parallel processing will give the engineer powerful tools to solve previously intractable problems. In this manuscript the specific problem of the optimization of largescale global engineering problems is addressed by utilizing three different avenues of parallelism. Any one of these methods and even combinations of them may utilize concurrent processing to its advantage. Parallelism by Exploitation of Optimization Algorithm Structure Populationbased global optimizers such as the Particle Swarm Optimizer (PSO) or Genetic Algorithms (GA' s) coordinate their search effort in the design space by evaluating a population of individuals in an iterative fashion. These iterations take the form of discrete time steps for the PSO and generations in the case of the GA. Both the PSO and the GA algorithm structures allow fitness calculations of individuals in the population to be evaluated independently and concurrently. This opens up the possibility of assigning a computational node or processor in a networked group of machines to each individual in the population, and calculating the fitness of each individual concurrently for every iteration of the optimization algorithm. Parallelism through Multiple Independent Concurrent Optimizations A single optimization of a largescale problem will have significant probability of becoming entrapped in a local minimum. This risk is alleviated by utilizing population based algorithms such as PSO and GA's. These global optimizers have the means of escaping from such a local minima if enough iterations are allowed. Alternatively, a larger population may be used, allowing for higher sampling densities of the design space, which also reduces the risk of entrapment. Both these options require significant additional computation effort, with no guarantee of improvement in global convergence probability. A more effective strategy can be followed while utilizing the same amount of resources. By running several independent but limited optimizations it will be shown that in most cases the combined probability of finding the global optimum is greatly improved. The limited optimization runs are rendered independent by applying a population based optimizer with different sets of initial population distributions in the design space. Parallelism through Concurrent Optimization of Decomposed Problems Some classes of such problems may be subdivided into several more tractable sub problems by applying decomposition strategies. This process of decomposition generally involves identifying groups of variables and constraints with minimal influence on one another. The choice of which decomposition strategy to apply depends largely on the original problem structure, and the interaction among variables. The objective is to find an efficient decomposition strategy to separate such large scale global optimization problems into smaller subproblems without introducing spurious local minima, and to apply an efficient optimizer to solve the resulting subproblems. Roadmap A background on the global optimization of large scale optimization problems, appropriate optimization algorithms and techniques, and parallelism will be presented in Chapter 2. Chapter 3 presents an evaluation of the global, stochastic population based algorithm, the Particle Swarm Optimizer, through several analytical and biomechanical system identification problems. In Chapter 4 the parallelization of this population based algorithm is demonstrated and applied. Chapter 5 details the use of multiple independent concurrent optimizations for significant improvements in combined convergence probability. Chapter 6 shows how complex structural problems with a large number of variables may be decomposed into multiple independent subproblems which can be optimized concurrently using a two level optimization scheme. In Chapter 7 some conclusions are drawn and avenues for future research are proposed. CHAPTER 2 BACKGROUND Populationbased Global Optimization Global optimization often requires specialized robust approaches. These include stochastic and/or populationbased optimizers such as GAs and the PSO. The focus of this research is on exploring avenues of parallelism in population based optimization algorithms. We demonstrate these methods using the stochastic Particle Swarm Optimizer, which is a stochastic search algorithm suited to continuous problems. Other merits to the PSO include low sensitivity to algorithm parameters, and insensitivity to scaling of design variables. These qualities will be investigated in Chapter 3. This algorithm does not require gradients, which is an important consideration when solving problems of high dimensionality, often the case in large scale optimization. The PSO has a performance comparable to GAs, which are also candidates for any of the methods of parallelism proposed in this manuscript, and may be more suitable for discrete or mixed variable types of problems. In the research presented in this manuscript the PSO is applied to a biomechanical problem with a large number of continuous variables. This problem has several local minima, and, when attempting to solve it with gradient based optimizers, demonstrated high sensitivity to the scaling of design variables. This made it an ideal candidate to demonstrate the desirable quantities of the algorithm. Other application problems include structural sizing problems and composite laminate angle optimization. Parallel Processing in Optimization There are five approaches which may be utilized to decompose a single computational task into smaller problems which may then be solved concurrently. These are geometric decomposition, iterative decomposition, recursive decomposition, speculative decomposition and functional decomposition, or a combination of these [1,2]. Among these, functional decomposition is most commonly applied and will also be the method of implementation presented in Chapter 4. The steps followed in parallelizing a sequential program consist of the following from [1]. 1. Decompose the sequential program or data into smaller tasks. 2. Assign the tasks to processes. A process is an abstract entity that performs tasks. 3. Orchestrate the necessary data access, communication, and synchronization. 4. Map or assign the processes to computational nodes. The functional decomposition method is based on the premise that applications such as an optimization algorithm may be broken into many distinct phases, each of which interacts with some or all of the others. These phases can be implemented as co routines, each of which will execute for as long as it is able and then invoke another and remain suspended until it is again needed. Functional decomposition is the simplest way of parallelization if it can be implemented by turning its highlevel description into a set of cooperating processes [2]. When using this method, balancing the throughput of the different computational stages will be highly problematic when there are dependencies between stages, for example, when data requires sequential processing by several stages. This limits the parallelism that may be achieved using functional decomposition. Any further parallelism must be achieved through using geometric, iterative or speculative decomposition within a functional unit [2]. When decomposing a task into concurrent processes some additional communication among these routines is required for coordination and the interchange of data. Among the methods of communication for parallel programming the parallel virtual machine (PVM) and the messagepassing interface (MPI) are the most widely used. For the research undertaken in Chapter 4 a portable implementation of the MPI library [3,4] containing a set of parallel communication functions [5] is used. Decomposition in Large Scale Optimization The optimization community developed several formalized decomposition methods such as Collaborative Optimization (CO) [6], Concurrent SubSpace Optimization (CSSO) [7], and Quasiseparable decomposition to deal with the challenges presented by large scale engineering problems. The problems addressed using these schemes include multi disciplinary optimization in aerospace design, or large scale structural and biomechanical problems. Decomposition methodologies in large scale optimization are currently intensely studied because increasingly advanced higher fidelity simulation methods result in large scale problems becoming intractable. Problem decomposition allows for: 1. Simplified decomposed subsystems. In most cases the decomposed subproblems are of reduced dimensionality, and therefore less demanding on optimization algorithms. An example of this is the number of gradient calculations required per optimization iteration, which in the case of gradient based algorithms, scales directly with problem dimensionality. 2. A broader work front to be attacked simultaneously, which results in a problem being solved in less time if the processing resources are available. Usually computational throughput is limited in a sequential fashion, i.e., the FLOPS limit for computers. However, if multiple processing units are available this limit can be circumvented by using an array of networked computers, for example, a Beowulf cluster. Several such decomposition strategies have been proposed (see next section for a short review), all differing in the manner in which they address some or all of the following. 1. Decomposition boundaries, which may be disciplinary, or component interfaces in a large structure. 2. Constraint handling 3. Coordination among decomposed subproblems. Literature Review: Problem Decomposition Strategies Here follows a summary of methodologies used for the decomposition and optimization of large scale global problems. This review forms the background for the study proposed in Chapter 6 of this manuscript. Collaborative Optimization (CO) Overview. The Collaborative Optimization (CO) strategy was first introduced by Kroo et al., [8]. Shortly after its introduction this bilevel optimization scheme was extended by Tappeta and Renaud [9,10] to three distinct formulations to address multi objective optimization of largescale systems. The CO paradigm is based on the concept that the interaction among several different disciplinary experts optimizing a design is minimal for local changes in each discipline in a MDO problem. This allows a large scale system to be decomposed into subsystems along domain specific boundaries. These subsystems are optimized through local design variables specific to each subsystem, subject to the domain specific constraints. The objective of each subsystem optimization is to maintain agreement on interdisciplinary design variables. A system level optimizer enforces this interdisciplinary compatibility while minimizing the overall objective function. This is achieved by combining the system level fitness with the cumulative sum of all discrepancies between interdisciplinary design variables. This strategy is extremely suited for parallel computation because of the minimal interaction between the different design disciplines, which results in reduced communication overhead during the course of the optimization. Methodology. This decomposition strategy is described with the flow diagram in Figure 1. As mentioned previously, the CO strategy is a bilevel method, in which the system level optimization sets and adjusts a interdisciplinary design variables during the optimization. The subspace optimizer attempts both to satisfy local constraints by adjusting local parameters, and to meet the interdisciplinary design variable targets set by the system level optimizer. Departures from the target interdisciplinary design parameters are allowed, which may occur because of insufficient local degrees of freedom, but is to be minimized. The system level optimizer attempts to adjust the interdisciplinary parameters such that the objective function is minimized, while maximizing the agreement between subsystems. This process of adjusting the system level target design, and the subsystems attempting to match it whilst satisfying local constraints, is repeated until convergence. This procedure can be graphically illustrated in Figure 2 (taken from Kroo and Manning [6]). The system level optimizer sets a design target P, and each subspace optimization attempts to satisfy local constraints while matching the target design P as closely as possible by moving in directions 1 and 2. During the next system level optimization cycle the target design P will be moved in direction 3 in order to maximize the agreement between the target design and subspace designs that satisfy local constraints. Attempt to match system level Optimize system mandated target design approximation problem and set target design I Subproblem Subproblem optimization optimization (local variables) (local variables) I  L . Figure 1 Collaborative optimization flow diagram Subspace 2 constraint Subspace 1 constraint Figure 2 Collaborative optimization subspace constraint satisfaction procedure (taken from [6]) Refinements on method. Several enhancements to this method have been proposed, among which are the integration of this architecture into a decision based design framework as proposed by Hazelrigg [11,12], the use of response surfaces [13] to model disciplinary analyses, and genetic algorithms with scheduling for increased efficiency [14]. Solution quality and computational efficiency. Sobieski and Kroo [15] report very robust performance on their CO scheme, with identical solutions being found on both collaborative and single level optimizations of 45 design variable, 26 constraint problems. Braun and Kroo [16] showed CO to be unsuited for small problems with strong coupling, but for large scale problems with weak coupling the CO methodology becomes more computationally efficient. They also found that the amount of system level iterations is dependent on the level of coupling of subsystems, and that the required number of suboptimizations scales in proportion to the overall problem size. Similar findings are reported by Alexandrov [17]. Braun et al. [18] evaluated the performance of CO on a set of quadratic problems presented by Shankar et al. [19] to evaluate the CSSO method. Unlike CSSO, the CO method did not require an increased amount of iterations for QP problems with strong coupling, and converged successfully in all cases. Applications. This decomposition architecture has been extensively demonstrated using analytical test problems [18,20] and aerospace optimization problems such as trajectory optimization [16,18], vehicle design [13,2022], and satellite constellation configurations [23]. Concurrent SubSpace Optimization (CSSO) Overview. This method was proposed by SobieszczanskiSobieski [24], and, like CO, divides the MDO problem along disciplinary boundaries. The main difference however is the manner in which the CSSO framework coordinates the subsystem optimizations. A bilevel optimization scheme is used in which the upper optimization problem consists of a linear [7] or second order [25] system approximation created with the use of Global Sensitivity Equations (GSE's). This system approximation reflects changes in constraints and the objective function as a function of design variables. Because of the nonlinearities, this approximation is only accurate in the immediate neighborhood of the current design state, and needs to be updated after every upperlevel optimization iteration. After establishing a system level approximation the subsystems are independently optimized using only design variables local to the subspace. The system level approximation is then updated by a sensitivity analysis to reflect changes in the subspace design. The last two steps of subspace optimization and system approximation is repeated through the upper level optimizer until convergence is achieved. Methodology. The basic steps taken to optimize a MDO problem with the CSSO framework are as follows: 1. Select initial set of designs for each subsystem. 2. Construct system approximation using GSE's 3. a) Subspace optimization through local variables and objective b) Update system approximation by performing sensitivity analysis 4. Optimize design variables according to system approximation 5. Stop if converged; otherwise go to 3) where step (3) contains the lower level optimization in this bilevel framework. This process is illustrated in Figure 3. Refinements on method. As previously mentioned, early CSSO strategies used linear system level approximations obtained with the Global Sensitivity Equations (GSE). Coordination of subspace or disciplinary optimizations is achieved through system level sensitivity information. Figure 3 Concurrent subspace optimization methodology flow diagram This imposes limits on the allowable deviation from the current design, and requires a new approximation to be constructed at every system iteration to maintain reasonable accuracy. Recent research focused on alternate methods for acquiring system level approximations for the coordination effort. Several authors [2528] modified the system approximation to utilize a second order response surface approximation. This is combined with a database of previous fitness evaluation points, which can be used to create and update the response surface. This response surface then serves to couple the subsystem optimizations and coordinate system level design. Solution quality and computational efficiency. Shankar et al. [19] investigated the robustness of the CSSO on a set of analytical problems. Several quadratic programming problems with weak and strong coupling between subsystems were evaluated with a modification of SobieszczanskiSobieski's nonhierarchical subspace optimization scheme [7]. Results indicated reasonable performance for problems with weak coupling between subsystems. For large problems with strong interactions between subsystems, this decomposition scheme proved unreliable in terms of finding global sensitivities, leading to poor solutions. Tappeta et al., using the iSIGHT software. [2931], analyzed two analytical and two structural problems, a welding design and a stepped beam weight minimization. In this work it is reported that the KarushKuhnTucker conditions were met in some of the cases, and that most problems converged closely to the original problem solution. Lin and Renaud compared the commercial software package LANCELOT [32] which compares the BroydonFletcherGoldfarbShanno (BFGS) method to the CSSO strategy, the latter incorporating response surfaces. In this study the authors show similar computational efficiencies for small uncoupled analytical problems. For large scale MDO problems however the CSSO method consistently outperformed the LANCELOT optimizer in this area. Sellar et al. [26] compared a CSSO with neural network based response surface enhancements with a full (all at once) system optimization. The CSSONN algorithm showed a distinct advantage in computational efficiency over the allatonce approach, while maintaining a high level of robustness. Applications. This decomposition methodology has been applied to large scale aerospace problems like high temperature and pressure aircraft engine components [29,33], aircraft brake component optimization [34], and aerospace vehicle design [26,35]. Analytical Target Cascading (ATC) Overview. Analytical Target Cascading was introduced by Michelena et al. [36] in 1999, and developed further by Kim [37] as a product development tool. This method is typically used to solve object based decomposed system optimization problems. Tosserams et al. [38] introduced a Lagrangian relaxation strategy which in some cases improves the computational efficiency of this method by several orders of magnitude. Methodology. ATC is a strategy which coordinates hierarchically decomposed systems or elements of a problem (see Figure 4) by the introduction of target and response coupling variables. Targets are set by parent elements which are met by responses from the children elements in the hierarchy (see Figure 5 obtained from [38]). At each element an optimization problem is formulated to find local variables, parent responses and child targets which minimize a penalized discrepancy function, while meeting local constraints. The responses are rebalanced up to higher levels by iteratively changing targets in a nested loop in order to obtain consistency. Several coordination strategies are available to determine the sequence of solving the subproblems and order of exchange in targets and responses [39]. Proof of convergence is also presented for some of these classes of approaches in [39]. Refinements on method. Tosserams et al. [38] introduced the use of augmented Lagrangian relaxation in order to reduce the computational cost associated with obtaining very accurate agreement between subproblems, and the coordination effort at the inner loop of the method. Allison et al. exploited the complimentary nature of ATC and CO to obtain an optimization formulation called nested ATCMDO [40]. Kokkolaras et al. extended the formulation of ATC to include the design of product families [41]. 16 Element index  i=I /=2 i= 3 l j= L Level index i Figure 4 Example hierarchical problem structure Optimization inputs from parent: targets from children: responses r(+l) k t(+1) k Optimization outputs from parent: responses to children: responses Figure 5 Subproblem information flow Solution quality and computational efficiency. Michelena et al. [39] prove, under certain convexity assumptions that the ATC process will yield the optimal solution of the original design target problem. The original ATC formulation had the twofold problem of requiring large penalty weights to accurate solutions, and the excessive repeat in the inner loop which solves subproblems before the outer loop can proceed. Both these problems are addressed by using augmented Lagrangian relaxation [38] by Tosserams et al., which reported a decrease in computational effort on the order of between orders 10 and 10000. Applications. The ATC strategy is applied the design of structural members and a electric water pump in [40], and automotive design [42,43]. The performance of the Augmented Lagrangian Relaxation ATC enhancement was tested using several geometric programming problems [38]. Quasiseparable Decomposition and Optimization The quasiseparable decomposition and optimization strategy will be the focus of the research proposed in Chapter 6. This methodology addresses a class of problems common in the field of engineering and can be applied to a wide class of structural, biomechanical and other disciplinary problems. The strategy, which will be explained in detail in Chapter 6, is based on a twolevel optimization approach which allows for the global search effort to be concentrated at the lower level subproblem optimizations. The system to be optimized is decomposed in to several lower level subsystem optimizations which are coordinated by an upper level optimization. A Sequential Quadratic Programming (SQP) based optimizer is applied in this optimization infrastructure to solve a example structural sizing problem. This example problem entails the maximization of the tip displacement of a hollow stepped cantilever beam with 5 sections. The quasiseparable decomposition methodology is applied to decompose the structure into several subproblems of reduced dimensionality. The parallelism with this strategy is achieved by optimizing the independent subproblems (sections of the stepped beam) concurrently, allowing for the utilization of parallel processing resources. CHAPTER 3 GLOBAL OPTIMIZATION THROUGH THE PARTICLE SWARM ALGORITHM Overview This chapter introduces the population based algorithm which will be the target for investigating parallelism throughout the manuscript. This stochastic algorithm mimics swarming or flocking behavior found in animal groups such as bees, fish and birds. The swarm of particles is basically several parallel individual searches that are influenced by individual and swarm memory of regions in the design space with high fitness. These positions of the regions are constantly updated and reported to the individuals in the swarm through a simple communication model. This model allows for the algorithm to be easily decomposed into concurrent processes, each representing an individual particle in the swarm. This approach to allow parallelism will be detailed in Chapter 4. For the purpose of illustrating the performance of the PSO, it is compared to several other algorithms commonly used when solving problems in biomechanics. This comparison is made through optimization of several analytical problems, and a biomechanical system identification problem. The focus of the research presented in this chapter is to demonstrate that the PSO has good properties such as insensitivity to the scaling of design variables and very few algorithm parameters to fine tune. These make the PSO a valuable addition in the arsenal of optimization methods in biomechanical optimization, in which it has never been applied before. The work presented in this Chapter was in collaboration with Jeff Reinbold, who supplied the biomechanical test problem [58] and Byung II Koh, who developed the parallel SQP and BFGS algorithms [77] used to establish a computational efficiency comparison. The research in this Chapter was also published in [58,76,129]. Thanks goes to Soest and Casius for their willingness to share their numerical results published in [48]. Introduction Optimization methods are used extensively in biomechanics research to predict movementrelated quantities that cannot be measured experimentally. Forward dynamic, inverse dynamic, and inverse static optimizations have been used to predict muscle, ligament, and joint contact forces during experimental or predicted movements (e.g., see references [4455]). System identification optimizations have been employed to tune a variety of musculoskeletal model parameters to experimental movement data (e.g., see references [5660]). Image matching optimizations have been performed to align implant and bone models to in vivo fluoroscopic images collected during loaded functional activities (e.g., see references [6163]). Since biomechanical optimization problems are typically nonlinear in the design variables, gradientbased nonlinear programming has been the most widely used optimization method. The increasing size and complexity of biomechanical models has also led to parallelization of gradientbased algorithms, since gradient calculations can be easily distributed to multiple processors [4446]. However, gradientbased optimizers can suffer from several important limitations. They are local rather than global by nature and so can be sensitive to the initial guess. Experimental or numerical noise can exacerbate this problem by introducing multiple local minima into the problem. For some problems, multiple local minima may exist due to the nature of the problem itself. In most situations, the necessary gradient values cannot be obtained analytically, and finite difference gradient calculations can be sensitive to the selected finite difference step size. Furthermore, the use of design variables with different length scales or units can produce poorly scaled problems that converge slowly or not at all [64,65], necessitating design variable scaling to improve performance. Motivated by these limitations and improvements in computer speed, recent studies have begun investigating the use of nongradient global optimizers for biomechanical applications. Neptune [47] compared the performance of a simulated annealing (SA) algorithm with that of downhill simplex (DS) and sequential quadratic programming (SQP) algorithms on a forward dynamic optimization of bicycle pedaling utilizing 27 design variables. Simulated annealing found a better optimum than the other two methods and in a reasonable amount of CPU time. More recently, Soest and Casius [48] evaluated a parallel implementation of a genetic algorithm (GA) using a suite of analytical tests problems with up to 32 design variables and forward dynamic optimizations of jumping and isokinetic cycling with up to 34 design variables. The genetic algorithm generally outperformed all other algorithms tested, including SA, on both the analytical test suite and the movement optimizations. This study evaluates a recent addition to the arsenal of global optimization methods  particle swarm optimization (PSO) for use on biomechanical problems. A recently developed variant of the PSO algorithm is used for the investigation. The algorithm's global search capabilities are evaluated using a previously published suite of difficult analytical test problems with multiple local minima [48], while its insensitivity to design variable scaling is proven mathematically and verified using a biomechanical test problem. For both categories of problems, PSO robustness, performance, and scale independence are compared to that of three offtheshelf optimization algorithms a genetic algorithm (GA), a sequential quadratic programming algorithm (SQP), and the BFGS quasiNewton algorithm. In addition, previously published results [48] for the analytical test problems permit comparison with a more complex GA algorithm (GA*), a simulated annealing algorithm (SA), a different SQP algorithm (SQP*), and a downhill simplex (DS) algorithm. Theory Particle Swarm Algorithm. Particle swarm optimization is a stochastic global optimization approach introduced by Kennedy and Eberhart [66]. The method's strength lies in its simplicity, being easy to code and requiring few algorithm parameters to define convergence behavior. The following is a brief introduction to the operation of the particle swarm algorithm based on a recent implementation by Groenwold and Fourie [67] incorporating dynamic inertia and velocity reduction. Consider a swarm ofp particles, where each particle's position x' represents a possible solution point in the problem design space D. For each particle i, Kennedy and Eberhart [66] proposed that the position x'k+ be updated in the following manner: xk+1 = Xk + Vk+1 (3.1) with a pseudovelocity vk+1 calculated as follows: vk +1= kVkck cl ( x) +cr k (g X ) (3.2) Here, subscript k indicates a (unit) pseudotime increment. The point p' is the best found cost location by particle i up to timestep k, which represents the cognitive contribution to the search vector vk+. Each component of vk+1 is constrained to be less than or equal to a maximum value defined in vma The point gk is the global bestfound position among all particles in the swarm up to time k and forms the social contribution to the velocity vector. Cost function values associated with p' and gk are denoted by fb, and fbg respectively. Random numbers rl and r2 are uniformly distributed in the interval [0,1]. Shi and Eberhart [68] proposed that the cognitive and social scaling parameters c, and c2 be selected such that c, = c2 = 2 to allow the product czr, or c2r2 to have a mean of 1. The result of using these proposed values is that the particles overshoot the attraction points p' and gk half the time, thereby maintaining separation in the group and allowing a greater area to be searched than if the particles did not overshoot. The variable wk, set to 1 at initialization, is a modification to the original PSO algorithm [66]. By reducing its value dynamically based on the cost function improvement rate, the search area is gradually reduced [69]. This dynamic reduction behavior is defined by wd, the amount by which the inertia wk is reduced, vd, the amount by which the maximum velocity vm+a is reduced, and d, the number of iterations with no improvement in gk before these reduction take place [67] (see algorithm flow description below). Initialization of the algorithm involves several important steps. Particles are randomly distributed throughout the design space, and particle velocities v' are initialized to random values within the limits 0 < v' < vm The particle velocity upper limit vmax is calculated as a fraction of the distance between the upper and lower bound on variables in the design space vmax = K(x, x,) with K = 0.5 as suggested in [69]. Iteration counters k and t are set to 0. Iteration counter k is used to monitor the total number of swarm iterations, while iteration counter t is used to monitor the number of swarm iterations since the last improvement in g Thus, t is periodically reset to zero during the optimization while k is not. The algorithm flow can be represented as follows: 1. Initialize a. Set constants KC, C1, C2, kmax, Ivmax 0V, v wd, and d b. Set counters k = 0, t = 0. Set random number seed. c. Randomly initialize particle positions x' E D in 91 for i = 1,..., p d. Randomly initialize particle velocities 0 < v, < vmax for i = 1,...,p e. Evaluate cost function values fj using design space coordinates x' for i=l,...,p f Set fb,,, = j and p0 = x0 for i = ,..., p g. Set f,, to best fe,, and go to corresponding xo 2. Optimize h. Update particle velocity vectors vk1 using Eq. (3.2) i. If vk+ > vm for any component, then set that component to its maximum allowable value j. Update particle position vectors xk,+ using Eq. (3.1) k. Evaluate cost function values fk' using design space coordinates xk,+ for i= ,...,p 1. If f+l < fdest, then f/st = f/+, p.+, = xk+ for i= 1,..., p m. If f < f~st, then fbst = f+, gk+ = xk+' for i = 1..., p n. If fg, was improved in (e), then reset t = 0. Else increment t o. If maximum number of function evaluations is exceeded, then go to 3 p. If t= d, then multiply wk+ by ( wd) and vk by (1 v ) q. Increment k r. Go to 2(a). 3. Report results 4. Terminate This algorithm was coded in the C programming language by the author [70] and used for all PSO analyses performed in the study. A standard population size of 20 particles was used for all runs, and other algorithm parameters were also selected based on standard recommendations (Table 1) [7072]. The C source code for our PSO algorithm is freely available at http://www.mae.ufl.edu/fregly/downloads/pso.zip (last accessed 12/2005). Table 1 Standard PSO algorithm parameters used in the study Parameter Description Value p Population size (number of particles) 20 cl Cognitive trust parameter 2.0 c2 Social trust parameter 2.0 wo Initial inertia 1 Wd Inertia reduction parameter 0.01 K Bound on velocity fraction 0.5 Vd Velocity reduction parameter 0.01 d Dynamic inertia/velocity reduction delay (function 200 evaluations) Analysis of Scale Sensitivity. One of the benefits of the PSO algorithm is its insensitivity to design variable scaling. To prove this characteristic, we will use a proof by induction to show that all particles follow an identical path through the design space regardless of how the design variables are scaled. In actual PSO runs intended to investigate this property, use of the same random seed in scaled and unsealed cases will ensure that an identical sequence of random r, and r2 values are produced by the computer throughout the course of the optimization. Consider an optimization problem with n design variables. An ndimensional constant scaling vector C can be used to scale any or all dimensions of the problem design space: C2 C= 3 (3.3) We wish to show that for any time step k > 0, vk = vk, (3.4) xk = xk (3.5) where xk and vk (dropping superscript i) are the unsealed position and velocity, respectively, of an individual particle and xk = Cxk and vk = 'vk are the corresponding scaled versions. First, we must show that our proposition is true for the base case, which involves initialization (k = 0) and the first time step (k = 1). Applying the scaling vector C to an individual particle position xo during initialization produces a scaled particle position x : x' = x0 (3.6) where the right hand side is a componentbycomponent product of vectors 4 and x0 This implies that P' = Po, g' = g, (3.7) In the unsealed case, the pseudovelocity is calculated as v0 = K(X xLB) (3.8) In the scaled case, this becomes v, = (x' x' = 0 (X' xLB) =K(XUB XLB)(3.9) = [(XUB XLB)] = 5v, V0 From Eqs. (3.1) and (3.2) and these initial conditions, the particle pseudovelocity and position for the first time step can be written as v,= wo vo+c1ir (poXo)+c2r (g Xo) (3.10) x1 = x0 +V (3.11) in the unsealed case and v' = w v + c1 (p' x')+Cr, (g' x) = wWoVo + cir (5po Xo ) + Czrz (go Xo ) = 5v, =x + v = x0 + Iv1 = 4[XO + v1] = ;X in the scaled case. Thus, our proposition is true for the base case. Next, we must show that our proposition is true for the inductive step. If we assume our proposition holds for any time step k =j, we must prove that it also holds for time step k =j + 1. We begin by replacing subscript k with subscriptj in Eqs. (3.4) and (3.5). If we then replace subscript 0 with subscriptj and subscript 1 with subscriptj + 1 in Eqs. (3.12) and (3.13), we arrive at Eqs. (3.4) and (3.5) where subscript k is replaced by subscripts + 1. Thus, our proposition is true for any time step + 1. Consequently, since the base case is true and the inductive step is true, Eqs. (3.4) and (3.5) are true for all k> 0. From Eqs. (3.4) and (3.5), we can conclude that any linear scaling of the design variables (or subset thereof) will have no effect on the final or any intermediate result of the optimization, since all velocities and positions are scaled accordingly. This fact leads to identical step intervals being taken in the design space for scaled and unsealed version of the same problem, assuming infinite precision in all calculations. In contrast, gradientbased optimization methods are often sensitive to design variable scaling due to algorithmic issues and numerical approximations. First derivative methods are sensitive because of algorithmic issues, as illustrated by a simple example. Consider the following minimization problem with two design variables (x,y) where the cost function is 2 x2 + (3.14) 100 with initial guess (1,1). A scaled version of the same problem can be created by letting x = x,y = y /10 so that the cost function becomes X2 +y2 (3.15) with initial guess (1,10). Taking first derivatives of each cost function with respect to the corresponding design variables and evaluating at the initial guesses, the search direction for the unsealed problem is along a line rotated 5.70 from the positive x axis and for the scaled problem along a line rotated 45. To reach the optimum in a single step, the unsealed problem requires a search direction rotated 84.3 and the scaled problem 45. Thus, the scaled problem can theoretically reach the optimum in a single step while the unsealed problem cannot due to the effect of scaling on the calculated search direction. Second derivative methods are sensitive to design variable scaling because of numerical issues related to approximation of the Hessian (second derivative) matrix. According to Gill et al. [64], Newton methods utilizing an exact Hessian matrix will be insensitive to design variable scaling as long as the Hessian matrix remains positive definite. However, in practice, exact Hessian calculations are almost never available, necessitating numerical approximations via finite differencing. Errors in these approximations result in different search directions for scaled versus unsealed versions of the same problem. Even a small amount of design variable scaling can significantly affect the Hessian matrix so that design variable changes of similar magnitude will not produce comparable magnitude cost function changes [64]. Common gradientbased algorithms that employ an approximate Hessian include Newton and quasiNewton nonlinear programming methods such as BFGS, SQP methods, and nonlinear leastsquares methods such as LevenbergMarquardt [64]. A detailed discussion of the influence of design variable scaling on optimization algorithm performance can be found in Gill et al. [64]. Methodology Optimization Algorithms In addition to our PSO algorithm, three offtheshelf optimization algorithms were applied to all test problems (analytical and biomechanical see below) for comparison purposes. One was a global GA algorithm developed by Deb [7375]. This basic GA implementation utilizes one mutation operator and one crossover operator along with real encoding to handle continuous variables. The other two algorithms were commercial implementations of gradientbased SQP and BFGS algorithms (VisualDOC, Vanderplaats R & D, Colorado Springs, CO). All four algorithms (PSO, GA, SQP, and BFGS) were parallelized to accommodate the computational demands of the biomechanical test problem. For the PSO algorithm, parallelization was performed by distributing individual particle function evaluations to different processors as detailed by the author in [76]. For the GA algorithm, individual chromosome function evaluations were parallelized as described in [48]. Finally, for the SQP and BFGS algorithms, finite difference gradient calculations were performed on different processors as outlined by Koh et al. in [77]. A masterslave paradigm using the Message Passing Interface (MPI) [3,4] was employed for all parallel implementations. Parallel optimizations for the biomechanical test problem were run on a cluster of Linux based PCs in the University of Florida Highperformance Computing and Simulation Research Laboratory (1.33 GHz Athlon CPUs with 256MB memory on a 100Mbps switched Fast Ethernet network). While the PSO algorithm used standard algorithm parameters for all optimization runs, minor algorithm tuning was performed on the GA, SQP, and BFGS algorithms for the biomechanical test problem. The goal was to give these algorithms the best possible chance for success against the PSO algorithm. For the GA algorithm, preliminary optimizations were performed using population sizes ranging from 40 to 100. It was found that for the specified maximum number of function evaluations, a population size of 60 produced the best results. Consequently, this population size was used for all subsequent optimization runs (analytical and biomechanical). For the SQP and BFGS algorithms, automatic tuning of the finite difference step size (FDSS) was performed separately for each design variable. At the start of each gradientbased run, forward and central difference gradients were calculated for each design variable beginning with a relative FDSS of 101. The step size was then incrementally decreased by factors often until the absolute difference between forward and central gradient results was a minimum. This approach was taken since the amount of noise in the biomechanical test problem prevented a single stable gradient value from being calculated over a wide range of FDSS values (see Discussion). The forward difference step size automatically selected for each design variable was used for the remainder of the run. Analytical Test Problems The global search capabilities of our PSO implementation were evaluated using a suite of difficult analytical test problems previously published by Soest and Casius [48]. In that study, each problem in the suite was evaluated using four different optimizers: SA, GA*, SQP*, and DS, where a star indicates a different version of an algorithm used in our study. One thousand optimization runs were performed with each optimizer starting from random initial guesses and using standard optimization algorithm parameters. Each run was terminated based on a predefined number of function evaluations for the particular problem being solved. We followed an identical procedure with our four algorithms to permit comparison between our results and those published by Soest and Casius in [48]. Since two of the algorithms used in our study (GA and SQP) were of the same general category as algorithms used by Soest and Casius in [48] (GA* and SQP*), comparisons could be made between different implementations of the same general algorithm. Failed PSO and GA runs were allowed to use up the full number of function evaluations, whereas failed SQP and BFGS runs were restarted from new random initial guesses until the full number of function evaluations was completed. Only 100 rather than 1000 runs were performed with the SQP and BFGS algorithms due to a database size problem in the VisualDOC software. A detailed description of the six analytical test problems can be found in Soest and Casius [48]. Since the design variables for each problem possessed the same absolute upper and lower bound and appeared in the cost function in a similar form, design variable scaling was not an issue in these problems. The six analytical test problems are described briefly below. H1: This simple 2dimensional function [48] has several local maxima and a global maximum of 2 at the coordinates (8.6998, 6.7665). 8 8 HI(x,,x2) =2sin2 2 (3.16) d+1 x,, x2 [ 100,100] where d = j(x 8.6998)2+ (x2 6.7665)2 Ten thousand function evaluations were used for this problem. H, : This inverted version of the F6 function used by Schaffer et al. [78] has 2 dimensions with several local maxima around the global maximum of 1.0 at (0,0). sin2( X2 + 0.5 H2(x,,x2) =0.5 (X 0 (3.17) 1+0001 (x+x x,,x,2 [100,100] This problem was solved using 20,000 function evaluations per optimization run. H3 : This test function from Corana et al. [79] was used with dimensionality n = 4, 8, 16, and 32. The function contains a large number of local minima (on the order of 104") with a global minimum of 0 at x, <0.05. H ) (t'sgn(z,)+z+)2 x c.d ifoeiXZ di x, otherwise x, [1000,1000] where z = +0.49999 sgn(x s, c = 0.15, 1 1 i = 1,5,9,... 1000 i = 2,6,10,... s= 0.2, t =0.05, and d, = 0 ,,1,. 10 i 3,7,11,... 100 i =4,8,12,... The use of the floor function in Eq. (3.18) makes the search space for this problem the most discrete of all problems tested. The number of function evaluations used for this problem was 50,000 (n = 4), 100,000 (n = 8), 200,000 (n = 16), and 400,000 (n = 32). For all of the analytical test problems, an algorithm was considered to have succeeded if it converged to within 103 of the known optimum cost function value within the specified number of function evaluations [48]. Biomechanical Test Problem In addition to these analytical test problems, a biomechanical test problem was used to evaluate the scaleindependent nature of the PSO algorithm. Though our PSO algorithm is theoretically insensitive to design variable scaling, numerical roundoff errors and implementation details could potentially produce a scaling effect. Running the other three algorithms on scaled and unsealed versions of this test problem also permitted investigation of the extent to which other algorithms are influenced by design variable scaling. The biomechanical test problem involved determination of an ankle joint kinematic model that best matched noisy synthetic (i.e., computer generated) movement data. Similar to that used by van der Bogert et al. [56], the ankle was modeled as a three dimensional linkage with two nonintersecting pin joints defined by 12 subjectspecific parameters (Figure 6). Y Tibia r Pe ^  Center N 7 Y (superior) z x (lateral) Lab (anterior) Figure 6 Joint locations and orientations in the parametric ankle kinematic model. Each pi (i = 1,...,12) represents a different position or orientation parameter in the model These parameters represent the positions and orientations of the talocrural and subtalar joint axes in the tibia, talus, and calcaneous. Position parameters were in units of centimeters and orientation parameters in units of radians, resulting in parameter values of varying magnitude. This model was part of a larger 27 degreeoffreedom (DOF) full body kinematic model used to optimize other joints as well [58]. Given this model structure, noisy synthetic movement data were generated from a nominal model for which the "true" model parameters were known. Joint parameters for the nominal model along with a nominal motion were derived from in vivo experimental movement data using the optimization methodology described below. Next, three markers were attached to the tibia and calcaneous segments in the model at locations consistent with the experiment, and the 27 model DOFs were moved through their nominal motions. This process created synthetic marker trajectories consistent with the nominal model parameters and motion and also representative of the original experimental data. Finally, numerical noise was added to the synthetic marker trajectories to emulate skin and soft tissue movement artifacts. For each marker coordinate, a sinusoidal noise function was used with uniformly distributed random period, phase, and amplitude (limited to a maximum of+ 1 cm). The values of the sinusoidal parameters were based on previous studies reported in the literature [80,53]. An unconstrained optimization problem with bounds on the design variables was formulated to attempt to recover the known joint parameters from the noisy synthetic marker trajectories. The cost function was min f(p) (3.19) p with 50 6 3 f(p) = min (ck k (p, q)) (3.20) k=l q j=1 1=1 where p is a vector of 12 design variables containing the joint parameters, q is a vector of 27 generalized coordinates for the kinematic model, cijk is the ith coordinate of synthetic marker at time frame k, and yk (p, q) is the corresponding marker coordinate from the cinematic model. At each time frame, ck (p, q) was computed from the current model parameters p and an optimized model configuration q. A separate Levenberg Marquardt nonlinear leastsquares optimization was performed for each time frame in Eq. (3.20) to determine this optimal configuration. A relative convergence tolerance of 103 was chosen to achieve good accuracy with minimum computational cost. A nested optimization formulation (i.e., minimization occurs in Eqs. (3.19) and (3.20)) was used to decrease the dimensionality of the design space in Eq. (3.19). Equation (3.20) was coded in Matlab and exported as standalone C code using the Matlab Compiler (The Mathworks, Natick, MA). The executable read in a file containing the 12 design variables and output a file containing the resulting cost function value. This approach facilitated the use of different optimizers to solve Eq. (3.19). To investigate the influence of design variable scaling on optimization algorithm performance, two versions of Eq. (3.20) were generated. The first used the original units of centimeters and radians for the position and orientation design variables respectively. Bounds on the design variables were chosen to enclose a physically realistic region around the solution point in design space. Each position design variable was constrained to remain within a cube centered at the midpoint of the medial and lateral malleoli, where the length of each side was equal to the distance between the malleoli (i.e., 11.32 cm). Each orientation design variable was constrained to remain within a circular cone defined by varying its "true" value by + 300. The second version normalized all 12 design variables to be within [1,1] using norm 2x XUB XLB (3.21) XUB XLB where x and xL denote the upper and lower bounds, respectively, on the design variable vector [81]. Two approaches were used to compare PSO scale sensitivity to that of the other three algorithms. For the first approach, a fixed number of scaled and unsealed runs (10) were performed with each optimization algorithm starting from different random initial seeds, and the sensitivity of the final cost function value to algorithm choice and design variable scaling was evaluated. The stopping condition for PSO and GA runs was 10,000 function evaluations, while SQP and BFGS runs were terminated when a relative convergence tolerance of 105 or absolute convergence tolerance of 106 was met. For the second approach, a fixed number of function evaluations (10,000) were performed with each algorithm to investigate unsealed versus scaled convergence history. A single random initial guess was used for the PSO and GA algorithms, and each algorithm was terminated once it reached 10,000 function evaluations. Since individual SQP and BFGS runs require much fewer than 10,000 function evaluations, repeated runs were performed with different random initial guesses until the total number of function evaluations exceeded 10,000 at the termination of a run. This approach essentially uses SQP and BFGS as global optimizers, where the separate runs are like individual particles that cannot communicate with each another but have access to local gradient information. Finite difference step size tuning at the start of each run was included in the computation of number of function evaluations. Once the total number of runs required to reach 10,000 function evaluations was known, the lowest cost function value from all runs at each iteration was used to represent the cost over a range of function evaluations equal to the number of runs. Results For the analytical test problems, our PSO algorithm was more robust than our GA, SQP, and BFGS algorithms (Table 2, top half). PSO converged to the correct global solution nearly 100% of the time on four of the six test problems (H1 and H3 with n = 4, 8, and 16). It converged 67% of the time for problem H2 and only 1.5% of the time for problem H3 with n = 32. In contrast, none of the other algorithms converged more than 32% of the time on any of the analytical test problems. Though our GA algorithm typically exhibited faster initial convergence than did our PSO algorithm (Figure 7, left column), it leveled off and rarely reached the correct final point in design space within the specified number of function evaluations. Table 2 Fraction of successful optimizer runs for the analytical test problems. Top half: Results from the PSO, GA, SQP, and BFGS algorithms used in the present study. Bottom half: Results from the SA, GA, SQP, and DS algorithms used in Soest and Casius 48. The GA and SQP algorithms used in that study were different from the ones used in our study. Successful runs were identified by a final cost function value within of the known optimum value, consistent with [48] H3 Study Algorithm H1 H2 (n = 4) (n = 8) (n = 16) (n = 32) PSO 0.972 0.688 1.000 1.000 1.000 0.015 GA 0.000 0.034 0.000 0.000 0.000 0.002 Present SQP 0.09 0.11 0.00 0.00 0.00 0.00 BFGS 0.00 0.32 0.00 0.00 0.00 0.00 Soest SA 1.000 0.027 0.000 0.001 0.000 0.000 and GA 0.990 0.999 1.000 1.000 1.000 1.000 Casius SQP 0.279 0.810 0.385 0.000 0.000 0.000 (2003) DS 1.000 0.636 0.000 0.000 0.000 0.000 a Present Study o100n _ _ 1U  PSO SGA 101 SQP ..BFGS 102 ' _   0 5 10 15 20 xC 103 41C 1 2 3 4 5 _ x 104 0 1 2 3 4 x 105 Number of Function Evaluations Soest & Casius (2003) 1 2 4 6 8 10 x103 Iu I x 104 3 1 2 3 4 x 10' Number of Function Evaluations Figure 7 Comparison of convergence history results for the analytical test problems. Left column: Results from the PSO, GA, SQP, and BFGS algorithms used in the present study. Right column: Results from the SA, GA, SQP, and DS algorithms used in Soest and Casius [5]. The GA and SQP algorithms used in that study were different from the ones used in our study. (a) Problem H1. The SA results have been updated using corrected data provided by Soest and Casius, since the results in 48 accidentally used a temperature reduction rate of 0.5 rather than the standard value of 0.85 as reported. (b) Problem H2. (c) Problem H3 with n = 4. (d) Problem H3 with n = 32. Error was computed using the known cost at the global optimum and represents the average of 1000 runs (100 multistart SQP and BFGS runs in our study) with each algorithm. 3000 8000 SQP BFGS 2250 6000 m Unsealed S1500 u Scaled 4000 750 i n 2000 0 0 12345678910 12345678910 Run Run Figure 8 Final cost function values for ten unsealed (dark bars) and scaled (gray bars) parallel PSO, GA, SQP, and BFGS runs for the biomechanical test problem. Each pair of unsealed and scaled runs was started from the same initial points) in design space, and each run was terminated when the specified stopping criteria was met (see text). In contrast, the SQP and BFGS algorithms were highly sensitive to design variable scaling in the biomechanical test problem. For the ten trials, unsealed and scaled SQP or BFGS runs rarely converged to similar points in design space (note y axis scale in Figure 8) and produced large differences in final cost function value from one trial to the next (Figure 8c and d). Scaling improved the final result in seven out often SQP trials and in five often BFGS trials. The best unsealed and scaled SQP final cost function values were 255 and 121, respectively, while those of BFGS were 355 and 102 (Table 3). Thus, scaling yielded the best result found with both algorithms. The best SQP and BFGS trials generally produced larger RMS marker distance errors (up to two times worse), orientation parameter errors (up to five times worse), and position parameter errors (up to six times worse) than those found by PSO or GA. 40 Table 3 Final cost function values and associated marker distance and joint parameter rootmeansquare (RMS) errors after 10,000 function evaluations performed by multiple unsealed and scaled PSO, GA, SQP, and BFGS runs. See Figure 9 for corresponding convergence histories RMS Error Cost Marker Orientation Position Optimizer Formulation Function Distances Parameters Parameters (mm) (deg) (mm) Unscaled 69.5 5.44 2.63 4.47 PSOt Scaled 69.5 5.44 2.63 4.47 GA Unscaled 77.9 5.78 2.65 6.97 Scaled 74.0 5.64 3.76 4.01 Unscaled 255 10.4 3.76 14.3 Scaled 121 7.21 3.02 9.43 Unscaled 69.5 5.44 2.63 4.47 BFGS 69.5 5.44 2.634.47 Scaled 69.5 5.44 2.63 4.47 2000 4000 6000 Number of Function Evaluation d: Black lines 3ray lines 8000 10000 s Figure 9 Convergence history for unsealed (dark lines) and scaled (gray lines) parallel PSO, GA, SQP, and BFGS runs for the biomechanical test problem. Each algorithm was run terminated after 10,000 function evaluations. Only one unsealed and scaled PSO and GA run were required to reach 10,000 function evaluations, while repeated SQP and BFGS runs were required to reach that number. Separate SQP and BFGS runs were treated like individual particles in a single PSO run for calculating convergence history (see text).  PSO GA  SQP S BFGS     \    Unscalec Scaled: 1011 0 Discussion This chapter evaluated a recent variation of the PSO algorithm with dynamic inertia and velocity updating as a possible addition to the arsenal of methods that can be applied to difficult biomechanical optimization problems. For all problems investigated, our PSO algorithm with standard algorithm parameters performed better than did three offtheself optimizers GA, SQP, and BFGS. For the analytical test problems, PSO robustness was found to be better than that of two other global algorithms but worse than that of a third. For the biomechanical test problem with added numerical noise, PSO was found to be insensitive to design variable scaling while GA was only mildly sensitive and SQP and BFGS highly sensitive. Overall, the results suggest that our PSO algorithm is worth consideration for difficult biomechanical optimization problems, especially those for which design variable scaling may be an issue. Though our biomechanical optimization involved a system identification problem, PSO may be equally applicable to problems involving forward dynamic, inverse dynamic, inverse static, or image matching analyses. Other global methods such as SA and GA have already been applied successfully to such problems [47,48,62], and there is no reason to believe that PSO would not perform equally well. As with any global optimizer, PSO utilization would be limited by the computational cost of function evaluations given the large number required for a global search. Our particle swarm implementation may also be applicable to some largescale biomechanical optimization problems. Outside the biomechanics arena [71,72,8291], PSO has been used to solve problems on the order of 120 design variables [8991]. In the present study, our PSO algorithm was unsuccessful on the largest test problem, H3 with n = 32 design variables. However, in a recent study, our PSO algorithm successfully solved the Griewank global test problem with 128 design variables using population sizes ranging from 16 to 128 76. When the Corana test problem (H3) was attempted with 128 DVs, the algorithm exhibited worse convergence. Since the Griewank problem possesses a bumpy but continuous search space and the Corana problem a highly discrete search space, our PSO algorithm may work best on global problems with a continuous search space. It is not known how our PSO algorithm would perform on biomechanical problems with several hundred DVs, such as the forward dynamic optimizations of jumping and walking performed with parallel SQP in [4446]. One advantage of global algorithms such as PSO, GA, and SA is that they often do not require significant algorithm parameter tuning to perform well on difficult problems. The GA used by Soest and Casius in [48] (which is not freely available) required no tuning to perform well on all of these particular analytical test problems. The SA algorithm used by Soest and Casius in [48] required tuning of two parameters to improve algorithm robustness significantly on those problems. Our PSO algorithm (which is freely available) required tuning of one parameter (wd, which was increased from 1.0 to 1.5) to produce 100% success on the two problems where it had significant failures. For the biomechanical test problem, our PSO algorithm required no tuning, and only the population size of our GA algorithm required tuning to improve convergence speed. Neither algorithm was sensitive to the two sources of noise present in the problem noise added to the synthetic marker trajectories, and noise due to a somewhat loose convergence tolerance in the LevenbergMarquardt suboptimizations. Thus, for many global algorithm implementations, poor performance on a particular problem can be rectified by minor tuning of a small number of algorithm parameters. 2 [ 10 S\  Forward le3 S o 0 Central le3 i Forward le6 0 Central le6 106 105 10 103 102 10 100 Finite Difference Step Size Figure 10 Sensitivity of gradient calculations to selected finite difference step size for one design variable. Forward and central differencing were evaluated using relative convergence tolerances of 103 and 106 for the nonlinear least squares suboptimizations performed during cost function evaluation (see Eq. (3.20)). In contrast, gradientbased algorithms such as SQP and BFGS can require a significant amount of tuning even to begin to approach global optimizer results on some problems. For the biomechanical test problem, our SQP and BFGS algorithms were highly tuned by scaling the design variables and determining the optimal FDSS for each design variable separately. FDSS tuning was especially critical due to the two sources of noise noted above. When forward and central difference gradient results were compared for one of the design variables using two different LevenbergMarquardt relative convergence tolerances (103 and 106), a "sweet spot" was found near a step size of 102 (Figure 10). Outside of that "sweet spot," which was automatically identified and used in generating our SQP and BFGS results, forward and central difference gradient results diverged quickly when the looser tolerance was used. Since most users of gradientbased optimization algorithms do not scale the design variables or tune the FDSS for each design variable separately, and many do not perform multiple runs, our SQP and BFGS results for the biomechanical test problem represent bestcase rather than typical results. For this particular problem, an offtheshelf global algorithm such as PSO or GA is preferable due to the significant reduction in effort required to obtain repeatable and reliable solutions. Another advantage of PSO and GA algorithms is the ease with which they can be parallelized [48,76] and their resulting high parallel efficiency. For our PSO algorithm, Schutte et al. [76] recently reported near ideal parallel efficiency for up to 32 processors. Soest and Casius [48] reported near ideal parallel efficiency for their GA algorithm with up to 40 processors. Though SA has historically been considered more difficult to parallelize [92], Higginson et al. [93] recently developed a new parallel SA implementation and demonstrated near ideal parallel efficiency for up to 32 processors. In contrast, Koh et al. [77] reported poor SQP parallel efficiency for up to 12 processors due to the sequential nature of the line search portion of the algorithm. The caveat for these parallel efficiency results is that the time required per function evaluation was approximately constant and the computational nodes were homogeneous. As shown in [76], when function evaluations take different amounts of time, parallel efficiency of our PSO algorithm (and any other synchronous parallel algorithm, including GA, SA, SQP, and BFGS) will degrade with increasing number of processors. Synchronization between individuals in the population or between individual gradient calculations requires slave computational nodes that have completed their function evaluations to sit idle until all nodes have returned their results to the master node. Consequently, the slowest computational node (whether loaded by other users, performing the slowest function evaluation, or possessing the slowest processor in a heterogeneous environment) will dictate the overall time for each parallel iteration. An asynchronous PSO implementation with load balancing, where the global bestfound position is updated continuously as each particle completes a function evaluation, could address this limitation. However, the extent to which convergence characteristics and scale independence would be affected is not yet known. To put the results of our study into proper perspective, one must remember that optimization algorithm robustness can be influenced heavily by algorithm implementation details, and no single optimization algorithm will work for all problems. For two of the analytical test problems (H2 and H3 with n = 4), other studies have reported PSO results using formulations that did not include dynamic inertia and velocity updating. Comparisons are difficult given differences in the maximum number of function evaluations and number of particles, but in general, algorithm modifications were (not surprisingly) found to influence algorithm convergence characteristics [9496]. For our GA and SQP algorithms, results for the analytical test problems were very different from those obtained by Soest and Casius in [48] using different GA and SQP implementations. With seven mutation and four crossover operators, the GA algorithm used by Soest and Casius in [48] was obviously much more complex than the one used here. In contrast, both SQP algorithms were highlydeveloped commercial implementations. In contrast, poor performance by a gradientbased algorithm can be difficult to correct even with design variable scaling and careful tuning of the FDSS. These findings indicate that specific algorithm implementations, rather than general classes of algorithms, must be evaluated to reach any conclusions about algorithm robustness and performance on a particular problem. Conclusions In summary, the PSO algorithm with dynamic inertia and velocity updating provides another option for difficult biomechanical optimization problems with the added benefit of being scale independent. There are few algorithmspecific parameters to adjust, and standard recommended settings work well for most problems [70,94]. In biomechanical optimization problems, noise, multiple local minima, and design variables of different scale can limit the reliability of gradientbased algorithms. The PSO algorithm presented here provides a simpletouse offtheshelf alternative for consideration in such cases. The algorithm's main drawback is the high cost in terms of function evaluations because of slow convergence in the final stages of the optimization, a common trait among global search algorithms. The time requirements associated with the high computational cost may be circumvented by utilizing the parallelism inherent in the swarm algorithm. The development of such a parallel PSO algorithm will be detailed in the next chapter. CHAPTER 4 PARALLELISM BY EXPLOITING POPULATIONBASED ALGORITHM STRUCTURES Overview The structures of population based optimizers such as Genetic Algorithms and the Particle Swarm may be exploited in order to enable these algorithms to utilize concurrent processing. These algorithms require a set of fitness values for the population or swarm of individuals for each iteration during the search. The fitness of each individual is independently evaluated, and may be assigned to separate computational nodes. The development of such a parallel computational infrastructure is detailed in this chapter and applied to a set of largescale analytical problems and a biomechanical system identification problem for the purpose of quantifying its efficiency. The parallelization of the PSO is achieved with a masterslave, coarsegrained implementation where slave computational nodes are associated with individual particle search trajectories and assigned their fitness evaluations. Greatly enhanced computation throughput is demonstrated using this infrastructure, with efficiencies of 95% observed for load balanced conditions. Numerical example problems with large load imbalances yield poor performance, which decreases in a linear fashion as additional nodes are added. This infrastructure is based on a twolevel approach with flexibility in terms of where the search effort can be concentrated. For the two problems presented, the global search effort is applied in the upper level. This work was done in collaboration with Jeff Reinbolt, who created the biomechanical kinematic analysis software [58] and evaluated the quality of the solutions found by the parallel PSO. The work presented in this chapter was also published in [58,76,129]. Introduction Present day engineering optimization problems often impose large computational demands, resulting in long solution times even on a modern highend processor. To obtain enhanced computational throughput and global search capability, we detail the coarsegrained parallelization of an increasingly popular global search method, the particle swarm optimization (PSO) algorithm. Parallel PSO performance was evaluated using two categories of optimization problems possessing multiple local minima large scale analytical test problems with computationally cheap function evaluations and mediumscale biomechanical system identification problems with computationally expensive function evaluations. For loadbalanced analytical test problems formulated using 128 design variables, speedup was close to ideal and parallel efficiency above 95% for up to 32 nodes on a Beowulf cluster. In contrast, for loadimbalanced biomechanical system identification problems with 12 design variables, speedup plateaued and parallel efficiency decreased almost linearly with increasing number of nodes. The primary factor affecting parallel performance was the synchronization requirement of the parallel algorithm, which dictated that each iteration must wait for completion of the slowest fitness evaluation. When the analytical problems were solved using a fixed number of swarm iterations, a single population of 128 particles produced a better convergence rate than did multiple independent runs performed using subpopulations (8 runs with 16 particles, 4 runs with 32 particles, or 2 runs with 64 particles).These results suggest that (1) parallel PSO exhibits excellent parallel performance under loadbalanced conditions, (2) an asynchronous implementation would be valuable for reallife problems subject to load imbalance, and (3) larger population sizes should be considered when multiple processors are available. Numerical optimization has been widely used in engineering to solve a variety of NPcomplete problems in areas such as structural optimization, neural network training, control system analysis and design, and layout and scheduling problems. In these and other engineering disciplines, two major obstacles limiting the solution efficiency are frequently encountered. First, even mediumscale problems can be computationally demanding due to costly fitness evaluations. Second, engineering optimization problems are often plagued by multiple local optima, requiring the use of global search methods such as populationbased algorithms to deliver reliable results. Fortunately, recent advances in microprocessor and network technology have led to increased availability of low cost computational power through clusters of low to medium performance computers. To take advantage of these advances, communication layers such as MPI [3, 5] and PVM [97] have been used to develop parallel optimization algorithms, the most popular being gradientbased, genetic (GA), and simulated annealing (SA) algorithms [48,98,99]. In biomechanical optimizations of human movement, for example, parallelization has allowed problems requiring days or weeks of computation on a single processor computer to be solved in a matter of hours on a multiprocessor machine [98]. The particle swarm optimization (PSO) algorithm is a recent addition to the list of global search methods [100].This derivativefree method is particularly suited to continuous variable problems and has received increasing attention in the optimization community. It has been successfully applied to largescale problems [69,100,101] in several engineering disciplines and, being a populationbased approach, is readily parallelizable. It has few algorithm parameters, and generic settings for these parameters work well on most problems [70,94]. In this study, we present a parallel PSO algorithm for application to computationally demanding optimization problems. The algorithm's enhanced throughput due to parallelization and improved convergence due to increased population size are evaluated using largescale analytical test problems and mediumscale biomechanical system identification problems. Both types of problems possess multiple local minima. The analytical test problems utilize 128 design variables to create a tortuous design space but with computationally cheap fitness evaluations. In contrast, the biomechanical system identification problems utilize only 12 design variables but each fitness evaluation is much more costly computationally. These two categories of problems provide a range of load balance conditions for evaluating the parallel formulation. Serial Particle Swarm Algorithm Particle swarm optimization was introduced in 1995 by Kennedy and Eberhart [66]. Although several modifications to the original swarm algorithm have been made to improve performance [68,102105] and adapt it to specific types of problems [69,106,107], a parallel version has not been previously implemented. The following is a brief introduction to the operation of the PSO algorithm. Consider a swarm ofp particles, with each particle's position representing a possible solution point in the design problem space D. For each particle i, Kennedy and Eberhart proposed that its position x, be updated in the following manner: +1 = x + v, (4.1) with a pseudovelocity v, calculated as follows: vk = kVk + C1 (P + C2 (k X ) (4.2) Here, subscript k indicates a (unit) pseudotime increment, p,k represents the best ever position of particle i at time k (the cognitive contribution to the pseudovelocity vector v,k+1), andpgk represents the global best position in the swarm at time k (social contribution). rl and r2 represent uniform random numbers between 0 and 1.To allow the product clrl or c2r2 to have a mean of 1, Kennedy and Eberhart proposed that the cognitive and social scaling parameters cl and c2 be selected such that cl = c2 = 2 .The result of using these proposed values is that the particles overshoot the target half the time, thereby maintaining separation within the group and allowing for a greater area to be searched than if no overshoot occurred. A modification by Fourie and Groenwold [69] on the original PSO algorithm 66 allows transition to a more refined search as the optimization progresses. This operator reduces the maximum allowable velocity Vmak and particle inertia wk in a dynamic manner, as dictated by the dynamic reduction parameters K d, Wd. For the sake of brevity, further details of this operator are omitted, but a detailed description can be found in References [69,70]. The serial PSO algorithm as it would typically be implemented on a single CPU computer is described below, where p is the total number of particles in the swarm. The best ever fitness value of a particle at design coordinates pk is denoted byfbest and the best ever fitness value of the overall swarm at coordinates pk byfgbest. At time step k = 0, the particle velocities v,O are initialized to values within the limits 0 < v 0 < Vmax0 .The vector Vmax is calculated as a fraction of the distance between the upper and lower bounds Vmax = ((xUB XLB) [69], with = 0 .5. With this background, the PSO algorithm flow can be described as follows: 6. Initialize a. b. c. d. e. f. g. h. 7. Optimize a. b. c. Set constants K c, cl, kmax, V ax, w0, Vd, Wd, and d Initialize dynamic maximum velocity vx and inertia wk Set counters k = 0, t = 0, i =1. Set random number seed. Randomly initialize particle positions x' E D in 91 for i = 1,..., p Randomly initialize particle velocities 0 < v, < vmax for i = 1,..., p Evaluate cost function values fj using design space coordinates x' for i=l,...,p Set fb,t = fo and p, = x, for i = ,..., p Set fg, to best fb,,t and go to corresponding x' Update particle velocity vectors v'k+ using Eq.(4.2) If vk+~ > vm~ for any component, then set that component to its maximum allowable value Update particle position vectors x'~+ using Eq. (4.1) d. Evaluate cost function values fk' using design space coordinates x'k+ for i=l,...,p e. If f, < fb~,t, then fb~, = fk~+, pk+1 = Xk+1 for i = l,...,p f If f 1 < fb,, then fb, = f~ ', gk+1 = Xk+ for i = ,...,p g. If fb, was improved in (e), then reset t = 0. Else increment t. If k> kmax go to 3 h. If t = d, then multiply wk+ by ( wd) and vkm by (1 vd) i. If maximum number of function evaluations is exceeded, then go to 3 j. Increment i. If i > p then increment k, and set i = 1 k. Go to 2(a). 8. Report results 9. Terminate The above logic is illustrated as a flow diagram in Figure 11 without detailing the working of the dynamic reduction parameters. Problem independent stopping conditions based on convergence tests are difficult to define for global optimizers. Consequently, we typically use a fixed number of fitness evaluations or swarm iterations as a stopping criteria. Parallel Particle Swarm Algorithm The following issues had to be addressed in order to create a parallel PSO algorithm. Concurrent Operation and Scalability The algorithm should operate in such a fashion that it can be easily decomposed for parallel operation on a multiprocessor machine. Furthermore, it is highly desirable that it be scalable. Scalability implies that the nature of the algorithm should not place a limit on the number of computational nodes that can be utilized, thereby permitting full use of available computational resources. An example of an algorithm with limited scalability is a parallel implementation of a gradientbased optimizer. This algorithm is decomposed by distributing the workload of the derivative calculations for a single point in design space among multiple processors. The upper limit on concurrent operations using this approach is therefore set by the number of design variables in the problem. On the other hand, populationbased methods such as the GA and PSO are better suited to parallel computing. Here the population of individuals representing designs can be increased or decreased according to the availability and speed of processors Any additional agents in the population will allow for a higher fidelity search in the design space, lowering susceptibility to entrapment in local minima. However, this comes at the expense of additional fitness evaluations. Initialize :0l','rilhm / constants kfmai. W,'' '. CI,, ce, c2 Setk = 1i = 1 Randomly initialize all particle positions .xT Randomly initialize all particle velocities vt Evaluate objective function f (x) for particle i Set i = 1, Increment k Update particle i and swarm yes best values fbies fbgest no i > total number I pj.iic velocity v, of particles? for particle i Update velocity x, Increment i for particle i no Si)pniL, criterion satisfied? yes Output results Figure 11 Serial implementation of PSO algorithm. To avoid complicating the diagram, we have omitted velocity/inertia reduction operations. Asynchronous vs. Synchronous Implementation The original PSO algorithm was implemented with a synchronized scheme for updating the best 'remembered 'individual and group fitness values fk andfgk, respectively, and their associated positions p,k and pgk This approach entails performing the fitness evaluations for the entire swarm before updating the best fitness values. Subsequent experimentation revealed that improved convergence rates can be obtained by updating thefk andfgk values and their positions after each individual fitness evaluation (i.e. in an asynchronous fashion) [70,94]. It is speculated that because the updating occurs immediately after each fitness evaluation, the swarm reacts more quickly to an improvement in the bestfound fitness value. With the parallel implementation, however, this asynchronous improvement on the swarm is lost since fitness evaluations are performed concurrently. The parallel algorithm requires updatingfk andfgk for the entire swarm after all fitness evaluations have been performed, as in the original particle swarm formulation. Consequently, the swarm will react more slowly to changes of the best fitness value 'position' in the design space. This behavior produces an unavoidable performance loss in terms of convergence rate compared to the asynchronous implementation and can be considered part of the overhead associated with parallelization. Coherence Parallelization should have no adverse affect on algorithm operation. Calculations sensitive to program order should appear to have occurred in exactly the same order as in the serial synchronous formulation, leading to the exact same final answer. In the serial PSO algorithm the fitness evaluations form the bulk of the computational effort for the optimization and can be performed independently. For our parallel implementation, we therefore chose a coarse decomposition scheme where the algorithm performs only the fitness evaluations concurrently on a parallel machine. Step 2 of the particle swarm optimization algorithm was modified accordingly to operate in a parallel manner: 2) Optimize a) Update particle velocity vectors vk+1 using Eq.(4.2) b) If vk+ > vm for any component, then set that component to its maximum allowable value c) Update particle position vectors x'k+ using Eq. (4.1) d) Concurrently evaluate fitness values f+, using design space coordinates x'k+ for i =l,...,p e) If f+, < f/ then fbst = fk+l, Pk+ = xkl for i= 1,..., p f) If fk, < fbg,,, then fbgt = fk+l, gk+l = Xk+l for i = 1,..., p g) If f ,, was improved in (e), then reset t = 0. Else increment t. If k> kmax go to 3 h) If t = d, then multiply wk+1 by (1 wd) and vk+ by (1 v ) i) If maximum number of function evaluations is exceeded, then go to 3 j) Increment k k) Goto2(a). The parallel PSO algorithm is represented by the flow diagram in Figure 12. Network Communication In a parallel computational environment, the main performance bottleneck is often the communication latency between processors. This issue is especially relevant to large clusters of computers where the use of high performance network interfaces are limited due to their high cost. To keep communication between different computational nodes at a minimum, we use fitness evaluation tasks as the level of granularity for the parallel software. As previously mentioned each of these evaluations can be performed independently and requires no communication aside from receiving design space co ordinates to be evaluated and reporting the fitness value at the end of the analysis. hiniihlize ali,ritlniii / constalln ..... i'I i. I ', Ci, C2 / I s2 Set k = 1 Randomly initialize all particle positions xa Randomly initialize all particle velocities vk f(.I) f) f() f(4) e' f(cx) Barrier synchronize Update particle and swarm Increment k best values fst, fest Update velocity v[ for all particles Update position 4X for all particles no Stopping criterion satisfied? yes Output results ( Stop Figure 12 Parallel implementation of the PSO algorithm. We have again omitted velocity/inertia reduction operations to avoid complicating the diagram. The optimization infrastructure is organized into a coordinating node and several computational nodes. PSO algorithm functions and task orchestration are performed by the coordinating node, which assigns the design coordinates to be evaluated, in parallel, to the computational nodes. With this approach, no communication is required between computational nodes as individual fitness evaluations are independent of each other. The only necessary communication is between the coordinating node and the computational nodes and encompasses passing the following information: 1) Several distinct design variable configuration vectors assigned by coordinating node to slave nodes for fitness evaluation. 2) Fitness values reported from slave nodes to coordinating node. 3) Synchronization signals to maintain program coherence. 4) Termination signals from coordinating node to slave nodes on completion of analysis to stop the program cleanly. The parallel PSO scheme and required communication layer were implemented in ANSI C on a Linux operating system using the message passing interface (MPI) libraries. Synchronization and Implementation From the parallel PSO algorithm, it is clear that some means of synchronization is required to ensure that all of the particle fitness evaluations have been completed and results reported before the velocity and position calculations can be executed (steps 2a and 2b). Synchronization is done using a barrier function in the MPI communication library which temporarily stops the coordinating node from proceeding with the next swarm iteration until all of the computational nodes have responded with a fitness value. Because of this approach, the time required to perform a single parallel swarm fitness evaluation will be dictated by the slowest fitness evaluation in the swarm. Two networked clusters of computers were used to obtain the numerical results. The first cluster was used to solve the analytical test problems and comprised 40 1 .33 GHz Athlon PCs located in the Highperformance Computing and Simulation (HCS) Research Laboratory at the University of Florida. The second group was used to solve the biomechanical system identification problems and consisted of 32 2 .40 GHz Intel PCs located in the HCS Research Laboratory at Florida State University. In both locations, 100 Mbps switched networks were utilized for connecting nodes. Sample Optimization Problems Analytical Test Problems Two wellknown analytical test problems were used to evaluate parallel PSO algorithm performance on largescale problems with multiple local minima (see Appendix A for mathematical description of both problems).The first was a test function (Figure 13 (a)) introduced by Griewank [108] which superimposes a highfrequency sine wave on a multidimensional parabola. In contrast, the second problem used the Corana test function [109] which exhibits discrete jumps throughout the design space (Figure 13(b)).For both problems, the number of local minima increases exponentially with the number of design variables. To investigate largescale optimization issues, we formulated both problems using 128 design variables. Since fitness evaluations are extremely fast for these test problems, a delay of approximately half a second was built into each fitness evaluation so that total computation time would not be swamped by communication time. Since parallelization opens up the possibility of utilizing large numbers of processors, we used the analytical test problems to investigate how convergence rate and final solution are affected by the number of particles employed in a parallel PSO run. To ensure that all swarms were given equally 'fair' starting positions, we generated a pool of 128 initial positions using the Latin Hypercube Sampler (LHS). Particle positions selected with this scheme will be distributed uniformly throughout the design space [110]. This initial pool of 128 particles was divided into the following subswarms: one swarm of 128 particles, two swarms of 64 particles, four swarms of 32 particles, and eight swarms of 16 particles. Each subswarm was used independently to solve the two analytical test problems. This approach allowed us to investigate whether is it more efficient to perform multiple parallel optimizations with smaller population sizes or one parallel optimization with a larger population size given a sufficient number of processors. To obtain comparisons for convergence speed, we allowed all PSO runs to complete 10,000 iterations before the search was terminated. This number of iterations corresponded to between 160,000 and 1,280,000 fitness evaluations depending on the number of particles employed in the swarm. Biomechanical System Identification problems In addition to the analytical test problems, mediumscale biomechanical system identification problems were used to evaluate parallel PSO performance under more realistic conditions. These problems were variations of a general problem that attempts to find joint parameters (i.e. positions and orientations of joint axes) that match a kinematic ankle model to experimental surface marker data [56]. The data are collected with an optoelectronic system that uses multiple cameras to record the positions of external markers placed on the body segments. To permit measurement of threedimensional motion, we attach three noncolinear markers to the foot and lower leg. The recordings are processed to obtain marker trajectories in a laboratoryfixed coordinate system [111, 112] The general problem possesses 12 design variables and requires approximately 1 minute for each fitness evaluation. Thus, while the problem is only mediumscale in terms of number of design variables, it is still computationally costly due to the time required for each fitness evaluation.. Y2 Q8~4" Figure 13 Surface plots of the (a) Griewank and (b) Corana analytical test problems showing the presence of multiple local minima. For both plots, 126 design variables were fixed at their optimal values and the remaining 2 design variables varied in a small region about the global minimum. The first step in the system identification procedure is to formulate a parametric ankle joint model that can emulate a patient's movement by possessing sufficient degrees of freedom. For the purpose of this paper, we approximate the talocrural and subtalar joints as simple 1 degree of freedom revolute joints. The resulting ankle joint model (Figure 6) contains 12 adjustable parameters that define its kinematic structure [56]. The model also has a set of virtual markers fixed to the limb segments in positions corresponding to the locations of real markers on the subject. The linkage parameters are then adjusted via optimization until markers on the model follow the measured marker trajectories as closely as possible. To quantify how closely the kinematic model with specified parameter values can follow measured marker trajectories, we define a cumulative marker error e as follows: e = Z A, (4.3) J=1 1l where A Ax2 + Ayl + Az (4.4) where Ax,,, Ay,, and Az,, are the spatial displacement errors for marker i at time framej in the x y ,and z directions as measured in the laboratoryfixed coordinate system, n = 50 is the number of time frames, and m = 6 (3 on the lower leg and 3 on the foot) is the number of markers. These errors are calculated between the experimental marker locations on the human subject and the virtual marker locations on the kinematic model. For each time frame, a nonlinear least squares suboptimization is performed to determine the joint angles that minimize 4,,2 given the current set of model parameters. The first suboptimization is started from an initial guess of zero for all joint angles. The suboptimization for each subsequent time frame is started with the solution from the previous time frame to speed convergence. By performing a separate suboptimization for each time frame and then calculating the sum of the squares of the marker coordinate errors, we obtain an estimate of how well the model fits the data for all time frames included in the analysis. By varying the model parameters and repeating the sub optimization process, the parallel PSO algorithm finds the best set of model parameters that minimize e over all time frames. For numerical testing, three variations of this general problem were analyzed as described below. In all cases the number of particles used by the parallel PSO algorithm was set to a recommended value of 20 [94]. 1) Synthetic data without numerical noise: Synthetic (i.e. computer generated) data without numerical noise were generated by simulating marker movements using a lower body kinematic model with virtual markers. The synthetic motion was based on an experimentally measured ankle motion (see 3 below).The kinematic model used anatomically realistic joint positions and orientations. Since the joint parameters associated with the synthetic data were known, his optimization was used to verify that the parallel PSO algorithm could accurately recover the original model. 2) Synthetic data with numerical noise: Numerical noise was superimposed on each synthetic marker coordinate trajectory to emulate the effect of marker displacements caused by skin movement artifacts [53]. A previously published noise model requiring three random parameters was used to generate a perturbation Nin each marker coordinate [80]: N= Asin(cot+q+) (4.5) where A is the amplitude, co the frequency, and p the phase angle of the noise. These noise parameters were treated as uniform random variables within the bounds 0 < A < Icm, 0 < co < 25 rad/s, and 0 < p < 2 (obtained from [80]). 3) Experimental data: Experimental marker trajectory data were obtained by processing threedimensional recordings from a subject performing movements with reflective markers attached to the foot and lower leg as previously described. Institutional review board approval was obtained for the experiments and data analysis, and the subject gave informed consent prior to participation. Marker positions were reported in a laboratoryfixed coordinate system. Speedup and Parallel Efficiency Parallel performance for both classes of problems was quantified by calculating speedup and parallel efficiency for different numbers of processors. Speedup is the ratio of sequential execution time to parallel execution time and ideally should equal the 64 number of processors. Parallel efficiency is the ratio of speedup to number of processors and ideally should equal 100%.For the analytical test problems, only the Corana problem was run since the half second delay added to both problems makes their parallel performance identical. For the biomechanical system identification problems, only the synthetic data with numerical noise case was reported since experimentation with the other two cases produced similar parallel performance. b 0 2000 4000 6000 8000 10000 Iterations Figure 14 Average fitness convergence histories for the (a) Griewank and (b) Corana analytical test problems for swarm sizes of 16,32,64,and 128 particles and 10000 swarm iterations. Triangles indicate the location on each curve where 160 000 fitness evaluations were completed. s. Corana 128 DVs I The number of particles and nodes used for each parallel evaluation was selected based on the requirements of the problem. The Corana problem with 128 design variables was solved using 32 particles and 1, 2, 4, 8, 16, and 32 nodes. The biomechanical problem with 12 design variables was solved using 20 particles and 1, 2, 5, 10, and 20 nodes. Both problems were allowed to run until 1000 fitness evaluations were completed. Numerical Results Convergence rates for the two analytical test problems differed significantly with changes in swarm size. For the Griewank problem (Figure 14(a)), individual PSO runs converged to within le6 of the global minimum after 10 000 optimizer iterations, regardless of the swarm size. Runtorun variations in final fitness value (not shown) for a fixed swarm size were small compared to variations between swarm sizes. For example, no runs with 16 particles produced a better final fitness value than any of the runs with 32 particles, and similarly for the 1632, 3264, and 64128 combinations. When number of fitness evaluations was considered instead of number of swarm iterations, runs with a smaller swarm size tended to converge more quickly than did runs with a larger swarm size (see triangles in Figure 14). However, two of the eight runs with the smallest number of particles failed to show continued improvement near the maximum number of iterations, indicating possible entrapment in a local minimum. Similar results were found for the Corana problem (Figure 14(b)) with two exceptions. First, the optimizer was unable obtain the global minimum for any swarm size within the specified number of iterations (Figure 14(b)), and second, overlapping in results between different swarm sizes was observed. For example, some 16 particle results were better than 32 particles results, and similarly for the other neighboring combinations. On average, however, a larger swarm size tended to produce better results for both problems. Table 4 Parallel PSO results for the biomechanical system identification problem using synthetic marker trajectories without and with numerical noise. Optimizations on synthetic with noise and without noise were with 20 particles and were terminated after 40000 fitness evaluations. Model Upper Lower Synthetic Synthetic data parameter bound bound solution Without noise With noise pi(deg) 48.67 11.63 18.37 18.36 15.13 p2 (deg) 30.00 30.00 0.00 0.01 8.01 p3 (deg) 70.23 10.23 40.23 40.26 32.97 p4(deg) 53.00 7.00 23.00 23.03 23.12 p5 (deg) 72.00 12.00 42.00 42.00 42.04 p6 (cm) 6.27 6.27 0.00 0.00 0.39 p7 (cm) 33.70 46.24 39.97 39.97 39.61 p8 (cm) 6.27 6.27 0.00 0.00 0.76 p9 (cm) 0.00 6.27 1.00 1.00 2.82 pio (cm) 15.27 2.72 9.00 9.00 10.21 pii (cm) 10.42 2.12 4.15 4.15 3.03 p12 (cm) 6.89 5.65 0.62 0.62 0.19 The parallel PSO algorithm found ankle joint parameters consistent with the known solution or results in the literature [6163].The algorithm had no difficulty recovering the original parameters from the synthetic date set without noise (Table 4), producing a final cumulative error e on the order of 10 13.The original model was recovered with mean orientation errors less than 0.05 and mean position errors less than 0.008cm. Furthermore, the parallel implementation produced identical fitness and parameter histories as did a synchronous serial implementation. For the synthetic data set with superimposed noise, a RMS marker distance error of 0.568cm was found, which is on the order of the imposed numerical noise with maximum amplitude of 1cm. For the experimental data set, the RMS marker distance error was 0.394cm (Table 5), comparable to the error for the synthetic data with noise. Convergence characteristics were similar for the three data sets considered in this study. The initial convergence rate was quite high (Figure 15(a)), where after it slowed when the approximate location of the global minimum was found. Table 5 Parallel PSO results for the biomechanical system identification problem using synthetic marker trajectories without and with numerical noise. Synthetic Data Experimental RMS errors Without noise With noise data Marker distances (cm) 3.58E04 0.568 0.394 Orientation parameters (deg) 1.85E02 5.010 N/A Position parameters (cm) 4.95E04 1.000 N/A As the solution process proceeded, the optimizer traded off increases in RMS joint orientation error (Figure 15(b)) for decreases in RMS joint position error (Figure 15(c)) to achieve further minor reductions in the fitness value. The analytical and biomechanical problems exhibited different parallel performance characteristics. The analytical problem demonstrated almost perfectly linear speedup (Figure 16(a), squares) resulting in parallel efficiencies above 95% for up to 32 nodes (Figure 16(b), squares). In contrast, the biomechanical problem exhibited speedup results that plateaued as the number of nodes was increased (Figure 16(a), circles), producing parallel efficiencies that decreased almost linearly with increasing number of nodes (Figure 16(b), circles). Each additional node produced roughly a 3% reduction in parallel efficiency. Discussion This study presented a parallel implementation of the particle swarm global optimizer. The implementation was evaluated using analytical test problems and biomechanical system identification problems. Speedup and parallel efficiency results were excellent when each fitness evaluation took the same amount of time. 100 200 0 100 200 300 400 300 400 500 600 500 600 0 100 200 300 400 500 600 700 8C 1.5 1 100 200 300 400 Swarm iterations 600 700 Figure 15 Fitness convergence and parameter error plots for the biomechanical system identification problem using synthetic data with noise I 600 " 400 200 30 25 20 15 10 rI_ a 30 S20 10 0 Analytical 0 Biomechanical 0 I b 60 S40 I. 20 0 I I I I I 0 5 10 15 20 25 30 3 Number of Nodes Figure 16 (a) Speedup and (b) parallel efficiency for the analytical and biomechanical optimization problems. For problems with large numbers of design variables and multiple local minima, maximizing the number of particles produced better results than repeated runs with fewer particles. Overall, parallel PSO makes efficient use of computational resources and provides a new option for computationally demanding engineering optimization problems. The agreement between optimized and known orientation parameters pip4 for the biomechanical problem using synthetic data with noise was poorer than initially expected. This finding was the direct result of the sensitivity of orientation calculations to errors in marker positions caused by the injected numerical noise. Because of the close proximity of the markers to each other, even relatively small amplitude numerical noise in marker positions can result in large fluctuations in the bestfit joint orientations. While more time frames could be used to offset the effects of noise, this approach would increase the cost of each fitness evaluation due to an increased number of sub optimizations. Nonetheless, the fitness value for the optimized parameters was lower than that for the parameters used to generate the original noiseless synthetic data. Though the biomechanical optimization problems only involved 12 design variables, multiple local minima existed when numerical or experimental noise was present. When the noisy synthetic data set was analyzed with a gradientbased optimizer using 20 random starting points, the optimizer consistently found distinct solutions, indicating a large number of local minima. Similar observations were made for a smaller number of gradientbased runs performed on the experimental data set. To evaluate the parallel PSOs ability to avoid entrapment in these local minima, we performed 10 additional runs with the algorithm. All 10 runs converged to the same solution, which was better than any of the solutions found by gradientbased runs. Differences in parallel PSO performance between the analytical test problem and the biomechanical system identification problem can be explained by load balancing issues. The half second delay added to the analytical test problem made all fitness evaluations take approximately the same amount of time and substantially less time than communication tasks. Consequently, load imbalances were avoided and little degradation in parallel performance was observed with increasing number of processors. In contrast, for the biomechanical system identification problem, the time required to complete the 50 suboptimizations was sensitive to the selected point in design space, thereby producing load imbalances. As the number of processors increased, so did the likelihood that at least one fitness evaluation would take much longer than the others. Due to the synchronization requirement of the current parallel implementation, the resulting load imbalance caused by even one slow fitness evaluation was sufficient to degrade parallel performance rapidly with increasing number of nodes. An asynchronous parallel implementation could be developed to address this problem with the added benefit of permitting high parallel efficiency on inhomogeneous clusters. Our results for the analytical and biomechanical optimization problems suggest that PSO performs best on problems with continuous rather than discrete noise. The algorithm consistently found the global minimum for the Griewank problem, even when the number of particles was low. Though the global minimum is unknown for the biomechanical problem using synthetic data with noise, multiple PSO runs consistently converged to the same solution. Both of these problems utilized continuous, sinusoidal noise functions. In contrast, PSO did not converge to the global minima for the Corana problem with its discrete noise function. Thus, for largescale problems with multiple local minima and discrete noise, other optimization algorithms such as GA may provide better results [48]. Use of a LHS rather than uniform random sampling to generate initial points in design space may be a worthwhile PSO algorithm modification. Experimentation with our random number generator indicated that initial particle positions can at times be grouped together. This motivated our use of LHS to avoid resampling the same region of design space when providing initial guesses to subswarms. To investigate the influence of sampling method on PSO convergence rate, we performed multiple runs with the Griewank problem using uniform random sampling and a LHS with the default design variable bounds (600 to +600) and with the bounds shifted by 200 (400 to +800).We found that when the bounds were shifted, convergence rate with uniform random sampling changed while it did not with a LHS. Thus, swarm behavior appears to be influenced by sampling method, and a LHS may be helpful for minimizing this sensitivity. A secondary motivation for running the analytical test problems with different numbers of particles was to determine whether the use of subswarms would improve convergence. The question is whether a larger swarm where all particles communicate with each other is more efficient than multiple smaller swarms where particles communicate within each subswarm but not between subswarms. It is possible that the global best position found by a large swarm may unduly influence the motion of all particles in the swarm. Creating subswarms that do not communicate eliminates this possibility. In our approach, we performed the same number of fitness evaluations for each population size. Our results for both analytical test problems suggest that when a large numbers of processors are available, increasing the swarm size will increase the probability of finding a better solution. Analysis of PSO convergence rate for different numbers of particles also suggests an interesting avenue for future investigation. Passing an imaginary curve through the triangles in Figure 14 reveals that for a fixed number of fitness evaluations, convergence rate increases asymptotically with decreasing number of particles. While the solution found by a smaller number of particles may be a local minimum, the final particle positions may still identify the general region in design space where the global minimum is located. Consequently, an adaptive PSO algorithm that periodically adjusts the number of particles upward during the course of an optimization may improve convergence speed. For example, an initial run with 16 particles could be performed for a fixed number of fitness evaluations. At the end of that phase, the final positions of those 16 particles would be kept, but 16 new particles would be added to bring the total up to 32 particles. The algorithm would continue using 32 particles until the same number of fitness evaluations was completed. The process of gradually increasing the number of particles would continue until the maximum specified swarm size (e.g. 128 particles) was analyzed. To ensure systematic sampling of the design space, a LHS would be used to generate a pool of sample points equal to the maximum number of particles and from which subsamples would be drawn progressively at each phase of the optimization. In the scenario above with a maximum of 128 particles, the first phase with 16 particles would remove 16 sampled points from the LHS pool, the second phase another 16 points, the third phase 32 points, and the final phase the remaining 64 points. Conclusions In summary, the parallel Particle Swarm Optimization algorithm presented in this chapter exhibits excellent parallel performance as long as individual fitness evaluations require the same amount of time. For optimization problems where the time required for each fitness evaluation varies substantially, an asynchronous implementation may be needed to reduce wasted CPU cycles and maintain high parallel efficiency. When large numbers of processors are available, use of larger population sizes may result in improved convergence rates to the global solution. An adaptive PSO algorithm that increases population size incrementally may also improve algorithm convergence characteristics. CHAPTER 5 IMPROVED GLOBAL CONVERGENCE USING MULTIPLE INDEPENDENT OPTIMIZATIONS Overview This chapter presents a methodology for improving the global convergence probability in large scale global optimization problems in cases where several local minima are present. The optimizer applied in this methodology is the PSO, but the strategy outlined here is applicable to any type of algorithm. The controlling idea behind this optimization approach is to utilize several independent optimizations, each using a fraction of a budget of computational resources. Although optimizations may have a limited probability of convergence individually as compared to a single optimization utilizing the full budget, it is shown that when they are combined they will have a cumulative convergence probability far in excess of the single optimization. Since the individual limited optimizations are independent they may be executed concurrently on separate computation nodes with no interaction. This exploitation of parallelism allows us to vastly increase the probability of convergence to the global minimum while simultaneously reducing the required wall clock time if a parallel machine is used. Introduction If we consider the general global unconstrained optimization problem for the real valued function f(x) defined on the set x E D in D one cannot state that a global solution has been found unless an exhaustive search of the set is x e D is performed [115]. With a finite number of function evaluations, at best we can only estimate the probability of arriving at or near the global optimum. To solve global optimization problems reliably the optimizer needs to achieve an efficient balance between sampling the entire design space and directing progressively more densely spaced sampling points towards promising regions for a more refined search [116]. Many algorithms achieve this balance, such as the deterministic DIRECT optimizer [117] or stochastic algorithms such as genetic algorithms [118], simulated annealing [119,120], clustering [121], and the particle swarm optimizer [122]. Although these populationbased global optimization algorithms are fairly robust, they can be attracted, at least temporarily, towards local optima which are not global (see, for example, the Griewank problem in Figure 17). Griewank 10 50 X2 10 10 Figure 17 Multiple local minima for Griewank analytical problem surface plot in two dimensions This difficulty can be addressed by allowing longer optimization runs or an increased population size. Both these options often result in a decrease in the algorithm efficiency, with no guarantee that the optimizer will escape from the local optimum. It is possible that restarting the algorithm when it gets stuck in a local minimum and allowing multiple optimization runs may be a more efficient approach. This follows from the hypothesis that several limited independent optimization runs, each with a small likelihood of finding the global optimum, may be combined in a synergistic effort which yields a vastly improved global convergence probability. This approach is routinely used for global search using multistart local optimizers [123]. Le Riche and Haftka have also suggested the use of this approach with genetic algorithms for solving complex composite laminate optimization problems [124]. The main difficulty in the application of such a multirun strategy is deciding when the optimizer should be stopped. The objective of this manuscript is to solve this problem by developing an efficient and robust scheme by which to allocate computational resources to individual optimizations in a set of multiple optimizations. The organization of this manuscript is as follows: First a brief description of the optimization algorithm applied in this study, the PSO algorithm is given. Next a set of analytical problems are described, along with details on calculating global convergence probabilities. After that, the multiple run methodology is outlined and a general budget strategy is presented for dividing a fixed number of fitness evaluations among multiple searches on a single processor. The use of this method on a parallel processing machine is also discussed. Then, numerical results based on the multiple run strategy for both single and multiprocessor machines are reported and discussed. Finally, general conclusions about the multirun methodology are presented. Methodology Analytical Test Set The convergence behavior of the PSO algorithm was analyzed with the Griewank [108], Shekel [114] and Hartman [114] analytical problems (see Appendix A for problem definitions), each of which possess multiple local minima. Analytical test problems were used because global solutions are known apriori. The known solution value allows us to ascertain if an optimization has converged to the global minimum. To estimate the probability of converging to the global optimum, we performed 1000 optimization runs for each problem, with each run limited to 500,000 fitness evaluations. These optimization runs are performed with identical parameter settings, with the exception of a different random number seed for each optimization run in order to start the population at different initial points in the design space. To evaluate the global convergence probability of the PSO algorithm as a function of population size, we solved each problem using a swarm of 10, 20, 50 and 100 particles. A standard set of parameters were used for the other algorithm parameters (Table 6). Table 6 Particle swarm algorithm parameters Parameter Description Value cl Cognitive trust parameter 2.0 c2 Social trust parameter 2.0 wo Initial inertia 1 Wd Inertia reduction parameter 0.01 K Bound on velocity fraction 0.5 vd Velocity reduction parameter 0.01 d Dynamic inertia/velocity reduction delay (function 200 evaluations) We assumed that convergence to the global optimum was achieved when the fitness value was within a predetermined fixed tolerance E, (see Table 7) of the known global optimum f*. f < f + (5.1) For the Shekel and Hartman problems, the tolerance ensures that the minimum corresponding to the global optimum for the problem has been found. That is, because c i 0 the exact optimum is not obtained, but if a local optimizer is started from the PSO solution found with the given tolerance it will converge to the global optimum. For the Griewank problem, however, starting a local optimizer at the PSO solution will not guarantee convergence to the global optimum, since this noisy, shallow convex problem has several local minima grouped around the global optimum that will defeat a local optimizer Table 7 Problem convergence tolerances Problem Convergence tolerance et Griewank 0.1 Shekel 0.001 Hartman 0.001 Multiplerun Methodology The use of multiple optimizations using a global optimizer such as a GA was first proposed by Le Riche and Haftka [124]. However, no criterion was given on the division of computational resources between the multiple optimizations, and the efficiency of the approach was not investigated. The method entails running multiple optimizations with a reduced number of fitness evaluations, either by limiting the number of algorithm iterations or reducing the population while keeping the number of iterations constant. Individually, the convergence probability of such a limited optimization may only be a fraction of a single traditional optimization run. However, the cumulative convergence probability obtained by combining the limited runs can be significantly higher than that of the single run. Previously similar studies have been undertaken to investigate the efficiency of repeated optimizations using simple search algorithms such as pure random search, grid search, and random walk [130,131]. The use of multiple local optimizations or clustering [132] is a common practice but for some algorithms the efficiency of this approach decreases rapidly when problems with a high number of local minima are encountered [130]. For estimating the efficiency of the proposed strategy and for comparison with optimizations with increased populations/allowed iterations, we are required to calculate the probability of convergence to the global optimum for an individual optimization run, P,. This convergence probability cannot be easily calculated for practical engineering problems with unknown solutions. For the set of analytical problems however, the solutions are known and a large number of optimizations of these problems can be performed at little computational cost. With some reasonable assumptions these two facts allow us to estimate the probability of convergence to the global optimum for individual optimization runs. The efficiency and exploration run considerations derived from the theoretical analytical results are equally applicable to practical engineering problems where solutions are not known a priori. The first step in calculating P, is using the convergence ratio, Cr, which is calculated as follows: C = (5.2) r N where Nc is the number of globally converged optimizations and Nis the number of optimizations, in this case 1000. For a very large number of optimizations the probability P, that any individual run converges to the global optimum approaches Cr. For a finite number of runs however the standard error se in P, can be quantified using: e = (5.3) V N which is an estimate of the standard deviation of Cr. For example, if we obtain a convergence probability of Pi = 0.5 with N = 1000 optimizations, the standard error would be se = 0.016. To obtain the combined cumulative probability of finding the global optimum by multiple independent optimizations we apply the statistical law for calculating the probability for success with repeated independent events. We denote the combined or cumulative probability of N multiple independent optimization runs converging to the solution as Pc, and using the fact that the convergence events are uncorrelated then N P = 1 I (5.4) where P, is the probability of the ith single individual optimization run converging to the global optimum. If we assume that individual optimization runs with similar parameter settings, as in the case of the following study, have equal probability of convergence we can simplify Eq. (5.4) to 4 =1(1 (5.5) The increase in cumulative probability Pc with fixed values ofP, for increasing number of optimization runs Nis illustrated in Figure 18. It must be stressed that the above relations are only valid for uncorrelated optimizations, which may not be the case when a poor quality random number generator is used to generate initial positions in the design space. Certain generators can exhibit a tendency to favor regions in the design space, biasing the search and probability of convergence to a minimum in these regions. 0.9 a*' 0.8 0.9 6    4 03 o 0.7 ' > 0.6 o 0.5 0.4 o 0.3 0.2 0.1 0 00oo 1 2 3 4 5 6 7 8 9 10 Number of runs Figure 18 Cumulative convergence probability Pc as a function of the number of optimization runs with assumed equal P, values To verify the cumulative probability values predicted in theory with Eq. (5.5), the Monte Carlo method is applied, sampling random pairs, triplets, quintuplets etc. of optimizations in the pool of 1000 runs. For example, to estimate the experimental global convergence probability of two runs, we selected a large number of random pairs of optimizations among the 1000 runs. Applying Eq. (5.2), the number of cases in which either or both of the pairs converged N, for the limited optimization run, divided by N (total number of pairs selected) will yield the experimental global convergence probability. Exploratory run and budgeting scheme Using the multiple run methodology requires a budget strategy by which to divide a fixed budget of fitness evaluations among the independent optimization runs. The budget of fitness evaluations nb is usually dictated by how much time the user is willing to allocate on a machine in order to solve a problem divided by how long a single fitness evaluation takes to execute. An exploratory optimization utilizing a fraction, nl of this budget is required to determine the interaction between the optimizer and problem. The fitness history of this optimization is used to obtain an estimate of the number of fitness evaluations to be allocated to each run, n,. This strategy is based on the assumption that a single fitness history will be sufficient to quantify the optimizer behavior on a problem. For the set of problems it is observed that a correlation exists between the point where the fitness history levels off and the convergence history levels off (Figure 19). Griewank, 20 particles Hartman, 20 particles Shekel, 20 particles 0 0 102 2 i 4 S100 2 6 I_ 8 3 10 0 5 10 0 0.5 1 1.5 2 0 5 10 Fitness Evaluationsx 104 Fitness Evaluationsx 104 Fitness Evaluationsx 104 1 0.4 1 0.8 0.3 0.8 0.6 0.6 8 0.2 S0.4 0.4 a) a) 0.2 0.2 >o o 0o o 0 0 0 0 01l 1  < 0 0 1 0 5 10 0 0.5 1 1.5 2 0 5 10 Fitness Evaluationsx 104 Fitness Evaluationsx 104 Fitness Evaluationsx 104 Figure 19 Fitness history and convergence probability Pc plots for Griewank, Hartman and Shekel problems We hypothesize that the algorithm will converge quickly to the optimum or stall at a similar number of fitness evaluations (Figure 20). The exploratory run is stopped using a stopping criterion which monitors the rate of change of the objective fitness value as a function of the number of fitness evaluations. As soon as this rate of improvement drops below a predetermined value (i.e. the fitness value plot levels off), the exploratory optimization is stopped and the number of fitness evaluations noted as n1. The stopping criterion parameters used for obtaining the numerical results is a change of less than 0.01 in fitness value for at least 500 functions evaluations. 83 0  2  4 (D 2 > n) 6 (D u 8 10 12 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Fitness evaluations x105 Figure 20 Typical Shekel fitness history plots of 20 optimizations (sampled out of 1000) The remainder of the budgeted fitness evaluations is distributed among a number of N independent optimizations, which may be calculated as follows N= n 1 (5.6) Sn, with an allowed number of fitness evaluations per run calculate by n N= n > n (5.7) If a multiprocessor machine is available, very high Pc values may be reached using a multiple run strategy. If we take the simple case where each independent optimization run is assigned to a separate node, the multiple run approach will be constrained somewhat differently than the previous singleprocessor case. Rather than the number of multiple optimizations being limited by a fixed budget of fitness evaluations (which is divided equally among the set of multiple independent optimizations using Eq.(5.6)), the number of optimization runs will be defined by the number of computational nodes and the wall clock time available to the user. A similar method to that followed for a single 7rrT11 processor machine for determining algorithm/problem behavior must still be followed to determine the optimal number of fitness evaluations for a single independent optimization run. This exploratory run can, however, be done using a parallel implementation of the populationbased algorithm under consideration, in which concurrent processing is achieved through functional decomposition [125]. Bayesian convergence probability estimation If the amount of time and the global probability of convergence are competing considerations a Bayesian convergence probability estimation method may be used, as proposed by Groenwold et al. [134,135]. This criterion states that the optimization is stopped once a certain confidence level is reached, which is that the best solution found among all optimizations I will be the global solution f*. This probability or confidence measure is given in [135] as (N+t)!(2N+b)! Prq=(N + ) (2N + (5.8) Pr[f=f' q (2N+U)! N+b)!(5.8) where q is the predetermined confidence level set by the user, usually 0.95. N is the total number of optimizations performed up to the time of evaluating the stopping criteria, and = a + b 1, b =b N, with a, b suitable parameters of a Beta distribution f(a, b) . The number of optimizations among the total Nwhich yield a final value of f is defined as Nc. Values of parameter a and b were chosen as 1 and 5 respectively, as recommended by Groenwold et. al.[135]. Numerical Results Multirun Approach for Predetermined Number of Optimizations For the three problems under consideration only a limited improvement in global convergence probability is achieved by applying the traditional approaches of increasing the number of fitness evaluations or the population size (Figure 21). TI 7  T 7 I  7II I 09  0.8 ,,.. 0.  T 1 p......... r I I  (D 0.5 Ti I ....... 10 particles  20 particles S0.4  50 particles >  100 particles 0 03 A     0.2  0.1 r   0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Fitness evaluations x 10 Figure 21 Shekel convergence probability for an individual optimization as a function of fitness evaluations and population size For the Shekel problem, using larger swarm sizes and/or allowing an increased number of fitness evaluations yielded higher convergence probabilities only up to a point. Similar results were obtained for the Griewank and Hartman problem cases. On the other hand, optimizations with a small number of particles reached moderate global convergence probabilities at significantly fewer fitness evaluations than did optimizations with large swarms. This behavior was observed for all the problems in the test set (Figure 19). To exploit this behavior we replace a single optimization with several PSO runs, each with a limited population and number of iterations. These individual optimizations utilize the same amount of resources allocated to the original single optimization (in this case the number of fitness evaluations). To illustrate the merit of such an approach we optimize the Hartman analytical problem with and without multiple limited optimizations. We observe that for a single optimization the probability of convergence is not significantly improved by allowing more fitness evaluations, or by increasing the population size (Figure 22). 0.8  .. 0.9 8 ' .  0.7   10 particles 0.7 ...... 10 particles o S'  20 particles S06 50 particles S/ . 100 particles 0.5  P with 10 indep. optimizations S0.4  :  I  0 .......... ..... I. 0.3 o r 0.  i.al r  0.2 I I I I 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Fitness evaluations x 10 Figure 22 Theoretical cumulative convergence probability Pc as a function of the number of optimization runs with constant P, for the Hartman problem. Multiple independent runs with 10 particles. We also observe that an optimization with 10 particles quickly attains a probability of convergence of P, = 0.344 after only 10,000 fitness evaluations. Using a multiple run strategy with 10 independent optimizations of 10,000 fitness evaluations yields the theoretical Pc values reported in Table 8 (calculated using Eq. (5.5) with P, = 0.344 and n = 1,...,10). These values are indicated as circled data points at a sum equivalent number of fitness evaluations in Figure 22 for comparison with a single extended optimization. 