UFDC Home  Search all Groups  UF Institutional Repository  UF Institutional Repository  UF Theses & Dissertations  Vendor Digitized Files   Help 
Material Information
Subjects
Notes
Record Information

Full Text 
A VARIABLEORDER NONLINEAR PROGRAMMING ALGORITHM FOR USE IN COMPUTERAIDED CIRCUIT DESIGN AND ANALYSIS By Alberto Jose Jimenez A DISSERTATION PRESENTED TO THE GRADUATE COUNCIL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1976 L To Judy ACKNOWLEDGEMENTS The contributions and patience of the chairman of the committee, Dr. S.W. Director, are gratefully acknowledged. Without the grant from IBM, it would have not been possible for this researcher to afford the pursuit of this research; in particular, I thank P.C. Reder, of the General Systems Division of IBM at Boca Raton, Florida, for his un selfish help and support. Interactions and many lively discussions with graduate students D. Fraser, M. Lightner and W. Nye also contributed to this research. Finally, but by no means in a lesser manner, the sup port of my family, my wife,Judy, andour daughters, contributed in a very important way and with some great sacrifices throughout the entire period at the university. I thank Adele Koehler for her selfevident superb typing of the manuscript. I gratefully acknowledge the constructive criticisms of Dr. A.E. Durling, Dr. A. Paige, Dr. C.V. Shaffer, and Dr. A. Westerberg, who with Dr. S.W. Director constituted the examining committee. TABLE OF CONTENTS Chapter ACKNOWLEDGEMENTS . . . LIST OF TABLES . . . LIST OF FIGURES . . . . LIST OF SYMBOLS . . . . ABSTRACT . . . . 1. INTRODUCTION . . . 2. A VIEW OF COMPUTERAIDED CIRCUIT DESIGN. . . 2.1 Nonlinear Programming Circuit Problem . 2.2 Derivation of the Gradient. . . 2.3 Computational Flow. . . . 2.4 Review of Unconstrained Minimization. . 3. THE VARIABLEORDER ALGORITHM FOR UNCONSTRAINED MINIMIZATION . . . 3.1 Properties of Minima. . . 3.2 Principal Steps in a Minimization Algorithm . 3.3 VariableOrder Transformations . . 3.3.1 Approximations of HigherOrder Corrections . 3.3.2 Transformation Order Selection . 3.4 The Scalar Search . . . 3.4.1 Iterations Close to x . . 3.4.2 Iterations Far from x . . 3.5 Hessian Singular or Negative Definite . Page iii viii x xi xii 1 5 5 8 12 16 20 23 28 33 TABLE OF CONTENTS (continued) Chapter 3.5.1 Murray's Procedure . . 3.5.2 Proposed Modification to Murray's Procedure . 3.5.3 Illustrative Example.. Page 59 62 65 3.5.4 Computation of HighOrder Corrections .... .66 3.6 Convergence of the VO Algorithm. . ... 68 3.6.1 Global Convergence. . ... 68 3.6.2 Order of Convergence. . ... 78 3.7 Summary. . . 81 4. IMPLEMENTATION OF THE VARIABLEORDER ALGORITHM. ... 82 4.1 Nonlinear Inequality Constraints ........... 84 4.2 Box Constraints. . . .. 86 4.3 Hessian and Gradient Approximations. ... 87 4.3.1 Function and Gradient Values Supplied .... .89 4.3.2 Only Function Values Supplied ... 94 4.4 Supplied Function and Gradient Values with Errors. 98 4.5 The VariableOrder Algorithm . .... 102 4.6 Numerical Results. . . ... 119 4.6.1 Rosenbrock's Problem. . ... 120 4.6.2 Powell's Problem. . ... 124 4.6.3 Fletcher and Powell's Problem ... 126 4.6.4 Wood's Problem. . ... 127 4.6.5 Cragg and Levy's Problem. . ... 132 4.6.6 4.6.7 Comparisons with Seven Minimization Algorithms. Example with Nonlinear Constraints . TABLE OF CONTENTS (continued) Chapter Page 4.6.8 Example with Box Constraints ....... 143 4.6.9 Example with Errors in the Supplied Function 145 and Gradient . . . 148 4.6.10 Simple Circuit Optimization Example .. .. 4.6.11 MOSFET Nand Gate Circuit Optimization Example. 151 4.6.12 Power Supply Regulator Circuit Optimization Example. . . .. 156 4.7 Summary . . . 5. APPLICATION OF THE VARIABLEORDER CONCEPT TO CIRCUIT ANALYSIS . . . 5.1 Approaches in Finding a Solution . . 5.1.1 Equivalent Unconstrained Minimization Problems. 5.1.2 Iterative Methods . . 5.2 Infinite Series Representation of a Solution . 5.2.1 A Class of Iterative Methods ... 5.2.2 The VariableOrder Iterative Method . 5.3 The VO Iterative Method in Transient Analysis.. 5.3.1 MOSFET Nand Gate Example . . 5.3.2 MOSFET Buffer Examples . . 5.3.3 ECL Gate Example . . 5.4 The VO Iterative Method in DC Analysis . 5.5 Summary . ... . 6. CONCLUSIONS AND FUTURE RESEARCH SUGGESTIONS . 6.1 Conclusions . .. .. 6.2 Future Research Suggestions . . APPENDIX I. DESCRIPTION OF COMPUTER PROGRAM IMPLEMENTING THE VARIABLEORDER ALGORITHM. . . 170 S. 172 173 S. 176 . 187 . 191 . 192 196 200 202 204 . 204 S. 208 . 210 TABLE OF CONTENTS (continued) Chapter Page I.1 Using the Program. ... ......... .210 1.2 Description of the Program . .. 217 1.3 Listing of the Program ................ 222 REFERENCES . . ... . 276 BIOGRAPHICAL SKETCH . . ... 282 LIST OF TABLES Table Page 4.1 Results for Rosenbrock's problem . ... .121 4.2 Results for Powell's problem . .... 125 4.3 Results for Fletcher and Powell's problem. 128 4.4 Results for Wood's problem . .. 130 4.5 Results for Wood's problem, initial guess near saddle point. . . ..... 131 4.6 Results for Cragg and Levy's problem. . .133 4.7 Summary of results for five problems with reduced accuracy . . 136 4.8 Comparative results of VO algorithm with seven other algorithms. ... . . .137 4.9 Results for illustrative constrained problem .. 140 4.10 Results for constrained problem solved by a sequence of problems. . . 142 4.11 Results for Rosenbrock's problem with random errors in the function and the gradient. . ... 147 4.12 Results of optimization of nand gate . .. 157 4.13 Tabular description of two current sources .. 161 5.1 Comparative results of transient analysis for nand gate. 193 5.2 Comparative results of transient analysis for 9device MOS buffer . . ... .... 197 5.3 Comparative results of transient analysis for 18device MOS buffer . . ... .... 197 5.4 Comparative results of transient analysis for ECL gate 199 5.5 Comparative results of transient analysis for ECL gate with capacitances multiplied by 103. . 199 viii LIST OF TABLES (Continued) Table Page 5.6 Comparative results of dc analysis for ECL gate ..... 203 5.7 Comparative results of dc analysis for 18device MOS buffer. . . ... ...... 203 LIST OF FIGURES Figure Page 3.1 Flowchart of a portion of the scalar search step of the VO algorithm. . . ... ..... 53 4.1 Illustration of projection of trajectory onto box con straints ................... ...... 88 4.2 Trajectories for the second, third, and fourthorder transformations at x0 = (1.2, 1)T for the minimization of Rosenbrock's function ................ 116 4.3 Plot of f(h (p)) vs. p ................. 118 4.4 Trajectory of VO algorithm for the minimization of Rosenbrock's function . ... 123 4.5 Simple circuit to test and compare the VO algorithm 149 4.6 Twoinput nand gate optimized . .... 152 4.7 Power supply regulator optimized ............ 158 5.1 Ninedevice MOS buffer circuit analyzed ... 194 5.2 Eighteendevice MOS buffer analyzed ............ 195 5.3 ECL gate analyzed . . .. .. 198 1.1 Example of subroutine to supply the function, the gradient and/or the hessian to the VO algorithm. . ... 212 1.2 Typical setup for executing the VO algorithm in an IBM computer with standard catalog procedures ... 214 LIST OF SYMBOLS x x IIy II T f'(x) ' 'W [f'(x)]. f"(x) [f"(x) 1. % 1J f"(x)  f"'(x), (x), F(x), F'(x), F"(x),.. 1 F' (x) 1 ', 0 nu A column vector in R with components x.. Each component of the vector x is less than the corresponding component of the vector y. Any norm of the vector y. The maximum norm of the vector z, equal to max[y ]. The transpose of the vector y, the result being a row vector, or the transpose of a matrix y. The column vector first derivative of f(x), the transpose of the gradient. The jth component of the gradient, af(x)/ax.. The second derivative of f(x), the hessian. The (i,j)th component of the hessian, 82f(x)/ (axiSx.). The inverse of the hessian matrix of f(x). The third and fourth derivative tensors of f(x). A vector function and its derivatives (F'(x) is r u called the Jacobian). The inverse of the Jacobian matrix of F(x). Unit diagonal matrix of appropriate dimension. Zero vector or zero matrix of appropriate dimension. The time derivative of the variable q. Abstract of Dissertation Presented to the Graduate Council of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy A VARIABLEORDER NONLINEAR PROGRAMMING ALGORITHM FOR USE IN COMPUTERAIDED CIRCUIT DESIGN AND ANALYSIS By Alberto Jose Jimenez June, 1976 Chairman: Stephen W. Director Major Department: Electrical Engineering An iterative algorithm, called the VariableOrder (VO) algorithm, is derived for computing a local minimum of a nonlinear function of several independent variables. The VO algorithm is shown to be very competitive with several existing algorithms. The class of functions for which the algorithm is globally convergent is established. It is shown that the VO algorithm converges with order as high as four. A major step of the algorithm is the solution of a scalar problem that may be along curved trajectories in the space of the independent variables, in stead of along straight lines as in most existing algorithms. Approxi mations to required higherorder derivatives are given which allow the use of the VO algorithm even if only function values can be supplied. If the supplied function and gradient values are somewhat inaccurate, as is often the case in computeraided circuit optimization procedures, the VO algorithm is shown to be still effective. The algorithm may be used to compute a solution of a general nonlinear programming problem by the use of penalty functions. Special cases of the VO algorithm are used to develop an iterative method to solve nonlinear equations that arise in circuit analysis with resulting modest improvements in efficiency. The new method is applied to transient and dc analysis of nonlinear MOSFET and bipolar circuits. CHAPTER 1 INTRODUCTION It has been nine years since publication of the first, by now almost classical, special issue of the Proceedings of the IEEE [57] on computeraided design of circuits. That issue marked the beginning of a trend to use the computer as an active partner in design rather than simply in a passive role for simulation. The circuits at that time might be described as being in the infancy of integrated electronics. The most complex integrated circuit chip might have consisted of ten devices; it was still somewhat feasible to use breadboarding techniques as aids in the design. During the ensuing nine years, several more journals have been devoted to the subject of computeraided design of circuits, among them [5861], and integrated electronics has matured to the point where large scale integrated circuits which contain thousands of devices can be manufactured. Breadboarding of circuits has generally ceased to be very useful as a design tool, and the computer has become indispensable in the entire design process. The computer is being used at many stages, and for many different purposes, during the design of circuits. We will be concerned with the use of the computer to optimize a circuit by vary ing some designable parameters in order to achieve the design objectives. In circuit optimization, a scalar performance function representing the design objectives is minimized. There are two principal computation al steps in this procedure. First, a numerical minimization algorithm must be employed to adjust the designable parameters in order to mini mize the scalar performance function. Minimization algorithms begin from an initial point, a guess to the optimum set of designable para meters, to generate a sequence of points which hopefully converges to the minimum of the scalar performance function. Most minimization algorithms require the numerical values of the performance function and its gradient with respect to the designable parameters evaluated at several points during the minimization procedure. The number of func tion and gradient evaluations required is generally proportional to the computational cost of the minimization. Second, the evaluation of the scalar performance function and its gradient generally involvesanalysis of the circuit equations, a rather large set of nonlinear algebraic and differential equations. The large number of circuit equations is not only due to the increased circuit size, but also due to the use of more complex device models for improved accuracy. The goals of this research were to improve the efficiency of the two computational steps described above. The major accomplishments were the derivation of a very promising new minimization algorithm, and a new iterative method to solve the nonlinear algebraic equations that arise in the analysis of circuits. The main contribution of this research is the development of a new minimization algorithm. Although the new algorithm has some short comings, numerical results on several examples show that it is quite accurate, and more efficient than other existing algorithms for the majority of the examples tried. From a theoretical standpoint the algo rithm has two novel new features: 1) it has a variable order of convergence up to order four, and 2) it has a novel scalar search at each iteration which may be along curved trajectories in the space of the independent variables. A second contribution of this research is an iterative method for solving the nonlinear equations that arise in the transient analysis of the circuit equations. Modest improvements in efficiency were obtained when the new iterative method was implemented in an already very effi cient transient analysis program. Other minor contributions can be summarized as follows: 1) A potentially useful Taylor series expansion of the solution point of a system of nonlinear equations. Different forms of the series, when truncated, yield different iterative methods. An iterative method was used to obtain dc solutions of the cir cuit equations. 2) A modified Cholesky factorization of a symmetric matrix, the hessian, which in effect modifies the matrix when it is not positive definite. The new factorization is a modification of a previously proposed technique [36]. 3) An apparently novel scheme for computing difference approxima tions to first and second derivatives, the gradient and the hessian. The scheme automatically takes into consideration errors that may be present in the function values that are used in the difference approximations. 4) A new method of describing minimization algorithms to account for other than straightline directions of search. Existing theorems are extended to the new description. The organization of the chapters is the following. In Chapter 2 we offer a brief theoretical and computational view of computeraided circuit design. In that chapter a brief historical review of minimiza tion algorithms is also given. In Chapter 3 the new minimization algorithm is derived and its theoretical properties are established. In Chapter 4, implementation of the new algorithm is described. In addition, several examples and comparisons with other algorithms are reported. In Chapter 5 the concepts used in the derivation of the mini mization algorithm are used to derive iterative methods for finding solutions to nonlinear equations. Finally, Chapter 6 offers general conclusions and some suggestions for further research. CHAPTER 2 A VIEW OF COMPUTERAIDED CIRCUIT DESIGN A successful approach in using the computer as a circuit design tool has been to minimize a scalar performance function, which repre sents the design objectives, by adjusting in some suitable fashion the designable parameters. This procedure requires many steps which will be briefly outlined in this chapter. The first step in using optimization for circuit design is to recast the problem into a nonlinear programming problem by character izing the qualitative design objectives by a scalar performance function with constraints. This step is quite heuristic and requires great insight on the part of the circuit designer. After this initial step, the computer takes over by approximating the solution of the nonlinear programming problem. The derivation and the computational steps involved in the evalua tion of the scalar performance function and its gradient, needed for solving the nonlinear programming problem, will be briefly outlined. The chapter ends with a brief historical review of existing methods for solving the unconstrained minimization problem which results from the nonlinear programming problem. 2.1 Nonlinear Programming Circuit Problem Although there havebeen some fairly successful attempts at synthesis, 5 where a circuit is "grown" to meet specifications [43,6], it may be safely stated that this problem has not been solved satisfactorily. Thus it will be assumed that a circuit which somewhat adequately meets the design objectives is available, e.g. from a catalog of circuits. It is then desired to change parameters of this circuit to improve its in tended objective. The design objectives of a circuit are usually specified in a somewhat qualitative manner. Some of the objectives are to minimize certain quantities, such as power dissipation, time delays, circuit size, or minimizing the difference between a desired voltage or current curve with the actual curve. Other objectives may be described as constraints on the solutions or on the designable parameters, such as voltage or current levels less than (or greater than) some value, propagation delays no larger than some value, low and high limits, called box con straints, on the designable parameters, etc. It is the circuit design er's job to translate the usually unspecific design objectives into a set of specific scalar functions. Often this specification step yields several scalar functions to be minimized, several constraint functions, and box constraints on the designable parameters. Most circuit optimization programs require that all the scalar functions to be minimized be combined in some manner to obtain a single scalar performance or objective function to be minimized. The simplest technique for accomplishing this combination is to choose a performance function which is a weighted sum of all the functions to be minimized. The general form of such a performance function is T f = f e(w, q, x, t) dt (2.1) to 7 where w is a vector of branch voltages, branch currents and node volt ages of the circuit; q is a vector of capacitance charges and inductor fluxes; x is the vector of timeindependent designable parameters, e.g. geometries of each device in the circuit; and t is time. The scalar function e represents a numerical compromise of the design objectives in that a different combination of the functions to be minimized yields a different function e, and thus a different minimization problem. As published applications have shown [28,12,5], this numerical compromise implies that circuit design carried out in this manner may require several minimizations to achieve the best design, as interpreted by the circuit designer. The minimization of (2.1) must be carried out subject to the circuit designer constraints and subject to the circuit equation constraints, which is a nonlinear programming problem. This problem may be described as follows T minimize f(w, q, x) = f e(w, q, x, t) dt (2.2a) w,q,x I t Io subject to u(w, x, t) < 0 (2.2b) L H x < x < x (2.2c) H(w, x, t) = 0 (2.2d) E w = q(t) = () (2.2e) where (2.2b) are the designer's nonlinear constraints, (2.2c) are the box constraints, and (2.2d) with (2.2e) represent the circuit equations. The circuit equations, which can also be expressed by H(w, q, x, t) F(w, q, t) = w = 0, q(0)=q(x) (2.3) E w q where the initial conditions q0(x) may be functions of the designable parameters, consist of Kirchhofflaws equations and the branch consti tuitive equations [15,17]. The box constraints (2.2c) are usually handled by transformations [8], or directly as done by the algorithm to be given in Chapters 3 and 4. The nonlinear constraints (2.2b) are usually made part of the e function in (2.2a) by using penalty functions as described in Chapter 4. Therefore we will discuss the numerical and theoretical considerations in solving the problem given by T minimize f(w, q, x) = f e(w, q, x, t) dt (2.3a) w,q,x 0 subject to H(w, q, x, t) = 0 (2.3b) E q q = 0, q(0) = q (x) (2.3c) where without loss of generality tO = 0 is assumed. 2.2 Derivation of the Gradient Any solution of (2.3) must satisfy the necessary condition that the first variational (or gradient) or the Lagrangian vanishes [8,29]. The 9 Lagrangian is given by T L(w, x, w, ) = [e(w, %, x, t) + w H(w, q, x, t) i0 % 'VIU 'V % + qT(Ew q)] dt (2.4) where w(t) and q(t) are the Lagrange multiplier vectors which are func tions of time. The first variational of L is given by T 8H 6L = J + w + T E) 6w dt + 0 '1 6, T 9H T T e ^+T a T T S( + + 6q dt 6q + ( Tq FU uX T T o + %) 6x dt + T T f Sw H dt + f 6 (E w ) dt ,(2.5) 0 Q 1 0 I where the variational term in (2.4) involving j is obtained using in tegration by parts as follows T T T 6(f q dt) = f qT q dt f T 6q dt 0 0 1 0 T T TT = f S6q dt+J q 6qdtq 6q (2.6) 0 0 0 In order to satisfy the necessary conditions one must find timedependent 10 *()* * vectors w (t), q (t), w (t), and q (t), and a timeindependent vector x to satisfy q, 6L = 0 (2.7) The problem of satisfying (2.7) normally has an extremely large dimension. The dimension can be reduced substantially if one makes two assumptions. First, if the circuit equations can be satisfied at all k k k values of x, then given an x = x we can obtain w (t) and (t) such that the circuit equations that the circuit equations H( t) =0 , k k k k E w . =, (0) = W(x are satisfied. Second, if the tions given by k J = 3H aw aHk aq 0. Jacobian operator of the circuit equa (2.9) is invertible in the interval 0 < t < T, then the Lagrange multiplier k k A vectors w (t) and q (t) can be computed from k k T aH T aH k k ^k ae w + w a (2.10a) w aq aw 114 nj r. (2.8a) (2.8b) 11 T T k k d k e k w E q = q (O) = 0 (2.10b) k nk where t = T t (note that q (t= 0) = q (t=T)). The equations in (2.10) may be written in matrix form as follows MHk aHk T ekT 'Ik Ve aw qv aw (2.11) k1 E 1 I ae dt which can be solved because of the second assumption. Observe that the solution to (2.11) is carried out in t, where E = T t, which is back wards in the original time variable. These two assumptions are reason able since the designable parameters are normally constrained by the box constraints, so that an actual physical circuit is always obtained, and therefore the circuit equations should always have a solution. With the preceding two assumptions, using (2.8) and (2.10), the variational 6L in (2.5) at x = x becomes 'V, ' 6Lk k eT k Ta k kT k S= +  6x dt + q (0) 6q (0) (2.12) 0 I , This variational may be expressed as the gradient with respect to x evaluated at xk by k k k T k T 9Hk T @qk(0) aL a + dt + (0) (2.13) f 0 v (0). n' lu 0 ru % 12 since x is not a function of time. This expression implies that the problem has been reduced in dimension. We can now use any unconstrained minimization algorithm which would vary only the designable parameters by using function and gradient values of the Lagrangian now effectively a function of only x. 2.3 Computational Flow Most iterative minimization algorithms require that an initial guess to the solution point be given, and that the values of the func tion and the gradient be supplied to the algorithm at points generated by the algorithm (the next section and Chapters 3 and 4 offer a more detailed description of minimization algorithms). Thus at any value of k x = x given by the minimization algorithm, one must supply L and the k gradient of L with respect to x, both evaluated at x = x Using the derivation in the preceding section, the computational flow will be described now. For notational convenience, the superscript k will be dropped. STEP 1. Determine q(0). These are the initial conditions. There may be three possibilities, 1) q(0) = q0, where ~q is a constant vector; 2) q(0) = q0, where 0 is part of the designable parameters as in a periodic steady state problem [13]; and 3) q(0) is computed from a dc analysis of the circuit equations. That is, q(0) and w(0) satisfy H(w(0), q(0), x) = 0 (2.14a) E w(O) = 0 (2.14b) 13 In Chapter 5, dc analysis is discussed further where a new algorithm is given. STEP 2. Compute an approximation to w(t) and q(t) by a transient analysis from t = 0 to t = T of the circuit equations to satisfy H(w, q, x, t) = 0 (2.15a) E w 0 (2.15b) with q(0) obtained from STEP 1. In this transient anal ysis, the value of the Lagrangian (2.4) can be computed, which due to (2.15) is now given by T L = f e(w, q, x, t) dt .(2.16) Transient analysis of the circuit equations is described in more detail in Chapter 5. STEP 3. Compute an approximation to the Lagrange multipliers w(t) and q(t) by a transient analysis from t = T t = 0 to t = T t = T (i.e., t running backwards) to satisfy aH aH T ^T 2. ae w +q (2.17a) a 2w Dq aw 'b 'b ru T d ^T 3e w E q = (2.17b) dt with q(t = 0) = 0. In this transient analysis, compute the vector 14 T 8H (DY e T )T dt (2.18) l> 0 ru n which is the dynamic portion of the gradient (see (2.13)). STEP 4. Compute the equilibrium portion of the gradient given by DL _T aq(0) (aEQ = (0) ax (2.19) The initial conditions q(0) are determined by one of the three possibilities outlined in STEP 1. The value of (2.19) therefore has also three possibilities, 1) if q(O) = q0, where q0 is constant, then the term (2.19) is zero; 2) if g(0) = %0, where q0 is obtained from a subset of the designable parameter vector [1,16], that is, if (xd T0 % 0u with q(0) = 0 then = (0, T (t= 0)) axEQ 'S 0 in this case; and 3) if q(0) is computed from a dc anal ysis satisfying (2.14), then this term requires additional work. Differentiating (2.14) with respect to x, and ex pressing the result in matrix notation yields aH aH aw(o) aH aw aq ax ax (2.20) aq(0) E 0 rL q L J L 15 which is a matrixmatrix linear equation. Since we want (2.19), if the system T T (w E L 1'I aH = (0 (0)) 0 lu _ (2.21) is solved for both sides of we obtain the vectors w (2.20) by ( (2.20) by (W ax 8q(0) ax and q, then after multiplying T), and using (2.21), g ), and using (2.21), =T T (w q nu 'x, aH ax '0 0 O which yields aL aq(0) aH 3T nu IU (Txax ax "U I, (2.22) Now the entire gradient is giveanby aL _dat rL 3L ax DY (xEQ This fourstep procedure has been implemented in a general circuit optimization program [27] with excellent results. Recently, it was shown how the performance function (2.3a) can be made more useful by the use of the event functional [5] which allows the inclusion of time quantities, such as time delays, within the entire procedure. 16 Clearly this entire procedure can be computationally very costly. Each function evaluation requested by the minimization algorithm requires a transient analysis of a system of nonlinear algebraic and differen tial equations,and a transient analysis of a system of linear time varying algebraic and differential equations is additionally required for the gradient. It is therefore essential that 1) the minimization algorithm used be extremely efficient requiring a small number of func tion and gradient evaluations to obtain the minimum, and 2) the entire fourstep procedure outlined above must be very efficiently implemented. Due to the previously mentioned heuristic procedure of generating the scalar performance function, this entire design procedure is manually iterative thus emphasizing the need for overall efficiency. 2.4 Review of Unconstrained Minimization Powell recently observed that in the last several years most of the useful work in the area of unconstrained minimization has been in understanding, improving and extending existing methods rather than devising new algorithms [41]. Indeed most, if not all, minimization algorithms can be described as first computing a direction of search from the current estimate to the solution and then obtaining the next point along this direction. That is, if the unconstrained minimization problem is expressed by minimize f(x) x most algorithms, at the kth iteration, first compute a direction of 17 k search, represented by a vector d by using the values of the function k k+1 f(x ) and perhaps some of its derivatives. Then the next point x is obtained by searching along this straight line in some manner to yield# k+l k k x = x +p d where pk is a scalar often called the steplength. Thus, most existing minimization algorithms have two principal steps in each iteration: choosing the direction of search, and the scalar search along this direction to obtain a suitable steplength pk' The direction of search is one of the differences among algorithms. The oldest minimization algorithms are 1) steepest descent, where the direction of search is in the direction of the negative gradient; 2) coordinate descents, where the direction of search is along each coordinate direction, i.e., one variable is adjusted at a time; 3) New ton's method, where the direction of search is the product of the inverse of the second derivative matrix, the hessian, and the negative gradient [34]. The most robust of these algorithms is steepest descent which converges with order one, while the fastest is Newton's which converges with second order for most functions. For this reason, from 1959 to the present, much of the activity in the area of unconstrained minimization algorithms has been to devise techniques that approach the speed of Newton's method without its disadvantages, in particular its requirement of the hessian matrix. Davidon [14] in 1959 published an algorithm which uses only gradient information to in effect build an approximation of the hessian inverse as the algorithm progresses towards the solution; thus the method Note that subscripts will be used for scalars. 18 approaches Newton's method after a number of iterations. Davidon's method, which is based on hessian conjugate directions, gave birth to a very large number of algorithms generally called quasiNewton algo rithms [22,23,7,34]. A characteristic of the earlier quasiNewton algorithms was that the scalar search to compute pk had to solve the scalar minimization problem minimize f(xk + p d p very accurately. The accurate solution of this problem is quite costly computationally as many researchers have shown [46,34]. The elimination of the requirement to solve this scalar minimization problem accurately was the principal motivation for many of the latest quasiNewton algo rithms [21,7,12], although some researchers were additionally motivated by deriving algorithms which required only function values [40,49,12]. The amount of information about the function which must be supplied to minimization algorithms has been a motivation for the development of many new algorithms. The general tendency in deriving algorithms, since Davidon's classical contribution [14], has been to account for the hessian without having it supplied; i.e., making sure an algorithm would be efficient for quadratic functions. On the other hand, the new mini mization algorithm developed in the next two chapters has the property of in effect accounting for even higher derivatives without having them supplied. The new algorithm requires that the function and the first two derivatives, the gradient and the hessian, be supplied. The effect of 19 the third and fourth derivatives is approximated from values of the first two derivatives. The algorithm offers several new novel ideas to the area of unconstrained minimization such as a variable order con vergence as high as four and a novel scalar search that may be along curved trajectories. Thus the new algorithm does not compute a direc tion of search which is always a straight line as most existing algo rithms do. In circuit optimization procedures, as was shown, the function and the gradient can be computed requiring only first partial derivatives of the circuit equations and the performance function. The hessian would require second partial derivatives of the circuit equations which in general are very difficult to derive and would require a large number of operations to handle (for linear circuits the second partial deriva tives are zero and thus the hessian can be evaluated in a straightforward manner as done in [54] for a Newtonlike minimization algorithm). For this reason, in Chapter 4 we describe a difference scheme which is builtin the new algorithm to approximate the hessian, thereby allowing the use of the new algorithm by supplying it with only function and gradient values. CHAPTER 3 THE VARIABLEORDER ALGORITHM FOR UNCONSTRAINED MINIMIZATION In this chapter a new algorithm, called the VariableOrder (VO) algorithm, is proposed for finding a solution to the unconstrained minimization problem minimize f(x) (3.1) x n n where f is a realvalued nonlinear function, f:E + E, and x E . 'u The equivalent maximization problem is also included in (3.1) since maximize f(x) = minimize f(x) x x '1 'I Solution of the unconstrained minimization problem is not only im portant in its own right, but as will be seen in Chapter 4, solution of the unconstrained problem is central to the solution of the con strained minimization problem which often arises in computeraided circuit optimization procedures. There are several existing algorithms designed to solve (3.1). These algorithms are all iterative, that is, beginning from an initial 0 k estimate of a solution x a sequence {x } is generated which under 'u Il 20 21 certain conditions converges to a solution x of (3.1). The exact solution x is rarely obtained in a finite number of iterations, but if the sequence has a high order of convergence the solution can be approximated closely in a finite, and hopefully small number of iter ations. Therefore, a desirable property for a minimization algorithm is that it generates convergent sequences with a high order of con vergence. If an algorithm generates convergent sequences from any initial point x it is said to be globally convergent. It will be shown that the VO algorithm is globally convergent for pseudoconvex [35] functions that are twice continuously differentiable. Moreover, numerical experiments indicate that the VO algorithm is able to effi ciently compute minima of some functions not satisfying these condi tions. If an algorithm has a high order of convergence, reasonable accuracy might be expected when the algorithm is stopped after several iterations. The order of convergence of an algorithm may be defined by a value r such that k+1 k 11r 11x x II < c 1 x x II , where 0 < C < m is a constant called the convergence ratio. Observe k k+l that if the distance lIx x II is sufficiently small, x will be much closer to x for large r. Most algorithms have orders of con vergence equal to two or less. For example: steepest descent con verges linearly (r = 1), the conjugate gradients algorithm of Fletcher 22 and Reeves converges linearly but with a smaller convergence ratio than the convergence ratio of steepest descent, and the quasiNewton algorithm of Davidon, Fletcher and Powell approaches secondorder convergence [34]. It will be shown that the VO algorithm has up to fourthorder convergence. The definition of the order of convergence implies that the higher the order the faster a solution is approached provided that k the point x is sufficiently close to the solution. Thus while a high order of convergence algorithm is desirable when in a small neighborhood of x previous studies generally indicated that when k * x is far from x a lower order algorithm was more efficient [30]. In fact the very popular class of algorithms called quasiNewton algorithms have the property of being linearly convergent initially and becoming essentially secondorder as the solution is approached [34]. The VO algorithm automatically adjusts its order at each iteration, generally selecting the order which allows the most pro gress towards the solution. The numerical results to be given in the next chapter show that the VO algorithm is more efficient than most existing algorithms. The first section of this chapter reviews some of the existing theory associated with solution points of (3.1). The second section discusses the two major steps of a minimization algorithm: the trans formation step, and the scalar search step. The new techniques being introduced for the VO algorithm are compared with the techniques of previous algorithms in this section. The theoretical derivation of the algorithm is presented in the third through the fifth sections. Although the character of these three sections is theoretical, several 23 numerical and practical considerations are discussed. The sixth and final section establishes the conditions for global convergence of the VO algorithm and its order of convergence. 3.1 Properties of Minima The problem to be solved is given by minimize f(x) (3.2) x n * where f:E + E. Let x be a solution, then if f( ) f(x) (3.3) for all possible x, the point x is called a global minimum. If f * (3.3) holds in a small neighborhood about x then the solution x is called a local minimum. One usually would like to determine the global minimum of (3.2). However one must in general be content with a local minimum because a global minimum can only be identified if all minima are obtained, or if the function is assumed to have a convexity property (in which case all minima are global [34]). In contrast, local minima are identified under less stringent conditions on the function. The following theorem establishes these conditions. Local Minimum Theorem. Let f:En E and suppose that the first derivative f'(x), the gradient of f, is continuous, and that the lb "' 24 second derivative matrix f"(x), the hessian of f, exists at x If x is a local minimum of f, then 1) f'(x) = 0 , and 2) f"(x ) is positive semidefinite. Conversely, if 1) f'(x = 0 and 2) f"(x ) is positive definite, then x is a local minimum of f. The proof is straightforward [34, 38], and it will not be repeated here. Note the subtle but significant difference between the neces sary and sufficient conditions. Most local minima are strict local minima. A strict local minimum solution x is defined by * f(x + y) < f(x ) (3.4) for all y # 0 such that x + y is in some neighborhood about x The following theorem plays an important role in the test for con vergence of the proposed algorithm. 25 Strict Local Minimum Theorem. Let f:En + E have a continuous * hessian in some neighborhood about x Then the point x is a strict local minimum of f if and only if both f'(x ) = 0, and there exists an E > 0 such that for all y satisfying 0 < Ily I < E, f"(x + y) is positive definite. This theorem is significant because in general a minimization algo k  rithm stops at a point x = x which lies in a small neighborhood * about x The theorem states that if x is a strict local minimum, the hessian should be positive definite at x. Proof: (Sufficiency) Assume f'(x ) = 0 and that there exists an c > 0 such that for all y satisfying 0 < IJly < E, f"(x + y) is positive definite. From the Taylor series with remainder one may write T * f(x + y) = f(x ) + (1/2) f"(x + t) for some 0 < t < 1. Then for 0 < I ly [ < e T * f(x* + ) f(x*) = (1/2) y f"(x + ty) y > 0 * which shows x is a strict local minimum. (Necessity) Now assume x is a strict local minimum. Then for arbitrary y lim (1/t)[f(x* + ty f(x*)] = f'(x*) +0 +^ tl f(i \ \ 26 by the definition of the derivative [38]. Assume f'(x ) # 0. Then T * a vector y exists such that f'(x )T < 0 (e.g. y = f'(x ) ). Thus for suitably small t > 0 (1/t)[f(x* + ty) f(*)] < 0 which contradicts the hypothesis that x is a strict local minimum. Now consider a Taylor series expansion with remainder f(x* + y) = f(x) + (1/2) y f"(* + ty) , L '6 'L l u for some 0 < t < 1. Then there exists a 6 > 0 such that for all y satisfying 0 < y II < 6 < E T * (1/2) Z f"(x + ty) y = f(x + y) f(x ) > 0 which implies the positive definiteness of the hessian in a small * neighborhood about x not necessarily including x itself. This completes the proof. If the hessian is either inaccurate or not supplied, a widely used test to stop an algorithm is at a point x for which f'(x) E (3.5) for some small e > 0, and f(x + tei) > f(x) i = ..., n, (3.6a) ik 1b n, 27 and f(x te ) > f(x) rV, 1%,  (3.6b) 1 n where e ..., e are the unit coordinate vectors, and t > 0 is some small scalar. The tests in (3.6) insure that the function does not decrease in value along any of the coordinate lines. However, as the following simple example [34] shows, these tests are not sufficient for x to be an approximation to a local minimum solution. Consider the problem minimize f(x) x nui 3 2 2 = x1 x1X2 + 2x2 The point (xl, x2) = (x x2) = (6, 9) satisfies a f 2 [f'(x)]l = 3x 2x2 = 0 , l 1 = x= 1 Furthermore with xl fixed at xl = 6, the expression f(6, 9+t) > f(6, 9) is satisfied for all t, and with x2 fixed at x2 = 9, the expression a f =~ x 2 + 4x2 = 0 2 2 a x 2 1 28 f(6+t, 9) > f(6, 9) is satisfied for t > 9. Therefore both (3.5) and (3.6) should be satisfied. Despite this fact, the point (xl, x2) is not a local mini mum since the hessian evaluated at this point, given by 18 12 f"(x ) =x 12 4 is not positive semidefinite since its determinant is negative. Therefore, any algorithm using (3.5) and (3.6) as its sole termina tion test cannot guarantee the computation of a local minimum. 3.2 Principal Steps in a Minimization Algorithm A minimization algorithm iteratively generates a sequence of points {xk starting from some initial point x0 which hopefully con verges to a solution of (3.2). It is convenient to view each itera tion of the algorithm in terms of the expression k+1 pk k x = H(p x) (3.7) where H is called the iteration function, which is a function of a th k scalar parameter Pk' the k estimate of the solution x the function k k value at x and possibly derivatives of f at x The iteration in dicated by (3.7) may be separated into two steps. The first step consists of computing the transformation function given by 29 k(p) = Hk(p, k) (3.8) where for convenience the function hk:E + En has been defined. This step will be called the transformation step. Its purpose is to com k k pute a direction or trajectory from x using f(x ) and possibly k k+l derivatives of f at x along which the next point x will be selected. The transformation function represents this trajectory, and p is a scalar parameter which is proportional to how far along k+1 this trajectory will the next point x be. The second step con sists of selecting or computing a suitable value of P=Pk such that k+1 . x is given by k+l k k S = g = h (pk(3.9) This step is called the scalar search step. In most existing algorithms [34], the transformation functions are linear in the scalar parameter p and have the form hk(p) = x + p d (3.10) ^U 'Xi 'ki where dk is called the direction of search. For example, in the k k steepest descent algorithm [34], dk = f'(x ). In contrast, the transformation functions for the VO algorithm are polynomials in p, of degree up to three. These polynomials follow inherently from the derivation of the transformation functions for the VO algorithm to be given in the next section. The transformation functions for the I 30 new algorithm may thus yield curved trajectories of search instead of straightline directions of search. If the scalar search step computes a p=pk such that f(x+) = f(h (pk)) < f(x) (3.11) is satisfied, then each iteration of the algorithm will progress towards a solution of (3.2). The satisfaction of (3.11) at each iteration insures that the sequence {f(x k)} is monotonically de creasing, a property that is generally required to establish the global convergence of an algorithm. Thus most algorithms, including the VO algorithm, compute a pk which satisfies (3.11). If a pk cannot be found to satisfy (3.11) an algorithm has either converged to a solution of (3.2), or it has failed. The following lemma, a generalization of an existing result [56], establishes sufficient conditions for the existence of a p=pk to satisfy (3.11). k k Lemma 3.1 Assume f(x) is differentiable at x=x hk(p) is differentiable at p=0, and hk(0)=xk. Define hk'(p) to be the first u r%\, 11, derivative of h with respect to p. Then if f'(xk)T hk'() < 0 (3.12) there exists a p > 0 such that f(hk(p)) < f(xk) (3.13) %L q, 31 Proof: Using chain differentiation and the definition of deri vative one may write lim +0 (l/p)[f(hk(p)) f(hk(o))] = f'(hk(o))T hk(0)) 'i '' 'I (3.14) hk(0) k Since h (0)= x using (3.12), this expression becomes lu lim (1/p)[f(hk(p)) f(xk)] < 0 p*+O uF (3.15) Then there exists an e > 0 such that for p # 0 and E < p < e (l/p)[f(hk(p)) f(xk)] < . > 'Lu (3.16) Select 0 < p < E to preserve the inequality and it follows that f(hk(p)) < f(xk) which completes the proof. For the transformation function (3.10) found in most algorithms, (3.12) takes the following form k T k f'(x ) d < 0 , which becomes f'(x ) f'(x ) < 0 %lf'( k (3.17) (3.18) 32 for the steepest descent algorithm [34]. Thus as long as f'(xk) # 0, the steepest descent algorithm should be able to obtain a decrease in the function. As will be seen, the VO algorithm also has the property of satisfying (3.12) whenever f'(xk) # 0. That this property is highly desirable follows from the theorems given in the preceding section. Assuming the existence of a p which satisfies (3.11), the problem to be briefly considered now is the computation of a particular value k+l pk to be used in obtaining x There are two stages to this problem. First, the desired pk must be defined in some concrete manner, usually as the solution of a scalar problem. In most existing algorithms, the desired pk is defined as the solution of the following scalar minimization problem [34] f(k )) = minimize f(hk(p)) (3.19) f(h (p )) P This value of p should satisfy (3.11), and thus this scalar problem defines the desired pk in a suitable manner. Other existing algo rithms, such as the Davidon [14] algorithm or its more popular modi fication due to Fletcher and Powell [22], have the requirement of defining pk by problem (3.19). While it may be argued that pk defined by (3.19) provides the most decrease in the function at the kth iter ation, this pk may not be the best one in the sense of minimizing the k overall number of iterations. For example, when x is far from a solution x of (3.2), the p, given by (3.19) tends to force all future iterations to follow the bottom of narrow valleys with slow progress 33 towards x [30]. Thus ideally pk should be defined by (3.19) when k k ever x is close to x and in some other manner whenever x is far from x Secondly, once the problem which defines the desired pk has been given, an approximation to the solution of the problem must be effi ciently computed. It is important that the overall algorithm not fail if rough approximations to the solution are computed to achieve savings of computer time. For example, many studies have indicated that the overall efficiency of the popular algorithm due to Fletcher and Powell and the one due to Fletcher and Reeves (both of which theore tically require the solution of (3.19)) are sensitive to the accuracy of the approximate solution of (3.19) [34,46]. The scalar search for the VO algorithm was developed under these considerations. The de tails are given in Section 3.4. 3.3 VariableOrder Transformations In this section the transformation functions for the VO algo rithm are derived. It will be seen that these transformation func tions require evaluation of higherorder derivatives. However, approximations are possible which allow the algorithm to be used even when only function values are supplied. The motivation for the variableorder transformations stems from a desire to approximate the behavior of the gradient of f with infor k mation at the present point x It will be shown that it is possible to represent the point x at which the gradient is zero, by an infinite series. The varableorder transformation functions result infinite series. The variableorder transformation functions result 34 from truncations of this series. A pararnter p appearing in the infinite series in effect accounts for the terms of the series which are dropped. * We begin by noting that a necessary condition for x to be a solution of the minimization problem (3.2) is that the gradient of f at x be zero. Then any solution of (3.2) must satisfy the system of equations f'(x) = 0 Now consider the change of variables denoted by (3.20) x = X(z) '% % I (3.21) where X may be a (3.20) becomes nonlinear function. Using (3.21), the equation f'(x) = f'(X(z)) = )(z) 'u 'I % u Define a z such that x =X(z ) then from (3.22) g( ) = 0 , 1i U 'U (3.22) (3.23) (3.24) since f'(x ) = 0. If the function g is simple to invert so that z Il\, nu % I 35 may be found from (3.24) by * 1 z = (0) Ib f (3.25) then the solution x of (3.20) may be found from (3.23). Clearly specifying a change of variables which yields a function g which is simple to invert could be difficult. However as we now show, we can start by specifying a suitable function g and determine the resulting change of variable function X. To this end assume some g function has been specified. Then X(z) may be expanded in a Taylor series k about some z=z so that (3.21) becomes I\j r x = X(z ) + 'v n 6'i X'(zk)(zz ) + (1/2)[X"(z )(zz.)](zz ) + ... (3.26) k k where from (3.21), x may be associated with zk by k k X(z = x and X(zk) = fI(xk) g'(zk) nu (U '= 'A '" I% ( X" (zk)= f"(xk g'"(zk) [f"' (xk) X'(zk)] X'(z )] X' '1. = ft 'U Ii M, %1, ''U b , (3.27) (3.28) , (3.29) are obtained by repeated differentiation of (3.22) and evaluating the k k resulting expressions at x=x and z=z Since, as it is the case u % q^ 'U 36 for the present, an x is known or given, a corresponding z may be for the found from (3.22) by k 1 k S= (f'()) (3.30) since it was assumed that g is simple to invert. Therefore all the terms of the series have been defined assuming all the derivatives and the inverse of the hessian exist. Finally, since we are interested * in x given by X(z ) in (3.23), we obtain the infinite series repre sentation of a solution to (3.20) given by x = k + X'(zk)(z* zk) + (1/2)[X"(zk)(z*zk)](z*zk) + .., (3.31) k where z is given by (3.25), z by (3.30), and the derivatives k k X'(z ), X"(z ), ... are obtained by differentiation as shown in (3.28) and (3.29). We now turn to the selection of a suitable function g. Since the function g should be simple to invert, a logical choice might be T (z) = (z z2, ..., n) (3.32) for which the function and its inverse are identical. For this func tion, the infinite series (3.31) becomes k k k k x x d d d ... (3.33a) X ,2 ,3 f4 where 37 d = f"(xk ) f'(x) (3.33b) %2 '1 % r d = (1/2) f"(xk)l[f"'(xk) d k d (3.33c) %j3 "to lb 1b rk '2 u2 dk =f,,kI[fk k k iv k k k k d= f(xk ' (xk)dk]dk (1/6)[[fi (xk)d ]dkk] (3.33d) %4 x, u 1\j U 1,3 i, ru %2 Q 1\12 This infinite series is a wellknown result extended to ndimensions. If this series is truncated to two terms, an iterative method can be constructed given by k+1 k k x = x d (3.34) which is Newton's iteration for solving (3.20) [34]. However, this iterative method (or iterative methods obtained from any truncations of (3.33)) may not converge to a solution of the minimization problem (3.2), because the infinite series (3.33) may not itself be a con vergent series. Thus we conclude that the series resulting from the simple function g given by (3.32) does not in general produce a suitable infinite series. One of the proposed modifications to (3.34) which improves its potential for convergence is the introduction of a scalar pk to "damp" the iteration [34] given by In 1755, Euler derived an infinite series for a solution of the scalar problem f(x) = 0 [19]. Other recent derivations and more extensive studies of this scalar series have been published [47, 39,50,32,42]. The Euler series becomes (3.33) when extended to ndimensions for the solution of problem (3.20). 38 k+1 k dk x = x p d . For the minimization problem (3.2), this iteration becomes the two step procedure given by the computation of the transformation function h(p) = p dk k+1 and the computation of a suitable value p=k to obtain x given by k+1 x = h(p) Motivated by this modification, we propose the following function# 0(z) = (z, z, .... z )T (3.35) Note that this g function is simple to invert. Furthermore, 1i * 1 z* = k (o) = 0 nui %u `U, (3.36) and any higher derivatives with respect to z are readily obtained. "4 Taking the first two terms of the resulting series (3.31) yields the iterative method k+l k xk 1 xk x = x pk f"(x ) f'(x ) "4 "4 l % f l (3.37) Thus the secondorder transformation function may be defined by Note that th Note that z1 means zI raised to the p power. Not I I 39 k k k h(p) = x p (3.38) where dk = f(xk)1 f(xk) (3.39) n12 'lu % Ix. ri is defined to be the secondorder correction. (Note that the order refers to the order of convergence of the sequences which are generated by the algorithm.) Thus Newton's method for solving minimization problems falls out as a special case of the proposed algorithm. Taking three terms of (3.31) yields the thirdorder transformation given by k k k 2 k h (p) = x (1/2)(3 p)p d p d (3.40) where d = (1/2) f"(xk) [f" (xk) d ] d (3.41) is the thirdorder correction. Similarly, using the first four terms of (3.31) yields the fourthorder transformation given by h(p) = k (1/6)(p 6p+l)p d (2p)p2 d p3 dk ,, pl2 3 ,4 (3.42) where the fourthorder correction is defined to be dk = f" (xk)1 [[f'(xk)dk] (1/6)[[fiv(xk)dk]d kdk] (3.43) ,u4 %2 .3 r 2 fU2 %2 40 Transformation functions of order higher than four may be similarly derived. However, as will be seen below, it does not seem possible to adequately approximate the corrections of order greater than four. Moreover, higherorder derivatives require considerable storage and they are very difficult to derive in general, and therefore we try to avoid them. Additionally, the techniques proposed in Section 3.4 for the scalar search would not be as attractive for transformation functions of order higher than four because the zeros of polynomials of degree greater than two would be needed. Finally, it was experi mentally found that for one function tested, transformation functions of order higher than four did not increase overall efficiency in computing the solution. The selection of which transformation func tion to use at each iteration will be described later. 3.3.1 Approximations of HigherOrder Corrections Observe that the computation of the thirdorder correction (3.41) would require both the evaluation of f" (xk), a thirdorder tensor, and a considerable number of multiplications. This computational effort can be reduced by using the approximation d f"I(xk) f'(h (1)) (3.44) ,;3 xL '\ 'i U2 where k k k h (1) = d (3.45) ,',2 2 ',2 from (3.38). The approximation (3.44) follows from the Taylor series 41 expansion kfx k dk)k k k k "" k k kf'(x (x ) f(x )d k+ (1/2)[f.. (x ) d ] d ... (3.46) Using (3.39), the first two terms cancel, and therefore f'(hk(1)) = f(xk dk) (1/2)[f"' (xk d k d k (3.47) % i,2 = `2 % a 2 %2 is an approximation with error on the order of Il d k assuming the fourth derivative of f is bounded. Comparing the equation for the thirdorder correction given by (3.41), using (3.47) yields the approx imation (3.44). Similarly, the fourthorder correction (3.43) may be approximated by d k f"(xk)1 f'(h ()) (3.48) ,u4 11, ru II .3 where k k k k h (1) = x d d (3.49) %3 '% 2 Q %3 from (3.40). Using the approximation for dk given by (3.44), the approximation (3.48) has error of order ( ldk II + Ildk I)3 assuming the fourth derivative of f is bounded. The approximation and the k kk error bound follow from the Taylor series expansion of f'(x d d ) 'V 'V %2 %3 and the use of (3.39) and (3.44). The error in these approximations continues to increase for corrections of higher order. Furthermore, there errors are enlarged since these higherorder corrections are 42 multiplied by increasing powers of p as can be seen from the third (3.40) and fourth(3.42) order transformations. This error is the major reason for considering only transformations of order four or less. In the next chapter, approximations to the hessian and the gra dient of f will be presented. These approximations will allow the algorithm to be used even when only function values are supplied. 3.3.2 Transformation Order Selection Assuming that all of the transformation functions exist (this assumption is removed later in this chapter), we wish to consider the question of which one should be used in each iteration. Recall that each transformation function may be thought of as a k curved trajectory passing through x with the scalar parameter p k proportional to how far from x the next point might be. Ideally the best transformation function order to use is the one whose trajectory passes the closest to x It might seem that the higher the order, the closer to the solution x since more terms are taken in the infinite series representation of x However there are two reasons why this seemingly reasonable expectation is not usually true. First, the process of computing the terms of the transformation functions involves several approximations and many arithmetic operations with ensuing errors. Second and perhaps more importantly, the infinite series represents x only if it converges; it must also converge very fast. It was indeed verified numerically that usually one of the transformation functions is better, in the sense of giving trajec tories closer to x than the others at each iteration, and the best 43 one is not necessarily the one with the highest order. The proposed technique for selecting the best transformation function is based on the convergence of the infinite series (3.33) to a solution of the minimization problem (3.2). Recall that this infinite series resulted from the g function (3.23), or for the func tion which was eventually used given by (3.35) with the scalar para meter p set to one. The procedure may be described as follows. Select the secondorder transformation if f(h (1)) = f(x d ) > f(x ) = f(h (0)) (3.50) n,2 'uiZ2 lu .2 otherwise select the (rl) order transformation if f(hk(1)) > f(hk (1)) r = 3 and 4 (3.51) ,br nir 1 If (3.51) is not satisfied for r = 4, the fourthorder transformation is selected. Thus when orders three or four are selected, a value of p = 1 always gives a point which yields a function value less than the present value. It was experimentally verified that this method selected the best order in most iterations. In those very few itera tions where it did not select the best order, the order selected was only insignificantly different than the best one. Once one of the three transformation functions is selected, the dimension of the problem has been reduced to one. To see this,note that at the kth iteration the problem remaining is to compute a value of the scalar parameter p such that 44 f(hk(p)) < f(x = f(hk(0)) r = 2, 3, or 4 ,(3.52) where the transformation functions were given earlier, but will be repeated here for ease of reference. Thus k k k h (p) = x p (3.53a) ,,,2 '1. ,2 9 h (p) = (3/2)d p Cd (1/2)d ] p (3.53b) u3 k 'k 2 k 3' h (p) = xk (11/6)dk p(2d kd) p [dk dk + (1/6)dk] p , %%4 lu %2V24 U3 %' (3.53c) k k where the terms have been rearranged in powers of p, and d d and k d are given by (3.39), (3.44), and (3.48) respectively. The com rC4 putation of a suitable value of p is described in the next section. 3.4 The Scalar Search The scalar search for an appropriate value of the parameter p which appears in the higherorder transformation functions is often the most timeconsuming step in a minimization algorithm. The source of the difficulty is the requirement that most existing algorithms have of computing a value of p that accurately solves a scalar mini mization problem to be described below. In this section it will be shown that the scalar search step for the VO algorithm is not time consuming due to inherent properties of the transformation functions. Furthermore, during the initial iterations of the VO algorithm, when 45 k * the estimate x of the solution x is far from the solution, the desired value of p is defined to be the solution of a scalar problem not used in any previously reported algorithm. Recall that at this stage we wish to find a value of p=pk such that k+l k x = h (p), r = 2, 3, or 4 (3.54) b r k where h is one of the transformation functions given in (3.53). If k * the current point x is not x a solution of the minimization pro blem (3.2), then the scalar search should select a value of p such that the descent condition f(hk(p)) < f(xk) (3.55) kr * is satisfied. If x is equal to x the scalar search is unnecessary. Normally there is an infinite number of values of p for which (3.55) is satisfied. The best value of p to choose would be the one that minimizes the total number of iterations to approximate x However, the computation of this value of p is impossible for general problems since it requires information from future iterations. Some of the considerations which lead to approximations of the best value of p are given next. Assume a pk= p exists such that from (3.54) we obtain k+l k, m x = x =h (p) < r, k 46 Clearly the best value of p in this case would be pk, and pk would be defined by f(hk (p)) = minimize f(hk(p)), (3.56) '\,Tr k r p which is a scalar minimization problem. Most existing algorithms choose pk to be an approximation to pk at each iteration; some algo rithms theoretically require that pk be an accurate approximation to Pk [46], in contrast with the VO algorithm which has no such require ment. In practice x k = h (p ) is seldom equal to x While it b ar k P m may be convincingly argued that k = k is an optimal value for some k * iterations, particularly when x is in some neighborhood of x the best value of p to choose is not pk for most iterations. In fact, k m when x is far from x choosing p = p tends to force any minimi lu *"lu k k zation algorithm to follow the bottom of narrow valleys with typically slow progress towards x [30]. Therefore, an ideal scheme would m k * choose p = p, when x is in some neighborhood of x ,and would k k choose pk to stay away from the bottom of narrow valleys whenever k * x is far from x The scalar search of the VO algorithm first attempts to establish k k * whether x is close to x .If x is close to x then p is set to an approximation of pm, a solution of the scalar minimization prob k * lem (3.56). The details are described below. If x is far from x , k+l k then pk is computed under the principle that x = h r(p) should be a k is AU %r k as far away from x as possible. The method for computing pk, when k * x is far from x will be described in two parts: 1) if the second order transformation was selected (r = 2), and 2) if the thirdor 47 the fourthorder transformation was selected (r = 3 or 4). * 3.4.1 Iterations Close to x k k If x is close to x then I f'(h (1)) I will be small due to the manner in which the transformation functions hk were derived. r Thus, if I f'((1)) k II M < (3.57) for some e > 0, it is concluded that the choice p, = pm should be c k k made. (For the tested functions, which are not badly scaled, E = was reasonable.) The test (3.57) can be made without c further gradient evaluations since it was shown in Section 3.3.1 that the two gradients f'(h (1)), and f'(h (1)) are evaluated while computing the approximations to the thirdand fourthorder correc tions given in (3.48). If the fourthorder transformation is selected, it was found that it is not necessary to evaluate f'(hk(1)), but rather the results of the test for f'(h (1)) could be used instead. k * Having identified that x is close to x in the above manner, m an approximation to pk needs to be computed. The following procedure was satisfactory for the functions tested. Evaluate f(h (p)) for Aur p = 2, 3, ..., L1, L, L+1, until f(hk(Ll)) > f(h (L)) < f(hk(L+l)) (3.58) %;r %T %r 48 which is a standard method of bracketing the scalar minimum [30]. The minimum of the quadratic polynomial in p which passes through the three points [30] obtained in (3.58) is computed given by f( k(L1)) 4f(h (L)) + 3f(hk(L+1)) S= L I k k k (3.59) f(h (L1)) 2f(h (L)) + f(h (L+1)) rGr r If p is close to L (Ip LI < .02), set pk = L to complete the scalar search for this iteration. If p is not close to L, then f(h (p)) is evaluated, and if f(h (p)) < f(h (L)), set pk = , otherwise set pk = L. This completes the scalar search for the case k * when x is close to x For most local minima pk = 1 and the above procedure should yield Pk = 1 requiring only one additional function evaluation. However, for local minima with a positive semidefinite hessian, pm will generally be greater than one. Therefore, in actual implementation shown in Appendix I, if the minimum is located for p > 4, the function is evaluated at p = 10, 22, 46, 94, 190, ..., etc., until the minimum is bracketed. 3.4.2 Iterations Far from x k * When x is far from x the expression (3.57) is not satisfied. k+1 In this case, the basic principle to be proposed is that x should be as far away from x as possible, subject to satisfying (3.55). This principle defines the desired pk to be a solution of 49 maximize  hk(p) x (3.60a) p subject to f(hk(p)) < f(xk) (3.60b) k+1 where ck 0 will be defined to insure that fk+ ) is sufficiently less than f(x ). An accurate solution of (3.60) would be difficult to obtain computationally. However, first an accurate solution is not required, and second when the thirdor fourthorder transformation is selected, trial values of p that may approximate a solution of (3.60) may be found from already available information. The pro cedure, if the secondorder transformation is selected, is described first. SecondOrder Transformation Selected. If the secondorder transformation is selected at the kth iteration, the search for an approximation to a solution of (3.60) is along a straight line in the space of the independent variables, since h (p) is a linear func tion of p; thus (3.60a) is linear in p and it is maximized by the largest possible value of p. In this case ck is set to zero, which implies that we desire any descent. The procedure proposed may be generally described as fitting, and computing the minimum of, approx imating polynomials, which attempts to satisfy the descent constraint (3.60b). Then attempting to satisfy (3.60a), a constant is added to the computed minimum of the approximating polynomial. The details are given next. The following information is already available: f(h (0)), Q 50 f'(hk(O)), f(hk(1)), and f'(hk(1)). lb '%2 n,2 ru A2 fV(hk(O))T h (0) < 0 6 2 IU2 Moreover, (3.61) where hk'(O) is the first derivative of h (p) with respect to p, rU2 X1 evaluated at p= 0. Expression (3.61) implies the existence of a p > 0 that satisfies (3.60b) (see Lemma 3.1). If the expression f(hk(1)) < f(hk(0) = f(xk) , (3.62) is satisfied, then (3.60b) is also satisfied. Whenever (3.62) is satisfied, it is computationally efficient to select pk = 1 since the function and the gradient are already evaluated for the next itera tion. In addition, it is unlikely that the descent constraint (3.60b) will be satisfied for pk > 1 because of the manner in which the transformation function order is selected. Thus whenever (3.62) is satisfied, pk is set to one. If (3.62) is not satisfied, the minimum within the interval (0, 1) of the cubic Hermite polynomial in p fitted through the available information is computed as follows [14] (3.63a) Pc = 1 (s1 + a b)/(s1 sO + 2 a) where (3.63b) = k (1))T h k1 nu n,,2 nj2 51 So= f'(hk(0))T hk' (0) b = 3 [f(h (0)) f(h (1))] + so + s 2 1i/2 a = b2 so s1/2 Then let p = max {O.1, pc + min {pc, 1 Pc}/2} kp Then after evaluating f(h (p )), if f(k) U C !. set pk = pc, and the scalar search is done. an attempt at satisfying (3.60a). Finally, fled, the procedure becomes iterative. The in p through the function and derivative at function at p= p is computed as follows Observe that (3.64) is if (3.65) is not satis minimum of the quadratic p=O, and through the PC = .5 Pc SO / c + f h (0)) f(2h ())] where so is given in (3.63c). Define (3.66) PC = max {pc' Pc/4} (3.63c) (3.63d) (3.63e) (3.64) (3.65) (3.67) 52 k  Then after evaluating f(h(p )), if (3.65) is satisfied, the search is done with pk = Pc, otherwise the process is repeated beginning with (3.66). Figure 3.1 summarizes the steps in a flow chart. For the functions tested, the secondorder transformation was rarely selected. Most of the time when it was selected, the search ended with (3.64); thus only one additional function evaluation was needed most of the time the secondorder transformation was selected. Thirdor FourthOrder Transformation Selected. Whenever the thirdor fourthorder transformation is selected, the search for a Pk to approximate a solution of (3.60) is no longer along straight lines in the space of the independent variables. Note that hk(p) and h (p) given in (3.53b) and (3.53c) are polynomials in p of degree greater than one. For clarity of notation, the superscript and sub script of the transformation function will be dropped; i.e., for the present discussion h(p) = hk(p), r = 3, or 4 (3.68) Additionally, the individual components of the transformation vector function will be needed. Thus let h(p) be defined by h(p) = (h (p), h (p), ..., h (p)) (3.69) 1 2 n k+1 th Therefore, since x = h(pk), the i component of all the possible k+1 points that may become x is given by 53 Figure 3.1 Flowchart of a portion of the Scalar Search step of the VO algorithm. 54 xi = hi(p) (3.70) This time, maximizing I h(p) xk I, as defined in (3.60), may not be simply achieved by increasing p, as it is if the secondorder trans formation is selected. In particular, since each coordinate is given by (3.70) there may be certain values of p for which the square of the difference k 2 k 2 (x x) (hi(p) x.) (3.71) is a maximum. This would certainly tend to satisfy the principle k+1 k that x should be as far as possible from x A necessary condition to maximize (3.71), which would tend to satisfy (3.60a), is given by differentiating (3.71) with respect to p and setting it equal to zero, to obtain the equation h!(p) = 0 .(3.72) 1 This equation is a linear equation in p for the thirdorder trans formation, and a quadratic polynomial in p for the fourthorder transformation. Therefore, its zeros may be easily found. If any of these zeros are positive, it implies that the ith coordinate moves k away from xi and at the value of p equal to a positive zero of (3.72) k it begins to move closer to xi again. Therefore, positive zeros are suitable candidates to satisfy (3.60a). It is proposed that these zeros be computed for all coordinates using (3.72), and to discard 55 any which are not positive. These zeros will be considered trial values of p later. Additional trial values of p are obtained by the following k argument. After expansion of f(x) in a Taylor series about x and substitution of x = h(p), the following expression is obtained f(h(p)) = f(xk) + f'(xk) Th(p) xk] + .(3.73) Since it is desired to compute a p which yields f(h(p)) sufficiently k less than f(x ), the term f'(xk) Th(p) x (3.74) should be as negative as possible. Therefore, values of p for which (3.74) may achieve a minimum value are points that are easily com puted. The necessary condition yields f'(xk)T h'(p) = 0 (3.75) which is a polynomial in p with zeros that may yield additional trial values of p, if any of the zeros are positive. Before describing how these trial values of p are used in approxi mating a solution of (3.60), the ck appearing in (3.60b) needs to be defined. Recall that'f(h(l)) is already evaluated and it is less than f(xk). The constant c is defined such that a value of p could be used provided it does not yield a function value too much greater 56 than f(h(1)). This is accomplished by defining ck by k min {10f(h(1)) f(x ), .1 w} f(h(1)) 0, ck = (3.76) min {.lf(h(1)) f(x), .1 w} f(h(1)) < 0, where w = f(h(1)) f(x) The constraint (3.60b) is now defined and the zeros previously found are candidates to satisfy it. It was found experimentally that a value of p greater than six never satisfied (3.60b). In addition, since f(h(1)) is less than f(x ), only values of p in the range 1 < p < 6 are considered (note that noninteger values are used). All the zeros previously obtained from (3.72) and (3.75) within the above range are sorted. Then beginning from the largest value and on to the smallest one, the function is evaluated and as soon as (3.60b) is satisfied, the scalar search is complete. In case (3.72) and (3.75) yield no trial values of p in the range 1 < p < 6, the function is evaluated at f(h(p)), for p = 2, 3, ..., p' pk+1, until k' k K f(h(pk)) satisfies (3.60b), and f(h(pk+1)) does not. For all the functions tested, in most iterations (3.72) and (3.75) yielded trial values of p. Furthermore, in most iterations only one additional function evaluation was needed to end the search. 3.5 Hessian Singular or Negative Definite In computing the second, third, and fourthorder corrections 57 given by (3.39), (3.44), and (3.48) there are two major difficulties to be considered concerning the hessian matrix: it may be singular and it may be negative definite. If the hessian is singular, the proposed corrections cannot be computed. If the hessian is negative k definite, the current point x is not in some small neighborhood of a strict local minimum. Furthermore, if the hessian is not positive definite, the proposed transformations may not give descent tra jectories. Recall that if f'(xk )T h'(0) < 0 (3.77) then the existence of a p > 0 which satisfies f(hk(p)) < f(xk) (3.78) is implied as shown in Lemma 3.1. Observe that k, k h '(0) = a d, r = 2, 3, or 4 (3.79) %\r r n2 2 where from (3.53), a2 = 1, a3 = 3/2, and a4 = 11/6. Therefore, the descent condition (3.77) becomes a f'(x ) f(x k) f'(xk) < 0 (3.80) r % 'U % % where (3.39), which defines d was used. The inequality (3.80) may not necessarily be satisfied whenever f k) is not positive definite. not necessarily be satisfied whenever f"(x ) is not positive definite. ru, % 58 Moreover, even if (3.80), and thus (3.78), is satisfied when the hessian is not positive definite, the descent trajectory may still be undesirable. Recall that the transformation functions were derived to compute a point which would yield a zero gradient. While the gradient is zero at a local minimum, it is also zero at a local maxi mum and at a saddle point [38]. Therefore, a descent trajectory may be towards a saddle point. Saddle points are even more difficult since the transformation functions may yield trajectories towards a saddle point even when the hessian is positive definite. This dif ficulty will be discussed again when the global convergence of the algorithm is established in the next section. Thus singularity and nonpositive definiteness of the hessian are signals to be used in avoiding these troublesome points. Since the hessian inverse is used in Newton's minimization algorithm [34], several alternatives have been proposed whenever the hessian is not positive definite [30,34]. The method we propose is a modification of the one given by Murray [36] for a Newtonlike minimization algorithm. The principle of Murray's method may be described as the computation of the Newton or secondorder correction, k d2, by solving the linear system of equations Fk dk= f'(xk) (3.81a) with F = f(xk) + D (3.81b) where Dk is a diagonal matrix which is computed to insure that Fk is lu n 59 positive definite. If the hessian f"(xk) is already positive de finite, the matrix Dk is the zero matrix, and dk is the secondorder correction as defined earlier. Observe that in the VO algorithm, the approximation to the thirdand the fourthorder corrections, (3.44) and (3.48), may be also defined as solutions of linear systems of equations with coefficient matrices equal to F. Murray's procedure for computing D is based on the Cholesky 'b factorization of a positive definite matrix. The Cholesky factor ization may be described as the computation of the upper triangular matrix U, such that '1\ F = UT U (3.82) k A byproduct of this factorization is the diagonal matrix D The modification we propose adds a pivoting strategy to this factorization procedure. The result of this modification is that the diagonal matrix Dk will tend to have a fewer number of nonzero diagonals than the original procedure. Once the factorization is computed, the computation of all the highorder corrections is simply obtained, as shown later. The details of Murray's procedure are given next, fol lowed by the details of the proposed modification. 3.5.1 Murray's Procedure Equating matrix elements in (3.82) yields the i row of U given by i1 1/2 ui = { [ u2. (3.83a) ii i mi m=l 60 ii u" =I[Fkl ui Umj } ij n% ij m= mi m It can be shown [5354] that if Fk is elements given by (3.83a) are greater elements of U are bounded by ^ ' / u.. j=i+l, ..., n. (3.83b) positive definite, all diagonal than zero, and that all the 0 < u max Fk/2 i = 1, .. ij a ii (3.84) The procedure due to Murray is to in effect obtain Dk in (3.81) such that all diagonal elements in (3.83a) are bounded by 6 < uii n 8 , (3.85) where a may be defined by S = max if if (k)I 11/2 S (3.86) i, j = 1, ..., n} and 6 > 0 is a given constant which is used below to in effect iden tify the square root of a numerical zero due to roundoff errors (6 = 10s/2 gave good results, where s is the number of significant digits of the numbers in the computer). The offdiagonal elements are also bounded by i = 1, ..., n1; j > i (3.87) lu ij I< 0 61 th The i stage of the procedure may be described as follows. Define the quantity k i1 2 1/2 u= max { 6, [f"(xk) u2 , m=1 (3.88) which will be considered as the candidate for the diagonal ui and ii . = [f"(xk] U u., j=i+1, ..., n 3J ij u u ml m3 m=1 (3.89) Observe that ui = uj/u j= i+, ..., n, will be the offdiagonal th elements of the i row once u.. has been computed. If the relation given by (1/ui) max {Ju.j, j = i+1, ..., n} < B (3.90) is satisfied, then set (3.91) uii = ui , otherwise set ui = (1/B) max (u.j j = i+1, ..., n} Then the rest of the ith row is given by (3.92) uij = uj / uii , (3.93) 62 The ith diagonal of the matrix Dk is given by i1 d uu2 [f"(xk 2 (3.94) ii ii ] ii U mi m=l It can be shown [36], in a straightforward manner, that the bounds (3.85) and (3.87) apply to this procedure. 3.5.2 Proposed Modification to Murray's Procedure The modification to Murray's procedure is motivated by a desire that the number of nonzero elements in Dk be as few as possible to in effect use as much of the hessian as possible. If some form of diagonal pivoting is added to Murray's procedure, not only will the number of nonzero diagonals of Dk tend to be small, but numerical stability may also be gained. Therefore, it is proposed that at each stage of the factorization procedure, the strongest diagonal is selected, where the strongest diagonal is defined as the diagonal which generates the smallest, in absolute value, maximum offdiagonal element in its row. Efficient implementation of this modification is described next. First, recognize that the Cholesky factorization of F with diagonal pivoting may be described as the computation of the upper triangular matrix U such that P F pT =U (3.95) where P is a permutation matrix with columns equal to a permutation 63 of the columns of the unit diagonal matrix. The procedure begins by initializing the elements of U as follows u( = [f"(xk)].. i = 1, ..., n; j = i, ..., n (3.96) ij It, '1i ij The elements of U are then modified iteratively. To describe the ith stage of the procedure we begin by defining the set {ek} = {max i u ) j=i, ..., k1; u j=k+l, ..., n, k=i, ..., n} (3.97) which contains the maximum absolute value of the offdiagonal elements in each row not yet processed. The next diagonal to be selected for pivoting is based on the following sequential tests: 1) If i=n the set {ek} is empty. Select the remaining diagonal. 2) Otherwise, if any element of {ek} is zero, select the diag onal corresponding to the first such zero element. This is a row where all offdiagonal elements are zero. 3) Otherwise, if the set {ek/ u(kk : ukk ( 0, k=i, ..., n is not empty, select the diagonal corresponding to its smallest element (the first one if ties exist). 4) Otherwise, select the diagonal corresponding to the smallest element of the set {ek} (the first one if ties exist). This choice occurs when all remaining diagonal elements are zero. The appropriate interchange of rows and columns is done next in order 64 to bring the selected diagonal into the ith row. This interchange is noted in the permutation matrix P of (3.95). Now define i = max { 6, uii 1 / (3.98) (1/u max 1 = i+, n} < I (3.99) set (i) uii =Ui otherwise set u( = (1/s) max { u1) j = i+1, ... n The ith diagonal of the permuted D matrix is given by d C([i) 2 (i1) de rt of te rw iib The rest of the row becomes (3.100) (3.101) (3.102) (i) (i1) (i) uij = uij / uii j = i+i, ...,n The rest of the matrix is updated as follows (3.103) 65 (i) (i1) (i) (i) ukj = u ik uij k=i+, ..., n; =k, ..., n (3.104) This completes the ith stage of the procedure. The pivoting strategy proposed insures that the remaining matrix is changed by a small amount, since the maximum absolute value of the change to the re maining matrix in (3.104) was minimized. Observe that double pre cision is recommended to store the matrix U in the modified procedure since inner products can no longer be efficiently accumulated [5354] as it is possible in the original method. 3.5.3 Illustrative Example The following example illustrates the effect of pivoting. Let f"(xk) for a threedimensional problem be given by 0 1 10 f"(xk) = 1 4 0 10 0 400 7 Let 6 = 10 and for this matrix 0 = 20 from (3.86). Without pivoting, the method proposed by Murray yields D given by k D = diag (.25, 4, 800) , IV 66 .5 2 20 U = 0 2 20 0 0 20 The proposed modification, with a diagonal pivot order 3, 2, 1, yields k D = diag (1, 0, 0) and 20 0 .5 U = 0 2 .5 0 0 (.5)1/2 3.5.4 Computation of HighOrder Corrections After the factorization (3.95) is completed, the highorder corrections are computed as follows. Instead of (3.39), we may now write f(xk) + D d = f'(xk) (3.105) Define v by k T d = (3.106) where T is the transpose of the permutation used in the factoriza tion procedure. Now multiply (3.105) by P, and use (3.106) to obtain 67 P [fII(xk) + DkkIpT v = p f'(k) lu l ru ru r. t, lu f l (3.107) From (3.81) and (3.95), the factorization process transforms (3.107) into UT U v = P f'(x k ,r (3.108) which may be solved by first solving U w = P f'I(k) lu % nu Ii r (3.109) for w by forward substitution, and then solving n. Uv=w U''~ V = (3.110) for v by back substitution. The secondorder correction is then obtained from (3.106). Similarly, the approximation to the third order correction given in (3.44), now becomes the solution of f"(xk) + Dk] dk = f'(hk(1)) 'v\ I iJ 3 u U and the fourthorder correction given by (3.48) now becomes [f"(xk) + Dk] dk = f'(hk(1)) 'n 1, 'b J4 u 'A3 (3.111) (3.112) These two systems of equations have the same coefficient matrix as 68 (3.105), and therefore the same factorization applies. The solutions of (3.111) and (3.112) are obtained similarly to the procedure out lined for solving (3.105). 3.6 Convergence of the VO Algorithm Two convergence properties of the VO algorithm are given in this section. First we establish the class of functions for which the algorithm is globally convergent. Second, it will be shown that the VO algorithm generates a sequence with a high order of convergence for most functions. When an algorithm generates sequences with a high order of convergence, an approximation to a solution of the mini mization problem (3.2) can be computed in a small number of itera tions, if the initial point is close to the solution. 3.6.1 Global Convergence The global convergence of the VO algorithm will be established by using the general analysis of algorithms developed mainly by Zangwill [56]. A brief review of this analysis is given below. The new algorithm is then recast in a manner which allows the results of this analysis to be used. A minimization algorithm may be generally described by a point toset mapping. A pointtoset mapping assigns to every point x E En a subset of En. Let A be a pointtoset mapping, then A(x) may be represented by A(x) = {e En} , (3.113) 69 where the definition of the elements constitutes part of the algo k rithm. The sequence of points {x k generated by the algorithm is 'A given by k+1 k x E A(x ) 0 beginning from some initial point x. The point selected from the set A(xk) at each iteration is also part of the details of the algo rithm. It is clear that the sequence {xk cannot be predicted solely 'L 0 from knowledge of the initial point x As the scalar search for the VO algorithm demonstrates, similar algorithms using the same trans formation functions could implement the scalar search somewhat differently. This difference may be enough to generate different sequences, but as the Global Convergence Theorem will show, the different sequences may still converge. Thus the pointtoset mapping concept aids in analyzing classes of algorithms without describing its steps in detail. An important property of pointtoset mappings, which is re quired later on, is that they may be closed. A pointtoset mapping A is said to be closed at x, if the assumptions k  1) x + x , k k k 2) y y y E A(x ) iu iu % Il imply 1 70 3) E A(x) The closedness property is a generalization of continuity of point topoint mappings or ordinary functions. The main result due to Zangwill may now be given. Global Convergence Theorem. Let the pointtoset map A(x) be n 0 k an algorithm on E and suppose that given x the sequence {x } is generated satisfying k+l k x C A(x ) Let 2 be a subset of En defined as the set of solution points of the minimization problem (3.2), and suppose k 1) All points x are in a compact set. 2) The function f(x) is continuous and a) if x i 0, then f(X) < f(x) for all y E A(x), b) if x E 0, then either the algorithm terminates, or for all y e A(x), f(y) < f(x). 3) The map A is closed at points outside Q. Then either the algorithm stops at a solution point in 2, or the limit of any convergent subsequence of {x k is a solution point in Q. The proof may be found in [56,34]. Condition 1 of the theorem in sures the existence of a convergent subsequence. Its violation normally indicates that the minimization problem has no finite 71 solution, and thus this condition is not very restrictive. Condition 2 is normally satisfied by a suitable transformation function and a scalar search using the terminology of the VO algorithm. Condition 3 of the algorithm is usually the most challenging. For the new algo rithm, the satisfaction of this condition imposes continuity require ments on the function and its first two derivatives, as well as the additional condition of pseudoconvexity. The VO algorithm will be described as a pointtoset composition mapping given by A(x) = S (M(x)) r r'1, where M is a pointtopoint map, and S is a pointtoset map. The following lemma [56] establishes the conditions on each mapping to yield a closed composition. Lemma 3.2. Let M:En + Em be a pointtopoint map and S:Em+{ CEn} be a pointtoset map. Assume M is continuous at x and S is closed at M(x). Then the pointtoset map A(x) = S (M(x)) is closed at x. The pointtopoint mapping M :En (rl)n which characterizes r the transformation phase of the VO algorithm may be described by M (x) = (x, d(r)) r = 2, 3, or 4 (3.114a) r i1 liu lu where d(r) denotes sets of correction terms given by A' 72 (d2) d(r) = (d (d, ,,2' , r= 2 , r=3 d d ), r =4 I3 'I41 and the corrections d, d3' and d are the solutions of %2' b3 U4 [f"(x) + D] d = f'(x) [f"(x) + D] d = f'(x d ) lb 'It % i 3 '\ 'b rV2 and [f"(x) + D] d = f'(x d d) 'b 'i %, ;; '\ % % %2 n (3.114c) (3.114d) (3.114e) (Note that the diagonal matrix D has been included as discussed in Section 3.5.4.) In order to make use of Lemma 3.2 we need to estab lish that M is continuous. r Lemma 3.3. If the gradient and the hessian of f(x) are continu ous, the mapping Mr(x) given in (3.114) is continuous. The proof is immediate since the diagonal matrix D is computed to insure [f"(x) + D] is nonsingular, and the diagonals of D are con tinuous functions of the elements of the hessian f"(x) as can be readily seen in (3.983.102). (3.114b) j 73 Now consider the pointtoset mapping S :E(r l)n +y En} r which operates on M (x) which characterizes the scalar search phase of the VO algorithm, and may be conveniently expressed as follows S (x, d(r)) = : y=h (p) for p > 0 and f(h (p)) < f(x)} , r IV ,u nr br ' r = 2, 3, or 4, (3.115a) where x d p r=2, S2 h (p) = x (3/2)d p [d (1/2)d]p2 r=3, ur A A2 D %2 x (ll/6)d2 (2d3d2)p2 [dd3+(1/6)d2]p3, r=4. (3.115b) Observe that (3.115b) consists of the three transformation functions derived in Section 3.3. The following lemma establishes the conditions that are sufficient for S to be closed. r Lemma 3.4. If f'(x) # 0 and f(x) is continuous, the mapping S given in (3.115) is closed. Proof: Recall that in order to show that S is closed, the conditions k k 1) (x, d (r)) (x, d(r)) k k k k 2) y y, ES (X d (r)) , 'b A, 'n 74 imply 3) E S (x, d(r)) 6u S r f, i\u Suppose first that r = 2. Then k k k y x pd 'i k 2 (3.116) k The assumption f'(x) # 0 implies that d # 0 for all k from (3.114c). Thus one may write from (3.116) pk= xk k dk Pk = ,, 1" / %2 2 which when taking limits yields P = 1111 / 11 11 This establishes the existence of a limit for the sequence {pk}. It then follows that = x p d.2 It remains to be shown that y Sr(x, 2). For each k, pk satisfies k kr % .k f(y) = f(x k ) < f(x ) That such a pk exists follows from the fact that f(xk)T h'(0) = f'(xk) < 0 f'( r 2 "U ru %' (3.117) (3.118) 75 and from Lemma 3.1. Taking limits in (3.117), and using the assump tion that f(x) is continuous yields tion that f(x) is continuous yields lb < f ,() (3.119) and hence S2(x, d2). Now consider the generalization of the pre ceding proof to any r. For each k, one may write k= h ( ) = xk + h'(tP) pk' 'r k I U r k k' t E (0, 1) using the Mean Value Theorem [45]. The assumption f'(x) # 0 implies that f(k= k f(xk f(vk) = f(hr(Pk)) < f(xk) , (3.121) since f(xk)T h'(O) < 0 , IV il and Lemma 3.1 implies the existence of a pk which satisfies (3.121). From the Mean Value Theorem and (3.120) f(yk)= f(xk+h'(tp ) = f(x) + f(xk+ th'(tp )p )Th'(tp)p for some t E (0, 1). The above expression with (3.121) imply h'(tPk) # 0 ^r*k ^ (3.120) 76 in (3.120). Thus taking limits in (3.120) establishes the existence of a p given by p =  y x / II h'(tp) . Thus for each k, k k k f() = f(x + h'(tk) P) < f( ) and after taking limits, and using the continuity of f(x), we obtain f(X) < f(x) . Hence E Sr(x, d(r)). This completes the proof. The VO algorithm may now be given as the composition of the two mappings M (x) given in (3.113), and S (x, d(r)) given in (3.115) to r 'ir 'r ' be A(x) = S (M x)) (3.122) By Lemmas 3.2, 3.3, and 3.4 the VO algorithm is closed at x, if both f'(x) 1 0 and if f(x) has continuous first and second derivatives. This implies that, using the Global Convergence Theorem, if k+1 k 1) All x e A(x ) are in a compact set, 2) The gradient f'(x) # 0, except at a solution of the rIV iX0 Ili 77 minimization problem (3.2). 3) The function f(x) is continuously twice differentiable, then the VO algorithm generates a sequence which converges to a solution or it stops at a solution. The conditions 1 and 3 may not be difficult to achieve in practice. However condition 2 implies that the algorithm is not closed at a local maximum or at a saddle point of f(x), since f'(x) is zero at these points. Thus theoreti cally, f(x) must not have any such points; a function not having these points is defined to be pseudoconvex [35]. In practice, the algorithm should generate convergent sequences if the function is pseudoconvex in the region including the desired solution and the initial point. However, experimental evidence on one tested problem indicates that the algorithm is superior to others in avoiding these troublesome points even when they exist in the region of in terest. The VO algorithm may be trivially modified to prevent convergence to a strict local maximum. If xk is a local maximum, the present algorithm will fail in the sense that f'(xk) = 0 will cause the algorithm to stop. At this point however f"(x ) will be negative semidefinite [56], which is indicated in the VO algorithm by Dk 0 in the modified Cholesky factorization presented in Section 3.5. k However Dk 0 may also occur when the hessian is positive semide finite, or indefinite. Thus Dk 0 is an indication of potential 1v, "u problems. Thus if D # 0 at the point at which the algorithm stops, coordinate searches may be undertaken to ascertain that the expres sions 78 f(xk + te) < f( = 1, ..., n 'Il 'i 1i are satisfied for some small value of t. If xk is a strict local k maximum, the above procedure should indicate so, and x may then be perturbed to start the algorithm again. k The previous modification may not work if x is a saddle point as the example in Section 3.1 shows. It must be noted however that numerical experiments will be given in the next chapter which show that the VO algorithm appears to be highly effective in avoiding con vergence to saddle points. For one tested problem with a saddle point three existing algorithms converged to the saddle point when the initial guess was close to the saddle point, while the VO algo rithm converged to the correct solution from the same initial point. 3.6.2 Order of Convergence The order of convergence of the VO algorithm can be established from published results dating back to Schroder in 1870, who was the first to define the concept of order of convergence [4748]. However in this section, the more modern results due to Ortega and Rheinboldt [38] and Traub [50] will be used. Two results are given; one applies to the convergence of the algorithm to local minima with positive definite hessian, and the other to local minima with positive semi definite hessian. Before we can use the existing results, it must be noted that the VO algorithm becomes the mstep method 79 k,0 k k,i k,il f,,(xk l f, k,i x = x x i = xki f"(x ) f'(x ), i=1, ..., m, k+1 xk,m xk = x (3.123) when x is in some neighborhood of a local minimum with a positive definite hessian (e.g., the scalar parameter p is one). This method was proposed and studied by Traub [50]. It has also been used by many researchers, as it can be thought of as Newton's method for solving a system of equations without updating the coefficient matrix at every iteration. The following theorem establishes that the method con verges with order m+ 1. Theorem 3.1. Let f(x) have continuous gradients and hessians, and II f(x) f"( I11 c x *II 0 < c < 1 for all x in a neighborhood of x .Assuming f'(x ) = 0 and f"(x exists, the order of convergence of (3.123) is m+1. The proof may be found in Ortega and Rheinboldt [38] and Traub [50]. Thus the VO algorithm converges with order r, where r is the transformation function order, whenever f"(x ) is positive definite. 1 If f"(x ) is positive semidefinite, f"(x ) does not exist, 1,I 'q, 'v and therefore the preceding theorem does not apply. In this case the convergence is in general linear (order equal to one) as the following argument shows. In this case the algorithm may be des cribed by the iteration 80 k+1 G( k k [pf"( k )+ k1 i k k x k+l= G(x ) = x pk[f"(x ) + D ] cf'(x ) + (pk' x )] , ru ru 'u k "U f r 'IV % ) 5 kk ' (3.124) where c, a constant, and the function k depend on the transformation function order selected, and where the diagonal matrix D is not zero k * at x = x The iteration function is thus given by G(x) = x p[f"(x) + D 1 [cf'(x)+ (p, x)] The derivative of the iteration function evaluated at x = x is given by ] * G'(x*) = 1 p[f"(x ) + D ] [cf"(x ) + g'(p, x)] V IVi A, %i n' n '%Xj rV This matrix will not be zero for any value of p in general, since D is not zero. Thus convergence cannot be higher than linear as long as G'(x ) # 0 [38]. It should be noted that while convergence is in general linear whenever f"(x ) is positive semidefinite, local minima having this property may be picturesquely described as being flat. This implies that for all x in a fairly large neighborhood of x II f'(x) II is very small. Thus for practical reasons, it is nor mally unnecessary to compute the local minimum with great accuracy for these cases. Two of the problems selected to test the VO algo rithm have their local minimum with a positive semidefinite hessian. The new algorithm was more efficient in computing an approximation to their local minimumthan several published algorithms. 81 3.7 Summary The major contribution of this chapter is the derivation of a new algorithm for finding a local minimum of an unconstrained non linear function. The new algorithm, called the VariableOrder (VO) algorithm, has two properties that no existing minimization algorithm has. The first new property is the order of convergence. While all existing algorithms converge with order less than or equal to two, the new algorithm converges with variable order as high as four. The second new property is the scalar search step of the algorithm. In contrast with previous algorithms that have scalar searches along a straight line in the space of the independent variables, the VO algo rithm may have scalar searches along curved trajectories. The VO algorithm was shown to be globally convergent for pseudoconvex func tions with continuous first and second derivatives. The order of convergence was also established to be from two to four for functions with a positive definite hessian at the local minimum being computed. If the hessian is positive semidefinite at the local minimum being computed, the convergence was shown to be linear. CHAPTER 4 IMPLEMENTATION OF THE VARIABLEORDER ALGORITHM In this chapter we consider the practical aspects of the imple mentation of the VariableOrder (VO) algorithm to nonlinear circuit optimization problems. These practical considerations lead to guide lines and to some modifications of the algorithm in order to solve the general nonlinear programming problem given by minimize f(x) (4.la) x subject to a set of nonlinear inequality constraints (x)< (4.1b) and a set of "box" constraints given by L H x < x < x (4.1c) Observe that any equality constraint may be included in (4.1b) as two inequality constraints [20]. In the first section, guidelines are given for handling the non linear inequalities by the use of penalty functions. In circuit 82 83 optimization, the nonlinear inequalities are "loose", i.e., they can be relaxed to some degree. Therefore, the penalty function approach, which in effect relaxes the constraints, is an ideal technique for circuit optimization applications. In the second section, it is shown how the VO algorithm handles the box constraints. Unlike the nonlinear inequality constraints, the box constraints must be satisfied. The VO algorithm, as described in the last chapter, requires that a subroutine be written which supplies the value of the function, the gradient and the hessian at each point generated by the algorithm. While writing such a subroutine may not be difficult for some problems, the VO algorithm would be more useful if the hessian can be approxi mated when it is difficult to write a subroutine which supplies the hessian values. Sometimes it may be just as difficult to supply even the gradient, and thus in this case both the gradient and the hessian must be approximated. In the third section we consider approximations to the hessian and the gradient when these values are not supplied. The fourth section considers the case when the function and the gradient values supplied to the VO algorithm contain errors. In non linear circuit optimization applications, the subroutine that supplies the function and the gradient values may be actually a complex computer program which includes the solution of a system of nonlinear algebraic and differential equations. The function and the gradient values depend on the solutions to these equations; thus due to the numerical techniques used, errors may be present in the function and the gradient values. Numerical experiments will show that if any errors present in the function and the gradient values supplied to the VO algorithm can 84 be estimated and kept small, the algorithm can still be effectively employed. The fifth section details the steps in a FORTRAN IV implementation of the VO algorithm. Finally, the sixth section presents several numer ical experiments and comparisons with other algorithms. 4.1 Nonlinear Inequality Constraints This section gives guidelines to be used in solving the problem given by minimize f(x) (4.2a) x subject to q(x) < 0 (4.2b) where q(x) is a vector function of inequality constraints. It is assumed that all the constraints are nonlinear. The method recommended for finding a solution of (4.2) is to con vert the constrained problem to an unconstrained one by using a penalty function method [20,34]. That is, define the problem minimize f(x) = f(x) + PQ(x) (4.3) x where Q(x) is a penalty function for the inequality constraints; the penalty function is defined such that it is zero whenever the point x satisfies all the constraints, and greater than zero whenever the point x does not satisfy any of the constraints. The constant y isapositive nuj 85 scalar. For large u, the minimum of (4.3) will tend to be in a region where the constraints should be almost satisfied. Thus for increasing p, the corresponding solution points of (4.3) approach a solution of (4.2) [34]. Therefore, the penalty method converts the constrained problem (4.2) into an approximately equivalent unconstrained problem, or perhaps to a sequence of unconstrained problems depending on how strictly the constraints are to be satisfied. This implies that uncon strained minimization algorithms, in particular the VO algorithm, may be used to approximate the solution of (4.3). Two difficulties are inherent in converting problem (4.2) to problem (4.3), and in eventually solving problem (4.3). First, to insure the global convergence of the VO algorithm, f(x) must be twice continuously differentiable. Therefore, it appears that penalty func tions must be chosen accordingly. Second, problem (4.3) is very ill conditioned for large values of the constant V [34]. These considera tions will be explored in more detail next. A suitable penalty function Q(x) is given by n Q() = w (max[O, q(x)])3 (4.4) i=l qi where n is the number of inequality constraints, and the constants w may be used to equalize the magnitude of the constraints. The continuity conditions on f(x) are satisfied by (4.4), if the inequality constraints qi(x), i=1, ..., n also satisfy them. In the litera i. bi q ture, the most popular penalty function for inequality constraints is given by 86 n Q(x) = [ w (max[0, q (x)])2 (4.5) i=1 i This quadratic penalty function has a hessian which is discontinuous whenever an inequality constraint is zero [34]. However, it was found experimentally that for several problems tested, this quadratic penalty function produces an unconstrained problem which is solved by the VO algorithm more efficiently, especially when the hessian is not supplied and thus approximated by differences, than by using the cubic penalty function (4.4). Note that the VO algorithm is not guaranteed to be globally convergent for the quadratic penalty function (4.5) because of its discontinuous hessian. The VO algorithm, as other existing algorithms [34], solves prob lem (4.3) for large p with great difficulty, as examples will show. It was found that the best approach was to compute rough approximations to the solution of a sequence of problems, given by (4.3), for increas ing values of p, and tightening the desired accuracy of the solution for the last p used. Thus, it is recommended that initially a small value of p be used when using the VO algorithm for solving problems such as (4.3). 4.2 Box Constraints When the minimization problem has box constraints of the form L H x the penalty function method just described can be used, if (4.6) is 87 rearranged into 2n inequality constraints. However, there are two reasons that compel the use of a different technique for handling the box constraints. First, the penalty function method allows the viola tion of constraints, particularly when the multiplying constant p is small. In circuit optimization procedures, the circuit equations may not have any solution if any of the box constraints are violated, and therefore the algorithm may fail if a box constraint is violated. Second, the box constraints are linear constraints and their effect can be very efficiently handled in a direct manner with some modifica tions to the VO algorithm. The method proposed for handling the constraints is to in effect project the transformation function trajectories onto the active box constraints whenever the trajectories are outside of the box constraints. Figure 4.1 illustrates the proposed technique. This projection can be very efficiently implemented as will be shown. A modification required in the implementation of the projection of the transformations is that whenever the trajectory is on a boundary, an accurate computation of the solution of the scalar minimization problem in the scalar search should be done. The reason for this modi fication is that when transformation functions are projected, their theoretical properties may be different when the scalar parameter p in the transformation functions is set to one. 4.3 Hessian and Gradient Approximations To this juncture the VO algorithm was described in a way that required supplying the function, the gradient, and the hessian of the function to be minimized at the points of the sequence {x } generated '\j 88 Xk O \ I  Figure 4.1 Illustration of projection of trajectory onto box con straints. Box constraints are depicted by the rectangle. The trajectory is shown by the dash curve to be outside of the box constraints over two intervals. The actual trajectory used is shown by the solid curve with arrows. / *1 / / V 'I 