A variable-order nonlinear programming algorithm for use in computer-aided circuit design and analysis


Material Information

A variable-order nonlinear programming algorithm for use in computer-aided circuit design and analysis
Computer-aided circuit design and analysis
Physical Description:
xii, 282 leaves : ill. ; 28 cm.
Jimenez, Alberto Jose, 1944-
Publication Date:


Subjects / Keywords:
Algorithms   ( lcsh )
Nonlinear programming   ( lcsh )
bibliography   ( marcgt )
theses   ( marcgt )
non-fiction   ( marcgt )


Thesis--University of Florida.
Includes bibliographical references (leaves 276-281).
Statement of Responsibility:
by Alberto José Jiménez.
General Note:
General Note:

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 000178214
notis - AAU4719
oclc - 03114162
System ID:

Full Text


Alberto Jose Jimenez





To Judy


The contributions and patience of the chairman of the committee,

Dr. S.W. Director, are gratefully acknowledged. Without the grant from

IBM, it would have not been possible for this researcher to afford the

pursuit of this research; in particular, I thank P.C. Reder, of the

General Systems Division of IBM at Boca Raton, Florida, for his un-

selfish help and support. Interactions and many lively discussions

with graduate students D. Fraser, M. Lightner and W. Nye also contributed

to this research. Finally, but by no means in a lesser manner, the sup-

port of my family, my wife,Judy, andour daughters, contributed in a very

important way and with some great sacrifices throughout the entire period

at the university.

I thank Adele Koehler for her self-evident superb typing of the


I gratefully acknowledge the constructive criticisms of Dr. A.E.

Durling, Dr. A. Paige, Dr. C.V. Shaffer, and Dr. A. Westerberg, who with

Dr. S.W. Director constituted the examining committee.







ABSTRACT . . . .



2.1 Nonlinear Programming Circuit Problem .

2.2 Derivation of the Gradient. . .

2.3 Computational Flow. . . .

2.4 Review of Unconstrained Minimization. .


3.1 Properties of Minima. . .

3.2 Principal Steps in a Minimization Algorithm .

3.3 Variable-Order Transformations . .

3.3.1 Approximations of Higher-Order Corrections .

3.3.2 Transformation Order Selection .

3.4 The Scalar Search . . .

3.4.1 Iterations Close to x . .

3.4.2 Iterations Far from x . .

3.5 Hessian Singular or Negative Definite .



















3.5.1 Murray's Procedure . .

3.5.2 Proposed Modification to Murray's Procedure .

3.5.3 Illustrative Example..





3.5.4 Computation of High-Order Corrections .... .66

3.6 Convergence of the VO Algorithm. . ... 68

3.6.1 Global Convergence. . ... 68

3.6.2 Order of Convergence. . ... 78

3.7 Summary. . . 81


4.1 Nonlinear Inequality Constraints ........... 84

4.2 Box Constraints. . . .. 86

4.3 Hessian and Gradient Approximations. ... 87

4.3.1 Function and Gradient Values Supplied .... .89

4.3.2 Only Function Values Supplied ... 94

4.4 Supplied Function and Gradient Values with Errors. 98

4.5 The Variable-Order Algorithm . .... 102

4.6 Numerical Results. . . ... 119

4.6.1 Rosenbrock's Problem. . ... 120

4.6.2 Powell's Problem. . ... 124

4.6.3 Fletcher and Powell's Problem ... 126

4.6.4 Wood's Problem. . ... 127

4.6.5 Cragg and Levy's Problem. . ... 132



Comparisons with Seven Minimization Algorithms.

Example with Nonlinear Constraints .



4.6.8 Example with Box Constraints ....... 143

4.6.9 Example with Errors in the Supplied Function 145
and Gradient . . .
4.6.10 Simple Circuit Optimization Example .. ..

4.6.11 MOSFET Nand Gate Circuit Optimization Example. 151

4.6.12 Power Supply Regulator Circuit Optimization
Example. . . .. 156

4.7 Summary . . .


5.1 Approaches in Finding a Solution . .

5.1.1 Equivalent Unconstrained Minimization Problems.

5.1.2 Iterative Methods . .

5.2 Infinite Series Representation of a Solution .

5.2.1 A Class of Iterative Methods ...

5.2.2 The Variable-Order Iterative Method .

5.3 The VO Iterative Method in Transient Analysis..

5.3.1 MOSFET Nand Gate Example . .

5.3.2 MOSFET Buffer Examples . .

5.3.3 ECL Gate Example . .

5.4 The VO Iterative Method in DC Analysis .

5.5 Summary . ... .


6.1 Conclusions . .. ..

6.2 Future Research Suggestions . .



S. 172


S. 176

. 187

. 191

. 192





. 204

S. 208

. 210


Chapter Page

I.1 Using the Program. ... ......... .210

1.2 Description of the Program . .. 217

1.3 Listing of the Program ................ 222

REFERENCES . . ... . 276



Table Page

4.1 Results for Rosenbrock's problem . ... .121

4.2 Results for Powell's problem . .... 125

4.3 Results for Fletcher and Powell's problem. 128

4.4 Results for Wood's problem . .. 130

4.5 Results for Wood's problem, initial guess near saddle
point. . . ..... 131

4.6 Results for Cragg and Levy's problem. . .133

4.7 Summary of results for five problems with reduced
accuracy . . 136

4.8 Comparative results of VO algorithm with seven other
algorithms. ... . . .137

4.9 Results for illustrative constrained problem .. 140

4.10 Results for constrained problem solved by a sequence
of problems. . . 142

4.11 Results for Rosenbrock's problem with random errors in
the function and the gradient. . ... 147

4.12 Results of optimization of nand gate . .. 157

4.13 Tabular description of two current sources .. 161

5.1 Comparative results of transient analysis for nand gate. 193

5.2 Comparative results of transient analysis for 9-device
MOS buffer . . ... .... 197

5.3 Comparative results of transient analysis for 18-device
MOS buffer . . ... .... 197

5.4 Comparative results of transient analysis for ECL gate 199

5.5 Comparative results of transient analysis for ECL gate
with capacitances multiplied by 103. . 199



Table Page

5.6 Comparative results of dc analysis for ECL gate ..... 203

5.7 Comparative results of dc analysis for 18-device MOS
buffer. . . ... ...... 203


Figure Page

3.1 Flowchart of a portion of the scalar search step of the
VO algorithm. . . ... ..... 53

4.1 Illustration of projection of trajectory onto box con-
straints ................... ...... 88

4.2 Trajectories for the second-, third-, and fourth-order
transformations at x0 = (-1.2, 1)T for the minimization
of Rosenbrock's function ................ 116

4.3 Plot of f(h (p)) vs. p ................. 118

4.4 Trajectory of VO algorithm for the minimization of
Rosenbrock's function . ... 123

4.5 Simple circuit to test and compare the VO algorithm 149

4.6 Two-input nand gate optimized . .... 152

4.7 Power supply regulator optimized ............ 158

5.1 Nine-device MOS buffer circuit analyzed ... 194

5.2 Eighteen-device MOS buffer analyzed ............ 195

5.3 ECL gate analyzed . . .. .. 198

1.1 Example of subroutine to supply the function, the gradient
and/or the hessian to the VO algorithm. . ... 212

1.2 Typical setup for executing the VO algorithm in an IBM
computer with standard catalog procedures ... 214






' 'W



[f"(x) 1.
% 1J

f"(x) -

f"'(x), (x),

F(x), F'(x), F"(x),..

F' (x)


A column vector in R with components x..

Each component of the vector x is less than the

corresponding component of the vector y.

Any norm of the vector y.

The maximum norm of the vector z, equal to

max[y ].

The transpose of the vector y, the result being

a row vector, or the transpose of a matrix y.

The column vector first derivative of f(x), the

transpose of the gradient.

The jth component of the gradient, af(x)/ax..

The second derivative of f(x), the hessian.

The (i,j)th component of the hessian,

82f(x)/ (axiSx.).

The inverse of the hessian matrix of f(x).

The third and fourth derivative tensors of f(x).

A vector function and its derivatives (F'(x) is
r u
called the Jacobian).

The inverse of the Jacobian matrix of F(x).

Unit diagonal matrix of appropriate dimension.

Zero vector or zero matrix of appropriate


The time derivative of the variable q.

Abstract of Dissertation Presented to the Graduate Council
of the University of Florida in Partial Fulfillment of the Requirements
for the Degree of Doctor of Philosophy



Alberto Jose Jimenez

June, 1976

Chairman: Stephen W. Director
Major Department: Electrical Engineering

An iterative algorithm, called the Variable-Order (VO) algorithm,

is derived for computing a local minimum of a nonlinear function of

several independent variables. The VO algorithm is shown to be very

competitive with several existing algorithms. The class of functions for

which the algorithm is globally convergent is established. It is shown

that the VO algorithm converges with order as high as four. A major

step of the algorithm is the solution of a scalar problem that may be

along curved trajectories in the space of the independent variables, in-

stead of along straight lines as in most existing algorithms. Approxi-

mations to required higher-order derivatives are given which allow the

use of the VO algorithm even if only function values can be supplied. If

the supplied function and gradient values are somewhat inaccurate, as is

often the case in computer-aided circuit optimization procedures, the VO

algorithm is shown to be still effective. The algorithm may be used to

compute a solution of a general nonlinear programming problem by the use

of penalty functions.

Special cases of the VO algorithm are used to develop an iterative

method to solve nonlinear equations that arise in circuit analysis with

resulting modest improvements in efficiency. The new method is applied

to transient and dc analysis of nonlinear MOSFET and bipolar circuits.



It has been nine years since publication of the first, by now

almost classical, special issue of the Proceedings of the IEEE [57] on

computer-aided design of circuits. That issue marked the beginning of

a trend to use the computer as an active partner in design rather than

simply in a passive role for simulation. The circuits at that time

might be described as being in the infancy of integrated electronics.

The most complex integrated circuit chip might have consisted of ten

devices; it was still somewhat feasible to use breadboarding techniques

as aids in the design.

During the ensuing nine years, several more journals have been

devoted to the subject of computer-aided design of circuits, among them

[58-61], and integrated electronics has matured to the point where large-

scale integrated circuits which contain thousands of devices can be

manufactured. Breadboarding of circuits has generally ceased to be very

useful as a design tool, and the computer has become indispensable in

the entire design process. The computer is being used at many stages,

and for many different purposes, during the design of circuits. We will

be concerned with the use of the computer to optimize a circuit by vary-

ing some designable parameters in order to achieve the design objectives.

In circuit optimization, a scalar performance function representing

the design objectives is minimized. There are two principal computation-

al steps in this procedure. First, a numerical minimization algorithm

must be employed to adjust the designable parameters in order to mini-

mize the scalar performance function. Minimization algorithms begin

from an initial point, a guess to the optimum set of designable para-

meters, to generate a sequence of points which hopefully converges to

the minimum of the scalar performance function. Most minimization

algorithms require the numerical values of the performance function and

its gradient with respect to the designable parameters evaluated at

several points during the minimization procedure. The number of func-

tion and gradient evaluations required is generally proportional to the

computational cost of the minimization.

Second, the evaluation of the scalar performance function and its

gradient generally involvesanalysis of the circuit equations, a rather

large set of nonlinear algebraic and differential equations. The large

number of circuit equations is not only due to the increased circuit

size, but also due to the use of more complex device models for improved


The goals of this research were to improve the efficiency of the

two computational steps described above. The major accomplishments were

the derivation of a very promising new minimization algorithm, and a new

iterative method to solve the nonlinear algebraic equations that arise

in the analysis of circuits.

The main contribution of this research is the development of a new

minimization algorithm. Although the new algorithm has some short-

comings, numerical results on several examples show that it is quite

accurate, and more efficient than other existing algorithms for the

majority of the examples tried. From a theoretical standpoint the algo-

rithm has two novel new features: 1) it has a variable order of

convergence up to order four, and 2) it has a novel scalar search at

each iteration which may be along curved trajectories in the space of

the independent variables.

A second contribution of this research is an iterative method for

solving the nonlinear equations that arise in the transient analysis of

the circuit equations. Modest improvements in efficiency were obtained

when the new iterative method was implemented in an already very effi-

cient transient analysis program.

Other minor contributions can be summarized as follows:

1) A potentially useful Taylor series expansion of the solution

point of a system of nonlinear equations. Different forms of

the series, when truncated, yield different iterative methods.

An iterative method was used to obtain dc solutions of the cir-

cuit equations.

2) A modified Cholesky factorization of a symmetric matrix, the

hessian, which in effect modifies the matrix when it is not

positive definite. The new factorization is a modification of

a previously proposed technique [36].

3) An apparently novel scheme for computing difference approxima-

tions to first and second derivatives, the gradient and the

hessian. The scheme automatically takes into consideration

errors that may be present in the function values that are used

in the difference approximations.

4) A new method of describing minimization algorithms to account

for other than straight-line directions of search. Existing

theorems are extended to the new description.

The organization of the chapters is the following. In Chapter 2

we offer a brief theoretical and computational view of computer-aided

circuit design. In that chapter a brief historical review of minimiza-

tion algorithms is also given. In Chapter 3 the new minimization

algorithm is derived and its theoretical properties are established.

In Chapter 4, implementation of the new algorithm is described. In

addition, several examples and comparisons with other algorithms are

reported. In Chapter 5 the concepts used in the derivation of the mini-

mization algorithm are used to derive iterative methods for finding

solutions to nonlinear equations. Finally, Chapter 6 offers general

conclusions and some suggestions for further research.



A successful approach in using the computer as a circuit design

tool has been to minimize a scalar performance function, which repre-

sents the design objectives, by adjusting in some suitable fashion the

designable parameters. This procedure requires many steps which will

be briefly outlined in this chapter.

The first step in using optimization for circuit design is to

recast the problem into a nonlinear programming problem by character-

izing the qualitative design objectives by a scalar performance function

with constraints. This step is quite heuristic and requires great

insight on the part of the circuit designer. After this initial step,

the computer takes over by approximating the solution of the nonlinear

programming problem.

The derivation and the computational steps involved in the evalua-

tion of the scalar performance function and its gradient, needed for

solving the nonlinear programming problem, will be briefly outlined.

The chapter ends with a brief historical review of existing methods for

solving the unconstrained minimization problem which results from the

nonlinear programming problem.

2.1 Nonlinear Programming Circuit Problem

Although there havebeen some fairly successful attempts at synthesis,


where a circuit is "grown" to meet specifications [43,6], it may be

safely stated that this problem has not been solved satisfactorily.

Thus it will be assumed that a circuit which somewhat adequately meets

the design objectives is available, e.g. from a catalog of circuits. It

is then desired to change parameters of this circuit to improve its in-

tended objective.

The design objectives of a circuit are usually specified in a

somewhat qualitative manner. Some of the objectives are to minimize

certain quantities, such as power dissipation, time delays, circuit size,

or minimizing the difference between a desired voltage or current curve

with the actual curve. Other objectives may be described as constraints

on the solutions or on the designable parameters, such as voltage or

current levels less than (or greater than) some value, propagation

delays no larger than some value, low and high limits, called box con-

straints, on the designable parameters, etc. It is the circuit design-

er's job to translate the usually unspecific design objectives into a

set of specific scalar functions. Often this specification step yields

several scalar functions to be minimized, several constraint functions,

and box constraints on the designable parameters.

Most circuit optimization programs require that all the scalar

functions to be minimized be combined in some manner to obtain a single

scalar performance or objective function to be minimized. The simplest

technique for accomplishing this combination is to choose a performance

function which is a weighted sum of all the functions to be minimized.

The general form of such a performance function is

f = f e(w, q, x, t) dt (2.1)


where w is a vector of branch voltages, branch currents and node volt-

ages of the circuit; q is a vector of capacitance charges and inductor

fluxes; x is the vector of time-independent designable parameters, e.g.

geometries of each device in the circuit; and t is time. The scalar

function e represents a numerical compromise of the design objectives

in that a different combination of the functions to be minimized yields

a different function e, and thus a different minimization problem. As

published applications have shown [28,12,5], this numerical compromise

implies that circuit design carried out in this manner may require

several minimizations to achieve the best design, as interpreted by the

circuit designer.

The minimization of (2.1) must be carried out subject to the circuit

designer constraints and subject to the circuit equation constraints,

which is a nonlinear programming problem. This problem may be described

as follows

minimize f(w, q, x) = f e(w, q, x, t) dt (2.2a)
w,q,x I t Io

subject to u(w, x, t) < 0 (2.2b)

x < x < x (2.2c)

H(w, x, t) = 0 (2.2d)

E w- = q(t) = () (2.2e)

where (2.2b) are the designer's nonlinear constraints, (2.2c) are the

box constraints, and (2.2d) with (2.2e) represent the circuit equations.

The circuit equations, which can also be expressed by

H(w, q, x, t)

F(w, q, t) = w = 0, q(0)=q(x) (2.3)
E w q

where the initial conditions q0(x) may be functions of the designable

parameters, consist of Kirchhoff-laws equations and the branch consti-

tuitive equations [15,17].

The box constraints (2.2c) are usually handled by transformations

[8], or directly as done by the algorithm to be given in Chapters 3 and

4. The nonlinear constraints (2.2b) are usually made part of the e

function in (2.2a) by using penalty functions as described in Chapter 4.

Therefore we will discuss the numerical and theoretical considerations

in solving the problem given by

minimize f(w, q, x) = f e(w, q, x, t) dt (2.3a)
w,q,x 0

subject to H(w, q, x, t) = 0 (2.3b)

E q q = 0, q(0) = q (x) (2.3c)

where without loss of generality tO = 0 is assumed.

2.2 Derivation of the Gradient

Any solution of (2.3) must satisfy the necessary condition that the

first variational (or gradient) or the Lagrangian vanishes [8,29]. The


Lagrangian is given by

L(w, x, w, ) = [e(w, %, x, t) + w H(w, q, x, t)
i0 % 'VIU 'V %

+ qT(Ew q)] dt (2.4)

where w(t) and q(t) are the Lagrange multiplier vectors which are func-

tions of time. The first variational of L is given by

T 8H
6L = J + w + T E) 6w dt +
0 '1 6,

T 9H T
T e ^+T a T -T
S(- + + 6q dt 6q +
( Tq FU uX

o + -%) 6x dt +

f Sw H dt + f 6 (E w ) dt ,(2.5)
0 Q 1 0 I

where the variational term in (2.4) involving j is obtained using in-

tegration by parts as follows

6(f q dt) = -f qT q dt f T 6q dt
0 0 1 0

= -f S6q dt+J q 6qdt-q 6q (2.6)
0 0 0

In order to satisfy the necessary conditions one must find time-dependent


*()* *
vectors w (t), q (t), w (t), and q (t), and a time-independent vector

x to satisfy

6L = 0


The problem of satisfying (2.7) normally has an extremely large

dimension. The dimension can be reduced substantially if one makes two

assumptions. First, if the circuit equations can be satisfied at all
k k k
values of x, then given an x = x we can obtain w (t) and (t) such
that the circuit equations
that the circuit equations

H( t) =0 ,

k -k k k
E w .- =, (0) = W(x

are satisfied. Second, if the

tions given by

J =




Jacobian operator of the circuit equa-


is invertible in the interval 0 < t < T, then the Lagrange multiplier
k k A
vectors w (t) and q (t) can be computed from

k k
T aH T aH k
-k ^k ae
w + w a (2.10a)
w aq aw
114 nj r.




T T k
k d -k e k
w E --q = q (O) = 0 (2.10b)

k nk
where t = T t (note that q (t= 0) = q (t=T)). The equations in

(2.10) may be written in matrix form as follows

MHk aHk T ekT
'Ik Ve
aw qv aw


E -1 I ae

which can be solved because of the second assumption. Observe that the

solution to (2.11) is carried out in t, where E = T t, which is back-

wards in the original time variable. These two assumptions are reason-

able since the designable parameters are normally constrained by the

box constraints, so that an actual physical circuit is always obtained,

and therefore the circuit equations should always have a solution.

With the preceding two assumptions, using (2.8) and (2.10), the

variational 6L in (2.5) at x = x becomes
'V, '

k eT k Ta k kT k
S= + -- 6x dt + q (0) 6q (0) (2.12)
0 I ,

This variational may be expressed as the gradient with respect to x

evaluated at xk by

k k
k T k T 9Hk T @qk(0)
aL a + dt + (0) (2.13)
f 0 v (0). n'
lu 0 ru %


since x is not a function of time. This expression implies that the

problem has been reduced in dimension. We can now use any unconstrained

minimization algorithm which would vary only the designable parameters

by using function and gradient values of the Lagrangian now effectively

a function of only x.

2.3 Computational Flow

Most iterative minimization algorithms require that an initial

guess to the solution point be given, and that the values of the func-

tion and the gradient be supplied to the algorithm at points generated

by the algorithm (the next section and Chapters 3 and 4 offer a more

detailed description of minimization algorithms). Thus at any value of
x = x given by the minimization algorithm, one must supply L and the
gradient of L with respect to x, both evaluated at x = x Using the

derivation in the preceding section, the computational flow will be

described now. For notational convenience, the superscript k will be


STEP 1. Determine q(0). These are the initial conditions. There

may be three possibilities, 1) q(0) = q0, where ~q is a

constant vector; 2) q(0) = q0, where 0 is part of the

designable parameters as in a periodic steady state problem

[13]; and 3) q(0) is computed from a dc analysis of the

circuit equations. That is, q(0) and w(0) satisfy

H(w(0), q(0), x) = 0 (2.14a)

E w(O) = 0 (2.14b)


In Chapter 5, dc analysis is discussed further where a

new algorithm is given.

STEP 2. Compute an approximation to w(t) and q(t) by a transient

analysis from t = 0 to t = T of the circuit equations to


H(w, q, x, t) = 0 (2.15a)

E w 0 (2.15b)

with q(0) obtained from STEP 1. In this transient anal-

ysis, the value of the Lagrangian (2.4) can be computed,

which due to (2.15) is now given by

L = f e(w, q, x, t) dt .(2.16)

Transient analysis of the circuit equations is described

in more detail in Chapter 5.

STEP 3. Compute an approximation to the Lagrange multipliers w(t)

and q(t) by a transient analysis from t = T t = 0 to

t = T t = T (i.e., t running backwards) to satisfy

aH aH
-T ^T 2. ae
w +q (2.17a)
a 2w Dq aw
'b 'b ru

T d ^T 3e
w E- q = (2.17b)

with q(t = 0) = 0. In this transient analysis, compute

the vector


T 8H
(DY e T )T dt (2.18)
l> 0 ru n

which is the dynamic portion of the gradient (see (2.13)).

STEP 4. Compute the equilibrium portion of the gradient given by

DL _T aq(0)
(aEQ = (0) ax (2.19)

The initial conditions q(0) are determined by one of the

three possibilities outlined in STEP 1. The value of

(2.19) therefore has also three possibilities, 1) if

q(O) = q0, where q0 is constant, then the term (2.19) is

zero; 2) if g(0) = %0, where q0 is obtained from a subset

of the designable parameter vector [1,16], that is, if

(xd T0
% 0u

with q(0) = 0 then

= (0, T (t= 0))
axEQ 'S 0

in this case; and 3) if q(0) is computed from a dc anal-

ysis satisfying (2.14), then this term requires additional

work. Differentiating (2.14) with respect to x, and ex-

pressing the result in matrix notation yields

aH aH aw(o) aH
aw aq ax ax

E 0
rL q- L J L


which is a matrix-matrix linear equation. Since we want

(2.19), if the system

-T -T

L 1'I

= (0 (0))

lu _


is solved for

both sides of

we obtain

the vectors w

(2.20) by (
(2.20) by (W



and q, then after multiplying
T), and using (2.21),
g ), and using (2.21),

=-T -T
(w q
nu 'x,



which yields

aL aq(0) aH
3T nu IU
(Txax ax
"U I,


Now the entire gradient is giveanby

aL _dat rL 3L
ax DY (xEQ

This four-step procedure has been implemented in a general circuit

optimization program [27] with excellent results. Recently, it was

shown how the performance function (2.3a) can be made more useful by

the use of the event functional [5] which allows the inclusion of time

quantities, such as time delays, within the entire procedure.


Clearly this entire procedure can be computationally very costly.

Each function evaluation requested by the minimization algorithm requires

a transient analysis of a system of nonlinear algebraic and differen-

tial equations,and a transient analysis of a system of linear time

varying algebraic and differential equations is additionally required

for the gradient. It is therefore essential that 1) the minimization

algorithm used be extremely efficient requiring a small number of func-

tion and gradient evaluations to obtain the minimum, and 2) the entire

four-step procedure outlined above must be very efficiently implemented.

Due to the previously mentioned heuristic procedure of generating the

scalar performance function, this entire design procedure is manually

iterative thus emphasizing the need for overall efficiency.

2.4 Review of Unconstrained Minimization

Powell recently observed that in the last several years most of

the useful work in the area of unconstrained minimization has been in

understanding, improving and extending existing methods rather than

devising new algorithms [41]. Indeed most, if not all, minimization

algorithms can be described as first computing a direction of search

from the current estimate to the solution and then obtaining the next

point along this direction. That is, if the unconstrained minimization

problem is expressed by

minimize f(x)

most algorithms, at the kth iteration, first compute a direction of


search, represented by a vector d by using the values of the function
k k+1
f(x ) and perhaps some of its derivatives. Then the next point x is

obtained by searching along this straight line in some manner to yield#

k+l k k
x = x +p d

where pk is a scalar often called the step-length. Thus, most existing

minimization algorithms have two principal steps in each iteration:

choosing the direction of search, and the scalar search along this

direction to obtain a suitable step-length pk'

The direction of search is one of the differences among algorithms.

The oldest minimization algorithms are 1) steepest descent, where the

direction of search is in the direction of the negative gradient;

2) coordinate descents, where the direction of search is along each

coordinate direction, i.e., one variable is adjusted at a time; 3) New-

ton's method, where the direction of search is the product of the

inverse of the second derivative matrix, the hessian, and the negative

gradient [34]. The most robust of these algorithms is steepest descent

which converges with order one, while the fastest is Newton's which

converges with second order for most functions. For this reason, from

1959 to the present, much of the activity in the area of unconstrained

minimization algorithms has been to devise techniques that approach the

speed of Newton's method without its disadvantages, in particular its

requirement of the hessian matrix.

Davidon [14] in 1959 published an algorithm which uses only gradient

information to in effect build an approximation of the hessian inverse

as the algorithm progresses towards the solution; thus the method

Note that subscripts will be used for scalars.


approaches Newton's method after a number of iterations. Davidon's

method, which is based on hessian conjugate directions, gave birth to

a very large number of algorithms generally called quasi-Newton algo-

rithms [22,23,7,34].

A characteristic of the earlier quasi-Newton algorithms was that

the scalar search to compute pk had to solve the scalar minimization


minimize f(xk + p d

very accurately. The accurate solution of this problem is quite costly

computationally as many researchers have shown [46,34]. The elimination

of the requirement to solve this scalar minimization problem accurately

was the principal motivation for many of the latest quasi-Newton algo-

rithms [21,7,12], although some researchers were additionally motivated

by deriving algorithms which required only function values [40,49,12].

The amount of information about the function which must be supplied

to minimization algorithms has been a motivation for the development of

many new algorithms. The general tendency in deriving algorithms, since

Davidon's classical contribution [14], has been to account for the

hessian without having it supplied; i.e., making sure an algorithm would

be efficient for quadratic functions. On the other hand, the new mini-

mization algorithm developed in the next two chapters has the property

of in effect accounting for even higher derivatives without having them


The new algorithm requires that the function and the first two

derivatives, the gradient and the hessian, be supplied. The effect of


the third and fourth derivatives is approximated from values of the

first two derivatives. The algorithm offers several new novel ideas

to the area of unconstrained minimization such as a variable order con-

vergence as high as four and a novel scalar search that may be along

curved trajectories. Thus the new algorithm does not compute a direc-

tion of search which is always a straight line as most existing algo-

rithms do.

In circuit optimization procedures, as was shown, the function and

the gradient can be computed requiring only first partial derivatives

of the circuit equations and the performance function. The hessian

would require second partial derivatives of the circuit equations which

in general are very difficult to derive and would require a large number

of operations to handle (for linear circuits the second partial deriva-

tives are zero and thus the hessian can be evaluated in a straightforward

manner as done in [54] for a Newton-like minimization algorithm). For

this reason, in Chapter 4 we describe a difference scheme which is

built-in the new algorithm to approximate the hessian, thereby allowing

the use of the new algorithm by supplying it with only function and

gradient values.



In this chapter a new algorithm, called the Variable-Order (VO)

algorithm, is proposed for finding a solution to the unconstrained

minimization problem

minimize f(x) (3.1)

n n
where f is a real-valued nonlinear function, f:E -+ E, and x E .
The equivalent maximization problem is also included in (3.1) since

maximize f(x) = minimize f(x)
x x
'1 'I

Solution of the unconstrained minimization problem is not only im-

portant in its own right, but as will be seen in Chapter 4, solution

of the unconstrained problem is central to the solution of the con-

strained minimization problem which often arises in computer-aided

circuit optimization procedures.

There are several existing algorithms designed to solve (3.1).

These algorithms are all iterative, that is, beginning from an initial
0 k
estimate of a solution x a sequence {x } is generated which under
'u Il



certain conditions converges to a solution x of (3.1). The exact

solution x is rarely obtained in a finite number of iterations, but

if the sequence has a high order of convergence the solution can be

approximated closely in a finite, and hopefully small number of iter-

ations. Therefore, a desirable property for a minimization algorithm

is that it generates convergent sequences with a high order of con-


If an algorithm generates convergent sequences from any initial

point x it is said to be globally convergent. It will be shown

that the VO algorithm is globally convergent for pseudoconvex [35]

functions that are twice continuously differentiable. Moreover,

numerical experiments indicate that the VO algorithm is able to effi-

ciently compute minima of some functions not satisfying these condi-


If an algorithm has a high order of convergence, reasonable

accuracy might be expected when the algorithm is stopped after several

iterations. The order of convergence of an algorithm may be defined

by a value r such that

k+1 k 11r
11x x II < c 1 x x II ,

where 0 < C < m is a constant called the convergence ratio. Observe
k k+l
that if the distance lIx x II is sufficiently small, x will be

much closer to x for large r. Most algorithms have orders of con-

vergence equal to two or less. For example: steepest descent con-

verges linearly (r = 1), the conjugate gradients algorithm of Fletcher


and Reeves converges linearly but with a smaller convergence ratio

than the convergence ratio of steepest descent, and the quasi-Newton

algorithm of Davidon, Fletcher and Powell approaches second-order

convergence [34]. It will be shown that the VO algorithm has up to

fourth-order convergence.

The definition of the order of convergence implies that the

higher the order the faster a solution is approached provided that
the point x is sufficiently close to the solution. Thus while a

high order of convergence algorithm is desirable when in a small

neighborhood of x previous studies generally indicated that when
k *
x is far from x a lower order algorithm was more efficient [30].

In fact the very popular class of algorithms called quasi-Newton

algorithms have the property of being linearly convergent initially

and becoming essentially second-order as the solution is approached

[34]. The VO algorithm automatically adjusts its order at each

iteration, generally selecting the order which allows the most pro-

gress towards the solution. The numerical results to be given in

the next chapter show that the VO algorithm is more efficient than

most existing algorithms.

The first section of this chapter reviews some of the existing

theory associated with solution points of (3.1). The second section

discusses the two major steps of a minimization algorithm: the trans-

formation step, and the scalar search step. The new techniques being

introduced for the VO algorithm are compared with the techniques of

previous algorithms in this section. The theoretical derivation of

the algorithm is presented in the third through the fifth sections.

Although the character of these three sections is theoretical, several


numerical and practical considerations are discussed. The sixth

and final section establishes the conditions for global convergence

of the VO algorithm and its order of convergence.

3.1 Properties of Minima

The problem to be solved is given by

minimize f(x) (3.2)

n *
where f:E -+ E. Let x be a solution, then if

f( ) f(x) (3.3)

for all possible x, the point x is called a global minimum. If
f *
(3.3) holds in a small neighborhood about x then the solution x

is called a local minimum.

One usually would like to determine the global minimum of (3.2).

However one must in general be content with a local minimum because

a global minimum can only be identified if all minima are obtained,

or if the function is assumed to have a convexity property (in which

case all minima are global [34]). In contrast, local minima are

identified under less stringent conditions on the function. The

following theorem establishes these conditions.

Local Minimum Theorem. Let f:En E and suppose that the first

derivative f'(x), the gradient of f, is continuous, and that the
lb "'


second derivative matrix f"(x), the hessian of f, exists at x If

x is a local minimum of f, then

1) f'(x) = 0 ,


2) f"(x ) is positive semidefinite.

Conversely, if

1) f'(x = 0


2) f"(x ) is positive definite,

then x is a local minimum of f.

The proof is straightforward [34, 38], and it will not be repeated

here. Note the subtle but significant difference between the neces-

sary and sufficient conditions.

Most local minima are strict local minima. A strict local

minimum solution x is defined by

f(x + y) < f(x ) (3.4)

for all y # 0 such that x + y is in some neighborhood about x

The following theorem plays an important role in the test for con-

vergence of the proposed algorithm.


Strict Local Minimum Theorem. Let f:En -+ E have a continuous
hessian in some neighborhood about x Then the point x is a strict

local minimum of f if and only if both f'(x ) = 0, and there exists

an E > 0 such that for all y satisfying 0 < Ily I < E, f"(x + y) is

positive definite.

This theorem is significant because in general a minimization algo-
k -
rithm stops at a point x = x which lies in a small neighborhood
about x The theorem states that if x is a strict local minimum,

the hessian should be positive definite at x.

Proof: (Sufficiency) Assume f'(x ) = 0 and that there exists

an c > 0 such that for all y satisfying 0 < IJly < E, f"(x + y) is

positive definite. From the Taylor series with remainder one may


T *
f(x + y) = f(x ) + (1/2) f"(x + t)

for some 0 < t < 1. Then for 0 < I ly [ < e

T *
f(x* + ) f(x*) = (1/2) y f"(x + ty) y > 0

which shows x is a strict local minimum.

(Necessity) Now assume x is a strict local minimum. Then for

arbitrary y

lim (1/t)[f(x* + ty f(x*)] = f'(x*)
+0 +^ tl f(i \ \


by the definition of the derivative [38]. Assume f'(x ) # 0. Then
T *
a vector y exists such that f'(x )T < 0 (e.g. y = -f'(x ) ). Thus

for suitably small t > 0

(1/t)[f(x* + ty) f(*)] < 0

which contradicts the hypothesis that x is a strict local minimum.

Now consider a Taylor series expansion with remainder

f(x* + y) = f(x) + (1/2) y f"(* + ty) ,
L '6 'L l u

for some 0 < t < 1. Then there exists a 6 > 0 such that for all y

satisfying 0 < y II < 6 < E

T *
(1/2) Z f"(x + ty) y = f(x + y) f(x ) > 0

which implies the positive definiteness of the hessian in a small
neighborhood about x not necessarily including x itself. This

completes the proof.

If the hessian is either inaccurate or not supplied, a widely

used test to stop an algorithm is at a point x for which

f'(x) E (3.5)

for some small e > 0, and

f(x + tei) > f(x) i = ..., n, (3.6a)
ik 1b n,



f(x te ) > f(x)
rV, 1%, -


1 n
where e ..., e are the unit coordinate vectors, and t > 0 is some

small scalar. The tests in (3.6) insure that the function does not

decrease in value along any of the coordinate lines. However, as the

following simple example [34] shows, these tests are not sufficient

for x to be an approximation to a local minimum solution. Consider

the problem

minimize f(x)

3 2 2
= x1 x1X2 + 2x2

The point (xl, x2) = (x x2) = (6, 9) satisfies

a f -2
[f'(x)]l = 3x 2x2 = 0 ,
l 1 = x= 1

Furthermore with xl fixed at xl = 6, the expression

f(6, 9+t) > f(6, 9)

is satisfied for all t, and with x2 fixed at x2 = 9, the expression

a f =~ -x 2 + 4x2 = 0
2 2 a x 2 1


f(6+t, 9) > f(6, 9)

is satisfied for t > -9. Therefore both (3.5) and (3.6) should be

satisfied. Despite this fact, the point (xl, x2) is not a local mini-

mum since the hessian evaluated at this point, given by

18 -12
f"(x ) =x
-12 4

is not positive semidefinite since its determinant is negative.

Therefore, any algorithm using (3.5) and (3.6) as its sole termina-

tion test cannot guarantee the computation of a local minimum.

3.2 Principal Steps in a Minimization Algorithm

A minimization algorithm iteratively generates a sequence of

points {xk starting from some initial point x0 which hopefully con-

verges to a solution of (3.2). It is convenient to view each itera-

tion of the algorithm in terms of the expression

k+1 pk k
x = H(p x) (3.7)

where H is called the iteration function, which is a function of a
th k
scalar parameter Pk' the k- estimate of the solution x the function
k k
value at x and possibly derivatives of f at x The iteration in-

dicated by (3.7) may be separated into two steps. The first step

consists of computing the transformation function given by


k(p) = Hk(p, k) (3.8)

where for convenience the function hk:E + En has been defined. This

step will be called the transformation step. Its purpose is to com-
k k
pute a direction or trajectory from x using f(x ) and possibly
k k+l
derivatives of f at x along which the next point x will be

selected. The transformation function represents this trajectory,

and p is a scalar parameter which is proportional to how far along
this trajectory will the next point x be. The second step con-

sists of selecting or computing a suitable value of P=Pk such that
k+1 .
x is given by

k+l k k
S = g = h (pk(3.9)

This step is called the scalar search step.

In most existing algorithms [34], the transformation functions

are linear in the scalar parameter p and have the form

hk(p) = x + p d (3.10)
^U 'Xi 'ki

where dk is called the direction of search. For example, in the
k k
steepest descent algorithm [34], dk = -f'(x ). In contrast, the

transformation functions for the VO algorithm are polynomials in

p, of degree up to three. These polynomials follow inherently from

the derivation of the transformation functions for the VO algorithm

to be given in the next section. The transformation functions for the



new algorithm may thus yield curved trajectories of search instead

of straight-line directions of search.

If the scalar search step computes a p=pk such that

f(x+) = f(h (pk)) < f(x) (3.11)

is satisfied, then each iteration of the algorithm will progress

towards a solution of (3.2). The satisfaction of (3.11) at each

iteration insures that the sequence {f(x k)} is monotonically de-

creasing, a property that is generally required to establish the

global convergence of an algorithm. Thus most algorithms, including

the VO algorithm, compute a pk which satisfies (3.11). If a pk

cannot be found to satisfy (3.11) an algorithm has either converged

to a solution of (3.2), or it has failed. The following lemma, a

generalization of an existing result [56], establishes sufficient

conditions for the existence of a p=pk to satisfy (3.11).

k k
Lemma 3.1 Assume f(x) is differentiable at x=x hk(p) is

differentiable at p=0, and hk(0)=xk. Define hk'(p) to be the first
u r%\, 11,
derivative of h with respect to p. Then if

f'(xk)T hk'() < 0 (3.12)

there exists a p > 0 such that

f(hk(p)) < f(xk) (3.13)
%L q,


Proof: Using chain differentiation and the definition of deri-

vative one may write


(l/p)[f(hk(p)) f(hk(o))] = f'(hk(o))T hk(0))
'i '' 'I


hk(0) k
Since h (0)= x using (3.12), this expression becomes

lim (1/p)[f(hk(p)) f(xk)] < 0
p*+O uF


Then there exists an e > 0 such that for p # 0 and -E < p < e

(l/p)[f(hk(p)) f(xk)] < .
> 'Lu


Select 0 < p < E to preserve the inequality and it follows that

f(hk(p)) < f(xk)

which completes the proof.

For the transformation function (3.10) found in most algorithms, (3.12)

takes the following form

k T k
f'(x ) d < 0 ,

which becomes

-f'(x ) f'(x ) < 0
%-lf'( k




for the steepest descent algorithm [34]. Thus as long as f'(xk) # 0,

the steepest descent algorithm should be able to obtain a decrease in

the function. As will be seen, the VO algorithm also has the property

of satisfying (3.12) whenever f'(xk) # 0. That this property is

highly desirable follows from the theorems given in the preceding


Assuming the existence of a p which satisfies (3.11), the problem

to be briefly considered now is the computation of a particular value
pk to be used in obtaining x There are two stages to this problem.

First, the desired pk must be defined in some concrete manner, usually

as the solution of a scalar problem. In most existing algorithms,

the desired pk is defined as the solution of the following scalar

minimization problem [34]

f(k )) = minimize f(hk(p)) (3.19)
f(h (p ))

This value of p should satisfy (3.11), and thus this scalar problem

defines the desired pk in a suitable manner. Other existing algo-

rithms, such as the Davidon [14] algorithm or its more popular modi-

fication due to Fletcher and Powell [22], have the requirement of

defining pk by problem (3.19). While it may be argued that pk defined

by (3.19) provides the most decrease in the function at the kth iter-

ation, this pk may not be the best one in the sense of minimizing the
overall number of iterations. For example, when x is far from a

solution x of (3.2), the p, given by (3.19) tends to force all future

iterations to follow the bottom of narrow valleys with slow progress


towards x [30]. Thus ideally pk should be defined by (3.19) when-
k k
ever x is close to x and in some other manner whenever x is far

from x

Secondly, once the problem which defines the desired pk has been

given, an approximation to the solution of the problem must be effi-

ciently computed. It is important that the overall algorithm not fail

if rough approximations to the solution are computed to achieve savings

of computer time. For example, many studies have indicated that the

overall efficiency of the popular algorithm due to Fletcher and

Powell and the one due to Fletcher and Reeves (both of which theore-

tically require the solution of (3.19)) are sensitive to the accuracy

of the approximate solution of (3.19) [34,46]. The scalar search for

the VO algorithm was developed under these considerations. The de-

tails are given in Section 3.4.

3.3 Variable-Order Transformations

In this section the transformation functions for the VO algo-

rithm are derived. It will be seen that these transformation func-

tions require evaluation of higher-order derivatives. However,

approximations are possible which allow the algorithm to be used even

when only function values are supplied.

The motivation for the variable-order transformations stems from

a desire to approximate the behavior of the gradient of f with infor-
mation at the present point x It will be shown that it is possible

to represent the point x at which the gradient is zero, by an
infinite series. The varable-order transformation functions result
infinite series. The variable-order transformation functions result


from truncations of this series. A pararnter p appearing in the

infinite series in effect accounts for the terms of the series which

are dropped.
We begin by noting that a necessary condition for x to be a

solution of the minimization problem (3.2) is that the gradient of f

at x be zero. Then any solution of (3.2) must satisfy the system

of equations

f'(x) = 0

Now consider the change of variables denoted by


x = X(z)
'% % I


where X may be a

(3.20) becomes

nonlinear function. Using (3.21), the equation

f'(x) = f'(X(z)) = )(z)
'u 'I % u

Define a z such that

x =X(z )

then from (3.22)

g( ) = 0 ,
1i U 'U




since f'(x ) = 0. If the function g is simple to invert so that z
Il\, nu % I


may be found from (3.24) by

* -1
z = (0)
Ib f


then the solution x of (3.20) may be found from (3.23). Clearly

specifying a change of variables which yields a function g which is

simple to invert could be difficult. However as we now show, we can

start by specifying a suitable function g and determine the resulting

change of variable function X. To this end assume some g function

has been specified. Then X(z) may be expanded in a Taylor series
about some z=z so that (3.21) becomes
I\j r

x = X(z ) +
'v n 6'i

X'(zk)(z-z ) + (1/2)[X"(z )(z-z.)](z-z ) + ...


k k
where from (3.21), x may be associated with zk by

k k
X(z = x


X(zk) = fI(xk) g'(zk)
nu (U '= 'A '" I% (

X" (zk)= f"(xk- g'"(zk) [f"' (xk) X'(zk)] X'(z )]
X' '1. = ft 'U Ii M, %1, ''U b ,



, (3.29)

are obtained by repeated differentiation of (3.22) and evaluating the
k k
resulting expressions at x=x and z=z Since, as it is the case
u % q^ 'U


for the present, an x is known or given, a corresponding z may be
for the

found from (3.22) by

k -1 k
S= (f'()) (3.30)

since it was assumed that g is simple to invert. Therefore all the

terms of the series have been defined assuming all the derivatives

and the inverse of the hessian exist. Finally, since we are interested
in x given by X(z ) in (3.23), we obtain the infinite series repre-

sentation of a solution to (3.20) given by

x = k + X'(zk)(z* zk) + (1/2)[X"(zk)(z*-zk)](z*-zk) + ..,

where z is given by (3.25), z by (3.30), and the derivatives
k k
X'(z ), X"(z ), ... are obtained by differentiation as shown in

(3.28) and (3.29).

We now turn to the selection of a suitable function g. Since

the function g should be simple to invert, a logical choice might be


(z) = (z z2, ..., n) (3.32)

for which the function and its inverse are identical. For this func-

tion, the infinite series (3.31) becomes

k k k k
x x d d d ... (3.33a)
X ,2 ,3 f4



d = f"(xk ) f'(x) (3.33b)
%2 '1 % r

d = (1/2) f"(xk)-l[f"'(xk) d k d (3.33c)
%j3 "to lb 1b rk '2 u2

dk =f,,k-I[fk k k iv k k k k
d= f(xk -' (xk)dk]dk -(1/6)[[fi (xk)d ]dkk] (3.33d)
%4 x, -u 1\j U 1,3 i, ru %2 Q 1\12

This infinite series is a well-known result extended to n-dimensions.

If this series is truncated to two terms, an iterative method can be

constructed given by

k+1 k k
x = x -d (3.34)

which is Newton's iteration for solving (3.20) [34]. However, this

iterative method (or iterative methods obtained from any truncations

of (3.33)) may not converge to a solution of the minimization problem

(3.2), because the infinite series (3.33) may not itself be a con-

vergent series. Thus we conclude that the series resulting from the

simple function g given by (3.32) does not in general produce a

suitable infinite series.

One of the proposed modifications to (3.34) which improves its

potential for convergence is the introduction of a scalar pk to "damp"

the iteration [34] given by

In 1755, Euler derived an infinite series for a solution of the
scalar problem f(x) = 0 [19]. Other recent derivations and more
extensive studies of this scalar series have been published [47,
39,50,32,42]. The Euler series becomes (3.33) when extended to
n-dimensions for the solution of problem (3.20).


k+1 k dk
x = x p d .

For the minimization problem (3.2), this iteration becomes the two-

step procedure given by the computation of the transformation function

h(p) = p dk

and the computation of a suitable value p=k to obtain x given by

x = h(p)

Motivated by this modification, we propose the following function#

0(z) = (z, z, .... z )T


Note that this g function is simple to invert. Furthermore,

* -1
z* = k (o) = 0
nui %u `U,


and any higher derivatives with respect to z are readily obtained.
Taking the first two terms of the resulting series (3.31) yields the

iterative method

k+l k xk 1 xk
x = x pk f"(x ) f'(x )
"4 "4 l % f l


Thus the second-order transformation function may be defined by

Note that th
Note that z1 means zI raised to the p power.
Not I I


k k k
h(p) = x p (3.38)


dk = f(xk)-1 f(xk) (3.39)
n12 'lu % Ix. ri

is defined to be the second-order correction. (Note that the order

refers to the order of convergence of the sequences which are generated

by the algorithm.) Thus Newton's method for solving minimization

problems falls out as a special case of the proposed algorithm.

Taking three terms of (3.31) yields the third-order transformation

given by

k k k 2 k
h (p) = x (1/2)(3 p)p d p d (3.40)


d = (1/2) f"(xk)- [f" (xk) d ] d (3.41)

is the third-order correction. Similarly, using the first four terms

of (3.31) yields the fourth-order transformation given by

h(p) = k (1/6)(p 6p+l)p d (2-p)p2 d p3 dk
,, pl2 3 ,4

where the fourth-order correction is defined to be

dk = f" (xk)-1 [[f'(xk)dk]- (1/6)[[fiv(xk)dk]d kdk] (3.43)
,u4 %2 -.3 r 2 fU2 %2


Transformation functions of order higher than four may be similarly

derived. However, as will be seen below, it does not seem possible

to adequately approximate the corrections of order greater than four.

Moreover, higher-order derivatives require considerable storage and

they are very difficult to derive in general, and therefore we try

to avoid them. Additionally, the techniques proposed in Section 3.4

for the scalar search would not be as attractive for transformation

functions of order higher than four because the zeros of polynomials

of degree greater than two would be needed. Finally, it was experi-

mentally found that for one function tested, transformation functions

of order higher than four did not increase overall efficiency in

computing the solution. The selection of which transformation func-

tion to use at each iteration will be described later.

3.3.1 Approximations of Higher-Order Corrections

Observe that the computation of the third-order correction (3.41)

would require both the evaluation of f" (xk), a third-order tensor,

and a considerable number of multiplications. This computational

effort can be reduced by using the approximation

d f"I(xk) f'(h (1)) (3.44)
,;3 xL '\ 'i U2


k k k
h (1) = d (3.45)
,',2 2 ',2

from (3.38). The approximation (3.44) follows from the Taylor series



kfx k dk)k k k k "" k k
kf'(x- (x ) f(x )d k+ (1/2)[f.. (x ) d ] d -...


Using (3.39), the first two terms cancel, and therefore

f'(hk(1)) = f(xk dk) (1/2)[f"' (xk d k d k (3.47)
% i,2 = `2 % a 2 %2

is an approximation with error on the order of Il d k assuming the

fourth derivative of f is bounded. Comparing the equation for the

third-order correction given by (3.41), using (3.47) yields the approx-

imation (3.44). Similarly, the fourth-order correction (3.43) may be

approximated by

d k f"(xk)1 f'(h ()) (3.48)
,u4 11, ru II -.3


k k k k
h (1) = x d d (3.49)
%3 '% 2 Q %3

from (3.40). Using the approximation for dk given by (3.44), the

approximation (3.48) has error of order ( ldk II + Ildk I)3 assuming

the fourth derivative of f is bounded. The approximation and the
k kk
error bound follow from the Taylor series expansion of f'(x -d -d )
'V 'V %2 %3
and the use of (3.39) and (3.44). The error in these approximations

continues to increase for corrections of higher order. Furthermore,

there errors are enlarged since these higher-order corrections are


multiplied by increasing powers of p as can be seen from the third-

(3.40) and fourth-(3.42) order transformations. This error is the

major reason for considering only transformations of order four or


In the next chapter, approximations to the hessian and the gra-

dient of f will be presented. These approximations will allow the

algorithm to be used even when only function values are supplied.

3.3.2 Transformation Order Selection

Assuming that all of the transformation functions exist (this

assumption is removed later in this chapter), we wish to consider the

question of which one should be used in each iteration.

Recall that each transformation function may be thought of as a
curved trajectory passing through x with the scalar parameter p
proportional to how far from x the next point might be. Ideally the

best transformation function order to use is the one whose trajectory

passes the closest to x It might seem that the higher the order,

the closer to the solution x since more terms are taken in the

infinite series representation of x However there are two reasons

why this seemingly reasonable expectation is not usually true. First,

the process of computing the terms of the transformation functions

involves several approximations and many arithmetic operations with

ensuing errors. Second and perhaps more importantly, the infinite

series represents x only if it converges; it must also converge very

fast. It was indeed verified numerically that usually one of the

transformation functions is better, in the sense of giving trajec-

tories closer to x than the others at each iteration, and the best


one is not necessarily the one with the highest order.

The proposed technique for selecting the best transformation

function is based on the convergence of the infinite series (3.33)

to a solution of the minimization problem (3.2). Recall that this

infinite series resulted from the g function (3.23), or for the func-

tion which was eventually used given by (3.35) with the scalar para-

meter p set to one. The procedure may be described as follows.

Select the second-order transformation if

f(h (1)) = f(x d ) > f(x ) = f(h (0)) (3.50)
n,2 'uiZ2 lu .2

otherwise select the (r-l) order transformation if

f(hk(1)) > f(hk (1)) r = 3 and 4 (3.51)
,br nir- 1

If (3.51) is not satisfied for r = 4, the fourth-order transformation

is selected. Thus when orders three or four are selected, a value

of p = 1 always gives a point which yields a function value less than

the present value. It was experimentally verified that this method

selected the best order in most iterations. In those very few itera-

tions where it did not select the best order, the order selected was

only insignificantly different than the best one.

Once one of the three transformation functions is selected, the

dimension of the problem has been reduced to one. To see this,note

that at the kth iteration the problem remaining is to compute a value

of the scalar parameter p such that


f(hk(p)) < f(x = f(hk(0)) r = 2, 3, or 4 ,(3.52)

where the transformation functions were given earlier, but will be

repeated here for ease of reference. Thus

k k k
h (p) = x p (3.53a)
,,,2 '1. ,2 9

h (p) = (3/2)d p Cd (1/2)d ] p (3.53b)

u3 k 'k 2 k 3'

h (p) = xk (11/6)dk p-(2d k-d) p [dk dk + (1/6)dk] p ,
%%4 lu %2V24 U3 %'

k k
where the terms have been rearranged in powers of p, and d d and
d are given by (3.39), (3.44), and (3.48) respectively. The com-
putation of a suitable value of p is described in the next section.

3.4 The Scalar Search

The scalar search for an appropriate value of the parameter p

which appears in the higher-order transformation functions is often

the most time-consuming step in a minimization algorithm. The source

of the difficulty is the requirement that most existing algorithms

have of computing a value of p that accurately solves a scalar mini-

mization problem to be described below. In this section it will be

shown that the scalar search step for the VO algorithm is not time-

consuming due to inherent properties of the transformation functions.

Furthermore, during the initial iterations of the VO algorithm, when


k *
the estimate x of the solution x is far from the solution, the

desired value of p is defined to be the solution of a scalar problem

not used in any previously reported algorithm.

Recall that at this stage we wish to find a value of p=pk such


k+l k
x = h (p), r = 2, 3, or 4 (3.54)
b r k

where h is one of the transformation functions given in (3.53). If
k *
the current point x is not x a solution of the minimization pro-

blem (3.2), then the scalar search should select a value of p such

that the descent condition

f(hk(p)) < f(xk) (3.55)
kr *

is satisfied. If x is equal to x the scalar search is unnecessary.

Normally there is an infinite number of values of p for which

(3.55) is satisfied. The best value of p to choose would be the one

that minimizes the total number of iterations to approximate x

However, the computation of this value of p is impossible for general

problems since it requires information from future iterations. Some

of the considerations which lead to approximations of the best value

of p are given next.

Assume a pk= p exists such that from (3.54) we obtain

k+l k, m
x = x =h (p)
< r, k


Clearly the best value of p in this case would be pk, and pk would

be defined by

f(hk (p)) = minimize f(hk(p)), (3.56)
'\,Tr k r

which is a scalar minimization problem. Most existing algorithms

choose pk to be an approximation to pk at each iteration; some algo-

rithms theoretically require that pk be an accurate approximation to

Pk [46], in contrast with the VO algorithm which has no such require-

ment. In practice x k = h (p ) is seldom equal to x While it
b ar k P
may be convincingly argued that k = k is an optimal value for some
k *
iterations, particularly when x is in some neighborhood of x the

best value of p to choose is not pk for most iterations. In fact,
k m
when x is far from x choosing p = p tends to force any minimi-
lu *"lu k k
zation algorithm to follow the bottom of narrow valleys with typically

slow progress towards x [30]. Therefore, an ideal scheme would
m k *
choose p = p, when x is in some neighborhood of x ,and would
k k
choose pk to stay away from the bottom of narrow valleys whenever
k *
x is far from x

The scalar search of the VO algorithm first attempts to establish
k k *
whether x is close to x .If x is close to x then p is set to

an approximation of pm, a solution of the scalar minimization prob-
k *
lem (3.56). The details are described below. If x is far from x ,
k+l k
then pk is computed under the principle that x = h r(p) should be
a k is AU %r k
as far away from x as possible. The method for computing pk, when
k *
x is far from x will be described in two parts: 1) if the second-

order transformation was selected (r = 2), and 2) if the third-or


the fourth-order transformation was selected (r = 3 or 4).

3.4.1 Iterations Close to x

k k
If x is close to x then I f'(h (1)) I will be small due to

the manner in which the transformation functions hk were derived.
Thus, if

I f'((1)) k II M < (3.57)

for some e > 0, it is concluded that the choice p, = pm should be
c k k
made. (For the tested functions, which are not badly scaled,

E = was reasonable.) The test (3.57) can be made without
further gradient evaluations since it was shown in Section 3.3.1

that the two gradients f'(h (1)), and f'(h (1)) are evaluated while

computing the approximations to the third-and fourth-order correc-

tions given in (3.48). If the fourth-order transformation is

selected, it was found that it is not necessary to evaluate f'(hk(1)),

but rather the results of the test for f'(h (1)) could be used

k *
Having identified that x is close to x in the above manner,
an approximation to pk needs to be computed. The following procedure

was satisfactory for the functions tested. Evaluate f(h (p)) for
p = 2, 3, ..., L-1, L, L+1, until

f(hk(L-l)) > f(h (L)) < f(hk(L+l)) (3.58)
%;r %T %r


which is a standard method of bracketing the scalar minimum [30].

The minimum of the quadratic polynomial in p which passes through the

three points [30] obtained in (3.58) is computed given by

f( k(L-1)) 4f(h (L)) + 3f(hk(L+1))
S= L -I k k k (3.59)
f(h (L-1)) 2f(h (L)) + f(h (L+1))
rGr r

If p is close to L (Ip LI < .02), set pk = L to complete the

scalar search for this iteration. If p is not close to L, then

f(h (p)) is evaluated, and if f(h (p)) < f(h (L)), set pk = ,

otherwise set pk = L. This completes the scalar search for the case
k *
when x is close to x

For most local minima pk = 1 and the above procedure should yield

Pk = 1 requiring only one additional function evaluation. However, for

local minima with a positive semidefinite hessian, pm will generally

be greater than one. Therefore, in actual implementation shown in

Appendix I, if the minimum is located for p > 4, the function is

evaluated at p = 10, 22, 46, 94, 190, ..., etc., until the minimum is


3.4.2 Iterations Far from x

k *
When x is far from x the expression (3.57) is not satisfied.
In this case, the basic principle to be proposed is that x should

be as far away from x as possible, subject to satisfying (3.55).

This principle defines the desired pk to be a solution of


maximize | hk(p)- x (3.60a)

subject to f(hk(p)) < f(xk) (3.60b)

where ck 0 will be defined to insure that fk+ ) is sufficiently

less than f(x ). An accurate solution of (3.60) would be difficult

to obtain computationally. However, first an accurate solution is

not required, and second when the third-or fourth-order transformation

is selected, trial values of p that may approximate a solution of

(3.60) may be found from already available information. The pro-

cedure, if the second-order transformation is selected, is described


Second-Order Transformation Selected. If the second-order

transformation is selected at the kth iteration, the search for an

approximation to a solution of (3.60) is along a straight line in

the space of the independent variables, since h (p) is a linear func-

tion of p; thus (3.60a) is linear in p and it is maximized by the

largest possible value of p. In this case ck is set to zero, which

implies that we desire any descent. The procedure proposed may be

generally described as fitting, and computing the minimum of, approx-

imating polynomials, which attempts to satisfy the descent constraint

(3.60b). Then attempting to satisfy (3.60a), a constant is added to

the computed minimum of the approximating polynomial. The details

are given next.

The following information is already available: f(h (0)),


f'(hk(O)), f(hk(1)), and f'(hk(1)).
lb '%2 n,2 ru A2

fV(hk(O))T h (0) < 0
-6 2 IU2



where hk'(O) is the first derivative of h (p) with respect to p,
rU2 X1
evaluated at p= 0. Expression (3.61) implies the existence of a

p > 0 that satisfies (3.60b) (see Lemma 3.1). If the expression

f(hk(1)) < f(hk(0) = f(xk) ,


is satisfied, then (3.60b) is also satisfied. Whenever (3.62) is

satisfied, it is computationally efficient to select pk = 1 since the

function and the gradient are already evaluated for the next itera-

tion. In addition, it is unlikely that the descent constraint (3.60b)

will be satisfied for pk > 1 because of the manner in which the

transformation function order is selected. Thus whenever (3.62) is

satisfied, pk is set to one.

If (3.62) is not satisfied, the minimum within the interval (0, 1)

of the cubic Hermite polynomial in p fitted through the available

information is computed as follows [14]


Pc = 1 (s1 + a b)/(s1 sO + 2 a)



= k (1))T h k1
nu n,,2 nj2


So= f'(hk(0))T hk' (0)

b = 3 [f(h (0)) f(h (1))] + so + s

2 1i/2
a = b2 so s1/2

Then let

p = max {O.1, pc + min {pc, 1 -Pc}/2}

Then after evaluating f(h (p )), if

U C !.

set pk = pc, and the scalar search is done.

an attempt at satisfying (3.60a). Finally,

fled, the procedure becomes iterative. The

in p through the function and derivative at

function at p= p is computed as follows

Observe that (3.64) is

if (3.65) is not satis-

minimum of the quadratic

p=O, and through the

PC = .5 Pc SO / c + f h (0)) f(2h ())]

where so is given in (3.63c). Define


PC = max {pc' Pc/4}








k -
Then after evaluating f(h(p )), if (3.65) is satisfied, the search

is done with pk = Pc, otherwise the process is repeated beginning

with (3.66). Figure 3.1 summarizes the steps in a flow chart. For

the functions tested, the second-order transformation was rarely

selected. Most of the time when it was selected, the search ended

with (3.64); thus only one additional function evaluation was needed

most of the time the second-order transformation was selected.

Third-or Fourth-Order Transformation Selected. Whenever the

third-or fourth-order transformation is selected, the search for a

Pk to approximate a solution of (3.60) is no longer along straight

lines in the space of the independent variables. Note that hk(p) and

h (p) given in (3.53b) and (3.53c) are polynomials in p of degree

greater than one. For clarity of notation, the superscript and sub-

script of the transformation function will be dropped; i.e., for the

present discussion

h(p) = hk(p), r = 3, or 4 (3.68)

Additionally, the individual components of the transformation vector

function will be needed. Thus let h(p) be defined by

h(p) = (h (p), h (p), ..., h (p)) (3.69)
1 2 n

k+1 th
Therefore, since x = h(pk), the i component of all the possible
points that may become x is given by


Figure 3.1 Flowchart of a portion of the Scalar Search step of the
VO algorithm.


xi = hi(p) (3.70)

This time, maximizing I h(p) xk I, as defined in (3.60), may not be

simply achieved by increasing p, as it is if the second-order trans-

formation is selected. In particular, since each coordinate is given

by (3.70) there may be certain values of p for which the square of

the difference

k 2 k 2
(x x) (hi(p) x.) (3.71)

is a maximum. This would certainly tend to satisfy the principle
k+1 k
that x should be as far as possible from x A necessary condition

to maximize (3.71), which would tend to satisfy (3.60a), is given by

differentiating (3.71) with respect to p and setting it equal to zero,

to obtain the equation

h!(p) = 0 .(3.72)

This equation is a linear equation in p for the third-order trans-

formation, and a quadratic polynomial in p for the fourth-order

transformation. Therefore, its zeros may be easily found. If any

of these zeros are positive, it implies that the ith coordinate moves
away from xi and at the value of p equal to a positive zero of (3.72)
it begins to move closer to xi again. Therefore, positive zeros are

suitable candidates to satisfy (3.60a). It is proposed that these

zeros be computed for all coordinates using (3.72), and to discard


any which are not positive. These zeros will be considered trial

values of p later.

Additional trial values of p are obtained by the following
argument. After expansion of f(x) in a Taylor series about x and

substitution of x = h(p), the following expression is obtained

f(h(p)) = f(xk) + f'(xk) Th(p) xk] + .(3.73)

Since it is desired to compute a p which yields f(h(p)) sufficiently
less than f(x ), the term

f'(xk) Th(p) x (3.74)

should be as negative as possible. Therefore, values of p for which

(3.74) may achieve a minimum value are points that are easily com-

puted. The necessary condition yields

f'(xk)T h'(p) = 0 (3.75)

which is a polynomial in p with zeros that may yield additional trial

values of p, if any of the zeros are positive.

Before describing how these trial values of p are used in approxi-

mating a solution of (3.60), the ck appearing in (3.60b) needs to be

defined. Recall that'f(h(l)) is already evaluated and it is less

than f(xk). The constant c is defined such that a value of p could

be used provided it does not yield a function value too much greater


than f(h(1)). This is accomplished by defining ck by

-min {10f(h(1)) f(x ), .1 w} f(h(1)) 0,

ck = (3.76)
-min {.lf(h(1)) f(x), .1 w} f(h(1)) < 0,


w = f(h(1)) f(x)

The constraint (3.60b) is now defined and the zeros previously

found are candidates to satisfy it. It was found experimentally that

a value of p greater than six never satisfied (3.60b). In addition,

since f(h(1)) is less than f(x ), only values of p in the range

1 < p < 6 are considered (note that non-integer values are used).

All the zeros previously obtained from (3.72) and (3.75) within the

above range are sorted. Then beginning from the largest value and

on to the smallest one, the function is evaluated and as soon as

(3.60b) is satisfied, the scalar search is complete. In case (3.72)

and (3.75) yield no trial values of p in the range 1 < p < 6, the

function is evaluated at f(h(p)), for p = 2, 3, ..., p' pk+1, until
k' k K
f(h(pk)) satisfies (3.60b), and f(h(pk+1)) does not. For all the

functions tested, in most iterations (3.72) and (3.75) yielded trial

values of p. Furthermore, in most iterations only one additional

function evaluation was needed to end the search.

3.5 Hessian Singular or Negative Definite

In computing the second, third-, and fourth-order corrections


given by (3.39), (3.44), and (3.48) there are two major difficulties

to be considered concerning the hessian matrix: it may be singular

and it may be negative definite. If the hessian is singular, the

proposed corrections cannot be computed. If the hessian is negative
definite, the current point x is not in some small neighborhood of

a strict local minimum. Furthermore, if the hessian is not positive

definite, the proposed transformations may not give descent tra-

jectories. Recall that if

f'(xk )T h'(0) < 0 (3.77)

then the existence of a p > 0 which satisfies

f(hk(p)) < f(xk) (3.78)

is implied as shown in Lemma 3.1. Observe that

k, k
h '(0) = -a d, r = 2, 3, or 4 (3.79)
%\r r n2 2

where from (3.53), a2 = 1, a3 = 3/2, and a4 = 11/6. Therefore, the

descent condition (3.77) becomes

-a f'(x ) f(x k) f'(xk) < 0 (3.80)
-r % 'U % %

where (3.39), which defines d was used. The inequality (3.80) may
not necessarily be satisfied whenever f k) is not positive definite.
not necessarily be satisfied whenever f"(x ) is not positive definite.
ru, %


Moreover, even if (3.80), and thus (3.78), is satisfied when the

hessian is not positive definite, the descent trajectory may still

be undesirable. Recall that the transformation functions were derived

to compute a point which would yield a zero gradient. While the

gradient is zero at a local minimum, it is also zero at a local maxi-

mum and at a saddle point [38]. Therefore, a descent trajectory may

be towards a saddle point. Saddle points are even more difficult

since the transformation functions may yield trajectories towards a

saddle point even when the hessian is positive definite. This dif-

ficulty will be discussed again when the global convergence of the

algorithm is established in the next section. Thus singularity and

non-positive definiteness of the hessian are signals to be used in

avoiding these troublesome points.

Since the hessian inverse is used in Newton's minimization

algorithm [34], several alternatives have been proposed whenever the

hessian is not positive definite [30,34]. The method we propose is

a modification of the one given by Murray [36] for a Newton-like

minimization algorithm. The principle of Murray's method may be

described as the computation of the Newton or second-order correction,
d2, by solving the linear system of equations

Fk dk= f'(xk) (3.81a)


F = f(xk) + D (3.81b)

where Dk is a diagonal matrix which is computed to insure that Fk is
lu n


positive definite. If the hessian f"(xk) is already positive de-

finite, the matrix Dk is the zero matrix, and dk is the second-order

correction as defined earlier. Observe that in the VO algorithm,

the approximation to the third-and the fourth-order corrections,

(3.44) and (3.48), may be also defined as solutions of linear systems

of equations with coefficient matrices equal to F.

Murray's procedure for computing D is based on the Cholesky
factorization of a positive definite matrix. The Cholesky factor-

ization may be described as the computation of the upper triangular

matrix U, such that

F = UT U (3.82)

A by-product of this factorization is the diagonal matrix D The

modification we propose adds a pivoting strategy to this factorization

procedure. The result of this modification is that the diagonal

matrix Dk will tend to have a fewer number of nonzero diagonals than

the original procedure. Once the factorization is computed, the

computation of all the high-order corrections is simply obtained, as

shown later. The details of Murray's procedure are given next, fol-

lowed by the details of the proposed modification.

3.5.1 Murray's Procedure

Equating matrix elements in (3.82) yields the i row of U given


i-1 1/2
ui = { [ u2. (3.83a)
ii i mi


u" =I[Fkl ui Umj }
ij n% ij m= mi m

It can be shown [53-54] that if Fk is

elements given by (3.83a) are greater

elements of U are bounded by
^ '

/ u.. j=i+l, ..., n. (3.83b)

positive definite, all diagonal

than zero, and that all the

0 < u max Fk/2 i = 1, ..
ij a ii


The procedure due to Murray is to in effect obtain Dk in (3.81) such

that all diagonal elements in (3.83a) are bounded by

6 < uii n 8 ,


where a may be defined by

S = max if if (k)I 11/2

S (3.86)

i, j = 1, ..., n}

and 6 > 0 is a given constant which is used below to in effect iden-

tify the square root of a numerical zero due to round-off errors

(6 = 10s/2 gave good results, where s is the number of significant

digits of the numbers in the computer). The off-diagonal elements

are also bounded by

i = 1, ..., n-1; j > i


lu ij I< 0


The i stage of the procedure may be described as follows. Define

the quantity

k i-1 2 1/2
u= max { 6, [f"(xk) u2 ,


which will be considered as the candidate for the diagonal ui and

. = [f"(xk] U u., j=i+1, ..., n
3J ij u u ml m3


Observe that ui = uj/u j= i+, ..., n, will be the off-diagonal
elements of the i row once u.. has been computed. If the relation

given by

(1/ui) max {Ju.j, j = i+1, ..., n} < B


is satisfied, then set


uii = ui ,

otherwise set

ui = (1/B) max (u.j j = i+1, ..., n}

Then the rest of the ith row is given by


uij = uj / uii ,



The ith diagonal of the matrix Dk is given by


d uu2 [f"(xk 2 (3.94)
ii ii ] ii U mi

It can be shown [36], in a straightforward manner, that the bounds

(3.85) and (3.87) apply to this procedure.

3.5.2 Proposed Modification to Murray's Procedure

The modification to Murray's procedure is motivated by a desire

that the number of nonzero elements in Dk be as few as possible to

in effect use as much of the hessian as possible. If some form of

diagonal pivoting is added to Murray's procedure, not only will the

number of nonzero diagonals of Dk tend to be small, but numerical

stability may also be gained. Therefore, it is proposed that at each

stage of the factorization procedure, the strongest diagonal is

selected, where the strongest diagonal is defined as the diagonal

which generates the smallest, in absolute value, maximum off-diagonal

element in its row. Efficient implementation of this modification is

described next.

First, recognize that the Cholesky factorization of F with

diagonal pivoting may be described as the computation of the upper

triangular matrix U such that

P F pT =U (3.95)

where P is a permutation matrix with columns equal to a permutation


of the columns of the unit diagonal matrix. The procedure begins by

initializing the elements of U as follows

u( = [f"(xk)].. i = 1, ..., n; j = i, ..., n (3.96)
ij It, '1i ij

The elements of U are then modified iteratively. To describe the ith

stage of the procedure we begin by defining the set

{ek} = {max i u ) j=i, ..., k-1; u j=k+l, ..., n,

k=i, ..., n} (3.97)

which contains the maximum absolute value of the off-diagonal elements

in each row not yet processed. The next diagonal to be selected for

pivoting is based on the following sequential tests:

1) If i=n the set {ek} is empty. Select the remaining diagonal.

2) Otherwise, if any element of {ek} is zero, select the diag-

onal corresponding to the first such zero element. This is

a row where all off-diagonal elements are zero.

3) Otherwise, if the set {ek/ u(kk : ukk ( 0, k=i, ..., n

is not empty, select the diagonal corresponding to its

smallest element (the first one if ties exist).

4) Otherwise, select the diagonal corresponding to the smallest

element of the set {ek} (the first one if ties exist). This

choice occurs when all remaining diagonal elements are zero.

The appropriate interchange of rows and columns is done next in order


to bring the selected diagonal into the ith row. This interchange

is noted in the permutation matrix P of (3.95). Now define

i = max { 6, uii 1 / (3.98)

(1/u max 1 = i+, n} < I



uii =Ui

otherwise set

u( = (1/s) max { u-1) j = i+1, ... n

The ith diagonal of the permuted D matrix is given by

d C([i) 2 (i-1)
de rt of te rw iib

The rest of the row becomes




(i) (i-1) (i)
uij = uij / uii

j = i+i, ...,n

The rest of the matrix is updated as follows



(i) (i-1) (i) (i)
ukj = u ik uij k=i+, ..., n; =k, ..., n


This completes the ith stage of the procedure. The pivoting strategy

proposed insures that the remaining matrix is changed by a small

amount, since the maximum absolute value of the change to the re-

maining matrix in (3.104) was minimized. Observe that double pre-

cision is recommended to store the matrix U in the modified procedure

since inner products can no longer be efficiently accumulated [53-54]

as it is possible in the original method.

3.5.3 Illustrative Example

The following example illustrates the effect of pivoting. Let

f"(xk) for a three-dimensional problem be given by

0 1 -10

f"(xk) = 1 4 0

-10 0 400

Let 6 = 10 and for this matrix 0 = 20 from (3.86). Without

pivoting, the method proposed by Murray yields D given by

D = diag (.25, 4, 800) ,


.5 2 -20

U = 0 2 20

0 0 20

The proposed modification, with a diagonal pivot order 3, 2, 1, yields

D = diag (1, 0, 0)


20 0 -.5

U = 0 2 .5

0 0 (.5)1/2

3.5.4 Computation of High-Order Corrections

After the factorization (3.95) is completed, the high-order

corrections are computed as follows. Instead of (3.39), we may now


f(xk) + D d = f'(xk) (3.105)

Define v by

k T
d = (3.106)

where T is the transpose of the permutation used in the factoriza-

tion procedure. Now multiply (3.105) by P, and use (3.106) to obtain


P [fII(xk) + DkkIpT v = p f'(k)
lu l ru ru r. t, lu f l


From (3.81) and (3.95), the factorization process transforms (3.107)


UT U v = P f'(x k


which may be solved by first solving

U w = P f'I(k)
lu % nu Ii r


for w by forward substitution, and then solving

U''~ V =


for v by back substitution. The second-order correction is then

obtained from (3.106). Similarly, the approximation to the third-

order correction given in (3.44), now becomes the solution of

f"(xk) + Dk] dk = f'(hk(1))
'v\ I iJ 3 u U

and the fourth-order correction given by (3.48) now becomes

[f"(xk) + Dk] dk = f'(hk(1))
'n 1, 'b J4 u 'A3



These two systems of equations have the same coefficient matrix as


(3.105), and therefore the same factorization applies. The solutions

of (3.111) and (3.112) are obtained similarly to the procedure out-

lined for solving (3.105).

3.6 Convergence of the VO Algorithm

Two convergence properties of the VO algorithm are given in this

section. First we establish the class of functions for which the

algorithm is globally convergent. Second, it will be shown that the

VO algorithm generates a sequence with a high order of convergence

for most functions. When an algorithm generates sequences with a

high order of convergence, an approximation to a solution of the mini-

mization problem (3.2) can be computed in a small number of itera-

tions, if the initial point is close to the solution.

3.6.1 Global Convergence

The global convergence of the VO algorithm will be established

by using the general analysis of algorithms developed mainly by

Zangwill [56]. A brief review of this analysis is given below.

The new algorithm is then recast in a manner which allows the

results of this analysis to be used.

A minimization algorithm may be generally described by a point-

to-set mapping. A point-to-set mapping assigns to every point

x E En a subset of En. Let A be a point-to-set mapping, then A(x)

may be represented by

A(x) = {e En} ,



where the definition of the elements constitutes part of the algo-

rithm. The sequence of points {x k generated by the algorithm is
given by

k+1 k
x E A(x )

beginning from some initial point x. The point selected from the

set A(xk) at each iteration is also part of the details of the algo-

rithm. It is clear that the sequence {xk cannot be predicted solely
from knowledge of the initial point x As the scalar search for the

VO algorithm demonstrates, similar algorithms using the same trans-

formation functions could implement the scalar search somewhat

differently. This difference may be enough to generate different

sequences, but as the Global Convergence Theorem will show, the

different sequences may still converge. Thus the point-to-set

mapping concept aids in analyzing classes of algorithms without

describing its steps in detail.

An important property of point-to-set mappings, which is re-

quired later on, is that they may be closed. A point-to-set mapping

A is said to be closed at x, if the assumptions

k -
1) x + x ,

k k k
2) y y y E A(x )
iu iu % Il




3) E A(x)

The closedness property is a generalization of continuity of point-

to-point mappings or ordinary functions.

The main result due to Zangwill may now be given.

Global Convergence Theorem. Let the point-to-set map A(x) be
n 0 k
an algorithm on E and suppose that given x the sequence {x } is

generated satisfying

k+l k
x C A(x )

Let 2 be a subset of En defined as the set of solution points of the

minimization problem (3.2), and suppose

1) All points x are in a compact set.

2) The function f(x) is continuous and

a) if x i 0, then f(X) < f(x) for all y E A(x),

b) if x E 0, then either the algorithm terminates, or

for all y e A(x), f(y) < f(x).

3) The map A is closed at points outside Q.

Then either the algorithm stops at a solution point in 2, or the

limit of any convergent subsequence of {x k is a solution point in Q.

The proof may be found in [56,34]. Condition 1 of the theorem in-

sures the existence of a convergent subsequence. Its violation

normally indicates that the minimization problem has no finite


solution, and thus this condition is not very restrictive. Condition

2 is normally satisfied by a suitable transformation function and a

scalar search using the terminology of the VO algorithm. Condition 3

of the algorithm is usually the most challenging. For the new algo-

rithm, the satisfaction of this condition imposes continuity require-

ments on the function and its first two derivatives, as well as the

additional condition of pseudoconvexity.

The VO algorithm will be described as a point-to-set composition

mapping given by

A(x) = S (M(x))
r r'1,

where M is a point-to-point map, and S is a point-to-set map. The

following lemma [56] establishes the conditions on each mapping to

yield a closed composition.

Lemma 3.2. Let M:En + Em be a point-to-point map and S:Em+{ CEn}

be a point-to-set map. Assume M is continuous at x and S is closed at

M(x). Then the point-to-set map A(x) = S (M(x)) is closed at x.

The point-to-point mapping M :En (r-l)n which characterizes
the transformation phase of the VO algorithm may be described by

M (x) = (x, d(r)) r = 2, 3, or 4 (3.114a)
r i1 liu lu

where d(r) denotes sets of correction terms given by



d(r) = (d


, r= 2

, r=3

d d ), r =4
I3 'I41

and the corrections d, d3' and d are the solutions of
%2' b3 U4

[f"(x) + D] d = f'(x)

[f"(x) + D] d = f'(x d )
lb 'It % i 3 '\ 'b rV2


[f"(x) + D] d = f'(x d d)
'b 'i %, ;; '\ % % %2 n




(Note that the diagonal matrix D has been included as discussed in

Section 3.5.4.) In order to make use of Lemma 3.2 we need to estab-

lish that M is continuous.

Lemma 3.3. If the gradient and the hessian of f(x) are continu-

ous, the mapping Mr(x) given in (3.114) is continuous.

The proof is immediate since the diagonal matrix D is computed to

insure [f"(x) + D] is non-singular, and the diagonals of D are con-

tinuous functions of the elements of the hessian f"(x) as can be

readily seen in (3.98-3.102).




Now consider the point-to-set mapping S :E(r l)n +y En}
which operates on M (x) which characterizes the scalar search phase

of the VO algorithm, and may be conveniently expressed as follows

S (x, d(r)) = : y=h (p) for p > 0 and f(h (p)) < f(x)} ,
r IV ,u nr br '

r = 2, 3, or 4, (3.115a)


x d p r=2,
h (p) = x (3/2)d p [d (1/2)d]p2 r=3,
ur A A2 -D %2

x (ll/6)d2- (2d3-d2)p2 [d-d3+(1/6)d2]p3, r=4.


Observe that (3.115b) consists of the three transformation functions

derived in Section 3.3. The following lemma establishes the conditions

that are sufficient for S to be closed.

Lemma 3.4. If f'(x) # 0 and f(x) is continuous, the mapping S

given in (3.115) is closed.

Proof: Recall that in order to show that S is closed, the


k k
1) (x, d (r)) (x, d(r))

k k k k
2) y y, ES (X d (r)) ,
'b A, 'n



3) E S (x, d(r))
6u S r f, i\u

Suppose first that r = 2.


k k k
y x pd
'i k 2


The assumption f'(x) # 0 implies that d # 0 for all k from (3.114c).

Thus one may write from (3.116)

pk= xk k dk
Pk = ,, 1" / %2 2

which when taking limits yields

P = 1111 / 11 11

This establishes the existence of a limit for the sequence {pk}. It

then follows that = x p d.2 It remains to be shown that

y Sr(x, 2). For each k, pk satisfies
k kr % -.k

f(y) = f(x k ) < f(x )

That such a pk exists follows from the fact that

f(xk)T h'(0) = -f'(xk) < 0
f'( r 2 "U ru %'




and from Lemma 3.1. Taking limits in (3.117), and using the assump-
tion that f(x) is continuous yields
tion that f(x) is continuous yields

< f ,()


and hence S2(x, d2). Now consider the generalization of the pre-

ceding proof to any r. For each k, one may write

k= h ( ) = xk + h'(tP) pk'
'r k I U r k k'

t E (0, 1)

using the Mean Value Theorem [45]. The assumption f'(x) # 0 implies


f(k= k f(xk
f(vk) = f(hr(Pk)) < f(xk) ,



f(xk)T h'(O) < 0 ,
IV il

and Lemma 3.1 implies the existence of a pk which satisfies (3.121).

From the Mean Value Theorem and (3.120)

f(yk)= f(xk+h'(tp ) = f(x) + f(xk+ th'(tp )p )Th'(tp)p

for some t E (0, 1). The above expression with (3.121) imply

h'(tPk) # 0
^r*k ^



in (3.120). Thus taking limits in (3.120) establishes the existence

of a p given by

p = | y x / II h'(tp) .

Thus for each k,

k k k
f() = f(x + h'(tk) P) < f( )

and after taking limits, and using the continuity of f(x), we obtain

f(X) < f(x) .

Hence E Sr(x, d(r)). This completes the proof.

The VO algorithm may now be given as the composition of the two

mappings M (x) given in (3.113), and S (x, d(r)) given in (3.115) to
r 'ir 'r '

A(x) = S (M x)) (3.122)

By Lemmas 3.2, 3.3, and 3.4 the VO algorithm is closed at x, if both

f'(x) 1 0 and if f(x) has continuous first and second derivatives.

This implies that, using the Global Convergence Theorem, if

k+1 k
1) All x e A(x ) are in a compact set,

2) The gradient f'(x) # 0, except at a solution of the
rIV iX0 Ili


minimization problem (3.2).

3) The function f(x) is continuously twice differentiable,

then the VO algorithm generates a sequence which converges to a

solution or it stops at a solution. The conditions 1 and 3 may not

be difficult to achieve in practice. However condition 2 implies

that the algorithm is not closed at a local maximum or at a saddle

point of f(x), since f'(x) is zero at these points. Thus theoreti-

cally, f(x) must not have any such points; a function not having

these points is defined to be pseudoconvex [35]. In practice, the

algorithm should generate convergent sequences if the function is

pseudoconvex in the region including the desired solution and the

initial point. However, experimental evidence on one tested problem

indicates that the algorithm is superior to others in avoiding

these troublesome points even when they exist in the region of in-


The VO algorithm may be trivially modified to prevent convergence

to a strict local maximum. If xk is a local maximum, the present

algorithm will fail in the sense that f'(xk) = 0 will cause the

algorithm to stop. At this point however f"(x ) will be negative

semidefinite [56], which is indicated in the VO algorithm by Dk 0

in the modified Cholesky factorization presented in Section 3.5.
However Dk 0 may also occur when the hessian is positive semide-

finite, or indefinite. Thus Dk 0 is an indication of potential
1v, "u
problems. Thus if D # 0 at the point at which the algorithm stops,

coordinate searches may be undertaken to ascertain that the expres-



f(xk + te) < f( = 1, ..., n
'Il 'i 1i

are satisfied for some small value of t. If xk is a strict local
maximum, the above procedure should indicate so, and x may then be

perturbed to start the algorithm again.
The previous modification may not work if x is a saddle point

as the example in Section 3.1 shows. It must be noted however that

numerical experiments will be given in the next chapter which show

that the VO algorithm appears to be highly effective in avoiding con-

vergence to saddle points. For one tested problem with a saddle

point three existing algorithms converged to the saddle point when

the initial guess was close to the saddle point, while the VO algo-

rithm converged to the correct solution from the same initial point.

3.6.2 Order of Convergence

The order of convergence of the VO algorithm can be established

from published results dating back to Schroder in 1870, who was the

first to define the concept of order of convergence [47-48]. However

in this section, the more modern results due to Ortega and Rheinboldt

[38] and Traub [50] will be used. Two results are given; one applies

to the convergence of the algorithm to local minima with positive

definite hessian, and the other to local minima with positive semi-

definite hessian.

Before we can use the existing results, it must be noted that

the VO algorithm becomes the m-step method


k,0 k k,i k,i-l f,,(xk -l f, k,i-
x = x x i = xki f"(x ) f'(x ), i=1, ..., m,

k+1 xk,m
xk = x (3.123)

when x is in some neighborhood of a local minimum with a positive

definite hessian (e.g., the scalar parameter p is one). This method

was proposed and studied by Traub [50]. It has also been used by many

researchers, as it can be thought of as Newton's method for solving a

system of equations without updating the coefficient matrix at every

iteration. The following theorem establishes that the method con-

verges with order m+ 1.

Theorem 3.1. Let f(x) have continuous gradients and hessians,


II f(x) f"( I11 c -x *II 0 < c <

for all x in a neighborhood of x .Assuming f'(x ) = 0 and f"(x

exists, the order of convergence of (3.123) is m+1.

The proof may be found in Ortega and Rheinboldt [38] and Traub

[50]. Thus the VO algorithm converges with order r, where r is the

transformation function order, whenever f"(x ) is positive definite.
If f"(x ) is positive semidefinite, f"(x ) does not exist,
1,I 'q, 'v
and therefore the preceding theorem does not apply. In this case

the convergence is in general linear (order equal to one) as the

following argument shows. In this case the algorithm may be des-

cribed by the iteration


k+1 G( k k [pf"( k )+ k-1 i k k
x k+l= G(x ) = x pk[f"(x ) + D ] cf'(x ) + (pk' x )] ,
ru ru 'u k "U f r 'IV % ) 5 kk '


where c, a constant, and the function k depend on the transformation

function order selected, and where the diagonal matrix D is not zero
k *
at x = x The iteration function is thus given by

G(x) = x p[f"(x) + D 1 [cf'(x)+ (p, x)]

The derivative of the iteration function evaluated at x = x is given


] *
G'(x*) = 1 p[f"(x ) + D ] [cf"(x ) + g'(p, x)]
V IVi A, %i n' n '%Xj rV

This matrix will not be zero for any value of p in general, since

D is not zero. Thus convergence cannot be higher than linear as

long as G'(x ) # 0 [38]. It should be noted that while convergence

is in general linear whenever f"(x ) is positive semidefinite, local

minima having this property may be picturesquely described as being

flat. This implies that for all x in a fairly large neighborhood of

x II f'(x) II is very small. Thus for practical reasons, it is nor-

mally unnecessary to compute the local minimum with great accuracy

for these cases. Two of the problems selected to test the VO algo-

rithm have their local minimum with a positive semidefinite hessian.

The new algorithm was more efficient in computing an approximation

to their local minimumthan several published algorithms.


3.7 Summary

The major contribution of this chapter is the derivation of a

new algorithm for finding a local minimum of an unconstrained non-

linear function. The new algorithm, called the Variable-Order (VO)

algorithm, has two properties that no existing minimization algorithm

has. The first new property is the order of convergence. While all

existing algorithms converge with order less than or equal to two,

the new algorithm converges with variable order as high as four. The

second new property is the scalar search step of the algorithm. In

contrast with previous algorithms that have scalar searches along a

straight line in the space of the independent variables, the VO algo-

rithm may have scalar searches along curved trajectories. The VO

algorithm was shown to be globally convergent for pseudoconvex func-

tions with continuous first and second derivatives. The order of

convergence was also established to be from two to four for functions

with a positive definite hessian at the local minimum being computed.

If the hessian is positive semidefinite at the local minimum being

computed, the convergence was shown to be linear.



In this chapter we consider the practical aspects of the imple-

mentation of the Variable-Order (VO) algorithm to nonlinear circuit

optimization problems. These practical considerations lead to guide-

lines and to some modifications of the algorithm in order to solve the

general nonlinear programming problem given by

minimize f(x) (4.la)

subject to a set of nonlinear inequality constraints

(x)< (4.1b)

and a set of "box" constraints given by

x < x < x (4.1c)

Observe that any equality constraint may be included in (4.1b) as two

inequality constraints [20].

In the first section, guidelines are given for handling the non-

linear inequalities by the use of penalty functions. In circuit



optimization, the nonlinear inequalities are "loose", i.e., they can

be relaxed to some degree. Therefore, the penalty function approach,

which in effect relaxes the constraints, is an ideal technique for

circuit optimization applications.

In the second section, it is shown how the VO algorithm handles

the box constraints. Unlike the nonlinear inequality constraints, the

box constraints must be satisfied.

The VO algorithm, as described in the last chapter, requires that

a subroutine be written which supplies the value of the function, the

gradient and the hessian at each point generated by the algorithm.

While writing such a subroutine may not be difficult for some problems,

the VO algorithm would be more useful if the hessian can be approxi-

mated when it is difficult to write a subroutine which supplies the

hessian values. Sometimes it may be just as difficult to supply even

the gradient, and thus in this case both the gradient and the hessian

must be approximated. In the third section we consider approximations

to the hessian and the gradient when these values are not supplied.

The fourth section considers the case when the function and the

gradient values supplied to the VO algorithm contain errors. In non-

linear circuit optimization applications, the subroutine that supplies

the function and the gradient values may be actually a complex computer

program which includes the solution of a system of nonlinear algebraic

and differential equations. The function and the gradient values

depend on the solutions to these equations; thus due to the numerical

techniques used, errors may be present in the function and the gradient

values. Numerical experiments will show that if any errors present in

the function and the gradient values supplied to the VO algorithm can


be estimated and kept small, the algorithm can still be effectively


The fifth section details the steps in a FORTRAN IV implementation

of the VO algorithm. Finally, the sixth section presents several numer-

ical experiments and comparisons with other algorithms.

4.1 Nonlinear Inequality Constraints

This section gives guidelines to be used in solving the problem

given by

minimize f(x) (4.2a)

subject to q(x) < 0 (4.2b)

where q(x) is a vector function of inequality constraints. It is

assumed that all the constraints are nonlinear.

The method recommended for finding a solution of (4.2) is to con-

vert the constrained problem to an unconstrained one by using a penalty

function method [20,34]. That is, define the problem

minimize f(x) = f(x) + PQ(x) (4.3)

where Q(x) is a penalty function for the inequality constraints; the

penalty function is defined such that it is zero whenever the point x

satisfies all the constraints, and greater than zero whenever the point

x does not satisfy any of the constraints. The constant y isapositive


scalar. For large u, the minimum of (4.3) will tend to be in a region

where the constraints should be almost satisfied. Thus for increasing

p, the corresponding solution points of (4.3) approach a solution of

(4.2) [34]. Therefore, the penalty method converts the constrained

problem (4.2) into an approximately equivalent unconstrained problem,

or perhaps to a sequence of unconstrained problems depending on how

strictly the constraints are to be satisfied. This implies that uncon-

strained minimization algorithms, in particular the VO algorithm, may

be used to approximate the solution of (4.3).

Two difficulties are inherent in converting problem (4.2) to

problem (4.3), and in eventually solving problem (4.3). First, to

insure the global convergence of the VO algorithm, f(x) must be twice

continuously differentiable. Therefore, it appears that penalty func-

tions must be chosen accordingly. Second, problem (4.3) is very ill-

conditioned for large values of the constant V [34]. These considera-

tions will be explored in more detail next.

A suitable penalty function Q(x) is given by

Q() = w (max[O, q(x)])3 (4.4)
i=l qi

where n is the number of inequality constraints, and the constants

w may be used to equalize the magnitude of the constraints. The

continuity conditions on f(x) are satisfied by (4.4), if the inequality

constraints qi(x), i=1, ..., n also satisfy them. In the litera-
i. bi q
ture, the most popular penalty function for inequality constraints is

given by



Q(x) = [ w (max[0, q (x)])2 (4.5)
i=1 i

This quadratic penalty function has a hessian which is discontinuous

whenever an inequality constraint is zero [34]. However, it was found

experimentally that for several problems tested, this quadratic penalty

function produces an unconstrained problem which is solved by the VO

algorithm more efficiently, especially when the hessian is not supplied

and thus approximated by differences, than by using the cubic penalty

function (4.4). Note that the VO algorithm is not guaranteed to be

globally convergent for the quadratic penalty function (4.5) because

of its discontinuous hessian.

The VO algorithm, as other existing algorithms [34], solves prob-

lem (4.3) for large p with great difficulty, as examples will show.

It was found that the best approach was to compute rough approximations

to the solution of a sequence of problems, given by (4.3), for increas-

ing values of p, and tightening the desired accuracy of the solution

for the last p used. Thus, it is recommended that initially a small

value of p be used when using the VO algorithm for solving problems

such as (4.3).

4.2 Box Constraints

When the minimization problem has box constraints of the form


the penalty function method just described can be used, if (4.6) is


rearranged into 2n inequality constraints. However, there are two

reasons that compel the use of a different technique for handling the

box constraints. First, the penalty function method allows the viola-

tion of constraints, particularly when the multiplying constant p is

small. In circuit optimization procedures, the circuit equations may

not have any solution if any of the box constraints are violated, and

therefore the algorithm may fail if a box constraint is violated.

Second, the box constraints are linear constraints and their effect

can be very efficiently handled in a direct manner with some modifica-

tions to the VO algorithm.

The method proposed for handling the constraints is to in effect

project the transformation function trajectories onto the active box

constraints whenever the trajectories are outside of the box constraints.

Figure 4.1 illustrates the proposed technique. This projection can be

very efficiently implemented as will be shown.

A modification required in the implementation of the projection

of the transformations is that whenever the trajectory is on a boundary,

an accurate computation of the solution of the scalar minimization

problem in the scalar search should be done. The reason for this modi-

fication is that when transformation functions are projected, their

theoretical properties may be different when the scalar parameter p in

the transformation functions is set to one.

4.3 Hessian and Gradient Approximations

To this juncture the VO algorithm was described in a way that

required supplying the function, the gradient, and the hessian of the

function to be minimized at the points of the sequence {x } generated


O \

I -

Figure 4.1

Illustration of projection of trajectory onto box con-
straints. Box constraints are depicted by the rectangle.
The trajectory is shown by the dash curve to be outside
of the box constraints over two intervals. The actual
trajectory used is shown by the solid curve with arrows.