Globally convergent neural networks

MISSING IMAGE

Material Information

Title:
Globally convergent neural networks
Physical Description:
x, 229 leaves : ill. ; 29 cm.
Language:
English
Creator:
Tang, Zaiyong, 1957-
Publication Date:

Subjects

Subjects / Keywords:
Neural networks (Computer science)   ( lcsh )
Decision and Information Sciences thesis Ph. D
Dissertations, Academic -- Decision and Information Sciences -- UF
Genre:
bibliography   ( marcgt )
non-fiction   ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1992.
Bibliography:
Includes bibliographical references (leaves 140-157).
Statement of Responsibility:
by Zaiyong Tang.
General Note:
Typescript.
General Note:
Vita.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 001803869
oclc - 27779167
notis - AJM7680
System ID:
AA00002095:00001

Full Text










GLOBALLY CONVERGENT NEURAL NETWORKS












BY
ZAIYONG TANG


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE
UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA
1992

































Copyright @1992 by Zaiyong Tang

All Rights Reserved











ACKNOWLEDGMENTS

I am indebted to many people without whom this work would never have become


a reality.


First of all, I owe deep thanks to my adviser, Dr. Gary Koehler,


who has


guided my dissertation research through all its ups and downs with patience, encour-
agement, and intellectual challenge. It is truly remarkable that, being a department
chairman and an adviser of eight Ph.D. students concurrently, he still finds time to


provide help whenever it is needed


I am thankful to all my committee members: Drs. Paul Fishwick, Harold Benson,


and Antal Majthay.


Dr. Fishwick introduced me to the exciting world of artificial


intelligence and neural networks.


His open-mindedness and enthusiasm have had a


great influence on me. Dr. Benson taught me the beauty and power of mathematical


proof during three math programming courses and


the rigorousness of scientific re-


search. Dr. Majthay, a guru in AI and expert systems, has offered me much valuable
advice on C++ programming.


All the faculty members in


the DI


department have helped me in one way or


another. I would like to thank Dr. Richard Elnicki for providing computing resources,


Dr. Selcuk Erenguc for general assistance in my graduate study, and Dr.


Zappe for setting an example as an excellent professor.
Thanks are due to Dian and Linda, our department secretaries.
very helpful in making my graduate study here a pleasant one. I wc


Christopher


They have been
)uld like to thank


also my fellow


Ph.D. students for many


stimulating discussions and a harmonious


and cooperative environment.
into the American culture.


Bob Norris


has been extremely helpfu


n fitting me


I owe a special thanks my family-my wife,


and Dora.


Xiaoqin Zeng, and my kids, Jimmy


Their love, understanding, and encouragement have kept me in high spirit


and proper perspective. Xiaoqin certainly knows more than anyone else how hard it
























TABLE OF CONTENTS


ACKNOWLEDGMENTS


S 9 t 4 4 9 4 9 4 S 4 1iii


LIST OF TABLES


LIST OF FIGURES


ABSTRACT


S9 9 .. S 4 5 4 9 4 4 9 9 IX


CHAPTERS


INTRODUCTION


9 a 4 9 9* 4 5 9 1


THE RENAISSANCE OF NEURAL NETWORKS


. 6


Overview of Neural Networks
Historical Development .
Neural Network Applications.
2.3.1 Neural Networks in AI
2.3.2 Neural Networks in Dec
Promise and Problems .


* 4 S 9 4 9
* 9 9 9 4 4
* S C S


vision Sciences


* 9 S 9 9 9 9 9
* S S S 9 4 S


FEEDFORWARD NEURAL NETWORKS .


The Processing Units (Neurons) .
The Perceptron Learning . .
The Limitation of Perceptrons .
Feedforward Neural Nets and the BP A
Backpropagation Derivation .
The Representation Capability of FNN


Llgorithm


* S 9 4
* 9 4 C


* 9 4
* 9


* a 9 4 4 4 4 4
* 9 4 9 4 9 4 4 .


VARIATIONS OF BACKPROPAGATION LEARNING


Performance Criterion Function
Momentum . .


a C *


* 9 5 9 4 4 4 4
* S S S 9 9 S C S I


* 4 9 .
* 9 5 9














4.5.2 Transcendental Fur
4.5.3 Higher Order Netv
4.5.4 Gradient Descent S
Dynamically Constructed
4.6.1 Network Growing
4.6.2 Network Pruning
Miscellenous Heuristics
4.7.1 Initial Weights .
4.7.2 Multi-scale Trainin
4.7.3 Borderline Pattern,
4.7.4 Rescaling of Error
4.7.5 Varying the Gain I
4.7.6 Divide and Conque
4.7.7 Total Error vs. Ind


actions . .
rorks and Function-link Networks


search in Function Space
Neural Nets .
Methods. .


* a a a a a S S *
* 9 a a a a a 9 a a
5 S a 9 S S S 9 5
g .
S . . . .


Signal .
;actor .
ividua .Error
ividual Error


C # a a
C a a a a S


* a . a
* a a C a a .
* a a 9 a


* S a a C C
a a a a p S C C
* C a a 9 a .
* a a a a a a a 9 .
* a a a a p a a a
* a a a a .
* a a a a a a a


GLOBALLY


GUIDED BACKPROPAGATION (GGBP


Limitations of I
The Idea of Glo
Learning Rule I
Convergence of
The G GBP Alg
Experiments .
5.6.1 The XO
5.6.2 The 424
Comparison of


,bally Gui
Derivation
GGBP
;orithm


* a a a a a S 1
ded Backpropagation
* a a a a S a a a .I
* a a a a a .
* S C C a a a a a I


a a a a .
R Problem .
Encoding Problem
GGBP and BP .


STOCHASTIC GLOBAL ALGORITHMS


Genetic Algorithm
Simulated Annealing
Random Search .
Clustering Methods


* *
.* a a a .


* S S P
* 9 ft a


* 9 9 p
* 9 5 a a
* C S a S
* a a ft C


DETERMINISTIC GLOBAL ALGORITHMS


Branch and Bound


7.1.1
7.1.2
Lipschi
Estima
7.3.1
7.3.2
7.3.3
BBBai


Prototype Branch and Bound
BB Algorithm Convergence
tz Optimization . .
te the Lipschitz Constant for
Some Lemmas on Lipschitz C
An FNN is Lipschitzian
Local Lipschitz Constant
sed NN Training Algorithm


an FNN
constant


* . . 98
S. . . 98
101
* a a a a a 103
* . . 107
S. 107
* C a a C 110
112
. . 116


t r r r t-, a r~ a rr~ V r~ V ~' r-, r'. n. A


a aQ* a S S S S C C. a
a a a a S S C P a a a


85


4 k -f












Combi
Experi
8.4.1
8.4.2
8.4.3
8.4.4


ned BB and BP
ments with GOTA and LGOTA


GOTA
GOTA
GOTA
GOTA


with Different Error Thresl
with Heuristic Pruning .
with Random Local Search
with BP Local Search


129
* 9 S S 0 9 S 131


h


olds . .131
132
133
.* a 134


SUMMARY


AND CONCLUSIONS


137


Contributions
Further Research


. . at *


137
138


REFERENCES


* 4 9 5 5 9 5 5 S 9 4 140


APPENDICES


A C++ Program for GOTA


. 9 9 1.


B Classes for Neural Network Simulation Systems


BIOGRAPHICAL SKETCH.


. S 9 9 0 2


229

















LIST OF TABLE


Training Epochs of GGBP
Training Epochs of GGBP


vs BP for the XOR


vs BP for the 424 Encoding


Lipschitz Constant over Weight Subsets


GOTA Iterations for Solving the


GOTA with Heuristic Pruning
GOTA with Local Random Search
LGOTA vs BP with Different 7


XOR Problem


133
134J


133
* . . * A. 133


LGOTA
LGOTA


vs BP with Different ry
vs BP with Different ac


*. C. S 9 S S 0 5135
S C 5 5 9 5 C 135


LGOTA Iterations for Parity-3 Problem


S. S 5 9 5 136


1



















LIST OF FIGURES


Structure of a single neuron

Typical activation functions


Geometrical explanation of the perception learning

The XOR problem and its geometrical representation.


. .

. 0 .

* 0 0


An example of layered perceptions that solve the XOR problem


feedforward neural network


An example of the Kolmogorov neural network .


Two simple neural nets that so

Output function surface of the


* 0 9 0 0 0 0 5 0 0

* S 0 5 5 0 5 0 0 0 0 0


e the XOR problem.

x 1 x 1 network


A3x4


x 2 radial basis


function network


A function-link neural network u


Error surface of an XOR
and local minimum..


sed to solve Parity 3


x 1) network showing valley, plateau


AW


corresponding to AO would lead W


to a global optimal solution.


A typical FNN where the weights associated with O0 are independent
to other output units. . . . . . .


Learning curve of GGBP (solid


vs BP (dotted line).


Boltzmann distribution at different temperatures

Equilibrium and non-equilibrium energy state


8


--











Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy


GLOBALLY CONVERGENT NEURAL NETWORKS

By
Zaiyong Tang


August 1992



Chairman: Gary J. Koehler
Major Department: Decision and Information Sciences

Artificial neural networks are a computational framework that has become a focus
of widespread interest. One of the most widely used neural networks is the feedfor-
ward neural network (FNN). This type of neural network can be used to learn the
underlying rules from examples. This learning ability enables FNNs to have wide ap-


plicability.


However, the theory behind this


neural network model is still immature.


There are many deficiencies of the current neural network learning algorithms that
have hindered their usefulness.
In this dissertation, we surveyed the research in FNN learning. Several new algo-
rithms are proposed to improve the learning efficiency of FNNs. We have developed
a globally guided neural network training algorithm that converges to a global opti-
mal solution and reduces the training time. Both stochastic and deterministic global


optimization approaches are employed for neural network training.


The stochastic


methods include genetic algorithms, simulated annealing, and pure random searches.
Deterministic methods considered for neural net training are branch-and-bound based
,insrhitz nntimizations Rv Penlorint- the snncial dtriictnre of the FNN and the nron-











training algorithms (GOTA) is that they yield a guaranteed global optimal solution.
GOTA can also be combined with local search procedures, such as backpropagation,
to produce more efficient, but still globally convergent algorithms.



















CHAPTER 1
INTRODUCTION

Artificial neural networks (neural networks or neural nets for short) are a com-

putational framework that has recently become a focus of widespread interest. In

contrast to conventional centralized, sequential processing, neural networks consist


of massively connected simple processing units,


which are analogous to the neurons


the biological brain.


Through


elementary local interactions (such as excitatory


and inhibitory) among these simple processing units, sophisticated global behaviors,

which resemble the high-level recognition process of humans, emerge.

Information in a neural network is distributed across many processing units and


the connections among them, rather than stored in a single location.

ing units act in parallel and communicate only with their local peers.

high-speed computation readily achievable through parallel computers.


and distributed processing(PDP)


The process-

This makes

The parallel


computational paradigm exhibits many desirable


features, such as fault tolerance (resistance to hardware failure),


robustness in han-


dling different


types of


data,


graceful degradation"


(being able to


process noisy or


incomplete information) (Matheus and Hohensee,


adapt (Rumelhart et al.,


1987), and the ability to learn and


1986; Lippmann, 1987; Hinton, 1989).


Research in neural nets experienced a sudden resurgence in the early


1980s and


has seen an explosive growth in


the last few years.


The excitement about neural


nets is rooted in understanding information processing in human brains.


But recent


interest in neural network study has grown to cover a wide spectrum of areas from

industry, to education, to business, to the military (Simpson, 1990).


1if S


S I


I I *4 II


*0


.1


1 1 I


-,


-n ', ,,, .. i i i .


S S


I








2


Hornik et al., 1990), although with regard to the whole area a sound theoretic foun-


dation has yet to be established.


The fast growth of this area has been pushed by


extensive applications of the neural net computation paradigm. By virtue of their in-
herent parallel and distributed processing, neural nets have been shown to be able to
perform tasks that are extremely difficult for conventional von Neumann machines,


but are easy for humans?.


These tasks include image recognition


(Carpenter and


Grossberg, 1987) and speech processing (Sejnowski and Rosenberg, 1987).


portantly,


More im-


neural nets have been successfully applied to solve problems that often


require human experts, such as sun spot prediction (Weigend et al., 1990) and ERP1


recognition (DasGupta et al.,


1990).


the business world, neural networks have been successfully applied to areas


where traditional approaches are ineffective or


efficient to use.


A partial list of


such areas include loan evaluation (Judge,


1989),


signature recognition (Rochester,


1990), stock market prediction


(Dutta and Shekhar,


1988),


time series forecasting


(Sharda and


Patil,


1990),


classification


analysis (Fisher and


McKusick,


1989;


Singleton and Surkan, 1990).


The leading neural net


paradigm for applications


the feedforward neural net


(FNN). An FNN is used by first training it with known


examples.


Once the network


is trained successfully, or in other words the neural net has learned the concept/rule


embedded in the training examples, it
come given an input it has seen before.


can be used to recognize an associated out-
The trained neural net can also be used to


estimate/predict a possible outcome when a novel input is presented.
A neural net training procedure is also called a learning algorithm.


One of the


most widely (and wildly) used neural net learning algorithms is the backpropagation


(BP)


procedure


(le Cun,


1988;


Rumelhart et


., 1986).


Although


con-


tribute to many successful stories,


background.


this learning procedure lacks a sound theoretic


Backpropagation is essentially a simple gradient descent based search








3


be a local minimum solution if the training problem has multiple minima, which is


often true.


Furthermore, the BP algorithm as used in practice deviates from strict


gradient descent.


This deviation may reduce the likelihood of a solution trapped in


a unsatisfactory local minimum. However, the convergence of the procedure has now
become an open question in theory.


Other shortcomings of the backpropagation


algorithm include a static (fixed a


priori) neural network structure, ad hoc choice of learning parameters, and sensitiv-


ity to initial conditions (weight values).


Because of these limitations, feedforward


neural nets trained with the BP algorithm reach only a suboptimal status.


The gen-


eralization ability of the neural nets, an ability to function in a domain larger than
the training set, is also limited.


Extensive research


been


carried out


in recent


years to explore the poten-


tial of feedforward neural nets and to improve the effectiveness and efficiency of the

backpropagation learning procedure (Jacobs, 1988; Becker and le Cun, 1988; Moller,
1990). Remarkable progress has been made in developing new training methods and
neural network architectures (Fahlman, 1989; Fahlman and Lebiere, 1990; Chan and


Shatin


,1990).


However, most variations of the backpropagation algorithm are based


on heuristics that reduce the generality of the approach.


For example, Fahlman's


cascade correlation algorithm is orders of


magnitude faster than


the classic back-


propagation algorithm, but its application is limited to input-output mappings with


binary outputs.


Much less work has been done in overcoming the problem of local


minima.


A few researcher have used stoc


astic global search methods in neural net


training with moderate success (Montana and Davis, 1989;


Fang and Li,


1991).


date, we have seen no reports that apply deterministic global optimization approaches
to neural net training.


Compared


with


a vast


volume of


applications,


theoretic study


on neural


(in particular, the backpropagation learning algorithm) has been weak at best. As a











and generalization are constantly referred to


without


precise definitions.


There is


apparently a need for unified definitions and formalism of the FNN learning paradigm.
In this dissertation, we attempt to fill this need and address the problems associ-
ated with backpropagation learning, with a focus on developing efficient and globally


convergent learning algorithms. Our aj
ministic global optimization techniques.


approaches involve both stochastic and deter-
We propose to treat neural network training


as a global optimization problem.


Recent development in global


optimization re-


search lends us some viable tools,


such as branch-and-bound method and Lipschitz


optimization (Horst and Tuy


1990).


We also consider globally guided heuristic search


methods.


The dissertation is composed of nine chapters.


Following the introduction, Chap-


ter 2 presents a general account of neural networks, an outline of the historical de-
velopment of neural net research, and a more detailed discussion of the promise and


problems of current neural network study.


Chapter 3 gives the basic concepts and


definitions


of feedforward neural nets.


The backpropagation algorithm


derived and


discussed in detail regarding its learning mec


and implementation.


anism, the applicability and limitations,


The next chapter (Chapter 4) focuses on the improvement of


the backpropagation learning algorithm.


variety of approaches is presented, rang-


ing from using efficient optimization procedures, to designing new network structures,


to dynamically adapting learning parameters and learn
summarizes the state-of-the-art research in feedforward


ng mechanisms.


This chapter


neural network training.


Chapter 5 begins our work on globally convergent neural network learning pro-


cedures.


We develop a search method that uses the information in neural network


output space to guide the learning process,
weight space following the gradient descent.


rather than search in


the complicated


We explore the application of stochastic


global optimization methods in neural network training in Chapter 6.


In particular,


we discuss the use of genetic algorithms, simulated annealing, pure random search








5


in obtaining lower bounds of the branch-and-bound procedure, through an extension


of the univariate Piyavskii algorithm. I
local search in the partition elements.


Jpper bounds can be obtained with or without
A procedure is developed to compute local


Lipschitz constant over subsets


of the weight space.


This leads to tighter lower bounds


and more effective pruning in the branching search process.
The implementation of the global optimization training algorithm (GOTA) is dis-


cussed in Chapter 8.


We show that the computation of the local Lipschitz constant is


easily carried out by exploring the special structure of the feedforward neural network


and the property of the sigmoid activation function.


We also discuss the simulation


program design and different search strategies under the general framework of GOTA.


Experiments on the effectiveness of GOTA


and its local search augmented version


(LGOTA) are carried out with some standard benchmark problems.


Finally, in Chapter 9, we summarize our contribution


n the dissertation. Several


conclusions are reached based on our theoretical study and experimental investiga-


tion.


Further extensions of this research are also discussed.

















CHAPTER 2
THE RENAISSANCE OF NEURAL NETWORKS


Let's face it, beyond


part of the interest in connectionism is that dirty


little secret that researchers in nuclear physics had during the thirties-
that maybe you can build something with it.


- Gary Lynch1


After more than a decade of dormancy, research


n artificial neural networks came


back to life in the 80's and experienced an explosive growth in recent years.


The new


surge of enthusiasm resembles the initial excitement in neural nets in the late 50's


the early 60's,


only far more intensive and extensive.


research has engulfed widespread disciplines:


computer science, engineering,


The wave of neural network


neuroscience,


mathematics, and decision sci


psychology, linguistics,
ences. In fact, the ma-


jority of neural net research has gone so far as to have totally lost any traces to their


biological roots.


Thus when we quote Gary


Lynch, a well-known neuroscientist, we


do not really mean that we are going to build an electronic brain, rather,


we mean


to build


"something"


will enable us to solve problems that are intractable or


difficult to solve with conventional approaches.


Overview of Neural Networks


As a reflection of the relative youth and broad scope of this field, neural networks


are known by various names such as adapti


ve sys


teams, connectionist machines,


neu-


rocomputers, collective decision circuits, parallel distributed processors and neuro-


mornhic svst.ems (Tinnmann 1 Q7-


Knitrlt 19 Q(0


Thepr are as many if not, mnnr











in patterns reminiscent of biological neural nets"


(Lippmann, 1987


4) to more


complicated and specific ones such as:
A parallel, distributed information processing structure consisting of pro-
cessing elements (which can possess a local memory and can carry out
localized information processing operations) interconnected with unidi-
rectional signal channels called connections, each processing element of
which has a single output connection which branches out into as many
collateral connections as desired with each carrying the same signal, that
being of any mathematical type desired (the processing being local to the
processing element, i.e., dependent only on the current values stored in
the processing element's local memory). (Hecht-Nielson, 1989, p. 593)


By the very fact that research in neural networks


was only revived recently and has


found its way into such a diversified spectrum of disciplines,


nets a generic and concise definition.


hard to give neural


But it is generally agreed that the essence of


neural nets is parallel distributed processing (PDP) (Rumelhart,


McClelland and the


PDP Group, 1986).
Originally, neural


were biologically motivated


. However


research in


field has


long (well, relatively long) diverged into two directions.


One branch en-


deavors to understand our very brain.


Researchers in this branch are concerned with


human perception, memory, reasoning, and learning.


ested


The other branch is more inter-


n the computational models and the power to accomplish traditionally difficult


tasks, rather than biological fidelity.


The main thrust of current neural net research


seems tilted towards the second


area.


There are


more than


a dozen main neural


net paradigms being actively applied today (Simpson, 1990)
one is the backpropagation (BP) model (le Cun, 1988). BP


to biological systems.


. The most widely used
bears little resemblance


The popularity of BP arises from its simplicity and powerful


representation ability that can address a wide variety of real world problems.


Other


neural net models that do not


have much biological


flavor but


find successful ap-


plications in pattern recognition, decision making and optimization include Hopfield
networks (Hopfield, 1982) and Kohonen's self-organizing networks (Kohonen, 1989).









8


2. Massive inter-neuron connections and associations via those connections

3. High parallel processing.

I. Internal information representation and distributed storage (as weights on the
connections and/or the activation states of the neurons)

i. A learning rule whereby the internal representation is changed in response to
the changes in the environment

3. A learning environment that provides input and feedback to the network

The basic characteristics of an artificial neural network are similar to its biological


counterpart.


But for most neural network paradigms,


the learning mechanisms do


not even remotely resemble the learning mechanism in biological systems.


Neverthe-


less, neural networks provide a framework within which certain aspects of the human


brain can be modeled.


Those aspects include association, classification, generaliza-


tion, optimization (under soft constraints) and adaptation.


systems (artificial or natural) depend on
easily modeled with conventional serial


In large part, intelligent


those abilities, and those abilities are not


processing models based on


von Neumann


machines.


The structural


and nonprogramming approach


of neural networks lend


themselves to deal with difficult artificial intelligence (AI) problems such as pattern


recognition problems.


While it is often difficult or impossible to explicitly write down


a set of rules for such problems (hence symbolic approaches fail), neural networks can


learn from training data to produce a solution.


In recent years neural networks have


made strong advances in AI areas (Caudill, 1989).
Conventional expert system inferences slow down with an increase in their knowl-


edge base.


This is counterintuitive. Humans get. faster


as we possess more knowledge


about


problem


domain.


This


deficiency


in expert


systems is


to the se-


nnentiil Qarnrh nitire* nf thi ;nf rnrotr ,nnTn ann;irn


T|;I nrnrhllm 1


c SI Ilp~l;lf PC~ Ill;t h


1








9


be retrieved by using any part of it as a key (Rumelhart, McClelland and the PDP
Group, 1986).


The neural network paradigm makes itself easily adaptive.


This ability is essential


in a dynamic environment. Some neural network models have been shown to be equiv-


alent to statistical classifiers (White, 1989).


Compared with statistical approaches,


neural networks


have the advantages


of robustness,


by virtue of


their distributed


representation and adaptation.


Also, neural networks make little or no assumptions


concerning the underlying distribution of the training data.


They may be applied to


data sets generated by non-Gaussian processes where traditional statistical methods


cease to be effective (Lippmann,


1987).


In a distributed processing system,


tile job is done by the joint effort of many


processing units.


If one or a few of tho


se units


fail, they do not significantly affect


the performance of other processing units and the system as a whole still works.


property is known as fault tolerance,


paradigms.


This


which is not shared by traditional computing


The human brain presents an excellent example of fault tolerance where


some neurons die out daily and the brain keeps functioning in every practical sense.
On the contrary, a serial processing machine comes to a complete halt with a failure


in virtually any part of it.


Even with continued damage to the processing units, a


distributed system has


"graceful


degradation.


" That


is, the system


's performance


deteriorates gradually, rather than with a catastrophic breakdown.


Historical Development


The study of neural networks has a long and


colorful history.


Pioneering work


on neural nets dates back to the early 1940s when McCulloch and Pitts (1943/1988)


proposed


that the brain, as a computing device, consists of simple processing units


(neurons).


They built a simple


yet elegant, model of a neuron


(later known as a


McCulloch-Pitts neuron or simply an


M-P neuron) in which a


well-defined process








10


The basic structure and operations of the M-P neuron can still be found in some of
today's neural network models.
The M-P neurons provide a model of computation that enables the idea of con-


nectionism.


The activations of the neurons are determined by the combined effects of


incoming excitatory and inhibitory stimuli.


But nothing was known about how the


connection strength between neurons could be changed to adapt to a new environment
until Donald Hebb (1949/1988) made known in his Organization of Behavior the first


neural network learning rule,


rule.


which has come to be known as the Hebbian learning


The essence of the Hebbian learning rule states that the synapse (weight) be-


tween two neurons should be strengthened if both neurons fire (in active states), and


the synapse should be weakened if only one of them fires.


The Hebbian learning rule


was proposed


without rigorous mathematical derivation,


but it has


been regarded


as a foundation of many more sophisticated
its ability to capture the learning behavior in


earning rules.


Its generic nature and


biological systems (Caudill, 1989) has


contributed to its continued utilization.
A milestone in neural network history was the introduction of the perception by


Frank Rosenblatt (1962).


A perception is a single M-P neuron or a set of M-P neu-


rons that systematically adjusts its (their) weights and excitatory thresholds to learn


a given input-output association.


The perception learning rule is an adapted, sys-


temized Hebbian rule.


In Principles of N' urodyn amics,


Rosenblatt (1962) proved the


perception convergence theorem.


This theorem shows that a perception can learn in


finite time any pattern association that is linearly separable.


gence theorem was powerful enough


learning.


The perception conver-


to stimulate widespread interest in perception


There was much speculation about how intelligence could arise from such


neuron-like devices.
The limitation of


perceptrons


to binary


outputs


was removed


Widrow


Hoff (1960).


They replaced the hard-limit activation function in perceptrons with a












classes.


Adaline and Madaline were also proved to be convergent to any function


they could represent (Wasserman, 1989).
The enthusiasm with perceptrons dwindled when researchers in the area found


that perceptrons failed to live up to their expectations.


The publication of the book


Perceptron by


Minsky and


Papert


(1969)


initiated a dark


age for neural


network


research.


The authors performed a rigorous mathematical analysis of the capability


and limitations of the perception.


They showed that the class of problems that can


be effectively solved by perceptrons is limited to linearly separable problems. Indeed,
perceptrons fail to solve such simple problems as the Exclusive-Or (XOR) problem.


(More detailed discussion on


the XOR problem is presented in


Chapter 3).


With


linear activation functions, a multilayered perception is equivalent to a single-layer
perception. So multilayer perceptrons could do no better than solving linear separable
problems. For multilayer perceptrons with a nonlinear activation function, there still


did not exist an effective training algorithm.


This seemingly incombatable difficulty


in training multilayered perceptrons led to the following inconclusive conclusion of


Minsky and Papert (1969, p. 231).


They wrote:


The perception has many features that attra
linearity; its intriguing learning theorem; its clear parad
as a kind of parallel computation. There is no reason to
of these virtues carry over to the many-layered version.
consider it to be an important research problem to elu
our intuitive judgment that the extension is sterile.


,ct


attention.


igmatic simplicity
suppose that any
Nevertheless, we
icidate (or reject)


Despite Minsky and Papert's recognition of tile importance of multilayered percep-
trons, their pessimism, backed up with their reputation and the rigor of their work,
effectively turned mainstream research away from neural networks.


Nevertheless, research in neural networks did not completely die out.


With ded-


icated effort, a small group of researchers continued their work in this largely aban-


doned field.


Some important progress made during the


"post


perception era"


7fl1r\


ni'rll- In n-,n-, n*rr no


-/'lC tinri4 n rnf'lr~*


4J 'nfl.'..


A D1r\


iI* IIiilr i II~ ;*Irl*1 ~ItJ.'ln %3irrTl i r flh I n.IuIIUrv *Irz uIsIIt1i r Ir*I' tl n* Irltr f.~%* n.


~ nr]n~nnn jn nnon~: n~n








12


time that a fully connected recurrent network exhibits emergent collective compu-


national capability (Hopfield,


1982),


that is,


the local interactions among the


processing units can


produce global


behaviors.


His model


was


later expanded


allow neurons to have continuous values


(Hopfield,


1984)


be applied


to hard


optimization problems (Hopfield and Tank, 1985).
The new era of neural network study witnessed a resurgence with the publication
of the three volumn Parallel Distributed Processing by Rumelhart, McClelland and


the PDP Research Group in 1986.


By then, some theoretical background had been


established, and there had been breakthroughs in the neurobiological understanding
and computer capabilities (which made it feasible to develop and test more sophisti-


cated models).


The PDP books were well publicized and stimulated a new fever of


neural net research that more than rivaled that which


had occurred in the early 60's.


Of particular importance is the backpropagation (BP) learning algorithm developed


by Rumelhart,


Hinton, and Williams (19


2 BP provides a procedure that success-


fully solves the


"credit assignment"


problem in


multilayered perception


training,


and hence provides a rebuttal to Minsky and Papert


multilayered perceptrons would be futile.


's conjecture that research in


Indeed, Rumelhart, Hinton, and Williams


(1986) showed that multilayered networks with BP learning were able to solve a wide


variety of nonlinear classification


problems


, including the notorious XOR problem.


backpropagation has become the backbone of current neural network research.


Neural Network Applica.tions


The continued and ever-increasing interest in neural net study has been both a
consequence of and a driving force for successful applications. In many areas neural
nets offer a different (drastically, sometimes) method of approaching a problem, and
open new avenues to attack traditionally intractable tasks or to solve more efficiently


problems that are being solved


with traditional methods.


the following we will








13


survey the applications of neural nets in artificial intelligence (AI), decision sciences,


business, and engineering,


while largely omitting the bulk of research in cognitive


science, psychology, and neuroscience.


2.3.1


Neural Networks in AI


Traditional AI, as a rival of neural networks,


has been successful in the 70's.


in particular expert systems, has found many fruitfu


applications.


Tasks that were


regarded as requiring high intelligence, such as chess playing and theorem proving,


can be accomplished by expert systems with remarkable performance.


Traditional


AI approaches are, however, inefficient in solving pattern recognition problems, such
as vision and speech processing, due to their nature of symbolic representation and


serial processing.


Expert system development has


knowledge acquisition bottleneck.


been hindered by the notorious


For one thing, experts are rare.


Perhaps more im-


portantly, expert knowledge cannot simply be put down as a set of precise rules. The
parallel distributed processing paradigm of neural nets seems a promising alternative
to overcome the difficulties in AI.
On the other hand, the success and advantages of traditional AI approaches are


not deniable.


One noticeable inroad that neural nets have made into traditional AI is


the integration of the two seemingly different approaches. Several ways of integrating


neural nets with AI systems are discussed in Caudill (1990).


hybrid system where neural net
level learning while an expert


Lamberts (1988) built a


s were used as a front-end processor that performs low


system performs high level reasoning.


The inference


attained by the expert system from processing the output of the neural nets is used
as a guide to modify the neural network weights.
Becker and Peng (1987) proposed a method for integrating neural nets and sym-


bolic processing.


Gallant


(1988)


worked on


problem of


extracting production


rules from neural nets, using a limited set of values for the activation functions. The












multilayer neural network.


Maskara and Neetzed (1990) used neural nets as an effi-


cient front-end for a rule-based system where the neural network was trained to learn


the associations of the expert system rules.


Similar to a content addressable mem-


ory, upon receiving partial rule descriptions, the neural network outputs all applicable
rules.


Neural nets


appear well suited


for fuzzy


learning.


Shiue and


Grondin


(1987)


developed a fuzzy-learning neural automata.
neural nets to generate fuzzy rules. Fuzzy I


Hayashi and Nakai
production rules and


(1989, 1990) used
their membership


function can be implemented in structured neural nets (Yamaguchi et al., 1990).


In the mapping of rule-based systems to neural nets, a concept (feature,


word,


symbol,


variable, fact, predicate, etc.)


may be represented as a unit, and logic rela-


tions between concepts may be represented by the connections between units. The
strength (weights) of the connections then correspond to the degree of certainty of


the logic relations (Tan et al.,


1990;


Yang andI


Bhargava,


1990).


Thus learning in


neural nets can be regarded as modifying tile certainty of the rules. Kuncicky (1990)
proposed an isomorphism that maps from not only rule-based systems to neural nets,


but also from neural nets to a rule-based


The number and structure of the


rules may change


in such a hybrid system as a result of neural


network learning.


Kerce and


Mueller (1990)


used


a heuristic link neural


network that is


applied


state space search.


A feedforward neural network is employed that takes the state


description as inputs,


and its


output is used as a guiding heuristic for the state space


search.


Successful applications of neural nets in AI areas (such as control,


vision, robot,


speech, and game playing) are numerous (Wang and Yeh,


1990).


One of the most


influential applications is the NETtalk by Sejnowski and Rosenberg (1986).


NETtalk


is a simple two-layer feedforward


neural


network.


Given


a series of


examples of


English text and the correct pronunciation


NETtalk


was


able to learn to read English


,











exposition to examples that embed the target concept (to be learned).


In contrast


a conventional computer requires algorithmic approaches, or intensionall program-
ming," where strict instructions or rules are followed with no reference to specific
examples. Extensional programming cuts down the needs in knowledge acquisition,


and hence represents a powerful technique (Knight,


2.3.2


1990).


Neural Networks in Decision Sciences


Neural nets provide a powerful computational framework that extends its appli-


cation scope far beyond traditional AI problems.


As mentioned above, neural nets


can be integrated with expert systems, and hence provide a new way of implement-


ing decision support systems.


Under certain condi


tions, neural nets are equivalent


to Bayesian classifiers.


sciences.


This opens wide possibilities for using neural nets in decision


The inherent properties of neural nets enable them to do more than just


statistical decision analysis.


Weigend (1990) reported neural net classifiers that have


been shown to outperform statistical methods. Burke (1991) and Burke and Ignizio
(1992) described several neural network systems and their applications in decision


making.


They also discussed conditions under which neural nets would


be prefer-


able to conventional procedures and gave some guidelines for using neural nets in
operations research.


Hornik,


Stincheombe and


White


(1989


others


(Hecht-Nielsen,


1989;


benko, 1989) have shown


that multilayer feedforward neural nets are universal ap-


proximators.


Simple feedforward neural nets with as


as one hidden layer can


approximate any continuous input-output mapping to arbitrarily specified accuracy


(the number of hidden


units may have to go


infinity,


though).


This result


solved theoretically the representation issue and made neural nets a legitimate tool


for function approximation with numerous appli


cations in system identification,


sign, control, modeling and prediction (Werbos, 1989).
Ik t S **











and Tank, 1985).


Besides Hopfield networks, other neural nets used in combinatorial


optimization include Boltzmann machines (Hinton and Sejnowski, 1986), Cauchy ma-
chines, (Jeong and Park, 1989) and self-organizing networks (Durbin and Willshaw,


1987


Hueter, 1988).


Ramanujam and Sadayappan


(1988)


showed how to map to neural networks a


number of


combinatorial optimization


problems,


including the traveling salesman


problem (TSP), the graph partition problem, the vertex covering problem, and the


maximum clique problem.


Compared with conventional approaches,


they reported


that neural network results showed promise. Xu and Tsai (1991) did extensive exper-
iments on the TSP. One of their neural-net-based algorithms matches or outperforms


the best known heuristics,


the Lin and


Kernighan algorithm (Lin and Kernighan,


1973).


Also the neural-net-based algorithm


was shown


to scale-up better than the


Lin and Kernighan algorithm. Foo and Takefuji (1988a,b) applied a stochastic neural


network for job-shop scheduling. A deterministi


c approach was also used by Foo and


Takefuji (1988c) to solve the same problem with neural network implemented integer
linear programming.


A relatively new advance of neural nets


been made in


the area of mathe-


matical programming. Maa and Shanblatt (1989, 1990


applied neural nets to linear


programming problems.


Kennedy and


Chua


(1988)


used neural nets for nonlinear


programming. Barbosa and de Carralho (1990) applied neural nets in feasible direc-


tion linear programming.


An adaptive feedforward neural net was used in multiple


criteria decision making (Zhen and Malakooti, 1990).


Other applications include the


shortest path (Helton, 1990), routing (Zhang and Thomopoulos, 1989), the knapsack
problem (Li, Fang and Wilson, 1989), and the task assignment (Tanaka et al., 1989).
Neural nets are rivalling traditional statistical analysis in classification (Pratt and


Kamm


, 1991),


principal components analysis


Feeser, 1991), and forecasting (Sharda and


(Baldi,


Patil.


1990).


1989), regression


Choukri et al.


(Orris and
(1991) re-








17


with past data, generated accurate predictions and consistently out-performed tradi-


tional statistical methods such as the


TAR (threshold autoregressive) model (Tong


et al.,


1980).


Compared with an established time series forecasting technique-the


Box-Jenkins method-neural nets have the advantages of automatic learning, better
performance for nonstationary series and long-term forecasting (Tang, de Almeida
and Fishwick, 1990).
With the abilities of model identification, generalization, and prediction, neural


nets have found many applications in


ral nets have been successfully applied to


business and engineering.


loan evaluation


(Judge,


In business, neu-
1989), signature


recognition (Rochester, 1990), stock market forecasting (Dutta and Shekhar, 1988)


and other classification analysis


(Fisher and McKusick,


1989; Singleton and Surkan,


1990).


In engineering, neural nets have been


et al.


applied to hardware fault diagnosis


1990), power system state evaluation (Nishimura and Arai, 1990),


(Tan


wastewater


treatment system (Krovvidy and


Wee


, 1990),


intelligent FM


facturing system) scheduling (Rabelo, Alptekin and Kiran, 1990).


neural nets as an engineering design


are emerging in a


variety of engineering areas.


is still being explored.


Wu et al.


(1990)


S (flexible manu-
The potential of
New applications
used neural-net-


based


teams to model the behavior of materials and obtained


promising results.


Neubauer (1991) applied neural networks to metal processing.


Neural nets have also


been used in structural mechanics computation, transportation and other engineering
applications (Sun and Fu, 1991; Dagli and Lammers, 1989).


Promise and Problems


Unlike the hype surrounding neural nets 30 years ago,


has aimed at solving real-world problems.


today's neural net research


Nearly all the big companies in the com-


puter industry-AT&T


IBMRl,


Texas


Instruments and


others-are involved in


1 1 ~~~~~~r i. l I


I








18


Diego, has retained its momentum with the participation of researchers from more


and more diversified areas and


pump-priming funding from NSF


NASA, DARPA,


and other major sponsors. Judging from their success in the past few years and the
still widening and deepening scope, we may conclude that neural nets indeed hold


great promise.


The current optimism in neural nets


' future is no less fantastic than that in the


early 60's.


Neural net


s, along with nuclear technology and superconductivity, has


been dubbed one of the greatest inventions in our modern society.


Leon Cooper, a


Nobel laureate, commented (in IJCNN
next century is what the computer is


1990) that what neural net


for to


s would be for the


Hecht-Nielsen (1986) went further


saying:


. It is clear that if [neural network technology] realizes its stated
its impact on human society will be profound. It may thus co
pass that we are now living at the boundary between two great e
of human existence; namely, the transition from Civilization to I
[a term coined by Hecht-Nielson to describe the imaginary future


society].


It has


been


10,000


goals,
me to
?pochs
ability
noble


years since tile last such transition (from


Culture to Civilization). If all of this is true. we are most fortunate to be
alive to witness and participate in this change.


While a repeat of neural network history in the late 60's seems unlikely, we need to


be very cautious about overly optimistic expectations.


None of those startling claims


such


"brain-like machines


" in the nontechnical literature has really been realized.


It is true that great progress has been made.


However, the field is far from mature.


Current research in neural nets faces many challenges in both theoretical study and


practical implementation.
yet to be established. The


In the theoretical aspect, a solid general foundation has
ere exist more than a dozen different neural network archi-


lectures that are being used in different problem domains.


Each model has its own


theory and implementation peculiarities.


Little has


been done to establish a com-


mon ground for those models, although Grossberg at Boston University is reportedly
attemntine a theoretical framework that would explain all neural behaviors (Miller.












than they would be otherwise.


Recent progress has shed some light into the


"black


boxes" (Fu, 1991), but the overall picture is still obscure.
The leading neural network model-the multilayered feedforward neural network
with backpropagation-suffers the same obscurity. BP has been widely used in many


applications, often with encouraging results.


from soundly established.


The theory behind BP is, however, far


BP is a simple and elegant procedure that overcomes the


difficulty of "credit assignment."


But this


procedure has some fundamental limita-


tions as


isted below:4


1. Learning (training) is generally slow.


No convergence results have been established for pattern


training-the most


commonly used training procedure.


Convergence of epoch


training to a local minimum is achieved, but a strictly


local minimum may not represent a desired solution.

4. The parameters, namely, the learning rate 7 and the momentum a, need to be
set empirically.

5. The structure of the network (number of layers and units) is determined arbi-
trarilv.


The model offers the


flexibility of choosing training schemes (epoch or pattern)


and different global criterion function and neuron activation functions,


but no


general guidelines exist.


Extensive work has been done to explore BP's potential and overcome its


limita-


tions in the last few years.
problem mentioned above.


A great research


effort is devoted to overcome the first


A number of local acceleration heuristics are discussed in


1 IA flfl~*~ 11 .a


Ir












al., 1990). Those improvements on backpropagation often increase the learning speed
significantly in terms of training epochs at the cost of an increased computational


effort.


Few researchers have considered


the second


third


problems of BP.


It has


been reported that


BP with pattern


training works better than epoch training for


a large training sample.
thoroughly carried out.


arbitrarily.


But no theoretical account for this phenomenon has been
Most people choose to use epoch training or pattern training


This leads to potentially erroneous conclusions about the efficacy of the


algorithm.
For the global convergence problem, empirical results have shown that with am-
ple hidden units embedded in the network, BP can usually escape a local minimum
(Rumelhart et al., 1986) probably due to large degrees of freedom. However, increas-
ing hidden units in the network may not be an appealing idea, since an unnecessarily
large number of hidden units is likely to decrease the generalization capability of the
network (Kruschke and Movellan, 1989; Ba.um and Haussler, 1989) and may cause
overfitting problems (Weigend et al., 1990). Fang and Li (1991) have adapted simu-
lated annealing methods to neural network training. Their approach guarantees the


solution


will be globally optimal, if


a proper


annealing schedule is derived for the


given problem.


Montana and


Davis


1989))


Belew et al.


(1990)


used


genetic


algorithms to train the feedforward neural nets.
is that they involve a random search (sometimes


The drawback of these approaches


)litld


y) and, hence, are not efficient


n general.
In the interest of efficiency and generalization, the complexity of a neural network


should be kept to its bare minimum.


Some researchers (Teh and Yu,


1988; Sietsma


and Dow,


1988) developed heuristic rules for pruning away inessential hidden units


during training, starting with an oversized network.


Others (


Tenorio and Lee,


1989)


used


dynamic procedures


generate new


units


as needed.


those ap-


r











large network.


This method has been used in Chauvin (1990) and others.


One of the


drawbacks of this approach i
The deficiency of neural


s that training time increases noticeably.
nets, in particular of backpropagation, indicates that


much theoretical work needs to be done before we can fully explore the potential of


this emerging computation framework.


We are not sure whether or when a profound


common theoretical basis for all neural network paradigms will emerge. But what we


can do now is to conduct a rigorous,


systematic study of the major neural net models,


study the efficacy and efficiency of them, identify the conditions under which they
may be effectively applied, explore the theoretical capabilities and limitations, and


build new and improved procedures based on the theoretical guidelines.


By doing so


we can hope to better understand this new field and its future and proceed gradually
to realize its potential to the fullest extent.


















CHAPTER 3
FEEDFORWARD NEURAL NETWORKS

Feedforward neural nets (FNN) are the most popular neural network paradigms
in the computation modeling branch of neural net research. The principal learning
algorithm for training FNN is the backpropagation (BP) algorithm. The popularity of


BP ari


ses from its simplicity and successful applications to many real-world problems.


This chapter will discuss the development of the backpropagation learning algorithm.
The efficacy and limitations of the BP algorithm will be analyzed while improvement


of the classic algorithm


will be presented in


the next chapter.


We will give basic


definitions and present theorems about the representation capability of general FNN.


We start


with


the building block of a neural network-the neurons-and


then the


workable neural network-the perception.


Feedforward neural nets are built


upon perceptrons.)


The Processine Units (Neurons)


There have been many nonstandard terminology


es used in the


neural net literature.


We will stick to the most general ones throughout our discussion.


In some


cases we


use two terms interchangably, e.g.,


processing unit and neuron; we will include both


terms in the definition


Definition


(Processing Unit)


A processing unit


(neuron)


is the


basic element


an artificial neural network.


neuron


conszs


ts of multiple


input connections from


other neurons;


a transfer function.


maps the


function that maps the scaler to a real or binary


?flp uts


activation


to a scaler;


(state)


an activation
and an output


thf ht drnnirat th if nrtfltfinfl


-111' 11 a rr-'uaa...a 1. n~ ri' r4S l t C


nm nl n n r mn nl r~


n" nnrk 4 1 / V














W2


Figure 3.1.


Structure of a single neuron


The first such processing unit w


is still widely used today.


as the McCulloch-Pitts neuron.


This basic model


It has a multiple input port and a single output port.


Before


the inputs are fed into the neuron, they are multiplied by corresponding weights on


their pathways.


The output is produced by taking the


and thresholding it via a hea


one of two discrete values, a and b,


viside (threshold) function.


where a, b


weighted sum of the inputs
A heaviside function returns
5. Depending on whether the


input is greater than or less than the threshold 0, b or a is returned.


It is common


to set a =

Definition


0, and b = 1.


(Net Input )


A sketch of the mod


The net


is shown in Figure 3.1.


input results from mapping multiple inputs to a


real or integer value.


Frequently this takes the bform of a weighted sum of the inputs.


Definition


3 (Activation Function)


The activation fun


action is a function that maps


the net input to a real or binary activation value (state) of the processing unit.

Besides the heaviside function, other commonly used activation functions include


the semilinear function and the sigmoid function.
decreasing function, linear in a certain range and


The semilinear function is a non-
constant outside that range. The






















Heaviside


Semi-linear


Sigmoid


Figure


Typical activation functions


The Perceptron Learning


following


we give


definitions


concerning perception


learning


then


present the learning algorithm and its finit


e co


nvergence theorem.


Definition

an artificial

olds) of its


(Learninq R


neural network ad


A learn


ts the


C CU. (I


rese


environment.


Definition


single or


set of p


process


zng un


crcep


its with h


is a simple neural network


eaC(ZS


nationn fun


actions


cons


and the


isting of a

perception


learning algorithm.


Definition


3.6 (Traininuo


rT nl


a sample tda


n from a give


n popu-


sample


is used


as the


cuvitro0l nme


ut of tht


neural


work providing inputs


and ta


values


(if applicab


Definition


stance)


Any particular


mle a1t


x of the


training


set T


an in-


stance.


x may have binary or real-valued att


Iributes.


Definit


(Samn


Train


Sample


och) training


trefe rs


a neura


net train-


is the


procedure


(Perceptron)


by which


ion (weights and th


resh-


netzoorl~.












each instance of the training sample.


If the instance


is chosen sequentially from the


sample,


is called sequential instance


training (sequential training).


If the instance


is chosen randomly from the sample,


domized training).


Note that an instance, ax,


is an example of


some


concept (hypothesi


s) to be learned.


In the neural net training process,


both


the instances and the concepts associated


with


the instances are provided to


the network.2


XE PR.


be an instance, T+


denote the


set of positive instances (a positive


instance is an


example of the target


concept or class) and T-


denote the set of negative instances (a negative instance


is a counterexample of the target concept or class).
The perception learning algorithm can be stated as


Let w E RT be a weight vector.


follows:


The Perceptron Learning Algorithm(PLA):


TART:


Set w

Let X


TEST:


E R" randomly.


WX <


or (x T-


and wx


>0)}


= 0, stop


Otherwise:


pick any x c X,
if x e T+, go to ADD,

if x e T-, go to SUBTRACT.


ADD:


4- 10


go to


SUBTRACT


go to


TEST

TE X,
TEST


I-~I -' I -1 A ii 1 9 I


is called randomized instance training (ran-


(sl(z


...


rI1


I


I'












Definition 3.10 (Convex Set


set S


is convex if for each x, y


ES and any


E [0,


=Ax +(1


-A)y


Definition


(Convex Hull)


be either finite


or infinite, the convex hull


, denoted by h(


smallest convex


set that contains


Definition

such that


3.12


(Bounded


set S


E R:


bounded if there


exists


R. M


Bo(M


E RT


Definition


3.13


are linearly


(Linearly Separable )


separable4


if there


exists


a nonzero


either finite


vector p


or infinite, Si

r and a scalar


such that


Theorem


rceptron


onierence.)


Suppose


and T-


are bounded


in R'


and are linearly separable,


then the perception learning algorithm will find a hyper-


plane


separates


and T-


I finite


tzic~i.


Proof:

Let H


T~u


then the


PLA produces the sequence of vectors:


"
U = 0,1..../* q*


where wo


arbitrary, and


is picked


such that w


By assumption, there


exists


an.'1


and a


such that iw


.x >


a, for all


xEH


1 1 1


1 fl>


C, -- -.-~ I... I-


-' ~... -- --- --- 5


U"


-T-


~M)


r^












At step n, we have


by the Cauchy Inequality, where


*W


(Wi-I


wi-I


+na


and, since wn-1


xn-1
- x


S0 and 1x112


13,


IlwnII2


n-1
w


wn-1


+ 2w


n-i1 + Ijxf-ll2


n-1 112


wI101 2


+ n.


Thus we have


I)u)o11))2


+ nflj j w


or the quadratic inequality


2a2
an r


+ (2aw


K j~iI)


- (i0


(3.1)


Since


k = (2au,


- Iw*lIl


given any a and p


a solution to ( 3.1


exists and is finite.


Thus, after at most


r .... .. .0


JI1 Il111 I'V 1 I '' '' *"


* w


II~U"IIIIU~*II


$ Zn-1I12


+ n.cy


- PIIZ1'*II2)a


zu*>2


?)2 + q~Z(I/11)


W*)2)


...r I ii:

















.\w2


w3x =O


Figure 3.3.


Geometrical explanation of the perception learning


Note that the proof does not assume finiteness of


The PLA


procedure can


be applied to infinite sets, as long as provisions are made to carry out the stopping
criterion test.
To understand the perception convergence procedure geometrically, the following


concept


s are useful:


Definition


(Convex Conc)


(1 conveX Set.


is a convex cone if


ES for any A


0 and any x


EmL;


Definition 3.15 (Dual Cone)


, the dual cone of


x >O


for' every


aCES}.


Geometrically the perception learning procedure finds an


cone of H


T~u


Startin


g with any random vector iw


interior point in the dual
, the ADD procedure (by


the definition of H


ADD procedure now includes the


UBTRACT


procedure)


...-I


. denoted by


-T-


R'' I









29


opens a rich body of related research, using approaches known as relaxation methods
(see, for example, Agmon, 1954).
Various modifications have been suggested to the basic perception learning algo-


rithm. In step 3 (ADD weights) w(+')


- w(") + x(n) can be replaced by


w(n+') u() + wk())


where


> 0


is a constant.


1/IlI "2


would


make


weight


change


unit vector


the direction of x.


Agmon (1954) suggested


(in a different context)


= c(w


x*)/ |1||2 where


cE (0,2).


The number of iterations of the algorithm changes


with these variations,


but the finite convergence property


is retained.


The conver-


gence proof of perception variations, Adaline (Widrow and Hoff, 1960) and Madaline
(Widrow and Stearns, 1985) can be found in Poliac (1989).
The basic perception learning rule can be easily generalized to handle multiple


class problems.


Let H, H2, ..., H1- be the sets of instances for each class.


The classi-


fiction problem requires finding a w*


.i/i Xi


such that for each


> w*
.11


for all


S..,IL 5


where S


> 0 is a scalar


. The


learning procedure is


presented


the following.


Proof of the


since it is a direct extension of Theorem 3.1.

Multi-class Perceptron Algorithm


convergence of this procedure is omitted


START
TEST:


Set wi
Let Xi


E R'


-- {xil,


. ,


H2 and for


to any random values.


some


i such that w;


* :t: c


.. AK, stop.


= 0 for all


Otherwise pick any


go to UPDATE.


If Xi


'(fyi *


.z; E H;


,2


n:; + (5















(1,0) ..


O *
III III III II
(0,0), /
(o~o)
OOS


(1 )


(0,1)


Figure 3.4.


The XOR problem and its geometrical representation.


One of the intriguing properties of the perceptron learning algorithm is that it uses
only locally available information-modifying weights after the presentation of each
input pattern. Yet the procedure constructs a globally optimal solution (for linearly
separable patterns). Local procedures are suitable for parallel implementation and


hence have the potential for fast, real-time applications.


Minsky and Papert (1969)


pointed out
perception


it would


procedure


with


be interesting to compare the relative efficiency of the


global


analytic methods,


such


as linear programming,


for solving the system of inequalities


No systematic study has been done in the


comparison of perception learning with global anal vtic approaches.


Many researchers


have, however
Jacobs, 1988).


, realized tile importance of


ocalitv"


in learning


for example,


This issue is further explored in later chapters.


The Limitation of Perceptrons


Minsky and Papert (1969) showed that perceptrons failed to solve a number of


simple pattern classification problems,


in particular,


the Exclusive Or (XOR) prob-


lem.


The XOR problem has been used extensively as a benchmark for neural network


algorithm evaluation due to this historical reason. The problem has four patterns.
Each pattern has two binary inputs and one binary output. The output is true (with
































Figure 3.5. An example of layered perceptrons that solve the XOR problem

The failure of the perception is due to its insufficient knowledge representation,


not its learning procedure.


Perceptrons construct only


linearly separable decision


regions, but there is no linearly separable region that can solve the XOR problem as
can be seen in Figure 3.4.


To solve the XOR problem,


a more complex convex


decision region is needed.


multilayered perceptrons could form such a decision region.


For example, let one


perception separate pattern (0,0) from the others, and another perception separate


pattern (1,1) from the others.


A third perception, taking the output of the first two


as input, could produce a convex decision region


that successfully classify pattern


(0,1) and (1,0) into one group.


The idea


is depicted in Figure 3.5 (following Beals


and Jackson, 1990).


Thus multilayered perceptrons are powerful enough


to form polyhedral convex


IT'b iok \onlTf iro + b rarcn + In nrnl'Jnrn rvf c' nr nt 1 rnInr^ nnrP\rronfrnno


'.~~


rlar;a;nn ra n; nn ~












the heaviside threshold function.


The perception learning procedure can correctly


adjust only the weights between inputs and outputs,


perceptrons.


but not the weights between


This difficult is overcome by introducing continuous activation functions


(Rumelhart et al., 1986).


This is shown in the next section.


Feedforwa.rd Neural Nets and the BP


Algorithm


Definition


16 (FNN)


A feedforward neural network (FNN)


a neural network con-


sisting of neurons that are arranged in layers, namely, an input layer, hidden layer (s),


and an output layer.


Connections are


unidirectional from lower layers to higher layers


with no feedback paths.


By definition, multilayer perceptrons are a subset of feedforward neural nets


heaviside activation functions.


with


But, conventionally, when we say feedforward neural


nets we mean feedforward neural nets with continuous activation functions


guished from perceptrons.


as distin-


Mult.ilayer perceptions are able to represent linearly non-


separable problems, but there is no efficient learning procedure.


Using FNN enables


us to solve the neural net


"credit assignment"


problem.


Given


the output gener-


ated from an input, which weights and how should they be changed to approximate


the desired output


The classic algorithm


to train an FNN is called backpropaga-


tion which is a learning algorithm that modifies tile network weights based on their


contributions


to a global performance


e criterion function.


A gradient descent search


procedure is employed.


Let (x, y) denote a training example (pattern),


where x is an input


vector and


y is the target output vector.


Also, let o denote th


e network output and w denote


the weights of the network.


We use NI


X N;H


xNo


to represent the structure of a


feedforward neural net where N;, NH and


are the number of input units, hidden


units and output


units, respectively.


Figure


6 shows


a2x


x 2 fully connected


feedforward neural net.wnrk


Pnfr OIrnnvpnIIC non


v


two nrnrcesinTr units are llnPr in
















Output




Hidden




Input


Figure 3.6.


x2x2


feedforward neural network


sigmoid function


1+e


- (3.3)


where 7y


is a constant


controlling the slope of


the function.


The net


input


to a


processing unit j is given by


netj =


ev~x~-4 0,


(3.4)


where x


are the outputs


from


the previous


layer,


w0, is the weight (connection


strength) of the link connecting unit


to unit


j, and Oj the bias,


which determines


the location of the sigmoid function on the x axis.


For notational convenience, we


let xo


- 1 and Woj


= --O, then we have0


-zC


tv~~:v,


(3.5)


1


f()











A feedforward neural net works by training it with known examples.


example (xp, yp) is drawn from the training set { (xp, y,) p


A random


= 1, 2, .., P}, and xp is fed


into the network through


the input layer.


The network computes an output vector


o, based on the hidden layer output.
A performance criterion function is
y,. A commonly used criterion functi


op is compared against the training target y,.
defined based on the difference between o, and
ion is the sum of squared error (SSE) function


= z
p


F-=


E(ypk Opk)2


(3.7)


where p is the index for the pattern (example) and k the index for output units.


The error computed from the output layer


is backpropagated through the network,


weights (wij) are modified according to their contribution to the performance


criterion function.


(3.8)


drvij


where 77 is called learning rate, which determines the step size of the weight updating.

3.5 Backpropagation Derivation

For easy of exposition, let us consider the error resulting from a single training
instance:


F-


(yJk o,,k)2


(3.9)


For connections leading to the output layer (refer to Figure 3.6), the partial derivative


of Fp with respect to weight wk can be written


0EJ,
tJo,,


dfllik


O-ne
4~j


(3.10)


using the chain rule.


Here


OF,
dlOpk


- (yOp )


(3.11)


awj


aF,
aruj~












Denote


dnetk


- pk)f (netk).


(3.14)


Then we have


t1F,
OWjk


-bko


(3.15)


Aw1Uk


0$,
O'Wujk


3.16)


This weight


updating


leading to the output layer


applies


to output


layer weights


(i.e.,


Similarly for hidden layer weights we have,


Oneitk


the weights

by the chain


Onef.


(3.17)


Since


Fnetk
9netk


Ofleik
do,


-- il ,


(3.19)


define


Odne


SkWJC-


'nt-ri-


(3.20)


Then


- Sioi


(3.21'


do'
t37netj


~ 6k oj


dzo;j


dnetk












If the sigmoid activation function is used, we have


f'(netj)


(1 + e-me)2


7f(nety)(1


- 0j ).


- f(netj))


(3.23)


Thus the derivative is easily obtained from the output of the processing units.


Other


performance criterion functions may be defined and other activation functions may


be used.


These variations will be covered in the next chapter.


The backpropagation algorithm is formally


stated below:


Algorithm BP


1. INITIALIZE:


* Construct the feedforward neural network.


Choose the number of input


s and the number of output units equal to the length of input vector


x and the length of target vector y,


* Randomize the weights and bias


respectively.


n the range (-


* Specify a stopping criterion such


< Fstop or ni


~max


Set iteration


number n = 0.

FEEDFORWARD:


* Compute the output


for the noninput


units.


The network output for a


given example p is


Opk -


f(E


:jk 7f(
Vit


S S


waxi )))).


* Compute the error using Equation 3.7.


--yrretJ


yoj 1












* For each output unit k, compute


=k = (ok yk)f (netk).


* For each hidden unit j


compute


6 -=
.7


f (netj)


6kWjk.


UPDATE:


Atw w(n + 1) = 1sojo + tAw;y(n)


where ij


> 0 is the learning rate (step size) and ar [0, 1) is a constant called


the momentum.


REPEAT:
Go to Step


3.6 The Representation Capability of FNN


A feedforward neural net can be regarded


as a general nonlinear model. In effect,


it is


a complex function


consisting


a convoluted


set of


transfer functions and


activation functions


cC,


the parameter set


where
called


is a set of continuously differentiable functions,


weights


includingn


thresholds).


The output


feedforward neural net can be written as:


o=f(


wjkf (


w. *, **. (


W~rix;))))


(3.24)


The next result shows that a two-layer FNN can approximate a large class of func-
tions.


Theorem


2 For any absolutely integrab


function g


, there exists a two


17... n ATltT 7


-tR


'11 .. I__i i. I .' 1 II i' r r


rl











This theorem is a direct result of Poliac's (1989)


Theorem 4.8.1.


The requirement


of f to be absolutely integrable is relaxed by Hornik, Stinchombe and Write (1989),
Cybenko (1989) and others to include the use of sigmoid activation functions. Hornik
(1991) further proved that an FNN with as few as a single hidden layer and arbitrary
bounded and nonconstant activation functions are universal approximators to any
continuous function based on an Lp norm performance criterion.
The above results assume that the number of processing units in the hidden layer


is unlimited.


theorem by


Kolmogorov (1957) can


be applied to


FNN


to yield a


three layer neural network that,
any continuous function.8


with finite hidden layer units, can exactly represent


Theorem

I = [0,1]


LIP1 rn~~qo ran)


There exist


such that each continuous


fixed increasing continuous functions hij


on I"


= [0, 1]"


can be written in the


form


g(x1, x1) =


2n+1
S f1


hj(.C))


fj are properly chosen continuous functions of one variable.


The theorem suggests that any continuous function


s of many variables can be repre-


sented as the linear superposition of


some continuous


s univa.riate functions.


In terms


of neural nets, this can be interpreted


as follows.


For any continuous function of n


variables, there exists a feedforward neural network with two hidden layers, (each pro-
cessing unit in the hidden layers has a continuous activation function), that exactly


represent g.


A two-input network structure corresponding to Kolmogorov's theorem


is shown in Figure 3.7.


Several variations of Kolmogorov's theorem exi


(Lorentz, 1976).


In particular,


each function


f, can be chosen identically a.nd fun


action hij can be replaced by lhA,


where 14 is constant and hj(x) is


continuous and nond


decreasing (cf.


Poggio and Griosi,


function g


where
































xlx2Q


Figure 3.7


An example of the Kolmogorov neural network


Correspondingly, we have the following theorem.


Theorem 3.4


Given any continuous function


-4R


ere exists a three-layer


feedforward neural network that exa


ctly represe


with n(n + 1) processing units in


the first hidden layer and


2n + 1


processing units


n the


second hidden layer.


Kolmogorov's theorem shows that


FNN


powerful representation capability.


However, this theorem is nonconstructive.


That


we know that


there exist such


functions


we have no


as how


to construct


them.


Hence the


application of Kolmogorov's theorem in neural nets has been limited to theory.


As an illustration of FNN's


capability, we can construct simple neural nets with


nnp nr twn hidden nnite that 'nlvp tlp Yn R n-rlnl~m l iihnn tb vnnA nrd ha rlrnrn-


































xl x2


xl x2


Figure 3.8.


Two simple neural nets that solve the XOR problem











the point


(0,0)


(1,1)


are grouped


together to


from one class (with low


values) while the other two points make the other class.

















CHAPTER 4
VARIATIONS OF BACKPROPAGATION LEARNING

The backpropagation algorithm, due to its simplicity and general applicability,
has quickly become the dominate training algorithm for feedforward neural networks.
Although successful applications of the BP algorithm are numerous, neural network
researchers soon found that the algorithm has some fundamental limitations. First of
all, BP training may fail to converge. Secondly, BP may reach only a local minimum


solution


when it does converge, as in any gradient descent based algorithm.


local minimum may or may not represent an acceptable solution.


Furthermore, BP


training is generally very slow as compared to non-neural net approaches.


This has


prevented the use of feedforward neural nets from real time applications.
An enormous amount of work has been done to improve BP learning in the last


few years.


the following we present new


developments in


this area concerning


convergence, generalization and learning rate,


optimal solutions to Chapter


while leaving the discussion on global


7. We consider BP variations in criterion function,


activation functions


network structure


, second order training algorithms and some


heuristics.


Performance Criterion Function


We have used total sum of squared


our discussion in Chapter 3.


criterion.


(TSS) error as the performance criterion in


TSS is the standard and most widely used performance


Besides its conceptual and implementational simplicity, it has the advan-


tage that under the assumption that training samples are independently chosen from
a Gaussian distribution, the least squared error (minimizing TSS) estimation is sta-












more appropriate than


TSS criterion.


Burrascano and Lucci


(1990) compared


the least square error (L2 norm) and the min-max (Loo norm) performance criteria.
The former is better if the data follow a Gaussian distribution, while the later should


be used if the data distribution is nearly uniform.


The min-max criterion function


is non-differentiable.


To carry


out gradient descent search, a pseudo derivative is


defined as


OF,
490$


Opk |Ipk" Opk


and ypk
and ypk


(4.1)


where k*


= argmax IIypk Opk


Correspondingly,


we have


9F,
Onetk
S0


+04k(1
SOpk(l1


- pk )
- 0opk)


and Ypk


(4.2)


= k* and ypk


This is used in


the updating rule for the output


layer.


The bj's


the hidden


layer(s) are not changed.


With the above modification, the standard backpropagation


algorithm (Section 3.5) can be employed. Burrascano and Lucci (1990) reported that
better performance was achieved with the min-max criterion for the parity problem.1


For classification problems, Hampshire and Waibel


1990


proposed the "classifi-


cation figure-of-merit"


(CFM) criterion function,


which


is defined as


CFM=


k


(4.3)


1 + e'Y(t-on)+


Where ot


is the output from


"true"


(correct classification)


unit and


is the


output from non-true unit.


We observe that CFM


is comprised of the sum of sigmoid


I~*
I;'


I;'












requires the output


representing the correct


classification


to have a


higher activation value than any other output units.


discourages


the network from


learning specific examples, and


encourages


learning a general representation of the training data.


* It alleviates the problem of the.


TSS criterion where outliers tend to mislead


the learning process.


Hampshire and Waibel reported slightly better results


were obtained using CFM


criterion


than


the sum


squared errors.


Assisted


an ad-hoc


post-processing


procedure,


the results from


CFM


criterion


became significantly better than


those


obtained with the TSS criterion.


Standard
Figure 3.2).


BP uses a sigmoid function


as the non-linear activation function (see


The sigmoid function has an automatic gain control property.


when the activation value is close to saturation (-1


or 1)2


That is,


, the output change corre-


spending to a input change is small; when the activation value is far from saturation,


the output change corresponding to an input change is large.


This property is im-


portant


to the stability of a dynamic network.


However,


the sigmoid nonlinearity


hinders the learning process with its


near-zero derivative over a large range of input


values.


This


is easily seen from the BP learning rule


for the output


aver):


Awgk,


9(,,pk Ok)' ( tk)oj.


(4.4)


When (yk


learned.


- ok)


we do not need to c


When oj --* 0, there is


ange tile weights as the target values are


no need to adjust tile corresponding weight wjk, since


wjk has no effect on the net input.


But th


e case


-+ 0 does not tell us much. Since


f'(net;)


'~(-4f


-* 0 whether o0. approaches the target value (0 or


1) or or


dcuji~












a large error.


This fact increases the probability that the neural nets get stuck in a


local minimum.
Burrascano and Lucci (1990) proposed a delta rule of the form


4=


(4.5)


1 + eynetk


which,


contrary to the standard


delta rule


larger values when


the activation


approaches 1.


Their experiments showed that with the new delta rule, the modified


BP algorithm performed slightly better than the classic BP algorithm.


What is more


important is that the modified version had a much smaller failure fraction than the


normal BP algorithm.


The authors claimed that the proposed modification virtually


eliminates non-convergence problems if a moderate


e learning rate is applied.


Another alternative to the sum-of-square error criterion


is the cross-entropy per-


formance function defined as:


The derivative of F


= -Z (ypklol(opk)+ (1
P k
with respect to opk is


- ypk)log(- Opk)).


(4.6)


- pk


- Y7k
--Opk


(4.7)


Note


-4 co as Opk


--- 1


i dyF


-- 00


as Opk


-*0


This


brings a


counteracting effect to the problem mentioned above, i.e., learning is hindered when


the output approaches saturation.


Indeed, experiments by Fahlman (1988) showed


using the cross entropy


criterion,


learning speed


of a neural network on


the encoding problem increased by 50%


as compared to using the standard sum-of-


squared-error criterion.


Momentum


A simple variation of the classic backpropagation algorithm is to add a


"momen-


+lm))~ 11 Cl2' + rnn\ ti + l ,'~ + r V n, u A 4 frtn,


OF








46


the weight changes when successive gradients have the same signs and to slow down


weight changes when successive gradients have different signs.


Thus, it helps to speed


up the search in the weight space where the down-hill gradient is small, and to damp
oscillations that are likely to occur in the ravine areas if only a fixed learning rate


is used.


Reports


(e.g.,


Chauvin,


1990)


have shown


the momentum term can


speed


the learning process significantly.


Since the use of


the momentum was


proposed by Rumelhart et al.


(1986), the authors popularized the backpropagation


algorithm, and it is used almost always in backpropagation learning, we will refer the

backpropagation algorithm with the use of momentum as the standard or classic BP
algorithm in our later discussion.


Adding the momentum term is analogous to signal


smoothing.


This


observation


led Adams (1991) to propose using both past and the future information in momen-


turn, analogous to a symmetric smoothing.


The idea is simple:


In the standard BP


algorithm, when the hidden layer weights are updated,


we have already the informa-


tion to compute the weight change in the next iteration, since


&,(t +1) = oy(1 -oj)


S1)wjk(t + l)


(4.9)


,ij(t + 1) = 6&j(t + 1)o;


(4.10)


where the future Sj(t +


is obtained


through


the newly


computed output


layer


weights.


Hence the hidden layer weight updating can be modified as


Awij(t) = i16j(t)o; + azawj(t -- 1) + 2Awj(t + 1)


(4.11)


where al and a2 are the coefficients corresponding to the past and future momentum.
The improvement of learning speed obtained by the author was moderate with this
modification.









47


some iterative process in which an approximation of the criterion function is mini-


mized.


Commonly used approximations are given by the first order or second order


Taylor-series expansion, i.e.,


F(w + Aw) = F(w) + AwVF +-.


(4.12)


F(w + Aw) = F(w) + AwVF + -AwTV2F(w)Aw + .


(4.13)


where


denotes


the gradient of


and V


denotes the Hessian of F


. Classic


backpropagation is an example of using a first order approximation.3 First order and
second order approximations are also referred to as linear and quadratic approxima-
tions, respectively.


First order


second


order


approximations use only


local


approximations use also curvature


gradient and


information.


function


values,


Hence second


while
order


methods usually have faster convergence.


Among the most successful applications of


second order methods in neural networks are the conjugate gradient (CG) algorithms
and Newton's methods.


Let us consider a general iterative process.


Suppose we want to minimize a crite-


rion function F(w).


We determine a search direction df and a stepsize At.


The iterate


wt+1


I11


+ At dt


(4.14)


where dt and A\ are determined sucdl that F(wt+l


< F(wt) or F(wt+l) is minimized.


Most optimization algorithms fall into this


framework.


They differ by the way dt and


At are computed.


If dt is set to be the negative gradient


-VF(w), and At


to be a


constant r7, then we have the simple gradient descent algorithm discussed in Chapter
3.
3We say the approximation is first order if the first order Taylor-series expansion is used. Simi-












4.3.1 Conjugate Gradient Methods


Let Fa(z


) denote the second order approximation to F(w) in the neighborhood


of w


Fa(z


F(w)z


ITV


F(w)z.


(4.15)


The necessary condition for


Fa to be minimized is


VFo(z


F(w)z


(4.16


At the current solution wt, Equation 4.16 represents a


system of linear equations with


variable


z (an


x 1 vector


The solution to this system of equations can be greatly


simplified if a set of vectors,


called a conjugate


system,


can be found.


Definition


(Coniuaate System)


Let di,d2,..


a set


of non-Zero


vectors in


, and A be a p


x p nonsingular matrix.


Then dl


..., dk is a conjugate


system


with respect to A if dl


...,dk are linearly


independent and


dTAdj


uppose


we have a conjugate


(IId,


E Rs with respect to


F(w


z* be a solution to Equation 4.16 and


z ER


be an arbitrary initial point.


Since


,di, d, ...,


ds forms a basis of Rs


then any


vector


iii]?


can be


expressed


as a linear


combination of the conjugate


vectors.


_- z


-z


(4.17)


where A


E R.


Multiplying both


sides with


(FV2


F(w) gi


F(w)(z


- z")


-z


F(wi)d;.


*o, k.


C7


F(w) $


F(w) +











Solving for A, gives


d(-VF(w)


V2F(w)zo)


dffV2F(w)dj
EVF (zo)


(4.20)


If we find the conjugate system in S steps, then we can determine


in S steps using


the above equations.


The conjugate vectors di,i =


,2,..., S


can be determined recursively. dl can be


set equal to the negative gradient -VFa(z), and dt can be determined as a linear


combination of the current negative gradient


(Moller, 1990).


found in Johansson et al.


Fa(zt) and the previous direction


Detailed treatment of the conjugate gradient algorithm can be


(1990).


Note that the iterative process converges in


S steps if F(w) is a quadratic function.


F,(z)


then


becomes an


exact


representation


F(w).


practice


the conjugate


gradient algorithm takes more than


steps to converge since F(w) is usually not


quadratic.


Computing and
large problems.


storing the


Hessian


(They require O(S3


matrix


F(w) is expensive or infeasible


O(S2) operations,


respectively).


implementing the CG algorithm, the following estimation is often used:


F(wt)det '


F(wt


+ Cadt) -


(4.21)


for some small at


E R, a


Conjugate gradient methods are generally regarded as among the most efficient


methods for


large-scale optimization problems.


Johansson


et al.


(1990)


reported


that their implementation of CG algorithm outperformed standard BP by an order


of magnitude in terms of training speed.


Moller (1990) improved the CG algorithm


F(w)dy


r r r I ,I n I I r I r r I n












4.3.2 Newtonian Algorithms


Assume that F is twice continuously differentiable, Newton's


method finds a fixed


point through the following iterate:


w(t + 1) = w(t) a(V2F(w


Note that


in a single step.
definite, and a =


(4.22)


quadratic, then the Newton's method converges to the minimum


This is seen by letting F(w)


then


we have wt+'


-= wt


1i T
2w


Aw bw where A is positive


- A-'(Awt


= A-'b.


Even if F


is not quadratic, under reasonable assumptions,


Newton's method is guaranteed to


converge to a local minimum from an arbitrary initial point (Schneider et al., 1991).


also converges fast


when


it reaches


the neighborhood


a solution.


However,


Newton's method is rarely used in its unmodified form because of the cost associated


with computing the Hessian matrix and its inverse.


Also, the method works well only


when it has a good initial solution (Becker and le Cun, 1988).
A class of modified Newton's method is called Quasi-Newton methods where the
search direction is computed via


d = -H-'VF(w)


(4.23)


where H is an approximation


to the Hessian matrix


F(w)


The most successful


Quasi-Newton


algorithm


is the


Broyden-Fletcher-Goldfarb-


hannon


(BFGS) algo-


rithm. In the BFGS method H-1


is obtained iterativel


- V T-1 V


where


= f(w


f(wt)


= VF(w'-1) -


+gF(,

7F(w'


y by


(4.24)


. At each iteration H-1


can


be determined through two new vector


58


and g, and the previous H-


Hence the


method is very efficient.


lA~winrn7r /,,,,1 AQI\ n n~- *- 1CO~1CA..~


">)-' VF(wl).


73 D c L ,









51


the computational locality properties of backpropagation where the weight updating
can be carried out in local units.


Becker and


le Cun


(1988)


proposed


using a simple diagonal


approximation


the Hessian matrix.


They replace Awij


= _4-9L with what they called a


"Pseudo-


Newton Step"


9 F
-yaw"


(4.25)


where


is used


improve the conditioning


Hessian


matrix.


magnitude of


p determines how


much curvature


information is to be used


weight updating rule.

4.3.3 Quickpropagation

Most second order methods are considerably more difficult to implement than first


order


methods,


especially those


require global


information.


Fahlman


(1988)


developed


a heuristic algorithm


he called


quickpropagation


(quickprop


for short)


based on two assumptions: (1) the error (i


.e., the criterion function) surface in weight


space can be approximated by a parabola, and (2) the change in the slope of the error
surface in one weight axis is not affected by other weights that are changing at the


same time.


Thus each weight can be updated independently by using previous and


current error slopes,


and previous weight changes by


Aw(t) =


C'l


OF(t-1)
Ow


" oF(_) Aw(t 1).
O w


(4.26)


This weight change leads directly to the minimum point of the parabola.4


Thus the


quickprop method would converge very fast if the criterion function surface were near
quadratic.
Although the assumptions are very crude, the quickprop algorithm turned out to
be very effective in reducing neural net training time in many standard test problems,


Awij- =









52


to standard BP, the quickprop weight updating rule has a denominator aF- a(t)


This factor is relatively large when


the weight gradient changes a lot.


Hence, this


results in a small stepsize.


While in the flat error surface areas, the gradient changes


very little,


hence creating a large stepsize.


This effectively overcomes the problems


with fixed stepsize of the standard BP method.

4.4 Parameter Adjusting

Tollenaere (1990) conducted a series of experiments to investigate the effect of


the learning parameters (77. a) on


the learning speed


(measured in epochs).


Those


experiments cleared to some extent the confusion about how to choose the parameters


caused


by conflicting reports,


where only non-systematic studies were carried out.


Some general conclusions from Tollenaere's study can be


summarized as follows.


* Learning time decreases exponentially as r increases up to a certain point.
After that point, the iterative process becomes unstable.

* The optimal learning rate y (with which the learning time is the least) decreases
as momentum a increases from 0 to 1.

* The use of momentum usually increases the learning speed by a factor of 2 to
3.


It has


long


been


realized


part of


the standard


ow efficiency


is due


to its fixed parameters.


Usually the parameters need


to be chosen empirically for


a particular problem.


Even after the best Iparameter combination is found through


extensive experiments,


using


those


fixed


parameters


can not meet


the conflicting


needs,


a large stepsize is desired


in flat.


functional surface area and


a small


stepsize is required in areas with narrow ravines.


Numerous dynamic parameter adjusting schemes have been developed.


Most of


thmr aro hinrpiitirc (I rr QihITV !nrl


A 11lP; -l


1 00 mnn rhn a2c1'7P^ IlnrCal rnrnnmt+ -a


1 Il()(l









53


Several principles for adjusting parameters are given in Jacobs (1988):

1. An individual learning step should be assigned to each weight (and threshold).


The learning rate (stepsize) should be adjusted according to the curvature of


the criterion function where change is taking place.


The learning rate should


be increased when


the current partial derivative of


the criterion function with respect to the weight


n consideration has the same


sign as the previous partial derivative; otherwise, the learning rate should be
decreased.


Based on


these principles,


Jacobs


proposed


"delta-bar-delta"


(DBD)


learning


rule.


A learning rate ry, is allocated to each weight wij,


exponentially decaying trace


of the gradient


and 6ij is introduced as an


Tile formulae for weight updating is:


if ( (t
if 56,,(t


- i)8s(t)
- 1) i ()


(4.27)


otherwise


AwUy(t) =


- siSj*(ti) + aAw;j(t -


(4.28)


j (t) = (1 )Sij(t) + O;j(t -


(4.29)


where


are user


determined


parameters,


tip
- -l


(it is slightly


different from the bi) in standard BP). Note that


the increase in the learning rate


is additive while the decrease is multiplicative.


This strategy prevents


the learning


rate from growing too fast


which may lead


to weight saturation) and allows


decrease rapidly, but keep a positive sign.
The DBD algorithm leads to significant speed-up of the standard BP algorithm.


However, the algorithm is very sensitive to tile new parameters, especially k.


Also,


while the
axrn:ii-_


momentum term increases learning speed


, it leads to instability. Sinai and


I'i (ifn\


I/U 1111aI* I'4'Ie


nrnnnp~cnAlr cerlxn;rl rnr~;r dnr c


n tit TnR hRi, ,rnvhm anrl lb k-,hol


,t = {


k+77-~


,


i











adaptive.


Upper bounds are put on


both


y and


The new weight updating


rules becomes:


SOF (t )
Aw^;(t) = -F --- + aAw;yj(t
Sw.,1


-1)


(4.30)


iij(t + 1) = Min{qmax, ij(t) + A7.y(t)}


(4.31)


aij(t + 1) = Min{arma,ax1ij(t) + Aaj(t)}


(4.32)


--r16i~~


,Aj(,) =



aj(t) =


if ij(t
if 6j(t


- 1)6;(t)
-1)(


(4.33)


otherwise


-fml6i(t)


if b~j(t
if 6j(t


- 1)iij(t)
- 1)6;j(t)
- l()S~t


(4.34)


otherwise


where ke, \I, 7t, km, A,, 7m,


TImar


and Omax


are parameters furnished by the user. EDBD


was reported


to provide significant sp


learning the logistic function


f(x)


eed-up over


= ax(1 x), a


r DBD a
= 3.95, 0


nd to be more robust on


x <1.


The authors of EDBD also suggested implementing a memory and recovery mech-
anism into the learning algorithm. Specifically, the current best solution is retained.


control


parameter


E R,


> 0 is defined.


If the criterion function


value be-


comes greater the times the best criterion value retained so far, then the search is
abandoned and restarted from the current best point with attenuated learning rate


and momentum.


However, the experiments on


this idea showed somewhat negative


results.


Davos and


Orban


's (19


SAB


(self-adapting backpropagation)


algorithm ad-


vocates similar ideas.


The algorithm starts without momentum, and increases the


learning rate exponentially as long as the weight gradient keeps the same sign.


It dif-


fers from the EDBD algorithm in that when the weight gradient changes sign, instead
of reducing no; by some rule. it is reset to its starting value, and then the algorithm


--Xr9ij( t.)


-Xa;j(t)








55


Tollenaere (1990) modified the SAB method and named his version SuperSAB.


The motivation behind SuperSAB


s that


whenever the gradient changes sign, the


weights should not be changed.


The weight change halts until the stepsize is reduced


to such an extent that a step can be taken without changing the sign of the gradient.
The learning rate changes simply by


7ij(t + 1) = 7+7ij(t)


(4.35)


-'hit.


where + and are the increase factor and1)=
where TJ^ and 1?_ are the increase factor and


(4.36)


tile decrease factor, respectively.


lenaere reported that SuperSAB is insensitive to the parameters, and r7+


= 1.05 and


= 2 are shown to be good for a wide variety of problems.


Compared with standard BP method, SuperSAB learning is significantly faster.
One important feature of SuperSAB is the range of the initial stepsize that leads to


reasonably fast learning (Tollenaere referred to it as osr


- optimal stepsize region)


is orders of magnitude wider than


that of standard


BP. A drawback of SuperSAB


is that it is slightly more instable than BP. But it was argued that SuperSAB with


restart after divergence was detected -
An interesting and important observatic


still outperformed standard BP.
)n Tollenaere made is that the optimum


stepsize region of different learning algorithms do not necessarily overlap.


Thus, com-


prison of different algorithms based on the same parameter values are inappropriate.


idea similar to SuperSAB


was


used


Silva and


Almeida (1990)


their


Adaptive Backpropagation


Algorithm (ABA). However, Silva and Almeida studied


the effectiveness of the algorithm in the context of varying criterion surface orientation


in the input space.


They argued that becau


se an individual learning rate is used for


each weight, the performance of the method may be affected by the orientation of the


I, S -








56


Chan and Shatin (1990) used the angle 0(t) between consecutive weight gradients,
instead of sign, to detect the curvature of the criterion surface in the weight space.
Only a global learning rate is used, and it is adapted by


(t}) = r)(t


- 1)(1 + -cose(t)).
2


(4.37)


The momentum is also made adaptive in their algorithm by


a(i)= A(t)n(t)


(4.38)


with


A(t) = 0Xo I


F(t)


\ !


(4.39)


- 1)11


where Ao


E (0, 1).


This in effect attenuates the momentum term such that it never


exceeds the current gradient term, hence will not dominate the effect by the current


weight gradient.


The weight updating rule is then


Aw(t) = 7(t)(--F() + a(t)Aw(t 1).
dw


A backtracking heuristic is also implemented.


(4.40)


The learning rate y?(t) is reduced


by half whenever the criterion value F(t) is greater than the previous one F(t 1)
by a certain percentage (say, 1%).


Chen and Shatin's


Adaptive


Training Algorithm (ATA) was tested against


Delta-Bar-Delta algorithm and a conjugat
and the 4-2-4 encoding problem (It will be


;e gradient algorithm on the XOR. problem
discussed in Chapter 5. See also Rumelhart


et al., 1986).


ATA


was shown


to learn much faster than


the other two algorithms


and was insensitive to initial parameters (although it still suffered the local minimum
problem as the others did).


Activation Functions











4.5.1


Radial Basis Functions


Powell (1985) introduced the radial basis function (RBF) for multivariate inter-


polation


problems.


Learning in supervised feedforward neural nets can be viewed


as surface interpolation.


This observation led to the use of radial basis function as


the activation function in neural nets by Broomhead and Lowe (1988), Moody and
Darken (1989), and Poggio and Girosi (1990).


Standard feedforward neural networks use sigmoid activation functions.


input (E wijXi) to each processing unit forms a hyperplane.


The net


Multilayer perceptrons


partition the input space with the hyperplanes from each unit, while in a feedforward
neural net those hyperplanes are smoothed through the sigmoid nonlinear filter before


being used


to form a decision region


(partition).


The radial basis function forms


hyperellipsoid regions in the input space.


A R.BF network consists of two layers (see


Figure 4.1).


Each hidden unit has a radial basis function 4


--+ R defined by


,(x) = ,( x )


i= 1,2,...,N


(4.41)


where p
centers.


E R


-fiji


, 2, ...,


are parameters,


measures the distance from the


and N is the number of radial basis


input vector x to the radial basis


function center ti


The network output at node k is


fk (w,x) =


wi;k (Ik pi ll).


(4.42)


A frequently used radial basis function is the Gaussian function


-Ik-,t~112


(4.43)


where


J 1 (2l


- (x


(x ti).


(4.44)


To simDlifv comDutation. the covariance matrix


Z is usually chosen to be a diagonal


)= .


Ir;)T
































Figure 4.1.


2 X3

radial basis function network


When


the radial basis center, p/,


= 1, 2,...,


are fixed at data points x', =


1,2,..., N,


what is left to the network to learn is then only the linear coefficients of


Wik in the output layer.


case,


RBF


networks can


be trained very fast and


without suffering the problem of local minima.


Moody and Darken (1989) reported


their RBF network reduced training time on learning the Mackey-Glass


equation5 by


a factor of 102 to 103 compared with standard BP.


However


, RBF network is not appropriate for large data sets as the size of the net-


work grows with the number of training instances.


Poggio and Girosi (1989) proposed


to treat the radial basis center as variables, and neural nets are allowed to estimate


the centers p/,j


= 1,2,..., K,


where K


may be much less than N


(the number of


data points).


They called the extension Generalized Radial Basis Function (GRBF)


network.


very rigorous and


thorough


treatment of RBF


GRBF


networks


given in Poggio and Giros


(1989).


T* .- L:.L).. .Lnk Ca..


i)r(t


crz(t-T)


517L~ : I.-r











4.5.2


Transcendental Functions


Although the sigmoid and the hyperbolic tangent functions have been the most
frequently used activation function in feedforward neural nets, other monotonic, dif-


ferentiable functions can also be used (Cybenko, 1989).


In particular, we have tested


using transcendental functions, such as sine or cosine function as the activation func-


tion.


The XOR problem can be solved in a few iterations with the new activation


function.


Rosen et al.


(1990) reported that their neural nets using sine and/or cosine


activation function outperformed
and learning x9 and x3 functions.
transcendental functions can be e


the standard BP


on the parity problem (n


A justification suggested by Rosen et al.


Expanded (via


is that


Taylor-series expansion) as the sum


of infinite order polynomials.


Although the polynomials are not independent within


each activation function, in a multilayer network the weighted sum of outputs from
the hidden units in effect produces a weighted sum of infinite order polynomials. But


sigmoid function can
Lapedes and Farber's


also be expanded to a sum of polynomials.


Experiments by


(1987) showed that trigonometric activation functions are less


robust than the sigmoid function.

4.5.3 Higher Order Networks and Function-link Networks

Instead of using the sum of weighted inputs as net input, some researchers (Pineda,


puts


) have explored the use of net input with higher order correlations among the in-
(e.g., higher order links may be created that take the product of input variables


nput).


The correlations are usually captured by the cross terms of a polynomial.


Volper and Hampson used quadratic terms, in particular, and concluded that higher


order network can be trained noticeably faster than the standard network.


and Rumelhart (1989) studied net input using product forms


Durbin


, and called those pro-


cessing units


product


units.


Their conclusion


was


product


units could


be a


computationally powerful extension to thle standard network.






















xl x2 x3 x1x2 x1x3 x2x3 xlx2x3


Figure 4.2.


complex, this


A function-link neural network used to solve Parity 3


creates a powerful method that usually permits simple networks without


hidden layers to solve hard problems.
3 problem is shown in Figure 4.2.


This functional network outperformed a
network by nearly an order of magnitude.


A function-link network that solves the parity


standard feedforward neural


The efficacy of function-link neural nets


were also shown through learning functions of one and two variables.


4.5.4


Gradient Descent Search in Function Spa.ce


Instead of using fixed activation functions in the processing units, Mani (1990)


considered providing a pool of functions to the processing units


and let the learning


algorithm decide which of the candidate activation functions are


the best


to use.


(Different function pools may provided to different processing units).


The learning


procedure he proposed is similar to that of thle standard BP. But now the gradient


descent is applied in


the function space, rather than


the weight


space (though the


two might be combined as suggested by Mani).
Unfortunately, the order of a set of general functions can not be readily defined,


hence the function gradients are not easily obtained.


proach more ideological than practical.


This difficulty makes the ap-


The only problem the author attempted to


i












Dynamically Constructed Neural Nets


The algorithms we have discussed so


apply


only to neural nets


with fixed


structures.


That is, the number of hidden processing units, the connections between


the units, and the layout of the network are determined before the training algorithms


are applied.


Many researchers have realized


there are drawbacks with fixed


neural
1990).


net structures (see Honavar and


Uhr,


1988;


Tenorio and Lee,


1989;


Frean,


For any particular problem we want to solve, some neural net structures are


more appropriate than others.


Since there is no general guidelines as how a neural


net should be designed for a given problem, it has been a common practice for neural
net users to copy neural net structure from other applications (without questioning


the validity), or simply make up one arbitrarily.
even though success may have been claimed .


This is hardly a scientific approach


Generally, small neural nets are preferred, given that they are


capable of solving the problem at hand.


Tile rationales are that


arge enough to be

(1) parsimony is


always desirable


(2) neural nets with fewer parameters are easier to interpret, when


interpretation is necessary; (3) small-sized networks can be trained more reliably given


a fix-sized training sample (see,


e.g., Haussler, 1991); and (4) neural nets with fewer


hidden units seem to generalize better with novel pattern


1991).


s (Kruschke and Movellan,


Although the general representation theorem (see Chapter 3) guarantees that


a feedforward neural


network


with


a single


hidden


layer


is sufficient for


learning


practically any input-output mapping, there is no theoretical result yet that specifies
how many hidden units are needed.


Honavar and Uhr (1988) pointed out that
fan-out6 sizes to create local receptive fields.


t is desirable to restrict the fan-in and


Then


the number of hidden units


each layer is limited, and multiple hidden layers become necessary to learn a desired
mapping. Indeed, experiments conducted by Gorman and Sejnowski (1988) suggested









62


Two broad approaches have been employed to construct neural nets with optimal


appropriate) size.


The first


is to start


with


a small network, and let it grow


as needed.


The second approach


is to


train


an exces


sively large (estimated) net-


work, and then prune away units that do not have significant impact on the network
performance.


4.6.1


Network Growing Methods


Fahlman and


Lebiere (1990


identified


two major problems that contribute to


the inefficiency of the standard


step-size


problem and


moving


target


problem.
problem


The first problem has been covered in a previous section.


that is caused


by the fixed structure of a neural net.


It is the second


In such a network


the hidden units have no communication with one another, as no lateral connections


are provided.


During the training process, each hidden unit modifies its link weights


according to the error signal backpropagated from the output layer.


The problem is


that all units are trying to learn the same training pattern at the same time. As the
training pattern changes constantly (for instance training, the most common case),
it takes a long time for the hidden units to split their roles and to commit to different
patterns.


A possible way to combat the mo


at a time.


ving target effect is to train part of the network


The cascade-correlation algorithm developed by Fahlman and Lebiere uses


this approach to its extreme. Only one hidden unit (including associated weights and
bias) is allowed to change at any stage of the training process.
The cascade-correlation algorithm starts with a feedforward neural network with-


out a hidden layer.


The algorithm builds up the network (the cascade architecture)


by adding hidden units one at a time.


Whenever a hidden unit is added, it forms a


new hidden layer with connections from all input units and previous added hidden
units.












patterns, and the covariance S of the hidden unit output


error is maximized.


Vp and the current network


S is given by


-z
k


where k is the output


The weights


unit index, and


- -)(Epk Ek )


(4.46)


are averages over all p patterns.


leading to the candidate hidden unit are modified to maximize S


with


a gradient ascent algorithm similar to that of backpropagation.


When these weights


converge (the maximization problem is solved),


they are frozen, and


added to the


current net with


the candidate hidden


unit.


Then


the training of the net resumes


until the stopping criterion i


s met or new hidden units are needed.


A number of benchmark test problems were performed by Fahlman and Lebiere


(1990).


They reported


the cascade-correlation


algorithm


beat quickprops by


a factor of 5 and standard


a factor of


10 on


the two-spirals problem."


the 8-bit


parity problem,


the cascade-correlation algorithm not only outperformed


the standard BP by a factor of 5, but it also built a much more compact network.
Furthermore, it was shown to generalize well on the 10-bit parity problem.
Frean (1990) developed an interesting net growing algorithm the Upstart Algo-


rithm.


The algorithm deals with multilayer perceptrons, i.e., feedforward neural nets


with threshold processing units. It cr
the errors made by each parent unit.


eates new units, called daughters, that correct
The algorithm proceeds recursively creating


new daughters units until none of the terminal (the leaf) daughters makes any mis-


takes.


In other words, the Upstart algorithm expands the network until the problem


is solved.
Tests on


Convergence to zero error is guaranteed


parity problem showed


for learning boolean


Upstart


algorithm


functions.


was efficient.


solved the n-bit parity problem with n less than 10 in less than 1000 iterations.


7At first glance, this approach seems anti-connectionist.


But we need to realize that sequential


rlll.r .rt


- -- -- -.---~-~~-L- ~ ~-.


C(Vp









64


algorithm probably doesn't scale-up well since it took more than 10,000 iterations to
solve the 10-bit parity problem.


The SONN (Self Organizing Neural Net) algorithm proposed by


Tenorio and Lee


(1989) was designed for system identification problems. A new node is generated with
polynomial activation functions of all inputs and outputs from previous layers. The


polynomial is limited to order two.


Thus each new unit has at most two parent units.


The best


polynomial functions is determined by


a Structure


Estimation


Criterion


(SEC) which provides a trade-off between performance and complexity of the model.


Simulated annealing is used in the search process.


When applied to learn the Mackey-


Glass (see footnote on page 58) time series, the SONN algorithm produced far more
compact models (net structures) than the standard feedforward neural networks used
for comparable performance.


Hirose et al.


(1991) considered some heuristics that perform both growing and


pruning of the feedforward neural nets.
sum of squared errors) is checked every


The performance criterion F (in this case,


100 weight updating.


If F fails to decrease


by more than one percent of the previously checked value, a new unit is added to the


hidden layer.


When a network is successfully trained, the pruning process is envoked


which simply removes one hidden unit at a time, and


then restarts training of the


reduced network until no more hidden units can be


removed.


This occurs when the


net fails to converge with a unit removed.


These heuristics appear very crude, but


they do help to overcome the non-convergent


problem.


The authors even claimed


that their heuristics could avoid local minimal solutions.

4.6.2 Network Pruning

The network growing methods usually have a goal to minimize the net size. How-
ever, there are also reasons to train a neural network with a larger than minimum size.


Extra hidden units may increase the robustness


(performing well in noisy environ-












Thus many researchers studied


pruning the nets after they are trained with


sufficiently large number of hidden units (Mozer and Smolensky, 1989; Karnin, 1990).


Sietsma and Dow (1987) proposed


a two-stage pruning method.


In the first stage,


the output of the hidden units of a trained net are analyzed.
whose output do not change for all input patterns are removed.


Those hidden units
If two hidden units


have the same or opposite outputs across


training patterns,


then one of them


may be removed.


In the second stage,


the contribution of each hidden unit to the


learning task (classification) is analyzed.


are removed.


The redundant units and hidden layer(s)


The resultant is a much smaller net that can be trained quickly.


interesting fact is that a net


with


the same si


ze as the net obtained from pruning


could not


be trained starting with random


weights.


Karnin (1990)


used a similar


pruning procedure where the hidden units are ordered by the amount of global error


(F) changed


when


the unit


is pruned.


Those units with negligible effects on


global error are removed.
Sankar and Mammone (1991a) proposed a new neural net architecture called the
Neural Tree Network (NTN) which combines feedforward neural nets with decision


trees.


A feedforward neural net is used at the root node of the NTN to divide the


instance space into N subsets,


where N


is the number of concept (output) classes. If


each subset corresponds to a single concept class, then


the job is done.


Otherwise.


each of those subsets with non-unique concept cl


asses


is assigned


to a child node,


where again


a feedforward


neural


net is


used


to divide the subset


further.


This


process continues until each subset contains only instances from a single class.


been reported that when feedforward neural nets


It has


are compared with decision trees for


classification, neural nets usually give smaller classification errors but take a longer


time to learn (Tsoi and Pearson,


1990;


Fisher and


McKusick, 1989;


Piramuthu et


al., 1990).


Sankar and Mammone showed that NTN outperformed both feedforward


n ITraI notc 2An cr ^ rioin tr'noc nrr\r sa nnn nr .ilnnrnn+ ,,nt ral rnrnetnif fln 2 01r












pruned subtree is


NTN


itself.


Asac


Increases,


the optimally


pruned subtree


reduces in size with the root node as a limit (Sankar and Mammone, 1991b).
Weigend et al. (1990) used the information theoretic concept of "minimum de-


scription length"


(as in the SONN algorithm by


Tenorio and Lee, 1989).


A penalty


for the network complexity measured in number of connections was added


to the


criterion function.


Thus,


by minimizing the augmented criterion function


through


standard BP, a trade-off is achieved between the performance and the network com-


plexity.


This approach led to a reduced size of the


trained network and improved its


generalization property.


Similar pruning approaches were discussed in


(Mozer and Smolensky, 1989


"Skelentonization"


"Optimal Brain Damage"


procedure


method (le Cun et al.,


1990).


Chauvin (1989) used a penalty term for large weights in the criterion function.


Hanson and Pratt (1989) defined a bias term in the criterion function that served to
decay the weights (pushing the weights not increased by the updating rule to zero),
and obtained trained nets with smaller numbers of hidden units.


The GAL (grow and learn) algorithm introduced by


Alpaydin (1991)


can both


grow and prune the net.


It is basically a


variant of the nearest neighbor method,


which, instead of storing the whole training set, stores only a subset of the training


set with training pattern


s close to class boundaries.


A recent summary of dynamic


structured neural nets can also be found in Alpaydin (1991).


Contradictory to common belief, Sietsma and


Dow (1991


showed


that for the


classification problems they attempted, pruning to the minimum number of hidden


units decreased the generalization ability of feedforward neural nets


in noisy environ-


ment


, although the pruned nets


did very well on the training set.


Miscellenous Heuristics


There are many variations of the standard


do not


fit in


the sections












4.7.1


Initial Weights


In most nonlinear optimization problems, identifying a good initial solution could


be crucial to the efficiency of the algorithm.
play an important role in network training.


Similarly,


initial weights in neural nets


Kolen and
propagation t'


Pollack (1990)


performed extensive tests on


o initial network weights.


Their results showed


te sensitivity of back-
that standard BP is


very sensitive to the initial weight range. Specifically, for the 2


x 1 XOR net, BP


gets stuck in local minima easily when the range of initial weights was set to larger


than


Chen and


Bastani


(1989)


introduced


a weight


nitialization algorithm for two-


layer feedforward neural nets. A least squared error (LSE) feature selection method


called the


Walsh


Transform is used.


What the


Walsh


transform does is producing


an initial weight matrix that has the best projection from the training sample. The
learning speed of the XOR network with the use of this weight initialization technique
was shown to be much higher than the same network with random initial weights.


Specifically, networks so initialized performed


nearly


as well as the best randomly


initialized networks from 150 tests.

4.7.2 Multi-scale Training


Felten et al.


(1990)


also considered incorporating features of the problem into


the neural net weight space.


They reasoned that it


s only natural to use any knowl-


edge about the training set in order to restrict the search space (hypothesis space).
Since real world problems are inherently structured, it is possible to incorporate the


information into neural network learning.


Specifically


they proposed a multi-scale


training algorithm.


It starts with small networks, and then uses the results from the


trained small networks to help train a larger network.
are related through the rescaling or dilation operator.


The networks of different size
For a hand-written character












4.7.3


Borderline Patterns


Ahmad and Tesauro (1988) found that the number of training examples needed to
train a neural net successfully scales linearly with the number of inputs for learning


the majority function.10


More


importantly, the most


useful


training instances are


those close to the class boundary.
to train the neural nets. Their ex


Thus they proposed to use only borderline patterns
:periments showed that nets trained with borderline


patterns performed significantly better than nets trained with random patterns.


They


also had a substantially better generalization ability. An upper bound on the number
of random training patterns sufficient to learn the majority function was derived based
on the borderline pattern notion.


4.7.4


Rescaline of Error Signal


Rigler et


(1991),


besides


providing


a general


account


gradient


descent


methods


, noted that in a feedforward neural net with sigmoid activation functions,


algorithm generates a factor


= o(1


-o)


Hence by the chain


rule the gradient vectors in different layers contain exponentially decreasing factors


(1/4,


To compensate this diminishing effect, they suggested rescaling


the gradient factor,


that is,


multiplying the gradient


factor with exponentially in-


creasing scalars, One particular set of rescalings thev used was 6, 36,216,..., obtained


from


taking the inverse of the expected diminishing factors.


Experiments showed


that this simple rescaling method could reduce training time by as much as an order
of magnitude.


Fahlman (1988)


called


the sigimoid prime function.


We have discussed that


the value of the sigmoid prime function goes to zero when the output approaches 0


or 1,


This also causes the backpropagation error signal to become vanishingly small,


hence learning is slowed down.


By simply adding a constant 0.1 to the sigmoid prime


function before it is used, Fahlman reduced the training time to nearly half of that


1/64,...).









69


4.7.5 Varying the Gain Factor

Kruschke and Movellan (1991) performed gradient descent with respect to the gain


factor, hence making it adaptive.
of the weight change, and create


rate.


The adaptive gain factor modifies the magnitude
s an effect similar to that of an adaptive learning


The BPG (backpropagation with adaptive gain) algorithm was shown to give a


remarkable speed-up (by a factor of about 2) over standard BP.


The gain factor was


also used to create hidden layer bottlenecks (reducing the number of hidden units)
for improving generalization.


4.7.6


Divide and Conquer


The divide-and-conquer strategy


artificial intelligent systems.
of a modular connectionist


Jacobs


has a
(1990


architecture.


ong tradition


developed


Similar


in computer science and


a theory


Thrun


et al.


methodology


(1991)


studied


task modularization


through


network modulation.


et al.


(1991)


proposed


method that combines Kohonen's feature map (Kohonen, 1989) with the feedforward
neural nets, and developed an error-driven decomposition scheme that was shown to
outperform the feature map or backpropagation alone in approximating the Mexican
hat function. 1


Pratt et al (1991


nets.


studied


direct transfer of learned information among neural


They were able to train a large net starting with


weights transferred from a


smaller net trained on subtasks.


Compared with nets using random initial weights,


the weight-preset nets achieved speed-ups of up to an order of magnitude (even if


the time to train the smaller nets


was taken into consideration).


The decomposition


technique, borrowed from


Waibel et al.


(1989


includes the following steps:


1. Subnet training: subnets are set up and trained individually.


Glue training:


The trained subnets are


bonded


together through


additional












4.7.7


Total Error vs.


Individual Error


Some researchers, in particular,


Yu and


immons (1990), considered using indi-


vidual pattern error, instead of the total sum of squared error, to guide the learning


process.


Their argument was that total error is not as effective a measure as a cor-


rectness ratio in classification problems.


They developed an algorithm called Descent


Epsilon where a parameter e is used to gauge the difference between a network out-


put and target value.


The output is considered correct if the difference is less than e.


Only those errors that are greater than e are backpropagated to modify the network


weights.


The magnitude of e is gradually decreased.


Hence the total error also goes


down with individual errors kept within the e bound.
In conclusion, this chapter has summarized the state-of-the-art research in feed-


forward neural network training.


Most variations of tihe back-propagation algorithm


are aimed at improving the training speed and increasing generalization ability of


the feedforward neura


networks.


However, more efficient and


Successes of


globally


convergen


various degrees have been achieved.
t training algorithms are needed to


deal with more challenging real world problems.


The next three chapters will focus


on global optimal neural network training algorithms.


















CHAPTER 5
GLOBALLY GUIDED BACKPROPAGATION (GGBP)

In this chapter we propose a modification to the standard backpropagation algo-


rithm.


The modification,


while retaining the simplicity of the standard


BP, intro-


duces two nice properties:


(1) There is a training time speed up, and (2) convergence


to a global


optimal solution


is guaranteed.


We start


with


a briefly


discussion


the shortcomings of standard


backpropagation.


Then we develop the ideas behind


our approach and present the globally guided backpropagation algorithm (GGBP).
Experiments on two standard test problems are presented.


Limitations of BP


The backpropagation (BP) method is one of the most widely used learning algo-


rithm for multi-layered feed-forward neural networks.


The popularity of BP arises


from its simplicity and successful applications to many real world


problems.


commonly recognized, however, that


BP has some inherent shortcomings.


Two of


the often


cited BP shortcomings are (1


slow or no convergence, and


(2) the pos-


sibility of getting stuck in local minimum


solutions (Tollenaere,


1990; Hirose et al.,


1991).
The objective of backpropagation learning is to find a set of network weights such


that the total error function defined by some measure is minimized.


Unfortunately,


the error surface of a feed-forward neural network is generally very complicated due


to the convoluted nonlinear transfer functions.


The error surface is generally char-


acterized by a large number of flat


areas


troughs that


have very small slope


(T-TorbfJ'Jalcon


1 (1gm


1 nf lt rr;nirinc


ALt#~.*I I) A. 11I -~ll LU tJLJ. &l lJ* Z I IILL


with shamrn crvature (Battit and Masulli.












the flat areas or by oscillating along the ravines.


Also it is clear that, with steepest


descent, once a solution gets stuck in a local minimum it has no way to escape.


Although many variations of BP


have been


developed as discussed in


the last


chapter.


The effort to deal with the first problem, that is, to develop more efficient


neural net training algorithms,


considered


has met only partial success.


the problem of local minimum solutions.


Few researchers have


Local refinements of the


algorithm, such as using second order information of the criterion function, improve


learning speed,


suffer


the same


problem


staying stuck


in a local


minimum once the solution is trapped.


The Idea of Globally Guided Backpropagation


The error surface of a feedforward neural networks in the weight space is generally
very complicated. Figure 5.1 shows a typical error surface of the simple XOR network


Section 3.3) where large flat areas and narrow valleys exist.


It is clear that a


strict gradient descent approach


will encounter difficulties


n such a


weight space.


However,


quite simple.


the error surface of


a feedforwar neural


network


the output


If we use a sum of squared error function, the error surface


space is
is convex


quadratic in the output space.


-z
p)


F =
Fp-


I
- 1


1 (y>pk opk)2


(5.1)


Note that the error in Equation (5.1)


separable in p and k, which are the


ndices for


the pattern (example) and the output unit of the network, respectively. Minimization
of the quadratic function is easy, if the ourput of the network can be controlled. The


unique local minimum of E is also a global minimum solution.


The optimal outputs


are the target values.
Unfortunately, solving for weights W through the inverse function of output O is


extremely difficult, if not impossible.


Because the neural network output is a sum


I. r / \ r I r ,












Error Surface of XOR (2 nodes) net; w2=xl. w6=x2


Figure 5.1.


Error surface of an XOR (2


x 1) network showing valley, plateau and


local minimum.


However, if we change the output by a small amount,


we will be able to find the


changes


n weights W


via a Taylor series expansion of 0.


O(W


+ AW, X)


-o(W,x)


wO(W,X)


X)AW


(5.2)


where


S(0,


If we update the weights of the network based on the changes


instead of


-r V wE


as in standard


backpropagation,


then


we have reason


to hope that


weight updating scheme would (1) lead to faster convergence, since the search in the
weight space is guided directly by the search in the output space, and (2) lead to a


~o(w


+ ~a w,
























Aw


Error


Figure 5.2.


corresponding to AO would lead W to a global optimal solution.


5.3 Learning Rule Derivation

The learning rule of GGBP is derived based on the changes in output space. Let


us consider a given training pattern.


The error function is


Ek=


(T Ok)2
k=l


(5.3)


where k is the index for the output units.


Changing output O


= (O, 02, 0... Ok)T based on gradient descent in the output


space gives


Ao(n)= O(n+ 1) -(n


-qVoE(n)


(5.4)


where n is the iteration index.


Using equation (5.3)


AO(n) = r (T


- 0(n)).


(5.5)


n W




























Xl X2 X3 X4


Figure 5.3.


A typical FNN where the weights associated with 0O are independent to


other output units.


Note that here AO(n) is a K


dimensional vector


W is an


S dimensional vector,


and VwO is a K


matrix.


Finding a AW


This is computationally undesirable.


feed-forward neural network,


requires the psedoinverse of the matrix
Considering the special structure of the


we notice that the weights of the output layer associated


with output unit i are independent of the output units Ok, k


= 1,2,


...,K, k


i (see


Figure 5.3).
We can rewrite AO


AO = [VwHO,


0, ...,


V 1'(


WH
Wo,

Wo.


(5.7)


where WH denotes weights i
associated with output node


n the hidden layer(


s) and


the output layer weights


Each component of AO becomes












choose AWk in the direction of


WkOk.


Thus we have


AOk = IVWkOkI A w II


IIAwl =


(5.9)


(5. 10)


Wk Ok |


The normalized component of AWk is


A'w8 -


I Ia wk II


(5.11)


IVwvOkI


Substituting IIAWkIl with Equation (5.10) gives


Law -


I


Replacing AOk using Equation (5.5) results


(5.12)


v Okll2'

in


azu,8


Sis a weight


q(Tk Ok)


(5.13)


n the output layer, Equation (5.13) is used as weight updating


rule.
on it
have


a weight


n a hidden layer,


The changes due to each output Ok, k


we need consider the effect of all the outputs


, 2, ...,


are summed up.


Hence we


Aw, =-


for all


,q(Tk


(5.14)


00k\


sE WH.


heuristic approach (summing up


zntv8


's) is also used in


White


(1990) where


similar results are obtained from an application of Newton's method.
of this approach is the simplicity of the weight updating rule. The d


The advantage
own-side of this


Ci~~ Wk


1,2,..., Ii'


1,2,..., 1(


- Ok)~l:












Summing up the components of AO (cf.


Equation (5.8)),


we have


AOk =


'wOk WN +


V7Wo


OkAWo,.


(5.15)


Because of the special structure of the feedforward network, AWH is the same for all


Thus we can separate A WH and obtain


IIAWHII =


AOk


-2k


VWok,


OkAWok


(5.16)


w, OkI


Similar to the derivation of equation (5.13),


we have for all


SEWH


IIAW"fI


wOklk


AOk


II Ek VwHOk 12


V Wo


Wo )


(Ek q(Tk Ok)


Comparing Equation (5.17)


- ZiEWW
EWH.(k


with Equation (5.13),


A i)


(5.17)


we note that the equation is


more complicated and requires explicitly the computation of weight changes in the
output layer.
Recall in standard backpropagation, the weights are updated with the following
formula:


Aw,=


AE,
ow,


dOk
- x t t _ _
flklk ___


(5.18)


for the output layer and


Aw, =


w =AE
-A -


(T Ok)
dw.


(5.19)


for the hidden layer(s).


AL-%.*.L sL,


S ..1~ 11. a


=1,2,..., K












where F


is a function of the partial of the output with respect to the weights.


concepts of the two approaches are,


however, quite different.


With


GGBP


fixed learning parameter in


the output


space,


while A of the


standard


BP is a fix


learning rate


n the weight


space.


5.4 Convergence of GGBP

Updating weights with Equation (5.13) and (5.14) (or (5.17)) will ensure that the
global error is decreasing, as long as the approximation used in the Taylor expansion is


valid.
in W


Following Equation (
and X). we have


and changing notation slightly (by


explicitly putting


AO(W"h


o (Wcv


,X) -0


V"-'X) = q(T


-O(W"-1


,X))


(5.21)


0(11/n


+(1

- (1


- q)"O(W0


,X).


(5.22)


For any q


E (0,1),O(WI"


as n


-4 ,


That is,


the output converges


to the target


value.


Note that


the convergence property is


guaranteed


by (5.21)


only for the case of a single example.


For a multi-example training set


the weight


updating rules of GGBP are still valid if the instance training method is used. But
the convergence proof remains an open issue as in the case of standard BP. Empirical
results, though, have shown that convergence is typical when 17 is small.
The extension of the GGBP algorithm to sample training is not straight-forward


because the output becomes a matrix when all patterns are considered.


the GGBP approach is still applicable.


Conceptually,


The derivation of the weight updating rules


then requires iterative solutions to a system of linear equations.


On the other hand,


a heuristic of applying GGBP


to sample training is to simply add


up the weight


A*tin ** l nn, t, ,14: %. '- .


11 1 -


1 1


- ~)7i


+ .+(1


-tT


--~>o(w,-l


-- i )nO(WO


11 )"]+(1


__LL-.. ~I


1












The GGBP


Algorithm


The GGBP algorithm is similar to the standard backpropagation algorithm.


implementation is straightforward.
inition of 6 in the two algorithms.


algorithm since GGBP
ravines and/or plateaus.


Note that there is a slight difference in the def-
Also we do not use the momentum term in our


is supposed to search in the output space where there


GGBP is formally stated below.


Algorithm GGBP


1. INITIALIZE:


* Construct the feedforward neural network.


its and the number of output units equal to the
and the length of target vector T, respectively.


* Randomize the weights and bias


Choose the number of input


ength of input vector


n the range (-0.5, 0.5).


* Specify a stopping criterion such as E


< Estop or n


72max


FEEDFORWARD:


* Compute the output for the non-input units.


The network output for a


given example p is


Opt =


f(E


wktf(Z


Wjk f( **


u,;r~;)>>S


Note that Oj is replaced by wJo for notational convenience.

Compute the error using Equation 3.7.

If a stopping criterion is met, stop.

BACKPROPAGATE:


For k


=12 K


repeat












* For each hidden unit


3, compute


j-= 6k


WljkJ (net. ).


$ .2


End repeat.

4. UPDATE:

For output layer


AWjk = r(Tk Ok)&O,/k


* For hidden layer


AWi, =


A/(Tk Ok )Oi / k


REPEAT:
Go to Step


Experiments


Two test problems are used to illustrate and evaluate the performance of GGBP.


Both


problems are standard


problems.


tests


were run


on a 80386-Micro


computer.


The reported results are averages of 20 runs starting with the same random


initial weights for both GGBP and the standard BP. All numbers are rounded to their
nearest integers.


5.6.1


The XOR. Problem


The Exclusive Or (XOR) problem has been used extensively as a benchmark for


neural network algorithm evaluation due to historical reasons.


The problem has been


described in Section 3.3. Solving the problem requires classifying the inputs into two
I1 1 1 I 1 1 I I I S


c- 1(3k















Table 5.1.


Training Epochs of GGBP vs BP for the XOR


the sake of comparison, standard BP without the momentum term is tested,


which


resulted in a convergence speed about 35 times slower than that of GGBP. As the


stopping criterion becomes more stringent,


the difference between GGBP


and BP


becomes more significant.


This is no surprise as the GGBP uses an approximation


scheme that is best in the neighborhood of the global minimum, while standard BP


slows down


when


the error signal


becomes small.


Typical learning curves of both


GGBP and BP are shown in Figure 5.4.


the beginning.


Note that the GGBP solution oscillates in


This shows that the linear approximation used in algorithm is very


crude while random


initial


weights


dominate.


The approximation


becomes more


effective when the weights are brought closer to the global optimal point.


We used


the heuristic method in the hidden layer weight updating, which may also contribute
to the inaccuracy during the initial learning period.


5.6.2


The 424 Encodine Problem


The encoding problem was proposed


Ackley,


Hinton and Sejnowski (1985).


The problem is to map N-tuple input patterns to N-tuple output patterns through


a hidden layer with


log, N


units.


Passing through


the hidden layer requires data


BP ir=0.5 mo=0.9 GGBP ir=0.5 BP ir=0.5
E-stop
mean std dev mean scd dev mean std dev


0.04 206 89 62 10 2148 410

0.01 292 45 71 10 -- ---

0.001 1369 310 110 52 -

































50 1(X) 150 2(X) 250


Number of Epochs


Figure 5.4.



Table 5.2.


Learning curve of GGBP (solid line)


Training Epochs of GGBP


We tested GGBP on a 4


4 network.


vs BP (dotted line).


BP for the 424 Encoding


The results are summarized in Table 5.2.


The speed-up of GGBP over the standard


BP is a factor of 5 to nearly 25.


Similar


to the case of the XOR problem, the performance of GGBP is significantly better


than the standard BP when the solution standard is set higher.


While the number of


training epochs of BP increased about 4 times when the stopping criterion decreased
._ flnA n l r i i 1 p. 1 --


BP lr=0.5 mo=0.9 GGBP lr=0.5
E-stop
mean std dev mean stcd dev



0.04 935 647 177 155

0.01 4635 2545 187 182


w












concept
space.


The algorithm considers optimization of the global function in the output


This leads to a faster learning and convergence to a global optimal solution.


The speed advantage can be attributed to the fact that the search is guided by the


changes in the output space.


That is, the weight change in the weight space does not


necessarily follow the gradient descent direction.


The problems associated with flat


plateaus and deep ravines in the weight space with standard BP are avoided.


The second advantage of GGBP


is that


it does


not use the momentum term.


Choosing


a good


combination of learning rate


and momentum with standard


often


poses


a challenge to


the inexperienced neural network users.


this sense,


GGBP


is easier to


use than


standard


noticed in


learning rate less than 0.5 usually produces fast and stable solutions.
Although at this implementation GGBP has a constant learning rate.


our experiments that


This need


not to be true.


A dynamically adjusted learning rate might improve its performance.


Even with a fixed learning rate (in the output space), GGBP is analogous to standard


BP with a dynamic learning rate in the weight space.


The dynamics of the learning


rate adjusting in
the algorithm. I


weight space is


with


well-founded in


dynamically adjusted


GGBP


learning rate has


by the derivation


been


studied


several researchers (Vogl et al., 1988; Jacobs, 198


Silva and Almeida, 1990).


Those


approaches


are heuristics.


They


work


some


limited(


domain and


may produce


controversial results.


Viewed as BP with dynamical learning rate,


GGBP provides


a learning rate adjusting mechanism that avoids the detailed considerations of the
shape of the error surface in the weight space.


The speed-up of GGBP


over


is evidenced


experiments.


A remarkable


feature of GGBP is that it still has a fast learning speed even when the error becomes


small while BP becomes hopelessly slow.


This feature could be especially beneficial


to problem domains where accurate learning is required.


fI-

t.


,,,1,,, :c, ,,,,,E, ,,,:~,~









84


changes using the updating rule will produce the desired output change which leads


to decreasing of the global error.
it is only approximately true. P


Careful examination of this assumption reveals that
art of the inaccuracy results from the first order ap-


proximation via Taylor's expansion of the output function O(W, X).


Another factor


that may adversely affect the approximation is that the hidden weights of the neural


network are dependent on all the output


units.


The asynchronous presentation of


target values (for a given pattern) renders the computation of hidden layer weight


change inaccurate.


Nevertheless, the GGBP


algorithm is shown to perform signifi-


cantly better than the standard BP. The performance of GGBP could be improved by
considering higher order approximations and synchronized parallel implementation.


It is not clear how those improvements can


be carried out


but the concept of


computing weight change to produce desired output change is appealing.


Research


along this line could be promising.

















CHAPTER 6
STOCHASTIC GLOBAL ALGORITHMS


globally


ter 5


guarantees


guided


a global


backpropagation


optimal


solution


(GGBP)


as long


algorithm introduced


as the


learning


rate is


Chap-
small


enough.


However,


the requirement of small learning rate may cause slow


conver-


gence.


The interest in


finding


a global


optimal solution


efficient learning al-


gorithms has prompted neural network researchers to look into global optimization
literature. Some researchers have explored the use of genetic algorithm and simu-


lated annealing in neural network training.


In this chapter, we will discuss the search


mechanisms and their implementation in feedforward neural network using stochastic


global algorithms:


genetic algorithm, simulated annealing, random search methods,


and clustering methods.


6.1 Genetic Algorithm


The concept of genetic algorithm (GA) was introduced by Holland (1975).


Genetic


algorithms are a class of search


algorithms


based


on several features of biological


evolution, such as cross-over (mating) and random perturbation (mutation).


In recent


years, genetic algorithms have been successfully applied to a large variety of problems


in optimization, learning, and operations management (Goldberg, 1989).


Generally,


a genetic algorithm has the following components:

1. An encoding/decoding scheme that maps the solution of the problem to a bit
stream (chromosome).


An initial nnnula.tinn consistinir of initial nnssih1p nllint.inns












A genetic algorithm starts with an initial population.
tion are evaluated with the criterion function. Part of th


The members of the popula-
e population is chosen to cre-


ate the next generation through cross-over, mutation, and/or other domain-specific


operators.


Selection of the parent members are determined by certain probability


distribution of their fitness measured by the criterion function (Holland, 1975).


The cross-over operator is applied to two parents.


A random bit of the bit stream


is chosen, at that point the parents' bits are crossed-over.


That is


, the parents ex-


change part of their bit streams starting from the chosen bit.


The mutation operator


is applied to a single parent of


child.


A random


bit of the parent is chosen and is


changed to its complement.
For the application of genetic algorithms to feedforward neural networks, a simple
implementation is to encode all weights and biases as a single vector (Montana and


Davis,


1989).


For example, for the XOR network with a single hidden mode (cf.


Figure 3.8), a solution is represented by a vector w = (wi, w2, ws, w4, ws, we, wz). An


, ..., w"} can be generated with each w,, i = 1,2,


taken from a random distribution, say, uniform or Gaussian distribution.


over operator is applied as discussed before.


..., 7,being
The cross-


The mutation operator can be modified


such that a random perturbation is added to a randomly chosen component of the
parent.


Montana and Davis (1989)


reported


their genetic algorithm outperformed


the classic backpropagation algorithm (without momentum)


A more involved coding


scheme was used by Chalmers (1990) where the weight-space dynamics was coded as


linear genomes consisting of bit streams.


Belew et al.


1990) considered using the


genetic algorithm to generate a good initial weight set wo that is then used in place


of random initial weights of the backpropagation


algorithm.


As can


be expected,


the performance of BP was improved with Wo chosen by GA.


The results of Offutt


(19890 showed that GhA rnnltl train a fpnrlfr\,,rir rtA ,n-j n n,..l1, .m...L ..:..1.. -1.


initial population Po = {w1












The search mechanism of genetic algorithm can


be implemented within the BP


algorithm


to help increase


learning speed


avoided


local


minima.


idea is


that when the BP algorithm is detected to be in a flat region where the gradient in
the weight space is nearly zero, a large jump incurred by sufficient mutation of the
current solution should be more efficient in bringing the solution out of the stagnant


status than a gradient descent move.


If the solution is stuck at a local minimum,


the gradient descent approach simply fails to


possibly


proceed,


while genetic mutation


cross-over of different solutions) may make a solution


tunnel through


surrounding peeks of the local minimum, and lead to the attraction region of some
more promising (local) minimum.


When


threshold


to apply GA
0 is defined.


can be determined by the following heuristics:


O can


preset or dynamically derived.


A gradient


A weight w; is


labeled inert whenever I
w.


Between each regular BP session, those weights


labeled inert are perturbated by a random amount (mutation).


If aF
Otwi


e for all


w,, then the current solution must be in a flat area of the weight space. A cross-over
between the current solution and a different solution can be performed. The genetic
algorithm augmented backpropagation (GAABP) algorithm is stated below

Algorithm GAABP


1. INITIALIZE:


* Construct the feedforward neural network.


Choose the number of input


units and the number of output units equal to the length of input vector
x and the length of target vector t, respectively.


* Randomize the weights w(0) (including bias) in the range (-.5,


* Specify a stopping criterion such as F


F8s0op r it


Set iteration


number n = 0.


n,,,.












* Compute the output for the non-input


units.


The network output for a


given example p is


W f(E
int


Wmj f(' .f(


WilXi


* Compute the error using Equation


If a stopping criterion is met, stop.

3. BACKPROPAGATE:

n n + 1.

For each output unit k, compute


(o Yk)f (netk).


* For each hidden unit j


f' (netj)


compute


Sk wyk.


O, then label(w 1) = inert.


UPDATE:

Mutation: If label(wij) = inert, then


Awij(n 1) = Random(F


where F is the current


criterion value, and Fstop the desired.


Random()


is a function returning a random value of Awi,


with a given probability


distribution.


Opk


4=


* If ljyoi


- F,,,,)


-












* Gradient descent:


Awij(n + 1) = 'fr6jOi + aoAwyj(n)


where y7


> 0 is the learning rate (step size) and a


E [0, 1) is the momentum.


REPEAT:
Go to Step


Generally,


changes that


mutation


may


produces


help a stagnant


variations


solution


while


to move out


cross-over


enables


local minima.


arger
Ran-


dom mutation may follow a uniform distribution or


a Gaussian


distribution.


cross-over operator returns two new weight sets
is taken as the updated solution, and the other


. The one with better objective value
is used as the candidate for the next


cross-over operation.


Simulated Annealine


Simulated annealing is a general heuristic optimization algorithm.


The algorithm


is based on concepts from statistical physics.


Kirkpatrick et al.


(1983), in the early


Eighty's, noticed that there is a strong similarity between combinatorial optimization


and the annealing of solid materials such as metals.


In a physical thermal dynamic


system,


the system state


is characterized


a probability


distribution


known


Boltzmann distribution at thermal equilibrium,


as shown in Figure 6.1.


The horizontal axis is system energy and the vertical axis is the probability of the
system at a state with energy E. From the distribution we notice that: (1) the system


state with lower energy has higher probability, and (2) as temperature T


decreases,


the system become stable at low energy state, because the probability of the system


being in


a high energy state approaches zero as the


temperature decreases.


annealing process is to reduce the system temperature slowly such that the thermal

















Probability


Thigh


Ti0"


Energy


Figure 6.1.


Boltzmann distribution at different temperatures


Energy


*'
*'
*r


Non-equilibrium
.--- .."
t@0@@0"....***
Dl~glBgDI Q~..*


Equilibrium


II