UFDC Home  Search all Groups  UF Institutional Repository  UF Institutional Repository  UF Theses & Dissertations   Help 
Material Information
Subjects
Notes
Record Information

Full Text 
GLOBALLY CONVERGENT NEURAL NETWORKS BY ZAIYONG TANG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1992 Copyright @1992 by Zaiyong Tang All Rights Reserved ACKNOWLEDGMENTS I am indebted to many people without whom this work would never have become a reality. First of all, I owe deep thanks to my adviser, Dr. Gary Koehler, who has guided my dissertation research through all its ups and downs with patience, encour agement, and intellectual challenge. It is truly remarkable that, being a department chairman and an adviser of eight Ph.D. students concurrently, he still finds time to provide help whenever it is needed I am thankful to all my committee members: Drs. Paul Fishwick, Harold Benson, and Antal Majthay. Dr. Fishwick introduced me to the exciting world of artificial intelligence and neural networks. His openmindedness and enthusiasm have had a great influence on me. Dr. Benson taught me the beauty and power of mathematical proof during three math programming courses and the rigorousness of scientific re search. Dr. Majthay, a guru in AI and expert systems, has offered me much valuable advice on C++ programming. All the faculty members in the DI department have helped me in one way or another. I would like to thank Dr. Richard Elnicki for providing computing resources, Dr. Selcuk Erenguc for general assistance in my graduate study, and Dr. Zappe for setting an example as an excellent professor. Thanks are due to Dian and Linda, our department secretaries. very helpful in making my graduate study here a pleasant one. I wc Christopher They have been )uld like to thank also my fellow Ph.D. students for many stimulating discussions and a harmonious and cooperative environment. into the American culture. Bob Norris has been extremely helpfu n fitting me I owe a special thanks my familymy wife, and Dora. Xiaoqin Zeng, and my kids, Jimmy Their love, understanding, and encouragement have kept me in high spirit and proper perspective. Xiaoqin certainly knows more than anyone else how hard it TABLE OF CONTENTS ACKNOWLEDGMENTS S 9 t 4 4 9 4 9 4 S 4 1iii LIST OF TABLES LIST OF FIGURES ABSTRACT S9 9 .. S 4 5 4 9 4 4 9 9 IX CHAPTERS INTRODUCTION 9 a 4 9 9* 4 5 9 1 THE RENAISSANCE OF NEURAL NETWORKS . 6 Overview of Neural Networks Historical Development . Neural Network Applications. 2.3.1 Neural Networks in AI 2.3.2 Neural Networks in Dec Promise and Problems . * 4 S 9 4 9 * 9 9 9 4 4 * S C S vision Sciences * 9 S 9 9 9 9 9 * S S S 9 4 S FEEDFORWARD NEURAL NETWORKS . The Processing Units (Neurons) . The Perceptron Learning . . The Limitation of Perceptrons . Feedforward Neural Nets and the BP A Backpropagation Derivation . The Representation Capability of FNN Llgorithm * S 9 4 * 9 4 C * 9 4 * 9 * a 9 4 4 4 4 4 * 9 4 9 4 9 4 4 . VARIATIONS OF BACKPROPAGATION LEARNING Performance Criterion Function Momentum . . a C * * 9 5 9 4 4 4 4 * S S S 9 9 S C S I * 4 9 . * 9 5 9 4.5.2 Transcendental Fur 4.5.3 Higher Order Netv 4.5.4 Gradient Descent S Dynamically Constructed 4.6.1 Network Growing 4.6.2 Network Pruning Miscellenous Heuristics 4.7.1 Initial Weights . 4.7.2 Multiscale Trainin 4.7.3 Borderline Pattern, 4.7.4 Rescaling of Error 4.7.5 Varying the Gain I 4.7.6 Divide and Conque 4.7.7 Total Error vs. Ind actions . . rorks and Functionlink Networks search in Function Space Neural Nets . Methods. . * a a a a a S S * * 9 a a a a a 9 a a 5 S a 9 S S S 9 5 g . S . . . . Signal . ;actor . ividua .Error ividual Error C # a a C a a a a S * a . a * a a C a a . * a a 9 a * S a a C C a a a a p S C C * C a a 9 a . * a a a a a a a 9 . * a a a a p a a a * a a a a . * a a a a a a a GLOBALLY GUIDED BACKPROPAGATION (GGBP Limitations of I The Idea of Glo Learning Rule I Convergence of The G GBP Alg Experiments . 5.6.1 The XO 5.6.2 The 424 Comparison of ,bally Gui Derivation GGBP ;orithm * a a a a a S 1 ded Backpropagation * a a a a S a a a .I * a a a a a . * S C C a a a a a I a a a a . R Problem . Encoding Problem GGBP and BP . STOCHASTIC GLOBAL ALGORITHMS Genetic Algorithm Simulated Annealing Random Search . Clustering Methods * * .* a a a . * S S P * 9 ft a * 9 9 p * 9 5 a a * C S a S * a a ft C DETERMINISTIC GLOBAL ALGORITHMS Branch and Bound 7.1.1 7.1.2 Lipschi Estima 7.3.1 7.3.2 7.3.3 BBBai Prototype Branch and Bound BB Algorithm Convergence tz Optimization . . te the Lipschitz Constant for Some Lemmas on Lipschitz C An FNN is Lipschitzian Local Lipschitz Constant sed NN Training Algorithm an FNN constant * . . 98 S. . . 98 101 * a a a a a 103 * . . 107 S. 107 * C a a C 110 112 . . 116 t r r r t, a r~ a rr~ V r~ V ~' r, r'. n. A a aQ* a S S S S C C. a a a a a S S C P a a a 85 4 k f Combi Experi 8.4.1 8.4.2 8.4.3 8.4.4 ned BB and BP ments with GOTA and LGOTA GOTA GOTA GOTA GOTA with Different Error Thresl with Heuristic Pruning . with Random Local Search with BP Local Search 129 * 9 S S 0 9 S 131 h olds . .131 132 133 .* a 134 SUMMARY AND CONCLUSIONS 137 Contributions Further Research . . at * 137 138 REFERENCES * 4 9 5 5 9 5 5 S 9 4 140 APPENDICES A C++ Program for GOTA . 9 9 1. B Classes for Neural Network Simulation Systems BIOGRAPHICAL SKETCH. . S 9 9 0 2 229 LIST OF TABLE Training Epochs of GGBP Training Epochs of GGBP vs BP for the XOR vs BP for the 424 Encoding Lipschitz Constant over Weight Subsets GOTA Iterations for Solving the GOTA with Heuristic Pruning GOTA with Local Random Search LGOTA vs BP with Different 7 XOR Problem 133 134J 133 * . . * A. 133 LGOTA LGOTA vs BP with Different ry vs BP with Different ac *. C. S 9 S S 0 5135 S C 5 5 9 5 C 135 LGOTA Iterations for Parity3 Problem S. S 5 9 5 136 1 LIST OF FIGURES Structure of a single neuron Typical activation functions Geometrical explanation of the perception learning The XOR problem and its geometrical representation. . . . 0 . * 0 0 An example of layered perceptions that solve the XOR problem feedforward neural network An example of the Kolmogorov neural network . Two simple neural nets that so Output function surface of the * 0 9 0 0 0 0 5 0 0 * S 0 5 5 0 5 0 0 0 0 0 e the XOR problem. x 1 x 1 network A3x4 x 2 radial basis function network A functionlink neural network u Error surface of an XOR and local minimum.. sed to solve Parity 3 x 1) network showing valley, plateau AW corresponding to AO would lead W to a global optimal solution. A typical FNN where the weights associated with O0 are independent to other output units. . . . . . . Learning curve of GGBP (solid vs BP (dotted line). Boltzmann distribution at different temperatures Equilibrium and nonequilibrium energy state 8  Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy GLOBALLY CONVERGENT NEURAL NETWORKS By Zaiyong Tang August 1992 Chairman: Gary J. Koehler Major Department: Decision and Information Sciences Artificial neural networks are a computational framework that has become a focus of widespread interest. One of the most widely used neural networks is the feedfor ward neural network (FNN). This type of neural network can be used to learn the underlying rules from examples. This learning ability enables FNNs to have wide ap plicability. However, the theory behind this neural network model is still immature. There are many deficiencies of the current neural network learning algorithms that have hindered their usefulness. In this dissertation, we surveyed the research in FNN learning. Several new algo rithms are proposed to improve the learning efficiency of FNNs. We have developed a globally guided neural network training algorithm that converges to a global opti mal solution and reduces the training time. Both stochastic and deterministic global optimization approaches are employed for neural network training. The stochastic methods include genetic algorithms, simulated annealing, and pure random searches. Deterministic methods considered for neural net training are branchandbound based ,insrhitz nntimizations Rv Penlorint the snncial dtriictnre of the FNN and the nron training algorithms (GOTA) is that they yield a guaranteed global optimal solution. GOTA can also be combined with local search procedures, such as backpropagation, to produce more efficient, but still globally convergent algorithms. CHAPTER 1 INTRODUCTION Artificial neural networks (neural networks or neural nets for short) are a com putational framework that has recently become a focus of widespread interest. In contrast to conventional centralized, sequential processing, neural networks consist of massively connected simple processing units, which are analogous to the neurons the biological brain. Through elementary local interactions (such as excitatory and inhibitory) among these simple processing units, sophisticated global behaviors, which resemble the highlevel recognition process of humans, emerge. Information in a neural network is distributed across many processing units and the connections among them, rather than stored in a single location. ing units act in parallel and communicate only with their local peers. highspeed computation readily achievable through parallel computers. and distributed processing(PDP) The process This makes The parallel computational paradigm exhibits many desirable features, such as fault tolerance (resistance to hardware failure), robustness in han dling different types of data, graceful degradation" (being able to process noisy or incomplete information) (Matheus and Hohensee, adapt (Rumelhart et al., 1987), and the ability to learn and 1986; Lippmann, 1987; Hinton, 1989). Research in neural nets experienced a sudden resurgence in the early 1980s and has seen an explosive growth in the last few years. The excitement about neural nets is rooted in understanding information processing in human brains. But recent interest in neural network study has grown to cover a wide spectrum of areas from industry, to education, to business, to the military (Simpson, 1990). 1if S S I I I *4 II *0 .1 1 1 I , n ', ,,, .. i i i . S S I 2 Hornik et al., 1990), although with regard to the whole area a sound theoretic foun dation has yet to be established. The fast growth of this area has been pushed by extensive applications of the neural net computation paradigm. By virtue of their in herent parallel and distributed processing, neural nets have been shown to be able to perform tasks that are extremely difficult for conventional von Neumann machines, but are easy for humans?. These tasks include image recognition (Carpenter and Grossberg, 1987) and speech processing (Sejnowski and Rosenberg, 1987). portantly, More im neural nets have been successfully applied to solve problems that often require human experts, such as sun spot prediction (Weigend et al., 1990) and ERP1 recognition (DasGupta et al., 1990). the business world, neural networks have been successfully applied to areas where traditional approaches are ineffective or efficient to use. A partial list of such areas include loan evaluation (Judge, 1989), signature recognition (Rochester, 1990), stock market prediction (Dutta and Shekhar, 1988), time series forecasting (Sharda and Patil, 1990), classification analysis (Fisher and McKusick, 1989; Singleton and Surkan, 1990). The leading neural net paradigm for applications the feedforward neural net (FNN). An FNN is used by first training it with known examples. Once the network is trained successfully, or in other words the neural net has learned the concept/rule embedded in the training examples, it come given an input it has seen before. can be used to recognize an associated out The trained neural net can also be used to estimate/predict a possible outcome when a novel input is presented. A neural net training procedure is also called a learning algorithm. One of the most widely (and wildly) used neural net learning algorithms is the backpropagation (BP) procedure (le Cun, 1988; Rumelhart et ., 1986). Although con tribute to many successful stories, background. this learning procedure lacks a sound theoretic Backpropagation is essentially a simple gradient descent based search 3 be a local minimum solution if the training problem has multiple minima, which is often true. Furthermore, the BP algorithm as used in practice deviates from strict gradient descent. This deviation may reduce the likelihood of a solution trapped in a unsatisfactory local minimum. However, the convergence of the procedure has now become an open question in theory. Other shortcomings of the backpropagation algorithm include a static (fixed a priori) neural network structure, ad hoc choice of learning parameters, and sensitiv ity to initial conditions (weight values). Because of these limitations, feedforward neural nets trained with the BP algorithm reach only a suboptimal status. The gen eralization ability of the neural nets, an ability to function in a domain larger than the training set, is also limited. Extensive research been carried out in recent years to explore the poten tial of feedforward neural nets and to improve the effectiveness and efficiency of the backpropagation learning procedure (Jacobs, 1988; Becker and le Cun, 1988; Moller, 1990). Remarkable progress has been made in developing new training methods and neural network architectures (Fahlman, 1989; Fahlman and Lebiere, 1990; Chan and Shatin ,1990). However, most variations of the backpropagation algorithm are based on heuristics that reduce the generality of the approach. For example, Fahlman's cascade correlation algorithm is orders of magnitude faster than the classic back propagation algorithm, but its application is limited to inputoutput mappings with binary outputs. Much less work has been done in overcoming the problem of local minima. A few researcher have used stoc astic global search methods in neural net training with moderate success (Montana and Davis, 1989; Fang and Li, 1991). date, we have seen no reports that apply deterministic global optimization approaches to neural net training. Compared with a vast volume of applications, theoretic study on neural (in particular, the backpropagation learning algorithm) has been weak at best. As a and generalization are constantly referred to without precise definitions. There is apparently a need for unified definitions and formalism of the FNN learning paradigm. In this dissertation, we attempt to fill this need and address the problems associ ated with backpropagation learning, with a focus on developing efficient and globally convergent learning algorithms. Our aj ministic global optimization techniques. approaches involve both stochastic and deter We propose to treat neural network training as a global optimization problem. Recent development in global optimization re search lends us some viable tools, such as branchandbound method and Lipschitz optimization (Horst and Tuy 1990). We also consider globally guided heuristic search methods. The dissertation is composed of nine chapters. Following the introduction, Chap ter 2 presents a general account of neural networks, an outline of the historical de velopment of neural net research, and a more detailed discussion of the promise and problems of current neural network study. Chapter 3 gives the basic concepts and definitions of feedforward neural nets. The backpropagation algorithm derived and discussed in detail regarding its learning mec and implementation. anism, the applicability and limitations, The next chapter (Chapter 4) focuses on the improvement of the backpropagation learning algorithm. variety of approaches is presented, rang ing from using efficient optimization procedures, to designing new network structures, to dynamically adapting learning parameters and learn summarizes the stateoftheart research in feedforward ng mechanisms. This chapter neural network training. Chapter 5 begins our work on globally convergent neural network learning pro cedures. We develop a search method that uses the information in neural network output space to guide the learning process, weight space following the gradient descent. rather than search in the complicated We explore the application of stochastic global optimization methods in neural network training in Chapter 6. In particular, we discuss the use of genetic algorithms, simulated annealing, pure random search 5 in obtaining lower bounds of the branchandbound procedure, through an extension of the univariate Piyavskii algorithm. I local search in the partition elements. Jpper bounds can be obtained with or without A procedure is developed to compute local Lipschitz constant over subsets of the weight space. This leads to tighter lower bounds and more effective pruning in the branching search process. The implementation of the global optimization training algorithm (GOTA) is dis cussed in Chapter 8. We show that the computation of the local Lipschitz constant is easily carried out by exploring the special structure of the feedforward neural network and the property of the sigmoid activation function. We also discuss the simulation program design and different search strategies under the general framework of GOTA. Experiments on the effectiveness of GOTA and its local search augmented version (LGOTA) are carried out with some standard benchmark problems. Finally, in Chapter 9, we summarize our contribution n the dissertation. Several conclusions are reached based on our theoretical study and experimental investiga tion. Further extensions of this research are also discussed. CHAPTER 2 THE RENAISSANCE OF NEURAL NETWORKS Let's face it, beyond part of the interest in connectionism is that dirty little secret that researchers in nuclear physics had during the thirties that maybe you can build something with it.  Gary Lynch1 After more than a decade of dormancy, research n artificial neural networks came back to life in the 80's and experienced an explosive growth in recent years. The new surge of enthusiasm resembles the initial excitement in neural nets in the late 50's the early 60's, only far more intensive and extensive. research has engulfed widespread disciplines: computer science, engineering, The wave of neural network neuroscience, mathematics, and decision sci psychology, linguistics, ences. In fact, the ma jority of neural net research has gone so far as to have totally lost any traces to their biological roots. Thus when we quote Gary Lynch, a wellknown neuroscientist, we do not really mean that we are going to build an electronic brain, rather, we mean to build "something" will enable us to solve problems that are intractable or difficult to solve with conventional approaches. Overview of Neural Networks As a reflection of the relative youth and broad scope of this field, neural networks are known by various names such as adapti ve sys teams, connectionist machines, neu rocomputers, collective decision circuits, parallel distributed processors and neuro mornhic svst.ems (Tinnmann 1 Q7 Knitrlt 19 Q(0 Thepr are as many if not, mnnr in patterns reminiscent of biological neural nets" (Lippmann, 1987 4) to more complicated and specific ones such as: A parallel, distributed information processing structure consisting of pro cessing elements (which can possess a local memory and can carry out localized information processing operations) interconnected with unidi rectional signal channels called connections, each processing element of which has a single output connection which branches out into as many collateral connections as desired with each carrying the same signal, that being of any mathematical type desired (the processing being local to the processing element, i.e., dependent only on the current values stored in the processing element's local memory). (HechtNielson, 1989, p. 593) By the very fact that research in neural networks was only revived recently and has found its way into such a diversified spectrum of disciplines, nets a generic and concise definition. hard to give neural But it is generally agreed that the essence of neural nets is parallel distributed processing (PDP) (Rumelhart, McClelland and the PDP Group, 1986). Originally, neural were biologically motivated . However research in field has long (well, relatively long) diverged into two directions. One branch en deavors to understand our very brain. Researchers in this branch are concerned with human perception, memory, reasoning, and learning. ested The other branch is more inter n the computational models and the power to accomplish traditionally difficult tasks, rather than biological fidelity. The main thrust of current neural net research seems tilted towards the second area. There are more than a dozen main neural net paradigms being actively applied today (Simpson, 1990) one is the backpropagation (BP) model (le Cun, 1988). BP to biological systems. . The most widely used bears little resemblance The popularity of BP arises from its simplicity and powerful representation ability that can address a wide variety of real world problems. Other neural net models that do not have much biological flavor but find successful ap plications in pattern recognition, decision making and optimization include Hopfield networks (Hopfield, 1982) and Kohonen's selforganizing networks (Kohonen, 1989). 8 2. Massive interneuron connections and associations via those connections 3. High parallel processing. I. Internal information representation and distributed storage (as weights on the connections and/or the activation states of the neurons) i. A learning rule whereby the internal representation is changed in response to the changes in the environment 3. A learning environment that provides input and feedback to the network The basic characteristics of an artificial neural network are similar to its biological counterpart. But for most neural network paradigms, the learning mechanisms do not even remotely resemble the learning mechanism in biological systems. Neverthe less, neural networks provide a framework within which certain aspects of the human brain can be modeled. Those aspects include association, classification, generaliza tion, optimization (under soft constraints) and adaptation. systems (artificial or natural) depend on easily modeled with conventional serial In large part, intelligent those abilities, and those abilities are not processing models based on von Neumann machines. The structural and nonprogramming approach of neural networks lend themselves to deal with difficult artificial intelligence (AI) problems such as pattern recognition problems. While it is often difficult or impossible to explicitly write down a set of rules for such problems (hence symbolic approaches fail), neural networks can learn from training data to produce a solution. In recent years neural networks have made strong advances in AI areas (Caudill, 1989). Conventional expert system inferences slow down with an increase in their knowl edge base. This is counterintuitive. Humans get. faster as we possess more knowledge about problem domain. This deficiency in expert systems is to the se nnentiil Qarnrh nitire* nf thi ;nf rnrotr ,nnTn ann;irn T;I nrnrhllm 1 c SI Ilp~l;lf PC~ Ill;t h 1 9 be retrieved by using any part of it as a key (Rumelhart, McClelland and the PDP Group, 1986). The neural network paradigm makes itself easily adaptive. This ability is essential in a dynamic environment. Some neural network models have been shown to be equiv alent to statistical classifiers (White, 1989). Compared with statistical approaches, neural networks have the advantages of robustness, by virtue of their distributed representation and adaptation. Also, neural networks make little or no assumptions concerning the underlying distribution of the training data. They may be applied to data sets generated by nonGaussian processes where traditional statistical methods cease to be effective (Lippmann, 1987). In a distributed processing system, tile job is done by the joint effort of many processing units. If one or a few of tho se units fail, they do not significantly affect the performance of other processing units and the system as a whole still works. property is known as fault tolerance, paradigms. This which is not shared by traditional computing The human brain presents an excellent example of fault tolerance where some neurons die out daily and the brain keeps functioning in every practical sense. On the contrary, a serial processing machine comes to a complete halt with a failure in virtually any part of it. Even with continued damage to the processing units, a distributed system has "graceful degradation. " That is, the system 's performance deteriorates gradually, rather than with a catastrophic breakdown. Historical Development The study of neural networks has a long and colorful history. Pioneering work on neural nets dates back to the early 1940s when McCulloch and Pitts (1943/1988) proposed that the brain, as a computing device, consists of simple processing units (neurons). They built a simple yet elegant, model of a neuron (later known as a McCullochPitts neuron or simply an MP neuron) in which a welldefined process 10 The basic structure and operations of the MP neuron can still be found in some of today's neural network models. The MP neurons provide a model of computation that enables the idea of con nectionism. The activations of the neurons are determined by the combined effects of incoming excitatory and inhibitory stimuli. But nothing was known about how the connection strength between neurons could be changed to adapt to a new environment until Donald Hebb (1949/1988) made known in his Organization of Behavior the first neural network learning rule, rule. which has come to be known as the Hebbian learning The essence of the Hebbian learning rule states that the synapse (weight) be tween two neurons should be strengthened if both neurons fire (in active states), and the synapse should be weakened if only one of them fires. The Hebbian learning rule was proposed without rigorous mathematical derivation, but it has been regarded as a foundation of many more sophisticated its ability to capture the learning behavior in earning rules. Its generic nature and biological systems (Caudill, 1989) has contributed to its continued utilization. A milestone in neural network history was the introduction of the perception by Frank Rosenblatt (1962). A perception is a single MP neuron or a set of MP neu rons that systematically adjusts its (their) weights and excitatory thresholds to learn a given inputoutput association. The perception learning rule is an adapted, sys temized Hebbian rule. In Principles of N' urodyn amics, Rosenblatt (1962) proved the perception convergence theorem. This theorem shows that a perception can learn in finite time any pattern association that is linearly separable. gence theorem was powerful enough learning. The perception conver to stimulate widespread interest in perception There was much speculation about how intelligence could arise from such neuronlike devices. The limitation of perceptrons to binary outputs was removed Widrow Hoff (1960). They replaced the hardlimit activation function in perceptrons with a classes. Adaline and Madaline were also proved to be convergent to any function they could represent (Wasserman, 1989). The enthusiasm with perceptrons dwindled when researchers in the area found that perceptrons failed to live up to their expectations. The publication of the book Perceptron by Minsky and Papert (1969) initiated a dark age for neural network research. The authors performed a rigorous mathematical analysis of the capability and limitations of the perception. They showed that the class of problems that can be effectively solved by perceptrons is limited to linearly separable problems. Indeed, perceptrons fail to solve such simple problems as the ExclusiveOr (XOR) problem. (More detailed discussion on the XOR problem is presented in Chapter 3). With linear activation functions, a multilayered perception is equivalent to a singlelayer perception. So multilayer perceptrons could do no better than solving linear separable problems. For multilayer perceptrons with a nonlinear activation function, there still did not exist an effective training algorithm. This seemingly incombatable difficulty in training multilayered perceptrons led to the following inconclusive conclusion of Minsky and Papert (1969, p. 231). They wrote: The perception has many features that attra linearity; its intriguing learning theorem; its clear parad as a kind of parallel computation. There is no reason to of these virtues carry over to the manylayered version. consider it to be an important research problem to elu our intuitive judgment that the extension is sterile. ,ct attention. igmatic simplicity suppose that any Nevertheless, we icidate (or reject) Despite Minsky and Papert's recognition of tile importance of multilayered percep trons, their pessimism, backed up with their reputation and the rigor of their work, effectively turned mainstream research away from neural networks. Nevertheless, research in neural networks did not completely die out. With ded icated effort, a small group of researchers continued their work in this largely aban doned field. Some important progress made during the "post perception era" 7fl1r\ ni'rll In n,n, n*rr no /'lC tinri4 n rnf'lr~* 4J 'nfl.'.. A D1r\ iI* IIiilr i II~ ;*Irl*1 ~ItJ.'ln %3irrTl i r flh I n.IuIIUrv *Irz uIsIIt1i r Ir*I' tl n* Irltr f.~%* n. ~ nr]n~nnn jn nnon~: n~n 12 time that a fully connected recurrent network exhibits emergent collective compu national capability (Hopfield, 1982), that is, the local interactions among the processing units can produce global behaviors. His model was later expanded allow neurons to have continuous values (Hopfield, 1984) be applied to hard optimization problems (Hopfield and Tank, 1985). The new era of neural network study witnessed a resurgence with the publication of the three volumn Parallel Distributed Processing by Rumelhart, McClelland and the PDP Research Group in 1986. By then, some theoretical background had been established, and there had been breakthroughs in the neurobiological understanding and computer capabilities (which made it feasible to develop and test more sophisti cated models). The PDP books were well publicized and stimulated a new fever of neural net research that more than rivaled that which had occurred in the early 60's. Of particular importance is the backpropagation (BP) learning algorithm developed by Rumelhart, Hinton, and Williams (19 2 BP provides a procedure that success fully solves the "credit assignment" problem in multilayered perception training, and hence provides a rebuttal to Minsky and Papert multilayered perceptrons would be futile. 's conjecture that research in Indeed, Rumelhart, Hinton, and Williams (1986) showed that multilayered networks with BP learning were able to solve a wide variety of nonlinear classification problems , including the notorious XOR problem. backpropagation has become the backbone of current neural network research. Neural Network Applica.tions The continued and everincreasing interest in neural net study has been both a consequence of and a driving force for successful applications. In many areas neural nets offer a different (drastically, sometimes) method of approaching a problem, and open new avenues to attack traditionally intractable tasks or to solve more efficiently problems that are being solved with traditional methods. the following we will 13 survey the applications of neural nets in artificial intelligence (AI), decision sciences, business, and engineering, while largely omitting the bulk of research in cognitive science, psychology, and neuroscience. 2.3.1 Neural Networks in AI Traditional AI, as a rival of neural networks, has been successful in the 70's. in particular expert systems, has found many fruitfu applications. Tasks that were regarded as requiring high intelligence, such as chess playing and theorem proving, can be accomplished by expert systems with remarkable performance. Traditional AI approaches are, however, inefficient in solving pattern recognition problems, such as vision and speech processing, due to their nature of symbolic representation and serial processing. Expert system development has knowledge acquisition bottleneck. been hindered by the notorious For one thing, experts are rare. Perhaps more im portantly, expert knowledge cannot simply be put down as a set of precise rules. The parallel distributed processing paradigm of neural nets seems a promising alternative to overcome the difficulties in AI. On the other hand, the success and advantages of traditional AI approaches are not deniable. One noticeable inroad that neural nets have made into traditional AI is the integration of the two seemingly different approaches. Several ways of integrating neural nets with AI systems are discussed in Caudill (1990). hybrid system where neural net level learning while an expert Lamberts (1988) built a s were used as a frontend processor that performs low system performs high level reasoning. The inference attained by the expert system from processing the output of the neural nets is used as a guide to modify the neural network weights. Becker and Peng (1987) proposed a method for integrating neural nets and sym bolic processing. Gallant (1988) worked on problem of extracting production rules from neural nets, using a limited set of values for the activation functions. The multilayer neural network. Maskara and Neetzed (1990) used neural nets as an effi cient frontend for a rulebased system where the neural network was trained to learn the associations of the expert system rules. Similar to a content addressable mem ory, upon receiving partial rule descriptions, the neural network outputs all applicable rules. Neural nets appear well suited for fuzzy learning. Shiue and Grondin (1987) developed a fuzzylearning neural automata. neural nets to generate fuzzy rules. Fuzzy I Hayashi and Nakai production rules and (1989, 1990) used their membership function can be implemented in structured neural nets (Yamaguchi et al., 1990). In the mapping of rulebased systems to neural nets, a concept (feature, word, symbol, variable, fact, predicate, etc.) may be represented as a unit, and logic rela tions between concepts may be represented by the connections between units. The strength (weights) of the connections then correspond to the degree of certainty of the logic relations (Tan et al., 1990; Yang andI Bhargava, 1990). Thus learning in neural nets can be regarded as modifying tile certainty of the rules. Kuncicky (1990) proposed an isomorphism that maps from not only rulebased systems to neural nets, but also from neural nets to a rulebased The number and structure of the rules may change in such a hybrid system as a result of neural network learning. Kerce and Mueller (1990) used a heuristic link neural network that is applied state space search. A feedforward neural network is employed that takes the state description as inputs, and its output is used as a guiding heuristic for the state space search. Successful applications of neural nets in AI areas (such as control, vision, robot, speech, and game playing) are numerous (Wang and Yeh, 1990). One of the most influential applications is the NETtalk by Sejnowski and Rosenberg (1986). NETtalk is a simple twolayer feedforward neural network. Given a series of examples of English text and the correct pronunciation NETtalk was able to learn to read English , exposition to examples that embed the target concept (to be learned). In contrast a conventional computer requires algorithmic approaches, or intensionall program ming," where strict instructions or rules are followed with no reference to specific examples. Extensional programming cuts down the needs in knowledge acquisition, and hence represents a powerful technique (Knight, 2.3.2 1990). Neural Networks in Decision Sciences Neural nets provide a powerful computational framework that extends its appli cation scope far beyond traditional AI problems. As mentioned above, neural nets can be integrated with expert systems, and hence provide a new way of implement ing decision support systems. Under certain condi tions, neural nets are equivalent to Bayesian classifiers. sciences. This opens wide possibilities for using neural nets in decision The inherent properties of neural nets enable them to do more than just statistical decision analysis. Weigend (1990) reported neural net classifiers that have been shown to outperform statistical methods. Burke (1991) and Burke and Ignizio (1992) described several neural network systems and their applications in decision making. They also discussed conditions under which neural nets would be prefer able to conventional procedures and gave some guidelines for using neural nets in operations research. Hornik, Stincheombe and White (1989 others (HechtNielsen, 1989; benko, 1989) have shown that multilayer feedforward neural nets are universal ap proximators. Simple feedforward neural nets with as as one hidden layer can approximate any continuous inputoutput mapping to arbitrarily specified accuracy (the number of hidden units may have to go infinity, though). This result solved theoretically the representation issue and made neural nets a legitimate tool for function approximation with numerous appli cations in system identification, sign, control, modeling and prediction (Werbos, 1989). Ik t S ** and Tank, 1985). Besides Hopfield networks, other neural nets used in combinatorial optimization include Boltzmann machines (Hinton and Sejnowski, 1986), Cauchy ma chines, (Jeong and Park, 1989) and selforganizing networks (Durbin and Willshaw, 1987 Hueter, 1988). Ramanujam and Sadayappan (1988) showed how to map to neural networks a number of combinatorial optimization problems, including the traveling salesman problem (TSP), the graph partition problem, the vertex covering problem, and the maximum clique problem. Compared with conventional approaches, they reported that neural network results showed promise. Xu and Tsai (1991) did extensive exper iments on the TSP. One of their neuralnetbased algorithms matches or outperforms the best known heuristics, the Lin and Kernighan algorithm (Lin and Kernighan, 1973). Also the neuralnetbased algorithm was shown to scaleup better than the Lin and Kernighan algorithm. Foo and Takefuji (1988a,b) applied a stochastic neural network for jobshop scheduling. A deterministi c approach was also used by Foo and Takefuji (1988c) to solve the same problem with neural network implemented integer linear programming. A relatively new advance of neural nets been made in the area of mathe matical programming. Maa and Shanblatt (1989, 1990 applied neural nets to linear programming problems. Kennedy and Chua (1988) used neural nets for nonlinear programming. Barbosa and de Carralho (1990) applied neural nets in feasible direc tion linear programming. An adaptive feedforward neural net was used in multiple criteria decision making (Zhen and Malakooti, 1990). Other applications include the shortest path (Helton, 1990), routing (Zhang and Thomopoulos, 1989), the knapsack problem (Li, Fang and Wilson, 1989), and the task assignment (Tanaka et al., 1989). Neural nets are rivalling traditional statistical analysis in classification (Pratt and Kamm , 1991), principal components analysis Feeser, 1991), and forecasting (Sharda and (Baldi, Patil. 1990). 1989), regression Choukri et al. (Orris and (1991) re 17 with past data, generated accurate predictions and consistently outperformed tradi tional statistical methods such as the TAR (threshold autoregressive) model (Tong et al., 1980). Compared with an established time series forecasting techniquethe BoxJenkins methodneural nets have the advantages of automatic learning, better performance for nonstationary series and longterm forecasting (Tang, de Almeida and Fishwick, 1990). With the abilities of model identification, generalization, and prediction, neural nets have found many applications in ral nets have been successfully applied to business and engineering. loan evaluation (Judge, In business, neu 1989), signature recognition (Rochester, 1990), stock market forecasting (Dutta and Shekhar, 1988) and other classification analysis (Fisher and McKusick, 1989; Singleton and Surkan, 1990). In engineering, neural nets have been et al. applied to hardware fault diagnosis 1990), power system state evaluation (Nishimura and Arai, 1990), (Tan wastewater treatment system (Krovvidy and Wee , 1990), intelligent FM facturing system) scheduling (Rabelo, Alptekin and Kiran, 1990). neural nets as an engineering design are emerging in a variety of engineering areas. is still being explored. Wu et al. (1990) S (flexible manu The potential of New applications used neuralnet based teams to model the behavior of materials and obtained promising results. Neubauer (1991) applied neural networks to metal processing. Neural nets have also been used in structural mechanics computation, transportation and other engineering applications (Sun and Fu, 1991; Dagli and Lammers, 1989). Promise and Problems Unlike the hype surrounding neural nets 30 years ago, has aimed at solving realworld problems. today's neural net research Nearly all the big companies in the com puter industryAT&T IBMRl, Texas Instruments and othersare involved in 1 1 ~~~~~~r i. l I I 18 Diego, has retained its momentum with the participation of researchers from more and more diversified areas and pumppriming funding from NSF NASA, DARPA, and other major sponsors. Judging from their success in the past few years and the still widening and deepening scope, we may conclude that neural nets indeed hold great promise. The current optimism in neural nets ' future is no less fantastic than that in the early 60's. Neural net s, along with nuclear technology and superconductivity, has been dubbed one of the greatest inventions in our modern society. Leon Cooper, a Nobel laureate, commented (in IJCNN next century is what the computer is 1990) that what neural net for to s would be for the HechtNielsen (1986) went further saying: . It is clear that if [neural network technology] realizes its stated its impact on human society will be profound. It may thus co pass that we are now living at the boundary between two great e of human existence; namely, the transition from Civilization to I [a term coined by HechtNielson to describe the imaginary future society]. It has been 10,000 goals, me to ?pochs ability noble years since tile last such transition (from Culture to Civilization). If all of this is true. we are most fortunate to be alive to witness and participate in this change. While a repeat of neural network history in the late 60's seems unlikely, we need to be very cautious about overly optimistic expectations. None of those startling claims such "brainlike machines " in the nontechnical literature has really been realized. It is true that great progress has been made. However, the field is far from mature. Current research in neural nets faces many challenges in both theoretical study and practical implementation. yet to be established. The In the theoretical aspect, a solid general foundation has ere exist more than a dozen different neural network archi lectures that are being used in different problem domains. Each model has its own theory and implementation peculiarities. Little has been done to establish a com mon ground for those models, although Grossberg at Boston University is reportedly attemntine a theoretical framework that would explain all neural behaviors (Miller. than they would be otherwise. Recent progress has shed some light into the "black boxes" (Fu, 1991), but the overall picture is still obscure. The leading neural network modelthe multilayered feedforward neural network with backpropagationsuffers the same obscurity. BP has been widely used in many applications, often with encouraging results. from soundly established. The theory behind BP is, however, far BP is a simple and elegant procedure that overcomes the difficulty of "credit assignment." But this procedure has some fundamental limita tions as isted below:4 1. Learning (training) is generally slow. No convergence results have been established for pattern trainingthe most commonly used training procedure. Convergence of epoch training to a local minimum is achieved, but a strictly local minimum may not represent a desired solution. 4. The parameters, namely, the learning rate 7 and the momentum a, need to be set empirically. 5. The structure of the network (number of layers and units) is determined arbi trarilv. The model offers the flexibility of choosing training schemes (epoch or pattern) and different global criterion function and neuron activation functions, but no general guidelines exist. Extensive work has been done to explore BP's potential and overcome its limita tions in the last few years. problem mentioned above. A great research effort is devoted to overcome the first A number of local acceleration heuristics are discussed in 1 IA flfl~*~ 11 .a Ir al., 1990). Those improvements on backpropagation often increase the learning speed significantly in terms of training epochs at the cost of an increased computational effort. Few researchers have considered the second third problems of BP. It has been reported that BP with pattern training works better than epoch training for a large training sample. thoroughly carried out. arbitrarily. But no theoretical account for this phenomenon has been Most people choose to use epoch training or pattern training This leads to potentially erroneous conclusions about the efficacy of the algorithm. For the global convergence problem, empirical results have shown that with am ple hidden units embedded in the network, BP can usually escape a local minimum (Rumelhart et al., 1986) probably due to large degrees of freedom. However, increas ing hidden units in the network may not be an appealing idea, since an unnecessarily large number of hidden units is likely to decrease the generalization capability of the network (Kruschke and Movellan, 1989; Ba.um and Haussler, 1989) and may cause overfitting problems (Weigend et al., 1990). Fang and Li (1991) have adapted simu lated annealing methods to neural network training. Their approach guarantees the solution will be globally optimal, if a proper annealing schedule is derived for the given problem. Montana and Davis 1989)) Belew et al. (1990) used genetic algorithms to train the feedforward neural nets. is that they involve a random search (sometimes The drawback of these approaches )litld y) and, hence, are not efficient n general. In the interest of efficiency and generalization, the complexity of a neural network should be kept to its bare minimum. Some researchers (Teh and Yu, 1988; Sietsma and Dow, 1988) developed heuristic rules for pruning away inessential hidden units during training, starting with an oversized network. Others ( Tenorio and Lee, 1989) used dynamic procedures generate new units as needed. those ap r large network. This method has been used in Chauvin (1990) and others. One of the drawbacks of this approach i The deficiency of neural s that training time increases noticeably. nets, in particular of backpropagation, indicates that much theoretical work needs to be done before we can fully explore the potential of this emerging computation framework. We are not sure whether or when a profound common theoretical basis for all neural network paradigms will emerge. But what we can do now is to conduct a rigorous, systematic study of the major neural net models, study the efficacy and efficiency of them, identify the conditions under which they may be effectively applied, explore the theoretical capabilities and limitations, and build new and improved procedures based on the theoretical guidelines. By doing so we can hope to better understand this new field and its future and proceed gradually to realize its potential to the fullest extent. CHAPTER 3 FEEDFORWARD NEURAL NETWORKS Feedforward neural nets (FNN) are the most popular neural network paradigms in the computation modeling branch of neural net research. The principal learning algorithm for training FNN is the backpropagation (BP) algorithm. The popularity of BP ari ses from its simplicity and successful applications to many realworld problems. This chapter will discuss the development of the backpropagation learning algorithm. The efficacy and limitations of the BP algorithm will be analyzed while improvement of the classic algorithm will be presented in the next chapter. We will give basic definitions and present theorems about the representation capability of general FNN. We start with the building block of a neural networkthe neuronsand then the workable neural networkthe perception. Feedforward neural nets are built upon perceptrons.) The Processine Units (Neurons) There have been many nonstandard terminology es used in the neural net literature. We will stick to the most general ones throughout our discussion. In some cases we use two terms interchangably, e.g., processing unit and neuron; we will include both terms in the definition Definition (Processing Unit) A processing unit (neuron) is the basic element an artificial neural network. neuron conszs ts of multiple input connections from other neurons; a transfer function. maps the function that maps the scaler to a real or binary ?flp uts activation to a scaler; (state) an activation and an output thf ht drnnirat th if nrtfltfinfl 111' 11 a rr'uaa...a 1. n~ ri' r4S l t C nm nl n n r mn nl r~ n" nnrk 4 1 / V W2 Figure 3.1. Structure of a single neuron The first such processing unit w is still widely used today. as the McCullochPitts neuron. This basic model It has a multiple input port and a single output port. Before the inputs are fed into the neuron, they are multiplied by corresponding weights on their pathways. The output is produced by taking the and thresholding it via a hea one of two discrete values, a and b, viside (threshold) function. where a, b weighted sum of the inputs A heaviside function returns 5. Depending on whether the input is greater than or less than the threshold 0, b or a is returned. It is common to set a = Definition 0, and b = 1. (Net Input ) A sketch of the mod The net is shown in Figure 3.1. input results from mapping multiple inputs to a real or integer value. Frequently this takes the bform of a weighted sum of the inputs. Definition 3 (Activation Function) The activation fun action is a function that maps the net input to a real or binary activation value (state) of the processing unit. Besides the heaviside function, other commonly used activation functions include the semilinear function and the sigmoid function. decreasing function, linear in a certain range and The semilinear function is a non constant outside that range. The Heaviside Semilinear Sigmoid Figure Typical activation functions The Perceptron Learning following we give definitions concerning perception learning then present the learning algorithm and its finit e co nvergence theorem. Definition an artificial olds) of its (Learninq R neural network ad A learn ts the C CU. (I rese environment. Definition single or set of p process zng un crcep its with h is a simple neural network eaC(ZS nationn fun actions cons and the isting of a perception learning algorithm. Definition 3.6 (Traininuo rT nl a sample tda n from a give n popu sample is used as the cuvitro0l nme ut of tht neural work providing inputs and ta values (if applicab Definition stance) Any particular mle a1t x of the training set T an in stance. x may have binary or realvalued att Iributes. Definit (Samn Train Sample och) training trefe rs a neura net train is the procedure (Perceptron) by which ion (weights and th resh netzoorl~. each instance of the training sample. If the instance is chosen sequentially from the sample, is called sequential instance training (sequential training). If the instance is chosen randomly from the sample, domized training). Note that an instance, ax, is an example of some concept (hypothesi s) to be learned. In the neural net training process, both the instances and the concepts associated with the instances are provided to the network.2 XE PR. be an instance, T+ denote the set of positive instances (a positive instance is an example of the target concept or class) and T denote the set of negative instances (a negative instance is a counterexample of the target concept or class). The perception learning algorithm can be stated as Let w E RT be a weight vector. follows: The Perceptron Learning Algorithm(PLA): TART: Set w Let X TEST: E R" randomly. WX < or (x T and wx >0)} = 0, stop Otherwise: pick any x c X, if x e T+, go to ADD, if x e T, go to SUBTRACT. ADD: 4 10 go to SUBTRACT go to TEST TE X, TEST I~I ' I 1 A ii 1 9 I is called randomized instance training (ran (sl(z ... rI1 I I' Definition 3.10 (Convex Set set S is convex if for each x, y ES and any E [0, =Ax +(1 A)y Definition (Convex Hull) be either finite or infinite, the convex hull , denoted by h( smallest convex set that contains Definition such that 3.12 (Bounded set S E R: bounded if there exists R. M Bo(M E RT Definition 3.13 are linearly (Linearly Separable ) separable4 if there exists a nonzero either finite vector p or infinite, Si r and a scalar such that Theorem rceptron onierence.) Suppose and T are bounded in R' and are linearly separable, then the perception learning algorithm will find a hyper plane separates and T I finite tzic~i. Proof: Let H T~u then the PLA produces the sequence of vectors: " U = 0,1..../* q* where wo arbitrary, and is picked such that w By assumption, there exists an.'1 and a such that iw .x > a, for all xEH 1 1 1 1 fl> C,  .~ I... I ' ~...    5 U" T ~M) r^ At step n, we have by the Cauchy Inequality, where *W (WiI wiI +na and, since wn1 xn1  x S0 and 1x112 13, IlwnII2 n1 w wn1 + 2w ni1 + Ijxfll2 n1 112 wI101 2 + n. Thus we have I)u)o11))2 + nflj j w or the quadratic inequality 2a2 an r + (2aw K j~iI)  (i0 (3.1) Since k = (2au,  Iw*lIl given any a and p a solution to ( 3.1 exists and is finite. Thus, after at most r .... .. .0 JI1 Il111 I'V 1 I '' '' *" * w II~U"IIIIU~*II $ Zn1I12 + n.cy  PIIZ1'*II2)a zu*>2 ?)2 + q~Z(I/11) W*)2) ...r I ii: .\w2 w3x =O Figure 3.3. Geometrical explanation of the perception learning Note that the proof does not assume finiteness of The PLA procedure can be applied to infinite sets, as long as provisions are made to carry out the stopping criterion test. To understand the perception convergence procedure geometrically, the following concept s are useful: Definition (Convex Conc) (1 conveX Set. is a convex cone if ES for any A 0 and any x EmL; Definition 3.15 (Dual Cone) , the dual cone of x >O for' every aCES}. Geometrically the perception learning procedure finds an cone of H T~u Startin g with any random vector iw interior point in the dual , the ADD procedure (by the definition of H ADD procedure now includes the UBTRACT procedure) ...I . denoted by T R'' I 29 opens a rich body of related research, using approaches known as relaxation methods (see, for example, Agmon, 1954). Various modifications have been suggested to the basic perception learning algo rithm. In step 3 (ADD weights) w(+')  w(") + x(n) can be replaced by w(n+') u() + wk()) where > 0 is a constant. 1/IlI "2 would make weight change unit vector the direction of x. Agmon (1954) suggested (in a different context) = c(w x*)/ 12 where cE (0,2). The number of iterations of the algorithm changes with these variations, but the finite convergence property is retained. The conver gence proof of perception variations, Adaline (Widrow and Hoff, 1960) and Madaline (Widrow and Stearns, 1985) can be found in Poliac (1989). The basic perception learning rule can be easily generalized to handle multiple class problems. Let H, H2, ..., H1 be the sets of instances for each class. The classi fiction problem requires finding a w* .i/i Xi such that for each > w* .11 for all S..,IL 5 where S > 0 is a scalar . The learning procedure is presented the following. Proof of the since it is a direct extension of Theorem 3.1. Multiclass Perceptron Algorithm convergence of this procedure is omitted START TEST: Set wi Let Xi E R'  {xil, . , H2 and for to any random values. some i such that w; * :t: c .. AK, stop. = 0 for all Otherwise pick any go to UPDATE. If Xi '(fyi * .z; E H; ,2 n:; + (5 (1,0) .. O * III III III II (0,0), / (o~o) OOS (1 ) (0,1) Figure 3.4. The XOR problem and its geometrical representation. One of the intriguing properties of the perceptron learning algorithm is that it uses only locally available informationmodifying weights after the presentation of each input pattern. Yet the procedure constructs a globally optimal solution (for linearly separable patterns). Local procedures are suitable for parallel implementation and hence have the potential for fast, realtime applications. Minsky and Papert (1969) pointed out perception it would procedure with be interesting to compare the relative efficiency of the global analytic methods, such as linear programming, for solving the system of inequalities No systematic study has been done in the comparison of perception learning with global anal vtic approaches. Many researchers have, however Jacobs, 1988). , realized tile importance of ocalitv" in learning for example, This issue is further explored in later chapters. The Limitation of Perceptrons Minsky and Papert (1969) showed that perceptrons failed to solve a number of simple pattern classification problems, in particular, the Exclusive Or (XOR) prob lem. The XOR problem has been used extensively as a benchmark for neural network algorithm evaluation due to this historical reason. The problem has four patterns. Each pattern has two binary inputs and one binary output. The output is true (with Figure 3.5. An example of layered perceptrons that solve the XOR problem The failure of the perception is due to its insufficient knowledge representation, not its learning procedure. Perceptrons construct only linearly separable decision regions, but there is no linearly separable region that can solve the XOR problem as can be seen in Figure 3.4. To solve the XOR problem, a more complex convex decision region is needed. multilayered perceptrons could form such a decision region. For example, let one perception separate pattern (0,0) from the others, and another perception separate pattern (1,1) from the others. A third perception, taking the output of the first two as input, could produce a convex decision region that successfully classify pattern (0,1) and (1,0) into one group. The idea is depicted in Figure 3.5 (following Beals and Jackson, 1990). Thus multilayered perceptrons are powerful enough to form polyhedral convex IT'b iok \onlTf iro + b rarcn + In nrnl'Jnrn rvf c' nr nt 1 rnInr^ nnrP\rronfrnno '.~~ rlar;a;nn ra n; nn ~ the heaviside threshold function. The perception learning procedure can correctly adjust only the weights between inputs and outputs, perceptrons. but not the weights between This difficult is overcome by introducing continuous activation functions (Rumelhart et al., 1986). This is shown in the next section. Feedforwa.rd Neural Nets and the BP Algorithm Definition 16 (FNN) A feedforward neural network (FNN) a neural network con sisting of neurons that are arranged in layers, namely, an input layer, hidden layer (s), and an output layer. Connections are unidirectional from lower layers to higher layers with no feedback paths. By definition, multilayer perceptrons are a subset of feedforward neural nets heaviside activation functions. with But, conventionally, when we say feedforward neural nets we mean feedforward neural nets with continuous activation functions guished from perceptrons. as distin Mult.ilayer perceptions are able to represent linearly non separable problems, but there is no efficient learning procedure. Using FNN enables us to solve the neural net "credit assignment" problem. Given the output gener ated from an input, which weights and how should they be changed to approximate the desired output The classic algorithm to train an FNN is called backpropaga tion which is a learning algorithm that modifies tile network weights based on their contributions to a global performance e criterion function. A gradient descent search procedure is employed. Let (x, y) denote a training example (pattern), where x is an input vector and y is the target output vector. Also, let o denote th e network output and w denote the weights of the network. We use NI X N;H xNo to represent the structure of a feedforward neural net where N;, NH and are the number of input units, hidden units and output units, respectively. Figure 6 shows a2x x 2 fully connected feedforward neural net.wnrk Pnfr OIrnnvpnIIC non v two nrnrcesinTr units are llnPr in Output Hidden Input Figure 3.6. x2x2 feedforward neural network sigmoid function 1+e  (3.3) where 7y is a constant controlling the slope of the function. The net input to a processing unit j is given by netj = ev~x~4 0, (3.4) where x are the outputs from the previous layer, w0, is the weight (connection strength) of the link connecting unit to unit j, and Oj the bias, which determines the location of the sigmoid function on the x axis. For notational convenience, we let xo  1 and Woj = O, then we have0 zC tv~~:v, (3.5) 1 f() A feedforward neural net works by training it with known examples. example (xp, yp) is drawn from the training set { (xp, y,) p A random = 1, 2, .., P}, and xp is fed into the network through the input layer. The network computes an output vector o, based on the hidden layer output. A performance criterion function is y,. A commonly used criterion functi op is compared against the training target y,. defined based on the difference between o, and ion is the sum of squared error (SSE) function = z p F= E(ypk Opk)2 (3.7) where p is the index for the pattern (example) and k the index for output units. The error computed from the output layer is backpropagated through the network, weights (wij) are modified according to their contribution to the performance criterion function. (3.8) drvij where 77 is called learning rate, which determines the step size of the weight updating. 3.5 Backpropagation Derivation For easy of exposition, let us consider the error resulting from a single training instance: F (yJk o,,k)2 (3.9) For connections leading to the output layer (refer to Figure 3.6), the partial derivative of Fp with respect to weight wk can be written 0EJ, tJo,, dfllik One 4~j (3.10) using the chain rule. Here OF, dlOpk  (yOp ) (3.11) awj aF, aruj~ Denote dnetk  pk)f (netk). (3.14) Then we have t1F, OWjk bko (3.15) Aw1Uk 0$, O'Wujk 3.16) This weight updating leading to the output layer applies to output layer weights (i.e., Similarly for hidden layer weights we have, Oneitk the weights by the chain Onef. (3.17) Since Fnetk 9netk Ofleik do,  il , (3.19) define Odne SkWJC 'ntri (3.20) Then  Sioi (3.21' do' t37netj ~ 6k oj dzo;j dnetk If the sigmoid activation function is used, we have f'(netj) (1 + eme)2 7f(nety)(1  0j ).  f(netj)) (3.23) Thus the derivative is easily obtained from the output of the processing units. Other performance criterion functions may be defined and other activation functions may be used. These variations will be covered in the next chapter. The backpropagation algorithm is formally stated below: Algorithm BP 1. INITIALIZE: * Construct the feedforward neural network. Choose the number of input s and the number of output units equal to the length of input vector x and the length of target vector y, * Randomize the weights and bias respectively. n the range ( * Specify a stopping criterion such < Fstop or ni ~max Set iteration number n = 0. FEEDFORWARD: * Compute the output for the noninput units. The network output for a given example p is Opk  f(E :jk 7f( Vit S S waxi )))). * Compute the error using Equation 3.7. yrretJ yoj 1 * For each output unit k, compute =k = (ok yk)f (netk). * For each hidden unit j compute 6 = .7 f (netj) 6kWjk. UPDATE: Atw w(n + 1) = 1sojo + tAw;y(n) where ij > 0 is the learning rate (step size) and ar [0, 1) is a constant called the momentum. REPEAT: Go to Step 3.6 The Representation Capability of FNN A feedforward neural net can be regarded as a general nonlinear model. In effect, it is a complex function consisting a convoluted set of transfer functions and activation functions cC, the parameter set where called is a set of continuously differentiable functions, weights includingn thresholds). The output feedforward neural net can be written as: o=f( wjkf ( w. *, **. ( W~rix;)))) (3.24) The next result shows that a twolayer FNN can approximate a large class of func tions. Theorem 2 For any absolutely integrab function g , there exists a two 17... n ATltT 7 tR '11 .. I__i i. I .' 1 II i' r r rl This theorem is a direct result of Poliac's (1989) Theorem 4.8.1. The requirement of f to be absolutely integrable is relaxed by Hornik, Stinchombe and Write (1989), Cybenko (1989) and others to include the use of sigmoid activation functions. Hornik (1991) further proved that an FNN with as few as a single hidden layer and arbitrary bounded and nonconstant activation functions are universal approximators to any continuous function based on an Lp norm performance criterion. The above results assume that the number of processing units in the hidden layer is unlimited. theorem by Kolmogorov (1957) can be applied to FNN to yield a three layer neural network that, any continuous function.8 with finite hidden layer units, can exactly represent Theorem I = [0,1] LIP1 rn~~qo ran) There exist such that each continuous fixed increasing continuous functions hij on I" = [0, 1]" can be written in the form g(x1, x1) = 2n+1 S f1 hj(.C)) fj are properly chosen continuous functions of one variable. The theorem suggests that any continuous function s of many variables can be repre sented as the linear superposition of some continuous s univa.riate functions. In terms of neural nets, this can be interpreted as follows. For any continuous function of n variables, there exists a feedforward neural network with two hidden layers, (each pro cessing unit in the hidden layers has a continuous activation function), that exactly represent g. A twoinput network structure corresponding to Kolmogorov's theorem is shown in Figure 3.7. Several variations of Kolmogorov's theorem exi (Lorentz, 1976). In particular, each function f, can be chosen identically a.nd fun action hij can be replaced by lhA, where 14 is constant and hj(x) is continuous and nond decreasing (cf. Poggio and Griosi, function g where xlx2Q Figure 3.7 An example of the Kolmogorov neural network Correspondingly, we have the following theorem. Theorem 3.4 Given any continuous function 4R ere exists a threelayer feedforward neural network that exa ctly represe with n(n + 1) processing units in the first hidden layer and 2n + 1 processing units n the second hidden layer. Kolmogorov's theorem shows that FNN powerful representation capability. However, this theorem is nonconstructive. That we know that there exist such functions we have no as how to construct them. Hence the application of Kolmogorov's theorem in neural nets has been limited to theory. As an illustration of FNN's capability, we can construct simple neural nets with nnp nr twn hidden nnite that 'nlvp tlp Yn R nrlnl~m l iihnn tb vnnA nrd ha rlrnrn xl x2 xl x2 Figure 3.8. Two simple neural nets that solve the XOR problem the point (0,0) (1,1) are grouped together to from one class (with low values) while the other two points make the other class. CHAPTER 4 VARIATIONS OF BACKPROPAGATION LEARNING The backpropagation algorithm, due to its simplicity and general applicability, has quickly become the dominate training algorithm for feedforward neural networks. Although successful applications of the BP algorithm are numerous, neural network researchers soon found that the algorithm has some fundamental limitations. First of all, BP training may fail to converge. Secondly, BP may reach only a local minimum solution when it does converge, as in any gradient descent based algorithm. local minimum may or may not represent an acceptable solution. Furthermore, BP training is generally very slow as compared to nonneural net approaches. This has prevented the use of feedforward neural nets from real time applications. An enormous amount of work has been done to improve BP learning in the last few years. the following we present new developments in this area concerning convergence, generalization and learning rate, optimal solutions to Chapter while leaving the discussion on global 7. We consider BP variations in criterion function, activation functions network structure , second order training algorithms and some heuristics. Performance Criterion Function We have used total sum of squared our discussion in Chapter 3. criterion. (TSS) error as the performance criterion in TSS is the standard and most widely used performance Besides its conceptual and implementational simplicity, it has the advan tage that under the assumption that training samples are independently chosen from a Gaussian distribution, the least squared error (minimizing TSS) estimation is sta more appropriate than TSS criterion. Burrascano and Lucci (1990) compared the least square error (L2 norm) and the minmax (Loo norm) performance criteria. The former is better if the data follow a Gaussian distribution, while the later should be used if the data distribution is nearly uniform. The minmax criterion function is nondifferentiable. To carry out gradient descent search, a pseudo derivative is defined as OF, 490$ Opk Ipk" Opk and ypk and ypk (4.1) where k* = argmax IIypk Opk Correspondingly, we have 9F, Onetk S0 +04k(1 SOpk(l1  pk )  0opk) and Ypk (4.2) = k* and ypk This is used in the updating rule for the output layer. The bj's the hidden layer(s) are not changed. With the above modification, the standard backpropagation algorithm (Section 3.5) can be employed. Burrascano and Lucci (1990) reported that better performance was achieved with the minmax criterion for the parity problem.1 For classification problems, Hampshire and Waibel 1990 proposed the "classifi cation figureofmerit" (CFM) criterion function, which is defined as CFM= k (4.3) 1 + e'Y(ton)+ Where ot is the output from "true" (correct classification) unit and is the output from nontrue unit. We observe that CFM is comprised of the sum of sigmoid I~* I;' I;' requires the output representing the correct classification to have a higher activation value than any other output units. discourages the network from learning specific examples, and encourages learning a general representation of the training data. * It alleviates the problem of the. TSS criterion where outliers tend to mislead the learning process. Hampshire and Waibel reported slightly better results were obtained using CFM criterion than the sum squared errors. Assisted an adhoc postprocessing procedure, the results from CFM criterion became significantly better than those obtained with the TSS criterion. Standard Figure 3.2). BP uses a sigmoid function as the nonlinear activation function (see The sigmoid function has an automatic gain control property. when the activation value is close to saturation (1 or 1)2 That is, , the output change corre spending to a input change is small; when the activation value is far from saturation, the output change corresponding to an input change is large. This property is im portant to the stability of a dynamic network. However, the sigmoid nonlinearity hinders the learning process with its nearzero derivative over a large range of input values. This is easily seen from the BP learning rule for the output aver): Awgk, 9(,,pk Ok)' ( tk)oj. (4.4) When (yk learned.  ok) we do not need to c When oj * 0, there is ange tile weights as the target values are no need to adjust tile corresponding weight wjk, since wjk has no effect on the net input. But th e case + 0 does not tell us much. Since f'(net;) '~(4f * 0 whether o0. approaches the target value (0 or 1) or or dcuji~ a large error. This fact increases the probability that the neural nets get stuck in a local minimum. Burrascano and Lucci (1990) proposed a delta rule of the form 4= (4.5) 1 + eynetk which, contrary to the standard delta rule larger values when the activation approaches 1. Their experiments showed that with the new delta rule, the modified BP algorithm performed slightly better than the classic BP algorithm. What is more important is that the modified version had a much smaller failure fraction than the normal BP algorithm. The authors claimed that the proposed modification virtually eliminates nonconvergence problems if a moderate e learning rate is applied. Another alternative to the sumofsquare error criterion is the crossentropy per formance function defined as: The derivative of F = Z (ypklol(opk)+ (1 P k with respect to opk is  ypk)log( Opk)). (4.6)  pk  Y7k Opk (4.7) Note 4 co as Opk  1 i dyF  00 as Opk *0 This brings a counteracting effect to the problem mentioned above, i.e., learning is hindered when the output approaches saturation. Indeed, experiments by Fahlman (1988) showed using the cross entropy criterion, learning speed of a neural network on the encoding problem increased by 50% as compared to using the standard sumof squarederror criterion. Momentum A simple variation of the classic backpropagation algorithm is to add a "momen +lm))~ 11 Cl2' + rnn\ ti + l ,'~ + r V n, u A 4 frtn, OF 46 the weight changes when successive gradients have the same signs and to slow down weight changes when successive gradients have different signs. Thus, it helps to speed up the search in the weight space where the downhill gradient is small, and to damp oscillations that are likely to occur in the ravine areas if only a fixed learning rate is used. Reports (e.g., Chauvin, 1990) have shown the momentum term can speed the learning process significantly. Since the use of the momentum was proposed by Rumelhart et al. (1986), the authors popularized the backpropagation algorithm, and it is used almost always in backpropagation learning, we will refer the backpropagation algorithm with the use of momentum as the standard or classic BP algorithm in our later discussion. Adding the momentum term is analogous to signal smoothing. This observation led Adams (1991) to propose using both past and the future information in momen turn, analogous to a symmetric smoothing. The idea is simple: In the standard BP algorithm, when the hidden layer weights are updated, we have already the informa tion to compute the weight change in the next iteration, since &,(t +1) = oy(1 oj) S1)wjk(t + l) (4.9) ,ij(t + 1) = 6&j(t + 1)o; (4.10) where the future Sj(t + is obtained through the newly computed output layer weights. Hence the hidden layer weight updating can be modified as Awij(t) = i16j(t)o; + azawj(t  1) + 2Awj(t + 1) (4.11) where al and a2 are the coefficients corresponding to the past and future momentum. The improvement of learning speed obtained by the author was moderate with this modification. 47 some iterative process in which an approximation of the criterion function is mini mized. Commonly used approximations are given by the first order or second order Taylorseries expansion, i.e., F(w + Aw) = F(w) + AwVF +. (4.12) F(w + Aw) = F(w) + AwVF + AwTV2F(w)Aw + . (4.13) where denotes the gradient of and V denotes the Hessian of F . Classic backpropagation is an example of using a first order approximation.3 First order and second order approximations are also referred to as linear and quadratic approxima tions, respectively. First order second order approximations use only local approximations use also curvature gradient and information. function values, Hence second while order methods usually have faster convergence. Among the most successful applications of second order methods in neural networks are the conjugate gradient (CG) algorithms and Newton's methods. Let us consider a general iterative process. Suppose we want to minimize a crite rion function F(w). We determine a search direction df and a stepsize At. The iterate wt+1 I11 + At dt (4.14) where dt and A\ are determined sucdl that F(wt+l < F(wt) or F(wt+l) is minimized. Most optimization algorithms fall into this framework. They differ by the way dt and At are computed. If dt is set to be the negative gradient VF(w), and At to be a constant r7, then we have the simple gradient descent algorithm discussed in Chapter 3. 3We say the approximation is first order if the first order Taylorseries expansion is used. Simi 4.3.1 Conjugate Gradient Methods Let Fa(z ) denote the second order approximation to F(w) in the neighborhood of w Fa(z F(w)z ITV F(w)z. (4.15) The necessary condition for Fa to be minimized is VFo(z F(w)z (4.16 At the current solution wt, Equation 4.16 represents a system of linear equations with variable z (an x 1 vector The solution to this system of equations can be greatly simplified if a set of vectors, called a conjugate system, can be found. Definition (Coniuaate System) Let di,d2,.. a set of nonZero vectors in , and A be a p x p nonsingular matrix. Then dl ..., dk is a conjugate system with respect to A if dl ...,dk are linearly independent and dTAdj uppose we have a conjugate (IId, E Rs with respect to F(w z* be a solution to Equation 4.16 and z ER be an arbitrary initial point. Since ,di, d, ..., ds forms a basis of Rs then any vector iii]? can be expressed as a linear combination of the conjugate vectors. _ z z (4.17) where A E R. Multiplying both sides with (FV2 F(w) gi F(w)(z  z") z F(wi)d;. *o, k. C7 F(w) $ F(w) + Solving for A, gives d(VF(w) V2F(w)zo) dffV2F(w)dj EVF (zo) (4.20) If we find the conjugate system in S steps, then we can determine in S steps using the above equations. The conjugate vectors di,i = ,2,..., S can be determined recursively. dl can be set equal to the negative gradient VFa(z), and dt can be determined as a linear combination of the current negative gradient (Moller, 1990). found in Johansson et al. Fa(zt) and the previous direction Detailed treatment of the conjugate gradient algorithm can be (1990). Note that the iterative process converges in S steps if F(w) is a quadratic function. F,(z) then becomes an exact representation F(w). practice the conjugate gradient algorithm takes more than steps to converge since F(w) is usually not quadratic. Computing and large problems. storing the Hessian (They require O(S3 matrix F(w) is expensive or infeasible O(S2) operations, respectively). implementing the CG algorithm, the following estimation is often used: F(wt)det ' F(wt + Cadt)  (4.21) for some small at E R, a Conjugate gradient methods are generally regarded as among the most efficient methods for largescale optimization problems. Johansson et al. (1990) reported that their implementation of CG algorithm outperformed standard BP by an order of magnitude in terms of training speed. Moller (1990) improved the CG algorithm F(w)dy r r r I ,I n I I r I r r I n 4.3.2 Newtonian Algorithms Assume that F is twice continuously differentiable, Newton's method finds a fixed point through the following iterate: w(t + 1) = w(t) a(V2F(w Note that in a single step. definite, and a = (4.22) quadratic, then the Newton's method converges to the minimum This is seen by letting F(w) then we have wt+' = wt 1i T 2w Aw bw where A is positive  A'(Awt = A'b. Even if F is not quadratic, under reasonable assumptions, Newton's method is guaranteed to converge to a local minimum from an arbitrary initial point (Schneider et al., 1991). also converges fast when it reaches the neighborhood a solution. However, Newton's method is rarely used in its unmodified form because of the cost associated with computing the Hessian matrix and its inverse. Also, the method works well only when it has a good initial solution (Becker and le Cun, 1988). A class of modified Newton's method is called QuasiNewton methods where the search direction is computed via d = H'VF(w) (4.23) where H is an approximation to the Hessian matrix F(w) The most successful QuasiNewton algorithm is the BroydenFletcherGoldfarb hannon (BFGS) algo rithm. In the BFGS method H1 is obtained iterativel  V T1 V where = f(w f(wt) = VF(w'1)  +gF(, 7F(w' y by (4.24) . At each iteration H1 can be determined through two new vector 58 and g, and the previous H Hence the method is very efficient. lA~winrn7r /,,,,1 AQI\ n n~ * 1CO~1CA..~ ">)' VF(wl). 73 D c L , 51 the computational locality properties of backpropagation where the weight updating can be carried out in local units. Becker and le Cun (1988) proposed using a simple diagonal approximation the Hessian matrix. They replace Awij = _49L with what they called a "Pseudo Newton Step" 9 F yaw" (4.25) where is used improve the conditioning Hessian matrix. magnitude of p determines how much curvature information is to be used weight updating rule. 4.3.3 Quickpropagation Most second order methods are considerably more difficult to implement than first order methods, especially those require global information. Fahlman (1988) developed a heuristic algorithm he called quickpropagation (quickprop for short) based on two assumptions: (1) the error (i .e., the criterion function) surface in weight space can be approximated by a parabola, and (2) the change in the slope of the error surface in one weight axis is not affected by other weights that are changing at the same time. Thus each weight can be updated independently by using previous and current error slopes, and previous weight changes by Aw(t) = C'l OF(t1) Ow " oF(_) Aw(t 1). O w (4.26) This weight change leads directly to the minimum point of the parabola.4 Thus the quickprop method would converge very fast if the criterion function surface were near quadratic. Although the assumptions are very crude, the quickprop algorithm turned out to be very effective in reducing neural net training time in many standard test problems, Awij = 52 to standard BP, the quickprop weight updating rule has a denominator aF a(t) This factor is relatively large when the weight gradient changes a lot. Hence, this results in a small stepsize. While in the flat error surface areas, the gradient changes very little, hence creating a large stepsize. This effectively overcomes the problems with fixed stepsize of the standard BP method. 4.4 Parameter Adjusting Tollenaere (1990) conducted a series of experiments to investigate the effect of the learning parameters (77. a) on the learning speed (measured in epochs). Those experiments cleared to some extent the confusion about how to choose the parameters caused by conflicting reports, where only nonsystematic studies were carried out. Some general conclusions from Tollenaere's study can be summarized as follows. * Learning time decreases exponentially as r increases up to a certain point. After that point, the iterative process becomes unstable. * The optimal learning rate y (with which the learning time is the least) decreases as momentum a increases from 0 to 1. * The use of momentum usually increases the learning speed by a factor of 2 to 3. It has long been realized part of the standard ow efficiency is due to its fixed parameters. Usually the parameters need to be chosen empirically for a particular problem. Even after the best Iparameter combination is found through extensive experiments, using those fixed parameters can not meet the conflicting needs, a large stepsize is desired in flat. functional surface area and a small stepsize is required in areas with narrow ravines. Numerous dynamic parameter adjusting schemes have been developed. Most of thmr aro hinrpiitirc (I rr QihITV !nrl A 11lP; l 1 00 mnn rhn a2c1'7P^ IlnrCal rnrnnmt+ a 1 Il()(l 53 Several principles for adjusting parameters are given in Jacobs (1988): 1. An individual learning step should be assigned to each weight (and threshold). The learning rate (stepsize) should be adjusted according to the curvature of the criterion function where change is taking place. The learning rate should be increased when the current partial derivative of the criterion function with respect to the weight n consideration has the same sign as the previous partial derivative; otherwise, the learning rate should be decreased. Based on these principles, Jacobs proposed "deltabardelta" (DBD) learning rule. A learning rate ry, is allocated to each weight wij, exponentially decaying trace of the gradient and 6ij is introduced as an Tile formulae for weight updating is: if ( (t if 56,,(t  i)8s(t)  1) i () (4.27) otherwise AwUy(t) =  siSj*(ti) + aAw;j(t  (4.28) j (t) = (1 )Sij(t) + O;j(t  (4.29) where are user determined parameters, tip  l (it is slightly different from the bi) in standard BP). Note that the increase in the learning rate is additive while the decrease is multiplicative. This strategy prevents the learning rate from growing too fast which may lead to weight saturation) and allows decrease rapidly, but keep a positive sign. The DBD algorithm leads to significant speedup of the standard BP algorithm. However, the algorithm is very sensitive to tile new parameters, especially k. Also, while the axrn:ii_ momentum term increases learning speed , it leads to instability. Sinai and I'i (ifn\ I/U 1111aI* I'4'Ie nrnnnp~cnAlr cerlxn;rl rnr~;r dnr c n tit TnR hRi, ,rnvhm anrl lb k,hol ,t = { k+77~ , i adaptive. Upper bounds are put on both y and The new weight updating rules becomes: SOF (t ) Aw^;(t) = F  + aAw;yj(t Sw.,1 1) (4.30) iij(t + 1) = Min{qmax, ij(t) + A7.y(t)} (4.31) aij(t + 1) = Min{arma,ax1ij(t) + Aaj(t)} (4.32) r16i~~ ,Aj(,) = aj(t) = if ij(t if 6j(t  1)6;(t) 1)( (4.33) otherwise fml6i(t) if b~j(t if 6j(t  1)iij(t)  1)6;j(t)  l()S~t (4.34) otherwise where ke, \I, 7t, km, A,, 7m, TImar and Omax are parameters furnished by the user. EDBD was reported to provide significant sp learning the logistic function f(x) eedup over = ax(1 x), a r DBD a = 3.95, 0 nd to be more robust on x <1. The authors of EDBD also suggested implementing a memory and recovery mech anism into the learning algorithm. Specifically, the current best solution is retained. control parameter E R, > 0 is defined. If the criterion function value be comes greater the times the best criterion value retained so far, then the search is abandoned and restarted from the current best point with attenuated learning rate and momentum. However, the experiments on this idea showed somewhat negative results. Davos and Orban 's (19 SAB (selfadapting backpropagation) algorithm ad vocates similar ideas. The algorithm starts without momentum, and increases the learning rate exponentially as long as the weight gradient keeps the same sign. It dif fers from the EDBD algorithm in that when the weight gradient changes sign, instead of reducing no; by some rule. it is reset to its starting value, and then the algorithm Xr9ij( t.) Xa;j(t) 55 Tollenaere (1990) modified the SAB method and named his version SuperSAB. The motivation behind SuperSAB s that whenever the gradient changes sign, the weights should not be changed. The weight change halts until the stepsize is reduced to such an extent that a step can be taken without changing the sign of the gradient. The learning rate changes simply by 7ij(t + 1) = 7+7ij(t) (4.35) 'hit. where + and are the increase factor and1)= where TJ^ and 1?_ are the increase factor and (4.36) tile decrease factor, respectively. lenaere reported that SuperSAB is insensitive to the parameters, and r7+ = 1.05 and = 2 are shown to be good for a wide variety of problems. Compared with standard BP method, SuperSAB learning is significantly faster. One important feature of SuperSAB is the range of the initial stepsize that leads to reasonably fast learning (Tollenaere referred to it as osr  optimal stepsize region) is orders of magnitude wider than that of standard BP. A drawback of SuperSAB is that it is slightly more instable than BP. But it was argued that SuperSAB with restart after divergence was detected  An interesting and important observatic still outperformed standard BP. )n Tollenaere made is that the optimum stepsize region of different learning algorithms do not necessarily overlap. Thus, com prison of different algorithms based on the same parameter values are inappropriate. idea similar to SuperSAB was used Silva and Almeida (1990) their Adaptive Backpropagation Algorithm (ABA). However, Silva and Almeida studied the effectiveness of the algorithm in the context of varying criterion surface orientation in the input space. They argued that becau se an individual learning rate is used for each weight, the performance of the method may be affected by the orientation of the I, S  56 Chan and Shatin (1990) used the angle 0(t) between consecutive weight gradients, instead of sign, to detect the curvature of the criterion surface in the weight space. Only a global learning rate is used, and it is adapted by (t}) = r)(t  1)(1 + cose(t)). 2 (4.37) The momentum is also made adaptive in their algorithm by a(i)= A(t)n(t) (4.38) with A(t) = 0Xo I F(t) \ ! (4.39)  1)11 where Ao E (0, 1). This in effect attenuates the momentum term such that it never exceeds the current gradient term, hence will not dominate the effect by the current weight gradient. The weight updating rule is then Aw(t) = 7(t)(F() + a(t)Aw(t 1). dw A backtracking heuristic is also implemented. (4.40) The learning rate y?(t) is reduced by half whenever the criterion value F(t) is greater than the previous one F(t 1) by a certain percentage (say, 1%). Chen and Shatin's Adaptive Training Algorithm (ATA) was tested against DeltaBarDelta algorithm and a conjugat and the 424 encoding problem (It will be ;e gradient algorithm on the XOR. problem discussed in Chapter 5. See also Rumelhart et al., 1986). ATA was shown to learn much faster than the other two algorithms and was insensitive to initial parameters (although it still suffered the local minimum problem as the others did). Activation Functions 4.5.1 Radial Basis Functions Powell (1985) introduced the radial basis function (RBF) for multivariate inter polation problems. Learning in supervised feedforward neural nets can be viewed as surface interpolation. This observation led to the use of radial basis function as the activation function in neural nets by Broomhead and Lowe (1988), Moody and Darken (1989), and Poggio and Girosi (1990). Standard feedforward neural networks use sigmoid activation functions. input (E wijXi) to each processing unit forms a hyperplane. The net Multilayer perceptrons partition the input space with the hyperplanes from each unit, while in a feedforward neural net those hyperplanes are smoothed through the sigmoid nonlinear filter before being used to form a decision region (partition). The radial basis function forms hyperellipsoid regions in the input space. A R.BF network consists of two layers (see Figure 4.1). Each hidden unit has a radial basis function 4 + R defined by ,(x) = ,( x ) i= 1,2,...,N (4.41) where p centers. E R fiji , 2, ..., are parameters, measures the distance from the and N is the number of radial basis input vector x to the radial basis function center ti The network output at node k is fk (w,x) = wi;k (Ik pi ll). (4.42) A frequently used radial basis function is the Gaussian function Ik,t~112 (4.43) where J 1 (2l  (x (x ti). (4.44) To simDlifv comDutation. the covariance matrix Z is usually chosen to be a diagonal )= . Ir;)T Figure 4.1. 2 X3 radial basis function network When the radial basis center, p/, = 1, 2,..., are fixed at data points x', = 1,2,..., N, what is left to the network to learn is then only the linear coefficients of Wik in the output layer. case, RBF networks can be trained very fast and without suffering the problem of local minima. Moody and Darken (1989) reported their RBF network reduced training time on learning the MackeyGlass equation5 by a factor of 102 to 103 compared with standard BP. However , RBF network is not appropriate for large data sets as the size of the net work grows with the number of training instances. Poggio and Girosi (1989) proposed to treat the radial basis center as variables, and neural nets are allowed to estimate the centers p/,j = 1,2,..., K, where K may be much less than N (the number of data points). They called the extension Generalized Radial Basis Function (GRBF) network. very rigorous and thorough treatment of RBF GRBF networks given in Poggio and Giros (1989). T* . L:.L).. .Lnk Ca.. i)r(t crz(tT) 517L~ : I.r 4.5.2 Transcendental Functions Although the sigmoid and the hyperbolic tangent functions have been the most frequently used activation function in feedforward neural nets, other monotonic, dif ferentiable functions can also be used (Cybenko, 1989). In particular, we have tested using transcendental functions, such as sine or cosine function as the activation func tion. The XOR problem can be solved in a few iterations with the new activation function. Rosen et al. (1990) reported that their neural nets using sine and/or cosine activation function outperformed and learning x9 and x3 functions. transcendental functions can be e the standard BP on the parity problem (n A justification suggested by Rosen et al. Expanded (via is that Taylorseries expansion) as the sum of infinite order polynomials. Although the polynomials are not independent within each activation function, in a multilayer network the weighted sum of outputs from the hidden units in effect produces a weighted sum of infinite order polynomials. But sigmoid function can Lapedes and Farber's also be expanded to a sum of polynomials. Experiments by (1987) showed that trigonometric activation functions are less robust than the sigmoid function. 4.5.3 Higher Order Networks and Functionlink Networks Instead of using the sum of weighted inputs as net input, some researchers (Pineda, puts ) have explored the use of net input with higher order correlations among the in (e.g., higher order links may be created that take the product of input variables nput). The correlations are usually captured by the cross terms of a polynomial. Volper and Hampson used quadratic terms, in particular, and concluded that higher order network can be trained noticeably faster than the standard network. and Rumelhart (1989) studied net input using product forms Durbin , and called those pro cessing units product units. Their conclusion was product units could be a computationally powerful extension to thle standard network. xl x2 x3 x1x2 x1x3 x2x3 xlx2x3 Figure 4.2. complex, this A functionlink neural network used to solve Parity 3 creates a powerful method that usually permits simple networks without hidden layers to solve hard problems. 3 problem is shown in Figure 4.2. This functional network outperformed a network by nearly an order of magnitude. A functionlink network that solves the parity standard feedforward neural The efficacy of functionlink neural nets were also shown through learning functions of one and two variables. 4.5.4 Gradient Descent Search in Function Spa.ce Instead of using fixed activation functions in the processing units, Mani (1990) considered providing a pool of functions to the processing units and let the learning algorithm decide which of the candidate activation functions are the best to use. (Different function pools may provided to different processing units). The learning procedure he proposed is similar to that of thle standard BP. But now the gradient descent is applied in the function space, rather than the weight space (though the two might be combined as suggested by Mani). Unfortunately, the order of a set of general functions can not be readily defined, hence the function gradients are not easily obtained. proach more ideological than practical. This difficulty makes the ap The only problem the author attempted to i Dynamically Constructed Neural Nets The algorithms we have discussed so apply only to neural nets with fixed structures. That is, the number of hidden processing units, the connections between the units, and the layout of the network are determined before the training algorithms are applied. Many researchers have realized there are drawbacks with fixed neural 1990). net structures (see Honavar and Uhr, 1988; Tenorio and Lee, 1989; Frean, For any particular problem we want to solve, some neural net structures are more appropriate than others. Since there is no general guidelines as how a neural net should be designed for a given problem, it has been a common practice for neural net users to copy neural net structure from other applications (without questioning the validity), or simply make up one arbitrarily. even though success may have been claimed . This is hardly a scientific approach Generally, small neural nets are preferred, given that they are capable of solving the problem at hand. Tile rationales are that arge enough to be (1) parsimony is always desirable (2) neural nets with fewer parameters are easier to interpret, when interpretation is necessary; (3) smallsized networks can be trained more reliably given a fixsized training sample (see, e.g., Haussler, 1991); and (4) neural nets with fewer hidden units seem to generalize better with novel pattern 1991). s (Kruschke and Movellan, Although the general representation theorem (see Chapter 3) guarantees that a feedforward neural network with a single hidden layer is sufficient for learning practically any inputoutput mapping, there is no theoretical result yet that specifies how many hidden units are needed. Honavar and Uhr (1988) pointed out that fanout6 sizes to create local receptive fields. t is desirable to restrict the fanin and Then the number of hidden units each layer is limited, and multiple hidden layers become necessary to learn a desired mapping. Indeed, experiments conducted by Gorman and Sejnowski (1988) suggested 62 Two broad approaches have been employed to construct neural nets with optimal appropriate) size. The first is to start with a small network, and let it grow as needed. The second approach is to train an exces sively large (estimated) net work, and then prune away units that do not have significant impact on the network performance. 4.6.1 Network Growing Methods Fahlman and Lebiere (1990 identified two major problems that contribute to the inefficiency of the standard stepsize problem and moving target problem. problem The first problem has been covered in a previous section. that is caused by the fixed structure of a neural net. It is the second In such a network the hidden units have no communication with one another, as no lateral connections are provided. During the training process, each hidden unit modifies its link weights according to the error signal backpropagated from the output layer. The problem is that all units are trying to learn the same training pattern at the same time. As the training pattern changes constantly (for instance training, the most common case), it takes a long time for the hidden units to split their roles and to commit to different patterns. A possible way to combat the mo at a time. ving target effect is to train part of the network The cascadecorrelation algorithm developed by Fahlman and Lebiere uses this approach to its extreme. Only one hidden unit (including associated weights and bias) is allowed to change at any stage of the training process. The cascadecorrelation algorithm starts with a feedforward neural network with out a hidden layer. The algorithm builds up the network (the cascade architecture) by adding hidden units one at a time. Whenever a hidden unit is added, it forms a new hidden layer with connections from all input units and previous added hidden units. patterns, and the covariance S of the hidden unit output error is maximized. Vp and the current network S is given by z k where k is the output The weights unit index, and  )(Epk Ek ) (4.46) are averages over all p patterns. leading to the candidate hidden unit are modified to maximize S with a gradient ascent algorithm similar to that of backpropagation. When these weights converge (the maximization problem is solved), they are frozen, and added to the current net with the candidate hidden unit. Then the training of the net resumes until the stopping criterion i s met or new hidden units are needed. A number of benchmark test problems were performed by Fahlman and Lebiere (1990). They reported the cascadecorrelation algorithm beat quickprops by a factor of 5 and standard a factor of 10 on the twospirals problem." the 8bit parity problem, the cascadecorrelation algorithm not only outperformed the standard BP by a factor of 5, but it also built a much more compact network. Furthermore, it was shown to generalize well on the 10bit parity problem. Frean (1990) developed an interesting net growing algorithm the Upstart Algo rithm. The algorithm deals with multilayer perceptrons, i.e., feedforward neural nets with threshold processing units. It cr the errors made by each parent unit. eates new units, called daughters, that correct The algorithm proceeds recursively creating new daughters units until none of the terminal (the leaf) daughters makes any mis takes. In other words, the Upstart algorithm expands the network until the problem is solved. Tests on Convergence to zero error is guaranteed parity problem showed for learning boolean Upstart algorithm functions. was efficient. solved the nbit parity problem with n less than 10 in less than 1000 iterations. 7At first glance, this approach seems anticonnectionist. But we need to realize that sequential rlll.r .rt    .~~~L ~ ~. C(Vp 64 algorithm probably doesn't scaleup well since it took more than 10,000 iterations to solve the 10bit parity problem. The SONN (Self Organizing Neural Net) algorithm proposed by Tenorio and Lee (1989) was designed for system identification problems. A new node is generated with polynomial activation functions of all inputs and outputs from previous layers. The polynomial is limited to order two. Thus each new unit has at most two parent units. The best polynomial functions is determined by a Structure Estimation Criterion (SEC) which provides a tradeoff between performance and complexity of the model. Simulated annealing is used in the search process. When applied to learn the Mackey Glass (see footnote on page 58) time series, the SONN algorithm produced far more compact models (net structures) than the standard feedforward neural networks used for comparable performance. Hirose et al. (1991) considered some heuristics that perform both growing and pruning of the feedforward neural nets. sum of squared errors) is checked every The performance criterion F (in this case, 100 weight updating. If F fails to decrease by more than one percent of the previously checked value, a new unit is added to the hidden layer. When a network is successfully trained, the pruning process is envoked which simply removes one hidden unit at a time, and then restarts training of the reduced network until no more hidden units can be removed. This occurs when the net fails to converge with a unit removed. These heuristics appear very crude, but they do help to overcome the nonconvergent problem. The authors even claimed that their heuristics could avoid local minimal solutions. 4.6.2 Network Pruning The network growing methods usually have a goal to minimize the net size. How ever, there are also reasons to train a neural network with a larger than minimum size. Extra hidden units may increase the robustness (performing well in noisy environ Thus many researchers studied pruning the nets after they are trained with sufficiently large number of hidden units (Mozer and Smolensky, 1989; Karnin, 1990). Sietsma and Dow (1987) proposed a twostage pruning method. In the first stage, the output of the hidden units of a trained net are analyzed. whose output do not change for all input patterns are removed. Those hidden units If two hidden units have the same or opposite outputs across training patterns, then one of them may be removed. In the second stage, the contribution of each hidden unit to the learning task (classification) is analyzed. are removed. The redundant units and hidden layer(s) The resultant is a much smaller net that can be trained quickly. interesting fact is that a net with the same si ze as the net obtained from pruning could not be trained starting with random weights. Karnin (1990) used a similar pruning procedure where the hidden units are ordered by the amount of global error (F) changed when the unit is pruned. Those units with negligible effects on global error are removed. Sankar and Mammone (1991a) proposed a new neural net architecture called the Neural Tree Network (NTN) which combines feedforward neural nets with decision trees. A feedforward neural net is used at the root node of the NTN to divide the instance space into N subsets, where N is the number of concept (output) classes. If each subset corresponds to a single concept class, then the job is done. Otherwise. each of those subsets with nonunique concept cl asses is assigned to a child node, where again a feedforward neural net is used to divide the subset further. This process continues until each subset contains only instances from a single class. been reported that when feedforward neural nets It has are compared with decision trees for classification, neural nets usually give smaller classification errors but take a longer time to learn (Tsoi and Pearson, 1990; Fisher and McKusick, 1989; Piramuthu et al., 1990). Sankar and Mammone showed that NTN outperformed both feedforward n ITraI notc 2An cr ^ rioin tr'noc nrr\r sa nnn nr .ilnnrnn+ ,,nt ral rnrnetnif fln 2 01r pruned subtree is NTN itself. Asac Increases, the optimally pruned subtree reduces in size with the root node as a limit (Sankar and Mammone, 1991b). Weigend et al. (1990) used the information theoretic concept of "minimum de scription length" (as in the SONN algorithm by Tenorio and Lee, 1989). A penalty for the network complexity measured in number of connections was added to the criterion function. Thus, by minimizing the augmented criterion function through standard BP, a tradeoff is achieved between the performance and the network com plexity. This approach led to a reduced size of the trained network and improved its generalization property. Similar pruning approaches were discussed in (Mozer and Smolensky, 1989 "Skelentonization" "Optimal Brain Damage" procedure method (le Cun et al., 1990). Chauvin (1989) used a penalty term for large weights in the criterion function. Hanson and Pratt (1989) defined a bias term in the criterion function that served to decay the weights (pushing the weights not increased by the updating rule to zero), and obtained trained nets with smaller numbers of hidden units. The GAL (grow and learn) algorithm introduced by Alpaydin (1991) can both grow and prune the net. It is basically a variant of the nearest neighbor method, which, instead of storing the whole training set, stores only a subset of the training set with training pattern s close to class boundaries. A recent summary of dynamic structured neural nets can also be found in Alpaydin (1991). Contradictory to common belief, Sietsma and Dow (1991 showed that for the classification problems they attempted, pruning to the minimum number of hidden units decreased the generalization ability of feedforward neural nets in noisy environ ment , although the pruned nets did very well on the training set. Miscellenous Heuristics There are many variations of the standard do not fit in the sections 4.7.1 Initial Weights In most nonlinear optimization problems, identifying a good initial solution could be crucial to the efficiency of the algorithm. play an important role in network training. Similarly, initial weights in neural nets Kolen and propagation t' Pollack (1990) performed extensive tests on o initial network weights. Their results showed te sensitivity of back that standard BP is very sensitive to the initial weight range. Specifically, for the 2 x 1 XOR net, BP gets stuck in local minima easily when the range of initial weights was set to larger than Chen and Bastani (1989) introduced a weight nitialization algorithm for two layer feedforward neural nets. A least squared error (LSE) feature selection method called the Walsh Transform is used. What the Walsh transform does is producing an initial weight matrix that has the best projection from the training sample. The learning speed of the XOR network with the use of this weight initialization technique was shown to be much higher than the same network with random initial weights. Specifically, networks so initialized performed nearly as well as the best randomly initialized networks from 150 tests. 4.7.2 Multiscale Training Felten et al. (1990) also considered incorporating features of the problem into the neural net weight space. They reasoned that it s only natural to use any knowl edge about the training set in order to restrict the search space (hypothesis space). Since real world problems are inherently structured, it is possible to incorporate the information into neural network learning. Specifically they proposed a multiscale training algorithm. It starts with small networks, and then uses the results from the trained small networks to help train a larger network. are related through the rescaling or dilation operator. The networks of different size For a handwritten character 4.7.3 Borderline Patterns Ahmad and Tesauro (1988) found that the number of training examples needed to train a neural net successfully scales linearly with the number of inputs for learning the majority function.10 More importantly, the most useful training instances are those close to the class boundary. to train the neural nets. Their ex Thus they proposed to use only borderline patterns :periments showed that nets trained with borderline patterns performed significantly better than nets trained with random patterns. They also had a substantially better generalization ability. An upper bound on the number of random training patterns sufficient to learn the majority function was derived based on the borderline pattern notion. 4.7.4 Rescaline of Error Signal Rigler et (1991), besides providing a general account gradient descent methods , noted that in a feedforward neural net with sigmoid activation functions, algorithm generates a factor = o(1 o) Hence by the chain rule the gradient vectors in different layers contain exponentially decreasing factors (1/4, To compensate this diminishing effect, they suggested rescaling the gradient factor, that is, multiplying the gradient factor with exponentially in creasing scalars, One particular set of rescalings thev used was 6, 36,216,..., obtained from taking the inverse of the expected diminishing factors. Experiments showed that this simple rescaling method could reduce training time by as much as an order of magnitude. Fahlman (1988) called the sigimoid prime function. We have discussed that the value of the sigmoid prime function goes to zero when the output approaches 0 or 1, This also causes the backpropagation error signal to become vanishingly small, hence learning is slowed down. By simply adding a constant 0.1 to the sigmoid prime function before it is used, Fahlman reduced the training time to nearly half of that 1/64,...). 69 4.7.5 Varying the Gain Factor Kruschke and Movellan (1991) performed gradient descent with respect to the gain factor, hence making it adaptive. of the weight change, and create rate. The adaptive gain factor modifies the magnitude s an effect similar to that of an adaptive learning The BPG (backpropagation with adaptive gain) algorithm was shown to give a remarkable speedup (by a factor of about 2) over standard BP. The gain factor was also used to create hidden layer bottlenecks (reducing the number of hidden units) for improving generalization. 4.7.6 Divide and Conquer The divideandconquer strategy artificial intelligent systems. of a modular connectionist Jacobs has a (1990 architecture. ong tradition developed Similar in computer science and a theory Thrun et al. methodology (1991) studied task modularization through network modulation. et al. (1991) proposed method that combines Kohonen's feature map (Kohonen, 1989) with the feedforward neural nets, and developed an errordriven decomposition scheme that was shown to outperform the feature map or backpropagation alone in approximating the Mexican hat function. 1 Pratt et al (1991 nets. studied direct transfer of learned information among neural They were able to train a large net starting with weights transferred from a smaller net trained on subtasks. Compared with nets using random initial weights, the weightpreset nets achieved speedups of up to an order of magnitude (even if the time to train the smaller nets was taken into consideration). The decomposition technique, borrowed from Waibel et al. (1989 includes the following steps: 1. Subnet training: subnets are set up and trained individually. Glue training: The trained subnets are bonded together through additional 4.7.7 Total Error vs. Individual Error Some researchers, in particular, Yu and immons (1990), considered using indi vidual pattern error, instead of the total sum of squared error, to guide the learning process. Their argument was that total error is not as effective a measure as a cor rectness ratio in classification problems. They developed an algorithm called Descent Epsilon where a parameter e is used to gauge the difference between a network out put and target value. The output is considered correct if the difference is less than e. Only those errors that are greater than e are backpropagated to modify the network weights. The magnitude of e is gradually decreased. Hence the total error also goes down with individual errors kept within the e bound. In conclusion, this chapter has summarized the stateoftheart research in feed forward neural network training. Most variations of tihe backpropagation algorithm are aimed at improving the training speed and increasing generalization ability of the feedforward neura networks. However, more efficient and Successes of globally convergen various degrees have been achieved. t training algorithms are needed to deal with more challenging real world problems. The next three chapters will focus on global optimal neural network training algorithms. CHAPTER 5 GLOBALLY GUIDED BACKPROPAGATION (GGBP) In this chapter we propose a modification to the standard backpropagation algo rithm. The modification, while retaining the simplicity of the standard BP, intro duces two nice properties: (1) There is a training time speed up, and (2) convergence to a global optimal solution is guaranteed. We start with a briefly discussion the shortcomings of standard backpropagation. Then we develop the ideas behind our approach and present the globally guided backpropagation algorithm (GGBP). Experiments on two standard test problems are presented. Limitations of BP The backpropagation (BP) method is one of the most widely used learning algo rithm for multilayered feedforward neural networks. The popularity of BP arises from its simplicity and successful applications to many real world problems. commonly recognized, however, that BP has some inherent shortcomings. Two of the often cited BP shortcomings are (1 slow or no convergence, and (2) the pos sibility of getting stuck in local minimum solutions (Tollenaere, 1990; Hirose et al., 1991). The objective of backpropagation learning is to find a set of network weights such that the total error function defined by some measure is minimized. Unfortunately, the error surface of a feedforward neural network is generally very complicated due to the convoluted nonlinear transfer functions. The error surface is generally char acterized by a large number of flat areas troughs that have very small slope (TTorbfJ'Jalcon 1 (1gm 1 nf lt rr;nirinc ALt#~.*I I) A. 11I ~ll LU tJLJ. &l lJ* Z I IILL with shamrn crvature (Battit and Masulli. the flat areas or by oscillating along the ravines. Also it is clear that, with steepest descent, once a solution gets stuck in a local minimum it has no way to escape. Although many variations of BP have been developed as discussed in the last chapter. The effort to deal with the first problem, that is, to develop more efficient neural net training algorithms, considered has met only partial success. the problem of local minimum solutions. Few researchers have Local refinements of the algorithm, such as using second order information of the criterion function, improve learning speed, suffer the same problem staying stuck in a local minimum once the solution is trapped. The Idea of Globally Guided Backpropagation The error surface of a feedforward neural networks in the weight space is generally very complicated. Figure 5.1 shows a typical error surface of the simple XOR network Section 3.3) where large flat areas and narrow valleys exist. It is clear that a strict gradient descent approach will encounter difficulties n such a weight space. However, quite simple. the error surface of a feedforwar neural network the output If we use a sum of squared error function, the error surface space is is convex quadratic in the output space. z p) F = Fp I  1 1 (y>pk opk)2 (5.1) Note that the error in Equation (5.1) separable in p and k, which are the ndices for the pattern (example) and the output unit of the network, respectively. Minimization of the quadratic function is easy, if the ourput of the network can be controlled. The unique local minimum of E is also a global minimum solution. The optimal outputs are the target values. Unfortunately, solving for weights W through the inverse function of output O is extremely difficult, if not impossible. Because the neural network output is a sum I. r / \ r I r , Error Surface of XOR (2 nodes) net; w2=xl. w6=x2 Figure 5.1. Error surface of an XOR (2 x 1) network showing valley, plateau and local minimum. However, if we change the output by a small amount, we will be able to find the changes n weights W via a Taylor series expansion of 0. O(W + AW, X) o(W,x) wO(W,X) X)AW (5.2) where S(0, If we update the weights of the network based on the changes instead of r V wE as in standard backpropagation, then we have reason to hope that weight updating scheme would (1) lead to faster convergence, since the search in the weight space is guided directly by the search in the output space, and (2) lead to a ~o(w + ~a w, Aw Error Figure 5.2. corresponding to AO would lead W to a global optimal solution. 5.3 Learning Rule Derivation The learning rule of GGBP is derived based on the changes in output space. Let us consider a given training pattern. The error function is Ek= (T Ok)2 k=l (5.3) where k is the index for the output units. Changing output O = (O, 02, 0... Ok)T based on gradient descent in the output space gives Ao(n)= O(n+ 1) (n qVoE(n) (5.4) where n is the iteration index. Using equation (5.3) AO(n) = r (T  0(n)). (5.5) n W Xl X2 X3 X4 Figure 5.3. A typical FNN where the weights associated with 0O are independent to other output units. Note that here AO(n) is a K dimensional vector W is an S dimensional vector, and VwO is a K matrix. Finding a AW This is computationally undesirable. feedforward neural network, requires the psedoinverse of the matrix Considering the special structure of the we notice that the weights of the output layer associated with output unit i are independent of the output units Ok, k = 1,2, ...,K, k i (see Figure 5.3). We can rewrite AO AO = [VwHO, 0, ..., V 1'( WH Wo, Wo. (5.7) where WH denotes weights i associated with output node n the hidden layer( s) and the output layer weights Each component of AO becomes choose AWk in the direction of WkOk. Thus we have AOk = IVWkOkI A w II IIAwl = (5.9) (5. 10) Wk Ok  The normalized component of AWk is A'w8  I Ia wk II (5.11) IVwvOkI Substituting IIAWkIl with Equation (5.10) gives Law  I Replacing AOk using Equation (5.5) results (5.12) v Okll2' in azu,8 Sis a weight q(Tk Ok) (5.13) n the output layer, Equation (5.13) is used as weight updating rule. on it have a weight n a hidden layer, The changes due to each output Ok, k we need consider the effect of all the outputs , 2, ..., are summed up. Hence we Aw, = for all ,q(Tk (5.14) 00k\ sE WH. heuristic approach (summing up zntv8 's) is also used in White (1990) where similar results are obtained from an application of Newton's method. of this approach is the simplicity of the weight updating rule. The d The advantage ownside of this Ci~~ Wk 1,2,..., Ii' 1,2,..., 1(  Ok)~l: Summing up the components of AO (cf. Equation (5.8)), we have AOk = 'wOk WN + V7Wo OkAWo,. (5.15) Because of the special structure of the feedforward network, AWH is the same for all Thus we can separate A WH and obtain IIAWHII = AOk 2k VWok, OkAWok (5.16) w, OkI Similar to the derivation of equation (5.13), we have for all SEWH IIAW"fI wOklk AOk II Ek VwHOk 12 V Wo Wo ) (Ek q(Tk Ok) Comparing Equation (5.17)  ZiEWW EWH.(k with Equation (5.13), A i) (5.17) we note that the equation is more complicated and requires explicitly the computation of weight changes in the output layer. Recall in standard backpropagation, the weights are updated with the following formula: Aw,= AE, ow, dOk  x t t _ _ flklk ___ (5.18) for the output layer and Aw, = w =AE A  (T Ok) dw. (5.19) for the hidden layer(s). AL%.*.L sL, S ..1~ 11. a =1,2,..., K where F is a function of the partial of the output with respect to the weights. concepts of the two approaches are, however, quite different. With GGBP fixed learning parameter in the output space, while A of the standard BP is a fix learning rate n the weight space. 5.4 Convergence of GGBP Updating weights with Equation (5.13) and (5.14) (or (5.17)) will ensure that the global error is decreasing, as long as the approximation used in the Taylor expansion is valid. in W Following Equation ( and X). we have and changing notation slightly (by explicitly putting AO(W"h o (Wcv ,X) 0 V"'X) = q(T O(W"1 ,X)) (5.21) 0(11/n +(1  (1  q)"O(W0 ,X). (5.22) For any q E (0,1),O(WI" as n 4 , That is, the output converges to the target value. Note that the convergence property is guaranteed by (5.21) only for the case of a single example. For a multiexample training set the weight updating rules of GGBP are still valid if the instance training method is used. But the convergence proof remains an open issue as in the case of standard BP. Empirical results, though, have shown that convergence is typical when 17 is small. The extension of the GGBP algorithm to sample training is not straightforward because the output becomes a matrix when all patterns are considered. the GGBP approach is still applicable. Conceptually, The derivation of the weight updating rules then requires iterative solutions to a system of linear equations. On the other hand, a heuristic of applying GGBP to sample training is to simply add up the weight A*tin ** l nn, t, ,14: %. ' . 11 1  1 1  ~)7i + .+(1 tT ~>o(w,l  i )nO(WO 11 )"]+(1 __LL.. ~I 1 The GGBP Algorithm The GGBP algorithm is similar to the standard backpropagation algorithm. implementation is straightforward. inition of 6 in the two algorithms. algorithm since GGBP ravines and/or plateaus. Note that there is a slight difference in the def Also we do not use the momentum term in our is supposed to search in the output space where there GGBP is formally stated below. Algorithm GGBP 1. INITIALIZE: * Construct the feedforward neural network. its and the number of output units equal to the and the length of target vector T, respectively. * Randomize the weights and bias Choose the number of input ength of input vector n the range (0.5, 0.5). * Specify a stopping criterion such as E < Estop or n 72max FEEDFORWARD: * Compute the output for the noninput units. The network output for a given example p is Opt = f(E wktf(Z Wjk f( ** u,;r~;)>>S Note that Oj is replaced by wJo for notational convenience. Compute the error using Equation 3.7. If a stopping criterion is met, stop. BACKPROPAGATE: For k =12 K repeat * For each hidden unit 3, compute j= 6k WljkJ (net. ). $ .2 End repeat. 4. UPDATE: For output layer AWjk = r(Tk Ok)&O,/k * For hidden layer AWi, = A/(Tk Ok )Oi / k REPEAT: Go to Step Experiments Two test problems are used to illustrate and evaluate the performance of GGBP. Both problems are standard problems. tests were run on a 80386Micro computer. The reported results are averages of 20 runs starting with the same random initial weights for both GGBP and the standard BP. All numbers are rounded to their nearest integers. 5.6.1 The XOR. Problem The Exclusive Or (XOR) problem has been used extensively as a benchmark for neural network algorithm evaluation due to historical reasons. The problem has been described in Section 3.3. Solving the problem requires classifying the inputs into two I1 1 1 I 1 1 I I I S c 1(3k Table 5.1. Training Epochs of GGBP vs BP for the XOR the sake of comparison, standard BP without the momentum term is tested, which resulted in a convergence speed about 35 times slower than that of GGBP. As the stopping criterion becomes more stringent, the difference between GGBP and BP becomes more significant. This is no surprise as the GGBP uses an approximation scheme that is best in the neighborhood of the global minimum, while standard BP slows down when the error signal becomes small. Typical learning curves of both GGBP and BP are shown in Figure 5.4. the beginning. Note that the GGBP solution oscillates in This shows that the linear approximation used in algorithm is very crude while random initial weights dominate. The approximation becomes more effective when the weights are brought closer to the global optimal point. We used the heuristic method in the hidden layer weight updating, which may also contribute to the inaccuracy during the initial learning period. 5.6.2 The 424 Encodine Problem The encoding problem was proposed Ackley, Hinton and Sejnowski (1985). The problem is to map Ntuple input patterns to Ntuple output patterns through a hidden layer with log, N units. Passing through the hidden layer requires data BP ir=0.5 mo=0.9 GGBP ir=0.5 BP ir=0.5 Estop mean std dev mean scd dev mean std dev 0.04 206 89 62 10 2148 410 0.01 292 45 71 10   0.001 1369 310 110 52  50 1(X) 150 2(X) 250 Number of Epochs Figure 5.4. Table 5.2. Learning curve of GGBP (solid line) Training Epochs of GGBP We tested GGBP on a 4 4 network. vs BP (dotted line). BP for the 424 Encoding The results are summarized in Table 5.2. The speedup of GGBP over the standard BP is a factor of 5 to nearly 25. Similar to the case of the XOR problem, the performance of GGBP is significantly better than the standard BP when the solution standard is set higher. While the number of training epochs of BP increased about 4 times when the stopping criterion decreased ._ flnA n l r i i 1 p. 1  BP lr=0.5 mo=0.9 GGBP lr=0.5 Estop mean std dev mean stcd dev 0.04 935 647 177 155 0.01 4635 2545 187 182 w concept space. The algorithm considers optimization of the global function in the output This leads to a faster learning and convergence to a global optimal solution. The speed advantage can be attributed to the fact that the search is guided by the changes in the output space. That is, the weight change in the weight space does not necessarily follow the gradient descent direction. The problems associated with flat plateaus and deep ravines in the weight space with standard BP are avoided. The second advantage of GGBP is that it does not use the momentum term. Choosing a good combination of learning rate and momentum with standard often poses a challenge to the inexperienced neural network users. this sense, GGBP is easier to use than standard noticed in learning rate less than 0.5 usually produces fast and stable solutions. Although at this implementation GGBP has a constant learning rate. our experiments that This need not to be true. A dynamically adjusted learning rate might improve its performance. Even with a fixed learning rate (in the output space), GGBP is analogous to standard BP with a dynamic learning rate in the weight space. The dynamics of the learning rate adjusting in the algorithm. I weight space is with wellfounded in dynamically adjusted GGBP learning rate has by the derivation been studied several researchers (Vogl et al., 1988; Jacobs, 198 Silva and Almeida, 1990). Those approaches are heuristics. They work some limited( domain and may produce controversial results. Viewed as BP with dynamical learning rate, GGBP provides a learning rate adjusting mechanism that avoids the detailed considerations of the shape of the error surface in the weight space. The speedup of GGBP over is evidenced experiments. A remarkable feature of GGBP is that it still has a fast learning speed even when the error becomes small while BP becomes hopelessly slow. This feature could be especially beneficial to problem domains where accurate learning is required. fI t. ,,,1,,, :c, ,,,,,E, ,,,:~,~ 84 changes using the updating rule will produce the desired output change which leads to decreasing of the global error. it is only approximately true. P Careful examination of this assumption reveals that art of the inaccuracy results from the first order ap proximation via Taylor's expansion of the output function O(W, X). Another factor that may adversely affect the approximation is that the hidden weights of the neural network are dependent on all the output units. The asynchronous presentation of target values (for a given pattern) renders the computation of hidden layer weight change inaccurate. Nevertheless, the GGBP algorithm is shown to perform signifi cantly better than the standard BP. The performance of GGBP could be improved by considering higher order approximations and synchronized parallel implementation. It is not clear how those improvements can be carried out but the concept of computing weight change to produce desired output change is appealing. Research along this line could be promising. CHAPTER 6 STOCHASTIC GLOBAL ALGORITHMS globally ter 5 guarantees guided a global backpropagation optimal solution (GGBP) as long algorithm introduced as the learning rate is Chap small enough. However, the requirement of small learning rate may cause slow conver gence. The interest in finding a global optimal solution efficient learning al gorithms has prompted neural network researchers to look into global optimization literature. Some researchers have explored the use of genetic algorithm and simu lated annealing in neural network training. In this chapter, we will discuss the search mechanisms and their implementation in feedforward neural network using stochastic global algorithms: genetic algorithm, simulated annealing, random search methods, and clustering methods. 6.1 Genetic Algorithm The concept of genetic algorithm (GA) was introduced by Holland (1975). Genetic algorithms are a class of search algorithms based on several features of biological evolution, such as crossover (mating) and random perturbation (mutation). In recent years, genetic algorithms have been successfully applied to a large variety of problems in optimization, learning, and operations management (Goldberg, 1989). Generally, a genetic algorithm has the following components: 1. An encoding/decoding scheme that maps the solution of the problem to a bit stream (chromosome). An initial nnnula.tinn consistinir of initial nnssih1p nllint.inns A genetic algorithm starts with an initial population. tion are evaluated with the criterion function. Part of th The members of the popula e population is chosen to cre ate the next generation through crossover, mutation, and/or other domainspecific operators. Selection of the parent members are determined by certain probability distribution of their fitness measured by the criterion function (Holland, 1975). The crossover operator is applied to two parents. A random bit of the bit stream is chosen, at that point the parents' bits are crossedover. That is , the parents ex change part of their bit streams starting from the chosen bit. The mutation operator is applied to a single parent of child. A random bit of the parent is chosen and is changed to its complement. For the application of genetic algorithms to feedforward neural networks, a simple implementation is to encode all weights and biases as a single vector (Montana and Davis, 1989). For example, for the XOR network with a single hidden mode (cf. Figure 3.8), a solution is represented by a vector w = (wi, w2, ws, w4, ws, we, wz). An , ..., w"} can be generated with each w,, i = 1,2, taken from a random distribution, say, uniform or Gaussian distribution. over operator is applied as discussed before. ..., 7,being The cross The mutation operator can be modified such that a random perturbation is added to a randomly chosen component of the parent. Montana and Davis (1989) reported their genetic algorithm outperformed the classic backpropagation algorithm (without momentum) A more involved coding scheme was used by Chalmers (1990) where the weightspace dynamics was coded as linear genomes consisting of bit streams. Belew et al. 1990) considered using the genetic algorithm to generate a good initial weight set wo that is then used in place of random initial weights of the backpropagation algorithm. As can be expected, the performance of BP was improved with Wo chosen by GA. The results of Offutt (19890 showed that GhA rnnltl train a fpnrlfr\,,rir rtA ,nj n n,..l1, .m...L ..:..1.. 1. initial population Po = {w1 The search mechanism of genetic algorithm can be implemented within the BP algorithm to help increase learning speed avoided local minima. idea is that when the BP algorithm is detected to be in a flat region where the gradient in the weight space is nearly zero, a large jump incurred by sufficient mutation of the current solution should be more efficient in bringing the solution out of the stagnant status than a gradient descent move. If the solution is stuck at a local minimum, the gradient descent approach simply fails to possibly proceed, while genetic mutation crossover of different solutions) may make a solution tunnel through surrounding peeks of the local minimum, and lead to the attraction region of some more promising (local) minimum. When threshold to apply GA 0 is defined. can be determined by the following heuristics: O can preset or dynamically derived. A gradient A weight w; is labeled inert whenever I w. Between each regular BP session, those weights labeled inert are perturbated by a random amount (mutation). If aF Otwi e for all w,, then the current solution must be in a flat area of the weight space. A crossover between the current solution and a different solution can be performed. The genetic algorithm augmented backpropagation (GAABP) algorithm is stated below Algorithm GAABP 1. INITIALIZE: * Construct the feedforward neural network. Choose the number of input units and the number of output units equal to the length of input vector x and the length of target vector t, respectively. * Randomize the weights w(0) (including bias) in the range (.5, * Specify a stopping criterion such as F F8s0op r it Set iteration number n = 0. n,,,. * Compute the output for the noninput units. The network output for a given example p is W f(E int Wmj f(' .f( WilXi * Compute the error using Equation If a stopping criterion is met, stop. 3. BACKPROPAGATE: n n + 1. For each output unit k, compute (o Yk)f (netk). * For each hidden unit j f' (netj) compute Sk wyk. O, then label(w 1) = inert. UPDATE: Mutation: If label(wij) = inert, then Awij(n 1) = Random(F where F is the current criterion value, and Fstop the desired. Random() is a function returning a random value of Awi, with a given probability distribution. Opk 4= * If ljyoi  F,,,,)  * Gradient descent: Awij(n + 1) = 'fr6jOi + aoAwyj(n) where y7 > 0 is the learning rate (step size) and a E [0, 1) is the momentum. REPEAT: Go to Step Generally, changes that mutation may produces help a stagnant variations solution while to move out crossover enables local minima. arger Ran dom mutation may follow a uniform distribution or a Gaussian distribution. crossover operator returns two new weight sets is taken as the updated solution, and the other . The one with better objective value is used as the candidate for the next crossover operation. Simulated Annealine Simulated annealing is a general heuristic optimization algorithm. The algorithm is based on concepts from statistical physics. Kirkpatrick et al. (1983), in the early Eighty's, noticed that there is a strong similarity between combinatorial optimization and the annealing of solid materials such as metals. In a physical thermal dynamic system, the system state is characterized a probability distribution known Boltzmann distribution at thermal equilibrium, as shown in Figure 6.1. The horizontal axis is system energy and the vertical axis is the probability of the system at a state with energy E. From the distribution we notice that: (1) the system state with lower energy has higher probability, and (2) as temperature T decreases, the system become stable at low energy state, because the probability of the system being in a high energy state approaches zero as the temperature decreases. annealing process is to reduce the system temperature slowly such that the thermal Probability Thigh Ti0" Energy Figure 6.1. Boltzmann distribution at different temperatures Energy *' *' *r Nonequilibrium . .." t@0@@0"....*** Dl~glBgDI Q~..* Equilibrium II 