Citation |

- Permanent Link:
- http://ufdc.ufl.edu/AA00029777/00001
## Material Information- Title:
- Temporal self-organization for neural networks
- Creator:
- Euliano, Neil R., 1964-
- Publication Date:
- 1998
- Language:
- English
- Physical Description:
- vii,177 leaves : ill. ; 29 cm.
## Subjects- Subjects / Keywords:
- Landmarks ( jstor )
Learning ( jstor ) Maps ( jstor ) Memory ( jstor ) Neural networks ( jstor ) Neurons ( jstor ) Phonemes ( jstor ) Signals ( jstor ) Temporal data ( jstor ) Trajectories ( jstor ) Dissertations, Academic -- Electrical and Computer Engineering -- UF ( lcsh ) Electrical and Computer Engineering thesis, Ph.D ( lcsh ) - Genre:
- bibliography ( marcgt )
theses ( marcgt ) non-fiction ( marcgt )
## Notes- Thesis:
- Thesis (Ph. D.)--University of Florida, 1998.
- Bibliography:
- Includes bibliographical references (leaves 169-176).
- Additional Physical Form:
- Also available online.
- General Note:
- Typescript.
- General Note:
- Vita.
- Statement of Responsibility:
- by Neil R. Euliano, II.
## Record Information- Source Institution:
- University of Florida
- Holding Location:
- University of Florida
- Rights Management:
- Copyright Neil R. Euliano. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
- Resource Identifier:
- 40074572 ( OCLC )
029539973 ( ALEPH )
## UFDC Membership |

Downloads |

## This item has the following downloads: |

Full Text |

TEMPORAL SELF-ORGANIZATION FOR NEURAL NETWORKS By NEIL R. EULIANO II A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1998 ACKNOWLEDGMENTS It is only appropriate that I first acknowledge the guidance and help of my advisor and friend Dr. Jose Principe. Without his support this work would never have been completed. I would also like to thank the members of my committee for their efforts and time spent on my behalf, as well as the members of the Computational NeuroEngineering Laboratory (CNEL). I must also acknowledge my wife Tammy who was incredibly patient and never wavered in her support of this endeavor. I would also like to thank my children Erin and Matthew who are just too fun to ignore. Although they extended the amount of time required to graduate, I would not trade the time I spent with them for anything in the world. Lastly I should thank my family and friends for not treating me like a dead-beat Ph.D. student. 11 TABLE OF CONTENTS page ACKNOWLEDGMENTS .................................... ................. ii ABSTRACT................................................................ vi CHAPTERS 1 INTRODUCTION AND PROBLEM DESCRIPTION........................................ Temporal Processing..............................................................2 Static Supervised and Unsupervised Learning............................................... Adding Memory to Neural Networks ........................................................... 5 Short-Term Memory Structures..............................................6 Recurrent Networks ........................... ...........................9 Training Dynamic Neural Networks...............................................10 Summary of Problems with Standard ANN Architectures .......................................12 The A pproach................................ ........................... 14 2 LITERATURE REVIEW .......................... ......................16 Biological Research ..................................................................16 Neurons and Learning ................................................17 Hippocampus .................................................................20 Diffusion Equations (Re-Di Equations)..................... .........21 Biological Representations of Time................................ ....................25 Biological Models for Temporal Processing ...................... ...28 Static Neural Network Learning .............................................................29 Unsupervised Learning ........................ ....................................30 Kohonen SOMs..........................................................32 Neural Gas ........................... .................................36 Supervised Training ..............................................................37 Second Order Methods ..................... .... .....................39 Temporal Neural Networks......................................................40 Temporal Unsupervised Learning........................... ....................40 Temporal Supervised Neural Networks.... ......................................................44 Architectural approaches............................................................................44 Algorithmic approaches ........................ ......................................46 111 Second order methods.................................. ................49 Sequence Recognition.........................................................50 Comparison of Hidden Markov Models with ANNs ...................................................51 3 TEMPORAL SELF-ORGANIZATION ............................... ........................54 Introduction and M otivation ........................ .....................................54 The M odel...................................... ...................................................55 Temporal Self-Organization in Unsupervised Networks............................. ....57 Temporal Activity Diffusion Through a SOM (SOTPAR) ................................57 Algorithm description .......................................60 Representation of memory ....................... ....... .........62 A simple illustrative example ......................... ........66 SOTPAR summary ...........................................................71 Temporal Activity Diffusion In the Neural Gas Algorithm (SOTPAR2)..............72 SOTPAR2 - algorithm details.................................73 Operation of the SOTPAR2 network........................ ........................... 77 SOTPAR2 summary ........................................83 Temporal Self-Organization for Training Supervised Networks.............................83 Using Temporal Neighborhoods in RTRL ..................... ......84 Review of RTRL and Zipser's Technique ...........................86 Dynamic Subgrouping with t.................................... .....................88 Estimating the Z matrix ....................... .....................................91 Illustrative Example ....................................... ....................94 Grouping Dynamics ........................ ...................................95 Second Order Methods ...........................................................96 Summary of the Dynamic Subgrouping Algorithm............................ .....97 4 APPLICATIONS AND RESULTS ............................................................99 SO T PA R .......................................... ......................................................................99 Landmark Discrimination and Recognition for Robotics............................. 100 SOTPAR solution ........................ .....................................102 Real data collected from the robot ........................................ 113 Sum m ary ................................................................... 117 Self-Organization of Phoneme Sequences .............................. ....................... 118 Sum m ary ........................ ......................................... ..... ................ 128 SO T PA R 2 ........................................... ........ ........ .....................................129 SOTPAR2 Vector Quantization of Speech Data ..............................................129 Time Series Prediction....................................................138 Results ............... ................................. 140 Summary of chaotic prediction ..................................... 147 Dynamic Subgrouping of RTRL in Recurrent Neural Networks ..............................148 System Identification ....................................... .......................... ............. 148 Comparison of the Number of Neighbors.............................152 iv Modeling a Set of Nonlinear Passage Dynamics ................................................. 155 Summary of Dynamic Subgrouping ........................... ........................ 160 5 CONCLUSIONS AND FUTURE RESEARCH POTENTIAL .............................162 Conclusions.............................................................162 Future Directions ...........................................................167 R EFER EN C E S ........................................ .................................................................. 169 BIOGRAPHICAL SKETCH ................... ......... .....................177 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy TEMPORAL SELF-ORGANIZATION FOR NEURAL NETWORKS By Neil R. Euliano II August, 1998 Chairman: Dr. Jose C. Principe Major Department: Electrical and Computer Engineering The field of artificial neural networks (ANNs) has reached a point where they are now being used in everyday products. ANNs, however, have been largely unsuccessful at processing signals that evolve over time. Temporal patterns have traditionally provided the most challenging problems for scientists and engineers and include language skills, vision skills, locomotion skills, process control, time series prediction, and many others. The fundamental concept presented in this dissertation is the formation of temporally organized neighborhoods in ANNs. This temporal self-organization enables the networks to process temporal patterns in a more organized and efficient manner. The concept is biologically inspired and uses activity diffusion to organize the processing elements of the network in an unsupervised manner. vi The self-organization in space and time created by my methodology has been applied to three distinct ANN architectures. The new network architectures created by adding the temporal organization are easy to implement and contain properties that are unique in the neural network field. A self-organizing map (SOM) network obtains a unique combination of long-term and short-term memory and becomes organized such that temporal patterns in the input fire sequentially ordered output PEs. These features are utilized in two different applications, a robotic landmark recognition problem and a temporally ordered vector quantization of phonemes in spoken words. When applied to the neural gas algorithm, the resulting network becomes a dynamic vector quantization network. The network anticipates the future inputs and adjusts the size of the Voronoi regions dynamically. It was used to vector quantize speech data for a digit recognition problem and to predict a chaotic signal. Lastly, the temporal organization was applied to the training of fully recurrent neural networks. It reduces the computational complexity of the training algorithm from O(N4) operations to only O(N2) operations and maintains nearly all of the power of the RTRL algorithm. This training method was tested on two inverse modeling tasks and provided a dramatic improvement in training times over the RTRL algorithm. vii CHAPTER 1 INTRODUCTION AND PROBLEM DESCRIPTION This dissertation focuses on neural network architectures and training methods for processing signals that evolve over time. The fundamental concept underlying the techniques described herein involves the formation of neighborhoods where temporally correlated processing elements (PEs) are clustered together. We have applied this concept to three different neural network architectures and found that it improves the performance of each one - either by increasing the functionality of the neural network or by improving its training. This chapter contains a description of the problem as well as background information that will help describe the shortcomings of the present methods. Chapter 2 presents a review of the relevant literature necessary to understand the material in the context of the current state of the art. Chapter 3 contains the theoretical description of the techniques and networks proposed by this work, including a few simple examples to elucidate the fundamental concepts. In Chapter 4, six more extensive and practical problems are solved using the temporal neighborhood concepts. These examples include speech recognition, chaotic prediction, system identification and control, and robotics. Chapter 5 concludes the dissertation with a summary of the work and possible future research directions. 2 Temporal Processing Most scientific problems can be grouped into two domains, static and dynamic problems. Static problems consist of information that is independent of time. For instance, in static image recognition, the image does not change over time. On the other hand, time is fundamental to the dynamic problem. The output of a dynamical system, for example, depends not only on the present input but also on the current state of the system, which encapsulates the past of the input. Temporal processing is the analysis, modeling, prediction, and/or classification of systems that vary with time. Patterns that evolve over time have traditionally provided the most challenging problems for scientists and engineers. Language skills (speech recognition, speech synthesis, sound identification, etc.), vision skills (motion detection, target tracking, object recognition, etc.), locomotion skills (synchronized movement, robotics, mapping, navigation, etc.), process control (both human and mechanical), time series prediction, and many other applications all require temporal pattern processing. In fact, the ability to properly recognize or generate temporal patterns is fundamental to human intelligence. Traditional analysis models (ARMA models, etc.) are well known but are usually linear and require significant expertise on the subject and a strict correspondence between the studied process and the constructed model. Artificial Neural Networks (ANNs) offer robust, model-free methods without requiring as much application specific expertise. Secondly, neural nets are adaptive (similar to ARMA models). This is a natural way to compensate for the drift of measuring devices and slow parameter changes inherent in 3 real systems. Thirdly, neural nets are naturally parallel systems that offer more speed in computation and fault tolerance than traditional computing models. [Kan94] Most of the major neural network success, however, has been mainly in the realm of static, instantaneous mappings (for example, static image recognition or pattern matching). Conventional neural net architectures and algorithms are not well suited for patterns that vary over time. Typically, in static pattern recognition a collection of features - visual, semantic, or otherwise - is presented and the network must categorize the input feature pattern into one or more classes. In such tasks, the network is presented with all relevant information simultaneously. In contrast, temporal pattern recognition involves processing patterns that evolve over time. The appropriate response at a particular point in time depends not only on the current input, but also potentially on an unspecified number of previous inputs. Static ANNs have been modified in various ways to process time-varying patterns, typically by adding short-term memory to the static pattern classification ability of the various architectures. The short-term memory holds onto some of the past events so that the static ANN can then classify or predict the temporal pattern. As I will explain in the next few sections, however, these hybrid structures (memory added to static architectures) have not been widely successful in the various temporal processing areas. Static Supervised and Unsupervised Learning The purpose of neural processing is to capture the information from an external signal in the neural network structure. This is a form of organization. It can be accomplished in an unsupervised manner using only the input, or in a supervised manner 4 guided by an extra input called the desired signal. Unsupervised training can only extract information from the input signal whereas supervised training can learn mappings between the input signal and the desired signal. They differ in the methods, but at the core they share the same function, learning a representation of the external world. The most common supervised network is the multilayer perceptron (MLP) which uses the error back-propagation [Rum86] learning algorithm. The MLP is characterized by layers (input, hidden, and output) of processing elements (PEs) that have a smooth non-linearity at their output. The nonlinear output of the MLP PEs is what differentiates the MLP from a typical adaptive filter. It provides the capability to map problems that are not linearly separable. In fact, it has been proven that an MLP with one hidden layer can uniformly approximate any continuous function with support in a unit hypercube [Cyb89]. Like in adaptive signal processing using the LMS algorithm, the backpropagation algorithm applies a correction Awji(n) to the synaptic weight wji(n) that is proportional to the gradient of the error a8(n) / aw,(n). The chain rule is used to recursively calculate the error for each layer of the network. Unsupervised networks are typically based on or derived from Hebbian learning. Hebbian learning is a biologically inspired learning rule that finds the correlations present in the input data. Because unsupervised networks can extract information only from the input, they are typically used for data analysis and preprocessing. They cannot reliably be used directly for classification since a labeling of the inputs is required for classification. Both supervised and unsupervised learning will be described in detail in Chapter 2. 5 Adding Memory to Neural Networks How do you use a static neural network architecture to process temporal patterns? The answer is to simply add memory. Without an appropriate memory to store information from the past, a neural network is limited to static pattern recognition or function approximation. The key questions that need to be answered while creating temporal neural networks are what type of memory do you use and how is the memory integrated into the training algorithm. Memory in neural networks can be classified into two categories: short-term memory and long-term memory. Short-term memory typically involves a representation of the temporal data, usually by creating multiple copies of the input data at various time delays (e.g. tapped delay line). Long-term memory, on the other hand, is the storage of information from the past into the structure of the network. For example, over time, the training of the network captures information about the input signal and this information can be considered long-term memory. Another example of long-term memory is the storage of patterns in an associative memory. Long-term memory corresponds more closely with the traditional biological concepts of memory. The main difference between the two is that the short-term memory is used for signal representation while the longterm memory is a trained memory that typically cannot represent unknown patterns. Another way to differentiate the two is that short-term memory is usually described by activations of nodes or taps (dynamical information), and long-term memory is stored in the weights of the network (statistical information). 6 Most of the work in temporal ANN research has focused on the application of short-term memories since they provide a mechanism to represent a temporal pattern in a static manner. For instance, a tapped delay line converts a temporal signal into a static pattern (the present input and the N past inputs) which can then be processed by a standard static ANN. Most short-term memory techniques fall into two categories. The first is to explicitly add memory structures and the second is to use recurrent loops in the network to save information. Long-term memory (the weights) has largely been ignored by the ANN research community for the storage of temporal patterns, but I will use it to store temporal correlations in the structure of the network. Short-Term Memory Structures The simplest form of memory is a buffer containing the N most recent inputs. This is often called a tapped delay line or a delay space embedding and forms the basis of traditional statistical autoregressive (AR) models, as well as dynamical system state space manipulations. This is a very popular model and has been used in many applications. The time-delay neural network (TDNN) [Wai90] uses a tapped delay line to convert the temporal pattern into a spatial pattern allowing the architecture to be trained using only standard back-propagation methods. The TDNN, however, has several drawbacks. First, the length of the delay line must be chosen a priori, we cannot work with arbitrary length sequences. In addition, the TDNN requires that the data is properly registered in time with the clock controlling the shift register. It imposes a rigid limit on the duration of patterns and suggests that all input vectors be the same length. Most importantly, two patterns which are very similar temporally (e.g. shifted one step in time) will be very 7 different spatially, which is the metric used by ANNs. For example, [1 0 0], [0 1 0], [0 0 1] are temporally shifted but are spatially on the corners of a unit cube. Using decay traces or exponential kernels to sample the history of the input helps alleviate some of the problems with the TDNN. A common methodology to describe the various memory architectures is to represent the short-term memory as a convolution of the input sequence with a kernel function, kl: .j,(t) = E k,(t - t)x(r), where x(t) is the input. Tank and Hopfield [Tan87] proposed a set of Gaussian kernels that are distributed over time with varying means and widths to sample the time history. The gamma model [DeV91] is an example of an exponential trace memory that uses the set of gamma kernels. The exponential trace memory has a more smooth representation of the past of the input since it decays exponentially. It gives more strength to the more recent inputs. The gamma memory also has a tunable parameter that trades off depth for resolution when the system requires information from farther in the past. Depth roughly refers to how far back into the past the memory stores information and resolution refers to the degree to which information concerning the individual elements of the input sequence are preserved. The exponential trace memories can be computed incrementally and easily, thus greatly increasing its usability. Viewing memory in this way, as a kernel function passed over the input, one can see that almost any kernel function will result in a distinct form of memory. The main problem with all of these memory architectures, however, is that they are all "prewired" one-dimensional cascades of delay elements. TDNNs are also known to train very slowly. 8 Theoretically, memory added to a system can be thought of as creating an embedding of the dynamics into a space larger than the original input space. An embedding of a dynamical system is based on the similarity between delays and derivatives (the first order approximation to a derivative is the difference between the signal and the delayed signal). The delayed values of a single variable can be used to represent the dynamics of a multi-dimensional system. Conceptually this can be rationalized as combining the first-order differential equations for the system (state space description) into a single high-order differential equation for one variable and then using the delay technique to approximate the derivatives of this equation - giving a new representation of the system states. This mathematical construct is effective but not necessarily efficient. For example, a dynamical system requires a minimum of 2D+1 taps to preserve the dynamics of a D dimensional system [Tak81]. If the dimension of the system is unknown, as is often the case, a large embedding is usually used. The embedding also does not efficiently encode the input ordering. It does a time-to-space mapping that treats the temporal information the same as a spatial input, allowing for all permutations of the order of inputs without regard to the limitations imposed by the dynamics of the system. The gamma memory and other convolution memory kernels warp or rotate the embedding space to more accurately (or efficiently) represent the system dynamics. A proper use of the embedding methodology requires a significant amount of work to determine a number of parameters, including the number of taps, the time between taps, the time between vectors, and the number of data samples. This is rarely done. 9 Recurrent Networks The MLP and TDNN are both feedforward networks where the data flow in the network moves strictly forward. No feedback is used. The feedback in recurrent networks can also create memory. The important distinction between the two types of memory is that memory created with feedback can be adapted and trained on-line, creating a flexible and adjustable memory mechanism. Feeding back outputs between different layers can lead to a generalization of storing not only the input but the "state" of the network (i.e. a processed version of the input) [Elm90] [Moz94]. In theory, the recurrent architecture is sufficiently powerful to handle arbitrarily complex temporal problems. The focused memory architectures such as the TDNN can also [San97], but may require a very large number of taps and weights. In practice, however, recurrent networks are much more difficult to train than the static networks. The recurrency adds tremendous power to the network (any memory architecture can be created with a recurrent neural network). This power, however, creates very complicated error surfaces. In recurrent networks, the states of the PEs in the network affect both the output and gradients. Therefore calculating the gradients and updating the weights of a recurrent network is a much more difficult and time consuming process. Because of these difficulties, the mainstream engineering community has largely ignored recurrent networks. Recently, however, the recurrent networks are being used more and more as engineers reach the limits of the capabilities of TDNNs and other 10 simpler architectures. Recurrent networks are hot topics in the fields of dynamic modeling and control. Training Dynamic Neural Networks Recurrent networks, either fully recurrent or partially recurrent (e.g. the gamma network), cannot directly use static backpropagation methods since the time history of the network and its inputs are critical to the outputs produced by it. Static backpropagation computes only the gradients based upon the current inputs and outputs. To train a dynamical system, the past information is at least as important as the present and thus a temporal backpropagation technique must be used. Recurrent backpropagation (fixedpoint learning) can be used to train a general recurrent network to move to stationary states. Its assumption of constant inputs and an approach to an attractor, however, precludes the recurrent back-propagation algorithm from real-time temporal processing. The TDNN can use static backpropagation because its memory is fixed and is at beginning of the network. The tapped delay line can be thought of as a temporal preprocessor converting dynamic patterns to static patterns, thus the network is trained in a completely static manner. Most other temporal networks, however, are trained using one of two first-order temporal methods: back-propagation through time (BPTT) [Rum86] or real-time recurrent learning (RTRL) [Wi189]. Both of these methods are gradient descent methods. The RTRL method brings the activations and their derivatives forward in time until the desired signal is available, and the BPTT method propagates back the errors from the desired signal to the beginning of the pattern. RTRL recurrently calculates the gradients of each PE with respect to every weight. This process allows on- 11 line updates (updates every sample), but calculating all the gradients is a time consuming process. In fact, if there are N fully recurrent PEs in a network, the RTRL algorithm requires O(N') operations per sample. The BPTT method requires fewer computations, but is non-causal. Thus it cannot be directly implemented in an on-line fashion. Both methods suffer from the following problems: * The computation of the gradient must occur over time. But the nonlinearity in each layer (actually it is the derivative of the nonlinearity required for the gradients) attenuates these gradients. Thus, if information is required from more than a few samples in the past, these training methods may have a difficult time maintaining and using this information. As the errors are propagated, the gradients get small and the impact of a connection weight - even if appropriate-will be masked by other weights if their values are inappropriate. This is true for large feedforward nets as well, but the feedback nature of the recurrent network in time makes this a much bigger problem in recurrent networks. * The desired signal must be defined over time. For example, how do you define a target signal when trying to detect a sequence? If the target is high throughout the pattern, the network will recognize partial sequences. But if the target is high only at the end, the network may be punished for partially recognizing a major portion of the sequence. * Temporal backpropagation is inherently slow both computationally and in terms of the number of training samples required to find an adequate solution. 12 Recently, second order gradient methods like the recurrent least squares (RLS) and the extended Kalman filter have been used in order to reduce the number of training samples required for a good solution. These methods use second order gradient information to determine more accurate data on the shape of the performance surface at the current operating point. This allows for much faster convergence but requires more computations per sample. These second order gradient methods still need to compute the dynamic gradient information and thus suffer from the same problems listed above. Summary of Problems with Standard ANN Architectures In summary, the standard ANN architectures when applied to temporal processing suffer from problems with supervision and problems with short-term memory. The problems that can be attributed to supervised training include: * The problem of assigning credit or blame to actions when the overall success or failure of the system results from a series of actions and cannot be judged instantaneously (i.e. how do you design a target signal?). * Back-propagation training can be very slow, often requiring thousands of training epochs. This problem is derived from many sources. The backpropagation algorithm must either take small steps in the gradient descent or use more computationally intensive error calculations (higher order derivatives). Since all nodes in a network are typically learning independently, several problems may occur. First all the hidden nodes may move together to try to solve the largest source of error, instead of dividing up the problem and each solving a different portion. Second, once the nodes have divided the problem, each tries to solve their portions independently. The 13 movement of each node through the error surface affects all the other nodes, creating a moving target for each node. Thus instead of a direct movement of the nodes to useful roles, we see a "complex dance" among all units [Fah91]. * Recurrent back-propagation trains even slower for many reasons. First, the training methods require more computations than the static backpropagation. Second, the error gradients tend to vanish exponentially as they are propagated through time. Thirdly, the recurrent networks tend to have a much more complicated error performance surface with many local minima, making the gradient search very difficult. * Supervised techniques require presegmented and prelabeled training data. This often must be done by hand and is quite time consuming. The rule of thumb for ANN training is 10 training exemplars for each adjustable weight. Thus for large networks, finding enough training data is a difficult task. If there is an insufficient amount of training data, the network will tend to memorize the data rather than draw reasonable generalizations about the data. Problems related to short-term memory structures include the following: * The common short-term memory techniques (tap delay lines, etc.) use a time-to-space mapping to represent the past of the signal. By converting time into just another spatial dimension, the unique features of the temporal information are lost (e.g. continuity, limitations based on the dynamics of the system, etc.). The short-term memory is a representation of the data, not a memory structure. " The typical short-term memory structure is a rigid architecture that must be pre-wired. 14 Short-term memory structures typically add many weights to the input (or interior) layer (e.g. A TDNN with N taps will create N times more weights in the first layer), which exacerbates the problems with the performance surface and the amount of training data. The resulting networks tend to have so many degrees of freedom that they do not generalize well (i.e. memorization due to insufficient training exemplars). The Approach It is a Herculean challenge to attempt to solve all of the above problems. This work focuses on a method of self-organizing PEs in a network architecture based on their temporal correlations. This concept is biologically inspired and has been applied to three different types of neural networks. By creating temporal neighborhoods of PEs in the architecture, we have increased the performance of the networks - either through increased functionality and power or through better training methods. When this technique is applied to a self-organizing feature map (SOFM or SOM), the temporal neighborhoods create traveling waves of activity which diffuse through the PEs. The resulting architecture has a spatio-temporal memory that is selective and recognizes temporal patterns similar to those it has been trained with. The typical ANN memory simply embeds the data for further processing by the ANN, without any mechanism for recognition. This architecture, however, is similar to biological memories in that it responds preferentially to known temporal patterns - this is unique in the neural network literature. When the temporal neighborhood approach is applied to the neural gas algorithm, the network becomes a temporal vector quantizer that again responds preferentially to 15 known temporal patterns. The temporal vector quantizer uses the past of the signal to anticipate the next input by expanding the Voronoi region associated with the expected next input. This allows the network to remove noise in the signal and generate better vector quantization based upon the temporal training and recent past of the signal. This anticipation is similar to how the human brain deals with noise in its environment - it uses the past to predict the future and correlates what it is sensing with this prediction. This is part of the reason humans can understand speech in very noisy environments, and also why two people can hear completely different things from the same set of sounds. When we apply the technique to the training of recurrent neural networks, the new training technique reduces the computational complexity of the RTRL algorithm from O(N4) to O(N2). This dramatic improvement comes with only a slight increase in the number of iterations of training data required. The overall speed-up taking into account both the decreased computational complexity and increased number of training samples is still dramatically better. In fact, the O(N4) property of the RTRL algorithm makes it virtually unusable for sizeable networks. In general, the self-organizing nature of the temporal neighborhoods helps alleviate many of the problems with the supervised techniques. Additionally, the novel spatio-temporal memory architectures provide a unique methodology for solving the problems with short-term memory. CHAPTER 2 LITERATURE REVIEW This chapter presents background information and a literature review of topics that either influenced this work, relate to this work, or will be compared and contrasted with this work. The chapter begins with a presentation of current research on biological neural networks and methods of temporal processing. This section is important because it motivated my work. I do not, however, claim that my work is biologically feasible or occurs in nature. Next, this chapter contains a description of the state of temporal neural network research. Since most of the work in temporal neural networks takes the form of extensions to static neural networks, an overview of static neural network learning is also presented. The contrast between the biological and artificial neural networks and the way they process time is striking. Static artificial neural networks are very similar to the static characteristics of real neurons, but temporal neural networks share little in common with their biological counterparts. Biological Research This section contains a description of biological neurons and their temporal characteristics, as well as other biological mechanisms that may help in processing time based signals. Recently, there has been extensive research into the temporal characteristics of the brain as well as in learning dynamics. This research has not yet 16 17 been integrated into the artificial neural network community, but holds promise for creating powerful, temporal ANNs. This information provides a motivation for the main principal of this work - that the creation of temporally organized neighborhoods in a neural network improves the performance of the network for temporal processing. The concept of diffusing temporal information through the network is one of the fundamental concepts used to rationalize the formation of these neighborhoods. Neurons and Learning Fundamentally, the artificial neural network is modeled after a collection of neurons in the brain. Each neuron is composed of three basic components: the cell body, the dendrites and the axon. [Fre92] The dendrites are a widely branching set of filaments that collect information from other neurons. The axon is a long transmission medium that contains fewer branches and transmits the output of the neuron to other neurons. Synapses are the junctions between axons and dendrites. The dendrites collect incoming pulses from other synapses, convert them to currents and sum them all at the initial segment of the axon. This summation works across both dendritic space (summation over all the dendrites) and across time. Each synaptic membrane acts as a leaky integrator with an associated time constant. The critical function of the axon is to transmit the timevarying amplitude of current summed by the dendrites to distant targets without attenuation. [Fre92] If the neuron reaches a certain threshold, itfires or depolarizes, which means that it produces an energy spike on its axon. The firing contains a refractory period such that a constantly active neuron will produce an impulse train on its axon. How biological neural networks are trained is not well known, but most of what is known 18 about the training is based on the Hebbian learning concept (which will be discussed later). The Hebbian learning law strengthens synapses (allowing more responsiveness from the post-synaptic neuron) when the two neurons fire at the same time. If there is a consistent correlation between the firing of two neurons, then the pre-synaptic neuron must be at least partially responsible for the firing of the post-synaptic neuron. A static artificial neural network is modeled loosely on an interconnected cluster of neurons. Each neuron is modeled by a processing element (PE) and a set of connections between processing elements. Typically, a processing element simply sums the inputs, nonlinearly warps the output, and then passes this output to its downstream connections. Training is implemented in either an unsupervised manner, usually using a form of Hebbian learning, or in a supervised manner, which has no biological parallel. Notice that none of the temporal characteristics of a neuron are used in static neural networks or their temporal extensions. Recently, there has been significant work on a more complete modeling of individual neurons and their temporal characteristics. Christodoulou and others [Chr95a][Chri93] have modeled the biological neuron including the random spiking nature, excitatory/inhibitory synapses, the transmission delay down the axon, and especially the membrane time constant. The membrane time constant is the main temporal property modeled today. Most modeling approaches use simplifications of the Hodgkin-Huxley equations that result in a leaky integrator model of the neuron membrane potential. This is an important feature of biological neurons, since the past history of the signal remains active on neurons for a short period and can influence the result of future inputs. 19 Additionally, the gas nitric oxide (NO) has been found to be involved in many processes in the central nervous system. One such process is the modification of synaptic strength thought to be the mechanism for learning (and most commonly used in ANNs). Neurons produce NO post-synaptically after depolarization. The NO diffuses rapidly (3.3 x 10-5 cm2/s) and has a long half-life (-4-6 seconds), creating an effective range of at least 150 pm. Large quantities of NO at an active synapse strengthen the synapse (called Long Term Potentiation, or LTP). If the NO level is low, the synaptic strength is decreased (Long Term Depression or LTD) even if the site is strongly depolarized. NO is thus commonly called a diffusing messenger as it has the ability to carry information through diffusion, without any direct electrical contact (synapses) over much larger distances than normally considered (non-local). The NO diffusion and non-linear synaptic change mechanism has been shown to be capable of supporting the development of topographical maps without the need for a Mexican Hat lateral interaction (described later). This seems to be a more biologically plausible explanation of the short range excitation and long range inhibition than the preprogrammed weights of synaptic connections which are typically assumed to implement the same effect [Kre96a][Kre96b]. In addition to the possibility of lateral diffusive messenger effects, the long life of NO can produce interesting temporal effects. Krekelberg has shown that NO can act as a memory trace in the brain that can allow the temporal correlations in the input to be converted into spatial connection strengths. [Kre96b] This mechanism for capturing the temporal correlations of the input using an NO diffusion process is similar to the method I will present in more detail in Chapter 3. 20 Hippocampus The hippocampus is the primary region in the mammalian brain for the study of memory and learning because: [Bur95] * hippocampal damage causes memory loss, * the hippocampus is the simplest form of cortex * long-term potentiation (LTP) has been found in the hippocampus (synaptic plasticity) * cell firing in the hippocampus is spatially coded (place cells). * all sensory inputs converge on the hippocampus and the output from the hippocampus is extensively divergent with projections onto most of the cortical areas. Figure 2-1 shows the major subfields and their projections of the hippocampus. The hippocampus is formed from sheets of cells, with most of the interconnections contained in these sheets (minimal connections between sheets). Most projections have large divergence and convergence, except the dentate gyrus to CA3 projection which has CA1 Subsculum Dentate CA3 Hippocompus Figure 2-1: The major subfield of the hippocampus 21 mossy fiber projections from each granule cell, making very large synapses onto only 14 or so pyramidal cells. Hebbian LTP has been observed in much of the hippocampus. A variety of intemeurons provide feed-forward and feed-back inhibition. One of the most interesting (and for this work, most relevant) aspects of the Hippocampus is that it contains "place cells" and other functional clusters of neurons. Place cells are small patches of neurons that selectively fire only when the animal is in a specific location of its environment. These are groups of thousands of neurons that fire together and are linked to other place cells. As the subject moves through a familiar set of locations, the patches fire sequentially and the linking of these patches allows for predictive navigation. They have been found in fields CA3 and CAl of the rat hippocampus. [Bur93] These place cells are temporally and spatially organized neurons that are correlated in their reaction to temporally occuring events. Diffusion Equations (Re-Di Equations) The diffusion equation (or the reaction-diffusion equation if the medium is active) can be used to explain certain characteristics of a neuron and neuronal clusters. In its generic form, however, it is used in many other fields. Objects such as cells, bacteria, chemicals and animals often have the property that each individual moves about in a random manner (e.g. brownian motion). When a concentration of these objects occurs, this random motion causes the objects to spread out into lower concentration areas of the environment. When this microscopic movement of the group results in macroscopic motion, we call it a diffusion process. If we assume a one-dimensional motion and a random walk process, we can derive the diffusion equation 22 from a probabilistic treatment of the process. By finding the probability p(m,n) that a particle reaches a point m steps away at n time steps in the future, we find the distribution of particles at time n. Using the random walk assumption and allowing n to be large, it can be shown that the resulting distribution is the Gaussian or normal probability distribution: L-J e [ m>>I, nl p(m,n)- exp - m>>1,n>> Next, we determine the probability of finding a particle in an area between (x-Ax, x+Ax) at time t by rewriting the equation for p(m,n) as the sum of the probability of moving right from x-Ax at time t-At or moving left from x+zx at time t-At. If we take the partial of p with respect to t and allow Ax 'O and At -0 we obtain the diffusion equation: ap 8'P at a 0.8 0.6. 0.4 02 0 2 4 6 8 Figure 2-2 - Diffusion process where D is the diffusion coefficient which defines how fast the particles spread. A typical diffusing process creates a spreading of a concentration into ever shallower and shallower Gaussians as shown in Figure 2-2. 23 The reaction-diffusion equations were originally proposed by Turing in 1952 and are typically used to explain natural pattern formation [Tur52]. They have been used to model insect populations, the formation of zebra stripes, crystal formation, galaxy formation and many other naturally occurring patterns and self-organizing systems. Turing's proposal modeled patterns found in nature by an interaction of chemicals called "morphogens". The different morphogens react with each other AND diffuse throughout the substance via the equation: 8m,(x,t) 82m,(x,t) Ot =f(m(xt),mi(x,t))+ D,, x2 where mi(x,t) is the concentration of morphogen i at time t, Dm is the diffusion coefficient, andf(mi, m) is a function (typically nonlinear) that represents the interaction between morphogens. By varying the interaction between chemicals and the speed of diffusion, complicated spatial patterns of chemicals are created. The reaction-diffusion equations have also been used to explain traveling waves such as the traveling impulse down the axon of a neuron. If the reaction portion of the Re-Di equations represents the kinetics of the system and these kinetics are nonlinear, then the system can create a traveling wave. One requirement for a traveling wave is that the kinetics of the system are excitable, where excitable implies two stable states where a small excursion away from one state may drive it to the next state. Another requirement is that after excitation, the system must relax back to the original state. An example of such a system is the Fitzhugh-Nagumo equations (FHN) that are a simplified version of the Hodgkin-Huxley model that describes the transmission of energy down the axon of a 24 neuron. The FHN equations can be described by the following system of 3 equations [Mur89]: au 82u - f(u) - v + D at Ov = bu - yv at f(u) = u(a - u)(u - 1) where u is roughly equivalent to the membrane potential, v lumps the effects of most of the ionic membrane currents, and a, b, and y are constants. The null clines of the kinetics in the (u,v) phase plane are shown in Figure 2-3. 0.2 -0.2 / 0 0.5 1 1.5 Figure 2-3: Null clines for dynamics of FHN equations The general concept is that when one element fires, its activity is diffused to its neighbors and pushes them just far enough from their stable state to move them to the "excited" state. Next, these newly excited elements excite their neighbors, etc. The elements which were excited originally then begin to relax, creating a traveling wave of activity. The traveling wave from the FHN equations is shown in Figure 2-4. In this case, not only does the system relax, it also has a refractory phase which inhibits future excitation for a period of time. [Tys88] 25 15 05 -0 . 5 10 15 20 25 30 Figure 2-4: Traveling waves caused by the FHN equations Diffusion and other biologically plausible local communication techniques have increasingly been used in neural networks. For example, the Kohonen algorithm can be implemented in analog hardware with an active medium using diffusion [Ruw93]. Diffusion has also been used frequently in visual imaging systems [Cun94]. Sherstinsky and Picard have proposed a cellular neural network based on Re-Di equations that can solve optimization problems [She94]. On key aspect of this work is that diffusion in the PE space of a neural network allows temporal information to be transmitted and stored using only local communication. This is similar to the diffusion of NO in the brain which is thought to affect the plasticity of synapses in areas where many neurons are firing at once. Without direct connectivity between two PEs, communication and temporal memory can be implemented using the local storage and transmission of a diffusing object (in our case, diffusing activity). Biological Representations of Time Another example of neurobiological research that has not been used in ANNs is the concept of rhythm. Recently, there has been some interesting research on oscillators, central pattern generators, rhythm and their effect on human pattern recognition. Rhythm 26 has been studied in biology and found that rhythmic signals from insects can be entrained or phase-locked to an external rythmic pattern - without high-level processing (the patterns are faster than the minimum response latency) [McA94]. There is evidence that the dynamics of many biological systems have natural rhythms that share the same frequency. Communication and locomotion, for instance, are highly dependent on rhythm and pacing. It has also been suggested that EEG rhythms play an important role in learning and temporal recognition. For instance, neurons are thought to modify their synaptic strengths only when the 0 rhythm is in the correct phase. The 0 rhythm is a sinusoidal component of the EEG that ranges from 7-12 Hz. The 0 rhythm has been linked with displacement movements (e.g. walking) and many other repetitive actions. Since the 0 rhythm must propagate through the neural tissue, this also could play the role of a moving wavefront that controls learning. Rhythm can be thought of in two ways: either as an external pacemaker that synchronizes the network in some fashion, or as the output of a collection of neurons that are working in unison. For the first case, there is little if any research on the effects of an external pacemaker on temporal ANNs. The pacemaker would create a time-varying network where the output of the network is dependent on the time or phase of the pacemaker. The pacemaker could also act as a sampling signal. For instance, learning may only occur at a specific phase of the 0 rhythm. In the second case, the rhythm could be the result of synchronized processing. For instance, waves of activity in the brain could be caused by the processing of the spatio-temporal patterns constantly input to the network by the continuous motions of the eyes and other sensory muscles. 27 Stanley and Kilmer [Sta75] have proposed a "wave mode" of memory that can learn sequences. It is based on the anatomy of the dentate gyrus (in the mammalian hippocampus) and can be summarized as follows: * The hippocampus is organized into transverse slices called lamellae * The majority of connections in the hippocampus do not leave a lamella (small longitudinal spread) * Sensory inputs arrive via the perforant path to excite cells directly * A small number of mossy fibers connect cells longitudinally (across lamellae) * Cells excited by an input spread excitation to its neighbors, causing a wave of activity to travel down the cell's lamella The wave formation is based on the pyramid and granule cells receiving excitatory influences from the hippocampal input pathways that in turn excite intemeurons whose axons inhibit the pyramid and granule cells. This excitation and inhibition create the waves of activity in the lamella. The memory is created by the association of the various waves in different lamellae via the mossy fibers that interconnect the lamellae. Each wave is created by a sensory input that triggers a cell in a lamella and can move a short distance before dying. Randomly distributed mossy fibers interconnect the lamella. The connection weights are strengthened in a Hebbian manner - when two waves from different lamella are coincident with a connecting mossy fiber, this connection is strengthened. Thus, the next time the first input wave passes the same position, it can automatically trigger the second wave even without the corresponding input. This is shown in Figure 2-5. For longer 28 temporal relationships, one wave will trigger a second wave in another lamella via prestrengthened longitudinal connections which will continue after the first wave has died. Input I at Time TO Wave at TO+01 Strenoghtned Mossy Fiber Input 2 at Time T1 Mossy/ Figure 2-5: Stanley and Kilmer's wave model [Sta75] Biological Models for Temporal Processing Living neurons act as leaky integrators with time constants on the order of tens to hundreds of milliseconds. This can lead to the storage of information in a way that may lead to temporal sequence processing. Most ANN temporal methods store the information in a spatial manner. The spatial approach to signal storage is used in the brain for auditory and visual processing (e.g. SOMs). The way in which these maps are then processed is not necessarily spatial. Reiss and Taylor propose an interesting temporal sequence storage mechanism based on a leaky integrator network [Rei91 ]. The basic concept is to use the leaky integrator neurons as temporary storage for an associative memory that is implemented like a single layer neural net. The network has been shown to have a capacity proportional to the number of neurons. The problem with this network is that the connection matrix seems to be very heavily skewed to only predicting the next input with little information from further in the past. This is similar to 29 a simple state machine or markov chain. An interesting part of this work is the possible connection to the function of the hippocampus. The memory network corresponds to the dentate gyrus, the CA3 corresponds to the predictor, and the input line is similar to the perforant path (between EHC, DG, and CA3). Kargupta and Ray proposed a temporal sequence processor that is based on the reaction-diffusion equations. [Kar94] Drawing an analogy between chemical diffusions in biology and spatio-temporal sequential processing, their model is based on a collection of cells that react to different inputs. When a cell becomes active (by recognizing its input), it outputs its own specific chemical. This chemical diffuses throughout the medium containing the cells. Each cell contains a memory of the chemical makeup at its location when it fires. The background medium thus stores the temporal history of the signal by diffusing all the various chemicals. This approach is more of a chemical model than an information processing model and has several difficulties when applied to realistic problems. Static Neural Network Learning This section contains a summary of the static neural network learning mechanisms. Almost all of the work in temporal ANNs is based on the principles from static ANNs. Since unsupervised training is most similar to known biological learning mechanisms, it will be presented first. Unsupervised learning does not have a desired signal and extracts information only from the input of the signal. As such, unsupervised techniques typically do not directly implement classifiers, but are usually used for preprocessing the input. For example, unsupervised networks can be trained to perform 30 principal component analysis (PCA), vector quantization (VQ), and data reduction. Supervised learning is presented next and these algorithms use a desired signal to train the network to mimic the desired input-output map. The desired signal can be thought of as a teacher or external influence that guides the network to the desired state. As we mentioned before, there is no known biological analog to supervised training. Unsupervised Learning Most unsupervised (also known as competitive or self-organizing) learning is based on Hebbian learning. Hebbian learning is derived from the work of the neuropsychologist Hebb who noted in 1949 that when cell A repeatedly participates in the firing of cell B, a growth process occurs between the two cells which increases the efficiency of the link between cell A and cell B. This can be stated as "neurons that fire together, wire together". This mechanism is often called correlation learning because the links are increased when there is a statistical correlation over time between the presynaptic and postynaptic activities. To avoid excessive weight growth, Hebbian synapses typically also include a decrease in the strength of a connection between two cells which are uncorrelated. Conversely, anti-hebbian learning is a learning rule that is based on increasing the strength of a connection when the presynaptic and postynaptic signals are negatively correlated and weakens them otherwise. A typical expression for Hebbian learning is Aw,,(n) = rLyk(n)x (n) where wk represents the synaptic weight between cell k and cell j, xj is the presynaptic activity and yj is the postynaptic activity. ri in the above equation is the learning rate. 31 This rule, however, does not include the weakening of uncorrelated signals, and thus the weights will forever increase. Introducing a nonlinear forgetting factor into the equation can control the weight growth: Aw,,(n)= lyv,(n)x,(n)- ay, (n)w, (n) where a is the decay constant. This equation can be rewritten as: Aw,(n) = xy,(n~cxj(n)- wj(n)] which is the standard Hebbian learning rule. Notice that when the postsynaptic neuron fires, w moves toward xj exponentially. By manipulating the definitions of the variables, this equation can be reformulated into the competitive learning rule. In competitive learning, a group of neurons are clustered such that one and only one neuron wins a competition for each input. Algorithmically, the winner is simply selected by choosing the PE with the highest/lowest output, which can be physically implemented using lateral inhibition between nodes. Biologically, neurons fire in clusters and the competition between clusters is believed to be due to long range inhibition and short range excitation (a concept that will come up again and again). In the case of a competitive cluster, the winning node has an output value of 1, and the others are all zero. Thus the Hebbian learning rule becomes: Aw (n)= j(n)-wkJ(n)] if neuron k wins 0 if neuron k loses Only one neuron (or cluster in biology) learns at each stage and its weights move toward the location of the input. Thus, the individual nodes specialize on sets of similar patterns and become feature detectors. Competitive learning is typically used for 32 clustering or vector quantization. Hebbian learning is used widely throughout the neural network field, but in its simplest form is often used for principal component analysis. Kohonen SOMs The Kohonen map or self-organizing feature map (SOM) is a neural network inspired by sensory mappings commonly found in the brain [Wil76][Koh82]. A selforganizing feature map creates a topographic map of the input patterns, in which the spatial locations of the neurons in the lattice correspond to intrinsic features of the input patterns. In this structure, neurons are organized in a lattice where neighboring neurons respond to similar inputs. The result of mapping similar inputs to neighboring outputs is a global organization that is extracted from the local neighborhoods. Topographical computational maps have been found in many locations in the brain including the vision areas (angle of tilt of line stimulus, motion direction), auditory areas (representations of frequency, representations of amplitude, representations of time intervals between acoustic events) and in motor control areas (control of eye movements). More abstract topographic maps have been found in other parts of the brain. For example there is a map for the representation of the location of a sound source based on the interaural differences in an acoustic signal. The SOM is one of the most widely used unsupervised artificial neural network algorithms.[Kan94] The typical SOM is composed of an input layer and an output layer as shown in Figure 2-7. The input layer broadcasts the vector input to each node in the output layer, scaled by the weights of each connection. Each node has an input term and lateral feedback term. The topographic mapping is created by the local lateral feedback, where 33 neighboring connections are excitatory and more distant connections are inhibitory. This is called a "mexican hat" lateral connectivity and is shown in Figure 2-6. The result is similar to the standard competitive network except that the network creates a more gentle cutoff, thus creating a Gaussian shaped output after the lateral interconnections have stabilized. This is called a "soft-max" rule (or soft-competition) where the winning PE and a few "near-winner" PEs remain active. The competitive rule is called a "hard-max" rule, hard competition, or winner-take-all rule. Depending on the characteristics of the mexican hat lateral interconnections, the resulting output will be a gaussian of varying widths centered roughly at the location of the maximum output. The process can be described using the following equations y= Y P (I + Ycjkyjk k=-K where yj is the output of thej'th node, Ij is the input to the j'th node scaled by the weights into the j'th node, cjk is the lateral weights which were described above as the mexican hat function, and < is a nonlinear saturating function which keeps the nodes from growing without bound. Figure 2-6: Mexican hat lateral connectivity and Gaussian shaped output 34 Output Layer Input Layer Figure 2-7: Connectivity of an SOM After the outputs have stabilized, the network can be trained with a simple Hebbian like rule to train the weights of the winning node and its neighbors. The neighboring nodes can be trained in proportion to their activity (Gaussian), or all neighbors within a certain distance can be trained equally. The learning rule can be described as follows: wj(n + 1)= wi(n)+ rl(n)n i,,,(n)[x(n) -w,(n)] where w are the weights of node j, x(n) is the input at time n, rj,i(x) is the neighborhood function centered around the winning node i(x), and 1l(n) is the learning rate. Notice that both the learning rate and neighborhood size are time dependent and are typically annealed (from large to small) to provide the best performance with the smallest training time. A simplified approximation to this algorithm consists of two stages: first, find the winning node (the one whose weights are closest to the input), then update the weights of the winner and its neighbors in a Hebbian manner. The SOM is an unsupervised network with large local connectivity, but unsupervised networks do not typically suffer from overtraining. Because the input is mapped onto a discrete, usually lower dimension output space, the SOM is typically used 35 as a vector quantization (VQ) algorithm. The weights of the winning node are the vector quantized representation of the input. A typical example of an SOM is mapping a two-dimensional input space onto a one-dimensional SOM. Figure 2-8 shows a random distribution of points that make up the input space in two dimensions. The points are plotted such that the coordinates of the point represent the input data. When this input data is presented to the I-D SOM, the map trains the nodes to maintain local neighborhoods in the input space. These local neighborhoods force a global ordering of the output nodes. After training, the nodes of the SOM are ordered and the weights of the nodes represent the center of mass of the input space to which they respond. By plotting the weights of the SOM PEs onto the input space, one can see where the center of each VQ cluster is located. The SOM is more than just a clustering algorithm. It also orders the PEs such that neighboring PEs respond to neighboring inputs. To show this, we connect neighboring PEs with a line. The right side of Figure 2-8 shows how the SOM maps a one-dimensional structure to cover the two-dimensional input space. This clearly shows how the global ordering has occurred Figure 2-8: Example of a 1-D SOM mapping a 2-D input 36 and that the I-D output snakes its way through the input space in order to maintain its topographic ordering and still cover the input space. Neural Gas The neural gas algorithm is similar to the SOM algorithm without the imposition of a predefined neighborhood structure on the output PEs. The neural gas PEs are trained with a soft-max rule, but the soft-max is applied based on the ranking of the distance to the reference vectors, not on the distance to the winning PE in the lattice. The neural gas algorithm has been shown to converge quickly to low distortion errors which are smaller than k-means, maximum entropy clustering or the SOM algorithm [Mar93]. It has no predefined neighborhood structure as in the SOM and for this reason works better on disjoint or complicated input spaces Martinetz et. al. [Mar93] showed an interesting parallel between most of the major clustering algorithms. The main difference between each one is how the neighborhood is defined. For K-means clustering, there is no neighborhood, only the winner is trained. This is a hard max. Aw, =* 5,(x)*(x - w,) For maximum entropy clustering, the neighborhood is defined as a soft max based on the distance in an entropy space Aw, =s e* hx - l)*(x- w, ) For the SOM, the neighborhood is based on the position in the SOM lattice Aw, = * h. (i,i(x)) * (x - w ) 37 and for the neural gas algorithm the softmax is based on the ranking of the node. For instance, the closest node gets the largest update, followed by the second closest, etc. Aw, =e*h,(k,(x,w))*(x-w,) Supervised Training Supervised training is the most commonly used and applied mechanism for training neural networks, especially for classification and function approximation. Many applications can be framed as a function approximation problem. The main supervised training technique is called backpropagation [Rum86]. Typically it is applied to multilayer perceptrons (MLPs) that consist of multiple layers of PEs, each of which does a sum-of-products on its input and then applies a saturating nonlinearity. The backpropagation algorithm works by first computing the forward activations of the network by applying the inputs to the network and computing and storing the output of every PE in the network. The activity of the network outputs is then compared to the desired activity of the network outputs and an error is computed. This error is then propagated backwards (thus the name backpropagation) through the network and is used Activation moves forward through the network Error is back ro paated to adust weights Desred PE PE Signal Inpu Frt Sond Output PE P PE Error Figure 2-9: Activation and error propagation in a static neural network in2 PE PE Input First Seeond O put Layer Hidden Hidden Layer Layer Layer Figure 2-9: Activation and error propagation in a static neural network 38 to adapt each of the weights in the system. This is graphically depicted in Figure 2-9. The output of each PE in an MLP can be described by the following equation: y, =f(net, ) = fZ w,,x, +bI) where wji represents the weight from PE i to PE j, x, represents the output of PE i or the external input for the first layer, bj represents the bias for PE j, and f() is the nonlinearity of the PE which is typically a logistic function (which ranges from 0-> 1) or a tanh function (which ranges from -1 -1). The performance surface that is searched using gradient descent is defined by: S l e =2 N(d - y, J= 2N , 2N ,=1;j=1 where e is the difference between the output and the desired signal, p is the index over the patterns and i over the output PEs. We want to update each weight based on the partial of J with respect to each weight. This formula can be written as: di- QYi i net,, = -(d,, - y, )f'(net,, )x, = -,,f'(net,,i )x if we define the local error 5i for the i' PE as 5, (n)= f'(net,) Then we can generalize the backpropagation algorithm for the MLP and the LMS algorithm for linear systems. All the weights in gradient descent learning are updated by multiplying the local error (86(n)) by the local activation (x,(n)) according to Widrow's estimation of the instantaneous gradient first shown in the LMS rule 39 Aw, (n)= Ts,(n)x (n) The difference between these algorithms is the calculation of the local error. If the PE is linear, then we have a linear combiner and the derivative of f is a constant. The equation then becomes the LMS rule. If the PE is nonlinear and is an output PE, then the local error is simply the difference between the output and the desired signal scaled by the derivative of the nonlinearity. This is simply the delta rule. If the PE is nonlinear and is a hidden layer PE, then the error is the sum of the backpropagated errors from the PEs that follow it. 8,(n)= f'(net,(n)) 6kw,k(n) This simple rule nicely summarizes the backpropagation algorithm and shows its relationship to other adaptive algorithms. Second Order Methods The standard backpropagation method of training a neural network uses the LMS approximation to gradient descent, which uses only an instantaneous estimate of the gradient. Second order methods collect information over time to get a better estimate of the gradient, thus allowing for faster convergence at the cost of more computations per cycle. In linear adaptive filtering, the recursive least squares (RLS) algorithm is used for exactly this purpose. The RLS algorithm is based upon estimating the inverse of the correlation matrix of the input. With this information, the RLS algorithm can often adapt as much as ten times faster than LMS. The RLS algorithm can also be formulated as a special case of the Kalman filter. Besides faster convergence, the RLS algorithm also has two other advantages [Hay96]: the eigenvalue spread of the correlation matrix does not 40 adversely affect the training (unlike in LMS) and the learning rate is automatically determined (the Kalman gain). Since RLS and Kalman filtering are derived for linear systems, they must be modified for use with nonlinear systems. These are typically called extended RLS or extended Kalman filtering. The most straightforward approach is to linearize the total cost function and directly apply RLS. This requires the storage and update of the complete error covariance matrix whose size is the square of the number of weights in the network [Hay94]. A better approach is to apply RLS to each node individually and linearize the activation function of the PE using a Taylor series about the current operating point. This method is called the multiple extended Kalman algorithm (MEKA) [Hay94] and reduces the computational requirements by ignoring the cross-terms between PEs. Temporal Neural Networks As we stated previously, the majority of the temporal neural networks are extensions of the static neural networks, either by adding memory or adding recursive connections. This section could just as easily be called "Extending static architectures to include time". Again, this topic will be discussed in two sections, supervised and unsupervised neural networks. Temporal Unsupervised Learning This section presents the methodologies currently available to add temporal information to unsupervised networks. Almost all work done on temporal unsupervised training has used self-organizing maps. 41 As mentioned before, a self-organizing map (SOM) creates a topographic map of the input patterns, in which the spatial locations of the neurons in the lattice correspond to intrinsic features of the input patterns. In this structure, neurons are organized in a lattice where neighboring neurons respond to similar inputs. There have been many attempts at integrating temporal information into the SOM. One major technique is to add temporal information to the input of the SOM. For example, exponential averaging and tappeddelay lines were tested in [Kan90][Kan91], while coding in the complex domain was implemented in [Moz95]. Another common method is to use layered or hierarchical SOMs where a second map tries to capture the spatial dynamics of the input moving through the first map [Kan90][Kan91]. More recently, researchers have begun integrating memory inside the SOM, typically with exponentially decaying memory traces. Privitera and Morasso have created a SOM with leaky integrators and thresholds at each node which activate only after the pattern has been stable in an area of the map for a certain amount of time. This allows the map to pick out only the "stationary" regions of the input signal and use these sequences of regions to detect the input sequence [Pri93][Pri94][Pri96]. The SARDNET architecture [Jam95] adds exponential decays to each neuron for use in the detection of node firing sequences. Once a node fires for a particular sequence, it is not allowed to fire again. Therefore, at the end of the sequence presentation, the sequence of node firings can be detected (or recreated) using the decayed outputs of the SOM. The exponential decay, however, provides poor resolution at high depths and thus will perform poorly with noisy and/or long sequences. 42 Chappell and Taylor have created a SOM which has neurons that hold the activity on their surface via leaky integrator storage [Cha93]. This activity is added to the typical spatial distance between input and weight vector to determine the next winner. The same or neighboring nodes will thus be more likely to win the competition for successive elements in a sequence. This creates neighborhoods with sensitivity to the previous input (i.e. context). There is not a successful method available yet to train these networks. The learning law proposed by Chappel and Taylor can lead to an unstable weight space. The methodology seems to work for patterns of binary inputs with at most length 3. Critchley [Cri94] has improved the architecture by moving the leaky integration to the synapses. This gives the network a much better picture of the temporal input space and has much more stable training, but becomes nothing more than an exponentially windowed input to a standard Kohonen map, as proposed by Kangas [Kan90]. The temporal organization map (TOM) integrates a cortical column model, SOM learning and separate temporal links to create a temporal Kohonen map [Dur96]. The TOM is split into super-units that are trained via the SOM learning algorithm. Winning units from each super-unit fire and then decay. Temporal links are made between the currently firing node and any node which has an activity above a threshold. Thus there can be multiple links created for each firing, allowing for the pattern to skip states. Kohonen and Kangas have proposed the hypermap architecture to include context in the SOM architecture. Kohonen's original hypermap architecture included two sets of inputs and weights [Koh91]. The first set is a context vector that is a tapped delay line of the past and future pattern vectors. This input is used to determine a "context domain" in the SOM. All nodes in the context domain are labeled active and are then presented with 43 the current input pattern. The "pattern" weights and context weights are then trained in the typical SOM manner. Kangas extended this concept by eliminating the context weights and allowing only nodes in the vicinity of the last winner to be selected. This smoothes the trajectory of winning nodes throughout the map and allows context to affect the selection of the winner without the addition of parameters like the width of the context window [Kan92]. Kangas has also proposed an SOM architecture that has an LPC predictor at each node in the Kohonen net. This provides temporal pattern recognition by using a filter at each node where the AR filters were trained via either genetic programming or gradient descent [Kan94]. Goppert and Rosenstiel conceptually extend this concept to include the notion of attention [Gop94a][Gop94b][Gop95]. The theory being that the probability of selecting a winner is affected by either higher-cognitive processes (which may be considered a type of supervision) or by information from the past activations of the network. This gives two components to the selection of a winner, the extrasensory distance (context or higher processes) and sensory distance (normal distance form weight to input). These two components can be added or multiplied. They focus on the concept of context and create a moving area of attention, which is the region that has been activated most in the recent past. The center of attention moves as each winner is selected and the region of attention has a Gaussian weighting applied to it so that nodes near the last winner will be more likely to fire the next time. The architecture outperformed the standard SOM on simple temporal tasks but did not train well on more complicated trajectories. 44 Temporal Supervised Neural Networks The main problem with temporal supervised neural networks is the complexity in training them. When the desired architecture contains recurrent connections or memory in one of the hidden layers, the network must be trained with a temporal gradient descent algorithm. There are two distinct approaches to the problem, modifying the architecture to simplify the temporal gradient calculations and creating better and/or faster methods of training the temporal neural networks. Architectural approaches The focused time-delay neural network (TDNN) has memory added only at the first layer and is the simplest example of an architecture designed to avoid many of the complications of temporal neural networks. It is simply a static MLP with a tap delay line between the input and the first layer. Because the memory is restricted to the first layer, the network can still be trained using static backpropagation. The tap delay line maps a segment of the input trajectory into an N-dimensional static image that is then mapped by the MLP. This works quite well for many applications, but has a number of difficulties as mentioned previously. The main difficulties are the increased number of weights required for TDNNs (each input now requires m weights where m is the number of taps in the tap delay line) and the inflexible, prewired nature of the tap delay line. Some of the problems with TDNNs have been attacked by defining the connectivity between layers such that only certain regions of each layer are connected. By doing this, certain regions in the input layer, corresponding to certain time periods of the input, can be connected to a single region of the second layer. This provides a more 45 goal directed architecture that can be time-shift or frequency-shift invariant. Although this can reduce the effects of the problems of TDNNs, the problems still remain and each network must be tailored for each application. [Saw91][Haf90] Two other networks deal with temporal information by using a very restrictive type of feedback. The Jordan network [Jor86] use recurrency between the output and the input of the network. The output of the network is fed back to a context unit which is simply a leaky integrator. The Elman network [Elm90] provides feedback from the hidden layer to the context units in the input layer. This is potentially more powerful than the Jordan network because it stores and uses the past state of the network, not just the past output of the network. Although both networks are commonly found in the neural network literature, neither is particularly powerful or easily trained. Recurrent networks are also continuously being modified in an attempt to improve their performance on temporal problems. Mozer has proposed a "multiscale integration model" that uses recurrent hidden units that have different time constants of integration, the slow integrators forming a coarse but global sequence memory and the fast integrators forming a fine grain but local memory [Moz92]. This work, however, is based only on exponentially decaying memory and the problem of selecting the time constants has not been solved (the time constants have to be hand tuned). A different spin on recurrent networks is the use of "higher order networks". These networks are recurrent networks where: 46 * hidden units represent states and the output of these states are fed back and multiplied with the inputs of the nodes, thus allowing second order statistics to be used [Gil91][Wat91] * one network computes the weights for a second network [Po191][Sch92a][Sch92b] The higher order networks have proven to be excellent sequence recognizers (grammar recognizers), but have failed to make a serious impact on temporal processing. These networks provide a representation for states in the neural network and allow the computation of high order statistics. For example, a second order network can compute the autocorrelation of the input, thus creating a translation invariant architecture. The main disadvantage of this work is that for complex tasks, higher order networks require even more weights and have even more complicated performance surfaces than standard ANNs. Algorithmic approaches There are two fundamental methods of computing the gradient for a dynamic neural network. First, the gradients can be computed in the backward direction similar to the static backpropagation techniques from feedforward networks. Unfolding a recurrent network in time creates a large, static, feedforward network where each "layer" consists of an instance of the recurrent network at each time step. Backpropagation can then be applied to this large feedforward network and the gradient can be computed. This is called backpropagation through time (BPTT) [Rum86]. The main shortcoming of this technique is that it is non-causal. The BPTT algorithm must be used in a batch mode, the data travels first in the forward direction while the entire state of the network is saved at 47 each step. Next, the error is backpropagated in the reverse temporal order. A secondary shortcoming of BPTT is the memory required to store the state of the network at each iteration. Many alterations have been made to the BPTT algorithm to improve its utility, in particular to make it usable as an on-line algorithm. Williams and Peng [Wil90] used a history cutoff where they assumed that the gradient information from the distant past is relatively inconsequential and thus can be ignored. Combining this and the use of a small step size, the new algorithm, BPTT(k) can be used in an on-line manner. See Pearlmutter [Pea95] for a review of this technique and others. The second fundamental method of computing the gradient for a recurrent neural network computes the gradients in the forward direction. This method, called RTRL [Wil89], computes the partial of each node with respect to each weight at every iteration. The method is completely on-line and simple to implement. The main difficulty with the RTRL method is its computational complexity. If we define n to be the number of PEs and m to be the number of weights, then the computation of the gradients of each PE with respect to each weight is O(n2m). For a fully recurrent network, this dominates the computational complexity and requires O(n4) computations per step. The algorithm works quite well on small networks, but the n4 factor becomes overwhelming as the number of nodes increases. The RTRL algorithm for a recurrent network can be summarized by the following set of equations [Hay94]. First, we define set A as the set of all inputs, set B as the set of all PEs, and set C as the set of outputs with desired signals. The forward activation equations are: 48 net (n) = - w, (n)u, (n) yj (n + 1) = (net (n)) where u represents the input vector at each time step and is composed of both the external inputs and the outputs of each PE (the values of the feedback). The gradient descent technique is based upon computing the sensitivity of each PE with respect to each weight. The weights are updated on-line using these sensitivities: Aw,(y (= -y - e (n)(n) aw,(n) , w, (n) For implementation, we create a matrix 7n that represents these sensitivities and write an update equation for it: OyF(n) )L= (n jeB,keB,leAuB ' w', (n) 11, (n + 1) = p(v, (n)) w, (n)rk, (n) + 8,uI(n) , (0) = 0 7t is a matrix of gradients with the rows representing weights and the columns representing nodes, thus it contains mn elements. Many methods have been proposed to increase the speed of RTRL. Schmidhuber and others have mixed BPTT and RTRL which reduces the complexity to O(nm) [Sch92]. This technique takes blocks of BPTT and uses RTRL to encapsulate the history before the start of each block. Sun, Chen, and Lee have developed an O(nm) on-line method based on a Green's Function approach [Sun92]. By solving an auxiliary set of equations, 49 the redundancies in the computation of the sensitivities over time can be removed. Zipser approached the problem in a different way and reduced the complexity of the RTRL algorithm by simply leaving out elements of the sensitivity matrix based upon a subgrouping of the PEs [Zip89]. The PEs are grouped arbitrarily and sensitivities between groups are ignored. If the size of the subgroups remains constant, then this reduces the complexity of the RTRL algorithm to O(m). This is a tremendous improvement, however, the method lacks some of the power of the full RTRL algorithm. It sometimes requires more PEs than the standard RTRL algorithm to converge. Second order methods Many researchers argue that simple gradient descent [Moz94] is not sufficiently powerful to discover the sort of relationships that exist in temporal patterns, especially those that cover long time sequences or involve high order statistics. Bengio, Frasconi, and Simard [Ben93] also present theoretical arguments for the inherent limitations of learning in recurrent networks. Many researchers have recently started using the extended Kalman filter algorithm, which is very similar to the RLS algorithm, for training dynamic neural networks. As described previously, the extended Kalman filter algorithm uses information from the correlation matrix, which is accumulated over time, to better approximate the direction to the bottom of the performance surface. Again, the problem with the extended Kalman filter algorithm is that it requires the computation and storage of the correlation matrix between the weights of the system. The computational requirement for this method is O(N'). The standard method of reducing this computational load is to decouple each PE of the network, such that the correlation 50 matrix is only computed between weights that terminate at the same PE. Puskorius and Feldkamp [Pus94] call this method the decoupled extended Kalman filter (DEKF) algorithm. The main difference between the dynamic version of the RLS/EKF algorithm and the static version is that the gradients that are used in the second order calculation are the dynamic gradients, not the static gradients. Thus, the BPTT or RTRL algorithms must still be used to compute these gradients. Sequence Recognition There are two broad categories of temporal problems which are typically addressed in the literature: sequence recognition and temporal pattern processing. Sequence recognition is typically a process of recognizing (and often reproducing) discrete symbolic sequences. These problems typically focus on recognizing grammars and symbolic patterns. Temporal pattern processing, however, involves the recognition, identification, control or other processing of a continuous signal which varies with time. Speech recognition is an example of temporal pattern recognition. The continuous signal can be vector quantized and turned into a symbolic pattern, but it is not practical to then treat it as a sequence recognition problem. Temporal patterns of interest are difficult to accurately quantize and typically have various forms of time warping and noise which make sequence recognition of quantized temporal patterns nearly impossible. Since the emphasis of this proposal is not sequence recognition, we will only briefly introduce a few interesting neural networks which accomplish this task. Wang and Arbib have proposed models based on the two dominant theories of "forgetting" - the 51 decay theory of forgetting where the memories decay from the time they are entered [Wan90] and the interference theory of forgetting where memory only decays when new inputs which must be remembered arrive [Wan93]. Both architectures are based on a winner-take-all field of neurons where the winning node fires and is then decremented slowly. The sequence is detected using an extra "detector unit" which is trained by the Hebbian rule using attentional learning. The main difficulty with these and other sequence recognizers is that they tend to be intolerant of time-warping and missing or noisy data - problems which are prevalent in temporal pattern recognition. The outstar avalanche was an early neural network that was used to learn and generate temporal patterns [Gro82]. The outstar avalanche is composed of N sequential outstars which detect an input and each outstar triggers the next in a chain producing an avalanche effect. This architecture was modified to include the combined effect of the input dot product and the avalanche input from preceding nodes and was called the spatio-temporal network (STN) [Fre9 1]. The sequential competitive avalanche field (SCAF) [Hec86] is a further extension of the STN where each node has lateral interconnections allowing the outstars to be competitive. Comparison of Hidden Markov Models with ANNs Due to the difficulties in modeling sequential structure with ANNs, hidden Markov models have become the gold standard for modeling many temporal processes (e.g. speech). Time sequence matching is a major problem in applying neural nets to temporal/dynamical, non-stationary processes. Although ANNs have been successfully applied to time series prediction [Wei94], they have not been as successful in tasks that 52 have synchronization problems such as time-warping. For example, different utterances of the same word can have very different timescales; both the overall duration and the details of timing can vary greatly. ANN models for speech have been shown to yield good performance only on short isolated speech units (e.g. phoneme detection). They have not been shown to be effective for large-scale recognition of continuous speech. The TDNN, for example, has powerful methods for dealing with local dynamic properties, but cannot deal with sequences explicitly. The HMM provides a compact, tractable mechanism for handling this temporal information by including explicit state information. Various neural network techniques have attempted to add state information, typically via feedback, but have been only successful on modest size applications. HMMs are stochastic in nature and thus can succeed even when the temporal nature of the system is locally very noisy. Speech patterns, for example, are to some extents a sequential process, however, they are sufficiently ambiguous locally that it is not adequate to make decisions locally and then process sequences of symbols. Two formal assumptions characterize HMMs as used in speech recognition. The first-order Markov hypothesis states that history has no influence on the chain's future evolution if the present is specified - e.g. the temporal information is stored in the current state of the system and all relevant temporal information must be able to be stored in this way (there is no other memory in the system). The second assumption is that the outputs depend stochastically only on the state of the system. The two main advantages of ANNs over HMMs is that ANNs are discriminative and ANNs do not rely on the Markov assumptions. Typically HMMs are trained using a 53 within-class method (each model is trained only on in-class, segmented, data). ANNs, however, can be trained to find the differences between classes, thus they can discriminate between classes, not just detect/model classes. ANNs have few restrictions on the systems they can model. The HMMs, however, assume that the observations are independent and that the underlying process that is modeled is a Markov Process. New methods which marry the discriminative power of the ANN with the temporal nature of the HMM have been relatively successful [Bou90]. CHAPTER 3 TEMPORAL SELF-ORGANIZATION Introduction and Motivation As described in the previous chapters, working with temporal patterns has been a very difficult task for neural networks. This problem is largely due to the fact that the methodologies applied to temporal processing are simple extensions of static neural networks with little regard for the unique nature of time and time based signals. Most of these architectures simply add memory to a well-known static network and can achieve reasonable performance for simple problems, but do not perform as well on more complex problems. Like in the 1980s when pattern recognition and classification drove the research community to develop neural networks, biological systems still easily outperform state-of-the-art solutions to temporal processing problems. For this reason, I began researching biological neural networks and biological mechanisms that might help us better solve these problems. As my research progressed, two key aspects continually resonated with my underlying goal of creating better neural networks for temporal pattern processing. These two elements are the self-organization of similar or correlated cells into clusters or neighborhoods (similar to place cells in the Hippocampus), and the diffusion of information over time and space. Self-organization describes a system where each 54 55 individual entity in the system has only simple local rules regarding its behavior. These simple local rules, however, can create global organization without any global control. Self-organization applies at virtually every layer of the universe, from neurons and brain cells, to bug populations, to solar systems and galaxies. It is tremendously important in the formation of the brain and in my opinion is greatly underutilized in artificial neural networks. The second element is diffusion. Like self-organization, diffusion is found everywhere. It can be derived from simple random Brownian motion (simple local rules as well), where particles and other objects move from areas of large densities to areas of small densities. Diffusion itself is a rather simple concept that may not appear to add much to neural network theory. However, when you add diffusion to a dynamical system (for instance, the reaction-diffusion equations), the resulting system can obtain some tremendously interesting and powerful dynamics. The Model Most temporal neural networks use short-term memory to transform time into space. This time-to-space mapping is usually the only mechanism for dealing with temporal information. The neural network operates as if the temporal pattern was simply a much larger spatial pattern. This is clearly inefficient. My method uses diffsion to create self-organization in time and space. The theory is to leave the fundamentals of the neural network the same (in order to use the theory and knowledge we have already accumulated) but to add self-organization in space-time to the PEs in the network. By creating temporally correlated neighborhoods in the field of PEs making up the network, 56 the basic functionality of the network is more organized and temporally sensitive, without drastically changing its underlying operation. The mechanism for the creation of these temporally correlated neighborhoods is the diffusion mechanism. In the brain, NO is given off from firing neurons in the brain, and diffuses throughout. NO has also been shown to affect the sensitivity of the neuron to synaptic changes (e.g. weight changes in neural networks). It has been theorized that this diffusion of NO may be responsible for the creation of place cells and other organization in the brain. In a more abstract sense, the diffusion of NO can be considered the diffusion of the neural activity. When a large group of neurons fire in close proximity (both temporally and spatially), a local build-up of NO probably occurs and diffuses throughout the brain. In my architectures, I use the concept of activity diffusion to create the temporally correlated neighborhoods. When a PE or group of PEs fire, they influence their neighbors, typically lowering their threshold such that they are more likely to fire in the near future. Because the underlying mechanism of most neural network training is Hebbian in nature, when neighboring PEs fire in a correlated fashion, they tend to continue to fire in a correlated fashion. This creates the temporally correlated neighborhoods and the self-organization in space-time. I have applied this concept to three different ANN architectures. The first is based on the self-organizing map (SOM) and is the most biologically inspired. The second is based on the neural gas algorithm which provides a more powerful but functionally similar solution. Lastly, to prove the robustness of the method, I applied it to the training of recurrent MLPs. MLPs are a totally different architecture and are trained in a totally different manner (e.g. supervised vs. unsupervised training). The MLP is not biologically 57 relevant, but the temporal self-organization method still proved to decrease training times dramatically. The rest of this chapter is divided into three sections based on these architectures. It is arranged chronologically so that the presentation will be more smooth, even though the MLP architecture may be the most useful of the three. This chapter only presents the theoretical derivation of each architecture and a simple illustrative example for each. Detailed application of each method to more practical problems will be presented in the next chapter. Temporal Self-Organization in Unsupervised Networks This section describes the two unsupervised networks which I have applied the concept of temporal clustering. The first architecture is based on the self-organizing map and is called the self-organizing temporal pattern recognizer (SOTPAR). The second architecture is based on the neural gas algorithm and is called the SOTPAR2. Temporal Activity Diffusion Through a SOM (SOTPAR) The self-organizing temporal pattern recognizer (SOTPAR) [Eul96a][Eul96b] is a biologically inspired architecture for embedded temporal pattern recognition (finding patterns in an unbounded input sequence without segmentation or markings). This is a difficult task since the patterns must be searched from every possible starting point. Although the SOTPAR architecture is unsupervised and thus cannot be used efficiently as a pattern recognition device, it does preprocess the input such that patterns commonly found in the training data will be easily detectable from the output of the SOTPAR. Most of the emphasis in this work is in the proper temporal representation of the spatiotemporal data. 58 The SOTPAR architecture adds two temporal characteristics to the SOM architecture, activity diffusion through the space of output PEs and the temporal decay of activations. Using these concepts, the SOTPAR converts and distributes the temporal information embedded in the input data into spatial connections and ordered PE firings in the network, all using self-organizing principles. Similar to self-organizing maps, the network uses competitive learning with neighborhood functions [Koh82]. In the SOM, the input is simultaneously compared to the weights of each PE in the system and the PE that has the closest match between the input and its stored weights is the winner. The winner and its neighbors are then trained in a Hebbian manner, which brings their weights closer to the current input. The key concept in the SOTPAR architecture is the activity diffusion through the output space. The firing of a PE in the network causes activity to diffuse through the network and affects both the training of the network and the recognition of the network. In the SOTPAR, the activity diffusion moves through the lattice of an SOM structure and is modeled after the reaction-diffusion equation [Mur89] n, (x,t) 2 m(x,t) - f(m,(x,t),,m(x,t))+ D, 2 where mi can be considered the activity of PE i, f(*) can be considered the current match, and the second derivative is the diffusion of activity over space and time. If the system is "excitable media" (multi-stable dynamical system), then the diffusion of activity can create traveling pulses or wavefronts in the system. When the activity diffusion spreads to neighboring PEs, the thresholds of these neighboring PEs are lowered, creating a 59 situation where the neighboring PEs are more likely to fire next. I define enhancement as the amount by which a PE's threshold is lowered. In the SOTPAR model, the local enhancement acts like a traveling wave. This significantly reduces computation of diffusion equations and provides a mechanism where temporally ordered inputs will trigger spatially ordered outputs. This is the key aspect of this network architecture. The traveling wave decays over time because of competition for limited resources with other traveling waves. It can only remain strong if spatially neighboring PEs are triggered from temporally ordered inputs, in which case the traveling waves are reinforced. In a simple one dimensional case, Figure 3-1 shows the enhancement for a sequence of spatially ordered winners (winners in order were PE1, 0.3 0.4 0.2 (b) Figure 3-1: Temporal activity in the SOTPAR network. a) activity created by temporally ordered input; b) activity created by unordered input 60 PE2, PE3, PE4) and for a sequence of random winners (winners in order were PE4, PE2, PEl, PE5), which would be the case if the input was noise or unknown. In the ordered case, the enhancement will lower the threshold for PE 5 dramatically more than the other PEs making PE 5 likely to win the next competition. In the unordered case, the enhancement becomes weak and affects all PEs roughly evenly. The second temporal functionality added to the SOM is the decay of output activation over time. This is also biologically realistic [Cha93]. When a PE fires or becomes active, it maintains an exponentially decaying portion of its activity after it fires. Because the PE gradually decays, the wavefront it creates is more spread out over time, rather than a simple traveling impulse. This spreading creates a more robust architecture that can gracefully handle both time-warping and missing or noisy data. The decay of the activity also creates another biological possibility for explaining the movement of the enhancement throughout the network. If we define a neighborhood around a neuron as one where it has strong excitatory connections with its neighbors, then the decay of activity from a neuron which fired in the past will help to fire (or lower the threshold of) its neighboring PEs. Algorithm description To simplify the description of the algorithm, I will use 1D maps and let the activity propagate in only one direction, since the diffusion of the activity is severely restricted in the one-dimensional case. Thus, the output space can be considered a set of PEs connected by a string where the information is passed between PEs along this string. The activity/enhancement moves in the direction of increasing PE number and decays at 61 each step. An implementation of the activity diffusion in one string is shown in Figure 3-2 and includes the activity decay at each PE and the activity movement through the net in the left-to-right direction. The factors u and (1-u) are used to normalize the total activity in the network. 1-u 1-u 1-u Nod Nod Node Enhancement Activity Activity Activity Figure 3-2: Model for activity diffusion in one string of the SOTPAR This activity diffusion mechanism serves to store the temporal information in the network. During training, the PEs will be spatially ordered to sequentially follow any temporal sequences presented. At each iteration, the activity of the network is determined by calculating the distance between the input and the weights of each PE and allowing for membrane potential decay: act(t,x) = act(t - ,x) * (1- u) + dist(inp(t), w. ) * (u) where act(t,x) represents the activity at PE x at time t, and dist(inp(t), w) represents the distance between the input at time t and the weights of PE x. Typically the activity is thresholded and enhanced before being propagated. For example act'= max(act-.5,0)*2 next, the winning PE is selected by winner = arg max(act + 3 * enhancement) 62 where the enhancement is the activity being propagated from the left. The parameter P is the spatio-temporalparameter that determines the amount that a temporal wavefront can lower the threshold for PE firing. By increasing 1 you can lower the threshold of neighboring PEs to the point where the next winner is almost guaranteed to be a neighbor of the current winner and forces the input patterns to be sequential in the output map. It is interesting to note that as 340, the system operates like a standard SOM and when P3-oo the system operates like an avalanche network. [Gro82] Once the winner is selected, it is trained along with its neighbors in a Hebbian manner with normalization as follows: w = wx + T* neigh(x)*(inp(t)- w.) where the neighborhood function, neigh(x), defines the closeness to the winner (typically a Gaussian function), and the learning rate is defined by 1l. In our current implementation, the spatio-temporal parameter, the learning rate, and neighborhood size are all annealed for better convergence. Representation of memory The activity diffusion in this network creates a unique spatio-temporal memory that stores and distributes the temporal information in the network itself. Most short-term memory structures can be described by convolving the input sequence with a kernel that describes the structure of the memory. This kernel is typically one-dimensional and describes the temporal features of the memory, i.e. the depth of the memory. The SOTPAR's memory is implemented in its "enhancement" which moves through time and 63 space. Thus, the SOTPAR memory kernel is spatio-temporal and must be described in at least two dimensions. There are two slightly different ways to implement the temporal enhancement in a ID SOTPAR. The difference lies in the decaying exponential portion. In method number 1, only the activity at each node is decayed. The contributions from the wavefronts do not contribute to the time dependent behavior of each node. The equation for this system is: E(n, t)= E(n- 1,t - 1)* L + A(n,t) A(n, t) = A(n,t - 1)*( - )+ In(n,t) where E(n,t) is the enhancement at node n at time t, A(n,t) is the activity at node n at time t, and ln(n,t) is the matching results between the input and the weights of node n and time t. Expanding these equations gives the following results: E(n,t)= E(n-l,t-1)*i + A(n,t-1)*(l-p)+ In(n,t) =(E(n-2,t-2)*p + A(n-l,t-2))*i + A(n,t-2)*(1-I)2 + In(n,t-1)*(1-)+ In(n,t) = In(n-k,t-k- )pk (1-) k=O =0 This equation shows how the results from the matching activity (which is called "input", for lack of a better word) contribute to the enhancement. The traveling waves create two decaying exponentials, one which moves through space (pk), and one which moves through time ( (1-,u)'). The past history of the node is added to the enhancement via the recursive self-loop in (1-I). The wavefront motion is added to the enhancement via the diagonal movement through the left-to-right channel scaled by i. The farther the 64 node is off the diagonal and the farther back in time, the less influence it has on the enhancement. The SOTPAR enhancement equation is similar to the gamma memory impulse response for tap n: t-1 g,(t)= (n-lp" (l-g)- U(t-n) By doing a variable substitution, the T can be replaced with t-n in the SOTPAR equation making the two equations even more similar. The SOTPAR enhancement, however, is not an impulse response equation. The SOTPAR allows input at each element of the memory structure, unlike the gamma memory which is a generalized tapped delay line, thus the input at different times and spatial locations is required to describe the enhancement (i.e. an impulse response does not represent the desired information). In summary, the SOTPAR enhancement is a spatially distributed gamma memory with inputs at each tap. The second method for implementing the enhancement is to allow the enhancement to also pass through the self-feedback at each node. This will allow an input to add to the enhancement multiple times by following different paths in the network. For example, In(n-l,t-2) can reach E2(n,t) either by looping first at node n-1 and then moving to position (n,t) or by first moving to position (n,t-1) and then looping at node n until (n,t). The equation for the enhancement in this case is: 65 E2(n, t) = E2(n - 1, t - 1)* pl + E2(n,t - 1) * (1 - i) + In(n, t) = (E2(n-2,t -2)* + E2(n-1,-2)*( - )+ In(n-1,t-1))* + (E2(n - 1,t- 2)*t + E2(n,t -2)*(1 - )+ In(n-l,t - 1))*(1-1 X) + In(n,t) = In(n-k,t-k- )g'(1- g)'(r +1) k=0 =0 This method of enhancement increases the contribution of the off-diagonal elements via the term (r+l) and allows more flexibility in non-sequential node firings. The two enhancement techniques can be shown for two values of 9I in Figure 3-3 and Figure 3-4. Both figures show Enhancement method 1, Enhancement method 2, and the difference between Enhancement method 2 and method 1, which shows the increased influence of the off-diagonal elements. These two figures also illustrate the effect of R on the enhancement. With 1. = 1, the time decay at each node is disconnected and the enhancement moves only from node to node. With [t = 0, the spatial movement of the enhancement is disconnected and only Enhancement 1 Enhancement 2 Enh 2 - Enh 1 1. 1 / 1 0.8, 0.8, 0.8 0.6 0.6 E 0.6 S0.4 0.4 0.4 0.2 0.2 0.2 0 0o 0 6 6 6 4 4 6 4 46 4 46 Node 2 2 Time Node 2 2 Time Node 2 2 Time Figure 3-3 - Enhancement in the network with u = 0.5 66 node decay contributes to the enhancement. Lower values of lt create a broader enhancement while higher values of p create narrower enhancement waves where almost all of the activity moves from one node to the next (down the diagonal of time and space). This can be seen in the figures as a much sharper contribution to the enhancement along the diagonal as 9I moves from .5 to .75. Enhancement 1 Enhancement 2 Enh 2 - Enh 1 1 1. 1 0.8, 0.8. 0.8 0.6 0.6. 0.6, m 0.4 0.4, m 0.4, W 0.2 w 0.2, W0.2 0 , 0 6 0 . 4 4 6 4 4 6 4 46 Node 2 2 Time Node 2 2 Time Node 2 2 Time Figure 3-4 - Enhancement in the network with p = 0.75 Another possible approach is to decouple the two exponentials [ and (1-p). This would require external normalization to keep the enhancement from growing without bound, but will provide more flexibility. A simple illustrative example A simple, descriptive test case involves an input that is composed of twodimensional vectors randomly distributed between 0 and 1. Embedded in the input are 20 'L' shaped sequences located in the upper right hand comer of the input space (from [0.5,1.0]-[0.5,0.5]-) [1.0,0.5]). Uniform noise between -0.05 and 0.05 was added to the 67 target sequences. When a standard ID SOM maps this input space, it maps the PEs without regard to temporal order, it simply needs to cover the 2D input space with its 1D structure. To show how this happens, we plot an 'X' at the position in the input space represented by the weights of each PE (remember, the weights of each PE are the center point of the Voronoi region that contains the inputs that trigger that PE). Since the neighborhood relationship between PEs is important, we connect neighboring PEs with a line. In a ID SOM, the result is a "string" of PEs, and this string of PEs is stretched and manipulated by the training algorithm so that the entire input space is mapped with the minimum distortion error while maintaining the neighborhood relationships (e.g. the string cannot be broken). The orientation of the output is not important, as long as it covers the input with minimal residual energy. A typical example is shown on the left side of Figure 3-5. Note the slightly higher density of the input in the 'L' shaped region. When the SOTPAR temporal activity is added to the SOM, the mapping has the additional constraint that temporal neighbors (sequential winners) should fire IO Kohon mapping wUthol Temporal Ertancemet 1D Ktonmpn wFih TEora-Et 09 09 08 08 07 07 06 06 0.4 04 03 03 02 04 06 08 0 02 04 06 08 1 Figure 3-5 - One-dimensional mapping of a two-dimensional input space, both with and without spatio-temporal coupling 68 sequentially. Thus, the string should not only cover the input space, but also follow prevalent temporal patterns found in the input. This is shown on the right side of Figure 3-5. Notice in the figure that sequential nodes have aligned themselves to cover the L shaped temporal patterns found in the input. Although not the main goal in creating the Spatio-Temporal SOM, recall is possible after the first few samples of the sequence have been input to the network. The rest of the pattern can be determined by following the sequence of nodes in the SOM, although the length of the sequence is not readily determined by the map. With a single string, the network can be trained to represent a single pattern or multiple patterns. Multiple patterns, however, require the string to be long. A long string may be difficult to train properly since it must weave its way through the input space, moving from the end of one pattern to the beginning of the next. Additional flexibility can be added by breaking up the large string into several smaller strings. Multiple strings can be considered a 2D array of output nodes with a ID neighborhood function. This allows the network to either follow multiple trajectories or long complicated trajectories in a simplified manner. Figure 3-6 shows an example of the storage of two temporal patterns with two strings. The left plot shows the input space that consists of two-dimensional input vectors. Two 8-point temporal patterns (diagonal lines: bottom-left to top-right and bottom-right to top-left) are intermixed with random noise in the input. The diagonal lines are drawn in for clarity. Between each pattern, there is random noise. This problem can be thought of as a motion detection problem across a visual topographic map. A number of strings could be trained to detect motion in a variety of directions and 69 orientations. On the left side of Figure 3-6, the trained strings are shown as sequences of 8 PEs represented as 'X's (the 'O' PE denotes the beginning of the string), with neighboring PEs connected by lines. As one can see from this figure, the memory structure was able to extract the predominant temporal features of the input data. The right side of Figure 3-6 shows a graphical representation of the sequence of winning PEs after training. The horizontal axis is time, and the vertical axis is the number of the winning PE. The input signal is labeled along the top of the plot. This plot clearly shows that the patterns elicit the network to respond with sequential PE firings (smooth diagonal lines), whereas the random noise between patterns causes random output firings. Notice also that the temporal information is crucially important in the training of the memory, especially at the center of the figure where the next point could be in one of two possible directions. This ambiguity is responsible for the misalignment of the PEs near the center of the input space. Input Space and Output Mapping I Seql I noise I Seq2 I noise I Seq1 I 1 E-16 0 14 N 12 0.6 . 10 x 8 04 6 0.2 4 2 0 0.5 1 10 20 30 40 X1 Time Figure 3-6 - The Storage of Two Temporal Patterns in a Memory Network 70 Figure 3-7 shows an example of how the network gracefully handles time warping. In this example, the input was as in the previous example except that the target sequences were warped to length 6, 8, and 10. The network mapped two 6 PE strings to the diagonal targets as shown in the left side of the figure. The right side of the figure shows the winning nodes with the three different size sequences - the first two are length 6, the second two are length 8, and the last one is length 10. The strings stretch to cover the entire pattern and certain PEs fire more than once for a longer sequence, thus extending the time that can be covered by the string. In general, if the network is trained with time-warped data, it will tend to represent the target trajectories with the minimum number of nodes (shortest pattern). The network will still respond to longer patterns by having certain nodes win multiple times. InpLut Space and Output Mapping ISl n IS21 n IS1 I n IS2 I n I S1 I 1 12 0.8,1 a 0.6 0.4 W4 0.2 0 0.5 1 20 40 60 X1 Time Figure 3-7 - Time Warping: Diagonal Targets Covered by Smaller Sequences The left and middle plots of Figure 3-8 show the traveling activity over time and space for the above example. The left side shows the activity for string 1 and the center shows the activity for string 2. These two plots clearly show how the traveling activity 71 builds up and reinforces the sequential firing of the output PEs (i.e. when a target sequence is presented, the activity builds up and moves along the string). The right-side of Figure 3-8 shows the maximum traveling activity for string 1 (solid) and string 2 (dashed). In a simple system, this plot shows how a simple threshold on the traveling activity could be used to detect the target sequence. Enhancement for String 1 Enhancement for String 2 Maximum Enhancement 2 0.8 1 E E 0.6 0.4 ' 6 20 40 60 20 0 0 0 20 40 60 Time Time Time Figure 3-8: Plots of ehancement over time for string 1 and string 2 and also the maximum enhancement over time for both strings SOTPAR summary The SOTPAR methodology creates an array of PEs that self-organizes in spacetime with the help of temporal information. The system is trained in an unsupervised manner and self-organizes so that sequences seen during training are mapped into unique spatial sequential firings of the PEs at the output. The output space is similar to a topographic map except that it maps both the temporal and spatial information. The network embeds the temporal and input data into one output space with both temporal and spatial locality. Instead of the standard time-to-space mapping produced by most short-term memories, the SOTPAR produces a time-to-"time and space" mapping. The representation is distributed throughout the self-organizing network and is stored not only 72 in the activations of the PEs but also in the connectivity and weights of the PEs. It is a radical departure from typical neural network architectures with memory, but is actually more biologically plausible. The SOTPAR is a unique combination of short-term and long-term memory. It contains short-term memory because the activations of the network can be used to represent a general input sequence. The interesting part of the SOTPAR, however, is that it contains attributes of a long-term memory. It stores commonly found input patterns into the network weights and produces enhanced responses to these temporal inputs. The known sequences produce an ordered response in a specific area of the output space. This is a discriminant mapping because only known sequences produce an ordered response. The sequential firing facilitates the recognition of temporal patterns by subsequent processing layers. It can also gracefully handle time warping. Temporal Activity Diffusion In the Neural Gas Algorithm (SOTPAR2) The SOTPAR2 network was developed to overcome a few difficulties with the original SOTPAR network. The main difficulty with the SOTPAR is the SOM map that it is built upon. The SOM's neighborhood lattice structure restricts both the movement of a trajectory through the output space of the network (e.g. the distance between successive inputs) and also limits the number of neighbors for each PE. For these reasons the neural gas algorithm is used as the basis for the SOTPAR2 architecture. The neural gas algorithm is similar to the SOM algorithm without the imposition of a predefined neighborhood structure on the output PEs. The neural gas PEs are trained with a soft-max rule, but the soft-max is applied based on the ranking of the distance to 73 the reference vectors, not on the distance to the winning PE in the lattice. Since the Neural gas algorithm has no predefined structure, each PE acts relatively independently. This is how it derived its name, each PE is like a molecule of gas that all spread to evenly cover the desired space. Since it has no predefined structure for the activity diffusion to move through, it allows the flexibility to create a diffusion structure that can be trained to best fit the input data. The SOTPAR2 diffuses activity through a secondary connection matrix that is trained with temporal Hebbian learning. This flexible structure decouples much of the spatial component from the temporal component in the network. In the SOTPAR, two neighboring nodes in time also needed to be relatively close in space in order for the system to train properly (since time and space were coupled). This is no longer a restriction in the SOTPAR2. This is still a space-time mapping, but now the coupling between space and time is directly controllable. The most interesting concept that falls out of this structure is the ability for the network to focus on temporal correlations. Temporal correlation can be thought of as the simple concept of anticipation. The human brain uses information from the past to enhance the recognition of "expected" patterns. For instance, during a conversation a speaker uses the context from the past to determine what they expect to hear in the future. This methodology can greatly improve the recognition of noisy input signals such as slurred or mispronounced speech. SOTPAR2 - algorithm details Based on previous experience (training), the SOTPAR2 algorithm uses temporal information to lower the threshold of PEs that are likely to fire next. The standard neural 74 gas network is appended with a connection matrix that is trained using temporal Hebbian learning. These secondary weights are similar to transition probabilities in Hidden Markov Models (HMM) and are the pathways used to diffuse the temporal information. As in the SOTPAR, the temporal activity diffusion is used to alter the selection of the winning PE and affects both the training and the operation of the network. The SOTPAR2 algorithm works as follows: First, you calculate the distance (di) from the input to all the PEs. The temporal activity in the network is similar to the SOTPAR diffusive wavefronts except that the wavefronts are scaled by the connection strengths between PEs. Thus, the temporal activity diffuses through the space defined by the connection matrix as follows: (f )(d=, k) + (1 - ))p a, (t+1) = ca,)+ (I max(p) where a,(t) is the activity at PE i at time t, a is a decay constant less than 1, pj is the connection strength from PE i to PEj, d is the vector of distances from the input to each PE, I. is the parameter which smoothes the activity giving more or less importance to the past activity in the network, and max(p) normalizes the connection strengths. The functionf(d,k) determines how the current match (distances) of the network contributes to the activity. At the present time, my implementation implements the case wheref(d,k) is simply a 8 function (and the summation is removed) such that only the activity from the past winner is propagated. This is similar to the Markov model where all temporal information is stored in the state itself. Unlike the Markov model, however, the previous winners affect the output activity of the current winner. Therefore, a previous winner that 75 has followed a "known" path through the network will have higher activity and thus will have more influence on the next selection. In the general case for the activity equation the temporal activity at each PE is affected by contributions from all other PEs. In this case the functionf(d,k) is typically an enhanced/sharpened version of the output and the summation is over all PEs. This allows all the activity in the network to influence the current selection. It makes the network more robust since the wavefronts will continue to propagate (but will decay rapidly) even if the selected winner temporarily transitions to an unlikely path. The next step of the SOTPAR2 algorithm is to modify the output (competition selection criteria) of each PE by the temporal activity in the network via the following equation: out, = di - fla, where 0 is the spatio-temporal parameter that determines how much the temporal information affects the selection of the winner. This parameter should be set based upon the expected magnitude of the noise present in the system. For example, if the data is normalized [0,11, then a setting of P = 0.1 allows the network to select a new winner that is at most a distance of 0.1 farther away than the PE closest to the spatial input. To adjust the weights, we use the standard neural gas algorithm that is simply competitive learning with a neighborhood function based on an ordering of the temporally modified distance to the input. Aw, = rlh, (k, (out))(in - w,) 76 where I is the learning rate (step size), hA(*) is an exponential neighborhood with the parameter X defining the width of the exponential. ki(out) is the ranking of PE, based on its modified distance from the input. The connection strengths are trained using temporal Hebbian learning with normalization. Temporal Hebbian learning is Hebbian learning applied over time, such that PEs that fire sequentially enhance their connection strength. The rationalization for this rule is that PEs will remain active for a period of time after they fire, thus both current and previous winners will be active at the same time. In the current implementation, the connection strengths are updated similar to the conscience algorithm for competitive learning: argmin(oul(t-I)),argmin(out(t)) = b The strength of the connection between the last winner and the present winner is increased by a small constant b and all connections are decreased by a fraction that maintains constant energy across the set of connections. Another possibility for normalization would be to normalize all connections leaving each PE. This method gives poorer performance if a PE is shared between two portions of a trajectory since the connection strength would have to be shared between the two outbound PEs. It does, however, give an interpretation of the connection strengths as probabilities and points out the similarity between the SOTPAR2 and the HMM. The parameters r1 and . are annealed exponentially as in the neural gas algorithm, while 1 takes the form of an offset sine wave. This allows the initial phases of learning to 77 proceed without interference so that the PEs start out with an even distribution across the input space. Then the temporal enhancement reaches a peak and slowly declines for fine tuning at the end of learning. Operation of the SOTPAR2 network I will use an artificial example to illustrate the features of the SOTPAR2. The input for this example is 15 pairs of noisy 8-point diagonal lines from (0,0) - (1,1) and from (1,0) -) (0,1). The diagonal lines have uniform noise (ï¿½0.15 in both dimensions) added to each point (notice the distance between each noise-free point of the diagonal lines is only 0.14). There is uniform noise [0,1] interspersed between the diagonal lines (16 points between each line such that there is twice as much noise as signal). The network extracts the temporal information from the diagonal lines without supervision, segmentation, or labeling. A 30-PE network was trained with and without temporal enhancement (200 iterations through the data set) and the resulting PE locations are shown in Figure 3-9 and Figure 3-10 with the diagonal lines superimposed on the figures. Notice that the temporal enhancement during training has slightly modified the positions of the PEs. The network trained with temporal enhancement has its PEs placed more consistently near the centers of the points along the diagonal lines (in particular, look at the line segment in the lower-right). The temporal training provides a portion of the improvement made by the SOTPAR2 algorithm, but the static comparison of the network is not dramatically different. 78 Training WITH Temporal Enhancement 1 17 0.8 4 1 23 0.6 5 7 28 0.4 18 30 0.2 22 19 26 10 0 0.5 1 Figure 3-9: Reference vector locations after training with enhancement Training WITHOUT Temporal Enhancement 1 17 0.8 4 1 23 0.6 27 28 0.4 6 18 30 22 8 0.2 19 2 26 14 10 11 0 2 0 0.5 1 Figure 3-10: Reference vector locations after training without enhancement During operation, the trained weights and information from the past create temporal wavefronts in the network that allow plasticity during recognition. This temporal activity is mixed with the standard spatial activity (distance from input to the weights) via P, the spatio-temporal parameter. Two identical inputs may fire different PEs depending on the temporal past of the signal. Figure 3-11 shows the Voronoi diagrams 79 for the SOTPAR2 network with two different temporal histories. Voronoi diagrams graphically describe the region in the input space that fires each PE. In these particular diagrams, the number in each Voronoi region represents the PE number for that particular region and is located at the center of the static Voronoi region. Remember that the center is the same as the weights of the PE. These diagrams show the regions of the input space that will fire each PE in the network. The left side of Figure 3-11 shows the Voronoi diagram during a presentation of random noise to the network. Since this input pattern was unlikely to be seen in the training input, temporal wavefronts were not created and the Voronoi diagram is very similar to the static Voronoi diagram. The right side of Figure 3-11 shows the Voronoi diagram during the presentation of the bottom-left to topright diagonal line. The temporal wavefront grew to an amplitude of 0.5 by the time PE 18 fired. Also, from the training of the network, the connection strength between PE 18 and PE 27 was large compared to the other PEs. Thus, the temporal wavefront flowed preferentially to PE 27 enhancing its possibilities of winning the next competition. Vormda p f : 23, 18, 10, 29 Be=0.2 Vaomn dagram tprev ous ners: 20, 26, 14, 18 Beta=.2 0.8 08 0.2 02, 00 0 0.2 0.4 0.6 08 1 0 02 04 0.6 0.8 1 Figure 3-11: Voronoi diagrams without and with enhancement 80 Notice how large region 27 is in right side of Figure 3-11 since it is the next expected winner. This plasticity seems similar to the way humans recognize temporal patterns (e.g. speech). Notice that the network uses temporal information and its previous training to "anticipate" the next input. The anticipated result is much more likely to be detected since the network is expecting to see it. It is important to point out how the static and dynamic conditions are dramatically different. In the dynamic SOTPAR2 the centroids (reference vectors) are not as important - the temporal information changes the entire characteristics of vector quantization creating data dependent Voronoi regions. An animation can demonstrate the operation of the SOTPAR2 Voronoi regions much better than static figures. Next I created a new set of 14 noisy diagonal lines to be run through the network as a test set. Each noisy line was passed through both a standard neural gas vector quantization network and a SOTPAR2 VQ network. The results will be analyzed using the 5' point in the bottom-left to top-right diagonal line. Figure 3-12 shows the locations Non-Enhanced Voronoi Diagram S117 9 7 0.8 2 0.6 X 5 ,2 28 0.4 1 12 0.2 20 0 26 0 0.2 0.4 0.6 0.8 1 Figure 3-12: Vomoi diagram without enhancement. VQ outputs were [12,12,16,16,25,25,25,25,27,27,27,27,27,27] 81 of this point in each of the 14 noisy diagonal lines along with the neural gas Voronoi diagram. Notice that the static vector quantization cannot consistently quantize this 5" point to the same Voronoi region. In fact, this point falls into four different regions. The SOTPAR2 network, however, was able to quantize every one of the 5th points into the same region. Figure 3-13 shows why. Figure 3-13 shows a typical Voronoi diagram for the trained SOTPAR2 network after the input of the first four points of a single noisy diagonal line. The location of the 5th point in each of the 14 noisy diagonal lines was again plotted. Notice that now all 14 points fall into the correct Voronoi region. Remember that each particular input sequence will create a different Voronoi diagram, but Figure 3-13 illustrates the mechanism for the SOTPAR2's improved vector quantization. The temporal plasticity has increased the size of the anticipated next region and reduced the variability of the SOTPAR2 vector quantization. Voroni diagram with rnode 18 as the previous wirer 1 21 17 0.8 24 0.6 X 2 28 0.4 18 XX 14 2 19 0.2 20 a 0 29 0 0.2 0.4 0.6 0.8 1 Figure 3-13: Voronoi diagram and VQ with enhancement. VQ outputs were [27,27,27,27,27,27,27,27,27,27,27,27,27,27] Next I ran the new noisy diagonal lines through the network and histogrammed the VQ outputs for each point of the two lines. Figure 3-14 shows the results with the 82 point number along the horizontal axis and the node number along the vertical axis. The number of firings for each node is indicated by the shading - white is high and gray is low. The left-to-right diagonal line is shown in the first 8 points of the horizontal axis and the right-to-left diagonal line is shown as the second 8 points of the horizontal axis. Notice how much cleaner the temporal enhanced VQ output is than the standard neural gas VQ. Wnn Nodes WTH enhancamnt W,, Nodos WITHOUT e~nhanmt 5 5 10 10 g_ 15 20 20 25 25 30 30 5 10 15 5 10 15 The. Time Figure 3-14: Histograms of the number of firings for each PE (bright = more) for the networks with and without enhancement Figure 3-15 shows a specific example of the VQ output of the two networks and illustrates how the SOTPAR2 uses temporal information to remove noise from the input. The input is a noisy diagonal line from bottom-right to top-left (solid line). The SOTPAR2 output is the short dashed line, and the static VQ output is the long dashed line. Notice how much closer the temporal VQ output is to the diagonal than the noisy input or the output of the static VQ. 83 sold=irput, dashed=NGas, dotted=Ou Method 0.81 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Figure 3-15: The SOTPAR2 VQ (dotted) is closer to the noise free signal than the original (solid) or the neural gas VQ (dashed) SOTPAR2 summary The SOTPAR2 algorithm uses temporal plasticity induced by the diffusion of activity through time and space. The SOTPAR2 algorithm is a temporal version of the neural gas algorithm that uses activity diffusion to couple space and time into a single set of dynamics that can help disambiguate the static spatial information with temporal information. This creates time-varying Voronoi diagrams based on the past of the input signal. This dynamic vector quantization helps reduce the variability inherent in the input by anticipating (based on training) future inputs. Temporal Self-Organization for Training Supervised Networks This section shows how the concepts of temporally trained clustering can help speed up the training of supervised neural networks. In particular, we have applied it to recurrent neural network training. Recurrent neural networks are more powerful than feedforward neural networks, but their training is very difficult and time-consuming. 84 Supervised neural networks are typically trained with gradient descent learning, which provides a more mathematically sound foundation than in the unsupervised networks. This allows for a goal-driven approach with mathematical derivations of the concepts. The goal of this architecture is to temporally organize the training of a recurrent neural network. A mathematical analysis will derive a principle very similar to that used in the neural gas network, that temporal correlation can be used to train PEs to form temporal neighborhoods. Usine Temporal Neighborhoods in RTRL In the past, static neural networks and feedforward networks with memory (TDNN, etc.) have been the workhorses of the neural network world. Recently recurrent neural networks have been getting more attention, especially when applied to dynamical modeling and system identification and control. The main difficulty in training recurrent neural networks is that the gradient is a function of time. The gradient at the current time depends not only on the current input, output, and desired signal, but also on all the values in the past. As discussed in Chapter 2, there are two fundamental methods of computing the gradient for a recurrent neural network. First, the gradients can be computed in the backward direction similar to the static backpropagation techniques from feedforward networks. This is called backpropagation through time (BPTT) [Rum86]. The main shortcoming of this technique is that it is non-causal. The second fundamental method of computing the gradient for a recurrent neural network computes the gradients in the forward direction. This method, called RTRL [Wil89], computes the partial of each node 85 with respect to each weight at every iteration. The method is completely on-line and simple to implement. The main difficulty with the RTRL method is its computational complexity. If n is the number of PEs in a fully recurrent network, then the computation of the gradients of each PE with respect to each weight is O(n4). This algorithm can only be used for small networks. Many methods have been proposed to increase the speed of RTRL. Zipser's approach [Zip89][Zip9O] will be used here because it lends itself to our techniques. Zipser approached the problem of reducing the complexity of the RTRL algorithm by simply leaving out elements of the sensitivity matrix based upon a subgrouping of the PEs. The PEs are grouped arbitrarily and sensitivities between groups are ignored. If the size of the subgroups remains constant, then this reduces the complexity of the RTRL algorithm to O(n2). This is a tremendous improvement, however, the method lacks some of the power of the full RTRL algorithm. For example, it will sometimes require more PEs than the standard RTRL algorithm to converge. Our methodology extends Zipser's technique by allowing the subgroups to change dynamically during learning. The dynamic subgroups are created by using an unsupervised temporal clustering very similar to that used in the SOTPAR and SOTPAR2. A derivation of a first-order approximation to the full sensitivity matrix shows that temporal correlation (temporal Hebbian learning) can be used to determine which nodes should be in each group. This method has the same computational complexity as Zipser's, but trains better and more consistently. 86 Review of RTRL and Zipser's Technique The computational complexity of the RTRL algorithm is dominated by the need to update a large array of sensitivities at each step of the algorithm. For a network with n nodes and m weights, the sensitivity matrix has O(nm) elements, each requiring O(n) computations per element, giving O(n2m) calculations per step. For a fully recurrent network, this dominates the computational complexity and requires O(n4) computations per step. The algorithm works quite well on small networks, but the n4 factor becomes overwhelming as the number of nodes increases. The value of n in the O(n2m) equation is the number of recurrently connected units. Zipser's algorithm reduces this value by creating subgroups of nodes where sensitivity information is only passed between nodes of the same subgroup. All connections still exist in the forward dynamical system, the subgroups only affect the training of the network. Connections between subnets are treated as inputs. If g is the number of subgroups, then the speed-up of the sensitivity calculations is approximately g2. For instance, dividing a network into two subsets (g=2) gives a 4-fold speed-up in computing the sensitivities. If the size of the subnets remains constant as the size of the network is increased (increasing the number of subnets), then the complexity of the RTRL algorithm is reduced from O(n2m) to O(m). The performance gains are substantial, but the question is whether the algorithm can train networks as well as the full RTRL. One might think that the subgrouping will limit the capabilities of the network to share nodes. This is not the case, however, since the activations of the network are unchanged -- it is still fully recurrent except in the 87 training methodology. Even though the error propagation is limited to the subnets, all units have access to the activities of all other units, just not all of their sensitivities. Zipser's empirical tests indicate that they can solve many of the same problems, but for certain applications, networks trained with subgrouped RTRL require more PEs than when they are trained with full RTRL. In my experience, the subgrouping algorithm typically also requires more training epochs to reach the same MSE. One caveat of the subgrouped RTRL training is that each subnet must have at least one unit for which a target exists since gradient information is not exchanged between groups. The problem can be solved by wrapping a feedforward network around the recursive network - creating a feedforward MLP with a fully recursive hidden layer. This is often termed a recurrent multilayer perceptron (RMLP) and is shown in Figure 3-16. The feedforward network is simply one additional layer that distributes the gradient between the groups. With simple extensions to the algorithms, multiple fully-recurrent layers can be added to the network. Fully Connected Hidden Layer Input Otutut Fiure 3-16: Diaram of a fully recurrent multi-laer ercetron (RMLP Figure 3-16: Diagram of a fully recurrent multi-layer perceptron (RMLP) 88 Dynamic Suberouping with x The goal of my method is to create local neighborhoods (subgroups) in the RTRL algorithm where the majority of the gradient information required for each node is confined to its local neighborhood. This requires organizing the recurrent PEs such that those that have strong temporal dependencies are neighbors. This technique replaces the static, preallocated grouping of Zipser's technique with a dynamic method of determining the best set of neighbors for each PE. This dynamic grouping provides faster and more robust training than Zipser's technique while maintaining its O(n2) performance. First, the RTRL equations must be modified slightly to better suit the RMLP architecture described above. The time indices are now defined such that the input vector contains the external inputs from this time period plus the values of the PE outputs from the previous time period (i.e. the feedback). IN, (n + 1) = [x(n + 1), y(n)] t. (n+ 1) = p'(v,(n + l))Li w (n)+8N L(n + 1)] Next, we must determine the criteria we will use to group the PEs in the network. If we assume that each PE is responsible for updating the weights of the arcs which terminate at that PE (i.e. the incoming connections), then the PEs that have the highest sensitivities relative to those connections should be in the same neighborhood. For example, PEj is responsible for updating all weights wjB where B is the set of recurrent PEs. If we define 1L 89 then the value of Zjk provides a measure of how much PE k affects the weights of PEj. Thus the neighbors to PEj should be the ones which have the highest Zjk. The "dynamic subgrouping with 7t" (DS-nt) methodology implements the RTRL algorithm with the subgroups chosen using the Z measure defined above. It should be noted that since it requires the computation of the complete At matrix, this algorithm is no more efficient than the full RTRL algorithm. It will, however, address how the neighborhood technique with "optimal" switching will perform compared to the full RTRL and Zipser's technique. The test case for the DS-in algorithm is a function approximation problem where we are trying to map a frequency doubler. The input to the network is a sinusoid with a 16-point period and the desired signal is a sinusoid with an 8-point period. This is a nonlinear function since linear functions cannot "create" frequencies. The DS-7 network 0.7 rtri -- our met 0.5 0.4 0.3 0.2 0.1 .. . . 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Epochs (16 samples per epoch) Figure 3-17: Average learning curve for the three algorithms using the frequency doubling problem 90 has 6 fully recurrent hidden layer PEs and one linear output node. Both Zipser's method and the DS-i method use two groups of three PEs. Each of the three algorithms was trained using the same five sets of random initial weights and the results were averaged to obtain the learning curves. Figure 3-17 shows the average learning curve for each algorithm. Notice that the full RTRL and the DS-nt method performed nearly identically. In fact, in a few cases, DS-n actually trained in fewer epochs. The third set of initial weights led all three algorithms to a deep local minimum. The first 100 epochs mainly depict the learning curve from the other four initial conditions (notice that the DS-n method and RTRL are nearly identical here), whereas the last 400 iterations are dominated by the learning curve for the initial weights with the deep local minimum. Zipser's method performed worse on all 5 sets of initial conditions and couldn't solve the problem at all (even with more training) for the 3rd set. In every application I have tested, the DS-n algorithm trains the networks in almost the same number of epochs as the full RTRL algorithm and performs significantly better than Zipser's subgrouping technique. The problem, however, is that the full 7E matrix is required to compute the neighborhoods. Since the computation of the n matrix is the computationally expensive part of the task, we have not gained anything here. This methodology, however, proves that the technique is feasible and that the all the gradient information is not necessary to train the networks. Also, when a neighborhood changes in the DS-n algorithm the gradient information from the ex-neighbor is discarded and new gradient information from the new neighbor starts building up. The technique using the full n matrix shows that this resetting and restarting of gradient information between 91 nodes does not affect the performance of the algorithm. The DS-nr algorithm will be used as an "ideal grouping" methodology since it uses all of the information of the sensitivities to determine the groupings. Estimating the Z matrix We now need an estimate of Z that will allow us to efficiently compute the temporal neighborhoods. The logical choice for an estimate of Z is to use the first-order estimate of the 7 matrix to compute Z. We start by writing out the equation for Z and simplifying: ZJk =j -nï¿½ i (i(){ wi ((vj(n)) wj, r(n-1) + IN(n)I ZIk =p'(v (n))Y wj,7 (n -1) + IN (n) At this point, I will stop and discuss some grouping rules that I have implemented. First, unlike Zipser's work, the groupings do not need to be symmetric. PEj can be a neighbor of PE k without PEj being a neighbor of PE k. Thus, the baseline method is not a true grouping, but a linking of PEs which are sensitive to each other. A true grouping can be determined by modifying the grouping criteria to include both directions (e.g. Zjk+Zkj). Since symmetry is not being enforced, the methodology enforces the rule that PEj is always a neighbor of PEj. This does not have to be the case, but seems to be a reasonable assumption. Much of the gradient information from a recurrent network comes from the self-recurrent loop in each PE. Since we assume that PEj is always a neighbor of PEj, we only need to compare the total sensitivity of all the other PEs. Thus, we do not need to worry about the ZO',j) 92 terms which means that the 8kj term can be removed. Reorganizing the summations slightly leads to: Z,, = jp'(vj (n)) wj, 7t' (n -1) where j k ,B L Expanding, we get: Z,, = p'(v (n)) wj p'(v(n-1)) w,,7,t2(n-2)+8ikN,,(n-I ieB L m and Z(jk '( (n ji i ,NL(n-i)+ ,x (n - 2) Now we separate the equation into its first order parts (the direct contributions from the input vector -- when i=k) and the rest. Z,, = (p'(vj (n))wjk(p'(v, (n - 1)) INL (n -1) L +'(v,(n)) Y w (jq'(v,(n-1))[ wan,(n-2) ieBi k L . I This is a very interesting equation. Let's say we approximate Z with the first order terms: ijk = (p'(vj (n))wjk(p'(v, (n - 1)) IN, (n - 1) Notice that the sum of the input scales all terms of Zjk the same, so it can also be eliminated, leaving only: Z,k = p'(v, (n))wjk p'(k (n - 1)) This is a very easy and computationally efficient method for estimating the Z matrix. It is conceptually appealing as well. You can see that this equation is a time correlation 93 between the derivatives of the non-linearities of the PEsj and k. If this were a static, linear network, the equation would simply be j, = yj (n)xj (n - 1) where yj is the output of PEj and xj is the input to PEj from PE k. This is a temporal version of Hebbian Learning. The first-order estimate of Z can be considered a nonlinear version of Hebbian learning. The derivative of the nonlinear function at the operating point determines the sensitivity of each PE to the current training input. Thus, we are calculating a correlation of the sensitivities of each PE. Which can also be considered as a correlation in the dual of the network. If the PEs of the network use a tanh activation function, the estimate for the Z matrix can become a local rule. Local rules are advantageous because they are easily implemented in parallel and easily analyzed. Figure 3-18 shows the plot off'(net) versus f(out), which is equal tof(f(net)), for a tanh PE. Since these two shapes are very similar, f(net) vs. f(f(net)) for tanh aon 0.7 04 0.3 02 -2 -1 0 1 2 Figure 3-18: Using a tanh PE, f(net) takes the same shape as f(out) |

Full Text |

PAGE 1 TEMPORAL SELF-ORGANIZATION FOR NEURAL NETWORKS By NEIL R. EULIANO II A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1998 PAGE 2 ACKNOWLEDGMENTS It is only appropriate that I first acknowledge the guidance and help of my advisor and friend Dr. Jose Principe. Without his support this work would never have been completed. I would also like to thank the members of my committee for their efforts and time spent on my behalf, as well as the members of the Computational NeuroEngineering Laboratory (CNEL). I must also acknowledge my wife Tammy who was incredibly patient and never wavered in her support of this endeavor. I would also like to thank my children Erin and Matthew who are just too fun to ignore. Although they extended the amount of time required to graduate, I would not trade the time I spent with them for anything in the world. Lastly I should thank my family and friends for not treating me like a dead-beat Ph.D. student. PAGE 3 TABLE OF CONTENTS page ACKNOWLEDGMENTS Â» ABSTRACT vi CHAPTERS 1 INTRODUCTION AND PROBLEM DESCRIPTION 1 Temporal Processing 2 Static Supervised and Unsupervised Learning 3 Adding Memory to Neural Networks 5 Short-Term Memory Structures 6 Recurrent Networks 9 Training Dynamic Neural Networks 10 Summary of Problems with Standard ANN Architectures 12 The Approach 14 2 LITERATURE REVIEW 16 Biological Research 16 Neurons and Learning 17 Hippocampus 20 Diffusion Equations (Re-Di Equations) 21 Biological Representations of Time 25 Biological Models for Temporal Processing 28 Static Neural Network Learning 29 Unsupervised Learning 30 KohonenSOMs 32 Neural Gas 36 Supervised Training 37 Second Order Methods 39 Temporal Neural Networks 40 Temporal Unsupervised Learning 40 Temporal Supervised Neural Networks 44 Architectural approaches 44 Algorithmic approaches 46 PAGE 4 Second order methods 49 Sequence Recognition 50 Comparison of Hidden Markov Models with ANNs 51 3 TEMPORAL SELF-ORGANIZATION 54 Introduction and Motivation 54 The Model 55 Temporal Self-Organization in Unsupervised Networks 57 Temporal Activity Diffusion Through a SOM (SOTPAR) 57 Algorithm description 60 Representation of memory 62 A simple illustrative example 66 SOTPAR summary 71 Temporal Activity Diffusion In the Neural Gas Algorithm (SOTPAR2) 72 SOTPAR2 algorithm details 73 Operation of the SOTPAR2 network 77 SOTPAR2 summary 83 Temporal Self-Organization for Training Supervised Networks 83 Using Temporal Neighborhoods in RTRL 84 Review of RTRL and Zipser's Technique 86 Dynamic Subgrouping with n 88 Estimating the Z matrix 91 Illustrative Example 94 Grouping Dynamics 95 Second Order Methods 96 Summary of the Dynamic Subgrouping Algorithm 97 4 APPLICATIONS AND RESULTS 99 SOTPAR 99 Landmark Discrimination and Recognition for Robotics 100 SOTPAR solution 102 Real data collected from the robot 113 Summary 117 Self-Organization of Phoneme Sequences 118 Summary 128 SOTPAR2 129 SOTPAR2 Vector Quantization of Speech Data 129 Time Series Prediction 138 Results 140 Summary of chaotic prediction 147 Dynamic Subgrouping of RTRL in Recurrent Neural Networks 148 System Identification 148 Comparison of the Number of Neighbors 152 PAGE 5 Modeling a Set of Nonlinear Passage Dynamics 155 Summary of Dynamic Subgrouping 160 5 CONCLUSIONS AND FUTURE RESEARCH POTENTIAL 162 Conclusions 162 Future Directions 167 REFERENCES 169 BIOGRAPHICAL SKETCH 177 PAGE 6 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy TEMPORAL SELF-ORGANIZATION FOR NEURAL NETWORKS By Neil R. Euliano II August, 1998 Chairman: Dr. Jose C. Principe Major Department: Electrical and Computer Engineering The field of artificial neural networks (ANNs) has reached a point where they are now being used in everyday products. ANNs, however, have been largely unsuccessful at processing signals that evolve over time. Temporal patterns have traditionally provided the most challenging problems for scientists and engineers and include language skills, vision skills, locomotion skills, process control, time series prediction, and many others. The fundamental concept presented in this dissertation is the formation of temporally organized neighborhoods in ANNs. This temporal self-organization enables the networks to process temporal patterns in a more organized and efficient manner. The concept is biologically inspired and uses activity diffusion to organize the processing elements of the network in an unsupervised manner. PAGE 7 The self-organization in space and time created by my methodology has been applied to three distinct ANN architectures. The new network architectures created by adding the temporal organization are easy to implement and contain properties that are unique in the neural network field. A self-organizing map (SOM) network obtains a unique combination of long-term and short-term memory and becomes organized such that temporal patterns in the input fire sequentially ordered output PEs. These features are utilized in two different applications, a robotic landmark recognition problem and a temporally ordered vector quantization of phonemes in spoken words. When applied to the neural gas algorithm, the resulting network becomes a dynamic vector quantization network. The network anticipates the future inputs and adjusts the size of the Voronoi regions dynamically. It was used to vector quantize speech data for a digit recognition problem and to predict a chaotic signal. Lastly, the temporal organization was applied to the training of fully recurrent neural networks. It reduces the computational complexity of the training algorithm from 0(N^) operations to only 0(N 2 ) operations and maintains nearly all of the power of the RTRL algorithm. This training method was tested on two inverse modeling tasks and provided a dramatic improvement in training times over the RTRL algorithm. PAGE 8 CHAPTER 1 INTRODUCTION AND PROBLEM DESCRIPTION This dissertation focuses on neural network architectures and training methods for processing signals that evolve over time. The fundamental concept underlying the techniques described herein involves the formation of neighborhoods where temporally correlated processing elements (PEs) are clustered together. We have applied this concept to three different neural network architectures and found that it improves the performance of each one either by increasing the functionality of the neural network or by improving its training. This chapter contains a description of the problem as well as background information that will help describe the shortcomings of the present methods. Chapter 2 presents a review of the relevant literature necessary to understand the material in the context of the current state of the art. Chapter 3 contains the theoretical description of the techniques and networks proposed by this work, including a few simple examples to elucidate the fundamental concepts. In Chapter 4, six more extensive and practical problems are solved using the temporal neighborhood concepts. These examples include speech recognition, chaotic prediction, system identification and control, and robotics. Chapter 5 concludes the dissertation with a summary of the work and possible future research directions. PAGE 9 2 Temporal Processing Most scientific problems can be grouped into two domains, static and dynamic problems. Static problems consist of information that is independent of time. For instance, in static image recognition, the image does not change over time. On the other hand, time is fundamental to the dynamic problem. The output of a dynamical system, for example, depends not only on the present input but also on the current state of the system, which encapsulates the past of the input. Temporal processing is the analysis, modeling, prediction, and/or classification of systems that vary with time. Patterns that evolve over time have traditionally provided the most challenging problems for scientists and engineers. Language skills (speech recognition, speech synthesis, sound identification, etc.), vision skills (motion detection, target tracking, object recognition, etc.), locomotion skills (synchronized movement, robotics, mapping, navigation, etc.), process control (both human and mechanical), time series prediction, and many other applications all require temporal pattern processing. In fact, the ability to properly recognize or generate temporal patterns is fundamental to human intelligence. Traditional analysis models (ARMA models, etc.) are well known but are usually linear and require significant expertise on the subject and a strict correspondence between the studied process and the constructed model. Artificial Neural Networks (ANNs) offer robust, model-free methods without requiring as much application specific expertise. Secondly, neural nets are adaptive (similar to ARMA models). This is a natural way to compensate for the drift of measuring devices and slow parameter changes inherent in PAGE 10 real systems. Thirdly, neural nets are naturally parallel systems that offer more speed in computation and fault tolerance than traditional computing models. [Kan94] Most of the major neural network success, however, has been mainly in the realm of static, instantaneous mappings (for example, static image recognition or pattern matching). Conventional neural net architectures and algorithms are not well suited for patterns that vary over time. Typically, in static pattern recognition a collection of features visual, semantic, or otherwise is presented and the network must categorize the input feature pattern into one or more classes. In such tasks, the network is presented with all relevant information simultaneously. In contrast, temporal pattern recognition involves processing patterns that evolve over time. The appropriate response at a particular point in time depends not only on the current input, but also potentially on an unspecified number of previous inputs. Static ANNs have been modified in various ways to process time-varying patterns, typically by adding short-term memory to the static pattern classification ability of the various architectures. The short-term memory holds onto some of the past events so that the static ANN can then classify or predict the temporal pattern. As I will explain in the next few sections, however, these hybrid structures (memory added to static architectures) have not been widely successful in the various temporal processing areas. Static Supervised and Unsupervised Learniniz The purpose of neural processing is to capture the information from an external signal in the neural network structure. This is a form of organization. It can be accomplished in an unsupervised manner using only the input, or in a supervised manner PAGE 11 guided by an extra input called the desired signal. Unsupervised training can only extract information from the input signal whereas supervised training can learn mappings between the input signal and the desired signal. They differ in the methods, but at the core they share the same function, learning a representation of the external world. The most common supervised network is the multilayer perceptron (MLP) which uses the error back-propagation [Rum86] learning algorithm. The MLP is characterized by layers (input, hidden, and output) of processing elements (PEs) that have a smooth non-linearity at their output. The nonlinear output of the MLP PEs is what differentiates the MLP from a typical adaptive filter. It provides the capability to map problems that are not linearly separable. In fact, it has been proven that an MLP with one hidden layer can uniformly approximate any continuous function with support in a unit hypercube [Cyb89], Like in adaptive signal processing using the LMS algorithm, the backpropagation algorithm applies a correction Aw j; (n) to the synaptic weight w^n) that is proportional to the gradient of the error 8^(n) 1 5w ; ,(n) The chain rule is used to recursively calculate the error for each layer of the network. Unsupervised networks are typically based on or derived from Hebbian learning. Hebbian learning is a biologically inspired learning rule that finds the correlations present in the input data. Because unsupervised networks can extract information only from the input, they are typically used for data analysis and preprocessing. They cannot reliably be used directly for classification since a labeling of the inputs is required for classification. Both supervised and unsupervised learning will be described in detail in Chapter 2. PAGE 12 5 Adding Memory to Neural Networks How do you use a static neural network architecture to process temporal patterns? The answer is to simply add memory. Without an appropriate memory to store information from the past, a neural network is limited to static pattern recognition or function approximation. The key questions that need to be answered while creating temporal neural networks are what type of memory do you use and how is the memory integrated into the training algorithm. Memory in neural networks can be classified into two categories: short-term memory and long-term memory. Short-term memory typically involves a representation of the temporal data, usually by creating multiple copies of the input data at various time delays (e.g. tapped delay line). Long-term memory, on the other hand, is the storage of information from the past into the structure of the network. For example, over time, the training of the network captures information about the input signal and this information can be considered long-term memory. Another example of long-term memory is the storage of patterns in an associative memory. Long-term memory corresponds more closely with the traditional biological concepts of memory. The main difference between the two is that the short-term memory is used for signal representation while the longterm memory is a trained memory that typically cannot represent unknown patterns. Another way to differentiate the two is that short-term memory is usually described by activations of nodes or taps (dynamical information), and long-term memory is stored in the weights of the network (statistical information). PAGE 13 Most of the work in temporal ANN research has focused on the application of short-term memories since they provide a mechanism to represent a temporal pattern in a static manner. For instance, a tapped delay line converts a temporal signal into a static pattern (the present input and the N past inputs) which can then be processed by a standard static ANN. Most short-term memory techniques fall into two categories. The first is to explicitly add memory structures and the second is to use recurrent loops in the network to save information. Long-term memory (the weights) has largely been ignored by the ANN research community for the storage of temporal patterns, but I will use it to store temporal correlations in the structure of the network. Short-Term Memory Structures The simplest form of memory is a buffer containing the N most recent inputs. This is often called a tapped delay line or a delay space embedding and forms the basis of traditional statistical autoregressive (AR) models, as well as dynamical system state space manipulations. This is a very popular model and has been used in many applications. The time-delay neural network (TDNN) [Wai90] uses a tapped delay line to convert the temporal pattern into a spatial pattern allowing the architecture to be trained using only standard back-propagation methods. The TDNN, however, has several drawbacks. First, the length of the delay line must be chosen a priori, we cannot work with arbitrary length sequences. In addition, the TDNN requires that the data is properly registered in time with the clock controlling the shift register. It imposes a rigid limit on the duration of patterns and suggests that all input vectors be the same length. Most importantly, two patterns which are very similar temporally (e.g. shifted one step in time) will be very PAGE 14 different spatially, which is the metric used by ANNs. For example, [1 0], [0 1 0], [0 1] are temporally shifted but are spatially on the corners of a unit cube. Using decay traces or exponential kernels to sample the history of the input helps alleviate some of the problems with the TDNN. A common methodology to describe the various memory architectures is to represent the short-term memory as a convolution of the input sequence with a kernel function, k,: x,(f) = Â£ *,(' " T M T ) Â• wnere X W is the input. Tank and Hopfield [Tan87] proposed a set of Gaussian kernels that are distributed over time with varying means and widths to sample the time history. The gamma model [DeV91] is an example of an exponential trace memory that uses the set of gamma kernels. The exponential trace memory has a more smooth representation of the past of the input since it decays exponentially. It gives more strength to the more recent inputs. The gamma memory also has a tunable parameter that trades off depth for resolution when the system requires information from farther in the past. Depth roughly refers to how far back into the past the memory stores information and resolution refers to the degree to which information concerning the individual elements of the input sequence are preserved. The exponential trace memories can be computed incrementally and easily, thus greatly increasing its usability. Viewing memory in this way, as a kernel function passed over the input, one can see that almost any kernel function will result in a distinct form of memory. The main problem with all of these memory architectures, however, is that they are all "prewired" one-dimensional cascades of delay elements. TDNNs are also known to train very slowly. PAGE 15 Theoretically, memory added to a system can be thought of as creating an embedding of the dynamics into a space larger than the original input space. An embedding of a dynamical system is based on the similarity between delays and derivatives (the first order approximation to a derivative is the difference between the signal and the delayed signal). The delayed values of a single variable can be used to represent the dynamics of a multi-dimensional system. Conceptually this can be rationalized as combining the first-order differential equations for the system (state space description) into a single high-order differential equation for one variable and then using the delay technique to approximate the derivatives of this equation giving a new representation of the system states. This mathematical construct is effective but not necessarily efficient. For example, a dynamical system requires a minimum of 2D+1 taps to preserve the dynamics of a D dimensional system [Tak81], If the dimension of the system is unknown, as is often the case, a large embedding is usually used. The embedding also does not efficiently encode the input ordering. It does a time-to-space mapping that treats the temporal information the same as a spatial input, allowing for all permutations of the order of inputs without regard to the limitations imposed by the dynamics of the system. The gamma memory and other convolution memory kernels warp or rotate the embedding space to more accurately (or efficiently) represent the system dynamics. A proper use of the embedding methodology requires a significant amount of work to determine a number of parameters, including the number of taps, the time between taps, the time between vectors, and the number of data samples. This is rarely done. PAGE 16 Recurrent Networks The MLP and TDNN are both feedforward networks where the data flow in the network moves strictly forward. No feedback is used. The feedback in recurrent networks can also create memory. The important distinction between the two types of memory is that memory created with feedback can be adapted and trained on-line, creating a flexible and adjustable memory mechanism. Feeding back outputs between different layers can lead to a generalization of storing not only the input but the "state" of the network (i.e. a processed version of the input) [Elm90][Moz94]. In theory, the recurrent architecture is sufficiently powerful to handle arbitrarily complex temporal problems. The focused memory architectures such as the TDNN can also [San97], but may require a very large number of taps and weights. In practice, however, recurrent networks are much more difficult to train than the static networks. The recurrency adds tremendous power to the network (any memory architecture can be created with a recurrent neural network). This power, however, creates very complicated error surfaces. In recurrent networks, the states of the PEs in the network affect both the output and gradients. Therefore calculating the gradients and updating the weights of a recurrent network is a much more difficult and time consuming process. Because of these difficulties, the mainstream engineering community has largely ignored recurrent networks. Recently, however, the recurrent networks are being used more and more as engineers reach the limits of the capabilities of TDNNs and other PAGE 17 10 simpler architectures. Recurrent networks are hot topics in the fields of dynamic modeling and control. Training Dynamic Neural Networks Recurrent networks, either fully recurrent or partially recurrent (e.g. the gamma network), cannot directly use static backpropagation methods since the time history of the network and its inputs are critical to the outputs produced by it. Static backpropagation computes only the gradients based upon the current inputs and outputs. To train a dynamical system, the past information is at least as important as the present and thus a temporal backpropagation technique must be used. Recurrent backpropagation (fixedpoint learning) can be used to train a general recurrent network to move to stationary states. Its assumption of constant inputs and an approach to an attractor, however, precludes the recurrent back-propagation algorithm from real-time temporal processing. The TDNN can use static backpropagation because its memory is fixed and is at beginning of the network. The tapped delay line can be thought of as a temporal preprocessor converting dynamic patterns to static patterns, thus the network is trained in a completely static manner. Most other temporal networks, however, are trained using one of two first-order temporal methods: back-propagation through time (BPTT) [Rum86] or real-time recurrent learning (RTRL) [Wil89], Both of these methods are gradient descent methods. The RTRL method brings the activations and their derivatives forward in time until the desired signal is available, and the BPTT method propagates back the errors from the desired signal to the beginning of the pattern. RTRL recurrently calculates the gradients of each PE with respect to every weight. This process allows on- PAGE 18 11 line updates (updates every sample), but calculating all the gradients is a time consuming process. In fact, if there are N fully recurrent PEs in a network, the RTRL algorithm requires 0(N 4 ) operations per sample. The BPTT method requires fewer computations, but is non-causal. Thus it cannot be directly implemented in an on-line fashion. Both methods suffer from the following problems: Â• The computation of the gradient must occur over time. But the nonlinearity in each layer (actually it is the derivative of the nonlinearity required for the gradients) attenuates these gradients. Thus, if information is required from more than a few samples in the past, these training methods may have a difficult time maintaining and using this information. As the errors are propagated, the gradients get small and the impact of a connection weight even if appropriate Â— will be masked by other weights if their values are inappropriate. This is true for large feedforward nets as well, but the feedback nature of the recurrent network in time makes this a much bigger problem in recurrent networks. Â• The desired signal must be defined over time. For example, how do you define a target signal when trying to detect a sequence? If the target is high throughout the pattern, the network will recognize partial sequences. But if the target is high only at the end, the network may be punished for partially recognizing a major portion of the sequence. Â• Temporal backpropagation is inherently slow both computationally and in terms of the number of training samples required to find an adequate solution. PAGE 19 12 Recently, second order gradient methods like the recurrent least squares (RLS) and the extended Kalman filter have been used in order to reduce the number of training samples required for a good solution. These methods use second order gradient information to determine more accurate data on the shape of the performance surface at the current operating point. This allows for much faster convergence but requires more computations per sample. These second order gradient methods still need to compute the dynamic gradient information and thus suffer from the same problems listed above. Summary of Problems with Standard ANN Architectures In summary, the standard ANN architectures when applied to temporal processing suffer from problems with supervision and problems with short-term memory. The problems that can be attributed to supervised training include: Â• The problem of assigning credit or blame to actions when the overall success or failure of the system results from a series of actions and cannot be judged instantaneously (i.e. how do you design a target signal?). Â• Back-propagation training can be very slow, often requiring thousands of training epochs. This problem is derived from many sources. The backpropagation algorithm must either take small steps in the gradient descent or use more computationally intensive error calculations (higher order derivatives). Since all nodes in a network are typically learning independently, several problems may occur. First all the hidden nodes may move together to try to solve the largest source of error, instead of dividing up the problem and each solving a different portion. Second, once the nodes have divided the problem, each tries to solve their portions independently. The PAGE 20 13 movement of each node through the error surface affects all the other nodes, creating a moving target for each node. Thus instead of a direct movement of the nodes to useful roles, we see a "complex dance" among all units [Fah91]. Â• Recurrent back-propagation trains even slower for many reasons. First, the training methods require more computations than the static backpropagation. Second, the error gradients tend to vanish exponentially as they are propagated through time. Thirdly, the recurrent networks tend to have a much more complicated error performance surface with many local minima, making the gradient search very difficult. Â• Supervised techniques require presegmented and prelabeled training data. This often must be done by hand and is quite time consuming. The rule of thumb for ANN training is 10 training exemplars for each adjustable weight. Thus for large networks, finding enough training data is a difficult task. If there is an insufficient amount of training data, the network will tend to memorize the data rather than draw reasonable generalizations about the data. Problems related to short-term memory structures include the following: Â• The common short-term memory techniques (tap delay lines, etc.) use a time-to-space mapping to represent the past of the signal. By converting time into just another spatial dimension, the unique features of the temporal information are lost (e.g. continuity, limitations based on the dynamics of the system, etc.). The short-term memory is a representation of the data, not a memory structure. Â• The typical short-term memory structure is a rigid architecture that must be pre-wired. PAGE 21 14 Â• Short-term memory structures typically add many weights to the input (or interior) layer (e.g. A TDNN with N taps will create N times more weights in the first layer), which exacerbates the problems with the performance surface and the amount of training data. The resulting networks tend to have so many degrees of freedom that they do not generalize well (i.e. memorization due to insufficient training exemplars). The Approach It is a Herculean challenge to attempt to solve all of the above problems. This work focuses on a method of self-organizing PEs in a network architecture based on their temporal correlations. This concept is biologically inspired and has been applied to three different types of neural networks. By creating temporal neighborhoods of PEs in the architecture, we have increased the performance of the networks either through increased functionality and power or through better training methods. When this technique is applied to a self-organizing feature map (SOFM or SOM), the temporal neighborhoods create traveling waves of activity which diffuse through the PEs. The resulting architecture has a spatio-temporal memory that is selective and recognizes temporal patterns similar to those it has been trained with. The typical ANN memory simply embeds the data for further processing by the ANN, without any mechanism for recognition. This architecture, however, is similar to biological memories in that it responds preferentially to known temporal patterns this is unique in the neural network literature. When the temporal neighborhood approach is applied to the neural gas algorithm, the network becomes a temporal vector quantizer that again responds preferentially to PAGE 22 15 known temporal patterns. The temporal vector quantizer uses the past of the signal to anticipate the next input by expanding the Voronoi region associated with the expected next input. This allows the network to remove noise in the signal and generate better vector quantization based upon the temporal training and recent past of the signal. This anticipation is similar to how the human brain deals with noise in its environment it uses the past to predict the future and correlates what it is sensing with this prediction. This is part of the reason humans can understand speech in very noisy environments, and also why two people can hear completely different things from the same set of sounds. When we apply the technique to the training of recurrent neural networks, the new training technique reduces the computational complexity of the RTRL algorithm from 0(N 4 ) to 0(N 2 ). This dramatic improvement comes with only a slight increase in the number of iterations of training data required. The overall speed-up taking into account both the decreased computational complexity and increased number of training samples is still dramatically better. In fact, the 0(N 4 ) property of the RTRL algorithm makes it virtually unusable for sizeable networks. In general, the self-organizing nature of the temporal neighborhoods helps alleviate many of the problems with the supervised techniques. Additionally, the novel spatio-temporal memory architectures provide a unique methodology for solving the problems with short-term memory. PAGE 23 CHAPTER 2 LITERATURE REVIEW This chapter presents background information and a literature review of topics that either influenced this work, relate to this work, or will be compared and contrasted with this work. The chapter begins with a presentation of current research on biological neural networks and methods of temporal processing. This section is important because it motivated my work. I do not, however, claim that my work is biologically feasible or occurs in nature. Next, this chapter contains a description of the state of temporal neural network research. Since most of the work in temporal neural networks takes the form of extensions to static neural networks, an overview of static neural network learning is also presented. The contrast between the biological and artificial neural networks and the way they process time is striking. Static artificial neural networks are very similar to the static characteristics of real neurons, but temporal neural networks share little in common with their biological counterparts. Biological Research This section contains a description of biological neurons and their temporal characteristics, as well as other biological mechanisms that may help in processing time based signals. Recently, there has been extensive research into the temporal characteristics of the brain as well as in learning dynamics. This research has not yet 16 PAGE 24 17 been integrated into the artificial neural network community, but holds promise for creating powerful, temporal ANNs. This information provides a motivation for the main principal of this work that the creation of temporally organized neighborhoods in a neural network improves the performance of the network for temporal processing. The concept of diffusing temporal information through the network is one of the fundamental concepts used to rationalize the formation of these neighborhoods. Neurons and Learning Fundamentally, the artificial neural network is modeled after a collection of neurons in the brain. Each neuron is composed of three basic components: the cell body, the dendrites and the axon. [Fre92] The dendrites are a widely branching set of filaments that collect information from other neurons. The axon is a long transmission medium that contains fewer branches and transmits the output of the neuron to other neurons. Synapses are the junctions between axons and dendrites. The dendrites collect incoming pulses from other synapses, convert them to currents and sum them all at the initial segment of the axon. This summation works across both dendritic space (summation over all the dendrites) and across time. Each synaptic membrane acts as a leaky integrator with an associated time constant. The critical function of the axon is to transmit the timevarying amplitude of current summed by the dendrites to distant targets without attenuation. [Fre92] If the neuron reaches a certain threshold, it fires or depolarizes, which means that it produces an energy spike on its axon. The firing contains a refractory period such that a constantly active neuron will produce an impulse train on its axon. How biological neural networks are trained is not well known, but most of what is known PAGE 25 about the training is based on the Hebbian learning concept (which will be discussed later). The Hebbian learning law strengthens synapses (allowing more responsiveness from the post-synaptic neuron) when the two neurons fire at the same time. If there is a consistent correlation between the firing of two neurons, then the pre-synaptic neuron must be at least partially responsible for the firing of the post-synaptic neuron. A static artificial neural network is modeled loosely on an interconnected cluster of neurons. Each neuron is modeled by a processing element (PE) and a set of connections between processing elements. Typically, a processing element simply sums the inputs, nonlinearly warps the output, and then passes this output to its downstream connections. Training is implemented in either an unsupervised manner, usually using a form of Hebbian learning, or in a supervised manner, which has no biological parallel. Notice that none of the temporal characteristics of a neuron are used in static neural networks or their temporal extensions. Recently, there has been significant work on a more complete modeling of individual neurons and their temporal characteristics. Christodoulou and others [Chr95a][Chri93] have modeled the biological neuron including the random spiking nature, excitatory/inhibitory synapses, the transmission delay down the axon, and especially the membrane time constant. The membrane time constant is the main temporal property modeled today. Most modeling approaches use simplifications of the Hodgkin-Huxley equations that result in a leaky integrator model of the neuron membrane potential. This is an important feature of biological neurons, since the past history of the signal remains active on neurons for a short period and can influence the result of future inputs. PAGE 26 19 Additionally, the gas nitric oxide (NO) has been found to be involved in many processes in the central nervous system. One such process is the modification of synaptic strength thought to be the mechanism for learning (and most commonly used in ANNs). Neurons produce NO post-synaptically after depolarization. The NO diffuses rapidly (3.3 x 10" 5 cm 2 /s) and has a long half-life (-4-6 seconds), creating an effective range of at least 150 urn. Large quantities of NO at an active synapse strengthen the synapse (called Long Term Potentiation, or LTP). If the NO level is low, the synaptic strength is decreased (Long Term Depression or LTD) even if the site is strongly depolarized. NO is thus commonly called a diffusing messenger as it has the ability to carry information through diffusion, without any direct electrical contact (synapses) over much larger distances than normally considered (non-local). The NO diffusion and non-linear synaptic change mechanism has been shown to be capable of supporting the development of topographical maps without the need for a Mexican Hat lateral interaction (described later). This seems to be a more biologically plausible explanation of the short range excitation and long range inhibition than the preprogrammed weights of synaptic connections which are typically assumed to implement the same effect [Kre96a][Kre96b]. In addition to the possibility of lateral diffusive messenger effects, the long life of NO can produce interesting temporal effects. Krekelberg has shown that NO can act as a memory trace in the brain that can allow the temporal correlations in the input to be converted into spatial connection strengths. [Kre96b] This mechanism for capturing the temporal correlations of the input using an NO diffusion process is similar to the method I will present in more detail in Chapter 3. PAGE 27 20 Hippocampus The hippocampus is the primary region in the mammalian brain for the study of memory and learning because: [Bur95] Â• hippocampal damage causes memory loss, Â• the hippocampus is the simplest form of cortex Â• long-term potentiation (LTP) has been found in the hippocampus (synaptic plasticity) Â• cell firing in the hippocampus is spatially coded (place cells). Â• all sensory inputs converge on the hippocampus and the output from the hippocampus is extensively divergent with projections onto most of the cortical areas. Figure 2-1 shows the major subfields and their projections of the hippocampus. The hippocampus is formed from sheets of cells, with most of the interconnections contained in these sheets (minimal connections between sheets). Most projections have large divergence and convergence, except the dentate gyrus to CA3 projection which has Higher order Cortices Sensory Cortex ,---A CA1 Subiculum C \ Cortex Dentate Â„, Gyrus Hippocampus Figure 2-1: The major subfield of the hippocampus PAGE 28 21 mossy fiber projections from each granule cell, making very large synapses onto only 14 or so pyramidal cells. Hebbian LTP has been observed in much of the hippocampus. A variety of interneurons provide feed-forward and feed-back inhibition. One of the most interesting (and for this work, most relevant) aspects of the Hippocampus is that it contains "place cells" and other functional clusters of neurons. Place cells are small patches of neurons that selectively fire only when the animal is in a specific location of its environment. These are groups of thousands of neurons that fire together and are linked to other place cells. As the subject moves through a familiar set of locations, the patches fire sequentially and the linking of these patches allows for predictive navigation. They have been found in fields CA3 and CA1 of the rat hippocampus. [Bur93] These place cells are temporally and spatially organized neurons that are correlated in their reaction to temporally occuring events. Diffusion Equations (Re-Pi Equations') The diffusion equation (or the reaction-diffusion equation if the medium is active) can be used to explain certain characteristics of a neuron and neuronal clusters. In its generic form, however, it is used in many other fields. Objects such as cells, bacteria, chemicals and animals often have the property that each individual moves about in a random manner (e.g. brownian motion). When a concentration of these objects occurs, this random motion causes the objects to spread out into lower concentration areas of the environment. When this microscopic movement of the group results in macroscopic motion, we call it a diffusion process. If we assume a one-dimensional motion and a random walk process, we can derive the diffusion equation PAGE 29 22 from a probabilistic treatment of the process. By finding the probability p(m,n) that a particle reaches a point m steps away at n time steps in the future, we find the distribution of particles at time n. Using the random walk assumption and allowing n to be large, it can be shown that the resulting distribution is the Gaussian or normal probability distribution: p{m,n)exp In mÂ»l, nÂ»l Next, we determine the probability of finding a particle in an area between (x-Ax, x+Ax) at time / by rewriting the equation for p(m,n) as the sum of the probability of moving right from x-Ax at time t-At or moving left from x+Ax at time IAt. If we take the partial of p with respect to t and allow Ax->0 and At-^0 we obtain the diffusion equation: dt 8x 2 Figure 2-2 Diffusion process where D is the diffusion coefficient which defines how fast the particles spread. A typical diffusing process creates a spreading of a concentration into ever shallower and shallower Gaussians as shown in Figure 2-2. PAGE 30 23 The reaction-diffusion equations were originally proposed by Turing in 1952 and are typically used to explain natural pattern formation [Tur52]. They have been used to model insect populations, the formation of zebra stripes, crystal formation, galaxy formation and many other naturally occurring patterns and self-organizing systems. Turing's proposal modeled patterns found in nature by an interaction of chemicals called "morphogens". The different morphogens react with each other AND diffuse throughout the substance via the equation: dm(x,t) d 2 m,(x,t) Â— ^ i =/(w,(x,o,in J (Â»,o)+^ 8 y where mj(x,t) is the concentration of morphogen i at time r, D m is the diffusion coefficient, and f(mi,m,j is a function (typically nonlinear) that represents the interaction between morphogens. By varying the interaction between chemicals and the speed of diffusion, complicated spatial patterns of chemicals are created. The reaction-diffusion equations have also been used to explain traveling waves such as the traveling impulse down the axon of a neuron. If the reaction portion of the Re-Di equations represents the kinetics of the system and these kinetics are nonlinear, then the system can create a traveling wave. One requirement for a traveling wave is that the kinetics of the system are excitable, where excitable implies two stable states where a small excursion away from one state may drive it to the next state. Another requirement is that after excitation, the system must relax back to the original state. An example of such a system is the Fitzhugh-Nagumo equations (FHN) that are a simplified version of the Hodgkin-Huxley model that describes the transmission of energy down the axon of a PAGE 31 24 neuron. The FHN equations can be described by the following system of 3 equations [Mur89]: du Â„ Â„ d 2 u dv a = bu yv f(u) = u(a-u)(u-\) where u is roughly equivalent to the membrane potential, v lumps the effects of most of the ionic membrane currents, and a, b, and y are constants. The null clines of the kinetics in the (u,v) phase plane are shown in Figure 2-3. 0.5 1 1.5 Figure 2-3 : Null clines for dynamics of FHN equations The general concept is that when one element fires, its activity is diffused to its neighbors and pushes them just far enough from their stable state to move them to the "excited" state. Next, these newly excited elements excite their neighbors, etc. The elements which were excited originally then begin to relax, creating a traveling wave of activity. The traveling wave from the FHN equations is shown in Figure 2-4. In this case, not only does the system relax, it also has a refractory phase which inhibits future excitation for a period of time. [Tys88] PAGE 32 25 15 20 25 30 Figure 2-4: Traveling waves caused by the FHN equations Diffusion and other biologically plausible local communication techniques have increasingly been used in neural networks. For example, the Kohonen algorithm can be implemented in analog hardware with an active medium using diffusion [Ruw93]. Diffusion has also been used frequently in visual imaging systems [Cun94]. Sherstinsky and Picard have proposed a cellular neural network based on Re-Di equations that can solve optimization problems [She94], On key aspect of this work is that diffusion in the PE space of a neural network allows temporal information to be transmitted and stored using only local communication. This is similar to the diffusion of NO in the brain which is thought to affect the plasticity of synapses in areas where many neurons are firing at once. Without direct connectivity between two PEs, communication and temporal memory can be implemented using the local storage and transmission of a diffusing object (in our case, diffusing activity). Biological Representations of Time Another example of neurobiological research that has not been used in ANNs is the concept of rhythm. Recently, there has been some interesting research on oscillators, central pattern generators, rhythm and their effect on human pattern recognition. Rhythm PAGE 33 26 has been studied in biology and found that rhythmic signals from insects can be entrained or phase-locked to an external rythmic pattern without high-level processing (the patterns are faster than the minimum response latency) [McA94], There is evidence that the dynamics of many biological systems have natural rhythms that share the same frequency. Communication and locomotion, for instance, are highly dependent on rhythm and pacing. It has also been suggested that EEG rhythms play an important role in learning and temporal recognition. For instance, neurons are thought to modify their synaptic strengths only when the 9 rhythm is in the correct phase. The 9 rhythm is a sinusoidal component of the EEG that ranges from 7-12 Hz. The 9 rhythm has been linked with displacement movements (e.g. walking) and many other repetitive actions. Since the 9 rhythm must propagate through the neural tissue, this also could play the role of a moving wavefront that controls learning. Rhythm can be thought of in two ways: either as an external pacemaker that synchronizes the network in some fashion, or as the output of a collection of neurons that are working in unison. For the first case, there is little if any research on the effects of an external pacemaker on temporal ANNs. The pacemaker would create a time-varying network where the output of the network is dependent on the time or phase of the pacemaker. The pacemaker could also act as a sampling signal. For instance, learning may only occur at a specific phase of the 9 rhythm. In the second case, the rhythm could be the result of synchronized processing. For instance, waves of activity in the brain could be caused by the processing of the spatio-temporal patterns constantly input to the network by the continuous motions of the eyes and other sensory muscles. PAGE 34 27 Stanley and Kilmer [Sta75] have proposed a "wave mode" of memory that can learn sequences. It is based on the anatomy of the dentate gyrus (in the mammalian hippocampus) and can be summarized as follows: Â• The hippocampus is organized into transverse slices called lamellae Â• The majority of connections in the hippocampus do not leave a lamella (small longitudinal spread) Â• Sensory inputs arrive via the perforant path to excite cells directly Â• A small number of mossy fibers connect cells longitudinally (across lamellae) Â• Cells excited by an input spread excitation to its neighbors, causing a wave of activity to travel down the cell's lamella The wave formation is based on the pyramid and granule cells receiving excitatory influences from the hippocampal input pathways that in turn excite interneurons whose axons inhibit the pyramid and granule cells. This excitation and inhibition create the waves of activity in the lamella. The memory is created by the association of the various waves in different lamellae via the mossy fibers that interconnect the lamellae. Each wave is created by a sensory input that triggers a cell in a lamella and can move a short distance before dying. Randomly distributed mossy fibers interconnect the lamella. The connection weights are strengthened in a Hebbian manner when two waves from different lamella are coincident with a connecting mossy fiber, this connection is strengthened. Thus, the next time the first input wave passes the same position, it can automatically trigger the second wave even without the corresponding input. This is shown in Figure 2-5. For longer PAGE 35 28 temporal relationships, one wave will trigger a second wave in another lamella via prestrengthened longitudinal connections which will continue after the first wave has died. Input 1 at Time TO * Input 2 at Time T1 Strenghtened Mossy Fiber Wave at T1+D2 Figure 2-5: Stanley and Kilmer's wave model [Sta75] Bioloaical Models for Temporal Processing Living neurons act as leaky integrators with time constants on the order of tens to hundreds of milliseconds. This can lead to the storage of information in a way that may lead to temporal sequence processing. Most ANN temporal methods store the information in a spatial manner. The spatial approach to signal storage is used in the brain for auditory and visual processing (e.g. SOMs). The way in which these maps are then processed is not necessarily spatial. Reiss and Taylor propose an interesting temporal sequence storage mechanism based on a leaky integrator network [Rei91]. The basic concept is to use the leaky integrator neurons as temporary storage for an associative memory that is implemented like a single layer neural net. The network has been shown to have a capacity proportional to the number of neurons. The problem with this network is that the connection matrix seems to be very heavily skewed to only predicting the next input with little information from further in the past. This is similar to PAGE 36 29 a simple state machine or markov chain. An interesting part of this work is the possible connection to the function of the hippocampus. The memory network corresponds to the dentate gyrus, the CA3 corresponds to the predictor, and the input line is similar to the perforant path (between EHC, DG, and CA3). Kargupta and Ray proposed a temporal sequence processor that is based on the reaction-diffusion equations. [Kar94] Drawing an analogy between chemical diffusions in biology and spatio-temporal sequential processing, their model is based on a collection of cells that react to different inputs. When a cell becomes active (by recognizing its input), it outputs its own specific chemical. This chemical diffuses throughout the medium containing the cells. Each cell contains a memory of the chemical makeup at its location when it fires. The background medium thus stores the temporal history of the signal by diffusing all the various chemicals. This approach is more of a chemical model than an information processing model and has several difficulties when applied to realistic problems. Static Neural Network Learning This section contains a summary of the static neural network learning mechanisms. Almost all of the work in temporal ANNs is based on the principles from static ANNs. Since unsupervised training is most similar to known biological learning mechanisms, it will be presented first. Unsupervised learning does not have a desired signal and extracts information only from the input of the signal. As such, unsupervised techniques typically do not directly implement classifiers, but are usually used for preprocessing the input. For example, unsupervised networks can be trained to perform PAGE 37 30 principal component analysis (PCA), vector quantization (VQ), and data reduction. Supervised learning is presented next and these algorithms use a desired signal to train the network to mimic the desired input-output map. The desired signal can be thought of as a teacher or external influence that guides the network to the desired state. As we mentioned before, there is no known biological analog to supervised training. Unsupervised Learning Most unsupervised (also known as competitive or self-organizing) learning is based on Hebbian learning. Hebbian learning is derived from the work of the neuropsychologist Hebb who noted in 1949 that when cell A repeatedly participates in the firing of cell B, a growth process occurs between the two cells which increases the efficiency of the link between cell A and cell B. This can be stated as "neurons that fire together, wire together". This mechanism is often called correlation learning because the links are increased when there is a statistical correlation over time between the presynaptic and postynaptic activities. To avoid excessive weight growth, Hebbian synapses typically also include a decrease in the strength of a connection between two cells which are uncorrelated. Conversely, anti-hebbian learning is a learning rule that is based on increasing the strength of a connection when the presynaptic and postynaptic signals are negatively correlated and weakens them otherwise. A typical expression for Hebbian learning is Aw kl (n)=r\y k {n)x l (n) where w kj represents the synaptic weight between cell k and cell/, x i is the presynaptic activity and yj is the postynaptic activity. r| in the above equation is the learning rate. PAGE 38 31 This rule, however, does not include the weakening of uncorrected signals, and thus the weights will forever increase. Introducing a nonlinear forgetting factor into the equation can control the weight growth: AwÂ„(n) = i\y t (iÂ»)x l (n)-ay i (nyy> â€¢ (tt) where a is the decay constant. This equation can be rewritten as: Aw ti (n) = ay,(n)[cx/n)w ti (n)\ which is the standard Hebbian learning rule. Notice that when the postsynaptic neuron fires, w tj moves toward & exponentially. By manipulating the definitions of the variables, this equation can be reformulated into the competitive learning rule. In competitive learning, a group of neurons are clustered such that one and only one neuron wins a competition for each input. Algorithmically, the winner is simply selected by choosing the PE with the highest/lowest output, which can be physically implemented using lateral inhibition between nodes. Biologically, neurons fire in clusters and the competition between clusters is believed to be due to long range inhibition and short range excitation (a concept that will come up again and again). In the case of a competitive cluster, the winning node has an output value of 1 , and the others are all zero. Thus the Hebbian learning rule becomes: , . a x.(n)w..(Â«) if neuron k wins AwJn) = < L ' ' ' [0 if neuron k loses Only one neuron (or cluster in biology) learns at each stage and its weights move toward the location of the input. Thus, the individual nodes specialize on sets of similar patterns and become feature detectors. Competitive learning is typically used for PAGE 39 32 clustering or vector quantization. Hebbian learning is used widely throughout the neural network field, but in its simplest form is often used for principal component analysis. Kohonen SOMs The Kohonen map or self-organizing feature map (SOM) is a neural network inspired by sensory mappings commonly found in the brain [Wil76][Koh82], A selforganizing feature map creates a topographic map of the input patterns, in which the spatial locations of the neurons in the lattice correspond to intrinsic features of the input patterns. In this structure, neurons are organized in a lattice where neighboring neurons respond to similar inputs. The result of mapping similar inputs to neighboring outputs is a global organization that is extracted from the local neighborhoods. Topographical computational maps have been found in many locations in the brain including the vision areas (angle of tilt of line stimulus, motion direction), auditory areas (representations of frequency, representations of amplitude, representations of time intervals between acoustic events) and in motor control areas (control of eye movements). More abstract topographic maps have been found in other parts of the brain. For example there is a map for the representation of the location of a sound source based on the interaural differences in an acoustic signal. The SOM is one of the most widely used unsupervised artificial neural network algorithms. [Kan94] The typical SOM is composed of an input layer and an output layer as shown in Figure 2-7. The input layer broadcasts the vector input to each node in the output layer, scaled by the weights of each connection. Each node has an input term and lateral feedback term. The topographic mapping is created by the local lateral feedback, where PAGE 40 33 neighboring connections are excitatory and more distant connections are inhibitory. This is called a "mexican hat" lateral connectivity and is shown in Figure 2-6. The result is similar to the standard competitive network except that the network creates a more gentle cutoff, thus creating a Gaussian shaped output after the lateral interconnections have stabilized. This is called a "soft-max" rule (or soft-competition) where the winning PE and a few "near-winner" PEs remain active. The competitive rule is called a "hard-max" rule, hard competition, or winner-take-all rule. Depending on the characteristics of the mexican hat lateral interconnections, the resulting output will be a gaussian of varying widths centered roughly at the location of the maximum output. The process can be described using the following equations *Â• kÂ—K P l , = Z w ji x i l=\ where y; is the output of the j'th node, Ij is the input to the j'th node scaled by the weights into the j'th node, Ci^is the lateral weights which were described above as the mexican hat function, and t is a nonlinear saturating function which keeps the nodes from growing without bound. Figure 2-6: Mexican hat lateral connectivity and Gaussian shaped output PAGE 41 34 Output Layer Figure 2-7: Connectivity of an SOM After the outputs have stabilized, the network can be trained with a simple Hebbian like rule to train the weights of the winning node and its neighbors. The neighboring nodes can be trained in proportion to their activity (Gaussian), or all neighbors within a certain distance can be trained equally. The learning rule can be described as follows: Wj(n + 1) = Wj(n) + r](n)n Â„,,,(Â«) [*(Â«) w,(n)] where wi are the weights of node j, x(n) is the input at time n, Xjj(x) is the neighborhood function centered around the winning node i(x), and r\(n) is the learning rate. Notice that both the learning rate and neighborhood size are time dependent and are typically annealed (from large to small) to provide the best performance with the smallest training time. A simplified approximation to this algorithm consists of two stages: first, find the winning node (the one whose weights are closest to the input), then update the weights of the winner and its neighbors in a Hebbian manner. The SOM is an unsupervised network with large local connectivity, but unsupervised networks do not typically suffer from overtraining. Because the input is mapped onto a discrete, usually lower dimension output space, the SOM is typically used PAGE 42 35 as a vector quantization (VQ) algorithm. The weights of the winning node are the vector quantized representation of the input. A typical example of an SOM is mapping a two-dimensional input space onto a one-dimensional SOM. Figure 2-8 shows a random distribution of points that make up the input space in two dimensions. The points are plotted such that the coordinates of the point represent the input data. When this input data is presented to the 1-D SOM, the map trains the nodes to maintain local neighborhoods in the input space. These local neighborhoods force a global ordering of the output nodes. After training, the nodes of the SOM are ordered and the weights of the nodes represent the center of mass of the input space to which they respond. By plotting the weights of the SOM PEs onto the input space, one can see where the center of each VQ cluster is located. The SOM is more than just a clustering algorithm. It also orders the PEs such that neighboring PEs respond to neighboring inputs. To show this, we connect neighboring PEs with a line. The right side of Figure 2-8 shows how the SOM maps a one-dimensional structure to cover the two-dimensional input space. This clearly shows how the global ordering has occurred Figure 2-8: Example of a 1-D SOM mapping a 2-D input PAGE 43 36 and that the 1-D output snakes its way through the input space in order to maintain its topographic ordering and still cover the input space. Neural Gas The neural gas algorithm is similar to the SOM algorithm without the imposition of a predefined neighborhood structure on the output PEs. The neural gas PEs are trained with a soft-max rule, but the soft-max is applied based on the ranking of the distance to the reference vectors, not on the distance to the winning PE in the lattice. The neural gas algorithm has been shown to converge quickly to low distortion errors which are smaller than k-means, maximum entropy clustering or the SOM algorithm [Mar93]. It has no predefined neighborhood structure as in the SOM and for this reason works better on disjoint or complicated input spaces Martinetz et. al. [Mar93] showed an interesting parallel between most of the major clustering algorithms. The main difference between each one is how the neighborhood is defined. For K-means clustering, there is no neighborhood, only the winner is trained. This is a hard max. Aw, = e * S t (x) * (x w, ) For maximum entropy clustering, the neighborhood is defined as a soft max based on the distance in an entropy space Aw, = Â£ * hwc w, |D* (x w, ) For the SOM, the neighborhood is based on the position in the SOM lattice Aw, =Â£*A PAGE 44 37 and for the neural gas algorithm the softmax is based on the ranking of the node. For instance, the closest node gets the largest update, followed by the second closest, etc. Aw, =Â£*h PAGE 45 38 to adapt each of the weights in the system. This is graphically depicted in Figure 2-9. The output of each PE in an MLP can be described by the following equation: y,=f{nelj)=f \'L w Ji x i +b J where W, represents the weight from PE i to PE j , x, represents the output of PE i or the external input for the first layer, bj represents the bias for PE j, and f() is the nonlinearity of the PE which is typically a logistic function (which ranges from 0-> 1) or a tanh function (which ranges from -1 -> 1). The performance surface that is searched using gradient descent is defined by: I N m \ N m . . where e is the difference between the output and the desired signal, p is the index over the patterns and i over the output PEs. We want to update each weight based on the partial of J with respect to each weight. This formula can be written as: aj _ a/ % a aw,j ay, p anet ip fa, "e' ip = -{ d i P -yin)f'{" et i,)x IP = -e â€¢ f[net â€¢ pe M if we define the local error 8 ; for the i' PE as 8,{n)~nnet â€¢ ) Sy, P Then we can generalize the backpropagation algorithm for the MLP and the LMS algorithm for linear systems. All the weights in gradient descent learning are updated by multiplying the local error (6,(n)) by the local activation (x/n)) according to Widrow's estimation of the instantaneous gradient first shown in the LMS rule PAGE 46 39 Aw, J (n) = r|8 / (n)x J (/j) The difference between these algorithms is the calculation of the local error. If the PE is linear, then we have a linear combiner and the derivative of f is a constant. The equation then becomes the LMS rule. If the PE is nonlinear and is an output PE, then the local error is simply the difference between the output and the desired signal scaled by the derivative of the nonlinearity. This is simply the delta rule. If the PE is nonlinear and is a hidden layer PE, then the error is the sum of the backpropagated errors from the PEs that follow it. 8,(n) = /'(Â»<*, WJIXh-hM k This simple rule nicely summarizes the backpropagation algorithm and shows its relationship to other adaptive algorithms. Second Order Methods The standard backpropagation method of training a neural network uses the LMS approximation to gradient descent, which uses only an instantaneous estimate of the gradient. Second order methods collect information over time to get a better estimate of the gradient, thus allowing for faster convergence at the cost of more computations per cycle. In linear adaptive filtering, the recursive least squares (RLS) algorithm is used for exactly this purpose. The RLS algorithm is based upon estimating the inverse of the correlation matrix of the input. With this information, the RLS algorithm can often adapt as much as ten times faster than LMS. The RLS algorithm can also be formulated as a special case of the Kalman filter. Besides faster convergence, the RLS algorithm also has two other advantages [Hay96]: the eigenvalue spread of the correlation matrix does not PAGE 47 40 adversely affect the training (unlike in LMS) and the learning rate is automatically determined (the Kalman gain). Since RLS and Kalman filtering are derived for linear systems, they must be modified for use with nonlinear systems. These are typically called extended RLS or extended Kalman filtering. The most straightforward approach is to linearize the total cost function and directly apply RLS. This requires the storage and update of the complete error covariance matrix whose size is the square of the number of weights in the network [Hay94]. A better approach is to apply RLS to each node individually and linearize the activation function of the PE using a Taylor series about the current operating point. This method is called the multiple extended Kalman algorithm (MEKA) [Hay94] and reduces the computational requirements by ignoring the cross-terms between PEs. Temporal Neural Networks As we stated previously, the majority of the temporal neural networks are extensions of the static neural networks, either by adding memory or adding recursive connections. This section could just as easily be called "Extending static architectures to include time". Again, this topic will be discussed in two sections, supervised and unsupervised neural networks. Temporal Unsupervised Learning This section presents the methodologies currently available to add temporal information to unsupervised networks. Almost all work done on temporal unsupervised training has used self-organizing maps. PAGE 48 41 As mentioned before, a self-organizing map (SOM) creates a topographic map of the input patterns, in which the spatial locations of the neurons in the lattice correspond to intrinsic features of the input patterns. In this structure, neurons are organized in a lattice where neighboring neurons respond to similar inputs. There have been many attempts at integrating temporal information into the SOM. One major technique is to add temporal information to the input of the SOM. For example, exponential averaging and tappeddelay lines were tested in [Kan90][Kan91], while coding in the complex domain was implemented in [Moz95]. Another common method is to use layered or hierarchical SOMs where a second map tries to capture the spatial dynamics of the input moving through the first map [Kan90][Kan91]. More recently, researchers have begun integrating memory inside the SOM, typically with exponentially decaying memory traces. Privitera and Morasso have created a SOM with leaky integrators and thresholds at each node which activate only after the pattern has been stable in an area of the map for a certain amount of time. This allows the map to pick out only the "stationary" regions of the input signal and use these sequences of regions to detect the input sequence [Pri93][Pri94][Pri96]. The SARDNET architecture [Jam95] adds exponential decays to each neuron for use in the detection of node firing sequences. Once a node fires for a particular sequence, it is not allowed to fire again. Therefore, at the end of the sequence presentation, the sequence of node firings can be detected (or recreated) using the decayed outputs of the SOM. The exponential decay, however, provides poor resolution at high depths and thus will perform poorly with noisy and/or long sequences. PAGE 49 42 Chappell and Taylor have created a SOM which has neurons that hold the activity on their surface via leaky integrator storage [Cha93]. This activity is added to the typical spatial distance between input and weight vector to determine the next winner. The same or neighboring nodes will thus be more likely to win the competition for successive elements in a sequence. This creates neighborhoods with sensitivity to the previous input (i.e. context). There is not a successful method available yet to train these networks. The learning law proposed by Chappel and Taylor can lead to an unstable weight space. The methodology seems to work for patterns of binary inputs with at most length 3. Critchley [Cri94] has improved the architecture by moving the leaky integration to the synapses. This gives the network a much better picture of the temporal input space and has much more stable training, but becomes nothing more than an exponentially windowed input to a standard Kohonen map, as proposed by Kangas [Kan90]. The temporal organization map (TOM) integrates a cortical column model, SOM learning and separate temporal links to create a temporal Kohonen map [Dur96]. The TOM is split into super-units that are trained via the SOM learning algorithm. Winning units from each super-unit fire and then decay. Temporal links are made between the currently firing node and any node which has an activity above a threshold. Thus there can be multiple links created for each firing, allowing for the pattern to skip states. Kohonen and Kangas have proposed the hypermap architecture to include context in the SOM architecture. Kohonen's original hypermap architecture included two sets of inputs and weights [Koh91]. The first set is a context vector that is a tapped delay line of the past and future pattern vectors. This input is used to determine a "context domain" in the SOM. All nodes in the context domain are labeled active and are then presented with PAGE 50 43 the current input pattern. The "pattern" weights and context weights are then trained in the typical SOM manner. Kangas extended this concept by eliminating the context weights and allowing only nodes in the vicinity of the last winner to be selected. This smoothes the trajectory of winning nodes throughout the map and allows context to affect the selection of the winner without the addition of parameters like the width of the context window [Kan92]. Kangas has also proposed an SOM architecture that has an LPC predictor at each node in the Kohonen net. This provides temporal pattern recognition by using a filter at each node where the AR filters were trained via either genetic programming or gradient descent [Kan94]. Goppert and Rosenstiel conceptually extend this concept to include the notion of attention [Gop94a][Gop94b][Gop95], The theory being that the probability of selecting a winner is affected by either higher-cognitive processes (which may be considered a type of supervision) or by information from the past activations of the network. This gives two components to the selection of a winner, the extrasensory distance (context or higher processes) and sensory distance (normal distance form weight to input). These two components can be added or multiplied. They focus on the concept of context and create a moving area of attention, which is the region that has been activated most in the recent past. The center of attention moves as each winner is selected and the region of attention has a Gaussian weighting applied to it so that nodes near the last winner will be more likely to fire the next time. The architecture outperformed the standard SOM on simple temporal tasks but did not train well on more complicated trajectories. PAGE 51 44 Temporal Supervised Neural Networks The main problem with temporal supervised neural networks is the complexity in training them. When the desired architecture contains recurrent connections or memory in one of the hidden layers, the network must be trained with a temporal gradient descent algorithm. There are two distinct approaches to the problem, modifying the architecture to simplify the temporal gradient calculations and creating better and/or faster methods of training the temporal neural networks. Architectural approaches The focused time-delay neural network (TDNN) has memory added only at the first layer and is the simplest example of an architecture designed to avoid many of the complications of temporal neural networks. It is simply a static MLP with a tap delay line between the input and the first layer. Because the memory is restricted to the first layer, the network can still be trained using static backpropagation. The tap delay line maps a segment of the input trajectory into an N-dimensional static image that is then mapped by the MLP. This works quite well for many applications, but has a number of difficulties as mentioned previously. The main difficulties are the increased number of weights required for TDNNs (each input now requires m weights where m is the number of taps in the tap delay line) and the inflexible, prewired nature of the tap delay line. Some of the problems with TDNNs have been attacked by defining the connectivity between layers such that only certain regions of each layer are connected. By doing this, certain regions in the input layer, corresponding to certain time periods of the input, can be connected to a single region of the second layer. This provides a more PAGE 52 45 goal directed architecture that can be time-shift or frequency-shift invariant. Although this can reduce the effects of the problems of TDNNs, the problems still remain and each network must be tailored for each application. [Saw91][Haf90] Two other networks deal with temporal information by using a very restrictive type of feedback. The Jordan network [Jor86] use recurrency between the output and the input of the network. The output of the network is fed back to a context unit which is simply a leaky integrator. The Elman network [Elm90] provides feedback from the hidden layer to the context units in the input layer. This is potentially more powerful than the Jordan network because it stores and uses the past state of the network, not just the past output of the network. Although both networks are commonly found in the neural network literature, neither is particularly powerful or easily trained. Recurrent networks are also continuously being modified in an attempt to improve their performance on temporal problems. Mozer has proposed a "multiscale integration model" that uses recurrent hidden units that have different time constants of integration, the slow integrators forming a coarse but global sequence memory and the fast integrators forming a fine grain but local memory [Moz92]. This work, however, is based only on exponentially decaying memory and the problem of selecting the time constants has not been solved (the time constants have to be hand tuned). A different spin on recurrent networks is the use of "higher order networks". These networks are recurrent networks where: PAGE 53 46 Â• hidden units represent states and the output of these states are fed back and multiplied with the inputs of the nodes, thus allowing second order statistics to be used [Gil91][Wat91] Â• one network computes the weights for a second network [Pol91][Sch92a][Sch92b] The higher order networks have proven to be excellent sequence recognizers (grammar recognizers), but have failed to make a serious impact on temporal processing. These networks provide a representation for states in the neural network and allow the computation of high order statistics. For example, a second order network can compute the autocorrelation of the input, thus creating a translation invariant architecture. The main disadvantage of this work is that for complex tasks, higher order networks require even more weights and have even more complicated performance surfaces than standard ANNs, Algorithmic approaches There are two fundamental methods of computing the gradient for a dynamic neural network. First, the gradients can be computed in the backward direction similar to the static backpropagation techniques from feedforward networks. Unfolding a recurrent network in time creates a large, static, feedforward network where each "layer" consists of an instance of the recurrent network at each time step. Backpropagation can then be applied to this large feedforward network and the gradient can be computed. This is called backpropagation through time (BPTT) [Rum86], The main shortcoming of this technique is that it is non-causal. The BPTT algorithm must be used in a batch mode, the data travels first in the forward direction while the entire state of the network is saved at PAGE 54 47 each step. Next, the error is backpropagated in the reverse temporal order. A secondary shortcoming of BPTT is the memory required to store the state of the network at each iteration. Many alterations have been made to the BPTT algorithm to improve its utility, in particular to make it usable as an on-line algorithm. Williams and Peng [Wil90] used a history cutoff where they assumed that the gradient information from the distant past is relatively inconsequential and thus can be ignored. Combining this and the use of a small step size, the new algorithm, BPTT(k) can be used in an on-line manner. See Pearlmutter [Pea95] for a review of this technique and others. The second fundamental method of computing the gradient for a recurrent neural network computes the gradients in the forward direction. This method, called RTRL [Wil89], computes the partial of each node with respect to each weight at every iteration. The method is completely on-line and simple to implement. The main difficulty with the RTRL method is its computational complexity. If we define n to be the number of PEs and m to be the number of weights, then the computation of the gradients of each PE with respect to each weight is O(n^m). For a fully recurrent network, this dominates the computational complexity and requires O(n^) computations per step. The algorithm works quite well on small networks, but the rr factor becomes overwhelming as the number of nodes increases. The RTRL algorithm for a recurrent network can be summarized by the following set of equations [Hay94], First, we define set A as the set of all inputs, set B as the set of all PEs, and set C as the set of outputs with desired signals. The forward activation equations are: PAGE 55 4S net i (n) = 'Â£w Ji {n)u,(n) y l (n + \) = ip(net i (n)) where u represents the input vector at each time step and is composed of both the external inputs and the outputs of each PE (the values of the feedback). The gradient descent technique is based upon computing the sensitivity of each PE with respect to each weight. The weights are updated on-line using these sensitivities: . , . de(n) v , . Â»/(") 8w u (n) % dw kl (ri) For implementation, we create a matrix ji that represents these sensitivities and write an update equation for it: dy,(n) n> u = ' jeB,keB,leAuB dw tl (n) *Â•Â«(" + !) = 9'( v /(")) E vt, //(") 7r Â«(") + ' y */"'(' ? ) ^(0) = 7i is a matrix of gradients with the rows representing weights and the columns representing nodes, thus it contains mn elements. Many methods have been proposed to increase the speed of RTRL. Schmidhuber and others have mixed BPTT and RTRL which reduces the complexity to Ofnm) [Sch92]. This technique takes blocks of BPTT and uses RTRL to encapsulate the history before the start of each block. Sun, Chen, and Lee have developed an O(nm) on-line method based on a Green's Function approach [Sun92]. By solving an auxiliary set of equations, PAGE 56 49 the redundancies in the computation of the sensitivities over time can be removed. Zipser approached the problem in a different way and reduced the complexity of the RTRL algorithm by simply leaving out elements of the sensitivity matrix based upon a subgrouping of the PEs [Zip89]. The PEs are grouped arbitrarily and sensitivities between groups are ignored. If the size of the subgroups remains constant, then this reduces the complexity of the RTRL algorithm to O(m). This is a tremendous improvement, however, the method lacks some of the power of the full RTRL algorithm. It sometimes requires more PEs than the standard RTRL algorithm to converge. Second order methods Many researchers argue that simple gradient descent [Moz94] is not sufficiently powerful to discover the sort of relationships that exist in temporal patterns, especially those that cover long time sequences or involve high order statistics. Bengio, Frasconi, and Simard [Ben93] also present theoretical arguments for the inherent limitations of learning in recurrent networks. Many researchers have recently started using the extended Kalman filter algorithm, which is very similar to the RLS algorithm, for training dynamic neural networks. As described previously, the extended Kalman filter algorithm uses information from the correlation matrix, which is accumulated over time, to better approximate the direction to the bottom of the performance surface. Again, the problem with the extended Kalman filter algorithm is that it requires the computation and storage of the correlation matrix between the weights of the system. The computational requirement for this method is OfN 5 ). The standard method of reducing this computational load is to decouple each PE of the network, such that the correlation PAGE 57 50 matrix is only computed between weights that terminate at the same PE. Puskorius and Feldkamp [Pus94] call this method the decoupled extended Kalman filter (DEKF) algorithm. The main difference between the dynamic version of the RLS/EKF algorithm and the static version is that the gradients that are used in the second order calculation are the dynamic gradients, not the static gradients. Thus, the BPTT or RTRL algorithms must still be used to compute these gradients. Sequence Recognition There are two broad categories of temporal problems which are typically addressed in the literature: sequence recognition and temporal pattern processing. Sequence recognition is typically a process of recognizing (and often reproducing) discrete symbolic sequences. These problems typically focus on recognizing grammars and symbolic patterns. Temporal pattern processing, however, involves the recognition, identification, control or other processing of a continuous signal which varies with time. Speech recognition is an example of temporal pattern recognition. The continuous signal can be vector quantized and turned into a symbolic pattern, but it is not practical to then treat it as a sequence recognition problem. Temporal patterns of interest are difficult to accurately quantize and typically have various forms of time warping and noise which make sequence recognition of quantized temporal patterns nearly impossible. Since the emphasis of this proposal is not sequence recognition, we will only briefly introduce a few interesting neural networks which accomplish this task. Wang and Arbib have proposed models based on the two dominant theories of "forgetting" the PAGE 58 51 decay theory of forgetting where the memories decay from the time they are entered [Wan90] and the interference theory of forgetting where memory only decays when new inputs which must be remembered arrive [Wan93]. Both architectures are based on a winner-take-all field of neurons where the winning node fires and is then decremented slowly. The sequence is detected using an extra "detector unit" which is trained by the Hebbian rule using altentional learning. The main difficulty with these and other sequence recognizers is that they tend to be intolerant of timewarping and missing or noisy data problems which are prevalent in temporal pattern recognition. The outstar avalanche was an early neural network that was used to learn and generate temporal patterns [Gro82]. The outstar avalanche is composed of N sequential outstars which detect an input and each outstar triggers the next in a chain producing an avalanche effect. This architecture was modified to include the combined effect of the input dot product and the avalanche input from preceding nodes and was called the spatio-temporal network (STN) [Fre91]. The sequential competitive avalanche field (SCAF) [Hec86] is a further extension of the STN where each node has lateral interconnections allowing the outstars to be competitive. Comparison of Hidden Markov Models with ANNs Due to the difficulties in modeling sequential structure with ANNs, hidden Markov models have become the gold standard for modeling many temporal processes (e.g. speech). Time sequence matching is a major problem in applying neural nets to temporal/dynamical, non-stationary processes. Although ANNs have been successfully applied to time series prediction [Wei94], they have not been as successful in tasks that PAGE 59 52 have synchronization problems such as time-warping. For example, different utterances of the same word can have very different timescales; both the overall duration and the details of timing can vary greatly. ANN models for speech have been shown to yield good performance only on short isolated speech units (e.g. phoneme detection). They have not been shown to be effective for large-scale recognition of continuous speech. The TDNN, for example, has powerful methods for dealing with local dynamic properties, but cannot deal with sequences explicitly. The HMM provides a compact, tractable mechanism for handling this temporal information by including explicit state information. Various neural network techniques have attempted to add state information, typically via feedback, but have been only successful on modest size applications. HMMs are stochastic in nature and thus can succeed even when the temporal nature of the system is locally very noisy. Speech patterns, for example, are to some extents a sequential process, however, they are sufficiently ambiguous locally that it is not adequate to make decisions locally and then process sequences of symbols. Two formal assumptions characterize HMMs as used in speech recognition. The first-order Markov hypothesis states that history has no influence on the chain's future evolution if the present is specified e.g. the temporal information is stored in the current state of the system and all relevant temporal information must be able to be stored in this way (there is no other memory in the system). The second assumption is that the outputs depend stochastically only on the state of the system. The two main advantages of ANNs over HMMs is that ANNs are discriminative and ANNs do not rely on the Markov assumptions. Typically HMMs are trained using a PAGE 60 53 within-class method (each model is trained only on in-class, segmented, data). ANNs, however, can be trained to find the differences between classes, thus they can discriminate between classes, not just detect/model classes. ANNs have few restrictions on the systems they can model. The HMMs, however, assume that the observations are independent and that the underlying process that is modeled is a Markov Process. New methods which marry the discriminative power of the ANN with the temporal nature of the HMM have been relatively successful [Bou90]. PAGE 61 CHAPTER 3 TEMPORAL SELF-ORGANIZATION Introduction and Motivation As described in the previous chapters, working with temporal patterns has been a very difficult task for neural networks. This problem is largely due to the fact that the methodologies applied to temporal processing are simple extensions of static neural networks with little regard for the unique nature of time and time based signals. Most of these architectures simply add memory to a well-known static network and can achieve reasonable performance for simple problems, but do not perform as well on more complex problems. Like in the 1 980s when pattern recognition and classification drove the research community to develop neural networks, biological systems still easily outperform state-of-the-art solutions to temporal processing problems. For this reason, I began researching biological neural networks and biological mechanisms that might help us better solve these problems. As my research progressed, two key aspects continually resonated with my underlying goal of creating better neural networks for temporal pattern processing. These two elements are the self-organization of similar or correlated cells into clusters or neighborhoods (similar to place cells in the Hippocampus), and the diffusion of information over time and space. Self-organization describes a system where each 54 PAGE 62 55 individual entity in the system has only simple local rules regarding its behavior. These simple local rules, however, can create global organization without any global control. Self-organization applies at virtually every layer of the universe, from neurons and brain cells, to bug populations, to solar systems and galaxies. It is tremendously important in the formation of the brain and in my opinion is greatly underutilized in artificial neural networks. The second element is diffusion. Like self-organization, diffusion is found everywhere. It can be derived from simple random Brownian motion (simple local rules as well), where particles and other objects move from areas of large densities to areas of small densities. Diffusion itself is a rather simple concept that may not appear to add much to neural network theory. However, when you add diffusion to a dynamical system (for instance, the reaction-diffusion equations), the resulting system can obtain some tremendously interesting and powerful dynamics. The Model Most temporal neural networks use short-term memory to transform time into space. This time-to-space mapping is usually the only mechanism for dealing with temporal information. The neural network operates as if the temporal pattern was simply a much larger spatial pattern. This is clearly inefficient. My method uses diffusion to create self-organization in time and space. The theory is to leave the fundamentals of the neural network the same (in order to use the theory and knowledge we have already accumulated) but to add self-organization in space-time to the PEs in the network. By creating temporally correlated neighborhoods in the field of PEs making up the network, PAGE 63 56 the basic functionality of the network is more organized and temporally sensitive, without drastically changing its underlying operation. The mechanism for the creation of these temporally correlated neighborhoods is the diffusion mechanism. In the brain, NO is given off from firing neurons in the brain, and diffuses throughout. NO has also been shown to affect the sensitivity of the neuron to synaptic changes (e.g. weight changes in neural networks). It has been theorized that this diffusion of NO may be responsible for the creation of place cells and other organization in the brain. In a more abstract sense, the diffusion of NO can be considered the diffusion of the neural activity. When a large group of neurons fire in close proximity (both temporally and spatially), a local build-up of NO probably occurs and diffuses throughout the brain. In my architectures, I use the concept of activity diffusion to create the temporally correlated neighborhoods. When a PE or group of PEs fire, they influence their neighbors, typically lowering their threshold such that they are more likely to fire in the near future. Because the underlying mechanism of most neural network training is Hebbian in nature, when neighboring PEs fire in a correlated fashion, they tend to continue to fire in a correlated fashion. This creates the temporally correlated neighborhoods and the self-organization in space-time. I have applied this concept to three different ANN architectures. The first is based on the self-organizing map (SOM) and is the most biologically inspired. The second is based on the neural gas algorithm which provides a more powerful but functionally similar solution. Lastly, to prove the robustness of the method, I applied it to the training of recurrent MLPs. MLPs are a totally different architecture and are trained in a totally different manner (e.g. supervised vs. unsupervised training). The MLP is not biologically PAGE 64 57 relevant, but the temporal self-organization method still proved to decrease training times dramatically. The rest of this chapter is divided into three sections based on these architectures. It is arranged chronologically so that the presentation will be more smooth, even though the MLP architecture may be the most useful of the three. This chapter only presents the theoretical derivation of each architecture and a simple illustrative example for each. Detailed application of each method to more practical problems will be presented in the next chapter. Temporal Self-Oraanization in Unsupervised Networks This section describes the two unsupervised networks which 1 have applied the concept of temporal clustering. The first architecture is based on the self-organizing map and is called the self-organizing temporal pattern recognizer (SOTPAR). The second architecture is based on the neural gas algorithm and is called the SOTPAR2. Temporal Activity Diffusion Through a SOM (SOTPAR^ The self-organizing temporal pattern recognizer (SOTPAR) [Eul96a][Eul96b] is a biologically inspired architecture for embedded temporal pattern recognition (finding patterns in an unbounded input sequence without segmentation or markings). This is a difficult task since the patterns must be searched from every possible starting point. Although the SOTPAR architecture is unsupervised and thus cannot be used efficiently as a pattern recognition device, it does preprocess the input such that patterns commonly found in the training data will be easily detectable from the output of the SOTPAR. Most of the emphasis in this work is in the proper temporal representation of the spatiotemporal data. PAGE 65 58 The SOTPAR architecture adds two temporal characteristics to the SOM architecture, activity diffusion through the space of output PEs and the temporal decay of activations. Using these concepts, the SOTPAR converts and distributes the temporal information embedded in the input data into spatial connections and ordered PE firings in the network, all using self-organizing principles. Similar to self-organizing maps, the network uses competitive learning with neighborhood functions [Koh82]. In the SOM, the input is simultaneously compared to the weights of each PE in the system and the PE that has the closest match between the input and its stored weights is the winner. The winner and its neighbors are then trained in a Hebbian manner, which brings their weights closer to the current input. The key concept in the SOTPAR architecture is the activity diffusion through the output space. The firing of a PE in the network causes activity to diffuse through the network and affects both the training of the network and the recognition of the network. In the SOTPAR, the activity diffusion moves through the lattice of an SOM structure and is modeled after the reaction-diffusion equation [Mur89] '\ =f(m i (x,t),m i (x,t)) + D f^ a ax where m\ can be considered the activity of PE i, f(*) can be considered the current match, and the second derivative is the diffusion of activity over space and time. If the system is "excitable media" (multi-stable dynamical system), then the diffusion of activity can create traveling pulses or wavefronts in the system. When the activity diffusion spreads to neighboring PEs, the thresholds of these neighboring PEs are lowered, creating a PAGE 66 59 situation where the neighboring PEs are more likely to fire next. I define enhancement as the amount by which a PE's threshold is lowered. In the SOTPAR model, the local enhancement acts like a traveling wave. This significantly reduces computation of diffusion equations and provides a mechanism where temporally ordered inputs will trigger spatially ordered outputs. This is the key aspect of this network architecture. The traveling wave decays over time because of competition for limited resources with other traveling waves. It can only remain strong if spatially neighboring PEs are triggered from temporally ordered inputs, in which case the traveling waves are reinforced. In a simple one dimensional case. Figure 3-1 shows the enhancement for a sequence of spatially ordered winners (winners in order were PE1, 0.5 _ 0.4 if Â°' 3 II Â°2 0.1 , Â— I I I PE|V^PE2\^PE3\^PE4VÂ»/PF.5\V PE6 (a) 5 04 Til Â°3 11 0.2 M 0.1 ^mn^m (pEiy^PE2y^rE3yypE4yypE5y-/pE6) (b) Figure 3-1 : Temporal activity in the SOTPAR network, a) activity created by temporally ordered input; b) activity created by unordered input PAGE 67 60 PE2, PE3, PE4) and for a sequence of random winners (winners in order were PE4, PE2, PE1, PE5), which would be the case if the input was noise or unknown. In the ordered case, the enhancement will lower the threshold for PE 5 dramatically more than the other PEs making PE 5 likely to win the next competition. In the unordered case, the enhancement becomes weak and affects all PEs roughly evenly. The second temporal functionality added to the SOM is the decay of output activation over time. This is also biologically realistic [Cha93], When a PE fires or becomes active, it maintains an exponentially decaying portion of its activity after it fires. Because the PE gradually decays, the wavefront it creates is more spread out over time, rather than a simple traveling impulse. This spreading creates a more robust architecture that can gracefully handle both timewarping and missing or noisy data. The decay of the activity also creates another biological possibility for explaining the movement of the enhancement throughout the network. If we define a neighborhood around a neuron as one where it has strong excitatory connections with its neighbors, then the decay of activity from a neuron which fired in the past will help to fire (or lower the threshold of) its neighboring PEs. Algorithm description To simplify the description of the algorithm, I will use 1 D maps and let the activity propagate in only one direction, since the diffusion of the activity is severely restricted in the one-dimensional case. Thus, the output space can be considered a set of PEs connected by a string where the information is passed between PEs along this string. The activity/enhancement moves in the direction of increasing PE number and decays at PAGE 68 61 each step. An implementation of the activity diffusion in one string is shown in Figure 3-2 and includes the activity decay at each PE and the activity movement through the net in the left-to-right direction. The factors u and (1-u) are used to normalize the total activity in the network. 1-u 1-u 1-u > -D^-m-^' Enhancement Activity Activity Activity Figure 3-2: Model for activity diffusion in one string of the SOTPAR This activity diffusion mechanism serves to store the temporal information in the network. During training, the PEs will be spatially ordered to sequentially follow any temporal sequences presented. At each iteration, the activity of the network is determined by calculating the distance between the input and the weights of each PE and allowing for membrane potential decay: act(t,x) = act(t \,x) * (1 u) + dist(inp(t), w, ) * (Â«) where act(t.x) represents the activity at PE x at time (, and dist(inp(t),Wx) represents the distance between the input at time t and the weights of PE x. Typically the activity is thresholded and enhanced before being propagated. For example act' = max(acf-.5,0) * 2 next, the winning PE is selected by winner = arg max(Â«c/ + p * enhancement) PAGE 69 62 where the enhancement is the activity being propagated from the left. The parameter p is the spatio-temporal parameter that determines the amount that a temporal wavefront can lower the threshold for PE firing. By increasing p you can lower the threshold of neighboring PEs to the point where the next winner is almost guaranteed to be a neighbor of the current winner and forces the input patterns to be sequential in the output map. It is interesting to note that as p->0, the system operates like a standard SOM and when p->co the system operates like an avalanche network. [Gro82] Once the winner is selected, it is trained along with its neighbors in a Hebbian manner with normalization as follows: w x = w x + r| * neigh(x)* (inp(t) -w x ) where the neighborhood function, neighfx), defines the closeness to the winner (typically a Gaussian function), and the learning rate is defined by r\. In our current implementation, the spatio-temporal parameter, the learning rate, and neighborhood size are all annealed for better convergence. Representation of memory The activity diffusion in this network creates a unique spatio-temporal memory that stores and distributes the temporal information in the network itself. Most short-term memory structures can be described by convolving the input sequence with a kernel that describes the structure of the memory. This kernel is typically one-dimensional and describes the temporal features of the memory, i.e. the depth of the memory. The SOTPAR's memory is implemented in its "enhancement" which moves through time and PAGE 70 63 space. Thus, the SOTPAR memory kernel is spatio-temporal and must be described in at least two dimensions. There are two slightly different ways to implement the temporal enhancement in a ID SOTPAR. The difference lies in the decaying exponential portion. In method number 1, only the activity at each node is decayed. The contributions from the wavefronts do not contribute to the time dependent behavior of each node. The equation for this system is: E(n,t) = E{n-\,t-l)*n + A(n,t) A(n,t) = A(n,l -l)*(l-u)+ /Â«(Â«,/) where E(n,t) is the enhancement at node n at time t, A(n,t) is the activity at node n at time I, and ln(n,t) is the matching results between the input and the weights of node n and time (. Expanding these equations gives the following results: E(n,t)= E(n-\,t-\)*\x + A{n,t-\)*(\-)x) + In(n,t) = (E(n-2,t-2)*ii + A(n-\,t-2J)*ii + A(n,t-2)*(\i i) 2 + In(n,t-\)*(\-n) + Wn,t) = X2>(Â«-U-4-t)u j (1-u)' * = T = This equation shows how the results from the matching activity (which is called "input", for lack of a better word) contribute to the enhancement. The traveling waves create two decaying exponentials, one which moves through space (u k ), and one which moves through time ( (l-fi)'). The past history of the node is added to the enhancement via the recursive self-loop in (1 -u.). The wavefront motion is added to the enhancement via the diagonal movement through the left-to-right channel scaled by u.. The farther the PAGE 71 64 node is off the diagonal and the farther back in time, the less influence it has on the enhancement. The SOTPAR enhancement equation is similar to the gamma memory impulse response for tap n: By doing a variable substitution, the x can be replaced with t-n in the SOTPAR equation making the two equations even more similar. The SOTPAR enhancement, however, is not an impulse response equation. The SOTPAR allows input at each element of the memory structure, unlike the gamma memory which is a generalized tapped delay line, thus the input at different times and spatial locations is required to describe the enhancement (i.e. an impulse response does not represent the desired information). In summary, the SOTPAR enhancement is a spatially distributed gamma memory with inputs at each tap. The second method for implementing the enhancement is to allow the enhancement to also pass through the self-feedback at each node. This will allow an input to add to the enhancement multiple times by following different paths in the network. For example, In(n-l,t-2) can reach E2(n,t) either by looping first at node n-1 and then moving to position (n,t) or by first moving to position (n,(-l) and then looping at node n until (n,t). The equation for the enhancement in this case is: PAGE 72 65 E2{n,t) = E2{n-\,t-\)*\x + E2{n,t-\)*(\-\i) + In(n,t) = (Â£2(n-2,f2)* u + Â£2(n-l,f -2)*(l-u) + /Â«(Â«U-l))*u + (Â£2(n-l,(-2)> + Â£2(Â«,/-2)*(l-n)+/n(Â«-l,(-l))*(l-n) + /Â«(n,0 = nMÂ»-i,'-i-*'(l-n)'(T + l) This method of enhancement increases the contribution of the off-diagonal elements via the term (r+1) and allows more flexibility in non-sequential node firings. The two enhancement techniques can be shown for two values of u. in Figure 3-3 and Figure 3-4. Both figures show Enhancement method 1, Enhancement method 2, and the difference between Enhancement method 2 and method 1 , which shows the increased influence of the off-diagonal elements. These two figures also illustrate the effect of u on the enhancement. With u. = 1 , the time decay at each node is disconnected and the enhancement moves only from node to node. With u = 0, the spatial movement of the enhancement is disconnected and only Enhancement 1 Enhancement 2 Enh 2 Enh 1 Node Time Time Time Figure 3-3 Enhancement in the network with u. = 0.5 PAGE 73 66 node decay contributes to the enhancement. Lower values of u. create a broader enhancement while higher values of u create narrower enhancement waves where almost all of the activity moves from one node to the next (down the diagonal of time and space). This can be seen in the figures as a much sharper contribution to the enhancement along the diagonal as p moves from .5 to .75. Enhancement 1 Enhancement 2 Enh 2 Enh 1 Node z Time Node ' * Time Figure 3-4 Enhancement in the network with p = 0.75 Node Time Another possible approach is to decouple the two exponentials p. and (1-u). This would require external normalization to keep the enhancement from growing without bound, but will provide more flexibility. A simple illustrative example A simple, descriptive test case involves an input that is composed of twodimensional vectors randomly distributed between and 1 . Embedded in the input are 20 'L' shaped sequences located in the upper right hand corner of the input space (from [0.5,1. 0]-Â»[0.5,0.5]-Â» [1.0,0.5]). Uniform noise between -0.05 and 0.05 was added to the PAGE 74 67 target sequences. When a standard 1 D SOM maps this input space, it maps the PEs without regard to temporal order, it simply needs to cover the 2D input space with its ID structure. To show how this happens, we plot an 'X' at the position in the input space represented by the weights of each PE (remember, the weights of each PE are the center point of the Voronoi region that contains the inputs that trigger that PE). Since the neighborhood relationship between PEs is important, we connect neighboring PEs with a line. In a ID SOM, the result is a "string" of PEs, and this string of PEs is stretched and manipulated by the training algorithm so that the entire input space is mapped with the minimum distortion error while maintaining the neighborhood relationships (e.g. the string cannot be broken). The orientation of the output is not important, as long as it covers the input with minimal residual energy. A typical example is shown on the left side of Figure 3-5. Note the slightly higher density of the input in the 'L' shaped region. When the SOTPAR temporal activity is added to the SOM, the mapping has the additional constraint that temporal neighbors (sequential winners) should fire 1 Kohonen mapping without Temporal Enhancement 1 D Kohonen mappng with Temporal Enhancement Figure 3-5 One-dimensional mapping of a two-dimensional input space, both with and without spatio-temporal coupling PAGE 75 DO sequentially. Thus, the string should not only cover the input space, but also follow prevalent temporal patterns found in the input. This is shown on the right side of Figure 3-5. Notice in the figure that sequential nodes have aligned themselves to cover the L shaped temporal patterns found in the input. Although not the main goal in creating the Spatio-Temporal SOM, recall is possible after the first few samples of the sequence have been input to the network. The rest of the pattern can be determined by following the sequence of nodes in the SOM, although the length of the sequence is not readily determined by the map. With a single string, the network can be trained to represent a single pattern or multiple patterns. Multiple patterns, however, require the string to be long. A long string may be difficult to train properly since it must weave its way through the input space, moving from the end of one pattern to the beginning of the next. Additional flexibility can be added by breaking up the large string into several smaller strings. Multiple strings can be considered a 2D array of output nodes with a ID neighborhood function. This allows the network to either follow multiple trajectories or long complicated trajectories in a simplified manner. Figure 3-6 shows an example of the storage of two temporal patterns with two strings. The left plot shows the input space that consists of two-dimensional input vectors. Two 8-point temporal patterns (diagonal lines: bottom-left to top-right and bottom-right to top-left) are intermixed with random noise in the input. The diagonal lines are drawn in for clarity. Between each pattern, there is random noise. This problem can be thought of as a motion detection problem across a visual topographic map. A number of strings could be trained to detect motion in a variety of directions and PAGE 76 69 orientations. On the left side of Figure 3-6, the trained strings are shown as sequences of 8 PEs represented as 'X's (the 'O' PE denotes the beginning of the string), with neighboring PEs connected by lines. As one can see from this figure, the memory structure was able to extract the predominant temporal features of the input data. The right side of Figure 3-6 shows a graphical representation of the sequence of winning PEs after training. The horizontal axis is time, and the vertical axis is the number of the winning PE. The input signal is labeled along the top of the plot. This plot clearly shows that the patterns elicit the network to respond with sequential PE firings (smooth diagonal lines), whereas the random noise between patterns causes random output firings. Notice also that the temporal information is crucially important in the training of the memory, especially at the center of the figure where the next point could be in one of two possible directions. This ambiguity is responsible for the misalignment of the PEs near the center of the input space. Input Space and Output Mapping | Seq1 | noise | Seq2 | noise | Seq1 | Figure 3-6 The Storage of Two Temporal Patterns in a Memory Network PAGE 77 70 Figure 3-7 shows an example of how the network gracefully handles time warping. In this example, the input was as in the previous example except that the target sequences were warped to length 6, 8, and 10. The network mapped two 6 PE strings to the diagonal targets as shown in the left side of the figure. The right side of the figure shows the winning nodes with the three different size sequences the first two are length 6, the second two are length 8, and the last one is length 10. The strings stretch to cover the entire pattern and certain PEs fire more than once for a longer sequence, thus extending the time that can be covered by the string. In general, if the network is trained with time-warped data, it will tend to represent the target trajectories with the minimum number of nodes (shortest pattern). The network will still respond to longer patterns by having certain nodes win multiple times. Input Space and Output Mapping |S1|n |S2|n |S1 | n |S2 | n | S1 | I 6 a. 4. f 2 I 1 1 J 20 40 60 Time 0.5 1 X1 Figure 3-7 Time Warping: Diagonal Targets Covered by Smaller Sequences The left and middle plots of Figure 3-8 show the traveling activity over time and space for the above example. The left side shows the activity for string 1 and the center shows the activity for string 2. These two plots clearly show how the traveling activity PAGE 78 71 builds up and reinforces the sequential firing of the output PEs (i.e. when a target sequence is presented, the activity builds up and moves along the string). The right-side of Figure 3-8 shows the maximum traveling activity for string 1 (solid) and string 2 (dashed). In a simple system, this plot shows how a simple threshold on the traveling activity could be used to detect the target sequence. M axicmirn Enhance 20 40 60 1 20 40 eo Figure 3-8: Plots of ehancement over time for string 1 and string 2 and also the maximum enhancement over time for both strings SOTPAR summary The SOTPAR methodology creates an array of PEs that self-organizes in spacetime with the help of temporal information. The system is trained in an unsupervised manner and self-organizes so that sequences seen during training are mapped into unique spatial sequential firings of the PEs at the output. The output space is similar to a topographic map except that it maps both the temporal and spatial information. The network embeds the temporal and input data into one output space with both temporal and spatial locality. Instead of the standard time-to-space mapping produced by most short-term memories, the SOTPAR produces a time-to-"time and space" mapping. The representation is distributed throughout the self-organizing network and is stored not only PAGE 79 72 in the activations of the PEs but also in the connectivity and weights of the PEs. It is a radical departure from typical neural network architectures with memory, but is actually more biologically plausible. The SOTPAR is a unique combination of short-term and long-term memory. It contains short-term memory because the activations of the network can be used to represent a general input sequence. The interesting part of the SOTPAR, however, is that it contains attributes of a long-term memory. It stores commonly found input patterns into the network weights and produces enhanced responses to these temporal inputs. The known sequences produce an ordered response in a specific area of the output space. This is a discriminant mapping because only known sequences produce an ordered response. The sequential firing facilitates the recognition of temporal patterns by subsequent processing layers. It can also gracefully handle time warping. Temporal Activity Diffusion In the Neural Gas Algorithm (SOTPAR2) The SOTPAR2 network was developed to overcome a few difficulties with the original SOTPAR network. The main difficulty with the SOTPAR is the SOM map that it is built upon. The SOM's neighborhood lattice structure restricts both the movement of a trajectory through the output space of the network (e.g. the distance between successive inputs) and also limits the number of neighbors for each PE. For these reasons the neural gas algorithm is used as the basis for the SOTPAR2 architecture. The neural gas algorithm is similar to the SOM algorithm without the imposition of a predefined neighborhood structure on the output PEs. The neural gas PEs are trained with a soft-max rule, but the soft-max is applied based on the ranking of the distance to PAGE 80 73 the reference vectors, not on the distance to the winning PE in the lattice. Since the Neural gas algorithm has no predefined structure, each PE acts relatively independently. This is how it derived its name, each PE is like a molecule of gas that all spread to evenly cover the desired space. Since it has no predefined structure for the activity diffusion to move through, it allows the flexibility to create a diffusion structure that can be trained to best fit the input data. The SOTPAR2 diffuses activity through a secondary connection matrix that is trained with temporal Hebbian learning. This flexible structure decouples much of the spatial component from the temporal component in the network. In the SOTPAR, two neighboring nodes in time also needed to be relatively close in space in order for the system to train properly (since time and space were coupled). This is no longer a restriction in the SOTPAR2. This is still a space-time mapping, but now the coupling between space and time is directly controllable. The most interesting concept that falls out of this structure is the ability for the network to focus on temporal correlations. Temporal correlation can be thought of as the simple concept of anticipation. The human brain uses information from the past to enhance the recognition of "expected" patterns. For instance, during a conversation a speaker uses the context from the past to determine what they expect to hear in the future. This methodology can greatly improve the recognition of noisy input signals such as slurred or mispronounced speech. SOTPAR2 algorithm details Based on previous experience (training), the SOTPAR2 algorithm uses temporal information to lower the threshold of PEs that are likely to fire next. The standard neural PAGE 81 74 gas network is appended with a connection matrix that is trained using temporal Hebbian learning. These secondary weights are similar to transition probabilities in Hidden Markov Models (HMM) and are the pathways used to diffuse the temporal information. As in the SOTPAR, the temporal activity diffusion is used to alter the selection of the winning PE and affects both the training and the operation of the network. The SOTPAR2 algorithm works as follows: First, you calculate the distance (dj) from the input to all the PEs. The temporal activity in the network is similar to the SOTPAR diffusive wavefronts except that the wavefronts are scaled by the connection strengths between PEs. Thus, the temporal activity diffuses through the space defined by the connection matrix as follows: j;[(i/(*.*)+o-iÂ»)Â«.('))pu] a, (( + 1) = aa, (/) + Â— Â— max(/>) where a ; (t) is the activity at PE i at time t, a is a decay constant less than 1, p ti is the connection strength from PE i to PEy, d is the vector of distances from the input to each PE, p. is the parameter which smoothes the activity giving more or less importance to the past activity in the network, and max(p) normalizes the connection strengths. The function f(d,k) determines how the current match (distances) of the network contributes to the activity. At the present time, my implementation implements the case where f(d,k) is simply a 5 function (and the summation is removed) such that only the activity from the past winner is propagated. This is similar to the Markov model where all temporal information is stored in the state itself. Unlike the Markov model, however, the previous winners affect the output activity of the current winner. Therefore, a previous winner that PAGE 82 75 has followed a "known" path through the network will have higher activity and thus will have more influence on the next selection. In the general case for the activity equation the temporal activity at each PE is affected by contributions from all other PEs. In this case the function f(d,k) is typically an enhanced/sharpened version of the output and the summation is over all PEs. This allows all the activity in the network to influence the current selection. It makes the network more robust since the wavefronts will continue to propagate (but will decay rapidly) even if the selected winner temporarily transitions to an unlikely path. The next step of the SOTPAR2 algorithm is to modify the output (competition selection criteria) of each PE by the temporal activity in the network via the following equation: out, = d, p\x, where p is the spatio-temporal parameter that determines how much the temporal information affects the selection of the winner. This parameter should be set based upon the expected magnitude of the noise present in the system. For example, if the data is normalized [0,1], then a setting of p = 0.1 allows the network to select a new winner that is at most a distance of 0.1 farther away than the PE closest to the spatial input. To adjust the weights, we use the standard neural gas algorithm that is simply competitive learning with a neighborhood function based on an ordering of the temporally modified distance to the input. Aw, = rjh^k, (out))(in w, ) PAGE 83 76 where r| is the learning rate (step size), hx(*) is an exponential neighborhood with the parameter X defining the width of the exponential, kj(out) is the ranking of PE; based on its modified distance from the input. The connection strengths are trained using temporal Hebbian learning with normalization. Temporal Hebbian learning is Hebbian learning applied over time, such that PEs that fire sequentially enhance their connection strength. The rationalization for this rule is that PEs will remain active for a period of time after they fire, thus both current and previous winners will be active at the same time. In the current implementation, the connection strengths are updated similar to the conscience algorithm for competitive learning: A Par e min(o.Â«(/-l)).argmin(o,Â«(()) = " The strength of the connection between the last winner and the present winner is increased by a small constant b and all connections are decreased by a fraction that maintains constant energy across the set of connections. Another possibility for normalization would be to normalize all connections leaving each PE. This method gives poorer performance if a PE is shared between two portions of a trajectory since the connection strength would have to be shared between the two outbound PEs. It does, however, give an interpretation of the connection strengths as probabilities and points out the similarity between the SOTPAR2 and the HMM. The parameters n and X are annealed exponentially as in the neural gas algorithm, while p takes the form of an offset sine wave. This allows the initial phases of learning to PAGE 84 77 proceed without interference so that the PEs start out with an even distribution across the input space. Then the temporal enhancement reaches a peak and slowly declines for fine tuning at the end of learning. Operation of the SOTPAR2 network I will use an artificial example to illustrate the features of the SOTPAR2. The input for this example is 15 pairs of noisy 8-point diagonal lines from (0,0) -> (1,1) and from (1,0) -^ (0,1). The diagonal lines have uniform noise (Â±0.15 in both dimensions) added to each point (notice the distance between each noise-free point of the diagonal lines is only 0.14). There is uniform noise [0,1] interspersed between the diagonal lines (16 points between each line such that there is twice as much noise as signal). The network extracts the temporal information from the diagonal lines without supervision, segmentation, or labeling. A 30-PE network was trained with and without temporal enhancement (200 iterations through the data set) and the resulting PE locations are shown in Figure 3-9 and Figure 3-10 with the diagonal lines superimposed on the figures. Notice that the temporal enhancement during training has slightly modified the positions of the PEs. The network trained with temporal enhancement has its PEs placed more consistently near the centers of the points along the diagonal lines (in particular, look at the line segment in the lower-right). The temporal training provides a portion of the improvement made by the SOTPAR2 algorithm, but the static comparison of the network is not dramatically different. PAGE 85 7S Training WITH Temporal Enhancement 1 0.8 0.6 0.4 0.2 0.5 1 Figure 3-9: Reference vector locations after training with enhancement 17, 9 7 ^24 15 1 23 5 sri 28 4 6 /18 30 22 / <4 19 ?sf 26 10 8 26Training WITHOUT Temporal Enhancement 170.5 1 Figure 3-10: Reference vector locations after training without enhancement During operation, the trained weights and information from the past create temporal wavefronts in the network that allow plasticity during recognition. This temporal activity is mixed with the standard spatial activity (distance from input to the weights) via p, the spatio-temporal parameter. Two identical inputs may fire different PEs depending on the temporal past of the signal. Figure 3-1 1 shows the Voronoi diagrams PAGE 86 79 for the SOTPAR2 network with two different temporal histories. Voronoi diagrams graphically describe the region in the input space that fires each PE. In these particular diagrams, the number in each Voronoi region represents the PE number for that particular region and is located at the center of the static Voronoi region. Remember that the center is the same as the weights of the PE. These diagrams show the regions of the input space that will fire each PE in the network. The left side of Figure 3-11 shows the Voronoi diagram during a presentation of random noise to the network. Since this input pattern was unlikely to be seen in the training input, temporal wavefronts were not created and the Voronoi diagram is very similar to the static Voronoi diagram. The right side of Figure 3-1 1 shows the Voronoi diagram during the presentation of the bottom-left to topright diagonal line. The temporal wavefront grew to an amplitude of 0.5 by the time PE 18 fired. Also, from the training of the network, the connection strength between PE 18 and PE 27 was large compared to the other PEs. Thus, the temporal wavefront flowed preferentially to PE 27 enhancing its possibilities of winning the next competition. Voronoi diagram win previous Winers: 23, 18, 10, 29 Beta=0 2 diagram win previous winners: 20. 26, 14, IB Be1a=0.2 02 04 06 0,8 1 0.2 04 Figure 3-11: Voronoi diagrams without and with enhancement PAGE 87 80 Notice how large region 27 is in right side of Figure 3-1 lsince it is the next expected winner. This plasticity seems similar to the way humans recognize temporal patterns (e.g. speech). Notice that the network uses temporal information and its previous training to "anticipate" the next input. The anticipated result is much more likely to be detected since the network is expecting to see it. It is important to point out how the static and dynamic conditions are dramatically different. In the dynamic SOTPAR2 the centroids (reference vectors) are not as important the temporal information changes the entire characteristics of vector quantization creating data dependent Voronoi regions. An animation can demonstrate the operation of the SOTPAR2 Voronoi regions much better than static figures. Next I created a new set of 14 noisy diagonal lines to be run through the network as a test set. Each noisy line was passed through both a standard neural gas vector quantization network and a SOTPAR2 VQ network. The results will be analyzed using the 5 lh point in the bottom-left to top-right diagonal line. Figure 3-12 shows the locations Non-Enhanced Voronoi Diagram 1 9 y r v 7 17 / 3 Jr-Â» 08 f 15 ^f^23 16 K 06 5 \ >jr 28 04 Â« T 6 lr 12 18 Y JT\ 30 22 \ 14 2 V^T 0.2 Â™\ X 10 r" 8 ViT J 29 0.2 0.4 0.6 0.8 1 Figure 3-12: Vornoi diagram without enhancement. VQ outputs were [1 2, 1 2, 1 6, 1 6,25,25,25,25,27,27,27,27,27,27] PAGE 88 81 of this point in each of the 14 noisy diagonal lines along with the neural gas Voronoi diagram. Notice that the static vector quantization cannot consistently quantize this 5' h point to the same Voronoi region. In fact, this point falls into four different regions. The SOTPAR2 network, however, was able to quantize every one of the 5 th points into the same region. Figure 3-13 shows why. Figure 3-13 shows a typical Voronoi diagram for the trained SOTPAR2 network after the input of the first four points of a single noisy diagonal line. The location of the 5 th point in each of the 14 noisy diagonal lines was again plotted. Notice that now all 14 points fall into the correct Voronoi region. Remember that each particular input sequence will create a different Voronoi diagram, but Figure 3-13 illustrates the mechanism for the SOTPAR2's improved vector quantization. The temporal plasticity has increased the size of the anticipated next region and reduced the variability of the SOTPAR2 vector quantization. Voronoi diagram with node 18 as the previous winner 1 08 21 j . y x * ^ 13 X V . tj^15 5 \ ^r 4 f 6 J J r \ 7 I \ X v. Jp-^_ 17 0.6 0.4 r 28 30 0.2 22 1 14 '/ 2 J*~19 11 26 20 1 10 / r \ f 8 Jf 29 0.2 0.4 0.6 0.8 1 Figure 3-13: Voronoi diagram and VQ with enhancement. VQ outputs were [27,27,27,27,27,27,27,27,27,27,27,27,27,27] Next I ran the new noisy diagonal lines through the network and histogrammed the VQ outputs for each point of the two lines. Figure 3-14 shows the results with the PAGE 89 82 point number along the horizontal axis and the node number along the vertical axis. The number of firings for each node is indicated by the shading white is high and gray is low. The left-to-right diagonal line is shown in the first 8 points of the horizontal axis and the right-to-left diagonal line is shown as the second 8 points of the horizontal axis. Notice how much cleaner the temporal enhanced VQ output is than the standard neural gasVQ. Winning Nodes WITH enhancement Winning Nodes WITHOUT enhancement Figure 3-14: Histograms of the number of firings for each PE (bright more) for the networks with and without enhancement Figure 3-15 shows a specific example of the VQ output of the two networks and illustrates how the SOTPAR2 uses temporal information to remove noise from the input. The input is a noisy diagonal line from bottom-right to top-left (solid line). The SOTPAR2 output is the short dashed line, and the static VQ output is the long dashed line. Notice how much closer the temporal VQ output is to the diagonal than the noisy input or the output of the static VQ. PAGE 90 S3 solid=input, dashed=NGas, dotted=Ou Method 1 ;\ V \ X 0.8 \ "" \ > 0.6 ^ IX^^ 0.4 0.2 X^~\ v\\ v 0.2 0.4 0.6 0.8 Figure 3-15: The SOTPAR2 VQ (dotted) is closer to the noise free signal than the original (solid) or the neural gas VQ (dashed) SOTPAR2 summary The SOTPAR2 algorithm uses temporal plasticity induced by the diffusion of activity through time and space. The SOTPAR2 algorithm is a temporal version of the neural gas algorithm that uses activity diffusion to couple space and time into a single set of dynamics that can help disambiguate the static spatial information with temporal information. This creates time-varying Voronoi diagrams based on the past of the input signal. This dynamic vector quantization helps reduce the variability inherent in the input by anticipating (based on training) future inputs. Temporal Self-Organization for Training Supervised Networks This section shows how the concepts of temporally trained clustering can help speed up the training of supervised neural networks. In particular, we have applied it to recurrent neural network training. Recurrent neural networks are more powerful than feedforward neural networks, but their training is very difficult and time-consuming. PAGE 91 84 Supervised neural networks are typically trained with gradient descent learning, which provides a more mathematically sound foundation than in the unsupervised networks. This allows for a goal-driven approach with mathematical derivations of the concepts. The goal of this architecture is to temporally organize the training of a recurrent neural network. A mathematical analysis will derive a principle very similar to that used in the neural gas network, that temporal correlation can be used to train PEs to form temporal neighborhoods. Using Temporal Neighborhoods in RTRL In the past, static neural networks and feedforward networks with memory (TDNN, etc.) have been the workhorses of the neural network world. Recently recurrent neural networks have been getting more attention, especially when applied to dynamical modeling and system identification and control. The main difficulty in training recurrent neural networks is that the gradient is a function of time. The gradient at the current time depends not only on the current input, output, and desired signal, but also on all the values in the past. As discussed in Chapter 2, there are two fundamental methods of computing the gradient for a recurrent neural network. First, the gradients can be computed in the backward direction similar to the static backpropagation techniques from feedforward networks. This is called backpropagation through time (BPTT) [Rum86]. The main shortcoming of this technique is that it is non-causal. The second fundamental method of computing the gradient for a recurrent neural network computes the gradients in the forward direction. This method, called RTRL [Wil89], computes the partial of each node PAGE 92 85 with respect to each weight at every iteration. The method is completely on-line and simple to implement. The main difficulty with the RTRL method is its computational complexity. If n is the number of PEs in a fully recurrent network, then the computation of the gradients of each PE with respect to each weight is O(n^). This algorithm can only be used for small networks. Many methods have been proposed to increase the speed of RTRL. Zipser's approach [Zip89][Zip90] will be used here because it lends itself to our techniques. Zipser approached the problem of reducing the complexity of the RTRL algorithm by simply leaving out elements of the sensitivity matrix based upon a subgrouping of the PEs. The PEs are grouped arbitrarily and sensitivities between groups are ignored. If the size of the subgroups remains constant, then this reduces the complexity of the RTRL algorithm to O(n^). This is a tremendous improvement, however, the method lacks some of the power of the full RTRL algorithm. For example, it will sometimes require more PEs than the standard RTRL algorithm to converge. Our methodology extends Zipser's technique by allowing the subgroups to change dynamically during learning. The dynamic subgroups are created by using an unsupervised temporal clustering very similar to that used in the SOTPAR and SOTPAR2. A derivation of a first-order approximation to the full sensitivity matrix shows that temporal correlation (temporal Hebbian learning) can be used to determine which nodes should be in each group. This method has the same computational complexity as Zipser's, but trains better and more consistently. PAGE 93 86 Review of RTRL and Zipser's Technique The computational complexity of the RTRL algorithm is dominated by the need to update a large array of sensitivities at each step of the algorithm. For a network with n nodes and m weights, the sensitivity matrix has O(nm) elements, each requiring O(n) computations per element, giving 0(n 2 m) calculations per step. For a fully recurrent network, this dominates the computational complexity and requires 0(tr) computations per step. The algorithm works quite well on small networks, but the n^ factor becomes overwhelming as the number of nodes increases. The value of n in the O(rfim) equation is the number of recurrently connected units. Zipser's algorithm reduces this value by creating subgroups of nodes where sensitivity information is only passed between nodes of the same subgroup. All connections still exist in the forward dynamical system, the subgroups only affect the training of the network. Connections between subnets are treated as inputs. If g is the number of subgroups, then the speed-up of the sensitivity calculations is approximately g2. For instance, dividing a network into two subsets (g=2) gives a 4-fold speed-up in computing the sensitivities. If the size of the subnets remains constant as the size of the network is increased (increasing the number of subnets), then the complexity of the RTRL algorithm is reduced from O(n^m) to O(m). The performance gains are substantial, but the question is whether the algorithm can train networks as well as the full RTRL. One might think that the subgrouping will limit the capabilities of the network to share nodes. This is not the case, however, since the activations of the network are unchanged Â— it is still fully recurrent except in the PAGE 94 87 training methodology. Even though the error propagation is limited to the subnets, all units have access to the activities of all other units, just not all of their sensitivities. Zipser's empirical tests indicate that they can solve many of the same problems, but for certain applications, networks trained with subgrouped RTRL require more PEs than when they are trained with full RTRL. In my experience, the subgrouping algorithm typically also requires more training epochs to reach the same MSE. One caveat of the subgrouped RTRL training is that each subnet must have at least one unit for which a target exists since gradient information is not exchanged between groups. The problem can be solved by wrapping a feedforward network around the recursive network creating a feedforward MLP with a fully recursive hidden layer. This is often termed a recurrent multilayer perceptron (RMLP) and is shown in Figure 3-16. The feedforward network is simply one additional layer that distributes the gradient between the groups. With simple extensions to the algorithms, multiple fully-recurrent layers can be added to the network. Fully Connected Hidden Layer Output Figure 3-16: Diagram of a fully recurrent multi-layer perceptron (RMLP) PAGE 95 88 Dynamic Suberouping with n The goal of my method is to create local neighborhoods (subgroups) in the RTRL algorithm where the majority of the gradient information required for each node is confined to its local neighborhood. This requires organizing the recurrent PEs such that those that have strong temporal dependencies are neighbors. This technique replaces the static, preallocated grouping of Zipser's technique with a dynamic method of determining the best set of neighbors for each PE. This dynamic grouping provides faster and more robust training than Zipser's technique while maintaining its O(n^) performance. First, the RTRL equations must be modified slightly to better suit the RMLP architecture described above. The time indices are now defined such that the input vector contains the external inputs from this time period plus the values of the PE outputs from the previous time period (i.e. the feedback). W,. (Â« + 1) = [*(Â« + !), XÂ»)] <(n + l) = cp'(v,(n + l)) Â£wy4(n)+VNi(Â«+l) Next, we must determine the criteria we will use to group the PEs in the network. If we assume that each PE is responsible for updating the weights of the arcs which terminate at that PE (i.e. the incoming connections), then the PEs that have the highest sensitivities relative to those connections should be in the same neighborhood. For example, PEj is responsible for updating all weights wvg where B is the set of recurrent PEs. If we define PAGE 96 89 then the value of Zifc provides a measure of how much PE k affects the weights of PEj. Thus the neighbors to PEj should be the ones which have the highest Zjfc. The "dynamic subgrouping with 71" (DS-7t) methodology implements the RTRL algorithm with the subgroups chosen using the Z measure defined above. It should be noted that since it requires the computation of the complete ji matrix, this algorithm is no more efficient than the full RTRL algorithm. It will, however, address how the neighborhood technique with "optimal" switching will perform compared to the full RTRL and Zipser's technique. The test case for the DS-jt algorithm is a function approximation problem where we are trying to map a frequency doubler. The input to the network is a sinusoid with a 1 6-point period and the desired signal is a sinusoid with an 8-point period. This is a nonlinear function since linear functions cannot "create" frequencies. The DS-7t network ^ 0.4 rtrl Â— our methoc zipser \ V V \ \ 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Epochs (16 samples per epoch) Figure 3-17: Average learning curve for the three algorithms using the frequency doubling problem PAGE 97 90 has 6 fully recurrent hidden layer PEs and one linear output node. Both Zipser's method and the DS-7t method use two groups of three PEs. Each of the three algorithms was trained using the same five sets of random initial weights and the results were averaged to obtain the learning curves. Figure 3-17 shows the average learning curve for each algorithm. Notice that the full RTRL and the DS-7t method performed nearly identically. In fact, in a few cases, DS-7t actually trained in fewer epochs. The third set of initial weights led all three algorithms to a deep local minimum. The first 100 epochs mainly depict the learning curve from the other four initial conditions (notice that the DS-7t method and RTRL are nearly identical here), whereas the last 400 iterations are dominated by the learning curve for the initial weights with the deep local minimum. Zipser's method performed worse on all 5 sets of initial conditions and couldn't solve the problem at all (even with more training) for the 3 rd set. In every application I have tested, the DS-n algorithm trains the networks in almost the same number of epochs as the full RTRL algorithm and performs significantly better than Zipser's subgrouping technique. The problem, however, is that the full Jl matrix is required to compute the neighborhoods. Since the computation of the It matrix is the computationally expensive part of the task, we have not gained anything here. This methodology, however, proves that the technique is feasible and that the all the gradient information is not necessary to train the networks. Also, when a neighborhood changes in the DS-71 algorithm the gradient information from the ex-neighbor is discarded and new gradient information from the new neighbor starts building up. The technique using the full ji matrix shows that this resetting and restarting of gradient information between PAGE 98 91 nodes does not affect the performance of the algorithm. The DS-7t algorithm will be used as an "ideal grouping" methodology since it uses all of the information of the sensitivities to determine the groupings. Estimating the Z matrix We now need an estimate of Z that will allow us to efficiently compute the temporal neighborhoods. The logical choice for an estimate of Z is to use the first-order estimate of the 71 matrix to compute Z. We start by writing out the equation for Z and simplifying: z,*=X<(")=Z U'(v,(")) X^X/Xh-d+V^C") 2, =
Â„70-2) + 5, t ,W,,(Â«-l) and a8 i W,,(Â«-l) + ^,<(Â»-2) Zj. = ,v'(y,(.n 1)) iefl Now we separate the equation into its first order parts (the direct contributions from the input vector when i=k) and the rest. Z lk = = '(v ; (n))Wj,(p'(vÂ»(n-l)) This is a very easy and computationally efficient method for estimating the Z matrix. It is conceptually appealing as well. You can see that this equation is a time correlation *(n-l)) The local implementation assumes that each PE stores the weights for its incoming nodes locally and thus has access to all of its inputs and its output. This methodology was tested and produced identical results to the DS-FOE method. Illustrative Example The performance of the network trained with dynamic subgrouping and the firstorder estimate of the 71 matrix (DS-FOE) will be compared against the full RTRL algorithm and Zipser's method. The example data set will again be the frequency doubling problem. Remember that for this problem, the DS-ti algorithm achieves nearly I 0.4 1 rtrl FOE ofZ zipser \ ~^v 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Epochs (16 samples per epoch) Figure 3-19: Average learning curve for the RTRL, DS-FOE. and Zipser's algorithms using the frequency doubling problem |