Citation
Temporal self-organization for neural networks

Material Information

Title:
Temporal self-organization for neural networks
Creator:
Euliano, Neil R., 1964-
Publication Date:
Language:
English
Physical Description:
vii,177 leaves : ill. ; 29 cm.

Subjects

Subjects / Keywords:
Landmarks ( jstor )
Learning ( jstor )
Maps ( jstor )
Memory ( jstor )
Neural networks ( jstor )
Neurons ( jstor )
Phonemes ( jstor )
Signals ( jstor )
Temporal data ( jstor )
Trajectories ( jstor )
Dissertations, Academic -- Electrical and Computer Engineering -- UF ( lcsh )
Electrical and Computer Engineering thesis, Ph.D ( lcsh )
Genre:
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1998.
Bibliography:
Includes bibliographical references (leaves 169-176).
Additional Physical Form:
Also available online.
General Note:
Typescript.
General Note:
Vita.
Statement of Responsibility:
by Neil R. Euliano, II.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright Neil R. Euliano. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Resource Identifier:
40074572 ( OCLC )
029539973 ( ALEPH )

Downloads

This item has the following downloads:


Full Text










TEMPORAL SELF-ORGANIZATION FOR NEURAL NETWORKS













By

NEIL R. EULIANO II













A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1998














ACKNOWLEDGMENTS

It is only appropriate that I first acknowledge the guidance and help of my advisor and friend Dr. Jose Principe. Without his support this work would never have been completed. I would also like to thank the members of my committee for their efforts and time spent on my behalf, as well as the members of the Computational NeuroEngineering Laboratory (CNEL).

I must also acknowledge my wife Tammy who was incredibly patient and never wavered in her support of this endeavor. I would also like to thank my children Erin and Matthew who are just too fun to ignore. Although they extended the amount of time required to graduate, I would not trade the time I spent with them for anything in the world. Lastly I should thank my family and friends for not treating me like a dead-beat Ph.D. student.




















11














TABLE OF CONTENTS
page


ACKNOWLEDGMENTS .................................... ................. ii

ABSTRACT................................................................ vi

CHAPTERS

1 INTRODUCTION AND PROBLEM DESCRIPTION........................................

Temporal Processing..............................................................2
Static Supervised and Unsupervised Learning...............................................
Adding Memory to Neural Networks ........................................................... 5
Short-Term Memory Structures..............................................6
Recurrent Networks ........................... ...........................9
Training Dynamic Neural Networks...............................................10
Summary of Problems with Standard ANN Architectures .......................................12
The A pproach................................ ........................... 14

2 LITERATURE REVIEW .......................... ......................16

Biological Research ..................................................................16
Neurons and Learning ................................................17
Hippocampus .................................................................20
Diffusion Equations (Re-Di Equations)..................... .........21
Biological Representations of Time................................ ....................25
Biological Models for Temporal Processing ...................... ...28
Static Neural Network Learning .............................................................29
Unsupervised Learning ........................ ....................................30
Kohonen SOMs..........................................................32
Neural Gas ........................... .................................36
Supervised Training ..............................................................37
Second Order Methods ..................... .... .....................39
Temporal Neural Networks......................................................40
Temporal Unsupervised Learning........................... ....................40
Temporal Supervised Neural Networks.... ......................................................44
Architectural approaches............................................................................44
Algorithmic approaches ........................ ......................................46


111








Second order methods.................................. ................49
Sequence Recognition.........................................................50
Comparison of Hidden Markov Models with ANNs ...................................................51

3 TEMPORAL SELF-ORGANIZATION ............................... ........................54

Introduction and M otivation ........................ .....................................54
The M odel...................................... ...................................................55
Temporal Self-Organization in Unsupervised Networks............................. ....57
Temporal Activity Diffusion Through a SOM (SOTPAR) ................................57
Algorithm description .......................................60
Representation of memory ....................... ....... .........62
A simple illustrative example ......................... ........66
SOTPAR summary ...........................................................71
Temporal Activity Diffusion In the Neural Gas Algorithm (SOTPAR2)..............72
SOTPAR2 - algorithm details.................................73
Operation of the SOTPAR2 network........................ ........................... 77
SOTPAR2 summary ........................................83
Temporal Self-Organization for Training Supervised Networks.............................83
Using Temporal Neighborhoods in RTRL ..................... ......84
Review of RTRL and Zipser's Technique ...........................86
Dynamic Subgrouping with t.................................... .....................88
Estimating the Z matrix ....................... .....................................91
Illustrative Example ....................................... ....................94
Grouping Dynamics ........................ ...................................95
Second Order Methods ...........................................................96
Summary of the Dynamic Subgrouping Algorithm............................ .....97

4 APPLICATIONS AND RESULTS ............................................................99

SO T PA R .......................................... ......................................................................99
Landmark Discrimination and Recognition for Robotics............................. 100
SOTPAR solution ........................ .....................................102
Real data collected from the robot ........................................ 113
Sum m ary ................................................................... 117
Self-Organization of Phoneme Sequences .............................. ....................... 118
Sum m ary ........................ ......................................... ..... ................ 128
SO T PA R 2 ........................................... ........ ........ .....................................129
SOTPAR2 Vector Quantization of Speech Data ..............................................129
Time Series Prediction....................................................138
Results ............... ................................. 140
Summary of chaotic prediction ..................................... 147
Dynamic Subgrouping of RTRL in Recurrent Neural Networks ..............................148
System Identification ....................................... .......................... ............. 148
Comparison of the Number of Neighbors.............................152


iv









Modeling a Set of Nonlinear Passage Dynamics ................................................. 155
Summary of Dynamic Subgrouping ........................... ........................ 160

5 CONCLUSIONS AND FUTURE RESEARCH POTENTIAL .............................162

Conclusions.............................................................162
Future Directions ...........................................................167

R EFER EN C E S ........................................ .................................................................. 169

BIOGRAPHICAL SKETCH ................... ......... .....................177















Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

TEMPORAL SELF-ORGANIZATION FOR NEURAL NETWORKS By

Neil R. Euliano II

August, 1998


Chairman: Dr. Jose C. Principe
Major Department: Electrical and Computer Engineering

The field of artificial neural networks (ANNs) has reached a point where they are now being used in everyday products. ANNs, however, have been largely unsuccessful at processing signals that evolve over time. Temporal patterns have traditionally provided the most challenging problems for scientists and engineers and include language skills, vision skills, locomotion skills, process control, time series prediction, and many others.

The fundamental concept presented in this dissertation is the formation of

temporally organized neighborhoods in ANNs. This temporal self-organization enables the networks to process temporal patterns in a more organized and efficient manner. The concept is biologically inspired and uses activity diffusion to organize the processing elements of the network in an unsupervised manner.






vi








The self-organization in space and time created by my methodology has been applied to three distinct ANN architectures. The new network architectures created by adding the temporal organization are easy to implement and contain properties that are unique in the neural network field. A self-organizing map (SOM) network obtains a unique combination of long-term and short-term memory and becomes organized such that temporal patterns in the input fire sequentially ordered output PEs. These features are utilized in two different applications, a robotic landmark recognition problem and a temporally ordered vector quantization of phonemes in spoken words.

When applied to the neural gas algorithm, the resulting network becomes a dynamic vector quantization network. The network anticipates the future inputs and adjusts the size of the Voronoi regions dynamically. It was used to vector quantize speech data for a digit recognition problem and to predict a chaotic signal. Lastly, the temporal organization was applied to the training of fully recurrent neural networks. It reduces the computational complexity of the training algorithm from O(N4) operations to only O(N2) operations and maintains nearly all of the power of the RTRL algorithm. This training method was tested on two inverse modeling tasks and provided a dramatic improvement in training times over the RTRL algorithm.














vii














CHAPTER 1
INTRODUCTION AND PROBLEM DESCRIPTION

This dissertation focuses on neural network architectures and training methods for processing signals that evolve over time. The fundamental concept underlying the techniques described herein involves the formation of neighborhoods where temporally correlated processing elements (PEs) are clustered together. We have applied this concept to three different neural network architectures and found that it improves the performance of each one - either by increasing the functionality of the neural network or by improving its training.

This chapter contains a description of the problem as well as background

information that will help describe the shortcomings of the present methods. Chapter 2 presents a review of the relevant literature necessary to understand the material in the context of the current state of the art. Chapter 3 contains the theoretical description of the techniques and networks proposed by this work, including a few simple examples to elucidate the fundamental concepts. In Chapter 4, six more extensive and practical problems are solved using the temporal neighborhood concepts. These examples include speech recognition, chaotic prediction, system identification and control, and robotics. Chapter 5 concludes the dissertation with a summary of the work and possible future research directions.





2


Temporal Processing


Most scientific problems can be grouped into two domains, static and dynamic problems. Static problems consist of information that is independent of time. For instance, in static image recognition, the image does not change over time. On the other hand, time is fundamental to the dynamic problem. The output of a dynamical system, for example, depends not only on the present input but also on the current state of the system, which encapsulates the past of the input. Temporal processing is the analysis, modeling, prediction, and/or classification of systems that vary with time. Patterns that evolve over time have traditionally provided the most challenging problems for scientists and engineers. Language skills (speech recognition, speech synthesis, sound identification, etc.), vision skills (motion detection, target tracking, object recognition, etc.), locomotion skills (synchronized movement, robotics, mapping, navigation, etc.), process control (both human and mechanical), time series prediction, and many other applications all require temporal pattern processing. In fact, the ability to properly recognize or generate temporal patterns is fundamental to human intelligence.

Traditional analysis models (ARMA models, etc.) are well known but are usually linear and require significant expertise on the subject and a strict correspondence between the studied process and the constructed model. Artificial Neural Networks (ANNs) offer robust, model-free methods without requiring as much application specific expertise. Secondly, neural nets are adaptive (similar to ARMA models). This is a natural way to compensate for the drift of measuring devices and slow parameter changes inherent in





3


real systems. Thirdly, neural nets are naturally parallel systems that offer more speed in computation and fault tolerance than traditional computing models. [Kan94]

Most of the major neural network success, however, has been mainly in the realm of static, instantaneous mappings (for example, static image recognition or pattern matching). Conventional neural net architectures and algorithms are not well suited for patterns that vary over time. Typically, in static pattern recognition a collection of features - visual, semantic, or otherwise - is presented and the network must categorize the input feature pattern into one or more classes. In such tasks, the network is presented with all relevant information simultaneously. In contrast, temporal pattern recognition involves processing patterns that evolve over time. The appropriate response at a particular point in time depends not only on the current input, but also potentially on an unspecified number of previous inputs.

Static ANNs have been modified in various ways to process time-varying

patterns, typically by adding short-term memory to the static pattern classification ability of the various architectures. The short-term memory holds onto some of the past events so that the static ANN can then classify or predict the temporal pattern. As I will explain in the next few sections, however, these hybrid structures (memory added to static architectures) have not been widely successful in the various temporal processing areas.

Static Supervised and Unsupervised Learning


The purpose of neural processing is to capture the information from an external signal in the neural network structure. This is a form of organization. It can be accomplished in an unsupervised manner using only the input, or in a supervised manner





4


guided by an extra input called the desired signal. Unsupervised training can only extract information from the input signal whereas supervised training can learn mappings between the input signal and the desired signal. They differ in the methods, but at the core they share the same function, learning a representation of the external world.

The most common supervised network is the multilayer perceptron (MLP) which uses the error back-propagation [Rum86] learning algorithm. The MLP is characterized by layers (input, hidden, and output) of processing elements (PEs) that have a smooth non-linearity at their output. The nonlinear output of the MLP PEs is what differentiates the MLP from a typical adaptive filter. It provides the capability to map problems that are not linearly separable. In fact, it has been proven that an MLP with one hidden layer can uniformly approximate any continuous function with support in a unit hypercube [Cyb89]. Like in adaptive signal processing using the LMS algorithm, the backpropagation algorithm applies a correction Awji(n) to the synaptic weight wji(n) that is proportional to the gradient of the error a8(n) / aw,(n). The chain rule is used to recursively calculate the error for each layer of the network.

Unsupervised networks are typically based on or derived from Hebbian learning. Hebbian learning is a biologically inspired learning rule that finds the correlations present in the input data. Because unsupervised networks can extract information only from the input, they are typically used for data analysis and preprocessing. They cannot reliably be used directly for classification since a labeling of the inputs is required for classification. Both supervised and unsupervised learning will be described in detail in Chapter 2.





5


Adding Memory to Neural Networks


How do you use a static neural network architecture to process temporal patterns? The answer is to simply add memory. Without an appropriate memory to store information from the past, a neural network is limited to static pattern recognition or function approximation. The key questions that need to be answered while creating temporal neural networks are what type of memory do you use and how is the memory integrated into the training algorithm.

Memory in neural networks can be classified into two categories: short-term

memory and long-term memory. Short-term memory typically involves a representation of the temporal data, usually by creating multiple copies of the input data at various time delays (e.g. tapped delay line). Long-term memory, on the other hand, is the storage of information from the past into the structure of the network. For example, over time, the training of the network captures information about the input signal and this information can be considered long-term memory. Another example of long-term memory is the storage of patterns in an associative memory. Long-term memory corresponds more closely with the traditional biological concepts of memory. The main difference between the two is that the short-term memory is used for signal representation while the longterm memory is a trained memory that typically cannot represent unknown patterns. Another way to differentiate the two is that short-term memory is usually described by activations of nodes or taps (dynamical information), and long-term memory is stored in the weights of the network (statistical information).





6


Most of the work in temporal ANN research has focused on the application of short-term memories since they provide a mechanism to represent a temporal pattern in a static manner. For instance, a tapped delay line converts a temporal signal into a static pattern (the present input and the N past inputs) which can then be processed by a standard static ANN. Most short-term memory techniques fall into two categories. The first is to explicitly add memory structures and the second is to use recurrent loops in the network to save information. Long-term memory (the weights) has largely been ignored by the ANN research community for the storage of temporal patterns, but I will use it to store temporal correlations in the structure of the network. Short-Term Memory Structures

The simplest form of memory is a buffer containing the N most recent inputs.

This is often called a tapped delay line or a delay space embedding and forms the basis of traditional statistical autoregressive (AR) models, as well as dynamical system state space manipulations. This is a very popular model and has been used in many applications. The time-delay neural network (TDNN) [Wai90] uses a tapped delay line to convert the temporal pattern into a spatial pattern allowing the architecture to be trained using only standard back-propagation methods. The TDNN, however, has several drawbacks. First, the length of the delay line must be chosen a priori, we cannot work with arbitrary length sequences. In addition, the TDNN requires that the data is properly registered in time with the clock controlling the shift register. It imposes a rigid limit on the duration of patterns and suggests that all input vectors be the same length. Most importantly, two patterns which are very similar temporally (e.g. shifted one step in time) will be very





7


different spatially, which is the metric used by ANNs. For example, [1 0 0], [0 1 0], [0 0 1] are temporally shifted but are spatially on the corners of a unit cube.

Using decay traces or exponential kernels to sample the history of the input helps alleviate some of the problems with the TDNN. A common methodology to describe the various memory architectures is to represent the short-term memory as a convolution of the input sequence with a kernel function, kl: .j,(t) = E k,(t - t)x(r), where x(t) is the input. Tank and Hopfield [Tan87] proposed a set of Gaussian kernels that are distributed over time with varying means and widths to sample the time history. The gamma model [DeV91] is an example of an exponential trace memory that uses the set of gamma kernels. The exponential trace memory has a more smooth representation of the past of the input since it decays exponentially. It gives more strength to the more recent inputs. The gamma memory also has a tunable parameter that trades off depth for resolution when the system requires information from farther in the past. Depth roughly refers to how far back into the past the memory stores information and resolution refers to the degree to which information concerning the individual elements of the input sequence are preserved. The exponential trace memories can be computed incrementally and easily, thus greatly increasing its usability. Viewing memory in this way, as a kernel function passed over the input, one can see that almost any kernel function will result in a distinct form of memory.

The main problem with all of these memory architectures, however, is that they are all "prewired" one-dimensional cascades of delay elements. TDNNs are also known to train very slowly.





8


Theoretically, memory added to a system can be thought of as creating an embedding of the dynamics into a space larger than the original input space. An embedding of a dynamical system is based on the similarity between delays and derivatives (the first order approximation to a derivative is the difference between the signal and the delayed signal). The delayed values of a single variable can be used to represent the dynamics of a multi-dimensional system. Conceptually this can be rationalized as combining the first-order differential equations for the system (state space description) into a single high-order differential equation for one variable and then using the delay technique to approximate the derivatives of this equation - giving a new representation of the system states. This mathematical construct is effective but not necessarily efficient. For example, a dynamical system requires a minimum of 2D+1 taps to preserve the dynamics of a D dimensional system [Tak81]. If the dimension of the system is unknown, as is often the case, a large embedding is usually used. The embedding also does not efficiently encode the input ordering. It does a time-to-space mapping that treats the temporal information the same as a spatial input, allowing for all permutations of the order of inputs without regard to the limitations imposed by the dynamics of the system. The gamma memory and other convolution memory kernels warp or rotate the embedding space to more accurately (or efficiently) represent the system dynamics. A proper use of the embedding methodology requires a significant amount of work to determine a number of parameters, including the number of taps, the time between taps, the time between vectors, and the number of data samples. This is rarely done.





9


Recurrent Networks

The MLP and TDNN are both feedforward networks where the data flow in the network moves strictly forward. No feedback is used. The feedback in recurrent networks can also create memory. The important distinction between the two types of memory is that memory created with feedback can be adapted and trained on-line, creating a flexible and adjustable memory mechanism. Feeding back outputs between different layers can lead to a generalization of storing not only the input but the "state" of the network (i.e. a processed version of the input) [Elm90] [Moz94]. In theory, the recurrent architecture is sufficiently powerful to handle arbitrarily complex temporal problems. The focused memory architectures such as the TDNN can also [San97], but may require a very large number of taps and weights.

In practice, however, recurrent networks are much more difficult to train than the static networks. The recurrency adds tremendous power to the network (any memory architecture can be created with a recurrent neural network). This power, however, creates very complicated error surfaces. In recurrent networks, the states of the PEs in the network affect both the output and gradients. Therefore calculating the gradients and updating the weights of a recurrent network is a much more difficult and time consuming process.

Because of these difficulties, the mainstream engineering community has largely ignored recurrent networks. Recently, however, the recurrent networks are being used more and more as engineers reach the limits of the capabilities of TDNNs and other





10


simpler architectures. Recurrent networks are hot topics in the fields of dynamic modeling and control.

Training Dynamic Neural Networks


Recurrent networks, either fully recurrent or partially recurrent (e.g. the gamma network), cannot directly use static backpropagation methods since the time history of the network and its inputs are critical to the outputs produced by it. Static backpropagation computes only the gradients based upon the current inputs and outputs. To train a dynamical system, the past information is at least as important as the present and thus a temporal backpropagation technique must be used. Recurrent backpropagation (fixedpoint learning) can be used to train a general recurrent network to move to stationary states. Its assumption of constant inputs and an approach to an attractor, however, precludes the recurrent back-propagation algorithm from real-time temporal processing.

The TDNN can use static backpropagation because its memory is fixed and is at beginning of the network. The tapped delay line can be thought of as a temporal preprocessor converting dynamic patterns to static patterns, thus the network is trained in a completely static manner. Most other temporal networks, however, are trained using one of two first-order temporal methods: back-propagation through time (BPTT) [Rum86] or real-time recurrent learning (RTRL) [Wi189]. Both of these methods are gradient descent methods. The RTRL method brings the activations and their derivatives forward in time until the desired signal is available, and the BPTT method propagates back the errors from the desired signal to the beginning of the pattern. RTRL recurrently calculates the gradients of each PE with respect to every weight. This process allows on-





11


line updates (updates every sample), but calculating all the gradients is a time consuming process. In fact, if there are N fully recurrent PEs in a network, the RTRL algorithm requires O(N') operations per sample. The BPTT method requires fewer computations, but is non-causal. Thus it cannot be directly implemented in an on-line fashion.

Both methods suffer from the following problems:

* The computation of the gradient must occur over time. But the nonlinearity in each

layer (actually it is the derivative of the nonlinearity required for the gradients)

attenuates these gradients. Thus, if information is required from more than a few

samples in the past, these training methods may have a difficult time maintaining and

using this information. As the errors are propagated, the gradients get small and the

impact of a connection weight - even if appropriate-will be masked by other

weights if their values are inappropriate. This is true for large feedforward nets as well, but the feedback nature of the recurrent network in time makes this a much

bigger problem in recurrent networks.

* The desired signal must be defined over time. For example, how do you define a

target signal when trying to detect a sequence? If the target is high throughout the

pattern, the network will recognize partial sequences. But if the target is high only at the end, the network may be punished for partially recognizing a major portion of the

sequence.

* Temporal backpropagation is inherently slow both computationally and in terms of

the number of training samples required to find an adequate solution.





12


Recently, second order gradient methods like the recurrent least squares (RLS) and the extended Kalman filter have been used in order to reduce the number of training samples required for a good solution. These methods use second order gradient information to determine more accurate data on the shape of the performance surface at the current operating point. This allows for much faster convergence but requires more computations per sample. These second order gradient methods still need to compute the dynamic gradient information and thus suffer from the same problems listed above.

Summary of Problems with Standard ANN Architectures


In summary, the standard ANN architectures when applied to temporal processing suffer from problems with supervision and problems with short-term memory. The problems that can be attributed to supervised training include:

* The problem of assigning credit or blame to actions when the overall success or

failure of the system results from a series of actions and cannot be judged

instantaneously (i.e. how do you design a target signal?).

* Back-propagation training can be very slow, often requiring thousands of training

epochs. This problem is derived from many sources. The backpropagation algorithm

must either take small steps in the gradient descent or use more computationally

intensive error calculations (higher order derivatives). Since all nodes in a network

are typically learning independently, several problems may occur. First all the hidden

nodes may move together to try to solve the largest source of error, instead of

dividing up the problem and each solving a different portion. Second, once the nodes

have divided the problem, each tries to solve their portions independently. The





13


movement of each node through the error surface affects all the other nodes, creating

a moving target for each node. Thus instead of a direct movement of the nodes to

useful roles, we see a "complex dance" among all units [Fah91].

* Recurrent back-propagation trains even slower for many reasons. First, the training

methods require more computations than the static backpropagation. Second, the error

gradients tend to vanish exponentially as they are propagated through time. Thirdly,

the recurrent networks tend to have a much more complicated error performance

surface with many local minima, making the gradient search very difficult.

* Supervised techniques require presegmented and prelabeled training data. This often

must be done by hand and is quite time consuming. The rule of thumb for ANN

training is 10 training exemplars for each adjustable weight. Thus for large networks,

finding enough training data is a difficult task. If there is an insufficient amount of

training data, the network will tend to memorize the data rather than draw reasonable

generalizations about the data.

Problems related to short-term memory structures include the following:

* The common short-term memory techniques (tap delay lines, etc.) use a time-to-space

mapping to represent the past of the signal. By converting time into just another

spatial dimension, the unique features of the temporal information are lost (e.g.

continuity, limitations based on the dynamics of the system, etc.). The short-term

memory is a representation of the data, not a memory structure.

" The typical short-term memory structure is a rigid architecture that must be pre-wired.





14


Short-term memory structures typically add many weights to the input (or interior)

layer (e.g. A TDNN with N taps will create N times more weights in the first layer),

which exacerbates the problems with the performance surface and the amount of

training data. The resulting networks tend to have so many degrees of freedom that

they do not generalize well (i.e. memorization due to insufficient training exemplars).

The Approach


It is a Herculean challenge to attempt to solve all of the above problems. This

work focuses on a method of self-organizing PEs in a network architecture based on their temporal correlations. This concept is biologically inspired and has been applied to three different types of neural networks. By creating temporal neighborhoods of PEs in the architecture, we have increased the performance of the networks - either through increased functionality and power or through better training methods.

When this technique is applied to a self-organizing feature map (SOFM or SOM), the temporal neighborhoods create traveling waves of activity which diffuse through the PEs. The resulting architecture has a spatio-temporal memory that is selective and recognizes temporal patterns similar to those it has been trained with. The typical ANN memory simply embeds the data for further processing by the ANN, without any mechanism for recognition. This architecture, however, is similar to biological memories in that it responds preferentially to known temporal patterns - this is unique in the neural network literature.

When the temporal neighborhood approach is applied to the neural gas algorithm, the network becomes a temporal vector quantizer that again responds preferentially to





15


known temporal patterns. The temporal vector quantizer uses the past of the signal to anticipate the next input by expanding the Voronoi region associated with the expected next input. This allows the network to remove noise in the signal and generate better vector quantization based upon the temporal training and recent past of the signal. This anticipation is similar to how the human brain deals with noise in its environment - it uses the past to predict the future and correlates what it is sensing with this prediction. This is part of the reason humans can understand speech in very noisy environments, and also why two people can hear completely different things from the same set of sounds.

When we apply the technique to the training of recurrent neural networks, the new training technique reduces the computational complexity of the RTRL algorithm from O(N4) to O(N2). This dramatic improvement comes with only a slight increase in the number of iterations of training data required. The overall speed-up taking into account both the decreased computational complexity and increased number of training samples is still dramatically better. In fact, the O(N4) property of the RTRL algorithm makes it virtually unusable for sizeable networks.

In general, the self-organizing nature of the temporal neighborhoods helps

alleviate many of the problems with the supervised techniques. Additionally, the novel spatio-temporal memory architectures provide a unique methodology for solving the problems with short-term memory.















CHAPTER 2
LITERATURE REVIEW

This chapter presents background information and a literature review of topics that either influenced this work, relate to this work, or will be compared and contrasted with this work. The chapter begins with a presentation of current research on biological neural networks and methods of temporal processing. This section is important because it motivated my work. I do not, however, claim that my work is biologically feasible or occurs in nature. Next, this chapter contains a description of the state of temporal neural network research. Since most of the work in temporal neural networks takes the form of extensions to static neural networks, an overview of static neural network learning is also presented. The contrast between the biological and artificial neural networks and the way they process time is striking. Static artificial neural networks are very similar to the static characteristics of real neurons, but temporal neural networks share little in common with their biological counterparts.

Biological Research


This section contains a description of biological neurons and their temporal

characteristics, as well as other biological mechanisms that may help in processing time based signals. Recently, there has been extensive research into the temporal characteristics of the brain as well as in learning dynamics. This research has not yet



16





17


been integrated into the artificial neural network community, but holds promise for creating powerful, temporal ANNs. This information provides a motivation for the main principal of this work - that the creation of temporally organized neighborhoods in a neural network improves the performance of the network for temporal processing. The concept of diffusing temporal information through the network is one of the fundamental concepts used to rationalize the formation of these neighborhoods. Neurons and Learning

Fundamentally, the artificial neural network is modeled after a collection of

neurons in the brain. Each neuron is composed of three basic components: the cell body, the dendrites and the axon. [Fre92] The dendrites are a widely branching set of filaments that collect information from other neurons. The axon is a long transmission medium that contains fewer branches and transmits the output of the neuron to other neurons. Synapses are the junctions between axons and dendrites. The dendrites collect incoming pulses from other synapses, convert them to currents and sum them all at the initial segment of the axon. This summation works across both dendritic space (summation over all the dendrites) and across time. Each synaptic membrane acts as a leaky integrator with an associated time constant. The critical function of the axon is to transmit the timevarying amplitude of current summed by the dendrites to distant targets without attenuation. [Fre92] If the neuron reaches a certain threshold, itfires or depolarizes, which means that it produces an energy spike on its axon. The firing contains a refractory period such that a constantly active neuron will produce an impulse train on its axon. How biological neural networks are trained is not well known, but most of what is known





18


about the training is based on the Hebbian learning concept (which will be discussed later). The Hebbian learning law strengthens synapses (allowing more responsiveness from the post-synaptic neuron) when the two neurons fire at the same time. If there is a consistent correlation between the firing of two neurons, then the pre-synaptic neuron must be at least partially responsible for the firing of the post-synaptic neuron.

A static artificial neural network is modeled loosely on an interconnected cluster of neurons. Each neuron is modeled by a processing element (PE) and a set of connections between processing elements. Typically, a processing element simply sums the inputs, nonlinearly warps the output, and then passes this output to its downstream connections. Training is implemented in either an unsupervised manner, usually using a form of Hebbian learning, or in a supervised manner, which has no biological parallel. Notice that none of the temporal characteristics of a neuron are used in static neural networks or their temporal extensions.

Recently, there has been significant work on a more complete modeling of individual neurons and their temporal characteristics. Christodoulou and others [Chr95a][Chri93] have modeled the biological neuron including the random spiking nature, excitatory/inhibitory synapses, the transmission delay down the axon, and especially the membrane time constant. The membrane time constant is the main temporal property modeled today. Most modeling approaches use simplifications of the Hodgkin-Huxley equations that result in a leaky integrator model of the neuron membrane potential. This is an important feature of biological neurons, since the past history of the signal remains active on neurons for a short period and can influence the result of future inputs.





19


Additionally, the gas nitric oxide (NO) has been found to be involved in many

processes in the central nervous system. One such process is the modification of synaptic strength thought to be the mechanism for learning (and most commonly used in ANNs). Neurons produce NO post-synaptically after depolarization. The NO diffuses rapidly (3.3 x 10-5 cm2/s) and has a long half-life (-4-6 seconds), creating an effective range of at least 150 pm. Large quantities of NO at an active synapse strengthen the synapse (called Long Term Potentiation, or LTP). If the NO level is low, the synaptic strength is decreased (Long Term Depression or LTD) even if the site is strongly depolarized. NO is thus commonly called a diffusing messenger as it has the ability to carry information through diffusion, without any direct electrical contact (synapses) over much larger distances than normally considered (non-local). The NO diffusion and non-linear synaptic change mechanism has been shown to be capable of supporting the development of topographical maps without the need for a Mexican Hat lateral interaction (described later). This seems to be a more biologically plausible explanation of the short range excitation and long range inhibition than the preprogrammed weights of synaptic connections which are typically assumed to implement the same effect [Kre96a][Kre96b].

In addition to the possibility of lateral diffusive messenger effects, the long life of NO can produce interesting temporal effects. Krekelberg has shown that NO can act as a memory trace in the brain that can allow the temporal correlations in the input to be converted into spatial connection strengths. [Kre96b] This mechanism for capturing the temporal correlations of the input using an NO diffusion process is similar to the method I will present in more detail in Chapter 3.





20


Hippocampus

The hippocampus is the primary region in the mammalian brain for the study of memory and learning because: [Bur95]

* hippocampal damage causes memory loss,

* the hippocampus is the simplest form of cortex

* long-term potentiation (LTP) has been found in the hippocampus (synaptic

plasticity)

* cell firing in the hippocampus is spatially coded (place cells).

* all sensory inputs converge on the hippocampus and the output from the

hippocampus is extensively divergent with projections onto most of the

cortical areas.

Figure 2-1 shows the major subfields and their projections of the hippocampus. The hippocampus is formed from sheets of cells, with most of the interconnections contained in these sheets (minimal connections between sheets). Most projections have large divergence and convergence, except the dentate gyrus to CA3 projection which has






CA1 Subsculum


Dentate
CA3

Hippocompus
Figure 2-1: The major subfield of the hippocampus





21


mossy fiber projections from each granule cell, making very large synapses onto only 14 or so pyramidal cells. Hebbian LTP has been observed in much of the hippocampus. A variety of intemeurons provide feed-forward and feed-back inhibition.

One of the most interesting (and for this work, most relevant) aspects of the Hippocampus is that it contains "place cells" and other functional clusters of neurons. Place cells are small patches of neurons that selectively fire only when the animal is in a specific location of its environment. These are groups of thousands of neurons that fire together and are linked to other place cells. As the subject moves through a familiar set of locations, the patches fire sequentially and the linking of these patches allows for predictive navigation. They have been found in fields CA3 and CAl of the rat hippocampus. [Bur93] These place cells are temporally and spatially organized neurons that are correlated in their reaction to temporally occuring events. Diffusion Equations (Re-Di Equations)

The diffusion equation (or the reaction-diffusion equation if the medium is active) can be used to explain certain characteristics of a neuron and neuronal clusters. In its generic form, however, it is used in many other fields.

Objects such as cells, bacteria, chemicals and animals often have the property that each individual moves about in a random manner (e.g. brownian motion). When a concentration of these objects occurs, this random motion causes the objects to spread out into lower concentration areas of the environment. When this microscopic movement of the group results in macroscopic motion, we call it a diffusion process. If we assume a one-dimensional motion and a random walk process, we can derive the diffusion equation





22


from a probabilistic treatment of the process. By finding the probability p(m,n) that a particle reaches a point m steps away at n time steps in the future, we find the distribution of particles at time n. Using the random walk assumption and allowing n to be large, it can be shown that the resulting distribution is the Gaussian or normal probability distribution:

L-J e [ m>>I, nl p(m,n)- exp - m>>1,n>> Next, we determine the probability of finding a particle in an area between (x-Ax, x+Ax) at time t by rewriting the equation for p(m,n) as the sum of the probability of moving right from x-Ax at time t-At or moving left from x+zx at time t-At. If we take the partial of p with respect to t and allow Ax 'O and At -0 we obtain the diffusion equation: ap 8'P
at a



0.8
0.6.
0.4
02

0 2 4 6 8

Figure 2-2 - Diffusion process


where D is the diffusion coefficient which defines how fast the particles spread. A typical diffusing process creates a spreading of a concentration into ever shallower and shallower Gaussians as shown in Figure 2-2.





23


The reaction-diffusion equations were originally proposed by Turing in 1952 and are typically used to explain natural pattern formation [Tur52]. They have been used to model insect populations, the formation of zebra stripes, crystal formation, galaxy formation and many other naturally occurring patterns and self-organizing systems. Turing's proposal modeled patterns found in nature by an interaction of chemicals called "morphogens". The different morphogens react with each other AND diffuse throughout the substance via the equation:

8m,(x,t) 82m,(x,t)
Ot =f(m(xt),mi(x,t))+ D,, x2

where mi(x,t) is the concentration of morphogen i at time t, Dm is the diffusion coefficient, andf(mi, m) is a function (typically nonlinear) that represents the interaction between morphogens. By varying the interaction between chemicals and the speed of diffusion, complicated spatial patterns of chemicals are created.

The reaction-diffusion equations have also been used to explain traveling waves such as the traveling impulse down the axon of a neuron. If the reaction portion of the Re-Di equations represents the kinetics of the system and these kinetics are nonlinear, then the system can create a traveling wave. One requirement for a traveling wave is that the kinetics of the system are excitable, where excitable implies two stable states where a small excursion away from one state may drive it to the next state. Another requirement is that after excitation, the system must relax back to the original state. An example of such a system is the Fitzhugh-Nagumo equations (FHN) that are a simplified version of the Hodgkin-Huxley model that describes the transmission of energy down the axon of a






24


neuron. The FHN equations can be described by the following system of 3 equations [Mur89]:

au 82u
- f(u) - v + D at
Ov
= bu - yv
at
f(u) = u(a - u)(u - 1) where u is roughly equivalent to the membrane potential, v lumps the effects of most of the ionic membrane currents, and a, b, and y are constants. The null clines of the kinetics in the (u,v) phase plane are shown in Figure 2-3.




0.2




-0.2 /
0 0.5 1 1.5 Figure 2-3: Null clines for dynamics of FHN equations



The general concept is that when one element fires, its activity is diffused to its neighbors and pushes them just far enough from their stable state to move them to the "excited" state. Next, these newly excited elements excite their neighbors, etc. The elements which were excited originally then begin to relax, creating a traveling wave of activity. The traveling wave from the FHN equations is shown in Figure 2-4. In this case, not only does the system relax, it also has a refractory phase which inhibits future excitation for a period of time. [Tys88]





25


15



05


-0 .
5 10 15 20 25 30

Figure 2-4: Traveling waves caused by the FHN equations



Diffusion and other biologically plausible local communication techniques have increasingly been used in neural networks. For example, the Kohonen algorithm can be implemented in analog hardware with an active medium using diffusion [Ruw93]. Diffusion has also been used frequently in visual imaging systems [Cun94]. Sherstinsky and Picard have proposed a cellular neural network based on Re-Di equations that can solve optimization problems [She94]. On key aspect of this work is that diffusion in the PE space of a neural network allows temporal information to be transmitted and stored using only local communication. This is similar to the diffusion of NO in the brain which is thought to affect the plasticity of synapses in areas where many neurons are firing at once. Without direct connectivity between two PEs, communication and temporal memory can be implemented using the local storage and transmission of a diffusing object (in our case, diffusing activity).

Biological Representations of Time

Another example of neurobiological research that has not been used in ANNs is the concept of rhythm. Recently, there has been some interesting research on oscillators, central pattern generators, rhythm and their effect on human pattern recognition. Rhythm





26


has been studied in biology and found that rhythmic signals from insects can be entrained or phase-locked to an external rythmic pattern - without high-level processing (the patterns are faster than the minimum response latency) [McA94]. There is evidence that the dynamics of many biological systems have natural rhythms that share the same frequency. Communication and locomotion, for instance, are highly dependent on rhythm and pacing. It has also been suggested that EEG rhythms play an important role in learning and temporal recognition. For instance, neurons are thought to modify their synaptic strengths only when the 0 rhythm is in the correct phase. The 0 rhythm is a sinusoidal component of the EEG that ranges from 7-12 Hz. The 0 rhythm has been linked with displacement movements (e.g. walking) and many other repetitive actions. Since the 0 rhythm must propagate through the neural tissue, this also could play the role of a moving wavefront that controls learning.

Rhythm can be thought of in two ways: either as an external pacemaker that

synchronizes the network in some fashion, or as the output of a collection of neurons that are working in unison. For the first case, there is little if any research on the effects of an external pacemaker on temporal ANNs. The pacemaker would create a time-varying network where the output of the network is dependent on the time or phase of the pacemaker. The pacemaker could also act as a sampling signal. For instance, learning may only occur at a specific phase of the 0 rhythm. In the second case, the rhythm could be the result of synchronized processing. For instance, waves of activity in the brain could be caused by the processing of the spatio-temporal patterns constantly input to the network by the continuous motions of the eyes and other sensory muscles.





27


Stanley and Kilmer [Sta75] have proposed a "wave mode" of memory that can learn sequences. It is based on the anatomy of the dentate gyrus (in the mammalian hippocampus) and can be summarized as follows:

* The hippocampus is organized into transverse slices called lamellae

* The majority of connections in the hippocampus do not leave a lamella (small

longitudinal spread)

* Sensory inputs arrive via the perforant path to excite cells directly

* A small number of mossy fibers connect cells longitudinally (across lamellae) * Cells excited by an input spread excitation to its neighbors, causing a wave of

activity to travel down the cell's lamella

The wave formation is based on the pyramid and granule cells receiving excitatory influences from the hippocampal input pathways that in turn excite intemeurons whose axons inhibit the pyramid and granule cells. This excitation and inhibition create the waves of activity in the lamella.

The memory is created by the association of the various waves in different

lamellae via the mossy fibers that interconnect the lamellae. Each wave is created by a sensory input that triggers a cell in a lamella and can move a short distance before dying. Randomly distributed mossy fibers interconnect the lamella. The connection weights are strengthened in a Hebbian manner - when two waves from different lamella are coincident with a connecting mossy fiber, this connection is strengthened. Thus, the next time the first input wave passes the same position, it can automatically trigger the second wave even without the corresponding input. This is shown in Figure 2-5. For longer





28


temporal relationships, one wave will trigger a second wave in another lamella via prestrengthened longitudinal connections which will continue after the first wave has died.


Input I at Time TO Wave at TO+01


Strenoghtned
Mossy Fiber
Input 2 at Time T1


Mossy/


Figure 2-5: Stanley and Kilmer's wave model [Sta75]



Biological Models for Temporal Processing

Living neurons act as leaky integrators with time constants on the order of tens to hundreds of milliseconds. This can lead to the storage of information in a way that may lead to temporal sequence processing. Most ANN temporal methods store the information in a spatial manner. The spatial approach to signal storage is used in the brain for auditory and visual processing (e.g. SOMs). The way in which these maps are then processed is not necessarily spatial. Reiss and Taylor propose an interesting temporal sequence storage mechanism based on a leaky integrator network [Rei91 ]. The basic concept is to use the leaky integrator neurons as temporary storage for an associative memory that is implemented like a single layer neural net. The network has been shown to have a capacity proportional to the number of neurons. The problem with this network is that the connection matrix seems to be very heavily skewed to only predicting the next input with little information from further in the past. This is similar to





29


a simple state machine or markov chain. An interesting part of this work is the possible connection to the function of the hippocampus. The memory network corresponds to the dentate gyrus, the CA3 corresponds to the predictor, and the input line is similar to the perforant path (between EHC, DG, and CA3).

Kargupta and Ray proposed a temporal sequence processor that is based on the reaction-diffusion equations. [Kar94] Drawing an analogy between chemical diffusions in biology and spatio-temporal sequential processing, their model is based on a collection of cells that react to different inputs. When a cell becomes active (by recognizing its input), it outputs its own specific chemical. This chemical diffuses throughout the medium containing the cells. Each cell contains a memory of the chemical makeup at its location when it fires. The background medium thus stores the temporal history of the signal by diffusing all the various chemicals. This approach is more of a chemical model than an information processing model and has several difficulties when applied to realistic problems.

Static Neural Network Learning


This section contains a summary of the static neural network learning

mechanisms. Almost all of the work in temporal ANNs is based on the principles from static ANNs. Since unsupervised training is most similar to known biological learning mechanisms, it will be presented first. Unsupervised learning does not have a desired signal and extracts information only from the input of the signal. As such, unsupervised techniques typically do not directly implement classifiers, but are usually used for preprocessing the input. For example, unsupervised networks can be trained to perform





30


principal component analysis (PCA), vector quantization (VQ), and data reduction. Supervised learning is presented next and these algorithms use a desired signal to train the network to mimic the desired input-output map. The desired signal can be thought of as a teacher or external influence that guides the network to the desired state. As we mentioned before, there is no known biological analog to supervised training. Unsupervised Learning

Most unsupervised (also known as competitive or self-organizing) learning is based on Hebbian learning. Hebbian learning is derived from the work of the neuropsychologist Hebb who noted in 1949 that when cell A repeatedly participates in the firing of cell B, a growth process occurs between the two cells which increases the efficiency of the link between cell A and cell B. This can be stated as "neurons that fire together, wire together". This mechanism is often called correlation learning because the links are increased when there is a statistical correlation over time between the presynaptic and postynaptic activities. To avoid excessive weight growth, Hebbian synapses typically also include a decrease in the strength of a connection between two cells which are uncorrelated. Conversely, anti-hebbian learning is a learning rule that is based on increasing the strength of a connection when the presynaptic and postynaptic signals are negatively correlated and weakens them otherwise.

A typical expression for Hebbian learning is

Aw,,(n) = rLyk(n)x (n)

where wk represents the synaptic weight between cell k and cell j, xj is the presynaptic activity and yj is the postynaptic activity. ri in the above equation is the learning rate.





31


This rule, however, does not include the weakening of uncorrelated signals, and thus the weights will forever increase. Introducing a nonlinear forgetting factor into the equation can control the weight growth:

Aw,,(n)= lyv,(n)x,(n)- ay, (n)w, (n)

where a is the decay constant. This equation can be rewritten as: Aw,(n) = xy,(n~cxj(n)- wj(n)]

which is the standard Hebbian learning rule. Notice that when the postsynaptic neuron fires, w moves toward xj exponentially. By manipulating the definitions of the variables, this equation can be reformulated into the competitive learning rule.

In competitive learning, a group of neurons are clustered such that one and only one neuron wins a competition for each input. Algorithmically, the winner is simply selected by choosing the PE with the highest/lowest output, which can be physically implemented using lateral inhibition between nodes. Biologically, neurons fire in clusters and the competition between clusters is believed to be due to long range inhibition and short range excitation (a concept that will come up again and again). In the case of a competitive cluster, the winning node has an output value of 1, and the others are all zero. Thus the Hebbian learning rule becomes: Aw (n)= j(n)-wkJ(n)] if neuron k wins 0 if neuron k loses

Only one neuron (or cluster in biology) learns at each stage and its weights move toward the location of the input. Thus, the individual nodes specialize on sets of similar patterns and become feature detectors. Competitive learning is typically used for





32


clustering or vector quantization. Hebbian learning is used widely throughout the neural network field, but in its simplest form is often used for principal component analysis. Kohonen SOMs

The Kohonen map or self-organizing feature map (SOM) is a neural network inspired by sensory mappings commonly found in the brain [Wil76][Koh82]. A selforganizing feature map creates a topographic map of the input patterns, in which the spatial locations of the neurons in the lattice correspond to intrinsic features of the input patterns. In this structure, neurons are organized in a lattice where neighboring neurons respond to similar inputs. The result of mapping similar inputs to neighboring outputs is a global organization that is extracted from the local neighborhoods. Topographical computational maps have been found in many locations in the brain including the vision areas (angle of tilt of line stimulus, motion direction), auditory areas (representations of frequency, representations of amplitude, representations of time intervals between acoustic events) and in motor control areas (control of eye movements). More abstract topographic maps have been found in other parts of the brain. For example there is a map for the representation of the location of a sound source based on the interaural differences in an acoustic signal. The SOM is one of the most widely used unsupervised artificial neural network algorithms.[Kan94]

The typical SOM is composed of an input layer and an output layer as shown in Figure 2-7. The input layer broadcasts the vector input to each node in the output layer, scaled by the weights of each connection. Each node has an input term and lateral feedback term. The topographic mapping is created by the local lateral feedback, where





33


neighboring connections are excitatory and more distant connections are inhibitory. This is called a "mexican hat" lateral connectivity and is shown in Figure 2-6. The result is similar to the standard competitive network except that the network creates a more gentle cutoff, thus creating a Gaussian shaped output after the lateral interconnections have stabilized. This is called a "soft-max" rule (or soft-competition) where the winning PE and a few "near-winner" PEs remain active. The competitive rule is called a "hard-max" rule, hard competition, or winner-take-all rule. Depending on the characteristics of the mexican hat lateral interconnections, the resulting output will be a gaussian of varying widths centered roughly at the location of the maximum output. The process can be described using the following equations


y= Y P (I + Ycjkyjk
k=-K




where yj is the output of thej'th node, Ij is the input to the j'th node scaled by the weights into the j'th node, cjk is the lateral weights which were described above as the mexican hat function, and < is a nonlinear saturating function which keeps the nodes from growing without bound.









Figure 2-6: Mexican hat lateral connectivity and Gaussian shaped output





34



Output Layer


Input Layer

Figure 2-7: Connectivity of an SOM



After the outputs have stabilized, the network can be trained with a simple Hebbian like rule to train the weights of the winning node and its neighbors. The neighboring nodes can be trained in proportion to their activity (Gaussian), or all neighbors within a certain distance can be trained equally. The learning rule can be described as follows:

wj(n + 1)= wi(n)+ rl(n)n i,,,(n)[x(n) -w,(n)] where w are the weights of node j, x(n) is the input at time n, rj,i(x) is the neighborhood function centered around the winning node i(x), and 1l(n) is the learning rate. Notice that both the learning rate and neighborhood size are time dependent and are typically annealed (from large to small) to provide the best performance with the smallest training time.

A simplified approximation to this algorithm consists of two stages: first, find the winning node (the one whose weights are closest to the input), then update the weights of the winner and its neighbors in a Hebbian manner.

The SOM is an unsupervised network with large local connectivity, but

unsupervised networks do not typically suffer from overtraining. Because the input is mapped onto a discrete, usually lower dimension output space, the SOM is typically used





35


as a vector quantization (VQ) algorithm. The weights of the winning node are the vector quantized representation of the input.

A typical example of an SOM is mapping a two-dimensional input space onto a one-dimensional SOM. Figure 2-8 shows a random distribution of points that make up the input space in two dimensions. The points are plotted such that the coordinates of the point represent the input data. When this input data is presented to the I-D SOM, the map trains the nodes to maintain local neighborhoods in the input space. These local neighborhoods force a global ordering of the output nodes. After training, the nodes of the SOM are ordered and the weights of the nodes represent the center of mass of the input space to which they respond. By plotting the weights of the SOM PEs onto the input space, one can see where the center of each VQ cluster is located. The SOM is more than just a clustering algorithm. It also orders the PEs such that neighboring PEs respond to neighboring inputs. To show this, we connect neighboring PEs with a line. The right side of Figure 2-8 shows how the SOM maps a one-dimensional structure to cover the two-dimensional input space. This clearly shows how the global ordering has occurred













Figure 2-8: Example of a 1-D SOM mapping a 2-D input





36


and that the I-D output snakes its way through the input space in order to maintain its topographic ordering and still cover the input space. Neural Gas

The neural gas algorithm is similar to the SOM algorithm without the imposition of a predefined neighborhood structure on the output PEs. The neural gas PEs are trained with a soft-max rule, but the soft-max is applied based on the ranking of the distance to the reference vectors, not on the distance to the winning PE in the lattice. The neural gas algorithm has been shown to converge quickly to low distortion errors which are smaller than k-means, maximum entropy clustering or the SOM algorithm [Mar93]. It has no predefined neighborhood structure as in the SOM and for this reason works better on disjoint or complicated input spaces

Martinetz et. al. [Mar93] showed an interesting parallel between most of the major clustering algorithms. The main difference between each one is how the neighborhood is defined. For K-means clustering, there is no neighborhood, only the winner is trained. This is a hard max.

Aw, =* 5,(x)*(x - w,)

For maximum entropy clustering, the neighborhood is defined as a soft max based on the distance in an entropy space

Aw, =s e* hx - l)*(x- w, )

For the SOM, the neighborhood is based on the position in the SOM lattice Aw, = * h. (i,i(x)) * (x - w )





37


and for the neural gas algorithm the softmax is based on the ranking of the node. For instance, the closest node gets the largest update, followed by the second closest, etc.

Aw, =e*h,(k,(x,w))*(x-w,)

Supervised Training

Supervised training is the most commonly used and applied mechanism for

training neural networks, especially for classification and function approximation. Many applications can be framed as a function approximation problem. The main supervised training technique is called backpropagation [Rum86]. Typically it is applied to multilayer perceptrons (MLPs) that consist of multiple layers of PEs, each of which does a sum-of-products on its input and then applies a saturating nonlinearity. The backpropagation algorithm works by first computing the forward activations of the network by applying the inputs to the network and computing and storing the output of every PE in the network. The activity of the network outputs is then compared to the desired activity of the network outputs and an error is computed. This error is then propagated backwards (thus the name backpropagation) through the network and is used


Activation moves forward through the network Error is back ro paated to adust weights Desred
PE PE Signal



Inpu Frt Sond Output
PE P PE Error
Figure 2-9: Activation and error propagation in a static neural network
in2
PE PE

Input First Seeond O put Layer Hidden Hidden Layer Layer Layer
Figure 2-9: Activation and error propagation in a static neural network





38


to adapt each of the weights in the system. This is graphically depicted in Figure 2-9.

The output of each PE in an MLP can be described by the following equation: y, =f(net, ) = fZ w,,x, +bI)


where wji represents the weight from PE i to PE j, x, represents the output of PE i or the external input for the first layer, bj represents the bias for PE j, and f() is the nonlinearity of the PE which is typically a logistic function (which ranges from 0-> 1) or a tanh function (which ranges from -1 -1). The performance surface that is searched using gradient descent is defined by:

S l e =2 N(d - y,
J=
2N , 2N ,=1;j=1

where e is the difference between the output and the desired signal, p is the index over the patterns and i over the output PEs. We want to update each weight based on the partial of J with respect to each weight. This formula can be written as:

di- QYi i net,, = -(d,, - y, )f'(net,, )x, = -,,f'(net,,i )x



if we define the local error 5i for the i' PE as 5, (n)= f'(net,)


Then we can generalize the backpropagation algorithm for the MLP and the LMS algorithm for linear systems. All the weights in gradient descent learning are updated by multiplying the local error (86(n)) by the local activation (x,(n)) according to Widrow's estimation of the instantaneous gradient first shown in the LMS rule





39


Aw, (n)= Ts,(n)x (n)

The difference between these algorithms is the calculation of the local error. If the PE is linear, then we have a linear combiner and the derivative of f is a constant. The equation then becomes the LMS rule. If the PE is nonlinear and is an output PE, then the local error is simply the difference between the output and the desired signal scaled by the derivative of the nonlinearity. This is simply the delta rule. If the PE is nonlinear and is a hidden layer PE, then the error is the sum of the backpropagated errors from the PEs that follow it.

8,(n)= f'(net,(n)) 6kw,k(n)


This simple rule nicely summarizes the backpropagation algorithm and shows its relationship to other adaptive algorithms.

Second Order Methods

The standard backpropagation method of training a neural network uses the LMS approximation to gradient descent, which uses only an instantaneous estimate of the gradient. Second order methods collect information over time to get a better estimate of the gradient, thus allowing for faster convergence at the cost of more computations per cycle. In linear adaptive filtering, the recursive least squares (RLS) algorithm is used for exactly this purpose. The RLS algorithm is based upon estimating the inverse of the correlation matrix of the input. With this information, the RLS algorithm can often adapt as much as ten times faster than LMS. The RLS algorithm can also be formulated as a special case of the Kalman filter. Besides faster convergence, the RLS algorithm also has two other advantages [Hay96]: the eigenvalue spread of the correlation matrix does not





40


adversely affect the training (unlike in LMS) and the learning rate is automatically determined (the Kalman gain).

Since RLS and Kalman filtering are derived for linear systems, they must be modified for use with nonlinear systems. These are typically called extended RLS or extended Kalman filtering. The most straightforward approach is to linearize the total cost function and directly apply RLS. This requires the storage and update of the complete error covariance matrix whose size is the square of the number of weights in the network [Hay94]. A better approach is to apply RLS to each node individually and linearize the activation function of the PE using a Taylor series about the current operating point. This method is called the multiple extended Kalman algorithm (MEKA) [Hay94] and reduces the computational requirements by ignoring the cross-terms between PEs.

Temporal Neural Networks


As we stated previously, the majority of the temporal neural networks are

extensions of the static neural networks, either by adding memory or adding recursive connections. This section could just as easily be called "Extending static architectures to include time". Again, this topic will be discussed in two sections, supervised and unsupervised neural networks.

Temporal Unsupervised Learning

This section presents the methodologies currently available to add temporal

information to unsupervised networks. Almost all work done on temporal unsupervised training has used self-organizing maps.





41


As mentioned before, a self-organizing map (SOM) creates a topographic map of the input patterns, in which the spatial locations of the neurons in the lattice correspond to intrinsic features of the input patterns. In this structure, neurons are organized in a lattice where neighboring neurons respond to similar inputs. There have been many attempts at integrating temporal information into the SOM. One major technique is to add temporal information to the input of the SOM. For example, exponential averaging and tappeddelay lines were tested in [Kan90][Kan91], while coding in the complex domain was implemented in [Moz95]. Another common method is to use layered or hierarchical SOMs where a second map tries to capture the spatial dynamics of the input moving through the first map [Kan90][Kan91].

More recently, researchers have begun integrating memory inside the SOM, typically with exponentially decaying memory traces. Privitera and Morasso have created a SOM with leaky integrators and thresholds at each node which activate only after the pattern has been stable in an area of the map for a certain amount of time. This allows the map to pick out only the "stationary" regions of the input signal and use these sequences of regions to detect the input sequence [Pri93][Pri94][Pri96].

The SARDNET architecture [Jam95] adds exponential decays to each neuron for use in the detection of node firing sequences. Once a node fires for a particular sequence, it is not allowed to fire again. Therefore, at the end of the sequence presentation, the sequence of node firings can be detected (or recreated) using the decayed outputs of the SOM. The exponential decay, however, provides poor resolution at high depths and thus will perform poorly with noisy and/or long sequences.





42


Chappell and Taylor have created a SOM which has neurons that hold the activity on their surface via leaky integrator storage [Cha93]. This activity is added to the typical spatial distance between input and weight vector to determine the next winner. The same or neighboring nodes will thus be more likely to win the competition for successive elements in a sequence. This creates neighborhoods with sensitivity to the previous input (i.e. context). There is not a successful method available yet to train these networks. The learning law proposed by Chappel and Taylor can lead to an unstable weight space. The methodology seems to work for patterns of binary inputs with at most length 3. Critchley [Cri94] has improved the architecture by moving the leaky integration to the synapses. This gives the network a much better picture of the temporal input space and has much more stable training, but becomes nothing more than an exponentially windowed input to a standard Kohonen map, as proposed by Kangas [Kan90].

The temporal organization map (TOM) integrates a cortical column model, SOM learning and separate temporal links to create a temporal Kohonen map [Dur96]. The TOM is split into super-units that are trained via the SOM learning algorithm. Winning units from each super-unit fire and then decay. Temporal links are made between the currently firing node and any node which has an activity above a threshold. Thus there can be multiple links created for each firing, allowing for the pattern to skip states.

Kohonen and Kangas have proposed the hypermap architecture to include context in the SOM architecture. Kohonen's original hypermap architecture included two sets of inputs and weights [Koh91]. The first set is a context vector that is a tapped delay line of the past and future pattern vectors. This input is used to determine a "context domain" in the SOM. All nodes in the context domain are labeled active and are then presented with





43


the current input pattern. The "pattern" weights and context weights are then trained in the typical SOM manner. Kangas extended this concept by eliminating the context weights and allowing only nodes in the vicinity of the last winner to be selected. This smoothes the trajectory of winning nodes throughout the map and allows context to affect the selection of the winner without the addition of parameters like the width of the context window [Kan92]. Kangas has also proposed an SOM architecture that has an LPC predictor at each node in the Kohonen net. This provides temporal pattern recognition by using a filter at each node where the AR filters were trained via either genetic programming or gradient descent [Kan94].

Goppert and Rosenstiel conceptually extend this concept to include the notion of attention [Gop94a][Gop94b][Gop95]. The theory being that the probability of selecting a winner is affected by either higher-cognitive processes (which may be considered a type of supervision) or by information from the past activations of the network. This gives two components to the selection of a winner, the extrasensory distance (context or higher processes) and sensory distance (normal distance form weight to input). These two components can be added or multiplied. They focus on the concept of context and create a moving area of attention, which is the region that has been activated most in the recent past. The center of attention moves as each winner is selected and the region of attention has a Gaussian weighting applied to it so that nodes near the last winner will be more likely to fire the next time. The architecture outperformed the standard SOM on simple temporal tasks but did not train well on more complicated trajectories.





44


Temporal Supervised Neural Networks

The main problem with temporal supervised neural networks is the complexity in training them. When the desired architecture contains recurrent connections or memory in one of the hidden layers, the network must be trained with a temporal gradient descent algorithm. There are two distinct approaches to the problem, modifying the architecture to simplify the temporal gradient calculations and creating better and/or faster methods of training the temporal neural networks.

Architectural approaches

The focused time-delay neural network (TDNN) has memory added only at the first layer and is the simplest example of an architecture designed to avoid many of the complications of temporal neural networks. It is simply a static MLP with a tap delay line between the input and the first layer. Because the memory is restricted to the first layer, the network can still be trained using static backpropagation. The tap delay line maps a segment of the input trajectory into an N-dimensional static image that is then mapped by the MLP. This works quite well for many applications, but has a number of difficulties as mentioned previously. The main difficulties are the increased number of weights required for TDNNs (each input now requires m weights where m is the number of taps in the tap delay line) and the inflexible, prewired nature of the tap delay line.

Some of the problems with TDNNs have been attacked by defining the

connectivity between layers such that only certain regions of each layer are connected. By doing this, certain regions in the input layer, corresponding to certain time periods of the input, can be connected to a single region of the second layer. This provides a more





45


goal directed architecture that can be time-shift or frequency-shift invariant. Although this can reduce the effects of the problems of TDNNs, the problems still remain and each network must be tailored for each application. [Saw91][Haf90]

Two other networks deal with temporal information by using a very restrictive type of feedback. The Jordan network [Jor86] use recurrency between the output and the input of the network. The output of the network is fed back to a context unit which is simply a leaky integrator. The Elman network [Elm90] provides feedback from the hidden layer to the context units in the input layer. This is potentially more powerful than the Jordan network because it stores and uses the past state of the network, not just the past output of the network. Although both networks are commonly found in the neural network literature, neither is particularly powerful or easily trained.

Recurrent networks are also continuously being modified in an attempt to improve their performance on temporal problems. Mozer has proposed a "multiscale integration model" that uses recurrent hidden units that have different time constants of integration, the slow integrators forming a coarse but global sequence memory and the fast integrators forming a fine grain but local memory [Moz92]. This work, however, is based only on exponentially decaying memory and the problem of selecting the time constants has not been solved (the time constants have to be hand tuned).

A different spin on recurrent networks is the use of "higher order networks". These networks are recurrent networks where:





46


* hidden units represent states and the output of these states are fed back and multiplied

with the inputs of the nodes, thus allowing second order statistics to be used

[Gil91][Wat91]

* one network computes the weights for a second network [Po191][Sch92a][Sch92b]

The higher order networks have proven to be excellent sequence recognizers

(grammar recognizers), but have failed to make a serious impact on temporal processing. These networks provide a representation for states in the neural network and allow the computation of high order statistics. For example, a second order network can compute the autocorrelation of the input, thus creating a translation invariant architecture. The main disadvantage of this work is that for complex tasks, higher order networks require even more weights and have even more complicated performance surfaces than standard ANNs.

Algorithmic approaches

There are two fundamental methods of computing the gradient for a dynamic

neural network. First, the gradients can be computed in the backward direction similar to the static backpropagation techniques from feedforward networks. Unfolding a recurrent network in time creates a large, static, feedforward network where each "layer" consists of an instance of the recurrent network at each time step. Backpropagation can then be applied to this large feedforward network and the gradient can be computed. This is called backpropagation through time (BPTT) [Rum86]. The main shortcoming of this technique is that it is non-causal. The BPTT algorithm must be used in a batch mode, the data travels first in the forward direction while the entire state of the network is saved at





47


each step. Next, the error is backpropagated in the reverse temporal order. A secondary shortcoming of BPTT is the memory required to store the state of the network at each iteration.

Many alterations have been made to the BPTT algorithm to improve its utility, in particular to make it usable as an on-line algorithm. Williams and Peng [Wil90] used a history cutoff where they assumed that the gradient information from the distant past is relatively inconsequential and thus can be ignored. Combining this and the use of a small step size, the new algorithm, BPTT(k) can be used in an on-line manner. See Pearlmutter [Pea95] for a review of this technique and others.

The second fundamental method of computing the gradient for a recurrent neural network computes the gradients in the forward direction. This method, called RTRL [Wil89], computes the partial of each node with respect to each weight at every iteration. The method is completely on-line and simple to implement. The main difficulty with the RTRL method is its computational complexity. If we define n to be the number of PEs and m to be the number of weights, then the computation of the gradients of each PE with respect to each weight is O(n2m). For a fully recurrent network, this dominates the computational complexity and requires O(n4) computations per step. The algorithm works quite well on small networks, but the n4 factor becomes overwhelming as the number of nodes increases.

The RTRL algorithm for a recurrent network can be summarized by the following set of equations [Hay94]. First, we define set A as the set of all inputs, set B as the set of all PEs, and set C as the set of outputs with desired signals. The forward activation equations are:





48


net (n) = - w, (n)u, (n) yj (n + 1) = (net (n))

where u represents the input vector at each time step and is composed of both the external inputs and the outputs of each PE (the values of the feedback). The gradient descent technique is based upon computing the sensitivity of each PE with respect to each weight. The weights are updated on-line using these sensitivities: Aw,(y (= -y - e (n)(n) aw,(n) , w, (n) For implementation, we create a matrix 7n that represents these sensitivities and write an update equation for it: OyF(n)
)L= (n jeB,keB,leAuB ' w', (n)


11, (n + 1) = p(v, (n)) w, (n)rk, (n) + 8,uI(n)


, (0) = 0

7t is a matrix of gradients with the rows representing weights and the columns representing nodes, thus it contains mn elements.

Many methods have been proposed to increase the speed of RTRL. Schmidhuber and others have mixed BPTT and RTRL which reduces the complexity to O(nm) [Sch92]. This technique takes blocks of BPTT and uses RTRL to encapsulate the history before the start of each block. Sun, Chen, and Lee have developed an O(nm) on-line method based on a Green's Function approach [Sun92]. By solving an auxiliary set of equations,





49


the redundancies in the computation of the sensitivities over time can be removed. Zipser approached the problem in a different way and reduced the complexity of the RTRL algorithm by simply leaving out elements of the sensitivity matrix based upon a subgrouping of the PEs [Zip89]. The PEs are grouped arbitrarily and sensitivities between groups are ignored. If the size of the subgroups remains constant, then this reduces the complexity of the RTRL algorithm to O(m). This is a tremendous improvement, however, the method lacks some of the power of the full RTRL algorithm. It sometimes requires more PEs than the standard RTRL algorithm to converge. Second order methods

Many researchers argue that simple gradient descent [Moz94] is not sufficiently powerful to discover the sort of relationships that exist in temporal patterns, especially those that cover long time sequences or involve high order statistics. Bengio, Frasconi, and Simard [Ben93] also present theoretical arguments for the inherent limitations of learning in recurrent networks. Many researchers have recently started using the extended Kalman filter algorithm, which is very similar to the RLS algorithm, for training dynamic neural networks. As described previously, the extended Kalman filter algorithm uses information from the correlation matrix, which is accumulated over time, to better approximate the direction to the bottom of the performance surface. Again, the problem with the extended Kalman filter algorithm is that it requires the computation and storage of the correlation matrix between the weights of the system. The computational requirement for this method is O(N'). The standard method of reducing this computational load is to decouple each PE of the network, such that the correlation





50


matrix is only computed between weights that terminate at the same PE. Puskorius and Feldkamp [Pus94] call this method the decoupled extended Kalman filter (DEKF) algorithm.

The main difference between the dynamic version of the RLS/EKF algorithm and the static version is that the gradients that are used in the second order calculation are the dynamic gradients, not the static gradients. Thus, the BPTT or RTRL algorithms must still be used to compute these gradients.

Sequence Recognition


There are two broad categories of temporal problems which are typically addressed in the literature: sequence recognition and temporal pattern processing. Sequence recognition is typically a process of recognizing (and often reproducing) discrete symbolic sequences. These problems typically focus on recognizing grammars and symbolic patterns. Temporal pattern processing, however, involves the recognition, identification, control or other processing of a continuous signal which varies with time. Speech recognition is an example of temporal pattern recognition. The continuous signal can be vector quantized and turned into a symbolic pattern, but it is not practical to then treat it as a sequence recognition problem. Temporal patterns of interest are difficult to accurately quantize and typically have various forms of time warping and noise which make sequence recognition of quantized temporal patterns nearly impossible.

Since the emphasis of this proposal is not sequence recognition, we will only briefly introduce a few interesting neural networks which accomplish this task. Wang and Arbib have proposed models based on the two dominant theories of "forgetting" - the





51

decay theory of forgetting where the memories decay from the time they are entered [Wan90] and the interference theory of forgetting where memory only decays when new inputs which must be remembered arrive [Wan93]. Both architectures are based on a winner-take-all field of neurons where the winning node fires and is then decremented slowly. The sequence is detected using an extra "detector unit" which is trained by the Hebbian rule using attentional learning. The main difficulty with these and other sequence recognizers is that they tend to be intolerant of time-warping and missing or noisy data - problems which are prevalent in temporal pattern recognition.

The outstar avalanche was an early neural network that was used to learn and generate temporal patterns [Gro82]. The outstar avalanche is composed of N sequential outstars which detect an input and each outstar triggers the next in a chain producing an avalanche effect. This architecture was modified to include the combined effect of the input dot product and the avalanche input from preceding nodes and was called the spatio-temporal network (STN) [Fre9 1]. The sequential competitive avalanche field (SCAF) [Hec86] is a further extension of the STN where each node has lateral interconnections allowing the outstars to be competitive.

Comparison of Hidden Markov Models with ANNs


Due to the difficulties in modeling sequential structure with ANNs, hidden

Markov models have become the gold standard for modeling many temporal processes (e.g. speech). Time sequence matching is a major problem in applying neural nets to temporal/dynamical, non-stationary processes. Although ANNs have been successfully applied to time series prediction [Wei94], they have not been as successful in tasks that





52


have synchronization problems such as time-warping. For example, different utterances of the same word can have very different timescales; both the overall duration and the details of timing can vary greatly. ANN models for speech have been shown to yield good performance only on short isolated speech units (e.g. phoneme detection). They have not been shown to be effective for large-scale recognition of continuous speech. The TDNN, for example, has powerful methods for dealing with local dynamic properties, but cannot deal with sequences explicitly.

The HMM provides a compact, tractable mechanism for handling this temporal information by including explicit state information. Various neural network techniques have attempted to add state information, typically via feedback, but have been only successful on modest size applications. HMMs are stochastic in nature and thus can succeed even when the temporal nature of the system is locally very noisy. Speech patterns, for example, are to some extents a sequential process, however, they are sufficiently ambiguous locally that it is not adequate to make decisions locally and then process sequences of symbols.

Two formal assumptions characterize HMMs as used in speech recognition. The first-order Markov hypothesis states that history has no influence on the chain's future evolution if the present is specified - e.g. the temporal information is stored in the current state of the system and all relevant temporal information must be able to be stored in this way (there is no other memory in the system). The second assumption is that the outputs depend stochastically only on the state of the system.

The two main advantages of ANNs over HMMs is that ANNs are discriminative and ANNs do not rely on the Markov assumptions. Typically HMMs are trained using a





53


within-class method (each model is trained only on in-class, segmented, data). ANNs, however, can be trained to find the differences between classes, thus they can discriminate between classes, not just detect/model classes. ANNs have few restrictions on the systems they can model. The HMMs, however, assume that the observations are independent and that the underlying process that is modeled is a Markov Process. New methods which marry the discriminative power of the ANN with the temporal nature of the HMM have been relatively successful [Bou90].















CHAPTER 3
TEMPORAL SELF-ORGANIZATION


Introduction and Motivation


As described in the previous chapters, working with temporal patterns has been a very difficult task for neural networks. This problem is largely due to the fact that the methodologies applied to temporal processing are simple extensions of static neural networks with little regard for the unique nature of time and time based signals. Most of these architectures simply add memory to a well-known static network and can achieve reasonable performance for simple problems, but do not perform as well on more complex problems. Like in the 1980s when pattern recognition and classification drove the research community to develop neural networks, biological systems still easily outperform state-of-the-art solutions to temporal processing problems. For this reason, I began researching biological neural networks and biological mechanisms that might help us better solve these problems.

As my research progressed, two key aspects continually resonated with my

underlying goal of creating better neural networks for temporal pattern processing. These two elements are the self-organization of similar or correlated cells into clusters or neighborhoods (similar to place cells in the Hippocampus), and the diffusion of information over time and space. Self-organization describes a system where each



54





55


individual entity in the system has only simple local rules regarding its behavior. These simple local rules, however, can create global organization without any global control. Self-organization applies at virtually every layer of the universe, from neurons and brain cells, to bug populations, to solar systems and galaxies. It is tremendously important in the formation of the brain and in my opinion is greatly underutilized in artificial neural networks.

The second element is diffusion. Like self-organization, diffusion is found

everywhere. It can be derived from simple random Brownian motion (simple local rules as well), where particles and other objects move from areas of large densities to areas of small densities. Diffusion itself is a rather simple concept that may not appear to add much to neural network theory. However, when you add diffusion to a dynamical system (for instance, the reaction-diffusion equations), the resulting system can obtain some tremendously interesting and powerful dynamics.

The Model


Most temporal neural networks use short-term memory to transform time into space. This time-to-space mapping is usually the only mechanism for dealing with temporal information. The neural network operates as if the temporal pattern was simply a much larger spatial pattern. This is clearly inefficient. My method uses diffsion to create self-organization in time and space. The theory is to leave the fundamentals of the neural network the same (in order to use the theory and knowledge we have already accumulated) but to add self-organization in space-time to the PEs in the network. By creating temporally correlated neighborhoods in the field of PEs making up the network,





56

the basic functionality of the network is more organized and temporally sensitive, without drastically changing its underlying operation. The mechanism for the creation of these temporally correlated neighborhoods is the diffusion mechanism. In the brain, NO is given off from firing neurons in the brain, and diffuses throughout. NO has also been shown to affect the sensitivity of the neuron to synaptic changes (e.g. weight changes in neural networks). It has been theorized that this diffusion of NO may be responsible for the creation of place cells and other organization in the brain.

In a more abstract sense, the diffusion of NO can be considered the diffusion of the neural activity. When a large group of neurons fire in close proximity (both temporally and spatially), a local build-up of NO probably occurs and diffuses throughout the brain. In my architectures, I use the concept of activity diffusion to create the temporally correlated neighborhoods. When a PE or group of PEs fire, they influence their neighbors, typically lowering their threshold such that they are more likely to fire in the near future. Because the underlying mechanism of most neural network training is Hebbian in nature, when neighboring PEs fire in a correlated fashion, they tend to continue to fire in a correlated fashion. This creates the temporally correlated neighborhoods and the self-organization in space-time.

I have applied this concept to three different ANN architectures. The first is based on the self-organizing map (SOM) and is the most biologically inspired. The second is based on the neural gas algorithm which provides a more powerful but functionally similar solution. Lastly, to prove the robustness of the method, I applied it to the training of recurrent MLPs. MLPs are a totally different architecture and are trained in a totally different manner (e.g. supervised vs. unsupervised training). The MLP is not biologically





57

relevant, but the temporal self-organization method still proved to decrease training times dramatically. The rest of this chapter is divided into three sections based on these architectures. It is arranged chronologically so that the presentation will be more smooth, even though the MLP architecture may be the most useful of the three. This chapter only presents the theoretical derivation of each architecture and a simple illustrative example for each. Detailed application of each method to more practical problems will be presented in the next chapter.

Temporal Self-Organization in Unsupervised Networks


This section describes the two unsupervised networks which I have applied the concept of temporal clustering. The first architecture is based on the self-organizing map and is called the self-organizing temporal pattern recognizer (SOTPAR). The second architecture is based on the neural gas algorithm and is called the SOTPAR2. Temporal Activity Diffusion Through a SOM (SOTPAR)

The self-organizing temporal pattern recognizer (SOTPAR) [Eul96a][Eul96b] is a biologically inspired architecture for embedded temporal pattern recognition (finding patterns in an unbounded input sequence without segmentation or markings). This is a difficult task since the patterns must be searched from every possible starting point. Although the SOTPAR architecture is unsupervised and thus cannot be used efficiently as a pattern recognition device, it does preprocess the input such that patterns commonly found in the training data will be easily detectable from the output of the SOTPAR. Most of the emphasis in this work is in the proper temporal representation of the spatiotemporal data.





58

The SOTPAR architecture adds two temporal characteristics to the SOM

architecture, activity diffusion through the space of output PEs and the temporal decay of activations. Using these concepts, the SOTPAR converts and distributes the temporal information embedded in the input data into spatial connections and ordered PE firings in the network, all using self-organizing principles.

Similar to self-organizing maps, the network uses competitive learning with

neighborhood functions [Koh82]. In the SOM, the input is simultaneously compared to the weights of each PE in the system and the PE that has the closest match between the input and its stored weights is the winner. The winner and its neighbors are then trained in a Hebbian manner, which brings their weights closer to the current input.

The key concept in the SOTPAR architecture is the activity diffusion through the output space. The firing of a PE in the network causes activity to diffuse through the network and affects both the training of the network and the recognition of the network. In the SOTPAR, the activity diffusion moves through the lattice of an SOM structure and is modeled after the reaction-diffusion equation [Mur89] n, (x,t) 2 m(x,t)
- f(m,(x,t),,m(x,t))+ D, 2

where mi can be considered the activity of PE i, f(*) can be considered the current match, and the second derivative is the diffusion of activity over space and time. If the system is "excitable media" (multi-stable dynamical system), then the diffusion of activity can create traveling pulses or wavefronts in the system. When the activity diffusion spreads to neighboring PEs, the thresholds of these neighboring PEs are lowered, creating a





59


situation where the neighboring PEs are more likely to fire next. I define enhancement as the amount by which a PE's threshold is lowered.

In the SOTPAR model, the local enhancement acts like a traveling wave. This significantly reduces computation of diffusion equations and provides a mechanism where temporally ordered inputs will trigger spatially ordered outputs. This is the key aspect of this network architecture. The traveling wave decays over time because of competition for limited resources with other traveling waves. It can only remain strong if spatially neighboring PEs are triggered from temporally ordered inputs, in which case the traveling waves are reinforced. In a simple one dimensional case, Figure 3-1 shows the enhancement for a sequence of spatially ordered winners (winners in order were PE1,



0.3









0.4
0.2




(b)







Figure 3-1: Temporal activity in the SOTPAR network. a) activity created by temporally ordered input; b) activity created by unordered input





60


PE2, PE3, PE4) and for a sequence of random winners (winners in order were PE4, PE2, PEl, PE5), which would be the case if the input was noise or unknown. In the ordered case, the enhancement will lower the threshold for PE 5 dramatically more than the other PEs making PE 5 likely to win the next competition. In the unordered case, the enhancement becomes weak and affects all PEs roughly evenly.

The second temporal functionality added to the SOM is the decay of output activation over time. This is also biologically realistic [Cha93]. When a PE fires or becomes active, it maintains an exponentially decaying portion of its activity after it fires. Because the PE gradually decays, the wavefront it creates is more spread out over time, rather than a simple traveling impulse. This spreading creates a more robust architecture that can gracefully handle both time-warping and missing or noisy data. The decay of the activity also creates another biological possibility for explaining the movement of the enhancement throughout the network. If we define a neighborhood around a neuron as one where it has strong excitatory connections with its neighbors, then the decay of activity from a neuron which fired in the past will help to fire (or lower the threshold of) its neighboring PEs.

Algorithm description

To simplify the description of the algorithm, I will use 1D maps and let the activity propagate in only one direction, since the diffusion of the activity is severely restricted in the one-dimensional case. Thus, the output space can be considered a set of PEs connected by a string where the information is passed between PEs along this string. The activity/enhancement moves in the direction of increasing PE number and decays at





61


each step. An implementation of the activity diffusion in one string is shown in Figure 3-2 and includes the activity decay at each PE and the activity movement through the net in the left-to-right direction. The factors u and (1-u) are used to normalize the total activity in the network.


1-u 1-u 1-u Nod Nod Node Enhancement


Activity Activity Activity

Figure 3-2: Model for activity diffusion in one string of the SOTPAR



This activity diffusion mechanism serves to store the temporal information in the network. During training, the PEs will be spatially ordered to sequentially follow any temporal sequences presented. At each iteration, the activity of the network is determined by calculating the distance between the input and the weights of each PE and allowing for membrane potential decay: act(t,x) = act(t - ,x) * (1- u) + dist(inp(t), w. ) * (u) where act(t,x) represents the activity at PE x at time t, and dist(inp(t), w) represents the distance between the input at time t and the weights of PE x. Typically the activity is thresholded and enhanced before being propagated. For example act'= max(act-.5,0)*2

next, the winning PE is selected by winner = arg max(act + 3 * enhancement)





62


where the enhancement is the activity being propagated from the left. The parameter P is the spatio-temporalparameter that determines the amount that a temporal wavefront can lower the threshold for PE firing. By increasing 1 you can lower the threshold of neighboring PEs to the point where the next winner is almost guaranteed to be a neighbor of the current winner and forces the input patterns to be sequential in the output map. It is interesting to note that as 340, the system operates like a standard SOM and when P3-oo the system operates like an avalanche network. [Gro82]

Once the winner is selected, it is trained along with its neighbors in a Hebbian manner with normalization as follows:

w = wx + T* neigh(x)*(inp(t)- w.)

where the neighborhood function, neigh(x), defines the closeness to the winner (typically a Gaussian function), and the learning rate is defined by 1l. In our current implementation, the spatio-temporal parameter, the learning rate, and neighborhood size are all annealed for better convergence.

Representation of memory

The activity diffusion in this network creates a unique spatio-temporal memory that stores and distributes the temporal information in the network itself. Most short-term memory structures can be described by convolving the input sequence with a kernel that describes the structure of the memory. This kernel is typically one-dimensional and describes the temporal features of the memory, i.e. the depth of the memory. The SOTPAR's memory is implemented in its "enhancement" which moves through time and





63


space. Thus, the SOTPAR memory kernel is spatio-temporal and must be described in at least two dimensions.

There are two slightly different ways to implement the temporal enhancement in a ID SOTPAR. The difference lies in the decaying exponential portion. In method number 1, only the activity at each node is decayed. The contributions from the wavefronts do not contribute to the time dependent behavior of each node. The equation for this system is:

E(n, t)= E(n- 1,t - 1)* L + A(n,t)
A(n, t) = A(n,t - 1)*( - )+ In(n,t)

where E(n,t) is the enhancement at node n at time t, A(n,t) is the activity at node n at time t, and ln(n,t) is the matching results between the input and the weights of node n and time t. Expanding these equations gives the following results: E(n,t)= E(n-l,t-1)*i + A(n,t-1)*(l-p)+ In(n,t)
=(E(n-2,t-2)*p + A(n-l,t-2))*i + A(n,t-2)*(1-I)2 + In(n,t-1)*(1-)+ In(n,t)


= In(n-k,t-k- )pk (1-)
k=O =0


This equation shows how the results from the matching activity (which is called "input", for lack of a better word) contribute to the enhancement. The traveling waves create two decaying exponentials, one which moves through space (pk), and one which moves through time ( (1-,u)'). The past history of the node is added to the enhancement via the recursive self-loop in (1-I). The wavefront motion is added to the enhancement via the diagonal movement through the left-to-right channel scaled by i. The farther the





64


node is off the diagonal and the farther back in time, the less influence it has on the enhancement.

The SOTPAR enhancement equation is similar to the gamma memory impulse response for tap n:

t-1
g,(t)= (n-lp" (l-g)- U(t-n)


By doing a variable substitution, the T can be replaced with t-n in the SOTPAR equation making the two equations even more similar. The SOTPAR enhancement, however, is not an impulse response equation. The SOTPAR allows input at each element of the memory structure, unlike the gamma memory which is a generalized tapped delay line, thus the input at different times and spatial locations is required to describe the enhancement (i.e. an impulse response does not represent the desired information). In summary, the SOTPAR enhancement is a spatially distributed gamma memory with inputs at each tap.

The second method for implementing the enhancement is to allow the

enhancement to also pass through the self-feedback at each node. This will allow an input to add to the enhancement multiple times by following different paths in the network. For example, In(n-l,t-2) can reach E2(n,t) either by looping first at node n-1 and then moving to position (n,t) or by first moving to position (n,t-1) and then looping at node n until (n,t). The equation for the enhancement in this case is:





65


E2(n, t) = E2(n - 1, t - 1)* pl + E2(n,t - 1) * (1 - i) + In(n, t)
= (E2(n-2,t -2)* + E2(n-1,-2)*( - )+ In(n-1,t-1))* +
(E2(n - 1,t- 2)*t + E2(n,t -2)*(1 - )+ In(n-l,t - 1))*(1-1 X) + In(n,t)


= In(n-k,t-k- )g'(1- g)'(r +1)
k=0 =0

This method of enhancement increases the contribution of the off-diagonal elements via the term (r+l) and allows more flexibility in non-sequential node firings.

The two enhancement techniques can be shown for two values of 9I in Figure 3-3 and Figure 3-4. Both figures show Enhancement method 1, Enhancement method 2, and the difference between Enhancement method 2 and method 1, which shows the increased influence of the off-diagonal elements.

These two figures also illustrate the effect of R on the enhancement. With 1. = 1, the time decay at each node is disconnected and the enhancement moves only from node to node. With [t = 0, the spatial movement of the enhancement is disconnected and only


Enhancement 1 Enhancement 2 Enh 2 - Enh 1

1. 1 / 1
0.8, 0.8, 0.8

0.6 0.6 E 0.6 S0.4 0.4 0.4
0.2 0.2 0.2

0 0o 0
6 6 6
4 4 6 4 46 4 46
Node 2 2 Time Node 2 2 Time Node 2 2 Time Figure 3-3 - Enhancement in the network with u = 0.5





66


node decay contributes to the enhancement. Lower values of lt create a broader enhancement while higher values of p create narrower enhancement waves where almost all of the activity moves from one node to the next (down the diagonal of time and space). This can be seen in the figures as a much sharper contribution to the enhancement along the diagonal as 9I moves from .5 to .75.

Enhancement 1 Enhancement 2 Enh 2 - Enh 1

1 1. 1
0.8, 0.8. 0.8 0.6 0.6. 0.6,

m 0.4 0.4, m 0.4, W 0.2 w 0.2, W0.2

0 , 0 6 0 .

4 4 6 4 4 6 4 46
Node 2 2 Time Node 2 2 Time Node 2 2 Time Figure 3-4 - Enhancement in the network with p = 0.75



Another possible approach is to decouple the two exponentials [ and (1-p). This would require external normalization to keep the enhancement from growing without bound, but will provide more flexibility. A simple illustrative example

A simple, descriptive test case involves an input that is composed of twodimensional vectors randomly distributed between 0 and 1. Embedded in the input are 20 'L' shaped sequences located in the upper right hand comer of the input space (from [0.5,1.0]-[0.5,0.5]-) [1.0,0.5]). Uniform noise between -0.05 and 0.05 was added to the





67


target sequences. When a standard ID SOM maps this input space, it maps the PEs without regard to temporal order, it simply needs to cover the 2D input space with its 1D structure. To show how this happens, we plot an 'X' at the position in the input space represented by the weights of each PE (remember, the weights of each PE are the center point of the Voronoi region that contains the inputs that trigger that PE). Since the neighborhood relationship between PEs is important, we connect neighboring PEs with a line. In a ID SOM, the result is a "string" of PEs, and this string of PEs is stretched and manipulated by the training algorithm so that the entire input space is mapped with the minimum distortion error while maintaining the neighborhood relationships (e.g. the string cannot be broken). The orientation of the output is not important, as long as it covers the input with minimal residual energy. A typical example is shown on the left side of Figure 3-5. Note the slightly higher density of the input in the 'L' shaped region.

When the SOTPAR temporal activity is added to the SOM, the mapping has the additional constraint that temporal neighbors (sequential winners) should fire


IO Kohon mapping wUthol Temporal Ertancemet 1D Ktonmpn wFih TEora-Et
09 09 08 08 07 07 06 06 0.4 04 03 03


02 04 06 08 0 02 04 06 08 1

Figure 3-5 - One-dimensional mapping of a two-dimensional input space, both with and without spatio-temporal coupling





68


sequentially. Thus, the string should not only cover the input space, but also follow prevalent temporal patterns found in the input. This is shown on the right side of Figure 3-5. Notice in the figure that sequential nodes have aligned themselves to cover the L shaped temporal patterns found in the input.

Although not the main goal in creating the Spatio-Temporal SOM, recall is

possible after the first few samples of the sequence have been input to the network. The rest of the pattern can be determined by following the sequence of nodes in the SOM, although the length of the sequence is not readily determined by the map.

With a single string, the network can be trained to represent a single pattern or multiple patterns. Multiple patterns, however, require the string to be long. A long string may be difficult to train properly since it must weave its way through the input space, moving from the end of one pattern to the beginning of the next. Additional flexibility can be added by breaking up the large string into several smaller strings. Multiple strings can be considered a 2D array of output nodes with a ID neighborhood function. This allows the network to either follow multiple trajectories or long complicated trajectories in a simplified manner.

Figure 3-6 shows an example of the storage of two temporal patterns with two strings. The left plot shows the input space that consists of two-dimensional input vectors. Two 8-point temporal patterns (diagonal lines: bottom-left to top-right and bottom-right to top-left) are intermixed with random noise in the input. The diagonal lines are drawn in for clarity. Between each pattern, there is random noise. This problem can be thought of as a motion detection problem across a visual topographic map. A number of strings could be trained to detect motion in a variety of directions and





69


orientations. On the left side of Figure 3-6, the trained strings are shown as sequences of

8 PEs represented as 'X's (the 'O' PE denotes the beginning of the string), with neighboring PEs connected by lines. As one can see from this figure, the memory structure was able to extract the predominant temporal features of the input data.

The right side of Figure 3-6 shows a graphical representation of the sequence of winning PEs after training. The horizontal axis is time, and the vertical axis is the number of the winning PE. The input signal is labeled along the top of the plot. This plot clearly shows that the patterns elicit the network to respond with sequential PE firings (smooth diagonal lines), whereas the random noise between patterns causes random output firings. Notice also that the temporal information is crucially important in the training of the memory, especially at the center of the figure where the next point could be in one of two possible directions. This ambiguity is responsible for the misalignment of the PEs near the center of the input space.

Input Space and Output Mapping I Seql I noise I Seq2 I noise I Seq1 I
1 E-16

0 14
N 12
0.6 . 10
x 8
04
6
0.2 4
2
0 0.5 1 10 20 30 40 X1 Time Figure 3-6 - The Storage of Two Temporal Patterns in a Memory Network





70


Figure 3-7 shows an example of how the network gracefully handles time

warping. In this example, the input was as in the previous example except that the target sequences were warped to length 6, 8, and 10. The network mapped two 6 PE strings to the diagonal targets as shown in the left side of the figure. The right side of the figure shows the winning nodes with the three different size sequences - the first two are length 6, the second two are length 8, and the last one is length 10. The strings stretch to cover the entire pattern and certain PEs fire more than once for a longer sequence, thus extending the time that can be covered by the string. In general, if the network is trained with time-warped data, it will tend to represent the target trajectories with the minimum number of nodes (shortest pattern). The network will still respond to longer patterns by having certain nodes win multiple times.

InpLut Space and Output Mapping ISl n IS21 n IS1 I n IS2 I n I S1 I
1 12

0.8,1
a

0.6

0.4
W4
0.2


0 0.5 1 20 40 60 X1 Time Figure 3-7 - Time Warping: Diagonal Targets Covered by Smaller Sequences



The left and middle plots of Figure 3-8 show the traveling activity over time and space for the above example. The left side shows the activity for string 1 and the center shows the activity for string 2. These two plots clearly show how the traveling activity





71


builds up and reinforces the sequential firing of the output PEs (i.e. when a target sequence is presented, the activity builds up and moves along the string). The right-side of Figure 3-8 shows the maximum traveling activity for string 1 (solid) and string 2 (dashed). In a simple system, this plot shows how a simple threshold on the traveling activity could be used to detect the target sequence.

Enhancement for String 1 Enhancement for String 2 Maximum Enhancement

2 0.8 1 E E 0.6


0.4
' 6


20 40 60 20 0 0 0 20 40 60 Time Time Time Figure 3-8: Plots of ehancement over time for string 1 and string 2 and also the maximum enhancement over time for both strings



SOTPAR summary

The SOTPAR methodology creates an array of PEs that self-organizes in spacetime with the help of temporal information. The system is trained in an unsupervised manner and self-organizes so that sequences seen during training are mapped into unique spatial sequential firings of the PEs at the output. The output space is similar to a topographic map except that it maps both the temporal and spatial information. The network embeds the temporal and input data into one output space with both temporal and spatial locality. Instead of the standard time-to-space mapping produced by most short-term memories, the SOTPAR produces a time-to-"time and space" mapping. The representation is distributed throughout the self-organizing network and is stored not only





72


in the activations of the PEs but also in the connectivity and weights of the PEs. It is a radical departure from typical neural network architectures with memory, but is actually more biologically plausible.

The SOTPAR is a unique combination of short-term and long-term memory. It contains short-term memory because the activations of the network can be used to represent a general input sequence. The interesting part of the SOTPAR, however, is that it contains attributes of a long-term memory. It stores commonly found input patterns into the network weights and produces enhanced responses to these temporal inputs. The known sequences produce an ordered response in a specific area of the output space. This is a discriminant mapping because only known sequences produce an ordered response. The sequential firing facilitates the recognition of temporal patterns by subsequent processing layers. It can also gracefully handle time warping. Temporal Activity Diffusion In the Neural Gas Algorithm (SOTPAR2)

The SOTPAR2 network was developed to overcome a few difficulties with the original SOTPAR network. The main difficulty with the SOTPAR is the SOM map that it is built upon. The SOM's neighborhood lattice structure restricts both the movement of a trajectory through the output space of the network (e.g. the distance between successive inputs) and also limits the number of neighbors for each PE. For these reasons the neural gas algorithm is used as the basis for the SOTPAR2 architecture.

The neural gas algorithm is similar to the SOM algorithm without the imposition of a predefined neighborhood structure on the output PEs. The neural gas PEs are trained with a soft-max rule, but the soft-max is applied based on the ranking of the distance to





73


the reference vectors, not on the distance to the winning PE in the lattice. Since the Neural gas algorithm has no predefined structure, each PE acts relatively independently. This is how it derived its name, each PE is like a molecule of gas that all spread to evenly cover the desired space. Since it has no predefined structure for the activity diffusion to move through, it allows the flexibility to create a diffusion structure that can be trained to best fit the input data. The SOTPAR2 diffuses activity through a secondary connection matrix that is trained with temporal Hebbian learning. This flexible structure decouples much of the spatial component from the temporal component in the network. In the SOTPAR, two neighboring nodes in time also needed to be relatively close in space in order for the system to train properly (since time and space were coupled). This is no longer a restriction in the SOTPAR2. This is still a space-time mapping, but now the coupling between space and time is directly controllable.

The most interesting concept that falls out of this structure is the ability for the network to focus on temporal correlations. Temporal correlation can be thought of as the simple concept of anticipation. The human brain uses information from the past to enhance the recognition of "expected" patterns. For instance, during a conversation a speaker uses the context from the past to determine what they expect to hear in the future. This methodology can greatly improve the recognition of noisy input signals such as slurred or mispronounced speech.

SOTPAR2 - algorithm details

Based on previous experience (training), the SOTPAR2 algorithm uses temporal information to lower the threshold of PEs that are likely to fire next. The standard neural





74


gas network is appended with a connection matrix that is trained using temporal Hebbian learning. These secondary weights are similar to transition probabilities in Hidden Markov Models (HMM) and are the pathways used to diffuse the temporal information. As in the SOTPAR, the temporal activity diffusion is used to alter the selection of the winning PE and affects both the training and the operation of the network.

The SOTPAR2 algorithm works as follows: First, you calculate the distance (di) from the input to all the PEs. The temporal activity in the network is similar to the SOTPAR diffusive wavefronts except that the wavefronts are scaled by the connection strengths between PEs. Thus, the temporal activity diffuses through the space defined by the connection matrix as follows:

(f )(d=, k) + (1 - ))p
a, (t+1) = ca,)+ (I
max(p)

where a,(t) is the activity at PE i at time t, a is a decay constant less than 1, pj is the connection strength from PE i to PEj, d is the vector of distances from the input to each PE, I. is the parameter which smoothes the activity giving more or less importance to the past activity in the network, and max(p) normalizes the connection strengths. The functionf(d,k) determines how the current match (distances) of the network contributes to the activity. At the present time, my implementation implements the case wheref(d,k) is simply a 8 function (and the summation is removed) such that only the activity from the past winner is propagated. This is similar to the Markov model where all temporal information is stored in the state itself. Unlike the Markov model, however, the previous winners affect the output activity of the current winner. Therefore, a previous winner that





75


has followed a "known" path through the network will have higher activity and thus will have more influence on the next selection.

In the general case for the activity equation the temporal activity at each PE is affected by contributions from all other PEs. In this case the functionf(d,k) is typically an enhanced/sharpened version of the output and the summation is over all PEs. This allows all the activity in the network to influence the current selection. It makes the network more robust since the wavefronts will continue to propagate (but will decay rapidly) even if the selected winner temporarily transitions to an unlikely path.

The next step of the SOTPAR2 algorithm is to modify the output (competition selection criteria) of each PE by the temporal activity in the network via the following equation:

out, = di - fla,

where 0 is the spatio-temporal parameter that determines how much the temporal information affects the selection of the winner. This parameter should be set based upon the expected magnitude of the noise present in the system. For example, if the data is normalized [0,11, then a setting of P = 0.1 allows the network to select a new winner that is at most a distance of 0.1 farther away than the PE closest to the spatial input.

To adjust the weights, we use the standard neural gas algorithm that is simply competitive learning with a neighborhood function based on an ordering of the temporally modified distance to the input.

Aw, = rlh, (k, (out))(in - w,)





76


where I is the learning rate (step size), hA(*) is an exponential neighborhood with the parameter X defining the width of the exponential. ki(out) is the ranking of PE, based on its modified distance from the input.

The connection strengths are trained using temporal Hebbian learning with

normalization. Temporal Hebbian learning is Hebbian learning applied over time, such that PEs that fire sequentially enhance their connection strength. The rationalization for this rule is that PEs will remain active for a period of time after they fire, thus both current and previous winners will be active at the same time. In the current implementation, the connection strengths are updated similar to the conscience algorithm for competitive learning:


argmin(oul(t-I)),argmin(out(t)) = b




The strength of the connection between the last winner and the present winner is increased by a small constant b and all connections are decreased by a fraction that maintains constant energy across the set of connections. Another possibility for normalization would be to normalize all connections leaving each PE. This method gives poorer performance if a PE is shared between two portions of a trajectory since the connection strength would have to be shared between the two outbound PEs. It does, however, give an interpretation of the connection strengths as probabilities and points out the similarity between the SOTPAR2 and the HMM.

The parameters r1 and . are annealed exponentially as in the neural gas algorithm, while 1 takes the form of an offset sine wave. This allows the initial phases of learning to





77


proceed without interference so that the PEs start out with an even distribution across the input space. Then the temporal enhancement reaches a peak and slowly declines for fine tuning at the end of learning.

Operation of the SOTPAR2 network

I will use an artificial example to illustrate the features of the SOTPAR2. The input for this example is 15 pairs of noisy 8-point diagonal lines from (0,0) - (1,1) and from (1,0) -) (0,1). The diagonal lines have uniform noise (�0.15 in both dimensions) added to each point (notice the distance between each noise-free point of the diagonal lines is only 0.14). There is uniform noise [0,1] interspersed between the diagonal lines (16 points between each line such that there is twice as much noise as signal). The network extracts the temporal information from the diagonal lines without supervision, segmentation, or labeling. A 30-PE network was trained with and without temporal enhancement (200 iterations through the data set) and the resulting PE locations are shown in Figure 3-9 and Figure 3-10 with the diagonal lines superimposed on the figures.

Notice that the temporal enhancement during training has slightly modified the positions of the PEs. The network trained with temporal enhancement has its PEs placed more consistently near the centers of the points along the diagonal lines (in particular, look at the line segment in the lower-right). The temporal training provides a portion of the improvement made by the SOTPAR2 algorithm, but the static comparison of the network is not dramatically different.





78

Training WITH Temporal Enhancement 1 17 0.8 4 1 23 0.6 5 7 28 0.4 18 30

0.2 22 19 26 10

0 0.5 1

Figure 3-9: Reference vector locations after training with enhancement



Training WITHOUT Temporal Enhancement

1 17

0.8 4 1 23

0.6 27 28
0.4 6 18 30 22 8
0.2 19 2 26 14 10 11
0 2
0 0.5 1

Figure 3-10: Reference vector locations after training without enhancement



During operation, the trained weights and information from the past create temporal wavefronts in the network that allow plasticity during recognition. This

temporal activity is mixed with the standard spatial activity (distance from input to the weights) via P, the spatio-temporal parameter. Two identical inputs may fire different PEs depending on the temporal past of the signal. Figure 3-11 shows the Voronoi diagrams





79


for the SOTPAR2 network with two different temporal histories. Voronoi diagrams graphically describe the region in the input space that fires each PE. In these particular diagrams, the number in each Voronoi region represents the PE number for that particular region and is located at the center of the static Voronoi region. Remember that the center is the same as the weights of the PE. These diagrams show the regions of the input space that will fire each PE in the network. The left side of Figure 3-11 shows the Voronoi diagram during a presentation of random noise to the network. Since this input pattern was unlikely to be seen in the training input, temporal wavefronts were not created and the Voronoi diagram is very similar to the static Voronoi diagram. The right side of Figure 3-11 shows the Voronoi diagram during the presentation of the bottom-left to topright diagonal line. The temporal wavefront grew to an amplitude of 0.5 by the time PE 18 fired. Also, from the training of the network, the connection strength between PE 18 and PE 27 was large compared to the other PEs. Thus, the temporal wavefront flowed preferentially to PE 27 enhancing its possibilities of winning the next competition.

Vormda p f : 23, 18, 10, 29 Be=0.2 Vaomn dagram tprev ous ners: 20, 26, 14, 18 Beta=.2


0.8 08




0.2 02,
00
0 0.2 0.4 0.6 08 1 0 02 04 0.6 0.8 1 Figure 3-11: Voronoi diagrams without and with enhancement





80


Notice how large region 27 is in right side of Figure 3-11 since it is the next expected winner. This plasticity seems similar to the way humans recognize temporal patterns (e.g. speech). Notice that the network uses temporal information and its previous training to "anticipate" the next input. The anticipated result is much more likely to be detected since the network is expecting to see it. It is important to point out how the static and dynamic conditions are dramatically different. In the dynamic SOTPAR2 the centroids (reference vectors) are not as important - the temporal information changes the entire characteristics of vector quantization creating data dependent Voronoi regions. An animation can demonstrate the operation of the SOTPAR2 Voronoi regions much better than static figures.

Next I created a new set of 14 noisy diagonal lines to be run through the network as a test set. Each noisy line was passed through both a standard neural gas vector quantization network and a SOTPAR2 VQ network. The results will be analyzed using the 5' point in the bottom-left to top-right diagonal line. Figure 3-12 shows the locations Non-Enhanced Voronoi Diagram
S117
9 7
0.8 2

0.6 X 5 ,2 28

0.4 1 12

0.2
20
0 26 0 0.2 0.4 0.6 0.8 1

Figure 3-12: Vomoi diagram without enhancement. VQ outputs were [12,12,16,16,25,25,25,25,27,27,27,27,27,27]





81


of this point in each of the 14 noisy diagonal lines along with the neural gas Voronoi diagram. Notice that the static vector quantization cannot consistently quantize this 5" point to the same Voronoi region. In fact, this point falls into four different regions. The SOTPAR2 network, however, was able to quantize every one of the 5th points into the same region. Figure 3-13 shows why. Figure 3-13 shows a typical Voronoi diagram for the trained SOTPAR2 network after the input of the first four points of a single noisy diagonal line. The location of the 5th point in each of the 14 noisy diagonal lines was again plotted. Notice that now all 14 points fall into the correct Voronoi region. Remember that each particular input sequence will create a different Voronoi diagram, but Figure 3-13 illustrates the mechanism for the SOTPAR2's improved vector quantization. The temporal plasticity has increased the size of the anticipated next region and reduced the variability of the SOTPAR2 vector quantization.

Voroni diagram with rnode 18 as the previous wirer

1 21 17

0.8 24

0.6 X 2 28

0.4 18 XX
14 2 19
0.2
20 a
0 29
0 0.2 0.4 0.6 0.8 1

Figure 3-13: Voronoi diagram and VQ with enhancement. VQ outputs were [27,27,27,27,27,27,27,27,27,27,27,27,27,27]



Next I ran the new noisy diagonal lines through the network and histogrammed the VQ outputs for each point of the two lines. Figure 3-14 shows the results with the





82


point number along the horizontal axis and the node number along the vertical axis. The number of firings for each node is indicated by the shading - white is high and gray is low. The left-to-right diagonal line is shown in the first 8 points of the horizontal axis and the right-to-left diagonal line is shown as the second 8 points of the horizontal axis. Notice how much cleaner the temporal enhanced VQ output is than the standard neural gas VQ.

Wnn Nodes WTH enhancamnt W,, Nodos WITHOUT e~nhanmt

5 5

10 10

g_ 15

20 20 25 25 30 30
5 10 15 5 10 15 The. Time Figure 3-14: Histograms of the number of firings for each PE (bright = more) for the networks with and without enhancement



Figure 3-15 shows a specific example of the VQ output of the two networks and illustrates how the SOTPAR2 uses temporal information to remove noise from the input. The input is a noisy diagonal line from bottom-right to top-left (solid line). The SOTPAR2 output is the short dashed line, and the static VQ output is the long dashed line. Notice how much closer the temporal VQ output is to the diagonal than the noisy input or the output of the static VQ.





83


sold=irput, dashed=NGas, dotted=Ou Method


0.81

0.6

0.4

0.2


0 0.2 0.4 0.6 0.8 1


Figure 3-15: The SOTPAR2 VQ (dotted) is closer to the noise free signal than the original (solid) or the neural gas VQ (dashed) SOTPAR2 summary

The SOTPAR2 algorithm uses temporal plasticity induced by the diffusion of activity through time and space. The SOTPAR2 algorithm is a temporal version of the neural gas algorithm that uses activity diffusion to couple space and time into a single set of dynamics that can help disambiguate the static spatial information with temporal information. This creates time-varying Voronoi diagrams based on the past of the input signal. This dynamic vector quantization helps reduce the variability inherent in the input by anticipating (based on training) future inputs.


Temporal Self-Organization for Training Supervised Networks


This section shows how the concepts of temporally trained clustering can help speed up the training of supervised neural networks. In particular, we have applied it to recurrent neural network training. Recurrent neural networks are more powerful than feedforward neural networks, but their training is very difficult and time-consuming.





84


Supervised neural networks are typically trained with gradient descent learning, which provides a more mathematically sound foundation than in the unsupervised networks. This allows for a goal-driven approach with mathematical derivations of the concepts. The goal of this architecture is to temporally organize the training of a recurrent neural network. A mathematical analysis will derive a principle very similar to that used in the neural gas network, that temporal correlation can be used to train PEs to form temporal neighborhoods.

Usine Temporal Neighborhoods in RTRL

In the past, static neural networks and feedforward networks with memory

(TDNN, etc.) have been the workhorses of the neural network world. Recently recurrent neural networks have been getting more attention, especially when applied to dynamical modeling and system identification and control. The main difficulty in training recurrent neural networks is that the gradient is a function of time. The gradient at the current time depends not only on the current input, output, and desired signal, but also on all the values in the past.

As discussed in Chapter 2, there are two fundamental methods of computing the gradient for a recurrent neural network. First, the gradients can be computed in the backward direction similar to the static backpropagation techniques from feedforward networks. This is called backpropagation through time (BPTT) [Rum86]. The main shortcoming of this technique is that it is non-causal. The second fundamental method of computing the gradient for a recurrent neural network computes the gradients in the forward direction. This method, called RTRL [Wil89], computes the partial of each node





85

with respect to each weight at every iteration. The method is completely on-line and simple to implement. The main difficulty with the RTRL method is its computational complexity. If n is the number of PEs in a fully recurrent network, then the computation of the gradients of each PE with respect to each weight is O(n4). This algorithm can only be used for small networks.

Many methods have been proposed to increase the speed of RTRL. Zipser's approach [Zip89][Zip9O] will be used here because it lends itself to our techniques. Zipser approached the problem of reducing the complexity of the RTRL algorithm by simply leaving out elements of the sensitivity matrix based upon a subgrouping of the PEs. The PEs are grouped arbitrarily and sensitivities between groups are ignored. If the size of the subgroups remains constant, then this reduces the complexity of the RTRL algorithm to O(n2). This is a tremendous improvement, however, the method lacks some of the power of the full RTRL algorithm. For example, it will sometimes require more PEs than the standard RTRL algorithm to converge.

Our methodology extends Zipser's technique by allowing the subgroups to change dynamically during learning. The dynamic subgroups are created by using an unsupervised temporal clustering very similar to that used in the SOTPAR and SOTPAR2. A derivation of a first-order approximation to the full sensitivity matrix shows that temporal correlation (temporal Hebbian learning) can be used to determine which nodes should be in each group. This method has the same computational complexity as Zipser's, but trains better and more consistently.





86


Review of RTRL and Zipser's Technique

The computational complexity of the RTRL algorithm is dominated by the need to update a large array of sensitivities at each step of the algorithm. For a network with n nodes and m weights, the sensitivity matrix has O(nm) elements, each requiring O(n) computations per element, giving O(n2m) calculations per step. For a fully recurrent network, this dominates the computational complexity and requires O(n4) computations per step. The algorithm works quite well on small networks, but the n4 factor becomes overwhelming as the number of nodes increases.

The value of n in the O(n2m) equation is the number of recurrently connected units. Zipser's algorithm reduces this value by creating subgroups of nodes where sensitivity information is only passed between nodes of the same subgroup. All connections still exist in the forward dynamical system, the subgroups only affect the training of the network. Connections between subnets are treated as inputs. If g is the number of subgroups, then the speed-up of the sensitivity calculations is approximately g2. For instance, dividing a network into two subsets (g=2) gives a 4-fold speed-up in computing the sensitivities. If the size of the subnets remains constant as the size of the network is increased (increasing the number of subnets), then the complexity of the RTRL algorithm is reduced from O(n2m) to O(m).

The performance gains are substantial, but the question is whether the algorithm can train networks as well as the full RTRL. One might think that the subgrouping will limit the capabilities of the network to share nodes. This is not the case, however, since the activations of the network are unchanged -- it is still fully recurrent except in the





87


training methodology. Even though the error propagation is limited to the subnets, all units have access to the activities of all other units, just not all of their sensitivities. Zipser's empirical tests indicate that they can solve many of the same problems, but for certain applications, networks trained with subgrouped RTRL require more PEs than when they are trained with full RTRL. In my experience, the subgrouping algorithm typically also requires more training epochs to reach the same MSE.

One caveat of the subgrouped RTRL training is that each subnet must have at least one unit for which a target exists since gradient information is not exchanged between groups. The problem can be solved by wrapping a feedforward network around the recursive network - creating a feedforward MLP with a fully recursive hidden layer. This is often termed a recurrent multilayer perceptron (RMLP) and is shown in Figure 3-16. The feedforward network is simply one additional layer that distributes the gradient between the groups. With simple extensions to the algorithms, multiple fully-recurrent layers can be added to the network.

Fully Connected
Hidden Layer



Input Otutut


Fiure 3-16: Diaram of a fully recurrent multi-laer ercetron (RMLP



Figure 3-16: Diagram of a fully recurrent multi-layer perceptron (RMLP)





88


Dynamic Suberouping with x

The goal of my method is to create local neighborhoods (subgroups) in the RTRL algorithm where the majority of the gradient information required for each node is confined to its local neighborhood. This requires organizing the recurrent PEs such that those that have strong temporal dependencies are neighbors. This technique replaces the static, preallocated grouping of Zipser's technique with a dynamic method of determining the best set of neighbors for each PE. This dynamic grouping provides faster and more robust training than Zipser's technique while maintaining its O(n2) performance.

First, the RTRL equations must be modified slightly to better suit the RMLP

architecture described above. The time indices are now defined such that the input vector contains the external inputs from this time period plus the values of the PE outputs from the previous time period (i.e. the feedback).

IN, (n + 1) = [x(n + 1), y(n)]


t. (n+ 1) = p'(v,(n + l))Li w (n)+8N L(n + 1)]


Next, we must determine the criteria we will use to group the PEs in the network. If we assume that each PE is responsible for updating the weights of the arcs which terminate at that PE (i.e. the incoming connections), then the PEs that have the highest sensitivities relative to those connections should be in the same neighborhood. For example, PEj is responsible for updating all weights wjB where B is the set of recurrent PEs. If we define


1L





89

then the value of Zjk provides a measure of how much PE k affects the weights of PEj. Thus the neighbors to PEj should be the ones which have the highest Zjk.

The "dynamic subgrouping with 7t" (DS-nt) methodology implements the RTRL algorithm with the subgroups chosen using the Z measure defined above. It should be noted that since it requires the computation of the complete At matrix, this algorithm is no more efficient than the full RTRL algorithm. It will, however, address how the neighborhood technique with "optimal" switching will perform compared to the full RTRL and Zipser's technique.

The test case for the DS-in algorithm is a function approximation problem where we are trying to map a frequency doubler. The input to the network is a sinusoid with a 16-point period and the desired signal is a sinusoid with an 8-point period. This is a nonlinear function since linear functions cannot "create" frequencies. The DS-7 network



0.7
rtri
-- our met

0.5

0.4 0.3

0.2

0.1 .. . .

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Epochs (16 samples per epoch)

Figure 3-17: Average learning curve for the three algorithms using the frequency doubling problem





90

has 6 fully recurrent hidden layer PEs and one linear output node. Both Zipser's method and the DS-i method use two groups of three PEs. Each of the three algorithms was trained using the same five sets of random initial weights and the results were averaged to obtain the learning curves. Figure 3-17 shows the average learning curve for each algorithm. Notice that the full RTRL and the DS-nt method performed nearly identically. In fact, in a few cases, DS-n actually trained in fewer epochs. The third set of initial weights led all three algorithms to a deep local minimum. The first 100 epochs mainly depict the learning curve from the other four initial conditions (notice that the DS-n method and RTRL are nearly identical here), whereas the last 400 iterations are dominated by the learning curve for the initial weights with the deep local minimum. Zipser's method performed worse on all 5 sets of initial conditions and couldn't solve the problem at all (even with more training) for the 3rd set.

In every application I have tested, the DS-n algorithm trains the networks in

almost the same number of epochs as the full RTRL algorithm and performs significantly better than Zipser's subgrouping technique. The problem, however, is that the full 7E matrix is required to compute the neighborhoods. Since the computation of the n matrix is the computationally expensive part of the task, we have not gained anything here. This methodology, however, proves that the technique is feasible and that the all the gradient information is not necessary to train the networks. Also, when a neighborhood changes in the DS-n algorithm the gradient information from the ex-neighbor is discarded and new gradient information from the new neighbor starts building up. The technique using the full n matrix shows that this resetting and restarting of gradient information between





91

nodes does not affect the performance of the algorithm. The DS-nr algorithm will be used as an "ideal grouping" methodology since it uses all of the information of the sensitivities to determine the groupings.

Estimating the Z matrix

We now need an estimate of Z that will allow us to efficiently compute the

temporal neighborhoods. The logical choice for an estimate of Z is to use the first-order estimate of the 7 matrix to compute Z. We start by writing out the equation for Z and simplifying:


ZJk =j -n� i (i(){ wi ((vj(n)) wj, r(n-1) + IN(n)I



ZIk =p'(v (n))Y wj,7 (n -1) + IN (n) At this point, I will stop and discuss some grouping rules that I have implemented. First, unlike Zipser's work, the groupings do not need to be symmetric. PEj can be a neighbor of PE k without PEj being a neighbor of PE k. Thus, the baseline method is not a true grouping, but a linking of PEs which are sensitive to each other. A true grouping can be determined by modifying the grouping criteria to include both directions (e.g. Zjk+Zkj). Since symmetry is not being enforced, the methodology enforces the rule that PEj is always a neighbor of PEj. This does not have to be the case, but seems to be a reasonable assumption. Much of the gradient information from a recurrent network comes from the self-recurrent loop in each PE.

Since we assume that PEj is always a neighbor of PEj, we only need to compare the total sensitivity of all the other PEs. Thus, we do not need to worry about the ZO',j)





92


terms which means that the 8kj term can be removed. Reorganizing the summations slightly leads to:


Z,, = jp'(vj (n)) wj, 7t' (n -1) where j k ,B L

Expanding, we get:


Z,, = p'(v (n)) wj p'(v(n-1)) w,,7,t2(n-2)+8ikN,,(n-I
ieB L m and


Z(jk '( (n ji i ,NL(n-i)+ ,x (n - 2)


Now we separate the equation into its first order parts (the direct contributions from the input vector -- when i=k) and the rest.


Z,, = (p'(vj (n))wjk(p'(v, (n - 1)) INL (n -1)
L

+'(v,(n)) Y w (jq'(v,(n-1))[ wan,(n-2) ieBi k L . I

This is a very interesting equation. Let's say we approximate Z with the first order terms:

ijk = (p'(vj (n))wjk(p'(v, (n - 1)) IN, (n - 1) Notice that the sum of the input scales all terms of Zjk the same, so it can also be eliminated, leaving only: Z,k = p'(v, (n))wjk p'(k (n - 1)) This is a very easy and computationally efficient method for estimating the Z matrix. It is conceptually appealing as well. You can see that this equation is a time correlation





93

between the derivatives of the non-linearities of the PEsj and k. If this were a static, linear network, the equation would simply be j, = yj (n)xj (n - 1)

where yj is the output of PEj and xj is the input to PEj from PE k. This is a temporal version of Hebbian Learning. The first-order estimate of Z can be considered a nonlinear version of Hebbian learning. The derivative of the nonlinear function at the operating point determines the sensitivity of each PE to the current training input. Thus, we are calculating a correlation of the sensitivities of each PE. Which can also be considered as a correlation in the dual of the network.

If the PEs of the network use a tanh activation function, the estimate for the Z matrix can become a local rule. Local rules are advantageous because they are easily implemented in parallel and easily analyzed. Figure 3-18 shows the plot off'(net) versus f(out), which is equal tof(f(net)), for a tanh PE. Since these two shapes are very similar,



f(net) vs. f(f(net)) for tanh aon
0.7





04

0.3

02



-2 -1 0 1 2 Figure 3-18: Using a tanh PE, f(net) takes the same shape as f(out)




Full Text

PAGE 1

TEMPORAL SELF-ORGANIZATION FOR NEURAL NETWORKS By NEIL R. EULIANO II A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1998

PAGE 2

ACKNOWLEDGMENTS It is only appropriate that I first acknowledge the guidance and help of my advisor and friend Dr. Jose Principe. Without his support this work would never have been completed. I would also like to thank the members of my committee for their efforts and time spent on my behalf, as well as the members of the Computational NeuroEngineering Laboratory (CNEL). I must also acknowledge my wife Tammy who was incredibly patient and never wavered in her support of this endeavor. I would also like to thank my children Erin and Matthew who are just too fun to ignore. Although they extended the amount of time required to graduate, I would not trade the time I spent with them for anything in the world. Lastly I should thank my family and friends for not treating me like a dead-beat Ph.D. student.

PAGE 3

TABLE OF CONTENTS page ACKNOWLEDGMENTS » ABSTRACT vi CHAPTERS 1 INTRODUCTION AND PROBLEM DESCRIPTION 1 Temporal Processing 2 Static Supervised and Unsupervised Learning 3 Adding Memory to Neural Networks 5 Short-Term Memory Structures 6 Recurrent Networks 9 Training Dynamic Neural Networks 10 Summary of Problems with Standard ANN Architectures 12 The Approach 14 2 LITERATURE REVIEW 16 Biological Research 16 Neurons and Learning 17 Hippocampus 20 Diffusion Equations (Re-Di Equations) 21 Biological Representations of Time 25 Biological Models for Temporal Processing 28 Static Neural Network Learning 29 Unsupervised Learning 30 KohonenSOMs 32 Neural Gas 36 Supervised Training 37 Second Order Methods 39 Temporal Neural Networks 40 Temporal Unsupervised Learning 40 Temporal Supervised Neural Networks 44 Architectural approaches 44 Algorithmic approaches 46

PAGE 4

Second order methods 49 Sequence Recognition 50 Comparison of Hidden Markov Models with ANNs 51 3 TEMPORAL SELF-ORGANIZATION 54 Introduction and Motivation 54 The Model 55 Temporal Self-Organization in Unsupervised Networks 57 Temporal Activity Diffusion Through a SOM (SOTPAR) 57 Algorithm description 60 Representation of memory 62 A simple illustrative example 66 SOTPAR summary 71 Temporal Activity Diffusion In the Neural Gas Algorithm (SOTPAR2) 72 SOTPAR2 algorithm details 73 Operation of the SOTPAR2 network 77 SOTPAR2 summary 83 Temporal Self-Organization for Training Supervised Networks 83 Using Temporal Neighborhoods in RTRL 84 Review of RTRL and Zipser's Technique 86 Dynamic Subgrouping with n 88 Estimating the Z matrix 91 Illustrative Example 94 Grouping Dynamics 95 Second Order Methods 96 Summary of the Dynamic Subgrouping Algorithm 97 4 APPLICATIONS AND RESULTS 99 SOTPAR 99 Landmark Discrimination and Recognition for Robotics 100 SOTPAR solution 102 Real data collected from the robot 113 Summary 117 Self-Organization of Phoneme Sequences 118 Summary 128 SOTPAR2 129 SOTPAR2 Vector Quantization of Speech Data 129 Time Series Prediction 138 Results 140 Summary of chaotic prediction 147 Dynamic Subgrouping of RTRL in Recurrent Neural Networks 148 System Identification 148 Comparison of the Number of Neighbors 152

PAGE 5

Modeling a Set of Nonlinear Passage Dynamics 155 Summary of Dynamic Subgrouping 160 5 CONCLUSIONS AND FUTURE RESEARCH POTENTIAL 162 Conclusions 162 Future Directions 167 REFERENCES 169 BIOGRAPHICAL SKETCH 177

PAGE 6

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy TEMPORAL SELF-ORGANIZATION FOR NEURAL NETWORKS By Neil R. Euliano II August, 1998 Chairman: Dr. Jose C. Principe Major Department: Electrical and Computer Engineering The field of artificial neural networks (ANNs) has reached a point where they are now being used in everyday products. ANNs, however, have been largely unsuccessful at processing signals that evolve over time. Temporal patterns have traditionally provided the most challenging problems for scientists and engineers and include language skills, vision skills, locomotion skills, process control, time series prediction, and many others. The fundamental concept presented in this dissertation is the formation of temporally organized neighborhoods in ANNs. This temporal self-organization enables the networks to process temporal patterns in a more organized and efficient manner. The concept is biologically inspired and uses activity diffusion to organize the processing elements of the network in an unsupervised manner.

PAGE 7

The self-organization in space and time created by my methodology has been applied to three distinct ANN architectures. The new network architectures created by adding the temporal organization are easy to implement and contain properties that are unique in the neural network field. A self-organizing map (SOM) network obtains a unique combination of long-term and short-term memory and becomes organized such that temporal patterns in the input fire sequentially ordered output PEs. These features are utilized in two different applications, a robotic landmark recognition problem and a temporally ordered vector quantization of phonemes in spoken words. When applied to the neural gas algorithm, the resulting network becomes a dynamic vector quantization network. The network anticipates the future inputs and adjusts the size of the Voronoi regions dynamically. It was used to vector quantize speech data for a digit recognition problem and to predict a chaotic signal. Lastly, the temporal organization was applied to the training of fully recurrent neural networks. It reduces the computational complexity of the training algorithm from 0(N^) operations to only 0(N 2 ) operations and maintains nearly all of the power of the RTRL algorithm. This training method was tested on two inverse modeling tasks and provided a dramatic improvement in training times over the RTRL algorithm.

PAGE 8

CHAPTER 1 INTRODUCTION AND PROBLEM DESCRIPTION This dissertation focuses on neural network architectures and training methods for processing signals that evolve over time. The fundamental concept underlying the techniques described herein involves the formation of neighborhoods where temporally correlated processing elements (PEs) are clustered together. We have applied this concept to three different neural network architectures and found that it improves the performance of each one either by increasing the functionality of the neural network or by improving its training. This chapter contains a description of the problem as well as background information that will help describe the shortcomings of the present methods. Chapter 2 presents a review of the relevant literature necessary to understand the material in the context of the current state of the art. Chapter 3 contains the theoretical description of the techniques and networks proposed by this work, including a few simple examples to elucidate the fundamental concepts. In Chapter 4, six more extensive and practical problems are solved using the temporal neighborhood concepts. These examples include speech recognition, chaotic prediction, system identification and control, and robotics. Chapter 5 concludes the dissertation with a summary of the work and possible future research directions.

PAGE 9

2 Temporal Processing Most scientific problems can be grouped into two domains, static and dynamic problems. Static problems consist of information that is independent of time. For instance, in static image recognition, the image does not change over time. On the other hand, time is fundamental to the dynamic problem. The output of a dynamical system, for example, depends not only on the present input but also on the current state of the system, which encapsulates the past of the input. Temporal processing is the analysis, modeling, prediction, and/or classification of systems that vary with time. Patterns that evolve over time have traditionally provided the most challenging problems for scientists and engineers. Language skills (speech recognition, speech synthesis, sound identification, etc.), vision skills (motion detection, target tracking, object recognition, etc.), locomotion skills (synchronized movement, robotics, mapping, navigation, etc.), process control (both human and mechanical), time series prediction, and many other applications all require temporal pattern processing. In fact, the ability to properly recognize or generate temporal patterns is fundamental to human intelligence. Traditional analysis models (ARMA models, etc.) are well known but are usually linear and require significant expertise on the subject and a strict correspondence between the studied process and the constructed model. Artificial Neural Networks (ANNs) offer robust, model-free methods without requiring as much application specific expertise. Secondly, neural nets are adaptive (similar to ARMA models). This is a natural way to compensate for the drift of measuring devices and slow parameter changes inherent in

PAGE 10

real systems. Thirdly, neural nets are naturally parallel systems that offer more speed in computation and fault tolerance than traditional computing models. [Kan94] Most of the major neural network success, however, has been mainly in the realm of static, instantaneous mappings (for example, static image recognition or pattern matching). Conventional neural net architectures and algorithms are not well suited for patterns that vary over time. Typically, in static pattern recognition a collection of features visual, semantic, or otherwise is presented and the network must categorize the input feature pattern into one or more classes. In such tasks, the network is presented with all relevant information simultaneously. In contrast, temporal pattern recognition involves processing patterns that evolve over time. The appropriate response at a particular point in time depends not only on the current input, but also potentially on an unspecified number of previous inputs. Static ANNs have been modified in various ways to process time-varying patterns, typically by adding short-term memory to the static pattern classification ability of the various architectures. The short-term memory holds onto some of the past events so that the static ANN can then classify or predict the temporal pattern. As I will explain in the next few sections, however, these hybrid structures (memory added to static architectures) have not been widely successful in the various temporal processing areas. Static Supervised and Unsupervised Learniniz The purpose of neural processing is to capture the information from an external signal in the neural network structure. This is a form of organization. It can be accomplished in an unsupervised manner using only the input, or in a supervised manner

PAGE 11

guided by an extra input called the desired signal. Unsupervised training can only extract information from the input signal whereas supervised training can learn mappings between the input signal and the desired signal. They differ in the methods, but at the core they share the same function, learning a representation of the external world. The most common supervised network is the multilayer perceptron (MLP) which uses the error back-propagation [Rum86] learning algorithm. The MLP is characterized by layers (input, hidden, and output) of processing elements (PEs) that have a smooth non-linearity at their output. The nonlinear output of the MLP PEs is what differentiates the MLP from a typical adaptive filter. It provides the capability to map problems that are not linearly separable. In fact, it has been proven that an MLP with one hidden layer can uniformly approximate any continuous function with support in a unit hypercube [Cyb89], Like in adaptive signal processing using the LMS algorithm, the backpropagation algorithm applies a correction Aw j; (n) to the synaptic weight w^n) that is proportional to the gradient of the error 8^(n) 1 5w ; ,(n) The chain rule is used to recursively calculate the error for each layer of the network. Unsupervised networks are typically based on or derived from Hebbian learning. Hebbian learning is a biologically inspired learning rule that finds the correlations present in the input data. Because unsupervised networks can extract information only from the input, they are typically used for data analysis and preprocessing. They cannot reliably be used directly for classification since a labeling of the inputs is required for classification. Both supervised and unsupervised learning will be described in detail in Chapter 2.

PAGE 12

5 Adding Memory to Neural Networks How do you use a static neural network architecture to process temporal patterns? The answer is to simply add memory. Without an appropriate memory to store information from the past, a neural network is limited to static pattern recognition or function approximation. The key questions that need to be answered while creating temporal neural networks are what type of memory do you use and how is the memory integrated into the training algorithm. Memory in neural networks can be classified into two categories: short-term memory and long-term memory. Short-term memory typically involves a representation of the temporal data, usually by creating multiple copies of the input data at various time delays (e.g. tapped delay line). Long-term memory, on the other hand, is the storage of information from the past into the structure of the network. For example, over time, the training of the network captures information about the input signal and this information can be considered long-term memory. Another example of long-term memory is the storage of patterns in an associative memory. Long-term memory corresponds more closely with the traditional biological concepts of memory. The main difference between the two is that the short-term memory is used for signal representation while the longterm memory is a trained memory that typically cannot represent unknown patterns. Another way to differentiate the two is that short-term memory is usually described by activations of nodes or taps (dynamical information), and long-term memory is stored in the weights of the network (statistical information).

PAGE 13

Most of the work in temporal ANN research has focused on the application of short-term memories since they provide a mechanism to represent a temporal pattern in a static manner. For instance, a tapped delay line converts a temporal signal into a static pattern (the present input and the N past inputs) which can then be processed by a standard static ANN. Most short-term memory techniques fall into two categories. The first is to explicitly add memory structures and the second is to use recurrent loops in the network to save information. Long-term memory (the weights) has largely been ignored by the ANN research community for the storage of temporal patterns, but I will use it to store temporal correlations in the structure of the network. Short-Term Memory Structures The simplest form of memory is a buffer containing the N most recent inputs. This is often called a tapped delay line or a delay space embedding and forms the basis of traditional statistical autoregressive (AR) models, as well as dynamical system state space manipulations. This is a very popular model and has been used in many applications. The time-delay neural network (TDNN) [Wai90] uses a tapped delay line to convert the temporal pattern into a spatial pattern allowing the architecture to be trained using only standard back-propagation methods. The TDNN, however, has several drawbacks. First, the length of the delay line must be chosen a priori, we cannot work with arbitrary length sequences. In addition, the TDNN requires that the data is properly registered in time with the clock controlling the shift register. It imposes a rigid limit on the duration of patterns and suggests that all input vectors be the same length. Most importantly, two patterns which are very similar temporally (e.g. shifted one step in time) will be very

PAGE 14

different spatially, which is the metric used by ANNs. For example, [1 0], [0 1 0], [0 1] are temporally shifted but are spatially on the corners of a unit cube. Using decay traces or exponential kernels to sample the history of the input helps alleviate some of the problems with the TDNN. A common methodology to describe the various memory architectures is to represent the short-term memory as a convolution of the input sequence with a kernel function, k,: x,(f) = £ *,(' " T M T ) • wnere X W is the input. Tank and Hopfield [Tan87] proposed a set of Gaussian kernels that are distributed over time with varying means and widths to sample the time history. The gamma model [DeV91] is an example of an exponential trace memory that uses the set of gamma kernels. The exponential trace memory has a more smooth representation of the past of the input since it decays exponentially. It gives more strength to the more recent inputs. The gamma memory also has a tunable parameter that trades off depth for resolution when the system requires information from farther in the past. Depth roughly refers to how far back into the past the memory stores information and resolution refers to the degree to which information concerning the individual elements of the input sequence are preserved. The exponential trace memories can be computed incrementally and easily, thus greatly increasing its usability. Viewing memory in this way, as a kernel function passed over the input, one can see that almost any kernel function will result in a distinct form of memory. The main problem with all of these memory architectures, however, is that they are all "prewired" one-dimensional cascades of delay elements. TDNNs are also known to train very slowly.

PAGE 15

Theoretically, memory added to a system can be thought of as creating an embedding of the dynamics into a space larger than the original input space. An embedding of a dynamical system is based on the similarity between delays and derivatives (the first order approximation to a derivative is the difference between the signal and the delayed signal). The delayed values of a single variable can be used to represent the dynamics of a multi-dimensional system. Conceptually this can be rationalized as combining the first-order differential equations for the system (state space description) into a single high-order differential equation for one variable and then using the delay technique to approximate the derivatives of this equation giving a new representation of the system states. This mathematical construct is effective but not necessarily efficient. For example, a dynamical system requires a minimum of 2D+1 taps to preserve the dynamics of a D dimensional system [Tak81], If the dimension of the system is unknown, as is often the case, a large embedding is usually used. The embedding also does not efficiently encode the input ordering. It does a time-to-space mapping that treats the temporal information the same as a spatial input, allowing for all permutations of the order of inputs without regard to the limitations imposed by the dynamics of the system. The gamma memory and other convolution memory kernels warp or rotate the embedding space to more accurately (or efficiently) represent the system dynamics. A proper use of the embedding methodology requires a significant amount of work to determine a number of parameters, including the number of taps, the time between taps, the time between vectors, and the number of data samples. This is rarely done.

PAGE 16

Recurrent Networks The MLP and TDNN are both feedforward networks where the data flow in the network moves strictly forward. No feedback is used. The feedback in recurrent networks can also create memory. The important distinction between the two types of memory is that memory created with feedback can be adapted and trained on-line, creating a flexible and adjustable memory mechanism. Feeding back outputs between different layers can lead to a generalization of storing not only the input but the "state" of the network (i.e. a processed version of the input) [Elm90][Moz94]. In theory, the recurrent architecture is sufficiently powerful to handle arbitrarily complex temporal problems. The focused memory architectures such as the TDNN can also [San97], but may require a very large number of taps and weights. In practice, however, recurrent networks are much more difficult to train than the static networks. The recurrency adds tremendous power to the network (any memory architecture can be created with a recurrent neural network). This power, however, creates very complicated error surfaces. In recurrent networks, the states of the PEs in the network affect both the output and gradients. Therefore calculating the gradients and updating the weights of a recurrent network is a much more difficult and time consuming process. Because of these difficulties, the mainstream engineering community has largely ignored recurrent networks. Recently, however, the recurrent networks are being used more and more as engineers reach the limits of the capabilities of TDNNs and other

PAGE 17

10 simpler architectures. Recurrent networks are hot topics in the fields of dynamic modeling and control. Training Dynamic Neural Networks Recurrent networks, either fully recurrent or partially recurrent (e.g. the gamma network), cannot directly use static backpropagation methods since the time history of the network and its inputs are critical to the outputs produced by it. Static backpropagation computes only the gradients based upon the current inputs and outputs. To train a dynamical system, the past information is at least as important as the present and thus a temporal backpropagation technique must be used. Recurrent backpropagation (fixedpoint learning) can be used to train a general recurrent network to move to stationary states. Its assumption of constant inputs and an approach to an attractor, however, precludes the recurrent back-propagation algorithm from real-time temporal processing. The TDNN can use static backpropagation because its memory is fixed and is at beginning of the network. The tapped delay line can be thought of as a temporal preprocessor converting dynamic patterns to static patterns, thus the network is trained in a completely static manner. Most other temporal networks, however, are trained using one of two first-order temporal methods: back-propagation through time (BPTT) [Rum86] or real-time recurrent learning (RTRL) [Wil89], Both of these methods are gradient descent methods. The RTRL method brings the activations and their derivatives forward in time until the desired signal is available, and the BPTT method propagates back the errors from the desired signal to the beginning of the pattern. RTRL recurrently calculates the gradients of each PE with respect to every weight. This process allows on-

PAGE 18

11 line updates (updates every sample), but calculating all the gradients is a time consuming process. In fact, if there are N fully recurrent PEs in a network, the RTRL algorithm requires 0(N 4 ) operations per sample. The BPTT method requires fewer computations, but is non-causal. Thus it cannot be directly implemented in an on-line fashion. Both methods suffer from the following problems: • The computation of the gradient must occur over time. But the nonlinearity in each layer (actually it is the derivative of the nonlinearity required for the gradients) attenuates these gradients. Thus, if information is required from more than a few samples in the past, these training methods may have a difficult time maintaining and using this information. As the errors are propagated, the gradients get small and the impact of a connection weight even if appropriate — will be masked by other weights if their values are inappropriate. This is true for large feedforward nets as well, but the feedback nature of the recurrent network in time makes this a much bigger problem in recurrent networks. • The desired signal must be defined over time. For example, how do you define a target signal when trying to detect a sequence? If the target is high throughout the pattern, the network will recognize partial sequences. But if the target is high only at the end, the network may be punished for partially recognizing a major portion of the sequence. • Temporal backpropagation is inherently slow both computationally and in terms of the number of training samples required to find an adequate solution.

PAGE 19

12 Recently, second order gradient methods like the recurrent least squares (RLS) and the extended Kalman filter have been used in order to reduce the number of training samples required for a good solution. These methods use second order gradient information to determine more accurate data on the shape of the performance surface at the current operating point. This allows for much faster convergence but requires more computations per sample. These second order gradient methods still need to compute the dynamic gradient information and thus suffer from the same problems listed above. Summary of Problems with Standard ANN Architectures In summary, the standard ANN architectures when applied to temporal processing suffer from problems with supervision and problems with short-term memory. The problems that can be attributed to supervised training include: • The problem of assigning credit or blame to actions when the overall success or failure of the system results from a series of actions and cannot be judged instantaneously (i.e. how do you design a target signal?). • Back-propagation training can be very slow, often requiring thousands of training epochs. This problem is derived from many sources. The backpropagation algorithm must either take small steps in the gradient descent or use more computationally intensive error calculations (higher order derivatives). Since all nodes in a network are typically learning independently, several problems may occur. First all the hidden nodes may move together to try to solve the largest source of error, instead of dividing up the problem and each solving a different portion. Second, once the nodes have divided the problem, each tries to solve their portions independently. The

PAGE 20

13 movement of each node through the error surface affects all the other nodes, creating a moving target for each node. Thus instead of a direct movement of the nodes to useful roles, we see a "complex dance" among all units [Fah91]. • Recurrent back-propagation trains even slower for many reasons. First, the training methods require more computations than the static backpropagation. Second, the error gradients tend to vanish exponentially as they are propagated through time. Thirdly, the recurrent networks tend to have a much more complicated error performance surface with many local minima, making the gradient search very difficult. • Supervised techniques require presegmented and prelabeled training data. This often must be done by hand and is quite time consuming. The rule of thumb for ANN training is 10 training exemplars for each adjustable weight. Thus for large networks, finding enough training data is a difficult task. If there is an insufficient amount of training data, the network will tend to memorize the data rather than draw reasonable generalizations about the data. Problems related to short-term memory structures include the following: • The common short-term memory techniques (tap delay lines, etc.) use a time-to-space mapping to represent the past of the signal. By converting time into just another spatial dimension, the unique features of the temporal information are lost (e.g. continuity, limitations based on the dynamics of the system, etc.). The short-term memory is a representation of the data, not a memory structure. • The typical short-term memory structure is a rigid architecture that must be pre-wired.

PAGE 21

14 • Short-term memory structures typically add many weights to the input (or interior) layer (e.g. A TDNN with N taps will create N times more weights in the first layer), which exacerbates the problems with the performance surface and the amount of training data. The resulting networks tend to have so many degrees of freedom that they do not generalize well (i.e. memorization due to insufficient training exemplars). The Approach It is a Herculean challenge to attempt to solve all of the above problems. This work focuses on a method of self-organizing PEs in a network architecture based on their temporal correlations. This concept is biologically inspired and has been applied to three different types of neural networks. By creating temporal neighborhoods of PEs in the architecture, we have increased the performance of the networks either through increased functionality and power or through better training methods. When this technique is applied to a self-organizing feature map (SOFM or SOM), the temporal neighborhoods create traveling waves of activity which diffuse through the PEs. The resulting architecture has a spatio-temporal memory that is selective and recognizes temporal patterns similar to those it has been trained with. The typical ANN memory simply embeds the data for further processing by the ANN, without any mechanism for recognition. This architecture, however, is similar to biological memories in that it responds preferentially to known temporal patterns this is unique in the neural network literature. When the temporal neighborhood approach is applied to the neural gas algorithm, the network becomes a temporal vector quantizer that again responds preferentially to

PAGE 22

15 known temporal patterns. The temporal vector quantizer uses the past of the signal to anticipate the next input by expanding the Voronoi region associated with the expected next input. This allows the network to remove noise in the signal and generate better vector quantization based upon the temporal training and recent past of the signal. This anticipation is similar to how the human brain deals with noise in its environment it uses the past to predict the future and correlates what it is sensing with this prediction. This is part of the reason humans can understand speech in very noisy environments, and also why two people can hear completely different things from the same set of sounds. When we apply the technique to the training of recurrent neural networks, the new training technique reduces the computational complexity of the RTRL algorithm from 0(N 4 ) to 0(N 2 ). This dramatic improvement comes with only a slight increase in the number of iterations of training data required. The overall speed-up taking into account both the decreased computational complexity and increased number of training samples is still dramatically better. In fact, the 0(N 4 ) property of the RTRL algorithm makes it virtually unusable for sizeable networks. In general, the self-organizing nature of the temporal neighborhoods helps alleviate many of the problems with the supervised techniques. Additionally, the novel spatio-temporal memory architectures provide a unique methodology for solving the problems with short-term memory.

PAGE 23

CHAPTER 2 LITERATURE REVIEW This chapter presents background information and a literature review of topics that either influenced this work, relate to this work, or will be compared and contrasted with this work. The chapter begins with a presentation of current research on biological neural networks and methods of temporal processing. This section is important because it motivated my work. I do not, however, claim that my work is biologically feasible or occurs in nature. Next, this chapter contains a description of the state of temporal neural network research. Since most of the work in temporal neural networks takes the form of extensions to static neural networks, an overview of static neural network learning is also presented. The contrast between the biological and artificial neural networks and the way they process time is striking. Static artificial neural networks are very similar to the static characteristics of real neurons, but temporal neural networks share little in common with their biological counterparts. Biological Research This section contains a description of biological neurons and their temporal characteristics, as well as other biological mechanisms that may help in processing time based signals. Recently, there has been extensive research into the temporal characteristics of the brain as well as in learning dynamics. This research has not yet 16

PAGE 24

17 been integrated into the artificial neural network community, but holds promise for creating powerful, temporal ANNs. This information provides a motivation for the main principal of this work that the creation of temporally organized neighborhoods in a neural network improves the performance of the network for temporal processing. The concept of diffusing temporal information through the network is one of the fundamental concepts used to rationalize the formation of these neighborhoods. Neurons and Learning Fundamentally, the artificial neural network is modeled after a collection of neurons in the brain. Each neuron is composed of three basic components: the cell body, the dendrites and the axon. [Fre92] The dendrites are a widely branching set of filaments that collect information from other neurons. The axon is a long transmission medium that contains fewer branches and transmits the output of the neuron to other neurons. Synapses are the junctions between axons and dendrites. The dendrites collect incoming pulses from other synapses, convert them to currents and sum them all at the initial segment of the axon. This summation works across both dendritic space (summation over all the dendrites) and across time. Each synaptic membrane acts as a leaky integrator with an associated time constant. The critical function of the axon is to transmit the timevarying amplitude of current summed by the dendrites to distant targets without attenuation. [Fre92] If the neuron reaches a certain threshold, it fires or depolarizes, which means that it produces an energy spike on its axon. The firing contains a refractory period such that a constantly active neuron will produce an impulse train on its axon. How biological neural networks are trained is not well known, but most of what is known

PAGE 25

about the training is based on the Hebbian learning concept (which will be discussed later). The Hebbian learning law strengthens synapses (allowing more responsiveness from the post-synaptic neuron) when the two neurons fire at the same time. If there is a consistent correlation between the firing of two neurons, then the pre-synaptic neuron must be at least partially responsible for the firing of the post-synaptic neuron. A static artificial neural network is modeled loosely on an interconnected cluster of neurons. Each neuron is modeled by a processing element (PE) and a set of connections between processing elements. Typically, a processing element simply sums the inputs, nonlinearly warps the output, and then passes this output to its downstream connections. Training is implemented in either an unsupervised manner, usually using a form of Hebbian learning, or in a supervised manner, which has no biological parallel. Notice that none of the temporal characteristics of a neuron are used in static neural networks or their temporal extensions. Recently, there has been significant work on a more complete modeling of individual neurons and their temporal characteristics. Christodoulou and others [Chr95a][Chri93] have modeled the biological neuron including the random spiking nature, excitatory/inhibitory synapses, the transmission delay down the axon, and especially the membrane time constant. The membrane time constant is the main temporal property modeled today. Most modeling approaches use simplifications of the Hodgkin-Huxley equations that result in a leaky integrator model of the neuron membrane potential. This is an important feature of biological neurons, since the past history of the signal remains active on neurons for a short period and can influence the result of future inputs.

PAGE 26

19 Additionally, the gas nitric oxide (NO) has been found to be involved in many processes in the central nervous system. One such process is the modification of synaptic strength thought to be the mechanism for learning (and most commonly used in ANNs). Neurons produce NO post-synaptically after depolarization. The NO diffuses rapidly (3.3 x 10" 5 cm 2 /s) and has a long half-life (-4-6 seconds), creating an effective range of at least 150 urn. Large quantities of NO at an active synapse strengthen the synapse (called Long Term Potentiation, or LTP). If the NO level is low, the synaptic strength is decreased (Long Term Depression or LTD) even if the site is strongly depolarized. NO is thus commonly called a diffusing messenger as it has the ability to carry information through diffusion, without any direct electrical contact (synapses) over much larger distances than normally considered (non-local). The NO diffusion and non-linear synaptic change mechanism has been shown to be capable of supporting the development of topographical maps without the need for a Mexican Hat lateral interaction (described later). This seems to be a more biologically plausible explanation of the short range excitation and long range inhibition than the preprogrammed weights of synaptic connections which are typically assumed to implement the same effect [Kre96a][Kre96b]. In addition to the possibility of lateral diffusive messenger effects, the long life of NO can produce interesting temporal effects. Krekelberg has shown that NO can act as a memory trace in the brain that can allow the temporal correlations in the input to be converted into spatial connection strengths. [Kre96b] This mechanism for capturing the temporal correlations of the input using an NO diffusion process is similar to the method I will present in more detail in Chapter 3.

PAGE 27

20 Hippocampus The hippocampus is the primary region in the mammalian brain for the study of memory and learning because: [Bur95] • hippocampal damage causes memory loss, • the hippocampus is the simplest form of cortex • long-term potentiation (LTP) has been found in the hippocampus (synaptic plasticity) • cell firing in the hippocampus is spatially coded (place cells). • all sensory inputs converge on the hippocampus and the output from the hippocampus is extensively divergent with projections onto most of the cortical areas. Figure 2-1 shows the major subfields and their projections of the hippocampus. The hippocampus is formed from sheets of cells, with most of the interconnections contained in these sheets (minimal connections between sheets). Most projections have large divergence and convergence, except the dentate gyrus to CA3 projection which has Higher order Cortices Sensory Cortex ,---A CA1 Subiculum C \ Cortex Dentate „, Gyrus Hippocampus Figure 2-1: The major subfield of the hippocampus

PAGE 28

21 mossy fiber projections from each granule cell, making very large synapses onto only 14 or so pyramidal cells. Hebbian LTP has been observed in much of the hippocampus. A variety of interneurons provide feed-forward and feed-back inhibition. One of the most interesting (and for this work, most relevant) aspects of the Hippocampus is that it contains "place cells" and other functional clusters of neurons. Place cells are small patches of neurons that selectively fire only when the animal is in a specific location of its environment. These are groups of thousands of neurons that fire together and are linked to other place cells. As the subject moves through a familiar set of locations, the patches fire sequentially and the linking of these patches allows for predictive navigation. They have been found in fields CA3 and CA1 of the rat hippocampus. [Bur93] These place cells are temporally and spatially organized neurons that are correlated in their reaction to temporally occuring events. Diffusion Equations (Re-Pi Equations') The diffusion equation (or the reaction-diffusion equation if the medium is active) can be used to explain certain characteristics of a neuron and neuronal clusters. In its generic form, however, it is used in many other fields. Objects such as cells, bacteria, chemicals and animals often have the property that each individual moves about in a random manner (e.g. brownian motion). When a concentration of these objects occurs, this random motion causes the objects to spread out into lower concentration areas of the environment. When this microscopic movement of the group results in macroscopic motion, we call it a diffusion process. If we assume a one-dimensional motion and a random walk process, we can derive the diffusion equation

PAGE 29

22 from a probabilistic treatment of the process. By finding the probability p(m,n) that a particle reaches a point m steps away at n time steps in the future, we find the distribution of particles at time n. Using the random walk assumption and allowing n to be large, it can be shown that the resulting distribution is the Gaussian or normal probability distribution: p{m,n)exp In m»l, n»l Next, we determine the probability of finding a particle in an area between (x-Ax, x+Ax) at time / by rewriting the equation for p(m,n) as the sum of the probability of moving right from x-Ax at time t-At or moving left from x+Ax at time IAt. If we take the partial of p with respect to t and allow Ax->0 and At-^0 we obtain the diffusion equation: dt 8x 2 Figure 2-2 Diffusion process where D is the diffusion coefficient which defines how fast the particles spread. A typical diffusing process creates a spreading of a concentration into ever shallower and shallower Gaussians as shown in Figure 2-2.

PAGE 30

23 The reaction-diffusion equations were originally proposed by Turing in 1952 and are typically used to explain natural pattern formation [Tur52]. They have been used to model insect populations, the formation of zebra stripes, crystal formation, galaxy formation and many other naturally occurring patterns and self-organizing systems. Turing's proposal modeled patterns found in nature by an interaction of chemicals called "morphogens". The different morphogens react with each other AND diffuse throughout the substance via the equation: dm(x,t) d 2 m,(x,t) — ^ i =/(w,(x,o,in J (»,o)+^ 8 y where mj(x,t) is the concentration of morphogen i at time r, D m is the diffusion coefficient, and f(mi,m,j is a function (typically nonlinear) that represents the interaction between morphogens. By varying the interaction between chemicals and the speed of diffusion, complicated spatial patterns of chemicals are created. The reaction-diffusion equations have also been used to explain traveling waves such as the traveling impulse down the axon of a neuron. If the reaction portion of the Re-Di equations represents the kinetics of the system and these kinetics are nonlinear, then the system can create a traveling wave. One requirement for a traveling wave is that the kinetics of the system are excitable, where excitable implies two stable states where a small excursion away from one state may drive it to the next state. Another requirement is that after excitation, the system must relax back to the original state. An example of such a system is the Fitzhugh-Nagumo equations (FHN) that are a simplified version of the Hodgkin-Huxley model that describes the transmission of energy down the axon of a

PAGE 31

24 neuron. The FHN equations can be described by the following system of 3 equations [Mur89]: du „ „ d 2 u dv a = bu yv f(u) = u(a-u)(u-\) where u is roughly equivalent to the membrane potential, v lumps the effects of most of the ionic membrane currents, and a, b, and y are constants. The null clines of the kinetics in the (u,v) phase plane are shown in Figure 2-3. 0.5 1 1.5 Figure 2-3 : Null clines for dynamics of FHN equations The general concept is that when one element fires, its activity is diffused to its neighbors and pushes them just far enough from their stable state to move them to the "excited" state. Next, these newly excited elements excite their neighbors, etc. The elements which were excited originally then begin to relax, creating a traveling wave of activity. The traveling wave from the FHN equations is shown in Figure 2-4. In this case, not only does the system relax, it also has a refractory phase which inhibits future excitation for a period of time. [Tys88]

PAGE 32

25 15 20 25 30 Figure 2-4: Traveling waves caused by the FHN equations Diffusion and other biologically plausible local communication techniques have increasingly been used in neural networks. For example, the Kohonen algorithm can be implemented in analog hardware with an active medium using diffusion [Ruw93]. Diffusion has also been used frequently in visual imaging systems [Cun94]. Sherstinsky and Picard have proposed a cellular neural network based on Re-Di equations that can solve optimization problems [She94], On key aspect of this work is that diffusion in the PE space of a neural network allows temporal information to be transmitted and stored using only local communication. This is similar to the diffusion of NO in the brain which is thought to affect the plasticity of synapses in areas where many neurons are firing at once. Without direct connectivity between two PEs, communication and temporal memory can be implemented using the local storage and transmission of a diffusing object (in our case, diffusing activity). Biological Representations of Time Another example of neurobiological research that has not been used in ANNs is the concept of rhythm. Recently, there has been some interesting research on oscillators, central pattern generators, rhythm and their effect on human pattern recognition. Rhythm

PAGE 33

26 has been studied in biology and found that rhythmic signals from insects can be entrained or phase-locked to an external rythmic pattern without high-level processing (the patterns are faster than the minimum response latency) [McA94], There is evidence that the dynamics of many biological systems have natural rhythms that share the same frequency. Communication and locomotion, for instance, are highly dependent on rhythm and pacing. It has also been suggested that EEG rhythms play an important role in learning and temporal recognition. For instance, neurons are thought to modify their synaptic strengths only when the 9 rhythm is in the correct phase. The 9 rhythm is a sinusoidal component of the EEG that ranges from 7-12 Hz. The 9 rhythm has been linked with displacement movements (e.g. walking) and many other repetitive actions. Since the 9 rhythm must propagate through the neural tissue, this also could play the role of a moving wavefront that controls learning. Rhythm can be thought of in two ways: either as an external pacemaker that synchronizes the network in some fashion, or as the output of a collection of neurons that are working in unison. For the first case, there is little if any research on the effects of an external pacemaker on temporal ANNs. The pacemaker would create a time-varying network where the output of the network is dependent on the time or phase of the pacemaker. The pacemaker could also act as a sampling signal. For instance, learning may only occur at a specific phase of the 9 rhythm. In the second case, the rhythm could be the result of synchronized processing. For instance, waves of activity in the brain could be caused by the processing of the spatio-temporal patterns constantly input to the network by the continuous motions of the eyes and other sensory muscles.

PAGE 34

27 Stanley and Kilmer [Sta75] have proposed a "wave mode" of memory that can learn sequences. It is based on the anatomy of the dentate gyrus (in the mammalian hippocampus) and can be summarized as follows: • The hippocampus is organized into transverse slices called lamellae • The majority of connections in the hippocampus do not leave a lamella (small longitudinal spread) • Sensory inputs arrive via the perforant path to excite cells directly • A small number of mossy fibers connect cells longitudinally (across lamellae) • Cells excited by an input spread excitation to its neighbors, causing a wave of activity to travel down the cell's lamella The wave formation is based on the pyramid and granule cells receiving excitatory influences from the hippocampal input pathways that in turn excite interneurons whose axons inhibit the pyramid and granule cells. This excitation and inhibition create the waves of activity in the lamella. The memory is created by the association of the various waves in different lamellae via the mossy fibers that interconnect the lamellae. Each wave is created by a sensory input that triggers a cell in a lamella and can move a short distance before dying. Randomly distributed mossy fibers interconnect the lamella. The connection weights are strengthened in a Hebbian manner when two waves from different lamella are coincident with a connecting mossy fiber, this connection is strengthened. Thus, the next time the first input wave passes the same position, it can automatically trigger the second wave even without the corresponding input. This is shown in Figure 2-5. For longer

PAGE 35

28 temporal relationships, one wave will trigger a second wave in another lamella via prestrengthened longitudinal connections which will continue after the first wave has died. Input 1 at Time TO * Input 2 at Time T1 Strenghtened Mossy Fiber Wave at T1+D2 Figure 2-5: Stanley and Kilmer's wave model [Sta75] Bioloaical Models for Temporal Processing Living neurons act as leaky integrators with time constants on the order of tens to hundreds of milliseconds. This can lead to the storage of information in a way that may lead to temporal sequence processing. Most ANN temporal methods store the information in a spatial manner. The spatial approach to signal storage is used in the brain for auditory and visual processing (e.g. SOMs). The way in which these maps are then processed is not necessarily spatial. Reiss and Taylor propose an interesting temporal sequence storage mechanism based on a leaky integrator network [Rei91]. The basic concept is to use the leaky integrator neurons as temporary storage for an associative memory that is implemented like a single layer neural net. The network has been shown to have a capacity proportional to the number of neurons. The problem with this network is that the connection matrix seems to be very heavily skewed to only predicting the next input with little information from further in the past. This is similar to

PAGE 36

29 a simple state machine or markov chain. An interesting part of this work is the possible connection to the function of the hippocampus. The memory network corresponds to the dentate gyrus, the CA3 corresponds to the predictor, and the input line is similar to the perforant path (between EHC, DG, and CA3). Kargupta and Ray proposed a temporal sequence processor that is based on the reaction-diffusion equations. [Kar94] Drawing an analogy between chemical diffusions in biology and spatio-temporal sequential processing, their model is based on a collection of cells that react to different inputs. When a cell becomes active (by recognizing its input), it outputs its own specific chemical. This chemical diffuses throughout the medium containing the cells. Each cell contains a memory of the chemical makeup at its location when it fires. The background medium thus stores the temporal history of the signal by diffusing all the various chemicals. This approach is more of a chemical model than an information processing model and has several difficulties when applied to realistic problems. Static Neural Network Learning This section contains a summary of the static neural network learning mechanisms. Almost all of the work in temporal ANNs is based on the principles from static ANNs. Since unsupervised training is most similar to known biological learning mechanisms, it will be presented first. Unsupervised learning does not have a desired signal and extracts information only from the input of the signal. As such, unsupervised techniques typically do not directly implement classifiers, but are usually used for preprocessing the input. For example, unsupervised networks can be trained to perform

PAGE 37

30 principal component analysis (PCA), vector quantization (VQ), and data reduction. Supervised learning is presented next and these algorithms use a desired signal to train the network to mimic the desired input-output map. The desired signal can be thought of as a teacher or external influence that guides the network to the desired state. As we mentioned before, there is no known biological analog to supervised training. Unsupervised Learning Most unsupervised (also known as competitive or self-organizing) learning is based on Hebbian learning. Hebbian learning is derived from the work of the neuropsychologist Hebb who noted in 1949 that when cell A repeatedly participates in the firing of cell B, a growth process occurs between the two cells which increases the efficiency of the link between cell A and cell B. This can be stated as "neurons that fire together, wire together". This mechanism is often called correlation learning because the links are increased when there is a statistical correlation over time between the presynaptic and postynaptic activities. To avoid excessive weight growth, Hebbian synapses typically also include a decrease in the strength of a connection between two cells which are uncorrelated. Conversely, anti-hebbian learning is a learning rule that is based on increasing the strength of a connection when the presynaptic and postynaptic signals are negatively correlated and weakens them otherwise. A typical expression for Hebbian learning is Aw kl (n)=r\y k {n)x l (n) where w kj represents the synaptic weight between cell k and cell/, x i is the presynaptic activity and yj is the postynaptic activity. r| in the above equation is the learning rate.

PAGE 38

31 This rule, however, does not include the weakening of uncorrected signals, and thus the weights will forever increase. Introducing a nonlinear forgetting factor into the equation can control the weight growth: Aw„(n) = i\y t (i»)x l (n)-ay i (nyy> • (tt) where a is the decay constant. This equation can be rewritten as: Aw ti (n) = ay,(n)[cx/n)w ti (n)\ which is the standard Hebbian learning rule. Notice that when the postsynaptic neuron fires, w tj moves toward & exponentially. By manipulating the definitions of the variables, this equation can be reformulated into the competitive learning rule. In competitive learning, a group of neurons are clustered such that one and only one neuron wins a competition for each input. Algorithmically, the winner is simply selected by choosing the PE with the highest/lowest output, which can be physically implemented using lateral inhibition between nodes. Biologically, neurons fire in clusters and the competition between clusters is believed to be due to long range inhibition and short range excitation (a concept that will come up again and again). In the case of a competitive cluster, the winning node has an output value of 1 , and the others are all zero. Thus the Hebbian learning rule becomes: , . a x.(n)w..(«) if neuron k wins AwJn) = < L ' ' ' [0 if neuron k loses Only one neuron (or cluster in biology) learns at each stage and its weights move toward the location of the input. Thus, the individual nodes specialize on sets of similar patterns and become feature detectors. Competitive learning is typically used for

PAGE 39

32 clustering or vector quantization. Hebbian learning is used widely throughout the neural network field, but in its simplest form is often used for principal component analysis. Kohonen SOMs The Kohonen map or self-organizing feature map (SOM) is a neural network inspired by sensory mappings commonly found in the brain [Wil76][Koh82], A selforganizing feature map creates a topographic map of the input patterns, in which the spatial locations of the neurons in the lattice correspond to intrinsic features of the input patterns. In this structure, neurons are organized in a lattice where neighboring neurons respond to similar inputs. The result of mapping similar inputs to neighboring outputs is a global organization that is extracted from the local neighborhoods. Topographical computational maps have been found in many locations in the brain including the vision areas (angle of tilt of line stimulus, motion direction), auditory areas (representations of frequency, representations of amplitude, representations of time intervals between acoustic events) and in motor control areas (control of eye movements). More abstract topographic maps have been found in other parts of the brain. For example there is a map for the representation of the location of a sound source based on the interaural differences in an acoustic signal. The SOM is one of the most widely used unsupervised artificial neural network algorithms. [Kan94] The typical SOM is composed of an input layer and an output layer as shown in Figure 2-7. The input layer broadcasts the vector input to each node in the output layer, scaled by the weights of each connection. Each node has an input term and lateral feedback term. The topographic mapping is created by the local lateral feedback, where

PAGE 40

33 neighboring connections are excitatory and more distant connections are inhibitory. This is called a "mexican hat" lateral connectivity and is shown in Figure 2-6. The result is similar to the standard competitive network except that the network creates a more gentle cutoff, thus creating a Gaussian shaped output after the lateral interconnections have stabilized. This is called a "soft-max" rule (or soft-competition) where the winning PE and a few "near-winner" PEs remain active. The competitive rule is called a "hard-max" rule, hard competition, or winner-take-all rule. Depending on the characteristics of the mexican hat lateral interconnections, the resulting output will be a gaussian of varying widths centered roughly at the location of the maximum output. The process can be described using the following equations *• k—K P l , = Z w ji x i l=\ where y; is the output of the j'th node, Ij is the input to the j'th node scaled by the weights into the j'th node, Ci^is the lateral weights which were described above as the mexican hat function, and t is a nonlinear saturating function which keeps the nodes from growing without bound. Figure 2-6: Mexican hat lateral connectivity and Gaussian shaped output

PAGE 41

34 Output Layer Figure 2-7: Connectivity of an SOM After the outputs have stabilized, the network can be trained with a simple Hebbian like rule to train the weights of the winning node and its neighbors. The neighboring nodes can be trained in proportion to their activity (Gaussian), or all neighbors within a certain distance can be trained equally. The learning rule can be described as follows: Wj(n + 1) = Wj(n) + r](n)n „,,,(«) [*(«) w,(n)] where wi are the weights of node j, x(n) is the input at time n, Xjj(x) is the neighborhood function centered around the winning node i(x), and r\(n) is the learning rate. Notice that both the learning rate and neighborhood size are time dependent and are typically annealed (from large to small) to provide the best performance with the smallest training time. A simplified approximation to this algorithm consists of two stages: first, find the winning node (the one whose weights are closest to the input), then update the weights of the winner and its neighbors in a Hebbian manner. The SOM is an unsupervised network with large local connectivity, but unsupervised networks do not typically suffer from overtraining. Because the input is mapped onto a discrete, usually lower dimension output space, the SOM is typically used

PAGE 42

35 as a vector quantization (VQ) algorithm. The weights of the winning node are the vector quantized representation of the input. A typical example of an SOM is mapping a two-dimensional input space onto a one-dimensional SOM. Figure 2-8 shows a random distribution of points that make up the input space in two dimensions. The points are plotted such that the coordinates of the point represent the input data. When this input data is presented to the 1-D SOM, the map trains the nodes to maintain local neighborhoods in the input space. These local neighborhoods force a global ordering of the output nodes. After training, the nodes of the SOM are ordered and the weights of the nodes represent the center of mass of the input space to which they respond. By plotting the weights of the SOM PEs onto the input space, one can see where the center of each VQ cluster is located. The SOM is more than just a clustering algorithm. It also orders the PEs such that neighboring PEs respond to neighboring inputs. To show this, we connect neighboring PEs with a line. The right side of Figure 2-8 shows how the SOM maps a one-dimensional structure to cover the two-dimensional input space. This clearly shows how the global ordering has occurred Figure 2-8: Example of a 1-D SOM mapping a 2-D input

PAGE 43

36 and that the 1-D output snakes its way through the input space in order to maintain its topographic ordering and still cover the input space. Neural Gas The neural gas algorithm is similar to the SOM algorithm without the imposition of a predefined neighborhood structure on the output PEs. The neural gas PEs are trained with a soft-max rule, but the soft-max is applied based on the ranking of the distance to the reference vectors, not on the distance to the winning PE in the lattice. The neural gas algorithm has been shown to converge quickly to low distortion errors which are smaller than k-means, maximum entropy clustering or the SOM algorithm [Mar93]. It has no predefined neighborhood structure as in the SOM and for this reason works better on disjoint or complicated input spaces Martinetz et. al. [Mar93] showed an interesting parallel between most of the major clustering algorithms. The main difference between each one is how the neighborhood is defined. For K-means clustering, there is no neighborhood, only the winner is trained. This is a hard max. Aw, = e * S t (x) * (x w, ) For maximum entropy clustering, the neighborhood is defined as a soft max based on the distance in an entropy space Aw, = £ * hwc w, |D* (x w, ) For the SOM, the neighborhood is based on the position in the SOM lattice Aw, =£*A
PAGE 44

37 and for the neural gas algorithm the softmax is based on the ranking of the node. For instance, the closest node gets the largest update, followed by the second closest, etc. Aw, =£*h ^AAJ\ 1 In1 \^~~y ^—^ i?X ^-*v \ s— x + Error cm. I ' n 2 r\ \ f/^S r ^^^^/^^J^ ^hf*?z\s weights weights weights Input First Second Output Layer Hidden Hidden Layer Figure 2-9: Activation and error propagation in a static neural network

PAGE 45

38 to adapt each of the weights in the system. This is graphically depicted in Figure 2-9. The output of each PE in an MLP can be described by the following equation: y,=f{nelj)=f \'L w Ji x i +b J where W, represents the weight from PE i to PE j , x, represents the output of PE i or the external input for the first layer, bj represents the bias for PE j, and f() is the nonlinearity of the PE which is typically a logistic function (which ranges from 0-> 1) or a tanh function (which ranges from -1 -> 1). The performance surface that is searched using gradient descent is defined by: I N m \ N m . . where e is the difference between the output and the desired signal, p is the index over the patterns and i over the output PEs. We want to update each weight based on the partial of J with respect to each weight. This formula can be written as: aj _ a/ % a aw,j ay, p anet ip fa, "e' ip = -{ d i P -yin)f'{" et i,)x IP = -e • f[net • pe M if we define the local error 8 ; for the i' PE as 8,{n)~nnet • ) Sy, P Then we can generalize the backpropagation algorithm for the MLP and the LMS algorithm for linear systems. All the weights in gradient descent learning are updated by multiplying the local error (6,(n)) by the local activation (x/n)) according to Widrow's estimation of the instantaneous gradient first shown in the LMS rule

PAGE 46

39 Aw, J (n) = r|8 / (n)x J (/j) The difference between these algorithms is the calculation of the local error. If the PE is linear, then we have a linear combiner and the derivative of f is a constant. The equation then becomes the LMS rule. If the PE is nonlinear and is an output PE, then the local error is simply the difference between the output and the desired signal scaled by the derivative of the nonlinearity. This is simply the delta rule. If the PE is nonlinear and is a hidden layer PE, then the error is the sum of the backpropagated errors from the PEs that follow it. 8,(n) = /'(»<*, WJIXh-hM k This simple rule nicely summarizes the backpropagation algorithm and shows its relationship to other adaptive algorithms. Second Order Methods The standard backpropagation method of training a neural network uses the LMS approximation to gradient descent, which uses only an instantaneous estimate of the gradient. Second order methods collect information over time to get a better estimate of the gradient, thus allowing for faster convergence at the cost of more computations per cycle. In linear adaptive filtering, the recursive least squares (RLS) algorithm is used for exactly this purpose. The RLS algorithm is based upon estimating the inverse of the correlation matrix of the input. With this information, the RLS algorithm can often adapt as much as ten times faster than LMS. The RLS algorithm can also be formulated as a special case of the Kalman filter. Besides faster convergence, the RLS algorithm also has two other advantages [Hay96]: the eigenvalue spread of the correlation matrix does not

PAGE 47

40 adversely affect the training (unlike in LMS) and the learning rate is automatically determined (the Kalman gain). Since RLS and Kalman filtering are derived for linear systems, they must be modified for use with nonlinear systems. These are typically called extended RLS or extended Kalman filtering. The most straightforward approach is to linearize the total cost function and directly apply RLS. This requires the storage and update of the complete error covariance matrix whose size is the square of the number of weights in the network [Hay94]. A better approach is to apply RLS to each node individually and linearize the activation function of the PE using a Taylor series about the current operating point. This method is called the multiple extended Kalman algorithm (MEKA) [Hay94] and reduces the computational requirements by ignoring the cross-terms between PEs. Temporal Neural Networks As we stated previously, the majority of the temporal neural networks are extensions of the static neural networks, either by adding memory or adding recursive connections. This section could just as easily be called "Extending static architectures to include time". Again, this topic will be discussed in two sections, supervised and unsupervised neural networks. Temporal Unsupervised Learning This section presents the methodologies currently available to add temporal information to unsupervised networks. Almost all work done on temporal unsupervised training has used self-organizing maps.

PAGE 48

41 As mentioned before, a self-organizing map (SOM) creates a topographic map of the input patterns, in which the spatial locations of the neurons in the lattice correspond to intrinsic features of the input patterns. In this structure, neurons are organized in a lattice where neighboring neurons respond to similar inputs. There have been many attempts at integrating temporal information into the SOM. One major technique is to add temporal information to the input of the SOM. For example, exponential averaging and tappeddelay lines were tested in [Kan90][Kan91], while coding in the complex domain was implemented in [Moz95]. Another common method is to use layered or hierarchical SOMs where a second map tries to capture the spatial dynamics of the input moving through the first map [Kan90][Kan91]. More recently, researchers have begun integrating memory inside the SOM, typically with exponentially decaying memory traces. Privitera and Morasso have created a SOM with leaky integrators and thresholds at each node which activate only after the pattern has been stable in an area of the map for a certain amount of time. This allows the map to pick out only the "stationary" regions of the input signal and use these sequences of regions to detect the input sequence [Pri93][Pri94][Pri96]. The SARDNET architecture [Jam95] adds exponential decays to each neuron for use in the detection of node firing sequences. Once a node fires for a particular sequence, it is not allowed to fire again. Therefore, at the end of the sequence presentation, the sequence of node firings can be detected (or recreated) using the decayed outputs of the SOM. The exponential decay, however, provides poor resolution at high depths and thus will perform poorly with noisy and/or long sequences.

PAGE 49

42 Chappell and Taylor have created a SOM which has neurons that hold the activity on their surface via leaky integrator storage [Cha93]. This activity is added to the typical spatial distance between input and weight vector to determine the next winner. The same or neighboring nodes will thus be more likely to win the competition for successive elements in a sequence. This creates neighborhoods with sensitivity to the previous input (i.e. context). There is not a successful method available yet to train these networks. The learning law proposed by Chappel and Taylor can lead to an unstable weight space. The methodology seems to work for patterns of binary inputs with at most length 3. Critchley [Cri94] has improved the architecture by moving the leaky integration to the synapses. This gives the network a much better picture of the temporal input space and has much more stable training, but becomes nothing more than an exponentially windowed input to a standard Kohonen map, as proposed by Kangas [Kan90]. The temporal organization map (TOM) integrates a cortical column model, SOM learning and separate temporal links to create a temporal Kohonen map [Dur96]. The TOM is split into super-units that are trained via the SOM learning algorithm. Winning units from each super-unit fire and then decay. Temporal links are made between the currently firing node and any node which has an activity above a threshold. Thus there can be multiple links created for each firing, allowing for the pattern to skip states. Kohonen and Kangas have proposed the hypermap architecture to include context in the SOM architecture. Kohonen's original hypermap architecture included two sets of inputs and weights [Koh91]. The first set is a context vector that is a tapped delay line of the past and future pattern vectors. This input is used to determine a "context domain" in the SOM. All nodes in the context domain are labeled active and are then presented with

PAGE 50

43 the current input pattern. The "pattern" weights and context weights are then trained in the typical SOM manner. Kangas extended this concept by eliminating the context weights and allowing only nodes in the vicinity of the last winner to be selected. This smoothes the trajectory of winning nodes throughout the map and allows context to affect the selection of the winner without the addition of parameters like the width of the context window [Kan92]. Kangas has also proposed an SOM architecture that has an LPC predictor at each node in the Kohonen net. This provides temporal pattern recognition by using a filter at each node where the AR filters were trained via either genetic programming or gradient descent [Kan94]. Goppert and Rosenstiel conceptually extend this concept to include the notion of attention [Gop94a][Gop94b][Gop95], The theory being that the probability of selecting a winner is affected by either higher-cognitive processes (which may be considered a type of supervision) or by information from the past activations of the network. This gives two components to the selection of a winner, the extrasensory distance (context or higher processes) and sensory distance (normal distance form weight to input). These two components can be added or multiplied. They focus on the concept of context and create a moving area of attention, which is the region that has been activated most in the recent past. The center of attention moves as each winner is selected and the region of attention has a Gaussian weighting applied to it so that nodes near the last winner will be more likely to fire the next time. The architecture outperformed the standard SOM on simple temporal tasks but did not train well on more complicated trajectories.

PAGE 51

44 Temporal Supervised Neural Networks The main problem with temporal supervised neural networks is the complexity in training them. When the desired architecture contains recurrent connections or memory in one of the hidden layers, the network must be trained with a temporal gradient descent algorithm. There are two distinct approaches to the problem, modifying the architecture to simplify the temporal gradient calculations and creating better and/or faster methods of training the temporal neural networks. Architectural approaches The focused time-delay neural network (TDNN) has memory added only at the first layer and is the simplest example of an architecture designed to avoid many of the complications of temporal neural networks. It is simply a static MLP with a tap delay line between the input and the first layer. Because the memory is restricted to the first layer, the network can still be trained using static backpropagation. The tap delay line maps a segment of the input trajectory into an N-dimensional static image that is then mapped by the MLP. This works quite well for many applications, but has a number of difficulties as mentioned previously. The main difficulties are the increased number of weights required for TDNNs (each input now requires m weights where m is the number of taps in the tap delay line) and the inflexible, prewired nature of the tap delay line. Some of the problems with TDNNs have been attacked by defining the connectivity between layers such that only certain regions of each layer are connected. By doing this, certain regions in the input layer, corresponding to certain time periods of the input, can be connected to a single region of the second layer. This provides a more

PAGE 52

45 goal directed architecture that can be time-shift or frequency-shift invariant. Although this can reduce the effects of the problems of TDNNs, the problems still remain and each network must be tailored for each application. [Saw91][Haf90] Two other networks deal with temporal information by using a very restrictive type of feedback. The Jordan network [Jor86] use recurrency between the output and the input of the network. The output of the network is fed back to a context unit which is simply a leaky integrator. The Elman network [Elm90] provides feedback from the hidden layer to the context units in the input layer. This is potentially more powerful than the Jordan network because it stores and uses the past state of the network, not just the past output of the network. Although both networks are commonly found in the neural network literature, neither is particularly powerful or easily trained. Recurrent networks are also continuously being modified in an attempt to improve their performance on temporal problems. Mozer has proposed a "multiscale integration model" that uses recurrent hidden units that have different time constants of integration, the slow integrators forming a coarse but global sequence memory and the fast integrators forming a fine grain but local memory [Moz92]. This work, however, is based only on exponentially decaying memory and the problem of selecting the time constants has not been solved (the time constants have to be hand tuned). A different spin on recurrent networks is the use of "higher order networks". These networks are recurrent networks where:

PAGE 53

46 • hidden units represent states and the output of these states are fed back and multiplied with the inputs of the nodes, thus allowing second order statistics to be used [Gil91][Wat91] • one network computes the weights for a second network [Pol91][Sch92a][Sch92b] The higher order networks have proven to be excellent sequence recognizers (grammar recognizers), but have failed to make a serious impact on temporal processing. These networks provide a representation for states in the neural network and allow the computation of high order statistics. For example, a second order network can compute the autocorrelation of the input, thus creating a translation invariant architecture. The main disadvantage of this work is that for complex tasks, higher order networks require even more weights and have even more complicated performance surfaces than standard ANNs, Algorithmic approaches There are two fundamental methods of computing the gradient for a dynamic neural network. First, the gradients can be computed in the backward direction similar to the static backpropagation techniques from feedforward networks. Unfolding a recurrent network in time creates a large, static, feedforward network where each "layer" consists of an instance of the recurrent network at each time step. Backpropagation can then be applied to this large feedforward network and the gradient can be computed. This is called backpropagation through time (BPTT) [Rum86], The main shortcoming of this technique is that it is non-causal. The BPTT algorithm must be used in a batch mode, the data travels first in the forward direction while the entire state of the network is saved at

PAGE 54

47 each step. Next, the error is backpropagated in the reverse temporal order. A secondary shortcoming of BPTT is the memory required to store the state of the network at each iteration. Many alterations have been made to the BPTT algorithm to improve its utility, in particular to make it usable as an on-line algorithm. Williams and Peng [Wil90] used a history cutoff where they assumed that the gradient information from the distant past is relatively inconsequential and thus can be ignored. Combining this and the use of a small step size, the new algorithm, BPTT(k) can be used in an on-line manner. See Pearlmutter [Pea95] for a review of this technique and others. The second fundamental method of computing the gradient for a recurrent neural network computes the gradients in the forward direction. This method, called RTRL [Wil89], computes the partial of each node with respect to each weight at every iteration. The method is completely on-line and simple to implement. The main difficulty with the RTRL method is its computational complexity. If we define n to be the number of PEs and m to be the number of weights, then the computation of the gradients of each PE with respect to each weight is O(n^m). For a fully recurrent network, this dominates the computational complexity and requires O(n^) computations per step. The algorithm works quite well on small networks, but the rr factor becomes overwhelming as the number of nodes increases. The RTRL algorithm for a recurrent network can be summarized by the following set of equations [Hay94], First, we define set A as the set of all inputs, set B as the set of all PEs, and set C as the set of outputs with desired signals. The forward activation equations are:

PAGE 55

4S net i (n) = '£w Ji {n)u,(n) y l (n + \) = ip(net i (n)) where u represents the input vector at each time step and is composed of both the external inputs and the outputs of each PE (the values of the feedback). The gradient descent technique is based upon computing the sensitivity of each PE with respect to each weight. The weights are updated on-line using these sensitivities: . , . de(n) v , . »/(") 8w u (n) % dw kl (ri) For implementation, we create a matrix ji that represents these sensitivities and write an update equation for it: dy,(n) n> u = ' jeB,keB,leAuB dw tl (n) *•«(" + !) = 9'( v /(")) E vt, //(") 7r «(") + ' y */"'(' ? ) ^(0) = 7i is a matrix of gradients with the rows representing weights and the columns representing nodes, thus it contains mn elements. Many methods have been proposed to increase the speed of RTRL. Schmidhuber and others have mixed BPTT and RTRL which reduces the complexity to Ofnm) [Sch92]. This technique takes blocks of BPTT and uses RTRL to encapsulate the history before the start of each block. Sun, Chen, and Lee have developed an O(nm) on-line method based on a Green's Function approach [Sun92]. By solving an auxiliary set of equations,

PAGE 56

49 the redundancies in the computation of the sensitivities over time can be removed. Zipser approached the problem in a different way and reduced the complexity of the RTRL algorithm by simply leaving out elements of the sensitivity matrix based upon a subgrouping of the PEs [Zip89]. The PEs are grouped arbitrarily and sensitivities between groups are ignored. If the size of the subgroups remains constant, then this reduces the complexity of the RTRL algorithm to O(m). This is a tremendous improvement, however, the method lacks some of the power of the full RTRL algorithm. It sometimes requires more PEs than the standard RTRL algorithm to converge. Second order methods Many researchers argue that simple gradient descent [Moz94] is not sufficiently powerful to discover the sort of relationships that exist in temporal patterns, especially those that cover long time sequences or involve high order statistics. Bengio, Frasconi, and Simard [Ben93] also present theoretical arguments for the inherent limitations of learning in recurrent networks. Many researchers have recently started using the extended Kalman filter algorithm, which is very similar to the RLS algorithm, for training dynamic neural networks. As described previously, the extended Kalman filter algorithm uses information from the correlation matrix, which is accumulated over time, to better approximate the direction to the bottom of the performance surface. Again, the problem with the extended Kalman filter algorithm is that it requires the computation and storage of the correlation matrix between the weights of the system. The computational requirement for this method is OfN 5 ). The standard method of reducing this computational load is to decouple each PE of the network, such that the correlation

PAGE 57

50 matrix is only computed between weights that terminate at the same PE. Puskorius and Feldkamp [Pus94] call this method the decoupled extended Kalman filter (DEKF) algorithm. The main difference between the dynamic version of the RLS/EKF algorithm and the static version is that the gradients that are used in the second order calculation are the dynamic gradients, not the static gradients. Thus, the BPTT or RTRL algorithms must still be used to compute these gradients. Sequence Recognition There are two broad categories of temporal problems which are typically addressed in the literature: sequence recognition and temporal pattern processing. Sequence recognition is typically a process of recognizing (and often reproducing) discrete symbolic sequences. These problems typically focus on recognizing grammars and symbolic patterns. Temporal pattern processing, however, involves the recognition, identification, control or other processing of a continuous signal which varies with time. Speech recognition is an example of temporal pattern recognition. The continuous signal can be vector quantized and turned into a symbolic pattern, but it is not practical to then treat it as a sequence recognition problem. Temporal patterns of interest are difficult to accurately quantize and typically have various forms of time warping and noise which make sequence recognition of quantized temporal patterns nearly impossible. Since the emphasis of this proposal is not sequence recognition, we will only briefly introduce a few interesting neural networks which accomplish this task. Wang and Arbib have proposed models based on the two dominant theories of "forgetting" the

PAGE 58

51 decay theory of forgetting where the memories decay from the time they are entered [Wan90] and the interference theory of forgetting where memory only decays when new inputs which must be remembered arrive [Wan93]. Both architectures are based on a winner-take-all field of neurons where the winning node fires and is then decremented slowly. The sequence is detected using an extra "detector unit" which is trained by the Hebbian rule using altentional learning. The main difficulty with these and other sequence recognizers is that they tend to be intolerant of timewarping and missing or noisy data problems which are prevalent in temporal pattern recognition. The outstar avalanche was an early neural network that was used to learn and generate temporal patterns [Gro82]. The outstar avalanche is composed of N sequential outstars which detect an input and each outstar triggers the next in a chain producing an avalanche effect. This architecture was modified to include the combined effect of the input dot product and the avalanche input from preceding nodes and was called the spatio-temporal network (STN) [Fre91]. The sequential competitive avalanche field (SCAF) [Hec86] is a further extension of the STN where each node has lateral interconnections allowing the outstars to be competitive. Comparison of Hidden Markov Models with ANNs Due to the difficulties in modeling sequential structure with ANNs, hidden Markov models have become the gold standard for modeling many temporal processes (e.g. speech). Time sequence matching is a major problem in applying neural nets to temporal/dynamical, non-stationary processes. Although ANNs have been successfully applied to time series prediction [Wei94], they have not been as successful in tasks that

PAGE 59

52 have synchronization problems such as time-warping. For example, different utterances of the same word can have very different timescales; both the overall duration and the details of timing can vary greatly. ANN models for speech have been shown to yield good performance only on short isolated speech units (e.g. phoneme detection). They have not been shown to be effective for large-scale recognition of continuous speech. The TDNN, for example, has powerful methods for dealing with local dynamic properties, but cannot deal with sequences explicitly. The HMM provides a compact, tractable mechanism for handling this temporal information by including explicit state information. Various neural network techniques have attempted to add state information, typically via feedback, but have been only successful on modest size applications. HMMs are stochastic in nature and thus can succeed even when the temporal nature of the system is locally very noisy. Speech patterns, for example, are to some extents a sequential process, however, they are sufficiently ambiguous locally that it is not adequate to make decisions locally and then process sequences of symbols. Two formal assumptions characterize HMMs as used in speech recognition. The first-order Markov hypothesis states that history has no influence on the chain's future evolution if the present is specified e.g. the temporal information is stored in the current state of the system and all relevant temporal information must be able to be stored in this way (there is no other memory in the system). The second assumption is that the outputs depend stochastically only on the state of the system. The two main advantages of ANNs over HMMs is that ANNs are discriminative and ANNs do not rely on the Markov assumptions. Typically HMMs are trained using a

PAGE 60

53 within-class method (each model is trained only on in-class, segmented, data). ANNs, however, can be trained to find the differences between classes, thus they can discriminate between classes, not just detect/model classes. ANNs have few restrictions on the systems they can model. The HMMs, however, assume that the observations are independent and that the underlying process that is modeled is a Markov Process. New methods which marry the discriminative power of the ANN with the temporal nature of the HMM have been relatively successful [Bou90].

PAGE 61

CHAPTER 3 TEMPORAL SELF-ORGANIZATION Introduction and Motivation As described in the previous chapters, working with temporal patterns has been a very difficult task for neural networks. This problem is largely due to the fact that the methodologies applied to temporal processing are simple extensions of static neural networks with little regard for the unique nature of time and time based signals. Most of these architectures simply add memory to a well-known static network and can achieve reasonable performance for simple problems, but do not perform as well on more complex problems. Like in the 1 980s when pattern recognition and classification drove the research community to develop neural networks, biological systems still easily outperform state-of-the-art solutions to temporal processing problems. For this reason, I began researching biological neural networks and biological mechanisms that might help us better solve these problems. As my research progressed, two key aspects continually resonated with my underlying goal of creating better neural networks for temporal pattern processing. These two elements are the self-organization of similar or correlated cells into clusters or neighborhoods (similar to place cells in the Hippocampus), and the diffusion of information over time and space. Self-organization describes a system where each 54

PAGE 62

55 individual entity in the system has only simple local rules regarding its behavior. These simple local rules, however, can create global organization without any global control. Self-organization applies at virtually every layer of the universe, from neurons and brain cells, to bug populations, to solar systems and galaxies. It is tremendously important in the formation of the brain and in my opinion is greatly underutilized in artificial neural networks. The second element is diffusion. Like self-organization, diffusion is found everywhere. It can be derived from simple random Brownian motion (simple local rules as well), where particles and other objects move from areas of large densities to areas of small densities. Diffusion itself is a rather simple concept that may not appear to add much to neural network theory. However, when you add diffusion to a dynamical system (for instance, the reaction-diffusion equations), the resulting system can obtain some tremendously interesting and powerful dynamics. The Model Most temporal neural networks use short-term memory to transform time into space. This time-to-space mapping is usually the only mechanism for dealing with temporal information. The neural network operates as if the temporal pattern was simply a much larger spatial pattern. This is clearly inefficient. My method uses diffusion to create self-organization in time and space. The theory is to leave the fundamentals of the neural network the same (in order to use the theory and knowledge we have already accumulated) but to add self-organization in space-time to the PEs in the network. By creating temporally correlated neighborhoods in the field of PEs making up the network,

PAGE 63

56 the basic functionality of the network is more organized and temporally sensitive, without drastically changing its underlying operation. The mechanism for the creation of these temporally correlated neighborhoods is the diffusion mechanism. In the brain, NO is given off from firing neurons in the brain, and diffuses throughout. NO has also been shown to affect the sensitivity of the neuron to synaptic changes (e.g. weight changes in neural networks). It has been theorized that this diffusion of NO may be responsible for the creation of place cells and other organization in the brain. In a more abstract sense, the diffusion of NO can be considered the diffusion of the neural activity. When a large group of neurons fire in close proximity (both temporally and spatially), a local build-up of NO probably occurs and diffuses throughout the brain. In my architectures, I use the concept of activity diffusion to create the temporally correlated neighborhoods. When a PE or group of PEs fire, they influence their neighbors, typically lowering their threshold such that they are more likely to fire in the near future. Because the underlying mechanism of most neural network training is Hebbian in nature, when neighboring PEs fire in a correlated fashion, they tend to continue to fire in a correlated fashion. This creates the temporally correlated neighborhoods and the self-organization in space-time. I have applied this concept to three different ANN architectures. The first is based on the self-organizing map (SOM) and is the most biologically inspired. The second is based on the neural gas algorithm which provides a more powerful but functionally similar solution. Lastly, to prove the robustness of the method, I applied it to the training of recurrent MLPs. MLPs are a totally different architecture and are trained in a totally different manner (e.g. supervised vs. unsupervised training). The MLP is not biologically

PAGE 64

57 relevant, but the temporal self-organization method still proved to decrease training times dramatically. The rest of this chapter is divided into three sections based on these architectures. It is arranged chronologically so that the presentation will be more smooth, even though the MLP architecture may be the most useful of the three. This chapter only presents the theoretical derivation of each architecture and a simple illustrative example for each. Detailed application of each method to more practical problems will be presented in the next chapter. Temporal Self-Oraanization in Unsupervised Networks This section describes the two unsupervised networks which 1 have applied the concept of temporal clustering. The first architecture is based on the self-organizing map and is called the self-organizing temporal pattern recognizer (SOTPAR). The second architecture is based on the neural gas algorithm and is called the SOTPAR2. Temporal Activity Diffusion Through a SOM (SOTPAR^ The self-organizing temporal pattern recognizer (SOTPAR) [Eul96a][Eul96b] is a biologically inspired architecture for embedded temporal pattern recognition (finding patterns in an unbounded input sequence without segmentation or markings). This is a difficult task since the patterns must be searched from every possible starting point. Although the SOTPAR architecture is unsupervised and thus cannot be used efficiently as a pattern recognition device, it does preprocess the input such that patterns commonly found in the training data will be easily detectable from the output of the SOTPAR. Most of the emphasis in this work is in the proper temporal representation of the spatiotemporal data.

PAGE 65

58 The SOTPAR architecture adds two temporal characteristics to the SOM architecture, activity diffusion through the space of output PEs and the temporal decay of activations. Using these concepts, the SOTPAR converts and distributes the temporal information embedded in the input data into spatial connections and ordered PE firings in the network, all using self-organizing principles. Similar to self-organizing maps, the network uses competitive learning with neighborhood functions [Koh82]. In the SOM, the input is simultaneously compared to the weights of each PE in the system and the PE that has the closest match between the input and its stored weights is the winner. The winner and its neighbors are then trained in a Hebbian manner, which brings their weights closer to the current input. The key concept in the SOTPAR architecture is the activity diffusion through the output space. The firing of a PE in the network causes activity to diffuse through the network and affects both the training of the network and the recognition of the network. In the SOTPAR, the activity diffusion moves through the lattice of an SOM structure and is modeled after the reaction-diffusion equation [Mur89] '\ =f(m i (x,t),m i (x,t)) + D f^ a ax where m\ can be considered the activity of PE i, f(*) can be considered the current match, and the second derivative is the diffusion of activity over space and time. If the system is "excitable media" (multi-stable dynamical system), then the diffusion of activity can create traveling pulses or wavefronts in the system. When the activity diffusion spreads to neighboring PEs, the thresholds of these neighboring PEs are lowered, creating a

PAGE 66

59 situation where the neighboring PEs are more likely to fire next. I define enhancement as the amount by which a PE's threshold is lowered. In the SOTPAR model, the local enhancement acts like a traveling wave. This significantly reduces computation of diffusion equations and provides a mechanism where temporally ordered inputs will trigger spatially ordered outputs. This is the key aspect of this network architecture. The traveling wave decays over time because of competition for limited resources with other traveling waves. It can only remain strong if spatially neighboring PEs are triggered from temporally ordered inputs, in which case the traveling waves are reinforced. In a simple one dimensional case. Figure 3-1 shows the enhancement for a sequence of spatially ordered winners (winners in order were PE1, 0.5 _ 0.4 if °' 3 II °2 0.1 , — I I I PE|V^PE2\^PE3\^PE4V»/PF.5\V PE6 (a) 5 04 Til °3 11 0.2 M 0.1 ^mn^m (pEiy^PE2y^rE3yypE4yypE5y-/pE6) (b) Figure 3-1 : Temporal activity in the SOTPAR network, a) activity created by temporally ordered input; b) activity created by unordered input

PAGE 67

60 PE2, PE3, PE4) and for a sequence of random winners (winners in order were PE4, PE2, PE1, PE5), which would be the case if the input was noise or unknown. In the ordered case, the enhancement will lower the threshold for PE 5 dramatically more than the other PEs making PE 5 likely to win the next competition. In the unordered case, the enhancement becomes weak and affects all PEs roughly evenly. The second temporal functionality added to the SOM is the decay of output activation over time. This is also biologically realistic [Cha93], When a PE fires or becomes active, it maintains an exponentially decaying portion of its activity after it fires. Because the PE gradually decays, the wavefront it creates is more spread out over time, rather than a simple traveling impulse. This spreading creates a more robust architecture that can gracefully handle both timewarping and missing or noisy data. The decay of the activity also creates another biological possibility for explaining the movement of the enhancement throughout the network. If we define a neighborhood around a neuron as one where it has strong excitatory connections with its neighbors, then the decay of activity from a neuron which fired in the past will help to fire (or lower the threshold of) its neighboring PEs. Algorithm description To simplify the description of the algorithm, I will use 1 D maps and let the activity propagate in only one direction, since the diffusion of the activity is severely restricted in the one-dimensional case. Thus, the output space can be considered a set of PEs connected by a string where the information is passed between PEs along this string. The activity/enhancement moves in the direction of increasing PE number and decays at

PAGE 68

61 each step. An implementation of the activity diffusion in one string is shown in Figure 3-2 and includes the activity decay at each PE and the activity movement through the net in the left-to-right direction. The factors u and (1-u) are used to normalize the total activity in the network. 1-u 1-u 1-u > -D^-m-^' Enhancement Activity Activity Activity Figure 3-2: Model for activity diffusion in one string of the SOTPAR This activity diffusion mechanism serves to store the temporal information in the network. During training, the PEs will be spatially ordered to sequentially follow any temporal sequences presented. At each iteration, the activity of the network is determined by calculating the distance between the input and the weights of each PE and allowing for membrane potential decay: act(t,x) = act(t \,x) * (1 u) + dist(inp(t), w, ) * («) where act(t.x) represents the activity at PE x at time (, and dist(inp(t),Wx) represents the distance between the input at time t and the weights of PE x. Typically the activity is thresholded and enhanced before being propagated. For example act' = max(acf-.5,0) * 2 next, the winning PE is selected by winner = arg max(«c/ + p * enhancement)

PAGE 69

62 where the enhancement is the activity being propagated from the left. The parameter p is the spatio-temporal parameter that determines the amount that a temporal wavefront can lower the threshold for PE firing. By increasing p you can lower the threshold of neighboring PEs to the point where the next winner is almost guaranteed to be a neighbor of the current winner and forces the input patterns to be sequential in the output map. It is interesting to note that as p->0, the system operates like a standard SOM and when p->co the system operates like an avalanche network. [Gro82] Once the winner is selected, it is trained along with its neighbors in a Hebbian manner with normalization as follows: w x = w x + r| * neigh(x)* (inp(t) -w x ) where the neighborhood function, neighfx), defines the closeness to the winner (typically a Gaussian function), and the learning rate is defined by r\. In our current implementation, the spatio-temporal parameter, the learning rate, and neighborhood size are all annealed for better convergence. Representation of memory The activity diffusion in this network creates a unique spatio-temporal memory that stores and distributes the temporal information in the network itself. Most short-term memory structures can be described by convolving the input sequence with a kernel that describes the structure of the memory. This kernel is typically one-dimensional and describes the temporal features of the memory, i.e. the depth of the memory. The SOTPAR's memory is implemented in its "enhancement" which moves through time and

PAGE 70

63 space. Thus, the SOTPAR memory kernel is spatio-temporal and must be described in at least two dimensions. There are two slightly different ways to implement the temporal enhancement in a ID SOTPAR. The difference lies in the decaying exponential portion. In method number 1, only the activity at each node is decayed. The contributions from the wavefronts do not contribute to the time dependent behavior of each node. The equation for this system is: E(n,t) = E{n-\,t-l)*n + A(n,t) A(n,t) = A(n,l -l)*(l-u)+ /«(«,/) where E(n,t) is the enhancement at node n at time t, A(n,t) is the activity at node n at time I, and ln(n,t) is the matching results between the input and the weights of node n and time (. Expanding these equations gives the following results: E(n,t)= E(n-\,t-\)*\x + A{n,t-\)*(\-)x) + In(n,t) = (E(n-2,t-2)*ii + A(n-\,t-2J)*ii + A(n,t-2)*(\i i) 2 + In(n,t-\)*(\-n) + Wn,t) = X2>(«-U-4-t)u j (1-u)' * = T = This equation shows how the results from the matching activity (which is called "input", for lack of a better word) contribute to the enhancement. The traveling waves create two decaying exponentials, one which moves through space (u k ), and one which moves through time ( (l-fi)'). The past history of the node is added to the enhancement via the recursive self-loop in (1 -u.). The wavefront motion is added to the enhancement via the diagonal movement through the left-to-right channel scaled by u.. The farther the

PAGE 71

64 node is off the diagonal and the farther back in time, the less influence it has on the enhancement. The SOTPAR enhancement equation is similar to the gamma memory impulse response for tap n: By doing a variable substitution, the x can be replaced with t-n in the SOTPAR equation making the two equations even more similar. The SOTPAR enhancement, however, is not an impulse response equation. The SOTPAR allows input at each element of the memory structure, unlike the gamma memory which is a generalized tapped delay line, thus the input at different times and spatial locations is required to describe the enhancement (i.e. an impulse response does not represent the desired information). In summary, the SOTPAR enhancement is a spatially distributed gamma memory with inputs at each tap. The second method for implementing the enhancement is to allow the enhancement to also pass through the self-feedback at each node. This will allow an input to add to the enhancement multiple times by following different paths in the network. For example, In(n-l,t-2) can reach E2(n,t) either by looping first at node n-1 and then moving to position (n,t) or by first moving to position (n,(-l) and then looping at node n until (n,t). The equation for the enhancement in this case is:

PAGE 72

65 E2{n,t) = E2{n-\,t-\)*\x + E2{n,t-\)*(\-\i) + In(n,t) = (£2(n-2,f2)* u + £2(n-l,f -2)*(l-u) + /«(«U-l))*u + (£2(n-l,(-2)> + £2(«,/-2)*(l-n)+/n(«-l,(-l))*(l-n) + /«(n,0 = nM»-i,'-i-*'(l-n)'(T + l) This method of enhancement increases the contribution of the off-diagonal elements via the term (r+1) and allows more flexibility in non-sequential node firings. The two enhancement techniques can be shown for two values of u. in Figure 3-3 and Figure 3-4. Both figures show Enhancement method 1, Enhancement method 2, and the difference between Enhancement method 2 and method 1 , which shows the increased influence of the off-diagonal elements. These two figures also illustrate the effect of u on the enhancement. With u. = 1 , the time decay at each node is disconnected and the enhancement moves only from node to node. With u = 0, the spatial movement of the enhancement is disconnected and only Enhancement 1 Enhancement 2 Enh 2 Enh 1 Node Time Time Time Figure 3-3 Enhancement in the network with u. = 0.5

PAGE 73

66 node decay contributes to the enhancement. Lower values of u. create a broader enhancement while higher values of u create narrower enhancement waves where almost all of the activity moves from one node to the next (down the diagonal of time and space). This can be seen in the figures as a much sharper contribution to the enhancement along the diagonal as p moves from .5 to .75. Enhancement 1 Enhancement 2 Enh 2 Enh 1 Node z Time Node ' * Time Figure 3-4 Enhancement in the network with p = 0.75 Node Time Another possible approach is to decouple the two exponentials p. and (1-u). This would require external normalization to keep the enhancement from growing without bound, but will provide more flexibility. A simple illustrative example A simple, descriptive test case involves an input that is composed of twodimensional vectors randomly distributed between and 1 . Embedded in the input are 20 'L' shaped sequences located in the upper right hand corner of the input space (from [0.5,1. 0]-»[0.5,0.5]-» [1.0,0.5]). Uniform noise between -0.05 and 0.05 was added to the

PAGE 74

67 target sequences. When a standard 1 D SOM maps this input space, it maps the PEs without regard to temporal order, it simply needs to cover the 2D input space with its ID structure. To show how this happens, we plot an 'X' at the position in the input space represented by the weights of each PE (remember, the weights of each PE are the center point of the Voronoi region that contains the inputs that trigger that PE). Since the neighborhood relationship between PEs is important, we connect neighboring PEs with a line. In a ID SOM, the result is a "string" of PEs, and this string of PEs is stretched and manipulated by the training algorithm so that the entire input space is mapped with the minimum distortion error while maintaining the neighborhood relationships (e.g. the string cannot be broken). The orientation of the output is not important, as long as it covers the input with minimal residual energy. A typical example is shown on the left side of Figure 3-5. Note the slightly higher density of the input in the 'L' shaped region. When the SOTPAR temporal activity is added to the SOM, the mapping has the additional constraint that temporal neighbors (sequential winners) should fire 1 Kohonen mapping without Temporal Enhancement 1 D Kohonen mappng with Temporal Enhancement Figure 3-5 One-dimensional mapping of a two-dimensional input space, both with and without spatio-temporal coupling

PAGE 75

DO sequentially. Thus, the string should not only cover the input space, but also follow prevalent temporal patterns found in the input. This is shown on the right side of Figure 3-5. Notice in the figure that sequential nodes have aligned themselves to cover the L shaped temporal patterns found in the input. Although not the main goal in creating the Spatio-Temporal SOM, recall is possible after the first few samples of the sequence have been input to the network. The rest of the pattern can be determined by following the sequence of nodes in the SOM, although the length of the sequence is not readily determined by the map. With a single string, the network can be trained to represent a single pattern or multiple patterns. Multiple patterns, however, require the string to be long. A long string may be difficult to train properly since it must weave its way through the input space, moving from the end of one pattern to the beginning of the next. Additional flexibility can be added by breaking up the large string into several smaller strings. Multiple strings can be considered a 2D array of output nodes with a ID neighborhood function. This allows the network to either follow multiple trajectories or long complicated trajectories in a simplified manner. Figure 3-6 shows an example of the storage of two temporal patterns with two strings. The left plot shows the input space that consists of two-dimensional input vectors. Two 8-point temporal patterns (diagonal lines: bottom-left to top-right and bottom-right to top-left) are intermixed with random noise in the input. The diagonal lines are drawn in for clarity. Between each pattern, there is random noise. This problem can be thought of as a motion detection problem across a visual topographic map. A number of strings could be trained to detect motion in a variety of directions and

PAGE 76

69 orientations. On the left side of Figure 3-6, the trained strings are shown as sequences of 8 PEs represented as 'X's (the 'O' PE denotes the beginning of the string), with neighboring PEs connected by lines. As one can see from this figure, the memory structure was able to extract the predominant temporal features of the input data. The right side of Figure 3-6 shows a graphical representation of the sequence of winning PEs after training. The horizontal axis is time, and the vertical axis is the number of the winning PE. The input signal is labeled along the top of the plot. This plot clearly shows that the patterns elicit the network to respond with sequential PE firings (smooth diagonal lines), whereas the random noise between patterns causes random output firings. Notice also that the temporal information is crucially important in the training of the memory, especially at the center of the figure where the next point could be in one of two possible directions. This ambiguity is responsible for the misalignment of the PEs near the center of the input space. Input Space and Output Mapping | Seq1 | noise | Seq2 | noise | Seq1 | Figure 3-6 The Storage of Two Temporal Patterns in a Memory Network

PAGE 77

70 Figure 3-7 shows an example of how the network gracefully handles time warping. In this example, the input was as in the previous example except that the target sequences were warped to length 6, 8, and 10. The network mapped two 6 PE strings to the diagonal targets as shown in the left side of the figure. The right side of the figure shows the winning nodes with the three different size sequences the first two are length 6, the second two are length 8, and the last one is length 10. The strings stretch to cover the entire pattern and certain PEs fire more than once for a longer sequence, thus extending the time that can be covered by the string. In general, if the network is trained with time-warped data, it will tend to represent the target trajectories with the minimum number of nodes (shortest pattern). The network will still respond to longer patterns by having certain nodes win multiple times. Input Space and Output Mapping |S1|n |S2|n |S1 | n |S2 | n | S1 | I 6 a. 4. f 2 I 1 1 J 20 40 60 Time 0.5 1 X1 Figure 3-7 Time Warping: Diagonal Targets Covered by Smaller Sequences The left and middle plots of Figure 3-8 show the traveling activity over time and space for the above example. The left side shows the activity for string 1 and the center shows the activity for string 2. These two plots clearly show how the traveling activity

PAGE 78

71 builds up and reinforces the sequential firing of the output PEs (i.e. when a target sequence is presented, the activity builds up and moves along the string). The right-side of Figure 3-8 shows the maximum traveling activity for string 1 (solid) and string 2 (dashed). In a simple system, this plot shows how a simple threshold on the traveling activity could be used to detect the target sequence. M axicmirn Enhance 20 40 60 1 20 40 eo Figure 3-8: Plots of ehancement over time for string 1 and string 2 and also the maximum enhancement over time for both strings SOTPAR summary The SOTPAR methodology creates an array of PEs that self-organizes in spacetime with the help of temporal information. The system is trained in an unsupervised manner and self-organizes so that sequences seen during training are mapped into unique spatial sequential firings of the PEs at the output. The output space is similar to a topographic map except that it maps both the temporal and spatial information. The network embeds the temporal and input data into one output space with both temporal and spatial locality. Instead of the standard time-to-space mapping produced by most short-term memories, the SOTPAR produces a time-to-"time and space" mapping. The representation is distributed throughout the self-organizing network and is stored not only

PAGE 79

72 in the activations of the PEs but also in the connectivity and weights of the PEs. It is a radical departure from typical neural network architectures with memory, but is actually more biologically plausible. The SOTPAR is a unique combination of short-term and long-term memory. It contains short-term memory because the activations of the network can be used to represent a general input sequence. The interesting part of the SOTPAR, however, is that it contains attributes of a long-term memory. It stores commonly found input patterns into the network weights and produces enhanced responses to these temporal inputs. The known sequences produce an ordered response in a specific area of the output space. This is a discriminant mapping because only known sequences produce an ordered response. The sequential firing facilitates the recognition of temporal patterns by subsequent processing layers. It can also gracefully handle time warping. Temporal Activity Diffusion In the Neural Gas Algorithm (SOTPAR2) The SOTPAR2 network was developed to overcome a few difficulties with the original SOTPAR network. The main difficulty with the SOTPAR is the SOM map that it is built upon. The SOM's neighborhood lattice structure restricts both the movement of a trajectory through the output space of the network (e.g. the distance between successive inputs) and also limits the number of neighbors for each PE. For these reasons the neural gas algorithm is used as the basis for the SOTPAR2 architecture. The neural gas algorithm is similar to the SOM algorithm without the imposition of a predefined neighborhood structure on the output PEs. The neural gas PEs are trained with a soft-max rule, but the soft-max is applied based on the ranking of the distance to

PAGE 80

73 the reference vectors, not on the distance to the winning PE in the lattice. Since the Neural gas algorithm has no predefined structure, each PE acts relatively independently. This is how it derived its name, each PE is like a molecule of gas that all spread to evenly cover the desired space. Since it has no predefined structure for the activity diffusion to move through, it allows the flexibility to create a diffusion structure that can be trained to best fit the input data. The SOTPAR2 diffuses activity through a secondary connection matrix that is trained with temporal Hebbian learning. This flexible structure decouples much of the spatial component from the temporal component in the network. In the SOTPAR, two neighboring nodes in time also needed to be relatively close in space in order for the system to train properly (since time and space were coupled). This is no longer a restriction in the SOTPAR2. This is still a space-time mapping, but now the coupling between space and time is directly controllable. The most interesting concept that falls out of this structure is the ability for the network to focus on temporal correlations. Temporal correlation can be thought of as the simple concept of anticipation. The human brain uses information from the past to enhance the recognition of "expected" patterns. For instance, during a conversation a speaker uses the context from the past to determine what they expect to hear in the future. This methodology can greatly improve the recognition of noisy input signals such as slurred or mispronounced speech. SOTPAR2 algorithm details Based on previous experience (training), the SOTPAR2 algorithm uses temporal information to lower the threshold of PEs that are likely to fire next. The standard neural

PAGE 81

74 gas network is appended with a connection matrix that is trained using temporal Hebbian learning. These secondary weights are similar to transition probabilities in Hidden Markov Models (HMM) and are the pathways used to diffuse the temporal information. As in the SOTPAR, the temporal activity diffusion is used to alter the selection of the winning PE and affects both the training and the operation of the network. The SOTPAR2 algorithm works as follows: First, you calculate the distance (dj) from the input to all the PEs. The temporal activity in the network is similar to the SOTPAR diffusive wavefronts except that the wavefronts are scaled by the connection strengths between PEs. Thus, the temporal activity diffuses through the space defined by the connection matrix as follows: j;[(i/(*.*)+o-i»)«.('))pu] a, (( + 1) = aa, (/) + — — max(/>) where a ; (t) is the activity at PE i at time t, a is a decay constant less than 1, p ti is the connection strength from PE i to PEy, d is the vector of distances from the input to each PE, p. is the parameter which smoothes the activity giving more or less importance to the past activity in the network, and max(p) normalizes the connection strengths. The function f(d,k) determines how the current match (distances) of the network contributes to the activity. At the present time, my implementation implements the case where f(d,k) is simply a 5 function (and the summation is removed) such that only the activity from the past winner is propagated. This is similar to the Markov model where all temporal information is stored in the state itself. Unlike the Markov model, however, the previous winners affect the output activity of the current winner. Therefore, a previous winner that

PAGE 82

75 has followed a "known" path through the network will have higher activity and thus will have more influence on the next selection. In the general case for the activity equation the temporal activity at each PE is affected by contributions from all other PEs. In this case the function f(d,k) is typically an enhanced/sharpened version of the output and the summation is over all PEs. This allows all the activity in the network to influence the current selection. It makes the network more robust since the wavefronts will continue to propagate (but will decay rapidly) even if the selected winner temporarily transitions to an unlikely path. The next step of the SOTPAR2 algorithm is to modify the output (competition selection criteria) of each PE by the temporal activity in the network via the following equation: out, = d, p\x, where p is the spatio-temporal parameter that determines how much the temporal information affects the selection of the winner. This parameter should be set based upon the expected magnitude of the noise present in the system. For example, if the data is normalized [0,1], then a setting of p = 0.1 allows the network to select a new winner that is at most a distance of 0.1 farther away than the PE closest to the spatial input. To adjust the weights, we use the standard neural gas algorithm that is simply competitive learning with a neighborhood function based on an ordering of the temporally modified distance to the input. Aw, = rjh^k, (out))(in w, )

PAGE 83

76 where r| is the learning rate (step size), hx(*) is an exponential neighborhood with the parameter X defining the width of the exponential, kj(out) is the ranking of PE; based on its modified distance from the input. The connection strengths are trained using temporal Hebbian learning with normalization. Temporal Hebbian learning is Hebbian learning applied over time, such that PEs that fire sequentially enhance their connection strength. The rationalization for this rule is that PEs will remain active for a period of time after they fire, thus both current and previous winners will be active at the same time. In the current implementation, the connection strengths are updated similar to the conscience algorithm for competitive learning: A Par e min(o.«(/-l)).argmin(o,«(()) = " The strength of the connection between the last winner and the present winner is increased by a small constant b and all connections are decreased by a fraction that maintains constant energy across the set of connections. Another possibility for normalization would be to normalize all connections leaving each PE. This method gives poorer performance if a PE is shared between two portions of a trajectory since the connection strength would have to be shared between the two outbound PEs. It does, however, give an interpretation of the connection strengths as probabilities and points out the similarity between the SOTPAR2 and the HMM. The parameters n and X are annealed exponentially as in the neural gas algorithm, while p takes the form of an offset sine wave. This allows the initial phases of learning to

PAGE 84

77 proceed without interference so that the PEs start out with an even distribution across the input space. Then the temporal enhancement reaches a peak and slowly declines for fine tuning at the end of learning. Operation of the SOTPAR2 network I will use an artificial example to illustrate the features of the SOTPAR2. The input for this example is 15 pairs of noisy 8-point diagonal lines from (0,0) -> (1,1) and from (1,0) -^ (0,1). The diagonal lines have uniform noise (±0.15 in both dimensions) added to each point (notice the distance between each noise-free point of the diagonal lines is only 0.14). There is uniform noise [0,1] interspersed between the diagonal lines (16 points between each line such that there is twice as much noise as signal). The network extracts the temporal information from the diagonal lines without supervision, segmentation, or labeling. A 30-PE network was trained with and without temporal enhancement (200 iterations through the data set) and the resulting PE locations are shown in Figure 3-9 and Figure 3-10 with the diagonal lines superimposed on the figures. Notice that the temporal enhancement during training has slightly modified the positions of the PEs. The network trained with temporal enhancement has its PEs placed more consistently near the centers of the points along the diagonal lines (in particular, look at the line segment in the lower-right). The temporal training provides a portion of the improvement made by the SOTPAR2 algorithm, but the static comparison of the network is not dramatically different.

PAGE 85

7S Training WITH Temporal Enhancement 1 0.8 0.6 0.4 0.2 0.5 1 Figure 3-9: Reference vector locations after training with enhancement 17, 9 7 ^24 15 1 23 5 sri 28 4 6 /18 30 22 / <4 19 ?sf 26 10 8 26Training WITHOUT Temporal Enhancement 170.5 1 Figure 3-10: Reference vector locations after training without enhancement During operation, the trained weights and information from the past create temporal wavefronts in the network that allow plasticity during recognition. This temporal activity is mixed with the standard spatial activity (distance from input to the weights) via p, the spatio-temporal parameter. Two identical inputs may fire different PEs depending on the temporal past of the signal. Figure 3-1 1 shows the Voronoi diagrams

PAGE 86

79 for the SOTPAR2 network with two different temporal histories. Voronoi diagrams graphically describe the region in the input space that fires each PE. In these particular diagrams, the number in each Voronoi region represents the PE number for that particular region and is located at the center of the static Voronoi region. Remember that the center is the same as the weights of the PE. These diagrams show the regions of the input space that will fire each PE in the network. The left side of Figure 3-11 shows the Voronoi diagram during a presentation of random noise to the network. Since this input pattern was unlikely to be seen in the training input, temporal wavefronts were not created and the Voronoi diagram is very similar to the static Voronoi diagram. The right side of Figure 3-1 1 shows the Voronoi diagram during the presentation of the bottom-left to topright diagonal line. The temporal wavefront grew to an amplitude of 0.5 by the time PE 18 fired. Also, from the training of the network, the connection strength between PE 18 and PE 27 was large compared to the other PEs. Thus, the temporal wavefront flowed preferentially to PE 27 enhancing its possibilities of winning the next competition. Voronoi diagram win previous Winers: 23, 18, 10, 29 Beta=0 2 diagram win previous winners: 20. 26, 14, IB Be1a=0.2 02 04 06 0,8 1 0.2 04 Figure 3-11: Voronoi diagrams without and with enhancement

PAGE 87

80 Notice how large region 27 is in right side of Figure 3-1 lsince it is the next expected winner. This plasticity seems similar to the way humans recognize temporal patterns (e.g. speech). Notice that the network uses temporal information and its previous training to "anticipate" the next input. The anticipated result is much more likely to be detected since the network is expecting to see it. It is important to point out how the static and dynamic conditions are dramatically different. In the dynamic SOTPAR2 the centroids (reference vectors) are not as important the temporal information changes the entire characteristics of vector quantization creating data dependent Voronoi regions. An animation can demonstrate the operation of the SOTPAR2 Voronoi regions much better than static figures. Next I created a new set of 14 noisy diagonal lines to be run through the network as a test set. Each noisy line was passed through both a standard neural gas vector quantization network and a SOTPAR2 VQ network. The results will be analyzed using the 5 lh point in the bottom-left to top-right diagonal line. Figure 3-12 shows the locations Non-Enhanced Voronoi Diagram 1 9 y r v 7 17 / 3 Jr-» 08 f 15 ^f^23 16 K 06 5 \ >jr 28 04 « T 6 lr 12 18 Y JT\ 30 22 \ 14 2 V^T 0.2 ™\ X 10 r" 8 ViT J 29 0.2 0.4 0.6 0.8 1 Figure 3-12: Vornoi diagram without enhancement. VQ outputs were [1 2, 1 2, 1 6, 1 6,25,25,25,25,27,27,27,27,27,27]

PAGE 88

81 of this point in each of the 14 noisy diagonal lines along with the neural gas Voronoi diagram. Notice that the static vector quantization cannot consistently quantize this 5' h point to the same Voronoi region. In fact, this point falls into four different regions. The SOTPAR2 network, however, was able to quantize every one of the 5 th points into the same region. Figure 3-13 shows why. Figure 3-13 shows a typical Voronoi diagram for the trained SOTPAR2 network after the input of the first four points of a single noisy diagonal line. The location of the 5 th point in each of the 14 noisy diagonal lines was again plotted. Notice that now all 14 points fall into the correct Voronoi region. Remember that each particular input sequence will create a different Voronoi diagram, but Figure 3-13 illustrates the mechanism for the SOTPAR2's improved vector quantization. The temporal plasticity has increased the size of the anticipated next region and reduced the variability of the SOTPAR2 vector quantization. Voronoi diagram with node 18 as the previous winner 1 08 21 j . y x * ^ 13 X V . tj^15 5 \ ^r 4 f 6 J J r \ 7 I \ X v. Jp-^_ 17 0.6 0.4 r 28 30 0.2 22 1 14 '/ 2 J*~19 11 26 20 1 10 / r \ f 8 Jf 29 0.2 0.4 0.6 0.8 1 Figure 3-13: Voronoi diagram and VQ with enhancement. VQ outputs were [27,27,27,27,27,27,27,27,27,27,27,27,27,27] Next I ran the new noisy diagonal lines through the network and histogrammed the VQ outputs for each point of the two lines. Figure 3-14 shows the results with the

PAGE 89

82 point number along the horizontal axis and the node number along the vertical axis. The number of firings for each node is indicated by the shading white is high and gray is low. The left-to-right diagonal line is shown in the first 8 points of the horizontal axis and the right-to-left diagonal line is shown as the second 8 points of the horizontal axis. Notice how much cleaner the temporal enhanced VQ output is than the standard neural gasVQ. Winning Nodes WITH enhancement Winning Nodes WITHOUT enhancement Figure 3-14: Histograms of the number of firings for each PE (bright more) for the networks with and without enhancement Figure 3-15 shows a specific example of the VQ output of the two networks and illustrates how the SOTPAR2 uses temporal information to remove noise from the input. The input is a noisy diagonal line from bottom-right to top-left (solid line). The SOTPAR2 output is the short dashed line, and the static VQ output is the long dashed line. Notice how much closer the temporal VQ output is to the diagonal than the noisy input or the output of the static VQ.

PAGE 90

S3 solid=input, dashed=NGas, dotted=Ou Method 1 ;\ V \ X 0.8 \ "" \ > 0.6 ^ IX^^ 0.4 0.2 X^~\ v\\ v 0.2 0.4 0.6 0.8 Figure 3-15: The SOTPAR2 VQ (dotted) is closer to the noise free signal than the original (solid) or the neural gas VQ (dashed) SOTPAR2 summary The SOTPAR2 algorithm uses temporal plasticity induced by the diffusion of activity through time and space. The SOTPAR2 algorithm is a temporal version of the neural gas algorithm that uses activity diffusion to couple space and time into a single set of dynamics that can help disambiguate the static spatial information with temporal information. This creates time-varying Voronoi diagrams based on the past of the input signal. This dynamic vector quantization helps reduce the variability inherent in the input by anticipating (based on training) future inputs. Temporal Self-Organization for Training Supervised Networks This section shows how the concepts of temporally trained clustering can help speed up the training of supervised neural networks. In particular, we have applied it to recurrent neural network training. Recurrent neural networks are more powerful than feedforward neural networks, but their training is very difficult and time-consuming.

PAGE 91

84 Supervised neural networks are typically trained with gradient descent learning, which provides a more mathematically sound foundation than in the unsupervised networks. This allows for a goal-driven approach with mathematical derivations of the concepts. The goal of this architecture is to temporally organize the training of a recurrent neural network. A mathematical analysis will derive a principle very similar to that used in the neural gas network, that temporal correlation can be used to train PEs to form temporal neighborhoods. Using Temporal Neighborhoods in RTRL In the past, static neural networks and feedforward networks with memory (TDNN, etc.) have been the workhorses of the neural network world. Recently recurrent neural networks have been getting more attention, especially when applied to dynamical modeling and system identification and control. The main difficulty in training recurrent neural networks is that the gradient is a function of time. The gradient at the current time depends not only on the current input, output, and desired signal, but also on all the values in the past. As discussed in Chapter 2, there are two fundamental methods of computing the gradient for a recurrent neural network. First, the gradients can be computed in the backward direction similar to the static backpropagation techniques from feedforward networks. This is called backpropagation through time (BPTT) [Rum86]. The main shortcoming of this technique is that it is non-causal. The second fundamental method of computing the gradient for a recurrent neural network computes the gradients in the forward direction. This method, called RTRL [Wil89], computes the partial of each node

PAGE 92

85 with respect to each weight at every iteration. The method is completely on-line and simple to implement. The main difficulty with the RTRL method is its computational complexity. If n is the number of PEs in a fully recurrent network, then the computation of the gradients of each PE with respect to each weight is O(n^). This algorithm can only be used for small networks. Many methods have been proposed to increase the speed of RTRL. Zipser's approach [Zip89][Zip90] will be used here because it lends itself to our techniques. Zipser approached the problem of reducing the complexity of the RTRL algorithm by simply leaving out elements of the sensitivity matrix based upon a subgrouping of the PEs. The PEs are grouped arbitrarily and sensitivities between groups are ignored. If the size of the subgroups remains constant, then this reduces the complexity of the RTRL algorithm to O(n^). This is a tremendous improvement, however, the method lacks some of the power of the full RTRL algorithm. For example, it will sometimes require more PEs than the standard RTRL algorithm to converge. Our methodology extends Zipser's technique by allowing the subgroups to change dynamically during learning. The dynamic subgroups are created by using an unsupervised temporal clustering very similar to that used in the SOTPAR and SOTPAR2. A derivation of a first-order approximation to the full sensitivity matrix shows that temporal correlation (temporal Hebbian learning) can be used to determine which nodes should be in each group. This method has the same computational complexity as Zipser's, but trains better and more consistently.

PAGE 93

86 Review of RTRL and Zipser's Technique The computational complexity of the RTRL algorithm is dominated by the need to update a large array of sensitivities at each step of the algorithm. For a network with n nodes and m weights, the sensitivity matrix has O(nm) elements, each requiring O(n) computations per element, giving 0(n 2 m) calculations per step. For a fully recurrent network, this dominates the computational complexity and requires 0(tr) computations per step. The algorithm works quite well on small networks, but the n^ factor becomes overwhelming as the number of nodes increases. The value of n in the O(rfim) equation is the number of recurrently connected units. Zipser's algorithm reduces this value by creating subgroups of nodes where sensitivity information is only passed between nodes of the same subgroup. All connections still exist in the forward dynamical system, the subgroups only affect the training of the network. Connections between subnets are treated as inputs. If g is the number of subgroups, then the speed-up of the sensitivity calculations is approximately g2. For instance, dividing a network into two subsets (g=2) gives a 4-fold speed-up in computing the sensitivities. If the size of the subnets remains constant as the size of the network is increased (increasing the number of subnets), then the complexity of the RTRL algorithm is reduced from O(n^m) to O(m). The performance gains are substantial, but the question is whether the algorithm can train networks as well as the full RTRL. One might think that the subgrouping will limit the capabilities of the network to share nodes. This is not the case, however, since the activations of the network are unchanged — it is still fully recurrent except in the

PAGE 94

87 training methodology. Even though the error propagation is limited to the subnets, all units have access to the activities of all other units, just not all of their sensitivities. Zipser's empirical tests indicate that they can solve many of the same problems, but for certain applications, networks trained with subgrouped RTRL require more PEs than when they are trained with full RTRL. In my experience, the subgrouping algorithm typically also requires more training epochs to reach the same MSE. One caveat of the subgrouped RTRL training is that each subnet must have at least one unit for which a target exists since gradient information is not exchanged between groups. The problem can be solved by wrapping a feedforward network around the recursive network creating a feedforward MLP with a fully recursive hidden layer. This is often termed a recurrent multilayer perceptron (RMLP) and is shown in Figure 3-16. The feedforward network is simply one additional layer that distributes the gradient between the groups. With simple extensions to the algorithms, multiple fully-recurrent layers can be added to the network. Fully Connected Hidden Layer Output Figure 3-16: Diagram of a fully recurrent multi-layer perceptron (RMLP)

PAGE 95

88 Dynamic Suberouping with n The goal of my method is to create local neighborhoods (subgroups) in the RTRL algorithm where the majority of the gradient information required for each node is confined to its local neighborhood. This requires organizing the recurrent PEs such that those that have strong temporal dependencies are neighbors. This technique replaces the static, preallocated grouping of Zipser's technique with a dynamic method of determining the best set of neighbors for each PE. This dynamic grouping provides faster and more robust training than Zipser's technique while maintaining its O(n^) performance. First, the RTRL equations must be modified slightly to better suit the RMLP architecture described above. The time indices are now defined such that the input vector contains the external inputs from this time period plus the values of the PE outputs from the previous time period (i.e. the feedback). W,. (« + 1) = [*(« + !), X»)] <(n + l) = cp'(v,(n + l)) £wy4(n)+VNi(«+l) Next, we must determine the criteria we will use to group the PEs in the network. If we assume that each PE is responsible for updating the weights of the arcs which terminate at that PE (i.e. the incoming connections), then the PEs that have the highest sensitivities relative to those connections should be in the same neighborhood. For example, PEj is responsible for updating all weights wvg where B is the set of recurrent PEs. If we define

PAGE 96

89 then the value of Zifc provides a measure of how much PE k affects the weights of PEj. Thus the neighbors to PEj should be the ones which have the highest Zjfc. The "dynamic subgrouping with 71" (DS-7t) methodology implements the RTRL algorithm with the subgroups chosen using the Z measure defined above. It should be noted that since it requires the computation of the complete ji matrix, this algorithm is no more efficient than the full RTRL algorithm. It will, however, address how the neighborhood technique with "optimal" switching will perform compared to the full RTRL and Zipser's technique. The test case for the DS-jt algorithm is a function approximation problem where we are trying to map a frequency doubler. The input to the network is a sinusoid with a 1 6-point period and the desired signal is a sinusoid with an 8-point period. This is a nonlinear function since linear functions cannot "create" frequencies. The DS-7t network ^ 0.4 rtrl — our methoc zipser \ V V \ \ 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Epochs (16 samples per epoch) Figure 3-17: Average learning curve for the three algorithms using the frequency doubling problem

PAGE 97

90 has 6 fully recurrent hidden layer PEs and one linear output node. Both Zipser's method and the DS-7t method use two groups of three PEs. Each of the three algorithms was trained using the same five sets of random initial weights and the results were averaged to obtain the learning curves. Figure 3-17 shows the average learning curve for each algorithm. Notice that the full RTRL and the DS-7t method performed nearly identically. In fact, in a few cases, DS-7t actually trained in fewer epochs. The third set of initial weights led all three algorithms to a deep local minimum. The first 100 epochs mainly depict the learning curve from the other four initial conditions (notice that the DS-7t method and RTRL are nearly identical here), whereas the last 400 iterations are dominated by the learning curve for the initial weights with the deep local minimum. Zipser's method performed worse on all 5 sets of initial conditions and couldn't solve the problem at all (even with more training) for the 3 rd set. In every application I have tested, the DS-n algorithm trains the networks in almost the same number of epochs as the full RTRL algorithm and performs significantly better than Zipser's subgrouping technique. The problem, however, is that the full Jl matrix is required to compute the neighborhoods. Since the computation of the It matrix is the computationally expensive part of the task, we have not gained anything here. This methodology, however, proves that the technique is feasible and that the all the gradient information is not necessary to train the networks. Also, when a neighborhood changes in the DS-71 algorithm the gradient information from the ex-neighbor is discarded and new gradient information from the new neighbor starts building up. The technique using the full ji matrix shows that this resetting and restarting of gradient information between

PAGE 98

91 nodes does not affect the performance of the algorithm. The DS-7t algorithm will be used as an "ideal grouping" methodology since it uses all of the information of the sensitivities to determine the groupings. Estimating the Z matrix We now need an estimate of Z that will allow us to efficiently compute the temporal neighborhoods. The logical choice for an estimate of Z is to use the first-order estimate of the 71 matrix to compute Z. We start by writing out the equation for Z and simplifying: z,*=X<(")=Z U'(v,(")) X^X/Xh-d+V^C") 2, =
PAGE 99

92 terms which means that the <5jb' term can be removed. Reorganizing the summations slightly leads to: Z* = (p'( v , (»))£ w,, X n «, (» " !) where J * k Expanding, we get: Z* = „70-2) + 5, t ,W,,(«-l) and a8 i W,,(«-l) + ^,<(»-2) Zj. = ,v'(y,(.n 1)) iefl Now we separate the equation into its first order parts (the direct contributions from the input vector when i=k) and the rest. Z lk = = '(v ; (n))Wj,(p'(v»(n-l)) This is a very easy and computationally efficient method for estimating the Z matrix. It is conceptually appealing as well. You can see that this equation is a time correlation

PAGE 100

93 between the derivatives of the non-linearities of the PEs / and k. If this were a static, linear network, the equation would simply be z,* =y ) (n)x J {n-\) where yj is the output of PEj and xj is the input to PEj from PE k. This is a temporal version of Hebbian Learning. The first-order estimate of Z can be considered a nonlinear version of Hebbian learning. The derivative of the nonlinear function at the operating point determines the sensitivity of each PE to the current training input. Thus, we are calculating a correlation of the sensitivities of each PE. Which can also be considered as a correlation in the dual of the network. If the PEs of the network use a tanh activation function, the estimate for the Z matrix can become a local rule. Local rules are advantageous because they are easily implemented in parallel and easily analyzed. Figure 3-18 shows the plot off (net) versus f(out), which is equal to f (f(net)) , for a tanh PE. Since these two shapes are very similar, F(net) vs. f(f(net)) for tanh axon f(net) 06 rjf(net)) 0.5 \ 04 yi V 03 i \" v -~____ 0.2 0.1 . — / / \ \ o Figure 3-18: Using a tanh PE, f(net) takes the same shape as f (out)

PAGE 101

94 the subgrouping criterion can replace f'(net) with/'(7«J since the output of the source PE is the input to the destination PE. The new equation for Z becomes 2* =(p'(v ; («))M'„*(n-l)) The local implementation assumes that each PE stores the weights for its incoming nodes locally and thus has access to all of its inputs and its output. This methodology was tested and produced identical results to the DS-FOE method. Illustrative Example The performance of the network trained with dynamic subgrouping and the firstorder estimate of the 71 matrix (DS-FOE) will be compared against the full RTRL algorithm and Zipser's method. The example data set will again be the frequency doubling problem. Remember that for this problem, the DS-ti algorithm achieves nearly I 0.4 1 rtrl FOE ofZ zipser \ ~^v 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Epochs (16 samples per epoch) Figure 3-19: Average learning curve for the RTRL, DS-FOE. and Zipser's algorithms using the frequency doubling problem

PAGE 102

95 identical performance to the full RTRL algorithm, so for clarity it is not shown in the plots. Figure 3-19 shows the learning curve averaged over 5 random initial conditions, as described previously. Notice that the DS-FOE method performs significantly better than Zipser's method yet not as well as the theoretical limit. The DS-FOE method could also not solve the frequency doubling problem on the third set of data in the allotted time. For the rest of the initial conditions (look at the first 200 samples of the learning curve), the full RTRL algorithm required 750 epochs to converge, the DS-FOE method required 1000 epochs to converge, and Zisper's method required 1500 epochs to converge. Grouping Dynamics One of the assumptions for the dynamic subgrouping algorithm is that the Z matrix, which represents the temporal "distance" between each PE, changes slowly. If the Z matrix changes rapidly, the neighborhoods will be constantly changing and a good approximation to the Z matrix will be impossible to achieve. Beyond that, the constantly changing neighborhoods will cause an increased number of transients in the training since each time a neighborhood changes, some of the gradient information is discarded. Figure 3-20 shows a plot of Z(l,i) over time for the DS-n algorithm. This plot shows the distances from PE1 to each of the other PEs in the network. The three nearest neighbors of PE1 are shown at the bottom of the figure (remember that PE1 is always a neighbor of PE1). As expected, the curves are all smooth over the 100 epoch plot. Specifically, in the beginning of the training, the two PEs closest to PE1 are PE3 and PE4. After 15 epochs, PE 2 begins to have a larger influence on PE1 and replaces PE4 as a neighbor to PEL Soon there after PE2 grows quickly and becomes the dominant neighbor to PEL After

PAGE 103

96 approximately 80 epochs, PE3's influence on PE1 wanes and PE4 again becomes a neighbor of PE 1 . This type of neighborhood plot is typical, and the smoothness and infrequent switching is common to all the applications I have tried. Occasionally, two PEs will have very similar values and will swap places frequently for a period of time. Normally when this happens, however, the Z values are fairly small and the gradient updates are being dominated by a much larger PE. PE2 V PE3 < PE4 i» PE5 PE6 Figure 3-20: Plot of Z matrix over time showing the temporal neighbors of PE 1 Second Order Methods Many researchers have found that first order gradient descent techniques like RTRL and BPTT do not provide satisfactory performance for training recurrent neural networks. Most commonly they use the extended Kalman Filter (EKF) (also known as the extended RLS algorithm). The EKF algorithm does not actually compute the dynamic gradients differently, it simply uses more information about the gradients to achieve a

PAGE 104

97 better estimate of the shape of the performance surface. Thus, the EKF algorithm still needs to use either BPTT or RTRL to compute the instantaneous estimates of the gradients. The dynamic subgrouping technique proposed here can be used to drastically speed up these computations in the EKF algorithm. A more interesting application of the dynamic subgrouping mechanism may be in the actual EKF formulation. Just like the RTRL algorithm, the EKF algorithm is too computationally burdensome to be fully implemented. Instead, most typically use the decoupled extended Kalman filter (DEKF) approach that ignores all cross-terms between PEs in the covariance matrix. This simple approach is similar to Zipser's approach for RTRL. The dynamic subgrouping could be used at the DEKF level as well, allowing the network to extract enough data from the input to determine which portions of the covariance matrix are important. This is equivalent to determining which PEs in the system should be temporally clustered. Summary of the Dynamic Subgrouping Algorithm By using temporal self-organization in the RTRL algorithm we created a dynamic version of Zipser's method of subgrouped RTRL that can provide a robust 0(N 2 ) training method for recurrent neural networks. The method makes use of a first order estimate of the 7i matrix to compute a metric (Z) that is used to group the PEs in the network. This method performs roughly halfway between the performance of Zipser's method and that of the "optimal" grouping. The current method of estimating the Z matrix is not only first order but also has a memory of only one step. It seems reasonable to believe that other

PAGE 105

98 estimates of Z may provide better performance without significantly increasing the computational complexity beyond OfN 2 ). The most interesting theoretical result is that with an "optimal" determination of the subgroups, the dynamic subgrouping algorithm achieves nearly identical performance as the full RTRL algorithm. In fact, it will occasionally outperform it. This could be because the PEs that contribute the least to the gradients of certain weights are not necessary and may only contribute noise to the overall gradient. By choosing only the PEs that contribute the largest fraction of the gradient, the DS algorithm may be achieving a PCA-like affect of removing the noise in the gradient.

PAGE 106

CHAPTER 4 APPLICATIONS AND RESULTS The previous chapter presented the theory and background information required for each of the three architectures I have created by applying temporal locality and activity diffusion. For illustration, each section also included a simple application for each architecture. This chapter presents the results of applying these architectures to a series of practical and/or more extensive problems. Again, we will begin with the SOTPAR architecture that is based on the self-organizing map, then the SOTPAR2 architecture based on the neural gas algorithm, and lastly the dynamic subgrouing algorithm based on real-time recurrent learning (RTRL). SOTPAR The SOTPAR architecture uses activity diffusion to create a spatio-temporal selforganizing map. The SOM has been used for many different applications, but in its most simple form is a vector quantization technique. The first application uses the SOTPAR dynamics to create a landmark discrimination and recognition mechanism. The complete SOTPAR architecture could be trained to solve this problem, but for improved speed of training, we modified the architecture to allow one-shot training. The second application of the SOTPAR architecture is based on the work of Ruwisch, et.al. This application uses the SOTPAR to organize a linear SOM for sequences of spoken phonemes. When 99

PAGE 107

100 the network is repetively trained with phoneme sequences, it organizes itself such that the phonemes are sequentially located in the map. Landmark Discrimination and Rcconnition lor Robotics Pedro Kulzer, from the E.E. robotics group, and I have applied the SOTPAR to a problem of determining the shape of a landmark using only local information obtained from a very simple robot. [Eul96b] The robot has only forward and lateral infrared sensors that can detect if it is near an object. There is no long distance vision or perspective obtained by these sensors. The robot uses wall-following techniques (similar to a blind person who can only feel the walls of a building) to determine the shape of the landmark. Using the limited information obtained by the shaft encoder and object detection sensors, the robot must discriminate between different landmark shapes while it traverses the walls of an object/landmark. Much of the problem description and robot control algorithms and some of the figures come from Pedro Kulzer's Master's Thesis [Kul96], The robot wanders about its "space" until it finds an object in front of it. The robot may approach this object from any possible direction and orientation. In order to simplify the procedure, when the robot detects an object, it first docks with the object. Docking consists of rotating the robot until its right side is parallel to the landmark's wall. The robot is determined to be parallel to the landmark when the front-right and rear-right sensors indicate equal distances to the landmark. After successful docking, the robot follows the walls of the landmark in a clockwise direction. This is the simplest algorithm

PAGE 108

101 for circumventing an object, but creates difficulties for the processing of the data. Figure 4-1 shows the docking behavior and a simplified wall-following diagram. 6 Tv -o =Cf Figure 4-1 : Docking behavior and an example of wall-following [Kul96] One of the difficulties with the wall-following approach is that the data collected by this simple robot is imprecise. The wheels and gears have a tendency to slip and the turning data and segment distances are often incorrect. Additionally, the wall-following control algorithm can overshoot or undershoot the turns, just like any other control algorithm. To obtain more accurate information, the speed that the robot circumvents the object must be much slower. The goal of our approach is to allow the robot to operate at higher speeds by creating an algorithm that will accept less precise local turn data. Figure 4-2 shows examples of turns at normal speed, slow speed, and faster speed. ! ( — > \ | I r j Figure 4-2: Robot wall-following at different speeds [Kul96] The main difficulties with the landmark discrimination problem are: the exact location where the robot first encounters the landmark is unknown

PAGE 109

102 • the circumference of the object is unknown (how do you know when you have traversed it completely) • the robot would like to learn the landmarks very quickly, preferably in a single circumvention • the slipping of wheels and gears creates noisy and inexact data during turns • symmetric shapes will create uncertainty in the exact location of the robot relative to the landmark (e.g. a square has 4 identical sides and angles) SOTPAR solution To simplify the problem we added a simple compass to the robot. This helps solve the symmetry problem since now we not only have turn angles and distances, we have orientation information. At preselected intervals the robot collects its turn angle and compass settings. The relative coordinate turn angle is the derivative of the compass setting and thus this creates a dynamic state-space description of the motion of the robot. If a compass is not present, the turn angle can be derived from the wheel motions and the compass would be the integral of the turns, but the compass would drift over time and would soon be unusable. The sampled data can be constructed into a trajectory through turn-angle/compass space. The SOTPAR can be used to map such trajectories. Figure 4-3 shows the trajectory the robot must follow to traverse the L shaped landmark. The trajectory begins when the robot is heading north (0°) after making a right turn at the bottom-left portion of the landmark. The interpolated turn slowly moves back to zero and then back to -7t/2 as the robot turns right again and begins to head east. The trajectory is very difficult to map

PAGE 110

103 because there is an overlap in the input space where the trajectory doubles back on itself. This corresponds to the concave corner in the figure and is shown as the line moving from approximately (3ji/2,-n/2) ->(it,jt/2) and vice-versa. The trajectory passes through the same points in state space and the only difference is the direction of travel. Without temporal information, the standard SOM can not properly map this trajectory. Compass Reading 4 Figure 4-3: Robot trajectory around the L-shaped landmark A SOTPAR network with 6 strings and 10 PEs was created to map the target trajectory. Since this is a complicated target trajectory that is 40 samples long, training a single long SOTPAR network for this problem is very difficult. Multiple strings of PEs provides improved flexibility in the mapping of complex trajectories by allowing different clusters of PEs to move independently. Each string of PEs is like a single ID SOTPAR network. The only mechanism linking the 6 strings of PEs is the wavefront of enhancement that travels from the last PE of one string to the first PE of the next. The neighborhood training of the strings is local and does not cross string borders. The SOTPAR was trained with noisy versions of the target trajectory (the turns and compass headings) intermixed with noise that represents random searching motions

PAGE 111

104 of the robot looking for landmarks. There was no supervision or indication as to which part of the input was noise and which was signal. Table 4-1 shows the training parameters for the SOTPAR network. The learning rate and SOM neighborhood width are generic parameters that are typically annealed from high to low values during training. The annealing of the learning rate allows for fast but coarse movements of the PEs at the beginning of training and slower fine-tuning of the PE positions at the end of training. The annealing of the neighborhood width organizes the network globally at first and then fine tunes the local neighborhoods at the end of training. The two new SOTPAR parameters are the spatio-temporal parameter, p, and the temporal decay parameter, p.. The spatio-temporal parameter is annealed similar to the other SOM parameters and allows the network to concentrate on temporal ordering first (high P) and then on fine tuning the spatial map at the end of training. The temporal decay parameter sets the feedback proportion of the leaky integrator at each PE. I typically use a value of 0.7 which provides a good tradeoff between forcing the network to always move forward and allowing repeated PE firings. Table 4-1: Training parameters for robot trajectory Training Parameter Spatio-temporal parameter, p Learning rate, n SOM neighborhood width Temporal decay parameter, u Value Linearly annealed from 2 to 0.3 Linearly annealed from 0.01 to 0.002 Linearly annealed from 2.5 to 0.7 Number of training epochs 100 Number of strings Number of nodes per string 10

PAGE 112

105 Figure 4-4 shows the trained PE weights of each string mapped back to the input space and represented as X's (O's represent the first PE in each string). Four (labels 1-4) of the strings mapped to different locations in the trajectory (as labeled on the right side of the figure) and two others (labels 5 & 6) account for most of the inter-signal noise. As the robot moves along the periphery of the object, the nodes of the SOTPAR will fire sequentially. Mapping the Robot Trajectory 1.5 3.0 4.5 COMPASS HEADING 6.0 Figure 4-4: SOTPAR mapping of L-shaped landmark trajectory The SOTPAR dynamics use temporal information to smooth noisy signals and can also gracefully handle time-warping. If the input data is noisy, the temporal wavefronts that move through the network will influence the selection of the winning PEs and be able to ignore a certain amount of noise. The spatio-temporal parameter, p, determines how much noise the system will accept. If (3 is set to 0.5, then the temporal information in the network can influence the network to choose a PE that is up to 0.5 units farther away from the spatial input than the closest spatial PE. Thus, p should be set based upon

PAGE 113

106 the dynamic range of the input and the amount of expected noise in the signal. Of course increasing (3 also decreases the ability of the network to discriminate between two similar but different temporal patterns. Figure 4-5 shows two trajectories with 0.1 amplitude and 0.3 amplitude zeromean additive noise. The trajectories are significantly different than the training signal, but the dominant temporal pattern is still clearly visible. Figure 4-6 shows the sequence of winning PEs and the SOTPAR enhancement in the network for both trajectories, with random noise interspersed. Remember that a diagonal line in the plot of winning PEs means that the PEs fired in a sequential order, as desired, p was set to 0.5 and thus the network should be able to use the temporal information in the signal to remove much of the variation in the signal. The trajectory with 0.1 amplitude noise is shown in the first 40 points of the figures. Notice the straight diagonal line that shows that the PEs fired in perfect sequential order. The wavefront is perfectly synchronized with the signal and moves along just ahead of the previous winning PE, helping the network choose the Compass (Heading Compass Heading Figure 4-5: Noisy trajectories with 0.1 and 0.3 amplitude noise

PAGE 114

107 correct PE. The second trajectory is shown between points 60 and 100. The SOTPAR was still able to correctly map the majority of the signal and only three times miss classified a point on the trajectory. These locations are clearly shown on the plot of winning PEs and also show up as a "dimming" of the enhancement wavefront in the enhancement chart. 20 40 60 80 100 120 time/sequence-number 40 60 80 100 120 time/sequence-number Figure 4-6: Enhancement and winning PEs for noisy trajectories with +/0.1 and +/-0.3 amplitude noise If the input data is timewarped, the memory kernel in the SOTPAR allows the network to either skip PEs for shorter patterns or fire PEs more than once for longer patterns, without greatly reducing the wavefront strength of the network. Maintaining the wavefront strength allows the network to continuously smooth spatial noise with temporal information. Two sequences were created that warped the 40-point trajectory to 56 points and 30 points by upsampling and downsampling the signal. Figure 4-7 shows the enhancement of each PE over time and the sequence of winning PEs. The longer trajectory is shown in the first 56 samples and the enhancement plot shows that the

PAGE 115

108 wavefronts periodically die out and restart at the same PE one time period later. The plot of the winning PEs also shows that certain PEs are fired twice so that the SOTPAR can warp the signal back onto its output map. Samples 72-102 show that the network periodically skips a PE to adjust the network to the faster sampling of the shorter trajectory. The enhancement shows disconnected wavefronts that continue in the network just ahead of where the previous one ended. 20 40 60 80 100 120 time/sequence-number 20 40 60 80 100 120 140 bme/seqjence-numter Figure 4-7: SOTPAR enhancement for time-warped trajectories The ability of the SOTPAR network to remove noise and deal with timewarping is very adventageous, but even the relatively short 100 epoch training time required for this network is too long for the robotic application. As a general rule, increasing the constraints of the system by using application specific information leads to simplified and faster training. By adapting the SOTPAR to the specific characteristics of the landmark object recognition problem, we created a new network that can be trained much faster. First, since the robot can detect when it encounters an object, multiple simplifications can

PAGE 116

109 be made. Each object will be represented by a single string of nodes, each with all the SOTPAR dynamics as defined above. Additionally, since the exact location where the robot first encounters the object is unknown, the endpoint of the string will wrap around to the starting point to form a ring. Thus, a wavefront will be created and rotate around the ring from any starting position. A one shot training algorithm is implemented which trains the landmark string as the robot circumvents the object for the first time. After the robot docks with the object, it follows the walls of the landmark in a clockwise direction. At predefined intervals, a PE is added to the current string, with its weights set to the current turn angle and compass heading. This continues until the robot completely circumvents the landmark at which point the end of the string is connected to its beginning, forming a ring network. Another modification designed by Pedro Kulzer [Kul96] makes the network more forgiving to imprecise turn angles and also provides an interesting link to biology and other technologies. The input of the network is first pre-processed by a set of PEs based upon a "tuning curve". The tuning curves shown in Figure 4-8 provide a more gentle roll .30 .60 .90 Turning angle Figure 4-8: Tuning curve for turn angles [Kul96] < ( \ ) >

PAGE 117

110 off of the activation of a PE with increased distance between the stored angle and the actual angle. The stored angles are discretized such that they are stored in increments of 30 degrees. Each node responds to one turn angle based upon the "tuning curve" shown in the figure. This tuning curve mechanism allows for an interesting rationale for the one-shot training algorithm. The tuning curve acts as a preprocessor that resembles a radial basis function network or a fuzzy logic system. In biology, SOM-like structures are commonly found doing such tasks, clustering sensory inputs. Each preprocessor PE represents a certain turning angle and fires when the current turn matches its stored turn. The one-shot training algorithm can be mimicked if each tuning curve PE is linked to the SOTPAR chain and the weights are trained in a Hebbian manner with very high learning rates, as shown in Figure 4-9. The figure shows only one set of weights from the preprocessor to PE 4 with the weight between degrees and PE 4 being much stronger than the others. SOFM Preprocessor Figure 4-9: Preprocessor feeds SOTPAR string using fast Hebbian learning

PAGE 118

Ill The SOTPAR dynamics are used for the recognition of the objects. The turn angles and compass headings are fed into the network and wavefronts are created as the robot traverses the outside of the landmark. A new ring is used for each landmark. Three different properties found in the SOTPAR provide useful information in the landmark recognition problem. • Landmark Recognition. As the robot moves around an unknown landmark, the network with the largest continuous wavefront will determine which landmark the robot is processing. To demonstrate this capability, I have simulated a circumvention of two different landmarks shown in Figure 4-10, the L-shaped landmark and the square shaped landmark. These two landmarks are difficult to discriminate since the turns that are required to traverse these two landmarks are very similar, differing only at two locations. Zero-mean random noise is added to the inputs to simulate the "sloppy" turning and wall-following algorithm that the robot will use. After the oneshot training on both landmarks, a new noisy input set was created for the L-Shaped landmark and run through both rings from an unknown starting point. The maximum activity of the two strings of 8 PEs is shown in Figure 4-10. The figure shows that the activity wavefront moves through the L-shaped landmark ring with little attenuation. The square landmark ring, however, starts to build up as it traverses the left and top of the figure and then decays rapidly as two wrong turns are made. • Endpoint Detection. Endpoint detection can be automatically determined while the network is training. When the robot passes the point where it first encountered the landmark, a second wavefront will begin forming at the beginning of the network and

PAGE 119

112 will move around the ring. The distance between the current location (the position of the original wavefront which should always be at the most recently added node) and the redundant wavefronts is the estimated circumference of the landmark. Figure 4-1 1 shows the wavefronts moving around a network being created as the robot traverses the L-shaped landmark. When the robot detects a sustained secondary wavefront, it should clip the network at the point where the secondary wavefront begins and reconnect the tail to the start of the ring. Location Relative to the Landmark. As the robot traverses the landmark, the location of the wavefront tells the robot its location relative to the landmark. This is not absolute information, it only describes where the robot is relative to the first time it traversed this landmark. Absolute location information, however, is not necessary. Figure 4-1 1 shows the wavefronts from a sample traversal. The current winning PE provides the best estimate as to the location of the robot relative to the landmark. Maximum Activity in the Two Networks 3 4 I Figure 4-10 Maximum Activity in the Two Networks (Solid line = L-shape, Dashed line = Square)

PAGE 120

113 Node Number Time 20 40 60 80 tOO Node Number Figure 4-1 1 The wavefronts moving through time and space tell the circumference of the landmark Real data collected from the robot The SOTPAR methodology described thus far has been based on the addition of the enhancement and distance measure. Another possibility, as discussed in Goppert [Gop94] and as implemented by Pedro Kulzer, is to multiply the enhancement and the distance measure. The multiplication lowers the noise resistance of the network since a single bad input will bring the output of the network down to zero, but it improves the discrimination between two similar landmarks. Kulzer implemented both the additive and multiplicative models and achieved very similar results. The results presented in this section are based on the multiplicative model. Additionally, the endpoint detection algorithm described above was not discoverd until after the tests with the real robot were completed. This adversely affected the results since the proper endpoints could not always be obtained. One further difference between the theoretical results and the actual

PAGE 121

114 collected data is that the existing robot did not have a compass input. Therefore, the discrimination is much more difficult using only local turn information. The data was collected from a prototype robot as described previously. The first landmark that was traversed is a rectangular sofa. Without the compass data, the turning information and stopping points were very noisy. Figure 4-12 shows a diagram of the sofa and a representation of 1 3 traversals around the sofa by the robot. The traversals started at random locations and the endpoint was determined by trying to determine when the robot had gone through 360 degrees. The figure plainly shows how noisy and inaccurate both the turn data and endpoints were with this robot. Figure 4-12: Sofa landmark and 13 traversals of the sofa [Kul96] Similarly, a sofa and footstool combination was circumvented. This combination represents an L-shape. Figure 4-13 shows a diagram of the landmark and six traversals of the landmark. Notice how difficult it will be to discriminate between the sofa and the

PAGE 122

115 sofa/chair landmarks using the collected data. The two sets are very similar and the chair portion of the landmark is noticeable but not a dominant feature of the datasets. For additional testing, a non-polygonal landmark was traversed as shown in Figure 4-14. Figure 4-13: Sofa and chair combination landmark [Kul96] Figure 4-14: Non-polygonal landmark [Kul96] With 3 landmarks and multiple circumventions for each landmark, there are many combinations of tests that could be done. Each circumvention can be used to create a

PAGE 123

116 network which can then be tested against other networks and other circumventions. First, a comparison of the sofa and sofa/chair landmarks was done. A circumvention from each landmark was randomly selected and used for testing networks which were created from all other circumventions. The testing resulted in an 18.8% error rate, much of which can be explained by improper stopping points. Table 4-2 contains a sampling of the discrimination tests and presents the results by showing the ratio of the output of the correct network to the output of the incorrect network. A value of 1 or greater shows correct discrimination. The values in the table that are greater than one are bolded and show incorrect discrimination. Table 4-2: Sofa vs. sofa/chair landmark discrimination ratios [Kul96] testl test2 test3 test4 test5 test6 test7 test8 Sofa Sofa/Chair 1.51 1.65 0.98 1.13 0.90 1.23 1.61 1.99 1.34 1.37 1.40 1.32 1.22 0.90 1.35 1.20 Next, the non-polygonal landmark was compared against both the sofa and sofa/chair landmarks. The error rate here was 12.5% which implies that the nonpolygonal landmark is easier to discriminate. Table 4-3 and Table 4-4 show a subset of the data collected, again using the ratio of the correct versus incorrect network outputs to determine the quality of the results. Table 4-3: Sofa vs. non-polygonal landmark discrimination ratios [Kul96] testl test2 test3 test4 Sofa Non-Polygonal 2.78 2.50 0.97 0.86 1.05 1.33 1.40 1.77

PAGE 124

117 Table 4-4: Sofa/chair vs. non-polygonal landmark discrimination ratios [Kul96] testl test2 test3 test4 Sofa/Chair Non-Polygonal 1.37 2.18 1.51 1.42 1.30 1.25 1.30 1.39 Summary The typical approach to solving the object recognition problem in the Robotics group at the University of Florida is to store information about each of the corners of the object and the distances between them. Then with Al techniques they would attempt to recognize the object. This algorithm, however, works only for polygonal landmarks and is very brittle when the robot generates imprecise turning data. The SOTPAR methodology described above performs better than the techniques being used at the time. Although the results are preliminary, our architecture has many advantages over the typical approach, including: • The landmark can be of any shape, not only polygonal • Spacewarping (timewarping in terms of the neural network) created by varying speeds of the robot is gracefully handled • The matching process is straightforward and resistant to noise, and does not require excessive amounts of computation • The position in the circumvention is easily determined • Endpoint detection can be done automatically • Rotations and translations of the landmark do not affect the performance [Kul96]

PAGE 125

118 Self-Organization of Phoneme Sequences In 1993, Ruwisch et. al. [Ruw93] created parallel hardware versions of SOM networks based on the reaction-diffusion equations. The fundamental concept is that each PE is part of an active medium and when one PE fires, it eventually causes its neighbors to fire as well. Once a PE fires, it remains active until a global reset is received from the controller. When a sample input is presented to the network, the output of each PE is proportional to the closeness of the match between the PE's weights and the input. A global threshold is then progressively lowered until a PE reaches saturation and fires. The PE that fires then triggers its neighbors as its activity diffuses to neighboring PEs. This causes a wavefront that travels through the network. The wavefront moves at a velocity c and at its simplest can be represented by the equation $>(r,t) = H(ct-\\r wll ,-r\i) where H is the Heaviside (step) function, r is the position of a PE in the lattice, and r^ is the position of the winning PE. Although this work uses a moving wavefront through time and space, they use it solely to implement the SOM algorithm in parallel. They do not use it to create a spatio-temporal mapping of the input. Although the equation bases the wavefront on the position of the winner, it is not necessary to actually compute the location of the winner; the active medium does this automatically. They use this wavefront to train the network. When a PE becomes active, it can begin to train its weights (i.e. move them toward the input). Since the PEs which were closer to the winner will be active for a longer period of time, this method naturally implements the neighborhood function of the SOM in parallel. The wavefront

PAGE 126

119 propagation and training continues until a global reset is received from the controller and the next input sample is presented to the network. More recently, they expanded their active medium concept by including a scaled down version of our SOTPAR dynamics in their network. They modified and simplified the SOTPAR dynamics such that it could easily be implemented in hardware. The equation for the temporal enhancement wavefront is: y V,(rJ) = H[ct,-d w Jr)]H[d m „(r)-ct l +b}i(r,t) where H is again the Heaviside function and d„ n (r) is the distance to the winning PE. They also include a "history function", h(r,t), which precludes a PE from firing more than once during the presentation of a single sequence. This history function keeps the network wavefronts from reversing direction and retraversing a portion of the network. The two Heaviside functions create a wavefront that is in the form of an expanding concentric wave crest with a specified width b. Figure 4-15 shows an example of the EnhancRinent2 (J U \(J u o o o*f o o °/j2o o r° o 0/ ff a 0(0 o\ V" ti o~ Field of PEs o o 0__0_J? OOOOOOO O O O O Enhancements U U U O O/OOOOOO OOOOOO OOOOOOO 30000000 000 000 Enhancement I u u u 000 0000000000000000 Figure 4-15: Ruwisch enhancement through a 2-D SOM [Ruw97]

PAGE 127

120 wave crests moving through a 2-D SOM including the history function and successive winners. As in the SOTPAR, the winner is chosen as a combination of the spatial and temporal distances: /-,.""" =argmin(j|t/, -W r \-P• M (r,t,)) The wavefront spreads until a global reset is sent from the controller. Notice that the only information used in the network is from one step in the past, other than the history function that does not provide significant information about the past. The attenuation and reinforcement of waves in the SOTPAR was not used. The elegant portion of this methodology is that the process used to train the network (implement the neighborhood training in the SOM) and the process used to add the temporal information to the network (the limited SOTPAR dynamics) can use the same wavefront. With one activation of the active medium, both the temporal information and the training information can be spread throughout the network in a fully parallel manner. As the wavefront spreads, the PEs are trained both spatially and temporally. I have recreated their methodology and tested the system on a set of data similar to that reported in [Ruw97], The network is a one-dimensional string of 36 PEs. The input is phonemes from the TIMIT database of labeled and segmented speech. The subdivision of each word into phonemes was done by a TIMIT computer algorithm that sometimes produces less than ideal results. The acoustic signal was sampled at 16KHz and filtered by a set of bandpass filters. For simplicity, we used only three filters whose passbands were 0.6-1.0 Khz, 1.0-3.5 KHz, and 3.5-7.4 KHz. Each phoneme was averaged over the entire duration to produce a single feature vector per phoneme. The two

PAGE 128

121 words used were "Suit" and "Small". Suit has three phonemes ('s','oo','t') and small has 4 phonemes ('s7m','a',T). A SOTPAR string of 36 PEs was then trained using the wavefront methodology described above. The history function and enhancement were reset after the complete presentation of each word. Table 4-5 shows the parameters for the training of the Ruwisch dynamics. The spatio-temporal parameter, learning rate, and SOM neighborhood width are all exponentially decayed. Often times an exponential decay improves the training time of the network over a linear decay of the parameters. The wavefront width and speed determine the number of PEs used to store each phoneme. In this case, the wavefront moved 4 PEs per sample of training data and was 3 PEs wide, providing 3 PEs to map the spatial characteristics of each phoneme. The 36-PE network trained for 200 epochs over the training set of 6 instances of each word. For a training set this size, using a wavefront more than 3 PEs wide to store each phoneme is excessive. For example, a 50-PE network with a wavefront speed of 7 and wavefront width of 5, did not train as well as the above network. There was enough flexibility in the network to map multiple phonemes to each phoneme neighborhood. Three of the 6 instances of the word suit moved in the same direction as the 6 instances of the word small. Figure 4-16 shows the probability of a phoneme firing each PE (i.e. being vector quantized to each cluster) using the Ruwisch dynamics. For example, for the 's' phoneme in the top left-hand plot, PEs 9 through 14 fire exclusively. The dynamics of the system organized the rest of the phonemes such that temporally neighboring phonemes will be located in spatially neighboring regions. If you follow the plots from top to bottom, the phonemes from suit are stored in the network from left to right and the phonemes for the

PAGE 129

122 Table 4-5: Training parameter for phoneme mapping Training Parameter Value Spatio-temporal parameter, (3 Exponentially decayed from 1 to 0.3 Learning rate, T) Exponentially decayed from 0.03 to 0.001 SOM neighborhood width Exponentially decayed from 8 to 1 Number of training epochs 200 Number of Nodes 36 Wavefront speed 4 PEs per sample Wavefront width 3PEs word small are stored in the network from right to left. The figure shows a few locations where incorrect PEs fire, this can usually be explained by the poor segmentation of the phonemes. For instance, the 'm' phoneme has a small peak in the 'u' region. This is due to Probability of PE winning for s Probability of PE winning for m 10 20 30 Probability of PE winning for u 10 20 30 Probability of PE winning for t 10 20 30 Probability of PE winning for a 10 20 30 Probability ofPE winning fori Figure 4-16: Probabilities of each PE firing for each phoneme

PAGE 130

123 the improper segmentation of the 'm' phoneme that included some of the 'a' phoneme that follows it. Since the 'm' is much lower energy than the 'a', the small amount of 'a' contained in the 'm' created enough higher frequency energy to make it look like the 'u'. The key attribute of using the traveling wavecrest version of the SOTPAR for this application is that the SOTPAR dynamics allow us to use multiple PEs to store spatial variations (corresponding to differences in the input) and yet still create global temporal ordering between these local clusters. Figure 4-17 shows an intensity coded image of the reference vectors for each PE in the network. It is clear from a cursory look at the image that the reference vectors are clustered into regions approximately 3 to 5 PEs wide. Since the winning PEs vary from speaker to speaker, the actual phonemes are clustered in a neighborhood of approximately 3-5 PEs in the output map, even though the wavecrest was only 3 PEs wide. A good example of how the network captured the spatial variations Figure 4-17: Reference vectors for the phoneme network

PAGE 131

124 of each phoneme in the network is the region between PE 7 and PE 20. PEs 7 through 14 make up the 's' phoneme showing most of the energy in the upper frequency bands. PEs 15-20 represent the 'm' phoneme showing most of the energy in the low frequency bands. Notice how from left to right there is a lot of variation in the upper frequency bands of the 'm' phoneme. This is caused by the poor segmentation of the TIMIT database as the 's' sound is leaking into the 'm' phoneme. The SOTPAR network, however, is mapping and clustering these variations properly. Figure 4-18 shows a graph that intensity codes the enhancement and winning PEs over time. The PEs receiving temporal enhancement are shown in shades of gray, with brighter indicating more enhancement, and the winning PE is shown in white. In the Ruwisch dynamics, however, all enhancement is at the same level, so there is only one shade of gray in the graph. On the horizontal axis is time, where each phoneme is presented for each word repeatedly. The first column represents the 's' of the first version of the word 'suit'. The PEs are on the vertical axis. Thus, the winning PE for the 's' of the first version of the word 'suit' is PE 13. According to the dynamics of the system, the winning PE causes the enhancement wavefront to travel in both directions from the winning PE. In this case, the wavecrest is 3 PEs wide and travels at a speed such that it is 2 PEs from the last winning PE. Thus, PEs 8-10 and 16-18 receive an enhanced chance of winning the next competition. The next winner is PE 8 which then creates two more wavefronts. Notice that in these dynamics, there is no building of energy from winner to winner, only the last winner determines the enhancement. The amount of enhancement is a constant determined by the wave equations. Notice in the graph that the 3-phoneme sequences (suit) move in the upward direction (from higher to lower numbered PEs) and

PAGE 132

125 the 4-phoneme sequences (small) move in the downward direction (from lower to higher numbered PEs). This is consistent with the probability plots shown in the previous figure. Winners (bright) and Enhancement (grey) 10 15 20 25 phoneme sequence, suit, small, suit, small T suit,small Figure 4-18: Winners and enhancement for the Ruwisch dynamics for the words 'suit' and 'small' To verify that the temporal dynamics were responsible for this organization, I set the spatio-temporal parameter, p, to zero. All other training parameters are identical to the parameters used for the previous network. This short-circuits all temporal information in the system and defaults to the static SOM. The probability plots are shown in Figure 4-19. Notice that there is no temporal organization whatsoever in the network and that a few of the phonemes were trained into multiple spatial locations (e.g. the V phoneme). This confirms that the temporal dynamics of the SOTPAR are responsible for the spatio-

PAGE 133

126 temporal organization of these maps. A comparison of this figure and Figure 4-16 shows how the temporal information significantly cleaned the output of the network. Probability of PE winning for s Probability of PE winning for m Figure 4-19: Probabilities for each PE without using the SOTPAR dynamics As a final experiment, I tested the use of the full SOTPAR dynamics in this application. The history function was retained to avoid backtracking in a one-dimensional map. The main difference between the full SOTPAR dynamics and the partial SOTPAR dynamics discussed above is that the full SOTPAR dynamics include the strengthening of the enhancement when multiple correct firings occur sequentially. The more correct sequential information the network receives, the more it becomes convinced it is

PAGE 134

127 recognizing a known pattern and the higher is the enhancement for the next phoneme. For the first two phonemes of each word, there should be little difference. For the last phonemes of each word there should be less noise with the full SOTP AR dynamics. Figure 4-20 shows the winners and enhancements using the full dynamics. Notice that the enhancement grows brighter after successive correct firings. Also, the wavefronts that move off in the wrong direction eventually die away. Winners (white) and Enhancement (grey) 5 10 phoneme sequence: suit, small, suit, small, suit, small Figure 4-20: Winners and enhancement using the full SOTPAR dynamics " " e m "" 1 _J 5 10 15 20 25 30 35 Figure 4-21 shows the probabilities for each phoneme to fire each PE. The results are very similar to the partial SOTPAR dynamics because the sequences are very short. The short duration does not allow the enhancement to grow significantly higher than the simplified one-step enhancement. The final phonemes are, however, more localized in the full SOTPAR dynamics than in the partial dynamics. One interesting aspect of using the

PAGE 135

128 full SOTPAR dynamics is that the network is more tolerant of using large wavecrests. The Ruwisch dynamics could not properly map the test data with a 5 PE wavecrest. The full SOTPAR dynamics, however, provided enough extra information in the network for it to properly map all of the 1 2 words in the training set. 0.4 03 0.2 0.1 04 0.3 0.2 1 Probability of PE winning for s 10 20 30 Probability of PE winning for u 04 T 3 A 0.2 0.1 / 1 10 20 30 Probability of PE winning for t Probability of PE winning for m 04 03 02 1 . /\ ) 10 20 30 Probability of PE winning for a 04 . 03 A 02 1 — . i \ 10 20 30 Probability of PE winning for I Figure 4-21: Probabilities for each PE winning using the full SOTPAR dynamics Summary This application showed that the SOTPAR dynamics can create a flexible architecture that allows phonemes to be stored among a cluster of PEs for local spatial flexiblity and yet these local groups can be globally organized with temporal information.

PAGE 136

129 Each cluster represents a single noisy phoneme and adjacent clusters represent temporally sequential phonemes in the training set. This organization is created by waves of activity moving throughout the network. The temporal dynamics help remove some of the variability in the vector quantization produced by the network and the temporal ordering can be used to reduce the interconnectivity of neural networks designed for word recognition. SOTPAR2 This section illustrates two applications of the SOTPAR2 network. The SOTPAR2 network uses temporal Hebbian learning and activity diffusion to determine temporal neighborhoods for each PE in a neural gas architecture. The first application uses the SOTPAR2 to vector quantize acoustic signals from 1 5 speakers saying the words one through ten. The second application is the prediction of the Mackey-Glass chaotic signal using piece-wise linear approximations. SOTPAR2 Vector Quantization of Speech Data The goal of this application is to recognize spoken English digits from one to ten. The SOTPAR2 will be used to vector quantize the sampled frequency representation of each digit. The corpus is a set of 15 speakers saying the digits one through ten. The first 1 speakers are used for training and the last 5 are for testing. The 1 5 speakers were graduate students and professors in the Electrical Engineering Department at the University of Florida. The speakers represent a wide variety of nationalities and accents, making this task significantly more difficult than one might think. The preprocessing comprised calculating the first 12 cepstral coefficients from 25.6 ms frames overlapped

PAGE 137

130 every 12.8 ms (10 kHz sampling). The cepstral coefficients were liftered by a raised sine to control the noninformation-bearing cepstral variabilities for a more reliable discrimination of sounds. These cepstral coefficients were then mean filtered three at a time to reduce the number of input vectors. The major difference between using the SOTPAR2 VQ method and a standard VQ method is that the SOTPAR2 algorithm is trained to enhance patterns that it was trained with. There are two options to incorporate the temporal characteristics of the SOTPAR2 into this architecture. Typically, one vector quantizer is used to quantize every word in the corpus. This can be done with the SOTPAR2 VQ as well. In this case, the network would need to store all of the temporal information from all ten digits in a single network. Although this is possible, the task of the SOTPAR2 is simplified by training a reference vectors a. Typical Vector Quantization Training System b. SOTPAR2 Vector Quantization Training System Figure 4-22: Block diagram for the digit recognition system, a) standard digit recognition system, b) SOTPAR2 recognition system

PAGE 138

131 separate S0TPAR2 network for each digit (e.g. each network stores the temporal characteristics of only a single digit). Similarly, we will use a separate MLP to detect each digit. Figure 4-22 shows the overall block diagram of the system. First we trained the 10 SOTPAR2 VQ networks. This process was done by feeding each network an input consisting of the target digits spoken by the 10 training speakers interspersed with random vectors from the other 9 digits. The training parameters for the network are shown in Table 4-6. The learning rate and neighborhood width are similar to the other networks we have trained. The spatio-temporal parameter p, however, typically takes the shape of a raised sine over the training session. (5 is different than in the SOTPAR because the neural gas algorithm does not have a spatial structure and the temporal neighborhoods are added externally. The raised sine allows the neural gas PEs to distribute themselves freely at the beginning without temporal interference. In the middle of the training algorithm, after the neural gas PEs are fairly uniformly distributed, the temporal training reaches its maximum and then tails off for fine tuning at the end. The temporal Hebbian increment determines the time constant used to update the temporal weights. If this increment is too large, the network puts too much emphasis on the recent past, and if too small, does not change quickly enough to significantly affect the training. In my experience, an increment of 0.01 seems to work fairly well for most normalized data. Although not required, I also used a conscience term in this network. A conscience algorithm is a standard method of creating a density match between the PEs and the data. If there are a large set of data in one small region of space, an SOM without a conscience may map a single PE to this one region. Including a conscience forces the

PAGE 139

132 SOM to map multiple PEs to this region by forcing each PE to fire at roughly the same rate. A PE that is firing too often is severely penalized such that it cannot fire for a while. The two conscience parameters are typical settings and have little effect on the network. Table 4-6: Training parameters digit recognition VQ Training Parameter Value Spatio-temporal parameter, (3 Raised sine with maximum 0.2 Learning rate, n Exponential decay from 0.1 to 0.01 SOM neighborhood width Exponential decay from 5.33 to Number of training epochs 50 Number of PEs 16 Temporal Hebbian increment, bl 0.01 Conscience increment, b2 0.01 Conscience factor, c 0.2 While training, the activity wavefronts could easily be seen in a plot of the maximum activity in the network over time. This usually picks up the wavefront activity in the network quite well. Figure 4-23 shows the activity of the digit six network with the training data. The instances where the word six is spoken are highlighted between dashed lines. The input data interspersed between the presentations of the 'six'es are random vectors from the other digits. Clearly the activity of the network is much higher when the word six is presented to the network. You should also notice, however, that certain speakers do not adequately match the "global average". For instance, speaker 10, near sample 400, does not create a large activity spike in the network. For larger systems, this can be solved by using multiple networks for each digit, allowing for more variation in the speakers.

PAGE 140

133 Maximum Activity for S0TPAR2 training for digit SIX 300 400 500 100 200 Figure 4-23: Maximum activity in the SOTPAR network for digit 6 600 After training all 10 models, we verified the validity of the SOTPAR2 networks by running the digits from each speaker through all 10 models. At each input we plotted the maximum activity in the network. Figure 4-24 shows the activity in the network for speaker 14 (one of the test set speakers) with the spoken digit along the horizontal axis and the SOTPAR2 model (e.g. the vector quantizer for each digit) along the vertical axis. Notice that the wavefronts are quite noticeable along the diagonal which corresponds to the proper model responding favorably to the proper digit. Notice also that the wavefronts for words 6 and 7 both started out high with the pronounciation of the "s" sound, but this particular speaker did not pronounce the end of either word similar to those in the training set. Remember, this activity is from an unsupervised, unsegmented, unlabeled data set created using self-organizing principles. The activity shown in these networks is helping the proper model remove noise and enhance the signal for later recognition.

PAGE 141

134 Maximum activity for speaker 14 12 3 4 5 6 Figure 4-24: Maximum activity in each model for each digit After the VQ networks were trained, we vector quantized each digit from all 1 5 speakers, 10 from the training set and 5 from the test set. We also vector quantized the data using a standard neural gas algorithm so that we could test our results. In order to remove some of the variability caused by the different rates at which the words and phonemes were spoken, we passed each sequence of reference vectors from each spoken word through a gamma memory. The gamma memory has 6 taps and a u of 0.5 giving it a depth of 12 samples, which corresponds to the maximum length of any spoken digit in the corpus (the minimum was 6 vectors). The output of the six taps of the memory is then fed into an MLP with 6 hidden PEs and a sigmoidal output PE. All 10 networks are trained simultaneously and a winner-take-all PE is used to select the network with the largest output. The digit detector with the largest output is declared the winner and this is compared against the desired signal which indicates which digit was actually spoken. Table 4-7 shows the training parameters used for each of the MLP subnetworks. These parameters are based upon well-known rules-of-thumb and variations in these parameters

PAGE 142

135 do not greatly affect the performance of the trained network. Figure 4-25 shows a block diagram of the recognition system. spoken digit frequency information VQ tor Digit 1 Gamma memory embedding MLP Detector for Digit 1 VQ for Digit 2 Gamma memory embedding MLP Detector for Digit 2 Winner-takeall determines spoken digit MLP Detector for Digit 10 Figure 4-25: Digit recognition system Table 4-7: Training parameters for each subnetwork of the MLP recognition network Training Parameter Value Hidden layer PEs 6 tanh PEs Hidden layer learning rate 1 Hidden layer momentum rate 0.7 Output layer PEs 1 logistic PE Output layer learning rate 0.1 Output layer momentum rate 0.7 Number of training epochs 1000 The entire system was trained with three different sets of data. First, the original data from the preprocessor was used to train the system. This allows us to validate that the vector quantization reduces the variability of the data and allows for easier recognition of the digits. Second, the neural gas algorithm was used to vector quantize the

PAGE 143

136 input data. Third, the SOTPAR2 network was used for each vector quantizer. In order to remove random variations based on initial conditions, each system was trained and tested 5 different times and the results were averaged. Table 4-8 shows the results of the training. The key figures are the number of misclassifications in the test set. Since MLPs are universal mappers, a sufficiently large MLP can learn to classify virtually any data set. A common problem with MLPs is that they can be overtrained, which can be thought of as memorizing the input instead of finding the features in the input that are important. If the network is overtrained, it will have very good classification in the training set, but very poor classification in the test set. Thus, the true indication of performance for MLPs is the performance in the test set. Table 4-8: Summary of the digit recognition system performance System Type Training MSE Testing MSE Training misclassifications Testing misclassifications Percent Correct classification in Test set No Vector Quantization 0.0005 0.0216 0.0 12.2 75.5 Neural gas VQ 0.0010 0.0321 0.2 10.0 80.0 SOTPAR2 VQ 0.0009 0.0199 0.2 7.6 84.8 The table shows that the SOTPAR2 VQ system reduced the number of errors in the testing set by 25% over the neural gas VQ system and by 40% in the system without vector quantization. This performance is due to the reduction in the variability in the systems. A vector quantization technique removes some of the variability of the signal by clustering all the inputs and representing every input in the cluster with a single reference vector. The SOTPAR2 VQ system takes this one step further by using temporal information to enhance the clustering. The temporal sequence of the feature vectors plays

PAGE 144

137 an important role in the vector quantization. For additional comparison, a hidden markov model (HMM) with 5 states was trained using the original input. The HMM was trained for 50 cycles starting from 5 different initial conditions. The average results for the HMM was 8 1 % correct over the test set. To further analyze the system, Table 4-9 and Table 41 show the "confusion matrices" for two sample systems (one SOTPAR2 VQ system and one neural gas VQ system). The confusion matrix uses a two-dimensional grid to represent the classification of the digits. On the horizontal axis is the classification by the network and on the vertical axis is the correct classification. Thus, perfect classification contains all the values on the diagonal of the matrix. Misclassifications show up as non-diagonal terms. The nondiagonal terms show which digits are being confused. For example, the first row in Table 4-9 shows the networks classifications for the 5 instances of the digit one in the test set. Four of these "ones" were correctly classified and the fifth was classified as a five. These confusion matrices show interesting details about the classification of the network. For instance, nine is confused with one and five because one has the same 'n' sound as nine and five has the same T sound as nine. Similarly, the 'n' sound in seven and ten also tends to confuse the networks and they misclassify them as one as well. The neural gas VQ systems performed very poorly when presented the digits nine and ten. In summary, the SOTPAR2 VQ system performed significantly better than the static VQ system and the system without vector quantization. The temporal plasticity of the algorithm allowed for better quantization performance and noise resistance.

PAGE 145

138 Table 4-9: Confusion matrix for SOTPAR2 VQ system Network Classification 1 2 3 4 5 6 7 8 9 10 '3s S S a a 1 4 1 2 5 3 5 4 5 5 5 6 5 7 1 1 3 8 5 9 1 1 3 10 1 4 Table 4-10: Confusion matrix for neural gas system Network Classification 1 2 3 4 5 6 7 8 9 10 M a 3 a a 1 4 1 2 3 2 3 5 4 5 5 5 6 5 7 5 8 5 9 1 1 2 1 10 1 1 1 1 2 Time Series Prediction In this application we compare the performance of the neural gas algorithm to the SOTPAR2 algorithm for prediction. The application is based on the study of Martinetz, et.al. [Mar93] where they used the neural gas algorithm to do time series prediction. They partition the input space into regions with each region having its own local linear

PAGE 146

139 predictor. Thus, each Voronoi region of the input space has a single linear predictor. They used the Mackey-Glass time series: with the parameters a = 0.2, (3 = -0.1, and t = 17. X(t) is quasi-periodic and chaotic signal with a fractal attractor dimension 2.1 for the parameters chosen here. The characteristic time constant of x(t) is 50 which makes it particularly difficult to forecast x(t+8t) with 8t>50 [Mar93]. The time series is predicted using a 4 dimensional embedding with a time lag of 6 between each input, thus the predictor is trying to estimate: x{t + 6) m f(v) where v = x(t),x{l 6),x(t -\2),x(l-\ 8) For each neural gas node i there is one linear predictor with coefficients y, and a, defined by: y = y t + a, -(vw,) where y is the prediction of the next point in the sequence. The equation contains an average prediction for this region, y, and an offset from this average prediction based on the distance between the reference vector and the input vector. The vector a, has 4 components, one for each input dimension. Using an LMS approximation to gradient descent, the update equations for the y's and a's are as follows: err = y y, a ( • (v w, ) Ay, =
PAGE 147

140 where err is the prediction error, r] is the learning rate, h k is the neighborhood function, and kj is the neighborhood ranking (the updates are done in a manner similar to the neural gas algorithm itself). Results For most signals, predicting the next value in a sequence given the previous values of the sequence is actually fairly simple. Most signals are continuous and vary fairly slowly, so the prediction can be done very accurately with very small models. The key to dynamical prediction, however, is how well the model has captured the dynamics of the system. One of the best ways to test this is to implement multi-step prediction. Multi-step prediction is implemented by initializing the memory of the system to a valid point in the trajectory of the system dynamics and then predicting the next point in the system. This predicted point is then fed back to the input of the system and is used to predict the next point in the system. Small errors in the predictions that are fed back and used to make even more predictions can often produce very large errors after a few iterations. If the network has actually captured the system dynamics, then it should be able to start at any point in the trajectory of the system and follow that trajectory autonomously for a certain number of steps. The distance that the network can follow a trajectory is dependent on the system that is being modeled. For instance, a chaotic system can be modeled only for a short period of time before the smallest of errors drives the network off of the correct trajectory. In fact, this is one possible definition for chaotic systems systems where small changes in initial conditions create large changes in the system dynamics.

PAGE 148

141 The static neural gas algorithm and the SOTPAR 2 algorithm were both trained for 100 epochs over a 500-point segment of the Mackey-Glass time series. The time series was generated using a 4* order Runge-Kutta algorithm. Wc first trained the PE locations and then trained the linear predictors, although these can easily be done simultaneously. Table 4-1 1 shows the training parameters for both the SOTPAR2 and neural gas VQ networks. The only difference between the two networks is that the neural gas system does not use the spatio-temporal parameter or the temporal Hebbian increment parameter. Over the training period, the SOTPAR2 algorithm consistently produced better prediction results than the neural gas. The final average one-step prediction error for the two networks was 7.63X10" 6 for the SOTPAR2 algorithm and 9.15X10 6 for the neural gas algorithm. The SOTPAR2 reached a MSE which was 1 6% lower than the neural gas, but as we said previously, the single-step prediction error does not determine how well the dynamics of the system have been captured. The final maximum single step prediction error may provide a better indication of how well the system will perform under multi-step prediction. The SOTPAR2 maximum single step prediction error was 1.36x10"" while the neural gas had a maximum error of 5.86xl0" 4 . The SOTPAR2's maximum error was 76% lower than the neural gas's maximum error. Since large errors will quickly drive the network away from the desired trajectory of the chaotic system, the maximum error is more relevant than the MSE. Next I tested the networks using multi-step prediction. After each training epoch, the weights were frozen and 50 randomly selected initial conditions were loaded into the networks. The predicted outputs were fed back into the networks to predict more values.

PAGE 149

142 Table 4-11: Training parameters for SOTPAR2 and Neural gas VQ for prediction Training Parameter Value Spatio-temporal parameter, p Raised sin with maximum 0.2 Learning rate, r| Exponential decay from 0.1 to 0.01 SOM neighborhood width Exponential decay from 1 6.7 to Number of training epochs 50 Number of PEs 50 Temporal Hebbian increment, bl 0.01 Conscience increment, b2 0.01 Conscience factor, c 0.2 The MSE between the desired trajectory and the 50 autonomous trajectories generated by the networks was then computed. The top plot in Figure 4-26 shows the MSE versus training epoch for both the neural gas (dotted line) and SOTPAR2 (solid) for 10 step prediction on a semilog scale. The bottom plot shows the average performance 10 Step Average Prediction Error : 10"* 10" 3 10' 1 0.5 1.5 2 2.5 3 Training Iterations 3.5 4 4.5 5 10 Step Average Improvement 0.5 w 0.4 1 0.3 S0.2 _ 0.5 1-5 2 2.5 3 Training Iterations 3.5 4 4 5 5 Figure 4-26: 10-step prediction error and average improvements for the neural gas (dashed line) and SOTPAR2 (solid line) predictors

PAGE 150

14:, improvement between the neural gas and SOTPAR2 versus the training time ( SSs ~ SSSm ). Initially, the SOTPAR2 trained faster than the neural gas, but in the end the SOTPAR2 only performed 10% better on average than the neural gas. As the prediction length increases, the ability of the SOTPAR2 to use temporal information to remove noise in the signal should improve the performance of the system. The embedding of the data creates a state space representation of the system output. This can be visualized as a trajectory moving through an N-dimensional space. The networks are carving out regions of this N-dimensional space and one predictor is assigned to each Voronoi region. When the current position of the trajectory of the system (either the original system or the autonomously running network trying to predict the system) ends up in a certain region, that region's predictor is used to predict the next output. The neural gas algorithm determines which region the trajectory is in using only the static information available to it. The SOTPAR2 algorithm uses the past locations of the trajectory as well to help guide the selection of the proper Voronoi region. Thus, if noise in the system has moved the trajectory from the correct region to a neighboring region, the SOTPAR2 algorithm uses its stored temporal information (obtained from the training data) to select the more appropriate predictor. Remember, that even if the original signal is "noise-free", the prediction of the signal will induce a significant amount of noise into the reconstruction of the trajectory. Figure 4-27 shows the results for the 25-step prediction. Notice now that the SOTPAR2 is beginning to significantly outperform the neural gas, providing a 60% decrease in the prediction error.

PAGE 151

I 10 I Q. 144 25 Step Average Prediction Error v\ ; 1 , 0.5 1.5 2 2.5 3 3.5 4 4.5 5 Training Iterations 25 Step Average Improvement Training Iterations Figure 4-27: 25-step prediction error and average improvements for the neural gas (dashed line) and the SOTPAR2 (solid line) predictors Figure 4-28 and Figure 4-29 show the performance of the two systems for 50 steps and 100 steps. The 50-step prediction still shows a significant decrease in the MSE for the SOTPAR2 algorithm as it performs 45% better. The 100-step prediction, however, shows that we have reached the maximum capabilities of the network. This may not be the result of poor network performance, however. Chaotic systems can only be predicted accurately in the short-term. Even with identical systems, a very small change in the initial conditions produces large long-term errors. As mentioned previously, predicting beyond 50 steps is very difficult for this data set. To further illustrate this point, I calculated the Lyapunov exponent for our segment of the Mackey-Glass signal. The Lyapunov exponent is a dynamical system parameter that measures the average rate of divergence for nearby trajectories. They can be thought of as the eigenvalues of the

PAGE 152

145 local linearization of the dynamics. [Ger94] If any Lyapunov exponent is greater than zero, then it is a chaotic signal. In our case, the largest Lyapunov exponent was approximately 0.067 bits per second. With any small difference between two sets of initial conditions, two trajectories of a chaotic system will diverge at a rate approximately equal to d„=d 2' a where d„ is the distance between the two trajectories at time n, d„ is the initial distance between the two trajectories, n is the number of iterations between the initial and final positions, and X is the dominant Lyapunov exponent. Using this equation and the average MSE for the SOTPAR2 predictor, Figure 4-30 shows the trajectory divergence for the input data. The horizontal line in the figure indicates the variance of the input signal and 50 Step Average Prediction Error 0.5 1.5 2 2.5 3 3.5 4 4,5 5 Training Iterations X10* 50 Step Average Improvement / D 0.5 1.5 2 2.5 3 3,5 4 4.5 5 Training iterations Figure 4-28: 50-step prediction error and average improvements for the neural gas (dashed line) and SOTPAR2 (solid line) predictors

PAGE 153

146 signifies the practical limit of predicting the signal with these parameters. Any predictor with an MSE greater than the variance is no better than simply predicting the mean of the signal at all times. Thus, for all practical purpose, this signal cannot be predicted accurately beyond approximately 60 samples with the given mean squared error. Additional training could reduce the MSE further, but since the slope of the divergence is so great near 60 samples, it will not improve the predictability of the signal significantly. 100 Step Average Prediction Error Figure 4-29: 100-step prediction error and average improvements for the neural gas (dashed line) and SOTPAR2 (solid line) predictors Lastly, I tested both systems by determining their average prediction length (number of predictions before the error was greater than 0.001). On average, the SOTPAR2 could predict 30% further on average than the neural gas algorithm. Table 4-12 shows the results that were collected over 8 different training runs with 40 different starting points for each training run. The table also shows the dependence that the

PAGE 154

147 prediction length has on p, the spatio-temporal parameter. As p increases beyond 0.10 the performance decreases due to the overemphasis on temporal information. Iteration number Figure 4-30: Trajectory divergence for Mackey-Glass prediction Table 4-12: Average prediction length for the neural gas and SOTPAR2 predictors with different values of p for the SOTPAR2 Model Ave. Prediction Length Neural gas 33 SOTPAR2 P=0.05 40 SOTPAR2 p=0.10 43 SOTPAR2 B-0.15 39 SOTPAR2 p=0.20 33 Summary of chaotic prediction In this application we have shown how the temporal plasticity of the SOTPAR2 can help disambiguate the static spatial information of the multi-step predictor. The dynamic vector quantization helps reduce the variability inherent in the input by anticipating (based on training) the future inputs. Specifically, the improvements can be

PAGE 155

148 attributed to two factors. First, the SOTPAR2 trains faster because the vector quantization is based on a trajectory in state space, not just the current location in state space. Secondly, the SOTPAR2 improves the multistep prediction because it can help remove the noise created by the iterative predictions. Both of these factors are due to the the temporal plasticity in the network that causes the Voronoi regions to shrink and grow based upon the previous inputs and training. The anticipation inherent in the dynamics of the network allow for a better selection of the linear predictor. Dynamic Subi;rouping of RTRL in Recurrent Neural Networks This section presents the applications that were used to test the dynamic subgrouping methodology to train recurrent neural networks. The activity diffusion and temporal Hebbian learning concepts allow for the temporal self-organization of the PEs in a fully recurrent network. Using the temporal self-organization, many of the gradient calculations in the RTRL algorithm can be ignored, reducing the computational requirements from 0(N 4 ) to 0(N'). The first application is the system identification of a nonlinear system with memory. The second application is also system identification applied to the nonlinear passage dynamics in a noise cancellation problem. System Identification The first experiment will evaluate the identification of the nonlinear system with memory presented in [Cam95]. System identification is a common task in control theory where the input-output characteristics of an unknown system are modeled as accurately as possible. This is accomplished by injecting the same signal (typically random noise if possible) into both the unknown plant and the model. The model is then trained to mimic

PAGE 156

149 the response of the plant by using the difference between the plant output and the model output as the criterion to minimize. Figure 4-3 1 shows a block diagram for system identification. For simplicity, often the random input is first injected into the plant and the plant's input and output are stored so that the model can be trained off-line. ANNs provide a powerful methodology for system identification because they are universal mappers and thus in theory can model any plant. d(n) — Unknown Plant x(n) * — ANl/ Hg) y(n) r -V e(n) Figure 4-3 1 : System identification block diagram The system we will model is described by the following equations: z(t) = 0.01 54x(t) + 0.0462x(/ 1) + 0.0462x(/ 2) + 0.0 1 54x(t 3) + 1 .99z(r 1) 1 .572z(r 2) + 0.4583z(r 3) X0 = sin[z(f)] This system uses the past three values of the state z and the present and past three values of the input to update the value of the state z. This state output z is then passed through a sine function to determine the system output. A sequence of 1 000 points of random uniform noise between -1 and +1 was generated and passed through the system equations. This set of input-output data was then used to train the network. Each 1000 points is an epoch of training. This data was used to train 4 networks using different

PAGE 157

150 learning mechanisms: the full RTRL algorithm, the Zipser static subgrouping algorithm, the dynamic subgrouping algorithm using the full n matrix to determine the subgroups, and the dynamic subgrouping algorithm using the first-order estimate of the n matrix to select the subgroups. The results of each network were averaged over 5 different training runs with different sets of initial conditions (the same 5 sets of initial conditions for each network). Figure 4-32 shows the learning curves for the 4 different networks when the subgrouping algorithms used two groups of 3 PEs (six total PEs). As expected, the full RTRL and the dynamic subgrouping with n (DS-71) algorithms were very similar. Remember that the dynamic subgrouping with n algorithm produces the best possible results for each particular configuration. This indicates that the chosen network setup (6 50 100 Epochs (500 samples per epoch) Figure 4-32: Learning curves for the system identification problem using the RTRL, Zipser, DS-7t, and DS-FOE algorithms

PAGE 158

151 PEs with 2 groups) is sufficient to train the network properly. The dynamic subgrouping using the first-order estimate (DS-FOE) produced slightly worse performance than the full RTRL and DS-71 algorithms, but required significantly less time per epoch to train. Zipser's algorithm took the most epochs to train. The DS-FOE algorithms and Zipser's algorithm both require 0(N 2 ) operations while the full RTRL and DS-71 algorithms require 0(N 4 ) operations. Table 4-13 shows the time per epoch, the number of epochs to reach an error of 0.01, and the overall time required to reach an MSE of 0.01. The timing was done on a Pentium 166MMX PC using MATLAB code and the Mathworks Visual Matlab compiler. The compiling of the code helps reduce the senstitivity of the algorithms to the actual code, but there is certainly some variation between these results and code that is fully optimized and/or written in C. In theory, the DS-FOE algorithm will be t&ftn? faster and Zipser's algorithms will be g* faster than the full RTRL algorithm where AT is the number of PEs, m is the number of temporal neighbors, and g is the number of subgroups. When the number of temporal neighbors divides evenly into the number of PEs, Zipser's algorithm and the DS-FOE algorithm will have the exact same performance. Since we have 2 subgroups in this case (g=2, m=3), both algorithms should be 4 times faster and they were approximately 3 times faster in reality (more on this later). Since the full RTRL trained in fewer epochs the actual performance improvement was only 44% for Zipser's algorithm and 57% for DS-FOE. For larger networks, however, the training time per epoch improvement will dominate the total training time and the DS-FOE algorithm will drastically decrease the overall time required to train the network. For

PAGE 159

152 instance, with a network as small as 24 PEs using 3 temporal neighbors, the DS-FOE algorithm will execute one epoch of training 32 times faster than the full RTRL. Table 4-13: Training times for the four algorithms for the system identification problem Time per Epoch Num Epochs to 0.01 Total Training Time (sees) /o Improvement Full RTRL 96 xs 8.448 0% Zipser 33 140 4,620 44% DS-FOE 33 110 3,630 57% Comparison of the Number of Neighbors Figure 4-33 shows the learning curves for the system identification problem using different numbers of neighbors. Every network had a total of 6 fully recurrent PEs, but used between 2 and 6 temporal neighbors. If 1 neighbor is used, the algorithm defaults to s o 0.03 n [~ i r~ < foe2 foe3 foe4 V foe5 o rtrl Figure 4-33: Learning curves for the DS-FOE algorithm using 2,3,4,5, and 6 neighbors

PAGE 160

153 a first-order estimate of the RTRL algorithm where the only long-term gradient information comes from within the self-recurrent loops. When 6 neighbors are used, the algorithm is identical to the full RTRL algorithm. As expected, the learning curve with only 2 temporal neighbors was the slowest. The more temporal neighbors are added, the faster the convergence. In this particular application, the law of diminishing returns seemed to take effect after 3 or 4 temporal neighbors. Next, I studied the training times per epoch to determine if the theoretical speedup estimates are correct. Table 4-14 summarizes the results. The left column shows the number of temporal neighbors used in each network (using a 6 PE network). As mentioned before, the dynamic subgrouping algorithm with one temporal neighbor is nearly a static approximation to the RTRL algorithm. The only temporal gradient information comes from the recurrent loop in the network. Thus, this network will require the least time per epoch but will generally require many more epochs (if it trains at all) to be properly trained. Using 6 temporal neighbors is identical to using the full RTRL algorithm. The table shows that the N 2 /m 2 speed-up estimate is not very accurate, especially when the number of temporal neighbors is low. A quick analysis of the algorithm showed that the actual speed-up is closer to N*(N+3)/m*(m+3). The number 3 comes from the fact that each weight update requires 3 extra computations. When the number of temporal neighbors is close to 3, the (m+3) term in the denominator has a large impact on the speed-up. The 4" 1 column of the table shows that this new estimate is much closer to the actual timed values. Figure 4-34 shows that as the number of temporal neighbors increase, the estimates and actual values become much closer.

PAGE 161

154 Table 4-14: Training times and theoretical speed-up for DS-FOE algorithm Number of Neighbors Actual Time per Epoch 0(N 2 /m 2 ) speedup estimate Precise speedup estimate Num Epochs to MSE<0.01 1 10 3 8 475 2 21 11 IB 155 3 33 24 32 110 4 50 43 50 100 5 70 67 71 98 6/RTRL 96 96 96 88 90 Actual Estimate / 80 — Theoretical // /? ,_ /ll O Ss CO HI) s ' s / o SO / O | 40 / a r ^4 s :vi ^^F 9 s 1 20 10 y 12 3 4 5 6 Number of Temporal Neighbors Figure 4-34: Execution time per training epoch for a 6 PE RMLP using DS-FOE and varying the number of temporal neighbors Finally, Figure 4-35 shows the overall execution time to train each network to a mean squared error of 0.01 when the number of temporal neighbors was varied. The plot clearly shows that using either 2 or 3 temporal neighbors minimizes the overall training time of the network. One neighbor did not contain enough temporal gradient information

PAGE 162

155 and more than 3 neighbors required too much computation for the small increase in gradient information. 3 4 Number of Temporal Neighbors Figure 4-35: Training time in seconds for the networks to achieve a MSE of 0.01 varying the number of temporal neighbors Modeling a Set of Nonlinear Passage Dynamics To further prove the effectiveness and broad applicability of the dynamic subgrouping methodology I will present another system identification problem. This problem consists of modeling the nonlinear passage dynamics for use in an adaptive noise cancellation system [Jan97]. The system we are trying to model is described by the following equation: sm(n(k)*n(k-\)) d(k)1 + h(*-1) 2

PAGE 163

156 where n(k) is the system input and is typically a random noise source. Modeling the passage dynamics of a noise source is the key aspect of an adaptive noise cancellation system. The problem setup is shown in Figure 4-36 and assumes that we have a signal which is known to contain additive noise. The noise source is measurable, but the noise is distorted by the passage dynamics before it is added to the information signal. The goal is to remove the distorted noise by modeling the passage dynamics, thus allowing us to pass the measured noise through the modeled passage dynamics and directly subtracting the result from the noisy signal. The classic example of such a system is the measurement of a fetal ECG in utero. The fetal ECG is contaminated by the mother's heart beat. The mother's ECG can be measured, but it is distorted as it passes through the mother's body before being measured again at her abdominal region. In this case, the passage dynamics that need to be modeled is the transfer function of the mother's ECG as it passes through her abdominal region and is picked up by the fetal ECG monitor. If the passage dynamics can be accurately modeled, the mother's ECG can be removed efficiently. x(k) Information signal (not measurable) n(k) Noise signal (measurable) Passage dynamics <*> y(k)=x(k)+d(k) Detected signal (measurable) d(k) Distorted noise (not measurable) Figure 4-36: Signal diagram of the noise cancellation problem

PAGE 164

157 Five hundred points of random noise were injected into the system and the input and output were collected. A 6 PE RMLP was trained with the full RTRL algorithm, Zipser's algorithm, the DS-7t algorithm, and the DS-FOE algorithm. Figure 4-37 shows the learning curve and the semi-log plot of the learning curve for the four algorithms. Each learning curve is averaged over the same four sets of random initial conditions. As expected, the DS-rc algorithm again performed virtually the same as the full RTRL algorithm this proves that the problem can be solved using only the full gradient information from at most 3 temporal neighbors of each PE. Zipser's algorithm again performed the worst of all algorithms and the DS-FOE algorithm performed between 0.02 I 0015 1 g 0.01 CT CO g 0.005 5 1 1L ^^\ 1 1 1 1 rw-FOF — _ Zipser -_ -RTRL DS-PI Figure 4-37: Learning curves and the log of the learning curves for the passage dynamics system identification

PAGE 165

158 Zipser's and the full RTRL algorithm. The DS-FOE reached the bottom of the learning curve significantly faster than Zipser's method. Also, although the learning curves all look very similar, since they all have a shallow slope, it will take Zipser's algorithm significantly longer to reach the same MSE as the DS-FOE algorithm. Figure 4-38 plots the elements of the 6"' row of the Z matrix over time for the passage dynamics problem. This plot shows the amount of temporal information shared between PE6 and all the other PEs. By choosing the 2 PEs (along with the self-recurrent loop between PE6 and itself) that have the highest temporal correlation with PE6, the algorithm trains in nearly the same number of epochs with much less computation per epoch. This produces a much lower overall training time. This particular application does not have very interesting switching dynamics. At first, the winning neighbors were PE3 o -0 1 -0.2 Neighbors of PE 6 33311111111111111111 4 4433333333333333333 6 6666666666666666666 PE1 PE2 V PE3
PAGE 166

159 and PE4 but after 30 epochs, PE1 replaced PE4. Again, as expected, this plot verifies that the temporal gradient information content between PEs is smooth and slowly changing. This allows us to subgroup the PEs without switching too often which may disrupt the training. For comparison, the same input was tested using 4 other common networks. First, a linear combiner (FIR) with 50 taps was tested. Since the passage dynamics are nonlinear, it came as no surprise that the linear combiner could not model the system and the best it could do was achieve an MSE of 0.10. Next, a tap-delay neural network (TDNN) was tested. The TDNN is a nonlinear version of the FIR where a tap-delay line is added at the beginning of an MLP, thus giving the MLP the ability to map temporal signals. The TDNN was setup with 7 taps and 6 hidden PEs so that both it and the RMLP would have the same number of weights (48). The TDNN trained more quickly than the RMLP architecture. The average training time was 40 epochs versus approximately 1 00 epochs for the RMLP. The average MSE after 200 training epochs, however, was twice that of the RMLP. The RMLP is a more powerful architecture than the TDNN and thus was able to capture more of the dynamics of the system. The TDNN simply does an embedding of the input into a multi-dimensional static pattern which then must be mapped by the MLP. The limited flexibility of the memory in the TDNN prevents it from being applied to some complex problems. Using a gamma memory in place of the tapdelay line did not significantly improve the performance of the TDNN. Lastly, I trained an identical RMLP with BPTT. BPTT is a trajectory algorithm for computing the temporal gradients. This means that the network is updated in a batch mode after each trajectory has been processed in both the forward and backward

PAGE 167

160 direction. Since batch mode algorithms train more slowly than on-line algorithms at the same step size (the gradients are already being averaged), the step size for the BPTT algorithm was set larger than the RMLP step size. The BPTT algorithm still required over 2000 epochs to properly train the network. The trajectory size was varied between 5 and 50 samples (the depth at which the gradients are calculated) without significant improvement in the performance of the BPTT algorithm. Summary of Dynamic Subgrouping This section presented two control applications that were solved with a recursive multi-layer perceptron and the forward propagation of gradients. The DS-7t algorithm showed that with the proper selection of neighbors, a dynamic subgrouping algorithm can greatly decrease the required computation per epoch while consistently training in nearly the same number of epochs as the full RTRL algorithm. The DS-FOE algorithm showed that with a simple first order approximation to the n matrix, the neighbors could be selected well enough to consistently outperform Zipser's algorithm. This is the last demonstration of the temporal organization of PEs using diffusion and temporal Hebbian learning concepts, and it may be the most important because it is a dramatic improvement on a generically applied method. Any methodology that uses realtime recurrent learning to train a neural network can use this methodology and expect significant performance improvements with little loss of power. The methodology reduces the number of computations by a factor of N 2 /m 2 , where N is the number of PEs in the network and m is the number of temporal neighbors used. When the size of the groups is held constant as the networks grow in size (larger networks mean more

PAGE 168

161 subnetworks), which practically and theoretically makes sense, the overall computational requirements for the DS-FOE algorithm are only 0(N : ) versus 0(N 4 ) for RTRL.

PAGE 169

CHAPTER 5 CONCLUSIONS AND FUTURE RESEARCH POTENTIAL Conclusions The goal of this work was to provide a new mechanism that allows neural networks to process temporal patterns and signals more efficiently. Inspired by biology, the fundamental concept underlying this proposed mechanism is the temporal selforganization of PEs in a network. The temporal organization of the PEs allows them to process the temporal data in a more organized and structured manner. The fact that it is done in an unsupervised (or self-organizing) manner means that the temporal organization can be considered independent of the normal operation of the network. The fundamental operation of the underlying neural network architecture is unchanged, allowing for the simple integration of the method with the existing algorithms. This is my main theoretical contribution that temporal self-organization in space-time can be easily added to existing neural network architectures with little change in their fundamental operation, yet significantly improve their performance on temporal applications. Most temporal neural networks simply add short-term memory to an existing static neural network. My methodology can be viewed as a unique combination of shortterm and long-term memory. The temporal information from the input is stored in both the activations of the network and also the structure of the network. This creates a unique 162

PAGE 170

163 trainable memory architecture that responds favorably to signals similar to those that it was trained with. This memory is more similar to biological memories. Activity diffusion is the local mechanism used to transfer information throughout network. It is biologically inspired and does not require global connections or communication. Activity diffusion and temporal Hebbian learning are used to train the temporal components of the network. These techniques were applied to three different neural network architectures and increased the performance of each of them. There is no reason to believe that this mechanism could not also be applied to many other architectures. When applied to a self-organizing map, the activity diffusion concept creates a spatio-temporal mapping in the network. The SOM maps similar input vectors to similar locations in the space of the SOM, and the SOTPAR dynamics map temporally similar inputs (e.g. successive inputs) to similar locations in the SOM. The combination of these two concepts creates the unique concept of self-organization in space and time. When the architecture processes its data, the activity diffusion creates waves of temporal activity that travel through the output map and influence the future operation of the network. These traveling waves provide a truly unique memory mechanism to the system. This architecture can be trained in an unsupervised manner with unsegmented data and will map common sequences in the input to sequential PE locations in the SOM. When applied to a landmark recognition problem in robotics, the SOTPAR architecture was able to use the temporal information in the signals to smooth the noisy turning data and gracefully handle the time-warping caused by variations in the speed of the robot. A specially modified version of the SOTPAR provided one-shot training, landmark recognition, endpoint detection, and location information. This neural

PAGE 171

164 implementation of a landmark recognition system is unique in the robotics field. Real data from a robot was collected and tested by Pedro Kulzer. The results were very promising but testing was discontinued after he graduated. More research, testing and comparisons to other methods are required but may provide a new and powerful mechanism for robotic navigation. The SOTPAR was also applied to the mapping of phoneme sequences. Phoneme sequences from an industry standard word database were mapped to an output space using traveling wavefronts with a pre-specified width. This mechanism creates local clusters of PEs that represent variations of a single phoneme, but the local clusters are globally organized by their temporal sequence. When compared against standard vector quantization techniques, the global temporal ordering created a much more organized map with most instances of each phoneme being clustered in a single small region of the output map. This architecture has been implemented in hardware and interactively demonstrated at neural network conferences by the German research group headed by Ruwisch. Next I applied the self-organization in space-time method to a neural gas architecture which freed the network from the spatial lattices of the SOM. The SOTPAR2 architecture uses a secondary connectivity matrix to store the temporal information from the training data. The connectivity matrix is trained using temporal Hebbian learning and the dynamics of the activity diffusion are very similar to the SOTPAR. As each PE fires, the activity from this PE diffuses over time and space to affect the future operation of the network. In this case, however, the activity diffuses through the connectivity matrix which allows us to train the temporal neighbors more effectively. The concept that best

PAGE 172

165 describes the training and operation of this structure is anticipation. The network is again trained in an unsupervised manner, without segmentation or labeling of the data. The network is trained so that it anticipates future inputs based on the past inputs (and its training) and responds favorably to these anticipated inputs. When applied to a vector quantization algorithm, the network uses temporal information to remove spatial noise in the input by creating dynamic Voronoi regions that shrink or grow based upon the past of the signal. This concept is unique in the neural network community. For instance, if the previous 3 inputs to a network were the phonemes 's', 'm'. and 'a', then the phoneme T would be much more likely to be accepted next than the phoneme 'd'. The SOTPAR2 architecture was applied to the vector quantization of speech data for a digit recognition problem. The input data consisted of 15 graduate students and professors speaking the words one through ten. The first 1 speakers were used as a training set and the last 5 were used as a test set. The SOTPAR2 and neural gas algorithms were trained and used to vector quantize the speech data before being processed by a set of MLP digit detectors. The SOTPAR2 VQ methodology reduced the number of errors in the test set by 25%. The SOTPAR2/MLP architecture also outperformed a HMM system by 20%. These results are promising, but a large scale effort is required to verify the usefulness of these techniques. The SOTPAR2 was also used to predict the Mackey-Glass chaotic signal. The SOTPAR2 was used to cluster the input trajectories into regions, each of which used a local linear predictor. This prediction system was compared against one that used the static neural gas algorithm. The SOTPAR2 predictor reduced the mean squared error of a 25-step iterative prediction by over 60%. On average it was able to predict 30% farther

PAGE 173

166 before reaching an error greater than 0.01. The temporal information in the SOTPAR2 created a dynamically changing vector quantization algorithm that allowed for better selection of the local linear predictors based upon the past information in the signal. Lastly, the activity diffusion and temporal correlation concepts were applied to the temporal training of recurrent networks. Recurrent networks have been largely ignored in the past because of the difficulty in training them. As the other methodologies are pushed beyond their capabilities, the recurrent networks are starting to be used more and more. Using the temporal self-organization concepts inside the real-time recurrent learning algorithm allows the gradient information to be clustered or subgrouped. This subgrouping of the gradients reduces the number of cross-terms that must be computed in the RTRL algorithm and greatly reduces the number of operations required to train the network. In a fully recurrent network, the number of operations is reduced from 0(N 4 ) to 0(N 2 ). The dynamic subgrouping greatly enhances the performance of this subgrouping technique over the static arrangement proposed by Zipser. Because the use of recurrent neural networks has been dominated by control applications, we applied this architecture to two control applications, the system identification of a nonlinear dynamic system and to the passage dynamics of a noise cancellation system. For both systems, the computational speed of each training epoch was significantly faster than the full RTRL algorithm, and the number of epochs required for training was not signficantly greater. On a small network with only 6 PEs, my dynamic subgrouping method could train the network over twice as fast as the RTRL method. On larger networks, the OfN 4 ) term will make the results even more impressive.

PAGE 174

167 The methodology of using activity diffusion and temporal Hebbian learning to self-organize neural networks in space and time seems to have great potential. Each network 1 applied the network to obtained properties slightly different than the others and unique in the field of neural networks. Future Directions There are many possible directions for future research in this area in fact, it was difficult to stay focused on only these three architectures as more and more possible applications came to mind. The SOTPAR and SOTPAR2 algorithms are a unique methodology for mapping spatio-temporal patterns and better methodologies need to be determined to use these architectures for purposes other than vector quantization. The dynamic subgrouping algorithm uses a first-order approximation to the matrix of gradients to determine which PEs should be temporal neighbors. This method works well, but is suboptimal and could work better. New grouping criteria can be determined that will improve the performance even more. Additionally, this methodology should be studied in regards to the Extended Kalman Filter methodology of using second-order information to train neural networks. Most EKF implementations use similar concepts to Zipser's in that they ignore cross terms for the second-order information. This static disposal of gradient information should easily be replaced by a dynamic subgrouping similar to ours. Additionally, our method can be even more easily applied to this algorithm since the EKF algorithm already requires the first-order gradient information which can be obtained using the dynamics subgrouping technique.

PAGE 175

168 In a broader sense, the temporal self-organization technique could be applied to other architectures. For instance, it may be able to be used in the feedforward portion of a recurrent neural network to prune unneeded connections. It could also be used in MLPs or TDNNs to organize the hidden layer PEs. In general, one of the main difficulties with the MLP or TDNN is that the PEs are all independent. Without any outside organization, each PE attacks the problem as if it were the only PE (e.g. trying to solve the largest source of error). If temporal ordering were imposed on these PEs, the network may be able to achieve a better global learning style, like divide-and-conquer. The concept of self-organization in space and time may provide many important improvements in the ability of neural networks to solve temporal problems.

PAGE 176

REFERENCES [Ans94] B. Ans, Y. Coiton, J-C. Gilhodes, J-L. Velay, A neural Network Model for Temporal Sequence Learning and Motor Programming, Neural Networks, Vol. 7, no 9, pp 1461-1476, 1994. [Bou90] H.A. Boulard, How Connectionist Models Could Improve Markov Models for Speech Recognition, in R. Echmiller, ed. Advanced Neural Computers, pp. 247-254, Amsterdam: North-Holland, 1990. [Bur93] N. Burgess, J. O'Keefe, and M. Recce, Using Hippocampal "Place Cells for Navigation, Exploiting Phase Coding", in Advances in NIPS 5, Morgan Kauffman, San Mateo, 1993. [Bur95] N. Burgess, M. Recce, and J. O'Keefe, Hippocampus Spatial Models, in M.A. Arbib (Ed), The handbook of Brain Theory and Neural Networks, Bradford books/MIT Press, 1995. [Cam95] P. Campolucci, F. Piazza, A. Uncini, Online Algorithms for Neural Networks with IIR Synapses, IEEE Int. Conf. on Neural Networks ICNN95, Perth, Australia, Dec. 1995. [Cha93] G.J. Chappell and J.G. Taylor, The Temporal Kohonen Map, Neural Networks, Vol. 6, pp. 441-445, 1993. [Chr93] C. Christodoulou, G. Bugmann, T.G. Clarkson, J.G. Taylor, The Temporal Noisy-Leaky Integrator Neuron Model, in Recent Advances in Neural Networks, R. Beale (ed), Ellis Norwood Publishing, 1993. [Chr95a] C. Christodoulou and T. Clarkson, A Review on the Stochastic Firing Behaviour of Real Neurons and How It Can Be Modelled, in J. Mira and F. Sandoval (eds), From Natural to Artificial Neural Computation, Lecture Notes In Computer Science, Springer-Verlag, 930, 223-230, 1995 [Chr95b] C. Christodoulou. T. Clarkson, and J.G. Taylor, Temporal Pattern Detection and Recognition Using the Temporal Noisy Leaky Integrator Neuron Model with the Postsynaptic Delays Trained Using Hebbian Learning, in Proceedings of the World Congress on Neural Networks, 1995. 169

PAGE 177

170 [Cri94] D. A. Critchley, Extending the Kohonen Self-Organising Map by Use of Adaptive Parameters and Temporal Neurons, Ph.D. Thesis, University College London, Department of Computer Science, February 1994. [Cun94] R.K. Cunningham and A.M. Waxman, Diffusion-Enhancement Bilayer: Realizing Long-Range Apparent Motion and Spatiotempoarl Grouping in a Neural Architecture, Neural Networks, Vol. 7, Nos. 6/7, pp. 895-924, 1994. [Cyb89] G. Cybenko, Approximation by Superpositions of a Sigmoidal Function, Mathematics of Control, Signals, and Systems, Vol. 2, pp. 337-341, 1989. [DeV91] B. de Vries and J.C. Principe, A Theory for Neural Networks with Time Delays, in R.P. Lippman, J. Moody, & D.S. Touretzky (Eds.), Advances in Neural Information Processing Systems 3 (pp. 162-168), Morgan Kaufman, 1991. [Dur96] S. Durand and F. Alexandre, TOM, A New Temporal Neural Net Architecture for Speech Signal Processing, in Proceedings oflCASSP '96, Vol. 6, pp 3550-3553. [Elm90] J. L. Elman, Finding Structure in Time, Cognitive Science, Vol. 14, pp 179-211,1990. [Eul96a] N.R. Euliano and J.C. Principe, Spatio-Temporal Self-Organizing Feature Maps, in Proceedings of the ICNN '96, Washington DC, June 1996, pp. 1900-1905. [Eul96b] N.R. Euliano, J.C. Principe, P. Kulzer, A Self-Organizing Temporal Pattern Recognizer with Application to Robot Landmark Recognition, Accepted to the Sintra Spatiotemporal Models in Biological and Artificial Systems Workshop, November, 1 996. [Fah91] S.E. Fahlman and C. Lebiere, The Cascade-Correlation Learning Architecture, Carnegie Mellon University, Computer Science Technical Report CMU-CS-90-100, 1991. [Far95] I. Farkas, On Vector-coded Feature Mapping Using Self-Organizing Neural Maps, Ph.D. Thesis, Slovak Technical University, Bratislava, 1995. [Fre91] J.A. Freeman and D.M. Skapura, Neural Networks: Algorithms, Applications, and Programming Techniques, AddisonWesley, Reading, MA, 1991.

PAGE 178

171 [Fre92] W.J. Freeman, Tutorial on Neurobiology: From Single Neurons to Brain Chaos, International Journal of Bifurcation and Chaos, Vol. 2, No.3, pp. 451-482, 1992. [Fuk88] K. Fukushima, Neocognitron: A Hierarchical Neural Network Capable of Visual Pattern Recognition, Neural Networks, Vol. 1, pp. 1 19-130, 1988. [Gil91] C. Giles, D. Chen, C. Miller, H. Chen, G. Sun, Y. Lee, Second Order Recurrent Networks for Grammatical Inference ", Int. Joint Conf. On Neural Nets, Vol. 2, pp. 273-281, Seattle, WA, 1991. [Gop94a] J. Goppert and W. Rosenstiel, Selective Attention and Self-Organizing Maps, in Proceedings of Neural Networks and their Applications ", IUSPIM, Marseille, France, 1994. [Gop94b] J. Goppert and W. Rosenstiel, Dynamic Extensions of Self-Organizing Maps, in Proceedings of the International Conference on Artificial Neural Networks, Sorrento, Springer, London 1 994. [Gop95] J. Goppert and W. Rosenstiel, Neurons with Continuous Varying Activation in Self-Organizing Maps, in From Natural to Artificial Neural Computation, Lecture notes in Computer Science, Vol. 930, pp. 416-426, SpringerVerlag, 1995. [Gro82] S. Grossberg, Learning by Neural Networks. In S. Grossberg, editor, Studies of Mind and Brain, D. Reidel Publishing, Boston MA, pp. 65-156, 1982. [Hay94] S. Haykin, Neural Networks: A Comprehensive Foundation, Macmillan College Publishing Company, 1994. [Hay96] S. Haykin, Adaptive Filter Theory, 3" 1 edition, Prentice-Hall, pp.562-588, 1996. [Haf90] P. Haffner, M. Franzini, A. Waibel, Integrating Time Alignment and Neural Networks for High Performance Continuous Speech Recognition, Proceedings of ICASSP, Vol. 1, pp 425-428, IEEE, 1990. [Hec86] R. Hecht-Neilsen, Nearest matched filter classification of spatiotemporal patterns. Technical report, Hecht-Neilsen Neurocomputer Corporation, San Diego, CA June 1986. [Hec90] R. Hecht-Neilsen, Neurocomputing, AddisonWesley Publishing Company, p. 168, 1990. [Het93] P. A. Hetherington and M. L. Shapiro, A Simple Network Model Simulates Hippocampal Place Fields: II. Computing Goal-Directed

PAGE 179

172 Trajectories and Memory Fields, Behavioral Neuroscience, Vol. 107, No. 3, pp. 434-443, 1993. [Jan97] J.-S.R. Jang, C.-T. Sun, E. Mizutani, Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence, Prentice Hall, pp. 523-533, 1997. [Jor86] M.I. Jordan, Attractor Dynamics and Parallelism in a Connectionist Sequential Machine, in Proceedings of the 9"* Annual Conference of the Cognitive Science Society, pp. 531-546, 1986. [Kan90] J. Kangas, Time-Delayed Self-Organizing Maps, in Proceedings of the International Joint Conference on Neural Networks, pp. 33 1 -336, part 2 of 3, 1990. [Kan91] J.Kangas, Phoneme Recognition Using Time-Dependent Versions of SelfOrganizing Maps, in Proceedings of the International Conference on Acoustic and Speech Signal Processing, Vol. 1, pp. 101-104, 1991. [Kan92] J. Kangas, Temporal Knowledge in Locations of Activations in a SelfOrganizing Map, in Artificial Neural Networks 2, pp. 117-120, 1 992. [Kan94] J. Kangas, On the Analysis of Pattern Sequences by Self-Organizing Maps, Unpublished Ph.D. Dissertation, Helsinki University of Technology, 1994. [Kar94] H. Kargupta and S. R. Ray, Temporal Sequence Processing Based on the Biological Reaction-Diffusion Process, Proceedings of the IEEE 1CNN '94, Vol. 4, pp. 2315-2320, 1994. [Koh82] T. Kohonen, Self-Organized Formation of Topological^ Correct Feature Maps, Biological Cybernetics, Volume 43, pp. 59-69, 1982. [Koh91] T. Kohonen, The Hypermap Architecture, in Proceedings of the International Conference on Artificial Neural Networks, pp. 1357-1360, 1991. [Kre96a] B. Krekelberg and J.G. Taylor, Nitric Oxide and the Development of Long-Range Horizontal Connectivity, Neural Networks World, Vol. 6, No. 2, pp. 185-189,1996. [Kre96b] B. Krekelberg and J.G. Taylor, Nitric Oxide in Cortical Map Formation, International Confernance on Artificial Neural Networks, 1996. [Kul96] P. Kulzer, "NAVBOT Autonomous robotic agent with neural network learning of autonomous mapping and navigation strategies", unpublished Master's Thesis from the University of Aveiro, Portugal, 1996.

PAGE 180

173 [Mac62] MacKay, Self-Organization in the Time Domain, Self-Organizing Systems, 1962. [Mar90] T.M. Martinetz and K.J. Schulten, Hierarchical neural net for learning control of a robot's arm and gripper, in the proceedings of International Joint Conference on Neural Networks 90, pp. 747-752, 1990. [Mar93] T.M. Martinetz, S.G. Berkovich, K.J. Schulten, "Neural-Gas" Network for Vector Quantization and its Application to Time-Series Prediction, IEEE Transactions on Neural Networks, Vol. 4, No. 4, July 1993, pp. 558-569. [McA94] J.D. McAuley, Time as Phase: A Dynamic Model of Time Perception, in Proceedings of the 16 tn Annual Conference of the Cognitive Science Society, Lawrence Erlbaum, pp 607-612, 1994. [Miy93] H. Miyamoto and K. Fukushima, Recognition of Spatio-Temporal Patterns by a Multi-Layered Neural Network Model, Proceedings of the 1993 International Joint Conference on Neural Networks, pp. 2267-2270, 1 993 . [Moz92] M.C. Mozer, Induction of Multiscale Temporal Structure, in Advances in Neural Information Processing Systems IV, pp. 275-282, published by Morgan Kaufmann, 1992. [Moz94] M. Mozer, Neural Net Architectures for Temporal Sequence Processing, in A.S. Weigend, N.A. Gershenfeld (eds) Time Series Prediction: Forecasting the Future and Understanding the Past, pp. 243-264, Addison Wesley Publishing Company, 1994. [Mur89] J.Murray, Mathematical Biology, SpringerVerlag, New York, 1989. [Ore94] R. C. O'Reilly and J. L. McClelland, Hippocampal Conjunctive Encoding, Storage, and Recall: Avoiding a Tradeoff, Parallel Distributed Processing and Cognitive Neuroscience Technical Report PDP.CNS.94.4, June 1994. [Pea95] B. Pearlmutter, Gradient Calculations for Dynamic Recurrent Neural Networks: A Survey. IEEE Transactions on Neural Networks, Vol. 6, No 3, pp. 1212-1228, 1995. [Pol91] J.B. Pollack, The Induction of Dynamical Recognizers, Machine Learning, Vol. 7, pp. 227-252, 1991. [Pri94] C. M. Privitera, P. Morasso, The Analysis of Continuous Temporal Sequences by a Map of Sequential Leaky Integrators, Proceedings of ICNN94, pp. 3127-3130, 1994. [Pri96] CM. Privitera and L. Shastri, A DSOM Hierarchical Model for Reflexive Processing: An Application to Visual Trajectory Classification,

PAGE 181

174 International Computer Science Institute, Berkeley CA, Technical Report TR-96-011,June 1996. [Rei91] M. Reiss and J.G. Taylor, Storing Temporal Sequences, Neural Networks, Vol4,pp 773-787, 1991. [Rol93] E.T. Rolls and A. Treves, Neural Networks in the Brain Involved in Memory and Recall, in Proceedings of the 1993 InternationalJoint Conference on Neural Networks, page 9-14, 1993. [Rum86] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning Internal Representations by Error Propagation, in D.E. Rumelhart and J.L McClelland eds., Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, Chapter 8, Cambridge, MA, MIT Press, 1986. [Ruw93] D. Ruwisch, M. Bode, H.-G. Purwins, Parallel Hardware Implementation of Kohonen's Algorithm with an Active Medium, Neural Networks, bol. 6, pp. 1147-1157, 1993. [Ruw97] D. Ruwisch, B. Dobrzewski, & M. Bode, Wave Propagation as a Neural Coupling Mechanism: Hardware for Self-Organizing Feature Maps and the Representation of Temporal Sequences, in IEEE Workshiop on Neural Networks for Signal Processing Proceedings, pp. 306-315, 1997. [San97] I.W. Sandberg & L. Xu, Uniform Approximation and Gamma Networks, Neural Networks, Vol. 10, pp. 781-784, 1997. [Saw91] H. Sawai, Frequency Shift Invariant Time-Delay Neural Networks for Robuts Continuous Speech Recognition, Proceedings of ICASSP, Vol. 1, IEEE, 1991. [Sch90] N. A. Schmajuk, Role of the Hippocampus in Temporal and Spatial Navigation: an Adaptive Neural Network, Behavioral Brain Research, Vol. 39, pp 205-229, Elsevier, 1990. [Sch92a] J. Schmidhuber, Learning Unambiguous Reduced Sequence Descriptions, in Advances in Neural Information Processing Systems IV, pp. 29 1 -298, published by Morgan Kaufmann, 1992. [Sch92b] J. Schmidhuber, Learning to Control FastWeight Memories: An Alternative to Dynamic Recurrent Networks, Neural Computation, Vol. 4, No. l,pp. 131-139, 1992.

PAGE 182

175 [Sch92c] J. Schmidhuber, A Fixed Size Storage 0(n3) Time Complexity Learning Algorithm for Fully Recurrent Continually Running Networks. Neural Computation 4, pp. 243-248, 1992. [Sha91] P.E. Sharp, Computer Simulation of Hippocampal Place Cells, Psychobiology, Vol. 19, No. 2, pp 103-1 15, 1991. [Sha93] M. L. Shapiro and P. A. Hetherington, A Simple Network Model Simulates Hippocampal Place Fields: Parametric Analysis and Physiological Predictions, Behavioral Neuroscience, Vol. 107, No. 1, pp 34-50, 1993. [She94] A. Shertinsky, R. W. Picard, M-Lattice: A Novel Non-Linear Dynamical System and Its Application to Halftoning, Proceedings of IEEE ICASSP, Vol. II, pp 565-568, 1994. [Sta75] J.C. Stanley and W.L Kilmer, A Wave Model of Temporal Sequence Learning, InternationalJournal of Man-Machine Studies, Vol. 7, pp. 395412, 1975. [Sun92] G.Z. Sun, H.H Chen, Y.C. Lee, Green's Function Method for Fast On-line Learning Algorithm of Recurrent Neural Networks. In NIPS 4, pp. 333340, 1992. [Tak8 1 ] F. Takens, Dynamical Systems and Turbulence, in D.A. Rand and L.S. Rand (eds.), Vol. 898 of Lecture Notes in Mathematics, Springer Verlag, Berlin, 1981. [Tan87] D.W. Tank and J.J. Hopfield, "Neural Computation byConcentrating Information in Time", Proceedings of the National Academy of Sciences, USA, Vol. 84, pp 1896-1900, 1987. [Tsu93] M. Tsukada, A Theoretical Model of the Hippocampal-Cortical Memory System Motivated by Physiological Functions in the Hippocampus, in Proeedings of the 1993 International Joint Conference on Neural Networks, pp 1120-1123. [Tur52] A. Turing, The Chemical Basis of Morphogenesis, Phil. Transactions of the Royal Society of London, Ser. B, Vol. 237, pp. 37-72, 1952. [Tys88] J.J Tyson and J.P Keener, Singular Perturbation Theory of Traveling Waves in Excitable Media (A Review), Physica D, Vol. 32, pp 327-361, 1988. [Wai89a] A. Waibel, Modular Construction of Time-Delay Neural Networks for Speech Recognition, Neural Computation, Vol. 1, pp 39-46, 1989.

PAGE 183

176 [Wai89b] A. Waibel. T. Hanazawa, G.E. Hinton, K. Shikano, K.J. Lang, Phoneme Recognition Using Time-Delay Neural Networks, 1989. [Wan90] D. Wang and MA. Arbib, Complex Temporal Sequence Learning Based on Short-term Memory, in Proceedings of the IEEE, 1 990. [Wan93] D. Wang and M.A. Arbib, Timing and Chunking in Processing Temporal Order, IEEE Transaction on Systems, Man, and Cybernetics, 1993. [Wan95] D. Wang, Temporal Pattern Encoding in Neural Networks, Wang, in Handbook of Brain Theory and Neural Network, MIT Press, 1 995. [Wat90] R. L. Watrous. Phoneme Discrimination Using Connectionist Networks, Journal of the Acoustic Society of America, Vol. 87, N 4. pp. 1 753-1 772, April 1990. [Wat91] R. L. Watrous, Context-Modulated Vowel Discrimination Using Connectionist Networks, Computer Speech and Language, Vol. 5, pp 341 362, 1991. [Wei94] A.S. Weigend, N.A. Gershenfeld (eds) Time Series Prediction: Forecasting the Future and Understanding the Past, Addison Wesley Publishing Company, 1994. [Wil76] D.J. Willshaw and C. von der Malsberg, How patterned Neural Connections can be Setup Up by Self-Organization, Proceedings of the Royal Society of London, Series B, Vol. 194, pp. 431-445, 1976. [Wil89] R.J. Williams and D. Zisper, A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural Computation,^ ol. 1, pp 270-280, 1989. [Wil90] R.J. Williams arjd J. Peng, "An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories", Neural Computation, Vol. 2, pp. 490-501, 1990. [Yam93] K. Yamauchi, M. Fukuda, K. Fukushima, A Speech Recognition System Consisting of Auditory Feature Extracting Cells and Velocity-Controlled Delay-Lines Part II. Recognition Model, Proceedings of the 1993 International Joint Conference on Neural Networks, pp. 259-262, 1993. [Zip89] D. Zipser, A Subgrouping Strategy that Reduces Complexity and Speeds Up Learning in Recurrent Networks. Neural Computation 1, pp. 552-558, 1989. [Zip90] D. Zipser, Subgrouping Reduces Complexity and Speeds up Learning in Recurrent Networks. In NIPS 2. pp. 638-641, 1990.

PAGE 184

BIOGRAPHICAL SKETCH Neil R. Euliano II was born in Erie, Pennsylvania, April 2, 1964. Neil attended Lake Brantley High School in Orlando, and graduated as the academic salutatorian. Neil continued his education at the University of Florida receiving a bachelor's degree in computer engineering. In August 1986 he was recognized as the Four Year Scholar of his graduating class. In 1986 Neil became the first recipient of the Challenger Memorial Scholarship for graduate study in the field of electrical engineering. He graduated with a Masters of Engineering in electrical engineering in August 1988. His master's thesis was titled "The Implementation of General-Purpose Systolic Arrays for Digital Signal Processing and Linear Algebra^" Neil worked for AT&T Bell Laboratories in the Avionics Systems Division as a Systems Engineer from 1988 until 1992, when he returned to the University of Florida for his Ph.D. After graduation Neil will remain in Gainesville working as a consultant for two local startup companies. Neil has a wife, Tammy, and two children, Erin (age 2 S4) and Matthew (age 1 Vi). 177

PAGE 185

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequ ate, in scope and quality, as a dissertation for the degree of Doctor of Philosoji pe/jChairman PrjSfesferbf Electrical and -omputer Engineering I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosop rfedLntayltfr Professor of Electrical and Computer Engineering I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Sw& ~J To LUcki.^ Donald G. Childers Professor of Electrical and Computer Engineering I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. ^L William W. Edmonson Assistant Professor of Electrical and Computer Engineering

PAGE 186

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy . mxX^ CI. (k.^Jon G. Allen Professor of Entomology and hematology This dissertation was submitted to the Graduate Faculty of the College of Engineering and to the Graduate School and was accepted as partial fulfillment of the requirements for the degree of Doctor of Philosophy. August 1998 S*\ Winfred M. Phillips Dean, College of Engineering Karen A. Holbrook Dean, Graduate School

PAGE 187

LD 1780 199£ .its UNIVERSITY OF FLORIDA 3 1262 08554 9300


xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID EVZCP6NR2_GOF43U INGEST_TIME 2015-03-27T18:39:13Z PACKAGE AA00029777_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES