TEMPORAL PROCESSING WITH NEURAL NETWORKS -
THE DEVELOPMENT OF THE GAMMA MODEL
BY
BERT DE VRIES
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
1991
ACKNOWLEDGMENTS
The chances of mental depression during one's Ph.D. studies are not small.
Poverty and personal, social and intellectual isolation are just a few of the risks of the
"profession". Thus the real-world environment as provided by family, friends and
faculty largely determines the sanity of the student and therefore the quality of the
resulting dissertation. I have been very fortunate in this respect, although my friends,
committee members, let alone my family deserve no blame for the quality of this work.
My Ph.D. supervisor, Dr. Jose C. Principe, has been much more than an advisor.
His limitless supply of ideas, his personal commitment and warm character make him
an ideal supervisor, a fact which many students besides me have recognized. At this
place I wish to thank him for his collaboration and his friendship. In the Spring of 1989
I worked in the Hearing Research Laboratory headed by Dr. David Green. To witness
Dr. Green's approach toward conducting science is one of the best lessons a Ph.D.
student can receive. Dr. Jan van der Aa has been both on my master's and Ph.D.
supervising committee. His commitment to offer outstanding help at any time and his
personal friendship are very much appreciated. Dr. Donald Childers has been very
helpful at several times in guiding the next research steps. Dr. Fred Taylor's willingness
to serve on my doctoral committee is very much appreciated. I have had much support
from Dr. James Keesling from the mathematics department and Dr. Antonio Arroyo
from the electrical engineering department. Dr. Pedro Guedes de Oliviera from the
electrical engineering department of the University of Aveiro in Portugal visited our
laboratory during the 1991 spring semester. He has made significant contributions to
the understanding of the gamma model. His help and friendship is also very much
appreciated.
Several graduate students in the Computational Neuro-Engineering Laboratory
(CNEL) have directly contributed to the work that is presented here. James Kuo
performed the experiments on noise reduction which are discussed in chapter 5. The
experiments on prediction of the Mackey-Glass series was carried out by Alok Rathie.
Curt Lefebvre, Samel Selebi, James Tracey and Mark Goldberg have done significant
work on the gamma model as well.
Furthermore I should thank my friends and the students in our laboratory for
their friendship.
Most of all, I am indebted to my dear parents and my sisters Karin and Marleen.
Their support, encouragement and love cannot be compensated by a few simple lines.
I dedicate this work for what it is worth to their health and happiness.
TABLE OF CONTENTS
CHAPTER 1
THE TEMPORAL PROCESSING PROBLEM AND RESEARCH GOALS
1.1 Introduction........................................................................................... 1
1.2 A Statement of the Problem........................................................ 2
1.3 Research Goals ................................................... ........................... 7
1.4 A Summary of the Next Chapters ........................................................ 8
CHAPTER 2
A REVIEW OF NEURAL NETS FOR TEMPORAL PROCESSING
2.1 Introduction ................................................................... ............ 10
2.2 A Recapitulation of Linear Digital Filters ........................................... 10
2.3 Introduction to Neural Networks ......................................................... 12
2.4 The Adaptive Linear Combiner ...................................................... 16
2.5 Neural Network Paradigms Static Models ..................................... 20
2.5.1 The Continuous M apper ......................................................... 20
2.5.2 The Associative Memory ...................................................... 22
2.6 Neural Network Paradigms Dynamic Nets ..................................... 23
2.6.1 Short Term Memory by Local Positive Feedback ................... 24
2.6.2 Short Term Memory by Delays ............................................ 28
2.6.3 The Sequential Associative Memory ................................... 32
2.7 Other Dynamic Neural Nets ........................................... ........ ... 33
2.8 D discussion ............................................................. ....................... 34
CHAPTER 3
THE GAMMA NEURAL MODEL
3.1 Introduction Convolution Memory versus ARMA Model .................. 35
3.2 The Gamma Memory Model ............................................................ 39
3.3 Characteristics of Gamma Memory .................................... ........... 44
3.3.1 Transformation to s- and z-Domain ........................................ 44
3.3.2 Frequency Domain Analysis................................ ......... 46
3.3.3 Time Domain Analysis ........................................ .......... 48
3.3.4 Discussion ................................................ ....................... 50
3.4 The Gamma Neural Net .............................................................. 51
3.4.1 The M odel ................................................ ........................ 51
3.4.2 The Gamma Model versus the Additive Neural Net .............. 52
3.4.3 The Gamma Model versus the Convolution Model ................ 55
3.4.4 The Gamma Model versus the Concentration-in-Time net ..... 56
3.4.5 The Gamma Model versus the Time Delay Neural Net ......... 58
3.4.6 The Gamma Model versus Adaline ...................................... 58
3.5 D discussion ....................................................... ............................ 59
CHAPTER 4
GRADIENT DESCENT LEARNING IN THE GAMMA NET
4.1 Introduction Learning as an Optimization Problem ...................... 60
4.2 Gradient Computation in Simple Static Networks .......................... 63
4.2.1 Gradient Computation by Direct Numerical Differentiation ... 64
4.2.2 The Backpropagation Procedure ........................................ 65
4.2.3 An Evaluation of the Direct Method versus Backpropagation 70
4.3 Error Gradient Computation in the Gamma Model ......................... 72
4.3.1 The Direct M ethod ............................................ ............ 73
4.3.2 Backpropagation in the Gamma Net ..................................... 76
4.4 The Focused Gamma Net Architecture ........................................... 80
4.4.1 Architecture ............................................... ...................... 82
CHAPTER 5
EXPERIMENTAL RESULTS
5.1 Introduction ...................................................... ........................... 87
5.2 Gamma Net Simulation and Training Issues .................................... 88
5.2.1 Gamma Net Adaptation ...................................... ......... 89
5.3 (Non-)linear Prediction of a Complex Time Series .......................... 92
5.3.1 Prediction/Noise Removal of Sinusoidals contaminated by Gaussian
N oise ................................................... .......................... 92
5.3.2 Prediction of an EEG Sleep Stage Two Segment ................... 95
5.3.3 Prediction of Mackey-Glass chaotic Time series ........ .......... 95
5.4 System Identification .................................................................. 98
5.5 Temporal Pattern Classification Training a Concentration-in-Time Net 99
5.6 Noise Reduction in State Space .......................................................... 102
5.7 D discussion ........................................................................................... 109
CHAPTER 6
THE LINEAR FILTERING PERSPECTIVE
6.1 Introduction .................................................................................
6.2 A Recapitulation of Linear Digital Filter Architectures ..................... 112
6.3 Generalized Feedforward Filters Definitions .................................... 113
6.4 The Adaptive Gamma Filter ............................................... 116
6.4.1 D efinitions ................................................................................ 116
6.4.2 Stability .................................................................................... 117
6.4.3 Memory Depth versus Filter Order ....................................... 118
6.4.4 LMS Adaptation ....................................................................... 118
6.4.5 Wiener-Hopf Equations for the Adaptive Gamma Filter ......... 120
6.5 Experimental Results ....................................................................... 122
6.6 The Gamma Transform A Design and Analysis Tool For Gamma Filters 125
6.7 A Second-order Memory Delay Element ............................................ 129
6.8 D discussion ............................................................................................ 130
CHAPTER 7
CONCLUSIONS AND FUTURE RESEARCH RECOMMENDATIONS
7.1 A Recapitulation of the Research ......................................................... 134
7.2 Ongoing Research Projects .................................................................. 136
7.3 Future Research Directions .................................................................. 138
R EFER EN CES ............................................................................................. 141
BIOGRAPHICAL SKETCH .......................................................................... 145
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the Requirements
for the Degree of Doctor of Philosophy
TEMPORAL PROCESSING WITH NEURAL NETWORKS -
THE DEVELOPMENT OF THE GAMMA MODEL
By
Bert de Vries
October 1991
Chairman: Dr. Jose C. Principe
Major Department: Electrical Engineering
This dissertation discusses the problem of processing complex temporal
patterns by artificial neural networks. The relatively broad topic of this work is
intentional processing here includes such specialities as system identification, time
series prediction, interference canceling and sequence classification. Rather than
focusing on a particular application, this research concentrates on the paradigm of time
representation in neural network structures.
In all temporal processing applications, an essential capacity for a neural net is
to store information from the recent past (we refer to this capacity as short term
memory). The main contribution of this work is the introduction of a new (neural net)
mechanism to store temporal information. This model, the gamma neural model,
compares very favorably to competing memory structures, such as the tapped delay line
and first-order self-recurrent memory units. The gamma memory mechanism is
characterized by a cascade of uniform locally self-recurrent delay units. An interesting
feature of the gamma memory mechanism is the adaptability of the memory depth and
resolution.
The gamma model is analyzed and compared with competing neural models. A
temporal backpropagation training procedure for gamma neural nets is derived.
Experiments in time series prediction (electro-encephalogram (EEG) and
synthetic chaotic signals), noise removal from a chaotic signal and system
identification are discussed. In all experiments, the gamma model outperforms
competing network architectures.
Interestingly, the application of the gamma memory structure is not limited to
neural nets. A chapter is devoted to introduce adaline(gl), an adaptive linear filter with
gamma memory. Adaline(tg) generalizes Widrow's adaptive linear combiner (adaline),
the most widely used structure in adaptive signal processing. The signal characteristics
and processing applications where adaline(gi) improves on the performance of adaline
are identified.
CHAPTER 1
THE TEMPORAL PROCESSING PROBLEM AND RESEARCH GOALS
1.1 Introduction
When I started this research somewhere around January 1988, the goal was to
develop a speech recognition system that was characterized by a relatively high degree
of biological plausibility. Thus I spent the first semester of 1988 in the Hearing
Research Center at the University of Florida. This speech system-in-waiting was going
to be developed from the bottom up! First I would develop a neural auditory model that
captures the essential features for speech analysis, followed by new techniques for
neural temporal pattern classification and so on. As time passed by, the topic of this
dissertation has shifted more toward the fundamental problem of time representation in
network structures per se. As a result, a neural network model for processing of
temporal patterns has been developed, and at the time of the writing of this thesis we
are conducting experiments in isolated word recognition and phoneme classification
making use of this new model. As it turns out, the application of the gamma model (this
is how the model is called) is not limited to the temporal classification problem.
Successful experiments were performed as well for problems in noise reduction,
system identification and time series prediction. For these tasks the gamma model
performance is very promising, in fact better than many competing techniques. In this
thesis the development of the gamma model is presented. I try to explain why and how
it works and report on a few experimental results.
1.2 A Statement of the Problem
This dissertation deals with processing of time-varying signals by a neural
network. By a signal we mean a time sequence of patterns and consequently the
adjectives "time-varying" or "temporal" will often be deleted. The temporal context is
always assumed. The 6-dimensional (input) signal to be processed is denoted by s(t)
and for the M-dimensional processed (output) signal we will write x(t). Processing of
s(t) implies the application of a map 0 which transforms s(t) into the signal x(t). In this
work we are interested in processing applications where the past of s(t) affects the
computation of x(t). Thus, 0 is actually a map from s(x) for x < t to x(t). Typical
applications include speech recognition, dynamic system identification and prediction
of a time series. The main problem of this thesis is how to effectively represent the past
of s(t). The following example shows why this is a difficult task.
Consider an isolated-word recognition system 0 which processes the incoming
speech signal in real-time. Assume a 16 word vocabulary and hence the output space
can be described in 4 bits. The speech signal is sampled at 8000 [Hz] and quantized by
8 bits. Then, for a typical word, which last 0.5 seconds, the input signal is represented
by 8000 x 8 x 0.5 = 32, 000 bits. Thus, 4 is a map from 232000 input pattern
combinations to 4 bits. This sequence of 32000 bits obviously contains many redundant
copies of information and is contaminated by irrelevant noise. Since 0 is a real-time
system, it must throw away data as quickly as possible, while preserving enough
information to classify the words. Note the inherent optimization task associated with
this problem: if we throw away too much data there may not be enough information left
for word classification, but yet we do not want to store redundant data and noise for it
will significantly complicate the processing task.
As another example, consider the problem of predicting a discrete time series
s(t) by a model 4. How many samples from the past of s(t) does the model 0 need in
order to reliably predict the next sample s(t+l)? Do we need to store the entire history
of s(t)? Very likely not, and in fact the additional noise associated with a deep memory
will work detrimental on the performance of (.
In general we conclude that before the data stream is submitted to the actual
processing task, first we need to rid the data stream of irrelevant information as much
as possible. Processing of a data stream so as to reduce the amount of data for further
processing is captured by the name pre-processing.
I pre-processing
to L (data reduction)
Figure 1.1 Modular feedforward information processing architecture.
In Figure 1.1 the general architecture for a feedforward information processing
scheme is shown. In this simple scheme two stages are distinguished. The first stage,
the pre-processing stage, serves to reduce the data flow to the second stage. In practice,
pre-processing of a data stream commonly consists of segmentation and feature
extraction. The second stage performs the actual processing goal. For instance in a
pattern recognition environment, the second stage is a classifier. For the purpose of this
discussion it is assumed that the second processing stage is implemented by an adaptive
neural network.
Let the signal s(t) in Figure 1.1 be defined for to 5 t 5 t in the temporal
dimension. At any time, only part of the temporal domain of s(t) is available for further
processing. The selection mask is called a window. The width of the window is denoted
by 8. In some cases before the data sequence is submitted to the actual processing task
), signal features are computed from the windowed data segment. As an example in
word classification, the pitch period provides important information concerning the
excitation source of the speech signal. Note that both (windowed) segmentation and
feature extraction contribute to the data reduction process.
The feature extraction process is usually very problem dependent. For now, let
us concentrate on the implications of the choice of the window width 8. Three
possibilities arise. First, the width 8 of the window equals the duration tf to of the
signal. This case effectively transforms the temporal dimension to an extra spatial
dimension. Since the entire past of the signal is always available, no temporal
processing capabilities are required and the problem is transferred to a static processing
problem. The second possibility concerns 0 < 8 < tf- t, which is called the sliding
window technique. In this case, the selection mask moves in some fashion over the
signal domain so as to cover the input signal space as time evolves. The extreme case
of the sliding window technique occurs when 8 -- 0, that is, only the current signal
values are available for further processing. We will refer to this choice as current-time
processing, and effectively when 8 -4 0 there is no segmentation. Obviously, in this
case the memory has to be moved to the second processing stage.
The choice of the sliding window width influences the system performance. We
identify two problems associated with the selection of 8. First, for large 5. the
dimensionality of the processing system increases. This fact minimally complicates
the neural net training. It has been shown that neural net adaptation time unfortunately
scales worse than proportional with the dimension of the weight vectors (Perugini and
Engeler, 1989). There are more problems associated with large networks, such as the
required increased dimension of the training set. For smaller windows the neural
network may not have enough information to appropriately learn the signal dynamics
because only a fraction of the decision space is available. However, the network
dimension gets progressively smaller which eases the learning requirements in terms
of training set size and number of learning steps.
A second issue that complicates the choice of 8 involves segmentation of non-
stationary signals. Normally the length of the stationary interval is not known a-priori,
and can very well change with time. A large 6 tends to average the time-varying
statistics of non-stationary signals before the signals enter the neural net. A smaller 5
makes the classification very sensitive to the actual signal segment utilized, and tends
to make the classification less robust. The balance is very difficult to achieve and in
general varies with time. The common practice in speech is to bias the selection to
fixed, relatively small segments of approximately 10 milliseconds (Rabiner and
Schafer, 1978).
Apart from the difficulty of choosing the window width, the temporal
resolution of the window is another important pre-processing parameter. We define the
(temporal) resolution R of the window as the number of outputs of the window divided
by the window width (in seconds). As is the case for the width 6, the optimal resolution
is dependent on the processing goal. For instance if the vocabulary size were 1000
instead of 16 in the isolated-word recognition example, the demands on the resolution
of the window would obviously increase.
In speech processing, it is common to determine the window width based on
statistical measures of the input signal. For instance, zero-crossing rate and energy
measures have been used to estimate pseudo-stationary signal segments (Rabiner and
Schafer, 1978). Note that this approach does not use any system performance feedback
to determine the pre-processing parameters. However, as is clear from the foregoing
discussion, optimal values for the pre-processing parameters such as window size and
resolution are a function of the processing goal as expressed by a system performance
criterion. Ideally the representation of the input signal would be adapted by
performance feedback of the total processing system.
This observation forms the basis for the neural net system that is proposed in
this dissertation. The system that I propose stores the signal history in an adaptive
short-term memory structure of a neural net. The capacity of a neural net to store and
compute with information from the recent past is referred to as short term memory. The
architecture of this system is shown in Figure 1.2. The neural short-term memory
mechanism substitutes and obviates a priori signal segmentation. An important
advantage of this approach is that neural network structures can be adapted so as to
optimize a system performance criterion. In the figure, the performance of the system
is measured by the error signal e(t), the difference between a desired output signal d(t)
and the system output x(t). Other measures of system performance are also possible. Of
central importance however is that in this framework the signal representation is
optimized by performance feedback instead of input signal statistics.
It will be shown in this thesis that adaptive pre-processing can be integrated
with information processing in the same neural network framework and
implementation.
Traditionally, linear signal processing is implemented by linear filter structures.
Digital filters can be categorized into two main architectural groups: the finite impulse
response (FIR) filters and infinite impulse response (IIR) filters. FIR filters are
feedforward and the past of the input signal is stored in a tapped delay line. IIR filters
are of recurrent (feedback) nature. As a result, more complicated memory structures
e(t)
based on feedback are possible. As will be discussed in chapter 2, the same principles
for short term memory hold in neural networks. In fact, from an engineering viewpoint,
neural networks can be considered as a generalized class of non-linear adaptive filters.
The combination of adaptation and non-linearity makes neural nets very versatile
processing architectures that can be applied to a wide range of complex problems.
Indeed the possibility to incorporate the input signal representation in a unified
adaptive framework with the system overall performance is a premier stimulus for the
research reported herein.
1.3 Research Goals
The central theme of this dissertation is the representation of temporal
information in a neural net. The main goal of this research is the development of a
neural network where the input signal representation is optimized adaptively by the
neural net itself. Thus I try to achieve that preprocessing parameters such as window
resolution or depth are adaptively optimized with respect to a performance measure of
adaptive
algorithm
Figure 1.2 Adaptive signal representation using a neural network
processing framework.
the total system.
A literature review, scheduled for chapter 2, will reveal that the main techniques
for temporal processing with neural networks are quite mediocre with respect to the
capacity to adapt to a varying signal environment. Since experimentation with a wide
range of network architecture is still going on, it seems that a consistent framework for
dynamic neural networks is still missing. Based on these premises the research plan is
scheduled as follows:
Design a neural network for temporal processing with adaptive short term
memory.
Develop training algorithms for this neural model.
Evaluate both in theory and by experimentation the applicability of the new
model. In particular we are interested in a comparison with alternative widely used
neural processing architectures.
Determine appropriate application areas for the new model.
1.4 A Summary of the Next Chapters
Chapter 2 starts with a concise review of neural networks. Neural nets are being
studied from a variety of viewpoints. I will take the "electrical engineering view" and
emphasize the relation to linear digital filters. Mechanisms for short term memory in
neural nets are reviewed. The analysis focuses in particular on two widely used
structures, the tapped delay line and the first-order self-recurrent units (context units).
Both mechanisms are shown to have limited applicability. For example, the tapped
delay line has limited fixed memory depth whereas the context units always overwrite
information from the past with more recent information.
In chapter 3 a new framework for storage of past information in neural nets is
introduced. The new memory model, gamma memory, is supported by a mathematical
CHAPTER 2
A REVIEW OF NEURAL NETS FOR TEMPORAL PROCESSING
2.1 Introduction
In this chapter various neural network architectures for processing of time-
varying patterns are reviewed.
In order to set the framework, we are concerned with extracting information
from a temporal sequence, but let it unspecified whether the processing goal is to
predict a future trend of the time series or to classify the sequence.
First, in the next section linear digital filters are recapitulated. Digital filters are
the basic computational tool for temporal processing. Moreover, the architectural
principles of digital filters underlie most neural network models.
2.2 A Recapitulation of Linear Digital Filters
Linear signal processing is traditionally implemented by linear filters. In a
discrete time environment, linear digital filters are networks of delay elements,
summation element and constant-factor multipliers. Digital filters are distinguished
into two main categories: Feedforward or finite impulse response (FIR) filters and
recurrent or infinite impulse response (IIR) filters.
In a FIR filter, the input signal history is stored in a tapped delay line. The
signals at the taps are referred to as state variables. The output of the FIR filter is a
linear weighted combination of the tap variables (see Figure 2.1). FIR filters are always
stable but note that the depth of the memory is fixed and equals the number of taps of
the delay line.
X,(l) X:(l) X- ^(t)
'O z-1 -- z-1 -- M z-1 -0
W Or W1 V '2 \VK
+ +w
Figure 2.1 The feedforward (FIR) filter.
IIR filters are more complex structures since recurrent connections are also
allowed. In control theory similar linear structures are known as auto-regressive
moving-average (ARMA) systems. A so-called observer canonical form
implementation of the IIR filter is shown in Figure 2.2. The existence of recurrent
connections implies the risk of instability of the system, but increases the
computational power of the system. The state variables xi(t) are a function of both the
lower index state variables (memory by delay as in the FIR filter) as well as higher
indexed variables (memory by feedback). The feedback connections in the IIR model
implies that the depth of the memory is no longer coupled to the number of delay
elements.
Linear filters are widely applied but the processing power is limited. Linear
processing is appropriate for tasks such as removal of signal-independent noise and
rearranging the temporal structure of a signal. For many important tasks linear
processing does not suffice though. Examples are removal of signal-dependent noise,
classification (decision making!) and modeling of a chaotic time series.
Neural networks as an engineering tool are probably best interpreted as a
generalized class of nonlinear adaptive filters. As such, they provide the computational
features that potentially better cope with solving complex non-linear problems. Next,
Figure 2.2 The recurrent (IIR) filter.
an introduction to neural nets is presented.
2.3 Introduction to Neural Networks
This section contains a brief introduction to neural networks. In the literature
we also find names as connectionist models, parallel distributed processing devices or
artificial neural nets, all denoting the same kind of processing architecture. The
discussion will be of general nature. For a deeper look into some equations and
implementations of neural networks, I like to refer to a paper by Lippmann (1987) and
a book by Simpson (1990). A more thorough look at neural networks is offered in
books by Hertz et al. (1991) and Hecht-Nielsen (1990).
There is not a single best definition for a neural network. Neural net research is
being approached from various viewpoints. Different models range widely in
biological plausibility. In the context of this thesis, an electrical engineering
dissertation, the biological plausibility is not considered a high priority. We are more
interested in the computational properties of a model. In a computer science book I
have seen a definition as short as the following:
- A neural net is a weighted directed graph of simple processors.
As may be clear from a previous discussion, I like to interpret neural nets as a
generalized class of non-linear adaptive filters. The following features of a neural net
processor are typical:
parallel architecture: a weighted network of simple processors.
adaptation: the connection weights are adaptive.
non-linearity: the processor transfer function is in general non-linear.
The mathematical framework for neural nets is non-linear dynamics. In a
continuous-time setting, neural nets are described by a set of differential equations. In
discrete time, the dynamics are described by difference equations. Characteristically,
the constant coefficients of the equations, called weights, adapt when examples of the
problem at hand are presented to the net. Ideally, the adaptation or learning of the
weights is also determined by a differential equation. As mentioned before, neural
networks are non-linear. For one, the computational power of non-linear dynamical
systems far exceeds that of linear systems. Secondly, the non-linearity of neural nets
originates from the fact that it is believed that most interesting primitive cognitive
functions such as associative memory are non-linear. Generally, let
x(t) = [l (t) ... xN(t)] hold the N-dimensional state of a neural net,
w11 ... wIN
w = i an N2-dimensional vector (for a fully connected net) of adaptive
WNI ... WN
weights and I(t) = I' (t) ... IN(f) the external input to the net. Then the system
is completely described by the following set of equations:
dx,
--(t) = fi (x, I, w) El
dwt
dt (t) = gi ,x)(WX
The dynamics for the state x(t) are described by Eq. and Eq.2 describes the adaptation
dynamics. The equilibria of system Eq.1 are computed by
0 = fi (x I, w), WEa2
where x* holds the steady state.
The most widely used neural network model is the so-called additive model described
by
t (t) = -ax,(t) + Y wjxj (t) + l (t) E_ A
The additive model is used in the great majority of practical applications of
neural networks today. Sejnowski provides a biological motivation for the additive
model (Sejnowski, 1981). A flow diagram of the additive model is shown inFigure 2.3.
The state vector xi(t) is affected by a passive decay -axi(t), yielding short term
memory, non-linear neural feedback signals a(wijxj(t)), and an external input Ii(t). The
neuron signal function a() normally is a non-linear function. A typical choice is the
logistic function a (x) = tanh (x). The feedback signals from the net itself are
sometimes shortly denoted by the variable net, that is,
N
neti(t) = wix (t) .
j=1
The system described by Eq.4 is called additive since the weights w are not a
function of the states x. In case w = w (x), the model exhibits mass-action behavior;
such systems are called mass-action, shunting or multiplicative models. In order for
Eq.4 to be computationally interesting, the three dynamic variables I, x and w must
Figure 2.3 The additive neural network model.
perform over three different time scales. From a neurodynamic viewpoint, we can
interpret x(t) to hold a short term memory (stm) trace and w(t) to process long term
memory traces (Itm). The philosophy behind system Eq.4 as a pattern recognition
device for temporal patterns then basically runs as follows. As time passes by, the Itm
traces w(t) sample and average over time the neuronal activity x(t), thus forming some
kind of template or reference pattern of neural activity. At any time a short term average
of the current external environment I(t) is reflected in the stm traces x(t). The degree of
matching between the stm traces and the Itm traces determines how well the current
environmental input is recognized.
The basic architectural component of neural networks and adaptive signal
processing structures is the adaptive linear combiner. An understanding of the working
of the adaptive linear combiner and the least mean square (LMS) algorithm is essential
for the neural network structures that are surveyed in this thesis. The next section
introduces the adaptive linear combiner.
2.4 The Adaptive Linear Combiner
The adaptive linear combiner or non-recursive adaptive filter is fundamental to
adaptive signal processing and neural network theory and applications. This structure,
normally shortly referred to as adaline (from adaptive linear neuron), was introduced
by Widrow and Hoff in 1960 (Widrow and Hoff, 1960). The adaline structure appears
in some form in nearly all feedforward neural network structures. The processing and
adaptation properties of adaline are well understood and documented in Widrow and
Stearns (1975) and Haykin (1990). In this thesis we will only introduce the properties
that are essential in the context of this work.
The describing equations for adaline are given by
K
y(t) = W wkxk(t),
k=
where xk(t) are the input signals, y(t) the output signal and wk the adaptive parameters
or weights. Adaline is a discrete-time structure, that is, the independent time variable t
runs through the natural numbers to,to+1,... and so on. Adaline is shown in Figure 2.4.
tapped delay
line
Figure 2.4 The adaptive linear combiner structure.
Although the input signals xk(t) may originate from any source, very often the
input signals are generated from a tapped delay line as shown in the figure. For this
case, the adaline structure is similar to a regular transversal FIR filter. In the adaptive
signal processing literature, it is common to define the following vectors
T T
w- [w ...] xi((t)(=[i() ... x (t) E.7
Thus we can write the describing equation for adaline as
y(t) = wTx(t). Eq
Let a desired output signal be given by d(t). d(t) is also referred to as target
signal or teacher signal. The difference between the desired output and actual output is
defined as the (instantaneous) error signal e (t) = d(t) -y (t). Substitution of Eq.8
and squaring leads to the following expression for the instantaneous squared error
signal:
e2(t) = d2 (t) +wT(t)xT(t)w-2d(t)xT (t)w. E.9
An important assumption in the theory of adaptive signal processing is that the signals
e(t), d(t) and x(t) are statistically stationary, that is, their statistical moments are
constant over time. In that case, taking the expected value of Eq.9 yields
E[e2 (t) ] = E[d2 (t)] + wTRw- 2PTw Eq.
where we defined the input correlation matrix R E [x(t)x (t)] and the cross-
correlation vector P E [d (t) x (t) ]. Note that the expression for the mean squared
error 4 is quadratic in the parameters w. The minimal mean squared error is obtained
by setting the error gradient to zero. Differentiating Eq.10 yields for the error
dw
gradient:
= 2Rw- 2P. Eqll
aw
Thus, the optimal weight vector woptis given by
wopt = R-1P. E 2
The expression Eq.12 is known by the name Wiener-Hopf equation or normal equation.
The Wiener-Hopf equation provides an expression for the minimal mean-square-error
weight vector, assuming the stationarity conditions hold. This expression is
fundamental in adaptive signal processing and linear neural network theory. Note that
if the stationarity conditions do not hold, the correlation matrices R and P are time-
varying, and consequently the optimal weight vector is time-varying.
The computation of the correlation matrices R-' and P is usually very expensive,
in particular when the network dimension K is large. Instead it is common to adapt the
weights on a sample-by-sample basis so as to search for the optimal values. As is
apparent from Eq.9, the mean-square-error is quadratic in the weights. Thus, the
performance surface 4 is a (hyper-)paraboloid with a minimum at wopt. A gradient
descent procedure should therefore in theory lead to the optimal weights. The steepest
descent update algorithm adapts the weights as follows:
w(t+l) = w(t)- 1 E13
The step size parameter il controls the rate of adaptation. In the neural net literature, iT
is referred to as the learning rate. Note that adaptation comes naturally to a halt when
the weights are optimal, since at the minimum of the performance surface we have
w = 0. The computation of the error gradients determines the complexity of the
learning algorithm. Widely used and very efficient is the Least Mean Square (LMS)
algorithm. We will now proceed to derive the LMS algorithm for the adaline structure,
as it is the precursor for the widely used backpropagation procedure in neural net
adaptation. At a later stage in this thesis, the backpropagation procedure is derived and
applied to several signal processing problems.
The central idea of the LMS algorithm is to approximate the stochastic gradient
aE [e2 (t) ] e2 (t)
w[e2 W by the instantaneous (time-varying) gradient Note that the
aw aw
instantaneous error gradient is an unbiased estimator of the stochastic gradient, that
is,E = e(t) DE [e.2t) I Substituting e (t) = d(t) x (t) w leads to
ae2 (t) De (t)
S= 2e (t) = -2e (t) x(t). .14
aw jw
Thus, the LMS update equation evaluates to
w(t+ 1) = w(t) +2rle(t)x(t). E
Note how simple the final equation for the LMS algorithm is. The signals e(t) and x(t)
are readily available. The combination of simplicity and accuracy have made the LMS
algorithm the most popular algorithm in adaptive signal processing. Widrow discusses
in his book a number of successful practical applications, such as adaptive
equalization, system identification, adaptive control, inference canceling and adaptive
beamforming (Widrow and Stearns, 1985).
In the next section, the core architectural paradigms for neural nets are
introduced. While adaline enjoys a wide application in neural network architectures,
the inherent linearity limits its computational power substantially. Neural nets in
general are more powerful, since they can be non-linear, recurrent and multi-input-
multi-output systems.
2.5 Neural Network Paradigms Static Models
Over the years two different paradigms have emerged that exploit the dynamics
of system Eq.4 to serve as a non-linear information processor. Nearly all theory and
practice deals with processing of static patterns. I will shortly introduce the two
concepts for the static case, since an understanding of this is essential in order to
comprehend efforts of extension to processing of space-time patterns. The section on
static nets is followed by dynamic nets.
2.5.1 The Continuous Mapper
The first paradigm offers the continuous mapper or many-to-many map. The
computational result of the continuous mapper is a continuous function from an 6-
dimensional input space to an M-dimensional output space. The standard method to
implement such a map is by way of a multi-layer feedforward network. An important
historical (and current) example of the continuous mapper is the perception
architecture (Rosenblatt, 1962). In Figure 2.5, a neural net implementation of the
continuous map is displayed.
The network in Figure 2.5 is feedforward, that is, there are no closed loops in
this structure. If we assume that the neurons are labeled sequentially starting at the
input layer, then the weight matrix w is lower triangular for feedforward nets, since
each neuron receives inputs from nodes with lower index. The states of the neurons are
independent of time. In particular, the additive static neuronal states are described by
the following algebraic relation:
Xi = o(wYjXj) +. Eq16
j*
*
It has been proven that a three-layer network (two hidden layers) in principle is
capable to compute an arbitrary continuous map from the O-dimensional input space to
the real numbers (Hecht-Nielsen, 1987). Although this may be impressive, the problem
output layer
2nd hidden layer
1st hidden layer
Input layer x,
I 17, 1
Figure 2.5 Structure of three-layer feedforward mapper
of finding the correct set of weights may be very hard. The problem of finding good
weights is called the loading problem. Theoretically, a learning mechanism such as
simulated annealing can be used to obtain the map that minimizes the error between the
desired map and the actual (network) implementation. However, simulated annealing
(stochastic optimization) is very slow and in practice, although not perfect, the back-
propagation training procedure has been quite effective for many applications. Back-
propagation involves adaptation of the weights by gradient descent so as to minimize a
performance criterion.
2.5.2 The Associative Memory
The second prototype entails the associative memory or many-to-one map, for
which the Hopfield net is the prime illustration (Hopfield, 1982 and 1984, see Figure
2.6). In terms of topology, the Hopfield net is a recurrent net with symmetric weights
(wj=wji, wi,=O), which enables the association of a Lyapunov (energy) function with
the system dynamics. Using Lyapunov's stability theory, it can be shown that this
system always converges to a point attractor. There is no external input to the
associative memory. The input is the initial state of the neurons x(to). Used as a
processing device, information is stored by locating point attractors at positions in the
state space that correspond to memories. Recognition then consists of settling into the
minimum closest to the initial state vector x(to).
W2N
Figure 2.6 An associative memory neural net the
Hopfield net structure.
Both the continuous mapper and the associative memory work only in a static
pattern environment. Although the Hopfield net processes information by a dynamic
relaxation process, the input pattern (initial state vector) is assumed to be static. Next,
temporal extensions of both the continuous mapper and the associative memory are
discussed. It will become clear that the ideas for computing with time in dynamic
neural nets correspond strongly to linear signal processing theory.
2.6 Neural Network Paradigms Dynamic Nets
The basic neural network model for processing of static patterns is the static
additive model. The activation of the units are computed by
i = o( iwjX) +l, E-.I
j*
*
where xi is the activation of neuron (unit, node) i. The weight factor wi connects node
j to node i. o() is a (non)-linear squashing function and Ii represents the external input.
We assume a system dimensionality of N. Sometimes the shorthand notation
net, = wixi will be used.
J*
*
Static neural nets have no memory. As a result, temporal relations can not be
stored or computed on by static neural nets. In order to process a temporal flow of
information, a neural net needs a short term memory mechanism. Neural network
models with short term memory are called dynamic neural nets. The simplest way to
dx,
add dynamics (memory) to the static model is to add a capacitive term C- to the left-
hand side of Eq.17. After rearrangement of terms, the so-called dynamic additive
model is obtained:
dx.
it = -xi+a( wijxj) +i. E4
This model is mathematically equivalent to the system described by Eq.4, where the
1
time constant is expressed by the decay parameter a = -. Let us look at the biological
picture of neural nets. In nature, the neural time constants are fixed and equal
approximately 5 msecs (Shen, 1989). This number is estimated by assuming an average
action potential rate of 200 per second. Higher rates are quite rare due to the refractory
period of the neurons. However, recognition of a spoken word requires the ability to
remember the contents of a passage for approximately 1 second. To accomplish this,
neural temporal resolution decreases while the "temporal window of susceptibility"
increases toward the cortex. Apparently, the brain is able to modulate temporal
resolution and depth of short-term memory making use of processing units with fixed
small time constants. The dominant biological principles for increasing the time
constants are feedback and delays. These are exactly the same strategies that are used
in digital filters to implement a temporal data buffer. Naturally, neural net researchers
have concentrated on the same concepts of feedback and delays when designing neural
nets for temporal processing applications. Next, the characteristics of both approaches
are analyzed.
2.6.1 Short Term Memory by Local Positive Feedback
The additive model can be extended with a positive state feedback term,
yielding
dx,
C- = X + o (net) + Ii + kx, E19
where k is a positive constant. In the biological literature, such local positive feedback
is often named reverberation, while neural net researchers speak of self-excitation.
Eq.19 can be rewritten as
T dxi
(-k)dt = -xl+a (neti) + I, B2
Y (neti) Ii
where b (neti) k and i -. For = 5 msec and k = 0.995, we get the
new time constant T = 1 = 1 sec. Units that self-excite over a time span that is
1-k
relevant with respect to the processing problem are referred to in the neural net
literature as context units. Several investigators have explored the temporal
computational properties of additive feedforward nets, extended by context units
(Jordan, 1986; Elman, 1990; Mozer, 1989; Stornetta et al., 1988). In Hertz et al. (1991),
neural models of this kind are collectively referred to as sequential nets. In sequential
neural nets, all units are additive and static, apart from the context units. The context
units are of type Eq.20 or a similar model. Sequential neural nets are a kind of
extension of the continuous mapper to the spatiotemporal domain. In Figure 2.7,
various architectural examples are displayed.In 1986, Jordan developed the
architecture as displayed in Figure 2.7a for learning of spatiotemporal maps. The state
of the context units evolve according to
x(t+ 1) = lx (t) +xut(t), E.21
where xout(t) is the state of an output unit. Note that Jordan's architecture makes use of
global recurrent loops (context to hidden to output to context units). As a result, care
must be taken to keep the total system stable. In Jordan (1986), he shows that this
network can successfully mimic co-articulation data. Anderson et al. (1989) have used
this architecture to categorizing a class of English syllables.
Elman (1990) utilizes non-linear self-recurrent hidden units of the type
x (t + 1) = g(o (x (t)) to store the past (Figure 2.7b). This network was able recognize
sequences, and even to produce continuations of sequences. Cleeremans et al. (1989)
showed that this architecture is able to learn and mimic a finite state machine, where
the hidden units represent the internal states of the automaton.
Stornetta et al. (1988) have used recurrent units at the input layer only to
represent a temporally weighted trace of the input signal (Figure 2.7c). There are no
weighted connections from the hidden or output units toward the context units. This
restriction results in several advantages when the network is trained by a gradient
(a) Jordan's network (b) Elman
outputs
(c) Stornetta et al.
hidden
c xt
I inputs I
Figure 2.7 Various sequential network architectures. (a) Michael Jordan's
architecture feeds the output back to an additional set of recurrent input
units. (b) Elman's structure uses recurrent non-linear hidden units. (c)
Stornetta et al. keep a history trace at the input units. This structure offers
particular advantages when back-propagation learning is used.
descent technique. We will discuss this issue in more detail in chapter 4 on training a
neural net. The author performed successful experiments in recognition of short
sequences. Mozer (1988) and Gori et al. (1989) have also made use of similar
architectural restrictions.
While the positive feedback mechanism is simple and used in biological
information processing, there are two computational problems associated with this
method. First, the new time constant is very sensitive to k. For our example, an increase
of 0.5% in k from k = 0.995 to k = 1 makes the model unstable. The time-varying nature
of biological parameters makes it therefore unlikely that reverberation is the
predominant mechanism for short term memory over long periods. The second
handicap of Eq.20 is that the new model is still governed by first-order dynamics. As a
result, weighting in the temporal domain is limited to a recency gradient (exponential
for linear feedback), that is, the most recent items carry a larger weight than previous
inputs. Note that the analytical solution to Eq.20 can be written as
t -(-s)
x(t) = Je [(net(s)) +I(s)]ds. EQ.22
0
t
Thus, the past input is weighted by a factor e which exponentially decays over time.
For a neural net composed of N neurons, the number of weights in the spatial
domain is O(N2), while the temporal domain is governed only by T. The use of a fixed
passive memory function then implies a limit to how structured the representation of
the past in the net can be. As an example, optimal temporal weighting for the
discrimination of the words "cat" and "mat" will not be a recency but rather a primacy
gradient. Another example, in a time-series analysis framework, the input signals
sometimes change very fast, sometimes slow. For fast changing input, we like the time
constant small, so that the net state can follow the input. For slow moving input, the
time constant may be larger in order to have a deeper memory available. This argument
pleads for the short term memory time constant to be a variable that should be learned
for each neuron and may even be modulated by the input. For physiologic mechanisms
of short-term modulation of r, one may think of adaptation (decreased sensitivity of
receptor neuron to a maintained stimulus) or heterosynaptic facilitation (the ability of
one synapse of a cell to temporarily increase the efficacy of another synapse; see Wong
and Chun (1986) for an application to neural nets).
In conclusion, short term memory by local positive feedback is simple and has
been applied successfully in artificial neural nets. However, reverberation may lead to
instability. Secondly, this mechanism restricts computational flexibility in the temporal
domain. In the next section, short term memory by delays is reviewed.
2.6.2 Short Term Memory by Delays
A general delay mechanism can be represented by temporal convolutions
instead of (instantaneous) multiplicative interactions. Consider the following extension
of the static additive model,
dx t
di -Xi + o w j(t -s) j (s) ds +I. E.23
vJo
We will call this model the (additive) convolution model. In the convolution
model the net input is given by
t
net(t) = :fw j(t-s)xj(s)ds. Eq.24
J0
In a discrete time environment, this translates to
t
neti(t) = wi(t-n)xj(n). Ea.25
J n=O
There is ample biological support for the substitution of weight constants w by time
varying weights w(t). Miller has reviewed experimental evidence that "... cortico-
cortical axonal connections impose a range of conduction delays sufficient to permit
temporal convergence at the single neuron level between signals originating up to 100-
200 msec apart" (Miller, 1987). Several artificial neural net researchers have also
experimented with additive delay models of type Eq.23. However, due to the
complexity of general convolution models, only strong simplifications of the weight
kernels have been proposed.
Lang et al. (1990) used the discrete delay kernels w (t) = wkO (t tk) in the
k
time delay neural network (TDNN).The TDNN architecture is shown in Figure 2.8. The
TDNN, considered the state-of-the-art, is a multilayer feedforward net that is trained
by error backpropagation. The past is represented by tapped delay lines as in FIR
filters. The authors reported excellent results on a phoneme recognition task. A
recognition rate of 98.5% at a phoneme recognition task ("B", "D" and "G") compared
to 93.7% for a hidden Markov model was achieved. Recently, the CMU-group
introduced the TEMPO 2 model, where adaptive gaussian distributed delay kernels
store the past (Bodenhausen and Waibel, 1991). Distributed delay kernels such as used
in the TEMPO 2 model improve on the TDNN with respect to the capture of temporal
context.
Tank and Hopfield (1987) also prewired w(t) as a linear combination of
t
a acll-)
dispersive kernels, in particular w(t) = XWk (t) = kwk() e This
k k
technique was utilized as a preprocessor to a Hopfield net for classification of temporal
patterns. The ideas are illustrated in Figure 2.9. Let an input signal successively
activate I, through 14. The delay between consecutive activations is one time step. If the
delays associated with the weights are as shown in the figure, then the input unit
activations arrive at the output unit at the same time. Thus, the output neuron is very
sensitive to an impulse moving over the input layer in the direction of the time arrow,
while an impulse moving in opposite direction does not activate the output node.
hidden units
Neural nets of this type, where information of several neurons at different times
integrates at one neuron, were called Concentration-of-Information-in-Time (CIT)
neural nets. The weight factors Wk were non-adaptive and determined a priori. They
successfully built such a system in hardware for an isolated word recognition task. In
particular, the robustness against time warped input signals should be mentioned. In a
later publication successful experiments were reported with adaptive gaussian
distributed delay kernels (Unnikrishnan et al., 1991).
When compared to the first-order context-unit networks, the convolution model
in its general formulation is more flexible in the temporal domain, since the weighting
of the past is not restricted to a recency gradient. However, a high price has to be paid
for the increased flexibility. I identify three complications for the convolution model
when compared to the additive model.
Analysis. The convolution model is described by a set of functional
differential equations (FDE) instead of ordinary differential equations (ODE) for the
Figure 2.8 An example of a Time-Delay Neural Network.
additive model. Such equations are in general harder to analyze a handicap when we
need to check (or design for) certain system characteristics such as stability and
convergence.
Numerical Simulation. For an N-dimensional convolution model, the required
number of operations to compute the next state for the FDE set scales with O(N2T),
where T is the number of time steps necessary to evaluate the convolution integral
dx
(using Euler method: x(t+h) = x(t)+h d). An N-dimensional additive model
scales by O(N2).
Learning. The weights in the convolution model are the time-varying
parameters w(t). Thus, the dimensionality of the free parameters grows linearly with
time. For a long temporal segment, the large weight vector dimensionality impairs the
ability to train the network.
The two models for incorporating short term memory in neural networks,
positive feedback and delays, have led to a number of architectures that essentially
generalize the continuous mapper to the space-time domain. In the discussion on static
time-varying
S. weight kernels
4_wl output
Figure 2.9 Principle of the Concentration-in-
Time neural net. The output node is tuned to
classify the sequence I-I,2-I,-I4.
neural nets we introduced the associative memory model. Is there also a temporal
extension for this model. Indeed, in particular in the physics community, several
researchers have experimented with temporal associative memories. The principal
ideas of the sequential associative memory are now shortly reviewed.
2.6.3 The Sequential Associative Memory
The sequential associative memory is a (recurrent) dynamical system that stores
memories in attractors (sinks) of zeroth order (point attractors) (Kleinfeld, 1986;
Sompolinsky and Kanter, 1986). Physicists have explored several ways to ignite
attractor transitions under influence of an external stimulus. The most widely used
method consists of forcing a combination of the external signal and the delayed
network state upon the net. The delayed state wants to keep the net in its current state,
while the external input tries to alter the state. As a result, the net state ideally hops
from one stable attractor to another. In a pattern recognition environment, the sequence
of visited states identifies the external input.
Sequential associative memories are theoretically very interesting and moreover
provide a neural explanation of categorical perception, due to the corrective properties
of the basins of attraction. However it has been noticed that these nets may not be very
selective pattern recognizers, in other words, nearly every input (if high enough) will
induce transitions (Amit, 1988). Secondly, the memory capacities of such nets are very
limited. It is the nature of point attractors that falling into a basin of one induces the
forgetting of previous states. Consequently, the 'deepness' of memory is fixed, short
and cannot be modulated. In my opinion, the sequential associative memory may be a
useful module for tasks like central pattern generation (Kleinfeld, 1986), but are not
(yet) flexible enough to encode the varying temporal relations of complex signals such
as speech.
2.7 Other Dynamic Neural Nets
We have discussed dynamic extensions of both the continuous mapper and the
associative memory. However, the architectures that were discussed in this chapter are
not the only viable constructions. A very important class of neural nets are networks
with globally recurrent connections. Note that the memory structures that have been
discussed sofar, the tapped delay line and the context units, are mechanism to store the
activations of local units. Generation of memory by feedback on a global scale has not
been discussed. The difference between local feedback at the unit level and global
feedback at the network level is displayed in Figure 2.10. Globally recurrent nets or
fully recurrent nets are an important area of current dynamic neural net research
(Williams and Zipser, 1989; Gherrity, 1989; Pearlmutter, 1989). The important issues
that confine applications of fully recurrent nets are control of stability and adaptation
problems. In a fully recurrent net, the performance surface is not necessarily convex
which is a severe handicap for gradient-based adaptation methods. Secondly, learning
in recurrent networks has been found to progress much slower than in feedforward nets.
Figure 2.10 An example of a globally recurrent network.
context
unit
output
s(t)
input
2.8 Discussion
In this chapter various neural architectures for temporal processing have been
discussed. The main principles for storage of and computing with a temporal data flow,
delays and feedback, were analyzed in some detail.
The feedforward tapped delay line has been used very successfully in the time-
delay neural net. Yet, the delay period per tap and window width are fixed and must be
chosen a priori. It was already discussed in chapter 1 that such a fixed signal
representation scheme likely leads to sub-optimal system performance.
Most context-unit neural networks do adapt the decay parameter of the context
units. As a result, the depth of memory can be controlled so as to match the goal of
processing. On the other hand, the weighting of the past is always restricted to a
recency gradient. Moreover, context units overwrite the past with new information.
In order to get more processing power, some investigators use globally recurrent
networks. These systems suffer from the same problem as recurrent filters: how do we
control stability during adaptation? Additionally, in particular for moderate to large
networks the currently available training algorithms do not suffice.
Although the importance of the future of global feedback neural nets should not
be underestimated, in this work we will concentrate on developing an improved
architecture for local (unit) short term memory. There are several reasons why this is a
important research direction. First, it will be shown later in this thesis that feedforward
networks of units with local (feedback) memory have some practical advantages over
globally recurrent nets. Stability for one is much easier controlled in such networks.
Secondly, the local short term mechanism that will be developed here does not exclude
global recurrence in the network. In fact, the integration of local feedback and global
feedback may lead to very interesting dynamic network architectures.
- I
CHAPTER 3
THE GAMMA NEURAL MODEL
3.1 Introduction Convolution Memory versus ARMA Model
It was discussed in chapter 2 that a general delay mechanism can be written as
t
net(t) = Jw(t s) x (s)ds Eq2
0
for the continuous time domain and
t
net(t) = w(t-n)x(n) E_.27
n=0
in the discrete time domain. It was mentioned that a problem associated with time-
dependent weight functions is that the number of parameters grows linearly with time.
This presents a severe modeling problem since the number of parameters of freedom
of a system do not always increase linearly with the memory depth of that system.
Although a convolution model is interesting as a biological model for memory by delay
and powerful as a computational model, the previous arguments make this model not
very attractive as a model for engineering applications. It makes sense to investigate
under what conditions an arbitrary time varying kernel w(t) can be adequately
approximated by a fixed dimensional set of constant weights. This problem was studied
by Fargue (1973) and the answer is provided by the following theorem.
Theorem 3. 1 The (scalar) integral equation
t
net(t) = Jw(t-s)x(s)ds E.28
0
can be reduced to a K- dimensional system of ordinary differential
equations with constant coefficients if (and only if) w(t) is a solution of
dK K-w dkw
(t) = ak (t),
dtK k = 0 dtk
where ao, aI,...,aK.1 are constants.
Proof. A constructive proof for sufficiency is provided. The initial conditions
dkw
for Eq.29 are rewritten as ^k (0), where k = 0,...,K-l, and we define the
dt
dkw
variables wk(t) = d (t), k = 0,...,K-1, which allows to rewrite Eq.29 as the
dtk
following set of K first-order differential equations.
dwk
dt (t) = wk+l(t), k = 0,...,K-2,
dwK- K-i
dt (t) = k akwk().
k=0
Next, we introduce the state variables
(t) Wk (t -s) (s) ds, k = 0,...,K-1. E31
0
Note that the system output is given by
t
net(t) = wo (t-s)x(s)ds = xo(t) Eq32
0
The state variables xk(t) can be recursively computed. Differentiating Eq.31 with
respect to t using Leibniz' rule gives
dxk t E
(t) = ffWk(t-S)X(S)ds+wk(O)x(t),
0
which using the recurrence relations from Eq.30 evaluates to
dxk
dt (t) = k+1() + k (t), for k = 0,...,K-2, and
dxK K-1
dt (t) = akk(t) + iK- x(t). a
k=0
Thus, if the weight kernel w(t) is a solution of the recurrence relation Eq.29, then the
integral equation Eq.28 can be reduced to a system of differential equations with
constant coefficients (Eq.34). O (end proof).
The following theorem reveals what is meant by imposing the condition Eq.29
on w(t).
Theorem 3. 2 Solutions of the system
dK K-1 dk
w(t) = ak- (t)
dtK k=0 dtk
k. .tr
can be written as a linear combination of the functions t 'e , where
m
1 : i < m,0 < ki < K and I Ki = K. (end theorem).
k=l
The proof of Theorem 3.2 is provided in most textbooks on ordinary differential
equations (e.g. Braun, 1983, page 258). The xi's are the eigenvalues of the system. In
particular, the ji's are the solutions of the characteristic equation of Eq.29,
K-1
SK- I a k = o. EQ.
k=O
m is the number of different eigenvalues and Ki the multiplicity of eigenvalue gi. The
k. 9-t
functions t 'e are the eigenfunctions of system Eq.29, where i enumerates the
various eigenmodes of the system.
In the signal processing and control community, the system described by Eq.32
and Eq.34 is called an auto-regressive moving average (ARMA) model (see Figure
3.1). It was discussed in section 2.2 that the memory of an ARMA system is
represented in the state variables xk(t). It is interesting to observe the relation between
the ARMA model parameters and the convolution model. The auto-regressive
parameters ak are the coefficients of the recurrence relation Eq.29 for w(t). The moving
average parameters ik equal the initial conditions of Eq.29.
Figure 3.1 An ARMA model implementation of a convolution model
with recursive w(t).
In the context of this exposition, I like to think of an ARMA model as a
dynamic model for a memory system. It was just proved that this configuration is
net(t)
equivalent to a convolution memory model if the condition described byEq.29 is
obeyed. Yet, I do not know of neural network models that utilize the full ARMA model
to store the past of x(t). The reason is that the global recurrent loops in the ARMA
model make it difficult to control stability in this configuration. This is particularly true
when the auto-regressive parameters ak are adaptive. There are some substructures of
the ARMA model for which stability is easily controlled. Examples are the feedforward
tapped delay line and the first order autoregressive model (or context unit). These two
structures, which are shaded differently for clarity in Figure 3.1, have been used
extensively in neural networks as a memory mechanism. The virtues and shortcomings
of either approach have already been discussed in chapter 2. In this chapter, a different
approximation to the ARMA memory model is introduced. A look ahead to Figure 3.4
shows that the memory model that will be introduced utilizes a cascade of locally
recursive structures in contrast to the global loops in the full ARMA model. As a result,
this model will provide a more flexible approximation to the ARMA or convolution
model. Yet, the stability conditions will prove to be trivial. In the next section, a
mathematical framework for this new memory model is presented.
3.2 The Gamma Memory Model
Let us consider the following case, a specific subset of the class of functions that
admit Eq.29:
K
w(t) = wkg~(t). E
k=
where
k
g (t) = t le- k = 1,....K, ( > 0). E38
It is easily checked that the kernels gt) (the superscript is dropped) are a solution to
It is easily checked that the kernels gk(t) (the superscript ji is dropped) are a solution to
Eq.29. Since the functions gk(t) are the integrands of the (normalized) F-function
CO
(F(x)- tJ-le-tdt), they will be referred to as gamma kernels. In view of the
0
solutions of Eq.29, the gamma kernels are characterized by the fact that all eigenvalues
are the same, that is, p. = p.. Thus, the gamma kernels are the eigenfunctions of the
following recurrence relation:
d K
(-+pg) g(t) = 0. Ea9
tk
The factor normalizes the area, that is
(k- 1)!
t
gk (s) ds = 1, k=1,2... E.4
0
The shape of the gamma kernels gk(t) is pictured in Figure 3.2 for p. = 0.7.
It is straightforward to derive an equivalent ARMA model for net(t) when w(t) is
constrained by Eq.37. The procedure is similar to the proof of Theorem 3. 1. First, the
kernels gk(t) are written as the following set of first-order differential equations
dg1 dgk
dt= -ig dt k + 1, k = 2,...K. Eq41
Substitution of Eq.37 into Eq.26 yields
K
net (t) = wkk, Eq.42
k=l
where the gamma state variables are defined as
t
xk(t) = gk(t- s) x (s) ds, k= 1,...,K. Eq.43
0
0.7
0.6
0.5 k=l
gk() '
0.4-
0.3- k=2
k=3
0.2k=3 k=4 k=5
k=5
0.1
C ----------..-, ...........
0 2 4 6 8 10 12 14 16
- t4p
Figure 3.2 The gamma kernels gk,() for p=0.7
The gamma state variables hold memory traces of the neural states x(t). How are the
variables yk (t) computed? Differentiating Eq.43 leads to
() = J k(t-s)x(s)ds+gk(0)x(t), Ea
0
which, since gk (0) = 0 for k > 2 and gl (0) = I, evaluates to
k
S(t) = xk (t) + xk 1 (t) k = 1...,K, E.4
where we defined x0 (t) x (t) The initial conditions for Eq.45 can be obtained from
evaluating xk (0) = gk (0) x (0), which reduces to
X0(0) = x(0) x1(0) = jx(0)
xk(O) = 0, k = 2,...,K.
Thus, when w(t) admits Eq.37, net(t) can be computed by a K-dimensional
system of ordinary differential equationsEq.45. The following theorem states that the
approximation of arbitrary w(t) by a linear combination of gamma kernels can be made
as close as desired.
Theorem 3. 3 The system gk(t), k=1,2,...,K, is closed in L2 [0, -] .
Theorem 3. 3 is equivalent to the statement that for all w(t) in L2 [0, o] (that is,
00
any w(t) for which f w (t) 2dt exists), for every e > 0, there exists a set of parameters
0
Wk, k = 1,...,K, such that
0 K 2
Jw (t) wkk(t) dt
0 k=1
The proof for this theorem is based upon the completeness of the Laguerre polynomials
and can be found in Szego (1939, page 108, theorem 5.7.1). The foregoing discussion
can be summarized by the following important result.
Theorem 3. 4 The convolution model described by
t
net(t) = Jw(t-s)x(s)ds, Ea.
0
is equivalent to the following system:
K
net(t) = Wkk(t)
k=l
where x0 (t) = x (t), and
dxk
d(t) = -xk(t) + .xk-_ (t),k = 1,...,K, >0. E.49
(end theorem).
The term gamma memory will be reserved to indicate the delay structure
described by Eq.49. The recursive nature of the gamma memory computation is
illustrated in Figure 3.3.
X XI X2 .XK
Figure 3.3 The gamma memory structure.
For the discrete time case, the derivative in Eq.49 is approximated by the first-
order forward difference, that is
dxk
t (t) = Xk (t + 1) Xk(t) E
This approximation is not the most accurate, but it is the simplest. Also, this particular
choice implies that the boundary value p.=l reduces the gamma memory to a tapped
delay line. This feature facilitates the comparison of the discrete gamma model to
tapped delay line structures. Applying Eq.50 leads to the following recurrence relations
for the discrete (time) gamma memory
x0(t) = x (t)
xk(t) = (1- t)Xk(t- 1) + Lxk_(t- ),k= ,...K, and t = to,t,t2... E.51
The time index t now runs through the iteration numbers to,tl,t2,... The discrete gamma
memory structure is displayed in Figure 3.4.
3.3 Characteristics of Gamma Memory
In this section the gamma memory structure is analyzed both in the time and
frequency domain.
3.3.1 Transformation to s- and z-Domain
Since the recursive relations that generate the gamma state variables xk(t) at
successive taps k are linear, the Laplace transformation can be applied. The (one-sided)
Laplace transform is defined as
00
Xk () = Xk (t)e-stdt = L {Xk () E.52
0
Application of Eq.52 to Eq.49 leads to the following recursive relations for the
generation of the gamma state variables in the s-domain:
X (s) = X(s)
Xk (S) = Xk (s) ,k = ..., K. E
The operator G (s) will be referred to as the gamma delay operator. Note that
s+Rp
Xk(s) can be expressed as a function of the memory input X(s) only. Repeated
application of Eq.53 yields
Xk(s) = ( 9) X(s) = Gk(s)X(s). E.54
It can be verified that Gk (s) = L {gk(t) }.
The system Eq.53 also suggests a hardware implementation of gamma memory,
which is shown in Figure 3.5. It follows that gamma memory can be interpreted as a
tapped low-pass ladder filter. There are two gamma memory parameters: the order K
and bandwidth ji = (RC) -1
x(t) Xl(t) x2(t) XK(t)
R ?R RT
C C C
0-T T .................. T
Figure 3.5 A hardware implementation of gamma memory.
The corresponding frequency domain for discrete-time systems is the z-domain.
It follows from Eq.53 that the z-transform can be found by substitution of s = z-1 in the
Laplace transform. This leads to
Xk+(z) = -(z) k = 1,...,K. E55
The discrete gamma delay operator is G (z) = The transfer function from
memory input X to the kth tap Xk follows from Eq.55:
Gkk
Gk(Z) = Ea(-)"
Inverse z-transformation of Eq.56 leads to the discrete gamma kernels
gk(t) = k(l-)t-k-Jk= 1,.... Kt =k,k+1,... E.
9k W = 11 Ul~) I J' ~ aS
The discrete gamma kernel gk(t) is the impulse response for the kth tap of the discrete
gamma memory model. Eq.57 can be interpreted as follows. In order to get from the
memory input to the kth tap xk(t) in time t, the signal has to take k forward steps and
pass through t-k loops. Each forward step involves a multiplication by p, and a pass
through a loop involves multiplication by 1-p.. The number of different paths from x(t)
to xk(t) in time t equals -~ j1
Next, some of the frequency domain characteristics of the gamma memory will
be analyzed. The analysis will be performed using the discrete model, although similar
properties may be derived for the continuous time version.
3.3.2 Frequency Domain Analysis
The transfer function for the (kth tap of the) discrete gamma memory is given
by Eq.56. The Kth order discrete gamma memory has a Kth order eigenvalue at
z = 1 p. Since a linear discrete-time model is stable when all eigenvalues lie within
the unit circle, it follows that the discrete gamma memory is a stable system when
0 p i 5 2. The group delay and magnitude response for this structure are displayed for
the second tap (k=2) in Figure 3.6 and Figure 3.7 respectively.
The group delay of a filter structure is defined as the negated derivative of the
phase response to the frequency. It provides a measure of the delay in the filter with
respect to the frequency of the input signal. When g=l, gamma memory reduces to a
tapped delay line structure. In this case, all frequencies pass with gain one and the
group delay is 2, the tap index.
When 0 < p < 1, the gamma memory implements a linear Kth order low-pass
filtr. The low frequencies are delayed more than the high frequencies. In fact, the low
frequencies can be delayed by more than the tap index, which is the (maximal) delay
for a tapped delay line. For instance, for g=0.25 at tap k=2, a delay up to 8 can be
p=0.25
p=0.5
II
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
"0 (D/T1
41
I
[t=1.75/
*/
0 1 0 0 0=i .25 0
I
0 0. I 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Figure 3.6 Group delay for discrete gamma memory at k=2.
Figure 3.6 Group delay for discrete gamma memory at k=2.
d(p
do
x
1
48
achieved for the low frequencies.The cost for the additional delay for low frequencies
is paid for by the high frequencies. The high frequencies are attenuated and the group
delay is less than the tap index. Thus, for 0 < p < 1, the storage of the low frequencies
is favored at a cost for the high frequencies.
The gamma memory behaves as a high pass filter when 1 < pI < 2. As a result,
the high frequencies are delayed by more than the tap index.
3.3.3 Time Domain Analysis
Although the impulse response g9(t) of the kth tap of the gamma memory
extends to infinite time for 0 < pi < 1, it is possible to formulate a mean memory depth
for a given memory structure gk(t). Let us define the mean sampling time tk for the kth
tap as
00
ik tg(t) = Z {tgk (t) = -1 E
t=0 z dz z= 1
1
We also define the mean sampling period Atk (at tap k) as Aik k 1 = The
p.
mean memory depth Dk for a gamma memory of order k then becomes
k i k
i=1
In the following, we drop the subscript when k = K. If we define the resolution Rk as
1
Rk k = p., the following formula arises which is of fundamental importance for the
characterization of gamma memory structures:
K = DR. Eg.60
Formula Eq.60 reflects the possible trade-off of resolution versus memory depth in a
100
IG21
10 1
10-2
In,
IG21
101
- \
... i=l
=0 .25,
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
P i. (/It
Cl=1.75
1p=l1,
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Do (0)/7t
Figure 3.7 Magnitude response for discrete gamma memory at k=2.
memory structure for fixed dimensionality K. Such a trade-off is not possible in a non-
dispersive tapped delay line, since the fixed choice of pg = 1 sets the depth and
resolution to D = K and R = 1 respectively. However, in the gamma memory, depth
and resolution can be adapted by variation of ip.
In most neural net structures, the number of adaptive parameters is proportional
to the number of taps (K). Thus, when gi = 1, the number of weights is proportional to
the memory depth. Very often this coupling leads to overfitting of the data set (using
parameters to model the noise). The parameter p. provides a means to uncouple the
memory order and depth.
As an example, assume a signal whose dynamics are described by a system with
5 parameters and maximal delay 10, that is, y (t) = f(x (t n,), wi) where i = 1,...,5,
and max,(ni) = 10. If we try to model this signal with an adaline structure, the choice K
= 10 leads to overfitting while K < 10 leaves the network unable to incorporate the
influence ofx(t -10). In an adaline with gamma memory network, the choice K = 5 and
pg = 0.5 leads to 5 free network parameters and mean memory depth of 10, obviously a
better compromise.
3.3.4 Discussion
In this section the gamma memory structure was analyzed in both the time and
frequency domains. When 0 < pi < 1 the storage of the low frequencies is favored over
the high frequencies. In the time domain this translates to a loss of resolution but a win
in memory depth. There are many applications for which this bias towards storing low
frequencies can be exploited. In particular we think of applications where a long
memory depth is required, as in echo cancelation or room equalization.
Next the gamma memory model is incorporated into the additive neural net. The
additive model with adaptive gamma memory provides a unified framework for non-
50
45
40
D 35-
K30 =5
30
25-
K=4
20 \
15 K=3
10 K=2
K = . . . .
0
0.I 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Figure 3.8 Memory depth of discrete gamma memory vs. p.
linear processing with adaptive short term memory capacities.
3.4 The Gamma Neural Net
In this section, the additive neural net is extended with gamma memory. This
new model will be compared to the various competing networks for temporal
processing as described in chapter 2.
3.4.1 The Model
Recall the general model for additive neural nets:
dxi N
S(t) = aixi (t) +oi Wijxj (t) + i (t), i = 1,...,N. EN.
Let us assume that each neuron has the capability to store its past in a gamma
memory structure of (maximal) order K and bandwidth pi. The activation of the kth tap
of neuron i is written as xik(t). The weight wik connects the kth tap of neuron j to neuron
i. The system equations for this model are
dx,
(t)= -aixi(t) + o(Y wijkxjk(t)) +I (t)
j k
for the activations xi(t) and for the taps xik(t)
dxik
dt (t) = xik(t) + i, k (t), k= ,... ,K.
The system described by Eq.62 and Eq.63 will be referred to as the (additive) gamma
neural net or gamma model. In Eq.62 the time constant is processed in the decay
parameter at. Also, for notational convenience we defined xi (t) xio (t) .The structure
of the gamma neural model is displayed in Figure 3.9.
The discrete gamma memory can be applied in discrete-time neural network
models. The discrete gamma neural model is defined as follows:
Xi(t) = (I wijkxjk()) +Ii(t)
j*
*
Xik(t) = (1 -i) ik(t- 1) + ii, k- (t- 1), k= 1,...,K E.65
Note that p=l leads to an additive network where the memory is implemented as a
tapped delay line. A feedforward network of this type is equivalent to the time-delay
neural net. The other extreme at p=0 obviates the gamma memory and reduces the
structure to a "normal" additive model.
Next, the gamma model is compared to previously introduced neural models for
temporal processing.
3.4.2 The Gamma Model versus the Additive Neural Net
Additive neural model are characterized by the fact that the adaptive net
Figure 3.9 The additive gamma neural model.
parameters are constants, that is, not dependent on the neural activation nor time
dependent. The free parameters in the gamma neural model are pi and wik, hence the
gamma net is an additive model. This is an important result, since the substantial theory
on additive neural nets applies directly to the gamma model. Also, existing learning
procedures such as Hebbian learning and backpropagation apply without restriction to
the gamma net. Since the gamma model is additive, it is possible to express the system
equations Eq.40 or Eq.65 as a Grossberg additive model. This is easiest shown by
rewriting the gamma model equations in matrix form. This conversion will now be
carried out for the continuous time model, since the result hints to an interesting
interpretation of the gamma model. First the node indices i and j are eliminated. We
define the signal vectors xk = [Xlk...,XNk], I = [Ii,..,N]r and parameter matrices a =
Wlk WiNk
diagd(ai), gi = diagN(ii) and wk = .Then, the gamma model equations
WN1k WNNk
can be written in matrix form as
dx0 K
dt (t) =-axo(t) +o w) +I)
=0
dxk
d(t) = xk(t) + xk (t),k=1,...,K. E.66
x0
Next, index k is eliminated. We define the gamma state vector X = X the input
xK
I = the squashing function XZ() = the matrix of decay
0 0
a 0 WW "' ...WK
parameters M = I and the matrix of weights Q = a .Then
0 0 ''
the gamma model evaluates to
dX
=- MX+ 2Y+E, Eq.67
an N(K+1)-dimensional Grossberg additive model.
Notice the form of the weight matrix U. Many entrees are preset to zero; in other
words, the gamma model is a pre-wired additive model. As a result, the pre-wiring of
the gamma model for processing of temporal patterns has rendered this model with less
free weights than a general additive model. This property is important, since both the
learning time and the number of training patterns needed grows with the number of free
weight in a neural net.
3.4.3 The Gamma Model versus the Convolution Model
In chapter 2, I identified theoretical analysis, numerical simulation and learning
as problem areas for the convolution model. It was also shown that the gamma model
is mathematically equivalent to the convolution model for sufficiently large order K.
Next these problems are re-evaluated when the convolution model is expressed as a
gamma model.
Analysis As was discussed in section 3.4.2, the gamma model can be
represented as a pre-wired additive model. Consequently, theoretical results for the
additive model are entirely applicable to the gamma model.
Numerical Simulation Whereas the complexity of numerical integration of the
convolution model scales as O(N2T), the gamma model scales as O(N2K). Thus, as time
progresses the evaluation of the next state for the convolution model involves an
increasing amount of computation. The computational load for the gamma model is
independent of time.
Learning While the weights in the convolution model are time-varying, the
gamma weights are constants. Thus, the dimensionality of the weight vector is
independent of time. This is a very important distinction from an engineering
standpoint, since it allows a direct generalization of conventional learning algorithms
to the gamma model.
3.4.4 The Gamma Model versus the Concentration-in-Time net
In 1987, Tank and Hopfield presented an analog neural net with dispersive delay
kernels for temporal processing (Tank and Hopfield, 1987). The memory taps xk(t) for
this "Concentration-in-Time net" (CITN) are obtained by convolving the input signal
with the kernels
t a (1--)
t
fk(t) = () e ,k=l,...,K,
where a is a positive integer. fk(t) is normalized to have maximal value 1 for t = k. The
degree of dispersion is regulated by parameter a. In Figure 3.10, the kernels fk(t) are
displayed for k=1 to 5 when a = 5.
Although fk(t) visually resembles peak-normalized gamma kernels, it is not
possible to generate the kernelsfk(t) by a recursive set of ordinary differential equations
with constant coefficients, as is the case for the gamma kernels. In fact, differentiating
Eq.68 leads to the following time-varying differential equation forfk(t):
0.9- 1(p f5(
0.8
k(t)0.7 -
0.6
0.5
0.4-
0.3
0.2
0.1
0 .... ....
0 I 2 3 4 5 6 7 8 9 10
0-34 5 6 7 8 9 10 t
Figure 3.10 Tank and Hopfield's delay kernels; a=5, k=1,...,5
dfk 1 1
dt (t) =a(- )k(t). Ea9
While Tank and Hopfield's model shares with the gamma model the capability of
regulating temporal dispersion, only the gamma model provides an additive neural
mechanism for this capacity. As a result, in the gamma model the dispersion control
parameter Ip can be treated as an adaptive weight. In Tank and Hopfield's model, a is
fixed. The relative merits of peak- versus area-normalization of dispersive delay
kernels have not been investigated.
Similar arguments hold when the gamma memory is compared to the adaptive
gaussian distributed delay models such as the TEMPO 2 model (Bodenhausen and
Waibel, 1991) and the gaussian version of the concentration-in-time neural net
(Unnikrishnan et al., 1991). These memory models do offer the advantage of adaptive
dispersion, yet only the gamma memory offers an additive neural mechanism to create
dispersive delays. The other models require evaluation of a convolution integral with
respect to the delay kernels in order to compute the memory traces. Although not a
priority for engineering applications, the gamma memory is biologically plausible,
since there is no (non-neural) external mechanism required to generate delay kernels.
3.4.5 The Gamma Model versus the Time Delay Neural Net
The memory structure in the time delay neural net (TDNN) is a tapped delay
line. In fact, TDNN structures can be created in the gamma memory by fixing p. = 1 in
the discrete gamma model. Thus, the TDNN is a special case of the discrete gamma
model. When 0 < p < 1, the discrete gamma memory implements a tapped dispersive
delay line. The amount of dispersion is regulated by the adaptive memory parameter p..
We discussed that the memory depth of the gamma memory can be estimated by K/lg.
Hence, the memory depth can be adapted independently from the number of taps (and
weights!) in the structure. In the TDNN, the memory depth and the memory order both
equal K. As a result, increasing the memory depth in the TDNN is always coupled with
an increase in the number of weights in the net, which is sometimes not desirable.
3.4.6 The Gamma Model versus Adaline
The simplest discrete-time gamma model is a linear one-layer feedforward
structure with one output unit. The equations for this network are given by
K
y(t) = kxk(t)
k=O
x0(t) = I (t)
xk (t) = (1 ) Xk (t 1) + gxk_- (t- 1), k=l,...,K.
This structure is depicted in Figure 3.11. For gi=1, Widrow's adaline is
obtained. Also, adaline is the simplest (linear, one-layer, one output) implementation
of the time-delay neural net.
y(t)
Figure 3.11 The adaline(p) structure.
The gamma memory generalizes adaline to an adaptive filter with a dispersive
tapped delay line. Adaline with gamma memory will be referred to as adaline(u) or
adaptive gamma filter. Several interesting aspects of this filter are worth a deeper look.
A special chapter (5) will be dedicated to the analysis of adaline(g).
3.5 Discussion
In this chapter the gamma neural model has been developed and analyzed. The
gamma model is characterized by a specific short term memory architecture, the
gamma memory structure. Gamma memory is an adaptive local short term memory
structure. Memory depth and resolution can be altered by variation of a continuous
parameter It. In the next chapter, gradient descent adaptation procedures for the gamma
model are presented.
CHAPTER 4
GRADIENT DESCENT LEARNING IN THE GAMMA NET
4.1 Introduction Learning as an Optimization Problem
Learning in a neural net concerns the modification of the weights of the net so
as to improve performance of the system. The term adaptation will be used as a
synonym for learning. Commonly it is assumed that the performance of the neural net
is expressed by a scalar performance index or total error E. In general we write
ptm
p t m
4P (t) is a cost functional which describes the error measure at output node m E M at
time t [0, T] when pattern p e P is presented to the system. Often the (weighted
1
by ) quadratic deviation from a given target trajectory dP (t) is chosen as the cost.
For this case, Eq.71 evaluates to
E= [dP (t) -x(t)]2 = 2 [e ,(t) ]2
p, t, m p, t, m
where eP (t) df (t) -xPm (t) is the instantaneous error signal which is immediately
measurable at any time.
The learning goal is to minimize E over the system parameters w and I,
constrained by the network state equations x = f(x, I;w, i) This problem has been
studied extensively in optimal control theory (Bryson and Ho, 1975). The most
dE
common approach to search for the minimum of E involves the use of the gradients a-
DE
and -. When E is minimal, these gradients necessarily vanish, that is, at the optimum
we have
DE DE
= =0. E
An algorithmic method is now discussed which searches for the values w and gI
that minimize E. Assume an available set of training pattern pairs (Ip (t), dP (t)) that
adequately represents the problem at hand. This set is referred to as the training set P.
Next, the training set is presented to the network and the activations xp (t) are
recorded. The presentation of the entire pattern set P is called an e-ch or batch. Note
that the availability of x, (t) and dP (t) allows the evaluation of the performance index
E using Eq.72. This measure can be used to determine when to stop training of the
DE DE
system. Next the gradients and are computed and the weights are updated in
the direction of the negative gradients:
DE
Wnew = Wold -IW E-
DE
new = 1old-- T1a
If the learning rate or step size rl is small enough, this update will decrease the
total error E on the next batch run. There are other methods of utilizing the error
gradients to search for the minimum of E. As an example, the successive weight
updates can be made orthogonal to each other, a process which is called conjugate
gradient descent. In this work we are not interested in optimizing the learning process
per se. Our goal will be to generalize gradient descent adaptation to the gamma net and
evaluate the properties of this generalization. The equations Eq.74 and Eq.75
implement an update strategy which is called steepest descent. The process of running
training set epochs followed by a weight update is repeated until the total error E no
longer decreases. Note that if the error surface E = E (w, g) is convex, this procedure
leads to a global optimization of the network performance.
The learning process described above updates the weights only after
presentation of the entire epoch. We call this epochwise learning or learning in batch
mode. A faster training method would be to adapt the weights after each time step t.
BE DE
This involves computation of the time-dependent error gradients (t) and (t
The update equations then become
bE
w (t) = w(t-1) l (t) E.76
aw
BE
A(t) = I(t- 1) -'IDE -- E(.77
This mode is called real-time learning or on-line learning. Real-time learning
converges faster than learning in batch mode, but the updates are no longer in the
BE BE
opposite direction of the total error gradients D- and Thus, even for convex error
surfaces, real time adaptation not necessarily leads to global optimization of the
network performance.
A look at the update equations makes clear that the crucial aspect of the learning
BE aE
process just described is the computation of the error gradients and -. Therefore
the remaining part of this chapter concerns methods of evaluating these gradients.
This chapter is organized as follows. In the next section the literature on
gradient computation in simple neural nets is reviewed. Two different methods will be
evaluated. The direct method computes the gradients by direct numerical
differentiation of the describing system equations. The alternative method, (error)
backpropagation, utilizes the specific network architecture to compute the gradients.
As a result, backpropagation will proof to be a more efficient technique (in terms of
number of operations) than the direct method. However, as we will see when dynamic
networks such as the gamma net are introduced, application of backpropagation is
restricted to short time intervals. The trade-offs of backpropagation versus the direct
method will be evaluated for the gamma net. Finally, a special gamma net architecture,
the focused gamma net, is introduced. This structure is of special interest, since it can
be trained by a fast hybrid learning procedure. Some well-known signal processing
structures such as adaline and the feedforward networks are special cases of the
focused gamma net.
4.2 Gradient Computation in Simple Static Networks
In this section the essentials of error gradient computation in neural nets are
treated. For the time being we deal with the simplest cases, since we only want to
convey the strategy of error gradient computation. The weight update equations for the
gamma model are postponed to section 4.3.
It is assumed that the processing system can be described by a static additive
neural model:
Xi = Oi( 1WijXj) + i. E
j*
*
We will also write neti = wijxj Also, we assume that the states xi are computed by
j*
increasing index order. Thus, first xl is computed, then x2 and so forth until XN. There*
are no temporal dynamics associated with Eq.78. As discussed before, the central task
DE
of all gradient descent adaptation procedures is to compute for all weights. It will
awi
be assumed that the total error measure E can be expressed as
E= X I= e m = -(dm m) 2
meM m m
Thus the training set consists of one static pattern. The learning task is to adapt the
weights such that the mean square error between the target dm and net activation Xm is
BE
minimal when I, is presented to the system. Next two exact algorithms to compute
are presented.
4.2.1 Gradient Computation by Direct Numerical Differentiation
DE
The simplest method to compute the gradients is just by differentiating the
equations Eq.78 and Eq.79. Applying the chainrule to Eq.79 yields
DE axm
I_ = -Eema Ea80
ii m ij
ax
Next the gradient variable P is defined. 3! can be directly computed by
iw i
differentiating the state equation Eq.78. This leads to
Sd(o (netm) Bnetm
1 dnetm aw-U
im ij
= m"'(netm) imxj+ wmnij1 Ea
n
where 8im is the kronecker delta function.
The set of equations Eq.78, Eq.80 and Eq.81 provide a system to compute the
error gradients. Together with an update rule such as
DE
Awij = _- EWi2
they form a neural net learning system.
4.2.2 The Backpropagation Procedure
In contrast to the direct method, the backpropagation method exploits the
specific network structure of neural nets in order to compute the error gradients. As a
result, backpropagation is a computationally more efficient procedure than the direct
method.
Before the backpropagation method is introduced, it is necessary to define more
precisely what is meant by the error gradients. We now proceed by an intermezzo in
order to explain how we define partial (error) derivatives in networks. Consider the
network in Figure 4.1. The state equations for this network are given by
x1 = I1
x2 = W21x + 12
X3 = 32X2
x4 = w422 + W433
x5 = W52X2 + 53X3 + 54x4
It is assumes that the variables xl through x5 are computed in indexed order, that is, first
x], then x2 and so forth until x5. A network whose state variables are computed one at
Ox5
a time in a specified order will be called an ordered network. Let us compute ax2
aX2
3x5
Explicit partial differentiation of the equation for x5 in Eq.83 gives aX2 = w52
However, this only reflects the direct or explicit dependence of xs on x2. x2 also affects
x5 indirectly through the network. Incorporating these indirect or implicit influences
leads to the following expression:
ax5 ax5 ax5
-= W52 +w323 + 42 a4
;x2 (3 4-
Sx5 ax5
= 52+ w32 53+ W43 a + w424
= 52 + 32 (w53 + W43W54) + W42w54 E
This difference between explicit and implicit dependencies in networks has been
treated in a backpropagation context by Werbos (1989). He introduced the term ordered
derivative to denote the total partial derivative (including the network influences). In
this work, whenever we speak of a partial derivative to x the ordered partial derivative
a
is meant, which is denoted by the symbol If we only want to include the explicit
(direct) dependence on x we speak of explicit partial derivative, for which the symbol
ae
Swill be reserved. Werbos (1989) proved the following theorem for ordered
ax
networks:
Theorem 4.1 Consider a network ofN variables xi whose dependencies are
ordered by the list L = [x1, x2, ..., XN] this means that xi only depends
on xi where j < i. Let a performance function E be defined by
E = E(xl, x2,...,N). E.85
aE
Then, the (ordered) partial derivatives a- can be computed by
axi
aE eE aE ex.
I 5 X J
W-i = i xj -xi
(end theorem).
ax5
It can be checked that application of Theorem 4.1 to the computation of in
ax2
the network Eq.83 leads to expression Eq.84. Note that the computation of the
DE
gradients ;- requires knowledge of the error gradients with respect to xj where j > i.
As a result, the error gradients must be computed in descending index order. This
feature has inspired the name backpropagation for algorithmic procedures that make
use of Eq.86 in order to compute the error gradients in neural networks. Here the
intermezzo on partial derivatives ends. We now proceed to derive the backpropagation
method.
Let the network equations and performance index be given by Eq.78 and
Eq.79 respectively. An order or sequence of computation has to be determined for all
variables involved in the state equation Eq.78. Since the set of weights {wij} are
initialized before the state variables x, are evaluated, they can be put at the beginning
of the list. This leads to the following list:
L = [{wij}, X,...,xN] E Z
Next the partial derivatives of the performance index E to all variables in L are
computed, making use of the rule for partial derivatives in ordered networks Eq.86. For
the state variables xi this leads to
E a eE E aexj
ax ax .x axBx
J>1i ji
BE
=-ei+ Iax.xo a'(netj)wji Eq.88
j>i j
aE
For the gradients D we obtain the following expression:
owij
aE E/ E exn
= +7 x
aij iJ n:N n iji
BE
= xGi' (neti)xj EqM
In the backpropagation literature it is customary to define the variables
BE aE
i =- and 5i = = Ei'(net). "E
aii anet, 1.90
Substitution of Eq.90 into Eq.88 and Eq.89 yields for the computation of the error
gradients
Ei ei + w. Ea
j>i
i = Eii'(neti) E 2
aE
W = 8- E93
The set of equations Eq.91, Eq.92 and Eq.93 constitute the backpropagation method to
compute the error gradients.
Now let us see how to apply all these equations to the learning problem. Say we
have a network described by Eq.78 and a training data set consisting of one input
patterns i, and a target pattern dm. To start the learning system, the input pattern is
presented to the net and the state variables xi are evaluated by Eq.78. The variables xi
are computed by increasing index order i. This completes the forward pass. Next, the
error variables ei = di x. are evaluated and stored. The next phase, the backward or
aE
backpropagation pass, computes the variables 8. = by evaluating Eq.91 and
a anetb
Eq.92. The quantities 58 are called backpropagation errors. They measure the
sensitivity of the total error E with respect to an infinitesimal change in neti. It has been
mentioned before that the backpropagation errors are computed in descending index
order, that is, first 5N followed by 5N_1 until 6,. After the backpropagation pass, the
error gradients with respect to the weights are computed by Eq.93. Next, if a steepest
descent update rule is used, the weights are adapted according to
DE
Aw 1 = --11 5-. Ea.94
In order to appreciate the architecture of the backpropagation method, we
rewrite the backpropagation equation Eq.88 as
E = -e + S (.'(net)wji E
I J iii
j>i
Note the structural similarity between the state equation Eq.78 and Eq.88. In the
backpropagation equation the backprop errors EP serve as the network states and -e1 is
the external input. In order to discriminate between the error variables ei and E,, ei is
sometimes referred to as the injection error whereas El (and 5i) are called
backprooagation errors. Since the backprop weights wji connect node j It i (note the
arrow reversal as compared to Eq.78), the backpropagation network structure is the
transposed network of the network during the forward pass. This property is visualized
in Figure 4.2 where the backpropagation structure is drawn for an example feedforward
network. Thus, the backpropagation method makes explicit use of the network
structure in order to compute the error gradients. As a result, the backpropagation
method is computationally more efficient than the direct method. Specifically how the
computational cost of the backpropagation method compares to the direct method will
be evaluated next.
Figure 4.2 Backpropagation architecture for a static feedforward net.
4.2.3 An Evaluation of the Direct Method versus Backpropagation
There are hardly any results with respect to convergence speed of gradient
descent update rules applied to neural networks. All we can say is that the success of
training a neural net using gradients often depends on the randomly selected initial
weights. It is however interesting to make a comparison of computational complexity
of competing learning strategies. When learning algorithms are compared as for their
time and space resource consumption it will be assumed that the learning process is
carried out on one sequential processor. In order to describe the complexity of
algorithms the notation O((p(n)) is used, which is defined as the set of (positive integer
valued) functions which are less or equal to some constant positive multiple of (p(n).
We will assume that the cost of one operation addition or multiplication carried out
on one processor is 0(1). As an example, the evaluation of the system equations for the
additive net Eq.78 costs O(N2) operations. The reasoning goes as follows. The
evaluation of x, costs O(N) operations since node i is connected to maximal N nodes.
Since we need to evaluate the activations for all i, that is i = 1,...,N, the total cost
becomes O(N2). The cost of storage (space) for the system Eq.78 is O(N2) since the
space requirements are dominated by the (maximal) N2 weights wy.
Now let us evaluate the cost of the error computation by the direct method. The
number of required operations is dominated by the evaluation of the gradients PT. The
computation of a variable PT involves O(N) operations, Since there are maximal N3
variables P? it follows that the total number of operations scale by O(N4). We need
O(N3) space to store P.
As for the computational cost of the backpropagation method, note that the
computation of the backpropagation errors requires evaluation of the (transposed)
network. It was already discussed that evaluation of the network requires O(N2) number
of operations. The space requirements are dominated by the weights wy, hence O(N2)
storage is needed. Thus, both the number of operations (time) and space requirements
for gradient evaluation scale favorably for the backpropagation method in comparison
to the direct method.
The computational complexity is of course only a ballpark measure of the
merits of an algorithm. In particular for neural nets it is important if or to what degree
the algorithm can be carried out on parallel hardware. An interesting property for
network algorithms in this respect is locality. We will say that an algorithm is local if
the states of the network can be computed from information that is locally available at
the site of computation. Locality is not only a property of the biological archetype, it
greatly facilitates implementation in parallel hardware. For instance, the
backpropagation errors are computed by means of the transposed network. As a result,
in a hardware implementation only the direction of the communication paths between
processors need to be reversed. The direct method, on the other hand, is not local. The
gradients (. need to be computed for all nodes n e N and all weight indices
(i,j) e NxN.
As a conclusion, both the computational complexity and the locality criterion
favor the backpropagation method over the direct method for static networks.
Therefore, the direct method should not be used for computation of error gradients in
static nets. However, it will be shown that the situation is more complicated for
dynamic networks. In the next section, the direct method and backpropagation are
extended to the gamma net operating in a temporal environment.
4.3 Error Gradient Computation in the Gamma Model
Since the gamma model can be formulated as a regular additive model, it
follows that both the direct method and the backpropagation procedure can be extended
DE DE
to the gamma net. In this section the error gradients -- and are derived for the
7Wijk i
discrete gamma model as described by
x (t) = a(i (2 ijkx(j))+ Ii(t)
j k
where the gamma state variables are computed by
xik(t) = (1 -9i)xik (t- 1) +gxi, k- 1 (t- 1)
In contrast to the previous section, it is assumed that the activations and target
patterns are time-varying. Thus, the performance index E is defined as
E Z/ (t)
t m
2= [dm (t) xm (t) ] 2
t,m
=2 ,[em(t)]2
t, m
We now proceed to derive the error gradients using the direct method.
4.3.1 The Direct Method
The procedure is similar to the derivation for the static model. Partially
differentiating E to wijk yields
aE
Sijk
Oxm (t)
-- em (t)
t, m ijk
Eq.99
Sxm (mt)
We define the gradient signal Pk (t) --- k (t)
k daw ijk k
can be evaluated by partial
differentiation of Eq.96, which leads to
P1k (t) = m' (netm (t)) [ imxjk (t) + WmnP Eq.100
ne N
where 5i is the Kronecker delta (and remember the notation wmn wmn).
aE
A similar derivation can be applied to obtain the gradients i-. Analogously to
Eq.99 we write
DE axm (t)
-a em (t)
i /t,m i
Eq.101
axm (t)
Applying the chainrule to the partial derivatives -- leads to
ax (t) alnk(t)
nN k xnk (t)
=xm (t)
_^^^)cr
aXm (t)
-ti
axik (t)
X agi
Eq.102
axm (t)
The signal xik follows by differentiation of Eq.96, yielding
aXm (t)
aXik(t) m (netm (t)) Wmik. E103
axik (t) m m mik'
Substitution of Eq.103 into Eq.101 yields for the error gradients
aE
= em (t) m (netm (t) ) YwmikO (t) EQ
t, m k
where we defined ak (t) = i The signals an (t) can be computed on line by
I a-- i I
differentiation of Eq.97, which evaluates to
a (t) = (- i) (t-) + i-(t 1) + [Xi, k 1 (t 1) Xik(t-1)] E 5
aE
The set of equations Eq.99 and Eq.100 provide the gradients whereas
JWijk
DE
Eq.104 and Eq.105 compute the gradients -. A steepest descent adaptive procedure
would use these variables in a update rule of the form
DE
Awik ijk
and an analogous expression for the adaptation of ti. Together with the gamma system
equations Eq.96 and Eq.97 they constitute a gamma model learning system.
The learning system as derived here assumed adaptation in batch mode.
However, this algorithm is easily converted to a real-time learning system. We just
define a time-dependent performance index E, by
Et [em (t)]2 07
meM
Note that since E = _Et, the only change in the formulae is to take out the I from
t t
the error gradients, which reduces the error gradient expressions to
aE
W= aem (t) ijk (t) and E J109
ijk m
SE
-i = em (t) am' (netm (t)) 2Wmikk ()
i m k
The signals P13 (t) and a' (t) are computed by the same equations as in the batch
mode, that is, Eq.100 and Eq.105 respectively. The real time mode for this algorithm is
particularly interesting since the required number of operations is equivalent to the
batch mode algorithm. However, the storage requirements for the real-time mode are
greatly reduced (by factor T, the number of time steps) since we update on-line by
DEt Et
Awijk ( = k and AIi (t) = -T11 il
In addition real-time adaptation usually converges faster than epochwise
updating. Therefore in practice real time updating is used far more that learning in
batch mode. In fact, the real-time mode of the algorithm described here was derived for
recurrent neural nets by Williams and Zipser (1989). They coined the name real time
recurrent learning algorithm (RTRL). We will take over their terminology. Thus, the
direct method for error gradient computation in gamma nets leads to a (special) RTRL
algorithm.
Let us analyze the locality and complexity properties of the RTRL algorithm for
the gamma net. Assume that the number of units in the system equals N. Each unit
stores a history trace of its activation in a gamma memory structure of maximal order
K. The (maximal) number of weights wijk then becomes N2K. The number of memory
parameters equal N. Also, it is assumed that the system is run for T time steps.
The gradient variables a (t) and jk (t) determine the complexity of the
procedure. There are maximal N3K variables jk (t), each of which is evaluated by
Eq.100 at a cost O(N) per time step. Thus the total cost is O(N4K) per time step. It
JE
follows from Eq.104 that the evaluation of D- requires a cost O(NK) per time step.
JE
The cost of evaluating a( (t) is 0(1). Since there are N variables it follows that
the total cost pre time step for memory adaptation id O(N2K). The space costs are
dominated by j (t), requiring O(N3K) memory locations. Note again that the
ijk
gradients ak (t) and pj (t) cannot be computed locally in space, but all computations
I ijk
are local in time since the algorithm is real-time. The results for RTRL and other
algorithm are summarized in Figure 4.6.
4.3.2 Backpropagation in the Gamma Net
In this section the backpropagation procedure as derived in section 4.2.2 is
generalized to the gamma neural net. We start by defining the list L that holds the order
of evaluation of the system variables. It is assumed that the activations Xik (t) are
evaluated in the order as schematically specified by Figure 4.3.
fort = 0 i T do
for t = to T do
fok=OlQmKdo
evaluate xik (t)
end; end;end
Figure 4.3 Evaluation order in gamma model
This leads to the following list:
L = [ {i, {Wijk },X1 (), X11(0), ...,XNK (O),x (1), ...,XNK(T)]. EQ.
The same performance index as defined for the RTRL procedure is used, that is
E=C [e, (t)]2.112
t, m
Recall that in order to compute the error gradients in the backpropagation algorithm,
we make use of Werbos' formula for ordered derivatives, which evaluates for the
activations xik (t) to
e
BE aeE E xjl ()
DE + x ------ E 113
xik (0t) Xik (, j, 1) > (t, i, k) j) xik(t '
The expression (T,j, 1) > (t, i, k) under the summation sign refers to all index
combinations (T,j, 1) that appear after (t, i, k) in the list L. Although cumbersome,
working out Eq.113 is straightforward. In order to simplify Eq.113, the two cases
k = 0 and k 0 have to be considered.
e
JE .Xjl (Q)
First -- is worked out (k = 0). In order to evaluate the factor in
ax, (t) axi (t)
expression Eq.113 we need to find the activations xjl () that explicitly depend on
xi(t). It follows from the gamma system equations Eq.96 and Eq.97 that only
xil (t+ 1) and x (t) (j > i ) directly depend on xi (t) Thus, Eq.113 evaluates to
DE DE DE
axi(t) = ei(t)+piaxi(t+1) + '(netx(t)
Next the formula for ordered derivatives Eq. 113 is evaluated for the tap variables xik(t)
for k = 1,...,K. Applying Eq.113 to the gamma state equations yields
0
JE _aeE aE E vE
DE = + (1 -i)x E +. E + .'(net (t))wjik
Jxik(t) zk(t) +xik(t+l1) ixi, k+l(t+l) j>i kx(t)
Eq.115
DE
The equations Eq.114 and Eq.115 backpropagate the gradients ---. Note
JXik (t)
that the gradients at time t are a function of the gradients at time t+l. Therefore, the
backpropagation system has to be run backwards in time, that is from t=T backwards
to t = 0. In fact, this is also clear when we recall that the list L is run backwards during
the backpropagation pass. For this reason the procedure described here is called
backpropagation-through-time (BPTT). Next the error gradients are computed with
respect to the system parameters by applying Eq.113 to the list L. For the weights wik
we get
aE aE
aw a ,' (neti(t))jk ) Ea 116
ijk txi (t)
SE
and for the gradients ,
DE E ye
x -iXik (t)
-9i t k Oxik (t) ik
_E
= Oik(t) [Xi,k-l(t--) -Xik(t-l E
In Figure 4.4 the set of equations that describe the backpropagation method for
the gamma model is summarized. In Figure 4.4 for convenience the notation
DE DE
- (t) = and 8i (t) = E) is used for the backpropagation errors.
k axik(t) W aneti(t)
The temporal aspect of gamma backpropagation impacts the use of this
procedure substantially. Similarly to regular backpropagation, the backpropagation
Ei(t) =-ei(t)+.piEi (t+1)+ )i. (1)
J>1
Eik() = (-li) Eik(+ ) +P.i,k+ i(+l) + ik. (t)
j>1
8i(t) = Ei(1) i (neti(1))
backpropagation equations run from k = K to 0, i = N to 1, 1 = T to 0,
aE
= Eik(t) [. i, k I(- ) -xik(t- )]
error gradients
Figure 4.4 Backpropagation-through-time equations for the gamma
neural model.
network is of the same complexity as the forward pass net. Note that this algorithm is
not local in time the backpropagation errors can only be computed after a complete
epoch has ended (at t = T). Thus real-time learning is excluded as well. Additionally, it
follows that the states xik(t) and errors ei(t), ei(t) and 8i(t) must be stored for the entire
N K
xi () = oi ijkxjk (1) + li()
j= lk= 0
xik(t) = (- )ik( 1) + pixi, k- (t- I)
state equations run from i = 0 to T, i = I to N, k = 0 to K
epoch. Thus the storage requirements scale by O(NKT+N2K) (first term for xik(t) and
second term for wijk). Obviously this limits the applicability of this algorithm to a
small epoch size T.
There is another disadvantage associated by deep backpropagation paths. Recall
that the backpropagated error signal traverses the transposed network in reverse
direction. The backprop errors hold an estimation of the sensitivity of the total error
with respect to a change in the local activation. If the system parameters are not close
to the optimal values, the backpropagation pass will soon degrade the accuracy of the
backprop errors. Also, in dense networks, the backprop errors will disperse through the
network and hence degrade other error estimates as well. Thus, for fast adaptation,
backpropagation paths should be kept as short as possible.
In practice, when there is a natural temporal boundary, as in word classification
problems, BPTT is a good choice. For typical real time learning applications, as in
prediction or system identification, most researchers apply RTRL. Yet at this time both
methods are restricted to relatively small problems when processed by sequential
machines. The computational cost of RTRL is excessive for large networks, while the
application of BPTT is hampered by increasing memory requirements over time. The
results of this section have been summarized in Figure 4.6. In the next section methods
to overcome the sharp increase in computational cost when neural nets are used in a
temporal environment are investigated.
4.4 The Focused Gamma Net Architecture
In this chapter general exact gradient descent adaptive procedures for the
gamma neural net have been studied. We have come up with two methods, the
backpropagation-through-time algorithm and the real-time-recurrent-learning
procedure. The renewed interest lately in neural net research has been largely propelled
by the application of the backpropagation method. Indeed, for static networks, the
backpropagation method provides an algorithmic approach to solve a large area of
problems that were previously not approachable due to the amount of computation
involved. This advantage is not so obvious when we generalize BP to dynamic
networks such as the gamma model. This procedure is very restricted in the sense that
the storage requirements grow linearly with time. The alternative method, real-time-
recurrent-learning, imposes a constant (in time) load on the computational resources.
Yet the application of RTRL is restricted to small networks.
So what is then the status of gradient descent learning in dynamic neural nets?
For small applications, the methods described sofar have been applied quite
successfully. Currently, research is concentrated on how to adapt the procedures
described here such that larger problems can be attacked with reasonable
computational cost. Two strategies are prevailing in this search. One area of research
focuses on approximate error gradient computation with reduced complexity as
compared to the exact methods described here. For instance, Williams has developed
the truncated backpropagation-through-time procedure (Williams and Zipser, 1991).
This algorithm is less accurate than BPTT but the memory requirements are constant
over time since the backpropagation pass involves a fixed number of time steps. The
other way to speed up error gradient computation is to prewire the network architecture
in order to reduce the complexity of the learning algorithm. In this section we propose
a restricted gamma net architecture, the focused gamma net. The focused gamma net is
inspired by Mozer's efforts to design an efficient dynamic neural net architecture
(Mozer, 1989).
Next the architecture of the focused gamma net is introduced. Some of the
characteristics of this structure are discussed, followed by a derivation of the error
gradients. The specific net architecture allows a very efficient hybrid approach to
gradient computation.
4.4.1 Architecture
The focused gamma net is schematically drawn in Figure 4.5. Assume the 6-
dimensional input signal I(t). The past of this signal is represented in a gamma memory
structure as described by
xi (t) = I (t) EaQ 18
Xik(t) = (1 )i) xik(t- 1) +PXi, k_- (t- 1)9
where t = 0,...,T, i = 1,...,6 and k = 1,...,K. This layer, the input layer, has 6 memory
parameters ti. The activations in the input layer are mapped onto a set of output nodes
by way of a (non-linear) static strictly feedforward net. The nodes in the feedforward
net are indexed 6+1 through N. Thus, this map can be written as
from feedforward net from input layer
xi(t) = i i Xj(t) + E E ik (1) EQ.120
0+l
For convenience Eq.120 will be written as
i (t) = Oi( ijkXjk(t)) E).121
j*
*
where we have utilized the notation xio (t) xi (t) and wfo -wi.
Similar architectures have been used by Stornetta et al. (1988) and Mozer
(1989). These investigators however only used a first-order memory structure (K = 1).
Mozer analyzed some of the properties of structures of this kind and coined the term
focused backpropagation architecture. It turns out that the focused network
architecture enjoys a number of advantages in comparison to the fully connected
dynamic networks.
Let us first derive the update equations for the weights wik. The
static feedforvard net
backpropagation method will be used. As before, the derivations are based on the
performance index E = [em (t) 2 and the evaluation order is determined by the
t,m
list L = [ { i}, {Wijk}, {Xik(t) } ]. We have already discussed in section 4.2.2 how
to apply backpropagation to feedforward nets. Thus, applying Werbos' formula for
ordered derivatives to the activations xi(t) in the feedforward net leads to the following
backpropagation system:
i (t = -ei (t) + wjisj ( Eq122
j>i
6i(t) = oi'(neti(t))ei(t), Q 123
DE DE
where we defined e (t) (t and 8. (t) -neti (t. Similarly, it follows from
&i (t) i net (t)
DE
section 4.2.2 that the gradients -- can be computed as
W ijk
XN.(I) x,v(I)
gamma memory
Figure 4.5 The focused gamma net architecture
DE
aw ii(t) jk (t). E124
ijk t
Note that since the mapping network is static and feedforward, we do not need to
backpropagate through time in order to find the backprop errors 8i(t). In fact, since the
85 (t) 's are computed in real-time, it is convert Eq.124 into a real-time procedure by
defining
DE
( ijk ( t) (t) xjk
Application of backpropagation to compute the error gradients with respect to
the parameters g, leads to a backpropagation-through-time procedure, since the input
layer is recurrent in nature. In most networks however, the number of memory
parameters is relatively small so it is efficient to use the direct method here. This
procedure has already been derived for the more general gamma nets in section 4.3.1.
DE
Hence, without explanation we derive the error gradients D- as follows -
E DE xm (t) xik (t)
E(t) = Xmt i x I
atti mxm (t) axik (t) X ti
= -eem (t) m' (netm (t)) Iwmik( (t)2
m k
where a (t) xik Ok (t) can be computed by evaluation of Eq.105.
An important property is that the backpropagation path is short since the
feedforward net is static. The errors estimates in the dynamic input layer do not
disperse during training since the gamma memory structures do not have lateral
connections in the input layer. This property is confirmed by considering Eq.105,
which propagates the errors through time. Thus the error estimates do not disperse in
the focused gamma net, which explains the adjective "focused".
gamma net architecture
N units, memory order K RTRL BPTT FOCUSED
T time steps
.space O(N3K) O(NKT) O(ONK)
time O(N4KT) O(N'2KT) O(N'K2T)
space no yes yes
-time yes no yes
Figure4.6 A complexity comparison ofgradient descent
learning procedures for the gamma net.
In this architecture we have taken advantage of the particular characteristics of
both the BPTT and RTRL adaptive procedures. Since the feedforward net is static, the
very efficient backpropagation procedure is used to update the weights wik. The input
layer of the focused net is dynamic however and as a result, application of
backpropagation would introduce the burden of time-dependent storage requirements
and error dispersion during the backward pass. RTRL on the other hand is a real-time
procedure that is tailored to application in small dynamic networks. Thus RTRL is used
in the recurrent input layer to compute the error gradients to the memory parameters pi.
The focused gamma model is not as general as a fully connected architecture.
Thus, certain dynamic input-output maps can not be computed by the focused
architecture. For example, this representation assumes that the output can be encoded
as a static map of (the past of) the input pattern. Yet, some very interesting architectures
can be created in this framework. Mozer (1989) and Stornetta et al. (1988) have
obtained promising results in word recognition experiments using a first-order memory
86
focused gamma net. Note that a linear one-layer focused gamma net generalizes
Widrow's adaline structure.
CHAPTER 5
EXPERIMENTAL RESULTS
5.1 Introduction
In this chapter experimental simulation results for the gamma model are
presented. The goals for the simulation experiments are the following:
1. How does the gamma model perform when it is applied to various temporal
processing protocols. In particular, we are interested in an experimental comparison to
alternative neural network architectures.
2. How well do the adaptation algorithms that were derived in chapter 4
perform? The following questions are interesting in this respect and will be addressed
in this chapter. How well does the gradient descent procedure for the focused gamma
net work? Can we learn the weights w? Can we learn the gamma memory parameters
i? How does the adaptation time for the focused gamma net compare to alternative
neural net models?
With respect to the first goal, simulation experiments for problems in
prediction, system identification, temporal pattern classification and noise reduction
were selected. All experiments were carried out by members of the CNEL group. We
used 386-based DOS personal computers and 68030- and 68040-CPU based NeXT
computers for all simulations. The programming language was C.
For all neural net simulations we used a version of the focused gamma neural
net. The focused gamma net is a very versatile structure as it reduces to a time-delay-
neural-net when It is fixed to 1. Also, a one-layer linear focused gamma net reduces to
adaline(g). More complex architectures are certainly possible but in this work we are
mainly interested in a comparative evaluation of the gamma memory structure per se.
The topic of designing complex globally recurrent neural net architectures with gamma
memory is not addressed here. Also, experimental evaluation of neural networks in
relation to alternative non-neural processing techniques is not presented here (with the
exception of the noise reduction experiments). The latter topic has been studied by a
special DARPA committee (DARPA, 1988).
Before the experimental results are presented, some general practical issues
concerning gamma net simulation and adaptation are discussed.
5.2 Gamma Net Simulation and Training Issues
The system architecture that is used in the experiments is shown in Figure 5.1.
The signal to be processed source signal is denoted by s(t). Both the neural net input
signal and the desired signal d(t) are derived from s(t). The particular form of this
transformation depends on the processing goal. A subset of the neural net states x(t),
the outputs, are measured and compared to the desired signals. The difference signal,
e(t) = d(t) x(t), is called the instantaneous error signal and it is used as the input to the
steepest descent training procedure.
Figure 5.1 Experimental architecture.
5.2.1 Gamma Net Adaptation
The training strategy of the gamma neural net deserves more attention. In all
cases the network parameters w and gp were adapted using the focused backpropagation
method as derived in chapter 4. We used the simple steepest descent update method,
JE
that is, Aw = -r -. In all experiments, we used real-time updating. Thus, the weights
aw
were adapted after each new sample. The stepsize (learning rate) T1 is an important
parameter. For large rl the adaptation algorithm may become unstable, while a small rl
leads to slow adaptation. We were not so much interested in optimizing the speed of
adaptation. A value between 0.01 and 0.1 for q1 provided in all cases a stable adaptation
phase. Another central problem is when to halt adaptation. Let us assume that the
network is trained by presentation of a set of pattern pairs, where each pair consists of
an input pattern and the corresponding target pattern. This set of patterns is called the
training data set. The presentation of all patterns from the training set is called an
epoch. In the experimental setting of this work, we obtain the training set by selecting
an appropriate source signal segment. The performance index (total error) for the
training set as a function of the epoch number provides a good measure as to how well
the neural net is able to model the training set. However, accurate modeling of the
training set is not the goal of adaptation. The idea of adaptation by exemplar patterns
over time is to present a good representation of the problem at hand to the neural net,
which after adaptation is able to extrapolate the information contained in the training
set to new input patterns. Thus, it is a good habit to test how well the neural net is able
to generalize to an additional set of patterns that are not used for adaptation. This set
of patterns is called the validation set. In general, whereas the total error for the training
set decreases as adaptation progresses, this is not necessarily the case for the
performance index of the validation set (Hecht-Nielsen, 1990). In practice it has been
found that gradient descent procedures first adapt to the gross features of the training
set. As training progresses, the system starts to adapt to the finer features of the training
set. The fine details of the training set very often do not represent features of the
problem, as is the case when the training data is corrupted by noise. When the system
starts to adapt to model the training data specific noise, the total error for the validation
data usually increases. This is the time when adaptation should be stopped. In this way,
the stop criterion detects when the adaptation process has reached a maximal
performance with respect to extrapolation to other patterns (not from the training set)
that are representative for the problem task. In all of the experiments that are presented
here we have used this strategy to determine when to stop training.
As an example, consider the learning curves as displayed in Figure 5.2. The
normalized square error for both the training and validation set is plotted as a function
of the epoch number. This example was taken from an elliptic filter modeling
experiment that will be discussed in section 5.4. In our experiments we stop training -
or detect convergence if the normalized error for the validation data set increases over
four consecutive epochs.
03
convergence
O"8 detected
0 ?6
Etrain
Eval ]
2 3 6 & 10 12 II Itb 1 :2
---- epoch no.
Figure 5.2 Normalized total error for training set and validation
set in sinusoidal prediction experiment. Convergence is detected
after 4 successive increases of error in validation set.
Another important issue when considering neural net training is the adaptation
time. The adaptation time is the number of patterns (or epochs for batch learning) that
have to be presented to the neural net before the weights converge. There is not much
theory about the adaptation time of backpropagation algorithms. However the question
whether and how the value of gI affects the adaptation time can be experimentally
tackled. This problem was studied for a third-order elliptic filter modeling problem,
which is covered in more detail in section 5.4. The architecture was an adaline(.)
structure with K=3. The adaptation time expressed in the number of samples was
measured as a function of gp and the results are plotted in Figure 5.3. The plot shows
that the adaptation time is nearly unaffected if gI is greater than 0.2. This is rather good
news, although we have not been able to establish explicit formulae for the adaptation
time dependence on p.
#of
samples
80-
60-
40-
20
I I I I I I I I
0.1 0.2 0.5
Figure 5.3 Adaptation time as a function of p for the
elliptic filter modelling experiment in section 5.4
In the next sections the experimental results with respect to application of the
gamma model to temporal processing problems are discussed.
5.3 (Non-)linear Prediction of a Complex Time Series
5.3.1 Prediction/Noise Removal of Sinusoidals contaminated by Gaussian Noise
We constructed an input signal consisting of a sum of sinusoids, contaminated
by additive white gaussian noise (AWGN). Specifically, I(t) was described by
I(t) = sin (7t (0.06t + 0.1)) + 3sin (t (0.12t + 0.45)) + 1.5sin (t (0.2t+ 0.34))
+ sin (7t (0.4t+ 0.67)) +AWGN .
The signal-to-noise ratio is 10 dB. This signal is shown in Figure 5.4.
--t I ---
Figure 5.4 (a) The sinusoidal signal plus AWGN (SNR = 10 dB). (b) Power
spectrum of the contaminated signal.
The processing goal was to predict the next sample of the sum of sinusoidals.
Hence, the processing problem involves a combination of prediction and noise
cancelation. The processing system was adaline(ji). The goals of this experiment are
the following:
1. Determine the optimal system performance as a function of g for 0 < p. < 1
and K. Note that this implies a comparison of the gamma memory structure versus the
tapped delay line (for (t=1) and the context-unit memories (for K=1).
2. Can the system parameters wk and tt be adapted to converge to the optimal
values?
A training set consisted of 300 samples was selected. After a run on the training
data, the system was run on a validation set, a different signal segment of 300 samples.
The noise in the validation set is sample-by-sample different from the noise in the
training set, but the statistics of both noise sources are the same. The system was
adapted until convergence for various values of K and pt. Ip was parametrized over
domain [0,1] using a step size Ap. = 0.1. The normalized performance index after
training is displayed versus gp in Figure 5.5.
Clearly, the first-order gamma memory with p. = 0.1 outperforms even the fifth order
adaline structure. For this experiment, the context-unit memory with pt = 0.1 is optimal.
In a next experiment we let gp be adaptive. The system is initialized with p=l.
The performance curves in Figure 5.5 seem to increase monotonically over the range
popt to p=1. Thus, a gradient descent algorithm with initial g=l should in principle
converge at the optimal p. However, the simple structure of the performance surface is
K=2 K=3 K=4
E0.32
0.3
0.28
0.22
0 0.1 0.2 0 0. 0.4 0.5 0.6 0.7 0.8 0.9 I
Figure 5.5 The normalized performance index versus p after
training the adaline(p) structure to predict sinusoids
contaminated by white gaussian noise.
K=l.
*
* |