Title Page
 Table of Contents
 The temporal processing problem...
 A review of neural nets for temporal...
 The gamma neural model
 Gradient descent learning in the...
 Experimental results
 Conclusions and future research...
 Biographical sketch

Title: Temporal processing with neural networks
Full Citation
Permanent Link: http://ufdc.ufl.edu/UF00082173/00001
 Material Information
Title: Temporal processing with neural networks the development of the gamma model
Physical Description: ix, 148 leaves : ill. ; 29 cm.
Language: English
Creator: De Vries, Bert, 1962-
Publication Date: 1991
Subject: Neural networks (Computer science)   ( lcsh )
Neural computers   ( lcsh )
Signal processing   ( lcsh )
Electrical Engineering thesis Ph. D
Dissertations, Academic -- Electrical Engineering -- UF
Genre: bibliography   ( marcgt )
non-fiction   ( marcgt )
Thesis: Thesis (Ph. D.)--University of Florida, 1991.
Bibliography: Includes bibliographical references (leaves 143-147)
Statement of Responsibility: by Bert De Vries.
General Note: Typescript.
General Note: Vita.
 Record Information
Bibliographic ID: UF00082173
Volume ID: VID00001
Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
Resource Identifier: aleph - 001709827
oclc - 25541247
notis - AJC2112

Table of Contents
    Title Page
        Page i
        Page ii
        Page iii
    Table of Contents
        Page iv
        Page v
        Page vi
        Page vii
        Page viii
    The temporal processing problem and research goals
        Page 1
        Page 2
        Page 3
        Page 4
        Page 5
        Page 6
        Page 7
        Page 8
    A review of neural nets for temporal processing
        Page 10
        Page 11
        Page 12
        Page 13
        Page 14
        Page 15
        Page 16
        Page 17
        Page 18
        Page 19
        Page 20
        Page 21
        Page 22
        Page 23
        Page 24
        Page 25
        Page 26
        Page 27
        Page 28
        Page 29
        Page 30
        Page 31
        Page 32
        Page 33
        Page 34
    The gamma neural model
        Page 35
        Page 36
        Page 37
        Page 38
        Page 39
        Page 40
        Page 41
        Page 42
        Page 43
        Page 44
        Page 45
        Page 46
        Page 47
        Page 48
        Page 49
        Page 50
        Page 51
        Page 52
        Page 53
        Page 54
        Page 55
        Page 56
        Page 57
        Page 58
        Page 59
    Gradient descent learning in the gamma net
        Page 60
        Page 61
        Page 62
        Page 63
        Page 64
        Page 65
        Page 66
        Page 67
        Page 68
        Page 69
        Page 70
        Page 71
        Page 72
        Page 73
        Page 74
        Page 75
        Page 76
        Page 77
        Page 78
        Page 79
        Page 80
        Page 81
        Page 82
        Page 83
        Page 84
        Page 85
        Page 86
    Experimental results
        Page 87
        Page 88
        Page 89
        Page 90
        Page 91
        Page 92
        Page 93
        Page 94
        Page 95
        Page 96
        Page 97
        Page 98
        Page 99
        Page 100
        Page 101
        Page 102
        Page 103
        Page 104
        Page 105
        Page 106
        Page 107
        Page 108
        Page 109
        Page 110
    Conclusions and future research recommendations
        Page 111
        Page 112
        Page 113
        Page 114
        Page 115
        Page 116
        Page 117
        Page 118
        Page 119
        Page 120
        Page 121
        Page 122
        Page 123
        Page 124
        Page 125
        Page 126
        Page 127
        Page 128
        Page 129
        Page 130
        Page 131
        Page 132
        Page 133
        Page 134
        Page 135
        Page 136
        Page 137
        Page 138
        Page 139
        Page 140
        Page 141
        Page 142
        Page 143
        Page 144
    Biographical sketch
        Page 145
        Page 146
        Page 147
        Page 9
Full Text







The chances of mental depression during one's Ph.D. studies are not small.

Poverty and personal, social and intellectual isolation are just a few of the risks of the

"profession". Thus the real-world environment as provided by family, friends and

faculty largely determines the sanity of the student and therefore the quality of the

resulting dissertation. I have been very fortunate in this respect, although my friends,

committee members, let alone my family deserve no blame for the quality of this work.

My Ph.D. supervisor, Dr. Jose C. Principe, has been much more than an advisor.

His limitless supply of ideas, his personal commitment and warm character make him

an ideal supervisor, a fact which many students besides me have recognized. At this

place I wish to thank him for his collaboration and his friendship. In the Spring of 1989

I worked in the Hearing Research Laboratory headed by Dr. David Green. To witness

Dr. Green's approach toward conducting science is one of the best lessons a Ph.D.

student can receive. Dr. Jan van der Aa has been both on my master's and Ph.D.

supervising committee. His commitment to offer outstanding help at any time and his

personal friendship are very much appreciated. Dr. Donald Childers has been very

helpful at several times in guiding the next research steps. Dr. Fred Taylor's willingness

to serve on my doctoral committee is very much appreciated. I have had much support

from Dr. James Keesling from the mathematics department and Dr. Antonio Arroyo

from the electrical engineering department. Dr. Pedro Guedes de Oliviera from the

electrical engineering department of the University of Aveiro in Portugal visited our

laboratory during the 1991 spring semester. He has made significant contributions to

the understanding of the gamma model. His help and friendship is also very much


Several graduate students in the Computational Neuro-Engineering Laboratory

(CNEL) have directly contributed to the work that is presented here. James Kuo

performed the experiments on noise reduction which are discussed in chapter 5. The

experiments on prediction of the Mackey-Glass series was carried out by Alok Rathie.

Curt Lefebvre, Samel Selebi, James Tracey and Mark Goldberg have done significant

work on the gamma model as well.

Furthermore I should thank my friends and the students in our laboratory for

their friendship.

Most of all, I am indebted to my dear parents and my sisters Karin and Marleen.

Their support, encouragement and love cannot be compensated by a few simple lines.

I dedicate this work for what it is worth to their health and happiness.




1.1 Introduction........................................................................................... 1
1.2 A Statement of the Problem........................................................ 2
1.3 Research Goals ................................................... ........................... 7
1.4 A Summary of the Next Chapters ........................................................ 8



2.1 Introduction ................................................................... ............ 10
2.2 A Recapitulation of Linear Digital Filters ........................................... 10
2.3 Introduction to Neural Networks ......................................................... 12
2.4 The Adaptive Linear Combiner ...................................................... 16
2.5 Neural Network Paradigms Static Models ..................................... 20
2.5.1 The Continuous M apper ......................................................... 20
2.5.2 The Associative Memory ...................................................... 22
2.6 Neural Network Paradigms Dynamic Nets ..................................... 23
2.6.1 Short Term Memory by Local Positive Feedback ................... 24
2.6.2 Short Term Memory by Delays ............................................ 28
2.6.3 The Sequential Associative Memory ................................... 32
2.7 Other Dynamic Neural Nets ........................................... ........ ... 33
2.8 D discussion ............................................................. ....................... 34



3.1 Introduction Convolution Memory versus ARMA Model .................. 35
3.2 The Gamma Memory Model ............................................................ 39
3.3 Characteristics of Gamma Memory .................................... ........... 44
3.3.1 Transformation to s- and z-Domain ........................................ 44

3.3.2 Frequency Domain Analysis................................ ......... 46
3.3.3 Time Domain Analysis ........................................ .......... 48
3.3.4 Discussion ................................................ ....................... 50
3.4 The Gamma Neural Net .............................................................. 51
3.4.1 The M odel ................................................ ........................ 51
3.4.2 The Gamma Model versus the Additive Neural Net .............. 52
3.4.3 The Gamma Model versus the Convolution Model ................ 55
3.4.4 The Gamma Model versus the Concentration-in-Time net ..... 56
3.4.5 The Gamma Model versus the Time Delay Neural Net ......... 58
3.4.6 The Gamma Model versus Adaline ...................................... 58
3.5 D discussion ....................................................... ............................ 59



4.1 Introduction Learning as an Optimization Problem ...................... 60
4.2 Gradient Computation in Simple Static Networks .......................... 63
4.2.1 Gradient Computation by Direct Numerical Differentiation ... 64
4.2.2 The Backpropagation Procedure ........................................ 65
4.2.3 An Evaluation of the Direct Method versus Backpropagation 70
4.3 Error Gradient Computation in the Gamma Model ......................... 72
4.3.1 The Direct M ethod ............................................ ............ 73
4.3.2 Backpropagation in the Gamma Net ..................................... 76
4.4 The Focused Gamma Net Architecture ........................................... 80
4.4.1 Architecture ............................................... ...................... 82



5.1 Introduction ...................................................... ........................... 87
5.2 Gamma Net Simulation and Training Issues .................................... 88
5.2.1 Gamma Net Adaptation ...................................... ......... 89
5.3 (Non-)linear Prediction of a Complex Time Series .......................... 92
5.3.1 Prediction/Noise Removal of Sinusoidals contaminated by Gaussian
N oise ................................................... .......................... 92
5.3.2 Prediction of an EEG Sleep Stage Two Segment ................... 95

5.3.3 Prediction of Mackey-Glass chaotic Time series ........ .......... 95
5.4 System Identification .................................................................. 98
5.5 Temporal Pattern Classification Training a Concentration-in-Time Net 99
5.6 Noise Reduction in State Space .......................................................... 102
5.7 D discussion ........................................................................................... 109



6.1 Introduction .................................................................................
6.2 A Recapitulation of Linear Digital Filter Architectures ..................... 112
6.3 Generalized Feedforward Filters Definitions .................................... 113
6.4 The Adaptive Gamma Filter ............................................... 116
6.4.1 D efinitions ................................................................................ 116
6.4.2 Stability .................................................................................... 117
6.4.3 Memory Depth versus Filter Order ....................................... 118
6.4.4 LMS Adaptation ....................................................................... 118
6.4.5 Wiener-Hopf Equations for the Adaptive Gamma Filter ......... 120
6.5 Experimental Results ....................................................................... 122
6.6 The Gamma Transform A Design and Analysis Tool For Gamma Filters 125
6.7 A Second-order Memory Delay Element ............................................ 129
6.8 D discussion ............................................................................................ 130



7.1 A Recapitulation of the Research ......................................................... 134
7.2 Ongoing Research Projects .................................................................. 136
7.3 Future Research Directions .................................................................. 138
R EFER EN CES ............................................................................................. 141

BIOGRAPHICAL SKETCH .......................................................................... 145

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the Requirements
for the Degree of Doctor of Philosophy



Bert de Vries

October 1991

Chairman: Dr. Jose C. Principe

Major Department: Electrical Engineering

This dissertation discusses the problem of processing complex temporal

patterns by artificial neural networks. The relatively broad topic of this work is

intentional processing here includes such specialities as system identification, time

series prediction, interference canceling and sequence classification. Rather than

focusing on a particular application, this research concentrates on the paradigm of time

representation in neural network structures.

In all temporal processing applications, an essential capacity for a neural net is

to store information from the recent past (we refer to this capacity as short term

memory). The main contribution of this work is the introduction of a new (neural net)

mechanism to store temporal information. This model, the gamma neural model,

compares very favorably to competing memory structures, such as the tapped delay line

and first-order self-recurrent memory units. The gamma memory mechanism is

characterized by a cascade of uniform locally self-recurrent delay units. An interesting

feature of the gamma memory mechanism is the adaptability of the memory depth and


The gamma model is analyzed and compared with competing neural models. A

temporal backpropagation training procedure for gamma neural nets is derived.

Experiments in time series prediction (electro-encephalogram (EEG) and

synthetic chaotic signals), noise removal from a chaotic signal and system

identification are discussed. In all experiments, the gamma model outperforms

competing network architectures.

Interestingly, the application of the gamma memory structure is not limited to

neural nets. A chapter is devoted to introduce adaline(gl), an adaptive linear filter with

gamma memory. Adaline(tg) generalizes Widrow's adaptive linear combiner (adaline),

the most widely used structure in adaptive signal processing. The signal characteristics

and processing applications where adaline(gi) improves on the performance of adaline

are identified.



1.1 Introduction

When I started this research somewhere around January 1988, the goal was to

develop a speech recognition system that was characterized by a relatively high degree

of biological plausibility. Thus I spent the first semester of 1988 in the Hearing

Research Center at the University of Florida. This speech system-in-waiting was going

to be developed from the bottom up! First I would develop a neural auditory model that

captures the essential features for speech analysis, followed by new techniques for

neural temporal pattern classification and so on. As time passed by, the topic of this

dissertation has shifted more toward the fundamental problem of time representation in

network structures per se. As a result, a neural network model for processing of

temporal patterns has been developed, and at the time of the writing of this thesis we

are conducting experiments in isolated word recognition and phoneme classification

making use of this new model. As it turns out, the application of the gamma model (this

is how the model is called) is not limited to the temporal classification problem.

Successful experiments were performed as well for problems in noise reduction,

system identification and time series prediction. For these tasks the gamma model

performance is very promising, in fact better than many competing techniques. In this

thesis the development of the gamma model is presented. I try to explain why and how

it works and report on a few experimental results.

1.2 A Statement of the Problem

This dissertation deals with processing of time-varying signals by a neural

network. By a signal we mean a time sequence of patterns and consequently the

adjectives "time-varying" or "temporal" will often be deleted. The temporal context is

always assumed. The 6-dimensional (input) signal to be processed is denoted by s(t)

and for the M-dimensional processed (output) signal we will write x(t). Processing of

s(t) implies the application of a map 0 which transforms s(t) into the signal x(t). In this

work we are interested in processing applications where the past of s(t) affects the

computation of x(t). Thus, 0 is actually a map from s(x) for x < t to x(t). Typical

applications include speech recognition, dynamic system identification and prediction

of a time series. The main problem of this thesis is how to effectively represent the past

of s(t). The following example shows why this is a difficult task.

Consider an isolated-word recognition system 0 which processes the incoming

speech signal in real-time. Assume a 16 word vocabulary and hence the output space

can be described in 4 bits. The speech signal is sampled at 8000 [Hz] and quantized by

8 bits. Then, for a typical word, which last 0.5 seconds, the input signal is represented

by 8000 x 8 x 0.5 = 32, 000 bits. Thus, 4 is a map from 232000 input pattern

combinations to 4 bits. This sequence of 32000 bits obviously contains many redundant

copies of information and is contaminated by irrelevant noise. Since 0 is a real-time

system, it must throw away data as quickly as possible, while preserving enough

information to classify the words. Note the inherent optimization task associated with

this problem: if we throw away too much data there may not be enough information left

for word classification, but yet we do not want to store redundant data and noise for it

will significantly complicate the processing task.

As another example, consider the problem of predicting a discrete time series

s(t) by a model 4. How many samples from the past of s(t) does the model 0 need in

order to reliably predict the next sample s(t+l)? Do we need to store the entire history

of s(t)? Very likely not, and in fact the additional noise associated with a deep memory

will work detrimental on the performance of (.

In general we conclude that before the data stream is submitted to the actual

processing task, first we need to rid the data stream of irrelevant information as much

as possible. Processing of a data stream so as to reduce the amount of data for further

processing is captured by the name pre-processing.

I pre-processing
to L (data reduction)

Figure 1.1 Modular feedforward information processing architecture.

In Figure 1.1 the general architecture for a feedforward information processing

scheme is shown. In this simple scheme two stages are distinguished. The first stage,

the pre-processing stage, serves to reduce the data flow to the second stage. In practice,

pre-processing of a data stream commonly consists of segmentation and feature

extraction. The second stage performs the actual processing goal. For instance in a

pattern recognition environment, the second stage is a classifier. For the purpose of this

discussion it is assumed that the second processing stage is implemented by an adaptive

neural network.

Let the signal s(t) in Figure 1.1 be defined for to 5 t 5 t in the temporal

dimension. At any time, only part of the temporal domain of s(t) is available for further

processing. The selection mask is called a window. The width of the window is denoted

by 8. In some cases before the data sequence is submitted to the actual processing task

), signal features are computed from the windowed data segment. As an example in

word classification, the pitch period provides important information concerning the

excitation source of the speech signal. Note that both (windowed) segmentation and

feature extraction contribute to the data reduction process.

The feature extraction process is usually very problem dependent. For now, let

us concentrate on the implications of the choice of the window width 8. Three

possibilities arise. First, the width 8 of the window equals the duration tf to of the

signal. This case effectively transforms the temporal dimension to an extra spatial

dimension. Since the entire past of the signal is always available, no temporal

processing capabilities are required and the problem is transferred to a static processing

problem. The second possibility concerns 0 < 8 < tf- t, which is called the sliding

window technique. In this case, the selection mask moves in some fashion over the

signal domain so as to cover the input signal space as time evolves. The extreme case

of the sliding window technique occurs when 8 -- 0, that is, only the current signal

values are available for further processing. We will refer to this choice as current-time

processing, and effectively when 8 -4 0 there is no segmentation. Obviously, in this

case the memory has to be moved to the second processing stage.

The choice of the sliding window width influences the system performance. We

identify two problems associated with the selection of 8. First, for large 5. the

dimensionality of the processing system increases. This fact minimally complicates

the neural net training. It has been shown that neural net adaptation time unfortunately

scales worse than proportional with the dimension of the weight vectors (Perugini and

Engeler, 1989). There are more problems associated with large networks, such as the

required increased dimension of the training set. For smaller windows the neural

network may not have enough information to appropriately learn the signal dynamics

because only a fraction of the decision space is available. However, the network

dimension gets progressively smaller which eases the learning requirements in terms

of training set size and number of learning steps.

A second issue that complicates the choice of 8 involves segmentation of non-

stationary signals. Normally the length of the stationary interval is not known a-priori,

and can very well change with time. A large 6 tends to average the time-varying

statistics of non-stationary signals before the signals enter the neural net. A smaller 5

makes the classification very sensitive to the actual signal segment utilized, and tends

to make the classification less robust. The balance is very difficult to achieve and in

general varies with time. The common practice in speech is to bias the selection to

fixed, relatively small segments of approximately 10 milliseconds (Rabiner and

Schafer, 1978).

Apart from the difficulty of choosing the window width, the temporal

resolution of the window is another important pre-processing parameter. We define the

(temporal) resolution R of the window as the number of outputs of the window divided

by the window width (in seconds). As is the case for the width 6, the optimal resolution

is dependent on the processing goal. For instance if the vocabulary size were 1000

instead of 16 in the isolated-word recognition example, the demands on the resolution

of the window would obviously increase.

In speech processing, it is common to determine the window width based on

statistical measures of the input signal. For instance, zero-crossing rate and energy

measures have been used to estimate pseudo-stationary signal segments (Rabiner and

Schafer, 1978). Note that this approach does not use any system performance feedback

to determine the pre-processing parameters. However, as is clear from the foregoing

discussion, optimal values for the pre-processing parameters such as window size and

resolution are a function of the processing goal as expressed by a system performance

criterion. Ideally the representation of the input signal would be adapted by

performance feedback of the total processing system.

This observation forms the basis for the neural net system that is proposed in

this dissertation. The system that I propose stores the signal history in an adaptive

short-term memory structure of a neural net. The capacity of a neural net to store and

compute with information from the recent past is referred to as short term memory. The

architecture of this system is shown in Figure 1.2. The neural short-term memory

mechanism substitutes and obviates a priori signal segmentation. An important

advantage of this approach is that neural network structures can be adapted so as to

optimize a system performance criterion. In the figure, the performance of the system

is measured by the error signal e(t), the difference between a desired output signal d(t)

and the system output x(t). Other measures of system performance are also possible. Of

central importance however is that in this framework the signal representation is

optimized by performance feedback instead of input signal statistics.

It will be shown in this thesis that adaptive pre-processing can be integrated

with information processing in the same neural network framework and


Traditionally, linear signal processing is implemented by linear filter structures.

Digital filters can be categorized into two main architectural groups: the finite impulse

response (FIR) filters and infinite impulse response (IIR) filters. FIR filters are

feedforward and the past of the input signal is stored in a tapped delay line. IIR filters

are of recurrent (feedback) nature. As a result, more complicated memory structures


based on feedback are possible. As will be discussed in chapter 2, the same principles

for short term memory hold in neural networks. In fact, from an engineering viewpoint,

neural networks can be considered as a generalized class of non-linear adaptive filters.

The combination of adaptation and non-linearity makes neural nets very versatile

processing architectures that can be applied to a wide range of complex problems.

Indeed the possibility to incorporate the input signal representation in a unified

adaptive framework with the system overall performance is a premier stimulus for the

research reported herein.

1.3 Research Goals

The central theme of this dissertation is the representation of temporal

information in a neural net. The main goal of this research is the development of a

neural network where the input signal representation is optimized adaptively by the

neural net itself. Thus I try to achieve that preprocessing parameters such as window

resolution or depth are adaptively optimized with respect to a performance measure of


Figure 1.2 Adaptive signal representation using a neural network
processing framework.

the total system.

A literature review, scheduled for chapter 2, will reveal that the main techniques

for temporal processing with neural networks are quite mediocre with respect to the

capacity to adapt to a varying signal environment. Since experimentation with a wide

range of network architecture is still going on, it seems that a consistent framework for

dynamic neural networks is still missing. Based on these premises the research plan is

scheduled as follows:

Design a neural network for temporal processing with adaptive short term


Develop training algorithms for this neural model.

Evaluate both in theory and by experimentation the applicability of the new

model. In particular we are interested in a comparison with alternative widely used

neural processing architectures.

Determine appropriate application areas for the new model.

1.4 A Summary of the Next Chapters

Chapter 2 starts with a concise review of neural networks. Neural nets are being

studied from a variety of viewpoints. I will take the "electrical engineering view" and

emphasize the relation to linear digital filters. Mechanisms for short term memory in

neural nets are reviewed. The analysis focuses in particular on two widely used

structures, the tapped delay line and the first-order self-recurrent units (context units).

Both mechanisms are shown to have limited applicability. For example, the tapped

delay line has limited fixed memory depth whereas the context units always overwrite

information from the past with more recent information.

In chapter 3 a new framework for storage of past information in neural nets is

introduced. The new memory model, gamma memory, is supported by a mathematical



2.1 Introduction

In this chapter various neural network architectures for processing of time-

varying patterns are reviewed.

In order to set the framework, we are concerned with extracting information

from a temporal sequence, but let it unspecified whether the processing goal is to

predict a future trend of the time series or to classify the sequence.

First, in the next section linear digital filters are recapitulated. Digital filters are

the basic computational tool for temporal processing. Moreover, the architectural

principles of digital filters underlie most neural network models.

2.2 A Recapitulation of Linear Digital Filters

Linear signal processing is traditionally implemented by linear filters. In a

discrete time environment, linear digital filters are networks of delay elements,

summation element and constant-factor multipliers. Digital filters are distinguished

into two main categories: Feedforward or finite impulse response (FIR) filters and

recurrent or infinite impulse response (IIR) filters.

In a FIR filter, the input signal history is stored in a tapped delay line. The

signals at the taps are referred to as state variables. The output of the FIR filter is a

linear weighted combination of the tap variables (see Figure 2.1). FIR filters are always

stable but note that the depth of the memory is fixed and equals the number of taps of

the delay line.

X,(l) X:(l) X- ^(t)
'O z-1 -- z-1 -- M z-1 -0

W Or W1 V '2 \VK

+ +w
Figure 2.1 The feedforward (FIR) filter.

IIR filters are more complex structures since recurrent connections are also

allowed. In control theory similar linear structures are known as auto-regressive

moving-average (ARMA) systems. A so-called observer canonical form

implementation of the IIR filter is shown in Figure 2.2. The existence of recurrent

connections implies the risk of instability of the system, but increases the

computational power of the system. The state variables xi(t) are a function of both the

lower index state variables (memory by delay as in the FIR filter) as well as higher

indexed variables (memory by feedback). The feedback connections in the IIR model

implies that the depth of the memory is no longer coupled to the number of delay


Linear filters are widely applied but the processing power is limited. Linear

processing is appropriate for tasks such as removal of signal-independent noise and

rearranging the temporal structure of a signal. For many important tasks linear

processing does not suffice though. Examples are removal of signal-dependent noise,

classification (decision making!) and modeling of a chaotic time series.

Neural networks as an engineering tool are probably best interpreted as a

generalized class of nonlinear adaptive filters. As such, they provide the computational

features that potentially better cope with solving complex non-linear problems. Next,

Figure 2.2 The recurrent (IIR) filter.

an introduction to neural nets is presented.

2.3 Introduction to Neural Networks

This section contains a brief introduction to neural networks. In the literature

we also find names as connectionist models, parallel distributed processing devices or

artificial neural nets, all denoting the same kind of processing architecture. The

discussion will be of general nature. For a deeper look into some equations and

implementations of neural networks, I like to refer to a paper by Lippmann (1987) and

a book by Simpson (1990). A more thorough look at neural networks is offered in

books by Hertz et al. (1991) and Hecht-Nielsen (1990).

There is not a single best definition for a neural network. Neural net research is

being approached from various viewpoints. Different models range widely in

biological plausibility. In the context of this thesis, an electrical engineering

dissertation, the biological plausibility is not considered a high priority. We are more

interested in the computational properties of a model. In a computer science book I

have seen a definition as short as the following:

- A neural net is a weighted directed graph of simple processors.

As may be clear from a previous discussion, I like to interpret neural nets as a

generalized class of non-linear adaptive filters. The following features of a neural net

processor are typical:

parallel architecture: a weighted network of simple processors.

adaptation: the connection weights are adaptive.

non-linearity: the processor transfer function is in general non-linear.

The mathematical framework for neural nets is non-linear dynamics. In a

continuous-time setting, neural nets are described by a set of differential equations. In

discrete time, the dynamics are described by difference equations. Characteristically,

the constant coefficients of the equations, called weights, adapt when examples of the

problem at hand are presented to the net. Ideally, the adaptation or learning of the

weights is also determined by a differential equation. As mentioned before, neural

networks are non-linear. For one, the computational power of non-linear dynamical

systems far exceeds that of linear systems. Secondly, the non-linearity of neural nets

originates from the fact that it is believed that most interesting primitive cognitive

functions such as associative memory are non-linear. Generally, let

x(t) = [l (t) ... xN(t)] hold the N-dimensional state of a neural net,

w11 ... wIN
w = i an N2-dimensional vector (for a fully connected net) of adaptive
WNI ... WN

weights and I(t) = I' (t) ... IN(f) the external input to the net. Then the system

is completely described by the following set of equations:

--(t) = fi (x, I, w) El

dt (t) = gi ,x)(WX

The dynamics for the state x(t) are described by Eq. and Eq.2 describes the adaptation

dynamics. The equilibria of system Eq.1 are computed by

0 = fi (x I, w), WEa2

where x* holds the steady state.

The most widely used neural network model is the so-called additive model described


t (t) = -ax,(t) + Y wjxj (t) + l (t) E_ A

The additive model is used in the great majority of practical applications of

neural networks today. Sejnowski provides a biological motivation for the additive
model (Sejnowski, 1981). A flow diagram of the additive model is shown inFigure 2.3.

The state vector xi(t) is affected by a passive decay -axi(t), yielding short term

memory, non-linear neural feedback signals a(wijxj(t)), and an external input Ii(t). The

neuron signal function a() normally is a non-linear function. A typical choice is the

logistic function a (x) = tanh (x). The feedback signals from the net itself are

sometimes shortly denoted by the variable net, that is,

neti(t) = wix (t) .

The system described by Eq.4 is called additive since the weights w are not a

function of the states x. In case w = w (x), the model exhibits mass-action behavior;

such systems are called mass-action, shunting or multiplicative models. In order for

Eq.4 to be computationally interesting, the three dynamic variables I, x and w must

Figure 2.3 The additive neural network model.

perform over three different time scales. From a neurodynamic viewpoint, we can

interpret x(t) to hold a short term memory (stm) trace and w(t) to process long term

memory traces (Itm). The philosophy behind system Eq.4 as a pattern recognition

device for temporal patterns then basically runs as follows. As time passes by, the Itm

traces w(t) sample and average over time the neuronal activity x(t), thus forming some

kind of template or reference pattern of neural activity. At any time a short term average

of the current external environment I(t) is reflected in the stm traces x(t). The degree of

matching between the stm traces and the Itm traces determines how well the current

environmental input is recognized.

The basic architectural component of neural networks and adaptive signal

processing structures is the adaptive linear combiner. An understanding of the working

of the adaptive linear combiner and the least mean square (LMS) algorithm is essential

for the neural network structures that are surveyed in this thesis. The next section

introduces the adaptive linear combiner.

2.4 The Adaptive Linear Combiner

The adaptive linear combiner or non-recursive adaptive filter is fundamental to

adaptive signal processing and neural network theory and applications. This structure,

normally shortly referred to as adaline (from adaptive linear neuron), was introduced

by Widrow and Hoff in 1960 (Widrow and Hoff, 1960). The adaline structure appears

in some form in nearly all feedforward neural network structures. The processing and

adaptation properties of adaline are well understood and documented in Widrow and

Stearns (1975) and Haykin (1990). In this thesis we will only introduce the properties

that are essential in the context of this work.

The describing equations for adaline are given by

y(t) = W wkxk(t),

where xk(t) are the input signals, y(t) the output signal and wk the adaptive parameters

or weights. Adaline is a discrete-time structure, that is, the independent time variable t

runs through the natural numbers to,to+1,... and so on. Adaline is shown in Figure 2.4.

tapped delay

Figure 2.4 The adaptive linear combiner structure.

Although the input signals xk(t) may originate from any source, very often the

input signals are generated from a tapped delay line as shown in the figure. For this

case, the adaline structure is similar to a regular transversal FIR filter. In the adaptive
signal processing literature, it is common to define the following vectors

w- [w ...] xi((t)(=[i() ... x (t) E.7

Thus we can write the describing equation for adaline as

y(t) = wTx(t). Eq

Let a desired output signal be given by d(t). d(t) is also referred to as target
signal or teacher signal. The difference between the desired output and actual output is

defined as the (instantaneous) error signal e (t) = d(t) -y (t). Substitution of Eq.8

and squaring leads to the following expression for the instantaneous squared error

e2(t) = d2 (t) +wT(t)xT(t)w-2d(t)xT (t)w. E.9

An important assumption in the theory of adaptive signal processing is that the signals

e(t), d(t) and x(t) are statistically stationary, that is, their statistical moments are
constant over time. In that case, taking the expected value of Eq.9 yields

E[e2 (t) ] = E[d2 (t)] + wTRw- 2PTw Eq.

where we defined the input correlation matrix R E [x(t)x (t)] and the cross-

correlation vector P E [d (t) x (t) ]. Note that the expression for the mean squared

error 4 is quadratic in the parameters w. The minimal mean squared error is obtained

by setting the error gradient to zero. Differentiating Eq.10 yields for the error


= 2Rw- 2P. Eqll

Thus, the optimal weight vector woptis given by

wopt = R-1P. E 2

The expression Eq.12 is known by the name Wiener-Hopf equation or normal equation.

The Wiener-Hopf equation provides an expression for the minimal mean-square-error

weight vector, assuming the stationarity conditions hold. This expression is

fundamental in adaptive signal processing and linear neural network theory. Note that

if the stationarity conditions do not hold, the correlation matrices R and P are time-

varying, and consequently the optimal weight vector is time-varying.

The computation of the correlation matrices R-' and P is usually very expensive,

in particular when the network dimension K is large. Instead it is common to adapt the

weights on a sample-by-sample basis so as to search for the optimal values. As is

apparent from Eq.9, the mean-square-error is quadratic in the weights. Thus, the

performance surface 4 is a (hyper-)paraboloid with a minimum at wopt. A gradient

descent procedure should therefore in theory lead to the optimal weights. The steepest

descent update algorithm adapts the weights as follows:

w(t+l) = w(t)- 1 E13

The step size parameter il controls the rate of adaptation. In the neural net literature, iT

is referred to as the learning rate. Note that adaptation comes naturally to a halt when

the weights are optimal, since at the minimum of the performance surface we have

w = 0. The computation of the error gradients determines the complexity of the

learning algorithm. Widely used and very efficient is the Least Mean Square (LMS)

algorithm. We will now proceed to derive the LMS algorithm for the adaline structure,

as it is the precursor for the widely used backpropagation procedure in neural net

adaptation. At a later stage in this thesis, the backpropagation procedure is derived and
applied to several signal processing problems.

The central idea of the LMS algorithm is to approximate the stochastic gradient

aE [e2 (t) ] e2 (t)
w[e2 W by the instantaneous (time-varying) gradient Note that the
aw aw

instantaneous error gradient is an unbiased estimator of the stochastic gradient, that

is,E = e(t) DE [e.2t) I Substituting e (t) = d(t) x (t) w leads to

ae2 (t) De (t)
S= 2e (t) = -2e (t) x(t). .14
aw jw

Thus, the LMS update equation evaluates to

w(t+ 1) = w(t) +2rle(t)x(t). E

Note how simple the final equation for the LMS algorithm is. The signals e(t) and x(t)
are readily available. The combination of simplicity and accuracy have made the LMS
algorithm the most popular algorithm in adaptive signal processing. Widrow discusses

in his book a number of successful practical applications, such as adaptive
equalization, system identification, adaptive control, inference canceling and adaptive
beamforming (Widrow and Stearns, 1985).

In the next section, the core architectural paradigms for neural nets are

introduced. While adaline enjoys a wide application in neural network architectures,

the inherent linearity limits its computational power substantially. Neural nets in
general are more powerful, since they can be non-linear, recurrent and multi-input-

multi-output systems.

2.5 Neural Network Paradigms Static Models

Over the years two different paradigms have emerged that exploit the dynamics

of system Eq.4 to serve as a non-linear information processor. Nearly all theory and

practice deals with processing of static patterns. I will shortly introduce the two

concepts for the static case, since an understanding of this is essential in order to

comprehend efforts of extension to processing of space-time patterns. The section on

static nets is followed by dynamic nets.

2.5.1 The Continuous Mapper

The first paradigm offers the continuous mapper or many-to-many map. The

computational result of the continuous mapper is a continuous function from an 6-

dimensional input space to an M-dimensional output space. The standard method to

implement such a map is by way of a multi-layer feedforward network. An important

historical (and current) example of the continuous mapper is the perception

architecture (Rosenblatt, 1962). In Figure 2.5, a neural net implementation of the

continuous map is displayed.

The network in Figure 2.5 is feedforward, that is, there are no closed loops in

this structure. If we assume that the neurons are labeled sequentially starting at the

input layer, then the weight matrix w is lower triangular for feedforward nets, since

each neuron receives inputs from nodes with lower index. The states of the neurons are

independent of time. In particular, the additive static neuronal states are described by

the following algebraic relation:

Xi = o(wYjXj) +. Eq16
It has been proven that a three-layer network (two hidden layers) in principle is

capable to compute an arbitrary continuous map from the O-dimensional input space to

the real numbers (Hecht-Nielsen, 1987). Although this may be impressive, the problem

output layer

2nd hidden layer

1st hidden layer

Input layer x,

I 17, 1
Figure 2.5 Structure of three-layer feedforward mapper

of finding the correct set of weights may be very hard. The problem of finding good

weights is called the loading problem. Theoretically, a learning mechanism such as

simulated annealing can be used to obtain the map that minimizes the error between the

desired map and the actual (network) implementation. However, simulated annealing

(stochastic optimization) is very slow and in practice, although not perfect, the back-

propagation training procedure has been quite effective for many applications. Back-
propagation involves adaptation of the weights by gradient descent so as to minimize a

performance criterion.

2.5.2 The Associative Memory

The second prototype entails the associative memory or many-to-one map, for

which the Hopfield net is the prime illustration (Hopfield, 1982 and 1984, see Figure

2.6). In terms of topology, the Hopfield net is a recurrent net with symmetric weights

(wj=wji, wi,=O), which enables the association of a Lyapunov (energy) function with

the system dynamics. Using Lyapunov's stability theory, it can be shown that this

system always converges to a point attractor. There is no external input to the

associative memory. The input is the initial state of the neurons x(to). Used as a

processing device, information is stored by locating point attractors at positions in the

state space that correspond to memories. Recognition then consists of settling into the

minimum closest to the initial state vector x(to).


Figure 2.6 An associative memory neural net the
Hopfield net structure.

Both the continuous mapper and the associative memory work only in a static

pattern environment. Although the Hopfield net processes information by a dynamic

relaxation process, the input pattern (initial state vector) is assumed to be static. Next,

temporal extensions of both the continuous mapper and the associative memory are

discussed. It will become clear that the ideas for computing with time in dynamic

neural nets correspond strongly to linear signal processing theory.

2.6 Neural Network Paradigms Dynamic Nets

The basic neural network model for processing of static patterns is the static

additive model. The activation of the units are computed by

i = o( iwjX) +l, E-.I
where xi is the activation of neuron (unit, node) i. The weight factor wi connects node

j to node i. o() is a (non)-linear squashing function and Ii represents the external input.

We assume a system dimensionality of N. Sometimes the shorthand notation

net, = wixi will be used.
Static neural nets have no memory. As a result, temporal relations can not be

stored or computed on by static neural nets. In order to process a temporal flow of

information, a neural net needs a short term memory mechanism. Neural network

models with short term memory are called dynamic neural nets. The simplest way to

add dynamics (memory) to the static model is to add a capacitive term C- to the left-

hand side of Eq.17. After rearrangement of terms, the so-called dynamic additive

model is obtained:

it = -xi+a( wijxj) +i. E4

This model is mathematically equivalent to the system described by Eq.4, where the

time constant is expressed by the decay parameter a = -. Let us look at the biological

picture of neural nets. In nature, the neural time constants are fixed and equal

approximately 5 msecs (Shen, 1989). This number is estimated by assuming an average

action potential rate of 200 per second. Higher rates are quite rare due to the refractory

period of the neurons. However, recognition of a spoken word requires the ability to

remember the contents of a passage for approximately 1 second. To accomplish this,

neural temporal resolution decreases while the "temporal window of susceptibility"

increases toward the cortex. Apparently, the brain is able to modulate temporal

resolution and depth of short-term memory making use of processing units with fixed

small time constants. The dominant biological principles for increasing the time

constants are feedback and delays. These are exactly the same strategies that are used

in digital filters to implement a temporal data buffer. Naturally, neural net researchers

have concentrated on the same concepts of feedback and delays when designing neural

nets for temporal processing applications. Next, the characteristics of both approaches

are analyzed.

2.6.1 Short Term Memory by Local Positive Feedback

The additive model can be extended with a positive state feedback term,


C- = X + o (net) + Ii + kx, E19

where k is a positive constant. In the biological literature, such local positive feedback

is often named reverberation, while neural net researchers speak of self-excitation.

Eq.19 can be rewritten as

T dxi
(-k)dt = -xl+a (neti) + I, B2

Y (neti) Ii
where b (neti) k and i -. For = 5 msec and k = 0.995, we get the

new time constant T = 1 = 1 sec. Units that self-excite over a time span that is

relevant with respect to the processing problem are referred to in the neural net

literature as context units. Several investigators have explored the temporal

computational properties of additive feedforward nets, extended by context units

(Jordan, 1986; Elman, 1990; Mozer, 1989; Stornetta et al., 1988). In Hertz et al. (1991),

neural models of this kind are collectively referred to as sequential nets. In sequential

neural nets, all units are additive and static, apart from the context units. The context

units are of type Eq.20 or a similar model. Sequential neural nets are a kind of

extension of the continuous mapper to the spatiotemporal domain. In Figure 2.7,

various architectural examples are displayed.In 1986, Jordan developed the

architecture as displayed in Figure 2.7a for learning of spatiotemporal maps. The state

of the context units evolve according to

x(t+ 1) = lx (t) +xut(t), E.21

where xout(t) is the state of an output unit. Note that Jordan's architecture makes use of

global recurrent loops (context to hidden to output to context units). As a result, care

must be taken to keep the total system stable. In Jordan (1986), he shows that this

network can successfully mimic co-articulation data. Anderson et al. (1989) have used

this architecture to categorizing a class of English syllables.

Elman (1990) utilizes non-linear self-recurrent hidden units of the type

x (t + 1) = g(o (x (t)) to store the past (Figure 2.7b). This network was able recognize

sequences, and even to produce continuations of sequences. Cleeremans et al. (1989)

showed that this architecture is able to learn and mimic a finite state machine, where

the hidden units represent the internal states of the automaton.

Stornetta et al. (1988) have used recurrent units at the input layer only to

represent a temporally weighted trace of the input signal (Figure 2.7c). There are no

weighted connections from the hidden or output units toward the context units. This

restriction results in several advantages when the network is trained by a gradient

(a) Jordan's network (b) Elman


(c) Stornetta et al.


c xt

I inputs I
Figure 2.7 Various sequential network architectures. (a) Michael Jordan's
architecture feeds the output back to an additional set of recurrent input
units. (b) Elman's structure uses recurrent non-linear hidden units. (c)
Stornetta et al. keep a history trace at the input units. This structure offers
particular advantages when back-propagation learning is used.

descent technique. We will discuss this issue in more detail in chapter 4 on training a

neural net. The author performed successful experiments in recognition of short

sequences. Mozer (1988) and Gori et al. (1989) have also made use of similar

architectural restrictions.

While the positive feedback mechanism is simple and used in biological

information processing, there are two computational problems associated with this

method. First, the new time constant is very sensitive to k. For our example, an increase

of 0.5% in k from k = 0.995 to k = 1 makes the model unstable. The time-varying nature

of biological parameters makes it therefore unlikely that reverberation is the

predominant mechanism for short term memory over long periods. The second

handicap of Eq.20 is that the new model is still governed by first-order dynamics. As a

result, weighting in the temporal domain is limited to a recency gradient (exponential

for linear feedback), that is, the most recent items carry a larger weight than previous

inputs. Note that the analytical solution to Eq.20 can be written as

t -(-s)
x(t) = Je [(net(s)) +I(s)]ds. EQ.22

Thus, the past input is weighted by a factor e which exponentially decays over time.

For a neural net composed of N neurons, the number of weights in the spatial

domain is O(N2), while the temporal domain is governed only by T. The use of a fixed

passive memory function then implies a limit to how structured the representation of

the past in the net can be. As an example, optimal temporal weighting for the

discrimination of the words "cat" and "mat" will not be a recency but rather a primacy

gradient. Another example, in a time-series analysis framework, the input signals

sometimes change very fast, sometimes slow. For fast changing input, we like the time

constant small, so that the net state can follow the input. For slow moving input, the

time constant may be larger in order to have a deeper memory available. This argument

pleads for the short term memory time constant to be a variable that should be learned

for each neuron and may even be modulated by the input. For physiologic mechanisms

of short-term modulation of r, one may think of adaptation (decreased sensitivity of

receptor neuron to a maintained stimulus) or heterosynaptic facilitation (the ability of

one synapse of a cell to temporarily increase the efficacy of another synapse; see Wong

and Chun (1986) for an application to neural nets).

In conclusion, short term memory by local positive feedback is simple and has

been applied successfully in artificial neural nets. However, reverberation may lead to

instability. Secondly, this mechanism restricts computational flexibility in the temporal

domain. In the next section, short term memory by delays is reviewed.

2.6.2 Short Term Memory by Delays

A general delay mechanism can be represented by temporal convolutions

instead of (instantaneous) multiplicative interactions. Consider the following extension

of the static additive model,

dx t
di -Xi + o w j(t -s) j (s) ds +I. E.23

We will call this model the (additive) convolution model. In the convolution

model the net input is given by

net(t) = :fw j(t-s)xj(s)ds. Eq.24

In a discrete time environment, this translates to

neti(t) = wi(t-n)xj(n). Ea.25
J n=O

There is ample biological support for the substitution of weight constants w by time

varying weights w(t). Miller has reviewed experimental evidence that "... cortico-

cortical axonal connections impose a range of conduction delays sufficient to permit

temporal convergence at the single neuron level between signals originating up to 100-

200 msec apart" (Miller, 1987). Several artificial neural net researchers have also

experimented with additive delay models of type Eq.23. However, due to the

complexity of general convolution models, only strong simplifications of the weight

kernels have been proposed.

Lang et al. (1990) used the discrete delay kernels w (t) = wkO (t tk) in the

time delay neural network (TDNN).The TDNN architecture is shown in Figure 2.8. The

TDNN, considered the state-of-the-art, is a multilayer feedforward net that is trained

by error backpropagation. The past is represented by tapped delay lines as in FIR

filters. The authors reported excellent results on a phoneme recognition task. A

recognition rate of 98.5% at a phoneme recognition task ("B", "D" and "G") compared

to 93.7% for a hidden Markov model was achieved. Recently, the CMU-group

introduced the TEMPO 2 model, where adaptive gaussian distributed delay kernels

store the past (Bodenhausen and Waibel, 1991). Distributed delay kernels such as used

in the TEMPO 2 model improve on the TDNN with respect to the capture of temporal


Tank and Hopfield (1987) also prewired w(t) as a linear combination of

a acll-)
dispersive kernels, in particular w(t) = XWk (t) = kwk() e This
k k

technique was utilized as a preprocessor to a Hopfield net for classification of temporal

patterns. The ideas are illustrated in Figure 2.9. Let an input signal successively

activate I, through 14. The delay between consecutive activations is one time step. If the

delays associated with the weights are as shown in the figure, then the input unit

activations arrive at the output unit at the same time. Thus, the output neuron is very

sensitive to an impulse moving over the input layer in the direction of the time arrow,

while an impulse moving in opposite direction does not activate the output node.

hidden units

Neural nets of this type, where information of several neurons at different times

integrates at one neuron, were called Concentration-of-Information-in-Time (CIT)

neural nets. The weight factors Wk were non-adaptive and determined a priori. They

successfully built such a system in hardware for an isolated word recognition task. In

particular, the robustness against time warped input signals should be mentioned. In a

later publication successful experiments were reported with adaptive gaussian

distributed delay kernels (Unnikrishnan et al., 1991).

When compared to the first-order context-unit networks, the convolution model

in its general formulation is more flexible in the temporal domain, since the weighting

of the past is not restricted to a recency gradient. However, a high price has to be paid

for the increased flexibility. I identify three complications for the convolution model

when compared to the additive model.

Analysis. The convolution model is described by a set of functional

differential equations (FDE) instead of ordinary differential equations (ODE) for the

Figure 2.8 An example of a Time-Delay Neural Network.

additive model. Such equations are in general harder to analyze a handicap when we

need to check (or design for) certain system characteristics such as stability and


Numerical Simulation. For an N-dimensional convolution model, the required

number of operations to compute the next state for the FDE set scales with O(N2T),

where T is the number of time steps necessary to evaluate the convolution integral

(using Euler method: x(t+h) = x(t)+h d). An N-dimensional additive model

scales by O(N2).

Learning. The weights in the convolution model are the time-varying

parameters w(t). Thus, the dimensionality of the free parameters grows linearly with

time. For a long temporal segment, the large weight vector dimensionality impairs the

ability to train the network.

The two models for incorporating short term memory in neural networks,

positive feedback and delays, have led to a number of architectures that essentially

generalize the continuous mapper to the space-time domain. In the discussion on static

S. weight kernels

4_wl output

Figure 2.9 Principle of the Concentration-in-
Time neural net. The output node is tuned to
classify the sequence I-I,2-I,-I4.

neural nets we introduced the associative memory model. Is there also a temporal

extension for this model. Indeed, in particular in the physics community, several

researchers have experimented with temporal associative memories. The principal

ideas of the sequential associative memory are now shortly reviewed.

2.6.3 The Sequential Associative Memory

The sequential associative memory is a (recurrent) dynamical system that stores

memories in attractors (sinks) of zeroth order (point attractors) (Kleinfeld, 1986;

Sompolinsky and Kanter, 1986). Physicists have explored several ways to ignite

attractor transitions under influence of an external stimulus. The most widely used

method consists of forcing a combination of the external signal and the delayed

network state upon the net. The delayed state wants to keep the net in its current state,

while the external input tries to alter the state. As a result, the net state ideally hops

from one stable attractor to another. In a pattern recognition environment, the sequence

of visited states identifies the external input.

Sequential associative memories are theoretically very interesting and moreover

provide a neural explanation of categorical perception, due to the corrective properties

of the basins of attraction. However it has been noticed that these nets may not be very

selective pattern recognizers, in other words, nearly every input (if high enough) will

induce transitions (Amit, 1988). Secondly, the memory capacities of such nets are very

limited. It is the nature of point attractors that falling into a basin of one induces the

forgetting of previous states. Consequently, the 'deepness' of memory is fixed, short

and cannot be modulated. In my opinion, the sequential associative memory may be a

useful module for tasks like central pattern generation (Kleinfeld, 1986), but are not

(yet) flexible enough to encode the varying temporal relations of complex signals such

as speech.

2.7 Other Dynamic Neural Nets

We have discussed dynamic extensions of both the continuous mapper and the

associative memory. However, the architectures that were discussed in this chapter are

not the only viable constructions. A very important class of neural nets are networks

with globally recurrent connections. Note that the memory structures that have been

discussed sofar, the tapped delay line and the context units, are mechanism to store the

activations of local units. Generation of memory by feedback on a global scale has not

been discussed. The difference between local feedback at the unit level and global

feedback at the network level is displayed in Figure 2.10. Globally recurrent nets or

fully recurrent nets are an important area of current dynamic neural net research

(Williams and Zipser, 1989; Gherrity, 1989; Pearlmutter, 1989). The important issues

that confine applications of fully recurrent nets are control of stability and adaptation

problems. In a fully recurrent net, the performance surface is not necessarily convex

which is a severe handicap for gradient-based adaptation methods. Secondly, learning

in recurrent networks has been found to progress much slower than in feedforward nets.

Figure 2.10 An example of a globally recurrent network.





2.8 Discussion

In this chapter various neural architectures for temporal processing have been

discussed. The main principles for storage of and computing with a temporal data flow,

delays and feedback, were analyzed in some detail.

The feedforward tapped delay line has been used very successfully in the time-

delay neural net. Yet, the delay period per tap and window width are fixed and must be

chosen a priori. It was already discussed in chapter 1 that such a fixed signal

representation scheme likely leads to sub-optimal system performance.

Most context-unit neural networks do adapt the decay parameter of the context

units. As a result, the depth of memory can be controlled so as to match the goal of

processing. On the other hand, the weighting of the past is always restricted to a

recency gradient. Moreover, context units overwrite the past with new information.

In order to get more processing power, some investigators use globally recurrent

networks. These systems suffer from the same problem as recurrent filters: how do we

control stability during adaptation? Additionally, in particular for moderate to large

networks the currently available training algorithms do not suffice.

Although the importance of the future of global feedback neural nets should not

be underestimated, in this work we will concentrate on developing an improved

architecture for local (unit) short term memory. There are several reasons why this is a

important research direction. First, it will be shown later in this thesis that feedforward

networks of units with local (feedback) memory have some practical advantages over

globally recurrent nets. Stability for one is much easier controlled in such networks.

Secondly, the local short term mechanism that will be developed here does not exclude

global recurrence in the network. In fact, the integration of local feedback and global

feedback may lead to very interesting dynamic network architectures.

- I



3.1 Introduction Convolution Memory versus ARMA Model

It was discussed in chapter 2 that a general delay mechanism can be written as

net(t) = Jw(t s) x (s)ds Eq2

for the continuous time domain and

net(t) = w(t-n)x(n) E_.27

in the discrete time domain. It was mentioned that a problem associated with time-

dependent weight functions is that the number of parameters grows linearly with time.

This presents a severe modeling problem since the number of parameters of freedom

of a system do not always increase linearly with the memory depth of that system.

Although a convolution model is interesting as a biological model for memory by delay

and powerful as a computational model, the previous arguments make this model not

very attractive as a model for engineering applications. It makes sense to investigate

under what conditions an arbitrary time varying kernel w(t) can be adequately

approximated by a fixed dimensional set of constant weights. This problem was studied

by Fargue (1973) and the answer is provided by the following theorem.

Theorem 3. 1 The (scalar) integral equation

net(t) = Jw(t-s)x(s)ds E.28

can be reduced to a K- dimensional system of ordinary differential

equations with constant coefficients if (and only if) w(t) is a solution of

dK K-w dkw
(t) = ak (t),
dtK k = 0 dtk

where ao, aI,...,aK.1 are constants.

Proof. A constructive proof for sufficiency is provided. The initial conditions

for Eq.29 are rewritten as ^k (0), where k = 0,...,K-l, and we define the

variables wk(t) = d (t), k = 0,...,K-1, which allows to rewrite Eq.29 as the

following set of K first-order differential equations.

dt (t) = wk+l(t), k = 0,...,K-2,

dwK- K-i
dt (t) = k akwk().

Next, we introduce the state variables

(t) Wk (t -s) (s) ds, k = 0,...,K-1. E31

Note that the system output is given by

net(t) = wo (t-s)x(s)ds = xo(t) Eq32

The state variables xk(t) can be recursively computed. Differentiating Eq.31 with

respect to t using Leibniz' rule gives

dxk t E
(t) = ffWk(t-S)X(S)ds+wk(O)x(t),

which using the recurrence relations from Eq.30 evaluates to

dt (t) = k+1() + k (t), for k = 0,...,K-2, and

dxK K-1
dt (t) = akk(t) + iK- x(t). a

Thus, if the weight kernel w(t) is a solution of the recurrence relation Eq.29, then the

integral equation Eq.28 can be reduced to a system of differential equations with

constant coefficients (Eq.34). O (end proof).

The following theorem reveals what is meant by imposing the condition Eq.29

on w(t).

Theorem 3. 2 Solutions of the system

dK K-1 dk
w(t) = ak- (t)
dtK k=0 dtk

k. .tr
can be written as a linear combination of the functions t 'e , where

1 : i < m,0 < ki < K and I Ki = K. (end theorem).

The proof of Theorem 3.2 is provided in most textbooks on ordinary differential

equations (e.g. Braun, 1983, page 258). The xi's are the eigenvalues of the system. In

particular, the ji's are the solutions of the characteristic equation of Eq.29,

SK- I a k = o. EQ.

m is the number of different eigenvalues and Ki the multiplicity of eigenvalue gi. The

k. 9-t
functions t 'e are the eigenfunctions of system Eq.29, where i enumerates the

various eigenmodes of the system.

In the signal processing and control community, the system described by Eq.32

and Eq.34 is called an auto-regressive moving average (ARMA) model (see Figure

3.1). It was discussed in section 2.2 that the memory of an ARMA system is

represented in the state variables xk(t). It is interesting to observe the relation between

the ARMA model parameters and the convolution model. The auto-regressive

parameters ak are the coefficients of the recurrence relation Eq.29 for w(t). The moving

average parameters ik equal the initial conditions of Eq.29.

Figure 3.1 An ARMA model implementation of a convolution model
with recursive w(t).

In the context of this exposition, I like to think of an ARMA model as a

dynamic model for a memory system. It was just proved that this configuration is


equivalent to a convolution memory model if the condition described byEq.29 is

obeyed. Yet, I do not know of neural network models that utilize the full ARMA model

to store the past of x(t). The reason is that the global recurrent loops in the ARMA

model make it difficult to control stability in this configuration. This is particularly true

when the auto-regressive parameters ak are adaptive. There are some substructures of

the ARMA model for which stability is easily controlled. Examples are the feedforward

tapped delay line and the first order autoregressive model (or context unit). These two

structures, which are shaded differently for clarity in Figure 3.1, have been used

extensively in neural networks as a memory mechanism. The virtues and shortcomings

of either approach have already been discussed in chapter 2. In this chapter, a different

approximation to the ARMA memory model is introduced. A look ahead to Figure 3.4

shows that the memory model that will be introduced utilizes a cascade of locally

recursive structures in contrast to the global loops in the full ARMA model. As a result,

this model will provide a more flexible approximation to the ARMA or convolution

model. Yet, the stability conditions will prove to be trivial. In the next section, a

mathematical framework for this new memory model is presented.

3.2 The Gamma Memory Model

Let us consider the following case, a specific subset of the class of functions that

admit Eq.29:

w(t) = wkg~(t). E


g (t) = t le- k = 1,....K, ( > 0). E38
It is easily checked that the kernels gt) (the superscript is dropped) are a solution to
It is easily checked that the kernels gk(t) (the superscript ji is dropped) are a solution to

Eq.29. Since the functions gk(t) are the integrands of the (normalized) F-function

(F(x)- tJ-le-tdt), they will be referred to as gamma kernels. In view of the

solutions of Eq.29, the gamma kernels are characterized by the fact that all eigenvalues

are the same, that is, p. = p.. Thus, the gamma kernels are the eigenfunctions of the

following recurrence relation:

d K
(-+pg) g(t) = 0. Ea9

The factor normalizes the area, that is
(k- 1)!

gk (s) ds = 1, k=1,2... E.4

The shape of the gamma kernels gk(t) is pictured in Figure 3.2 for p. = 0.7.

It is straightforward to derive an equivalent ARMA model for net(t) when w(t) is

constrained by Eq.37. The procedure is similar to the proof of Theorem 3. 1. First, the

kernels gk(t) are written as the following set of first-order differential equations

dg1 dgk
dt= -ig dt k + 1, k = 2,...K. Eq41

Substitution of Eq.37 into Eq.26 yields

net (t) = wkk, Eq.42

where the gamma state variables are defined as

xk(t) = gk(t- s) x (s) ds, k= 1,...,K. Eq.43



0.5 k=l
gk() '

0.3- k=2

0.2k=3 k=4 k=5

C ----------..-, ...........
0 2 4 6 8 10 12 14 16
- t4p

Figure 3.2 The gamma kernels gk,() for p=0.7

The gamma state variables hold memory traces of the neural states x(t). How are the

variables yk (t) computed? Differentiating Eq.43 leads to

() = J k(t-s)x(s)ds+gk(0)x(t), Ea

which, since gk (0) = 0 for k > 2 and gl (0) = I, evaluates to

S(t) = xk (t) + xk 1 (t) k = 1...,K, E.4

where we defined x0 (t) x (t) The initial conditions for Eq.45 can be obtained from

evaluating xk (0) = gk (0) x (0), which reduces to

X0(0) = x(0) x1(0) = jx(0)

xk(O) = 0, k = 2,...,K.

Thus, when w(t) admits Eq.37, net(t) can be computed by a K-dimensional

system of ordinary differential equationsEq.45. The following theorem states that the

approximation of arbitrary w(t) by a linear combination of gamma kernels can be made

as close as desired.

Theorem 3. 3 The system gk(t), k=1,2,...,K, is closed in L2 [0, -] .

Theorem 3. 3 is equivalent to the statement that for all w(t) in L2 [0, o] (that is,

any w(t) for which f w (t) 2dt exists), for every e > 0, there exists a set of parameters

Wk, k = 1,...,K, such that

0 K 2
Jw (t) wkk(t) dt 0 k=1

The proof for this theorem is based upon the completeness of the Laguerre polynomials

and can be found in Szego (1939, page 108, theorem 5.7.1). The foregoing discussion

can be summarized by the following important result.

Theorem 3. 4 The convolution model described by

net(t) = Jw(t-s)x(s)ds, Ea.

is equivalent to the following system:

net(t) = Wkk(t)

where x0 (t) = x (t), and

d(t) = -xk(t) + .xk-_ (t),k = 1,...,K, >0. E.49

(end theorem).

The term gamma memory will be reserved to indicate the delay structure

described by Eq.49. The recursive nature of the gamma memory computation is
illustrated in Figure 3.3.


Figure 3.3 The gamma memory structure.

For the discrete time case, the derivative in Eq.49 is approximated by the first-

order forward difference, that is

t (t) = Xk (t + 1) Xk(t) E

This approximation is not the most accurate, but it is the simplest. Also, this particular

choice implies that the boundary value p.=l reduces the gamma memory to a tapped
delay line. This feature facilitates the comparison of the discrete gamma model to
tapped delay line structures. Applying Eq.50 leads to the following recurrence relations

for the discrete (time) gamma memory

x0(t) = x (t)

xk(t) = (1- t)Xk(t- 1) + Lxk_(t- ),k= ,...K, and t = to,t,t2... E.51

The time index t now runs through the iteration numbers to,tl,t2,... The discrete gamma

memory structure is displayed in Figure 3.4.

3.3 Characteristics of Gamma Memory

In this section the gamma memory structure is analyzed both in the time and

frequency domain.

3.3.1 Transformation to s- and z-Domain

Since the recursive relations that generate the gamma state variables xk(t) at

successive taps k are linear, the Laplace transformation can be applied. The (one-sided)

Laplace transform is defined as

Xk () = Xk (t)e-stdt = L {Xk () E.52

Application of Eq.52 to Eq.49 leads to the following recursive relations for the

generation of the gamma state variables in the s-domain:

X (s) = X(s)

Xk (S) = Xk (s) ,k = ..., K. E

The operator G (s) will be referred to as the gamma delay operator. Note that

Xk(s) can be expressed as a function of the memory input X(s) only. Repeated

application of Eq.53 yields

Xk(s) = ( 9) X(s) = Gk(s)X(s). E.54

It can be verified that Gk (s) = L {gk(t) }.

The system Eq.53 also suggests a hardware implementation of gamma memory,
which is shown in Figure 3.5. It follows that gamma memory can be interpreted as a
tapped low-pass ladder filter. There are two gamma memory parameters: the order K
and bandwidth ji = (RC) -1

x(t) Xl(t) x2(t) XK(t)


0-T T .................. T
Figure 3.5 A hardware implementation of gamma memory.

The corresponding frequency domain for discrete-time systems is the z-domain.
It follows from Eq.53 that the z-transform can be found by substitution of s = z-1 in the
Laplace transform. This leads to

Xk+(z) = -(z) k = 1,...,K. E55

The discrete gamma delay operator is G (z) = The transfer function from

memory input X to the kth tap Xk follows from Eq.55:
Gk(Z) = Ea(-)"

Inverse z-transformation of Eq.56 leads to the discrete gamma kernels

gk(t) = k(l-)t-k-Jk= 1,.... Kt =k,k+1,... E.
9k W = 11 Ul~) I J' ~ aS

The discrete gamma kernel gk(t) is the impulse response for the kth tap of the discrete

gamma memory model. Eq.57 can be interpreted as follows. In order to get from the
memory input to the kth tap xk(t) in time t, the signal has to take k forward steps and

pass through t-k loops. Each forward step involves a multiplication by p, and a pass
through a loop involves multiplication by 1-p.. The number of different paths from x(t)

to xk(t) in time t equals -~ j1

Next, some of the frequency domain characteristics of the gamma memory will
be analyzed. The analysis will be performed using the discrete model, although similar

properties may be derived for the continuous time version.

3.3.2 Frequency Domain Analysis

The transfer function for the (kth tap of the) discrete gamma memory is given
by Eq.56. The Kth order discrete gamma memory has a Kth order eigenvalue at

z = 1 p. Since a linear discrete-time model is stable when all eigenvalues lie within

the unit circle, it follows that the discrete gamma memory is a stable system when

0 p i 5 2. The group delay and magnitude response for this structure are displayed for

the second tap (k=2) in Figure 3.6 and Figure 3.7 respectively.

The group delay of a filter structure is defined as the negated derivative of the
phase response to the frequency. It provides a measure of the delay in the filter with
respect to the frequency of the input signal. When g=l, gamma memory reduces to a
tapped delay line structure. In this case, all frequencies pass with gain one and the
group delay is 2, the tap index.

When 0 < p < 1, the gamma memory implements a linear Kth order low-pass

filtr. The low frequencies are delayed more than the high frequencies. In fact, the low
frequencies can be delayed by more than the tap index, which is the (maximal) delay
for a tapped delay line. For instance, for g=0.25 at tap k=2, a delay up to 8 can be




0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
"0 (D/T1




0 1 0 0 0=i .25 0

0 0. I 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Figure 3.6 Group delay for discrete gamma memory at k=2.
Figure 3.6 Group delay for discrete gamma memory at k=2.





achieved for the low frequencies.The cost for the additional delay for low frequencies

is paid for by the high frequencies. The high frequencies are attenuated and the group

delay is less than the tap index. Thus, for 0 < p < 1, the storage of the low frequencies

is favored at a cost for the high frequencies.

The gamma memory behaves as a high pass filter when 1 < pI < 2. As a result,

the high frequencies are delayed by more than the tap index.

3.3.3 Time Domain Analysis

Although the impulse response g9(t) of the kth tap of the gamma memory

extends to infinite time for 0 < pi < 1, it is possible to formulate a mean memory depth

for a given memory structure gk(t). Let us define the mean sampling time tk for the kth

tap as

ik tg(t) = Z {tgk (t) = -1 E
t=0 z dz z= 1

We also define the mean sampling period Atk (at tap k) as Aik k 1 = The

mean memory depth Dk for a gamma memory of order k then becomes

k i k

In the following, we drop the subscript when k = K. If we define the resolution Rk as
Rk k = p., the following formula arises which is of fundamental importance for the

characterization of gamma memory structures:

K = DR. Eg.60

Formula Eq.60 reflects the possible trade-off of resolution versus memory depth in a



10 1





- \
... i=l

=0 .25,

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
P i. (/It



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Do (0)/7t

Figure 3.7 Magnitude response for discrete gamma memory at k=2.

memory structure for fixed dimensionality K. Such a trade-off is not possible in a non-

dispersive tapped delay line, since the fixed choice of pg = 1 sets the depth and

resolution to D = K and R = 1 respectively. However, in the gamma memory, depth

and resolution can be adapted by variation of ip.

In most neural net structures, the number of adaptive parameters is proportional

to the number of taps (K). Thus, when gi = 1, the number of weights is proportional to

the memory depth. Very often this coupling leads to overfitting of the data set (using

parameters to model the noise). The parameter p. provides a means to uncouple the

memory order and depth.

As an example, assume a signal whose dynamics are described by a system with

5 parameters and maximal delay 10, that is, y (t) = f(x (t n,), wi) where i = 1,...,5,

and max,(ni) = 10. If we try to model this signal with an adaline structure, the choice K

= 10 leads to overfitting while K < 10 leaves the network unable to incorporate the

influence ofx(t -10). In an adaline with gamma memory network, the choice K = 5 and

pg = 0.5 leads to 5 free network parameters and mean memory depth of 10, obviously a

better compromise.

3.3.4 Discussion

In this section the gamma memory structure was analyzed in both the time and

frequency domains. When 0 < pi < 1 the storage of the low frequencies is favored over

the high frequencies. In the time domain this translates to a loss of resolution but a win

in memory depth. There are many applications for which this bias towards storing low

frequencies can be exploited. In particular we think of applications where a long

memory depth is required, as in echo cancelation or room equalization.

Next the gamma memory model is incorporated into the additive neural net. The

additive model with adaptive gamma memory provides a unified framework for non-



D 35-
K30 =5

20 \

15 K=3

10 K=2

K = . . . .
0.I 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Figure 3.8 Memory depth of discrete gamma memory vs. p.

linear processing with adaptive short term memory capacities.

3.4 The Gamma Neural Net

In this section, the additive neural net is extended with gamma memory. This

new model will be compared to the various competing networks for temporal

processing as described in chapter 2.

3.4.1 The Model

Recall the general model for additive neural nets:

dxi N
S(t) = aixi (t) +oi Wijxj (t) + i (t), i = 1,...,N. EN.

Let us assume that each neuron has the capability to store its past in a gamma

memory structure of (maximal) order K and bandwidth pi. The activation of the kth tap

of neuron i is written as xik(t). The weight wik connects the kth tap of neuron j to neuron

i. The system equations for this model are

(t)= -aixi(t) + o(Y wijkxjk(t)) +I (t)
j k

for the activations xi(t) and for the taps xik(t)

dt (t) = xik(t) + i, k (t), k= ,... ,K.

The system described by Eq.62 and Eq.63 will be referred to as the (additive) gamma
neural net or gamma model. In Eq.62 the time constant is processed in the decay

parameter at. Also, for notational convenience we defined xi (t) xio (t) .The structure

of the gamma neural model is displayed in Figure 3.9.

The discrete gamma memory can be applied in discrete-time neural network
models. The discrete gamma neural model is defined as follows:

Xi(t) = (I wijkxjk()) +Ii(t)
Xik(t) = (1 -i) ik(t- 1) + ii, k- (t- 1), k= 1,...,K E.65

Note that p=l leads to an additive network where the memory is implemented as a
tapped delay line. A feedforward network of this type is equivalent to the time-delay
neural net. The other extreme at p=0 obviates the gamma memory and reduces the
structure to a "normal" additive model.

Next, the gamma model is compared to previously introduced neural models for
temporal processing.

3.4.2 The Gamma Model versus the Additive Neural Net

Additive neural model are characterized by the fact that the adaptive net

Figure 3.9 The additive gamma neural model.

parameters are constants, that is, not dependent on the neural activation nor time

dependent. The free parameters in the gamma neural model are pi and wik, hence the

gamma net is an additive model. This is an important result, since the substantial theory

on additive neural nets applies directly to the gamma model. Also, existing learning

procedures such as Hebbian learning and backpropagation apply without restriction to

the gamma net. Since the gamma model is additive, it is possible to express the system

equations Eq.40 or Eq.65 as a Grossberg additive model. This is easiest shown by

rewriting the gamma model equations in matrix form. This conversion will now be

carried out for the continuous time model, since the result hints to an interesting

interpretation of the gamma model. First the node indices i and j are eliminated. We

define the signal vectors xk = [Xlk...,XNk], I = [Ii,..,N]r and parameter matrices a =

Wlk WiNk
diagd(ai), gi = diagN(ii) and wk = .Then, the gamma model equations

can be written in matrix form as

dx0 K
dt (t) =-axo(t) +o w) +I)

d(t) = xk(t) + xk (t),k=1,...,K. E.66


Next, index k is eliminated. We define the gamma state vector X = X the input


I = the squashing function XZ() = the matrix of decay

0 0

a 0 WW "' ...WK
parameters M = I and the matrix of weights Q = a .Then

0 0 ''

the gamma model evaluates to

=- MX+ 2Y+E, Eq.67

an N(K+1)-dimensional Grossberg additive model.

Notice the form of the weight matrix U. Many entrees are preset to zero; in other

words, the gamma model is a pre-wired additive model. As a result, the pre-wiring of

the gamma model for processing of temporal patterns has rendered this model with less

free weights than a general additive model. This property is important, since both the

learning time and the number of training patterns needed grows with the number of free

weight in a neural net.

3.4.3 The Gamma Model versus the Convolution Model

In chapter 2, I identified theoretical analysis, numerical simulation and learning

as problem areas for the convolution model. It was also shown that the gamma model

is mathematically equivalent to the convolution model for sufficiently large order K.

Next these problems are re-evaluated when the convolution model is expressed as a

gamma model.

Analysis As was discussed in section 3.4.2, the gamma model can be

represented as a pre-wired additive model. Consequently, theoretical results for the

additive model are entirely applicable to the gamma model.

Numerical Simulation Whereas the complexity of numerical integration of the

convolution model scales as O(N2T), the gamma model scales as O(N2K). Thus, as time

progresses the evaluation of the next state for the convolution model involves an

increasing amount of computation. The computational load for the gamma model is

independent of time.

Learning While the weights in the convolution model are time-varying, the

gamma weights are constants. Thus, the dimensionality of the weight vector is

independent of time. This is a very important distinction from an engineering

standpoint, since it allows a direct generalization of conventional learning algorithms

to the gamma model.

3.4.4 The Gamma Model versus the Concentration-in-Time net

In 1987, Tank and Hopfield presented an analog neural net with dispersive delay

kernels for temporal processing (Tank and Hopfield, 1987). The memory taps xk(t) for

this "Concentration-in-Time net" (CITN) are obtained by convolving the input signal

with the kernels

t a (1--)
fk(t) = () e ,k=l,...,K,

where a is a positive integer. fk(t) is normalized to have maximal value 1 for t = k. The

degree of dispersion is regulated by parameter a. In Figure 3.10, the kernels fk(t) are

displayed for k=1 to 5 when a = 5.

Although fk(t) visually resembles peak-normalized gamma kernels, it is not

possible to generate the kernelsfk(t) by a recursive set of ordinary differential equations

with constant coefficients, as is the case for the gamma kernels. In fact, differentiating

Eq.68 leads to the following time-varying differential equation forfk(t):

0.9- 1(p f5(


k(t)0.7 -






0 .... ....
0 I 2 3 4 5 6 7 8 9 10
0-34 5 6 7 8 9 10 t

Figure 3.10 Tank and Hopfield's delay kernels; a=5, k=1,...,5

dfk 1 1
dt (t) =a(- )k(t). Ea9

While Tank and Hopfield's model shares with the gamma model the capability of

regulating temporal dispersion, only the gamma model provides an additive neural

mechanism for this capacity. As a result, in the gamma model the dispersion control

parameter Ip can be treated as an adaptive weight. In Tank and Hopfield's model, a is

fixed. The relative merits of peak- versus area-normalization of dispersive delay

kernels have not been investigated.

Similar arguments hold when the gamma memory is compared to the adaptive

gaussian distributed delay models such as the TEMPO 2 model (Bodenhausen and

Waibel, 1991) and the gaussian version of the concentration-in-time neural net

(Unnikrishnan et al., 1991). These memory models do offer the advantage of adaptive

dispersion, yet only the gamma memory offers an additive neural mechanism to create

dispersive delays. The other models require evaluation of a convolution integral with

respect to the delay kernels in order to compute the memory traces. Although not a

priority for engineering applications, the gamma memory is biologically plausible,

since there is no (non-neural) external mechanism required to generate delay kernels.

3.4.5 The Gamma Model versus the Time Delay Neural Net

The memory structure in the time delay neural net (TDNN) is a tapped delay

line. In fact, TDNN structures can be created in the gamma memory by fixing p. = 1 in

the discrete gamma model. Thus, the TDNN is a special case of the discrete gamma

model. When 0 < p < 1, the discrete gamma memory implements a tapped dispersive

delay line. The amount of dispersion is regulated by the adaptive memory parameter p..

We discussed that the memory depth of the gamma memory can be estimated by K/lg.

Hence, the memory depth can be adapted independently from the number of taps (and

weights!) in the structure. In the TDNN, the memory depth and the memory order both

equal K. As a result, increasing the memory depth in the TDNN is always coupled with

an increase in the number of weights in the net, which is sometimes not desirable.

3.4.6 The Gamma Model versus Adaline

The simplest discrete-time gamma model is a linear one-layer feedforward

structure with one output unit. The equations for this network are given by

y(t) = kxk(t)

x0(t) = I (t)

xk (t) = (1 ) Xk (t 1) + gxk_- (t- 1), k=l,...,K.

This structure is depicted in Figure 3.11. For gi=1, Widrow's adaline is

obtained. Also, adaline is the simplest (linear, one-layer, one output) implementation

of the time-delay neural net.

Figure 3.11 The adaline(p) structure.

The gamma memory generalizes adaline to an adaptive filter with a dispersive

tapped delay line. Adaline with gamma memory will be referred to as adaline(u) or

adaptive gamma filter. Several interesting aspects of this filter are worth a deeper look.

A special chapter (5) will be dedicated to the analysis of adaline(g).

3.5 Discussion

In this chapter the gamma neural model has been developed and analyzed. The

gamma model is characterized by a specific short term memory architecture, the

gamma memory structure. Gamma memory is an adaptive local short term memory

structure. Memory depth and resolution can be altered by variation of a continuous

parameter It. In the next chapter, gradient descent adaptation procedures for the gamma

model are presented.



4.1 Introduction Learning as an Optimization Problem

Learning in a neural net concerns the modification of the weights of the net so

as to improve performance of the system. The term adaptation will be used as a

synonym for learning. Commonly it is assumed that the performance of the neural net

is expressed by a scalar performance index or total error E. In general we write

p t m

4P (t) is a cost functional which describes the error measure at output node m E M at

time t [0, T] when pattern p e P is presented to the system. Often the (weighted

by ) quadratic deviation from a given target trajectory dP (t) is chosen as the cost.

For this case, Eq.71 evaluates to

E= [dP (t) -x(t)]2 = 2 [e ,(t) ]2
p, t, m p, t, m

where eP (t) df (t) -xPm (t) is the instantaneous error signal which is immediately

measurable at any time.

The learning goal is to minimize E over the system parameters w and I,

constrained by the network state equations x = f(x, I;w, i) This problem has been

studied extensively in optimal control theory (Bryson and Ho, 1975). The most

common approach to search for the minimum of E involves the use of the gradients a-

and -. When E is minimal, these gradients necessarily vanish, that is, at the optimum

we have

= =0. E

An algorithmic method is now discussed which searches for the values w and gI

that minimize E. Assume an available set of training pattern pairs (Ip (t), dP (t)) that

adequately represents the problem at hand. This set is referred to as the training set P.

Next, the training set is presented to the network and the activations xp (t) are

recorded. The presentation of the entire pattern set P is called an e-ch or batch. Note

that the availability of x, (t) and dP (t) allows the evaluation of the performance index

E using Eq.72. This measure can be used to determine when to stop training of the

system. Next the gradients and are computed and the weights are updated in

the direction of the negative gradients:

Wnew = Wold -IW E-

new = 1old-- T1a

If the learning rate or step size rl is small enough, this update will decrease the

total error E on the next batch run. There are other methods of utilizing the error
gradients to search for the minimum of E. As an example, the successive weight
updates can be made orthogonal to each other, a process which is called conjugate

gradient descent. In this work we are not interested in optimizing the learning process

per se. Our goal will be to generalize gradient descent adaptation to the gamma net and

evaluate the properties of this generalization. The equations Eq.74 and Eq.75

implement an update strategy which is called steepest descent. The process of running

training set epochs followed by a weight update is repeated until the total error E no

longer decreases. Note that if the error surface E = E (w, g) is convex, this procedure

leads to a global optimization of the network performance.

The learning process described above updates the weights only after

presentation of the entire epoch. We call this epochwise learning or learning in batch

mode. A faster training method would be to adapt the weights after each time step t.

This involves computation of the time-dependent error gradients (t) and (t

The update equations then become

w (t) = w(t-1) l (t) E.76

A(t) = I(t- 1) -'IDE -- E(.77

This mode is called real-time learning or on-line learning. Real-time learning

converges faster than learning in batch mode, but the updates are no longer in the

opposite direction of the total error gradients D- and Thus, even for convex error

surfaces, real time adaptation not necessarily leads to global optimization of the

network performance.

A look at the update equations makes clear that the crucial aspect of the learning

process just described is the computation of the error gradients and -. Therefore

the remaining part of this chapter concerns methods of evaluating these gradients.

This chapter is organized as follows. In the next section the literature on

gradient computation in simple neural nets is reviewed. Two different methods will be

evaluated. The direct method computes the gradients by direct numerical

differentiation of the describing system equations. The alternative method, (error)

backpropagation, utilizes the specific network architecture to compute the gradients.

As a result, backpropagation will proof to be a more efficient technique (in terms of

number of operations) than the direct method. However, as we will see when dynamic

networks such as the gamma net are introduced, application of backpropagation is

restricted to short time intervals. The trade-offs of backpropagation versus the direct

method will be evaluated for the gamma net. Finally, a special gamma net architecture,

the focused gamma net, is introduced. This structure is of special interest, since it can

be trained by a fast hybrid learning procedure. Some well-known signal processing

structures such as adaline and the feedforward networks are special cases of the

focused gamma net.

4.2 Gradient Computation in Simple Static Networks

In this section the essentials of error gradient computation in neural nets are

treated. For the time being we deal with the simplest cases, since we only want to

convey the strategy of error gradient computation. The weight update equations for the

gamma model are postponed to section 4.3.

It is assumed that the processing system can be described by a static additive

neural model:

Xi = Oi( 1WijXj) + i. E
We will also write neti = wijxj Also, we assume that the states xi are computed by
j increasing index order. Thus, first xl is computed, then x2 and so forth until XN. There

are no temporal dynamics associated with Eq.78. As discussed before, the central task

of all gradient descent adaptation procedures is to compute for all weights. It will
be assumed that the total error measure E can be expressed as

E= X I= e m = -(dm m) 2
meM m m

Thus the training set consists of one static pattern. The learning task is to adapt the

weights such that the mean square error between the target dm and net activation Xm is

minimal when I, is presented to the system. Next two exact algorithms to compute

are presented.

4.2.1 Gradient Computation by Direct Numerical Differentiation

The simplest method to compute the gradients is just by differentiating the

equations Eq.78 and Eq.79. Applying the chainrule to Eq.79 yields

DE axm
I_ = -Eema Ea80
ii m ij

Next the gradient variable P is defined. 3! can be directly computed by
iw i

differentiating the state equation Eq.78. This leads to

Sd(o (netm) Bnetm
1 dnetm aw-U
im ij

= m"'(netm) imxj+ wmnij1 Ea
where 8im is the kronecker delta function.

The set of equations Eq.78, Eq.80 and Eq.81 provide a system to compute the

error gradients. Together with an update rule such as

Awij = _- EWi2

they form a neural net learning system.

4.2.2 The Backpropagation Procedure

In contrast to the direct method, the backpropagation method exploits the
specific network structure of neural nets in order to compute the error gradients. As a
result, backpropagation is a computationally more efficient procedure than the direct

Before the backpropagation method is introduced, it is necessary to define more
precisely what is meant by the error gradients. We now proceed by an intermezzo in
order to explain how we define partial (error) derivatives in networks. Consider the
network in Figure 4.1. The state equations for this network are given by

x1 = I1
x2 = W21x + 12
X3 = 32X2
x4 = w422 + W433
x5 = W52X2 + 53X3 + 54x4

It is assumes that the variables xl through x5 are computed in indexed order, that is, first

x], then x2 and so forth until x5. A network whose state variables are computed one at

a time in a specified order will be called an ordered network. Let us compute ax2

Explicit partial differentiation of the equation for x5 in Eq.83 gives aX2 = w52

However, this only reflects the direct or explicit dependence of xs on x2. x2 also affects

x5 indirectly through the network. Incorporating these indirect or implicit influences

leads to the following expression:

ax5 ax5 ax5
-= W52 +w323 + 42 a4
;x2 (3 4-

Sx5 ax5
= 52+ w32 53+ W43 a + w424

= 52 + 32 (w53 + W43W54) + W42w54 E

This difference between explicit and implicit dependencies in networks has been
treated in a backpropagation context by Werbos (1989). He introduced the term ordered
derivative to denote the total partial derivative (including the network influences). In
this work, whenever we speak of a partial derivative to x the ordered partial derivative
is meant, which is denoted by the symbol If we only want to include the explicit

(direct) dependence on x we speak of explicit partial derivative, for which the symbol
Swill be reserved. Werbos (1989) proved the following theorem for ordered

Theorem 4.1 Consider a network ofN variables xi whose dependencies are
ordered by the list L = [x1, x2, ..., XN] this means that xi only depends
on xi where j < i. Let a performance function E be defined by

E = E(xl, x2,...,N). E.85

Then, the (ordered) partial derivatives a- can be computed by

aE eE aE ex.
I 5 X J
W-i = i xj -xi

(end theorem).

It can be checked that application of Theorem 4.1 to the computation of in

the network Eq.83 leads to expression Eq.84. Note that the computation of the

gradients ;- requires knowledge of the error gradients with respect to xj where j > i.

As a result, the error gradients must be computed in descending index order. This

feature has inspired the name backpropagation for algorithmic procedures that make

use of Eq.86 in order to compute the error gradients in neural networks. Here the
intermezzo on partial derivatives ends. We now proceed to derive the backpropagation


Let the network equations and performance index be given by Eq.78 and

Eq.79 respectively. An order or sequence of computation has to be determined for all

variables involved in the state equation Eq.78. Since the set of weights {wij} are

initialized before the state variables x, are evaluated, they can be put at the beginning

of the list. This leads to the following list:

L = [{wij}, X,...,xN] E Z

Next the partial derivatives of the performance index E to all variables in L are

computed, making use of the rule for partial derivatives in ordered networks Eq.86. For

the state variables xi this leads to

E a eE E aexj
ax ax .x axBx
J>1i ji

=-ei+ Iax.xo a'(netj)wji Eq.88
j>i j

For the gradients D we obtain the following expression:

aE E/ E exn
= +7 x
aij iJ n:N n iji

= xGi' (neti)xj EqM

In the backpropagation literature it is customary to define the variables

i =- and 5i = = Ei'(net). "E
aii anet, 1.90

Substitution of Eq.90 into Eq.88 and Eq.89 yields for the computation of the error


Ei ei + w. Ea

i = Eii'(neti) E 2

W = 8- E93

The set of equations Eq.91, Eq.92 and Eq.93 constitute the backpropagation method to

compute the error gradients.

Now let us see how to apply all these equations to the learning problem. Say we

have a network described by Eq.78 and a training data set consisting of one input

patterns i, and a target pattern dm. To start the learning system, the input pattern is

presented to the net and the state variables xi are evaluated by Eq.78. The variables xi

are computed by increasing index order i. This completes the forward pass. Next, the

error variables ei = di x. are evaluated and stored. The next phase, the backward or

backpropagation pass, computes the variables 8. = by evaluating Eq.91 and
a anetb

Eq.92. The quantities 58 are called backpropagation errors. They measure the

sensitivity of the total error E with respect to an infinitesimal change in neti. It has been

mentioned before that the backpropagation errors are computed in descending index

order, that is, first 5N followed by 5N_1 until 6,. After the backpropagation pass, the

error gradients with respect to the weights are computed by Eq.93. Next, if a steepest

descent update rule is used, the weights are adapted according to

Aw 1 = --11 5-. Ea.94

In order to appreciate the architecture of the backpropagation method, we

rewrite the backpropagation equation Eq.88 as

E = -e + S (.'(net)wji E
I J iii

Note the structural similarity between the state equation Eq.78 and Eq.88. In the

backpropagation equation the backprop errors EP serve as the network states and -e1 is

the external input. In order to discriminate between the error variables ei and E,, ei is

sometimes referred to as the injection error whereas El (and 5i) are called

backprooagation errors. Since the backprop weights wji connect node j It i (note the

arrow reversal as compared to Eq.78), the backpropagation network structure is the

transposed network of the network during the forward pass. This property is visualized

in Figure 4.2 where the backpropagation structure is drawn for an example feedforward

network. Thus, the backpropagation method makes explicit use of the network

structure in order to compute the error gradients. As a result, the backpropagation

method is computationally more efficient than the direct method. Specifically how the

computational cost of the backpropagation method compares to the direct method will

be evaluated next.

Figure 4.2 Backpropagation architecture for a static feedforward net.

4.2.3 An Evaluation of the Direct Method versus Backpropagation

There are hardly any results with respect to convergence speed of gradient

descent update rules applied to neural networks. All we can say is that the success of

training a neural net using gradients often depends on the randomly selected initial

weights. It is however interesting to make a comparison of computational complexity

of competing learning strategies. When learning algorithms are compared as for their

time and space resource consumption it will be assumed that the learning process is

carried out on one sequential processor. In order to describe the complexity of

algorithms the notation O((p(n)) is used, which is defined as the set of (positive integer

valued) functions which are less or equal to some constant positive multiple of (p(n).

We will assume that the cost of one operation addition or multiplication carried out

on one processor is 0(1). As an example, the evaluation of the system equations for the

additive net Eq.78 costs O(N2) operations. The reasoning goes as follows. The

evaluation of x, costs O(N) operations since node i is connected to maximal N nodes.

Since we need to evaluate the activations for all i, that is i = 1,...,N, the total cost

becomes O(N2). The cost of storage (space) for the system Eq.78 is O(N2) since the

space requirements are dominated by the (maximal) N2 weights wy.

Now let us evaluate the cost of the error computation by the direct method. The

number of required operations is dominated by the evaluation of the gradients PT. The

computation of a variable PT involves O(N) operations, Since there are maximal N3

variables P? it follows that the total number of operations scale by O(N4). We need

O(N3) space to store P.

As for the computational cost of the backpropagation method, note that the

computation of the backpropagation errors requires evaluation of the (transposed)

network. It was already discussed that evaluation of the network requires O(N2) number

of operations. The space requirements are dominated by the weights wy, hence O(N2)

storage is needed. Thus, both the number of operations (time) and space requirements

for gradient evaluation scale favorably for the backpropagation method in comparison

to the direct method.

The computational complexity is of course only a ballpark measure of the

merits of an algorithm. In particular for neural nets it is important if or to what degree

the algorithm can be carried out on parallel hardware. An interesting property for

network algorithms in this respect is locality. We will say that an algorithm is local if

the states of the network can be computed from information that is locally available at

the site of computation. Locality is not only a property of the biological archetype, it

greatly facilitates implementation in parallel hardware. For instance, the

backpropagation errors are computed by means of the transposed network. As a result,

in a hardware implementation only the direction of the communication paths between

processors need to be reversed. The direct method, on the other hand, is not local. The

gradients (. need to be computed for all nodes n e N and all weight indices

(i,j) e NxN.

As a conclusion, both the computational complexity and the locality criterion

favor the backpropagation method over the direct method for static networks.

Therefore, the direct method should not be used for computation of error gradients in

static nets. However, it will be shown that the situation is more complicated for

dynamic networks. In the next section, the direct method and backpropagation are

extended to the gamma net operating in a temporal environment.

4.3 Error Gradient Computation in the Gamma Model

Since the gamma model can be formulated as a regular additive model, it

follows that both the direct method and the backpropagation procedure can be extended

to the gamma net. In this section the error gradients -- and are derived for the
7Wijk i
discrete gamma model as described by

x (t) = a(i (2 ijkx(j))+ Ii(t)
j k

where the gamma state variables are computed by

xik(t) = (1 -9i)xik (t- 1) +gxi, k- 1 (t- 1)

In contrast to the previous section, it is assumed that the activations and target

patterns are time-varying. Thus, the performance index E is defined as

E Z/ (t)
t m

2= [dm (t) xm (t) ] 2

=2 ,[em(t)]2
t, m

We now proceed to derive the error gradients using the direct method.

4.3.1 The Direct Method

The procedure is similar to the derivation for the static model. Partially
differentiating E to wijk yields


Oxm (t)
-- em (t)
t, m ijk


Sxm (mt)
We define the gradient signal Pk (t) --- k (t)
k daw ijk k

can be evaluated by partial

differentiation of Eq.96, which leads to

P1k (t) = m' (netm (t)) [ imxjk (t) + WmnP Eq.100
ne N

where 5i is the Kronecker delta (and remember the notation wmn wmn).

A similar derivation can be applied to obtain the gradients i-. Analogously to

Eq.99 we write

DE axm (t)
-a em (t)
i /t,m i


axm (t)
Applying the chainrule to the partial derivatives -- leads to

ax (t) alnk(t)
nN k xnk (t)

=xm (t)

aXm (t)

axik (t)
X agi


axm (t)
The signal xik follows by differentiation of Eq.96, yielding

aXm (t)
aXik(t) m (netm (t)) Wmik. E103
axik (t) m m mik'

Substitution of Eq.103 into Eq.101 yields for the error gradients

= em (t) m (netm (t) ) YwmikO (t) EQ
t, m k

where we defined ak (t) = i The signals an (t) can be computed on line by
I a-- i I

differentiation of Eq.97, which evaluates to

a (t) = (- i) (t-) + i-(t 1) + [Xi, k 1 (t 1) Xik(t-1)] E 5

The set of equations Eq.99 and Eq.100 provide the gradients whereas
Eq.104 and Eq.105 compute the gradients -. A steepest descent adaptive procedure

would use these variables in a update rule of the form

Awik ijk

and an analogous expression for the adaptation of ti. Together with the gamma system

equations Eq.96 and Eq.97 they constitute a gamma model learning system.

The learning system as derived here assumed adaptation in batch mode.
However, this algorithm is easily converted to a real-time learning system. We just
define a time-dependent performance index E, by

Et [em (t)]2 07

Note that since E = _Et, the only change in the formulae is to take out the I from
t t
the error gradients, which reduces the error gradient expressions to

W= aem (t) ijk (t) and E J109
ijk m

-i = em (t) am' (netm (t)) 2Wmikk ()
i m k

The signals P13 (t) and a' (t) are computed by the same equations as in the batch

mode, that is, Eq.100 and Eq.105 respectively. The real time mode for this algorithm is

particularly interesting since the required number of operations is equivalent to the

batch mode algorithm. However, the storage requirements for the real-time mode are

greatly reduced (by factor T, the number of time steps) since we update on-line by

DEt Et
Awijk ( = k and AIi (t) = -T11 il

In addition real-time adaptation usually converges faster than epochwise

updating. Therefore in practice real time updating is used far more that learning in
batch mode. In fact, the real-time mode of the algorithm described here was derived for

recurrent neural nets by Williams and Zipser (1989). They coined the name real time

recurrent learning algorithm (RTRL). We will take over their terminology. Thus, the

direct method for error gradient computation in gamma nets leads to a (special) RTRL


Let us analyze the locality and complexity properties of the RTRL algorithm for

the gamma net. Assume that the number of units in the system equals N. Each unit
stores a history trace of its activation in a gamma memory structure of maximal order

K. The (maximal) number of weights wijk then becomes N2K. The number of memory

parameters equal N. Also, it is assumed that the system is run for T time steps.

The gradient variables a (t) and jk (t) determine the complexity of the

procedure. There are maximal N3K variables jk (t), each of which is evaluated by

Eq.100 at a cost O(N) per time step. Thus the total cost is O(N4K) per time step. It

follows from Eq.104 that the evaluation of D- requires a cost O(NK) per time step.

The cost of evaluating a( (t) is 0(1). Since there are N variables it follows that

the total cost pre time step for memory adaptation id O(N2K). The space costs are

dominated by j (t), requiring O(N3K) memory locations. Note again that the

gradients ak (t) and pj (t) cannot be computed locally in space, but all computations
I ijk
are local in time since the algorithm is real-time. The results for RTRL and other

algorithm are summarized in Figure 4.6.

4.3.2 Backpropagation in the Gamma Net

In this section the backpropagation procedure as derived in section 4.2.2 is

generalized to the gamma neural net. We start by defining the list L that holds the order

of evaluation of the system variables. It is assumed that the activations Xik (t) are

evaluated in the order as schematically specified by Figure 4.3.
fort = 0 i T do
for t = to T do
evaluate xik (t)
end; end;end
Figure 4.3 Evaluation order in gamma model

This leads to the following list:

L = [ {i, {Wijk },X1 (), X11(0), ...,XNK (O),x (1), ...,XNK(T)]. EQ.

The same performance index as defined for the RTRL procedure is used, that is

E=C [e, (t)]2.112
t, m

Recall that in order to compute the error gradients in the backpropagation algorithm,
we make use of Werbos' formula for ordered derivatives, which evaluates for the

activations xik (t) to

BE aeE E xjl ()
DE + x ------ E 113
xik (0t) Xik (, j, 1) > (t, i, k) j) xik(t '

The expression (T,j, 1) > (t, i, k) under the summation sign refers to all index

combinations (T,j, 1) that appear after (t, i, k) in the list L. Although cumbersome,

working out Eq.113 is straightforward. In order to simplify Eq.113, the two cases

k = 0 and k 0 have to be considered.

JE .Xjl (Q)
First -- is worked out (k = 0). In order to evaluate the factor in
ax, (t) axi (t)

expression Eq.113 we need to find the activations xjl () that explicitly depend on

xi(t). It follows from the gamma system equations Eq.96 and Eq.97 that only

xil (t+ 1) and x (t) (j > i ) directly depend on xi (t) Thus, Eq.113 evaluates to

axi(t) = ei(t)+piaxi(t+1) + '(netx(t)

Next the formula for ordered derivatives Eq. 113 is evaluated for the tap variables xik(t)

for k = 1,...,K. Applying Eq.113 to the gamma state equations yields

JE _aeE aE E vE
DE = + (1 -i)x E +. E + .'(net (t))wjik
Jxik(t) zk(t) +xik(t+l1) ixi, k+l(t+l) j>i kx(t)


The equations Eq.114 and Eq.115 backpropagate the gradients ---. Note
JXik (t)

that the gradients at time t are a function of the gradients at time t+l. Therefore, the
backpropagation system has to be run backwards in time, that is from t=T backwards
to t = 0. In fact, this is also clear when we recall that the list L is run backwards during
the backpropagation pass. For this reason the procedure described here is called
backpropagation-through-time (BPTT). Next the error gradients are computed with
respect to the system parameters by applying Eq.113 to the list L. For the weights wik
we get

aE aE
aw a ,' (neti(t))jk ) Ea 116
ijk txi (t)

and for the gradients ,

DE E ye
x -iXik (t)
-9i t k Oxik (t) ik

= Oik(t) [Xi,k-l(t--) -Xik(t-l E

In Figure 4.4 the set of equations that describe the backpropagation method for
the gamma model is summarized. In Figure 4.4 for convenience the notation

- (t) = and 8i (t) = E) is used for the backpropagation errors.
k axik(t) W aneti(t)

The temporal aspect of gamma backpropagation impacts the use of this
procedure substantially. Similarly to regular backpropagation, the backpropagation

Ei(t) =-ei(t)+.piEi (t+1)+ )i. (1)

Eik() = (-li) Eik(+ ) +P.i,k+ i(+l) + ik. (t)

8i(t) = Ei(1) i (neti(1))

backpropagation equations run from k = K to 0, i = N to 1, 1 = T to 0,


= Eik(t) [. i, k I(- ) -xik(t- )]

error gradients

Figure 4.4 Backpropagation-through-time equations for the gamma
neural model.

network is of the same complexity as the forward pass net. Note that this algorithm is

not local in time the backpropagation errors can only be computed after a complete

epoch has ended (at t = T). Thus real-time learning is excluded as well. Additionally, it

follows that the states xik(t) and errors ei(t), ei(t) and 8i(t) must be stored for the entire

xi () = oi ijkxjk (1) + li()
j= lk= 0

xik(t) = (- )ik( 1) + pixi, k- (t- I)

state equations run from i = 0 to T, i = I to N, k = 0 to K

epoch. Thus the storage requirements scale by O(NKT+N2K) (first term for xik(t) and

second term for wijk). Obviously this limits the applicability of this algorithm to a

small epoch size T.

There is another disadvantage associated by deep backpropagation paths. Recall

that the backpropagated error signal traverses the transposed network in reverse

direction. The backprop errors hold an estimation of the sensitivity of the total error

with respect to a change in the local activation. If the system parameters are not close

to the optimal values, the backpropagation pass will soon degrade the accuracy of the

backprop errors. Also, in dense networks, the backprop errors will disperse through the

network and hence degrade other error estimates as well. Thus, for fast adaptation,

backpropagation paths should be kept as short as possible.

In practice, when there is a natural temporal boundary, as in word classification

problems, BPTT is a good choice. For typical real time learning applications, as in

prediction or system identification, most researchers apply RTRL. Yet at this time both

methods are restricted to relatively small problems when processed by sequential

machines. The computational cost of RTRL is excessive for large networks, while the

application of BPTT is hampered by increasing memory requirements over time. The

results of this section have been summarized in Figure 4.6. In the next section methods

to overcome the sharp increase in computational cost when neural nets are used in a

temporal environment are investigated.

4.4 The Focused Gamma Net Architecture

In this chapter general exact gradient descent adaptive procedures for the

gamma neural net have been studied. We have come up with two methods, the

backpropagation-through-time algorithm and the real-time-recurrent-learning

procedure. The renewed interest lately in neural net research has been largely propelled

by the application of the backpropagation method. Indeed, for static networks, the

backpropagation method provides an algorithmic approach to solve a large area of

problems that were previously not approachable due to the amount of computation

involved. This advantage is not so obvious when we generalize BP to dynamic

networks such as the gamma model. This procedure is very restricted in the sense that

the storage requirements grow linearly with time. The alternative method, real-time-

recurrent-learning, imposes a constant (in time) load on the computational resources.

Yet the application of RTRL is restricted to small networks.

So what is then the status of gradient descent learning in dynamic neural nets?

For small applications, the methods described sofar have been applied quite

successfully. Currently, research is concentrated on how to adapt the procedures

described here such that larger problems can be attacked with reasonable

computational cost. Two strategies are prevailing in this search. One area of research

focuses on approximate error gradient computation with reduced complexity as

compared to the exact methods described here. For instance, Williams has developed

the truncated backpropagation-through-time procedure (Williams and Zipser, 1991).

This algorithm is less accurate than BPTT but the memory requirements are constant

over time since the backpropagation pass involves a fixed number of time steps. The

other way to speed up error gradient computation is to prewire the network architecture

in order to reduce the complexity of the learning algorithm. In this section we propose

a restricted gamma net architecture, the focused gamma net. The focused gamma net is

inspired by Mozer's efforts to design an efficient dynamic neural net architecture

(Mozer, 1989).

Next the architecture of the focused gamma net is introduced. Some of the

characteristics of this structure are discussed, followed by a derivation of the error

gradients. The specific net architecture allows a very efficient hybrid approach to

gradient computation.

4.4.1 Architecture

The focused gamma net is schematically drawn in Figure 4.5. Assume the 6-
dimensional input signal I(t). The past of this signal is represented in a gamma memory

structure as described by

xi (t) = I (t) EaQ 18

Xik(t) = (1 )i) xik(t- 1) +PXi, k_- (t- 1)9

where t = 0,...,T, i = 1,...,6 and k = 1,...,K. This layer, the input layer, has 6 memory
parameters ti. The activations in the input layer are mapped onto a set of output nodes

by way of a (non-linear) static strictly feedforward net. The nodes in the feedforward
net are indexed 6+1 through N. Thus, this map can be written as

from feedforward net from input layer

xi(t) = i i Xj(t) + E E ik (1) EQ.120

For convenience Eq.120 will be written as

i (t) = Oi( ijkXjk(t)) E).121
where we have utilized the notation xio (t) xi (t) and wfo -wi.

Similar architectures have been used by Stornetta et al. (1988) and Mozer
(1989). These investigators however only used a first-order memory structure (K = 1).
Mozer analyzed some of the properties of structures of this kind and coined the term
focused backpropagation architecture. It turns out that the focused network
architecture enjoys a number of advantages in comparison to the fully connected

dynamic networks.

Let us first derive the update equations for the weights wik. The

static feedforvard net

backpropagation method will be used. As before, the derivations are based on the

performance index E = [em (t) 2 and the evaluation order is determined by the
list L = [ { i}, {Wijk}, {Xik(t) } ]. We have already discussed in section 4.2.2 how

to apply backpropagation to feedforward nets. Thus, applying Werbos' formula for
ordered derivatives to the activations xi(t) in the feedforward net leads to the following

backpropagation system:

i (t = -ei (t) + wjisj ( Eq122

6i(t) = oi'(neti(t))ei(t), Q 123

where we defined e (t) (t and 8. (t) -neti (t. Similarly, it follows from
&i (t) i net (t)

section 4.2.2 that the gradients -- can be computed as
W ijk

XN.(I) x,v(I)

gamma memory

Figure 4.5 The focused gamma net architecture

aw ii(t) jk (t). E124
ijk t

Note that since the mapping network is static and feedforward, we do not need to
backpropagate through time in order to find the backprop errors 8i(t). In fact, since the

85 (t) 's are computed in real-time, it is convert Eq.124 into a real-time procedure by


( ijk ( t) (t) xjk

Application of backpropagation to compute the error gradients with respect to

the parameters g, leads to a backpropagation-through-time procedure, since the input

layer is recurrent in nature. In most networks however, the number of memory
parameters is relatively small so it is efficient to use the direct method here. This
procedure has already been derived for the more general gamma nets in section 4.3.1.

Hence, without explanation we derive the error gradients D- as follows -

E DE xm (t) xik (t)
E(t) = Xmt i x I
atti mxm (t) axik (t) X ti

= -eem (t) m' (netm (t)) Iwmik( (t)2
m k

where a (t) xik Ok (t) can be computed by evaluation of Eq.105.

An important property is that the backpropagation path is short since the
feedforward net is static. The errors estimates in the dynamic input layer do not
disperse during training since the gamma memory structures do not have lateral
connections in the input layer. This property is confirmed by considering Eq.105,
which propagates the errors through time. Thus the error estimates do not disperse in

the focused gamma net, which explains the adjective "focused".

gamma net architecture
N units, memory order K RTRL BPTT FOCUSED
T time steps

.space O(N3K) O(NKT) O(ONK)

time O(N4KT) O(N'2KT) O(N'K2T)

space no yes yes

-time yes no yes

Figure4.6 A complexity comparison ofgradient descent
learning procedures for the gamma net.

In this architecture we have taken advantage of the particular characteristics of

both the BPTT and RTRL adaptive procedures. Since the feedforward net is static, the

very efficient backpropagation procedure is used to update the weights wik. The input

layer of the focused net is dynamic however and as a result, application of

backpropagation would introduce the burden of time-dependent storage requirements

and error dispersion during the backward pass. RTRL on the other hand is a real-time

procedure that is tailored to application in small dynamic networks. Thus RTRL is used

in the recurrent input layer to compute the error gradients to the memory parameters pi.

The focused gamma model is not as general as a fully connected architecture.

Thus, certain dynamic input-output maps can not be computed by the focused

architecture. For example, this representation assumes that the output can be encoded

as a static map of (the past of) the input pattern. Yet, some very interesting architectures

can be created in this framework. Mozer (1989) and Stornetta et al. (1988) have

obtained promising results in word recognition experiments using a first-order memory


focused gamma net. Note that a linear one-layer focused gamma net generalizes

Widrow's adaline structure.



5.1 Introduction

In this chapter experimental simulation results for the gamma model are

presented. The goals for the simulation experiments are the following:

1. How does the gamma model perform when it is applied to various temporal

processing protocols. In particular, we are interested in an experimental comparison to

alternative neural network architectures.

2. How well do the adaptation algorithms that were derived in chapter 4

perform? The following questions are interesting in this respect and will be addressed

in this chapter. How well does the gradient descent procedure for the focused gamma

net work? Can we learn the weights w? Can we learn the gamma memory parameters

i? How does the adaptation time for the focused gamma net compare to alternative

neural net models?

With respect to the first goal, simulation experiments for problems in

prediction, system identification, temporal pattern classification and noise reduction

were selected. All experiments were carried out by members of the CNEL group. We

used 386-based DOS personal computers and 68030- and 68040-CPU based NeXT

computers for all simulations. The programming language was C.

For all neural net simulations we used a version of the focused gamma neural

net. The focused gamma net is a very versatile structure as it reduces to a time-delay-

neural-net when It is fixed to 1. Also, a one-layer linear focused gamma net reduces to

adaline(g). More complex architectures are certainly possible but in this work we are

mainly interested in a comparative evaluation of the gamma memory structure per se.

The topic of designing complex globally recurrent neural net architectures with gamma

memory is not addressed here. Also, experimental evaluation of neural networks in

relation to alternative non-neural processing techniques is not presented here (with the

exception of the noise reduction experiments). The latter topic has been studied by a

special DARPA committee (DARPA, 1988).

Before the experimental results are presented, some general practical issues

concerning gamma net simulation and adaptation are discussed.

5.2 Gamma Net Simulation and Training Issues

The system architecture that is used in the experiments is shown in Figure 5.1.

The signal to be processed source signal is denoted by s(t). Both the neural net input

signal and the desired signal d(t) are derived from s(t). The particular form of this

transformation depends on the processing goal. A subset of the neural net states x(t),

the outputs, are measured and compared to the desired signals. The difference signal,

e(t) = d(t) x(t), is called the instantaneous error signal and it is used as the input to the

steepest descent training procedure.

Figure 5.1 Experimental architecture.

5.2.1 Gamma Net Adaptation

The training strategy of the gamma neural net deserves more attention. In all

cases the network parameters w and gp were adapted using the focused backpropagation

method as derived in chapter 4. We used the simple steepest descent update method,

that is, Aw = -r -. In all experiments, we used real-time updating. Thus, the weights

were adapted after each new sample. The stepsize (learning rate) T1 is an important

parameter. For large rl the adaptation algorithm may become unstable, while a small rl

leads to slow adaptation. We were not so much interested in optimizing the speed of

adaptation. A value between 0.01 and 0.1 for q1 provided in all cases a stable adaptation

phase. Another central problem is when to halt adaptation. Let us assume that the

network is trained by presentation of a set of pattern pairs, where each pair consists of

an input pattern and the corresponding target pattern. This set of patterns is called the

training data set. The presentation of all patterns from the training set is called an
epoch. In the experimental setting of this work, we obtain the training set by selecting

an appropriate source signal segment. The performance index (total error) for the

training set as a function of the epoch number provides a good measure as to how well

the neural net is able to model the training set. However, accurate modeling of the

training set is not the goal of adaptation. The idea of adaptation by exemplar patterns

over time is to present a good representation of the problem at hand to the neural net,

which after adaptation is able to extrapolate the information contained in the training

set to new input patterns. Thus, it is a good habit to test how well the neural net is able

to generalize to an additional set of patterns that are not used for adaptation. This set

of patterns is called the validation set. In general, whereas the total error for the training

set decreases as adaptation progresses, this is not necessarily the case for the

performance index of the validation set (Hecht-Nielsen, 1990). In practice it has been

found that gradient descent procedures first adapt to the gross features of the training

set. As training progresses, the system starts to adapt to the finer features of the training

set. The fine details of the training set very often do not represent features of the

problem, as is the case when the training data is corrupted by noise. When the system

starts to adapt to model the training data specific noise, the total error for the validation

data usually increases. This is the time when adaptation should be stopped. In this way,

the stop criterion detects when the adaptation process has reached a maximal

performance with respect to extrapolation to other patterns (not from the training set)

that are representative for the problem task. In all of the experiments that are presented

here we have used this strategy to determine when to stop training.

As an example, consider the learning curves as displayed in Figure 5.2. The

normalized square error for both the training and validation set is plotted as a function

of the epoch number. This example was taken from an elliptic filter modeling

experiment that will be discussed in section 5.4. In our experiments we stop training -

or detect convergence if the normalized error for the validation data set increases over

four consecutive epochs.


O"8 detected
0 ?6
Eval ]
2 3 6 & 10 12 II Itb 1 :2
---- epoch no.

Figure 5.2 Normalized total error for training set and validation
set in sinusoidal prediction experiment. Convergence is detected
after 4 successive increases of error in validation set.

Another important issue when considering neural net training is the adaptation

time. The adaptation time is the number of patterns (or epochs for batch learning) that

have to be presented to the neural net before the weights converge. There is not much

theory about the adaptation time of backpropagation algorithms. However the question

whether and how the value of gI affects the adaptation time can be experimentally

tackled. This problem was studied for a third-order elliptic filter modeling problem,

which is covered in more detail in section 5.4. The architecture was an adaline(.)

structure with K=3. The adaptation time expressed in the number of samples was

measured as a function of gp and the results are plotted in Figure 5.3. The plot shows

that the adaptation time is nearly unaffected if gI is greater than 0.2. This is rather good

news, although we have not been able to establish explicit formulae for the adaptation

time dependence on p.





0.1 0.2 0.5

Figure 5.3 Adaptation time as a function of p for the
elliptic filter modelling experiment in section 5.4

In the next sections the experimental results with respect to application of the

gamma model to temporal processing problems are discussed.

5.3 (Non-)linear Prediction of a Complex Time Series

5.3.1 Prediction/Noise Removal of Sinusoidals contaminated by Gaussian Noise

We constructed an input signal consisting of a sum of sinusoids, contaminated

by additive white gaussian noise (AWGN). Specifically, I(t) was described by

I(t) = sin (7t (0.06t + 0.1)) + 3sin (t (0.12t + 0.45)) + 1.5sin (t (0.2t+ 0.34))
+ sin (7t (0.4t+ 0.67)) +AWGN .

The signal-to-noise ratio is 10 dB. This signal is shown in Figure 5.4.

--t I ---

Figure 5.4 (a) The sinusoidal signal plus AWGN (SNR = 10 dB). (b) Power
spectrum of the contaminated signal.

The processing goal was to predict the next sample of the sum of sinusoidals.

Hence, the processing problem involves a combination of prediction and noise

cancelation. The processing system was adaline(ji). The goals of this experiment are

the following:

1. Determine the optimal system performance as a function of g for 0 < p. < 1

and K. Note that this implies a comparison of the gamma memory structure versus the

tapped delay line (for (t=1) and the context-unit memories (for K=1).

2. Can the system parameters wk and tt be adapted to converge to the optimal


A training set consisted of 300 samples was selected. After a run on the training

data, the system was run on a validation set, a different signal segment of 300 samples.

The noise in the validation set is sample-by-sample different from the noise in the

training set, but the statistics of both noise sources are the same. The system was

adapted until convergence for various values of K and pt. Ip was parametrized over

domain [0,1] using a step size Ap. = 0.1. The normalized performance index after

training is displayed versus gp in Figure 5.5.

Clearly, the first-order gamma memory with p. = 0.1 outperforms even the fifth order

adaline structure. For this experiment, the context-unit memory with pt = 0.1 is optimal.

In a next experiment we let gp be adaptive. The system is initialized with p=l.

The performance curves in Figure 5.5 seem to increase monotonically over the range

popt to p=1. Thus, a gradient descent algorithm with initial g=l should in principle

converge at the optimal p. However, the simple structure of the performance surface is

K=2 K=3 K=4




0 0.1 0.2 0 0. 0.4 0.5 0.6 0.7 0.8 0.9 I

Figure 5.5 The normalized performance index versus p after
training the adaline(p) structure to predict sinusoids
contaminated by white gaussian noise.


University of Florida Home Page
© 2004 - 2010 University of Florida George A. Smathers Libraries.
All rights reserved.

Acceptable Use, Copyright, and Disclaimer Statement
Last updated October 10, 2010 - - mvs