DYNAMICAL COMPUTATION WITH ECHO STATE NETWORKS
By
MUSTAFA CAN OZTURK
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2007
O 2007 Mustafa Can Ozturk
To my parents, Necmi Ozturk and Saliha Ozturk, who made everything I have achieved possible,
for their enduring love, support and generosity.
ACKNOWLEDGEMENTS
First and foremost, I would like to thank my supervisory committee chair, Dr. Jose
Principe, for his inspiring ideas, constant support, and guidance. His enthusiasm for life and
imaginative way of understanding concepts inspired me to learn different aspects of science and
engineering. Without him, this dissertation would not have been possible.
Secondly, I would like to thank Dr. John Harris, for serving on my supervisory committee
and offering wise insights into academia and other aspects of life. I also wish to thank the other
members of my supervisory committee, Dr. Arunava Benerj ee and Dr. Thomas Demarse for their
insightful comments and suggestions.
During my studies, I gained much from the interactive environment at the Computational
NeuroEngineering Laboratory. All of my colleagues helped me by sharing their knowledge and
inspirational ideas. I especially thank Dongming Xu for his help and long hours of productive
discussions. Most of this dissertation was a result of collaboration with Aysegul Gunduz,
Nicholas Dedual, Rodrigo Sachhi, Sohan Seth, Johan Nyqvist, Kyu-Hwa Jeong, Ismail Uysal,
Harsha Sathyendra, and Mark Skowronski. I like to thank each one of them for their expertise
and insightful comments.
I would like to specifically thank my wonderful friends, Jianwu Xu, Rui Yan, and Anant
Hegde. Deep philosophical conversations, sleepless nights for proj ects are unforgettable. They
were certainly more than friends to me. I will always remember them with great respect.
I would like to thank my big family for their constant support and trust in me. Everything I
have achieved is possible only with their love, encouragement, and guidance.
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS .............. ...............4.....
LIST OF TABLES ............ _...... ._ ...............8....
LIST OF FIGURES .............. ...............9.....
AB S TRAC T ............._. .......... ..............._ 12...
CHAPTER
1 INTRODUCTION ................. ...............15.......... ......
Literature Review of Dynamic Computation ................. ...............15........... ...
Short-term Memory Structure s ................ ............ ...............15. ....
Network Architectures for Dynamical Computation .............. ...............18....
Literature Review of Echo State Networks ................. ...............22...............
Thesis Goals............... ...............24.
2 ANALYSIS OF ECHO STATE NETWORKS ................. ...............30...............
Echo States as Bases and Proj sections .................. ...............30........... ..
ESN Dynamics as a Combination of Linear Systems .................. .......... ................ ...3 5
Average State Entropy as a Measure of the Richness of ESN Reservoir ............... .... ...........3 8
3 DESIGN OF THE DYNAMICAL RESERVOIR ................. ...............43...............
Design of the Recurrent Connections ................. ...............43........... ...
Design of the Adaptive Bias ................. ...............47........... ...
Experiments .............. ......... ..............4
Short-term Memory Structure s ................ ...............48................
Binary Parity Check .............. ...............51....
System Identification............... .............5
4 DESIGN OF THE READOUT .............. ...............54....
Design of the Readout for Function Approximation ......__ ......... __. ........._. .....54
Design of the Readout for Dynamical Pattern Recognition .............. ....__. .............54
Linear Associative Memories............... ...............56
The M ACE Filter............... ........ ..... .. .............5
The ESN/LSM MACE for Dynamical Pattern Recognition............._. .........._._....60
The Design of ESNs for Dynamical Pattern Recognition.........._.._.. ........_.._.........63
Experiments ............__... .....__ ......__ ... ... ..........6
Classification of temporal signals with unknown peak values .............. ..............65
Robust signal detection in a digital communication channel ................ ................69
Classification of odors with an electronic nose ........._..... ...._... ........_.......69
Spike train classification with LSM ...._.._.._ ........__. ...._.._ ...........7
5 IMPLICATIONS OF ESN ON THE DESIGN OF OTHER NETWORKS............._.._.. ........80
Transiently Stable Computation .............. ...............80....
Conventional Echo State Networks ........._.._......... ....._.._.._ ....._.._ .............
Transiently Stable Computation with Echo State Networks .............. ....................8
Understanding Transiently Stable Computation .............. ...............86....
Freeman's K Sets for Dynamical Computation............... ..............8
An Overview of Freeman Model ........._.._......... ..... .._.. .. .. ...............90.
Dynamical Computation with Freeman Model with a Readout..........._.._.. .........._.._...95
Experiments ........._...... ......_.._ ...............101.....
Function approximation .............. .... .... .............10
KO set as the readout for function approximation ............_ ..... ..._.............103
Autonomous sequence generation with output feedback .............. .....................0
Binary parity check .............. ...............107....
M ulti-attractor learning .............. ...............108....
6 APPLICATIONS ................. ...............112................
Complex Echo State Networks ................. ...............112____ .....
Channel Equalization. ............ ........... ...............113....
Brain M machine Interfaces ................. ...............119...... ......
Experimental Setup .............. ...............122....
M ethods .............. ... .... .....__ ...... .......... ......... ..........2
Forecasting Water Inflow Using the Echo State Network ................. ................. ......127
Water Inflow Problem ............_...... .__ ...............127...
R results .............. ...... ........ ......... .........12
Spike Detection with Echo State Networks ....__ ......_____ .......___ ..........13
Problems in Spike Detection .............. ...............133....
Threshold detection ............ ..... .._ ...............133...
M watched filter ................ ........._ ...............134....
Spike Detection with ESN-MACE ............... .. ...___ .......___ ........ 13
Robust signal detection in a digital communication channel .............. .....................4
7 CONCLUSIONS .............. ...............147....
Understanding Echo State Networks ............ .....___ ...............148..
Designing Echo State Networks ............_...... ._ ...............149..
Computation at the Edge of Chaos ............_...... .__ ...............150.
Transiently Stable Computation .............. ... ....__ ..... __ ............15
Echo State Networks for Temporal Pattern Recognition .............. ...............151....
Channel Equalization with Complex Echo State Networks .............. .....................5
Freeman M odel ............ ..... .._ ...............153...
Applications ............ ..... .._ ...............153...
Brain Machine Interfaces .............. ...............153....
Prediction of Water Inflow ................. ...............154...............
Spike Detection .............. ...............154....
M watched Filtering .............. ...............155....
Possible Future Directions ................. ...............155................
LIST OF REFERENCES ................. ...............157................
BIOGRAPHICAL SKETCH ................ ............. ............ 166...
LIST OF TABLES
Table page
6-1 Averaged correlation coefficients between the actual hand traj ectory and the model
outputs computed over non-overlapping windows of 5 seconds ................ .................1 24
6-2 Performance comparison of the models for water inflow prediction ................. .................129
LIST OF FIGURES
Figure page
1-1 Delay line memory of order N. ........... ..... .___ ...............16.
1-2 Dynamic network with a short-term memory structure followed by a static mapper ............. 18
1-3 Generic structure for universal myopic mapping theorem, bank of filters followed by a
static network ................. ...............21........ ......
1-4 Echo state network (ESN). ESN is composed of two parts: a fixed-weight (W) recurrent
network and a linear readout............... ...............23
2-1 Performance of ESNs for different realizations of W with the same weight distribution.
Results show that spectral radius of w is not the unique parameter that determines
the performance of an ESN ................. ...............34........... ...
2-2 Pole tracks of the linearized ESN when the input goes through a cycle.. ............ ..... ........._..37
2-3 Echo states and state entropy for different ESNs. ............. ...............41.....
2-4 The ASE values obtained from 50 realizations of ESNs with the same spectral radius .........42
3-1 Comparison of ASE values obtained for ASE-ESN with U-ESN and randomly generated
ESNs with different sparseness............... ...............4
3-2 The k-delay STM capacity of each ESN for delays 1, .. ,40 computed using the test
si gnal .. ............... ...............5.. 0..............
3-3 Number of wrong decisions made by each ESN for m = 3, .. ,8 in the binary parity
check problem .. ............. ...............52.....
4-1 Interpretation of echo states as a 2-d image. ............. ...............62.....
4-2 Performance comparisons of ESNs for the classification of signals with unknown peak
values.. ............ ...............66.....
4-3 Time-series representing the response pattern of the 32 electronic nose sensors exposed
to rosemary............... ...............70
4-4 Comparison of the ESN-MACE fi1ter output and MACE fi1ter output trained on the
input space for the rosemary class.. ............ ...............74.....
4-5 Comparison of the correct classification rates of LSM-MACE and LSM with linear
readout trained in the spike train classification problem as the parameter h varies...........79
5-1. Demonstration of a typical response of ESN with a spectral radius of 0.9 and
application to function approximation.. ............. ...............82.....
5-2 Demonstration of a typical response of transiently stable ESN with a spectral radius of
1.1 and application to function approximation.. ............ ...............84.....
5-3 Movement of the poles for ESN with spectral radius 0.9. ................ ....___ ................86
5-4 Movement of the poles for ESN with spectral radius 1.1 ................ ............... 87...........
5-6 A full KII network. .............. ...............94....
5-7 Freeman Model as an associative memory. ...._.. ...._._._.. .......__... ..........9
5-8 Nullcline graph of the reduced KII set. ........._._._ ....__ ...............99.
5-9. Function approximation with Freeman Model. ............. ...............102....
5-10 Overlay plot of the desired signal and output signal of the Freeman Model with KO
readout. ........._.__...... ._ __ ...............105....
5-12 Binary parity check with Freeman Model. ............. ...............108....
5-13 Sample input-output signals for multi-attractor learning. ............. ....................10
5-14 Desired and output signals during training for multi-attractor learning with Freeman
M odel. ........._._ ....._... ...............110....
5-15 Desired and output signals during testing for multi-attractor learning with Freeman
M odel. ........._._ ....._... ...............111....
6-1 Block diagram of a communication system with nonlinear dispersive channel ........._._. ......114
6-2 Constellation diagram of 16 QAM. A) Input symbols. B) Output symbols. ................... ......115
6-3 SER vs SNR plot for four networks. ...........__..... ...............117
6-4 Constellation diagram rectangular QPSK. A) Input symbols. B) Output symbols. ..............118
6-5 SER vs SNR plot for four networks. ...........__..... ...............119
6-6 Comparison of three methods in brain machine interface modeling. Windowed
correlation coefficients in all spectral bands. .....__.....___ .......... .............2
6-7 Training performance of ESN for the water inflow prediction at Fumas Hydroelectric
power plant ................. ...............130................
6-8 Testing performance of the ESN for the water inflow prediction at Fumas Hydroelectric
power pl ant ................. ...............13. 1...............
6-9 Spike shapes recorded with high SNR. ............. ...............137....
6-10 Spike shapes in rat dataset. Numerous distinct spike waveforms are present in this
dataset. ............. ...............137....
6-11 Segment of neural recording from rat' s brain. ................ ....._.._......... .......3
6-12 Comparison of ROC curves for the rat dataset. ...._.._.._ .... .._._. ...._.._.. ......13
6-13 Comparison of ROC curves for ESN-MACE and the linear matched filter under
Gaussian and impulsive noise conditions.. ............ ...............144.....
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
DYNAMICAL COMPUTATION WITH ECHO STATE NETWORKS
By
Mustafa Can Ozturk
May 2007
Chair: Jose C. Principe
Major: Electrical and Computer Engineering
Echo state networks (ESN) were recently proposed as a new recurrent neural network
(RNN) paradigm. ESN couples dynamics with computation in novel ways by conceptually
separating RNN into two parts: a recurrent topology of nonlinear PEs that constitutes a "reservoir
of rich dynamics" and an instantaneous linear readout. The interesting property of ESN is that
only the memoryless readout is trained, whereas the recurrent topology has Eixed connection
weights. This reduces the complexity of RNN training to simple linear regression while
preserving a recurrent topology, but obviously places important constraints in the overall
architecture that have not yet been fully studied.
Present design of Eixed parameters of ESN relies on the selection of the maximum
eigenvalue of the linearized system around zero (spectral radius). However, this procedure does
not quantify in a systematic manner the performance of the ESN in terms of approximation error.
In this study, we proposed a functional space approximation framework to better understand the
operation of ESNs, and proposed an information-theoretic metric (the average entropy of echo
states) to assess the "richness" of the ESN dynamics. We also provided an interpretation of the
ESN dynamics rooted in system theory as families of coupled linearized systems whose poles
move according to the input signal dynamics. With this interpretation, we put forward a design
methodology for functional approximation where ESNs are designed with uniform pole
distributions covering the frequency spectrum to abide to the "richness" metric, irrespective of
the spectral radius. A single bias parameter at the ESN input, adapted with the modeling error,
configures the ESN spectral radius to the input-output j oint space. Function approximation
examples compare the proposed design methodology versus the conventional design.
On further investigating the use of ESNs for dynamical pattern recognition, we postulated
that ESNs are particularly well suited for dynamical pattern recognition and we proposed a linear
associative memory (LAM) as a novel readout for ESNs. From the class of LAMs, we adopted
the minimum average correlation energy (MACE) filter because of its high rej section
characteristics that allow its use as a detector in the automatic pattern recognition literature. In
the ESN application, the MACE interprets the states of the ESN as a two-dimensional image: one
dimension is time and the other dimension is the processing element index (space). An optimal
template image for each class, which associates ESN states with the class label, can be
analytically computed using training data. During testing, ESN states were correlated with each
template image, and the class label of the template with the highest correlation is assigned to the
input pattern. The ESN-MACE combination leads to a nonlinear template matcher with robust
noise performance, as needed in non-Gaussian, nonlinear digital communication channels. We
used a real-world data experiment for chemical sensing with an electronic nose to demonstrate
the power of this approach. The proposed readout can also be used with liquid state machines
eliminating the need to convert spike trains into continuous signals by inning or low-pass
filtering.
We applied ESN on interesting real-world problems such as brain machine interface
design, water inflow prediction, detection of action potentials in neural recordings, matched
filtering in digital communications, channel equalization of a nonlinear channel and compared its
performance to other standard techniques. We proposed ESNs for signal processing in the
complex domain. The use of ESNs for complex domain is very convenient since system training
is equivalent to simple linear regression, which is trivial in the complex domain. The derivatives
of the nonlinear activation functions are never necessary since the recurrent part is fixed apriori.
We showed that Freeman model of the olfactory cortex can be considered in the same
framework of ESN. This work provided two important contributions to Freeman networks. First,
it emphasized the need for optimal readout, and showed how to adaptively derive them. Second,
it showed that the Freeman model is able to process continuous signals with temporal structure.
We also investigated the dynamics of ESNs without the echo state condition. This
investigation led to a novel computational mode for nonlinear systems with sigmoidal
nonlinearity, which does not require global stability. In this mode, although the autonomous
system is unstable, the input signal forces the system dynamics to become "transiently stable".
Function approximation experiments showed that the transiently stable ESN without the echo
state condition is still capable of useful computation.
CHAPTER 1
INTTRODUCTION
Literature Review of Dynamic Computation
Static networks are those whose response depends only on the present stimulus.
Multiplayer perceptrons (MLPs), radial basis function (RBFs), and self-organizing map (SOM)
networks are examples of static neural network models that deal with static input patterns. These
models have been well studied in function spaces, for regression and classification, using
statistical principles (Haykin, 1998). On the other hand, time is an essential ingredient in many
real-world problems, from cognitive tasks (such as vision and speech processing) to engineering
problems (such as system identification, noise cancellation, and channel equalization). These
real-world problems require the extraction of information embedded in the temporal structure of
signals. To incorporate time into the operation of a network, memory must be built into it. To be
precise, memory here stands for short-term memory, the ability to remember the recent past
which provides the contextual information from the history of time series to unambiguously
identify the current stimulus.
Short-term Memory Structures
General linear discrete-time memory structure of order N (Figure 1-1 A) is also known as
the generalized tapped delay line memory: each block has the transfer function G(z), or impulse
response g(n) (Haykin, 2001). This is a single-input multiple-output system and taps of the
system are filtered versions of the input signal.
Basically, the memory reconstructs the state of the system that created the time series in a
sufficiently large state space where time is implicit; creating a traj ectory that has a one-to-one
correspondence (with a unique inverse) to the time series.
xl(n)
u(n)
u(n)
x2 8) u 8-1>
T T
xN(n) u(n-N+1)
A B
Figure 1-1. Delay line memory of order N. A) Generalized tapped delay line. B) Tapped delay
line.
Quality of a memory structure is measured in two quantities: depth (D) and resolution (R)
(Principe et al., 1993). Memory depth measures how long the memory can hold information
about the past of a time series. Memory resolution is the amount of detail in the information
memory holds. Formally, memory depth is defined as
n=0
where gN(n) is the overall impulse response of the memory obtained from N successive
convolutions of g(n). Resolution is defined as the number of taps in memory structure per unit
time. For a linear memory structure of fixed taps, the product of memory depth and resolution is
a constant equal to the number of taps N. This creates a tradeoff between the memory depth and
resolution and the choice of G(z) determines the values R and D in this tradeoff (Principe et al.,
1993).
The time delay embedding (Takens, 1981), which can be implemented by a simple delay
line, is the most commonly used form of short-term memory (Figure 1-1 B). The use of time
delays for memory is also biologically motivated since signal delays are ubiquitous in the brain
and essential in biological information processing (Braitenberg, 1990). The memory depth and
resolution for the tapped delay line is N and 1 independent of the number of taps, respectively.
This restriction is relaxed in gamma filter which makes use of local feedback loops to
create short-term memory (Principe et al., 1993; de Vries, 1991). Each section of gamma
memory has the transfer function
~u-1
G(z) =
where pu is the adjustable feedback parameter. Note that gamma memory reduces to tapped delay
line for pu=1. With the introduction of the feedback loop, stability becomes an issue. However,
the overall system stability is easily guaranteed when each local feedback is stable (i.e. 0
Memory depth for gamma memory is N/p and the resolution is pu. This means that the memory
depth can be improved by choosing the feedback parameter pu less than unity. However, this also
reduces the resolution. Hence, in gamma memory, the memory depth can be improved without
increasing the number of taps unlike the tapped delay line by sacrificing the resolution (Principe
et al., 1993; de Vries, 1991).
Although tapped delay line and the gamma filter are the most common forms of short-term
memory structures, other structures such as leaky integrator memory (context nodes or memory
neurons) and Laguerre filter are also available in the literature. Time alignment filter, which is
obtained from gamma memory by allowing nonhomogenous pu values in different taps, extends
the structure in Figure 1-1 A by allowing tap dependent modulations of memory depth and
resolution (de Vries, 1991).
Network Architectures for Dynamical Computation
There are basically two ways of creating a dynamic network with short-term-memory. The
first one is to stimulate a static network (e.g., linear mapper, MLP) via a memory structure at its
input such as a tapped delay line or gamma memory (Haykin, 1998). This type of dynamical
networks is commonly used in the literature. However, there are several practical drawbacks to
this approach, which uses a spatial metaphor for time, especially when the dynamics of the time
series are rich and time varying. The alternative approach for creating memory in an implicit
manner is through the use of feedback. Feedback can be local at the single neuronal level or can
be global covering the whole network. In neural networks literature, networks with global
feedback loops are referred to as recurrent neural networks (RNN). The use of feedback provides
with very powerful systems with rich dynamical behavior whereas it also brings in practical
problems such as stability and training complexity (Haykin, 1998).
xi (n)
u~n)Short-term x2 ) Static y(n)
S Memory Mapper
Structure
xN(n)
Figure 1-2. Dynamic network with a short-term memory (tapped delay line, gamma memory,
etc.) structure followed by a static mapper (linear mapper, MLP, RBF etc.)
First we will consider the explicit representation of time where a static network (linear or
nonlinear) is transformed into a dynamic network with a preprocessor which is a short-term
memory structure. In these focused structures, the memory is restricted in the input layer of the
network (Sandberg and Xu, 1997; Haykin, 1998). There is a clear separation of responsibilities
in this approach of building a dynamical system, where memory represents time and static
network accounts for mapping (Figure 1-2). Basically, memory reconstructs the state of the
system that created the time series in a sufficiently large state space and static network maps the
reconstructed space to the desired signal space. Moreover, the well-develop techniques of static
network training such as least mean squares (LMS) or static back-propagation algorithm can be
applied to the dynamic network.
The simplest structure for temporal processing in the form of Figure 1-2 is the finite
impulse response filter (FIR) where the memory is a tapped delay line and the static network is a
linear mapper. It has been widely used in adaptive signal processing applications due to its
simplicity and the existence of effective learning algorithms (Haykin, 2001). Gamma filter is
obtained when the memory structure in FIR is replaced by gamma memory (Principe et al., 1993;
de Vries, 1991). Chains of first order integrators are interesting because they effectively decrease
the number of delays necessary to create embeddings. The extra degree of freedom on the
memory depth, gained from local feedback, has been fruitfully utilized in applications like
system identification, echo cancellation, and speech classification (de Vries, 1991).
The first successful demonstration of a neural network for temporal processing was
NETtalk devised by Sejnowski and Rosenberg in 1987 (Sejnowski and Rosenberg, 1997). They
used a system which is based on MLP to convert English speech to phonemes. A more popular
network is the time delay neural network (TDNN) that replaces the linear mapper in the FIR with
a more powerful nonlinear network, MLP (Haykin, 1998). Lang and Hinton have successfully
applied TDNN for the recognition of 4 isolated words: "bee", "dee", "ee", and "vee" (Lang and
Hinton, 1998). TDNN has become one of the most popular neural network architectures for
temporal processing and has been successfully applied to time series prediction, system
identification, control, and speech processing. The structure of TDNN can be generalized as
shown in Figure 1-3. The first block is a bank of linear filters operating in parallel on the input
stimulus whereas second block implements a static nonlinear network. This structure is a
universal dynamic mapper (Sandberg and Xu, 1997, page 477) according to universal myopic
mapping theorem:
Any shift-invariant myopic dynamic map can be uniformly approximated arbitrarily well
by a structure consisting of two functional blocks: a bank of linear filters feeding a static
network.
TDNN is a special case of the structure described in this theorem and is also a universal
mapper. However, there are several drawbacks of focused structures which convert time series to
spatial signals (Elman, 1990). First of all, it requires an interface to the external world which
buffers the input for further processing. Secondly the selection of the length of buffer is not
trivial and once selected it imposes a firm limit on the memory depth. Moreover, buffering
requires all the patterns to have the same length which is especially problematic in areas like
speech processing where different length speech signals have to be processed (Elman, 1990).
Finally, two instances of same time-series that are very similar can be very dissimilar in the
reconstructed space (spatially distant) (Elman, 1990).
hi (n) x n
u(n) h( X2(n) MS er 1 y(n)
hs~n)xN(n)
Figure 1-3. Generic structure for universal myopic mapping theorem, bank of filters followed by
a static network
An alternative approach for creating memory is by means of recurrent connections within
the network. In this approach, the system itself is dynamic and time, integrated into the system, is
represented by the effect it has on processing. They are particularly powerful to deal with the
time varying patterns and are practical for the problems where the dynamics of the considered
process is complex. Various RNN architectures are available in the literature. One of the earlier
architectures proposed by Jordan uses memory units that are fed from the system output to
provide the contextual information (Jordan, 1986). Elman network is a modification of Jordan' s,
where the context units keep a copy of the hidden layer activations for the next update (Elman,
1990). Fully connected recurrent neural network contains several feedback loops between the
processing elements (PE) in the hidden layer (Figure 1-4). RNNs can exhibit rich dynamical
responses such as oscillations and even chaos. They have been widely used in many applications
such as system identification and control of dynamical systems (Delgado et al., 1995; Feldkamp
et al., 1998; Kechriotis et al., 1994; Principe, 2001). The computational power of fully connected
RNNs (Siegelman and Sontag, 1991) has been mathematically proven by Siegelman and Sontag:
All Turing machines may be simulated by fully connected recurrent networks built on neurons
with sigmoid activation functions. However, the main problem with the RNNs is the difficulty to
adapt the system weights. Various algorithms, such as back propagation through time (Werbos,
1990) and real time recurrent learning (Williams and Zipser, 1989) have been proposed to train
recurrent systems. However, these algorithms suffer from a variety of problems: computational
complexity resulting in slow training, complex performances surfaces, the possibility of
instability, and the decay of gradients through the topology and time (Haykin, 1998). The
problem of decaying gradients has been addressed with special PEs (Hochreiter and
Schmidhuber, 1997). Alternative second order training methods based on extended Kalman
filtering (Singhal and Wu, 1989; Puskorius and Feldkamp, 1996) and the multi-streaming
training approach (Feldkamp et al., 1998) provide more reliable performance and have enabled
practical applications.
Literature Review of Echo State Networks
Recently, a new recurrent network paradigm has been proposed by Jaeger under the name
of echo state networks (ESN) (Jaeger, 2001; Jaeger, 2002, Jaeger and Hass, 2004). ESNs aim at
addressing the problems with RNN training by separating the RNN architecture into two parts: a
fixed recurrent topology of dynamnical reservoir and an adaptive memoryless readout network.
The input signal is fed to the dynamical reservoir which contains information about the history of
input or/and output patterns when properly dimensioned (Figure 1-4). The outputs of the internal
PEs (echo state) are fed to a memoryless but adaptive readout network (generally linear) that
reads the reservoir and produces the network output. The interesting property of ESN is that only
the memoryless readout is trained, whereas the recurrent topology (W) has fixed connection
weights. This reduces the complexity of RNN training to simple linear regression while
preserving a recurrent topology, but obviously places important constraints in the overall
architecture that have not yet been fully studied.
Read-out
...............
Input Layer
O"""
........... .
y(n)
Back
I
Figure 1-4. Diagram of an echo state network (ESN). ESN is composed of two parts: a fixed-
weight (W) recurrent network and a linear readout. The recurrent network is a
reservoir of highly interconnected dynamical components, states of which are called
echo states. The memoryless linear readout is trained to produce the output.
Figure 1-4 depicts an ESN with M~input units, N internal PEs and L output units. The value
of the input unit at time n is u(n)= [u (n), u2 n,.,M n~T, Of internal units are x(n)= [x (n),
x2() ,N n~T, and of output units y(n)= [y (n), y2(n,. yL nT. The connection weights are
given in a Nxl~weight matrix W~n = (wy) foCrr co~nnectio~ns between the input andl the intenal
states, in an NxN matrix W = (wiy) for connections between the internal PEs, in an LxN matrix
Wout ut~M) for connections from internal units to the output units, and in an NxL matrix
Wbc ackJ) for the connections that proj ect back from the output to the internal units
(Jaeger, 2001). The activation of the internal PEs (echo states) is updated according to
x(n +1) = f (Wl"u(n +1) + Wx(n) + Whacky~)
(1-1)
where f = ( f f 2,. N") are the internal unit's activation functions. Generally, all f' 's are
eX -e
hyperbolic tangent functions ( ).
eX +e-
The readout is mostly a simple linear regressor network whose output is computed according to
y(n +1) = W"'tx(n +1) (1-2)
When direct connections from the input units to the output units are used, the state vector is
concatenated with the input vector to calculate the output.
A basic necessary property for ESN reservoir is the echo state property which states that
for the ESN learning principle to work, the reservoir must asymptotically forget input history
(input forgetting property). It has been shown in (Jaeger, 2001) that input forgetting property is
equivalent to state forgetting, that is reservoir must forget its initial state after sufficiently long
time. The echo state condition can be linked to the spectral radius (the largest among the absolute
values of the eigenvalues of a matrix, denoted by ||.||) of the reservoir' s weight matrix, (||IW||<1).
In fact, this condition states that the dynamics of the ESN is uniquely controlled by the input and
the effect of initial states vanishes.
The design of fixed connections of ESN is based on the selection of the spectral radius.
The reservoir is randomly and sparsely connected with a spectral radius suitable for the given
problem. Sparse connectivity allows less coupling between the PEs encouraging the development
of individual dynamics and increasing diversity in overall reservoir dynamics.
Thesis Goals
ESNs, although recently proposed, have proven very successful in applications such as
system identification, chaotic time series prediction, and channel equalization. However, the
state of the art is still immature. We see a few maj or shortcomings with the current ESN
approach. First, the characterization of the reservoir properties is poorly understood. It is a
mystery how a network with mostly random connections can be successful in solving difficult
problems. A better understanding on the operation of dynamical reservoir is vital for improving
ESN theory. Second, imposing a constraint only on the spectral radius for ESN design is a weak
condition to properly set the parameters of the reservoir, as experiments show (different
randomizations with the same spectral radius perform differently for the same problem). Third,
there is a fundamental problem with the way ESNs operate. The impact of Eixed reservoir
parameters for function approximation means that the information about the desired response is
only conveyed to the output proj section. This is not optimal, and strategies to select different
reservoirs for different applications have not been devised.
We aim to address these problems by proposing a framework, a metric and a design
principle for ESNs. We first deal with the analysis of ESNs and explains the framework and the
metric proposed. The framework is a signal processing interpretation of basis and proj sections in
functional spaces to describe and understand the ESN architecture. According to this
interpretation, the reservoir states implement a set of bases functionals (representation space)
constructed dynamically by the input, while the readout simply proj ects the desired response
onto this representation space. The metric to describe the richness of the ESN dynamics is an
information theoretic quantity, the average state entropy (ASE). Entropy measures the amount of
information contained in a given random variable (Shannon, 1948). Here, the random variable is
the instantaneous echo state from which the entropy for the overall state (vector) is estimated.
Due to the time dependency of the states, the state entropy averaged over time (ASE) will be
used as the quantifier to measure the richness of reservoir dynamics. We also interpret the ESN
dynamics as a combination of time varying linear systems obtained from the linearization of the
ESN nonlinear PE in a small local neighborhood of the current state.
We then propose a design methodology for the ESN reservoir in the light of the analysis
tools developed for ESNs. According to this design principle, one should consider independently
the correlation among the basis and the spectral radius. In the absence of any information about
the desired response, the ESN reservoir states should be designed with the highest ASE,
independent of the spectral radius. The poles of the linearized ESN reservoir should have
uniform pole distributions to generate echo states with the most diverse pole locations (which
correspond to the uniformity of time constants). Effectively this will create the least correlated
bases for a given spectral radius, which corresponds to the largest volume spanned by the basis
set. When the designer has no other information about the desired response to set the basis, this
principle distributes the system' s degrees of freedom uniformly in space. It approximates for
ESNs the well known property of orthogonal basis. The unresolved issue that ASE does not
quantify is how to set the spectral radius, which depends again upon the desired mapping. a
simple adaptive bias is added at the ESN input to control the spectral radius integrating the
information from the input-output j oint space in the ESN bases. For sigmoidal PEs, the bias
adjusts the operating points of the reservoir PEs, which has the net effect of adjusting the volume
of the state manifold as required to approximate the desired response with a small error. We
show that ESNs designed with this strategy obtain systematically better results in a set of
experiments when compared with the conventional ESN design.
We discuss the readouts for ESNs for function approximation and pattern recognition. The
standard linear readout is used for function approximation tasks such as system identification,
time series prediction, and channel equalization. We thoroughly investigate the use of ESN for
temporal pattern recognition. The standard linear readout can be used for pattern recognition
with ESNs by minimizing the mean-square error (MSE) with respect to a label. Classifieation of
time series is different from the classification of static patterns since there is a structure between
the samples of the time series over time. The dynamical classification problem can be reduced to
a static one by treating each sample of the time series individually. In this case, the time series is
not treated as a whole and temporal information between samples is ignored. Moreover, one
class label is associated with each time instant; therefore, post classification such as maj ority
voting has to be applied. An alternative to this approach is embedding the time series to populate
a short-term history of the pattern. Again, a static mapper can be used to classify the embedded
pattern. Another maj or difficulty in dynamical pattern recognition is how to design the label,
which should also be a time series. One straightforward method is to use +1 or -1 as the desired
response throughout the application of the input pattern. However, the difficulty with this
method is that the desired response is forced to be constant independent of the input signal
dynamics. An alternative powerful technique is to create a one-step predictor for each class in
order to capture the dynamics of the class generating the time-series (Zahalka and Principe,
1993). Then, during testing, the test pattern is presented to all predictors and the label of the
predictor with the least prediction error is assigned as the class of the input pattern. We also
propose an alternative readout for ESNs to be used in classification tasks that does not require a
desired response. The goal is to design an ESN/LSM readout that will recognize a class of inputs
that differ by some quantity (e.g. amplitude or shape considered as a distortion parameter). The
proposed readout, called the minimum average correlation energy (MACE) filter, is adopted
from optical pattern recognition literature, where it is used for recognition of a given object in 2-
dimensional images in the presence of noise, geometric distortions and pose changes
(Mahalanobis et al., 1987). Instead of using a desired response, the MACE filter creates a
template from echo state responses corresponding to the training patterns of a given class. A
MACE is trained for each class. During testing, the unknown time signal is fed to the ESN and
the states are correlated with each template image and the class label of the template with the
highest correlation is assigned to the input pattern.
We investigate the implications of ESN idea on other networks. First, we investigate the
echo state condition (|W|<1) in a system theoretic framework to understand the effect of system
stability in on the power of dynamical systems for computation. For this reason, we relax the
echo state condition by allowing it to be slightly larger than 1. This introduces a new dynamical
regime, called "transiently stable computation", where the system is autonomously unstable but
it is stabilized by the input signal of sufficient power. In this regime, function approximation
with a linear readout is still possible even though the system is not globally stable. Secondly, we
investigate the biologically plausible model of the olfactory cortex, Freeman Model (FM), and
propose to use a readout for FM to be able to use it as a universal computer. Without the
proposed readout, the use of FM is limited to simple digit recognition, where the patterns are
static and binary. We will demonstrate with experiments that FM coupled with a readout can
process continuous valued signals. An interesting property of FM is the nonlinear function used
in the model. FM nonlinearity does not have its largest slope at zero operating point unlike the
sigmoidal nonlinearity used in ESNs.
Various applications of ESN for real-world problems are investigated. We first utilize
ESNs for signal processing in the complex domain and compare complex ESN with other models
for complex signal processing in a nonlinear channel equalization problem in a digital
communication channel. Secondly, we propose ESNs to model brain machine interfaces and
compare it with the linear Wiener filter. Brain machine interfaces aim at predicting the hand
position of a primate using the brain signals. Thirdly, we apply ESNs to predict the water inflow
at a hydro power plant using the previous values of water inflow levels. Then, we tackle two
temporal pattern recognition problems using ESNs with the MACE filter readout. In the spike
detection problem, the goal is to detect action potentials in a noisy neural recording. The second
problem deals with the design of a detector similar to a matched filter in a communication
channel. We compare ESN-MACE filter to the matched filter in a baseband additive noise
channel under different noise distributions.
CHAPTER 2
ANALYSIS OF ECHO STATE NETWORKS
ESN is a recurrent network paradigm described by a recurrently connected, reservoir
network stimulated by an input signal, and an instantaneous, adaptive readout, which combines
the reservoir states to form the desired signal. ESNs have been shown to be very powerful in
many applications such as chaotic time series prediction, channel equalization, speech
recognition. However, it remains a mystery how a network with mostly random connections can
be so successful in solving difficult problems. This section aims at enlightening the basic
mystery of ESNs by analyzing the operation of ESNs. The first subsection proposes a framework
which is a signal processing interpretation of basis functions and proj sections in functional spaces
to describe and understand the ESN architecture. In the second subsection, we interpret the ESN
dynamics as a combination of linear systems. According to this interpretation, when the system
operating point varies over time with the influence of the input signal, local ESN dynamics
change. The third subsection introduces an information-theoretic metric, called average state
entropy, to quantify the computational power of a dynamical network for function
approximation. The ASE metric combined with the idea of bases and proj sections will lead to a
design procedure for ESNs that will be discussed in the next chapter.
Echo States as Bases and Projections
Let us revisit the recursive update equation of an ESN. The activation of the internal PEs is
updated according to
x(n +1) = f (Wl"u(n +1) + Wx(n) + Whacky(n)), (2-1)
where f = (f f,... f) are the internal unit' s activation functions. The output from the linear
readout network is computed according to
y(n +1) = W o"tx(n +1), (2-2)
ESNs resemble the RNN architecture proposed in (Puskorius and Feldkamp, 1994) and
also used by (Sanchez, 2004) in brain machine interfaces. The critical difference is the
dimensionality of the hidden recurrent PE layer and the adaptation of the recurrent weights. We
submit that the ideas of approximation theory in functional spaces (bases and proj sections so
useful in optimal signal processing (Principe, 2001), should be utilized to understand the ESN
architecture. Let h(u(t)) be a real-valued function of a real-valued vector
u(t)= [u, (t), u, (t),..., u,(t)].
In functional approximation, the goal is to estimate the behavior of h(u(t)) as a
combination of simpler functions 9, (t) called the bases functionals, such that its approximant
h(u(t)), is given by
h~u(t)) =~ a,n, (t)
Here, a,'s are the proj sections of h(u(t)) onto each basis functional. One of the central questions in
practical functional approximation is how to choose the set of bases to approximate a given
desired signal. In signal processing, the choice normally goes for complete set of orthogonal
basis, independent of the input. When the basis set is complete and can be made as large as
required, fixed bases work wonders (e.g. Fourier decompositions). In neural computing, the basic
idea is to derive the set of bases f~om the input signal through a multilayered architecture.
Consider a single hidden layer TDNN with NPEs and a linear output. The hidden layer PE
outputs can be considered a set of non-orthogonal basis functionals dependent upon the input
9, (u(t)) = g(C b,,u, (t))
Here, b,,'s are the input layer weights, and g is the PE nonlinearity. The approximation produced
by the TDNN is then
h~u(t)) = [ a,a? (u(t))(23
where a,'s are the weights of the output layer. Notice that the b,, adapt the bases and the a, adapt
the proj section in the proj section space (readout). Here the goal is to restrict the number of bases
(number of hidden layer PEs) because their number is coupled with the number of parameters to
adapt which impacts generalization, training set size, etc. Usually, since all of the parameters of
the network are adapted, the best basis in the joint (input and desired) space as well as the best
proj section can be achieved, and represents the optimal solution. The output of the TDNN is a
linear combination of its internal representations, but to achieve a basis set (even if non-
orthogonal), linear independence among the p, (u(t)) 's must be enforced. The effect of nonlinear
transformation of the hidden PEs on the input signal has been investigated by many authors in
RN. Oh has shown that correlation among the weighted sums decrease after they pass through the
sigmoid nonlinear function which can be approximated by piecewise linear functions (Oh and
Lee, 1994). Shah and Poon showed that there exist weight values for MLP such that the outputs
of its hidden layer PEs are linearly independent and hence, form a complete set of basis functions
(Shah and Poon, 1999).
The ESN (and the RNN) architecture can also be studied in this framework. The reservoir
states of Equation 2-1 correspond to the basis set which are recursively computed from the input,
output and previous states through Win, W and Wback. Notice however that none of these weight
matrices are adapted, that is, the functional bases in the ESN are uniquely defined by the input
and the initial selection of weights. In a sense ESNs are trading the adaptive connections in the
RNN hidden layer by a brute force approach of creating fixed diversified dynamics in the hidden
layer.
For an ESN with a linear read-out network, the output equation (y(n + 1) = Wo""x(n +1) ),
has the same form of Equation 2-3, where the cp, (u(t)) 's and a,'s are replaced by the echo states
and the readout weights, respectively. These weights are adapted in the training data, which
means that the ESN is able to find the optimal proj section in the proj section space, just like the
RNN or the TDNN.
It is interesting that a similar perspective of basis and projections for information
processing in biological networks has been proposed by (Pouget and Sejnowski, 1997). They
explored the possibility that the response of single neurons in parietal cortex serve as basis
functions for the transformations from the sensory input to the motor responses. They proposed
that "the role of spatial representations is to code the sensory inputs and posture signals in a
format that simplifies subsequent computation, particularly in the generation of motor
commands" (Pouget and Sejnowski, 1997).
The central issue in ESN design is exactly the non adaptive nature of the basis set.
Parameter sets in the reservoir that provide linearly independent states and possess a given
spectral radius, may define drastically different proj section spaces because the correlation among
the bases is not constrained. A simple experiment was designed to demonstrate that the selection
of the echo state parameters by constraining the spectral radius is not the most suitable for
function approximation. Consider an echo state network of 100-units where the input signal is
sin(2~nn/(10x)). Mimicking (Jaeger, 2001), the goal is to let the ESN generate the seventh power
of the input signal. Different realizations of a randomly connected 100 unit ESN were
constructed where the entries of W are set to 0.4, -0.4 and 0 with probabilities of 0.025, 0.025
and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input weights are set to +1
or -1 with equal probabilities and ac"k iS Set to zero. The input is applied for 300 time steps and
D 10 20 30 40 50
Different Rea~l:izton 51
Figure 2-1. Performance of ESNs for different realizations of W with the same weight
distribution. The weight values are set to 0.4, -0.4 and 0 with probabilities of 0.025,
0.025 and 0.95. All realizations have the same spectral radius of 0.88. In the 50
realizations, MSEs vary from 5.9x10-9 to 8.9x10- Results show that spectral radius
of W is not the unique parameter that determines the performance of an ESN.
the echo states are calculated using Equation 2-1. The next step is to train the linear readout
when a desired training signal, d(n) is available. One method to determine the optimal output
weight matrix, Wor t in the mean square error (MSE) sense (where MSE is defined by
( [(d(n)- y(nz))l (d(n) -y(n))), is to use the Wiener solution given by (Hayk~in, 200 1)
W"'t = E[x ] ~l, o (Exd a ~~xn (n)d(n) (2-4)
Here E[.], T and I denotes the expected value operator, the length of the training sequence and
the conjugate transpose of a complex vector. Figure 2-1 depicts the MSE values for 50 different
realizations of the ESNs. As observed, even though each ESN has the same sparseness and
spectral radius, the MSE values obtained vary greatly among different realizations. The
minimum MSE value obtained among the 50 realizations is 5.9x10-9 whereas the maximum MSE
is 8.9x10- This experiment demonstrates that a design strategy that is based solely on the
spectral radius is not sufficient to specify the system architecture for function approximation.
This shows that for each set of random weights that provide the same spectral radius, the
correlation or degree of redundancy among the bases will change and different performances are
encountered in practice.
ESN Dynamics as a Combination of Linear Systems
It is well known that the dynamics of a nonlinear system can be approximated by that of a
linear system in a small neighborhood of an equilibrium point (Kuznetsov, 1998). Here, we will
perform the analysis with hyperbolic tangent PEs and approximate the ESN dynamics by the
dynamics of the linearized system in the neighborhood of the current system state. Hence, when
the system operating point varies over time, the linear system approximating the ESN dynamics
will change. We are particularly interested in the movement of the poles of the linearized ESN.
Consider the update equation for the ESN without output feedback given by
x(n +1) = f (Wl"u(n +1) + Wx(n))
Linearizing the system around the current state x(n), one obtains the Jacobian matrix,
J(n 1), defined by
f (net, (n))w,, f (net, (n))wl2 .. 1~et 1N>>
f (net, (n))w,, f (net, (n))wN, 2. f" t (N NN
f (netl (n))w,, 0 ... O
0 f (net2 (n)) ... O
W (2-5)
0 0 ... f (net, ()
= F (n) W
Here, net,(n) is the ith' entry of the vector (Wizzu(nz +1) + Wx(nz)) and w!, denotes the (i, j)th' entry
of W. The poles of the linearized system at time n+1 are given by the eigenvalues of the
Jacobian matrix J(n+1). When the amplitude of each PE input changes, the local slope changes,
and so the poles of the linearized system are time varying, although the parameters of ESN are
fixed.
In order to visualize the movement of the poles, consider an ESN with 100 states. The
entries of the internal weight matrix are chosen to be 0, 0.4 and -0.4 with probabilities 0.9, 0.05
and 0.05. W is scaled such that a spectral radius of 0.95 is obtained. Input weights are set to +1
or -1 with equal probabilities. A sinusoidal signal with a period of 100 is fed to the system and
the echo states are computed according to Equation 2-1. Then, the Jacobian matrix and the
eigenvalues are calculated using Equation 2-5. Figure 2-2 shows the pole tracks of the linearized
ESN for different input values. A single ESN with fixed parameters implements a combination
of many linear systems with varying pole locations, hence many different time constants that
modulates the richness of the reservoir of dynamics as a function of input amplitude. Higher
amplitude portions of the signal tend to saturate the nonlinear function and cause the poles to
shrink toward the origin of the z-plane (decreases the spectral radius), which results in a system
with large stability margin. When the input is close to zero, the poles of the linearized ESN are
close to the maximal spectral radius chosen, decreasing the stability margin. When compared to
their linear counterpart, an ESN with same number of states results in a detailed coverage of the
z-plane dynamics (E-coverage), which illustrates the power of nonlinear systems. Similar results
can be obtained using signals of different shapes at the ESN input.
0[1 0.5 1 -'1 -0.5 0
Real Real
Figure 2-2. Pole tracks of the linearized ESN when the input goes through a cycle. A single ESN
with fixed parameters implements a combination of many linear systems with varying
pole locations. A) One cycle of sinusoidal signal with a period of 100. B)-E) show the
positions of poles of the linearized systems when the input values are at B, C, D, and
E in panel A. F) Cumulative pole locations show the movement of the poles as the
input changes.
Theorem: The eigenvalues of the linearized system have the largest radius when the
system state is zero.
Proof: Assume W has nondegenerate eigenvalues and corresponding linearly independent
eigenvectors. Then, consider the eigendecomposition of W, where W = PDP -, P is the
eigenvector matrix and D is the diagonal matrix of eigenvalues (Dii) of W. Since F(n) and D are
diagonal, J(n +1) = F(nr)W = F(n)(PDP' = P(F(n)D)P' is the eigendecomposition of
J(n+1). Here, each entry of F(n)D,f (neti (n))Di,~ is an eigenvalue of J.
I(net, (n))D ii <; $D ii sic -(nety (nI)) <; jl(0)
A key corollary of the above analysis is that the spectral radius of an ESN can be
adjusted using a constant bias signal at the ESN input without changing the recurrent connection
matrix, W. The application of a nonzero constant bias will move the operating point to regions of
the sigmoid function closer to saturation and always decrease the spectral radius due to the shape
of the nonlinearity. This property will be exploited in the design of ESNs which will be
discussed in the next chapter.
Average State Entropy as a Measure of the Richness of ESN Reservoir
The concept of "rich dynamical reservoir" (Jaeger, 2001) has not been quantified with a
well-defined metric. Our framework of bases and proj sections leads to a new metric to quantify
the richness of ESN reservoir. Here, we propose the instantaneous state entropy to quantify the
distribution of instantaneous amplitudes across the ESN reservoir states. Entropy of the
instantaneous ESN states is appropriate to quantify performance in function approximation
because the ESN output is a mere weighted combination of the value of the ESN states. If the
echo state's instantaneous amplitudes are concentrated on only a few values across the ESN state
dynamic range, the ability to approximate an arbitrary desired response by weighting the states is
limited (and wasteful due to redundancy between the different states) and performance will
suffer. On the other hand, if the ESN states provide a diversity of instantaneous amplitudes, then
it is much easier to achieve the desired mapping. Hence, the instantaneous entropy of the states
appears as a good measure to quantify "richness" of dynamics with instantaneous mappers. Due
to the time structure of signals, the average state entropy (ASE), defined as the state entropy
averaged over time will be the parameter used to quantify the diversity in the dynamical
reservoir of the ESN. Moreover, entropy has been proposed as an appropriate measure of the
volume of the signal manifold (Cox, 1946; Amari, 1990). Here, ASE measures the volume of the
echo state manifold spanned by the traj ectories.
Renyi's quadratic entropy is employed here because it is a global measure of information
and efficient nonparametric estimator that avoids explicit pdf estimation has been developed
(Principe et al., 2000;Erdogmus et al., 2003; Erdogmus and Principe, 2002). Renyi's entropy
with parameter y for a random variable X with a pdf fx(x) is given by (Principe et al., 2000)
Hr (X) = log E[ f~ r(X)]
1-7
The Renyi's quadratic entropy is obtained for y=2 (for y 1l). Shannon' s entropy is
hobtaned). Given Nsaomnple {xt, x2... XN~drawn~x from the u~nknowmnrfn pdf to b etmated, Parzern
windowing approximates the underlying pdf by
Si=1
where K, is the kernel function with the kernel size o. Then the Renyi's quadratic entropy can be
estimated by (Principe et al.,2000; Erdogmus et al., 2003).
H2 (X)= -log: 2 K,(xi -x ')1 2
The instantaneous state entropy is estimated using Equation 2-6 where the samples are the
entres of the state vector x~n) = [xY (n)), x2 ()1,X (Nr "j ,of an ESN with N internal PEs.
Results will be shown with a Gaussian kernel with kernel size chosen to be 0.3 of the standard
deviation of the entries of the state vector. We will show that ASE is a more sensitive parameter
to quantify the approximation properties of ESNs by experimentally demonstrating that ESNs
with the same spectral radius display different ASEs.
Let us consider the same 100 unit ESN that we used in the previous section built with three
different spectral radii 0.2, 0.5, 0.8 with an input signal of sin(2~nn/20). Figure 2-3 A depicts the
echo states over 200 time ticks. The instantaneous state entropy is also calculated at each time
step using Equation 2-6 and plotted in figure 2-3 B. First, note that the instantaneous state
entropy changes over time with the distribution of the echo states as we would expect, since state
entropy is dependent upon the input signal that also changes in this case. Second, as the spectral
radius increases in the simulation, the diversity in the echo states increases. For the spectral
radius of 0.2, echo state's instantaneous amplitudes are concentrated only on a few values which
is wasteful due to redundancy between different states. In practice, to quantify the overall
representation ability over time, we will use ASE, which takes values -0.735, -0.007 and 0.335,
for the spectral radii of 0.2, 0.5 and 0.8, respectively. Hence, ASE is affected by the spectral
radius of the W matrix as we would expect. Moreover, even for the same spectral radius of 0.5,
several ASEs are possible. Figure 2-4 shows ASEs from 50 different realizations of ESNs with
same spectral radius of 0.5, which means that ASE is a finer descriptor of the dynamics of the
reservoir. Although we have presented an experiment with sinusoidal signal, similar results are
obtained for other inputs as long as the input dynamic range is properly selected.
(A) Echo States
0 0 40 6 80 100 120 140 160 180 200
0 20 40 60 80 100 120 140 160 180 200
-1
0 20 40 60 80 100 120 140 160 180 200
Time
A
(B) State Entropy
1.5
Spectral Radius = 0.2
1C ----- Spectral Radius = 0.5
-Spectral Radius = 0.8
-1.5E
100
Time
B
Figure 2-3. Echo states and state entropy for different ESNs. A) Outputs of echo states (100 PEs)
produced by systems with spectral radius of 0.2, 0.5 and 0.8, from up to bottom,
respectively. B) Instantaneous state entropy is calculated using Equation 2-6 for echo
states in panel A.
0.28
0.26
0.24
0.22
0.2
Tm 0.18
0.16
0.14
0.12
0.1
0.08
0 10 20 30 40 50
Trials
Figure 2-4. The ASE values obtained from 50 realizations of ESNs with the same spectral radius
Maximizing ASE means that the diversity of the states over time is the largest and should
provide a basis set that is as uncorrelated as possible. This condition is unfortunately not a
guarantee that the ESN so designed will perform the best, because the basis set is independent of
the desired response and the application may require a small spectral radius. However, when the
desired response is not accessible for the design of the ESN bases or when the same reservoir is
to be used for a number of problems, the default strategy should be to maximize the ASE of the
state vector.
CHAPTER 3
DESIGN OF THE DYNAMICAL RESERVOIR
In ESNs, reservoir weights are selected randomly with the constraint of the echo state
condition (Jaeger, 2001). However, such a random selection scheme is not optimal as it is
demonstrated in chapter 2. In this chapter, we propose a design scheme for the design of
reservoir weights that will lead to small approximation errors for a variety of tasks. For optimal
approximation with a finite number of basis functionals, both the proj section space (bases) and
the proj ector (readout) require knowledge from the input and desired responses. However, in
ESNs, the projection space is determined solely by the architecture of the ESN reservoir and the
input signal (since W is fixed), without any knowledge of the space spanned by the desired target
signal. The selection of basis functions with the knowledge of input signal only is an ill-posed
problem. However, the selection of the reservoir weights must still be done using some rule, and
here we hypothesize that a good design strategy is to let the ESN states cover with equal
resolution the projection space to anticipate any possible mapping requirement (dictated by the
unknown desired response).
Design of the Recurrent Connections
According to the interpretation of ESNs as coupled linear systems, the design of the
internal connection matrix, W, will be based on the distribution of the poles of the linearized
system around zero state. Our proposal is to design the ESN such that the linearized system has
uniform pole distribution inside the unit circle of the z-plane. With this design scenario, the
system dynamics will include uniform coverage of time constants arising from the uniform
distribution of the poles, which also decorrelates as much as possible the bases functionals. This
principle was chosen by analogy to the identification of linear systems using Kautz filters
(Kautz, 1954) which shows that the best approximation of a given transfer function by a linear
system with finite order is achieved when poles are placed in the neighborhood of the spectral
resonances. When no information is available about the desired response or when would like to
use the same reservoir for a variety of tasks, we should uniformly spread the poles to anticipate
good approximation to arbitrary mappings.
We again use a maximum entropy principle to distribute the poles inside the unit circle
uniformly. The constraints of a circle as boundary conditions for discrete linear systems and
complex conjugate locations are easy to include for the pole distribution (Thogula, 2003). The
poles are first initialized at random locations; the quadratic Renyi's entropy is calculated using
Equation 2-6 and poles are moved such that the entropy of the new distribution is increased over
iterations (Erdogmus et al., 2003). This method is efficient to find a uniform coverage of the unit
circle with an arbitrary number of poles. However, any other method can be used to find the
location of poles with uniform coverage. Notice that this operation has to be done only once for a
given number of poles, which is equal to the number of hidden unit PEs.
The system with the uniform pole locations can be interpreted using linear system theory.
The poles that are close to the unit circle correspond to many sharp band pass filters specialized
in different frequency regions whereas the inner poles realize filters of larger frequency support.
Moreover, different orientations (angles) of the poles create filters of different center
frequencies.
Now, our problem is to construct a weight matrix from the pole locations. This is
equivalent to the problem of designing W when its eigenvalues are known. In principle we
would like to create a sparse matrix, so we started with the sparsest matrix (with an inverse)
which is the direct canonical structure given by (Kailath, 1980)
-a -a2 N-1 N
1 0 ... O 0
W = 0 1 ... O 0 (3-1)
0 0 ... 1 0
The characteristic polynomial of W is
1(s) = det(sl W) = s" + alsN 1 + a2SN-2 + N +a = (S p1 )(s p2 )...(S pN ) (3 -2)
where p,'s are the eigenvalues and a,'s are the coefficients of the characteristic polynomial of W.
Here, we know the pole locations of the linear system obtained from the linearization of the
ESN, so using Equation 3-2, we can obtain the characteristic polynomial and construct W matrix
in the canonical form using Equation 3-1. We will name the ESN constructed based on the
uniform pole principle ASE-ESN. All other possible solutions with the same eigenvalues can be
obtained by Q 1WQ, where Q is any nonsingular matrix.
To corroborate our hypothesis, we would like to show that the linearized ESN designed
with the internal connection weight matrix having the eigenvalues uniformly distributed inside
the unit circle creates higher ASE values for a given spectral radius compared to other ESNs with
random internal connection weight matrices. We will consider an ESN with 30 states, and use
our procedure to create the uniformly distributed linearized ASE-ESN matrix for different
spectral radius between [0.1, 0.95]. Similarly, we constructed ESNs with sparse random W
matrices with different sparseness constraints. This corresponds to a weight distribution having
the values 0, c and c with probabilities pi, (1- p1)/2 and (1- p1)/2, where pl defines the
sparseness ofW and c is a constant that takes a specific value depending on the spectral radius.
We also created W matrices with values uniformly distributed between -1 and 1 (U-ESN), and
scaled to obtain a given spectral radius (Jaeger and Hass, 2004). Then, for different W'n matrices,
we run the ASE-ESNs with the sinusoidal input of period 20 and calculate ASE.
-~ASE-ESN
U-ESN
0.8
Ssparseness=0.2
0 sparseness=0.1
0.6 --o -sparseness=0.07
0.4
-0.2
-0.4
0 0.2 0.4 0.6 0.8 -i
Spectral radius
Figure 3-1. Comparison of ASE values obtained for ASE-ESN with U-ESN and randomly
generated ESNs with different sparseness. As observed from the figure, the ASE-ESN
with uniform pole distribution generates higher ASE on the average for all spectral
radii compared to ESNs with random connections.
Figure 3-1 compares the ASE values obtained using different ESNs over 1000 realizations.
As observed from the figure, the ASE-ESN with uniform pole distribution generates higher ASE
on the average for all spectral radii compared to ESNs with sparse and uniform random
connections. This approach is indeed conceptually similar to Jeffreys' maximum entropy prior
(Jeffreys, 1946): it will provide a consistently good response for the largest class of problems.
Concentrating the poles of the linearized system in certain regions of the space, only provide
good performance if the desired response has energy in this part of the space, as is well known
from the theory of Kautz filters (Kautz, 1954).
Design of the Adaptive Bias
In conventional ESNs, only the output weights are trained optimizing the proj sections of the
desired response onto the basis functions (echo states). Since, the dynamical reservoir is fixed,
the basis functions are only input dependent. However, since function approximation is a
problem in the j oint space of the input and desired signals, a penalty in performance will be
incurred. From the linearization analysis that shows the crucial importance of the operating point
of the PE nonlinearity in defining the echo state dynamics, we propose to use a single external
adaptive bias to adjust the effective spectral radius of an ESN. Notice that, according to
linearization analysis with tanh nonlinearity, bias can only reduce spectral radius. The
information for adaptation of bias is the MSE in training, which modulates the spectral radius of
the system with the information derived from the approximation error. With this simple
mechanism, some information from the input output j oint space is incorporated in the definition
of the proj section space of the ESN. The beauty of this method is that the spectral radius can be
adjusted by a single parameter that is external to the system without changing reservoir weights.
The training of bias can be easily accomplished. Indeed, since the parameter space is only
one dimensional, a simple line search method can be efficiently employed to optimize the bias.
Among different line search algorithms, we will use a search that uses Fibonacci numbers in the
selection of points to be evaluated (Wilde, 1964). The Fibonacci search method minimizes the
maximum number of evaluations needed to reduce the interval of uncertainty to within the
prescribed length. In our problem, a bias value is picked according to Fibonacci search. For each
value of bias, training data is applied to the ESN and the echo states are calculated. Then, the
corresponding optimal output weights and the obj ective function (MSE) are evaluated to pick the
next bias value.
Alternatively, gradient based methods can be utilized to optimize the bias, due to
simplicity and low computational cost. System update equation with an external bias signal, b, is
given by
x(n +1) = f (W '"u(n +1) + W 'b + Wx(n))
The update equation for b is given by
80(n +1) -eWuxx(n +1)
db db
ou 8x(n)~,
= -eWox fen1-W +WW
Here, O is the MSE defined previously. This algorithm may suffer from similar problems
observed in gradient based methods in recurrent networks training. However, we observed that
the performance surface is rather simple. Moreover, since the search parameter is one
dimensional, the gradient vector can only assume one of the two directions. Hence, imprecision
in the gradient estimation should affect the speed of convergence, but normally not change the
correct gradient direction.
Experiments
This section presents a variety of experiments in order to test the validity of the ESN
design scheme proposed in the previous section.
Short-term Memory Structures
Modeling a dynamical system with input-output data requires access to a sufficiently long
input history as the output of the dynamical system depends not only the current value of input
signal but also the history of it. In signal processing, the common approach to overcome this
difficulty is to embed the input signal using a tapped delay line (in FIR filter or TDNN). RNNs
provide a different type of embedding through the recurrent connections between the PEs.
However, the length of the memory achieved is not as clear as in tapped delay line.
This experiment compares the short term memory (STM) capacity of ESNs with the same
spectral radius using the framework presented in (Jaeger, 2002). Consider an ESN with a single
input signal, u(n), optimally trained with the desired signal u(n-k),for a given delay k. Denoting
the optimal output signal yk(n), the k-delay STM capacity of a network, M~Ck, is defined as
squared correlation coefficient between u(n-k) and yk(n) (Jaeger, 2002). The STM capacity, M~C,
of the network is defined as f MCk STM capacity measures how accurately the delayed
k=1
version of the input signal is recovered with optimally trained output units. It has been shown in
(Jaeger, 2002) that the memory capacity for recalling an independent identically distributed (i.i.d)
input with an Nunit RNN with linear output units is bounded by N.
Consider an ESN with 20 PEs and a single input unit. ESN is driven by an i.i.d random
input signal, u(n), that is uniformly distributed over [-0.5, 0.5]. The goal is to train the ESN to
generate the delayed versions of the input, u(n-1), u(n-2), ..., u(n-40),. We used four different
ESNs, R-ESN, U-ESN, ASE-ESN and BASE-ESN. R-ESN is a randomly connected ESN used
in (Jaeger, 2002) where the entries of W matrix are set to 0, 0.47, -0.47 with probabilities 0.8, 0.1,
0. 1, respectively. This corresponds to a sparse connectivity of 20% and a spectral radius of 0.9.
The entries of are uniformly distributed over [-1, 1] and scaled to obtain the spectral radius of 0.9.
ASE-ESN also has a spectral radius of 0.9 and is designed with uniform poles. BASE-ESN has
the same recurrent weight matrix as ASE-ESN and an adaptive bias at its input. In each ESN, the
input weights are set to 0.1 or -0. 1 with equal probability and direct connections from the input
to the output are allowed whereas Whack iS Set to 0 (Jaeger, 2002). The echo states are calculated
using Equation 2-1 for 200 samples of the input signal and the first 100 samples corresponding to
initial transient are eliminated. Then, the output weight matrix is calculated using Equation 2-4.
For the BASE-ESN, the bias is trained for each task. All networks are run with a test input signal
and the corresponding output and MCk are calculated.
1 R-ESN
xf- U-ESN
0.9
-ASE-ESN
0.8 -- BASE-ESN
>0.7 .ii
S0.5
e~0.4
r 0.3
0.2
0.1
O 5 10 15 20 25 30 35 40
Delay
Figure 3-2. The k-delay STM capacity of each ESN for delays 1, .. ,40 computed using the test
signal. The STM capacities of R-ESN, U-ESN, ASE-ESN and BASE-ESN are 13.09,
13.55, 16.70 and 16.90, respectively.
Figure 3-2 shows the k-delay STM capacity (averaged over 100 trials) of each ESN for
delays 1, ,for the test signal. The STM capacities of R-ESN, U-ESN, ASE-ESN and BASE-
ESN are 13.09, 13.55, 16.70 and 16.90, respectively. First of all, ESNs with uniform pole
distribution (ASE-ESN and BASE-ESN) have MCs that are much longer than the randomly
generated ESN given in (Jaeger, 2002) in spite of all having the same spectral radius. In fact, the
STM capacity of ASE-ESN is close to the theoretical maximum value of N=20. A closer look at
the figure shows that R-ESN performs slightly better than ASE-ESN for delays less than 9.
Indeed, for small k, large ASE degrades the performance because the tasks do not need long
memory depth. However, the drawback of high ASE for small k is recovered in BASE-ESN
which reduces the ASE to the appropriate level required for the task. Overall, the addition of the
bias to the ASE-ESN increases the STM capacity from 16.70 to 16.90. On the other hand, U-
ESN has slightly better STM compared to R-ESN with only 3 different weight values although
U-ESN has more distinct weight values than R-ESN. It is also significant to note that the MC
will be very poor for an ESN with smaller spectral radius even with an adaptive bias since the
problem requires large ASE and bias can only reduce ASE. This experiment demonstrates the
need for maximizing ASE in ESN design.
Binary Parity Check
The effect of the adaptive bias was marginal in the previous experiment since the nature of
problem required large ASE values and long short-term memory. However, there are tasks in
which the optimal solutions require smaller ASE values and smaller spectral radius. Those are
the tasks where the adaptive bias becomes a crucial design parameter in our design methodology.
Consider an ESN with 100 internal units and a single input unit. ESN is driven by a binary
input signal, u(n), that assumes the values 0 or 1. The goal is to train an ESN to generate the m-
bit parity, where m is 3,...,8 Similar to the previous experiments, we used the R-ESN, ASE-ESN
and BASE-ESN topologies. R-ESN is a randomly connected ESN where the entries of W matrix
are set to 0, 0.06, -0.06 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse
connectivity of 20% and a spectral radius of 0.3. ASE-ESN and BASE-ESN are designed with a
spectral radius of 0.9. The input weights are set to 1 or -1 with equal probability and direct
connections from the input to the output are allowed whereas Wback is Set to 0. The echo states
are calculated using Equation 2-1 for 1000 samples of the input signal and the first 100 samples
corresponding to initial transient are eliminated. Then, the output weight matrix is calculated
using Equation 2-4. For ESN with adaptive bias, the bias is trained for each task. Binary decision
is made by a threshold detector that compares the output of the ESN to 0.5. Figure 3-3 shows the
number of wrong decisions (averaged over 100 different realizations) made by each ESN for
m=3,...,8.
350
300C
250
.0 200
en~ 150
3100
50 R-ESN
P ~ ASE-ESN
00 a BASE-ESN
3 4 5 6 7 8
m
Figure 3-3. Number of wrong decisions made by each ESN for m = 3, .. ,8 in the binary parity
check problem. The total number of wrong decisions for m = 3, .. ,8 of R-ESN,
ASE-ESN and BASE-ESN are given by 722, 862 and 699.
The total number of wrong decisions for m=3,...,8 of R-ESN, ASE-ESN and BASE-ESN
are 722, 862 and 699, respectively. ASE-ESN performs poorly since the nature of the problem
requires short time-constant for fast response but ASE-ESN has large spectral radius. For 5-bit
parity, the R-ESN has no wrong decisions whereas ASE-ESN has 47 wrong decisions. BASE-
ESN performs a lot better than ASE-ESN and slightly better than the R-ESN since the adaptive
bias reduces the spectral radius effectively. Note that for m = 7 and 8, the ASE-ESN performs
similar to the R-ESN, since the task requires access to longer input history, which compromises
the need for fast response. Indeed, the bias in the BASE-ESN takes effect when there is errors
(m>4) and when the task benefits from smaller spectral radius. The optimal bias values are
approximately 3.2, 2.8, 2.6 and 2.7 for m = 3, 4, 5, and 6, respectively. For m = 7 or 8, there is a
wide range of bias values that result in similar MSE values (between 0 and 3). In summary, this
experiment clearly demonstrates the power of the bias signal to configure the ESN reservoir
according to the mapping task.
System Identification
This section presents a function approximation task where the aim is to identify a nonlinear
dynamical system. The unknown system is defined by the difference equation
y(n +1) = 0.3y(n) +0.6y(n -1) + f(u(n)) where flu) = 0.6 sin(miu) +0.3 sin(3mi) +0.1Isin(5mu) .
The input to the system is chosen to be sin(2~nn/25).
We used an ESN with 30 internal units and a single input unit. Again, we used three
different ESNs, R-ESN, ASE-ESN and BASE-ESN. The W matrices of all ESNs are generated
as described in chapter 3.3.1 except that they are scaled to have a spectral radius of 0.95. In each
ESN, the input weights are set to 1 or -1 with equal probability and direct connections from the
input to the output are allowed whereas Wback iS Set to 0. The optimal output signal is calculated
for each ESN as described in section 3.3.1. The MSE values (averaged over 100 realizations) for
R-ESN, ASE-ESN are 1.23x10-5 and 1.83x10-6, TOSpectively. The addition of the adaptive bias to
the ASE-ESN reduces the MSE value from 1.83x10-6 to 3.27x10-9. This experiment illustrates the
superiority of the proposed design scheme for function approximation.
CHAPTER 4
DESIGN OF THE READOUT
In the previous chapter, we have discussed the design of the ESN reservoir and the bias.
This chapter deals with the design of the readouts for ESNs. The problem is studied for function
approximation and pattern recognition separately and different solutions are proposed for each
case.
Design of the Readout for Function Approximation
The usual readout is the linear regression network. Throughout this dissertation, we will
use the linear readout. The equation for the linear readout is given by
y(n +1) = W""'x(n +1)
Alternative readouts are nonlinear readouts such as multilayer perception or local
nonlinear modelling. This issue is outside the scope of this dissertation.
Design of the Readout for Dynamical Pattern Recognition
The applications of ESNs to problems such as system identification, control of dynamical
systems, and time-series prediction has been extensively studied both in this work and in the
literature (Jaeger, 2001; Jaeger and Hass, 2004; Ozturk et al., 2007; Prokhorov, 2005). However,
the use of ESNs for dynamical pattern recognition has been limited. In the literature, RNNs and
other neural architectures such as time-delay neural network have been utilized for dynamical
pattern recognition tasks by minimizing the mean-square error (MSE) with respect to a label
(Haykin, 1998; Bishop, 1995; Iso and Watanabe, 1991; Tebelskis, 1995). If the pattern to be
classified is a time signal, the classification process is different from the classification of a static
pattern. The dynamical classification problem can be reduced to a static one by treating each
sample of the time series individually. In this case, the time series is not treated as a whole and
temporal information between samples is ignored. Moreover, one class label is associated with
each time instant; therefore, post classification such as maj ority voting has to be applied. An
alternative to this approach is embedding the time series to populate a short-term history of the
pattern. Again, a static mapper can be used to classify the embedded pattern. In this approach,
classification becomes extremely hard, if the dimensionality of the input space becomes too
large. Alternatively, temporal processing of the time series can be done inside the classifier. In
this approach, the classifier, which has a built-in memory, can accept single elements of time
series. Although the classifier does not require external temporal processing, the design of the
classifier may be complicated. Another maj or difficulty in dynamical pattern recognition is how
to design the label, which should also be a time series. The obvious method is to enforce a class
label (+1 or -1) as the desired response throughout the presentation of the input pattern.
However, the difficulty with this method is that the desired response is forced to be constant
independent of the input signal dynamics. An alternative powerful technique is to create a one-
step predictor for each class in order to capture the dynamics of the class generating the time-
series (Zahalka and Principe, 1993). Then, during testing, the test pattern is presented to all
predictors and the label of the predictor with the least prediction error is assigned as the class of
the input pattern. This method is a smart way of converting the pattern recognition problem into
a regression problem where many techniques are available. This method can be easily applied to
ESNs with an adaptive linear readout and used for classification of temporal patterns, but it is
still dependent upon input normalization and it is sensitive to pattern duration changes.
In this chapter, we propose an alternative readout for ESNs for temporal pattern
recognition and compare it with the standard linear readout of ESNs and other conventional
techniques used in the literature. The goal is to design an ESN/LSM readout that will recognize a
class of inputs that differ by some quantity (e.g. amplitude or shape considered as a distortion
parameter). The proposed readout is based on a biologically plausible linear associative memory
(LAM) that implements a correlator specially trained for high specifieity. The readout, called the
minimum average correlation energy (MACE) Eilter, is adopted from optical pattern recognition
literature, where it is used for recognition of a given obj ect in 2-D images in the presence of
noise, geometric distortions and pose changes (Mahalanobis et al., 1987). The MACE fi1ter is a
correlator whose weights are determined in close form by solving a minimization problem with a
constraint, and it has been shown that the method is equivalent to a cascade of a pre-whitening
fi1ter for the image class followed by a LAM (Fisher, 1997). The recognition of a time pattern in
a single time series is achieved by feeding the time series to an ESN/LSM, and utilizing a time
window of states which can be interpreted as a 2-D image, one dimension being time and the
other processing element number (space). Several MACEs are trained, one for each class. During
testing, the unknown time signal is fed to the ESN and the states are correlated with each
template image and the class label of the template with the highest correlation is assigned to the
input pattern. An interesting implication of the MACE is that it can also be used for LSMs with
proper regularization of the MACE solution. In LSMs, the proposed readout operates directly in
the spike domain without converting the liquid state spike train outputs into continuous signals
by inning or low-pass filtering.
In this chapter, we first present a brief background on linear associative memories followed
by the detailed explanation of the MACE filter. Then, the use of MACE filter as the readout for
ESN/LSMs is discussed. Finally, the experiments are presented to compare the proposed readout
with the conventional methods.
Linear Associative Memories
A linear associative memory is a multiple input multiple output linear system that
implements an association (memory) between input and desired outputs (Hinton and Anderson,
1981). The output is automatically created by the presentation of the proper input, which is
referred as content addressability. LAMs are quite different from the concept of memory used in
digital computers, which uses a memory location whose contents are accessible by its address.
Unlike digital memory, LAMs store information in a more global way by having several PEs
instead of local bit storage. Therefore, they are also robust to noise (Haykin, 1998; Hinton and
Anderson, 1981). LAMs recall by association, hence the pattern that is closest to the input is
recalled. LAMs are biologically plausible just like ESN/LSM and they stand as the most likely
model for human memory (Principe et al., 2000).
Forced Hebbian rule can be used to train a LAM that associates a P dimensional input p to
a Q dimensional desired output q. The output of the LAM is given by q=WLAMp, where QxP
matrix WLAM" can be computed using the outerproduct rule, WLAM _qpl (Principe et al., 2000;
Haykin, 1998). Multiple input-output vector pairs (pk,qk,for k-1,...K) can also be stored in the
LAM by repeated presentation of each input. Using the superposition principle for the linear
network, the final weight matrix is given by
Wu" = [qkpk
For an input vector pt, the output of the LAM is given by
yk = ~ld I 9119 9lt CkPk91
k=1,kel
The first term in the LAM output is the desired output and the second term, which is called
crosstalk, is the interference of the other stored patterns to the true pattern. If the crosstalk term
is small, the LAM can retrieve the pattern corresponding to the input. Notice that if the stored
patterns are orthogonal to each other, the crosstalk term will be zero resulting in a perfect recall.
Therefore, the number of patterns that can be stored perfectly in a LAM is limited by the size of
input space, which is one of the limitations of LAMs for information processing (Principe et al.,
2000; Haykin, 1998).
The output patterns qk for the pairs (pk, qk), k-1,...,K, can be interpreted as the desired
response for the LAM. The existence of a desired response allows the application of supervised
learning to train LAMs using unconstrained optimization techniques such as least squares. For
instance, the weights can be modified using the least-mean-squares algorithm by an amount
given by
AW L" h= Tekp k yq9p k k k k
This equation is a combination of the desired forced Hebbian ( qqkpk ) and' an ant-Hebbian
term (- yykpk )that decorrelates the present output yk from the input pk. The anti-Hebbian term
reduces the crosstalk at each iteration improving the performance of LAMs in terms of crosstalk.
A LAM trained with LMS is called an optimal linear associative memory (OLAM) (Principe et
al., 2000).
One question follows naturally: what is the difference between a LAM and a linear
regressor? The linear regressor passes a single hyperplane by all the desired samples, so we
normally want more patterns than the size of the input dimension to generalize but it looses
information about the covariance on the data space (Principe et al., 2000). On the other hand, in
the LAM, the output is as close as possible to each of the training samples. This different
behavior is controlled by the number of data points used for training, which must be must be less
than the input dimensionality to minimize crosstalk. The MACE filter that will be discussing
next is a cascade of a preprocessor (a whitening filter over the class of training data) followed by
a LAM (Fisher, 1997).
The MACE Filter
The technique of matched spatial filters (MSF) or correlators for optical pattern recognition
has been well studied and used for the recognition of 2-D obj ects (Vanderlugt, 1964; Kumar,
1986). The goal is to design an optimal correlator filter that represents an object class under
noise and/or distortions using a set of training images. The reason it is possible to match a single
template to a class of objects is because of the huge number of weights (degrees of freedom) that
the correlator has in 2-D (N2 weights for an NxN input image). Moreover, MACE filter can
incorporate the covariance information of the class in its weights. The MACE filter is the most
widely used MSF due to superior discrimination properties (Mahalanobis, 1987; Casasent and
Ravichandran, 1992).
The formulation of the MACE will be presented in the frequency domain for convenience
and ease of implementation (Mahalanobis, 1987). Consider a set of training images for a single
obj ect class. The ith training image (i=1,..,K) is described as a 1-D discrete sequence (obtained by
lexicographic ordering the columns of the image) denoted by x, = [xl (1), xl (2),..., xl (d)]" where
d and K are the number of pixels in the image and the number of training images for the class,
respectively. The discrete Fourier transform of x, is denoted by X, Define the overall dxK
training matrix by X = [X, X,,..., XK ]. The dx 1 column vector h denotes the coefficients of a
filter in space domain and H its Fourier transform. The correlation function of the ith image with
the filter h is given by g~ = h 0 x, The filter output g defines a 3-dimensional space where x
and y axes are the indices (lags) of the 2-dimensional correlation function and z-axis is the value
of the correlation for the corresponding indices. We will call it the correlation space. The energy
of the ith COrrelation plane in the frequency domain is E, = H' D, H where D, is a dxd diagonal
matrix whose elements are the magnitude squares of the associated elements ofX, The average
correlation plane energy over the training images is defined as
E, (1/K~j i E, (1/)H [D, H (1/K)H DH (Mahlanobis, 1987).
The MACE filter solution designs a correlation filter that minimizes the output energy
while providing predetermined values (c = [cy, c,,..., cK T) at the origin of the correlation plane.
Each ci is the ith peak correlation value obtained by correlating the ith image with the filter H at
zero lag. More specifically, the problem is to solve the constraint optimization problem
(Mahalanobis, 1987):
min (1/K)H DH subjectto X H= c
The solution to this problem can be found analytically using the method of Lagrange
multipliers (Mahalanobis, 1987) and is given by
H = D X(X D X) (4-1)
The selection of the exact value of c is somewhat arbitrary since it simply scales the peak
correlation amplitude. Usually, the entries of c are chosen to be 1 for in-class training images and
a smaller value (around 0. 1) for out-of-class images if at all used (Mahalanobis, 1987). The
MACE filter solution achieves superior discrimination properties when compared to other
existing correlation filters such as synthetic discriminant function filters (Mahalanobis, 1987;
Casasent and Ravichandran, 1992).
The ESN/LSM MACE for Dynamical Pattern Recognition
Pattern recognition is normally formulated as the design of a nonlinear static system
trained discriminantly with the information of a set of labels, with data that are vectors in Rn. An
alternative to this approach is obtained by creating models for each class which are only trained
with the class data (nondiscriminant training). When the number of classes to be classified is
unknown at the training time (open set classification), the latter provides robust results. The
correlator or matched fi1ter is a very well known example of a system that recognizes a single
class data (Helstrom, 1995; Proakis, 2001), and can be easily extended to multiple dimensions.
Dynamical pattern recognition can also be formulated in a similar manner by using
dynamical systems as classifiers and training them discriminantly with an appropriate desired
signal. Instead, in this study, we propose to use the MACE fi1ter as a readout for ESN/LSMs for
dynamical pattern recognition tasks, nondiscriminantly trained for each class. In ESN/LSM, the
proposed associative memory considers a time window of states which can be interpreted as a 2-
D image, one dimension being time and the other the space of the states (see Figure 4-1). Let us
explain the operation of ESN-MACE in mathematical terms.
Assume that we have P different classes of temporal patterns. The goal is to compute one
MACE fi1ter, hP for each class. Assume also that for each class, the training set consists of K
input patterns each with a particular temporal shape of length T. The procedure is to compute
each hP for p=1,...,P as follows.
1. The ith training input pattern for the pth ClaSS Hz = [u, (1), u, (2),..., u, (T)] of dimension
M~xTis used to calculate the echo states using Equation 1-1.
2. The resulting NxT matrix of echo states forms the equivalent of a 2-D image in section
2.2 (Figure 4-1). The echo states are then lexicographically ordered by the columns to get the 1-
D column vector x, = [x, (1)T xI (2)T ,. ..~x (T)T ]T with Nx T elements. Here each x,(n) is an Nx1
vector with the value of the echo state at time n for the ith training sequence.
3. The discrete Fourier transform of x, is denoted by X, Define the overall dxK training
matrix in the frequency domain by X = [X, X2,..., XK i
4. The optimal coefficients of the LAM for the class are computed using Equation 4-1 in
the frequency domain and the corresponding MACE filter weights hP are obtained by inverse
discrete Fourier transform. The output of the MACE for the ith training input pattern for the pth
class can be obtained by x~ hP
Input signal
0.5
10 15 20
Overlay plot of echo states
10 15 20
Echo state image
2 4 6 8 10 12 14 16 18 20
time
Figure 4-1. Interpretation of echo states as a 2-d image. A) The triangular input signal. B) The
echo states of a 100 unit ESN. C) The echo state image where a point in state index
and time is interpreted similar to a pixel in an image.
The same procedure is repeated for the training sequences of other classes to obtain an optimal
filter for each class.
During testing, the input signal can be either a continuous stream (asynchronous operation)
or a series of frames (synchronous operation). Depending on the mode of operation; the choice of
the initial conditions for ESN in step 1 varies. In synchronous operation, the timing of the signal
and the frames of interest are already known both in training and testing. Therefore, ESN is
initialized with the same zero initial condition both in training and testing. The zero initial
condition is chosen such that the system dynamics is not biased and the system states are
controlled by the input. During testing, the MACE output is calculated for each frame by
correlating the filter coefficients with the echo states generated by the frame at zero lag and with
a zero initial condition.
When the signal timing is unknown (asynchronous), the MACE filter generates an output
at each time instant n by correlating the filter coefficients with the echo states between time
instants n-T+1 and n. and the maximum in time above a certain pre specified threshold will be
picked to represent the occurrence of the pattern. In case of a continuous stream, the initial
conditions for each frame are dictated by the history of the input signal during testing; therefore
they are not necessarily zero. In order to mimic the test conditions while training the MACE, the
initial conditions of the ESN are not set to zero but determined by the history of the input. In
other words, the whole input signal is used to create echo states and then the resulting NxT
dimensional echo state images are extracted and used to train the MACE filter using the above
algorithm.
The Design of ESNs for Dynamical Pattern Recognition
One of the drawbacks of LAMs is the storage capacity that is limited by the input space
dimension (Haykin, 1998). However, the ESN/LSM with a LAM readout becomes very powerful
since echo/liquid states provide the LAM with a user defined high dimensional input space
increasing the number of patterns that can be stored in the LAM with minimal cross talk. In
ESNs, the reservoir states can be interpreted as a set of bases functionals (representation space)
constructed dynamically and nonlinearly from the input (Ozturk et al., 2007). Due to the
nonlinearity, it is easy to get states that are linearly independent of each other (Ito, 1996; Ozturk
et al., 2007). Therefore, the matrices in the MACE fi1ter solution are typically full-rank, allowing
the computation of inverses. Although linearly independent, the states are not orthogonal to each
other since they are connected through predefined weight values. An information-theoretic-
metric, called average state entropy (ASE), has been recently proposed to describe the richness
of ESN dynamics (Ozturk et al., 2007). ASE quantifies the distribution of state values across the
dynamic range. When ASE is maximized, the states of the system are evenly distributed across
the dynamic range instead of being populated only on a few values. The ASE measure is related
to the system design by uniformly distributing the poles of the linearized system around the
origin (eigenvalues of W) (Ozturk et al., 2007). We propose to use the same design methodology
to construct the representation layer of ESNs with the LAM readout. This approximates for
ESNs the well-known property of orthogonal basis and eases the LAM training. In the
experiments performed for this study, we have observed that the numerical rank (the condition
number) of the inverted matrices is well-conditioned for ESNs.
Likewise for LSMs, the value of each liquid state at each time instant is interpreted by the
MACE as 1 or 0 depending on the existence or absence of a spike. This creates a very sparse
state matrix which likely leads to singular matrices in the MACE fi1ter computation. Therefore,
the matrices have to be regularized in order to compute appropriate MACE fi1ter weights with
the outlined formulation. The regularization can be done by adding zero mean Gaussian noise
with a small variance into the states before computing the MACE solution.
Experiments
This section presents a variety of experiments to compare the MACE filter readout
proposed for ESN/LSM with the conventional techniques for dynamical pattern recognition.
Classification of temporal signals with unknown peak values
In this experiment, the aim is to classify a temporal signal into one of two possible classes.
The two classes are represented by a triangular and a step function of 20 samples. The interesting
property of the signals in this experiment is that the peak values are unknown and can assume
any value between 0.5 and 1. In other words, each class is not a signal but a family of signals.
The signals are also corrupted with additive, zero mean, white Gaussian noise.
We compare the performance of three different ESNs, denoted by R-ESN1-MACE, R-
ESN2-MACE, ASE-ESN-MACE, with MACE filter readout. All ESNs have 30 PEs and are
constructed using different techniques. R-ESN1-MACE is a randomly connected ESN where the
entries of W matrix are set to 0, 0.47, -0.47 with probabilities 0.8, 0.1, 0.1, respectively. This
corresponds to a sparse connectivity of 20% and a spectral radius of 0.9. R-ESN2-MACE is a
randomly connected ESN where the entries of the W matrix are chosen from a Gaussian
distribution with zero mean. The weights are scaled such that the spectral radius is set to 0.9. The
internal weight matrix, W, of the ASE-ESN-MACE is designed such that the eigenvalues of W
(poles of the linearized system around origin) are uniformly distributed inside the unit circle.
This principle is introduced in (Ozturk et al., 2007) and it distributes the states of the system
evenly across the dynamic range instead of populating them only on a few locations. The entries
of Wm" are set to 1 or -1 with equal probability for all three ESNs.
1 --~MACE readout
O 0.7 -8
0.6 -J
cn ASE-ESN-MACE
0. as R-ESN1-MACE
O R-ES N2-MAC E
0.4~ -i Predictive ESN
Discriminative ESN
0.3
0 0.1 0.2 0.3 0.4 0.5 0.6
Noise standard deviation
Figure 4-2. Performance comparisons of ESNs for the classification of signals with unknown
peak values. ESNs with the MACE filter readout performs better for all noise levels
compared to discriminative and predictive ESNs. The classification accuracy of ASE-
ESN-MACE is almost perfect up to a noise standard deviation of 0.3. ASE-ESN-
MACE performs better than randomly connected ESNs with MACE readout since it
distributes the echo states uniformly and increases separation between states
corresponding to different classes.
To train the MACE filter readout for the ESNs, 100 different realizations for each class of
signals are generated and the echo states are calculated using Equation 2-2 with zero initial
conditions. This creates an image of dimension 30x20, where 30 is the number of PEs and 20 is
the signal length. One MACE filter is trained for each class using only data from the
corresponding class. Output correlation peak amplitudes are assigned to be 1.0 for the training
data. The two MACE filters are synthesized in the frequency domain using Equation 4-1 and the
corresponding image plane filters are obtained by inverse discrete Fourier transform. For a given
test signal, each filter output is calculated by correlating the echo states of dimension 30x20 with
the filter coefficients at zero lag. Figure 4-2 shows the correct classification rate of the 3 methods
as the standard deviation of the noise signal varies. The results are averaged over 100 realizations
of each ESN.
We also compare the performance ofESNs with MACE filter readout to the ASE-ESN
utilizing the conventional linear readout. When the conventional readout is used for dynamical
pattern recognition, the problem is the design of the desired signal. One solution is to train a
discriminative readout with constant target output signal of amplitude 1 for one of the two
classes and a constant signal of amplitude -1 for the other class. The duration of the target signals
is the same as the duration of the input signal. During testing, echo states, the output signal and
the corresponding error signals one for each label are computed for a given input signal. Then,
the two MSE values, integrated over the duration of the input pattern, are compared to assign the
class of the input pattern to the lowest MSE. In this solution, the readout has to generate a
constant signal independent of the input signal dynamics. An alternative also tested is to train
two independent predictive readouts, one for each class to predict the next sample of the input
signal. During testing, the echo states, the output signal and the error signal are computed for
each readout. Then, the label of the readout with the lower MSE value, integrated over the
duration of the input pattern, is assigned as the class of the input signal. The predictive readout
seems a more natural choice since the desired response is not independent of the input dynamics.
The same training data used for the MACE filter is employed to calculate the readout weights for
the discriminative and predictive ESNs.
Figure 4-2 shows the correct classification rate as the standard deviation of the noise signal
varies. The results are averaged over 100 trials. As observed from the Eigure, ESNs with the
MACE fi1ter readout perform better for all noise levels compared to the ESNs with conventional
or predictive readouts. The classification accuracy of ASE-ESN-MACE is almost perfect up to a
noise standard deviation of 0.3. The correct classification rate of the discriminative ESN for the
no noise case, is only 0.92, while for the predictive ESN is 0.95, which is slightly better than the
discriminative ESN. Since all three systems have the same reservoir, the better performance of
the MACE readout is attributed to the fact that the problem requires modeling a class of
functions all at once in a single template and the MACE fi1ter is able to build such a template
containing covariance information about the full training set. The conventional and predictive
readouts train for the best compromise of the input class peak amplitude of the training set.
Being a nonlinear system, ESN dynamics are coupled to the input signal. Therefore, the response
of the system to input signals with varying amplitudes can be quite different making the task of
proj ecting the states to a constant value very challenging. The wave shape is also challenging,
since for the pulse (a sequence of zero values followed by a sequence of ones) the linear readout
has to transform the highly irregular states generated by this pattern to a constant value.
Similarly, for the predictive ESN the variation in the input amplitudes affects the ESN dynamics,
and the performance drops fast as the noise standard deviation is increased. Morever, the MACE
fi1ter readout is expected to be more robust w.r.t selection of W and Wm" in the ESN design
compared to the conventional readout design methodology where the states have to be carefully
selected in order to estimate the desired signal with a linear readout. In contrast, the LAM
readout does not try to estimate a desired response but rather generates a template from the
states.
Among the ESNs with MACE readout, the ASE-ESN performs the best compared to
randomly generated ESNs. This was an expected result, since ASE-ESN distributes the echo
states uniformly and increases separation between states corresponding to different classes.
Therefore, ASE-ESN should be the preferred design for ESNs with MACE filter readout.
Robust signal detection in a digital communication channel
We applied ESN-MACE to detection of a known waveform transmitted through a digital
communication channel. Please refer to Chapter 6 for details of this study.
Classification of odors with an electronic nose
In this experiment, the goal is to classify odors using signals collected with a Cyranose 320
electronic nose. The Cyranose 320 has an array of 32 sensors which change their resistance in
response to odors and produce a unique response pattern for different odors (Figure 4-3). For our
experiment, the spices, basil, cinnamon, ginger, and rosemary form the 4 classes of odors to be
classified. Each of these spices is mixed with air and exposed to the electronic nose in 4 different
concentrations; 25%, 50%, 75% and 100%. For each concentration of each spice, data
corresponding to 10 different trials is collected. There are a total of 160 trials, 40 for each class
of spice. We submit two approaches to solve the problem. The first approach is based on
obtaining static features that are the steady-state values of all 32 sensors as given by the
Cyranose. We obtain a 32xl60 matrix containing 32 features for all 160 trials of the 4 classes.
From the feature matrix, we extract the part corresponding to 40 trials (10 trials for each class
with representatives from 4 different concentrations) as the training set to train a linear classifier
with 4 outputs, each representing one of the 4 classes. The linear classifier is used due to the
availability of an analytical solution. Each output of the linear classifier is trained to output I
when the input feature is from the representative class and 0 otherwise. The 4x32 weight matrix
of the linear network is computed using least squares. During testing, the 32xl feature vector is
Response of the 32 electronic nose sensors over time
I i
)
ti me
Figure 4-3. Time-series representing the response pattern of the 32 electronic nose sensors
exposed to rosemary.
used to compute the 4 outputs and the label of the output with the maximum value is assigned as
the class of the input vector. The classification accuracy of the linear classifier evaluated on the
160 feature vectors is found to be 78.7% and the confusion matrix is found to be
40 0 0 0
4 22 14 0
0 0 40 0
0 0 16 24
The entry aij in row i and column j of the confusion matrix is the number of odors that
should have been classified in class i but were classified as j. The correct decisions are given on
the main diagonal of the confusion matrix. The 4 classes are presented in the order basil,
cinnamon, ginger, and rosemary.
The second approach is based on the proposed ESN-MACE. ESN exploits the dynamical
response of the sensors which also contain information about the odor class. We create a
template MACE filter for each class from the states of the ESN which represents the variations
among different spice concentrations. Again, 40 trials (10 trials for each class with
representatives from 4 different concentrations) are used as the training set. Unlike the static
case, each training trial consists of a 32 dimensional time series representing the response of the
electronic nose. We use an ESN with 100 internal units and 32 inputs. The internal weight
matrix, W, of the ESN is designed with uniform eigenvalue distribution and a spectral radius of
0.9. The entries of W'n are set to 5 or -5 with equal probability. In the previous two experiments,
the selection of the length of the patterns was dictated by the known signal lengths. However, in
this experiment, we do not have pre-defined signals making the selection of the pattern length, T,
a design parameter (Figure 4-3). Here, we choose Tto be 80 so that the template pattern contains
the rising and falling edges of the sensor response. This selection uses the transient sensor
dynamics of the particular odor in the design of the MACE readout. For each input pattern, the
echo states are calculated using Equation 2-2 with zero initial conditions. This creates an image
of dimension 100x80, where 100 is the number of PEs and 80 is the signal length. 10 in-class
patterns are used to train each one of the MACE filters corresponding to 4 classes. Output
correlation peaks are assigned to be 1.0. During testing, echo states are calculated for each
pattern and 4 MACE filter outputs are calculated. The label of the MACE filter with the highest
value is assigned as the class of the input patter. The classification accuracy of the ESN-MACE
evaluated on the 160 input patterns is found to be 63.1% and the confusion matrix is found to be
21 18 1 0
5 35 0 0
7 16 17 0
0 1 11 28
The results are, indeed, worse than the linear classifier trained on static features. When
looked closely, the MACE fi1ter gives correlation values around 1.0 for the in-class data.
However, the response of the MACE fi1ter corresponding to 2nd ClaSs (cinnamon) for out-of-class
data is bigger than 1 for most of the patterns which results in the false classifications. The MACE
filter trained nondiscrimantly does not constrain the response of the MACE for out-of-class data.
One solution to this problem is to use out-of-class data in the MACE fi1ter training as is done in
(Mahalanobis et al., 1987). Different from the previous experiments, both in-class and out-of-
class training data are used to train each one of the 4 MACE fi1ters corresponding to 4 classes.
For each MACE fi1ter, 10 in-class and 30 out-of-class (10 for each class) echo state images are
used for discriminative training. Output correlation peak amplitudes are assigned to be 1.0 for the
in-class and 0.1 for the out-of-class training data.
Using the training data, four MACE fi1ters are synthesized in the frequency domain using
Equation 4-1 and the corresponding image plane fi1ters are obtained by inverse discrete Fourier
transform. For a given test signal, the output of the each MACE filter is calculated by correlating
the echo states of dimension 100x80 with the filter coefficients. The label of the filter with the
largest correlation is assigned as the class of the input pattern. Out of the 160 trials, the correct
classification rate is found to be 93.1 %. Note that this is much better than the performance of the
classifier based on static features. Similarly, the confusion matrix is found to be
38 2 0 0
0 37 3 0
0 5 35 0
0 0 1 39
The addition of out-of-class data for training of the MACE improves the rejection of the
out-of class patterns and hence decreases the false alarm rate. However, it also converts the
problem to a classification problem (closed set classification) where the number of classes has to
be known a priori. It is desirable, whenever possible, to avoid the use of out-of-class data in the
training data for detection of signals where the out-of-class is either not available a priori or can
be any possible signal shape excluding the in-class signal shape (the entire signal space
excluding the in-class signal space).
Alternatively, we trained MACE filters in the input space instead of the state space of the
ESN. In the input space, the images are of dimension 32x80. The procedure of training MACE
filters for each class is repeated in the 32 dimensional input space. Out of 160 trials, the correct
classification rate is found to be 85.6%. The confusion matrix of the MACE filter is
34 5 1 0
3 32 5 0
1 7 32 0
0 0 1 39
In order to understand the increase in classification rate by the ESNs, let us compare the
outputs ofESN-MACE filter with the MACE filter trained directly at the input space for the
rosemary class (Figure 4-4).
As observed from the figures, the ESN-MACE filter output corresponding to the rosemary
class displays values around one consistently for the in-class data except for the 37th testing
sample.
MACE 1
MACE 2
MACE 3
MACE 4
1.5
o 0.5
-0.5
-1
-1.5
5 10 15 20
Test data index
25 30 35 40
ESN-MACE 1
ESN-MACE 2
ESN-MACE 3
ESN-MACE 4
1.5
o 0.5
u- 0
-0.5
-1
-1.5
5 10 15 20
Test data index
25 30 35 40
Figure 4-4. Comparison of the ESN-MACE fi1ter output and MACE fi1ter output trained on the
input space for the rosemary class. A) Outputs of the 4 MACE filters over the test
data index. B) Outputs of the 4 ESN-MACE fi1ters over the test data index.
The ESN-MACE filter outputs corresponding to the other classes are very close to 0.1 as
specified by the training procedure. However, the MACE filter outputs trained directly in the
input space have very inconsistent outputs for the out of class samples. The margin between the
in-class and out-of-class outputs of the MACE filter is increased by the ESN reservoir. There are
mainly two reasons for the increase in the distance between the images with the addition of ESN.
The first reason is the proj section of the signals to a higher dimension by the ESN which increases
the separation between the images. Secondly, ESN functions as a nonlinear filter that helps
separate the responses to different input classes.
We also compared the results obtained to the TDNN trained in discriminative mode. 4
TDNNs, one for each class, are used where the desired target signal is +1 for in-class data and 0
otherwise. Each TDNN has a delay line of 81 samples (equal to the signal duration) for each one
of the 32 sensors, 3 hidden layer PEs and a single output unit. The same data (10 in-class and 30
out-of-class) for discriminative ESN-MACE is used to train the weights of TDNN using the
Levenberg-Marquardt algorithm with a step size of 0.01. Training is stopped after 100 epochs. A
cross-validation set cannot be used due to limited number of training data. During testing, an
input pattern is fed to each one of the 4 TDNNs and the label of the TDNN with the largest
output is assigned as the class of the input pattern. Out of 160 trials, the correct classification rate
is found to be 65.7%. The confusion matrix of the filter is
25 13 2 0
3 29 7 1
3 10 21 6
6 2 2 30
The performance of the TDNN is very poor compared to discriminative ESN-MACE and
even the static classifier. The main reason for the poor performance is the lack of sufficient data
for training the TDNN. The number of parameters of each TDNN is 32*81*3+3=7779 whereas
only 40 training patterns are available for training.
Spike train classification with LSM
The associative memory readout presented for ESN can in fact also be utilized as readout
for a LSM. Similar to the ESN case, the liquid states are interpreted as 2-D images, one
dimension time and the other space. An optimal MACE fi1ter, which associates LSM states with
class labels, can be analytically computed for each class using training data. The value of the
liquid state is either 1 or 0 depending on the existence or absence of a spike. Note that the MACE
fi1ter readout does not require low-pass filtering the spike trains in order to get continuous
amplitude signals for further processing.
We use a spike train classification experiment, where the goal is to classify a spike train
presented to the LSM as one of the two classes (Maass et al., 2002). Two classes are represented
by randomly generated Poisson distributed spike trains over 100ms time interval. Other members
of each class are generated by moving each spike of the representative spike train by an amount
drawn from a zero mean Gaussian distribution with a variance of 4ms.
A randomly connected LSM with 50 integrate-and-fire neurons with 20% inhibitory (I)
and 80% excitatory (E) connections is used. The membrane time constant, reset voltage,
threshold voltage, background current and input resistance of the neurons are chosen to be 30ms,
13.5mV, 15mV, 13.5nA, and 1 MGZ, respectively. Absolute refractory periods for excitatory and
inhibitory neurons are 3ms and 2ms, respectively (Maass et al., 2002). The probability of a
D(a,b) 2
synaptic connection between neurons a and b is given by ," where D(a,b) denotes the
distance between two neurons and 32 is a parameter which controls both the average number of
connections and the average distance between neurons that are synaptically connected. The
neurons are placed on a 10x5xl grid. C is set to 0.3 (EE), 0.2 (El), 0.4 (lE), 0.1 (II). Dynamic
synapses are used based on the model proposed in (Markram et al., 1998), with the synaptic
parameters U (use), D (depression time constant), F (facilitation time constant) randomly chosen
from Gaussian distributions. Mean values of U, D, F are set to .5, 1.1, .05 (EE), .05, .125, 1.2
(El), .25, .7, .02 (IE), .32, .144, .06 (ll) and SD of each parameter was chosen to be 50% of its
mean. The mean of the scaling parameter W (in nA) is chosen from a Gamma distribution with
mean values 30 nA (EE), 60 nA (EI), -19 nA (IE), -19 nA (II) (Maass et al., 2002). For input
synapses, the parameter Whas a value of 18 nA or 9.0 nA for destinations excitatory and
inhibitory neuron, respectively. SD of Wis 100% of its mean. The postsynaptic current was
modeled as an exponential decay exp(-t/ts) where ts is 3ms and 6ms for excitatory and inhibitory
synapses, respectively. The transmission delays between neurons are 1.5 ms (EE), and 0.8 for the
other connections. The initial membrane voltage of each neuron at the start of each simulation is
drawn from a uniform distribution between 13.5mV and 15.0mV (Maass et al., 2002).
Fifty different realizations of spike trains for each class are generated and the liquid states
are calculated for each realization. One readout MACE filter is trained for each class using only
data from the corresponding class. Output correlation peak amplitude is assigned to be 1.0 for the
training data. Notice that the state matrix is very sparse resulting in singular matrices in the
MACE filter computation. Therefore, the matrices have to be regularized in order to compute
inverses. Here the regularization is done by adding zero mean Gaussian noise with a variance of
0.02 to the states before computing the MACE solution. The two MACE filters are synthesized
in the frequency domain using eq 4.1 and the corresponding image plane filters are obtained by
an inverse discrete Fourier transform. For a given test signal, the output of each filter is
calculated by correlating the liquid states with the filter coefficients.
Spike train classification with LSMs is conventionally done by first low-pass filtering the
spike output of the liquid and using a linear regressor network (Maass et al., 2002). Here, we
used a low-pass filter with a time constant of 30ms. A linear readout is trained with the constant
target signal of amplitude +1 for one of classes and the constant signal of amplitude -1 for the
other class using the same data as MACE filter training. During testing, the echo states, the
output signal and two error signals one between the output and the constant signal of amplitude
+1 and another between the output and the constant signal of amplitude -1 are computed for a
given input signal. Then, two MSE values, integrated over the duration of the input pattern, are
compared to assign the class of the input pattern.
Figure 4-5 shows the correct classification rate for both methods as the parameter h, which
controls the synaptic connections in the LSM, varies from 0.2 to 4. The results are averaged over
100 trials. As observed from the figure, the MACE filter readout gives slightly better accuracy
compared to the linear readout for all h values. This is due to the fact that the linear readout tries
to approximate a constant signal independent of the liquid state values. This creates difficulties
for the simple linear readout especially when the input spike train is sparse. In fact, under closer
analysis, the output of the linear readout fluctuates. Therefore, the decision made by the linear
readout has to be carefully averaged over time in order to reach a final decision about the class of
the input spike train. For example, if the decision is based solely on the output of the readout at
the final time instant, the classification rate is very close to theoretical minimum of 0.5. On the
other hand, the MACE filter readout provides the ability to operate in the spike domain without
the need to low-pass filter the liquid state spike output. However, we note that the MACE filter
computation in Equation 4-1 requires extra caution since the liquid state has many zero values
mostly leading to singular matrices.
1.05
1
0.95
0.9
0.85
0 0.5
1 1.5 2 2.5
3 3.5
L near re ad out
MAC E filte r read out
Lambda
Figure 4-5. Comparison of the correct classification rates of LSM-MACE and LSM with linear
readout trained in the spike train classification problem as the parameter h varies. The
MACE fi1ter readout gives slightly better accuracy compared to the linear readout for
all h values. This is due to the fact that the linear readout tries to approximate a
constant signal independent of the liquid state values. Moreover, MACE readout
operates in the spike domain eliminating the need to convert spike trains to
continuous valued signals by low-pass filtering.
This experiment demonstrates the use of MACE filter as a readout for LSMs. A more
detailed study has to be carried out to explore the advantages and disadvantages of MACE
readout compared to the standard techniques.
CHAPTER 5
IMPLICATIONS OF ESN ON THE DESIGN OF OTHER NETWORKS
Echo state networks introduced a new type of computation that has been proven to be very
powerful. In this chapter, we study how the ESN idea can be utilized for the design of other
networks. First, we investigate the echo state condition (|W|<1) and approach the echo state
condition in terms of the effect of system stability on the power of dynamical system for
computation. In particular, we relax the echo state condition and allow some of the poles to be
larger than 1. This introduces a new dynamical regime, called "transiently stable computation",
where function approximation with a readout is still possible eventhough the system is not
globally stable. Second, we investigate the biologically plausible model of the olfactory cortex,
Freeman Model (FM), and propose to use a readout for FM to be able to use it for useful
computation. Without the proposed readout, the use of FM is limited to simple digit recognition.
An interesting property of FM is the nonlinear function used in the model. FM nonlinearity does
not have its largest slope at zero operating point unlike the sigmoidal nonlinearity used in ESNs.
We will demonstrate with experiments that FM coupled with a readout can process continuous
valued signals.
Transiently Stable Computation
Linear filters are the simplest dynamical systems employed to process signals with
temporal structure as discussed in chapter 1. They can be classified according to the impulse
response, which fully characterizes the filter, as finite impulse response (FIR) and infinite
impulse response (IIR) filters (Haykin, 2001). For the IIR case, stability becomes an essential
constraint in the design. Without stability, the system response will diverge independent of the
input signal. Global asymptotic stability of linear filters can simply be guaranteed by selecting
the poles of the system in the proper region of the frequency domain, open left half plane for the
continuous-time systems and inside the unit circle for the discrete-time systems (Kailath, 1980).
TDNN and RNN can be considered as the nonlinear counterparts of the FIR and IIR filters,
respectively. Similar to the linear case, stability becomes an issue for RNNs, although the
definition of stability is no longer BIBO (bounded input bounded output) stability. In the ESNs,
stability issue is addressed with the echo state condition (Jaeger, 2001) and has been investigated
in detail in chapter 2. Echo state condition is derived from the stability of the linearization of the
system around zero equilibrium and guarantees that the states are strongly coupled with the input
signal allowing the system to compute an input-output map. In this chapter, we introduce a new
computational mode that emerges when we relax the echo state condition and allow the spectral
radius to be slightly greater than 1. First, let us consider an ESN that obeys the echo state
condition.
Conventional Echo State Networks
A simple demonstration of the typical response of an ESN with echo state condition is
given in Figure 5-1. A randomly connected 100-unit ESN without output feedback connections is
constructed. The entries of the internal connection weight matrix W are set to 0.4, -0.4 and 0
with probabilities of 0.025, 0.025 and 0.95 resulting in a spectral radius of 0.9. The input weight
matrix, Wm" has values +1 or -1 with equal probabilities. The input to the system (Figure 5-1 A)
has 3 different regions, a zero signal for the first 100 steps, a ramp signal for the following 100
steps and again a zero signal of length 100. The system is initialized randomly and run with the
given mnput.
Input signal
50 100 150 200 250 300 350
time
The states of the 100-unit ESN
50 100 150 200 250 300 350
time
Overlay plot of the ESN output and desired signal
Desired signal
SESN output
50 100 150 200 250 300 350
Figure 5-1. Demonstration of a typical response of ESN with a spectral radius of 0.9 and
application to function approximation. A) The input signal. B) 100 echo states of
ESN. C) The overlay plot of the ESN output and the desired signal.
r:'
i-
Figure 5-1 B depicts all the 100 echo states on top of each other over the given time
interval. As seen from the figure, during the first 100 steps, the system state converged to zero
(after an initial transient of length about 30 steps caused by the random initial condition) since
the input to the system is zero and the spectral radius is less than 1. Then, the echo states are
constructed by the ramp signal between the time interval 100 and 200. The states again converge
to zero due to the echo state condition, when the input is removed. The resulting echo states can
be used to generate any desired response related to the input signal. Here, desired signal is
chosen to be the seventh power of the input signal and The readout weights are calculated using
Equation 2-4. Figure 5-1 C displays the overlay plot of the optimal ESN output and the desired
signal. As observed from the figure, the system output very accurately estimates the desired
signal .
Transiently Stable Computation with Echo State Networks
Engineering systems are mostly designed with a global stability restriction. We have seen a
similar constraint, echo state condition, in the design of echo state networks. We have also
demonstrated with a function approximation experiment that ESNs with the echo state condition
can do useful computation. On the other hand, biological computation may not possess a similar
stability constraint (Freeman, 1975; Freeman, 1993; Yao and Freeman, 1990). Indeed, biological
systems do not have fixed point dynamics but rather wide variety of collective behavior in the
form of oscillations and even chaos (Freeman, 1975; Yao and Freeman, 1990). Inspiring from
principles of biological computation, we would like to explore the network' s response without
the restriction on the spectral radius of W. Hence, we will remove the echo state condition and
let the spectral radius to be slightly greater than 1. Obviously, this will introduce instability for
the autonomous system whereas the response of the system with an applied input is yet not
obvious.
50 100 150 200
time
The states of the 100-unit transiently stable ESN
250 300 350
A
The overlay plot of the ESN output and desired signal
1.2
Desired signal
-0.21 ESN\ output
-.0 50 100 150 200 250 300 350
time
Figure 5-2. Demonstration of a typical response of transiently stable ESN with a spectral radius
of 1.1 and application to function approximation. A) 100 echo states of the transiently
stable ESN. B) The overlay plot of the ESN output and the desired signal.
In order to observe the ESN response without the echo state condition, we will use a
similar experimental set up that we used in previous section. We use the same W matrix and
scaled it to obtain a spectral radius of 1.1. The same input signal (Figure 5-1 A) is fed to the ESN
. '
0I I
1
II ;
that is initialized to a random initial condition. We plot the resulting echo states in Figure 5-2 A.
There are a few observations to be made at this point. First of all, we see that the system does not
converge to zero during the first 100 steps although the input is zero. Instead, the echo states
exhibit a nonconvergent dynamical response that differs from the ESN with the echo state
condition. In fact, this was the expected response, since the system is designed to be unstable
(spectral radius is greater than one) around zero equilibrium. A similar response can also be
observed during the last 100 time steps when the input is again zero. The second and more
important observation is the response of the system between time steps 100 and 200 where the
ramp signal is applied as the input. The echo states during this interval become more regular
(after some initial transient between time steps 100 and 130) compared to the states when there is
no input. In fact, after the transient, the echo states look similar to the ones in Figure 5-1 B where
the ESN satisfies the echo state condition. Thirdly, there is a transition period between time steps
100 and 130 where the states switch from a disordered mode to a more regular mode. In
summary, the system responds according to its own dynamics when there is no input. As the
input amplitude gradually increases, the system response is determined by a competition between
the system dynamics and the input amplitude. When the input amplitude is sufficiently large, the
system dynamics becomes "transiently stable" and is determined by the input signal. We would
like to find out if we can utilize the transiently stable ESN for the same function approximation
problem we had in the previous section. The weights of the linear readout network are again
computed using Equation 2-4 and the corresponding output is generated in Figure 5-2 B. As seen
from the figure, we got a good match between the output and the desired signal. Similar results
can be obtained even if the system is initialized to a different initial condition or the time instant
at which the ramp signal is applied (this will change the state value when the ramp signal is
applied) changes. This experiment clearly demonstrates that "transiently stable" ESN can do
useful computation.
Understanding Transiently Stable Computation
We will utilize the linearization analysis proposed in chapter 2.2 in order to quantify the
local dynamics of the system. The pole movement of ESN with spectral radius of 0.9 is
illustrated in Figure 5-3. According to the figure, poles always stay inside the unit circle even
though the input signal changes. As the input signal strengthens, the poles move towards the
origin of the z-plane (decreases the spectral radius), which results in a more stable system. This
is due to the fact that the higher input signals saturate the nonlinear function reducing the slope at
which the system is operating. Similar observations have been already made in chapter 2.2.
Real Real
A B
C D
Figur 5-3 Moveent f thepole i*foESwihsetarais09reptednfgusAB
Figur C, D,3 whven o the iput is atrES poits labeedta by iu A, 9 BC nFgre 5-1te A, frespectvely
Now, let us examine the movement of poles of the transiently stable ESNof the previous
section. Figure 5-4 shows the movement of the poles for the same input signal. First of all, the
poles of the system move in and out of the unit circle during the first and last 100 time steps
(hence zero state is not asymptotically stable). This explains the complex echo state signal when
there is no input to the system (Figure 5-2 B). The poles of the ESN with spectral radius 0.9 stays
inside the unit circle during the same time interval. Secondly, when the ramp input is
Real Real
A B
Real Real
C D
Figure 5-4. Movement of the poles for ESN with spectral radius 1.1 are plotted in figures A, B,
C, D, when the input is at points labeled by A, B, C, D in Figure 5-1 A, respectively.
applied, the poles start to shrink towards the unit circle (after some transient) as is observed in
the stable ESN. This phenomenon is again due to the sigmoid nature of nonlinearity, which
saturates with large input signals reducing the slope of the operating point in Equation 2-5. With
the movement of the poles towards the origin, the system stabilized transiently allowing the
system state to be controlled by the input. This is critical for the use of the system for
computation (function approximation). Finally, we observe a transient time, when the ramp
signal is applied, before the poles start shrinking. In this period, there is a competition between
the system dynamics and the input signal. There are a few extra comments about the selection of
the spectral radius to be made at this point. First of all, the transient response time after the ramp
is applied is determined by two factors, the spectral radius and the slope of the ramp signal. The
transient time will increase if the spectral radius of W is increased or the slope of the ramp signal
is reduced. Because as the spectral radius is increased, the force required to bring the poles inside
the unit circle is increased, hence a larger signal is required. Therefore, there is a balance
between the system dynamics determined by the spectral radius and the input signal required to
stabilize the system. Secondly, it is important to quantify the unstable behavior of the signals
present in the transiently stable ESN when the input is absent. We have observed that when the
spectral radius is slightly greater than 1, the signals are periodic oscillations whereas for larger
values of the spectral radius, the signals are more irregular and even chaotic.
Freeman's K Sets for Dynamical Computation
Understanding information processing in brains remains a challenge. Many different
hypotheses on how the brain might process the massive bombardment of information brought by
sensory systems have been advanced, ranging from artificial intelligence, neural networks and
more recently machine learning and statistics (Freeman, 1975; Yao and Freeman, 1990; Chua
and Yang, 1988; Wang, 1999).Walter Freeman developed a biologically realistic mathematical
model for the olfactory cortex that captures at a cell assembly level (mesoscopic level), some of
the physiological properties of the olfactory system. Freeman model (FM) is based on nonlinear
recurrent dynamical interactions representing the response of thousands of cells, the neural
assemblies, instead of single neuron responses. In order to derive his model, Freeman used
neurophysiological recordings and neuroscience concepts and the K set hierarchy (Freeman,
1975) to build a model that mimics the behavior of olfaction (Freeman, 1975). This model is a
drastic departure of other neural network models such as Hopfield's because system response is
produced by interactions of second order nonlinear sets.
Freeman' s maj or drive is to understand how olfactory system works; while here we aim at
using FM as a computational framework for the interaction of animats with the real world. Real
worlds are complex dynamical environments that unfold unpredictably with lots of redundancy
but also with many unrelated cues. In this set up, the fundamental operations are the recognition
of prior situations that will induce prototypical responses dependent upon the animat' s goals. The
constraints of the experimental setting are also rather important. The operation is intrinsically
real-time in the sense that only the present value of the continuous stream of information from
the sensors is available and must be processed at each time step, while the full understanding of
the situation is most often related to the input history. So we favor signal processing
methodologies instead of statistics where the behavior in time is always rather constrained. The
animat may also need to learn how to recognize new situations when they become important for
its function, therefore it must have internal representational structures that expect outcomes and
change behaviors. Brains process information for the sake of the animal's survival, while
animats have external goals preprogrammed by humans, and as such they must possess an extra
level of functionality that translates internal representations back and forth to the outside, i.e. it is
necessary to define input and outputs (in the animal, brain's only output is the motor system).
Therefore we have to emphasize this translation between distributed dynamical representations
and outputs in the form of design principles for optimal readouts.
The framework we are pursuing is based on distributed dynamics instead of machine
learning. Since Hopfield, the role of dynamics for associative memories is well understood
dynamicall systems with fixed point attractors). In fact, FM has also been utilized as a dynamical
auto-associative memory (Tavares, 2001; Xu et al., Ozturk et al., 2004; Xu et al., 2004). FM
weights are trained with Hebbian learning while two binary patterns to be stored are presented to
the system successively. During testing, a noise corrupted version of one of the two stored
patterns is presented to the system. Then, the energy of the oscillatory system response is
computed and compared to a threshold to choose the correct stored pattern. We see two maj or
shortcomings with the energy based readout. First, utilizing just the energy is wasteful since it
ignores the information embedded in the dynamics of the signals. Second, it does not exploit the
global information in the states since it is a local method computing the energy of individual
processing elements.
In this chapter, we demonstrate that FM can be considered in the same framework of
ESN/LSM. In fact, the KI and KII networks of FM are conceptually similar to the
reservoir/liquid in an ESN/LSM with a proper selection of the parameters. The big difference is
that the KII layer is highly structured with a coupled set of nonlinear oscillators We will derive
conditions on the system parameters for the FM similar to the echo state property of ESNs. With
the proposed framework, it becomes evident that the OB layer alone lacks a readout in order to
be used as a universal computing machine for time series. Therefore, we propose to use an
adaptive linear network to implement the readout from the OB layer. We present experimental
results to show the power of this framework for the FM of the olfactory cortex. In particular, we
show that it is not necessary to drive Freeman' s model with 0 and Is, as is presently done.
An Overview of Freeman Model
Freeman Model is a biologically plausible mathematical model of the olfactory cortex. The
dynamical behavior of the FM mimics the physiological signals obtained from the cortex. It is a
hierarchical model of different levels (KO, KI, KHI and KiII), where simpler structures at the
bottom of the hierarchy combine to form the more complex structures at the top of the hierarchy.
The basic building block of the model is the KO set, which is depicted in Figure 5-5 A.
Every KO unit can accept several spatial inputs that are weighted and summed, and then
convolved with a linear time invariant system defined by the second order dynamics whose
transfer function, H(s), is given by
ab
H(s) = (5-1)
(s + a)(s + b)
where 1/a and 1/b are real time constants determined experimentally (Freeman, 1975). The
output of the linear dynamics is then shaped by the nonlinear sigmoid function, which is
experimentally determined to be
Q(x,Q,)=Q, (- e, >= ) fx>x (5-2)
-1 if x < xo
The sign of the connection strength from a KO set will define the type of the KO set
(excitatory for positive weights and inhibitory for negative weights). The second level in the
hierarchy is the KI network. KO sets with a common sign (either excitatory or inhibitory) are
connected through forward lateral feedback to construct a KI network (Figure 5-5 B). No auto-
feedback is allowed in the network. KI network is generally used as an input layer for the higher
levels of Freeman model. It preprocesses the input to transform it into a space which is
compatible with OB dynamics. In this study, we will use KI network as an echo state network
containing a representation of the input history.
+- H(s) q Q(.)
Asymmetric sigmoid
Inputs 2 dodrnonlinearity
dynamics
A
Excitatory KI set
Inhibitory KI set
C D
2 -- -
-5 -4 -3 -2 -1 0 1 2 3 4 5
Figure 5-5. The building blocks of Freeman Model. A) The KO set. B) KI network. C) KII
Set. D) Reduced KII set. E) The asymmetric nonlinear function.
The third level in the hierarchy, the KHI set (Figure 5-5 C), is the most interesting and
important building block of the olfactory system, since it is an oscillator controlled by the input.
The response of the KHI set to an impulse input is a damped oscillation whereas with a sustained
input, the output oscillation is maintained as long as the input remains. The architecture of KH
set is shown in Figure 5-5 C, where the circles denote KO sets and the sign indicates the type of
connection. Figure 5-5 D shows a reduced KHI set, where the two KO units are denoted by M (for
mitral cell) and G (for granular cell). In this model, the mitral cell takes the external input P(t)
and the coupling strengths between M and G are controlled by the two weights Kmg > 0
(excitatory) and Kgm < 0 (inhibitory). If the coupling weights are selected properly, the RKII set
is, similar to the KII set, an oscillator controlled by the input (Freeman, 1975).
A KHI network can be formed by interconnecting a number of KHI sets with excitatory
connections between the excitatory cells (top circles in Figure 5-6) and inhibitory connections
between inhibitory cells (bottom circles in Figure 5-6). An RKII network is formed similar to KII
network by connecting a number of RKII sets. We will use RKII network in our simulations
instead of KII networks since various analytic tools are available for RKII and it has been
demosntrated that RKII is functionally very similar to the KII (Freeman 1975, Xu et al., 2004).
This interconnected structure represents a key stage of learning and memory in the olfactory
system. Input patterns through M cells are mapped into spatially distributed outputs. Excitatory
and inhibitory interconnections enable cooperative and competitive behaviors, respectively, in
this network. The RKII network functions as an associative memory (Tavares, 2001; Xu et al.,
2004; Principe et al., 2001; Ozturk et al., 2004).
Tnlhu*.~Inr 1;II.
Figure 5-6. A full KII network.
The final Katchalsky level is the KHIInetwork and represents the olfactory cortex. In a KIII
network, two/three KII sets and a KII network are tightly coupled through dispersive
connections, mimicking the different lengths and thicknesses of axons. Since the intrinsic
oscillating frequencies of each one of the KII sets in different layers are incommensurate among
themselves, this network of coupled oscillators will present chaotic behavior. For a detailed
description of the KIII network, we refer the reader to (Freeman, 1975; Xu et al, 2004).
Now, we will formulate a standard form for the state (outputs of KO sets) equations of the
Freeman model valid for all hierarchical sets of FM). Consider a Freeman network with M input
units, N KO sets and L output units. State equations for this Freeman model can be expressed as
ii(t) + (a + b) (t) + ab x(t) = ab(W Q
Here, x(t) = (xl(t), x2(t),..., xyrtt)) is the state vector where each entry is the output of a KO
set at time t and u(t) = (ul(t), u2(t),..., uhu(t)) is the input vector. Wis an N*N matrix defining
the interconnection weight values between the KO sets and Wt" is an N*M matrix defining the
interconnection weights between the inputs and the KO sets.
For example, consider the RKII set depicted in Figure 5-5 D. The governing equations for
the RKII set are given by
m(t) + (a + b~m(t) + ab m(t) = ab(K, Q(g(t)) + p(t))
(5-4)
g(t) + (a + b) g(t) + ab g(t) = abK emgt)
In a RKII set, the number of KO sets is two and the number of inputs is one. Hence, in the
standard form, M is one and N is two. With this information, Equation 5-4 can be restated in the
standard form of Equation 5-4 with the state vector x(t) = [m(t), g(t)]T, input vector u(t) = [p(t)],
the input weight vector Wm = [1, 0]T and the weight matrix
W= rP~O gm
Dynamical Computation with Freeman Model with a Readout
In order to be able to compute with a dynamical system, one has to be aware of how the
information is represented in the system states. Once the form of representation is known,
desired information can be extracted from the system states with an appropriate readout
mechanism. For instance, in Deliang Wang' s network (Wang, 1999), the phase difference
between the coupled oscillators is the source of information. For FM, it has been argued the
information is encoded as amplitude modulated signals and hence energy of the KO output was
used as the relevant information from the FM in practical application. For example, the RKII
network has been proposed as an auto-associative memory that can store static binary patterns
(Tavares, 2001; Xu et al., 2004; Ozturk et al., 2004). In (Xu et al., 2004) an RKII network of 64
reduced KII sets is constructed with fixed Kmg, Km and Kii values (determined experimentally
from biological data). Excitatory weights of RKII network are trained with Oj a' s unsupervised
learning rule (a stable version of the Hebbian learning rule) while two binary patterns to be
stored are presented to the system successively. During testing, a noise corrupted version of one
of the two stored patterns is presented to the system. Each RKII set creates limit cycle
oscillations of low or high energy depending on the stored and the input pattern applied. Then,
the energy (averaged over a time interval) of the mitral KO set of each RKII set is computed and
compared to a threshold to decide the binary system output. We see two maj or shortcomings
with the energy based readout for FM. First, utilizing just the energy is wasteful since it ignores
the information embedded in the dynamics of the system state. Second, it does not exploit the
global interdependencies among the RKII sets since it is a local method computing the energy of
individual processing elements. Hence, the energy based readout can not be optimal and can not
be used for continuous dynamic patterns.
It is clear that a more principled approach is necessary in order to use FM to process
signals that are not necessarily binary or static. With the analogy to the LSM/ESN framework,
we propose to use a readout which is a linear or nonlinear proj section of all the RKII states. The
role of the readout is to extract specified desired information from the FM states.
1a~ 0jIJ
~m~i~~ ~n~rrAs
~ar~~( EarIB
Figure 5-7. Freeman Model as an associative memory. A) Stored patterns. B) Recall of the
corrupted patterns: The first, second and their columns are the input to the KII
network, the output before the threshold detector, and the output after the threshold
detector, respectively.
We will also demonstrate that Freeman's K sets provides a complete system, that is to say,
the readout can also be chosen from the Freeman hierarchy, namely KO set or the KI network
depending on the output layer dimension. When a linear readout is used, the adaptation of the
overall system reduces to a linear regression problem, which we can be solved by either Wiener
filter equation offline or gradient-based learning rule online. We will use the KI network and KII
network architectures analogous to liquid in LSM or the reservoir in ESN. The use of KIII
network as the dynamical reservoir is outside the scope of this study. In summary, with the
adoption of LSM/ESN framework, the states of the FM are dynamical basis functionals
(representation space) created from nonlinear combinations of the input and the readout simply
finds the best proj section of the desired response in this representation space.
We will provide two theorems for the echo state property of the Freeman RKII network.
The first theorem states the sufficient conditions on the parameter values of a RKII set resulting
in echo states. In the second theorem, we will give a sufficient condition for the nonexistence of
echo states for all Freeman networks. Unfortunately, we do not have a general theoretical result
for the sufficient conditions resulting in echo states for all Freeman networks in standard form
given by Equation 5-3.
Theorem 4.1: Consider a reduced KII set with the governing equations given by equation
5-4. Let the paramneter values Kmg and Kgm satisfy KmgK, < ,Whr
(a + b)2 max, >
Qmax denotes the maximum value of the derivative of the nonlinear function Q and a and b are
the filter coefficients. Then the network has echo states for all inputs u and for all states x. The
proof follows from the linear stability analysis of the RKII set, which has been extensively
studied in (Xu and Principe., 2005).
Proof: Consider the governing equations for the RKII set given in Equation 5-4. An
equilibrium point of the nonlinear differential Equation 5-4 satisfies
mo = K,L!(Ko)+ P
(5-5)
Ko = K,!mg 0
Using the nullclines defined by Equation 5-5, we can see that for a given set of parameters
and the input, the system has a unique equilibrium point (Figure 5-8).
Now consider the state space representation given by Equation 5-4.
nt, = nt,
m, = -(a +b~m, abna, + ab(K,,e,(g, ) + p(t))
(5-6)
g, = -(a + b) g, abg, + abK,e~, Qn )
m (t)=Kgm~,();Pt a g0(t)= K oQ(mo(t))
2-
~~ Equilibrium
-4 t -
-6 -4 -2 U 2 4 6
g (t)
Figure 5-8. Nullcline graph of the reduced KII set.
Then, the linearization of the system around the unique equilibrium point leads to the
Jacobian matrix of the form
0 1 0 0
-a"b (a +b) abK,,,,(g, ) 0
(5-7)
0 0 0 1
abKmgQ(m ,) 0 ab (a + b)
Checking the eigenvalues of Equation 5-9, we see that the equilibrium is stable if
IKKab
K,,,,,, .and the equilibrium is unstable if
(a + b)2 Q~m,,)Q~g, )
ab
K,,K,,> .. The system does have a bifurcation point if
(a + b)2 Q~m )~g, )
IK,,,,Knn, = .b (Xu and Principe., 2005).
(a + b)2 Q~m )~g, )
Now, assume that the paramneters K,,, and K;,, satisfy Kl,sK,,,, < Te
(a + b)2 (Omax)
the unique equilibrium point of the system will be stable. This means that independent of the
initial conditions, the system will approach to the same traj ectory for a given set of parameter
values and the input sequence. Moreover, the states will converge to zero when there is no input.
Theorem 4.2: Consider a Freeman network containing N KO sets with standard form
given in Equation 5-3. Define the effective weight matrix W
-;ab(I W)O (a + b)I:,, (5-8)
where I denotes the N*N identity matrix and 0 denotes N*N zero matrix. Let the effective weight
matrix W satisfy real(hax)>0, where 3Lax is an eigenvalue of W with largest real part. Then the
network has an asymptotically unstable null state. This implies that it has no echo states for any
input set containing 0.
Proof: Consider a Freeman network with governing equations in standard form given by
Equation 5-4.
In state-space form Equation 5-3 can be expressed as
X1 = X2
(5-9)
x, = -(a +b~x, abx, ab(WQ(x, )+ Wina))
where xl and x2 are both N dimensional vectors. For zero input, Equation 5-9 has the trivial null
solution. Now, consider the Jacobian matrix of Equation 5-9 from linearization around origin,