<%BANNER%>

Dynamical Computation with Echo State Networks


PAGE 1

1 DYNAMICAL COMPUTATION WITH ECHO STATE NETWORKS By MUSTAFA CAN OZTURK A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2007

PAGE 2

2 2007 Mustafa Can Ozturk

PAGE 3

3 To my parents, Necmi Ozturk and Saliha Oztur k, who made everything I have achieved possible, for their enduring love, support and generosity.

PAGE 4

4 ACKNOWLEDGEMENTS First and foremost, I would like to thank my supervisory committee chair, Dr. Jose Principe, for his inspiring ideas, constant suppo rt, and guidance. His enthusiasm for life and imaginative way of understanding co ncepts inspired me to learn di fferent aspects of science and engineering. Without him, this disser tation would not have been possible. Secondly, I would like to thank Dr. John Harri s, for serving on my supervisory committee and offering wise insights into academia and other aspects of life. I also wish to thank the other members of my supervisory committee, Dr. Arunava Benerjee and Dr. Thomas Demarse for their insightful comments and suggestions. During my studies, I gained much from the in teractive environment at the Computational NeuroEngineering Laboratory. All of my collea gues helped me by sharing their knowledge and inspirational ideas. I es pecially thank Dongming Xu for his help and long hours of productive discussions. Most of this di ssertation was a result of co llaboration with Aysegul Gunduz, Nicholas Dedual, Rodrigo Sachhi, Sohan Set h, Johan Nyqvist, Kyu-Hwa Jeong, Ismail Uysal, Harsha Sathyendra, and Mark Skowronski. I like to thank each one of them for their expertise and insightful comments. I would like to specif ically thank my wonderful friends, Jianwu Xu, Rui Yan, and Anant Hegde. Deep philosophical convers ations, sleepless nights for proj ects are unforgettable. They were certainly more than friends to me. I wi ll always remember them with great respect. I would like to thank my big family for their constant support and tr ust in me. Everything I have achieved is possible only with thei r love, encouragement, and guidance.

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS...............................................................................................................4 LIST OF TABLES................................................................................................................. ..........8 LIST OF FIGURES................................................................................................................ .........9 ABSTRACT....................................................................................................................... ............12 CHAPTER 1 INTRODUCTION..................................................................................................................15 Literature Review of Dynamic Computation.........................................................................15 Short-term Memory Structures........................................................................................15 Network Architectures for Dynamical Computation......................................................18 Literature Review of Echo State Networks............................................................................22 Thesis Goals................................................................................................................... .........24 2 ANALYSIS OF ECHO STATE NETWORKS......................................................................30 Echo States as Bases and Projections.....................................................................................30 ESN Dynamics as a Combination of Linear Systems............................................................35 Average State Entropy as a Measure of the Richness of ESN Reservoir...............................38 3 DESIGN OF THE DYNAMICAL RESERVOIR..................................................................43 Design of the Recurrent Connections.....................................................................................43 Design of the Adaptive Bias...................................................................................................47 Experiments.................................................................................................................... ........48 Short-term Memory Structures........................................................................................48 Binary Parity Check........................................................................................................51 System Identification.......................................................................................................53 4 DESIGN OF THE READOUT ..............................................................................................54 Design of the Readout for Function Approximation..............................................................54 Design of the Readout for Dynamical Pattern Recognition...................................................54 Linear Associative Memories..........................................................................................56 The MACE Filter.............................................................................................................59 The ESN/LSM MACE for Dynamical Pattern Recognition.........................................60 The Design of ESNs for Dynamical Pattern Recognition...............................................63 Experiments.................................................................................................................... .65 Classification of temporal signals with unknown peak values................................65

PAGE 6

6 Robust signal detection in a digital communication channel...................................69 Classification of odors w ith an electronic nose........................................................69 Spike train classification with LSM.........................................................................76 5 IMPLICATIONS OF ESN ON THE DESIGN OF OTHER NETWORKS...........................80 Transiently Stable Computation.............................................................................................80 Conventional Echo State Networks.................................................................................81 Transiently Stable Computati on with Echo State Networks...........................................83 Understanding Transiently Stable Computation.............................................................86 Freemans K Sets for Dynamical Computation......................................................................88 An Overview of Freeman Model.....................................................................................90 Dynamical Computation with Fr eeman Model with a Readout......................................95 Experiments...................................................................................................................101 Function approximation.........................................................................................101 K0 set as the readout for function approximation..................................................103 Autonomous squence gnerat ion with output feedback..........................................105 Binary parity check................................................................................................107 Multi-attractor learning..........................................................................................108 6 APPLICATIONS..................................................................................................................112 Complex Echo State Networks.............................................................................................112 Channel Equalization.....................................................................................................113 Brain Machine Interfaces......................................................................................................119 Experimental Setup.......................................................................................................122 Methods........................................................................................................................ .123 Forecasting Water Inflow Usi ng the Echo State Network....................................................127 Water Inflow Problem...................................................................................................127 Results........................................................................................................................ ...128 Spike Detection with Echo State Networks..........................................................................131 Problems in Spike Detection.........................................................................................133 Threshold detection................................................................................................133 Matched filter.........................................................................................................134 Spike Detection with ESN-MACE................................................................................135 Robust signal detection in a digital communication channel...............................................140 7 CONCLUSIONS..................................................................................................................147 Understanding Echo State Networks....................................................................................148 Designing Echo State Networks...........................................................................................149 Computation at the Edge of Chaos.......................................................................................150 Transiently Stable Computation...........................................................................................150 Echo State Networks for Temporal Pattern Recognition.....................................................151 Channel Equalization with Complex Echo State Networks.................................................152 Freeman Model.................................................................................................................. ...153 Applications................................................................................................................... .......153

PAGE 7

7 Brain Machine Interfaces..............................................................................................153 Prediction of Water Inflow............................................................................................154 Spike Detection.............................................................................................................154 Matched Filtering..........................................................................................................155 Possible Future Directions....................................................................................................155 LIST OF REFERENCES.............................................................................................................157 BIOGRAPHICAL SKETCH.......................................................................................................166

PAGE 8

8 LIST OF TABLES Table page 6-1 Averaged correlation coefficients betw een the actual hand trajectory and the model outputs computed over non-overlapping windows of 5 seconds.....................................124 6-2 Performance comparison of the m odels for water inflow prediction....................................129

PAGE 9

9 LIST OF FIGURES Figure page 1-1 Delay line memory of order N............................................................................................... ..16 1-2 Dynamic network with a short-term memo ry structure followed by a static mapper.............18 1-3 Generic structure for unive rsal myopic mapping theorem, bank of filters followed by a static network................................................................................................................. ....21 1-4 Echo state network (ESN). ESN is com posed of two parts: a fixed-weight ( W ) recurrent network and a linear readout..............................................................................................23 2-1 Performance of ESNs for different realizations of W with the same weight distribution. Results show that spectral radius of W is not the unique parameter that determines the performance of an ESN................................................................................................34 2-2 Pole tracks of the linearized ESN when the input goes through a cycle.................................37 2-3 Echo states and state en tropy for different ESNs....................................................................41 2-4 The ASE values obtained from 50 realizatio ns of ESNs with the same spectral radius.........42 3-1 Comparison of ASE values obtained for ASE-ESN with U-ESN and randomly generated ESNs with different sparseness..........................................................................................46 3-2 The k -delay STM capacity of each ESN for dela ys 1, . ,40 computed using the test signal......................................................................................................................... .........50 3-3 Number of wrong decisions made by each ESN for m = 3, . ,8 in the binary parity check problem.................................................................................................................. ..52 4-1 Interpretation of echo states as a 2-d image............................................................................62 4-2 Performance comparisons of ESNs for the classification of signals with unknown peak values......................................................................................................................... ........66 4-3 Time-series representing th e response pattern of the 32 el ectronic nose sensors exposed to rosemary.................................................................................................................... .....70 4-4 Comparison of the ESN-MACE filter outpu t and MACE filter out put trained on the input space for the rosemary class.....................................................................................74 4-5 Comparison of the correct classification ra tes of LSM-MACE and LSM with linear readout trained in the spike train cl assification problem as the parameter varies...........79 5-1. Demonstration of a typi cal response of ESN with a spectral radius of 0.9 and application to function approximation...............................................................................82

PAGE 10

10 5-2 Demonstration of a typical response of tran siently stable ESN with a spectral radius of 1.1 and application to function approximation..................................................................84 5-3 Movement of the poles for ESN with spectral radius 0.9........................................................86 5-4 Movement of the poles for ESN with spectral radius 1.1........................................................87 5-6 A full KII network......................................................................................................... ..........94 5-7 Freeman Model as an associative memory..............................................................................97 5-8 Nullcline graph of the reduced KII set....................................................................................99 5-9. Function approximation with Freeman Model....................................................................102 5-10 Overlay plot of the desi red signal and output signal of the Freeman Model with K0 readout........................................................................................................................ ......105 5-12 Binary parity check with Freeman Model...........................................................................108 5-13 Sample input-output signals for multi-attractor learning...................................................109 5-14 Desired and output signals during training for multi-attractor learning with Freeman Model.......................................................................................................................... .....110 5-15 Desired and output signals during testing for multi-attractor learning with Freeman Model.......................................................................................................................... .....111 6-1 Block diagram of a communication syst em with nonlinear dispersive channel....................114 6-2 Constellation diagram of 16 QAM. A) Input symbol s. B) Output symbols..........................115 6-3 SER vs SNR plot for four networks......................................................................................117 6-4 Constellation diagram rectangular QPSK. A) Input symbols. B) Output symbols...............118 6-5 SER vs SNR plot for four networks......................................................................................119 6-6 Comparison of three methods in br ain machine interface modeling. Windowed correlation coefficients in all spectral bands....................................................................126 6-7 Training performance of ESN for the wate r inflow prediction at Fumas Hydroelectric power plant.................................................................................................................... ...130 6-8 Testing performance of the ESN for the wa ter inflow prediction at Fumas Hydroelectric power plant.................................................................................................................... ...131 6-9 Spike shapes recorded with high SNR..................................................................................137

PAGE 11

11 6-10 Spike shapes in rat dataset. Numerous di stinct spike waveforms are present in this dataset........................................................................................................................ ......137 6-11 Segment of neural r ecording from rats brain.....................................................................139 6-12 Comparison of ROC curves for the rat dataset....................................................................139 6 13 Comparison of ROC curves for ESN-MACE and the linear matched filter under Gaussian and impulsive noise conditions........................................................................144

PAGE 12

12 Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy DYNAMICAL COMPUTATION WITH ECHO STATE NETWORKS By Mustafa Can Ozturk May 2007 Chair: Jose C. Principe Major: Electrical and Computer Engineering Echo state networks (ESN) were recently pr oposed as a new recurrent neural network (RNN) paradigm. ESN couples dynamics with computation in novel ways by conceptually separating RNN into two parts: a recurrent topol ogy of nonlinear PEs that constitutes a reservoir of rich dynamics and an instantaneous linear re adout. The interesting property of ESN is that only the memoryless readout is trained, whereas the recurrent topology has fixed connection weights. This reduces the complexity of RN N training to simple linear regression while preserving a recurrent topology, but obviously pl aces important constraints in the overall architecture that have no t yet been fully studied. Present design of fixed parameters of ES N relies on the selection of the maximum eigenvalue of the linearized system around zero (sp ectral radius). However, this procedure does not quantify in a systematic manner the performan ce of the ESN in terms of approximation error. In this study, we proposed a functional space approximation framework to better understand the operation of ESNs, and proposed an informationtheoretic metric (the average entropy of echo states) to assess the richness of the ESN dynami cs. We also provided an interpretation of the ESN dynamics rooted in system theory as families of coupled linearized systems whose poles move according to the input signal dynamics. With this interpretation, we put forward a design

PAGE 13

13 methodology for functional approximation wher e ESNs are designed with uniform pole distributions covering the frequenc y spectrum to abide to the richness metric, irrespective of the spectral radius. A single bias parameter at the ESN input, adapted with the modeling error, configures the ESN spectral radius to the input-output joint space. Function approximation examples compare the proposed design methodology versus the conventional design. On further investigating the use of ESNs fo r dynamical pattern recognition, we postulated that ESNs are particularly well suited for dyna mical pattern recognition and we proposed a linear associative memory (LAM) as a novel readout for ESNs. From th e class of LAMs, we adopted the minimum average correlation energy (MACE) filter because of its high rejection characteristics that allow its use as a detector in the automatic pattern recognition literature. In the ESN application, the MACE interprets the stat es of the ESN as a two-dimensional image: one dimension is time and the other dimension is th e processing element index (space). An optimal template image for each class, which associates ESN states with the class label, can be analytically computed using trai ning data. During testing, ESN stat es were correlated with each template image, and the class label of the template with the highest correlation is assigned to the input pattern. The ESNMACE co mbination leads to a nonlinear template matcher with robust noise performance, as needed in non-Gaussia n, nonlinear digital communication channels. We used a real-world data experiment for chemical sensing with an electron ic nose to demonstrate the power of this approach. The proposed readout can also be used with liquid state machines eliminating the need to convert spike trains into continuous signals by binning or low-pass filtering. We applied ESN on interesting real-world problems such as brain machine interface design, water inflow prediction, detection of ac tion potentials in neural recordings, matched

PAGE 14

14 filtering in digital communications, channel equali zation of a nonlinear channel and compared its performance to other standard techniques. We proposed ESNs for signal processing in the complex domain. The use of ESNs for complex do main is very convenient since system training is equivalent to simple linear regression, which is trivial in the comple x domain. The derivatives of the nonlinear activation functions are never necessa ry since the recurrent part is fixed apriori. We showed that Freeman model of the olfactory cortex can be considered in the same framework of ESN. This work provided two import ant contributions to Fr eeman networks. First, it emphasized the need for optimal readout, and s howed how to adaptively derive them. Second, it showed that the Freeman model is able to proc ess continuous signals with temporal structure. We also investigated the dynamics of ES Ns without the echo state condition. This investigation led to a novel computational mode for nonlinear systems with sigmoidal nonlinearity, which does not require global stabil ity. In this mode, although the autonomous system is unstable, the input si gnal forces the system dynamics to become transiently stable. Function approximation experiments showed that the transiently stable ESN without the echo state condition is still capabl e of useful computation.

PAGE 15

15 CHAPTER 1 INTRODUCTION Literature Review of Dynamic Computation Static networks are those whose respons e depends only on the present stimulus. Multiplayer perceptrons (MLPs), radial basis fu nction (RBFs), and self-organizing map (SOM) networks are examples of static neural network models that deal with static input patterns. These models have been well studied in function spaces, for regression and classification, using statistical principles (Haykin, 1998). On the other hand, time is an essential ingredient in many real-world problems, from cognitive tasks (such as vision and speech processing) to engineering problems (such as system identification, noise cancellation, and channel equalization). These real-world problems require the extraction of info rmation embedded in the temporal structure of signals. To incorporate time into the operation of a network, memory must be built into it. To be precise, memory here stands for short-term memo ry, the ability to remember the recent past which provides the contextual information from the history of time series to unambiguously identify the current stimulus. Short-term Memory Structures General linear discrete-time memory structure of order N (Figure 1-1 A) is also known as the generalized tapped delay line memory : each block has the transfer function G(z) or impulse response g(n) (Haykin, 2001). This is a single-input multiple-output system and taps of the system are filtered versions of the input signal. Basically, the memory reconstructs the state of the system that created the time series in a sufficiently large state space where time is imp licit; creating a trajectory that has a one-to-one correspondence (with a unique inve rse) to the time series.

PAGE 16

16 u(n) . G(z) G(z) G(z) x1(n) x2(n) xN(n) u(n-1) u(n) . z-1 z-1 z-1 u(nN +1) u(n) A B Figure 1-1. Delay line memory of order N. A) Generalized tapped delay line. B) Tapped delay line. Quality of a memory structure is m easured in two quantities: depth ( D ) and resolution ( R ) (Principe et al., 1993). Memory depth measures how long the memory can hold information about the past of a time series. Memory resolution is the amount of detail in the information memory holds. Formally, memory depth is defined as 0) (n Nn ng D where gN(n) is the overall impulse respons e of the memory obtained from N successive convolutions of g(n) Resolution is defined as the number of taps in memory structure per unit time. For a linear memory structure of fixed taps the product of memory depth and resolution is

PAGE 17

17 a constant equal to the number of taps N This creates a tradeoff between the memory depth and resolution and the choice of G(z) determines the values R and D in this tradeoff (Principe et al., 1993). The time delay embedding (Takens, 1981), which can be implemented by a simple delay line, is the most commonly used form of shor t-term memory (Figure 1-1 B). The use of time delays for memory is also biologically motivated since signal delays are ub iquitous in the brain and essential in biological information proces sing (Braitenberg, 1990). The memory depth and resolution for the tapped delay line is N and 1 independent of the number of taps, respectively. This restriction is relaxed in gamma filter which makes use of local feedback loops to create short-term memory (Pri ncipe et al., 1993; de Vries, 1991). Each section of gamma memory has the transfer function 1 1) 1 ( 1 ) ( z z z G where is the adjustable feedback parameter. Note that gamma memory reduces to tapped delay line for =1. With the introduction of the feedback l oop, stability becomes an issue. However, the overall system stability is easily guaranteed when each local feedback is stable (i.e. 0< <2). Memory depth for gamma memory is N / and the resolution is This means that the memory depth can be improved by choosing the feedback parameter less than unity. However, this also reduces the resolution. Hence, in gamma memo ry, the memory depth can be improved without increasing the number of taps unlike the tapped de lay line by sacrificing the resolution (Principe et al., 1993; de Vries, 1991). Although tapped delay line and the gamma filter are the most common forms of short-term memory structures, other structur es such as leaky integrator memory (context nodes or memory neurons) and Laguerre filter are also available in the literature. Time alignment filter, which is

PAGE 18

18 obtained from gamma memo ry by allowing nonhomogenous values in different taps, extends the structure in Figure 1-1 A by allowing tap dependent modulations of memory depth and resolution (de Vries, 1991). Network Architectures for Dynamical Computation There are basically two ways of creating a dynamic network with short-term-memory. The first one is to stimulate a static network (e.g., li near mapper, MLP) via a memory structure at its input such as a tapped delay li ne or gamma memory (Haykin, 1998). This type of dynamical networks is commonly used in th e literature. However, there are several practical drawbacks to this approach, which uses a spatial metaphor for time, especially when the dynamics of the time series are rich and time varying. The alternative approach for creating memory in an implicit manner is through the use of feedback. Feedback can be local at the singl e neuronal level or can be global covering the whole netw ork. In neural networks lite rature, networks with global feedback loops are referred to as recurrent neur al networks (RNN). The us e of feedback provides with very powerful systems with rich dynamical behavior whereas it also brings in practical problems such as stability and trai ning complexity (Haykin, 1998). x2(n) Static Mapper y(n) u(n) x1(n) Short-term Memory Structure xN(n) Figure 1-2. Dynamic network with a short-term memory (tapped delay line, gamma memory, etc.) structure followed by a static ma pper (linear mapper, MLP, RBF etc.)

PAGE 19

19 First we will consider the explicit representation of time where a static network (linear or nonlinear) is transformed into a dynamic network with a preprocessor which is a short-term memory structure. In these focuse d structures, the memory is restri cted in the input layer of the network (Sandberg and Xu, 1997; Haykin, 1998). Ther e is a clear separation of responsibilities in this approach of building a dynamical syst em, where memory represents time and static network accounts for mapping (Figure 1-2). Basica lly, memory reconstructs the state of the system that created the time series in a suffici ently large state space and static network maps the reconstructed space to the desired signal space. Moreover, the well-develop techniques of static network training such as least m ean squares (LMS) or static b ack-propagation algorithm can be applied to the dynamic network. The simplest structure for temporal processi ng in the form of Figure 1-2 is the finite impulse response filter (FIR) where the memory is a tapped delay line and the static network is a linear mapper. It has been widely used in adap tive signal processing applications due to its simplicity and the existence of effective learning algorithms (H aykin, 2001). Gamma filter is obtained when the memory structure in FIR is replaced by gamma memory (Principe et al., 1993; de Vries, 1991). Chains of first order integrator s are interesting because they effectively decrease the number of delays necessary to create em beddings. The extra degree of freedom on the memory depth, gained from local feedback, has b een fruitfully utilized in applications like system identification, echo cancellation, a nd speech classification (de Vries, 1991). The first successful demonstration of a ne ural network for temporal processing was NETtalk devised by Sejnowski and Rosenberg in 1987 (Sejnowski and Rosenberg, 1997). They used a system which is based on MLP to conve rt English speech to phonemes. A more popular network is the time delay neural network (TDNN) that replaces the linear mapper in the FIR with

PAGE 20

20 a more powerful nonlinear networ k, MLP (Haykin, 1998). Lang and Hinton have successfully applied TDNN for the recognition of 4 isolated wo rds: bee, dee, ee, and vee (Lang and Hinton, 1998). TDNN has become one of the most popular neural network architectures for temporal processing and has been successfully applied to time series prediction, system identification, control, and speech processing. The structure of TDNN can be generalized as shown in Figure 1-3. The first block is a bank of linear filters operating in parallel on the input stimulus whereas second block implements a st atic nonlinear network. This structure is a universal dynamic mapper (Sandberg and Xu, 1997, page 477) according to universal myopic mapping theorem: Any shift-invariant my opic dynamic map can be uniformly approximated arbitrarily well by a structure consisting of two functional blocks: a bank of linear filters feeding a static network. TDNN is a special case of the structure described in this theorem and is also a universal mapper. However, there are severa l drawbacks of focused structures which convert time series to spatial signals (Elman, 1990). First of all, it re quires an interface to th e external world which buffers the input for further processing. Secondly the selection of the length of buffer is not trivial and once selected it imposes a firm li mit on the memory depth. Moreover, buffering requires all the patterns to have the same length which is especially problematic in areas like speech processing where different length speech signals have to be processed (Elman, 1990). Finally, two instances of same time-series that are very similar can be very dissimilar in the reconstructed space (spatially distant) (Elman, 1990).

PAGE 21

21 xN(n) y(n) x1(n) x2(n) Static Mapper h1(n) hN(n) u(n) h2(n) Figure 1-3. Generic structure for universal m yopic mapping theorem, bank of filters followed by a static network An alternative approach for creating memory is by means of recurrent connections within the network. In this approach, th e system itself is dynamic and time, integrated into the system, is represented by the effect it has on processing. They are particular ly powerful to deal with the time varying patterns and are practical for the problems where the dynamics of the considered process is complex. Various RNN architectures ar e available in the literat ure. One of the earlier architectures proposed by Jordan uses memory units that are fed from the system output to provide the contextual informati on (Jordan, 1986). Elman network is a modification of Jordans, where the context units keep a copy of the hidden layer activati ons for the next update (Elman, 1990). Fully connected recurrent ne ural network contains severa l feedback loops between the processing elements (PE) in the hidden layer (Figure 1-4). RNNs can exhibit rich dynamical responses such as oscillations a nd even chaos. They have been wi dely used in many applications such as system identification and control of dyn amical systems (Delgado et al., 1995; Feldkamp et al., 1998; Kechriotis et al., 1994; Principe, 2001). The computat ional power of fully connected RNNs (Siegelman and Sontag, 1991) has been ma thematically proven by Siegelman and Sontag: All Turing machines may be simulated by fully c onnected recurrent networ ks built on neurons with sigmoid activation functions. However, the main problem with the RNNs is the difficulty to adapt the system weights. Various algorithms, such as back propagation through time (Werbos,

PAGE 22

22 1990) and real time recurrent learning (Williams and Zipser, 1989) have been proposed to train recurrent systems. However, these algorithms su ffer from a variety of problems: computational complexity resulting in slow training, comp lex performances surfaces the possibility of instability, and the decay of gradients th rough the topology and time (Haykin, 1998). The problem of decaying gradients has been a ddressed with special PEs (Hochreiter and Schmidhuber, 1997). Alternative second order tr aining methods based on extended Kalman filtering (Singhal and Wu, 1989; Puskorius a nd Feldkamp, 1996) and the multi-streaming training approach (Feldkamp et al., 1998) provide more reliable performance and have enabled practical applications. Literature Review of Echo State Networks Recently, a new recurrent network paradigm has been proposed by Jaeger under the name of echo state networks (ESN) (Jaeger, 2001; Jaeg er, 2002, Jaeger and Hass, 2004). ESNs aim at addressing the problems with RNN training by sepa rating the RNN architecture into two parts: a fixed recurrent topology of dynamical reservoir and an adaptive memoryless readout network. The input signal is fed to the dynamical reservoir which contains information about the history of input or/and output patterns when properly dimensioned (Figure 1-4) The outputs of the internal PEs (echo state) are fed to a memoryless but ad aptive readout network (g enerally linear) that reads the reservoir and produces th e network output. The interesting property of ESN is that only the memoryless readout is trained, whereas the recurrent topology (W) has fixed connection weights. This reduces the complexity of RNN training to simple linear regression while preserving a recurrent topology, but obviously pl aces important constraints in the overall architecture that have not yet been fully studied.

PAGE 23

23 Figure 1-4. Diagram of an echo state network (ESN). ESN is co mposed of two parts: a fixedweight (W) recurrent network and a linear r eadout. The recurrent network is a reservoir of highly interconn ected dynamical components, states of which are called echo states. The memoryless linear readout is trained to produce the output. Figure 1-4 depicts an ESN with M input units, N internal PEs and L output units. The value of the input unit at time n is u(n)=[ u1(n), u2( n ),, uM( n )]T, of internal units are x(n)=[ x1(n), x2( n ),, xN( n )]T, and of output units y(n)=[ y1(n), y2( n ),, yL( n )]T. The connection weights are given in a N x M weight matrix ) (in ij inw W for connections between th e input and the internal states, in an N x N matrix ) (ijw W for connections between the internal PEs, in an L x N matrix ) (out ij outw W for connections from internal un its to the output units, and in an N x L matrix ) (back ij backw W for the connections that project back from the output to the internal units (Jaeger, 2001). The activation of the internal PEs (echo stat es) is updated according to )) ( ) ( ) 1 ( ( ) 1 ( n n n nback iny W Wx u W f x (1-1) where ) ,..., (2 1 Nf f f f are the internal units activa tion functions. Generally, allifs are hyperbolic tangent functions ( x x x xe e e e ). Input Layer Dynamical Reservoir Read-out . + W Win Wout Wback u ( n ) x ( n ) y ( n )

PAGE 24

24 The readout is mostly a simple linear regresso r network whose output is computed according to ) 1 ( ) 1 ( n noutx W y (1-2) When direct connections from the input units to the output units are used the state vector is concatenated with the input vector to calculate the output. A basic necessary property for ESN reservoir is the echo state property which states that for the ESN learning principle to work, the reser voir must asymptotically forget input history (input forgetting property). It has been shown in (Jaeger, 2001) that input forgetting property is equivalent to state forgetting, th at is reservoir must forget its initial state after sufficiently long time. The echo state condition can be linked to the spectral radius (the la rgest among the absolute values of the eigenvalues of a matrix, denoted by || ||) of the reservoirs weight matrix, (|| W ||<1). In fact, this condition states th at the dynamics of the ESN is uni quely controlled by the input and the effect of initial states vanishes. The design of fixed connections of ESN is ba sed on the selection of the spectral radius. The reservoir is randomly and sparsely connected with a spectral radius suitable for the given problem. Sparse connectivity allows less couplin g between the PEs encouraging the development of individual dynamics and increasing dive rsity in overall reservoir dynamics. Thesis Goals ESNs, although recently proposed, have proven very successful in applications such as system identification, chaotic time series pred iction, and channel equalization. However, the state of the art is still imma ture. We see a few major shortc omings with the current ESN approach. First, the characteriza tion of the reservoir properties is poorly u nderstood. It is a mystery how a network with mostly random connections can be successful in solving difficult problems. A better understanding on the operation of dynamical reservoir is vital for improving

PAGE 25

25 ESN theory. Second, imposing a constraint only on the spectral radius for ESN design is a weak condition to properly set the parameters of the reservoir, as experiments show (different randomizations with the same spectral radius perfo rm differently for the same problem). Third, there is a fundamental problem with the way ESNs operate. The impact of fixed reservoir parameters for function approximation means that the information about th e desired response is only conveyed to the output projection. This is not optimal, and strategies to select different reservoirs for different applica tions have not been devised. We aim to address these problems by propos ing a framework, a metric and a design principle for ESNs. We first deal with the analysis of ESNs and explains the framework and the metric proposed. The framework is a signal proce ssing interpretation of ba sis and projections in functional spaces to describe and understand the ESN architecture. According to this interpretation, the reservoir states implement a set of bases functionals (representation space) constructed dynamically by the input, while the readout simply projects the desired response onto this representation space. The metric to de scribe the richness of the ESN dynamics is an information theoretic quantity, the average state entropy (ASE). Entropy measures the amount of information contained in a given random variab le (Shannon, 1948). Here, the random variable is the instantaneous echo state from which the entro py for the overall state (vector) is estimated. Due to the time dependency of the states, the state entropy averaged over time (ASE) will be used as the quantifier to measur e the richness of reservoir dyna mics. We also interpret the ESN dynamics as a combination of time varying linear systems obtained from the linearization of the ESN nonlinear PE in a small local neighborhood of the current state. We then propose a design methodology for the ESN reservoir in the li ght of the analysis tools developed for ESNs. According to this desi gn principle, one should consider independently

PAGE 26

26 the correlation among the basis and the spectral ra dius. In the absence of any information about the desired response, the ESN reservoir states should be de signed with the highest ASE, independent of the spectral radius. The poles of the linearized ESN reservoir should have uniform pole distributions to generate echo states with the most diverse pole locations (which correspond to the uniformity of time constants). Effectively this will cr eate the least correlated bases for a given spectral radius, which correspo nds to the largest volume spanned by the basis set. When the designer has no other information about the desired response to set the basis, this principle distributes the systems degrees of freedom uniformly in space. It approximates for ESNs the well known property of orthogonal basi s. The unresolved issue that ASE does not quantify is how to set the spectral radius, which depends again upon the desired mapping. a simple adaptive bias is added at the ESN input to control the spectral radius integrating the information from the input-output joint space in the ESN bases. For sigmoidal PEs, the bias adjusts the operating points of th e reservoir PEs, which has the ne t effect of adjusting the volume of the state manifold as required to approximate the desired response with a small error. We show that ESNs designed with this strategy obtain systematically better results in a set of experiments when compared with the conventional ESN design. We discuss the readouts for ESNs for func tion approximation and pa ttern recognition. The standard linear readout is used for function approximation tasks such as system identification, time series prediction, and cha nnel equalization. We thoroughly i nvestigate the us e of ESN for temporal pattern recognition. The standard lin ear readout can be used for pattern recognition with ESNs by minimizing the mean-square error (MSE ) with respect to a label. Classification of time series is different from the cl assification of static patterns si nce there is a structure between the samples of the time series over time. The dyn amical classification problem can be reduced to

PAGE 27

27 a static one by treating each sample of the time series individually. In this case, the time series is not treated as a whole and temporal informati on between samples is ignored. Moreover, one class label is associated with each time instant; therefore, post classification such as majority voting has to be applied. An a lternative to this approach is embedding the time series to populate a short-term history of the patter n. Again, a static mapper can be used to classify the embedded pattern. Another major difficulty in dynamical pa ttern recognition is how to design the label, which should also be a time series. One straightfo rward method is to use +1 or -1 as the desired response throughout the application of the input pattern. Howeve r, the difficulty with this method is that the desired response is forced to be constant independent of the input signal dynamics. An alternative powerful technique is to create a one-step predictor for each class in order to capture the dynamics of the class gene rating the time-series (Zahalka and Principe, 1993). Then, during testing, the test pattern is pr esented to all predictors and the label of the predictor with the least prediction error is assigned as the class of the input pattern. We also propose an alternative readout for ESNs to be used in classification tasks that does not require a desired response. The goal is to design an ESN/LSM readout that will recognize a class of inputs that differ by some quantity (e.g. amplitude or shape considered as a distortion parameter). The proposed readout, called the minimum average correlation energy (MACE) filter, is adopted from optical pattern recognition literature, where it is used for recognition of a given object in 2dimensional images in the presence of noi se, geometric distortions and pose changes (Mahalanobis et al., 1987). Instead of using a desired response, the MACE filter creates a template from echo state respons es corresponding to the training patterns of a given class. A MACE is trained for each class. During testi ng, the unknown time signal is fed to the ESN and

PAGE 28

28 the states are correlated with each template image and the cl ass label of the template with the highest correlation is assi gned to the input pattern. We investigate the implications of ESN idea on other networks. First, we investigate the echo state condition (| W |<1) in a system theoretic framewor k to understand the effect of system stability in on the power of dynamical systems fo r computation. For this reason, we relax the echo state condition by allowing it to be slightly larger than 1. This introduces a new dynamical regime, called transiently stable computation, where the system is autonomously unstable but it is stabilized by the input signal of sufficient power. In this regime, function approximation with a linear readout is still possible even though the system is not globally stable. Secondly, we investigate the biologically pl ausible model of the olfactory cortex, Freeman Model (FM), and propose to use a readout for FM to be able to use it as a universal computer. Without the proposed readout, the use of FM is limited to simple digit recognition, where the patterns are static and binary. We will demonstrate with ex periments that FM coupl ed with a readout can process continuous valued signals. An interesting property of FM is the nonlinear function used in the model. FM nonlinearity does not have its la rgest slope at zero oper ating point unlike the sigmoidal nonlinearity used in ESNs. Various applications of ESN fo r real-world problems are investigated. We first utilize ESNs for signal processing in the complex domai n and compare complex ESN with other models for complex signal processing in a nonlinear channel equalization pr oblem in a digital communication channel. Secondly, we propose ESNs to model br ain machine interfaces and compare it with the linear Wiener filter. Brai n machine interfaces aim at predicting the hand position of a primate using the brai n signals. Thirdly, we apply ESNs to predict the water inflow at a hydro power plant using the previous values of water inflow levels Then, we tackle two

PAGE 29

29 temporal pattern recognition probl ems using ESNs with the MACE filter readout. In the spike detection problem, the goal is to detect action potentials in a noisy neural recording. The second problem deals with the design of a detector similar to a matched filter in a communication channel. We compare ESN-MACE filter to the matched filter in a ba seband additive noise channel under different noise distributions.

PAGE 30

30 CHAPTER 2 ANALYSIS OF ECHO STATE NETWORKS ESN is a recurrent network paradigm described by a recurrently connected, reservoir network stimulated by an input signal, and an inst antaneous, adaptive readout which combines the reservoir states to form the desired signal. ESNs have been shown to be very powerful in many applications such as chaotic time seri es prediction, channe l equalization, speech recognition. However, it remains a mystery how a network with mo stly random connections can be so successful in solving difficult problems. This section aims at enlightening the basic mystery of ESNs by analyzing th e operation of ESNs. The first subsection proposes a framework which is a signal processing interpretation of basi s functions and projections in functional spaces to describe and understand the ESN architecture. In the second subsection, we interpret the ESN dynamics as a combination of linear systems. Acco rding to this interpreta tion, when the system operating point varies over time with the influence of the i nput signal, local ESN dynamics change. The third subsection introduces an info rmation-theoretic metric, called average state entropy, to quantify the computational pow er of a dynamical network for function approximation. The ASE metric combined with the idea of bases and projections will lead to a design procedure for ESNs that will be di scussed in the next chapter. Echo States as Bases and Projections Let us revisit the recursive update equation of an ESN. The activation of the internal PEs is updated according to )) ( ) ( ) 1 ( ( ) 1 ( n n n nback iny W Wx u W f x (2-1) where f = ( f1, f2,, fN) are the internal units activation functions. The output from the linear readout network is computed according to ) 1 ( ) 1 ( n noutx W y (2-2)

PAGE 31

31 ESNs resemble the RNN architecture propos ed in (Puskorius and Feldkamp, 1994) and also used by (Sanchez, 2004) in brain mach ine interfaces. The critical difference is the dimensionality of the hidden recurrent PE layer a nd the adaptation of the recurrent weights. We submit that the ideas of approximation theory in functional spaces (bases and projections), so useful in optimal signal processing (Principe, 2001), should be utilized to understand the ESN architecture. Let h (u( t )) be a real-valued function of a real-valued vector )] ( ),..., ( ), ( [ ) (2 1t u t u t u tM u. In functional approximation, the goa l is to estimate the behavior of h ( u (t)) as a combination of simpler functions ) ( ti called the bases functionals, such that its approximant )) ( ( t h u is given by N i i it a t h1) ( )) ( (u Here, ais are the projections of h ( u ( t )) onto each basis functional. On e of the central questions in practical functional approximation is how to choose the set of bases to approximate a given desired signal. In signal proces sing, the choice normally goes fo r complete set of orthogonal basis, independent of the input. When the basis set is complete and can be made as large as required, fixed bases work wonders (e.g. Fourier decompositions). In neural computing, the basic idea is to derive the set of bases from the input signal through a multilayered architecture Consider a single hidden layer TDNN with N PEs and a linear output. The hidden layer PE outputs can be considered a set of non-orthogona l basis functionals dependent upon the input ) ) ( ( )) ( (j j ij it u b g t u Here, bijs are the input layer weights, and g is the PE nonlinearity. The approximation produced by the TDNN is then

PAGE 32

32 N i i it a t h1)) ( ( )) ( ( u u (2-3) where ais are the weights of the output layer. Notice that the bij adapt the bases and the ai adapt the projection in the projection sp ace (readout). Here the goal is to restrict the number of bases (number of hidden layer PEs) because their number is coupled with the number of parameters to adapt which impacts generalization, training set size, etc. Usually, since all of the parameters of the network are adapted, the best basis in the joint (input and desi red) space as well as the best projection can be achieved, and represents the optimal solution. The output of the TDNN is a linear combination of it s internal representations, but to achieve a basis set (even if nonorthogonal), linear independence among the)) ( ( tiu s must be enforced. The effect of nonlinear transformation of the hidden PEs on the input si gnal has been investig ated by many authors in RN. Oh has shown that correlation among the weighted sums decrease after they pass through the sigmoid nonlinear function which can be appr oximated by piecewise linear functions (Oh and Lee, 1994). Shah and Poon showed that there exis t weight values for MLP such that the outputs of its hidden layer PEs are linearl y independent and hence, form a complete set of basis functions (Shah and Poon, 1999). The ESN (and the RNN) architecture can also be studied in this framework. The reservoir states of Equation 2-1 correspond to the basis set which are recurs ively computed from the input, output and previous states through Win, W and Wback. Notice however that none of these weight matrices are adapted, that is, the functional ba ses in the ESN are uniquely defined by the input and the initial selection of weight s. In a sense ESNs are trading the adaptive connections in the RNN hidden layer by a brute force approach of creating fixed dive rsified dynamics in the hidden layer.

PAGE 33

33 For an ESN with a linear read-out network, the output equation () 1 ( ) 1 ( n noutx W y ), has the same form of Equation 2-3, where the )) ( ( tiu s and ais are replaced by the echo states and the readout weights, respectively. These we ights are adapted in the training data, which means that the ESN is able to find the optimal projection in the projection space, just like the RNN or the TDNN. It is interesting that a similar perspectiv e of basis and projections for information processing in biological networks has been proposed by (Pouget and Sejnowski, 1997). They explored the possibility that the response of si ngle neurons in parietal cortex serve as basis functions for the transformations from the sens ory input to the motor responses. They proposed that "the role of spatial representations is to code the sensory inputs and posture signals in a format that simplifies subsequent computati on, particularly in the generation of motor commands (Pouget and Sejnowski, 1997). The central issue in ESN design is exactly the non adaptive nature of the basis set. Parameter sets in the reservoir that provide linearly independent stat es and possess a given spectral radius, may define drastically differe nt projection spaces because the correlation among the bases is not constrained. A si mple experiment was designed to demonstrate that the selection of the echo state parameters by constraining the spectral radius is not the most suitable for function approximation. Consider an echo state network of 100-units where the input signal is sin(2 n/(10 )). Mimicking (Jaeger, 2001), the goal is to let the ESN generate the seventh power of the input signal. Different realizations of a randomly connected 100 unit ESN were constructed where the entries of W are set to 0.4, -0.4 and 0 w ith probabilities of 0.025, 0.025 and 0.95, respectively. This corresponds to a spectr al radius of 0.88. Input weights are set to +1 or -1 with equal probabilities and Wback is set to zero. The input is applied for 300 time steps and

PAGE 34

34 Figure 2-1. Performance of ESNs for different realizations of W with the same weight distribution. The weight values are set to 0.4, -0.4 and 0 with probabilities of 0.025, 0.025 and 0.95. All realizations have the same spectral radius of 0.88. In the 50 realizations, MSEs vary from 5.9x10-9 to 8.9x10-5. Results show that spectral radius of W is not the unique parameter that dete rmines the performance of an ESN. the echo states are calculated usi ng Equation 2-1. The next step is to train the linear readout when a desired training signal, d ( n ) is available. One method to determine the optimal output weight matrix, Wou t in the mean square error (MSE ) sense (where MSE is defined by ( nn n n n )) ( ) ( ( )) ( ) ( ( 2 1 y d y d ), is to use the Wiener solution given by (Haykin, 2001) T n T n outn n T n n T E E0 1 0 1) ( ) ( 1 ) ( ) ( 1 ] [ ] [d x x x xd xx W (2-4) Here E [.], T and denotes the expected value operator, the length of the training sequence and the conjugate transpose of a complex vector. Figu re 2-1 depicts the MSE values for 50 different realizations of the ESNs. As observed, even though each ESN has the same sparseness and spectral radius, the MSE values obtained va ry greatly among different realizations. The

PAGE 35

35 minimum MSE value obtained among the 50 realizations is 5.9x10-9 whereas the maximum MSE is 8.9x10-5. This experiment demonstrates that a de sign strategy that is based solely on the spectral radius is not sufficient to specify th e system architecture for function approximation. This shows that for each set of random weights that provide the same spectral radius, the correlation or degree of redundancy among the base s will change and different performances are encountered in practice. ESN Dynamics as a Combinat ion of Linear Systems It is well known that the dynamics of a nonlin ear system can be approximated by that of a linear system in a small neighborhood of an equilibrium point (Kuznetsov, 1998). Here, we will perform the analysis with hyperbolic tangent PEs and approximate the ESN dynamics by the dynamics of the linearized system in the neighbo rhood of the current system state. Hence, when the system operating point varies over time, th e linear system approximating the ESN dynamics will change. We are particularly interested in the movement of the poles of the linearized ESN. Consider the update equation for the ES N without output feedback given by )) ( ) 1 ( ( ) 1 ( n n ninWx u W f x Linearizing the system around the current state x(n) one obtains the Jacobian matrix, J (n+1) defined by W F W J ) ( )) ( ( ... 0 0 ... ... ... ... 0 ... )) ( ( 0 0 ... 0 )) ( ( )) ( ( ... )) ( ( )) ( ( ... ... ... ... )) ( ( ... )) ( ( )) ( ( )) ( ( ... )) ( ( )) ( ( ) 1 (2 11 1 2 1 2 2 22 2 21 2 1 1 12 1 11 1n n net f n net f w n net f w n net f w n net f w n net f w n net f w n net f w n net f w n net f w n net f w n net f nN NN N N N N N N N (2-5)

PAGE 36

36 Here, neti(n) is the ith entry of the vector () ( ) 1 ( n ninWx u W ) and wij denotes the ( i j )th entry of W The poles of the linearized system at time n +1 are given by the eigenvalues of the Jacobian matrix J(n+1). When the amplitude of each PE in put changes, the local slope changes, and so the poles of the linearized system are time varying, although the parameters of ESN are fixed. In order to visualize the m ovement of the poles, consider an ESN with 100 states. The entries of the internal weight matrix are chosen to be 0, 0.4 and -0.4 with probabilities 0.9, 0.05 and 0.05. W is scaled such that a spectr al radius of 0.95 is obtained. Input weights are set to +1 or -1 with equal probabilities. A sinusoidal signa l with a period of 100 is fed to the system and the echo states are computed according to Equa tion 2-1. Then, the Jacobian matrix and the eigenvalues are calculated usi ng Equation 2-5. Figure 2-2 shows the pole tracks of the linearized ESN for different input values. A single ESN with fixed parameters implements a combination of many linear systems with varying pole locatio ns, hence many different time constants that modulates the richness of the re servoir of dynamics as a function of input amplitude. Higher amplitude portions of the signal tend to saturate the nonlinear function and cause the poles to shrink toward the origin of the zplane (decreases the spectral radi us), which results in a system with large stability margin. When the input is close to zero, th e poles of the linearized ESN are close to the maximal spectral radius chosen, decr easing the stability margin. When compared to their linear counterpart, an ESN with same number of states re sults in a detailed coverage of the z-plane dynamics ( -coverage), which illustrates the power of nonlinear systems. Similar results can be obtained using signals of di fferent shapes at the ESN input.

PAGE 37

37 Figure 2-2. Pole tracks of the linearized ESN when the input goe s through a cycle. A single ESN with fixed parameters implements a combin ation of many linear sy stems with varying pole locations. A) One cycle of sinusoidal si gnal with a period of 100. B)-E) show the positions of poles of the linearized systems wh en the input values are at B, C, D, and E in panel A. F) Cumulative pole locations show the movement of the poles as the input changes. Theorem: The eigenvalues of the linearized system have the largest radius when the system state is zero.

PAGE 38

38 Proof: Assume W has nondegenerate eigenvalues a nd corresponding linearly independent eigenvectors. Then, consider the eigendecomposition of W where1 PDP W P is the eigenvector matrix and D is the diagonal matr ix of eigenvalues ( Dii) of W Since F (n) and D are diagonal, 1 ) 1) ) ( ( )( ( ) ( ) 1 ( P D F P PDP F W F J n n n n is the eigendecomposition of J (n+1). Here, each entry of F(n)D, ii in net f D )) ( (, is an eigenvalue of J ii ii in net f D D )) ( ( since ) 0 ( )) ( ( f n net fi A key corollary of the above analysis is that the spectral radius of an ESN can be adjusted using a constant bias signal at the ES N input without changing the recurrent connection matrix, W The application of a nonzero constant bias will move the operating point to regions of the sigmoid function closer to saturation and always decrease the spectral radius due to the shape of the nonlinearity. This property will be exploited in the design of ESNs which will be discussed in the next chapter. Average State Entropy as a Measure of the Richness of ESN Reservoir The concept of rich dynamical reservoir (Jaeger, 2001) has not been quantified with a well-defined metric. Our framewor k of bases and projections lead s to a new metric to quantify the richness of ESN reser voir. Here, we propose the instantaneous state entropy to quantify the distribution of instantaneous amplitudes acros s the ESN reservoir states. Entropy of the instantaneous ESN states is appropriate to quantify perfor mance in function approximation because the ESN output is a mere weighted combin ation of the value of the ESN states. If the echo states instantaneous amplitudes are concentr ated on only a few values across the ESN state dynamic range, the ability to approximate an arbitr ary desired response by we ighting the states is limited (and wasteful due to redundancy between the different states) and performance will

PAGE 39

39 suffer. On the other hand, if the ESN states prov ide a diversity of instantaneous amplitudes, then it is much easier to achieve th e desired mapping. Hence, the inst antaneous entropy of the states appears as a good measure to quan tify "richness" of dynamics with instantaneous mappers. Due to the time structure of signals, the average state entropy (ASE), defined as the state entropy averaged over time will be the parameter us ed to quantify the diversity in the dynamical reservoir of the ESN. Moreover, entropy has been proposed as an appropriate measure of the volume of the signal manifold (Cox, 1946; Amari, 1990). Here, ASE measures the volume of the echo state manifold spanned by the trajectories. Renyis quadratic entropy is employed here be cause it is a global measure of information and efficient nonparametric estimator that avoi ds explicit pdf estimation has been developed (Principe et al., 2000;Erdogmus et al., 2003; Erdogmus and Principe, 2002). Renyis entropy with parameter for a random variab le X with a pdf fX(x) is given by (Principe et al., 2000) )] ( [ log 1 1 ) (1X f E X HX The Renyis quadratic entropy is obtained for =2 (for 1). Shannons entropy is obtained). Given N samples {x1, x2, xN}drawn from the unknown pdf to be estimated, Parzen windowing approximates the underlying pdf by ) ( 1 ) (1 N i i Xx x K N x f where K is the kernel function with the kernel size Then the Renyis quadratic entropy can be estimated by (Principe et al.,2000; Erdogmus et al., 2003). ji i jx x K N X H ) ( 1 log ) (2 2 (2-6)

PAGE 40

40 The instantaneous state entropy is estimated using Equation 2-6 where the samples are the entries of the state vector T Nn x n x n x n ) ( ),..., ( ), ( ) (2 1x, of an ESN with N internal PEs. Results will be shown with a Gaussian kernel with kernel size chosen to be 0.3 of the standard deviation of the entries of the state vector. We will show that ASE is a more sensitive parameter to quantify the approximation properties of ESNs by experimentally dem onstrating that ESNs with the same spectral radius display different ASEs. Let us consider the same 100 unit ESN that we used in the previous section built with three different spectral radii 0.2, 0.5, 0.8 with an input signal of sin(2 n/20). Figure 2-3 A depicts the echo states over 200 time ticks. The instantaneous state entropy is also ca lculated at each time step using Equation 2-6 and plot ted in figure 2-3 B. First, not e that the instantaneous state entropy changes over time with the di stribution of the echo states as we would expe ct, since state entropy is dependent upon the input signal that also changes in this case. Second, as the spectral radius increases in the simulation, the diversity in the echo states increases. For the spectral radius of 0.2, echo states instantaneous amplit udes are concentrated only on a few values which is wasteful due to redundancy between different states. In practice, to quantify the overall representation ability over time, we will use AS E, which takes values -0.735, -0.007 and 0.335, for the spectral radii of 0.2, 0.5 and 0.8, resp ectively. Hence, ASE is affected by the spectral radius of the W matrix as we would expect. Moreover, even for the same spectral radius of 0.5, several ASEs are possible. Figure 2-4 shows ASEs from 50 different realiza tions of ESNs with same spectral radius of 0.5, which means that AS E is a finer descriptor of the dynamics of the reservoir. Although we have presented an experime nt with sinusoidal signal, similar results are obtained for other inputs as long as the input dynamic ra nge is properly selected.

PAGE 41

41 0 20 40 60 80 100 120 140 160 180 200 -1 0 1 (A) Echo States 0 20 40 60 80 100 120 140 160 180 200 -1 0 1 0 20 40 60 80 100 120 140 160 180 200 -1 0 1 Time A 0 50 100 150 200 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 Time (B) State Entropy Spectral Radius = 0.2 Spectral Radius = 0.5 Spectral Radius = 0.8 B Figure 2-3. Echo states a nd state entropy for different ESNs. A) Outputs of echo states (100 PEs) produced by systems with spectral radius of 0.2, 0.5 and 0.8, from up to bottom, respectively. B) Instantaneous state entr opy is calculated using Equation 2-6 for echo states in panel A.

PAGE 42

42 0 10 20 30 40 50 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 TrialsASE Figure 2-4. The ASE values obtained from 50 realiz ations of ESNs with the same spectral radius Maximizing ASE means that the diversity of the states over time is the largest and should provide a basis set that is as uncorrelated as possible. This condition is unfortunately not a guarantee that the ESN so designed will perform the best, because the basis set is independent of the desired response and the application may requi re a small spectral radius. However, when the desired response is not accessible for the design of the ESN bases or when the same reservoir is to be used for a number of problems, the defau lt strategy should be to maximize the ASE of the state vector.

PAGE 43

43 CHAPTER 3 DESIGN OF THE DYNAMICAL RESERVOIR In ESNs, reservoir weights are selected random ly with the constraint of the echo state condition (Jaeger, 2001). However, such a random selection scheme is not optimal as it is demonstrated in chapter 2. In this chapter, we propose a design scheme for the design of reservoir weights that will lead to small approxima tion errors for a variety of tasks. For optimal approximation with a finite number of basis f unctionals, both the projection space (bases) and the projector (readout) require knowledge from the input and de sired responses. However, in ESNs, the projection space is determined solely by the architecture of th e ESN reservoir and the input signal (since W is fixed), without any knowledge of the space spanned by the desired target signal. The selection of basis f unctions with the knowledge of i nput signal only is an ill-posed problem. However, the selection of the reservoir weights must still be done using some rule, and here we hypothesize that a good design strategy is to let the ESN states cover with equal resolution the projection space to anticipate a ny possible mapping requirement (dictated by the unknown desired response). Design of the Recurrent Connections According to the interpretation of ESNs as coupled linear systems, the design of the internal connection matrix, W will be based on the distribution of the poles of the linearized system around zero state. Our proposal is to design the ESN such that the linearized system has uniform pole distribution inside th e unit circle of the z-plane. With this design scenario, the system dynamics will include uniform coverage of time constants arising from the uniform distribution of the poles, which al so decorrelates as much as po ssible the bases functionals. This principle was chosen by analogy to the identifi cation of linear systems using Kautz filters (Kautz, 1954) which shows that th e best approximation of a give n transfer function by a linear

PAGE 44

44 system with finite order is achieved when pol es are placed in the neighborhood of the spectral resonances. When no information is available about the desired response or when would like to use the same reservoir for a variety of tasks, we should uniformly spread the poles to anticipate good approximation to arbitrary mappings. We again use a maximum entropy principle to di stribute the poles in side the unit circle uniformly. The constraints of a circle as boundary conditions for discrete linear systems and complex conjugate locations are easy to incl ude for the pole distribut ion (Thogula, 2003). The poles are first initialized at random locations; th e quadratic Renyis entropy is calculated using Equation 2-6 and poles are moved such that the entropy of the new distribution is increased over iterations (Erdogmus et al., 2003). This method is e fficient to find a uniform coverage of the unit circle with an arbitrary number of poles. However, any other method can be used to find the location of poles with uniform c overage. Notice that this operati on has to be done only once for a given number of poles, which is equal to the number of hidden unit PEs. The system with the uniform pole locations can be interpreted using linear system theory. The poles that are close to the unit circle corre spond to many sharp band pass filters specialized in different frequency regions wh ereas the inner poles realize filte rs of larger frequency support. Moreover, different orientations (angles) of the poles create filters of different center frequencies. Now, our problem is to construct a weight matrix from the pole locations. This is equivalent to the problem of designing W when its eigenvalues are known. In principle we would like to create a sparse matrix, so we star ted with the sparsest matrix (with an inverse) which is the direct canonical st ructure given by (Kailath, 1980)

PAGE 45

45 0 1 ... 0 0 ... ... ... ... 0 0 ... 1 0 0 0 ... 0 1 ...1 2 1N Na a a a W (3-1) The characteristic polynomial of W is ) )...( )( ( ... ) det( ) (2 1 2 2 1 1 N N N N Np s p s p s a s a s a s s s l W I (3-2) where pis are the eigenvalues and ais are the coefficients of th e characteristic polynomial of W Here, we know the pole locations of the linear system obtained from th e linearization of the ESN, so using Equation 3-2, we can obtain the characteri stic polynomial and construct W matrix in the canonical form using Equation 3-1. We will name the ESN constructed based on the uniform pole principle ASE-ESN. All other possible solu tions with the same eigenvalues can be obtained by Q-1WQ where Q is any nonsingular matrix. To corroborate our hypothesis, we would like to show that the linearized ESN designed with the internal connection weight matrix havi ng the eigenvalues uniformly distributed inside the unit circle creates higher ASE values for a gi ven spectral radius compared to other ESNs with random internal connection weight matrices. We will consider an ESN with 30 states, and use our procedure to create the uniformly distri buted linearized ASE-ESN matrix for different spectral radius between [0.1, 0.95]. Similarly, we constructed ESNs with sparse random W matrices with different sparsene ss constraints. This corresponds to a weight distribution having the values 0, c and c with probabilities p1, (1p1)/2 and (1p1)/2, where p1 defines the sparseness of W and c is a constant that takes a specific value depending on the spectral radius. We also created W matrices with values uniformly dist ributed between -1 and 1 (U-ESN), and scaled to obtain a given spectral radius (Jaeger and Hass, 2004). Then, for different Win matrices, we run the ASE-ESNs with the sinusoidal input of period 20 and calculate ASE.

PAGE 46

46 0 0.2 0.4 0.6 0.8 1 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Spectral radiusASE ASE-ESN U-ESN sparseness=0.2 sparseness=0.1 sparseness=0.07 Figure 3-1. Comparison of AS E values obtained for ASE-ES N with U-ESN and randomly generated ESNs with different sparseness. As observed from the figure, the ASE-ESN with uniform pole distribution generates higher ASE on the average for all spectral radii compared to ESNs with random connections. Figure 3-1 compares the ASE va lues obtained using differen t ESNs over 1000 realizations. As observed from the figure, the ASE-ESN with uniform pole distributio n generates higher ASE on the average for all spectral radii compared to ESNs with sparse and uniform random connections. This approach is indeed conceptu ally similar to Jeffreys maximum entropy prior (Jeffreys, 1946): it will provide a consistently good response for the largest class of problems. Concentrating the poles of the linearized system in certain regions of the space, only provide good performance if the desired response has energy in this part of the space, as is well known from the theory of Kautz filters (Kautz, 1954).

PAGE 47

47 Design of the Adaptive Bias In conventional ESNs, only the ou tput weights are trained optim izing the projections of the desired response onto the basis func tions (echo states). Since, th e dynamical reservoir is fixed, the basis functions are only input dependent. However, sin ce function approximation is a problem in the joint space of the input and de sired signals, a penalty in performance will be incurred. From the linearization analysis that shows the crucial importance of the operating point of the PE nonlinearity in defini ng the echo state dynamics, we pr opose to use a single external adaptive bias to adjust the effective spectral radius of an ESN. Notice that, according to linearization analysis with tanh nonlinearity, bias can only reduce spectral radius. The information for adaptation of bias is the MSE in training, which modulates the spectral radius of the system with the informati on derived from the approximation error. With this simple mechanism, some information from the input output joint space is incorpor ated in the definition of the projection space of the ESN. The beauty of this method is that the spectral radius can be adjusted by a single parameter that is external to the system wit hout changing reservoir weights. The training of bias can be easily accomplishe d. Indeed, since the parameter space is only one dimensional, a simple line search method ca n be efficiently employed to optimize the bias. Among different line search algorithms, we will use a search that uses Fibonacci numbers in the selection of points to be eval uated (Wilde, 1964). The Fibonacci search method minimizes the maximum number of evaluations needed to reduce the interval of uncertainty to within the prescribed length. In our problem, a bias value is picked according to Fibonacci search. For each value of bias, training data is applied to the ESN and the echo st ates are calculated. Then, the corresponding optimal output weights and the objec tive function (MSE) are ev aluated to pick the next bias value.

PAGE 48

48 Alternatively, gradient based methods can be utilized to optim ize the bias, due to simplicity and low computational cost. System upda te equation with an ex ternal bias signal, b is given by )) ( ) 1 ( ( ) 1 ( n b n nin inWx W u W f x The update equation for b is given by ) ) ( ( ) ( ) 1 ( ) 1 (1 in n out outb n net b n b n O W x W f W e x W e Here, O is the MSE defined previously. This al gorithm may suffer from similar problems observed in gradient based methods in recurrent networks training. However, we observed that the performance surface is rather simple. Mo reover, since the search parameter is one dimensional, the gradient vector can only assume one of the tw o directions. Hence, imprecision in the gradient estimation should affect the spee d of convergence, but normally not change the correct gradient direction. Experiments This section presents a variety of experiment s in order to test the validity of the ESN design scheme proposed in the previous section. Short-term Memory Structures Modeling a dynamical system with input-outpu t data requires access to a sufficiently long input history as the output of the dynamical syst em depends not only the current value of input signal but also the history of it. In signal pr ocessing, the common approach to overcome this difficulty is to embed the input signal using a ta pped delay line (in FIR filter or TDNN). RNNs

PAGE 49

49 provide a different type of embedding through the recurrent connectio ns between the PEs. However, the length of the memory achieved is not as clear as in tapped delay line. This experiment compares the short term memo ry (STM) capacity of ESNs with the same spectral radius using the framework presented in (Jaeger, 2002). Consider an ESN with a single input signal, u(n) optimally trained w ith the desired signal u(n-k) ,for a given delay k Denoting the optimal output signal yk(n) the k -delay STM capacity of a network, MCk, is defined as squared correlation coefficient between u(n-k) and yk(n) (Jaeger, 2002). The STM capacity, MC of the network is defined as 1 k kMC STM capacity measures how accurately the delayed version of the input signal is r ecovered with optimally trained out put units. It has been shown in (Jaeger, 2002) that the memory capacity for recalli ng an independent identically distributed (i.i.d) input with an N unit RNN with linear output units is bounded by N Consider an ESN with 20 PEs and a single i nput unit. ESN is driven by an i.i.d random input signal, u(n) that is uniformly distributed over [-0. 5, 0.5]. The goal is to train the ESN to generate the delayed ve rsions of the input, u(n -1), u(n -2), u(n -40),. We used four different ESNs, R-ESN, U-ESN, ASE-ES N and BASE-ESN. R-ESN is a randomly connected ESN used in (Jaeger, 2002) where the entries of W matrix are set to 0, 0.47, -0.47 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse c onnectivity of 20% and a sp ectral radius of 0.9. The entries of are uniformly distri buted over [-1, 1] and scaled to obtain the spectral radius of 0.9. ASE-ESN also has a spectral radi us of 0.9 and is designed with uniform poles. BASE-ESN has the same recurrent weight matrix as ASE-ESN and an adaptive bias at its input. In each ESN, the input weights are set to 0.1 or -0.1 with equal probability and direct c onnections from the input to the output are allowed whereas Wback is set to 0 (Jaeger, 2002). The echo states are calculated using Equation 2-1 for 200 samples of the input signal and the first 100 samples corresponding to

PAGE 50

50 initial transient are eliminated. Then, the output weight matrix is calculated using Equation 2-4. For the BASE-ESN, the bias is tr ained for each task. All networks are run with a te st input signal and the corresponding output and MCk are calculated. 0 5 10 15 20 25 30 35 40 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DelayMemory Capacity R-ESN U-ESN ASE-ESN BASE-ESN Figure 3-2. The k -delay STM capacity of each ESN for delays 1, . ,40 computed using the test signal. The STM capacities of R-ESN, UESN, ASE-ESN and BASE-ESN are 13.09, 13.55, 16.70 and 16.90, respectively. Figure 3-2 shows the k -delay STM capacity (averaged over 100 trials) of each ESN for delays 1 for the test signal. The STM capacities of R-ESN, U-ESN, ASE-ESN and BASEESN are 13.09, 13.55, 16.70 and 16.90, respectively. First of all, ESNs with uniform pole distribution (ASE-ESN and BASE -ESN) have MCs that are much longer than the randomly generated ESN given in (Jaeger, 2002) in spite of all having the same spectral radius. In fact, the STM capacity of ASE-ESN is close to the theoretical maximum value of N =20. A closer look at the figure shows that R-ESN performs slightly better than ASE-ESN for delays less than 9.

PAGE 51

51 Indeed, for small k large ASE degrades the performance because the tasks do not need long memory depth. However, the drawback of high ASE for small k is recovered in BASE-ESN which reduces the ASE to the appropriate level re quired for the task. Overall, the addition of the bias to the ASE-ESN increases the STM capacity from 16.70 to 16.90. On the other hand, UESN has slightly better STM compared to R-ES N with only 3 different weight values although U-ESN has more distinct weight values than R-ES N. It is also significant to note that the MC will be very poor for an ESN with smaller spectral radius even w ith an adaptive bias since the problem requires large ASE and bi as can only reduce ASE. This experiment demonstrates the need for maximizing ASE in ESN design. Binary Parity Check The effect of the adaptive bias was marginal in the previous e xperiment since the nature of problem required large ASE values and long shor t-term memory. However, there are tasks in which the optimal solutions require smaller ASE values and smaller spectral radius. Those are the tasks where the adaptive bias becomes a cr ucial design parameter in our design methodology. Consider an ESN with 100 internal units and a single input unit. ESN is driven by a binary input signal, u(n) that assumes the values 0 or 1. The goal is to train an ESN to generate the m bit parity, where m is 3,,8 Similar to the previous experiments, we used the R-ESN, ASE-ESN and BASE-ESN topologies. R-ESN is a random ly connected ESN where the entries of W matrix are set to 0, 0.06, -0.06 with probabilities 0.8, 0. 1, 0.1, respectively. This corresponds to a sparse connectivity of 20% and a spectra l radius of 0.3. ASE-ESN and BASE-ESN are designed with a spectral radius of 0.9. The input weights are set to 1 or -1 w ith equal probability and direct connections from the input to the output are allowed whereas Wback is set to 0. The echo states are calculated using Equation 2-1 for 1000 samples of the input signal and the first 100 samples corresponding to ini tial transient are eliminated. Then, the output weight matrix is calculated

PAGE 52

52 using Equation 2-4. For ESN with adaptive bias, th e bias is trained for eac h task. Binary decision is made by a threshold detector that compares th e output of the ESN to 0.5. Figure 3-3 shows the number of wrong decisions (averaged over 100 di fferent realizations) made by each ESN for m =3,,8. 3 4 5 6 7 8 0 50 100 150 200 250 300 350 mWrong Decisions R-ESN ASE-ESN BASE-ESN Figure 3-3. Number of wrong d ecisions made by each ESN for m = 3, . ,8 in the binary parity check problem. The total number of wrong decisions for m = 3, . ,8 of R-ESN, ASE-ESN and BASE-ESN are given by 722, 862 and 699. The total number of wrong decisions for m =3,,8 of R-ESN, ASE-ESN and BASE-ESN are 722, 862 and 699, respectively. ASE-ESN perf orms poorly since the nature of the problem requires short time-constant for fast response but ASE-ESN has large spectral radius. For 5-bit parity, the R-ESN has no wrong decisions wh ereas ASE-ESN has 47 wrong decisions. BASEESN performs a lot better than ASE-ESN and slightly better than the R-ESN since the adaptive bias reduces the spectral radius effectively. Note that for m = 7 and 8, the ASE-ESN performs

PAGE 53

53 similar to the R-ESN, since the task requires a ccess to longer input history, which compromises the need for fast response. Indeed the bias in the BASE-ESN take s effect when there is errors ( m >4) and when the task benefits from smaller spectral radius. The optimal bias values are approximately 3.2, 2.8, 2.6 and 2.7 for m = 3, 4, 5, and 6, respectively. For m = 7 or 8, there is a wide range of bias values that result in similar MSE values (between 0 and 3). In summary, this experiment clearly demonstrates the power of the bias signal to configure the ESN reservoir according to the mapping task. System Identification This section presents a function approximation task where the aim is to identify a nonlinear dynamical system. The unknown system is defined by the difference equation )) ( ( ) 1 ( 6 0 ) ( 3 0 ) 1 ( n u f n y n y n y where ) 5 sin( 1 0 ) 3 sin( 3 0 ) sin( 6 0 ) ( u u u u f The input to the system is chosen to be sin(2 n/25). We used an ESN with 30 internal units and a single input unit. Again, we used three different ESNs, R-ESN, AS E-ESN and BASE-ESN. The W matrices of all ESNs are generated as described in chapter 3.3.1 except that they are s caled to have a spectral radius of 0.95. In each ESN, the input weights are set to 1 or -1 with equal probability and direct connections from the input to the output are allowed whereas Wback is set to 0. The optimal output signal is calculated for each ESN as described in section 3.3.1. The MSE values (averaged over 100 realizations) for R-ESN, ASE-ESN are 1.23x10-5 and 1.83x10-6, respectively. The addition of the adaptive bias to the ASE-ESN reduces the MSE value from 1.83x10-6 to 3.27x10-9. This experiment illustrates the superiority of the proposed design sc heme for function approximation.

PAGE 54

54 CHAPTER 4 DESIGN OF THE READOUT In the previous chapter, we have discussed the design of the ESN reservoir and the bias. This chapter deals with the design of the readout s for ESNs. The problem is studied for function approximation and pattern recognition separately and different solutions are proposed for each case. Design of the Readout for Function Approximation The usual readout is the linear regression ne twork. Throughout this dissertation, we will use the linear readout. The equation for the linear readout is given by ) 1 ( ) 1 ( n noutx W y Alternative readouts are non linear readouts such as mu ltilayer perceptron or local nonlinear modelling. This issue is outs ide the scope of this dissertation. Design of the Readout for Dynamical Pattern Recognition The applications of ESNs to problems such as system identification, control of dynamical systems, and time-series prediction has been exte nsively studied both in this work and in the literature (Jaeger, 2001; Jaeger and Hass, 2004; Ozturk et al., 2007; Prokhorov, 2005). However, the use of ESNs for dynamical pattern recogniti on has been limited. In the literature, RNNs and other neural architectures such as time-delay ne ural network have been utilized for dynamical pattern recognition tasks by mini mizing the mean-square error (MSE) with respect to a label (Haykin, 1998; Bishop, 1995; Is o and Watanabe, 1991; Tebelskis, 1995). If the pattern to be classified is a time signal, the classification proce ss is different from the cl assification of a static pattern. The dynamical classification problem can be reduced to a static one by treating each sample of the time series individually. In this cas e, the time series is not treated as a whole and temporal information between samples is ignored. Moreover, one class label is associated with

PAGE 55

55 each time instant; therefore, post classification su ch as majority voting has to be applied. An alternative to this approach is embedding the time series to popul ate a short-term history of the pattern. Again, a static mapper can be used to cl assify the embedded pattern. In this approach, classification becomes extremely hard, if the dimensionality of the input space becomes too large. Alternatively, temporal pr ocessing of the time series can be done inside the classifier. In this approach, the classifier, which has a builtin memory, can accept single elements of time series. Although the classifier doe s not require external temporal processing, the design of the classifier may be complicated. Another major diff iculty in dynamical patte rn recognition is how to design the label, which should also be a time series. The obvious method is to enforce a class label (+1 or -1) as the desire d response throughout the pres entation of the input pattern. However, the difficulty with this method is that the desired response is forced to be constant independent of the input signal dynamics. An alte rnative powerful techni que is to create a onestep predictor for each class in order to captu re the dynamics of the class generating the timeseries (Zahalka and Principe, 1993) Then, during testing, the test pattern is presented to all predictors and the label of the pr edictor with the least prediction e rror is assigned as the class of the input pattern. This method is a smart way of converting the pattern recognition problem into a regression problem where many t echniques are available. This me thod can be easily applied to ESNs with an adaptive linear readout and used fo r classification of tempor al patterns, but it is still dependent upon input normali zation and it is sensitive to pattern duration changes. In this chapter, we propose an alterna tive readout for ESNs for temporal pattern recognition and compare it with the standard lin ear readout of ESNs and other conventional techniques used in the literature. The goal is to design an ESN/LSM read out that will recognize a class of inputs that differ by some quantity (e.g. amplit ude or shape considered as a distortion

PAGE 56

56 parameter). The proposed readout is based on a biologically plausi ble linear associative memory (LAM) that implements a correlat or specially trained for high sp ecificity. The readout, called the minimum average correlation energy (MACE) filter, is adopted from opti cal pattern recognition literature, where it is used for recognition of a given object in 2-D images in the presence of noise, geometric distortions and pose changes (Mah alanobis et al., 1987). The MACE filter is a correlator whose weights are determined in clos e form by solving a minimization problem with a constraint, and it has been shown that the met hod is equivalent to a cascade of a pre-whitening filter for the image class followe d by a LAM (Fisher, 1997). The recognition of a time pattern in a single time series is achieved by feeding the tim e series to an ESN/LSM, and utilizing a time window of states which can be interpreted as a 2-D image, one dimension being time and the other processing element number (space). Several MACEs are trained, one for each class. During testing, the unknown time signal is fed to the ES N and the states are correlated with each template image and the class label of the template with the highest correlation is assigned to the input pattern. An interesting impli cation of the MACE is that it can also be used for LSMs with proper regularization of the MACE solution. In LSMs, the proposed readout operates directly in the spike domain without conver ting the liquid state spike train outputs into continuous signals by binning or low-pass filtering. In this chapter, we first present a brief background on linear associative memories followed by the detailed explanation of th e MACE filter. Then, the use of MACE filter as the readout for ESN/LSMs is discussed. Finally the experiments are presented to compare the proposed readout with the conventional methods. Linear Associative Memories A linear associative memory is a multiple input multiple output linear system that implements an association (mem ory) between input and desired outputs (Hinton and Anderson,

PAGE 57

57 1981). The output is automatically created by the presentation of the proper input, which is referred as content addressability LAMs are quite different from the concept of memory used in digital computers, which uses a memory locati on whose contents are acce ssible by its address. Unlike digital memory, LAMs store informati on in a more global way by having several PEs instead of local bit storage. Therefore, they are also robust to noise (Haykin, 1998; Hinton and Anderson, 1981). LAMs recall by asso ciation, hence the pattern that is closest to the input is recalled. LAMs are biologically plausible just like ESN/LSM and they stand as the most likely model for human memory (P rincipe et al., 2000). Forced Hebbian rule can be used to train a LAM that associates a P dimensional input p to a Q dimensional desired output q. The output of the LAM is given by q=WLAMp, where Q x P matrix WLAM can be computed using the outerproduct rule, WLAM=qp (Principe et al., 2000; Haykin, 1998). Multiple input -output vector pairs (pk,qk ,for k =1, K ) can also be stored in the LAM by repeated presentation of each input. Us ing the superposition pr inciple for the linear network, the final weight matrix is given by K k k k LAM 1p q W For an input vector pl, the output of the LAM is given by K l k k l k k l l l l LAM k 1p p q p p q p W y. The first term in the LAM output is the desire d output and the second term, which is called crosstalk is the interference of the other stored pattern s to the true pattern. If the crosstalk term is small, the LAM can retrieve the pattern corresponding to the i nput. Notice that if the stored patterns are orthogonal to each other, the crosstalk term will be zero resulting in a perfect recall. Therefore, the number of patterns that can be st ored perfectly in a LAM is limited by the size of

PAGE 58

58 input space, which is one of th e limitations of LAMs for information processing (Principe et al., 2000; Haykin, 1998). The output patterns qk for the pairs (pk, qk) k =1,, K can be interpreted as the desired response for the LAM. The existence of a desired response allows the application of supervised learning to train LAMs using unconstrained optimization tec hniques such as least squares. For instance, the weights can be modified using the least-mean-squares algorithm by an amount given by k k k k k k LAMep y p q p W This equation is a combination of the desired forced Hebbian ( k kp q) and an anti-Hebbian term (k kp y) that decorrelates the present output yk from the input pk. The anti-Hebbian term reduces the crosstalk at each iteration improving th e performance of LAMs in terms of crosstalk. A LAM trained with LMS is called an optimal li near associative memory (OLAM) (Principe et al., 2000). One question follows naturally: what is the difference between a LAM and a linear regressor? The linear regressor passes a single hyperplane by all the desired samples, so we normally want more patterns than the size of the input dimension to generalize but it looses information about the covariance on the data space (Principe et al ., 2000). On the other hand, in the LAM, the output is as close as possible to each of the training samples. This different behavior is controlled by the numbe r of data points used for training, which must be must be less than the input dimensionality to minimize crosst alk. The MACE filter that will be discussing next is a cascade of a preprocessor (a whitening filter over the class of training data) followed by a LAM (Fisher, 1997).

PAGE 59

59 The MACE Filter The technique of matched spatial filters (MSF) or correlators for optical pattern recognition has been well studied and used for the rec ognition of 2-D objects (Vanderlugt, 1964; Kumar, 1986). The goal is to design an optimal correlator filter that represents an object class under noise and/or distortions using a set of training images. The re ason it is possible to match a single template to a class of objects is because of the huge number of weights (degrees of freedom) that the correlator has in 2-D ( N2 weights for an N x N input image). Moreover, MACE filter can incorporate the covariance information of the class in its weights. The MACE filter is the most widely used MSF due to superior discrimina tion properties (Mahalanob is, 1987; Casasent and Ravichandran, 1992). The formulation of the MACE will be presen ted in the frequency domain for convenience and ease of implementation (Mahal anobis, 1987). Consider a set of training images for a single object class. The ith training image ( i =1,.., K ) is described as a 1-D di screte sequence (obtained by lexicographic ordering the columns of the image) denoted by T i i i id x x x )] ( ),..., 2 ( ), 1 ( [ x, where d and K are the number of pixels in the image and the number of training images for the class, respectively. The discrete Fourier transform of ixis denoted by iX. Define the overall d x K training matrix by ] ,..., [2 1 KX X X X The d x1 column vector h denotes the coefficients of a filter in space domain and H its Fourier transform. Th e correlation function of the ith image with the filter h is given by i ix h g The filter output g defines a 3-dimensional space where x and y axes are the indices (lags) of the 2-dimensional correlation function and z-axis is the value of the correlation for the corresponding indices. We will call it the correlation space. The energy of the ith correlation plane in the frequency domain is H D Hi iE where iD is a d x d diagonal

PAGE 60

60 matrix whose elements are the magnitude squares of the associated elements of iX. The average correlation plane energy over the tr aining images is defined as H D H H D H ) / 1 ( ) / 1 ( ) / 1 (1 1K K E K EK i i K i i av(Mahalanobis, 1987). The MACE filter solution designs a correlati on filter that minimizes the output energy while providing predetermined values (] ,..., [2 1Kc c c cT) at the origin of the correlation plane. Each ci is the ith peak correlation value obt ained by correlating the ith image with the filter Hat zero lag. More specifically, the problem is to solve the constraint optimization problem (Mahalanobis, 1987): H D H) / 1 ( min KH subject to XHc The solution to this problem can be found analytically using the method of Lagrange multipliers (Mahalanobis, 1987) and is given by c X D X X D H1 1 1) ( (4-1) The selection of the exact value of c is somewhat arbitrary sin ce it simply scales the peak correlation amplitude. Usually, the entries of c are chosen to be 1 for in-class training images and a smaller value (around 0.1) for out-of-class im ages if at all used (Mahalanobis, 1987). The MACE filter solution achieves superior discri mination properties when compared to other existing correlation filters such as synthetic di scriminant function filt ers (Mahalanobis, 1987; Casasent and Ravichandran, 1992). The ESN/LSM MACE for Dynamical Pattern Recognition Pattern recognition is normally formulated as the design of a nonlinear static system trained discriminantly with the information of a set of labels, with data that are vectors in Rn. An alternative to this approach is obtained by crea ting models for each class which are only trained

PAGE 61

61 with the class data (nondiscriminant training). Wh en the number of classes to be classified is unknown at the training time (open set classifica tion), the latter provi des robust results. The correlator or matche d filter is a very well known example of a system that recognizes a single class data (Helstrom, 1995; Proa kis, 2001), and can be easily extended to multiple dimensions. Dynamical pattern recognition can also be formulated in a similar manner by using dynamical systems as classifiers and training them discriminantly with an appropriate desired signal. Instead, in this study, we propose to us e the MACE filter as a r eadout for ESN/LSMs for dynamical pattern recognition tasks, nondiscriminan tly trained for each class. In ESN/LSM, the proposed associative memory considers a time window of states which can be interpreted as a 2D image, one dimension being time and the other th e space of the states (see Figure 4-1). Let us explain the operation of ESN-MA CE in mathematical terms. Assume that we have P different classes of temporal pa tterns. The goal is to compute one MACE filter, ph for each class. Assume also that for each class, the training set consists of K input patterns each with a particular temporal shape of length T The procedure is to compute each phfor p=1,, P as follows. 1. The ith training input pattern for the pth class )] ( ),..., 2 ( ), 1 ( [ Ti i i iu u u u of dimension M x T is used to calculate the echo states using Equation 1-1. 2. The resulting N x T matrix of echo states forms the e quivalent of a 2-D image in section 2.2 (Figure 4-1). The echo states are then lexico graphically ordered by the columns to get the 1D column vector T T i T i T i iT ] ) ( ,... ) 2 ( ) 1 ( [x x x x with N x T elements. Here each xi( n ) is an N x1 vector with the value of the echo state at time n for the it h training sequence.

PAGE 62

62 3. The discrete Fourier transform of ixis denoted by iX. Define the overall d x K training matrix in the frequency domain by ] ,..., [2 1KX X X X 4. The optimal coefficients of the LAM for the class are computed using Equation 4-1 in the frequency domain and the corresponding MACE filter weights ph are obtained by inverse discrete Fourier transform. The output of the MACE for the ith training input pattern for the pth class can be obtained by Tp ixh. 0 5 10 15 20 25 0 0.5 1 Input signal 0 5 10 15 20 25 -1 0 1 Overlay plot of echo states time Echo state imageState Index 2 4 6 8 10 12 14 16 18 20 20 40 60 80 100 Figure 4-1. Interpretation of echo states as a 2-d imag e. A) The triangular input signal. B) The echo states of a 100 unit ESN. C) The echo state image where a poi nt in state index and time is interpreted simila r to a pixel in an image. The same procedure is repeated for the training sequences of other classes to obtain an optimal filter for each class.

PAGE 63

63 During testing, the input signal can be either a continuous stream (asynchronous operation) or a series of frames (synchronous operation). De pending on the mode of operation; the choice of the initial conditions for ESN in step 1 varies. In synchronous operation, the timing of the signal and the frames of interest are already known bot h in training and testing. Therefore, ESN is initialized with the same zero initial condition bo th in training and testing. The zero initial condition is chosen such that the system dynami cs is not biased and the system states are controlled by the input. During testing, the MACE output is calculated for each frame by correlating the filter coefficients with the echo states generated by the frame at zero lag and with a zero initial condition. When the signal timing is unknown (asynchronou s), the MACE filter generates an output at each time instant n by correlating the filter coefficients with the echo states between time instants n T +1 and n and the maximum in time above a certain pre specified threshold will be picked to represent the occurrence of the pattern. In case of a continuous stream, the initial conditions for each frame are dictated by the histor y of the input signal during testing; therefore they are not necessarily zero. In order to mimic the test conditi ons while training the MACE, the initial conditions of the ESN are not set to zero but determined by the history of the input. In other words, the whole input signal is used to create echo states and then the resulting NxT dimensional echo state images are extracted and used to train the MACE filter using the above algorithm. The Design of ESNs for Dynamical Pattern Recognition One of the drawbacks of LAMs is the storag e capacity that is lim ited by the input space dimension (Haykin, 1998). However, the ESN/LS M with a LAM readout becomes very powerful since echo/liquid states provide the LAM with a user defined high dime nsional input space increasing the number of patterns that can be stored in the LAM with minimal cross talk. In

PAGE 64

64 ESNs, the reservoir states can be interpreted as a set of bases functionals (representation space) constructed dynamically and nonlinearly from th e input (Ozturk et al., 2007). Due to the nonlinearity, it is easy to get states that are linearly independe nt of each other (Ito, 1996; Ozturk et al., 2007). Therefore, the matrices in the MA CE filter solution are typically full-rank, allowing the computation of inverses. Alt hough linearly independent, the stat es are not orthogonal to each other since they are connected through predefin ed weight values. An information-theoreticmetric, called average stat e entropy (ASE), has been recently proposed to describe the richness of ESN dynamics (Ozturk et al., 2007). ASE quantif ies the distribution of state values across the dynamic range. When ASE is maximized, the states of the system are evenly distributed across the dynamic range instead of being populated only on a few values. The ASE measure is related to the system design by uniformly distributing the poles of the linearized system around the origin (eigenvalues of W) (Ozturk et al., 2007). We propose to use the same design methodology to construct the representation layer of ESNs with the LAM readout. This approximates for ESNs the well-known property of orthogonal ba sis and eases the LAM training. In the experiments performed for this study, we have observed that the numeri cal rank (the condition number) of the inverted matrices is well-conditioned for ESNs. Likewise for LSMs, the value of each liquid state at each time instant is interpreted by the MACE as 1 or 0 depending on the ex istence or absence of a spike. This creates a very sparse state matrix which likely leads to singular matric es in the MACE filter computation. Therefore, the matrices have to be regularized in order to compute appropriate MACE filter weights with the outlined formulation. The regularization can be done by adding zero mean Gaussian noise with a small variance into the states be fore computing the MACE solution.

PAGE 65

65 Experiments This section presents a variety of experiments to compare the MACE filter readout proposed for ESN/LSM with the conventional techniques for dynamical pattern recognition. Classification of temporal si gnals with unknown peak values In this experiment, the aim is to classify a te mporal signal into one of two possible classes. The two classes are represented by a triangular and a step function of 20 samples. The interesting property of the signals in this experiment is that the peak values are unknown and can assume any value between 0.5 and 1. In other words, each class is not a signal but a family of signals The signals are also corrupted with additive, zero mean, white Gaussian noise. We compare the performance of three different ESNs, denoted by R-ESN1-MACE, RESN2-MACE, ASE-ESN-MACE, with MACE filte r readout. All ESNs have 30 PEs and are constructed using different t echniques. R-ESN1-MACE is a randomly connected ESN where the entries of W matrix are set to 0, 0.47, -0.47 with proba bilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse connectiv ity of 20% and a spectral radius of 0.9. R-ESN2-MACE is a randomly connected ESN where the entries of the W matrix are chosen from a Gaussian distribution with zero mean. The wei ghts are scaled such that the spec tral radius is set to 0.9. The internal weight matrix, W, of the ASE-ESN-MACE is designe d such that the eigenvalues of W (poles of the linearized system around origin) ar e uniformly distributed inside the unit circle. This principle is introduced in (Ozturk et al., 200 7) and it distributes the states of the system evenly across the dynamic range instead of popula ting them only on a few locations. The entries of Win are set to 1 or -1 with equal probability for all three ESNs.

PAGE 66

66 0 0.1 0.2 0.3 0.4 0.5 0.6 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 Noise standart deviationClassification accuracy ASE-ESN-MACE R-ESN1-MACE R-ESN2-MACE Predictive ESN Discriminative ESN MACE readout Figure 4-2. Performance comparisons of ESNs for th e classification of signals with unknown peak values. ESNs with the MACE filter readout performs better for all noise levels compared to discriminative and predictive ESNs. The classification accuracy of ASEESN-MACE is almost perfect up to a noise standard deviation of 0.3. ASE-ESNMACE performs better than randomly connect ed ESNs with MACE readout since it distributes the echo states uniformly a nd increases separation between states corresponding to different classes. To train the MACE filter readout for the ESNs 100 different realizati ons for each class of signals are generated and the echo states are ca lculated using Equation 2-2 with zero initial conditions. This creates an image of dimension 30x20, where 30 is the number of PEs and 20 is the signal length. One MACE filter is trai ned for each class using only data from the corresponding class. Output correlation peak amp litudes are assigned to be 1.0 for the training data. The two MACE filters are synthesized in the frequency domain using Equation 4-1 and the

PAGE 67

67 corresponding image plane filters are obtained by i nverse discrete Fourier transform. For a given test signal, each filter output is calculated by correlating the ec ho states of dimension 30x20 with the filter coefficients at zero la g. Figure 4-2 shows the correct cl assification rate of the 3 methods as the standard deviation of th e noise signal varies. The results are averaged over 100 realizations of each ESN. We also compare the performance of ESNs with MACE filter readout to the ASE-ESN utilizing the conventional linear readout. When the conventional readout is used for dynamical pattern recognition, the pr oblem is the design of the desired si gnal. One solution is to train a discriminative readout with constant target output signal of amplitude 1 for one of the two classes and a constant signal of am plitude -1 for the other class. Th e duration of the target signals is the same as the duration of the input signal. During testing, echo states, the output signal and the corresponding error signals one for each label are computed for a given input signal. Then, the two MSE values, integrated over the duration of the input pattern, are compared to assign the class of the input pattern to th e lowest MSE. In this solution, the readout has to generate a constant signal independent of the input signal dynamics. An altern ative also tested is to train two independent predictive readouts, one for each class to predict th e next sample of the input signal. During testing, the echo states, the output signal and the error si gnal are computed for each readout. Then, the label of the readout w ith the lower MSE value, integrated over the duration of the input pattern, is assigned as th e class of the input signa l. The predictive readout seems a more natural choice since the desired re sponse is not independent of the input dynamics. The same training data used for the MACE filter is employed to cal culate the readout weights for the discriminative and predictive ESNs.

PAGE 68

68 Figure 4-2 shows the correct classi fication rate as the standard deviation of the noise signal varies. The results are averaged over 100 trials. As observed from the figure, ESNs with the MACE filter readout perform better for all noise le vels compared to the ESNs with conventional or predictive readouts. The classification accura cy of ASE-ESN-MACE is almost perfect up to a noise standard deviation of 0.3. The correct classification rate of the discriminative ESN for the no noise case, is only 0.92, while fo r the predictive ESN is 0.95, whic h is slightly better than the discriminative ESN. Since all three systems have the same reservoir, th e better performance of the MACE readout is attributed to the fact that the problem requires modeling a class of functions all at once in a single template and the MACE filter is able to build such a template containing covariance information about the full training set. Th e conventional and predictive readouts train for the best compromise of the i nput class peak amplitude of the training set. Being a nonlinear system, ESN dynamics are coupled to the input signal. Therefore, the response of the system to input signals with varying ampl itudes can be quite different making the task of projecting the states to a consta nt value very challenging. The wa ve shape is also challenging, since for the pulse (a sequence of zero values fo llowed by a sequence of ones) the linear readout has to transform the highly irregular states ge nerated by this pattern to a constant value. Similarly, for the predictive ESN the variation in the input amplitudes affects the ESN dynamics, and the performance drops fast as the noise stan dard deviation is increased. Morever, the MACE filter readout is expected to be more robust w.r.t selection of W and Win in the ESN design compared to the conventional readout design met hodology where the states have to be carefully selected in order to estimate the desired si gnal with a linear readout. In contrast, the LAM readout does not try to estimate a desired response but rather generates a template from the states.

PAGE 69

69 Among the ESNs with MACE readout, the ASE-ESN performs the best compared to randomly generated ESNs. This was an expected result, since ASE-ES N distributes the echo states uniformly and increases separation betw een states corresponding to different classes. Therefore, ASE-ESN should be the preferred design for ESNs with MACE filter readout. Robust signal detection in a digital communication channel We applied ESN-MACE to detection of a known waveform transmitted through a digital communication channel. Please refer to Ch apter 6 for details of this study. Classification of odors with an electronic nose In this experiment, the goal is to classify odors using signals collected with a Cyranose 320 electronic nose. The Cyranose 320 has an array of 32 sensors which change their resistance in response to odors and produce a unique response patte rn for different odors (Figure 4-3). For our experiment, the spices, basil, cinnamon, ginger, a nd rosemary form the 4 classes of odors to be classified. Each of these spices is mixed with ai r and exposed to the elec tronic nose in 4 different concentrations; 25%, 50%, 75% and 100%. For each concentration of each spice, data corresponding to 10 different trials is collected. There are a total of 160 trials, 40 for each class of spice. We submit two approaches to solv e the problem. The first approach is based on obtaining static features that are the steady-st ate values of all 32 sensors as given by the Cyranose. We obtain a 32x160 matrix containing 32 features for all 160 trials of the 4 classes. From the feature matrix, we extract the part co rresponding to 40 trials (10 trials for each class with representatives from 4 differe nt concentrations) as the training set to train a linear classifier with 4 outputs, each representing one of the 4 classe s. The linear classifier is used due to the availability of an analytical solution. Each output of the linear classifier is trained to output 1 when the input feature is from the representati ve class and 0 otherwise. The 4x32 weight matrix of the linear network is computed using least squares. During test ing, the 32x1 feature vector is

PAGE 70

70 0 50 100 150 -0.05 0 0.05 0.1 Response of the 32 electronic nose sensors over time time Figure 4-3. Time-series representing the res ponse pattern of the 32 electronic nose sensors exposed to rosemary. used to compute the 4 outputs and the label of the output with the maximum value is assigned as the class of the input vector. The classification ac curacy of the linear clas sifier evaluated on the 160 feature vectors is found to be 78.7% and the confusi on matrix is found to be 24 16 0 0 0 40 0 0 0 14 22 4 0 0 0 40 The entry aij in row i and column j of the confusion matrix is the number of odors that should have been classified in cl ass i but were classified as j. The correct decisions are given on

PAGE 71

71 the main diagonal of the confusion matrix. The 4 classes are presented in the order basil, cinnamon, ginger, and rosemary. The second approach is based on the propos ed ESN-MACE. ESN exploits the dynamical response of the sensors which also contain in formation about the odor class. We create a template MACE filter for each class from the states of the ESN which represents the variations among different spice concentrations. Again, 40 trials (10 trials for each class with representatives from 4 different concentrations) are used as the training set. Unlike the static case, each training trial consists of a 32 dimensional time series representing the response of the electronic nose. We use an ESN with 100 internal units and 32 inputs. The internal weight matrix, W, of the ESN is designed with uniform eigenva lue distribution and a spectral radius of 0.9. The entries of Win are set to 5 or -5 with equal probab ility. In the previous two experiments, the selection of the length of the patterns was dictated by the known signal lengths. However, in this experiment, we do not have pre-defined sign als making the selection of the pattern length, T a design parameter (Figure 4-3). Here, we choose T to be 80 so that the te mplate pattern contains the rising and falling edges of the sensor respon se. This selection uses the transient sensor dynamics of the particular odor in the design of the MACE readout. For each input pattern, the echo states are calculated using Equation 2-2 with zero initial conditions. This creates an image of dimension 100x80, where 100 is the number of PE s and 80 is the signal length. 10 in-class patterns are used to train each one of the MA CE filters corresponding to 4 classes. Output correlation peaks are assigned to be 1.0. During testing, echo states are calculated for each pattern and 4 MACE filter outputs are calculated. The label of the MACE filter with the highest value is assigned as the class of the input patter The classification accuracy of the ESN-MACE evaluated on the 160 input patterns is found to be 63.1% and the confusion matrix is found to be

PAGE 72

72 28 11 1 0 0 17 16 7 0 0 35 5 0 1 18 21 The results are, indeed, worse than the linear classifier trained on st atic features. When looked closely, the MACE filter gives correla tion values around 1.0 for the in-class data. However, the response of the MACE filter corresponding to 2nd class (cinnamon) for out-of-class data is bigger than 1 for most of the patterns wh ich results in the false classifications. The MACE filter trained nondiscrimantly does not constrain th e response of the MACE for out-of-class data. One solution to this problem is to use out-of-clas s data in the MACE filte r training as is done in (Mahalanobis et al., 1987). Different from the previous experiments, both in-class and out-ofclass training data are used to train each one of the 4 MACE filters corresponding to 4 classes. For each MACE filter, 10 in-class and 30 out-of-class (10 for each class) echo state images are used for discriminative training. Output correlation peak am plitudes are assigned to be 1.0 for the in-class and 0.1 for the out -of-class training data. Using the training data, four MACE filters ar e synthesized in the frequency domain using Equation 4-1 and the corresponding image plane filte rs are obtained by inverse discrete Fourier transform. For a given test signal, the output of the each MACE filter is calculated by correlating the echo states of dimension 100x80 with the filter coefficients. The label of the filter with the largest correlation is assi gned as the class of the input patter n. Out of the 160 trials, the correct classification rate is found to be 93. 1 %. Note that this is much be tter than the performance of the classifier based on static features. Similarl y, the confusion matrix is found to be

PAGE 73

73 39 1 0 0 0 35 5 0 0 3 37 0 0 0 2 38 The addition of out-of-class da ta for training of the MACE improves the rejection of the out-of class patterns and hence decreases the fals e alarm rate. However, it also converts the problem to a classification problem (closed set cla ssification) where the number of classes has to be known a priori. It is desirable, whenever possibl e, to avoid the use of out-of-class data in the training data for detection of signa ls where the out-of-class is either not available a priori or can be any possible signal shape excluding the in -class signal shape (the entire signal space excluding the in-class signal space). Alternatively, we trained MACE filters in th e input space instead of the state space of the ESN. In the input space, the images are of dimension 32x80. The procedure of training MACE filters for each class is repeat ed in the 32 dimensional input space. Out of 160 trials, the correct classification rate is found to be 85.6%. The confusion matrix of the MACE filter is 39 1 0 0 0 32 7 1 0 5 32 3 0 1 5 34 In order to understand the increase in classi fication rate by the ESNs, let us compare the outputs of ESN-MACE filter with the MACE filter trained direc tly at the input space for the rosemary class (Figure 4-4). As observed from the figures, the ESN-MACE fi lter output correspond ing to the rosemary class displays values around one consisten tly for the in-class da ta except for the 37th testing sample.

PAGE 74

74 0 5 10 15 20 25 30 35 40 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 Test data indexFilter output MACE 1 MACE 2 MACE 3 MACE 4 A 0 5 10 15 20 25 30 35 40 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 Test data indexFilter output ESN-MACE 1 ESN-MACE 2 ESN-MACE 3 ESN-MACE 4 B Figure 4-4. Comparison of the ESN-MACE filter output and MACE filter output trained on the input space for the rosemary class. A) Outputs of the 4 MACE filters over the test data index. B) Outputs of the 4 ESN-MA CE filters over the test data index.

PAGE 75

75 The ESN-MACE filter outputs co rresponding to the othe r classes are very close to 0.1 as specified by the training procedure. However, th e MACE filter outputs trained directly in the input space have very inconsistent outputs for the out of class samples. The margin between the in-class and out-of-class outputs of the MACE filte r is increased by the ESN reservoir. There are mainly two reasons for the increase in the distance between the images with the addition of ESN. The first reason is the projection of the signals to a higher dimension by the ESN which increases the separation between the images. Secondly, ESN functions as a nonlinear filter that helps separate the responses to different input classes. We also compared the results obtained to the TDNN trained in discriminative mode. 4 TDNNs, one for each class, are used where the desi red target signal is +1 for in-class data and 0 otherwise. Each TDNN has a delay line of 81 samp les (equal to the signal duration) for each one of the 32 sensors, 3 hidden layer PEs and a sing le output unit. The same data (10 in-class and 30 out-of-class) for discriminative ESN-MACE is used to train the weights of TDNN using the Levenberg-Marquardt algorithm with a step size of 0.01. Training is stopped after 100 epochs. A cross-validation set cannot be used due to limite d number of training da ta. During testing, an input pattern is fed to each one of the 4 TDNNs and the label of the TDNN with the largest output is assigned as the class of the input pattern. Out of 160 trials, the correct classification rate is found to be 65.7%. The confusion matrix of the filter is 30 2 2 6 6 21 10 3 1 7 29 3 0 2 13 25 The performance of the TDNN is very poor compared to discriminative ESN-MACE and even the static classifier. The main reason for th e poor performace is the l ack of sufficient data

PAGE 76

76 for training the TDNN. The number of parameters of each TDNN is 32*81*3+3=7779 whereas only 40 training patterns are available for training. Spike train classification with LSM The associative memory readout presented for ES N can in fact also be utilized as readout for a LSM. Similar to the ESN case, the liquid states are interpreted as 2-D images, one dimension time and the other space. An optimal MACE filter, which associates LSM states with class labels, can be analytically computed for each class using training data. The value of the liquid state is either 1 or 0 depe nding on the existence or absence of a spike. Note that the MACE filter readout does not require lo w-pass filtering the spike trains in order to get continuous amplitude signals for further processing. We use a spike train classification experiment, where the goal is to classify a spike train presented to the LSM as one of the two classes (Maass et al., 2002). Two classes are represented by randomly generated Poisson distributed spike trains over 100ms time interval. Other members of each class are generated by moving each spike of the representative spike train by an amount drawn from a zero mean Gaussian dist ribution with a variance of 4ms. A randomly connected LSM with 50 integrate-an d-fire neurons with 20% inhibitory (I) and 80% excitatory (E) connecti ons is used. The membrane ti me constant, reset voltage, threshold voltage, background current and input resistance of the ne urons are chosen to be 30ms, 13.5mV, 15mV, 13.5nA, and 1 M respectively. Absolute refract ory periods for excitatory and inhibitory neurons are 3ms and 2ms, respectiv ely (Maass et al., 2002). The probability of a synaptic connection between neurons a and b is given by 2 ) ) ( (b a DCe, where D(a,b) denotes the distance between two neurons and is a parameter which controls both the average number of connections and the average distance between neurons that are synaptically connected. The

PAGE 77

77 neurons are placed on a 10x5x1 grid. C is set to 0.3 ( EE ), 0.2 ( EI ), 0.4 ( IE ), 0.1 ( II ). Dynamic synapses are used based on the model proposed in (Markram et al., 1998), with the synaptic parameters U (use), D (depression time constant), F (facilitation time constant) randomly chosen from Gaussian distributions. Mean values of U D F are set to .5, 1.1, .05 ( EE ), .05, .125, 1.2 ( EI ), .25, .7, .02 ( IE ), .32, .144, .06 ( II ) and SD of each parameter was chosen to be 50% of its mean. The mean of the scaling parameter W (in nA) is chosen from a Gamma distribution with mean values 30 nA (EE), 60 nA (EI), -19 nA (IE ), -19 nA (II) (Maass et al., 2002). For input synapses, the parameter W has a value of 18 nA or 9.0 nA for destinations excitatory and inhibitory neuron, respectively. SD of W is 100% of its mean. The postsynaptic current was modeled as an exponential decay exp(t /t s ) where t s is 3ms and 6ms for excitatory and inhibitory synapses, respectively. The transmission delays be tween neurons are 1.5 ms (EE), and 0.8 for the other connections. The initial membrane voltage of each neuron at the start of each simulation is drawn from a uniform distri bution between 13.5mV and 15.0mV (Maass et al., 2002). Fifty different realizations of spike trains fo r each class are generated and the liquid states are calculated for each realization. One readout MACE filter is trained for each class using only data from the corresponding class. Output correlation peak amplitude is assigned to be 1.0 for the training data. Notice that the state matrix is very sparse resulting in singular matrices in the MACE filter computation. Therefore, the matrices have to be regularized in order to compute inverses. Here the regularization is done by adding zero mean Gau ssian noise with a variance of 0.02 to the states before computing the MACE solution. The two MACE filters are synthesized in the frequency domain using eq 4.1 and the corresponding image plane filters are obtained by an inverse discrete Fourier transform. For a gi ven test signal, the output of each filter is calculated by correlating the liquid states with the filter coefficients.

PAGE 78

78 Spike train classification with LSMs is conve ntionally done by firs t low-pass filtering the spike output of the liquid and using a linear re gressor network (Maass et al., 2002). Here, we used a low-pass filter with a time constant of 30m s. A linear readout is tr ained with the constant target signal of amplitude +1 for one of classe s and the constant signal of amplitude -1 for the other class using the same data as MACE filte r training. During testing, the echo states, the output signal and two error signals one between the output and the constant signal of amplitude +1 and another between the output and the constant signal of amplitude -1 are computed for a given input signal. Then, two MS E values, integrated over the dur ation of the i nput pattern, are compared to assign the class of the input pattern. Figure 4-5 shows the correct classification rate for both methods as the parameter which controls the synaptic connections in the LSM, varies from 0.2 to 4. The results are averaged over 100 trials. As observed from the figure, the MACE filter readout gives sl ightly better accuracy compared to the linear readout for all values. This is due to the fact that the linear readout tries to approximate a constant signal independent of the liquid state values. This creates difficulties for the simple linear readout especially when the i nput spike train is sparse In fact, under closer analysis, the output of the linear readout fluctuat es. Therefore, the decision made by the linear readout has to be carefully averaged over time in order to reach a final de cision about the class of the input spike train. For example, if the decision is based solely on the ou tput of the readout at the final time instant, the classifi cation rate is very close to theo retical minimum of 0.5. On the other hand, the MACE filter rea dout provides the ability to operate in the spike domain without the need to low-pass filter the liquid state spike output. However, we note that the MACE filter computation in Equation 4-1 requires extra cauti on since the liquid state has many zero values mostly leading to singular matrices.

PAGE 79

79 0 0.5 1 1.5 2 2.5 3 3.5 4 0.8 0.85 0.9 0.95 1 1.05 LambdaCorrect classification rate Linear readout MACE filter readout Figure 4-5. Comparison of the correct classification rates of LSM-MACE and LSM with linear readout trained in the spike train cl assification problem as the parameter varies. The MACE filter readout gives slightly better accu racy compared to the linear readout for all values. This is due to th e fact that the linear rea dout tries to approximate a constant signal independent of the liqui d state values. Moreover, MACE readout operates in the spike domain eliminating the need to convert spike trains to continuous valued signals by low-pass filtering. This experiment demonstrates the use of MACE filter as a readout for LSMs. A more detailed study has to be carried out to expl ore the advantages and disadvantages of MACE readout compared to the standard techniques.

PAGE 80

80 CHAPTER 5 IMPLICATIONS OF ESN ON THE DESIGN OF OTHER NETWORKS Echo state networks introduced a new type of co mputation that has been proven to be very powerful. In this chapter, we study how the ES N idea can be utilized for the design of other networks. First, we investig ate the echo state condition (|W|<1) and approach the echo state condition in terms of the effect of system stability on the power of dynamical system for computation. In particular, we relax the echo st ate condition and allow some of the poles to be larger than 1. This introduces a new dynamical regime, called transiently stable computation, where function approximation with a readout is still possible eventhough the system is not globally stable. Second, we invest igate the biologically plausible model of the olfactory cortex, Freeman Model (FM), and propose to use a readout for FM to be able to use it for useful computation. Without the proposed readout, the use of FM is limite d to simple digit recognition. An interesting property of FM is the nonlinear f unction used in the model. FM nonlinearity does not have its largest slope at ze ro operating point unlike the sigm oidal nonlinearity used in ESNs. We will demonstrate with experiments that FM coupled with a readout can process continuous valued signals. Transiently Stable Computation Linear filters are the simplest dynamical systems employed to process signals with temporal structure as discussed in chapter 1. Th ey can be classified according to the impulse response, which fully characterizes the filter, as finite impulse response (FIR) and infinite impulse response (IIR) filters (H aykin, 2001). For the IIR case, stability becomes an essential constraint in the design. Without stability, the system response will diverge independent of the input signal. Global asymptotic stability of linea r filters can simply be guaranteed by selecting the poles of the system in the proper region of the frequency doma in, open left half plane for the

PAGE 81

81 continuous-time systems and inside the unit circle for the discrete-time systems (Kailath, 1980). TDNN and RNN can be considered as the nonlinear counterparts of the FIR and IIR filters, respectively. Similar to the linear case, stab ility becomes an issue for RNNs, although the definition of stability is no l onger BIBO (bounded input bounded out put) stability. In the ESNs, stability issue is addressed with the echo state condition (Jaeger, 2001) and has been investigated in detail in chapter 2. Echo state condition is deri ved from the stability of the linearization of the system around zero equilibrium and guarantees that the states are strongly coupled with the input signal allowing the system to compute an input-out put map. In this chapter, we introduce a new computational mode that emerges when we rela x the echo state condition and allow the spectral radius to be slightly greater th an 1. First, let us consider an ESN that obeys the echo state condition. Conventional Echo State Networks A simple demonstration of the typical res ponse of an ESN with echo state condition is given in Figure 5-1. A randomly connected 100unit ESN without output f eedback connections is constructed. The entries of the in ternal connection weight matrix W are set to 0.4, -0.4 and 0 with probabilities of 0.025, 0.025 and 0.95 resulting in a spectral radius of 0.9. The input weight matrix, Win has values +1 or with equal probabilitie s. The input to the sy stem (Figure 5-1 A) has 3 different regions, a zero si gnal for the first 100 steps, a ramp signal for the following 100 steps and again a zero signal of length 100. The sy stem is initialized rand omly and run with the given input.

PAGE 82

82 0 50 100 150 200 250 300 350 -0.5 0 0.5 1 1.5 time Input signal A B C D 0 50 100 150 200 250 300 350 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 time The states of the 100-unit ESN 0 50 100 150 200 250 300 350 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 time Overlay plot of the ESN output and desired signal Desired signal ESN output Figure 5-1. Demonstration of a typical response of ESN with a spectral radius of 0.9 and application to function approximation. A) The input signal. B) 100 echo states of ESN. C) The overlay plot of the ESN output and the desired signal.

PAGE 83

83 Figure 5-1 B depicts all the 100 echo states on top of each other over the given time interval. As seen from the figur e, during the first 100 steps, the system state converged to zero (after an initial transient of length about 30 steps caused by the random initial condition) since the input to the system is zero and the spectral ra dius is less than 1. Then, the echo states are constructed by the ramp signal between the time interval 100 and 200. The states again converge to zero due to the echo state condition, when the input is removed. The resulting echo states can be used to generate any desired response relate d to the input signal. He re, desired signal is chosen to be the seventh power of the input sign al and The readout weights are calculated using Equation 2-4. Figure 5-1 C displays the overlay plot of the optimal ES N output and the desired signal. As observed from the figure, the syst em output very accurately estimates the desired signal. Transiently Stable Computation with Echo State Networks Engineering systems are mostly designed with a global stability restric tion. We have seen a similar constraint, echo state condition, in the design of echo state netw orks. We have also demonstrated with a function approximation expe riment that ESNs with the echo state condition can do useful computation. On the other hand, biological computation ma y not possess a similar stability constraint (Freeman 1975; Freeman, 1993; Yao and Fr eeman, 1990). Indeed, biological systems do not have fixed point dynamics but rather wide variety of collective behavior in the form of oscillations and even chaos (Freeman 1975; Yao and Freeman, 1990). Inspiring from principles of biological computa tion, we would like to explore the networks re sponse without the restriction on the spectral radius of W. Hence, we will remove the echo state condition and let the spectral radius to be slightly greater th an 1. Obviously, this will introduce instability for the autonomous system whereas the response of th e system with an app lied input is yet not obvious.

PAGE 84

84 0 50 100 150 200 250 300 350 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 time The states of the 100-unit transiently stable ESN A 0 50 100 150 200 250 300 350 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 time The overlay plot of the ESN output and desired signal Desired signal ESN output B Figure 5-2. Demonstration of a typi cal response of transiently stable ESN with a spectral radius of 1.1 and application to f unction approximation. A) 100 echo states of th e transiently stable ESN. B) The overlay plot of the ESN output and the desired signal. In order to observe the ESN response wit hout the echo state cond ition, we will use a similar experimental set up that we used in previous section. We use the same W matrix and scaled it to obtain a spectral radius of 1.1. The same input signal (Figure 5-1 A) is fed to the ESN

PAGE 85

85 that is initialized to a random initial condition. We plot the resulting echo states in Figure 5-2 A. There are a few observations to be made at this poin t. First of all, we see that the system does not converge to zero during the first 100 steps although the i nput is zero. Instead, the echo states exhibit a nonconvergent dynamical response that differs from the ESN with the echo state condition. In fact, this was the expected response, sin ce the system is designed to be unstable (spectral radius is greater th an one) around zero equilibrium. A similar response can also be observed during the last 100 time steps when th e input is again zero. The second and more important observation is the response of the system between time steps 100 and 200 where the ramp signal is applied as the input. The echo st ates during this interval become more regular (after some initial transient between time steps 100 and 130) compared to the states when there is no input. In fact, after the transi ent, the echo states look similar to the ones in Figure 5-1 B where the ESN satisfies the echo state condition. Thirdly, there is a transition period between time steps 100 and 130 where the states switch from a disordered mode to a more regular mode. In summary, the system responds according to its own dynamics when there is no input. As the input amplitude gradually increases, the system response is determined by a competition between the system dynamics and the input amplitude. When the input amplitude is sufficiently large, the system dynamics becomes transiently stable an d is determined by the input signal. We would like to find out if we can utilize the transientl y stable ESN for the same function approximation problem we had in the previous section. The weights of the linear readout network are again computed using Equation 2-4 and the corresponding output is genera ted in Figure 5-2 B. As seen from the figure, we got a good match between the output and the desired signal. Similar results can be obtained even if the system is initialized to a different initial condition or the time instant at which the ramp signal is applied (this will ch ange the state value when the ramp signal is

PAGE 86

86 applied) changes. This experiment clearly demo nstrates that transiently stable ESN can do useful computation. Understanding Transientl y Stable Computation We will utilize the linearization analysis proposed in chapter 2.2 in order to quantify the local dynamics of the system. The pole movement of ESN with spectral radius of 0.9 is illustrated in Figure 5-3. According to the figure, poles always stay inside the unit circle even though the input signal changes. As the input si gnal strengthens, the poles move towards the origin of the z-plane (decreases the spectral radius ), which results in a more stable system. This is due to the fact that the highe r input signals saturate the nonlin ear function reducing the slope at which the system is operating. Similar observatio ns have been already made in chapter 2.2. -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 RealImaginary -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 RealImaginary A B -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 RealImaginary -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 RealImaginary C D Figure 5-3. Movement of the poles for ESN with sp ectral radius 0.9 are plot ted in figures A, B, C, D when the input is at points labeled by A, B, C, D in Figure 5-1 A, respectively.

PAGE 87

87 Now, let us examine the movement of poles of the transiently stable ESNof the previous section. Figure 5-4 shows the movement of the pol es for the same input signal. First of all, the poles of the system move in and out of the un it circle during the first and last 100 time steps (hence zero state is not asymptotically stable). This explains the complex echo state signal when there is no input to the system (Figure 5-2 B). Th e poles of the ESN with spectral radius 0.9 stays inside the unit circle during th e same time interval. Secondly, when the ramp input is -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 RealImaginary -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 RealImaginary A B -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 RealImaginary -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 RealImaginary C D Figure 5-4. Movement of the poles for ESN with sp ectral radius 1.1 are plot ted in figures A, B, C, D, when the input is at points labeled by A, B, C, D in Figure 5-1 A, respectively. applied, the poles start to shrink towards the unit ci rcle (after some transient) as is observed in the stable ESN. This phenomenon is again due to the sigmoid nature of nonlinearity, which saturates with large input signals reducing the slope of the operating point in Equation 2-5. With the movement of the poles towards the origin, the system stabilized transiently allowing the system state to be controlled by the input. This is critical for the use of the system for

PAGE 88

88 computation (function approximation). Finally, we observe a transient time, when the ramp signal is applied, before the poles start shrinki ng. In this period, there is a competition between the system dynamics and the input signal. There are a few extra comments about the selection of the spectral radius to be made at this point. First of all, the tran sient response time after the ramp is applied is determined by two f actors, the spectral radius and the slope of the ramp signal. The transient time will increase if the spectral radius of W is increased or the sl ope of the ramp signal is reduced. Because as the spectral radius is increased, the force re quired to bring the poles inside the unit circle is increased, he nce a larger signal is required. Therefore, there is a balance between the system dynamics determined by the sp ectral radius and the input signal required to stabilize the system. Secondly, it is important to quantify the uns table behavior of the signals present in the transiently stable ESN when the in put is absent. We have observed that when the spectral radius is slightly greater than 1, the signals are peri odic oscillations whereas for larger values of the spectral radius, the signals are more irregular and even chaotic. Freemans K Sets for Dynamical Computation Understanding information processing in br ains remains a challenge. Many different hypotheses on how the brain might process the ma ssive bombardment of information brought by sensory systems have been advanced, ranging from artificial intelligence, neural networks and more recently machine learning and statistics (Freeman, 1975; Yao and Freeman, 1990; Chua and Yang, 1988; Wang, 1999).Walter Freeman devel oped a biologically realistic mathematical model for the olfactory cortex that captures at a cell assembly level (mesoscopic level), some of the physiological properties of the olfactory sy stem. Freeman model (FM) is based on nonlinear recurrent dynamical interactions representing th e response of thousands of cells, the neural assemblies, instead of single neuron responses. In order to derive his model, Freeman used neurophysiological recordings a nd neuroscience concepts and th e K set hierarchy (Freeman,

PAGE 89

89 1975) to build a model that mimics the behavior of olfaction (Freeman 1975). This model is a drastic departure of other neural network models such as Hopfie lds because system response is produced by interactions of second order nonlinear sets. Freemans major drive is to understand how olf actory system works; while here we aim at using FM as a computational framework for the in teraction of animats with the real world. Real worlds are complex dynamical environments that unfold unpredictably w ith lots of redundancy but also with many unrelated cues. In this set up, the fundamental operat ions are the recognition of prior situations that will i nduce prototypical responses dependent upon the animats goals. The constraints of the experimental setting are also rather important. The operation is intrinsically real-time in the sense that only the present value of the continuous stream of information from the sensors is available and must be processed at each time step, while the full understanding of the situation is most often related to the input history. So we favor signal processing methodologies instead of statistics where the behavi or in time is always ra ther constrained. The animat may also need to learn how to recognize new situations wh en they become important for its function, therefore it must have internal representational struct ures that expect outcomes and change behaviors. Brains process information for the sake of the animals survival, while animats have external goals preprogrammed by hu mans, and as such they must possess an extra level of functionality that translates internal represen tations back and forth to the outside, i.e. it is necessary to define input and outputs (in the anim al, brains only output is the motor system). Therefore we have to emphasize this translati on between distributed dynamical representations and outputs in the form of design principles for optimal readouts. The framework we are pursuing is based on distributed dynamics instead of machine learning. Since Hopfield, the role of dynamics for associative memo ries is well understood

PAGE 90

90 (dynamical systems with fixed point attractors). In fact, FM has al so been utilized as a dynamical auto-associative memory (Tavares, 2001; Xu et al., Ozturk et al., 2004; Xu et al., 2004). FM weights are trained with Hebbian learning while two binary patterns to be stored are presented to the system successively. During testing, a noise corrupted version of one of the two stored patterns is presented to the system. Then, the energy of the oscillatory system response is computed and compared to a threshold to choose the correct stored patt ern. We see two major shortcomings with the energy base d readout. First, utilizing just the energy is wasteful since it ignores the information embedded in the dynamics of the signals. Second it does not exploit the global information in the states since it is a local method computing th e energy of individual processing elements. In this chapter, we demonstrate that FM can be considered in the same framework of ESN/LSM. In fact, the KI and KII networks of FM are conceptually similar to the reservoir/liquid in an ESN/LSM with a proper se lection of the parameters. The big difference is that the KII layer is highly stru ctured with a coupled set of non linear oscillators We will derive conditions on the system parameters for the FM similar to the echo state property of ESNs. With the proposed framework, it becomes evident that th e OB layer alone lacks a readout in order to be used as a universal computing machine for time series. Therefore, we propose to use an adaptive linear networ k to implement the readout from the OB layer. We present experimental results to show the power of this framework for th e FM of the olfactory cortex. In particular, we show that it is not necessary to drive Freeman s model with 0 and 1s, as is presently done. An Overview of Freeman Model Freeman Model is a biologically plausible math ematical model of the olfactory cortex. The dynamical behavior of the FM mimics the physiologi cal signals obtained from the cortex. It is a

PAGE 91

91 hierarchical model of different levels ( K0 KI KII and KIII ), where simpler structures at the bottom of the hierarchy combine to form the more co mplex structures at the top of the hierarchy. The basic building block of the model is the K0 set, which is depicted in Figure 5-5 A. Every K0 unit can accept several spatial inputs that are weighted and summed, and then convolved with a linear time invariant system defined by the second order dynamics whose transfer function, H(s) is given by ) )( ( ) ( b s a s ab s H (5-1) where 1/a and 1/b are real time constants de termined experimentally (Freeman, 1975). The output of the linear dynamics is then shaped by the nonlinear sigmoid function, which is experimentally determined to be 0 0 ) ) 1 ( (if 1 if ) 1 ( ) ( x x x x e Q Q x Qm xQ e m m (5-2) The sign of the connection strength from a K0 set will define the type of the K0 set (excitatory for positive weights and inhibitory for negative we ights). The second level in the hierarchy is the KI network. K0 sets with a common sign (either excitatory or inhibitory) are connected through forward lateral feedback to construct a KI ne twork (Figure 5-5 B). No autofeedback is allowed in the networ k. KI network is generally used as an input layer for the higher levels of Freeman model. It preprocesses th e input to transform it into a space which is compatible with OB dynamics. In this study, we will use KI network as an echo state network containing a representation of the input history.

PAGE 92

92 + H ( s ) Q (.) 2 nd order dynamics Asymmetric sigmoid nonlinearity Inputs A + + Excitatory KI set Inhibitory KI set B + + + + + K gm K mg I ( t ) M G C D -5 -4 -3 -2 -1 0 1 2 3 4 5 -2 -1 0 1 2 3 4 5 6 xQ(x)Nonlinear Function, Qm=5 E Figure 5-5. The building blocks of Freeman Mode l. A) The K0 set. B) KI network. C) KII Set. D) Reduced KII set. E) The asymmetric nonlinear function.

PAGE 93

93 The third level in the hierarchy, the KII set (Figure 5-5 C), is the most interesting and important building block of the olfactory system, si nce it is an oscillator controlled by the input. The response of the KII set to an impulse input is a damped oscillation whereas with a sustained input, the output oscillation is maintained as long as the input remain s. The architecture of KII set is shown in Figure 5-5 C, where the circles denote K0 sets and the sign i ndicates the type of connection. Figure 5-5 D shows a reduced KII set, where the two K0 units are denoted by M (for mitral cell) and G (for granular cell). In this model, the mitral cell takes the external input P(t) and the coupling strengths between M and G are controlled by the two weights Kmg > 0 (excitatory) and Kgm < 0 (inhibitory). If the coupling weights are select ed properly, the RKII set is, similar to the KII set, an oscillato r controlled by the input (Freeman, 1975). A KII network can be formed by interconnecting a number of KII sets with excitatory connections between the excitatory cells (top circ les in Figure 5-6) and inhibitory connections between inhibitory cells (bottom circles in Figure 5-6). An RKII network is formed similar to KII network by connecting a number of RKII sets. We will use RKII network in our simulations instead of KII networks since various analytic tools are ava ilable for RKII and it has been demosntrated that RKII is functionally very si milar to the KII (Freeman 1975, Xu et al., 2004). This interconnected structure represents a key stage of learning and memory in the olfactory system. Input patterns through M cells are mapped into spatially distributed outputs. Excitatory and inhibitory interconnections enable cooperative and competitiv e behaviors, respectively, in this network. The RKII network functions as an associative memory (Tavares, 2001; Xu et al., 2004; Principe et al., 2001; Ozturk et al., 2004).

PAGE 94

94 Figure 5-6. A full KII network. The final Katchalsky level is the KIII network and represents the olfactory cortex. In a KIII network, two/three KII sets and a KII network are tightly c oupled through dispersive connections, mimicking the diffe rent lengths and thicknesses of axons. Since the intrinsic oscillating frequencies of each one of the KII sets in different layers are incommensurate among themselves, this network of coupled oscillator s will present chaotic behavior. For a detailed description of the KIII network, we refer the reader to (Freeman, 1975; Xu et al, 2004). Now, we will formulate a standard form for th e state (outputs of KO sets) equations of the Freeman model valid for all hierarchical sets of FM). Consider a Freeman network with M input units, N KO sets and L output units. State equations fo r this Freeman model can be expressed as )) ( )) ( ( ( ) ( ) ( ) ( ) ( t W t Q W ab t ab t b a tinu x x x x (5-3) Here, x(t) = ( x1(t), x2(t),, xN(t))T is the state vector where e ach entry is the output of a KO set at time t and u(t) = ( u1(t), u2(t),, uM(t))T is the input vector. W is an N*N matrix defining the interconnection weight values between the KO sets and Win is an N*M matrix defining the interconnection weights between the inputs and the KO sets. For example, consider the RKII set depicted in Figure 5-5 D. The governing equations for the RKII set are given by

PAGE 95

95 )) ( ( ) ( ) ( ) ( ) ( )) ( )) ( ( ( ) ( ) ( ) ( ) ( t m Q abK t g ab t g b a t g t p t g Q K ab t m ab t m b a t mmg gm (5-4) In a RKII set, the number of K0 sets is two and the number of inputs is one. Hence, in the standard form, M is one and N is two. With this information, Equation 5-4 can be restated in the standard form of Equation 5-4 with the state vector x(t) = [ m(t) g(t) ]T, input vector u(t) = [ p(t) ], the input weight vector Win = [1, 0]T and the weight matrix 0 0mg gmK KW Dynamical Computation with F reeman Model with a Readout In order to be able to com pute with a dynamical system, one has to be aware of how the information is represented in the system stat es. Once the form of representation is known, desired information can be extracted from the system states with an appropriate readout mechanism. For instance, in Deliang Wangs network (Wang, 1999), the phase difference between the coupled oscillators is the source of information. For FM, it has been argued the information is encoded as amplitude modulated signals and hence energy of the KO output was used as the relevant information from the FM in practical application. For example, the RKII network has been proposed as an auto-associative memory that can store static binary patterns (Tavares, 2001; Xu et al., 2004; Ozturk et al., 2004). In (Xu et al., 2004) an RKII network of 64 reduced KII sets is constructed with fixed Kmg, Kgm and Kii values (determined experimentally from biological data). Excitato ry weights of RKII network are trained with Ojas unsupervised learning rule (a stable version of the Hebbian learning rule) while two binary patterns to be stored are presented to the system successively. During testing, a noise corrupted version of one of the two stored patterns is presented to the system. Each RKII set creates limit cycle oscillations of low or high ener gy depending on the stored and th e input pattern applied. Then,

PAGE 96

96 the energy (averaged over a time interval) of th e mitral KO set of each RKII set is computed and compared to a threshold to decide the binary system output. We see two major shortcomings with the energy based readout for FM. First, utili zing just the energy is wa steful since it ignores the information embedded in the dynamics of th e system state. Second, it does not exploit the global interdependencies among the RKII sets since it is a local method computing the energy of individual processing elements. Hence, the energy based readout can not be optimal and can not be used for continuous dynamic patterns. It is clear that a more principled approach is necessary in order to use FM to process signals that are not necessarily binary or static. With the analogy to the LSM/ESN framework, we propose to use a readout which is a linear or nonlinear projection of all the RKII states. The role of the readout is to extract specifie d desired information from the FM states. A

PAGE 97

97 B Figure 5-7. Freeman Model as an associative me mory. A) Stored patterns. B) Recall of the corrupted patterns: The first, second and their columns are the input to the KII network, the output before th e threshold detector, and th e output after the threshold detector, respectively. We will also demonstrate that Freemans K sets provides a complete system, that is to say, the readout can also be chosen from the Freeman hierarchy, namely K0 set or the KI network depending on the output layer dimension. When a linear readout is used, the adaptation of the overall system reduces to a linear regression prob lem, which we can be solved by either Wiener filter equation offline or gradient-based learning rule online. We will use the KI network and KII network architectures analogous to liquid in LS M or the reservoir in ESN. The use of KIII network as the dynamical reservoir is outside th e scope of this study. In summary, with the

PAGE 98

98 adoption of LSM/ESN framework, the states of the FM are dynamical basis functionals (representation space) created from nonlinear comb inations of the input and the readout simply finds the best projection of the desired response in this representation space. We will provide two theorems for the echo st ate property of the Freeman RKII network. The first theorem states the sufficient conditions on the parameter values of a RKII set resulting in echo states. In the second theorem, we will give a sufficient condition for the nonexistence of echo states for all Freeman networks. Unfortunately we do not have a general theoretical result for the sufficient conditions resulting in echo stat es for all Freeman networks in standard form given by Equation 5-3. Theorem 4.1: Consider a reduced KII set with th e governing equations given by equation 5-4. Let the parameter values Kmg and Kgm satisfy 2 max 2) ( ) ( Q b a ab K Kgm mg where maxQdenotes the maximum value of the derivative of the nonlinear function Q and a and b are the filter coefficients. Then the ne twork has echo states for all inputs u and for all states x. The proof follows from the linear stability analysis of the RKII set, which has been extensively studied in (Xu and Principe., 2005). Proof: Consider the governing equations for th e RKII set given in Equation 5-4. An equilibrium point of the nonlinear di fferential Equation 5-4 satisfies ) ( ) (0 0 0 0m Q K g p g Q K mmg gm (5-5) Using the nullclines defined by Equation 5-5, we can see that for a given set of parameters and the input, the system has a uniqu e equilibrium point (Figure 5-8). Now consider the state space representation given by Equation 5-4.

PAGE 99

99 ) ( ) ( )) ( ) ( ( ) (1 1 2 2 2 1 1 1 2 2 2 1m Q abK abg g b a g g g t p g Q K ab abm m b a m m mmg gm (5-6) -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 g(t)m(t)m0(t)=KgmQ(g0(t))+P(t) g0(t)=KmgQ(m0(t)) Equilibrium Figure 5-8. Nullcline graph of the reduced KII set. Then, the linearization of th e system around the unique equilibrium point leads to the Jacobian matrix of the form ) ( 0 ) ( 1 0 0 0 0 ) ( ) ( 0 0 1 00 0b a ab m Q abKmg g Q abK b a abmg (5-7) Checking the eigenvalues of Equation 5-9, we see that the equilibrium is stable if ) ( ) ( ) (0 0 2g Q m Q b a ab K Kgm mg and the equilibrium is unstable if

PAGE 100

100 ) ( ) ( ) (0 0 2g Q m Q b a ab K Kgm mg The system does have a bifurcation point if ) ( ) ( ) (0 0 2g Q m Q b a ab K Kgm mg (Xu and Principe., 2005). Now, assume that the parameters Kmg and Kgm satisfy 2 max 2) ( ) ( Q b a ab K Kgm mg Then the unique equilibrium point of the system will be stable. This means that independent of the initial conditions, the system will approach to th e same trajectory for a given set of parameter values and the input sequence. Moreover, the states will converge to zero when there is no input. Theorem 4.2: Consider a Freeman network containing N KO sets with standard form given in Equation 5-3. Define the effective weight matrix W I b a W I ab I ) ( ) ( 0W (5-8) where I denotes the N*N identity matrix and 0 denotes N*N zero matrix. Let the effective weight matrix Wsatisfy real( max)>0, where max is an eigenvalue of Wwith largest real part. Then the network has an asymptotically unstable null state. This implies that it has no echo states for any input set containing 0. Proof: Consider a Freeman network with governing equations in standard form given by Equation 5-4. In state-space form Equati on 5-3 can be expressed as )) ) ( ( ) (1 1 2 2 2 1u W x W x x x x xin Q ab ab b a (5-9) where x1 and x2 are both N dimensional vectors. For zer o input, Equation 5-9 has the trivial null solution. Now, consider the Jacobian matrix of Equation 5-9 from lin earization around origin,

PAGE 101

101 I b a W I ab I ) ( ) ( 0W (5-10) If the matrix Whas an eigenvalue with real part greate r than zero, the linearized system is not asymptotically stable. This implies the exis tence of another solution of Equation 5-9 besides the null solution. This implies that the network has no echo stat es for any input set containing 0. This theorem, which is valid for all Freeman networks in standard form, will provide a heuristic way to obtain a useful weight matrix W in practical applicatio ns. Whenever we create an interconnection weight matrix W, we will check the eigenvalues of the effective weight matrix W. If the real parts of all eigenvalues of Ware less than zero, we will assume the states are controlled by the input. Experiments This section demonstrates a variety of expe riments in order to demonstrate the proposed framework for Freemans K sets. In all the experi ments, we use the discrete-time Freeman model developed in (Tavares, 2001; Ozturk et al., 2003). The discretization exploits the fact that K0 set is composed of a separate linear dynamical pa rt and a static nonlinea r part. Continuous-time linear filter of K0 is di scretized by the method called impulse invariance transformation resulting in an equivalent discrete-time filter. Function approximation It has been demonstrated that FM can be used as an auto-associative memory (Xu et al., 2004; Tavares, 2001). However, the stored patterns are binary (0s and 1s) and static. We would like to first show that FM can handle continuous valued signal s within the proposed framework with a function approximation experiment.. Consider an RKII network of 50 RKII sets and a single input cha nnel connected to all excitatory cells with unity weights. RKII ne twork is driven by the input signal, sin(2 t/1000 )

PAGE 102

102 and the target signal is 0.5sin7(2 n/1000 ). The weight values Kmg and Kgm of all RKII sets and all inhibitory weights, Kii ar e set to 4, -1 and -0.002, re spectively. These values are experimentally determined from biological data and same values are used in (Xu et al., 2004). The excitatory weight values are chosen from a gaussion distribution with zero mean and unity variance. A B Figure 5-9. Function approximation with Freeman M odel. A) States of the Freeman model. B) The overlay plot of the ou tput and desired signals

PAGE 103

103 The network is initialized to a random initial co ndition and the states of the RKII network (K0 outputs) are calculated. At this point, it is intere sting to observe the echo states depicted in Figure 5-9 A. This plot shows the signals observed at ou tputs of some of the KO sets. As seen from the figure, due to the nonhomogenous nature of the sy stem, there is a significant diversity in the signals at the nodes of the KII network creating the rich dynamics require d in the echo states. Now, the next step is to construct the desire d signal from the echo states using a memoryless readout network. For this example, we do not ha ve direct connections from input to output. We will use a simple linear readout whose optimal weights can be calculated using Equation 2-4. The output of the optimal linear network is de picted in Figure 5-9 B along with the desired response. As seen from the figure, we get an almost perfect match between the desired response and the output of the Freeman m odel with the linear readout. K0 set as the readout for function approximation An alternative to the linear ne twork as the FM readout is to use one of the Freemans K sets. For systems with only one output, we will use the K0 set as the readout. In this section, we will derive the gradient based update equations for the K0 ne twork using MSE as the cost function. This will allow us to use one of the components of the Freeman network indicating that the Freeman K sets contain all the ingredient s necessary for computation in the echo state framework. In the K0 set, the lin ear filter and the stat ic nonlinear function ar e fixed and the input weights will be adaptable. We will derive the gradient-based update equations for the weight vector using the MSE as the cost function. Note that we do not have an analytic solution for the optimal weight vector since K0 comprises a line ar dynamical part which is an infinite impulse response filter and also a static nonlinearity. Consid er the K0 set depicted in Figure 5-5 A. Let us first fix the notation. The input to the K0, the in put weight matrix, the si gnal at the input of the

PAGE 104

104 linear part, the signal at the output of the linear part and the output of the K0 will be denoted by x(n), wout(n), z(n), l(n), y(n), respectively. Then the governing equations will be given by ) ( ) ( ) ( )) ( ( ) ( ) 1 ( ) 2 ( ) 1 ( ) ( ) ( ) ( ) (3 2 1n y n d n e n l Q n y n z c n l c n l c n l n x n w n zout The constants c1, c2 and c3 are the coefficients of the discrete-time filter obtained from the impulse invariance transformation of the original continuous-time filter. The details of the derivation can be found in (Oztur k et al., 2003). The instantaneous cost function is defined as e2(n) = (d(n)-y(n))2. Then the weights will be optimized according to out out outw e n w n w ) ( ) ( ) 1 (2. In order to calculate the sensit ivity of the cost with respect to the weights, we will use backpropagation over the topology, i.e. x c l Q w z z l l y y e w e3 2 2) ( 2 ) ( ) ( ) ( ) ( ) ( Note that, although K0 includes a recurrent filter, the sensitivity of the cost w.r.t the weights can be calculated instantaneously. Beca use we do not need to update the filter coefficient but we only need to calculate the sens itivity of the filter output w.r.t the filter input which is just a constant, c3. With this derivation, K0 can be employed as the readout for one output systems. For multi-output systems, we will use the KI network where the connection weights between the K0 sets of the KI network is fixed and the weights between the echo states and KI network is adaptable as above. The extension of the gradient update rule for the multi output case is trivial.

PAGE 105

105 Figure 5-10. Overlay plot of the desired signal an d output signal of the Freeman Model with K0 readout. To demonstrate the idea of using K0 set as the readout network, we will take the same regression experiment. We will use the same paramete rs as in the previous section. Now, we will use the K0 set as the readout instead of the lin ear network and adapt the K0 as described above. Figure 5-10 shows the overlay plot of the desired signal and the out put of the K0 set. As seen from the plot, we get a good match between th e desired signal and the output of the Freeman network. However, the results are not as good as th e simple linear readout since we do not have an optimal solution for the K0 readout. Autonomous squence gneration with output feedback This experiment demonstrates the ability of the system to autonomously generate a sequence of numbers. We will use a reduced K II network without any input units but with feedback from the output of the readout. The de sired signal to be autonomously generated by the

PAGE 106

106 system is 0.5sin3(2 n/1000 ). An RKII network of 50 K0 sets without input units but with output feedback connections is constr ucted. The weight values Kmg and Kgm of all RKII sets and all inhibitory weights, Kii, are the same as the ones in the previous experiment. The excitatory weight values are chosen from a gaussion distribution with zero mean and unity variance. The output feedback connections from the linear readout to the KII network are chosen to be 0.1 and -0.1 with equal probability. During training, the states of the RKII set ar e driven by the desired response via output feedback. 0 500 1000 1500 2000 2500 3000 3500 -4 -3 -2 -1 0 1 2 The Desired and Output Signals samplesFigure 5-11. Autonomous signal ge neration with Freeman Model. The network is initialized to a random initial condition and the states of the RKII network are calculated for 2000 samples of the desired response. The first 1000 samples corresponding to initial transient are eliminated and the optimal output weights are calculated using Equation 2-4.

PAGE 107

107 The MSE value obtained dur ing training is 1.4350x10-7. During testing, the system is first run with for 1000 steps with the correc t desired signal. After 1000 steps, the system is let run freely. Figure 5-11 shows the desired and output signals for the first 3000 steps of testing. As observed from the plot, there is a transient time interval fo r which the system tries to lock into the desired signal. But once it is locked, the system conti nues generating the signal autonomously. The MSE value for the last 1000 steps of testing is found to be 7.934x10-7. Binary parity check This experiment investigates the use of Freeman model for bi nary logic computation. In particular, a parity check bit calculation will be carried out using a reduced KII set of 100 KO sets. We will attach one input unit and a linear readout with one output unit to the reduced KII network. The input will be a sequence of binary va lues, s and s and the desired signal will be the parity bit corresp onding to the last 3 bina ry input values. More precisely, the desired signal will be the addition of the last 3 bit values modulo 2. In our experiment, we will represent each bit by a sequence of 100 samples. In other words, will be represented by a sequence of a hundred 0s, 0..0, and similarly . We use a reduced KII network of 100 KO set. The weight values Kmg and Kgm of all RKII sets and all inhi bitory weights, Kii are set to 0.5, -0 .4 and -0.002, respectively. Kee values are set to be 0.1, 0.2 or 0 with probabilities 0.05, 0.05 and 0.90, respectively. We generate a sequence of 300 random binary numbers as the in put and the corresponding parity bit sequence as the training data. This corresponds to a trai ning sequence of 30000 data points since each bit comprises 100 samples. Again, the weights of th e linear readout are calculated using the Wiener solution after discarding the first 1000 samples. In order to test the system, a random sequence of 300 binary numbers is generated and fed to the FM. Figure 5-12 shows a segment of the desired and output signals during testing. As we observe from the plots, the system output converges to

PAGE 108

108 the desired response for all input bits. If the linear r eadout output is compared to a threshold value of 0.5 to make binary decisions, the syst em converges to the true binary value for the whole testing data without an error. 7500 8000 8500 9000 9500 10000 10500 11000 0 0.2 0.4 0.6 0.8 1 Desired and Output Signals Figure 5-12. Binary parity check with Freeman Model. Multi-attractor learning In this experiment, we will train the readout to switch between multip le point attractors depending on the input (Jaeger, 2001). The networ k will have 20 input units and 20 output units attached. The system will have 20 point attractor s that are controlled by the input signal. The input signals are in the form of rare pulses (w ith values 0 or 1, see Figure 5-13). At any given time, only one of the input channels can have no nzero value (at most one pulse at a time). Each output unit is associated with an input unit and when a pulse arrives at an input unit, the corresponding output unit value beco mes 0.5 whereas the other output units take the value -0.5. The values of the output units remain the same un til a new pulse arrives at another input unit. At

PAGE 109

109 the arrival of this new pulse to an input unit, the output unit corresponding to that input unit activates to 0.5 and the remaining to -0.5. Figure 5-13 shows sample input signal and the corresponding output signal for a two input-two output system. 500 1000 1500 2000 0 0.2 0.4 0.6 0.8 1 INPUT-1 500 1000 1500 2000 0 0.2 0.4 0.6 0.8 1 INPUT-2 500 1000 1500 2000 -0.4 -0.2 0 0.2 0.4 0.6 OUTPUT-1 500 1000 1500 2000 -0.4 -0.2 0 0.2 0.4 0.6 OUTPUT-2 Figure 5-13. Sample input-output sign als for multi-attractor learning. We use a KI network of 100 K0 sets with 20 inputs units. 100 K0 sets are randomly connected having the weight values of 0.2, 0.1 and 0 with probabilities of 0.95, 0.025 and 0.025, respectively. This corresponds to a sparse connectivity of 5%. Th e input units are connected to the K0 sets with connection strengths of 5 or -5 with equal probability. The output units are connected to the states via output feedback havi ng the weight values of 1, -1 and 0 with probabilites of 0.1, 0.1, 0.8. We generated a sequence 8100 input samples and the corresponding output sequence for training. Starting from the time step one, a pulse is assigned to input chan nel 1 to 20 in order at every 400 steps. At time step 8001, a pulse exists in the input channel 1. The pulse width is

PAGE 110

110 chosen to be 100 samples. The output sequence is assigned to 0.5 or -0.5 according to the rule specified above. The system is run with the i nput data generated while the output is teacherforced by the output signal generated. The stat es are calculated for 8100 time steps using the discretization. Then, the output weights are calculated using E quation 2-4 after disgarding the first 1000 samples corresponding to the transient signal. Figure 5-14 shows the output and desired re ponse for 4 channels during training. The system is tested with an input-output sequence of 80000 samples. The syst em is initialized with the first 500 samples of desired signal forced as the output. Figure 5-15 shows the output and desired response for four channels during testing. The Freeman model is able to model all the switches without any errors during testing. 0 2000 4000 6000 8000 -1 -0.5 0 0.5 1 0 2000 4000 6000 8000 -1 -0.5 0 0.5 1 0 2000 4000 6000 8000 -1 -0.5 0 0.5 1 0 2000 4000 6000 8000 -1 -0.5 0 0.5 1 Figure 5-14 Desired and out put signals during training for mul ti-attractor learning with Freeman Model.

PAGE 111

111 0 2000 4000 6000 8000 10000 -1 -0.5 0 0.5 1 0 2000 4000 6000 8000 10000 -1 -0.5 0 0.5 1 0 2000 4000 6000 8000 10000 -1 -0.5 0 0.5 1 0 2000 4000 6000 8000 10000 -1 -0.5 0 0.5 Figure 5-15. Desired and output si gnals during testing for multi-att ractor learning with Freeman Model.

PAGE 112

112 CHAPTER 6 APPLICATIONS This chapter explores the use of ESN for various real-wor ld applications. Complex Echo State Networks Most of the signal processing tools have been developed to operate on signals with real values. However, there exist areas where signal processing in the complex domain is required. For example, in digital communication, the sy mbols sent through a communication channel are usually implemented as points in complex domai n. In phase shift keying (PSK) the binary information is conveyed by changes in the phase of the reference signa l (Proakis, 2001). In quadrature PSK, four different phase values are used to encode two bits of data. In quadrature amplitude modulation (QAM), both the amplitude and the phase of the reference signal are utilized to encode information more efficiently with increased data transfer rate. The most commonly used QAM are 32QAM, 64QAM, 128QAM and 256QAM where 5 to 8 bits of data are transferred per symbol. Phase plays an importa nt role in these systems and the symbols are represented by complex numbers. The straightforward method to deal with complex numbers is to process the real and comple x parts of the signal separately However, this is suboptimal because it neglects the interdepe ndency between the real and comp lex part of the signal (Kim and Adali, 2002). Alternatively, the existing tools of signal pr ocessing, which are originally designed to deal with real signal s, have been effectively modified to operate in complex domain. In the neural networks literature, the popular back propagation algorithm (BP) for multilayer perceptron (MLP) was originally developed in re al domain and in the past two decades it has been effectively used in many diverse area s (Widrow and Lehr, 1990; Haykin, 1998). Kim and Adali have rigorously developed the complex back propagation (CBP) algorithm for complex MLP (CMLP) (Kim and Adali, 2002). CMLP uses fully complex activation functions instead of

PAGE 113

113 split complex functions that treat real and imagin ary parts of the signal separately, and achieve consistently better performance. However, th e difficulty with the MLP training in complex domain arises mostly from the selection of act ivation functions that can operate in complex domain. The nonlinear functions have to be bot h analytic and bounded since the backpropagation algorithm requires derivatives of the nonlineariti es to calculate the weight updates. Moreover MLPs are trained in an iterativ e approach which is usually slow and computationally intensive. The model parameter such as the input dimensi on and number of hidden units must be selected through exhaustive testing as they play a crucial role in faster convergence and minimal training error. In this section, we propose ESNs for comp lex signal processing. The use of ESNs for complex domain is very convenient since syst em training is equivalent to simple linear regression, which is trivial in the complex dom ain. The derivatives of the nonlinear activation functions are never necessary since the recurrent part is fixed apriori. The nonlinear activation functions have to be modified similar to (Kim and Adali, 2002) to ensure bounded output. However, being analytic is not anymore required since ESN does not requ ire the computation of derivatives for training. We compare CESNs with CMLP and a linear network trained with complex least mean squares (CLMS) algorithm on a complex channel equalization experiment. We demonstrate that CESN can achieve lower symbol error rates with a simple and fast learning. Channel Equalization The performance of the CESN is tested on two benchmark nonlinear channel equalization problems. Fig. 6-1 depicts the block diagram of an additive noise nonlinear communication channel. The modulated signal corrupted by addi tive noise is passed thro ugh a nonlinear channel with memory. An equalizer is used to remove the effects of channel and noise before the signal is demodulated. The equalizer ideally behaves like an inverse system of the nonlinear channel. The

PAGE 114

114 noise is considered to be addi tive white Gaussian. The goal is to implement the equalizer using adaptive signal processing techniques. Here, both the signals and the channel are complex valued. Figure 6-1. Block diagram of a communication system with non linear dispersive channel. First problem is from satellite communication domain and is also used by Kim and Adali [2, 8]. The nonlinear channel is modeled by a Volterra series expansion given by 0 ) 1 2 ( 1 2 ,..., 1 1 2 1 11 2 2 1.. 1 ... ... H a a a a a xk k k k k k nn n n n n n knn n n n n n n where o is the down-link noise a nd Gaussian in nature, an is the nth information symbol and H is Volterra series coefficient that describes the effect of channel nonlinearity. We used a channel described by the following Volterra series coe fficients (Benedetto a nd Biglieri, 1982; Kim and Adali, 2002). Linear coefficient: H0 (1) = 1.22 + i0.646, H1 (1) = 0.063 i0.001, H2 (1)= 0.024 i0.014, H3 (1) = 0.036 + i0.031 3rd order coefficient: H002 (3)= 0.039 i0.022, H330 (3)= 0.018 i0.018, H001 (3)= 0.035 i0.035, H003 (3) = .04 i0.009, H110 (3)= .01 i0.017 5th order coefficient: H00011 (5)= 0.039 i0.022. We use a rectangular 16QAM modulation sc heme where the 16 symbols are equally spaced in the complex domain and the farthest point from origin has a magnitude of 1. The Nonlinear Channel Equalize r AWGN Input Decision

PAGE 115

115 amplitude and phase of the complex noise ar e independently generated from Gaussian and uniform distributions, respectively. A B Figure 6-2. Constellation diagram of 16 QAM. A) Input symbols. B) Output symbols.

PAGE 116

116 Figures 6-2 A and 6-2 B show the constellati on diagrams of the input symbols (under 12dB signal-to-noise-ratio [SNR]) and the output symbol s of the nonlinear channel, respectively. The symbol energy of the 16QAM system is given by 2.5d2 where d is the horizontal or vertical distance between two adjacent points. The bit ener gy is calculated by dividing the symbol energy by 4. SNR is calculated by dividing the b it energy by twice the noise variance. We compared the performance of three diffe rent networks: a linea r filter, a CMLP and CESN. All networks are tr ained with 20000 data points generated from the 16QAM configuration with equal probability. The optimal we ights of the 10 tap linear filter are computed using Equation 2-4. We used a CMLP with 10 input units, 5 hidden units a nd a single input unit. We use arcsinh as activation function. The de rivative of the arcsinh(z) is given by (1+z2)-1. The weights of the CMLP are trained us ing the complex backpropagation algorithm derived in (Kim and Adali, 2002).We used a 30 unit CESN. The input weights are selected from values -0.1 and +0.1 with equa l probability. The internal connection weight matrix is designed with uniform pole distribution prin ciple (Ozturk et al., 2007). ESN designed with this principle is denoted by ASE-CESN while the ESN initialized with random weights is denoted by CESN. The activation function of the internal units is the complex sigmoidal functi on,tanh(x). For CESN and CMLP the network is trained 10 times for ev ery SNR value and the best performance is recorded. All three networks are tested with 106 points input sequence where each point is selected from the 16 points with equal proba bility. We trained CE SN with only 1000 data samples. Figure 6-3 shows variation of the symbol error rate (SER) for different SNR values for four networks. As observed from the figure, AS E-CESN shows the best performance compared to the other networks. The main reason is th at ESN avoids backpropagation through the complex nonlinear functions. The CLMS performs very similar to CMLP since the CLMS also uses the

PAGE 117

117 optimal solution and the problem is only mildly nonlinear. SER of ASE-CESN drops to zero after 12dB SNR while the SER for CLMS and CMLP still decreases slowly. CESN that used much fewer points than CLMS and CMLP can stil l perform as well due to the simple training. Figure 6-3. SER vs SNR pl ot for four networks. The second problem is another channel equa lization problem from digital communication domain and is also used in (Cha and Kassam, 1995). Here the nonlinear channel is described by the following equations. n n n n nv x x x y 3 205 0 1 0 2 1) 21 0 34 0 ( ) 43 0 87 0 ( ) 27 0 34 0 ( n n n nu i u i u i x where n is independent identically dist ributed white Gaussian.noise, We use a rectangular QPSK modulation scheme where the 4 symbols are equally spaced in the complex domain and the farthest point from origin has a magnitude of 1. Figures 6-4 A and 6-4 B show the constellation diagrams of th e input symbols (under 12dB signal-to-noise-ratio [SNR]) and the output symbols of the nonlinear ch annel, respectively. The symbol energy of the QPSK system is given by 0.5d2 where d is the horizontal or vertical distance between two

PAGE 118

118 adjacent points. The bit energy is calculated by dividing the symbol energy by 2. SNR is calculated by dividing th e bit energy by twice the noise variance. A B Figure 6-4. Constellation diagram rectangular Q PSK. A) Input symbols. B) Output symbols.

PAGE 119

119 We compared the performance of three differe nt networks: a linear filter, a CMLP and a CESN. All networks are trained with 10000 data points generated from the QPSK configuration with equal probability. The optimal weights of the 10 tap linear filter are computed using Equation 2-4. We used a CMLP with 3 input uni ts, 10 hidden units and a single input unit. The same ESN models used in the previous sxam ple are utilized. Here, CESN uses 500 training samples. Fig 6-5 shows variation of the symbol error rate (SER) for different SNR values for three networks. Similar to the previous experi ment, ASE-CESN shows the best performance. CESN can still perform very close to CLMS and CMLS eventhough CESN utilizes a small number of training points. Figure 6-5. SER vs SNR pl ot for four networks. Brain Machine Interfaces This section of the dissertat ion is a result of the joint study with Aysegul Gunduz (Gunduz et al., 2007). We have provided th e expertise in ESNs while they provide the data and expertise in brain machine interfaces (BMI). The resu lts of the study are accepte d for publication (Gunduz et al., 2007).

PAGE 120

120 In BMI, the goal is to model the relationshi p between a kinematic variable such as hand position of a primate and the elect rophysiological data collected from its brain. BMIs provide new rehabilitation options to patients who have lo st their motor functions due to damage to the nervous system. Therefore, much effort has been put into the research on neuroprosthetics for motor control and BMIs. In order to model the relationship between brain signals and hand position in BMIs, simultaneous recordings have been collected from the brain and the primates hand position. The brain signals can be from acquire d from microarray reco rdings (action potentia ls and local field potentials), on the cortex via subdural grids (electrocorticogram [E CoG]) and on the scalp (electroencephalogram [EEG]). The levels of abst ractions of these signals are referred to as microscopic, mesoscopic, and macroscopic respectively. In the BMI literature, most of the studies fo cus on modeling the hand position of a primate from the activities of individual neurons collected from various sites in the sensorimotor cortex (Chapin et al., 1999; Serruya et al., 2002). However, the main difficulty in the practical use of these systems is the crude invasiveness of microe lectrodes, which preven ts their use in human subjects. On the other hand, non-invasive scal p EEG contains a summation of dendritic and axonal currents of millions of neurons over the cortical impedance (Sanchez et al., 2007). The macroscopic EEG lacks the spatial resolution of microscopic recordings and have very low signal-to-noise ratio. In this chapter, our fo cus will be on the mesoscopic level recording called electrocorticogram (ECoG). ECoG when compared with EEG, has the great advantage of being more spatially specific (smaller electrodes), mu ch less contaminated by noise, and can display better frequency resolution because of its closer proximity to the neural tissue (Sanchez et al., 2007).

PAGE 121

121 Current BMIs based on ECoG recordings eith er require movement related averages of trials for extraction of features (Mehring et al., 2004; Rickert et al., 2005) or imagery association of another task to the kinematic behavior (e .g. protruding tongue) (L euthard et al., 2004). Movement related potentials, though a valuable analysis tool, are not suitable for building real time BMIs. In our experimental paradigms, we undertake the prediction of the kinematic behavior of the patients wit hout employing any other task. The parameters that can be extracted from EC oG to predict hand trajectories is still under investigation (Gunduz et al., 2007). Still, since amplitude modulat ion plays a key role in both neuronal activation and rate coding, this study seeks to explore preprocessing modalities that emphasize amplitude modulation in the ECoG above the level of noise and background fluctuations in order to derive the commands for complex motor control tasks (reaching and grasping) (Sanchez et al., 2006). Mapping amplitude modulated control features to motor behavior requires models of low orders that are easy to train. Simp le linear filters, such as Wiener f ilters, are very easy to train but have limited performance in accuracy. Time delayed neural networks (TDNNs) require parameters of very high orders especially wh en the system is a multiple-input multiple output system as in BMIs. Recurrent neural networks (RNNs) which have high accuracy in modeling are very hard to train. In this chapter, we study and compare mappings from amplitude modulated ECoG features to hand movements i nvolving reaching and pointing tasks using linear filters, echo state networks and leaky echo state networks. The data for this study was obtained from a patient undergoing extraoperative subdural grid electroencephalographic monitoring for the treat ment of intractable complex partial epilepsy at Shands Hospital, University of Florida. Th e experimental paradigms were approved by the

PAGE 122

122 University of Florida Institutional Review Boards.The patient, a right-handed 15 year old female, underwent a presurgical work-up that included scalp EEG, formal neuropsychological testing, and MRI. The patients IQ and motor functions were verified to be nonfocal by the absence motor or sensory deficits on neurological examination (Sanchez et al., 2007) The patient was implanted with subdural grid electrodes according to established protocols (Lesser et al., 1990). The grids consisted of a 1.5mm thick silastic sheet embedded with platinumiridium electrodes (4mm diameter with 2.3mm diameter exposed surface) spaced at 1cm center-to-center distances. The anatomical lo cation of the grids was based upon the medical teams recommendation for epilepsy evaluation. The primary motor cortex was determined by evoked potentials and direct elec trical stimulation of the subdur al grids (Jasper and Penfield, 1954) and was found to be far from the seizure focus. The 32-channel electrode grid was covering premotor (PMA), primary motor (M1), and somatosensory (S1) cortices based on the patients cytoarchitecture (Sanchez et al., 2007). Experimental Setup Extraction of control features from ECoG reco rdings for neuroprosthes is is facilitated by continuously time synchronizing neuronal modulation with the well defined behavioral paradigm (Sanchez et al., 2006). The behavioral tasks used in this neuroprosthetic design focus on arm reaching and pointing toward a visual input (S anchez et al., 2007). The patient was cued to follow with her index finger a predefined cursor trajectory presented on an LCD screen with an active area of (20 x 30cm) while neuronal modul ations from the implanted ECoG electrodes were simultaneously being recorded (Sanchez et al., 2007). The trajectory consisted of a widely performed center-out task (Geor gopoulos et al, 1982) and a target selection task (Desmurget et al., 1998). This behavior mimics a computer users movement to se lect an icon on the screen. In

PAGE 123

123 a single session, the patients we re required to repeat the entire task six times and this was repeated for each trial (Sanchez et al., 2006). Spectrally preprocessed cortical recordings have been shown to correl ate with a variety of visual, auditory, and motor tasks in frequency bands comprised of slow potentials (1-60Hz) (Pfurtscheller et al., 2003),the gamma band (60100Hz) (Crone et al., 1998), fast gamma band (100-300Hz) (Sinai et al., 2005 ) and ensemble depolarization (300-6kHz) (Engel et al., 2005). Therefore we define the band specific am plitude modulated control features, u( t ) as the integrated power of the ECoG voltage V (t) in non-overlapping bins of 100 msec given by 100 2 0()()ims nni tutVtt (6-1) where tn+1 = tn+100 ms Methods The simplest method to map the neuronal activity to hand movement is the linear Wiener filter topology. There are 32 input and 2 output channels. For each i nput channel, a tap delay line with 25 taps is used. The filter order was optimized by scanning tap delays from 5-30. At each time sample, the filter computes the outputs usi ng 2.5 milliseconds of the past inputs. The model was trained on 3 minutes of recordings and tested on a data segment of 1.5 minutes. We use an ESN with 500 internal un its. The 32x500 input weight matrix Win is fully connected with values of +1 or -1 with equal probability. A spectral radi us of 0.9 is chosen. In BMI problem, the neuronal activity that is input to the system is a fast changing signal whereas the hand position is at a relatively slow pace. Therefore, we alternatively used a leaky ESN network with a built-in low-pass filter to fu rther increase the memory depth of the system and slow-down the input signal. The update equation of the leaky ESN is given by )) ( ) 1 ( ( ) ( ) 1 ( ) 1 ( n n C n Ca ninWx u W f x x

PAGE 124

124 Here C and a are the leakage parameter, time constant and decay factor, respectively. They define the level of low-pass f iltering the system introduces. Th e number of processing elements for the leaky ESN is experimentally determined to be 200. The spectral radius was kept at 0.9 and the leakage parameters C and a were experimentally determined for each band. In both architectures one second of data was discarded as transients. Averaged correlation coefficients (CC) betw een the actual hand trajectory and the filter outputs computed over non-overlapping windows of 5 seconds are presented in table 6-1. Figure 6-6 shows the variation of CC over time. The performance of each model drops when the patient was switching between tasks. The performance along the y-direction is evidently better for all topologies and all frequency bands compared to x-direction. Across spec tral bands the most accurate reconstruction was obser ved in the ensemble activity (300-6kHz). Overall, the leaky ESN performed better than the basic ESN and Wiener filter, increasing the CC performance of the Wiener filter by 15% in th e y-direction in the 300-6kHz band. Improvement in results with ESNs compared to the linear filt er decreased as the input spect ral activity slowed down. This may be an indication that the slower frequency bands contribute less to the modulation of hand kinematics. Table 6-1. Averaged correlation coefficients be tween the actual hand trajectory and the model outputs computed over non-overlapping windows of 5 seconds

PAGE 125

125

PAGE 126

126 Figure 6-6. Comparison of three methods in brain machine interface modeling. Windowed correlation coefficients in all spectral bands.

PAGE 127

127 Forecasting Water Inflow Using the Echo State Network This section of the dissertati on is a result of the joint study with Rodrigo Sacchi (Sacchi et al., 2007). We have provided the expertise in ESNs while they provide the data and expertise in water inflow forecast. The results of the st udy are accepted for publicati on (Sacchi et al., 2007). We use the ESN for the prediction of the hydr opower plant reservoir water inflow. Being able to predict the water inflow is fundamental in terms of planning the hydrothermal power system operation. In this study, we used a databa se of average monthly wa ter inflows of Furnas plant, one of the Brazilian hydropow er plants. The goal is to predic t the value of the water inflow for the next month using previous data. The perf ormance of the ESN is compared with SONARX network, RBF network and ANFIS model. Water Inflow Problem The greatest difficulties to forecast water inflow arises due to the nonstationary, nonGaussian nature of the time series due to ch anging weather conditions (wet and dry seasons) in different periods of the year across years. Forecasting models based on Box & Jenkins methodology have been largely used in streamflow forecasting problems (Salas et al., 1982; MecLeod, 1994). However, those parametric auto-regressive models assume linear relationship (Box et al ., 1994). This is a major limitation since the nature of the water in flow problem requires nonlinear modeling. Alternatively, ANNs have been proposed for time se ries analysis due to thei r ability to deal with nonlinear input-output relationshi ps (Haykin, 1998). MultiLayer Perceptrons (MLP) with a backpropagation algorithm, the most popular of them, have been a pplied to forecast streamflow with promising results (Atiya, 1999). The exam ples of the successful application of ANN architectures for water inflow prediction ar e (Coulibaly and Anctil, 1999;Valenca et al., 2005;Sacchi et al., 2004).

PAGE 128

128 The main problem with ANNs is the number of parameters which is usually big due to the difficulty of the problem. Moreover, a tapped de lay line has to be used as a preprocessor to provide short term memory for st atic architectures such as ML P or RBF. The selection of the optimal embedding that defines the most important previous measured samples which must be considered as input data of the predictor is not trivial. ESNs provide nonlinearity with training reduced to simple linear regression. Moreover, since the system has internal bu ilt-in memory resulting from the feedback connections, it is not necessary to embed the input signal before furthe r processing. The training of the linear readout weights can be done analytically; hence very fast. In this sectio n, we use the ESN as a one-stepahead predictor of the monthly average hydr opower plant reservoi r water inflow. The performance of ESN is evaluated and compar ed to the Self-Organizing Nonlinear AutoRegressive model with eXogenous input (SONA RX) model (Barreto a nd Arajo, 2001), the Radial Basis Function (SONARX-RBF) network (Sacchi et al., 2004) and the Adaptive NeuroFuzzy Inference System (ANFIS) model (Jang, 2003). Results In this case study, the data collected from Furnas hydroele ctric power plant (HPP) was used since it is without oper ative influence due its relative position on the cascade. Monthly water inflow data from 1931 to1994 is availa ble. A five year pe riod from 1972 to 1976 is selected as the testing data. The remaining data is used to train the models. Since this is a dynamical modeling problem, mo st of the models require embedding a short term history of the time series by using a tappe d delay line. The states are formed by delayed versions of the input signal, i.e., x1(t) = u(t), x2(t) = u(t-1), x3(t) = u(t-2), where t denotes the current month and u(t) is the value of the water inflow sample for the month t and x1(t), x2(t), x3(t) are the states which are the inputs of the predictor model. Alternatively, we can use two

PAGE 129

129 short periods, one from the immediate past a nd the other from the pr evious year, i.e., x1(t) = u(t), x2(t) = u(t-1), x3(t) = u(t-2), x4(t) = u(t-10), x5(t) = u(t-11), x6(t) = u(t-12). This allows the use of not only the monthly information but also s easonal information from previous year. The desired response at month t is defined by u(t+1), which is the value of the water inflow for the next month. The performances of the pr oposed methods are compared w.r.t. 4 different metrics: MSE, root MSE (RMSE), mean absolu te error (MAD), and mean percentage error (MPE). MAD and MPE are defined as T tt d t y T MAD1) ( ) ( 1 T tt d t d t y T MPE1) ( ) ( ) ( 100 where T is the total number of samples and y(t) and d(t) are the model out put and desired signals, respectively. Table 6-2 shows the performances of th e models (ESN, SONARX, SONARX-RBF and ANFIS) w.r.t. the error cr iterion specified above. Table 6-2. Performance comparison of the models for water inflow prediction ModelsEmbedding MSE (x104) RMSEMADMPE (%) Echo State Networkbuilt-in5.97244.45177.4121.04t,t-1,t-221.78466.73320.1337.94t,t-1,t-2,t-3,t-4,t-512.20349.30253.1828.78t,t-1,t-2,t-10,t-11,t-1213.54367.90261.0428.32t,t-1,t-29.24303.94233.2429.35t,t-1,t-2,t-3,t-4,t-56.35252.02187.5322.41t,t-1,t-2,t-10,t-11,t-127.13266.94202.5624.35t,t-1,t-29.45307.47235.6729.89t,t-1,t-2,t-3,t-4,t-57.07265.97198.5224.27t,t-1,t-2,t-10,t-11,t-126.72259.25205.9925.02 SONARX SONARX-RBF ANFIS

PAGE 130

130 We can see that ESN, SONARX-RBF and ANFIS performed significantly better than SONARX model, which has an intrinsic error si nce it is based on Vecto r-Quantized Temporal Associative Memory (VQTAM). ESN performs slightly better th an all other models, even tho ugh it presents a significantly simpler, and faster training al gorithm. SONARX-RBF with 5 taps performs very close to ESN whereas with other selection of taps, ESN pe rforms significantly better. This shows the importance of the selection of the number of taps for performance. The prediction results of the water inflow during training are show n in Figure 6-7. The blue line refers to the desired time series, and th e red one refers to the forecasted values water inflow. Based on the figure, the predictor model was able to learn and capture most of the behavior variability present on this time series. Figure 6-8 shows the water inflow forecast during the testing period of 1972-1976. The red line, representing water inflow forecast, shows a behavior very close to the blue one, the desired time series. 0 100 200 300 400 500 600 700 0 500 1000 1500 2000 2500 3000 3500 4000 Monthsm3/sWater Inflow (Training Samples) Observed Forecasted Figure 6-7. Training performance of ESN for the water inflow prediction at Fumas Hydroelectric power plant.

PAGE 131

131 0 12 24 36 48 60 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Monthsm3/sWater Inflow (Testing Samples) Observed Forecasted Figure 6-8. Testing performance of the ESN for the water inflow prediction at Fumas Hydroelectric power plant. Spike Detection with Echo State Networks In this section, we propose to use ESN-MACE for detecting neural action potentials in neural recording. Detection of action potentials is a very ch allenging task due to the great variability in the spike shapes due to the differe nce in spike shapes (bot h in amplitude and width of the spikes), as well as th e intrinsic noise, which is nons tationary and most likely nonGaussian, collected in neural recordings. The sor ting of spikes from single neurons is classically defined as a two-step process. First, the spikes mu st be detected using eith er a threshold detector or through power analysis (Obeid and Wolf, 2004). Once the spikes are detected, a variety of algorithms such as template matching with lin ear matched filter, or Principal Components Analysis (PCA) can be used to isolate the wave forms of individual dist inct neurons (Lewicki, 1998; Obeid and Wolf, 2004). Howe ver, the performance of this serial process is greatly dependent upon the accuracy of the first step. In threshold detection, the background noise, which can be instantaneously at a high value, is c onfused with spikes leading to false alarms. In

PAGE 132

132 addition, spikes with low power are hard to detect. The main difficulty with the matched filtering method is the absence of a well-defined template signal, which can represent the spike class, due to great variability in signal shapes. Furthermore, noise statis tics in neural recordings is nonGaussian and nonstationary impairing the optimality of matched filter. The proposed ESN-MACE aims at addressing th e difficulties posed by the spike detection problem by utilizing the high dimensional states of a nonlinear dynamical system to train an associative memory that can represent a class unde r noise and distortions with a single filter. The advantage of the LAM over the c onventional linear rea dout is that it explicitly utilizes the covariance information of the input training se t, unlike linear readouts (Ozturk and Principe, 2007). It also simplifies the training of the ESN r ead out since there is no need to train for a desired response for the LAM. ESNMACE is particularly powerf ul when the problem requires modeling a class of functions since ESN-MACE f ilter is able to build a single template containing covariance information about the full training set. Such a sy stem will improve the ability to correctly identify neural action potentia ls because it is able to explicitly utilize the variability seen during the training phase. Various experiments using in vivo neural recordings were used to compare the performance of the ESN-MACE with thresholds and a matched filter detector. The performance metric is the receiver operating characteristics curve (ROC), which is a standard measure for evaluating the performance of a detector by dem onstrating the trade-off between probability of detection and probability of false alarms (Hel strom, 1995). The ROC curve for each filter is obtained by varying the threshold va lue and calculating the detection rate and false alarm rate for each value of the threshold. We demonstrate th at the ESN-MACE is ab le to detect action potentials with lower false alarm rates compared to established methods of spike detection.

PAGE 133

133 Problems in Spike Detection A neural recording cannot be characterized as a deterministic signal, not only because of the intrinsic noise present in the recording instru ments, but also because of the many sources of uncertainty that make a neuron fire in an vivo experimental condition (Srnmo and Laguna, 2005). The extracellular spikes, recorded from a microelectrode array, have amplitudes ranging from a few microvolts up to a few hundred microvolts. The amplitudes of the spikes, as well their varying shapes, make the pr ocess of distinguishing the spik es from instrumentation noise difficult. As such, accurate identif ication of spikes has been probl ematic (Lewicki, 1998; Srnmo and Laguna, 2005). However, the wave shape that is registered in a conve ntional extracellular neural recording is basically determined by th e relative distance and position of the cell body with respect to the microelectrode. We will briefly review below several spike detection algorithms. Threshold detection The most common used technique to detect th e presence of spike activity is through the use of a threshold. This system decides the exis tence of a spike whenever the amplitude of the neural recording exceeds a predetermined thres hold (Lewicki, 1998). The beauty of this method is that it provides a quick and inexpensive way to identify and discern neural activity. However, threshold detection is ideally suited only for spikes which are easily discernible from the background noise (Lewicki, 1998). Threshol d detection is also prone to missing spikes (false negatives) that are present in a dataset, but do not exceed the established threshold. While it is possible to lower the threshold in order to cap ture all spikes present, doing so increases the probability of falsely misidentifying noise for an action potential (false positives), thus reducing the overall accuracy of the system.

PAGE 134

134 Matched filter Another approach to spike detection is through the use of filters, in particular by using a matched filter (MF) (Lewicki, 1998). Matched filters are widely used in the receiving end of the digital communication systems to detect binary signals. Matched filter provide optimum filtering in terms of minimum bit error rates for an a dditive white Gaussian noise channel (Helstrom, 1995). The fundamental problem with the matched filter for spike detection is that it assumes that there is a template that represents the spike class. Matched filters were originally developed for radar and communication a pplications, where the signals of interest were designed by humans, and the difficulty was simply one of discovering th em in high noise backgrounds. In fact in these applications the template is known a priori and so it is trivial to de sign the matched filter. Moreover, it is known that the MF is optimum under very restric tive conditions of white noise backgrounds, uncorrelated with the signal and wh ere the noise can be well characterized by Gaussian statistics (Proakis, 2001). In neural spike detection, the use of a matched filter is very different, due to the fact that the spike amplitude and shape are variable, the nois e is not stationary, and its statistics are most likely not Gaussian. But the determining shortcomi ng is really the variability in spike shapes. Figure 6-9 shows a very regular spike recording from a singl e neuron under high signal-tonoise-ratio (SNR) conditions. Figure 6-10 shows a more natural recording of activity collected by a single electrode in the rat motor cortex where several neurons are being recorded. The waveforms are aligned by their maximum slope, and show the normal variability of neural recordings. The first step in the design of a matched filter for spike detection is the design of the spike template. Each dataset was initially spike sorted using a program designed by Cambridge

PAGE 135

135 Electronic Design called Spike 2 (version 5) in orde r to generate markers for the instance of time when the spike is first detected, commonly know n as spike times. Using these spike times, each spike was then extracted and sorted accordi ng to its variance from the original dataset. Sorting each spike by its variance allowed us to determine how much variability was present within all spikes, and as such select ed a number of spikes that encompassed the variability present in the dataset. Numerous experiments with different datasets showed that we were able to accurately describe the variability of all spikes present by randomly selecting a minimum of 50 spikes. This will be the training set for all the experiments presented in this section. Once the necessary number of spikes has b een extracted, an average of all spikes present is generated. In order to ensure uniformity with in the template, all spikes were aligned according to their initialization time, as recorded in the da tasets spike times. By doing this, a template is generated that characterizes the spike waveform that is most predominant in the dataset. This approach is best when the neural spikes detected are easily discernable from the background noise and each spike varies insignific antly from each other. However, when the dataset possesses spikes with high variability, the average proves to be inaccurate, reducing the overall accuracy of the system. Spike Detection with ESN-MACE The high dimensional internal states of ES N architecture coupled with the MACE filter creates an excellent medium for dynamical patter n recognition. It has been demonstrated in (Ozturk and Principe, 2007) that ESN-MACE is particularly pow erful when the problem requires modeling a class of functions since ESN-MACE f ilter is able to build a single template containing covariance information a bout the full training set. In th is chapter, we propose to use the ESN-MACE for spike detection problem where th e spikes can be considered as a family of signals due to great variability in sp ike amplitude and width (see Figure 6-10).

PAGE 136

136 An ESN with 28 internal units and a single input unit is used for the simulations in this work. W matrix is designed according to the ASE metric presented in chapter 3 and scaled so that its spectral radius is set to 0.9. The entries for Win are set to -1or 1 with equal probability. Two distinct neural recordings were used in the experiment s. The first one was acquired from a single neuron with high SNR. The second da taset was collected from a rat neural cortex neuron while the animal performed a lever pressing task in order to achieve a water reward. Both datasets were sampled at 24414.1Hz and were pre-filtered between 300Hz. Both datasets were normalized, and spike sorted in order to generate their respective spike-times files. Fifty representative spike signals were select ed as the training data for each one of the neural recordings. Figure 6-9 de picts the selected spike signals for the high SNR neural recording. It is evident from the figure that the spike shapes has a limited variability since they all are generated from a single neuron. On the other hand, the differences in amplitude and width of the depolarization/repolarization phases in th e spike shapes in Figure 6-10 clearly shows that the rat dataset has more than one type of neuron present in the dataset. For each data set, the algorithm described in chapter 4 is used to train the ESN-MACE filter. Fifty representative spike signals (each with length 21) are fed to the ESN and the echo states are calculated using equation 1-1 with zero initial conditions. This creates an image of dimension 28x21, where 28 is the number of PEs and 21 is the signal length. Output correlation peak amplitudes are assigned to be 1.0 for the tr aining data. The MACE filter is synthesized in the frequency domain using equation 4-1 and th e corresponding image plan e filters are obtained by inverse discrete Fourier transform. For a give n test signal, MACE filter output is calculated by correlating the echo states of dimension 28x21 w ith the filter coefficien ts at zero lag. MACE

PAGE 137

137 filter output is then compared to a threshold to decide the absence or presence of a spike. Figure 6-9. Spike shapes recorded with high SNR. Figure 6-10. Spike shapes in rat dataset. Numerous distinct spike waveforms are present in this dataset. Different threshold values are used to obtain the ROC figure. Figure 6-11 shows a section of the rat dataset where we used the ESN-MACE filter to identify several spikes. As the ESNMACE filter is designed to provide a maximum response after a spike ha s been convolved with

PAGE 138

138 the filter, we shifted our system response by the width of our filte r, in order to use the spike times as a means of identifying whether a spike was properly detected, as seen in Figure 6-11. The results obtained with ESN-MACE are co mpared to thresholds and matched filter detectors. The matched filter is obtained from the same data set of fifty spike signals according to the method explained in the previous section In the first dataset, the ESN-MACE filter, the threshold detector and the matched filter correctly identify 300 spikes present in a 100,000 samples with no missed identifications, demonstrating that the ESN-MACE filter performs as well as both the matched filter system the threshold detector. This was an expected result sin ce the class of spikes present in this dataset is easily discernible from the background noise and come from a single class of spikes. In fact, this is a very simple dataset created by a si ngle neuron with very low noise levels. On the other hand, the results for the more interesting and more challenging rat data set are not as perfect since the variab ility in the spike signals are tremendous. Figure 6-12 depicts the ROC curves for the ESN-MACE, threshold detector and the matched filter. As observed from the figure, ESN-MACE gives the best results out of the three methods presented. ESN-MACE filter identifies 273 spikes within 100,000 data samples of the rat dataset, corresponding to approximately 4 seconds of recorded data. In or der to correctly identify all 273 spikes, the ESNMACE filter has a cost of 42 falsely identified spikes; whereas the matched filter and the threshold detector have a cost 262 and 210 falsely identified spikes, respectively, in order to achieve this level of identification. Obviously, th e variability in the rat dataset makes detection much more challenging compared to the first dataset. The ESN-MACE filter handles the variability in the spike shapes better compared to the simple threshold detector and the matched filter detector.

PAGE 139

139 Figure 6-11. Segment of neural recording from rats brain. Figure 6-12. Comparison of ROC cu rves for the rat dataset. Each grid division corresponds to a division of 50 spikes. The leftmost dotte d curve corresponds to our ESN-MACE filter system, whereas the solid like corresponds to a matched filter system and the dashed curve corresponds to a thre shold detector system.

PAGE 140

140 Robust signal detection in a digital communication channel The detection of a known waveform transmitte d through a channel in a noisy environment is a critical problem in signal processing w ith applications in areas such as digital communication, radar, pattern re cognition and biomedical engineer ing (Helstrom, 1995; Proakis, 2001). Linear matched filtering is the popular solution to this problem since it is optimal in terms of maximizing signal to noise ra tio (SNR) at its output under a linear additive white Gaussian noise (AWGN) channel assumption (Proakis, 200 1). However, matched filters lose their optimality if the noise is non-Gaussian or if th e channel introduces nonlinear distortions on the transmitted signal. For instance, in the case of an impulsive noise (with infinite variance) in a linear channel, the noise power at the output of matched filter is still infinite, which creates erratic performance due to the finite sample size produced by the filt er impulse response. In mathematical terms, the problem of detecting a known deterministic signal (s( n )) corrupted by a zero mean additive white noise (v( n )) is equivalent to the binary hypothesis testing problem (Helstrom, 1995): H0: u( n ) = s( n ) + v( n ) n = 1,, T H1: u( n ) = v( n ) n = 1,, T The matched filter is defined by the impulse response s( T n ) and the filter output is obtained by convolving th e received signal u( n ) with the filter impulse response. SNR is maximized if the filter output is sampled at lag T since there is maximal correlation between s( n ) and filter impulse response at this lag. SNR ha s been defined as the ratio of the total signal energy to the noise va riance (Helstrom, 1995): )) ( var( ) (1 2n v n s SNRT n

PAGE 141

141 The output of the matched filter is sampled at the optimal time instant T and compared to a threshold to detect the presence or absence of the signal s( n ). The performance of the matched filter is comp ared with that of the ESN-MACE and the discriminative TDNN for a linear channel under Gaussian and impulsive noise distributions. The performance metric is the receiver operating ch aracteristics curve (ROC), which is a standard measure for evaluating the performance of a de tector by demonstrating the trade-off between probability of detection and proba bility of false alarms (Helstrom, 1995; Proakis, 2001). The ROC curve for each filter is obtai ned by varying the threshold valu e and calculating the detection rate and false alarm rate for each value of the threshold. The transmitted signal is chosen to be a Gaussian pulse s( n )=exp(n2), where n is varied between -5 to 5 with a time step of 0.25 units. The noise distribution is chosen to be either Gaussian or impulsive. Impulsive noise is generated using a Gaussian mixture model with probability distribution function ) 0 ( ) 0 ( ) 1 ( ) (2 2 2 1 N N n p In our simulations, we used 15 0 ,2 1 2 250 and 2 10.07 The ASE-ESN with 30 internal units and a single input unit from section 3.1 is used The TDNN uses a window of 41 samples (equal to the signal duration), 5 hi dden layer PEs and a single output unit. In a communication system, one of two types of operation is possible. We will consider these two cases separately. a. Synchronous operation In synchronous operation, the timing of the received signal and th e optimal sampling time are known during testing. Therefore, the frames are already available for further processing.

PAGE 142

142 In order to calculate the MA CE filter weights for synchronous operation, 100 realizations of the received signal (s( n ) + v( n )) are generated and fed to the ESN one sample at a time. For each realization, the echo states are calculated using Equation 2-1 wh ere the initial state is set to zero. The echo state creates an image of dimens ion 30x41, where 30 is the number of PEs and 41 is the signal length. Output correlation peak am plitudes are assigned to be 1.0 for the training data. The MACE filter is computed in the frequency domain using Equation 4-1 and the corresponding correlator weights are obtained by inverse discrete Fourier transform. During testing, since the optimal decision time is known, the received signal u( n ) corresponding to a transmitted symbol can be proce ssed in a batch. The received signal of length 41 is fed to the ESN with zero initial conditions (a s in the training), the ec ho states are computed, and the decision made once at sample 41. Here, echo states create an image of dimension 30x41. The output of the MACE filter is computed by 2d correlation of the echo states with the filter coefficients. The TDNN is used in discriminative mode where the desired target signal is +1 when the received signal is (s( n ) + v( n )) and 0 when the received signal is v( n ). 500 realizations of the received signal (250 from each class) are gene rated to train the weights of TDNN using the Levenberg-Marquardt algorithm with a step size of 0.01. A crossv alidation set of 200 realizations is used to stop the training. Notice th at we use more data and an extra crossvalidation set for TDNN training compared to ESN-MACE tr aining. During testing, a received signal of length 41 is fed to the TDNN and the output obtained from the TDNN is compared to different threshold values to plot the ROC curve. We used the neural networks toolbox of Matlab to simulate and train the TDNN.

PAGE 143

143 For each noise distribution and a 0.5 proba bility of signal transmission, 10,000 Monte Carlo simulations are performed to compute th eROC curves of the matched filter ESN-MACE and TDNN. Figure 6-13 A shows the ROC curves under Gaussian and impul sive noise with SNR value set to 10dB for synchronous operation. U nder Gaussian noise, the matched filter is the optimal detection filter and hence results in th e optimal ROC. However, the ROC curve of the ESN-MACE and the TDNN are still close to that of the matched filter. The performance of the matched filter degrades tremendously under impulsive noise whereas the ESN-MACE performance is robust to impulsive noise. Th e TDNN performance under impulsive noise is better than the matched filter bu t worse than the ESN-MACE ev en though TDNN requires more data samples and a cross validation data set fo r training. If the training of the TDNN is done without a cross-validation set and stopping is done after 50 epochs the generalization performance suffers tremendously for both noise distributions. We have also tested a TDNN trained in predictive mode, where the desired targ et signal is the next value of the input. The results are very poor mostly due to the noi se level involved in the experiment. b. Asynchronous operation In asynchronous operation, the timing of th e received signal (and the optimal sampling time) is unknown during testing. Therefore, in asynchronous operation, the frames should be constructed for time n by extracting the echo states from time n -40 to n Notice that the echo states from time n -40 to n will also be affected from the value of the state at time n-40 (initial state). In synchronous operation, we did not have this problem since the timing of the received signal is known and the frames can be proces sed separately using zero initial state.

PAGE 144

144 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Probability of false alarmProbability of detection Linear matched filter ESN-MACE Gaussian noise Impulsive noise A 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Probability of false alarmProbability of detection Linear matched filter ESN-MACE Impulsive noise Gaussian noise B Figure 6-13. Comparison of ROC curves for ESN-MACE and the linear matched filter under Gaussian and impulsive noise conditions. A) ROC curves for synchronous operation. B) ROC curves for asynchronous operation.

PAGE 145

145 To be able to mimic the testing conditions of asynchronous operation while training the MACE filter, the initial conditions of the ESN are not set to zero but determined by the history of the input. 100 different realizat ions of the received signal are generated and inserted into the noise signal of 10000 samples at random time instan ts. The resulting signal is fed to the ESN and echo states are calculated. The echo states duri ng the application of the received signal are extracted. Echo states create an image of di mension 30x41, where 30 is the number of PEs and 41 is the signal length. Output correlation peak amplitudes are assigned to be 1.0 for the training data. The MACE filter is computed in the frequency domain using Equation 4-1 and the corresponding correlator weight s are obtained by inverse di screte Fourier transform. During testing, the received signal is fed to the ESN and the echo states are computed. Then, at every time instant n the echo states from time n -40 to time n are fed to the MACE filter and its 2-d correlation with the MACE is computed. The correlation result is compared to various threshold values to obtain the ROC curve. We also trained a TDNN with the specifications from section 3.1.a. We created a data set of 500 realizations of the received signal (250 for each class). The generated data is used to train the weights of TDNN using the Levenberg-Marquar dt algorithm with a step size of 0.01. A crossvalidation set of 200 realizations is used to stop the training. The TDNN is trained in the discriminative mode with the same target signal fr om section 3.2.a. During testing, the received signal is fed to the TDNN. At every time instant n the input signal from time n -40 to time n is used to produce the TDNN output which is then co mpared to various threshold values to obtain the ROC curve. Figure 6-13 B shows the ROC curves for the matched filter, the ESN-MACE and the TDNN under Gaussian and impulsive noise with SNR value set to 10dB. Similar to the

PAGE 146

146 asynchronous case, the matched filter results in the superior ROC curve for Gaussian noise whereas the ESN-MACE and TDNN performance are still robust to impulsive noise. ESNMACE performs the best under impulsive nois e similar to the synchronous operation. The performance of all filters in asynchronous opera tion degrades slightly compared to synchronous operation since neither the timing of the receive d signal is available in asynchronous operation, nor the initial condition.

PAGE 147

147 CHAPTER 7 CONCLUSIONS The great appeal of echo state networks (ESN ) is their ability to construct arbitrary mappings of signals with rich and time varying te mporal structures wit hout requiring adaptation of the free parameters of the recurrent layer. The echo state condition allows the recurrent connections to be fixed and only a linear output layer need s to be trained. In a way, the difficulty associated with RNN training is avoided yet preserving the well-know n power of recurrent topology. However, fixing the recurrent connections imposes important constraints in the overall architecture that have not yet been fully studied. In this dissertation, we addressed some of the issues with the state of the art by proposing a signal processing framework to understand the operation of ESNs, a system-theoretic approach to quantify ESN dynamics and a metric to quantify the richness of reservoir for computati on. The proposed framework led to a systematic design procedure for the fixed reservoir weights of ESN. We further investigated the use of ESNs for temporal pattern recognition and compar e several approaches to tackle the problem. A novel associative memory readout adopted from im age processing literature called MACE filter, was proposed for ESNs to be used in pattern re cognition problems. ESNs with the linear readout were utilized to solve various real-world pr oblems such as brain machine interface design, prediction of water inflow, and channel equalizat ion. ESN combined with the MACE readout led to a detector for action potentials in neural r ecordings that outperform the standard techniques such as threshold detector and the matched filter. A nonlinear matched filter that can detect signals in a digital communicati on channel under different noise conditions was also proposed. The ESN idea has implications on the understand ing and design of othe r architectures. We investigated the use of readout s for Freeman model of the olfa ctory cortex. With the proposed readouts, we showed that Freeman Model can pr ocess continuous valued, time varying signals.

PAGE 148

148 In addition, we investigated the effect of ec ho state condition for comput ation. By allowing the spectral radius to be slightly greater than 1, we showed that a system without a global stability constraint can still perform usef ul computation. The detailed conclusions for each of the topics are given below. Understanding Echo State Networks In this dissertation, we first proposed a fr amework for ESNs, which is a signal processing interpretation of bases and proj ections in functional spaces to describe and understand the ESN architecture. According to this interpretation, echo states implement a set of functional bases formed by fixed nonlinear combinations of the input. The linear readout at the output stage simply computes the projection of desired ou tput space onto this representation space. We further introduced an information-theoretic crit erion, average state entropy (ASE), for better understanding and evaluating the cap ability of a given ESN to construct such a representation layer. The entropy of the distribu tion of the echo states at each time step is calculated and averaged over time to obtain the ASE measure. ASE quantifies the volume of the state manifold. As such this volume should be the largest to achieve the sma llest correlation among the bases and be able to cope w ith arbitrary mappings. We also interpreted the ESN dynamics as coupled linear systems obtained from the linearization of the ESN nonlinea r PE in a small local neighborhood of the current state. This system-theoretic approach allowed the visualiz ation of movement of the system dynamics in response to the input signal. A single ESN with fixed parameters implements a combination of many linear systems with varying pole locatio ns, hence many different time constants that modulate the richness of the reservoir of dynami cs as a function of input amplitude. Higheramplitude portions of the signal tend to saturate the nonlinear function and cause the poles to shrink toward the origin of the z -plane (decreases the spectral ra dius), which results in a system

PAGE 149

149 with a large stability margin. When the input is cl ose to zero, the poles of the linearized ESN are close to the maximal spectral radius chosen, decr easing the stability margin. When compared to their linear counterpart, an ESN w ith the same number of states re sults in a detaile d coverage of the z -plane dynamics, which illustrates the power of nonlinear systems. Moreover, the spectral radius of an ESN can be adjusted using a consta nt bias signal at the ES N input without changing the recurrent connection matrix,W. The application of a nonzero constant bias moves the operating point to regions of the sigmoid function closer to satu ration and always decrease the spectral radius due to the shape of the nonlinearity. Designing Echo State Networks We proposed a systematic design procedure fo r ESNs. When the desi red response is not accessible for the design of the ESN bases or when the same reservoir is to be used for a number of problems, the default strategy should be to maximize the ASE of the state vector. However, not all function approximation problems require the same memory depth, which is coupled to the spectral radius. The effective spect ral radius of an ESN can be optimized for the given problem with the help of an external bias signal th at is adapted using the joint input output space information. The interesting property of this method when applied to ESN built from sigmoidal nonlinearitieis is that it allows the fine tuning of the system dyna mics for a given problem with a single external adaptive bias i nput and without changi ng internal system parameters. Moreover, the bias can be easily trained either with a line search method or a gradient based method since it is one dimensional. We illustrated experime ntally that the design of the ESN using the maximization of ASE with the adaptation of the spectral radi us by the bias has provided consistently better performance across tasks that require differe nt memory depths. This means that these two parameters design methodology is preferred to the spectr al radius criterion proposed by Jaeger, and it is still easil y incorporated in the ESN design.

PAGE 150

150 Computation at the Edge of Chaos Experiments demonstrated that the ASE fo r ESN with uniform linearized poles is maximized when the spectral radius of the recurren t weight matrix approaches one (instability). It is interesting to relate th is observation with the computa tional properties found in dynamical systems at the edge of chaos (Packard, 1988; Langton, 1990; Mitchell et. al., 1993; Bertschinger and Natschlger, 2004) Langton stated that when cellular automata rules are evolved to perform a complex computation, evolu tion will tend to select rules with critical parameter values, which correlate with a phase transition between ordered and chaotic regimes. Recently, similar conclusions were suggested fo r liquid state machines (LSMs) (Bertschinger and Natschlger, 2004). Langtons interpretation of edge of chaos was questioned by Mitchell et al (Mitchell et. al., 1993). Here, we provided a system theoretic view and explain the computational behavior with the diversity of dynamics achieved w ith linearizations that have poles close to the unit circle, exactly at the edge of chaos. According to our results, the spectral radius of the optimal ESN in function approxima tion is problem dependent and in general it is impossible to forecast the computational performan ce as the system approaches instability (the spectral radius of the recurrent weight matrix ap proaches one). However, allowing the system to modulate the spectral radius either by the output or intern al biasing may allow a system at the edge of chaos to solve various proble ms requiring different spectral radii. Transiently Stable Computation We introduced a new computati onal mode, transient ly stable computation, that can be observed in nonlinear systems. Linear system theory shows clearly that stab ility is necessary to obtain useful responses from linear systems. Ho wever, this argument does not necessarily apply to nonlinear systems as observed in transiently stable computa tion. A nonlinear system can be unstable and still be able to have dynamics that are controlled by th e input, when it is applied.

PAGE 151

151 We showed that transiently st able systems can still comput e input output mappings. From a computation point of view, this regime is interesting because it toggles between internal dynamics and external dynamics. This makes tran siently stable computation biologically more realistic compared to conventiona l echo states, since biological systems do not have fixed point dynamics but rather wide variety of collective be havior in the form of oscillations and even chaos. Echo State Networks for Temporal Pattern Recognition We argued that the high dimensional internal states of ESN/LSM architecture create an excellent medium for dynamical pattern recognition. We approached the problem without the use of a desired label as conventi onally done, instead by designing a readout for each class based on a specially trained architecture, called the linear associative memory (LAM). Unlike computer memory, LAMs store information globally in syst em weights and access memory by its content. Among the family of LAMs, we employed the minimum average correlation energy (MACE) filter which is widely used in the recognition of objects in 2-D images due to its much higher rejection to outliers when compared with conve ntional LAM training. Th e MACE interprets the states of the ESN/LSM as a tw o dimensional image, one dimension being time and the other space. An optimal template image for each class can be analytically computed using training data. During testing, ESN states are correlated w ith each template image and the class label of the template with the highest correlation is assi gned to the input pattern. A single MACE readout is able to represent a class of temporal signals because it incorporates explicitly the covariance of the training data class. If the c onventional linear readout is utili zed it can only represent the mean of the posterior density of the class given the data as is well know n both in classification (Bishop, 1995) or in regression /prediction (Haykin, 2001). This was clearly demonstrated in

PAGE 152

152 classification of the triangular versus square waves of different amp litudes (i.e. a class of signals). The ESN-MACE combination is a nonlinear temp late matcher that is posed to replace the linear matched filter in digita l communication systems with a more robust performance under different noise distributions. The detection perf ormance of the linear matched filter, which is optimal under Gaussian noise, degrades treme ndously under impulsive noise, whereas the ESNMACE filter provided consistent results under bot h Gaussian and impulsive noise. We applied the ESN-MACE filter for the classi fication of odors from an electr onic nose and compared to the MACE filter trained on the input space. ESN incr eases the margin between the in-class and outof-class outputs of the MACE filter due to the us er-defined high dimensionality provided by the ESN reservoir. We also provided a preliminary experiment that extends the proposed MACE filter readout for LSMs. The MACE readout does not require the convers ion of the liquid state outputs (spike trains) into continuous valued signals by filtering Instead, the value of the liquid state at each time instant is interpreted by the MACE as 1 or 0 depending on the existence or absence of a spike. The liquid state is usually very sparse lead ing to singular matrices used in the MACE filter computation. Therefore, proper regularization of the singular matrices should be done before computing the LSM/MACE filter weights in orde r to avoid numerical errors. Detailed studies have to be performed to better understand the ad vantages/disadvantages of a MACE readout for LSMs. Channel Equalization with Complex Echo State Networks We proposed ESN for signal processing with co mplex numbers. ESNs are very suitable for complex domain signal processing since training is linear regression, whic h is very simple both in real and complex domains. ESNs do not requir e backpropagation of weights through complex

PAGE 153

153 nonlinear functions, which is problematic in the complex domain since it requires the estimation of complex gradients. We compared the complex ESN with the complex least mean squares and complex multilayer perceptron in a complex dom ain channel equalization problem and achieved superior results in terms of bit error rates. Freeman Model We demonstrated that FM can be considered in the same framework with ESN/LSM. In fact, the KI and KII networks of FM are concep tually similar to the reservoir/liquid in an ESN/LSM with a proper selection of the parameters. The big difference is that the KII layer is highly structured with a coupled set of nonlinear oscillators We de rived conditions on the system parameters for the FM similar to the echo stat e property of ESNs. With the proposed framework, it became evident that the OB la yer alone lacks a readout in orde r to be used as a universal computing machine for time series. Therefore, we proposed to use an adaptive linear network to implement the readout from the OB layer. We pr esented experimental resu lts to show the power of this framework for the FM of the olfactory cortex. With the proposed framework, we showed that it is not necessary to drive Freemans m odel with 0 and 1s, as is conventionally done. Applications We have also utilized ESNs for vari ous real-world engineering problems. Brain Machine Interfaces We used ESNs to model brain machine inte rfaces. In BMIs, the goal is to predict a kinematic variable such as the hand position fr om electrophysiological recording that are acquired through microarrays from the brain. Me soscopic level ECoG signals and simultaneous hand position of a human patient are used to build the models. We compared the performances of a Wiener filter, an ESN and a leaky ESN in te rms of modeling accuracy using the correlation coefficient metric. Both ESN and leaky ESN ach ieved superior performance compared to the

PAGE 154

154 Wiener solution since the task requires nonlinear modeling. The leaky ESN achieved the best performance since the input signals change rapidl y whereas the hand position is at a much slower rate. In problems where the desired signal is at a slower rate than the input signal, the built-in low pass filter in the leaky ESN is very useful to match the signal rates. Prediction of Water Inflow We utilized ESNs to forecast the water inflow of a hydropow er plant reservoir. Water inflow information is fundamental to planning of the hydrothermal power system operation. A database of average monthly water inflows of Furnas plant, one of the Brazilian hydropower plants, was used as source of training and test data.The results of the ESN forecasting were compared with other nonlinear models We have shown that the ESN performed significantly better than the other nonlinear models with the advantage of significantly simpler, and faster, training algorithm. Spike Detection Spike detection is a challenging task due to the variability in the neural recordings and nonstationary, non-Gaussian noise embedded in the signal. The conventional techniques such as threshold detector or matched filter fail especially when the neural signal comprises recordings from multiple neurons spaced at different distan ces relative to the micr oelectrode. We proposed the ESN-MACE filter to tackle th e variability in the spike detection problem. We showed in a challenging neural data recorded from a rats brain that ESN-MACE combination results in lower false alarm rates for perfect detection. The better results of the ESN-MACE is attributed to the fact that MACE filter can model a class of signals by incorporating the covariance information of the class whereas ESN provides a rich representation space as the input to the MACE.

PAGE 155

155 Matched Filtering The ESN-MACE combination is a nonlinear temp late matcher that is posed to replace the linear matched filter in digita l communication systems with a more robust performance under different noise distributions. The detection perf ormance of the linear matched filter, which is optimal under Gaussian noise, degrades treme ndously under impulsive noise, whereas the ESNMACE filter provided consistent results under both Gaussian and impulsive noise. Possible Future Directions Recurrent neural networks train all the weight s in the network using algorithms such as backpropogation through time r eal time recurrent learning wher eas echo state networks train only the readout network weights. A possibility in the recurrent network l earning paradigm is to find an intermediate step where the internal conn ection weights can be clev erly initialized with a simple algorithm. This initiali zation algorithm finds a good choice of weights for the internal connection weight matrix using both the input and desired respons e and optimally computes the output matrix. This intermediate network will address the big drawback in the state-of-the-art of the ESN design that ignores the desired response in the design of internal weight matrix. A better understanding of the effect of dynamics in terms of computational power has to be established. We have shown that a nonlinear system without globa l stability constraint can still do useful computation in terms of functional approximation. Si milarly, KII network of the Freeman Model can be considered in the same framework. However, we have not yet observed a concrete advantage of this regime compared to conventional fully stable regime. A thorough study that links stability with computational pow er could be very inte resting both from an engineering and biological systems perspective. The use of multiple linear readouts in the spirit of local linear modeling can considerably improve the performance of ESNs. In our opinion the main difficulty associated with this

PAGE 156

156 approach would be to divide the input space since the signals in consideration are not static but time-series. A practical solution to this problem either by embedding the input time series or yet better using the instantaneous value of th e echo states has to be developed.

PAGE 157

157 LIST OF REFERENCES Amari, S. I. (1990). Differential-geometrical methods in Statistics New York: Springer. Atiya, A. F., El-Shoura, S. M., Shaheen, S. I., & El-Sherif, M. S. (1999) A comparison between neural-network forecasting techniques-case study: River flow forecasting. IEEE Transactions on Neural Networks. 10, 402. Barreto, G. A., & Arajo, A. F. R. (2001). A se lf-organizing NARX network and its application to prediction of chaotic time series. In Proc. of International Joint Conference on Neural Networks (pp. 2144-2149). Washington D.C., USA. Benedetto, S., & Biglieri, E. (1983). Nonlinear Equalization of Digita l Satellite Channels. IEEE Journal on Selected Areas in Communication, 1, 57-62. Bertschinger, N., & Natschlger, T. (2004). Real -time computation at the edge of chaos in recurrent neural networks. Neural Computation 16(7), 1413. Bishop, C. M. (1995). Neural networks for pattern recognition New York: Oxford University Press. Box, G. E. P., Jenkins, G., & Reinsel, G. C. (1994). Time Series Analysis, Forecasting, and Control California: Prentice-Hall. Braitenberg, V. (1990). Readi ng the structures of brains. Network: Computation in Neural Systems, 1(1), 1-12. Casasent, D., & Ravichandran, G. (1992). Adva nced distortion-invari ant minimum average correlation energy (MACE) filters. Applied Optics, 31 (8), 1109-1116. Cha, S. I., Kassam A. (1995). Channel equaliz ation using adaptive complex radial basis functionnetworks. IEEE Journal on Selected Areas in Communication 13(1), 122-131. Chapin, J.K., Moxon, K.A., Markowitz, R.S., & Ni colelis, M.A.L. (1999).Real-time control of a robot arm using simultaneously recorded neurons in the motor cortex. Nature Neuroscience 2(7), 664-670. Chua L., Yang L. (1988). Cellula r neural networks: Theory. IEEE Transaction on Circuits and Systems 35, 1257. Coulibaly. P., & Anctil, F. (1999).Real-Time Short-Term Natural Water Inflows Forecasting Using Recurrent Neural Networks. In Proc. of International Jo int Conference on Neural Networks (pp. 3802-3805). Washington D.C., USA. Cox, R. T. (1946). Probability, freq uency, and reasonable expectation. American Journal of Physics 14(1), 1.

PAGE 158

158 Crone, N.E., Miglioretti, D.L., Gordon, B., & Lesser, R.P. (1998). Functional mapping of human sensorimotor cortex with electrocorticogr aphic spectral analysis: II. Event-related synchronization in the gamma band. Brain 121(12), 2301. de Vries, B. (1991). Temporal processing with neural netw orksthe development of the gamma model Unpublished doctoral dissertati on, University of Florida. Delgado, A., Kambhampati, C., & Warwick, K. (1995). Dynamic recurrent neural network for system identification and control. IEEE Proceedings of Control Theory and Applications 142(4), 307. Desmurget, M., Pelisson, D., Rosetti, Y., & Prab lanc, C. (1998). From Eye to Hand: Planning Goal-Directed Movements. Neuroscience and Biobehavioral Reviews 22, 761-788. Elman, J. L. (1990). Find ing structure in time. Cognitive Science 14(2), 179. Engel, A.K., Moll, C.K.E., Fried, I., & Ojema nn, G.A. (2005). Invasive Recordings from the Human Brain: Clinical Insights and Beyond. Nature Reviews Neuroscience 6(1), 35-47. Erdogmus, D., Hild, K. E., & Principe, J. ( 2003). Online entropy manipulation: Stochastic information gradient. Signal Processing Letters 10(8), 242. Erdogmus, D. & Principe, J. (2002). Generalized information potential criterion for adaptive system training. IEEE Transactions on Neural Networks 13(5), 1035. Fancourt, C., & Principe, J. (2000). On the us e of Neural Networks in the Generalized Likelihood Ratio Test for Detecting Abrupt Changes in Signals. In Proc. of International Joint Conference on Neural Networks (pp. 243-249). Como, Italy. Feldkamp, L. A., Prokhorov, D. V., Eagen, C., & Yuan, F. (1998). Enhanced multi-stream kalman filter training for recurrent networks In J. Suykens, and J. Vandewalle (Eds.), Nonlinear Modeling: Advan ced Black-Box Techniques (pp. 29). Dordrecht, Netherlands: Kluwer. Fisher, J.W. (1997). Nonlinear Extensions to the Minimu m Average Correlation Energy Filter Unpublished doctoral dissertation, University of Florida. Freeman, W. (1975). Mass Action in the Nervous System. New York: Academic. Freeman, W., & Jakubith, S. (1993). Bifurcation analysis of continuous time dynamics of oscillatory neural networks. In A. Aertsen, W. von Seelen (Eds.), Brain Theory-Spatiotemporal Aspects of Brain Function (pp.183-208). Amsterdam, Netherlands: Elsevier. Georgopoulos, A., Kalaska, J., Caminiti, R., & Massey, J. (1982). On the relations between direction of two-dimensional arm movements and cell discharge in primate motor cortex. Journal of Neuroscience 2(1), 1527-1537.

PAGE 159

159 Gori, M., & Scarselli, F. (1998). Are multilayer pe rceptrons adequate for pattern recognition and verification?. IEEE Transactions on Pattern Anal ysis and Machine Intelligence, 20(11), 1121-1132. Gunduz, A., Ozturk, M. C., Sanchez, J. C., & Pr incipe, J. C. (2007). Echo State Networks for Motor Control of Human Ec oG Neuroprosthetics. In Proc. of IEEE EMBS conference on Neural Engineering Manuscript accepted for publication. Haykin, S. (1998). Neural Networks: A Comprehensive Foundation (2nd ed.). New Jersey: Prentice Hall. Haykin, S. (2001). Adaptive Filter Theory (4th ed.). New Jersey: Prentice Hall. Helstrom, C.W. (1995). Elements of Signal Detection and Estimation New Jersey: Prentice Hall. Hinton, G. E., & Anderson J. A. Ed. (1981). Parallel Models of Associative Memory New Jersey: Lawrence Erlbaum Associates. Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9(8), 1735. Inhyok S. C., & Kassam, A. (1995). Channel eq ualization using adaptive complex radial basisfunction etworks. IEEE Journal on Selected Areas in Communication 13(1), 122131. Iso, K., & Watanabe, T. (1991). Large vocabul ary speech recognition us ing neural prediction model. In Proc. of International Conference on Acoustics, Speech and Signal Processing ( pp. 57-60). Toronto, CA. Ito, Y. (1996). Nonlinearity creates linear independence. Advances in Computer Mathematics 5(1), 189. Jaeger, H. (2001). The echo state approach to analyzing and training recurrent neural networks (Tech. Rep. No. 148). Bremen: German Nati onal Research Center for Information Technology. Jaeger, H. (2002). Short term memory in echo state networks (Tech. Rep. No. 152). Bremen: German National Research Center for Information Technology. Jaeger, H. & Hass, H. (2004). Harnessing nonlin earity: Predicting chaotic systems and saving energy in wireless communication. Science 304(5667), 78. Jang, J.S. R. (1993). ANFIS: Adaptive-Network-based Fuzzy Inference Systems. IEEE Transactions on Systems, Man and Cybernetics 23(3), 665-685. Jasper, H., & Penfield, W. (1954). Epilepsy and the Functional Anatomy of the Human Brain Boston: Little, Brown and Co.

PAGE 160

160 Jeffreys, H. (1946). An invariant form for th e prior probability in estimation problems. Proceedings of the Royal Society of London A(196), 453. Jordan, M. I. (1986). Attractor dynamics and para llelism in a connectionist sequential machine. In Proc. of the Eighth Annual confer ence of the Cognitive Science Society (pp. 531-546). Hillsdale, NJ. Kailath, T. (1980). Linear systems New Jersey: Prentice Hall. Kautz, W. (1954). Transient synthesis in time domain. IRE Transactions on Circuit Theory 1(3), 29. Kechriotis, G., Zervas, E., & Manolakos, E. S. (1994). Using recurrent neural networks for adaptive communication channel equalization. IEEE Transactions on Neural Networks 5(2), 267. Kim, T., Adali, T. (2002). Fully Complex Mul ti-Layer Perceptron Network For Nonlinear Signal Processing Journal of VLSI Signal Processing 32(1), 29-43. Kremer, S. C. (1995). On the computational power of elman-style recurrent networks. IEEE Transactions on Neural Networks 6(5), 1000. Kumar, B. (1986). Minimum variance synthetic discriminant functions. Journal of the Optical Society of America, A3(10), 1579-1584. Kung, S. Y. (1993). Digital Neural Networks New Jersey: Prentice Hall. Kuznetsov, Y., Kuznetsov, L., & Marsden, J. (1998). Elements of Applied Bifurcation Theory (2nd ed.). New York: Springer-Verlag. Lang, K. J., & Hinton, G.E. (1998). The development of time-delay neural network architectures for speech recognition (Tech. Rep. No. CMU-CS-88-152), Pittsburg: Carnegie-Mellon University. Langton, C. G. (1990). Computa tion at the edge of chaos. Physica D 42(1),12. Lesser, R.P., Gordon, B., Fisher, R.S., Vining, E., & Uematsu, S. (1990). Cortical stimulation using subdural electrodes. Journal of Epilepsy 3(1), 103-106. Leuthardt, E.C., Schalk, G., Wolpaw, J.R., Oj emann, J.G., & Moran, D. W. (2004). A braincomputer interface using electrocor ticographic signals in humans. Journal of Neural Engineering 1(1), 63-71. Lewicki, M.S., (1998). A review of methods for sp ike sorting: the detecti on and classification of neural action potentials. Computation in Neural Systems, 9(4), 53-78.

PAGE 161

161 Maass, W., Natschlager, T., & Markram, H. (200 2). Real-time computing without stable states: A new framework for neural com putation based on perturbations. Neural Computation 14(11), 2531. Mahalanobis, A., Vijaya K. B.V.K., & Casa sent, D. (1987). Minimum average correlation energy filters. Applied Optics 26(17), 3633-3640. Markram, H., Wang, Y., & Tsodyks, M. (1998). Di fferential signaling via the same axon of neocortical pyramidal neurons. Proceedings of the National Academy of Sciences 95(1), 5323-5328. MecLeod, A. I.(1995). Diagnostic checking of periodic autoregression with Application. Journal of Time Series Analysis, 15(2), 221. Mehring, C., Nawrot, M. P., Cardoso de Oliveira S., Vaadia E., Schulze-Bonhage, A., Aertsen, & A., Ball, T. (2004). Comparing information about arm movement direction in single channels of local and epicortical field pot entials from monkey and human motor cortex. Journal of Neurophsiology 98(1), 498-506. Mitchell, M., Hraber, P., & Crut chfield, J. (1993). Revisiting the edge of chaos: Evolving cellular automata to perform computations. Complex Systems 7(1), 89. Obeid, I. & Wolf, P. (2004). Evaluation of Spike-Detection Algorithms for a Brain-Machine Interface Application. IEEE Transactions on Biomedical Engineering, 51(6), 905-11. Oh, S.-H. & Lee, Y. (1994). Effect of nonlinear transformations on correlation between weighted sums in multilayer perceptrons. IEEE Transactions on Neural Networks 5(3), 508. Ozturk, M. C., & Principe, J. C. (2007). An Associative Memory Readout for Echo State Networks. Neural Networks. Manuscript accepted for publication. Ozturk, M.C, Principe, J.C., Davis, B., & Er dogmus, D. (2003). Simulation of the Freeman model of the olfactory cortex: a quantitative performance an alysis for the DSP approach. In Proc. of International Joint Conference on Neural Networks (pp. 332-336). Portland. Ozturk, M.C, Xu, D., & Principe, J.C. (2004). Modi fied Freeman model: a stability analysis and application to pattern recognition. In Proc. of International Joint Conference on Neural Networks (pp. 3207-3212). Budapest, Hungary. Ozturk, M.C, Xu, D., & Principe, J.C. (2007). Analysis and Design of Echo State Networks. Neural Computation 19(1), 111-138. Packard, N. (1988). Adaptation towards the edge of chaos. In J. A. S. Kelso, A. J. Mandell, & M. F. Shlesinger (Eds.), Dynamic Patterns in Complex Systems (pp. 293). Singapore: World Scientific.

PAGE 162

162 Pfurtscheller, G., Graimann, B., Huggins, J.E., Levine, S.P., & Schuh, L.A. (2003). Spatiotemporal patterns of beta desynchronization a nd gamma synchronization in corticographic data during self-paced movement, Clinical Neurophysiology 114, 12261236. Pouget, A. & Sejnowski, T. J. (1997). Spatial tr ansformations in the parietal cortex using basis functions. Journal of Cognitive Neuroscience 9(2), 222. Principe, J. (2001). Dynamic neural networks a nd optimal signal processing. In Y. Hu, & J. Hwang (Eds.). Neural Networks for Signal Processing (pp. 6). Boca Raton: CRC Press. Principe, J. C., de Vries, B., & deOliviera, P. G. (1993). The gamma filtera new class of adaptive iir filters with restricted feedback. IEEE Transactions on Signal Processing 41(2), 649. Principe, J. C., Euliano, N. R., & Lefebvre, W. C. (1999). Neural and Adaptive Systems: Fundamentals through Simulation New York: John Wiley and Sons. Principe, J., Xu, D., & Fisher, J. (2000). Inform ation theoretic learning. In S. Haykin (Ed.), Unsupervised Adaptive Filtering (pp. 265). New Jersey: Wiley. Proakis, J.G. (2001). Digital Communications (4th ed.) New York: McGraw-Hill. Prokhorov, D. (2005). Echo state networ ks: Appeal and challenges. In Proc. of International Joint Conference on Neural Networks (pp. 1463). Montreal, Canada. Prokhorov, D., Feldkamp, L., & Tyukin, I. (1992). Adaptive behavior with fixed weights in recurrent neural networks: An overview. In Proc. of International Joint Conference on Neural Networks (pp. 2018). Honolulu, Hawaii. Puskorius, G. V. & Feldkamp, L. A. (1994). Ne urocontrol of nonlinear dynamical systems with kalman filter trained recurrent networks. IEEE Transactions on Neural Networks 5(2), 279. Puskorius, G. V. & Feldkamp, L. A. (1996). Dynamic neural network methods applied to onvehicle idle speed control. Proceedings of IEEE 84(10), 1407. Rao, Y., Kim, S., Sanchez, J., Erdogmus, D., Principe, J. C., Carmena, J., Lebedev, M., & Nicolelis, M. (2005). Learning mappings in brain machine interfaces with echo state networks.In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (pp.233-236). Philadelphia. Renyi, A. (1970). Probability Theory New York: Elsevier. Rickert, J., Cardoso de Oliveira S., Vaadia, E., Aertsen, A., Rotter, S., & Mehring C. (2005). Encoding of movement direction in different frequency ranges of motor cortical local field potentials. Journal of Neuroscience 25, 8815-8824.

PAGE 163

163 Sacchi, R., Carneiro. A. A. F. M., & Arajo, A. F. R. (2004). An RBF network trained by the SONARX model and app lied to obtain the oper ation policies of th e hydropower systems. In 8th Brazilian Symposium on Ar tificial Neural Networks Sao Luis, Brazil. Sacchi, R., Ozturk, M. C., Principe, J. C., Carneiro, A. A. F. M., & da Silva, I. N. (2007). Water Inflow Forecasting using the Echo State Network: a Brazilian Case Study. In Proc. of IEEE International Joint Conf erence on Neural Networks. Manuscript accepted for publication. Salas, J. D. Boes, D. C., & Smith R. A. ( 1982). Estimation for arma models with seasonal parameters. Water Resources Research 18(4), 1006. Sanchez, J. C. (2004). From Cortical Neural Spike Trains to Behavior: Modeling and Analysis Unpublished doctoral dissertation, University of Florida. Sanchez, J.C., Carney, P.R., & Principe, J.C. (2006). Analysis of Amplitude Modulated Control Features for ECoG Ne uroprosthetics. In Proceedings of the 28th IEEE EMBS Annual International Conference (pp. 5468-5471). New York. Sanchez, J.C., Gunduz, A., Principe, J.C., & Carn ey, P.R. (2007). Extraction and Localization of Mesoscopic Motor Control Signals fo r Human ECoG Neuroprosthetics. Journal of Neuroscience Methods Manuscript submitted for publication. Sandberg, I. W. & Xu, L. (1997). Uniform a pproximation of multidimensional myopic maps. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 44(6), 477. Sejnowski, T. J., & Rosenberg, C. R. (1987). Pa rallel Networks that learn to pronounce English text. Complex Systems 1(1):, 145-168 Serruya, M.D., Hatsopoulus, N.G., Paninski, L., Fellows, M.R. & Donoghue, J.P. (2002). Instant neural control of a movement signal. Nature 416, 141-142. Shah, J. V. & Poon, C.-S. (1999). Linear indepe ndence of internal representations in multilayer perceptrons. IEEE Transactions on Neural Networks 10(1), 10. Shannon, C. E. (1948). A mathema tical theory of communication. Bell System Techinical Journal 27, 623. Siegelmann, H. T. (1993). Foundations of Recurrent Neural Networks Unpublished doctoral dissertation, Rutgers University. Siegelmann, H. T. & Sontag, E. (1991). Turing computability with neural nets. Applied Mathematics Letters 4(6), 77. Sinai, A., Bowers, C.W., Crainiceanu, C.M., Boat man, D., Gordon, B., Lesser, R.P., Lenz, F.A., & Crone, N.E.(2005). Electrocorticographic high gamma activity versus electrical cortical stimulation mapping of naming. Brain 128, 1556-1570.

PAGE 164

164 Singhal, S. & Wu, L. (1989). Training multilaye r perceptrons with the extended Kalman algorithm. In D. S. Touretzky (Ed.), Advances in Neural Inform ation Processing Systems 1 (pp. 133). California: Morgan Kaufmann. Srnmo, L., & Laguna, P. (2005). Bioelectrical Signal Processi ng in Cardiac and Neurological Applications London: Elsevier Academic Press. Takens, F. (1981). Detecting strange attractors in turbulence. In D. A. Rand, & L. S. Young (Eds.), Dynamical systems and turbulence (pp. 366). Berlin: Springer. Tavares V. (2001) Design and implementation of a biologica lly realistic olfactory cortex model Unpublished doctoral dissertation, University of Florida. Tebelskis, J. (1995). Speech recognition usi ng neural networks. Unpublished doctoral dissertation, Carnegie Mellon Univerity. Thogula, R. (2003). Information theoretic self -organization of multiple agents. Unpublished masters thesis, University of Florida. Valena, M., Lurdermir, T., & Valena, A. (2005). River Flow Forecasting for Reservoir management through Neural Networks. In Proc. of the Fifth International Conference on Hybrid Intelligent Systems (pp. 545-547). Rio De Janeiro, Brazil. Vanderlugt, A.B. (1964). Signal Detection by Complex Matched Spatial Filtering. IEEE Transacactions on Information Theory, 2, 139-145. Wang D. (1999). Relaxation oscillators and networks. In J. G. Webster (Ed.), Wiley encyclopedia of electrical and elec tronics engineering (pp. 396-405). New York: Wiley. Werbos, P. (1990). Backpropa gation through time; what it does and how to do it. Proceedings of IEEE 78(10), 1550. Werbos, P. (1992). Neurocontrol and supervised learning: An overview and evaluation. In D. White, & D. Sofge (Eds.), Handbook of Intelligent Control (pp. 65). New York: Van Nostrand Reinhold. Widrow, B., & Lehr, M. A. (1990). 30 years of Adaptive Neural Networks: Perceptron, Madaline, and Backpropagation. Proceedings of IEEE 78(9), 1415-1442. Wilde, D. J. (1964). Optimum Seeking Methods New Jersey: Prentice Hall. Williams, R. J. & Zipser, D. ( 1989). A learning algorithm for con tinually running fully recurrent neural networks. Neural Computation 1, 270. Xu D., & Principe, J. C. (2004). Dynamical Analysis of neural oscillators in an olfactory cortex model. IEEE Transactions on Neural Networks, 15, 1053.

PAGE 165

165 Xu, D., Davis, B., Ozturk, M. C., Deng, L., S kowronski, M., Harris, J. G., Freeman, W., & Principe. J.C. (2004). Com putational neural networks. In B. He, (Ed.), Neural Engineering (chapter 16). New York: Kluwer. Zahalka, A., & Principe, J. (1993) Transient dete ction with neural networks: the search for the desired signal. In S. J. Hanson, J. D Kowan, C. L. Giles (Eds.), Advances Neural Information Processing Systems 5 (pp. 688-695). Californi a: Morgan Kaufmann.

PAGE 166

166 BIOGRAPHICAL SKETCH Mustafa Can Ozturk was born in Antalya, Turkey, on November 23, 1980. He received his B.S.in electrical and electronics engineering and B.S.in mathem atics, both from Middle East Technical University, Ankara, Turkey, in 2002. He has been pursuing his graduate studies in electrical and computer engi neering department at the Un iversity of Florida under the supervision of Dr. Jose Principe. He received his masters degree in May 2004 and Ph.D. degree in May 2007. His research interests broadly include dynamical neural networks, adaptive systems, and signal processing.


Permanent Link: http://ufdc.ufl.edu/UFE0019604/00001

Material Information

Title: Dynamical Computation with Echo State Networks
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0019604:00001

Permanent Link: http://ufdc.ufl.edu/UFE0019604/00001

Material Information

Title: Dynamical Computation with Echo State Networks
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0019604:00001


This item has the following downloads:


Full Text





DYNAMICAL COMPUTATION WITH ECHO STATE NETWORKS


By

MUSTAFA CAN OZTURK


















A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2007









































O 2007 Mustafa Can Ozturk


































To my parents, Necmi Ozturk and Saliha Ozturk, who made everything I have achieved possible,
for their enduring love, support and generosity.









ACKNOWLEDGEMENTS

First and foremost, I would like to thank my supervisory committee chair, Dr. Jose

Principe, for his inspiring ideas, constant support, and guidance. His enthusiasm for life and

imaginative way of understanding concepts inspired me to learn different aspects of science and

engineering. Without him, this dissertation would not have been possible.

Secondly, I would like to thank Dr. John Harris, for serving on my supervisory committee

and offering wise insights into academia and other aspects of life. I also wish to thank the other

members of my supervisory committee, Dr. Arunava Benerj ee and Dr. Thomas Demarse for their

insightful comments and suggestions.

During my studies, I gained much from the interactive environment at the Computational

NeuroEngineering Laboratory. All of my colleagues helped me by sharing their knowledge and

inspirational ideas. I especially thank Dongming Xu for his help and long hours of productive

discussions. Most of this dissertation was a result of collaboration with Aysegul Gunduz,

Nicholas Dedual, Rodrigo Sachhi, Sohan Seth, Johan Nyqvist, Kyu-Hwa Jeong, Ismail Uysal,

Harsha Sathyendra, and Mark Skowronski. I like to thank each one of them for their expertise

and insightful comments.

I would like to specifically thank my wonderful friends, Jianwu Xu, Rui Yan, and Anant

Hegde. Deep philosophical conversations, sleepless nights for proj ects are unforgettable. They

were certainly more than friends to me. I will always remember them with great respect.

I would like to thank my big family for their constant support and trust in me. Everything I

have achieved is possible only with their love, encouragement, and guidance.












TABLE OF CONTENTS



page

ACKNOWLEDGMENTS .............. ...............4.....


LIST OF TABLES ............ _...... ._ ...............8....


LIST OF FIGURES .............. ...............9.....


AB S TRAC T ............._. .......... ..............._ 12...


CHAPTER


1 INTRODUCTION ................. ...............15.......... ......


Literature Review of Dynamic Computation ................. ...............15........... ...
Short-term Memory Structure s ................ ............ ...............15. ....
Network Architectures for Dynamical Computation .............. ...............18....
Literature Review of Echo State Networks ................. ...............22...............
Thesis Goals............... ...............24.


2 ANALYSIS OF ECHO STATE NETWORKS ................. ...............30...............


Echo States as Bases and Proj sections .................. ...............30........... ..
ESN Dynamics as a Combination of Linear Systems .................. .......... ................ ...3 5
Average State Entropy as a Measure of the Richness of ESN Reservoir ............... .... ...........3 8

3 DESIGN OF THE DYNAMICAL RESERVOIR ................. ...............43...............


Design of the Recurrent Connections ................. ...............43........... ...
Design of the Adaptive Bias ................. ...............47........... ...
Experiments .............. ......... ..............4
Short-term Memory Structure s ................ ...............48................
Binary Parity Check .............. ...............51....
System Identification............... .............5

4 DESIGN OF THE READOUT .............. ...............54....


Design of the Readout for Function Approximation ......__ ......... __. ........._. .....54
Design of the Readout for Dynamical Pattern Recognition .............. ....__. .............54
Linear Associative Memories............... ...............56
The M ACE Filter............... ........ ..... .. .............5
The ESN/LSM MACE for Dynamical Pattern Recognition............._. .........._._....60
The Design of ESNs for Dynamical Pattern Recognition.........._.._.. ........_.._.........63
Experiments ............__... .....__ ......__ ... ... ..........6
Classification of temporal signals with unknown peak values .............. ..............65












Robust signal detection in a digital communication channel ................ ................69
Classification of odors with an electronic nose ........._..... ...._... ........_.......69

Spike train classification with LSM ...._.._.._ ........__. ...._.._ ...........7

5 IMPLICATIONS OF ESN ON THE DESIGN OF OTHER NETWORKS............._.._.. ........80


Transiently Stable Computation .............. ...............80....
Conventional Echo State Networks ........._.._......... ....._.._.._ ....._.._ .............
Transiently Stable Computation with Echo State Networks .............. ....................8
Understanding Transiently Stable Computation .............. ...............86....
Freeman's K Sets for Dynamical Computation............... ..............8
An Overview of Freeman Model ........._.._......... ..... .._.. .. .. ...............90.

Dynamical Computation with Freeman Model with a Readout..........._.._.. .........._.._...95
Experiments ........._...... ......_.._ ...............101.....
Function approximation .............. .... .... .............10
KO set as the readout for function approximation ............_ ..... ..._.............103
Autonomous sequence generation with output feedback .............. .....................0
Binary parity check .............. ...............107....
M ulti-attractor learning .............. ...............108....

6 APPLICATIONS ................. ...............112................


Complex Echo State Networks ................. ...............112____ .....
Channel Equalization. ............ ........... ...............113....
Brain M machine Interfaces ................. ...............119...... ......

Experimental Setup .............. ...............122....
M ethods .............. ... .... .....__ ...... .......... ......... ..........2
Forecasting Water Inflow Using the Echo State Network ................. ................. ......127
Water Inflow Problem ............_...... .__ ...............127...
R results .............. ...... ........ ......... .........12

Spike Detection with Echo State Networks ....__ ......_____ .......___ ..........13
Problems in Spike Detection .............. ...............133....
Threshold detection ............ ..... .._ ...............133...
M watched filter ................ ........._ ...............134....

Spike Detection with ESN-MACE ............... .. ...___ .......___ ........ 13
Robust signal detection in a digital communication channel .............. .....................4

7 CONCLUSIONS .............. ...............147....


Understanding Echo State Networks ............ .....___ ...............148..
Designing Echo State Networks ............_...... ._ ...............149..
Computation at the Edge of Chaos ............_...... .__ ...............150.
Transiently Stable Computation .............. ... ....__ ..... __ ............15
Echo State Networks for Temporal Pattern Recognition .............. ...............151....
Channel Equalization with Complex Echo State Networks .............. .....................5
Freeman M odel ............ ..... .._ ...............153...

Applications ............ ..... .._ ...............153...












Brain Machine Interfaces .............. ...............153....
Prediction of Water Inflow ................. ...............154...............

Spike Detection .............. ...............154....
M watched Filtering .............. ...............155....
Possible Future Directions ................. ...............155................


LIST OF REFERENCES ................. ...............157................


BIOGRAPHICAL SKETCH ................ ............. ............ 166...










LIST OF TABLES


Table page

6-1 Averaged correlation coefficients between the actual hand traj ectory and the model
outputs computed over non-overlapping windows of 5 seconds ................ .................1 24

6-2 Performance comparison of the models for water inflow prediction ................. .................129










LIST OF FIGURES


Figure page

1-1 Delay line memory of order N. ........... ..... .___ ...............16.

1-2 Dynamic network with a short-term memory structure followed by a static mapper ............. 18

1-3 Generic structure for universal myopic mapping theorem, bank of filters followed by a
static network ................. ...............21........ ......

1-4 Echo state network (ESN). ESN is composed of two parts: a fixed-weight (W) recurrent
network and a linear readout............... ...............23

2-1 Performance of ESNs for different realizations of W with the same weight distribution.
Results show that spectral radius of w is not the unique parameter that determines
the performance of an ESN ................. ...............34........... ...

2-2 Pole tracks of the linearized ESN when the input goes through a cycle.. ............ ..... ........._..37

2-3 Echo states and state entropy for different ESNs. ............. ...............41.....

2-4 The ASE values obtained from 50 realizations of ESNs with the same spectral radius .........42

3-1 Comparison of ASE values obtained for ASE-ESN with U-ESN and randomly generated
ESNs with different sparseness............... ...............4

3-2 The k-delay STM capacity of each ESN for delays 1, .. ,40 computed using the test
si gnal .. ............... ...............5.. 0..............

3-3 Number of wrong decisions made by each ESN for m = 3, .. ,8 in the binary parity
check problem .. ............. ...............52.....

4-1 Interpretation of echo states as a 2-d image. ............. ...............62.....

4-2 Performance comparisons of ESNs for the classification of signals with unknown peak
values.. ............ ...............66.....

4-3 Time-series representing the response pattern of the 32 electronic nose sensors exposed
to rosemary............... ...............70

4-4 Comparison of the ESN-MACE fi1ter output and MACE fi1ter output trained on the
input space for the rosemary class.. ............ ...............74.....

4-5 Comparison of the correct classification rates of LSM-MACE and LSM with linear
readout trained in the spike train classification problem as the parameter h varies...........79

5-1. Demonstration of a typical response of ESN with a spectral radius of 0.9 and
application to function approximation.. ............. ...............82.....











5-2 Demonstration of a typical response of transiently stable ESN with a spectral radius of
1.1 and application to function approximation.. ............ ...............84.....

5-3 Movement of the poles for ESN with spectral radius 0.9. ................ ....___ ................86

5-4 Movement of the poles for ESN with spectral radius 1.1 ................ ............... 87...........

5-6 A full KII network. .............. ...............94....

5-7 Freeman Model as an associative memory. ...._.. ...._._._.. .......__... ..........9

5-8 Nullcline graph of the reduced KII set. ........._._._ ....__ ...............99.

5-9. Function approximation with Freeman Model. ............. ...............102....

5-10 Overlay plot of the desired signal and output signal of the Freeman Model with KO
readout. ........._.__...... ._ __ ...............105....

5-12 Binary parity check with Freeman Model. ............. ...............108....

5-13 Sample input-output signals for multi-attractor learning. ............. ....................10

5-14 Desired and output signals during training for multi-attractor learning with Freeman
M odel. ........._._ ....._... ...............110....

5-15 Desired and output signals during testing for multi-attractor learning with Freeman
M odel. ........._._ ....._... ...............111....

6-1 Block diagram of a communication system with nonlinear dispersive channel ........._._. ......114

6-2 Constellation diagram of 16 QAM. A) Input symbols. B) Output symbols. ................... ......115

6-3 SER vs SNR plot for four networks. ...........__..... ...............117

6-4 Constellation diagram rectangular QPSK. A) Input symbols. B) Output symbols. ..............118

6-5 SER vs SNR plot for four networks. ...........__..... ...............119

6-6 Comparison of three methods in brain machine interface modeling. Windowed
correlation coefficients in all spectral bands. .....__.....___ .......... .............2

6-7 Training performance of ESN for the water inflow prediction at Fumas Hydroelectric
power plant ................. ...............130................

6-8 Testing performance of the ESN for the water inflow prediction at Fumas Hydroelectric
power pl ant ................. ...............13. 1...............

6-9 Spike shapes recorded with high SNR. ............. ...............137....











6-10 Spike shapes in rat dataset. Numerous distinct spike waveforms are present in this
dataset. ............. ...............137....

6-11 Segment of neural recording from rat' s brain. ................ ....._.._......... .......3

6-12 Comparison of ROC curves for the rat dataset. ...._.._.._ .... .._._. ...._.._.. ......13

6-13 Comparison of ROC curves for ESN-MACE and the linear matched filter under
Gaussian and impulsive noise conditions.. ............ ...............144.....









Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy

DYNAMICAL COMPUTATION WITH ECHO STATE NETWORKS

By

Mustafa Can Ozturk

May 2007

Chair: Jose C. Principe
Major: Electrical and Computer Engineering

Echo state networks (ESN) were recently proposed as a new recurrent neural network

(RNN) paradigm. ESN couples dynamics with computation in novel ways by conceptually

separating RNN into two parts: a recurrent topology of nonlinear PEs that constitutes a "reservoir

of rich dynamics" and an instantaneous linear readout. The interesting property of ESN is that

only the memoryless readout is trained, whereas the recurrent topology has Eixed connection

weights. This reduces the complexity of RNN training to simple linear regression while

preserving a recurrent topology, but obviously places important constraints in the overall

architecture that have not yet been fully studied.

Present design of Eixed parameters of ESN relies on the selection of the maximum

eigenvalue of the linearized system around zero (spectral radius). However, this procedure does

not quantify in a systematic manner the performance of the ESN in terms of approximation error.

In this study, we proposed a functional space approximation framework to better understand the

operation of ESNs, and proposed an information-theoretic metric (the average entropy of echo

states) to assess the "richness" of the ESN dynamics. We also provided an interpretation of the

ESN dynamics rooted in system theory as families of coupled linearized systems whose poles

move according to the input signal dynamics. With this interpretation, we put forward a design









methodology for functional approximation where ESNs are designed with uniform pole

distributions covering the frequency spectrum to abide to the "richness" metric, irrespective of

the spectral radius. A single bias parameter at the ESN input, adapted with the modeling error,

configures the ESN spectral radius to the input-output j oint space. Function approximation

examples compare the proposed design methodology versus the conventional design.

On further investigating the use of ESNs for dynamical pattern recognition, we postulated

that ESNs are particularly well suited for dynamical pattern recognition and we proposed a linear

associative memory (LAM) as a novel readout for ESNs. From the class of LAMs, we adopted

the minimum average correlation energy (MACE) filter because of its high rej section

characteristics that allow its use as a detector in the automatic pattern recognition literature. In

the ESN application, the MACE interprets the states of the ESN as a two-dimensional image: one

dimension is time and the other dimension is the processing element index (space). An optimal

template image for each class, which associates ESN states with the class label, can be

analytically computed using training data. During testing, ESN states were correlated with each

template image, and the class label of the template with the highest correlation is assigned to the

input pattern. The ESN-MACE combination leads to a nonlinear template matcher with robust

noise performance, as needed in non-Gaussian, nonlinear digital communication channels. We

used a real-world data experiment for chemical sensing with an electronic nose to demonstrate

the power of this approach. The proposed readout can also be used with liquid state machines

eliminating the need to convert spike trains into continuous signals by inning or low-pass

filtering.

We applied ESN on interesting real-world problems such as brain machine interface

design, water inflow prediction, detection of action potentials in neural recordings, matched









filtering in digital communications, channel equalization of a nonlinear channel and compared its

performance to other standard techniques. We proposed ESNs for signal processing in the

complex domain. The use of ESNs for complex domain is very convenient since system training

is equivalent to simple linear regression, which is trivial in the complex domain. The derivatives

of the nonlinear activation functions are never necessary since the recurrent part is fixed apriori.

We showed that Freeman model of the olfactory cortex can be considered in the same

framework of ESN. This work provided two important contributions to Freeman networks. First,

it emphasized the need for optimal readout, and showed how to adaptively derive them. Second,

it showed that the Freeman model is able to process continuous signals with temporal structure.

We also investigated the dynamics of ESNs without the echo state condition. This

investigation led to a novel computational mode for nonlinear systems with sigmoidal

nonlinearity, which does not require global stability. In this mode, although the autonomous

system is unstable, the input signal forces the system dynamics to become "transiently stable".

Function approximation experiments showed that the transiently stable ESN without the echo

state condition is still capable of useful computation.









CHAPTER 1
INTTRODUCTION

Literature Review of Dynamic Computation

Static networks are those whose response depends only on the present stimulus.

Multiplayer perceptrons (MLPs), radial basis function (RBFs), and self-organizing map (SOM)

networks are examples of static neural network models that deal with static input patterns. These

models have been well studied in function spaces, for regression and classification, using

statistical principles (Haykin, 1998). On the other hand, time is an essential ingredient in many

real-world problems, from cognitive tasks (such as vision and speech processing) to engineering

problems (such as system identification, noise cancellation, and channel equalization). These

real-world problems require the extraction of information embedded in the temporal structure of

signals. To incorporate time into the operation of a network, memory must be built into it. To be

precise, memory here stands for short-term memory, the ability to remember the recent past

which provides the contextual information from the history of time series to unambiguously

identify the current stimulus.

Short-term Memory Structures

General linear discrete-time memory structure of order N (Figure 1-1 A) is also known as

the generalized tapped delay line memory: each block has the transfer function G(z), or impulse

response g(n) (Haykin, 2001). This is a single-input multiple-output system and taps of the

system are filtered versions of the input signal.

Basically, the memory reconstructs the state of the system that created the time series in a

sufficiently large state space where time is implicit; creating a traj ectory that has a one-to-one

correspondence (with a unique inverse) to the time series.








xl(n)


u(n)


u(n)


x2 8) u 8-1>












T T
xN(n) u(n-N+1)
A B

Figure 1-1. Delay line memory of order N. A) Generalized tapped delay line. B) Tapped delay
line.

Quality of a memory structure is measured in two quantities: depth (D) and resolution (R)

(Principe et al., 1993). Memory depth measures how long the memory can hold information
about the past of a time series. Memory resolution is the amount of detail in the information

memory holds. Formally, memory depth is defined as


n=0

where gN(n) is the overall impulse response of the memory obtained from N successive

convolutions of g(n). Resolution is defined as the number of taps in memory structure per unit

time. For a linear memory structure of fixed taps, the product of memory depth and resolution is









a constant equal to the number of taps N. This creates a tradeoff between the memory depth and

resolution and the choice of G(z) determines the values R and D in this tradeoff (Principe et al.,

1993).

The time delay embedding (Takens, 1981), which can be implemented by a simple delay

line, is the most commonly used form of short-term memory (Figure 1-1 B). The use of time

delays for memory is also biologically motivated since signal delays are ubiquitous in the brain

and essential in biological information processing (Braitenberg, 1990). The memory depth and

resolution for the tapped delay line is N and 1 independent of the number of taps, respectively.

This restriction is relaxed in gamma filter which makes use of local feedback loops to

create short-term memory (Principe et al., 1993; de Vries, 1991). Each section of gamma

memory has the transfer function

~u-1
G(z) =


where pu is the adjustable feedback parameter. Note that gamma memory reduces to tapped delay

line for pu=1. With the introduction of the feedback loop, stability becomes an issue. However,

the overall system stability is easily guaranteed when each local feedback is stable (i.e. 0
Memory depth for gamma memory is N/p and the resolution is pu. This means that the memory

depth can be improved by choosing the feedback parameter pu less than unity. However, this also

reduces the resolution. Hence, in gamma memory, the memory depth can be improved without

increasing the number of taps unlike the tapped delay line by sacrificing the resolution (Principe

et al., 1993; de Vries, 1991).

Although tapped delay line and the gamma filter are the most common forms of short-term

memory structures, other structures such as leaky integrator memory (context nodes or memory

neurons) and Laguerre filter are also available in the literature. Time alignment filter, which is









obtained from gamma memory by allowing nonhomogenous pu values in different taps, extends

the structure in Figure 1-1 A by allowing tap dependent modulations of memory depth and

resolution (de Vries, 1991).

Network Architectures for Dynamical Computation

There are basically two ways of creating a dynamic network with short-term-memory. The

first one is to stimulate a static network (e.g., linear mapper, MLP) via a memory structure at its

input such as a tapped delay line or gamma memory (Haykin, 1998). This type of dynamical

networks is commonly used in the literature. However, there are several practical drawbacks to

this approach, which uses a spatial metaphor for time, especially when the dynamics of the time

series are rich and time varying. The alternative approach for creating memory in an implicit

manner is through the use of feedback. Feedback can be local at the single neuronal level or can

be global covering the whole network. In neural networks literature, networks with global

feedback loops are referred to as recurrent neural networks (RNN). The use of feedback provides

with very powerful systems with rich dynamical behavior whereas it also brings in practical

problems such as stability and training complexity (Haykin, 1998).




xi (n)


u~n)Short-term x2 ) Static y(n)
S Memory Mapper
Structure

xN(n)

Figure 1-2. Dynamic network with a short-term memory (tapped delay line, gamma memory,
etc.) structure followed by a static mapper (linear mapper, MLP, RBF etc.)









First we will consider the explicit representation of time where a static network (linear or

nonlinear) is transformed into a dynamic network with a preprocessor which is a short-term

memory structure. In these focused structures, the memory is restricted in the input layer of the

network (Sandberg and Xu, 1997; Haykin, 1998). There is a clear separation of responsibilities

in this approach of building a dynamical system, where memory represents time and static

network accounts for mapping (Figure 1-2). Basically, memory reconstructs the state of the

system that created the time series in a sufficiently large state space and static network maps the

reconstructed space to the desired signal space. Moreover, the well-develop techniques of static

network training such as least mean squares (LMS) or static back-propagation algorithm can be

applied to the dynamic network.

The simplest structure for temporal processing in the form of Figure 1-2 is the finite

impulse response filter (FIR) where the memory is a tapped delay line and the static network is a

linear mapper. It has been widely used in adaptive signal processing applications due to its

simplicity and the existence of effective learning algorithms (Haykin, 2001). Gamma filter is

obtained when the memory structure in FIR is replaced by gamma memory (Principe et al., 1993;

de Vries, 1991). Chains of first order integrators are interesting because they effectively decrease

the number of delays necessary to create embeddings. The extra degree of freedom on the

memory depth, gained from local feedback, has been fruitfully utilized in applications like

system identification, echo cancellation, and speech classification (de Vries, 1991).

The first successful demonstration of a neural network for temporal processing was

NETtalk devised by Sejnowski and Rosenberg in 1987 (Sejnowski and Rosenberg, 1997). They

used a system which is based on MLP to convert English speech to phonemes. A more popular

network is the time delay neural network (TDNN) that replaces the linear mapper in the FIR with









a more powerful nonlinear network, MLP (Haykin, 1998). Lang and Hinton have successfully

applied TDNN for the recognition of 4 isolated words: "bee", "dee", "ee", and "vee" (Lang and

Hinton, 1998). TDNN has become one of the most popular neural network architectures for

temporal processing and has been successfully applied to time series prediction, system

identification, control, and speech processing. The structure of TDNN can be generalized as

shown in Figure 1-3. The first block is a bank of linear filters operating in parallel on the input

stimulus whereas second block implements a static nonlinear network. This structure is a

universal dynamic mapper (Sandberg and Xu, 1997, page 477) according to universal myopic

mapping theorem:

Any shift-invariant myopic dynamic map can be uniformly approximated arbitrarily well

by a structure consisting of two functional blocks: a bank of linear filters feeding a static

network.

TDNN is a special case of the structure described in this theorem and is also a universal

mapper. However, there are several drawbacks of focused structures which convert time series to

spatial signals (Elman, 1990). First of all, it requires an interface to the external world which

buffers the input for further processing. Secondly the selection of the length of buffer is not

trivial and once selected it imposes a firm limit on the memory depth. Moreover, buffering

requires all the patterns to have the same length which is especially problematic in areas like

speech processing where different length speech signals have to be processed (Elman, 1990).

Finally, two instances of same time-series that are very similar can be very dissimilar in the

reconstructed space (spatially distant) (Elman, 1990).










hi (n) x n


u(n) h( X2(n) MS er 1 y(n)


hs~n)xN(n)


Figure 1-3. Generic structure for universal myopic mapping theorem, bank of filters followed by
a static network

An alternative approach for creating memory is by means of recurrent connections within

the network. In this approach, the system itself is dynamic and time, integrated into the system, is

represented by the effect it has on processing. They are particularly powerful to deal with the

time varying patterns and are practical for the problems where the dynamics of the considered

process is complex. Various RNN architectures are available in the literature. One of the earlier

architectures proposed by Jordan uses memory units that are fed from the system output to

provide the contextual information (Jordan, 1986). Elman network is a modification of Jordan' s,

where the context units keep a copy of the hidden layer activations for the next update (Elman,

1990). Fully connected recurrent neural network contains several feedback loops between the

processing elements (PE) in the hidden layer (Figure 1-4). RNNs can exhibit rich dynamical

responses such as oscillations and even chaos. They have been widely used in many applications

such as system identification and control of dynamical systems (Delgado et al., 1995; Feldkamp

et al., 1998; Kechriotis et al., 1994; Principe, 2001). The computational power of fully connected

RNNs (Siegelman and Sontag, 1991) has been mathematically proven by Siegelman and Sontag:

All Turing machines may be simulated by fully connected recurrent networks built on neurons

with sigmoid activation functions. However, the main problem with the RNNs is the difficulty to

adapt the system weights. Various algorithms, such as back propagation through time (Werbos,










1990) and real time recurrent learning (Williams and Zipser, 1989) have been proposed to train

recurrent systems. However, these algorithms suffer from a variety of problems: computational

complexity resulting in slow training, complex performances surfaces, the possibility of

instability, and the decay of gradients through the topology and time (Haykin, 1998). The

problem of decaying gradients has been addressed with special PEs (Hochreiter and

Schmidhuber, 1997). Alternative second order training methods based on extended Kalman

filtering (Singhal and Wu, 1989; Puskorius and Feldkamp, 1996) and the multi-streaming

training approach (Feldkamp et al., 1998) provide more reliable performance and have enabled

practical applications.

Literature Review of Echo State Networks

Recently, a new recurrent network paradigm has been proposed by Jaeger under the name

of echo state networks (ESN) (Jaeger, 2001; Jaeger, 2002, Jaeger and Hass, 2004). ESNs aim at

addressing the problems with RNN training by separating the RNN architecture into two parts: a

fixed recurrent topology of dynamnical reservoir and an adaptive memoryless readout network.

The input signal is fed to the dynamical reservoir which contains information about the history of

input or/and output patterns when properly dimensioned (Figure 1-4). The outputs of the internal

PEs (echo state) are fed to a memoryless but adaptive readout network (generally linear) that

reads the reservoir and produces the network output. The interesting property of ESN is that only

the memoryless readout is trained, whereas the recurrent topology (W) has fixed connection

weights. This reduces the complexity of RNN training to simple linear regression while

preserving a recurrent topology, but obviously places important constraints in the overall

architecture that have not yet been fully studied.










Read-out

















...............


Input Layer





O"""










........... .


y(n)





Back


I
Figure 1-4. Diagram of an echo state network (ESN). ESN is composed of two parts: a fixed-
weight (W) recurrent network and a linear readout. The recurrent network is a
reservoir of highly interconnected dynamical components, states of which are called
echo states. The memoryless linear readout is trained to produce the output.

Figure 1-4 depicts an ESN with M~input units, N internal PEs and L output units. The value

of the input unit at time n is u(n)= [u (n), u2 n,.,M n~T, Of internal units are x(n)= [x (n),

x2() ,N n~T, and of output units y(n)= [y (n), y2(n,. yL nT. The connection weights are

given in a Nxl~weight matrix W~n = (wy) foCrr co~nnectio~ns between the input andl the intenal


states, in an NxN matrix W = (wiy) for connections between the internal PEs, in an LxN matrix


Wout ut~M) for connections from internal units to the output units, and in an NxL matrix


Wbc ackJ) for the connections that proj ect back from the output to the internal units


(Jaeger, 2001). The activation of the internal PEs (echo states) is updated according to


x(n +1) = f (Wl"u(n +1) + Wx(n) + Whacky~)


(1-1)


where f = ( f f 2,. N") are the internal unit's activation functions. Generally, all f' 's are


eX -e
hyperbolic tangent functions ( ).
eX +e-









The readout is mostly a simple linear regressor network whose output is computed according to

y(n +1) = W"'tx(n +1) (1-2)

When direct connections from the input units to the output units are used, the state vector is

concatenated with the input vector to calculate the output.

A basic necessary property for ESN reservoir is the echo state property which states that

for the ESN learning principle to work, the reservoir must asymptotically forget input history

(input forgetting property). It has been shown in (Jaeger, 2001) that input forgetting property is

equivalent to state forgetting, that is reservoir must forget its initial state after sufficiently long

time. The echo state condition can be linked to the spectral radius (the largest among the absolute

values of the eigenvalues of a matrix, denoted by ||.||) of the reservoir' s weight matrix, (||IW||<1).

In fact, this condition states that the dynamics of the ESN is uniquely controlled by the input and

the effect of initial states vanishes.

The design of fixed connections of ESN is based on the selection of the spectral radius.

The reservoir is randomly and sparsely connected with a spectral radius suitable for the given

problem. Sparse connectivity allows less coupling between the PEs encouraging the development

of individual dynamics and increasing diversity in overall reservoir dynamics.

Thesis Goals

ESNs, although recently proposed, have proven very successful in applications such as

system identification, chaotic time series prediction, and channel equalization. However, the

state of the art is still immature. We see a few maj or shortcomings with the current ESN

approach. First, the characterization of the reservoir properties is poorly understood. It is a

mystery how a network with mostly random connections can be successful in solving difficult

problems. A better understanding on the operation of dynamical reservoir is vital for improving









ESN theory. Second, imposing a constraint only on the spectral radius for ESN design is a weak

condition to properly set the parameters of the reservoir, as experiments show (different

randomizations with the same spectral radius perform differently for the same problem). Third,

there is a fundamental problem with the way ESNs operate. The impact of Eixed reservoir

parameters for function approximation means that the information about the desired response is

only conveyed to the output proj section. This is not optimal, and strategies to select different

reservoirs for different applications have not been devised.

We aim to address these problems by proposing a framework, a metric and a design

principle for ESNs. We first deal with the analysis of ESNs and explains the framework and the

metric proposed. The framework is a signal processing interpretation of basis and proj sections in

functional spaces to describe and understand the ESN architecture. According to this

interpretation, the reservoir states implement a set of bases functionals (representation space)

constructed dynamically by the input, while the readout simply proj ects the desired response

onto this representation space. The metric to describe the richness of the ESN dynamics is an

information theoretic quantity, the average state entropy (ASE). Entropy measures the amount of

information contained in a given random variable (Shannon, 1948). Here, the random variable is

the instantaneous echo state from which the entropy for the overall state (vector) is estimated.

Due to the time dependency of the states, the state entropy averaged over time (ASE) will be

used as the quantifier to measure the richness of reservoir dynamics. We also interpret the ESN

dynamics as a combination of time varying linear systems obtained from the linearization of the

ESN nonlinear PE in a small local neighborhood of the current state.

We then propose a design methodology for the ESN reservoir in the light of the analysis

tools developed for ESNs. According to this design principle, one should consider independently









the correlation among the basis and the spectral radius. In the absence of any information about

the desired response, the ESN reservoir states should be designed with the highest ASE,

independent of the spectral radius. The poles of the linearized ESN reservoir should have

uniform pole distributions to generate echo states with the most diverse pole locations (which

correspond to the uniformity of time constants). Effectively this will create the least correlated

bases for a given spectral radius, which corresponds to the largest volume spanned by the basis

set. When the designer has no other information about the desired response to set the basis, this

principle distributes the system' s degrees of freedom uniformly in space. It approximates for

ESNs the well known property of orthogonal basis. The unresolved issue that ASE does not

quantify is how to set the spectral radius, which depends again upon the desired mapping. a

simple adaptive bias is added at the ESN input to control the spectral radius integrating the

information from the input-output j oint space in the ESN bases. For sigmoidal PEs, the bias

adjusts the operating points of the reservoir PEs, which has the net effect of adjusting the volume

of the state manifold as required to approximate the desired response with a small error. We

show that ESNs designed with this strategy obtain systematically better results in a set of

experiments when compared with the conventional ESN design.

We discuss the readouts for ESNs for function approximation and pattern recognition. The

standard linear readout is used for function approximation tasks such as system identification,

time series prediction, and channel equalization. We thoroughly investigate the use of ESN for

temporal pattern recognition. The standard linear readout can be used for pattern recognition

with ESNs by minimizing the mean-square error (MSE) with respect to a label. Classifieation of

time series is different from the classification of static patterns since there is a structure between

the samples of the time series over time. The dynamical classification problem can be reduced to









a static one by treating each sample of the time series individually. In this case, the time series is

not treated as a whole and temporal information between samples is ignored. Moreover, one

class label is associated with each time instant; therefore, post classification such as maj ority

voting has to be applied. An alternative to this approach is embedding the time series to populate

a short-term history of the pattern. Again, a static mapper can be used to classify the embedded

pattern. Another maj or difficulty in dynamical pattern recognition is how to design the label,

which should also be a time series. One straightforward method is to use +1 or -1 as the desired

response throughout the application of the input pattern. However, the difficulty with this

method is that the desired response is forced to be constant independent of the input signal

dynamics. An alternative powerful technique is to create a one-step predictor for each class in

order to capture the dynamics of the class generating the time-series (Zahalka and Principe,

1993). Then, during testing, the test pattern is presented to all predictors and the label of the

predictor with the least prediction error is assigned as the class of the input pattern. We also

propose an alternative readout for ESNs to be used in classification tasks that does not require a

desired response. The goal is to design an ESN/LSM readout that will recognize a class of inputs

that differ by some quantity (e.g. amplitude or shape considered as a distortion parameter). The

proposed readout, called the minimum average correlation energy (MACE) filter, is adopted

from optical pattern recognition literature, where it is used for recognition of a given object in 2-

dimensional images in the presence of noise, geometric distortions and pose changes

(Mahalanobis et al., 1987). Instead of using a desired response, the MACE filter creates a

template from echo state responses corresponding to the training patterns of a given class. A

MACE is trained for each class. During testing, the unknown time signal is fed to the ESN and









the states are correlated with each template image and the class label of the template with the

highest correlation is assigned to the input pattern.

We investigate the implications of ESN idea on other networks. First, we investigate the

echo state condition (|W|<1) in a system theoretic framework to understand the effect of system

stability in on the power of dynamical systems for computation. For this reason, we relax the

echo state condition by allowing it to be slightly larger than 1. This introduces a new dynamical

regime, called "transiently stable computation", where the system is autonomously unstable but

it is stabilized by the input signal of sufficient power. In this regime, function approximation

with a linear readout is still possible even though the system is not globally stable. Secondly, we

investigate the biologically plausible model of the olfactory cortex, Freeman Model (FM), and

propose to use a readout for FM to be able to use it as a universal computer. Without the

proposed readout, the use of FM is limited to simple digit recognition, where the patterns are

static and binary. We will demonstrate with experiments that FM coupled with a readout can

process continuous valued signals. An interesting property of FM is the nonlinear function used

in the model. FM nonlinearity does not have its largest slope at zero operating point unlike the

sigmoidal nonlinearity used in ESNs.

Various applications of ESN for real-world problems are investigated. We first utilize

ESNs for signal processing in the complex domain and compare complex ESN with other models

for complex signal processing in a nonlinear channel equalization problem in a digital

communication channel. Secondly, we propose ESNs to model brain machine interfaces and

compare it with the linear Wiener filter. Brain machine interfaces aim at predicting the hand

position of a primate using the brain signals. Thirdly, we apply ESNs to predict the water inflow

at a hydro power plant using the previous values of water inflow levels. Then, we tackle two










temporal pattern recognition problems using ESNs with the MACE filter readout. In the spike

detection problem, the goal is to detect action potentials in a noisy neural recording. The second

problem deals with the design of a detector similar to a matched filter in a communication

channel. We compare ESN-MACE filter to the matched filter in a baseband additive noise

channel under different noise distributions.









CHAPTER 2
ANALYSIS OF ECHO STATE NETWORKS

ESN is a recurrent network paradigm described by a recurrently connected, reservoir

network stimulated by an input signal, and an instantaneous, adaptive readout, which combines

the reservoir states to form the desired signal. ESNs have been shown to be very powerful in

many applications such as chaotic time series prediction, channel equalization, speech

recognition. However, it remains a mystery how a network with mostly random connections can

be so successful in solving difficult problems. This section aims at enlightening the basic

mystery of ESNs by analyzing the operation of ESNs. The first subsection proposes a framework

which is a signal processing interpretation of basis functions and proj sections in functional spaces

to describe and understand the ESN architecture. In the second subsection, we interpret the ESN

dynamics as a combination of linear systems. According to this interpretation, when the system

operating point varies over time with the influence of the input signal, local ESN dynamics

change. The third subsection introduces an information-theoretic metric, called average state

entropy, to quantify the computational power of a dynamical network for function

approximation. The ASE metric combined with the idea of bases and proj sections will lead to a

design procedure for ESNs that will be discussed in the next chapter.

Echo States as Bases and Projections

Let us revisit the recursive update equation of an ESN. The activation of the internal PEs is

updated according to

x(n +1) = f (Wl"u(n +1) + Wx(n) + Whacky(n)), (2-1)

where f = (f f,... f) are the internal unit' s activation functions. The output from the linear

readout network is computed according to

y(n +1) = W o"tx(n +1), (2-2)









ESNs resemble the RNN architecture proposed in (Puskorius and Feldkamp, 1994) and

also used by (Sanchez, 2004) in brain machine interfaces. The critical difference is the

dimensionality of the hidden recurrent PE layer and the adaptation of the recurrent weights. We

submit that the ideas of approximation theory in functional spaces (bases and proj sections so

useful in optimal signal processing (Principe, 2001), should be utilized to understand the ESN

architecture. Let h(u(t)) be a real-valued function of a real-valued vector

u(t)= [u, (t), u, (t),..., u,(t)].

In functional approximation, the goal is to estimate the behavior of h(u(t)) as a

combination of simpler functions 9, (t) called the bases functionals, such that its approximant

h(u(t)), is given by


h~u(t)) =~ a,n, (t)


Here, a,'s are the proj sections of h(u(t)) onto each basis functional. One of the central questions in

practical functional approximation is how to choose the set of bases to approximate a given

desired signal. In signal processing, the choice normally goes for complete set of orthogonal

basis, independent of the input. When the basis set is complete and can be made as large as

required, fixed bases work wonders (e.g. Fourier decompositions). In neural computing, the basic

idea is to derive the set of bases f~om the input signal through a multilayered architecture.

Consider a single hidden layer TDNN with NPEs and a linear output. The hidden layer PE

outputs can be considered a set of non-orthogonal basis functionals dependent upon the input

9, (u(t)) = g(C b,,u, (t))


Here, b,,'s are the input layer weights, and g is the PE nonlinearity. The approximation produced

by the TDNN is then










h~u(t)) = [ a,a? (u(t))(23


where a,'s are the weights of the output layer. Notice that the b,, adapt the bases and the a, adapt

the proj section in the proj section space (readout). Here the goal is to restrict the number of bases

(number of hidden layer PEs) because their number is coupled with the number of parameters to

adapt which impacts generalization, training set size, etc. Usually, since all of the parameters of

the network are adapted, the best basis in the joint (input and desired) space as well as the best

proj section can be achieved, and represents the optimal solution. The output of the TDNN is a

linear combination of its internal representations, but to achieve a basis set (even if non-

orthogonal), linear independence among the p, (u(t)) 's must be enforced. The effect of nonlinear

transformation of the hidden PEs on the input signal has been investigated by many authors in

RN. Oh has shown that correlation among the weighted sums decrease after they pass through the

sigmoid nonlinear function which can be approximated by piecewise linear functions (Oh and

Lee, 1994). Shah and Poon showed that there exist weight values for MLP such that the outputs

of its hidden layer PEs are linearly independent and hence, form a complete set of basis functions

(Shah and Poon, 1999).

The ESN (and the RNN) architecture can also be studied in this framework. The reservoir

states of Equation 2-1 correspond to the basis set which are recursively computed from the input,

output and previous states through Win, W and Wback. Notice however that none of these weight

matrices are adapted, that is, the functional bases in the ESN are uniquely defined by the input

and the initial selection of weights. In a sense ESNs are trading the adaptive connections in the

RNN hidden layer by a brute force approach of creating fixed diversified dynamics in the hidden

layer.










For an ESN with a linear read-out network, the output equation (y(n + 1) = Wo""x(n +1) ),

has the same form of Equation 2-3, where the cp, (u(t)) 's and a,'s are replaced by the echo states

and the readout weights, respectively. These weights are adapted in the training data, which

means that the ESN is able to find the optimal proj section in the proj section space, just like the

RNN or the TDNN.

It is interesting that a similar perspective of basis and projections for information

processing in biological networks has been proposed by (Pouget and Sejnowski, 1997). They

explored the possibility that the response of single neurons in parietal cortex serve as basis

functions for the transformations from the sensory input to the motor responses. They proposed

that "the role of spatial representations is to code the sensory inputs and posture signals in a

format that simplifies subsequent computation, particularly in the generation of motor

commands" (Pouget and Sejnowski, 1997).

The central issue in ESN design is exactly the non adaptive nature of the basis set.

Parameter sets in the reservoir that provide linearly independent states and possess a given

spectral radius, may define drastically different proj section spaces because the correlation among

the bases is not constrained. A simple experiment was designed to demonstrate that the selection

of the echo state parameters by constraining the spectral radius is not the most suitable for

function approximation. Consider an echo state network of 100-units where the input signal is

sin(2~nn/(10x)). Mimicking (Jaeger, 2001), the goal is to let the ESN generate the seventh power

of the input signal. Different realizations of a randomly connected 100 unit ESN were

constructed where the entries of W are set to 0.4, -0.4 and 0 with probabilities of 0.025, 0.025

and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input weights are set to +1

or -1 with equal probabilities and ac"k iS Set to zero. The input is applied for 300 time steps and





























D 10 20 30 40 50
Different Rea~l:izton 51

Figure 2-1. Performance of ESNs for different realizations of W with the same weight
distribution. The weight values are set to 0.4, -0.4 and 0 with probabilities of 0.025,
0.025 and 0.95. All realizations have the same spectral radius of 0.88. In the 50
realizations, MSEs vary from 5.9x10-9 to 8.9x10- Results show that spectral radius
of W is not the unique parameter that determines the performance of an ESN.

the echo states are calculated using Equation 2-1. The next step is to train the linear readout

when a desired training signal, d(n) is available. One method to determine the optimal output

weight matrix, Wor t in the mean square error (MSE) sense (where MSE is defined by


( [(d(n)- y(nz))l (d(n) -y(n))), is to use the Wiener solution given by (Hayk~in, 200 1)



W"'t = E[x ] ~l, o (Exd a ~~xn (n)d(n) (2-4)


Here E[.], T and I denotes the expected value operator, the length of the training sequence and

the conjugate transpose of a complex vector. Figure 2-1 depicts the MSE values for 50 different

realizations of the ESNs. As observed, even though each ESN has the same sparseness and

spectral radius, the MSE values obtained vary greatly among different realizations. The









minimum MSE value obtained among the 50 realizations is 5.9x10-9 whereas the maximum MSE

is 8.9x10- This experiment demonstrates that a design strategy that is based solely on the

spectral radius is not sufficient to specify the system architecture for function approximation.

This shows that for each set of random weights that provide the same spectral radius, the

correlation or degree of redundancy among the bases will change and different performances are

encountered in practice.

ESN Dynamics as a Combination of Linear Systems

It is well known that the dynamics of a nonlinear system can be approximated by that of a

linear system in a small neighborhood of an equilibrium point (Kuznetsov, 1998). Here, we will

perform the analysis with hyperbolic tangent PEs and approximate the ESN dynamics by the

dynamics of the linearized system in the neighborhood of the current system state. Hence, when

the system operating point varies over time, the linear system approximating the ESN dynamics

will change. We are particularly interested in the movement of the poles of the linearized ESN.

Consider the update equation for the ESN without output feedback given by

x(n +1) = f (Wl"u(n +1) + Wx(n))

Linearizing the system around the current state x(n), one obtains the Jacobian matrix,

J(n 1), defined by

f (net, (n))w,, f (net, (n))wl2 .. 1~et 1N>>



f (net, (n))w,, f (net, (n))wN, 2. f" t (N NN

f (netl (n))w,, 0 ... O
0 f (net2 (n)) ... O
W (2-5)

0 0 ... f (net, ()
= F (n) W










Here, net,(n) is the ith' entry of the vector (Wizzu(nz +1) + Wx(nz)) and w!, denotes the (i, j)th' entry

of W. The poles of the linearized system at time n+1 are given by the eigenvalues of the

Jacobian matrix J(n+1). When the amplitude of each PE input changes, the local slope changes,

and so the poles of the linearized system are time varying, although the parameters of ESN are

fixed.

In order to visualize the movement of the poles, consider an ESN with 100 states. The

entries of the internal weight matrix are chosen to be 0, 0.4 and -0.4 with probabilities 0.9, 0.05

and 0.05. W is scaled such that a spectral radius of 0.95 is obtained. Input weights are set to +1

or -1 with equal probabilities. A sinusoidal signal with a period of 100 is fed to the system and

the echo states are computed according to Equation 2-1. Then, the Jacobian matrix and the

eigenvalues are calculated using Equation 2-5. Figure 2-2 shows the pole tracks of the linearized

ESN for different input values. A single ESN with fixed parameters implements a combination

of many linear systems with varying pole locations, hence many different time constants that

modulates the richness of the reservoir of dynamics as a function of input amplitude. Higher

amplitude portions of the signal tend to saturate the nonlinear function and cause the poles to

shrink toward the origin of the z-plane (decreases the spectral radius), which results in a system

with large stability margin. When the input is close to zero, the poles of the linearized ESN are

close to the maximal spectral radius chosen, decreasing the stability margin. When compared to

their linear counterpart, an ESN with same number of states results in a detailed coverage of the

z-plane dynamics (E-coverage), which illustrates the power of nonlinear systems. Similar results

can be obtained using signals of different shapes at the ESN input.
















































0[1 0.5 1 -'1 -0.5 0
Real Real


Figure 2-2. Pole tracks of the linearized ESN when the input goes through a cycle. A single ESN
with fixed parameters implements a combination of many linear systems with varying
pole locations. A) One cycle of sinusoidal signal with a period of 100. B)-E) show the
positions of poles of the linearized systems when the input values are at B, C, D, and
E in panel A. F) Cumulative pole locations show the movement of the poles as the
input changes.

Theorem: The eigenvalues of the linearized system have the largest radius when the

system state is zero.









Proof: Assume W has nondegenerate eigenvalues and corresponding linearly independent


eigenvectors. Then, consider the eigendecomposition of W, where W = PDP -, P is the

eigenvector matrix and D is the diagonal matrix of eigenvalues (Dii) of W. Since F(n) and D are


diagonal, J(n +1) = F(nr)W = F(n)(PDP' = P(F(n)D)P' is the eigendecomposition of


J(n+1). Here, each entry of F(n)D,f (neti (n))Di,~ is an eigenvalue of J.

I(net, (n))D ii <; $D ii sic -(nety (nI)) <; jl(0)


A key corollary of the above analysis is that the spectral radius of an ESN can be

adjusted using a constant bias signal at the ESN input without changing the recurrent connection

matrix, W. The application of a nonzero constant bias will move the operating point to regions of

the sigmoid function closer to saturation and always decrease the spectral radius due to the shape

of the nonlinearity. This property will be exploited in the design of ESNs which will be

discussed in the next chapter.

Average State Entropy as a Measure of the Richness of ESN Reservoir

The concept of "rich dynamical reservoir" (Jaeger, 2001) has not been quantified with a

well-defined metric. Our framework of bases and proj sections leads to a new metric to quantify

the richness of ESN reservoir. Here, we propose the instantaneous state entropy to quantify the

distribution of instantaneous amplitudes across the ESN reservoir states. Entropy of the

instantaneous ESN states is appropriate to quantify performance in function approximation

because the ESN output is a mere weighted combination of the value of the ESN states. If the

echo state's instantaneous amplitudes are concentrated on only a few values across the ESN state

dynamic range, the ability to approximate an arbitrary desired response by weighting the states is

limited (and wasteful due to redundancy between the different states) and performance will









suffer. On the other hand, if the ESN states provide a diversity of instantaneous amplitudes, then

it is much easier to achieve the desired mapping. Hence, the instantaneous entropy of the states

appears as a good measure to quantify "richness" of dynamics with instantaneous mappers. Due

to the time structure of signals, the average state entropy (ASE), defined as the state entropy

averaged over time will be the parameter used to quantify the diversity in the dynamical

reservoir of the ESN. Moreover, entropy has been proposed as an appropriate measure of the

volume of the signal manifold (Cox, 1946; Amari, 1990). Here, ASE measures the volume of the

echo state manifold spanned by the traj ectories.

Renyi's quadratic entropy is employed here because it is a global measure of information

and efficient nonparametric estimator that avoids explicit pdf estimation has been developed

(Principe et al., 2000;Erdogmus et al., 2003; Erdogmus and Principe, 2002). Renyi's entropy

with parameter y for a random variable X with a pdf fx(x) is given by (Principe et al., 2000)


Hr (X) = log E[ f~ r(X)]
1-7

The Renyi's quadratic entropy is obtained for y=2 (for y 1l). Shannon' s entropy is

hobtaned). Given Nsaomnple {xt, x2... XN~drawn~x from the u~nknowmnrfn pdf to b etmated, Parzern

windowing approximates the underlying pdf by



Si=1

where K, is the kernel function with the kernel size o. Then the Renyi's quadratic entropy can be

estimated by (Principe et al.,2000; Erdogmus et al., 2003).


H2 (X)= -log: 2 K,(xi -x ')1 2









The instantaneous state entropy is estimated using Equation 2-6 where the samples are the


entres of the state vector x~n) = [xY (n)), x2 ()1,X (Nr "j ,of an ESN with N internal PEs.

Results will be shown with a Gaussian kernel with kernel size chosen to be 0.3 of the standard

deviation of the entries of the state vector. We will show that ASE is a more sensitive parameter

to quantify the approximation properties of ESNs by experimentally demonstrating that ESNs

with the same spectral radius display different ASEs.

Let us consider the same 100 unit ESN that we used in the previous section built with three

different spectral radii 0.2, 0.5, 0.8 with an input signal of sin(2~nn/20). Figure 2-3 A depicts the

echo states over 200 time ticks. The instantaneous state entropy is also calculated at each time

step using Equation 2-6 and plotted in figure 2-3 B. First, note that the instantaneous state

entropy changes over time with the distribution of the echo states as we would expect, since state

entropy is dependent upon the input signal that also changes in this case. Second, as the spectral

radius increases in the simulation, the diversity in the echo states increases. For the spectral

radius of 0.2, echo state's instantaneous amplitudes are concentrated only on a few values which

is wasteful due to redundancy between different states. In practice, to quantify the overall

representation ability over time, we will use ASE, which takes values -0.735, -0.007 and 0.335,

for the spectral radii of 0.2, 0.5 and 0.8, respectively. Hence, ASE is affected by the spectral

radius of the W matrix as we would expect. Moreover, even for the same spectral radius of 0.5,

several ASEs are possible. Figure 2-4 shows ASEs from 50 different realizations of ESNs with

same spectral radius of 0.5, which means that ASE is a finer descriptor of the dynamics of the

reservoir. Although we have presented an experiment with sinusoidal signal, similar results are

obtained for other inputs as long as the input dynamic range is properly selected.









(A) Echo States




0 0 40 6 80 100 120 140 160 180 200





0 20 40 60 80 100 120 140 160 180 200


-1


0 20 40 60 80 100 120 140 160 180 200
Time

A

(B) State Entropy
1.5
Spectral Radius = 0.2
1C ----- Spectral Radius = 0.5
-Spectral Radius = 0.8


-1.5E


100
Time

B


Figure 2-3. Echo states and state entropy for different ESNs. A) Outputs of echo states (100 PEs)
produced by systems with spectral radius of 0.2, 0.5 and 0.8, from up to bottom,
respectively. B) Instantaneous state entropy is calculated using Equation 2-6 for echo
states in panel A.











0.28

0.26

0.24

0.22

0.2

Tm 0.18

0.16

0.14

0.12

0.1

0.08
0 10 20 30 40 50
Trials


Figure 2-4. The ASE values obtained from 50 realizations of ESNs with the same spectral radius

Maximizing ASE means that the diversity of the states over time is the largest and should

provide a basis set that is as uncorrelated as possible. This condition is unfortunately not a

guarantee that the ESN so designed will perform the best, because the basis set is independent of

the desired response and the application may require a small spectral radius. However, when the

desired response is not accessible for the design of the ESN bases or when the same reservoir is

to be used for a number of problems, the default strategy should be to maximize the ASE of the

state vector.









CHAPTER 3
DESIGN OF THE DYNAMICAL RESERVOIR

In ESNs, reservoir weights are selected randomly with the constraint of the echo state

condition (Jaeger, 2001). However, such a random selection scheme is not optimal as it is

demonstrated in chapter 2. In this chapter, we propose a design scheme for the design of

reservoir weights that will lead to small approximation errors for a variety of tasks. For optimal

approximation with a finite number of basis functionals, both the proj section space (bases) and

the proj ector (readout) require knowledge from the input and desired responses. However, in

ESNs, the projection space is determined solely by the architecture of the ESN reservoir and the

input signal (since W is fixed), without any knowledge of the space spanned by the desired target

signal. The selection of basis functions with the knowledge of input signal only is an ill-posed

problem. However, the selection of the reservoir weights must still be done using some rule, and

here we hypothesize that a good design strategy is to let the ESN states cover with equal

resolution the projection space to anticipate any possible mapping requirement (dictated by the

unknown desired response).

Design of the Recurrent Connections

According to the interpretation of ESNs as coupled linear systems, the design of the

internal connection matrix, W, will be based on the distribution of the poles of the linearized

system around zero state. Our proposal is to design the ESN such that the linearized system has

uniform pole distribution inside the unit circle of the z-plane. With this design scenario, the

system dynamics will include uniform coverage of time constants arising from the uniform

distribution of the poles, which also decorrelates as much as possible the bases functionals. This

principle was chosen by analogy to the identification of linear systems using Kautz filters

(Kautz, 1954) which shows that the best approximation of a given transfer function by a linear










system with finite order is achieved when poles are placed in the neighborhood of the spectral

resonances. When no information is available about the desired response or when would like to

use the same reservoir for a variety of tasks, we should uniformly spread the poles to anticipate

good approximation to arbitrary mappings.

We again use a maximum entropy principle to distribute the poles inside the unit circle

uniformly. The constraints of a circle as boundary conditions for discrete linear systems and

complex conjugate locations are easy to include for the pole distribution (Thogula, 2003). The

poles are first initialized at random locations; the quadratic Renyi's entropy is calculated using

Equation 2-6 and poles are moved such that the entropy of the new distribution is increased over

iterations (Erdogmus et al., 2003). This method is efficient to find a uniform coverage of the unit

circle with an arbitrary number of poles. However, any other method can be used to find the

location of poles with uniform coverage. Notice that this operation has to be done only once for a

given number of poles, which is equal to the number of hidden unit PEs.

The system with the uniform pole locations can be interpreted using linear system theory.

The poles that are close to the unit circle correspond to many sharp band pass filters specialized

in different frequency regions whereas the inner poles realize filters of larger frequency support.

Moreover, different orientations (angles) of the poles create filters of different center

frequencies.

Now, our problem is to construct a weight matrix from the pole locations. This is

equivalent to the problem of designing W when its eigenvalues are known. In principle we

would like to create a sparse matrix, so we started with the sparsest matrix (with an inverse)

which is the direct canonical structure given by (Kailath, 1980)










-a -a2 N-1 N
1 0 ... O 0
W = 0 1 ... O 0 (3-1)


0 0 ... 1 0

The characteristic polynomial of W is

1(s) = det(sl W) = s" + alsN 1 + a2SN-2 + N +a = (S p1 )(s p2 )...(S pN ) (3 -2)

where p,'s are the eigenvalues and a,'s are the coefficients of the characteristic polynomial of W.

Here, we know the pole locations of the linear system obtained from the linearization of the

ESN, so using Equation 3-2, we can obtain the characteristic polynomial and construct W matrix

in the canonical form using Equation 3-1. We will name the ESN constructed based on the

uniform pole principle ASE-ESN. All other possible solutions with the same eigenvalues can be

obtained by Q 1WQ, where Q is any nonsingular matrix.

To corroborate our hypothesis, we would like to show that the linearized ESN designed

with the internal connection weight matrix having the eigenvalues uniformly distributed inside

the unit circle creates higher ASE values for a given spectral radius compared to other ESNs with

random internal connection weight matrices. We will consider an ESN with 30 states, and use

our procedure to create the uniformly distributed linearized ASE-ESN matrix for different

spectral radius between [0.1, 0.95]. Similarly, we constructed ESNs with sparse random W

matrices with different sparseness constraints. This corresponds to a weight distribution having

the values 0, c and c with probabilities pi, (1- p1)/2 and (1- p1)/2, where pl defines the

sparseness ofW and c is a constant that takes a specific value depending on the spectral radius.

We also created W matrices with values uniformly distributed between -1 and 1 (U-ESN), and

scaled to obtain a given spectral radius (Jaeger and Hass, 2004). Then, for different W'n matrices,

we run the ASE-ESNs with the sinusoidal input of period 20 and calculate ASE.













-~ASE-ESN
U-ESN
0.8
Ssparseness=0.2
0 sparseness=0.1
0.6 --o -sparseness=0.07


0.4








-0.2


-0.4
0 0.2 0.4 0.6 0.8 -i
Spectral radius


Figure 3-1. Comparison of ASE values obtained for ASE-ESN with U-ESN and randomly
generated ESNs with different sparseness. As observed from the figure, the ASE-ESN
with uniform pole distribution generates higher ASE on the average for all spectral
radii compared to ESNs with random connections.

Figure 3-1 compares the ASE values obtained using different ESNs over 1000 realizations.

As observed from the figure, the ASE-ESN with uniform pole distribution generates higher ASE

on the average for all spectral radii compared to ESNs with sparse and uniform random

connections. This approach is indeed conceptually similar to Jeffreys' maximum entropy prior

(Jeffreys, 1946): it will provide a consistently good response for the largest class of problems.

Concentrating the poles of the linearized system in certain regions of the space, only provide

good performance if the desired response has energy in this part of the space, as is well known

from the theory of Kautz filters (Kautz, 1954).










Design of the Adaptive Bias

In conventional ESNs, only the output weights are trained optimizing the proj sections of the

desired response onto the basis functions (echo states). Since, the dynamical reservoir is fixed,

the basis functions are only input dependent. However, since function approximation is a

problem in the j oint space of the input and desired signals, a penalty in performance will be

incurred. From the linearization analysis that shows the crucial importance of the operating point

of the PE nonlinearity in defining the echo state dynamics, we propose to use a single external

adaptive bias to adjust the effective spectral radius of an ESN. Notice that, according to

linearization analysis with tanh nonlinearity, bias can only reduce spectral radius. The

information for adaptation of bias is the MSE in training, which modulates the spectral radius of

the system with the information derived from the approximation error. With this simple

mechanism, some information from the input output j oint space is incorporated in the definition

of the proj section space of the ESN. The beauty of this method is that the spectral radius can be

adjusted by a single parameter that is external to the system without changing reservoir weights.

The training of bias can be easily accomplished. Indeed, since the parameter space is only

one dimensional, a simple line search method can be efficiently employed to optimize the bias.

Among different line search algorithms, we will use a search that uses Fibonacci numbers in the

selection of points to be evaluated (Wilde, 1964). The Fibonacci search method minimizes the

maximum number of evaluations needed to reduce the interval of uncertainty to within the

prescribed length. In our problem, a bias value is picked according to Fibonacci search. For each

value of bias, training data is applied to the ESN and the echo states are calculated. Then, the

corresponding optimal output weights and the obj ective function (MSE) are evaluated to pick the

next bias value.









Alternatively, gradient based methods can be utilized to optimize the bias, due to

simplicity and low computational cost. System update equation with an external bias signal, b, is

given by

x(n +1) = f (W '"u(n +1) + W 'b + Wx(n))

The update equation for b is given by

80(n +1) -eWuxx(n +1)
db db
ou 8x(n)~,
= -eWox fen1-W +WW


Here, O is the MSE defined previously. This algorithm may suffer from similar problems

observed in gradient based methods in recurrent networks training. However, we observed that

the performance surface is rather simple. Moreover, since the search parameter is one

dimensional, the gradient vector can only assume one of the two directions. Hence, imprecision

in the gradient estimation should affect the speed of convergence, but normally not change the

correct gradient direction.

Experiments

This section presents a variety of experiments in order to test the validity of the ESN

design scheme proposed in the previous section.

Short-term Memory Structures

Modeling a dynamical system with input-output data requires access to a sufficiently long

input history as the output of the dynamical system depends not only the current value of input

signal but also the history of it. In signal processing, the common approach to overcome this

difficulty is to embed the input signal using a tapped delay line (in FIR filter or TDNN). RNNs









provide a different type of embedding through the recurrent connections between the PEs.

However, the length of the memory achieved is not as clear as in tapped delay line.

This experiment compares the short term memory (STM) capacity of ESNs with the same

spectral radius using the framework presented in (Jaeger, 2002). Consider an ESN with a single

input signal, u(n), optimally trained with the desired signal u(n-k),for a given delay k. Denoting

the optimal output signal yk(n), the k-delay STM capacity of a network, M~Ck, is defined as

squared correlation coefficient between u(n-k) and yk(n) (Jaeger, 2002). The STM capacity, M~C,


of the network is defined as f MCk STM capacity measures how accurately the delayed
k=1

version of the input signal is recovered with optimally trained output units. It has been shown in

(Jaeger, 2002) that the memory capacity for recalling an independent identically distributed (i.i.d)

input with an Nunit RNN with linear output units is bounded by N.

Consider an ESN with 20 PEs and a single input unit. ESN is driven by an i.i.d random

input signal, u(n), that is uniformly distributed over [-0.5, 0.5]. The goal is to train the ESN to

generate the delayed versions of the input, u(n-1), u(n-2), ..., u(n-40),. We used four different

ESNs, R-ESN, U-ESN, ASE-ESN and BASE-ESN. R-ESN is a randomly connected ESN used

in (Jaeger, 2002) where the entries of W matrix are set to 0, 0.47, -0.47 with probabilities 0.8, 0.1,

0. 1, respectively. This corresponds to a sparse connectivity of 20% and a spectral radius of 0.9.

The entries of are uniformly distributed over [-1, 1] and scaled to obtain the spectral radius of 0.9.

ASE-ESN also has a spectral radius of 0.9 and is designed with uniform poles. BASE-ESN has

the same recurrent weight matrix as ASE-ESN and an adaptive bias at its input. In each ESN, the

input weights are set to 0.1 or -0. 1 with equal probability and direct connections from the input

to the output are allowed whereas Whack iS Set to 0 (Jaeger, 2002). The echo states are calculated

using Equation 2-1 for 200 samples of the input signal and the first 100 samples corresponding to










initial transient are eliminated. Then, the output weight matrix is calculated using Equation 2-4.

For the BASE-ESN, the bias is trained for each task. All networks are run with a test input signal

and the corresponding output and MCk are calculated.



1 R-ESN
xf- U-ESN
0.9
-ASE-ESN
0.8 -- BASE-ESN


>0.7 .ii

S0.5

e~0.4

r 0.3

0.2

0.1


O 5 10 15 20 25 30 35 40
Delay


Figure 3-2. The k-delay STM capacity of each ESN for delays 1, .. ,40 computed using the test
signal. The STM capacities of R-ESN, U-ESN, ASE-ESN and BASE-ESN are 13.09,
13.55, 16.70 and 16.90, respectively.

Figure 3-2 shows the k-delay STM capacity (averaged over 100 trials) of each ESN for

delays 1, ,for the test signal. The STM capacities of R-ESN, U-ESN, ASE-ESN and BASE-

ESN are 13.09, 13.55, 16.70 and 16.90, respectively. First of all, ESNs with uniform pole

distribution (ASE-ESN and BASE-ESN) have MCs that are much longer than the randomly

generated ESN given in (Jaeger, 2002) in spite of all having the same spectral radius. In fact, the

STM capacity of ASE-ESN is close to the theoretical maximum value of N=20. A closer look at

the figure shows that R-ESN performs slightly better than ASE-ESN for delays less than 9.









Indeed, for small k, large ASE degrades the performance because the tasks do not need long

memory depth. However, the drawback of high ASE for small k is recovered in BASE-ESN

which reduces the ASE to the appropriate level required for the task. Overall, the addition of the

bias to the ASE-ESN increases the STM capacity from 16.70 to 16.90. On the other hand, U-

ESN has slightly better STM compared to R-ESN with only 3 different weight values although

U-ESN has more distinct weight values than R-ESN. It is also significant to note that the MC

will be very poor for an ESN with smaller spectral radius even with an adaptive bias since the

problem requires large ASE and bias can only reduce ASE. This experiment demonstrates the

need for maximizing ASE in ESN design.

Binary Parity Check

The effect of the adaptive bias was marginal in the previous experiment since the nature of

problem required large ASE values and long short-term memory. However, there are tasks in

which the optimal solutions require smaller ASE values and smaller spectral radius. Those are

the tasks where the adaptive bias becomes a crucial design parameter in our design methodology.

Consider an ESN with 100 internal units and a single input unit. ESN is driven by a binary

input signal, u(n), that assumes the values 0 or 1. The goal is to train an ESN to generate the m-

bit parity, where m is 3,...,8 Similar to the previous experiments, we used the R-ESN, ASE-ESN

and BASE-ESN topologies. R-ESN is a randomly connected ESN where the entries of W matrix

are set to 0, 0.06, -0.06 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse

connectivity of 20% and a spectral radius of 0.3. ASE-ESN and BASE-ESN are designed with a

spectral radius of 0.9. The input weights are set to 1 or -1 with equal probability and direct

connections from the input to the output are allowed whereas Wback is Set to 0. The echo states

are calculated using Equation 2-1 for 1000 samples of the input signal and the first 100 samples

corresponding to initial transient are eliminated. Then, the output weight matrix is calculated










using Equation 2-4. For ESN with adaptive bias, the bias is trained for each task. Binary decision

is made by a threshold detector that compares the output of the ESN to 0.5. Figure 3-3 shows the

number of wrong decisions (averaged over 100 different realizations) made by each ESN for

m=3,...,8.


350


300C


250


.0 200


en~ 150


3100


50 R-ESN
P ~ ASE-ESN
00 a BASE-ESN
3 4 5 6 7 8
m


Figure 3-3. Number of wrong decisions made by each ESN for m = 3, .. ,8 in the binary parity
check problem. The total number of wrong decisions for m = 3, .. ,8 of R-ESN,
ASE-ESN and BASE-ESN are given by 722, 862 and 699.

The total number of wrong decisions for m=3,...,8 of R-ESN, ASE-ESN and BASE-ESN

are 722, 862 and 699, respectively. ASE-ESN performs poorly since the nature of the problem

requires short time-constant for fast response but ASE-ESN has large spectral radius. For 5-bit

parity, the R-ESN has no wrong decisions whereas ASE-ESN has 47 wrong decisions. BASE-

ESN performs a lot better than ASE-ESN and slightly better than the R-ESN since the adaptive

bias reduces the spectral radius effectively. Note that for m = 7 and 8, the ASE-ESN performs









similar to the R-ESN, since the task requires access to longer input history, which compromises

the need for fast response. Indeed, the bias in the BASE-ESN takes effect when there is errors

(m>4) and when the task benefits from smaller spectral radius. The optimal bias values are

approximately 3.2, 2.8, 2.6 and 2.7 for m = 3, 4, 5, and 6, respectively. For m = 7 or 8, there is a

wide range of bias values that result in similar MSE values (between 0 and 3). In summary, this

experiment clearly demonstrates the power of the bias signal to configure the ESN reservoir

according to the mapping task.

System Identification

This section presents a function approximation task where the aim is to identify a nonlinear

dynamical system. The unknown system is defined by the difference equation

y(n +1) = 0.3y(n) +0.6y(n -1) + f(u(n)) where flu) = 0.6 sin(miu) +0.3 sin(3mi) +0.1Isin(5mu) .

The input to the system is chosen to be sin(2~nn/25).

We used an ESN with 30 internal units and a single input unit. Again, we used three

different ESNs, R-ESN, ASE-ESN and BASE-ESN. The W matrices of all ESNs are generated

as described in chapter 3.3.1 except that they are scaled to have a spectral radius of 0.95. In each

ESN, the input weights are set to 1 or -1 with equal probability and direct connections from the

input to the output are allowed whereas Wback iS Set to 0. The optimal output signal is calculated

for each ESN as described in section 3.3.1. The MSE values (averaged over 100 realizations) for

R-ESN, ASE-ESN are 1.23x10-5 and 1.83x10-6, TOSpectively. The addition of the adaptive bias to

the ASE-ESN reduces the MSE value from 1.83x10-6 to 3.27x10-9. This experiment illustrates the

superiority of the proposed design scheme for function approximation.









CHAPTER 4
DESIGN OF THE READOUT

In the previous chapter, we have discussed the design of the ESN reservoir and the bias.

This chapter deals with the design of the readouts for ESNs. The problem is studied for function

approximation and pattern recognition separately and different solutions are proposed for each

case.

Design of the Readout for Function Approximation

The usual readout is the linear regression network. Throughout this dissertation, we will

use the linear readout. The equation for the linear readout is given by

y(n +1) = W""'x(n +1)

Alternative readouts are nonlinear readouts such as multilayer perception or local

nonlinear modelling. This issue is outside the scope of this dissertation.

Design of the Readout for Dynamical Pattern Recognition

The applications of ESNs to problems such as system identification, control of dynamical

systems, and time-series prediction has been extensively studied both in this work and in the

literature (Jaeger, 2001; Jaeger and Hass, 2004; Ozturk et al., 2007; Prokhorov, 2005). However,

the use of ESNs for dynamical pattern recognition has been limited. In the literature, RNNs and

other neural architectures such as time-delay neural network have been utilized for dynamical

pattern recognition tasks by minimizing the mean-square error (MSE) with respect to a label

(Haykin, 1998; Bishop, 1995; Iso and Watanabe, 1991; Tebelskis, 1995). If the pattern to be

classified is a time signal, the classification process is different from the classification of a static

pattern. The dynamical classification problem can be reduced to a static one by treating each

sample of the time series individually. In this case, the time series is not treated as a whole and

temporal information between samples is ignored. Moreover, one class label is associated with









each time instant; therefore, post classification such as maj ority voting has to be applied. An

alternative to this approach is embedding the time series to populate a short-term history of the

pattern. Again, a static mapper can be used to classify the embedded pattern. In this approach,

classification becomes extremely hard, if the dimensionality of the input space becomes too

large. Alternatively, temporal processing of the time series can be done inside the classifier. In

this approach, the classifier, which has a built-in memory, can accept single elements of time

series. Although the classifier does not require external temporal processing, the design of the

classifier may be complicated. Another maj or difficulty in dynamical pattern recognition is how

to design the label, which should also be a time series. The obvious method is to enforce a class

label (+1 or -1) as the desired response throughout the presentation of the input pattern.

However, the difficulty with this method is that the desired response is forced to be constant

independent of the input signal dynamics. An alternative powerful technique is to create a one-

step predictor for each class in order to capture the dynamics of the class generating the time-

series (Zahalka and Principe, 1993). Then, during testing, the test pattern is presented to all

predictors and the label of the predictor with the least prediction error is assigned as the class of

the input pattern. This method is a smart way of converting the pattern recognition problem into

a regression problem where many techniques are available. This method can be easily applied to

ESNs with an adaptive linear readout and used for classification of temporal patterns, but it is

still dependent upon input normalization and it is sensitive to pattern duration changes.

In this chapter, we propose an alternative readout for ESNs for temporal pattern

recognition and compare it with the standard linear readout of ESNs and other conventional

techniques used in the literature. The goal is to design an ESN/LSM readout that will recognize a

class of inputs that differ by some quantity (e.g. amplitude or shape considered as a distortion










parameter). The proposed readout is based on a biologically plausible linear associative memory

(LAM) that implements a correlator specially trained for high specifieity. The readout, called the

minimum average correlation energy (MACE) Eilter, is adopted from optical pattern recognition

literature, where it is used for recognition of a given obj ect in 2-D images in the presence of

noise, geometric distortions and pose changes (Mahalanobis et al., 1987). The MACE fi1ter is a

correlator whose weights are determined in close form by solving a minimization problem with a

constraint, and it has been shown that the method is equivalent to a cascade of a pre-whitening

fi1ter for the image class followed by a LAM (Fisher, 1997). The recognition of a time pattern in

a single time series is achieved by feeding the time series to an ESN/LSM, and utilizing a time

window of states which can be interpreted as a 2-D image, one dimension being time and the

other processing element number (space). Several MACEs are trained, one for each class. During

testing, the unknown time signal is fed to the ESN and the states are correlated with each

template image and the class label of the template with the highest correlation is assigned to the

input pattern. An interesting implication of the MACE is that it can also be used for LSMs with

proper regularization of the MACE solution. In LSMs, the proposed readout operates directly in

the spike domain without converting the liquid state spike train outputs into continuous signals

by inning or low-pass filtering.

In this chapter, we first present a brief background on linear associative memories followed

by the detailed explanation of the MACE filter. Then, the use of MACE filter as the readout for

ESN/LSMs is discussed. Finally, the experiments are presented to compare the proposed readout

with the conventional methods.

Linear Associative Memories

A linear associative memory is a multiple input multiple output linear system that

implements an association (memory) between input and desired outputs (Hinton and Anderson,









1981). The output is automatically created by the presentation of the proper input, which is

referred as content addressability. LAMs are quite different from the concept of memory used in

digital computers, which uses a memory location whose contents are accessible by its address.

Unlike digital memory, LAMs store information in a more global way by having several PEs

instead of local bit storage. Therefore, they are also robust to noise (Haykin, 1998; Hinton and

Anderson, 1981). LAMs recall by association, hence the pattern that is closest to the input is

recalled. LAMs are biologically plausible just like ESN/LSM and they stand as the most likely

model for human memory (Principe et al., 2000).

Forced Hebbian rule can be used to train a LAM that associates a P dimensional input p to

a Q dimensional desired output q. The output of the LAM is given by q=WLAMp, where QxP

matrix WLAM" can be computed using the outerproduct rule, WLAM _qpl (Principe et al., 2000;

Haykin, 1998). Multiple input-output vector pairs (pk,qk,for k-1,...K) can also be stored in the

LAM by repeated presentation of each input. Using the superposition principle for the linear

network, the final weight matrix is given by


Wu" = [qkpk


For an input vector pt, the output of the LAM is given by


yk = ~ld I 9119 9lt CkPk91
k=1,kel

The first term in the LAM output is the desired output and the second term, which is called

crosstalk, is the interference of the other stored patterns to the true pattern. If the crosstalk term

is small, the LAM can retrieve the pattern corresponding to the input. Notice that if the stored

patterns are orthogonal to each other, the crosstalk term will be zero resulting in a perfect recall.

Therefore, the number of patterns that can be stored perfectly in a LAM is limited by the size of










input space, which is one of the limitations of LAMs for information processing (Principe et al.,

2000; Haykin, 1998).

The output patterns qk for the pairs (pk, qk), k-1,...,K, can be interpreted as the desired

response for the LAM. The existence of a desired response allows the application of supervised

learning to train LAMs using unconstrained optimization techniques such as least squares. For

instance, the weights can be modified using the least-mean-squares algorithm by an amount

given by

AW L" h= Tekp k yq9p k k k k

This equation is a combination of the desired forced Hebbian ( qqkpk ) and' an ant-Hebbian


term (- yykpk )that decorrelates the present output yk from the input pk. The anti-Hebbian term

reduces the crosstalk at each iteration improving the performance of LAMs in terms of crosstalk.

A LAM trained with LMS is called an optimal linear associative memory (OLAM) (Principe et

al., 2000).

One question follows naturally: what is the difference between a LAM and a linear

regressor? The linear regressor passes a single hyperplane by all the desired samples, so we

normally want more patterns than the size of the input dimension to generalize but it looses

information about the covariance on the data space (Principe et al., 2000). On the other hand, in

the LAM, the output is as close as possible to each of the training samples. This different

behavior is controlled by the number of data points used for training, which must be must be less

than the input dimensionality to minimize crosstalk. The MACE filter that will be discussing

next is a cascade of a preprocessor (a whitening filter over the class of training data) followed by

a LAM (Fisher, 1997).









The MACE Filter

The technique of matched spatial filters (MSF) or correlators for optical pattern recognition

has been well studied and used for the recognition of 2-D obj ects (Vanderlugt, 1964; Kumar,

1986). The goal is to design an optimal correlator filter that represents an object class under

noise and/or distortions using a set of training images. The reason it is possible to match a single

template to a class of objects is because of the huge number of weights (degrees of freedom) that

the correlator has in 2-D (N2 weights for an NxN input image). Moreover, MACE filter can

incorporate the covariance information of the class in its weights. The MACE filter is the most

widely used MSF due to superior discrimination properties (Mahalanobis, 1987; Casasent and

Ravichandran, 1992).

The formulation of the MACE will be presented in the frequency domain for convenience

and ease of implementation (Mahalanobis, 1987). Consider a set of training images for a single

obj ect class. The ith training image (i=1,..,K) is described as a 1-D discrete sequence (obtained by

lexicographic ordering the columns of the image) denoted by x, = [xl (1), xl (2),..., xl (d)]" where

d and K are the number of pixels in the image and the number of training images for the class,

respectively. The discrete Fourier transform of x, is denoted by X, Define the overall dxK

training matrix by X = [X, X,,..., XK ]. The dx 1 column vector h denotes the coefficients of a

filter in space domain and H its Fourier transform. The correlation function of the ith image with

the filter h is given by g~ = h 0 x, The filter output g defines a 3-dimensional space where x

and y axes are the indices (lags) of the 2-dimensional correlation function and z-axis is the value

of the correlation for the corresponding indices. We will call it the correlation space. The energy

of the ith COrrelation plane in the frequency domain is E, = H' D, H where D, is a dxd diagonal









matrix whose elements are the magnitude squares of the associated elements ofX, The average

correlation plane energy over the training images is defined as


E, (1/K~j i E, (1/)H [D, H (1/K)H DH (Mahlanobis, 1987).

The MACE filter solution designs a correlation filter that minimizes the output energy

while providing predetermined values (c = [cy, c,,..., cK T) at the origin of the correlation plane.

Each ci is the ith peak correlation value obtained by correlating the ith image with the filter H at

zero lag. More specifically, the problem is to solve the constraint optimization problem

(Mahalanobis, 1987):

min (1/K)H DH subjectto X H= c

The solution to this problem can be found analytically using the method of Lagrange

multipliers (Mahalanobis, 1987) and is given by

H = D X(X D X) (4-1)

The selection of the exact value of c is somewhat arbitrary since it simply scales the peak

correlation amplitude. Usually, the entries of c are chosen to be 1 for in-class training images and

a smaller value (around 0. 1) for out-of-class images if at all used (Mahalanobis, 1987). The

MACE filter solution achieves superior discrimination properties when compared to other

existing correlation filters such as synthetic discriminant function filters (Mahalanobis, 1987;

Casasent and Ravichandran, 1992).

The ESN/LSM MACE for Dynamical Pattern Recognition

Pattern recognition is normally formulated as the design of a nonlinear static system

trained discriminantly with the information of a set of labels, with data that are vectors in Rn. An

alternative to this approach is obtained by creating models for each class which are only trained









with the class data (nondiscriminant training). When the number of classes to be classified is

unknown at the training time (open set classification), the latter provides robust results. The

correlator or matched fi1ter is a very well known example of a system that recognizes a single

class data (Helstrom, 1995; Proakis, 2001), and can be easily extended to multiple dimensions.

Dynamical pattern recognition can also be formulated in a similar manner by using

dynamical systems as classifiers and training them discriminantly with an appropriate desired

signal. Instead, in this study, we propose to use the MACE fi1ter as a readout for ESN/LSMs for

dynamical pattern recognition tasks, nondiscriminantly trained for each class. In ESN/LSM, the

proposed associative memory considers a time window of states which can be interpreted as a 2-

D image, one dimension being time and the other the space of the states (see Figure 4-1). Let us

explain the operation of ESN-MACE in mathematical terms.

Assume that we have P different classes of temporal patterns. The goal is to compute one

MACE fi1ter, hP for each class. Assume also that for each class, the training set consists of K

input patterns each with a particular temporal shape of length T. The procedure is to compute

each hP for p=1,...,P as follows.

1. The ith training input pattern for the pth ClaSS Hz = [u, (1), u, (2),..., u, (T)] of dimension

M~xTis used to calculate the echo states using Equation 1-1.

2. The resulting NxT matrix of echo states forms the equivalent of a 2-D image in section

2.2 (Figure 4-1). The echo states are then lexicographically ordered by the columns to get the 1-

D column vector x, = [x, (1)T xI (2)T ,. ..~x (T)T ]T with Nx T elements. Here each x,(n) is an Nx1

vector with the value of the echo state at time n for the ith training sequence.









3. The discrete Fourier transform of x, is denoted by X, Define the overall dxK training

matrix in the frequency domain by X = [X, X2,..., XK i

4. The optimal coefficients of the LAM for the class are computed using Equation 4-1 in

the frequency domain and the corresponding MACE filter weights hP are obtained by inverse

discrete Fourier transform. The output of the MACE for the ith training input pattern for the pth

class can be obtained by x~ hP



Input signal


0.5


10 15 20

Overlay plot of echo states






10 15 20

Echo state image


2 4 6 8 10 12 14 16 18 20
time


Figure 4-1. Interpretation of echo states as a 2-d image. A) The triangular input signal. B) The
echo states of a 100 unit ESN. C) The echo state image where a point in state index
and time is interpreted similar to a pixel in an image.

The same procedure is repeated for the training sequences of other classes to obtain an optimal

filter for each class.









During testing, the input signal can be either a continuous stream (asynchronous operation)

or a series of frames (synchronous operation). Depending on the mode of operation; the choice of

the initial conditions for ESN in step 1 varies. In synchronous operation, the timing of the signal

and the frames of interest are already known both in training and testing. Therefore, ESN is

initialized with the same zero initial condition both in training and testing. The zero initial

condition is chosen such that the system dynamics is not biased and the system states are

controlled by the input. During testing, the MACE output is calculated for each frame by

correlating the filter coefficients with the echo states generated by the frame at zero lag and with

a zero initial condition.

When the signal timing is unknown (asynchronous), the MACE filter generates an output

at each time instant n by correlating the filter coefficients with the echo states between time

instants n-T+1 and n. and the maximum in time above a certain pre specified threshold will be

picked to represent the occurrence of the pattern. In case of a continuous stream, the initial

conditions for each frame are dictated by the history of the input signal during testing; therefore

they are not necessarily zero. In order to mimic the test conditions while training the MACE, the

initial conditions of the ESN are not set to zero but determined by the history of the input. In

other words, the whole input signal is used to create echo states and then the resulting NxT

dimensional echo state images are extracted and used to train the MACE filter using the above

algorithm.

The Design of ESNs for Dynamical Pattern Recognition

One of the drawbacks of LAMs is the storage capacity that is limited by the input space

dimension (Haykin, 1998). However, the ESN/LSM with a LAM readout becomes very powerful

since echo/liquid states provide the LAM with a user defined high dimensional input space

increasing the number of patterns that can be stored in the LAM with minimal cross talk. In










ESNs, the reservoir states can be interpreted as a set of bases functionals (representation space)

constructed dynamically and nonlinearly from the input (Ozturk et al., 2007). Due to the

nonlinearity, it is easy to get states that are linearly independent of each other (Ito, 1996; Ozturk

et al., 2007). Therefore, the matrices in the MACE fi1ter solution are typically full-rank, allowing

the computation of inverses. Although linearly independent, the states are not orthogonal to each

other since they are connected through predefined weight values. An information-theoretic-

metric, called average state entropy (ASE), has been recently proposed to describe the richness

of ESN dynamics (Ozturk et al., 2007). ASE quantifies the distribution of state values across the

dynamic range. When ASE is maximized, the states of the system are evenly distributed across

the dynamic range instead of being populated only on a few values. The ASE measure is related

to the system design by uniformly distributing the poles of the linearized system around the

origin (eigenvalues of W) (Ozturk et al., 2007). We propose to use the same design methodology

to construct the representation layer of ESNs with the LAM readout. This approximates for

ESNs the well-known property of orthogonal basis and eases the LAM training. In the

experiments performed for this study, we have observed that the numerical rank (the condition

number) of the inverted matrices is well-conditioned for ESNs.

Likewise for LSMs, the value of each liquid state at each time instant is interpreted by the

MACE as 1 or 0 depending on the existence or absence of a spike. This creates a very sparse

state matrix which likely leads to singular matrices in the MACE fi1ter computation. Therefore,

the matrices have to be regularized in order to compute appropriate MACE fi1ter weights with

the outlined formulation. The regularization can be done by adding zero mean Gaussian noise

with a small variance into the states before computing the MACE solution.










Experiments

This section presents a variety of experiments to compare the MACE filter readout

proposed for ESN/LSM with the conventional techniques for dynamical pattern recognition.

Classification of temporal signals with unknown peak values

In this experiment, the aim is to classify a temporal signal into one of two possible classes.

The two classes are represented by a triangular and a step function of 20 samples. The interesting

property of the signals in this experiment is that the peak values are unknown and can assume

any value between 0.5 and 1. In other words, each class is not a signal but a family of signals.

The signals are also corrupted with additive, zero mean, white Gaussian noise.

We compare the performance of three different ESNs, denoted by R-ESN1-MACE, R-

ESN2-MACE, ASE-ESN-MACE, with MACE filter readout. All ESNs have 30 PEs and are

constructed using different techniques. R-ESN1-MACE is a randomly connected ESN where the

entries of W matrix are set to 0, 0.47, -0.47 with probabilities 0.8, 0.1, 0.1, respectively. This

corresponds to a sparse connectivity of 20% and a spectral radius of 0.9. R-ESN2-MACE is a

randomly connected ESN where the entries of the W matrix are chosen from a Gaussian

distribution with zero mean. The weights are scaled such that the spectral radius is set to 0.9. The

internal weight matrix, W, of the ASE-ESN-MACE is designed such that the eigenvalues of W

(poles of the linearized system around origin) are uniformly distributed inside the unit circle.

This principle is introduced in (Ozturk et al., 2007) and it distributes the states of the system

evenly across the dynamic range instead of populating them only on a few locations. The entries

of Wm" are set to 1 or -1 with equal probability for all three ESNs.













1 --~MACE readout







O 0.7 -8

0.6 -J
cn ASE-ESN-MACE
0. as R-ESN1-MACE
O R-ES N2-MAC E
0.4~ -i Predictive ESN

Discriminative ESN
0.3
0 0.1 0.2 0.3 0.4 0.5 0.6
Noise standard deviation

Figure 4-2. Performance comparisons of ESNs for the classification of signals with unknown
peak values. ESNs with the MACE filter readout performs better for all noise levels
compared to discriminative and predictive ESNs. The classification accuracy of ASE-
ESN-MACE is almost perfect up to a noise standard deviation of 0.3. ASE-ESN-
MACE performs better than randomly connected ESNs with MACE readout since it
distributes the echo states uniformly and increases separation between states
corresponding to different classes.

To train the MACE filter readout for the ESNs, 100 different realizations for each class of

signals are generated and the echo states are calculated using Equation 2-2 with zero initial

conditions. This creates an image of dimension 30x20, where 30 is the number of PEs and 20 is

the signal length. One MACE filter is trained for each class using only data from the

corresponding class. Output correlation peak amplitudes are assigned to be 1.0 for the training

data. The two MACE filters are synthesized in the frequency domain using Equation 4-1 and the









corresponding image plane filters are obtained by inverse discrete Fourier transform. For a given

test signal, each filter output is calculated by correlating the echo states of dimension 30x20 with

the filter coefficients at zero lag. Figure 4-2 shows the correct classification rate of the 3 methods

as the standard deviation of the noise signal varies. The results are averaged over 100 realizations

of each ESN.

We also compare the performance ofESNs with MACE filter readout to the ASE-ESN

utilizing the conventional linear readout. When the conventional readout is used for dynamical

pattern recognition, the problem is the design of the desired signal. One solution is to train a

discriminative readout with constant target output signal of amplitude 1 for one of the two

classes and a constant signal of amplitude -1 for the other class. The duration of the target signals

is the same as the duration of the input signal. During testing, echo states, the output signal and

the corresponding error signals one for each label are computed for a given input signal. Then,

the two MSE values, integrated over the duration of the input pattern, are compared to assign the

class of the input pattern to the lowest MSE. In this solution, the readout has to generate a

constant signal independent of the input signal dynamics. An alternative also tested is to train

two independent predictive readouts, one for each class to predict the next sample of the input

signal. During testing, the echo states, the output signal and the error signal are computed for

each readout. Then, the label of the readout with the lower MSE value, integrated over the

duration of the input pattern, is assigned as the class of the input signal. The predictive readout

seems a more natural choice since the desired response is not independent of the input dynamics.

The same training data used for the MACE filter is employed to calculate the readout weights for

the discriminative and predictive ESNs.










Figure 4-2 shows the correct classification rate as the standard deviation of the noise signal

varies. The results are averaged over 100 trials. As observed from the Eigure, ESNs with the

MACE fi1ter readout perform better for all noise levels compared to the ESNs with conventional

or predictive readouts. The classification accuracy of ASE-ESN-MACE is almost perfect up to a

noise standard deviation of 0.3. The correct classification rate of the discriminative ESN for the

no noise case, is only 0.92, while for the predictive ESN is 0.95, which is slightly better than the

discriminative ESN. Since all three systems have the same reservoir, the better performance of

the MACE readout is attributed to the fact that the problem requires modeling a class of

functions all at once in a single template and the MACE fi1ter is able to build such a template

containing covariance information about the full training set. The conventional and predictive

readouts train for the best compromise of the input class peak amplitude of the training set.

Being a nonlinear system, ESN dynamics are coupled to the input signal. Therefore, the response

of the system to input signals with varying amplitudes can be quite different making the task of

proj ecting the states to a constant value very challenging. The wave shape is also challenging,

since for the pulse (a sequence of zero values followed by a sequence of ones) the linear readout

has to transform the highly irregular states generated by this pattern to a constant value.

Similarly, for the predictive ESN the variation in the input amplitudes affects the ESN dynamics,

and the performance drops fast as the noise standard deviation is increased. Morever, the MACE

fi1ter readout is expected to be more robust w.r.t selection of W and Wm" in the ESN design

compared to the conventional readout design methodology where the states have to be carefully

selected in order to estimate the desired signal with a linear readout. In contrast, the LAM

readout does not try to estimate a desired response but rather generates a template from the

states.










Among the ESNs with MACE readout, the ASE-ESN performs the best compared to

randomly generated ESNs. This was an expected result, since ASE-ESN distributes the echo

states uniformly and increases separation between states corresponding to different classes.

Therefore, ASE-ESN should be the preferred design for ESNs with MACE filter readout.

Robust signal detection in a digital communication channel

We applied ESN-MACE to detection of a known waveform transmitted through a digital

communication channel. Please refer to Chapter 6 for details of this study.

Classification of odors with an electronic nose

In this experiment, the goal is to classify odors using signals collected with a Cyranose 320

electronic nose. The Cyranose 320 has an array of 32 sensors which change their resistance in

response to odors and produce a unique response pattern for different odors (Figure 4-3). For our

experiment, the spices, basil, cinnamon, ginger, and rosemary form the 4 classes of odors to be

classified. Each of these spices is mixed with air and exposed to the electronic nose in 4 different

concentrations; 25%, 50%, 75% and 100%. For each concentration of each spice, data

corresponding to 10 different trials is collected. There are a total of 160 trials, 40 for each class

of spice. We submit two approaches to solve the problem. The first approach is based on

obtaining static features that are the steady-state values of all 32 sensors as given by the

Cyranose. We obtain a 32xl60 matrix containing 32 features for all 160 trials of the 4 classes.

From the feature matrix, we extract the part corresponding to 40 trials (10 trials for each class

with representatives from 4 different concentrations) as the training set to train a linear classifier

with 4 outputs, each representing one of the 4 classes. The linear classifier is used due to the

availability of an analytical solution. Each output of the linear classifier is trained to output I

when the input feature is from the representative class and 0 otherwise. The 4x32 weight matrix

of the linear network is computed using least squares. During testing, the 32xl feature vector is











Response of the 32 electronic nose sensors over time


I i


)


ti me



Figure 4-3. Time-series representing the response pattern of the 32 electronic nose sensors
exposed to rosemary.

used to compute the 4 outputs and the label of the output with the maximum value is assigned as

the class of the input vector. The classification accuracy of the linear classifier evaluated on the

160 feature vectors is found to be 78.7% and the confusion matrix is found to be

40 0 0 0
4 22 14 0
0 0 40 0
0 0 16 24


The entry aij in row i and column j of the confusion matrix is the number of odors that

should have been classified in class i but were classified as j. The correct decisions are given on









the main diagonal of the confusion matrix. The 4 classes are presented in the order basil,

cinnamon, ginger, and rosemary.

The second approach is based on the proposed ESN-MACE. ESN exploits the dynamical

response of the sensors which also contain information about the odor class. We create a

template MACE filter for each class from the states of the ESN which represents the variations

among different spice concentrations. Again, 40 trials (10 trials for each class with

representatives from 4 different concentrations) are used as the training set. Unlike the static

case, each training trial consists of a 32 dimensional time series representing the response of the

electronic nose. We use an ESN with 100 internal units and 32 inputs. The internal weight

matrix, W, of the ESN is designed with uniform eigenvalue distribution and a spectral radius of

0.9. The entries of W'n are set to 5 or -5 with equal probability. In the previous two experiments,

the selection of the length of the patterns was dictated by the known signal lengths. However, in

this experiment, we do not have pre-defined signals making the selection of the pattern length, T,

a design parameter (Figure 4-3). Here, we choose Tto be 80 so that the template pattern contains

the rising and falling edges of the sensor response. This selection uses the transient sensor

dynamics of the particular odor in the design of the MACE readout. For each input pattern, the

echo states are calculated using Equation 2-2 with zero initial conditions. This creates an image

of dimension 100x80, where 100 is the number of PEs and 80 is the signal length. 10 in-class

patterns are used to train each one of the MACE filters corresponding to 4 classes. Output

correlation peaks are assigned to be 1.0. During testing, echo states are calculated for each

pattern and 4 MACE filter outputs are calculated. The label of the MACE filter with the highest

value is assigned as the class of the input patter. The classification accuracy of the ESN-MACE

evaluated on the 160 input patterns is found to be 63.1% and the confusion matrix is found to be









21 18 1 0
5 35 0 0
7 16 17 0
0 1 11 28

The results are, indeed, worse than the linear classifier trained on static features. When

looked closely, the MACE fi1ter gives correlation values around 1.0 for the in-class data.

However, the response of the MACE fi1ter corresponding to 2nd ClaSs (cinnamon) for out-of-class

data is bigger than 1 for most of the patterns which results in the false classifications. The MACE

filter trained nondiscrimantly does not constrain the response of the MACE for out-of-class data.

One solution to this problem is to use out-of-class data in the MACE fi1ter training as is done in

(Mahalanobis et al., 1987). Different from the previous experiments, both in-class and out-of-

class training data are used to train each one of the 4 MACE fi1ters corresponding to 4 classes.

For each MACE fi1ter, 10 in-class and 30 out-of-class (10 for each class) echo state images are

used for discriminative training. Output correlation peak amplitudes are assigned to be 1.0 for the

in-class and 0.1 for the out-of-class training data.

Using the training data, four MACE fi1ters are synthesized in the frequency domain using

Equation 4-1 and the corresponding image plane fi1ters are obtained by inverse discrete Fourier

transform. For a given test signal, the output of the each MACE filter is calculated by correlating

the echo states of dimension 100x80 with the filter coefficients. The label of the filter with the

largest correlation is assigned as the class of the input pattern. Out of the 160 trials, the correct

classification rate is found to be 93.1 %. Note that this is much better than the performance of the

classifier based on static features. Similarly, the confusion matrix is found to be









38 2 0 0
0 37 3 0
0 5 35 0
0 0 1 39


The addition of out-of-class data for training of the MACE improves the rejection of the

out-of class patterns and hence decreases the false alarm rate. However, it also converts the

problem to a classification problem (closed set classification) where the number of classes has to

be known a priori. It is desirable, whenever possible, to avoid the use of out-of-class data in the

training data for detection of signals where the out-of-class is either not available a priori or can

be any possible signal shape excluding the in-class signal shape (the entire signal space

excluding the in-class signal space).

Alternatively, we trained MACE filters in the input space instead of the state space of the

ESN. In the input space, the images are of dimension 32x80. The procedure of training MACE

filters for each class is repeated in the 32 dimensional input space. Out of 160 trials, the correct

classification rate is found to be 85.6%. The confusion matrix of the MACE filter is

34 5 1 0
3 32 5 0
1 7 32 0
0 0 1 39

In order to understand the increase in classification rate by the ESNs, let us compare the

outputs ofESN-MACE filter with the MACE filter trained directly at the input space for the

rosemary class (Figure 4-4).

As observed from the figures, the ESN-MACE filter output corresponding to the rosemary

class displays values around one consistently for the in-class data except for the 37th testing

sample.
















MACE 1
MACE 2
MACE 3
MACE 4


1.5




o 0.5




-0.5

-1


-1.5


5 10 15 20
Test data index


25 30 35 40


ESN-MACE 1
ESN-MACE 2
ESN-MACE 3
ESN-MACE 4


1.5




o 0.5

u- 0

-0.5


-1

-1.5


5 10 15 20
Test data index


25 30 35 40


Figure 4-4. Comparison of the ESN-MACE fi1ter output and MACE fi1ter output trained on the

input space for the rosemary class. A) Outputs of the 4 MACE filters over the test
data index. B) Outputs of the 4 ESN-MACE fi1ters over the test data index.









The ESN-MACE filter outputs corresponding to the other classes are very close to 0.1 as

specified by the training procedure. However, the MACE filter outputs trained directly in the

input space have very inconsistent outputs for the out of class samples. The margin between the

in-class and out-of-class outputs of the MACE filter is increased by the ESN reservoir. There are

mainly two reasons for the increase in the distance between the images with the addition of ESN.

The first reason is the proj section of the signals to a higher dimension by the ESN which increases

the separation between the images. Secondly, ESN functions as a nonlinear filter that helps

separate the responses to different input classes.

We also compared the results obtained to the TDNN trained in discriminative mode. 4

TDNNs, one for each class, are used where the desired target signal is +1 for in-class data and 0

otherwise. Each TDNN has a delay line of 81 samples (equal to the signal duration) for each one

of the 32 sensors, 3 hidden layer PEs and a single output unit. The same data (10 in-class and 30

out-of-class) for discriminative ESN-MACE is used to train the weights of TDNN using the

Levenberg-Marquardt algorithm with a step size of 0.01. Training is stopped after 100 epochs. A

cross-validation set cannot be used due to limited number of training data. During testing, an

input pattern is fed to each one of the 4 TDNNs and the label of the TDNN with the largest

output is assigned as the class of the input pattern. Out of 160 trials, the correct classification rate

is found to be 65.7%. The confusion matrix of the filter is

25 13 2 0
3 29 7 1
3 10 21 6
6 2 2 30


The performance of the TDNN is very poor compared to discriminative ESN-MACE and

even the static classifier. The main reason for the poor performance is the lack of sufficient data










for training the TDNN. The number of parameters of each TDNN is 32*81*3+3=7779 whereas

only 40 training patterns are available for training.

Spike train classification with LSM

The associative memory readout presented for ESN can in fact also be utilized as readout

for a LSM. Similar to the ESN case, the liquid states are interpreted as 2-D images, one

dimension time and the other space. An optimal MACE fi1ter, which associates LSM states with

class labels, can be analytically computed for each class using training data. The value of the

liquid state is either 1 or 0 depending on the existence or absence of a spike. Note that the MACE

fi1ter readout does not require low-pass filtering the spike trains in order to get continuous

amplitude signals for further processing.

We use a spike train classification experiment, where the goal is to classify a spike train

presented to the LSM as one of the two classes (Maass et al., 2002). Two classes are represented

by randomly generated Poisson distributed spike trains over 100ms time interval. Other members

of each class are generated by moving each spike of the representative spike train by an amount

drawn from a zero mean Gaussian distribution with a variance of 4ms.

A randomly connected LSM with 50 integrate-and-fire neurons with 20% inhibitory (I)

and 80% excitatory (E) connections is used. The membrane time constant, reset voltage,

threshold voltage, background current and input resistance of the neurons are chosen to be 30ms,

13.5mV, 15mV, 13.5nA, and 1 MGZ, respectively. Absolute refractory periods for excitatory and

inhibitory neurons are 3ms and 2ms, respectively (Maass et al., 2002). The probability of a

D(a,b) 2
synaptic connection between neurons a and b is given by ," where D(a,b) denotes the

distance between two neurons and 32 is a parameter which controls both the average number of

connections and the average distance between neurons that are synaptically connected. The









neurons are placed on a 10x5xl grid. C is set to 0.3 (EE), 0.2 (El), 0.4 (lE), 0.1 (II). Dynamic

synapses are used based on the model proposed in (Markram et al., 1998), with the synaptic

parameters U (use), D (depression time constant), F (facilitation time constant) randomly chosen

from Gaussian distributions. Mean values of U, D, F are set to .5, 1.1, .05 (EE), .05, .125, 1.2

(El), .25, .7, .02 (IE), .32, .144, .06 (ll) and SD of each parameter was chosen to be 50% of its

mean. The mean of the scaling parameter W (in nA) is chosen from a Gamma distribution with

mean values 30 nA (EE), 60 nA (EI), -19 nA (IE), -19 nA (II) (Maass et al., 2002). For input

synapses, the parameter Whas a value of 18 nA or 9.0 nA for destinations excitatory and

inhibitory neuron, respectively. SD of Wis 100% of its mean. The postsynaptic current was

modeled as an exponential decay exp(-t/ts) where ts is 3ms and 6ms for excitatory and inhibitory

synapses, respectively. The transmission delays between neurons are 1.5 ms (EE), and 0.8 for the

other connections. The initial membrane voltage of each neuron at the start of each simulation is

drawn from a uniform distribution between 13.5mV and 15.0mV (Maass et al., 2002).

Fifty different realizations of spike trains for each class are generated and the liquid states

are calculated for each realization. One readout MACE filter is trained for each class using only

data from the corresponding class. Output correlation peak amplitude is assigned to be 1.0 for the

training data. Notice that the state matrix is very sparse resulting in singular matrices in the

MACE filter computation. Therefore, the matrices have to be regularized in order to compute

inverses. Here the regularization is done by adding zero mean Gaussian noise with a variance of

0.02 to the states before computing the MACE solution. The two MACE filters are synthesized

in the frequency domain using eq 4.1 and the corresponding image plane filters are obtained by

an inverse discrete Fourier transform. For a given test signal, the output of each filter is

calculated by correlating the liquid states with the filter coefficients.










Spike train classification with LSMs is conventionally done by first low-pass filtering the

spike output of the liquid and using a linear regressor network (Maass et al., 2002). Here, we

used a low-pass filter with a time constant of 30ms. A linear readout is trained with the constant

target signal of amplitude +1 for one of classes and the constant signal of amplitude -1 for the

other class using the same data as MACE filter training. During testing, the echo states, the

output signal and two error signals one between the output and the constant signal of amplitude

+1 and another between the output and the constant signal of amplitude -1 are computed for a

given input signal. Then, two MSE values, integrated over the duration of the input pattern, are

compared to assign the class of the input pattern.

Figure 4-5 shows the correct classification rate for both methods as the parameter h, which

controls the synaptic connections in the LSM, varies from 0.2 to 4. The results are averaged over

100 trials. As observed from the figure, the MACE filter readout gives slightly better accuracy

compared to the linear readout for all h values. This is due to the fact that the linear readout tries

to approximate a constant signal independent of the liquid state values. This creates difficulties

for the simple linear readout especially when the input spike train is sparse. In fact, under closer

analysis, the output of the linear readout fluctuates. Therefore, the decision made by the linear

readout has to be carefully averaged over time in order to reach a final decision about the class of

the input spike train. For example, if the decision is based solely on the output of the readout at

the final time instant, the classification rate is very close to theoretical minimum of 0.5. On the

other hand, the MACE filter readout provides the ability to operate in the spike domain without

the need to low-pass filter the liquid state spike output. However, we note that the MACE filter

computation in Equation 4-1 requires extra caution since the liquid state has many zero values

mostly leading to singular matrices.











1.05


1




0.95




0.9




0.85


0 0.5


1 1.5 2 2.5


3 3.5


L near re ad out
MAC E filte r read out


Lambda


Figure 4-5. Comparison of the correct classification rates of LSM-MACE and LSM with linear
readout trained in the spike train classification problem as the parameter h varies. The
MACE fi1ter readout gives slightly better accuracy compared to the linear readout for
all h values. This is due to the fact that the linear readout tries to approximate a
constant signal independent of the liquid state values. Moreover, MACE readout
operates in the spike domain eliminating the need to convert spike trains to
continuous valued signals by low-pass filtering.

This experiment demonstrates the use of MACE filter as a readout for LSMs. A more

detailed study has to be carried out to explore the advantages and disadvantages of MACE

readout compared to the standard techniques.









CHAPTER 5
IMPLICATIONS OF ESN ON THE DESIGN OF OTHER NETWORKS

Echo state networks introduced a new type of computation that has been proven to be very

powerful. In this chapter, we study how the ESN idea can be utilized for the design of other

networks. First, we investigate the echo state condition (|W|<1) and approach the echo state

condition in terms of the effect of system stability on the power of dynamical system for

computation. In particular, we relax the echo state condition and allow some of the poles to be

larger than 1. This introduces a new dynamical regime, called "transiently stable computation",

where function approximation with a readout is still possible eventhough the system is not

globally stable. Second, we investigate the biologically plausible model of the olfactory cortex,

Freeman Model (FM), and propose to use a readout for FM to be able to use it for useful

computation. Without the proposed readout, the use of FM is limited to simple digit recognition.

An interesting property of FM is the nonlinear function used in the model. FM nonlinearity does

not have its largest slope at zero operating point unlike the sigmoidal nonlinearity used in ESNs.

We will demonstrate with experiments that FM coupled with a readout can process continuous

valued signals.

Transiently Stable Computation

Linear filters are the simplest dynamical systems employed to process signals with

temporal structure as discussed in chapter 1. They can be classified according to the impulse

response, which fully characterizes the filter, as finite impulse response (FIR) and infinite

impulse response (IIR) filters (Haykin, 2001). For the IIR case, stability becomes an essential

constraint in the design. Without stability, the system response will diverge independent of the

input signal. Global asymptotic stability of linear filters can simply be guaranteed by selecting

the poles of the system in the proper region of the frequency domain, open left half plane for the









continuous-time systems and inside the unit circle for the discrete-time systems (Kailath, 1980).

TDNN and RNN can be considered as the nonlinear counterparts of the FIR and IIR filters,

respectively. Similar to the linear case, stability becomes an issue for RNNs, although the

definition of stability is no longer BIBO (bounded input bounded output) stability. In the ESNs,

stability issue is addressed with the echo state condition (Jaeger, 2001) and has been investigated

in detail in chapter 2. Echo state condition is derived from the stability of the linearization of the

system around zero equilibrium and guarantees that the states are strongly coupled with the input

signal allowing the system to compute an input-output map. In this chapter, we introduce a new

computational mode that emerges when we relax the echo state condition and allow the spectral

radius to be slightly greater than 1. First, let us consider an ESN that obeys the echo state

condition.

Conventional Echo State Networks

A simple demonstration of the typical response of an ESN with echo state condition is

given in Figure 5-1. A randomly connected 100-unit ESN without output feedback connections is

constructed. The entries of the internal connection weight matrix W are set to 0.4, -0.4 and 0

with probabilities of 0.025, 0.025 and 0.95 resulting in a spectral radius of 0.9. The input weight

matrix, Wm" has values +1 or -1 with equal probabilities. The input to the system (Figure 5-1 A)

has 3 different regions, a zero signal for the first 100 steps, a ramp signal for the following 100

steps and again a zero signal of length 100. The system is initialized randomly and run with the

given mnput.









Input signal


50 100 150 200 250 300 350
time
The states of the 100-unit ESN


50 100 150 200 250 300 350
time
Overlay plot of the ESN output and desired signal









Desired signal


SESN output
50 100 150 200 250 300 350


Figure 5-1. Demonstration of a typical response of ESN with a spectral radius of 0.9 and
application to function approximation. A) The input signal. B) 100 echo states of
ESN. C) The overlay plot of the ESN output and the desired signal.


r:'



i-










Figure 5-1 B depicts all the 100 echo states on top of each other over the given time

interval. As seen from the figure, during the first 100 steps, the system state converged to zero

(after an initial transient of length about 30 steps caused by the random initial condition) since

the input to the system is zero and the spectral radius is less than 1. Then, the echo states are

constructed by the ramp signal between the time interval 100 and 200. The states again converge

to zero due to the echo state condition, when the input is removed. The resulting echo states can

be used to generate any desired response related to the input signal. Here, desired signal is

chosen to be the seventh power of the input signal and The readout weights are calculated using

Equation 2-4. Figure 5-1 C displays the overlay plot of the optimal ESN output and the desired

signal. As observed from the figure, the system output very accurately estimates the desired

signal .

Transiently Stable Computation with Echo State Networks

Engineering systems are mostly designed with a global stability restriction. We have seen a

similar constraint, echo state condition, in the design of echo state networks. We have also

demonstrated with a function approximation experiment that ESNs with the echo state condition

can do useful computation. On the other hand, biological computation may not possess a similar

stability constraint (Freeman, 1975; Freeman, 1993; Yao and Freeman, 1990). Indeed, biological

systems do not have fixed point dynamics but rather wide variety of collective behavior in the

form of oscillations and even chaos (Freeman, 1975; Yao and Freeman, 1990). Inspiring from

principles of biological computation, we would like to explore the network' s response without

the restriction on the spectral radius of W. Hence, we will remove the echo state condition and

let the spectral radius to be slightly greater than 1. Obviously, this will introduce instability for

the autonomous system whereas the response of the system with an applied input is yet not

obvious.


































50 100 150 200
time


The states of the 100-unit transiently stable ESN


250 300 350


A


The overlay plot of the ESN output and desired signal
1.2


Desired signal


-0.21 ESN\ output
-.0 50 100 150 200 250 300 350
time





Figure 5-2. Demonstration of a typical response of transiently stable ESN with a spectral radius
of 1.1 and application to function approximation. A) 100 echo states of the transiently
stable ESN. B) The overlay plot of the ESN output and the desired signal.


In order to observe the ESN response without the echo state condition, we will use a


similar experimental set up that we used in previous section. We use the same W matrix and


scaled it to obtain a spectral radius of 1.1. The same input signal (Figure 5-1 A) is fed to the ESN


. '



0I I


1
II ;









that is initialized to a random initial condition. We plot the resulting echo states in Figure 5-2 A.

There are a few observations to be made at this point. First of all, we see that the system does not

converge to zero during the first 100 steps although the input is zero. Instead, the echo states

exhibit a nonconvergent dynamical response that differs from the ESN with the echo state

condition. In fact, this was the expected response, since the system is designed to be unstable

(spectral radius is greater than one) around zero equilibrium. A similar response can also be

observed during the last 100 time steps when the input is again zero. The second and more

important observation is the response of the system between time steps 100 and 200 where the

ramp signal is applied as the input. The echo states during this interval become more regular

(after some initial transient between time steps 100 and 130) compared to the states when there is

no input. In fact, after the transient, the echo states look similar to the ones in Figure 5-1 B where

the ESN satisfies the echo state condition. Thirdly, there is a transition period between time steps

100 and 130 where the states switch from a disordered mode to a more regular mode. In

summary, the system responds according to its own dynamics when there is no input. As the

input amplitude gradually increases, the system response is determined by a competition between

the system dynamics and the input amplitude. When the input amplitude is sufficiently large, the

system dynamics becomes "transiently stable" and is determined by the input signal. We would

like to find out if we can utilize the transiently stable ESN for the same function approximation

problem we had in the previous section. The weights of the linear readout network are again

computed using Equation 2-4 and the corresponding output is generated in Figure 5-2 B. As seen

from the figure, we got a good match between the output and the desired signal. Similar results

can be obtained even if the system is initialized to a different initial condition or the time instant

at which the ramp signal is applied (this will change the state value when the ramp signal is










applied) changes. This experiment clearly demonstrates that "transiently stable" ESN can do

useful computation.

Understanding Transiently Stable Computation

We will utilize the linearization analysis proposed in chapter 2.2 in order to quantify the

local dynamics of the system. The pole movement of ESN with spectral radius of 0.9 is

illustrated in Figure 5-3. According to the figure, poles always stay inside the unit circle even

though the input signal changes. As the input signal strengthens, the poles move towards the

origin of the z-plane (decreases the spectral radius), which results in a more stable system. This

is due to the fact that the higher input signals saturate the nonlinear function reducing the slope at

which the system is operating. Similar observations have been already made in chapter 2.2.











Real Real
A B









C D
Figur 5-3 Moveent f thepole i*foESwihsetarais09reptednfgusAB




Figur C, D,3 whven o the iput is atrES poits labeedta by iu A, 9 BC nFgre 5-1te A, frespectvely










Now, let us examine the movement of poles of the transiently stable ESNof the previous

section. Figure 5-4 shows the movement of the poles for the same input signal. First of all, the

poles of the system move in and out of the unit circle during the first and last 100 time steps

(hence zero state is not asymptotically stable). This explains the complex echo state signal when

there is no input to the system (Figure 5-2 B). The poles of the ESN with spectral radius 0.9 stays

inside the unit circle during the same time interval. Secondly, when the ramp input is











Real Real
A B











Real Real
C D

Figure 5-4. Movement of the poles for ESN with spectral radius 1.1 are plotted in figures A, B,
C, D, when the input is at points labeled by A, B, C, D in Figure 5-1 A, respectively.

applied, the poles start to shrink towards the unit circle (after some transient) as is observed in

the stable ESN. This phenomenon is again due to the sigmoid nature of nonlinearity, which

saturates with large input signals reducing the slope of the operating point in Equation 2-5. With

the movement of the poles towards the origin, the system stabilized transiently allowing the

system state to be controlled by the input. This is critical for the use of the system for










computation (function approximation). Finally, we observe a transient time, when the ramp

signal is applied, before the poles start shrinking. In this period, there is a competition between

the system dynamics and the input signal. There are a few extra comments about the selection of

the spectral radius to be made at this point. First of all, the transient response time after the ramp

is applied is determined by two factors, the spectral radius and the slope of the ramp signal. The

transient time will increase if the spectral radius of W is increased or the slope of the ramp signal

is reduced. Because as the spectral radius is increased, the force required to bring the poles inside

the unit circle is increased, hence a larger signal is required. Therefore, there is a balance

between the system dynamics determined by the spectral radius and the input signal required to

stabilize the system. Secondly, it is important to quantify the unstable behavior of the signals

present in the transiently stable ESN when the input is absent. We have observed that when the

spectral radius is slightly greater than 1, the signals are periodic oscillations whereas for larger

values of the spectral radius, the signals are more irregular and even chaotic.

Freeman's K Sets for Dynamical Computation

Understanding information processing in brains remains a challenge. Many different

hypotheses on how the brain might process the massive bombardment of information brought by

sensory systems have been advanced, ranging from artificial intelligence, neural networks and

more recently machine learning and statistics (Freeman, 1975; Yao and Freeman, 1990; Chua

and Yang, 1988; Wang, 1999).Walter Freeman developed a biologically realistic mathematical

model for the olfactory cortex that captures at a cell assembly level (mesoscopic level), some of

the physiological properties of the olfactory system. Freeman model (FM) is based on nonlinear

recurrent dynamical interactions representing the response of thousands of cells, the neural

assemblies, instead of single neuron responses. In order to derive his model, Freeman used

neurophysiological recordings and neuroscience concepts and the K set hierarchy (Freeman,










1975) to build a model that mimics the behavior of olfaction (Freeman, 1975). This model is a

drastic departure of other neural network models such as Hopfield's because system response is

produced by interactions of second order nonlinear sets.

Freeman' s maj or drive is to understand how olfactory system works; while here we aim at

using FM as a computational framework for the interaction of animats with the real world. Real

worlds are complex dynamical environments that unfold unpredictably with lots of redundancy

but also with many unrelated cues. In this set up, the fundamental operations are the recognition

of prior situations that will induce prototypical responses dependent upon the animat' s goals. The

constraints of the experimental setting are also rather important. The operation is intrinsically

real-time in the sense that only the present value of the continuous stream of information from

the sensors is available and must be processed at each time step, while the full understanding of

the situation is most often related to the input history. So we favor signal processing

methodologies instead of statistics where the behavior in time is always rather constrained. The

animat may also need to learn how to recognize new situations when they become important for

its function, therefore it must have internal representational structures that expect outcomes and

change behaviors. Brains process information for the sake of the animal's survival, while

animats have external goals preprogrammed by humans, and as such they must possess an extra

level of functionality that translates internal representations back and forth to the outside, i.e. it is

necessary to define input and outputs (in the animal, brain's only output is the motor system).

Therefore we have to emphasize this translation between distributed dynamical representations

and outputs in the form of design principles for optimal readouts.

The framework we are pursuing is based on distributed dynamics instead of machine

learning. Since Hopfield, the role of dynamics for associative memories is well understood










dynamicall systems with fixed point attractors). In fact, FM has also been utilized as a dynamical

auto-associative memory (Tavares, 2001; Xu et al., Ozturk et al., 2004; Xu et al., 2004). FM

weights are trained with Hebbian learning while two binary patterns to be stored are presented to

the system successively. During testing, a noise corrupted version of one of the two stored

patterns is presented to the system. Then, the energy of the oscillatory system response is

computed and compared to a threshold to choose the correct stored pattern. We see two maj or

shortcomings with the energy based readout. First, utilizing just the energy is wasteful since it

ignores the information embedded in the dynamics of the signals. Second, it does not exploit the

global information in the states since it is a local method computing the energy of individual

processing elements.

In this chapter, we demonstrate that FM can be considered in the same framework of

ESN/LSM. In fact, the KI and KII networks of FM are conceptually similar to the

reservoir/liquid in an ESN/LSM with a proper selection of the parameters. The big difference is

that the KII layer is highly structured with a coupled set of nonlinear oscillators We will derive

conditions on the system parameters for the FM similar to the echo state property of ESNs. With

the proposed framework, it becomes evident that the OB layer alone lacks a readout in order to

be used as a universal computing machine for time series. Therefore, we propose to use an

adaptive linear network to implement the readout from the OB layer. We present experimental

results to show the power of this framework for the FM of the olfactory cortex. In particular, we

show that it is not necessary to drive Freeman' s model with 0 and Is, as is presently done.

An Overview of Freeman Model

Freeman Model is a biologically plausible mathematical model of the olfactory cortex. The

dynamical behavior of the FM mimics the physiological signals obtained from the cortex. It is a









hierarchical model of different levels (KO, KI, KHI and KiII), where simpler structures at the

bottom of the hierarchy combine to form the more complex structures at the top of the hierarchy.

The basic building block of the model is the KO set, which is depicted in Figure 5-5 A.

Every KO unit can accept several spatial inputs that are weighted and summed, and then

convolved with a linear time invariant system defined by the second order dynamics whose

transfer function, H(s), is given by

ab
H(s) = (5-1)
(s + a)(s + b)

where 1/a and 1/b are real time constants determined experimentally (Freeman, 1975). The

output of the linear dynamics is then shaped by the nonlinear sigmoid function, which is

experimentally determined to be


Q(x,Q,)=Q, (- e, >= ) fx>x (5-2)
-1 if x < xo

The sign of the connection strength from a KO set will define the type of the KO set

(excitatory for positive weights and inhibitory for negative weights). The second level in the

hierarchy is the KI network. KO sets with a common sign (either excitatory or inhibitory) are

connected through forward lateral feedback to construct a KI network (Figure 5-5 B). No auto-

feedback is allowed in the network. KI network is generally used as an input layer for the higher

levels of Freeman model. It preprocesses the input to transform it into a space which is

compatible with OB dynamics. In this study, we will use KI network as an echo state network

containing a representation of the input history.













+- H(s) q Q(.)

Asymmetric sigmoid
Inputs 2 dodrnonlinearity
dynamics

A


Excitatory KI set


Inhibitory KI set


C D


2 -- -
-5 -4 -3 -2 -1 0 1 2 3 4 5






Figure 5-5. The building blocks of Freeman Model. A) The KO set. B) KI network. C) KII

Set. D) Reduced KII set. E) The asymmetric nonlinear function.









The third level in the hierarchy, the KHI set (Figure 5-5 C), is the most interesting and

important building block of the olfactory system, since it is an oscillator controlled by the input.

The response of the KHI set to an impulse input is a damped oscillation whereas with a sustained

input, the output oscillation is maintained as long as the input remains. The architecture of KH

set is shown in Figure 5-5 C, where the circles denote KO sets and the sign indicates the type of

connection. Figure 5-5 D shows a reduced KHI set, where the two KO units are denoted by M (for

mitral cell) and G (for granular cell). In this model, the mitral cell takes the external input P(t)

and the coupling strengths between M and G are controlled by the two weights Kmg > 0

(excitatory) and Kgm < 0 (inhibitory). If the coupling weights are selected properly, the RKII set

is, similar to the KII set, an oscillator controlled by the input (Freeman, 1975).

A KHI network can be formed by interconnecting a number of KHI sets with excitatory

connections between the excitatory cells (top circles in Figure 5-6) and inhibitory connections

between inhibitory cells (bottom circles in Figure 5-6). An RKII network is formed similar to KII

network by connecting a number of RKII sets. We will use RKII network in our simulations

instead of KII networks since various analytic tools are available for RKII and it has been

demosntrated that RKII is functionally very similar to the KII (Freeman 1975, Xu et al., 2004).

This interconnected structure represents a key stage of learning and memory in the olfactory

system. Input patterns through M cells are mapped into spatially distributed outputs. Excitatory

and inhibitory interconnections enable cooperative and competitive behaviors, respectively, in

this network. The RKII network functions as an associative memory (Tavares, 2001; Xu et al.,

2004; Principe et al., 2001; Ozturk et al., 2004).









Tnlhu*.~Inr 1;II.


Figure 5-6. A full KII network.

The final Katchalsky level is the KHIInetwork and represents the olfactory cortex. In a KIII

network, two/three KII sets and a KII network are tightly coupled through dispersive

connections, mimicking the different lengths and thicknesses of axons. Since the intrinsic

oscillating frequencies of each one of the KII sets in different layers are incommensurate among

themselves, this network of coupled oscillators will present chaotic behavior. For a detailed

description of the KIII network, we refer the reader to (Freeman, 1975; Xu et al, 2004).

Now, we will formulate a standard form for the state (outputs of KO sets) equations of the

Freeman model valid for all hierarchical sets of FM). Consider a Freeman network with M input

units, N KO sets and L output units. State equations for this Freeman model can be expressed as

ii(t) + (a + b) (t) + ab x(t) = ab(W Q
Here, x(t) = (xl(t), x2(t),..., xyrtt)) is the state vector where each entry is the output of a KO

set at time t and u(t) = (ul(t), u2(t),..., uhu(t)) is the input vector. Wis an N*N matrix defining

the interconnection weight values between the KO sets and Wt" is an N*M matrix defining the

interconnection weights between the inputs and the KO sets.

For example, consider the RKII set depicted in Figure 5-5 D. The governing equations for

the RKII set are given by









m(t) + (a + b~m(t) + ab m(t) = ab(K, Q(g(t)) + p(t))
(5-4)
g(t) + (a + b) g(t) + ab g(t) = abK emgt)

In a RKII set, the number of KO sets is two and the number of inputs is one. Hence, in the

standard form, M is one and N is two. With this information, Equation 5-4 can be restated in the

standard form of Equation 5-4 with the state vector x(t) = [m(t), g(t)]T, input vector u(t) = [p(t)],

the input weight vector Wm = [1, 0]T and the weight matrix


W= rP~O gm


Dynamical Computation with Freeman Model with a Readout

In order to be able to compute with a dynamical system, one has to be aware of how the

information is represented in the system states. Once the form of representation is known,

desired information can be extracted from the system states with an appropriate readout

mechanism. For instance, in Deliang Wang' s network (Wang, 1999), the phase difference

between the coupled oscillators is the source of information. For FM, it has been argued the

information is encoded as amplitude modulated signals and hence energy of the KO output was

used as the relevant information from the FM in practical application. For example, the RKII

network has been proposed as an auto-associative memory that can store static binary patterns

(Tavares, 2001; Xu et al., 2004; Ozturk et al., 2004). In (Xu et al., 2004) an RKII network of 64

reduced KII sets is constructed with fixed Kmg, Km and Kii values (determined experimentally

from biological data). Excitatory weights of RKII network are trained with Oj a' s unsupervised

learning rule (a stable version of the Hebbian learning rule) while two binary patterns to be

stored are presented to the system successively. During testing, a noise corrupted version of one

of the two stored patterns is presented to the system. Each RKII set creates limit cycle

oscillations of low or high energy depending on the stored and the input pattern applied. Then,








the energy (averaged over a time interval) of the mitral KO set of each RKII set is computed and
compared to a threshold to decide the binary system output. We see two maj or shortcomings
with the energy based readout for FM. First, utilizing just the energy is wasteful since it ignores
the information embedded in the dynamics of the system state. Second, it does not exploit the
global interdependencies among the RKII sets since it is a local method computing the energy of
individual processing elements. Hence, the energy based readout can not be optimal and can not
be used for continuous dynamic patterns.
It is clear that a more principled approach is necessary in order to use FM to process
signals that are not necessarily binary or static. With the analogy to the LSM/ESN framework,
we propose to use a readout which is a linear or nonlinear proj section of all the RKII states. The
role of the readout is to extract specified desired information from the FM states.






1a~ 0jIJ


~m~i~~ ~n~rrAs
























~ar~~( EarIB
















Figure 5-7. Freeman Model as an associative memory. A) Stored patterns. B) Recall of the
corrupted patterns: The first, second and their columns are the input to the KII
network, the output before the threshold detector, and the output after the threshold
detector, respectively.

We will also demonstrate that Freeman's K sets provides a complete system, that is to say,

the readout can also be chosen from the Freeman hierarchy, namely KO set or the KI network

depending on the output layer dimension. When a linear readout is used, the adaptation of the

overall system reduces to a linear regression problem, which we can be solved by either Wiener

filter equation offline or gradient-based learning rule online. We will use the KI network and KII

network architectures analogous to liquid in LSM or the reservoir in ESN. The use of KIII

network as the dynamical reservoir is outside the scope of this study. In summary, with the









adoption of LSM/ESN framework, the states of the FM are dynamical basis functionals

(representation space) created from nonlinear combinations of the input and the readout simply

finds the best proj section of the desired response in this representation space.

We will provide two theorems for the echo state property of the Freeman RKII network.

The first theorem states the sufficient conditions on the parameter values of a RKII set resulting

in echo states. In the second theorem, we will give a sufficient condition for the nonexistence of

echo states for all Freeman networks. Unfortunately, we do not have a general theoretical result

for the sufficient conditions resulting in echo states for all Freeman networks in standard form

given by Equation 5-3.

Theorem 4.1: Consider a reduced KII set with the governing equations given by equation


5-4. Let the paramneter values Kmg and Kgm satisfy KmgK, < ,Whr
(a + b)2 max, >

Qmax denotes the maximum value of the derivative of the nonlinear function Q and a and b are

the filter coefficients. Then the network has echo states for all inputs u and for all states x. The

proof follows from the linear stability analysis of the RKII set, which has been extensively

studied in (Xu and Principe., 2005).

Proof: Consider the governing equations for the RKII set given in Equation 5-4. An

equilibrium point of the nonlinear differential Equation 5-4 satisfies

mo = K,L!(Ko)+ P
(5-5)
Ko = K,!mg 0

Using the nullclines defined by Equation 5-5, we can see that for a given set of parameters

and the input, the system has a unique equilibrium point (Figure 5-8).

Now consider the state space representation given by Equation 5-4.










nt, = nt,
m, = -(a +b~m, abna, + ab(K,,e,(g, ) + p(t))
(5-6)

g, = -(a + b) g, abg, + abK,e~, Qn )







m (t)=Kgm~,();Pt a g0(t)= K oQ(mo(t))
2-

~~ Equilibrium





-4 t -


-6 -4 -2 U 2 4 6
g (t)
Figure 5-8. Nullcline graph of the reduced KII set.

Then, the linearization of the system around the unique equilibrium point leads to the

Jacobian matrix of the form

0 1 0 0
-a"b (a +b) abK,,,,(g, ) 0
(5-7)
0 0 0 1

abKmgQ(m ,) 0 ab (a + b)

Checking the eigenvalues of Equation 5-9, we see that the equilibrium is stable if

IKKab
K,,,,,, .and the equilibrium is unstable if
(a + b)2 Q~m,,)Q~g, )









ab
K,,K,,> .. The system does have a bifurcation point if
(a + b)2 Q~m )~g, )


IK,,,,Knn, = .b (Xu and Principe., 2005).
(a + b)2 Q~m )~g, )


Now, assume that the paramneters K,,, and K;,, satisfy Kl,sK,,,, < Te
(a + b)2 (Omax)

the unique equilibrium point of the system will be stable. This means that independent of the

initial conditions, the system will approach to the same traj ectory for a given set of parameter

values and the input sequence. Moreover, the states will converge to zero when there is no input.

Theorem 4.2: Consider a Freeman network containing N KO sets with standard form

given in Equation 5-3. Define the effective weight matrix W



-;ab(I W)O (a + b)I:,, (5-8)

where I denotes the N*N identity matrix and 0 denotes N*N zero matrix. Let the effective weight

matrix W satisfy real(hax)>0, where 3Lax is an eigenvalue of W with largest real part. Then the

network has an asymptotically unstable null state. This implies that it has no echo states for any

input set containing 0.

Proof: Consider a Freeman network with governing equations in standard form given by

Equation 5-4.

In state-space form Equation 5-3 can be expressed as

X1 = X2
(5-9)
x, = -(a +b~x, abx, ab(WQ(x, )+ Wina))

where xl and x2 are both N dimensional vectors. For zero input, Equation 5-9 has the trivial null

solution. Now, consider the Jacobian matrix of Equation 5-9 from linearization around origin,