|UFDC Home||myUFDC Home | Help|
This item has the following downloads:
AN AUGMENTED ERROR CRITERION FOR LINEAR ADAPTIVE FILTERING:
THEORY, ALGORITHMS AND APPLICATIONS
YADUNANDANA NAGARAJA RAO
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
YADUNANDANA NAGARAJA RAO
This dissertation is dedicated to my family, teachers and friends for their enduring love,
support and friendship.
First of all, I would like to thank Dr. Jose Principe, for his constant guidance,
encouragement, patience and continuous support over the past five years. His enthusiasm
for research and quest for excellence have left an everlasting impression in my mind. To
me, he has been more than an advisor, and this research would not have been possible
without him. Secondly, I would like to thank Dr. John Harris for being on my committee
and offering me guidance not only in research but in many other aspects of life. I would
also like to thank Dr. Michael Nechyba and Dr. Mark Yang for being on my committee.
I would like to thank Dr. Deniz Erdogmus, my friend and colleague at CNEL,
whose contributions in this research have been tremendous. I deeply benefited from all
the long hours of fruitful discussions with him on a multitude of topics. His drive for
research and enormous ability to motivate others have been quite inspirational. I also
wish to extend my acknowledgements to all the members of CNEL who have been
primarily responsible for my fruitful stay in the lab. I would like to extend my gratitude
to the always cheerful Ellie Goodwin for her golden words of wisdom. Her ability to get
things done was truly remarkable. I would also like to acknowledge Linda Kahila for the
extensive support and assistance she provided during my stay at UFL.
I would like to thank my family and friends for their constant love and
encouragement. They have allowed me to pursue whatever I wanted in life. Without their
guidance and affection, it would have been impossible for me to advance my education.
Lastly, I would like to thank my life partner Geetha for making my life beautiful
and for being on my side whenever I needed her. Her everlasting love has made me a
TABLE OF CONTENTS
ACKNOWLEDGMENT S .............. .................... iv
LI ST OF T ABLE S ................. ...............x................
LI ST OF FIGURE S .............. .................... xi
AB STRAC T ................ .............. xiv
1 MEAN SQUARED ERROR BASED ADAPTIVE SIGNAL PROCESSING
SY STEMS: A BRIEF REVIEW .............. ...............1.....
Introducti on ................... .. .. ........ .. ........... .............
Why Do We Need Adaptive Systems? ................ ...............2...............
Design of Adaptive Sy stem s............... .... ...............3...............
Least Mean Squares (LMS) Algorithm ................ ...............5............ ...
Recursive Least Squares (RLS) Algorithm .............. ...............6.....
Other Al gorithms .........._..... ........___ ..... .. .._.._.. ............
Limitati ons of MSE Criteri on B ased Linear Adaptive Sy stems..........._.._.._ ...............8
Total Least Squares (TLS) and Other Methods ................. ................ ......... .10
Limitations of TLS ........._..... ... ..._. __ ...............11...
Extended TLS for Correlated Noise .....___.....__.___ ...... .._._ ..........1
Other M ethod s ........._.. ..... ._ ............... 13....
Summary ........._.. ..... ._ ...............13.....
2 AUGMENTED ERROR CRITERION FOR LINEAR ADAPTIVE SYSTEMS......15
Introducti on .........._.... ......._.___..... ...............15
Error Whitening Criterion (EWC) ................ ...............16................
Motivation for Error Whitening Criterion ................. .............................17
Analysis of the Autocorrelation of the Error Signal ................. ........._.._. ......17
Augmented Error Criterion (AEC) .............. ...............22....
Properties of Augmented Error Criterion .............. ...............24....
Shape of the Performance Surface .............. ...............24....
Analysis of the Noise-free Input Case ................. ...............25...............
Analysis of the Noisy Input Case ................ ...............27...............
Orthogonality of Error to Input ................. ...............29........... ...
Relationship to Error Entropy Maximization ................. .......... ...............30
Note on Model-Order Selection .............. ...............3 1....
The Effect of P on the Weight Error Vector .....___ .................. ................32
Numerical Case Studies of AEC with the Theoretical Solution .............. ..............33
Summary ................. ...............40..__.._.......
3 FAST RECURSIVE NEWTON TYPE ALGORITHMS FOR AEC .......................41
Introducti on .................. ... ......... .......... ........ .. .. ........4
Derivation of the Newton Type Recursive Error Whitening Algorithm ................... .41
Extension of the REW Algorithm for Multiple Lags ................. ............... ....45
Relationship to the Recursive Instrumental Variables Method ................... ........48
Recursive EWC Algorithm Based on Minor Components Analysis..........................49
Experimental Results ................. ............. ........ ..... .................5
Estimation of System Parameters in White Noise Using REW ..........................51
Effect of P and Weight Tracks of REW Algorithm ............... ... .........___......53
Performance Comparisons between REW, EWC-TLS and IV methods ............55
Summary ............ ..... ._ ...............57....
4 STOCHASTIC GRADIENT ALGORITHMS FOR AEC .............. ....................5
Introducti on................ .... .... ... ................. ... .. .........5
Derivation of the Stochastic Gradient AEC-LMS Algorithm .............. ..............59
Convergence Analysis of AEC-LMS Algorithm ......____ ..... ... ._ ........._......6 1
Proof of AEC-LMS Convergence for P > 0 ................ .....___ ........._._._..61
Proof of AEC-LMS Convergence for P < 0 ........._..... ....___. ........._.....63
On-line Implementations of AEC-LMS for P< 0.........._._... ........___..........67
Excess Error Correlation Bound for EWC-LMS ................. .......................69
Other Variants of the AEC-LMS Algorithms ............_......__ ..............72
AEC-LMS Algorithm with Multiple Lags .............. ...............73....
Simulation Results ............... .........__ ... ......___..........7
Estimation of System Parameters in White Noise............... ...............74.
Weight Tracks and Convergence................ ........... .......7
Inverse Modeling and Controller Design Using EWC .............. ............. ..80
Summary ........._...... ...............83..__..........
5 LINEAR PARAMETER ESTIMATION IN CORRELATED NOISE......................85
Introducti on ........._...... ...............85..__..........
Existing Solutions ........._...... .. ... .___. ... ...._...... ... ..........8
Criterion for Estimating the Parameters in Correlated Noise ........._...... ..................87
Stochastic Gradient Algorithm and Analysis .............. ...............90....
Simulation Results .........._...... ... ....._._... .. ......._._... .............9
Sy stem Identifi cati on with the Analytical S oluti on ........._...... ......_._...........93
System Identification with Stochastic Gradient Algorithm............... ................9
Verification of the Local Stability of the Gradient Algorithm............._._... .........95
Extensions to Correlated Noise in the Desired Data .............. .....................9
Experimental Results ................ ...............100................
System Identification............... ............10
Stochastic Algorithm Performance............... ..............10
Summary ................. ...............101................
6 ON UNDERMODELING AND OVERESTIMATION ISSUES IN LINEAR
SY STEM ADAPTATION............... ..............10
Introducti on ................. ...............104................
Undermodeling Effects ................. ...............105................
Overestimation Effects .............. ...............108....
Experimental Results ................ ...............109................
Sum m ary ................. ...............113......... ......
7 CONCLUSIONS AND FUTURE DIRECTIONS ................. ........................114
Conclusions............... .. .............11
Future Research Directions ............_..._ ......._ ....._.. ............1
A FAST PRINCIPAL COMPONENTS ANALYSIS (PCA) ALGORITHMS ...........118
Introducti on .................. ......... ...... ...............118......
Brief Review of Existing Methods ................. ...............119..............
Derivation of the Fixed-Point PCA Algorithm .................. ............................121
Mathematical Analysis of the Fixed-Point PCA Algorithm ................. ................. 123
Self-Stabilizing Fixed-Point PCA Algorithm .................. ............... .... ............... ..128
Mathematical Analysis of the Self-Stabilizing Fixed-Point PCA Algorithm ...........129
Minor Components Extraction: Self-Stabilizing Fixed-Point PCA Algorithm........1 32
B FAST TOTAL LEAST-SQUARES ALGORITHM USING MINOR
COMPONENTS ANALYSIS .............. ...............135....
Introducti on ................. ...............135____.......
Fast TLS Al gorithms .............. ...............136....
Simulation Results with TLS................ ............. ..........13
Simulation 1: Noise Free FIR Filter Modeling. ......___ ...... .._ ............139
Simulation 2: FIR Filter Modeling with Noise ................. .......___...........140
C ALGORITHMS FOR GENERALIZED EIGENDECOMPOSITION....................143
Introducti on ............ _... ....._ .... ...............143.
Review of Existing Learning Algorithms .....__.....___ ..............._.........4
Fixed-Point Learning Algorithm for GED .............. ...............145....
M athematical Analysis .............. ...............150....
D SOME DERIVATIONS FOR THE NOISY INPUT CASE............... .................15
E ORTHOGONALITY OF ERROR TO INPUT .............. ...............156....
F AEC AND ERROR ENTROPY MAXIMIZATION .............. .....................5
G PROOF OF CONVERGENCE OF ERROR VECTOR NORM IN AEC-LMS ......159
LIST OF REFERENCES ........._._.. ...._ ... ...............160....
BIOGRAPHICAL SKETCH .........._.... ...............172.__..........
LIST OF TABLES
1-1. Outline of the RLS Algorithm. ........._. ...... .__ ...............7.
3-1. Outline of the REW Al gorithm. ............. ...............45.....
LIST OF FIGURES
1-1. Block diagram of an Adaptive System. .............. ...............4.....
1-2. Parameter estimates using RLS algorithm with noisy data. ............. ....................9
2-1i. Schemati c di agram of EWC adaptati on ................. ...............16...........
2-2. The MSE performance surfaces, the AEC contour plot, and the AEC performance
surface for three different training data sets and 2-tap adaptive FIR fi1ters. .............25
2-3. Demonstration scheme with coloring filter h, true mapping fi1ter w, and the
uncorrelated white signals. ............. ...............34.....
2-4. The average squared error-norm of the optimal weight vector as a function of
autocorrelation lag L for various f values and SNR levels. ............. ....................35
2-5. The average squared error-norm of the optimal weight vector as a function of filter
length m for various f values and SNR levels. ......____ .... ... ._ ................35
2-6. Histograms of the weight error norms (dB) obtained in 50 Monte Carlo simulations
using 10000 samples of noisy data using MSE (empty bars) and EWC with B = -0.5
(filled bars). The subfigures in each row use fi1ters with 4, 8, and 12 taps
respectively. The subfigures in each column use noisy samples at -10, 0, and 10 dB
SNR, respectively. ............. ...............37.....
2-7. Error autocorrelation function for MSE (dotted) and EWC (solid) solutions. ...........38
3-1. Histogram plots showing the error vector norm for EWC-LMS, LMS algorithms and
the numerical TLS solution. ............. ...............53.....
3-2. Performance of REW algorithm (a) SNR = OdB and (b) SNR = -10 over various beta
values. ............. ...............54.....
3-3. Weight tracks for REW and RLS algorithms. ............. ...............55.....
3-4. Histogram plots showing the error vector norms for all the methods. .......................56
3-5. Convergence of the minor eigenvector of G with (a) noisy data and (b) clean data..57
4-1. Histogram plots showing the error vector norm for EWC-LMS, LMS algorithms and
the numerical TLS solution. ............. ...............75.....
4-2. Comparison of stochastic versus recursive algorithms............... ...............7
4-3. Contour plots with the weight tracks showing convergence to saddle point.............. 77
4-4. Weight tracks for the stochastic algorithm. ............. ...............77.....
4-5. Contour plot with weight tracks for different initial values for the weights. .............78
4-6. Contour plot with weight tracks for EWC-LMS algorithm with sign information
(left) and without sign information (right) ................. ...............79...............
4-7. EWC performance surface (left) and weight tracks for the noise-free case with and
without sign information (right). ............. ...............80.....
4-8. Block diagram for model reference inverse control. ................. .................8
4-9. Block diagram for inverse modeling. ............. ...............81.....
4-10. Plot of tracking results and error histograms ................. ...............82.............
4-11i. Magnitude and phase responses of the reference model and designed model-
controller pairs............... ...............82.
5-1. System identification block diagram showing data signals and noise........................88
5-2. Histogram plots showing the error vector norm in dB for the proposed and MSE
criteria. .............. ...............94....
5-3. Weight tracks for LMS and the stochastic gradient algorithm in the system
identification example. ............. ...............96.....
5-4. Weight tracks for LMS and the stochastic gradient algorithm showing stability
around the optimal solution. ............. ...............96.....
5-5. Histogram plots of the error norms for the proposed method and MSE. ...............101
5-6. Weight tracks showing the convergence of the stochastic gradient algorithm......... 102
6-1. Undermodeling effects with input SNR = OdB (left) and input SNR = 5dB (right).109
6-2. Crosscorrelation plots for EWC and MSE for undermodeling. ............. ..............110
6-3. Crosscorrelation plots for EWC and MSE for overestimation ................ .. .............111
6-4. Power normalized error crosscorrelation for EWC and MSE with overestimation. 111
6-5. Weight tracks for LMS and the stochastic gradient algorithm in the case of
under modeling. ................ ...............112_____.......
A-1. Representative network architecture showing lateral connections. .........................134
B-1. Estimation of minor eigenvector ................. ...............140........... ..
B-2. Minimum eigenvalue estimation............... ..............14
B-3. Comparison between the estimated and true filter coefficients using TLS.............141
B-4. Comparison between the estimated and true filter coefficients using RLS.............142
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
AN AUGMENTED ERROR CRITERION FOR LINEAR ADAPTIVE FILTERING:
THEORY, ALGORITHMS AND APPLICATIONS
Yadunandana Nagaraj a Rao
Chair: Jose C. Principe
Cochair: John G. Harris
Major Department: Electrical and Computer Engineering
Ever since its conception, the mean-squared error (MSE) criterion has been the
workhorse of optimal linear adaptive filtering. However, it is a well-known fact that the
MSE criterion is no longer optimal in situations where the data are corrupted by noise.
Noise, being omnipresent in most of the engineering applications, can result in severe
errors in the solutions produced by the MSE criterion. In this dissertation, we propose
novel error criteria and the associated learning algorithms followed by a detailed
mathematical analysis of these algorithms. Specifically, these criteria are designed to
solve the problem of optimal filtering with noisy data. Firstly, we discuss a new criterion
called augmented error criterion (AEC) that can provide unbiased parameter estimates
even in the presence of additive white noise. Then, we derive novel, online sample-by-
sample learning algorithms with varying degrees of complexity and performance that are
tailored for real-world applications. Rigorous mathematical analysis of the new
algorithms is presented.
In the second half of this dissertation, we extend the AEC to handle correlated
noise in the data. The modifications introduced will enable us to obtain optimal, unbiased
parameter estimates of a linear system when the data are corrupted by correlated noise.
Further, we achieve this without explicitly assuming any prior information about the
noise statistics. The analytical solution is derived and an iterative stochastic algorithm is
presented to estimate this optimal solution.
The proposed criteria and the learning algorithms can be applied in many
engineering problems. System identification and controller design problems are obvious
areas where the proposed criteria can be efficiently used. Other applications include
model-order estimation in the presence of noise and design of multiple local linear filters
to characterize complicated nonlinear systems.
MEAN SQUARED ERROR BASED ADAPTIVE SIGNAL PROCESSING SYSTEMS:
A BRIEF REVIEW
Conventional signal processing techniques can be typically formulated as linear or
non-linear operations on the input data. For example, a Einite impulse response (FIR)
Eilter is a linear combination of the time delayed versions of the input signal. We know
that a linear combiner is nothing but a linear proj ector in the input space. Mathematically
speaking, a projection can be defined as a linear transformation between two vector
spaces . These linear transformations can be vectors spanning 8"' or matrices
spanning 8"". For vector transformations the proj sections are given by the inner products
and in case of matrix transformations the proj sections become rotations. Often, most of the
design tasks in signal processing involve finding appropriate projections that perform the
desired operation on the input. For instance, the filtering task is basically finding the
projection that preserves only a specified part of the input information . Another
example is data compression, wherein we estimate an optimal projection matrix or
rotation matrix that preserves most of the information in the input space. The first step in
finding these projections is to understand the specifications of the problem. Then, the
specifications are translated into mathematical criteria and equations that can be solved
using various mathematical and statistical tools. The solutions thus obtained are often
optimal with respect to the criterion used.
Why Do We Need Adaptive Systems?
Depending on the problem at hand, estimating the optimal projections can be a
daunting task. Complexities can arise due to the non-availability of a closed form solution
or even the non-existence of a feasible analytical solution. In the latter case, we may have
to be contented with sub-optimal solutions. On the other hand, scenarios exist where we
have to synthesize projections that are not based on user specifications. For instance,
suppose we are given two signals, an input and a desired signal, and the goal is to find the
optimal projection (filter) that generates the desired signal from the input. Thus the
specifications do not convey any explicit information regarding the type of filter we have
to design. The conventional filter synthesis cookbook does not contain any recipes for
these types of problems. Such problems can be solved by learning mechanisms that
intelligently deduce the optimal projections using only the input and desired signals or at
times using the input signal alone. These learning mechanisms form the foundation of
adaptive systems and neural networks. All learning mechanisms have at least two major
pieces associated with them. The first is the criterion and the second is the search
algorithm. The search algorithm finds the best possible solution in the space of the inputs
under some constraints. Optimization theory has provided us with a variety of search
techniques possessing different degrees of complexity and robustness . These learning-
based adaptive systems provide us with a powerful methodology that can go beyond
conventional signal processing. The projections derived by these adaptive systems are
called optimal adaptive proj sections. Another very desirable feature of adaptive systems is
their innate ability to automatically adjust and track according to the changing statistical
properties of signals. This can be vital in many engineering applications, viz., wireless
data transmission, biomedical monitoring and control, echo cancellation over wired
telephone lines etc., wherein the underlying physical sources that generate the
information change over time. In the next section, we will briefly review the theory
behind the design of linear adaptive systems.
Design of Adaptive Systems
A block diagram of an adaptive system is shown in Figure 1-1. Assume that we are
given a zero-mean input signal x,, and a zero-mean desired signal d,, Further, these
signals are assumed to be corrupted by noise terms v,, and u,, respectively. Let the
parameters of the adaptive system be denoted by the weight vector w. Note that we have
not put any constraints on the topology of the adaptive filter. For convenience, we will
assume a FIR topology in this chapter. The goal then is to generate an output y,, that best
approximates the desired signal. In order to achieve that, a criterion (often referred to as
the cost J(w)) is devised which is typically a function of the error e,, defined as the
difference between the desired signal and the output, i.e., e,, = d,, y,, The most widely
used criterion in the literature is the Mean-Squared Error (MSE) which is defined as
J(w) = E(e,2) (1.1)
The MSE cost function has some nice properties, namely,
* Physical relevance to energy
* The performance surface (shape ofJ(w)) is smooth and has continuous derivatives
* The performance surface is a convex paraboloid with a single global minimum
* The weight vector w, corresponding to the global minimum is the best linear
unbiased estimate in the absence of noise 
* If the desired signal is a future sample of the input, i.e., d,, = x,,,, then the filter
with coefficients w, is guaranteed to be minimum phase 
-Algorithm 4 Criterion -
Figure 1-1. Block diagram of an Adaptive System.
Once the criterion is fixed, the next step is to design an algorithm to optimize the cost
function. This forms another important element in an adaptive system. Optimization is a
well researched topic and there is a plethora of search methods for convex cost functions.
Specifically, we minimize the MSE cost function and since the performance surface is
quadratic with a single global minimum, an analytical closed form optimal solution w,
can be easily determined. The optimal solution is called the Wiener solution for MSE 
(Wiener filter), which is given by
w, = R'P (1.2)
In equation (1.2), R denotes the covariance matrix of the input defined as R = E xkX:
and the vector P denotes the cross correlation between the desired signal and the lagged
input defined as P= E(xkdk COmputing the Wiener solution requires inverting the
matrix R which requires O(N3) Operations . However, due to the time-delay
embedding of the input, the matrix R can be easily shown to be symmetric and Toeplitz,
which facilitates a computationally efficient inverse operation with complexity O(N2) .
From the point of view of an adaptive system, the Wiener solution is still not elegant
because one requires the knowledge of all data samples to compute equation (1.2). A
sample-by-sample iterativee) algorithm is more desirable as it suits the framework of an
adaptive system. The most commonly used algorithms to iteratively estimate the optimal
Wiener solution w, are the stochastic gradient based Least Mean Squares (LMS) and the
fixed-point type Recursive Least Squares (RLS).
Least Mean Squares (LMS) Algorithm
The gradient of the cost function in (1.1) is given by
= -2E(ekXk ) (1.3)
Notice that the output of the adaptive filter y, is simply the inner-product between the
weight vector w and the vector x, which is a vector comprised of the delayed versions
of the input signal xn Instead of computing the exact gradient, Widrow and fellow
researchers [9,10] proposed the instantaneous gradient which only considered the most
recent data samples (both input and desired). This led to the development of the
stochastic gradient algorithm for MSE minimization that is popularly known as the Least
Mean Squares (LMS) algorithm. The stochastic gradient is given by
= -2ekXk (1.4)
Once the instantaneous gradient is known, the search should be in the direction opposite
to the gradient which gives us the stochastic LMS algorithm in (1.5).
w(k +1) = w(k)+ r(k)ekXk (1.5)
The term r(k) denotes a time-varying step-size that is typically chosen from a set of
small positive numbers. Under mild conditions, it is possible to show that the LMS
algorithm converges in the mean to the Wiener solution [10-14]. The stochastic LMS
algorithm is linear in complexity, i.e., O(N), and allows on-line, local computations.
These nice features facilitate efficient hardware implementation for real-world adaptive
systems. Being a stochastic gradient algorithm, LMS suffers from problems related to
slow convergence and excessive misadjustment in the presence of noise [14, 15]. Higher
order methods have been proposed to mitigate these effects and mainly they are variants
of Quasi-Newton, Levenberg-Marquardt (LM) and Conjugate-Gradient (CG) methods
popular in optimization [16-17]. Alternatively, we can derive a recursive fixed-point
algorithm to iteratively estimate the optimal Wiener solution. This is the well-known
Recursive Least Squares (RLS) algorithm [18, 19].
Recursive Least Squares (RLS) Algorithm
The derivation of the RLS algorithm utilizes the fact that the input covariance
matrix R can be iteratively estimated from its past values using the recursive relation,
R(k) = R(k -1)+ xkX (1.6)
The above equation can also be viewed as a rank-1 update on the input covariance matrix
R. Further, the cross correlation vector P satisfies the following recursion.
P(k) = P(k 1)+ xkdk (1.7)
We know that the optimal Wiener solution at the time instant k is simply
w*(k) = R '(k)P(k) (1.8)
Recall the matrix inversion lemma [7,8] at this point which allows us to recursively
update the inverse of a matrix.
R '(k -1)xXX R1 (k-_1)
R '(k) = R (k -1)-k (1.9)
1+ x R '(k -1)xk
It is important to note that the inversion lemma is useful only when the matrix itself can
be expressed using reduced rank updates as in equation (1.6). By plugging equation (1.9)
into the Wiener solution in (1.8) and using the recursive update for P(k) in (1.7), we can
derive the RLS algorithm outlined in Table 1-1 below.
Table 1-1. Outline of the RLS Algorithm.
Initialize R (~0) = cI, where c is a large positive constant
w(0) = 0, initialize the weight vector to an all zero vector
At every iteration, compute
R '(k 1) x
K (k) =k
1+ x R'(k -1)xk
e(k) = dk WT(k- 1)xk
w(k) = w(k -1)+ K(k)e(k)
R '(k) = R (k -1)- K(k)x R '(k -1)
The RLS algorithm is a truly fixed-point method as it tracks the exact Wiener solution at
every iteration. Also, observe the complexity of the algorithm is O(N2) as compared to
the linear complexity of the LMS algorithm. This additional increase in complexity is
compensated by the fast convergence and zero misadjustment of the RLS algorithm.
Although LMS and RLS form the core of adaptive signal processing algorithms,
researchers have proposed many other variants possessing varying degrees of complexity
and performance levels. Important amongst them are the sign LMS algorithms that were
introduced for reduced complexity hardware implementations [20, 21]. Historically, the
sign-error algorithm has been utilized in the design of channel equalizers  and also in
the 32kbps ADPCM digital coding scheme . In terms of improving the speed of
convergence with minimum misadjustment, variable step-size LMS and normalized LMS
algorithms have been proposed [23-27]. Leaky LMS algorithms  have been explored
to mitigate the finite word length effects at the expense of introducing some bias in the
optimal solution. Several extensions to the RLS algorithm have also been studied. Some
of these algorithms show improved robustness against round-off errors and superior
numerical stability [29,30]. The conventional RLS algorithm works well when the data
statistics do not change over time (stationarity assumption). Analysis of the RLS tracking
abilities in non-stationary conditions have been studied by Eleftheriou and Falconer 
and many solutions have been proposed .
Limitations of MSE Criterion Based Linear Adaptive Systems
Although MSE based adaptive systems have been very popular, the criterion may
not be the optimal choice for many engineering applications. For instance, consider the
problem of system identification  which is stated as follows: Given a set of input and
output noisy measurements where the outputs are the responses of an unknown system,
obtain a parametric model estimate of the unknown system. If the unknown system is
nonlinear, then it is obvious that MSE minimization would not result in the best possible
representation of the system (plant). Criteria that utilize higher order statistics like the
error entropy, for instance, can potentially provide a better model [33,34].
Let us restrict ourselves to the class of linear parametric models. Although the
Wiener solution is optimal in the least squares sense, the biased input covariance matrix
R, in the presence of additive white input noise yields a biasl in the optimal solution
compared to what would have been obtained with noise-free data. This is a major
drawback, since noise is omnipresent in practical scenarios. In order to illustrate the
degradation in the quality of the parameter estimate, we created a random input time
SThe Wiener solution with noise-free data gives unbiased estimates. We refer to this mismatch in
the estimates obtained with and without noise as the bias introduced by noise.
Filter coefficients estimated using RLS vs. true values
1.5 -e-O True values
0 5 10 15 20 25 30 35 40 45 50
Figure 1-2. Parameter estimates using RLS algorithm with noisy data.
series with arbitrary coloring and passed it through a FIR filter with 50 taps. The filtered
data were used as the desired signal. Uncorrelated white noise was added to the colored
input signal and the input signal-to-noise ratio (SNR) was fixed at OdB. The RLS
algorithm was then used to estimate the weight vector. Ideally, if the SNR was infinite,
RLS would have resulted in a weight vector exactly matching the FIR filter. However,
because of the noisy input, the RLS estimates were biased as can be seen in Figure 1-2.
This is a very serious drawback of the MSE criterion which is further accentuated by the
fact that the optimal Wiener M~SE solution varies nI ithr changing noise power. Researchers
have dwelt on this problem for many years and several modifications have been proposed
to mitigate the effect of noise on the estimate. Total least-squares (TLS) is one method
which is quite powerful in eliminating the bias due to noise [35-42]. The instrumental
variables (IV) method proposed as an extension to the Least-Squares (LS) has been
previously applied for parameter estimation in white noise . This method requires
choosing a set of instruments that are uncorrelated with the noise in the input [32,43]. Yet
another classical approach is subspace Wiener filtering [14,44]. This approach tries to
suppress the bias by performing an optimal subspace projection (Principal Component
Space) and then training a filter in the reduced input space. In the next few sections, we
will briefly cover some of these methods and discuss their benefits and the limitations.
Total Least Squares (TLS) and Other Methods
Mathematically speaking, TLS solves an over-determined set of linear equations of
the form Ax = b, where As E" "m is the data matrix, be E is the desired vector, and
x E W" is the parameter vector and m denotes the number of different observation vectors
each of dimension n . Alternatively, the linear equations can be written in the form
[A; b][xT ;-1] = 0, where [A; b] denotes an augmented data matrix. Let S be the SVD 
of the augmented data matrix [A; b] such that S = UIV where UTU = I,,, V' V = I
and 1= [diag(o a,,a,,a,,a4 "" n+1); 0(,, ,,,] with all singular values ak > 0. If
[A; b][xT;-1]= 0, the smallest singular value must be zero. This is possible only if
[xT;-1] is a singular vector of [A; b] (corresponding to the zero singular value)
normalized such that its (n+1)th element value is -1. When [A; b] is a symmetric square
matrix, the solution reduces to finding the eigenvector corresponding to the smallest
eigenvalue of [A;b]. The TLS solution in this special case is then
[x;-1]= -V nlv /vn~ (1.10)
where v,,~, is the last element of the minor eigenvector v, The Total Least-Squares
technique can be easily applied to estimate the optimal solution using minor components
estimation algorithms [45-51]. The computation of the TLS solution requires efficient
algorithms for extracting the principal components  or the eigenvectors of the data
covariance matrix. Eigendecomposition is a well studied problem and many algorithms
have been proposed for online estimation of eigenvectors and eigenvalues directly from
data samples [53-77]. We have proposed robust, sample efficient algorithms for solving
Principal Components Analysis (PCA) that have outperformed most of the available
methods. A brief review of PCA theory and the proposed algorithms are outlined in
appendix A. Brief mathematical analyses of the proposed algorithms according to the
principles of stochastic approximation theory [78-85] are also included. A fast minor
components analysis (MCA) based TLS algorithm  is discussed in appendix B.
Limitations of TLS
Total least squares gives unbiased estimates only when both the noise in the input
and the desired data are independent and identically distributed (i.i.d.) and have same
variance. Further, when the noise is truly i.i.d. Gaussian-distributed, the TLS solution is
also the maximum likelihood solution. However, the assumption of equal noise variances
is very restrictive, as measurement noises seldom have similar variances. The
Generalized TLS (GTLS) problem  specifically deals with cases where the noise (still
assumed to be i.i.d.) variances are different. However, the caveat is that the ratio of noise
variances is assumed to be known which is, once again, not a practical assumption.
Extended TLS for Correlated Noise
In order to overcome the i.i.d. assumption, Mathews and Cichocki have proposed
the Extended TLS (ETLS)  that allows the noise to have non-zero correlations. We
will briefly describe the approach they adopted. Let the augmented input matrix [A; b] be
represented as H = [A; b] Then, the square matrix HTH can be written as a combination
of the clean data matrix HTH and the noise covariance matrix R,.
HTH= HTH+R, (1.11)
The above equation is true when the noise is uncorrelated with the clean data. This
assumption is reasonable as the noise processes in general are unrelated (hence
independent) to the physical sources that produced the data. Assume that there exists a
matrix transformation H, such that
H = HR /2 (1.12)
The transformed data correlation matrix of H is simply
HTH = R /2 THR /2 +I (1.13)
Equation (1.13) basically tells us that the transformed data are now corrupted by an i.i.d.
noise process. Hence, we can now find the regular TLS solution with the transformed
data by estimating the minor eigenvector of the matrix HTH. In other words, the optimal
ETLS solution for correlated noise signals is given by estimating the generalized
eigenvector corresponding to the smallest generalized eigenvalue of the matrix pencil
(HTH, Rw ). Solving the generalized eigenvalue problem  is a non-trivial task and
there are only a handful of algorithms that can provide online solutions. Our research in
the area of PCA provided us the tools to develop a novel generalized eigenvalue
decomposition (GED) algorithm. A short summary of the GED problem, existing
learning algorithms and the proposed algorithm are listed in appendix C.
Although the ETLS seems to solve the general problem of linear parameter
estimation, there is an inherent drawback. The ETLS requires the full knowledge of the
correlation matrix of the noise (RN). This assumption potentially leaves the problem of
linear parameter estimation with noisy data wide open.
Infinite Impulse Response (IIR) system identification methods [89-92] deal with
the problem of measurement noise in the output (desired) data. The Instrumental
Variables (IV) method  for IIR system identification on the other hand, does not
guarantee stability. It has been known for quite a while that the unit norm constraint for
the equation-error (EE) based system identification is much better compared to the
conventional monic constraint [90-92]. However, imposing the unit norm constraint
appears too restrictive and hence limits the applicability.
In this chapter, we started by describing linear adaptive systems criteria and their
associated algorithms. Most often, adaptive solutions are derived using the MSE
criterion. We showed that the MSE criterion produces biased solutions in the presence of
additive noise. The optimal Wiener MSE solution varies with changing noise variances
which is highly undesirable. Alternative approaches to combat the effect of noise in the
parameter estimation have been explored. The most popular approaches are based on the
Total Least-Squares principles. Generalized TLS and Extended TLS improve upon the
ability of the TLS to provide bias free estimates in the presence of additive noise.
However, these methods rely on assumptions that can be very restrictive for real-world
applications. Further, they require SVD and Generalized SVD computation [94-105]
which increases the complexity. Another method called subspace Wiener filtering relies
on the accurate estimation of the signal subspace from the noisy data. This technique
reduces the effect of the bias when the signals are distinguishable from noise (high SNR
scenario). Otherwise, it fails since noise and signal subspaces cannot be separated.
Thus, it would not be fallacious to say that the problem of linear parameter
estimation with noisy data is a hard problem that does not yet have a satisfactory solution
in the existing literature. One of the major contributions of this dissertation is the
development of an elegant solution to this problem without making any unreasonable
assumptions about the noise statistics. Towards this end, we will present a new criterion
based on the error signal and derive new learning algorithms.
AUGMENTED ERROR CRITERION FOR LINEAR ADAPTIVE SYSTEMS
In the previous chapter, we discussed the Mean-Squared Error (MSE) criterion
which has been the workhorse of linear optimization theory due to the simple and
analytically tractable structure of linear least squares. In adaptive fi1ter theory, the
classical Wiener-Hopf equations [6,10] are more commonly used owing to the extension
of least squares to functional spaces (Hilbert spaces ) proposed by Wiener .
However, for Einite impulse response (FIR) filters, (vector spaces) the two solutions
coincide. There are also a number of important properties that help us understand the
statistical properties of the Wiener solution, namely the orthogonality of the error signal
to the input vector space as well as the whiteness of the predictor error signal for
stationary inputs, provided the fi1ter is long enough [5,14]. However, in a number of
applications of practical importance, the error sequence produced by the Wiener filter is
not white. One of the most important is the case of inputs corrupted by white noise,
where the Wiener solution is biased by the noise variance as we saw before in Chapter 1.
In this chapter, we will develop a new criterion which augments the MSE criterion.
In fact, MSE becomes a special case of this new criterion which we call the Augmented
Error Criterion (AEC). Further, we will show that, under some conditions, this new
criterion can produce a partially white error sequence at the output of an adaptive system
even with noisy data. This special case of the AEC is called the Error Whitening
Criterion (EWC). Our approach in this chapter will be as follows. We will first focus on
the problem of parameter estimation with noisy data and motivate the derivation of the
error whitening criterion. Then, we will deduce the more generic augmented error
Error Whitening Criterion (EWC)
Consider the problem of parameter estimation with noisy data. Instead of
minimizing the MSE, we will tackle the problem by introducing a new adaptation
criterion that enforces zero autocorrelation of the error signal beyond a certain lag; hence
the name error whitening criterion (EWC). Since we want to preserve the on-line
properties of the adaptation algorithms, we propose to expand the error autocorrelation
around a lag larger than the filter length using Taylor series. Thus, instead of an error
signal, we will end up with an error vector, containing as many components as the terms
kept in the Taylor series expansion. A schematic diagram of the proposed adaptation
structure is depicted in Figure 2-1. The properties of this solution are very interesting, and
it contains the Wiener solution as a special case. Additionally, for the case of two error
terms, the same analytical tools developed for the Wiener filter can be applied with minor
modifications. Moreover, when the input signal is contaminated with additive white
Figure 2-1. Schematic diagram of EWC adaptation.
noise, EWC produces the same optimal solution that would be obtained with the noise
free data, with the same computational complexity of the Wiener solution.
Motivation for Error Whitening Criterion
The classical Wiener solution yields a biased estimate of the reference filter weight
vector in the presence of input noise. This problem arises due to the contamination of the
input signal autocorrelation matrix with that of the additive noise. If a signal is
contaminated with additive white noise, only the zero-lag autocorrelation is biased by the
amount of the noise power. Autocorrelation values at all other lags still remain at their
original values. This observation rules out MSE as a good optimization criterion for this
case. In fact, since the error power is the value of the error autocorrelation function at
zero lag, the optimal weights will be biased because they depend on the input
autocorrelation values at zero-lag. The fact that the autocorrelation values at non-zero
lags are unaffected by the presence of noise will be proved useful in determining an
unbiased estimate of the filter weights.
Analysis of the Autocorrelation of the Error Signal
The question that arises is what lag should be used to obtain the true weight vector
in the presence of white input noise. Let us consider the autocorrelation of the training
error at non-zero lags. Suppose noisy training data of the form (x(t),d(t)) are provided,
where x(t) = x(t)+ v(t) and d(t) = d(t) +u(t) with x(t) being the sample of the noise-
free input vector at time t (time is assumed to be continuous), v(t) being the additive
white noise vector on the input vector, d(t) being the noise-free desired output and u(t)
being the additive white noise on the desired output. Suppose that the true weight vector
of the reference filter that generated the data is w, (moving average model). Then the
error at time t is e(t) = (d(t)+ u(t)) ((t)+ v(t)) w where w is the estimated weight
vector. Equivalently, when the desired response belongs to the subspace of the input, i.e.,
d(t)= XT (t)wT, the error can be written as
e(t) = (1' (t)wT + u(t)) (X(t) + v(t))T w = XT (t)(wT w) + u(t) vT (t)w (2.1)
Given this noisy training data, the MSE-based Wiener solution will not yield a residual
training error that has zero autocorrelation for a number of consecutive lags, even when
the contaminating noise signals are white. From (2.1) it is easy to see that the error will
have a zero autocorrelation function if and only if
* the weight vector is equal to the true weights of the reference model,
* the lag is beyond the Wiener filter length.
During adaptation, the issue is that the filter weights are not set at w,, so the error
autocorrelation function will be generally nonzero. Therefore a criterion to determine the
true weight vector when the data is contaminated with white noise should be to force the
long lags (beyond the filter length) of the error autocorrelation function to zero by using
an appropriate criterion. This is exactly what the error-whitening criterion (EWC) that
we propose here will do. There are two interesting situations that we should consider:
* What happens when the selected autocorrelation lag is smaller than the filter
* What happens when the selected autocorrelation lag is larger than the lag at which
the autocorrelation function of the input signal vanishes?
The answer to the first question is simply that the solution will be still biased since
it will be obtained by inverting a biased input autocorrelation matrix. If the selected lag is
autocorrelation matrix, where the zero-lag autocorrelation of the input signal shows up. In
the special case of MSE, the selected lag is zero and the zeroth sub-diagonal becomes the
main diagonal, thus the solution is biased by the noise power.
The answer to the second question is equally important. The MSE solution is quite
stable because it is determined by the inverse of a diagonally dominant Toeplitz matrix.
The diagonal dominance is guaranteed by the fact that the autocorrelation function of a
real-valued function has a peak at zero-lag. If other lags are used in the criterion, it is
important that the lag is selected such that the corresponding autocorrelation matrix
(which will be inverted) is not ill conditioned. If the selected lag is larger than the length
of the input autocorrelation function, then the autocorrelation matrix becomes singular
and a solution cannot be obtained. Therefore, lags beyond the input signal correlation
time should also be avoided in practice.
The observation that, constraining the higher lags of the error autocorrelation
function to zero yields unbiased weight solutions is quite significant. Moreover, the
algorithmic structure of this new solution and the lag-zero MSE solution are still very
similar. The noise-free case helps us understand why this similarity occurs. Suppose the
desired signal is generated by the following equation: d(t)= xT (t)w,, where w, is the
true weight vector. Now multiply both sides by x(t A) from the left and then take the
expected value of both sides to yield E[x(t A)d(t)] = E[x(t A)XT (t)]wT Similarly,
we can obtain E[x(t)d(t -A8)]= E[x(t)xT (t -A)]w,. Adding the corresponding sides of
these two equations yields
E[x(t)d(t -A) + i(t -)d(t)] = E[x(t)i' (t -A) + i(t A)i(t)]w, (2.2)
This equation is similar to the standard Wiener-Hopf equation [9,10]
E[x(t)d(t)] = E[x(t)xT (t)]w,. Yet, it is different due to the correlations being evaluated
at a lag other than zero, which means that the weight vector can be determined by
constraining higher order lags in the error autocorrelation. Now that we have described
the structure of the solution, let us address the issue of training linear systems using error
correlations. Adaptation exploits the sensitivity of the error autocorrelation with respect
to the weight vector of the adaptive filter. We will formulate the solution in continuous
time first, for the sake of simplicity. If the support of the impulse response of the adaptive
filter is of length m, we evaluate the derivative of the error autocorrelation function with
respect to the lag A, where A 2 m are both real numbers. Assuming that the noises in the
input and desired are uncorrelated to each other and the input signal, we get
ape (A) dEe(t~e(t A)]
E (~w, w)'I(t)n (t (w, w) +(ul(t) v(t)w)(u(t A) v( -A)w)]
= -2E ijt)IT (t A)w, w)
The identity in equation (2.3) immediately tells us that the sensitivity of the error
autocorrelation with respect to the weight vector becomes zero, i.e., 8pe(A)/8w = 0, if
(w, -w) =0. This observation emphasizes the following important conclusion: when
given training data that is generated by a linear filter, but contaminated with white noise,
it is possible to derive simple adaptive algorithms that could determine the underlying
filter weights without bias. Furthermore, if (w, w) is not in the null space of
E[X(t)XT (t A)], then only (w, w) = 0 makes pe (A) = 0 and 8pe (A) / w = 0 But
looking at (2.3), we conclude that a proper delay depends on the autocorrelation of the
input signal that is, in general, unknown. Therefore, the selection of the delay A is
important. One possibility is to evaluate the error autocorrelation function at different
lags A > m and check for a non zero input autocorrelation function for that delay, which
will be very time consuming and inappropriate for on-line algorithms.
Instead of searching for a good lag-A, consider the Taylor series approximation of
the autocorrelation function around a fixed lag-L, where L > m,
pe (A) = pe (L) + pe (L)(A L) + pe (L)(A L)2 +...
= E[e(t)e(t L)] E[e(t)e(t L)](A L) +-~E[e(t)e(t L)](A L)2 +...
In (2.4), e(t) and #(t) (see Figure 2-1) represent the derivatives of the error signal with
respect to the time index. Notice that we do not take the Taylor series expansion around
zero-lag for the reasons indicated above. Moreover, L should be less than the correlation
time of the input, such that the Taylor expansion has a chance of being accurate. But
since we bring more lags in the expansion, the choice of the lag becomes less critical than
in (2.3). In principle, the more terms we keep in the Taylor expansion the more
constraints we are imposing on the autocorrelation of the error in adaptation. Therefore,
instead of finding the weight vector that makes the actual gradient in (2.3) zero, we find
the weight vector that makes the derivative of the approximation in (2.4) with respect to
the weight vector zero.
If the adaptive filter is operating in discrete time instead of continuous time, the
differentiation with respect to time can be replaced by a first-order forward difference,
e(n) =e(n) -e(n -L) Higher order derivatives can also be approximated by their
corresponding forward difference estimates, e.g., e(n) = e(n) 2e(n L) + e(n 2L), etc.
Although the forward difference normally uses two consecutive samples, for reasons that
will become clear in the following sections of the chapter, we will utilize two samples
separated by L samples in time. The first-order truncated Taylor series expansion for the
error autocorrelation function for lag A evaluated at L becomes
pe (A) = E[e(n)e(n L)] E[e(n)(e(n) e(n L))](A L)
= -(A -L)E[e 2(n)] +(1 + A -L)E[e(n)e(n -L)]
Analyzing (2.5) we remark another advantage of the Taylor series expansion because the
familiar MSE is part of the expansion. Notice also that as one forces A 4 L, the MSE
term will disappear and only the lag-L error autocorrelation will remain. On the other
hand, as A 4L -1 only the MSE term will prevail in the autocorrelation function
approximation. Introducing more terms in the Taylor expansion will bring in error
autocorrelation constraints from lags iL.
Augmented Error Criterion (AEC)
We are now in a position to formulate the augmented error criterion (AEC). To the
regular MSE term, we add another function E(e2) to result in the augmented error
criterion as shown in equation (2.6).
J(w) = E[e 2 (nl E[e2 (n)] (2.6)
where pis a real scalar parameter. Equivalently, (2.6) can also be written as
J(w) = (1 +2P)E[e 2(n)] 2E[e(n)e(n -L)] (2.7)
which has the same form as in (2.5). Notice that when P = 0 we recover the MSE in
(2.6) and (2.7). Similarly, we would have to select A = L in order to make the first-order
expansion identical to the exact value of the error autocorrelation function. Substituting
the identity (1 +2P) = -( L), and using A = L, we observe that P = -1/2 eliminates
the MSE term from the criterion. Interestingly, this value will appear in a later discussion,
when we optimize S in order to reduce the bias in the solution introduced by input noise.
If p is positive, then minimizing the cost function J(w) is equivalent to minimizing the
MSE but with a constraint that the error signal must be smooth. Thus, the weight vector
corresponding to the minimum J(w) will result in a higher MSE than the Wiener solution.
The same criterion can also be obtained by considering performance functions of
= E[e 2 (Y) E[d 2 *nl+~~2 n +.
where the coefficients p, 7, etc. are assumed to be positive. Notice that (2.8) is the L2
norm of a vector of different objective functions. The components of this vector consist
of e(n), e(n), e'(n), etc. Due to the equivalence provided by the difference
approximations for derivative, these terms constrain the error autocorrelation at lags iL as
well as the error power as seen in (2.8).
In summary, the AEC defined by equation (2.6) can take many forms and hence
results in different optimal solutions.
* If fis 0, then AEC exactly becomes the MSE criterion
* If fis -0.5, then AEC becomes the EWC which will result in an unbiased estimate
of the parameters even in the presence of noise
* If 4is positive and not equal to 0, then the cost function minimizes a combination
of MSE with a smoothness constraint
In the following sections, we will further elaborate on the properties of AEC.
Properties of Augmented Error Criterion
Shape of the Performance Surface
Suppose that noise-free training data of the form (x(n), d(n)), generated by a linear
system with weight vector w, through d(n)= xT(n)w,, is provided. Assume without
loss of generality that the adaptive filter and the reference filter are of the same length.
This is possible since it is possible to pad w, with zeros if it is shorter than the adaptive
filter. Therefore, the input vector x(n) E We, the weight vector w, E We and the desired
output d(n)E eW. Equation (2.6) has a quadratic form and has a unique stationary point.
If p > 0, then this stationary point is a minimum. Otherwise, the Hessian of (2.6) might
have mixed-sign eigenvalues. We demonstrate this fact with sample performance
surfaces obtained for 2-tap FIR filters using P = -1/2 .
For three differently colored training data, we obtain the AEC performance
surfaced shown in Figure 2-2. In each row, the MSE performance surface, the AEC cost
contour plot, and the AEC performance surface are shown for the corresponding training
data. The eigenvalue pairs of the Hessian matrix of (2.6) are (2.35,20.30), (-6.13,5.21),
and (-4.08,-4.14), for these representative cases in Figure 2-2. Clearly, it is possible for
(2.6) to have a stationary point that is a minimum, a saddle point, or a maximum and we
start to see the differences brought about by the AEC.
The performance surface is a weighted sum of paraboloids, which will complicate
gradient-based adaptation, but will not affect search algorithms utilizing curvature
information. We will discuss more on the search techniques later in this Chapter and also
in Chapter 4.
-08 -06 -04 -02 0 02 04 06 08 1
Coztotr ot fn~~~g~aopr MC sowng o~ odn
-08 -06 -04 -02 0 02 04 06 08 1
Pafommace unfae foAClr MC wd2 sadd pontoln
Caontor lot for C slxwununan
-08 -06 4)4 -02 0 02 04 06 08
Pafommacerunface fo~~Ir M dol ation a un un
Figure 2-2. The MSE performance surfaces, the AEC contour plot, and the AEC
performance surface for three different training data sets and 2-tap adaptive
Analysis of the Noise-free Input Case
Theorem 2. 1: The stationary point of the quadratic form in (2.6) is given by
w* = (1 + P) '(P + Q) (2.9)
where we defined R = E[x(n)xT (n)],
QS = E[i(n)d (n)] .
S = E[ir(n)irT (n)], P = E[x(n)d (n)] and
Proofi Substituting the proper variables in (2.6), we obtain the following explicit
expression for J(w).
J(w) = E[d 2(n)] + E[d 2(n)] +wT (R + P)w 2(P + p)T w (2.10)
Taking the gradient with respect to w and equating to zero yields
= 2(R + P)w 2(P + Q) = 0
-> w* = (R + P) '(P + Q)
Notice that selecting f = in (2.6) reduces the criterion to MSE and the optimal
solution, given in (2.9), reduces to the Wiener solution. Thus, the Wiener filter is a
special case of the AEC solution (though not optimal for noisy inputs, as we will show
Corollaryll~~~~~11111~~~~ 1. An equivalent expression for the stationary point of (2.6) is given by
w, = 1+2/i)R-fR i,1+20)P- PL (2.12)
where we defined the matrix RL = E[x(n L)i' (n) + i(n)i' (n -L)] and the vector
PL = E[x(n L)d(n) + x(n)d(n L)]. Notice that the interesting choice P = -1/ 2 yields
w* = R 'PL
Proof: Substituting the definitions of R, S, P, (1, and then recollecting terms to obtain
RL and PL yields the desired result.
w, = (R + P) '(P + SE))
IE~Iw(n)i' (n)] + SE~[(g-(n-L) (n ))in i(n -L)) ]
E [Ci(n)d(n)] + PE[(x(n) x(n L))(d(n) d (n L))]j
E~i~~i'(n) + pE~in~i (n] + ~i~ ~i'(n -L)]- RL 1 (2.13)
'IE[x(n)d (n)] + P(E[x(n)d (n] + E[x(n L)d (n L)] PL
= 1+2B)R-fl1 (i,~1+20)P- PL
From these results we deduct two extremely interesting conclusions:
Lenana 1. (Generalized Wiener-Hopf Equations) In the noise-free case, the true weight
vector is given by RLw T L (This result is also true for noisy data.)
Proof: This result follows immediately from the substitution of d(n)= xT(n)w, and
d (n L) = XT (n L)wT in the definitions of RL and PL
Lenana 2. In the noise-free case, regardless of the specific value of f the optimal
solution is equal to the true weight vector, i.e., w, = w, .
Proof: This result follows immediately from the substitution of the result in Lenana 1 into
the optimal solution expression given in (2.9).
The result in Lenana 1 is especially significant, since it provides a generalization of
the Wiener-Hopf equations to autocorrelation and cross correlation matrices evaluated at
different lags of the signals. In these equations, L represents the specific correlation lag
selected, and the choice L=0 corresponds to the traditional Wiener-Hopf equations. The
generalized Wiener-Hopf equations are essentially stating that, the true weight vector can
be determined by exploiting correlations evaluated at different lags of the signals, and we
are not restricted to the zero-lag correlations as in the Wiener solution.
Analysis of the Noisy Input Case
Now, suppose that we are given noisy training data (x(n), d(n)), where
x(n) = x(n) +v(n) and d(n) = d(n) +u(n) The additive noise on both signals are zero-
mean and uncorrelated with each other and with the input and desired signals. Assume
that the additive noise, u(n), on the desired is white (in time) and let the autocorrelation
matrices of v(n) be V = E[v(n)vT (n)], and VL = E[v(n L)vT (n) + v(n)vT (n L)] .
Under these circumstances, we have to estimate the necessary matrices to evaluate (2.9)
using noisy data. These matrices evaluated using noisy data, R, S, P, and Q will
become (see appendix D for details)
R = E[x(n)xT (n)] = R + V
S = E[(x(nZ) -x(n -L))(x(n) -x(n -L))T ]= 2(1 + V)-_RL VL
P = E[x(n)d(n)] = P
Q = E[(x(n) x(n L))(d(n) d(n L))T ] = 2P PL
Finally, the optimal solution estimate of AEC, when presented with noisy input and
desired output data, will be
v~v. = (R + PS)-'(P + Q)
= (R+ V)+ P(2(R + V) RL VL ) 1+ P+ (2P -L)j (2.15)
= 1+2P)(R+V)-R L, (V 1+2P)P- PL
Theorem 2.2: (EWC Noise-Rejection Theorem) In the noisy-input data case, the optimal
solution obtained using AEC will be identically equal to the true weight vector if and
only if P = -1/2, RL + 0, and VL = 0 There are two situations to consider:
* When the adaptive linear system is an FIR filter, the input noise vector vk COnsists
of delayed versions of a single dimensional noise process. In that case, VL = 0 if
and only if L > na, where na is the filter length and the single dimensional noise
process is white.
* When the adaptive linear system is an ADALINE, the input noise is a vector
process. In that case, VL = 0 if and only if the input noise vector process is white
(in time) and L > 1. The input noise vector may be spatially correlated.
Proofi Sufficiency of the first statement is immediately observed by substituting the
provided values of P and VL Necessity is obtained by equating (2.15) to w, and
substituting the generalized Wiener-Hopf equations provided in Lenana 1. Clearly, if
RL = 0, then there is no equation to solve, thus the weights cannot be uniquely
determined using this value of L. The statement regarding the FIR fi1ter case is easily
proved by noticing that the temporal correlations in the noise vector diminish once the
autocorrelation lag becomes greater than equal to the fi1ter length. The statement
regarding the ADALINE structure is immediately obtained from the definition of a
temporally white vector process.
Orthogonality of Error to Input
An important question regarding the behavior of the optimal solution obtained
using the AEC is the relationship between the residual error signal and the input vector.
In the case of MSE, we know that the Wiener solution results in the error to be
orthogonal to the input signal, i.e., E[e(n)x(n)]= 0 [10,14,15]. However, this result is
true only when there is no noise and also when the estimated fi1ter length is greater than
the actual system impulse response. Similarly, we can determine what the AEC will
Lemma 3: At the optimal solution of AEC, the error and the input random processes
satisfy PE[e(n)x(n L) + e(n L)x(n)] = (1 + 2 P)E[e(n)x(n)], for all L > 0 .
Proofi We know that the optimal solution of AEC for any L > 0 is obtained when the
gradient of the cost function with respect to the weights is zero. Therefore,
= 2E[e(n)x(n)] + 2PE[(e(n) e(n L))(x(n) x(n L))]
= (1+ 2 P)E[e(n)x(n)] PE[e(n)x(n L) + e(n L)x(n)] = 0
It is interesting to note that if = -1/ 2, then we obtain
E[e(n)x(n -L) + e(n -L)x(n)] = 0 for all L. On the other hand, since the criterion
reduces to MSE for P = 0, then we obtain E[e(n)x(n)] = 0 The result shown in (2. 16),
if interpreted in terms of Newtonian physics, reveals an interesting insight as to the
behavior of the EWC criterion ( P= -1/ 2) at its optimal solution (regardless of the
length of the reference filter that created the desired signal). In a simplistic manner, this
behavior could be summarized by the following statement: The optimal solution of EWC
tries to decorrelate the residual error from the estimated future value of the input vector
(see appendix E for details).
The case where P= -1/2 is especially interesting, because it results in complete
noise rejection. Notice that, in this case, since the optimal solution is equal to the true
weight vector, the residual error is given by e(n) = u(n) v' (n)w,, which is composed
purely of the noise in the training data. Certainly, this is the only way that the adaptive
filter can achieve E[e(n)x(n L) + e(n L)x(n)] = 0 for all L values, since
E[e(n)x(n -L)] =E[e(n -L)x(n)] =0 for this error signal. Thus, EWC not only
orthogonalizes the instantaneous error and input signals, but it orthogonalizes all lags of
the error from the input.
Relationship to Error Entropy Maximization
Another interesting property that the AEC solution exhibits is its relationship with
entropy . Notice that when P < 0, the optimization rule tries to minimize MSE, yet
it tries to maximize the separation between samples of errors, simultaneously. We could
regard the sample separation as an estimate of the error entropy. In fact, the entropy
estimation literature is full of methods based on sample separations [108-113].
Specifically, the EWC case with f = -1/2, finds the perfect balance between entropy
and MSE that allows us to eliminate the effect of noise on the solution. Recall that the
Gaussian density displays maximum entropy among distributions of fixed variance [1 14].
In the light of this fact, the aim of EWC could be understood as finding the minimum
error variance solution, while keeping the error close to Gaussian. Notice that, due to
central limit theorem , the error signal will be closely approximated by a Gaussian
density when there are a large number of taps. A brief description of the relationship
between entropy (using estimators) [115-117] and sample differences is provided in
Note on Model-Order Selection
Model order selection is another important issue in adaptive filter theory. The
actual desired behavior from an adaptive filter is to find the right balance between
approximating the training data as accurately as possible and generalizing to unseen data
with precision . One major cause of poor generalization is known to be excessive
model complexity . Under these circumstances, the designer's aim is to determine
the least complex adaptive system (which translates to smaller number of weights in the
case of linear systems) that minimizes the approximation error. Akaike's information
criterion (AIC)  and Rissanen's minimum description length (MDL)  are two
important theoretical results regarding model order selection. Such methods require the
designer to evaluate an objective function, which is a combination of MSE and the filter
length or the filter weights, using different lengths of adaptive filters.
Consider the case of overmodeling in the problem of linear FIR filter (assume N
taps) estimation. If we use the MSE criterion, and assume that there is no noise in the
data, then, the estimated Wiener solution will have exactly N non-zero elements that
exactly match with the true FIR filter. This is a very nice property of the MSE criterion.
However, when there is noise in the data, then this property of MSE is no longer true.
Therefore, increasing the length of the adaptive filter will only result in more parameter
bias in the Wiener solution. On the other hand, EWC successfully determines the length
of the true filter, even in the presence of additive noise. In the overmodeling case, the
additional taps will decay to zero indicating that a smaller filter is sufficient to model the
data. This is exactly what we would like an automated regularization algorithm to
achieve: determining the proper length of the filter without requiring external discrete
modifications on this parameter. Therefore, EWC extends the regularization capability of
MSE to the case of noisy training data. Alternatively, EWC could be used as a criterion
for determining the model order in a fashion similar to standard model order selection
methods. Given a set of training samples, one could start solving for the optimal EWC
solution for various lengths of the adaptive filter. As the length of the adaptive filter is
increased past the length of the true filter, the error power with the EWC solution will
become constant. Observing this point of transition from variable to constant error power,
we can determine the exact model order of the original filter.
The Effect of fon the Weight Error Vector
The effect of the cost function free parameter P on the accuracy of the solution
(compared to the true weight vector that generated the training data) is another crucial
issue. In fact, it is possible to determine the dynamics of the weight error as a function of
p. This result is provided in the following lemma.
Lemma 4: (The effect of P on AEC solution) In the noisy training data case, the
derivative of the error vector between the optimal EWC solution and the true weight
vector, i.e., il. = w. w,, with respect to pis given by
= -I (+ 2a)(R + V) FR ][2(R- R,)i, R,w, i (2. 17)
Proofi Recall from (2.15) that in the noisy data case, the optimal AEC solution is given
by i^v = [(1+ 2a)(R + V) FRL L -1 [('1+ 2a)P SPL i. Using the chain rule for the
derivative and the fact that for any nonsingular matrix A(P),
dA-'/9= -A-'(8A/89)A', the result in (2.17) follows from straightforward
derivation. In order to get the derivative asp f -1/2, we substitute this value and
The significance of Lemma 4 is that it shows that no finite P value will make this
error derivative zero. The matrix inversion, on the other hand, approaches to zero for
unboundedly growing P. In addition, it could be used to determine the Euclidean error
norm derivative, ie Ii / 80.
Numerical Case Studies of AEC with the Theoretical Solution
In the preceding sections, we have built the theory of the augmented error criterion
and its special case, the error whitening criterion, for linear adaptive filter optimization.
We have investigated the behavior of the optimal solution as a function of the cost
function parameters as well as determining the optimal value of this parameter in the
noisy training data case. This section is designed to demonstrate these theoretical results
in numerical case studies with Monte Carlo simulations. In these simulations, the
following scheme will be used to generate the required autocorrelation and
Given the scheme depicted in Figure 2-3, it is possible to determine the true
analytic auto/cross-correlations of all signals of interest, in terms of the filter coefficients
and the noise powers. Suppose 5, v, and u are zero-mean white noise signals with
powers a, a ", and 0;, ,respectively. Suppose that the coloring filter h and the
Figure 2-3. Demonstration scheme with coloring fi1ter h, true mapping fi1ter w, and the
uncorrelated white signals.
mapping fi1ter w are unit-norm. Under these conditions, we obtain
E[(x(n) + v(n))(x(n A) + v(n A))] 'Exnx"(~l i (2.19)
E[(lr(n)+ (n2))d(n2)]= w +f wlE[r(n2) (n1- A)] (2.20)
For each combination of SNR from (-10dB,0dB,10dB), /?from (-0.5,-0.3,0,0. 1), m from
(2,...,10), and L from (m,...,20) we have perfo~rmedl 100C hMo~nt Carlo si~mu~lationns using
randomly selected 30-tap FIR coloring and n-tap mapping filters. The length of the
mapping filters and that of the adaptive filters were selected to be equal in every case. In
all simulations, we used an input signal power of 0,2 = 1, and the noise powers z,2 = 0,2
are determined from the given SNR using SNR = 10 log (0, / a ) The matrices R S,
P, and Q, which are necessary to evaluate the optimal solution given by (2.15) are then
evaluated using (2.18), (2.19), and (2.20), analytically. The results obtained are
summarized in Figure 2-4 and Figure 2-5, where for the three SNR levels selected, the
average squared error norm for the optimal solutions (in reference to the true weights) are
given as a function of L and n for different P values. In Figure 2-4, we present the
average normalized weight vector error norm obtained using AEC at different SNR levels
and using different P values as a function of the correlation lag L that is used in the
criterion. The fi1ter length was Eixed to 10 in these simulations.
SNR=-10 SNR=0 SNR=10
beta= 0 1
10 15 20
10 15 20
/ beta=0 1
0 15 20
Figure 2-4. The average squared error-norm of the optimal weight vector as a function of
autocorrelation lag L for various f values and SNR levels.
SNR=-10 SNR=0 SNR=10
S beta= 0 1
beta= -1/2 4
beta= 0 1
beta= -1/2 4
beta= -0 3
Beta= -1/2 4
Figure 2-5. The average squared error-norm of the optimal weight vector as a function of
filter length m for various f values and SNR levels.
From the theoretical analysis, we know that if the input autocorrelation matrix is
invertible, then the solution accuracy should be independent of the autocorrelation lag L.
The results of the Monte Carlo simulations presented in Figure 2-4 conform to this fact.
As expected, the optimal choice of = -1/2 determined the correct filter weights
exactly. Another set of results, presented in Figure 2-5, shows the effect of filter length
on the accuracy of the solutions provided by the AEC. The optimal value of P = -1/2
always yields the perfect solution, whereas the accuracy of the optimal weights degrades
as this parameter is increased towards zero (i.e. as the weights approach the Wiener
solution). An interesting observation from Figure 2-5 is that for SNR levels below zero,
the accuracy of the solutions using sub-optimal P values increases, whereas for SNR
levels above zero, the accuracy decreases when the fi1ter length is increased. For zero
SNR, on the other hand, the accuracy seems to be roughly unaffected by the fi1ter length.
The Monte Carlo simulations performed in the preceding examples utilized the
exact coloring filter and the true fi1ter coefficients to obtain the analytical solutions. In
our Einal case study, we demonstrate the performance of the batch solution of the AEC
criterion obtained from sample estimates of all the relevant auto- and cross-correlation
matrices. In these Monte Carlo simulations, we utilize 10,000 samples corrupted with
white noise at various SNR levels. The results of these Monte Carlo simulations are
summarized in the histograms shown in Figure 2-6. Each subplot of Figure 2-6
corresponds to experiments performed using SNR levels of -10 dB, OdB, and 10 dB for
each column and adaptive fi1ter lengths of 4-taps, 8-taps, and 12-taps for each row,
respectively. For each combination of SNR and filter length, we have performed 50
Monte Carlo simulations using MSE (f =0) and EWC (f =-1/2) criteria. The
correlation lag is selected to be equal to the filter length in all simulations, due to
Theorem 2.2. Clearly, Figure 2-6 demonstrates the superiority of the AEC in rejecting
noise that is present in the training data. Notice that in all subplots (for all combinations
of filter length and SNR), AEC achieves a smaller average error norm than MSE.
Figure 2-6. Histograms of the weight error norms (dB) obtained in 50 Monte Carlo
simulations using 10000 samples of noisy data using MSE (empty bars) and
EWC with f = -0.5 (filled bars). The subfigures in each row use filters with 4,
8, and 12 taps respectively. The subfigures in each column use noisy samples
at -10, 0, and 10 dB SNR, respectively.
The discrepancy between the performances of the two solutions intensifies with
increasing filter length. Next, we will demonstrate the error-whitening property of the
EWC solution. From equation (2.1) we can expect that the error autocorrelation function
will vanish at lags greater than or equal to the length of the reference filter, if the weight
vector is identical to the true weight vector. For any other value of the weight vector, the
error autocorrelation fluctuates at non-zero values. A 4-tap reference filter is identified
with a 4-tap adaptive filter using noisy training data (hypothetical) at an SNR level of
OdB. The autocorrelation functions of the error signals corresponding to the MSE
solution and the EWC solution are shown in Figure 2-7. Clearly, the EWC criterion
determines a solution that forces the error autocorrelation function to zero at lags greater
than or equal the filter length (partial whitening of the error).
~0.5 \ -0.1
W C 5 10 15 20 25 30
0 5 10 15 20 25 30
Figure 2-7. Error autocorrelation function for MSE (dotted) and EWC (solid) solutions.
Finally, we will address the order selection capability and demonstrate how the
AEC (specifically EWC) can be used as a tool for determining the correct filter order,
even with noisy data, provided that the given input-desired output pair is a moving
average process. For this purpose, we determine the theoretical Wiener and EWC (with
7= -1/2 and L=m, where m is the length of the adaptive filter) solutions for a
randomly selected pair of coloring filter, h, and mapping filter w, at different adaptive
filter lengths. The noise level is selected to be 20 dB, and the length of the true mapping
filter is 5. We know from our theoretical analysis that if the adaptive filter is longer than
the reference filter, the EWC will yield the true weight vector padded with zeros. This
will not change the MSE of the solution. Thus, if we plot the MSE of the EWC versus the
length of the adaptive filter, starting from the length of the actual filter, the MSE curve
will remain flat, whereas the Wiener solution will keep decreasing the MSE,
contaminating the solution by learning the noise in the data. Figure 2-8(a) shows the
MSE obtained with the Wiener solution as well as the EWC solution for different lengths
of the adaptive filter using the same training data described above. Notice (in the
zoomed-in portion) that the MSE with EWC remains constant starting from 5, which is
the filter order that generated the data. On the other hand, if we were to decide on the
filter order looking at the MSE of the Wiener solution, we would select a model order of
4, since the gain in MSE is insignificantly small compared to the previous steps from this
point on. Figure 2-8(b) shows the norm of the weight vector error for the solutions
obtained using the EWC and MSE criteria, which confirms that the true weight vector is
indeed attained with the EWC criterion once the proper model order is reached.
This section aimed at experimentally demonstrating the theoretical concepts set
forth in the preceding sections of the chapter. We have demonstrated with numerous
Monte Carlo simulations that the analytical solution of the EWC criterion eliminates the
effect of noise completely if the proper value is used for /7. We have also demonstrated
that the batch solution of EWC (estimated from a finite number of samples) outperforms
MSE in the presence of noise, provided that a sufficient number of samples are given so
that the noise autocorrelation matrices diminish as required by the theory.
In this chapter, we derived the augmented error criterion (AEC) and discussed a
special case of AEC called the error whitening criterion (EWC). The proposed AEC
includes MSE as a special case. We discussed some of the interesting properties of the
AEC cost function and worked out the analytical optimal solution. Further, we discussed
the reasoning behind naming the special case of AEC with the parameter /7 = -0.5 as
EWC. The intuitive reasoning is that this criterion partially whitens the error signal even
in the presence of noise which cannot be achieved by the MSE criterion. Thus the error
whitening criterion is very useful for estimating the parameters of a linear unknown
system in the presence of additive white noise. AEC with other values of f can be used as
a constrained MSE criterion where the constraint is the smoothness of the error signal.
Most of the material presented in this chapter can be found in .
Although we have presented a complete theoretical investigation of the proposed
criterion and its analytical solution, in practice, on-line algorithms that operate on a
sample-by-sample basis to determine the desired solution are equally valuable. Therefore,
in the following chapters, we will focus on designing computationally efficient on-line
algorithms to solve for the optimal AEC solution in a fashion similar to the well-known
RLS and LMS algorithms. In fact, we aim to come up with algorithms that have the same
computational complexity with these two widely used algorithms.
FAST RECURSIVE NEWTON TYPE ALGORITHMS FOR AEC
In Chapter 2, we derived the analytical solution for AEC. We also showed
simulation results using block methods. In this Chapter, the focus will be on deriving
online, sample-by-sample Newton type algorithms to estimate the optimal AEC solution.
First, we will derive a Newton type algorithm that has a structure similar to the well-
known RLS algorithm that estimates the optimal Wiener solution for MSE criterion. The
complexity of the proposed algorithm is O(N2) which is again comparable with that of the
RLS algorithm. Then, we will propose another Newton type algorithm derived from the
principles of TLS using minor components analysis. This algorithm in its current form
estimates the optimal EWC solution which is a special case of AEC with B = -0.5.
Derivation of the Newton Type Recursive Error Whitening Algorithm
Given the estimate of the filter tap weights at time instant (n -1i), the goal is to
determine the best set of tap weights at the next iteration n that would track the optimal
solution. We call this algorithm as Recursive Error Whitening (REW) algorithm although
the error whitening property is applicable only when the parameter B is set to -0.5. But,
the algorithm can be applied with any value of J. Recall that the RLS algorithm belongs
to the class of fixed-point algorithms in the sense that they track the optimal Wiener
solution at every time step. The REW algorithm falls in the same category and it tracks
the optimal AEC solution at every iteration. The noteworthy feature of the fixed-point
algorithms is their exponential convergence rate as they utilize higher order information
like curvature of the performance surface. Although the complexity of the fixed-point
Newton type algorithms is higher when compared to the conventional gradient methods,
the superior convergence and robustness to the eigenspread of the data can be vital gains
in many applications.
For convenience purposes, we will drop the tilde convention that we used in the
previous chapter to differentiate between noise-corrupted and noise free matrices and
vectors. Recall that the optimal AEC solution is given by
w* = (R + PS)-'(P + PQ) (3.1)
Letting T(n) =R(n) + S(n) and V(n) =P(n) + Q(n), we obtain the following
T (n) = T (n 1) + (1 + 2 P) x(n) x (n) fx (n L) x (n) fx (n) x (n L)
= T (n 1) + 2 fx(n) x (n) px (n L) x (n) + x (n) x (n) fx (n) x (n L) (3 .2)
= T (n 1) + (2 fx(n) px (n L)) x (n) + x (n)(x (n) fx (n L))T
Realize that equation (3.2) basically tells us that the matrix T(n) can be obtained
recursively using a rank-2 update. In comparison (see Chapter 1), the RLS algorithm
utilizes a rank-1 update for updating the covariance matrix. At this point, we invoke the
matrix inversion lemma2 (Sherman-Morri son-Woodbury identity) [7,8] given by
(A +BCD T) = A A-'B(C + DT A-B)1 'DT A (3.3)
Substituting A =T(n -1), B =[(2px(n) x(n L)) x(n)], C= Ix2, a 2x2 identity
matrix and D = [x(n) (x(n) fx(n L))] we get the equation (3.2) in the same form as
the LHS of equation (3.3). Therefore, the recursion for the inverse of T(n) becomes
2 NOtice that the matrix inversion lemma simplifies the computation of the matrix inverse only when the
original matrix can be written using reduced rank updates.
T l(n) = T '(n -1) T- T(n -1)B(I2x2 + gT -(n -1)B) 1D T (n -1) (3.~4)
Note that the computation of the above inverse is different than the conventional RLS
algorithm. It requires the inversion of a 2x2 matrix (Izx2 + D T -(n -)B) owing to the
rank-2 update of T(n).The recursive estimator for V(n) is a simple correlation estimator
V(n) = V(n 1) + [(1+ 2 P)d(n)x(n) pd(n)x(n L) fd~(n L)x(n)] (3.5)
Using T l(n) and V(n), an estimate of the fi1ter weight vector at iteration index n is
w(nZ)= T l(n)V(nz) (3.6)
We will define a gain matrix analogous to the gain vector in the RLS case  as
K(n) = T (n )BI2x2 +D T (n 1)Br (3.7)
Using the above definition, the recursive estimate for the inverse of T(n) becomes
T l(n) = T l(n 1) K(n)D T -(n 1) (3.8)
Once again, the above equation is analogous to the Ricatti equation for the RLS
algorithm. Multiplying (3.7) from the right by (I2x2 + D T 1(n -1)B), we obtain
K(H12x Z+D T (n- 1)B)= T (n-1)B
K(n2)= T1(n 1)B -K(nl)D T -(n -1)B (3.9)
K(n) = T l(n)B
In order to derive an update equation for the fi1ter weights, we substitute the recursive
estimate for V(n) in (3.6).
w(n) = T l(n)V(n 1) + T l(n)[(1 + 2P)d(n)x(n) ~a'(n)x(n L) fd'(n L)x(n)] (3.10)
Using (3.8) and recognizing the fact that w(n-1) =T (~n-1)V(n-1) the above
Note that the product D'w(n-1) is nothing but the matrix of the outputs
l(n) y'(n1)- Syc(nz L)1 where y'(n) = x' (n)w(n -1), y(n L) = x' (n L)w(n -1).
The apriori error matrix is defined as
d~n n) y(n) P(d(n L) y(n L))n)- e~nn) e(n L) (.4
Using all the above definitions, we will formally state the weight update equation for the
REW algorithm as
w(n) = w(n 1) + K(n~e(n) (3.15)
The overall complexity of (3.15) is O(N2) which is comparable to the complexity of the
RLS algorithm (this was achieved by using the matrix inversion lemma). Unlike the
stochastic gradient algorithms that are easily affected by the eigenspread of the input data
and the type of the stationary point solution (minimum, maximum or saddle), the REW
algorithm is immune to these problems. This is because it inherently makes use of more
information about the performance surface by computing the inverse of the Hessian
equation can be reduced to
w(n) = w(n- 1) -K(n)D w(n 1)
+ T- (n)[(1 + 2P)d(n)x(n) pd(n)x(n L) fd'(n L)x(n)]
the definition for B = [(2px(n) px(n L)) x(n)], we can easily see that
d>n fd~(n -dn L) ,?
From (3.9) and (3.12), the weight update equation simplifies to
wl(nl)= w(n- 1)-K(n)D'w(n-1)+Ki"dn ) d~:~n )1
matrix R + pS A summary of the REW algorithm is given below in Table 3-1.
Table 3-1. Outline of the REW Algorithm.
Initialize T 1(0) = cI, c is a large positive constant
w(0) = 0
At every iteration, compute
B = [(2 px(n) fx(n L)) x(n)] and D = [x(n) (x(n) fx(n L))]
K(n)= T (n-1)BIZx2+D T (n-1)Br
y(n) = xT (n)w(n -1) and y(n -L) = xT (n -L)w(n -1)
d(n) y(n) e(n)
.l.d(n) y(n) P(d(n L) y(n L)) ;e(n) fe(n L)
w(n) = w(n- 1) +K(n~e(n)
T l(n) = T l(n 1) K(n)D T -(n 1)
The above derivation assumes stationary signals. For non-stationary signals, a
forgetting factor is required for tracking. Inclusion of this factor in the derivation is trivial
and is left out in this chapter. Also, note that the REW algorithm can be applied for any
value of B. When B = -0.5, we know that AEC reduces to EWC and hence REW
algorithm can be used for estimating the parameters in the presence of input white noise.
Extension of the REW Algorithm for Multiple Lags
In Chapter 2, we briefly mentioned the fact the AEC can be extended by including
multiple lags in the cost function. It is easy to see that the extended AEC is given by
JI(w) = E[e (n)]+ C 13E[e(n)- e(n -L]
where, Lmax denotes the maximum number of lags utilized in the AEC cost function. It is
not mandatory to use the same constant a for all the error lag terms. However, for the
sake of simplicity, we assume single a value. The gradient of (3.16) with respect to the
weight vector w, is
= -e(n)x(n) BC [e(nz)- e(nz L)] [x(n2)- x(nZ L)] (3.17)
Recall the following matrix definitions (restated here for clarity),
R = E[x(n) x "(n)]
SL = E[(x(nZ) -x(n -L))(x(n) -x(n -L))T ]= 2(R-_ RL)
RL = E[x(n)x(n L)T + x(nZ L)x(nZ)T ]3.8
P = E[x(n)d(n)]
PL = E[x(n)d(n L) + x(n L)d(n)]
Q L = E[(x(n) x(n L))(d(n) d(n L))T ] = 2P PL
Using the above definitions in (3.17) and equating the gradient to zero, we get the
optimal extended AEC solution as shown below.
w = (R + pC SL> -1 p L) (3.19)
At first glance, the computational complexity of (3.19) seems to be O(N3). But, the
symmetric structure of the matrices involved can be exploited to lower the complexity.
Once again, we resort to the matrix inversion lemma as before and deduce a lower O(N2)
complexity algorithm. Realize that the optimal extended AEC solution at any time instant
n will be
w(n) = T l(n)V(n) (3.20)
where, T(n) = R(n) + pC SL (n) and V(n) P(n) + pC QL (n) as before. The estimator
for the vector V(n) will be a simple recursive correlator.
V(n) = V(n ) + [(1 + 2ar~max,)d(n)x(n) f d(n)x(n L) fd(nz -L)x(n)] (3.21)
The matrix T(n) can be estimated recursively as follows.
T(nz) = T(n2 1) + (1 + 2a~ma,)x(n2)xT (n2) ff x(n L>x" (n)> aC x(n2)xT (n2 L)
= T(n -1)+ 2ar~maxx~rn~x" (n) -j ffx(nZ L)xT(n)+
x (nz) x (n7) -j fx (nZ) x (nI -L)
T () =T ( ) +2 L ma, xx(nZ) ffx X(n L) x ')(n)! + x(n xn)- ffx (n L)
Now, the matrices A, B, C and D used in the inversion lemma in equation (3.3) are
defined as follows.
A = T (n 1)
B = (2~maxx(nZ) aC x
The only differences from the previous definitions lie in the expressions for the B and D
matrices that now require an inner loop running up to Lmax. The rest of the procedure
remains the same as before. Once again, by the proper application of the matrix inversion
lemma, we were able to reduce the complexity of the matrix inversion to O(N2) by
recursively computing the inverse in a way that we only require an inversion of a simple
2x2 matrix. This measure of complexity does not include the computations involved in
building the B and D matrices. However, typically, the maximum number of lags will be
smaller than the length of the adaptive filter. Therefore, the additional overhead incurred
in the estimation of B and D matrices will not result in a significant change in the overall
Relationship to the Recursive Instrumental Variables Method
The previously derived REW algorithm for the single lag case has a structure
similar to the Instrumental Variables (IV) method. The IV method has its origins in
statistics and was apparently proposed by Reiersarl . Over a period of time, it has
been adapted to model dynamical systems in control engineering. Lot of work in the
applications of IV to control engineering problems has been done by Wong, Polak 
and Young [124-126]. Recent advances in IV methods for system identification and
control have been mainly due to Soiderstroim and Stoica [32,93]. It is beyond the scope of
this dissertation to summarize the applications and impacts of IV in various engineering
problems. For more details, refer to .
Basically, IV can be viewed as an extension to the standard Least Squares
regression and can be used to estimate the parameters in white noise once the model
order is known. The fundamental principle is to choose delayed regression vectors known
as instruments that are uncorrelated with the additive white noise. IV can also be
extended to handle colored noise situations. This will be exclusively handled in Chapter
5. For now, the discussion will be strictly limited to the white noise scenario.
Mathematically speaking, the IV method computes the solution
w,, = E[x(n)x(n A)T ]-'E[x(n)d(n A)] (3.23)
where, the lag A is chosen such that the outer product of the regression vector x(n) with
the lagged regression vector x(n-A) result in a matrix that is independent of the additive
white noise components v(n). In comparison, the REW solution is given by w, = RL1PL *
Notice that in REW solution, the matrix RL is symmetric and Toeplitz , which is very
much desirable and we exploit this fact to derive an elegant minor components based
algorithm in the next section. Thus, in effect the IV method can be considered as a special
case of the REW algorithm obtained by removing the symmetric terms in RL and PL *
We will compare the performances of REW and IV methods later in this chapter.
Recursive EWC Algorithm Based on Minor Components Analysis
Till now, we focused on a Newton type algorithm to compute the optimal AEC
solution. Although the algorithm is fast converging, the convergence of the algorithm can
be sensitive to the ill-conditioning of the Hessian matrix R(n) + PS(n) which can happen
at the first few iterations. Alternatively, we can explore the idea of using the minor
components analysis (MCA) to derive a recursive algorithm similar to the TLS algorithm
for MSE. We call this algorithm as EWC-TLS algorithm. As the name suggests, this
algorithm can be used only for the case with a = -0.5, which defaults the augmented error
criterion to error whitening criterion. Recall that the TLS problem in general, solves an
over-determined set of linear equations of the form Ax = b, where As E "'" is the data
matrix, beW"' is the desired vector, and x e is the parameter vector and m denotes
the number of different observation vectors each of dimension n . Alternatively, the
linear equations can be written in the form [A; b][xT;'-1]= 0 whe-Pre[;b] depnpote an
augmented data matrix. When [A; b] is a symmetric square matrix, it can be shown that
the TLS solution is simply given by
[x;-1l]= -v, n~l,/v,n, (3.24)
where, v,,,~, is the last element of the minor eigenvector v,, ,. In the case of EWC, it is
easy to show that the augmented data matrix [127, 128] (analogous to [A; b]) is
G = LP p (L) (3.25)
The term pd (L) in (3.25) denotes the autocorrelation of the desired signal at lag L. It is
important to note that the matrix (3.25) is square symmetric due to the symmetry of R,.
Hence, the eigenvectors of G are all real which is highly desirable. Also, it is important to
note the fact that (3.25) still holds even with noisy data as the entries of G are unaffected
by the noise terms. In the infinite-sample case, the matrix G is not full rank and we can
immediately see that one of the eigenvalues of (3.25) is zero. In the finite-sample case,
the goal would be to Eind the eigenvector corresponding to the minimum absolute
eigenvalue (finite samples also imply that G-1 exists). Since the eigenvalues of G can be
both positive and negative, typical iterative gradient or even some Eixed-point type
algorithms tend to become unstable. A workaround would be to use the matrix G2 inStead
of G. This will obviate the problem of having mixed eigenvalues while still preserving
the eigenvectors. However, the squaring operation is good only if the eigenvalues of G
are well separated. Otherwise, the smaller eigenvalues blend together making the
estimation of the minor component of G2 mOre difficult. Also, the squaring operation
creates additional overhead, thereby negating any computational benefits offered by the
Eixed point type PCA solutions as discussed in Appendix A.
So, we propose to use the inverse iteration method for estimating the minor
eigenvector of G . If w(n)E e St" denotes the estimate of the minor eigenvector
corresponding to the smallest absolute eigenvalue at time instant n, then the estimate at
the (n+1)th instant is given by
w(n +1) = G(n +1)-w(n)
w (n + 1) (3.26)
w(n +1) =
Iw (n + 1)
The term G(n +1) denotes the estimate of the augmented data matrix G (equation (3.25))
at the (n+1)th instant. It is easy to see that G(n) can be recursively estimated as
G(n) = G(n -1) + y(n)~y(n -L)T + \(nZ- L)V(nZ)T where, ~y(n) = [x(n); d(n)] is the
concatenated vector of the input and desired response. Now, we can invoke the inversion
lemma as before and obtain a recursive O(n2) OStimate for matrix inversion in (3.26). The
details of this derivation are trivial and omitted here. Once the minor component estimate
converges i.e., w(n) 4v, the EWC-TLS solution is simply given by equation (3.24).
Thus, the overall complexity of the EWC-TLS algorithm is still O(n2) which is the same
as the REW algorithm. However, we have observed through simulations that, the EWC-
TLS method converges faster than the EWC-REW while preserving the accuracy of the
We will now show the simulation results with the Newton type algorithms for
AEC. Specifically, our objective is to show the superior performance of the proposed
criterion and the associated algorithms in the problem of system identification with noisy
Estimation of System Parameters in White Noise Using REW
The REW algorithm can be used effectively to solve the system identification
problem in noisy environments. As we have seen before, setting the value of /7 = -0.5,
noise immunity can be gained for parameter estimation. We generated a purely white
Gaussian random noise of length 50,000 samples and added this to a colored input signal.
The white noise signal is uncorrelated with the input signal. The noise free, colored, input
signal was filtered by the unknown reference filter, and this formed the desired signal for
the adaptive filter. Since, the noise in the desired signal would be averaged out for both
RLS and REW algorithms, we decided to use the clean desired signal itself. This will
bring out only the effects of input noise on the filter estimates. Also, the noise added to
the clean input is uncorrelated with the desired signal. In the experiment, we varied the
Signal-to-Noi se-Ratio (SNR) in the range -10dB to +10dB. The number of desired filter
coefficients was also varied from 4 to 12. We then performed 100 Monte Carlo runs and
computed the normalized error vector norm given by
error = 20 0logIW lIIrl (3.27)
where, w, is the weight vector estimated by the REW algorithm with /7 = -0.5 after
50,000 iterations or one complete presentation of the input data and w, is the true weight
vector. In order to show the effectiveness of the REW algorithm, we performed Monte
Carlo runs using the RLS algorithm on the same data to estimate the filter coefficients.
Further, we also evaluated the analytical TLS solution for each case Figure 3-1 shows a
histogram plot of the normalized error vector norm given in (3.27) for all the three
methods. It is clear that the REW algorithm was able to perform better than the RLS at
various SNR and tap length settings. In the high SNR cases, there is not much of a
difference between RLS and REW results. However, under noisy circumstances, the
reduction in the parameter estimation error with REW is orders of magnitude more when
compared with RLS. Also, the RLS algorithm results in a rather useless zero weight
vector, i.e., w = 0 when the SNR is lower than -10dB. On the other hand, TLS performs
well only in the cases when the noise variances in the input and desired signals are the
same. This is in conformance with the well-known theoretical limitations of the TLS
laPerltluinnieRNRI ladB,#rp;rug P renewwnhSNR DAarlaprps Para~mlncethSNR lidB, rip 4
Eo: numid Ere umnBEr nlid
Ey now in dB brra nulm in dB ber now in dB
In Prdnnarl6wNhR 10d8.,Hap 1 Ivhrmr nohanF;Mssin = tsap?' = rrmnaneow NR= 10dB Mapi'
.I I I I .. it
Ew nol ind rynl nd renl nd
Figure~, 3-.Hsormposshwn h ro etr omfrECLS M
aloitm an te nueialTSsouin
aveage rrovetrnrvausfr~nW -1
Performance of REW over vanous beta values Performance of REW algorithm over var ous beta values
S-10 2 -
1-OB -06 -04 -02 0 02 04 06 08 1 -08 -06 -04 -02 0 02 04 06 OB
Figure 3-2. Performance of REW algorithm (a) SNR = OdB and (b) SNR = -10 over
various beta values.
this clearly gives us the minimum estimation error. This corresponds to the EWC
solution. For p = 0, (indicated by a "o" in the figure) the REW algorithm reduces to the
regular RLS giving a fairly significant estimation error.
Next the parameter p is set to -0.5 and SNR to OdB, and the weight tracks are
estimated for the REW and the RLS algorithms. Figure 3-3 shows the averaged weight
tracks for both REW and RLS algorithms averaged over 50 Monte Carlo trials. Asterisks
on the plots indicate the true parameters. The tracks for the RLS algorithm are smoother,
but they converge to wrong values, which we have observed quite consistently. The
weight tracks for the REW algorithm are noisier compared to those of the RLS, but they
eventually converge to values very close to the true weights. This brings us to an
important issue of estimators viz., bias and the variance. The RLS algorithm has a
reduced variance because of the positive-definiteness of the covariance matrix R(n).
However, the RLS solution remains asymptotically, biased in the presence of noisy input.
We ght tracks for HEW algorithm Weight tracks for RLS algorithm
E~ 6i *5 'denoted true we ghts
K I~ denotes true weights 1 2
005 1 15 2 25 3 36 4 46 5 '0 06 1 15 2 26 3 36 4 45
Iterat ons x 104 Iterations x1
Figure 3-3. Weight tracks for REW and RLS algorithms.
On the other hand, REW algorithm produces zero bias, but the variance can be high
owing to the conditioning of the Hessian matrix. However, this variance diminishes with
increasing number of samples.
The noisy initial weight tracks of the REW algorithm may be attributed to the ill
conditioning that is mainly caused by the smallest eigenvalue of the estimated Hessian
matrix, which is R(n)+ PS(n). The same holds true for the RLS algorithm, where the
minimum eigenvalue of R(n) affects the sensitivity . The instability issues of the
RLS algorithm during the initial stages of adaptation have been well studied in literature
and effects of round off error have been analyzed and many solutions have been proposed
to make the RLS algorithm robust to such effects . Similar analysis on the REW
algorithm is yet to be done and this would be addressed in future work on the topic.
Performance Comparisons between REW, EWC-TLS and IV methods
In this example, we will contrast the performances of the REW, EWC-TLS and the
Instrumental Variables (IV) method in a 4-tap system identification problem with noisy
data. The input signal is colored and corrupted with white noise (input SNR was set at
5dB) and the desired signal SNR is 10dB. For the IV method, we chose the delayed input
Performance with SNR = 5dB, # taps = 4
C D IV
3 C I la : 11 "
-20 -5 -10 -5 O 1
menrro norm inea dB d
Fiue34 itga lt hwn h ro etrnrsfraltemtos
veto xn A a te nsrmn n h llag 1 wa hsntob or wihi h
vector norm ~ ~ ~ Errrnrr in dBgiebyeuto(32)Fiue34sosteerrhtgamfr
REW, EWC-TLS, IV and the optimal Wiener solutions. EWC-TLS and REW algorithms
outperform the Wiener MSE solution. The IV method also produces better results than
the Wiener solution. Amongst the EWC solutions, we obtained better results with the
EWC-TLS algorithm (equations 3.24 and 3.26) than REW. However, both EWC-TLS
and REW outperformed IV method. This may be partially attributed to the conditioning
of the matrices involved in the estimation of the REW and IV methods. Further
theoretical analysis is required to quantify the effects of conditioning and symmetric
Toeplitz structure of R,. In Figure 3-5, we show the angle between the estimated minor
eigenvector and the true eigenvector of the augmented data matrix G for a random single
trial in scenarios with and without noise. Notice that the rates of convergence are very
much different. It is well known that the rate of convergence for inverse iteration method
is given by the ratio Az 22 Where ~ is the largest eigenvalue of G1 and ii2 -1 is the
second largest eigenvalue of G-1 . Faster convergence can be seen in the noiseless case
owing to the huge IA /12I ratio.
Convergence of the minor e genvector of G with noisy data Convergence of the mi nor e genvector of G with clean data
0 05 1 1 5 2 2 5 3 3 5 4 41 5 U 50 100 150 200
Iterations x 104 Iterations
Figure 3-5. Convergence of the minor eigenvector of G with (a) noisy data and (b) clean
In this chapter, we derived recursive Newton type algorithm to estimate the optimal
AEC solution. First, the Recursive Error Whitening (REW) algorithm was derived using
the analytical AEC solution and the matrix inversion lemma. The well-known RLS
algorithm for MSE becomes a special case of the REW algorithm. Further, a Total Least-
Squares based EWC algorithm called EWC-TLS was proposed. This algorithm works
with a = -0.5 and can be easily applied to estimate parameters in the presence of white
noise. A fixed-point minor components extraction algorithm was developed using the
inverse iteration method. Other fixed-point or gradient-based methods cannot be used
because of the indefiniteness (matrix with mixed eigenvalues make the algorithms locally
unstable) of the matrix involved in the EWC-TLS formulation. The computational
complexity of the above mentioned algorithms is O(N2). We briefly explored an
extension of the Newton type algorithm for the extended AEC with multiple lags.
Effective usage of the matrix inversion lemma can cut the complexity of the extended
REW algorithm to O(N2)
In the later half of the chapter, we discussed the performance of the algorithms in
the problem of system identification in the presence of additive white noise. The
proposed recursive algorithms outperform the RLS and the analytical MSE TLS
solutions. We also showed the simulation results with the EWC-TLS algorithm and
quantitatively compared the performance with the well-known IV method.
Although the recursive EWC algorithms presented in this chapter are fast
converging and sample efficient, the complexity of O(N2) can be high for many
applications involving low power designs. Additionally, the recursive algorithms can
exhibit limited performance in non-stationary conditions if the forgetting factors are not
chosen properly. This motivates us to explore stochastic gradient (and its variants)
algorithms for estimating the optimal AEC solution. Chapter 5 will describe these
algorithms and also highlight other benefits of the stochastic algorithms over their
Newton type counterparts.
STOCHASTIC GRADIENT ALGORITHMS FOR AEC
Stochastic gradient algorithms have been at the forefront in optimizing quadratic
cost functions like the MSE. Owing to the presence of a global minimum in quadratic
performance surfaces, gradient algorithms can elegantly accomplish the task of reaching
the optimal solution at minimal computational cost. In this chapter, we will derive the
stochastic gradient algorithms for the AEC. Since the AEC performance surface is a
weighted sum of quadratics, we can expect that difficulties will arise. However, we will
show that using some simple optimization tricks, we can overcome these difficulties in an
Derivation of the Stochastic Gradient AEC-LMS Algorithm
Assume that we have a noisy training data set of the form (x(n),d(n)), where
x(n)E e 8 is the input and d(n)E e is the output of a linear system with coefficient
vector w,. The goal is to estimate the parameter vector w, using the augmented error
criterion. We know that the AEC cost function is given by
J(w) = E(e2 SE)e2(n)) (4.1)
where, e(n) = e(n) e(n L) w is the estimate of the parameter vector and L 2 m, the
size of the input vector. For convenience, we will restate the following definitions,
i(n) =x(n) -x(n- L), d(n) =d(n) -d(n- L), R =E[x(n)x" (n)], S = E[ir(n)ir (n)],
P = E[x(n)d(n)] and Q = E[ir(n)d(n)]. Using these definitions, we can rewrite the cost
function in (4.1) as
J(w) = E[d2 (n)] + PE[d2 (n)] + wT (R + PS)w 2(P + pQ)T w (4.2)
It is easy to see that both E[e- (n)] and E[e- (n)] have parabolic performance surfaces as
their Hessians have positive eigenvalues. However, the value of P can invert the
performance surface of E[e (n)]. For P > the stationary point is always a global
minimum and the gradient given by equation (4.2) can be written as the sum of the
individual gradients as shown below.
= 2(R + PS)w 2(P + PQ) = 2(Rw P) + 2 P(Sw Q) (4.3)
The above gradient can be approximated by the stochastic instantaneous gradient by
removing the expectation operators.
'31W(-)[e(n2)x(n)f + f(n)ia(n)] (4.4)
The goal is to minimize the cost function and hence using steepest descent, we can write
the weight update for the stochastic AEC-LMS algorithm for P > 0 as
w(n + 1) = w(n) + r(n)(e(n~x(n) + pe(n~ir(n)) (4.5)
where r(n) > 0 is a finite step-size parameter that controls convergence. For p <0o, the
stationary point is still unique, but it can be a saddle point, global maximum or a global
minimum depending on the value a. Evaluating the gradient as before and using the
instantaneous gradient, we get the AEC-LMS algorithm for P < 0 .
w(nf + )= w(nI)+ l(nZ) e(n)x)(n) $e(n)i(n)) (4.6)
where, r(n) is again a small step-size. However, there is no guarantee that the above
update rules will be stable for all choices of step-sizes. Although, equations (4.5) and
(4.6) are identical, we will use PI in the update equation (4.6) to analyze the
convergence of the algorithm specifically for p < 0 The reason for the separate analysis
is that the convergence characteristics of (4.5) and (4.6) are very different.
Convergence Analysis of AEC-LMS Algorithm
Theorem 4.1. The stochastic AEC algorithms asymptotically converge in the mean to the
optimal solution given by
w* = (R + PS)-'(P + PQ), S >
w. = (R + S)-'(P PQ), S <
We will make the following mild assumptions typically applied to stochastic
approximation algorithms [79-81,84] that can be easily satisfied.
1. The input vectors x(n) are derived from at least a wide sense stationary (WSS)
colored random signal with positive definite autocorrelation matrix
R = E[x(n)xT (n)]
2. The matrix RL = E[x(n)xT (n L) + x(n L)xT(n)] exists and has full rank
3. The sequence of weight vectors w(n) is bounded with probability 1
4. The update functions h(w(n)) = e(n)x(n)+ fe(n)ir(n) for P > 0 and
hz(w(n)) = e(nz)x(n) p e(n)ii(n) for P < exist and are continuously
differentiable with respect to w(n), and their derivatives are bounded in time.
5. Even if h(w(n)) has some discontinuities a mean update vector
h(w(nz))= limn E[h(w(nz))] exists.
Assumption A.1 is easily satisfied. A.2 requires that the input signal have sufficient
correlation with itself for at least L lags.
Proof of AEC-LMS Convergence for (3 > 0
We will first consider the update equation in (4.5) which is the stochastic AEC-
LMS algorithm for P > 0 Without loss of generality, we will assume that the input
vectors x(n) and their corresponding desired responses d(n) are noise-free. The mean
update vector h (w(n)) is given by
h (w(n)) dt E [(n)x(n) + PLe(n)i(n)]l = Rw(nZ) P(nI)+ P(Sw(n) Q(nI)) (4.8)
The stationary point of the ordinary differential equation (ODE) in (4.8) is given by
w. = (R + pS) (P + pQ) (4.9)
We will define the error vector at time instant n as S(n) = w, w(n). Therefore
S(n + 1) = S(n) r(n)[e(n)xOn + pe(n)ir(n)] (4.10)
and the norm of the error vector at time n+1 is simply
1(n + 1) i = 1(n) -? 21?(n)[g (I (ne(nl)x(r) + p;T (l)e(nz)i(n) +
17 (n)l e(n)x(n) + Pte(n)i(n)l
Imposing the condition that 11(n +1)1 < I(n)l for all n2, we get an upper bound on the
time varying step-size parameter r(n) which is given by
2 (' (n~e(n7)x(n7)+ PST (n)e(n)i(n)]
r(n) < (4.12)
Ie(n2)x(n)> + pe(n~i(n)>
Simplifying the above equation using the fact that 5T (n)x(n) = e(n) and
5'(n)ir(n) = e(n), we get
2 e2(n) + PPe (I)]
r(n) < (4.13)
Ie(n)x(n) + fe(n~i(n)ll
which is a more practical upper bound on the step-size as it can be directly estimated
from the input and desired data. As an observation, notice that if P = 0, then, the bound
in (4.13) reduces to
r(n) < (4.14)
which, when included in the update equation, reduces to a variant of the Normalized
LMS (NLMS) algorithm . In general, if the step-size parameter is chosen according
to the bound given by (4. 13), then the norm of the error vector S(n) is a monotonically
decreasing sequence converging asymptotically to zero, i.e., lim l(n) 2 40 which
implies that limw(n) 4 w, (see Appendix G for details). In addition, the upper bound on
the step-size ensures that the weights are always bound with probability one satisfying
the assumption A.3 we made before. Thus the weight vector w(n) converges
asymptotically to w,, which is the only stable stationary point of the ODE in (4.8). Note
that (4.5) is an O(m) algorithm.
Proof of AEC-LMS Convergence for P < 0
We analyze the convergence of the stochastic gradient algorithm for P < 0 in the
presence of white noise because this is the relevant case ( p = -0.5 eliminates the bias
due to noise added to the input). From (4.6), the mean update vector h (w(n)) is given by,
h~~wt (wn)==Ee(n1)x(n2) -P $e(n)i(n)l = Rw(n) -P(nz) p(Sw(n) -Q(n)) (4. 15)
As before, the stationary point of this ODE is
w. = R f Sr P P Q) (4.16)
The eigenvalues of R- PIS decide the nature of the stationary point. If they are all
positive, then we have a global minimum and if they are all negative, we have a global
maximum. In these two cases, the stochastic gradient algorithm in (4.6) with proper fixed
sign step-size would converge to the stationary point, which would be stable. However,
we know that the eigenvalues of R- PIS can also take both positive and negative values
resulting in a saddle stationary point. Thus, the underlying dynamical system would have
both stable and unstable modes making it impossible for the algorithm in (4.6) with fixed
sign step-size to converge. This is well known in the literature [3,14]. However, as will
be shown next, this difficulty can be removed for our case by appropriately utilizing the
sign of the update equation (remember that this saddle point is the only stationary point
of the quadratic performance surface). The general idea is to use a vector step-size (one
step-size per weight) having both positive and negative values. One unrealistic way (for
an on-line algorithm) to achieve this goal is to estimate the eigenvalues of R-I S.
Alternatively, we can derive the conditions on the step-size for guaranteed convergence.
As before, we will define the error vector at time instant n as S(n) = w, w(n). The
norm of the error vector at time instant n+1 is given by
1(n2 + 1)I = 1(n) -z 21?(n)[2 (n~e(n)x (n) f' (n)L'(n)i (n) +
9 (n) e(nz)x(n) f (n)i(l)?
Taking the expectations on both sides, we get
E 5(n + 1) I = E S(n) 2p~) ('(neS1xn)- (n~e(l)i(nZ) +
Il (n)Ele(nz)x(n) fd(,(n~iln
The mean of the error vector norm will monotonically decay to zero over time i.e.,
E 5(n +1)1 < EI (m if and only if the step-size satisfies the following inequality.
2 E ('(n)e(n)x(n) f (n)e(nl)i(n)~
1'(n) < (4.19)
E e(n)x(n) -B d(nr)(n)'
Let x(n) = x(n) + v(n) and d(n) = d (n) + u(n), where x(n) and d (n) be the clean
input and desired data respectively. We will further assume that the input noise vector
v(n) and the noise component in the desired signal u(n) to be uncorrelated. Also the
noise signals are assumed to be independent of the clean input and desired signals.
Furthermore, the lag L is chosen to be more than m, the length of the filter under
consideration. Since the noise is assumed to be purely white,
E[v(n)vT (n L)] = E[v(n L)vT (n)] = 0 and E[v(n)v' (n)] = V We have
(ncenxn) = (w. -w(n)) d (nl) +u(n) -wT')(n)Wr(n) w'~(nl)tv(n)in ~) (4.20)
Simplifying this further and taking the expectations, we get
E[5'(nr)e(nz> x(n1)]= var (nr) -2Prw(n) +wr (n)ftw(n1)
+ wT (n)Vw(nZ)- w, Vw(nZ) (4.21)
where, 1 = E[x(n)xT (n)], P E[x(n~d(n)] and
Jii = w7(n) 1+ V v(n) + varp(n))- 2P;'w(n) (4.22)
Similarly, we have
S'(n)e(nz~i(n) = (w, w(nZ))l [d(n2)+ u1(n2)- w7'(n2)~(in)+ v(n%))
d (n L) + u(n L) + wT (n)(ji(n L) + v(nZ L))] (4.23)
(ik + Vk k- L Vk-L
Evaluating the expectations on both sides of (4.23) and simplifying, we obtain
E 7(ll(n)ei(n)n)= var p~(n)- d (n -L))- 2Q7w(n)
+ wT(nZ)Sw(n)+ 2wT (n)Vw(nZ)- 2w (n) Vw(n) (4.24)
= J., 2 w Vw(n)
where, we have used the definitions S =E[( (n) i(n -L))( (n) i(n -L)) ],
Q = E[(x(n) X(n L))(d(n) d (n L))] and
Jnr = w (n)+ +2V (n)+ var d"(n)- d(n-L))- 2Q'w(n) (4.25)
Using (4.21) and (4.24) in equation (4.19), we get an expression for the upper bound on
the step-size as
2/,- pJr 1-20 ~Vw(n)
1'(n) < (4.26)
E e(n)x(n) -B e(nr)i(n)'
This expression is not usable in practice as an upper bound because it depends on the
optimal weight vector. However, for P = -0.5, the upper bound on the step-size reduces
17(n)l <~ (4.27)
E e(n)x(n2)- 0.5e(2)(n)i n)
From (4.22) and (4.25), we know that J, and J,,r are positive quantities. However,
Js 0.5J,,r can be negative. Also note that this upper bound is computed by
evaluating the right hand side of (4.27) with the current weight vector w(n) Thus as
expected, it is very clear that the step-size at the nth iteration can take either positive or
negative values based on J, 0.5J,,r; therefore, sgn(r(n)) must be the same as
sgn(A, 0.5J., ) evaluated at w(n). Intuitively speaking, the term As 0.5Jr is
the EWC cost computed with the current weights w(n) and P = -0.5, which tells us
where we are on the performance surface and the sign tells which way to go to reach the
stationary point. It also means that the lower bound on the step-size is not positive as in
traditional gradient algorithms. In general, if the step-size we choose satisfies (4.27),
then, the mean error vector norm decreases asymptotically i.e., E :(n + 1)1 < E 1(n)l
and eventually becomes zero, which implies that limE[w(n)]4w.. Thus the weight
vector E[w(n)] converges asymptotically to w,, which is the only stationary point of the
ODE in (4.15). We conclude that the knowledge of the eigenvalues is not needed to
implement gradient descent in the EWC performance surface, but (4.27) is still not
appropriate for a simple LMS type algorithm.
On-line Implementations of AEC-LMS for P < 0
As mentioned before, computing J,, 0.5J,,r at the current weight vector would
require reusing the entire past data at every iteration. As an alternative, we can extract the
curvature at the operating point and include that information in the gradient algorithm. By
doing so, we obtain the following stochastic algorithm.
w(n+ 1) =w(n) +risgn(w' (n) R(n) -P S(n) () e(n) n)-pe(n)x(nz)) (4.28)
where, R(n) and S(n) are the estimates of R and S respectively at the nth time instant.
Corollazy:l~~~~~11111~~~~ Given any quadratic surface J(w), the following gradient algorithm
converges to its stationary point.
w(n + 1) = w(n) r7sgn(wT (n)Hw(n)) (4.29)
Proofi Without loss of generality, suppose that we are given a quadratic surface of the
form J(w)= wTHw, where He E" "'x, and we W""Y1. H is restricted to be symmetric;
therefore, it is the Hessian matrix of this quadratic surface. The gradient of the
performance surface with respect to the weights, evaluated at point w, is = 2Hw,,
and the stationary point of J(w) is the origin. Since the performance surface is quadratic,
any cross-section passing through the stationary point is a parabola. Consider the cross-
section of J(w) along the line defined by the local gradient that passes through the point
wo. In general, the Hessian matrix of this surface can be positive or negative definite; it
might as well have mixed eigenvalues. The unique stationary point of J(w), which
makes its gradient zero, can be reached by moving along the direction of the local
gradient. The important issue is the selection of the sign, i.e., whether to move along or
against the gradient direction to reach the stationary point. The decision can be made by
observing the local curvature of the cross-section of J(w) along the gradient direction.
The performance surface cross-section along the gradient direction at wo is
J(wo + 2yHwo) = w ( ~I+2pH)"H(I + 2H)w o = w (H+4 9H-2 + 4r2H3 0W (4.30)
From this, we deduce that the local curvature of the parabolic cross-section at wo is
4w H3 0 If the performance surface is locally convex, then this curvature is positive.
If the performance surface is locally concave, this curvature is negative. Also, note that
sgn(4w H3 0)= Sgn(w Hwo). Thus, the update equation with the curvature
information in (4.30) converges to the stationary point of the quadratic cost function
J(w) irrespective of the nature of the stationary point.
From the above corollary and utilizing the fact that the matrix R PIS is
symmetric, we can conclude that the update equation in (4.29) asymptotically converges
to the stationary point w, = (R- psf S P -fQ). On the down side however, the update
equation in (4.28) requires O(m2) COmputations, which makes the algorithm unwieldy
for real-world applications. Also, we can use the REW algorithm instead, which has a
For an O(m) algorithm, we have to go back to the update rule in (4.6). We will discuss
only the simple case of P = -0.5, which turns out to be also the more useful. We propose
to use an instantaneous estimate of the sign with the current weights given by
w(n + 1) = w(n) + r(n) sgn(e (n) 0.5e~ (n))[e(n)x(n) 0.5e(Z)i(nZ)] (4.31)
where, where r(n) > 0 and is bound by (4.27). It is possible to make mistakes in the sign
estimation when (4.31) is utilized, which will not affect the convergence in the mean, but
will penalize the misadjustment.
Excess Error Correlation Bound for EWC-LMS
In the next theorem, we will show that the asymptotic excess error correlation at
lags L 2 m is always bounded from above and can be arbitrarily reduced by controlling
Theorem 4.2: With P = -1/2, the steady state excess error autocorrelation at lag L 2 m,
i.e., Ip,(L)I is always bound by
Ip,(L) ~ 2 Z"1 [Tv(R)+ I]l[E(e (k))+o+ flw lO] (4.32)
where R = E[xkX(], and Tr(*) denotes the matrix trace. The term E(e,2(k)) denotes the
excess MSE which is (w, w,)T R(w, wr) The noise variances in the input and
desired signals are represented by oa and o] respectively. Note that the term 'I\I is
always bound because of the step-size bound.
Proofi For convenience, we will adopt the subscript k to denote the time or iteration
index. With this convention, the weight vector at the kth iteration is denoted by wk
Further, the error vector (difference between the true vector wT and the adaptive estimate
at time k) is denoted by Ek T k, w. Throughout the rest of the proof, we will use the
following notations: the noisy input vector ik, the noise-free input vector xk, and the
input noise vector vk Obey ik = Xk +Vk; the noisy desired signal dk,, the noise-free
desired signal dk and noise rek are related by dk~ = dk 1k.We will start from the
equation describing the dynamics of the error vector norm given below.
lik+1 12 = ik 2 -- 2pign(8flT~ + p()(8kk kr~k k kd~,k (4.33)
In (4.33), we have assumed a constant step-size which satisfies the upper bound in (4.27).
Letting E Ak+,1 2 = E ik 12 as ka 00 we see that
E Ik~Xk k k~j 2=E (8k4 kk)(434
We now invoke the Jensen's inequality for convex functions  to reduce (4.34)
E2 L&II +/;1 pA>EIf (8k/k k k (4.35)
The noisy error term is given by Sk ae(k)+uk -TVk Where the excess error
e, (k)= ETxk Using the expressions E'^T &kx k] E- e )-wiw +w Vk
E[i: Cik, k= E(e ;) 2wfVwk + 2w Vwk and p = -0.5, we can immediately recognize
that the RHS of (4.35) is simply the steady state excess error autocorrelation at lag
L 2 m, i.e., p, (L)I In order to evaluate the LHS of (4.35), we will assume that the terms
that ik 12 and 1d? are uncorrelated in the steady state. Using, this assumption, we can
E kk -0.5g k k ci = (f [rR)a (4.36)
where,, E(8 )2 =- E1/e\ (k) + + |w || a Using (4.36) in equation (4.3.5), we get the
inequality in (4.32).
This assumption (more relaxed than the independence assumptions [11,14]) is used
in computing the steady state excess-MSE for stochastic LMS algorithm [131,132] and
becomes more realistic for long filters. In the estimation of the excess MSE for the LMS
algorithm, Price's theorem  for Gaussian random variables can be invoked to derive
closed form expressions. However, even the Gaussianity assumption is questionable as
discussed by Eweda  who proposed additional reasonable constraints on the noise
pdf to overcome the Gaussianity and independence assumptions that lead to a more
generic treatment for the sign-LMS algorithm. It is important to realize at this point that
in the analysis presented here, no explicit Gaussianity assumptions have been made.
As a special case, consider L = and noise free input. Then, (4.32) is true with the
equality sign and also Ip,(L)I will be the same as E(ej(k)) which is nothing but the
excess MSE (as ka ~o ) of the LMS algorithm. In other words, (4.32) reduces to
E(ecr(k)) = -Tr(R)[E(ea (k))+ oj] (4.37)
From (4.37), the excess MSE for the LMS algorithm  can be deduced as
which will become to yo- Tr(R)/2 for very small step-sizes. If the adaptive filter is long
enough, the excess error es (k) will be Gaussian and we can easily show that the excess
MSE is bounded by T7(R)E[ so ]~/4 where, a, denotes the error due to the initial
Other Variants of the AEC-LMS Algorithms
It is easy to see that for convergence of the mean, the condition is
II r/Zk (R + PS)I < 1 for all k, where Ak (R + SS) denotes the kth eigenvalue of the matrix
(R + pS) This gives an upper bound on the step-size as 17 < 2/ Amax, (R + PS)I From the
triangle inequality , IR + /3 S11 < Amx R) + p Amax (S)S where, *1 1 denotes the
matrix norm. Since, both R and S are positive-definite matrices, we can write
R + S < p
In a stochastic framework, we can include this in the AEC-LMS update equation to result
in a step-size normalized EWC-LMS update rule given by
ysign(e (nZ) + ~2(n))(e(nZ)x(n) + P(nZ)i(n))
w(n +1) = w(n) + (4.40)
Bx(n) I+ / II(n%1)Z
Note that when f = (4.40) reduces to the well-known normalized LMS (NLMS)
algorithm . Alternatively, we can normalize by the norm squared of the gradient and
this gives the following modified update rule.
(e (n) + p~(n))(e(n)x(n) + pC(n)i(n))
w(n +1) = w(n) +2 (4.41)
Ie(n)x(n) + Pt(n)i(n)ll + 3
The term 6, a small positive constant compensates for the numerical instabilities when the
signal has zero power or when the error goes to zero, which can happen in the noiseless
case even with finite number of samples. Once again, we would like to state that with
p = 0, (4.41) defaults to NLMS algorithm. However, the caveat is that, both (4.40) and
(4.41) do not satisfy the principle of minimum disturbance unlike the NLMS3 . But
nevertheless, the algorithms in (4.40) and (4.41) can be used to provide faster
convergence at the expense of increased misadjustment (in the error correlation sense) in
the final solution.
AEC-LMS Algorithm with Multiple Lags
In the previous chapter, we discussed a recursive Newton type algorithm that
included more than one lag in the cost function. With decreasing SNR at the input, the
Hessian matrix H = R + BS is mostly determined by the noise covariance matrix. This
can degrade the performance and we might be forced to use very small step-sizes (slow
convergence) to achieve good results. One way of alleviating this problem is to
incorporate multiple lags in the AEC cost function. The stochastic gradient AEC-LMS
algorithm for the multiple lag case is simply given by
w(n + 1) = w(n) + C 77LSlgn(e (n))+ P (n))(e(n2>x(n) + r.L H XL (n2)) (4.42)
Lmax is the total number of lags (constraints) used in the AEC cost function. The
additional robustness of using multiple lags comes at an increase in the computational
cost and in the case when the number of lags becomes equal to the length of the adaptive
filter, the complexity will approach that of the recursive Newton type algorithms.
The stochastic AEC algorithms have linear complexity in comparison with the
O(N2) algorithm of the recursive Newton type algorithms discussed in the previous
chapter. At the same time, since the algorithms are all based on the instantaneous
SThe NLMS algorithm is also called the minimum norm update algorithm. It can be formulated as a
constrained minimization problem wherein the actual cost function is the norm of the update, viz.,
I\ I n -w(n-1)||- and the constraint is the error e(n) with the weights w(n) must be zero.
gradients, these algorithms have better tracking abilities when compared with their
Newton counterparts. Hence these algorithms can be expected to perform better in non-
Estimation of System Parameters in White Noise
The experimental setup is the same as the one used to test the REW algorithm. We
varied the Signal-to-Noise-Ratio (SNR) between -10dB to +10dB and changed the
number of filter parameters from 4 to 12. We set P = -0.5 and used the update equation
in (4.31) for the EWC-LMS algorithm. A time varying step-size magnitude was chosen in
accordance with the upper bound given by (4.27) without the expectation operators. This
greatly reduces the computational burden but makes the algorithm noisier. However,
since we are using 50,000 samples for estimating the parameters, we can expect the errors
to average out over iterations. For the LMS algorithm, we chose the step-size that gave
the least error in each trial. Totally 100 Monte Carlo trials were performed and
histograms of normalized error vector norms were plotted. It is possible to use other
statistical measures instead of the error norm, but this is sufficient to demonstrate the bias
removal ability of EWC-LMS. For comparison purposes, we computed the solutions with
LMS as well as the numerical TLS (regular TLS) methods. Figure 4-1 shows the error
histograms for all the three methods. The inset plots in Figure 4-1 show the summary of
the histograms for each method. EWC-LMS performs significantly better than LMS at
low SNR values (-10dB and OdB), while performing equally well for 10dB SNR. The
input noise variances for -10dB, OdB, and 10dB SNR values are 10, 1, and 0.1,
respectively. Thus, we expect (and observe) TLS results to be worst for -10dB and best
..Iill 11l II
035 30 25 20 15 10 5
Error norm in dB
Performance with SNR = OdB #taps = 8
I0 I i
15 -20 -0 -8 -B 2 4
9 --meansaarlaan ~ I
830 h us
70-40 20 2
60 mennd eaid
50 I I
lo 30 TL
for 10dB. As per theory, we observe that TLS performance drops when the noise
variances are not the same in the input and desired signals.
n Performance with SNR = -10dB, #taps= 4 Performance with SNR = OdB, #taps = 4 n Performance with SNR = 10dB, #taps = 4
700 -20 u
-1 -2 sus
Performancewith SNR =10dB,#tap = 8
-40 20 0
-30 -26 -20 -16 -10 -5
Error norm in d
Performance with SNR = 1dB, #1aps = 12
Performance with SNR =-10dB, #taps = 2
35 -6 0 5 0
Error norm in dB Error norm indB Error norn in dB
Figure 4-1. Histogram plots showing the error vector norm for EWC-LMS, LMS
algorithms and the numerical TLS solution.
Figure 4-2 shows a sample comparison between the stochastic and the recursive
algorithms for OdB SNR and 4 filter taps. Interestingly, the performance of the EWC-
LMS algorithm is better than the REW algorithm in the presence of noise. Similarly, the
LMS algorithm is much better than the RLS algorithm. This tells us that the stochastic
1C I REW
-40 -35 -30 -25 -20 -15 -10 -5 O
error norm in dB
Figure 4-2. Comparison of stochastic versus recursive algorithms.
Weight Tracks and Convergence
The steady state performance of a stochastic gradient algorithm is a matter of great
importance. We will now experimentally verify the steady state behavior of the EWC-
LMS algorithm. The SNR of the input signal is set to 10dB and the number of filter taps
is fixed to two for display convenience. Figure 4-3 shows the contour plot of the EWC
cost function with noisy input data. Clearly, the Hessian of this performance surface has
both positive and negative eigenvalues thus making the stationary point an undesirable
saddle point. On the same plot, we have shown the weight tracks of the EWC-LMS
algorithm with P = -0.5 Also, we used a fixed value of 0.001 for the step-size. From
algorithms presumably reject more noise than the fixed-point algorithms. Researchers
have made this observation before, although no concrete arguments exist to account for
the smartness of the adaptive algorithms . Similar conclusions can be drawn in our
case for EWC-LMS and REW.
Performance of RLS, REW, EWC-LMS, LMS algorithms with SNR = OdB, #taps = 4
Weight tracks for the EWC-LMS algorithm
W2 = 0.5
the figure, it is clear that the EWC-LMS algorithm converges stably to the saddle point
solution, which is theoretically unstable when a single sign step-size is used. Notice that
due to the constant step-size, there is misadjustment in the final solution. In Figure 4-4,
we show the individual weight tracks for the EWC-LMS algorithm. The weights
converge to the vicinity of the true filter parameters, which are -0.2 and 0.5 respectively
within 1000 samples.
Contour plot of the performance surface with the weight tracks
S-Uo8 -Ub -U 4 -U.2 U Ux U.4 U.b Uo
Figure 4-3. Contour plots with the weight tracks showing convergence to saddle point.
O O 5 1 1 5 2 2.5
3 3 5 4 4.5
Figure 4-4. Weight tracks for the stochastic algorithm.
In order to see if the algorithm converges to the saddle point solution in a robust
manner, we ran the same experiment using different initial conditions on the contours.
Figure 4-5 shows a few plots of the weight tracks originating from different initial values
over the contours of the performance surface. In every case, the algorithm converged to
06 08 -1 -08 -06 -04 -02
4-02 0 02 04
14-02 0 02 04
1-OB -06 -04 -02 I 02 04 IB IB 1
-IB -IB-04 -02 I 02 04 06 0B
Figure 4-5. Contour plot with weight tracks for different initial values for the weights.
the saddle point in a stable manner. Note that the misadjustment in each case is almost
the same. Finally, to quantify the effect of reducing the SNR, we repeated the experiment
with OdB SNR. Figure 4-6 (left) shows the weight tracks over the contour and we can see
that the misadjustment has increased owing to decrease in the SNR. This is a typical
phenomenon observed with most of the stochastic gradient algorithms. However, the
misadjustment is proportional to the step-size. Therefore, by using smaller step-sizes, the
misadjustment can be controlled to be within acceptable values. The drawback is slow
convergence to the optimal solution. Figure 4-6 (right) shows the weight tracks when the
algorithm is used without the sign information for the step-size. Note that convergence is
-1~~~ -D O 0 0 2 0 B O 1 -B -B -4 0 2 0 B D
Fiur 46 Cnturpotwihwegh taksfo ECLMagrith withsg
inomto let n ihutsg nomain(ih)
no civdi hscs hchsbtnitsorpeiu rgmn htafxdsg tp
sizewil neer onvrgeto sadlepoit. o frthr sbstntite hisfac, w reove
the nos fothinuanrathEW-Malrith ihadwtottesg em
Fiue47(et hw h niefe W efrac uraeadFgr rgt
shw h egttak wt adwtottesg nomtin ntecnor.Cery
the wegt ontcneg otedsrdsddepitee nteasneo os.O
the other hand, using the sign information leads the weights to the saddle point in a stable
manner. Since this is the noise-free case, the final misadjustment becomes zero.
Performance surface of EWC without no se in Input and desired
-0 5 -0 5 -41 I
-1 1 0 04 02 04 O6 O
2 W1 W1
Figure 4-7. EWC performance surface (left) and weight tracks for the noise-free case
with and without sign information (right).
Inverse Modeling and Controller Design Using EWC
System identification is the first step in the design of an inverse controller.
Specifically, we wish to design a system that controls the plant to produce a predefined
output. Figure 4-8 shows a block diagram of model reference inverse control . In
this case, the adaptive controller is designed so that the controller-plant pair would track
the response generated by the reference model for any given input (command). Clearly,
we require the plant parameters (which are typically unknown) to devise the controller.
Once we have a model for the plant, the controller can be easily designed using
conventional MSE minimization techniques. In this example, we will assume that the
plant is an all-pole system with transfer functionP(z)= 1/(1+ 0.8z '- 0.5z-2 0.3z-3)
The reference model is chosen to be an FIR filter with 5 taps. The block diagram for the
plant identification is shown in Figure 4-9. Notice that the output of the plant is corrupted
with additive white noise due to measurement errors. The SNR at the plant output was set
to OdB. We then ran the EWC-LMS and LMS algorithms to estimate the model
parameters given noisy input and desired signals. The model parameters thus obtained are
used to derive the controller (see Figure 4-8) using standard backpropagation of error. We
then tested the adaptive controller-plant pair for trajectory tracking by feeding the
controller-plant pair with a non-linear time series and observing the responses. Ideally,
the controller-plant pair must follow the trajectory generated by the reference model.
Figure 4-10 (top) shows the tracking results for both controller-plant pairs along with the
reference output. Figure 4-10 (bottom) shows a histogram of the tracking errors. Note
Figure 4-8. Block diagram for model reference inverse control.
Figure 4-9. Block diagram for inverse modeling.
that the errors with EWC-LMS controller are all concentrated around zero, which is
desirable. In contrast, the errors produced with the MSE based controller are significant
Figure 4-10. Plot of tracking results and error histograms.
Magnitude and phase responses of the reference model vs. rnodel-controller pair
0 0.1 0.2 O 3 0.4 0.5 0.6 0.7 O 8 0.9 1
and this can be worse if the SNR levels drop further. Figure 4-11 shows the magnitude
and phase responses of the reference models along with the generated controller-model
Trajectory tracking responses
-3 -2 -1 0
1 2 3
0 0.1 0.2 O 3 O 4 0.5 0.6 0.7
O 8 O 9 1
Figure 4-11i. Magnitude and phase responses of the reference model and designed model-
I I MSE I
pairs. Note that, the EWC controller-model pair matches very closely with the desired
transfer function, whereas MSE controller-model pair produces a significantly different
transfer function. This clearly demonstrates the advantages offered by EWC.
More details on the applications of EWC-LMS in system identification and
controller design problems can be found in [137-139].
In this chapter, we proposed online sample-by-sample stochastic gradient
algorithms for estimating the optimal AEC solution. The detailed derivations of the
update rules were presented and the convergence was proved rigorously using stochastic
approximation theory. We also derived the step-size upper bounds for convergence with
probability one. Further, the theoretical upper bound on the excess error correlation in the
case of EWC-LMS was derived. The AEC stochastic algorithms include the LMS
algorithm for MSE as a special case. Owing to the complexities of the EWC performance
surface (see Chapter 2), additional information like the sign of the instantaneous cost is
required for guaranteed convergence to the unique optimal AEC solution. In this context,
the AEC optimization problem can be pursued as a root-finding problem and the popular
Robbins-Munro method  can be adopted to solve for the optimal solution. We have
not explored this method yet for the AEC criterion.
We also presented several variants of the AEC-LMS algorithm. As a special case,
the normalized AEC-LMS algorithm in equation (4.40) reduces to the well-known NLMS
algorithm for MSE. The gradient normalized AEC-LMS algorithm in equation (4.41) has
shown better performance over the simple AEC-LMS algorithm in our simulation studies.
We then presented simulation results to show the noise rejection capability of the
EWC-LMS algorithm. Experiments were also conducted to verify some of the properties
of the proposed gradient algorithms. In particular, we observed the weight tracks and the
verified that the algorithm converges in a stable manner to even saddle stationary points.
This is achieved mainly by utilizing the sign information in the gradient update. We also
showed the amount of misadjustment can be controlled by the step-size parameter. This is
in conformance with the general theory behind stochastic gradient algorithms.
Lastly, we demonstrated the application of EWC in the design of a model-reference
inverse controller. We compared the performance of the EWC controller with the MSE
derived controller and verified the superiority of the former.
LINEAR PARAMETER ESTIMATION IN CORRELATED NOISE
In the previous chapters we discussed a new criterion titled augmented error
criterion (AEC) that can potentially replace the popular MSE criterion. In fact we showed
that a special case of the AEC called the error whitening criterion (EWC) can solve the
problem of estimating the parameters of a linear system in the presence of input noise.
We showed extensive simulation results with different EWC adaptation algorithms that
proved beyond doubt, the usefulness of this criterion in solving system identification and
controller design problems.
Two crucial assumptions were made in the theory behind the error whitening
criterion. Firstly, we assumed that the input noise is uncorrelated with itself or is white.
Although, in most problems, we assume that the noise is white, this assumption can be
certainly restrictive in many applications. From the theory we discussed in the previous
chapters, it is easy to conclude that EWC fails to remove the bias in the parameter
estimates when the noise is correlated or colored.
Secondly, we assumed full knowledge of the model order of the unknown system.
This is not just native to the proposed method as most of the competing methods
including Total Least-Squares (TLS) assume exact model order. To the best of our
knowledge, there is no existing solution to the problem of system identification in the
presence of input noise in cases when the model order is unknown. However, till this
point, we have not dealt with the implications of using the proposed EWC when the