UFDC Home  myUFDC Home  Help 



Full Text  
AN AUGMENTED ERROR CRITERION FOR LINEAR ADAPTIVE FILTERING: THEORY, ALGORITHMS AND APPLICATIONS By YADUNANDANA NAGARAJA RAO A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2004 Copyright 2004 by YADUNANDANA NAGARAJA RAO This dissertation is dedicated to my family, teachers and friends for their enduring love, support and friendship. ACKNOWLEDGMENTS First of all, I would like to thank Dr. Jose Principe, for his constant guidance, encouragement, patience and continuous support over the past five years. His enthusiasm for research and quest for excellence have left an everlasting impression in my mind. To me, he has been more than an advisor, and this research would not have been possible without him. Secondly, I would like to thank Dr. John Harris for being on my committee and offering me guidance not only in research but in many other aspects of life. I would also like to thank Dr. Michael Nechyba and Dr. Mark Yang for being on my committee. I would like to thank Dr. Deniz Erdogmus, my friend and colleague at CNEL, whose contributions in this research have been tremendous. I deeply benefited from all the long hours of fruitful discussions with him on a multitude of topics. His drive for research and enormous ability to motivate others have been quite inspirational. I also wish to extend my acknowledgements to all the members of CNEL who have been primarily responsible for my fruitful stay in the lab. I would like to extend my gratitude to the always cheerful Ellie Goodwin for her golden words of wisdom. Her ability to get things done was truly remarkable. I would also like to acknowledge Linda Kahila for the extensive support and assistance she provided during my stay at UFL. I would like to thank my family and friends for their constant love and encouragement. They have allowed me to pursue whatever I wanted in life. Without their guidance and affection, it would have been impossible for me to advance my education. Lastly, I would like to thank my life partner Geetha for making my life beautiful and for being on my side whenever I needed her. Her everlasting love has made me a better individual. TABLE OF CONTENTS page ACKNOWLEDGMENT S .............. .................... iv LI ST OF T ABLE S ................. ...............x................ LI ST OF FIGURE S .............. .................... xi AB STRAC T ................ .............. xiv CHAPTER 1 MEAN SQUARED ERROR BASED ADAPTIVE SIGNAL PROCESSING SY STEMS: A BRIEF REVIEW .............. ...............1..... Introducti on ................... .. .. ........ .. ........... ............. Why Do We Need Adaptive Systems? ................ ...............2............... Design of Adaptive Sy stem s............... .... ...............3............... Least Mean Squares (LMS) Algorithm ................ ...............5............ ... Recursive Least Squares (RLS) Algorithm .............. ...............6..... Other Al gorithms .........._..... ........___ ..... .. .._.._.. ............ Limitati ons of MSE Criteri on B ased Linear Adaptive Sy stems..........._.._.._ ...............8 Total Least Squares (TLS) and Other Methods ................. ................ ......... .10 Limitations of TLS ........._..... ... ..._. __ ...............11... Extended TLS for Correlated Noise .....___.....__.___ ...... .._._ ..........1 Other M ethod s ........._.. ..... ._ ............... 13.... Summary ........._.. ..... ._ ...............13..... 2 AUGMENTED ERROR CRITERION FOR LINEAR ADAPTIVE SYSTEMS......15 Introducti on .........._.... ......._.___..... ...............15 Error Whitening Criterion (EWC) ................ ...............16................ Motivation for Error Whitening Criterion ................. .............................17 Analysis of the Autocorrelation of the Error Signal ................. ........._.._. ......17 Augmented Error Criterion (AEC) .............. ...............22.... Properties of Augmented Error Criterion .............. ...............24.... Shape of the Performance Surface .............. ...............24.... Analysis of the Noisefree Input Case ................. ...............25............... Analysis of the Noisy Input Case ................ ...............27............... Orthogonality of Error to Input ................. ...............29........... ... Relationship to Error Entropy Maximization ................. .......... ...............30 Note on ModelOrder Selection .............. ...............3 1.... The Effect of P on the Weight Error Vector .....___ .................. ................32 Numerical Case Studies of AEC with the Theoretical Solution .............. ..............33 Summary ................. ...............40..__.._....... 3 FAST RECURSIVE NEWTON TYPE ALGORITHMS FOR AEC .......................41 Introducti on .................. ... ......... .......... ........ .. .. ........4 Derivation of the Newton Type Recursive Error Whitening Algorithm ................... .41 Extension of the REW Algorithm for Multiple Lags ................. ............... ....45 Relationship to the Recursive Instrumental Variables Method ................... ........48 Recursive EWC Algorithm Based on Minor Components Analysis..........................49 Experimental Results ................. ............. ........ ..... .................5 Estimation of System Parameters in White Noise Using REW ..........................51 Effect of P and Weight Tracks of REW Algorithm ............... ... .........___......53 Performance Comparisons between REW, EWCTLS and IV methods ............55 Summary ............ ..... ._ ...............57.... 4 STOCHASTIC GRADIENT ALGORITHMS FOR AEC .............. ....................5 Introducti on................ .... .... ... ................. ... .. .........5 Derivation of the Stochastic Gradient AECLMS Algorithm .............. ..............59 Convergence Analysis of AECLMS Algorithm ......____ ..... ... ._ ........._......6 1 Proof of AECLMS Convergence for P > 0 ................ .....___ ........._._._..61 Proof of AECLMS Convergence for P < 0 ........._..... ....___. ........._.....63 Online Implementations of AECLMS for P< 0.........._._... ........___..........67 Excess Error Correlation Bound for EWCLMS ................. .......................69 Other Variants of the AECLMS Algorithms ............_......__ ..............72 AECLMS Algorithm with Multiple Lags .............. ...............73.... Simulation Results ............... .........__ ... ......___..........7 Estimation of System Parameters in White Noise............... ...............74. Weight Tracks and Convergence................ ........... .......7 Inverse Modeling and Controller Design Using EWC .............. ............. ..80 Summary ........._...... ...............83..__.......... 5 LINEAR PARAMETER ESTIMATION IN CORRELATED NOISE......................85 Introducti on ........._...... ...............85..__.......... Existing Solutions ........._...... .. ... .___. ... ...._...... ... ..........8 Criterion for Estimating the Parameters in Correlated Noise ........._...... ..................87 Stochastic Gradient Algorithm and Analysis .............. ...............90.... Simulation Results .........._...... ... ....._._... .. ......._._... .............9 Sy stem Identifi cati on with the Analytical S oluti on ........._...... ......_._...........93 System Identification with Stochastic Gradient Algorithm............... ................9 Verification of the Local Stability of the Gradient Algorithm............._._... .........95 Extensions to Correlated Noise in the Desired Data .............. .....................9 Experimental Results ................ ...............100................ System Identification............... ............10 Stochastic Algorithm Performance............... ..............10 Summary ................. ...............101................ 6 ON UNDERMODELING AND OVERESTIMATION ISSUES IN LINEAR SY STEM ADAPTATION............... ..............10 Introducti on ................. ...............104................ Undermodeling Effects ................. ...............105................ Overestimation Effects .............. ...............108.... Experimental Results ................ ...............109................ Sum m ary ................. ...............113......... ...... 7 CONCLUSIONS AND FUTURE DIRECTIONS ................. ........................114 Conclusions............... .. .............11 Future Research Directions ............_..._ ......._ ....._.. ............1 APPENDIX A FAST PRINCIPAL COMPONENTS ANALYSIS (PCA) ALGORITHMS ...........118 Introducti on .................. ......... ...... ...............118...... Brief Review of Existing Methods ................. ...............119.............. Derivation of the FixedPoint PCA Algorithm .................. ............................121 Mathematical Analysis of the FixedPoint PCA Algorithm ................. ................. 123 SelfStabilizing FixedPoint PCA Algorithm .................. ............... .... ............... ..128 Mathematical Analysis of the SelfStabilizing FixedPoint PCA Algorithm ...........129 Minor Components Extraction: SelfStabilizing FixedPoint PCA Algorithm........1 32 B FAST TOTAL LEASTSQUARES ALGORITHM USING MINOR COMPONENTS ANALYSIS .............. ...............135.... Introducti on ................. ...............135____....... Fast TLS Al gorithms .............. ...............136.... Simulation Results with TLS................ ............. ..........13 Simulation 1: Noise Free FIR Filter Modeling. ......___ ...... .._ ............139 Simulation 2: FIR Filter Modeling with Noise ................. .......___...........140 C ALGORITHMS FOR GENERALIZED EIGENDECOMPOSITION....................143 Introducti on ............ _... ....._ .... ...............143. Review of Existing Learning Algorithms .....__.....___ ..............._.........4 FixedPoint Learning Algorithm for GED .............. ...............145.... M athematical Analysis .............. ...............150.... D SOME DERIVATIONS FOR THE NOISY INPUT CASE............... .................15 E ORTHOGONALITY OF ERROR TO INPUT .............. ...............156.... F AEC AND ERROR ENTROPY MAXIMIZATION .............. .....................5 G PROOF OF CONVERGENCE OF ERROR VECTOR NORM IN AECLMS ......159 LIST OF REFERENCES ........._._.. ...._ ... ...............160.... BIOGRAPHICAL SKETCH .........._.... ...............172.__.......... LIST OF TABLES Table pg 11. Outline of the RLS Algorithm. ........._. ...... .__ ...............7. 31. Outline of the REW Al gorithm. ............. ...............45..... LIST OF FIGURES Figure pg 11. Block diagram of an Adaptive System. .............. ...............4..... 12. Parameter estimates using RLS algorithm with noisy data. ............. ....................9 21i. Schemati c di agram of EWC adaptati on ................. ...............16........... 22. The MSE performance surfaces, the AEC contour plot, and the AEC performance surface for three different training data sets and 2tap adaptive FIR fi1ters. .............25 23. Demonstration scheme with coloring filter h, true mapping fi1ter w, and the uncorrelated white signals. ............. ...............34..... 24. The average squared errornorm of the optimal weight vector as a function of autocorrelation lag L for various f values and SNR levels. ............. ....................35 25. The average squared errornorm of the optimal weight vector as a function of filter length m for various f values and SNR levels. ......____ .... ... ._ ................35 26. Histograms of the weight error norms (dB) obtained in 50 Monte Carlo simulations using 10000 samples of noisy data using MSE (empty bars) and EWC with B = 0.5 (filled bars). The subfigures in each row use fi1ters with 4, 8, and 12 taps respectively. The subfigures in each column use noisy samples at 10, 0, and 10 dB SNR, respectively. ............. ...............37..... 27. Error autocorrelation function for MSE (dotted) and EWC (solid) solutions. ...........38 31. Histogram plots showing the error vector norm for EWCLMS, LMS algorithms and the numerical TLS solution. ............. ...............53..... 32. Performance of REW algorithm (a) SNR = OdB and (b) SNR = 10 over various beta values. ............. ...............54..... 33. Weight tracks for REW and RLS algorithms. ............. ...............55..... 34. Histogram plots showing the error vector norms for all the methods. .......................56 35. Convergence of the minor eigenvector of G with (a) noisy data and (b) clean data..57 41. Histogram plots showing the error vector norm for EWCLMS, LMS algorithms and the numerical TLS solution. ............. ...............75..... 42. Comparison of stochastic versus recursive algorithms............... ...............7 43. Contour plots with the weight tracks showing convergence to saddle point.............. 77 44. Weight tracks for the stochastic algorithm. ............. ...............77..... 45. Contour plot with weight tracks for different initial values for the weights. .............78 46. Contour plot with weight tracks for EWCLMS algorithm with sign information (left) and without sign information (right) ................. ...............79............... 47. EWC performance surface (left) and weight tracks for the noisefree case with and without sign information (right). ............. ...............80..... 48. Block diagram for model reference inverse control. ................. .................8 49. Block diagram for inverse modeling. ............. ...............81..... 410. Plot of tracking results and error histograms ................. ...............82............. 411i. Magnitude and phase responses of the reference model and designed model controller pairs............... ...............82. 51. System identification block diagram showing data signals and noise........................88 52. Histogram plots showing the error vector norm in dB for the proposed and MSE criteria. .............. ...............94.... 53. Weight tracks for LMS and the stochastic gradient algorithm in the system identification example. ............. ...............96..... 54. Weight tracks for LMS and the stochastic gradient algorithm showing stability around the optimal solution. ............. ...............96..... 55. Histogram plots of the error norms for the proposed method and MSE. ...............101 56. Weight tracks showing the convergence of the stochastic gradient algorithm......... 102 61. Undermodeling effects with input SNR = OdB (left) and input SNR = 5dB (right).109 62. Crosscorrelation plots for EWC and MSE for undermodeling. ............. ..............110 63. Crosscorrelation plots for EWC and MSE for overestimation ................ .. .............111 64. Power normalized error crosscorrelation for EWC and MSE with overestimation. 111 65. Weight tracks for LMS and the stochastic gradient algorithm in the case of under modeling. ................ ...............112_____....... A1. Representative network architecture showing lateral connections. .........................134 B1. Estimation of minor eigenvector ................. ...............140........... .. B2. Minimum eigenvalue estimation............... ..............14 B3. Comparison between the estimated and true filter coefficients using TLS.............141 B4. Comparison between the estimated and true filter coefficients using RLS.............142 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy AN AUGMENTED ERROR CRITERION FOR LINEAR ADAPTIVE FILTERING: THEORY, ALGORITHMS AND APPLICATIONS By Yadunandana Nagaraj a Rao May 2004 Chair: Jose C. Principe Cochair: John G. Harris Major Department: Electrical and Computer Engineering Ever since its conception, the meansquared error (MSE) criterion has been the workhorse of optimal linear adaptive filtering. However, it is a wellknown fact that the MSE criterion is no longer optimal in situations where the data are corrupted by noise. Noise, being omnipresent in most of the engineering applications, can result in severe errors in the solutions produced by the MSE criterion. In this dissertation, we propose novel error criteria and the associated learning algorithms followed by a detailed mathematical analysis of these algorithms. Specifically, these criteria are designed to solve the problem of optimal filtering with noisy data. Firstly, we discuss a new criterion called augmented error criterion (AEC) that can provide unbiased parameter estimates even in the presence of additive white noise. Then, we derive novel, online sampleby sample learning algorithms with varying degrees of complexity and performance that are tailored for realworld applications. Rigorous mathematical analysis of the new algorithms is presented. In the second half of this dissertation, we extend the AEC to handle correlated noise in the data. The modifications introduced will enable us to obtain optimal, unbiased parameter estimates of a linear system when the data are corrupted by correlated noise. Further, we achieve this without explicitly assuming any prior information about the noise statistics. The analytical solution is derived and an iterative stochastic algorithm is presented to estimate this optimal solution. The proposed criteria and the learning algorithms can be applied in many engineering problems. System identification and controller design problems are obvious areas where the proposed criteria can be efficiently used. Other applications include modelorder estimation in the presence of noise and design of multiple local linear filters to characterize complicated nonlinear systems. CHAPTER 1 MEAN SQUARED ERROR BASED ADAPTIVE SIGNAL PROCESSING SYSTEMS: A BRIEF REVIEW Introduction Conventional signal processing techniques can be typically formulated as linear or nonlinear operations on the input data. For example, a Einite impulse response (FIR) Eilter is a linear combination of the time delayed versions of the input signal. We know that a linear combiner is nothing but a linear proj ector in the input space. Mathematically speaking, a projection can be defined as a linear transformation between two vector spaces [1]. These linear transformations can be vectors spanning 8"' or matrices spanning 8"". For vector transformations the proj sections are given by the inner products and in case of matrix transformations the proj sections become rotations. Often, most of the design tasks in signal processing involve finding appropriate projections that perform the desired operation on the input. For instance, the filtering task is basically finding the projection that preserves only a specified part of the input information [2]. Another example is data compression, wherein we estimate an optimal projection matrix or rotation matrix that preserves most of the information in the input space. The first step in finding these projections is to understand the specifications of the problem. Then, the specifications are translated into mathematical criteria and equations that can be solved using various mathematical and statistical tools. The solutions thus obtained are often optimal with respect to the criterion used. Why Do We Need Adaptive Systems? Depending on the problem at hand, estimating the optimal projections can be a daunting task. Complexities can arise due to the nonavailability of a closed form solution or even the nonexistence of a feasible analytical solution. In the latter case, we may have to be contented with suboptimal solutions. On the other hand, scenarios exist where we have to synthesize projections that are not based on user specifications. For instance, suppose we are given two signals, an input and a desired signal, and the goal is to find the optimal projection (filter) that generates the desired signal from the input. Thus the specifications do not convey any explicit information regarding the type of filter we have to design. The conventional filter synthesis cookbook does not contain any recipes for these types of problems. Such problems can be solved by learning mechanisms that intelligently deduce the optimal projections using only the input and desired signals or at times using the input signal alone. These learning mechanisms form the foundation of adaptive systems and neural networks. All learning mechanisms have at least two major pieces associated with them. The first is the criterion and the second is the search algorithm. The search algorithm finds the best possible solution in the space of the inputs under some constraints. Optimization theory has provided us with a variety of search techniques possessing different degrees of complexity and robustness [3]. These learning based adaptive systems provide us with a powerful methodology that can go beyond conventional signal processing. The projections derived by these adaptive systems are called optimal adaptive proj sections. Another very desirable feature of adaptive systems is their innate ability to automatically adjust and track according to the changing statistical properties of signals. This can be vital in many engineering applications, viz., wireless data transmission, biomedical monitoring and control, echo cancellation over wired telephone lines etc., wherein the underlying physical sources that generate the information change over time. In the next section, we will briefly review the theory behind the design of linear adaptive systems. Design of Adaptive Systems A block diagram of an adaptive system is shown in Figure 11. Assume that we are given a zeromean input signal x,, and a zeromean desired signal d,, Further, these signals are assumed to be corrupted by noise terms v,, and u,, respectively. Let the parameters of the adaptive system be denoted by the weight vector w. Note that we have not put any constraints on the topology of the adaptive filter. For convenience, we will assume a FIR topology in this chapter. The goal then is to generate an output y,, that best approximates the desired signal. In order to achieve that, a criterion (often referred to as the cost J(w)) is devised which is typically a function of the error e,, defined as the difference between the desired signal and the output, i.e., e,, = d,, y,, The most widely used criterion in the literature is the MeanSquared Error (MSE) which is defined as J(w) = E(e,2) (1.1) The MSE cost function has some nice properties, namely, * Physical relevance to energy * The performance surface (shape ofJ(w)) is smooth and has continuous derivatives * The performance surface is a convex paraboloid with a single global minimum * The weight vector w, corresponding to the global minimum is the best linear unbiased estimate in the absence of noise [4] * If the desired signal is a future sample of the input, i.e., d,, = x,,,, then the filter with coefficients w, is guaranteed to be minimum phase [5] 11 Algorithm 4 Criterion  Figure 11. Block diagram of an Adaptive System. Once the criterion is fixed, the next step is to design an algorithm to optimize the cost function. This forms another important element in an adaptive system. Optimization is a well researched topic and there is a plethora of search methods for convex cost functions. Specifically, we minimize the MSE cost function and since the performance surface is quadratic with a single global minimum, an analytical closed form optimal solution w, can be easily determined. The optimal solution is called the Wiener solution for MSE [6] (Wiener filter), which is given by w, = R'P (1.2) In equation (1.2), R denotes the covariance matrix of the input defined as R = E xkX: and the vector P denotes the cross correlation between the desired signal and the lagged input defined as P= E(xkdk COmputing the Wiener solution requires inverting the matrix R which requires O(N3) Operations [7]. However, due to the timedelay embedding of the input, the matrix R can be easily shown to be symmetric and Toeplitz, which facilitates a computationally efficient inverse operation with complexity O(N2) [8]. From the point of view of an adaptive system, the Wiener solution is still not elegant because one requires the knowledge of all data samples to compute equation (1.2). A samplebysample iterativee) algorithm is more desirable as it suits the framework of an adaptive system. The most commonly used algorithms to iteratively estimate the optimal Wiener solution w, are the stochastic gradient based Least Mean Squares (LMS) and the fixedpoint type Recursive Least Squares (RLS). Least Mean Squares (LMS) Algorithm The gradient of the cost function in (1.1) is given by 8J(w) = 2E(ekXk ) (1.3) dw Notice that the output of the adaptive filter y, is simply the innerproduct between the weight vector w and the vector x, which is a vector comprised of the delayed versions of the input signal xn Instead of computing the exact gradient, Widrow and fellow researchers [9,10] proposed the instantaneous gradient which only considered the most recent data samples (both input and desired). This led to the development of the stochastic gradient algorithm for MSE minimization that is popularly known as the Least Mean Squares (LMS) algorithm. The stochastic gradient is given by 8J(w) = 2ekXk (1.4) dw Once the instantaneous gradient is known, the search should be in the direction opposite to the gradient which gives us the stochastic LMS algorithm in (1.5). w(k +1) = w(k)+ r(k)ekXk (1.5) The term r(k) denotes a timevarying stepsize that is typically chosen from a set of small positive numbers. Under mild conditions, it is possible to show that the LMS algorithm converges in the mean to the Wiener solution [1014]. The stochastic LMS algorithm is linear in complexity, i.e., O(N), and allows online, local computations. These nice features facilitate efficient hardware implementation for realworld adaptive systems. Being a stochastic gradient algorithm, LMS suffers from problems related to slow convergence and excessive misadjustment in the presence of noise [14, 15]. Higher order methods have been proposed to mitigate these effects and mainly they are variants of QuasiNewton, LevenbergMarquardt (LM) and ConjugateGradient (CG) methods popular in optimization [1617]. Alternatively, we can derive a recursive fixedpoint algorithm to iteratively estimate the optimal Wiener solution. This is the wellknown Recursive Least Squares (RLS) algorithm [18, 19]. Recursive Least Squares (RLS) Algorithm The derivation of the RLS algorithm utilizes the fact that the input covariance matrix R can be iteratively estimated from its past values using the recursive relation, R(k) = R(k 1)+ xkX (1.6) The above equation can also be viewed as a rank1 update on the input covariance matrix R. Further, the cross correlation vector P satisfies the following recursion. P(k) = P(k 1)+ xkdk (1.7) We know that the optimal Wiener solution at the time instant k is simply w*(k) = R '(k)P(k) (1.8) Recall the matrix inversion lemma [7,8] at this point which allows us to recursively update the inverse of a matrix. R '(k 1)xXX R1 (k_1) R '(k) = R (k 1)k (1.9) 1+ x R '(k 1)xk It is important to note that the inversion lemma is useful only when the matrix itself can be expressed using reduced rank updates as in equation (1.6). By plugging equation (1.9) into the Wiener solution in (1.8) and using the recursive update for P(k) in (1.7), we can derive the RLS algorithm outlined in Table 11 below. Table 11. Outline of the RLS Algorithm. Initialize R (~0) = cI, where c is a large positive constant w(0) = 0, initialize the weight vector to an all zero vector At every iteration, compute R '(k 1) x K (k) =k 1+ x R'(k 1)xk e(k) = dk WT(k 1)xk w(k) = w(k 1)+ K(k)e(k) R '(k) = R (k 1) K(k)x R '(k 1) The RLS algorithm is a truly fixedpoint method as it tracks the exact Wiener solution at every iteration. Also, observe the complexity of the algorithm is O(N2) as compared to the linear complexity of the LMS algorithm. This additional increase in complexity is compensated by the fast convergence and zero misadjustment of the RLS algorithm. Other Algorithms Although LMS and RLS form the core of adaptive signal processing algorithms, researchers have proposed many other variants possessing varying degrees of complexity and performance levels. Important amongst them are the sign LMS algorithms that were introduced for reduced complexity hardware implementations [20, 21]. Historically, the signerror algorithm has been utilized in the design of channel equalizers [20] and also in the 32kbps ADPCM digital coding scheme [22]. In terms of improving the speed of convergence with minimum misadjustment, variable stepsize LMS and normalized LMS algorithms have been proposed [2327]. Leaky LMS algorithms [28] have been explored to mitigate the finite word length effects at the expense of introducing some bias in the optimal solution. Several extensions to the RLS algorithm have also been studied. Some of these algorithms show improved robustness against roundoff errors and superior numerical stability [29,30]. The conventional RLS algorithm works well when the data statistics do not change over time (stationarity assumption). Analysis of the RLS tracking abilities in nonstationary conditions have been studied by Eleftheriou and Falconer [31] and many solutions have been proposed [14]. Limitations of MSE Criterion Based Linear Adaptive Systems Although MSE based adaptive systems have been very popular, the criterion may not be the optimal choice for many engineering applications. For instance, consider the problem of system identification [32] which is stated as follows: Given a set of input and output noisy measurements where the outputs are the responses of an unknown system, obtain a parametric model estimate of the unknown system. If the unknown system is nonlinear, then it is obvious that MSE minimization would not result in the best possible representation of the system (plant). Criteria that utilize higher order statistics like the error entropy, for instance, can potentially provide a better model [33,34]. Let us restrict ourselves to the class of linear parametric models. Although the Wiener solution is optimal in the least squares sense, the biased input covariance matrix R, in the presence of additive white input noise yields a biasl in the optimal solution compared to what would have been obtained with noisefree data. This is a major drawback, since noise is omnipresent in practical scenarios. In order to illustrate the degradation in the quality of the parameter estimate, we created a random input time SThe Wiener solution with noisefree data gives unbiased estimates. We refer to this mismatch in the estimates obtained with and without noise as the bias introduced by noise. Filter coefficients estimated using RLS vs. true values RLS 1.5 eO True values 1 0.5 I 0.5 , 0 5 10 15 20 25 30 35 40 45 50 Figure 12. Parameter estimates using RLS algorithm with noisy data. series with arbitrary coloring and passed it through a FIR filter with 50 taps. The filtered data were used as the desired signal. Uncorrelated white noise was added to the colored input signal and the input signaltonoise ratio (SNR) was fixed at OdB. The RLS algorithm was then used to estimate the weight vector. Ideally, if the SNR was infinite, RLS would have resulted in a weight vector exactly matching the FIR filter. However, because of the noisy input, the RLS estimates were biased as can be seen in Figure 12. This is a very serious drawback of the MSE criterion which is further accentuated by the fact that the optimal Wiener M~SE solution varies nI ithr changing noise power. Researchers have dwelt on this problem for many years and several modifications have been proposed to mitigate the effect of noise on the estimate. Total leastsquares (TLS) is one method which is quite powerful in eliminating the bias due to noise [3542]. The instrumental variables (IV) method proposed as an extension to the LeastSquares (LS) has been previously applied for parameter estimation in white noise [32]. This method requires choosing a set of instruments that are uncorrelated with the noise in the input [32,43]. Yet another classical approach is subspace Wiener filtering [14,44]. This approach tries to suppress the bias by performing an optimal subspace projection (Principal Component Space) and then training a filter in the reduced input space. In the next few sections, we will briefly cover some of these methods and discuss their benefits and the limitations. Total Least Squares (TLS) and Other Methods Mathematically speaking, TLS solves an overdetermined set of linear equations of the form Ax = b, where As E" "m is the data matrix, be E is the desired vector, and x E W" is the parameter vector and m denotes the number of different observation vectors each of dimension n [41]. Alternatively, the linear equations can be written in the form [A; b][xT ;1] = 0, where [A; b] denotes an augmented data matrix. Let S be the SVD [8] of the augmented data matrix [A; b] such that S = UIV where UTU = I,,, V' V = I and 1= [diag(o a,,a,,a,,a4 "" n+1); 0(,, ,,,] with all singular values ak > 0. If [A; b][xT;1]= 0, the smallest singular value must be zero. This is possible only if [xT;1] is a singular vector of [A; b] (corresponding to the zero singular value) normalized such that its (n+1)th element value is 1. When [A; b] is a symmetric square matrix, the solution reduces to finding the eigenvector corresponding to the smallest eigenvalue of [A;b]. The TLS solution in this special case is then [x;1]= V nlv /vn~ (1.10) where v,,~, is the last element of the minor eigenvector v, The Total LeastSquares technique can be easily applied to estimate the optimal solution using minor components estimation algorithms [4551]. The computation of the TLS solution requires efficient algorithms for extracting the principal components [52] or the eigenvectors of the data covariance matrix. Eigendecomposition is a well studied problem and many algorithms have been proposed for online estimation of eigenvectors and eigenvalues directly from data samples [5377]. We have proposed robust, sample efficient algorithms for solving Principal Components Analysis (PCA) that have outperformed most of the available methods. A brief review of PCA theory and the proposed algorithms are outlined in appendix A. Brief mathematical analyses of the proposed algorithms according to the principles of stochastic approximation theory [7885] are also included. A fast minor components analysis (MCA) based TLS algorithm [86] is discussed in appendix B. Limitations of TLS Total least squares gives unbiased estimates only when both the noise in the input and the desired data are independent and identically distributed (i.i.d.) and have same variance. Further, when the noise is truly i.i.d. Gaussiandistributed, the TLS solution is also the maximum likelihood solution. However, the assumption of equal noise variances is very restrictive, as measurement noises seldom have similar variances. The Generalized TLS (GTLS) problem [87] specifically deals with cases where the noise (still assumed to be i.i.d.) variances are different. However, the caveat is that the ratio of noise variances is assumed to be known which is, once again, not a practical assumption. Extended TLS for Correlated Noise In order to overcome the i.i.d. assumption, Mathews and Cichocki have proposed the Extended TLS (ETLS) [88] that allows the noise to have nonzero correlations. We will briefly describe the approach they adopted. Let the augmented input matrix [A; b] be represented as H = [A; b] Then, the square matrix HTH can be written as a combination of the clean data matrix HTH and the noise covariance matrix R,. HTH= HTH+R, (1.11) The above equation is true when the noise is uncorrelated with the clean data. This assumption is reasonable as the noise processes in general are unrelated (hence independent) to the physical sources that produced the data. Assume that there exists a matrix transformation H, such that H = HR /2 (1.12) The transformed data correlation matrix of H is simply HTH = R /2 THR /2 +I (1.13) Equation (1.13) basically tells us that the transformed data are now corrupted by an i.i.d. noise process. Hence, we can now find the regular TLS solution with the transformed data by estimating the minor eigenvector of the matrix HTH. In other words, the optimal ETLS solution for correlated noise signals is given by estimating the generalized eigenvector corresponding to the smallest generalized eigenvalue of the matrix pencil (HTH, Rw ). Solving the generalized eigenvalue problem [8] is a nontrivial task and there are only a handful of algorithms that can provide online solutions. Our research in the area of PCA provided us the tools to develop a novel generalized eigenvalue decomposition (GED) algorithm. A short summary of the GED problem, existing learning algorithms and the proposed algorithm are listed in appendix C. Although the ETLS seems to solve the general problem of linear parameter estimation, there is an inherent drawback. The ETLS requires the full knowledge of the correlation matrix of the noise (RN). This assumption potentially leaves the problem of linear parameter estimation with noisy data wide open. Other Methods Infinite Impulse Response (IIR) system identification methods [8992] deal with the problem of measurement noise in the output (desired) data. The Instrumental Variables (IV) method [93] for IIR system identification on the other hand, does not guarantee stability. It has been known for quite a while that the unit norm constraint for the equationerror (EE) based system identification is much better compared to the conventional monic constraint [9092]. However, imposing the unit norm constraint appears too restrictive and hence limits the applicability. Summary In this chapter, we started by describing linear adaptive systems criteria and their associated algorithms. Most often, adaptive solutions are derived using the MSE criterion. We showed that the MSE criterion produces biased solutions in the presence of additive noise. The optimal Wiener MSE solution varies with changing noise variances which is highly undesirable. Alternative approaches to combat the effect of noise in the parameter estimation have been explored. The most popular approaches are based on the Total LeastSquares principles. Generalized TLS and Extended TLS improve upon the ability of the TLS to provide bias free estimates in the presence of additive noise. However, these methods rely on assumptions that can be very restrictive for realworld applications. Further, they require SVD and Generalized SVD computation [94105] which increases the complexity. Another method called subspace Wiener filtering relies on the accurate estimation of the signal subspace from the noisy data. This technique reduces the effect of the bias when the signals are distinguishable from noise (high SNR scenario). Otherwise, it fails since noise and signal subspaces cannot be separated. Thus, it would not be fallacious to say that the problem of linear parameter estimation with noisy data is a hard problem that does not yet have a satisfactory solution in the existing literature. One of the major contributions of this dissertation is the development of an elegant solution to this problem without making any unreasonable assumptions about the noise statistics. Towards this end, we will present a new criterion based on the error signal and derive new learning algorithms. CHAPTER 2 AUGMENTED ERROR CRITERION FOR LINEAR ADAPTIVE SYSTEMS Introduction In the previous chapter, we discussed the MeanSquared Error (MSE) criterion which has been the workhorse of linear optimization theory due to the simple and analytically tractable structure of linear least squares. In adaptive fi1ter theory, the classical WienerHopf equations [6,10] are more commonly used owing to the extension of least squares to functional spaces (Hilbert spaces [106]) proposed by Wiener [6]. However, for Einite impulse response (FIR) filters, (vector spaces) the two solutions coincide. There are also a number of important properties that help us understand the statistical properties of the Wiener solution, namely the orthogonality of the error signal to the input vector space as well as the whiteness of the predictor error signal for stationary inputs, provided the fi1ter is long enough [5,14]. However, in a number of applications of practical importance, the error sequence produced by the Wiener filter is not white. One of the most important is the case of inputs corrupted by white noise, where the Wiener solution is biased by the noise variance as we saw before in Chapter 1. In this chapter, we will develop a new criterion which augments the MSE criterion. In fact, MSE becomes a special case of this new criterion which we call the Augmented Error Criterion (AEC). Further, we will show that, under some conditions, this new criterion can produce a partially white error sequence at the output of an adaptive system even with noisy data. This special case of the AEC is called the Error Whitening Criterion (EWC). Our approach in this chapter will be as follows. We will first focus on the problem of parameter estimation with noisy data and motivate the derivation of the error whitening criterion. Then, we will deduce the more generic augmented error criterion. Error Whitening Criterion (EWC) Consider the problem of parameter estimation with noisy data. Instead of minimizing the MSE, we will tackle the problem by introducing a new adaptation criterion that enforces zero autocorrelation of the error signal beyond a certain lag; hence the name error whitening criterion (EWC). Since we want to preserve the online properties of the adaptation algorithms, we propose to expand the error autocorrelation around a lag larger than the filter length using Taylor series. Thus, instead of an error signal, we will end up with an error vector, containing as many components as the terms kept in the Taylor series expansion. A schematic diagram of the proposed adaptation structure is depicted in Figure 21. The properties of this solution are very interesting, and it contains the Wiener solution as a special case. Additionally, for the case of two error terms, the same analytical tools developed for the Wiener filter can be applied with minor modifications. Moreover, when the input signal is contaminated with additive white Figure 21. Schematic diagram of EWC adaptation. noise, EWC produces the same optimal solution that would be obtained with the noise free data, with the same computational complexity of the Wiener solution. Motivation for Error Whitening Criterion The classical Wiener solution yields a biased estimate of the reference filter weight vector in the presence of input noise. This problem arises due to the contamination of the input signal autocorrelation matrix with that of the additive noise. If a signal is contaminated with additive white noise, only the zerolag autocorrelation is biased by the amount of the noise power. Autocorrelation values at all other lags still remain at their original values. This observation rules out MSE as a good optimization criterion for this case. In fact, since the error power is the value of the error autocorrelation function at zero lag, the optimal weights will be biased because they depend on the input autocorrelation values at zerolag. The fact that the autocorrelation values at nonzero lags are unaffected by the presence of noise will be proved useful in determining an unbiased estimate of the filter weights. Analysis of the Autocorrelation of the Error Signal The question that arises is what lag should be used to obtain the true weight vector in the presence of white input noise. Let us consider the autocorrelation of the training error at nonzero lags. Suppose noisy training data of the form (x(t),d(t)) are provided, where x(t) = x(t)+ v(t) and d(t) = d(t) +u(t) with x(t) being the sample of the noise free input vector at time t (time is assumed to be continuous), v(t) being the additive white noise vector on the input vector, d(t) being the noisefree desired output and u(t) being the additive white noise on the desired output. Suppose that the true weight vector of the reference filter that generated the data is w, (moving average model). Then the error at time t is e(t) = (d(t)+ u(t)) ((t)+ v(t)) w where w is the estimated weight vector. Equivalently, when the desired response belongs to the subspace of the input, i.e., d(t)= XT (t)wT, the error can be written as e(t) = (1' (t)wT + u(t)) (X(t) + v(t))T w = XT (t)(wT w) + u(t) vT (t)w (2.1) Given this noisy training data, the MSEbased Wiener solution will not yield a residual training error that has zero autocorrelation for a number of consecutive lags, even when the contaminating noise signals are white. From (2.1) it is easy to see that the error will have a zero autocorrelation function if and only if * the weight vector is equal to the true weights of the reference model, * the lag is beyond the Wiener filter length. During adaptation, the issue is that the filter weights are not set at w,, so the error autocorrelation function will be generally nonzero. Therefore a criterion to determine the true weight vector when the data is contaminated with white noise should be to force the long lags (beyond the filter length) of the error autocorrelation function to zero by using an appropriate criterion. This is exactly what the errorwhitening criterion (EWC) that we propose here will do. There are two interesting situations that we should consider: * What happens when the selected autocorrelation lag is smaller than the filter length? * What happens when the selected autocorrelation lag is larger than the lag at which the autocorrelation function of the input signal vanishes? The answer to the first question is simply that the solution will be still biased since it will be obtained by inverting a biased input autocorrelation matrix. If the selected lag is L autocorrelation matrix, where the zerolag autocorrelation of the input signal shows up. In the special case of MSE, the selected lag is zero and the zeroth subdiagonal becomes the main diagonal, thus the solution is biased by the noise power. The answer to the second question is equally important. The MSE solution is quite stable because it is determined by the inverse of a diagonally dominant Toeplitz matrix. The diagonal dominance is guaranteed by the fact that the autocorrelation function of a realvalued function has a peak at zerolag. If other lags are used in the criterion, it is important that the lag is selected such that the corresponding autocorrelation matrix (which will be inverted) is not ill conditioned. If the selected lag is larger than the length of the input autocorrelation function, then the autocorrelation matrix becomes singular and a solution cannot be obtained. Therefore, lags beyond the input signal correlation time should also be avoided in practice. The observation that, constraining the higher lags of the error autocorrelation function to zero yields unbiased weight solutions is quite significant. Moreover, the algorithmic structure of this new solution and the lagzero MSE solution are still very similar. The noisefree case helps us understand why this similarity occurs. Suppose the desired signal is generated by the following equation: d(t)= xT (t)w,, where w, is the true weight vector. Now multiply both sides by x(t A) from the left and then take the expected value of both sides to yield E[x(t A)d(t)] = E[x(t A)XT (t)]wT Similarly, we can obtain E[x(t)d(t A8)]= E[x(t)xT (t A)]w,. Adding the corresponding sides of these two equations yields E[x(t)d(t A) + i(t )d(t)] = E[x(t)i' (t A) + i(t A)i(t)]w, (2.2) This equation is similar to the standard WienerHopf equation [9,10] E[x(t)d(t)] = E[x(t)xT (t)]w,. Yet, it is different due to the correlations being evaluated at a lag other than zero, which means that the weight vector can be determined by constraining higher order lags in the error autocorrelation. Now that we have described the structure of the solution, let us address the issue of training linear systems using error correlations. Adaptation exploits the sensitivity of the error autocorrelation with respect to the weight vector of the adaptive filter. We will formulate the solution in continuous time first, for the sake of simplicity. If the support of the impulse response of the adaptive filter is of length m, we evaluate the derivative of the error autocorrelation function with respect to the lag A, where A 2 m are both real numbers. Assuming that the noises in the input and desired are uncorrelated to each other and the input signal, we get ape (A) dEe(t~e(t A)] E (~w, w)'I(t)n (t (w, w) +(ul(t) v(t)w)(u(t A) v( A)w)] (2.3) = 2E ijt)IT (t A)w, w) The identity in equation (2.3) immediately tells us that the sensitivity of the error autocorrelation with respect to the weight vector becomes zero, i.e., 8pe(A)/8w = 0, if (w, w) =0. This observation emphasizes the following important conclusion: when given training data that is generated by a linear filter, but contaminated with white noise, it is possible to derive simple adaptive algorithms that could determine the underlying filter weights without bias. Furthermore, if (w, w) is not in the null space of E[X(t)XT (t A)], then only (w, w) = 0 makes pe (A) = 0 and 8pe (A) / w = 0 But looking at (2.3), we conclude that a proper delay depends on the autocorrelation of the input signal that is, in general, unknown. Therefore, the selection of the delay A is important. One possibility is to evaluate the error autocorrelation function at different lags A > m and check for a non zero input autocorrelation function for that delay, which will be very time consuming and inappropriate for online algorithms. Instead of searching for a good lagA, consider the Taylor series approximation of the autocorrelation function around a fixed lagL, where L > m, pe (A) = pe (L) + pe (L)(A L) + pe (L)(A L)2 +... (2.4) = E[e(t)e(t L)] E[e(t)e(t L)](A L) +~E[e(t)e(t L)](A L)2 +... In (2.4), e(t) and #(t) (see Figure 21) represent the derivatives of the error signal with respect to the time index. Notice that we do not take the Taylor series expansion around zerolag for the reasons indicated above. Moreover, L should be less than the correlation time of the input, such that the Taylor expansion has a chance of being accurate. But since we bring more lags in the expansion, the choice of the lag becomes less critical than in (2.3). In principle, the more terms we keep in the Taylor expansion the more constraints we are imposing on the autocorrelation of the error in adaptation. Therefore, instead of finding the weight vector that makes the actual gradient in (2.3) zero, we find the weight vector that makes the derivative of the approximation in (2.4) with respect to the weight vector zero. If the adaptive filter is operating in discrete time instead of continuous time, the differentiation with respect to time can be replaced by a firstorder forward difference, e(n) =e(n) e(n L) Higher order derivatives can also be approximated by their corresponding forward difference estimates, e.g., e(n) = e(n) 2e(n L) + e(n 2L), etc. Although the forward difference normally uses two consecutive samples, for reasons that will become clear in the following sections of the chapter, we will utilize two samples separated by L samples in time. The firstorder truncated Taylor series expansion for the error autocorrelation function for lag A evaluated at L becomes pe (A) = E[e(n)e(n L)] E[e(n)(e(n) e(n L))](A L) (2.5) = (A L)E[e 2(n)] +(1 + A L)E[e(n)e(n L)] Analyzing (2.5) we remark another advantage of the Taylor series expansion because the familiar MSE is part of the expansion. Notice also that as one forces A 4 L, the MSE term will disappear and only the lagL error autocorrelation will remain. On the other hand, as A 4L 1 only the MSE term will prevail in the autocorrelation function approximation. Introducing more terms in the Taylor expansion will bring in error autocorrelation constraints from lags iL. Augmented Error Criterion (AEC) We are now in a position to formulate the augmented error criterion (AEC). To the regular MSE term, we add another function E(e2) to result in the augmented error criterion as shown in equation (2.6). J(w) = E[e 2 (nl E[e2 (n)] (2.6) where pis a real scalar parameter. Equivalently, (2.6) can also be written as J(w) = (1 +2P)E[e 2(n)] 2E[e(n)e(n L)] (2.7) which has the same form as in (2.5). Notice that when P = 0 we recover the MSE in (2.6) and (2.7). Similarly, we would have to select A = L in order to make the firstorder expansion identical to the exact value of the error autocorrelation function. Substituting the identity (1 +2P) = ( L), and using A = L, we observe that P = 1/2 eliminates the MSE term from the criterion. Interestingly, this value will appear in a later discussion, when we optimize S in order to reduce the bias in the solution introduced by input noise. If p is positive, then minimizing the cost function J(w) is equivalent to minimizing the MSE but with a constraint that the error signal must be smooth. Thus, the weight vector corresponding to the minimum J(w) will result in a higher MSE than the Wiener solution. The same criterion can also be obtained by considering performance functions of the form 2~ (2.8) = E[e 2 (Y) E[d 2 *nl+~~2 n +. where the coefficients p, 7, etc. are assumed to be positive. Notice that (2.8) is the L2 norm of a vector of different objective functions. The components of this vector consist of e(n), e(n), e'(n), etc. Due to the equivalence provided by the difference approximations for derivative, these terms constrain the error autocorrelation at lags iL as well as the error power as seen in (2.8). In summary, the AEC defined by equation (2.6) can take many forms and hence results in different optimal solutions. * If fis 0, then AEC exactly becomes the MSE criterion * If fis 0.5, then AEC becomes the EWC which will result in an unbiased estimate of the parameters even in the presence of noise * If 4is positive and not equal to 0, then the cost function minimizes a combination of MSE with a smoothness constraint In the following sections, we will further elaborate on the properties of AEC. Properties of Augmented Error Criterion Shape of the Performance Surface Suppose that noisefree training data of the form (x(n), d(n)), generated by a linear system with weight vector w, through d(n)= xT(n)w,, is provided. Assume without loss of generality that the adaptive filter and the reference filter are of the same length. This is possible since it is possible to pad w, with zeros if it is shorter than the adaptive filter. Therefore, the input vector x(n) E We, the weight vector w, E We and the desired output d(n)E eW. Equation (2.6) has a quadratic form and has a unique stationary point. If p > 0, then this stationary point is a minimum. Otherwise, the Hessian of (2.6) might have mixedsign eigenvalues. We demonstrate this fact with sample performance surfaces obtained for 2tap FIR filters using P = 1/2 . For three differently colored training data, we obtain the AEC performance surfaced shown in Figure 22. In each row, the MSE performance surface, the AEC cost contour plot, and the AEC performance surface are shown for the corresponding training data. The eigenvalue pairs of the Hessian matrix of (2.6) are (2.35,20.30), (6.13,5.21), and (4.08,4.14), for these representative cases in Figure 22. Clearly, it is possible for (2.6) to have a stationary point that is a minimum, a saddle point, or a maximum and we start to see the differences brought about by the AEC. The performance surface is a weighted sum of paraboloids, which will complicate gradientbased adaptation, but will not affect search algorithms utilizing curvature information. We will discuss more on the search techniques later in this Chapter and also in Chapter 4. CoztotrpltforAE~howngthentun Pafonmancenfacefo~~IrAEldsoutonasntnn 01 ;J .I o Y '' o! o o. II  r ^r ~I 08 06 04 02 0 02 04 06 08 1 Coztotr ot fn~~~g~aopr MC sowng o~ odn 08 06 04 02 0 02 04 06 08 1 Pafommace unfae foAClr MC wd2 sadd pontoln O rba~ Caontor lot for C slxwununan 08 06 4)4 02 0 02 04 06 08 Pafommacerunface fo~~Ir M dol ation a un un 11 1 OH O(I o ,, W2[ 011 01 011 1 ~z Figure 22. The MSE performance surfaces, the AEC contour plot, and the AEC performance surface for three different training data sets and 2tap adaptive FIR filters. Analysis of the Noisefree Input Case Theorem 2. 1: The stationary point of the quadratic form in (2.6) is given by w* = (1 + P) '(P + Q) (2.9) where we defined R = E[x(n)xT (n)], QS = E[i(n)d (n)] . S = E[ir(n)irT (n)], P = E[x(n)d (n)] and 1 1 .i :, "` 1 I Proofi Substituting the proper variables in (2.6), we obtain the following explicit expression for J(w). J(w) = E[d 2(n)] + E[d 2(n)] +wT (R + P)w 2(P + p)T w (2.10) Taking the gradient with respect to w and equating to zero yields 8J(w) = 2(R + P)w 2(P + Q) = 0 dw (2.11) > w* = (R + P) '(P + Q) Notice that selecting f = in (2.6) reduces the criterion to MSE and the optimal solution, given in (2.9), reduces to the Wiener solution. Thus, the Wiener filter is a special case of the AEC solution (though not optimal for noisy inputs, as we will show later) . Corollaryll~~~~~11111~~~~ 1. An equivalent expression for the stationary point of (2.6) is given by w, = 1+2/i)RfR i,1+20)P PL (2.12) where we defined the matrix RL = E[x(n L)i' (n) + i(n)i' (n L)] and the vector PL = E[x(n L)d(n) + x(n)d(n L)]. Notice that the interesting choice P = 1/ 2 yields w* = R 'PL Proof: Substituting the definitions of R, S, P, (1, and then recollecting terms to obtain RL and PL yields the desired result. w, = (R + P) '(P + SE)) IE~Iw(n)i' (n)] + SE~[(g(nL) (n ))in i(n L)) ] E [Ci(n)d(n)] + PE[(x(n) x(n L))(d(n) d (n L))]j E~i~~i'(n) + pE~in~i (n] + ~i~ ~i'(n L)] RL 1 (2.13) 'IE[x(n)d (n)] + P(E[x(n)d (n] + E[x(n L)d (n L)] PL = 1+2B)Rfl1 (i,~1+20)P PL From these results we deduct two extremely interesting conclusions: Lenana 1. (Generalized WienerHopf Equations) In the noisefree case, the true weight vector is given by RLw T L (This result is also true for noisy data.) Proof: This result follows immediately from the substitution of d(n)= xT(n)w, and d (n L) = XT (n L)wT in the definitions of RL and PL Lenana 2. In the noisefree case, regardless of the specific value of f the optimal solution is equal to the true weight vector, i.e., w, = w, . Proof: This result follows immediately from the substitution of the result in Lenana 1 into the optimal solution expression given in (2.9). The result in Lenana 1 is especially significant, since it provides a generalization of the WienerHopf equations to autocorrelation and cross correlation matrices evaluated at different lags of the signals. In these equations, L represents the specific correlation lag selected, and the choice L=0 corresponds to the traditional WienerHopf equations. The generalized WienerHopf equations are essentially stating that, the true weight vector can be determined by exploiting correlations evaluated at different lags of the signals, and we are not restricted to the zerolag correlations as in the Wiener solution. Analysis of the Noisy Input Case Now, suppose that we are given noisy training data (x(n), d(n)), where x(n) = x(n) +v(n) and d(n) = d(n) +u(n) The additive noise on both signals are zero mean and uncorrelated with each other and with the input and desired signals. Assume that the additive noise, u(n), on the desired is white (in time) and let the autocorrelation matrices of v(n) be V = E[v(n)vT (n)], and VL = E[v(n L)vT (n) + v(n)vT (n L)] . Under these circumstances, we have to estimate the necessary matrices to evaluate (2.9) using noisy data. These matrices evaluated using noisy data, R, S, P, and Q will become (see appendix D for details) R = E[x(n)xT (n)] = R + V S = E[(x(nZ) x(n L))(x(n) x(n L))T ]= 2(1 + V)_RL VL (2.14) P = E[x(n)d(n)] = P Q = E[(x(n) x(n L))(d(n) d(n L))T ] = 2P PL Finally, the optimal solution estimate of AEC, when presented with noisy input and desired output data, will be v~v. = (R + PS)'(P + Q) = (R+ V)+ P(2(R + V) RL VL ) 1+ P+ (2P L)j (2.15) = 1+2P)(R+V)R L, (V 1+2P)P PL Theorem 2.2: (EWC NoiseRejection Theorem) In the noisyinput data case, the optimal solution obtained using AEC will be identically equal to the true weight vector if and only if P = 1/2, RL + 0, and VL = 0 There are two situations to consider: * When the adaptive linear system is an FIR filter, the input noise vector vk COnsists of delayed versions of a single dimensional noise process. In that case, VL = 0 if and only if L > na, where na is the filter length and the single dimensional noise process is white. * When the adaptive linear system is an ADALINE, the input noise is a vector process. In that case, VL = 0 if and only if the input noise vector process is white (in time) and L > 1. The input noise vector may be spatially correlated. Proofi Sufficiency of the first statement is immediately observed by substituting the provided values of P and VL Necessity is obtained by equating (2.15) to w, and substituting the generalized WienerHopf equations provided in Lenana 1. Clearly, if RL = 0, then there is no equation to solve, thus the weights cannot be uniquely determined using this value of L. The statement regarding the FIR fi1ter case is easily proved by noticing that the temporal correlations in the noise vector diminish once the autocorrelation lag becomes greater than equal to the fi1ter length. The statement regarding the ADALINE structure is immediately obtained from the definition of a temporally white vector process. Orthogonality of Error to Input An important question regarding the behavior of the optimal solution obtained using the AEC is the relationship between the residual error signal and the input vector. In the case of MSE, we know that the Wiener solution results in the error to be orthogonal to the input signal, i.e., E[e(n)x(n)]= 0 [10,14,15]. However, this result is true only when there is no noise and also when the estimated fi1ter length is greater than the actual system impulse response. Similarly, we can determine what the AEC will achieve. Lemma 3: At the optimal solution of AEC, the error and the input random processes satisfy PE[e(n)x(n L) + e(n L)x(n)] = (1 + 2 P)E[e(n)x(n)], for all L > 0 . Proofi We know that the optimal solution of AEC for any L > 0 is obtained when the gradient of the cost function with respect to the weights is zero. Therefore, 8J = 2E[e(n)x(n)] + 2PE[(e(n) e(n L))(x(n) x(n L))] dw (2.16) = (1+ 2 P)E[e(n)x(n)] PE[e(n)x(n L) + e(n L)x(n)] = 0 It is interesting to note that if = 1/ 2, then we obtain E[e(n)x(n L) + e(n L)x(n)] = 0 for all L. On the other hand, since the criterion reduces to MSE for P = 0, then we obtain E[e(n)x(n)] = 0 The result shown in (2. 16), if interpreted in terms of Newtonian physics, reveals an interesting insight as to the behavior of the EWC criterion ( P= 1/ 2) at its optimal solution (regardless of the length of the reference filter that created the desired signal). In a simplistic manner, this behavior could be summarized by the following statement: The optimal solution of EWC tries to decorrelate the residual error from the estimated future value of the input vector (see appendix E for details). The case where P= 1/2 is especially interesting, because it results in complete noise rejection. Notice that, in this case, since the optimal solution is equal to the true weight vector, the residual error is given by e(n) = u(n) v' (n)w,, which is composed purely of the noise in the training data. Certainly, this is the only way that the adaptive filter can achieve E[e(n)x(n L) + e(n L)x(n)] = 0 for all L values, since E[e(n)x(n L)] =E[e(n L)x(n)] =0 for this error signal. Thus, EWC not only orthogonalizes the instantaneous error and input signals, but it orthogonalizes all lags of the error from the input. Relationship to Error Entropy Maximization Another interesting property that the AEC solution exhibits is its relationship with entropy [107]. Notice that when P < 0, the optimization rule tries to minimize MSE, yet it tries to maximize the separation between samples of errors, simultaneously. We could regard the sample separation as an estimate of the error entropy. In fact, the entropy estimation literature is full of methods based on sample separations [108113]. Specifically, the EWC case with f = 1/2, finds the perfect balance between entropy and MSE that allows us to eliminate the effect of noise on the solution. Recall that the Gaussian density displays maximum entropy among distributions of fixed variance [1 14]. In the light of this fact, the aim of EWC could be understood as finding the minimum error variance solution, while keeping the error close to Gaussian. Notice that, due to central limit theorem [114], the error signal will be closely approximated by a Gaussian density when there are a large number of taps. A brief description of the relationship between entropy (using estimators) [115117] and sample differences is provided in appendix F. Note on ModelOrder Selection Model order selection is another important issue in adaptive filter theory. The actual desired behavior from an adaptive filter is to find the right balance between approximating the training data as accurately as possible and generalizing to unseen data with precision [118]. One major cause of poor generalization is known to be excessive model complexity [118]. Under these circumstances, the designer's aim is to determine the least complex adaptive system (which translates to smaller number of weights in the case of linear systems) that minimizes the approximation error. Akaike's information criterion (AIC) [119] and Rissanen's minimum description length (MDL) [120] are two important theoretical results regarding model order selection. Such methods require the designer to evaluate an objective function, which is a combination of MSE and the filter length or the filter weights, using different lengths of adaptive filters. Consider the case of overmodeling in the problem of linear FIR filter (assume N taps) estimation. If we use the MSE criterion, and assume that there is no noise in the data, then, the estimated Wiener solution will have exactly N nonzero elements that exactly match with the true FIR filter. This is a very nice property of the MSE criterion. However, when there is noise in the data, then this property of MSE is no longer true. Therefore, increasing the length of the adaptive filter will only result in more parameter bias in the Wiener solution. On the other hand, EWC successfully determines the length of the true filter, even in the presence of additive noise. In the overmodeling case, the additional taps will decay to zero indicating that a smaller filter is sufficient to model the data. This is exactly what we would like an automated regularization algorithm to achieve: determining the proper length of the filter without requiring external discrete modifications on this parameter. Therefore, EWC extends the regularization capability of MSE to the case of noisy training data. Alternatively, EWC could be used as a criterion for determining the model order in a fashion similar to standard model order selection methods. Given a set of training samples, one could start solving for the optimal EWC solution for various lengths of the adaptive filter. As the length of the adaptive filter is increased past the length of the true filter, the error power with the EWC solution will become constant. Observing this point of transition from variable to constant error power, we can determine the exact model order of the original filter. The Effect of fon the Weight Error Vector The effect of the cost function free parameter P on the accuracy of the solution (compared to the true weight vector that generated the training data) is another crucial issue. In fact, it is possible to determine the dynamics of the weight error as a function of p. This result is provided in the following lemma. Lemma 4: (The effect of P on AEC solution) In the noisy training data case, the derivative of the error vector between the optimal EWC solution and the true weight vector, i.e., il. = w. w,, with respect to pis given by = I (+ 2a)(R + V) FR ][2(R R,)i, R,w, i (2. 17) Proofi Recall from (2.15) that in the noisy data case, the optimal AEC solution is given by i^v = [(1+ 2a)(R + V) FRL L 1 [('1+ 2a)P SPL i. Using the chain rule for the derivative and the fact that for any nonsingular matrix A(P), dA'/9= A'(8A/89)A', the result in (2.17) follows from straightforward derivation. In order to get the derivative asp f 1/2, we substitute this value and The significance of Lemma 4 is that it shows that no finite P value will make this error derivative zero. The matrix inversion, on the other hand, approaches to zero for unboundedly growing P. In addition, it could be used to determine the Euclidean error norm derivative, ie Ii / 80. Numerical Case Studies of AEC with the Theoretical Solution In the preceding sections, we have built the theory of the augmented error criterion and its special case, the error whitening criterion, for linear adaptive filter optimization. We have investigated the behavior of the optimal solution as a function of the cost function parameters as well as determining the optimal value of this parameter in the noisy training data case. This section is designed to demonstrate these theoretical results in numerical case studies with Monte Carlo simulations. In these simulations, the following scheme will be used to generate the required autocorrelation and crosscorrelation matrices. Given the scheme depicted in Figure 23, it is possible to determine the true analytic auto/crosscorrelations of all signals of interest, in terms of the filter coefficients and the noise powers. Suppose 5, v, and u are zeromean white noise signals with powers a, a ", and 0;, ,respectively. Suppose that the coloring filter h and the Figure 23. Demonstration scheme with coloring fi1ter h, true mapping fi1ter w, and the uncorrelated white signals. mapping fi1ter w are unitnorm. Under these conditions, we obtain E[(x(n) + v(n))(x(n A) + v(n A))] 'Exnx"(~l i (2.19) E[(lr(n)+ (n2))d(n2)]= w +f wlE[r(n2) (n1 A)] (2.20) For each combination of SNR from (10dB,0dB,10dB), /?from (0.5,0.3,0,0. 1), m from (2,...,10), and L from (m,...,20) we have perfo~rmedl 100C hMo~nt Carlo si~mu~lationns using randomly selected 30tap FIR coloring and ntap mapping filters. The length of the mapping filters and that of the adaptive filters were selected to be equal in every case. In all simulations, we used an input signal power of 0,2 = 1, and the noise powers z,2 = 0,2 are determined from the given SNR using SNR = 10 log (0, / a ) The matrices R S, P, and Q, which are necessary to evaluate the optimal solution given by (2.15) are then evaluated using (2.18), (2.19), and (2.20), analytically. The results obtained are summarized in Figure 24 and Figure 25, where for the three SNR levels selected, the average squared error norm for the optimal solutions (in reference to the true weights) are given as a function of L and n for different P values. In Figure 24, we present the average normalized weight vector error norm obtained using AEC at different SNR levels and using different P values as a function of the correlation lag L that is used in the criterion. The fi1ter length was Eixed to 10 in these simulations. SNR=10 SNR=0 SNR=10 O 6 beta= 0 1 05 04 beta=0 3 0 2 0 1 beta= 1/2 10 15 20 05 I 04 03 0ea= 1 beta=1/2 N`0 10 15 20 beta=0 3 / beta=0 1 ~beta=0 beta= 1/2 0 15 20 Figure 24. The average squared errornorm of the optimal weight vector as a function of autocorrelation lag L for various f values and SNR levels. SNR=10 SNR=0 SNR=10 09~ oa 07 06 0 4 0 3 0 2 0 1 O s S beta= 0 1 *beta=0 beta=0 3 beta= 1/2 4 5 10 beta= 0 1 beta0 3 beta= 0 beta= 1/2 4 5 10 beta= 01 beta= 0 3 Beta= 1/2 4 5 10 Figure 25. The average squared errornorm of the optimal weight vector as a function of filter length m for various f values and SNR levels. From the theoretical analysis, we know that if the input autocorrelation matrix is invertible, then the solution accuracy should be independent of the autocorrelation lag L. The results of the Monte Carlo simulations presented in Figure 24 conform to this fact. As expected, the optimal choice of = 1/2 determined the correct filter weights exactly. Another set of results, presented in Figure 25, shows the effect of filter length on the accuracy of the solutions provided by the AEC. The optimal value of P = 1/2 always yields the perfect solution, whereas the accuracy of the optimal weights degrades as this parameter is increased towards zero (i.e. as the weights approach the Wiener solution). An interesting observation from Figure 25 is that for SNR levels below zero, the accuracy of the solutions using suboptimal P values increases, whereas for SNR levels above zero, the accuracy decreases when the fi1ter length is increased. For zero SNR, on the other hand, the accuracy seems to be roughly unaffected by the fi1ter length. The Monte Carlo simulations performed in the preceding examples utilized the exact coloring filter and the true fi1ter coefficients to obtain the analytical solutions. In our Einal case study, we demonstrate the performance of the batch solution of the AEC criterion obtained from sample estimates of all the relevant auto and crosscorrelation matrices. In these Monte Carlo simulations, we utilize 10,000 samples corrupted with white noise at various SNR levels. The results of these Monte Carlo simulations are summarized in the histograms shown in Figure 26. Each subplot of Figure 26 corresponds to experiments performed using SNR levels of 10 dB, OdB, and 10 dB for each column and adaptive fi1ter lengths of 4taps, 8taps, and 12taps for each row, respectively. For each combination of SNR and filter length, we have performed 50 Monte Carlo simulations using MSE (f =0) and EWC (f =1/2) criteria. The correlation lag is selected to be equal to the filter length in all simulations, due to Theorem 2.2. Clearly, Figure 26 demonstrates the superiority of the AEC in rejecting noise that is present in the training data. Notice that in all subplots (for all combinations of filter length and SNR), AEC achieves a smaller average error norm than MSE. SHn; IOd8,11qE;1 Clrann~DI mn; lod4,w~~;a Ganrn(b~l Ihn;113d~.lt~P;B Sl;OdB Yrapr;l ' ' ~; I0dslupr;ll SI;O~,Ylapr;l? G;arnn(d~ ~m;los aup~;l2 Figure 26. Histograms of the weight error norms (dB) obtained in 50 Monte Carlo simulations using 10000 samples of noisy data using MSE (empty bars) and EWC with f = 0.5 (filled bars). The subfigures in each row use filters with 4, 8, and 12 taps respectively. The subfigures in each column use noisy samples at 10, 0, and 10 dB SNR, respectively. ,,ldldOl ClrannM Ganmr(dm III Ilr~R~rrnn(O~jj Cnana~R(Pql The discrepancy between the performances of the two solutions intensifies with increasing filter length. Next, we will demonstrate the errorwhitening property of the EWC solution. From equation (2.1) we can expect that the error autocorrelation function will vanish at lags greater than or equal to the length of the reference filter, if the weight vector is identical to the true weight vector. For any other value of the weight vector, the error autocorrelation fluctuates at nonzero values. A 4tap reference filter is identified with a 4tap adaptive filter using noisy training data (hypothetical) at an SNR level of OdB. The autocorrelation functions of the error signals corresponding to the MSE solution and the EWC solution are shown in Figure 27. Clearly, the EWC criterion determines a solution that forces the error autocorrelation function to zero at lags greater than or equal the filter length (partial whitening of the error). EWC MSE 0.1 1.5 0.05 0 ::1 0.05 ~0.5 \ 0.1 W C 5 10 15 20 25 30 0 5 10 15 20 25 30 Lag Figure 27. Error autocorrelation function for MSE (dotted) and EWC (solid) solutions. Finally, we will address the order selection capability and demonstrate how the AEC (specifically EWC) can be used as a tool for determining the correct filter order, even with noisy data, provided that the given inputdesired output pair is a moving average process. For this purpose, we determine the theoretical Wiener and EWC (with 7= 1/2 and L=m, where m is the length of the adaptive filter) solutions for a randomly selected pair of coloring filter, h, and mapping filter w, at different adaptive filter lengths. The noise level is selected to be 20 dB, and the length of the true mapping filter is 5. We know from our theoretical analysis that if the adaptive filter is longer than the reference filter, the EWC will yield the true weight vector padded with zeros. This will not change the MSE of the solution. Thus, if we plot the MSE of the EWC versus the length of the adaptive filter, starting from the length of the actual filter, the MSE curve will remain flat, whereas the Wiener solution will keep decreasing the MSE, contaminating the solution by learning the noise in the data. Figure 28(a) shows the MSE obtained with the Wiener solution as well as the EWC solution for different lengths of the adaptive filter using the same training data described above. Notice (in the zoomedin portion) that the MSE with EWC remains constant starting from 5, which is the filter order that generated the data. On the other hand, if we were to decide on the filter order looking at the MSE of the Wiener solution, we would select a model order of 4, since the gain in MSE is insignificantly small compared to the previous steps from this point on. Figure 28(b) shows the norm of the weight vector error for the solutions obtained using the EWC and MSE criteria, which confirms that the true weight vector is indeed attained with the EWC criterion once the proper model order is reached. This section aimed at experimentally demonstrating the theoretical concepts set forth in the preceding sections of the chapter. We have demonstrated with numerous Monte Carlo simulations that the analytical solution of the EWC criterion eliminates the effect of noise completely if the proper value is used for /7. We have also demonstrated that the batch solution of EWC (estimated from a finite number of samples) outperforms MSE in the presence of noise, provided that a sufficient number of samples are given so that the noise autocorrelation matrices diminish as required by the theory. Summary In this chapter, we derived the augmented error criterion (AEC) and discussed a special case of AEC called the error whitening criterion (EWC). The proposed AEC includes MSE as a special case. We discussed some of the interesting properties of the AEC cost function and worked out the analytical optimal solution. Further, we discussed the reasoning behind naming the special case of AEC with the parameter /7 = 0.5 as EWC. The intuitive reasoning is that this criterion partially whitens the error signal even in the presence of noise which cannot be achieved by the MSE criterion. Thus the error whitening criterion is very useful for estimating the parameters of a linear unknown system in the presence of additive white noise. AEC with other values of f can be used as a constrained MSE criterion where the constraint is the smoothness of the error signal. Most of the material presented in this chapter can be found in [121]. Although we have presented a complete theoretical investigation of the proposed criterion and its analytical solution, in practice, online algorithms that operate on a samplebysample basis to determine the desired solution are equally valuable. Therefore, in the following chapters, we will focus on designing computationally efficient online algorithms to solve for the optimal AEC solution in a fashion similar to the wellknown RLS and LMS algorithms. In fact, we aim to come up with algorithms that have the same computational complexity with these two widely used algorithms. CHAPTER 3 FAST RECURSIVE NEWTON TYPE ALGORITHMS FOR AEC Introduction In Chapter 2, we derived the analytical solution for AEC. We also showed simulation results using block methods. In this Chapter, the focus will be on deriving online, samplebysample Newton type algorithms to estimate the optimal AEC solution. First, we will derive a Newton type algorithm that has a structure similar to the well known RLS algorithm that estimates the optimal Wiener solution for MSE criterion. The complexity of the proposed algorithm is O(N2) which is again comparable with that of the RLS algorithm. Then, we will propose another Newton type algorithm derived from the principles of TLS using minor components analysis. This algorithm in its current form estimates the optimal EWC solution which is a special case of AEC with B = 0.5. Derivation of the Newton Type Recursive Error Whitening Algorithm Given the estimate of the filter tap weights at time instant (n 1i), the goal is to determine the best set of tap weights at the next iteration n that would track the optimal solution. We call this algorithm as Recursive Error Whitening (REW) algorithm although the error whitening property is applicable only when the parameter B is set to 0.5. But, the algorithm can be applied with any value of J. Recall that the RLS algorithm belongs to the class of fixedpoint algorithms in the sense that they track the optimal Wiener solution at every time step. The REW algorithm falls in the same category and it tracks the optimal AEC solution at every iteration. The noteworthy feature of the fixedpoint algorithms is their exponential convergence rate as they utilize higher order information like curvature of the performance surface. Although the complexity of the fixedpoint Newton type algorithms is higher when compared to the conventional gradient methods, the superior convergence and robustness to the eigenspread of the data can be vital gains in many applications. For convenience purposes, we will drop the tilde convention that we used in the previous chapter to differentiate between noisecorrupted and noise free matrices and vectors. Recall that the optimal AEC solution is given by w* = (R + PS)'(P + PQ) (3.1) Letting T(n) =R(n) + S(n) and V(n) =P(n) + Q(n), we obtain the following recursion. T (n) = T (n 1) + (1 + 2 P) x(n) x (n) fx (n L) x (n) fx (n) x (n L) = T (n 1) + 2 fx(n) x (n) px (n L) x (n) + x (n) x (n) fx (n) x (n L) (3 .2) = T (n 1) + (2 fx(n) px (n L)) x (n) + x (n)(x (n) fx (n L))T Realize that equation (3.2) basically tells us that the matrix T(n) can be obtained recursively using a rank2 update. In comparison (see Chapter 1), the RLS algorithm utilizes a rank1 update for updating the covariance matrix. At this point, we invoke the matrix inversion lemma2 (ShermanMorri sonWoodbury identity) [7,8] given by (A +BCD T) = A A'B(C + DT AB)1 'DT A (3.3) Substituting A =T(n 1), B =[(2px(n) x(n L)) x(n)], C= Ix2, a 2x2 identity matrix and D = [x(n) (x(n) fx(n L))] we get the equation (3.2) in the same form as the LHS of equation (3.3). Therefore, the recursion for the inverse of T(n) becomes 2 NOtice that the matrix inversion lemma simplifies the computation of the matrix inverse only when the original matrix can be written using reduced rank updates. T l(n) = T '(n 1) T T(n 1)B(I2x2 + gT (n 1)B) 1D T (n 1) (3.~4) Note that the computation of the above inverse is different than the conventional RLS algorithm. It requires the inversion of a 2x2 matrix (Izx2 + D T (n )B) owing to the rank2 update of T(n).The recursive estimator for V(n) is a simple correlation estimator given by V(n) = V(n 1) + [(1+ 2 P)d(n)x(n) pd(n)x(n L) fd~(n L)x(n)] (3.5) Using T l(n) and V(n), an estimate of the fi1ter weight vector at iteration index n is w(nZ)= T l(n)V(nz) (3.6) We will define a gain matrix analogous to the gain vector in the RLS case [14] as K(n) = T (n )BI2x2 +D T (n 1)Br (3.7) Using the above definition, the recursive estimate for the inverse of T(n) becomes T l(n) = T l(n 1) K(n)D T (n 1) (3.8) Once again, the above equation is analogous to the Ricatti equation for the RLS algorithm. Multiplying (3.7) from the right by (I2x2 + D T 1(n 1)B), we obtain K(H12x Z+D T (n 1)B)= T (n1)B K(n2)= T1(n 1)B K(nl)D T (n 1)B (3.9) K(n) = T l(n)B In order to derive an update equation for the fi1ter weights, we substitute the recursive estimate for V(n) in (3.6). w(n) = T l(n)V(n 1) + T l(n)[(1 + 2P)d(n)x(n) ~a'(n)x(n L) fd'(n L)x(n)] (3.10) Using (3.8) and recognizing the fact that w(n1) =T (~n1)V(n1) the above Note that the product D'w(n1) is nothing but the matrix of the outputs l(n) y'(n1) Syc(nz L)1 where y'(n) = x' (n)w(n 1), y(n L) = x' (n L)w(n 1). The apriori error matrix is defined as d~n n) y(n) P(d(n L) y(n L))n) e~nn) e(n L) (.4 Using all the above definitions, we will formally state the weight update equation for the REW algorithm as w(n) = w(n 1) + K(n~e(n) (3.15) The overall complexity of (3.15) is O(N2) which is comparable to the complexity of the RLS algorithm (this was achieved by using the matrix inversion lemma). Unlike the stochastic gradient algorithms that are easily affected by the eigenspread of the input data and the type of the stationary point solution (minimum, maximum or saddle), the REW algorithm is immune to these problems. This is because it inherently makes use of more information about the performance surface by computing the inverse of the Hessian equation can be reduced to w(n) = w(n 1) K(n)D w(n 1) Using (3.11) + T (n)[(1 + 2P)d(n)x(n) pd(n)x(n L) fd'(n L)x(n)] the definition for B = [(2px(n) px(n L)) x(n)], we can easily see that d>n fd~(n dn L) ,? From (3.9) and (3.12), the weight update equation simplifies to wl(nl)= w(n 1)K(n)D'w(n1)+Ki"dn ) d~:~n )1 (3.13) matrix R + pS A summary of the REW algorithm is given below in Table 31. Table 31. Outline of the REW Algorithm. Initialize T 1(0) = cI, c is a large positive constant w(0) = 0 At every iteration, compute B = [(2 px(n) fx(n L)) x(n)] and D = [x(n) (x(n) fx(n L))] K(n)= T (n1)BIZx2+D T (n1)Br y(n) = xT (n)w(n 1) and y(n L) = xT (n L)w(n 1) d(n) y(n) e(n) .l.d(n) y(n) P(d(n L) y(n L)) ;e(n) fe(n L) w(n) = w(n 1) +K(n~e(n) T l(n) = T l(n 1) K(n)D T (n 1) The above derivation assumes stationary signals. For nonstationary signals, a forgetting factor is required for tracking. Inclusion of this factor in the derivation is trivial and is left out in this chapter. Also, note that the REW algorithm can be applied for any value of B. When B = 0.5, we know that AEC reduces to EWC and hence REW algorithm can be used for estimating the parameters in the presence of input white noise. Extension of the REW Algorithm for Multiple Lags In Chapter 2, we briefly mentioned the fact the AEC can be extended by including multiple lags in the cost function. It is easy to see that the extended AEC is given by Lmax JI(w) = E[e (n)]+ C 13E[e(n) e(n L] (3.16) where, Lmax denotes the maximum number of lags utilized in the AEC cost function. It is not mandatory to use the same constant a for all the error lag terms. However, for the sake of simplicity, we assume single a value. The gradient of (3.16) with respect to the weight vector w, is cl/(w) Lma = e(n)x(n) BC [e(nz) e(nz L)] [x(n2) x(nZ L)] (3.17) dw L=1 Recall the following matrix definitions (restated here for clarity), R = E[x(n) x "(n)] SL = E[(x(nZ) x(n L))(x(n) x(n L))T ]= 2(R_ RL) RL = E[x(n)x(n L)T + x(nZ L)x(nZ)T ]3.8 P = E[x(n)d(n)] PL = E[x(n)d(n L) + x(n L)d(n)] Q L = E[(x(n) x(n L))(d(n) d(n L))T ] = 2P PL Using the above definitions in (3.17) and equating the gradient to zero, we get the optimal extended AEC solution as shown below. Lma. Lmr w = (R + pC SL> 1 p L) (3.19) L=1 L=1 At first glance, the computational complexity of (3.19) seems to be O(N3). But, the symmetric structure of the matrices involved can be exploited to lower the complexity. Once again, we resort to the matrix inversion lemma as before and deduce a lower O(N2) complexity algorithm. Realize that the optimal extended AEC solution at any time instant n will be w(n) = T l(n)V(n) (3.20) Lmax. Lm where, T(n) = R(n) + pC SL (n) and V(n) P(n) + pC QL (n) as before. The estimator L=1 L=1 for the vector V(n) will be a simple recursive correlator. Lmax V(n) = V(n ) + [(1 + 2ar~max,)d(n)x(n) f d(n)x(n L) fd(nz L)x(n)] (3.21) L=1 The matrix T(n) can be estimated recursively as follows. T(nz) = T(n2 1) + (1 + 2a~ma,)x(n2)xT (n2) ff x(n L>x" (n)> aC x(n2)xT (n2 L) L=1 L=1 Lmax = T(n 1)+ 2ar~maxx~rn~x" (n) j ffx(nZ L)xT(n)+ L=1 Lmax (3.21) x (nz) x (n7) j fx (nZ) x (nI L) L=1 T () =T ( ) +2 L ma, xx(nZ) ffx X(n L) x ')(n)! + x(n xn) ffx (n L) Now, the matrices A, B, C and D used in the inversion lemma in equation (3.3) are defined as follows. A = T (n 1) B = (2~maxx(nZ) aC x (3.22) C=I 2x2 L=1 The only differences from the previous definitions lie in the expressions for the B and D matrices that now require an inner loop running up to Lmax. The rest of the procedure remains the same as before. Once again, by the proper application of the matrix inversion lemma, we were able to reduce the complexity of the matrix inversion to O(N2) by recursively computing the inverse in a way that we only require an inversion of a simple 2x2 matrix. This measure of complexity does not include the computations involved in building the B and D matrices. However, typically, the maximum number of lags will be smaller than the length of the adaptive filter. Therefore, the additional overhead incurred in the estimation of B and D matrices will not result in a significant change in the overall compl exity. Relationship to the Recursive Instrumental Variables Method The previously derived REW algorithm for the single lag case has a structure similar to the Instrumental Variables (IV) method. The IV method has its origins in statistics and was apparently proposed by Reiersarl [122]. Over a period of time, it has been adapted to model dynamical systems in control engineering. Lot of work in the applications of IV to control engineering problems has been done by Wong, Polak [123] and Young [124126]. Recent advances in IV methods for system identification and control have been mainly due to Soiderstroim and Stoica [32,93]. It is beyond the scope of this dissertation to summarize the applications and impacts of IV in various engineering problems. For more details, refer to [32]. Basically, IV can be viewed as an extension to the standard Least Squares regression and can be used to estimate the parameters in white noise once the model order is known. The fundamental principle is to choose delayed regression vectors known as instruments that are uncorrelated with the additive white noise. IV can also be extended to handle colored noise situations. This will be exclusively handled in Chapter 5. For now, the discussion will be strictly limited to the white noise scenario. Mathematically speaking, the IV method computes the solution w,, = E[x(n)x(n A)T ]'E[x(n)d(n A)] (3.23) where, the lag A is chosen such that the outer product of the regression vector x(n) with the lagged regression vector x(nA) result in a matrix that is independent of the additive white noise components v(n). In comparison, the REW solution is given by w, = RL1PL * Notice that in REW solution, the matrix RL is symmetric and Toeplitz [8], which is very much desirable and we exploit this fact to derive an elegant minor components based algorithm in the next section. Thus, in effect the IV method can be considered as a special case of the REW algorithm obtained by removing the symmetric terms in RL and PL * We will compare the performances of REW and IV methods later in this chapter. Recursive EWC Algorithm Based on Minor Components Analysis Till now, we focused on a Newton type algorithm to compute the optimal AEC solution. Although the algorithm is fast converging, the convergence of the algorithm can be sensitive to the illconditioning of the Hessian matrix R(n) + PS(n) which can happen at the first few iterations. Alternatively, we can explore the idea of using the minor components analysis (MCA) to derive a recursive algorithm similar to the TLS algorithm for MSE. We call this algorithm as EWCTLS algorithm. As the name suggests, this algorithm can be used only for the case with a = 0.5, which defaults the augmented error criterion to error whitening criterion. Recall that the TLS problem in general, solves an overdetermined set of linear equations of the form Ax = b, where As E "'" is the data matrix, beW"' is the desired vector, and x e is the parameter vector and m denotes the number of different observation vectors each of dimension n [41]. Alternatively, the linear equations can be written in the form [A; b][xT;'1]= 0 whePre[;b] depnpote an augmented data matrix. When [A; b] is a symmetric square matrix, it can be shown that the TLS solution is simply given by [x;1l]= v, n~l,/v,n, (3.24) where, v,,,~, is the last element of the minor eigenvector v,, ,. In the case of EWC, it is easy to show that the augmented data matrix [127, 128] (analogous to [A; b]) is G = LP p (L) (3.25) The term pd (L) in (3.25) denotes the autocorrelation of the desired signal at lag L. It is important to note that the matrix (3.25) is square symmetric due to the symmetry of R,. Hence, the eigenvectors of G are all real which is highly desirable. Also, it is important to note the fact that (3.25) still holds even with noisy data as the entries of G are unaffected by the noise terms. In the infinitesample case, the matrix G is not full rank and we can immediately see that one of the eigenvalues of (3.25) is zero. In the finitesample case, the goal would be to Eind the eigenvector corresponding to the minimum absolute eigenvalue (finite samples also imply that G1 exists). Since the eigenvalues of G can be both positive and negative, typical iterative gradient or even some Eixedpoint type algorithms tend to become unstable. A workaround would be to use the matrix G2 inStead of G. This will obviate the problem of having mixed eigenvalues while still preserving the eigenvectors. However, the squaring operation is good only if the eigenvalues of G are well separated. Otherwise, the smaller eigenvalues blend together making the estimation of the minor component of G2 mOre difficult. Also, the squaring operation creates additional overhead, thereby negating any computational benefits offered by the Eixed point type PCA solutions as discussed in Appendix A. So, we propose to use the inverse iteration method for estimating the minor eigenvector of G [8]. If w(n)E e St" denotes the estimate of the minor eigenvector corresponding to the smallest absolute eigenvalue at time instant n, then the estimate at the (n+1)th instant is given by w(n +1) = G(n +1)w(n) w (n + 1) (3.26) w(n +1) = Iw (n + 1) The term G(n +1) denotes the estimate of the augmented data matrix G (equation (3.25)) at the (n+1)th instant. It is easy to see that G(n) can be recursively estimated as G(n) = G(n 1) + y(n)~y(n L)T + \(nZ L)V(nZ)T where, ~y(n) = [x(n); d(n)] is the concatenated vector of the input and desired response. Now, we can invoke the inversion lemma as before and obtain a recursive O(n2) OStimate for matrix inversion in (3.26). The details of this derivation are trivial and omitted here. Once the minor component estimate converges i.e., w(n) 4v, the EWCTLS solution is simply given by equation (3.24). Thus, the overall complexity of the EWCTLS algorithm is still O(n2) which is the same as the REW algorithm. However, we have observed through simulations that, the EWC TLS method converges faster than the EWCREW while preserving the accuracy of the parameter estimates. Experimental Results We will now show the simulation results with the Newton type algorithms for AEC. Specifically, our objective is to show the superior performance of the proposed criterion and the associated algorithms in the problem of system identification with noisy input data. Estimation of System Parameters in White Noise Using REW The REW algorithm can be used effectively to solve the system identification problem in noisy environments. As we have seen before, setting the value of /7 = 0.5, noise immunity can be gained for parameter estimation. We generated a purely white Gaussian random noise of length 50,000 samples and added this to a colored input signal. The white noise signal is uncorrelated with the input signal. The noise free, colored, input signal was filtered by the unknown reference filter, and this formed the desired signal for the adaptive filter. Since, the noise in the desired signal would be averaged out for both RLS and REW algorithms, we decided to use the clean desired signal itself. This will bring out only the effects of input noise on the filter estimates. Also, the noise added to the clean input is uncorrelated with the desired signal. In the experiment, we varied the SignaltoNoi seRatio (SNR) in the range 10dB to +10dB. The number of desired filter coefficients was also varied from 4 to 12. We then performed 100 Monte Carlo runs and computed the normalized error vector norm given by error = 20 0logIW lIIrl (3.27) where, w, is the weight vector estimated by the REW algorithm with /7 = 0.5 after 50,000 iterations or one complete presentation of the input data and w, is the true weight vector. In order to show the effectiveness of the REW algorithm, we performed Monte Carlo runs using the RLS algorithm on the same data to estimate the filter coefficients. Further, we also evaluated the analytical TLS solution for each case Figure 31 shows a histogram plot of the normalized error vector norm given in (3.27) for all the three methods. It is clear that the REW algorithm was able to perform better than the RLS at various SNR and tap length settings. In the high SNR cases, there is not much of a difference between RLS and REW results. However, under noisy circumstances, the reduction in the parameter estimation error with REW is orders of magnitude more when compared with RLS. Also, the RLS algorithm results in a rather useless zero weight vector, i.e., w = 0 when the SNR is lower than 10dB. On the other hand, TLS performs well only in the cases when the noise variances in the input and desired signals are the same. This is in conformance with the wellknown theoretical limitations of the TLS algorithm. laPerltluinnieRNRI ladB,#rp;rug P renewwnhSNR DAarlaprps Para~mlncethSNR lidB, rip 4 Eo: numid Ere umnBEr nlid : I Ey now in dB brra nulm in dB ber now in dB In Prdnnarl6wNhR 10d8.,Hap 1 Ivhrmr nohanF;Mssin = tsap?' = rrmnaneow NR= 10dB Mapi' .I I I I .. it Ew nol ind rynl nd renl nd Figure~, 3.Hsormposshwn h ro etr omfrECLS M aloitm an te nueialTSsouin aveage rrovetrnrvausfr~nW 1 Performance of REW over vanous beta values Performance of REW algorithm over var ous beta values 8 3 S10 2  1OB 06 04 02 0 02 04 06 08 1 08 06 04 02 0 02 04 06 OB beta beta (a) (b) Figure 32. Performance of REW algorithm (a) SNR = OdB and (b) SNR = 10 over various beta values. this clearly gives us the minimum estimation error. This corresponds to the EWC solution. For p = 0, (indicated by a "o" in the figure) the REW algorithm reduces to the regular RLS giving a fairly significant estimation error. Next the parameter p is set to 0.5 and SNR to OdB, and the weight tracks are estimated for the REW and the RLS algorithms. Figure 33 shows the averaged weight tracks for both REW and RLS algorithms averaged over 50 Monte Carlo trials. Asterisks on the plots indicate the true parameters. The tracks for the RLS algorithm are smoother, but they converge to wrong values, which we have observed quite consistently. The weight tracks for the REW algorithm are noisier compared to those of the RLS, but they eventually converge to values very close to the true weights. This brings us to an important issue of estimators viz., bias and the variance. The RLS algorithm has a reduced variance because of the positivedefiniteness of the covariance matrix R(n). However, the RLS solution remains asymptotically, biased in the presence of noisy input. We ght tracks for HEW algorithm Weight tracks for RLS algorithm E~ 6i *5 'denoted true we ghts 1 514 K I~ denotes true weights 1 2 005 1 15 2 25 3 36 4 46 5 '0 06 1 15 2 26 3 36 4 45 Iterat ons x 104 Iterations x1 Figure 33. Weight tracks for REW and RLS algorithms. On the other hand, REW algorithm produces zero bias, but the variance can be high owing to the conditioning of the Hessian matrix. However, this variance diminishes with increasing number of samples. The noisy initial weight tracks of the REW algorithm may be attributed to the ill conditioning that is mainly caused by the smallest eigenvalue of the estimated Hessian matrix, which is R(n)+ PS(n). The same holds true for the RLS algorithm, where the minimum eigenvalue of R(n) affects the sensitivity [14]. The instability issues of the RLS algorithm during the initial stages of adaptation have been well studied in literature and effects of round off error have been analyzed and many solutions have been proposed to make the RLS algorithm robust to such effects [129]. Similar analysis on the REW algorithm is yet to be done and this would be addressed in future work on the topic. Performance Comparisons between REW, EWCTLS and IV methods In this example, we will contrast the performances of the REW, EWCTLS and the Instrumental Variables (IV) method in a 4tap system identification problem with noisy data. The input signal is colored and corrupted with white noise (input SNR was set at 5dB) and the desired signal SNR is 10dB. For the IV method, we chose the delayed input 56 Performance with SNR = 5dB, # taps = 4 M EWCREW C D IV 3 C I la : 11 " 20 5 10 5 O 1 menrro norm inea dB d Fiue34 itga lt hwn h ro etrnrsfraltemtos veto xn A a te nsrmn n h llag 1 wa hsntob or wihi h vector norm ~ ~ ~ Errrnrr in dBgiebyeuto(32)Fiue34sosteerrhtgamfr REW, EWCTLS, IV and the optimal Wiener solutions. EWCTLS and REW algorithms outperform the Wiener MSE solution. The IV method also produces better results than the Wiener solution. Amongst the EWC solutions, we obtained better results with the EWCTLS algorithm (equations 3.24 and 3.26) than REW. However, both EWCTLS and REW outperformed IV method. This may be partially attributed to the conditioning of the matrices involved in the estimation of the REW and IV methods. Further theoretical analysis is required to quantify the effects of conditioning and symmetric Toeplitz structure of R,. In Figure 35, we show the angle between the estimated minor eigenvector and the true eigenvector of the augmented data matrix G for a random single trial in scenarios with and without noise. Notice that the rates of convergence are very much different. It is well known that the rate of convergence for inverse iteration method is given by the ratio Az 22 Where ~ is the largest eigenvalue of G1 and ii2 1 is the second largest eigenvalue of G1 [8]. Faster convergence can be seen in the noiseless case owing to the huge IA /12I ratio. Convergence of the minor e genvector of G with noisy data Convergence of the mi nor e genvector of G with clean data m460 703 0 05 1 1 5 2 2 5 3 3 5 4 41 5 U 50 100 150 200 Iterations x 104 Iterations Figure 35. Convergence of the minor eigenvector of G with (a) noisy data and (b) clean data. Summary In this chapter, we derived recursive Newton type algorithm to estimate the optimal AEC solution. First, the Recursive Error Whitening (REW) algorithm was derived using the analytical AEC solution and the matrix inversion lemma. The wellknown RLS algorithm for MSE becomes a special case of the REW algorithm. Further, a Total Least Squares based EWC algorithm called EWCTLS was proposed. This algorithm works with a = 0.5 and can be easily applied to estimate parameters in the presence of white noise. A fixedpoint minor components extraction algorithm was developed using the inverse iteration method. Other fixedpoint or gradientbased methods cannot be used because of the indefiniteness (matrix with mixed eigenvalues make the algorithms locally unstable) of the matrix involved in the EWCTLS formulation. The computational complexity of the above mentioned algorithms is O(N2). We briefly explored an extension of the Newton type algorithm for the extended AEC with multiple lags. Effective usage of the matrix inversion lemma can cut the complexity of the extended REW algorithm to O(N2) In the later half of the chapter, we discussed the performance of the algorithms in the problem of system identification in the presence of additive white noise. The proposed recursive algorithms outperform the RLS and the analytical MSE TLS solutions. We also showed the simulation results with the EWCTLS algorithm and quantitatively compared the performance with the wellknown IV method. Although the recursive EWC algorithms presented in this chapter are fast converging and sample efficient, the complexity of O(N2) can be high for many applications involving low power designs. Additionally, the recursive algorithms can exhibit limited performance in nonstationary conditions if the forgetting factors are not chosen properly. This motivates us to explore stochastic gradient (and its variants) algorithms for estimating the optimal AEC solution. Chapter 5 will describe these algorithms and also highlight other benefits of the stochastic algorithms over their Newton type counterparts. CHAPTER 4 STOCHASTIC GRADIENT ALGORITHMS FOR AEC Introduction Stochastic gradient algorithms have been at the forefront in optimizing quadratic cost functions like the MSE. Owing to the presence of a global minimum in quadratic performance surfaces, gradient algorithms can elegantly accomplish the task of reaching the optimal solution at minimal computational cost. In this chapter, we will derive the stochastic gradient algorithms for the AEC. Since the AEC performance surface is a weighted sum of quadratics, we can expect that difficulties will arise. However, we will show that using some simple optimization tricks, we can overcome these difficulties in an elegant manner. Derivation of the Stochastic Gradient AECLMS Algorithm Assume that we have a noisy training data set of the form (x(n),d(n)), where x(n)E e 8 is the input and d(n)E e is the output of a linear system with coefficient vector w,. The goal is to estimate the parameter vector w, using the augmented error criterion. We know that the AEC cost function is given by J(w) = E(e2 SE)e2(n)) (4.1) where, e(n) = e(n) e(n L) w is the estimate of the parameter vector and L 2 m, the size of the input vector. For convenience, we will restate the following definitions, i(n) =x(n) x(n L), d(n) =d(n) d(n L), R =E[x(n)x" (n)], S = E[ir(n)ir (n)], P = E[x(n)d(n)] and Q = E[ir(n)d(n)]. Using these definitions, we can rewrite the cost function in (4.1) as J(w) = E[d2 (n)] + PE[d2 (n)] + wT (R + PS)w 2(P + pQ)T w (4.2) It is easy to see that both E[e (n)] and E[e (n)] have parabolic performance surfaces as their Hessians have positive eigenvalues. However, the value of P can invert the performance surface of E[e (n)]. For P > the stationary point is always a global minimum and the gradient given by equation (4.2) can be written as the sum of the individual gradients as shown below. 8J(w) = 2(R + PS)w 2(P + PQ) = 2(Rw P) + 2 P(Sw Q) (4.3) dw The above gradient can be approximated by the stochastic instantaneous gradient by removing the expectation operators. '31W()[e(n2)x(n)f + f(n)ia(n)] (4.4) dw(n) The goal is to minimize the cost function and hence using steepest descent, we can write the weight update for the stochastic AECLMS algorithm for P > 0 as w(n + 1) = w(n) + r(n)(e(n~x(n) + pe(n~ir(n)) (4.5) where r(n) > 0 is a finite stepsize parameter that controls convergence. For p <0o, the stationary point is still unique, but it can be a saddle point, global maximum or a global minimum depending on the value a. Evaluating the gradient as before and using the instantaneous gradient, we get the AECLMS algorithm for P < 0 . w(nf + )= w(nI)+ l(nZ) e(n)x)(n) $e(n)i(n)) (4.6) where, r(n) is again a small stepsize. However, there is no guarantee that the above update rules will be stable for all choices of stepsizes. Although, equations (4.5) and (4.6) are identical, we will use PI in the update equation (4.6) to analyze the convergence of the algorithm specifically for p < 0 The reason for the separate analysis is that the convergence characteristics of (4.5) and (4.6) are very different. Convergence Analysis of AECLMS Algorithm Theorem 4.1. The stochastic AEC algorithms asymptotically converge in the mean to the optimal solution given by w* = (R + PS)'(P + PQ), S > (4.7) w. = (R + S)'(P PQ), S < We will make the following mild assumptions typically applied to stochastic approximation algorithms [7981,84] that can be easily satisfied. 1. The input vectors x(n) are derived from at least a wide sense stationary (WSS) colored random signal with positive definite autocorrelation matrix R = E[x(n)xT (n)] 2. The matrix RL = E[x(n)xT (n L) + x(n L)xT(n)] exists and has full rank 3. The sequence of weight vectors w(n) is bounded with probability 1 4. The update functions h(w(n)) = e(n)x(n)+ fe(n)ir(n) for P > 0 and hz(w(n)) = e(nz)x(n) p e(n)ii(n) for P < exist and are continuously differentiable with respect to w(n), and their derivatives are bounded in time. 5. Even if h(w(n)) has some discontinuities a mean update vector h(w(nz))= limn E[h(w(nz))] exists. Assumption A.1 is easily satisfied. A.2 requires that the input signal have sufficient correlation with itself for at least L lags. Proof of AECLMS Convergence for (3 > 0 We will first consider the update equation in (4.5) which is the stochastic AEC LMS algorithm for P > 0 Without loss of generality, we will assume that the input vectors x(n) and their corresponding desired responses d(n) are noisefree. The mean update vector h (w(n)) is given by h (w(n)) dt E [(n)x(n) + PLe(n)i(n)]l = Rw(nZ) P(nI)+ P(Sw(n) Q(nI)) (4.8) The stationary point of the ordinary differential equation (ODE) in (4.8) is given by w. = (R + pS) (P + pQ) (4.9) We will define the error vector at time instant n as S(n) = w, w(n). Therefore S(n + 1) = S(n) r(n)[e(n)xOn + pe(n)ir(n)] (4.10) and the norm of the error vector at time n+1 is simply 1(n + 1) i = 1(n) ? 21?(n)[g (I (ne(nl)x(r) + p;T (l)e(nz)i(n) + (4.11) 17 (n)l e(n)x(n) + Pte(n)i(n)l Imposing the condition that 11(n +1)1 < I(n)l for all n2, we get an upper bound on the time varying stepsize parameter r(n) which is given by 2 (' (n~e(n7)x(n7)+ PST (n)e(n)i(n)] r(n) < (4.12) Ie(n2)x(n)> + pe(n~i(n)> Simplifying the above equation using the fact that 5T (n)x(n) = e(n) and 5'(n)ir(n) = e(n), we get 2 e2(n) + PPe (I)] r(n) < (4.13) Ie(n)x(n) + fe(n~i(n)ll which is a more practical upper bound on the stepsize as it can be directly estimated from the input and desired data. As an observation, notice that if P = 0, then, the bound in (4.13) reduces to r(n) < (4.14) which, when included in the update equation, reduces to a variant of the Normalized LMS (NLMS) algorithm [14]. In general, if the stepsize parameter is chosen according to the bound given by (4. 13), then the norm of the error vector S(n) is a monotonically decreasing sequence converging asymptotically to zero, i.e., lim l(n) 2 40 which implies that limw(n) 4 w, (see Appendix G for details). In addition, the upper bound on the stepsize ensures that the weights are always bound with probability one satisfying the assumption A.3 we made before. Thus the weight vector w(n) converges asymptotically to w,, which is the only stable stationary point of the ODE in (4.8). Note that (4.5) is an O(m) algorithm. Proof of AECLMS Convergence for P < 0 We analyze the convergence of the stochastic gradient algorithm for P < 0 in the presence of white noise because this is the relevant case ( p = 0.5 eliminates the bias due to noise added to the input). From (4.6), the mean update vector h (w(n)) is given by, h~~wt (wn)==Ee(n1)x(n2) P $e(n)i(n)l = Rw(n) P(nz) p(Sw(n) Q(n)) (4. 15) As before, the stationary point of this ODE is w. = R f Sr P P Q) (4.16) The eigenvalues of R PIS decide the nature of the stationary point. If they are all positive, then we have a global minimum and if they are all negative, we have a global maximum. In these two cases, the stochastic gradient algorithm in (4.6) with proper fixed sign stepsize would converge to the stationary point, which would be stable. However, we know that the eigenvalues of R PIS can also take both positive and negative values resulting in a saddle stationary point. Thus, the underlying dynamical system would have both stable and unstable modes making it impossible for the algorithm in (4.6) with fixed sign stepsize to converge. This is well known in the literature [3,14]. However, as will be shown next, this difficulty can be removed for our case by appropriately utilizing the sign of the update equation (remember that this saddle point is the only stationary point of the quadratic performance surface). The general idea is to use a vector stepsize (one stepsize per weight) having both positive and negative values. One unrealistic way (for an online algorithm) to achieve this goal is to estimate the eigenvalues of RI S. Alternatively, we can derive the conditions on the stepsize for guaranteed convergence. As before, we will define the error vector at time instant n as S(n) = w, w(n). The norm of the error vector at time instant n+1 is given by 1(n2 + 1)I = 1(n) z 21?(n)[2 (n~e(n)x (n) f' (n)L'(n)i (n) + (4.17) 9 (n) e(nz)x(n) f (n)i(l)? Taking the expectations on both sides, we get E 5(n + 1) I = E S(n) 2p~) ('(neS1xn) (n~e(l)i(nZ) + (4.18) Il (n)Ele(nz)x(n) fd(,(n~iln The mean of the error vector norm will monotonically decay to zero over time i.e., E 5(n +1)1 < EI (m if and only if the stepsize satisfies the following inequality. 2 E ('(n)e(n)x(n) f (n)e(nl)i(n)~ 1'(n) < (4.19) E e(n)x(n) B d(nr)(n)' Let x(n) = x(n) + v(n) and d(n) = d (n) + u(n), where x(n) and d (n) be the clean input and desired data respectively. We will further assume that the input noise vector v(n) and the noise component in the desired signal u(n) to be uncorrelated. Also the noise signals are assumed to be independent of the clean input and desired signals. Furthermore, the lag L is chosen to be more than m, the length of the filter under consideration. Since the noise is assumed to be purely white, E[v(n)vT (n L)] = E[v(n L)vT (n)] = 0 and E[v(n)v' (n)] = V We have (ncenxn) = (w. w(n)) d (nl) +u(n) wT')(n)Wr(n) w'~(nl)tv(n)in ~) (4.20) Simplifying this further and taking the expectations, we get E[5'(nr)e(nz> x(n1)]= var (nr) 2Prw(n) +wr (n)ftw(n1) + wT (n)Vw(nZ) w, Vw(nZ) (4.21) =Jnns wIVw(n) where, 1 = E[x(n)xT (n)], P E[x(n~d(n)] and Jii = w7(n) 1+ V v(n) + varp(n)) 2P;'w(n) (4.22) Similarly, we have S'(n)e(nz~i(n) = (w, w(nZ))l [d(n2)+ u1(n2) w7'(n2)~(in)+ v(n%)) d (n L) + u(n L) + wT (n)(ji(n L) + v(nZ L))] (4.23) (ik + Vk k L VkL Evaluating the expectations on both sides of (4.23) and simplifying, we obtain E 7(ll(n)ei(n)n)= var p~(n) d (n L)) 2Q7w(n) + wT(nZ)Sw(n)+ 2wT (n)Vw(nZ) 2w (n) Vw(n) (4.24) = J., 2 w Vw(n) where, we have used the definitions S =E[( (n) i(n L))( (n) i(n L)) ], Q = E[(x(n) X(n L))(d(n) d (n L))] and Jnr = w (n)+ +2V (n)+ var d"(n) d(nL)) 2Q'w(n) (4.25) Using (4.21) and (4.24) in equation (4.19), we get an expression for the upper bound on the stepsize as 2/, pJr 120 ~Vw(n) 1'(n) < (4.26) E e(n)x(n) B e(nr)i(n)' This expression is not usable in practice as an upper bound because it depends on the optimal weight vector. However, for P = 0.5, the upper bound on the stepsize reduces 21,I 0.5Ja 17(n)l <~ (4.27) E e(n)x(n2) 0.5e(2)(n)i n) From (4.22) and (4.25), we know that J, and J,,r are positive quantities. However, Js 0.5J,,r can be negative. Also note that this upper bound is computed by evaluating the right hand side of (4.27) with the current weight vector w(n) Thus as expected, it is very clear that the stepsize at the nth iteration can take either positive or negative values based on J, 0.5J,,r; therefore, sgn(r(n)) must be the same as sgn(A, 0.5J., ) evaluated at w(n). Intuitively speaking, the term As 0.5Jr is the EWC cost computed with the current weights w(n) and P = 0.5, which tells us where we are on the performance surface and the sign tells which way to go to reach the stationary point. It also means that the lower bound on the stepsize is not positive as in traditional gradient algorithms. In general, if the stepsize we choose satisfies (4.27), then, the mean error vector norm decreases asymptotically i.e., E :(n + 1)1 < E 1(n)l and eventually becomes zero, which implies that limE[w(n)]4w.. Thus the weight vector E[w(n)] converges asymptotically to w,, which is the only stationary point of the ODE in (4.15). We conclude that the knowledge of the eigenvalues is not needed to implement gradient descent in the EWC performance surface, but (4.27) is still not appropriate for a simple LMS type algorithm. Online Implementations of AECLMS for P < 0 As mentioned before, computing J,, 0.5J,,r at the current weight vector would require reusing the entire past data at every iteration. As an alternative, we can extract the curvature at the operating point and include that information in the gradient algorithm. By doing so, we obtain the following stochastic algorithm. w(n+ 1) =w(n) +risgn(w' (n) R(n) P S(n) () e(n) n)pe(n)x(nz)) (4.28) where, R(n) and S(n) are the estimates of R and S respectively at the nth time instant. Corollazy:l~~~~~11111~~~~ Given any quadratic surface J(w), the following gradient algorithm converges to its stationary point. w(n + 1) = w(n) r7sgn(wT (n)Hw(n)) (4.29) dw (n) Proofi Without loss of generality, suppose that we are given a quadratic surface of the form J(w)= wTHw, where He E" "'x, and we W""Y1. H is restricted to be symmetric; therefore, it is the Hessian matrix of this quadratic surface. The gradient of the performance surface with respect to the weights, evaluated at point w, is = 2Hw,, Swn and the stationary point of J(w) is the origin. Since the performance surface is quadratic, any crosssection passing through the stationary point is a parabola. Consider the cross section of J(w) along the line defined by the local gradient that passes through the point wo. In general, the Hessian matrix of this surface can be positive or negative definite; it might as well have mixed eigenvalues. The unique stationary point of J(w), which makes its gradient zero, can be reached by moving along the direction of the local gradient. The important issue is the selection of the sign, i.e., whether to move along or against the gradient direction to reach the stationary point. The decision can be made by observing the local curvature of the crosssection of J(w) along the gradient direction. The performance surface crosssection along the gradient direction at wo is J(wo + 2yHwo) = w ( ~I+2pH)"H(I + 2H)w o = w (H+4 9H2 + 4r2H3 0W (4.30) From this, we deduce that the local curvature of the parabolic crosssection at wo is 4w H3 0 If the performance surface is locally convex, then this curvature is positive. If the performance surface is locally concave, this curvature is negative. Also, note that sgn(4w H3 0)= Sgn(w Hwo). Thus, the update equation with the curvature information in (4.30) converges to the stationary point of the quadratic cost function J(w) irrespective of the nature of the stationary point. From the above corollary and utilizing the fact that the matrix R PIS is symmetric, we can conclude that the update equation in (4.29) asymptotically converges to the stationary point w, = (R psf S P fQ). On the down side however, the update equation in (4.28) requires O(m2) COmputations, which makes the algorithm unwieldy for realworld applications. Also, we can use the REW algorithm instead, which has a similar complexity. For an O(m) algorithm, we have to go back to the update rule in (4.6). We will discuss only the simple case of P = 0.5, which turns out to be also the more useful. We propose to use an instantaneous estimate of the sign with the current weights given by w(n + 1) = w(n) + r(n) sgn(e (n) 0.5e~ (n))[e(n)x(n) 0.5e(Z)i(nZ)] (4.31) where, where r(n) > 0 and is bound by (4.27). It is possible to make mistakes in the sign estimation when (4.31) is utilized, which will not affect the convergence in the mean, but will penalize the misadjustment. Excess Error Correlation Bound for EWCLMS In the next theorem, we will show that the asymptotic excess error correlation at lags L 2 m is always bounded from above and can be arbitrarily reduced by controlling the stepsize. Theorem 4.2: With P = 1/2, the steady state excess error autocorrelation at lag L 2 m, i.e., Ip,(L)I is always bound by Ip,(L) ~ 2 Z"1 [Tv(R)+ I]l[E(e (k))+o+ flw lO] (4.32) where R = E[xkX(], and Tr(*) denotes the matrix trace. The term E(e,2(k)) denotes the excess MSE which is (w, w,)T R(w, wr) The noise variances in the input and desired signals are represented by oa and o] respectively. Note that the term 'I\I is always bound because of the stepsize bound. Proofi For convenience, we will adopt the subscript k to denote the time or iteration index. With this convention, the weight vector at the kth iteration is denoted by wk Further, the error vector (difference between the true vector wT and the adaptive estimate at time k) is denoted by Ek T k, w. Throughout the rest of the proof, we will use the following notations: the noisy input vector ik, the noisefree input vector xk, and the input noise vector vk Obey ik = Xk +Vk; the noisy desired signal dk,, the noisefree desired signal dk and noise rek are related by dk~ = dk 1k.We will start from the equation describing the dynamics of the error vector norm given below. lik+1 12 = ik 2  2pign(8flT~ + p()(8kk kr~k k kd~,k (4.33) In (4.33), we have assumed a constant stepsize which satisfies the upper bound in (4.27). Letting E Ak+,1 2 = E ik 12 as ka 00 we see that E Ik~Xk k k~j 2=E (8k4 kk)(434 We now invoke the Jensen's inequality for convex functions [130] to reduce (4.34) further, yielding E2 L&II +/;1 pA>EIf (8k/k k k (4.35) The noisy error term is given by Sk ae(k)+uk TVk Where the excess error e, (k)= ETxk Using the expressions E'^T &kx k] E e )wiw +w Vk E[i: Cik, k= E(e ;) 2wfVwk + 2w Vwk and p = 0.5, we can immediately recognize that the RHS of (4.35) is simply the steady state excess error autocorrelation at lag L 2 m, i.e., p, (L)I In order to evaluate the LHS of (4.35), we will assume that the terms that ik 12 and 1d? are uncorrelated in the steady state. Using, this assumption, we can wnite E kk 0.5g k k ci = (f [rR)a (4.36) 2 2 where,, E(8 )2 = E1/e\ (k) + + w  a Using (4.36) in equation (4.3.5), we get the inequality in (4.32). This assumption (more relaxed than the independence assumptions [11,14]) is used in computing the steady state excessMSE for stochastic LMS algorithm [131,132] and becomes more realistic for long filters. In the estimation of the excess MSE for the LMS algorithm, Price's theorem [133] for Gaussian random variables can be invoked to derive closed form expressions. However, even the Gaussianity assumption is questionable as discussed by Eweda [134] who proposed additional reasonable constraints on the noise pdf to overcome the Gaussianity and independence assumptions that lead to a more generic treatment for the signLMS algorithm. It is important to realize at this point that in the analysis presented here, no explicit Gaussianity assumptions have been made. As a special case, consider L = and noise free input. Then, (4.32) is true with the equality sign and also Ip,(L)I will be the same as E(ej(k)) which is nothing but the excess MSE (as ka ~o ) of the LMS algorithm. In other words, (4.32) reduces to E(ecr(k)) = Tr(R)[E(ea (k))+ oj] (4.37) From (4.37), the excess MSE for the LMS algorithm [14] can be deduced as E(e,2(k))= yo~()(4.38) 2 rTr(R) which will become to yo Tr(R)/2 for very small stepsizes. If the adaptive filter is long enough, the excess error es (k) will be Gaussian and we can easily show that the excess MSE is bounded by T7(R)E[ so ]~/4 where, a, denotes the error due to the initial condition [131]. Other Variants of the AECLMS Algorithms It is easy to see that for convergence of the mean, the condition is II r/Zk (R + PS)I < 1 for all k, where Ak (R + SS) denotes the kth eigenvalue of the matrix (R + pS) This gives an upper bound on the stepsize as 17 < 2/ Amax, (R + PS)I From the triangle inequality [8], IR + /3 S11 < Amx R) + p Amax (S)S where, *1 1 denotes the matrix norm. Since, both R and S are positivedefinite matrices, we can write R + S < p In a stochastic framework, we can include this in the AECLMS update equation to result in a stepsize normalized EWCLMS update rule given by ysign(e (nZ) + ~2(n))(e(nZ)x(n) + P(nZ)i(n)) w(n +1) = w(n) + (4.40) Bx(n) I+ / II(n%1)Z Note that when f = (4.40) reduces to the wellknown normalized LMS (NLMS) algorithm [14]. Alternatively, we can normalize by the norm squared of the gradient and this gives the following modified update rule. (e (n) + p~(n))(e(n)x(n) + pC(n)i(n)) w(n +1) = w(n) +2 (4.41) Ie(n)x(n) + Pt(n)i(n)ll + 3 The term 6, a small positive constant compensates for the numerical instabilities when the signal has zero power or when the error goes to zero, which can happen in the noiseless case even with finite number of samples. Once again, we would like to state that with p = 0, (4.41) defaults to NLMS algorithm. However, the caveat is that, both (4.40) and (4.41) do not satisfy the principle of minimum disturbance unlike the NLMS3 [14]. But nevertheless, the algorithms in (4.40) and (4.41) can be used to provide faster convergence at the expense of increased misadjustment (in the error correlation sense) in the final solution. AECLMS Algorithm with Multiple Lags In the previous chapter, we discussed a recursive Newton type algorithm that included more than one lag in the cost function. With decreasing SNR at the input, the Hessian matrix H = R + BS is mostly determined by the noise covariance matrix. This can degrade the performance and we might be forced to use very small stepsizes (slow convergence) to achieve good results. One way of alleviating this problem is to incorporate multiple lags in the AEC cost function. The stochastic gradient AECLMS algorithm for the multiple lag case is simply given by w(n + 1) = w(n) + C 77LSlgn(e (n))+ P (n))(e(n2>x(n) + r.L H XL (n2)) (4.42) L=1 Lmax is the total number of lags (constraints) used in the AEC cost function. The additional robustness of using multiple lags comes at an increase in the computational cost and in the case when the number of lags becomes equal to the length of the adaptive filter, the complexity will approach that of the recursive Newton type algorithms. The stochastic AEC algorithms have linear complexity in comparison with the O(N2) algorithm of the recursive Newton type algorithms discussed in the previous chapter. At the same time, since the algorithms are all based on the instantaneous SThe NLMS algorithm is also called the minimum norm update algorithm. It can be formulated as a constrained minimization problem wherein the actual cost function is the norm of the update, viz., I\ I n w(n1) and the constraint is the error e(n) with the weights w(n) must be zero. gradients, these algorithms have better tracking abilities when compared with their Newton counterparts. Hence these algorithms can be expected to perform better in non stationary conditions. Simulation Results Estimation of System Parameters in White Noise The experimental setup is the same as the one used to test the REW algorithm. We varied the SignaltoNoiseRatio (SNR) between 10dB to +10dB and changed the number of filter parameters from 4 to 12. We set P = 0.5 and used the update equation in (4.31) for the EWCLMS algorithm. A time varying stepsize magnitude was chosen in accordance with the upper bound given by (4.27) without the expectation operators. This greatly reduces the computational burden but makes the algorithm noisier. However, since we are using 50,000 samples for estimating the parameters, we can expect the errors to average out over iterations. For the LMS algorithm, we chose the stepsize that gave the least error in each trial. Totally 100 Monte Carlo trials were performed and histograms of normalized error vector norms were plotted. It is possible to use other statistical measures instead of the error norm, but this is sufficient to demonstrate the bias removal ability of EWCLMS. For comparison purposes, we computed the solutions with LMS as well as the numerical TLS (regular TLS) methods. Figure 41 shows the error histograms for all the three methods. The inset plots in Figure 41 show the summary of the histograms for each method. EWCLMS performs significantly better than LMS at low SNR values (10dB and OdB), while performing equally well for 10dB SNR. The input noise variances for 10dB, OdB, and 10dB SNR values are 10, 1, and 0.1, respectively. Thus, we expect (and observe) TLS results to be worst for 10dB and best 20 11 ..Iill 11l II 035 30 25 20 15 10 5 Error norm in dB Performance with SNR = OdB #taps = 8 I0 I i 15 20 0 8 B 2 4 9 meansaarlaan ~ I 830 h us 7040 20 2 60 mennd eaid 50 I I 40 II ]I LMS lo 30 []TL Eg~I EWCLM for 10dB. As per theory, we observe that TLS performance drops when the noise variances are not the same in the input and desired signals. n Performance with SNR = 10dB, #taps= 4 Performance with SNR = OdB, #taps = 4 n Performance with SNR = 10dB, #taps = 4 700 20 u 30I I 20 1 2 sus Performancewith SNR =10dB,#tap = 8 40 20 0 meanandspreadinde 30 26 20 16 10 5 Error norm in d Performance with SNR = 1dB, #1aps = 12 Performance with SNR =10dB, #taps = 2 35 6 0 5 0 30 meanandspreadnda Error norm in dB Error norm indB Error norn in dB Figure 41. Histogram plots showing the error vector norm for EWCLMS, LMS algorithms and the numerical TLS solution. Figure 42 shows a sample comparison between the stochastic and the recursive algorithms for OdB SNR and 4 filter taps. Interestingly, the performance of the EWC LMS algorithm is better than the REW algorithm in the presence of noise. Similarly, the LMS algorithm is much better than the RLS algorithm. This tells us that the stochastic LMS 3C EWCLMS 1C I REW 40 35 30 25 20 15 10 5 O error norm in dB Figure 42. Comparison of stochastic versus recursive algorithms. Weight Tracks and Convergence The steady state performance of a stochastic gradient algorithm is a matter of great importance. We will now experimentally verify the steady state behavior of the EWC LMS algorithm. The SNR of the input signal is set to 10dB and the number of filter taps is fixed to two for display convenience. Figure 43 shows the contour plot of the EWC cost function with noisy input data. Clearly, the Hessian of this performance surface has both positive and negative eigenvalues thus making the stationary point an undesirable saddle point. On the same plot, we have shown the weight tracks of the EWCLMS algorithm with P = 0.5 Also, we used a fixed value of 0.001 for the stepsize. From algorithms presumably reject more noise than the fixedpoint algorithms. Researchers have made this observation before, although no concrete arguments exist to account for the smartness of the adaptive algorithms [135]. Similar conclusions can be drawn in our case for EWCLMS and REW. Performance of RLS, REW, EWCLMS, LMS algorithms with SNR = OdB, #taps = 4 Weight tracks for the EWCLMS algorithm W2 = 0.5 W1 =0.2 77 the figure, it is clear that the EWCLMS algorithm converges stably to the saddle point solution, which is theoretically unstable when a single sign stepsize is used. Notice that due to the constant stepsize, there is misadjustment in the final solution. In Figure 44, we show the individual weight tracks for the EWCLMS algorithm. The weights converge to the vicinity of the true filter parameters, which are 0.2 and 0.5 respectively within 1000 samples. Contour plot of the performance surface with the weight tracks 0.4 0.2 0.4 I. 0.6 0.8 1 i SUo8 Ub U 4 U.2 U Ux U.4 U.b Uo Figure 43. Contour plots with the weight tracks showing convergence to saddle point. O O 5 1 1 5 2 2.5 iterations 3 3 5 4 4.5 x 104 Figure 44. Weight tracks for the stochastic algorithm. In order to see if the algorithm converges to the saddle point solution in a robust manner, we ran the same experiment using different initial conditions on the contours. Figure 45 shows a few plots of the weight tracks originating from different initial values over the contours of the performance surface. In every case, the algorithm converged to ~/// Y Y" 06 08 1 08 06 04 02 402 0 02 04 1402 0 02 04 1OB 06 04 02 I 02 04 IB IB 1 IB IB04 02 I 02 04 06 0B Figure 45. Contour plot with weight tracks for different initial values for the weights. the saddle point in a stable manner. Note that the misadjustment in each case is almost the same. Finally, to quantify the effect of reducing the SNR, we repeated the experiment with OdB SNR. Figure 46 (left) shows the weight tracks over the contour and we can see that the misadjustment has increased owing to decrease in the SNR. This is a typical 79 phenomenon observed with most of the stochastic gradient algorithms. However, the misadjustment is proportional to the stepsize. Therefore, by using smaller stepsizes, the misadjustment can be controlled to be within acceptable values. The drawback is slow convergence to the optimal solution. Figure 46 (right) shows the weight tracks when the algorithm is used without the sign information for the stepsize. Note that convergence is 1~~~ D O 0 0 2 0 B O 1 B B 4 0 2 0 B D Fiur 46 Cnturpotwihwegh taksfo ECLMagrith withsg inomto let n ihutsg nomain(ih) no civdi hscs hchsbtnitsorpeiu rgmn htafxdsg tp sizewil neer onvrgeto sadlepoit. o frthr sbstntite hisfac, w reove the nos fothinuanrathEWMalrith ihadwtottesg em Fiue47(et hw h niefe W efrac uraeadFgr rgt shw h egttak wt adwtottesg nomtin ntecnor.Cery the wegt ontcneg otedsrdsddepitee nteasneo os.O the other hand, using the sign information leads the weights to the saddle point in a stable manner. Since this is the noisefree case, the final misadjustment becomes zero. Performance surface of EWC without no se in Input and desired 0 5 0 5 41 I 1 1 0 04 02 04 O6 O 2 W1 W1 Figure 47. EWC performance surface (left) and weight tracks for the noisefree case with and without sign information (right). Inverse Modeling and Controller Design Using EWC System identification is the first step in the design of an inverse controller. Specifically, we wish to design a system that controls the plant to produce a predefined output. Figure 48 shows a block diagram of model reference inverse control [136]. In this case, the adaptive controller is designed so that the controllerplant pair would track the response generated by the reference model for any given input (command). Clearly, we require the plant parameters (which are typically unknown) to devise the controller. Once we have a model for the plant, the controller can be easily designed using conventional MSE minimization techniques. In this example, we will assume that the plant is an allpole system with transfer functionP(z)= 1/(1+ 0.8z ' 0.5z2 0.3z3) The reference model is chosen to be an FIR filter with 5 taps. The block diagram for the plant identification is shown in Figure 49. Notice that the output of the plant is corrupted with additive white noise due to measurement errors. The SNR at the plant output was set to OdB. We then ran the EWCLMS and LMS algorithms to estimate the model parameters given noisy input and desired signals. The model parameters thus obtained are used to derive the controller (see Figure 48) using standard backpropagation of error. We then tested the adaptive controllerplant pair for trajectory tracking by feeding the controllerplant pair with a nonlinear time series and observing the responses. Ideally, the controllerplant pair must follow the trajectory generated by the reference model. Figure 410 (top) shows the tracking results for both controllerplant pairs along with the reference output. Figure 410 (bottom) shows a histogram of the tracking errors. Note Figure 48. Block diagram for model reference inverse control. Figure 49. Block diagram for inverse modeling. that the errors with EWCLMS controller are all concentrated around zero, which is desirable. In contrast, the errors produced with the MSE based controller are significant Figure 410. Plot of tracking results and error histograms. Magnitude and phase responses of the reference model vs. rnodelcontroller pair co40 Reference S401 111 0 0.1 0.2 O 3 0.4 0.5 0.6 0.7 O 8 0.9 1 Normalized frequency 82 and this can be worse if the SNR levels drop further. Figure 411 shows the magnitude and phase responses of the reference models along with the generated controllermodel Trajectory tracking responses Error histogram 200 S150  S100 4 3 2 1 0 errors 1 2 3 Ref~rence 9, t~' q~. C*  400 _11 0 0.1 0.2 O 3 O 4 0.5 0.6 0.7 Normalized frequency O 8 O 9 1 Figure 411i. Magnitude and phase responses of the reference model and designed model controller pairs. SEWC I I MSE I n  pairs. Note that, the EWC controllermodel pair matches very closely with the desired transfer function, whereas MSE controllermodel pair produces a significantly different transfer function. This clearly demonstrates the advantages offered by EWC. More details on the applications of EWCLMS in system identification and controller design problems can be found in [137139]. Summary In this chapter, we proposed online samplebysample stochastic gradient algorithms for estimating the optimal AEC solution. The detailed derivations of the update rules were presented and the convergence was proved rigorously using stochastic approximation theory. We also derived the stepsize upper bounds for convergence with probability one. Further, the theoretical upper bound on the excess error correlation in the case of EWCLMS was derived. The AEC stochastic algorithms include the LMS algorithm for MSE as a special case. Owing to the complexities of the EWC performance surface (see Chapter 2), additional information like the sign of the instantaneous cost is required for guaranteed convergence to the unique optimal AEC solution. In this context, the AEC optimization problem can be pursued as a rootfinding problem and the popular RobbinsMunro method [140] can be adopted to solve for the optimal solution. We have not explored this method yet for the AEC criterion. We also presented several variants of the AECLMS algorithm. As a special case, the normalized AECLMS algorithm in equation (4.40) reduces to the wellknown NLMS algorithm for MSE. The gradient normalized AECLMS algorithm in equation (4.41) has shown better performance over the simple AECLMS algorithm in our simulation studies. We then presented simulation results to show the noise rejection capability of the EWCLMS algorithm. Experiments were also conducted to verify some of the properties of the proposed gradient algorithms. In particular, we observed the weight tracks and the verified that the algorithm converges in a stable manner to even saddle stationary points. This is achieved mainly by utilizing the sign information in the gradient update. We also showed the amount of misadjustment can be controlled by the stepsize parameter. This is in conformance with the general theory behind stochastic gradient algorithms. Lastly, we demonstrated the application of EWC in the design of a modelreference inverse controller. We compared the performance of the EWC controller with the MSE derived controller and verified the superiority of the former. CHAPTER 5 LINEAR PARAMETER ESTIMATION IN CORRELATED NOISE Introduction In the previous chapters we discussed a new criterion titled augmented error criterion (AEC) that can potentially replace the popular MSE criterion. In fact we showed that a special case of the AEC called the error whitening criterion (EWC) can solve the problem of estimating the parameters of a linear system in the presence of input noise. We showed extensive simulation results with different EWC adaptation algorithms that proved beyond doubt, the usefulness of this criterion in solving system identification and controller design problems. Two crucial assumptions were made in the theory behind the error whitening criterion. Firstly, we assumed that the input noise is uncorrelated with itself or is white. Although, in most problems, we assume that the noise is white, this assumption can be certainly restrictive in many applications. From the theory we discussed in the previous chapters, it is easy to conclude that EWC fails to remove the bias in the parameter estimates when the noise is correlated or colored. Secondly, we assumed full knowledge of the model order of the unknown system. This is not just native to the proposed method as most of the competing methods including Total LeastSquares (TLS) assume exact model order. To the best of our knowledge, there is no existing solution to the problem of system identification in the presence of input noise in cases when the model order is unknown. However, till this point, we have not dealt with the implications of using the proposed EWC when the 