UFDC Home  myUFDC Home  Help 



Full Text  
IMPLEMENTATION CONSIDERATIONS FOR FPGABASED ADAPTIVE TRANSVERSAL FILTER DESIGNS By ANDREW Y. LIN A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ENGINEERING UNIVERSITY OF FLORIDA 2003 Copyright 2003 by Andrew Y. Lin ACKNOWLEDGMENTS I would like to thank my advisory committee members, Dr. Jose Principe, Dr. Karl Gugel and Dr. John Harris, for their guidance, advice, and encouragement toward successful completion of this project. I also thank my fellow Applied Digital Design Laboratory members, Scott Morrison, Jeremy Parks, Shalom Darmanjian and Joel Fuster, for their unconditional help of my research everyway they can. My special thanks go to my parents, who have been supportive and caring throughout every step of my life, including my graduate years at University of Florida. Altera Corp. has provided software and hardware in support of my thesis. TABLE OF CONTENTS Page A C K N O W L E D G M E N T S ................................................................................................. iii L IST O F F IG U R E S .... ...... ................................................ .. .. ..... .............. vii ABSTRACT ........ .............. ............. ...... ...................... ix CHAPTER 1 IN TR OD U CTION ............................................... .. ......................... .. 1.1 Problem Statem ent............. ......... ................................... .................. 1.2 Tradeoffs in Choosing Fixedpoint Representation..............................................3 1.3 M otivation and Outline of the Thesis .................................................................. 5 2 THEORETICAL BACKGROUND ON LINEAR ADAPTIVE ALGORITHMS....... 7 2 .1 D iscrete Stochastic P rocesses ................................................................................7 2.1.1 A utocorrelation Function......................................... .......... ............... 7 2.1.2 Correlation M atrix ............................................. .... .. .............. .8 2.1.3 YuleW alker Equation.......... ............. .................. .... .......... ..... 9 2 .1.4 W ien er F ilters ............. ........ ...................... .............. .. ........... .... 10 2.2 M ethod of Steepest D escent ........................................ ........................... 12 2.2.1 Steepest D descent A lgorithm ............................................ ..................... 12 2.2.2 Wiener Filters with Steepest Descent Algorithm ......................................13 2.3 Least M ean Square A lgorithm .................... ...... ................... ..................... 14 2 .3 .1 O v erv iew ..............................................................14 2.3.2 The A lgorithm ............................................ .. ........ ........... ... 15 2.3.3 Applications................................ ......... 16 2.3.3.1 A daptive noise cancellation .................................. ............... 16 2.3.3.2 Adaptive line enhancem ent ............................ .... ................. 17 3 FINITE PRECISION EFFECTS ON ADAPTIVE ALGORITHMS .........................18 3 .1 Q u antization E effects ................................................................... ................ .. 19 3 .1.1 R ou n d in g ..............................................................19 3.1.2 Truncation .................................................................. ..... ......... .. .... ........ 21 3.1.3 R wounding vs. Truncation ........................................ ........................ 22 3.2 Input Q uantization Effects...................................................... ...............23 3.3 A rithm etic Rounding Effects............................. ........................ ............... 24 3.3.1 Product R wounding Effects...................................... ......................... 25 3.3.2 Coefficient Rounding Effects .............. ..................................... ........ 26 3.3.3 Slow dow n and Stalling...................... .... ......................... ................27 3.3.4 Saturation............................................... ......... 29 3.3.5 Solutions for Arithmetic Quantization Effects ........................... ........31 3.4 Sim ulation R esult........... .... ... .............................. ...... .... .. ............ 31 3.4.1 R wounding vs. Truncation .................... ..... ............... ............... .... 32 3.4.2 Effects of Product Rounding at the Convolution Stage............................33 3.4.3 Effects of Product Rounding at the Adaptation Stage.............................35 3.4.4 Clam ping Technique ............................................................................ 36 3 .4 .5 Sign A lgorithm ................................................... .. ........ ...... ............38 3 .5 R em ark s ................................................................ 3 9 4 SOFTWARE SIMULATION OF A FIXEDPOINTBASED POWEROFTWO ADAPTIVE NOISE CANCELLER...................................................................40 4 .1 M odular O verview .......... ........................................ ................ .. .... ...... 4 1 4.2 Data Quantization ......... .. ...................... ........ ........ .. ..................... 42 4.3 Sim ulation R results ......... .......... .......................... ........ .. .. .. ........ .... 43 5 HARDWARE IMPLEMENTATION OF AN INTEGERBASED POWER OF TWO ADAPTIVE NOISE CANCELLER IN STRATIX DEVICES................................45 5.1 Stratix D devices ................................................... ..................... ..... ....... 46 5.1.1 D evice A architecture ..................................................... ...................46 5.1.2 Em bedded D SP Blocks...................................................... ............. 47 5 .2 D esign Sp ecification s ........................................ ............................................4 8 5.2.1 Structural O verview ......................................................... ............... 48 5.2.2 The PowerofTwo Scheme............................................... .................. 49 5.2.3 Data Flow and Quantization...................... .... ......................... 50 5.3 Dynamic Component Instantiation in VHDL.................................. .................50 5.4 Simulation and Im plem entation Results ................................... ....5........... .52 5.5 Performance Comparison of Stratix and Traditional FPGAs.............................53 5 .5 .1 S p e e d ................................................................5 4 5 .5 .2 A re a ................................................................5 4 5 .6 P ip elin in g ............................................................................... 5 5 5.6.1 Optim al M ultiplier Pipeline Stages ................................. ................ 57 5.6.2. Optimal Adderchain Pipeline Stages ....................................... 58 5.6.3 Tradeoffs in Introducing Latency into Adaptive Systems..........................60 5.6.4 Performance of the Pipelined Adaptive System.......................................63 5.7 Performance Comparison of FPGAs and DSP Processors...............................65 5 .7 .1 S p e e d ..................................................................... 6 6 5.7.2 Pow er Consum ption ............................................................................67 6 CONCLUSION AND FUTURE WORK ....................................... ............... 69 6.1 Conclusion ..................................... ................................ .......... 69 v 6 .2 F u tu re W ork ................................................................... .. 7 1 APPENDIX A M A T L A B SC R IP T S .......................................................................... .................... 73 B VHDL CODES .................... ............................ ......78 L IST O F R E F E R E N C E S ......................................... ........... ................ .......................... 90 B IO G R A PH IC A L SK E T C H ...................................................................... ..................93 LIST OF FIGURES Figure pge 11. Conventional Adaptive Filter Configuration...........................................................2 12. Tw o O options of Q uantization .......................................................................... 4 21. Block diagram of a Statistical Filtering Problem ................................................ 11 22. Block Diagram of an Adaptive FIR Filter..................................... ......................13 23. Adaptive Noise Cancellation Block Diagram .......................................... ........17 24. Adaptive Line Enhancer Block Diagram ..................................... ......... ............... 17 31. Rounding Effects ....................................... ......... ......... ................20 32. Truncation Effects ......................... ......... .. .. ..... .. ............. 21 33. M A C U nit B lock D iagram .............................................. .............................. 25 34. System Identification Block Diagram ............................................ ............... 32 35. Experimental Setup for Rounding vs. Truncation .............................................32 36. Simulation Result for Rounding vs. Truncation............. ..... ............... 33 37. Additional Quantizers at the Convolution Stage...................................................34 38. Effects of Product Quantization at the Convolution Stage.................... ........ 34 39. Additional Quantizers at the Adaptation Stage ............................... ............... .35 310. Effects of Product Quantization at the Convolution and Adaptation Stages............36 311. Tap weight Track for Clamping Technique .................................... ............... 37 312. Misadjustment Plot for Clamping Technique............ ............... ............... 38 313. Misadjustment for Sign Algorithm vs. LMS ...............................................39 41. Adaptive Noise Canceller Block Diagram ...................................... ............... 41 42. Internal Structure of the Noise Canceller with Quantizers.......................................42 43. Weight Tracks for Fixedpoint Systems ...................................... ...............43 44. Misadjustment Plots of Fixedpoint Systems and a Floatingpoint System.............44 51. Stratix D evice Block D iagram ..................................................................... 47 52. Em bedded D SP Block D iagram ........................................ ........................... 48 53. Adaptive Transversal Filter Block Diagram ................................. ............... 49 54. Waveform Simulation Result of the Adaptive Noise Canceller............................52 55. Logic State Analyzer Result of the Adaptive Noise Canceller ..............................53 56. Plot of Filter Order vs. Speed ........ ...... .. ......... ........................... 54 57. Plot of Filter O order vs. A rea .................................... ......... .............................. 55 58. Pipelined Multiplier Test Module................................................57 59. Maximum Data Rate of three Multipliers with Various Pipeline Stages ..................58 510. A dderchain T est M odule .......................................................................... .....59 511. Adderchain Data Rate with Respect to Number of Adders ...............................59 512. Pipelined and Buffered Adaptive System Block Diagram ....................................60 513. Timealigned Adaptive System Block Diagram...........................................63 514. Pipelined Adaptive System Performance ...................................... ............... 64 515. Power Consumption Plot for Various Devices............................................ 67 Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Engineering IMPLEMENTATION CONSIDERATIONS FOR FPGABASED ADAPTIVE TRANSVERSAL FILTER DESIGNS By Andrew Y. Lin August, 2003 Chair: Jose C. Principe Major Department: Electrical and Computer Engineering Adaptive filters have become vastly popular in the area of digital signal processing. However, adaptive filtering algorithms assume infiniteprecision whereas in reality, digital hardware is of finiteprecision. The effects of finiteprecision on adaptive algorithms are studied in this thesis and techniques rendering these effects are presented. Simulation results are also presented to verify the techniques targeting specifically to the Least Mean Square (LMS) algorithm. Finally, a fixedpointbased adaptive transversal filter is simulated in a new family of FPGA devices with embedded DSP blocks. The costbenefit and tradeoff of pipelining are studied. The performance of this new family of FPGA devices is compared against DSP processors, as well as traditional FPGA devices that do not have embedded DSP blocks. CHAPTER 1 INTRODUCTION 1.1 Problem Statement Significant contributions have been made in the past thirty years in the signal processing field. Particularly digital signal processing (DSP) systems have become attractive due to the advances in digital circuit design and the systems' reliability, accuracy and flexibility. One of the DSP applications is calledfiltering, where the digital system's objective is to process a signal in order to manipulate the information contained in the input signal. As described in DiCarlo [7], a filter is a device that maps its input signal to another output signal facilitating the extraction of the desired information contained in the input signal. For a timeinvariant filter, the internal parameters and the structure of the filter are fixed. Once specifications are given, the filter's transfer function and the structure defining the algorithm are fixed. An adaptive filter is timevarying since their parameters are continually changing in order to meet certain performance requirement. Usually the definition of the performance criterion requires the existence of a reference signal, which is absent in timeinvariant filters. The general set up of an adaptive filtering environment is illustrated in Figure 11, where n is the iteration index, x(n) denotes the input signal, y(n) is the adaptive filter's output signal, and d(n) defines the reference or desired signal. The error signal e(n) is the difference between the desired d(n) and filter output y(n). The error signal is used as a feedback to the adaptation algorithm in order to determine the appropriate updating of the filter's coefficients, or tap weights. The minimization objective is for the adaptive filter's output signal matching the desired signal in some sense. d(k) + Kk) r Adaptive y(k) ) filter Adaptive Algorithm Figure 11. Conventional Adaptive Filter Configuration The minimization objective can be viewed as a function of the input, desired, and output signals, or consequently a function of the error signal. One of the most commonly used objectives is to minimize the mean square error, that is, the objective function is defined as F[e(n)]= E[e (n)] (1.1) Adaptive filters can be implemented either in Finite Impulse Response (FIR) form or in Infinite Impulse Response (IIR) form. FIR filters are usually implemented in non recursive structures, whereas IIR filters employ recursive realizations. In the case of FIR realizations, the most widely used adaptive filter structure is the transversal filter, also known as tapped delay line structure. As will be derived in Chapter 2, all adaptive algorithms including the Least Mean Square (LMS) algorithm for example, assume infinite precision. In other words, there is infinite storage for information needed to perform adaptation. However, it is not the case in reality, where computers or digital hardware which implement adaptive algorithms all contain limited storage for information, that is, numbers are stored in finite precisions. Due to finite precisions in digital hardware, quantization must be performed in either or all of the following areas: * Input and reference signals; * Product quantization in convolution stage; * Coefficient quantization in adaptation stage. Quantization noise is introduced in all of the above areas. The effects of quantization are discussed in this thesis. DSP applications including adaptive systems have traditionally been implemented with either fixedpoint or floatingpoint microprocessors. However, with its growing die size as well as incorporating the embedded DSP block, the FPGA devices have become a serious contender in the signal processing market. Although it is not yet feasible to use floatingpoint arithmetic in modern FPGAs, it is sufficient to use fixedpoint arithmetic and still achieve tapweight convergence for adaptive filters. This thesis also investigates the performance among FGPAs and DSP processors in terms of speed and power consumption. 1.2 Tradeoffs in Choosing Fixedpoint Representation Since infinite precision is not available in the real world, tradeoffs must be made in implementation of adaptive systems in finite precision. By increasing the wordlength, a system can increase the data precision in which it can represent. However, the amount of hardware also increases, and that leads to larger circuitry and slower system speed. If wordlength is insufficient, saturation or stalling may occur due to the inadequacy of data storage, even though smaller wordlength reduces amount of hardware. Therefore, the system engineer must deal with the tradeoffs between overall feasibility of the implementation, and the functionality of the system. Quantization may create effects such as saturation and stalling. These effects, if not dealt with carefully, may render the adaptive filter useless. Let us take multiplication as an example for illustration: when two Nbit numbers are multiplied, the result is 2N bits and the product is usually quantized into a number that is Mbit long, where M<2N. Refer to Figure 12, there are two options for quantization: a) the upper significant bits are quantized resulting loss of large amount of information; b) the lower significant bits are quantized resulting loss of data precision. 2N 2N 0 al 4 a 0 4 A M M   M ._ a) Quantize upper significant bits b) Quantize lower significant bits Figure 12. Two Options of Quantization By choosing option a), one is exposed to the danger of saturation, where the filter becomes useless due to the loss of large amount of information. Saturation may be avoided by increasing the wordlength, or by the clamping technique. Alternatively, if option b) is chosen, stalling phenomenon may occur when tap weight update parameters become smaller than the least significant bit of the binary representation and consequently are quantized into zeros. When stalling occurs, the adaptation process is terminated prematurely due to lack of update information. We will show that stalling may be avoided by either incrementing the step size parameter, use the sign algorithm, or by dithering. Slowdown may also occur in finite precision environments, in which the tap weight convergence is slower than in infinite precision environments. We will show that wordlength of the tap weights plays significant parts in cause of slowdown and by allocating more bits to represent coefficients, slowdown can be avoided. 1.3 Motivation and Outline of the Thesis As stated earlier, adaptive filters have become growing interests in the DSP field. Most adaptive algorithms that run inside the adaptive filters have been derived under the assumption of infinite precision. However, since finite precision takes place in the real world, it is advantageous to study what effects finite precision can impose on adaptive filters and furthermore what techniques may be employed to mitigate, if not eliminate these effects. Once the effects are studied thoroughly, a finite precision based adaptive filter is implemented by first experimenting in software environment to obtain feasibility, and then turning the software experiment into digital hardware realization. Chapter 2 presents the theoretic backgrounds on adaptive algorithms, and the LMS algorithm is derived. Chapter 3 focuses on the effects created by finite precision environment as well as techniques to reduce such effects. Chapter 4 demonstrates a software implementation of a finite precision based adaptive filter where in Chapter 5, based on the feasibility analysis from Chapter 4, details of a transversal adaptive filter 6 implemented in an FPGA device is given. In order to boost data rates, pipelining is implemented. Tradeoffs in introducing pipelining are also studied. Comparison is also presented in choosing hardware for adaptive DSP application implementation. Finally, conclusion and future work are presented in Chapter 6. CHAPTER 2 THEORETICAL BACKGROUND ON LINEAR ADAPTIVE ALGORITHMS 2.1 Discrete Stochastic Processes In most signals and systems discussion, the signals are defined by analytical expressions, difference equations or even arbitrary graphs. However most signals in the real world are random, or containing random components due to factors such as additive noise or quantization errors. Such signals therefore, require the use of statistical methods rather than analytical expressions for their descriptions. Haykin [16] defines the term stochastic process as a term to describe the time evolution of a statistical phenomenon according to probabilistic laws. The time evolution implies that the stochastic process is a set of functions of time. According to Probabilistic laws implies that the outcomes of the stochastic process cannot be determined before conducting experiments. A stochastic process is not a single function of time. Rather, it represents an infinite number of different realizations of the process [16]. One example of the realizations is a discretetime series, in which the process is sampled at each sampling period. For example, the sequence [u(n), u(n1), ..., u(nM)] represents a partial discretetime observation consisting samples of the present value and M past values of the process. 2.1.1 Autocorrelation Function Consider a discretetime series representation of a stochastic process [u(n), u(n1), .., u(nM)], the autocorrelation function is defined as following: r(n, nk) = E[u(n)u*(nk)], k = 0, +1, +2, ... (2.1) Where E[] denotes the expectation operator and denotes complex conjugate. This secondorder characterization of the process offers two important advantages: First, it lends itself to practical measurements and second, it is well suited for linear operations on stochastic processes [16]. Note that if only realworld signals are considered, the conjugate form is omitted and the autocorrelation is simply the mean square of the signal. This consideration is true for the rest of the thesis. The autocorrelation function described in equation 2.1 depends only on the difference between the observation time n and n k, or the lag k. Therefore, r(n, n k)= r(k) (2.2) 2.1.2 Correlation Matrix Let the Mby1 observation vector u(n) represent the discretetime series u(n), u(n 1), ..., u(nM +). The composition of the vector can then be written as u(n)= [u(n), u(n]), ..., u(nM+ )] (2.3) where T denotes transposition. The correlation matrix of a discretetime stochastic process can be defined as the expectation of the outer product of the observation vector u(n) with itself. The dimension of the correlation matrix is MbyM and is denoted as R as following: R = E[u (n)u (n)] (2.4) By substituting Eq. (2.3) into Eq. (2.4) and using the property defined in Eq. (2.1), the expanded matrix form of the correlation matrix can be expressed as follows: r(0) r(1) ... r(M1) R r(1) r(O) r(M2) (2.5) R (2.5) r(M+) r(M + 2) .. r(0) 2.1.3 YuleWalker Equation An autoregressive process (AR) of order Mis defined by the difference equation u(n) + alu(n1) + a2u(n2) + ... + amu(n) = v(n) (2.6) where al, a2, ..., aM are constants and v(n) is white noise. Eq. (2.6) can be rewritten in the form u(n) = wiu(n1) + w2u(n2) + ... (n1) + v(n), (2.7) where wk ak. Eq. (2.7) states that the present value of the process, u(n), is a finite linear combination of past values, u(n1), u(n2), ..., u(nM), plus an error term v(n). By multiplying both sides of Eq. (2.6) by u(n ), where > 0, and then applying the expectation operator, we obtain the following equation: E aku(n k)u(n) = E[v(n)u(n 1)] (2.8) k=0 Since the expectation E[u(n k)u(n 1)] equals to the autocorrelation function of the AR process with lag of k, and the E[v(n)u(n 1) is zero for > 0, Eq. (2.8) can be simplified to akr(lk)= 0, > 0 .(2.9) k=0 The autocorrelation function of the AR process thus satisfies the difference equation r(l) = wir(l 1)+ w2r(l 2) + ... + r(l M), > 0 (2.10) By expanding Eq. (2.10) for all = 1, 2, ..., M, a set of M simultaneous equations is formed with the values of the autocorrelation function as known quantities and the AR parameters as unknowns. The set of equations may appear in matrix form r(0) r(l) ... r(M1) wl r(l) r(l) r(0) r(M ) w2 r(2) (2.11) r(M+1) r(M + 2) ... r(0) wmi r(M) This set of equations in (2.11) is called the YuleWalker Equations. By using the expression introduced in Eq. (2.5), the YuleWalker equations may be written in its compact matrix form Rw r (2.12) Assume that R1 exists, the solution for the AR parameters can be obtained by w= R1 r (2.13) 2.1.4 Wiener Filters Consider a Finite Impulse Response (FIR) filtering problem described in Figure 21, the input of the filter consists of time series u(O), u(1), u(2), ..., and the filter has an impulse response, or tap weights, wo, wl, ..., WM, where M is the length of the filter. The impulse response are selected so that the filter output match as closely as possible with a desired signal denoted by d(n). The estimation error e(n) is defined as the difference between d(n) and the filter outputy(n). Statistical optimization may be applied to minimize e(n). One such optimization is to minimize the mean square value of e(n). According to the Principle of Orthogonality, if the FIR filter depicted in Figure 21 operates under optimum condition, the filter output y[n] best estimates the desired signal d[n]. The WienerHopf equation is derived from the same principle to solve for the optimum condition. d(n) u(n) I FIR (n) e(n) FILTER Figure 21. Block diagram of a Statistical Filtering Problem. Let R be the MbyM correlation matrix of the filter inputs u(n), where u(n) = [ u(n), u(n1), ..., u(nM+ )]. According to Eq. (2.3) to (2.5), the correlation matrix is in the form of r(O) r(1) ... r(M1) R r(1) r(O) r(M2) (2.14) R (2.14) r(M+l) r(M+2) ... r(0) Also letp denote the Mby1 cross correlation vector between the filter inputs and the desired response: p E[u(n)d(n)] (2.15) or in the expanded vector form: p= [p(O), p(1), ..., p(1M)]. (2.16) The WienerHopf equation is thus defined as the following: Rw =p (2.17) where Wo is the Mby1 optimum tap weight s of the FIR filter described in Figure 21. To solve for the WienerHopf equation for wo, we assume that R1 exists and multiply it to both sides of Eq. (2.17) to obtain the following: wo R'p (2.18) Note that in order to calculate the optimum tap weight vector Wo with Eq. (2.18), both the autocorrelation matrix of the filter input and the crosscorrelation vector between input and desired have to be known apriori, that is, the statistical information of the entire tap inputs vector and the desired are known before Wo is calculated. Eq. (2.18) is also computational expensive, an inverse operation of an MbyM matrix is performed follow by a matrixvector multiplication. 2.2 Method of Steepest Descent As described in Section 2.1.4, the Wiener filter employs the minimization of the mean square of its error signal e(n) to optimally match the filter output signal y(n) with the desired signal d(n) employs the minimization of the mean square of its error signal e(n). Furthermore, the particular Wiener filter has fixed tap weights for all filter inputs and the tap weights are calculated a priori using the WienerHopf Equation. The method of steepest descent involves updating the tap weights of the filter at each time step in a feedback system. It does not require the entire statistics of the filter inputs; instead, it provides an algorithmic solution that allows for the tracking of time variations in the signal's statistics without having using the WienerHopf Equation. 2.2.1 Steepest Descent Algorithm Let us define J(w) to be the cost function of some unknown weight vector w and that J(w) is continuously differentiable with respect to w. The optimum weight vector Wo thus satisfies the following condition: J(wo) L J(w) for all w. (2.19) Eq. (2.19) may be extended according local iterative descent. An initial presumption for J(w) is made, at each time interval, a new set ofw is generated so that J(w(n+ 1)) J(w(n)) (2.20) where w(n) is the previous tap weight vector and w(n+l) is the updated version. One particular method of the local iterative descent is the method of steepest descent. At each iteration, the tap weight vector is adjusted in the direction opposite to the gradient vector of the cost function J(w). The gradient vector is defined as a(w) (2.21) g aw Therefore the steepest descent algorithm is defined as w(n+ 1) = w(n) ug(n) (2.22) The term / is the step size. Details of the step size are given later. Justification for Eq. (2.22) satisfying the criteria defined in Eq. (2.20) can be seen in [16]. 2.2.2 Wiener Filters with Steepest Descent Algorithm Figure 21 depicts a Wiener filter with fixed tap weights where the tap weights are optimal and are calculated using the WienerHopf equation. There is no adjustment to the weights. By incorporating the method of steepest descent, a new structure of the Wiener filter with weight adjustment is shown in Figure 22. u(n) I i ........ T 1 Figure 22. Block Diagram of an Adaptive FIR Filter The gradient function g(t) may be in the form of the autocorrelation matrix of the filter inputs and the crosscorrelation vector between filter input and the desired response, if the cost function J(w) is a function oft, as described in Eq. (2.20) [16]. Eq (2.22) can then be rewritten as w(n+ 1) = w(n) [p Rw(n)] (2.23) where p denotes the crosscorrelation vector, R denotes the autocorrelation matrix and / denotes step size. In order to guarantee convergence of the steepest descent algorithm, two conditions must be satisfied: * The process is widesense stationary. 1 * 0 < p < where Amaxis the largest eigenvalue of R. Amax 2.3 Least Mean Square Algorithm The most widely used adaptive algorithm is the Least Mean Square (LMS) algorithm. The key feature of the LMS algorithm is its simplicity. It requires neither any measurement of the correlation function, nor any matrix inversion or multiplication. 2.3.1 Overview The LMS adaptive filter bears the same structure as the one shown in Figure 21. The filter outputy(n) should be made to resemble the desired signal d(n). The difference of d(n) and y(n) is the error signal e(n). As described in Section 2.2, a linear adaptive filter consists of two basic processes. The first process involves performing convolution sum of the filter taps with the tap weights. The other process involves performing adaptation process on the tap weights. In the case of the LMS algorithm, the weight adjustments requires the current error signal e(n) along with filter taps to produce the updated tap weight vectors. Details of the algorithm are given in the next section. 2.3.2 The Algorithm The Steepest Descent method has progressed from a fixed tapweight structure to a stepbystep adaptive structure. However, when applying Steepest Descent method into the Wiener filter, we still require prior knowledge of the autocorrelation matrix R and the crosscorrelation vectorp. In order to avoid measurement of any correlation function and avoid any matrix computations, and to establish a truly adaptive system, estimates of R andp are calculated using only available data. The simplest estimation may use only the current available taps and the current desired response to estimate autocorrelation matrix and crosscorrelation vector. The new equation to adapt tap weights using the instantaneous taps and desired response, according to Eq. (2.23), is therefore given as follows: w(n+ ) = w(n) + uu(n)[d(n) u(n)w(n) ] (2.24) Since the filter output is the convolution sum of the taps and tap weights, or y(n) = u(n)w(n) (2.25) Furthermore, the estimated error signal e(n) is defined as the difference between the desired response and the filer response, or e(n) d(n) y(n) (2.26) Therefore, Eq. (2.24) can be rewritten in terms of the error signal and the taps: w(n+1) = w(n) + uu(n)e(n) (2.27) Eq. (2.27) is the formula for the LMS algorithm. As illustrated in the equation, each tap weight adaptation at each time interval requires merely the knowledge of the current taps and the current error signal, which is produced with the knowledge of the desired response. The algorithm does not require any prior knowledge of the entire autocorrelation matrix or the crosscorrelation vector, nor does it require matrix computations. The algorithm requires an initial "guess" of the tap weight vector. In general, if no prior knowledge of the environment is known, the tap weight vector is initialized to all zeros. The step size parameter, [t, plays an important role in determining the LMS algorithm's speed of convergence and misadjustment (the difference between true minimum cost value J,,f and the minimum cost value produced by the LMS algorithm). Unfortunately, there is no clear mathematical analysis to derive the quantities. Only through experiments may we obtain a feasible solution. Several authors including authors in [1] have proposed modified LMS algorithm in which the step size parameter is a part of the adaptation along with tap weights. In general, [t should obey the following inequality: 0 < < (2.28) max where Mis the filter length and Smax is the maximum value of the power spectral density of the tap inputs [16]. 2.3.3 Applications The LMS algorithm is considered the most widely used adaptive algorithms for many signals and systems applications. Here we present two applications as examples. 2.3.3.1 Adaptive noise cancellation Figure 23 describes a simple structure on interference noise canceling where the desired response is composed of a signal s(n) and a noise component v(n), which is uncorrelated with s(n). The filter input is a sequence of noise, v '(n), which is correlated with the noise component in the desired signal. By using the LMS algorithm inside the adaptive filter, the error term e(n) produced by this system is then the original signal s(n) with the noise signal v(n) cancelled. d(n) = s(n) + v(n) I+ v'(n) Adaptive e+ e(n) Filter Figure 23. Adaptive Noise Cancellation Block Diagram 2.3.3.2 Adaptive line enhancement A sinusoidal waveform, denoted by s(n), is transmitted thru a medium and is corrupted by noise, denoted by v(n). A delayed version of this corrupted signal serves as the input of the LMS adaptive filter and the original corrupted signal serves as the desired signal. The adaptive filter's outputy(n) becomes an enhanced version of the original sinusoid. The block diagram for the line enhancer is shown in Figure 24. Figure 24. Adaptive Line Enhancer Block Diagram CHAPTER 3 FINITE PRECISION EFFECTS ON ADAPTIVE ALGORITHMS Theories of adaptive algorithms such as the LMS algorithm presented in Chapter 2 assume the systems to be models with real values, that is, the systems retain infinite precision for the input signal, the internal calculations, as well as the result of the system. But in reality, computers or digital hardware that implement adaptive algorithms all involve finite precision architectures. The analog input signals have to first be converted digitally before it is fed into the system; the arithmetic operation results have to be quantized or even scaled to prevent overflow of the registers. If not dealt with carefully, these factors can cause a disastrous outcome on the adaptive system. There are two ways to represent a value based on finite precision: fixedpoint and floatingpoint. In fixedpoint representation, the radix point is fixed by specifying number of bits for integer part and number of bits for fractional part. Although it has a restricted dynamic range of numbers it can represent, the fixedpoint representation's resolution is fixed. In floatingpoint representation, the total number of bits is fixed but the radix point can "float" anywhere, resulting a wider dynamic range of numbers in which it can represent. However, since the radix point floats, the resolution is not fixed and therefore quantization is required at both additions and multiplications, which creates more quantization noise. Conversely, quantization is required only after multiplications in fixedpoint arithmetic. Since we are dealing with minimizing the effects due to finite precision in this chapter, it is desirable to choose fixedpoint representation for analysis. Additionally, since the radix point is fixed for fixedpoint representation, adders and multipliers have much simpler logic equations than for floatingpoint representation. This initiative leads to simpler circuit design and better circuit performance in terms of speed. For hardware implementations of DSP applications, it is advantageous to choose fixedpoint based architectures. Chapter 3 presents some of the common effects, as well as some wellknown techniques against these effects in dealing with finite precision adaptive systems. 3.1 Quantization Effects Due to finite precision architectures of most digital hardware, the analog input signal, as well as each register that holds any intermediate or final arithmetic results has to be quantized within certain wordlength. Quantization can be done in two ways: rounding and truncation. These two techniques will be discussed in details in this Section. The quantizing step is defined as the weight of the least significant bit of the binary representation and is denoted by q. It will be shown that errors created by quantization are directly related to the quantizing step. 3.1.1 Rounding Quantization by rounding leads an infinite precision value to a result of a finite precision code whose value is closest to the actual value [8]. If q is the quantizing steps, the sampled value lying between n q and n + q are all rounded tonq. Mathematically, rounding can be expressed as the following: f(nT) = nq, n q 20 Figure 31 shows the rounding result of a continuous signal of an arbitrary sinusoid rounded to the nearest integer values, i.e., q = 1. Rounding effect 3 S2 4 6 8 1 0) 1 2 3 0 2 4 6 8 10 =Pr Figure 31. Rounding Effects Let x be the error caused by rounding, x then can be assumed to be a uniformly 2 2 distributed random variable between and The probability density function for q q rounding error, according to definitions given in [22], is shown in Eq. (3.2). 0, if > q Since the probability density function of the rounding error is uniformly distributed between q and q, the expectation of the rounding error, denoted by Er(x), is given by 2 2 q/2 Er(x) =xp(x)dx= xdx = 0 (3.3) q/2q The variance, or the power spectral density of the rounding error, denoted by a, is derived by its definition and is equal to q/2 2 2 2 Er(x2) [E(x)]2 =E(x2) f_ = (3.4) q 12 q/2 q 12 3.1.2 Truncation Quantization by truncation leads an infinite precision value to a finite precision result that is closest to but always less than the value [8]. Again, ifq is the quantizing step, the value lying between nq and (n + 1)q is truncated tonq. Truncation is expressed in the following equation: ft(nT)= nq, nq< nT < (n + )q (3.5) Figure 32 shows the truncated result of the same continuous signal used in Figure 31 truncated to the nearest integer values with sampling period T= 0.1. Truncation effect 3 2 C3 E 1 2 3 0 1 2 3 4 5 6 7 8 9 10 qpr: Figure 32. Truncation Effects Let x be the error caused by truncation, x then again can be assumed uniformly distributed between q and 0. The probability density function for the truncation error is therefore ,q < x< 0 Pt (x) = q (3.6) o, x> 0 Again by assuming the probability density function of the truncation error is uniformly distributed between q and 0, the expectation of the truncation error, denoted by Et(x), is given by 0 Et(x) = xp(x)dx = dxq (3.7) q2 The power spectral density of the truncation error, denoted by o2 is equal to 2 2 q2 q2 S= Et(x2 [Et(x)]2 = dx = (3.8) q 4 12 3.1.3 Rounding vs. Truncation From the above derivations of both the mean and the variance (power) of two different quantization techniques, we can see that although they produce the same error power, rounding the number results in zero mean error while truncation results in mean error of . The errors associated with a nonzero value, although small, tend to 2 propagate through the filter [8]. It is especially true in adaptive filters, since the filter is not only a linear systems, in that any error terms are processed by the filter just as an input and thus contaminate the output of the filter; but the filter is also a feedback system, in that error signal produced in the output circulates back to the filter to create even more errors. Therefore, rounding is more attractive compare to truncation when it comes to signal quantization. Simulation results in Section 3.4.1 will verify this finding. 3.2 Input Quantization Effects Before an analog signal may be accepted for processing by a digital system, such as a computer or microprocessor, it must be converted into digital form. The first step in the digitization process is to take samples of the signal at regular time intervals to convert a continuous signal with time variable t into real instances with sample variable n. Next, the instances are quantized. That is, the amplitudes of the instances are converted into discrete levels, and then we assign these discrete levels as quantization levels. Finally, the quantized instances are encoded into a sequence of binary codes according to each instance's quantization level. This process of sampling, quantization and encoding is usually called analogto digital (A/D) conversion. The difference between the actual analog input sample and the corresponding binarycoded quantized value is called quantization noise and is the first source of degradation [3]. 2 As shown in Section 3.1, the mean error and power spectral density is zero and q 12' respectively, if rounding is used. After quantization, the input to the filter becomes fq (nT) = f(nT) + (nT) (3.9) where f(nT) is the original sampled signal and E(nT) is the quantization noise. Since the filter is a linear system, the noise signal is also filtered by the filter's transfer function. We will show now how the newly introduced noise term affects the filter's output. Let I be the number of bits to represent the quantized signal, then the signal's maximum allowable amplitude is S "21 Am,, q2 (3.10) 2 Further the signal's peakpower, denoted by pc, is defined as the power in which the quantized signal can pass without clipping. Thus, Pc is given by P 1 (A) 21 _2 2213 (3.11) 2 22 2 Under the assumption that the quantization noise has zero mean and variance q 12 that is, rounding is used instead of truncation, the ratio of the peak power and the input quantization noise, denoted by R,, is therefore Ri = = 3(221) (3.12) r or SNRi = 6.021+1.76dB .(3.13) For example, a 16bit input quantizer's signal to noise ratio is ideally according to Eq. (3.13), approximately 100dB. The calculation is done without considering any other noise source. In practice, however, in order to obtain the desired signal to noise ratio, one more bit is added to ensure filter's ideal SNR performance. 3.3 Arithmetic Rounding Effects Digital implementation of filters, including adaptive filters, relies heavily upon arithmetic operations. There are two processes involved in an adaptive system, the convolution of the tap weights with its taps, and the adaptation process to update the coefficients. The MultiplyandAccumulate (MAC) operation is central for performing these two processes. Specifically, for an adaptive FIR filter using the LMS algorithm, (M 1) multiplyandAccumulate operations are needed for calculating the convolution, where Mis the filter length. On top of that, refer to the LMS equation given in Eq. (2.27), each tap weight update requires a MAC operation. Therefore, 2 x (M +1) MAC operations are needed for an adaptive FIR filter with LMS algorithm. Note that Eq. (2.27) involves two multiplications before a tap weight is updated, but if poweroftwo scheme is used, the stepsize parameter multiplication becomes a bitwise shift right operation. Details of this scheme are discussed in Chapter 5. As stated earlier, if fixedpoint representation is used, quantization only needs to be performed after multiplications, not after addition. Therefore, the source of quantization noise is from the multiplications at both the convolution stage and at the adaptation stage. The effects of product quantization are discussed below. 3.3.1 Product Rounding Effects Consider a fixedpoint MAC unit shown in Figure 33, where two Nbit numbers are multiplied, rounded to an Nbit product, and then accumulated with another Nbit number to get an Nbit MAC result. N bits s 2xN bits N bits Nbits MAC Result N bits Figure 33. MAC Unit Block Diagram Assume the Quantization is done by rounding, the same statistical results hold for the product quantization, where the error created by rounding has power spectral density 2 ofq. Since the adaptive LMS filter contains 2 x (M +1) MAC operations, and again 12 assuming absence of any other noise source, the total error power spectrum produced by product quantization is q2 (M + 1)q2 sp = 2(M + 1) (3.14) 12 6 Given peak power Pc defined in Eq. (3.11), the ratio of the peak power and the product quantization noise, denoted by Rp is therefore Pc q222 3 221 Rp = = = (3.15) Sp (M + 1)q2 4 M+1 6 or SNRp = 6.021 10log(M +1) 1.25dB (3.16) For example, a 9th order LMS FIR adaptive filter with 16bit wordlength has signal to noise ratio of about 85dB due to product quantization. Again, the calculation is performed by assuming no any other noise sources. 3.3.2 Coefficient Rounding Effects In this section, we wish to analyze how product quantization noise is created due to coefficient rounding in the tap weight adaptation. The LMS algorithm updates the filter's coefficients, or tap weights according to Eq. (2.27), which is replicated here: w(n+l) = w(n) + uu(n)e(n) (3.17) As shown in the above equation, the update parameter, namely uu(n)e(n), must be quantized to less than or equal to wordlength of w(n) in order to produce the proper result for the updates. Again, the update parameter only involves one set of multiplication if the step size parameter is poweroftwo. The quantization of the update parameter results in quantization noise described in the previous section, that is, for an Mthorder FIR (M + 1) q2 filter, the tap weight updates result in noise power of(M + )q 12 Since coefficient quantization is performed on the tap weights, i.e., before the convolution stage, the quantization noise associated with coefficient quantization is also process at the convolution stage. Therefore, the adaptive systems are more sensitive toward coefficient quantization. Coefficient quantization may result in slowdown or stalling phenomenon, in which the rate of convergence is either slower or after convergence, tap weights fail to comply with the weights if infinite precision were used. The slowdown and stalling phenomenon will be studied in next section. Furthermore, noise produced by coefficient quantization can be potentially hazardous if an IIR filter structure is used. Since the coefficients directly affect the stability of an IIR filter, in that any noise introduced in the coefficients may shift the poles outside of the unit circle and cause the IIR filter to diverge the output. 3.3.3 Slowdown and Stalling The LMS algorithm may stop adapting due to the finite precision implementation of the digital hardware. If the result of the update parameter, namely /u e(n) u(n) is less than the least significant bit of the binary representation after quantization, that is, if Q(l e(n) u(n)) < q (3.18) where q is the quantizing step, the adaptation fails to update due to the fact that if the update parameter is less than q, it is quantized into zero. The step size parameter [t plays an essential role for LMS algorithm stalling. It can be shown in [7] that by incorporating a lower bound for /, the stalling phenomenon can be avoided. The lower bound is described below: > (3.19) 4,0 2 + C2 ^uVe + n 2 2 where oe and ao denote variance of the error signal and variance of the quantization noise, respectively. By combining Eq. (3.19) with Eq. (2.28), the range of [t is restricted to the following: q 2 < 2 < 2 (3.20) 4c, + MS max Also according to [23], with fixedpoint arithmetic, it can be advantageous to leave / as a higher value when possible. The sign algorithm is another way of preventing stalling and is presented in [19]. Instead of calculating the update parameter by multiplying the tap and the error term, the sign algorithm only takes the sign of the error term into consideration. That is, the update parameter is calculated as following: W(n +1) = W(n) + U(n) sign[e(n)] (3.21) The sign algorithm decreases the chance of stalling and simplifies the hardware requirements. Since no multipliers are needed to update tap weights, the sign algorithm also decreases noise created by product quantization. Although the sign algorithm introduces nonlinearity in the adaptation process, it does not prevent the algorithm from converging. However, the sign algorithm will always converge slower than the LMS algorithm [5]. Another method involving dithering is proposed by [16] to prevent stalling. Here dithers are inserted at the input of the quantizers of update parameters, where a dither consist of a random sequence that, if added to the input, guarantee the input to be greater than the quantization step. The effect of additive dither can be eliminated by shaping the power spectrum of the dither so that it is rejected by the algorithm anyways. The LMS algorithm running under finite precision also may encounter the slowdown phenomenon, in which the effect of quantization causes the rate of convergence to be slower than its infinite counter part. In this case, the tap weights may achieve the intended values only at a slower rate. The slowdown phenomenon can be eliminated by proper choice of data and coefficient wordlength. It is shown in [15] that for most practical cases, more bits should be allocated to coefficients than input data to prevent slowdown. 3.3.4 Saturation A filter's internal registers to hold any arithmetic results are fixed. It is possible for an arithmetic result to overflow during addition and multiplication, that is, the number of bits to represent the integer part of the summation does not store all the necessary information. Such a phenomenon is called Saturation. For example, refer to Figure 34, which shows a MAC operation of two Nbit numbers. Saturation may occur when two Nbit numbers are added to produce an Nbit sum, since (N+ ) bits are needed to represent a full addition without concerning saturation. Similarly, saturation can also occur when two Nbit numbers are multiplied and the product is quantized to M bits, where M 2N. Saturation can introduce major distortions into a system's output, since large amount of information is vanished due to the loss of the upper significant bits of the addition or multiplication result. Saturation can render a filter useless. Therefore, it is essential for the filter designer to study the nature of the input data to eliminate the effects of saturation. One of the most common solutions for saturation is to scale the input signals [8]. By scaling down the input signals, the probability of any internal arithmetic overflow is decreased. However, as suggested in [25], input scaling also decrease the precision of the data and may result in rough filter outputs or even stalling. This is of particularly interests for the LMS adaptive filter, since the criteria for the performance of such filter is the misadjustment of the error signal. Misadjustment, as defined in Chapter 2, is the difference between the weights produced by the optimum Wiener solution and the adapted weights produced by the LMS adaptive filter. Therefore, tradeoffs exists as to the amount of scaling applied to input signal to avoid saturation, at the same time retain or minimize misadjustment due to the effect of scaling. The only way to achieve such goal is to carefully study the nature of the input data and calculate the upper bound of the magnitude of the input signals. Besides scaling the input signals, increasing wordlength can also reduce the effect of saturation, that is, to increase the number of bits for each registers. However, this technique may not be available for some digital implementations. For example, common DSP processors have fixed wordlength and cannot be modified. Also, wordlength increment introduces more hardware and reduces the speed of the digital hardware considerably. Another way to minimize the effects of saturation is proposed by [25] called clamping. Clamping will, upon detecting an overflow, clamp the adder's output to the most positive or negative values. That is, the output of an Nbit adder is defined as following: 2 N1 sum > 21 result = sum 2 < Note that Eq. (3.22) assumes 2's complement form for arithmetic operations. 3.3.5 Solutions for Arithmetic Quantization Effects Eweda in [10] proposes an algorithm in which the tap weight updates are repeatedly frozen for a certain period of time and then updating them on the base of the average innovation period during the freezing period. During each innovation period, the adaptation parameter, i.e., u(n)e(n) is accumulated and update is only performed at the end of the innovation period. This innovation period accumulation can smooth out the quantization errors and therefore increase the output SNR. It is also shown in [11] that the quantization noise can be reduced exponentially by increasing the wordlength of the registers. For the same reason stated earlier, this technique may not be available. If wordlength increment is in fact available, commercial software exists for wordlength optimization in DSP applications. Such software usually includes the synthesis tool presented in [18]. 3.4 Simulation Result Throughout this section, one particular application of the LMS algorithm, namely the system identification application is used. Consider the module depicted in Figure 34, where the LMS adaptive filter is to model the unknown system by using the unknown system's output as the desired signal to the adaptive filter. The adaptive filter's task is to adapt its tap weights such that its output matches the unknown system's output. error Figure 34. System Identification Block Diagram 3.4.1 Rounding vs. Truncation An experiment is set up to verify the conclusion drawn up from Section 3.1, that is, for signal quantization, rounding creates less quantization noise than truncation. Refer to Figure 34, both input signal and desired signals are quantized before fed into the adaptive filter. Arithmetic quantization is not considered at this stage, in other words, the results from either convolution sum or the adaptation process are not quantized. Since the LMS algorithm uses minimum mean square error as the criteria, we can safely opt rounding over truncation if rounding produces less mean square error over truncation. Figure 35. Experimental Setup for Rounding vs. Truncation The two quantization techniques are tested in the two quantizers shown in Figure 3 5. The adaptive filter length is fixed at four where the input sequence consists of 5000 normally distributed random samples. Additionally, the quantizing step q is chosen to hold the following values: [21, 22, 23, 24, 25, 26]. At each value of q, the misadjustment produced by the adaptive system is captured for both rounding and truncation and the result is shown in Figure 36. As shown in Figure 36, rounding clearly produces less noise than truncation for each value of q and only as the quantization step decreases, the effects of truncation becomes impartial over rounding. Rounding vs. Truncation 0.1 I Truncation W Rounding 0.08  0.06  0.04 0.02  0 77   0 1 2 3 4 5 6 7 Fractional Wordlenqth Figure 36. Simulation Result for Rounding vs. Truncation 3.4.2 Effects of Product Rounding at the Convolution Stage In this section, we wish to further experiment the effects from quantization. In addition to the quantizers shown in Figure 37, rounding is also performed at each multiplication at the convolution stage. Refer to Figure 37, for the same 4thorder adaptive filter used in the previous section, four more quantizers are added. Input Figure 37. Additional Quantizers at the Convolution Stage We again experiment the effects of product quantization by a set of different q values [21, 22, 2, 24, 25, 26]. For each value of q, the adaptive filter's misadjustment is captured and plotted. The simulation result is shown in Figure 37, where as the quantization step decreases, so does the quantization noise caused by multipliers. Product Quantization at the Convolution Stage 0.07 0.06 0.0  0.04 0.03  0.02 0.01 1 2 3 4 5 6 Figure 38. Effects of Product Quantization at the Convolution Stage The figure also verifies the conclusion drawn up in Eq. (3.14), which shows the error power spectrum decreases exponentially as the quantization step decreases. 3.4.3 Effects of Product Rounding at the Adaptation Stage Coefficient rounding contributes greater quantization noise in the product quantization noise. In this section, update parameters are also quantized. The same structure is used as the previous sections and the same set of normally distributed data is applied. Refer to Figure 39, quantization is also performed at the adaptation stage. Input Desired Input unknown system D i Q Q 71 71 71 Q Q Q Q + + + S Q weight update , Figure 39. Additional Quantizers at the Adaptation Stage Simulation result for this experiment is plotted in Figure 310. Note that two sets of misadjustments were plotted. The red bars correspond to misadjustment due to product quantization at the convolution stage, whereas the blue bars correspond to misadjustment due to quantization at the adaptation stage. Clearly, quantization at the adaptation stage creates significantly larger noise than at the convolution stage for reason stated earlier. It is apparent that an adaptive filter's performance is more sensitive to coefficient quantization noise. Thus, as suggested in Section 3.3.3, more bits should be allocated for coefficient representation. Product Quantization at the Adaptation Stage 0.25 Q at Adaptation 0 Q at Convolution 0.2 0.15 0.1 0.05 0 0 1 2 3 4 5 6 7 Figure 310. Effects of Product Quantization at the Convolution and Adaptation Stages 3.4.4 Clamping Technique An experiment is setup to simulate the saturation phenomenon on an adaptive LMS filter. System identification practice described in Figure 34 again is used, where tap weight adaptation is performed so that the adaptive filter's output matches the unknown system's output. For simplicity, all inputs are positive. An upper bound is set for wordlength of results from either multiplications or additions. If wordlength of the result exceeds this upper bound, two scenarios are tested, one is to do nothing, that is, the upper most significant bits are lost due to saturation; the other is by the use of clamping, in which upon detection of saturation, the result is clamped to most positive number that the upper bound can represent. A set of normally distributed data is tested in this experiment, where the adaptive filter's ideal tap weights are [4 5 1] after convergence. The results of this experiment are shown in Figure 311 and Figure 312, where both the misadjustment curve and the tap weights are plotted. Weight Track for Clamping 14 12 10 4 0 200 400 600 800 1000 n Figure 311. Tap weight Track for Clamping Technique In Figure 311, the blue lines track tap weights if no clamping were used whereas the red lines track tap weights if clamping were used. The black lines represent the ideal tap weights if a 64bit floatingpoint system were used, which is considered ideal. It is apparent that tap weights simply diverge if clamping is not used. The divergence of the tap weights indicates the adaptive filter has become ineffective. Figure 312 shows the misadjustment plot of the experiment. The mean square error of each system is capture at every 30 samples. As can be seen, the mean square error of the nonclamping result is never reduced due to tap weight divergence whereas in the clamping case, the misadjustment is very close to the ideal result. Misadjustment Plot for Clamping 4000 nonclamping 3500 \ + clamping S ideal 3000 2500 E 2000 1500 o . . . 1500 1000 * 500 0 5 10 15 20 25 30 35 time instance Figure 312. Misadjustment Plot for Clamping Technique 3.4.5 Sign Algorithm The sign algorithm presented in the previous section is a way of preventing stalling when the update parameter result is less than the quantizing step. System identification is again used in this simulation. A set of small scale input and desired signal are used and various quantizing step values are tried. It was determined that for q < 24, tap weights simply diverge. Therefore, quantizing steps q = [23, 24, 25] are used for this experiment. The effectiveness of the sign algorithm with respect to the LMS algorithm using various q values is studied. Figure 313 shows the misadjustment plot for the adaptive filter with same sets of input and same filter order with respect to various q values. Misadjustment is again captured at every 30 samples. The step size for the sign algorithm is slightly larger than the LMS algorithm in order for it to converge due to reason stated in [7]. As shown in Figure 313, tap weights diverge when q = 23 due to insufficient fractional bits. In the case of q = 24, due to limited precision, the LMS algorithm stalls and results in larger misadjustment than the sign algorithm, that is, the sign algorithm is able to obtain better convergence result than the LMS algorithm. Only by decreasing q, the LMS algorithm is able to outperform the sign algorithm, as can be seen in the case when q = 25 for LMS algorithm. Sign Algorithm vs LMS with various quantization step 0.026  0.024 0.022  L q LMS q=2 3 0.02 LMS q=24 S LMS q=25 0.018  sign algorithm q=24 Dm1 0.016 \ 0.014  0.012 0.01 0 50 100 150 200 time instance Figure 313. Misadjustment for Sign Algorithm vs. LMS 3.5 Remarks The effects due to finite precision on adaptive systems are presented in this Chapter. Due to quantization at various stages of the system, quantization noise is introduced. The quantization noise propagates through the system just as an input. Due to quantization noise, the saturation and the stalling phenomenon may occur and thus severely diminish the adaptive filter's performance. Some techniques that are helpful in reducing the effects are presented. However, quantization noise cannot be eliminated and thus the system engineer must study and make tradeoffs between the performance and practicality of the system. CHAPTER 4 SOFTWARE SIMULATION OF A FIXEDPOINTBASED POWEROFTWO ADAPTIVE NOISE CANCELLER The effects of finite precision are elaborated in Chapter 3. In this Chapter, we wish to translate theories into reality, where a floatingpoint based system is compared with a fixedpoint based system. As stated in Chapter 3, a floatingpoint based system can represent larger dynamic range of data in the cost of losing resolution and introducing more quantization noise, where a fixedpointbased system's dynamic range is limited with respect to its quantizing step, but holds the advantage of simpler circuit design, since additions and multiplications are composed of simpler logic equations. Therefore, for implementation of a finite precision adaptive system, fixedpoint architecture is preferred over floatingpoint. It is the goal of this chapter to obtain the feasibility of implementing fixedpoint based adaptive system due to its simplicity. As described in Chapter 2, the LMS algorithm is the most widely used adaptive algorithms and bears many applications. Two examples were explored in Chapter 2, namely the noise canceller and the line enhancer. In this Chapter, a software simulation of a noise canceller is implemented and the LMS algorithm is fixedpoint based. The step size parameter utilizes poweroftwo scheme, that is, / can only take up values of2 where n is a positive integer. Consider a scenario where a speaker is giving out a speech, while the housekeeper insists on vacuuming the floor at the same time. The vacuuming noise obscured the speech to an extend that it was not audible. The contaminated speech, i.e., original speech plus noise, and the noise itself are recorded. An experiment is set up to use the Adaptive Noise Canceling technique to retrieve the original speech. The noise signal itself serves as the primary filter input, and the contaminated signal is the reference input, or the desired signal to the system. We wish to investigate the effect of finite wordlength due to this particular application. Specifically, can the speech be recovered by this integerbased system? And how much does this fixedpointbased system differ from a floatingpoint based counterpart? If the fixedpointbased system makes no striking difference on the outcome of noise canceller, i.e., the original speech can still be recovered and be heard by human, then a hardware implementation based on this software experiment becomes feasible since fixedpointbased adaptive system is ideal due to its simplicity and practicality. 4.1 Modular Overview The Adaptive Noise Canceller block diagram was presented in Figure 23 in Chapter 2 and is replicated below in Figure 41. d(n) = s(n) + v(n) v'(n) l Adaptive e(n) Filter Figure 41. Adaptive Noise Canceller Block Diagram The sampled desired discrete signal, composed of both the speaker's speech and the vacuum noise, is served as the Noise Canceller's reference signal; another vacuum noise, also sampled, is served as the filter's primary input signal. Upon processing, the vacuum noise will be reduced due to the adaptation of the filter tap weights. And the error signal produced by the adaptive system is in close resemblance of the original speech. Figure 34 shows the internal structure of the adaptive filter, including the quantizers to quantize all inputs and tap weights to fixed wordlengths. The filter uses tap delay line architecture and thus, for an Mthorder filter, M 1 multiplications are needed at the convolution stage and M+ more at the adaptation stage. s(n) + v(n) X X X (X w(n+1) v(n) V'(n) Q Q Weight Updates e(n) Figure 42. Internal Structure of the Noise Canceller with Quantizers 4.2 Data Quantization As seen in Figure 42, quantization takes place in four stages: at the primary input signal, the reference signal, and in both convolution and adaptation. Rounding is used for quantization. Since the primary and reference signal quantization is unavoidable due to A/D conversion, the only source of error that can be controlled by the designer is then product quantization noise at both the convolution stage and the adaptation stage. The quantizing step determines how many fractional bits are remained after quantization. It is established that product quantization noise is inversely exponential with respect to quantizing step. 43 4.3 Simulation Results The primary and reference signals are assumed proper sampled. By experimentation, the filter length is chosen to be four and the step size t is chosen to be 27 A set of quantizing steps, q = [25, 26, 27, 28], are used to show the misadjustment due to product quantization error. For simplicity reason, the number of bits to represent integer parts of products is assumed to be sufficient, that is, saturation is not considered in this experiment. Figure 43 and 44 show the weight tracks and the misadjustment curves with respect to various values of q, respectively. The performances of the four fixedpoint systems are compared against a 64bit floating point system. As can be seen in the figure, when q = 28, the fixedpoint system performs just as well as the floatingpoint system. More importantly, although the speech filtered by the fixedpoint based system is noisier, largely due to quantization noise, the recovered speech tends to be intact and coherent. Sq=25 q=26 0, 0'1 I qI F,,  O] ' o*J iC,'* o".._ c., l i M Ill" * q=27 ue[ ^'^  :1 0. n' i ___ _, *I" ,1 if  * q=28 vs.7 a ? ; ; s Weight Tracks for Fixedpoint Systems Figure 43. IT Learning Curve with Various q 0.35 q=21 0.3  q=22 0.25 q=23 q=24 0.2 . 64bit floating point 0.15 0.1 0.05  0 500 1000 1500 2000 time instance Figure 44. Misadjustment Plots of Fixedpoint Systems and a Floatingpoint System The success of this software experiment proves that for adaptive applications such as noise cancellations, the system is not as sensitive to input A/D conversion and data quantization. And as can be shown in simulation, fixedpoint systems with limited quantizing step perform just as well as a 64bit floatingpoint system. Without sacrificing enormous amount of hardware if a floatingpoint system were applied, hardware implementation of a fixedpoint system therefore becomes very appealing and feasible. In fact, Chapter 5 illustrates a VLSI based noise canceller that is fixedpointbased and takes advantages of the poweroftwo scheme. CHAPTER 5 HARDWARE IMPLEMENTATION OF AN INTEGERBASED POWER OF TWO ADAPTIVE NOISE CANCELLER IN STRATIX DEVICES Chapter 4 presented a software simulation of an adaptive noise canceller based on fixpoint approach. By experimenting the fixedpoint based system, it is believed that noise cancellers are one of the adaptive applications that are practical for a fixedpoint based hardware implementation. DSP applications, including adaptive algorithms involve heavily upon arithmetic operations such as multiplication and addition. By incorporating fixedpoint only, adder and multipliers that are essential to DSP applications require less amount of logic elements as opposed to if the applications were implemented in floatingpoint based. In a VLSI circuit design, this feature is particular of interest, since VLSI devices have limited logic elements and simpler circuit generally translates into faster performance. The newest FPGA families, Altera's Stratix device family for example, incorporates embedded DSP blocks within the FPGA chip to have dedicated circuitry to perform common DSP operations including multiply and accumulate. This family of FPGA devices is compared with another family of FPGA devices that does not include embedded DSP blocks. Performance comparison is done in two areas, which include amount of logic elements occupied and maximum frequency allowed. The poweroftwo scheme is used to avoid implementing areaconsuming division circuitry. Software package Quartus II is used to produce a waveform simulation, along with logic state analyzer's captured waveform are presented to verify the hardware functionality. DSP applications including adaptive systems have traditionally been implemented using generalpurpose DSP processors due to their ability to perform fast arithmetic operations. Advancement in FPGA devices including the embedded DSP blocks has made FPGA devices serious contenders in the DSP market. It is advantageous to examine the performance of the adaptive filter implemented in Stratix devices against both fixedpoint based DSP processor and floatingpoint based DSP processor. Two criteria, system speed and power consumption are examined and the results are shown in this Chapter. 5.1 Stratix Devices 5.1.1 Device Architecture The Stratix family is the newest family of programmable logic devices from Altera. The Stratix devices have three times the size of memory blocks compared to traditional FPGAs. The Stratix devices also contain embedded DSP blocks, which have dedicated pipelined multiplier and accumulator circuits. With the embedded DSP blocks, the Stratix devices can perform high speed multiplyandaccumulate operations. Stratix devices contain a twodimensional row and column based architecture to implement custom logic. A network of varying length and speed, row and column interconnects provide signal interconnections between Logic Array Blocks (LABs), memory blocks, and embedded DSP blocks. Each LAB consists of 10 Logic Elements (LEs). LABs are grouped into rows and columns across the device. The memory blocks are RAM based. These memory blocks provide dedicated simple dualport or single port memory up to 36 bits wide and up to 291MHz access speed. The DSP blocks can implement multiplications in various bit length with add or subtract features. The blocks also contain 18bit input shift registers for applications such as Finite Impulse Response (FIR) or Infinite Impulse Response (IIR) filters. Figure 51 shows the block diagram of a typical Stratix device [2]. M512 or M4k RAM Blocks DSP Blocks I 9IOE 0 0 MRAM Block *0 * *S * Figure 51. Stratix Device Block Diagram 5.1.2 Embedded DSP Blocks The most commonly used DSP functions include multiplication, addition, and accumulation. The Stratix devices provide DSP blocks to meet the arithmetic requirements of these functions. Each Stratix device has two columns of DSP blocks to efficiently implement DSP functions faster than LEbased implementations. Each DSP block can be configured to support one set of the following: * Eight 9 x 9 bit multipliers * Four 18 x 18 bit multipliers * One 36 x 36 bit multiplier DSP block multipliers can optionally feed an adder/subtractor or accumulator within the block. This feature saves LE routing resources and increase performance, since all interconnections and blocks are all within the DSP block. The DSP block input registers can also be configured as shift registers for FIR filter applications. Figure 2 is a block diagram for a typical component inside the DSP block. multiplier Sconfigurable output selection Sadd/sub/acc. multiplexer ENA Subtracnd Accurnrublar CLRN Figure 52. Embedded DSP Block Diagram 5.2 Design Specifications 5.2.1 Structural Overview The noise canceller implementation assumes FIR filter structure. The design shown in Figure 53 depicts a structural view of such FIR filter. As shown in the figure, the main components of the filter consist of m Unit Delay Registers and m+ 1 Weight Updates. The Unit Delay Registers are simply D FlipFlops. Each Weight Update component updates the filter coefficient according to the LMS equation presented in Chapter 2, Eq. (2.27). The adaptive filter's input is the primary input, which is the vacuum noise. The filter output is subtracted from the desired signal, in this case, the ^ I d Figre 2.Embdde DS Blck iagA7m 5.0einSeiiain 5.2. Strcturl Ovrvie The noise caceler ipeetto sue I te tutr.Tedsg hw in Figure 53 depicts a structur~ENalveofscFIfltrAshwnitefgute main~~~~ comonnt oftefle oss ntDLoyRgisesadm1Wih vacuum noise. The filter output is subtracted from the desired signal, in this case, the original speech plus noise, to produce an error signal. The error signal, i.e., the recovered speech is a buffer, which is fed back to the Weight Update components to produce next sets of filter coefficients. Input Unit Unit nit Delay Delay c Delay Weight Weight Weight Weight Desired Update Update Update Update X Err Err Err Err SErr Figure 53. Adaptive Transversal Filter Block Diagram 5.2.2 The PowerofTwo Scheme Weight Updates perform logics according to Eq (2.27). Arithmetic operations needed include two multiplications and one subtraction. However, the stepsize parameter [t is a fractional number that is always less than 1. Also, by multiplying a fractional number is equivalent of dividing its reciprocal. Therefore, in order to avoid implementing complicated and areaconsuming division circuitry, or multiplication for floatingpoint numbers, Arithmetic Shift Right (ASR) operation is used instead to simplify and boost the runtime frequency of the design. The ASR operates on a 2's complement integer by shifting the number n bits to the right (direction of the least significant bit), while preserving the sign bit (the most significant bit). By shifting the number n bits to the right, it is equivalent of multiplying this number by 2n. Therefore, in order to achieve simplicity and feasibility, this design restricts the value of u to beu = 2n, where n is a positive integer. This is the socalled poweroftwo scheme. 5.2.3 Data Flow and Quantization As depicted in Figure 53, there are two inputs to the system, the primary filter input and the reference or desired signal. The adaptive filter's output is subtracted from desired signal to produce a buffered error signal. This error signal is in turn fed back to all the weight update components for the LMS algorithm tap weight updates. In order to preserve the simplicity of the design, all input and output signals share the same wordlength. That is, the primary and reference input, the intermediate signals, along with the error term all have wordlength of n, including the sign bit. Based upon this preservation, quantization takes places in the weight update component, where according to the weight update equation w(n+ ) w(n)+ ue(n)x(n) (5.1) if e(n) and x(n) are both n bits, the product of these two terms has 2n bits. After shifting the product to the right, as described in poweroftwo scheme, the 2n bit term is quantized into n bits, by keeping the least significant (n 1) bits while retaining the sign bit. This n bit update parameter is then added from the n bit current tap weight to produce the updated n bit tap weight. The same quantization technique is applied to all weight update components. In addition to quantization, saturation is another potential hazard, where each addition, in either adaptation or in convolution, could create saturation. In our adaptive filter design, the nature of the experimental data is first studied to obtain suitable wordlength, thereby avoiding saturation. 5.3 Dynamic Component Instantiation in VHDL Refer to the structural diagram shown in Figure 56, if filter length is to be incremented to one more, an additional weight update, unit delay, multiplier and adder 51 are all needed to be instantiated. But both the length of the adaptive filter and the wordlength to represent data bus should be easily changed without spending too much time on the architectural level. Since this adaptive filter is written in VHDL, we now show how to dynamically instantiate a component in VHDL. In a separate "header" file, a package is created to include not only the components definition, but also constants such as filter length and bus width information. A portion of the "header" file is shown below: package header is  fl indicates filter length, or filter order  bussize indicates the size of the input data bus. constant fl : integer:= 10; constant bussize : integer:= 16; component generator port( clk : in std logic; reset L : in std logic; xx : in std logic vector(bussize1 downto 0); ee : in stdlogicvector(bussize1 downto 0); ww : buffer std logic vector(bussize1 downto 0)); end component; end header; This header file is included into the project and upon compiling, the package information is used in the structural port map statements in the top hierarchy to determine the number of components to be instantiated. Therefore, by changing the numbers in the package field, the designer is able to dynamically instantiate however many number of components needed for the specific design. For additional helpful VHDL tutorials please refer to [26]. 52 5.4 Simulation and Implementation Results It can be argued that since input signals have to be converted from analog to digital, and A/D operations involves converting real values into 2'scomplement binary values, adaptive systems are therefore naturally suitable for integerbased. The sampled primary and reference signals are scaled and rounded to be integers before it is fed into the system. Altera's Quartus II software package is used to compile the VHDLbased package and a vector waveform simulation is produced. The primary and reference signals are stored into the device's internal memory with equal depth. Update parameter remains the same throughout the process, while the address line that controls the internal memory is incremented in every clock cycle. A snapshot of the waveform simulation is captured and shown in Figure 54. Upon convergence, the tap weights become [0001, FFFA, FFFF, 0002, FFFD]. Converting these hexadecimal numbers into decimal, the weights are [1, 6, 1, 2, 3]. Ips 6.4 us 12.8 us 19.2 us 25.6 us 32.0 us 38.4 us 44.8 us Nam 5 ns it clk I [ adr n11vvTrr 1 1v r 1 n v it reset L S  eights[4] FFF     _mu i_ i i i i i i i O i i i i i i i E weights B, F FR 0 FI FTCW, + [ weights[4] FI IF F +_ _eights[1_ 0 1 0 00_____ F I weights[3] FFF iFFF I Figure 54. Waveform Simulation Result of the Adaptive Noise Canceller The project is implemented into Altera's DSP development board and the lower 5 bits of each weight are captured using a logic state analyzer. The analyzer's result is shown in Figure 55 below. I Agilent *ogdave [Waveform( eII I Fi? Ej. :i;il " 'I. lc .I. .i .',J [, , ... ,,2, Hel: ''I,. I )I i: IlljlT! _ ( ,,, .illiir : ii, , : ,,, ,,,, )III ii _... II Figure 55. Logic State Analyzer Result of the Adaptive Noise Canceller Implementation result shows that lower 5bits of the weights are [00001, 11010, 11111, 00010, 11101]. 2's complement forms are indeed [1, 6, 1, 2, 3], which are equivalent to the waveform simulation demonstrated in Figure 54. 5.5 Performance Comparison of Stratix and Traditional FPGAs Area and speed are the two main measurements in evaluating FPGA performance of this filter. Since the Stratix devices have embedded DSP blocks built in, they should occupy less LEs and have faster maximum clock frequency. Area and Speed issues were studied with a Stratix Device and a FPGA device without embedded DSP blocks, namely an APEX device also from Altera. Figures 55 and 56 show the varying filter orders vs. area and speed plots, respectively, for both the Stratix and APEX devices. Area is measured by number of LEs occupied, whereas speed is measured by longest registerto register delay. 5.5.1 Speed Refer to Figure 53, for each additional tap, the longest registertoregister path is elongated as well, resulting allowable frequency to plunge. Figure 56 shows as the number of taps increase, the allowable speed of the adaptive filter decreases, that is, the clock frequency decreases. Timing for Stratix device is obtained from Quartus simulation result, since a Stratix device is not readily available. For the APEX device, timing is obtained by using a functional generator to generate a clock signal as the system's clock signal. Clearly, if the functional generator's clock signal period exceeds the longest registertoregister delay, it will cause erroneous computational result, since logic elements need the time period specified by longest registertoregister delay to perform correct computation. Therefore, the maximum frequency is obtained from the fastest frequency in which the adaptive system can run while still able to obtain intended tap weight convergence. 40 m Stratix 35 3 APEX 30 25 Max.Freq. 25 20 (MHz)15 10 5 0 3 5 10 25 35 50 Filter Order Figure 56. Plot of Filter Order vs. Speed 5.5.2 Area For each additional tap, a separate weight update, multiplier, and adder also have to be instantiated. These components all occupy LEs. Therefore, when the number of taps increases, so does the number of occupied LEs. Figure 57 shows this relationship. Note that for the Stratix device at filter length of 20, all embedded DSP blocks have been occupied with multipliers and adders. The DSP block elements do not count as logic elements. Each additional multiplier and adder required by the increase of filter length, they is implemented in regular LEs, which results in a exponential growth from filter length 10 to filter length 25. 60000 60000 Stratix 50000PEX APEX 40000 LEs 30000 20000 10000 0 3 5 10 25 35 50 Filter Order Figure 57. Plot of Filter Order vs. Area From the above two graphs, we can easily see that the Stratix device is overwhelmingly favored over traditional FPGA devices. When it comes to DSP applications implemented in FPGA devices, the Stratix device not only occupies less LEs due to the dedicated circuitry within the DSP blocks, but it also allows faster clock frequency. 5.6 Pipelining Although the design depicted in Figure 53 fully utilizes the parallelism advantage of FPGA devices in full, the speed performance decays substantially as the filter order increases, since the longest registertoregister delay elongates from the first weight update component on the left to the subtractor on the right. Two methods can be incorporated into the existing design to reduce the longest registertoregister delay. The first method is to introduce pipelined multipliers. Multipliers occupy large amount of logic, by partitioning the entire multiplier logic into smaller elements and incorporate pipeline registers in between, the registertoregister delay can be decreased, resulting in an increase of the maximum system clock frequency. The other method involves inserting buffers into the chain of adders at the convolution stage. The amount of sequential adders increases linearly as filter order increases. Therefore the amount of LEs to implement these adders also increases, resulting in an overwhelming decrease in system speed. If buffers are added into the adderchain, the system's maximum data rate can be increased. The two methods can be combined together to obtain an adaptive system with optimal performance in terms of data rate. Latencies are also introduced by incorporating the above two methods. Latencies introduced in multipliers or in adderchain effectively create phase shifts into the convolution stage, since full result of the multiplication is delayed by the number of pipeline levels. Consequently, this phase shift also affects the error output signal because error output is also delayed. If the phase shift created by latency becomes sufficiently large, it can remove the correlation between the reference signal and the primary signal and force the adaptive system to diverge. In fact, the error produced by the adaptive system is a function of the primary and reference signals, and the error signal is also a feedback signal to the weight updates. We will, in this section, investigate techniques to cope with latency effects in adaptation. Synthesis tools that partition the multiplier logic can be investigated to obtain optimal number of pipeline stages. Optimal number of pipeline stages is defined as the smallest number of pipeline stages for which further increase does not enhance multiplier's speed. The maximum speed of the pipelined multiplier serves as a guideline to how many buffers are inserted into the adderchain. We wish to insert minimal number of buffers onto the adderchain to minimize latencies, and also to minimize registertoregister path. Procedures on how to obtain optimal pipeline stages are now discussed. 5.6.1 Optimal Multiplier Pipeline Stages In order to investigate the synthesis tool provided by Quartus software, a multiplier block is instantiated according to Figure 58. Without pipelining the multiplier, the longest registertoregister delay is from the input register to the output register. If pipelines are introduced within the multiplier, the longest registertoregister delay is reduced. Figure 58. Pipelined Multiplier Test Module Performance improvement in speed with various numbers of pipelines for different sizes of multipliers is studied using the Quartus synthesis tool. It can be shown according to Figure 59 that, for an 8bit multiplier, the optimal pipeline stage is 1, since incrementing the number of pipeline stages does not generate better multiplier 58 performance. Similarly, the optimal pipeline stages for 16bit multiplier and 32bit multiplier are 2 and 3, respectively. 400 I8bit 350 1" 300 S250 200 150 100 50 0 0 1 2 3 4 Pipeline Stages Figure 59. Maximum Data Rate of three Multipliers with Various Pipeline Stages 5.6.2. Optimal Adderchain Pipeline Stages Refer to the structural diagram in Figure 53, adders used in convolution may become a burden to system performance in terms of speed, because the adderchain occupies more logic elements as filter order increases. As discussed in the previous section, multipliers can be pipelined in optimal pipeline stages with respect to their input bus size. In this section, we wish to investigate further improvement in the adaptive system's speed performance by inserting buffers into the adderchain. The goal is to minimize the number of buffers while not increasing the longest registertoregister delay. It is apparent that the upper bound constraint for the number of adders in between buffers should be less than the speed of the pipelined multiplier. According to results found in the previous section, an 8bit, 16bit, and 32bit multiplier can be pipelined and have optimal speed of 335MHz, 278MHz, 278MHz, respectively. An adderchain component described in Figure 510 is instantiated to observe the number of adders that can be included within the multiplier's speed range. Figure 510. Adderchain Test Module Results of 8bit, 16bit, and 32bit adders are shown in Figure 511. For 8bit adders, it is found that in order to satisfy the speed constraint set by the multipliers, one buffer can be added for every two adders in the adderchain to optimize system performance. Three adders between buffers already exceed the propagation delay of an 8bit pipelined multiplier. Similarly for 16bit and 32bit adders, the maximum numbers of adders that can be included between two buffers are also two. 450 400 8 t1d S350 , tli 300 32 tL I 250 S200 150 S100 50 0 1 2 3 Number of Adders in AdderChain Figure 511. Adderchain Data Rate with Respect to Number of Adders Incorporating pipelined multipliers and buffering adders in the adderchain can reduce the longest registertoregister delay. As an example, the structural view of a 4th order adaptive system is shown in Figure 512 below, where multipliers are pipelined with two stages and buffers are added for every two adders in the adderchain. Input Unit Unit Unit Unit Delay  ~*' Delay  ~*' Delay Delay Weight Weight Weight Weight Weight Desired Err Err Err Err Err  <5  <5^  <^ 0 Figure 512. Pipelined and Buffered Adaptive System Block Diagram Note that since a buffer is added after the second adder on the adderchain, buffers are also added to the fourth and fifth multiplier outputs in order to compensate the latency introduced by the adderchain buffer. 5.6.3 Tradeoffs in Introducing Latency into Adaptive Systems As described earlier, an adaptive system consists of both convolution and adaptation stages. These two stages are expressed mathematically in Eq. (2.25) Eq. (2.27). By introducing pipelining and buffers, an adaptive system can be expressed in the following two equations representing error signal computation and adaptation: eD(n)= d(n) T (n D)U(n) (5.2) W(n + 1) = W(n) + ueD (n)U(n) ,(5.3) where D represents levels of latencies and eD represents delayed error signal. As described earlier, if latency is large, an adaptive system can due to phase shift caused by latencies. Recall that the criteria for the step size parameter / is derived in Chapter 2, in that /u must satisfy the following inequality: 0 < < (5.4) Amax where Amax is the largest eigenvalue of R. It can be shown in [17] that in order to guarantee convergence of the adaptive system with latencies, / must be restricted to an even smaller constraint: 2 t 0 < u < sin (5.5) Amax 2(2D +1) Note that Eq. (5.5) also shows that as number of pipeline stages increase, range for appropriate / decreases. It can also be shown in [17] that a pipelined LMS system always converges slower than an unpipelined LMS system. Several authors have investigated in improving the pipelined LMS systems' convergence rate. In [9], a correction term is incorporated into generating the error signal in that eD (n)= d(n) WT(n D)U(n) c(n) (5.6) c(n) = RT (n)E(D) (n 1) (5.7) where RT(n) is the Ddimensional input correlation vector and E(D(n1) is a vector of past errors. It was shown that the modified method of calculating error signal results in equal performance with respect to unpipelined LMS system. However, more computation is introduced as well and thus essentially nullifies the purpose of pipelining. Convergence rate can also be improved by updating the weight according to LMS algorithm, at the same time modifying the step size according to the following update equation proposed in [28]: W(n + 1)= W(n) + (n D)e(nD) (5.8) U (n D)U(n D) Again this method introduces more computation overhead and thus is not desired. In addition, the software tool MMAlpha is used in [13] to automatically derive a VHDL description of a pipelined LMS architecture to optimize speed and sacrificing 50% increase in area. Based upon evidence presented above, by introducing pipelines into the adaptive system, the system's speed is increase in the expense of either slower convergence rate or more computation. However, by aligning the terms shown in Eq. (5.2) and (5.3), we can reduce the effects of phase shifts caused by pipelining. Refer to structural diagram depicted in Figure 512. If multipliers are pipelined, and buffers are added to adder chain, latencies are propagated into the error signal calculation. The delayed error signal is fed back into the weight update components to perform adaptation. Buffers can be added onto the system's reference signal to align the error signal calculation. Furthermore, weight updates can also be aligned by using delayed filter taps. This time alignment scheme can be expressed by the following three equations: yD(n)= WT (n)U(n) (5.9) eD (n) = d(n D) yD (n) (5.10) W(n +1) = W(n)+ eD(n)U(n D) (5.11) With this scheme the weight update at sample n is done with the input and desired signals at sample nD. For signals that do not change a lot between sampling points, this scheme provides a close fit to the unpipelined architecture. This means oversampling is suggested when using the timealign scheme. Otherwise there will be a penalty in convergence rate. The new architecture applied to the structure depicted in Figure 512 is shown in Figure 513 below: Input Unit Unit Unit Unit Unit Unit Unit Delay  ~~ Delay  "*' Delay ~~: Delay + Delay t Delay ^ Delay Weight Weight Weight eight eight Desired Update Update Update Update Update Err Err Err Err Err IErr Figure 513. Timealigned Adaptive System Block Diagram Compared with previous solutions described in literature mentioned earlier, this time alignment scheme does not introduce more computation. It does, however, introduce more hardware in terms of buffers. The convergence rate for this pipelined system is still slower than an unpipelined system. 5.6.4 Performance of the Pipelined Adaptive System Performance of the unpipelined design in terms of speed is illustrated in Figure 5 5. In this section, pipelines are added into the multipliers as shown in Figure 53. The pipelined adaptive system is compared against the unpipelined system. Buffers are further added into the adderchain. The Stratix device is used for the implementation. The multiplier bus width is set at 16 and thus according to Figure 59, optimal pipeline stage is set at 2. Buffers are inserted for every two adders within the adderchain. By varying the filter order in the system, maximum data rates of three scenarios are plotted in Figure 514 with respect to filter orders. The three scenarios are the following: an un pipelined system, a pipelined system, and a system with pipelined multipliers and buffers. m unpipelined Spipelined 70,:,,: 60 50 N 40 uLL 30 20 10 3 5 10 25 35 50 Filter Order Figure 514. Pipelined Adaptive System Performance Note that although a pipelinedandbuffered adaptive system can have maximum data rate up to 60MHz regardless of filter order, it also has the most stages of latency. To summarize, an adaptive system's speed performance can be increased significantly by either pipelining multipliers, adding buffers onto the adderchain, or both. Latency is introduced such that the adaptive system may diverge its tap weight adaptation, due to the delayed error signal is also a feedback signal to weight updates. Buffering the desired signal can timealign the error signal computation and the tap weight update computation. The timealigned scheme does not require correction terms described in Eq. (5.6), nor does it require modifying the step size as described in Eq. (5.7). Experiments have shown that the timealign scheme reduces the effects of latency. However, since latency cannot be completely compensated, convergence rate for the timealigned adaptive system is still slower than an unpipelined adaptive system. In realtime applications where high data rates are given, slower convergence rate can be an acceptable tradeoff [20]. 5.7 Performance Comparison of FPGAs and DSP Processors DSP applications have traditionally been implemented with DSP processors. Due to recent advancement in FPGA devices, it is valuable to compare the performance of adaptive system in both FPGA devices and DSP processors in terms of speed and power consumption. FPGAs maintain the advantages of custom functionality while avoiding the high development costs and the inability to make design modifications after production [14]. Compare with DSP processors, FPGAs also hold the advantage of parallelism, in that multiple operations can be performance at one time instance, whereas DSP processors are only able to perform one instruction per time instance. It is evident that according to Figure 53, by instantiating multiple adders and multipliers, the system is able to perform convolution and adaptation on the fly. If the design is implemented in DSP processors, then only one instruction is performed at a time. However, it is also apparent that as the filter order increases, so does the registertoregister delay in FPGA design, which will eventually overcome the parallelism advantage. Therefore, performance in terms of speed is investigated using two devices, namely the Stratix FPGA device and Texas Instruments' TMS320VC33 floatingpoint DSP processor. Power consumption is also a main concern in choosing between various devices. Power consumption is assumed fixed for DSP processors, since the internal structure is fixed. FPGA devices' power consumption varies with respect to amount ofLEs programmed, number clockdriven registers, and DSP block utilization. Issue of power consumption is also investigated in this section using Stratix device, a floatingpoint processor and a fixedpoint processor. 5.7.1 Speed Pipelined adaptive system presented in Section 5.5 is used to compare with a floatingpoint DSP processor. The processor of choice is Texas Instruments' TMS320VC33 floatingpoint DSP processor. The floatingpoint processor has maximum speed of 150 Million FloatingPoint Operations per Second (MFLOPS) at 60MHz. Speed is measured by amount of time it takes to update a set of weights for an adaptive system with various number of filter order. Based on benchmark data obtained from Mr. Scott Morrison of Computational NeuroEngineering Laboratory, University of Florida, for a single channel LMS adaptive filter, the C33 processor updates tap weights in the order of microseconds where as the FPGA LMS adaptive filter can perform tap weight updates in the order of nanoseconds. For example, it takes the APEX device implementation 67ns to update all tap weights for an adaptive filter of order 10, whereas it takes the DSP processor 2.3 as to do so. Parallelism works in full advantages over DSP processors in this LMS adaptive application. A shortcoming for FGPA implementation however, is that the amount of LEs are limited for a given device, which restricts the order of filter to be fit in a particular FPGA. There is no such problem for DSP processors, since they rely on either internal or external memory to store information, and computations are done sequentially. Furthermore, floatingpoint implementation is not yet feasible in FPGA devices, because the devices have limited LEs. For any applications that require large data dynamic range, DSP processors still are devices of choice. 67 5.7.2 Power Consumption Power consumption for DSP processors is generally fixed. It is found that worst case power consumption is 500mW for the TMS320VC33 floating point DSP processor [26]. For the DSP 56309 fixedpoint processor, benchmark information obtained in [6] indicates that the LMS algorithm can be performed at 1.5mA/MHz. If 100MHz oscillator is applied to the processor and since the core processor's voltage is 3.3V, estimated power consumption for running the adaptive system in this fixedpoint processor is therefore 514mW. On the other hand, FPGA devices' power consumption varies depend on the size of the design. For our adaptive system, instances of components increase as filter order increases, resulting larger amount of logics needed to fit into the FPGA. Therefore as the filter order increases, so does power consumed by the device. By using the Stratix power calculator provided by Altera, Inc, estimated power consumption is obtained with various filter order. Figure 515 illustrates the relationship between filter order and power consumption for FPGAs, as well as comparison between the three devices of choice. 650 600 0  T!,1, ?.'OVC, ?.? 550 D F P''.. 500 m E 450 400 0 350 300 250 200 3 5 10 25 35 50 Filter Order Figure 515. Power Consumption Plot for Various Devices 68 As seen in Figure 515, if energy conservation is desired, FPGA implementation should be considered over the two DSP processors for an adaptive filter with filter order less than 25. For filter order over 25, Stratix device consume more energy than the DSP processors and therefore becomes unattractive. CHAPTER 6 CONCLUSION AND FUTURE WORK 6.1 Conclusion Finite precision effects on adaptive algorithms have been studied in this thesis. Several common effects were studied and solutions were provided to mitigate the effects. An adaptive noise canceller was first simulated in software for its effectiveness in an integerbased system. The noise canceller was then implemented in a VLSIbased hardware due to its success in software simulation. One commonly used adaptive algorithm, namely the LMS algorithm was derived in Chapter 2. The LMS algorithm is based on minimum mean square error as criteria and an adaptive filter which uses LMS algorithm assumes FIR filter structure. During adaptation, the adaptive filter updates its tap weights to make the filter output as close as the reference input of the system and the difference between the reference input and the filter output, or the error term, is attempted to be minimized. Mathematical expressions for adaptive algorithms that were presented in Chapter 2 assume infinite precision, i.e., they do not consider the wordlength of the calculation. However in reality, digital hardware used to implement an adaptive algorithm has limited wordlength. Because of this, finite precision effects on adaptive algorithms, specifically, the LMS algorithm should be studied. Finite precision effects can be grouped in three groups. First, in order to maintain wordlength, any input signals and intermediate arithmetic results must be quantized. Quantization is performed via either rounding or truncation. It is found that rounding is preferred over truncation, since rounding produces zero mean error signal. Secondly, filter applications rely heavily upon arithmetic operations, these results must be rounded as well due to finite precisions. It was found that for an Mth order FIR (M +1)q2 adaptive filter, the error power created by arithmetic quantization is E(n) = ( , 6 where q is the quantization step and Mis the filter length. By increasing either the wordlength or use a periodical update scheme, the effects result from arithmetic rounded can be reduced. Thirdly, saturation and stalling can arise due to finite precision constraints. Saturation can be dealt with either by scaling the input signals so that saturation becomes less probable, or by using the clamping technique in which upon detecting saturation, the result is "clamped" to the most positive or most negative number, depending on the sign bit. The step size parameter / may cause the algorithm to stall, that is, tap weights fail to update due to the update parameter is smaller than the quantization step. Stalling can be avoided by incorporating a lower bound for /. Alternatively, the sign algorithm is another way to reduce/avoid stalling. A fixedpoint based adaptive noise canceller was simulated in software. It was found that the fixedpoint based system with sufficient number of bits makes no striking difference from a system that is floatingpoint based. The simulation result suggests that a low cost hardware realization of this noise canceller is possible, since a fixedpoint based adaptive filter requires significantly less circuitry than if the system were based on floatingpoint. The adaptive noise canceller was implemented in an FPGA device with embedded DSP blocks, e.g., a Stratix device. The DSP blocks are dedicated circuitry to perform common DSP operations including multiplyandadd. Due to the embedded DSP blocks, the Stratix device outperforms traditional FPGAs to implement the same adaptive filters because it allows faster clock frequency and it utilizes less logic elements. Since the design is written in VHDL, dynamic component instantiation becomes available for filter designers to quickly modify the filter length and/or wordlength. Pipelining is also introduced in the adaptive system design. By applying pipelines into the design, maximum data rate of the adaptive system can be increased compared to an unpipelined system. By introducing pipelining, latency is also introduced and thus slows down convergence. But in realtime high speed applications, slower convergence rate can be an acceptable tradeoff. Performance of the FPGA based adaptive system in terms of speed and power consumption is also compared against traditional DSP processors. It was found that FPGAs fully utilizes its parallelism advantage resulting in much faster filter performance. However, as filter order increases, the FPGA implementation becomes less attractive due to limitation on amount of logic elements within an FPGA and higher power consumption when compared with DSP processors. For lower order adaptive filter implementation, FPGAs should be seriously considered. On the other hand DSP processors should be used for higher order filters. 6.2 Future Work Finite precision effects were experimented in fixedpoint based systems only, in which the signals are quantized. This is due to the current limitation on FPGA devices. In the future, as the number of logic elements becomes sufficiently abundant, FPGA based floatingpoint adaptive filters may become feasible to implement. Multichannel adaptive systems are useful in that multiple channels can be trained using the same adaptive filter, by multiplexing the channels. Internal memory within the FPGA may be used to read/write each channel's taps and tap weights. The multichannel system requires a few more components that include multiplexers for multiplexing primary and reference signal of the system input, and a RAM arbiter to control memory I/O of each channel's taps and tap weights. Pseudofloatingpoint scheme was proposed in [24] and was shown that it out performs ordinary fixedpoint scheme in adaptive LMS systems. This scheme can be easily implemented with the existed architecture shown in this Thesis with minor modifications. The scheme can further be used to compare with our fixedpoint architecture in terms of speed, area, and rate of convergence. APPENDIX A MATLAB SCRIPTS Author : Andy Lin %% File Name: LMS.m Date : 02/12/02 %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% the LMS function uses LMS Algorithm to produce updated weights for the filter. Usage : [W, error] = LMS(xx, desired, order, mu, winit); order : the order of the filter, or the dimension of Rx and Px,y desired : desired signal, the desired will subtract the output produced by the filter to get error xx : input to the Adaptive Filter mu : stepsize winit : initial weights J : learning rate W : weight track matrix with dimension (order of filter x # of samples) error : sum of desired and (filter output) function [J, W, error] Lx = length(xx); [m,n] = size(xx); if n>m, xx = xx.'; end; LMS(xx, desired, order, mu, winit); %add zero padding to initial states xx = [zeros(order1,1); xx]; initializationn steps 1 = 1; sumMSE = 0; %sum of mean square err error = desired; w = winit; W = zeros(order, Lx); for k = l:Lx, % update ever X = xx(k+orderl:l:k); y = w'*X; error(k) = desired(k)y; sumMSE = sumMSE + error(k)*error(k); w = w + mu*error(k)*X; W(:, k) = w; or y sampling period 74 if (mod(k, 30) == 0) J(1) = sumMSE / k; S= 1 + 1; end; end; %% Author : Andy Lin File Name: clamping LMS.m %% Date : 03/12/03 %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% the LMS function uses LMS Algorithm to produce updated %% weights for the filter. Clamping is used with respect to wordlength %% %% Usage : [W, error] = LMS(xx, desired, order, mu, winit, wordlength); %% % order : the order of the filter, or the dimension of Rx %% and Px,y % desired : desired signal, the desired will subtract the output %% produced by the filter to get error %% xx : input to the Adaptive Filter % mu : stepsize %% winit : initial weights % wordlength: MSB position %% J : learning rate %% W : weight track matrix with dimension %% (order of filter x # of samples) %% error : sum of desired and (filter output) function [J, W, error] = clamping LMS(xx, desired, order, mu, winit, wordlength); Lx = length(xx); [m,n] = size(xx); if n>m, xx = xx.' ; end; %calculate the clamping value, which is the maximum %value the wordlength can represent max = 0; for i=0:wordlength1, max = max + 2^i; end; %add zero padding to initial states xx = [zeros(order1,1); xx]; initializationn steps 1 = 1; sumMSE = 0; %sum of mean square error error = desired; w = winit; W = zeros(order, Lx); l:Lx, xx(k+orderl:l:k); w'*X; update every sampling period %simulate saturation effect tmpy = dec2bin(y); %if saturation occurs, clamp to the largest number wordlength can %represent. if (length(tmpy) > wordlength) y = max; end; error(k) = desired(k)y; sumMSE = sumMSE + error(k)*error(k); w = w + mu*error(k)*X; W(:, k) = w; if (mod(k, 30) == 0) J(1) = sumMSE / k; 1 = 1 + 1; end; end; Author : Andy Lin File Name: sign LMS.m Date : 03/12/03 Sign algorithm is used to produce weight update Usage : [W, error] = LMS(xx, desired, order, mu, winit); order : the order of the filter, or the dimension of Rx and Px,y desired : desired signal, the desired will subtract the output produced by the filter to get error xx : input to the Adaptive Filter mu : stepsize winit : initial weights J : learning rate W : weight track matrix with dimension (order of filter x # of samples) error : sum of desired and (filter output) function [J, W, error] sign LMS(xx, desired, order, mu, winit, q); Lx = length(xx); [m,n] = size(xx); if n>m, xx = xx.'; end; for k X y 76 %add zero padding to initial states xx = [zeros(order1,1); xx]; initializationn steps 1 = 1; sumMSE = 0; %sum error = desired; w = winit; W = zeros(order, Lx); of mean square error for k = l:Lx, % update every sampling period X = xx(k+orderl:l:k); quantizationn at convolution stage y = round(w'*X .* q)/q; error(k) = desired(k)y; sumMSE = sumMSE + error(k)*error(k); quantizationn at adaptation stage and use sign(e) only w = w + round(mu*sign(error(k)).*X .*q)/q; W(:, k) = w; if (mod(k, 30) == 0) J(1) = sumMSE / k; S= 1 + 1; end; end; Author : Andy Lin File Name: LMS with q.m Date : 03/12/03 quantized any computation with respect to q. Usage : [W, error] = LMS(xx, desired, order, mu, winit, q); order : the order of the filter, or the dimension of Rx and Px,y desired : desired signal, the desired will subtract the output produced by the filter to get error xx : input to the Adaptive Filter mu : stepsize winit : initial weights q : quantization step J : learning rate W : weight track matrix with dimension (order of filter x # of samples) error : sum of desired and (filter output) function [J, W, error] LMS(xx, desired, order, mu, winit, q); Lx = length(xx); [m,n] = size(xx); if n>m, xx = xx.'; end; %add zero padding to initial states xx = [zeros(order1,1); xx]; initializationn steps 1 = 1; sumMSE = 0; %sum of mean square erro error = desired; w = winit; W = zeros(order, Lx); for k = l:Lx, % update every X = xx(k+orderl:l:k); %rounding at the convolution stage y = round(w'*X *q)/q; error(k) = desired(k)y; sumMSE = sumMSE + error(k)*error(k); %%rounding at the adaptation stage w = w + round( mu*error(k)*X *q) / q; W(:, k) = w; r sampling period if (mod(k, 10) == 0) J(1) = sumMSE / k; 1 = 1 + 1; end; end; APPENDIX B VHDL CODES Author Date File : Andrew Y. Lin : 04/03/02 : header.vhd library IEEE; use IEEE.std logic 1164.all; use IEEE.std logic arith.all; package header is  fl indicates filter length, or filter order  bussize indicates the size of the input data bus. constant fl : integer:= 4; constant bussize : integer:= 16; constant depth : integer:= 12; subtype type type buss is std logic vector(bussize1 downto 0); pbus is array (fl downto 0) of buss; qbus is array (fl1 downto 0) of buss; component xadder port ( a : in std logic vector(bussize1 downto 0); b : in std logic vector(bussize1 downto 0); y : out std logic vector(bussize1 downto end component; component subtractor clk downto 0); downto 0); downto 0)); end component; component multiplier a b y 0)); end component; port( : in std logic; : in std logic vector(bussize1 : in std logic vector(bussize1 : buffer std logic vector(bussize1 port( in std logic vector(bussize1 downto 0); in std logic vector(bussize1 downto 0); out std logic vector(bussize1 downto 0)); component wgenerator clk reset mu xx downto 0); ee downto 0); port( : in std logic; in std logic; : in std logic vector(3 downto 0); : in std logic vector(bussize1 : in std logic vector(bussize1 buffer std logic vector(bussize1 downto 0)); end component; component UnitDelay port( clk reset : ii inp downto 0); outp 0)); in std logic; std logic; in std logic vector(bussize1 buffer std logic vector(bussize1 downto end component; component LMSMaster clk reset mu x downto 0); d downto 0); w err downto 0)); end component; port( in in std logic; std logic; in std logic vector(3 downto 0); in std logic vector(bussize1 : in std logic vector(bussize1 : buffer pbus; : buffer std logic vector(bussize1 end header; Author Date File Andrew Y. Lin 04/03/02 Multiplier.vhd library IEEE; use IEEE.std logic 1164.all; use IEEE.std logic arith.all; use work.header.all; LIBRARY 1pm; USE lpm.lpm components.ALL; entity multiplier is port( a : in std logic vector(bussize1 downto 0); b : in std logic vector(bussize1 downto 0); y : out std logic vector(bussize1 downto 0)); end multiplier; architecture behave of multiplier is std logic vector(2*bussize1 downto 0); begin Mult: 1pm mult  product = a*b; GENERIC MAP ( LPM WIDTHA =>bussize, LPM WIDTHB =>bussize, LPM REPRESENTATION => "SIGNED", LPM WIDTHP => 2*bussize, LPM WIDTHS => 2*bussize) PORT MAP ( dataa => a, datab => b, result => product); take the sign bit "and" with the lower y <= product(2*bussizel) & product(bussize2 downto 0); end behave; Andrew Y. Lin 04/03/02 Subtractor.vhd library IEEE; use IEEE.std logic 1164.all; use IEEE.std logic arith.all; use work.header.all; LIBRARY 1pm; USE lpm.lpm components.ALL; entity subtractor is port( clk : a b y 0)); end subtractor; in std in std in std buffer logic; logic vector(bussize1 downto 0); logic vector(bussize1 downto 0); std logic vector(bussize1 downto architecture behave of subtractor is signal yy : std logic vector(bussize1 downto 0); begin sub: lpm add sub GENERIC MAP PORT MAP (  y = a b LPM WIDTH => bussize, LPM REPRESENTATION => "SIGNED", LPM DIRECTION => "SUB") dataa => a, datab => b, result => yy); Author Date File signal product latch the subtraction process (clk) begin if (clk'event and y <= yy; end if; end process; on rising edge of clk clk='0') then end behave; Author Date File Andrew Y. Lin 04/03/02 xadder.vhd LIBRARY ieee; USE ieee.std logic 1164.ALL; USE ieee.std logic arith.ALL; USE ieee.std logic signed.ALL; use work.header.all; LIBRARY 1pm; USE lpm.lpm components.ALL; entity xadder is port( a b y end xadder; in std logic vector(bussize1 downto 0); in std logic vector(bussize1 downto 0); out std logic vector(bussize1 downto 0)); architecture behave of xadder is begin add: 1pm add sub GENERIC MAP PORT MAP (  y = a + b LPM WIDTH => bussize, LPM REPRESENTATION => "SIGNED", LPM DIRECTION => "ADD") dataa => a, datab => b, result => y); end behave; Author Date File Andrew Y. Lin 04/03/02 UnitDelay.vhd library IEEE; use IEEE.std logic 1164.all; use IEEE.std logic arith.all; use work.header.all; entity UnitDelay is port( clk : in std logic; reset : in std logic; inp : in std logic vector(bussize1 downto 0); outp : buffer std logic vector(bussize1 downto 0)); end UnitDelay; architecture behave of UnitDelay is begin process(clk) begin if (rising edge(clk)) then if (reset = '1') then outp <= (others=>'0'); else outp <= inp; end if; end if; end process; end behave; Author Date File Andrew Y. Lin 04/03/02 WGenerator.vhd library IEEE; use IEEE.std logic 1164.all; use IEEE.std logic arith.all; use work.header.all; LIBRARY 1pm; USE lpm.lpm components.ALL; entity WGenerator is port( clk reset mu xx downto 0); downto 0); ww : in std logic; in std logic; : in std logic vector(3 downto 0); : in std logic vector(bussize1 in std logic vector(bussize1 : buffer std logic vector(bussize1 downto 0)); end WGenerator; architecture behave of WGenerator is 83 signal ee mult xx : std logic vector(2*bussize1 downto 0); signal ee mult xx div mu : std logic vector(bussize1 downto 0); signal ww updated : std logic vector(bussize1 downto 0);  this function divides input by shifting input "len" bits to the right function div (a : std logic vector(2*bussize1 downto 0); len : std logic vector(3 downto 0)) return std logic vector is variable temp : std logic vector(2*bussize1 downto 0); begin temp := a;  if input is positive if (temp(2*bussizel) = case len is when "0001" temp when "0010" temp when "0011" temp when "0100" temp when "0101" temp when "0110" temp when "0111" temp when "1000" temp when "1001" temp when "1010" temp when "1011" temp when "1100" temp 12); when "1101" temp 13); when "1110" temp 14); when "1111" temp 15); when others null; '0') then => = '0' & temp(2*bussize1 downto 1); => = "00" & temp(2*bussize1 downto 2); => = "000" & temp(2*bussize1 downto 3); => = "0000" & temp(2*bussizel downto 4); => = "00000" & temp(2*bussizel downto 5); => = "000000" & temp(2*bussizel downto 6); => = "0000000" & temp(2*bussizel downto 7); => = "00000000" & temp(2*bussizel downto => = "000000000" & temp(2*bussizel downto => = "0000000000" & temp(2*bussizel downto => = "00000000000" & temp(2*bussizel downto => = "000000000000" & temp(2*bussizel => = "0000000000000" & temp(2*bussizel => = "00000000000000" & temp(2*bussizel => = "000000000000000" & temp(2*bussizel 8); 9); 10); 11); downto downto downto downto end ca  if input else case 1 8); 9); 10); 11); downto 12); downto 13); downto 14); downto 15); ise; is negative en is when "0001" temp when "0010" temp when "0011" temp when "0100" temp when "0101" temp when "0110" temp when "0111" temp when "1000" temp when "1001" temp when "1010" temp when "1011" temp when "1100" temp when "1101" temp when "1110" temp when "1111" temp when others null; end case; end if; return temp(2*bussizel) the least significant bits end;  of function "div" = '1' & temp(2*bussize1 downto 1); => S"11" & temp(2*bussize1 downto 2); => S"1111" & temp(2*bussize1 downto 3); => S"11111" & temp(2*bussize1 downto 4); => S"111111" & temp(2*bussizel downto 5); => S"1111111" & temp(2*bussizel downto 6); => S"11111111" & temp(2*bussizel downto 7); => = "11111111" & temp(2*bussize downto => => S"111111111111" & temp(2*bussize downto => = "11111111111" & temp(2*bussizel downto => = "111111111111" & temp(2*bussizel => = "11111111111111" & temp(2*bussizel => = "111111111111111" & temp(2*bussizel & temp(bussize2 downto 0); take only begin  of architecture concurrent statement ee mult xx div mu <= div(ee mult xx, mu); process(clk) begin if (rising edge(clk)) then if reset = '1' then ww <= (others=: else '0 '); ww <= ww updated; end if; end if; end process; Mult: lpm mL GENERIC MAP PORT MAP ( sub: 1pm add sub GENERIC MAP PORT MAP ( lt  ee*xx LPM WIDTHA =>bussize, LPM WIDTHB =>bussize, LPM REPRESENTATION => "SIGNED", LPM WIDTHP => 2*bussize, LPM WIDTHS => 2*bussize) dataa => xx, datab => ee, result => ee mult xx)  ww = ww + ee*xx / mu LPM WIDTH => bussize, LPM REPRESENTATION => "SIGNED", LPM DIRECTION => "ADD") dataa => ww, datab => ee mult xx div mu, result => ww updated); end behave; Author Date File Andrew Y. Lin 04/03/02 LMSMaster.vhd library IEEE; use IEEE.std logic 1164.all; use IEEE.std logic arith.all; use work.header.all; entity LMSMaster is port( clk : in std logic; reset : in std logic; mu : in std logic vector(3 downto 0); x : in std logic vector(bussize1 downto 0); d : in std logic vector(bussize1 downto 0); w : buffer pbus; err : buffer std logic vector(bussize1 downto 0)); end LMSMaster; ; 86 architecture struct of LMSMaster is signal w signal signal signal pbus; qbus; qbus; pbus; begin component UDMi : instantiations for i in fl1 downto 0 generate Fl: if i = (fl1) generate UDM: UnitDelay port map (clk=>clk, reset =>reset, => X, outp => qx(i)); end generate; F2: if i /: UDi: (fl1) generate UnitDelay port map (clk=>clk, reset => reset, => qx(i+l), outp => qx(i)); end generate; end generate; WGMi : for i in fl downto 0 generate F3 : if i = fl generate WGM : WGenerator port map clk => clk, reset => reset, mu => mu, xx => x, ee => err, ww => w(i)); end generate; F4 : if i /= fl generate WGA : WGenerator port map( clk => clk, reset => reset, mu => mu, xx => qx(i), ee => err, ww => w(i)); end generate; end generate; MULMi : for i in fl downto 0 generate F5 : if i = fl generate MULM : multiplier port map (a => x, b => w(i), y => pm(i)); end generate; F6 : if i /= fl generate MUL : multiplier port map( a => qx(i), b => w(i), y => pm(i)); end generate; end generate; ADDMi : for i in fl1 downto 0 generate F7 : if i = fl1 generate ADDM : xadder port map end generate; F8 : if i /= fl1 generate ADD : xadder port map( a => pm(i), b => y => qy(i+l), qy(i) ) ; end generate; end generate; SUB : subtractor port map( clk => clk, a => d, b => qy(0), y => err); pm(i+l), pm(i), qy(i)); a => b => y => end struct; Author Date File Andrew Y. Lin 01/12/03 Overall.vhd library IEEE; use IEEE.std logic 1164.all; use IEEE.std logic arith.all; use IEEE.std logic unsigned.all; use work.header.all; LIBRARY 1pm; USE lpm.lpm components.ALL; entity Overall clk reset mu addr weights q err end Overall; is port( : in std logic; in std logic; : in std logic vector(3 downto 0); in std logic vector(9 downto 0); : buffer pbus; : out std logic vector(bussize1 downto 0); : buffer std logic vector(bussize1 downto 0)); architecture struct of Overall is signal desired, x in : std logic vector(bussize1 downto 0); signal addr : std logic vector(9 downto 0); begin This ROM contains the desired signal Desired ROM: lpm rom GENERIC MAP ( lpm widthad => 10, 1pm width => bussize, lpm address control => "REGISTERED", lpm outdata => "UNREGISTERED", lpm file => "c:\andy lin\testdata\LMSDesired.mif") PORT MAP ( inclock => clk, q => desired, address => addr); This ROM contains the input signal input ROM: lpm rom GENERIC MAP ( 1pm widthad => 10, lpm width => bussize, lpm address control => "REGISTERED", lpm outdata => "UNREGISTERED", 1pm file => "c:\andy lin\testdata\LMSinput.mif") 89 PORT MAP inclock => clk, q => x in, address => addr); This RAM contains error signal err RAM : 1pm ram dq GENERIC MAP( LPM WIDTH => bussize, LPM WIDTHAD => 10, LPM INDATA => "REGISTERED", LPM OUTDATA => "UNREGISTERED", LPM ADDRESS CONTROL => "UNREGISTERED") PORT MAP( address => addr, inclock => clk, we => '1', data => err, q => q); LMS FIR instantiation FIR : LMSMaster PORT MAP clk => clk, reset => reset, mu => mu, x => x in, d => desired, w => weights, err => err); process(clk) begin if (clk'event and clk='l') then if (reset = '1') then addr <= (others=>'0'); else addr <= addr + '1'; end if; end if; end process; end struct; LIST OF REFERENCES 1. AlKindi, M. J., AlSamarrie, A.K. and AlAnbakee, T. M., Performance improvements of adaptive FIR filters using adjusted step size LMS algorithm. Seventh International Conference on HF Radio Systems and Techniques, pp. 454 458, Jul. 1997. 2. Altera, Stratix Programmable Logic Device Family Data ,\/ee, Data Sheet DS STXFAMLY2.1, Altera, Inc., Aug. 2002. 3. Baher, H., Analog and Digital Signal Processing. 2nd edition, John Wiley & sons, LTD., New York, New York, 2001. 4. Chew, W. C., FarhangBoroujeny, B., FPGA Implementation ofAcoustic Echo Cancelling. Proceedings of the IEEE Region 10 Conference TENCON 1999, vol. 1, pp. 263266, 1999. 5. Claasen, T. A. C. M. and Mecklenbrauker, W. F. G., Comparison of the Convergence of two Algorithms for Adaptive FIR Digital Filters. IEEE Trans. Acoustic, Speech, Signal Processing, vol. ASSP29, pp. 670678, Jun. 1981. 6. DiCarlo, D., Characterizing CMOS DSP Core Current for Lowpower Applications, Data Sheet AN2013D, Motorola, Inc., Oct. 2000. 7. Diniz, P. S. R., Adaptive Filtering Alggot ithlnu and Practical Implementation. 2nd Edition, Kluwer Academic Publishers, Norwell, Massachusetts, 2002. 8. Diniz, P. S. R., da Silva, E.A.B. and Netto, S.L., Digital Signal Processing System Analysis andDesign. Cambridge University Press, Cambridge U.K., 2002. 9. Douglas, S. C., Zhu, Q. and Smith, K. F., A Pipelined LMS Adaptive FIR Filter Architecture Without Adaptation Delay. IEEE Transactions on Signal Processing, vol. 46, no. 3, pp. 775779, Mar. 1998. 10. Eweda, E., Reducing the Effect of Finite Wordlength on the Performance of an LMSAdaptive Filter. IEEE International Conference on Communications, vol. 2, pp. 711, Jun. 1998. 11. Eweda, E., Convergence analysis and Design of an Adaptive Filter i/ ith Finitebit PowerofTwo QuantizedError. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol. 39, issue 2, pp. 113115, Feb. 1992. 12. FU, R. and FORTIER, P., VLSI Implementation of ParallelSerialLMS Adaptive Filters, 18th Biennial Symposium on Communications, pp. 159162, June, 1996. 13. Guillou, A., Quinton, P., Risset, T. and Massicotte, D., Automatic Design of VLSI PipelinedLMIS Architecture, Proceedings in International Conference on Parallel Computing in Electrical Engineering, pp. 144149, 2000. 14. Goslin, G. R., A Guide to Using Field Programmable Gate Arrays (FPGAs)for ApplicationSpecific Digital Signal Processing Performance, Digital Signal Processing program report, Xilinx Inc., 1995. 15. Gupta, R. and Hero, A.O., Transient Behavior ofFixed Point LMS Adaptation. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 376379, April, 2000. 16. Haykin, S. Adaptive Filter Theory. 4th edition, Prentice Hall, Upper Saddle River, New Jersey, 2002. 17. Kabal, P. The Stability ofAdaptive Minimum Mean Square Error Equalizers Using Delayed Adjustment. IEEE Transactions on Communications, vol. COM31, no. 3, pp. 430431, Mar. 1983. 18. Kum, K. and Sung W., Wordlength Optimization for High Level Synthesis of Digital SignalProcessing Systems. IEEE Workshop on Signal Processing Systems, pp. 569578, October 1998. 19. Mathews, V. J. and Cho, S. H., Improved Convergence Analysis of Stochastic Gradient Adaptive Filters Using the Sign Algorithm. IEEE Transactions on Acoustic, Speech and Signal Processing, vol. 35, issue 4, pp. 450454, April, 1987. 20. Meyer, M.D. and Agrawal, D. P., A High Sampling Rate Delayed LMS Filter Architecture. IEEE Transactions on Circuits and Systems  II: Analog and Digital Signal Processing, vol. 40, No. 11, pp. 727729, Nov. 1993. 21. Nichols, K., Moussa, M. and Areibi, S., Feasibility ofFloating Point Arithmetic in FPGA basedANNs. In Proceedings of the 15th International Conference on Computer Applications in Industry and Engineering, pp. 813, November 2002. 22. Papoulis, A. and Pillai, S.U., Probability, Random Variables and Stochastic Process. 4th edition, McGrawHill, New York, New York, 2001. 23. Schertler, T., Cancellation of Acoustic Echoes in ith Exponentially Weighted Step Size and Fixed Point Arithmetic. Conference records of the 32nd Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 399403, November 1998. 