Blind Separation of Convolutive Mixtures Using Renyi's Divergence

Material Information

Blind Separation of Convolutive Mixtures Using Renyi's Divergence
HILD, KENNETH E. ( Author, Primary )
Copyright Date:


Subjects / Keywords:
Approximation ( jstor )
Datasets ( jstor )
Dimensionality ( jstor )
Entropy ( jstor )
Estimators ( jstor )
Kurtosis ( jstor )
Matrices ( jstor )
Permutations ( jstor )
Signal processing ( jstor )
Signals ( jstor )

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright Kenneth E. Hild. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:
Resource Identifier:
436097517 ( OCLC )


This item is only available as the following downloads:

Full Text




Copyright 2003 by Kenneth E. Hild II


iiiACKNOWLEDGEMENTSFirst and foremost I would like to thank my advisor, Dr. Principe, for all his help, including the countless hours he spent carefully reviewing my work and for the many useful suggestions that he proffered during the c ourse of my studies. Dr. Principe has helped me to become a more well-rounded researcher by imbuing in me a desire to learn about every facet of electrical engineering, as well as about topics outside of engineering, such as philosophy and the history of science. I wo uld like to thank my committee members for always being available at a momentÂ’s noti ce and for patiently enduring all of my questions. I would like to thank my many colleagues from the past four years that have made the Computational NeuroEngineering Laborator y a very enjoyable experience and have created an atmosphere that is highly c onducive to thought, reason, and the pursuit of knowledge. I am also thankful for the teachi ng scholarship that I received from the Department of Electric al and Computer Engineering, as well as the funding that I received from NSF ECS #0300340. Finally, I would like to thank my immedi ate family members and my extended family from the University City Church of Christ. Without friends such as these, this lifetime would be bleak and uninteresting.


iv TABLE OF CONTENTS Page ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 INSTANTANEOUS BLIND SOURCE SEPARATION . . . . . . . . . . . . . . . . . . . . . . . . . 3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 MRMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Stucture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Modification for Sub-Gaussian Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 RenyiÂ’s Entropy Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 MRMI-SIG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1 Relationship Between RenyiÂ’s Entropy Estimator and Kurtosis . . . . . . . . . . . . . 25 The Five Competing BSS Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Cardoso and SouloumiacÂ’s JADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 HyvarinenÂ’s FastICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 ComonÂ’s MI method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Bell and SejnowskiÂ’s InfoMax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 MSMI, Minimization of ShannonÂ’s Mutual Information . . . . . . . . . . . . . . . . . 31 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3 CONVOLUTIVE BLIND SOURCE SEPARATION . . . . . . . . . . . . . . . . . . . . . . . . . 42 Introduction to Convolutive BSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Difficulties of Convolutive BSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Number of Adaptable Parameters and Time-Varying Nature of Mixing . . . . . . 47 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Down-Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48


v Categorization of Published Cr iteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Similarities of Identification and Inverse Systems . . . . . . . . . . . . . . . . . . . . . . . . 55 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Stability and causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Sound quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Second-order statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Frequency-domain approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Representative Set of Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Proposed Method, MRMI-SIG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Experimental (Approximate) Upper B ound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Synthetic Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Real Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4 CONCLUDING REMARKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 PDF Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Entropy Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 8 APPENDIX A SUPERVISED FEATURE EXTRACTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 21 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Projection Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Information-Theoretic Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Feature Reduction Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 0 B APPROXIMATING SHANNONÂ’S MUTUAL INFORMATION . . . . . . . . . . . . . . 153 C EXTRACTING SPECTRAL DIVERSITY USING MRMI-SIG . . . . . . . . . . . . . . . 158 LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189


vi LIST OF TABLES Table Page 2-1. Speed of adaptation, given in relative orders of magnitude . . . . . . . . . . . . . . . . . . . . 40 3-1. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3-2. Characteristics of Different Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3-3. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3-4. Representative Set of Convolutive Separation Algorithms . . . . . . . . . . . . . . . . . . . . . 89 3-5. Type I Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 3-6. Type II Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3-7. Type III Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3-8. Type II, IV, V Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 6 A-1. Description of the Feature Reduction Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 133 A-2. Description of the Data Sets Used in the Comparison . . . . . . . . . . . . . . . . . . . . . . . 138


vii LIST OF FIGURES Figure Page 2-1. Block diagram of mixing and demixi ng. For instantaneous mixing and demixing H and W are matrices of constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2-2. Signal-to-interference plots for ten sources/observations. Notice that the x-axis is logarithmic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2-3. Entropy as a function of ge neralized Gaussian parameter, for both RenyiÂ’s quadratic and ShannonÂ’s (marginal) entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2-4. Entropy versus rotation angle for four values of . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2-5. RenyiÂ’s entropy versus for Lb = 1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2-6. Expanded view of RenyiÂ’s entropy versus for Lb = 1000 . . . . . . . . . . . . . . . . . . . . 14 2-7. Statistical variation of severa l entropy and moment estimators for = 1, 2, and infinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2-8. Normalized standard deviation of RenyiÂ’s entropy estimator as a function of for = 0.5, Lb = 1000, where statistics were computed using 100 Monte Carlo trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2-9. Summation of normalized standard deviation for = 1, 2, and infinity . . . . . . . . . . 20 2-10. RenyiÂ’s entropy estimator using SIG for four kernel sizes and the theoretical values of RenyiÂ’s quadratic entropy as a function of . . . . . . . . . . . . . . . . . . . . . . . 23 2-11. Estimate of Gaussian kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2-12. Saturation of exponential as a function of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2-13. SIR for Ns = 5 as a function of and Lb = 100, 200, 500, 1000, 2000, 5000, 10,000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2-14. SIR versus Lb for the competing BSS methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35


viii 2-15. Mean SIR as a function of , averaged over Lb = 100, 200, 500, 1000, 2000, 5000 and 10,000, for i.i.d. data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2-16. Mean SIR as a function of Lb, averaged over = 1, 1.2, 1.7, 2.7, 5, and 10, for i.i.d. data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2-17. SIR as a function of Lb data samples, for Ns = 5 audio sources . . . . . . . . . . . . . . . . 38 2-18. SIR as a function of Ns audio sources, for Lb = 10,000 data samples . . . . . . . . . . . 39 3-1. Block diagram of convolutive mixing and demixing systems . . . . . . . . . . . . . . . . . 43 3-2. Categorization of convolutive blind source separation methods . . . . . . . . . . . . . . . 54 3-3. Likelihood that a demixing system having L = 0 has a stable inverse . . . . . . . . . . . 61 3-4. Percent relative error as a function of window length for 4 values of Lh (each data point is an average of 10 trials) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 5 3-5. SIR for Lh = 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3-6. SIR as a function of 2 and , for 1 = 0.25 and Lt = 3 . . . . . . . . . . . . . . . . . . . . . . . 78 3-7. Permuted and non-permuted soluti ons for the supervised approximate upper bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3-8. Experiments #1-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3-9. Experiments #5-6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3-10. Top view of the recording setup for the da ta collected in the sound proof room . . . 94 3-11. Experiments #7-8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3-12. Experiment #9, SIR as a function of azimuth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3-13. SIR of the approximate upper bound as a function of the acausal parameter, L, for both permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4-1. Plot of N versus dimensionality for constant SIR (entropy estimation) and constant SER (pdf estimation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4-2. Estimation of a two-dimensional pdf using Parzen Windows with Gaussian kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115


ix A-1. Block diagram of feature extr action for classification . . . . . . . . . . . . . . . . . . . . . . . 121 A-2. Classification performance versus output dimension, NO, for Pima data set . . . . . 142 A-3. Classification performance versus output dimension, NO, for Landsat data set . . 143 A-4. Classification performanc e versus output dimension, NO, for Letter Recognition data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 A-5. General classification performance fo r all 3 data sets. The number above each bar represents the mean classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 B-1. Joint entropy of Y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 B-2. Conditional entropy of Y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 B-3. Summation of the joint and conditional entropies of Y . . . . . . . . . . . . . . . . . . . . . . 157


x Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy BLIND SEPARATION OF CONVOLUTIVE MIXTURES USING RENYIÂ’S DIVERGENCE By Kenneth E. Hild II December 2003 Chair: Jose C. Principe Major Department: Electrical and Computer Engineering A new information-theoretic (IT) criterion is introduced, which is based on the nonparametric approximation of Re nyi's quadratic entropy. Using this criterion, convolutive source separation is achieved by minimizing the mutual information between segments of processes. Three reasons are given why Renyi Â’s entropy is preferred to ShannonÂ’s entropy. In addition, experimental evidence is provided that shows that the proposed method is very data efficient and that it performs well for noiseless synthetic mixtures. Perhaps more importantly, a supervised training method was found that estimates the demixing coefficients directly when the sources are recorded one at a time. This method provides an experimental approximate upper bound to the performance for real recordings. A comparison of several different methods using both real and synthetic mixtures indicates that feedforward, time-domain demixing structures are preferred, current BSS methods are


xi performing well below their potential, and the greatest limitations in performance are additive noise and/or nonlinearities.


1CHAPTER 1 INTRODUCTION The past 50 years has seen a progression in signal processing techniques whose goal is to separate a source signal from one or mo re interfering signals, in which the only available information is that obtained from one or more sensors. The research in this field started with the optimal filtering work of Wi ener [1], Bode and Shannon [2] in the late 1940Â’s and early 1950Â’s. The original work required only one sensor and makes use of the spectral diversity of the sources. Following this, signal separation research changed to include multiple sensors. These techniques, known collectively as interference cancellation methods, can be divided into two groups, vi z., those that use spatial filtering and those that use temporal filtering. The former of th ese two methods, referred to as beamforming, oftentimes assumes that the desired source is in the far field and that it has a single, distinct direction of arrival [3]. Early examples include the IF sidelobe canceller of Howells from the late 1950Â’s [4] and the work of Applebaum from the mid 1960's [5]. The method that makes use of temporal filtering, on the other hand, commonly assumes that one of the sensor signals is composed entirely of a filt ered version of the interfering signal [6]. The most familiar embodiment of this method is referred to as adaptive noise cancellation, which has been aptly described in a paper from the mid 1970's by Widrow et al . [7]. In more recent years, a new signal sepa rating technique has emerged that can be applied in cases where the preceding techni ques cannot be used. The seminal papers in this area of Blind Source Se paration (BSS), which were written in the 1980Â’s, are by


2Bar-Ness et al . [8] and Ans et al . [9]. BSS is able to take advantage of either spectral or spatial diversity (unlike the classical methods ), it does not require that the desired source is in the far field (unlike beamforming incide ntally, the far field for acoustic applications begins at approximately 2m of separation between sensor and s ource), and it does not require that one of the sensors contains onl y the interfering source (unlike adaptive noise cancellation). BSS methods have been applied to severa l different mixing models, including those that produce linear and non-linear mixtures. For the case of linear mixtures, the mixing models are often dichotomized into those th at are instantaneous (memory-less) and those that are convolutive. Since the focus of the paper is on audio separation, with application to either hearing aids or speech recognition tasks, the convolutive mixing model is more appropriate (nonlinear models are not consider ed here). The approach taken within this paper is to use an a pproximation of Renyi’s -divergence to separate the signals. The paper will begin by using this criterion for the more simple task of instantaneous source separation and then proceed to the more diff icult convolutive source separation. Blind separation of instantaneous mixtures is covere d in Chapter 2, which is based on a published paper1 and a paper submitted to Signal Processing . The main topic, BSS for convolutive mixtures, is covered in Chapter 3 a nd is based on papers submitted to IEEE Transactions on Speech and Audio Processing and IEEE Signal Processing Letters . An additional application, supervised feature extraction, is also included since it may be addressed using the same criterion as used for source separation. This is covered in A ppendix A, and is based on a paper submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence . 1.© 2001 IEEE. Reprinted, with permission, from IEEE Signal Proc . Letters , Vol. 8, No. 6, pp. 174-176, June 2001.


3CHAPTER 2 INSTANTANEOUS BLIND SOURCE SEPARATION This chapter will examine the case of linear, instantaneous mixtures of sources. The more computationally demanding MRMI method, having complexity O(Lb 2), will be derived first (where Lb is the block length). This is followed by a generalization of the criterion that allows for any value of RenyiÂ’s parameter, , to be used. Several different arguments are given why = 2 is the preferred value. It is even preferred over = 1, which corresponds to ShannonÂ’s entropy. By dropping one of the summations of the MRMI method, the O(Lb) MRMI-SIG criterion is produced. Several aspects of this method are discussed, including its ability to make use of spectral diversity. Both the MRMI and MRMI-SIG methods are based on minimizing an approximation of RenyiÂ’s mutual information, which is a particular type of diverg ence measure, by means of non-parametric pdf (probability density function) estimation. A twostage process is used, which consists of spatial whitening and a series of Givens rotations. This allows the removal of the joint entropy so that the cost function consists only of marginal entropies. This formulation avoids the inaccuracy due to truncation of a series expansion and does not require the estimation of high-dimensional joint entropies. Comparisons are included for both the MRMI and MRMI-SIG methods, which show that both perform well and are very data efficient. Introduction Minimization of the mutual information ( MMI) between outputs is considered the ideal information theoretic criteria for blind source separation (BSS) [10-12]. The


4(Shannon) mutual information ca n be written as the sum of the (Shannon) marginal entropies minus the joint entropy. One of the difficulties in using MMI is the estimation of the marginal entropies. In order to estimate the marginal entropy, Comon and others approximate the output marginal pdf’s with truncat ed polynomial expansions [11-13], a process that inherently introduces error in the es timation procedure. A nother MMI method proposed by Xu et al . [14] avoids the polynomial expansion by approximating the KullbackLeibler divergence with the Cauchy-Schwartz inequality and estimating Renyi’s entropy non-parametrically using Par zen windows [15]. The second method requires estimation of the Ns-dimensional joint entropy. We propose a new algorithm based on an approximation of Renyi’s mutual information that avoids both polynomial expansions and estimation of the joint entropy. MRMI Renyi’s mutual information is defined as [16] The sum of Renyi’s marginal entropies minus the joint entropy is which differs from Equation 2-1. Equation 2-1 is non-negative and evaluates to zero when and only when the joint pdf can be written as a product of the marginals [16]. For superGaussian sources, there is experimental ev idence that Equation 2-2 is also non-negative IY () 1 1 – -----------fYy ()fYiyi() 1 – i 1 = Ns---------------------------------y d – log =(21) HY () HY () –i 1 = Ns1 1 – -----------fYy ()y d – fYiy () i 1 = Ns yid – --------------------------------------------log =(22)


5and evaluates to zero only when the joint pdf can be written as a product of the marginals. Therefore, minimizing Renyi’s MI can be acc omplished, just like with Shannon’s entropy, by minimizing the sum of (Renyi’s) marginal entropies minus the joint entropy. Notice that Equation 2-2 is not a reformulation of MI, but is preferred since Renyi’s entropy can be estimated non-para metrically as proposed in a paper by Principe et al . [17]. In order to produce an algorithm that scales favorably as the number of sources increases, the estimati on of the joint entropy is avoided by the two-step parameterization used by Comon [13]. The first stage performs spatial whitening and the second stage performs a rotation in Ns-dimensions. The rotation, denoted by matrix R, is adapted to minimize the cost function given by Equation 22. In fact, since Renyi’s joint entropy is invariant to rotations the joint entropy may be discarded from the adaptation process, leaving only the marginal entropies. Hence, the cost function becomes simply which mimics the cost function of Yang and Am ari [12] and Comon [13], but with Renyi’s entropy substituted for Shannon’s entropy. Usi ng the non-parametric Renyi’s entropy estimator proposed by Principe et al . [17], a new algorithm for blind source separation is produced [18].HY () 1 1 – -----------fYy ()y d – log = 1 1 – -----------fXx ()det R () det R () ----------------------------------x d – log Hx () == (23) JHYi()i 1 = Ns=(24)


6Structure The overall BSS block diagram for two inputs/observations is given in Figure 2-1. The equations for a system having Ns sources/observations are given by the (Ns x 1) observation vector, x = WTHTs, and the (Ns x 1) output vector, y = RTx, where W = Š1/2, is the matrix of eigenvector s of the autocorrelation of HTs, and is the corresponding eigenvalue matrix. Notice that E[xxT] = I, the (Ns x Ns) identity matrix, due to the spatial whitening [19]. The rotation matrix, R, is cons tructed from the product of Ns(Ns-1)/2 Givens rotation matrices, Rij, where Rij equals I with elements I(i,i), I(i,j), I(j,i) and I(j,j) modified to cos ij, sin ij, -sin ij, and cos ij, respectively, where I(i,j) is the element of I located at the ith row and the jth column. There is one rotation angle for each Rij, which is denoted as ij, whose purpose is to perform a rotation in the output space in the plane specified by the ith and jth (orthogonal) basis vectors. Th e gradient of R with respect to ij is also needed for the algorithm development and it is denoted as Rij, which is of size (Ns x Ns). The gradient matrix is the product of the same Ns(Ns-1)/2 Givens rotation matrices used in R, with the exception that d Rij/ d ij is used in place of Rij. xN (n) yN (n) y1(n): . : . Demixing Matrix W x1(n) : .s1(n) sN (n) s2(n) Mixing Matrix H Figure 2-1. Block diagram of mixing and de mixing. For instantaneous mixing and demi xin H and W are matrices of constants. x2(n) s s s


7Stochastic Gradient Descent When Parzen windowing is used with a Gaussian kernel, the estimate for Renyi’s quadratic marginal entropy simplifies to where G(x,2 2) is a Gaussian pdf evaluated at x and having a variance of 2 2, and yi(j) is the jth sample of output yi. Notice that the infinite-limit integral disappears and that there are no approximations involved (aside from th e implicit pdf estimation using Parzen windows). Substituting Equation 2-5 into Equation 2-4 and taking the derivative with respect to ij produces where ( ijR)i is the ith column of ijR and x(m) is the (Ns x 1) vector of x at time m. The overall update equation for stochast ic gradient descent is then (n+1) = (n) ( n ) , where (n) and (n) are vectors of angles, and is the step size. Comparisons The proposed method is compared to FastICA [20], Bell and Sejnowski’s Infomax [21] and Comon’s MMI method [13] using an instantaneous mixture of ten audio sources consisting of one music source and speech from five male and four female speakers. Spatial pre-whitening is used for each method, the va lues of the mixing coefficients are H2Yi() 1 Lb 2------Gyij () yik () 2 2, – ()k 1 = Lbj 1 = Lblog – =(25) ijd d ij--------H2Yk()k 1 = Ns= Gykm () ykn () –2 2, () ykm () ykn () – ()ijR ()i Txm () xn () – ()n 1 = Lbm 1 = LbGykm () ykn () –2 2, ()n 1 = Lbm 1 = Lb---------------------------------------------------------------------------------------------------------------------------------------------------------------------------k 1 = Ns– =(26)


8chosen uniformly in [-1,1] and the performance criterion is the signalto-interference ratio, defined as where q = RTWTHT, qi is the ith row of q, and (max qi) equals the maximum of qi. This criterion effectively measures the distance of q from the product of a permutation matrix with a diagonal matrix. Figure 2-2 shows the SIR for each method as a function of Lb, which in this case equals the data length, N. FastICA uses the symmetric approach and the cubic nonlinearity, Infomax uses Amari’s natural gradient [22], Comon’s method uses a fourth-order pdf expansion (requiring estimates of third and fourth-order cumulants) and the proposed SIR 1 Ns----10log10max qi()2qiqi Tmax qi()2– -----------------------------------------i 1 = Ns=(27) F igure 2-2. Signal-to-interference plots for ten sources/observations. Notice that the x-axis is logarithmic and that no results are shown for MRMI when the data length exceeds 1000 due to the computational complexity. 102 103 104 -5 0 5 10 15 20 25 Data Length (samples)SIR (dB)MRMI COMON FASTICA INFOMAX (NATURAL GRADIENT)


9algorithm, MRMI, uses a kernel size, 2, of 0.5 (Lb < 100) or 0.25 (Lb > 100). In addition, small step sizes were used to maximize SIR. As can be seen from the figure, the MRMI method requires much fewer data points to yi eld a given performance level (good separation is achieved at an SIR of 20 dB). This is imperative in a number of applications such as channel equalization and in nonstationary envi ronments. In hearing aid applications, for example, Brandstein explains that, “Any system which attempts to . . . apply some means of inverse filtering would have to be adaptable on almost a frame-by-frame basis to be effective” [23, pg. 3614]. The results for MRMI are not shown beyond Lb = 1000 due to the computational complexity, which is O(Lb 2) as compared to O(Lb) for the other three methods. 100 101 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 Generalized Gaussian Parameter, EntropyShannon's and Renyi's Entropy of uniform ( = inf) Renyi's Quadratic Entropy Shannon's Entropy F igure 2-3. Entropy as a function of generalized Gaussi an parameter, for both Renyi’s quadratic and Shannon’s (marginal) entropy.


10Modification for Sub-Gaussian Sources The section above describes how to use MRMI for super-Gaussian sources by using Renyi’s quadratic entropy ( = 2). Here, the criterion is generalized so that it can also separate sub-Gaussian sources. The Central Limi t Theorem states that the pdf of a summation of independent random variables tends toward the Gaussian distribution. Therefore, the goal of BSS is to force each output pdf, fYi( yi) , to be as far from Gaussian as possible. Shannon’s entropy has a nice feature in that th e pdf for which it is maximized, for a fixed variance, is the Gaussian. This is ideal for BS S applications since, if a sphering/rotation is used to control the variance, separation can be achieved by minimizing the sum of Shannon’s marginal entropies of the outputs. The proof that this yields a global minimum when separation is achieved can be derived from Equation 2-1 by noting that Shannon’s joint entropy is also invariant to rotations (as wa s Renyi’s). Figure 2-3 shows a plot of both Renyi’s quadratic and Shannon’s (margi nal) entropies as a function of , where is a parameter of a generalized Gaussian pdf which is a family of zero-mean, unimodal, and symmetric pdf’s of varying kurtosis. In Equation 2-8, Cm controls the variance and C is the normalization factor that ensures that the pdf integrates to 1. Since it is assumed th at the sphering/rotation architecture is used, the value of Cm for each point in the plot is chosen such that the pdf produces random variables having a variance of 1. The values in the plot are numerical estimates of the theoretical entropies (as opposed to using an entropy estimator, which produces values based on a finite number of samples of the random variable). The asterisks indicate the true fYmym() CeCmym –= (28)


11value for RenyiÂ’s and ShannonÂ’s entropies for = 1, 2, and infinity, corresponding to a Laplacian, Gaussian, and uniform pdf, respectivel y. Notice that the values of RenyiÂ’s and ShannonÂ’s entropy for a uniform random variable are identical, as can be easily proven. This plot shows that, for the generalized Gaussian family having 1 < < 10, ShannonÂ’s entropy is maximized for = 2, as it should, while RenyiÂ’s is maximized for approximately equal to 4. Figure 2-3 indicates that minimizing Sh annonÂ’s entropy always moves the output pdf away from the Gaussian. As an example, in Figure 2-4(A) the entropy of a mixture of 2 Laplacian sources is plotted as a function of rotation angle for = 1, where 0 + k /2 radians corresponds to separation for k any integer. The results for both RenyiÂ’s and 0 1 2 3 0 0.2 0.4 0.6 0.8 1 Rotation Angle (rad)Entropy (arbitrary units)Renyi's, Shannon's 0 1 2 3 0 0.2 0.4 0.6 0.8 1 Rotation Angle (rad)Entropy (arbitrary units)Renyi's Shannon's 0 1 2 3 0 0.2 0.4 0.6 0.8 1 Rotation Angle (rad)Entropy (arbitrary units)Renyi's Shannon's 0 1 2 3 0 0.2 0.4 0.6 0.8 1 Rotation Angle (rad)Entropy (arbitrary units)Renyi's Shannon's F igure 2-4. Entropy versus rotation angle for four values of . In this example, 0 + k /2 radians is a solution for k any integer. A) = 1, B) = 2.7, C) = 5, and D) = 10. A B C D


12ShannonÂ’s entropies in Figure 2-4 were scaled and (vertically) shifted, for visualization purposes, so that the minimum value is 0 and the maximum value is 1. Notice that the two results are virtually identical for Laplacian sources, and that both have minima in the proper locations. In fact, although it is not s hown, there was very li ttle difference between RenyiÂ’s and ShannonÂ’s entropies for all super-Gaussian pdfÂ’s (of the generalized Gaussian family). Likewise, Figure 2-3 indicates that RenyiÂ’s quadratic entropy requires minimization for the separation of super-Gaussian sour ces (any mixture of super-Gaussian sources is also super-Gaussian), but may require ma ximization followed by minimization, in order to separate a mixture of sub-Gaussian source s. This is also shown in Figure 2-4 by noting that a local minimum always occurs at the so lution (there is a minimum at the solution in Figure 2-4(B) although it is diffic ult to see in this plot); however, the value of the entropy at the separating solution is not the global minima until increases beyond approximately 8. This indicates that the sum of marginal entropies minus the joint, for RenyiÂ’s, can be negative. Whether or not the solution occurs at the global minimum for RenyiÂ’s entropy, it is clearly seen that a local minimum occurs fo r sub-Gaussian sources at the point of maximum mixing, /4 + k /2 radians, and so the minimiz ation of Equation 2-4 may produce erroneous results when there are sub-Gaussian sources in the mixtures. The preceding discussion was limited to theore tical quantities. If the entropy estimator of Equation 2-5 is used, the plot of entropy versus is represented by Figures 2-5 and 2-6. The use of the entropy estimator introduces the kernel shape and kernel size, , as additional user-defined parameters. Here we will limit the discussion to Gaussian kernels; therefore, the figures are shown only for different choices of and . Upon close observation of Figures 2-5 and 2-6, it can be seen that increasing causes the entropy estimates


13for large to increase more than for smaller . Two important conclusions can be drawn from this fact. First, for = 1, the plot no longer peaks at = 2 for > 0.75. Second, for = 2, the entropy plot becomes monotonic for > 1. This implies that ShannonÂ’s entropy can be used without the need for estimating the sign of the kurtosis only when < 0.75. Likewise, RenyiÂ’s (sample-base d) quadratic entropy can be used for BSS as long as the estimate of the sign of the (normalized) kurto sis for each output is included in the criterion, where the kurtosis of sub-Gaussian and super-Gaussian sources is negative and positive, respectively. Incidentally, it was original ly thought that having to estimate the sign of the kurtosis would be detrimental to the perf ormance, as compared to a method that does 100 101 0.8 1 1.2 1.4 1.6 1.8 = 4 = 0.3 = 2 = 1 = 0.25 100 101 0.8 1 1.2 1.4 1.6 1.8 = 4 = 0.3 = 2 = 1 = 0.5 100 101 0.8 1 1.2 1.4 1.6 1.8 Renyi's Entropy Estimator = 4 = 0.3 = 2 = 1 = 1 100 101 0.8 1 1.2 1.4 1.6 1.8 = 1 = 2 = 4 = 0.3 F igure 2-5. RenyiÂ’s entropy versus for Lb = 1000. A) = 0.25, B) = 0.5, C) = 1, and D) the results for RenyiÂ’s theoretical entropies, for sake of comparison. A B C D


14not need such estimations. However, this is not necessarily the case since methods that do not explicitly estimate the sign of the kurtosis must still do so implicitly . The decision of super-Gaussian or sub-Gaussian for impli cit methods (i.e., those that use ShannonÂ’s entropy) is completely determined by the sign of the gradient of the cost function. Since the sign of the gradient must be estimated us ing a finite amount of data, implicit methods are also subject to local minima.RenyiÂ’s entropy orderIt is well known that RenyiÂ’s entropy approaches ShannonÂ’s in the limit as approaches 1 [24]. As a result, the family of criteria presented in this section includes 100 101 1.35 1.4 1.45 1.5 Generalized Gaussian Parameter,Renyi's Entropy= 1.0 = 1.1 = 0.9 100 101 1.48 1.5 1.52 1.54 Generalized Gaussian Parameter,Renyi's Entropy = 1.0 = 1.1 = 0.9 F igure 2-6. Expanded view of RenyiÂ’s entropy versus for Lb = 1000. A) = 0.5 and B) = 1. A B


15Shannon’s MI as a special case. Recall that th e formulation of MI in terms of Shannon’s entropies is given by where H1( ym) is one of the Ns marginal entropies, H1( y ) is the joint entropy, and the subscript 1 is used to denote Shannon’s definiti on of entropy. Substituting Renyi’s entropy for Shannon’s produces the MRMI criterion given previously in Equation 2-2. The gene ralized non-parametric entropy estimator for Renyi’s entropy is given by which is Equation 2-5, except that it is expressed as a function of and a unique kernel size is assigned for each output. Equation 2-11 can now be plugged into Equation 2-4 or Equation 2-10 using any desired value of > 0 (except = 1), where Shannon’s MI is produced in the limit as approaches 1 [25] and MRMI is produced for = 2. The above formulation cannot be used to find the equation for = 1 since it results in the indeterminant value 0/0. There are, however, two ways in which an analogous expression can be found for an estimator of Shannon’s entropy, i.e., = 1. In the first method, the gradient of Equation 2-11, which is needed for gradient -based adaptation anyway, is found and then I1Y () H1Ym() H1Y () –m 1 = Ns=(29) HYm() HY () –m 1 = Ns(2-1 0) HYm() 1 1 – -----------1 Lb----1 Lb----Gymn () ymk () –2 m 2, ()k 1 = Lb 1 – n 1 = Lblog =(2-1 1)


16is set to 1 in the gradient expression. The second (equivalent) method is derived by noticing that Shannon’s entropy can be written in terms of an expectation as follows Approximating the expectation with the samp le mean and using Parzen window estimation for the pdf produces the following estimate for Shannon’s entropy which is similar to that used previously by Viola et al . for processing magnetic resonance images [26]. The difference is that Viola partitioned the data into two sets (one to estimate the pdf and the other to estimate the entropy). It is easy to show that the gradient of Equation 2-13 is identically the grad ient of Equation 2-11 with = 1. Nothing has been said hitherto concerning the optimum choice of , which is related to the norm used to measure the entropy [27]. The overall separation performance is the most important factor in deter mining an appropriate value for . However, the separation performance depends on se veral factors and therefore may not be the most useful measure for making valid inferences. For a criterion based on Equation 2-10, se veral of these factors include the statistical pr operties of the entropy estimation (including the bias and variance), the effect of summing together multiple entropies, and the ability to ascertain if the outputs are sub-Gaussian or super-Gaussian. Th e first item in this list, which determines the degree to which the criterion is able to di stinguish the Gaussianity of the outputs, is enlisted to select an appropriate range of values for . The relationship between the bias of the entropy estimation and the final separation performance is complex. For example, suppose that the expected value of the estimator H1Ym() EfYmYm() () log [] – =(2-1 2) H1Ym() 1 Lb----1 Lb----Gymn () ymk () – 2, () k 1 = Lblogn 1 = Lb– (2-1 3)


17equals the true entropy plus some constant, wh ere the constant (which is identically the bias) is independent of the demixing parameter. In this case, minimization or maximization of the corresponding criterion is completely unaffected by the bias of the entropy estimation, regardless of its magnitude. On the other hand, examples could also be given where the bias plays a pivotal role. Since the expe cted effect of estimation bias is not clear, let us consider the variance of the entropy estimation. Figure 2-7(A) shows the standard deviation of the entropy es timate as a function of , for the two values of discussed thus far. Also shown, for sake of perspective, is the standard deviation of several moment estimators, which are constr ucted using the sample mean of th e data raised to the appropriate power. This plot indicates that the variance increases as the distribution becomes increasingly super-Gaussian and that the estimator for = 1 has the smallest variance. This is misleading, however, since the ability to accurately distinguish between, e.g., a Laplacian and a Gaussian random variab le depends on how the vari ance of the estimator at = 1 and = 2 compares to the difference in the mean values (at the same two values of ). To address these shortcomings informati on on the bias and the variance of the entropy estimation is combined, the results of which are shown in Figure 2-7(B). This figure shows the same results as Figure 2-7(A), wi th the exception that the standard deviation is normalized by the difference in the mean values of the estimation for = 1 and = 2. Keep in mind that, with this definition, inferences for < 2 are more accurate than for those made for > 2. From the normalized results it is clear that the consistency of the RenyiÂ’s quadratic entropy estimator exceed s that of the estimator for fourth-order moments, which is commonly used as a criterion for BSS. Also, some methods are not very general purpose in that they have a low (normalized) variance for some value of ,


18but a large variance for a different value of . For example, the 8th-order moment estimator has the lowest variance (of the methods c onsidered) for Gaussian distributed data, but it also has the highest variance for Laplacian distributed data. In fact, Figure 2-7(B) indicates that, when the data distribution is Laplacian, the result obtained using the moment estimator may easily lead one to believe that the distribution is actually Gaussian. This is 1 2 -0.2 0 0.2 0.4 0.6 0.8 1 Normalized Standard DeviationInfinity Shannon's Entropy ( = 1) Renyi's Quadratic Entropy 4th moment 6th moment 8th moment ( = 2) 1 2 10-4 10-2 100 102 104 106 Standard DeviationInfinity Shannon's Entropy ( = 1) Renyi's Quadratic Entropy 4th moment 6th moment 8th moment ( = 2)Figure 2-7. Statistical variation of seve ral entropy and moment estimators for = 1, 2, and infinity. A) St andard deviation and B) Norm alized standard deviation. B A


19predicated on the fact that the expected value of the 8th-order moment estimator for Laplacian data is within (approximately) one standard deviation of the expected result for Gaussian data. Also notice that ShannonÂ’s entropy es timator is the least desirable for uniformly distributed data. The previous figure shows results for only two values of . A more complete picture, in terms of , is given in Figure 2-8. Once again, the normalized standa rd deviation is shown for values of that correspond to Laplacian ( = 1), Gaussian ( = 2), and uniformly ( = infinity) distributed data. This plot indicates that should not be less than 1 for either Laplacian or Gaussian distributed da ta. This is not surprising, at least for Laplacian distributed data, because of the following two facts. First, it is well-known that kurtosis-based methods need a lot of data to separate super-Gaussian sources (which have F igure 2-8. Normalized standard deviation of RenyiÂ’s entropy estimator as a function of = 0.5, Lb = 1000, where statistics were computed using 100 Monte Carlo t ria 0 1 2 3 4 5 6 7 8 9 10 0 0.05 0.1 0.15 0.2 0.25 Normalized Standard Deviation = 1 infinity = 2 =


20heavy tails) because they emphasize the tails of the pdf [11]. Second, for RenyiÂ’s entropy, the tails are increasingly emphasized as approaches zero [28]. While small values of perform poorly for super-Gaussian data, they perform well for sub-Gaussian data (for < 0.5). However, = 4 produces the most consistent estimator when the distribution is uniform. Interestingly, the plot for uniform da ta takes a maximum value at approximately = 0.75, for which the normalized standard deviation is approximately 2.5. This is a result of the lack of monotonicity between the entropy estimator and for the super-Gaussian region ( < 2). Also notice that = 1 is optimal, in regards to normalized variance, for Gaussian data. The range of that produces the most consistent estimators when simultaneously considering = 1, 2, and infinity, is roughly > 1, as shown in Figure 2-9. According to Figure 2-9 = 4 is the optimal choice, although it provides only a marginal improvement over 2 < < 10. An obvious choice is to use = 1 since it does not require explicit estimation of the sign of the kurtosis (for small m). However, there are three good arguments for selecting = 2, viz., (1) = 2 has nearly minimal normalized F igure 2-9. Summation of normalized standard deviation for = 1, 2, and infinity. 0 1 2 3 4 5 6 7 8 9 10 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Normalized Standard Deviation


21standard deviation, (2) unlike any other value of , there exists an entropy estimator for = 2 that allows the criterion to make use of spectral diversity, and (3) unlike any other value of , this new entropy estimator reduces th e complexity of the estimation from O(Lb 2) to O(Lb).MRMI-SIGThe MRMI-SIG algorithm is simply the MRMI algorithm with the exception that one of the summations in Equation 2-15 is removed [29] and different values of may be used for each output. Dropping the first summation, and then plugging the result into Equation 2-4, yields the following O(Lb) BSS criterion where the difference in time of the outputs, p , can be arbitrarily chosen, and the sign of the kurtosis has been explicitly added. A constant value of p = 1 is used for Equation 2-14, as well as in previous publications, such that th e difference occurs between consecutive outputs. The choice of p = 1 was chosen at least in part because it minimizes the number of requisite memory elements for an on-line implementation. Notice that the filtering operation corresponding to yi( j ) yi( j -1) is a high pass filter, which is not desirable in the presence of wide-band noise. A triv ial modification that provides a partial remedy consists of using yi( j ) yi( j -2) instead, which corresponds to a band-pass filter. The O(Lb) criterion of Equation 2-14 approximates stochastically the O(Lb 2) criterion and can be made arbitrarily close to it by averaging multiple estima tes, supposing that the time indices of the data are randomized for each es timate [28]. It turns out that randomization (and, therefore, JsignKYm() () 1 Lb----Gymn () ymnp – () –2 m 2, ()n 1 = Lblogm 1 = Ns– =(2-1 4)


22multiple estimates) is not necessa rily needed for the case of i.i.d. signals. To verify this, the two methods were used to compute a pl ot of entropy for many different values of using Lb = 1000 temporally independent data samp les. For this example, the MRMI-SIG algorithm was constrained to use only a single estimate, i.e., one pass of the data. The O(Lb 2) MRMI algorithm required three orders of magnitude more flops than the O(Lb) MRMI-SIG algorithm, although the results of the two (as a function of ) were almost indistinguishable. The number of passes of th e data required for MRMI-SIG to closely approximate MRMI is in versely proportional to Lb. Between 5 and 10 pa sses of the data are required for Lb = 100 and, as already stated, only a single pass is required for Lb > 1000. The worst case scenario is that, for Lb = 2, the two methods have an equal computational complexity. It is interesting to note that if the same trick of dropping one of the summations is used with either Equation 2-11 (valid for > 0, except = 1) or Equation 2-13 (valid for = 1), the resulting entropy estimator still estimates RenyiÂ’s quadratic entropy for all > 1 (which is the more interesting range of ). For example, if the inner summation of Equation 2-11 is dropped, the resulting equation is precisely MRMI-SIG except that it has a different positive multiplicative constant fo r the criterion and for the kernel size, . Multiplying the kernel size or the criterion by a positive constant is not significant since they serve only to modify the kernel size and the step size, which are user-defined parameters that can always be adjusted to precisely compensate for the multiplicative constants. Likewise, dropping the outer summation allows the -1 exponent to be moved outside of the log, which cancels the 1/(1-) term (except for the minus sign). Once again, this produces the MRMI-SIG criterion. The reas on MRMI-SIG approx imates MRMI with = 2 is


23that the inner summation term is linear, or raised to the power of 1, only for this value of . A similar problem occurs with Equation 2-13. Removal of the inner summation causes the entropy estimator to be a function only of te mporal second-order statistics. In this case, the criterion is no longer information-theoretic ; consequently, it is no longer of any interest. Conversely, removal of the outer summ ation immediately yields RenyiÂ’s quadratic entropy estimator. Figure 2-10 shows RenyiÂ’s quadratic entropy using SIG (for four values of kernel size) as a function of the generalized Gaussian parameter, . Also shown in the figure is the curve associated with the theoretical va lues of RenyiÂ’s quadratic entropy. The measurements were made using Lb = 1000 data points; therefore, identical curves are expected if the O(Lb 2) RenyiÂ’s quadratic entropy estimator of Equation 2-5 was used. This figure F igure 2-10. RenyiÂ’s entropy estimator using SIG for four kernel sizes and the theoretica l values of RenyiÂ’s quadratic en tropy as a function of . 100 101 0.5 1 1.5 2 Renyi's Entropy Using SIG = 0.1 = 0.25 = 0.5 Renyi's Quadratic Entropy ( = 2) = 1


24shows that the O(Lb), and hence the O(Lb 2), entropy estimator produces a good approximation of the theoretical value of RenyiÂ’s qua dratic entropy when the amount of data is sufficient and the corresponding is small (small relative to the variance of the data, which in this case is one). This is the expe cted result since the only approximation in the derivation of the entropy estimator concerns the pdf, combined with the fact that the error of the Parzen window pdf estimation become s arbitrarily small as the kernel size approaches zero and the length of the da ta approaches infinity. Notice that, as is increased from 0.1 to 0.5, a bias is intr oduced that is largely independent of . As the kernel size increases further, to = 1, the bias is no longer independent of and the shape no longer resembles the theoretical curve for = 2. Keep in mind, though, that approximating the theoretical value is not necessarily required for good performance when minimizing or maximizing an entropy-based criterion as will be shown. There is one other interesting item about MRMI-SIG. If there is some spectral diversity among the sources, it is possible to exploit it with MRMI-SIG with the proper choice of p . For example, suppose that p = 1 and that each source has a distinct value for the autocorrelation at lag 1. Since correlati on always implies dependence, the ( ym( n ) ym( n -1))2 term that appears in MRMI-SIG is able to provide information to separate the sources even if the sources are jointly Gaussian. The proof that MRMI-SIG can separate signals based on spectral diversity is given in Appendi x C. Spectral information has been used in numerous different methods that are based on second-order statisti cs and has been discussed for use in information-theoretic met hods [30]. However, to the knowledge of the authors, an information-theoretic method that uses spectral diversity has not been implemented hitherto. MRMI-SIG is able to extract spectral diversity only if the time indices


25are not randomized during adaptation, since random ization destroys the temporal structure of the data. This gives an important advant age of MRMI-SIG over MRMI (for any value of ) since the computation of MRMI involve s an average taken over time and cannot, therefore, be used in a similar fashion to se parate signals based on spectral diversity. It was experimentally de termined that the time indices should not be randomized when the absolute value of the normalized correlation coefficient at lag p is greater than 0.4 and they should be randomized otherwise. Notice that, if the time indices are not randomized, the algorithm is capable of se parating multiple i.i.d. sources, as well as multiple temporally-correlated Gaussian sources. The benefit of exploiting spectral diversity for source separation will be shown in the comparison section.Relationship Between Renyi’s Entropy Estimator and KurtosisSeveral definitions of kurtosi s have been discussed in the literature. The expression that will be used here is This definition is normalized so that, with re gards to generalized Gaussian pdf’s, it has a maximum value of 3 for Laplacian, a value of 0 for Gaussian, and a minimum value of -6/ 5 for uniformly distributed data. When sphering is used, Equation 2-15 simplifies to E[ y4] 3, which is estimated using Using the first 3 terms of a Taylor series expansion of ex allows the Renyi’s O(Lb) entropy estimator to be approximated by Equation 2-17. It is assumed, for the moment, that each source is temporally independent. Also, recall that sphering is used and that the outputs areKYm() EYm 4[] 3 EYm 2[]2– =(2-1 5) KYm() 1 Lb----ymn ()43 –n 1 = Lb(2-1 6)


26zero-mean. As a result, several of the terms in Equation 2-17 either vanish or become 1. Additionally, since additive constants and positive multiplicative constants have no effect on relative measurements and since the log is a monotonically increasing function, minimizing Equation 2-17 is equivalent to maximizing which is equivalent to maxi mizing Equation 2-16. It should be noted that the assumption of temporal independence is only needed for the O(Lb) entropy estimator, and only if it is used without randomizing the time indices. This assumption is not needed for either the O(Lb 2) quadratic entropy estimator, or if the da ta samples are randomized each epoch and the results are averaged over a sufficient number of epochs. Since, under certain conditions, minimizing Renyi’s entropy amounts to ma ximizing kurtosis, it may be considered as a measure of negative kurtosis (in the sense that maximization becomes minimization and vice versa). Although this rela tionship may prove useful, it is wise to discuss its veracity first. In the derivation, the only approximati on was that of the truncation of the series expansion. This truncation is valid as long as the exponent, -Cm[ y ( k )y ( k -1)]2, is close to 0. It turns out that the user-defined kernel size, m, appears in the denominator of Cm. For example, for Laplacian, Gaussia n, and uniform distributions, Cm is given by (respectively)H2Ym() C Lb----1 Cmymn () ymn 1 – () – ()2– Cm 22 ------ymn () ymn 1 – () – ()4+n 1 = Lblog – (2-1 7) 1 Lb----ymn ()4 n 1 = Lb(2-1 8) 2 2 m-------------1 2 m 2-----------1 3 m[]---------------------(2-1 9)


27Therefore, making m arbitrarily large is sufficient to ensure that the exponent approaches 0. In fact, for uniform distributions, the exponent is precisely 0 for any value of m (within the domain of the random variable, which is all that is pertinent). Figure 2-11 shows the true value of the Gaussian kernel and the approximation as a function of the number of terms kept in the series expansion. As shown in the plot for m = 2, using the terms up to and including the 2nd order (the first three terms) provides a good approximation. It should also be noted that m cannot be increased indefinitely sinc e it will eventually cause saturation so that the data no longe r have an effect on the entropy estimate. Figure 2-12 shows that m must be less than 5 for = 10 and less than 25 for = 1 in order to avoid saturation. The upper limit in both cases is larger th an that needed for RenyiÂ’s entropy to 5 10 15 -0.2 0 0.2 0.4 0.6 Order of last term in sumEstimate of C1exp(C2(yi-yj)2)4% error at order = 10 1 2 3 4 5 0.15 0.155 0.16 0.165 0.17 0.175 Order of last term in sumEstimate of C1exp(C2(yi-yj)2)2% error at order = 2 0 20 40 60 -1 -0.5 0 0.5 1 x 107 Order of last term in sumEstimate of C1exp(C2(yi-yj)2) 2 4 6 8 10 0.15 0.16 0.17 0.18 Order of last term in sumEstimate of C1exp(C2(yi-yj)2)5% error at order = 2 F igure 2-11. Estimate of Gaussian kernel. A) m = 1, = 10, B) m = 2, = 10, C) m = 1 = 1, and D) m = 2, = 1. A B C D


28approximate (negative) kurtosis. In fact, ex perimental results of separation performance indicate that the values of m less than 0.1 or larger than 4 need not be considered for unitvariance signals, regardless of the value of . The other important consideration in the choice of m is Lb. In order to decrease the bias while keeping the variance constant, the optimal m should decrease as Lb increases. In summary, using a small value of m for MRMI-SIG will produce a different solution than kurtosis, which is desirable when the sources are super-Gaussian. Likewise, a large value of m may be expected to produce a solution very similar to kurtosis (due to the relationship between RenyiÂ’s quadratic entropy estimator and kurtosis), which is desi rable when the sources are sub-Gaussian. Since the sign of the kurtosis is already required for the parameter updates, it is trivial to set m, for m = 1, 2, . . ., Ns, to one of two values (one fo r sub-Gaussian and one for superGaussian data) based on the sign of the kurtosis measured at the corresponding output. 5 10 15 20 25 30 0 0.5 1 1.5 Sigmaexp(C2(xi-xj)2) 5 10 15 20 25 30 0 5 10 15 20 25 Sigmaexp(C2(xi-xj)2) F igure 2-12. Saturation of exponential as a function of . A) = 10 and B) = 1. A B


29The Five Competing BSS Methods Now that the details of the MRMI-SIG method have been given, the five algorithms with which it is compared are now briefly described. They include the JADE algorithm [31], FastICA [20], an MI method by Com on [13], Infomax [21], and MSMI, a method based on the combination of ShannonÂ’s en tropy and Parzen windows. One additional method, an MI method by Yang and Amari [12], was also tried. However, the preliminary results were disappointing so no results are included here. Notice that all these methods use higher-order statistics. Methods that use only second-order statistics were not considered since most of the separation tasks included in the comparison are for stationary and temporally independent sources, for which second-order statistics is insufficient.Cardoso & SouloumiacÂ’s JADEThis method jointly diagonalizes cumulant ma trices of order 2 and order 4. This is not a gradient-based method so there is no st epsize, and the code (downloaded from http:/ / uses a stopping criterion so that there are no parameters to tune. This greatly simplifies th e use of this code. It was also the fastest method. One possible drawback is that the memory requirements increase exponentially as the number of sources increases.HyvarinenÂ’s FastICAFastICA is a fixed point algorithm so it does not require a step size. The code, which was downloaded from the internet (http://www, did not include a stopping criterion so the only user-def ined parameter required was the number of iterations to adapt. This value was individua lly determined for each separation task. Like the JADE algorithm, this method is very fast.


30ComonÂ’s MI MethodComonÂ’s MI method is found by formulating ShannonÂ’s MI in terms of negentropy. Since it restricts the architecture to a s phering/rotation, the join t negentropy may be removed leaving a sum of marginal negentr opies. The Edgeworth pdf expansion is then employed, only keeping terms up to the 4th order, in order to a pproximate the marginal negentropies as a weighted sum of 3rd and 4th order cumulants. It is a gradient-based method and therefore requires a step size.Bell & SejnowskiÂ’s InfomaxBell and SejnowskiÂ’s version of Infomax typically employs a sigmoid nonlinearity at each output. The goal of the algorithm is to maximize the entropy at the output of the nonlinearities. It has been shown that this method is a maximum-likelihood method for the case that the non-linearities are equal to the cumulative distribution functions (cdfÂ’s) of the sources [32]. However, the cdfÂ’s are not known in BSS. Theref ore, the nonlinearities need to be adapted in some fashion to ensure that Infomax is robust to varying distributions. In order to avoid the difficulty of ad apting these nonlinearities, a sort of best-case scenario has been adopted. For all synthetically created data , the non-linearities are chosen to be the exact cdf of the sources. Some results are also included for speech data, for which the sigmoid non-linearity is a decent approximation of the cdf. Therefore, the results for speech separation use the sigmoi d non-linearity, the code of which may be found at It should be noted that, in all cases, the tap weight update employed the natural gradient [22], which is also known as the relative gradient [33]. This form of the update equa tion reduces the sensitivity of adaptation to


31nearly singular mixing matrices. Infomax is a gradient-based method, so a step size is required. MSMI, Minimization of ShannonÂ’s Mutual InformationThe authors would be remiss to not include the method based on combining ShannonÂ’s entropy with Parzen window pdf estimation since it is quite similar to the proposed method, MRMI-SIG, but has the nice feature th at it does not need e xplicit estimation of the sign of the kurtosis. This method shall be re ferred to, for lack of a better name, MSMI, Minimization of ShannonÂ’s Mutual Informati on. The derivation of the MSMI algorithm consists of plugging the entropy estimator for ShannonÂ’s entropy, given previously in Equation 2-13, into Equation 2-9. The architecture is restricted to a sphering/rotation combination; therefore, the joint entropy may be dropped from the criterion. Comparisons An off-line implementation is used for the comparisons; therefore, it is assumed that the data are always available and the time limit to find a separating solution is restricted only by the patience of the user. This implies that the data may be re-used for any number of epochs. For the gradient-based adaptive al gorithms, which include 4 of the 6 algorithms used here, there is an inherent trade-of f between variance of the solution upon convergence and time of adaptation, and any comparison should aim to keep one of these two constant. Since an off-line implementation is assumed, it seems reasonable to train these methods using Lb equal to the length of the data and to use a small step size. In this case, it is possible for the parameters of the gradient-b ased methods to exhibit virtually zero variance upon convergence. An exception occurs for the MRMI-SIG algorithm, which randomizes the data between each epoch when the sources are temporally independent.


32Unlike the other methods, each possible or dering of the data within a batch window changes the gradient estimate for MRMI-SIG, resulting in a non-zero variance upon convergence. In order to completely remove th e variance, the value of the step size would have to approach zero, which causes an unac ceptable increase in training time. The value of the step size for MRMI-SIG was therefor e chosen so that convergence takes approximately the same amount of time required by the other gradient-based methods. The performance measure will once again be SIR, whic h was previously given in Equation 2-7. A total of 10 Monte Carlo runs were performed for each separation task. Each Monte Carlo run used a different mixing matrix, whose en tries were chosen uniformly in [-1,+1]. JADE and FastICA have no user-definable parameters, ComonÂ’s method and Infomax have a step size, and MRMI-SIG and MSMI have both a step size and a kernel size. The value of 0.5 was chosen for the kernel size for MSMI in order that the maximum entropy pdf remains the Gaussian. Also, due to the O(Lb 2) complexity, MSMI uses only 500 randomly selected data points when the le ngth of the data exceeds 500. In order to take advantage of any spectral diversity, ra ndomization is not used for MRMI-SIG whenever the mean of the absolute values of th e normalized correlation coefficients of the outputs exceeds 0.4. The kernel sizes for MRMI-SIG are chosen to be 0.25 and 1 for positively and negatively kurtotic outputs, respec tively (the sign of the kurtosis is checked periodically during training). While it is possible to fine tune the kernel sizes in order to avoid local minima [34], this was not done. In fact, the only fine tuning that occurred was for the Infomax algorithm. It was noticed that, for > 1.7, the Infomax algorithm was continually getting stuck in a poor-performing local minima. By carefully adjusting the step size, it was possible to escape these local min ima. However, it required constant


33supervision of the training process. On th e other hand, Infomax worked quite well for super-Gaussian data. Supervision for the Info max algorithm is justif ied since the desire was for the results of this method to represent a best-case scenario. The first separation task is to separate Ns = 5 sources for different combinations of and Lb. Six different exponentially increasing values of are used. These values are 1, 1.2, 1.7, 2.7, 5, and 10 (for each test, all 5 sources have the same ). An exponential increase was used since = 2 is the logical choice for the midpoint. In addition, seven different values were used for the data length. These values are 100, 200, 500, 1000, 2000, 5000, and 10,000. This made a total of 42 different combinations for each method. The samples for this task are drawn in an i.i.d. fa shion so that the data are stationary and temporally independent. Figure 2-13 shows the results for each method, averaged over 10 Monte Carlo trials, in a separate subplot. The performance for all of the methods when = 2 is expected to be very poor since i.i.d. Gaussian-distributed sources cannot be separated. As a result, the lines in Figure 2-13 that extend between the values of = 1.7 and = 2.7 are not meant to reflect the expected performance between these two values of . As expected, the performance drops as approaches a Gaussian distributi on. Likewise, the performance increases as Lb, hence the data length, increases. Figure 2-14 shows a different view of the sa me results. In this figure, each subplot represents a different value of . Aside from some initial differences in data efficiency (i.e. for small values of Lb), notice that the different methods perform almost identically as the distribution of the sources become increasing ly uniform. The similarity of MRMI-SIG and the methods based on 4th-order cumulants, in this case, is not surprising. Likewise, the


34similarity between MRMI-SIG and MSMI is not surprising since (1) the uniform distribution is the only distribution for which the theo retical value of RenyiÂ’s entropy is independent of , (2) Parzen windows produces a good appr oximation of the pdf for uniformly distributed data when Lb > 500 (with an appropriate kern el size), and (3) the entropy estimator produces a good approximation of the theo retical entropy when the pdf is approximated well. 100 101 0 10 20 30 40 Lb = 100 Lb = 10,000 SIR (dB) 100 101 0 10 20 30 40 SIR (dB)Lb = 100 Lb = 10,000 100 1 01 0 10 20 30 40 SIR (dB)Lb=100 Lb = 10,000 100 101 0 10 20 30 40 SIR (dB) Lb = 10,000 Lb = 100 100 101 0 10 20 30 40 SIR (dB)Lb = 10,000 Lb = 100 100 1 01 0 10 20 30 40 SIR (dB)Lb = 100 Lb = 500 F igure 2-13. SIR for Ns = 5 as a function of and Lb = 100, 200, 500, 1000, 2000, 5000, 10,000. A) MRMI-SIG, B) JADE, C) FastICA, D) ComonÂ’s MI, E) Infomax , and F) MSMI. A BC D E F


35Figures 2-15 and 2-16 show the results averaged over Lb and , respectively. Notice that MRMI-SIG is more data efficient than the other methods, even the O(Lb 2) MSMI algorithm. The performance of the MSMI al gorithm, as shown in Figure 2-15, is handicapped by the fact that it is limited to 500 data points due to the O(Lb 2) complexity. From Figure 2-16 it is apparent that, even though it has exponentially greater computational 102 103 104 0 10 20 30 40 SIR (dB) MRMI-SIG JADE Fast ICA Comon Infomax MSMI A 102 103 1 04 0 10 20 30 40 SIR (dB)MRMI-SIG JADE Fast ICA Comon Infomax MSMI 102 103 104 0 5 10 15 20 SIR (dB) MRMI-SIG JADE Fast ICA Comon Infomax MSMI 102 103 1 04 0 10 20 30 SIR (dB)MRMI-SIG JADE Fast ICA Comon Infomax MSMI 102 103 104 0 10 20 30 40 SIR (dB) MRMI-SIG JADE Fast ICA Comon Infomax MSMILb 102 103 1 04 0 10 20 30 40 SIR (dB)MRMI-SIG JADE Fast ICA Comon Infomax MSMILb C B D E F F igure 2-14. SIR versus Lb for the competing BSS methods. A) = 1, B) = 1.2, C) = 1.7, D) = 2.7, E) = 5, and F) = 10.


36complexity and the determination of super-Gaussianity and sub-Gaussianity is implicit in the criterion, it still performs worse than both MRMI-SIG and JADE at all values of Lb. Likewise, Figure 2-16 shows that Infomax performs worse than both MRMI-SIG and JADE at all values of Lb, even though knowledge of the true source cdf was used to construct the nonlinearities and the adaptation wa s constantly supervised. FastICA performs the worst when the data length is small and ComonÂ’s MI method performs the worst for Lb > 1000 (neglecting MSMI). The next two separation tasks use artificial ly mixed audio. There were a total of 50 sources, of which 24 were speech (approxi mately Laplacian distributed) and 26 were music (most of which were slightly super-Gaussi an). Each Monte Carlo trial used a F igure 2-15. Mean SIR as a function of , averaged over Lb = 100, 200, 500, 1000, 2000 , 5000, and 10,000, for i.i.d. data. 100 101 0 5 10 15 20 25 30 Generalized Gaussian Parameter, SIR (dB)MRMI-SIG JADE Infomax Comon MSMI Fast ICA


37randomly selected Ns of these sources. One task was to separate Ns = 5 sources as a function of Lb, while the other varied Ns and kept Lb at a constant value of 10,000. These results are shown in Figures 2-17 and 2-18. Noti ce that the performance of all the methods are reduced from that of the i.i.d. sources due to the reduction of available statistical information caused by the time-correlation of the audio sources. However, MRMI-SIG is reduced much less than the others. Unlike before, MSMI is able to improve as Lb increases above 500, as shown in Figure 2-17, since the 500 randomly selected points become less likely to be temporally correlated as Lb increases. Notice that Infomax performs quite well in this case even though a sigmoid nonlinearity is used, which is not perfectly tuned to the cdf of the sources. In fact, it surpasses the performance of JADE, which had outperformed it for i.i.d. data . Figure 2-17 shows that MSMI is better than all methods F igure 2-16. Mean SIR as a function of Lb, averaged over = 1, 1.2, 1.7, 2.7, 5, and 10, for i.i.d. data. 102 103 104 0 5 10 15 20 25 30 35 L b SIR (dB)MRMI-SIG JADE Infomax Fast ICA MSMI Comon


38except MRMI-SIG for Lb < 500. This is at least partially because of the improved statistical properties of estimating ShannonÂ’s entropy ve rsus kurtosis for super-Gaussian sources, as shown previously in Figure 2-7(B). This advantage is not seen in Figure 2-18 because of the restriction on the amount of data used by MSMI. Recall that MSMI is limited to 500 data points, while all other methods use more than a magnitude of or der more data. Also, it appears that the performance for all the me thods, except MRMI-SIG, is flat up to and including Lb = 2000, with the slope across all methods basically identical for Lb > 2000. Hence, the better performance of MRMI-SIG for Lb < 2000 is attributed to the extraction of temporal dependencies. In Figure 2-18, the data point corresponding to Ns = 20 for ComonÂ’s MI method was unavailable because the time of adaptation for large Ns, an order of magnitude longer than the other gradie nt-based methods, became unbearably long. F igure 2-17. SIR as a function of Lb data samples, for Ns = 5 audio sources. 102 103 104 0 5 10 15 20 25 30 Number of data points, L b SIR (dB)JADE Comon Infomax MSMI MRMI-SIG Fast ICA


39This comparison was for an off-line BSS implementation, therefore the time of adaptation is considered unimportant (except in extreme cases). However, the amount of time required for each are listed in Table 2-1. Keep in mind that, had the comparison fixed the adaptation time, the gradient-based algorithms would have traded performance for time. Nevertheless, it is quite impressive that J ADE and FastICA perform as well as they do and yet require very little training time as compared to the other algorithms. Conclusions A new O(Lb 2) BSS algorithm, MRMI, has been de veloped that is based on the minimization of an approximation of RenyiÂ’s mutual information. Unlike the Comon approach that uses ShannonÂ’s entropy, it does not require truncation of a pdf series expansion. Also, unlike the approach by Xu et al ., the proposed method does not need to approximate the F igure 2-18. SIR as a function of Ns audio sources, for Lb = 10,000 data samples. 2 4 6 8 10 12 14 16 18 20 0 5 10 15 20 25 30 35 40 Number of Sources, NsSIR (dB)Fast ICA Comon MSMI MRMI-SIG JADE Infomax


40high-dimensional joint entropy. While this algorithm is computationally more complex than its predecessors, it has been shown to be much more efficient in terms of the amount of data samples required for good separation. Th is is a paramount item for tracking a rapidly changing environment [35], [36]. By dropping one of the summations in Equation 2-5, an O(Lb) estimator for RenyiÂ’s quadratic entropy was created, which is the basis of the MRMI-SIG criterion. Although MRMI-SIG was originally de signed with an on-line impleme ntation in mind, it performs quite well for off-line implementations. For example, when the data are i.i.d., randomizing the data between iterations allows the MRMI-S IG criterion to effectively approximate the performance of the O(Lb 2) MRMI method. This differs from an on-line implementation, for which randomization is not as feasible. Like wise, if the data are temporally correlated, MRMI-SIG can extract information contained in time by not randomizing the data. While the performance improvement over the other me thods is marginal when the sources are not temporally correlated, the improvement is quite significant when they are correlated. The MRMI-SIG method is the only method in the comparison that is able to make use of Table 2-1. Speed of adaptation, given in relative orders of magnitude. Ns = 5, Lb = 100Ns = 5, Lb = 1000Ns = 20, Lb = 10,000 MRMI-SIG236 JADE001 Infomax235 ComonÂ’s MI237 FastICA002 MSMI34 (using 500 data points) 5 (using 500 data points)


41spectral diversity, and, in fact, is the only known information-theoretic method with this capability (incidentally, the SIR results of separating two colored Gaussian sources averaged over 10 Monte Carlo trials for MRMI-S IG, Infomax, FastICA, and Jade are 37, 18, 11, and 7 dB, respectively, where the lengt h of the data is 35,000). Parzen windows may be applied to create a non-parametric estima tor for any value of , including = 1, which corresponds to ShannonÂ’s entropy. However, three important reasons are given why = 2, which corresponds to RenyiÂ’s quadratic entropy estimator, is preferred. They are as follows: (1) nearly optimal statistical properties when measured in terms of normalized standard deviation, (2) the ability to make use of spectral diversity , and (3) exponentially reduced computational complexity. Also s hown was the relationship between MRMI-SIG and kurtosis for large values of , and the modification required for MRMI-SIG to separate sub-Gaussian sources (by means of the sign of the kurtosis).


42CHAPTER 3 CONVOLUTIVE BLIND SOURCE SEPARATION This chapter will focus on the applicati on of convolutive Blind Source Separation (BSS) methods for the specific case of audi o separation. The difficulties associated with this task are listed. Several of these difficulties, which preclude the existence of a perfect separating solution, are well-know n. Herein, one additional re ason is examined. Namely, the process of sampling may degrade separa tion performance. Numerous convolutive BSS methods are then categorized. A total of se ven methods, including one or two representatives from each of the five categories, are chosen from this list in order to be compared in a later section. One additional approach, ba sed on the same methodology as presented in Chapter 2, is also presented. An experime ntal (approximate) upper bound for the performance of any convolutive BSS method is then introduced, which establishes an attainable standard of performance. By collecting real data one source at a time, it is possible to apply a supervised training method to estimate the demixing coefficients. It is shown that this approach, which is free from local minima, minimizes the power of the interfering signal in each output. While the data collec tion procedure is itsel f unrealistic, it provides insight as to how good separation can be for a particular (real) data set. It also can be used to make more confident assertions, than othe rwise possible, concerning the identity of the limiting factors of performance. The seven di fferent BSS methods chosen are then compared using both synthetically mixed sources and sources recorded in a sound proof room.


43Introduction to Convolutive BSS The block diagram of Figure 3-1 shows a convolutive BSS system. For BSS, the mixing matrix, H(z), and the different sources, si(n), are shown in the figure, although they are not available at the receiver. The equation that describes the mixing process for convolutive mixing is given by the following feedforward model where the ijth entry of H(z) is the Z-transform of the impulse response, hij(n), and the sizes of these signals and all other signals are give n in Table 3-1. The individual elements of the observation vector can be written in terms of a sum of convolutions as follows It will be assumed throughout that the num ber of observations equals the number of sources and that each hij(n) is an FIR filter of length Lh. This FIR feedforward model is able to describe any linear mi xture, as long as we allow the possibility of infinite-length FIR filters. In contrast, several different demi xing structures are considered since different structures have significantly different properties associated with adaptation.xn () Hz () sn () = (31) xin () hijn () sjn ()j 1 = Ls=(32) xN (n) yN (n) y1(n): . : . Demixing Matrix W(z) x1(n)S : .s1(n) sN (n) s2(n)S Mixing Matrix H(z) F igure 3-1. Block diagram of convolutive mixing and demixing systems. x2(n)S


44 Variable(s)DescriptionSize s (n), x (n), y (n)source, observation, and output vector, respectively Ns x 1 sj(n), xj(n), yj(n) jth element of source, observation, and output, respectively scalar H(z), Wf(z), Wb(z)mixing, feedforward demixing, and feedback demixing matrix, respectively Ns x NsHij(z), Wf,ij(z), Wb,ij(z) ijth element of the corresponding matrix polynomial hij(n)individual (local) mixing filterLh x 1 wf,ij(n), wb,ij(n)individual (local) demixing filter for feedforward and feedback systems, respectively Lw x 1 Nsnumber of sourcesscalar Lh, Lw, Lw*length of hij(n) and wij(n) (for both FF and FB), and memory depth, respectively scalar L, Lflength of paraunitary fi lter and length of FFT, respectively scalar L, Lmacausal filter parametersscalar Ltoverall temporal span of outputs used in MRMISIG scalar Lslength of consecutive temporal dependence for each source scalar Lddecimation factor for MRMI-SIG criterion (Ld = 1 corresponds to no decimation) scalar Nnumber of data pointsscalar Lbblock lengthscalar NM, NIparameters for ParraÂ’s JBD methodscalar ddecimation factor (d = 1 corresponds to no decimation) scalar T able 3-1. Terminology.


45In the time-domain, demixing can be done by either the feedforward (FF) architecture or the feedback (FB) architecture where the individual demixing filters in th e above two equations are described by an impulse response, the implementation of which can have one of several different (local) structures. Another option is to use a frequency-domain structure. Simply stated, the goal of BSS is to adjust the parameters of each wf,ij(n) or wb,ij(n) (for time-domain structures), such that the output signals, yi(n), approximate the sources, sj(n). Note that the acronyms FF and FB are used here in a global sense, and that the individual filters of each are free to be locally feedforward or feedback. The list of (local) filters includes, but is not limited to, the autoregressive (AR) [6], moving average (MA, also known as FIR) [6], autoregressive-moving average (ARMA) [6], and Gamma f ilters [37]. The combination of a FF or FB system with one of the filters just listed wi ll be denoted using the global structure first, such as FF/FIR or FB/ARMA. The simplest and by far the most common of these local filters is the FIR. The AR and ARMA filters have arbitrarily located poles and are, therefore, IIR. The Gamma filter is also IIR, but the lo cations of the poles is heavily restricted. Consider for a moment that there are Ns = 2 sources and observations. In this case, the two separating solutions (one for each globa l permutation) for the restricted FF yin () wfij ,n () xjn ()j 1 = Ls=(33) yin () xin () wbij ,n () yjn ()j 1 = Ls+ =(34)


46architecture (restricted to have z-L terms on the main diagonal) are given by Equations 3-5 and 3-6 for the non-permuted and the permuted solutions, respectively. In these equations, the variable L is a user-defined acausality parameter and the local structure can be any one of the four liste d above. Likewise, the two solutions for the restricted FB architecture (restricted to have zeros on the main diagonal and to have L = 0) are given by Equations 3-7 and 3-8. Since Y(z) = Wf(z)X(z) = Wf(z)H(z)S(z) (the equality used here implies that an infinite length window is used), Equations 3-5 and 3-6 can be verified by noticing that the product of Wf(z)H(z) at the separating solution must be either a diagonal or anti-diagonal matrix Wfz () zL –H –12z () zL –H22z () --------------------------H –21z () zL –H11z () --------------------------zL – (non-permuted solution) = (35) Wfz () zL –H –11z () zL –H21z () --------------------------H –22z () zL –H12z () --------------------------zL – (permuted solution) = (36) Wbz () 0 H –12z () H22z () ------------------H –21z () H11z () ------------------0 (non-permuted solution) = (37) Wbz () 0 H –11z () H21z () ------------------H –22z () H12z () ------------------0 (permuted solution) = (38)


47(for Ns = 2). Likewise, Y(z) = (I Wb(z))-1X(z) = (I Wb(z))-1H(z)S(z) for the FB architecture. Difficulties Of Convolutive BSS Convolutive audio separation methods that a im to demix the sources by inverting the filtering imposed on them, have a number of diff iculties that must be overcome in order to successfully demix the signals. Several of th ese are well-known. Additive noise and non-linearities are problematic and they are not accounted for in the majority of mixing models. Also, the requisite number of adjustable parameters is often large, the mixing is time-varying, and permutation ambiguities exist. In additi on, the process of down-sampling may, in and of itself, be sufficient to preclude the ex istence of a perfect separating solution. In this section, several of these topics will be discussed in more detail. Number of Adaptable Parameters an d Time-Varying Nature of Mixing The length of the significant portion of the impulse response of the mixing filters for audio applications may be t housands of taps long [38]. In all likelihood, the impulse responses of the demixing filters for separa tion is also very long. Consequently, timedomain FIR architectures will require a large number of adaptable parameters and a large number of data points in order to train them. This might not have been such a grave problem if it weren't for the fact that the mixing environment is rapidly time-varying. In fact, “a talker turning his head or motion as little as a few centimeters is frequently sufficient to compromise the optimality of these [array design] schemes” [23, pg. 3614]. As a result, the demixing system will have to find the approximate separating solution given only seconds or fractions of a second of data.


48Permutations Local permutations (as opposed to global permutations, which represent a basic indeterminacy for all BSS methods and which do not inhibit the ability of the methods to separate the sources) and gain ambigui ties are a well known problem for frequencydomain convolutive separation. A number of pape rs have been published that attempt to address this issue [39-49]. Moreover, recent work by Parra and Alvino has shown that time-domain methods that rely on se cond-order statistics (SOS) are also subject to poorperforming local minima as a result of local permutations [41]. Down-sampling The last item to be discussed in this s ection concerns down-sampling. For audio separation, the observations are necessarily con tinuous-time, but are oftentimes sampled so that digital computers may be used for the demixing. The sampling theorem states that signals that are sampled above th e Nyquist rate may be perfectly reconstructed if a suitable interpolating filter is used [50]. While it ha s become second-nature for any engineer to use an anti-aliasing filter prior to sampling, the Nyquist rate has li ttle or nothing to do with the requirements for separation. The difference is th at the goal of separation is to cancel signals, not to reconstruct them. In order to perf ectly cancel signals that have been sampled above the Nyquist, the demixed si gnals would have to be interpolated just prior to their addition, which is never done in practice. In addition, the demixing filter cannot itself act as the interpolator without the use of an infinitely long fractionally-spaced equalizer. Down-sampling also provides a benefit, when used with an anti-alias filter, since it reduces the bandwidth of the signals and it re duces the length of the optimal demixing filter by a factor equal to the decimation factor. In fact, in the limit as the bandwidth of the


49signals becomes zero, the requisite demixing process reduces to that of instantaneous demixing. Downsampling makes the demixing proc ess easier and may in fact more than offset the loss in performance caused by down-sampling. An example of how decimation affects the best performing solution is given ne xt, but first it should be mentioned that in every example examined thus far, the benefit provided by shortening the demixing filter has more than offset any loss due to down-sampling. Suppose there are Ns = 2 sources that are sampled at 16 kHz, and that the mixing and demixing matrices are given by and respectively. Then let the observations be decimated by a factor of 2, producing x (n). The first output, y1(n), as a function of the original sources (for one of the two possible sampling phases) is as follows which can be rewritten asHz () 1 h120 () h121 () z1 –+ h210 () h211 () z1 –+1 = (39) Wfz () 1 wf 12 ,0 () wf 12 ,1 () z1 –+ wf 21 ,0 () wf 21 ,1 () z1 –+1 = (3-1 0) y1n () s12 n () h120 () s22 n () h121 () s22 n 1 – () wf 12 ,0 () s22 n () h210 () s12 n () h211 () s12 n 1 – () ++ () +++ = + wf 12 ,1 () s22 n 2 – () h210 () s12 n 2 – () h211 () s12 n 3 – () ++ ()(3-1 1) y1n () 1 wf 12 ,0 () h210 () + () s12 n () wf 12 ,0 () h211 () s12 n 1 – () wf 12 ,1 () h210 () s12 n 2 – () +++ = wf 12 ,1 () h211 () s12 n 3 – () + h120 () wf 12 ,0 () + () s22 n () h121 () s22 n 1 – () wf 12 ,1 () s22 n 2 – () ++(3-1 2)


50Whether the permuted solution or the non-permuted solution is desired, and whether this sampling phase or the other sampling ph ase is chosen, there are no values of wf,12(0) and wf,12(1) such that y1(n) contains only delayed versions of either s1(n) or s2(n). The reason is that downsampling produces a situation where there are more unknowns than there are equations. Furthermore, this probl em is not guaranteed to be ameliorated by increasing the length of the demixing filter, unlike the case of no down-sampling. If the separation performance is defined as the ratio of the desired signal power to the interfering signal power, the best possible performanc e for the above example (without down-sampling) is infinity. Supposing that the source s are zero-mean, unit variance, and temporally uncorrelated for lags 1 to 3, then the separation performance with downsampling (here it is assumed that s1(n) is the desired source and s2(n) and its delayed versions are the interference) is given by The maximum is attained with the appropriate choice of wf,12(0) and wf,12(1). Notice that this equation is necessarily finite due to the h12(1)2 term in the denominator. Similar results are obtained if the power of s1(n) is minimized in y1(n), which corresponds to the other permutation. The astute reader will notice that, for this particular example, perfect separation is possible by removing the constraint that the wf,ii(n) = (n). This solution, however, no longer applies when the decimation fa ctor is increased to a value greater than or equal to three.1 wf 12 ,0 () h210 () + ()2wf 12 ,0 () h211 () ()2wf 12 ,1 () h210 () ()2wf 12 ,1 () h211 () ()2+++ h120 () wf 12 ,0 () + ()2h121 ()2wf 12 ,1 ()2++ -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------(3-1 3)


51The assumption used above that the sources are temporally uncorrelated is not necessary, and is used only to simplify the example. This should be obvious since there are three delayed versions of s2(n) in y1(n), and only two adaptable parameters (for each parameter added, an additional unknown is also added). In order to demonstrate that the Nyquist criterion has little to do with the ab ility to separate a downsampled signal, consider the previous example once again, except now let the length of the demixing filter equal Lw. To perfectly cancel the entire contribution of s2 in y1(n), the term s2(2n-1), which is the only term not multiplied by a demixing parameter, must be a linear combination of s2(2n), s2(2n-2), s2(2n-4), . . ., s2(2n-2(Lw-1)) for some value of Lw. Moreover, the coefficients of this linear combination must be constant for all time. This has the effect of reducing the number of unknowns. Suppose that this is only approximately true, such that the relationship can be given by a linear comb ination plus some noise, v(2n-1). This can be described by the following where the i are chosen to minimize the power of v(n). As with any linear prediction problem trained using mean square error, the solution that minimizes the power of the error causes the input to be orthogonal to the error over the span of the filter. In this case, the error signal is v(n) and the input is s2(n). Plugging this equation back into the equation for y1(n) reduces the number of unknowns by 1, such that the selection of wij(n) that minimizes the power of the interference will lead to the following form for the performance measure defined aboves22 n 1 – ()is22 n 2 i – () v 2 n 1 – () +i 0 = Lw1 –=(3-1 4) P11h121 ()2rv0 () -------------------------------(3-1 5)


52where P11 is the total power of the delayed versions of s1(n) terms in y1(n), and rv(0) is the power of v(n). This will result in a value of infinity only if rv(0) = 0 or h12(1) = 0 (for finite-valued parameters, wij(n), and as long as the numerator does not vanish), neither of which are under the control of the demixer. Keep in mind, the solution that minimizes the power of the interference is not guaranteed to be the one that maximizes the ratio of signal power to interfering power in all cases, but it does as rv(0) approaches 0, and the value of the performance measure is always finite unless rv(0) = 0. This example is true regardless of the frequency content of either s1(n) or s2(n), which implies that it is independent of the Nyquist criterion. Furthermore, the situation in which rv(0) = 0 implies that s2(n) is deterministic. Hence, sampling at a frequency above the Nyquist rate is not sufficient to allow perfect separation of stochastic signals. The results are now extended to allow for an arbitrary decimation factor. Assume the mixing and demixing architectures are restrict ed FF/FIR and that the sources are temporally independent. In this case, the exact so lution is possible when there is no decimation (i.e., d = 1) when Lw = Lh (corresponding to no permutation). Decimation factors larger than 1 cause the optimum length of the demixing filter, Lw, to be reduced by a factor approximately equal to d. The ra tio of the {number of terms of the interfering signal that can be exactly cancelled by proper selection of wij(n)} to the {total number of terms of the interfering signal in a given output} can be easily shown to be which is expressed in terms of the greatest integer function. If the temporal independence assumption is relaxed, the exact solution requires that the weighted sum of all the terms Lh1 – d ‡ 1 + Lh------------------------------------(3-1 6)


53that are not multiplied by a demixing parameter are a linear combination of all the terms that are multiplied by a demixing parameter, just as before. As can be seen from this discussion, the degradation in performance due to downsampling can be mitigated to some extent by sour ces that have a large degree of temporal correlation. In addition, the temporal correla tion can be increased by using anti-alias filters at the cost of possibly removing signal en ergy. Interestingly, the optimal solution for non-WSS signals, such as speech, changes at the rate at which the statistics of the source change, which is on the order of 10-30 ms [51]. This is unlike the case where no downsampling occurs, where the optimal solution depends only on the mixing coefficients. Finally, there is at least one similarity between the requirements for signal cancellation and signal reconstruction. As th e sampling frequency increases to infinity for continuoustime signals (or as the decimation factor decr eases to 1 for discrete-time signals), the perfect separating solution exists as long as no ot her factors prevent its existence. A simple proof of this is known, but it is not included here since this is the expected result. Categorization of Published Criteria In order to assist with the comparison of convolutive BSS methods, it seems worthwhile to select a set of these algorithms that are representative of the many different types of algorithms that exist. In this section, the numerous types of methods are categorized using a simple tree diagram. While this coul d have been done in any of a number of ways, the grouping presented here incorporates most of the algorithms currently in existence and it captures some of the more pertinent ch aracteristics of these convolutive BSS methods. The categorization, which is shown in Figure 3-2, incorporates most of the published convolutive BSS methods. Two other types, however, did not fit well into this scheme.


54They include subspace methods [52-55] and linear prediction methods [56-59], which will be referred to as Type VI and Type VII, respectively. The different types of methods, which come from more than 80 papers aside from the 8 just listed, are given in Tables 3-5 to 3-8, which may be found at the end of this chapter. The name associated with each is that given in the paper in which it was f ound, if one was given. If one was not given, keywords from the paper were used to generate a name. If a particular paper discusses numerous different methods, only the two or three that appear to be the most interesting are included in the tables. Keep in mind that some of the methods listed have not been implemented. It should also be mentioned that ther e were some difficultie s in determining how to categorize some of the methods, because they do not always neatly fall into one of the SOS HOS NP ML IT Frequency Domain Time Domain SOS HOS NP ML IT SOS HOS NP ML IT Frequency Domain Time Domain SOS HOS NP ML IT Frequency Domain SOS HOS NP ML IT Criterion FB FF Time Domain Frequency Domain Inverse System Identification Separation Convolutive Structure FB = Feedback system FF = Feedforward system IT = Information-theoretic SOS = Second-order statistics HOS = Higher-order statistics ML = Maximum likelihood NP = Non-parametric non-IT non-IT non-IT non-IT non-IT Type I Type II Type III Type IV Type V { { { { { (not shown) F igure 3-2. Categorization of convolu tive blind source separation methods.


55categories of Figure 3-2 and because the aut hors do not always provide enough detail to make the distinction. A more complete tree diagram coul d have been found, but in the opinion of the author, this would only serve to obfuscate matters. One of the difficulties in labeling methods stems from the subtle difference that exists between Types II, IV, and V. For this reason, they are grouped together in Table 3-8, except for several methods which are necessari ly of one or the other type. The subtlety arises from the fact that each of them us e a frequency domain criterion. For Types II and IV, the optimized parameters in the frequency-domain must be converted to time-domain filters. However, the conversion from frequenc y-domain parameters to either FF or FB time-domain filters is very straightforward. Likewise, a Type V method is one in which there is no conversion to a time-domain filter since the outputs are found using the inverse FFT. It could be asked why these three different categories were used in Figure 3-2. Even though the distinction between them is incons picuous, there are very important differences in the characteristics of each, as will be discussed below. Similarities of Identification and Inverse Systems The first division of the categorization of Fi gure 3-2 is based on whether or not the adaptable parameters are trained in order to produce a set of output signals that approximate the original sources, which is referred to as an inverse system. Equations 3-3 and 3-4 are of this type. The alternative is to train th e adaptable parameters su ch that they estimate the mixing filters directly, which is referred to as identification. An identification method requires two steps to separate signals. The first involves estimation of the mixing filters, and the second involves determining the demix ing filters by inverting the estimated mixing filters. In some cases, this last task is non-trivial. The possible advantage of an


56identification method, relative to an inverse sy stem, is that it may require the training of much fewer parameters for equa l performance. However, it is also possible that it will require more parameters. As a result, if no pr ior knowledge is available, neither one is to be preferred over the other. As can be seen from the figure, the portion of the tree devoted to the identification methods is not expanded. The reason for this is simple. In two commonly occurring situations there is no significant difference between identification and inverse systems. The two situations are (1) when Ns = 2 and both Hii(z) transfer functions are assumed to be unity and (2) when a paraunita ry filter is applied prior to identification. There is also a third scenario where it coul d be argued that the two types are equivalent. For the first case, suppose that there are two sources and two observations and that H11(z) and H22(z) are assumed to be unity. The mixing matrix is then a (2 x 2) matrix of filters. This matrix can be inverted by exch anging the two filters on the diagonal, negating the filters on the off-diagonal, and dividing by the determinant. With the determinant excluded, each individual demixing filter of the separating solution has the exact form as one of the estimated mixing filters. This is true for any filter type. For example, when using FIR filters for identification, the FIR se parating solution (for the given mixing estimate) consists of the negated estimated mixing filters. Since the determinant is common to every element of the demixing matrix, it can be excluded due to the indeterminacies of BSS. Hence, identification and inverse methods may be considered equivalent in this case. For Ns > 2, each element of the separating solution has, in general, both numerator and denominator terms made up of functions of the estimated mixing filters. Consequently, ARMA filters must be used for the demixing in order to have a one-to-one correspondence between the estimated mixing filters and the separating solution. Even though


57there is a one-to-one correspondence in this case, the mixing estimate and the separating solution are not identical. As a result, it is not clear if an identification and an inverse system should be considered distinct for ARMA demixing. For the second case, suppose that a pre-whitening filter, P(z), is applied to the observations prior to identification. The combination of any mixing filter with the filter that prewhitens it, denoted as H(z)P(z), is known to be a parunitary filter, (z) [60-61]. The identification process only requires an estima tion of the paraunitary system. Paraunitary filters can be formed from a product of rotation matrices, where each rotation matrix is an identity matrix with the elements at locations (i,i), (i,j), (j,i), and (j,j) replaced with cos(ijk)z, sin(ijk), -sin(ijk), and cos(ijk)z-1, respectively, for i, j = 1, 2, . . ., Ns (i not equal to j), and k = 0, 1, . . ., L 1, where L is the length of the paraunitary impulse response. Paraunitary systems have the property that (z)T(z-1) = I, where I is the (Ns x Ns) identity matrix. Hence, once the paraunitary system is estimated, the inverse can be found by simply transposing the mixing estimate and replacing z with z-1 (the mixing estimate and the demixing solution can be made ca usal with a sufficient number of delays). Structure Any adaptive filter can be considered as being composed of three parts, which are the structure, criterion, and optimization met hod [62]. The remainder of the categorization of Figure 3-2 focuses on the first two of these items. In terms of the structure, a timedomain system can be of the feedforward (FF) or the feedback (FB) type. Examples of these two time-domain structures were given previously in Equations 3-3 and 3-4. A third option is the frequency-domain method. The st ructure defines how the outputs are produced given the observations. The criterion, on the other hand, is used to adjust the


58parameters of the structure and does not necess arily need to be applied to the outputs of the structure. For example, the criterion of a Type II or Type IV convolutive BSS system is applied to the FFT of the observations (or a function of the FFT of the observations), which uses an entirely sepa rate signal path than that used to produce the estimated sources. Several important characteristics of the different types of structures are discussed in this section. This includes the (1) stability, (2) ability to implement acausal filters, (3) sound quality of the estimated sources, and (4) complexity. Following this is a section that concerns several aspects of the choice of criterion. Stability and causality There are two conditions for which the stability must be considered. They are the stability during adaptation and the stability of the separating solu tion. Causality requirements are also included here due to the relationship between stability and causality. The stability during adaptation is covered first. For adaptation, there are two types of stabili ty, i.e., tap weight explosion and output explosion [6]. The former can be remedied by c hoosing a small step size, and is the type of stability with which any demixing structure is concerned. The latter type of stability problem only occurs for systems with either loca l or global feedback, since systems that are entirely feedforward are inherently BI BO (bounded-input, bounded-output) stable. The determination of whether or not an update to the demixing parameters yields a stable system (in terms of output explosion) is difficult when the local structure is AR or ARMA, or for any local structure when the global system is FB. The Gamma filter is like the AR and ARMA filters in that it is has local feedback. Unlike the AR and ARMA, it has a multiplicity of poles at a single location, which is determined by a user-defined parameter, µ. In


59contrast to the two other IIR filters, the dete rmination of the stability for a FF/Gamma system is trivial. For the FB system, on the ot her hand, experience with the FB/FIR system indicates that there is a fairly large likelihood that a random initialization is such that the attractor lies in an unstable region of parameter space. To increase the probability that output explosion is avoided during adaptation, a restriction is placed on all FB systems in this paper. Namely, the Wb,ii(z) are set to 0, for i = 1, 2, . . ., Ns. This system, as previously mentioned, is denoted as the restricted FB system. Although this reduces the ability of the structure to implement arbitrary demixing filter s, the improvement in stability appears to be a good trade-off. By far the most common local filter to use with a FB system, or for a FF system for that matter, is the FIR. The restricted FB/FIR system is always locally stable in terms of output explosion. For Ns = 2, it is also globally stable when the determinant of (I Wb(z)) does not have any zeros outside the uni t circle (since the determinant appears in the denominator). Given any LTI mixing matrix, the restricted FF/FIR is capable of producing the separating solution, although it may require infinite -length time-domain filters and an L of infinity. The z-L term implements what is referred to as an acausal filter. For the restricted FB structure, L is fixed at 0. The reason for ch oosing L = 0 for the restricted FB system is that the delay, z-L, appears inside the global feedback loop, which is not desired. The consequence is that the restricted FB/FIR arch itecture can produce the exact separating solution only when all Hij(z) meet certain causality constraints, and once again, may require that Lw equal infinity. Namely, for Ns = 2 and the case of no permutation, H12(z)/H22(z) and H21(z)/H11(z) must both be causal. This can be easily determined from Equations 3-5


60and 3-7. Likewise, for the case of permuted outputs, H11(z)/H21(z) and H22(z)/H12(z) must be causal, which can be determined from Equations 3-6 and 3-8. For more on causality requirements, an excellent resource is provide d by Charkani and Deville [63]. As before, to avoid output explosion for a restricted FB/FIR system, the determinant of (I Wb(z)) must not have any zeros outside the unit ci rcle. For the separating solution, the determinant of (I Wb(z)) can be expressed in te rms of the mixing filters as for the non-permuted solution, and for the permuted solution. Sometimes the separating solutions for the restricted FF system are given by Equations 3-5 and 3-6, with the ex ception that the entire matrix is divided by the appropriate determinant (for a restricted FF system, the determinants for both permutations are those above multiplied by z-2L). The separating solutions with and without the determinant produce different filtered versions of the original sources. However, it has been reported that this post filtering has l ittle effect on speech recognition [64]. More importantly, due to a basic indeterminacy of BSS, these solutions may be considered as identical for purposes of separation. This brings up an important question for the case that the demixing is constrained to be causal (i.e. L = 0), “What is the like lihood that the separating solution is stable?” Assuming the mixing filters are modeled as causal FIR filters, the locations of the poles of the separating solutions for Wf,ij(z) and Wb,ij(z) depend only on the locations of the zeros 1 H12z () H21z () H11z () H22z () --------------------------------–(3-1 7) 1 H11z () H22z () H12z () H21z () --------------------------------–(3-1 8)


61of the appropriate Hij(z). Figure 3-3 shows the likelihood that randomly chosen impulse responses, hij(n), have at least one stable, causal inverse (for either of the two possible permutations) for the case that Ns = 2. Experimental evidence thus far seems to indicate that these results are largely independent of the di stribution of or the temporal correlation of the randomly chosen impulse responses. This test was performed for different lengths of the impulse responses, Lh. The subplot (A) in this figure corresponds to the general situation, where all four hij(n) are randomly chosen FIR filters . The bottom plot corresponds to the assumption that h11(n) and h22(n) equal (n), the Kronnecker delta function. Notice that, for the more general case, the probability of the separating solution having a stable and causal inverse approaches 0 by the time Lh approaches the value of 10. Also notice that arbitrarily setting h11(n) and h22(n) to be (n), as done by many authors, carries with it an implicit assumption. The assumption is that the transfer functions associated with h11(n) and h22(n) are such that the stability/causality requirements of the separating 1 2 3 4 5 6 7 8 9 10 0 50 100 LhSolution Likelihood (%)Restricted FF (L = 0) Restricted FB 1 2 3 4 5 6 7 8 9 10 0 50 100 LhSolution Likelihood (%)Restricted FF (L = 0) Restricted FB F igure 3-3. Likelihood that a demixing system having L = 0 has a stable inverse. A) gen era case, B) assuming Hii(z) = 1. B A


62solution are more likely to be met (for L = 0), as witnessed by the figure. Keep in mind that these results mainly apply to the restri cted FB system, since the value of L for the restricted FF system can be changed arbitraril y. A summary of the results of the stability and the existence of an acausal solution is given in Table 3-2 for several of the more interesting structures, where C(z) is used to represent an arbitrary monic polynomial. Sound quality The sound quality of the final estimated source s depends on several factors, such as whether or not the algorithm converged and whethe r the structure is sufficient to invert the mixing matrix. One additional item that has a large impact on speech quality, which is independent of the quality of the separation, is whether or not a frequency-domain structure is used. For Type V systems, which are characterized by having a frequency-domain structure, the recovered speech sounds artificial. The reason is that the frequency-domain solution does not have a “time-domain” constraint [6]. The addition of this constraint would guarantee that the inverse FFT of the demixing filters of length Lf ends in at least Lf /2 zeros (assuming 50% overlap). In this case, the demixing of the observations corresponds to the linear convolution with a time-domain filter of length Lf /2. Without this Structure Global Stability Non-Causal SolutionFF/FIRAlways Wii(z) = z-LC(z) Restricted FF/FIRAlways Wii(z) = z-LRestricted FF/GammaTrivial Wii(z) = z-LRestricted FF/IIRDifficult Wii(z) = z-LRestricted FB/FIRDifficult(none) T able 3-2. Characteristics of Different Structures.


63constraint, what is actually implemented is circular convolution. If, on the other hand, the time-domain constraint is added, the method w ould be categorized as Type II and not Type V. Complexity The structure determines to some degree the complexity of the algorithm, which is defined here as the requisite number of parame ters to achieve a certain level of performance. If the transfer function of a given filter needs to be long, then an FIR filter will need a large number of adaptable paramete rs. One feature of the frequency-domain method is that the number of parameters does not depend on the length of the impulse response, but on the frequency resolution needed for a given mixture. When the mixture requires a low frequency resolution, the number of paramete rs is less than that needed by an FIR. Besides frequency resolution, there ar e two additional constraints placed on the choice of block size for Type V methods. One of these is that the block size should be 10 times longer than the length of the mixing filte r in order for the FFT of the observations, X(k), to provide a good approxima tion to H(k)S(k). The second of these is that the block size must not be too large, because the am ount of data available for the instantaneous demixing at each frequency is given by the tota l length of the data in the buffer (which should be limited to a value congruous to the rate at which the mixing filters change) divided by the block size. Needless to say, ther e are many practical situations where it is not possible to simultaneously satisfy these thre e requirements. This particular scenario is evaluated in the papers by Araki et al . [65-66]. For time-domain systems, the Gamma filter can be thought of as providing an analogous a dvantage over the FIR. It is similar in that the number of parameters is reduced from that required for an FIR if the desired impulse


64response has a low temporal resolution, where resolution is given in terms of taps per sample (an FIR filter has a resolution of 1, and a Gamma filter has a resolution of µ) [67]. The Gamma filter is quite simila r to the Laguerre filter [68] . The only published paper known by the authors that uses either the Gamma or the Laguerre for convolutive BSS is a paper by Stanacevic et al . [69]. Criterion Returning to the categorization of Figure 32, the criteria are divided according to whether the cost function is applied to frequenc y-domain or time-domain signals. It is then divided into those that make use of informa tion-theoretic quantities and those that do not. Non information-theoretic criteria include meth ods based on second-order statistics (SOS) and methods that use higher-order statistics (HOS). Information-theoretic methods, which are based on either entropy or divergence meas ures, are divided according to whether the source pdf's are known (or assumed to be of a known parametric family), or they are estimated directly from the data. The former of these will be referred to as maximum likelihood (ML) techniques, even though this may not be strictly correct in some cases. The latter will be referred to as non-parametric (N P) techniques. Second-order statistics The type of criteria that is the most restricted in its ability to separate sources is the SOS-based method. Second-order st atistics have been determin ed to be able to provide separation in certain cases, i.e., when the mi xture is non-static (i.e. not instantaneous), Hij(z) are causal FIR filters for i not equal to j, Hii(z) are unitary (which carries with it certain assumptions concerning causality, as discussed previously), and Lw > Lh [70-71]. However, these assumptions may not hold, and even if they do hold, the required


65algorithm is fairly complex. As evidence of this fact, the algorithm has yet to be implemented. As a result, some additional informat ion is needed in order for practical SOS methods to have any hope of providing separa tion. One item that will allow SOS to separate convolutive mixtures is if the sources ha ve time-varying statistic s. This is often the case for audio mixtures since speech is non-st ationary. The vast ma jority of SOS convolutive BSS methods are based on this assumption. Frequency-domain approximations The criteria of Types II, IV, and V depe nd on a frequency domain approximation that may or may not be valid. It is well-known th at the FFT transforms convolution into multiplication [50]. As a result, fre quency-domain methods apply an instantaneous demixing matrix of complex numbers at each individual frequency. Although it is true that the FFT converts convolution into multiplication, it only doe s so asymptotically. To wit, it is precisely true in the limit as the length of the window, Lf, approaches infinity. Any real system, however, requires a finite window size. The overlap/add and overlap/save methods 100 101 102 103 104 0 50 100 150 200 250 300 350 400 450 500 FFT Window Length, LRelative Error (%)Lh = 10 Lh = 100 Lh = 10,000 Lh = 1000 f F igure 3-4. Percent relative error as a f unction of window length for 4 values of Lh (each data point is an average of 10 trials).


66[72] were designed to circumvent this problem. However, these methods are not capable of circumventing the need for infinite-lengt h windows for BSS. To illustrate, consider a linear system that has an impulse response, h(n), of length Lh, a single input, s(n), and a single output, x(n). The overlap/save method (w ith 50% overlap) takes the FFT of a window of s(n) of length Lf = 2Lh and multiplies it with the 2Lh length FFT of h(n) (by appending Lh zeros). This produces a complex signal of length 2Lh, which will be denoted as S(k)H(k), where k = 1, 2, . . ., 2Lh. The inverse FFT (IFFT) is then applied and the first 50% of the data, or Lh samples, are thrown away. This pr ocess is repeated after shifting the window on the input data by Lh samples. This approach yields identical results with that found in the time-domain by convolving s(n) and h(n), thus validating the appropriateness of using the overlap/save method in place of time-domain convolution. The signal that is needed for BSS, however, is precisely the complex-valued S(k)H(k), found prior to applying the IFFT and throwing away the first Lh samples. It is assumed for BSS that the signals s(n) and h( n) are unknown; therefore, only x(n) may be used to determine this signal. Given only x(n) , hence X(k), it is not possible to work backwards to construct S(k)H(k) because it is not possible to determine the Lh samples that would have been thrown away (a similar pr oblem exists for the overlap/add method). Fortunately, X(k) becomes a good a pproximation of S(k)H(k) as the length of the window increases (relative to the length of the filter) . Figure 3-4 shows the mean of the relative error magnitude, |X(k)-S(k)H(k)|/|S(k)H(k)|, as a function of both window length and filter length. In the literature, it has been suggested that the block size for the FFT should be 6 [73], 8 [74-75], or 10 [76], [77] times the le ngth of the mixing filter. Figure 3-4 seems to indicate that 10 is a good choice. It is interesti ng to note that out of 49 papers surveyed that


67include the equation, X(k) = S(k)H(k), only 11 imply that this equation is an approximation when using a finite window size, and onl y 2 make any attempt to explain the phenomenon [48], [75]. Representative Set of Methods Now that the methods have been categoriz ed and their important characteristics listed, a representative set is chosen. The fo llowing seven methods are used as a cross-section of the different types of methods shown in Figure 3-2. Type I-ML Nonholonomic Density Matching [78] Type I-NP MRMI-SIG Type II-SOS JBD [74] Type III-SOS SAD [79] Type III-ML InfoMax [80] Type IV-HOS Fixed Point [81] Type V-HOS JADE [31] The fixed point method is described by Bi ngham and Hyvarinen [81], although they did not use the restricted FB/FIR structure. Fixed point was chosen for its speed, which is imperative since as many as 8000 (complex) in stantaneous BSS solutions are required for each separation task. For JADE, knowledge of the separating solution was used to correct for (local) permutations at each frequency. Th erefore, the results for JADE represent a best case scenario for Type V methods. The second method in the list, MRMI-SIG, will be briefly introduced in the following section.


68Proposed Method, MRMI-SIG A method of convolutive BSS, MRMI-SIG, is presented in this section. It retains the same designation as used for instantaneous BS S (and feature reduction) due to the similarities of the criterion used for all three tasks. The application of convolutive source separation does, however, present some unique diffic ulties. Consequently, the extension of the criterion from instantaneous to convolutiv e demixing is not completely obvious. The structure used for MRMI-SIG is the restrict ed FF/Gamma, the criterion involves the minimization of the (approximation of) mutual information between the outputs, and the optimization method is gradient descent. Some a dditional specifics are given in the following sections. Structure The Gamma filter is a tapped delay line with each one-unit delay, z-1, replaced by a generalized delay function, µz-1/(1 (1-µ)z-1), where µ is a user-defined parameter. A Gamma filter can sacrifice tem poral resolution in order to reduce the number of adaptable parameters required by an FIR, where temporal resolution is expressed in taps per sample and equals 1 for an FIR and µ for a Gamma filter. For MRMI-SIG one minor change is made to the restricted FF/Gamma structure. In order to increase the resolution at the leading edge of the demixing impulse response, th e first few taps use th e one-unit delay, and the remainder use the Gamma delay. The resu lting input-output relationship from observation j to output i is given byyin () wfij ,k () xjnk – () wfij ,k () xjk ,nk – ()kLg= Lw1 –+k 0 = Lg1 –= xjk ,n ()µ xjk 1 – ,n 1 – () 1 µ – () xjk ,n 1 – () + = kLgLg1 + ƒ Lw1 – ,,, = xjLg,n () xjnLg– () =(3-1 9) (3-2 0) (3-2 1)


69where Lg is a user-defined parameter. In order to reduce the number of parameters that must be optimized, it is suggested to fix the value of Lg to be 0.1Lw. Criterion The length of the impulse response for the Gamma filter will be approximated using the concept of memory depth. For the Gamma filter, the memory depth is defined as Lw* = Lw/µ [67]. With this approximation, the length of the impulse response of the MRMI-SIG structure is Lw* = Lg + (Lw-Lg)/µ, where Lw is the number of weights adapted for each Gamma filter. By means of comparison, an FIR having Lw adaptable parameters has a memory depth of Lw* = Lw. Since the impulse response of the combination of the mixing and demixing filters has an approximate length of Lh+Lw*-1 (as long as Lw* is larger than or equal to L), the sample of any source, sj(n), can then appear at only the following outputs, yi(n), yi(n+1), . . ., yi(n+Lh+Lw*-2), for i, j = 1, 2, . . ., Ns. This suggests the minimization of the following criterion where y i(n) = [yi(n), yi(n-Ld), yi(n-2Ld), . . ., yi(n-Lh-Lw*+2)]T, y (n) = [y 1(n)T, y 2(n)T, . . ., y Ns(n)T]T, Ld = 1, and equality holds only for = 1. This criterion, referred to by Pham as the mutual information between segments of processes, is a contrast for = 1 and jointly stationary processes, although it is not known if it is discriminating [83]. That is to say that the minimum (theoretical) value of 0 is obtai ned when the outputs are separated, but it is possible that there are other ways in which the criterion can produce a value of 0. Pham Iyn () () Iy1n () y2n ()ƒ yNsn () ; ; ; () Hyin () ()i 1 = NsHyn () () – = (3-2 2)


70does not consider the case that the length of each y i(n) is connected to the span of the finite-length mixing/demixing combination, as done here. As a consequence, if (1) both Lw and L are sufficiently large, (2) the source s are jointly stationary, and (3) some additional mild constraints hold, then it is believe d that this criterion is also discriminating. However, no proof is known. For time-domain criteria, SOS methods t ypically use pairwise independence and multiple lags at which to es timate correlation between the outputs, while IT methods typically use Ns-wise independence and a single lag (the zero lag) to estimate an entropy or divergence measure [30]. This method is belie ved to be the first IT method implemented that (1) uses multiple lags (Pham discusses this idea in several papers, but they are not implemented in these papers [30], [83]) and (2 ) does not require the sources to be either i.i.d. or linear processes [84] . Criteria derived assuming the sources are modeled as i.i.d. or, e.g., AR processes, attempt to produce estimated sources that are temporally white, which is not ideal in the case of speech signals. Furthermore, these criteria are hampered by the fact that, even though speech is oftentimes modeled as an AR process [85], there is no single set of AR parameters that is valid fo r continuous speech. In fact, the AR parameters of speech change rather frequently (a pproximately every 20 ms), which causes the optimal demixing coefficients for this type of method to change fre quently (in addition to the changes caused by moving sour ces and/or sensors). As will be shown for real data, it is difficult to find a good solution using several seconds of data. Conse quently, 20 ms of data is certainly insufficient. Ohata and Matsuoka also argue that these methods, which assume the sources are either i.i.d. or linear proces ses, do not work well for nonlinear processes [84]. It should be mentioned; however, that the proposed method has a similar drawback,


71at least theoretically speaking. Since the de rivation of the method includes Parzen windows [15], theoretical justification of the me thod requires that the samples of the outputs are i.i.d., which they are certainly not. This is analogous to the requirement that a signal must be WSS and asymptotically uncorrelated in order to ju stify the use of the sample mean for estimation of the mean [6]. Since speech is not WSS, it could be argued from a theoretical basis that not even the mean can be properly estimated. Despite this, as is wellknown, the estimate of the mean of non-WSS data is still useful. Although it may not be expected, the proposed method is also shown to be useful even though the outputs are not i.i.d. Just like before, the mutual information is approximated by usi ng a sum of Renyi’s marginal entropies minus Renyi’s joint entropy. Unlike before, (1) the dimensionalities of the underlying pdf’s are vastly increased and (2) the joint entropy may not be discarded. Recall that the joint entropy was not used in instantaneous demixing since it is invariant to rotations. Also, for feature extraction, an altern ative formulation is used in which the conditional entropy replaces the joint entropy. The alternative formulatio n is appropriate for feature extraction since the random variab le on which the entropy was conditioned, the class, is discrete. For convolutive BSS, the al ternative formulation is inappropriate. Both of these differences admit additional complica tions that must be addressed before proceeding. The first complication, the increased dimensionality, is important because of expectations concerning increased computational comp lexity and the “curse of dimensionality”. To exemplify the dimensionalities involved, suppos e that a FF/FIR system is used with Lw = 250, a modest length filter for audio mixtures. In this case, the pdf implicit in the joint


72entropy of Equation 3-22 has a dimensi onality of 2(2*250 1) = 998 (assuming Lh = Lw). An approximation of Equation 3-22 is introduced that allows the reduction of the dimensionality by summing multiple mutual information terms. This is then used to find the preferred dimensionality. The second co mplication, the requirement that the joint entropy is used, is problematic for reasons el ucidated by Morejon [86]. He points out that the RenyiÂ’s entropy estimator used in Chapter 2 does not always produce values that have an absolute significance, although it may always be used as a relative measure. As such, it is normally appropriate for minimizing or ma ximizing [86]. However, in this case, the approximation of mutual inform ation consists of a sum of multiple entropy terms whose constituent pdfÂ’s differ in dimensionality. As such, it may not give the expected results. This is not the case for feature extraction (Appendix A) since the pdfÂ’s in the marginal and conditional entropies have identical dimensiona lities. In order to guarantee that the results of the entropy estimator correspond to the theoretical values (having an absolute significance), the kernel size must be small relative to the variance of the data and there must exist enough data so that the pdf can be a pproximated well. These conditions are increasingly unlikely to be met as the dimensionality is increased. It is still possible to obtain good separation when these conditions are not met, as will be shown shortly, but it does require a few minor changes in the criterion. Tw o different types of modifications are used to address the shortcoming associated with the use of pdfÂ’s having different dimensionalities. Although only one of these is used in th e final criterion, the description of both is included for the sake of exposition. The following three changes were incorporat ed in order to decrease the dimensionalities of the underlying pdfÂ’s.


73 Pairwise independence is used instead of Ns-wise independence The number of elements in one y i(n) term is reduced to 1 The (temporal) elements in the second y i(n) term are decimated by a factor, Ld > 1 With these changes, the criterion becomes Equation 3-23 is still a contrast, but in order for it to be disc riminating, at the very least, it will require more constraints than needed for Equation 3-22. The first item in the list above serves mainly to reduce th e dimensionality of the pdf in the joint entropy. With this change, the pdf in the joint entropy still has tw ice the dimensionality of the pdf’s in the marginal entropies. The inclusion of the seco nd item is reasonable, although it will reduce the robustness, as long as the following cond ition is met (this cond ition will be described for the case of no permutations, and may be extended to include any particular permutation in a straightforward fashion). The requirement is that si(n-k), for any value of k, must always be present in y i(n) during adaptation. Otherwise, a dditional (persistent) local minima may be inserted into the performance surface so that the (theoretical) mutual information can be zero even though the sources are not separated. This constraint can be relaxed if the consecutive temporal dependence of each source, Ls, is larger than 0 (where Ls = 0 corresponds to temporal independence). The third item in this list is also reasonable if Ls is greater than or equal to (Ld-1)/2 consecutive lags. As can be seen, the resulting criterion Iyin () yjn () ; ()j 1 = ji Nsi 1 = NsHyin () () Hyjn () () Hyin () yjn () , () – + []j 1 = ji Nsi 1 = Ns= (3-2 3)


74has more terms than Equation 3-22. More specifically, Equation 3-22 has Ns terms while Equation 3-23 has Ns(Ns-1) terms. However, the dimensionality of the pdf of each joint entropy has been reduced from Ns(Lh+Lw*-1) to which is expressed in terms of the greatest integer function. Keep in mind that Equations 3-22 through 3-24 are written in terms of Lh, which is unknown. In practice, the value of Lh must be approximated. It is assumed throughout that Lh = Lw*, which is the most obvious choice, although there is certainly no guarantee that it is the correct choice. In order to implement Equation 3-23, each entropy is es timated using the same non-parametric quadratic entropy estimator that has been used in the previous chapter. For convenience, the definition is repeated here. It is given by where G(y,2) is the multi-variate Gaussian function evaluated at y and having variance 2, and Lb is the block size. Figure 3-5 shows the sensitivity of the MRMI-SIG method to several different parameters when using Equation 3-23 as a criterion. These four plots show the performance of different parameters as a function of SIR (signal-to-interference ratio). The definition of SIR used here is the obvious extens ion (to convolutive mixtures) of that defined previously for instantaneous mixtures in Equation 2-7. More details on the SIR for convolutive mixtures are given in the following sect ion. These four plots pertain to a synthetic mixture where Hii(z) = 1, Hij(z), for i not equal to j, correspond to sparse FIR filters of 2 LhL +w *1 – Ld-----------------------------(3-2 4) H2yn () () 1 N --Gyn () yn 1 – () – ()2, ()n 1 = Lblog – =(3-2 5)


75length 25, and Lw* also equals 25. Except where noted, the simulations represented in Figure 3-5 use µ = 1, Ld = 1, Lb = 500, and Lt = 49, where Lt is the overall temporal span of the outputs used in the criterion (the correct value is Lt = Lh+Lw*-1, which is approximated by 2Lw*-1). As can be seen in Figure 3-5(A), the optimal value of µ is 1, which not surprisingly indicates that for short, sparse filters, the high resolution of the FIR is required (values of µ between 1 and 2 are also valid but are not considered here since the resulting generalized delay function corresponds to a high pass filter). The performance is also best when the criterion uses the en tire output vector, which corresponds to Ld = 1. In simulations where the sources and sensors are physically stationary, the performance tends F igure 3-5. SIR for Lh = 25. A) SIR as a function of µ, B) SIR as a function of Ld, C) SI R as a function of Lb, D) SIR as a function of Lt. 0.2 0.4 0.6 0.8 0 5 10 15 20 25 µSIR (dB) 2 4 6 8 0 5 10 15 20 25 LdSIR (dB) 0 5000 10000 0 10 20 30 L b SIR (dB) 20 40 60 80 0 10 20 30 L t SIR (dB)AB C D


76to improve by increasing the block size, Lb. However, large values of Lb cannot be used in realistic scenarios since the sources and sensors (and any nearby objects) are not physically stationary. Also notice that the criterion appears to be sensitive to the estimation of Lt, which ideally equals 49 for Figure 3-5. David Pinto performed some additional simulations for Lh = 2 and Lh = 25, which are not shown here. In these simulations, two additional parameters were varied, which he refe rred to as the anchor and slide. The new parameters are such that, when they are properly selected, either Equation 3-22 or 3-23 may be produced. Interestingly, the results of this comparison show that Equation 3-22 is preferred out of all possible combinations of the anchor and slide, even though this corresponds to the set of parameters that yield the maximum pdf dimensionality. This is also consistent with the results shown in Figure 3-5(B). Theoretically speaking, this result is not surprising. However, in practice, it is somewhat counterintuitive since the pdf of the joint entropy in this case has a dimensionality of 98 (for Lh = Lw = 25). There is certainly no reason to think that the corresponding 98-dimensional pdf estimation is very accurate when only N = 30,000 data points are used. Based on these results, the analyses that follow will be based on modi fications of Equation 3-22. The second complication of using MRMI-SIG for convolutive separation is that the pdfÂ’s in the marginal and joint entropies have different dimensionalities. As mentioned previously, two different methods are used to alleviate this problem. The first method, which does not change the dimensionality of th e constituent pdfÂ’s, attempts to compensate by applying a user-defined gain to the joint entropy and by using different kernel sizes for


77the marginal and joint entropy estimation. The resulting criterion, motivated by several insights about the entropy estimator provided by Morejon [86], becomes where the two different kernel sizes are e xplicitly shown. The second method reduces the dimensionality of the pdf in the joint entropy by a factor of Ns so that the dimensionalities of the pdf’s in the marginal and joint entropies are equal. This is done by temporal decimation of the criterion in the same fashion as done before using the decimation factor, except that a decimation of Ns is used for the pdf of the joint entropy and a decimation of 1 (no decimation) is used for the pdf of the margin al entropies. The resulting criterion using the second approach is where Ld = 1, y j(n) = [yj(n), yj(n-Ns), yj(n-2Ns), . . ., yj(n-Lh-Lw*+2)]T, and y (n) = [y 1(n)T, y 2(n)T, . . ., y Ns(n)T]T. This is similar to the approach suggested by Morejon. The difference is that Morejon recommends repeating the random variables in the marginal pdf’s Ns times to increase the dimensionality to that of the joint pdf [86]. The two approaches represented by Equations 3-26 a nd 3-27 are evaluated next. A noticeable pattern developed when optimiz ing the three parameters of Equation 326, namely, 1, 2, and . This is shown in Figure 3-6, where 1 = 0.25, Lw = Lh = 2, and Lt = 3. Numerous plots (not shown) were generated that had the same characteristic as that seen in Figure 3-6. Namely, as is changed, the corresponding optimal 2 is such that the I2yn () () H2yin () ;1 2()i 1 = Ns H2yn ()2 2; () – (3-2 6) I2yn () () H2yin () ()i 1 = NsH2yn () () – (3-2 7)


78ratio, /2, is constant (within the precision of the experiments). For Figure 3-6, the optimal ratio, /2, equals roughly 4. In fact, the value of the optimal ratio remains 4 for Lt = 9 and Lt = 19. When Lt is increased to 49, however, the optimal ratio becomes 2. Returning to Lt = 3, doubling the value of 1 (i.e. 1 = 0.5) causes the optimal ratio to decrease by a factor of one-half, from 4 to 2. Whil e this suggests a linear relationship between 1 and the optimal ratio, /2, it is not true for smalle r or larger values of 1. As is apparent from the discussion, the optimal choices for the three parameters, 1, 2, and , depends on Lt. In order to determine the underlying relati onship of all four variables, including Lt, four equations are required. The linear relationship between and 2 provides one of the equations, but three others are still needed. In addition, the results seem to indicate that the optimal ratio, /2, depends on Lt in a nonlinear fashion. If good results were available for large Lt, then a simple nonlinearity could be fit to the data and interpolated as needed. F igure 3-6. SIR as a function of 2 and , for 1 = 0.25 and Lt = 3. 0 0.5 1 1.5 2 0 5 10 15 20 25 30 35 40 2SIR (dB) = 0.2 = 0.5 = 1 = 2 = 5


79However, the plots begin to lose their characteristic shape as Lt is increased. Since no data are available for large Lt and since it is well-known that extrapolation of nonlinear functions is dangerous, no equations are derived for the relationship between the four variables. It is believed that better results would be obtained for large Lt if the block size, Lb, were increased substantially. Not only is it intuitive that larger values of Lt would require more data, but the initial results for Lt = 49 were poor until Lb was increased. For illustrative purposes, Lt = 3 requires Lb = 500 data points, and Lt = 49 requires Lb = 5000. For Lt = 199, on the other hand, values of Lb up to 140,000 were tried without obtaining a plot that resembles Figure 3-6 (although, the sepa ration performance of this method for Lh = 100 and Lt = 199 was comparable to the methods listed later in this chapter). There is one other important characterist ic hinted at in Figure 3-6. Namely, as 2 becomes large or small, the performance is independent of . This can be explained mathematically by considering the expression for th e gradient of the crite rion. The gradient of Equation 3-26 has a term correspondi ng to each marginal entropy that is multiplied by 1/1 2 and has a term corresponding to the joint entropy that is multiplied by /2 2. In the limit as the ratio 2/1 becomes infinite (assuming is finite and non-zero), the joint entropy ceases to make any contribution so th at the criterion degenerates to the sum of marginal entropies. This effect can be observe d in Figure 3-6 by noting that the SIR results for all values of converge as 2 approaches 2. In the limit as 2/1 becomes 0, the criterion degenerates to the joint entropy (as can be seen in Figure 3-6 as 2 approaches 0). In the former case, the criterion is independent of since the effect of the joint entropy is removed by the factor, 1/2 2. In the latter case, since the marginal entropies make no


80contribution, the term multiplies the entire effec tive criterion. Consequently, the only effect of is to change the step size. As indi cated by Figure 3-6, the optimal ratio of 2/1 is such that both the marginal entropies and the joint entropy contribute to the criterion. A detailed analysis of the criterion expre ssed in Equation 3-27 is not given since (1) it has only one non-adaptable parameter, , and (2) a fixed value of produces acceptable results (as long as the outputs are constrained to have a fixed variance). Comparisons of Equations 3-26 and 3-27 indicate that, for small Lt (or, perhaps, when Lb is sufficiently large), Equation 3-26 produces better separation. For large Lt, on the other hand, the opposite is true. In the end, Equation 3-27 was chosen as the final criterion. First, large values of Lt are expected for real audio mixtures. Second, the optimization of this criterion is noticeably simpler. In conclusion, the criterion given by Equation 3-27, which is used in the remainder of this chapter, does not attempt to temporally whiten the estimated sources, such as often occurs for many convolutive BSS methods. It is, however, noticeably more computationally complex than the other met hods. Another limitation of this approach is that it only works for super-Gaussian sources, although this does not preclude its use for speech separation. The use of sub-Gaussian sources requires a simple modification for instantaneous BSS, but this sa me trick is not valid for convolutive BSS since the kurtosis of multi-dimensional signals is not defined. Using the suggestions given previously, three parameters remain, excluding the step size a nd block size, which must be selected for MRMI-SIG. They include the following three items. Number of adaptable parameters in each of the demixing filters, Lw Kernel size, Gamma parameter, µ


81Experimental (Approximate) Upper Bound The use of synthetic mixtures for testing convolutive BSS methods has the advantage that the optimum separa tion performance is oftentime s known. This section introduces a supervised learning method that ge nerates an experimental (approximate) upper bound on separation performance that is applicable for real recordings. Prior to this, the performance measure of interest is defined. Numerous methods have been used in the published literature to indicate separation performance. They include the root MSE [ 87], error rate [87-88], Frobenius distance between the overall mixing matrix (combina tion of mixing and demixing) and a diagonal matrix [88], multichannel row and multichanne l column ISI [89], sum of Frobenius norms [84], plot of overall mixing filter responses [69], [73], SNR [44], [90-92], mean square error [46], ISR [93], SIR [93], “one at a time” SIR [94], ISR [49], ISI [95], bias and standard deviation [96], plot of spectra [93], hand-segmented SIR [97-98], automatic speech recognition rate [31], and the mean opinion sc ore [99]. Several of these are not ideal because they are either subjective, such as th e plots and the mean opinion score, or require knowledge of the mixing filters. The latter of these two ma kes the performance measure inapplicable for real mixtures. Several others are not desirable for speech because they give preference to methods th at temporally whiten the estimated sources, such as the ISIbased measures, or are more suitable for digital communications. The automatic speech recognition is an excellent measure, but it requires a large amount of training data and training time. The two methods that remain include the “one at a time” SIR and the handsegmented SIR. The latter method records all sources at the same time, but requires the outputs to be segmented into regions where only one source is active at a time. While this


82method does not suffer from having differen t background noises for each recording and does not require that the sources are physically stationary, such as the “one at a time” SIR measure, the sources must not overl ap significantly in the time domain. The figure of merit used for the comparisons is the former of the two SIR measures. For a given permutation, it is defined as where ni, for i = 1, 2, . . ., Ns, is an element of {1, 2, . . ., Ns}, ni not equal to nj for i not equal to j, Pni is the power of source ni in output i, and Pi is the total power of output i. The ni terms are used to define the particular (g lobal) permutation. Since permutations are an indeterminacy for BSS, the one that provides the maximum SIR is the one of interest. For Ns = 2 sources and observations, the equa tion above simplifies to the following where Pij is the power of source j in output i. This method was introduced by Schobben et al . [94]. For real recordings, Schobben suggests th at the sources be recorded one at a time. Since sound waves are additive, they can be summed together to produce proper mixtures, and then presented to any BSS algorithm. The ad vantage of this approach is that the SIR can be easily estimated by passi ng each source, one at a t ime, through the trained demixing filters and measuring the power of that s ource in each output. This measure is equally applicable to both synthetic and real mixtures.SIR 1 Ns----10log10PniPiPni– -----------------i 1 = Ns=(3-2 8) SIRmax 0.510log10P12P11-------10log10P21P22-------+ 0.510log10P11P12-------10log10P22P21-------+ , = (3-2 9)


83Having defined the figure of merit, th e approximate upper bound will now be derived for the case of Ns = 2. Given the general form of the mixing matrix the signals that are available when recording one source at a time are H11(z)s1(n), H21(z)s1(n), H12(z)s2(n), and H22(z)s2(n). The two solutions for the restricted FF/FIR, given once again for convenience, are as follows and Suppose that we have an adaptive filter constructed of a tapped delay line. Suppose further that H22(z)s2(n) is used as the input and H12(z)s2(n) is delayed by L and used as the desired signal for this adaptive filter. This s ituation is identical to a system identification problem, where the desired transfer function is H12(z)z-L/H22(z). The important point is that, with the exception of a minus sign, the desired transfer function is identically the optimal solution for Wf,12(z). In addition, the structure of Wf,12(z) and the structure of the FIR are the same. By selecting the proper two re corded signals, it is possible to generate a system identification problem where the desi red transfer function is the optimal solution Hz () H11z () H12z () H21z () H22z () = (3-3 0) Wfz () zL –H –12z () zL –H22z () --------------------------H –21z () zL –H11z () --------------------------zL – (non-permuted solution) = (3-3 1) Wfz () zL –H –11z () zL –H21z () --------------------------H –22z () zL –H12z () --------------------------zL – (permuted solution) = (3-3 2)


84for either Wf,12(z) or Wf,21(z), for the case of either permutation. This is shown in Figure 3-7. The result is that a supervised training methodology can be used to estimate all of the optimal filters that make up Wf(z). Furthermore, as long as mean square error is used as the criterion, and since the adaptive filter uses an FIR structure, the performance surface is guaranteed to be free of local minima [6]. It should be apparent that this method requires that the sources be recorded one at a time. Ther efore, this method cannot be used in realistic separation scenarios, but it can be used as a standard of performance with which BSS algorithms can be compared. Also, unlike many theoretical upper bounds, this one determines the parameters of a restricted FF/FIR structure that represent the estimated separating solution. Therefore, it represents an attainable level of performance. Since every BSS approach is necessarily unsupervised and is oftentimes subject to local minima, it would not be surprising if the solution found using the supervised method performs better. While the supervised me thod does not maximize SIR, it does minimize the power of the interfering signal, which is th e denominator of the SIR. This is easily z -L z -LH (z)s (n)22 11 12 21 2 22H (z)s (n) H (z)s (n) H (z)s (n)2 12 2 f,12W (z) W (z)11 1 21 1 f,12 f,21 z -LH (z)s (n) H (z)s (n)W (z) z -LH (z)s (n) H (z)s (n)W (z)1 1 f,21 2Permuted Non-permuted F igure 3-7. Permuted and non-permuted soluti ons for the supervised approximate upper bound. A) Non-permuted solution and B) permuted solution. A B


85verified by noting that the error signal of the system identification is precisely the appropriate interfering signal. For example, when finding Wf,12(z) for the case of no permutation, the error signal is (Wf,12(z) H12(z)z-L/H22(z))s2(n), which is identically the interfering signal for the BSS application. It is also true that, in the limit as the power of the interfering signal goes to zero, the supe rvised method does maximize SIR (as long as the power of the desired source does not vanish). For this reason, it will be referred to as an approximate upper bound for the performance of any convolutive BSS algorithm. This method can be used for any value of L, and can be used for other local filters, such as the restricted FF/Gamma structure (if it is decided to adapt the Gamma parameter, µ, then local minima will be introduced into the performanc e surface). However, it is limited in that it can only be used for 2 sources/observations. While it is theoretically possible to extend this to 3 or more observations, it is not practical. For more than 2 sources, each audio source must be recorded from the desired physical location of each source, and it requires that the inputs and desired signals are found by multiple convolutions of the recorded signals with each other. It also re quires a delay parameter be optimized for each Wf,ij(z) in order to align the intermediate signals, and the determination of each Wf,ij(z) has to be done in several steps with the fi nal solution being the summation of the results from each individual step. The second weakness to this approach is that it is not able to take into account differences in the frequency c ontent of the sources, which will cause it to produce a smaller value of SIR than may be otherwise possible. This should be obvious since, at any given time, only one source is used during the supervised learning. In its defense, Ns = 2 covers over 92% of published results (82 out of 89 papers) and, if there is


86very little spectral overlap, then a BSS method is not needed since a classical separation method would suffice. To give an indication of the accuracy of the supervised method, an upper bound was measured on artificial mixtures for several different values of Lh. For each mixing matrix, H11(z) and H22(z) are set to unity and Lw is set equal to Lh. In this case, for artificial mixtures, the best performing solution has an SI R of infinity. The resulting SIR of the supervised method for a randomly chosen mixing filter having Lh = 5 was in fact infinity (indicating that the error was smaller than th e precision of the machine), and the result when the mixing filter was set equal to an estimated room impulse response of length Lh = 4000 (which corresponds to 8000 adaptable parameters), was 53.7 dB. To summarize, the approximate upper bound is found as follows. Use the two configurations in Figure 3-7 for determining Wf,12(z) and Wf,21(z) for no permutation and for a given structure (restricted FF/FIR or restri cted FF/Gamma, with a specified value of L and µ if applicable). Using these filters, determine the SIR. Compute the same two filters again for the case of permuted outputs, and onc e again measure the SIR. The larger of the two SIR's is the estimate of the upper bound for that particular structure. As a reminder, the scenario in which it is most likely to provide a poor estimate of the upper bound is when the source spectra do not extensively overlap since a good BSS algorithm should be capable of taking advantage of any differences in source spectra. Comparisons Both synthetic and real mixtures are used to compare the seven previously chosen methods. The sources in all cases consist of male and female speech, which was downloaded from a webpage hosted by Eindhoven University of Technology and maintained by


87Daniel Schobben, Kari Torkkoloa, and Pari s Smaragdis ( realworld.html). The separated male and female speech was used as the original sources for both synthetic and real mixtures. A summ ary of the 9 experiments conducted, 6 using synthetic mixtures and 3 using real mixture s, is shown in Table 3-3. The same male speaker is used as both sources (different s poken sentences) for all experiments except # 4, 7-9, where both male and female speech were used. Exp. No. LhLwN NsMixing Transfer Functions (i not equal to j) SNR (dB)12, 5, 10, 25, 50, 100 where hij(n) are truncated to length LhLh140k2Hii(z) = 1 H12(z) = 0.2z-1-0.2z-4+0.2z-9+0.4z-24 +0.1z-30-0.2z-49+0.1z-70 +0.1z-99H21(z) =0.3z-1-0.3z-4+0.4z-9+0.3z-24 +0.4z-33-0.1z-49+0.3z-79 -0.2z-992255, 15, 25, 35, 45 140k2Hii(z), Hij(z) as in Experiment #1 325Lh35k, 70k, 105k, 140k 2Hii(z), Hij(z) as in Experiment #1 425Lh140k2, 3, 4, 5, 6, 7, 8 Hii(z) = 1 Hij(z) = 4 randomly selected powers of z were given random values chosen uniformly in {-0.5, 0.5} 525Lh140k2 Hii(z) = z-L L = 1, 3, 5, 7 Hij(z) as in Experiment #1 625Lh140k2Hii(z), Hij(z) as in Experiment #11.6, 3.0, 5.2, 10.0 7unknown250, 500, 1000, 2000 30k2 unknown (azimuth = 45o, fs = 11.025k) 8unknown1000, 1450, 2000, 2900, 4000 30k, 44k, 60k, 87k, 120k 2 unknown (azimuth = 450, fs = 11.025k, 16k, 22.05k, 32k, 44.1k) 9unknown100030k2 unknown (azimuth = -90o, -45o, 0o, 45o, 90o, fs = 11.025k) Table 3-3. Experiments.


88Synthetic Mixtures For the synthetic mixtures the sampling frequency, fs, is 16 kHz. Also, the mixing filter parameters were chosen such that the separating solution is stable and causal for all structures considered, the requirements of whic h are completely dictated by the restricted FB systems. This required the use of hand-selected sparse mixing filters, due to the problems of finding randomly selected mixing filt ers possessing causal inverses, as alluded to earlier. Sometimes the demixing parameters we re initialized to the separating solution. This was done primarily for the FB system s since a random initialization may have an unstable attractor. It should be noted that, with the exception of helping to prevent instability, the results of either initialization made no noticeable difference in the final value of the SIR for all seven methods. The fact that the so lutions of both initializations (random and separating solution) produced the same SIR is consistent with the idea that the results presented are not due to local minima (assuming both in itializations converged to stable solutions). The user-defined parameters for each BSS method are shown in Table 3-4 under the columns of “Default Parameters” and “Optimized Parameters,” where Lb is the block length, LM is the number of cross-power matrices to diagonalize, LI is the number of intervals used to estimate each cross-power matrix, and is the step size. As can be seen from the table, one parameter is optimized for a ll methods except for JBD, for which two parameters are optimized. The same random initial conditions were used for each method (this was not possible for Experiment #4). An adaptable gain, which produces a unit variance signal, is applied to each output for MRMI-S IG prior to its inclusion in the criterion. This is done, as previously stated, so that a constant value of may be used. Also notice

PAGE 100

89that the MRMI-SIG algorithm uses µ = 1 for the synthetic mixtures. For this value of µ, the Gamma filter reduces to an FIR filter, so that Lw* = Lw. The JADE algorithm, which is the only Type V method, uses an explicit met hod to correct for (local) permutations. For all the experiments using synthetic mixtures, the minimum Frobenius norm between the separating solution and the estimated solution was used to correct for the permutations at each frequency. Obviously this method is not applicable to realistic demixing scenarios. Compared to the results without permutation correction, the SIR improved by an average of almost 5 dB. A variety of experiments were conducted by changing one item at a time. MethodTypeStructureCriterion Default Parameters Optimized Parameters CommentsDouglas’ Nonholonomic Density Matching I-MLFF/FIRMin. Mutual Information Lb = 500 Slow, erratic convergence Hild’s MRMISIG I-NPRestricted FF/ Gamma Min. Mutual Information Lb = 500, µ = 1, Ld = 1, Lg = 0.1Lw , = 0.1 Slow, performs poorly in high noise Parra’s JBDII-SOSRestricted FF/ FIR Min. Cross Power Spectra LM = 10, LI = 7Lf , Fast, sources must be nonstationary van Gerven’s SAD III-SOSRestricted FB/ FIR Min. CrossCorrelation Lb = 500 Slow, unstable Bell’s InfoMaxIII-MLRestricted FB/ FIR Max. Entropy Lb = 500 Slow, nonlinearities should match cdf of sources Bingham’s Fixed Point IV-HOSRestricted FB/ FIR (Bingham did not use this structure) Max. sum of HOS (none)LfFast, erratic convergence for long mixing filters Cardoso’s JADE V-HOSFrequencyDomain Jointly diagonalize cumulant matrices (none)LfFast, artificial sounding Table 3-4. Representative Set of Convolutive Separation Algorithms.

PAGE 101

90The items changed included the length of the mi xing filters, the length of the demixing filters, the length of the data, the number of sour ces, the acausality parameter, and the power of the additive white Gaussian noise (expresse d as SNR, or signal-to-noise ratio). The results for these 6 experiments are shown in Figures 3-8 and 3-9. 40 60 80 100 120 140 0 5 10 15 20 25 30 35 40 45 N (in thousands of samples) SIR (dB) JADE Fixed Point InfoMax MRMI-SIG JBD 2 3 4 5 6 7 8 0 5 10 15 20 25 30 35 NSIR (dB)JADE Fixed Point InfoMax MRMI-SIG JBD s 101 102 0 5 10 15 20 25 30 35 40 45 L SIR (dB) MRMI-SIG JBD InfoMax Nonholonomic Fixed Point JADE SAD h 5 10 15 20 25 30 35 40 4 5 0 5 10 15 20 25 30 35 40 45 LSIR (dB)JADE Fixed Point Nonholonomic InfoMax MRMI-SIG JBD w F igure 3-8. Experiments #1-4. A) Experiment #1, SIR as a func tion of mixing filter length , Lh, B) Experiment #2, SIR as a func tion of demixing filter length, Lw, C) Experiment #3,SIR as a function of data lengt h, N, and D) Experiment #4, SIR as a function of the number of sources, Ns. A C D B

PAGE 102

91For Experiment #1, performance is shown as a function of Lh (for Lw = Lh, which is the minimum value of Lw required for perfect separation). SAD diverged at all values of Lh, except for Lh = 2, for the random initialization. In a ddition, it did not fair much better when initialized to the separating solution. In this case, it diverged for all values of Lh > 5. Since all the remaining experiments use a value of Lh = 25, no additional results are reported for the SAD algorithm. For the Fixed Point method with random initialization, 63% of the values of Lf resulted in unstable solutions for Lh = 50 and Lh = 100. Initialization at the separating solution improved this situation slightly. For all values of Lf tried, all of the solutions for Lh = 50 and one-half of the solutions for Lh = 100 were stable. As mentioned previously, the FFT of the obser vations, X(k), is only an approximation of H(k)S(k). To see how much effect this approximation has on the SIR, JADE was used by passing to it the true values of H(k)S(k). W ith this change, the results for JADE were almost perfectly flat as a function of Lh. At Lh = 2, the resulting SIR was 15.8 dB (versus 15.7 dB) and at Lh = 100, the result was 16.0 dB (versus 11.4 dB). The improvement increased as Lh increased. The difference in performance is expected to become even more exaggerated as the ratio N/Lh decreases (for a fixed Lf). For Experiment #2, a perfect separating solution requires Lw > 25. This experiment shows how sensitive each method is to ove r and under-estimating the required length of the demixing filter. The performance of JADE is shown as a straight line, since the method does not have an Lw parameter. As can be seen from the figure, four of the methods produced almost identical results for Lw = 5, which corresponds to under-estimation by a factor of 5. The figure also shows that the MRMI-SIG, InfoMax, and JBD algorithms are

PAGE 103

92fairly insensitive to over-estimation (by a factor of 2). The Nonholonomic method performed very poorly once again. For this reas on, no further results will be shown for this method. For Experiment #3, the results indicate, as expected, that a reduction in data length will cause all methods to perform worse. Base d on the sampling frequency, the abscissa of Figure 3-6(C) corresponds to a range of 2.2 to 8.8 seconds of data. For Experiment #4, results are shown as a function of the number of sources. Not only does it become increasingly difficult to determine stability/causality as Ns increases, but the likelihood of finding a mixing matrix that has a stable, causal inverse is expected to decrease substantially. Therefore, this wa s the one (synthetic) experiment for which no attempt was made to choose a mixing matrix th at has a stable, causal inverse. The plot indicates that the two FB systems (that remain) had a difficult time finding stable solutions. In particular, InfoMax only produced valid results for up to 3 sources, and the Fixed Point method only produced valid results for up to 5 sources. Also notice how quickly the performance drops for all the methods. For Experiment #5, performance is shown as a function of the acausality of the mixing matrix, Lm, when using the optimal value of L (L = Lm). Restricted FF systems are (ideally) insensitive to Lm since they can implement an arbitrary acausal solution. For restricted FB systems, however, L = 0. Due to the choice of the mixing filters, it is expected that the performance of the FB methods will fall as Lm transitions from 0 to 1, and as it transitions from 3 to 4. Figure 3-9(A) indicates that InfoMax does indeed fall for the transition between 0 and 1, but the second transition is not very noticeable. Neither

PAGE 104

93transition is noticeable for the Fixed Point method, but then the performance for this method is not very good either. For Experiment #6, the SNR is defined as 10log10(Ps/Pn), where Ps is the power of each source (which is the same for all sources) and Pn is the power of the additive white Gaussian noise added to each observation (bandlimited to 8 kHz). The performance of all but the Fixed Point and JADE methods droppe d significantly as noised was added. Also notice that, in this case, InfoMax performed noticeably better than all the other methods. Real Mixtures The next three experiments were from bina ural recordings made in a 2.5 m wide by 3.2 m long by 2.4 m tall double-walled sound proof room. The reverberation time of the sound proof room is approximately 0.1 s, where reverberation time is defined as the time for the mean-squared sound pressure of a 1 kHz signal to drop 60 dB. The speech from a male and female speaker were played over a single loudspeaker, one at a time, and 0 1 2 3 4 5 6 7 0 5 10 15 20 25 30 35 40 45 L SIR (dB) JADE Fixed Point InfoMax MRMI-SIG JBD 2 3 4 5 6 7 8 9 1 0 0 5 10 15 20 25 30 SNR (dB)SIR (dB)JADE Fixed Point InfoMax MRMI-SIG JBD F igure 3-9. Experiments #5-6. A) Experiment #5, SIR as a function of acausal parameter, L, B) Experiment #6, SIR as a function of SNR. A B

PAGE 105

94collected by a pair of Audio Technica AT 853 miniature condenser microphones, having a cardiod pickup pattern, and amplified by a pa ir of Midiman Audio Buddy amplifiers. The microphones were placed inside two syntheti c ears made by Knowles Electronics (part #DB-090 and DB-091), which were placed on eith er side of a dummy head fashioned out of a small cardboard box. The data from the microphones were recorded on a Sony DAT deck and then later downloaded to a comput er. The sampling frequency of the signal is 44.1 kHz and the length of the data is N = 120k samples, which corresponds to 2.7 seconds of data. While better results can be expect ed if the length of the data were increased, this is an optimistic value when considering the usual length of data available before the mixing matrix changes dur ing normal conversation. The data were collected by rotating the stand on which the dummy head was attached to different angles, as shown in Figu re 3-10. At each angle, male speech, female speech, a segment of music, and a segment of white noise were recorded one at a time. 0 45 90 45 -90 o o o o o 0.14m 2.0m dummy head s(n)1 s(n)2 F igure 3-10. Top view of the recording set up for the data collected in the sound proof room.

PAGE 106

95This was done for 0o to 360o azimuth, in increments of 10o. Angles of 45o, 135o, 225o, and 315o azimuth were also included. After downloading the data to a computer, the data were down-sampled to the par ticular sampling frequency of interest. A ninth-ordered Butterworth anti-alias filter was used for this purpose. Mixtures are then created by summing the appropriate observations. For example, if the desired auditory event includes a female at 0o azimuth and a male at 45o azimuth, the first observation is created by summing the male and female speech captur ed by the microphone associated with the left ear. Similarly, the second observation is create d by summing the male and female speech associated with the recordings from the right ear. Only afte r the summations are the signals presented to each BSS method. The original 44.1 kHz data set for all azimuths is available for downloading at by selecting the “Data” link. The parameters for these last three exper iments are somewhat different than before. The block length had to be increased for each method to allow for the longer demixing filters. Also, the Fixed Point and JBD algorithms, due to the limited length of data, was much more restricted in terms of the choices of the FFT length, Lf. The limited length of data also required, for JBD, that LM and LI be changed for each value of Lf. For MRMISIG, the values of Lw and µ were chosen to limit the numbe r of adaptable parameters and so that the resulting memory depth of the Gamma filter (Lw* = (Lw-Lg)/µ + Lg) equals that of the FIR (Lw* = Lw). The former was needed to ensure that the training time was reasonable. However, large memory depths result in a small value for µ, which is known to cause a drastic increase in the eigenvalue spread [ 68]. The final parameters for MRMI-SIG were µ = 0.33, 0.20, 0.15, 0.05 whenever a memory depth of 250, 500, 1000, and 2000,

PAGE 107

96respectively, is required. The results for thes e 3 (real) experiments are shown in Figures 311 and 3-12. For Experiment #7, shown in Figure 3-11(A), s1(n) is located at 0o azimuth and s2(n) is located at 45o azimuth. Keep in mind that these are only virtual locations, since it was the dummy head that was rotated and not the speaker. Notice that the Fixed Point method becomes unstable for all the values of Lf attempted when Lw exceeds 500. Also, no results are shown for InfoMax. The InfoMax method ta kes by far the longest of the seven methods to converge when the number of adaptabl e parameters becomes large (mainly because it uses a FB system, which takes longer than the FF systems in computer simulations). In addition, because of the stability problems associ ated with using a restricted FB structure, it often (consistent with other findings [100-101]) results in an unstable solution. For these reasons, no more results are shown for InfoMax. The JADE algorithm is shown as a line in Figure 3-11 because, just as before, the results depend on Lf, not on Lw. For experiments 200 400 600 800 1000 1200 1400 1600 1800 2000 0 5 10 15 20 25 Approx. Upper Bound Fixed Point MRMI-SIG JBD JADE Memory Depth (L w for an FIR) SIR (dB) F igure 3-11. Experiments #7-8. A) Experiment #7, SIR as a f unction of mixing filter lengt h, Lw, B) Experiment #8, SIR as a f unction of sampling frequency, fs. 10 15 20 25 30 35 40 4 5 0 5 10 15 20 25 f s (in kHz)SIR (dB)JBD Approx. Upper Bound JADE A B

PAGE 108

97using real mixtures, JADE can no longer use the separating solution in order to correct for local permutations. The plots shown in Figur es 3-11 and 3-12, however, do use the results of the approximate upper bound for determining local permutations. This improved the results for JADE by an average of almost 2 dB. For Experiment #8, the sampli ng frequency is varied to see the effect of down-sampling with s1(n) once again located at 0o azimuth and s2(n) located at 45o azimuth. An anti-alias filter was used and the memory depth was changed in accordance with fs as listed next. Values for the memory depth of 1000, 1450, 2000, 2900, and 4000 were used for fs = 11.025, 16.0, 22.05, 32.0, and 44.1 kHz, respectively. The results indicate that the performance of convolu tive separation can be improved by low pass filtering and decimating. This experiment uses very long demixing filters, which caused the convergence time of MRMI-SIG to exceed a reasonable limit. Also notice that no results are shown for the Fixed Point method, since every single soluti on was unstable. Due to the stability problems of this method on the last two experime nts, no more results are shown for the Fixed Point method. For Experiment #9, results are shown as th e location of the second source takes values of -90o, -45o, 0o, 45o, and 90o azimuth, while the first source is fixed at 0o azimuth. When s2(n) is at 0o, the two sources are in the same location so that the mixing matrix is not full rank. However, since there is a noticeable difference in the spectra of the two sources, it is still possible to produce some degree of separation. The approximate upper bound behaves as expected, in that the result at 0o performs the worst. Interestingly, the JBD method not only performs best at 0o, but it outperforms the approximate bound at that

PAGE 109

98angle as well. It was hypothesized that this resu lt occurs because JBD is making use of differences in spectra that the upper bound was not able to use. To test this hypothesis, the cross filters, Wij(z), of JBD were examined. In dee d, one corresponded to a high pass filter (for the female speech) and the other corresponded to a low pass filter (for the male speech), the passbands of which did not overlap. A possible implication is that the JBD algorithm is better at distinguishing spectral di versity than it is at distinguishing spatial diversity, at least when no other information is present. Theoretically, Figure 3-12 should be perfectly symmetric with respect to 0o azimuth. It is believed that the lack of perfect symmetry is caused by imperfectly positioned microphones, although other possible causes exist. For example, the left and ri ght microphones and/or am plifiers might have slightly different transfer functions. Conclusions Several conclusions are listed here, keep in mind that they are based only on the results of this set of exper iments and are not meant to be conclusive. The InfoMax method -80 -60 -40 -20 0 20 40 60 80 0 5 10 15 20 25 Azimuth (degrees)SIR (dB)Approx. Upper Bound JADE MRMI-SIG JBD F igure 3-12. Experiment #9, SIR as a function of azimuth.

PAGE 110

99used a restricted FB structure because it ha d been determined by several researchers that when this criterion is used with a FF system, the criterion preferred to temporally whiten the sources rather than separate them [80], [102]. Although it performed well for synthetic mixtures, it is handicapped by the limitations of the restricted FB system. The Type V method, JADE, was by far the fastest of all the methods. However, its performance was rather disappointing, especially considering th at the local permutations were adjusted using knowledge not available in a real de mixing scenario. The Fixed Point method converged smoothly when the length of the mixing filters was small compared to the length of the data. This was not the case when the mixing filters became larger. Keep in mind that this problem cannot be remedied by reducing th e step size since fixed point methods are characterized by the lack of a step size. It does have a nice feature, relative to the other FB systems. Even though the outputs are produced usi ng a restricted FB structure, the signals used in the criterion are frequency-domain si gnals, which come from an entirely separate signal path. Since the latter si gnal path does not involve fe edback, there are no stability problems for the criterion. The solution must, of course, be converted to the restricted FB structure before generating the output signals . However, suppose at time n that the criterion (in conjunction with the optimization method) chooses values for the parameters of the time-domain structure that result in out put explosion. Since th e criterion at time n+1 chooses new values for the parameters of th e structure based on the frequency-domain signal path and not the time-domain signal path, it is quite possible that the system is able to recover. For SAD and Infomax, on the other hand, it is unlikely the system will recover once the adaptable parameters are such that th e system is unstable. For real mixtures, JBD is clearly the best performing method of the seven tested. However, it may be more adept

PAGE 111

100at spectral separation than spatial separation, where the latter is more desirable for a BSS algorithm. The proposed convolutive BSS me thod, MRMI-SIG, performed well for synthetic data except when the SNR was poor. Overa ll, the results of this research indicate that Type I or Type II systems, corresponding to FF time-domain demixing systems, are preferred for separation performance (as opposed to speed of convergence). The situation is not as clear for determining the preferred criterion. The two SOS methods posted the best and worst performances, while the pe rformance of the HOS and IT methods were between the two extremes. For real mixtures, the approximate upper bound performed better than the best convolutive BSS method, JBD, by an average of 8.0 dB (averaged over Experiments #7-9) suggesting that there is still a lot of room fo r improvement of BSS methods. This is especially true considering that the upper bound may produce a smaller SIR value than otherwise possible for sources that do not have significant overlap in the frequency domain. It is also noteworthy that none of the methods performed very well for data collected in a F igure 3-13. SIR of the approximate upper bound as a function of the acausal parameter , L, for both permutations. 0 50 100 150 200 0 5 10 15 20 25 Permuted solution Non-permuted solution L (acausal parameter)SIR (dB)

PAGE 112

101very controlled environment, i.e., in a doubl e-walled sound proof room . In an attempt to determine if the acausal parameter was limiting the performance, the SIR was computed for the experimental upper bound. This plot is shown in Figure 3-13, which shows that a slight improvement is possible for the non-permuted solution by increasing L to 200. For this last figure, s1(n) and s2(n) are at 0o and 45o azimuth, respectively, fs = 11.025 kHz, and Lw = 1000. The following list of items contains several of the most important items in limiting performance of a convolutive BSS met hod for the case of real mixtures. Additive noise Non-linearities Demixing filters too short/long Too few data Resolution too low (time or frequency) Time-varying mixing matrix Local minima due to any causes including local permutations Frequency domain approximation, X(k) = H(k)S(k) Acausal parameter not properly set More sources than sensors The approximate upper bound uses a resolution of 1 and is not effected by local minima or the frequency domain approximation and, in all experiments conducted, the number of sources never exceeds the number of sensors. Fu rthermore, due to the controlled nature of the experiments, the mixing matrix can be assu med to be time-invariant, i.e., physically stationary. Results of the upper bound in Figure 311(A) appear to indicate that the length

PAGE 113

102of the demixing is of the correct order and Figure 3-11(B) clearly shows that the combination of an anti-alias filter with down-sampling is not causing degradation of performance. Based on Figure 3-13, it does not appear that much can be gained by increasing L. By process of elimination, it appears that the limita tion of performance of the restricted FF/FIR structure (for this data set) is mainly due to the limited amount of data, additive noise, and/ or channel non-linearities. Since the amount of da ta in a real mixing scenario is likely to be much less than that used in the experime nt and since not much else can be done to limit the effect of additive noise (beyond the use of a low pass filter, which in this case is realized by the anti-alias filter), it appears that the only hope of improving the best obtainable results is to use a nonlinear demixing structure. An additional experiment wa s conducted to determine if the transfer function from the sources to the sensors is linear or nonlinea r. The idea is to use system identification [6] with an FIR filter, where the input is the orig inal male speech and th e desired signal is the recorded male speech (for this experiment, the recorded speech from the male was kept separate from that of the female). An FIR of length 1000 is then trained using the MSE criterion. The trained system represents the line ar portion of the overall transfer function, which consists of the speaker, acoustic cha nnel, microphone, amplifier, analog-to-digital converter, and anti-alias filter. A scalar measure that quantifies the amount of nonlinearities in the overall channel can then be given by the power of the error signal normalized by the power of the desired signal. If the channel is linear, and supposing it is represented well using an FIR of length 1000 taps, the performance measure should result in a value near 0%. The result indicates that the error pow er is 81% and 83% that of the power of the

PAGE 114

103desired signal for the left and right channels, re spectively. This suggests that the channel is highly nonlinear. It would not be surprising if the channel is noticeably nonlinear since the amplifiers had to be turned to approxima tely 80% of their maximum value due to the 2 m separation between the speakers and the microphones (incid entally, the vast majority of the real recordings use a speaker to microphone distance of less than 0.25 m, which corresponds to a fairly intimate convers ation). However, the results from the linearity test are worse than expected considering that the experimental upper bound, which also uses a (linear) FIR filter, achieved fairly respectable separation results. There are at least two reasons that might explain this discrepancy. First, the channel nonlinearities are primarily due to the speaker. This is a possible explanation since both signals used to train the experimental upper bound involve the same speaker. Consequen tly, the speaker transfer function is cancelled in the overall transfer function associated with the experimental upper bound. This is not true for the linearity test. The other possibility is that there is a slight difference in the value for the clock frequency of the anal og-to-digital converter and the sample frequency assumed for the original data. Once again, the experimental upper bound is immune from this since both signals used to train the experimental upper bound are sampled using the same clock. To the degree that the linearity test results can be trusted, this experiment confirms that nonlinear demixing structures are needed to improve performance beyond that given by the experimental upper bound.

PAGE 115

104 SOSDecorrelation of Non-stationary Sources [103] SAD (Symmetric Adaptive Decorrelation) [79] Min. Squared Cross-Correlation [92] Joint Diag. of Cov. Matrices [104] Min. Squared Cross-Correlation [105] Min. Squared Cross-Correlation [70] Decorrelation of Non-Stationary Sources [106-107] Extension of Molgedey and Schuster Method [64] Simult. Diag. of Non-Positive Definite Cov. Matrices [108] On-Line Time Domain Decorrelation [109]HOSCancellation of Cross-Cumulants [96] Max. Sum of Temporal Squared 4th-Order Cumulants [110] CMA (Constant Modulus Algorithm) with Lateral Inhibition [111] Partial Approx. Joint Diag. of Cross Cumulants [88] Squared Correlation of NL Functions of Outputs [69] Extended Super Exponential Deflation [95] Dynamic ICA [87] Max. Sum of Squared Kurtoses [110] Max. 4th-Order Cumulants of Non-IID Sources [112] Two-Stage Super-Exponential [113] Max. Higher Order Cumulants [114] Cumulant Matching [114] Cumulant Matching Augmented With Correlations [114] Sum of Function of Absolute Value of Cumulants [115] Hierarchical Constant Modulus Algorithm [116] Paraunitary Filter Bank Deconvolution [60] Higher Order Cumulant Maximization [117] Sum of Absolute Value of Higher Order Cumulants [118] Sum of Squared Higher Order Cumulants [118]MLMin. MI of Markovian Processes [83] Min. MI of Linear Processes [26], [83] Min. MI of Linear Causal Min. Phase Processes [83], [119] Min. MI of Linear Processes [84] Min. MI of IID Sources Having a Known PDF [84] DCA-CT (Dynamic Components Analysis Convolutive, Time Domain) [73], [120] Contextual ICA [91] Density Matching [93] Extended InfoMax [121] Nonholonomic Density Matching [78] EDICA (Extended Dynamic ICA) [122] InfoMax [123] Multichannel Deconvolution Using Natural Gradient [124] InfoMax for Convolutive Mixtures Using FF System [80] Multichannel Deconvolution Using Non-Causal Filters Trained With Information Backpropagation [125]NPMRMI-SIG (presented in this chapter) Table 3-5. Type I Systems.

PAGE 116

105 SOSSDIF (Simult. Diag. of Cov. Matrices in Freq. Domain) [76] Min. Squared Coherence [97] MRFD (Multi-Resolution Freq. Domain) [126] Joint Approx. Diag. of Cross-Spectral Density Matrices [127] Mult. Decorrelation of Non-Stationary Sources [75] SDIF (Simult. Diag. of Cov. Matrices in Freq . Domain Using Hadamard Â’s Inequality) [76], [77] Diag. of Cross Power Spectra of Non-stationary Sources [74] MAD (Multiple Adaptive Decorrelation) [109] (see Table 3-8 for more options)HOS(see Table 3-8)ML(see Table 3-8)NP(none)SOSSAD (Symmetric Adaptive Decorrelation) [79]HOSExtended H-J Network [128] Cancellation of Cross-Cumulants [98] Cancellation of NL Functions of Outputs [98] Cancellation of 2nd and 4th Order Cross-Cumulants [98] Min. Asymptotic Estimator Error Variance [63], [129] Self-Optimized Min. Asymptotic Estimator Error Variance [130-131] Min. Asymptotic Estimator Error Variance for Colored Sources [131]MLSAMLE (Symmetric Adaptive ML Estimation) [90] InfoMax for FB Architecture [102] InfoMax for Separation of Delayed Sources [132] InfoMax for Convolutive Mixtures Using FB System [80] InfoMax for FB System Including Static Weights [100]NP(none) Table 3-6. Type II Systems. Table 3-7. Type III Systems.

PAGE 117

106 SOSJoint Diag. of Cross Spectral Density Matrix [49] Frequency Assignment [99] FDODF (Output Decorrelation Filtering in Freq. Domain) [133-134] AMCor (Amplitude Modulation Correlation) [45] Natural Gradient Method for Non-stationary Sources [135] GSS-I1 (Geometric Source Separation) [41] GSS-C2 (Geometric Source Separation) [41] Simult. Diag. of Covariance Matrices [136] Joint Diag. of Cross Power Spectral Density Matrices [137]HOSCancellation of Cross-Polyspectra [138] Freq. Domain JADE (Joint Approx. Diag. of Eigenmatrices) [44] SFDA using JADE (Simplified Freq. Domain Approach) [46] Max. Absolute Value of Auto and Min. Absolute Value of Cross 4th-order Cumulants [43] Extended JADE [139] Freq. Domain Fixed Point (Second Solution) [40] Fixed-Point for Complex Signals [81] Cancellation of Dissymmetrical Cross-Cumulants [140] Cancellation of Symmetrical Cross-Cumulants [140]MLInfoMax with Beamformer [43] DCA-CF (Dynamic Component Analysis Convolutive, Freq. Domain) [73] DCA-CS (Dynamic Component Analysis Convolutive, Semi-Blind) [73] Contextual Density Matching [101] Freq. Domain Extended InfoMax [141] FDMI (Information Maximization in Freq. Domain) [133] MI Natural Gradient Alternative [142] Freq. Domain InfoMax Resulting in CM with Penalty [42] Freq. Domain Fixed Point (First Solution) [40] Freq. Domain InfoMax [143] Freq. Domain InfoMax [39] Min. Kullback-Leibler Divergence [144-145] Freq. Domain InfoMax with Reduction of Room Reflections [47] BLMS (Direct Bussgang Algorithm Cost Form 1) [146] DBAC3 (Direct Bussgang Algorithm Cost Form 3) [146]NP(none) Table 3-8. Type II, IV, V Systems.

PAGE 118

107CHAPTER 4 CONCLUDING REMARKS Three different types of applications have been addressed in the previous 2 chapters and in the first appendix. They include instan taneous blind source separation, convolutive blind source separation, and feature reduction. In all 3 cases, the underlying criterion was based on an estimator of RenyiÂ’s entropy, whic h is itself based on non-parametric pdf estimation. Several other options exist for both th e pdf and the entropy estimators, that have not been discussed hitherto. These are listed , along with a brief summary of the reasons for selecting the pdf and entropy estimators that were incorporated into the criterion. Following this is a list of several other applicat ions for which the proposed criterion has been used and a list of suggested items for future research. PDF Estimation Several options for non-parametric pdf es timation, the foundation of which is attributed to the seminal work of Fix and Hodge s [82], include histograms, nearest neighbor estimation [147], polynomial expansion [148], a dditive kernel estimation such as Parzen Windows [15], [149], and product kernel estimation [150]. Othe r closely-related options include, but are not limited to, tree-based me thods and rectangular partitioning [150]. The histogram divides the space into equal size regions and counts the number of occurrences within each region. Polynomial expansions cons ist of a weighted sum of functions, where each function differs from all other functions in the expansion. The kernel estimation techniques place a single (identical) kernel at each data sample, and then sum the results from

PAGE 119

108each data point over time. The difference between additive and mu ltiplicative methods only becomes apparent for mul ti-dimensional estimation. In th is case, the additive methods use a kernel that is multi-dimensional, whereas the multiplicative methods form a multi-variate function from a product of singlevariate functions, one for each dimension. Finally, the nearest neighbor method is similar to the kernel-based methods when the selected kernel is the uniform distribution. Th e difference is that th e kernel-based methods have a constant kernel volume and a varying number of data points that fall under the kernel, while the nearest neighbor methods have a constant number of da ta points and a varying volume. The histogram, nearest neighbor method, tr ee-based method, and rectangular partitioning method produce highly discontinuous pdf estimates, which are problematic for gradient-based optimization methods. Anothe r pdf estimation method not listed above was also briefly considered, the estimation of which is formed by a multiplicative combination of kernels at each time instant. This method, however, was quickly discarded since it did not perform nearly as well as the two previously mentioned kernel-based methods. This leaves only the polynomial expansion and th e two kernel-based methods. Only one polynomial expansion method was tried, the estimate of which consisted of the weighted sum of Legendre polynomials [148]. In order for the Legendre polynomial method to compare similarly to the kernel-based method for BS S (in terms of SIR), the number of polynomials had to be very large. For this reason, it was not considered further. This left only the kernel-based methods. Kernel-based methods estimate pdfÂ’s ba sed on the data samples and require the selection of a kernel shape and a kernel width. The initial choice for the kernel was the

PAGE 120

109Gaussian distribution, for which the covariance matrix was the identity matrix. In this particular case there is no distinction between additive and multiplicative methods since a product of single-dimensional Ga ussian functions equals a single multi-variate Gaussian. Several other kernels of the generalized Gau ssian family, such as the Laplacian and uniform distributions, were also tried for the multiplicative kernel-based method. The optimal choice of kernel shape, defined in terms of integrated square pdf estimation error, depends on the amount of available data in the follo wing fashion. Suppose that a large number of i.i.d. data samples are used to estimate the pdf (large relative to the amount needed for good pdf approximation) and that the kernel width is sufficiently small. In this case, the choice of kernel shape is not critical, as any kernel for which the total mass is concentrated around zero will produce an asymptotically unbi ased estimate of the pdf [149]. At the other extreme, suppose that there is only a single data point available. In this case, it is possible to perfectly estimate any distributi on with a single precisely located data point when the kernel shape matches the specified di stribution and the kernel width is correctly chosen. This might lead one to consider, e.g., using uniform kernels when the data distribution is uniform. However, the distribution being estimated changes as the adaptable parameters are updated for all three applications considered here. Therefore, even if the data points just so happened to be in precisely the correct location and the kernel width is perfectly chosen, the kernel shape will be inco rrect since the optimal kernel shape changes in an unknown fashion as the parameters adapt. Moreover, the choice of a uniform kernel in the above example (when the sources ar e uniformly distributed) is appropriate only at the separating solution and it, like the histogram, produces a highly discontinuous pdf estimate. The result is that matching the kernel shape to the source distribution is not

PAGE 121

110necessarily a good approach. In fact, it is not even possible in the truly blind case, since the source distributions are unknown. In order to determine the preferred choice of kernel shape, kernels having the generalized Gaussian distribution for = 1, 2, and 10 are compared. The performance is measured both in terms of integrated square of the pdf estimation error (for many different target distributions) and in terms of its ability to blindly separate s ources (for many different source distributions). Also, the comparis ons made use of the optimal kernel size, which was experimentally determined for each task. The performances of the different kernels varied significantly as the target/sour ce distributions were changed and as the number of data points changed. The overall be st performer was unequivocally the Gaussian kernel. In fact, almost without exception, the Gaussian kernel performed as well or better than the other kernels independent of th e target/source distribution. Interestingly, a (non-conclusive) argument can be made for the us e of Gaussian kernels. Recall that in all cases the variance of the data for which the entropy was estimated (hence, for which the pdf was estimated) was held constant. When th is is true, it is well-known that the maximum entropy pdf is Gaussian. The use of Gaussian kernels for pdf estimation is, therefore, maximally noncommittal. This is a pleasing resu lt since, for the three applications considered, no a priori information of the true densities is available. Entropy Estimation Several issues regarding the entropy estimator used in the criterion are discussed next. Specifically, other possible definitions of entropy are listed, notes concerning the correction required for sub-Gaussian sources are given, and a discussion is included concerning the curse of dimensionality as it applies to entropy estimation.

PAGE 122

111RenyiÂ’s entropy, as previously mentioned, is a family of entropy measures that depend on the user-defined entropy parameter, . The definition includes ShannonÂ’s definition of entropy in the limit as approaches one. Several othe r definitions of entropy not considered include the Havrda-Charvat, Sharma and Mittal, Sharma-Taneja, and KapurÂ’s 1st and 2nd entropy definitions [151-152]. With regards to RenyiÂ’s definition, three reasons are given why choosing to be two is a good choice (as previously reported in Chapter 2). First, the variance of the entropy estima tion (for equal bias) of a wide variety of generalized Gaussian distributions is le ss than the variance of estimating ShannonÂ’s entropy and is nearly the minimum variance produced for any value of . In essence, the use of RenyiÂ’s quadratic entropy does for I TL what HyvarinenÂ’s research did for HOS [153]. Namely, it improves the statistica l characteristics of the estimator. Second, = 2 is the only value of for which an entropy estimator exists that has O(Lb) computational complexity, as opposed to O(Lb 2) complexity. Moreover, this reduced complexity estimator provides a good approximation to the O(Lb 2) estimator from which it was derived by randomizing the data a sufficient number of times. Good approximation is also achieved without randomizing the data as long as the data are temporally i.i.d. and the block size is sufficiently large. Third, the O(Lb) entropy estimator is capable of outperforming the O(Lb 2) estimator (from which it was derived) wh en there is spectral diversity between the sources. In this case, it is required that the data are not randomized since randomization destroys the temporal structure of the data. For instantaneous BSS, the sign of the kurtosis was needed in the criterion in order for the adaptation to proceed in the correct direction when separating sub-Gaussian

PAGE 123

112sources. Appendix B shows several plots that indi cate that this correction is not needed for the case of feature reduction. The reason the sign of the kurtosis is not needed in this example is because of a fundamental difference between feature reduction and BSS. Namely, feature reduction is a supervised me thod, which is able to obtain information on the correct direction of adaptation from the class labels. Convolutive BSS, similar to instantaneous BSS, is unsupervised and is therefore expected to require extra information when separating sub-Gaussian sources. Unlike for instantaneous BSS, convolutive BSS involves multi-dimensional marginal entropies a nd a joint entropy. Whereas it is simple to compute the sign of the kurtosis for single-dime nsional distributions, the usual definitions of kurtosis are not directly applicable to multi-dimensional distributions. It should be noted that the kurtosis cannot be measured for each dimension separately since this creates a conflict unless the estimated sign of the kurtosis is identical for every dimension of the given multi-dimensional distribution. Moreover, as indicated by Figure B-1(B), the performance surface for the joint entropy in the bivariat e case is flat when both of the associated single-dimensional distributions are sub-Gaussian. The result is that the proposed convolutive BSS method, introduced in Chapter 3, only works for super-Gaussian sources. The reason it works for super-Gaussian data, as a ttested to by the SIR results in Chapter 3, is believed to be due to the si milarity between the performance surface for Shannon’s and Renyi’s quadratic entropy estimators for supe r-Gaussian data, such as given by Figure 24(A) and implied by Figu re B-1(B) (for 1 < 1 < 2, 1 < 2 < 2). The curse of dimensionality is undeniably an important factor when estimating probability distributions since the space that mu st be “covered” by th e data samples grows exponentially as the dimensionality is increased linearly. One question that arose

PAGE 124

113repeatedly in the course of this research is wh ether or not the curse of dimensionality is an important factor for the estimation of entropy. While it is of no consequence for instantaneous BSS (recall that when the sphering/ro tation architecture is used, the criterion reduces to a sum of entropies of single -dimensional distributions, regardless of the number of sources), it directly concerns feature extr action and convolutive BSS. In the latter two applications, the dimensionality of the underl ying distributions increases with the number of factors retained and the number of sources/lags, consecutively. The most obvious answer is that the estimation of entropy is affected by the cu rse of dimensionality because entropy is a function of the distribution. Keep in mind, however, that the mean value is also a function of the distribution, for exampleEX [] xfXx () x d–=(41) Figure 4-1. Plot of N versus dimensionality fo r constant SIR (entr opy estimation) and co nstant SER (pdf estimation). The linear reference line shows the slope require d for a linear relationship between N and the dimensionality. 100 101 102 103 10-1 100 101 102 Number of data points (in thousands)SIR = 20 dB Dimensionality SIR = 25 dB SER = 16 dB Linear reference

PAGE 125

114One of the (two different) derivations of the estimator for Renyi’s quadratic entropy may be found by writing the entropy in terms of an expectation and then replacing the expectation operator with the sample mean. The difference between the estimates of the mean and the entropy are that the former requires integration (summation) over the realizations of the random vari able and the latter requires integration over the estimated distribution of the random variab le. Since integration tends to reduce variance, it seems plausible that the curse of dimensionality is not as critical for entropy estimation as it is for distribution estimation. In fact, similar to an argument given in the previous chapter, recall Figure 3-8(A) which shows the separation performance of the proposed method for the case that Lh = 100. For this value of Lh, the distribution of the joint entropy has a dimensionality of 200, which must be “estimated” using only 140k data samples. Despite this, the method produced an SIR of over 20 dB. A simple experiment was conduc ted in order to address this question in a more systematic fashion. The SIR was measured as a function of dimensionality, which was increased by increasing the lengths of the mix ing and demixing filters. For this experiment, Hii(z) = 1, Lh = Lw, and Lb = N, the length of the data. Due to the use of a different block size, these results cannot be compared to those from Chapter 3. For a given dimensionality, the value of N was varied to achi eve 20 dB and 25 dB SIR for a convolutive BSS application, where 20 dB corresponds to inaudibl e interference. The plot showing the final set of results is given in Figure 4-1. Recall that for a log-log plot, the slope of the line indicates the power of the exponent (the d.c. offset is due to a gain factor, which should be EfXx () [] log – fXx ()2x d–log – =(42)

PAGE 126

115ignored). As a result, a line having a slope of one represents a linear function. Notice that for entropy estimation, a linear increase in dime nsionality requires a line ar increase in the amount of data. Also shown are the result s for pdf estimation of a multi-dimensional -2 0 2 -2 0 2 0 0.1 0.2 0.3 0.4 y1 y2 fY(y1,y2) -2 0 2 -2 0 2 0 0.1 0.2 0.3 0.4 y1 y2 fY(y1,y2) F igure 4-2. Estimation of a tw o-dimensional pdf using Pa rzen Windows with Gaussian kernels. A) True distribution and B) estimated distribution with = 0.1, N = 10k, and SER = 19 dB. B A

PAGE 127

116Laplacian distribution. The performance measure used for the pdf estimation is the SER, or signal to error rati o, which is defined by The numerator is the “signal power” and the de nominator is the power of the squared error between the true and the estimated distributions. An example pdf estimation of a twodimensional distribution using Parzen Windows with Gaussian kernels is shown in Figure 4-2, where = 0.1, N = 10k, and the SER = 19 dB. The integration is approximated by a Riemann sum, which consists of discretizing the space into equal size bins. The bin size was set to the maximum value for which the integration is approximated well. For this experimentally determined bin size and for N = 100k, the time to compute the SER for distributions having a dimensionality of 4 a nd 10 was estimated to be 11 days and 1.5 billion years, respectively. The reason for the sharp increase in time is because an estimated 61 bins (minimum) are needed in each dimensi on, which brings the total number of bins at which the pdf is approximated to 61d, where d is defined here as the dimensionality of the distribution. Consequently, results are only s hown for d = 1, 2, and 3. Interestingly, the computational requirement of estimating entr opy, O(N), is approxima tely equal to that needed for the estimation of the distribution at a single bin. More powerful methods could be used to approximate the in tegration, but this would only provide a marginal increase in the dimensionality that could be tested. The end result is that, as the dimensionality increases, the amount of data needed for e qually good blind source separation is not dictated by the exponential require ments of pdf approximation.SER 10 log10fYy ()2x d–fYy () fYˆ y () – ()2x d–-------------------------------------------------=(43)

PAGE 128

117Other Applications The MRMI-SIG criterion has also been us ed for several other applications. For example, Hui Liu has preliminary results wh ere MRMI-SIG was used to preprocess EEG data (electro-encephalogram). Specifically, she used the algorithm to separate artifacts associated with the EEG from the other brain signals. The hope is that the artifact removal will facilitate epileptic seiz ure detection and prediction. A nother researcher, Dorothee Marossero, is using MRMI-SIG to separate the fetal EKG signals (electro-cardiogram) from that of the motherÂ’s EKG signals [154] . Successful separation of the EKG signals will allow doctors to better assess the hea lth of the fetus. Hemanth Peddaneni used MRMI-SIG for feature reduction on a data set where the subjects from whom the data were collected were attempting to control th e vertical position of a cursor on a screen by changing their cortical potentials. In a manner similar to that described in Appendix A, he maximized the mutual information between the cortical potentials and the cursor location. The difference between the data from his resear ch and the data used in Appendix A, is that each feature in the cortical da ta set represented an entire time series, not just a single point. Although the approach previously used could be applied directly to the cortical data set, the number of parameters that would need to be trained would be excessively large. Instead, his approach replaced the static classifier with a (dynamic) multilayer perceptron, obviating the need for the feature reducer to extract the information from the entire time series and represent it using a single scalar . With this approach, he was able to obtain a 79% accuracy on the disjoint test set.

PAGE 129

118Future Research In this section, six topics that need further investigation are listed. First, extend the criterion for convolutive separation so that s ub-Gaussian sources may be separated. Second, instead of using the difference of consec utive outputs in the criterion for convolutive source separation, allow the first term in the difference to remain fixed (e.g., fixed to the leading edge of the window). Th is will cause each term in th e joint entropy to be a function of lags from 1 to Lb, instead of a function of only lag 1. It is believed that this will prove to be a more direct method of measur ing temporal dependenc ies; however, several previous attempts to do this have failed. One possible reason for the lack of success may be related to the fact that the kernel size de pends on the signal power. Perhaps gain factors, one for each lag, need to be incorporated so that the covariance at each lag has an identical value. One drawback of this approach is that the computational complexity increases noticeably. Third, change the criterion for convolutive BSS from the mutual information between vectors of processes to the mutual information between Ns scalars, where the scalar associated with the jth output, j = 1, 2, . . ., Ns, is formed from the temporal summation of the L most recent values of output j. This approach is not as direct as that proposed, but it has the benefit of significantly reducing th e dimensionality. Initial attempts to do this have failed, but the attempts were also far from comprehensive. Fourth, extend the experimental upper bound for convolutive separation to in clude other architectures, such as the FB/FIR and frequency-domain structures. Anothe r closely related item is to change the experimental upper bound to maximize SIR instea d of minimize the power of the interference. The mathematics suggest that this is possible; although, the performance surface for the new formulation will no longer be guaranteed to have a single minimum. Fifth, modify

PAGE 130

119the algorithm to improve separation performan ce for sources that have a multi-modal distribution. Initial results indicate that all the methods listed in Chapter 2, including the proposed method, have a very di fficult time separating bi-m odal, instantaneously mixed sources. Sixth, and finally, modify the algorit hm to improve the separation of instantaneous mixtures of colored Gaussian sources. Recall that very encouraging results were reported in Chapter 2 for this scenario, wh ere MRMI-SIG outperformed the next best method by 19 dB on average. However, it may be possible with the simple addition of hysteresis to further improve the results. The pr oblem is that the MRMI-SIG criterion uses the sign of the kurtosis and that the Gaussian di stribution is precisely the boundary of positive and negative kurtoses. In 1 of the 10 Monte Ca rlo trials from Chapter 2, it was found that the SIR did not converge, but rather oscillated between a very good solution (49 dB) and a very poor one (0 dB). The oscillatory behavior was caused by the fact that the sign of the kurtosis was itself oscillating. Although it is simple to test whether or not hysteresis is a good solution for this particular problem, it ma y have negative side effects for separating non-Gaussian sources. Consequently, further an alysis is needed. Incidentally, for the computation of the average result reported in Chapter 2, the value of 0 dB was used for the trial mentioned above.

PAGE 131

120APPENDIX A SUPERVISED FEATURE EXTRACTION A classification system typically includes both a feature extractor and a classifier. The two components can be trained either sequentially or simultaneously. The former option has an implementation advantage since th e extractor is trained independently of the classifier, but it is hindered by the suboptimal ity of feature select ion. Simultaneous training has the advantage of minimizing classifica tion error, but it has implementation difficulties. Certain criteria, such as Minimum Classification Error, are better suited for simultaneous training, while othe r criteria, such as Mutual Information, are amenable for training the extractor either sequentially or simultaneously. Herein, an information-theoretic criterion is introduced and is evaluated fo r sequential training, in order to ascertain its ability to find relevant features for classification. The proposed method uses non-parametric estimation of RenyiÂ’s entropy to train the extractor by maximizing an approximation of the mutual information between the class labels and the output of the extractor. The proposed method is compared against seven ot her feature reduction methods and, when combined with a simple classifier, agains t the Support Vector Machine and Optimal Hyperplane. Interestingly, the evaluations show that the proposed method, when used in a sequential manner, performs at least as well as the best simultaneous feature reduction methods.

PAGE 132

121Introduction Feature extraction is commonly employed as a pre-processor for applications including visualization, classification, detection, and verification. Herein, feature reduction, which in the linear case is also known as subspace projection, is investigated as it applies to classification. Figure A-1 shows a block di agram of the major components used in a classification system. In this figure, sj( k ), xj( k ) and yj( k ) are the size (NI x 1) input features, (NO x 1) output (transformed) features and the (NC x 1) outputs of the classifier at time k and having class j ( j = 1, 2, . . ., NC), respectively. For classification, optimality can be considered as the condition for which the probability of correct classification, 1 P[ e ( k ) = 1], is maximized, where e ( k ) = 1 if an error occurred at time k , and is 0 otherwise. An empirical estimation of the error probabil ity will be used as the figure of merit. Feature reduction methods attempt to impr ove generalization by reducing the variance in classification performance [155]. For certain classifiers, the improved generalization may occur since a reduction in the number of features causes a reduction in the number of free parameters (relative to the amount of data) required for classification. In 1(k) Sum of Squared ErrorsN (k)C ^: .j ,N(k)^ Compare c(k) c(k) e( k) . y yj ,1(k): .C : . Discriminant Function . MAX DEMUX e(k) xj ,1(k) xj ,N (k) xj ,2(k)O : .sj ,1(k) sj ,N sj ,2(k) sj ,3(k)I Feature Extraction R . F igure A-1. Block diagram of feature extraction for classification.

PAGE 133

122other classifiers, the improved generalization may occur as a result of reducing the dimensionality of the required probability density function (pdf) estimation. However, classification performance does not impr ove indefinitely as the numbe r of features is reduced, due to classifier bias [155]. At some point, the loss of information (about the class) inherent in reducing the number of features overw helms any benefit gained from reducing, for example, the number of adaptable parameters. Obviously, this bias-variance dilemma is highly dependent upon the specific algorithms used for both the feature extractor and the classifier, and it is the trade-off between the two that provides the motivation for the present study. Another method, Vapnik’s Structural Risk Minimization (and its embodiment, the Support Vector Machine (SVM) [156-1 64]), has recently been determined to provide more explicit control of generali zation through regularization. As such, it has become the obvious methodology with which to compare the performance of any feature reduction method. Feature reduction methods ma y be categorized based on whether the projector and the classifier are trained sequentially or s imultaneously. Sequential methods adapt the projector based on optimizing a cr iterion at the output of the projector . On the other hand, simultaneous methods adapt the projector base d on optimizing a criterion at the output of the classifier. The former is independent of the cl assifier, while the latter is trained “through” the classifier. This gives sequentia l methods an implementation advantage. Not only does it take longer to train simultaneous ly due to the increased computational complexity, but also the cost function landscapes may become more difficult to search. Moreover, a new set of update equations must be derived and the projector must be re-trained if it is desired to evaluate a new classifier. On the other hand, simultaneous methods have the

PAGE 134

123obvious (theoretically speaking) performance advantage in that they optimize the projector and the classifier together, that is they t une the projector to the classifier discriminant functions. It is expected that, every thing else remaining constant , simultaneous training will produce superior results compared to tr aining sequentially. Consequently, the onus is on sequential training methods to demonstrate that there exists an easily implementable and general-purpose feature extraction algorithm that provides, with the combination of a suitable classifier, commensurate classification error. Another difference between sequential and simultaneous systems is the choice of possible criteria for use in training the extracto r. Criteria such as Minimum Classification Error (MCE) [165-167] and Mean Square Erro r (MSE) [168], for example, are well suited for training the extractor simultaneously. Howeve r, they are not appropriate for sequential training since both of these criteria are based on an error signal. In order to use the MSE criterion for sequential training, a set of NO-dimensional targets (one for each class) must be defined in the output feature space, yj( k ). There is no principled method known to the authors for selecting these targets in the feat ure space, combined with the expectation that the classification performance will vary considerably depending on the choice of targets used. Similarly, the MCE criterion is based on the assumption that the relative values of the output of the (in this case) projector, are related to the a posteriori probabilities of each class, to wit, it is based on the assumption that the values are the output of a classifier. Other criteria, on the other hand, are well su ited for either sequential or simultaneous training. For example, an information-theoretic method that makes use of mutual information (MI) could be used. In this case, the ex tractor can be trained either sequentially or simultaneously by maximizing the MI between the class labels and the outputs of the

PAGE 135

124extractor or the classifier, respectively. Here we are particularly interested in measuring the performance of IT methods with respect to the preservation of discriminative information; hence, the information-theoretic methods considered in this paper are limited to sequentially-trained systems. Since the theoretical performance advantage of a simultaneously-trained system, as compared to a sequentially-trained system, can not be guaranteed if the two use different criteria, this gives rise to an interesting question: How does the performance of information-th eoretic, sequential feature reduction methods compare to error-based, simultaneous methods? Furthermore, in order to put the performa nce results of the above-mentioned comparison in proper perspective, it may be asked: How does the performance of the best feat ure reduction method considered here, when combined with a simple classifier, compare to the SVM and the closely related Optimal Hyperplane? In order to answer the first question, a co mmon platform needs to be defined. The two classifiers used to compare the feature re duction algorithms are described, and then the details are given as to the constraints placed on the structure of the subspace projector. This is followed by an introduction to the proposed information-theoretic algorithm, MeRMaId-SIG (or MRMI-SIG). A brief descript ion of all the competing feature reduction algorithms is then given. These eight algorithms are then compared in order to address the first question. The results of the SVM and the Optimal Hyperplane (OH) are then given in order to address the second question.

PAGE 136

125Classifiers The two Bayes (maximum likelihood) classifiers that are used for the comparisons are the Bayes-G and the Bayes-NP classifiers, bot h of which generate nonlinear decision surfaces. Notice that these classifiers are used to compare the feature reduction methods, and in a later section, are combined with the proposed feature reduction method in order to compare its performance with the SVM and the OH. The Bayes-G classifier is a parametric clas sifier that uses only second-order statistics of the output features. It has a single output for each class, which gives an estimate of the (a posteriori) probability of the associated class given the data (more specifically, it gives an estimate of the like lihood function multiplied by the prior, which is proportional to the a posteriori) [168]. The likelihoods ar e estimated by assuming that, for every class j , the set of output features, xj, is multi-variate Gaussian distributed. The estimated class for each output is determined as the one that maximizes the weighted likelihood where ( j = 1, 2, . . ., NC), NC is the number of classes, x ( k ) is the (NO x 1) data point at time k that is to be classified, µj is the (NO x 1) mean vector of class j , Cj is the (NO x NO) covariance matrix of class j , and P( j ) is the prior probability of class j , i.e. Nj / NT, where Nj is the number of data points containe d in the training set belonging to class j , and NT is the total size of the training set. The Bayes-NP is a non-parametric classifier that uses Parzen Windows [15] to estimate each of the a posteriori distributions (once again, it actually estimates the likelihood multiplied by the associated prior probability, which, for cla ssification, is tantamount to yjk () Pj () 2 ()NOCj 12‡------------------------------------e1 – 2 ----xk()µj–()TCj 1 –xk()µj–()= (A1)

PAGE 137

126determining the a posteriori). The NO-dimensional likelihood for class j is estimated by placing a Gaussian kernel at each point (one for each data point in the training set belonging to class j ) in the input space (input to the classifi er, which is the output feature space), summing these together, multiplying by the prior and then normalizing by Nj. Once the likelihood functions are determined, a new point is classified by evaluating the weighted likelihood of each class at the location of the data point in question. The class that produces the maximum value is then determined to be the correct class, where the weighted likelihood of each class is given by xj( i ) is the ith data point of the training set of class j ( j = 1, 2, . . ., NC), is a user-defined kernel size, and G( x ,2) is A nice feature of the Bayes-NP classifier is that there are no implicit assumptions on the distribution of the output features. Therefore, this method is able to take into account higher-order statistics of the output features, including multiple-modality, unlike the Bayes-G classifier. It does, however, have a user-defined parameter, , whose value must be determined. Other classifiers could also have been us ed, for example, the SVM, k-NN [169] or one of several different static artificial ne ural networks (ANN’s), viz., the Multi-Layer Perceptron (MLP) [37], Radial Basis Function (RBF) [37], or Polynomial Network [170]. The MLP classifier, the most popular of the ANN’s just listed, was considered briefly yjk () Pj () Nj---------Gxk () xji () –2 2, ()i 1 = Nj=(A2) Gx 2, () 1 2 ()NO ------------------------e1 – 22-------xTx = (A3)

PAGE 138

127because it can be considered to encompass bot h the projector and the classifier in one functional unit. However, results from several papers indicate that the MLP classifier performs quite poorly for two of the three data sets used in the upcoming comparison [171172]. More importantly, the two classifiers chosen have the nice feature that either the requisite training is trivial (i.e., estimation of second-orde r statistics for the Bayes-G classifier) or non-existent (for the Bayes-NP, wh ich is a memory-based classifier). This is to be compared to an MLP classifier, the performance of which is subject to local minima and other convergence issues. As such, its use in comparing multiple feature extractors would reduce the level of certainty that the me asured performance difference is due to the type of feature extractor and not due to imperfect training of the classifier. Projection Architecture To simplify the exposition, the projection ar chitecture is limited to the set of linear transformations. The equation for a comple tely general linear transformation is x = Rs + b , where R is a (NO x NI) matrix of coefficients and b is a (NO x 1) vector of coefficients. Notice that the block diagram in Figure A-1 does not include the b vector since the two classifiers are invariant under a change in the mean. In addition, it is trivial to show that both classifiers are invariant under an invertible , linear transform. Therefore it is assumed, without loss of generality, that the original fe atures have been shifted, rotated and scaled so that the resulting (NI x 1) input features, s ( k ), are zero-mean, (spatially) uncorrelated and have unit variance. The invariance of the cl assifiers to invertible transforms can also be used to reduce the number of free pa rameters. This is done by constraining the R matrix to be a pure rotation. In this case, the R matrix is formed using the first NO rows of the product of (NI x NI) GivenÂ’s rotation matrices (one for each pair of outputs) [173]. This

PAGE 139

128reduces the number of parameters from NONI to NO(NI-NO) without unnecessarily restricting the possible set of decision surfaces that can be produced by a linear projection. Notice that rotations between retained output features have no effect on classification, nor do rotations between rejected outputs. Only rota tions between a feature that is retained and a feature that is rejected has any effect. Due to its generality, unless stated otherwise, the subspace projector of each feature reduction al gorithm will be constrained to be a rotation matrix. Information-Theoretic Extraction Whereas methods that use second-order statistics compare the linear relationship of (commonly) two random variables, e.g. one output fe ature and the class label, information-theoretic methods compare the nonlinear relationships of multiple random variables, i.e. a vector of features and the class label. This is accomplished by maximizing I( X ; C ), the MI between the output features and the class labels. In words, I( X ; C ) may be described as the amount of information th e random (input feature) vector X carries about the class, C (where, realizations of X and C are given by the (NO x 1) vector x ( k ) and the scalar c ( k ), respectively). A number of different definitions of MI exist, but the one having (arguably) the greatest theoretical importance is the one named in honor of Claude Shannon, which is given by [25] where H1( X ) is Shannon’s (differential) entropyI1XC ; () H1X () H1XC () – = (A4) H1X () fXx () fXx () log x d–– =(A5)

PAGE 140

129and fX( x ) is the pdf of X . First, maximizing I( X ; C ) minimizes the amount of information (about the class) lost due to the projection, where an intuitive definition of the amount of information loss is I( S ; C ) I( X ; C ), i.e., the amount of informat ion the input features have about the class minus the information that the projected features have about the class. A projection of s ( k ) cannot add information so that I( X ; C ) is always less than or equal to I( S ; C ) and the difference is zero only if x ( k ) is an invertible (linear or nonlinear) function of s ( k ). Since I( S ; C ) is constant with regards to the projector, the loss is minimized by maximizing I( X ; C ). A second, more principled, argument for using Shannon’s MI comes from the consideration of the upper and lowe r bounds it places on the Bayes error rate. A lower bound for the Bayes rate is given by the Fano inequality [143], and an upper bound is given by Hellman and Raviv [174] (also see Erdogmus and Principe [175]). Both of these bounds are minimized by maximizing I( X ; C ), or equivalently, by minimizing H( C|X ). Despite the arguments for using Shannon’s MI, it is not in common use due to the difficulty of estimating Equation A-5. Part of the difficulty stems from the integral, which may be avoided by discretization of the va riables [176-180]. However, discretization requires the bin sizes for each region of the multi-dimensional output feature space to be appropriately selected. Too small and a zero proba bility occurs due to having a finite data set, too large and details of the shape are lo st. In addition, the determination of the appropriate bin sizes is time consuming and methods that rely on discretization require a large amount of training data for satis factory results [177] due to th e “curse of dimensionality” [181], which pertains to the exponential incr ease in the amount of data needed to adequately represent the histogram as the dimensionality increases linearly.

PAGE 141

130On the other hand, Alfred Renyi introduced a definition of entropy, namely Renyi’s (quadratic) entropy [27], that ca n be used as an alternativ e to Shannon’s definition. This alternative entropy measure is given by When the pdf is estimated using Parzen window s with Gaussian kernels, there is no need for discretization. This is due to the fact that the integral of the produc t of two Gaussians is a Gaussian evaluated at a single point [182] . As a result, there are no truncations or approximations required, outside of the implicit pdf estima tion using Parzen windows. This motivates an approximation for mutual information, MI, using Renyi’s entropy, namely which is similar to what has been referred to as mutual -entropy difference by Hero et al . [183] (Hero uses a direct estimation of Renyi’s entropy by means of minimal spanning trees rather than the “plug-in” density estima tor [184] used here). This approximation of MI is invariant to a change in scale, is mu ch simpler than the one suggested by Renyi [27] and is similar in form to Shannon’s MI, except that each of the entropies is substituted with the non-parame tric estimator for Renyi’s qua dratic entropy given by [35] where G( x ,2) was previously defined in Equation A-3. The combination of Equations A7 and A-8 results in the proposed informati on-theoretic criterion for sequential feature extraction, which is referred to as Minimum/Maximum Renyi’s Mutual Information using H2X () fXx ()2x d–log – =(A6) I2XC ; () H2X () H2XC () – (A7) H2X () 1 N --Gxk () xk 1 – () –2 2, ()k 1 = Nlog – (A8)

PAGE 142

131the Stochastic Information Gradient, or Me RMaId-SIG [29]. The overall cost function is given by where the influence of R is from the relation, x ( k ) = Rs ( k ). The SIG modification is an approximation that is used to reduce the complexity from O(NT 2) of the original algorithm to O(NT) of the present algorithm. It turns out th at, if the training data are presented multiple times and if the time indices are randomized for each presentation, the SIG algorithm converges in the limit to the original algorith m [29]. Several other methods may also be used to reduce the complexity, including importance sampling [ 185], random sampling [186], clustering [186], sliding window [ 187], and Gaussian Mixture Models (GMM) [186]. If gradient ascent is used as the op timization method, the ta p weight update equation becomes where the second term is the gradient of I( X ; C ) with respect to R and is the step size. Notice that the class labels are used only to determine which set of training data is included in the second term of Equation A-9, so that there is no need to convert the (nominal) class labels into ratio values (where all values may be categorized as either nominal, ordinal, interval, or ratio [188]). There are two weaknesses of this approach. First, there is no guarantee that maximizing Equation A-7 using Renyi’s definition, is equivalent to maximizing Equation A-4 using Shannon’s definition (see Appendix B for a class of examples where maximization maxR1 NT-----log – Gxk () xk 1 – () 2 2, – () NjNT-----1 Nj----Gxjk () xjk 1 – () 2 2, – ()k 1 = Njlog j 1 = NC+k 1 = NTarg(A9) Rn 1 + () Rn ()RIXC ; () () + =(A-1 0)

PAGE 143

132of Equation A-4 using either RenyiÂ’s or Sh annonÂ’s entropy is expected to produce similar results). Second, due to the difficulty of estim ating pdfÂ’s in high dimensional spaces, it is preferred to keep the number of dimensions small (accurate pdf estimation is required for generating proofs, but is not required for the algorithm to perform well as shown in Chapters 3 and 4). Notice, however, the dimensionality of the pdf estimation is not determined by the number of input features, but by th e (smaller) number of output features. MeRMaId-SIG was used in Chapters 2 and 3 in an unsupervised fashion, in an attempt to minimize the mutual information betw een a set of outputs for the application of blind source separation (BSS) [18]. There are several differences between the formulation above and that required for BSS. The criterion for subspace projection involves maximization instead of minimization, only NO of the outputs of the rotation matrix are kept, mutual information is measured between the output feature set and the class label (instead of between the outputs) and the conditional (or joint) entropy does not disappear (for convolutive BSS). Additional details on the entropy estimator given in Equation A-8, including proof of convergence when it is used for error entropy minimization, is given by Erdogmus et al . [29]. Feature Reduction Algorithms The previous section described the proposed information-theoretic algorithm, which uses the MeRMaId-SIG criterion. This secti on will briefly describe a second sequential method that makes use of an information-th eoretic criterion, two benchmark methods, and four simultaneous feature reduction methods. All eight of these feature reduction algorithms are listed in Table A-1. Note that wh en no name is available for a given feature

PAGE 144

133reduction method, in order to prevent the inclus ion of more terminology than is necessary, the algorithm will be referred to by the name of the criterion which it uses. The second information-theoretic sequential method, ED-QMI-SIG, is a slight variation of a criterion published by Principe et al . [143] and used by Torkkola and Campbell for subspace projections [189-191]. It is si milar to MeRMaId-SIG in that it uses Parzen windows with Gaussian kernels for the pdf estimation. The difference between the two is the choice of the distance measure between the two relevant pdfÂ’s, which for MI are (1) the joint pdf of X and C , and (2) the product of the two marginal pdfÂ’s. The distance measure for MeRMaId-SIG is loosely based on an approximation of Kullback-Leibler Table A-1. Description of the Feature Reduc tion Algorithms (excluding the classifier).AlgorithmType Extraction Architecture Extraction Criterion Extraction OptimizationMeRMaId-SIG (Minimum/Maximum RenyiÂ’s Mutual Information Stochastic Information Gradient)SequentialLinear Constrained (rotation) Supervised Sequential Information theoretic Nominal classes Iterative Gradient ascentED-QMI-SIG (Quadrature Mutual Information using Euclidean Distance Stochastic Information Gradient)SequentialLinear Constrained (rotation) Supervised Sequential Information theoretic Nominal classes Iterative Gradient ascentPCA (Principal Components Analysis)BenchmarkLinear Constrained (rotation) Unsupervised Sequential Second-order statistics N/A Non-iterative (analytic)LDA (Linear Discriminant Analysis)BenchmarkLinear Constrained (rotation) Supervised Sequential Second-order statistics Nominal classes Non-iterative (analytic)FR (Feature Ranking)SimultaneousLinear Constrained (0,1; a single 1 per row/column) Supervised Simultaneous N/A Nominal classes Iterative Brute force (by rank)FS (Feature Selection)SimultaneousLinear Constrained (0,1; a single 1 per row/column) Supervised Simultaneous N/A Nominal classes Iterative Brute force (exhaustive search)MSE (Mean Square Error)SimultaneousLinear Constrained (rotation) Supervised Simultaneous Second-order statistics Assigns targets Iterative Gradient descentMCE (Minimum Classification Error)SimultaneousLinear Constrained (rotation) Supervised Simultaneous Higher-order statistics Nominal classes Iterative Gradient descent

PAGE 145

134divergence [25], while ED-QMI-SIG uses Euclidean distance. The criterion for this method is given in Equation A-11. The two benchmark methods are the wellknown Principal Components Analysis (PCA) [37], and Linear Discriminant Analysis (LDA, which is an extension of Fisher’s Discriminant Ratio [181]). These methods use only se cond-order statistics, and the transformation matrix R for both methods consists of a set of NO eigenvectors of the appropriately defined matrix. As a result, the solution may be obtained analytically (i.e. non-iteratively). The main difference between these two methods is that PCA is unsupervised and LDA is supervised. In addition, the number of output feat ures for LDA is restricted to be less than or equal to NC, the number of classes. There are a total of four simultaneous fe ature reduction algorithms considered. The first two of these are Feature Ranking (FR) a nd Feature Selection (FS). The difference is that FR evaluates each feature independently of the others, while FS considers all possible combinations of the NI inputs, taken NO at a time. Consequently, FS is expected to produce superior classification performance; howev er, the computational complexity suffers from combinatorial explosion. These methods produce at each output of the projection, a single, unweighted input feature. This diff ers from feature extraction, where each output feature is a weighted sum of all input features . This difference is completely characterized by the architecture, which for FR and FS is c onstrained to (1) have only elements that are 0 or 1, and (2) have only one non-zero element in each row and in each column. Notice max arg R 1 NT-----Gxjk () xjk 1 – () 2 2, – () 1 NT-----NjNT-----2 j 1 = NC Gxk () xk 1 – () 2 2, – () +k 1 = NT+k 1 = Njj 1 = NC1 NT-----NjNT------j 1 = NCGxjk () xk 1 – () 2 2, – ()k 1 = NT(A-1 1)

PAGE 146

135that, when this architecture is combined with either classifier, it is not possible to span the full space of linear projections. As a result, the performance may be needlessly reduced. In fact, it is trivial to construct a data set th at will cause very poor performance for either method. The criterion for both is to minim ize the (empirical) classification error. The third simultaneous feature reduction method considered is the Mean Square Error (MSE) method, which trains the extracto r “through” the classifier by using the MSE criterion at the output of the classifier. Notic e that the outputs of both the Bayes-G and the Bayes-NP classifiers are ratios, unlike the cl asses which are necessarily nominal values, e.g. “person has diabetes” and “person does not have diabetes”. Therefore, to compare the classifier outputs with the class labels, either the ratios (classifier outputs) need to be converted to nominal values (class labels) or vi ce versa. The former case requires training through the MAX operator of Figure A-1 and involves a non-arbitrary process. This was the method used for FR and FS and is emulated by the MCE method, to be described next. The latter case, which applies to MSE, requires an arbitrary assignment. More specifically, if the training does not take place th rough the MAX operator, then each class label must be converted to a rather arbitrarily chosen (NC x 1) vector of ratios. This process is indicated in Figure A-1 by the DEMUX (de-multiplexer) operator. The resulting criterion for this method may be expressed as a function of these targets as where j( k ) is the user-defined target for the jth output of the classifier at time k and yj( k ) is given by either Equation A-1 or A-2. The cl assification performance for this method depends on how well the subspace projector/classifier combination can approach the minR1 NCNT------------yjk ()jk () – ()2 k 1 = NTj 1 = NCarg(A-1 2)

PAGE 147

136targets, which is itself dependent on how well the targets are chosen. Unfortunately, it is not clear how the targets should be chosen to optimize performance of the overall system (although it is much less difficult than for defi ning targets in the feature space). The 1 of NO scheme is commonly used, which is defined as setting the target associated with the correct class to 1, while all other targets are 0. If the MSE criterion were used to train the classifier, then this particular choice of ta rgets produces the optimal (in the mean square sense) approximation of the a posteriori proba bilities [168], [192], [193]. Note that the classifiers used here are not trained using MSE. Instead, their outputs are determined using either Equation A-1 or A-2. In this par ticular case, the concern is how to train the extractor. Since it also appears to be a reas onable choice of targets for training the extractor, the 1 of NO scheme is used. Minimum Classification Error (MCE) is the final feature reduction method considered. Notice that the term “minimum classifica tion error”, as before, is used for both the name of the method and for the name of the criterion which it employs. In addition, the expression is used to characterize several criteri a, such as FR and FS, since they attempt to minimize the empirical classification error. The MCE criterion has also been referred to as Minimum Classification Error using Generalized Probabilistic Descent (MCE/GPD) [165], [167], Minimum Error Rate (MER) [ 194], and Discriminative Feature Extraction (DFE) [166], [172], [195]. It is also related to the Bayesian Back Propagation algorithm of Nedeljkovic [196]. The motivation behind this criterion is to appr oximate the decision process (the maximization in Figure A-1) usi ng a function that is continuously differentiable. This allows a straightforward applicati on of the stochastic gradient-based optimization method. This criterion can be described in the following manner. It modifies the

PAGE 148

137parameters in an attempt to make the classi fier output associated with the correct class have as much margin as possible with resp ect to the single classifier output having the largest value of all the outputs associated with the incorrect class. Mathematically, the Minimum Classification Error criterion, wh ich includes the user-defined parameters and , is given by where has a value of 1 when the correct class for xj( k ) is j and is 0 otherwise. There are two userdefinable parameters, and , for this criterion ( was previously defined in Chapter 2, the new definition holds for this appendix) . As both approach infinity, the criterion becomes precisely the classification error. There are many other options for feature reduction that are not included in the following comparison. For example, there is an extension to LDA that removes the restriction on the number of possible output features [197]. Projection pursuit could be used to find interesting projections [198], as coul d unsupervised clustering methods [181]. The maximization of the mutual information could be performed in a simultaneous manner, i.e. through the classifier. Either an SVM or MI coul d also be used to rank or select features [176-179], [199], [200]. For featur e selection, greedy selection can be used to reduce the computational burden of an exhaustive sear ch, producing a computational complexity minR1 NT-----1 xjk () ck () () 1 eyjk()– 1 NC1 – --------------yik() i 1 ij = N C 1 --+ –+ 1 – j 1 = NCk 1 = NTarg(A-1 3) 1 xjk () ck () ()(A-1 4)

PAGE 149

138between that of FR and FS . Greedy algorithms, which are suboptimal, come in three forms, viz., forward selection, backward elimination, and stepwi se regression [176]. Another option, which is not always applic able, is a method known as “branch and bound”. This approach is able to reduce the size of the exhaustive search space for FS without loss of optimality [169]. Other tech niques include the construction of an MI matrix on which eigenanalysi s is applied [178], the use frequency domain transforms [181], or the use of an unsupervised maximum entropy method [201]. Comparisons There are three data sets used for the comparisons, the important characteristics of which are listed in Table A-2. These are the Pima, Landsat, and Letter Recognition data sets. All three of these may be found at the UCI Machine Learning Repository (http:// Th e data were first pre-processed. This included centering the data, sphering the data , and, for two of the data sets (Pima and Landsat), reducing the original dimension (to NI). The dimension was reduced using PCA by removing the eigendirections that had associ ated eigenvalues smaller than 0.5% that of the maximum eigenvalue. This same pre-proc essing was used for all the results included in this paper. One of the differences between the data sets is that the Pima data set was determined to have outliers, which for purposes of this paper is defined as missing data Table A-2. Description of the Data Sets Used in the Comparison (after pre-processing).Data Set Input Dimension (NI) Classes Training Size (NT) Test SizeOutlier %Pima Indians Diabetes4250026849% Landsat Satellite Image (Statlog)86443520000% Letter Recognition162616,00040000%

PAGE 150

139points, e.g., when a feature has a value of 0 even though a value of 0 is not meaningful or physically possible. These outliers are points in feature space that are statistically distant from the mean calculated using the remaining data (with the points in question removed). The Pima data set will be used with all invalid points included, which has the benefit of helping to identify which feature reduc tion methods are sensitive to outliers. The Bayes-G classifier was determined to provide the best classification performance for the Pima and Landsat data sets, wh ile the Bayes-NP classifier was found to produce the best results for the Letter Recognition data set. Therefore, the results shown are restricted to these combinations of data sets and classifiers. It shoul d be noted that, for the other combinations of data sets and classifiers, there is virtually no change in the order of the performance of the eight algorithms. When th e Bayes-NP classifier is used, the kernel size, , is always set to 0.25, whic h was experimentally deter mined (using resubstitution) to be at or near the optimal value. The training was performed using the first NT samples of the overall data set, and then tested on th e remaining data. For the gradient-based methods, the number of Monte Carlo runs is set to 10. The initial conditions for the projector for each Monte Carlo repetiti on were initialized randomly (zer o for the first run and, thereafter, normally distributed with a variance of 1). Randomizing the initial conditions was done instead of randomly selecting the training set in hopes that it will facilitate other researcherÂ’s attempts to duplicate the results and/or compare these results with their own work. This has the added benefit of simplif ying the task of discerning which algorithms are susceptible to local minima. The step size for each method was chosen small enough so that, after convergence, the variance of th e parameter values ha ve no, or very little, effect on classification performance (convergen ce of every adaptation was verified

PAGE 151

140manually). As a result, the variation of clas sification performance for each feature reduction method can be expected to be caused by local maxima/minima. The following user-defined parameters were determined using resubstitution. The kernel size, , for MeRMaId-SIG is set to 0.25, 0.35, and 0.5 when NO is 1, 2-4, 5-8, respectively. Notice that the optimal value of increases as the number of output dimensions increases, partially offsetting the spars ity of data in high dimensions. Also notice that a fixed value of is used during training, which is in contrast to approaches by other researchers [143], [190], [202]. For ED-QMI-S IG, the resubstitution results suggest using a kernel size of 0.5, independent of NO. For MCE, the parameter was somewhat arbitrarily set to 10 and the optimal smoothing parameter, , was determined to be 2. It should be mentioned that seve ral combinations of and NO were plagued by local minima. In particular, out of the 20 Monte Carlo runs associated with NO = 1, = 1 and NO = 1, = 4, the final classification performance (using resubstitution) was between 25-28% for 8 runs and between 64-65% for the remaining 12. Incidentally, local minima of the MCE algorithm can (if performed properly) be a voided by scheduling the value of the smoothing parameter, . This can also be accomplished with the proposed information-theoretic criteria by annealing the kernel size, [34]. Since it is not known precisely how the scheduling should be done, and in order to simplify th e experimental procedur e, this is not done for either MCE or MeRMaId-SIG. For the OH, the free parameter (which is referred to as the parameter by Mangasarian and Musicant [159]) was set to 0.1. For th e SVM, pairwise training (as opposed to one vs. all) is used to combine the binary decisions for multi-class data. Unlike the other methods, the user-defined parameters for the SVM are not determined using resubstitution,

PAGE 152

141since the resulting generalization performance wa s very poor. This is particularly true for the Pima data set, where the resubstitution results are very negatively correlated with the generalization results. This is possibly due to the known weakness of the SVM for the case that there is missing data [161-162]. The usual solution is to use cross-validation [181] in place of resubstitution. However, since the SVM is used as a benchmark, and to ensure that it cannot be claimed that the training of the SVM is insufficient, the parameters are chosen using the best generalization performance . This resulted in the following parameters: penalty = 0.8, = 8, for the Pima data set, and penalty = 0.5, = 2 for Landsat (no results are given for the Letter Recognition data set due to its size). The selection of the parameters by means of the generalization error gives the SVM a definite performance advantage over the other methods. Not only is the criterion for each method opt imized, but it is also “de-optimized” by maximizing instead of minimizi ng, or vice versa. This is done to see how well the criterion can manipulate the error rate in both directions and to give an indication of the worst performance possible, for sake of perspective. Random proj ections were also used, for which the coefficients were chosen uniformly in [-1,1]. These are always indicated by a dashed line in the following plots. Ideally, the results of the optimized criterion should be well above the results for random projections, and the results of th e de-optimized criterion (signified in the plots by placing an asterisk after the name of the algorithm) well below the results for random projecti ons. For PCA and LDA, the de-optimized results are found by selecting the eigenvectors associated with the minimum eigenvalues. For MCE, this was accomplished by maximizing the classification error rate and by changing from 10 to -10. The latter was done so that the margin for the output pertaining to the correct class

PAGE 153

142was minimized with respect to the smallest output (of all incorrect classes). Figures A-2 to A-4 show the plots of correct classification pe rcentage, one for each data set, for both the optimized and de-optimized versions of each criterion. In each case, the left subplot (A) shows the results for the PCA, LDA, ED-QMI-SIG and MeRMaId-SIG algorithms, as well as for the random projection. The right subplot (B) shows the results for the FR, FS, MSE and MCE algorithms. Also shown in subplot (B), for sake of convenience, are the results for MeRMaId-SIG and the random projection. In all cases, when NO = NI, the results for all algorithms are the same. This is because both classifi ers are invariant under full-rank linear transforms. The first plot, given in Figure A-2, pertains to the Pima data set. There are several important items to notice in this figure. For example, it is noteworthy that the performance improves by decreasing the dimensionality. Since the projections to a given dimension F igure A-2. Classification performance versus output dimension, NO, for Pima data set. A) for PCA, LDA, MeRMaId-SIG, ED-QMI-SIG, and Random, B) for MSE, MCE, FR, FS, MeRMaId-SIG, and Random. 1 1.5 2 2.5 3 3.5 4 60 65 70 75 80 85 Output Dimension (No)Correct Classificaton PercentageMeRMaId-SIG MeRMaId-SIG* ED-QMI-SIG ED-QMI-SIG* Random PCA LDA LDA* PCA* 1 1.5 2 2.5 3 3.5 4 60 65 70 75 80 85 Output Dimension (No)Correct Classificaton PercentageMeRMaId-SIG* MeRMaId-SIG, FR, FS Random MSE* MSE FR* FS* MCE MCE* A B

PAGE 154

143include as a subset all projections to lower dimensions, the cause of this must be due to inaccurate estimation of the parameters, either of the projector or the classifier. This anomaly only occurred for the Pima data set, therefore it seems likely that the underlying cause is the existence of the outliers. Anothe r possible explanation concerns the generalization of the results for the Pima data set sinc e it is the smallest of the three; however, the ratio of free parameters to amount of training data lies between that of the other two data sets (where the number of free parameters is based on the Bayes-G classifier and NO = NI/2). Another unexpected result was that ED-QMI-SIG* (the “de-optimized” result) outperforms ED-QMI-SIG for NO = 2 and NO = 3. It could be that ED-QMI-SIG has local maxima whose performance is worse than that associated with one or more local minima, or that the “convergence” of several of the Monte Carlo runs was to a saddle point. In either case, the fact that this only occurred for the Pima data set, suggests the possibility Figure A-3. Classification performance versus output dimension, NO, for Landsat data set. A) for PCA, LDA, MeRMaId-SIG, ED-QMI-SIG, and Random, B) for MSE, MCE, FR, FS, MeRMaId-SIG, and Random. 1 2 3 4 5 6 7 8 0 10 20 30 40 50 60 70 80 90 100 Output Dimension (No)Correct Classificaton PercentageMeRMaId-SIG MeRMaId-SIG* ED-QMI-SIG* ED-QMI-SIG PCA LDA Random LDA* PCA* 1 2 3 4 5 6 7 8 0 10 20 30 40 50 60 70 80 90 100 Random MeRMaId-SIG* MeRMaId-SIG MSE MSE* FR* FS* FS FR Output Dimension (No)Correct Classificaton Percentage MCE MCE* AB

PAGE 155

144that the ED-QMI-SIG algorithm is sensitive to outliers (an additional test, not included here, was performed that seems to verify this claim). Also notice that the result for both LDA and LDA* is only a single point, due to the limitation of the criterion as mentioned previously. PCA* outperforms PCA for NO = 1, which is not too peculiar considering that PCA is an unsupervised method. The Landsat data set is shown in Figure A-3. Since the number of classes for this data set is 6, LDA can be calculated for NO < 5. The results from these two plots indicate that the dimension can be reduced from 8 to 2 without loss of performance. Results for th e Letter Recognition data set are shown in Figure A-4, for which a slight improvement is provided by reducing the dimensionality by one-half. Due to the size of this data set, the projections were trained only for NO = 1, 2, 4, and 8. Notice 2 4 6 8 10 12 14 16 0 10 20 30 40 50 60 70 80 90 100 Output Dimension (No)Correct Classificaton PercentageRandom MeRMaId-SIG MeRMaId-SIG* ED-QMI-SIG* ED-QMI-SIG PCA LDA PCA* LDA* 2 4 6 8 10 12 14 16 0 10 20 30 40 50 60 70 80 90 100 Output Dimension (No)Correct Classification PercentageFS FS* FR* FR Random MeRMaId-SIG MeRMaId-SIG* MSE* MSE MCE MCE* Figure A-4. Classification performance versus output dimension, NO, for Letter Recognition data set. A) for PCA, LDA, MeRMaId-SIG, ED-QMI-SIG, and Random, B) for MSE, MCE, FR , FS, MeRMaId-SIG, and Random. A B

PAGE 156

145that no results are shown for the FS algorithm for NO > 4. This is due to the combinatorial explosion associated with this algorithm. For all three data sets, as NO approaches NI, it becomes less and less important how the projection is chosen. This is shown in the figures in two different ways. First, the result for the random projection approaches the best result as NO approaches NI. Second, there is a tendency for the difference between the hi ghest and lowest performance (for optimizing and de-optimizing, respectively ) to approach zero as NO approaches NI. Consequently, the projection to NO = 1 dimension is likely the most im portant data point in determining which algorithm performs the best. For generali zation purposes, it is also important that an algorithm performs well for all output dimensions. Keep in mind, though, that when results are given that average over all possibl e output dimensions, the performance difference between the different algorithms is n ecessarily deflated for the reason given above. Consequently, the overall results are shown in two different ways. Figure A-5(A) shows the overall results for NO = 1, averaged over all three data sets, where the mean value is shown in parentheses (results for de-optimiz ation are not shown). Figure A-5(B), on the other hand, shows the results for each algorithm averaged over all data sets and all output dimensions. In the former bar graph, FR and FS can be seen to have identical results. This is because the algorithms become identical for NO = 1. In the latter bar graph, no results are shown for LDA or FS since they have missing data points (LDA because of the limitation on the number of outputs and FS because of the training time involved), indicating that the two algorithms are not generally applicable. For each of the gradient-based algorithms, Figure A-5 shows the minimum value, the mean value, and the maximum value of the 10 Monte Carlo runs. The performance for

PAGE 157

146each of the gradient-based algorithms is not Gaussian distributed, otherwise + 3 sigma values would be given. The reason that they are not Gaussian distributed is due to the existence of local minima/maxima. The larges t difference between the minimum and maximum results is for the random projection, as expected. The algorithms having the next largest difference is the MSE and MCE algorithms, while the two MI methods are fairly consistent. These results indicate that the performance of the MS E and MCE algorithms F igure A-5. General classification performanc e for all 3 data sets. The number above each bar represents the mean classification. A) shows the result for NO = 1, B) shows the results averaged over all values of NO. 30 35 40 45 50 55 60MeRMaIdSIG ED-QMI-SIG PCA LDA FR FS MSE MCE RandomCorrect Classification Percentage Maximum Mean Minimum 55.8 41.6 51.2 47.4 52.252.2 48.6 47.8 40.3 40 45 50 55 60 65 70 75 80MeRMaIdSIG ED-QMI-SIG PCA LDA FR FS MSE MCE RandomCorrect Classification Percentage Maximum Mean Minimum 69.4 65.2 72.9 70.968.4 68.0 56.7A B

PAGE 158

147are more sensitive to local optima than the MI methods. For the case that NO = 1, the MeRMaId-SIG algorithm outpe rformed all other feature reduction methods by 4.6% to 14.2% (a relative increase of 9% to 34%). For the case that the results are averaged over all data sets and all output dimensions, MeRMaId-SIG outperf ormed all others by 2.0% to 7.7% (a relative increase of 3% to 12%). It is interesting that the proposed sequen tial method has a mean classification error less than that produced by any of the simult aneous feature reduction methods, three of which use a criterion that specifically minimizes the classification error. This is, however, possible since the minimum classification erro r methods minimize the error on the finite training set, not on the disjoint finite test set. In order to prove that the error-based methods minimize the probability of error on the disjoi nt test set, “infinite training” is required, as acknowledged by Watanabe et al . [166] and Katagiri et al . [167]. This is essentially equivalent to knowing the underlying distribu tions, which would allow any of a number of methods to produce the optimum (Bayes) soluti on. While it seems perfectly reasonable to use this as a criterion when the data are fini te, it is no longer guaranteed to be optimal. In fact, as suggested by the results presented here, it may very well be suboptimal because these methods focus only on training data that are near the decision boundary. This means that much of the data set (which is oftentimes limited in size) is ignored, so that the results can be expected to be sensitive to outliers. In order to see how the proposed feature reduction method compares with the SVM and the OH, these results are now given. Mixed in with these results are the best classification performances recorded in eac h of several differ ent articles in th e published literature (results shown without an associated referenc e are those obtained by the authors). For

PAGE 159

148feature reduction methods, the value of NO is also included. Keep in mind that the results for the MeRMaId-SIG algorithm are found using ve ry simple classifiers, especially for the Pima and Landsat data, while the results from the literature are often the result of using sophisticated classifiers and/or training tec hniques, such as Boosting (which uses a committee of 3 learning machines) [203]. Pima : MeRMaId-SIG result is 79.1% using a Bayes-G classifier with NO = 1 ~76% using LVQ with NO = 1 [204] 76.1% using regularized AdaBoost [205] 76.6% using Adaptive Margin SVM [205] 78.7% using LVQ with NO = 3 [189] 78.7% using OH 79.5% using SVM ~81% using LVQ with NO = 6 [204] Landsat : MeRMaId-SIG result is 85.3% using a Bayes-G classifier with NO = 6 ~78% using LVQ with NO = 1 [204] 80.4% using an MLP with NO = 3 [178] 83.8% using OH 88.8% using AdaBoost [162] 89.5% using LVQ with NO = 15 [186] 90.5% using SVM 93.3% using Bayesian (200 hidden units) [206]

PAGE 160

149Letter Recognition : MeRMaId-SIG result is 92.7% using a Bayes-NP classifier with NO = 8 79.3% using an MLP (7 nodes in hidden layer) [171] ~80% using Holland-style (1190 rules) [207] 80.3% using an MLP with NO = 15 [178] 80.5% using a Bayesian network [208] 83.2% using OH 88.6% using LVQ with NO = 8 [186] 89.9% using k-NN [171] 92.9% using AdaBoost [162] Besides classification performance, there are implementation issues that are relevant in algorithm selection. Two such issues include algorithmic complexity and length of time required for training. Complexity, in terms of computational requirements to update the parameters of the projection, is addressed first. In this regard, the PCA, LDA, FR, and FS methods have trivial complexity. The first two due to the existence of an analytical solution, the latter two because all the possible solutions are enumerated a priori (however, the number of solutions for FS may be very larg e). A good indication of the complexity of the gradient-based systems, on the other hand, is obtained by examining the criteria given previously in Equations A-9, A-11, A-12, and A-13. Be mindful, the criteria for MSE and MCE are deceptively simple looking compared to those of the two MI methods. This is not so once one of the classifiers of Equation A-1 or A-2 is inserted into Equations A-12 and A-13. Also keep in mind that MCE and MSE ar e not as generally applicable, in that it is very difficult to use them with certain cl assifiers, such as k-NN. The time required for

PAGE 161

150training is another important implementation issu e, which is difficult to address since the experiment was not designed to minimize ad aptation time for a given performance level. Nevertheless, the results of adaptation speed ar e given for the interested reader. These give a first-order estimate of the relative speed of the different methods. In all cases, the training (for a single Monte Carlo run) took on the order of minutes or seconds. For the Letter Recognition data set, however, several of the algorithms took significantly longer. For example, when transforming to NO = 8 output features, MeRMaId-SIG and ED-QMI-SIG took on the order of 2 hours, while the MS E and MCE methods took on the order of 24 hours. In addition, it was estimated that FS would take 3,075 hours, or 128 days, to complete. Conclusions The MeRMaId-SIG method has very intere sting implementation characteristics. It has small relative complexity/high adaptati on speed (compared to MCE, MSE and sometimes FS); it allows the use of different dime nsionality in the class labels and network outputs (unlike MSE); it is not as sensitive to loca l optima relative to the other gradient-based methods (unlike MCE and MSE); it is independent of the classifier (unlike FR, FS, MSE, MCE); and, it appears to be robust to outlie rs (compared to ED-Q MI-SIG, MCE, MSE, SVM, OH). In addition, it can be used to train nonlinear projectors [204], or used to train a classification system in a simultaneous manne r. Also, the proposed method is non-parametric, does not require discretization of the da ta, and is computed directly from the samples. On the other hand, there are several shortc omings. It was mentioned earlier that there is no guarantee that it will produce the same solution as ShannonÂ’s MI. However, results seem to indicate that this is not an issue (notic e that it is not claimed that the results

PAGE 162

151indicate that the proposed method produces the same solution as Shannon’s, only that it produces a useful solution). It takes longer to tr ain (for large data sets) than PCA, LDA, or FR. Although the sensitivity to local minima is small, it is larger than for the SVM and all the methods that are not gradient-based (as expected). Also, as previously mentioned, there is a potential loss of performance as NO increases due to the required high-dimensional density estimation (but only when NO << NI, otherwise any method is sufficient). The goal of this article was to answer two questions. The answer to the first question, “ How does the performance of information-th eoretic, sequential feature reduction methods compare to error-based, simultaneous methods?” is “ favorably ”. This conclusion is based on the classification perfor mance and indicates that, for real (finite) data sets, samples not near the boundary carry information vital to the proper placement of the decision boundaries. Consequently, since simultane ous methods have a theoretical performance advantage, it would be interesting to use an MI criterion in a simultaneouslytrained system (although this may prove to be prohibitively complex for most classifiers). The answer to the second question, “ How does the performance of the best feature reduction method considered here, when combined wi th a simple classifier, compare to the SVM and the closely related Optimal Hyperplane? ” is “ reasonably well ”, especially considering that the parameters of the SVM were chosen using the best generalization performance. In fact, the results here indicate th at it would be interesting to combine an SVM with a feature reducer since th e former appears to be robust to outliers and may help alleviate the lack of robustness of an SVM (see Weston et al . [161]). Much more theoretical work is required to validate the mutual information approach for feature extraction in classification. The funda mental difficulty is related to the implicit

PAGE 163

152link between mutual information and classifi cation error, where the only known results are expressed in the form of upper and lower bounds on the Bayes classification error. However, the tightness of the bounds remains unknown. Probably a more productive approach with IT learning is to reverse the question of tuning the classifier topology to the feature extractor, and seek classifiers that will meet the minimization of the Bayes error with MIderived features. Another important line of research, where some encouraging results using the Information Bottleneck method have been published by Slonim and Tishby [209], is the optimization of the dimensionality of the feature space.

PAGE 164

153APPENDIX B APPROXIMATING SHANNON’S MUTUAL INFORMATION The expressions for mutual information given in Equations A-4 and A-7 are expressed as a joint entropy minus a conditional entropy , where the conditioning is performed over a discrete random variable. In general, both the joint and c onditional entropies may be a function of multi-dimensional pdf’s. In this appendix, a two-dimensional scenario is considered in order to assist with visualization of the performance surface. Consider two random variables, Y and C, wh ere C is the discrete class label and the two marginal densities of the two-dimensi onal continuous random variable Y are given by a generalized Gaussian pdf. More specifica lly, the marginal distributions of Y are expressed in terms of 1 and 2, the generalized Gaussian parameters, by the following where the constants C1, C2, C3 and C4 have the appropriate values so that each pdf integrates to one and each has unit variance. Recall that = 1 corresponds to a Laplacian distribution, = 2 corresponds to a Gaussian, and = 10 is approximately uniform ( = infinity is uniform). Further assume that th e relationship between C and Y is given by, C = sign(Y1 + Y2). It is well own that the mutual information between Y and C can be written as the following sum of Shannon’s entropies, I1(Y;C) = H1(Y) H1(Y|C), where H1(Y) is fY1y1() C1eC2y1 1–= (B1) fY2y2() C3eC4y2 2–= (B2)

PAGE 165

154used to denote ShannonÂ’s entropy. In this dissertation, ShannonÂ’s entropy is replaced with RenyiÂ’s quadratic entropy, so that th e approximation of mutual information, I2(Y;C), becomes H2(Y) H2(Y|C), where the subscript of 2 is used to denote RenyiÂ’s quadratic entropy. Figure B-1 shows the (analytical) joint en tropy of Y for both ShannonÂ’s and RenyiÂ’s definition. Likewise, Figure B-2 shows the conditional entropy for both definitions of entropy. Notice that there is a substantial difference between ShannonÂ’s and RenyiÂ’s definition when both 1 > 2 and 2 > 2. This is consistent with the findings of Chapter 2, where it was determined that the use of RenyiÂ’s entropy in place of ShannonÂ’s entropy is not guaranteed to produce the same solution, es pecially for sub-Gaussian data. Figure B-3, however, shows that the shape of the mutual information when using ShannonÂ’s and RenyiÂ’s entropy, which is the summation of the surfaces of Figures B-1 and B-2, have nearly the same shape. The differences that existed in the joint and conditional entropies are effectively cancelled. What is important about this observation is that Equation A-7, which uses RenyiÂ’s entropy, may be used in place of Equation A-4 independent of the Gaussianity of Y. Note that this analysis does not cover the case that the elements of Y are dependent, or for the case that the elements of Y are other than generalized Gaussian distributed. It is expected that the criteri on will still produce good resu lts for these cases.

PAGE 166

155 0 2 4 6 8 10 0 2 4 6 8 10 1 1.1 1.2 1.3 1.4 2 1 H2(Y) 0 2 4 6 8 10 0 2 4 6 8 10 1.34 1.38 1.42 2 1 H1(Y) F igure B-1. Joint entropy of Y. A) Usi ng ShannonÂ’s entropy and B) using RenyiÂ’s entropy. A B

PAGE 167

156 0 2 4 6 8 10 0 2 4 6 8 10 1.34 1.38 1.42 2 1 H1(Y|C) 0 2 4 6 8 10 0 2 4 6 8 10 1 1.1 1.2 1.3 1.4 2 1 H2(Y|C) F igure B-2. Conditional entropy of Y. A) Using ShannonÂ’s entropy and B) using RenyiÂ’s entropy. A B

PAGE 168

157 0 2 4 6 8 10 0 2 4 6 8 10 0 0.02 0.04 0.06 0.08 2 1 I1(Y;C) 0 2 4 6 8 10 0 2 4 6 8 10 0 0.02 0.04 0.06 0.08 2 1 I2(Y;C) F igure B-3. Summation of the joint and cond itional entropies of Y. A) Using ShannonÂ’s entropy and B) using RenyiÂ’s entropy. A B

PAGE 169

158APPENDIX C EXTRACTING SPECTRAL DIVERSITY USING MRMI-SIG In Chapter 2 it was claimed that MRMI-SIG can separate instantaneous mixtures of mutually independent colored Gaussian source s. The proof and the conditions required for the proof are presented here. Theorem: MRMI-SIG can (asymptotically) se parate an instantaneous mixture of colored Gaussian sources from a set of strictly stationary realizations of the processes. Proof: The sources are assumed to be colored Gaussian random processes that are generated by means of passing a zero-mean, i.i.d. Gaussian process through a linear system. Without loss of generality, it is also assumed that the gain of the linear system is such that the outputs (the sources) each have unit-va riance. The observations are sphered using PCA and then are rotated using a combination of Given’s rotations. Th e proof will concern the optimization of the rotation matrix. The MRMI-SIG criterion is given by where is a small constant greater than 0. Since the kernel size depends only on the sign of the kurtosis, which is constant, the ke rnel size is also constant. The variable is needed to simplify the proof but is not used elsewhere in the dissertation. In practice, interestingly, it has been determined that the results ar e sometimes noticeably improved when the sign of the kurtosis is negative for one or more, but not all, of the outputs. The causes of this are not understood at the present.I ˆ Y () signKYm() + () 1 N --Gymn () ymnp – () –2 m 2, ()n 1 = Nlogm 1 = M– =(C1)

PAGE 170

159Letting N approach infinity, the pdf estimating using Parzen windows with a Gaussian kernel produces an estimate th at equals the true distribution convolved with the kernel. Since (1) the true distribution and the kernel are both Gaussian and (2) the convolution of two Gaussian distributions returns another Gaussian, the estimated pdf is Gaussian. The kurtosis of a Gaussian distribution is 0, so that the sign of the kurtosis plus is always 1. Using this result and invoking the ergodic theorem for strictly stationary random processes, Equation (C-1) may be written as a summation of expectations as follows This may in turn be written as an integral where fYm(n),Ym(n-p)(ym(n), ym(n-p)) is the joint pdf of the random variables Ym(n) and Ym(n-p). Before proceeding it is worth menti oning that the equivalent expression of Equation (C-3) for the O(N2) entropy estimator is a function of the square of the marginal pdf of the output and not a function of the two-dimensional joint pdf of a single output at two different times. Hence, it is not capable of extracting spectral info rmation. Since any combination of Gaussian random variables has a Gaussian distribution, it is known that Ym(n) and Ym(n-p) are Gaussian-distributed. In additi on, due to a property of Gaussian random variables, it is known that the two are jointly Gaussian, the distribution of which is completely characterized by a mean vector and an auto-correlation matrix. The mean vector is zero since all the sources are assumed to be zero-mean and all transformations are linear. I ˆ Y () EGymn () ymnp – () –2 2, () [] logm 1 = M– =(C2) I ˆ Y () Gymn () ymnp – () –2 2, () fYmn()Ymnp –() ,ymn () ymnp – () , () ymn () dymnp – () d––logm 1 = M– =(C3)

PAGE 171

160The auto-correlation matrix, on the other hand , is a function of the both the mixing and demixing matrices. It is known that the diagona l elements of the covariance matrix are 1 since the sphered observations are operated on by a rotation matrix. The off-diagonal elements of the (2 x 2) symmetric auto -correlation matrix are both, by definition, E[ym(n)ym(n-p)]. Signifying the (M x M) mixing, sphering, and demixing (rotation) matrices by H, W, and R , respectively, the (M x 1) vector of outputs at time n is given by y(n) = R WHs(n). The overall mixing and demixing matrix, due to the assumption that the sources are unit-variance, may be represente d by a single rotation matrix, R. Using this notation, the off-diagonal elements of the auto-correlation matrix are represented by E[Rms(n)Rms(n-p)], where s(n) is the (M x 1) column vector of sources at time n, Rm is the mth row of matrix R, and (as will be needed shortly) Rm,k is the element of matrix R at the mth row and kth column. This in turn may be expressed as a function of the auto-correlation of the sources (for lag p) due to the assumption of mutual independenc e between the sources. For sake of convenience, the scalar value in Equation (C-4) is represented by m. The joint pdf can now be written as where T is the transpose operator and m is the autocorrelation matrix given by mERmsn () Rmsnp – () [] Rmk,2 Eskn () sknp – () []k 1 = M==(C4) fYmn()Ymnp –() ,ymn () ymnp – () , () 1 2 12‡---------------------e1 2 -ymn() ymnp –() []12 ‡ –ymn() ymnp –() []T–= (C5) 1 mm1 (C-6 )

PAGE 172

161Before proceeding, notice that the integrand consists of a product of two Gaussians. The first is a one-dimensional Gaussian evaluated at the difference of the two random variables and the second is a two-dimensional Gau ssian evaluated at the vector whose components are the random variables themselves. C onsequently, Equation (C-3) may be written as a product of two exponentials which may then be simplified to a singl e exponential by summing the exponents causing the approximation of I(Y) to become where m is expressed in terms of m. This in turn may be expr essed as a function of a single two-dimensional Gaussian distribution where m is given by the following (2 x 2) matrixI ˆ Y () 1 2 2 ---------------------e1 – 42-------ymn()ymnp –()–()21 2 m 12‡------------------------e1 – 2 ----ymn() ymnp –() []m 1 –ymn()ymnp –()ymn () dymnp – () d––logm 1 = M– = (C7) 1 2 ()32‡2 1 m 2– --------------------------------------------------e1 – 2 ----1 22--------ymn()ymnp –()–()2ymn()22mymn()ymnp –()– ymnp –()2+ 1m 2– -----------------------------------------------------------------------------------------------------+ymn () dymnp – () d––logm 1 = M– (C8) I ˆ Y () 1 2 ()32‡2 1 m 2– --------------------------------------------------m 12‡m 12‡---------------------e1 – 2 ----ymn() ymnp –() []m 1 –ymn()ymnp –()ymn () dymnp – () d––logm 1 = M– = (C9) m1 1 2 2-------1 1 m 2– --------------+ 21 2 2-------m1 m 2– --------------+ 2– ------------------------------------------------------------------------------------1 2 2--------1 1 m 2– --------------+ 1 2 2--------m1 m 2– --------------+ 1 2 2--------m1 m 2– --------------+ 1 2 2--------1 1 m 2– --------------+ = (C-1 0)

PAGE 173

162Since the infinite-limit integr ation of a probability distribu tion function produces a value of 1, Equation (C-9) simplifies to Plugging in the square root of the determinant of m produces By completing the squares, Equation (C-12) may be written as which may be further simplified to Plugging in the full expression for m and removing the square root by placing a 1/2 outside of the log produces The location of the minimum of Equation (C15) is unaffected by the two terms, 1/2 and 4. As a result, minimizing Equation (C-15) is equivalent to minimizing the followingI ˆ Y () m 12‡2 2 1 m 2– ------------------------------------------logm 1 = M– = (C-1 1) I ˆ Y () 1 2 2 1 m 2– 1 2 2-------1 1 m 2– --------------+ 21 2 2--------m1 m 2– --------------+ 2– ----------------------------------------------------------------------------------------------------------------------------------logm 1 = M– = (C-1 2) I ˆ Y () 1 2 1 m 2– 1 m– 21 m 2– () -------------------------1 m 2– 1 m 2– ()2---------------------+ ----------------------------------------------------------------------------------------------logm 1 = M– = (C-1 3) I ˆ Y () 2 1 2m– + () logm 1 = M= (C-1 4) I ˆ Y () 1 2 -4 1 2Rmk,2 Eskn () sknp – () []k 1 = M– + logm 1 = M= (C-1 5) 1 2Rmk,2 Eskn () sknp – () []k 1 = M– + logm 1 = M(C-1 6)

PAGE 174

163The derivative of Equation (C-16) with respect to ij is where the R matrix is represented by a produc t of individual Given’s rotation matrices, Rqr, each of which equals an identity matrix ex cept that elements at (q,q), (q,r), (r,q), and (r,r) are cosqr, sinqr, -sinqr, and cosqr, respectively (two subscripts are used without a comma to indicate a particular Given’s rotati on matrix and is not to be confused with Rm,k which represents the element of matrix R in the mth row and kth column). Each individual rotation matrix, Rqr, represents a rotation in the plane specified by the qth and rth basis vectors and each is a functi on of a single rotation angle, qr. A necessary condition for selecting values of the vector that minimize Equation (C-16) is for the derivative of Equation (C-16) with respect to each of th e M(M-1)/2 rotation angles that make up the vector must equal zero. It is known that the se parating solutions for matrix R are such that each element of the vector that specifies R equals kqr/2, where kqr is any integer and the subscripts are used on k to denote that each element of the vector may use a different integer. The next objective is to plug the gene ral form of the separating solution into Equation (C-17) and verify that the separating solution represents a minimum, maximum, or saddle point. The second derivative will then be used to verify that the separating solution is a minimum of Equation (C-16). Notice that (1) each individual rotation ma trix has cosine terms on the diagonal and sine terms on the off-diagonal, (2) each individual rotation matrix, Rqr, is found by d d ij--------Rqr rq 1 + = Mq 1 = M 1 – mk,2 Eskn () sknp – () []k 1 = M– 1 2Rmk,2 Eskn () sknp – () []k 1 = M– + -------------------------------------------------------------------------------------------------------------------------m 1 = M(C-1 7)

PAGE 175

164modifying the four elements of an identity ma trix that correspond to the intersection of the qth and rth rows and the qth and rth columns, and (3) at the separating solution, either the cosine terms or the sine terms are zero. Moreover, if the cosine terms are zero, then the sine terms are -1 or 1. Likewise, if the sine te rms are zero, then the cosine terms are -1 or 1. Consequently, each individual rotation matr ix may be represented by the product of a permutation matrix and a dia gonal matrix, where the permutation matrix consists of only zeros and ones and has only one non-zero element in each row and in each column, while the diagonal matrix has diagonal elements that ar e either -1 or 1. This particular type of matrix will be referred to as a modified pe rmutation matrix. The important property of the modified permutation matrix, as defined, is that the product of multiple modified permutation matrices produces another m odified permutation matrix. This is a direct result of the properties of permutation matr ices, diagonal matrices, and the fact that the only non-zero value of R is the multiplicative identity. This is useful since the overall Given’s rotation matrix is found from the multiplication of mu ltiple individual rotati on matrices, each of which (at the separating solution) is a modified permutation matrix. Equation (C-17) may be written in the following form which may be further simplified to2 Rmk,d d ij--------R12R13ƒ Rijƒ RM 1 M –()mk, Eskn () sknp – () []k 1 = M– 1 2Rmk,2 Eskn () sknp – () []k 1 = M– + ---------------------------------------------------------------------------------------------------------------------------------------------------------m 1 = M(C-1 8) 2 R12R13ƒ Rijƒ RM 1 M –()mk,R12R13ƒ d d ij--------Rijƒ RM 1 M – mk, Eskn () sknp – () []k 1 = M– 1 2Rmk,2 Eskn () sknp – () []k 1 = M– + --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------m 1 = M(C-1 9)

PAGE 176

165The derivative of Rij with respect to ij is a zero matrix with the exception that elements at (i,i), (i,j), (j,i), and (j,j) are once again sines and cosines. This time, however, the sine terms are on the diagonal and the cosine term s are on the off-diagonal. At the separating solution, the sines and cosines once agai n become -1, 0, or 1. Consequently, Rij and d Rij/ dij differ in that only of them has non-zero elements on the diagonal at (i,i) and (j,j), which produces a permutation difference between the two, and the latter matrix, unlike the former, has zeros in all rows and columns other than the ith and jth. The only difference in the two parenthetical entries in the numerator of Equation (C-19) is the term corresponding to the Rij matrix. Since the first parenthetical entry in the numerator is a modified permutation matrix, only one valu e of k produces a non-zero valu e. The second parenthetical term in the numerator has at most one non-zero value, and due to the permutation caused by taking the derivative of Rij, it is guaranteed that whenever the first parenthetical entry is non-zero that the second parenthetical entry is precisely zero. This is true for every value of m. In a similar fashion, the summation in the denominator has exactly one k that produces a non-zero value. Also, since (1) the term of the R matrix that appears in the summation is squared and (2) the elements of Rm,k are -1, 0, or 1, the only non-zero value in each row of R2 m,k is always 1 and the denominator simplifies to 1 + 2 E[so(n)so(np)], where the value of o defines the permutation. Since the outputs are constrained to have unit-variance and since the magnitude of the auto-correlation is always less than or equal to the covariance, the denominator is guaranteed to be positive when > 0. This ensures that the indeterminant value of 0/ 0 is avoided. Summarizing the results to the

PAGE 177

166present, the derivative of Equation (C-16) with respect to each parameter of the combined mixing and demixing matrix is 0 when evaluated at any separating solution. Taking the second derivative of E quation (C-16) with respect to ij, which is also the first derivative of Equation (C17) written as a function of Rm,k, produces It is well-known that the derivative of a ratio is the denominator multiplied by the derivative of the numerator minus the numerator mult iplied by the derivative of the denominator all over the denominator square d. Equation (C-20) takes advantage of the fact previously determined in Equation (C-19) that the derivative of the denominator, when evaluated at any separating solution, is zero. The first parent hetical entry of the numerator is cancelled by one of the squared terms of the denominator . Using this result and the chain rule for differentiation, Equation (C-20) may be written as The second derivative of an indi vidual rotation matrix, as can easily be verified, is simply the negative of the original matrix. Hence, the second derivative of Rm,k is simply -Rm,k. This allows the previous equation to be written as21 2Rmk,2 Eskn () sknp – () []k 1 = M– + d d ij--------Rmk,d d ij--------Rmk, Eskn () sknp – () []k 1 = M– 1 2Rmk,2 Eskn () sknp – () []k 1 = M– + 2---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------m 1 = M(C-2 0) 2 Rmk,d2d ij 2--------Rmk,d d ij--------Rmk, 2+ Eskn () sknp – () []k 1 = M– 1 2Rmk,2 Eskn () sknp – () []k 1 = M– + ---------------------------------------------------------------------------------------------------------------------------------------m 1 = M(C-2 1) 2 R –mk,2d d ij--------Rmk, 2+ Eskn () sknp – () []k 1 = M– 1 2Rmk,2 Eskn () sknp – () []k 1 = M– + ---------------------------------------------------------------------------------------------------------------------m 1 = M(C-2 2)

PAGE 178

167As mentioned earlier, the d Rij/ dij term has zeros in all columns and all rows other than the ith and jth and it induces a different permutation than Rij when evaluated at a separating solution. Consequently, the Rm,k and d Rm,k/ dij terms have different permutations (at a separating solution). Since the two terms are summed in Equation (C-22), the numerator is non-zero when either is non-zero. Furthermore, for each value of m, only one value of k produces a non-zero contribution for Rm,k, for a total of M contributions. Likewise, the d Rm,k/ dij term has a total of 2 contributions. Re gardless of the permutations imposed by Rm,k and d Rm,k/ dij, Equation (C-22) therefore takes the following form where k = {1, 2, . . ., M}, n = {1, 2, . . ., M}, k does not equal n, i doe s not equal k, j does not equal n, i does not equal j, and the variables i and j in Equation (C-23) are not necessarily equal to the variables i and j in Equati on (C-22). The values of i, j, k, and n depend on the particular permutation of the separati ng solution. In order for Equation (C-16) to represent a minimum, Equation (C-23) must be greater than 0 for all M(M-1)/2 possible combinations of i and j. Since the equation for each combination of i and j is different, it is possible that the separating solution corresponds to a saddle point. It should also be mentioned that the requirements for each permutation are not identical. In particular, there are permutations where several of the M(M-1)/2 equations are identical so that the overall requirements for the minimum of Equation (C-16) to represent a separating solution are simplified. Also notice that, as 2 becomes large, the requirement simplifies to the 2 Esmn () smnp – () [] 1 2Esmn () smnp – () [] – + ------------------------------------------------------------------2 Esin () sinp – () [] Eskn () sknp – () [] – () 1 2Esin () sinp – () [] – + --------------------------------------------------------------------------------------------------++m 1 = mij, M2 Esjn () sjnp – () [] Esnn () snnp – () [] – () 1 2Esjn () sjnp – () [] – + ----------------------------------------------------------------------------------------------------(C-2 3)

PAGE 179

168following. If the summation of the M-2 smallest au to-correlations at lag p (that is to say, the auto-correlations of the M-2 sources that have the smallest auto-correlations) is positive, the set of rotation angles that minimi ze Equation (C-16) always corresponds to a separating solution. This is the case when all the auto-correlations are positive, which commonly occurs with natural signals for small lags. For the case that there are only two sources , on the other hand, Equation (C-23) for the single combination of i and j yields the following which may be written as Equation (C-25) is positive as long as E[s1(n)s1(n-p)] does not equal to E[s2(n)s2(n-p)]. This is precisely the requirement needed for many methods based on second-order statistics. It is also simple to show that diversity in the autocorrelation at lag p of the sources must be present when the number of sources is 3 or larger. To summarize, it is possible to separate Gaussian sources by minimizing the MRMISIG criterion if the sources have diversity in their autocorrelation functions at lag p and if the auto-correlations are such that the second derivative of Equation (C-16) with respect to each of the M(M-1)/2 rotation angles, the form of which is given by Equation (C-23), is positive. It should also be mentioned that there may be other local minima of the cost function that do not represent a separating solu tion. No attempt was made to prove that the separating solution is the only minimum or even the global minimum, but several years of 2 Es1n () s1np – () [] Es2n () s2np – () [] – () 1 2Es1n () s1np – () [] – + -----------------------------------------------------------------------------------------------------2 Es2n () s2np – () [] Es1n () s1np – () [] – () 1 2Es2n () s2np – () [] – + -----------------------------------------------------------------------------------------------------+(C-2 4) 2 Es1n () s1np – () [] Es2n () s2np – () [] – ()2(C-2 5)

PAGE 180

169experience with the criterion suggests that th ere are no other minima, at least for symmetric unimodal source distributions. If the sour ces are colored but non-Gaussian, the results of this appendix suggest that MRMI-SIG is ab le to take into account both types of information, i.e., spectral and spatial diversity. The fact that MRMI-SIG can extract spectral diversity also suggests that it can separate signals that have time-varying statistics (at lag p).

PAGE 181

170LIST OF REFERENCES [1] N. Wiener, Extrapolation, Interpolation and Smoothing of Stationary Time Series, With Engineering Applications , Wiley, NY, New York, 1949. [2] H. Bode and C. Shannon, “A simplified de rivation of linear least squares smoothing and prediction theory,” Proc . IRE , Vol. 38, pp. 417-425, Apr. 1950. [3] Bernard Widrow, Samuel D. Stearns, Adaptive Signal Processing , Pearson, Harlow, UK, 1985. [4] P.W. Howells, “Intermediate frequenc y side-lobe canceler,” U.S. Patent 3,202,990, Aug. 24, 1965, (filed May 4, 1959). [5] S.P. Applebaum, “Adaptive arrays,” Syracuse University Research Corp ., Report SPL TR 66-1. [6] Simon Haykin, Adaptive Filter Theory , 4th ed., Prentice-Hall, Englewood Cliffs, NJ, 2001. [7] B. Widrow, J. Glover, Jr., J.M. McCool, J. Kaunitz, C.S. Williams, R.H. Hearn, J.R. Zeidler, E. Dong, M.D., and R.C. Goodlin, M.D., “Adaptive noise cancelling: Principles and applications,” Proc . of IEEE , Vol. 63, pp. 1692-1716, Dec. 1975. [8] Y. Bar-Ness, J.W. Carlin, M.L. Steinberg, “Bootstrapping adaptive cross pol cancelers for satellite communication,” The Intl . Conf . on Comm. , Paper No. 4F.5, Philadelphia, PA, June, 1982. [9] B. Ans, J. Herault, and C. Jutten, “Adap tive neural architectures: Detection of primitives,” Proc . of COGNITIVA ‘85 , pp. 593-597, Paris, France, June 1985. [10]J. Cardoso, “Blind signal separa tion: Statistical principles,” Proc . of the IEEE , Vol. 86, No. 10, pp. 2009-2025, Oct. 1998. [11]A. Hyvarinen, “Survey on inde pendent component analysis,” Neural Computing Surveys 2 , pp. 94-128, 1999. [12]H. Yang and S. Amari, “Adaptive online learning algorithms for blind separation: maximum entropy and minimum mutual information,” Neural Computation , Vol. 9, No.7, pp. 1457-1482, Oct. 1997.

PAGE 182

171[13]P. Comon, “Independent compone nt analysis, a new concept?” Signal Proc ., Vol. 36, No. 3, pp. 287-314, Apr. 1994. [14]D. Xu, J. Principe, J. Fisher, and H. Wu, “A novel measure for independent component analysis (ICA),” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '98 ), Seattle, WA, Vol. 2, pp. 1161-1164, May 1998. [15] Emanuel Parzen, “On estimation of a probability density function and mode,” Annals of Mathematical Statistics , Vol. 33, No. 3, pp. 1065-1076, Sept. 1962. [16]A. Rényi, Probability Theory , North-Holland Publishing Company, Amsterdam, Netherlands, 1970. [17]J. Principe, D. Xu, J. Fisher, “Information Theoretic Learning”, in Unsupervised Adaptive Filtering , S. Haykin Editor, pp. 265-319, John Wiley & Son, New York, NY, 2000. [18] Kenneth E. Hild II, Deniz Erdogmus a nd Jose C. Principe, “Blind source separation using Renyi's mutual information,” IEEE Signal Proc . Letters , Vol. 8, No. 6, pp. 174176, June 2001. [19] Simon Haykin, Neural Networks , 2nd ed., Prentice-Hall, In c., Upper Saddle River, NJ, 1999. [20] A. Hyvarinen, “Fast and robust fixedpoint algorithms for independent component analysis,” IEEE Transactions on Neural Networks , Vol. 10, No. 3, pp. 626-634, 1999. [21] A. Bell and T. Sejnowski, “An informa tion-maximization approach to blind separation and blind deconvolution,” Neural Computation , Vol. 7, No. 6, pp. 1129-1159, Nov. 1995. [22] S. Amari, “Neural learning in structured parameter spaces Natural Riemannian gradient,” Advances in Neural Information Proc . Systems ( NIPS '96 ), MIT Press, Cambridge, MA, pp. 127-133, Dec. 1996 [23] Michael S. Brandstein, “On the use of explicit speech modeling in microphone array applications,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '98 ), Seattle, WA, Vol. 6, pp. 3613-1616, May 1998. [24] J. Kapur, Maximum entropy models in science and engineering , Wiley Eastern, New Delhi, India, 1989. [25] Thomas M. Cover and Joy A. Thomas, Elements of Information Theory , John Wiley & Sons, Inc., New York, NY, 1991. [26] Paul Viola, Nicol N. Schraudolph, and Terrence J. Sejnowski, “Empirical entropy manipulation for real-world problems,” Advances in Neural Information Proc . Systems ( NIPS '95 ), MIT Press, Cambri dge, MA, pp. 851-857, Nov. 1995.

PAGE 183

172[27] A. Renyi, Probability Theory , North-Holland Publishing Company, Amsterdam, Netherlands, 1970. [28] D. Erdogmus, K. Hild II and J.C. Prin cipe, “Blind source separation using Renyi's alpha-marginal entropies,” Neurocomputing , Special Issue on Blind Source Separation and Independent Component Analysis, Vol. 49, No. 1, pp. 25-38, Dec. 2002. [29] Deniz Erdogmus, Kenneth E. Hild II, a nd Jose C. Principe, “On-line entropy manipulation: Stochastic Information Gradient,” IEEE Signal Processing Letters , Vol. 10, No. 8, pp. 242-245, Aug. 2003. [30] Dinh Tuan Pham, “Contrast functions for blind separation and deconvolution of sources,” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '01 ), San Diego, CA, pp. 37-42, Dec. 2001. [31] J.F. Cardoso and A. Souloumiac, “B lind beamforming for non-Gaussian signals,” Radar and Signal Processing, IEE Proceedings F , Vol. 140, No. 6, pp. 362 -370, Dec. 1993. [32] J.F. Cardoso. (1997). In fomax and maximum likelihood fo r blind source separation. IEEE Signal Proc . Letters , 4(4):112-114. [33] J.F. Cardoso. and B. Laheld, “Equivariant adaptive source separation,” IEEE Trans. on Signal Proc ., Vol. 44, No. 12, pp. 3017-3030, Dec. 1996. [34] Deniz Erdogmus and Jose C. Principe, “Generalized information potential criterion for adaptive system training,” IEEE Trans . on Neural Networks , Vol. 13, No.5, pp. 1035-1044, Sept. 2002. [35] Kenneth E. Hild II, Deniz Erdogmus, and Jose C. Principe, “On-line minimum mutual information method for time -varying blind s ource separation,” Intl . Workshop on Independent Component Analysis and Signal Separation, ( ICA ‘01 ), San Diego, CA, pp. 126-131, Dec. 2001. [36] Kenneth E. Hild II, Deniz Erdogmus, a nd Jose C. Principe, “Blind source separation of time-varying, instantaneous mixtu res using an on-line algorithm,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '02 ), Orlando, FL, Vol. 1, pp. 993-996, May 2002. [37] Jose C. Principe, Neil R. Euliano, and W. Curt Lefabvre, Neural and Adaptive Systems , John Wiley & Sons, Inc., New York, NY, 1999. [38] K. Torkkola, “Blind separation fo r audio signals Are we there yet?” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '99 ), Aussois, France, pp. 239-244, Jan. 1999.

PAGE 184

173[39] Paris Smaragdis, “Blind separation of convolved mixtures in the frequency domain,” Neurocomputing , Vol. 22, No. 1-3, pp. 21-34, Nov. 1998. [40] N. Mitianoudis and M. Davies, “New fixed-point algorithms for convolved mixtures,” Intl. Workshop on Independent Compone nt Analysis and Signal Separation ( ICA '01 ), San Diego, CA, pp. 633-638, Dec. 2001. [41] Lucas C. Parra and Christopher V. Alvi no, “Geometric source separation: Merging convolutive source separation with geometric beamforming,” IEEE Trans. on Speech and Audio Proc ., Vol. 10, No. 6, pp. 352-362, Sept. 2002. [42] Cristina Mejuto, Adriana Dapena, and Luis Castedo, “Frequency-domain Infomax for blind separation of convolutive mixtures,” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '00 ), Helsinki, Finland, pp. 315-320, June 2000. [43] Adriana Dapena, Monica F. Bugallo, a nd Luis Castedo, “Separation of convolutive mixtures of temporally-white signa ls: A novel frequency-domain approach,” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '01 ), San Diego, CA, pp. 179-184, Dec. 2001. [44] Wolf Baumann, Bert-Uwe Kohler, Dorothea Kolossa, and Reinhold Orglmeister, “Real time separation of convolutive mixtures,” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '01 ), San Diego, CA, pp. 65-69, Dec. 2001. [45] Jorn Anamuler and Birger Kollmeier, “Amplitude modulation decorrelation for convolutive blind source separation,” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '00 ), Helsinki, Finland, pp. 215-220, June 2000. [46] Adriana Dapena and Christine Serviere , “A simplified freque ncy-domain approach for blind separation of convolutive mixtures,” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '01 ), San Diego, CA, pp. 569-574, Dec. 2001. [47] Futoshi Asano, Shiro Ikeda, Michiaki Ogawa, Hideki Asoh, and Nobuhiko Kitawaki, “A combined approach of array processi ng and independent component analysis for blind separaion of acoustic signals,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '01 ), Salt Lake City, UT, Vol. 5, pp. 2729-2732, May 2001. [48] V. Capdevielle, Ch. Serv iere, and J. L. Lacoume, “Blind separation of wide-band sources in the frequency domain,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '95 ), Detroit, MI, Vol. 3, pp. 2080-2083, May 1995.

PAGE 185

174[49] Kamran Rahbar and Jame s P. Reilly, “Blind source separation algorithm for MIMO convolutive mixtures,” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '01 ), San Diego, CA, pp. 242-247, Dec. 2001. [50] Leland B. Jackson, Signals and Systems , Addison-Wesley, Boston, MA, 1991. [51] Daniel W.E. Schobben and Piet C.W. Sommen, “On the indeterminacies of convolutive blind signal separation based on second order statistics,” Intl . Symposium on Signal Proc. and its Applications ( ISSPA '99 ), Brisbane, Australia, pp. 215-218, Aug. 1999. [52] Ali Mansour, Christian Ju tten, and Philippe Loubaton, “Adaptive subspace algorithm for blind separation of independent sources in convolutive mixture,” IEEE Trans . on Signal Proc ., Vol. 48, No. 2, pp. 583-586, Feb. 2000. [53] W. Hachem, F. Desbouvries, and Ph. Loubaton, “On the identification of certain noisy FIR convolutive mixtures,” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '99 ), Aussois, France, pp. 401-405, Jan. 1999. [54] H. Bousbia-Salah, A. Belouchrani, a nd K. Abed-Meriam, “Jacobi-like algorithm for blind signal separation of convolutive mixtures,” Electronic Letters , Vol. 37, No. 16, pp. 1049-1050, Aug. 2001. [55] Hicham Bousbia-Salah, Adel Belouchran i, and Karim Abed-Meraim, “Blind separation of convolutive mixtures usi ng joint block diagonalization,” Intl . Symposium on Signal Proc . and its Applications ( ISSPA '01 ), Kuala Lampur, Malaysia, pp. 13-16, Aug. 2001. [56] C. Simon, G. d'Urso, C. Vignat, Philippe Loubaton, and Christian Jutten, “On the convolutive mixture source separati on by the decorrelation approach,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '98 ), Seattle, WA, Vol. 4, pp. 2109-2112, May 1998. [57] Nathalie Delfosse and Philippe Loubaton, “Adaptive blind separation of convolutive mixtures,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '96 ), Atlanta, GA, Vol. 5, pp. 2940-2943, May 1996. [58] Nathalie Delfosse and Philippe Loubaton, “Adaptive blind separation of convolutive mixtures,” Asilomar Conf . on Signals, Systems, and Computers ( Asilomar '95 ), Pacific Grove, CA, Vol. 1, pp. 341-345, Oct. 1995. [59] S. Icart and R. Gautie r, “Blind separation of convolutive mixtures using second and fourth order moments,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '96 ), Atlanta, GA, Vol. 5, pp. 3018-3021, May 1996.

PAGE 186

175[60] X. Sun and S.C. Douglas, “Adaptive para unitary filter banks for contrast-based multichannel blind deconvolution,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '01 ), Salt Lake City, UT, Vol. 5, pp. 2753-2756, May 2001. [61] Ruey-wen Liu, Yujiro Inouye, and Hui Luo, “A system-theoretic foundation for blind signal separation of MIMO-FIR convolutive mixtures A review,” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '00 ), Helsinki, Finland, pp. 205-210, June 2000. [62] Michael G. Larimore, John R. Treichler, and C. Richard Johnson, Theory and Design of Adaptive Filters , Pearson Education, Harlow, UK, 2001. [63] Nabil Charkani and Ya nnick Deville, “Self-adaptive separation of convolutively mixed signals with a recursive structure. Part I: Stability analysis and optimization of asymptotic behaviour,” Signal Processing , Vol. 73, No. 3, pp. 225-254, Jan. 1999. [64] F. Ehlers, and H.G. Schuster, “Blind se paration of convolutive mixtures and an application in automatic speech recognition in a noisy environment,” IEEE Trans . on Signal Proc ., Vol. 45, No. 10, pp. 2608-2612, Oct. 1997. [65] Shoko Araki, Shoji Makino, Ryo Mukai, Tsuyoki Nishikawa, and Hiroshi Saruwatari, “Fundamental limitation of fr equency domain blind source separation for convolved mixture of speech,” Intl . Workshop on Independent Co mponent Analysis and Signal Separation ( ICA '01 ), San Diego, CA, pp. 132-137, Dec. 2001. [66] Shoko Araki, Shoji Makino, Tsuyoki Nishikawa, and Hiroshi Saruwatari, “Fundamental limitation of frequency domain bli nd source separation for convolutive mixture of speech,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '01 ), Salt Lake City, UT, Vol. 5, pp. 2737-2740, May 2001. [67] Jose C. Principe, Bert deVries, and Pedro Guedes de Oliveira, “The Gamma filter: a new class of adaptive IIR filter s with restricted feedback,” IEEE Trans. on Signal Proc ., Vol. 355, No. 9, pp. 161-163, Jan. 1992. [68] T. Oliveira e Silva, “On the equiva lence between Gamma and Laguerre filters,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '94 ), Adelaide, Australia, Vol. 4, pp. 385-388, Apr. 1994. [69] Milutin Stanacevic, Marc Cohen, and Ge rt Cauwenberghs, “Blind separation of linear convolutive mixtures usi ng orthogonal filter banks,” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '01 ), San Diego, CA, pp. 260-265, Dec. 2001. [70] Ulf A. Lindgren, Holger Broman, “Source separation using a criterion based on second-order statistics,” IEEE Trans. on Signal Proc ., Vol. 46, No. 7, pp. 1837-1850, July 1998.

PAGE 187

176[71] C. Simon, G. d'Urso, C. Vignat, Philippe Loubaton, and Christian Jutten, “On the convolutive mixture source separation by the decorrelation approach,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '98 ), Seattle, WA, Vol. 4, pp. 2109-2112, May 1998. [72] Fred J. Taylor and Jon Mellott, Signal Processing: A Hands-On Workshop , McGrawHill Companies, New York, NY, 1998. [73] H. Attias and C. E. Schreiner, “Bli nd source separation and deconvolution: The dynamic component analysis algorithm,” Neural Computation , Vol. 10, No. 6, pp. 1373-1424, Aug. 1998. [74] Lucas Parra and Clay Spence, “Convol utive blind separation of non-stationary sources,” IEEE Trans . on Speech and Audio Proc ., Vol. 8, No. 3, pp. 320-327, May 2000. [75] Lucas Parra, Clay Spence, and Bert De Vries, “Convolutive blind source separation based on multiple decorrelation,” Neural Networks for Signal Proc . ( NNSP '98 ), Cambridge, England, pp. 23-32, Aug. 1998. [76] Hsiao-Chun Wu and Jose C. Principe, “Simultaneous diagonalization in the frequency domain (SDIF) for source separation,” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '99 ), Aussois, France, pp. 245-250, Jan. 1999. [77] Jose C. Principe and Hsiao-Chun W u, “Blind separation of convolutive mixtures,” Intl . Joint Conf . on Neural Networks ( IJCNN '99 ), Washington, DC, Vol. 2, pp. 10541058, July 1999. [78] Scott C. Douglas and Xiaoan Sun, “Bli nd separation of acoustical mixtures without time-domain deconvolution or decorrelation,” Neural Networks for Signal Proc . ( NNSP '01 ), Falmouth, MA, pp. 323-332, Sept. 2001. [79] Stefaan Van Gerven and Dirk Van Co mpernolle, “Signal separation by symmetric adaptive decorrelation: stability, convergence, and uniqueness,” IEEE Trans . on Signal Proc ., Vol. 43, No. 7, pp. 1602-1612, July 1995. [80] Kari Torkkola, “Blind separation of convolved source s based on information maximization,” Neural Networks for Signal Proc . ( NNSP '96 ), Kyoto, Japan, pp. 1-10, Sept. 1996. [81] Ella Bingham and Aapo H yvarinen, “A fast fixed-point algorithm for independent component analysis of complex valued signals,” Intl . Journal of Neural Systems , Vol. 10, No. 1, pp. 1-8, Feb. 2000.

PAGE 188

177[82] Dana H. Brooks and Chrysostomos L. Nikias, “Multichannel adaptive blind deconvolution using the complex cepstrum of higher order cross-spectra,” IEEE Trans . on Signal Proc ., Vol. 41, No. 9, pp. 2928-2934, Sept. 1993. [83] Dinh Tuan Pham, “Mutual information a pproach to blind separation of stationary sources,” IEEE Trans . on Information Theory , Vol. 48, No. 7, pp. 1935-1946, July 2002. [84] Masashi Ohata and Kiyotoshi Matsuoka, “Stability analyses of information-theoretic blind separation algorithms in the case wh ere the sources are nonlinear processes,” IEEE Trans . on Signal Proc ., Vol. 50, No. 1, pp. 69-77, Jan. 2002. [85] Lawrence R. Rabiner and Ronald W. Schafer, Digital Processing of Speech Signals , Prentice-Hall, Inc., Upper Saddle River, NJ, 1978. [86] Rodney A. Morejon, “An information-theo retic approach to sonar automatic target recognition,” Ph.D. dissertation, University of Florida, 2003. [87] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle, “An algebraic approach to blind MIMO identification,” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '00 ), Helsinki, Finland, pp. 211-214, June 2000. [88] Pierre Comon, Eric Moreau, and Ludwig Rota, “Blind separation of convolutive mixtures, a contrast-based joint diagonlization approach,” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '01 ), San Diego, CA, pp. 686-691, Dec. 2001. [89] Russell H. Lambert, “Difficulty measures and figures of merit for source separation,” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '99 ), Aussois, France, pp. 133-138, Jan. 1999. [90] M. Girolami, “Symmetric adaptive ma ximum likelihood estimation for noise cancellation and signal separation,” Electronic Letters , Vol. 33, No. 17, pp. 1437-1438, Aug. 14, 1997. [91] Barak A. Pearlmutter and Lucas C. Parra, “Maximum likelihood blind source separation: A context-sensitive generalization of ICA,” Advances in Neural Information Proc . Systems ( NIPS '96 ), MIT Press, Cambridge, MA, pp. 613-619, Dec. 1996. [92] Henrik Sahlin and Holger Broma n, “Separation of real-world signals,” Signal Processing , Vol. 64, No. 1, pp. 103-113, Jan. 1998. [93] Xiaoan Sun and Scott C. Douglas, “A na tural gradient convolutive blind source separation algorithm for speech mixtures,” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '01 ), San Diego, CA, pp. 59-64, Dec. 2001.

PAGE 189

178[94] Daniel Schobben, Kari Torkkola and Pari s Smaragdis, “Evaluation of blind signal separation methods,” Intl . Workshop on Independent Co mponent Analysis and Signal Separation ( ICA '99 ), Aussois, France, pp. 261-266, Jan. 1999. [95] M. Kawamoto, Y. Inouye, A. Mansour , and R.-W. Liu, “Blind deconvolution algorithms for MIMO-FIR systems driven by fourth-order colored signals,” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '01 ), San Diego, CA, pp. 692-697, Dec. 2001. [96] Daniel Yellin and Ehud Weinstein, “C riteria for multichanne l signal separation,” IEEE Trans . on Signal Proc ., Vol. 42, No. 8, pp. 2158-2167, Aug. 1994. [97] Craig L. Fancourt and Lucas Parra, “The coherence function in vlind source separation of convolutive mixtures of non-stationary signals,” Neural Networks for Signal Proc . ( NNSP '01 ), Falmouth, MA, pp. 303-312, Sept. 2001. [98] Hoang-Lan Nguyen Thi and Christian Jutten, “Blind separation for convolutive mixtures,” Signal Processing , Vol. 45, No. 2, pp. 209-229, Aug. 1995. [99] Masaki Handa, Takayuki Nagaim and Akira Kurematsu, “Frequency domain multichannel speech separation and its applications,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '01 ), Salt Lake City, UT, Vol. 5, pp. 2761-2764, May 2001. [100] Te-Won Lee, Anthony J. Bell, and Russel l H. Lambert, “Blind separation of delayed and convolved sources,” Advances in Neural Information Proc . Systems ( NIPS '96 ), MIT Press, Cambridge , MA, pp. 758-764, Dec. 1996. [101] Te-Won Lee and Reinhold Orglmeister, “A contextual blind se paration of delayed and convolved sources,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '97 ), Munich, Germany, Vol. 2, pp. 1199-1202, Apr. 1997. [102] Nikos A. Kanlis, Jonathan Z. Simon, and Shihab A. Shamma, “Complete training analysis of feedback architecture networks that perform blind source separation and deconvolution,” Intl . Workshop on Independent Co mponent Analysis and Signal Separation ( ICA '00 ), Helsinki, Finland, pp. 139-144, June 2000. [103] Ehud Weinstein, Meir Feder, and Alan Oppenheim, “Multi-ch annel signal separation by decorrelation,” IEEE Trans . on Speech and Audio Proc ., Vol. 1, No. 4, pp. 405-413, Oct. 1993. [104] Robert Aichner, Shoko Araki, Shoji Makino, Tsuyoki Nishikawa, and Hiroshi Saruwatari, “Time domain blind source separa tion of non-stationary convolved signals by utilizing geometric beamforming,” Neural Networks for Signal Proc . ( NNSP '02 ), Martigny, Switzerland, pp. 445-454, Sept. 2002.

PAGE 190

179[105] Brian S. Krongold and Douglas L. Jones, “Blind source separation of nonstationary convolutively mixed signals,” IEEE Work . on Stat. Signal and Array Proc . ( SSAP '00 ), Pocono Manor, PA, pp. 53-57, Aug. 2000. [106] M. Kawamoto, A.K. Barros, A. Mans our, K. Matsuoka, and N. Ohnishi, “Real world blind separation of convol ved non-stationary signals,” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '99 ), Aussois, France, pp. 347-352, Jan. 1999. [107] M. Kawamoto, K. Matsuoka, and N. Ohni shi, “Real world blind separation of convolved speech signals,” Intl . Joint Conf. on Neural Networks ( IJCNN '99 ), Washington, DC, Vol. 2, pp. 993-997, July 1999. [108] Hsiao-Chun Wu and Jose C. Principe, “A unifying criterion for blind source separation and decorrelation: simultaneous di agonalization of correlation matrices,” Neural Networks for Signal Proc . ( NNSP '97 ), Amelia Island, FL, pp. 496-505, Sept. 1997. [109] Lucas Parra and Clay Spence, “On-line convolutive blind source separation of nonstationary signals,” Journal of VLSI Signal Proc . Systems for Signal Image and Video Technology , Vol. 26, No. 1/2, pp. 39-46, Aug. 2000. [110] Ruey-wen Liu and Yujiro Inouye, “B lind equalization of MIMO-FIR channels driven by white but higher order colored source signals,” IEEE Trans . on Information Theory , Vol. 48, No. 5, pp. 1206-1214, May 2002. [111] Inbar Fijalkow and Philippe Gaussier , “Self-organizing blind MIMO deconvolution using lateral-inhibition,” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '99 ), Aussois, France, pp. 221-226, Jan. 1999. [112] C. Simon, Ph. Loubaton, C. Vignat, C. Jutten, and G. d'Urso, “Blind source separation of convolutive mixtures by maximizati on of fourth-order cumulants: The nonIID case,” Asilomar Conf . on Signals, Systems, and Computers ( Asilomar '98 ), Pacific Grove, CA, Vol. 2, pp. 1584-1588, Nov. 1998. [113] Yujiro Inouye and Kazuaki Tanebe, “Super-exponential algorit hms for multichannel blind deconvolution,” IEEE Trans . on Signal Proc ., Vol. 48, No. 3, pp. 881-888, Mar. 2000. [114] Jitendra K. Tugnait, “On blind separati on of convolutive mixtures of independent linear signals in unknown additive noise,” IEEE Trans . on Signal Proc ., Vol. 46, No. 11, pp. 3117-3123, Nov. 1998. [115] Eric Moreau and Jean-Christophe Pes quet, “Generalized c ontrasts for multichannel blind deconvolution of linear systems,” IEEE Signal Proc . Letters , Vol. 4, No. 6, June 1997.

PAGE 191

180[116] Azzedine Touzni, Inbar Fijalkow, Michae l G. Larimore, John R. Treichler, “A globally convergent approach for blind MIMO adaptive deconvolution,” IEEE Trans . on Signal Proc ., Vol. 49, No. 6, pp. 1166-1178, June 2001. [117] Jitendra K. Tugnait, “Adaptive blind se paration of convolutive mixtures of independent linear signals,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '98 ), Seattle, WA, Vol. 4, pp. 2097-2100, May 1998. [118] Pierre Comon, “Contrasts for multichannel blind deconvolution,” IEEE Signal Proc . Letters , Vol. 3, No. 7, pp. 209-211, July 1996. [119] Dinh-Tuan Pham, “Mutual information a pproach to blind sepa ration of stationary sources,” Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '99 ), Aussois, France, pp. 215-220, Jan. 1999. [120] H. Attias and C.E. Schreiner, “B lind source separation and deconvolution by dynamic component analysis,” Neural Networks for Signal Proc . ( NNSP '97 ), Amelia Island, FL, pp. 456-465, Sept. 1997. [121] James P. Reilly and Lino Coria Mendo za, “Blind signal separation for convolutive mixing environments using sp atial-temporal processing,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '99 ), Phoenix, AZ, Vol. 3, pp. 1437-1440, Mar. 1999. [122] Ireneusz Sabala, Andrzej Cichoki, and Shun-Ichi Amari, “Relationships between instantaneous blind source separation and multichannel blind deconvolution,” Intl . Joint Conf . on Neural Networks ( IJCNN '98 ), Anchorage, AK, Vol. 1, pp. 39-44, May 1998. [123] Te-Won Lee, Anthony J. Bell, and Reinhold Orglmeister, “Blind source separation of real world signals,” Intl . Conf . on Neural Networks ( ICNN '97 ), Houston, TX, Vol. 4, pp. 2129-2134, June 1997. [124] Shun-ichi Amari, Scott C. Douglas, A ndrzej Cichocki, and Howard H. Yang, “Multichannel blind deconvolution and equali zation using the na tural gradient,” Workshop on Signal Proc . Advances in Wireless Commun . ( SPAWC '97 ), La Villette, France, pp. 101-104, April 1997. [125] L.Q. Zhang, Andrzej Cichocki, and S hun-ichi Amari, “Multichannel blind deconvolution of non-minimum phase systems using information backpropagation,” Intl . Conf . on Neural Information Proc . ( ICONIP '99 ), Dunedon/Queenstown, New Zealand, Vol. 1, pp. 210-216, Nov. 1999.

PAGE 192

181[126] Muhammad Z. Ikram and Dennis R. Morg an, “A multiresolution approach to blind separation of speech signals in a reverberant environment,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '01 ), Salt Lake City, UT, Vol. 5, pp. 27572760, May 2001. [127] Kamran Rahbar and James P. Reilly, “Blind source separation of convolved sources by joint approximate diagonalization of cross-spectral density matrices,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '01 ), Salt Lake City, UT, Vol. 5, pp. 2745-2748, May 2001. [128] John C. Platt and Federi co Faggin, “Networks for the separation of sources that are superimposed and delayed,” Advances in Neural Information Proc . Systems ( NIPS '91 ), MIT Press, Cambridge, MA, pp. 730-737, Dec. 1991. [129] Nabil Charkani, Yannick Deville, and Jeanny Herault, “Stability analysis and optimization of time-domain convolutive source separation algorithms,” Workshop on Signal Proc . Advances in Wireless Commun . ( SPAWC '97 ), La Villette, France, pp. 73-76, April 1997. [130] Nabil Charkani and Yannick Deville, “A convolutive source separation method with self-optimizing non-linearities,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '99 ), Phoenix, AZ, Vol. 5, pp. 2909-2912, Mar. 1999. [131] Nabil Charkani and Ya nnick Deville, “Self-adaptive separation of convolutively mixed signals with a recursive structure. Pa rt II: Theoretical extensions and application to synthetic and real signals,” Signal Processing , Vol. 75, No. 2, pp. 117-140, June 1999. [132] Kari Torkkola, “Blind separation of de layed sources based on information maximization,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '96 ), Atlanta, GA, Vol. 6, pp. 3509-3512, May 1996. [133] A. Koutras, E. Dermatas, and G. K okkinakis, “Continuous speech recognition in a multi-simultaneous-speaker environment using decorrelation filtering in the frequency domain,” Intl . Work . Speech and Computer ( SPECOM '98 ), St. Petersburg, Russia, pp. 253-256, Oct. 1998. [134] Athanasios Koutras, Evangelos Derm atas, and George Kokkinakis, “Blind signal separation and speech recognition in the frequency domain,” IEEE Intl . Conf . on Electronics, Circuits and Systems ( ICECS '99 ), Pafos, Cyprus, Vol. 1, pp. 427-430, Sept. 1999. [135] M. Kawamoto, A.K. Barros, K. Matsuoka, and N. Ohnishi, "A method of real-world separation implemented in frequency domain," Intl . Workshop on Independent Component Analysis and Signal Separation ( ICA '00 ), Helsinki, Finland, pp. 267272, June 2000.

PAGE 193

182[136] Cristina Mejuto, “A seco nd-order method for blind source separation of convolutive mixtures,” Intl . Workshop on Independent Compone nt Analysis and Signal Separation ( ICA '99 ), Aussois, France, pp. 395-400, Jan. 1999. [137] Sergio A. Cruces-Alvarez, Andrzej Cichoki, and Shun-Ichi Amari, “On a new blind signal extraction algorithm: Different criteria and stability analysis,” IEEE Signal Proc . Letters , Vol. 9, No. 8, pp. 233-236, Aug. 2002. [138] Daniel Yellin and Ehud Weinstein, “M ultichannel signal sepa ration: Methods and analysis,” IEEE Trans . on Signal Proc ., Vol. 44, No. 1, pp. 106-118, Jan. 1996. [139] Jean-Christophe Pesquet, Binning Ch en, and Athina P. Petropulu, “Frequencydomain contrast functions for sepa ration of convolutive mixtures,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '01 ), Salt Lake City, UT, Vol. 5, pp. 2765-2768, May 2001. [140] C. Serviere, “Blind source se paration of convolutive mixtures,” IEEE Work . on Stat . Signal and Array Proc . ( SSAP '96 ), Corfu, Greece, pp. 316-319, June 1996. [141] Paris Smaragdis, “Efficient bli nd separation of convolved sound mixtures,” IEEE ASSP Work . on Applications of Signal Proc . to Audio and Acoustics ( ASSP '97 ), New Paltz, NY, pp. 19-22, Oct. 1997. [142] Marcel Joho, Heinz Mathis, and George S. Moschytz, “An FFT-based algorithm for multichannel blind deconvolution,” IEEE Intl . Symp . on Circuits and Systems ( ISCAS '99 ), Orlando, FL, Vol. 3, pp. 203-206, July 1999. [143] Jose C. Principe, Dongxin Xu, Qun Zhao, and John W. Fisher III, “Learning from examples with information theoretic criteria,” Journal of VLSI Signal Proc . Systems , Vol. 26, No. 1/2, pp. 61-77, Aug. 2000. [144] Zdenek Fabian, “Inf ormation and entropy of continuous random variables,” IEEE Trans . on Information Theory , Vol. 43, No. 3, pp. 1080-1084, May 1997. [145] L. Lorne Campbell, “Minimum cross-en tropy estimation with inaccurate side information,” IEEE Trans . on Information Theory , Vol. 45, No. 7, pp. 2650-2652, Nov. 1999. [146] Russell H. Lambert and An thony J. Bell, “Blind separation of multiple speakers in a multipath environment,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '97 ), Munich, Germany, Vol. 1, pp. 423-426, Apr. 1997. [147] Keinosuke Fukunaga and Larry D. Hostetler, “Optimi zation of k-ne arest neighbor density estimates,” IEEE Trans . on Info . Theory , Vol. 19, No. 3, pp. 320-326, May 1973.

PAGE 194

183[148] Deniz Erdogmus, Kenneth E. Hild II, a nd Jose C. Principe, “Independent Components Analysis using Renyi’s mutual info rmation and Legendre density estimation,” Intl . Joint Conf. on Neural Networks ( IJCNN '01), Washington, DC, Vol. 4, pp. 2762-2767, July 2001. [149] Murray Rosenblatt, “Remarks on some nonparametric estimates of a density function,” Annals of Mathematical Statistics , Vol. 27, No. 3, pp. 832-837, Sept. 1956. [150] Luc Devroye and Adam Krzyzak, “New multivariate product density estimators,” Journal of Multivariate Analysis , Vol. 82, pp. 88-110, 2002. [151] J.N. Kapur and H.K. Kesavan, Entropy Optimization Principles with Applications , Academic Press, Inc., Boston, MA, 1992. [152] J.N. Kapur, Measures of Information and Their Applications , John Wiley & Sons, New York, NY, 1994. [153] A. Hyvarinen, “A family of fixed point algorithms for Independent Component Analysis and projection pursuit,” Technical report A47, Helsinki University of Technology, Laboratory of Computer and Information Science, 1997. [154] Dorothee E. Marossero, Deniz Erdogmus , Neil Euliano, Jose C. Principe, and Kenneth E. Hild II, “Independent Component s Analysis for fetal electrocardiogram extraction: A case for the data efficient Mermaid algorithm,” accepted to Neural Networks for Signal Proc . ( NNSP '03 ), Toulouse, France, Sept. 2003. [155] Brian D. Ripley, Pattern Recognition and Neural Networks , Cambridge University Press, Cambridge, UK, 1995. [156] Vladimir Naumovich Vapnik, The Nature of Statistical Learning Theory , 2nd ed., Springer-Verlag, New York, NY, 2000. [157] Thilo-Thomas Frieb, Nello Cristianini, and Colin Campbell, “The Kernel-Adatron algorithm: A fast and simple learning pr ocedure for support vector machines,” Intl . Conf . on Machine Learning ( ICML '98 ), Madison, WI, pp. 188-196, July 1998. [158] Davide Mattera, France sco Palmieri, and Simon Haykin, “An explicit algorithm for training support vector machines,” IEEE Signal Proc . Letters , Vol. 6, No. 9, pp. 243-245, Sept. 1999. [159] O.L. Mangasarian and David R. Musica nt, “Lagrangian support vector machines,” Journal of Machine Learning Research , Vol. 1, pp. 161-177, Mar. 2001. [160] Thorsten Joachims, “Making large-scale SVM learning practical,” Advances in Kernel Methods Support Vector Learning , B. Scholkopf, C. Burges, and A. Smola ed., MIT Press, 1999.

PAGE 195

184[161] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik, “Feature selection for SVMs,” Advances in Neural Information Proc . Systems ( NIPS '00 ), MIT Press, Cambridge , MA, pp. 668-674, Nov. 2000. [162] Erin L. Allwein, Robert E. Schapire , and Yoram Singer, “R educing multiclass to binary: A unifying approach for margin classifiers,” Journal of Machine Learning Research , Vol. 1, pp. 113-141, Dec. 2000. [163] Colin Campbell, Thilo-Thomas Frieb, and Nello Cristianini, “Maximum margin classification using the KA algorithm,” Intelligent Data Engineering and Learning ( IDEAL '98 ), Hong Kong, pp. 355-362, Oct. 1998. [164] Corinna Cortes and Vladimir Vapnik, “Support-vector networks,” Machine Learning , Vol. 20, No. 3, pp. 273-297, Sept. 1995. [165] Biing-Hwang Juang and Shigeru Katagi ri, “Discriminative learning for minimum error classification,” IEEE Trans . on Signal Proc ., Vol. 40, No. 12, pp. 3043-3054, Dec. 1992. [166] Hideyuki Watanabe, Tsuyoshi Yamaguchi , and Shigeru Katagiri, “Discriminative metric design for robus t pattern recognition,” IEEE Trans . on Signal Proc ., Vol. 45, No. 11, pp. 2655-2662, Nov. 1997. [167] Shigeru Katagiri, Biing-Hwang Juang, and Chin-Hui Lee, “Pattern recognition using a family of design algorithms ba sed upon the generaliz ed probabilistic descent method,” Proc . IEEE , Vol. 86, No. 11, pp. 2345-2373, Nov. 1998. [168] Chris Bishop, Neural Networks for Pattern Recognition , Oxford University Press, Oxford, UK, 1995. [169] Keinosuke Fukunaga, Introduction to Statistical Pattern Recognition , 2nd ed., Academic Press, Inc., Boston, MA, 1990. [170] William M. Campbell, Kari Torkkola, and Sreeram V. Balakrishnan, “Dimension reduction techniques for trai ning polynomial networks,” Intl . Conf . on Machine Learning ( ICML '00 ), Stanford, CA, pp. 119-126, June 2000. [171] Shailesh Kumar, Joydeep Ghosh, and Melb a Crawford, “A Bayesian pairwise classifier for character recognition,” Cognitive and Neural Models for Word Recognition and Document Processing , Nabeel Mursheed (ed.), Worl d Scientific Press, River Edge, NJ, 2000. [172] Alain Biem, Shigeru Katagiri, and Biing-Hwang Juang, “Discriminative feature extraction for speech recognition,” Neural Networks for Signal Proc . ( NNSP '93 ), Linthicum, MD, pp. 392-401, Sept. 1993.

PAGE 196

185[173] Gene H. Golub and Charles F. Van Loan, Matrix Computations , 3rd ed., John Hopkins University Press, Baltimore, MD, 1996. [174] Martin E. Hellman and Josef Raviv, “P robability of error, equivocation, and the Chernoff Bound,” IEEE Trans . on Information Theory , Vol. IT-16, No. 4, pp. 368372, July 1970. [175] Deniz Erdogmus and Jose C. Principe , “Lower and upper bounds for misclassification probability based on Renyi's information,” to appear in Journal of VLSI Signal Processing Systems Special Issue on Wireless Communications and Blind Signal Processing , 2003. [176] Howard Hua Yang and John Moody, “Feature selection based on joint mutual information,” Advances in Intelligent Data Analysis ( AIDA '99 ), Computational Intelligence Methods and Applications ( CIMA ‘99 ), International Computer Science Conventions, Rochester, NY, June 1999. [177] Roberto Battiti, “Using mutu al information for selecting features in supervised neural net learning,” IEEE Trans. on Neural Networks , Vol. 5, No. 4, pp. 537-550, July 1994. [178] Kurt D. Bollacker and Joydeep Ghosh, “Mutual information feature extractors for neural classifiers,” Intl . Conf . on Neural Networks ( ICNN '96 ), Washington DC, pp. 1528-1533, June 1996. [179] Nojun Kwak and Chong-Ho Choi, “Improve d mutual information feature selector for neural networks in supervised learning,” Intl . Joint Conf. on Neural Networks ( IJCNN '99 ), Washington, DC, Vol. 2, pp. 1313-1318, July 1999. [180] R. Rajagopal, K. Anoop Ku mar, and P. Ramakrishna Ra o, “An integrated approach to passive target classification,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '94 ), Adelaide, Australia, Vol. 2, pp. 313-316, Apr. 1994. [181] Sergios Theodoridis and Konstantinos Koutroumbas, Pattern Recognition , Academic Press, Sa n Diego, CA, 1999. [182] Dongxin Xu, Jose C. Principe, John Fi sher III, Hsiao-Chun Wu, “A novel measure for independent component analysis (ICA),” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '98 ), Seattle, WA, Vol. 2, pp. 1161-1164, May 1998. [183] Alfred O. Hero, Bing Ma, Olivier Michel and John Gorman, “Alpha-divergence for classification, indexing and retrieval (revi sion 2),” Communications and Signal Processing Laboratory, The University of Michigan, Technical Report CSPL-328, pp. 1-25, June 2002.

PAGE 197

186[184] J. Beirlant, E. J. Dudewica, L. Gyof i, and E. van der Meulen, “Non-parametric entropy estimation: An overview,” Intl . Journal of Math . Stat . Sci ., Vol. 6, No. 1, pp. 17-39, 1997. [185] Keinosuke Fukunaga and Raymond R. Haye s, “The reduced parzen classifier,” IEEE Trans . on Pattern Analysis and Machine Intelligence , Vol. 11, No. 4, pp. 423425, Apr. 1989. [186] Kari Torkkola, “Learning discriminative feature transforms to low dimensions in low dimensions,” Advances in Neural Information Proc . Systems ( NIPS '01 ), MIT Press, Cambridge, MA, Dec. 2001. [187] Deniz Erdogmus, Kenneth E. Hild II, a nd Jose C. Principe, “Blind deconvolution of linear channels by minimizing or maximizing Renyi's entropy,” submitted to IEEE Trans . on Signal Proc ., Jun 2002. [188] Jens Blauert, Spatial Hearing, The Psychophysics of Human Sound Localization , MIT Press, Cambridge, MA, 1983. [189] Kari Torkkola and William M. Campbell, “Mutual information in learning feature transformations,” Intl . Conf . on Machine Learning ( ICML '00 ), Stanford, CA, pp. 1015-1022, June 2000. [190] Kari Torkkola, “Visualizing class stru cture in data using mutual information,” Neural Networks for Signal Proc . ( NNSP '00 ), Sydney, Australia, pp. 376-385, Dec. 2000. [191] Kari Torkkola, “On feature extrac tion by mutual information maximization,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '02 ), Orlando, FL, pp. 821-825, May 2002. [192] Michael D. Richard and Richard P. Lippmann, “Neural network classifiers estimate Bayesian a posteriori probabilities,” Neural Computation , Vol. 3, No. 4, pp. 461483, Winter 1991. [193] Chanchal Chatterjee and Vwani Roychowdhur y, “Statistical risk analysis for classification and feature extraction by multilayer perceptrons,” Intl . Conf . on Neural Networks ( ICNN '96 ), Washington, DC, Vol. 3, pp. 1610-1615, June 1996. [194] Qi Li and Biing-Hwang Juang, “A new algorithm for fa st discriminative training,” Intl . Conf . on Acoustics, Speech, and Signal Processing ( ICASSP '02 ), Orlando, FL, Vol. 1, pp. 97-100, May 2002. [195] Alain Biem, Shigeru Katagiri, and B iing-Hwang Juang, “Pattern recognition using discriminative feature extraction,” IEEE Trans. on Signal Proc ., Vol. 45, No. 2, pp. 500-504, Feb. 1997.

PAGE 198

187[196] Vladimir Nedeljkovic, “A novel multilaye r neural networks training algorithm that minimizes the probability of classification error,” IEEE Trans. on Neural Networks , Vol. 4, No. 4, pp. 650-659, July 1993. [197] T. Okadada and S. Tomita, “An optimal orthonormal system for discriminant analysis,” Pattern Recognition , Vol. 18, No. 2, pp. 139-144, 1985. [198] Jerome H. Friedman, “E xploratory projection pursuit,” Journal of the American Statistical Association , Vol. 82, No. 397, pp. 249-266, March 1987. [199] Isabelle Guyon, Jason We ston, Stephen Barnhill, and Vl adimir Vapnik, “Gene selection for cancer classification us ing Support Vector Machines,” Machine Learning , Vol. 46, No. 1-3, pp. 389-422, Jan.-Mar. 2002. [200] Dongming Xu and Jose C. Principe, “F eature evaluation using quadratic mutual information,” Intl . Joint Conf . on Neural Networks ( IJCNN '01 ), Washington, DC, Vol. 1, pp. 459-463, July 2001. [201] John W. Fisher III and Jose C. Prin cipe, “A methodology for information theoretic feature extraction,” Intl . Joint Conf . on Neural Networks ( IJCNN '98 ), Anchorage, AK, Vol. 3, pp. 1712-1716, May 1998. [202] Qun Zhao and Jose C. Principe, “Support vector machines for SAR automatic target recognition,” IEEE Trans . on Aerospace and Electronic Systems , Vol. 37, No. 2, pp. 643-654, Apr. 2001. [203] Robert E. Schapire, Yoav Freund, Pete r Bartlett, and Wee Sun Lee, “Boosting the margin: A new explanation for the effectiveness of voting methods,” The Annals of Statistics , Vol. 26, No. 5, pp. 1651-1686, May 1998. [204] Kari Torkkola, “Nonlinear feature tran sforms using maximum mutual information,” Intl . Joint Conf . on Neural Networks ( IJCNN '01 ), Washington, DC, pp. 2756-2761, July 2001. [205] Ralf Herbrich and Jason Weston, “Ada ptive margin support vector machines for classification,” Intl . Conf . on Artificial Neural Networks , Edinburgh, UK, pp. 880885, Sept. 1999. [206] Nello Cristianini, John Shawe-Taylor, a nd Peter Sykacek, “Bayesian classifiers are large margin hyperplanes in a Hilbert space,” Intl . Conf . on Machine Learning ( ICML '98 ), Madison, WI, pp. 109-117, July 1998. [207] P. W. Frey, and D. J. Slate, “Letter recognition using Holland-style adaptive classifiers,” Machine Learning , Vol. 6, No. 2, 161–182, March 1991.

PAGE 199

188[208] Moninder Singh and Marco Valtorta, “Cons truction of Bayesian Belief Networks from data: A brief survey and an efficient algorithm”, International Journal of Approximate Reasoning , Vol. 12, No. 2, Feb. 1995, pp. 111-131. [209] Noam Slonim and Naftali Tishby, “Agglomerative Information Bottleneck,” Advances in Neural Information Proc . Systems ( NIPS '99 ), MIT Press, Cambridge, MA, pp. 617-623, Nov. 1999.

PAGE 200

189BIOGRAPHICAL SKETCH Kenneth E. Hild II received the B.S. degree in electrical engineering in 1992 and the M.S. degree in electrical engineer ing in 1996, both from the University of Oklahoma. From 1995 to 1999 he was employed full-time with Seagate Technologies, Inc., where he served as an advisory development engineer in the Advanced Concepts group. During his Ph.D. program at the University of Florida, he studied information theoretic learning, blind source separation, and feature extraction. Kenneth is a member of IEEE, Tau Beta Pi, and Eta Kappa Nu.