Citation |

- Permanent Link:
- https://ufdc.ufl.edu/UF00100781/00001
## Material Information- Title:
- Principal component analysis with multiresolution
- Creator:
- Brennan, Victor L., 1956- (
*Dissertant*) Principe, Jose (*Thesis advisor*) - Place of Publication:
- Gainesville, Fla.
- Publisher:
- University of Florida
- Publication Date:
- 2001
- Copyright Date:
- 2001
- Language:
- English
## Subjects- Subjects / Keywords:
- Databases ( jstor )
Eigenvalues ( jstor ) Eigenvectors ( jstor ) Error rates ( jstor ) Image classification ( jstor ) Mathematical vectors ( jstor ) Matrices ( jstor ) Pixels ( jstor ) Principal components analysis ( jstor ) Signals ( jstor ) Dissertations, Academic -- Electrical and Computer Engineering -- UF ( lcsh ) Electrical and Computer Engineering thesis, Ph. D ( lcsh ) Human face recognition (Computer science) ( lcsh ) Principal components analysis ( lcsh ) Signal processing -- Digital techniques ( lcsh ) Synthetic aperture radar ( lcsh ) - Genre:
- bibliography ( marcgt )
theses ( marcgt ) government publication (state, provincial, terriorial, dependent) ( marcgt ) non-fiction ( marcgt )
## Notes- Abstract:
- Eigenvalue decomposition and multiresolution are widely used techniques for signal representation. Both techniques divide a signal into an ordered set of components. The first component can be considered an approximation of the input signal; subsequent components improve the approximation. Principal component analysis selects components at the source resolution that are optimal for minimizing mean square error in reconstructing the original input. For classification, where discriminability among classes puts an added constraint on representations, PCA is no longer optimal. Features utilizing multiresolution have been demonstrated to preserve discriminability better than a single scale representation. Multiresolution chooses components to provide good representations of the input signal at several resolutions. The full set of components provides an exact reconstruction of the original signal. Principal component analysis with multiresolution combines the best properties of each technique: 1. PCA provides an adaptive basis for multiresolution. 2. Multiresolution provides localization to PCA. The first PCA-M component is a low-resolution approximation of the signal. Additional PCA-M components improve the signal approximation in a manner that optimizes the reconstruction of the original signal at full resolution. PCA-M can provide a complete or overcomplete basis to represent the original signal, and as such has advantages for classification because some of the multiresolution projections preserve discriminability better than full resolution representations. PCA-M can be conceptualized as PCA with localization, or as multiresolution with an adaptive basis. PCA-M retains many of the advantages, mathematical characteristics, algorithms and networks of PCA. PCA-M is tested using two approaches. The first approach is consistent with a widely-known eigenface decomposition. The second approach assumes ergodicity. PCA-M is applied to two image classification applications: face classification and synthetic aperture radar (SAR) detection. For face classification, PCA-M had an average error of under 2.5%, which compares favorably with other approaches. For synthetic aperture radar (SAR), direct comparisons were not available, but PCA-M performed better than the matched filter approach. ( , )
- Thesis:
- Thesis (Ph. D.)--University of Florida, 2001.
- Bibliography:
- Includes bibliographical references (p. 120-123).
- System Details:
- System requirements: World Wide Web browser and PDF reader.
- System Details:
- Mode of access: World Wide Web.
- General Note:
- Title from first page of PDF file.
- General Note:
- Document formatted into pages; contains xi, 124 p.; also contains graphics.
- General Note:
- Vita.
- Statement of Responsibility:
- by Victor L. Brennan.
## Record Information- Source Institution:
- University of Florida
- Holding Location:
- University of Florida
- Rights Management:
- All applicable rights reserved by the source institution and holding location.
- Resource Identifier:
- 50743663 ( OCLC )
002729315 ( AlephBibNum ) ANK7079 ( NOTIS )
## UFDC Membership |

Downloads |

## This item has the following downloads: |

Full Text |

PRINCIPAL COMPONENT ANALYSIS WITH MULTIRESOLUTION By VICTOR L. BRENNAN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2001 ACKNOWLEDGMENTS I am grateful for the support I have received from family, colleagues, and faculty at UF. It is difficult to single out a few people to thank when many have been supportive and encouraging. I wish to thank my advisor, Dr. Jos6 Principe, not only for sharing his technical expertise and insight, but especially for his patience and encouragement. I want to thank Leonard and Carolina Brennan, who have been loving parents and inspirational role models. I am most grateful to my wife, Karen, for her love and for her support in every endeavor in our lives. TABLE OF CONTENTS page ACKNOW LEDGMENTS ............................. ii LIST OF TABLES ...................... ......... vi LIST OF FIGURES ..................... ......... vii ABSTRACT ... .. .. .. .. .. .. ... .. .. .. .. .. .. .. .. .. x CHAPTERS 1 INTRODUCTION ................... ....... 1 1.1 Classification .................. ....... 2 1.2 Principal Component Analysis (PCA) ............. 4 1.3 Multiresolution .......... ....... ....... 5 1.4 PCA-M ............ .......... ..... .. 6 1.5 Image Classification Experiments ...... ...... ... 6 1.6 MSTAR Experiment ........... ........... 7 2 PCA ....... ............ ........... .... 9 2.1 The Eigenvalue Problem .............. ... . 9 2.2 An Example .................. ........ .. 11 2.3 PCA .................. ....... .... . 15 2.4 Deflation Techniques ................ . . 17 2.5 Generalized Hebbian Algorithm ............... .. 19 2.6 Eigenfilters .................. ......... .. 22 2.6.1 Low-Pass Test Signal ................ 23 2.6.2 High Pass Test Signal ................. .. 25 2.6.3 Mixed Mode Test Signal ................ .. 26 2.7 Properties .................. ......... .. 27 3 MULTIRESOLUTION .................. ..... .. 29 3.1 Two Notes Example ........ . . . . . .... 30 3.2 Quadrature Filter and Iterated Filter Bank . . . ... 32 3.3 Discrete Wavelets .................. ... .. 34 3.4 Haar Wavelet .................. ....... .. 35 3.5 A Multiresolution Application: Compression . . . ... 35 4 PCA-M 4.1 Definition of PCA-M .................. ..... 38 4.1.1 Localization of PCA .................. .. 38 4.1.2 A Structure of Localized Outputs . . . ... 41 4.2 The Classification Problem .................. .. 43 4.3 Complete Representations .................. .. 44 4.3.1 Eigenfaces .................. .. .. 45 4.3.2 Identity Map .................. .. 50 4.3.3 Iterated Filter Banks ................. .. 52 4.3.4 Dual Implementation of PCA . . . . ..... 56 4.4 Overcomplete Representations ................ .. 59 4.5 Local Feature Analysis ..... ........... .... 60 4.5.1 Output Vector .................. ..... 61 4.5.2 Residual Correlation .................. .. 62 4.5.3 K ernel . . . . . . . . . . . .... . 63 4.5.4 LFA on ORL Faces .................. .. 63 4.5.5 Localization for LFA and PCA-M . . . ... 66 4.5.6 Feature Space for LFA, PCA, and PCA-M ...... 67 4.6 Summary ....... ....... ......... .... 67 5 FACE RECOGNITION EXPERIMENT .............. .. 69 5.1 ORL face Database .................. .. 70 5.2 Eigenfaces .................. ......... .. 71 5.2.1 Description of Experiment .............. .. 72 5.2.2 Results ........... . . . . . 73 5.3 Face Recognition using HMM's ................ .. 73 5.3.1 Markov Models .................. ..... 74 5.3.2 Description of Experiment .............. .. 75 5.3.3 Results . . . . . . . . . . . .... . 76 5.4 Convolutional Neural Networks ................ .. 77 5.4.1 Self-Organizing Map ..... . . . 77 5.4.2 Convolutional Network ................ .. 78 5.4.3 Description of Experiment ..... . . . . 79 5.4.4 Results . . . . . . . . . . 80 5.5 Face Classification with PCA-M ... . . . 80 5.5.1 Classifier Architecture .... . . . . 81 5.5.2 Data Preparation ............. .. .. ..82 5.5.3 Fixed Resolution PCA Results . . . . 82 5.5.4 Haar Multiresolution. ................ . 84 5.5.5 PCA-M ... .. .. .. .. .. .. .. .. ... .. . 85 6 MSTAR EXPERIMENT ............. 6.1 SAR Image Database ......... .............. 89 6.2 Classification Experiment ......... ........... 90 6.3 Basis Arrays for PCA-M ............. ... .. .. 93 6.3.1 Level 3 Components .................. .. 94 6.3.2 Level 2 Components .................. .. 94 6.3.3 Level 1 Components .................. .. 94 6.3.4 Decorrelation between Levels . . . . ..... 95 6.4 A Component Classifier .................. .. 96 6.5 Classifications using Several Components . . . .... 98 6.6 A Simple Discriminator ............... . . 102 6.7 False-Positive and False-Negative Errors . . . . .... 104 6.8 Observations .................. ........ 106 7 CONCLUSIONS AND FURTHER WORK . . . . 108 7.1 Conclusions . . . . . . . . . . . ... .. 108 7.2 Future W ork ....... ......... ...... 111 7.2.1 Segmentation of the Input . . . . . 111 7.2.2 Component Selection . . . . . . 111 7.2.3 Conditioned Data and Non-Linear Classifier ...... 112 APPENDIX A ABBREVIATIONS .................. ........ .. 113 B OLIVETTI RESEARCH LABORATORY FACE DATABASE .. 115 C MSTAR IMAGES ....... .................... 116 REFERENCES ....................... . . . ....... 120 BIOGRAPHICAL SKETCH .................. ......... 124 LIST OF TABLES Table page 4.1 Normalized Eigenvalues .................. ....... .. 49 4.2 Energy Distribution of Exemplars ................ 49 5.1 Error Rates of Several Algorithms ................ . 71 5.2 Face Classification CN Architecture .................. 80 5.3 Fixed Resolution PCA Error Rates over 10 Runs . . . .... 82 5.4 Error Rates for PCA-M with Magnitude of FFT ........... ..86 5.5 Component Misclassifications (200 Test Images) ........... ..87 6.1 Input Data .................. .............. .. 90 6.2 Classification using First Component .................. 97 6.3 Misclassifications with Individual PCA-M Components . . . ... 99 6.4 Error Rate (5/68 7.,!'-) using 3 Components ............ .100 6.5 Error Rate (2/68 = 3.0'.,) using 10 Components . . . ..... 100 6.6 Overall Unconditional Pc with Template Matching . . . . ... 102 6.7 Overall Unconditional Pc with PCA-M ................ ..102 6.8 Determining an Threshold for Detection ................ ..103 6.9 Ten Components without Rejection ................. 104 6.10 Ten Components with Rejection .................. .. 104 6.11 Detector Threshold .................. ....... .. 105 6.12 Performance at 91' Pd ............... ..... . 106 LIST OF FIGURES Figure 1.1 1.2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 3.1 3.2 3.3 3.4 3.5 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 Conceptual Steps in a Classifier . . ....... PCA-M Classifier . . ............... Sam ple D ata . . . . . . . . . . . . Original (left) and Scaled (right) Data . . ... GHA Linear Network . . . ............ Low Pass Test Data . . . ............. Low Pass Data PCA . . ............. High Pass Data PCA . . . ............ Test High and Low Frequency Data . . . .... T wo N otes . . . . . . . . . . . . Quadrature Filter . . . .............. Discrete Wavelet Transform with 2 Levels . . . . Equivalent 2"' Filter Bank DWT Implementation . Three Levels of Decomposition on the Approximation PCA-M for Classification . . . .......... PCA and PCA-M in Feature Space . . ..... Raw Images from ORL Database . . ...... Residual Images for GHA Input . . ....... All-to-one and One-to-one Networks . . . .... Eigenfaces from GHA Weights . . . ....... Three Level Dyadic Banks . . .......... First Four Eigenimages . . . ........... page 3 7 12 15 20 23 25 26 27 31 33 34 35 36 40 44 46 46 47 48 52 54 4.9 Three Level Decomposition of a Face ................. .. 55 4.10 Output of Quadratic Filter Bank ............. .... .. .. 55 4.11 Localization of a Global Output .................. .. 58 4.12 Local Feature Analysis .................. ....... .. 60 4.13 PCA Reconstruction MSE .................. .... .. 64 4.14 PCA Reconstructions .................. ...... .. .. 64 4.15 LFA Outputs (Compare to PCA Reconstruction) . . . ... 65 4.16 LFA Kernel and Residual Correlation (Look for Localization) . . 66 5.1 Varying Conditions in ORL Pictures ................. .. 71 5.2 Parsing an Image into a Sequence of Observations . . . .... 74 5.3 Markov Model .................. ............ .. 75 5.4 Top-down Constrained State Transitions ............... .. 75 5.5 SOM-CN Face Classifier .................. ... .. .. 77 5.6 Initial Classifier Structure .................. ... .. 81 5.7 Training and Test Data at Different Scales .... . . . . 83 5.8 PCA-M Decomposition of One Picture ............ . .. 84 5.9 Selected Resolutions . ............... ...... .. 85 5.10 Final Classifier Structure ................ ... ... .. 86 6.1 Aspect and Depression Angles ................ ...... 89 6.2 Experiment Overview . ............... ...... .. 91 6.3 Three Levels of Decomposition on the Approximation . . . . 93 6.4 PCA-M Decomposition of a BMP2 Input ... . . . . 95 6.5 The Templates for Three Classes for PCA-M Component 1 . . 96 6.6 First Component of SAR Images Projected to 3-Space . . . ... 98 6.7 Class Templates for other PCA-M Components . . . . 99 6.8 Clustering in 3-Space using All PCA-M Components . . . ... 101 6.9 Probability of Detection versus False Alarm Rate . . . ... 106 B.1 Olivetti Research Laboratory Face Database .. . . . . 115 C.1 BMP2 Training and Test Data ............. .... . 116 C.2 T72 Training and Test Data ................ .... .. 117 C.3 BTR70 Training and Test Data ............... . .. 118 C.4 Confuser Data .................. ............ .. 119 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy PRINCIPAL COMPONENT ANALYSIS WITH MULTIRESOLUTION By Victor L. Brennan T; ,- 2001 Chair: Jos6 Principe Major Department: Electrical and Computer Engineering Eigenvalue decomposition and multiresolution are widely used techniques for signal representation. Both techniques divide a signal into an ordered set of components. The first component can be considered an approximation of the input signal; subsequent components improve the approximation. Principal component analysis selects components at the source resolution that are optimal for minimiz- ing mean square error in reconstructing the original input. For classification, where discriminability among classes puts an added constraint on representations, PCA is no longer optimal. Features utilizing multiresolution have been demonstrated to preserve discriminability better than a single scale representation. Multiresolution chooses components to provide good representations of the input signal at several resolutions. The full set of components provides an exact reconstruction of the original signal. Principal component analysis with multiresolution combines the best proper- ties of each technique: 1. PCA provides an adaptive basis for multiresolution. 2. Multiresolution provides localization to PCA. The first PCA-M component is a low-resolution approximation of the signal. Additional PCA-M components improve the signal approximation in a manner that optimizes the reconstruction of the original signal at full resolution. PCA-M can provide a complete or overcomplete basis to represent the original signal, and as such has advantages for classification because some of the multiresolution projections preserve discriminability better than full resolution representations. PCA-M can be conceptualized as PCA with localization, or as multiresolution with an adaptive basis. PCA-M retains many of the advantages, mathematical characteristics, algorithms and networks of PCA. PCA-M is tested using two approaches. The first approach is consistent with a widely-known eigenface decomposition. The second approach assumes ergodicity. PCA-M is applied to two image classification applications: face classification and synthetic aperture radar (SAR) detection. For face classification, PCA-M had an average error of under 2.5' which compares favorably with other approaches. For synthetic aperture radar (SAR), direct comparisons were not available, but PCA-M performed better than the matched filter approach. CHAPTER 1 INTRODUCTION Principal component analysis with multiresolution (PCA-M) combines and enhances two well-established signal processing techniques for signal representa- tion. This dissertation presents the motivation, the mathematical basis, and an efficient implementation for combining principal component analysis (PCA) with multiresolution. This dissertation also presents the results of using PCA-M as a front-end for two applications: face classification, and target discrimination of synthetic aperture radar (SAR) images. More detailed discussions of PCA, multiresolution (differential pyramids), and PCA-M are presented in subsequent chapters. This introduction is intended as an overview to the presentation of PCA-M. PCA-M was originally developed as an on-line signal representation technique. The intent was to perform real-time segmentation of (time) signals based on variations in local principal components. Tests with simple artificially generated signals were promising, and good results have been reported in applying PCA-M to biological signals (Alonso-Betzanos et al., 1999). The decision to concentrate on images was made when several researchers (Giles et al., 1997; Samaria, 1994; Turk and Pentland, 1991a) applied various techniques against a common database generated by the Olivetti Research Lab (ORL). Each of the researchers also cited an approach which decomposed a set of facial images into component eigenfaces. It became possible to compare the performance of PCA-M to the results of other researchers using fixed resolution PCA technique and more computationally intensive non-PCA image classification techniques. We will start by providing a brief overview of the fundamental concepts required to understand PCA-M. 1.1 Classification Classification is the assignment of an input signal x = [xI, x2, Xd]T to one of K classes (Bishop, 1995, pp. 1-10). x --> Ck, 1 Each input x is assigned a label y E {1, 2, .. K}. The value of the label y corresponds to the assigned class. The classification problem can be formulated in terms of a set of discriminant functions yk with parameters, w, Yk y (X; w). (1.1) An input x is assigned to class Ck if k max {yj(x; w)}. (1.2) Each class has a corresponding discriminant function. A signal x is input to each discriminant function. The function with the highest output assigns a label to the input (eq. 1.2). While difficult problems can be addressed by more complex (e.g., nonlinear, muiltil:v-r) discriminants, an alternative approach is to attempt to simplify the problem by some transformation K of the raw data, Uk = max {yj (K(x),w)}. (1.3) I The output of the transformations or projections is called a feature and the output space is called the feature space (Duda and Hart, 1973). The size of the feature space can be larger or smaller than the original space (Fukunaga, 1990; Vapnik, 1998). Traditionally, in statistical pattern recognition the feature space is smaller than the input space. One of the unsolved problems is how to determine the feature space and its size to improve classification accuracy. A feature is a projection that preserves discriminability. A fortuitous choice for the transformation extracts features that differ between classes but are similar within a class. Undesirable features differ within-class or are similar between-class. Heuristics have been the most utilized method of selecting good features. The problem is the following. Optimal classifications in high dimensional spaces require prohibitive amounts of data to be trained accurately (Fukunaga, 1990; Duda and Hart, 1973). Hence the reduction of the input space dimensionality improves accuracy of the estimated classifier parameters and improves classifier performance. On the other hand, projections to a feature subspace may decrease discriminability, so there is a trade-off that is difficult to formulate and solve (Fuku- naga, 1990). x N1)x) Yk Data Transformation Classifier Figure 1.1: Conceptual Steps in a Classifier Experience has shown that local features tend to preserve discriminability better than global features. Hence the widespread use of wavelets and other multiresolution techniques as feature extractors for classification (Bischof, 1995). More recently there has been work proposing feature spaces of higher dimen- sionality than that of the original input space (Vapnik, 1998). High dimension spaces increase the separability between classes, enabling the use of linear dis- criminators that have fewer parameters to estimate than the optimal (B i,-i ,i:) classifiers. When analyzed from the feature extraction point of view, projection to high dimensional spaces also enhances the chance of obtaining "b, I 1 i features; that is, where the projections of different classes concentrated more along certain directions. These are called overcomplete representations and 1h :,- have been studied in the wavelet literature (Vetterli and Kovacevic, 1995; Strang and Nguyen, 1996). The big issue is still how to choose the overcomplete dictionary and how to select the best features. 1.2 Principal Component Analysis (PCA) Principal component analysis (PCA) is based on eigenvalue decomposi- tion (Hotelling, 1933). Eigenvalue decomposition has been applied to problems across 1ii i'v disciplines. There is a rich mathematical background and a variety of implementations (Oja, 1982). Given a set of data, a scatter matrix S is calculated to estimate the autocorrelation matrix of the data. N S T= Ti x (1.4) n=l The eigenvector and corresponding eigenvalue pairs (wk, Ak) of S are found by solving Sw = Aw. (1.5) Both the data x and the scatter matrix S can be expanded in terms of the eigen- vectors, S = E WkW Ak, (1.6) x = k WkW X = k WkQk. Analytic and deflation-based iterative approaches are available to solve the eigen- value problem that automatically order the eigenvectors such that A> > > N > . PCA components are uncorrelated and maximal in 12 energy. PCA is one possible transformation for equation 1.3. It has been shown that PCA is optimal for signal representation, but it is sub-optimal for feature extraction (Fukunaga, 1990). Chapter 2 has an expanded discussion of PCA. Although other sets of basis functions are available that are similar to the PCA basis, only PCA can select a reduced set of components that are optimal for reconstruction MSE. 1.3 Multiresolution Multiresolution has been broadly defined as the simultaneous presentation of a signal at several resolutions. An intuitive argument for using multiresolution is available from common experience. Consider watching someone approach from a distance. As the person comes closer, more details are resolvable to allow an observer to make successively refined categorizations. A possible sequence is to first identify a moving object, then a person, the gender of the person, the identity of the person, and finally the facial expressions of the person. Another familiar application of multiresolution is transferring images across low bandwidth channels (internet). People tend to leave a web page if the page takes too long to load. For commercial web sites this translates into a loss of potential customers. On the other hand, many sites feel that customers will not return to a site that does not have a lot of graphics. Some image intensive web pages (e.g., zoo, museum, or auction sites) usually present small images (initially transfer small files). A larger, more detailed version of the image is loaded only if the viewer clicks on the small image. While it is possible to completely reload the larger image, it is more efficient to use information available on the (already loaded) small version and just add the details needed to produce the larger picture. For classification, it is hoped that within-class differences are high-resolution features, and that sufficient desirable features are resolvable at coarse resolution. By using a coarser representation (lower-resolution), it is hoped that undesirable features are sharply attenuated with minor impact on the desirable features. Multiresolution is discussed in chapter 3. 1.4 PCA-M Both PCA and multiresolution have been successfully applied to similar problems. It seems reasonable that an application that has benefited from each individual approach should further benefit from a combined approach. PCA-M is simply multiresolution with an adaptive basis (PCA). I show that a linear network for online, adaptive, multiresolution feature extraction is easily adapted from the networks used for standard PCA. principal component analysis with multiresolution (PCA-M) is implemented with a partially connected, single-li v-r linear network. The same network can be used for both training and normal operation. The training algorithm is a modification of the generalized hebbian algorithm (GHA) (Sanger, 1989). I treat PCA-M in chapter 4. 1.5 Image Classification Experiments Olivetti Research Lab (ORL) has a public face database that serves as a benchmark for comparing different face classification algorithms. Both mul- tiresolution and PCA (Turk and Pentland, 1991a) had been successfully applied against the database. The PCA-M components were used with an almost linear network. The network is linear except for selecting the maximum discriminant (i\ AXNET) (Kung, 1993, p. 48)). --., FFT r H PERL .NE Ll.: S' "- : FFT r L1 HYPERPLANE MAJORITY: r FFT :.~.~ .... ........... SFT r I P r FEATVUE CLASSIFY RAW IMAGE EXTRACTION 10 PCA-M OM PON ENTS Figure 1.2: PCA-M Classifier The ORL database was used to compare PCA-M to several standard fixed resolution transforms (discrete Fourier transform, discrete cosine transform, PCA), and to multiresolution using a Haar basis. PCA-M outperformed PCA at all tested resolutions. PCA-M outperformed the Haar basis if a reduced set of components were used. Results were comparable if the full set of multiresolution components were used. In chapter 5, PCA-M results are compared to classifiers using a fixed res- olution PCA (Turk and Pentland, 1991a, Eigenfaces), a hidden Markov model (HMM) (Samaria, 1994) and a convolutional neural network (Giles et al., 1997). PCA-M had the lowest error rate. 1.6 MSTAR Experiment The 9/95 MSTAR Public Release Data (Veda Inc., www.mbvlab.wpafb.mil). contains synthetic aperture radar (SAR) images of vehicles at various poses (aspect and depression angles). The estimated aspect angle (Xu et al., 1998) of each vehicle was used to assign each vehicle to one of the twelve non-overlapping 300 sectors. Within each sector, multiresolution templates were derived for each class. Chapter 6 shows that PCA-M worked very well in some sectors, but poorly in other sectors. The overall error rate ( 1(' .) was comparable to other template matching 8 procedures (Velten et al., 1998), but poorer than information theoretic and vector support methods. We conclude the dissertation with some comments and future directions for further research. CHAPTER 2 PRINCIPAL COMPONENT ANALYSIS Principal component analysis (PCA) is a technique for representing an image (or a signal) using basis functions that are derived from eigenvalue decomposi- tion of the data autocorrelation matrix. This chapter is an introduction to the eigenvalue problem. A thorough presentation is not possible, but this chapter should contain the information on principal component analysis that is required by subsequent discussion of PCA-M. 2.1 The Eigenvalue Problem Consider a square matrix A of full rank N. A vector w is said to be an eigenvector of A with a corresponding (scalar) eigenvalue A if Aw = Aw. (2.1) The eigenvalue problem (equation 2.1) can be solved analytically by subtracting Aw from both sides, (A IA)w 0. (2.2) Taking the determinant of both sides yields an Nth order polynomial in A called the characteristic polynomial of A, det(A IA|) 0. (2.3) The N roots are the eigenvalues, and each eigenvalue Ak has a corresponding eigen- vector wk. Each solution to the eigenvector problem is a paired eigenvalue and corresponding eigenvector (Ak, Wk). From equation 2.1, it should be clear that if w is an eigenvector and K is an arbitrary scalar, then Kw is also an eigenvector with the same eigenvalue. Given a non-repeating eigenvector A, the corresponding eigen- vector is unique except for scale factor K. Without loss of generality, eigenvectors are usually scaled such that |w | w= w = 1. If the eigenvalues Ak are unique (non-repeated), then a unique eigenvector exists for each eigenvalue. The (normalized) eigenvectors wk are orthonormal. SWfk -jk- (2.4) Define the modal matrix W as the matrix whose columns are the normalized eigenvectors of A, W = [w1 W2 ... WN] (2.5) In general, there are N! permutations of eigenvectors. Without loss of generality, order the eigenvectors such that the eigenvalues are non-increasing, A, > A2 > ... > AN. Define the diagonal matrix A as the matrix with a main diagonal consisting of the (ordered) eigenvalues of A. A1 0 0 SA2 ... 0 A 0 0 ... AN. The eigenvalue problem can be restated in matrix notation, A = WAWT (2.6) If equation 2.6 is satisfied, the matrix A is said to be diagonalizable. For this study, the matrices of interest are real Toeplitz matrices that are alv-- diagonalizable. The orthonormality condition of equation 2.4 can also be restated, WWT = WTW = I. (2.7) A matrix satisfying equation 2.7 is said to be unitary. The modal matrix W is unitary and is said to diagonalize the matrix A. 2.2 An Example The properties of PCA will later be discussed more rigorously, but a quick example should provide an intuitive grasp of some of the properties of PCA. Consider L = 20 vectors of dimension N = 3 arranged into the data matrix X, X = [Xi X2 ... X20o The autocorrelation of the data is estimated by the scatter matrix, 0.1296 0.1372 0.1296 S, = E{xx'} XX' = 0.1372 0.1613 0.1372 L 0.1296 0.1372 0.1296 Eigenvalue decomposition yields 0.5578 -0.4346 -0.7071 0.4104 0 0 W 0.6146 0.7888 0.0000 A 0 0.0102 0 0.5578 -0.4346 0.7071 0 0 0 Each input vector can be interpreted as a set of coordinates. The standard basis functions for the input space are normalized vectors in each of the input coordinates, 1 0 0 e= 0 2 1 63 0 0 0 1 Figure 2.1 (left) plots the data in the input space. The input was constructed to lie near the diagonal of the input space. The first element of each input vector was randomly selected in the interval (-1, +1). The second element was the first element plus Gaussian noise. The third element was set equal to the first component. By construction, all three elements are equal except for the additive Gaussian noise in the second element. The dimension of the signal (part of x excluding the noise) is one. The noise adds a second dimension. Although x is nominally three-dimensional, the data set can be embedded in a two-dimensional space. INPUT SPACE EIGENVECTORS (wl,w2) PLANE ... ........ ..... MA .... Figure 2.1: Sample Data Figure 2.1 (middle) shows the eigenvectors in the input space. Note that the first eigenvector is the line that best fits the data in a mean square error sense. Figure 2.1 (right) shows that the eigenvectors can be used as basis functions for the data. The input coordinates x are rotated to the eigenspace coordinates y by multiplication with the modal matrix W, y = WTx or Y = WTX. The input vectors were drawn from a zero mean distribution, but the sample mean was x E{x}= [0.1476, 0.1649, 0.1476]T If the data is zero-mean, then the scatter matrix is also an estimate of the auto- covariance. The shift, z = x x would produce data with a (sample) mean of zero. It is obvious in this example that the sample mean is a poor estimate of the true mean. The true mean of the output is zero, but the sample mean is y [0.2660, 0.0018, 0.0000], y = E{y} WTx. It is perhaps too obvious to mention that small sample sizes lead to poor char- acterization (e.g., statistical parameters) of a distribution. However, many real applications have a limited amount of data available. Insufficient data will degrade the performance of any algorithm. The scatter matrix of the rotated vectors is Sy = A. Since A is diagonal, the components of y are uncorrelated. 11 Sy = LYYT WTXXTW = WTSW = A. L L The trace of the scatter matrices is invariant under rotation, S1 = Syy = 0.4206 The trace is a measure of the total variation of the data. A linear transformation does not affect total autocorrelation. However, a linear transformation can change the variance of individual components and the cross-correlation between compo- nents. To see the contribution of each component of the input and output to the total variance, divide both scatter matrices by the trace, 0.3082 0.3262 0.3082 0.9758 0 0 S' = 0.3262 0.3836 0.3262 0 0.0242 0 0.3082 0.3262 0.3082 0 0 0 The trace of each scatter matrix in equation 2.2 is one, and the elements along the main diagonal can be interpreted as percentages of total variation. By construction the variation in the input data is distributed almost equally among all three components. The normalized output scatter matrix (equation 2.2) shows that the first component captures 97.1.'. of the variation of the data. The zero eigenvalue in the third column of A indicates that the underlying dimension of the data is two. The input data can be reconstructed from the output data, x =Wy or X = WY. The input data can be perfectly reconstructed from y even if the third component is discarded (3 :'. lossless compression). The input data can be reconstructed with 2.5' mean square error from just the first component of y (i7' compression). The transform did not completely separate the data from the noise. However, the input reconstructed from just the first output component has an enhanced signal-to-noise ratio (SNR). The rotation from the input space X to the eigenspace Y is only one possible rotation. Although it is more obvious to directly examine other rotations of the 3-dimensional input space, it is simpler to examine rotations of the 2-dimensional eigenspace. Consider a set of coordinates z derived from rotating the (non-zero) coordinates in eigenspace through an arbitrary angle a, zi cos(a) sin(a) y] Z 2 sin(a) cos(a)) Y The variance of {zi} is cOS2( Ca)ly +sin2(a)729. X x x X ... v ... X Figure 2.2: Original (left) and Scaled (right) Data Figure 2.2 (left) shows the standard deviations of the two non-zero components of the output. The ellipse in figure 2.2 (left) shows the standard deviations of the data projected along arbitrary unit vectors. Among all possible sets of projections, the variance of an individual component is maximized and minimized when the input is projected against the two eigenvectors wi and w2, respectively. If the second component is scaled so that the variances of the two components are equal (Figure 2.2, right), it is not possible to change the component variances by rotation. 2.3 Principal Component Analysis The Karhunen-Lotve Transform (KLT) uses eigenvalue analysis to decompose a continuous random process, x(t), instead of the random variable discussed in the preceding sections of this chapter. The discrete equivalent developed by Hotelling is called Principal component analysis (PCA), but is also often referred to as Karhunen-Loeve Transformation. A nice discussion is found in Jain (1989, pp. 163-175). Let x be a discrete zero-mean, wide-sense stationary process. Let XN(n) denote a block of length N, XN(n) = [x(n), x(n 1), .. x(n N + 1)]. The (N x N) autocorrelation matrix Rxx is positive-definite and Toeplitz (doubly symmetric and constant along the diagonals) (Kailath, 1980). The eigenvalue decomposition of Rxx is Rxx = E{xN(n)xN(n)T} W A W-1 W A WT Not all matrices can be diagonalized, but symmetry is a sufficient condition. Since Rxx is symmetric, it has N orthogonal eigenvectors even if the eigenvalues are not distinct. PCA is an expansion of XN(n) using the eigenvectors of Rxx. Any N-length block of x(n) can be represented by N N XN(nf) Zx(n k+ 1t^, Z y, Yk(n)Wk. k=1 k=1 The PCA expansion can be more compactly expressed by XN() = WywN(), with inverse, yN(n) = WTxN(), (2.8) where YN(n) = [yi(n), y2 (n), y(n)l. Equation 2.8 can also be interpreted as linear mappings from an N-dimensional space spanned by the standard basis vectors, ek, to an N-dimensional space spanned by the eigenvectors, wk. The autocorrelation matrix of yN(n) is RyY = E{yN(n)yN(n)T} = E{WTXN(n)XN(n)W} = WTE{XN(U)XN(n)}W SWTRxxW = A A represents the correlation matrix between the components yk. The components are uncorrelated and the variance of each component is simply Ak. Interpreting variance as signal energy, trace-invariance under similarity transformations equates to conservation of energy. The original signal can be perfectly reconstructed by the inverse transformation, x(n) yi(n) x(n -1) 2(n) (2.9) x(n N + 1) YN(n) Decorrelation is desirable for analysis since redundant information between components is minimized. Reconstruction is often performed with only the first M < N components for two main reasons, 1. Compression Using only the first M components achieves a M/N compres- sion ratio with minimum 12 reconstruction error. 2. Noise Reduction For signals with additive noise, A is interpreted as a signal to noise ratio (SNR). Reconstruction of x(n) using the high SNR components retains the signal components and excludes the noisy low energy components. 2.4 Deflation Techniques All the eigenvalues of a matrix of rank N can be found analytically by solving a polynomial of order N, det(A IA) = 0. If only the first few eigenvectors are of interest, then one can either use an analyti- cal approach such as Singular Value Decomposition (SVD) (Haykin, 1996), or find an approximation by using a deflation technique. Since the one of the eigenvectors maximizes variance, a vector is found by choosing an arbitrary vector and itera- tively modifying the vector to increase the variance. Once found, the component corresponding to the eigenvector is removed and the input is said to be deflated. The next eigenvector is found by repeating the process on the deflated data. From the basic eigenvalue statement (equation 2.1), Ak W Awk (2.10) Consider an arbitrary vector v and an associated scalar K, S= vTAv (2.11) The eigenvector corresponding to the largest eigenvalue maximizes A1 max(t) max {vTAv} (2.12) Vv Equation 2.11 associates a scalar with each of the vectors in the span of A. The vector associated with the maximal scalar is an eigenvector of A. More specifically, 1. Set A1 = A. 2. Use a gradient based iteration (or power method) on w to maximize A. 3. Set wi = Wpt and A1 = A(wpt). 4. Remove the projection of wi from A1. A2 = A1 wiAiwT 5. In the subspace spanned by A2 the optimal solution to equation 2.12 is now {A2, w2}. Repeat procedure until the desired number of solutions (eigenvectors) is obtained. 2.5 Generalized Hebbian Algorithm Eigendecompositions can be analytically computed by many algorithms (Golub and Loan, 1989). But here we seek sample-by-sample estimators of PCA conducive to on-line implementation in personal computers. There is rich literature on linear networks to evaluate PCA using gradient descent learning rules (Oja, 1982; Haykin, 1994). Being adaptive, the networks take time to converge and exhibit rattling; that is, network values fluctuate around the "true" values. Hence, these networks should not be taken as substitutes for the analytic methods when the goal is to compute eigenvectors and eigenvalues. However, in signal processing applications where we have to deal with nonstationary signals and we are interested in feature vectors for real-time assessment, the "r i i- PCA is very often adequate and saves enormous computation. In fact, the algorithms about to be described are of O(N) (size of the space), instead of 0(N2). Haykin (1994, pp. 391) states that PCA neural network algorithms can be grouped into two classes: 1. reestimation algorithms only feedforward connections, 2. decorrelating algorithms both feedforward and feedback connections. Reestimation algorithms use deflation. The generalized hebbian algorithm (GHA) is a reestimation algorithm that uses a single (computational) l--r linear network to perform PCA on a process XN(n). A nice presentation of GHA can be found in Haykin (1994, pp. 365-394). x(n) xl Figure 2.3: GHA Linear Network Let W denote the (M x N) matrix of network weights and let wk denote a column of W. Figure 2.3 shows the network that extracts the first M < N principal components of the random vector XN(n). The equations for figure 2.3 are T wl T w yM (n) WTXN(n) 2 x(n), M < N. T M To adapt the weights, GHA performs three operations, 1. adaptation of each column wi to maximize variance (energy), 2. adaptation between columns wi to remove the projection from previous components, 3. self-normalizing to keep weights at unit norm. The equations for adapting the weights are p-1 yj(n) = tr,..()xi(n), calculate output (2.13) i=0 t-O Au',,(n) = yj(n) Xidn) i, (n)yk(n) ,update weights (2.14) k-0 The update (equation 2.14) has two terms, r(y j()xi(n) yj(n)r,, ,(n)yj(n)), maximize variance and normalize (2.15) j-1 lyj(n) "I (n)yk(n), remove projections (2.16) k-0 Equation 2.15 has two terms. The first term is the classic Hebbian formulation and has been called the activity product rule (Haykin, 1994, p. 51). The problem with the classic formulation is that the magnitude of the weights increases. Still, the classic Hebbian algorithm is elegant in its simplicity and power. The second term of equation 2.15 is a self correcting adaptation (Sanger, 1989). Equation 2.15 by itself reestimates and normalizes the weights for variance maximization. Equa- tion 2.16 subtracts the projections of previous (higher energy) components. This term is introduced to keep the weights to different output nodes from converging to the same eigenvector. Equation 2.16 also shows that convergence of a principal component is dependent on the convergence of higher energy components. While there is no inherent ordering to the eigenvectors, the implementation effectively creates a sequence of dependencies in the convergence of eigenvectors. The procedure is adaptive and thus suitable for locally stationary data. Even if the process is not stationary, the mapping WN will give perfect reconstruction (WNXN is invertible). PCA is optimal for minimum 12 reconstruction using fixed length filters on a stationary signal. For optimal 12 compression and reconstruction, the first M (out of N) components of yN(n) are used. Using M < N components yields a compression of (N M)/N. Further com- pression is usually obtained by using fewer bits to encode lower energy components. Y (k) [yo(k), y(k), yM-i(k)] = WNxMXN(n). (2.17) and the reconstruction (denoted by subscript c)of the original signal is ,(k) = [(WNxMWNxM) 1WNxM] y(k). (2.18) 2.6 Eigenfilters The direct interpretation of equation 2.8 is that each of the yk(n) is a projec- tion of XN(n) on an eigenvectors ii-, Each projection is found by taking the inner product, yk(n) = ("; xN(n)) = W XN(n) (2.19) The underlying time structure of XN(n) allows a filter interpretation of PCA, XN() = [x(n),x(n- 1),.- ,x(n N + 1)T. Rewriting equation 2.19, N-1 yk(n) = Equation 2.20 is a convolution sum of x with a filter impulse response i,', (n). FIR filters whose coefficients (impulse response) are derived from eigenvalue analysis are called eigenfilters (Vetterli and Kovacevic, 1995). The collection of filters, {"' }i can be interpreted as an analysis bank. If the principal components, y(n) [yi(n), y2(n), YN(n)T. are then processed to reconstruct the original input XN(n), the reconstruction filters form the synthesis bank. Eigenfilters have several key properties: 1. Both the analysis and synthesis banks of a linear network use finite impulse response (FIR) filters, 2. Since the autocorrelation filter is Toeplitz, the eigenvectors are all either symmetric or antisymmetric, 3. Since the modal matrix is unitary, the synthesis filters can be implemented easily (transpose of the analysis bank), 4. Since the components are uncorrelated, the reconstruction from each compo- nent is independent of other components. The remainder of this section illustrates the decomposition of simple test signals. 2.6.1 Low-Pass Test Signal slhOrder M Filer OMIpu 0 0.5 1 1.5 2 25 3 35 4 Tine 03 0.6 0.2 0.4 0 -- 0 - knpls Response Dekby Figure 2.4: Low Pass Test Data A test signal (Figure 2.4) was generated using a 5th-order moving average (V A) filter driven by white Gaussian noise. 5 5 x (n) = 2 2-u (n k) # X(z) (2z)-k U (Z k=0 k=0 The transfer function is x(z) U( (z) 1i K 2 (I 3 (4 2i1] 1 + (+ + + + 2 2z 2z 2z 2z2 The 5 zeros are evenly spaced, 1 e1k7\) z- exp j k [1,2,3,4,5] 23 The first six autocorrelation coefficients are T Rxx(k, 1) 0.3333 0.1665 0.0830 0.0410 0.0195 0.0078 1 Consider the autocorrelation matrix formed by the first six autocorrelation coeffi- cients. The eigenvalues are Ak 0.7952 0.4811 0.2811 0.1859 0.1388 0.1174 The corresponding eigenvectors (columns) are 0.3034 0.4195 0.4816 0.4816 0.4195 0.3034 -0.4940 -0.4687 -0.1905 0.1905 0.4687 0.4940 -0.5350 -0.1244 0.4453 0.4453 -0.1244 -0.5350 -0.4721 0.3313 0.4091 -0.4091 -0.3313 0.4721 -0.3489 0.5554 -0.2640 -0.2640 0.5554 -0.3489 0.1819 -0.4130 0.5444 -0.5444 0.4130 -0.1819 (2.21) Figure 2.5 shows the eigenfilters generated from the eigenvectors shown in equa- tion 2.21. Notice the filter bank structure as we described above, that appears in a self-organizing manner; that is, no one programmed the filters. It was the data and the constraints placed on the topology and adaptation rule that led to a unique set of filter weights. The bandwidth of the filters is dictated by the size of the input delay line (I/NT). This illustrates why PCA defaults to a Fourier transform when the observation window size approaches infinity. Tme -0 2 4 6 8 05e;~;~ FMaj n 2 0_n.4 4L0-, A 49 n o 0.z 0.4 FPhe m 02 n n4 - 0 0.2- -- OA- Figure 2.5: Low Pass Data PCA The 1/f energy distribution can be observed by normalizing the eigenvalues by the trace. Thu sum of the diagonal elements is invariant and can be interpreted as total energy. An eigenvalue divided by the trace can be interpreted as the percentage of the total signal energy belonging to the outputs of the corresponding eigenfilters. Figure 2.5 and equation 2.6.1 show that the eigenfilters are ordered by passband center frequency and output energy. Ak = 0.3977 0.2406 0.1406 0.0930 0.0694 0.0587 Equation 2.6.1 provides some upper bounds for compression. These eigenfilters are expected to be optimal for any signal generated by the filter in equation 2.6.1. 2.6.2 High Pass Test Signal The high pass signal was simply a low to high pass conversion of the low pass signal. The zero at w = r was moved to w = 0. Figure 2.6 shows that PCA-M adapted to order the basis functions by energy. The time-frequency resolution trade-off can be observed by looking across a row and seeing that the shorter filter results in a wider frequency passband. Tme 0.5 0.5 2 4 -0.5 -------- ..r~IPII. 3 J. & Sq FM a 4 2 2 0 01 A 0 0.2 0.4 FPIse 5 0 -10 0 0.2 A Figure 2.6: High Pass Data PCA 2.6.3 Mixed Mode Test Signal The high pass signal and low pass signal were added together for the mixed mode signal. Figure 2.7 shows that PCA-M adapted to order the basis functions by energy. Tme FMoa FPhme 05 FP 5 1, nad S210 S 4 0 0 2 0 .4 0 02 0a Figure 2.7: Test High and Low Frequency Data 2.7 Key Properties of PCA Eigendecomposition has a strong mathematical foundation and is a tool used across several disciplines. Eigenvalue decomposition is an optimal representation in o ,i:v l--, Key properties of PCA include, 1. The elements of A (eigenvalues) are positive and real, and the elements of W (eigenvectors) are real. 2. Aside from scaling and transposing columns, W is the unique matrix that both decorrelates the xN(n) and maximizes variance for components, 3. Since W is unitary, W-1 = WT and reconstruction is easy. The mapping is norm preserving and reconstruction error is easily measured. PCA has several criticisms: 1. The mapping is linear. The underlying structure for some applications may be nonlinear. However, a nonlinear problem can be made into a linear problem by projection to a higher dimension. 2. The mapping is global. Each output component is dependent on all the input components. If important features are dependent on some subset of 28 input components, it would be desirable to have output components that are localized to the appropriate input components. 3. PCA components resemble each other. Approaching the transform as an eigenfilter bank provides some insight. FIR filters have large sidelobes. Orthogonality is obtained by constructive and destructive combinations of sidelobes. It seems typical that the low frequency component is so large that the sidelobes do not provide sufficient attenuation. CHAPTER 3 MULTIRESOLUTION A discussion on multiresolution should start with time signals and the classic time-frequency resolution trade-off. Assume that a recording session produces some (real) analog signal x(t). The analog signal x(t) is sampled at some uniform interval Ts to produce a discrete time signal x(nTs). For convenience, normalize Ts to unity so that x(n) x(nTs). The session x(n) is usually divided into smaller observation windows of duration N (= NTs). The choice for N fixes the resolution of the analysis. Denote a block of data of length N by xN(n). Consider the Discrete Fourier Transform (DFT) of xN(n), XNf(n) )- XN(k) The DFT transforms a vector xN(n) with N real components to a vector XN(k) with N complex components. XN(n) and XN(k) are the time-domain repre- sentation and frequency-domain representation, respectively, of the signal. Each component of XN(k) is a linear combination of the elements of XN(n). That is, each component of XN(k) is feature of the entire input xN(n) and can be lo- calized in time to NTs. The frequency resolution of the output is 1/NTs. The input XN(n) has high-resolution in time (Ts), but no resolution in frequency. As N increases, the output XN(k) loses resolution in time and gains resolution in frequency. The time and frequency resolution of the output are fixed by the single parameter N. Ideally, there might be an optimal choice for N, the observation window length. For example, N would be matched to the duration of key features in the signal. Sometimes, however, it can be difficult to make a judicious choice for N if, 1. key features are not known, 2. the optimal length is different among the key features. Under such situations, multiresolution is an alternative to fixed resolution repre- sentations. The DFT is a fixed resolution representation since each component of XN(k) has the same resolution. In the context of the above discussion, a mul- tiresolution representation would be a representation XN(k) whose elements have varying resolution. More generally, multiresolution is the representation of a signal across several resolutions. 3.1 Two Notes Example The two notes example is now found in many standard texts on time-frequency techniques; this section is an abbreviated version of Kaiser (1994). Consider a signal composed of "notes" of single frequency, and the problem of detecting the number of notes that occur in a time interval. Figure 3.1 Kaiser (1994) shows a signal consisting of two single frequencies that occur at different times. Figure 3.1: Two Notes Theoretically, the two notes can be separated by using either frequency or time information. However, a frequency representation has no time resolution and limited (by the observation window length) frequency resolution. Unless the notes are sufficiently separated in frequency, a standard Fourier transform of the signal will not resolve the two notes. Certainly, in the extreme case where the two notes are at the same frequency, time domain information is necessary to isolate the notes. A Fourier transform cannot take advantage of the time information to help resolve the individual notes. Similarly, the time representation has no frequency resolution and limited (by the sampling interval) time resolution. If the two notes are not well separated in time, but well separated in frequency, time-domain analysis cannot separate the notes. The corresponding extreme case is that if the two notes overlap in time, frequency domain analysis is necessary to resolve the two notes. Clearly, it is desirable to use both time and frequency domain information. One of the first (combination) time-frequency techniques is the Short Term Fourier Transform (STFT) or windowed Fourier Transforms (Porat, 1994, 335- 337). The signal x(n) is divided into subintervals of some fixed length. The essential approach is that instead of using a single transform X(k) over the entire time interval, a Fourier transform is taken over each subinterval. The results are displ i' ,I in a waterfall plot with time and frequency forming two axes and the magnitude of the frequency on the third axis. The waterfall plot provides information on how the frequency content of a signal changes over time. Discussion of implementation details, such as window functions and overlapping windows, can be found in Strang and Nguyen (1996); Vetterli and Kovacevic (1995). The relevance to this work is that a signal can be represented using both time and frequency using a fixed-resolution (constant block length N) technique such as the STFT. Again, an important consideration for a fixed-resolution analysis is choosing a ;ood" window length. If the window length is too long, time resolution is lost. If the window length is too short, frequency resolution is sacrificed. Figure 3.1 shows that either time or frequency resolution (or both) may be critical for a given application. The transition from fixed resolution to multiresolution can be performed with iterated filter banks that will be further discussed in chapter 4. 3.2 Quadrature Filter and Iterated Filter Bank An iterated filter bank uses variable length windows to provide high frequency resolution at low frequencies and high time resolution at high frequencies. A dyadic filter bank uses a pair of filters to divide a signal into two components. The two filters are designated Ho(z) and Hi(z) (Figure 3.2). SYNTHESIS Figure 3.2: Quadrature Filter The filters must be chosen to divide a signal into orthogonal components that can later be used to perfectly reconstruct the original signal. Familiar choices for dyadic filters include, 1. simple odd-even decomposition, 2. quadrature modulation filters in communications (sin and cos components), 3. quadrature mirror filter (Hi(z) = -Ho(z)) (Strang and Nguyen, 1996, 109). The quadrature mirror filters Ho(z) and Hi(z) are constructed as low- pass and high-pass filters, respectively. A dyadic iterated filter banks is formed by passing the output of Hi(z) or Ho(z) into another identical filter bank (Figure 3.4). A series of cascaded low-order filters is equivalent to a single high- order filter. Time resolution decreases and frequency resolution increases with the number of low-order filters in the cascade. An intuitively appealing approach is to iterate the low frequency component; this approach will be discussed in more detail in the next section. The rationale is that low frequency components do not require high time resolution since low frequency implies slow changes. The quadrature mirror filter was an early implementation; a more recent approach is the use of wavelets (Strang and Nguyen, 1996). ANALYSIS 3.3 Wavelets in Discrete Time Mathematically, passing a signal through a filter and downsampling can be presented as projection against basis functions. The design of filters is equivalent to finding appropriate basis functions. For wavelet analysis, a function is chosen as the mother wavelet. The basis functions at each level correspond to some dilation (scaling) of the mother wavelet. Within a level, all the basis functions are non-overlapping time-shifted versions of the same function. The scaling and shifting allow time resolution over intervals less than NTs. It is desirable that basis functions from different levels are orthogonal, but linear independence is sufficient. A standard approach to multiresolution is to use a cascade of 2-bank (high- pass H1 and low-pass Ho) filters (Figure 3.3). The outputs of the analysis filters are downsampled by a factor of 2, then the low-pass output is cascaded into another analysis bank of 2 filters. The process is repeated for the desired number of levels. The reverse operation takes place at the synthesis bank {Gi}. r( H,-HIGH ^ 2J 21 G, i LJ LOW (s) ""vD,, i a21)2 G, ANALYSIS SYNTHESIS Figure 3.3: Discrete Wavelet Transform with 2 Levels Again, the rationale for choosing this sequence of operations is that high frequency components can change quickly, hence the highest frequency component should be sampled most often. The lowest frequency component changes the least frequently and can be downsampled several times. The iterated tree structure can be implemented as a parallel structure (Figure 3.4). ANALYSIS SYNTHESIS Figure 3.4: Equivalent 2" Filter Bank DWT Implementation 3.4 Haar Wavelet The Haar wavelet uses the simplest set of basis functions. The low pass filter is ho (k) = [1, 1] and the high pass filter is hi (k) [1, -1]. The matrix W for a two level decomposition is shown in equation 3.1. For the 2-level Haar example, the input is divide into segments of length N = 4 and, 0 0 1 1 1 1( ( 1 ( V4(k) WN. N) ()3t S1 11 1 The matrix W is invertible so perfect reconstruction is possible. Since W is or- thonormal; that is, W-1 = WT, no further calculations are needed for constructing the synthesis filters. The Haar wavelet has the worst frequency resolution; other basis functions since Morlet) may be more appropriate depending on the desired trade-off between resolution in time and frequency. 3.5 A Multiresolution Application: Compression A standard multiresolution application is representing a signal (image) at different scales. Figure 3.5 shows an example of a Haar decomposition (not the Haar transform) that is dyadic along each dimension. Several Iterations of Haar Decomposition Original Iteration 1 Iteration 2 Iteration 3 Pyramid Figure 3.5: Three Levels of Decomposition on the Approximation The Haar basis vectors are el = -[11] and 62 = 21 1]. For 2-dimensions, the basis vectors are 1el1, 6162, 6261, 6262 (the 2-D bases are separable and identi- cal). An image is partitioned into non-overlapping (2 x 2) blocks and each block is projected against the basis vectors. Using non-overlapping blocks is similar to a polyphase filter and is more computationally efficient than downsampling the projections. The first projection is simply an average of each (2 x 2) block and gives a good compressed approximation of the original image. The other three detail images have the information needed (in addition to the approximation) to perfectly reconstruct the original image. That is the detail signals have the information needed to correct the reconstruction from the approximation. This is slightly different than the pyramidal approach that provides a single correction at the lower scale. The procedure can be repeated on the approximation to provide an approximation at the next level of compression. All the information for creating the first (level 1) approximation is contained in the original (level 0) image. Simi- larly, an approximation at any level only uses information from the approximation at the previous level. Lower level approximations have more information (spatial resolution) and less compression than high-level approximations. Clearly, there is less data to process if classifications can be performed with compressed images (smaller matrices). Our main interest in multiresolution is in deriving multiscale localized features for classification. Inputs presented at different scales lead to extraction of features 37 at different scales. The next chapter continues the discussion of multiresolution with more focus on deriving and using multiresolution features in PCA-M. CHAPTER 4 PRINCIPAL COMPONENT ANALYSIS WITH MULTIRESOLUTION 4.1 Definition of PCA-M In this section, we formally treat localization with PCA. PCA is briefly dis- cussed in a context of representation and feature extraction. PCA-M is presented as PCA with localized outputs. The localized outputs of PCA-M are structured to provide a multiscale representation. The section ends with a formal definition of principal component analysis with multiresolution. 4.1.1 Localization of PCA Consider a set of K training images, STRAIN = {01(n), - ", (n>),* OK(n)}. The pixels of each image ', (n), are indexed by n E S. Define, K Xk (n) i. (n) Y (n). k=1 If the training images already have zero mean, then xk(n) = /, (n). Principal component analysis has two stages, training and testing (verification). The first stage derives a set of eigenvectors and eigenvalues, (,'. (n), Am), for the set of training images. Denote the set of eigenvectors by ', 'I = {Qd(n),...,,, (n),... ,,' ,(n)}. As discussed earlier, the number of non-zero eigenvectors ,M, is the minimum of the number of exemplars, K, or the number of components of each exemplar, N. The training stage of PCA finds a mapping from a set of training images to a set of eigenvectors, @TRAIN (4.1) Equation 4.1 emphasizes that the eigenvectors and eigenvalues are characteristics of a set of input images. Since the input images and the eigenvectors have the same spatial index, the eigenvectors are also called eigenimages. Once trained, PCA uses the eigenimages to decompose each new input onto a set of components. The second stage of PCA is a mapping from an input image to a set of M output scalar components, Xk(( ) > { 1, ... Y ... M}k. (4.2) Equation 4.2 shows that each input has a unique set of outputs (components). Since the association from input to output is usually implicit, notation can be simplified. Rewrite equation 4.2, x(n) {(i,...,y m,-... ,M}- (4.3) Each component is global since its value is calculated using all the pixels of the input image, rn < x(n),, (n) > x(n)', )n). (4.4) nES The dependency of a global output on a specific input pixel is seen by differentiat- ing equation 4.4, V (n S): (y) = (n). (4.5) a x(n) Output localization implies that an output is dependent only on a local set of pixels. Consider a subregion of the pixels, A C S. A local output could be specified by a (YLOCAL) = (n), (4.6) Sx ((4.) SX( 0, otherwise. Localization could arise naturally if some of the eigenvector components were zero, ((n) = 0. PCA-M forces localization by explicitly manipulating equation 4.6. Definition 4.1.1 Consider a set of N-dimensional inputs, x(n). The components of each input are indexed by n E S, where S = [1,..., N]. Let A be a subset of S such that A corresponds to a localized time interval ifx(n) is a time .:,il or to a local region ifx(n) is an image. Denote the subregion of an input by xA(n). Let wA(n) be the corresponding eigenvector (eigenimage). A localized PCA-M output is YA =< xA(n),wA((n) > xA(n)wA(n). (4.7) nEA Q (x) O( 6 (x)) PCA-M FEATURE CLASSIFIER k DATA EXTRACTOR ASS Figure 4.1: PCA-M for Classification Before defining a localized eigenvector, we want to restate our goals for PCA-M so that the design choices will be understood. First, both PCA and PCA-M provide representations, but this does not mean that '1! i, can be automatically used for feature extraction. In fact, PCA-M components are not directly constructed for optimal discrimination. So PCA-M should be understood as a preprocessor for classification that constrains the scale and locality of subsequent features (Figure 4.1). Second, given that PCA-M is not the ideal feature extractor, care should be taken that no information is lost. Since PCA-M cannot identify whether some information is needed for discrimination, all information should be propagated to the classifier. PCA-M should (and can) provide at least a complete representation of the input in the space of the training set. That is, some information from non-training exemplars is ahb--,v- lost since the eigenspace is a subspace of all possible inputs. If the application is classification, an overcomplete representation (redundancy) is not only acceptable, but may be essential. Nonetheless, we design PCA-M with a minimum amount of redundancy. It is usually much easier to add redundancy than reduce redundancy. Finally, in allowing each output to have inputs of varying geometry (scale and shape), localized eigenvectors are not guaranteed to be orthogonal. Since I'. :,- are not orthogonal, it is an abuse of nomenclature to continue to refer to the localized eigenvectors as i;, i ,. I tors". Since the PCA-M network weights converge to the localized eigenvectors, we will henceforth call them PCA-M weights. Orthogonality has a direct impact on designing PCA-M since many implementations of PCA involve deflation in some form. While it is not ahl--,v- apparent from the network architecture, there is an inherent sequencing of calculations. As the weights for the first output are calculated, the weights and output are used to reconstruct an estimated input. The input is deflated by the estimate, and the deflated inputs are used as "effective inpl for calculating the weights of subsequent outputs. 4.1.2 A Structure of Localized Outputs Definition 4.1.1 is not an implementable definition for a localized PCA-M output since the corresponding localized eigenvector wA(n) is not yet defined. If all outputs are supported by the same region, the eigenvectors are found using standard PCA with the localized input, xA(n), as the new input. If each output is supported by a separate non-overlapping region, the localized eigenvector is found by treating each region as a separate standard PCA problems. When the outputs are localized to overlapping regions of varying geometry (size and shape), the meaning of orthogonality becomes unclear. That is, eigenvectors that derived from the same subregion of the training images are orthogonal. Also, eigenvectors that are each derived from non-overlapping subregions are orthogonal. However, eigenvectors that are each derived from partially overlapping subregions are not generally orthogonal Definition 4.1.2 Consider a set of N-dimensional inputs, x(n), with an associated set of M eigenvectors, b(n). For both the inputs and eigenvectors, components are indexed by n E S, where S = [1,..., N]. PCA-M is an iterative procedure: 1. For the first eigenvector, partition S into R(1) subregions such that U = s, (4.8) rE[1.../R(1)] 2. treating each subregion as a separate .:,u ,;. .,l;., problem, calculate the first eigenvector for each region, 3. 1. fl,/.': each region of the input, and use the 1. fIrl ,l input as the effective input for subsequent calculation. The geometry of the partitions can hI,,.- for each iteration. The number of iterations to span the input space will not exceed M. A global scalar output ym is replaced by an array of R(m) localized outputs, Ym = [Y1 ... Yr(m) ... YR,(m)], (4.9) where Yr(m) = Xk(n)wm(n), where, A = ,Sr(m). (4.10) nEA Each array of outputs, ym, is a compressed version of the input image. A fine partitioning (R(m) large) corresponds to a fine resolution for ym. If the partitions are identical for each array of outputs ym, the representation has fixed resolution. Analogous to equation 4.2, the mapping for PCA-M is Xk(f) ) {yn(r(m))}k. (4.11) The PCA network trains M sets of weights corresponding to full eigenimages. The PCA-M network has ( m1 R(m)) sets of weights corresponding to partitioned eigenimages. The PCA-M network replaces each scalar global output, ym, with an array of localized outputs, ym. Constraints on the weights of the GHA network allow control of the partitioning (number and composition) for each output. Control of the structure of the partitions sets the localization and scale for the PCA-M network. 4.2 The Classification Problem The applications presented in the next chapter use PCA-M for classification. This section restates the basic classification problem so that PCA-M can be discussed in the context of feature extraction. Given a choice of several classes Ck and some data x,, a basic classifier assigns the input to one of the classes. X, Ck.- Equivalently, the class index k is a function of the input x, k =g(xn). (4.12) For example, a classifier could identify that photograph x, belongs to person k. Mathematically, designing a good classifier is finding a good mapping function g. Each class of data contains features; that is, characteristics that are useful for classification. Ideally those features, 0(x), could be separated from the I -.- less" characteristics in the raw data. Presented with only the pertinent data for classification, the task of the classifier becomes easier. k f(O(x)) g(x). (4.13) Worsening the resolution of the inputs is intended to remove details that are not needed for classification, while retaining coarse features that are needed. In general, too fine a resolution retains unneeded details, while too coarse a resolution discards critical information. A multiresolution approach provides a structure for control of the detailed information. PCA-M can provide multiscale representations of an exemplar to allow extraction of features at different scales (section 4.3.3). PCA-M can also be used to directly localize a global eigenimage (section 4.3.4). A search for features in the universal space may not be feasible, but the eigenspace may be too restrictive for feature extraction. PCA-M provides a space richer than eigenspace (figure 4.2), but still keeps the dimensionality under control. PCA-M Features PCA Features All Features Figure 4.2: PCA and PCA-M in Feature Space 4.3 Complete Representations The section on eigenfaces is a classical application of PCA to images. PCA can be conceptualized as PCA-M with the coarsest partitioning (equation 4.8), Vm : R(m) =1. The identity map is presented for contrast as PCA-M with the finest partitioning, Vm : R(m)= N. The iterated filter bank and dual decomposition have milder restrictions on the partition sizes but implement constraints based on stationarity. The identity map can be considered as PCA with fully localized outputs. The next section is an example of PCA with global outputs. The subsequent section on iterated filter banks describe a structure with outputs of varying localization. 4.3.1 Eigenfaces The theoretical side of standard PCA has been discussed earlier. The eigen- faces section presents an implementation of standard PCA using the generalized hebbian algorithm (GHA) presented in section 2.5. Since the theory and struc- ture has been discussed earlier, this section presents experimental results. The PCA decompositions presented in this section are used for comparison to PCA-M decompositions (in the next section) of the same set of data. The GHA network is a simple, flexible and efficient way to implement PCA. Minor structural modifications lead to multiresolution. This section presents the GHA Network used for standard PCA. Subsequent sections then step through several modifications. With each modification, we present: 1. the network structure, 2. the changes in the representation that arise from the modifications, 3. an example of the representation using one of the faces drawn from the ORL database. Figure 4.3 shows K = 10 pictures from the ORL database. These ten pictures { '}(k 1...10) are all of the same person and cropped to R = 112 rows and C = 92 columns. 9v~"S: Le Figure 4.3: Raw Images from ORL Database Each input ', = (n) is described by two indices. The index k identifies the specific exemplar, and the index n specified the specific component of -, For time signals, n is an one-dimensional index and the components are time samples. For images, n specifies the row and column of the image's pixels; n can be either a two-dimensional vector index or an one-dimensional index to a rasterized version of the image. The class average o4 is formed by averaging the ten faces. 1 K k-1 After subtracting the average from each face, the residual images {xk}(k= 1..10) are shown in figure 4.4. Figure 4.4: Residual Images for GHA Input The residual images are each presented at the input -1-. r of a network similar to figure 4.5(left). r: IN OUT O IN OUT 0 O *0--- X is a coln of IN OUT IN OUT Figure 4.5: Two single la r networks with: each output driven by all inputs (left), and each output driven by single input(right) The number of input nodes for the GHA network is determined by the dimensions of the exemplars, 10304 (112 x 92). The number of non-zero output nodes can be no more than the number of linearly independent inputs; for this example, there are ten output nodes. The network has a single computational 4.16.r with every input connect to each output; each output has 10304 associated weights. For convenience, construct the input matrix X X(k, n) such that each input image Xk is a column of X, As each exemplar is presented at the input I-u-, r, the output is calculated and the weights are updated, using equations 4.15 and 4.16. p-1 yj(n) = t,.. ()xi(n), calculate output (4.15) i=0 i-O Au',.(n) = ryj(n) Xi(n) i (n)yk(n) update weights (4.16) Equation 4.16 shows that each output is affected by prior outputs. This depen- dency is shown in figure 4.5 (left) by dashed lateral connections between output nodes. Since the training set had ten linearly independent images, the network converges to ten sets of weights, {wk}k=- ..10. The (10304 x 10) transformation matrix WA is constructed by setting each wk as a column of WA, WA [w w21 ... wiol. (4.17) The weights are eigenvectors that will be described by w = wk W k(n) depending on the context. The index k identifies the corresponding output nodes of the GHA network. The ordering for GHA output nodes orders the eigenvectors such that the corresponding eigenvalues are in decreasing order. Each wk has 10304 components that can be arranged in an (112 x 92) array corresponding to the positions of the associated input components. When the eigenvectors are arranged as a two- dimensional array, the eigenvectors are also called eigenimages. The eigenimages resemble the input faces. Because of this resemblance, the eigenvectors are also called eigenfaces. Figure 4.6: Eigenfaces from GHA Weights Possibly the widest application for PCA is in signal representation and reconstruc- tion. The training inputs can be perfectly reconstructed as a linear combination of eigenfaces. The quality for reconstruction of other inputs depends on the degree that the exemplars are representative of other inputs. The eigenvalues are an indication of the reduction in reconstruction MSE. Table 4.1 shows that the inputs can be reconstructed with about 10' reconstruction error using just the first four eigenfaces and that the contribution from the last eigenface is negligible. Table 4.1: Normalized The eigenface expansion provides reconstruction for the network inputs. The inputs are the residual images (exemplars minus the average image). The reconstruction of the original exemplars requires adding back the class average. Because the PCA analysis characterizes residual images, it is expected that the average image is poorly reconstructed by eigenvalue expansion. This is especially interesting since most of the energy is in the average image (table 4.2). In a classification problem, there is an average image for each class as well as a single average for all images across all classes. The high energy in the individual class averages ,I-.-:- -1- that each class is characterized by the variation of its class average from the overall average, not by the variations in the individual images. Table 4.2: Energy Distribution of Exemplars Energy in Energy in Percent in Exemplar Average Component Component X1 0.2970 0.0150 4.81 X2 0.3614 0.0212 6.,', X3 0.3130 0.0166 5.3-2 X4 0.3454 0.0154 4.95', x5 0.3492 0.0114 3.' ,.,' X6 0.3419 0.0168 5..!,' x7 0.3155 0.0099 3.19', Xs 0.3034 0.0122 3.92' X9 0.3243 0.0153 4.9 :' xio 0.3126 0.0144 4.1 ,'. Avg 0.3264 0.0148 4.7 .'. AI 23.9i'. A6 5.8 :' , A2 21.7 !' A7 4.0"'- A3 19.-.' As 3.6 !' , A4 o10.,, A9 3.0 :' , A5 7.92'- A10 0.011' , Eigenvalues 4.3.2 Identity Map Some properties of the standard GHA network are more evident when con- trasted to another network. This section provides a subjective discussion of the following modifications to the standard GHA network: 1. the number of output nodes, 2. the dependencies (lateral connections) between output nodes, 3. the number and selection of inputs to an output node. Since the key modification of PCA-M involves controlling the inputs scale to an output node, it is instructive to examine the extreme case of a single output for each input. This structure can be realized with a network that has all inputs connected to each output, but with the constraint that only one weight is non-zero. The network is shown in figure 4.5 (right) showing only the non-zero connections. The inputs (figure 4.4) have not changed, so there are still 10304 input nodes corresponding to the dimensions of the input exemplars. By design, there are also 10304 output nodes to provide an output for each input. Each output node has the same spatial localization as the corresponding input node. This architecture is actually 10304 independent GHA networks operating independently so the number of outputs does not exceed the number of linearly independent exemplars. Without loss of generality, stipulate that the final weights are normalized. Construct the (10304 x 10304) transformation matrix Wi, W1 = [W W2 ... 10304] (4.18) It should be evident that the transformation matrix W1 is a (10304 x 10304) identity matrix. The transformation matrices from equations 4.17 and 4.18 provide insight to several key consequences of partial connections. 1. Span of the output space: Standard PCA has an output space that is a small subset of the image space. The output space of the identity transformation is the entire space of (112 x 92) images. A feature that is desirable for classification might lie outside a restrictive face space. 2. Compression: An image Q described using standard PCA requires a system that memorizes 11 images (the average image and the 10 eigenfaces), but describes each (112 x 92) input with at most 10 coefficients. The il,. il il transformation requires 10304 coefficients for each input image. 3. Resolution: The outputs of PCA are global. Each output is dependent on all the input pixels. Each output of the identity transformation is dependent of a single input pixel and thus has the same resolution as the input image. Spatial resolution of an output can be controlled by simply limiting the number of inputs. 4. Orthogonalization of Eigenvectors: The orthogonalization of standard GHA arises by deflation (virtual lateral connections). The orthogonalization of the id.. -il il v transformation arises from non-overlapping inputs. 5. Decorrelation of Outputs: The outputs of standard GHA are decorrelated and a repeated application of PCA decomposition changes nothing. The outputs of the identity transformation are correlated. These outputs can be decorrelated by adding another 1-,-.r (a GHA 1~,-V-r) to the network. 6. Class Features: For the GHA network, class information is in the weights, and the individual exemplar information is in the outputs. For the identity transformation, there is no information in the weights. Class information must be extracted from the outputs. In a network structured between the two extremes of standard GHA and an identity transformation, several tradeoffs can be considered. We feel that PCA-M enhances control of the span of the output space and of localization. 4.3.3 Iterated Filter Banks The eigenface decomposition and the identity map may be considered as two extreme cases of fixed resolution PCA. The iterated filter bank structure is the first multiresolution network presented in this chapter. The iterated filter bank is also of interest since it is a way of implementing PCA-M for a complete represen- tation. The concepts follow from prior sections and the focus is in describing the architecture and constraints. Allowing partial connections makes it possible to arbitrarily assign inputs to outputs and control orthogonality between outputs. The number of possible networks increases dramatically. For example, if each output is connected to two inputs there are C(n, r) = C(10304, 2) = 53081056 unique combinations of inputs. For each pair of inputs there are two orthogonal outputs, so it is possible to construct a network with over 108 orthogonal outputs. -0 O--mN1(2) HH1 2 t 1 -N-4 O-- O -4 2) 2N2(2) 1 N5 -6 -- O A'3(2) H 1 r 3 (2 1 7 ~- O Q~N ^^ 04(2) H2 2 -NN -- o T;5(4) 3 2 1 --0 O N6(4) -- 0 0 7(8) rHH 2 Nl-4 H--*0l NO8(8) H(2 HH2) 4 5 6 [HL(2 T (HH2(2 T HL3))) K AN7 IN OUT N8 IN OUT HL1(2T (HH2(2T HH3))) ---N Figure 4.7: Three Level Dyadic Banks One possible structure mimics the structure of a dyadic filter bank. The structure of a three level dyadic bank is shown on the top left of figure 4.7 with the equivalent polyphase construction on the bottom right. The filters used in this example are constrained to be two tap FIR filters. The filters are derived from the eigenvectors of the (2 x 2) scatter matrices of the data at each stage. Four sequential outputs of the first filter correspond to the outputs at the first four nodes (N1 4) of the network. Two sequential outputs of the second filter correspond to the outputs at the fifth (N5) and sixth (6) nodes of the network. The GHA network iterates on the lowest energy (variance) component. For inputs that have 1/f energy distributions, the lowpass or highpass components can be selected by using Hebbian or anti-Hebbian learning for the output node. The network (figure 4.7 left) has four nodes (N1 4) that are each connected to two non-overlapping contiguous inputs. The numbers of connected inputs are in parenthesis after the node labels. Output nodes N5 6 are each connected to four contiguous non-overlapping inputs, and the last two nodes N7 8 are fully connected. The weights of N1 4 are constrained to be equal since i!, i- correspond to a single filter in the filter bank. For the same reason, the weights for outputs N5 6 are constrained to be equal. GHA orthogonalizes weights by deflating the inputs to subsequent output nodes. Deflation is ineffective for non-overlapping inputs. The first four nodes have no orthogonalization constraints from other nodes. Node N5 is directly affected only by nodes N1 2 since the inputs of those two nodes are partitions of the input to N5. Node N6 is directly affected only by nodes N3 4 (indirectly influenced by N1 2 because of the equality constraint between N5 and N6). The weights of the fully connected outputs N7 8 are constrained by all earlier nodes. The three-stage dyadic (twofold) bank produces eight outputs for each set of eight inputs. For the ORL images, a quadratic (fourfold) filter was used. At each stage the inputs were partitioned into non-overlapping (2 x 2) blocks. To parallel the dyadic filter bank, all regions are constrained to have the same weights. The scatter matrix of the first stage inputs is 0.99 0.95 0.91 0.90 0.95 1.00 0.89 0.93 S= (4.19) 0.91 0.89 1.00 0.95 0.90 0.93 0.95 1.01 Due to the local nonstationarity of images, the autocorrelations differ by mode (horizontal, vertical, diagonal) as well as by lag. The scatter matrix is doubly symmetric but not (in general) Toeplitz. The first four eigenvectors are shown in figure 4.8 and show that the filter is essentially separable along the horizontal and vertical modes. U,,, Figure 4.8: The first four (2 x 2) eigenimages are separable odd-even decomposi- tions. From left to right: even horizontal and vertical (A1 = 0.9415), odd horizontal and even vertical (A1 = 0.0346), even horizontal and odd vertical (A1 = 0.0183), odd horizontal and vertical (A1 = 0.0056) The weights (features) are separable even (low-pass) and odd (high-pass) decom- positions. Figure 4.7( bottom-right) shows that the iterated filter bank develops longer filters by cascading shorter filters. Short eigenfilter coefficients are driven by PCA symmetry constraints and cannot adapt to data statistics. Figure 4.9 (top) shows the outputs of three stages. For display purposes, each image was normalized so that pixel intensities lie in the range (0, 1). The first stage of the filter bank produces four outputs shown in the four top left panels of figure 4.9. The four images are downsampled and arranged as a (2 x 2) array of compressed images as shown in the top right panel. There is no implied ordering or spatial relationship in the arrangement of the compressed images. We place the compressed image to be iterated in the top-left of the (2 x 2) array. The low-pass component (top-left) is passed to another iteration. The downsampled outputs of the second stage are shown in the bottom, far left panel. The panel is di-p'l 't, ,I at double scale. The third stage outputs are shown in the bottom, middle left panel the figure 4.9. The (2 x 2) array of outputs from the third stage output is displ ,i'- . at four times the actual scale. The outputs of all three stages are combined in the bottom middle image. For comparison, the residual (original minus average) and the original image are shown on the bottom right. Figure 4.9: Three Level Decomposition of an Exemplar Face The first stage outputs of all ten inputs are shown in figure 4.10. Figure 4.10: Output of the First Stage of the Quadratic Filter Bank for the Ten Training Exemplars A large disadvantage of the iterated filter bank is limited adaptability: 1. Short Filter Length Constraint The weights cannot adapt to class statistics (no global feature extraction). 2. Equal Weight Constraint The weights are constrained to be the same for all subimages (no localization of features). The chief advantage of the iterated filter bank is that an orthogonal basis is used at each stage. The overall linear transformation is orthogonal and guarantees perfect reconstruction of the inputs. The iterated filter bank structure produces a multiresolution representation at the output. If the compressed representations are difficult to classify, then feature extraction must be implemented by another stage. 4.3.4 Dual Implementation of PCA The filter bank structure first extracts small highly localized features, then builds up to global features. Another alternative is to find global features before local features. In general, it cannot be guaranteed that the resulting localized eigenvectors will form a minimal spanning set. That is, while we can still guarantee a set of vectors to span the space of training exemplars, PCA-M might not produce a minimal spanning set. As previously mentioned, a minimal spanning set is not required and perhaps not desired for classification. If the process is wide-sense stationary (WSS), then the network weights form a pair-wise linearly independent set of vectors. The discussion of orthogonal bases will be deferred to the next section. The behavior of PCA-M in going from long filters to short filters is clearer when discussed in the context of the dual PCA decomposition. PCA can be done using the transpose of the data matrix X ( 4.14). The main consideration is that the number of exemplars is usually much smaller than the dimension (number of components) of an input. The dual scatter matrix is SD X= X' (4.20) Continuing with the ORL example, the original scatter matrix is a doubly sym- metric (10304 x 10304) matrix. The dual scatter matrix is a (10 x 10) matrix. An analytic solution to PCA has operations in the order of 0(N2), so there is a significant computational advantage to using the dual PCA. Standard S=XX' XX' WAW' XX'W WA(W'W) XX'W =WA X'XX'W=X'WA X'X(X'W) (X'W)A V X'W Dual SD =X' X'X VAV' X'XV VA(V'V) X'XV=VA1 XX'XV=XVA XX'(XV) (XV)A W=xv Equation 4.21 contrasts the standard decomposition S = WAW' to the dual decomposition SD = VAV'. The dual eigenvalues are equal. The eigenvectors W can be calculated from the dual eigenvector V, W =XV. (4.22) The columns of the matrix W are orthogonal but not orthonormal. Normalizing W results in W. The dual formulation can significantly reduce computations when the number of exemplars is smaller than the spatial dimension of the exemplars. The formu- lation also provides an alternative interpretation to PCA-M. Consider a single exemplar xi, and a single eigenface wl (equation 4.17). Partition each array, Xl,(1,1) Xl,(1,2) Xl,(2,1) Xl,(2,2) (4.21) Xl (4.23) Wl,(1,1) Wl,(1,2) W1 Wl,(2,1) W1,(2,2) The projection of x1 against w1 is a single global scalar, i =< xi, Wi >- < xl,(r,c), Wl,(r,c) >= 4.12. (4.24) r C Equation 4.3.4 shows that the global output yi can be considered as the sum of localized terms. By partitioning xl and wl into smaller subarrays in a manner similar to equation 4.23, the global output yi can be replaced by an array of localized outputs. yl, localized 2.77 -4.21 14.17 -8.61 The localization of output yi can be extended to full resolution. Figure 4.11 shows continued yi localized into blocks of (8 x 8), (4 x 4),(2 x 2), and finally (1 x 1) (full localization). The array of localized outputs can be considered a compressed representation of the input. 8x8 4x4 2x2 1 1 Figure 4.11: Localization of a Global Output It is interesting to note that as the eigenface w1 is partitioned, each segment has the same dual eigenvectors. In principle this is similar to the iterated filter bank in that the dual (rather than the primal) eigenvectors are preserved globally. (4.25) The meaning of the variations across exemplars was not explored since the vari- ations seem to arise from misalignment during data collection rather than from an inherent feature of the class. Although multiresolution can be easily extended to standard PCA by simply partitioning the standard eigenvectors, the repre- sentations that result are ahl--,v- overcomplete. An overcomplete representation contains redundancy that is a disadvantage if the application is to compactly transmit information. For classification, an overcomplete representation may be advantageous. 4.4 Overcomplete Representations The preceding section showed two v--v- that PCA-M could be constrained to produce complete or overcomplete representations. For the iterated bank, the weights for each partition were constrained to be equal. In the dual approach, the statistics over the entire image were assumed stationary. For classification, features are important and overcomplete representations are satisfactory. In this section the flexibility of the single ?1v. r GHA network is discussed. The single computational 1?v, r GHA network outputs are determined by the inputs. There are three main mechanisms for controlling the input, 1. control deflation from other outputs, 2. mask all inputs outside a selected region, 3. place explicit constraints in the training. The original algorithm is to provide deflated inputs to successive outputs. For fixed resolution PCA, the deflation is needed to prevent different outputs from having weights converge to the same values. For multiresolution, the deflated inputs are only required if the input nodes are identical, otherwise deflation is optional. By selectively constraining weights to zero, an output can be localized to subregions of arbitrary size and shape. The subregions need not be convex or connected (e.g., a region for both eyes but excluding the nose). Each subregion of the image can be of a different size and shape. Overlapping regions are allowed to reduce edge effects. Shifted outputs can be introduced to facilitate shifts (translations) in the image. Each partition is allowed to have different statistics and allowed to converge to a local subeigenimage. Relaxing all the constraints produce a richer set of characteristics. Relaxing constraints also complicates the implementation. The specific choices for a PCA-M network are discussed in the experiments. The overall approach, however, was to relax a single constraint at a time until the network's classification performance was adequate. 4.5 Local Feature Analysis Penev and Atick (1996) report great success in face classification using a technique called Local Feature Analysis(LFA). The improvement in performance is attributed to localized feature extraction. Atick has also implemented a commercial automated face classification program, (Facelt, http://venezia. rockefeller. edu/group/papers/full/AdvImaging/index.html) for workstations using LFA. r-------------------------------------- F Ensemble Eigenvectors, Y(n,m) K(n,m) of Inputs P /V Eigenvalues, A P(n,m) PCA/SVD] e- N ---------------------- lf--------------- LFA Single MAPPER Single Input, (k(n) Output, Ok(n) Figure 4.12: Local Feature Analysis Figure 4.12 shows that LFA is based on PCA. The top cascade of operations calculates class properties. The bottom row is the LFA mapper proper. Assume a set of K inputs, ', (n), which are exemplars of a single class. Each input has a spatial dimension n that can be rasterized such that 1 < n < N. For the ORL faces N is equal to the number of pixels in each input, N (R x C) = 10304. PCA is the eigenvalue decomposition of inputs' scatter matrix S(NxN). In general, eigenvalue analysis yields a square modal matrix T(NxN)(n n2) and a single diagonal matrix of eigenvalues A. The eigenvalue decomposition produces K eigenvectors, but only the first M eigenvectors are retained (M < K < N). The truncated expansion uses eigenvectors ,'. (n) with corresponding eigenvalues Am. The modal matrix is not square since the number of linearly independent inputs K is less than the dimension N. Since M < K < N, the modal matrix T(n1, n2) is (N x M). Using the eigenvectors as a basis, each input can be reconstructed, K ,(n) -, ,', (n), m=1 M (n) A, (a). m=l LFA introduces some new quantities that will be discussed in more detail in separate subsections, O(n) A YrM A Output vectors, K(ni,) A 3CM1 ''* (Inl) 'i (I2), LFA kernel, P(ni, 2) A M= '. i. 2), Residual Correlation. The LFA kernel K(na, n2) is a topographic (Penev and Atick, 1996, p. 5) analog for the modal matrix. The residual correlation P(n n2) is comparable to the matrix of eigenvalues A. The LFA output is similar to the reconstructed input in PCA. 4.5.1 Output Vector Using the LFA kernel, an output O(n) is computed for every input 0(n), M O(nl) K(ni, a2)(2) AA, ,'. (4.26) The output O(n) is of the same dimension as the input 0(n). The LFA output is the PCA reconstruction except that each eigenvector is normalized (scaled to unit norm), <, ',. >= Am(m,l) < 1%'' > J(mI) (4.27) For convenience, the expressions for input and output are repeated here, M (A) ,Am (4.28) m=l M 0(n) M A- (4.29) The main difference (between a PCA reconstruction and an LFA output) seems to be that normalizing the eigenvectors the scale factor de-emphasizes terms with large eigenvalues in equation 4.28. In PCA, these are the terms that are the most important for reconstructions with minimum mean squared error (\!SlK;). On the other hand, it has also been sr--.-- -i. 1 that eliminating the first principal components (that are the low frequency components in iI Ii ,!" (1/f) images) can compensate for differences in illumination level. The objection to discarding the first few principal components is that essential discriminatory information may be lost. LFA's approach of de-emphasis rather than outright elimination of the first eigenvectors may provide features that are robust with respect to illumination. 4.5.2 Residual Correlation The residual correlation matrix definition can be rewritten in matrix form, P(ni, n2) A T 1 '' ( (2 = E' l L. K 2 -i'M+1 '(. /, (/2), (4.30) 1 EM+1 (ni> (n2). If the full set of K eigenvectors is used for LFA output expansion, then the residual correlation of the output is the identity matrix. If only a subset M < K of the eigenvectors is used, there is a residual correlation as shown in equation 4.30. Atick and Penev note that the LFA output correlations pplen to be" localized. 4.5.3 Kernel The scatter matrix of the input data can be rewritten, 4)' = AP'V (4.31) (,x,,. (x)' Writing the expressions for the scatter matrix and the inverse kernel together, (4.32) K- = Z M= A '/ ''nl' (ni2) The inverse kernel is comparable to the original scatter matrix except for the number of terms (M < N), and the scale factor. The key difference in LFA seems to be the scaling by A/X, otherwise, the analysis is similar to a partial PCA reconstruction. 4.5.4 LFA on ORL Faces The ORL database has ten exemplars of each person. K = 9 exemplars were used for training and one retained for evaluating the expansion. Figure 4.13 shows normalized (by input image power) reconstruction MSE as a function of number of components. The starting MSE (at x = 0) is from just using the average image. Each of the lines is for a different input . Normalized Fondrtrudbn MSE 2 3 4 5 6 Nurrber ol Eiganvealoi 7 8 9 10 Figure 4.13: PCA Reconstruction MSE Figure 4.14 shows the reconstruction 0r(x) using only the M = 8 eigenvectors corresponding to the eight largest eigenvalues. Note that the poses that are not fully frontal have artifacts, the reconstructions using M = 9 are indistinguishable from the inputs. The error from adding the tenth component is an artifact incurred by implementation limitations on numerical accuracy. Figure 4.14: PCA Reconstructions Figure 4.15 shows the corresponding LFA outputs, LFAOulpuls, W9 Figure 4.15: LFA Outputs (Compare to PCA Reconstruction) The kernel and residual correlation matrices each have 106,172, 416 (10, 304 x 10, 304) t 108 elements. Each row of the kernel is a sum of scaled PCA eigenfaces. Penev and Atick (1996) shows that local features can be found from an appropriate linear combination of global features. What conditions are needed so that an arbitrary image (e.g., a local feature) can be reconstructed using eigenimages (global features)? Clearly, the local feature must be contained in the span of the eigenspace. The span of the eigenspace is dependent on the number of independent training exemplars. That is, as the number of independent training exemplars increases, the span of the eigenspace increases. Figure 4.16 shows the first five rows, each row reshaped to a (112 x 92) image of the LFA Kernel (top row) and Residual Correlation (bottom row), LFA lMarices, N9 Kernel, Ki.y) RFihaualC orreaibn, P(x,y) Figure 4.16: LFA Kernel and Residual Correlation (Look for Localization) 4.5.5 Localization for LFA and PCA-M PCA-M parses the input into spatial subregions to obtain localized feature. It is assumed that pixels that are close (spatially) are more likely to be related that pixels that are widely separated. Similarly for time signals, events that occur in a close interval of time tend to be better correlated as opposed events that are separated by large intervals of time. LFA based classification is significantly more involved than finding localized features. LFA is a PCA based technique that is designed to obtain groups of pixels that are highly correlated. Coincidentally, highly correlated pixels were found to be spatially localized. A further coincidence is that the localized regions corresponded to local ph:,--i. I1 features. LFA (and PCA-M) could have produced local regions of pixels that do not correspond to any ph,!:--i, 1 features. The approach seems very elegant and some of the techniques might be applied to PCA-M in the future. In particular, LFA provides a statistically based approach to parsing an image into a minimal set of arbitrarily shaped and highly correlated subregions. That is, LFA provides a framework for grouping pixels into localized regions based on correlation rather than simple .,li i:'ency. Further, LFA provides a nice verification that, for faces, spatially localized features are well correlated. 4.5.6 Feature Space for LFA, PCA, and PCA-M For a given set of input exemplars, LFA and PCA have the same feature space. To derive local features, both PCA and LFA rely on the eigenspace having a sufficiently large span so that local features are included. In both PCA and LFA, the eigenspace can only be increased by using more (linearly independent) training exemplars. That is, LFA localized features are a linear combination of a large number of training exemplars. Kirby and Sirovich (1987) estimates that a dimensionality of at least 400 is needed for adequate representation of tightly cropped faces with PCA. Penev (1999) states that a dimensionality of 200 (at least 200 exemplars) is needed for adequate representation of faces with LFA. In the ORL example, with only nine exemplars of (112 x 92) images, any linear combination will be I ...--like" and not localized. PCA-M is a multiresolution technique that encompasses classical PCA as an extreme case within the PCA-M definition. PCA-M directly manipulates local- ization by partitioning the exemplar images. PCA-M then adapts to the second order statistics in each localized region. Localization of features is independent of the number of training exemplars. Further, PCA-M facilitates construction of multi-scale features. That is, PCA-M can be utilized with global features as well as (local) features of varying scale. 4.6 Summary PCA-M can be used to directly derive features for classification. Local features can be found by explicitly selecting regions that correspond to local pl,--i. I1 structures. Unfortunately, there is often no a priori way to know that the best mathematical features correspond to given ]l,!:1-i' 1 features. If a priori information is available, the GHA network can structure the PCA-M network in a very flexible manner. Unlike LFA which looks for local features by exhaustive linear combinations of global features, PCA-M can explicitly selecting regions that correspond to local pl,!:v -i 1 structures. PCA-M seems to be sufficiently useful in providing localized inputs to an- other classifier such as a neural network. The neural network can then choose to construct features that are global or local. The classifier can create features that are combinations of PCA-M outputs at a single scale, or combine PCA-M outputs of several scales. The subsequent experiments showed that a single l1-,. r PCA-M network followed by a single 1*.--r classifier performed comparably or better than more complicated structures. CHAPTER 5 FACE RECOGNITION EXPERIMENT In recent years, automated face recognition has received increased interest while simultaneously becoming more feasible. Surveillance and medical diagnostics are two broad classes of applications that have driven the demand for image recognition technology. Hardware for image recognition has shown a trend towards higher performance, increased accessibility, and lowered costs. Numerous advances have been made in face recognition algorithms (Chellappa et al., 1995, pp. 705 - 706). Automated face recognition and classification has many practical applica- tions (C'!i I!! ppa et al., 1995, p. 707). In general, automated face recognition is a complex problem that requires detecting and isolating a face under unknown lighting conditions, backgrounds, orientations, and distances. However, there are several applications where the lighting, scale, and background can be expected to be well controlled: personal identification (credit cards, passports, driver's license), mug shot matching, automated store/bank access control. In these applications, detection and isolation of the faces is not necessary, Non- linear distortions in the images (due to lighting, background, centering, scaling, or rotation) can be controlled during data collection (and assumed to be negligible). C'!i I lppa et al. (1995) presents a nice survey that includes background material on psychology and neuroscience studies, face recognition based on moving video, and face recognition using profile faces. The scope of this dissertation is limited to automated face recognition based on frontal, still photos. Given a database of exemplar images, the basic face recognition problem is to identify individuals in subsequent images. The performance expectations for automated face recognition are high since most people can recognize faces despite fairly adverse conditions. For a machine, the task involves detecting faces from a cluttered background, isolating each face, extracting features from each face, and finally classification of the face. This chapter includes an extended presentation of three specific classifiers: the original eigenfaces experiment, a Hidden Markov Model, and a convolutional network. All three techniques have been applied to the same (Olivetti Research Lab) face database under similar conditions. Finally, the PCA-M classifier is presented against the same ORL database. 5.1 ORL face Database Olivetti Research Lab (ORL) has a public face database reproduced in appendix B. The database has 400 pictures made up from 10 pictures of 40 people. The images are (112 x 92) = 10304 pixel, 8-bit .i .-i'" i,! images. The images in the ORL database present a non-trivial classification problem. The pictures show variation in background lighting, scale, orientation, and facial expression (figure 5.1). The tolerance in scale is about 2n' and the tolerance for tilting is about 200 (Giles et al., 1997). Individuals who used eyeglasses were allowed to pose both with and without eyeglasses. Some people looked very similar to each other (figure 5.1, far right). scale Variation in ORL Face Database expression lighting similarity Figure 5.1: Varying Conditions in ORL Pictures Several other techniques have been applied to the ORL database under the same testing conditions (40 people, 5 test + 5 verification pictures for each person). Control of the conditions is important since reducing the number of classes (not using all 40 people) implies an easier classification problem. C'! i'5iiig the ratio of training exemplars to verification exemplars also alters classifier performance. This section discusses the three experiments that are used for comparison to PCA-M. PCA-M gave better average performance than the other techniques. Table 5.1: Error Rates of Several Algorithms Algorithm Performance Eigenfaces (Turk and Pentland, 1991a) 10(' HMM (Samaria, 1994) 5.5', SOM-CN (Giles et al., 1997) 5.7.' (: -.) PCA-M (Brennan and Principe, 1998) 2.5'. 5.2 Eigenfaces The decomposition of a training set of face images into eigenfaces has been previously discussed. This section briefly presents the face recognition experiment using eigenface. 5.2.1 Description of Experiment The ORL database has 200 training images, 5 images for each of the 40 people. All the training images {jk}l 200 1 2 k (5.1) Oo20200L k=l The ensemble average is removed from each image, Xk k o. (5.2) Eigenfaces T,' are found for the training ensemble and M eigenfaces with signif- icant eigenvalues are retained. The eigenfaces define the axes in eigenspace, and both training and test images can be mapped to a set of coordinates in eigenspace, Xk [C, Qa2,... ,I Io k where, a, =< x, > (5.3) Denote the eigenspace coordinate vector by ak = [acl,... lr]k. An immediate advantage of eigenfaces is a steep reduction in dimension. For the ORL database and using all the eigenfaces (M = 200), each image is described by 200 coordinates rather than (92 x 112) = 10304 pixels. The training images can be compressed by a factor of 50 without loss. The training images from each class map to separate regions in eigenspace. The eigenspace coordinates can be treated as raw input to any classifier. For example, let the coordinates of the 5 training images from class n be denoted by a'. The distance of a test image's coordinates from the training image coordinates for a class can be used to determine the probability of the test image belonging to the class, Prob(xtest E n)= f(atest, a,..., a5). (5.4) The simplest method is to average the coordinates of all the training exemplars for a class, and to calculate the distance of a test image from the average coor- dinates (Turk and Pentland, 1991b; Giles et al., 1997). Samaria (1994) used a nearest-neighbor classifier. 5.2.2 Results Pentland et al. (1994) reports under 05'. error rate on 200 faces using a large (unspecified) database. Samaria (1994) reported a 10'. error rate when using 175 to 199 eigenfaces. The improvement after 10 eigenfaces is gradual, but the error rate rapidly becomes worse when less than 10 faces are used. Samaria's results also showed that error rate was not monotonically non-increasing as the classifier used more eigenfaces. Giles et al. (1997) reports a 10.5' using 40 to 100 eigenfaces. Error rates aside, the eigenface approach demonstrates that PCA is a useful preprocessor for classification. 1. PCA coordinates in eigenspace are good features for classification. 2. PCA reduces the dimensionality of the classifier inputs that in turn reduces computations. 3. Eigenvalues are potentially a good indicator for classification features. 5.3 Face Recognition using HMM's Samaria (1994) on face recognition using a Hidden Markov Model (HMM) is often cited as seminal work in applying statistical signal processing to image classification. Hidden Markov Models are widely applied to continuous speech recognition (Haykin, 1994, p. 227). Samaria passed an observation window over an image from left-to-right, down, right-to-left, down, left-to-right, and so on (Figure 5.2, left). ___ win ow general traversal top-down traversal Figure 5.2: Parsing an Image into a Sequence of Observations That is, an observation window traverses a one-dimensional path through each image. For each image, Samaria thus obtained a corresponding observation array, 0 [01,...,OT]. 5.3.1 Markov Models A Markov model is a statistical model for a sequence of observations based on an underlying sequence of states (a Markov process). The probability of a state at some time in a sequence is dependent only on the immediately preceding state (Therrien, 1992, pp. 99 118). Each transition between states generates an output that is dependent only on the state being entered. If the states can be only one of N countable discrete values, the process is called a Markov Chain (Therrien, 1992, pp. 99 118). The Markov process is described by four parameters (Samaria, 1994, p. 28), 1. the number of states N, 2. the one-step state transition matrix, A = {aj : 1 < i,j < N}, 3. the output probability function, B = {bj(.) : 1 < j < N}, 4. the initial state probability distribution, II {-rj : 1 < j < N}. When only the outputs are observable (the states are hidden), the model is said to be a Hidden Markov Model Output, oj State i State j Figure 5.3: Markov Model 5.3.2 Description of Experiment A full description of Samaria's work and a detailed description of HMM's is outside the scope of this dissertation. HMM's are described in various books on statistical signal processing (Haykin, 1996; Therrien, 1992). Samaria cites Rabiner (1989). This section attempts to cover points in Samaria's research that would be salient to reconstructing his experiments on 1-dimensional HMM's (1D-HMM). 1 forehead 2 eyes 3 nose 4 mouth 5 chin Figure 5.4: Top-down Constrained State Transitions Samaria obtained his best results using a top-down sequence of five states. Each state corresponds to a region of the face. The allowed state transitions correspond to a top-down traversal of a face (Figure 5.4). The observation window was constructed from several complete rows of the image (Figure 5.2, right). Each window was eight rows high and overlapped .idi i,:ent windows by seven rows. Samaria described his model using a shorthand notation (Samaria, 1994, p. 42) of Fi = (N (states) L (observation rows) M (overlap rows) ) = (5, 8, 7) Samaria used the HTK software package described in Young (1993). For each of the 40 classes in the ORL database, five training images are each transformed into a sequence of observations using a top-down traversal. The five training sequences used as inputs to the HTK software with a design specification for five states (five face regions). The HTK software derived optimal parameters for each HMM using the Baum-Welch re-estimation algorithm (Baum, 1972). The optimization includes parsing the training images into five regions, and deriving both the state transition matrix A and output probability function B for the HMM. At the end of training, an HMM has been derived for each class. To classify a test image, select the class whose HMM maximizes the likelihood of the test image. 5.3.3 Results Samaria's dissertation exhaustively explored variations in the number of mod- els, and observation window parameters. The dissertation included experiments using frequency domain representations and reduced (spatial) resolution images. Samaria reports that 1D-HMM outperformed the Eigenfaces approach about 1I' . of the time. The 1D-HMM had an average error rate of 10'-. After all the detailed analysis, Samaria modestly concluded that the improvements (over eigenfaces) in face recognition using 1D-HMM were probably not statistically significant. The dissertation also explored a more complicated model, the P2D-HMM (pseudo 2D model). The P2D-HMM outperformed the eigenfaces approach about ,91'` of the time with an average error rate of 5'. 5.4 Convolutional Neural Networks Giles et al. (1997) used a Self-Organizing Map (SOM) in conjunction with a Convolutional Neural Network for face classification. The self-organizing map (SOM) is used for dimensionality reduction of the exemplars, and the convolutional network (CN) provides partial translation and deformation invariance (Giles et al., 1997, p. 67). Raw Compressed Image Representation Class -Parse/SOM CN - Figure 5.5: SOM-CN Face Classifier 5.4.1 Self-Organizing Map Giles et al. (1997) states that Kohonen's self-organizing map (SOM) or Self-Organizing Feature Map (SOFM) (Kohonen, 1995) is a topology preserving, unsupervised learning process. This section presents an overview of SOFM that follows the presentation of Kohonen's SOFM found in Haykin (1994, pp. 402 - 414). A SOFM maps an input of arbitrary dimension into a discrete map of reduced dimension. The theory is based on vector quantization theory in which a large set of input vectors is mapped to a reduced set of prototypes (the weights of the winning output node). The network for a SOFM has only an input and output 1 -.i r. Each output is fully connected to all the inputs. The nodes of the output 1?,-i.r are arranged in a one, two, or three-dimensional lattice. When presented with an input x, one of the SOFM's output nodes is the best-matching or winning node according to some distance criteria, (5.5) i(x) argmin x- xjl, j = 1, 2,..., N. i In equation 5.5, i(x) is the index of the winning output in response to input x. The SOFM is topologically ordered in the sense that nodes that are .,Ii i'ent in the output lattice tend to have similar weights. The SOFM is topologically preserving in the sense that a small distance between two inputs (in the input space) implies a small distance between the corresponding winning outputs (in the output space). 5.4.2 Convolutional Network A convolutional network (Le Cun and Bengio, 1995) is a specific structure for a muiltilv-, r perception network that has been successfully applied to optical character recognition (OCR) (Haykin, 1994, p. 226). Giles et al. (1997) uses a similar network for face classification. The CN has five computational l~--.-rs: four hidden -1-. rs and the output 1l--v. 1. The first hidden 1l-vr will be discussed in some detail since it exhibits the properties of feature maps, weight sharing, local receptive fields, and nonlinear convolution. Consider a (20 x 20) OCR image that is parsed into (5 x 5) subregions. There are (16 x 16) = 256 subregions that can be constructed by shifting the (5 x 5) window over the OCR image. A neuron in the first hidden l~.-v-r is said to have a local receptive field if the inputs to the neuron correspond to a local region of the input. The neurons can be organized into a (16 x 16) feature map such that .,1i ,ient neurons have local receptive fields that are shifted by one pixel. A further constraint on a feature map is that all the neurons in the feature map have the same weights (weight sharing). The construction of the feature map can be perceived as a convolution of the input image against the fixed weights. Since the output of the neurons is passed through a nonlinear function, the first hidden l-1-.-r is characterized as a nonlinear convolutional l1~-,-r. In the OCR application, the first hidden l-~v. r consists of four feature maps. 2. The second hidden l-i. v is a downsampling If -r Downsampling provides a tolerance to distortions due to translation, rotation, and scaling. In the OCR example, the second hidden l*- v-r has four feature maps that are respectively reduced (spatial) resolution representations of the four feature maps from the first hidden l~Vr. 3. The third hidden 1- v-r is another convolutional 1lr- r. A feature map in the third hidden l-1.- r may use local receptive fields from two feature maps in the second 1-r. v. The OCR application has twelve feature maps in the third hidden 1lir.. 4. The fourth hidden 1--. r is another averaging and downsampling 1-c-r. identical in structure to the second hidden l *'vr. 5. The output 1l,--r is fully connected to the fourth hidden 1 --,r. In the OCR application, there are ten neurons corresponding to the ten digits [0,1,..., 9]. The output l-1,--r classifies the input, further, the difference between the most active and second most active outputs can be used to generate a measure of confidence in the classification. Haykin (1994, p. 226) states that a multilrv,-r perception that uses alternating convolutional and downsampling l-r,-i is a convolutional network. 5.4.3 Description of Experiment In contrast to the OCR application, Giles et al. (1997) chose to preprocess the raw (face) images with a SOM. The (92 x 112) images are parsed into (5 x 5) subimages. Each subimage overlaps .,.i i'went subimages by one pixel. All the subimages from the training data are collected and used to train a SOM with a (5 x 5 x 5) three-dimensional output lattice. The trained SOM is used to transform the raw images into three (23 x 28) maps. The three maps are passed to the convolutional network. Giles' CN has five l~-,. rs, the architecture is described in table 5.2. Table 5.2: Face Classification CN Architecture Number of Feature Map Receptive Field L Type Feature Maps Dimensions Dimensions 1 convolutional 20 (21 x 26) (3 x 3) 2 downsampling 20 (9 x 11) (2 x 2) 3 convolutional 25 (9 x 11) (3 x 3) 4 downsampling 25 (5 x 6) (2 x 2) 5 full 40 (1 x 1) (5 x 6) 5.4.4 Results The CN is a muiltilv,--r perception network that requires significant training. Once trained, the network operates quickly. The best results were reported as 3.5' error against the ORL database. 5.5 Face Classification with PCA-M PCA is known to be optimal for representation, but suboptimal for classifica- tion. Belhumeur et al. (1997) points out that PCA does not differentiate in-class scatter from between-class scatter. Bartlett et al. (1998) shows that classification can be improved by using independent component analysis to incorporate higher- order statistics. On the other hand, the experiments using eigenfaces (Turk and Pentland, 1991a) showed that coordinates in eigenspace are useful for classification. Several other experiments indicate that PCA-based feature extraction could be improved by adding localization and multiresolution. Pentland et al. (1994) states that localization can enhance eigenfaces. Pent- land trained eigenfeatures corresponding to pl1,:-i. I1 facial features. Classification based on localized eigenfeatures was comparable to the performance of eigenfaces. The combination of localized eigenfeatures and global eigenfaces performed almost perfectly. Brunelli and Poggio (1993) states that localized features may be more important than global features. Brunelli stated that when a classifier can use only a single facial feature, then local templates based on eyes, nose, and mouth contribute more to recognition than global facial templates. Giles et al. (1997) Si.---- -1.. 1 that reducing the spatial resolution of the ORL images might improve classification. The use of the SOM front end to reduce dimensionality while retain- ing good classification supports Giles' observation. Turk and Pentland (1991b) used a six-level Gaussian pyramid to view the inputs at several spatial resolutions. The theory for PCA-M, dyadic filter banks, and the GHA network were presented in chapter 4. This remainder of this section presents and discusses experimental results for test runs using PCA-M, a fixed-basis multiresolution (Haar), and PCA at several fixed resolutions. 5.5.1 Classifier Architecture Figure 5.6 shows the initial architecture for our classifier (Brennan and Principe, 2000). r----------------------------------- CLASSIFIER Hyperplane Majority SPCA-M Hyperplane Vote Hyperplane - -- - - - -- - - - I Image Multiresolution Class Features Figure 5.6: Initial Classifier Structure The structure was originally intended to isolate each feature space. We wanted to observe both the individual feature performance and degradation due to decoupling features. Adding more eigenfeatures does not monotonically increase classifier performance (Samaria, 1994), and we plan on finding a way to select or weight the predictions from eigenfeatures in future research. Each feature classifier uses a template obtained by averaging the training exemplars. The final classification was done by weighted vote among the component classifiers. The in1i' 'i, ily vote mechanism is the simplest way to combine the results of the feature classifiers. Hu Table 5.3: Fixed Resolution PCA Error Rates over 10 Runs WINDOW MAX MEAN MIN Raw Data 19,0 14.4 10.5 (2 x 2) 22.5 17.5 14.0 (4 x 4) 27.5 23.6 20.5 et al. (1997) il-.-. -I; that more elaborate committee structures don't necessarily significantly outperform a ii ii Pi ily vote mechanism. The overall structure can be compared to a 2-1h'v, r hierarchical One-Class-One-Network (OCON) Decision-Based Neural Network (DBNN) (Kung, 1993, pp. 118-120). 5.5.2 Data Preparation The most straightforward way to vary the resolution is to reduce the scale by half. We investigated reductions in scale of 2, 4, 8, and 16. The scaling was performed by passing a (2k x 2k) window through the image. For example the 1/16 scaling takes a (24 x 24) subimage and produces 16 scalar outputs; the collection of scalar outputs from all the windowed subimages form 16 scaled images with reduced spatial resolution. We used non-overlapping observation windows because we wanted to observe if blocking would severely deteriorate classification. To facilitate scaling with non-overlapping windows, image dimensions were cropped to (112 x 80) so that the numbers of pixels along each dimension are a factor of 24 = 16. Six columns of pixels were cropped from each side of the input. 5.5.3 Fixed Resolution PCA Results The fixed resolution PCA was investigated for windows of (2 x 2) and (4 x 4). Eigenfaces would have transformed each (112 x 80) = 8960 pixel input to 8960 coordinates in eigenspace; each coordinate corresponds to the projection of an input to an eigenface. For the ORL database there would have been only 200 non- zero coordinates. We step through the procedure for a (2 x 2) and note that the procedure for the (4 x 4) PCA windows are analogous. A (2 x 2) has 4 eigenimages. As the non-overlapping window is passed through the image, we are effectively condoling 4 sets of coefficients against the input image. Since the blocks are non- overlapping, downsampling is accomplished in the same step. As an aside, linear convolution and downsampling are being performed by a single computational 1-,-. r partially connected network with 8960 inputs and 8960 outputs. Each output is locally supported by to 4 inputs (a (2 x 2) subregion). Only 4 sets of weights are used. The output can be organized into 4 feature spaces that are 4 half-scale images. In each run, five training exemplars were randomly selected from the ten exemplars available for each person in the ORL database. The results show great sensitivity to selection of the training set. The sensitivity is not surprising. For example, figure 5.7 shows a class that had the entire training set at one scale, and the entire test set at another scale. Training and Test data at different Scales TRAIN VERIFY Figure 5.7: Training and Test Data at Different Scales Giles et al. (1997) points out that a random selection among 40 classes would be expected to be correct 1/40 = 2.5'. of the time. We feel that a more realistic base- line for error rates is the performance of a template classifier with the raw data. Since PCA is just a rotation, the performance would be the same as the perfor- mance using all 200 eigenfaces. Samaria (1994) reported 10.5'. for the ORL data, but we found that the error rate was also sensitive to training set and averaged around 14.!' (first line of table 5.3). Some of the increased misclassification could be due to the clipped data, but it is more likely that the decoupling of data due to our classifier structure is responsible for the deterioration. The results seem to sup- port that data organized into 4 independent feature spaces. The (2 x 2) window is worse than data taken as a whole (raw data), but better than 16 decoupled feature spaces. Note that if the feature spaces are linearly combined before classification, we would expect an error rate similar to the raw images. All individual feature classifiers and the iii ii i ly vote mechanism have a nonlinear operation when the maximum output is selected. 5.5.4 Haar Multiresolution A fixed Haar basis was used to crate a four level differential image pyramid. A sample decomposition is shown along with the original image. PCA-M Decomposition I U. ORIGINAL RAW NORMALIZED Figure 5.8: PCA-M Decomposition of One Picture The autocorrelation matrix of the observation windows, as expected, shows that the pixels in a natural image are 1/f. Classification using a Haar basis was not significantly different from PCA-M since the Haar basis is well suited for 1/f signals. Moreover, for small observation windows, the choice of multiresolution basis is not very important. Multiplication by any fixed basis is a rotation; if all the features are used, then input distances are preserved. Classification (with a linear classifier) will be no better than using the raw inputs. 5.5.5 PCA-M PCA-M was used to decompose images into multiresolution feature spaces (components). Four feature spaces are 1/16 scale images, three are 1/8 scale images, three are 1/4 scale images, and three are 1/2 scale images (figure 5.9). Figure 5.9: Selected Resolutions The decomposition was chosen to facilitate comparison to the Haar decomposition. Referring to figure 5.9, components 1 to 4 have the longest eigenvectors; that is, (16 x 16) eigenimages and the least spatial resolution. Components 11 to 13 have the shortest eigenvectors and the highest spatial resolution. The classifier was modified (figure 5.10). CLASSIFIER SFFT Hyperplane Majority -- PCA-M FFT Hyperplane rite I- -FFT Hyperplane Image Multiresolution Class Features Figure 5.10: Final Classifier Structure The modification resulted from earlier experiments evaluating PCA-M for repre- sentation. Since we had the ORL database in a variety of representations, we fed them into the classifier. The magnitude of the FFT of the raw data had an average error rate of 1(''. The combination of PCA-M with magnitude FFT gave the best performance. We assume that using the FFT magnitude makes the classifier more robust to translations. There is still a high sensitivity to training set selection. The Table 5.4: Error Rates for PCA-M with Magnitude of FFT Multiresolution Levels MAX AVG MIN 2 6.5(0' 2.95' 0.(11' " 3 5.011' 2.45' 0.(11' 4 6.5( 3.4(1' 1.i . main diagonal of table 5.5 shows the performance of the individual feature classi- fiers. The table shows the number of misclassifications out of 200 test images. The nondiagonal elements show the number of misclassifications using a pair of feature classifiers. The performance seemed to be independent of eigenvalue or resolution. None of the component classifier had more than 10 misclassifications (out of 200) in the training set (200 images). Performance on the test set does not seem to be predictable from performance on the training set. When a component classifier's best guess was incorrect, the second best guess was correct half the time. Of some interest is the poor performance of the first four components since Belhumeur et al. (1997) states that the first few eigenvectors are sensitive to illumination. Belhumeur stated that removal of the first three or four eigenvectors could provide some robustness to illumination levels. Table 5.5: Component Misclassifications (200 Test Images) 1 2 3 4 5 6 7 8 9 10 11 12 13 1 61 48 45 33 19 28 19 5 19 7 25 3 4 2 136 100 73 21 55 46 7 45 9 54 6 8 3 147 75 20 52 45 6 42 7 48 6 8 4 99 22 43 31 6 31 9 39 5 8 5 27 17 6 6 11 8 9 4 3 6 68 29 6 27 6 24 5 4 7 54 4 16 4 15 4 4 8 7 5 6 3 4 1 9 53 5 30 5 4 10 9 3 3 2 11 67 4 6 12 6 3 13 9 CHAPTER 6 MSTAR EXPERIMENT This chapter describes a classification and a simple discrimination experiment using synthetic aperture radar (SAR) images of armored vehicles. SAR imagery is obtained by combining radar returns over the path of a moving platform (an airplane or satellite). The path is effectively a large antenna aperture leading to high-resolution imagery. The basic scenario is that given a training set of several I ,i;, I vehicles, the discriminator will assign subsequent input images to the correct class of threat vehicles, or identify that the new image belongs to a new class of i,,,i,-I ,i ,_t vehicles. Classes of vehicles that have only test (no training) exemplars are called confuser classes. Discriminators can make three types of mistakes. 1. A false-negative error occurs when a target vehicle is not identified. Presum- ably, failing to respond to a target vehicle incurs the most severe penalty. 2. A false-positive mistake occurs when a non-target vehicle is identified as a target vehicle. This mistake causes resources to be wasted in an unnecessary response. 3. The third error occurs when a target vehicle is correctly identified, but is incorrectly labeled. By modifying the decision boundary, a trade-off is possible between the three errors. Two sets of results are presented: a set for classification (no rejection of confuser classes), and a set for discrimination. The goal of the classifier is to minimize unconditional probability of correct classification P,,. The goal for our discriminator is to maximize (conditional) Pc when the probability of detecting targets, Pd, is 91'. 6.1 SAR Image Database The raw SAR inputs are (128 x 128) pixel images from a subset of the 9/95 MSTAR Public Release Data obtained from Veda Inc., (www.mbvlab.wpafb.mil). The web site also includes a paper with a detailed description of the data and the results of a baseline template-matching classifier (Velten et al., 1998). The data consists of X-band SAR images with a 1-foot by 1-foot resolution (Velten et al., 1998). Table 6.1 lists the vehicle classes, bumper tags, and the quantity of corresponding images. The SAR image of a vehicle is dependent on the pose of the target vehicle relative to the collection platform (satellite). The two pertinent pose parameters are aspect angle and depression angle (figure 6.1)). Figure 6.1: Aspect and Depression Angles Changing the pose of a vehicle results in nonlinear changes in the SAR image since the radar cross-section is a projection of a 3-dimensional object onto a 2-dimensional surface. Thus, the features that are available for discrimination are a function of pose. Because of this dependence, it is desirable to ensure that the exemplars for a given vehicle have the same pose. Ideally, a large number of exemplars would be available for each aspect angle. More practically, we would collect a large number of training exemplars for a narrow range of aspect angles. Realistically, we have to accept a trade-off between having a large number of |