PRINCIPAL COMPONENT ANALYSIS
WITH MULTIRESOLUTION
By
VICTOR L. BRENNAN
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2001
ACKNOWLEDGMENTS
I am grateful for the support I have received from family, colleagues, and
faculty at UF. It is difficult to single out a few people to thank when many have
been supportive and encouraging.
I wish to thank my advisor, Dr. Jos6 Principe, not only for sharing his
technical expertise and insight, but especially for his patience and encouragement.
I want to thank Leonard and Carolina Brennan, who have been loving parents and
inspirational role models. I am most grateful to my wife, Karen, for her love and
for her support in every endeavor in our lives.
TABLE OF CONTENTS
page
ACKNOW LEDGMENTS ............................. ii
LIST OF TABLES ...................... ......... vi
LIST OF FIGURES ..................... ......... vii
ABSTRACT ... .. .. .. .. .. .. ... .. .. .. .. .. .. .. .. .. x
CHAPTERS
1 INTRODUCTION ................... ....... 1
1.1 Classification .................. ....... 2
1.2 Principal Component Analysis (PCA) ............. 4
1.3 Multiresolution .......... ....... ....... 5
1.4 PCAM ............ .......... ..... .. 6
1.5 Image Classification Experiments ...... ...... ... 6
1.6 MSTAR Experiment ........... ........... 7
2 PCA ....... ............ ........... .... 9
2.1 The Eigenvalue Problem .............. ... . 9
2.2 An Example .................. ........ .. 11
2.3 PCA .................. ....... .... . 15
2.4 Deflation Techniques ................ . . 17
2.5 Generalized Hebbian Algorithm ............... .. 19
2.6 Eigenfilters .................. ......... .. 22
2.6.1 LowPass Test Signal ................ 23
2.6.2 High Pass Test Signal ................. .. 25
2.6.3 Mixed Mode Test Signal ................ .. 26
2.7 Properties .................. ......... .. 27
3 MULTIRESOLUTION .................. ..... .. 29
3.1 Two Notes Example ........ . . . . . .... 30
3.2 Quadrature Filter and Iterated Filter Bank . . . ... 32
3.3 Discrete Wavelets .................. ... .. 34
3.4 Haar Wavelet .................. ....... .. 35
3.5 A Multiresolution Application: Compression . . . ... 35
4 PCAM
4.1 Definition of PCAM .................. ..... 38
4.1.1 Localization of PCA .................. .. 38
4.1.2 A Structure of Localized Outputs . . . ... 41
4.2 The Classification Problem .................. .. 43
4.3 Complete Representations .................. .. 44
4.3.1 Eigenfaces .................. .. .. 45
4.3.2 Identity Map .................. .. 50
4.3.3 Iterated Filter Banks ................. .. 52
4.3.4 Dual Implementation of PCA . . . . ..... 56
4.4 Overcomplete Representations ................ .. 59
4.5 Local Feature Analysis ..... ........... .... 60
4.5.1 Output Vector .................. ..... 61
4.5.2 Residual Correlation .................. .. 62
4.5.3 K ernel . . . . . . . . . . . .... . 63
4.5.4 LFA on ORL Faces .................. .. 63
4.5.5 Localization for LFA and PCAM . . . ... 66
4.5.6 Feature Space for LFA, PCA, and PCAM ...... 67
4.6 Summary ....... ....... ......... .... 67
5 FACE RECOGNITION EXPERIMENT .............. .. 69
5.1 ORL face Database .................. .. 70
5.2 Eigenfaces .................. ......... .. 71
5.2.1 Description of Experiment .............. .. 72
5.2.2 Results ........... . . . . . 73
5.3 Face Recognition using HMM's ................ .. 73
5.3.1 Markov Models .................. ..... 74
5.3.2 Description of Experiment .............. .. 75
5.3.3 Results . . . . . . . . . . . .... . 76
5.4 Convolutional Neural Networks ................ .. 77
5.4.1 SelfOrganizing Map ..... . . . 77
5.4.2 Convolutional Network ................ .. 78
5.4.3 Description of Experiment ..... . . . . 79
5.4.4 Results . . . . . . . . . . 80
5.5 Face Classification with PCAM ... . . . 80
5.5.1 Classifier Architecture .... . . . . 81
5.5.2 Data Preparation ............. .. .. ..82
5.5.3 Fixed Resolution PCA Results . . . . 82
5.5.4 Haar Multiresolution. ................ . 84
5.5.5 PCAM ... .. .. .. .. .. .. .. .. ... .. . 85
6 MSTAR EXPERIMENT .............
6.1 SAR Image Database ......... .............. 89
6.2 Classification Experiment ......... ........... 90
6.3 Basis Arrays for PCAM ............. ... .. .. 93
6.3.1 Level 3 Components .................. .. 94
6.3.2 Level 2 Components .................. .. 94
6.3.3 Level 1 Components .................. .. 94
6.3.4 Decorrelation between Levels . . . . ..... 95
6.4 A Component Classifier .................. .. 96
6.5 Classifications using Several Components . . . .... 98
6.6 A Simple Discriminator ............... . . 102
6.7 FalsePositive and FalseNegative Errors . . . . .... 104
6.8 Observations .................. ........ 106
7 CONCLUSIONS AND FURTHER WORK . . . . 108
7.1 Conclusions . . . . . . . . . . . ... .. 108
7.2 Future W ork ....... ......... ...... 111
7.2.1 Segmentation of the Input . . . . . 111
7.2.2 Component Selection . . . . . . 111
7.2.3 Conditioned Data and NonLinear Classifier ...... 112
APPENDIX
A ABBREVIATIONS .................. ........ .. 113
B OLIVETTI RESEARCH LABORATORY FACE DATABASE .. 115
C MSTAR IMAGES ....... .................... 116
REFERENCES ....................... . . . ....... 120
BIOGRAPHICAL SKETCH .................. ......... 124
LIST OF TABLES
Table page
4.1 Normalized Eigenvalues .................. ....... .. 49
4.2 Energy Distribution of Exemplars ................ 49
5.1 Error Rates of Several Algorithms ................ . 71
5.2 Face Classification CN Architecture .................. 80
5.3 Fixed Resolution PCA Error Rates over 10 Runs . . . .... 82
5.4 Error Rates for PCAM with Magnitude of FFT ........... ..86
5.5 Component Misclassifications (200 Test Images) ........... ..87
6.1 Input Data .................. .............. .. 90
6.2 Classification using First Component .................. 97
6.3 Misclassifications with Individual PCAM Components . . . ... 99
6.4 Error Rate (5/68 7.,!') using 3 Components ............ .100
6.5 Error Rate (2/68 = 3.0'.,) using 10 Components . . . ..... 100
6.6 Overall Unconditional Pc with Template Matching . . . . ... 102
6.7 Overall Unconditional Pc with PCAM ................ ..102
6.8 Determining an Threshold for Detection ................ ..103
6.9 Ten Components without Rejection ................. 104
6.10 Ten Components with Rejection .................. .. 104
6.11 Detector Threshold .................. ....... .. 105
6.12 Performance at 91' Pd ............... ..... . 106
LIST OF FIGURES
Figure
1.1
1.2
2.1
2.2
2.3
2.4
2.5
2.6
2.7
3.1
3.2
3.3
3.4
3.5
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
Conceptual Steps in a Classifier . . .......
PCAM Classifier . . ...............
Sam ple D ata . . . . . . . . . . . .
Original (left) and Scaled (right) Data . . ...
GHA Linear Network . . . ............
Low Pass Test Data . . . .............
Low Pass Data PCA . . .............
High Pass Data PCA . . . ............
Test High and Low Frequency Data . . . ....
T wo N otes . . . . . . . . . . . .
Quadrature Filter . . . ..............
Discrete Wavelet Transform with 2 Levels . . . .
Equivalent 2"' Filter Bank DWT Implementation .
Three Levels of Decomposition on the Approximation
PCAM for Classification . . . ..........
PCA and PCAM in Feature Space . . .....
Raw Images from ORL Database . . ......
Residual Images for GHA Input . . .......
Alltoone and Onetoone Networks . . . ....
Eigenfaces from GHA Weights . . . .......
Three Level Dyadic Banks . . ..........
First Four Eigenimages . . . ...........
page
3
7
12
15
20
23
25
26
27
31
33
34
35
36
40
44
46
46
47
48
52
54
4.9 Three Level Decomposition of a Face ................. .. 55
4.10 Output of Quadratic Filter Bank ............. .... .. .. 55
4.11 Localization of a Global Output .................. .. 58
4.12 Local Feature Analysis .................. ....... .. 60
4.13 PCA Reconstruction MSE .................. .... .. 64
4.14 PCA Reconstructions .................. ...... .. .. 64
4.15 LFA Outputs (Compare to PCA Reconstruction) . . . ... 65
4.16 LFA Kernel and Residual Correlation (Look for Localization) . . 66
5.1 Varying Conditions in ORL Pictures ................. .. 71
5.2 Parsing an Image into a Sequence of Observations . . . .... 74
5.3 Markov Model .................. ............ .. 75
5.4 Topdown Constrained State Transitions ............... .. 75
5.5 SOMCN Face Classifier .................. ... .. .. 77
5.6 Initial Classifier Structure .................. ... .. 81
5.7 Training and Test Data at Different Scales .... . . . . 83
5.8 PCAM Decomposition of One Picture ............ . .. 84
5.9 Selected Resolutions . ............... ...... .. 85
5.10 Final Classifier Structure ................ ... ... .. 86
6.1 Aspect and Depression Angles ................ ...... 89
6.2 Experiment Overview . ............... ...... .. 91
6.3 Three Levels of Decomposition on the Approximation . . . . 93
6.4 PCAM Decomposition of a BMP2 Input ... . . . . 95
6.5 The Templates for Three Classes for PCAM Component 1 . . 96
6.6 First Component of SAR Images Projected to 3Space . . . ... 98
6.7 Class Templates for other PCAM Components . . . . 99
6.8 Clustering in 3Space using All PCAM Components . . . ... 101
6.9 Probability of Detection versus False Alarm Rate . . . ... 106
B.1 Olivetti Research Laboratory Face Database .. . . . . 115
C.1 BMP2 Training and Test Data ............. .... . 116
C.2 T72 Training and Test Data ................ .... .. 117
C.3 BTR70 Training and Test Data ............... . .. 118
C.4 Confuser Data .................. ............ .. 119
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
PRINCIPAL COMPONENT ANALYSIS
WITH MULTIRESOLUTION
By
Victor L. Brennan
T; , 2001
Chair: Jos6 Principe
Major Department: Electrical and Computer Engineering
Eigenvalue decomposition and multiresolution are widely used techniques
for signal representation. Both techniques divide a signal into an ordered set of
components. The first component can be considered an approximation of the input
signal; subsequent components improve the approximation. Principal component
analysis selects components at the source resolution that are optimal for minimiz
ing mean square error in reconstructing the original input. For classification, where
discriminability among classes puts an added constraint on representations, PCA
is no longer optimal. Features utilizing multiresolution have been demonstrated to
preserve discriminability better than a single scale representation. Multiresolution
chooses components to provide good representations of the input signal at several
resolutions. The full set of components provides an exact reconstruction of the
original signal.
Principal component analysis with multiresolution combines the best proper
ties of each technique:
1. PCA provides an adaptive basis for multiresolution.
2. Multiresolution provides localization to PCA.
The first PCAM component is a lowresolution approximation of the signal.
Additional PCAM components improve the signal approximation in a manner
that optimizes the reconstruction of the original signal at full resolution. PCAM
can provide a complete or overcomplete basis to represent the original signal,
and as such has advantages for classification because some of the multiresolution
projections preserve discriminability better than full resolution representations.
PCAM can be conceptualized as PCA with localization, or as multiresolution
with an adaptive basis. PCAM retains many of the advantages, mathematical
characteristics, algorithms and networks of PCA. PCAM is tested using two
approaches. The first approach is consistent with a widelyknown eigenface
decomposition. The second approach assumes ergodicity. PCAM is applied to two
image classification applications: face classification and synthetic aperture radar
(SAR) detection. For face classification, PCAM had an average error of under
2.5' which compares favorably with other approaches. For synthetic aperture
radar (SAR), direct comparisons were not available, but PCAM performed better
than the matched filter approach.
CHAPTER 1
INTRODUCTION
Principal component analysis with multiresolution (PCAM) combines and
enhances two wellestablished signal processing techniques for signal representa
tion. This dissertation presents the motivation, the mathematical basis, and an
efficient implementation for combining principal component analysis (PCA) with
multiresolution.
This dissertation also presents the results of using PCAM as a frontend
for two applications: face classification, and target discrimination of synthetic
aperture radar (SAR) images. More detailed discussions of PCA, multiresolution
(differential pyramids), and PCAM are presented in subsequent chapters. This
introduction is intended as an overview to the presentation of PCAM.
PCAM was originally developed as an online signal representation technique.
The intent was to perform realtime segmentation of (time) signals based on
variations in local principal components. Tests with simple artificially generated
signals were promising, and good results have been reported in applying PCAM
to biological signals (AlonsoBetzanos et al., 1999). The decision to concentrate
on images was made when several researchers (Giles et al., 1997; Samaria, 1994;
Turk and Pentland, 1991a) applied various techniques against a common database
generated by the Olivetti Research Lab (ORL). Each of the researchers also cited
an approach which decomposed a set of facial images into component eigenfaces.
It became possible to compare the performance of PCAM to the results of other
researchers using fixed resolution PCA technique and more computationally
intensive nonPCA image classification techniques. We will start by providing a
brief overview of the fundamental concepts required to understand PCAM.
1.1 Classification
Classification is the assignment of an input signal x = [xI, x2, Xd]T to one
of K classes (Bishop, 1995, pp. 110).
x > Ck, 1
Each input x is assigned a label y E {1, 2, .. K}. The value of the label y
corresponds to the assigned class. The classification problem can be formulated in
terms of a set of discriminant functions yk with parameters, w,
Yk y (X; w). (1.1)
An input x is assigned to class Ck if
k max {yj(x; w)}. (1.2)
Each class has a corresponding discriminant function. A signal x is input to each
discriminant function. The function with the highest output assigns a label to
the input (eq. 1.2). While difficult problems can be addressed by more complex
(e.g., nonlinear, muiltil:vr) discriminants, an alternative approach is to attempt to
simplify the problem by some transformation K of the raw data,
Uk = max {yj (K(x),w)}. (1.3)
I
The output of the transformations or projections is called a feature and the
output space is called the feature space (Duda and Hart, 1973). The size of
the feature space can be larger or smaller than the original space (Fukunaga,
1990; Vapnik, 1998). Traditionally, in statistical pattern recognition the feature
space is smaller than the input space. One of the unsolved problems is how to
determine the feature space and its size to improve classification accuracy. A
feature is a projection that preserves discriminability. A fortuitous choice for
the transformation extracts features that differ between classes but are similar
within a class. Undesirable features differ withinclass or are similar betweenclass.
Heuristics have been the most utilized method of selecting good features.
The problem is the following. Optimal classifications in high dimensional
spaces require prohibitive amounts of data to be trained accurately (Fukunaga,
1990; Duda and Hart, 1973). Hence the reduction of the input space dimensionality
improves accuracy of the estimated classifier parameters and improves classifier
performance. On the other hand, projections to a feature subspace may decrease
discriminability, so there is a tradeoff that is difficult to formulate and solve (Fuku
naga, 1990).
x N1)x) Yk
Data Transformation Classifier
Figure 1.1: Conceptual Steps in a Classifier
Experience has shown that local features tend to preserve discriminability
better than global features. Hence the widespread use of wavelets and other
multiresolution techniques as feature extractors for classification (Bischof, 1995).
More recently there has been work proposing feature spaces of higher dimen
sionality than that of the original input space (Vapnik, 1998). High dimension
spaces increase the separability between classes, enabling the use of linear dis
criminators that have fewer parameters to estimate than the optimal (B i,i ,i:)
classifiers.
When analyzed from the feature extraction point of view, projection to high
dimensional spaces also enhances the chance of obtaining "b, I 1 i features; that
is, where the projections of different classes concentrated more along certain
directions. These are called overcomplete representations and 1h :, have been
studied in the wavelet literature (Vetterli and Kovacevic, 1995; Strang and Nguyen,
1996). The big issue is still how to choose the overcomplete dictionary and how to
select the best features.
1.2 Principal Component Analysis (PCA)
Principal component analysis (PCA) is based on eigenvalue decomposi
tion (Hotelling, 1933). Eigenvalue decomposition has been applied to problems
across 1ii i'v disciplines. There is a rich mathematical background and a variety of
implementations (Oja, 1982). Given a set of data, a scatter matrix S is calculated
to estimate the autocorrelation matrix of the data.
N
S T= Ti x (1.4)
n=l
The eigenvector and corresponding eigenvalue pairs (wk, Ak) of S are found by
solving
Sw = Aw. (1.5)
Both the data x and the scatter matrix S can be expanded in terms of the eigen
vectors,
S = E WkW Ak,
(1.6)
x = k WkW X = k WkQk.
Analytic and deflationbased iterative approaches are available to solve the eigen
value problem that automatically order the eigenvectors such that
A> > > N > .
PCA components are uncorrelated and maximal in 12 energy. PCA is one possible
transformation for equation 1.3. It has been shown that PCA is optimal for signal
representation, but it is suboptimal for feature extraction (Fukunaga, 1990).
Chapter 2 has an expanded discussion of PCA. Although other sets of basis
functions are available that are similar to the PCA basis, only PCA can select a
reduced set of components that are optimal for reconstruction MSE.
1.3 Multiresolution
Multiresolution has been broadly defined as the simultaneous presentation
of a signal at several resolutions. An intuitive argument for using multiresolution
is available from common experience. Consider watching someone approach from
a distance. As the person comes closer, more details are resolvable to allow an
observer to make successively refined categorizations. A possible sequence is to first
identify a moving object, then a person, the gender of the person, the identity of
the person, and finally the facial expressions of the person.
Another familiar application of multiresolution is transferring images across
low bandwidth channels (internet). People tend to leave a web page if the page
takes too long to load. For commercial web sites this translates into a loss of
potential customers. On the other hand, many sites feel that customers will not
return to a site that does not have a lot of graphics. Some image intensive web
pages (e.g., zoo, museum, or auction sites) usually present small images (initially
transfer small files). A larger, more detailed version of the image is loaded only
if the viewer clicks on the small image. While it is possible to completely reload
the larger image, it is more efficient to use information available on the (already
loaded) small version and just add the details needed to produce the larger picture.
For classification, it is hoped that withinclass differences are highresolution
features, and that sufficient desirable features are resolvable at coarse resolution.
By using a coarser representation (lowerresolution), it is hoped that undesirable
features are sharply attenuated with minor impact on the desirable features.
Multiresolution is discussed in chapter 3.
1.4 PCAM
Both PCA and multiresolution have been successfully applied to similar
problems. It seems reasonable that an application that has benefited from each
individual approach should further benefit from a combined approach. PCAM is
simply multiresolution with an adaptive basis (PCA).
I show that a linear network for online, adaptive, multiresolution feature
extraction is easily adapted from the networks used for standard PCA. principal
component analysis with multiresolution (PCAM) is implemented with a partially
connected, singleli vr linear network. The same network can be used for both
training and normal operation. The training algorithm is a modification of the
generalized hebbian algorithm (GHA) (Sanger, 1989). I treat PCAM in chapter 4.
1.5 Image Classification Experiments
Olivetti Research Lab (ORL) has a public face database that serves as a
benchmark for comparing different face classification algorithms. Both mul
tiresolution and PCA (Turk and Pentland, 1991a) had been successfully applied
against the database. The PCAM components were used with an almost linear
network. The network is linear except for selecting the maximum discriminant
(i\ AXNET) (Kung, 1993, p. 48)).
., FFT r H PERL .NE Ll.:
S' " : FFT r L1 HYPERPLANE MAJORITY:
r FFT :.~.~ .... ...........
SFT r I P r
FEATVUE CLASSIFY
RAW IMAGE EXTRACTION 10 PCAM
OM PON ENTS
Figure 1.2: PCAM Classifier
The ORL database was used to compare PCAM to several standard fixed
resolution transforms (discrete Fourier transform, discrete cosine transform, PCA),
and to multiresolution using a Haar basis. PCAM outperformed PCA at all tested
resolutions. PCAM outperformed the Haar basis if a reduced set of components
were used. Results were comparable if the full set of multiresolution components
were used.
In chapter 5, PCAM results are compared to classifiers using a fixed res
olution PCA (Turk and Pentland, 1991a, Eigenfaces), a hidden Markov model
(HMM) (Samaria, 1994) and a convolutional neural network (Giles et al., 1997).
PCAM had the lowest error rate.
1.6 MSTAR Experiment
The 9/95 MSTAR Public Release Data (Veda Inc., www.mbvlab.wpafb.mil).
contains synthetic aperture radar (SAR) images of vehicles at various poses (aspect
and depression angles). The estimated aspect angle (Xu et al., 1998) of each
vehicle was used to assign each vehicle to one of the twelve nonoverlapping 300
sectors. Within each sector, multiresolution templates were derived for each class.
Chapter 6 shows that PCAM worked very well in some sectors, but poorly in other
sectors. The overall error rate ( 1(' .) was comparable to other template matching
8
procedures (Velten et al., 1998), but poorer than information theoretic and vector
support methods.
We conclude the dissertation with some comments and future directions for
further research.
CHAPTER 2
PRINCIPAL COMPONENT ANALYSIS
Principal component analysis (PCA) is a technique for representing an image
(or a signal) using basis functions that are derived from eigenvalue decomposi
tion of the data autocorrelation matrix. This chapter is an introduction to the
eigenvalue problem. A thorough presentation is not possible, but this chapter
should contain the information on principal component analysis that is required by
subsequent discussion of PCAM.
2.1 The Eigenvalue Problem
Consider a square matrix A of full rank N. A vector w is said to be an
eigenvector of A with a corresponding (scalar) eigenvalue A if
Aw = Aw. (2.1)
The eigenvalue problem (equation 2.1) can be solved analytically by subtracting
Aw from both sides,
(A IA)w 0. (2.2)
Taking the determinant of both sides yields an Nth order polynomial in A called the
characteristic polynomial of A,
det(A IA) 0. (2.3)
The N roots are the eigenvalues, and each eigenvalue Ak has a corresponding eigen
vector wk. Each solution to the eigenvector problem is a paired eigenvalue and
corresponding eigenvector (Ak, Wk). From equation 2.1, it should be clear that if w
is an eigenvector and K is an arbitrary scalar, then Kw is also an eigenvector with
the same eigenvalue. Given a nonrepeating eigenvector A, the corresponding eigen
vector is unique except for scale factor K. Without loss of generality, eigenvectors
are usually scaled such that
w  w= w = 1.
If the eigenvalues Ak are unique (nonrepeated), then a unique eigenvector exists for
each eigenvalue. The (normalized) eigenvectors wk are orthonormal.
SWfk jk (2.4)
Define the modal matrix W as the matrix whose columns are the normalized
eigenvectors of A,
W = [w1 W2 ... WN] (2.5)
In general, there are N! permutations of eigenvectors. Without loss of generality,
order the eigenvectors such that the eigenvalues are nonincreasing,
A, > A2 > ... > AN.
Define the diagonal matrix A as the matrix with a main diagonal consisting of the
(ordered) eigenvalues of A.
A1 0 0
SA2 ... 0
A
0 0 ... AN.
The eigenvalue problem can be restated in matrix notation,
A = WAWT
(2.6)
If equation 2.6 is satisfied, the matrix A is said to be diagonalizable. For this study,
the matrices of interest are real Toeplitz matrices that are alv diagonalizable.
The orthonormality condition of equation 2.4 can also be restated,
WWT = WTW = I. (2.7)
A matrix satisfying equation 2.7 is said to be unitary. The modal matrix W is
unitary and is said to diagonalize the matrix A.
2.2 An Example
The properties of PCA will later be discussed more rigorously, but a quick
example should provide an intuitive grasp of some of the properties of PCA.
Consider L = 20 vectors of dimension N = 3 arranged into the data matrix X,
X = [Xi X2 ... X20o
The autocorrelation of the data is estimated by the scatter matrix,
0.1296 0.1372 0.1296
S, = E{xx'} XX' = 0.1372 0.1613 0.1372
L
0.1296 0.1372 0.1296
Eigenvalue decomposition yields
0.5578 0.4346 0.7071 0.4104 0 0
W 0.6146 0.7888 0.0000 A 0 0.0102 0
0.5578 0.4346 0.7071 0 0 0
Each input vector can be interpreted as a set of coordinates. The standard
basis functions for the input space are normalized vectors in each of the input
coordinates,
1 0 0
e= 0 2 1 63 0
0 0 1
Figure 2.1 (left) plots the data in the input space. The input was constructed to
lie near the diagonal of the input space. The first element of each input vector
was randomly selected in the interval (1, +1). The second element was the
first element plus Gaussian noise. The third element was set equal to the first
component. By construction, all three elements are equal except for the additive
Gaussian noise in the second element. The dimension of the signal (part of x
excluding the noise) is one. The noise adds a second dimension. Although x is
nominally threedimensional, the data set can be embedded in a twodimensional
space.
INPUT SPACE EIGENVECTORS (wl,w2) PLANE
... ........ ..... MA ....
Figure 2.1: Sample Data
Figure 2.1 (middle) shows the eigenvectors in the input space. Note that the
first eigenvector is the line that best fits the data in a mean square error sense.
Figure 2.1 (right) shows that the eigenvectors can be used as basis functions for
the data. The input coordinates x are rotated to the eigenspace coordinates y by
multiplication with the modal matrix W,
y = WTx or Y = WTX.
The input vectors were drawn from a zero mean distribution, but the sample mean
was
x E{x}= [0.1476, 0.1649, 0.1476]T
If the data is zeromean, then the scatter matrix is also an estimate of the auto
covariance. The shift, z = x x would produce data with a (sample) mean
of zero. It is obvious in this example that the sample mean is a poor estimate
of the true mean. The true mean of the output is zero, but the sample mean is
y [0.2660, 0.0018, 0.0000],
y = E{y} WTx.
It is perhaps too obvious to mention that small sample sizes lead to poor char
acterization (e.g., statistical parameters) of a distribution. However, many real
applications have a limited amount of data available. Insufficient data will degrade
the performance of any algorithm. The scatter matrix of the rotated vectors is
Sy = A. Since A is diagonal, the components of y are uncorrelated.
11
Sy = LYYT WTXXTW = WTSW = A.
L L
The trace of the scatter matrices is invariant under rotation,
S1 = Syy = 0.4206
The trace is a measure of the total variation of the data. A linear transformation
does not affect total autocorrelation. However, a linear transformation can change
the variance of individual components and the crosscorrelation between compo
nents. To see the contribution of each component of the input and output to the
total variance, divide both scatter matrices by the trace,
0.3082 0.3262 0.3082 0.9758 0 0
S' = 0.3262 0.3836 0.3262 0 0.0242 0
0.3082 0.3262 0.3082 0 0 0
The trace of each scatter matrix in equation 2.2 is one, and the elements along the
main diagonal can be interpreted as percentages of total variation. By construction
the variation in the input data is distributed almost equally among all three
components. The normalized output scatter matrix (equation 2.2) shows that the
first component captures 97.1.'. of the variation of the data. The zero eigenvalue
in the third column of A indicates that the underlying dimension of the data is two.
The input data can be reconstructed from the output data,
x =Wy or X = WY.
The input data can be perfectly reconstructed from y even if the third component
is discarded (3 :'. lossless compression). The input data can be reconstructed with
2.5' mean square error from just the first component of y (i7' compression).
The transform did not completely separate the data from the noise. However,
the input reconstructed from just the first output component has an enhanced
signaltonoise ratio (SNR).
The rotation from the input space X to the eigenspace Y is only one possible
rotation. Although it is more obvious to directly examine other rotations of the
3dimensional input space, it is simpler to examine rotations of the 2dimensional
eigenspace. Consider a set of coordinates z derived from rotating the (nonzero)
coordinates in eigenspace through an arbitrary angle a,
zi cos(a) sin(a) y]
Z 2 sin(a) cos(a)) Y
The variance of {zi} is
cOS2( Ca)ly +sin2(a)729.
X
x x X
... v ...
X
Figure 2.2: Original (left) and Scaled (right) Data
Figure 2.2 (left) shows the standard deviations of the two nonzero components
of the output. The ellipse in figure 2.2 (left) shows the standard deviations of the
data projected along arbitrary unit vectors. Among all possible sets of projections,
the variance of an individual component is maximized and minimized when the
input is projected against the two eigenvectors wi and w2, respectively. If
the second component is scaled so that the variances of the two components are
equal (Figure 2.2, right), it is not possible to change the component variances by
rotation.
2.3 Principal Component Analysis
The KarhunenLotve Transform (KLT) uses eigenvalue analysis to decompose
a continuous random process, x(t), instead of the random variable discussed in the
preceding sections of this chapter. The discrete equivalent developed by Hotelling
is called Principal component analysis (PCA), but is also often referred to as
KarhunenLoeve Transformation. A nice discussion is found in Jain (1989, pp.
163175).
Let x be a discrete zeromean, widesense stationary process. Let XN(n)
denote a block of length N,
XN(n) = [x(n), x(n 1), .. x(n N + 1)].
The (N x N) autocorrelation matrix Rxx is positivedefinite and Toeplitz (doubly
symmetric and constant along the diagonals) (Kailath, 1980). The eigenvalue
decomposition of Rxx is
Rxx = E{xN(n)xN(n)T} W A W1 W A WT
Not all matrices can be diagonalized, but symmetry is a sufficient condition. Since
Rxx is symmetric, it has N orthogonal eigenvectors even if the eigenvalues are
not distinct. PCA is an expansion of XN(n) using the eigenvectors of Rxx. Any
Nlength block of x(n) can be represented by
N N
XN(nf) Zx(n k+ 1t^, Z y, Yk(n)Wk.
k=1 k=1
The PCA expansion can be more compactly expressed by
XN() = WywN(), with inverse, yN(n) = WTxN(), (2.8)
where
YN(n) = [yi(n), y2 (n), y(n)l.
Equation 2.8 can also be interpreted as linear mappings from an Ndimensional
space spanned by the standard basis vectors, ek, to an Ndimensional space
spanned by the eigenvectors, wk. The autocorrelation matrix of yN(n) is
RyY = E{yN(n)yN(n)T} = E{WTXN(n)XN(n)W}
= WTE{XN(U)XN(n)}W
SWTRxxW = A
A represents the correlation matrix between the components yk. The components
are uncorrelated and the variance of each component is simply Ak. Interpreting
variance as signal energy, traceinvariance under similarity transformations equates
to conservation of energy. The original signal can be perfectly reconstructed by the
inverse transformation,
x(n) yi(n)
x(n 1) 2(n) (2.9)
x(n N + 1) YN(n)
Decorrelation is desirable for analysis since redundant information between
components is minimized. Reconstruction is often performed with only the first
M < N components for two main reasons,
1. Compression Using only the first M components achieves a M/N compres
sion ratio with minimum 12 reconstruction error.
2. Noise Reduction For signals with additive noise, A is interpreted as a signal
to noise ratio (SNR). Reconstruction of x(n) using the high SNR components
retains the signal components and excludes the noisy low energy components.
2.4 Deflation Techniques
All the eigenvalues of a matrix of rank N can be found analytically by solving
a polynomial of order N,
det(A IA) = 0.
If only the first few eigenvectors are of interest, then one can either use an analyti
cal approach such as Singular Value Decomposition (SVD) (Haykin, 1996), or find
an approximation by using a deflation technique. Since the one of the eigenvectors
maximizes variance, a vector is found by choosing an arbitrary vector and itera
tively modifying the vector to increase the variance. Once found, the component
corresponding to the eigenvector is removed and the input is said to be deflated.
The next eigenvector is found by repeating the process on the deflated data. From
the basic eigenvalue statement (equation 2.1),
Ak W Awk (2.10)
Consider an arbitrary vector v and an associated scalar K,
S= vTAv (2.11)
The eigenvector corresponding to the largest eigenvalue maximizes
A1 max(t) max {vTAv} (2.12)
Vv
Equation 2.11 associates a scalar with each of the vectors in the span of A. The
vector associated with the maximal scalar is an eigenvector of A. More specifically,
1. Set A1 = A.
2. Use a gradient based iteration (or power method) on w to maximize A.
3. Set wi = Wpt and A1 = A(wpt).
4. Remove the projection of wi from A1.
A2 = A1 wiAiwT
5. In the subspace spanned by A2 the optimal solution to equation 2.12 is
now {A2, w2}. Repeat procedure until the desired number of solutions
(eigenvectors) is obtained.
2.5 Generalized Hebbian Algorithm
Eigendecompositions can be analytically computed by many algorithms (Golub
and Loan, 1989). But here we seek samplebysample estimators of PCA conducive
to online implementation in personal computers. There is rich literature on linear
networks to evaluate PCA using gradient descent learning rules (Oja, 1982; Haykin,
1994). Being adaptive, the networks take time to converge and exhibit rattling;
that is, network values fluctuate around the "true" values. Hence, these networks
should not be taken as substitutes for the analytic methods when the goal is to
compute eigenvectors and eigenvalues. However, in signal processing applications
where we have to deal with nonstationary signals and we are interested in feature
vectors for realtime assessment, the "r i i PCA is very often adequate and saves
enormous computation. In fact, the algorithms about to be described are of O(N)
(size of the space), instead of 0(N2).
Haykin (1994, pp. 391) states that PCA neural network algorithms can be
grouped into two classes:
1. reestimation algorithms only feedforward connections,
2. decorrelating algorithms both feedforward and feedback connections.
Reestimation algorithms use deflation. The generalized hebbian algorithm (GHA)
is a reestimation algorithm that uses a single (computational) lr linear network
to perform PCA on a process XN(n). A nice presentation of GHA can be found in
Haykin (1994, pp. 365394).
x(n) xl
Figure 2.3: GHA Linear Network
Let W denote the (M x N) matrix of network weights and let wk denote
a column of W. Figure 2.3 shows the network that extracts the first M < N
principal components of the random vector XN(n). The equations for figure 2.3 are
T
wl
T
w
yM (n) WTXN(n) 2 x(n), M < N.
T
M
To adapt the weights, GHA performs three operations,
1. adaptation of each column wi to maximize variance (energy),
2. adaptation between columns wi to remove the projection from previous
components,
3. selfnormalizing to keep weights at unit norm.
The equations for adapting the weights are
p1
yj(n) = tr,..()xi(n), calculate output (2.13)
i=0
tO
Au',,(n) = yj(n) Xidn) i, (n)yk(n) ,update weights (2.14)
k0
The update (equation 2.14) has two terms,
r(y j()xi(n) yj(n)r,, ,(n)yj(n)), maximize variance and normalize (2.15)
j1
lyj(n) "I (n)yk(n), remove projections (2.16)
k0
Equation 2.15 has two terms. The first term is the classic Hebbian formulation
and has been called the activity product rule (Haykin, 1994, p. 51). The problem
with the classic formulation is that the magnitude of the weights increases. Still,
the classic Hebbian algorithm is elegant in its simplicity and power. The second
term of equation 2.15 is a self correcting adaptation (Sanger, 1989). Equation 2.15
by itself reestimates and normalizes the weights for variance maximization. Equa
tion 2.16 subtracts the projections of previous (higher energy) components. This
term is introduced to keep the weights to different output nodes from converging
to the same eigenvector. Equation 2.16 also shows that convergence of a principal
component is dependent on the convergence of higher energy components. While
there is no inherent ordering to the eigenvectors, the implementation effectively
creates a sequence of dependencies in the convergence of eigenvectors.
The procedure is adaptive and thus suitable for locally stationary data. Even
if the process is not stationary, the mapping WN will give perfect reconstruction
(WNXN is invertible). PCA is optimal for minimum 12 reconstruction using fixed
length filters on a stationary signal. For optimal 12 compression and reconstruction,
the first M (out of N) components of yN(n) are used.
Using M < N components yields a compression of (N M)/N. Further com
pression is usually obtained by using fewer bits to encode lower energy components.
Y (k) [yo(k), y(k), yMi(k)] = WNxMXN(n). (2.17)
and the reconstruction (denoted by subscript c)of the original signal is
,(k) = [(WNxMWNxM) 1WNxM] y(k). (2.18)
2.6 Eigenfilters
The direct interpretation of equation 2.8 is that each of the yk(n) is a projec
tion of XN(n) on an eigenvectors ii, Each projection is found by taking the inner
product,
yk(n) = ("; xN(n)) = W XN(n) (2.19)
The underlying time structure of XN(n) allows a filter interpretation of PCA,
XN() = [x(n),x(n 1),. ,x(n N + 1)T.
Rewriting equation 2.19,
N1
yk(n) =
a=0
Equation 2.20 is a convolution sum of x with a filter impulse response i,', (n). FIR
filters whose coefficients (impulse response) are derived from eigenvalue analysis are
called eigenfilters (Vetterli and Kovacevic, 1995). The collection of filters, {"' }i can
be interpreted as an analysis bank. If the principal components,
y(n) [yi(n), y2(n), YN(n)T.
are then processed to reconstruct the original input XN(n), the reconstruction
filters form the synthesis bank. Eigenfilters have several key properties:
1. Both the analysis and synthesis banks of a linear network use finite impulse
response (FIR) filters,
2. Since the autocorrelation filter is Toeplitz, the eigenvectors are all either
symmetric or antisymmetric,
3. Since the modal matrix is unitary, the synthesis filters can be implemented
easily (transpose of the analysis bank),
4. Since the components are uncorrelated, the reconstruction from each compo
nent is independent of other components.
The remainder of this section illustrates the decomposition of simple test signals.
2.6.1 LowPass Test Signal
slhOrder M Filer OMIpu
0 0.5 1 1.5 2 25 3 35 4
Tine
03 0.6
0.2 0.4
0  0 
knpls Response Dekby
Figure 2.4: Low Pass Test Data
A test signal (Figure 2.4) was generated using a 5thorder moving average
(V A) filter driven by white Gaussian noise.
5 5
x (n) = 2 2u (n k) # X(z) (2z)k U (Z
k=0 k=0
The transfer function is
x(z)
U( (z)
1i K 2 (I 3 (4 2i1]
1 + (+ + + +
2 2z 2z 2z 2z2
The 5 zeros are evenly spaced,
1 e1k7\)
z exp j k [1,2,3,4,5]
23
The first six autocorrelation coefficients are
T
Rxx(k, 1) 0.3333 0.1665 0.0830 0.0410 0.0195 0.0078 1
Consider the autocorrelation matrix formed by the first six autocorrelation coeffi
cients. The eigenvalues are
Ak 0.7952 0.4811 0.2811 0.1859 0.1388 0.1174
The corresponding eigenvectors (columns) are
0.3034
0.4195
0.4816
0.4816
0.4195
0.3034
0.4940
0.4687
0.1905
0.1905
0.4687
0.4940
0.5350
0.1244
0.4453
0.4453
0.1244
0.5350
0.4721
0.3313
0.4091
0.4091
0.3313
0.4721
0.3489
0.5554
0.2640
0.2640
0.5554
0.3489
0.1819
0.4130
0.5444
0.5444
0.4130
0.1819
(2.21)
Figure 2.5 shows the eigenfilters generated from the eigenvectors shown in equa
tion 2.21. Notice the filter bank structure as we described above, that appears in a
selforganizing manner; that is, no one programmed the filters. It was the data and
the constraints placed on the topology and adaptation rule that led to a unique set
of filter weights.
The bandwidth of the filters is dictated by the size of the input delay line
(I/NT). This illustrates why PCA defaults to a Fourier transform when the
observation window size approaches infinity.
Tme
0
2 4 6 8
05e;~;~
FMaj
n 2 0_n.4
4L0, A
49 n
o 0.z 0.4
FPhe
m
02
n n4
 0 0.2  OA
Figure 2.5: Low Pass Data PCA
The 1/f energy distribution can be observed by normalizing the eigenvalues
by the trace. Thu sum of the diagonal elements is invariant and can be interpreted
as total energy. An eigenvalue divided by the trace can be interpreted as the
percentage of the total signal energy belonging to the outputs of the corresponding
eigenfilters. Figure 2.5 and equation 2.6.1 show that the eigenfilters are ordered by
passband center frequency and output energy.
Ak = 0.3977 0.2406 0.1406 0.0930 0.0694 0.0587
Equation 2.6.1 provides some upper bounds for compression. These eigenfilters are
expected to be optimal for any signal generated by the filter in equation 2.6.1.
2.6.2 High Pass Test Signal
The high pass signal was simply a low to high pass conversion of the low pass
signal. The zero at w = r was moved to w = 0. Figure 2.6 shows that PCAM
adapted to order the basis functions by energy. The timefrequency resolution
tradeoff can be observed by looking across a row and seeing that the shorter filter
results in a wider frequency passband.
Tme
0.5
0.5 2 4
0.5 
..r~IPII.
3 J. & Sq
FM a
4
2
2
0
01 A
0 0.2 0.4
FPIse
5
0
10
0 0.2 A
Figure 2.6: High Pass Data PCA
2.6.3 Mixed Mode Test Signal
The high pass signal and low pass signal were added together for the mixed
mode signal. Figure 2.7 shows that PCAM adapted to order the basis functions
by energy.
Tme FMoa FPhme
05 FP 5 1, nad
S210
S 4 0 0 2 0 .4 0 02 0a
Figure 2.7: Test High and Low Frequency Data
2.7 Key Properties of PCA
Eigendecomposition has a strong mathematical foundation and is a tool used
across several disciplines. Eigenvalue decomposition is an optimal representation in
o ,i:v l, Key properties of PCA include,
1. The elements of A (eigenvalues) are positive and real, and the elements of W
(eigenvectors) are real.
2. Aside from scaling and transposing columns, W is the unique matrix that
both decorrelates the xN(n) and maximizes variance for components,
3. Since W is unitary, W1 = WT and reconstruction is easy. The mapping is
norm preserving and reconstruction error is easily measured.
PCA has several criticisms:
1. The mapping is linear. The underlying structure for some applications
may be nonlinear. However, a nonlinear problem can be made into a linear
problem by projection to a higher dimension.
2. The mapping is global. Each output component is dependent on all the
input components. If important features are dependent on some subset of
28
input components, it would be desirable to have output components that are
localized to the appropriate input components.
3. PCA components resemble each other. Approaching the transform as an
eigenfilter bank provides some insight. FIR filters have large sidelobes.
Orthogonality is obtained by constructive and destructive combinations of
sidelobes. It seems typical that the low frequency component is so large that
the sidelobes do not provide sufficient attenuation.
CHAPTER 3
MULTIRESOLUTION
A discussion on multiresolution should start with time signals and the classic
timefrequency resolution tradeoff. Assume that a recording session produces
some (real) analog signal x(t). The analog signal x(t) is sampled at some uniform
interval Ts to produce a discrete time signal x(nTs). For convenience, normalize
Ts to unity so that x(n) x(nTs). The session x(n) is usually divided into
smaller observation windows of duration N (= NTs). The choice for N fixes the
resolution of the analysis. Denote a block of data of length N by xN(n). Consider
the Discrete Fourier Transform (DFT) of xN(n),
XNf(n) ) XN(k)
The DFT transforms a vector xN(n) with N real components to a vector XN(k)
with N complex components. XN(n) and XN(k) are the timedomain repre
sentation and frequencydomain representation, respectively, of the signal. Each
component of XN(k) is a linear combination of the elements of XN(n). That is,
each component of XN(k) is feature of the entire input xN(n) and can be lo
calized in time to NTs. The frequency resolution of the output is 1/NTs. The
input XN(n) has highresolution in time (Ts), but no resolution in frequency. As
N increases, the output XN(k) loses resolution in time and gains resolution in
frequency. The time and frequency resolution of the output are fixed by the single
parameter N. Ideally, there might be an optimal choice for N, the observation
window length. For example, N would be matched to the duration of key features
in the signal. Sometimes, however, it can be difficult to make a judicious choice for
N if,
1. key features are not known,
2. the optimal length is different among the key features.
Under such situations, multiresolution is an alternative to fixed resolution repre
sentations. The DFT is a fixed resolution representation since each component
of XN(k) has the same resolution. In the context of the above discussion, a mul
tiresolution representation would be a representation XN(k) whose elements have
varying resolution. More generally, multiresolution is the representation of a signal
across several resolutions.
3.1 Two Notes Example
The two notes example is now found in many standard texts on timefrequency
techniques; this section is an abbreviated version of Kaiser (1994). Consider a
signal composed of "notes" of single frequency, and the problem of detecting the
number of notes that occur in a time interval. Figure 3.1 Kaiser (1994) shows a
signal consisting of two single frequencies that occur at different times.
Figure 3.1: Two Notes
Theoretically, the two notes can be separated by using either frequency or
time information. However, a frequency representation has no time resolution and
limited (by the observation window length) frequency resolution. Unless the notes
are sufficiently separated in frequency, a standard Fourier transform of the signal
will not resolve the two notes. Certainly, in the extreme case where the two notes
are at the same frequency, time domain information is necessary to isolate the
notes. A Fourier transform cannot take advantage of the time information to help
resolve the individual notes.
Similarly, the time representation has no frequency resolution and limited (by
the sampling interval) time resolution. If the two notes are not well separated in
time, but well separated in frequency, timedomain analysis cannot separate the
notes. The corresponding extreme case is that if the two notes overlap in time,
frequency domain analysis is necessary to resolve the two notes. Clearly, it is
desirable to use both time and frequency domain information.
One of the first (combination) timefrequency techniques is the Short Term
Fourier Transform (STFT) or windowed Fourier Transforms (Porat, 1994, 335
337). The signal x(n) is divided into subintervals of some fixed length. The
essential approach is that instead of using a single transform X(k) over the entire
time interval, a Fourier transform is taken over each subinterval. The results
are displ i' ,I in a waterfall plot with time and frequency forming two axes and
the magnitude of the frequency on the third axis. The waterfall plot provides
information on how the frequency content of a signal changes over time. Discussion
of implementation details, such as window functions and overlapping windows,
can be found in Strang and Nguyen (1996); Vetterli and Kovacevic (1995). The
relevance to this work is that a signal can be represented using both time and
frequency using a fixedresolution (constant block length N) technique such as the
STFT.
Again, an important consideration for a fixedresolution analysis is choosing
a ;ood" window length. If the window length is too long, time resolution is lost.
If the window length is too short, frequency resolution is sacrificed. Figure 3.1
shows that either time or frequency resolution (or both) may be critical for a
given application. The transition from fixed resolution to multiresolution can be
performed with iterated filter banks that will be further discussed in chapter 4.
3.2 Quadrature Filter and Iterated Filter Bank
An iterated filter bank uses variable length windows to provide high frequency
resolution at low frequencies and high time resolution at high frequencies. A dyadic
filter bank uses a pair of filters to divide a signal into two components. The two
filters are designated Ho(z) and Hi(z) (Figure 3.2).
SYNTHESIS
Figure 3.2: Quadrature Filter
The filters must be chosen to divide a signal into orthogonal components that
can later be used to perfectly reconstruct the original signal. Familiar choices for
dyadic filters include,
1. simple oddeven decomposition,
2. quadrature modulation filters in communications (sin and cos components),
3. quadrature mirror filter (Hi(z) = Ho(z)) (Strang and Nguyen, 1996, 109).
The quadrature mirror filters Ho(z) and Hi(z) are constructed as low
pass and highpass filters, respectively. A dyadic iterated filter banks is formed
by passing the output of Hi(z) or Ho(z) into another identical filter bank
(Figure 3.4). A series of cascaded loworder filters is equivalent to a single high
order filter. Time resolution decreases and frequency resolution increases with the
number of loworder filters in the cascade. An intuitively appealing approach is to
iterate the low frequency component; this approach will be discussed in more detail
in the next section. The rationale is that low frequency components do not require
high time resolution since low frequency implies slow changes. The quadrature
mirror filter was an early implementation; a more recent approach is the use of
wavelets (Strang and Nguyen, 1996).
ANALYSIS
3.3 Wavelets in Discrete Time
Mathematically, passing a signal through a filter and downsampling can be
presented as projection against basis functions. The design of filters is equivalent
to finding appropriate basis functions. For wavelet analysis, a function is chosen
as the mother wavelet. The basis functions at each level correspond to some
dilation (scaling) of the mother wavelet. Within a level, all the basis functions
are nonoverlapping timeshifted versions of the same function. The scaling and
shifting allow time resolution over intervals less than NTs. It is desirable that basis
functions from different levels are orthogonal, but linear independence is sufficient.
A standard approach to multiresolution is to use a cascade of 2bank (high
pass H1 and lowpass Ho) filters (Figure 3.3). The outputs of the analysis filters
are downsampled by a factor of 2, then the lowpass output is cascaded into
another analysis bank of 2 filters. The process is repeated for the desired number of
levels. The reverse operation takes place at the synthesis bank {Gi}.
r( H,HIGH ^ 2J 21 G, i
LJ LOW (s) ""vD,, i a21)2 G,
ANALYSIS SYNTHESIS
Figure 3.3: Discrete Wavelet Transform with 2 Levels
Again, the rationale for choosing this sequence of operations is that high
frequency components can change quickly, hence the highest frequency component
should be sampled most often. The lowest frequency component changes the least
frequently and can be downsampled several times. The iterated tree structure can
be implemented as a parallel structure (Figure 3.4).
ANALYSIS SYNTHESIS
Figure 3.4: Equivalent 2" Filter Bank DWT Implementation
3.4 Haar Wavelet
The Haar wavelet uses the simplest set of basis functions. The low pass filter is
ho (k) = [1, 1] and the high pass filter is hi (k) [1, 1]. The matrix W for
a two level decomposition is shown in equation 3.1. For the 2level Haar example,
the input is divide into segments of length N = 4 and,
0 0 1
1 1 1( ( 1 (
V4(k) WN. N) ()3t
S1 11 1
The matrix W is invertible so perfect reconstruction is possible. Since W is or
thonormal; that is, W1 = WT, no further calculations are needed for constructing
the synthesis filters. The Haar wavelet has the worst frequency resolution; other
basis functions since Morlet) may be more appropriate depending on the desired
tradeoff between resolution in time and frequency.
3.5 A Multiresolution Application: Compression
A standard multiresolution application is representing a signal (image) at
different scales. Figure 3.5 shows an example of a Haar decomposition (not the
Haar transform) that is dyadic along each dimension.
Several Iterations of Haar Decomposition
Original Iteration 1 Iteration 2 Iteration 3 Pyramid
Figure 3.5: Three Levels of Decomposition on the Approximation
The Haar basis vectors are el = [11] and 62 = 21 1]. For 2dimensions,
the basis vectors are 1el1, 6162, 6261, 6262 (the 2D bases are separable and identi
cal). An image is partitioned into nonoverlapping (2 x 2) blocks and each block
is projected against the basis vectors. Using nonoverlapping blocks is similar to
a polyphase filter and is more computationally efficient than downsampling the
projections. The first projection is simply an average of each (2 x 2) block and
gives a good compressed approximation of the original image. The other three
detail images have the information needed (in addition to the approximation)
to perfectly reconstruct the original image. That is the detail signals have the
information needed to correct the reconstruction from the approximation. This is
slightly different than the pyramidal approach that provides a single correction at
the lower scale. The procedure can be repeated on the approximation to provide
an approximation at the next level of compression. All the information for creating
the first (level 1) approximation is contained in the original (level 0) image. Simi
larly, an approximation at any level only uses information from the approximation
at the previous level. Lower level approximations have more information (spatial
resolution) and less compression than highlevel approximations. Clearly, there
is less data to process if classifications can be performed with compressed images
(smaller matrices).
Our main interest in multiresolution is in deriving multiscale localized features
for classification. Inputs presented at different scales lead to extraction of features
37
at different scales. The next chapter continues the discussion of multiresolution
with more focus on deriving and using multiresolution features in PCAM.
CHAPTER 4
PRINCIPAL COMPONENT ANALYSIS WITH MULTIRESOLUTION
4.1 Definition of PCAM
In this section, we formally treat localization with PCA. PCA is briefly dis
cussed in a context of representation and feature extraction. PCAM is presented
as PCA with localized outputs. The localized outputs of PCAM are structured to
provide a multiscale representation. The section ends with a formal definition of
principal component analysis with multiresolution.
4.1.1 Localization of PCA
Consider a set of K training images,
STRAIN = {01(n),  ", (n>),* OK(n)}.
The pixels of each image ', (n), are indexed by n E S. Define,
K
Xk (n) i. (n) Y (n).
k=1
If the training images already have zero mean, then xk(n) = /, (n). Principal
component analysis has two stages, training and testing (verification). The first
stage derives a set of eigenvectors and eigenvalues, (,'. (n), Am), for the set of
training images. Denote the set of eigenvectors by ',
'I = {Qd(n),...,,, (n),... ,,' ,(n)}.
As discussed earlier, the number of nonzero eigenvectors ,M, is the minimum of
the number of exemplars, K, or the number of components of each exemplar, N.
The training stage of PCA finds a mapping from a set of training images to a set of
eigenvectors,
@TRAIN (4.1)
Equation 4.1 emphasizes that the eigenvectors and eigenvalues are characteristics of
a set of input images. Since the input images and the eigenvectors have the same
spatial index, the eigenvectors are also called eigenimages. Once trained, PCA
uses the eigenimages to decompose each new input onto a set of components. The
second stage of PCA is a mapping from an input image to a set of M output scalar
components,
Xk(( ) > { 1, ... Y ... M}k. (4.2)
Equation 4.2 shows that each input has a unique set of outputs (components).
Since the association from input to output is usually implicit, notation can be
simplified. Rewrite equation 4.2,
x(n) {(i,...,y m,... ,M} (4.3)
Each component is global since its value is calculated using all the pixels of the
input image,
rn < x(n),, (n) > x(n)', )n). (4.4)
nES
The dependency of a global output on a specific input pixel is seen by differentiat
ing equation 4.4,
V (n S): (y) = (n). (4.5)
a x(n)
Output localization implies that an output is dependent only on a local set of
pixels. Consider a subregion of the pixels, A C S. A local output could be specified
by
a (YLOCAL) = (n), (4.6)
Sx ((4.)
SX( 0, otherwise.
Localization could arise naturally if some of the eigenvector components were zero,
((n) = 0. PCAM forces localization by explicitly manipulating equation 4.6.
Definition 4.1.1 Consider a set of Ndimensional inputs, x(n). The components
of each input are indexed by n E S, where S = [1,..., N]. Let A be a subset of
S such that A corresponds to a localized time interval ifx(n) is a time .:,il or
to a local region ifx(n) is an image. Denote the subregion of an input by xA(n).
Let wA(n) be the corresponding eigenvector (eigenimage). A localized PCAM
output is
YA =< xA(n),wA((n) > xA(n)wA(n). (4.7)
nEA
Q (x) O( 6 (x))
PCAM FEATURE CLASSIFIER k
DATA EXTRACTOR ASS
Figure 4.1: PCAM for Classification
Before defining a localized eigenvector, we want to restate our goals for PCAM so
that the design choices will be understood. First, both PCA and PCAM provide
representations, but this does not mean that '1! i, can be automatically used
for feature extraction. In fact, PCAM components are not directly constructed
for optimal discrimination. So PCAM should be understood as a preprocessor
for classification that constrains the scale and locality of subsequent features
(Figure 4.1).
Second, given that PCAM is not the ideal feature extractor, care should be
taken that no information is lost. Since PCAM cannot identify whether some
information is needed for discrimination, all information should be propagated to
the classifier. PCAM should (and can) provide at least a complete representation
of the input in the space of the training set. That is, some information from
nontraining exemplars is ahb,v lost since the eigenspace is a subspace of all
possible inputs. If the application is classification, an overcomplete representation
(redundancy) is not only acceptable, but may be essential. Nonetheless, we design
PCAM with a minimum amount of redundancy. It is usually much easier to add
redundancy than reduce redundancy.
Finally, in allowing each output to have inputs of varying geometry (scale and
shape), localized eigenvectors are not guaranteed to be orthogonal. Since I'. :, are
not orthogonal, it is an abuse of nomenclature to continue to refer to the localized
eigenvectors as i;, i ,. I tors". Since the PCAM network weights converge to the
localized eigenvectors, we will henceforth call them PCAM weights. Orthogonality
has a direct impact on designing PCAM since many implementations of PCA
involve deflation in some form. While it is not ahl,v apparent from the network
architecture, there is an inherent sequencing of calculations. As the weights for
the first output are calculated, the weights and output are used to reconstruct an
estimated input. The input is deflated by the estimate, and the deflated inputs are
used as "effective inpl for calculating the weights of subsequent outputs.
4.1.2 A Structure of Localized Outputs
Definition 4.1.1 is not an implementable definition for a localized PCAM
output since the corresponding localized eigenvector wA(n) is not yet defined. If
all outputs are supported by the same region, the eigenvectors are found using
standard PCA with the localized input, xA(n), as the new input. If each output
is supported by a separate nonoverlapping region, the localized eigenvector is
found by treating each region as a separate standard PCA problems. When the
outputs are localized to overlapping regions of varying geometry (size and shape),
the meaning of orthogonality becomes unclear. That is, eigenvectors that derived
from the same subregion of the training images are orthogonal. Also, eigenvectors
that are each derived from nonoverlapping subregions are orthogonal. However,
eigenvectors that are each derived from partially overlapping subregions are not
generally orthogonal
Definition 4.1.2 Consider a set of Ndimensional inputs, x(n), with an associated
set of M eigenvectors, b(n). For both the inputs and eigenvectors, components are
indexed by n E S, where S = [1,..., N]. PCAM is an iterative procedure:
1. For the first eigenvector, partition S into R(1) subregions such that
U = s, (4.8)
rE[1.../R(1)]
2. treating each subregion as a separate .:,u ,;. .,l;., problem, calculate the first
eigenvector for each region,
3. 1. fl,/.': each region of the input, and use the 1. fIrl ,l input as the effective
input for subsequent calculation.
The geometry of the partitions can hI,,. for each iteration. The number of
iterations to span the input space will not exceed M.
A global scalar output ym is replaced by an array of R(m) localized outputs,
Ym = [Y1 ... Yr(m) ... YR,(m)], (4.9)
where
Yr(m) = Xk(n)wm(n), where, A = ,Sr(m). (4.10)
nEA
Each array of outputs, ym, is a compressed version of the input image. A fine
partitioning (R(m) large) corresponds to a fine resolution for ym. If the partitions
are identical for each array of outputs ym, the representation has fixed resolution.
Analogous to equation 4.2, the mapping for PCAM is
Xk(f) ) {yn(r(m))}k. (4.11)
The PCA network trains M sets of weights corresponding to full eigenimages. The
PCAM network has ( m1 R(m)) sets of weights corresponding to partitioned
eigenimages. The PCAM network replaces each scalar global output, ym, with an
array of localized outputs, ym. Constraints on the weights of the GHA network
allow control of the partitioning (number and composition) for each output.
Control of the structure of the partitions sets the localization and scale for the
PCAM network.
4.2 The Classification Problem
The applications presented in the next chapter use PCAM for classification.
This section restates the basic classification problem so that PCAM can be
discussed in the context of feature extraction.
Given a choice of several classes Ck and some data x,, a basic classifier assigns
the input to one of the classes.
X, Ck.
Equivalently, the class index k is a function of the input x,
k =g(xn). (4.12)
For example, a classifier could identify that photograph x, belongs to person k.
Mathematically, designing a good classifier is finding a good mapping function
g. Each class of data contains features; that is, characteristics that are useful
for classification. Ideally those features, 0(x), could be separated from the I .
less" characteristics in the raw data. Presented with only the pertinent data for
classification, the task of the classifier becomes easier.
k f(O(x)) g(x). (4.13)
Worsening the resolution of the inputs is intended to remove details that are not
needed for classification, while retaining coarse features that are needed. In general,
too fine a resolution retains unneeded details, while too coarse a resolution discards
critical information. A multiresolution approach provides a structure for control
of the detailed information. PCAM can provide multiscale representations of an
exemplar to allow extraction of features at different scales (section 4.3.3). PCAM
can also be used to directly localize a global eigenimage (section 4.3.4). A search
for features in the universal space may not be feasible, but the eigenspace may
be too restrictive for feature extraction. PCAM provides a space richer than
eigenspace (figure 4.2), but still keeps the dimensionality under control.
PCAM
Features
PCA
Features
All Features
Figure 4.2: PCA and PCAM in Feature Space
4.3 Complete Representations
The section on eigenfaces is a classical application of PCA to images. PCA can
be conceptualized as PCAM with the coarsest partitioning (equation 4.8),
Vm : R(m) =1.
The identity map is presented for contrast as PCAM with the finest partitioning,
Vm : R(m)= N.
The iterated filter bank and dual decomposition have milder restrictions on the
partition sizes but implement constraints based on stationarity. The identity map
can be considered as PCA with fully localized outputs. The next section is an
example of PCA with global outputs. The subsequent section on iterated filter
banks describe a structure with outputs of varying localization.
4.3.1 Eigenfaces
The theoretical side of standard PCA has been discussed earlier. The eigen
faces section presents an implementation of standard PCA using the generalized
hebbian algorithm (GHA) presented in section 2.5. Since the theory and struc
ture has been discussed earlier, this section presents experimental results. The
PCA decompositions presented in this section are used for comparison to PCAM
decompositions (in the next section) of the same set of data.
The GHA network is a simple, flexible and efficient way to implement PCA.
Minor structural modifications lead to multiresolution. This section presents the
GHA Network used for standard PCA. Subsequent sections then step through
several modifications. With each modification, we present:
1. the network structure,
2. the changes in the representation that arise from the modifications,
3. an example of the representation using one of the faces drawn from the ORL
database.
Figure 4.3 shows K = 10 pictures from the ORL database. These ten pictures
{ '}(k 1...10) are all of the same person and cropped to R = 112 rows and C = 92
columns.
9v~"S: Le
Figure 4.3: Raw Images from ORL Database
Each input ', = (n) is described by two indices. The index k identifies the
specific exemplar, and the index n specified the specific component of , For time
signals, n is an onedimensional index and the components are time samples. For
images, n specifies the row and column of the image's pixels; n can be either a
twodimensional vector index or an onedimensional index to a rasterized version of
the image. The class average o4 is formed by averaging the ten faces.
1 K
k1
After subtracting the average from each face, the residual images {xk}(k= 1..10) are
shown in figure 4.4.
Figure 4.4: Residual Images for GHA Input
The residual images are each presented at the input 1. r of a network similar to
figure 4.5(left).
r:
IN OUT O IN OUT
0 O *0
X is a coln of
IN OUT IN OUT
Figure 4.5: Two single la r networks with: each output driven by all inputs (left),
and each output driven by single input(right)
The number of input nodes for the GHA network is determined by the dimensions
of the exemplars, 10304 (112 x 92). The number of nonzero output nodes
can be no more than the number of linearly independent inputs; for this example,
there are ten output nodes. The network has a single computational 4.16.r with
every input connect to each output; each output has 10304 associated weights. For
convenience, construct the input matrix X X(k, n) such that each input image
Xk is a column of X,
As each exemplar is presented at the input Iu, r, the output is calculated and the
weights are updated, using equations 4.15 and 4.16.
p1
yj(n) = t,.. ()xi(n), calculate output (4.15)
i=0
iO
Au',.(n) = ryj(n) Xi(n) i (n)yk(n) update weights (4.16)
Equation 4.16 shows that each output is affected by prior outputs. This depen
dency is shown in figure 4.5 (left) by dashed lateral connections between output
nodes. Since the training set had ten linearly independent images, the network
converges to ten sets of weights, {wk}k= ..10. The (10304 x 10) transformation
matrix WA is constructed by setting each wk as a column of WA,
WA [w w21 ... wiol. (4.17)
The weights are eigenvectors that will be described by w = wk W k(n) depending
on the context. The index k identifies the corresponding output nodes of the GHA
network. The ordering for GHA output nodes orders the eigenvectors such that the
corresponding eigenvalues are in decreasing order. Each wk has 10304 components
that can be arranged in an (112 x 92) array corresponding to the positions of
the associated input components. When the eigenvectors are arranged as a two
dimensional array, the eigenvectors are also called eigenimages. The eigenimages
resemble the input faces. Because of this resemblance, the eigenvectors are also
called eigenfaces.
Figure 4.6: Eigenfaces from GHA Weights
Possibly the widest application for PCA is in signal representation and reconstruc
tion. The training inputs can be perfectly reconstructed as a linear combination of
eigenfaces. The quality for reconstruction of other inputs depends on the degree
that the exemplars are representative of other inputs. The eigenvalues are an
indication of the reduction in reconstruction MSE. Table 4.1 shows that the inputs
can be reconstructed with about 10' reconstruction error using just the first four
eigenfaces and that the contribution from the last eigenface is negligible.
Table 4.1: Normalized
The eigenface expansion provides reconstruction for the network inputs. The inputs
are the residual images (exemplars minus the average image). The reconstruction
of the original exemplars requires adding back the class average. Because the PCA
analysis characterizes residual images, it is expected that the average image is
poorly reconstructed by eigenvalue expansion. This is especially interesting since
most of the energy is in the average image (table 4.2). In a classification problem,
there is an average image for each class as well as a single average for all images
across all classes. The high energy in the individual class averages ,I.: 1 that
each class is characterized by the variation of its class average from the overall
average, not by the variations in the individual images.
Table 4.2: Energy Distribution of Exemplars
Energy in Energy in Percent in
Exemplar Average Component Component
X1 0.2970 0.0150 4.81
X2 0.3614 0.0212 6.,',
X3 0.3130 0.0166 5.32
X4 0.3454 0.0154 4.95',
x5 0.3492 0.0114 3.' ,.,'
X6 0.3419 0.0168 5..!,'
x7 0.3155 0.0099 3.19',
Xs 0.3034 0.0122 3.92'
X9 0.3243 0.0153 4.9 :'
xio 0.3126 0.0144 4.1 ,'.
Avg 0.3264 0.0148 4.7 .'.
AI 23.9i'. A6 5.8 :' ,
A2 21.7 !' A7 4.0"'
A3 19..' As 3.6 !' ,
A4 o10.,, A9 3.0 :' ,
A5 7.92' A10 0.011' ,
Eigenvalues
4.3.2 Identity Map
Some properties of the standard GHA network are more evident when con
trasted to another network. This section provides a subjective discussion of the
following modifications to the standard GHA network:
1. the number of output nodes,
2. the dependencies (lateral connections) between output nodes,
3. the number and selection of inputs to an output node.
Since the key modification of PCAM involves controlling the inputs scale to
an output node, it is instructive to examine the extreme case of a single output
for each input. This structure can be realized with a network that has all inputs
connected to each output, but with the constraint that only one weight is nonzero.
The network is shown in figure 4.5 (right) showing only the nonzero connections.
The inputs (figure 4.4) have not changed, so there are still 10304 input nodes
corresponding to the dimensions of the input exemplars. By design, there are also
10304 output nodes to provide an output for each input. Each output node has
the same spatial localization as the corresponding input node. This architecture is
actually 10304 independent GHA networks operating independently so the number
of outputs does not exceed the number of linearly independent exemplars.
Without loss of generality, stipulate that the final weights are normalized.
Construct the (10304 x 10304) transformation matrix Wi,
W1 = [W W2 ... 10304] (4.18)
It should be evident that the transformation matrix W1 is a (10304 x 10304)
identity matrix.
The transformation matrices from equations 4.17 and 4.18 provide insight to
several key consequences of partial connections.
1. Span of the output space: Standard PCA has an output space that
is a small subset of the image space. The output space of the identity
transformation is the entire space of (112 x 92) images. A feature that is
desirable for classification might lie outside a restrictive face space.
2. Compression: An image Q described using standard PCA requires a system
that memorizes 11 images (the average image and the 10 eigenfaces), but
describes each (112 x 92) input with at most 10 coefficients. The il,. il il
transformation requires 10304 coefficients for each input image.
3. Resolution: The outputs of PCA are global. Each output is dependent on
all the input pixels. Each output of the identity transformation is dependent
of a single input pixel and thus has the same resolution as the input image.
Spatial resolution of an output can be controlled by simply limiting the
number of inputs.
4. Orthogonalization of Eigenvectors: The orthogonalization of standard
GHA arises by deflation (virtual lateral connections). The orthogonalization
of the id.. il il v transformation arises from nonoverlapping inputs.
5. Decorrelation of Outputs: The outputs of standard GHA are decorrelated
and a repeated application of PCA decomposition changes nothing. The
outputs of the identity transformation are correlated. These outputs can be
decorrelated by adding another 1,.r (a GHA 1~,Vr) to the network.
6. Class Features: For the GHA network, class information is in the weights,
and the individual exemplar information is in the outputs. For the identity
transformation, there is no information in the weights. Class information
must be extracted from the outputs.
In a network structured between the two extremes of standard GHA and an
identity transformation, several tradeoffs can be considered. We feel that PCAM
enhances control of the span of the output space and of localization.
4.3.3 Iterated Filter Banks
The eigenface decomposition and the identity map may be considered as two
extreme cases of fixed resolution PCA. The iterated filter bank structure is the
first multiresolution network presented in this chapter. The iterated filter bank is
also of interest since it is a way of implementing PCAM for a complete represen
tation. The concepts follow from prior sections and the focus is in describing the
architecture and constraints.
Allowing partial connections makes it possible to arbitrarily assign inputs
to outputs and control orthogonality between outputs. The number of possible
networks increases dramatically. For example, if each output is connected to two
inputs there are C(n, r) = C(10304, 2) = 53081056 unique combinations of
inputs. For each pair of inputs there are two orthogonal outputs, so it is possible to
construct a network with over 108 orthogonal outputs.
0 OmN1(2) HH1 2 t 1 N4
O O 4 2) 2N2(2) 1 N5 6
 O A'3(2) H 1 r 3 (2 1 7
~ O Q~N ^^ 04(2) H2 2 NN
 o T;5(4) 3 2 1
0 O N6(4)
 0 0 7(8) rHH 2 Nl4
H*0l NO8(8) H(2 HH2) 4 5 6
[HL(2 T (HH2(2 T HL3))) K AN7
IN OUT N8
IN OUT HL1(2T (HH2(2T HH3))) N
Figure 4.7: Three Level Dyadic Banks
One possible structure mimics the structure of a dyadic filter bank. The
structure of a three level dyadic bank is shown on the top left of figure 4.7 with
the equivalent polyphase construction on the bottom right. The filters used in
this example are constrained to be two tap FIR filters. The filters are derived
from the eigenvectors of the (2 x 2) scatter matrices of the data at each stage.
Four sequential outputs of the first filter correspond to the outputs at the first
four nodes (N1 4) of the network. Two sequential outputs of the second filter
correspond to the outputs at the fifth (N5) and sixth (6) nodes of the network.
The GHA network iterates on the lowest energy (variance) component. For inputs
that have 1/f energy distributions, the lowpass or highpass components can be
selected by using Hebbian or antiHebbian learning for the output node.
The network (figure 4.7 left) has four nodes (N1 4) that are each connected
to two nonoverlapping contiguous inputs. The numbers of connected inputs are
in parenthesis after the node labels. Output nodes N5 6 are each connected to
four contiguous nonoverlapping inputs, and the last two nodes N7 8 are fully
connected. The weights of N1 4 are constrained to be equal since i!, i correspond
to a single filter in the filter bank. For the same reason, the weights for outputs
N5 6 are constrained to be equal.
GHA orthogonalizes weights by deflating the inputs to subsequent output
nodes. Deflation is ineffective for nonoverlapping inputs. The first four nodes have
no orthogonalization constraints from other nodes. Node N5 is directly affected
only by nodes N1 2 since the inputs of those two nodes are partitions of the input
to N5. Node N6 is directly affected only by nodes N3 4 (indirectly influenced by
N1 2 because of the equality constraint between N5 and N6). The weights of the
fully connected outputs N7 8 are constrained by all earlier nodes. The threestage
dyadic (twofold) bank produces eight outputs for each set of eight inputs.
For the ORL images, a quadratic (fourfold) filter was used. At each stage the
inputs were partitioned into nonoverlapping (2 x 2) blocks. To parallel the dyadic
filter bank, all regions are constrained to have the same weights. The scatter
matrix of the first stage inputs is
0.99 0.95 0.91 0.90
0.95 1.00 0.89 0.93
S= (4.19)
0.91 0.89 1.00 0.95
0.90 0.93 0.95 1.01
Due to the local nonstationarity of images, the autocorrelations differ by mode
(horizontal, vertical, diagonal) as well as by lag. The scatter matrix is doubly
symmetric but not (in general) Toeplitz. The first four eigenvectors are shown in
figure 4.8 and show that the filter is essentially separable along the horizontal and
vertical modes.
U,,,
Figure 4.8: The first four (2 x 2) eigenimages are separable oddeven decomposi
tions. From left to right: even horizontal and vertical (A1 = 0.9415), odd horizontal
and even vertical (A1 = 0.0346), even horizontal and odd vertical (A1 = 0.0183),
odd horizontal and vertical (A1 = 0.0056)
The weights (features) are separable even (lowpass) and odd (highpass) decom
positions. Figure 4.7( bottomright) shows that the iterated filter bank develops
longer filters by cascading shorter filters. Short eigenfilter coefficients are driven by
PCA symmetry constraints and cannot adapt to data statistics.
Figure 4.9 (top) shows the outputs of three stages. For display purposes, each
image was normalized so that pixel intensities lie in the range (0, 1). The first
stage of the filter bank produces four outputs shown in the four top left panels of
figure 4.9. The four images are downsampled and arranged as a (2 x 2) array of
compressed images as shown in the top right panel. There is no implied ordering
or spatial relationship in the arrangement of the compressed images. We place the
compressed image to be iterated in the topleft of the (2 x 2) array. The lowpass
component (topleft) is passed to another iteration. The downsampled outputs of
the second stage are shown in the bottom, far left panel. The panel is dip'l 't, ,I at
double scale. The third stage outputs are shown in the bottom, middle left panel
the figure 4.9. The (2 x 2) array of outputs from the third stage output is displ ,i' .
at four times the actual scale. The outputs of all three stages are combined in the
bottom middle image. For comparison, the residual (original minus average) and
the original image are shown on the bottom right.
Figure 4.9: Three Level Decomposition of an Exemplar Face
The first stage outputs of all ten inputs are shown in figure 4.10.
Figure 4.10: Output of the First Stage of the Quadratic Filter Bank for the Ten
Training Exemplars
A large disadvantage of the iterated filter bank is limited adaptability:
1. Short Filter Length Constraint The weights cannot adapt to class
statistics (no global feature extraction).
2. Equal Weight Constraint The weights are constrained to be the same for
all subimages (no localization of features).
The chief advantage of the iterated filter bank is that an orthogonal basis is used
at each stage. The overall linear transformation is orthogonal and guarantees
perfect reconstruction of the inputs. The iterated filter bank structure produces a
multiresolution representation at the output. If the compressed representations are
difficult to classify, then feature extraction must be implemented by another stage.
4.3.4 Dual Implementation of PCA
The filter bank structure first extracts small highly localized features, then
builds up to global features. Another alternative is to find global features before
local features. In general, it cannot be guaranteed that the resulting localized
eigenvectors will form a minimal spanning set. That is, while we can still guarantee
a set of vectors to span the space of training exemplars, PCAM might not produce
a minimal spanning set. As previously mentioned, a minimal spanning set is not
required and perhaps not desired for classification. If the process is widesense
stationary (WSS), then the network weights form a pairwise linearly independent
set of vectors. The discussion of orthogonal bases will be deferred to the next
section.
The behavior of PCAM in going from long filters to short filters is clearer
when discussed in the context of the dual PCA decomposition. PCA can be done
using the transpose of the data matrix X ( 4.14). The main consideration is that
the number of exemplars is usually much smaller than the dimension (number of
components) of an input. The dual scatter matrix is
SD X= X'
(4.20)
Continuing with the ORL example, the original scatter matrix is a doubly sym
metric (10304 x 10304) matrix. The dual scatter matrix is a (10 x 10) matrix.
An analytic solution to PCA has operations in the order of 0(N2), so there is a
significant computational advantage to using the dual PCA.
Standard
S=XX'
XX' WAW'
XX'W WA(W'W)
XX'W =WA
X'XX'W=X'WA
X'X(X'W) (X'W)A
V X'W
Dual
SD =X'
X'X VAV'
X'XV VA(V'V)
X'XV=VA1
XX'XV=XVA
XX'(XV) (XV)A
W=xv
Equation 4.21 contrasts the standard decomposition S = WAW' to the dual
decomposition SD = VAV'. The dual eigenvalues are equal. The eigenvectors W
can be calculated from the dual eigenvector V,
W =XV.
(4.22)
The columns of the matrix W are orthogonal but not orthonormal. Normalizing W
results in W.
The dual formulation can significantly reduce computations when the number
of exemplars is smaller than the spatial dimension of the exemplars. The formu
lation also provides an alternative interpretation to PCAM. Consider a single
exemplar xi, and a single eigenface wl (equation 4.17). Partition each array,
Xl,(1,1) Xl,(1,2)
Xl,(2,1) Xl,(2,2)
(4.21)
Xl
(4.23)
Wl,(1,1) Wl,(1,2)
W1
Wl,(2,1) W1,(2,2)
The projection of x1 against w1 is a single global scalar,
i =< xi, Wi > < xl,(r,c), Wl,(r,c) >= 4.12. (4.24)
r C
Equation 4.3.4 shows that the global output yi can be considered as the sum of
localized terms. By partitioning xl and wl into smaller subarrays in a manner
similar to equation 4.23, the global output yi can be replaced by an array of
localized outputs.
yl, localized
2.77 4.21
14.17 8.61
The localization of output yi can be extended to full resolution. Figure 4.11 shows
continued yi localized into blocks of (8 x 8), (4 x 4),(2 x 2), and finally (1 x 1)
(full localization). The array of localized outputs can be considered a compressed
representation of the input.
8x8
4x4
2x2
1 1
Figure 4.11: Localization of a Global Output
It is interesting to note that as the eigenface w1 is partitioned, each segment
has the same dual eigenvectors. In principle this is similar to the iterated filter
bank in that the dual (rather than the primal) eigenvectors are preserved globally.
(4.25)
The meaning of the variations across exemplars was not explored since the vari
ations seem to arise from misalignment during data collection rather than from
an inherent feature of the class. Although multiresolution can be easily extended
to standard PCA by simply partitioning the standard eigenvectors, the repre
sentations that result are ahl,v overcomplete. An overcomplete representation
contains redundancy that is a disadvantage if the application is to compactly
transmit information. For classification, an overcomplete representation may be
advantageous.
4.4 Overcomplete Representations
The preceding section showed two vv that PCAM could be constrained
to produce complete or overcomplete representations. For the iterated bank, the
weights for each partition were constrained to be equal. In the dual approach, the
statistics over the entire image were assumed stationary. For classification, features
are important and overcomplete representations are satisfactory. In this section the
flexibility of the single ?1v. r GHA network is discussed.
The single computational 1?v, r GHA network outputs are determined by the
inputs. There are three main mechanisms for controlling the input,
1. control deflation from other outputs,
2. mask all inputs outside a selected region,
3. place explicit constraints in the training.
The original algorithm is to provide deflated inputs to successive outputs. For fixed
resolution PCA, the deflation is needed to prevent different outputs from having
weights converge to the same values. For multiresolution, the deflated inputs are
only required if the input nodes are identical, otherwise deflation is optional. By
selectively constraining weights to zero, an output can be localized to subregions of
arbitrary size and shape. The subregions need not be convex or connected (e.g., a
region for both eyes but excluding the nose). Each subregion of the image can be of
a different size and shape. Overlapping regions are allowed to reduce edge effects.
Shifted outputs can be introduced to facilitate shifts (translations) in the image.
Each partition is allowed to have different statistics and allowed to converge to a
local subeigenimage.
Relaxing all the constraints produce a richer set of characteristics. Relaxing
constraints also complicates the implementation. The specific choices for a PCAM
network are discussed in the experiments. The overall approach, however, was to
relax a single constraint at a time until the network's classification performance was
adequate.
4.5 Local Feature Analysis
Penev and Atick (1996) report great success in face classification using a
technique called Local Feature Analysis(LFA). The improvement in performance is
attributed to localized feature extraction. Atick has also implemented a commercial
automated face classification program, (Facelt, http://venezia. rockefeller.
edu/group/papers/full/AdvImaging/index.html) for workstations using LFA.
r
F
Ensemble Eigenvectors, Y(n,m) K(n,m)
of Inputs P /V Eigenvalues, A P(n,m)
PCA/SVD] e N
 lf
LFA
Single MAPPER Single
Input, (k(n) Output, Ok(n)
Figure 4.12: Local Feature Analysis
Figure 4.12 shows that LFA is based on PCA. The top cascade of operations
calculates class properties. The bottom row is the LFA mapper proper. Assume
a set of K inputs, ', (n), which are exemplars of a single class. Each input has a
spatial dimension n that can be rasterized such that 1 < n < N. For the ORL faces
N is equal to the number of pixels in each input, N (R x C) = 10304. PCA is the
eigenvalue decomposition of inputs' scatter matrix S(NxN). In general, eigenvalue
analysis yields a square modal matrix T(NxN)(n n2) and a single diagonal matrix
of eigenvalues A. The eigenvalue decomposition produces K eigenvectors, but only
the first M eigenvectors are retained (M < K < N). The truncated expansion
uses eigenvectors ,'. (n) with corresponding eigenvalues Am. The modal matrix
is not square since the number of linearly independent inputs K is less than the
dimension N. Since M < K < N, the modal matrix T(n1, n2) is (N x M). Using
the eigenvectors as a basis, each input can be reconstructed,
K
,(n) , ,', (n),
m=1
M
(n) A, (a).
m=l
LFA introduces some new quantities that will be discussed in more detail in
separate subsections,
O(n) A YrM A Output vectors,
K(ni,) A 3CM1 ''* (Inl) 'i (I2), LFA kernel,
P(ni, 2) A M= '. i. 2), Residual Correlation.
The LFA kernel K(na, n2) is a topographic (Penev and Atick, 1996, p. 5) analog
for the modal matrix. The residual correlation P(n n2) is comparable to the
matrix of eigenvalues A. The LFA output is similar to the reconstructed input in
PCA.
4.5.1 Output Vector
Using the LFA kernel, an output O(n) is computed for every input 0(n),
M
O(nl) K(ni, a2)(2) AA, ,'. (4.26)
The output O(n) is of the same dimension as the input 0(n). The LFA output is
the PCA reconstruction except that each eigenvector is normalized (scaled to unit
norm),
<, ',. >= Am(m,l) < 1%'' > J(mI) (4.27)
For convenience, the expressions for input and output are repeated here,
M
(A) ,Am (4.28)
m=l
M
0(n) M A (4.29)
The main difference (between a PCA reconstruction and an LFA output) seems
to be that normalizing the eigenvectors the scale factor deemphasizes terms
with large eigenvalues in equation 4.28. In PCA, these are the terms that are the
most important for reconstructions with minimum mean squared error (\!SlK;).
On the other hand, it has also been sr. i. 1 that eliminating the first principal
components (that are the low frequency components in iI Ii ,!" (1/f) images) can
compensate for differences in illumination level. The objection to discarding the
first few principal components is that essential discriminatory information may be
lost. LFA's approach of deemphasis rather than outright elimination of the first
eigenvectors may provide features that are robust with respect to illumination.
4.5.2 Residual Correlation
The residual correlation matrix definition can be rewritten in matrix form,
P(ni, n2) A T 1 '' ( (2
= E' l L. K 2 i'M+1 '(. /, (/2),
(4.30)
1 EM+1 (ni> (n2).
If the full set of K eigenvectors is used for LFA output expansion, then the residual
correlation of the output is the identity matrix. If only a subset M < K of the
eigenvectors is used, there is a residual correlation as shown in equation 4.30. Atick
and Penev note that the LFA output correlations pplen to be" localized.
4.5.3 Kernel
The scatter matrix of the input data can be rewritten,
4)' = AP'V
(4.31)
(,x,,. (x)'
Writing the expressions for the scatter matrix and the inverse kernel together,
(4.32)
K = Z M= A '/ ''nl' (ni2)
The inverse kernel is comparable to the original scatter matrix except for the
number of terms (M < N), and the scale factor. The key difference in LFA seems
to be the scaling by A/X, otherwise, the analysis is similar to a partial PCA
reconstruction.
4.5.4 LFA on ORL Faces
The ORL database has ten exemplars of each person. K = 9 exemplars were
used for training and one retained for evaluating the expansion. Figure 4.13 shows
normalized (by input image power) reconstruction MSE as a function of number of
components. The starting MSE (at x = 0) is from just using the average image.
Each of the lines is for a different input .
Normalized Fondrtrudbn MSE
2 3 4 5 6
Nurrber ol Eiganvealoi
7 8 9 10
Figure 4.13: PCA Reconstruction MSE
Figure 4.14 shows the reconstruction 0r(x) using only the M = 8 eigenvectors
corresponding to the eight largest eigenvalues. Note that the poses that are not
fully frontal have artifacts, the reconstructions using M = 9 are indistinguishable
from the inputs. The error from adding the tenth component is an artifact incurred
by implementation limitations on numerical accuracy.
Figure 4.14: PCA Reconstructions
Figure 4.15 shows the corresponding LFA outputs,
LFAOulpuls, W9
Figure 4.15: LFA Outputs (Compare to PCA Reconstruction)
The kernel and residual correlation matrices each have 106,172, 416
(10, 304 x 10, 304) t 108 elements. Each row of the kernel is a sum of scaled
PCA eigenfaces. Penev and Atick (1996) shows that local features can be found
from an appropriate linear combination of global features. What conditions are
needed so that an arbitrary image (e.g., a local feature) can be reconstructed using
eigenimages (global features)? Clearly, the local feature must be contained in the
span of the eigenspace. The span of the eigenspace is dependent on the number of
independent training exemplars. That is, as the number of independent training
exemplars increases, the span of the eigenspace increases. Figure 4.16 shows the
first five rows, each row reshaped to a (112 x 92) image of the LFA Kernel (top
row) and Residual Correlation (bottom row),
LFA lMarices, N9
Kernel, Ki.y)
RFihaualC orreaibn, P(x,y)
Figure 4.16: LFA Kernel and Residual Correlation (Look for Localization)
4.5.5 Localization for LFA and PCAM
PCAM parses the input into spatial subregions to obtain localized feature.
It is assumed that pixels that are close (spatially) are more likely to be related
that pixels that are widely separated. Similarly for time signals, events that occur
in a close interval of time tend to be better correlated as opposed events that are
separated by large intervals of time.
LFA based classification is significantly more involved than finding localized
features. LFA is a PCA based technique that is designed to obtain groups of pixels
that are highly correlated. Coincidentally, highly correlated pixels were found to be
spatially localized. A further coincidence is that the localized regions corresponded
to local ph:,i. I1 features. LFA (and PCAM) could have produced local regions of
pixels that do not correspond to any ph,!:i, 1 features. The approach seems very
elegant and some of the techniques might be applied to PCAM in the future. In
particular, LFA provides a statistically based approach to parsing an image into a
minimal set of arbitrarily shaped and highly correlated subregions. That is, LFA
provides a framework for grouping pixels into localized regions based on correlation
rather than simple .,li i:'ency. Further, LFA provides a nice verification that, for
faces, spatially localized features are well correlated.
4.5.6 Feature Space for LFA, PCA, and PCAM
For a given set of input exemplars, LFA and PCA have the same feature
space. To derive local features, both PCA and LFA rely on the eigenspace having
a sufficiently large span so that local features are included. In both PCA and
LFA, the eigenspace can only be increased by using more (linearly independent)
training exemplars. That is, LFA localized features are a linear combination of
a large number of training exemplars. Kirby and Sirovich (1987) estimates that
a dimensionality of at least 400 is needed for adequate representation of tightly
cropped faces with PCA. Penev (1999) states that a dimensionality of 200 (at
least 200 exemplars) is needed for adequate representation of faces with LFA.
In the ORL example, with only nine exemplars of (112 x 92) images, any linear
combination will be I ...like" and not localized.
PCAM is a multiresolution technique that encompasses classical PCA as an
extreme case within the PCAM definition. PCAM directly manipulates local
ization by partitioning the exemplar images. PCAM then adapts to the second
order statistics in each localized region. Localization of features is independent
of the number of training exemplars. Further, PCAM facilitates construction of
multiscale features. That is, PCAM can be utilized with global features as well as
(local) features of varying scale.
4.6 Summary
PCAM can be used to directly derive features for classification. Local
features can be found by explicitly selecting regions that correspond to local
pl,i. I1 structures. Unfortunately, there is often no a priori way to know that
the best mathematical features correspond to given ]l,!:1i' 1 features. If a priori
information is available, the GHA network can structure the PCAM network in
a very flexible manner. Unlike LFA which looks for local features by exhaustive
linear combinations of global features, PCAM can explicitly selecting regions that
correspond to local pl,!:v i 1 structures.
PCAM seems to be sufficiently useful in providing localized inputs to an
other classifier such as a neural network. The neural network can then choose to
construct features that are global or local. The classifier can create features that
are combinations of PCAM outputs at a single scale, or combine PCAM outputs
of several scales. The subsequent experiments showed that a single l1,. r PCAM
network followed by a single 1*.r classifier performed comparably or better than
more complicated structures.
CHAPTER 5
FACE RECOGNITION EXPERIMENT
In recent years, automated face recognition has received increased interest
while simultaneously becoming more feasible. Surveillance and medical diagnostics
are two broad classes of applications that have driven the demand for image
recognition technology. Hardware for image recognition has shown a trend towards
higher performance, increased accessibility, and lowered costs. Numerous advances
have been made in face recognition algorithms (Chellappa et al., 1995, pp. 705 
706).
Automated face recognition and classification has many practical applica
tions (C'!i I!! ppa et al., 1995, p. 707). In general, automated face recognition is
a complex problem that requires detecting and isolating a face under unknown
lighting conditions, backgrounds, orientations, and distances. However, there are
several applications where the lighting, scale, and background can be expected to
be well controlled:
personal identification (credit cards, passports, driver's license),
mug shot matching,
automated store/bank access control.
In these applications, detection and isolation of the faces is not necessary, Non
linear distortions in the images (due to lighting, background, centering, scaling, or
rotation) can be controlled during data collection (and assumed to be negligible).
C'!i I lppa et al. (1995) presents a nice survey that includes background
material on psychology and neuroscience studies, face recognition based on moving
video, and face recognition using profile faces. The scope of this dissertation
is limited to automated face recognition based on frontal, still photos. Given a
database of exemplar images, the basic face recognition problem is to identify
individuals in subsequent images. The performance expectations for automated
face recognition are high since most people can recognize faces despite fairly
adverse conditions. For a machine, the task involves detecting faces from a
cluttered background, isolating each face, extracting features from each face,
and finally classification of the face.
This chapter includes an extended presentation of three specific classifiers:
the original eigenfaces experiment, a Hidden Markov Model, and a convolutional
network. All three techniques have been applied to the same (Olivetti Research
Lab) face database under similar conditions. Finally, the PCAM classifier is
presented against the same ORL database.
5.1 ORL face Database
Olivetti Research Lab (ORL) has a public face database reproduced in
appendix B. The database has 400 pictures made up from 10 pictures of 40 people.
The images are (112 x 92) = 10304 pixel, 8bit .i .i'" i,! images. The images
in the ORL database present a nontrivial classification problem. The pictures
show variation in background lighting, scale, orientation, and facial expression
(figure 5.1). The tolerance in scale is about 2n' and the tolerance for tilting is
about 200 (Giles et al., 1997). Individuals who used eyeglasses were allowed to pose
both with and without eyeglasses. Some people looked very similar to each other
(figure 5.1, far right).
scale
Variation in ORL Face Database
expression lighting similarity
Figure 5.1: Varying Conditions in ORL Pictures
Several other techniques have been applied to the ORL database under the
same testing conditions (40 people, 5 test + 5 verification pictures for each person).
Control of the conditions is important since reducing the number of classes (not
using all 40 people) implies an easier classification problem. C'! i'5iiig the ratio of
training exemplars to verification exemplars also alters classifier performance. This
section discusses the three experiments that are used for comparison to PCAM.
PCAM gave better average performance than the other techniques.
Table 5.1: Error Rates of Several Algorithms
Algorithm Performance
Eigenfaces (Turk and Pentland, 1991a) 10('
HMM (Samaria, 1994) 5.5',
SOMCN (Giles et al., 1997) 5.7.' (: .)
PCAM (Brennan and Principe, 1998) 2.5'.
5.2 Eigenfaces
The decomposition of a training set of face images into eigenfaces has been
previously discussed. This section briefly presents the face recognition experiment
using eigenface.
5.2.1 Description of Experiment
The ORL database has 200 training images, 5 images for each of the 40 people.
All the training images {jk}l
200
1 2 k (5.1)
Oo20200L
k=l
The ensemble average is removed from each image,
Xk k o. (5.2)
Eigenfaces T,' are found for the training ensemble and M eigenfaces with signif
icant eigenvalues are retained. The eigenfaces define the axes in eigenspace, and
both training and test images can be mapped to a set of coordinates in eigenspace,
Xk [C, Qa2,... ,I Io k where, a, =< x, > (5.3)
Denote the eigenspace coordinate vector by ak = [acl,... lr]k. An immediate
advantage of eigenfaces is a steep reduction in dimension. For the ORL database
and using all the eigenfaces (M = 200), each image is described by 200 coordinates
rather than (92 x 112) = 10304 pixels. The training images can be compressed by
a factor of 50 without loss. The training images from each class map to separate
regions in eigenspace.
The eigenspace coordinates can be treated as raw input to any classifier. For
example, let the coordinates of the 5 training images from class n be denoted by
a'. The distance of a test image's coordinates from the training image coordinates
for a class can be used to determine the probability of the test image belonging to
the class,
Prob(xtest E n)= f(atest, a,..., a5).
(5.4)
The simplest method is to average the coordinates of all the training exemplars
for a class, and to calculate the distance of a test image from the average coor
dinates (Turk and Pentland, 1991b; Giles et al., 1997). Samaria (1994) used a
nearestneighbor classifier.
5.2.2 Results
Pentland et al. (1994) reports under 05'. error rate on 200 faces using a large
(unspecified) database. Samaria (1994) reported a 10'. error rate when using 175
to 199 eigenfaces. The improvement after 10 eigenfaces is gradual, but the error
rate rapidly becomes worse when less than 10 faces are used. Samaria's results also
showed that error rate was not monotonically nonincreasing as the classifier used
more eigenfaces. Giles et al. (1997) reports a 10.5' using 40 to 100 eigenfaces.
Error rates aside, the eigenface approach demonstrates that PCA is a useful
preprocessor for classification.
1. PCA coordinates in eigenspace are good features for classification.
2. PCA reduces the dimensionality of the classifier inputs that in turn reduces
computations.
3. Eigenvalues are potentially a good indicator for classification features.
5.3 Face Recognition using HMM's
Samaria (1994) on face recognition using a Hidden Markov Model (HMM)
is often cited as seminal work in applying statistical signal processing to image
classification. Hidden Markov Models are widely applied to continuous speech
recognition (Haykin, 1994, p. 227). Samaria passed an observation window over
an image from lefttoright, down, righttoleft, down, lefttoright, and so on
(Figure 5.2, left).
___ win ow
general traversal topdown traversal
Figure 5.2: Parsing an Image into a Sequence of Observations
That is, an observation window traverses a onedimensional path through each
image. For each image, Samaria thus obtained a corresponding observation array,
0 [01,...,OT].
5.3.1 Markov Models
A Markov model is a statistical model for a sequence of observations based
on an underlying sequence of states (a Markov process). The probability of a
state at some time in a sequence is dependent only on the immediately preceding
state (Therrien, 1992, pp. 99 118). Each transition between states generates an
output that is dependent only on the state being entered. If the states can be only
one of N countable discrete values, the process is called a Markov Chain (Therrien,
1992, pp. 99 118). The Markov process is described by four parameters (Samaria,
1994, p. 28),
1. the number of states N,
2. the onestep state transition matrix, A = {aj : 1 < i,j < N},
3. the output probability function, B = {bj(.) : 1 < j < N},
4. the initial state probability distribution, II {rj : 1 < j < N}.
When only the outputs are observable (the states are hidden), the model is said to
be a Hidden Markov Model
Output, oj
State i State j
Figure 5.3: Markov Model
5.3.2 Description of Experiment
A full description of Samaria's work and a detailed description of HMM's is
outside the scope of this dissertation. HMM's are described in various books on
statistical signal processing (Haykin, 1996; Therrien, 1992). Samaria cites Rabiner
(1989). This section attempts to cover points in Samaria's research that would be
salient to reconstructing his experiments on 1dimensional HMM's (1DHMM).
1 forehead
2 eyes
3 nose
4 mouth
5 chin
Figure 5.4: Topdown Constrained State Transitions
Samaria obtained his best results using a topdown sequence of five states. Each
state corresponds to a region of the face. The allowed state transitions correspond
to a topdown traversal of a face (Figure 5.4). The observation window was
constructed from several complete rows of the image (Figure 5.2, right). Each
window was eight rows high and overlapped .idi i,:ent windows by seven rows.
Samaria described his model using a shorthand notation (Samaria, 1994, p. 42) of
Fi = (N (states) L (observation rows) M (overlap rows) ) = (5, 8, 7)
Samaria used the HTK software package described in Young (1993). For each of
the 40 classes in the ORL database, five training images are each transformed into
a sequence of observations using a topdown traversal. The five training sequences
used as inputs to the HTK software with a design specification for five states (five
face regions). The HTK software derived optimal parameters for each HMM using
the BaumWelch reestimation algorithm (Baum, 1972). The optimization includes
parsing the training images into five regions, and deriving both the state transition
matrix A and output probability function B for the HMM. At the end of training,
an HMM has been derived for each class. To classify a test image, select the class
whose HMM maximizes the likelihood of the test image.
5.3.3 Results
Samaria's dissertation exhaustively explored variations in the number of mod
els, and observation window parameters. The dissertation included experiments
using frequency domain representations and reduced (spatial) resolution images.
Samaria reports that 1DHMM outperformed the Eigenfaces approach about 1I' .
of the time. The 1DHMM had an average error rate of 10'. After all the detailed
analysis, Samaria modestly concluded that the improvements (over eigenfaces) in
face recognition using 1DHMM were probably not statistically significant. The
dissertation also explored a more complicated model, the P2DHMM (pseudo 2D
model). The P2DHMM outperformed the eigenfaces approach about ,91'` of the
time with an average error rate of 5'.
5.4 Convolutional Neural Networks
Giles et al. (1997) used a SelfOrganizing Map (SOM) in conjunction with
a Convolutional Neural Network for face classification. The selforganizing map
(SOM) is used for dimensionality reduction of the exemplars, and the convolutional
network (CN) provides partial translation and deformation invariance (Giles et al.,
1997, p. 67).
Raw Compressed
Image Representation Class
Parse/SOM CN 
Figure 5.5: SOMCN Face Classifier
5.4.1 SelfOrganizing Map
Giles et al. (1997) states that Kohonen's selforganizing map (SOM) or
SelfOrganizing Feature Map (SOFM) (Kohonen, 1995) is a topology preserving,
unsupervised learning process. This section presents an overview of SOFM that
follows the presentation of Kohonen's SOFM found in Haykin (1994, pp. 402 
414).
A SOFM maps an input of arbitrary dimension into a discrete map of reduced
dimension. The theory is based on vector quantization theory in which a large
set of input vectors is mapped to a reduced set of prototypes (the weights of the
winning output node). The network for a SOFM has only an input and output
1 .i r. Each output is fully connected to all the inputs. The nodes of the output
1?,i.r are arranged in a one, two, or threedimensional lattice. When presented with
an input x, one of the SOFM's output nodes is the bestmatching or winning node
according to some distance criteria,
(5.5)
i(x) argmin x xjl, j = 1, 2,..., N.
i
In equation 5.5, i(x) is the index of the winning output in response to input x.
The SOFM is topologically ordered in the sense that nodes that are .,Ii i'ent in the
output lattice tend to have similar weights. The SOFM is topologically preserving
in the sense that a small distance between two inputs (in the input space) implies a
small distance between the corresponding winning outputs (in the output space).
5.4.2 Convolutional Network
A convolutional network (Le Cun and Bengio, 1995) is a specific structure
for a muiltilv, r perception network that has been successfully applied to optical
character recognition (OCR) (Haykin, 1994, p. 226). Giles et al. (1997) uses a
similar network for face classification. The CN has five computational l~.rs: four
hidden 1. rs and the output 1lv.
1. The first hidden 1lvr will be discussed in some detail since it exhibits
the properties of feature maps, weight sharing, local receptive fields, and
nonlinear convolution. Consider a (20 x 20) OCR image that is parsed
into (5 x 5) subregions. There are (16 x 16) = 256 subregions that can be
constructed by shifting the (5 x 5) window over the OCR image. A neuron
in the first hidden l~.vr is said to have a local receptive field if the inputs
to the neuron correspond to a local region of the input. The neurons can
be organized into a (16 x 16) feature map such that .,1i ,ient neurons have
local receptive fields that are shifted by one pixel. A further constraint on a
feature map is that all the neurons in the feature map have the same weights
(weight sharing). The construction of the feature map can be perceived as a
convolution of the input image against the fixed weights. Since the output of
the neurons is passed through a nonlinear function, the first hidden l1.r is
characterized as a nonlinear convolutional l1~,r. In the OCR application, the
first hidden l~v. r consists of four feature maps.
2. The second hidden li. v is a downsampling If r Downsampling provides a
tolerance to distortions due to translation, rotation, and scaling. In the OCR
example, the second hidden l* vr has four feature maps that are respectively
reduced (spatial) resolution representations of the four feature maps from the
first hidden l~Vr.
3. The third hidden 1 vr is another convolutional 1lr r. A feature map in the
third hidden l1. r may use local receptive fields from two feature maps in
the second 1r. v. The OCR application has twelve feature maps in the third
hidden 1lir..
4. The fourth hidden 1. r is another averaging and downsampling 1cr.
identical in structure to the second hidden l *'vr.
5. The output 1l,r is fully connected to the fourth hidden 1 ,r. In the OCR
application, there are ten neurons corresponding to the ten digits [0,1,..., 9].
The output l1,r classifies the input, further, the difference between the most
active and second most active outputs can be used to generate a measure of
confidence in the classification.
Haykin (1994, p. 226) states that a multilrv,r perception that uses alternating
convolutional and downsampling lr,i is a convolutional network.
5.4.3 Description of Experiment
In contrast to the OCR application, Giles et al. (1997) chose to preprocess
the raw (face) images with a SOM. The (92 x 112) images are parsed into (5 x 5)
subimages. Each subimage overlaps .,.i i'went subimages by one pixel. All the
subimages from the training data are collected and used to train a SOM with a
(5 x 5 x 5) threedimensional output lattice. The trained SOM is used to transform
the raw images into three (23 x 28) maps. The three maps are passed to the
convolutional network. Giles' CN has five l~,. rs, the architecture is described in
table 5.2.
Table 5.2: Face Classification CN Architecture
Number of Feature Map Receptive Field
L Type Feature Maps Dimensions Dimensions
1 convolutional 20 (21 x 26) (3 x 3)
2 downsampling 20 (9 x 11) (2 x 2)
3 convolutional 25 (9 x 11) (3 x 3)
4 downsampling 25 (5 x 6) (2 x 2)
5 full 40 (1 x 1) (5 x 6)
5.4.4 Results
The CN is a muiltilv,r perception network that requires significant training.
Once trained, the network operates quickly. The best results were reported as 3.5'
error against the ORL database.
5.5 Face Classification with PCAM
PCA is known to be optimal for representation, but suboptimal for classifica
tion. Belhumeur et al. (1997) points out that PCA does not differentiate inclass
scatter from betweenclass scatter. Bartlett et al. (1998) shows that classification
can be improved by using independent component analysis to incorporate higher
order statistics. On the other hand, the experiments using eigenfaces (Turk and
Pentland, 1991a) showed that coordinates in eigenspace are useful for classification.
Several other experiments indicate that PCAbased feature extraction could be
improved by adding localization and multiresolution.
Pentland et al. (1994) states that localization can enhance eigenfaces. Pent
land trained eigenfeatures corresponding to pl1,:i. I1 facial features. Classification
based on localized eigenfeatures was comparable to the performance of eigenfaces.
The combination of localized eigenfeatures and global eigenfaces performed almost
perfectly. Brunelli and Poggio (1993) states that localized features may be more
important than global features. Brunelli stated that when a classifier can use
only a single facial feature, then local templates based on eyes, nose, and mouth
contribute more to recognition than global facial templates. Giles et al. (1997)
Si. 1.. 1 that reducing the spatial resolution of the ORL images might improve
classification. The use of the SOM front end to reduce dimensionality while retain
ing good classification supports Giles' observation. Turk and Pentland (1991b) used
a sixlevel Gaussian pyramid to view the inputs at several spatial resolutions.
The theory for PCAM, dyadic filter banks, and the GHA network were
presented in chapter 4. This remainder of this section presents and discusses
experimental results for test runs using PCAM, a fixedbasis multiresolution
(Haar), and PCA at several fixed resolutions.
5.5.1 Classifier Architecture
Figure 5.6 shows the initial architecture for our classifier (Brennan and
Principe, 2000).
r
CLASSIFIER
Hyperplane
Majority
SPCAM Hyperplane Vote
Hyperplane 
        I
Image Multiresolution Class
Features
Figure 5.6: Initial Classifier Structure
The structure was originally intended to isolate each feature space. We wanted to
observe both the individual feature performance and degradation due to decoupling
features. Adding more eigenfeatures does not monotonically increase classifier
performance (Samaria, 1994), and we plan on finding a way to select or weight
the predictions from eigenfeatures in future research. Each feature classifier uses
a template obtained by averaging the training exemplars. The final classification
was done by weighted vote among the component classifiers. The in1i' 'i, ily vote
mechanism is the simplest way to combine the results of the feature classifiers. Hu
Table 5.3: Fixed Resolution PCA Error Rates over 10 Runs
WINDOW MAX MEAN MIN
Raw Data 19,0 14.4 10.5
(2 x 2) 22.5 17.5 14.0
(4 x 4) 27.5 23.6 20.5
et al. (1997) il.. I; that more elaborate committee structures don't necessarily
significantly outperform a ii ii Pi ily vote mechanism. The overall structure can be
compared to a 21h'v, r hierarchical OneClassOneNetwork (OCON) DecisionBased
Neural Network (DBNN) (Kung, 1993, pp. 118120).
5.5.2 Data Preparation
The most straightforward way to vary the resolution is to reduce the scale
by half. We investigated reductions in scale of 2, 4, 8, and 16. The scaling was
performed by passing a (2k x 2k) window through the image. For example the 1/16
scaling takes a (24 x 24) subimage and produces 16 scalar outputs; the collection
of scalar outputs from all the windowed subimages form 16 scaled images with
reduced spatial resolution. We used nonoverlapping observation windows because
we wanted to observe if blocking would severely deteriorate classification. To
facilitate scaling with nonoverlapping windows, image dimensions were cropped
to (112 x 80) so that the numbers of pixels along each dimension are a factor of
24 = 16. Six columns of pixels were cropped from each side of the input.
5.5.3 Fixed Resolution PCA Results
The fixed resolution PCA was investigated for windows of (2 x 2) and (4 x 4).
Eigenfaces would have transformed each (112 x 80) = 8960 pixel input to 8960
coordinates in eigenspace; each coordinate corresponds to the projection of an
input to an eigenface. For the ORL database there would have been only 200 non
zero coordinates. We step through the procedure for a (2 x 2) and note that the
procedure for the (4 x 4) PCA windows are analogous. A (2 x 2) has 4 eigenimages.
As the nonoverlapping window is passed through the image, we are effectively
condoling 4 sets of coefficients against the input image. Since the blocks are non
overlapping, downsampling is accomplished in the same step. As an aside, linear
convolution and downsampling are being performed by a single computational
1,. r partially connected network with 8960 inputs and 8960 outputs. Each output
is locally supported by to 4 inputs (a (2 x 2) subregion). Only 4 sets of weights
are used. The output can be organized into 4 feature spaces that are 4 halfscale
images.
In each run, five training exemplars were randomly selected from the ten
exemplars available for each person in the ORL database. The results show great
sensitivity to selection of the training set. The sensitivity is not surprising. For
example, figure 5.7 shows a class that had the entire training set at one scale, and
the entire test set at another scale.
Training and Test data at different Scales
TRAIN
VERIFY
Figure 5.7: Training and Test Data at Different Scales
Giles et al. (1997) points out that a random selection among 40 classes would be
expected to be correct 1/40 = 2.5'. of the time. We feel that a more realistic base
line for error rates is the performance of a template classifier with the raw data.
Since PCA is just a rotation, the performance would be the same as the perfor
mance using all 200 eigenfaces. Samaria (1994) reported 10.5'. for the ORL data,
but we found that the error rate was also sensitive to training set and averaged
around 14.!' (first line of table 5.3). Some of the increased misclassification could
be due to the clipped data, but it is more likely that the decoupling of data due to
our classifier structure is responsible for the deterioration. The results seem to sup
port that data organized into 4 independent feature spaces. The (2 x 2) window is
worse than data taken as a whole (raw data), but better than 16 decoupled feature
spaces. Note that if the feature spaces are linearly combined before classification,
we would expect an error rate similar to the raw images. All individual feature
classifiers and the iii ii i ly vote mechanism have a nonlinear operation when the
maximum output is selected.
5.5.4 Haar Multiresolution
A fixed Haar basis was used to crate a four level differential image pyramid. A
sample decomposition is shown along with the original image.
PCAM Decomposition
I
U.
ORIGINAL RAW NORMALIZED
Figure 5.8: PCAM Decomposition of One Picture
The autocorrelation matrix of the observation windows, as expected, shows that
the pixels in a natural image are 1/f. Classification using a Haar basis was not
significantly different from PCAM since the Haar basis is well suited for 1/f
signals. Moreover, for small observation windows, the choice of multiresolution
basis is not very important. Multiplication by any fixed basis is a rotation; if all
the features are used, then input distances are preserved. Classification (with a
linear classifier) will be no better than using the raw inputs.
5.5.5 PCAM
PCAM was used to decompose images into multiresolution feature spaces
(components). Four feature spaces are 1/16 scale images, three are 1/8 scale
images, three are 1/4 scale images, and three are 1/2 scale images (figure 5.9).
Figure 5.9: Selected Resolutions
The decomposition was chosen to facilitate comparison to the Haar decomposition.
Referring to figure 5.9, components 1 to 4 have the longest eigenvectors; that is,
(16 x 16) eigenimages and the least spatial resolution. Components 11 to 13 have
the shortest eigenvectors and the highest spatial resolution. The classifier was
modified (figure 5.10).
CLASSIFIER
SFFT Hyperplane
Majority
 PCAM FFT Hyperplane rite
I FFT Hyperplane
Image Multiresolution Class
Features
Figure 5.10: Final Classifier Structure
The modification resulted from earlier experiments evaluating PCAM for repre
sentation. Since we had the ORL database in a variety of representations, we fed
them into the classifier. The magnitude of the FFT of the raw data had an average
error rate of 1(''. The combination of PCAM with magnitude FFT gave the best
performance. We assume that using the FFT magnitude makes the classifier more
robust to translations. There is still a high sensitivity to training set selection. The
Table 5.4: Error Rates for PCAM with Magnitude of FFT
Multiresolution Levels MAX AVG MIN
2 6.5(0' 2.95' 0.(11' "
3 5.011' 2.45' 0.(11'
4 6.5( 3.4(1' 1.i .
main diagonal of table 5.5 shows the performance of the individual feature classi
fiers. The table shows the number of misclassifications out of 200 test images. The
nondiagonal elements show the number of misclassifications using a pair of feature
classifiers. The performance seemed to be independent of eigenvalue or resolution.
None of the component classifier had more than 10 misclassifications (out of 200)
in the training set (200 images). Performance on the test set does not seem to be
predictable from performance on the training set. When a component classifier's
best guess was incorrect, the second best guess was correct half the time. Of some
interest is the poor performance of the first four components since Belhumeur
et al. (1997) states that the first few eigenvectors are sensitive to illumination.
Belhumeur stated that removal of the first three or four eigenvectors could provide
some robustness to illumination levels.
Table 5.5: Component Misclassifications (200 Test Images)
1 2 3 4 5 6 7 8 9 10 11 12 13
1 61 48 45 33 19 28 19 5 19 7 25 3 4
2 136 100 73 21 55 46 7 45 9 54 6 8
3 147 75 20 52 45 6 42 7 48 6 8
4 99 22 43 31 6 31 9 39 5 8
5 27 17 6 6 11 8 9 4 3
6 68 29 6 27 6 24 5 4
7 54 4 16 4 15 4 4
8 7 5 6 3 4 1
9 53 5 30 5 4
10 9 3 3 2
11 67 4 6
12 6 3
13 9
CHAPTER 6
MSTAR EXPERIMENT
This chapter describes a classification and a simple discrimination experiment
using synthetic aperture radar (SAR) images of armored vehicles. SAR imagery
is obtained by combining radar returns over the path of a moving platform (an
airplane or satellite). The path is effectively a large antenna aperture leading to
highresolution imagery. The basic scenario is that given a training set of several
I ,i;, I vehicles, the discriminator will assign subsequent input images to the
correct class of threat vehicles, or identify that the new image belongs to a new
class of i,,,i,I ,i ,_t vehicles. Classes of vehicles that have only test (no training)
exemplars are called confuser classes. Discriminators can make three types of
mistakes.
1. A falsenegative error occurs when a target vehicle is not identified. Presum
ably, failing to respond to a target vehicle incurs the most severe penalty.
2. A falsepositive mistake occurs when a nontarget vehicle is identified as a
target vehicle. This mistake causes resources to be wasted in an unnecessary
response.
3. The third error occurs when a target vehicle is correctly identified, but is
incorrectly labeled.
By modifying the decision boundary, a tradeoff is possible between the three
errors. Two sets of results are presented: a set for classification (no rejection
of confuser classes), and a set for discrimination. The goal of the classifier is to
minimize unconditional probability of correct classification P,,. The goal for our
discriminator is to maximize (conditional) Pc when the probability of detecting
targets, Pd, is 91'.
6.1 SAR Image Database
The raw SAR inputs are (128 x 128) pixel images from a subset of the 9/95
MSTAR Public Release Data obtained from Veda Inc., (www.mbvlab.wpafb.mil).
The web site also includes a paper with a detailed description of the data and
the results of a baseline templatematching classifier (Velten et al., 1998). The
data consists of Xband SAR images with a 1foot by 1foot resolution (Velten
et al., 1998). Table 6.1 lists the vehicle classes, bumper tags, and the quantity of
corresponding images. The SAR image of a vehicle is dependent on the pose of the
target vehicle relative to the collection platform (satellite). The two pertinent pose
parameters are aspect angle and depression angle (figure 6.1)).
Figure 6.1: Aspect and Depression Angles
Changing the pose of a vehicle results in nonlinear changes in the SAR image
since the radar crosssection is a projection of a 3dimensional object onto a
2dimensional surface. Thus, the features that are available for discrimination
are a function of pose. Because of this dependence, it is desirable to ensure that
the exemplars for a given vehicle have the same pose. Ideally, a large number of
exemplars would be available for each aspect angle. More practically, we would
collect a large number of training exemplars for a narrow range of aspect angles.
Realistically, we have to accept a tradeoff between having a large number of
