Principal component analysis with multiresolution

Material Information

Principal component analysis with multiresolution
Brennan, Victor L., 1956- ( Dissertant )
Principe, Jose ( Thesis advisor )
Place of Publication:
Gainesville, Fla.
University of Florida
Publication Date:
Copyright Date:


Subjects / Keywords:
Databases ( jstor )
Eigenvalues ( jstor )
Eigenvectors ( jstor )
Error rates ( jstor )
Image classification ( jstor )
Mathematical vectors ( jstor )
Matrices ( jstor )
Pixels ( jstor )
Principal components analysis ( jstor )
Signals ( jstor )
Dissertations, Academic -- Electrical and Computer Engineering -- UF ( lcsh )
Electrical and Computer Engineering thesis, Ph. D ( lcsh )
Human face recognition (Computer science) ( lcsh )
Principal components analysis ( lcsh )
Signal processing -- Digital techniques ( lcsh )
Synthetic aperture radar ( lcsh )
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
non-fiction ( marcgt )


Eigenvalue decomposition and multiresolution are widely used techniques for signal representation. Both techniques divide a signal into an ordered set of components. The first component can be considered an approximation of the input signal; subsequent components improve the approximation. Principal component analysis selects components at the source resolution that are optimal for minimizing mean square error in reconstructing the original input. For classification, where discriminability among classes puts an added constraint on representations, PCA is no longer optimal. Features utilizing multiresolution have been demonstrated to preserve discriminability better than a single scale representation. Multiresolution chooses components to provide good representations of the input signal at several resolutions. The full set of components provides an exact reconstruction of the original signal. Principal component analysis with multiresolution combines the best properties of each technique: 1. PCA provides an adaptive basis for multiresolution. 2. Multiresolution provides localization to PCA. The first PCA-M component is a low-resolution approximation of the signal. Additional PCA-M components improve the signal approximation in a manner that optimizes the reconstruction of the original signal at full resolution. PCA-M can provide a complete or overcomplete basis to represent the original signal, and as such has advantages for classification because some of the multiresolution projections preserve discriminability better than full resolution representations. PCA-M can be conceptualized as PCA with localization, or as multiresolution with an adaptive basis. PCA-M retains many of the advantages, mathematical characteristics, algorithms and networks of PCA. PCA-M is tested using two approaches. The first approach is consistent with a widely-known eigenface decomposition. The second approach assumes ergodicity. PCA-M is applied to two image classification applications: face classification and synthetic aperture radar (SAR) detection. For face classification, PCA-M had an average error of under 2.5%, which compares favorably with other approaches. For synthetic aperture radar (SAR), direct comparisons were not available, but PCA-M performed better than the matched filter approach. ( , )
Thesis (Ph. D.)--University of Florida, 2001.
Includes bibliographical references (p. 120-123).
System Details:
System requirements: World Wide Web browser and PDF reader.
System Details:
Mode of access: World Wide Web.
General Note:
Title from first page of PDF file.
General Note:
Document formatted into pages; contains xi, 124 p.; also contains graphics.
General Note:
Statement of Responsibility:
by Victor L. Brennan.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
50743663 ( OCLC )
002729315 ( AlephBibNum )
ANK7079 ( NOTIS )


This item has the following downloads:

Full Text








I am grateful for the support I have received from family, colleagues, and

faculty at UF. It is difficult to single out a few people to thank when many have

been supportive and encouraging.

I wish to thank my advisor, Dr. Jos6 Principe, not only for sharing his

technical expertise and insight, but especially for his patience and encouragement.

I want to thank Leonard and Carolina Brennan, who have been loving parents and

inspirational role models. I am most grateful to my wife, Karen, for her love and

for her support in every endeavor in our lives.



ACKNOW LEDGMENTS ............................. ii

LIST OF TABLES ...................... ......... vi

LIST OF FIGURES ..................... ......... vii

ABSTRACT ... .. .. .. .. .. .. ... .. .. .. .. .. .. .. .. .. x


1 INTRODUCTION ................... ....... 1

1.1 Classification .................. ....... 2
1.2 Principal Component Analysis (PCA) ............. 4
1.3 Multiresolution .......... ....... ....... 5
1.4 PCA-M ............ .......... ..... .. 6
1.5 Image Classification Experiments ...... ...... ... 6
1.6 MSTAR Experiment ........... ........... 7

2 PCA ....... ............ ........... .... 9

2.1 The Eigenvalue Problem .............. ... . 9
2.2 An Example .................. ........ .. 11
2.3 PCA .................. ....... .... . 15
2.4 Deflation Techniques ................ . . 17
2.5 Generalized Hebbian Algorithm ............... .. 19
2.6 Eigenfilters .................. ......... .. 22
2.6.1 Low-Pass Test Signal ................ 23
2.6.2 High Pass Test Signal ................. .. 25
2.6.3 Mixed Mode Test Signal ................ .. 26
2.7 Properties .................. ......... .. 27

3 MULTIRESOLUTION .................. ..... .. 29

3.1 Two Notes Example ........ . . . . . .... 30
3.2 Quadrature Filter and Iterated Filter Bank . . . ... 32
3.3 Discrete Wavelets .................. ... .. 34
3.4 Haar Wavelet .................. ....... .. 35
3.5 A Multiresolution Application: Compression . . . ... 35


4.1 Definition of PCA-M .................. ..... 38
4.1.1 Localization of PCA .................. .. 38
4.1.2 A Structure of Localized Outputs . . . ... 41
4.2 The Classification Problem .................. .. 43
4.3 Complete Representations .................. .. 44
4.3.1 Eigenfaces .................. .. .. 45
4.3.2 Identity Map .................. .. 50
4.3.3 Iterated Filter Banks ................. .. 52
4.3.4 Dual Implementation of PCA . . . . ..... 56
4.4 Overcomplete Representations ................ .. 59
4.5 Local Feature Analysis ..... ........... .... 60
4.5.1 Output Vector .................. ..... 61
4.5.2 Residual Correlation .................. .. 62
4.5.3 K ernel . . . . . . . . . . . .... . 63
4.5.4 LFA on ORL Faces .................. .. 63
4.5.5 Localization for LFA and PCA-M . . . ... 66
4.5.6 Feature Space for LFA, PCA, and PCA-M ...... 67
4.6 Summary ....... ....... ......... .... 67

5 FACE RECOGNITION EXPERIMENT .............. .. 69

5.1 ORL face Database .................. .. 70
5.2 Eigenfaces .................. ......... .. 71
5.2.1 Description of Experiment .............. .. 72
5.2.2 Results ........... . . . . . 73
5.3 Face Recognition using HMM's ................ .. 73
5.3.1 Markov Models .................. ..... 74
5.3.2 Description of Experiment .............. .. 75
5.3.3 Results . . . . . . . . . . . .... . 76
5.4 Convolutional Neural Networks ................ .. 77
5.4.1 Self-Organizing Map ..... . . . 77
5.4.2 Convolutional Network ................ .. 78
5.4.3 Description of Experiment ..... . . . . 79
5.4.4 Results . . . . . . . . . . 80
5.5 Face Classification with PCA-M ... . . . 80
5.5.1 Classifier Architecture .... . . . . 81
5.5.2 Data Preparation ............. .. .. ..82
5.5.3 Fixed Resolution PCA Results . . . . 82
5.5.4 Haar Multiresolution. ................ . 84
5.5.5 PCA-M ... .. .. .. .. .. .. .. .. ... .. . 85

6 MSTAR EXPERIMENT .............

6.1 SAR Image Database ......... .............. 89
6.2 Classification Experiment ......... ........... 90
6.3 Basis Arrays for PCA-M ............. ... .. .. 93
6.3.1 Level 3 Components .................. .. 94
6.3.2 Level 2 Components .................. .. 94
6.3.3 Level 1 Components .................. .. 94
6.3.4 Decorrelation between Levels . . . . ..... 95
6.4 A Component Classifier .................. .. 96
6.5 Classifications using Several Components . . . .... 98
6.6 A Simple Discriminator ............... . . 102
6.7 False-Positive and False-Negative Errors . . . . .... 104
6.8 Observations .................. ........ 106


7.1 Conclusions . . . . . . . . . . . ... .. 108
7.2 Future W ork ....... ......... ...... 111
7.2.1 Segmentation of the Input . . . . . 111
7.2.2 Component Selection . . . . . . 111
7.2.3 Conditioned Data and Non-Linear Classifier ...... 112


A ABBREVIATIONS .................. ........ .. 113


C MSTAR IMAGES ....... .................... 116

REFERENCES ....................... . . . ....... 120

BIOGRAPHICAL SKETCH .................. ......... 124


Table page

4.1 Normalized Eigenvalues .................. ....... .. 49

4.2 Energy Distribution of Exemplars ................ 49

5.1 Error Rates of Several Algorithms ................ . 71

5.2 Face Classification CN Architecture .................. 80

5.3 Fixed Resolution PCA Error Rates over 10 Runs . . . .... 82

5.4 Error Rates for PCA-M with Magnitude of FFT ........... ..86

5.5 Component Misclassifications (200 Test Images) ........... ..87

6.1 Input Data .................. .............. .. 90

6.2 Classification using First Component .................. 97

6.3 Misclassifications with Individual PCA-M Components . . . ... 99

6.4 Error Rate (5/68 7.,!'-) using 3 Components ............ .100

6.5 Error Rate (2/68 = 3.0'.,) using 10 Components . . . ..... 100

6.6 Overall Unconditional Pc with Template Matching . . . . ... 102

6.7 Overall Unconditional Pc with PCA-M ................ ..102

6.8 Determining an Threshold for Detection ................ ..103

6.9 Ten Components without Rejection ................. 104

6.10 Ten Components with Rejection .................. .. 104

6.11 Detector Threshold .................. ....... .. 105

6.12 Performance at 91' Pd ............... ..... . 106

























Conceptual Steps in a Classifier . . .......

PCA-M Classifier . . ...............

Sam ple D ata . . . . . . . . . . . .

Original (left) and Scaled (right) Data . . ...

GHA Linear Network . . . ............

Low Pass Test Data . . . .............

Low Pass Data PCA . . .............

High Pass Data PCA . . . ............

Test High and Low Frequency Data . . . ....

T wo N otes . . . . . . . . . . . .

Quadrature Filter . . . ..............

Discrete Wavelet Transform with 2 Levels . . . .

Equivalent 2"' Filter Bank DWT Implementation .

Three Levels of Decomposition on the Approximation

PCA-M for Classification . . . ..........

PCA and PCA-M in Feature Space . . .....

Raw Images from ORL Database . . ......

Residual Images for GHA Input . . .......

All-to-one and One-to-one Networks . . . ....

Eigenfaces from GHA Weights . . . .......

Three Level Dyadic Banks . . ..........

First Four Eigenimages . . . ...........
























4.9 Three Level Decomposition of a Face ................. .. 55

4.10 Output of Quadratic Filter Bank ............. .... .. .. 55

4.11 Localization of a Global Output .................. .. 58

4.12 Local Feature Analysis .................. ....... .. 60

4.13 PCA Reconstruction MSE .................. .... .. 64

4.14 PCA Reconstructions .................. ...... .. .. 64

4.15 LFA Outputs (Compare to PCA Reconstruction) . . . ... 65

4.16 LFA Kernel and Residual Correlation (Look for Localization) . . 66

5.1 Varying Conditions in ORL Pictures ................. .. 71

5.2 Parsing an Image into a Sequence of Observations . . . .... 74

5.3 Markov Model .................. ............ .. 75

5.4 Top-down Constrained State Transitions ............... .. 75

5.5 SOM-CN Face Classifier .................. ... .. .. 77

5.6 Initial Classifier Structure .................. ... .. 81

5.7 Training and Test Data at Different Scales .... . . . . 83

5.8 PCA-M Decomposition of One Picture ............ . .. 84

5.9 Selected Resolutions . ............... ...... .. 85

5.10 Final Classifier Structure ................ ... ... .. 86

6.1 Aspect and Depression Angles ................ ...... 89

6.2 Experiment Overview . ............... ...... .. 91

6.3 Three Levels of Decomposition on the Approximation . . . . 93

6.4 PCA-M Decomposition of a BMP2 Input ... . . . . 95

6.5 The Templates for Three Classes for PCA-M Component 1 . . 96

6.6 First Component of SAR Images Projected to 3-Space . . . ... 98

6.7 Class Templates for other PCA-M Components . . . . 99

6.8 Clustering in 3-Space using All PCA-M Components . . . ... 101

6.9 Probability of Detection versus False Alarm Rate . . . ... 106

B.1 Olivetti Research Laboratory Face Database .. . . . . 115

C.1 BMP2 Training and Test Data ............. .... . 116

C.2 T72 Training and Test Data ................ .... .. 117

C.3 BTR70 Training and Test Data ............... . .. 118

C.4 Confuser Data .................. ............ .. 119

Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy



Victor L. Brennan

T; ,- 2001

Chair: Jos6 Principe
Major Department: Electrical and Computer Engineering

Eigenvalue decomposition and multiresolution are widely used techniques

for signal representation. Both techniques divide a signal into an ordered set of

components. The first component can be considered an approximation of the input

signal; subsequent components improve the approximation. Principal component

analysis selects components at the source resolution that are optimal for minimiz-

ing mean square error in reconstructing the original input. For classification, where

discriminability among classes puts an added constraint on representations, PCA

is no longer optimal. Features utilizing multiresolution have been demonstrated to

preserve discriminability better than a single scale representation. Multiresolution

chooses components to provide good representations of the input signal at several

resolutions. The full set of components provides an exact reconstruction of the

original signal.

Principal component analysis with multiresolution combines the best proper-

ties of each technique:

1. PCA provides an adaptive basis for multiresolution.

2. Multiresolution provides localization to PCA.

The first PCA-M component is a low-resolution approximation of the signal.

Additional PCA-M components improve the signal approximation in a manner

that optimizes the reconstruction of the original signal at full resolution. PCA-M

can provide a complete or overcomplete basis to represent the original signal,

and as such has advantages for classification because some of the multiresolution

projections preserve discriminability better than full resolution representations.

PCA-M can be conceptualized as PCA with localization, or as multiresolution

with an adaptive basis. PCA-M retains many of the advantages, mathematical

characteristics, algorithms and networks of PCA. PCA-M is tested using two

approaches. The first approach is consistent with a widely-known eigenface

decomposition. The second approach assumes ergodicity. PCA-M is applied to two

image classification applications: face classification and synthetic aperture radar

(SAR) detection. For face classification, PCA-M had an average error of under

2.5' which compares favorably with other approaches. For synthetic aperture

radar (SAR), direct comparisons were not available, but PCA-M performed better

than the matched filter approach.


Principal component analysis with multiresolution (PCA-M) combines and

enhances two well-established signal processing techniques for signal representa-

tion. This dissertation presents the motivation, the mathematical basis, and an

efficient implementation for combining principal component analysis (PCA) with


This dissertation also presents the results of using PCA-M as a front-end

for two applications: face classification, and target discrimination of synthetic

aperture radar (SAR) images. More detailed discussions of PCA, multiresolution

(differential pyramids), and PCA-M are presented in subsequent chapters. This

introduction is intended as an overview to the presentation of PCA-M.

PCA-M was originally developed as an on-line signal representation technique.

The intent was to perform real-time segmentation of (time) signals based on

variations in local principal components. Tests with simple artificially generated

signals were promising, and good results have been reported in applying PCA-M

to biological signals (Alonso-Betzanos et al., 1999). The decision to concentrate

on images was made when several researchers (Giles et al., 1997; Samaria, 1994;

Turk and Pentland, 1991a) applied various techniques against a common database

generated by the Olivetti Research Lab (ORL). Each of the researchers also cited

an approach which decomposed a set of facial images into component eigenfaces.

It became possible to compare the performance of PCA-M to the results of other

researchers using fixed resolution PCA technique and more computationally

intensive non-PCA image classification techniques. We will start by providing a

brief overview of the fundamental concepts required to understand PCA-M.

1.1 Classification

Classification is the assignment of an input signal x = [xI, x2, Xd]T to one

of K classes (Bishop, 1995, pp. 1-10).

x --> Ck, 1

Each input x is assigned a label y E {1, 2, .. K}. The value of the label y

corresponds to the assigned class. The classification problem can be formulated in

terms of a set of discriminant functions yk with parameters, w,

Yk y (X; w). (1.1)

An input x is assigned to class Ck if

k max {yj(x; w)}. (1.2)

Each class has a corresponding discriminant function. A signal x is input to each

discriminant function. The function with the highest output assigns a label to

the input (eq. 1.2). While difficult problems can be addressed by more complex

(e.g., nonlinear, muiltil:v-r) discriminants, an alternative approach is to attempt to
simplify the problem by some transformation K of the raw data,

Uk = max {yj (K(x),w)}. (1.3)
The output of the transformations or projections is called a feature and the

output space is called the feature space (Duda and Hart, 1973). The size of

the feature space can be larger or smaller than the original space (Fukunaga,

1990; Vapnik, 1998). Traditionally, in statistical pattern recognition the feature

space is smaller than the input space. One of the unsolved problems is how to

determine the feature space and its size to improve classification accuracy. A

feature is a projection that preserves discriminability. A fortuitous choice for

the transformation extracts features that differ between classes but are similar

within a class. Undesirable features differ within-class or are similar between-class.

Heuristics have been the most utilized method of selecting good features.

The problem is the following. Optimal classifications in high dimensional

spaces require prohibitive amounts of data to be trained accurately (Fukunaga,

1990; Duda and Hart, 1973). Hence the reduction of the input space dimensionality

improves accuracy of the estimated classifier parameters and improves classifier

performance. On the other hand, projections to a feature subspace may decrease

discriminability, so there is a trade-off that is difficult to formulate and solve (Fuku-

naga, 1990).

x N1)x) Yk

Data Transformation Classifier

Figure 1.1: Conceptual Steps in a Classifier

Experience has shown that local features tend to preserve discriminability

better than global features. Hence the widespread use of wavelets and other

multiresolution techniques as feature extractors for classification (Bischof, 1995).

More recently there has been work proposing feature spaces of higher dimen-

sionality than that of the original input space (Vapnik, 1998). High dimension

spaces increase the separability between classes, enabling the use of linear dis-

criminators that have fewer parameters to estimate than the optimal (B i,-i ,i:)


When analyzed from the feature extraction point of view, projection to high

dimensional spaces also enhances the chance of obtaining "b, I 1 i features; that

is, where the projections of different classes concentrated more along certain

directions. These are called overcomplete representations and 1h :,- have been

studied in the wavelet literature (Vetterli and Kovacevic, 1995; Strang and Nguyen,

1996). The big issue is still how to choose the overcomplete dictionary and how to

select the best features.

1.2 Principal Component Analysis (PCA)

Principal component analysis (PCA) is based on eigenvalue decomposi-

tion (Hotelling, 1933). Eigenvalue decomposition has been applied to problems

across 1ii i'v disciplines. There is a rich mathematical background and a variety of

implementations (Oja, 1982). Given a set of data, a scatter matrix S is calculated

to estimate the autocorrelation matrix of the data.
S T= Ti x (1.4)

The eigenvector and corresponding eigenvalue pairs (wk, Ak) of S are found by


Sw = Aw. (1.5)

Both the data x and the scatter matrix S can be expanded in terms of the eigen-


S = E WkW Ak,
x = k WkW X = k WkQk.

Analytic and deflation-based iterative approaches are available to solve the eigen-

value problem that automatically order the eigenvectors such that

A> > > N > .

PCA components are uncorrelated and maximal in 12 energy. PCA is one possible

transformation for equation 1.3. It has been shown that PCA is optimal for signal

representation, but it is sub-optimal for feature extraction (Fukunaga, 1990).

Chapter 2 has an expanded discussion of PCA. Although other sets of basis

functions are available that are similar to the PCA basis, only PCA can select a

reduced set of components that are optimal for reconstruction MSE.

1.3 Multiresolution

Multiresolution has been broadly defined as the simultaneous presentation

of a signal at several resolutions. An intuitive argument for using multiresolution

is available from common experience. Consider watching someone approach from

a distance. As the person comes closer, more details are resolvable to allow an

observer to make successively refined categorizations. A possible sequence is to first

identify a moving object, then a person, the gender of the person, the identity of

the person, and finally the facial expressions of the person.

Another familiar application of multiresolution is transferring images across

low bandwidth channels (internet). People tend to leave a web page if the page

takes too long to load. For commercial web sites this translates into a loss of

potential customers. On the other hand, many sites feel that customers will not

return to a site that does not have a lot of graphics. Some image intensive web

pages (e.g., zoo, museum, or auction sites) usually present small images (initially

transfer small files). A larger, more detailed version of the image is loaded only

if the viewer clicks on the small image. While it is possible to completely reload

the larger image, it is more efficient to use information available on the (already

loaded) small version and just add the details needed to produce the larger picture.

For classification, it is hoped that within-class differences are high-resolution

features, and that sufficient desirable features are resolvable at coarse resolution.

By using a coarser representation (lower-resolution), it is hoped that undesirable

features are sharply attenuated with minor impact on the desirable features.

Multiresolution is discussed in chapter 3.

1.4 PCA-M

Both PCA and multiresolution have been successfully applied to similar

problems. It seems reasonable that an application that has benefited from each

individual approach should further benefit from a combined approach. PCA-M is

simply multiresolution with an adaptive basis (PCA).

I show that a linear network for online, adaptive, multiresolution feature

extraction is easily adapted from the networks used for standard PCA. principal

component analysis with multiresolution (PCA-M) is implemented with a partially

connected, single-li v-r linear network. The same network can be used for both

training and normal operation. The training algorithm is a modification of the

generalized hebbian algorithm (GHA) (Sanger, 1989). I treat PCA-M in chapter 4.

1.5 Image Classification Experiments

Olivetti Research Lab (ORL) has a public face database that serves as a

benchmark for comparing different face classification algorithms. Both mul-

tiresolution and PCA (Turk and Pentland, 1991a) had been successfully applied

against the database. The PCA-M components were used with an almost linear

network. The network is linear except for selecting the maximum discriminant

(i\ AXNET) (Kung, 1993, p. 48)).

--., FFT r H PERL .NE Ll.:
r FFT :.~.~ .... ...........

SFT r I P r

Figure 1.2: PCA-M Classifier

The ORL database was used to compare PCA-M to several standard fixed

resolution transforms (discrete Fourier transform, discrete cosine transform, PCA),

and to multiresolution using a Haar basis. PCA-M outperformed PCA at all tested

resolutions. PCA-M outperformed the Haar basis if a reduced set of components

were used. Results were comparable if the full set of multiresolution components

were used.

In chapter 5, PCA-M results are compared to classifiers using a fixed res-

olution PCA (Turk and Pentland, 1991a, Eigenfaces), a hidden Markov model

(HMM) (Samaria, 1994) and a convolutional neural network (Giles et al., 1997).

PCA-M had the lowest error rate.

1.6 MSTAR Experiment

The 9/95 MSTAR Public Release Data (Veda Inc.,

contains synthetic aperture radar (SAR) images of vehicles at various poses (aspect

and depression angles). The estimated aspect angle (Xu et al., 1998) of each

vehicle was used to assign each vehicle to one of the twelve non-overlapping 300

sectors. Within each sector, multiresolution templates were derived for each class.

Chapter 6 shows that PCA-M worked very well in some sectors, but poorly in other

sectors. The overall error rate ( 1(' .) was comparable to other template matching


procedures (Velten et al., 1998), but poorer than information theoretic and vector

support methods.

We conclude the dissertation with some comments and future directions for

further research.


Principal component analysis (PCA) is a technique for representing an image

(or a signal) using basis functions that are derived from eigenvalue decomposi-

tion of the data autocorrelation matrix. This chapter is an introduction to the

eigenvalue problem. A thorough presentation is not possible, but this chapter

should contain the information on principal component analysis that is required by

subsequent discussion of PCA-M.

2.1 The Eigenvalue Problem

Consider a square matrix A of full rank N. A vector w is said to be an

eigenvector of A with a corresponding (scalar) eigenvalue A if

Aw = Aw. (2.1)

The eigenvalue problem (equation 2.1) can be solved analytically by subtracting

Aw from both sides,

(A IA)w 0. (2.2)

Taking the determinant of both sides yields an Nth order polynomial in A called the

characteristic polynomial of A,

det(A IA|) 0. (2.3)

The N roots are the eigenvalues, and each eigenvalue Ak has a corresponding eigen-

vector wk. Each solution to the eigenvector problem is a paired eigenvalue and

corresponding eigenvector (Ak, Wk). From equation 2.1, it should be clear that if w

is an eigenvector and K is an arbitrary scalar, then Kw is also an eigenvector with

the same eigenvalue. Given a non-repeating eigenvector A, the corresponding eigen-

vector is unique except for scale factor K. Without loss of generality, eigenvectors

are usually scaled such that

|w | w= w = 1.

If the eigenvalues Ak are unique (non-repeated), then a unique eigenvector exists for

each eigenvalue. The (normalized) eigenvectors wk are orthonormal.

SWfk -jk- (2.4)

Define the modal matrix W as the matrix whose columns are the normalized

eigenvectors of A,

W = [w1 W2 ... WN] (2.5)

In general, there are N! permutations of eigenvectors. Without loss of generality,

order the eigenvectors such that the eigenvalues are non-increasing,

A, > A2 > ... > AN.

Define the diagonal matrix A as the matrix with a main diagonal consisting of the

(ordered) eigenvalues of A.

A1 0 0

SA2 ... 0

0 0 ... AN.

The eigenvalue problem can be restated in matrix notation,



If equation 2.6 is satisfied, the matrix A is said to be diagonalizable. For this study,

the matrices of interest are real Toeplitz matrices that are alv-- diagonalizable.

The orthonormality condition of equation 2.4 can also be restated,

WWT = WTW = I. (2.7)

A matrix satisfying equation 2.7 is said to be unitary. The modal matrix W is

unitary and is said to diagonalize the matrix A.

2.2 An Example

The properties of PCA will later be discussed more rigorously, but a quick

example should provide an intuitive grasp of some of the properties of PCA.

Consider L = 20 vectors of dimension N = 3 arranged into the data matrix X,

X = [Xi X2 ... X20o

The autocorrelation of the data is estimated by the scatter matrix,

0.1296 0.1372 0.1296

S, = E{xx'} XX' = 0.1372 0.1613 0.1372
0.1296 0.1372 0.1296

Eigenvalue decomposition yields

0.5578 -0.4346 -0.7071 0.4104 0 0

W 0.6146 0.7888 0.0000 A 0 0.0102 0

0.5578 -0.4346 0.7071 0 0 0

Each input vector can be interpreted as a set of coordinates. The standard

basis functions for the input space are normalized vectors in each of the input


1 0 0

e= 0 2 1 63 0

0 0 1

Figure 2.1 (left) plots the data in the input space. The input was constructed to

lie near the diagonal of the input space. The first element of each input vector

was randomly selected in the interval (-1, +1). The second element was the

first element plus Gaussian noise. The third element was set equal to the first

component. By construction, all three elements are equal except for the additive

Gaussian noise in the second element. The dimension of the signal (part of x

excluding the noise) is one. The noise adds a second dimension. Although x is

nominally three-dimensional, the data set can be embedded in a two-dimensional



... ........ ..... MA ....

Figure 2.1: Sample Data

Figure 2.1 (middle) shows the eigenvectors in the input space. Note that the

first eigenvector is the line that best fits the data in a mean square error sense.

Figure 2.1 (right) shows that the eigenvectors can be used as basis functions for

the data. The input coordinates x are rotated to the eigenspace coordinates y by

multiplication with the modal matrix W,

y = WTx or Y = WTX.

The input vectors were drawn from a zero mean distribution, but the sample mean


x E{x}= [0.1476, 0.1649, 0.1476]T

If the data is zero-mean, then the scatter matrix is also an estimate of the auto-

covariance. The shift, z = x x would produce data with a (sample) mean

of zero. It is obvious in this example that the sample mean is a poor estimate

of the true mean. The true mean of the output is zero, but the sample mean is

y [0.2660, 0.0018, 0.0000],

y = E{y} WTx.

It is perhaps too obvious to mention that small sample sizes lead to poor char-

acterization (e.g., statistical parameters) of a distribution. However, many real

applications have a limited amount of data available. Insufficient data will degrade

the performance of any algorithm. The scatter matrix of the rotated vectors is

Sy = A. Since A is diagonal, the components of y are uncorrelated.


The trace of the scatter matrices is invariant under rotation,

S1 = Syy = 0.4206

The trace is a measure of the total variation of the data. A linear transformation

does not affect total autocorrelation. However, a linear transformation can change

the variance of individual components and the cross-correlation between compo-

nents. To see the contribution of each component of the input and output to the

total variance, divide both scatter matrices by the trace,

0.3082 0.3262 0.3082 0.9758 0 0

S' = 0.3262 0.3836 0.3262 0 0.0242 0

0.3082 0.3262 0.3082 0 0 0

The trace of each scatter matrix in equation 2.2 is one, and the elements along the

main diagonal can be interpreted as percentages of total variation. By construction

the variation in the input data is distributed almost equally among all three

components. The normalized output scatter matrix (equation 2.2) shows that the

first component captures 97.1.'. of the variation of the data. The zero eigenvalue

in the third column of A indicates that the underlying dimension of the data is two.

The input data can be reconstructed from the output data,

x =Wy or X = WY.

The input data can be perfectly reconstructed from y even if the third component

is discarded (3 :'. lossless compression). The input data can be reconstructed with

2.5' mean square error from just the first component of y (i7' compression).

The transform did not completely separate the data from the noise. However,

the input reconstructed from just the first output component has an enhanced

signal-to-noise ratio (SNR).

The rotation from the input space X to the eigenspace Y is only one possible

rotation. Although it is more obvious to directly examine other rotations of the

3-dimensional input space, it is simpler to examine rotations of the 2-dimensional

eigenspace. Consider a set of coordinates z derived from rotating the (non-zero)

coordinates in eigenspace through an arbitrary angle a,

zi cos(a) sin(a) y]
Z 2 sin(a) cos(a)) Y

The variance of {zi} is

cOS2( Ca)ly +sin2(a)729.


x x X
... v ...


Figure 2.2: Original (left) and Scaled (right) Data

Figure 2.2 (left) shows the standard deviations of the two non-zero components

of the output. The ellipse in figure 2.2 (left) shows the standard deviations of the

data projected along arbitrary unit vectors. Among all possible sets of projections,

the variance of an individual component is maximized and minimized when the

input is projected against the two eigenvectors wi and w2, respectively. If

the second component is scaled so that the variances of the two components are

equal (Figure 2.2, right), it is not possible to change the component variances by


2.3 Principal Component Analysis

The Karhunen-Lotve Transform (KLT) uses eigenvalue analysis to decompose

a continuous random process, x(t), instead of the random variable discussed in the

preceding sections of this chapter. The discrete equivalent developed by Hotelling

is called Principal component analysis (PCA), but is also often referred to as

Karhunen-Loeve Transformation. A nice discussion is found in Jain (1989, pp.


Let x be a discrete zero-mean, wide-sense stationary process. Let XN(n)

denote a block of length N,

XN(n) = [x(n), x(n 1), .. x(n N + 1)].

The (N x N) autocorrelation matrix Rxx is positive-definite and Toeplitz (doubly

symmetric and constant along the diagonals) (Kailath, 1980). The eigenvalue

decomposition of Rxx is

Rxx = E{xN(n)xN(n)T} W A W-1 W A WT

Not all matrices can be diagonalized, but symmetry is a sufficient condition. Since

Rxx is symmetric, it has N orthogonal eigenvectors even if the eigenvalues are

not distinct. PCA is an expansion of XN(n) using the eigenvectors of Rxx. Any

N-length block of x(n) can be represented by

XN(nf) Zx(n k+ 1t^, Z y, Yk(n)Wk.
k=1 k=1
The PCA expansion can be more compactly expressed by

XN() = WywN(), with inverse, yN(n) = WTxN(), (2.8)


YN(n) = [yi(n), y2 (n), y(n)l.

Equation 2.8 can also be interpreted as linear mappings from an N-dimensional

space spanned by the standard basis vectors, ek, to an N-dimensional space

spanned by the eigenvectors, wk. The autocorrelation matrix of yN(n) is

RyY = E{yN(n)yN(n)T} = E{WTXN(n)XN(n)W}

= WTE{XN(U)XN(n)}W


A represents the correlation matrix between the components yk. The components

are uncorrelated and the variance of each component is simply Ak. Interpreting

variance as signal energy, trace-invariance under similarity transformations equates

to conservation of energy. The original signal can be perfectly reconstructed by the

inverse transformation,

x(n) yi(n)

x(n -1) 2(n) (2.9)

x(n N + 1) YN(n)

Decorrelation is desirable for analysis since redundant information between

components is minimized. Reconstruction is often performed with only the first

M < N components for two main reasons,

1. Compression Using only the first M components achieves a M/N compres-

sion ratio with minimum 12 reconstruction error.

2. Noise Reduction For signals with additive noise, A is interpreted as a signal

to noise ratio (SNR). Reconstruction of x(n) using the high SNR components

retains the signal components and excludes the noisy low energy components.

2.4 Deflation Techniques

All the eigenvalues of a matrix of rank N can be found analytically by solving

a polynomial of order N,

det(A IA) = 0.

If only the first few eigenvectors are of interest, then one can either use an analyti-

cal approach such as Singular Value Decomposition (SVD) (Haykin, 1996), or find

an approximation by using a deflation technique. Since the one of the eigenvectors

maximizes variance, a vector is found by choosing an arbitrary vector and itera-

tively modifying the vector to increase the variance. Once found, the component

corresponding to the eigenvector is removed and the input is said to be deflated.

The next eigenvector is found by repeating the process on the deflated data. From

the basic eigenvalue statement (equation 2.1),

Ak W Awk (2.10)

Consider an arbitrary vector v and an associated scalar K,

S= vTAv (2.11)

The eigenvector corresponding to the largest eigenvalue maximizes

A1 max(t) max {vTAv} (2.12)

Equation 2.11 associates a scalar with each of the vectors in the span of A. The

vector associated with the maximal scalar is an eigenvector of A. More specifically,

1. Set A1 = A.

2. Use a gradient based iteration (or power method) on w to maximize A.

3. Set wi = Wpt and A1 = A(wpt).

4. Remove the projection of wi from A1.

A2 = A1 wiAiwT

5. In the subspace spanned by A2 the optimal solution to equation 2.12 is

now {A2, w2}. Repeat procedure until the desired number of solutions

(eigenvectors) is obtained.

2.5 Generalized Hebbian Algorithm

Eigendecompositions can be analytically computed by many algorithms (Golub

and Loan, 1989). But here we seek sample-by-sample estimators of PCA conducive

to on-line implementation in personal computers. There is rich literature on linear

networks to evaluate PCA using gradient descent learning rules (Oja, 1982; Haykin,

1994). Being adaptive, the networks take time to converge and exhibit rattling;

that is, network values fluctuate around the "true" values. Hence, these networks

should not be taken as substitutes for the analytic methods when the goal is to

compute eigenvectors and eigenvalues. However, in signal processing applications

where we have to deal with nonstationary signals and we are interested in feature

vectors for real-time assessment, the "r i i- PCA is very often adequate and saves

enormous computation. In fact, the algorithms about to be described are of O(N)

(size of the space), instead of 0(N2).

Haykin (1994, pp. 391) states that PCA neural network algorithms can be

grouped into two classes:

1. reestimation algorithms only feedforward connections,

2. decorrelating algorithms both feedforward and feedback connections.

Reestimation algorithms use deflation. The generalized hebbian algorithm (GHA)

is a reestimation algorithm that uses a single (computational) l--r linear network

to perform PCA on a process XN(n). A nice presentation of GHA can be found in

Haykin (1994, pp. 365-394).

x(n) xl

Figure 2.3: GHA Linear Network

Let W denote the (M x N) matrix of network weights and let wk denote

a column of W. Figure 2.3 shows the network that extracts the first M < N

principal components of the random vector XN(n). The equations for figure 2.3 are


yM (n) WTXN(n) 2 x(n), M < N.


To adapt the weights, GHA performs three operations,

1. adaptation of each column wi to maximize variance (energy),

2. adaptation between columns wi to remove the projection from previous


3. self-normalizing to keep weights at unit norm.

The equations for adapting the weights are

yj(n) = tr,..()xi(n), calculate output (2.13)

Au',,(n) = yj(n) Xidn) i, (n)yk(n) ,update weights (2.14)

The update (equation 2.14) has two terms,

r(y j()xi(n) yj(n)r,, ,(n)yj(n)), maximize variance and normalize (2.15)
lyj(n) "I (n)yk(n), remove projections (2.16)

Equation 2.15 has two terms. The first term is the classic Hebbian formulation

and has been called the activity product rule (Haykin, 1994, p. 51). The problem

with the classic formulation is that the magnitude of the weights increases. Still,

the classic Hebbian algorithm is elegant in its simplicity and power. The second

term of equation 2.15 is a self correcting adaptation (Sanger, 1989). Equation 2.15

by itself reestimates and normalizes the weights for variance maximization. Equa-

tion 2.16 subtracts the projections of previous (higher energy) components. This

term is introduced to keep the weights to different output nodes from converging

to the same eigenvector. Equation 2.16 also shows that convergence of a principal

component is dependent on the convergence of higher energy components. While

there is no inherent ordering to the eigenvectors, the implementation effectively

creates a sequence of dependencies in the convergence of eigenvectors.

The procedure is adaptive and thus suitable for locally stationary data. Even

if the process is not stationary, the mapping WN will give perfect reconstruction

(WNXN is invertible). PCA is optimal for minimum 12 reconstruction using fixed

length filters on a stationary signal. For optimal 12 compression and reconstruction,

the first M (out of N) components of yN(n) are used.

Using M < N components yields a compression of (N M)/N. Further com-

pression is usually obtained by using fewer bits to encode lower energy components.

Y (k) [yo(k), y(k), yM-i(k)] = WNxMXN(n). (2.17)

and the reconstruction (denoted by subscript c)of the original signal is

,(k) = [(WNxMWNxM) 1WNxM] y(k). (2.18)

2.6 Eigenfilters

The direct interpretation of equation 2.8 is that each of the yk(n) is a projec-

tion of XN(n) on an eigenvectors ii-, Each projection is found by taking the inner


yk(n) = ("; xN(n)) = W XN(n) (2.19)

The underlying time structure of XN(n) allows a filter interpretation of PCA,

XN() = [x(n),x(n- 1),.- ,x(n N + 1)T.

Rewriting equation 2.19,
yk(n) = a=0

Equation 2.20 is a convolution sum of x with a filter impulse response i,', (n). FIR

filters whose coefficients (impulse response) are derived from eigenvalue analysis are

called eigenfilters (Vetterli and Kovacevic, 1995). The collection of filters, {"' }i can

be interpreted as an analysis bank. If the principal components,

y(n) [yi(n), y2(n), YN(n)T.

are then processed to reconstruct the original input XN(n), the reconstruction

filters form the synthesis bank. Eigenfilters have several key properties:

1. Both the analysis and synthesis banks of a linear network use finite impulse

response (FIR) filters,

2. Since the autocorrelation filter is Toeplitz, the eigenvectors are all either

symmetric or antisymmetric,

3. Since the modal matrix is unitary, the synthesis filters can be implemented

easily (transpose of the analysis bank),

4. Since the components are uncorrelated, the reconstruction from each compo-

nent is independent of other components.

The remainder of this section illustrates the decomposition of simple test signals.

2.6.1 Low-Pass Test Signal

slhOrder M Filer OMIpu

0 0.5 1 1.5 2 25 3 35 4

03 0.6
0.2 0.4

0 -- 0 -
knpls Response Dekby

Figure 2.4: Low Pass Test Data

A test signal (Figure 2.4) was generated using a 5th-order moving average

(V A) filter driven by white Gaussian noise.

5 5
x (n) = 2 2-u (n k) # X(z) (2z)-k U (Z
k=0 k=0

The transfer function is

U( (z)

1i K 2 (I 3 (4 2i1]
1 + (+ + + +
2 2z 2z 2z 2z2

The 5 zeros are evenly spaced,

1 e1k7\)
z- exp j k [1,2,3,4,5]

The first six autocorrelation coefficients are
Rxx(k, 1) 0.3333 0.1665 0.0830 0.0410 0.0195 0.0078 1

Consider the autocorrelation matrix formed by the first six autocorrelation coeffi-

cients. The eigenvalues are

Ak 0.7952 0.4811 0.2811 0.1859 0.1388 0.1174

The corresponding eigenvectors (columns) are






































Figure 2.5 shows the eigenfilters generated from the eigenvectors shown in equa-

tion 2.21. Notice the filter bank structure as we described above, that appears in a

self-organizing manner; that is, no one programmed the filters. It was the data and

the constraints placed on the topology and adaptation rule that led to a unique set

of filter weights.

The bandwidth of the filters is dictated by the size of the input delay line

(I/NT). This illustrates why PCA defaults to a Fourier transform when the

observation window size approaches infinity.


2 4 6 8


n 2 0_n.4

4L0-, A

49 n

o 0.z 0.4



n n4
- 0 0.2- -- OA-

Figure 2.5: Low Pass Data PCA

The 1/f energy distribution can be observed by normalizing the eigenvalues

by the trace. Thu sum of the diagonal elements is invariant and can be interpreted

as total energy. An eigenvalue divided by the trace can be interpreted as the

percentage of the total signal energy belonging to the outputs of the corresponding

eigenfilters. Figure 2.5 and equation 2.6.1 show that the eigenfilters are ordered by

passband center frequency and output energy.

Ak = 0.3977 0.2406 0.1406 0.0930 0.0694 0.0587

Equation 2.6.1 provides some upper bounds for compression. These eigenfilters are

expected to be optimal for any signal generated by the filter in equation 2.6.1.

2.6.2 High Pass Test Signal

The high pass signal was simply a low to high pass conversion of the low pass

signal. The zero at w = r was moved to w = 0. Figure 2.6 shows that PCA-M

adapted to order the basis functions by energy. The time-frequency resolution

trade-off can be observed by looking across a row and seeing that the shorter filter

results in a wider frequency passband.



0.5 2 4

-0.5 --------
3 J. & Sq

FM a



01 A

0 0.2 0.4



0 0.2 A

Figure 2.6: High Pass Data PCA

2.6.3 Mixed Mode Test Signal

The high pass signal and low pass signal were added together for the mixed

mode signal. Figure 2.7 shows that PCA-M adapted to order the basis functions

by energy.

Tme FMoa FPhme

05 FP 5 1, nad


S 4 0 0 2 0 .4 0 02 0a

Figure 2.7: Test High and Low Frequency Data

2.7 Key Properties of PCA

Eigendecomposition has a strong mathematical foundation and is a tool used

across several disciplines. Eigenvalue decomposition is an optimal representation in

o ,i:v l--, Key properties of PCA include,

1. The elements of A (eigenvalues) are positive and real, and the elements of W

(eigenvectors) are real.

2. Aside from scaling and transposing columns, W is the unique matrix that

both decorrelates the xN(n) and maximizes variance for components,

3. Since W is unitary, W-1 = WT and reconstruction is easy. The mapping is
norm preserving and reconstruction error is easily measured.

PCA has several criticisms:
1. The mapping is linear. The underlying structure for some applications

may be nonlinear. However, a nonlinear problem can be made into a linear

problem by projection to a higher dimension.
2. The mapping is global. Each output component is dependent on all the

input components. If important features are dependent on some subset of


input components, it would be desirable to have output components that are

localized to the appropriate input components.

3. PCA components resemble each other. Approaching the transform as an

eigenfilter bank provides some insight. FIR filters have large sidelobes.

Orthogonality is obtained by constructive and destructive combinations of

sidelobes. It seems typical that the low frequency component is so large that

the sidelobes do not provide sufficient attenuation.


A discussion on multiresolution should start with time signals and the classic

time-frequency resolution trade-off. Assume that a recording session produces

some (real) analog signal x(t). The analog signal x(t) is sampled at some uniform

interval Ts to produce a discrete time signal x(nTs). For convenience, normalize

Ts to unity so that x(n) x(nTs). The session x(n) is usually divided into

smaller observation windows of duration N (= NTs). The choice for N fixes the

resolution of the analysis. Denote a block of data of length N by xN(n). Consider

the Discrete Fourier Transform (DFT) of xN(n),

XNf(n) )- XN(k)

The DFT transforms a vector xN(n) with N real components to a vector XN(k)

with N complex components. XN(n) and XN(k) are the time-domain repre-

sentation and frequency-domain representation, respectively, of the signal. Each

component of XN(k) is a linear combination of the elements of XN(n). That is,

each component of XN(k) is feature of the entire input xN(n) and can be lo-

calized in time to NTs. The frequency resolution of the output is 1/NTs. The

input XN(n) has high-resolution in time (Ts), but no resolution in frequency. As
N increases, the output XN(k) loses resolution in time and gains resolution in

frequency. The time and frequency resolution of the output are fixed by the single

parameter N. Ideally, there might be an optimal choice for N, the observation

window length. For example, N would be matched to the duration of key features

in the signal. Sometimes, however, it can be difficult to make a judicious choice for

N if,

1. key features are not known,

2. the optimal length is different among the key features.

Under such situations, multiresolution is an alternative to fixed resolution repre-

sentations. The DFT is a fixed resolution representation since each component

of XN(k) has the same resolution. In the context of the above discussion, a mul-

tiresolution representation would be a representation XN(k) whose elements have

varying resolution. More generally, multiresolution is the representation of a signal

across several resolutions.

3.1 Two Notes Example

The two notes example is now found in many standard texts on time-frequency

techniques; this section is an abbreviated version of Kaiser (1994). Consider a

signal composed of "notes" of single frequency, and the problem of detecting the

number of notes that occur in a time interval. Figure 3.1 Kaiser (1994) shows a

signal consisting of two single frequencies that occur at different times.

Figure 3.1: Two Notes

Theoretically, the two notes can be separated by using either frequency or

time information. However, a frequency representation has no time resolution and

limited (by the observation window length) frequency resolution. Unless the notes

are sufficiently separated in frequency, a standard Fourier transform of the signal

will not resolve the two notes. Certainly, in the extreme case where the two notes

are at the same frequency, time domain information is necessary to isolate the

notes. A Fourier transform cannot take advantage of the time information to help

resolve the individual notes.

Similarly, the time representation has no frequency resolution and limited (by

the sampling interval) time resolution. If the two notes are not well separated in

time, but well separated in frequency, time-domain analysis cannot separate the

notes. The corresponding extreme case is that if the two notes overlap in time,

frequency domain analysis is necessary to resolve the two notes. Clearly, it is

desirable to use both time and frequency domain information.

One of the first (combination) time-frequency techniques is the Short Term

Fourier Transform (STFT) or windowed Fourier Transforms (Porat, 1994, 335-

337). The signal x(n) is divided into subintervals of some fixed length. The

essential approach is that instead of using a single transform X(k) over the entire

time interval, a Fourier transform is taken over each subinterval. The results

are displ i' ,I in a waterfall plot with time and frequency forming two axes and

the magnitude of the frequency on the third axis. The waterfall plot provides

information on how the frequency content of a signal changes over time. Discussion

of implementation details, such as window functions and overlapping windows,

can be found in Strang and Nguyen (1996); Vetterli and Kovacevic (1995). The

relevance to this work is that a signal can be represented using both time and

frequency using a fixed-resolution (constant block length N) technique such as the


Again, an important consideration for a fixed-resolution analysis is choosing

a ;ood" window length. If the window length is too long, time resolution is lost.

If the window length is too short, frequency resolution is sacrificed. Figure 3.1

shows that either time or frequency resolution (or both) may be critical for a

given application. The transition from fixed resolution to multiresolution can be

performed with iterated filter banks that will be further discussed in chapter 4.

3.2 Quadrature Filter and Iterated Filter Bank

An iterated filter bank uses variable length windows to provide high frequency

resolution at low frequencies and high time resolution at high frequencies. A dyadic

filter bank uses a pair of filters to divide a signal into two components. The two

filters are designated Ho(z) and Hi(z) (Figure 3.2).


Figure 3.2: Quadrature Filter

The filters must be chosen to divide a signal into orthogonal components that

can later be used to perfectly reconstruct the original signal. Familiar choices for

dyadic filters include,

1. simple odd-even decomposition,

2. quadrature modulation filters in communications (sin and cos components),

3. quadrature mirror filter (Hi(z) = -Ho(z)) (Strang and Nguyen, 1996, 109).

The quadrature mirror filters Ho(z) and Hi(z) are constructed as low-

pass and high-pass filters, respectively. A dyadic iterated filter banks is formed

by passing the output of Hi(z) or Ho(z) into another identical filter bank

(Figure 3.4). A series of cascaded low-order filters is equivalent to a single high-

order filter. Time resolution decreases and frequency resolution increases with the

number of low-order filters in the cascade. An intuitively appealing approach is to

iterate the low frequency component; this approach will be discussed in more detail

in the next section. The rationale is that low frequency components do not require

high time resolution since low frequency implies slow changes. The quadrature

mirror filter was an early implementation; a more recent approach is the use of

wavelets (Strang and Nguyen, 1996).


3.3 Wavelets in Discrete Time

Mathematically, passing a signal through a filter and downsampling can be

presented as projection against basis functions. The design of filters is equivalent

to finding appropriate basis functions. For wavelet analysis, a function is chosen

as the mother wavelet. The basis functions at each level correspond to some

dilation (scaling) of the mother wavelet. Within a level, all the basis functions

are non-overlapping time-shifted versions of the same function. The scaling and

shifting allow time resolution over intervals less than NTs. It is desirable that basis

functions from different levels are orthogonal, but linear independence is sufficient.

A standard approach to multiresolution is to use a cascade of 2-bank (high-

pass H1 and low-pass Ho) filters (Figure 3.3). The outputs of the analysis filters

are downsampled by a factor of 2, then the low-pass output is cascaded into

another analysis bank of 2 filters. The process is repeated for the desired number of

levels. The reverse operation takes place at the synthesis bank {Gi}.

r( H,-HIGH ^ 2J 21 G, i
LJ LOW (s) ""vD,, i a21)2 G,


Figure 3.3: Discrete Wavelet Transform with 2 Levels

Again, the rationale for choosing this sequence of operations is that high

frequency components can change quickly, hence the highest frequency component

should be sampled most often. The lowest frequency component changes the least

frequently and can be downsampled several times. The iterated tree structure can

be implemented as a parallel structure (Figure 3.4).


Figure 3.4: Equivalent 2" Filter Bank DWT Implementation

3.4 Haar Wavelet

The Haar wavelet uses the simplest set of basis functions. The low pass filter is

ho (k) = [1, 1] and the high pass filter is hi (k) [1, -1]. The matrix W for

a two level decomposition is shown in equation 3.1. For the 2-level Haar example,

the input is divide into segments of length N = 4 and,

0 0 1
1 1 1( ( 1 (
V4(k) WN. N) ()3t

S1 11 1

The matrix W is invertible so perfect reconstruction is possible. Since W is or-

thonormal; that is, W-1 = WT, no further calculations are needed for constructing

the synthesis filters. The Haar wavelet has the worst frequency resolution; other

basis functions since Morlet) may be more appropriate depending on the desired

trade-off between resolution in time and frequency.

3.5 A Multiresolution Application: Compression

A standard multiresolution application is representing a signal (image) at

different scales. Figure 3.5 shows an example of a Haar decomposition (not the

Haar transform) that is dyadic along each dimension.

Several Iterations of Haar Decomposition

Original Iteration 1 Iteration 2 Iteration 3 Pyramid

Figure 3.5: Three Levels of Decomposition on the Approximation

The Haar basis vectors are el = -[11] and 62 = 21 1]. For 2-dimensions,

the basis vectors are 1el1, 6162, 6261, 6262 (the 2-D bases are separable and identi-

cal). An image is partitioned into non-overlapping (2 x 2) blocks and each block

is projected against the basis vectors. Using non-overlapping blocks is similar to

a polyphase filter and is more computationally efficient than downsampling the

projections. The first projection is simply an average of each (2 x 2) block and

gives a good compressed approximation of the original image. The other three

detail images have the information needed (in addition to the approximation)

to perfectly reconstruct the original image. That is the detail signals have the

information needed to correct the reconstruction from the approximation. This is

slightly different than the pyramidal approach that provides a single correction at

the lower scale. The procedure can be repeated on the approximation to provide

an approximation at the next level of compression. All the information for creating

the first (level 1) approximation is contained in the original (level 0) image. Simi-

larly, an approximation at any level only uses information from the approximation

at the previous level. Lower level approximations have more information (spatial

resolution) and less compression than high-level approximations. Clearly, there

is less data to process if classifications can be performed with compressed images

(smaller matrices).

Our main interest in multiresolution is in deriving multiscale localized features

for classification. Inputs presented at different scales lead to extraction of features


at different scales. The next chapter continues the discussion of multiresolution

with more focus on deriving and using multiresolution features in PCA-M.


4.1 Definition of PCA-M

In this section, we formally treat localization with PCA. PCA is briefly dis-

cussed in a context of representation and feature extraction. PCA-M is presented

as PCA with localized outputs. The localized outputs of PCA-M are structured to

provide a multiscale representation. The section ends with a formal definition of

principal component analysis with multiresolution.

4.1.1 Localization of PCA

Consider a set of K training images,

STRAIN = {01(n), - ", (n>),* OK(n)}.

The pixels of each image ', (n), are indexed by n E S. Define,

Xk (n) i. (n) Y (n).

If the training images already have zero mean, then xk(n) = /, (n). Principal

component analysis has two stages, training and testing (verification). The first

stage derives a set of eigenvectors and eigenvalues, (,'. (n), Am), for the set of

training images. Denote the set of eigenvectors by ',

'I = {Qd(n),...,,, (n),... ,,' ,(n)}.

As discussed earlier, the number of non-zero eigenvectors ,M, is the minimum of

the number of exemplars, K, or the number of components of each exemplar, N.

The training stage of PCA finds a mapping from a set of training images to a set of


@TRAIN (4.1)

Equation 4.1 emphasizes that the eigenvectors and eigenvalues are characteristics of

a set of input images. Since the input images and the eigenvectors have the same

spatial index, the eigenvectors are also called eigenimages. Once trained, PCA

uses the eigenimages to decompose each new input onto a set of components. The

second stage of PCA is a mapping from an input image to a set of M output scalar


Xk(( ) > { 1, ... Y ... M}k. (4.2)

Equation 4.2 shows that each input has a unique set of outputs (components).

Since the association from input to output is usually implicit, notation can be

simplified. Rewrite equation 4.2,

x(n) {(i,...,y m,-... ,M}- (4.3)

Each component is global since its value is calculated using all the pixels of the

input image,

rn < x(n),, (n) > x(n)', )n). (4.4)

The dependency of a global output on a specific input pixel is seen by differentiat-

ing equation 4.4,

V (n S): (y) = (n). (4.5)
a x(n)

Output localization implies that an output is dependent only on a local set of

pixels. Consider a subregion of the pixels, A C S. A local output could be specified


a (YLOCAL) = (n), (4.6)
Sx ((4.)
SX( 0, otherwise.

Localization could arise naturally if some of the eigenvector components were zero,

((n) = 0. PCA-M forces localization by explicitly manipulating equation 4.6.
Definition 4.1.1 Consider a set of N-dimensional inputs, x(n). The components

of each input are indexed by n E S, where S = [1,..., N]. Let A be a subset of

S such that A corresponds to a localized time interval ifx(n) is a time .:,il or

to a local region ifx(n) is an image. Denote the subregion of an input by xA(n).

Let wA(n) be the corresponding eigenvector (eigenimage). A localized PCA-M

output is

YA =< xA(n),wA((n) > xA(n)wA(n). (4.7)

Q (x) O( 6 (x))

Figure 4.1: PCA-M for Classification

Before defining a localized eigenvector, we want to restate our goals for PCA-M so

that the design choices will be understood. First, both PCA and PCA-M provide

representations, but this does not mean that '1! i, can be automatically used

for feature extraction. In fact, PCA-M components are not directly constructed

for optimal discrimination. So PCA-M should be understood as a preprocessor

for classification that constrains the scale and locality of subsequent features

(Figure 4.1).

Second, given that PCA-M is not the ideal feature extractor, care should be

taken that no information is lost. Since PCA-M cannot identify whether some

information is needed for discrimination, all information should be propagated to

the classifier. PCA-M should (and can) provide at least a complete representation

of the input in the space of the training set. That is, some information from

non-training exemplars is ahb--,v- lost since the eigenspace is a subspace of all

possible inputs. If the application is classification, an overcomplete representation

(redundancy) is not only acceptable, but may be essential. Nonetheless, we design

PCA-M with a minimum amount of redundancy. It is usually much easier to add

redundancy than reduce redundancy.

Finally, in allowing each output to have inputs of varying geometry (scale and

shape), localized eigenvectors are not guaranteed to be orthogonal. Since I'. :,- are

not orthogonal, it is an abuse of nomenclature to continue to refer to the localized

eigenvectors as i;, i ,. I tors". Since the PCA-M network weights converge to the

localized eigenvectors, we will henceforth call them PCA-M weights. Orthogonality

has a direct impact on designing PCA-M since many implementations of PCA

involve deflation in some form. While it is not ahl--,v- apparent from the network

architecture, there is an inherent sequencing of calculations. As the weights for

the first output are calculated, the weights and output are used to reconstruct an

estimated input. The input is deflated by the estimate, and the deflated inputs are

used as "effective inpl for calculating the weights of subsequent outputs.

4.1.2 A Structure of Localized Outputs

Definition 4.1.1 is not an implementable definition for a localized PCA-M

output since the corresponding localized eigenvector wA(n) is not yet defined. If

all outputs are supported by the same region, the eigenvectors are found using

standard PCA with the localized input, xA(n), as the new input. If each output

is supported by a separate non-overlapping region, the localized eigenvector is

found by treating each region as a separate standard PCA problems. When the

outputs are localized to overlapping regions of varying geometry (size and shape),

the meaning of orthogonality becomes unclear. That is, eigenvectors that derived

from the same subregion of the training images are orthogonal. Also, eigenvectors

that are each derived from non-overlapping subregions are orthogonal. However,

eigenvectors that are each derived from partially overlapping subregions are not

generally orthogonal

Definition 4.1.2 Consider a set of N-dimensional inputs, x(n), with an associated

set of M eigenvectors, b(n). For both the inputs and eigenvectors, components are

indexed by n E S, where S = [1,..., N]. PCA-M is an iterative procedure:

1. For the first eigenvector, partition S into R(1) subregions such that

U = s, (4.8)

2. treating each subregion as a separate .:,u ,;. .,l;., problem, calculate the first

eigenvector for each region,

3. 1. fl,/.': each region of the input, and use the 1. fIrl ,l input as the effective

input for subsequent calculation.

The geometry of the partitions can hI,,.- for each iteration. The number of

iterations to span the input space will not exceed M.

A global scalar output ym is replaced by an array of R(m) localized outputs,

Ym = [Y1 ... Yr(m) ... YR,(m)], (4.9)


Yr(m) = Xk(n)wm(n), where, A = ,Sr(m). (4.10)

Each array of outputs, ym, is a compressed version of the input image. A fine

partitioning (R(m) large) corresponds to a fine resolution for ym. If the partitions

are identical for each array of outputs ym, the representation has fixed resolution.

Analogous to equation 4.2, the mapping for PCA-M is

Xk(f) ) {yn(r(m))}k. (4.11)

The PCA network trains M sets of weights corresponding to full eigenimages. The

PCA-M network has ( m1 R(m)) sets of weights corresponding to partitioned

eigenimages. The PCA-M network replaces each scalar global output, ym, with an

array of localized outputs, ym. Constraints on the weights of the GHA network

allow control of the partitioning (number and composition) for each output.

Control of the structure of the partitions sets the localization and scale for the

PCA-M network.

4.2 The Classification Problem

The applications presented in the next chapter use PCA-M for classification.

This section restates the basic classification problem so that PCA-M can be

discussed in the context of feature extraction.

Given a choice of several classes Ck and some data x,, a basic classifier assigns

the input to one of the classes.

X, Ck.-

Equivalently, the class index k is a function of the input x,

k =g(xn). (4.12)

For example, a classifier could identify that photograph x, belongs to person k.

Mathematically, designing a good classifier is finding a good mapping function

g. Each class of data contains features; that is, characteristics that are useful

for classification. Ideally those features, 0(x), could be separated from the I -.-

less" characteristics in the raw data. Presented with only the pertinent data for

classification, the task of the classifier becomes easier.

k f(O(x)) g(x). (4.13)

Worsening the resolution of the inputs is intended to remove details that are not

needed for classification, while retaining coarse features that are needed. In general,

too fine a resolution retains unneeded details, while too coarse a resolution discards

critical information. A multiresolution approach provides a structure for control

of the detailed information. PCA-M can provide multiscale representations of an

exemplar to allow extraction of features at different scales (section 4.3.3). PCA-M

can also be used to directly localize a global eigenimage (section 4.3.4). A search

for features in the universal space may not be feasible, but the eigenspace may

be too restrictive for feature extraction. PCA-M provides a space richer than

eigenspace (figure 4.2), but still keeps the dimensionality under control.



All Features

Figure 4.2: PCA and PCA-M in Feature Space

4.3 Complete Representations

The section on eigenfaces is a classical application of PCA to images. PCA can

be conceptualized as PCA-M with the coarsest partitioning (equation 4.8),

Vm : R(m) =1.

The identity map is presented for contrast as PCA-M with the finest partitioning,

Vm : R(m)= N.

The iterated filter bank and dual decomposition have milder restrictions on the

partition sizes but implement constraints based on stationarity. The identity map

can be considered as PCA with fully localized outputs. The next section is an

example of PCA with global outputs. The subsequent section on iterated filter

banks describe a structure with outputs of varying localization.

4.3.1 Eigenfaces

The theoretical side of standard PCA has been discussed earlier. The eigen-

faces section presents an implementation of standard PCA using the generalized

hebbian algorithm (GHA) presented in section 2.5. Since the theory and struc-

ture has been discussed earlier, this section presents experimental results. The

PCA decompositions presented in this section are used for comparison to PCA-M

decompositions (in the next section) of the same set of data.

The GHA network is a simple, flexible and efficient way to implement PCA.

Minor structural modifications lead to multiresolution. This section presents the

GHA Network used for standard PCA. Subsequent sections then step through

several modifications. With each modification, we present:

1. the network structure,

2. the changes in the representation that arise from the modifications,

3. an example of the representation using one of the faces drawn from the ORL


Figure 4.3 shows K = 10 pictures from the ORL database. These ten pictures

{ '}(k 1...10) are all of the same person and cropped to R = 112 rows and C = 92

9v~"S: Le

Figure 4.3: Raw Images from ORL Database

Each input ', = (n) is described by two indices. The index k identifies the

specific exemplar, and the index n specified the specific component of -, For time

signals, n is an one-dimensional index and the components are time samples. For

images, n specifies the row and column of the image's pixels; n can be either a

two-dimensional vector index or an one-dimensional index to a rasterized version of

the image. The class average o4 is formed by averaging the ten faces.

1 K


After subtracting the average from each face, the residual images {xk}(k= 1..10) are

shown in figure 4.4.

Figure 4.4: Residual Images for GHA Input

The residual images are each presented at the input -1-. r of a network similar to

figure 4.5(left).


0 O *0---

X is a coln of


Figure 4.5: Two single la r networks with: each output driven by all inputs (left),

and each output driven by single input(right)

The number of input nodes for the GHA network is determined by the dimensions

of the exemplars, 10304 (112 x 92). The number of non-zero output nodes

can be no more than the number of linearly independent inputs; for this example,

there are ten output nodes. The network has a single computational 4.16.r with

every input connect to each output; each output has 10304 associated weights. For

convenience, construct the input matrix X X(k, n) such that each input image
Xk is a column of X,

As each exemplar is presented at the input I-u-, r, the output is calculated and the

weights are updated, using equations 4.15 and 4.16.

yj(n) = t,.. ()xi(n), calculate output (4.15)

Au',.(n) = ryj(n) Xi(n) i (n)yk(n) update weights (4.16)

Equation 4.16 shows that each output is affected by prior outputs. This depen-

dency is shown in figure 4.5 (left) by dashed lateral connections between output

nodes. Since the training set had ten linearly independent images, the network

converges to ten sets of weights, {wk}k=- ..10. The (10304 x 10) transformation

matrix WA is constructed by setting each wk as a column of WA,

WA [w w21 ... wiol. (4.17)

The weights are eigenvectors that will be described by w = wk W k(n) depending

on the context. The index k identifies the corresponding output nodes of the GHA

network. The ordering for GHA output nodes orders the eigenvectors such that the

corresponding eigenvalues are in decreasing order. Each wk has 10304 components

that can be arranged in an (112 x 92) array corresponding to the positions of

the associated input components. When the eigenvectors are arranged as a two-

dimensional array, the eigenvectors are also called eigenimages. The eigenimages

resemble the input faces. Because of this resemblance, the eigenvectors are also

called eigenfaces.

Figure 4.6: Eigenfaces from GHA Weights

Possibly the widest application for PCA is in signal representation and reconstruc-

tion. The training inputs can be perfectly reconstructed as a linear combination of

eigenfaces. The quality for reconstruction of other inputs depends on the degree

that the exemplars are representative of other inputs. The eigenvalues are an

indication of the reduction in reconstruction MSE. Table 4.1 shows that the inputs

can be reconstructed with about 10' reconstruction error using just the first four

eigenfaces and that the contribution from the last eigenface is negligible.

Table 4.1: Normalized

The eigenface expansion provides reconstruction for the network inputs. The inputs

are the residual images (exemplars minus the average image). The reconstruction

of the original exemplars requires adding back the class average. Because the PCA

analysis characterizes residual images, it is expected that the average image is

poorly reconstructed by eigenvalue expansion. This is especially interesting since

most of the energy is in the average image (table 4.2). In a classification problem,

there is an average image for each class as well as a single average for all images

across all classes. The high energy in the individual class averages ,I-.-:- -1- that

each class is characterized by the variation of its class average from the overall

average, not by the variations in the individual images.

Table 4.2: Energy Distribution of Exemplars

Energy in Energy in Percent in
Exemplar Average Component Component
X1 0.2970 0.0150 4.81
X2 0.3614 0.0212 6.,',
X3 0.3130 0.0166 5.3-2
X4 0.3454 0.0154 4.95',
x5 0.3492 0.0114 3.' ,.,'
X6 0.3419 0.0168 5..!,'
x7 0.3155 0.0099 3.19',
Xs 0.3034 0.0122 3.92'
X9 0.3243 0.0153 4.9 :'
xio 0.3126 0.0144 4.1 ,'.
Avg 0.3264 0.0148 4.7 .'.

AI 23.9i'. A6 5.8 :' ,

A2 21.7 !' A7 4.0"'-

A3 19.-.' As 3.6 !' ,

A4 o10.,, A9 3.0 :' ,

A5 7.92'- A10 0.011' ,


4.3.2 Identity Map

Some properties of the standard GHA network are more evident when con-

trasted to another network. This section provides a subjective discussion of the

following modifications to the standard GHA network:

1. the number of output nodes,

2. the dependencies (lateral connections) between output nodes,

3. the number and selection of inputs to an output node.

Since the key modification of PCA-M involves controlling the inputs scale to

an output node, it is instructive to examine the extreme case of a single output

for each input. This structure can be realized with a network that has all inputs

connected to each output, but with the constraint that only one weight is non-zero.

The network is shown in figure 4.5 (right) showing only the non-zero connections.

The inputs (figure 4.4) have not changed, so there are still 10304 input nodes

corresponding to the dimensions of the input exemplars. By design, there are also

10304 output nodes to provide an output for each input. Each output node has

the same spatial localization as the corresponding input node. This architecture is

actually 10304 independent GHA networks operating independently so the number

of outputs does not exceed the number of linearly independent exemplars.

Without loss of generality, stipulate that the final weights are normalized.

Construct the (10304 x 10304) transformation matrix Wi,

W1 = [W W2 ... 10304] (4.18)

It should be evident that the transformation matrix W1 is a (10304 x 10304)

identity matrix.

The transformation matrices from equations 4.17 and 4.18 provide insight to

several key consequences of partial connections.

1. Span of the output space: Standard PCA has an output space that

is a small subset of the image space. The output space of the identity

transformation is the entire space of (112 x 92) images. A feature that is

desirable for classification might lie outside a restrictive face space.

2. Compression: An image Q described using standard PCA requires a system

that memorizes 11 images (the average image and the 10 eigenfaces), but

describes each (112 x 92) input with at most 10 coefficients. The il,. il il

transformation requires 10304 coefficients for each input image.

3. Resolution: The outputs of PCA are global. Each output is dependent on

all the input pixels. Each output of the identity transformation is dependent

of a single input pixel and thus has the same resolution as the input image.

Spatial resolution of an output can be controlled by simply limiting the

number of inputs.

4. Orthogonalization of Eigenvectors: The orthogonalization of standard

GHA arises by deflation (virtual lateral connections). The orthogonalization

of the id.. -il il v transformation arises from non-overlapping inputs.

5. Decorrelation of Outputs: The outputs of standard GHA are decorrelated

and a repeated application of PCA decomposition changes nothing. The

outputs of the identity transformation are correlated. These outputs can be

decorrelated by adding another 1-,-.r (a GHA 1~,-V-r) to the network.

6. Class Features: For the GHA network, class information is in the weights,

and the individual exemplar information is in the outputs. For the identity

transformation, there is no information in the weights. Class information

must be extracted from the outputs.

In a network structured between the two extremes of standard GHA and an

identity transformation, several tradeoffs can be considered. We feel that PCA-M

enhances control of the span of the output space and of localization.

4.3.3 Iterated Filter Banks

The eigenface decomposition and the identity map may be considered as two

extreme cases of fixed resolution PCA. The iterated filter bank structure is the

first multiresolution network presented in this chapter. The iterated filter bank is

also of interest since it is a way of implementing PCA-M for a complete represen-

tation. The concepts follow from prior sections and the focus is in describing the

architecture and constraints.

Allowing partial connections makes it possible to arbitrarily assign inputs

to outputs and control orthogonality between outputs. The number of possible

networks increases dramatically. For example, if each output is connected to two

inputs there are C(n, r) = C(10304, 2) = 53081056 unique combinations of

inputs. For each pair of inputs there are two orthogonal outputs, so it is possible to

construct a network with over 108 orthogonal outputs.

-0 O--mN1(2) HH1 2 t 1 -N-4
O-- O -4 2) 2N2(2) 1 N5 -6
-- O A'3(2) H 1 r 3 (2 1 7
~- O Q~N ^^ 04(2) H2 2 -NN
-- o T;5(4) 3 2 1
--0 O N6(4)
-- 0 0 7(8) rHH 2 Nl-4
H--*0l NO8(8) H(2 HH2) 4 5 6

[HL(2 T (HH2(2 T HL3))) K AN7
IN OUT HL1(2T (HH2(2T HH3))) ---N

Figure 4.7: Three Level Dyadic Banks

One possible structure mimics the structure of a dyadic filter bank. The

structure of a three level dyadic bank is shown on the top left of figure 4.7 with

the equivalent polyphase construction on the bottom right. The filters used in

this example are constrained to be two tap FIR filters. The filters are derived

from the eigenvectors of the (2 x 2) scatter matrices of the data at each stage.

Four sequential outputs of the first filter correspond to the outputs at the first

four nodes (N1 4) of the network. Two sequential outputs of the second filter

correspond to the outputs at the fifth (N5) and sixth (6) nodes of the network.

The GHA network iterates on the lowest energy (variance) component. For inputs

that have 1/f energy distributions, the lowpass or highpass components can be

selected by using Hebbian or anti-Hebbian learning for the output node.

The network (figure 4.7 left) has four nodes (N1 4) that are each connected

to two non-overlapping contiguous inputs. The numbers of connected inputs are

in parenthesis after the node labels. Output nodes N5 6 are each connected to

four contiguous non-overlapping inputs, and the last two nodes N7 8 are fully

connected. The weights of N1 4 are constrained to be equal since i!, i- correspond

to a single filter in the filter bank. For the same reason, the weights for outputs

N5 6 are constrained to be equal.

GHA orthogonalizes weights by deflating the inputs to subsequent output

nodes. Deflation is ineffective for non-overlapping inputs. The first four nodes have

no orthogonalization constraints from other nodes. Node N5 is directly affected

only by nodes N1 2 since the inputs of those two nodes are partitions of the input

to N5. Node N6 is directly affected only by nodes N3 4 (indirectly influenced by

N1 2 because of the equality constraint between N5 and N6). The weights of the

fully connected outputs N7 8 are constrained by all earlier nodes. The three-stage

dyadic (twofold) bank produces eight outputs for each set of eight inputs.

For the ORL images, a quadratic (fourfold) filter was used. At each stage the

inputs were partitioned into non-overlapping (2 x 2) blocks. To parallel the dyadic

filter bank, all regions are constrained to have the same weights. The scatter

matrix of the first stage inputs is

0.99 0.95 0.91 0.90
0.95 1.00 0.89 0.93
S= (4.19)
0.91 0.89 1.00 0.95
0.90 0.93 0.95 1.01

Due to the local nonstationarity of images, the autocorrelations differ by mode
(horizontal, vertical, diagonal) as well as by lag. The scatter matrix is doubly
symmetric but not (in general) Toeplitz. The first four eigenvectors are shown in
figure 4.8 and show that the filter is essentially separable along the horizontal and
vertical modes.


Figure 4.8: The first four (2 x 2) eigenimages are separable odd-even decomposi-
tions. From left to right: even horizontal and vertical (A1 = 0.9415), odd horizontal
and even vertical (A1 = 0.0346), even horizontal and odd vertical (A1 = 0.0183),
odd horizontal and vertical (A1 = 0.0056)

The weights (features) are separable even (low-pass) and odd (high-pass) decom-
positions. Figure 4.7( bottom-right) shows that the iterated filter bank develops
longer filters by cascading shorter filters. Short eigenfilter coefficients are driven by
PCA symmetry constraints and cannot adapt to data statistics.
Figure 4.9 (top) shows the outputs of three stages. For display purposes, each
image was normalized so that pixel intensities lie in the range (0, 1). The first
stage of the filter bank produces four outputs shown in the four top left panels of
figure 4.9. The four images are downsampled and arranged as a (2 x 2) array of
compressed images as shown in the top right panel. There is no implied ordering

or spatial relationship in the arrangement of the compressed images. We place the

compressed image to be iterated in the top-left of the (2 x 2) array. The low-pass

component (top-left) is passed to another iteration. The downsampled outputs of

the second stage are shown in the bottom, far left panel. The panel is di-p'l 't, ,I at

double scale. The third stage outputs are shown in the bottom, middle left panel

the figure 4.9. The (2 x 2) array of outputs from the third stage output is displ ,i'- .

at four times the actual scale. The outputs of all three stages are combined in the

bottom middle image. For comparison, the residual (original minus average) and

the original image are shown on the bottom right.

Figure 4.9: Three Level Decomposition of an Exemplar Face

The first stage outputs of all ten inputs are shown in figure 4.10.

Figure 4.10: Output of the First Stage of the Quadratic Filter Bank for the Ten

Training Exemplars

A large disadvantage of the iterated filter bank is limited adaptability:

1. Short Filter Length Constraint The weights cannot adapt to class

statistics (no global feature extraction).

2. Equal Weight Constraint The weights are constrained to be the same for

all subimages (no localization of features).

The chief advantage of the iterated filter bank is that an orthogonal basis is used

at each stage. The overall linear transformation is orthogonal and guarantees

perfect reconstruction of the inputs. The iterated filter bank structure produces a

multiresolution representation at the output. If the compressed representations are

difficult to classify, then feature extraction must be implemented by another stage.

4.3.4 Dual Implementation of PCA

The filter bank structure first extracts small highly localized features, then

builds up to global features. Another alternative is to find global features before

local features. In general, it cannot be guaranteed that the resulting localized

eigenvectors will form a minimal spanning set. That is, while we can still guarantee

a set of vectors to span the space of training exemplars, PCA-M might not produce

a minimal spanning set. As previously mentioned, a minimal spanning set is not

required and perhaps not desired for classification. If the process is wide-sense

stationary (WSS), then the network weights form a pair-wise linearly independent

set of vectors. The discussion of orthogonal bases will be deferred to the next


The behavior of PCA-M in going from long filters to short filters is clearer

when discussed in the context of the dual PCA decomposition. PCA can be done

using the transpose of the data matrix X ( 4.14). The main consideration is that

the number of exemplars is usually much smaller than the dimension (number of

components) of an input. The dual scatter matrix is

SD X= X'


Continuing with the ORL example, the original scatter matrix is a doubly sym-

metric (10304 x 10304) matrix. The dual scatter matrix is a (10 x 10) matrix.

An analytic solution to PCA has operations in the order of 0(N2), so there is a

significant computational advantage to using the dual PCA.







X'X(X'W) (X'W)A



SD =X'







Equation 4.21 contrasts the standard decomposition S = WAW' to the dual

decomposition SD = VAV'. The dual eigenvalues are equal. The eigenvectors W

can be calculated from the dual eigenvector V,

W =XV.


The columns of the matrix W are orthogonal but not orthonormal. Normalizing W

results in W.

The dual formulation can significantly reduce computations when the number

of exemplars is smaller than the spatial dimension of the exemplars. The formu-

lation also provides an alternative interpretation to PCA-M. Consider a single

exemplar xi, and a single eigenface wl (equation 4.17). Partition each array,

Xl,(1,1) Xl,(1,2)

Xl,(2,1) Xl,(2,2)




Wl,(1,1) Wl,(1,2)
Wl,(2,1) W1,(2,2)

The projection of x1 against w1 is a single global scalar,

i =< xi, Wi >- < xl,(r,c), Wl,(r,c) >= 4.12. (4.24)
r C

Equation 4.3.4 shows that the global output yi can be considered as the sum of

localized terms. By partitioning xl and wl into smaller subarrays in a manner

similar to equation 4.23, the global output yi can be replaced by an array of

localized outputs.

yl, localized

2.77 -4.21

14.17 -8.61

The localization of output yi can be extended to full resolution. Figure 4.11 shows

continued yi localized into blocks of (8 x 8), (4 x 4),(2 x 2), and finally (1 x 1)

(full localization). The array of localized outputs can be considered a compressed

representation of the input.




1 1

Figure 4.11: Localization of a Global Output

It is interesting to note that as the eigenface w1 is partitioned, each segment

has the same dual eigenvectors. In principle this is similar to the iterated filter

bank in that the dual (rather than the primal) eigenvectors are preserved globally.


The meaning of the variations across exemplars was not explored since the vari-

ations seem to arise from misalignment during data collection rather than from

an inherent feature of the class. Although multiresolution can be easily extended

to standard PCA by simply partitioning the standard eigenvectors, the repre-

sentations that result are ahl--,v- overcomplete. An overcomplete representation

contains redundancy that is a disadvantage if the application is to compactly

transmit information. For classification, an overcomplete representation may be


4.4 Overcomplete Representations

The preceding section showed two v--v- that PCA-M could be constrained

to produce complete or overcomplete representations. For the iterated bank, the

weights for each partition were constrained to be equal. In the dual approach, the

statistics over the entire image were assumed stationary. For classification, features

are important and overcomplete representations are satisfactory. In this section the

flexibility of the single ?1v. r GHA network is discussed.

The single computational 1?v, r GHA network outputs are determined by the

inputs. There are three main mechanisms for controlling the input,

1. control deflation from other outputs,

2. mask all inputs outside a selected region,

3. place explicit constraints in the training.

The original algorithm is to provide deflated inputs to successive outputs. For fixed

resolution PCA, the deflation is needed to prevent different outputs from having

weights converge to the same values. For multiresolution, the deflated inputs are

only required if the input nodes are identical, otherwise deflation is optional. By

selectively constraining weights to zero, an output can be localized to subregions of

arbitrary size and shape. The subregions need not be convex or connected (e.g., a

region for both eyes but excluding the nose). Each subregion of the image can be of

a different size and shape. Overlapping regions are allowed to reduce edge effects.

Shifted outputs can be introduced to facilitate shifts (translations) in the image.

Each partition is allowed to have different statistics and allowed to converge to a

local subeigenimage.

Relaxing all the constraints produce a richer set of characteristics. Relaxing

constraints also complicates the implementation. The specific choices for a PCA-M

network are discussed in the experiments. The overall approach, however, was to

relax a single constraint at a time until the network's classification performance was


4.5 Local Feature Analysis

Penev and Atick (1996) report great success in face classification using a

technique called Local Feature Analysis(LFA). The improvement in performance is

attributed to localized feature extraction. Atick has also implemented a commercial

automated face classification program, (Facelt, http://venezia. rockefeller.

edu/group/papers/full/AdvImaging/index.html) for workstations using LFA.

Ensemble Eigenvectors, Y(n,m) K(n,m)
of Inputs P /V Eigenvalues, A P(n,m)
---------------------- lf---------------
Single MAPPER Single
Input, (k(n) Output, Ok(n)

Figure 4.12: Local Feature Analysis

Figure 4.12 shows that LFA is based on PCA. The top cascade of operations

calculates class properties. The bottom row is the LFA mapper proper. Assume

a set of K inputs, ', (n), which are exemplars of a single class. Each input has a

spatial dimension n that can be rasterized such that 1 < n < N. For the ORL faces

N is equal to the number of pixels in each input, N (R x C) = 10304. PCA is the

eigenvalue decomposition of inputs' scatter matrix S(NxN). In general, eigenvalue

analysis yields a square modal matrix T(NxN)(n n2) and a single diagonal matrix

of eigenvalues A. The eigenvalue decomposition produces K eigenvectors, but only

the first M eigenvectors are retained (M < K < N). The truncated expansion

uses eigenvectors ,'. (n) with corresponding eigenvalues Am. The modal matrix

is not square since the number of linearly independent inputs K is less than the

dimension N. Since M < K < N, the modal matrix T(n1, n2) is (N x M). Using

the eigenvectors as a basis, each input can be reconstructed,
,(n) -, ,', (n),
(n) A, (a).

LFA introduces some new quantities that will be discussed in more detail in

separate subsections,
O(n) A YrM A Output vectors,

K(ni,) A 3CM1 ''* (Inl) 'i (I2), LFA kernel,

P(ni, 2) A M= '. i. 2), Residual Correlation.
The LFA kernel K(na, n2) is a topographic (Penev and Atick, 1996, p. 5) analog

for the modal matrix. The residual correlation P(n n2) is comparable to the

matrix of eigenvalues A. The LFA output is similar to the reconstructed input in


4.5.1 Output Vector

Using the LFA kernel, an output O(n) is computed for every input 0(n),

O(nl) K(ni, a2)(2) AA, ,'. (4.26)

The output O(n) is of the same dimension as the input 0(n). The LFA output is

the PCA reconstruction except that each eigenvector is normalized (scaled to unit


<, ',. >= Am(m,l) < 1%'' > J(mI) (4.27)

For convenience, the expressions for input and output are repeated here,
(A) ,Am (4.28)
0(n) M A- (4.29)

The main difference (between a PCA reconstruction and an LFA output) seems

to be that normalizing the eigenvectors the scale factor de-emphasizes terms

with large eigenvalues in equation 4.28. In PCA, these are the terms that are the

most important for reconstructions with minimum mean squared error (\!SlK;).

On the other hand, it has also been sr--.-- -i. 1 that eliminating the first principal

components (that are the low frequency components in iI Ii ,!" (1/f) images) can

compensate for differences in illumination level. The objection to discarding the

first few principal components is that essential discriminatory information may be

lost. LFA's approach of de-emphasis rather than outright elimination of the first

eigenvectors may provide features that are robust with respect to illumination.

4.5.2 Residual Correlation

The residual correlation matrix definition can be rewritten in matrix form,

P(ni, n2) A T 1 '' ( (2

= E' l L. K 2 -i'M+1 '(. /, (/2),

1 EM+1 (ni> (n2).

If the full set of K eigenvectors is used for LFA output expansion, then the residual

correlation of the output is the identity matrix. If only a subset M < K of the

eigenvectors is used, there is a residual correlation as shown in equation 4.30. Atick

and Penev note that the LFA output correlations pplen to be" localized.

4.5.3 Kernel

The scatter matrix of the input data can be rewritten,

4)' = AP'V
(,x,,. (x)'

Writing the expressions for the scatter matrix and the inverse kernel together,

K- = Z M= A '/ ''nl' (ni2)

The inverse kernel is comparable to the original scatter matrix except for the

number of terms (M < N), and the scale factor. The key difference in LFA seems

to be the scaling by A/X, otherwise, the analysis is similar to a partial PCA


4.5.4 LFA on ORL Faces

The ORL database has ten exemplars of each person. K = 9 exemplars were

used for training and one retained for evaluating the expansion. Figure 4.13 shows

normalized (by input image power) reconstruction MSE as a function of number of

components. The starting MSE (at x = 0) is from just using the average image.

Each of the lines is for a different input .

Normalized Fondrtrudbn MSE

2 3 4 5 6
Nurrber ol Eiganvealoi

7 8 9 10

Figure 4.13: PCA Reconstruction MSE

Figure 4.14 shows the reconstruction 0r(x) using only the M = 8 eigenvectors

corresponding to the eight largest eigenvalues. Note that the poses that are not

fully frontal have artifacts, the reconstructions using M = 9 are indistinguishable

from the inputs. The error from adding the tenth component is an artifact incurred

by implementation limitations on numerical accuracy.

Figure 4.14: PCA Reconstructions

Figure 4.15 shows the corresponding LFA outputs,

LFAOulpuls, W9

Figure 4.15: LFA Outputs (Compare to PCA Reconstruction)

The kernel and residual correlation matrices each have 106,172, 416

(10, 304 x 10, 304) t 108 elements. Each row of the kernel is a sum of scaled

PCA eigenfaces. Penev and Atick (1996) shows that local features can be found

from an appropriate linear combination of global features. What conditions are

needed so that an arbitrary image (e.g., a local feature) can be reconstructed using

eigenimages (global features)? Clearly, the local feature must be contained in the

span of the eigenspace. The span of the eigenspace is dependent on the number of

independent training exemplars. That is, as the number of independent training

exemplars increases, the span of the eigenspace increases. Figure 4.16 shows the

first five rows, each row reshaped to a (112 x 92) image of the LFA Kernel (top

row) and Residual Correlation (bottom row),

LFA lMarices, N9

Kernel, Ki.y)

RFihaualC orreaibn, P(x,y)

Figure 4.16: LFA Kernel and Residual Correlation (Look for Localization)

4.5.5 Localization for LFA and PCA-M

PCA-M parses the input into spatial subregions to obtain localized feature.

It is assumed that pixels that are close (spatially) are more likely to be related

that pixels that are widely separated. Similarly for time signals, events that occur

in a close interval of time tend to be better correlated as opposed events that are

separated by large intervals of time.

LFA based classification is significantly more involved than finding localized

features. LFA is a PCA based technique that is designed to obtain groups of pixels

that are highly correlated. Coincidentally, highly correlated pixels were found to be

spatially localized. A further coincidence is that the localized regions corresponded

to local ph:,--i. I1 features. LFA (and PCA-M) could have produced local regions of

pixels that do not correspond to any ph,!:--i, 1 features. The approach seems very

elegant and some of the techniques might be applied to PCA-M in the future. In

particular, LFA provides a statistically based approach to parsing an image into a

minimal set of arbitrarily shaped and highly correlated subregions. That is, LFA

provides a framework for grouping pixels into localized regions based on correlation

rather than simple .,li i:'ency. Further, LFA provides a nice verification that, for

faces, spatially localized features are well correlated.

4.5.6 Feature Space for LFA, PCA, and PCA-M

For a given set of input exemplars, LFA and PCA have the same feature

space. To derive local features, both PCA and LFA rely on the eigenspace having

a sufficiently large span so that local features are included. In both PCA and

LFA, the eigenspace can only be increased by using more (linearly independent)

training exemplars. That is, LFA localized features are a linear combination of

a large number of training exemplars. Kirby and Sirovich (1987) estimates that

a dimensionality of at least 400 is needed for adequate representation of tightly

cropped faces with PCA. Penev (1999) states that a dimensionality of 200 (at

least 200 exemplars) is needed for adequate representation of faces with LFA.

In the ORL example, with only nine exemplars of (112 x 92) images, any linear

combination will be I ...--like" and not localized.

PCA-M is a multiresolution technique that encompasses classical PCA as an

extreme case within the PCA-M definition. PCA-M directly manipulates local-

ization by partitioning the exemplar images. PCA-M then adapts to the second

order statistics in each localized region. Localization of features is independent

of the number of training exemplars. Further, PCA-M facilitates construction of

multi-scale features. That is, PCA-M can be utilized with global features as well as

(local) features of varying scale.

4.6 Summary

PCA-M can be used to directly derive features for classification. Local

features can be found by explicitly selecting regions that correspond to local

pl,--i. I1 structures. Unfortunately, there is often no a priori way to know that

the best mathematical features correspond to given ]l,!:1-i' 1 features. If a priori

information is available, the GHA network can structure the PCA-M network in

a very flexible manner. Unlike LFA which looks for local features by exhaustive

linear combinations of global features, PCA-M can explicitly selecting regions that

correspond to local pl,!:v -i 1 structures.

PCA-M seems to be sufficiently useful in providing localized inputs to an-

other classifier such as a neural network. The neural network can then choose to

construct features that are global or local. The classifier can create features that

are combinations of PCA-M outputs at a single scale, or combine PCA-M outputs

of several scales. The subsequent experiments showed that a single l1-,. r PCA-M

network followed by a single 1*.--r classifier performed comparably or better than

more complicated structures.


In recent years, automated face recognition has received increased interest

while simultaneously becoming more feasible. Surveillance and medical diagnostics

are two broad classes of applications that have driven the demand for image

recognition technology. Hardware for image recognition has shown a trend towards

higher performance, increased accessibility, and lowered costs. Numerous advances

have been made in face recognition algorithms (Chellappa et al., 1995, pp. 705 -


Automated face recognition and classification has many practical applica-

tions (C'!i I!! ppa et al., 1995, p. 707). In general, automated face recognition is

a complex problem that requires detecting and isolating a face under unknown

lighting conditions, backgrounds, orientations, and distances. However, there are

several applications where the lighting, scale, and background can be expected to

be well controlled:

personal identification (credit cards, passports, driver's license),

mug shot matching,

automated store/bank access control.

In these applications, detection and isolation of the faces is not necessary, Non-

linear distortions in the images (due to lighting, background, centering, scaling, or

rotation) can be controlled during data collection (and assumed to be negligible).

C'!i I lppa et al. (1995) presents a nice survey that includes background

material on psychology and neuroscience studies, face recognition based on moving

video, and face recognition using profile faces. The scope of this dissertation

is limited to automated face recognition based on frontal, still photos. Given a

database of exemplar images, the basic face recognition problem is to identify

individuals in subsequent images. The performance expectations for automated

face recognition are high since most people can recognize faces despite fairly

adverse conditions. For a machine, the task involves detecting faces from a

cluttered background, isolating each face, extracting features from each face,

and finally classification of the face.

This chapter includes an extended presentation of three specific classifiers:

the original eigenfaces experiment, a Hidden Markov Model, and a convolutional

network. All three techniques have been applied to the same (Olivetti Research

Lab) face database under similar conditions. Finally, the PCA-M classifier is

presented against the same ORL database.

5.1 ORL face Database

Olivetti Research Lab (ORL) has a public face database reproduced in

appendix B. The database has 400 pictures made up from 10 pictures of 40 people.

The images are (112 x 92) = 10304 pixel, 8-bit .i .-i'" i,! images. The images

in the ORL database present a non-trivial classification problem. The pictures

show variation in background lighting, scale, orientation, and facial expression

(figure 5.1). The tolerance in scale is about 2n' and the tolerance for tilting is

about 200 (Giles et al., 1997). Individuals who used eyeglasses were allowed to pose

both with and without eyeglasses. Some people looked very similar to each other

(figure 5.1, far right).


Variation in ORL Face Database

expression lighting similarity

Figure 5.1: Varying Conditions in ORL Pictures

Several other techniques have been applied to the ORL database under the

same testing conditions (40 people, 5 test + 5 verification pictures for each person).

Control of the conditions is important since reducing the number of classes (not

using all 40 people) implies an easier classification problem. C'! i'5iiig the ratio of

training exemplars to verification exemplars also alters classifier performance. This

section discusses the three experiments that are used for comparison to PCA-M.

PCA-M gave better average performance than the other techniques.

Table 5.1: Error Rates of Several Algorithms

Algorithm Performance
Eigenfaces (Turk and Pentland, 1991a) 10('
HMM (Samaria, 1994) 5.5',
SOM-CN (Giles et al., 1997) 5.7.' (: -.)
PCA-M (Brennan and Principe, 1998) 2.5'.

5.2 Eigenfaces

The decomposition of a training set of face images into eigenfaces has been

previously discussed. This section briefly presents the face recognition experiment

using eigenface.

5.2.1 Description of Experiment

The ORL database has 200 training images, 5 images for each of the 40 people.

All the training images {jk}l
1 2 k (5.1)

The ensemble average is removed from each image,

Xk k o. (5.2)

Eigenfaces T,' are found for the training ensemble and M eigenfaces with signif-

icant eigenvalues are retained. The eigenfaces define the axes in eigenspace, and

both training and test images can be mapped to a set of coordinates in eigenspace,

Xk [C, Qa2,... ,I Io k where, a, =< x, > (5.3)

Denote the eigenspace coordinate vector by ak = [acl,... lr]k. An immediate

advantage of eigenfaces is a steep reduction in dimension. For the ORL database

and using all the eigenfaces (M = 200), each image is described by 200 coordinates

rather than (92 x 112) = 10304 pixels. The training images can be compressed by

a factor of 50 without loss. The training images from each class map to separate

regions in eigenspace.

The eigenspace coordinates can be treated as raw input to any classifier. For

example, let the coordinates of the 5 training images from class n be denoted by

a'. The distance of a test image's coordinates from the training image coordinates

for a class can be used to determine the probability of the test image belonging to

the class,

Prob(xtest E n)= f(atest, a,..., a5).


The simplest method is to average the coordinates of all the training exemplars

for a class, and to calculate the distance of a test image from the average coor-

dinates (Turk and Pentland, 1991b; Giles et al., 1997). Samaria (1994) used a

nearest-neighbor classifier.

5.2.2 Results

Pentland et al. (1994) reports under 05'. error rate on 200 faces using a large

(unspecified) database. Samaria (1994) reported a 10'. error rate when using 175

to 199 eigenfaces. The improvement after 10 eigenfaces is gradual, but the error

rate rapidly becomes worse when less than 10 faces are used. Samaria's results also

showed that error rate was not monotonically non-increasing as the classifier used

more eigenfaces. Giles et al. (1997) reports a 10.5' using 40 to 100 eigenfaces.

Error rates aside, the eigenface approach demonstrates that PCA is a useful

preprocessor for classification.

1. PCA coordinates in eigenspace are good features for classification.

2. PCA reduces the dimensionality of the classifier inputs that in turn reduces


3. Eigenvalues are potentially a good indicator for classification features.

5.3 Face Recognition using HMM's

Samaria (1994) on face recognition using a Hidden Markov Model (HMM)

is often cited as seminal work in applying statistical signal processing to image

classification. Hidden Markov Models are widely applied to continuous speech

recognition (Haykin, 1994, p. 227). Samaria passed an observation window over

an image from left-to-right, down, right-to-left, down, left-to-right, and so on

(Figure 5.2, left).

___ win ow

general traversal top-down traversal

Figure 5.2: Parsing an Image into a Sequence of Observations

That is, an observation window traverses a one-dimensional path through each

image. For each image, Samaria thus obtained a corresponding observation array,

0 [01,...,OT].

5.3.1 Markov Models

A Markov model is a statistical model for a sequence of observations based

on an underlying sequence of states (a Markov process). The probability of a

state at some time in a sequence is dependent only on the immediately preceding

state (Therrien, 1992, pp. 99 118). Each transition between states generates an

output that is dependent only on the state being entered. If the states can be only

one of N countable discrete values, the process is called a Markov Chain (Therrien,

1992, pp. 99 118). The Markov process is described by four parameters (Samaria,

1994, p. 28),

1. the number of states N,

2. the one-step state transition matrix, A = {aj : 1 < i,j < N},

3. the output probability function, B = {bj(.) : 1 < j < N},

4. the initial state probability distribution, II {-rj : 1 < j < N}.

When only the outputs are observable (the states are hidden), the model is said to

be a Hidden Markov Model

Output, oj

State i State j

Figure 5.3: Markov Model

5.3.2 Description of Experiment

A full description of Samaria's work and a detailed description of HMM's is

outside the scope of this dissertation. HMM's are described in various books on

statistical signal processing (Haykin, 1996; Therrien, 1992). Samaria cites Rabiner

(1989). This section attempts to cover points in Samaria's research that would be

salient to reconstructing his experiments on 1-dimensional HMM's (1D-HMM).

1 forehead

2 eyes

3 nose

4 mouth

5 chin

Figure 5.4: Top-down Constrained State Transitions

Samaria obtained his best results using a top-down sequence of five states. Each

state corresponds to a region of the face. The allowed state transitions correspond

to a top-down traversal of a face (Figure 5.4). The observation window was

constructed from several complete rows of the image (Figure 5.2, right). Each

window was eight rows high and overlapped .idi i,:ent windows by seven rows.

Samaria described his model using a shorthand notation (Samaria, 1994, p. 42) of

Fi = (N (states) L (observation rows) M (overlap rows) ) = (5, 8, 7)

Samaria used the HTK software package described in Young (1993). For each of

the 40 classes in the ORL database, five training images are each transformed into

a sequence of observations using a top-down traversal. The five training sequences

used as inputs to the HTK software with a design specification for five states (five

face regions). The HTK software derived optimal parameters for each HMM using

the Baum-Welch re-estimation algorithm (Baum, 1972). The optimization includes

parsing the training images into five regions, and deriving both the state transition

matrix A and output probability function B for the HMM. At the end of training,

an HMM has been derived for each class. To classify a test image, select the class

whose HMM maximizes the likelihood of the test image.

5.3.3 Results

Samaria's dissertation exhaustively explored variations in the number of mod-

els, and observation window parameters. The dissertation included experiments

using frequency domain representations and reduced (spatial) resolution images.

Samaria reports that 1D-HMM outperformed the Eigenfaces approach about 1I' .

of the time. The 1D-HMM had an average error rate of 10'-. After all the detailed

analysis, Samaria modestly concluded that the improvements (over eigenfaces) in

face recognition using 1D-HMM were probably not statistically significant. The

dissertation also explored a more complicated model, the P2D-HMM (pseudo 2D

model). The P2D-HMM outperformed the eigenfaces approach about ,91'` of the

time with an average error rate of 5'.

5.4 Convolutional Neural Networks

Giles et al. (1997) used a Self-Organizing Map (SOM) in conjunction with

a Convolutional Neural Network for face classification. The self-organizing map

(SOM) is used for dimensionality reduction of the exemplars, and the convolutional

network (CN) provides partial translation and deformation invariance (Giles et al.,

1997, p. 67).

Raw Compressed
Image Representation Class
-Parse/SOM CN -

Figure 5.5: SOM-CN Face Classifier

5.4.1 Self-Organizing Map

Giles et al. (1997) states that Kohonen's self-organizing map (SOM) or

Self-Organizing Feature Map (SOFM) (Kohonen, 1995) is a topology preserving,

unsupervised learning process. This section presents an overview of SOFM that

follows the presentation of Kohonen's SOFM found in Haykin (1994, pp. 402 -


A SOFM maps an input of arbitrary dimension into a discrete map of reduced

dimension. The theory is based on vector quantization theory in which a large

set of input vectors is mapped to a reduced set of prototypes (the weights of the

winning output node). The network for a SOFM has only an input and output

1 -.i r. Each output is fully connected to all the inputs. The nodes of the output

1?,-i.r are arranged in a one, two, or three-dimensional lattice. When presented with

an input x, one of the SOFM's output nodes is the best-matching or winning node

according to some distance criteria,


i(x) argmin x- xjl, j = 1, 2,..., N.

In equation 5.5, i(x) is the index of the winning output in response to input x.

The SOFM is topologically ordered in the sense that nodes that are .,Ii i'ent in the

output lattice tend to have similar weights. The SOFM is topologically preserving

in the sense that a small distance between two inputs (in the input space) implies a

small distance between the corresponding winning outputs (in the output space).

5.4.2 Convolutional Network

A convolutional network (Le Cun and Bengio, 1995) is a specific structure

for a muiltilv-, r perception network that has been successfully applied to optical

character recognition (OCR) (Haykin, 1994, p. 226). Giles et al. (1997) uses a

similar network for face classification. The CN has five computational l~--.-rs: four

hidden -1-. rs and the output 1l--v.

1. The first hidden 1l-vr will be discussed in some detail since it exhibits

the properties of feature maps, weight sharing, local receptive fields, and

nonlinear convolution. Consider a (20 x 20) OCR image that is parsed

into (5 x 5) subregions. There are (16 x 16) = 256 subregions that can be

constructed by shifting the (5 x 5) window over the OCR image. A neuron

in the first hidden l~.-v-r is said to have a local receptive field if the inputs

to the neuron correspond to a local region of the input. The neurons can

be organized into a (16 x 16) feature map such that .,1i ,ient neurons have

local receptive fields that are shifted by one pixel. A further constraint on a

feature map is that all the neurons in the feature map have the same weights

(weight sharing). The construction of the feature map can be perceived as a

convolution of the input image against the fixed weights. Since the output of

the neurons is passed through a nonlinear function, the first hidden l-1-.-r is

characterized as a nonlinear convolutional l1~-,-r. In the OCR application, the

first hidden l-~v. r consists of four feature maps.

2. The second hidden l-i. v is a downsampling If -r Downsampling provides a

tolerance to distortions due to translation, rotation, and scaling. In the OCR

example, the second hidden l*- v-r has four feature maps that are respectively

reduced (spatial) resolution representations of the four feature maps from the

first hidden l~Vr.

3. The third hidden 1- v-r is another convolutional 1lr- r. A feature map in the

third hidden l-1.- r may use local receptive fields from two feature maps in

the second 1-r. v. The OCR application has twelve feature maps in the third

hidden 1lir..

4. The fourth hidden 1--. r is another averaging and downsampling 1-c-r.

identical in structure to the second hidden l *'vr.

5. The output 1l,--r is fully connected to the fourth hidden 1 --,r. In the OCR

application, there are ten neurons corresponding to the ten digits [0,1,..., 9].

The output l-1,--r classifies the input, further, the difference between the most

active and second most active outputs can be used to generate a measure of

confidence in the classification.

Haykin (1994, p. 226) states that a multilrv,-r perception that uses alternating

convolutional and downsampling l-r,-i is a convolutional network.

5.4.3 Description of Experiment

In contrast to the OCR application, Giles et al. (1997) chose to preprocess

the raw (face) images with a SOM. The (92 x 112) images are parsed into (5 x 5)

subimages. Each subimage overlaps .,.i i'went subimages by one pixel. All the

subimages from the training data are collected and used to train a SOM with a

(5 x 5 x 5) three-dimensional output lattice. The trained SOM is used to transform

the raw images into three (23 x 28) maps. The three maps are passed to the

convolutional network. Giles' CN has five l~-,. rs, the architecture is described in

table 5.2.

Table 5.2: Face Classification CN Architecture
Number of Feature Map Receptive Field
L Type Feature Maps Dimensions Dimensions
1 convolutional 20 (21 x 26) (3 x 3)
2 downsampling 20 (9 x 11) (2 x 2)
3 convolutional 25 (9 x 11) (3 x 3)
4 downsampling 25 (5 x 6) (2 x 2)
5 full 40 (1 x 1) (5 x 6)

5.4.4 Results

The CN is a muiltilv,--r perception network that requires significant training.

Once trained, the network operates quickly. The best results were reported as 3.5'

error against the ORL database.

5.5 Face Classification with PCA-M

PCA is known to be optimal for representation, but suboptimal for classifica-

tion. Belhumeur et al. (1997) points out that PCA does not differentiate in-class

scatter from between-class scatter. Bartlett et al. (1998) shows that classification

can be improved by using independent component analysis to incorporate higher-

order statistics. On the other hand, the experiments using eigenfaces (Turk and

Pentland, 1991a) showed that coordinates in eigenspace are useful for classification.

Several other experiments indicate that PCA-based feature extraction could be

improved by adding localization and multiresolution.

Pentland et al. (1994) states that localization can enhance eigenfaces. Pent-

land trained eigenfeatures corresponding to pl1,:-i. I1 facial features. Classification

based on localized eigenfeatures was comparable to the performance of eigenfaces.

The combination of localized eigenfeatures and global eigenfaces performed almost

perfectly. Brunelli and Poggio (1993) states that localized features may be more

important than global features. Brunelli stated that when a classifier can use

only a single facial feature, then local templates based on eyes, nose, and mouth

contribute more to recognition than global facial templates. Giles et al. (1997)

Si.---- -1.. 1 that reducing the spatial resolution of the ORL images might improve

classification. The use of the SOM front end to reduce dimensionality while retain-

ing good classification supports Giles' observation. Turk and Pentland (1991b) used

a six-level Gaussian pyramid to view the inputs at several spatial resolutions.

The theory for PCA-M, dyadic filter banks, and the GHA network were

presented in chapter 4. This remainder of this section presents and discusses

experimental results for test runs using PCA-M, a fixed-basis multiresolution

(Haar), and PCA at several fixed resolutions.

5.5.1 Classifier Architecture

Figure 5.6 shows the initial architecture for our classifier (Brennan and

Principe, 2000).


SPCA-M Hyperplane Vote
Hyperplane -
-- - - - -- - - - I
Image Multiresolution Class

Figure 5.6: Initial Classifier Structure

The structure was originally intended to isolate each feature space. We wanted to

observe both the individual feature performance and degradation due to decoupling

features. Adding more eigenfeatures does not monotonically increase classifier

performance (Samaria, 1994), and we plan on finding a way to select or weight

the predictions from eigenfeatures in future research. Each feature classifier uses

a template obtained by averaging the training exemplars. The final classification

was done by weighted vote among the component classifiers. The in1i' 'i, ily vote

mechanism is the simplest way to combine the results of the feature classifiers. Hu

Table 5.3: Fixed Resolution PCA Error Rates over 10 Runs
Raw Data 19,0 14.4 10.5
(2 x 2) 22.5 17.5 14.0
(4 x 4) 27.5 23.6 20.5

et al. (1997) il-.-. -I; that more elaborate committee structures don't necessarily

significantly outperform a ii ii Pi ily vote mechanism. The overall structure can be

compared to a 2-1h'v, r hierarchical One-Class-One-Network (OCON) Decision-Based

Neural Network (DBNN) (Kung, 1993, pp. 118-120).

5.5.2 Data Preparation

The most straightforward way to vary the resolution is to reduce the scale

by half. We investigated reductions in scale of 2, 4, 8, and 16. The scaling was

performed by passing a (2k x 2k) window through the image. For example the 1/16

scaling takes a (24 x 24) subimage and produces 16 scalar outputs; the collection

of scalar outputs from all the windowed subimages form 16 scaled images with

reduced spatial resolution. We used non-overlapping observation windows because

we wanted to observe if blocking would severely deteriorate classification. To

facilitate scaling with non-overlapping windows, image dimensions were cropped

to (112 x 80) so that the numbers of pixels along each dimension are a factor of

24 = 16. Six columns of pixels were cropped from each side of the input.

5.5.3 Fixed Resolution PCA Results

The fixed resolution PCA was investigated for windows of (2 x 2) and (4 x 4).

Eigenfaces would have transformed each (112 x 80) = 8960 pixel input to 8960

coordinates in eigenspace; each coordinate corresponds to the projection of an

input to an eigenface. For the ORL database there would have been only 200 non-

zero coordinates. We step through the procedure for a (2 x 2) and note that the

procedure for the (4 x 4) PCA windows are analogous. A (2 x 2) has 4 eigenimages.

As the non-overlapping window is passed through the image, we are effectively

condoling 4 sets of coefficients against the input image. Since the blocks are non-

overlapping, downsampling is accomplished in the same step. As an aside, linear

convolution and downsampling are being performed by a single computational

1-,-. r partially connected network with 8960 inputs and 8960 outputs. Each output

is locally supported by to 4 inputs (a (2 x 2) subregion). Only 4 sets of weights

are used. The output can be organized into 4 feature spaces that are 4 half-scale


In each run, five training exemplars were randomly selected from the ten

exemplars available for each person in the ORL database. The results show great

sensitivity to selection of the training set. The sensitivity is not surprising. For

example, figure 5.7 shows a class that had the entire training set at one scale, and

the entire test set at another scale.

Training and Test data at different Scales



Figure 5.7: Training and Test Data at Different Scales

Giles et al. (1997) points out that a random selection among 40 classes would be

expected to be correct 1/40 = 2.5'. of the time. We feel that a more realistic base-

line for error rates is the performance of a template classifier with the raw data.

Since PCA is just a rotation, the performance would be the same as the perfor-
mance using all 200 eigenfaces. Samaria (1994) reported 10.5'. for the ORL data,
but we found that the error rate was also sensitive to training set and averaged
around 14.!' (first line of table 5.3). Some of the increased misclassification could
be due to the clipped data, but it is more likely that the decoupling of data due to
our classifier structure is responsible for the deterioration. The results seem to sup-
port that data organized into 4 independent feature spaces. The (2 x 2) window is
worse than data taken as a whole (raw data), but better than 16 decoupled feature
spaces. Note that if the feature spaces are linearly combined before classification,
we would expect an error rate similar to the raw images. All individual feature
classifiers and the iii ii i ly vote mechanism have a nonlinear operation when the
maximum output is selected.

5.5.4 Haar Multiresolution
A fixed Haar basis was used to crate a four level differential image pyramid. A
sample decomposition is shown along with the original image.
PCA-M Decomposition




Figure 5.8: PCA-M Decomposition of One Picture

The autocorrelation matrix of the observation windows, as expected, shows that
the pixels in a natural image are 1/f. Classification using a Haar basis was not
significantly different from PCA-M since the Haar basis is well suited for 1/f
signals. Moreover, for small observation windows, the choice of multiresolution
basis is not very important. Multiplication by any fixed basis is a rotation; if all

the features are used, then input distances are preserved. Classification (with a

linear classifier) will be no better than using the raw inputs.

5.5.5 PCA-M

PCA-M was used to decompose images into multiresolution feature spaces

(components). Four feature spaces are 1/16 scale images, three are 1/8 scale

images, three are 1/4 scale images, and three are 1/2 scale images (figure 5.9).

Figure 5.9: Selected Resolutions

The decomposition was chosen to facilitate comparison to the Haar decomposition.

Referring to figure 5.9, components 1 to 4 have the longest eigenvectors; that is,

(16 x 16) eigenimages and the least spatial resolution. Components 11 to 13 have

the shortest eigenvectors and the highest spatial resolution. The classifier was

modified (figure 5.10).


SFFT Hyperplane
-- PCA-M FFT Hyperplane rite

I- -FFT Hyperplane

Image Multiresolution Class

Figure 5.10: Final Classifier Structure

The modification resulted from earlier experiments evaluating PCA-M for repre-

sentation. Since we had the ORL database in a variety of representations, we fed

them into the classifier. The magnitude of the FFT of the raw data had an average

error rate of 1(''. The combination of PCA-M with magnitude FFT gave the best

performance. We assume that using the FFT magnitude makes the classifier more

robust to translations. There is still a high sensitivity to training set selection. The

Table 5.4: Error Rates for PCA-M with Magnitude of FFT

Multiresolution Levels MAX AVG MIN
2 6.5(0' 2.95' 0.(11' "
3 5.011' 2.45' 0.(11'
4 6.5( 3.4(1' 1.i .

main diagonal of table 5.5 shows the performance of the individual feature classi-

fiers. The table shows the number of misclassifications out of 200 test images. The

nondiagonal elements show the number of misclassifications using a pair of feature

classifiers. The performance seemed to be independent of eigenvalue or resolution.

None of the component classifier had more than 10 misclassifications (out of 200)

in the training set (200 images). Performance on the test set does not seem to be

predictable from performance on the training set. When a component classifier's

best guess was incorrect, the second best guess was correct half the time. Of some

interest is the poor performance of the first four components since Belhumeur

et al. (1997) states that the first few eigenvectors are sensitive to illumination.

Belhumeur stated that removal of the first three or four eigenvectors could provide

some robustness to illumination levels.

Table 5.5: Component Misclassifications (200 Test Images)

1 2 3 4 5 6 7 8 9 10 11 12 13
1 61 48 45 33 19 28 19 5 19 7 25 3 4
2 136 100 73 21 55 46 7 45 9 54 6 8
3 147 75 20 52 45 6 42 7 48 6 8
4 99 22 43 31 6 31 9 39 5 8
5 27 17 6 6 11 8 9 4 3
6 68 29 6 27 6 24 5 4
7 54 4 16 4 15 4 4
8 7 5 6 3 4 1
9 53 5 30 5 4
10 9 3 3 2
11 67 4 6
12 6 3
13 9


This chapter describes a classification and a simple discrimination experiment

using synthetic aperture radar (SAR) images of armored vehicles. SAR imagery

is obtained by combining radar returns over the path of a moving platform (an

airplane or satellite). The path is effectively a large antenna aperture leading to

high-resolution imagery. The basic scenario is that given a training set of several

I ,i;, I vehicles, the discriminator will assign subsequent input images to the

correct class of threat vehicles, or identify that the new image belongs to a new

class of i,,,i,-I ,i ,_t vehicles. Classes of vehicles that have only test (no training)

exemplars are called confuser classes. Discriminators can make three types of


1. A false-negative error occurs when a target vehicle is not identified. Presum-

ably, failing to respond to a target vehicle incurs the most severe penalty.

2. A false-positive mistake occurs when a non-target vehicle is identified as a

target vehicle. This mistake causes resources to be wasted in an unnecessary


3. The third error occurs when a target vehicle is correctly identified, but is

incorrectly labeled.

By modifying the decision boundary, a trade-off is possible between the three

errors. Two sets of results are presented: a set for classification (no rejection

of confuser classes), and a set for discrimination. The goal of the classifier is to

minimize unconditional probability of correct classification P,,. The goal for our

discriminator is to maximize (conditional) Pc when the probability of detecting

targets, Pd, is 91'.

6.1 SAR Image Database

The raw SAR inputs are (128 x 128) pixel images from a subset of the 9/95

MSTAR Public Release Data obtained from Veda Inc., (

The web site also includes a paper with a detailed description of the data and

the results of a baseline template-matching classifier (Velten et al., 1998). The

data consists of X-band SAR images with a 1-foot by 1-foot resolution (Velten

et al., 1998). Table 6.1 lists the vehicle classes, bumper tags, and the quantity of

corresponding images. The SAR image of a vehicle is dependent on the pose of the

target vehicle relative to the collection platform (satellite). The two pertinent pose

parameters are aspect angle and depression angle (figure 6.1)).

Figure 6.1: Aspect and Depression Angles

Changing the pose of a vehicle results in nonlinear changes in the SAR image

since the radar cross-section is a projection of a 3-dimensional object onto a

2-dimensional surface. Thus, the features that are available for discrimination

are a function of pose. Because of this dependence, it is desirable to ensure that

the exemplars for a given vehicle have the same pose. Ideally, a large number of

exemplars would be available for each aspect angle. More practically, we would

collect a large number of training exemplars for a narrow range of aspect angles.

Realistically, we have to accept a trade-off between having a large number of