Nonlinear extensions to the minimum average correlation energy filter

MISSING IMAGE

Material Information

Title:
Nonlinear extensions to the minimum average correlation energy filter
Physical Description:
x, 173 leaves : ill. ; 29 cm.
Language:
English
Creator:
Fisher, John W., 1965-
Publication Date:

Subjects

Subjects / Keywords:
Electrical and Computer Engineering thesis, Ph. D   ( lcsh )
Dissertations, Academic -- Electrical and Computer Engineering -- UF   ( lcsh )
Genre:
bibliography   ( marcgt )
non-fiction   ( marcgt )

Notes

Thesis:
Thesis (Ph. D.)--University of Florida, 1997.
Bibliography:
Includes bibliographical references (leaves 168-172).
Statement of Responsibility:
by John W. Fisher III.
General Note:
Typescript.
General Note:
Vita.

Record Information

Source Institution:
University of Florida
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
aleph - 027596855
oclc - 37163201
System ID:
AA00014281:00001


This item is only available as the following downloads:


Full Text









NONLINEAR EXTENSIONS TO THE
MINIMUM AVERAGE CORRELATION ENERGY FILTER















By

JOHN W. FISHER III


A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF
THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF
PHILOSOPHY

UNIVERSITY OF FLORIDA


1997







ACKNOWLEDGEMENTS


There are many people I would like to acknowledge for their help in the genesis of this
manuscript. I would begin with my family for their constant encouragement and support.

I am grateful to the Electronic Communications Laboratory and the Army Research
Laboratory for their support of the research at the ECL. I was fortunate to work with very
talented people, Marion Bartlett, Jim Bevington, and Jim Kurtz, in the areas of ATR and
coherent radar systems. In particular, I cannot overstate the influence that Marion Bartlett
has had on my perspective of engineering problems. I would also like to thank Jeff Sichina
of the Army Research Laboratory for providing many interesting problems, perhaps too
interesting, in the field of radar and ATR. A large part of who I am technically has been
shaped by these people.

I would, of course, like to acknowledge my advisor, Dr. Jose Principe, for providing me
with an invaluable environment for the study of nonlinear systems and excellent guidance
throughout the development of this thesis. His influence will leave a lasting impression on
me. I would also like to thank DARPA, funding by this institution enabled a great deal of
the research that went into this thesis. I would also like to thank Drs. David Casasent and
Paul Viola for taking an interest in my work and offering helpful advice.

I would also like to thank the students, past and present, of the Computational Neu-
roEngineering Laboratory. The list includes, but is not limited to, Chuan Wang for useful
discussions on information theory, Neil Euliano for providing much needed recreational
opportunities and intramural championship t-shirts, Andy Mitchell for being a good friend
to go to lunch with and who suffered long inane technical discussions and who now is a
better climber than me. There are certainly others and I am grateful to all.

Finally I would like to thank my wife, Anita, for enduring a seemingly endless ordeal,
for allowing me to use every ounce of her patience, and for sacrificing some of her best
years so that I could finish this Ph. D. I hope it has been worth it.








TABLE OF CONTENTS


Page
ACKNOWLEDGEMENTS ................... ...................... ii
LIST OF FIGURES ...................................... ........... v
LIST OF TABLES ............ ............................ ......... viii
ABSTRACT ..................... ................... .... ix
CHAPTERS

1 INTRODUCTION .................. ................ 1
1.1 Motivation ............ ........................ .. 1
2 BACKGROUND ........................... .... ........ 6
2.1 Discussion of Distortion Invariant Filters ...................... 6
2.1.1 Synthetic Discriminant Function ........................ 12
2.1.2 Minimum Variance Synthetic Discriminant Function ........ 15
2.1.3 Minimum Average Correlation Energy Filter. .............. 18
2.1.4 Optimal Trade-off Synthetic Discriminant Function ......... 20
2.2 Pre-processor/SDF Decomposition ........................... 24
3 THE MACE FILTER AS AN ASSOCIATIVE MEMORY ............. 27
3.1 Linear Systems as Classifiers. ............................... 27
3.2 MSE Criterion as a Proxy for Classification Performance ......... 29
3.2.1 Unrestricted Functional Mappings ....................... 30
3.2.2 Parameterized Functional Mappings ...................... 32
3.2.3 Finite Data Sets ..................................... 34
3.3 Derivation of the MACE Filter .............................. 35
3.3.1 Pre-processor/SDF Decomposition. .................. .... 38
3.4 Associative Memory Perspective .......................... 39
3.5 Comments .................................. ............ 49
4 STOCHASTIC APPROACH TO TRAINING NONLINEAR SYNTHETIC DIS-
CRIMINANT FUNCTIONS. ......................... .. ........ 52
4.1 Nonlinear iterative Approach. .......... ........... ... ... 52
4.2 A Proposed Nonlinear Architecture. ................... ...... 53
4.2.1 Shift Invariance of the Proposed Nonlinear Architecture...... 55
4.3 Classifier Performance and Measures of Generalization ........... 57
4.4 Statistical Characterization of the Rejection Class ............... 67
4.4.1 The Linear Solution as a Special Case .................... 69
4.4.2 Nonlinear M appings ...................... ........... 70








Page
4.5 Efficient Representation of the Rejection Class................... 72
4.6 Experimental Results ...................................... 74
4.6.1 Experiment I noise training ........................... 75
4.6.2 Experiment II noise training with an orthogonalization constraint 81
4.6.3 Experiment III subspace noise training .................. 84
4.6.4 Experiment IV convex hull approach .................... 89
5 INFORMATION-THEORETIC FEATURE EXTRACTION ........... 96
5.1 Introduction .................. .......................... 96
5.2 Motivation for Feature Extraction ............................ 97
5.3 Information Theoretic Background ........................... 101
5.3.1 Mutual Information as a Self-Organizing Principle .......... 101
5.3.2 Mutual Information as a Criterion for Feature Extraction ..... 104
5.3.3 Prior Work in Information Theoretic Neural Processing ...... 106
5.3.4 Nonparametric PDF Estimation ......................... 108
5.4 Derivation Of The Learning Algorithm ........................ 110
5.5 Gaussian Kernels ....................................... 115
5.6 Maximum Entropy/ PCA: An Empirical Comparison ............. 118
5.7 Maximum Entropy: ISAR Experiment ........................ 124
5.7.1 Maximum Entropy: Single Vehicle Class .................. 125
5.7.2 Maximum Entropy: Two Vehicle Classes .................. 127
5.8 Computational Simplification of the Algorithm ................. 127
5.9 Conversion of Implicit Error Direction to an Explicit Error ........ 136
5.9.1 Entropy Minimization as Attraction to a Point.............. 136
5.9.2 Entropy Maximization as Diffusion ...................... 139
5.9.3 Stopping Criterion. ................................. 141
5.10 Observations ............................................ 143
5.11 Mutual Information Applied to the Nonlinear MACE Filters ....... 144
6 CONCLUSIONS. ................. .......................... 151
APPENDIX

A DERIVATIONS................... ........ ..................... 155
REFERENCES ............. .... ................. ... .......... 168
BIOGRAPHICAL SKETCH .................. ...................... 173








LIST OF FIGURES


Eage
Figure
1 ISAR images of two vehicle types......... ........................... 9
2 MSF peak output response of training vehicle la over all aspect angles. ..... 10
3 MSF peak output response of testing vehicles lb and 2a over all aspect angles. 11
4 MSF output image plane response........ ........................... 12
5 SDF peak output response of training vehicle la over all aspect angles....... 15
6 SDF peak output response of testing vehicles Ib and 2a over all aspect angles. 16
7 SDF output image plane response. .................................. 17
8 MACE filter output image plane response. ............................ 20
9 MACE peak output response of vehicle la, lb and 2a over all aspect angles... 21
10 Example of a typical OTSDF performance plot ........................ 23
11 OTSDF filter output image plane response. ........................... 24
12 OTSDF peak output response of vehicle la over all aspect angles........... 25
13 OTSDF peak output response of vehicles Ib and 2a over all aspect angles .... 26
14 Decomposition of distortion invariant filter in space domain............... 26
15 Adaline architecture ......... ... ............................. 28
16 Decomposition of MACE filter as a preprocessor (i.e. a pre-whitening filter over
the average power spectrum of the exemplars) followed by a synthetic discrimi-
nant function ................................................ 39
17 Decomposition of MACE filter as a preprocessor (i.e. a pre-whitening filter over
the average power spectrum of the exemplars) followed by a linear associative
memory. ............................................. ........ 43
18 Peak output response over all aspects of vehicle I a when the data matrix which is
not full rank .............. ...... .................. .............. 47
19 Output correlation surface for LMS computed filter from non full rank data... 48
20 Learning curve for LMS approach............... ...................... 49
21 NMSE between closed form solution and iterative solution................ 50
22 Decomposition of optimized correlator as a pre-processor followed by SDF/LAM
(top). Nonlinear variation shown with MLP replacing SDF in signal flow (middle),
detail of the MLP (bottom). The linear transformation represents the space domain
equivalent of the spectral pre-processor ............................... 54
23 ISAR images of two vehicle types shown at aspect angles of 5, 45, and 85 degrees
respectively. .............. ......... ........ ................ 59










24 Generalization as measured by the minimum peak response .............. 62
25 Generalization as measured by the peak response mean square error......... 63
26 Comparison of ROC curves ................ ....... ................ 64
27 ROC performance measures versus ................... .............. 66
28 Peak output response of linear and nonlinear filters over the training set...... 77
29 Output response of linear filter (top) and nonlinear filter (bottom)........... 78
30 ROC curves for linear filter (solid line) versus nonlinear filter (dashed line)... 79
31 Experiment I: Resulting feature space from simple noise training ........... 80
32 Experiment II: Resulting feature space when orthogonality is imposed on the input
layer of the MLP. ................................................ 83
33 Experiment II: Resulting ROC curve with orthogonality constraint.......... 84
34 Experiment II: Output response to an image from the recognition class training
set......... ..................... .................. 85
35 Experiment III: Resulting feature space when the subspace noise is used for train-
ing ................... ........ ............................... 88
36 Experiment Im: Resulting ROC curve for subspace noise training........... 89
37 Experiment III: Output response to an image from the recognition class training
set .................................. ....... ...... ........... 90
38 Learning curves for three methods. ............................ .. 90
39 Experiment IV: resulting feature space from convex hull training ........... 94
40 Experiment IV: Resulting ROC curve with convex hull approach ........... 95
41 Classical pattern classification decomposition. ................. ...... 100
42 Decomposition of NL-MACE as a cascade of feature extraction followed by dis-
crimination .................................................... 100
43 Mutual information approach to feature extraction ...................... 106
44 Mapping as feature extraction. Information content is measured in the low dimen-
sional space of the observed output.......... .......................... 108
45 A signal flow diagram of the learning algorithm. .................. ..... 114
46 Gradient of two-dimensional gaussian kernel. The kernels act as attractors to low
points in the observed PDF on the data when entropy maximization is desired. 117
47 Mixture of gaussians example. ............... ..................... 118
48 Mixture of gaussians example, entropy minimization and maximization...... 119
49 PCA vs. Entropy gaussian case...................................... 120
50 PCA vs. Entropy non-gaussian case. ............................ 122
51 PCA vs. Entropy non-gaussian case. ............................ 123










52 Example ISAR images from two vehicles used for experiments. ........... 124
53 Single vehicle experiment, 100 iterations. .......................... 125
54 Single vehicle experiment, 200 iterations. ............................. 126
55 Single vehicle experiment, 300 iterations. ............................. 126
56 Two vehicle experiment. ......................................... 128
57 Two dimensional attractor functions. ................................. 133
58 Two dimensional regulating function. .............................. 134
59 Magnitude of the regulating function. ................................ 134
60 Approximation of the regulating function ............................. 135
61 Feedback functions for implicit error term ........................... 138
62 Entropy minimization as local attraction. ............................. 140
63 Entropy maximization as diffusion. ................................. 142
64 Stopping criterion. ............................................... 143
65 Mutual information feature space. ................................. 146
66 ROC curves for mutual information feature extraction (dotted line) versus linear
M ACE filter (solid line)............................................ 148
67 Mutual information feature space resulting from convex hull exemplars...... 149
68 ROC curves for mutual information feature extraction (dotted line) versus linear
MACE filter (solid line)..... .................................. 150








LIST OF TABLES


Page

Table

1 Classifier performance measures when the filter is determined by either of the
common measures of generalization as compared to best classifier performance for
two values of.................................... ............. 61
2 Correlation of generalization measures to classifier performance. In both cases (
equal to 0.5 or 0.95) the classifier performance as measured by the area of the ROC
curve or Pfa at Pd equal 0.8, has an opposite correlation as to what would be
expected of a useful measure for predicting performance ................ 64
3 Comparison of ROC classifier performance for to values of Pd. Results are shown
for the linear filter versus four different types of nonlinear training. N: white noise
training, G-S: Gram-Schmidt orthogonalization, subN: PCA subspace noise, C-H:
convex hull rejection class ....................................... 81
4 Comparison of ROC classifier performance for to values of Pd. Results are shown
for the linear filter versus experiments III and IV from section 4.6 and mutual
information feature extraction.The symbols indicate the type of rejection class
exemplars used. N: white noise training, G-S: Gram-Schmidt orthogonalization,
subN: PCA subspace noise, C-H: convex hull rejection class.............. 145












Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy


NONLINEAR EXTENSIONS TO THE MINIMUM AVERAGE
CORRELATION ENERGY FILTER


By

John W. Fisher III

May 1997

Chairman: Dr. Jose C. Principe
Major Department: Electrical and Computer Engineering


The major goal of this research is to develop efficient methods by which the family of

distortion invariant filters, specifically the minimum average correlation energy (MACE)

filter, can be extended to a general nonlinear signal processing framework. The primary

application of MACE filters has been to pattern classification of images. Two desirable

qualities of MACE-type correlators are ease of implementation via correlation and ana-

lytic computation of the filter coefficients.

Our motivation for exploring nonlinear extensions to these filters is due to the well-

known limitations of the linear systems approach to classification. Among these limita-

tions the attempt to solve the classification problem in a signal representation space,

whereas the classification problem is more properly solved in a decision or probability

space. An additional limitation of the MACE filter is that it can only be used to realize a

linear decision surface regardless of the means by which it is computed. These limitations

lead to suboptimal classification and discrimination performance.








Extension to nonlinear signal processing is not without cost. Solutions must in general

be computed iteratively. Our approach was motivated by the early proof that the MACE

filter is equivalent to the linear associative memory (LAM). The associative memory per-

spective is more properly associated with the classification problem and has been devel-

oped extensively in an iterative framework.

In this thesis we demonstrate a method emphasizing a statistical perspective of the

MACE filter optimization criterion. Through the statistical perspective efficient methods

of representing the rejection and recognition classes are derived. This, in turn, enables a

machine learning approach and the synthesis of more powerful nonlinear discriminant

functions which maintain the desirable properties of the linear MACE filter, namely, local-

ized detection and shift invariance.

We also present a new information theoretic approach to training in a self-organized or

supervised manner. Information theoretic signal processing looks beyond the second order

statistical characterization inherent in the linear systems approach. The information theo-

retic framework probes the probability space of the signal under analysis. This technique

has wide application beyond nonlinear MACE filter techniques and represents a powerful

new advance to the area of information theoretic signal processing.

Empirical results, comparing the classical linear methodology to the nonlinear exten-

sions, are presented using inverse synthetic aperture radar (ISAR) imagery. The results

demonstrate the superior classification performance of the nonlinear MACE filter.













CHAPTER 1

INTRODUCTION

1.1 Motivation

Automatic target detection and recognition (ATD/R) is a field of pattern recognition.

The goal of an ATD/R system is to quickly and automatically detect and classify objects

which may be present within large amounts of data (typically imagery) with a minimum of

human intervention. In an ATD/R system, it is not only desirable to recognize various tar-

gets, but to locate them with some degree of accuracy. The minimum average correlation

energy (MACE) filter [Mahalanobis et al., 1987] is of interest to the ATD/R problem due

to its localization and discrimination properties. The MACE filter is a member of a family

of correlation filters derived from the synthetic discriminant function (SDF) [Hester and

Casasent, 1980]. The SDF and its variants have been widely applied to the ATD/R prob-

lem. We will describe synthetic discriminant functions in more detail in chapter 2. Other

generalizations of the SDF include the minimum variance synthetic discriminant function

(MVSDF) [Kumar, 1986], the MACE filter, and more recently the gaussian minimum

average correlation energy (G-MACE) [Casasent et al., 1991] and the minimum noise and

correlation energy (MINACE) [Ravichandran and Casasent, 1992] filters.

This area of filter design is commonly referred to as distortion-invariant filtering. It is a

generalization of matched spatial filtering for the detection of a single object to the detec-

tion of a class of objects, usually in the image domain. Typically the object class is repre-

sented by a set of exemplars. The exemplar images represent the image class through a







range of "distortions" such as a variation in viewing aspect of a single object. The goal is

to design a single filter which will recognize an object class through the entire range of

distortion. Under the design criterion the filter is equally matched to the entire range of

distortion as opposed to a single viewpoint as in a matched filter. Hence the nomenclature

distortion-invariant filtering [Kumar, 1992].

The bulk of the research using these types of filters has focused on optical and infra-

red (IR) imagery and overcoming recognition problems in the presence of distortions asso-

ciated with 3-D to 2-D mappings, e.g. scale and rotation (in-plane and out-of-plane).

Recently, however, this technique has been applied to radar imagery [Novak et al., 1994;

Fisher and Principe, 1995a; Chiang et al., 1995]. In contrast to optical or infra-red imag-

ery, the scale of each pixel within a radar image is usually constant and known. Conse-

quently, radar imagery does not suffer from scale distortions of objects.

In the family of distortion invariant filters, the MACE filter has been shown to posses

superior discrimination properties [Mahalanobis et al., 1987, Casasent and Ravichandran,

1992]. It is for this reason that this work emphasizes nonlinear extensions to the MACE

filter. The MACE filter and its variants are designed to produce a narrow, constrained-

amplitude peak response when the filter mask is centered on a target in the recognition

class while minimizing the energy in the rest of the output plane. This property provides

desirable localization for detection. Another property of the MACE filter is that it is less

susceptible to out-of-class false alarms [Mahalanobis et al., 1987]. While the focus of this

work will be on the MACE filter criterion, it should be stated that all of the results pre-

sented here are equally applicable to any of the distortion invariant filters mentioned above

with appropriate changes to the respective optimization criteria.







Although the MACE filter does have superior false alarm properties, it also has some

fundamental limitations. Since it is a linear filter, it can only be used to realize linear deci-

sion surfaces. It has also been shown to be limited in its ability to generalize to exemplars

that are in the recognition class (but not in the training set), while simultaneously rejecting

out-of-class inputs [Casasent and Ravichandran, 1992; Casasent et al., 1991]. The number

of design exemplars can be increased in order to overcome generalization problems; how-

ever, the calculation of the filter coefficients becomes computationally prohibitive and

numerically unstable as the number of design exemplars is increased [Kumar, 1992]. The

MINACE and G-MACE variations have improved generalization properties with a slight

degradation in the average output plane variance [Ravichandran and Casasent, 1992] and

sharpness of the central peak [Casasent et al., 1991], respectively.

This research presents a basis by which the MACE filter, and by extension all linear

distortion invariant filters, can be extended to a more general nonlinear signal processing

framework. In the development it is shown that the performance of the linear MACE filter

can be improved upon in terms of generalization while maintaining its desirable proper-

ties, i.e. sharp, constrained peak at the center of the output plane.

A more detailed description of the developmental progression of distortion invariant

filtering is given in chapter 2. In this chapter a qualitative comparison of the various distor-

tion invariant filters is presented using inverse synthetic aperture radar (ISAR) imagery.

The application of pattern recognition techniques to high-resolution radar imagery has

become a topic of great interest recently with the advent of widely available instrumenta-

tion grade imaging radars. High resolution radar imagery poses a special challenge to dis-

tortion invariant filtering in that the source of distortions such as rotation in aspect of an







object do not manifest themselves as rotations within the radar image (as opposed to opti-

cal imagery). In this case the distortion is not purely geometric, but more abstract.

Chapter 3 presents a derivation of the MACE filter as a special case of Kohonen's lin-

ear associative memory [1988]. This relationship is important in that the associative mem-

ory perspective is the starting point for developing nonlinear extensions to the MACE

filter.

In chapter 4 the basis upon which the MACE filter can extended to nonlinear adaptive

systems is developed. In this chapter a nonlinear architecture is proposed for the extension

of the MACE filter. A statistical perspective of the MACE filter is discussed which leads

naturally into a class representational viewpoint of the optimization criterion of distortion

invariant filters. Commonly used measures of generalization for distortion invariant filter-

ing are also discussed. The results of the experiments presented show that the measures

are not appropriate for the task of classification. It is interesting to note that the analysis

indicates the appropriateness of the measures is independent of whether the mapping is

linear or nonlinear. The analysis also discusses the merit of the MACE filter optimization

criterion in the context of classification and with regards to measures of generalization.

The chapter concludes with a series of experiments further refining the techniques by

which nonlinear MACE filters are computed.

Chapter 5 presents a new information theoretic method for feature extraction. An

information theoretic approach is motivated by the observation that the optimization crite-

rion of the MACE filter only considers the second-order statistics of the rejection class.

The information theoretic approach, however, operates in probability space, exploiting

properties of the underlying probability density function. The method enables the extrac-





5

tion of statistically independent features. The method has wide application beyond nonlin-

ear extensions to MACE filters and as such represents a powerful new technique for

information theoretic signal processing. A review of information theoretic approaches to

signal processing are presented in this chapter. This is followed by the derivation of the

new technique as well as some general experimental results which are not specifically

related to nonlinear MACE filters, but which serve to illustrate the potential of this

method. Finally the logical placement of this method within nonlinear MACE filters is

presented along with experimental results.

In chapter 6 we review the significant results and contributions of this dissertation. We

also discuss possible lines of research resulting from the base established here.













CHAPTER 2

BACKGROUND

2.1 Discussion of Distortion Invariant Filters

As stated, distortion invariant filtering is a generalization of matched spatial filtering.

It is well known that the matched filter maximizes the peak-signal-to-average-noise power

ratio as measured at the filter output at a specific sample location when the input signal is

corrupted by additive white noise.

In the discrete signal case the design of a matched filter is equivalent to the following

vector optimization problem.[Kumar, 1986]


min hth st. xth = d {h,x}e CNXe

where the column vector x contains the N coefficients of the signal we wish to detect, h

contains the coefficients of the filter ( t indicates the hermitian transpose operator), and d

is a positive scaler. This notation is also suitable for N-dimensional signal processing as

long as the signal and filter have finite support and are re-ordered in the same lexico-

graphic manner (e.g. by row or column in the two-dimensional case) into column vectors.

The optimal solution to this problem is


h = x(xtx)- d.








Given this solution we can calculate the peak output signal power as


= (xth)2

= (xtx(xtx)-ld)2
= d2

and the average output noise power due to an additive white noise input


o = E{htnnth}
= htEnh

= o2hth
= aYd2(xt,)-1

where is an ao2 is the input noise variance. Resulting in a peak-signal-to-average-noise

output power ratio of


( )9 d2
oF a,2d2(xtx)-I

(xtx)
2

As we can see, the result is independent of the choice of scalar, d. If d is set to unity,

the result is a normalized matched spatial filter.[Vander Lugt, 1964]

In order to further motivate the concept of distortion invariant filtering, a typical ATR

example problem will be used for illustration. This experiment will also help to illustrate

the genesis of the various types of distortion invariant filtering approaches beginning with

the matched spatial filter (MSF).

Inverse synthetic aperture radar (ISAR) imagery will be used for all of the experiments

presented herein. The distortion invariant filtering; however, is not limited to ISAR imag-

ery and in fact can be extended to much more abstract data types. ISAR images are shown








in figure 1. In the figure, three vehicles are displayed, each at three different radar viewing

aspect angles (5, 45, and 85 degrees), where the aspect angle is the direction of the front of

the vehicle relative to the radar antenna. The image dimensions are 64 x 64 pixels. Radar

systems measure a quantity called radar cross section (RCS). When a radar transmits an

electromagnetic pulse, some of the incident energy on an object is reflected back to the

radar. RCS is a measure of the reflected energy detected by the radar's receiving antenna.

ISAR imagery is the result of a radar signal processing technique which uses multiple

detected radar returns measured over a range of relative object aspect angles. Each pixel in

an ISAR image is a measure of the aggregate radar cross section at regularly sampled

points in space.

Two types of vehicles are shown. Vehicle type I will represent a recognition class,

while vehicle type 2 will represent a confusion class. The goal is to compute a filter which

will recognize vehicle type 1 without being confused by vehicle 2. Images of vehicle la

will be used to compute the filter coefficients. Vehicles lb and 2a represent an independent

testing class.

ISAR images of all three vehicles were formed in the aspect range of 5 to 85 degrees at

1 degree increments. As the MSF is derived from a single vehicle image, an image of vehi-

cle la at 45 degrees (the midpoint of the aspect range) is used.

The peak output response to an image represents maximum of the cross correlation

function of the image with the MSF template. The peak output response over the entire

aspect range of vehicle la is shown in figure 2. As can be seen in the figure, the filter

matches at 45 degrees very well; however, as the aspect moves away from 45 degrees, the




















vehicle la (training)


vehicle lb (testing)


vehicle 2a (testing)
Figure 1. ISAR images of two vehicle types. Vehicles are shown at aspect angles
of 5, 45, and 85 degrees respectively. Two different vehicles of type 1 (a
and bi are shown, while one vehicle of type 2 (a) is shown. Vehicle la
is utcd as a training vehicle, while vehicle lb is used as the testing
eh, Ile for the recognition class. Vehicle 2a represents a confusion
vehicle.

peak output response begins to degrade. Depending on the type of imagery as well as the


vehicle, this degradation can become very severe.









matched spatial filter
1.2


1.0


0.8

o
S0.6



0.4


0.2


0.0
0 20 40 60 80 100
aspect angle
Figure 2. MSF peak output response of training vehicle 1 a over all aspect angles.
Peak response degrades as aspect difference increases.

The peak output responses of both vehicles in the testing set are shown in figure 3

overlain on the training image response. In one sense the filter exhibits good generaliza-

tion, that is, the peak response to vehicle lb is much the same as a function of aspect as the

peak response to vehicle la. However, the filter also "generalizes" equally as well to vehi-

cle 2b, which is undesirable. As a vehicle discrimination test (vehicle 1 from vehicle 2) the

MSF fails.









spatial filter
I ,, .


0 20 40 60
aspect angle


80 100


Figure 3. MSF peak output response of testing vehicles lb and 2a over all aspect
angles. Responses are overlaid on training vehicle response. Filter
responses to vehicles lb (dashed line) and 2a (dashed-dot) do not differ
significantly.


matched






12

The output image plane response to a single image of vehicle la is shown in figure 4.

Refinements to the distortion invariant filter approach, namely the MACE filter, will show

that the localization of this output response, as measured by the sharpness of the peak, can

be improved significantly.

1.0


0.8 -


0.6 -

0.4






0.0







Figure 4. MSF output image plane response.


2.1.1 Synthetic Discriminant Function

The degradation evidenced in figures 2 and 3 were the primary motivation for the syn-

thetic discriminant function (SDF)[Hester and Casasent, 1980]. A shortcoming of the

MSF, from the standpoint of distortion invariant filtering, is that it is only optimum for a

single image. One approach would be to design a bank of MSFs operating in parallel

which were matched to the distortion range. The typical ATR system; however, must rec-

ognize/discriminate multiple vehicle types and so from an implementation standpoint

alone parallel MSFs is an impractical choice. Hester and Casasent set out to design a sin-





13

gle filter which could be matched to multiple images using the idea of superposition. This

approach was possible due to the large number of coefficients (degrees of freedom) that

typically constitute 2-D image templates. For historical reasons, specifically that the filters

in question were synthesized optically using holographic techniques [Vander Lugt, 1964],

it was hypothesized that such a filter could be synthesized from linear combinations of a

set of exemplar images.

The filter synthesis procedure consists of projecting the exemplar images onto an

ortho-normal basis (originally Gram-Schmidt orthogonalization was used to generate the

basis). The next step is to determine the coefficients with which to linearly combine the

basis vectors such that a desired response for each original image exemplar was obtained.

[Hester and Casasent, 1980]

The proposed synthesis procedure is a bit convoluted. It turns out that the choice of

ortho-normal basis is irrelevant. As long as the basis spans the space of the original exem-

plar images the result is always the same. The development of Kumar [1986] is more use-

ful for depicting the SDF as a generalization of the matched filter (for the white noise

case) to multiple signals. The SDF can be cast as the solution to the following optimiza-

tion problem


min hth s.t. Xth = d {h E CN X,X e C 'NxN,d CN

where X is now a matrix whose Nt columns comprise a set of training images] we wish

to detect, d is a column vector of desired outputs (one for each of the training exemplars)


1. Since these filters have been applied primarily to 2D images, signals will be referred to
as images or exemplars from this point on. In the vector notation, all NI x N2 images are
re-ordered (by row or column) into N x 1 column vectors, where N = N1N2.






14

and is typically set to all unity values for the recognition class. The images of the data

matrix X comprise the range of distortion that the implemented filter is expected to

encounter. It is assumed that N, < N and so the problem formulation is a quadratic optimi-

zation subject to an under-determined system of linear constraints. The optimal solution is


h = X(XtX)-ld.

When there is only one training exemplar (N, = 1) and d is unity the SDF defaults to

the normalized matched filter. Similar to the matched filter (white noise case), the SDF is

the linear filter which minimizes the white noise response while satisfying the set of linear

constraints over the training exemplars.

By way of example, the SDF technique is tested against the ISAR data as in the MSF

case. Exemplar images from vehicle la were selected at every 4 degrees aspect from 5 to

85 degrees for a total of 21 exemplar images (i.e. N, = 21). Figure 5 shows the peak out-

put response over all aspects of the training vehicle (la). As seen in the figure, the degra-

dation as the aspect changes is removed. The MSF response has been overlaid to highlight

the differences.

The peak output response over all exemplars in the testing set is shown in figure 6.

From the perspective of peak response, the filter generalizes fairly well. However, as in the

MSF, the usefulness of the filter as a discriminant between vehicles 1 and 2 is clearly lim-

ited.

Figure 7 shows the resulting output plane response when the SDF filter is correlated

with a single image of vehicle I a. The localization of the peak is similar to the MSF case.









synthetic discriminant function
1.2


1.0-


0.8 -/ "


0.6


0.4


0.2



0 20 40 60 80 100
aspect angle
Figure 5. SDF peak output response of training vehicle la over all aspect angles.
The MSF response is also shown (dashed line). The degradation in the
peak response has been corrected.

2.1.2 Minimum Variance Synthetic Discriminant Function

The SDF approach seemingly solved the problem of generalizing a matched filter to

multiple images. However, the SDF has no built-in noise tolerance by design (except for

the white noise case). Furthermore, in practice, it would turn out that occasionally the

noise response would be higher than the peak object response depending on the type of

imagery. As a result, detection by means of searching for correlation peaks was shown to

be unreliable for some types of imagery, specifically imagery which contains recognition

class images embedded in non-white noise[Kumar, 1992]. Kumar [1986] proposed a

method by which noise tolerance could be built in to the filter design. This technique was

termed the minimum variance synthetic discriminant function (MVSDF). The MVSDF is








synthetic discriminant function
1.2 I


1.0 -


0.8

0
S0.6

a
0.4


0.2


0.0 2 ,. ,.
0 20 40 60 80 100
aspect angle
Figure 6. SDF peak output response of testing vehicles lb and 2a over all aspect
angles. The dashed line is vehicle lb while the dashed-dot line is
vehicle 2a.

the correlation filter which minimizes the output variance due to zero-mean input noise

while satisfying the same linear constraints as the SDF. The output noise variance can be

shown to be htZh, where h is the vector of filter coefficients and Zn is the covariance

matrix of the noise. [Kumar, 1986]

Mathematically the problem formulation is


min htnh s.t. Xth = d


he CNX tXe CN X ,ENe CNxNde CNx'




























Figure 7. SDF output image plane response.

with the optimal solution


h = InlX(XtZX X)- d.

In the case of white noise, the MVSDF is equivalent to the SDF This technique has a

significant numerical complexity issue which is that the solution requires the inversion of

an Nx N matrix (Z,) which for moderate image sizes (N = NIN2) can be quite large

and computationally prohibitive, unless simplifying assumptions can be made about its

form (e.g. a diagonal matrix, toeplitz, etc.).

The MVSDF can be seen as a more general extension of the matched filter to multiple

vector detection as most signal processing definitions of the matched filter incorporate a

noise power spectrum and do not assume the white noise case only. It is mentioned here

because it is the first distortion invariant filtering technique to recognize the need to char-

acterize a rejection class.





18

2.1.3 Minimum Average Correlation Energy Filter

The MVSDF (and the SDF) control the output of the filter at a single point in the out-

put plane of the filter. In practice large sidelobes may be exhibited in the output plane

making detection difficult. These difficulties led Mahalanobis et al [1987] to propose the

minimum average correlation energy (MACE) filter. This development in distortion invari-

ant filtering attempts as its design goal to control not only the output point when the image

is centered on the filter, but the response of the entire output plane as well. Specifically it

minimizes the average correlation energy of the output over the training exemplars subject

to the same linear constraints as the MVSDF and SDF filters.

The problem is formulated in the frequency domain using Parseval relationships. In

the frequency domain, the formulation is


min HtDH s.t. XtH = d
{He CN ',Xe CNx E,De CNx Nd eCx

where D is a diagonal matrix whose diagonal elements are the coefficients of the average

2-D power spectrum of the training exemplars. The form of the quadratic criterion is

derived using Parseval's relationship. A derivation is given in section A. 1 of the appendix.

The other terms, H and X, contain the 2-D DFT coefficients of the filter and training

exemplars, respectively. The vector d is the same as in the MVSDF and SDF cases. The

optimal solution, in the frequency domain, is



H = D-IX(XtD-'X)-'d. (1)

As in the MVSDF, the solution requires the inversion of an N x N matrix, but in this

case the matrix D is diagonal and so its inversion is trivial. When the noise covariance





19

matrix is estimated from observations of noise sequences (assuming wide-sense stationar-

ity and ergodicity) the MVSDF can also be formulated in the frequency domain, as well,

and the complex matrix inversion is avoided. A derivation of this is given in the appendix

A, examination of equations (95), (96), (97) shows that under the assumption that the

noise class can be modeled as a stationary, ergodic random noise process the solution of

the MVSDF can be found in the spectral domain using the estimated power spectrum of

the noise process and equation (1).

In practice, the MACE filter performs better than the MVSDF with respect to rejecting

out-of-class input images. The MACE filter; however, has been shown to have poor gener-

alization properties, that is, images in the recognition class but not in the training exemplar

set are not recognized.

A MACE filter was computed using the same exemplar images as in the SDF example.

Figure 8 shows the resulting output image place response for one image. As can be seen in

the figure, the peak in the center is now highly localized. In fact it can be shown [Mahal-

anobis et al., 1987] that over the training exemplars (those used to compute the filter) the

output peak will always be at the constraint location.

Generalization to between aspect images, as mentioned, is a problem for the MACE

filter. Figure 9 shows the peak output response over all aspect angles. As can be seen in the

figure, the peak response degrades severely for aspects between the exemplars used to

compute the filter. Furthermore, from a peak output response viewpoint, generalization to

vehicle lb is also worse. However, unlike the previous techniques, we now begin to see

some separation between the two vehicle types as represented by their peak response.





























Figure 8. MACE filter output image plane response.

2.1.4 Optimal Trade-off Synthetic Discriminant Function

The final distortion invariant filtering technique which will be discussed here is the

method proposed by R6fr6grier and Fique [1991], known as the optimal trade-off syn-

thetic discriminant function (OTSDF). Suppose that the designer wishes to optimize over

multiple quadratic optimization criteria (e.g. average correlation energy and output noise

variance) subject to the same set of equality constraints as in the previous distortion invari-

ant filters. We can represent the individual optimization criterion by


J, = htQih,

where Q, is an N x N symmetric, positive-definite matrix (e.g. Qi = 1n for MVSDF

optimization criterion).

The OTSDF is a method by which a set of quadratic optimization criterion may be

optimally traded off against each other; that is, one criterion can be minimized with mini-









MACE filter
1.2


1.0


S0.8

0




".- '. *" 'W '' ; "


0.2


0.0
0 20 40 60 80 100
aspect angle
Figure 9. MACE peak output response of vehicle la, lb and 2a over all aspect
angles. Degradation to between aspect exemplars is evident.
Generalization to the testing vehicles as measured by peak output
response is also poorer. Vehicle la is the solid line, lb is the dashed line
and 2a is the dashed-dot line.

mum penalty to the rest. The solution to all such filters can be characterized by the equa-

tion



h = Q IX(XtQ- X) d, (2)

where, assuming M different criteria,


M M
Q = ,Qi i = I 0 < X
i= l i= l








The possible solutions, parameterized by X,, define a performance bound which can-

not be exceeded by any linear system with respect to the optimization criteria and the

equality constraints. All such linear filters which optimally trade-off a set of quadratic cri-

teria are referred to as optimal trade-off synthetic discriminant functions.

We may, for example, wish to trade-off the MACE filter criterion versus the MVSDF

filter criterion. This presents the added difficulty that one criterion is specified in the space

domain and the other in the spectral domain. If the noise is represented as zero-mean, sta-

tionary, and ergodic (if the covariance is to be estimated from samples) we can, as men-

tioned, transform the MVSDF criterion to the spectral domain. In this case the optimal

filter has the frequency domain solution,


1 -I
H = [kDn+(l-X)Dox-lX[Xt[h Dn+(1-. )DJ] X] d

= DIX[XtDxX]-Id

where DX = AD, + ( )Dx, 0 < < 1, and Dn, Dx are diagonal matrices whose

diagonal elements contain the estimated power spectrum coefficients of the noise class and

the recognition class, respectively. The performance bound of such a filter would resemble

figure 10, where all linear filters would fall in the darkened region and all optimal trade-off

filters would lie somewhere on the boundary.

By way of example we again use the data from the MACE and SDF examples. In this

case we will construct an OTSDF which trades off the MACE filter criterion for the SDF

criterion. In order to transform the SDF to the spectral domain, we will assume that the

noise class is zero-mean, stationary, white noise. The power spectrum is therefore flat. One

of the issues for constructing an OTSDF is how to set the value of X which represents the











a
S I








S Realizable region
ofperformance


I-


MAC
MAC


average correlation energy


Figure 10. Example of a typical OTSDF performance plot. This plot shows the
trade-off, hypothetically, between the ACE criteria versus a noise
variance criteria. The curved arrow on the performance bound indicates
the direction of increasing X for the two criterion case. The curve is
bounded by the MACE and MVSDF results.

degree by which one criterion is emphasized over another. We will not address that issue

here, but simply set the value to X = 0.95, indicating more emphasis on the MACE filter

criterion.

The output plane response of the OTSDF is shown in figure 11. As compared to the

MACE filter response, the output peak is not nearly as sharp, but still more localized than

the SDF case.

The peak output response over the training vehicle for the OTSDF is compared to the

MACE filter in figure 12. The degradation to between aspect exemplars is less severe than

the MACE filter. The peak output response of vehicles lb and 2a are shown in figure 13.


MVSD


n





























Figure 11. OTSDF filter output image plane response.

As compared to the MACE filter the peak response is improved over the testing set. Sepa-

ration between the two vehicle types appears to be maintained.


2.2 Pre-processor/SDF Decomposition

In the sample domain, the SDF family of correlation filters is equivalent to a cascade

of a linear pre-processor followed by a linear correlator [Mahalanobis et al., 1987;Kumar,

1992]. This is illustrated in figure 14 with vector operations. The pre-processor, in the case

of the MACE filter, is a pre-whitening filter computed on the basis of the average power

spectrum of the recognition class training exemplars. In the case of the MVSDF the pre-

processor is a pre-whitening filter computed on the basis of the covariance matrix of the

noise. The net result is that after pre-processing, the second processor is an SDF computed

over the pre-processed exemplars.








OTSDF
1.2






0.8 II I II I '

0.6I I Iill II I -



0.4


0.2 -


0.0 .
0 20 40 60 80 100
aspect angle
Figure 12. OTSDF peak output response of vehicle la over all aspect angles.
Degradation to between aspect exemplars is less than in the MACE
filter shown in dashed line.

The primary contribution of this research will be to extend the ideas of MACE filtering

to a general nonlinear signal processing architecture and accompanying classification

framework. These extensions will focus on processing structures which improve the gen-

eralization and discrimination properties while maintaining the shift-invariance and local-

ization detection properties of the linear MACE filter.




































Figure 13. OTSDF peak output response of vehicles lb and 2a over all aspect
angles. Generalization is better than in the MACE filter. Vehicle lb is
shown in dashed line, vehicle 2a is shown in dashed-dot line.


y = Ax h = y(yty)-Id

input image, x pre-processor SDF scalar output


Filter Decomposition

Figure 14. Decomposition of distortion invariant filter in space domain. The
notation used assumes that the image and filter coefficients have been
re-ordered into vectors. The input image vector, x, is pre-processed
by the linear transformation, y = Ax. The resulting vector is
processed by a synthetic discriminant function, yout = yth.


OTSDF


\I i; .


0.4



0.2


0 20 40 60
aspect angle


-L' ' '


80 100













CHAPTER 3

THE MACE FILTER AS AN ASSOCIATIVE MEMORY

3.1 Linear Systems as Classifiers

In this chapter we present the MACE filter from the perspective of associative memo-

ries. This perspective is important because it leads to a machine-learning and classification

framework and consequently a means by which to determine the parameters of a nonlinear

mapping via gradient search techniques. We shall refer, herein, to the machine learning/

gradient search methods as an iterative framework. The techniques are iterative in the

sense that adaptation to the mapping parameters are computed sequentially and repeatedly

over a set of exemplars. We shall show that the iterative and classification framework com-

bined with a nonlinear system architecture have distinct advantages over the linear frame-

work of distortion invariant filters.

As we have stated, distortion invariant filters can only realize linear discriminant func-

tions. We begin, therefore, by considering linear systems used as classifiers. The adaline

architecture [Widrow and Hoff, 1960], depicted in figure 15, is an example of a linear sys-

tem used for pattern classification. A pattern, represented by the coefficients xi, is applied

to a linear combiner, represented by the weight coefficients wi, the resulting output y is








then applied to a hard limiter which assigns a class to the input pattern. Mathematically

this can be represented by


c = sgn(y-p)
= sgn(wTx- p)

where sgn( ) is the signum function, tp is a threshold, and w, x e 9Nx are column

vectors containing the coefficients of the pattern and combiner weights, respectively. In

the context of classification, this architecture is trained iteratively using the least mean

square (LMS) algorithm [Widrow and Hoff, 1960]. For a two class problem the desired

output, d in the figure, is set to 1 depending on the class of the input pattern, the LMS

algorithm then minimizes the mean square error (MSE) between the classification output

c and the desired output. Since the error function, ec, can only take on three values +2

and 0, minimization of the MSE is equivalent to minimizing the average number of actual

errors.






29

There are several observations to be made about the adaline/LMS approach to classifi-

cation. One observation is that the adaptation process described uses the error, E, as mea-

sured at the output of the linear combiner to drive the adaptation process and not the actual

classification error, Ec. Another observation is that this approach presupposes that the pat-

tern classes can be linearly separated. A final point, on which we will have more to say, is

that the method uses the MSE criterion as a proxy for classification.


3.2 MSE Criterion as a Proxy for Classification Performance

As we have pointed out, the adaline/LMS approach to classification uses the MSE cri-

terion to drive the adaptation process. It is the probability of misclassification (also called

the Bayes criterion), however, with which we are truly concerned. We now discuss the

consequence of using the MSE criterion as a proxy for classification performance.

It is well known that the discriminant function that minimizes misclassification is

monotonically related to the posterior probability distribution of the class, c, given the

observation x [Fukanaga, 1990]. That is, for the two class problem, if the discriminant

function is



f(x) = P2P(C 2x), (3)

where P2 is the prior probability of class 2, and p(C21x) is the conditional probability

distribution of class 2 given x, then the probability of classification will be minimized if

the following decision rule is used



f(x) < 0.5 choose class 1 (4)
f(x) > 0.5 choose class 2








For the case of f(x) = 0.5, both classes are equally likely, so a guess must be made.


3.2.1 Unrestricted Functional Mappings

With regards to the adaline/LMS approach we now ask, what is the consequence of

using the MSE criterion for computing discriminant functions? In the two class case, the

source distributions are p(xI C1) or p(xI C2) depending on whether the observation, x, is

drawn from class 1 or class 2, respectively. If we assign a desired output of zero to class 1

and unity to class 2 then the MSE criterion is equivalent to the following



J(f) = 2E{f(x)2|C, + --E{(- f(x))2C2}, (5)


where the 1/2 scale factors are for convenience, E{ } is the expectation operator, and

Ci indicates class i.

For now we will place no constraints on the functional form of f(x). In so doing, we

can solve for the optimal solution using the calculus of variations approach. In this case,

we would like to find a stationary point of the criterion J(f) due to small perturbations in

the function f(x) indicated by



J = J(f+6f)-J(f)
0 (6)
=0







The first term of 6 can be computed as


P P
J(f +sf) = P2E{(f+Sf)2I|C}+-~E{(l-f-if)2IC2}

= P-E{(f2+2f2f) C1}
(7)
P
+-E{((1 f)2 -2(1 f)8f)IC21+O( 0(8f2)
2
P P2
= J(f)+ E{(2ff) Cl} E{(2(1- f)8f)C2}

which can be substituted into 6 to yield


8J = PE{f8fIC, }-P2E{( -f)8flC2}

= Pf f(x)8fp(x|C,I)dx-P2f (- f(x))Sfp(xIC)dx
(8)
= f [f(x)(Plp(x|Ci) + P2P(x C2)) (P2P(X C2))]fd

= [f(x)py(x)--P2p( C2)]fdx

where px(x) = P p(x C1) + P2p(xlC2) is the unconditional probability distribution of

the random variable X. In order for f(x) to be a stationary point of J(f), equation 8

must be zero over all x for any arbitrary perturbation 6f(x). Consequently


f(x)px(x)-P2p(xlC2) = 0








or


S P2P(xC2)
f(x) -
px(x)
P2p(xlC2) (10)
Plp(x|C1)+P2P(xlC2)
= p(C21x)

which is the likelihood that the observation is drawn from class 2. If we had reversed the
desired outputs, the result would have been the likelihood that the observation was drawn
from class 1. This result, predicated by our choice of desired outputs, shows that for arbi-

trary f(x), the MSE criterion is equivalent to probability of misclassification error crite-
rion. In fact, it has been shown by Richard and Lippman [1991] (using other means) for

the multi-class case that if the desired outputs are encoded as vectors, ei E Ntx 1, where

the ith element is unity and the others are zero, for an N-class problem the MSE criterion
is equivalent to optimizing the Bayes criterion for classification.

3.2.2 Parameterized Functional Mappings

Suppose, however, that the function is not arbitrary, but is also a function of parameter

set, a, as in f(x, a). The MSE criterion of 5 can be rewritten


P1
J(f) -= -TE{f(x, ca)21C} + 2EE{(- f(x, a))ZC2}. (1l)

The gradient of the criterion with respect to the parameters becomes



= P E P f(x, a) -f(x, a)I C P2E ( f(x, a)) f(x, a) C2 (12)
Tat aa I ~







and consequently



= PJ f(x,ca) -f(x, a)p(x\C,)dx

-P, (l-f(x,a)) -f(x, a)p(xl C2)dx
(13)
= (f(x, a)(P1 p(xC1) +P2p(xC2))- (P2p(x C2)))- f(x, (X)dx

= (f(xa)p'(x)- Pp(xC2))- f(x,a)dx

Examination of equation 13 allows for two possibilities for a stationary point of the crite-
rion. The first, as before, is that


P2p(xfC2)
f(x, 0) =
Px(x) (14)
= p(C2x)

while the second is if we are near a local minima with respect to a. In other words, if the
parameterized function can realize the Bayes discriminant function via an appropriate
choice of its parameters, then this function represents a global minima, but this does not
discount the fact that there may be local minima. Furthermore, if the parameterized func-
tion is not capable of representing the Bayes discriminant function there is no guarantee
that the global (or local) minima will result in robust classification.








3.2.3 Finite Data Sets

The previous development does not take into account that in an iterative framework we

are working with observations of a random variable. Therefore, we rewrite the criterion of

equation 5 as finite summations. That is, the criterion becomes



J(f(x, a)) = -, f(x,. a)2 + (1 -f(xi, a))2, (15)
x, C, X, E C

where xi e Ci denotes the set of observations taken from class Ci. Taking the derivative

of this criterion with respect to the parameters, a, yields



S= PI f(xi, a)-f(x, )-P2 ( f(xi, a))-f(xi, a). (16)
xi, CI x, e C2

It is assumed that the set of observations from class C1 (xi e C ) are independent and

identically distributed (i.i.d.), as are the set of observations from class C2 (xi E C2)

although with a different distribution than class CI Since the summation terms are bro-

ken up by class, we can assume that the arguments of the summations (functions of dis-

tinct i.i.d. random variables) are themselves i.i.d. random variables [Papoulis, 1991]. If we

set PINI = P1 and P2N2 = P2, where PI and P2 are the prior probabilities of classes

C, and C2, respectively, and N, and N2 are the number of samples from drawn from






35

each of the classes, we can use the law of large numbers to say that the summations of

equation 16 approach their expected values. In other words, in the limit as N1, N2 --




S= PE f(x, a) C 2E (1 f(x, a)) C (17)


which is identical to equation 12 and so yields the same solution for the mapping as


Pp(x|C2)
f(x, a) = (18)
p(x)

The conclusion is that if we have a sufficient number of observations to characterize

the underlying distributions then the MSE criterion is again equivalent to the Bayes crite-

rion.


3.3 Derivation of the MACE Filter

We have already introduced the MACE filter in a previous section. We present a deri-

vation of the MACE filter here. The development is similar to the derivations given in

Mahalanobis [1987] and Kumar [1992]. Our purpose in this presentation of the derivation

is that it serves to illustrate the associative memory perspective of optimized correlators; a

perspective which will be used to motivate the development of the nonlinear extensions

presented in later sections.






36

In the original development, SDF type filters were formulated using correlation opera-

tions, a convention which will be maintained here. The output, g(n1, n2), of a correlation

filter is determined by


N1-1 N2-1
g(n,n2) = x lx*(nl+ml,n2+m2)h(m,,m2)
m = Om2 = 0
x*(nl, n2)**h(n n2)


where x*(n,, n2) is the complex conjugate of an input image with NI x N2 region of sup-

port, h(nl, n2) represents the filter coefficients, and ** represents the two-dimensional

circluar convolution operation [Oppenheim and Shafer, 1989].

The MACE filter formulation is as follows [Mahalanobis et al., 1987J. Given a set of

image exemplars, {xiE NixNI; i = 1...N,}, we wish to find filter coefficients,


h E 9N, x N, such that average correlation energy at the output of the filter defined as



N, N I-IN,-1
-= Iz gn ,2 (19
ti=1 =0n2=0 ])

is minimized subject to the constraints


N,-1 N,-1
gi(O,0) = E xi*(ml m2)h(m1,m2) = di; i= 1...N,. (20)
m, = m = 0

Mahalanobis [1987] reformulates this as a vector optimization in the spectral domain

using Parseval's theorem. In the spectral domain we wish to find the elements of

H e CNN2 x I a column vector whose elements are the 2-D DFT coefficients of the space








domain filter h reordered lexicographically. Let the columns of the data matrix


X e CNINxN, contain the 2-D DFT coefficients of the exemplars {xl, .... XN, also


reordered into column vectors. The diagonal matrix Di E 9NN2 N2O contains the mag-


nitude squared of the 2-D DFT coefficients of the ith exemplar. These matrices are aver-

aged to form the diagonal matrix D as



N,
N,= Di= (21)


which then contains the average power spectrum of the training exemplars. Minimizing

equation (19) subject to the constraints of equation (20) is equivalent to minimizing


HtDH, (22)

subject to the linear constraints


XtH = d (23)

N xl
where the elements of d E x are the desired outputs corresponding to the exemplars.

The solution to this optimization problem can be found using the method of Lagrange

multipliers. In the spectral domain, the filter that satisfies the constraints of equation (20)

and minimizes the criterion of equation (19) [Mahalanobis et al., 1987;Kumar, 1992] is



H = D-'X(XtD-tX)-'d, (24)


where H E CNN, x contains the 2D-DFT coefficients of the filter, assuming a unitary 2-

D DFT.1








3.3.1 Pre-processor/SDF Decomposition

As observed by Mahalanobis [1987], the MACE filter can be decomposed as a syn-

thetic discriminant function preceded by a pre-whitening filter. Let the matrix

B = D1/2, where B is diagonal with diagonal elements equal to the inverse of the

square root of the diagonal elements of D. We implicitly assume that the diagonal ele-

ments of D are non-zero, consequently BtB = D-1 and Bt = B. Equation (24) can

then be rewritten as



H = B(BX)((BX)1(BX))d. (25)

Substituting Y = BX, representing the original exemplars preprocessed in the spec-

tral domain by the matrix B, equation (25) can be written



H = BY(YIY)d. (26)

The term H' = Y(Yt Y)d is recognized as the SDF computed from the preprocessed

exemplars Y. The MACE filter solution can therefore be written as a cascade of a pre-

whitener (over the average power spectrum of the exemplars) followed by a synthetic dis-

criminant function, depicted in figure 16, as



H = BH'. (27)


1. If the DFT were as defined in [Oppenheim and Shafer, 1989] then a scale factor of
NIN2 would be necessary.













X yo
pre-processor SDF

process t omrposin

Figure 16. Decomposition of MACE filter as a preprocessor (i.e. a pre-
whitening filter over the average power spectrum of the
exemplars) followed by a synthetic discriminant function.

3.4 Associative Memory Perspective

Having presented the derivation of the MACE filter and the pre-processor/SDF decom-

position, we now show that with a modification (addition of a linear pre-processor), the

MACE filter is a special case of Kohonen's linear associative memory [1988].

Associative memories [Kohonen, 1988] are general structures by which pattern vec-

tors can be related to one another, typically in an input/output pair-wise fashion. An input

stimulus vector is presented to the associative memory structure resulting in an output

response vector. The input/output pairs establish the desired response to a given input. In

the case of an auto-associative memory, the desired response is the stimulus vector,

whereas, in a hetero-associative memory the desired response is arbitrary. From a signal

processing perspective, associative memories are viewed as projections [Kung, 1992], lin-

ear and nonlinear. The input patterns exist in a vector space and the associative memory

projects them onto a new space. The linear associative memory of Kohonen [1988] is for-

mulated exactly in this way.

A simple form of the linear hetero-associative memory maps vectors to scalars. It is

formulated as follows. Given the set of input/output vector/scalar pairs








{xiE Nxl, die 9t,i= 1...N,}, which are placed into a input data matrix,


x = [xl...XN], and desired output vector, d = [d .. .dN] find the vector, h 9Nx 1

such that



xth = d (28)

If the system of equations described by (28) is under-determined the inner product


hth (29)

is minimized using (28) as a constraint. If the system of equations are over-determined


(xth d)t(xh -d)

is minimized.

Here, we are interested in the under-determined case. The optimal solution for the

under-determined, using the pseudo-inverse of x is [Kohonen, 1988]



h = x(xtx) d. (30)

As was shown in [Fisher and Principe, 1994], we can modify the linear associative

memory model slightly by adding a pre-processing linear transformation matrix, A, and

find h such that the under-determined system of equations


(Ax)th = d


(31)








is satisfied while hth is minimized. As in the MACE filter, this optimization can be

solved using the method of Lagrange multipliers. We adjoin the system of constraints to

the optimization criterion as


J = hth + T((Ax)th d) (32)


where X 9E Nx 1 is a column vector of Lagrange multipliers, one for each constraint

(desired response). Taking the gradient of equation (32) with respect to the vector h yields


aJ
S= 2h +AxX. (33)
oh

Setting the gradient to zero and solving for the vector h yields



h= -Ax%. (34)


Substituting this result into the constraint equations of (31) and solving for the Lagrange

multipliers yields



S= -2((Ax)tAx)-ld. (35)

Substituting this result back into equation (34) yields the final solution to the optimization

as



h = Ax(xtAtAx) ld. (36)

If the pre-processing transformation, A, is the space-domain equivalent of the MACE

filter's spectral pre-whitener and the columns of the data matrix x contain the re-ordered

elements of the images from the MACE filter problem then equation (36) combined with








the pre-processing transformation yields exactly the space domain coefficients of the

MACE filter. This can be shown using a unitary discrete Fourier transformation (DFT)

matrix.


If U CN x N2 is the DFT of the image u e 9tN, x N, we can reorder both U and u


into column vectors, U e CN2 and u e CN 2 respectively. We can then imple-

ment the 2-D DFT as a unitary transformation matrix, cE, such that


U = Qu u = tU 44)t = I.

In order for the transformation A to be the space domain equivalent of the spectral pre-

whitener of the MACE filter, the relationship


Ax = Oty
= (tBX
= +DtfiBx

where B is the same matrix as in equation 27, must be true which, by inspection, means

that


A = VtBD. (37)


Substituting equation (37) into equation (36) and using the property BtB = BB = D-1

yields



h = Ax(xtAtAx)- d

= totB x(xt(itBD)t tdtB>x)- d
(38)
= >tB4x(xt4tBDtB4tdx) d38)

= DtBX(XtD -X)-'d








combining this solution for h with the pre-processor in equation (31) for the equivalent

linear system, hsys, yields


hy = Ah

= A4tBX(XtD-'X)-d

= (DTBt~tDBX(XtD LX)I d

= VtD-X(XtD-lX)- d

Substituting the MACE filter solution, equation (24), gives the result


hsys = tHMACE (39)


and so hsys is the inverse DFT pair of the spectral domain MACE filter. This result estab-

lishes the relationship between the MACE filter and the linear associative memory. The

decomposition of the MACE filter of figure 16 can also be considered as a cascade of a lin-

ear pre-processor followed by a linear associative memory (LAM) as in figure 17.



y=Ax yo=yth
A = (tD- I/2a h = y(yfy)d
x y yo
pre-processor LAM

e- ocessor Ier composition

Figure 17. Decomposition of MACE filter as a preprocessor (i.e. a pre-
whitening filter over the average power spectrum of the exemplars)
followed by a linear associative memory.


Since the two are equivalent then why make the distinction between the two perspec-

tives? The are several reasons. The development of distortion invariant filtering and asso-

ciative memories has proceeded in parallel. Distortion invariant filtering has been





44

concerned with finding projections which will essentially detect a set of images. Towards

this goal the techniques have emphasized analytic solutions resulting in linear discrimi-

nant functions. Advances have been concerned with better descriptions of the second order

statistics of the causes of false detections. The approach, however, is still a data driven

approach. The desired recognition class is represented through exemplars. In the distortion

invariant filtering approach, the task has been confined to fitting a hyper-plane to the rec-

ognition exemplars subject to various quadratic optimization criterion.

The development of associative memories has proceeded along a different track. It is

also data driven, but the emphasis has been on iterative machine learning methods. Many

of the methods are biologically motivated, including the perception learning rule [Rosenb-

latt, 1958] and Hebbian learning [Hebb, 1949]. Other methods, including the least-mean-

square (LMS) algorithm [Widrow and Hoff, 1960] (which we have described) and the

backpropagation algorithm [Rumelhart et al., 1986; Werbos 1974], are gradient descent

based methods.

From the classification standpoint, of which the ATR problem is a subset, iterative

methods have certain advantages. This can be illustrated with a simple example. Suppose

the data matrix


N N,N,xN,
X = [Xl, X2 ... N] 9 x N,

were not full rank. In other words the exemplars representing the recognition class could

be represented without error in a subspace of dimension less than N,. From an ATR per-

spective this would be a desirable property. The implicit assumption in any data driven

method is that information about the recognition class is transmitted through exemplars.

This is as true for distortion invariant filters, which have analytic solutions, as it is for iter-








ative methods. The smaller the dimension of the subspace in which the recognition class

lies, the better we can discriminate images considered to be out of the class. One limitation

of the analytic solutions of distortion invariant filters is that they require the inverse of a

matrix of the form


xtQx, (40)

where Q is a positive definite matrix representing a quadratic optimization criterion. If the

matrix, x, is not full column rank there is no inverse for the matrix of (40) and conse-

quently no analytic solution for any of the distortion invariant filters. The LMS algorithm,

however, will still find a best fit to the design goal, which is to minimize the criterion while

satisfying the linear constraints.

We can illustrate this by modifying the data from the experiments in section 2.1. It is

well known that the data matrix x can be decomposed using the singular value decompo-

sition (SVD) as


x = UAVV


where the columns of U 9Nx N, form an ortho-normal basis (the principal components

of the vector xi in fact), the diagonal matrix A 9N' N' contains the singular values of
N, xN,
the data matrix, and V 9t N' is unitary. The columns of the data matrix can be pro-

jected onto a subspace by setting one of the diagonal elements of A to zero. The impor-

tance of any of the basis vectors in U is directly proportional to the singular value. In this

case N, = 21 so we can choose one of the smaller singular values to set to zero without








changing basic structure of the data. For this example we choose the twelfth largest singu-

lar value. A data matrix xsub is generated by



AI-11 0 0 VT
xsu5b = U 0 0 0 V ,
0 0 A13-21

where Ai_ is a diagonal matrix containing the i through j singular values of the original

data matnx x.

This data matrix is not full rank, so there is no analytical solution for the MACE filter,

however we can use the LMS approach and derive a linear associative memory. The col-

umns of xsub are pre-processed with a pre-whitening filter computed over the average

power spectrum. The LMS algorithm can then be used to iteratively compute the transfor-

mation that best fits

T
Xsubh = d,

in a least squares sense; that is, we can find the h that minimizes


(xTh -d) (xTbh -d)

where d is column vector of desired responses (set to all unity in this case).

The peak output response for this filter was computed over all of the aspect views of

vehicle la and is shown in figure 18. The exemplars used to compute the filter are plotted

with diamond symbols. The desired response cannot be met exactly so a least squares fit is

achieved. Figure 19 shows the correlation output surface for one of the training exemplars.









MACE filter (LMS)
1.2


1.0


0.8-


0.6


0.4 -


0.2 -


0.0 .
0 20 40 60 80 100
aspect angle
Figure 18. Peak output response over all aspects of vehicle la when the data
matrix which is not full rank. The LMS algorithm was used to compute
the filter coefficients.

As can be seen in the image, the qualities of low variance and localized peak are still

maintained using the iterative method.

The learning curve, which measures the normalized mean square error (NMSE)

between the filter output and the desired output, is shown as a function of the learning

epoch (an epoch is one pass through the data) in figure 20. When the data matrix is full

rank, as shown with a solid line, we see that since there is an exact solution and the error

approaches zero. When xub is used the NMSE approaches a limit because there is no


exact solution and so a least squares solution is found.





























Figure 19. Output correlation surface for LMS computed filter from non full rank
data. The filter output is not substantially different from the analytic
solution with full rank data.

Since the system of constraint equations are generally under-determined, there are infi-

nitely many filters which will satisfy the constraints. There is only one, however, that min-

imizes the norm of filter (the optimization criterion after pre-processing) [Kohonen, 1988].

Figure 21 shows the NMSE between the analytic solution for the filter coefficients as com-

pared to the iterativel method. When the data matrix is full rank the iterative method

approaches the optimal analytic solution, as shown by the solid line in the figure. When

the data matrix is not full rank, as shown by the dashed line in the figure, the error in the

iterative solution approaches a limit.

These qualities of iterative learning methods are important from the ATR perspective.

We see from the example that when the data possesses a quality that would seemingly be

1. in this case iterativee" refers to the LMS algorithm, within this text it generally refers to
a gradient search algorithm.









10 LMS learning curves
1o0 F -- i-', ,--- '-,



10-


10-42

--
10-3


10-4


10-5


10-6
0 20 40 60 80 100
epoch
Figure 20. Learning curve for LMS approach. The learning curve for the LMS
algorithm when the full rank data matrix is shown with a solid line, the
non full rank case is shown with a dashed line.

useful to the ATR problem, namely that the class can be described by a sub-space, the ana-

lytic solution fails when the number of exemplars exceeds the dimensionality of the sub-

space. The iterative method, however, finds a reasonable solution. Furthermore, if the data

matrix is full rank, the iterative method approaches the optimal analytic solution.


3.5 Comments

There are further motivations for the associative memory perspective and by extension

the use of iterative methods. It is well known that non-linear associative memory struc-

tures can outperform their linear counterparts on the basis of generalization and dynamic

range [Kohonen, 1988;Hinton and Anderson, 1981]. In general, they are more difficult to

design as their parameters cannot be computed analytically. The parameters for a large









filter error


0 5 10 15 20
1_ epoch
Figure 21. NMSE between closed form solution and iterative solution. The
learning curve for the LMS algorithm when the full rank data matrix is
shown with a solid line, the non full rank case is shown with a dashed
line.

class of nonlinear associative memories can, however, be determined by gradient search

techniques. The methods of distortion invariant filters are limited to linear or piece-wise

linear discriminant functions. It is unlikely that these solutions are optimal for the ATR

problem.

In this chapter we have made the connection between distortion invariant filtering and

linear associative memories. Furthermore we have motivated an iterative approach. Recall

figure 15, which shows the adaline architecture. In this architecture we can use the linear

error term in order to train our system as a classifier. This is consequence of the assump-

tion that a linear discriminant function is desirable. If a linear discriminant function is sub-






51

optimal, which will almost always be the case for any high-dimensional classification

problem, then we must work directly with the classification error.

We have also shown that the MSE criterion is a sufficient proxy for classification error

(with certain restrictions), however, it requires that we work with the true output error of

the mapping as well as a mapping with sufficient flexibility (i.e. can closely approximate a

wide range of functions which are not necessarily linear). The linear systems approach,

however, does not allow for either of these requirements. Consequently, we must adopt a

nonlinear systems approach if we hope to achieve improved performance. The next chap-

ter will show that the MACE filter can be extended to nonlinear systems such that the

desirable properties of shift invariance and localized detection peak are maintained while

achieving superior classification performance.














CHAPTER 4

STOCHASTIC APPROACH TO TRAINING NONLINEAR
SYNTHETIC DISCRIMINANT FUNCTIONS

4.1 Nonlinear iterative Approach

The MACE filter is the best linear system that minimizes the energy in the output cor-

relation plane subject to a peak constraint at the origin. An advantage of linear systems is

that we have the mathematical tools to use them in optimal operating conditions from the

standpoint of second order statistics. Such optimality conditions, however, should not be

confused with the best possible classification performance.

Our goal is to extend the optimality condition of MACE filters to adaptive nonlinear

systems and classification performance. The optimality condition of the MACE filter con-

siders the entire output plane, not just the response when the image is centered. With

regards to general nonlinear filter architectures which can be trained iteratively, a brute

force approach would be to train a neural network with a desired output of unity for the

centered images and zero for all shifted images. This would indeed emulate the optimality

of the MACE filter, however, the result is a training algorithm of order NIN2N, for N,

training images of size N x N2 pixels. This is clearly impractical.

In this section we propose a nonlinear architecture for extending the MACE filter. We

discuss some its properties. Appropriate measures of generalization are discussed. We also

present a statistical viewpoint of distortion invariant filters from which such nonlinear

extensions fit naturally into an iterative framework. From this iterative framework we






53

present experimental results which exhibit improved discrimination and generalization

performance with respect to the MACE filter while maintaining the properties of localized

detection peak and low variance in the output plane.


4.2 A Proposed Nonlinear Architecture

As we have stated, the MACE filter can be decomposed as a pre-whitening filter fol-

lowed by a synthetic discriminant function (SDF), which can also be viewed as a special

case of Kohonen's linear associative memory (LAM) [Hester and Casasent, 1980; Fisher

and Principe, 1994]. This decomposition is shown at the top of figure 22. The nonlinear

filter architecture with which we are proposing is shown in the middle of figure 22. In this

architecture we replace the LAM with a nonlinear associative memory, specifically a feed-

forward multi-layer perception (MLP), shown in more detail at the bottom of figure 22.

We will refer to this structure as the nonlinear MACE filter (NL-MACE) for brevity.

Another reason for choosing the multi-layer perception (MLP) is that it is capable of

achieving a much wider range of discriminant functions. It is well known that an MLP

with a single hidden layer can approximate any discriminant function to any arbitrary

degree of precision [Funahashi, 1989]. One of the shortcomings of distortion invariant

approaches such as the MACE filter is that it attempts to fit a hyper-plane to our training

exemplars as the discriminant function. Using an MLP in place of the LAM relaxes this

constraint. MLPs do not, in general, allow for analytic solutions. We can, however, deter-

mine their parameters iteratively using gradient search.












NxN
x E9


LI



pre-proe..,-r I SOMLP
I linear filler

r- -* -- -



N1xN










IXE





PE




pre-processed
image : .
3- PE'



I 1 pi MLP

Figure 22. Decomposition of optimized correlator as a pre-processor followed by
SDFILAM (top). Nonlinear variation shown with MLP replacing SDF
in signal flow (middle), detail of the MLP (bottom). The linear
transformation A represents the space domain equivalent of the
spectral pre-processor (aP +(1- a) /2
spectral pre-processor (p + *(I L)p)






55

4.2.1 Shift Invariance of the Proposed Nonlinear Architecture

One of the properties of the MACE filter is shift invariance. We wish to maintain that

property in our nonlinear extensions. A transformation, T[ ], of a two-dimensional func-

tion is shift invariant if it can be shown that


g(nl, n2) = [y(n,, n2)]
g(nI +n1',n2 +n2) = T[y(nl + n',2 +n2')]'

where nl, nl', n2, n2' are integers. In other words, a shift of the input signal is reflected as

a corresponding shift of the output signal. [Oppenheim and Shafer, 1989]

We show here that this property is maintained for our proposed nonlinear architecture.

The pre-processor of the nonlinear architecture at the bottom of figure 22 is the same as

the pre-processor of the linear filter shown at the top. The pre-processor is implemented as

a linear shift invariant (LSI) filter. Cascading shift invariant operations maintains shift

invariance of the entire system [Oppenheim and Shafer, 1989]. In order to show that the

system as a whole is shift invariant, it is sufficient to show that the MLP is shift invariant.

The mapping function of the MLP in figure 22 can be written


g(oo,y) = o(W3s(W2o(WYy)+ p))

0 = W ~ NIN2 N, x- W, N,3 N, N.l } (41)
o W l IW 2 C: 9 ,,W 3C = 91 x '


In the nonlinear architecture, the matrix Wi, represents the connectivities from the pro-

cessing elements (PEs) of layer (i 1) to the input to the PEs of layer i; that is, the matrix

Wi is applied as linear transformation to the vector output of layer (i- 1). When i = 1

the transformation is applied to the input vector, y. The number of PEs in layer i is






56

denoted by N. In equation 41 p is a constant bias vector added to each element of the

vector, W2a(Wy) e 9N-, x It is also assumed that if the argument to the nonlinear

function o( ) is a matrix or vector then the nonlinearity is applied to each element of the

matrix or vector.

N,N2 x l
The input to the MLP is denoted as a vector, y e 9t2 The elements of the vector

are samples of a two-dimensional pre-whitened input signal, y(n1, n2). We can write the

ith element of the vector as a function of the two dimensional signal as follows


yi(nl, n2) = y(n" + (i, N1), n2 + N ) i = 0,..., NN2 1,

where (i, Nl) indicates a modulo operation (the remainder of i divided by N ) and

[i, NI1 indicates integer division of i by N Written this way, the elements of the vector

y sample a rectangular region of support of size NI x N2 beginning at sample (nI, n2) in

the pre-whitened signal, y(n n2). The vector argument of equation 41 and the resulting

output signal can now be written as an explicit function of the beginning sample point of

the template within the pre-whitened image


go(nl,n2)= g(o, (nl,n2)) = o( W30(W20(Wly(nl,n2))+(P)). (42)

The output of the mapping as written in equation 42 is now an explicit function of

(n,, n2) and the constant parameter set, 0o (which do not vary with (n, n2)). We can also

write the output response as a function of the shifted version of the image, y(n n2) as


g,(nl +nl',n2+n2') = g(O,y(n" + n',n2+n2'))









Since the parameters, co, are constant, equations 42 and 43 are sufficient to show the

mapping of the MLP is shift invariant and consequently, the system as a whole (including

the shift invariant pre-processor) is also shift invariant.


4.3 Classifier Performance and Measures of Generalization

One of the issues for any iterative method which relies on exemplars is the number of

training exemplars to use in the computation of the discriminant function. In addition, for

iterative methods, there is the issue of when to stop the adaptation process. In the case of

distortion invariant filters, such as the MACE filter, some common heuristics are used to

determine the number of training exemplars. Typically samples are drawn from the train-

ing set and used to compute the filter from equation 23 until the minimum peak response

over the remaining samples exceeds some threshold [Casasent and Ravichandran, 1992].

A similar heuristic is to continue to draw samples from the training set until the mean

square error of the peak response over the remaining samples drops below some preset

threshold. These measures are then used as indicators of how well the filter generalizes to

between aspect exemplars from the training set which have not been used for the computa-

tion of the filter coefficients.

The ultimate goal, however, is classification. Generalization in the context of classifi-

cation must be related to the ability to classify a previously unseen input [Bishop, 1995].

We show by example that the measures of generalization mentioned above may be mis-

leading as predictors of classifier performance for even the linear filters. In fact the result

of the experiments will show that the way in which the data is pre-processed is more indic-

ative of classifier performance than these other indirect measures.









We illustrate this point with an example using ISAR image data. A data set, larger than

in the previous experiments, will be used. Two more vehicles, one from each vehicle type

will be used for the testing set, and all vehicles will be samples at higher aspect resolution.

Figure 23 shows ISAR images of size 64 x 64 taken from five different vehicles and two

different vehicle types. The images are all taken with the same radar. Data taken from

vehicles in the same class vary in the vehicle configuration and radar depression angle (15

or 20 degrees depression). Images have been formed from each vehicle at aspect varia-

tions of 0.125 degrees from 5 to 85 degrees aspect for a total of 641 images for each vehi-

cle. Figure 23 shows each of the vehicles at 5, 45, and 85 degrees aspect.

We will use vehicle type 1 as the recognition class and vehicle type 2 as a confusion

vehicle. Images of vehicle la will be used as the set from which to draw training exem-

plars. Classification performance will then be measured as the ability to recognize vehi-

cles lb and Ic while rejecting vehicles 2a and 2b. The filter we will use is a form of the

OTSDF [R6fr6gier and Figue, 1991] which is computed in the spectral domain as


--1-
H = [oaPx+(l-a)P,] X[Xt[aPx+(l-a)PI X]d, (44)


where the columns of the data matrix X e CNN2 xN' are the Fourier coefficients of Nt

exemplar images of dimension NI x N2 of vehicle la reordered into column vectors. The

diagonal matrix Px E 9 'N2x NN contains the coefficients of the average power spec-


trum measured over the N, exemplars of vehicle la, while FP e9t ',NxNN is the iden-


tity matrix scaled by the average of the diagonal terms of Px. Finally, d e NX is a

column vector of desired outputs, one for each exemplar. The elements of d are typically









vehicle I








b










vehicle 2


a





b



Figure 23. ISAR images of two vehicle types shown at aspect angles of 5, 45, and
85 degrees respectively. Three different vehicles of type 1 (a, b, and c)
are shown, while two different vehicles of type 2 (a and b) are shown.
Vehicle la is used as a training vehicle, while vehicles lb and Ic are
used as the testing vehicles for the recognition class. Vehicles 2a and
2b are used a s confusion vehicles.

set to unity. When a is set to unity equation 44 yields exactly the MACE filter, when it is

set to zero the result is the SDF The filter we are using is therefore trading off the MACE

filter criterion with the SDF criterion. The SDF criterion can also be viewed as the

MVSDF [Kumar, 1986] criterion when the noise class is represented by a white noise ran-

dom process. This filter can also be decomposed as in figure 22.






60

These experiments examine the relationship between the two commonly used mea-

sures of generalization and two measures of classification performance. We can draw con-

clusions from the results about the appropriateness of the generalization measures with

regards to classification. The first generalization measure is the minimum peak response,

denoted Ymin, taken over the aspect range of the images of the training vehicle (excluding

the aspects used for computing the filter). The second generalization measure is the mean

square error, denoted yme, between the desired output of unity and the peak response over

the aspect range of the images of the training vehicle (excluding the aspects used for com-

puting the filter). The classification measures are taken from the receiver operating char-

acteristic (ROC) curve measuring the probability of detecting, Pd, a testing vehicle in the

recognition class (vehicles lb and Ic) versus the probability of false alarm, Pfa, on a test-

ing vehicle in the confusion class (vehicles 2a and 2b) based on peak detection. The spe-

cific measures are the area under the ROC curve, a general measure of the test being used,

while the second measure is the probability of false alarm when the probability of detec-

tion equals 80%, which measures a single point of interest on the ROC curve.

Two filters are used, one with a = 0.5 and the other with a = 0.95, or one in which

both criterion are weighted equally and one which is close to the MACE filter criterion.

The number of exemplars drawn from the training vehicle (la) is varied from 21 to 81

sampled uniformly in aspect (1 to 4 degrees aspect separation between exemplars).

Examination of figures 24 and 25 show that for both cases (a equal to 0.5 and 0.95)

no clear relationship emerges in which the generalization measures are indicators of good

classification performance. Table 1 compares the classifier performance when the general-








ization measures as described above are used to choose the filter versus the best ROC per-

formance achieved throughout the range of aspect separation. In one regard, the

generalization measures were consistent in that the same aspect separation was predicted

by both measures for both settings of a. In figure 26 we compare the ROC curves for two

cases, first where the filter chosen using the generalization measures and second the best

achieved ROC curve, for both settings of a. We would expect that for each a the filter

using the generalization measure would be near the best ROC performance. As can be

seen in the figure this is not the case.

Table 1. Classifier performance measures when the filter is determined by either of the
common measures of generalization as compared to best classifier performance for two
values of a.
Generalization Measure
Ymin Ymse Best
S= 0.50 Pfa@Pd=0.8 0.24 0.24 0.16
ROC area 0.83 0.83 0.90

S= 0.95 Pfa@Pd=0.8 0.16 0.16 0.07
ROC area 0.94 0.94 0.95

It is obvious from figures 24 and 25 that the generalization measures are not signifi-

cantly correlated with the ROC performance. In fact, as summarized in table 2, the gener-

alization measures are negatively, albeit weakly, correlated with ROC performance. One

feature of figures 24 and 25 is that although ROC performance varies independent of










-.10 ivs. ROC area
1.00



0.95-




0.90 U A A




0.85 A a= 0.50
a = 0.95



0.80 I .....
0.60 0.70 0.80 0.90 1.00
Ymin

Ymin vs. Pv (@ P= 0.8)
0.30


0.25- A


0.20o
[AA

B 0.15 -


0.10- ]n O]E LF


0.05- A = 0.50
a = 0.95

0.00 __
0.60 0.70 0.80 0.90 1.00
Ymin
Figure 24. Generalization as measured by the minimum peak response. The plot
compares y.in versus classification performance measures (ROC area
and Pfa@Pd=0.8).










Ym,, vs. ROC area


1.00


-, E 0 0
r 0 o


0
%


Aa = 0.50
Sa = 0.95


0.020 0.030 0.040 0.050
Y. ..
v--- vs. P,. (@ P =0.8)


0.00 ,
0.000


0.010 0.020 0.030
Ym..


0.040 0.050 0.060 0.070


Figure 25. Generalization as measured by the peak response mean square error.
The plot compares ymse versus classification performance measures
(ROC area and Pfa@Pd=0.8).


0.95


1

S0.90



0.85


0.060 0.070


0.000 0.010


0.30 F-' -'


0.25


0.20


a 0.15


0.10


0.05


10 D 0


O o
uo% oo []iit


0


Sa = 0.50
Sa = 0.95


0.801


I~~~~I~~~~I~~~~~~~~









1.0










04 a = 0.50, best ROC
a = 0.95, best generalization
a = 0.95, best ROC
0.8 2 ,



0.6 .' j











0.0 0.2 0.4 0.6 0.8 1.0
*r i.







Figure 26. Comparison of ROC curves. The ROC curves for the number of
:; / ......a = 0.50, best ROC





training exemplars yielding the 0.95, best generalization measure versus the
number yielding the0.95, best ROC performance for values of a equal to
0.2-/ ,



0.0 0.2 0.4 0.6 0.8 1.0
Figure 26. Comparison of ROC curves. The ROC curves for the number of
training exemplars yielding the best generalization measure versus the
number yielding the best ROC performance for values of a equal to
0.5 and 0.95 are shown.

either the minimum peak response or the MSE, there does appear to be dependency on a.

This leads to a second experiment.

Table 2. Correlation of generalization measures to classifier performance. In both cases (a equal to 0.5
or 0.95) the classifier performance as measured by the area of the ROC curve or Pf at Pd equal 0.8, has
an opposite correlation as to what would be expected of a useful measure for predicting performance.
Performance Measures
ROC area Pfa(@Pd=0.8) ROC area Pfa(@Pd=0.8)
a = 0.50 a = 0.95
Generalization Ymin -0.39 0.21 -0.40 0.41
Measures Ymse 0.32 -0.11 0.31 -0.35


In the second experiment we examine the relationship between the parameter a and

the ROC performance. The aspect separation between training exemplars is set to 2, 4, and

8 degrees. The value of a, the emphasis on the MACE criterion, is varied in the range

zero to unity. Figure 27 shows the relationship between ROC performance and the value









of a. It is clear from the plots that there is a positive relationship between the emphasis on

the MACE criteria and the ROC performance. However, the peak in ROC performance is

not achieved at a equal to unity. In all three cases, the ROC performance peaks just prior

to unity with the performance drop-off increasing with aspect separation at a equal to

unity.

The difference between the SDF and MACE filter is the pre-processor. What is shown

by this analysis is that, in general, the pre-processor from the MACE filter criterion leads

to better classification, but too much emphasis on the MACE filter criterion, as measured

by a equal to unity, leads to a filter which is too specific to the training samples. The

problems described above are well known. Alterations to the MACE criterion have been

the subject of many researchers [Casasent et al., 1991; Casasent and Ravichandran, 1992;

Ravichandran and Casasent, 1992; Mahalanobis et al., 1994a]. There is still, as yet, no

principled method found in the literature by which to set the parameter a.

There are two conclusions from this analysis that are pertinent to the nonlinear exten-

sion we are using. First the results show that pre-whitening over the recognition class

leads to better classification performance. For this reason we choose to use the pre-proces-

sor of the MACE filter in our nonlinear filter architecture. The issue of extending the

MACE filter to nonlinear systems can in this way be formulated as a search for a more

robust nonlinear discriminant function in the pre-whitened image space.

The second conclusion is that comparisons of the nonlinear filter to its linear counter-

part must be made in terms of classification performance only. There are simple nonlinear

systems, such as a soft threshold at the output of a linear system for example, that will out-










ROC area vs. a


08
13


ROC area vs. a


0.2 0.4 0.6
a
P,.(@P,=0.8) vs. a


0.8 1.0


001 I o i ______ I
0.0 0.2 0.4 0.6 0.8 1.0
a
Figure 27. ROC performance measures versus a. Results are shown for training
aspect separations of 2, 4, and 8 degrees. These plots indicate that
ROC performance is positively related to a.

perform the MACE filter or its variations in terms of maximizing the minimum peak

response over the training vehicle or reducing the variance in the output image plane.


0.41-


0.0 R-
0.0


10


0.8



S0.6
II





02


0 a = 2.00 degrees
A a = 4.00 degrees
D a = 8.00 degrees









0 0


Sa = 2.00 degrees
A = 4.00 degrees
Sa = 8.00 degrees


, I I









These measures are not, however, sufficient to describe classification performance. We

have also used these measures in the past but feel that they are not the most appropriate for

classification [Fisher and Principe, 1995b].


4 4 Statistical Characterization of the Rejection Class

We now present a statistical viewpoint of distortion invariant filters from which such

nonlinear extensions fit naturally into an iterative framework. This treatment results in an

efficient way to capture the optimality condition of the MACE filter using a training algo-

rithm which is approximately of order N, and which leads to better classification perfor-

mance than the linear MACE.

A possible approach to design a nonlinear extension to the MACE filter and improve

on the generalization properties is to simply substitute the linear processing elements of

the LAM with nonlinear elements. Since such a system can be trained with error back-

propagation [Rumelhart et al., 1986], the issue would be simply to report on performance

comparisons with the MACE. Such methodology does not, however, lead to understand-

ing of the role of the nonlinearity, and does not elucidate the trade-offs in the design and in

training.

Here we approach the problem from a different perspective. We seek to extend the

optimality condition of the MACE to a nonlinear system, i.e. the energy in the output

space is minimized while maintaining the peak constraint at the origin. Hence we will

impose these constraints directly in the formulation, even knowing A priori that an analyti-

cal solution is very difficult or impossible to obtain. We reformulate the MACE filter from




! -


68

a statistical viewpoint and generalize it to arbitrary mapping functions, linear and nonlin-

ear.

Consider images of dimension NI x N2 re-ordered by column or row into vectors. Let


the rejection class be characterized by the random vector, XI E NNx I We know the

second-order statistics of this class as represented by the average power spectrum (or

equivalently the autocorrelation function). Let the recognition class be characterized by


the columns of a data matrix x2 91NN x N which are observations of the random vector,


X2 E NNx similarly re-ordered. We wish to find the parameters, Co, of a mapping,


g(Co, X):9 1 91 such that we may discriminate the recognition class from the

rejection class. Here, it is the mapping function, g, which defines the discriminator topol-

ogy.

Towards this goal, we wish to minimize the objective function


J = E(g(, X)2)

over the mapping parameters, co, subject to the system of constraints



g(, x2) = d (45)


where d e 9' is a column vector of desired outputs. It is assumed that the mapping

function is applied to each column of x2, and E( ) is the expected value function.




-...............-........1 I -


69

Using the method of Lagrange multipliers, we can augment the objective function as



J = E(g(o, X1)2) + (g(o, x2) -d ),, (46)


where X e 9tNx is a vector whose elements are the Lagrange multipliers, one for each

constraint. Computing the gradient with respect to the mapping parameters yields


aj fg(m,X,)Y\ g(oa,x2)
S= 2E g(, X1 ) )) + a ,. (47)


Equation 47 along with the constraints of equation 45 can be used to solve for the opti-

mal parameters, co, assuming our constraints form a consistent set of equations. This is,

of course dependent on the mapping topology.


4.4 1 The Linear Solution as a Special Case

It is interesting to verify that this formulation yields the MACE filter as a special case.

If, for example, we choose the mapping to be a linear projection of the input image, that is


g(a,,x) = Tx o = [hl...hNN2 ]T E 9 x

equation 46 becomes, after simplification,


J = TE(XIXT)O +(oTx dT). (48)

In order to solve for the mapping parameters, co, we are still left with the task of com-

T
putting the term E(XIXT) which, in general, we can only estimate from observations of the

random vector, X1 or assume a specific form. Assuming that we have a suitable estima-









tor, the well known solution to the minimum of equation 48 over the mapping parameters

subject to the constraints of equation 45 is



-1 T-1 -1
o = Rx,x2[x2Rx X2] d, (49)

where



x = estimate{E(XIX) }. (50)


Depending on the characterization of X1, equation 49 describes various SDF-type fil-

ters (i.e. MACE, MVSDF, etc.). In the case of the MACE filter, the rejection class is char-

acterized by all 2D circular shifts of target class images away from the origin. Solving for

the MACE filter coefficients is therefore equivalent to using the average circular autocor-

relation sequence (or equivalently the average power spectrum in the frequency domain)


over images in the target class as estimators of the elements of the matrix E(XIXT).

Sudharsanan et al [1991] suggest a very similar methodology for improving the perfor-

mance of the MACE filter. In that case the average linear autocorrelation sequence is esti-

T
mated over the target class and this estimator of E(XIX1) is used to solve for linear

projection coefficients in the space domain. The resulting filter is referred to as the

SMACE (space-domain MACE) filter.


4.4.2 Nonlinear Mappings

For arbitrary nonlinear mappings it will, in general, be very difficult to solve for glo-

bally optimal parameters analytically. Our purpose is instead to develop iterative training

algorithms which are practical and yield improved performance over the linear mappings.


I









It is through the implicit description of the rejection class by its second-order statistics

from which we have developed an efficient method extending the MACE filter and other

related correlators to nonlinear topologies such as neural networks.

As stated, our goal is to find mappings, defined by a topology and a parameter set,

which improve upon the performance of the MACE filter in terms of generalization while

maintaining a sharp constrained peak in the center of the output plane for images in the

recognition class. One approach, which leads to an iterative algorithm, is to approximate

the original objective function of equation 46 with the modified objective function



J= (1 3)E(g(o, XI)2) + g(, x2)-dT][g(, x2) -d ]. (51)

The principal advantage gained by using equation 51 over equation 46 is that we can

solve iteratively for the parameters of the mapping function (assuming it is differentiable)

using gradient search. The constraint equations, however, are no longer satisfied with

equality over the training set. It has been recognized that the choice of constraint values

has direct impact on the performance of optimized linear correlators. Sudharsanan et al

[1990] have explored techniques for optimally assigning these values within the con-

straints of a linear topology. Other methods have been suggested [Mahalanobis et al.,

1994a, 1994b; Kumar and Mahalanobis, 1995] to improve the performance of distortion

invariant filters by relaxing the equality constraints. Mahalanobis [1994a] extends this

idea to unconstrained linear correlation filters. The OTSDF objective function of

Rdfr6gier [1991] appears similar to the modified objective function and indeed, for a lin-

ear topology this can be solved analytically as an optimal trade-off problem.


_








Our primary purpose for modifying the objective function is to allow for an iterative

method within the NL-MACE architecture. We have already shown in the previous chap-

ter that this choice of criterion is suitable for classification. We will show that the primary

qualities of the MACE filter are still maintained when we relax the equality constraints in

our formulation. Varying P in the range [0, 1] controls the degree to which the average

response to the rejection class is emphasized versus the variance about the desired output

over the recognition class.

As in the linear case, we can only estimate the expected variance of the output due to

the random vector input and its associated gradient. If, as in the MACE (or SMACE) filter

formulation, X1 is characterized by all 2-D circular (or linear) shifts of the recognition

class away from the origin then this term can be estimated with a sampled average over

the exemplars, x2, for all such shifts. From an iterative standpoint this still leads to the

impractical approach training exhaustively over the entire output plane. It is desirable,

then, to find other equivalent characterizations of the rejection class which may alleviate

the computational load without significantly impacting performance.


4.5 Efficient Representatinn of the Rejection Class

Training becomes an issue once the associative memory structure takes a nonlinear

form. The output variance of the linear MACE filter is minimized for the entire output

plane over the training exemplars. Even when the coefficients of the MACE filter are

computed iteratively we need only consider the output point at the designated peak loca-

tion (constraint) for each pre-whitened training exemplar [Fisher and Principe, 1994]. This

is due to the fact that for the under-determined case, the linear projection which satisfies








the system of constraints with equality and has minimum norm is also the linear projection

which minimizes the response to images with a flat power spectrum. This solution is

arrived at naturally via a gradient search which only considers the response at the con-

straint location.

This is no longer the case when the mapping is nonlinear. Adapting the parameters via

gradient search (such as error backpropagation) on recognition class exemplars only at the

constraint location will not, in general, minimize the variance over the entire output image

plane. In order to minimize the variance over the entire output plane we must consider the

response of the filter to each location in the input image, not just the constraint location.

The MACE filter optimization criterion minimizes, in the average, the response to all

images with the same second order statistics as the rejection class. At the output of the pre-

whitener (prior to the MLP) any white sequence will have the same second order statistics

as the rejection class. This condition can be exploited to make the training of the MLP

more efficient.

From an implementation standpoint, the pre-whitening stage and the input layer

weights can be combined into a single equivalent linear transformation, however, pre-

whitening separately allows the rejection class to be represented by white sequences at the

input to the MLP during the training phase.

This result is due to the statistical formulation of the optimization criterion. Minimiz-

ing the response to white sequences, in the average, minimizes the response to shifts of the

exemplar images since they have the same second-order statistics (after pre-whitening).

Consequently, we do not have to train over the entire output plane exhaustively, thereby

reducing training times proportionally by the input image size, NIN2. Instead, we use a









small number of randomly generated white sequences to efficiently represent the rejection

class. The result is an algorithm which is of order N, + Ns (where Ns is the number of

white noise rejection class exemplars) as compared to exhaustive training.


4 6 Experimental Results

We now present experimental results which illustrate the technique and potential pit-

falls. There are four significant outcomes in the experiments presented in this section. The

first is that when using the white sequences to characterize the rejection class, the linear

solution is a strong attractor. The second outcome is that imposing orthogonality on the

input layer to the MLP tends to lead to a nonlinear solution with improved performance.

The third result, in which we restrict the rejection class to a subspace, yields a significant

decrease in the convergence time. The fourth result, in which we borrow from the idea of

using the interior of the convex hull to represent the rejection class [Kumar et al., 1994],

yields significantly better classification performance.

In these experiments we use the data depicted in figure 23. As in the previous experi-

ments images from vehicle la will be used as the training set. Vehicles lb and Ic will be

used as the recognition class while vehicles 2a and 2b will be used as a rejection/confusion

class for testing purposes. In each case comparisons will be made to a baseline linear filter.

Specifically, in all cases the value of a for the linear filter is set to 0.99. The aspect

separation between training images is 2.0 degrees. This results in 41 training exemplars

from vehicle la. These settings of a and aspect separation were found to give the best

classifier performance for the linear filter with this data set. We continue to refer to this as

a MACE filter since the MACE criterion is so heavily emphasized. Technically it is an









OTSDF filter, but such nomenclature does not convey the type of pre-processing that is

being performed. We choose the value of a so as to compare to the best possible MACE

filter for this data set.

The nonlinear filter will use the same pre-processor as the linear filter (i.e. a = 0.99).

The MLP structure is shown at the bottom of figure 22. It accepts an NIN2 input vector (a

preprocessed image reordered into a column vector), followed by two hidden layers (with

two and three hidden PE nodes, respectively), and a single output node. The parameters of

the MLP


W,2NNx2 2 2x3 W C 3x] 3xl

are to be determined through gradient search. The gradient search technique used in all

cases will be error backpropagation algorithm.


4.6.1 Experiment I noise training

As stated, using the statistical approach, the rejection class is characterized by white

noise sequences at the input to the MLP. The recognition class is characterized by the

exemplars. It is from these white noise sequences that the MLP, through the backpropaga-

tion learning algorithm, captures information about the rejection class. So it would seem a

simple matter, during the training stage, to present random white noise sequences as the

rejection class exemplars. This is exactly the training method used for this experiment.

From our empirical observations we observed that with this method of training the linear

solution is a strong attractor. The results of the first experiment is demonstrates this behav-

ior.








Figure 28 shows the peak output response taken over all images of vehicle la for both

the linear (top) and nonlinear (bottom) filters. In the figure we see that for the linear filter

the peak constraint (unity) is met exactly for the training exemplars with degradation for

the between aspect exemplars. As mentioned previously, if the pure MACE filter criterion

were used (a equal to unity), the peak in the output plane is guaranteed to be at the con-

straint location [Mahalanobis et al., 1987]. It turns out that for this data set the peak output

also occurs the constraint location for the training images, however, with a = 0.99 it was

not guaranteed. Examination of the peak output response for the NL-MACE filter shows

that the constraints are met very closely (but not exactly) for the training exemplars also

with degradation in the peak output response at between aspect locations. The degradation

for the nonlinear filter is noticeably less than in the linear case and so in this regard it has

outperformed the linear filter.

Figure 29 shows the output plane response for a single image of vehicle la (not one

used for computing the filter coefficients) for the linear filter (top) and the nonlinear filter

(bottom). Again in this figure we see that both filters result in a noticeable peak when the

image is centered on the filter and a reduced response when the image is shifted. The

reduction in response to the shifted image is again noticeably better in the nonlinear filter

than in the linear filter. Such would be found to be true for all images of vehicle la and so

in this regard we can again say that the nonlinear filter had outperformed the linear filter.

However, as we have already illustrated for the linear case, these measures are not suf-

ficient to predict classifier performance alone and are certainly not sufficient to compare

linear systems to nonlinear systems. This point is made clear in table 3 which summarizes

the classifier performance at two probabilities of detection for all of the experiments










linear filter


1.10F-


1.00-


r 0.90






0.70 -


0.60
0 20 40 60 80
aspect (degrees)
nonlinear filter
1.10






0.90


0.80


0.70 -


0.60
0 20 40 60 80
aspect (degrees)
Figure 28. Peak output response of linear and nonlinear filters over the training
set. The nonlinear filter clearly outperforms the linear filter by this
metric alone.

reported here when vehicles lb and Ic are used as the recognition class and vehicles 2a

and 2b are used for the rejection class. At this point we are only interested in the results

pertaining to the linear filter (our baseline) and nonlinear filter results for experiment I.













































Figure 29. Output response of linear filter (top) and nonlinear filter (bottom).
The response is for a single image from the training set, but not one
used to compute the filter.

This table shows that the classifier performance for the linear filter and nonlinear filters

are nominally the same, despite what may be perceived to be better performance in the

nonlinear filter with regards to peak response over the training vehicle and reduced output

plane response to shifts of the image. Furthermore, if we examine figure 30, which shows







79

the ROC curve for both filters we see that they overlay each other. From a classification

standpoint the two filters are equivalent.

ROC curve
1.0


0.8 -


0.6


0.4


0.2


0.0
0.0 0.2 0.4 0.6 0.8 1.0
P,
Figure 30. ROC curves for linear filter (solid line) versus nonlinear filter (dashed
line). Despite improved performance of the nonlinear filter as
measured by peak output response and reduced variance over the
training set, the filters are equivalent with regards to classification
over the testing set.


The explanation of this result is best explained by figure 31. Recall the points ul and


u2 labeled in figure 22.


We can view these outputs as a feature space, that is, the MLP discriminant function

can be superimposed on the projection of the input image onto this space. In this case the

feature space is a representation of the input vector internal to the MLP structure. The des-

ignation of these points as features is due to the fact that they represent some abstract qual-























O3 D recognition, training
+ recognition, nontraining
o rejection, training


-1.0 -0.5 0.0
u1


1.0



0.5



3 0.0



-0.5



-1.0


-1.0 -0.5 0.0
UI


0.5 1.0


0.5 1.0


Figure 31. Experiment I: Resulting feature space from simple noise training.
Note that all points are projected onto a single curve in the feature
space. In the top figure squares are the recognition class training
exemplars, triangles are white noise rejection class exemplars, and
plus signs are the images of vehicle la not used for training. In the
bottom figure, squares are the peak responses from vehicles Ib and Ic,
triangles are the peak responses from vehicles 2a and 2b.


O recognition, testing
0 rejection, testing


~






81

ity of the data and the decision surface can be computed as a function of the features.

Mathematically this can be written




Wx = = u Yo = ((Wia(Woa(u)+(p)). (52)


Recall that the matrix Wi represents the connectivities from the output of layer (i- 1) to

the inputs of the PEs of layer i, (p is a constant bias term, and a( ) is a sigmoidal nonlin-

earity (hyperbolic tangent function in this case).

Figure 31 shows this projection for the training set (top) and the testing set (bottom).

What is significant in the figure is that although the discriminant as a function of the vec-

tor u is nonlinear, the projection of the images lie on a single curve in this feature space.

Topologically this filter can put into one-to-one correspondence with a linear projection.

This is not to say that the linear solution is undesirable, but under the optimization crite-

rion it can be computed in closed form. Furthermore, in a space as rich as the ISAR image

space it is unlikely that the linear solution will give the best classification performance.

Table 3. Comparison of ROC classifier performance for to values of Pd. Results are shown for the linear
filter versus four different types of nonlinear training. N: white noise training, G-S: Gram-Schmidt
orthogonalization, subN: PCA subspace noise, C-H: convex hull rejection class.
Pd (%) Pfa (%)
linear filter nonlinear writer, experiments I-IV
I (N) II (N, G-S) III (subN, G-S) IV (subN, G-S, C-H)
80 4.37 4.37 3.74 2.81 2.45
99 42.43 41.87 27.15 26.52 15.33


4.6.2 Experiment II noise training with an orthogonalization constraint

As a means of avoiding the linear solution a modification was made to the training

algorithm. The modification was to impose orthogonality on the columns of W, through a





82

Gram-Schmidt process. The motivation for doing this stems from the fact that we are

working in a pre-whitened image space. In a pre-whitened image space this condition is

sufficient to assure the outputs in the feature space, as measured at ul and u2, will be

uncorrelated over the rejection class. Mathematically this can be shown as


T T T T T 2
E{uu } = E{W1X1X1W1 = WrE{XlX1}Wl

TE(XI T T Ti T 1
w2E(X1XI)w w2E(XIXI)w2 I2 w w2 2
[ T T T T [ T 2 T 2


St2 IW11120o 1
SI 2 1120
[ 0 jw2

where wl, w2 E 9NN x are the columns of W1 This result is true for any number of

nodes in the first layer of the MLP.

The results of the training with this modification are shown in figure 32 which is the

resulting feature space as measured at uj and u2. From this figure we can see that the dis-

criminant function, represented by the contour lines, is a nonlinear function of ul and u2.

Furthermore, because the projection of the vehicles into the feature space do not lie on a

single curve (as in the previous experiment), the features represent different discrimina-

tion information with regards to the both rejection and recognition classes. The bottom of

the figure, showing the projection of a random sampling of the test vehicles (all 1282

would be too dense for plotting) show that both features are useful for separating vehicle 1

from vehicle 2. Examination of table 3 (column II in the nonlinear results) shows that at

the two detection probabilities of interest improved false alarm performance has been









obtained. Figure 33 shows the ROC curve for the resulting filter. It is evident that the non-

linear filter is a uniformly better test for classification.


1.0


0.5


" 0.0


-0.5 -
] recognition, training 0
+ recognition, nontraining
-1.0 o rejection, training

-1.0 -0.5 0.0 0.5 1.0
U1


1.0 -0.5 0.0 0.5 1.0


Figure 32. Experiment I: Resulting feature space when orthogonality is imposed
on the input layer of the MLP. In the top figure squares indicate the
recognition class training exemplars, triangles indicate white noise
rejection class exemplars, and plus signs are the images of vehicle la
not used for training. In the bottom figure, squares are the peak
responses from vehicles lb and Ic, triangles are the peak responses
from vehicles 2a and 2b.









ROC curve
1.0 -


0.8 -


0.6-


0.4 -



0.2 -


0.0
0.0 0.2 0.4 0.6 0.8 1.0
Pa
Figure 33. Experiment II: Resulting ROC curve with orthogonality constraint.

Convinced that the filter represents a better test for classification than the linear filter,

we now examine the result for the other features of interest. Figure 34 shows the output

response for this filter for one of the images. As seen in the figure, a noticeable peak at the

center of the output plane has been achieved. This shows that the filter maintains the local-

ization properties of the linear filter.

In this way the characterization of the rejection class by its second order statistics, the

addition of the orthogonality constraint at the input layer to the MLP and the use of a non-

linear topology has resulted in a superior classification test.


4.6.3 Experiment ITT subspace noise training

The next experiments describes an additional modification to this technique. One of

the issues of training nonlinear systems is the convergence time. Training methods which

require overly long training times are not of much practical use. We have already shown



























Figure 34. Experiment II: Output response to an image from the recognition class
training set.

how to reduce the training complexity by recognizing that we can sufficiently describe the

rejection class with white noise sequences. We now show a more compact description of

the rejection class which leads to shorter convergence times, as demonstrated empirically.

This description relies on the well known singular value decomposition (SVD).

We view the random white sequences as stochastic probes of the performance surface

in the whitened image space. The classifier discriminant function is, of course, not deter-

mined by the rejection class alone. It is also affected by the recognition class. We have

shown previously that the white noise sequences enable us to probe the input space more

efficiently than examining all shifts of the recognition exemplars. However, we are still

searching a space of dimension equal to the image size, N, N2.

One of the underlying premises to a data driven approach is that the information about

a class is conveyed through exemplars. In this case the recognition class is represented by








N, < NN2 exemplars placed in the data matrix x2 9NN2 '. It is well known that if

x2, if it is full rank, can be decomposed with the SVD as



x2 = UAVT. (53)


where the columns U e RNN, x N'' are an ortho-normal basis that span the column space

of the data matrix, A are the singular values, and V is an orthogonal matrix. This decom-

position has many well known properties including compactness of representation for the

columns of the data matrix[Gerbrands, 1981]. Indeed, as has been noted by Gheen[1990],

the SDF can be written as a function of the SVD of the data matrix.



hSDF = UA -vTd (54)

We will use this recognition class representation to further refine our description of the

rejection class for training. As we stated, the underlying assumption in a data driven

method, is that the data matrix x2 conveys information about the recognition class, any

information about the recognition class outside the space of the data matrix is not attain-

able from this perspective. The information certainly exists, but there is no mechanism by

which to include it in the determination of the discriminant function within this frame-

work. This does however lead to a more efficient description of the rejection class. We can

modify our optimization criterion to reduce the response to white sequences as they are

projected into the N,-dimensional subspace of the data matrix. Effectively this reduces the

search for a discriminant function in an NIN2-dimensional space to an N,-dimensional

subspace.






87

The adaptation scheme of backpropagation allows a simple mechanism to implement

this constraint. The adaptation of matrix W, at iteration k can be written as



W,(k+ 1) = W,(k) + x(k)e (k) (55)


where E'i is a column vector derived from the backpropagated error and xi(k) is the

current input exemplar from either class presented to network which, by design, lies in the

subspace spanned by the columns of U. From equation (55) if the rejection class noise

exemplars are restricted to lie in the data space of x2, which can be achieved by projecting

random vectors of size Nt onto the matrix U above, and W, is initialized to be a random

projection from this space we will be assured that the columns of W, only extract infor-

mation from the data space of x2. This is because the columns of W1 will only be con-

structed from vectors which lie in the columns space of U and so will be orthogonal to

any vector component that lies in the null space of U.

The search for a discriminant function is now reduced from within an N,N2-dimen-

sional space to a search from within an N -dimensional space. Due to the dimensionality

reduction achieved we would expect the convergence time to be reduced.

This is the method that was used for the third experiment. Rejection class noise exem-

N,x I
plars were generated by projecting a random vector, n E 9 onto the basis U by

xrej = Un. In figure 35 the resulting discriminant function is shown as in the previous






88

experiments and the result is similar to experiment II. The classifier performance as mea-

sured in table 3 and the ROC curve of figure 36 are also nominally the same.


Figure 35. Experiment III: Resulting feature space when the subspace noise is
used for training. Symbols represent the same data as in the previous
case.


There are, however, two notable differences. Examination of figure 37 shows that the

output response to shifted images is even lower allowing for better localization. This con-









ROC curve
1.0 .


0.8


0.6



0.4



0.2-


0.0 _
0.0 0.2 0.4 0.6 0.8 1.0
Pf.
Figure 36. Experiment I: Resulting ROC curve for subspace noise training.

edition was found to be the case throughout the data set. Of more significance is the result

shown in figure 38 in which we compare the learning curves of all of the experiments pre-

sented here. In this figure the dashed and dashed-dot lines are the learning curves for

experiments II and III, respectively. In this case the convergence rate was increased nomi-

nally by a factor of three, from 100 epochs to approximately 30 epochs. Here an epoch

represents one pass through all of the training data.


4.6.4 Experiment TV convex hull approach

In this experiment we present a technique which borrows from the ideas of Kumar et

al [1994]. This approach designed an SDF which rejects images which are away from the




























Figure 37. Experiment III: Output response to an image from the recognition
class training set


_0 learning curve


10-6


100
epoch


10000


~-~-~=;ip"=;-'-t\


Figure 38. Learning curves for three methods. Experiment II: White noise
training (dashed line). Experiment III: subspace noise (dashed-dot
line). Experiment IV: subspace noise plus convex hull exemplars
(solid line).




Full Text
xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID EW1AJMXHI_SPYQKR INGEST_TIME 2013-09-28T02:02:09Z PACKAGE AA00014281_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES