NONLINEAR EXTENSIONS TO THE
MINIMUM AVERAGE CORRELATION ENERGY FILTER
By
JOHN W. FISHER III
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF
THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF
PHILOSOPHY
UNIVERSITY OF FLORIDA
1997
ACKNOWLEDGEMENTS
There are many people I would like to acknowledge for their help in the genesis of this
manuscript. I would begin with my family for their constant encouragement and support.
I am grateful to the Electronic Communications Laboratory and the Army Research
Laboratory for their support of the research at the ECL. I was fortunate to work with very
talented people, Marion Bartlett, Jim Bevington, and Jim Kurtz, in the areas of ATR and
coherent radar systems. In particular, I cannot overstate the influence that Marion Bartlett
has had on my perspective of engineering problems. I would also like to thank Jeff Sichina
of the Army Research Laboratory for providing many interesting problems, perhaps too
interesting, in the field of radar and ATR. A large part of who I am technically has been
shaped by these people.
I would, of course, like to acknowledge my advisor, Dr. Jose Principe, for providing me
with an invaluable environment for the study of nonlinear systems and excellent guidance
throughout the development of this thesis. His influence will leave a lasting impression on
me. I would also like to thank DARPA, funding by this institution enabled a great deal of
the research that went into this thesis. I would also like to thank Drs. David Casasent and
Paul Viola for taking an interest in my work and offering helpful advice.
I would also like to thank the students, past and present, of the Computational Neu
roEngineering Laboratory. The list includes, but is not limited to, Chuan Wang for useful
discussions on information theory, Neil Euliano for providing much needed recreational
opportunities and intramural championship tshirts, Andy Mitchell for being a good friend
to go to lunch with and who suffered long inane technical discussions and who now is a
better climber than me. There are certainly others and I am grateful to all.
Finally I would like to thank my wife, Anita, for enduring a seemingly endless ordeal,
for allowing me to use every ounce of her patience, and for sacrificing some of her best
years so that I could finish this Ph. D. I hope it has been worth it.
TABLE OF CONTENTS
Page
ACKNOWLEDGEMENTS ................... ...................... ii
LIST OF FIGURES ...................................... ........... v
LIST OF TABLES ............ ............................ ......... viii
ABSTRACT ..................... ................... .... ix
CHAPTERS
1 INTRODUCTION .................. ................ 1
1.1 Motivation ............ ........................ .. 1
2 BACKGROUND ........................... .... ........ 6
2.1 Discussion of Distortion Invariant Filters ...................... 6
2.1.1 Synthetic Discriminant Function ........................ 12
2.1.2 Minimum Variance Synthetic Discriminant Function ........ 15
2.1.3 Minimum Average Correlation Energy Filter. .............. 18
2.1.4 Optimal Tradeoff Synthetic Discriminant Function ......... 20
2.2 Preprocessor/SDF Decomposition ........................... 24
3 THE MACE FILTER AS AN ASSOCIATIVE MEMORY ............. 27
3.1 Linear Systems as Classifiers. ............................... 27
3.2 MSE Criterion as a Proxy for Classification Performance ......... 29
3.2.1 Unrestricted Functional Mappings ....................... 30
3.2.2 Parameterized Functional Mappings ...................... 32
3.2.3 Finite Data Sets ..................................... 34
3.3 Derivation of the MACE Filter .............................. 35
3.3.1 Preprocessor/SDF Decomposition. .................. .... 38
3.4 Associative Memory Perspective .......................... 39
3.5 Comments .................................. ............ 49
4 STOCHASTIC APPROACH TO TRAINING NONLINEAR SYNTHETIC DIS
CRIMINANT FUNCTIONS. ......................... .. ........ 52
4.1 Nonlinear iterative Approach. .......... ........... ... ... 52
4.2 A Proposed Nonlinear Architecture. ................... ...... 53
4.2.1 Shift Invariance of the Proposed Nonlinear Architecture...... 55
4.3 Classifier Performance and Measures of Generalization ........... 57
4.4 Statistical Characterization of the Rejection Class ............... 67
4.4.1 The Linear Solution as a Special Case .................... 69
4.4.2 Nonlinear M appings ...................... ........... 70
Page
4.5 Efficient Representation of the Rejection Class................... 72
4.6 Experimental Results ...................................... 74
4.6.1 Experiment I noise training ........................... 75
4.6.2 Experiment II noise training with an orthogonalization constraint 81
4.6.3 Experiment III subspace noise training .................. 84
4.6.4 Experiment IV convex hull approach .................... 89
5 INFORMATIONTHEORETIC FEATURE EXTRACTION ........... 96
5.1 Introduction .................. .......................... 96
5.2 Motivation for Feature Extraction ............................ 97
5.3 Information Theoretic Background ........................... 101
5.3.1 Mutual Information as a SelfOrganizing Principle .......... 101
5.3.2 Mutual Information as a Criterion for Feature Extraction ..... 104
5.3.3 Prior Work in Information Theoretic Neural Processing ...... 106
5.3.4 Nonparametric PDF Estimation ......................... 108
5.4 Derivation Of The Learning Algorithm ........................ 110
5.5 Gaussian Kernels ....................................... 115
5.6 Maximum Entropy/ PCA: An Empirical Comparison ............. 118
5.7 Maximum Entropy: ISAR Experiment ........................ 124
5.7.1 Maximum Entropy: Single Vehicle Class .................. 125
5.7.2 Maximum Entropy: Two Vehicle Classes .................. 127
5.8 Computational Simplification of the Algorithm ................. 127
5.9 Conversion of Implicit Error Direction to an Explicit Error ........ 136
5.9.1 Entropy Minimization as Attraction to a Point.............. 136
5.9.2 Entropy Maximization as Diffusion ...................... 139
5.9.3 Stopping Criterion. ................................. 141
5.10 Observations ............................................ 143
5.11 Mutual Information Applied to the Nonlinear MACE Filters ....... 144
6 CONCLUSIONS. ................. .......................... 151
APPENDIX
A DERIVATIONS................... ........ ..................... 155
REFERENCES ............. .... ................. ... .......... 168
BIOGRAPHICAL SKETCH .................. ...................... 173
LIST OF FIGURES
Eage
Figure
1 ISAR images of two vehicle types......... ........................... 9
2 MSF peak output response of training vehicle la over all aspect angles. ..... 10
3 MSF peak output response of testing vehicles lb and 2a over all aspect angles. 11
4 MSF output image plane response........ ........................... 12
5 SDF peak output response of training vehicle la over all aspect angles....... 15
6 SDF peak output response of testing vehicles Ib and 2a over all aspect angles. 16
7 SDF output image plane response. .................................. 17
8 MACE filter output image plane response. ............................ 20
9 MACE peak output response of vehicle la, lb and 2a over all aspect angles... 21
10 Example of a typical OTSDF performance plot ........................ 23
11 OTSDF filter output image plane response. ........................... 24
12 OTSDF peak output response of vehicle la over all aspect angles........... 25
13 OTSDF peak output response of vehicles Ib and 2a over all aspect angles .... 26
14 Decomposition of distortion invariant filter in space domain............... 26
15 Adaline architecture ......... ... ............................. 28
16 Decomposition of MACE filter as a preprocessor (i.e. a prewhitening filter over
the average power spectrum of the exemplars) followed by a synthetic discrimi
nant function ................................................ 39
17 Decomposition of MACE filter as a preprocessor (i.e. a prewhitening filter over
the average power spectrum of the exemplars) followed by a linear associative
memory. ............................................. ........ 43
18 Peak output response over all aspects of vehicle I a when the data matrix which is
not full rank .............. ...... .................. .............. 47
19 Output correlation surface for LMS computed filter from non full rank data... 48
20 Learning curve for LMS approach............... ...................... 49
21 NMSE between closed form solution and iterative solution................ 50
22 Decomposition of optimized correlator as a preprocessor followed by SDF/LAM
(top). Nonlinear variation shown with MLP replacing SDF in signal flow (middle),
detail of the MLP (bottom). The linear transformation represents the space domain
equivalent of the spectral preprocessor ............................... 54
23 ISAR images of two vehicle types shown at aspect angles of 5, 45, and 85 degrees
respectively. .............. ......... ........ ................ 59
24 Generalization as measured by the minimum peak response .............. 62
25 Generalization as measured by the peak response mean square error......... 63
26 Comparison of ROC curves ................ ....... ................ 64
27 ROC performance measures versus ................... .............. 66
28 Peak output response of linear and nonlinear filters over the training set...... 77
29 Output response of linear filter (top) and nonlinear filter (bottom)........... 78
30 ROC curves for linear filter (solid line) versus nonlinear filter (dashed line)... 79
31 Experiment I: Resulting feature space from simple noise training ........... 80
32 Experiment II: Resulting feature space when orthogonality is imposed on the input
layer of the MLP. ................................................ 83
33 Experiment II: Resulting ROC curve with orthogonality constraint.......... 84
34 Experiment II: Output response to an image from the recognition class training
set......... ..................... .................. 85
35 Experiment III: Resulting feature space when the subspace noise is used for train
ing ................... ........ ............................... 88
36 Experiment Im: Resulting ROC curve for subspace noise training........... 89
37 Experiment III: Output response to an image from the recognition class training
set .................................. ....... ...... ........... 90
38 Learning curves for three methods. ............................ .. 90
39 Experiment IV: resulting feature space from convex hull training ........... 94
40 Experiment IV: Resulting ROC curve with convex hull approach ........... 95
41 Classical pattern classification decomposition. ................. ...... 100
42 Decomposition of NLMACE as a cascade of feature extraction followed by dis
crimination .................................................... 100
43 Mutual information approach to feature extraction ...................... 106
44 Mapping as feature extraction. Information content is measured in the low dimen
sional space of the observed output.......... .......................... 108
45 A signal flow diagram of the learning algorithm. .................. ..... 114
46 Gradient of twodimensional gaussian kernel. The kernels act as attractors to low
points in the observed PDF on the data when entropy maximization is desired. 117
47 Mixture of gaussians example. ............... ..................... 118
48 Mixture of gaussians example, entropy minimization and maximization...... 119
49 PCA vs. Entropy gaussian case...................................... 120
50 PCA vs. Entropy nongaussian case. ............................ 122
51 PCA vs. Entropy nongaussian case. ............................ 123
52 Example ISAR images from two vehicles used for experiments. ........... 124
53 Single vehicle experiment, 100 iterations. .......................... 125
54 Single vehicle experiment, 200 iterations. ............................. 126
55 Single vehicle experiment, 300 iterations. ............................. 126
56 Two vehicle experiment. ......................................... 128
57 Two dimensional attractor functions. ................................. 133
58 Two dimensional regulating function. .............................. 134
59 Magnitude of the regulating function. ................................ 134
60 Approximation of the regulating function ............................. 135
61 Feedback functions for implicit error term ........................... 138
62 Entropy minimization as local attraction. ............................. 140
63 Entropy maximization as diffusion. ................................. 142
64 Stopping criterion. ............................................... 143
65 Mutual information feature space. ................................. 146
66 ROC curves for mutual information feature extraction (dotted line) versus linear
M ACE filter (solid line)............................................ 148
67 Mutual information feature space resulting from convex hull exemplars...... 149
68 ROC curves for mutual information feature extraction (dotted line) versus linear
MACE filter (solid line)..... .................................. 150
LIST OF TABLES
Page
Table
1 Classifier performance measures when the filter is determined by either of the
common measures of generalization as compared to best classifier performance for
two values of.................................... ............. 61
2 Correlation of generalization measures to classifier performance. In both cases (
equal to 0.5 or 0.95) the classifier performance as measured by the area of the ROC
curve or Pfa at Pd equal 0.8, has an opposite correlation as to what would be
expected of a useful measure for predicting performance ................ 64
3 Comparison of ROC classifier performance for to values of Pd. Results are shown
for the linear filter versus four different types of nonlinear training. N: white noise
training, GS: GramSchmidt orthogonalization, subN: PCA subspace noise, CH:
convex hull rejection class ....................................... 81
4 Comparison of ROC classifier performance for to values of Pd. Results are shown
for the linear filter versus experiments III and IV from section 4.6 and mutual
information feature extraction.The symbols indicate the type of rejection class
exemplars used. N: white noise training, GS: GramSchmidt orthogonalization,
subN: PCA subspace noise, CH: convex hull rejection class.............. 145
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
NONLINEAR EXTENSIONS TO THE MINIMUM AVERAGE
CORRELATION ENERGY FILTER
By
John W. Fisher III
May 1997
Chairman: Dr. Jose C. Principe
Major Department: Electrical and Computer Engineering
The major goal of this research is to develop efficient methods by which the family of
distortion invariant filters, specifically the minimum average correlation energy (MACE)
filter, can be extended to a general nonlinear signal processing framework. The primary
application of MACE filters has been to pattern classification of images. Two desirable
qualities of MACEtype correlators are ease of implementation via correlation and ana
lytic computation of the filter coefficients.
Our motivation for exploring nonlinear extensions to these filters is due to the well
known limitations of the linear systems approach to classification. Among these limita
tions the attempt to solve the classification problem in a signal representation space,
whereas the classification problem is more properly solved in a decision or probability
space. An additional limitation of the MACE filter is that it can only be used to realize a
linear decision surface regardless of the means by which it is computed. These limitations
lead to suboptimal classification and discrimination performance.
Extension to nonlinear signal processing is not without cost. Solutions must in general
be computed iteratively. Our approach was motivated by the early proof that the MACE
filter is equivalent to the linear associative memory (LAM). The associative memory per
spective is more properly associated with the classification problem and has been devel
oped extensively in an iterative framework.
In this thesis we demonstrate a method emphasizing a statistical perspective of the
MACE filter optimization criterion. Through the statistical perspective efficient methods
of representing the rejection and recognition classes are derived. This, in turn, enables a
machine learning approach and the synthesis of more powerful nonlinear discriminant
functions which maintain the desirable properties of the linear MACE filter, namely, local
ized detection and shift invariance.
We also present a new information theoretic approach to training in a selforganized or
supervised manner. Information theoretic signal processing looks beyond the second order
statistical characterization inherent in the linear systems approach. The information theo
retic framework probes the probability space of the signal under analysis. This technique
has wide application beyond nonlinear MACE filter techniques and represents a powerful
new advance to the area of information theoretic signal processing.
Empirical results, comparing the classical linear methodology to the nonlinear exten
sions, are presented using inverse synthetic aperture radar (ISAR) imagery. The results
demonstrate the superior classification performance of the nonlinear MACE filter.
CHAPTER 1
INTRODUCTION
1.1 Motivation
Automatic target detection and recognition (ATD/R) is a field of pattern recognition.
The goal of an ATD/R system is to quickly and automatically detect and classify objects
which may be present within large amounts of data (typically imagery) with a minimum of
human intervention. In an ATD/R system, it is not only desirable to recognize various tar
gets, but to locate them with some degree of accuracy. The minimum average correlation
energy (MACE) filter [Mahalanobis et al., 1987] is of interest to the ATD/R problem due
to its localization and discrimination properties. The MACE filter is a member of a family
of correlation filters derived from the synthetic discriminant function (SDF) [Hester and
Casasent, 1980]. The SDF and its variants have been widely applied to the ATD/R prob
lem. We will describe synthetic discriminant functions in more detail in chapter 2. Other
generalizations of the SDF include the minimum variance synthetic discriminant function
(MVSDF) [Kumar, 1986], the MACE filter, and more recently the gaussian minimum
average correlation energy (GMACE) [Casasent et al., 1991] and the minimum noise and
correlation energy (MINACE) [Ravichandran and Casasent, 1992] filters.
This area of filter design is commonly referred to as distortioninvariant filtering. It is a
generalization of matched spatial filtering for the detection of a single object to the detec
tion of a class of objects, usually in the image domain. Typically the object class is repre
sented by a set of exemplars. The exemplar images represent the image class through a
range of "distortions" such as a variation in viewing aspect of a single object. The goal is
to design a single filter which will recognize an object class through the entire range of
distortion. Under the design criterion the filter is equally matched to the entire range of
distortion as opposed to a single viewpoint as in a matched filter. Hence the nomenclature
distortioninvariant filtering [Kumar, 1992].
The bulk of the research using these types of filters has focused on optical and infra
red (IR) imagery and overcoming recognition problems in the presence of distortions asso
ciated with 3D to 2D mappings, e.g. scale and rotation (inplane and outofplane).
Recently, however, this technique has been applied to radar imagery [Novak et al., 1994;
Fisher and Principe, 1995a; Chiang et al., 1995]. In contrast to optical or infrared imag
ery, the scale of each pixel within a radar image is usually constant and known. Conse
quently, radar imagery does not suffer from scale distortions of objects.
In the family of distortion invariant filters, the MACE filter has been shown to posses
superior discrimination properties [Mahalanobis et al., 1987, Casasent and Ravichandran,
1992]. It is for this reason that this work emphasizes nonlinear extensions to the MACE
filter. The MACE filter and its variants are designed to produce a narrow, constrained
amplitude peak response when the filter mask is centered on a target in the recognition
class while minimizing the energy in the rest of the output plane. This property provides
desirable localization for detection. Another property of the MACE filter is that it is less
susceptible to outofclass false alarms [Mahalanobis et al., 1987]. While the focus of this
work will be on the MACE filter criterion, it should be stated that all of the results pre
sented here are equally applicable to any of the distortion invariant filters mentioned above
with appropriate changes to the respective optimization criteria.
Although the MACE filter does have superior false alarm properties, it also has some
fundamental limitations. Since it is a linear filter, it can only be used to realize linear deci
sion surfaces. It has also been shown to be limited in its ability to generalize to exemplars
that are in the recognition class (but not in the training set), while simultaneously rejecting
outofclass inputs [Casasent and Ravichandran, 1992; Casasent et al., 1991]. The number
of design exemplars can be increased in order to overcome generalization problems; how
ever, the calculation of the filter coefficients becomes computationally prohibitive and
numerically unstable as the number of design exemplars is increased [Kumar, 1992]. The
MINACE and GMACE variations have improved generalization properties with a slight
degradation in the average output plane variance [Ravichandran and Casasent, 1992] and
sharpness of the central peak [Casasent et al., 1991], respectively.
This research presents a basis by which the MACE filter, and by extension all linear
distortion invariant filters, can be extended to a more general nonlinear signal processing
framework. In the development it is shown that the performance of the linear MACE filter
can be improved upon in terms of generalization while maintaining its desirable proper
ties, i.e. sharp, constrained peak at the center of the output plane.
A more detailed description of the developmental progression of distortion invariant
filtering is given in chapter 2. In this chapter a qualitative comparison of the various distor
tion invariant filters is presented using inverse synthetic aperture radar (ISAR) imagery.
The application of pattern recognition techniques to highresolution radar imagery has
become a topic of great interest recently with the advent of widely available instrumenta
tion grade imaging radars. High resolution radar imagery poses a special challenge to dis
tortion invariant filtering in that the source of distortions such as rotation in aspect of an
object do not manifest themselves as rotations within the radar image (as opposed to opti
cal imagery). In this case the distortion is not purely geometric, but more abstract.
Chapter 3 presents a derivation of the MACE filter as a special case of Kohonen's lin
ear associative memory [1988]. This relationship is important in that the associative mem
ory perspective is the starting point for developing nonlinear extensions to the MACE
filter.
In chapter 4 the basis upon which the MACE filter can extended to nonlinear adaptive
systems is developed. In this chapter a nonlinear architecture is proposed for the extension
of the MACE filter. A statistical perspective of the MACE filter is discussed which leads
naturally into a class representational viewpoint of the optimization criterion of distortion
invariant filters. Commonly used measures of generalization for distortion invariant filter
ing are also discussed. The results of the experiments presented show that the measures
are not appropriate for the task of classification. It is interesting to note that the analysis
indicates the appropriateness of the measures is independent of whether the mapping is
linear or nonlinear. The analysis also discusses the merit of the MACE filter optimization
criterion in the context of classification and with regards to measures of generalization.
The chapter concludes with a series of experiments further refining the techniques by
which nonlinear MACE filters are computed.
Chapter 5 presents a new information theoretic method for feature extraction. An
information theoretic approach is motivated by the observation that the optimization crite
rion of the MACE filter only considers the secondorder statistics of the rejection class.
The information theoretic approach, however, operates in probability space, exploiting
properties of the underlying probability density function. The method enables the extrac
5
tion of statistically independent features. The method has wide application beyond nonlin
ear extensions to MACE filters and as such represents a powerful new technique for
information theoretic signal processing. A review of information theoretic approaches to
signal processing are presented in this chapter. This is followed by the derivation of the
new technique as well as some general experimental results which are not specifically
related to nonlinear MACE filters, but which serve to illustrate the potential of this
method. Finally the logical placement of this method within nonlinear MACE filters is
presented along with experimental results.
In chapter 6 we review the significant results and contributions of this dissertation. We
also discuss possible lines of research resulting from the base established here.
CHAPTER 2
BACKGROUND
2.1 Discussion of Distortion Invariant Filters
As stated, distortion invariant filtering is a generalization of matched spatial filtering.
It is well known that the matched filter maximizes the peaksignaltoaveragenoise power
ratio as measured at the filter output at a specific sample location when the input signal is
corrupted by additive white noise.
In the discrete signal case the design of a matched filter is equivalent to the following
vector optimization problem.[Kumar, 1986]
min hth st. xth = d {h,x}e CNXe
where the column vector x contains the N coefficients of the signal we wish to detect, h
contains the coefficients of the filter ( t indicates the hermitian transpose operator), and d
is a positive scaler. This notation is also suitable for Ndimensional signal processing as
long as the signal and filter have finite support and are reordered in the same lexico
graphic manner (e.g. by row or column in the twodimensional case) into column vectors.
The optimal solution to this problem is
h = x(xtx) d.
Given this solution we can calculate the peak output signal power as
= (xth)2
= (xtx(xtx)ld)2
= d2
and the average output noise power due to an additive white noise input
o = E{htnnth}
= htEnh
= o2hth
= aYd2(xt,)1
where is an ao2 is the input noise variance. Resulting in a peaksignaltoaveragenoise
output power ratio of
( )9 d2
oF a,2d2(xtx)I
(xtx)
2
As we can see, the result is independent of the choice of scalar, d. If d is set to unity,
the result is a normalized matched spatial filter.[Vander Lugt, 1964]
In order to further motivate the concept of distortion invariant filtering, a typical ATR
example problem will be used for illustration. This experiment will also help to illustrate
the genesis of the various types of distortion invariant filtering approaches beginning with
the matched spatial filter (MSF).
Inverse synthetic aperture radar (ISAR) imagery will be used for all of the experiments
presented herein. The distortion invariant filtering; however, is not limited to ISAR imag
ery and in fact can be extended to much more abstract data types. ISAR images are shown
in figure 1. In the figure, three vehicles are displayed, each at three different radar viewing
aspect angles (5, 45, and 85 degrees), where the aspect angle is the direction of the front of
the vehicle relative to the radar antenna. The image dimensions are 64 x 64 pixels. Radar
systems measure a quantity called radar cross section (RCS). When a radar transmits an
electromagnetic pulse, some of the incident energy on an object is reflected back to the
radar. RCS is a measure of the reflected energy detected by the radar's receiving antenna.
ISAR imagery is the result of a radar signal processing technique which uses multiple
detected radar returns measured over a range of relative object aspect angles. Each pixel in
an ISAR image is a measure of the aggregate radar cross section at regularly sampled
points in space.
Two types of vehicles are shown. Vehicle type I will represent a recognition class,
while vehicle type 2 will represent a confusion class. The goal is to compute a filter which
will recognize vehicle type 1 without being confused by vehicle 2. Images of vehicle la
will be used to compute the filter coefficients. Vehicles lb and 2a represent an independent
testing class.
ISAR images of all three vehicles were formed in the aspect range of 5 to 85 degrees at
1 degree increments. As the MSF is derived from a single vehicle image, an image of vehi
cle la at 45 degrees (the midpoint of the aspect range) is used.
The peak output response to an image represents maximum of the cross correlation
function of the image with the MSF template. The peak output response over the entire
aspect range of vehicle la is shown in figure 2. As can be seen in the figure, the filter
matches at 45 degrees very well; however, as the aspect moves away from 45 degrees, the
vehicle la (training)
vehicle lb (testing)
vehicle 2a (testing)
Figure 1. ISAR images of two vehicle types. Vehicles are shown at aspect angles
of 5, 45, and 85 degrees respectively. Two different vehicles of type 1 (a
and bi are shown, while one vehicle of type 2 (a) is shown. Vehicle la
is utcd as a training vehicle, while vehicle lb is used as the testing
eh, Ile for the recognition class. Vehicle 2a represents a confusion
vehicle.
peak output response begins to degrade. Depending on the type of imagery as well as the
vehicle, this degradation can become very severe.
matched spatial filter
1.2
1.0
0.8
o
S0.6
0.4
0.2
0.0
0 20 40 60 80 100
aspect angle
Figure 2. MSF peak output response of training vehicle 1 a over all aspect angles.
Peak response degrades as aspect difference increases.
The peak output responses of both vehicles in the testing set are shown in figure 3
overlain on the training image response. In one sense the filter exhibits good generaliza
tion, that is, the peak response to vehicle lb is much the same as a function of aspect as the
peak response to vehicle la. However, the filter also "generalizes" equally as well to vehi
cle 2b, which is undesirable. As a vehicle discrimination test (vehicle 1 from vehicle 2) the
MSF fails.
spatial filter
I ,, .
0 20 40 60
aspect angle
80 100
Figure 3. MSF peak output response of testing vehicles lb and 2a over all aspect
angles. Responses are overlaid on training vehicle response. Filter
responses to vehicles lb (dashed line) and 2a (dasheddot) do not differ
significantly.
matched
12
The output image plane response to a single image of vehicle la is shown in figure 4.
Refinements to the distortion invariant filter approach, namely the MACE filter, will show
that the localization of this output response, as measured by the sharpness of the peak, can
be improved significantly.
1.0
0.8 
0.6 
0.4
0.0
Figure 4. MSF output image plane response.
2.1.1 Synthetic Discriminant Function
The degradation evidenced in figures 2 and 3 were the primary motivation for the syn
thetic discriminant function (SDF)[Hester and Casasent, 1980]. A shortcoming of the
MSF, from the standpoint of distortion invariant filtering, is that it is only optimum for a
single image. One approach would be to design a bank of MSFs operating in parallel
which were matched to the distortion range. The typical ATR system; however, must rec
ognize/discriminate multiple vehicle types and so from an implementation standpoint
alone parallel MSFs is an impractical choice. Hester and Casasent set out to design a sin
13
gle filter which could be matched to multiple images using the idea of superposition. This
approach was possible due to the large number of coefficients (degrees of freedom) that
typically constitute 2D image templates. For historical reasons, specifically that the filters
in question were synthesized optically using holographic techniques [Vander Lugt, 1964],
it was hypothesized that such a filter could be synthesized from linear combinations of a
set of exemplar images.
The filter synthesis procedure consists of projecting the exemplar images onto an
orthonormal basis (originally GramSchmidt orthogonalization was used to generate the
basis). The next step is to determine the coefficients with which to linearly combine the
basis vectors such that a desired response for each original image exemplar was obtained.
[Hester and Casasent, 1980]
The proposed synthesis procedure is a bit convoluted. It turns out that the choice of
orthonormal basis is irrelevant. As long as the basis spans the space of the original exem
plar images the result is always the same. The development of Kumar [1986] is more use
ful for depicting the SDF as a generalization of the matched filter (for the white noise
case) to multiple signals. The SDF can be cast as the solution to the following optimiza
tion problem
min hth s.t. Xth = d {h E CN X,X e C 'NxN,d CN
where X is now a matrix whose Nt columns comprise a set of training images] we wish
to detect, d is a column vector of desired outputs (one for each of the training exemplars)
1. Since these filters have been applied primarily to 2D images, signals will be referred to
as images or exemplars from this point on. In the vector notation, all NI x N2 images are
reordered (by row or column) into N x 1 column vectors, where N = N1N2.
14
and is typically set to all unity values for the recognition class. The images of the data
matrix X comprise the range of distortion that the implemented filter is expected to
encounter. It is assumed that N, < N and so the problem formulation is a quadratic optimi
zation subject to an underdetermined system of linear constraints. The optimal solution is
h = X(XtX)ld.
When there is only one training exemplar (N, = 1) and d is unity the SDF defaults to
the normalized matched filter. Similar to the matched filter (white noise case), the SDF is
the linear filter which minimizes the white noise response while satisfying the set of linear
constraints over the training exemplars.
By way of example, the SDF technique is tested against the ISAR data as in the MSF
case. Exemplar images from vehicle la were selected at every 4 degrees aspect from 5 to
85 degrees for a total of 21 exemplar images (i.e. N, = 21). Figure 5 shows the peak out
put response over all aspects of the training vehicle (la). As seen in the figure, the degra
dation as the aspect changes is removed. The MSF response has been overlaid to highlight
the differences.
The peak output response over all exemplars in the testing set is shown in figure 6.
From the perspective of peak response, the filter generalizes fairly well. However, as in the
MSF, the usefulness of the filter as a discriminant between vehicles 1 and 2 is clearly lim
ited.
Figure 7 shows the resulting output plane response when the SDF filter is correlated
with a single image of vehicle I a. The localization of the peak is similar to the MSF case.
synthetic discriminant function
1.2
1.0
0.8 / "
0.6
0.4
0.2
0 20 40 60 80 100
aspect angle
Figure 5. SDF peak output response of training vehicle la over all aspect angles.
The MSF response is also shown (dashed line). The degradation in the
peak response has been corrected.
2.1.2 Minimum Variance Synthetic Discriminant Function
The SDF approach seemingly solved the problem of generalizing a matched filter to
multiple images. However, the SDF has no builtin noise tolerance by design (except for
the white noise case). Furthermore, in practice, it would turn out that occasionally the
noise response would be higher than the peak object response depending on the type of
imagery. As a result, detection by means of searching for correlation peaks was shown to
be unreliable for some types of imagery, specifically imagery which contains recognition
class images embedded in nonwhite noise[Kumar, 1992]. Kumar [1986] proposed a
method by which noise tolerance could be built in to the filter design. This technique was
termed the minimum variance synthetic discriminant function (MVSDF). The MVSDF is
synthetic discriminant function
1.2 I
1.0 
0.8
0
S0.6
a
0.4
0.2
0.0 2 ,. ,.
0 20 40 60 80 100
aspect angle
Figure 6. SDF peak output response of testing vehicles lb and 2a over all aspect
angles. The dashed line is vehicle lb while the dasheddot line is
vehicle 2a.
the correlation filter which minimizes the output variance due to zeromean input noise
while satisfying the same linear constraints as the SDF. The output noise variance can be
shown to be htZh, where h is the vector of filter coefficients and Zn is the covariance
matrix of the noise. [Kumar, 1986]
Mathematically the problem formulation is
min htnh s.t. Xth = d
he CNX tXe CN X ,ENe CNxNde CNx'
Figure 7. SDF output image plane response.
with the optimal solution
h = InlX(XtZX X) d.
In the case of white noise, the MVSDF is equivalent to the SDF This technique has a
significant numerical complexity issue which is that the solution requires the inversion of
an Nx N matrix (Z,) which for moderate image sizes (N = NIN2) can be quite large
and computationally prohibitive, unless simplifying assumptions can be made about its
form (e.g. a diagonal matrix, toeplitz, etc.).
The MVSDF can be seen as a more general extension of the matched filter to multiple
vector detection as most signal processing definitions of the matched filter incorporate a
noise power spectrum and do not assume the white noise case only. It is mentioned here
because it is the first distortion invariant filtering technique to recognize the need to char
acterize a rejection class.
18
2.1.3 Minimum Average Correlation Energy Filter
The MVSDF (and the SDF) control the output of the filter at a single point in the out
put plane of the filter. In practice large sidelobes may be exhibited in the output plane
making detection difficult. These difficulties led Mahalanobis et al [1987] to propose the
minimum average correlation energy (MACE) filter. This development in distortion invari
ant filtering attempts as its design goal to control not only the output point when the image
is centered on the filter, but the response of the entire output plane as well. Specifically it
minimizes the average correlation energy of the output over the training exemplars subject
to the same linear constraints as the MVSDF and SDF filters.
The problem is formulated in the frequency domain using Parseval relationships. In
the frequency domain, the formulation is
min HtDH s.t. XtH = d
{He CN ',Xe CNx E,De CNx Nd eCx
where D is a diagonal matrix whose diagonal elements are the coefficients of the average
2D power spectrum of the training exemplars. The form of the quadratic criterion is
derived using Parseval's relationship. A derivation is given in section A. 1 of the appendix.
The other terms, H and X, contain the 2D DFT coefficients of the filter and training
exemplars, respectively. The vector d is the same as in the MVSDF and SDF cases. The
optimal solution, in the frequency domain, is
H = DIX(XtD'X)'d. (1)
As in the MVSDF, the solution requires the inversion of an N x N matrix, but in this
case the matrix D is diagonal and so its inversion is trivial. When the noise covariance
19
matrix is estimated from observations of noise sequences (assuming widesense stationar
ity and ergodicity) the MVSDF can also be formulated in the frequency domain, as well,
and the complex matrix inversion is avoided. A derivation of this is given in the appendix
A, examination of equations (95), (96), (97) shows that under the assumption that the
noise class can be modeled as a stationary, ergodic random noise process the solution of
the MVSDF can be found in the spectral domain using the estimated power spectrum of
the noise process and equation (1).
In practice, the MACE filter performs better than the MVSDF with respect to rejecting
outofclass input images. The MACE filter; however, has been shown to have poor gener
alization properties, that is, images in the recognition class but not in the training exemplar
set are not recognized.
A MACE filter was computed using the same exemplar images as in the SDF example.
Figure 8 shows the resulting output image place response for one image. As can be seen in
the figure, the peak in the center is now highly localized. In fact it can be shown [Mahal
anobis et al., 1987] that over the training exemplars (those used to compute the filter) the
output peak will always be at the constraint location.
Generalization to between aspect images, as mentioned, is a problem for the MACE
filter. Figure 9 shows the peak output response over all aspect angles. As can be seen in the
figure, the peak response degrades severely for aspects between the exemplars used to
compute the filter. Furthermore, from a peak output response viewpoint, generalization to
vehicle lb is also worse. However, unlike the previous techniques, we now begin to see
some separation between the two vehicle types as represented by their peak response.
Figure 8. MACE filter output image plane response.
2.1.4 Optimal Tradeoff Synthetic Discriminant Function
The final distortion invariant filtering technique which will be discussed here is the
method proposed by R6fr6grier and Fique [1991], known as the optimal tradeoff syn
thetic discriminant function (OTSDF). Suppose that the designer wishes to optimize over
multiple quadratic optimization criteria (e.g. average correlation energy and output noise
variance) subject to the same set of equality constraints as in the previous distortion invari
ant filters. We can represent the individual optimization criterion by
J, = htQih,
where Q, is an N x N symmetric, positivedefinite matrix (e.g. Qi = 1n for MVSDF
optimization criterion).
The OTSDF is a method by which a set of quadratic optimization criterion may be
optimally traded off against each other; that is, one criterion can be minimized with mini
MACE filter
1.2
1.0
S0.8
0
". '. *" 'W '' ; "
0.2
0.0
0 20 40 60 80 100
aspect angle
Figure 9. MACE peak output response of vehicle la, lb and 2a over all aspect
angles. Degradation to between aspect exemplars is evident.
Generalization to the testing vehicles as measured by peak output
response is also poorer. Vehicle la is the solid line, lb is the dashed line
and 2a is the dasheddot line.
mum penalty to the rest. The solution to all such filters can be characterized by the equa
tion
h = Q IX(XtQ X) d, (2)
where, assuming M different criteria,
M M
Q = ,Qi i = I 0 < X
i= l i= l
The possible solutions, parameterized by X,, define a performance bound which can
not be exceeded by any linear system with respect to the optimization criteria and the
equality constraints. All such linear filters which optimally tradeoff a set of quadratic cri
teria are referred to as optimal tradeoff synthetic discriminant functions.
We may, for example, wish to tradeoff the MACE filter criterion versus the MVSDF
filter criterion. This presents the added difficulty that one criterion is specified in the space
domain and the other in the spectral domain. If the noise is represented as zeromean, sta
tionary, and ergodic (if the covariance is to be estimated from samples) we can, as men
tioned, transform the MVSDF criterion to the spectral domain. In this case the optimal
filter has the frequency domain solution,
1 I
H = [kDn+(lX)DoxlX[Xt[h Dn+(1. )DJ] X] d
= DIX[XtDxX]Id
where DX = AD, + ( )Dx, 0 < < 1, and Dn, Dx are diagonal matrices whose
diagonal elements contain the estimated power spectrum coefficients of the noise class and
the recognition class, respectively. The performance bound of such a filter would resemble
figure 10, where all linear filters would fall in the darkened region and all optimal tradeoff
filters would lie somewhere on the boundary.
By way of example we again use the data from the MACE and SDF examples. In this
case we will construct an OTSDF which trades off the MACE filter criterion for the SDF
criterion. In order to transform the SDF to the spectral domain, we will assume that the
noise class is zeromean, stationary, white noise. The power spectrum is therefore flat. One
of the issues for constructing an OTSDF is how to set the value of X which represents the
a
S I
S Realizable region
ofperformance
I
MAC
MAC
average correlation energy
Figure 10. Example of a typical OTSDF performance plot. This plot shows the
tradeoff, hypothetically, between the ACE criteria versus a noise
variance criteria. The curved arrow on the performance bound indicates
the direction of increasing X for the two criterion case. The curve is
bounded by the MACE and MVSDF results.
degree by which one criterion is emphasized over another. We will not address that issue
here, but simply set the value to X = 0.95, indicating more emphasis on the MACE filter
criterion.
The output plane response of the OTSDF is shown in figure 11. As compared to the
MACE filter response, the output peak is not nearly as sharp, but still more localized than
the SDF case.
The peak output response over the training vehicle for the OTSDF is compared to the
MACE filter in figure 12. The degradation to between aspect exemplars is less severe than
the MACE filter. The peak output response of vehicles lb and 2a are shown in figure 13.
MVSD
n
Figure 11. OTSDF filter output image plane response.
As compared to the MACE filter the peak response is improved over the testing set. Sepa
ration between the two vehicle types appears to be maintained.
2.2 Preprocessor/SDF Decomposition
In the sample domain, the SDF family of correlation filters is equivalent to a cascade
of a linear preprocessor followed by a linear correlator [Mahalanobis et al., 1987;Kumar,
1992]. This is illustrated in figure 14 with vector operations. The preprocessor, in the case
of the MACE filter, is a prewhitening filter computed on the basis of the average power
spectrum of the recognition class training exemplars. In the case of the MVSDF the pre
processor is a prewhitening filter computed on the basis of the covariance matrix of the
noise. The net result is that after preprocessing, the second processor is an SDF computed
over the preprocessed exemplars.
OTSDF
1.2
0.8 II I II I '
0.6I I Iill II I 
0.4
0.2 
0.0 .
0 20 40 60 80 100
aspect angle
Figure 12. OTSDF peak output response of vehicle la over all aspect angles.
Degradation to between aspect exemplars is less than in the MACE
filter shown in dashed line.
The primary contribution of this research will be to extend the ideas of MACE filtering
to a general nonlinear signal processing architecture and accompanying classification
framework. These extensions will focus on processing structures which improve the gen
eralization and discrimination properties while maintaining the shiftinvariance and local
ization detection properties of the linear MACE filter.
Figure 13. OTSDF peak output response of vehicles lb and 2a over all aspect
angles. Generalization is better than in the MACE filter. Vehicle lb is
shown in dashed line, vehicle 2a is shown in dasheddot line.
y = Ax h = y(yty)Id
input image, x preprocessor SDF scalar output
Filter Decomposition
Figure 14. Decomposition of distortion invariant filter in space domain. The
notation used assumes that the image and filter coefficients have been
reordered into vectors. The input image vector, x, is preprocessed
by the linear transformation, y = Ax. The resulting vector is
processed by a synthetic discriminant function, yout = yth.
OTSDF
\I i; .
0.4
0.2
0 20 40 60
aspect angle
L' ' '
80 100
CHAPTER 3
THE MACE FILTER AS AN ASSOCIATIVE MEMORY
3.1 Linear Systems as Classifiers
In this chapter we present the MACE filter from the perspective of associative memo
ries. This perspective is important because it leads to a machinelearning and classification
framework and consequently a means by which to determine the parameters of a nonlinear
mapping via gradient search techniques. We shall refer, herein, to the machine learning/
gradient search methods as an iterative framework. The techniques are iterative in the
sense that adaptation to the mapping parameters are computed sequentially and repeatedly
over a set of exemplars. We shall show that the iterative and classification framework com
bined with a nonlinear system architecture have distinct advantages over the linear frame
work of distortion invariant filters.
As we have stated, distortion invariant filters can only realize linear discriminant func
tions. We begin, therefore, by considering linear systems used as classifiers. The adaline
architecture [Widrow and Hoff, 1960], depicted in figure 15, is an example of a linear sys
tem used for pattern classification. A pattern, represented by the coefficients xi, is applied
to a linear combiner, represented by the weight coefficients wi, the resulting output y is
then applied to a hard limiter which assigns a class to the input pattern. Mathematically
this can be represented by
c = sgn(yp)
= sgn(wTx p)
where sgn( ) is the signum function, tp is a threshold, and w, x e 9Nx are column
vectors containing the coefficients of the pattern and combiner weights, respectively. In
the context of classification, this architecture is trained iteratively using the least mean
square (LMS) algorithm [Widrow and Hoff, 1960]. For a two class problem the desired
output, d in the figure, is set to 1 depending on the class of the input pattern, the LMS
algorithm then minimizes the mean square error (MSE) between the classification output
c and the desired output. Since the error function, ec, can only take on three values +2
and 0, minimization of the MSE is equivalent to minimizing the average number of actual
errors.
29
There are several observations to be made about the adaline/LMS approach to classifi
cation. One observation is that the adaptation process described uses the error, E, as mea
sured at the output of the linear combiner to drive the adaptation process and not the actual
classification error, Ec. Another observation is that this approach presupposes that the pat
tern classes can be linearly separated. A final point, on which we will have more to say, is
that the method uses the MSE criterion as a proxy for classification.
3.2 MSE Criterion as a Proxy for Classification Performance
As we have pointed out, the adaline/LMS approach to classification uses the MSE cri
terion to drive the adaptation process. It is the probability of misclassification (also called
the Bayes criterion), however, with which we are truly concerned. We now discuss the
consequence of using the MSE criterion as a proxy for classification performance.
It is well known that the discriminant function that minimizes misclassification is
monotonically related to the posterior probability distribution of the class, c, given the
observation x [Fukanaga, 1990]. That is, for the two class problem, if the discriminant
function is
f(x) = P2P(C 2x), (3)
where P2 is the prior probability of class 2, and p(C21x) is the conditional probability
distribution of class 2 given x, then the probability of classification will be minimized if
the following decision rule is used
f(x) < 0.5 choose class 1 (4)
f(x) > 0.5 choose class 2
For the case of f(x) = 0.5, both classes are equally likely, so a guess must be made.
3.2.1 Unrestricted Functional Mappings
With regards to the adaline/LMS approach we now ask, what is the consequence of
using the MSE criterion for computing discriminant functions? In the two class case, the
source distributions are p(xI C1) or p(xI C2) depending on whether the observation, x, is
drawn from class 1 or class 2, respectively. If we assign a desired output of zero to class 1
and unity to class 2 then the MSE criterion is equivalent to the following
J(f) = 2E{f(x)2C, + E{( f(x))2C2}, (5)
where the 1/2 scale factors are for convenience, E{ } is the expectation operator, and
Ci indicates class i.
For now we will place no constraints on the functional form of f(x). In so doing, we
can solve for the optimal solution using the calculus of variations approach. In this case,
we would like to find a stationary point of the criterion J(f) due to small perturbations in
the function f(x) indicated by
J = J(f+6f)J(f)
0 (6)
=0
The first term of 6 can be computed as
P P
J(f +sf) = P2E{(f+Sf)2IC}+~E{(lfif)2IC2}
= PE{(f2+2f2f) C1}
(7)
P
+E{((1 f)2 2(1 f)8f)IC21+O( 0(8f2)
2
P P2
= J(f)+ E{(2ff) Cl} E{(2(1 f)8f)C2}
which can be substituted into 6 to yield
8J = PE{f8fIC, }P2E{( f)8flC2}
= Pf f(x)8fp(xC,I)dxP2f ( f(x))Sfp(xIC)dx
(8)
= f [f(x)(Plp(xCi) + P2P(x C2)) (P2P(X C2))]fd
= [f(x)py(x)P2p( C2)]fdx
where px(x) = P p(x C1) + P2p(xlC2) is the unconditional probability distribution of
the random variable X. In order for f(x) to be a stationary point of J(f), equation 8
must be zero over all x for any arbitrary perturbation 6f(x). Consequently
f(x)px(x)P2p(xlC2) = 0
or
S P2P(xC2)
f(x) 
px(x)
P2p(xlC2) (10)
Plp(xC1)+P2P(xlC2)
= p(C21x)
which is the likelihood that the observation is drawn from class 2. If we had reversed the
desired outputs, the result would have been the likelihood that the observation was drawn
from class 1. This result, predicated by our choice of desired outputs, shows that for arbi
trary f(x), the MSE criterion is equivalent to probability of misclassification error crite
rion. In fact, it has been shown by Richard and Lippman [1991] (using other means) for
the multiclass case that if the desired outputs are encoded as vectors, ei E Ntx 1, where
the ith element is unity and the others are zero, for an Nclass problem the MSE criterion
is equivalent to optimizing the Bayes criterion for classification.
3.2.2 Parameterized Functional Mappings
Suppose, however, that the function is not arbitrary, but is also a function of parameter
set, a, as in f(x, a). The MSE criterion of 5 can be rewritten
P1
J(f) = TE{f(x, ca)21C} + 2EE{( f(x, a))ZC2}. (1l)
The gradient of the criterion with respect to the parameters becomes
= P E P f(x, a) f(x, a)I C P2E ( f(x, a)) f(x, a) C2 (12)
Tat aa I ~
and consequently
= PJ f(x,ca) f(x, a)p(x\C,)dx
P, (lf(x,a)) f(x, a)p(xl C2)dx
(13)
= (f(x, a)(P1 p(xC1) +P2p(xC2)) (P2p(x C2))) f(x, (X)dx
= (f(xa)p'(x) Pp(xC2)) f(x,a)dx
Examination of equation 13 allows for two possibilities for a stationary point of the crite
rion. The first, as before, is that
P2p(xfC2)
f(x, 0) =
Px(x) (14)
= p(C2x)
while the second is if we are near a local minima with respect to a. In other words, if the
parameterized function can realize the Bayes discriminant function via an appropriate
choice of its parameters, then this function represents a global minima, but this does not
discount the fact that there may be local minima. Furthermore, if the parameterized func
tion is not capable of representing the Bayes discriminant function there is no guarantee
that the global (or local) minima will result in robust classification.
3.2.3 Finite Data Sets
The previous development does not take into account that in an iterative framework we
are working with observations of a random variable. Therefore, we rewrite the criterion of
equation 5 as finite summations. That is, the criterion becomes
J(f(x, a)) = , f(x,. a)2 + (1 f(xi, a))2, (15)
x, C, X, E C
where xi e Ci denotes the set of observations taken from class Ci. Taking the derivative
of this criterion with respect to the parameters, a, yields
S= PI f(xi, a)f(x, )P2 ( f(xi, a))f(xi, a). (16)
xi, CI x, e C2
It is assumed that the set of observations from class C1 (xi e C ) are independent and
identically distributed (i.i.d.), as are the set of observations from class C2 (xi E C2)
although with a different distribution than class CI Since the summation terms are bro
ken up by class, we can assume that the arguments of the summations (functions of dis
tinct i.i.d. random variables) are themselves i.i.d. random variables [Papoulis, 1991]. If we
set PINI = P1 and P2N2 = P2, where PI and P2 are the prior probabilities of classes
C, and C2, respectively, and N, and N2 are the number of samples from drawn from
35
each of the classes, we can use the law of large numbers to say that the summations of
equation 16 approach their expected values. In other words, in the limit as N1, N2 
S= PE f(x, a) C 2E (1 f(x, a)) C (17)
which is identical to equation 12 and so yields the same solution for the mapping as
Pp(xC2)
f(x, a) = (18)
p(x)
The conclusion is that if we have a sufficient number of observations to characterize
the underlying distributions then the MSE criterion is again equivalent to the Bayes crite
rion.
3.3 Derivation of the MACE Filter
We have already introduced the MACE filter in a previous section. We present a deri
vation of the MACE filter here. The development is similar to the derivations given in
Mahalanobis [1987] and Kumar [1992]. Our purpose in this presentation of the derivation
is that it serves to illustrate the associative memory perspective of optimized correlators; a
perspective which will be used to motivate the development of the nonlinear extensions
presented in later sections.
36
In the original development, SDF type filters were formulated using correlation opera
tions, a convention which will be maintained here. The output, g(n1, n2), of a correlation
filter is determined by
N11 N21
g(n,n2) = x lx*(nl+ml,n2+m2)h(m,,m2)
m = Om2 = 0
x*(nl, n2)**h(n n2)
where x*(n,, n2) is the complex conjugate of an input image with NI x N2 region of sup
port, h(nl, n2) represents the filter coefficients, and ** represents the twodimensional
circluar convolution operation [Oppenheim and Shafer, 1989].
The MACE filter formulation is as follows [Mahalanobis et al., 1987J. Given a set of
image exemplars, {xiE NixNI; i = 1...N,}, we wish to find filter coefficients,
h E 9N, x N, such that average correlation energy at the output of the filter defined as
N, N IIN,1
= Iz gn ,2 (19
ti=1 =0n2=0 ])
is minimized subject to the constraints
N,1 N,1
gi(O,0) = E xi*(ml m2)h(m1,m2) = di; i= 1...N,. (20)
m, = m = 0
Mahalanobis [1987] reformulates this as a vector optimization in the spectral domain
using Parseval's theorem. In the spectral domain we wish to find the elements of
H e CNN2 x I a column vector whose elements are the 2D DFT coefficients of the space
domain filter h reordered lexicographically. Let the columns of the data matrix
X e CNINxN, contain the 2D DFT coefficients of the exemplars {xl, .... XN, also
reordered into column vectors. The diagonal matrix Di E 9NN2 N2O contains the mag
nitude squared of the 2D DFT coefficients of the ith exemplar. These matrices are aver
aged to form the diagonal matrix D as
N,
N,= Di= (21)
which then contains the average power spectrum of the training exemplars. Minimizing
equation (19) subject to the constraints of equation (20) is equivalent to minimizing
HtDH, (22)
subject to the linear constraints
XtH = d (23)
N xl
where the elements of d E x are the desired outputs corresponding to the exemplars.
The solution to this optimization problem can be found using the method of Lagrange
multipliers. In the spectral domain, the filter that satisfies the constraints of equation (20)
and minimizes the criterion of equation (19) [Mahalanobis et al., 1987;Kumar, 1992] is
H = D'X(XtDtX)'d, (24)
where H E CNN, x contains the 2DDFT coefficients of the filter, assuming a unitary 2
D DFT.1
3.3.1 Preprocessor/SDF Decomposition
As observed by Mahalanobis [1987], the MACE filter can be decomposed as a syn
thetic discriminant function preceded by a prewhitening filter. Let the matrix
B = D1/2, where B is diagonal with diagonal elements equal to the inverse of the
square root of the diagonal elements of D. We implicitly assume that the diagonal ele
ments of D are nonzero, consequently BtB = D1 and Bt = B. Equation (24) can
then be rewritten as
H = B(BX)((BX)1(BX))d. (25)
Substituting Y = BX, representing the original exemplars preprocessed in the spec
tral domain by the matrix B, equation (25) can be written
H = BY(YIY)d. (26)
The term H' = Y(Yt Y)d is recognized as the SDF computed from the preprocessed
exemplars Y. The MACE filter solution can therefore be written as a cascade of a pre
whitener (over the average power spectrum of the exemplars) followed by a synthetic dis
criminant function, depicted in figure 16, as
H = BH'. (27)
1. If the DFT were as defined in [Oppenheim and Shafer, 1989] then a scale factor of
NIN2 would be necessary.
X yo
preprocessor SDF
process t omrposin
Figure 16. Decomposition of MACE filter as a preprocessor (i.e. a pre
whitening filter over the average power spectrum of the
exemplars) followed by a synthetic discriminant function.
3.4 Associative Memory Perspective
Having presented the derivation of the MACE filter and the preprocessor/SDF decom
position, we now show that with a modification (addition of a linear preprocessor), the
MACE filter is a special case of Kohonen's linear associative memory [1988].
Associative memories [Kohonen, 1988] are general structures by which pattern vec
tors can be related to one another, typically in an input/output pairwise fashion. An input
stimulus vector is presented to the associative memory structure resulting in an output
response vector. The input/output pairs establish the desired response to a given input. In
the case of an autoassociative memory, the desired response is the stimulus vector,
whereas, in a heteroassociative memory the desired response is arbitrary. From a signal
processing perspective, associative memories are viewed as projections [Kung, 1992], lin
ear and nonlinear. The input patterns exist in a vector space and the associative memory
projects them onto a new space. The linear associative memory of Kohonen [1988] is for
mulated exactly in this way.
A simple form of the linear heteroassociative memory maps vectors to scalars. It is
formulated as follows. Given the set of input/output vector/scalar pairs
{xiE Nxl, die 9t,i= 1...N,}, which are placed into a input data matrix,
x = [xl...XN], and desired output vector, d = [d .. .dN] find the vector, h 9Nx 1
such that
xth = d (28)
If the system of equations described by (28) is underdetermined the inner product
hth (29)
is minimized using (28) as a constraint. If the system of equations are overdetermined
(xth d)t(xh d)
is minimized.
Here, we are interested in the underdetermined case. The optimal solution for the
underdetermined, using the pseudoinverse of x is [Kohonen, 1988]
h = x(xtx) d. (30)
As was shown in [Fisher and Principe, 1994], we can modify the linear associative
memory model slightly by adding a preprocessing linear transformation matrix, A, and
find h such that the underdetermined system of equations
(Ax)th = d
(31)
is satisfied while hth is minimized. As in the MACE filter, this optimization can be
solved using the method of Lagrange multipliers. We adjoin the system of constraints to
the optimization criterion as
J = hth + T((Ax)th d) (32)
where X 9E Nx 1 is a column vector of Lagrange multipliers, one for each constraint
(desired response). Taking the gradient of equation (32) with respect to the vector h yields
aJ
S= 2h +AxX. (33)
oh
Setting the gradient to zero and solving for the vector h yields
h= Ax%. (34)
Substituting this result into the constraint equations of (31) and solving for the Lagrange
multipliers yields
S= 2((Ax)tAx)ld. (35)
Substituting this result back into equation (34) yields the final solution to the optimization
as
h = Ax(xtAtAx) ld. (36)
If the preprocessing transformation, A, is the spacedomain equivalent of the MACE
filter's spectral prewhitener and the columns of the data matrix x contain the reordered
elements of the images from the MACE filter problem then equation (36) combined with
the preprocessing transformation yields exactly the space domain coefficients of the
MACE filter. This can be shown using a unitary discrete Fourier transformation (DFT)
matrix.
If U CN x N2 is the DFT of the image u e 9tN, x N, we can reorder both U and u
into column vectors, U e CN2 and u e CN 2 respectively. We can then imple
ment the 2D DFT as a unitary transformation matrix, cE, such that
U = Qu u = tU 44)t = I.
In order for the transformation A to be the space domain equivalent of the spectral pre
whitener of the MACE filter, the relationship
Ax = Oty
= (tBX
= +DtfiBx
where B is the same matrix as in equation 27, must be true which, by inspection, means
that
A = VtBD. (37)
Substituting equation (37) into equation (36) and using the property BtB = BB = D1
yields
h = Ax(xtAtAx) d
= totB x(xt(itBD)t tdtB>x) d
(38)
= >tB4x(xt4tBDtB4tdx) d38)
= DtBX(XtD X)'d
combining this solution for h with the preprocessor in equation (31) for the equivalent
linear system, hsys, yields
hy = Ah
= A4tBX(XtD'X)d
= (DTBt~tDBX(XtD LX)I d
= VtDX(XtDlX) d
Substituting the MACE filter solution, equation (24), gives the result
hsys = tHMACE (39)
and so hsys is the inverse DFT pair of the spectral domain MACE filter. This result estab
lishes the relationship between the MACE filter and the linear associative memory. The
decomposition of the MACE filter of figure 16 can also be considered as a cascade of a lin
ear preprocessor followed by a linear associative memory (LAM) as in figure 17.
y=Ax yo=yth
A = (tD I/2a h = y(yfy)d
x y yo
preprocessor LAM
e ocessor Ier composition
Figure 17. Decomposition of MACE filter as a preprocessor (i.e. a pre
whitening filter over the average power spectrum of the exemplars)
followed by a linear associative memory.
Since the two are equivalent then why make the distinction between the two perspec
tives? The are several reasons. The development of distortion invariant filtering and asso
ciative memories has proceeded in parallel. Distortion invariant filtering has been
44
concerned with finding projections which will essentially detect a set of images. Towards
this goal the techniques have emphasized analytic solutions resulting in linear discrimi
nant functions. Advances have been concerned with better descriptions of the second order
statistics of the causes of false detections. The approach, however, is still a data driven
approach. The desired recognition class is represented through exemplars. In the distortion
invariant filtering approach, the task has been confined to fitting a hyperplane to the rec
ognition exemplars subject to various quadratic optimization criterion.
The development of associative memories has proceeded along a different track. It is
also data driven, but the emphasis has been on iterative machine learning methods. Many
of the methods are biologically motivated, including the perception learning rule [Rosenb
latt, 1958] and Hebbian learning [Hebb, 1949]. Other methods, including the leastmean
square (LMS) algorithm [Widrow and Hoff, 1960] (which we have described) and the
backpropagation algorithm [Rumelhart et al., 1986; Werbos 1974], are gradient descent
based methods.
From the classification standpoint, of which the ATR problem is a subset, iterative
methods have certain advantages. This can be illustrated with a simple example. Suppose
the data matrix
N N,N,xN,
X = [Xl, X2 ... N] 9 x N,
were not full rank. In other words the exemplars representing the recognition class could
be represented without error in a subspace of dimension less than N,. From an ATR per
spective this would be a desirable property. The implicit assumption in any data driven
method is that information about the recognition class is transmitted through exemplars.
This is as true for distortion invariant filters, which have analytic solutions, as it is for iter
ative methods. The smaller the dimension of the subspace in which the recognition class
lies, the better we can discriminate images considered to be out of the class. One limitation
of the analytic solutions of distortion invariant filters is that they require the inverse of a
matrix of the form
xtQx, (40)
where Q is a positive definite matrix representing a quadratic optimization criterion. If the
matrix, x, is not full column rank there is no inverse for the matrix of (40) and conse
quently no analytic solution for any of the distortion invariant filters. The LMS algorithm,
however, will still find a best fit to the design goal, which is to minimize the criterion while
satisfying the linear constraints.
We can illustrate this by modifying the data from the experiments in section 2.1. It is
well known that the data matrix x can be decomposed using the singular value decompo
sition (SVD) as
x = UAVV
where the columns of U 9Nx N, form an orthonormal basis (the principal components
of the vector xi in fact), the diagonal matrix A 9N' N' contains the singular values of
N, xN,
the data matrix, and V 9t N' is unitary. The columns of the data matrix can be pro
jected onto a subspace by setting one of the diagonal elements of A to zero. The impor
tance of any of the basis vectors in U is directly proportional to the singular value. In this
case N, = 21 so we can choose one of the smaller singular values to set to zero without
changing basic structure of the data. For this example we choose the twelfth largest singu
lar value. A data matrix xsub is generated by
AI11 0 0 VT
xsu5b = U 0 0 0 V ,
0 0 A1321
where Ai_ is a diagonal matrix containing the i through j singular values of the original
data matnx x.
This data matrix is not full rank, so there is no analytical solution for the MACE filter,
however we can use the LMS approach and derive a linear associative memory. The col
umns of xsub are preprocessed with a prewhitening filter computed over the average
power spectrum. The LMS algorithm can then be used to iteratively compute the transfor
mation that best fits
T
Xsubh = d,
in a least squares sense; that is, we can find the h that minimizes
(xTh d) (xTbh d)
where d is column vector of desired responses (set to all unity in this case).
The peak output response for this filter was computed over all of the aspect views of
vehicle la and is shown in figure 18. The exemplars used to compute the filter are plotted
with diamond symbols. The desired response cannot be met exactly so a least squares fit is
achieved. Figure 19 shows the correlation output surface for one of the training exemplars.
MACE filter (LMS)
1.2
1.0
0.8
0.6
0.4 
0.2 
0.0 .
0 20 40 60 80 100
aspect angle
Figure 18. Peak output response over all aspects of vehicle la when the data
matrix which is not full rank. The LMS algorithm was used to compute
the filter coefficients.
As can be seen in the image, the qualities of low variance and localized peak are still
maintained using the iterative method.
The learning curve, which measures the normalized mean square error (NMSE)
between the filter output and the desired output, is shown as a function of the learning
epoch (an epoch is one pass through the data) in figure 20. When the data matrix is full
rank, as shown with a solid line, we see that since there is an exact solution and the error
approaches zero. When xub is used the NMSE approaches a limit because there is no
exact solution and so a least squares solution is found.
Figure 19. Output correlation surface for LMS computed filter from non full rank
data. The filter output is not substantially different from the analytic
solution with full rank data.
Since the system of constraint equations are generally underdetermined, there are infi
nitely many filters which will satisfy the constraints. There is only one, however, that min
imizes the norm of filter (the optimization criterion after preprocessing) [Kohonen, 1988].
Figure 21 shows the NMSE between the analytic solution for the filter coefficients as com
pared to the iterativel method. When the data matrix is full rank the iterative method
approaches the optimal analytic solution, as shown by the solid line in the figure. When
the data matrix is not full rank, as shown by the dashed line in the figure, the error in the
iterative solution approaches a limit.
These qualities of iterative learning methods are important from the ATR perspective.
We see from the example that when the data possesses a quality that would seemingly be
1. in this case iterativee" refers to the LMS algorithm, within this text it generally refers to
a gradient search algorithm.
10 LMS learning curves
1o0 F  i', , ',
10
1042

103
104
105
106
0 20 40 60 80 100
epoch
Figure 20. Learning curve for LMS approach. The learning curve for the LMS
algorithm when the full rank data matrix is shown with a solid line, the
non full rank case is shown with a dashed line.
useful to the ATR problem, namely that the class can be described by a subspace, the ana
lytic solution fails when the number of exemplars exceeds the dimensionality of the sub
space. The iterative method, however, finds a reasonable solution. Furthermore, if the data
matrix is full rank, the iterative method approaches the optimal analytic solution.
3.5 Comments
There are further motivations for the associative memory perspective and by extension
the use of iterative methods. It is well known that nonlinear associative memory struc
tures can outperform their linear counterparts on the basis of generalization and dynamic
range [Kohonen, 1988;Hinton and Anderson, 1981]. In general, they are more difficult to
design as their parameters cannot be computed analytically. The parameters for a large
filter error
0 5 10 15 20
1_ epoch
Figure 21. NMSE between closed form solution and iterative solution. The
learning curve for the LMS algorithm when the full rank data matrix is
shown with a solid line, the non full rank case is shown with a dashed
line.
class of nonlinear associative memories can, however, be determined by gradient search
techniques. The methods of distortion invariant filters are limited to linear or piecewise
linear discriminant functions. It is unlikely that these solutions are optimal for the ATR
problem.
In this chapter we have made the connection between distortion invariant filtering and
linear associative memories. Furthermore we have motivated an iterative approach. Recall
figure 15, which shows the adaline architecture. In this architecture we can use the linear
error term in order to train our system as a classifier. This is consequence of the assump
tion that a linear discriminant function is desirable. If a linear discriminant function is sub
51
optimal, which will almost always be the case for any highdimensional classification
problem, then we must work directly with the classification error.
We have also shown that the MSE criterion is a sufficient proxy for classification error
(with certain restrictions), however, it requires that we work with the true output error of
the mapping as well as a mapping with sufficient flexibility (i.e. can closely approximate a
wide range of functions which are not necessarily linear). The linear systems approach,
however, does not allow for either of these requirements. Consequently, we must adopt a
nonlinear systems approach if we hope to achieve improved performance. The next chap
ter will show that the MACE filter can be extended to nonlinear systems such that the
desirable properties of shift invariance and localized detection peak are maintained while
achieving superior classification performance.
CHAPTER 4
STOCHASTIC APPROACH TO TRAINING NONLINEAR
SYNTHETIC DISCRIMINANT FUNCTIONS
4.1 Nonlinear iterative Approach
The MACE filter is the best linear system that minimizes the energy in the output cor
relation plane subject to a peak constraint at the origin. An advantage of linear systems is
that we have the mathematical tools to use them in optimal operating conditions from the
standpoint of second order statistics. Such optimality conditions, however, should not be
confused with the best possible classification performance.
Our goal is to extend the optimality condition of MACE filters to adaptive nonlinear
systems and classification performance. The optimality condition of the MACE filter con
siders the entire output plane, not just the response when the image is centered. With
regards to general nonlinear filter architectures which can be trained iteratively, a brute
force approach would be to train a neural network with a desired output of unity for the
centered images and zero for all shifted images. This would indeed emulate the optimality
of the MACE filter, however, the result is a training algorithm of order NIN2N, for N,
training images of size N x N2 pixels. This is clearly impractical.
In this section we propose a nonlinear architecture for extending the MACE filter. We
discuss some its properties. Appropriate measures of generalization are discussed. We also
present a statistical viewpoint of distortion invariant filters from which such nonlinear
extensions fit naturally into an iterative framework. From this iterative framework we
53
present experimental results which exhibit improved discrimination and generalization
performance with respect to the MACE filter while maintaining the properties of localized
detection peak and low variance in the output plane.
4.2 A Proposed Nonlinear Architecture
As we have stated, the MACE filter can be decomposed as a prewhitening filter fol
lowed by a synthetic discriminant function (SDF), which can also be viewed as a special
case of Kohonen's linear associative memory (LAM) [Hester and Casasent, 1980; Fisher
and Principe, 1994]. This decomposition is shown at the top of figure 22. The nonlinear
filter architecture with which we are proposing is shown in the middle of figure 22. In this
architecture we replace the LAM with a nonlinear associative memory, specifically a feed
forward multilayer perception (MLP), shown in more detail at the bottom of figure 22.
We will refer to this structure as the nonlinear MACE filter (NLMACE) for brevity.
Another reason for choosing the multilayer perception (MLP) is that it is capable of
achieving a much wider range of discriminant functions. It is well known that an MLP
with a single hidden layer can approximate any discriminant function to any arbitrary
degree of precision [Funahashi, 1989]. One of the shortcomings of distortion invariant
approaches such as the MACE filter is that it attempts to fit a hyperplane to our training
exemplars as the discriminant function. Using an MLP in place of the LAM relaxes this
constraint. MLPs do not, in general, allow for analytic solutions. We can, however, deter
mine their parameters iteratively using gradient search.
NxN
x E9
LI
preproe..,r I SOMLP
I linear filler
r *  
N1xN
IXE
PE
preprocessed
image : .
3 PE'
I 1 pi MLP
Figure 22. Decomposition of optimized correlator as a preprocessor followed by
SDFILAM (top). Nonlinear variation shown with MLP replacing SDF
in signal flow (middle), detail of the MLP (bottom). The linear
transformation A represents the space domain equivalent of the
spectral preprocessor (aP +(1 a) /2
spectral preprocessor (p + *(I L)p)
55
4.2.1 Shift Invariance of the Proposed Nonlinear Architecture
One of the properties of the MACE filter is shift invariance. We wish to maintain that
property in our nonlinear extensions. A transformation, T[ ], of a twodimensional func
tion is shift invariant if it can be shown that
g(nl, n2) = [y(n,, n2)]
g(nI +n1',n2 +n2) = T[y(nl + n',2 +n2')]'
where nl, nl', n2, n2' are integers. In other words, a shift of the input signal is reflected as
a corresponding shift of the output signal. [Oppenheim and Shafer, 1989]
We show here that this property is maintained for our proposed nonlinear architecture.
The preprocessor of the nonlinear architecture at the bottom of figure 22 is the same as
the preprocessor of the linear filter shown at the top. The preprocessor is implemented as
a linear shift invariant (LSI) filter. Cascading shift invariant operations maintains shift
invariance of the entire system [Oppenheim and Shafer, 1989]. In order to show that the
system as a whole is shift invariant, it is sufficient to show that the MLP is shift invariant.
The mapping function of the MLP in figure 22 can be written
g(oo,y) = o(W3s(W2o(WYy)+ p))
0 = W ~ NIN2 N, x W, N,3 N, N.l } (41)
o W l IW 2 C: 9 ,,W 3C = 91 x '
In the nonlinear architecture, the matrix Wi, represents the connectivities from the pro
cessing elements (PEs) of layer (i 1) to the input to the PEs of layer i; that is, the matrix
Wi is applied as linear transformation to the vector output of layer (i 1). When i = 1
the transformation is applied to the input vector, y. The number of PEs in layer i is
56
denoted by N. In equation 41 p is a constant bias vector added to each element of the
vector, W2a(Wy) e 9N, x It is also assumed that if the argument to the nonlinear
function o( ) is a matrix or vector then the nonlinearity is applied to each element of the
matrix or vector.
N,N2 x l
The input to the MLP is denoted as a vector, y e 9t2 The elements of the vector
are samples of a twodimensional prewhitened input signal, y(n1, n2). We can write the
ith element of the vector as a function of the two dimensional signal as follows
yi(nl, n2) = y(n" + (i, N1), n2 + N ) i = 0,..., NN2 1,
where (i, Nl) indicates a modulo operation (the remainder of i divided by N ) and
[i, NI1 indicates integer division of i by N Written this way, the elements of the vector
y sample a rectangular region of support of size NI x N2 beginning at sample (nI, n2) in
the prewhitened signal, y(n n2). The vector argument of equation 41 and the resulting
output signal can now be written as an explicit function of the beginning sample point of
the template within the prewhitened image
go(nl,n2)= g(o, (nl,n2)) = o( W30(W20(Wly(nl,n2))+(P)). (42)
The output of the mapping as written in equation 42 is now an explicit function of
(n,, n2) and the constant parameter set, 0o (which do not vary with (n, n2)). We can also
write the output response as a function of the shifted version of the image, y(n n2) as
g,(nl +nl',n2+n2') = g(O,y(n" + n',n2+n2'))
Since the parameters, co, are constant, equations 42 and 43 are sufficient to show the
mapping of the MLP is shift invariant and consequently, the system as a whole (including
the shift invariant preprocessor) is also shift invariant.
4.3 Classifier Performance and Measures of Generalization
One of the issues for any iterative method which relies on exemplars is the number of
training exemplars to use in the computation of the discriminant function. In addition, for
iterative methods, there is the issue of when to stop the adaptation process. In the case of
distortion invariant filters, such as the MACE filter, some common heuristics are used to
determine the number of training exemplars. Typically samples are drawn from the train
ing set and used to compute the filter from equation 23 until the minimum peak response
over the remaining samples exceeds some threshold [Casasent and Ravichandran, 1992].
A similar heuristic is to continue to draw samples from the training set until the mean
square error of the peak response over the remaining samples drops below some preset
threshold. These measures are then used as indicators of how well the filter generalizes to
between aspect exemplars from the training set which have not been used for the computa
tion of the filter coefficients.
The ultimate goal, however, is classification. Generalization in the context of classifi
cation must be related to the ability to classify a previously unseen input [Bishop, 1995].
We show by example that the measures of generalization mentioned above may be mis
leading as predictors of classifier performance for even the linear filters. In fact the result
of the experiments will show that the way in which the data is preprocessed is more indic
ative of classifier performance than these other indirect measures.
We illustrate this point with an example using ISAR image data. A data set, larger than
in the previous experiments, will be used. Two more vehicles, one from each vehicle type
will be used for the testing set, and all vehicles will be samples at higher aspect resolution.
Figure 23 shows ISAR images of size 64 x 64 taken from five different vehicles and two
different vehicle types. The images are all taken with the same radar. Data taken from
vehicles in the same class vary in the vehicle configuration and radar depression angle (15
or 20 degrees depression). Images have been formed from each vehicle at aspect varia
tions of 0.125 degrees from 5 to 85 degrees aspect for a total of 641 images for each vehi
cle. Figure 23 shows each of the vehicles at 5, 45, and 85 degrees aspect.
We will use vehicle type 1 as the recognition class and vehicle type 2 as a confusion
vehicle. Images of vehicle la will be used as the set from which to draw training exem
plars. Classification performance will then be measured as the ability to recognize vehi
cles lb and Ic while rejecting vehicles 2a and 2b. The filter we will use is a form of the
OTSDF [R6fr6gier and Figue, 1991] which is computed in the spectral domain as
1
H = [oaPx+(la)P,] X[Xt[aPx+(la)PI X]d, (44)
where the columns of the data matrix X e CNN2 xN' are the Fourier coefficients of Nt
exemplar images of dimension NI x N2 of vehicle la reordered into column vectors. The
diagonal matrix Px E 9 'N2x NN contains the coefficients of the average power spec
trum measured over the N, exemplars of vehicle la, while FP e9t ',NxNN is the iden
tity matrix scaled by the average of the diagonal terms of Px. Finally, d e NX is a
column vector of desired outputs, one for each exemplar. The elements of d are typically
vehicle I
b
vehicle 2
a
b
Figure 23. ISAR images of two vehicle types shown at aspect angles of 5, 45, and
85 degrees respectively. Three different vehicles of type 1 (a, b, and c)
are shown, while two different vehicles of type 2 (a and b) are shown.
Vehicle la is used as a training vehicle, while vehicles lb and Ic are
used as the testing vehicles for the recognition class. Vehicles 2a and
2b are used a s confusion vehicles.
set to unity. When a is set to unity equation 44 yields exactly the MACE filter, when it is
set to zero the result is the SDF The filter we are using is therefore trading off the MACE
filter criterion with the SDF criterion. The SDF criterion can also be viewed as the
MVSDF [Kumar, 1986] criterion when the noise class is represented by a white noise ran
dom process. This filter can also be decomposed as in figure 22.
60
These experiments examine the relationship between the two commonly used mea
sures of generalization and two measures of classification performance. We can draw con
clusions from the results about the appropriateness of the generalization measures with
regards to classification. The first generalization measure is the minimum peak response,
denoted Ymin, taken over the aspect range of the images of the training vehicle (excluding
the aspects used for computing the filter). The second generalization measure is the mean
square error, denoted yme, between the desired output of unity and the peak response over
the aspect range of the images of the training vehicle (excluding the aspects used for com
puting the filter). The classification measures are taken from the receiver operating char
acteristic (ROC) curve measuring the probability of detecting, Pd, a testing vehicle in the
recognition class (vehicles lb and Ic) versus the probability of false alarm, Pfa, on a test
ing vehicle in the confusion class (vehicles 2a and 2b) based on peak detection. The spe
cific measures are the area under the ROC curve, a general measure of the test being used,
while the second measure is the probability of false alarm when the probability of detec
tion equals 80%, which measures a single point of interest on the ROC curve.
Two filters are used, one with a = 0.5 and the other with a = 0.95, or one in which
both criterion are weighted equally and one which is close to the MACE filter criterion.
The number of exemplars drawn from the training vehicle (la) is varied from 21 to 81
sampled uniformly in aspect (1 to 4 degrees aspect separation between exemplars).
Examination of figures 24 and 25 show that for both cases (a equal to 0.5 and 0.95)
no clear relationship emerges in which the generalization measures are indicators of good
classification performance. Table 1 compares the classifier performance when the general
ization measures as described above are used to choose the filter versus the best ROC per
formance achieved throughout the range of aspect separation. In one regard, the
generalization measures were consistent in that the same aspect separation was predicted
by both measures for both settings of a. In figure 26 we compare the ROC curves for two
cases, first where the filter chosen using the generalization measures and second the best
achieved ROC curve, for both settings of a. We would expect that for each a the filter
using the generalization measure would be near the best ROC performance. As can be
seen in the figure this is not the case.
Table 1. Classifier performance measures when the filter is determined by either of the
common measures of generalization as compared to best classifier performance for two
values of a.
Generalization Measure
Ymin Ymse Best
S= 0.50 Pfa@Pd=0.8 0.24 0.24 0.16
ROC area 0.83 0.83 0.90
S= 0.95 Pfa@Pd=0.8 0.16 0.16 0.07
ROC area 0.94 0.94 0.95
It is obvious from figures 24 and 25 that the generalization measures are not signifi
cantly correlated with the ROC performance. In fact, as summarized in table 2, the gener
alization measures are negatively, albeit weakly, correlated with ROC performance. One
feature of figures 24 and 25 is that although ROC performance varies independent of
.10 ivs. ROC area
1.00
0.95
0.90 U A A
0.85 A a= 0.50
a = 0.95
0.80 I .....
0.60 0.70 0.80 0.90 1.00
Ymin
Ymin vs. Pv (@ P= 0.8)
0.30
0.25 A
0.20o
[AA
B 0.15 
0.10 ]n O]E LF
0.05 A = 0.50
a = 0.95
0.00 __
0.60 0.70 0.80 0.90 1.00
Ymin
Figure 24. Generalization as measured by the minimum peak response. The plot
compares y.in versus classification performance measures (ROC area
and Pfa@Pd=0.8).
Ym,, vs. ROC area
1.00
, E 0 0
r 0 o
0
%
Aa = 0.50
Sa = 0.95
0.020 0.030 0.040 0.050
Y. ..
v vs. P,. (@ P =0.8)
0.00 ,
0.000
0.010 0.020 0.030
Ym..
0.040 0.050 0.060 0.070
Figure 25. Generalization as measured by the peak response mean square error.
The plot compares ymse versus classification performance measures
(ROC area and Pfa@Pd=0.8).
0.95
1
S0.90
0.85
0.060 0.070
0.000 0.010
0.30 F' '
0.25
0.20
a 0.15
0.10
0.05
10 D 0
O o
uo% oo []iit
0
Sa = 0.50
Sa = 0.95
0.801
I~~~~I~~~~I~~~~~~~~
1.0
04 a = 0.50, best ROC
a = 0.95, best generalization
a = 0.95, best ROC
0.8 2 ,
0.6 .' j
0.0 0.2 0.4 0.6 0.8 1.0
*r i.
Figure 26. Comparison of ROC curves. The ROC curves for the number of
:; / ......a = 0.50, best ROC
training exemplars yielding the 0.95, best generalization measure versus the
number yielding the0.95, best ROC performance for values of a equal to
0.2/ ,
0.0 0.2 0.4 0.6 0.8 1.0
Figure 26. Comparison of ROC curves. The ROC curves for the number of
training exemplars yielding the best generalization measure versus the
number yielding the best ROC performance for values of a equal to
0.5 and 0.95 are shown.
either the minimum peak response or the MSE, there does appear to be dependency on a.
This leads to a second experiment.
Table 2. Correlation of generalization measures to classifier performance. In both cases (a equal to 0.5
or 0.95) the classifier performance as measured by the area of the ROC curve or Pf at Pd equal 0.8, has
an opposite correlation as to what would be expected of a useful measure for predicting performance.
Performance Measures
ROC area Pfa(@Pd=0.8) ROC area Pfa(@Pd=0.8)
a = 0.50 a = 0.95
Generalization Ymin 0.39 0.21 0.40 0.41
Measures Ymse 0.32 0.11 0.31 0.35
In the second experiment we examine the relationship between the parameter a and
the ROC performance. The aspect separation between training exemplars is set to 2, 4, and
8 degrees. The value of a, the emphasis on the MACE criterion, is varied in the range
zero to unity. Figure 27 shows the relationship between ROC performance and the value
of a. It is clear from the plots that there is a positive relationship between the emphasis on
the MACE criteria and the ROC performance. However, the peak in ROC performance is
not achieved at a equal to unity. In all three cases, the ROC performance peaks just prior
to unity with the performance dropoff increasing with aspect separation at a equal to
unity.
The difference between the SDF and MACE filter is the preprocessor. What is shown
by this analysis is that, in general, the preprocessor from the MACE filter criterion leads
to better classification, but too much emphasis on the MACE filter criterion, as measured
by a equal to unity, leads to a filter which is too specific to the training samples. The
problems described above are well known. Alterations to the MACE criterion have been
the subject of many researchers [Casasent et al., 1991; Casasent and Ravichandran, 1992;
Ravichandran and Casasent, 1992; Mahalanobis et al., 1994a]. There is still, as yet, no
principled method found in the literature by which to set the parameter a.
There are two conclusions from this analysis that are pertinent to the nonlinear exten
sion we are using. First the results show that prewhitening over the recognition class
leads to better classification performance. For this reason we choose to use the preproces
sor of the MACE filter in our nonlinear filter architecture. The issue of extending the
MACE filter to nonlinear systems can in this way be formulated as a search for a more
robust nonlinear discriminant function in the prewhitened image space.
The second conclusion is that comparisons of the nonlinear filter to its linear counter
part must be made in terms of classification performance only. There are simple nonlinear
systems, such as a soft threshold at the output of a linear system for example, that will out
ROC area vs. a
08
13
ROC area vs. a
0.2 0.4 0.6
a
P,.(@P,=0.8) vs. a
0.8 1.0
001 I o i ______ I
0.0 0.2 0.4 0.6 0.8 1.0
a
Figure 27. ROC performance measures versus a. Results are shown for training
aspect separations of 2, 4, and 8 degrees. These plots indicate that
ROC performance is positively related to a.
perform the MACE filter or its variations in terms of maximizing the minimum peak
response over the training vehicle or reducing the variance in the output image plane.
0.41
0.0 R
0.0
10
0.8
S0.6
II
02
0 a = 2.00 degrees
A a = 4.00 degrees
D a = 8.00 degrees
0 0
Sa = 2.00 degrees
A = 4.00 degrees
Sa = 8.00 degrees
, I I
These measures are not, however, sufficient to describe classification performance. We
have also used these measures in the past but feel that they are not the most appropriate for
classification [Fisher and Principe, 1995b].
4 4 Statistical Characterization of the Rejection Class
We now present a statistical viewpoint of distortion invariant filters from which such
nonlinear extensions fit naturally into an iterative framework. This treatment results in an
efficient way to capture the optimality condition of the MACE filter using a training algo
rithm which is approximately of order N, and which leads to better classification perfor
mance than the linear MACE.
A possible approach to design a nonlinear extension to the MACE filter and improve
on the generalization properties is to simply substitute the linear processing elements of
the LAM with nonlinear elements. Since such a system can be trained with error back
propagation [Rumelhart et al., 1986], the issue would be simply to report on performance
comparisons with the MACE. Such methodology does not, however, lead to understand
ing of the role of the nonlinearity, and does not elucidate the tradeoffs in the design and in
training.
Here we approach the problem from a different perspective. We seek to extend the
optimality condition of the MACE to a nonlinear system, i.e. the energy in the output
space is minimized while maintaining the peak constraint at the origin. Hence we will
impose these constraints directly in the formulation, even knowing A priori that an analyti
cal solution is very difficult or impossible to obtain. We reformulate the MACE filter from
! 
68
a statistical viewpoint and generalize it to arbitrary mapping functions, linear and nonlin
ear.
Consider images of dimension NI x N2 reordered by column or row into vectors. Let
the rejection class be characterized by the random vector, XI E NNx I We know the
secondorder statistics of this class as represented by the average power spectrum (or
equivalently the autocorrelation function). Let the recognition class be characterized by
the columns of a data matrix x2 91NN x N which are observations of the random vector,
X2 E NNx similarly reordered. We wish to find the parameters, Co, of a mapping,
g(Co, X):9 1 91 such that we may discriminate the recognition class from the
rejection class. Here, it is the mapping function, g, which defines the discriminator topol
ogy.
Towards this goal, we wish to minimize the objective function
J = E(g(, X)2)
over the mapping parameters, co, subject to the system of constraints
g(, x2) = d (45)
where d e 9' is a column vector of desired outputs. It is assumed that the mapping
function is applied to each column of x2, and E( ) is the expected value function.
.......................1 I 
69
Using the method of Lagrange multipliers, we can augment the objective function as
J = E(g(o, X1)2) + (g(o, x2) d ),, (46)
where X e 9tNx is a vector whose elements are the Lagrange multipliers, one for each
constraint. Computing the gradient with respect to the mapping parameters yields
aj fg(m,X,)Y\ g(oa,x2)
S= 2E g(, X1 ) )) + a ,. (47)
Equation 47 along with the constraints of equation 45 can be used to solve for the opti
mal parameters, co, assuming our constraints form a consistent set of equations. This is,
of course dependent on the mapping topology.
4.4 1 The Linear Solution as a Special Case
It is interesting to verify that this formulation yields the MACE filter as a special case.
If, for example, we choose the mapping to be a linear projection of the input image, that is
g(a,,x) = Tx o = [hl...hNN2 ]T E 9 x
equation 46 becomes, after simplification,
J = TE(XIXT)O +(oTx dT). (48)
In order to solve for the mapping parameters, co, we are still left with the task of com
T
putting the term E(XIXT) which, in general, we can only estimate from observations of the
random vector, X1 or assume a specific form. Assuming that we have a suitable estima
tor, the well known solution to the minimum of equation 48 over the mapping parameters
subject to the constraints of equation 45 is
1 T1 1
o = Rx,x2[x2Rx X2] d, (49)
where
x = estimate{E(XIX) }. (50)
Depending on the characterization of X1, equation 49 describes various SDFtype fil
ters (i.e. MACE, MVSDF, etc.). In the case of the MACE filter, the rejection class is char
acterized by all 2D circular shifts of target class images away from the origin. Solving for
the MACE filter coefficients is therefore equivalent to using the average circular autocor
relation sequence (or equivalently the average power spectrum in the frequency domain)
over images in the target class as estimators of the elements of the matrix E(XIXT).
Sudharsanan et al [1991] suggest a very similar methodology for improving the perfor
mance of the MACE filter. In that case the average linear autocorrelation sequence is esti
T
mated over the target class and this estimator of E(XIX1) is used to solve for linear
projection coefficients in the space domain. The resulting filter is referred to as the
SMACE (spacedomain MACE) filter.
4.4.2 Nonlinear Mappings
For arbitrary nonlinear mappings it will, in general, be very difficult to solve for glo
bally optimal parameters analytically. Our purpose is instead to develop iterative training
algorithms which are practical and yield improved performance over the linear mappings.
I
It is through the implicit description of the rejection class by its secondorder statistics
from which we have developed an efficient method extending the MACE filter and other
related correlators to nonlinear topologies such as neural networks.
As stated, our goal is to find mappings, defined by a topology and a parameter set,
which improve upon the performance of the MACE filter in terms of generalization while
maintaining a sharp constrained peak in the center of the output plane for images in the
recognition class. One approach, which leads to an iterative algorithm, is to approximate
the original objective function of equation 46 with the modified objective function
J= (1 3)E(g(o, XI)2) + g(, x2)dT][g(, x2) d ]. (51)
The principal advantage gained by using equation 51 over equation 46 is that we can
solve iteratively for the parameters of the mapping function (assuming it is differentiable)
using gradient search. The constraint equations, however, are no longer satisfied with
equality over the training set. It has been recognized that the choice of constraint values
has direct impact on the performance of optimized linear correlators. Sudharsanan et al
[1990] have explored techniques for optimally assigning these values within the con
straints of a linear topology. Other methods have been suggested [Mahalanobis et al.,
1994a, 1994b; Kumar and Mahalanobis, 1995] to improve the performance of distortion
invariant filters by relaxing the equality constraints. Mahalanobis [1994a] extends this
idea to unconstrained linear correlation filters. The OTSDF objective function of
Rdfr6gier [1991] appears similar to the modified objective function and indeed, for a lin
ear topology this can be solved analytically as an optimal tradeoff problem.
_
Our primary purpose for modifying the objective function is to allow for an iterative
method within the NLMACE architecture. We have already shown in the previous chap
ter that this choice of criterion is suitable for classification. We will show that the primary
qualities of the MACE filter are still maintained when we relax the equality constraints in
our formulation. Varying P in the range [0, 1] controls the degree to which the average
response to the rejection class is emphasized versus the variance about the desired output
over the recognition class.
As in the linear case, we can only estimate the expected variance of the output due to
the random vector input and its associated gradient. If, as in the MACE (or SMACE) filter
formulation, X1 is characterized by all 2D circular (or linear) shifts of the recognition
class away from the origin then this term can be estimated with a sampled average over
the exemplars, x2, for all such shifts. From an iterative standpoint this still leads to the
impractical approach training exhaustively over the entire output plane. It is desirable,
then, to find other equivalent characterizations of the rejection class which may alleviate
the computational load without significantly impacting performance.
4.5 Efficient Representatinn of the Rejection Class
Training becomes an issue once the associative memory structure takes a nonlinear
form. The output variance of the linear MACE filter is minimized for the entire output
plane over the training exemplars. Even when the coefficients of the MACE filter are
computed iteratively we need only consider the output point at the designated peak loca
tion (constraint) for each prewhitened training exemplar [Fisher and Principe, 1994]. This
is due to the fact that for the underdetermined case, the linear projection which satisfies
the system of constraints with equality and has minimum norm is also the linear projection
which minimizes the response to images with a flat power spectrum. This solution is
arrived at naturally via a gradient search which only considers the response at the con
straint location.
This is no longer the case when the mapping is nonlinear. Adapting the parameters via
gradient search (such as error backpropagation) on recognition class exemplars only at the
constraint location will not, in general, minimize the variance over the entire output image
plane. In order to minimize the variance over the entire output plane we must consider the
response of the filter to each location in the input image, not just the constraint location.
The MACE filter optimization criterion minimizes, in the average, the response to all
images with the same second order statistics as the rejection class. At the output of the pre
whitener (prior to the MLP) any white sequence will have the same second order statistics
as the rejection class. This condition can be exploited to make the training of the MLP
more efficient.
From an implementation standpoint, the prewhitening stage and the input layer
weights can be combined into a single equivalent linear transformation, however, pre
whitening separately allows the rejection class to be represented by white sequences at the
input to the MLP during the training phase.
This result is due to the statistical formulation of the optimization criterion. Minimiz
ing the response to white sequences, in the average, minimizes the response to shifts of the
exemplar images since they have the same secondorder statistics (after prewhitening).
Consequently, we do not have to train over the entire output plane exhaustively, thereby
reducing training times proportionally by the input image size, NIN2. Instead, we use a
small number of randomly generated white sequences to efficiently represent the rejection
class. The result is an algorithm which is of order N, + Ns (where Ns is the number of
white noise rejection class exemplars) as compared to exhaustive training.
4 6 Experimental Results
We now present experimental results which illustrate the technique and potential pit
falls. There are four significant outcomes in the experiments presented in this section. The
first is that when using the white sequences to characterize the rejection class, the linear
solution is a strong attractor. The second outcome is that imposing orthogonality on the
input layer to the MLP tends to lead to a nonlinear solution with improved performance.
The third result, in which we restrict the rejection class to a subspace, yields a significant
decrease in the convergence time. The fourth result, in which we borrow from the idea of
using the interior of the convex hull to represent the rejection class [Kumar et al., 1994],
yields significantly better classification performance.
In these experiments we use the data depicted in figure 23. As in the previous experi
ments images from vehicle la will be used as the training set. Vehicles lb and Ic will be
used as the recognition class while vehicles 2a and 2b will be used as a rejection/confusion
class for testing purposes. In each case comparisons will be made to a baseline linear filter.
Specifically, in all cases the value of a for the linear filter is set to 0.99. The aspect
separation between training images is 2.0 degrees. This results in 41 training exemplars
from vehicle la. These settings of a and aspect separation were found to give the best
classifier performance for the linear filter with this data set. We continue to refer to this as
a MACE filter since the MACE criterion is so heavily emphasized. Technically it is an
OTSDF filter, but such nomenclature does not convey the type of preprocessing that is
being performed. We choose the value of a so as to compare to the best possible MACE
filter for this data set.
The nonlinear filter will use the same preprocessor as the linear filter (i.e. a = 0.99).
The MLP structure is shown at the bottom of figure 22. It accepts an NIN2 input vector (a
preprocessed image reordered into a column vector), followed by two hidden layers (with
two and three hidden PE nodes, respectively), and a single output node. The parameters of
the MLP
W,2NNx2 2 2x3 W C 3x] 3xl
are to be determined through gradient search. The gradient search technique used in all
cases will be error backpropagation algorithm.
4.6.1 Experiment I noise training
As stated, using the statistical approach, the rejection class is characterized by white
noise sequences at the input to the MLP. The recognition class is characterized by the
exemplars. It is from these white noise sequences that the MLP, through the backpropaga
tion learning algorithm, captures information about the rejection class. So it would seem a
simple matter, during the training stage, to present random white noise sequences as the
rejection class exemplars. This is exactly the training method used for this experiment.
From our empirical observations we observed that with this method of training the linear
solution is a strong attractor. The results of the first experiment is demonstrates this behav
ior.
Figure 28 shows the peak output response taken over all images of vehicle la for both
the linear (top) and nonlinear (bottom) filters. In the figure we see that for the linear filter
the peak constraint (unity) is met exactly for the training exemplars with degradation for
the between aspect exemplars. As mentioned previously, if the pure MACE filter criterion
were used (a equal to unity), the peak in the output plane is guaranteed to be at the con
straint location [Mahalanobis et al., 1987]. It turns out that for this data set the peak output
also occurs the constraint location for the training images, however, with a = 0.99 it was
not guaranteed. Examination of the peak output response for the NLMACE filter shows
that the constraints are met very closely (but not exactly) for the training exemplars also
with degradation in the peak output response at between aspect locations. The degradation
for the nonlinear filter is noticeably less than in the linear case and so in this regard it has
outperformed the linear filter.
Figure 29 shows the output plane response for a single image of vehicle la (not one
used for computing the filter coefficients) for the linear filter (top) and the nonlinear filter
(bottom). Again in this figure we see that both filters result in a noticeable peak when the
image is centered on the filter and a reduced response when the image is shifted. The
reduction in response to the shifted image is again noticeably better in the nonlinear filter
than in the linear filter. Such would be found to be true for all images of vehicle la and so
in this regard we can again say that the nonlinear filter had outperformed the linear filter.
However, as we have already illustrated for the linear case, these measures are not suf
ficient to predict classifier performance alone and are certainly not sufficient to compare
linear systems to nonlinear systems. This point is made clear in table 3 which summarizes
the classifier performance at two probabilities of detection for all of the experiments
linear filter
1.10F
1.00
r 0.90
0.70 
0.60
0 20 40 60 80
aspect (degrees)
nonlinear filter
1.10
0.90
0.80
0.70 
0.60
0 20 40 60 80
aspect (degrees)
Figure 28. Peak output response of linear and nonlinear filters over the training
set. The nonlinear filter clearly outperforms the linear filter by this
metric alone.
reported here when vehicles lb and Ic are used as the recognition class and vehicles 2a
and 2b are used for the rejection class. At this point we are only interested in the results
pertaining to the linear filter (our baseline) and nonlinear filter results for experiment I.
Figure 29. Output response of linear filter (top) and nonlinear filter (bottom).
The response is for a single image from the training set, but not one
used to compute the filter.
This table shows that the classifier performance for the linear filter and nonlinear filters
are nominally the same, despite what may be perceived to be better performance in the
nonlinear filter with regards to peak response over the training vehicle and reduced output
plane response to shifts of the image. Furthermore, if we examine figure 30, which shows
79
the ROC curve for both filters we see that they overlay each other. From a classification
standpoint the two filters are equivalent.
ROC curve
1.0
0.8 
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
P,
Figure 30. ROC curves for linear filter (solid line) versus nonlinear filter (dashed
line). Despite improved performance of the nonlinear filter as
measured by peak output response and reduced variance over the
training set, the filters are equivalent with regards to classification
over the testing set.
The explanation of this result is best explained by figure 31. Recall the points ul and
u2 labeled in figure 22.
We can view these outputs as a feature space, that is, the MLP discriminant function
can be superimposed on the projection of the input image onto this space. In this case the
feature space is a representation of the input vector internal to the MLP structure. The des
ignation of these points as features is due to the fact that they represent some abstract qual
O3 D recognition, training
+ recognition, nontraining
o rejection, training
1.0 0.5 0.0
u1
1.0
0.5
3 0.0
0.5
1.0
1.0 0.5 0.0
UI
0.5 1.0
0.5 1.0
Figure 31. Experiment I: Resulting feature space from simple noise training.
Note that all points are projected onto a single curve in the feature
space. In the top figure squares are the recognition class training
exemplars, triangles are white noise rejection class exemplars, and
plus signs are the images of vehicle la not used for training. In the
bottom figure, squares are the peak responses from vehicles Ib and Ic,
triangles are the peak responses from vehicles 2a and 2b.
O recognition, testing
0 rejection, testing
~
81
ity of the data and the decision surface can be computed as a function of the features.
Mathematically this can be written
Wx = = u Yo = ((Wia(Woa(u)+(p)). (52)
Recall that the matrix Wi represents the connectivities from the output of layer (i 1) to
the inputs of the PEs of layer i, (p is a constant bias term, and a( ) is a sigmoidal nonlin
earity (hyperbolic tangent function in this case).
Figure 31 shows this projection for the training set (top) and the testing set (bottom).
What is significant in the figure is that although the discriminant as a function of the vec
tor u is nonlinear, the projection of the images lie on a single curve in this feature space.
Topologically this filter can put into onetoone correspondence with a linear projection.
This is not to say that the linear solution is undesirable, but under the optimization crite
rion it can be computed in closed form. Furthermore, in a space as rich as the ISAR image
space it is unlikely that the linear solution will give the best classification performance.
Table 3. Comparison of ROC classifier performance for to values of Pd. Results are shown for the linear
filter versus four different types of nonlinear training. N: white noise training, GS: GramSchmidt
orthogonalization, subN: PCA subspace noise, CH: convex hull rejection class.
Pd (%) Pfa (%)
linear filter nonlinear writer, experiments IIV
I (N) II (N, GS) III (subN, GS) IV (subN, GS, CH)
80 4.37 4.37 3.74 2.81 2.45
99 42.43 41.87 27.15 26.52 15.33
4.6.2 Experiment II noise training with an orthogonalization constraint
As a means of avoiding the linear solution a modification was made to the training
algorithm. The modification was to impose orthogonality on the columns of W, through a
82
GramSchmidt process. The motivation for doing this stems from the fact that we are
working in a prewhitened image space. In a prewhitened image space this condition is
sufficient to assure the outputs in the feature space, as measured at ul and u2, will be
uncorrelated over the rejection class. Mathematically this can be shown as
T T T T T 2
E{uu } = E{W1X1X1W1 = WrE{XlX1}Wl
TE(XI T T Ti T 1
w2E(X1XI)w w2E(XIXI)w2 I2 w w2 2
[ T T T T [ T 2 T 2
St2 IW11120o 1
SI 2 1120
[ 0 jw2
where wl, w2 E 9NN x are the columns of W1 This result is true for any number of
nodes in the first layer of the MLP.
The results of the training with this modification are shown in figure 32 which is the
resulting feature space as measured at uj and u2. From this figure we can see that the dis
criminant function, represented by the contour lines, is a nonlinear function of ul and u2.
Furthermore, because the projection of the vehicles into the feature space do not lie on a
single curve (as in the previous experiment), the features represent different discrimina
tion information with regards to the both rejection and recognition classes. The bottom of
the figure, showing the projection of a random sampling of the test vehicles (all 1282
would be too dense for plotting) show that both features are useful for separating vehicle 1
from vehicle 2. Examination of table 3 (column II in the nonlinear results) shows that at
the two detection probabilities of interest improved false alarm performance has been
obtained. Figure 33 shows the ROC curve for the resulting filter. It is evident that the non
linear filter is a uniformly better test for classification.
1.0
0.5
" 0.0
0.5 
] recognition, training 0
+ recognition, nontraining
1.0 o rejection, training
1.0 0.5 0.0 0.5 1.0
U1
1.0 0.5 0.0 0.5 1.0
Figure 32. Experiment I: Resulting feature space when orthogonality is imposed
on the input layer of the MLP. In the top figure squares indicate the
recognition class training exemplars, triangles indicate white noise
rejection class exemplars, and plus signs are the images of vehicle la
not used for training. In the bottom figure, squares are the peak
responses from vehicles lb and Ic, triangles are the peak responses
from vehicles 2a and 2b.
ROC curve
1.0 
0.8 
0.6
0.4 
0.2 
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Pa
Figure 33. Experiment II: Resulting ROC curve with orthogonality constraint.
Convinced that the filter represents a better test for classification than the linear filter,
we now examine the result for the other features of interest. Figure 34 shows the output
response for this filter for one of the images. As seen in the figure, a noticeable peak at the
center of the output plane has been achieved. This shows that the filter maintains the local
ization properties of the linear filter.
In this way the characterization of the rejection class by its second order statistics, the
addition of the orthogonality constraint at the input layer to the MLP and the use of a non
linear topology has resulted in a superior classification test.
4.6.3 Experiment ITT subspace noise training
The next experiments describes an additional modification to this technique. One of
the issues of training nonlinear systems is the convergence time. Training methods which
require overly long training times are not of much practical use. We have already shown
Figure 34. Experiment II: Output response to an image from the recognition class
training set.
how to reduce the training complexity by recognizing that we can sufficiently describe the
rejection class with white noise sequences. We now show a more compact description of
the rejection class which leads to shorter convergence times, as demonstrated empirically.
This description relies on the well known singular value decomposition (SVD).
We view the random white sequences as stochastic probes of the performance surface
in the whitened image space. The classifier discriminant function is, of course, not deter
mined by the rejection class alone. It is also affected by the recognition class. We have
shown previously that the white noise sequences enable us to probe the input space more
efficiently than examining all shifts of the recognition exemplars. However, we are still
searching a space of dimension equal to the image size, N, N2.
One of the underlying premises to a data driven approach is that the information about
a class is conveyed through exemplars. In this case the recognition class is represented by
N, < NN2 exemplars placed in the data matrix x2 9NN2 '. It is well known that if
x2, if it is full rank, can be decomposed with the SVD as
x2 = UAVT. (53)
where the columns U e RNN, x N'' are an orthonormal basis that span the column space
of the data matrix, A are the singular values, and V is an orthogonal matrix. This decom
position has many well known properties including compactness of representation for the
columns of the data matrix[Gerbrands, 1981]. Indeed, as has been noted by Gheen[1990],
the SDF can be written as a function of the SVD of the data matrix.
hSDF = UA vTd (54)
We will use this recognition class representation to further refine our description of the
rejection class for training. As we stated, the underlying assumption in a data driven
method, is that the data matrix x2 conveys information about the recognition class, any
information about the recognition class outside the space of the data matrix is not attain
able from this perspective. The information certainly exists, but there is no mechanism by
which to include it in the determination of the discriminant function within this frame
work. This does however lead to a more efficient description of the rejection class. We can
modify our optimization criterion to reduce the response to white sequences as they are
projected into the N,dimensional subspace of the data matrix. Effectively this reduces the
search for a discriminant function in an NIN2dimensional space to an N,dimensional
subspace.
87
The adaptation scheme of backpropagation allows a simple mechanism to implement
this constraint. The adaptation of matrix W, at iteration k can be written as
W,(k+ 1) = W,(k) + x(k)e (k) (55)
where E'i is a column vector derived from the backpropagated error and xi(k) is the
current input exemplar from either class presented to network which, by design, lies in the
subspace spanned by the columns of U. From equation (55) if the rejection class noise
exemplars are restricted to lie in the data space of x2, which can be achieved by projecting
random vectors of size Nt onto the matrix U above, and W, is initialized to be a random
projection from this space we will be assured that the columns of W, only extract infor
mation from the data space of x2. This is because the columns of W1 will only be con
structed from vectors which lie in the columns space of U and so will be orthogonal to
any vector component that lies in the null space of U.
The search for a discriminant function is now reduced from within an N,N2dimen
sional space to a search from within an N dimensional space. Due to the dimensionality
reduction achieved we would expect the convergence time to be reduced.
This is the method that was used for the third experiment. Rejection class noise exem
N,x I
plars were generated by projecting a random vector, n E 9 onto the basis U by
xrej = Un. In figure 35 the resulting discriminant function is shown as in the previous
88
experiments and the result is similar to experiment II. The classifier performance as mea
sured in table 3 and the ROC curve of figure 36 are also nominally the same.
Figure 35. Experiment III: Resulting feature space when the subspace noise is
used for training. Symbols represent the same data as in the previous
case.
There are, however, two notable differences. Examination of figure 37 shows that the
output response to shifted images is even lower allowing for better localization. This con
ROC curve
1.0 .
0.8
0.6
0.4
0.2
0.0 _
0.0 0.2 0.4 0.6 0.8 1.0
Pf.
Figure 36. Experiment I: Resulting ROC curve for subspace noise training.
edition was found to be the case throughout the data set. Of more significance is the result
shown in figure 38 in which we compare the learning curves of all of the experiments pre
sented here. In this figure the dashed and dasheddot lines are the learning curves for
experiments II and III, respectively. In this case the convergence rate was increased nomi
nally by a factor of three, from 100 epochs to approximately 30 epochs. Here an epoch
represents one pass through all of the training data.
4.6.4 Experiment TV convex hull approach
In this experiment we present a technique which borrows from the ideas of Kumar et
al [1994]. This approach designed an SDF which rejects images which are away from the
Figure 37. Experiment III: Output response to an image from the recognition
class training set
_0 learning curve
106
100
epoch
10000
~~~=;ip"=;'t\
Figure 38. Learning curves for three methods. Experiment II: White noise
training (dashed line). Experiment III: subspace noise (dasheddot
line). Experiment IV: subspace noise plus convex hull exemplars
(solid line).
