UFDC Home  Search all Groups  UF Institutional Repository  UF Institutional Repository  UF Theses & Dissertations  Vendor Digitized Files   Help 
Material Information
Subjects
Notes
Record Information

Full Text 
NONLINEAR EXTENSIONS TO THE MINIMUM AVERAGE CORRELATION ENERGY FILTER By JOHN W. FISHER III A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 1997 ACKNOWLEDGEMENTS There are many people I would like to acknowledge for their help in the genesis of this manuscript. I would begin with my family for their constant encouragement and support. I am grateful to the Electronic Communications Laboratory and the Army Research Laboratory for their support of the research at the ECL. I was fortunate to work with very talented people, Marion Bartlett, Jim Bevington, and Jim Kurtz, in the areas of ATR and coherent radar systems. In particular, I cannot overstate the influence that Marion Bartlett has had on my perspective of engineering problems. I would also like to thank Jeff Sichina of the Army Research Laboratory for providing many interesting problems, perhaps too interesting, in the field of radar and ATR. A large part of who I am technically has been shaped by these people. I would, of course, like to acknowledge my advisor, Dr. Jose Principe, for providing me with an invaluable environment for the study of nonlinear systems and excellent guidance throughout the development of this thesis. His influence will leave a lasting impression on me. I would also like to thank DARPA, funding by this institution enabled a great deal of the research that went into this thesis. I would also like to thank Drs. David Casasent and Paul Viola for taking an interest in my work and offering helpful advice. I would also like to thank the students, past and present, of the Computational Neu roEngineering Laboratory. The list includes, but is not limited to, Chuan Wang for useful discussions on information theory, Neil Euliano for providing much needed recreational opportunities and intramural championship tshirts, Andy Mitchell for being a good friend to go to lunch with and who suffered long inane technical discussions and who now is a better climber than me. There are certainly others and I am grateful to all. Finally I would like to thank my wife, Anita, for enduring a seemingly endless ordeal, for allowing me to use every ounce of her patience, and for sacrificing some of her best years so that I could finish this Ph. D. I hope it has been worth it. TABLE OF CONTENTS Page ACKNOWLEDGEMENTS ................... ...................... ii LIST OF FIGURES ...................................... ........... v LIST OF TABLES ............ ............................ ......... viii ABSTRACT ..................... ................... .... ix CHAPTERS 1 INTRODUCTION .................. ................ 1 1.1 Motivation ............ ........................ .. 1 2 BACKGROUND ........................... .... ........ 6 2.1 Discussion of Distortion Invariant Filters ...................... 6 2.1.1 Synthetic Discriminant Function ........................ 12 2.1.2 Minimum Variance Synthetic Discriminant Function ........ 15 2.1.3 Minimum Average Correlation Energy Filter. .............. 18 2.1.4 Optimal Tradeoff Synthetic Discriminant Function ......... 20 2.2 Preprocessor/SDF Decomposition ........................... 24 3 THE MACE FILTER AS AN ASSOCIATIVE MEMORY ............. 27 3.1 Linear Systems as Classifiers. ............................... 27 3.2 MSE Criterion as a Proxy for Classification Performance ......... 29 3.2.1 Unrestricted Functional Mappings ....................... 30 3.2.2 Parameterized Functional Mappings ...................... 32 3.2.3 Finite Data Sets ..................................... 34 3.3 Derivation of the MACE Filter .............................. 35 3.3.1 Preprocessor/SDF Decomposition. .................. .... 38 3.4 Associative Memory Perspective .......................... 39 3.5 Comments .................................. ............ 49 4 STOCHASTIC APPROACH TO TRAINING NONLINEAR SYNTHETIC DIS CRIMINANT FUNCTIONS. ......................... .. ........ 52 4.1 Nonlinear iterative Approach. .......... ........... ... ... 52 4.2 A Proposed Nonlinear Architecture. ................... ...... 53 4.2.1 Shift Invariance of the Proposed Nonlinear Architecture...... 55 4.3 Classifier Performance and Measures of Generalization ........... 57 4.4 Statistical Characterization of the Rejection Class ............... 67 4.4.1 The Linear Solution as a Special Case .................... 69 4.4.2 Nonlinear M appings ...................... ........... 70 Page 4.5 Efficient Representation of the Rejection Class................... 72 4.6 Experimental Results ...................................... 74 4.6.1 Experiment I noise training ........................... 75 4.6.2 Experiment II noise training with an orthogonalization constraint 81 4.6.3 Experiment III subspace noise training .................. 84 4.6.4 Experiment IV convex hull approach .................... 89 5 INFORMATIONTHEORETIC FEATURE EXTRACTION ........... 96 5.1 Introduction .................. .......................... 96 5.2 Motivation for Feature Extraction ............................ 97 5.3 Information Theoretic Background ........................... 101 5.3.1 Mutual Information as a SelfOrganizing Principle .......... 101 5.3.2 Mutual Information as a Criterion for Feature Extraction ..... 104 5.3.3 Prior Work in Information Theoretic Neural Processing ...... 106 5.3.4 Nonparametric PDF Estimation ......................... 108 5.4 Derivation Of The Learning Algorithm ........................ 110 5.5 Gaussian Kernels ....................................... 115 5.6 Maximum Entropy/ PCA: An Empirical Comparison ............. 118 5.7 Maximum Entropy: ISAR Experiment ........................ 124 5.7.1 Maximum Entropy: Single Vehicle Class .................. 125 5.7.2 Maximum Entropy: Two Vehicle Classes .................. 127 5.8 Computational Simplification of the Algorithm ................. 127 5.9 Conversion of Implicit Error Direction to an Explicit Error ........ 136 5.9.1 Entropy Minimization as Attraction to a Point.............. 136 5.9.2 Entropy Maximization as Diffusion ...................... 139 5.9.3 Stopping Criterion. ................................. 141 5.10 Observations ............................................ 143 5.11 Mutual Information Applied to the Nonlinear MACE Filters ....... 144 6 CONCLUSIONS. ................. .......................... 151 APPENDIX A DERIVATIONS................... ........ ..................... 155 REFERENCES ............. .... ................. ... .......... 168 BIOGRAPHICAL SKETCH .................. ...................... 173 LIST OF FIGURES Eage Figure 1 ISAR images of two vehicle types......... ........................... 9 2 MSF peak output response of training vehicle la over all aspect angles. ..... 10 3 MSF peak output response of testing vehicles lb and 2a over all aspect angles. 11 4 MSF output image plane response........ ........................... 12 5 SDF peak output response of training vehicle la over all aspect angles....... 15 6 SDF peak output response of testing vehicles Ib and 2a over all aspect angles. 16 7 SDF output image plane response. .................................. 17 8 MACE filter output image plane response. ............................ 20 9 MACE peak output response of vehicle la, lb and 2a over all aspect angles... 21 10 Example of a typical OTSDF performance plot ........................ 23 11 OTSDF filter output image plane response. ........................... 24 12 OTSDF peak output response of vehicle la over all aspect angles........... 25 13 OTSDF peak output response of vehicles Ib and 2a over all aspect angles .... 26 14 Decomposition of distortion invariant filter in space domain............... 26 15 Adaline architecture ......... ... ............................. 28 16 Decomposition of MACE filter as a preprocessor (i.e. a prewhitening filter over the average power spectrum of the exemplars) followed by a synthetic discrimi nant function ................................................ 39 17 Decomposition of MACE filter as a preprocessor (i.e. a prewhitening filter over the average power spectrum of the exemplars) followed by a linear associative memory. ............................................. ........ 43 18 Peak output response over all aspects of vehicle I a when the data matrix which is not full rank .............. ...... .................. .............. 47 19 Output correlation surface for LMS computed filter from non full rank data... 48 20 Learning curve for LMS approach............... ...................... 49 21 NMSE between closed form solution and iterative solution................ 50 22 Decomposition of optimized correlator as a preprocessor followed by SDF/LAM (top). Nonlinear variation shown with MLP replacing SDF in signal flow (middle), detail of the MLP (bottom). The linear transformation represents the space domain equivalent of the spectral preprocessor ............................... 54 23 ISAR images of two vehicle types shown at aspect angles of 5, 45, and 85 degrees respectively. .............. ......... ........ ................ 59 24 Generalization as measured by the minimum peak response .............. 62 25 Generalization as measured by the peak response mean square error......... 63 26 Comparison of ROC curves ................ ....... ................ 64 27 ROC performance measures versus ................... .............. 66 28 Peak output response of linear and nonlinear filters over the training set...... 77 29 Output response of linear filter (top) and nonlinear filter (bottom)........... 78 30 ROC curves for linear filter (solid line) versus nonlinear filter (dashed line)... 79 31 Experiment I: Resulting feature space from simple noise training ........... 80 32 Experiment II: Resulting feature space when orthogonality is imposed on the input layer of the MLP. ................................................ 83 33 Experiment II: Resulting ROC curve with orthogonality constraint.......... 84 34 Experiment II: Output response to an image from the recognition class training set......... ..................... .................. 85 35 Experiment III: Resulting feature space when the subspace noise is used for train ing ................... ........ ............................... 88 36 Experiment Im: Resulting ROC curve for subspace noise training........... 89 37 Experiment III: Output response to an image from the recognition class training set .................................. ....... ...... ........... 90 38 Learning curves for three methods. ............................ .. 90 39 Experiment IV: resulting feature space from convex hull training ........... 94 40 Experiment IV: Resulting ROC curve with convex hull approach ........... 95 41 Classical pattern classification decomposition. ................. ...... 100 42 Decomposition of NLMACE as a cascade of feature extraction followed by dis crimination .................................................... 100 43 Mutual information approach to feature extraction ...................... 106 44 Mapping as feature extraction. Information content is measured in the low dimen sional space of the observed output.......... .......................... 108 45 A signal flow diagram of the learning algorithm. .................. ..... 114 46 Gradient of twodimensional gaussian kernel. The kernels act as attractors to low points in the observed PDF on the data when entropy maximization is desired. 117 47 Mixture of gaussians example. ............... ..................... 118 48 Mixture of gaussians example, entropy minimization and maximization...... 119 49 PCA vs. Entropy gaussian case...................................... 120 50 PCA vs. Entropy nongaussian case. ............................ 122 51 PCA vs. Entropy nongaussian case. ............................ 123 52 Example ISAR images from two vehicles used for experiments. ........... 124 53 Single vehicle experiment, 100 iterations. .......................... 125 54 Single vehicle experiment, 200 iterations. ............................. 126 55 Single vehicle experiment, 300 iterations. ............................. 126 56 Two vehicle experiment. ......................................... 128 57 Two dimensional attractor functions. ................................. 133 58 Two dimensional regulating function. .............................. 134 59 Magnitude of the regulating function. ................................ 134 60 Approximation of the regulating function ............................. 135 61 Feedback functions for implicit error term ........................... 138 62 Entropy minimization as local attraction. ............................. 140 63 Entropy maximization as diffusion. ................................. 142 64 Stopping criterion. ............................................... 143 65 Mutual information feature space. ................................. 146 66 ROC curves for mutual information feature extraction (dotted line) versus linear M ACE filter (solid line)............................................ 148 67 Mutual information feature space resulting from convex hull exemplars...... 149 68 ROC curves for mutual information feature extraction (dotted line) versus linear MACE filter (solid line)..... .................................. 150 LIST OF TABLES Page Table 1 Classifier performance measures when the filter is determined by either of the common measures of generalization as compared to best classifier performance for two values of.................................... ............. 61 2 Correlation of generalization measures to classifier performance. In both cases ( equal to 0.5 or 0.95) the classifier performance as measured by the area of the ROC curve or Pfa at Pd equal 0.8, has an opposite correlation as to what would be expected of a useful measure for predicting performance ................ 64 3 Comparison of ROC classifier performance for to values of Pd. Results are shown for the linear filter versus four different types of nonlinear training. N: white noise training, GS: GramSchmidt orthogonalization, subN: PCA subspace noise, CH: convex hull rejection class ....................................... 81 4 Comparison of ROC classifier performance for to values of Pd. Results are shown for the linear filter versus experiments III and IV from section 4.6 and mutual information feature extraction.The symbols indicate the type of rejection class exemplars used. N: white noise training, GS: GramSchmidt orthogonalization, subN: PCA subspace noise, CH: convex hull rejection class.............. 145 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy NONLINEAR EXTENSIONS TO THE MINIMUM AVERAGE CORRELATION ENERGY FILTER By John W. Fisher III May 1997 Chairman: Dr. Jose C. Principe Major Department: Electrical and Computer Engineering The major goal of this research is to develop efficient methods by which the family of distortion invariant filters, specifically the minimum average correlation energy (MACE) filter, can be extended to a general nonlinear signal processing framework. The primary application of MACE filters has been to pattern classification of images. Two desirable qualities of MACEtype correlators are ease of implementation via correlation and ana lytic computation of the filter coefficients. Our motivation for exploring nonlinear extensions to these filters is due to the well known limitations of the linear systems approach to classification. Among these limita tions the attempt to solve the classification problem in a signal representation space, whereas the classification problem is more properly solved in a decision or probability space. An additional limitation of the MACE filter is that it can only be used to realize a linear decision surface regardless of the means by which it is computed. These limitations lead to suboptimal classification and discrimination performance. Extension to nonlinear signal processing is not without cost. Solutions must in general be computed iteratively. Our approach was motivated by the early proof that the MACE filter is equivalent to the linear associative memory (LAM). The associative memory per spective is more properly associated with the classification problem and has been devel oped extensively in an iterative framework. In this thesis we demonstrate a method emphasizing a statistical perspective of the MACE filter optimization criterion. Through the statistical perspective efficient methods of representing the rejection and recognition classes are derived. This, in turn, enables a machine learning approach and the synthesis of more powerful nonlinear discriminant functions which maintain the desirable properties of the linear MACE filter, namely, local ized detection and shift invariance. We also present a new information theoretic approach to training in a selforganized or supervised manner. Information theoretic signal processing looks beyond the second order statistical characterization inherent in the linear systems approach. The information theo retic framework probes the probability space of the signal under analysis. This technique has wide application beyond nonlinear MACE filter techniques and represents a powerful new advance to the area of information theoretic signal processing. Empirical results, comparing the classical linear methodology to the nonlinear exten sions, are presented using inverse synthetic aperture radar (ISAR) imagery. The results demonstrate the superior classification performance of the nonlinear MACE filter. CHAPTER 1 INTRODUCTION 1.1 Motivation Automatic target detection and recognition (ATD/R) is a field of pattern recognition. The goal of an ATD/R system is to quickly and automatically detect and classify objects which may be present within large amounts of data (typically imagery) with a minimum of human intervention. In an ATD/R system, it is not only desirable to recognize various tar gets, but to locate them with some degree of accuracy. The minimum average correlation energy (MACE) filter [Mahalanobis et al., 1987] is of interest to the ATD/R problem due to its localization and discrimination properties. The MACE filter is a member of a family of correlation filters derived from the synthetic discriminant function (SDF) [Hester and Casasent, 1980]. The SDF and its variants have been widely applied to the ATD/R prob lem. We will describe synthetic discriminant functions in more detail in chapter 2. Other generalizations of the SDF include the minimum variance synthetic discriminant function (MVSDF) [Kumar, 1986], the MACE filter, and more recently the gaussian minimum average correlation energy (GMACE) [Casasent et al., 1991] and the minimum noise and correlation energy (MINACE) [Ravichandran and Casasent, 1992] filters. This area of filter design is commonly referred to as distortioninvariant filtering. It is a generalization of matched spatial filtering for the detection of a single object to the detec tion of a class of objects, usually in the image domain. Typically the object class is repre sented by a set of exemplars. The exemplar images represent the image class through a range of "distortions" such as a variation in viewing aspect of a single object. The goal is to design a single filter which will recognize an object class through the entire range of distortion. Under the design criterion the filter is equally matched to the entire range of distortion as opposed to a single viewpoint as in a matched filter. Hence the nomenclature distortioninvariant filtering [Kumar, 1992]. The bulk of the research using these types of filters has focused on optical and infra red (IR) imagery and overcoming recognition problems in the presence of distortions asso ciated with 3D to 2D mappings, e.g. scale and rotation (inplane and outofplane). Recently, however, this technique has been applied to radar imagery [Novak et al., 1994; Fisher and Principe, 1995a; Chiang et al., 1995]. In contrast to optical or infrared imag ery, the scale of each pixel within a radar image is usually constant and known. Conse quently, radar imagery does not suffer from scale distortions of objects. In the family of distortion invariant filters, the MACE filter has been shown to posses superior discrimination properties [Mahalanobis et al., 1987, Casasent and Ravichandran, 1992]. It is for this reason that this work emphasizes nonlinear extensions to the MACE filter. The MACE filter and its variants are designed to produce a narrow, constrained amplitude peak response when the filter mask is centered on a target in the recognition class while minimizing the energy in the rest of the output plane. This property provides desirable localization for detection. Another property of the MACE filter is that it is less susceptible to outofclass false alarms [Mahalanobis et al., 1987]. While the focus of this work will be on the MACE filter criterion, it should be stated that all of the results pre sented here are equally applicable to any of the distortion invariant filters mentioned above with appropriate changes to the respective optimization criteria. Although the MACE filter does have superior false alarm properties, it also has some fundamental limitations. Since it is a linear filter, it can only be used to realize linear deci sion surfaces. It has also been shown to be limited in its ability to generalize to exemplars that are in the recognition class (but not in the training set), while simultaneously rejecting outofclass inputs [Casasent and Ravichandran, 1992; Casasent et al., 1991]. The number of design exemplars can be increased in order to overcome generalization problems; how ever, the calculation of the filter coefficients becomes computationally prohibitive and numerically unstable as the number of design exemplars is increased [Kumar, 1992]. The MINACE and GMACE variations have improved generalization properties with a slight degradation in the average output plane variance [Ravichandran and Casasent, 1992] and sharpness of the central peak [Casasent et al., 1991], respectively. This research presents a basis by which the MACE filter, and by extension all linear distortion invariant filters, can be extended to a more general nonlinear signal processing framework. In the development it is shown that the performance of the linear MACE filter can be improved upon in terms of generalization while maintaining its desirable proper ties, i.e. sharp, constrained peak at the center of the output plane. A more detailed description of the developmental progression of distortion invariant filtering is given in chapter 2. In this chapter a qualitative comparison of the various distor tion invariant filters is presented using inverse synthetic aperture radar (ISAR) imagery. The application of pattern recognition techniques to highresolution radar imagery has become a topic of great interest recently with the advent of widely available instrumenta tion grade imaging radars. High resolution radar imagery poses a special challenge to dis tortion invariant filtering in that the source of distortions such as rotation in aspect of an object do not manifest themselves as rotations within the radar image (as opposed to opti cal imagery). In this case the distortion is not purely geometric, but more abstract. Chapter 3 presents a derivation of the MACE filter as a special case of Kohonen's lin ear associative memory [1988]. This relationship is important in that the associative mem ory perspective is the starting point for developing nonlinear extensions to the MACE filter. In chapter 4 the basis upon which the MACE filter can extended to nonlinear adaptive systems is developed. In this chapter a nonlinear architecture is proposed for the extension of the MACE filter. A statistical perspective of the MACE filter is discussed which leads naturally into a class representational viewpoint of the optimization criterion of distortion invariant filters. Commonly used measures of generalization for distortion invariant filter ing are also discussed. The results of the experiments presented show that the measures are not appropriate for the task of classification. It is interesting to note that the analysis indicates the appropriateness of the measures is independent of whether the mapping is linear or nonlinear. The analysis also discusses the merit of the MACE filter optimization criterion in the context of classification and with regards to measures of generalization. The chapter concludes with a series of experiments further refining the techniques by which nonlinear MACE filters are computed. Chapter 5 presents a new information theoretic method for feature extraction. An information theoretic approach is motivated by the observation that the optimization crite rion of the MACE filter only considers the secondorder statistics of the rejection class. The information theoretic approach, however, operates in probability space, exploiting properties of the underlying probability density function. The method enables the extrac 5 tion of statistically independent features. The method has wide application beyond nonlin ear extensions to MACE filters and as such represents a powerful new technique for information theoretic signal processing. A review of information theoretic approaches to signal processing are presented in this chapter. This is followed by the derivation of the new technique as well as some general experimental results which are not specifically related to nonlinear MACE filters, but which serve to illustrate the potential of this method. Finally the logical placement of this method within nonlinear MACE filters is presented along with experimental results. In chapter 6 we review the significant results and contributions of this dissertation. We also discuss possible lines of research resulting from the base established here. CHAPTER 2 BACKGROUND 2.1 Discussion of Distortion Invariant Filters As stated, distortion invariant filtering is a generalization of matched spatial filtering. It is well known that the matched filter maximizes the peaksignaltoaveragenoise power ratio as measured at the filter output at a specific sample location when the input signal is corrupted by additive white noise. In the discrete signal case the design of a matched filter is equivalent to the following vector optimization problem.[Kumar, 1986] min hth st. xth = d {h,x}e CNXe where the column vector x contains the N coefficients of the signal we wish to detect, h contains the coefficients of the filter ( t indicates the hermitian transpose operator), and d is a positive scaler. This notation is also suitable for Ndimensional signal processing as long as the signal and filter have finite support and are reordered in the same lexico graphic manner (e.g. by row or column in the twodimensional case) into column vectors. The optimal solution to this problem is h = x(xtx) d. Given this solution we can calculate the peak output signal power as = (xth)2 = (xtx(xtx)ld)2 = d2 and the average output noise power due to an additive white noise input o = E{htnnth} = htEnh = o2hth = aYd2(xt,)1 where is an ao2 is the input noise variance. Resulting in a peaksignaltoaveragenoise output power ratio of ( )9 d2 oF a,2d2(xtx)I (xtx) 2 As we can see, the result is independent of the choice of scalar, d. If d is set to unity, the result is a normalized matched spatial filter.[Vander Lugt, 1964] In order to further motivate the concept of distortion invariant filtering, a typical ATR example problem will be used for illustration. This experiment will also help to illustrate the genesis of the various types of distortion invariant filtering approaches beginning with the matched spatial filter (MSF). Inverse synthetic aperture radar (ISAR) imagery will be used for all of the experiments presented herein. The distortion invariant filtering; however, is not limited to ISAR imag ery and in fact can be extended to much more abstract data types. ISAR images are shown in figure 1. In the figure, three vehicles are displayed, each at three different radar viewing aspect angles (5, 45, and 85 degrees), where the aspect angle is the direction of the front of the vehicle relative to the radar antenna. The image dimensions are 64 x 64 pixels. Radar systems measure a quantity called radar cross section (RCS). When a radar transmits an electromagnetic pulse, some of the incident energy on an object is reflected back to the radar. RCS is a measure of the reflected energy detected by the radar's receiving antenna. ISAR imagery is the result of a radar signal processing technique which uses multiple detected radar returns measured over a range of relative object aspect angles. Each pixel in an ISAR image is a measure of the aggregate radar cross section at regularly sampled points in space. Two types of vehicles are shown. Vehicle type I will represent a recognition class, while vehicle type 2 will represent a confusion class. The goal is to compute a filter which will recognize vehicle type 1 without being confused by vehicle 2. Images of vehicle la will be used to compute the filter coefficients. Vehicles lb and 2a represent an independent testing class. ISAR images of all three vehicles were formed in the aspect range of 5 to 85 degrees at 1 degree increments. As the MSF is derived from a single vehicle image, an image of vehi cle la at 45 degrees (the midpoint of the aspect range) is used. The peak output response to an image represents maximum of the cross correlation function of the image with the MSF template. The peak output response over the entire aspect range of vehicle la is shown in figure 2. As can be seen in the figure, the filter matches at 45 degrees very well; however, as the aspect moves away from 45 degrees, the vehicle la (training) vehicle lb (testing) vehicle 2a (testing) Figure 1. ISAR images of two vehicle types. Vehicles are shown at aspect angles of 5, 45, and 85 degrees respectively. Two different vehicles of type 1 (a and bi are shown, while one vehicle of type 2 (a) is shown. Vehicle la is utcd as a training vehicle, while vehicle lb is used as the testing eh, Ile for the recognition class. Vehicle 2a represents a confusion vehicle. peak output response begins to degrade. Depending on the type of imagery as well as the vehicle, this degradation can become very severe. matched spatial filter 1.2 1.0 0.8 o S0.6 0.4 0.2 0.0 0 20 40 60 80 100 aspect angle Figure 2. MSF peak output response of training vehicle 1 a over all aspect angles. Peak response degrades as aspect difference increases. The peak output responses of both vehicles in the testing set are shown in figure 3 overlain on the training image response. In one sense the filter exhibits good generaliza tion, that is, the peak response to vehicle lb is much the same as a function of aspect as the peak response to vehicle la. However, the filter also "generalizes" equally as well to vehi cle 2b, which is undesirable. As a vehicle discrimination test (vehicle 1 from vehicle 2) the MSF fails. spatial filter I ,, . 0 20 40 60 aspect angle 80 100 Figure 3. MSF peak output response of testing vehicles lb and 2a over all aspect angles. Responses are overlaid on training vehicle response. Filter responses to vehicles lb (dashed line) and 2a (dasheddot) do not differ significantly. matched 12 The output image plane response to a single image of vehicle la is shown in figure 4. Refinements to the distortion invariant filter approach, namely the MACE filter, will show that the localization of this output response, as measured by the sharpness of the peak, can be improved significantly. 1.0 0.8  0.6  0.4 0.0 Figure 4. MSF output image plane response. 2.1.1 Synthetic Discriminant Function The degradation evidenced in figures 2 and 3 were the primary motivation for the syn thetic discriminant function (SDF)[Hester and Casasent, 1980]. A shortcoming of the MSF, from the standpoint of distortion invariant filtering, is that it is only optimum for a single image. One approach would be to design a bank of MSFs operating in parallel which were matched to the distortion range. The typical ATR system; however, must rec ognize/discriminate multiple vehicle types and so from an implementation standpoint alone parallel MSFs is an impractical choice. Hester and Casasent set out to design a sin 13 gle filter which could be matched to multiple images using the idea of superposition. This approach was possible due to the large number of coefficients (degrees of freedom) that typically constitute 2D image templates. For historical reasons, specifically that the filters in question were synthesized optically using holographic techniques [Vander Lugt, 1964], it was hypothesized that such a filter could be synthesized from linear combinations of a set of exemplar images. The filter synthesis procedure consists of projecting the exemplar images onto an orthonormal basis (originally GramSchmidt orthogonalization was used to generate the basis). The next step is to determine the coefficients with which to linearly combine the basis vectors such that a desired response for each original image exemplar was obtained. [Hester and Casasent, 1980] The proposed synthesis procedure is a bit convoluted. It turns out that the choice of orthonormal basis is irrelevant. As long as the basis spans the space of the original exem plar images the result is always the same. The development of Kumar [1986] is more use ful for depicting the SDF as a generalization of the matched filter (for the white noise case) to multiple signals. The SDF can be cast as the solution to the following optimiza tion problem min hth s.t. Xth = d {h E CN X,X e C 'NxN,d CN where X is now a matrix whose Nt columns comprise a set of training images] we wish to detect, d is a column vector of desired outputs (one for each of the training exemplars) 1. Since these filters have been applied primarily to 2D images, signals will be referred to as images or exemplars from this point on. In the vector notation, all NI x N2 images are reordered (by row or column) into N x 1 column vectors, where N = N1N2. 14 and is typically set to all unity values for the recognition class. The images of the data matrix X comprise the range of distortion that the implemented filter is expected to encounter. It is assumed that N, < N and so the problem formulation is a quadratic optimi zation subject to an underdetermined system of linear constraints. The optimal solution is h = X(XtX)ld. When there is only one training exemplar (N, = 1) and d is unity the SDF defaults to the normalized matched filter. Similar to the matched filter (white noise case), the SDF is the linear filter which minimizes the white noise response while satisfying the set of linear constraints over the training exemplars. By way of example, the SDF technique is tested against the ISAR data as in the MSF case. Exemplar images from vehicle la were selected at every 4 degrees aspect from 5 to 85 degrees for a total of 21 exemplar images (i.e. N, = 21). Figure 5 shows the peak out put response over all aspects of the training vehicle (la). As seen in the figure, the degra dation as the aspect changes is removed. The MSF response has been overlaid to highlight the differences. The peak output response over all exemplars in the testing set is shown in figure 6. From the perspective of peak response, the filter generalizes fairly well. However, as in the MSF, the usefulness of the filter as a discriminant between vehicles 1 and 2 is clearly lim ited. Figure 7 shows the resulting output plane response when the SDF filter is correlated with a single image of vehicle I a. The localization of the peak is similar to the MSF case. synthetic discriminant function 1.2 1.0 0.8 / " 0.6 0.4 0.2 0 20 40 60 80 100 aspect angle Figure 5. SDF peak output response of training vehicle la over all aspect angles. The MSF response is also shown (dashed line). The degradation in the peak response has been corrected. 2.1.2 Minimum Variance Synthetic Discriminant Function The SDF approach seemingly solved the problem of generalizing a matched filter to multiple images. However, the SDF has no builtin noise tolerance by design (except for the white noise case). Furthermore, in practice, it would turn out that occasionally the noise response would be higher than the peak object response depending on the type of imagery. As a result, detection by means of searching for correlation peaks was shown to be unreliable for some types of imagery, specifically imagery which contains recognition class images embedded in nonwhite noise[Kumar, 1992]. Kumar [1986] proposed a method by which noise tolerance could be built in to the filter design. This technique was termed the minimum variance synthetic discriminant function (MVSDF). The MVSDF is synthetic discriminant function 1.2 I 1.0  0.8 0 S0.6 a 0.4 0.2 0.0 2 ,. ,. 0 20 40 60 80 100 aspect angle Figure 6. SDF peak output response of testing vehicles lb and 2a over all aspect angles. The dashed line is vehicle lb while the dasheddot line is vehicle 2a. the correlation filter which minimizes the output variance due to zeromean input noise while satisfying the same linear constraints as the SDF. The output noise variance can be shown to be htZh, where h is the vector of filter coefficients and Zn is the covariance matrix of the noise. [Kumar, 1986] Mathematically the problem formulation is min htnh s.t. Xth = d he CNX tXe CN X ,ENe CNxNde CNx' Figure 7. SDF output image plane response. with the optimal solution h = InlX(XtZX X) d. In the case of white noise, the MVSDF is equivalent to the SDF This technique has a significant numerical complexity issue which is that the solution requires the inversion of an Nx N matrix (Z,) which for moderate image sizes (N = NIN2) can be quite large and computationally prohibitive, unless simplifying assumptions can be made about its form (e.g. a diagonal matrix, toeplitz, etc.). The MVSDF can be seen as a more general extension of the matched filter to multiple vector detection as most signal processing definitions of the matched filter incorporate a noise power spectrum and do not assume the white noise case only. It is mentioned here because it is the first distortion invariant filtering technique to recognize the need to char acterize a rejection class. 18 2.1.3 Minimum Average Correlation Energy Filter The MVSDF (and the SDF) control the output of the filter at a single point in the out put plane of the filter. In practice large sidelobes may be exhibited in the output plane making detection difficult. These difficulties led Mahalanobis et al [1987] to propose the minimum average correlation energy (MACE) filter. This development in distortion invari ant filtering attempts as its design goal to control not only the output point when the image is centered on the filter, but the response of the entire output plane as well. Specifically it minimizes the average correlation energy of the output over the training exemplars subject to the same linear constraints as the MVSDF and SDF filters. The problem is formulated in the frequency domain using Parseval relationships. In the frequency domain, the formulation is min HtDH s.t. XtH = d {He CN ',Xe CNx E,De CNx Nd eCx where D is a diagonal matrix whose diagonal elements are the coefficients of the average 2D power spectrum of the training exemplars. The form of the quadratic criterion is derived using Parseval's relationship. A derivation is given in section A. 1 of the appendix. The other terms, H and X, contain the 2D DFT coefficients of the filter and training exemplars, respectively. The vector d is the same as in the MVSDF and SDF cases. The optimal solution, in the frequency domain, is H = DIX(XtD'X)'d. (1) As in the MVSDF, the solution requires the inversion of an N x N matrix, but in this case the matrix D is diagonal and so its inversion is trivial. When the noise covariance 19 matrix is estimated from observations of noise sequences (assuming widesense stationar ity and ergodicity) the MVSDF can also be formulated in the frequency domain, as well, and the complex matrix inversion is avoided. A derivation of this is given in the appendix A, examination of equations (95), (96), (97) shows that under the assumption that the noise class can be modeled as a stationary, ergodic random noise process the solution of the MVSDF can be found in the spectral domain using the estimated power spectrum of the noise process and equation (1). In practice, the MACE filter performs better than the MVSDF with respect to rejecting outofclass input images. The MACE filter; however, has been shown to have poor gener alization properties, that is, images in the recognition class but not in the training exemplar set are not recognized. A MACE filter was computed using the same exemplar images as in the SDF example. Figure 8 shows the resulting output image place response for one image. As can be seen in the figure, the peak in the center is now highly localized. In fact it can be shown [Mahal anobis et al., 1987] that over the training exemplars (those used to compute the filter) the output peak will always be at the constraint location. Generalization to between aspect images, as mentioned, is a problem for the MACE filter. Figure 9 shows the peak output response over all aspect angles. As can be seen in the figure, the peak response degrades severely for aspects between the exemplars used to compute the filter. Furthermore, from a peak output response viewpoint, generalization to vehicle lb is also worse. However, unlike the previous techniques, we now begin to see some separation between the two vehicle types as represented by their peak response. Figure 8. MACE filter output image plane response. 2.1.4 Optimal Tradeoff Synthetic Discriminant Function The final distortion invariant filtering technique which will be discussed here is the method proposed by R6fr6grier and Fique [1991], known as the optimal tradeoff syn thetic discriminant function (OTSDF). Suppose that the designer wishes to optimize over multiple quadratic optimization criteria (e.g. average correlation energy and output noise variance) subject to the same set of equality constraints as in the previous distortion invari ant filters. We can represent the individual optimization criterion by J, = htQih, where Q, is an N x N symmetric, positivedefinite matrix (e.g. Qi = 1n for MVSDF optimization criterion). The OTSDF is a method by which a set of quadratic optimization criterion may be optimally traded off against each other; that is, one criterion can be minimized with mini MACE filter 1.2 1.0 S0.8 0 ". '. *" 'W '' ; " 0.2 0.0 0 20 40 60 80 100 aspect angle Figure 9. MACE peak output response of vehicle la, lb and 2a over all aspect angles. Degradation to between aspect exemplars is evident. Generalization to the testing vehicles as measured by peak output response is also poorer. Vehicle la is the solid line, lb is the dashed line and 2a is the dasheddot line. mum penalty to the rest. The solution to all such filters can be characterized by the equa tion h = Q IX(XtQ X) d, (2) where, assuming M different criteria, M M Q = ,Qi i = I 0 < X i= l i= l The possible solutions, parameterized by X,, define a performance bound which can not be exceeded by any linear system with respect to the optimization criteria and the equality constraints. All such linear filters which optimally tradeoff a set of quadratic cri teria are referred to as optimal tradeoff synthetic discriminant functions. We may, for example, wish to tradeoff the MACE filter criterion versus the MVSDF filter criterion. This presents the added difficulty that one criterion is specified in the space domain and the other in the spectral domain. If the noise is represented as zeromean, sta tionary, and ergodic (if the covariance is to be estimated from samples) we can, as men tioned, transform the MVSDF criterion to the spectral domain. In this case the optimal filter has the frequency domain solution, 1 I H = [kDn+(lX)DoxlX[Xt[h Dn+(1. )DJ] X] d = DIX[XtDxX]Id where DX = AD, + ( )Dx, 0 < < 1, and Dn, Dx are diagonal matrices whose diagonal elements contain the estimated power spectrum coefficients of the noise class and the recognition class, respectively. The performance bound of such a filter would resemble figure 10, where all linear filters would fall in the darkened region and all optimal tradeoff filters would lie somewhere on the boundary. By way of example we again use the data from the MACE and SDF examples. In this case we will construct an OTSDF which trades off the MACE filter criterion for the SDF criterion. In order to transform the SDF to the spectral domain, we will assume that the noise class is zeromean, stationary, white noise. The power spectrum is therefore flat. One of the issues for constructing an OTSDF is how to set the value of X which represents the a S I S Realizable region ofperformance I MAC MAC average correlation energy Figure 10. Example of a typical OTSDF performance plot. This plot shows the tradeoff, hypothetically, between the ACE criteria versus a noise variance criteria. The curved arrow on the performance bound indicates the direction of increasing X for the two criterion case. The curve is bounded by the MACE and MVSDF results. degree by which one criterion is emphasized over another. We will not address that issue here, but simply set the value to X = 0.95, indicating more emphasis on the MACE filter criterion. The output plane response of the OTSDF is shown in figure 11. As compared to the MACE filter response, the output peak is not nearly as sharp, but still more localized than the SDF case. The peak output response over the training vehicle for the OTSDF is compared to the MACE filter in figure 12. The degradation to between aspect exemplars is less severe than the MACE filter. The peak output response of vehicles lb and 2a are shown in figure 13. MVSD n Figure 11. OTSDF filter output image plane response. As compared to the MACE filter the peak response is improved over the testing set. Sepa ration between the two vehicle types appears to be maintained. 2.2 Preprocessor/SDF Decomposition In the sample domain, the SDF family of correlation filters is equivalent to a cascade of a linear preprocessor followed by a linear correlator [Mahalanobis et al., 1987;Kumar, 1992]. This is illustrated in figure 14 with vector operations. The preprocessor, in the case of the MACE filter, is a prewhitening filter computed on the basis of the average power spectrum of the recognition class training exemplars. In the case of the MVSDF the pre processor is a prewhitening filter computed on the basis of the covariance matrix of the noise. The net result is that after preprocessing, the second processor is an SDF computed over the preprocessed exemplars. OTSDF 1.2 0.8 II I II I ' 0.6I I Iill II I  0.4 0.2  0.0 . 0 20 40 60 80 100 aspect angle Figure 12. OTSDF peak output response of vehicle la over all aspect angles. Degradation to between aspect exemplars is less than in the MACE filter shown in dashed line. The primary contribution of this research will be to extend the ideas of MACE filtering to a general nonlinear signal processing architecture and accompanying classification framework. These extensions will focus on processing structures which improve the gen eralization and discrimination properties while maintaining the shiftinvariance and local ization detection properties of the linear MACE filter. Figure 13. OTSDF peak output response of vehicles lb and 2a over all aspect angles. Generalization is better than in the MACE filter. Vehicle lb is shown in dashed line, vehicle 2a is shown in dasheddot line. y = Ax h = y(yty)Id input image, x preprocessor SDF scalar output Filter Decomposition Figure 14. Decomposition of distortion invariant filter in space domain. The notation used assumes that the image and filter coefficients have been reordered into vectors. The input image vector, x, is preprocessed by the linear transformation, y = Ax. The resulting vector is processed by a synthetic discriminant function, yout = yth. OTSDF \I i; . 0.4 0.2 0 20 40 60 aspect angle L' ' ' 80 100 CHAPTER 3 THE MACE FILTER AS AN ASSOCIATIVE MEMORY 3.1 Linear Systems as Classifiers In this chapter we present the MACE filter from the perspective of associative memo ries. This perspective is important because it leads to a machinelearning and classification framework and consequently a means by which to determine the parameters of a nonlinear mapping via gradient search techniques. We shall refer, herein, to the machine learning/ gradient search methods as an iterative framework. The techniques are iterative in the sense that adaptation to the mapping parameters are computed sequentially and repeatedly over a set of exemplars. We shall show that the iterative and classification framework com bined with a nonlinear system architecture have distinct advantages over the linear frame work of distortion invariant filters. As we have stated, distortion invariant filters can only realize linear discriminant func tions. We begin, therefore, by considering linear systems used as classifiers. The adaline architecture [Widrow and Hoff, 1960], depicted in figure 15, is an example of a linear sys tem used for pattern classification. A pattern, represented by the coefficients xi, is applied to a linear combiner, represented by the weight coefficients wi, the resulting output y is then applied to a hard limiter which assigns a class to the input pattern. Mathematically this can be represented by c = sgn(yp) = sgn(wTx p) where sgn( ) is the signum function, tp is a threshold, and w, x e 9Nx are column vectors containing the coefficients of the pattern and combiner weights, respectively. In the context of classification, this architecture is trained iteratively using the least mean square (LMS) algorithm [Widrow and Hoff, 1960]. For a two class problem the desired output, d in the figure, is set to 1 depending on the class of the input pattern, the LMS algorithm then minimizes the mean square error (MSE) between the classification output c and the desired output. Since the error function, ec, can only take on three values +2 and 0, minimization of the MSE is equivalent to minimizing the average number of actual errors. 29 There are several observations to be made about the adaline/LMS approach to classifi cation. One observation is that the adaptation process described uses the error, E, as mea sured at the output of the linear combiner to drive the adaptation process and not the actual classification error, Ec. Another observation is that this approach presupposes that the pat tern classes can be linearly separated. A final point, on which we will have more to say, is that the method uses the MSE criterion as a proxy for classification. 3.2 MSE Criterion as a Proxy for Classification Performance As we have pointed out, the adaline/LMS approach to classification uses the MSE cri terion to drive the adaptation process. It is the probability of misclassification (also called the Bayes criterion), however, with which we are truly concerned. We now discuss the consequence of using the MSE criterion as a proxy for classification performance. It is well known that the discriminant function that minimizes misclassification is monotonically related to the posterior probability distribution of the class, c, given the observation x [Fukanaga, 1990]. That is, for the two class problem, if the discriminant function is f(x) = P2P(C 2x), (3) where P2 is the prior probability of class 2, and p(C21x) is the conditional probability distribution of class 2 given x, then the probability of classification will be minimized if the following decision rule is used f(x) < 0.5 choose class 1 (4) f(x) > 0.5 choose class 2 For the case of f(x) = 0.5, both classes are equally likely, so a guess must be made. 3.2.1 Unrestricted Functional Mappings With regards to the adaline/LMS approach we now ask, what is the consequence of using the MSE criterion for computing discriminant functions? In the two class case, the source distributions are p(xI C1) or p(xI C2) depending on whether the observation, x, is drawn from class 1 or class 2, respectively. If we assign a desired output of zero to class 1 and unity to class 2 then the MSE criterion is equivalent to the following J(f) = 2E{f(x)2C, + E{( f(x))2C2}, (5) where the 1/2 scale factors are for convenience, E{ } is the expectation operator, and Ci indicates class i. For now we will place no constraints on the functional form of f(x). In so doing, we can solve for the optimal solution using the calculus of variations approach. In this case, we would like to find a stationary point of the criterion J(f) due to small perturbations in the function f(x) indicated by J = J(f+6f)J(f) 0 (6) =0 The first term of 6 can be computed as P P J(f +sf) = P2E{(f+Sf)2IC}+~E{(lfif)2IC2} = PE{(f2+2f2f) C1} (7) P +E{((1 f)2 2(1 f)8f)IC21+O( 0(8f2) 2 P P2 = J(f)+ E{(2ff) Cl} E{(2(1 f)8f)C2} which can be substituted into 6 to yield 8J = PE{f8fIC, }P2E{( f)8flC2} = Pf f(x)8fp(xC,I)dxP2f ( f(x))Sfp(xIC)dx (8) = f [f(x)(Plp(xCi) + P2P(x C2)) (P2P(X C2))]fd = [f(x)py(x)P2p( C2)]fdx where px(x) = P p(x C1) + P2p(xlC2) is the unconditional probability distribution of the random variable X. In order for f(x) to be a stationary point of J(f), equation 8 must be zero over all x for any arbitrary perturbation 6f(x). Consequently f(x)px(x)P2p(xlC2) = 0 or S P2P(xC2) f(x)  px(x) P2p(xlC2) (10) Plp(xC1)+P2P(xlC2) = p(C21x) which is the likelihood that the observation is drawn from class 2. If we had reversed the desired outputs, the result would have been the likelihood that the observation was drawn from class 1. This result, predicated by our choice of desired outputs, shows that for arbi trary f(x), the MSE criterion is equivalent to probability of misclassification error crite rion. In fact, it has been shown by Richard and Lippman [1991] (using other means) for the multiclass case that if the desired outputs are encoded as vectors, ei E Ntx 1, where the ith element is unity and the others are zero, for an Nclass problem the MSE criterion is equivalent to optimizing the Bayes criterion for classification. 3.2.2 Parameterized Functional Mappings Suppose, however, that the function is not arbitrary, but is also a function of parameter set, a, as in f(x, a). The MSE criterion of 5 can be rewritten P1 J(f) = TE{f(x, ca)21C} + 2EE{( f(x, a))ZC2}. (1l) The gradient of the criterion with respect to the parameters becomes = P E P f(x, a) f(x, a)I C P2E ( f(x, a)) f(x, a) C2 (12) Tat aa I ~ and consequently = PJ f(x,ca) f(x, a)p(x\C,)dx P, (lf(x,a)) f(x, a)p(xl C2)dx (13) = (f(x, a)(P1 p(xC1) +P2p(xC2)) (P2p(x C2))) f(x, (X)dx = (f(xa)p'(x) Pp(xC2)) f(x,a)dx Examination of equation 13 allows for two possibilities for a stationary point of the crite rion. The first, as before, is that P2p(xfC2) f(x, 0) = Px(x) (14) = p(C2x) while the second is if we are near a local minima with respect to a. In other words, if the parameterized function can realize the Bayes discriminant function via an appropriate choice of its parameters, then this function represents a global minima, but this does not discount the fact that there may be local minima. Furthermore, if the parameterized func tion is not capable of representing the Bayes discriminant function there is no guarantee that the global (or local) minima will result in robust classification. 3.2.3 Finite Data Sets The previous development does not take into account that in an iterative framework we are working with observations of a random variable. Therefore, we rewrite the criterion of equation 5 as finite summations. That is, the criterion becomes J(f(x, a)) = , f(x,. a)2 + (1 f(xi, a))2, (15) x, C, X, E C where xi e Ci denotes the set of observations taken from class Ci. Taking the derivative of this criterion with respect to the parameters, a, yields S= PI f(xi, a)f(x, )P2 ( f(xi, a))f(xi, a). (16) xi, CI x, e C2 It is assumed that the set of observations from class C1 (xi e C ) are independent and identically distributed (i.i.d.), as are the set of observations from class C2 (xi E C2) although with a different distribution than class CI Since the summation terms are bro ken up by class, we can assume that the arguments of the summations (functions of dis tinct i.i.d. random variables) are themselves i.i.d. random variables [Papoulis, 1991]. If we set PINI = P1 and P2N2 = P2, where PI and P2 are the prior probabilities of classes C, and C2, respectively, and N, and N2 are the number of samples from drawn from 35 each of the classes, we can use the law of large numbers to say that the summations of equation 16 approach their expected values. In other words, in the limit as N1, N2  S= PE f(x, a) C 2E (1 f(x, a)) C (17) which is identical to equation 12 and so yields the same solution for the mapping as Pp(xC2) f(x, a) = (18) p(x) The conclusion is that if we have a sufficient number of observations to characterize the underlying distributions then the MSE criterion is again equivalent to the Bayes crite rion. 3.3 Derivation of the MACE Filter We have already introduced the MACE filter in a previous section. We present a deri vation of the MACE filter here. The development is similar to the derivations given in Mahalanobis [1987] and Kumar [1992]. Our purpose in this presentation of the derivation is that it serves to illustrate the associative memory perspective of optimized correlators; a perspective which will be used to motivate the development of the nonlinear extensions presented in later sections. 36 In the original development, SDF type filters were formulated using correlation opera tions, a convention which will be maintained here. The output, g(n1, n2), of a correlation filter is determined by N11 N21 g(n,n2) = x lx*(nl+ml,n2+m2)h(m,,m2) m = Om2 = 0 x*(nl, n2)**h(n n2) where x*(n,, n2) is the complex conjugate of an input image with NI x N2 region of sup port, h(nl, n2) represents the filter coefficients, and ** represents the twodimensional circluar convolution operation [Oppenheim and Shafer, 1989]. The MACE filter formulation is as follows [Mahalanobis et al., 1987J. Given a set of image exemplars, {xiE NixNI; i = 1...N,}, we wish to find filter coefficients, h E 9N, x N, such that average correlation energy at the output of the filter defined as N, N IIN,1 = Iz gn ,2 (19 ti=1 =0n2=0 ]) is minimized subject to the constraints N,1 N,1 gi(O,0) = E xi*(ml m2)h(m1,m2) = di; i= 1...N,. (20) m, = m = 0 Mahalanobis [1987] reformulates this as a vector optimization in the spectral domain using Parseval's theorem. In the spectral domain we wish to find the elements of H e CNN2 x I a column vector whose elements are the 2D DFT coefficients of the space domain filter h reordered lexicographically. Let the columns of the data matrix X e CNINxN, contain the 2D DFT coefficients of the exemplars {xl, .... XN, also reordered into column vectors. The diagonal matrix Di E 9NN2 N2O contains the mag nitude squared of the 2D DFT coefficients of the ith exemplar. These matrices are aver aged to form the diagonal matrix D as N, N,= Di= (21) which then contains the average power spectrum of the training exemplars. Minimizing equation (19) subject to the constraints of equation (20) is equivalent to minimizing HtDH, (22) subject to the linear constraints XtH = d (23) N xl where the elements of d E x are the desired outputs corresponding to the exemplars. The solution to this optimization problem can be found using the method of Lagrange multipliers. In the spectral domain, the filter that satisfies the constraints of equation (20) and minimizes the criterion of equation (19) [Mahalanobis et al., 1987;Kumar, 1992] is H = D'X(XtDtX)'d, (24) where H E CNN, x contains the 2DDFT coefficients of the filter, assuming a unitary 2 D DFT.1 3.3.1 Preprocessor/SDF Decomposition As observed by Mahalanobis [1987], the MACE filter can be decomposed as a syn thetic discriminant function preceded by a prewhitening filter. Let the matrix B = D1/2, where B is diagonal with diagonal elements equal to the inverse of the square root of the diagonal elements of D. We implicitly assume that the diagonal ele ments of D are nonzero, consequently BtB = D1 and Bt = B. Equation (24) can then be rewritten as H = B(BX)((BX)1(BX))d. (25) Substituting Y = BX, representing the original exemplars preprocessed in the spec tral domain by the matrix B, equation (25) can be written H = BY(YIY)d. (26) The term H' = Y(Yt Y)d is recognized as the SDF computed from the preprocessed exemplars Y. The MACE filter solution can therefore be written as a cascade of a pre whitener (over the average power spectrum of the exemplars) followed by a synthetic dis criminant function, depicted in figure 16, as H = BH'. (27) 1. If the DFT were as defined in [Oppenheim and Shafer, 1989] then a scale factor of NIN2 would be necessary. X yo preprocessor SDF process t omrposin Figure 16. Decomposition of MACE filter as a preprocessor (i.e. a pre whitening filter over the average power spectrum of the exemplars) followed by a synthetic discriminant function. 3.4 Associative Memory Perspective Having presented the derivation of the MACE filter and the preprocessor/SDF decom position, we now show that with a modification (addition of a linear preprocessor), the MACE filter is a special case of Kohonen's linear associative memory [1988]. Associative memories [Kohonen, 1988] are general structures by which pattern vec tors can be related to one another, typically in an input/output pairwise fashion. An input stimulus vector is presented to the associative memory structure resulting in an output response vector. The input/output pairs establish the desired response to a given input. In the case of an autoassociative memory, the desired response is the stimulus vector, whereas, in a heteroassociative memory the desired response is arbitrary. From a signal processing perspective, associative memories are viewed as projections [Kung, 1992], lin ear and nonlinear. The input patterns exist in a vector space and the associative memory projects them onto a new space. The linear associative memory of Kohonen [1988] is for mulated exactly in this way. A simple form of the linear heteroassociative memory maps vectors to scalars. It is formulated as follows. Given the set of input/output vector/scalar pairs {xiE Nxl, die 9t,i= 1...N,}, which are placed into a input data matrix, x = [xl...XN], and desired output vector, d = [d .. .dN] find the vector, h 9Nx 1 such that xth = d (28) If the system of equations described by (28) is underdetermined the inner product hth (29) is minimized using (28) as a constraint. If the system of equations are overdetermined (xth d)t(xh d) is minimized. Here, we are interested in the underdetermined case. The optimal solution for the underdetermined, using the pseudoinverse of x is [Kohonen, 1988] h = x(xtx) d. (30) As was shown in [Fisher and Principe, 1994], we can modify the linear associative memory model slightly by adding a preprocessing linear transformation matrix, A, and find h such that the underdetermined system of equations (Ax)th = d (31) is satisfied while hth is minimized. As in the MACE filter, this optimization can be solved using the method of Lagrange multipliers. We adjoin the system of constraints to the optimization criterion as J = hth + T((Ax)th d) (32) where X 9E Nx 1 is a column vector of Lagrange multipliers, one for each constraint (desired response). Taking the gradient of equation (32) with respect to the vector h yields aJ S= 2h +AxX. (33) oh Setting the gradient to zero and solving for the vector h yields h= Ax%. (34) Substituting this result into the constraint equations of (31) and solving for the Lagrange multipliers yields S= 2((Ax)tAx)ld. (35) Substituting this result back into equation (34) yields the final solution to the optimization as h = Ax(xtAtAx) ld. (36) If the preprocessing transformation, A, is the spacedomain equivalent of the MACE filter's spectral prewhitener and the columns of the data matrix x contain the reordered elements of the images from the MACE filter problem then equation (36) combined with the preprocessing transformation yields exactly the space domain coefficients of the MACE filter. This can be shown using a unitary discrete Fourier transformation (DFT) matrix. If U CN x N2 is the DFT of the image u e 9tN, x N, we can reorder both U and u into column vectors, U e CN2 and u e CN 2 respectively. We can then imple ment the 2D DFT as a unitary transformation matrix, cE, such that U = Qu u = tU 44)t = I. In order for the transformation A to be the space domain equivalent of the spectral pre whitener of the MACE filter, the relationship Ax = Oty = (tBX = +DtfiBx where B is the same matrix as in equation 27, must be true which, by inspection, means that A = VtBD. (37) Substituting equation (37) into equation (36) and using the property BtB = BB = D1 yields h = Ax(xtAtAx) d = totB x(xt(itBD)t tdtB>x) d (38) = >tB4x(xt4tBDtB4tdx) d38) = DtBX(XtD X)'d combining this solution for h with the preprocessor in equation (31) for the equivalent linear system, hsys, yields hy = Ah = A4tBX(XtD'X)d = (DTBt~tDBX(XtD LX)I d = VtDX(XtDlX) d Substituting the MACE filter solution, equation (24), gives the result hsys = tHMACE (39) and so hsys is the inverse DFT pair of the spectral domain MACE filter. This result estab lishes the relationship between the MACE filter and the linear associative memory. The decomposition of the MACE filter of figure 16 can also be considered as a cascade of a lin ear preprocessor followed by a linear associative memory (LAM) as in figure 17. y=Ax yo=yth A = (tD I/2a h = y(yfy)d x y yo preprocessor LAM e ocessor Ier composition Figure 17. Decomposition of MACE filter as a preprocessor (i.e. a pre whitening filter over the average power spectrum of the exemplars) followed by a linear associative memory. Since the two are equivalent then why make the distinction between the two perspec tives? The are several reasons. The development of distortion invariant filtering and asso ciative memories has proceeded in parallel. Distortion invariant filtering has been 44 concerned with finding projections which will essentially detect a set of images. Towards this goal the techniques have emphasized analytic solutions resulting in linear discrimi nant functions. Advances have been concerned with better descriptions of the second order statistics of the causes of false detections. The approach, however, is still a data driven approach. The desired recognition class is represented through exemplars. In the distortion invariant filtering approach, the task has been confined to fitting a hyperplane to the rec ognition exemplars subject to various quadratic optimization criterion. The development of associative memories has proceeded along a different track. It is also data driven, but the emphasis has been on iterative machine learning methods. Many of the methods are biologically motivated, including the perception learning rule [Rosenb latt, 1958] and Hebbian learning [Hebb, 1949]. Other methods, including the leastmean square (LMS) algorithm [Widrow and Hoff, 1960] (which we have described) and the backpropagation algorithm [Rumelhart et al., 1986; Werbos 1974], are gradient descent based methods. From the classification standpoint, of which the ATR problem is a subset, iterative methods have certain advantages. This can be illustrated with a simple example. Suppose the data matrix N N,N,xN, X = [Xl, X2 ... N] 9 x N, were not full rank. In other words the exemplars representing the recognition class could be represented without error in a subspace of dimension less than N,. From an ATR per spective this would be a desirable property. The implicit assumption in any data driven method is that information about the recognition class is transmitted through exemplars. This is as true for distortion invariant filters, which have analytic solutions, as it is for iter ative methods. The smaller the dimension of the subspace in which the recognition class lies, the better we can discriminate images considered to be out of the class. One limitation of the analytic solutions of distortion invariant filters is that they require the inverse of a matrix of the form xtQx, (40) where Q is a positive definite matrix representing a quadratic optimization criterion. If the matrix, x, is not full column rank there is no inverse for the matrix of (40) and conse quently no analytic solution for any of the distortion invariant filters. The LMS algorithm, however, will still find a best fit to the design goal, which is to minimize the criterion while satisfying the linear constraints. We can illustrate this by modifying the data from the experiments in section 2.1. It is well known that the data matrix x can be decomposed using the singular value decompo sition (SVD) as x = UAVV where the columns of U 9Nx N, form an orthonormal basis (the principal components of the vector xi in fact), the diagonal matrix A 9N' N' contains the singular values of N, xN, the data matrix, and V 9t N' is unitary. The columns of the data matrix can be pro jected onto a subspace by setting one of the diagonal elements of A to zero. The impor tance of any of the basis vectors in U is directly proportional to the singular value. In this case N, = 21 so we can choose one of the smaller singular values to set to zero without changing basic structure of the data. For this example we choose the twelfth largest singu lar value. A data matrix xsub is generated by AI11 0 0 VT xsu5b = U 0 0 0 V , 0 0 A1321 where Ai_ is a diagonal matrix containing the i through j singular values of the original data matnx x. This data matrix is not full rank, so there is no analytical solution for the MACE filter, however we can use the LMS approach and derive a linear associative memory. The col umns of xsub are preprocessed with a prewhitening filter computed over the average power spectrum. The LMS algorithm can then be used to iteratively compute the transfor mation that best fits T Xsubh = d, in a least squares sense; that is, we can find the h that minimizes (xTh d) (xTbh d) where d is column vector of desired responses (set to all unity in this case). The peak output response for this filter was computed over all of the aspect views of vehicle la and is shown in figure 18. The exemplars used to compute the filter are plotted with diamond symbols. The desired response cannot be met exactly so a least squares fit is achieved. Figure 19 shows the correlation output surface for one of the training exemplars. MACE filter (LMS) 1.2 1.0 0.8 0.6 0.4  0.2  0.0 . 0 20 40 60 80 100 aspect angle Figure 18. Peak output response over all aspects of vehicle la when the data matrix which is not full rank. The LMS algorithm was used to compute the filter coefficients. As can be seen in the image, the qualities of low variance and localized peak are still maintained using the iterative method. The learning curve, which measures the normalized mean square error (NMSE) between the filter output and the desired output, is shown as a function of the learning epoch (an epoch is one pass through the data) in figure 20. When the data matrix is full rank, as shown with a solid line, we see that since there is an exact solution and the error approaches zero. When xub is used the NMSE approaches a limit because there is no exact solution and so a least squares solution is found. Figure 19. Output correlation surface for LMS computed filter from non full rank data. The filter output is not substantially different from the analytic solution with full rank data. Since the system of constraint equations are generally underdetermined, there are infi nitely many filters which will satisfy the constraints. There is only one, however, that min imizes the norm of filter (the optimization criterion after preprocessing) [Kohonen, 1988]. Figure 21 shows the NMSE between the analytic solution for the filter coefficients as com pared to the iterativel method. When the data matrix is full rank the iterative method approaches the optimal analytic solution, as shown by the solid line in the figure. When the data matrix is not full rank, as shown by the dashed line in the figure, the error in the iterative solution approaches a limit. These qualities of iterative learning methods are important from the ATR perspective. We see from the example that when the data possesses a quality that would seemingly be 1. in this case iterativee" refers to the LMS algorithm, within this text it generally refers to a gradient search algorithm. 10 LMS learning curves 1o0 F  i', , ', 10 1042  103 104 105 106 0 20 40 60 80 100 epoch Figure 20. Learning curve for LMS approach. The learning curve for the LMS algorithm when the full rank data matrix is shown with a solid line, the non full rank case is shown with a dashed line. useful to the ATR problem, namely that the class can be described by a subspace, the ana lytic solution fails when the number of exemplars exceeds the dimensionality of the sub space. The iterative method, however, finds a reasonable solution. Furthermore, if the data matrix is full rank, the iterative method approaches the optimal analytic solution. 3.5 Comments There are further motivations for the associative memory perspective and by extension the use of iterative methods. It is well known that nonlinear associative memory struc tures can outperform their linear counterparts on the basis of generalization and dynamic range [Kohonen, 1988;Hinton and Anderson, 1981]. In general, they are more difficult to design as their parameters cannot be computed analytically. The parameters for a large filter error 0 5 10 15 20 1_ epoch Figure 21. NMSE between closed form solution and iterative solution. The learning curve for the LMS algorithm when the full rank data matrix is shown with a solid line, the non full rank case is shown with a dashed line. class of nonlinear associative memories can, however, be determined by gradient search techniques. The methods of distortion invariant filters are limited to linear or piecewise linear discriminant functions. It is unlikely that these solutions are optimal for the ATR problem. In this chapter we have made the connection between distortion invariant filtering and linear associative memories. Furthermore we have motivated an iterative approach. Recall figure 15, which shows the adaline architecture. In this architecture we can use the linear error term in order to train our system as a classifier. This is consequence of the assump tion that a linear discriminant function is desirable. If a linear discriminant function is sub 51 optimal, which will almost always be the case for any highdimensional classification problem, then we must work directly with the classification error. We have also shown that the MSE criterion is a sufficient proxy for classification error (with certain restrictions), however, it requires that we work with the true output error of the mapping as well as a mapping with sufficient flexibility (i.e. can closely approximate a wide range of functions which are not necessarily linear). The linear systems approach, however, does not allow for either of these requirements. Consequently, we must adopt a nonlinear systems approach if we hope to achieve improved performance. The next chap ter will show that the MACE filter can be extended to nonlinear systems such that the desirable properties of shift invariance and localized detection peak are maintained while achieving superior classification performance. CHAPTER 4 STOCHASTIC APPROACH TO TRAINING NONLINEAR SYNTHETIC DISCRIMINANT FUNCTIONS 4.1 Nonlinear iterative Approach The MACE filter is the best linear system that minimizes the energy in the output cor relation plane subject to a peak constraint at the origin. An advantage of linear systems is that we have the mathematical tools to use them in optimal operating conditions from the standpoint of second order statistics. Such optimality conditions, however, should not be confused with the best possible classification performance. Our goal is to extend the optimality condition of MACE filters to adaptive nonlinear systems and classification performance. The optimality condition of the MACE filter con siders the entire output plane, not just the response when the image is centered. With regards to general nonlinear filter architectures which can be trained iteratively, a brute force approach would be to train a neural network with a desired output of unity for the centered images and zero for all shifted images. This would indeed emulate the optimality of the MACE filter, however, the result is a training algorithm of order NIN2N, for N, training images of size N x N2 pixels. This is clearly impractical. In this section we propose a nonlinear architecture for extending the MACE filter. We discuss some its properties. Appropriate measures of generalization are discussed. We also present a statistical viewpoint of distortion invariant filters from which such nonlinear extensions fit naturally into an iterative framework. From this iterative framework we 53 present experimental results which exhibit improved discrimination and generalization performance with respect to the MACE filter while maintaining the properties of localized detection peak and low variance in the output plane. 4.2 A Proposed Nonlinear Architecture As we have stated, the MACE filter can be decomposed as a prewhitening filter fol lowed by a synthetic discriminant function (SDF), which can also be viewed as a special case of Kohonen's linear associative memory (LAM) [Hester and Casasent, 1980; Fisher and Principe, 1994]. This decomposition is shown at the top of figure 22. The nonlinear filter architecture with which we are proposing is shown in the middle of figure 22. In this architecture we replace the LAM with a nonlinear associative memory, specifically a feed forward multilayer perception (MLP), shown in more detail at the bottom of figure 22. We will refer to this structure as the nonlinear MACE filter (NLMACE) for brevity. Another reason for choosing the multilayer perception (MLP) is that it is capable of achieving a much wider range of discriminant functions. It is well known that an MLP with a single hidden layer can approximate any discriminant function to any arbitrary degree of precision [Funahashi, 1989]. One of the shortcomings of distortion invariant approaches such as the MACE filter is that it attempts to fit a hyperplane to our training exemplars as the discriminant function. Using an MLP in place of the LAM relaxes this constraint. MLPs do not, in general, allow for analytic solutions. We can, however, deter mine their parameters iteratively using gradient search. NxN x E9 LI preproe..,r I SOMLP I linear filler r *   N1xN IXE PE preprocessed image : . 3 PE' I 1 pi MLP Figure 22. Decomposition of optimized correlator as a preprocessor followed by SDFILAM (top). Nonlinear variation shown with MLP replacing SDF in signal flow (middle), detail of the MLP (bottom). The linear transformation A represents the space domain equivalent of the spectral preprocessor (aP +(1 a) /2 spectral preprocessor (p + *(I L)p) 55 4.2.1 Shift Invariance of the Proposed Nonlinear Architecture One of the properties of the MACE filter is shift invariance. We wish to maintain that property in our nonlinear extensions. A transformation, T[ ], of a twodimensional func tion is shift invariant if it can be shown that g(nl, n2) = [y(n,, n2)] g(nI +n1',n2 +n2) = T[y(nl + n',2 +n2')]' where nl, nl', n2, n2' are integers. In other words, a shift of the input signal is reflected as a corresponding shift of the output signal. [Oppenheim and Shafer, 1989] We show here that this property is maintained for our proposed nonlinear architecture. The preprocessor of the nonlinear architecture at the bottom of figure 22 is the same as the preprocessor of the linear filter shown at the top. The preprocessor is implemented as a linear shift invariant (LSI) filter. Cascading shift invariant operations maintains shift invariance of the entire system [Oppenheim and Shafer, 1989]. In order to show that the system as a whole is shift invariant, it is sufficient to show that the MLP is shift invariant. The mapping function of the MLP in figure 22 can be written g(oo,y) = o(W3s(W2o(WYy)+ p)) 0 = W ~ NIN2 N, x W, N,3 N, N.l } (41) o W l IW 2 C: 9 ,,W 3C = 91 x ' In the nonlinear architecture, the matrix Wi, represents the connectivities from the pro cessing elements (PEs) of layer (i 1) to the input to the PEs of layer i; that is, the matrix Wi is applied as linear transformation to the vector output of layer (i 1). When i = 1 the transformation is applied to the input vector, y. The number of PEs in layer i is 56 denoted by N. In equation 41 p is a constant bias vector added to each element of the vector, W2a(Wy) e 9N, x It is also assumed that if the argument to the nonlinear function o( ) is a matrix or vector then the nonlinearity is applied to each element of the matrix or vector. N,N2 x l The input to the MLP is denoted as a vector, y e 9t2 The elements of the vector are samples of a twodimensional prewhitened input signal, y(n1, n2). We can write the ith element of the vector as a function of the two dimensional signal as follows yi(nl, n2) = y(n" + (i, N1), n2 + N ) i = 0,..., NN2 1, where (i, Nl) indicates a modulo operation (the remainder of i divided by N ) and [i, NI1 indicates integer division of i by N Written this way, the elements of the vector y sample a rectangular region of support of size NI x N2 beginning at sample (nI, n2) in the prewhitened signal, y(n n2). The vector argument of equation 41 and the resulting output signal can now be written as an explicit function of the beginning sample point of the template within the prewhitened image go(nl,n2)= g(o, (nl,n2)) = o( W30(W20(Wly(nl,n2))+(P)). (42) The output of the mapping as written in equation 42 is now an explicit function of (n,, n2) and the constant parameter set, 0o (which do not vary with (n, n2)). We can also write the output response as a function of the shifted version of the image, y(n n2) as g,(nl +nl',n2+n2') = g(O,y(n" + n',n2+n2')) Since the parameters, co, are constant, equations 42 and 43 are sufficient to show the mapping of the MLP is shift invariant and consequently, the system as a whole (including the shift invariant preprocessor) is also shift invariant. 4.3 Classifier Performance and Measures of Generalization One of the issues for any iterative method which relies on exemplars is the number of training exemplars to use in the computation of the discriminant function. In addition, for iterative methods, there is the issue of when to stop the adaptation process. In the case of distortion invariant filters, such as the MACE filter, some common heuristics are used to determine the number of training exemplars. Typically samples are drawn from the train ing set and used to compute the filter from equation 23 until the minimum peak response over the remaining samples exceeds some threshold [Casasent and Ravichandran, 1992]. A similar heuristic is to continue to draw samples from the training set until the mean square error of the peak response over the remaining samples drops below some preset threshold. These measures are then used as indicators of how well the filter generalizes to between aspect exemplars from the training set which have not been used for the computa tion of the filter coefficients. The ultimate goal, however, is classification. Generalization in the context of classifi cation must be related to the ability to classify a previously unseen input [Bishop, 1995]. We show by example that the measures of generalization mentioned above may be mis leading as predictors of classifier performance for even the linear filters. In fact the result of the experiments will show that the way in which the data is preprocessed is more indic ative of classifier performance than these other indirect measures. We illustrate this point with an example using ISAR image data. A data set, larger than in the previous experiments, will be used. Two more vehicles, one from each vehicle type will be used for the testing set, and all vehicles will be samples at higher aspect resolution. Figure 23 shows ISAR images of size 64 x 64 taken from five different vehicles and two different vehicle types. The images are all taken with the same radar. Data taken from vehicles in the same class vary in the vehicle configuration and radar depression angle (15 or 20 degrees depression). Images have been formed from each vehicle at aspect varia tions of 0.125 degrees from 5 to 85 degrees aspect for a total of 641 images for each vehi cle. Figure 23 shows each of the vehicles at 5, 45, and 85 degrees aspect. We will use vehicle type 1 as the recognition class and vehicle type 2 as a confusion vehicle. Images of vehicle la will be used as the set from which to draw training exem plars. Classification performance will then be measured as the ability to recognize vehi cles lb and Ic while rejecting vehicles 2a and 2b. The filter we will use is a form of the OTSDF [R6fr6gier and Figue, 1991] which is computed in the spectral domain as 1 H = [oaPx+(la)P,] X[Xt[aPx+(la)PI X]d, (44) where the columns of the data matrix X e CNN2 xN' are the Fourier coefficients of Nt exemplar images of dimension NI x N2 of vehicle la reordered into column vectors. The diagonal matrix Px E 9 'N2x NN contains the coefficients of the average power spec trum measured over the N, exemplars of vehicle la, while FP e9t ',NxNN is the iden tity matrix scaled by the average of the diagonal terms of Px. Finally, d e NX is a column vector of desired outputs, one for each exemplar. The elements of d are typically vehicle I b vehicle 2 a b Figure 23. ISAR images of two vehicle types shown at aspect angles of 5, 45, and 85 degrees respectively. Three different vehicles of type 1 (a, b, and c) are shown, while two different vehicles of type 2 (a and b) are shown. Vehicle la is used as a training vehicle, while vehicles lb and Ic are used as the testing vehicles for the recognition class. Vehicles 2a and 2b are used a s confusion vehicles. set to unity. When a is set to unity equation 44 yields exactly the MACE filter, when it is set to zero the result is the SDF The filter we are using is therefore trading off the MACE filter criterion with the SDF criterion. The SDF criterion can also be viewed as the MVSDF [Kumar, 1986] criterion when the noise class is represented by a white noise ran dom process. This filter can also be decomposed as in figure 22. 60 These experiments examine the relationship between the two commonly used mea sures of generalization and two measures of classification performance. We can draw con clusions from the results about the appropriateness of the generalization measures with regards to classification. The first generalization measure is the minimum peak response, denoted Ymin, taken over the aspect range of the images of the training vehicle (excluding the aspects used for computing the filter). The second generalization measure is the mean square error, denoted yme, between the desired output of unity and the peak response over the aspect range of the images of the training vehicle (excluding the aspects used for com puting the filter). The classification measures are taken from the receiver operating char acteristic (ROC) curve measuring the probability of detecting, Pd, a testing vehicle in the recognition class (vehicles lb and Ic) versus the probability of false alarm, Pfa, on a test ing vehicle in the confusion class (vehicles 2a and 2b) based on peak detection. The spe cific measures are the area under the ROC curve, a general measure of the test being used, while the second measure is the probability of false alarm when the probability of detec tion equals 80%, which measures a single point of interest on the ROC curve. Two filters are used, one with a = 0.5 and the other with a = 0.95, or one in which both criterion are weighted equally and one which is close to the MACE filter criterion. The number of exemplars drawn from the training vehicle (la) is varied from 21 to 81 sampled uniformly in aspect (1 to 4 degrees aspect separation between exemplars). Examination of figures 24 and 25 show that for both cases (a equal to 0.5 and 0.95) no clear relationship emerges in which the generalization measures are indicators of good classification performance. Table 1 compares the classifier performance when the general ization measures as described above are used to choose the filter versus the best ROC per formance achieved throughout the range of aspect separation. In one regard, the generalization measures were consistent in that the same aspect separation was predicted by both measures for both settings of a. In figure 26 we compare the ROC curves for two cases, first where the filter chosen using the generalization measures and second the best achieved ROC curve, for both settings of a. We would expect that for each a the filter using the generalization measure would be near the best ROC performance. As can be seen in the figure this is not the case. Table 1. Classifier performance measures when the filter is determined by either of the common measures of generalization as compared to best classifier performance for two values of a. Generalization Measure Ymin Ymse Best S= 0.50 Pfa@Pd=0.8 0.24 0.24 0.16 ROC area 0.83 0.83 0.90 S= 0.95 Pfa@Pd=0.8 0.16 0.16 0.07 ROC area 0.94 0.94 0.95 It is obvious from figures 24 and 25 that the generalization measures are not signifi cantly correlated with the ROC performance. In fact, as summarized in table 2, the gener alization measures are negatively, albeit weakly, correlated with ROC performance. One feature of figures 24 and 25 is that although ROC performance varies independent of .10 ivs. ROC area 1.00 0.95 0.90 U A A 0.85 A a= 0.50 a = 0.95 0.80 I ..... 0.60 0.70 0.80 0.90 1.00 Ymin Ymin vs. Pv (@ P= 0.8) 0.30 0.25 A 0.20o [AA B 0.15  0.10 ]n O]E LF 0.05 A = 0.50 a = 0.95 0.00 __ 0.60 0.70 0.80 0.90 1.00 Ymin Figure 24. Generalization as measured by the minimum peak response. The plot compares y.in versus classification performance measures (ROC area and Pfa@Pd=0.8). Ym,, vs. ROC area 1.00 , E 0 0 r 0 o 0 % Aa = 0.50 Sa = 0.95 0.020 0.030 0.040 0.050 Y. .. v vs. P,. (@ P =0.8) 0.00 , 0.000 0.010 0.020 0.030 Ym.. 0.040 0.050 0.060 0.070 Figure 25. Generalization as measured by the peak response mean square error. The plot compares ymse versus classification performance measures (ROC area and Pfa@Pd=0.8). 0.95 1 S0.90 0.85 0.060 0.070 0.000 0.010 0.30 F' ' 0.25 0.20 a 0.15 0.10 0.05 10 D 0 O o uo% oo []iit 0 Sa = 0.50 Sa = 0.95 0.801 I~~~~I~~~~I~~~~~~~~ 1.0 04 a = 0.50, best ROC a = 0.95, best generalization a = 0.95, best ROC 0.8 2 , 0.6 .' j 0.0 0.2 0.4 0.6 0.8 1.0 *r i. Figure 26. Comparison of ROC curves. The ROC curves for the number of :; / ......a = 0.50, best ROC training exemplars yielding the 0.95, best generalization measure versus the number yielding the0.95, best ROC performance for values of a equal to 0.2/ , 0.0 0.2 0.4 0.6 0.8 1.0 Figure 26. Comparison of ROC curves. The ROC curves for the number of training exemplars yielding the best generalization measure versus the number yielding the best ROC performance for values of a equal to 0.5 and 0.95 are shown. either the minimum peak response or the MSE, there does appear to be dependency on a. This leads to a second experiment. Table 2. Correlation of generalization measures to classifier performance. In both cases (a equal to 0.5 or 0.95) the classifier performance as measured by the area of the ROC curve or Pf at Pd equal 0.8, has an opposite correlation as to what would be expected of a useful measure for predicting performance. Performance Measures ROC area Pfa(@Pd=0.8) ROC area Pfa(@Pd=0.8) a = 0.50 a = 0.95 Generalization Ymin 0.39 0.21 0.40 0.41 Measures Ymse 0.32 0.11 0.31 0.35 In the second experiment we examine the relationship between the parameter a and the ROC performance. The aspect separation between training exemplars is set to 2, 4, and 8 degrees. The value of a, the emphasis on the MACE criterion, is varied in the range zero to unity. Figure 27 shows the relationship between ROC performance and the value of a. It is clear from the plots that there is a positive relationship between the emphasis on the MACE criteria and the ROC performance. However, the peak in ROC performance is not achieved at a equal to unity. In all three cases, the ROC performance peaks just prior to unity with the performance dropoff increasing with aspect separation at a equal to unity. The difference between the SDF and MACE filter is the preprocessor. What is shown by this analysis is that, in general, the preprocessor from the MACE filter criterion leads to better classification, but too much emphasis on the MACE filter criterion, as measured by a equal to unity, leads to a filter which is too specific to the training samples. The problems described above are well known. Alterations to the MACE criterion have been the subject of many researchers [Casasent et al., 1991; Casasent and Ravichandran, 1992; Ravichandran and Casasent, 1992; Mahalanobis et al., 1994a]. There is still, as yet, no principled method found in the literature by which to set the parameter a. There are two conclusions from this analysis that are pertinent to the nonlinear exten sion we are using. First the results show that prewhitening over the recognition class leads to better classification performance. For this reason we choose to use the preproces sor of the MACE filter in our nonlinear filter architecture. The issue of extending the MACE filter to nonlinear systems can in this way be formulated as a search for a more robust nonlinear discriminant function in the prewhitened image space. The second conclusion is that comparisons of the nonlinear filter to its linear counter part must be made in terms of classification performance only. There are simple nonlinear systems, such as a soft threshold at the output of a linear system for example, that will out ROC area vs. a 08 13 ROC area vs. a 0.2 0.4 0.6 a P,.(@P,=0.8) vs. a 0.8 1.0 001 I o i ______ I 0.0 0.2 0.4 0.6 0.8 1.0 a Figure 27. ROC performance measures versus a. Results are shown for training aspect separations of 2, 4, and 8 degrees. These plots indicate that ROC performance is positively related to a. perform the MACE filter or its variations in terms of maximizing the minimum peak response over the training vehicle or reducing the variance in the output image plane. 0.41 0.0 R 0.0 10 0.8 S0.6 II 02 0 a = 2.00 degrees A a = 4.00 degrees D a = 8.00 degrees 0 0 Sa = 2.00 degrees A = 4.00 degrees Sa = 8.00 degrees , I I These measures are not, however, sufficient to describe classification performance. We have also used these measures in the past but feel that they are not the most appropriate for classification [Fisher and Principe, 1995b]. 4 4 Statistical Characterization of the Rejection Class We now present a statistical viewpoint of distortion invariant filters from which such nonlinear extensions fit naturally into an iterative framework. This treatment results in an efficient way to capture the optimality condition of the MACE filter using a training algo rithm which is approximately of order N, and which leads to better classification perfor mance than the linear MACE. A possible approach to design a nonlinear extension to the MACE filter and improve on the generalization properties is to simply substitute the linear processing elements of the LAM with nonlinear elements. Since such a system can be trained with error back propagation [Rumelhart et al., 1986], the issue would be simply to report on performance comparisons with the MACE. Such methodology does not, however, lead to understand ing of the role of the nonlinearity, and does not elucidate the tradeoffs in the design and in training. Here we approach the problem from a different perspective. We seek to extend the optimality condition of the MACE to a nonlinear system, i.e. the energy in the output space is minimized while maintaining the peak constraint at the origin. Hence we will impose these constraints directly in the formulation, even knowing A priori that an analyti cal solution is very difficult or impossible to obtain. We reformulate the MACE filter from !  68 a statistical viewpoint and generalize it to arbitrary mapping functions, linear and nonlin ear. Consider images of dimension NI x N2 reordered by column or row into vectors. Let the rejection class be characterized by the random vector, XI E NNx I We know the secondorder statistics of this class as represented by the average power spectrum (or equivalently the autocorrelation function). Let the recognition class be characterized by the columns of a data matrix x2 91NN x N which are observations of the random vector, X2 E NNx similarly reordered. We wish to find the parameters, Co, of a mapping, g(Co, X):9 1 91 such that we may discriminate the recognition class from the rejection class. Here, it is the mapping function, g, which defines the discriminator topol ogy. Towards this goal, we wish to minimize the objective function J = E(g(, X)2) over the mapping parameters, co, subject to the system of constraints g(, x2) = d (45) where d e 9' is a column vector of desired outputs. It is assumed that the mapping function is applied to each column of x2, and E( ) is the expected value function. .......................1 I  69 Using the method of Lagrange multipliers, we can augment the objective function as J = E(g(o, X1)2) + (g(o, x2) d ),, (46) where X e 9tNx is a vector whose elements are the Lagrange multipliers, one for each constraint. Computing the gradient with respect to the mapping parameters yields aj fg(m,X,)Y\ g(oa,x2) S= 2E g(, X1 ) )) + a ,. (47) Equation 47 along with the constraints of equation 45 can be used to solve for the opti mal parameters, co, assuming our constraints form a consistent set of equations. This is, of course dependent on the mapping topology. 4.4 1 The Linear Solution as a Special Case It is interesting to verify that this formulation yields the MACE filter as a special case. If, for example, we choose the mapping to be a linear projection of the input image, that is g(a,,x) = Tx o = [hl...hNN2 ]T E 9 x equation 46 becomes, after simplification, J = TE(XIXT)O +(oTx dT). (48) In order to solve for the mapping parameters, co, we are still left with the task of com T putting the term E(XIXT) which, in general, we can only estimate from observations of the random vector, X1 or assume a specific form. Assuming that we have a suitable estima tor, the well known solution to the minimum of equation 48 over the mapping parameters subject to the constraints of equation 45 is 1 T1 1 o = Rx,x2[x2Rx X2] d, (49) where x = estimate{E(XIX) }. (50) Depending on the characterization of X1, equation 49 describes various SDFtype fil ters (i.e. MACE, MVSDF, etc.). In the case of the MACE filter, the rejection class is char acterized by all 2D circular shifts of target class images away from the origin. Solving for the MACE filter coefficients is therefore equivalent to using the average circular autocor relation sequence (or equivalently the average power spectrum in the frequency domain) over images in the target class as estimators of the elements of the matrix E(XIXT). Sudharsanan et al [1991] suggest a very similar methodology for improving the perfor mance of the MACE filter. In that case the average linear autocorrelation sequence is esti T mated over the target class and this estimator of E(XIX1) is used to solve for linear projection coefficients in the space domain. The resulting filter is referred to as the SMACE (spacedomain MACE) filter. 4.4.2 Nonlinear Mappings For arbitrary nonlinear mappings it will, in general, be very difficult to solve for glo bally optimal parameters analytically. Our purpose is instead to develop iterative training algorithms which are practical and yield improved performance over the linear mappings. I It is through the implicit description of the rejection class by its secondorder statistics from which we have developed an efficient method extending the MACE filter and other related correlators to nonlinear topologies such as neural networks. As stated, our goal is to find mappings, defined by a topology and a parameter set, which improve upon the performance of the MACE filter in terms of generalization while maintaining a sharp constrained peak in the center of the output plane for images in the recognition class. One approach, which leads to an iterative algorithm, is to approximate the original objective function of equation 46 with the modified objective function J= (1 3)E(g(o, XI)2) + g(, x2)dT][g(, x2) d ]. (51) The principal advantage gained by using equation 51 over equation 46 is that we can solve iteratively for the parameters of the mapping function (assuming it is differentiable) using gradient search. The constraint equations, however, are no longer satisfied with equality over the training set. It has been recognized that the choice of constraint values has direct impact on the performance of optimized linear correlators. Sudharsanan et al [1990] have explored techniques for optimally assigning these values within the con straints of a linear topology. Other methods have been suggested [Mahalanobis et al., 1994a, 1994b; Kumar and Mahalanobis, 1995] to improve the performance of distortion invariant filters by relaxing the equality constraints. Mahalanobis [1994a] extends this idea to unconstrained linear correlation filters. The OTSDF objective function of Rdfr6gier [1991] appears similar to the modified objective function and indeed, for a lin ear topology this can be solved analytically as an optimal tradeoff problem. _ Our primary purpose for modifying the objective function is to allow for an iterative method within the NLMACE architecture. We have already shown in the previous chap ter that this choice of criterion is suitable for classification. We will show that the primary qualities of the MACE filter are still maintained when we relax the equality constraints in our formulation. Varying P in the range [0, 1] controls the degree to which the average response to the rejection class is emphasized versus the variance about the desired output over the recognition class. As in the linear case, we can only estimate the expected variance of the output due to the random vector input and its associated gradient. If, as in the MACE (or SMACE) filter formulation, X1 is characterized by all 2D circular (or linear) shifts of the recognition class away from the origin then this term can be estimated with a sampled average over the exemplars, x2, for all such shifts. From an iterative standpoint this still leads to the impractical approach training exhaustively over the entire output plane. It is desirable, then, to find other equivalent characterizations of the rejection class which may alleviate the computational load without significantly impacting performance. 4.5 Efficient Representatinn of the Rejection Class Training becomes an issue once the associative memory structure takes a nonlinear form. The output variance of the linear MACE filter is minimized for the entire output plane over the training exemplars. Even when the coefficients of the MACE filter are computed iteratively we need only consider the output point at the designated peak loca tion (constraint) for each prewhitened training exemplar [Fisher and Principe, 1994]. This is due to the fact that for the underdetermined case, the linear projection which satisfies the system of constraints with equality and has minimum norm is also the linear projection which minimizes the response to images with a flat power spectrum. This solution is arrived at naturally via a gradient search which only considers the response at the con straint location. This is no longer the case when the mapping is nonlinear. Adapting the parameters via gradient search (such as error backpropagation) on recognition class exemplars only at the constraint location will not, in general, minimize the variance over the entire output image plane. In order to minimize the variance over the entire output plane we must consider the response of the filter to each location in the input image, not just the constraint location. The MACE filter optimization criterion minimizes, in the average, the response to all images with the same second order statistics as the rejection class. At the output of the pre whitener (prior to the MLP) any white sequence will have the same second order statistics as the rejection class. This condition can be exploited to make the training of the MLP more efficient. From an implementation standpoint, the prewhitening stage and the input layer weights can be combined into a single equivalent linear transformation, however, pre whitening separately allows the rejection class to be represented by white sequences at the input to the MLP during the training phase. This result is due to the statistical formulation of the optimization criterion. Minimiz ing the response to white sequences, in the average, minimizes the response to shifts of the exemplar images since they have the same secondorder statistics (after prewhitening). Consequently, we do not have to train over the entire output plane exhaustively, thereby reducing training times proportionally by the input image size, NIN2. Instead, we use a small number of randomly generated white sequences to efficiently represent the rejection class. The result is an algorithm which is of order N, + Ns (where Ns is the number of white noise rejection class exemplars) as compared to exhaustive training. 4 6 Experimental Results We now present experimental results which illustrate the technique and potential pit falls. There are four significant outcomes in the experiments presented in this section. The first is that when using the white sequences to characterize the rejection class, the linear solution is a strong attractor. The second outcome is that imposing orthogonality on the input layer to the MLP tends to lead to a nonlinear solution with improved performance. The third result, in which we restrict the rejection class to a subspace, yields a significant decrease in the convergence time. The fourth result, in which we borrow from the idea of using the interior of the convex hull to represent the rejection class [Kumar et al., 1994], yields significantly better classification performance. In these experiments we use the data depicted in figure 23. As in the previous experi ments images from vehicle la will be used as the training set. Vehicles lb and Ic will be used as the recognition class while vehicles 2a and 2b will be used as a rejection/confusion class for testing purposes. In each case comparisons will be made to a baseline linear filter. Specifically, in all cases the value of a for the linear filter is set to 0.99. The aspect separation between training images is 2.0 degrees. This results in 41 training exemplars from vehicle la. These settings of a and aspect separation were found to give the best classifier performance for the linear filter with this data set. We continue to refer to this as a MACE filter since the MACE criterion is so heavily emphasized. Technically it is an OTSDF filter, but such nomenclature does not convey the type of preprocessing that is being performed. We choose the value of a so as to compare to the best possible MACE filter for this data set. The nonlinear filter will use the same preprocessor as the linear filter (i.e. a = 0.99). The MLP structure is shown at the bottom of figure 22. It accepts an NIN2 input vector (a preprocessed image reordered into a column vector), followed by two hidden layers (with two and three hidden PE nodes, respectively), and a single output node. The parameters of the MLP W,2NNx2 2 2x3 W C 3x] 3xl are to be determined through gradient search. The gradient search technique used in all cases will be error backpropagation algorithm. 4.6.1 Experiment I noise training As stated, using the statistical approach, the rejection class is characterized by white noise sequences at the input to the MLP. The recognition class is characterized by the exemplars. It is from these white noise sequences that the MLP, through the backpropaga tion learning algorithm, captures information about the rejection class. So it would seem a simple matter, during the training stage, to present random white noise sequences as the rejection class exemplars. This is exactly the training method used for this experiment. From our empirical observations we observed that with this method of training the linear solution is a strong attractor. The results of the first experiment is demonstrates this behav ior. Figure 28 shows the peak output response taken over all images of vehicle la for both the linear (top) and nonlinear (bottom) filters. In the figure we see that for the linear filter the peak constraint (unity) is met exactly for the training exemplars with degradation for the between aspect exemplars. As mentioned previously, if the pure MACE filter criterion were used (a equal to unity), the peak in the output plane is guaranteed to be at the con straint location [Mahalanobis et al., 1987]. It turns out that for this data set the peak output also occurs the constraint location for the training images, however, with a = 0.99 it was not guaranteed. Examination of the peak output response for the NLMACE filter shows that the constraints are met very closely (but not exactly) for the training exemplars also with degradation in the peak output response at between aspect locations. The degradation for the nonlinear filter is noticeably less than in the linear case and so in this regard it has outperformed the linear filter. Figure 29 shows the output plane response for a single image of vehicle la (not one used for computing the filter coefficients) for the linear filter (top) and the nonlinear filter (bottom). Again in this figure we see that both filters result in a noticeable peak when the image is centered on the filter and a reduced response when the image is shifted. The reduction in response to the shifted image is again noticeably better in the nonlinear filter than in the linear filter. Such would be found to be true for all images of vehicle la and so in this regard we can again say that the nonlinear filter had outperformed the linear filter. However, as we have already illustrated for the linear case, these measures are not suf ficient to predict classifier performance alone and are certainly not sufficient to compare linear systems to nonlinear systems. This point is made clear in table 3 which summarizes the classifier performance at two probabilities of detection for all of the experiments linear filter 1.10F 1.00 r 0.90 0.70  0.60 0 20 40 60 80 aspect (degrees) nonlinear filter 1.10 0.90 0.80 0.70  0.60 0 20 40 60 80 aspect (degrees) Figure 28. Peak output response of linear and nonlinear filters over the training set. The nonlinear filter clearly outperforms the linear filter by this metric alone. reported here when vehicles lb and Ic are used as the recognition class and vehicles 2a and 2b are used for the rejection class. At this point we are only interested in the results pertaining to the linear filter (our baseline) and nonlinear filter results for experiment I. Figure 29. Output response of linear filter (top) and nonlinear filter (bottom). The response is for a single image from the training set, but not one used to compute the filter. This table shows that the classifier performance for the linear filter and nonlinear filters are nominally the same, despite what may be perceived to be better performance in the nonlinear filter with regards to peak response over the training vehicle and reduced output plane response to shifts of the image. Furthermore, if we examine figure 30, which shows 79 the ROC curve for both filters we see that they overlay each other. From a classification standpoint the two filters are equivalent. ROC curve 1.0 0.8  0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 P, Figure 30. ROC curves for linear filter (solid line) versus nonlinear filter (dashed line). Despite improved performance of the nonlinear filter as measured by peak output response and reduced variance over the training set, the filters are equivalent with regards to classification over the testing set. The explanation of this result is best explained by figure 31. Recall the points ul and u2 labeled in figure 22. We can view these outputs as a feature space, that is, the MLP discriminant function can be superimposed on the projection of the input image onto this space. In this case the feature space is a representation of the input vector internal to the MLP structure. The des ignation of these points as features is due to the fact that they represent some abstract qual O3 D recognition, training + recognition, nontraining o rejection, training 1.0 0.5 0.0 u1 1.0 0.5 3 0.0 0.5 1.0 1.0 0.5 0.0 UI 0.5 1.0 0.5 1.0 Figure 31. Experiment I: Resulting feature space from simple noise training. Note that all points are projected onto a single curve in the feature space. In the top figure squares are the recognition class training exemplars, triangles are white noise rejection class exemplars, and plus signs are the images of vehicle la not used for training. In the bottom figure, squares are the peak responses from vehicles Ib and Ic, triangles are the peak responses from vehicles 2a and 2b. O recognition, testing 0 rejection, testing ~ 81 ity of the data and the decision surface can be computed as a function of the features. Mathematically this can be written Wx = = u Yo = ((Wia(Woa(u)+(p)). (52) Recall that the matrix Wi represents the connectivities from the output of layer (i 1) to the inputs of the PEs of layer i, (p is a constant bias term, and a( ) is a sigmoidal nonlin earity (hyperbolic tangent function in this case). Figure 31 shows this projection for the training set (top) and the testing set (bottom). What is significant in the figure is that although the discriminant as a function of the vec tor u is nonlinear, the projection of the images lie on a single curve in this feature space. Topologically this filter can put into onetoone correspondence with a linear projection. This is not to say that the linear solution is undesirable, but under the optimization crite rion it can be computed in closed form. Furthermore, in a space as rich as the ISAR image space it is unlikely that the linear solution will give the best classification performance. Table 3. Comparison of ROC classifier performance for to values of Pd. Results are shown for the linear filter versus four different types of nonlinear training. N: white noise training, GS: GramSchmidt orthogonalization, subN: PCA subspace noise, CH: convex hull rejection class. Pd (%) Pfa (%) linear filter nonlinear writer, experiments IIV I (N) II (N, GS) III (subN, GS) IV (subN, GS, CH) 80 4.37 4.37 3.74 2.81 2.45 99 42.43 41.87 27.15 26.52 15.33 4.6.2 Experiment II noise training with an orthogonalization constraint As a means of avoiding the linear solution a modification was made to the training algorithm. The modification was to impose orthogonality on the columns of W, through a 82 GramSchmidt process. The motivation for doing this stems from the fact that we are working in a prewhitened image space. In a prewhitened image space this condition is sufficient to assure the outputs in the feature space, as measured at ul and u2, will be uncorrelated over the rejection class. Mathematically this can be shown as T T T T T 2 E{uu } = E{W1X1X1W1 = WrE{XlX1}Wl TE(XI T T Ti T 1 w2E(X1XI)w w2E(XIXI)w2 I2 w w2 2 [ T T T T [ T 2 T 2 St2 IW11120o 1 SI 2 1120 [ 0 jw2 where wl, w2 E 9NN x are the columns of W1 This result is true for any number of nodes in the first layer of the MLP. The results of the training with this modification are shown in figure 32 which is the resulting feature space as measured at uj and u2. From this figure we can see that the dis criminant function, represented by the contour lines, is a nonlinear function of ul and u2. Furthermore, because the projection of the vehicles into the feature space do not lie on a single curve (as in the previous experiment), the features represent different discrimina tion information with regards to the both rejection and recognition classes. The bottom of the figure, showing the projection of a random sampling of the test vehicles (all 1282 would be too dense for plotting) show that both features are useful for separating vehicle 1 from vehicle 2. Examination of table 3 (column II in the nonlinear results) shows that at the two detection probabilities of interest improved false alarm performance has been obtained. Figure 33 shows the ROC curve for the resulting filter. It is evident that the non linear filter is a uniformly better test for classification. 1.0 0.5 " 0.0 0.5  ] recognition, training 0 + recognition, nontraining 1.0 o rejection, training 1.0 0.5 0.0 0.5 1.0 U1 1.0 0.5 0.0 0.5 1.0 Figure 32. Experiment I: Resulting feature space when orthogonality is imposed on the input layer of the MLP. In the top figure squares indicate the recognition class training exemplars, triangles indicate white noise rejection class exemplars, and plus signs are the images of vehicle la not used for training. In the bottom figure, squares are the peak responses from vehicles lb and Ic, triangles are the peak responses from vehicles 2a and 2b. ROC curve 1.0  0.8  0.6 0.4  0.2  0.0 0.0 0.2 0.4 0.6 0.8 1.0 Pa Figure 33. Experiment II: Resulting ROC curve with orthogonality constraint. Convinced that the filter represents a better test for classification than the linear filter, we now examine the result for the other features of interest. Figure 34 shows the output response for this filter for one of the images. As seen in the figure, a noticeable peak at the center of the output plane has been achieved. This shows that the filter maintains the local ization properties of the linear filter. In this way the characterization of the rejection class by its second order statistics, the addition of the orthogonality constraint at the input layer to the MLP and the use of a non linear topology has resulted in a superior classification test. 4.6.3 Experiment ITT subspace noise training The next experiments describes an additional modification to this technique. One of the issues of training nonlinear systems is the convergence time. Training methods which require overly long training times are not of much practical use. We have already shown Figure 34. Experiment II: Output response to an image from the recognition class training set. how to reduce the training complexity by recognizing that we can sufficiently describe the rejection class with white noise sequences. We now show a more compact description of the rejection class which leads to shorter convergence times, as demonstrated empirically. This description relies on the well known singular value decomposition (SVD). We view the random white sequences as stochastic probes of the performance surface in the whitened image space. The classifier discriminant function is, of course, not deter mined by the rejection class alone. It is also affected by the recognition class. We have shown previously that the white noise sequences enable us to probe the input space more efficiently than examining all shifts of the recognition exemplars. However, we are still searching a space of dimension equal to the image size, N, N2. One of the underlying premises to a data driven approach is that the information about a class is conveyed through exemplars. In this case the recognition class is represented by N, < NN2 exemplars placed in the data matrix x2 9NN2 '. It is well known that if x2, if it is full rank, can be decomposed with the SVD as x2 = UAVT. (53) where the columns U e RNN, x N'' are an orthonormal basis that span the column space of the data matrix, A are the singular values, and V is an orthogonal matrix. This decom position has many well known properties including compactness of representation for the columns of the data matrix[Gerbrands, 1981]. Indeed, as has been noted by Gheen[1990], the SDF can be written as a function of the SVD of the data matrix. hSDF = UA vTd (54) We will use this recognition class representation to further refine our description of the rejection class for training. As we stated, the underlying assumption in a data driven method, is that the data matrix x2 conveys information about the recognition class, any information about the recognition class outside the space of the data matrix is not attain able from this perspective. The information certainly exists, but there is no mechanism by which to include it in the determination of the discriminant function within this frame work. This does however lead to a more efficient description of the rejection class. We can modify our optimization criterion to reduce the response to white sequences as they are projected into the N,dimensional subspace of the data matrix. Effectively this reduces the search for a discriminant function in an NIN2dimensional space to an N,dimensional subspace. 87 The adaptation scheme of backpropagation allows a simple mechanism to implement this constraint. The adaptation of matrix W, at iteration k can be written as W,(k+ 1) = W,(k) + x(k)e (k) (55) where E'i is a column vector derived from the backpropagated error and xi(k) is the current input exemplar from either class presented to network which, by design, lies in the subspace spanned by the columns of U. From equation (55) if the rejection class noise exemplars are restricted to lie in the data space of x2, which can be achieved by projecting random vectors of size Nt onto the matrix U above, and W, is initialized to be a random projection from this space we will be assured that the columns of W, only extract infor mation from the data space of x2. This is because the columns of W1 will only be con structed from vectors which lie in the columns space of U and so will be orthogonal to any vector component that lies in the null space of U. The search for a discriminant function is now reduced from within an N,N2dimen sional space to a search from within an N dimensional space. Due to the dimensionality reduction achieved we would expect the convergence time to be reduced. This is the method that was used for the third experiment. Rejection class noise exem N,x I plars were generated by projecting a random vector, n E 9 onto the basis U by xrej = Un. In figure 35 the resulting discriminant function is shown as in the previous 88 experiments and the result is similar to experiment II. The classifier performance as mea sured in table 3 and the ROC curve of figure 36 are also nominally the same. Figure 35. Experiment III: Resulting feature space when the subspace noise is used for training. Symbols represent the same data as in the previous case. There are, however, two notable differences. Examination of figure 37 shows that the output response to shifted images is even lower allowing for better localization. This con ROC curve 1.0 . 0.8 0.6 0.4 0.2 0.0 _ 0.0 0.2 0.4 0.6 0.8 1.0 Pf. Figure 36. Experiment I: Resulting ROC curve for subspace noise training. edition was found to be the case throughout the data set. Of more significance is the result shown in figure 38 in which we compare the learning curves of all of the experiments pre sented here. In this figure the dashed and dasheddot lines are the learning curves for experiments II and III, respectively. In this case the convergence rate was increased nomi nally by a factor of three, from 100 epochs to approximately 30 epochs. Here an epoch represents one pass through all of the training data. 4.6.4 Experiment TV convex hull approach In this experiment we present a technique which borrows from the ideas of Kumar et al [1994]. This approach designed an SDF which rejects images which are away from the Figure 37. Experiment III: Output response to an image from the recognition class training set _0 learning curve 106 100 epoch 10000 ~~~=;ip"=;'t\ Figure 38. Learning curves for three methods. Experiment II: White noise training (dashed line). Experiment III: subspace noise (dasheddot line). Experiment IV: subspace noise plus convex hull exemplars (solid line). 
Full Text 
xml version 1.0 encoding UTF8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd INGEST IEID EW1AJMXHI_SPYQKR INGEST_TIME 20130928T02:02:09Z PACKAGE AA00014281_00001 AGREEMENT_INFO ACCOUNT UF PROJECT UFDC FILES 