UFDC Home  myUFDC Home  Help 



Full Text  
AN INFORMATIONTHEORETIC APPROACH TO SONAR AUTOMATIC TARGET RECOGNITION By RODNEY ALBERTO MOREJON A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2003 Copyright 2003 by Rodney Alberto Morejon To the Memory of My Father ACKNOWLEDGMENTS The quest of pursuing a doctoral degree is often a long and arduous journey requiring enduring patience, perseverance, and personal sacrifice. It is not only the pursuer of the degree who requires this dedication and sacrifice, but also those who are close to him. I would like to express my extreme gratitude to those who have helped me throughout this extended journey. First of all, I would like to thank my advisor, Dr. Jose Principe, for his patience, support, and guidance over my long course of study (which has extended beyond my longest expectations when I began nearly 10 years ago). I am also grateful to all the members of my advisory committee. I am especially thankful of Dr. Gerry Dobeck, who has served as a second mentor through the course of my research. Additionally, I cannot express my level of gratitude to Mr. Kirk Dye (without whom the opportunity to pursue this degree not just once, but twice, would have never been possible). I am also grateful to the Coastal Systems Station and Dr. Dave Skinner, whose support in providing both financial and temporal resources has been invaluable. Lastly, I am eternally grateful to my family, especially my parents and grandparents, without whom I would have never been exposed to the many opportunities I have enjoyed throughout my life. And finally, I would like to thank my daughter, Maria, who has become a welcome inspiration in my life. Her patience, even at age two, in allowing Daddy to finish his schoolwork despite the temptation of dozens of playtime activities has been remarkably cool. Now is the time to play. TABLE OF CONTENTS A C K N O W L E D G M E N T S ................................................................................................. iv LIST OF TA BLES .............................. ........ ...... ................ .......... ...... ....... ix LIST OF FIGURES ............................... ... ...... ... ................. .x ABSTRACT .............. .......................................... xiv CHAPTER 1 IN T R O D U C T IO N ................. ........ ............................ .... ........ .. ............ . B ack g rou n d ................................................................................................... . 2 M otiv action .................. ................... ...................5......... Overview of D issertation ........................................................... .. ................. 7 2 SONAR AUTOMATIC TARGET RECOGNITION (ATR) ..........................................8 Introdu action ................ ............... ...................................... ............... 8 Underwater Sonar Image Characterization................... ..... ........................ 8 General M ultistage ATR...................................................... 10 Sonar Im agery A TR M ethods................................................................... .............. 12 Preprocessing/Image Normalization Approaches.............................. .................... 12 D election A approach es ......................................... .. ............... ........ ................. .. 13 F feature E xtraction A pproaches....................................................... ... ................. 14 C lassification A approach es ........................................................................................ 16 F u sion A approach es............................................ .. ................. ........ ................... 16 Feature Extraction Overview .......................................................... .............. 17 Feature Selection...................................... ..................... 20 Principal Component Analysis (PCA) ........... ................................ ........... 21 Independent Com ponent A analysis (ICA )............................................. ... ................. 22 Linear Discriminant Analysis (LDA) ............................................ ........... 22 Other Com m on M ethods.......................................................... ......................... 23 S o n ar D atasets .................................................................................. 2 3 Sonar Set 10 (SON 10) ......... ..................... .................. ........... 23 Sonar Set 5 and 12 (SON 512)..................................................... ................. 25 Sonar Baseline Extracted Feature Sets ....................................................... 26 v 3 INFORMATIONTHEORETIC LEARNING (ITL).....................................................28 Introduction....................... ............... ..... ............. 28 Inform ation Theory ....................... ..... ...................... ...... ..................... .. 28 G general Learning System A rchitecture............................................... .... .. .............. 31 Inform ation Theoretic Optimization Criterion.................................... .................... 32 Inform ation Theoretic Learning Approach................................................................ 34 4 ALGORITHM OPTIM IZATION .............................................................. ............... 39 Introduction ..................................... 39 B aseline Entropy A lgorithm ................................................ ............................. 39 Entropy Algorithm Factorization...................... ...... ........................... 40 Exploitation of Sym m etry .......................................................... ... .............. 42 E ntropy V ectorization ............................................................... ............ ...... ..... 43 R results for the Q M I C riterion ................................................................................ ..... 45 C onclusions............................... ........... .......... 46 5 ADVANCED PARAMETER SEARCH TECHNIQUES...........................................48 Intro du action ..................... ... ........ ......................... 4 8 Baseline Technique: Gradient Descent (GD) .............. .............................. ....... ....... 48 A advanced Search Techniques ......................................................... .............. 49 Im proved G D A lgorithm s ....................................................................... 49 R esilient B ackpropagation (RP) ........................................ ................... ...... 50 Conjugate G radient A lgorithm s............................................................ .............. 51 Newton's Method.................................................................... ............ .. 53 LevenbergMarquardt (LM) Algorithm.......................... ... ............................ 54 Perform ance D ivergence M measures ........................................ ...................... 54 L in e S each R ou tin es .............................................................. .................. .. 5 5 D difficulties and R em edies for ITL .......................................................... .............. 56 A daptiv e K ernel Size .............. .... .... .................................. ................ .. 56 Positive and N negative Performance Values .......................................... .............. 58 Relative Error versus Absolute SquaredError ..................................... ................. 60 Summary of Advanced Parameter Search Algorithms for ITL ............................ 63 Sample Problems ........................................... ......................... 64 Example 1: Maximum Entropy Feature Extraction........................ .............. 65 Exam ple 2: Frequency D oubler ........................................... ........ ...................... 68 Example 3: Simplified Frequency Doubler ........................................................ 71 C onclusions............................... ........... .......... 75 6 BATCH TRAINING APPROACH ........................................ .......................... 77 Introduction ..................................... 77 B atch Training A pproach............................................................ ......................... 77 Overview ............... ....................................................... 77 Batch Training Parameters................................................... 79 B atch S election M eth od s .......................................................................................... 8 1 Other Batch Training Considerations ............................................. .............. 81 Stochastic Inform ation Gradient (SIG) M ethod ....................................................... 82 Batch Training Com prison Results .............................. .................. ............ ...... 82 Performance Comparison Metric......... ............ ......... ............. 82 V variations of the SIG ............. .... ...... ........... ............. .. ............. .... .. 83 Batch Training with Advanced Parameter Adaptation .................................. 83 Batch Selection Methods ....................................................... 85 Batch Training Comparison with Additive Noise .............................................. 88 Batch Training Comparison with Oversampling ................................................ 90 D discussion on B watch Param eter Selection.............................................. .... ................ 92 C onclusions............................... ........... .......... 94 7 PROPERTIES OF THE INFORMATION ESTIMATORS ..........................................96 Introduction .......................................... 96 E ntropy E stim ator ............................................................................ ......... .... .. 96 Fam ilies of E ntropy C urves .............. .......................................... ........................... 97 K ernel Size R anges .............................................................. ....................... ....... .. 102 Effects of Different Scale and a ScaleInvariant Entropy ................................ 104 Sample Size Variations, Relative Entropy and Batch Size Estimation .................. 107 EntropyBased Intrinsic Dimensionality Estimators ........................................... 112 Relative entropy intrinsic dimensionality estimator .................................. 113 Information potential dimensionality estimator .......................................... 114 Dim ensionality estim ation exam ples ............. ................. ............................. 118 Mutual Information Estimators................... ................. .................. 120 Families of M utual Information Curves ....................................... .............. 120 Effects of Sample Size Variations .............. ...... ........................................ 122 Effects of D different Scale........................................ ......................... .............. 125 Effects of Dimensional Differences.......................... .................... ............ .. 127 C conclusions ............................................................. .... ...... ........ 133 8 SONAR FEATURE EXTRACTION ........................................ ....................... 135 Intro du action ..............................135............................... Inform ative Feature M apping ........................................................ .............. 136 Approach ................................. .......................... .... ...... ........ 136 R results ........................... ......................................... 138 Feature Subset Ranking and Selection.............................. .................. 143 A p p ro ach ................. ................................. ....... ......... ...... 14 3 Results .......................... ... .................... .......... 149 D iscrim native Feature M apping ........................................ .......................... 154 A p p ro ach ................................1 54............................. R e su lts ............................................................................................ 1 5 5 Summary of Results .......... ........ ........ ......... ............ ............ .. 165 C conclusions ............................................................. .... ...... ........ 167 9 C O N C L U S IO N S ............................. ...................................................... ............... 16 9 APPENDIX IMPLEMENTATION DETAILS FOR TRAINING ALGORITHMS......172 G radient D descent for ITL ......... ................. ........................... ........................... 172 LevenbergMarquardt with LineSearch for ITL..................................... .................. 172 REFERENCE LIST ................... ..................................... 174 BIOGRAPHICAL SKETCH .............. .............................................................. 178 viii LIST OF TABLES Table page 21 Distribution of extracted feature exemplars for SON10. .......................................25 22 Distribution of extracted feature exemplars for SON512. ......................................25 23 The CSS baseline extracted feature set..................................................................... 27 41 Algorithm optim ization: Factorization ............................................. ............... 41 42 Algorithm optimization: Symmetry with factorization........................ ...............43 43 Algorithm optimization: Vectorized with symmetry and factorization. ...................44 44 Algorithm optimization: EDQMI baseline and optimized results...........................46 51 Advanced parameter search algorithm modification summary for ITL....................64 61 Batch training param eter settings. ........................................ .......................... 84 71 Comparison of dimensionality estimators.............................................. .............118 81 Exhaustive search feature sets and times for SON10 and SON512..........................145 82 Forward/backward search feature sets for SON10 and SON512.............................149 83 Summary of detection performance for SON 10 ............... .............. .....................165 84 Summary of detection performance for SON512....................................................... 166 LIST OF FIGURES Figure page 21 R representative sonar im age ........................................................................... ....... 9 22 M ultistaged A TR approach. ......................................................................................11 23 N orm alized sonar im age .............................................................................. ....... 13 24 Typical SO N 10 sonar im age. ........................................................... .....................24 25 Typical SON 512 sonar im age. ............................................ ............................ 26 31 General learning system architecture. .............................................. ............... 31 41 Entropy estimator optimization results. ........................................... ............... 45 51 Performance surface variation with kernel size. ................................. ...............57 52 Performance divergence measure........................... ......... ......................... 59 53 LevenbergMarquardt ITL algorithms: Weight tracks.............................................62 54 LevenbergMarquardt ITL algorithms: Training efficiency. .....................................63 55 Maximum entropy feature extraction. .............................................................. 66 56 Feature extraction: Parameter search comparison in epochs. ....................................66 57 Feature extraction: Parameter search comparison in FLOPs. ....................................67 58 Frequency doubler: Input and desired signals ................................. .................... 68 59 Frequency doubler: Network topology.................... .............. ............... 69 510 Frequency doubler: Local minima outputs......... .......................................70 511 Frequency doubler: Parameter search comparison .................................................70 512 Desired and EDQM I optim al signals....................................................................... 72 513 Fixed kernel size weight tracks. ............................................................................73 514 Fixed and adaptive kernel size weight tracks for IC #3 ............................................74 515 Kernel size and training performance for RP and IC#3................................... 75 61 Batch training comparison for GD ........................................................................... 84 62 Batch training comparison for SCG. ........................................ ........................ 85 63 Batch training com prison for RP. ........................................................................86 64 Batch training com prison for LM ........................................ ......................... 86 65 Batch training selection method comparison with GD. ..............................................87 66 Batch training selection method comparison with SCG. ...........................................87 67 O original and noise corrupted signals ................................................ .....................89 68 Batch training comparison: GD with noise. ..................................... ...............89 69 Batch training selection method comparison: GD and noise. ....................................90 610 Batch training comparison: GD with oversampled. .................................................91 611 Batch training selection method comparison: GD with Oversampled......................92 71 Family of entropy curves: Variable scale........................................... .................. 98 72 Family of entropy curves: Variable sample size. .............................. ................99 73 Family of entropy curves: Variable dimensionality .................................................99 74 Scale translation exam ples. ............................................... .............................. 105 75 Family of entropy curves: Kernel size equivalents. .................................................108 76 Relative entropy comparison across variable sample sizes............... ... .................110 77 Relative entropy comparison for sample problems................................................111 78 Relative entropy comparison for SON512. ............. .......................................111 79 Relative entropy ID estimation for SON512 training set ................. ... .................115 710 The PCA residual errors for SON512 training set. .....................................................115 711 Dimensionnormalized entropy curves. ............ .............................. ...............116 712 GrassbergerProcaccia and IP dimension estimators. ...............................................117 713 Dimension estimation dependency on sample size. ....................... ...................119 714 Scatter plot for mutual information experimentation data. .......................................121 715 Family of CSQMI curves: Variable sample size. ....................................................122 716 Family of EDQMI curves: Variable sample size ............................................. 123 717 Family of QMI information force curves: Variable sample size..............................125 718 Family of CSQMI curves: Variable scale ratios. .....................................................126 719 Family of EDQMI curves: Variable scale ratios......... .......................................127 720 Family of CSQMI curves: Variable equal dimensionality............. ...............128 721 Family of EDQMI curves: Variable equal dimensionality .......................................128 722 Family of CSQMI curves: Variable different dimensionality ................................. 130 723 Family of EDQMI curves: Variable different dimensionality............................. 130 724 Family of CSQMI curves: Variable dimensionality compensated. .........................132 725 Family of EDQMI curves: Variable dimensionality compensated .........................132 81 Maximum entropy informative feature extraction for dimension reduction.............138 82 Classification performance of MaxEnt and PCA features for SON10.......................140 83 Classification performance of MaxEnt and PCA features for SON512.....................141 84 Comparison of relative entropy of MaxEnt and PCA features ................................142 85 M aximum QM I feature ranker. ........................................... ............................ 144 86 Individual feature CSQMI for SON10 and SON512 training sets.............................146 87 Mutual information estimates with and without dimension compensation.................148 88. Classification performance of QMI ranked features with forward selection for SON10 training set.......................................................... 151 89. Classification performance of QMI ranked features with forward selection for S O N 1 0 te st set. ............................................................................................................1 5 1 810. Classification performance of QMI ranked features with backward elimination for SON10 training set ..........................................................52 811. Classification performance of QMI ranked features with backward elimination for SO N 10 test set. ..........................................................................152 812. Classification performance of QMI ranked features with forward selection for SON512 training set....................... .... ............ ............ 153 813. Classification performance of QMI ranked features with forward selection for SON 512 test set ..................... ...... .................... ... ........................ 153 814 M aximum QM I discriminative feature extraction....... ........................ ................ 155 815 Maximum Linear QMI feature extraction for SON10 training set. ..........................157 816 Maximum Linear QMI feature extraction for SON10 test set. ...................................157 817 Maximum Linear QMI feature extraction for SON512 training set. ........................158 818 Maximum Linear QMI feature extraction for SON512 test set. .................................158 819 Class feature space in 2D for SON10 prior to training......... .............. ............... 159 820 Class feature space in 2D for SON10 after NLQMIC training..............................160 821 Comparison of 1D PDF for SON10 maximum QMI feature extraction..................161 822 Comparison of 1D PDF for SON512 maximum QMI feature extraction................161 823 Minimum error bound comparison of LDA and QMI feature extraction ................. 163 824 Comparison of CSS algorithms and QMI feature extraction for SON10....................164 825 Comparison of CSS algorithms and QMI feature extraction for SON512................164 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy AN INFORMATIONTHEORETIC APPROACH TO SONAR AUTOMATIC TARGET RECOGNITION By Rodney Alberto Morejon May 2003 Chair: Dr. Jose C. Principe Major Department: Electrical and Computer Engineering Multistage automatic target recognition (ATR) algorithms provide an effective means for locating items of interest in sonar imagery by successively reducing the volume of data at each stage in order to produce a concise list of possible target locations. Traditionally, the stages in these algorithms are designed using statistical methods that are often based on a priori assumptions about the input data distributions. By extending recent advances in informationtheoretic learning (ITL), this dissertation applies concepts of information theory to the feature extraction stage of the multistage sonar ATR paradigm. By incorporating informationtheoretic performance metrics, the approach is not dependent on a priori statistical assumptions about the data distributions. The widespread application of ITL has likely been hindered by two phenomena: computational complexity and difficulty of use. This dissertation addresses both of these issues in order to increase the applicability of ITL. First, the computation of the ITL criteria is optimized for more efficient evaluation. Next, training efficiency is improved by tailoring some of the most popular advanced training algorithms for compatibility with ITL systems. To further improve training efficiency, a batch training approach to ITL is presented that reduces the quadratic complexity of the criterion. Finally, various properties of the informationtheoretic criteria are investigated in order to achieve a better understanding of the ITL process in support of easier implementation. These advances in ITL efficiency and understanding are then applied to address the problem of sonar feature extraction for ATR. This problem represents a difficult real world challenge for the application of ITL and is representative of the class of problems that stand to benefit from a more widespread implementation of ITL principles. The informationtheoretic feature extraction methods presented are shown to be more effective and robust than the most popular statistical methods and some of the currently fielded sonar ATR algorithms. The ITL methods presented provide an alternative approach for designing the subspace mapping functions that are prevalent throughout ATR systems. CHAPTER 1 INTRODUCTION The use of automatic target recognition (ATR) algorithms for the detection of items of interest in sonar data is a formidable problem that has received much attention from the defense, medical, and academic communities. Typically, a multistage approach is implemented that systematically reduces the volume of data from stage to stage by applying increasingly complex algorithms at each stage. Many of the traditional methods for designing these algorithms have been based on statistical methods that make a priori assumptions about the input data distributions. Since the informationtheoretic learning (ITL) paradigm was developed specifically to avoid these assumptions [PriOO], an informationtheoretic approach to sonar ATR promises the potential for more robust detection performance. In the preliminary phases of this research, the initial goal was to develop novel approaches (based on informationtheoretic criteria) for addressing the functionality of several stages in the multistage sonar ATR paradigm. Originally, the motivation was to highlight the versatility of ITL across a wide range of mapping functions in the multistage ATR process. However, initial experimentation with real sonar datasets highlighted some of the practical challenges with applying ITL techniques to larger scale problems. The most pronounced challenge was the computational complexity: many minutes were needed to compute a single informationtheoretic metric. The lack of implementation guidelines for selecting the parameters associated with ITL also contributed to the original difficulty of use. These challenges helped to reshape the focus of this research accordingly. The ultimate objectives evolved into the following: * Reducing the computational complexity of the ITL process. * Increasing the understanding of ITL in order to facilitate the implementation process. * Applying ITL to one stage of the sonar ATR paradigm, namely, the feature extraction stage. The remainder of this chapter presents some basic background information on sonar ATR and ITL along with the motivation, from both application and academic perspectives for the work presented herein. An outline of the remaining chapters of the dissertation concludes this chapter. Background The problem of identifying items of interest in sonar images is of prime concern to many groups ranging from the medical professional to the military. Specially trained human operators are called upon to perform the tedious task of examining sonar images to locate and identify targets or items of interest. In certain applications, particularly in defense related areas, the need for continuous monitoring of a realtime data stream makes this is a very demanding and stressful task requiring frequent shift changes in order to maintain fresh eyes. In order to alleviate the stress on human operators, ATR algorithms have been developed to assist, supplement, and in some cases completely replace, the manintheloop. These algorithms do not get tired and always maintain a constant vigil. The overall utility of an ATR algorithm is typically measured using metrics of target detection performance, runtime computational efficiency, and training efficiency. The detection performance measures how well the algorithm can separate items of interest from the background or clutter. Detection performance is usually characterized as a function of detection probability and false alarm rate using a receiver operating characteristics (ROC) curve. In many applications, sonar target analysis requires that an ATR algorithm rapidly or continuously process large volumes of data. In these cases, the algorithm's utility is measured not only by detection performance, but also by its ability to complete its processing as quickly, or efficiently, as possible so as to keep up with the input data stream. For these cases, computational efficiency is also of great importance in ATR design. Lastly, the training efficiency in terms of the volume of input data and the time required for training can also be limiting factors as to the utility of a given ATR design. Many sonar ATR algorithms have been designed using a variety of theoretical methods including detection theory, statistical pattern recognition, matched filtering, neural networks, wavelet transforms, and modelbased methods [Ari98, Dob97, Sac99, Syz99]. In general, most of these algorithms take a multistaged approach consisting of the following or similar stages: a preprocessing or image normalization stage, a coarse detection or clutter rejection stage, a feature extraction stage, a classification stage, and a fusion stage if multiple sensor sources or algorithms are available. From one perspective, each of these stages can be viewed as an information filter or mapping function that seeks to filter out the information that is not relevant to the next stage of ATR, thus reducing the dimension and/or volume of the data at each successive stage. Interestingly, little work has been done that applies the concepts of information theory to the design of ATR systems. Historically, this was mostly because of the increased computational complexity that is a consequence of most informationtheoretic implementations. However, recent developments in informationtheoretic learning systems have led to a more computationally tractable approach to applying information theoretic criteria to trainable systems [PriOO]. While these developments have significantly advanced the feasibility of ITL systems, two difficulties still hamper more widespread applicability of ITL systems. First, the computational complexity of the criterion function, although significantly reduced from previous implementations, is proportional to the square of the number of samples. While this complexity is acceptable for applications with small datasets, it can quickly become computationally prohibitive with some of the larger datasets that are commonly encountered in real world applications. The second problem most likely stems from the relative newness of the ITL method. Although the method has been applied to solve a wide range of problem classes, these solutions have been developed almost exclusively by experts in the field. Little work has been published to establish general guidelines for ITL implementations or to provide a better understanding of the information estimators so that the concepts of ITL can be used by a more mainstream group of users. The primary objective of this dissertation is to advance the science of ITL further by improving the computational efficiency and understanding of the ITL process. These advancements in efficiency and understanding serve as tools that allow information theoretic concepts to be applied to the sonar ATR problem. Within the context of sonar ATR, the efforts of this research focus on the feature extraction stage with the objective of developing novel methods for feature extraction that provide improved detection performance while optimizing the computational and training efficiencies of the ITL process. The utility of these methods is measured through comparisons with some existing and common approaches to the feature extraction problem. Motivation Existing Navy sonar ATR algorithms provide detection performance on a level that is comparable to that of a human operator [Dob97]. In many cases, the computational requirements of these algorithms are very high. The Navy has identified the need for improving the performance of existing ATR algorithms. It has funded various research and development efforts at the Coastal Systems Station to investigate approaches for improving the detection performance and efficiency of ATR algorithms. Improving detection performance brings significant and obvious benefits to the Naval fleet (such as increased safety, reduced losses, and faster fleet response times). Improved computational efficiency can allow for realtime vs. offline data processing, can reduce cost and complexity, and can provide for feasible simulationbased analysis. Better training efficiency helps to speed up the design process and can potentially allow for the use of more complex detection algorithms. By working closely with recognized experts in Naval sonar ATR and ITL, the results of this dissertation can potentially provide advancements to the field of sonar ATR in these areas. Many of the functions that take place within the various stages of an ATR system share a common goal to create a mapping from the input space to a synthetic space that optimizes some figure of merit: * Normalization creates a full rank projection that produces some desired background statistics. * Feature extraction creates a subspace projection that possesses informative or discriminative characteristics of the input data. * Classification creates a subspace that optimizes some measure of class membership. These similarities motivate the investigation of a general approach to address this common functionality. Further motivation is provided by the fact that, in many cases, the traditional approaches used to address these functions are based on statistical assumptions that are not valid in general. Information theory provides an attractive alternative approach to these issues. First, ITL methods make no assumptions about the probability density functions of the data. This eliminates the errors associated with incorrect model assumptions. Second, ITL criteria are very flexible and can be used in various supervised and unsupervised ways. This flexibility makes the approach readily applicable to a wide array of mapping function problems to include feature extraction, data fusion, and classification. The last, and perhaps most attractive, feature of ITL is due to the very nature of information theory. Information theory concerns itself with the information content of messages, optimum information compression, and information equivalence. These concepts are more complex and robust for information processing than more simplistic measures such as meansquare error (MSE). Since each stage of an ATR system is effectively an information filter, designing them based on informationtheoretic criteria is a natural conclusion. For the purpose of maintaining focus and limiting the scope of work presented, the results in this dissertation are constrained to the feature extraction stage of ATR. However, the approaches and methods described herein can be readily applied and extended to other stages of the ATR problem. In fact, ongoing research is currently investigating the use of ITL concepts for the data fusion and classification stages of the sonar ATR problem. Overview of Dissertation Chapter 2 includes an overview of the general multistage ATR approach, and describes some recent implementations that have been used specifically for sonar ATR with special emphasis placed on the feature extraction stage. Chapter 2 also includes an overview of the characteristics of sonar imagery along with a description of the sonar datasets that have been made available to support this work. Chapter 3 explains the basic concepts of Information Theory, the application of informationbased criteria to general learning systems, and summarizes the ITL approach of Principe et al. Chapter 4 presents an optimized implementation of the entropy and mutual information criteria estimators for ITL. In Chapter 5, training efficiency is addressed by adapting several popular advanced parameter search techniques for use with ITL systems. Chapter 6 introduces a batch approach to training ITL systems that reduces computational complexity by partitioning the dataset into smaller batches. Chapter 7 examines some of the properties of the ITL criteria estimators, explores the relationships between the estimator parameters, and helps to increase an overall understanding of the ITL process. Chapter 8 presents several informationtheoretic techniques to address the feature extraction problem using real world sonar data. The Chapter 9 summarizes the contributions of this research and presents avenues for future research. CHAPTER 2 SONAR AUTOMATIC TARGET RECOGNITION (ATR) Introduction The development of automated techniques for target recognition has been the subject of much research for a wide range of sensor systems. In the context of underwater sonar imagery, private industry and the military have supported the development of many algorithms that attempt to reduce the need for or improve the performance of a manintheloop. Throughout this body of research, the multistage paradigm has emerged as the virtually universal underlying architecture for most implementations [Ari98, Cha99, Dob97, Guo99, Hyl95, Lau98]. This chapter presents a summary of the multistage sonar ATR process with special emphasis on the feature extraction stage. First, an overview of the general characteristics of sonar imagery is presented. Next, the typical multistage ATR process is described for general sensor systems. This is followed by a description of the common methods that have been developed to address the various stages of ATR for sonar imagery. The next section focuses more closely on the feature extraction problem and reviews some of the most common methods and terminology. The chapter concludes with a description of the real world sonar datasets that have been made available to support this research. Underwater Sonar Image Characterization Successful target recognition in twodimensional imagery is highly dependent on the characteristics of the type of sensor used to produce the image in addition to the environmental conditions encountered during collection. Sonar imagery has unique characteristics that distinguish it from data of other sources such as radar, infrared, and visible light. Underwater sonar images are collected from a vehicle moving through the water with a sonar transceiver. The transmitter emits a periodic signal of acoustic energy, or ping, while the receiver listens for the return signal. The image is formed sequentially by row, or pingbyping, as the vehicle makes headway through the water. The following features typically characterize most sonar imagery: 1. Attenuation in signal with increasing range. 2. Banding caused by automatic gain correction. 3. Motion effects caused by nonlinear vehicle movement. Figure 21. Representative sonar image. Figure 21 shows a representative starboard (right) sidescan sonar image with the vehicle track moving up the left side of the image. The figure shows the three general features identified above marked with their respective numbers. In sonar ATR, the objective is to locate the objects of interest, or targets, while rejecting the background, or acoustic clutter. Environmental conditions that contribute to acoustic clutter include physical bottom clutter, reverberation, and multipath returns [Nel96]. Targets in sonar imagery are typically identified by a bright highlight followed by a dark shadow. Figure 21 contains five such targets designated by the Tn markings. The targets are immediately to the right of the labels. As is evident from the figure, sonar images typically contain nontarget highlights and shadows from acoustic clutter that often result in false alarms, or false target indications, when the image is processed. General Multistage ATR One of the most prevalent solutions to the ATR problem is the multistage approach. This approach uses multiple algorithms that attempt to reduce the data volume from stage to stage. A typical example of this approach is shown in Figure 22. Generally, the complexity of the algorithm increases from left to right as the data volume decreases. For example, the detector will usually process the entire image with a simple algorithm while the classifier will focus a more complex algorithm on a small fraction of the image or on a set of features extracted from the image. The five most common components include: the preprocessor, the detector, the feature extractor, the classifier, and the fusion stage. Brief descriptions of the general functionality of these stages are listed below. Preprocessor. The preprocessor stage of an ATR is typically used to apply some type of normalization technique over the entire image. Even though normalization is normally not a highly complex task, the fact that it is usually applied to every pixel can make it a significant contributor to the overall computation total. Detector. The detector, sometimes referred to as a prescreener or focus of attention stage, is typically responsible for the greatest reduction of data volume in the ATR process. Since it is applied over the entire image, algorithm efficiency is critical in keeping the total computations in check. The detector is usually set to a very high PD (90100%) in order to allow most of the targets through to the more sophisticated stages of processing. Raw Image(s) Processed Image SubImage Chips Feature Vectors Target Locations FltureStl Tait1 Tait1 1 Fee Set I T2 1.2. : FtrSet2 Tat2 Tat2 3 .F itureSt3 . K F atreSet eK TaitM TaSeN ... ....... P......Csso............... t or....... .F.eature Extrac r....C sifer... Inqp2 IFusion Figure 22. Multistaged ATR approach. Feature extractor. In the feature extractor, data reduction is achieved through information abstraction or compression rather than by data screening (as in the detector). This stage is used to convert the pixel representation of the screened subimages into a more concise feature set that represents the salient characteristics of the subimage. Some of the features might relate to physical characteristics of the target such as size and orientation. Other features might represent statistical characteristics of the subimage. Classifier. The classifier stage is typically the final stage of an ATR. With the list of feature vectors as input, it is responsible for determining which feature sets correspond to targets and which represent false alarms. Classifiers can be as simple as a threshold operation and as complicated as nonlinear decision surfaces. The greater the number of features in the feature vector, the more complex a classifier design can be. Fusion stage. In cases where more than one image, sensor, or algorithm is present, a fusion stage is often used to combine the results. Typically, the results are combined after the classifier using some statistical means, particularly if the classifier can provide confidence levels for each target nomination. Other fusion techniques use logical operations such as the AND function or fuzzy logic techniques. Although fusion is described here as a postclassification stage, in practice it can be applied at any stage of the ATR process as will be discussed in the sections below. Sonar Imagery ATR Methods Preprocessing/Image Normalization Approaches Upon inspection of Figure 21, it is reasonable to conclude that image normalization would help compensate for the range attenuation and bring the darker targets out of the shadows. In general, this is the first stage of most ATR approaches reviewed in the literature [Ari98, Cha99, Dob97, Guo99]. Most normalization techniques attempt to bring the background to a constant level throughout the image so that shadow and highlight levels are compared to a consistent background in subsequent stages of ATR. The results of a sequential pingbyping approach shown in Figure 23 are typical. Clearly, the targets are more visible to the human eye in the normalized image. Other normalization techniques applied to sonar image analysis include: range based normalization, forward and backward filtering [Dob99], logarithms [LanOO], and low pass median filtering [Dob97]. In addition to normalization, other techniques are sometimes used as part of the preprocessing stage of sonar ATR. The most common of these is image decimation through pixel combinations [Ari97, Jes99, Mor97]. This step sacrifices image resolution for noise reduction and computational complexity reduction. The most common decimation technique is pixel addition, however, Aridgides uses the 'largest of,' or ORing function, for the pixels in each decimation block. The effect of these decimation approaches is similar to the wavelet transform methods that are discussed below. Figure 23. Normalized sonar image. Detection Approaches The detection stage usually results in the greatest reduction in data volume. Since it is applied to the entire image, it is desirable to keep the computational complexity of this stage to a minimum. The detector typically applies a simple statistical threshold operation to a mask that scans the image. Since the threshold operation may be highly dependent on any preprocessing or normalization, detection is often closely tied to the preprocessing stage. Varieties of approaches have been applied to the detection stage of sonar ATR. Common algorithms combine highlight and shadow detection schemes that look for pixel counts and intensities that are a threshold above or below the levels of the normalized background. Dobeck et al. applied a rangedependent nonlinear matched filter as part of the Advanced Mine Detection and Classification (AMDAC) algorithm that attempts to match pretarget, highlight, dead zone, and shadow areas across the image [Dob97]. Bello proposed a random Markov field anomaly detector [Bel95b]. Lau and Chao combined a morphological filter and simplified Holmes doublegated filter with image sieves (based on wavelet transforms) that pass subimages of the appropriate size. The Holmes doublegated filter, also referred to as the twoparameter constant false alarm rate (CFAR) detector, is commonly used as a detection algorithm in ATR systems for its simplicity and performance in variably cluttered imagery. The typical CFAR stencil includes a center window for the area under test along with a surrounding band to compute local statistics. The detector uses a simple threshold to accept candidate target areas based on (21) that essentially compares the average intensity of the area under test with the average intensity of the local area normalized by the standard deviation of the local area. "X test Xlocal > threshold (21) Local Feature Extraction Approaches Typically, in an ATR system, the Detector and Classifier achieve data reduction by eliminating candidate target areas from further contention as possible targets. The feature extractor also performs data reduction, but in a different way. The feature extractor converts an image or subimage from a pixel representation to a feature list that is designed to represent the key characteristics, or features, of a potential target. Like detection algorithms, feature extraction algorithms are often tied to the previous stages of their particular ATR processing string. It is common, for example, to use pixel statistics or coefficients that were determined as a consequence of the preprocessing or detection stages as part of the feature set that is passed on to the classification stage. Common features of this type include highlight and shadow pixel counts, intensities from both raw and filtered images, and wavelet coefficients. Other features, describing the physical characteristics of the detected object, are often extracted from the raw or filtered image. These features include items such as object length, width, and orientation. Additionally, features are often extracted based on some statistical or mathematical properties of the detected objects. For example, Lau and Chao extract the Hermite coefficients from each detected subimage as their primary feature set [Lau98]. Kronquist et al. compute a large base of wavelet coefficients and select the most discriminant basis [Kro99]. Chandran combines pixel statistics with higher order spectral features including bispectrum and trispectrum [Cha99]. In many cases, all the extracted features are passed on to the classification stage. However, in some instances, feature reduction is performed to reduce the dimensionality of the feature set and thus avoid Bellman's curse of dimensionality [Bel61]. This is an important issue in sonar ATR at this point since the available datasets are limited in size. Using too many features leads to poor generalization while too few features may not convey enough information to discriminate between classes. Dobeck et al. explore selecting the five most discriminating features incrementally in addition to a combined forward and backward search [Dob97, Hyl95]. Chandran combines his feature set using a variation of principle component analysis with three classes [Cha99]. Aridgides uses a feature orthogonalization procedure to derive independent features from the original set [Ari98]. The next major section describes the feature mapping process in detail and describes some of the most common approaches to feature extraction and selection. Classification Approaches In general, the classifier stage is designed to take the selected feature space and partition it into target and nontarget subspaces based on the available training vectors. By far, the most common implementation for the classifier is the multilayer perception (MLP) neural network trained with MSE optimization criteria and backpropagation learning rule [Guo99, Lau98, Sac99, Szy99]. One novel approach is Dobeck's knearest neighbor (KNN) neural network containing attractor and confidence layers [Hyl95]. This network uses the training set to determine the location, number, and size of the attractors, which are based on radial basis functions. The confidence layer computes the likelihood that a feature set belongs to a given class. The KNN classifier is quickly trained by noniterative methods and is computationally efficient during runtime as well. Dobeck subsequently combined the KNN classifier with an optimal discriminatory filter classifier (ODFC) base on linear discrimination theory by ANDing the two outputs [Dob97]. Chandran proposed a sequential combination of thresholding, minimum distance, and KNN classifiers. Guo proposed a wavelet channel clustered MLP topology that limits the full connectivity of certain neurons to specific wavelet coefficients [Guo98]. Aridgides uses an implementation of the log likelihood ratio test combined with the precomputed histograms of orthogonalized features derived from the training set. Fusion Approaches Fusion can take place at various stages and in several ways within an ATR system. The two major categories include multisensor fusion and multiple algorithm fusion. Multisensor fusion combines different data sources (which can be processed with the same or different algorithms), while multiple algorithm fusion combines the application of different algorithms to the same data. Combination of low frequency and high frequency sonar imagery is multisensor fusion, while Dobeck's approach of ANDing two classifiers that are trained with the same data is an example of algorithm fusion. Multisensor fusion can take place at any stage in the ATR from preprocessing to postclassification. Aridgides proposed a predetection 3d adaptive clutter filter that combines high and low frequency sonar imagery for subsequent detection stages [Ari97]. Although this approach realized improvements over individual sensor ATR, he subsequently realized better results with a postclassification fusion using simple ANDing [Ari98]. Dobeck used a postclassification fuzzy ANDing based on class confidence measures that also produced significant improvements over individual sensors. Sacramone uses a postclassifier neural network to combine multiple sensors that looks at the combined feature space for both sensors [Sac99]. The benefits of algorithm fusion in reducing false alarm rates were demonstrated separately by Dobeck, Huang, and Aridgides when they combined full ATR processing strings from Coastal Systems Station (CSS), Raytheon, and Lockheed Martin [AriOO, DobOO, HuaOO]. Gou and Symczak also demonstrated improved performance by combining multiple ATR processing strings on the same input data [Guo99]. Feature Extraction Overview The problem of extracting features from datasets, also referred to as feature mapping, is prevalent across a wide range of fields from control systems to signal processing to pattern recognition. The goal is to find a mapping function, g: RK > AR, from one space to another described by the equation, Y = g(X), where X e RK and Y e RM that optimizes a given criterion. The mapping function can be linear (e.g., matrix) or nonlinear (e.g., MLP). The mapping can be volume preserving when M=K (e.g., digital filtering and normalization), volume reducing when M extraction, feature selection, and compression), or volume expanding when M>K (e.g., decompression and reconstruction). The goal of the mapping criteria can also take on a variety of forms including signal representation, class discrimination, and information fusion. For this discussion, the primary focus will be in the volume reducing and preserving mappings for feature extraction. The feature extraction process can be either datadriven or based on prior knowledge or use some combination of these methods. Data driven feature extraction relies exclusively on the data, the form of the mapping function, and the mapping criteria to arrive at the set of new features. In other cases, when relationships are known to be important from prior knowledge, it is often useful to generate features based on this information. For example, the ratio between height and width of an object might be a significant factor in discriminating between object classes. It is often much more efficient to compute these features directly than to depend on the mapping process to extract some representation of this phenomenon as in the datadriven method. Feature extraction for dimension reduction can be done by discarding a subset of original data, known as feature selection, or through linear or nonlinear combinations of original data. One example in image processing is pixel averaging. In most situations, dimension reduction results in loss of information. Therefore, one of the main goals in designing a good feature extraction system is to ensure that as much of the relevant information as possible is retained [Bis95]. In feature extraction for pattern recognition, particularly when the input is a high dimensional space relative to the amount of training data, one of the primary goals is to reduce the effects of the curse of dimensionality [Bel61]. As described by Bellman, this curse refers to the exponential growth in the need for training data as the dimension of the feature space increases. In the typical case where limited training data exists, this is addressed with a volume reducing mapping function where the minimum number of features, M, is sought that can still adequately represent the information to the subsequent process. If the subsequent process is to design a classifier that is expected to generalize to unseen data sets, care must be taken not to impose class separability as the mapping criteria until the dimensionality of the features has been adequately reduced. Failure to do so often results in poor generalization performance. Fukunaka distinguishes between feature mappings for signal representation and mappings for classification [Fuk90]. Rubinstein and Hastie refer to these as informative and discriminative classification systems, respectively [Rub97]. In the context of feature selection, Koller and Sahami refer to these as filter and wrapper methods, respectively [Kol96]. An informative, filter, or signal representation mapping, does not take into account class information during feature extraction and is concerned with finding the principle descriptive components or class feature densities of the data. Discriminative mappings focus on finding the class boundaries with no consideration for signal representation. The sequential combination of informative and then discriminative mappings can be used to capture the strength of both methods. However, care must be taken so as not to remove the discriminative features of the data during the volume reduction of the informative stage. In practice, this is difficult to ascertain. The sections that follow describe several of the most commonly used methods for both informative and discriminative feature extraction. Feature Selection One of the most basic implementations of feature extraction is feature selection. Feature selection is a straightforward linear volume reducing method that simply selects a subset of the original features as the new feature set with the goal that they best represent or discriminate the original signals. Selection of the bestM out of N features is a combinatorial problem that quickly becomes prohibitive with increasing dimension. Common search techniques for avoiding exhaustive methods for feature selection include forward selection and backward elimination. Forward selection selects the most relevant feature from those remaining to add to the new feature space. Conversely, backward elimination drops the least relevant feature from further consideration. However, neither method guarantees the optimal set. Branch and bound methods are guaranteed optimal for monotonic criterion. Independent features allow for the selection of the M individual best features. Feature selection is useful if some inputs carry little relevant information or if there are strong correlations between sets of inputs so that the data is redundant. The feature selection criterion for classification problems is optimally the probability of misclassification or total risk, but in practice, this may be too difficult to compute or estimate. Simplified criteria include measures of class separability based on covariance matrices. Some of these criteria are monotonic with respect to dimension, so they are not useful for comparing between different dimensionalities of feature sets. Cross validation techniques can be used in these cases [Bis95]. Koller and Sahami proposed an informationbased approach to filter selection where the goal is to minimize information loss between the full feature set and the subset. The information loss is estimated by the crossentropy between individual features. Features are rejected using a backward elimination strategy that rejects the feature with minimum crossentropy with respect to a predetermine number of features [Kol96]. Principal Component Analysis (PCA) Principal component analysis, also called the KarhunenLoeve Transform, is a linear, informative, unsupervised method for creating a feature mapping for signal representation. Typically, when used for feature extraction, the PCA mapping reduces the dimensionality of the feature space. The objective is to represent the input signal as closely as possible with a linear combination of a reduced set of orthonormal vector as follows, N M N X = y,u, y,u, + b,u, (22) i=1 1=1 M+1 where the ui form an orthonormal basis to represent X, the yi are the feature values, and the bi are constants. In the approximation, the first Mbasis vectors, or principal components, are selected that minimize the sum of the square of the error in signal representation. It turns out that these principal components are the eigenvectors that correspond to the M largest eigenvalues of the covariance matrix of X An interesting property of PCA is that the effectiveness of each feature for signal representation is determined by its corresponding eigenvalue. In addition, the output features are uncorrelated and in the case where X is normally distributed, the features are independent [Fuk90]. Also of note is that PCA is the solution that arises when the goal is to either: minimize information loss, maximize output entropy, or maximize mutual information between the input and outputs, when the distribution of X is normal [Dec96]. Independent Component Analysis (ICA) Independent component analysis is a generalization of PCA that seeks to extract features that are as independent as possible. The problem is to find a mapping, g(X), that maximizes some measure of the statistical independence of the output features. Deco and Obradovic propose two criteria for evaluating statistical independence: loworder cumulant expansion and mutual information between the output components. The cumulant method is based on satisfying statistical independence properties in the Fourier space. Under Gaussian assumptions, the mutual information approach reduces to minimization of the sum of the output variances. Without Gaussian assumptions, an Edgeworth expansion is proposed for estimation of the output entropies. It should be noted, however, that both methods are based on the restrictive assumption that g is an invertible volume preserving mapping [Dec96]. Bell and Sejnowski propose a nonlinear approach to ICA for source separation problems. They develop a learning rule for adapting neural network weights that is based on maximizing the output entropy [Bel95a]. Linear Discriminant Analysis (LDA) Linear discriminant analysis is a linear, discriminative method for performing a dimensionreducing feature mapping that attempts to preserve as much class discriminatory information as possible. The goal is to find the optimum linear transformation matrix, A, that produces a new Mdimensional space, Y, according to the relationship, x = AKxM Kx (23) where M+1 is the number of classes. For the case of a twoclass problem, the transformation matrix is simply a vector. The method looks for the linear transformation that minimizes the withinclass scatter, Sw, while maximizing the betweenclass scatter, Sb. It turns out that the LDA features are the eigenvectors that correspond to the eigenvalues of the product of the inverse of Sw with Sb [Fuk90]. Other Common Methods One common approach for feature extraction is to make use of attributes derived form applying basic statistical or mathematical functions to the data. Some of these features include means, medians, variances, moments and other higher order statistics. In some cases, histogram bin values, ratios, or geometries are used, particularly where prior knowledge dictates the importance of the feature. Another approach is to use the coefficients of an alternate or transformed version of the input. For example, features such as wavelet transform, DCT, or FFT coefficients are commonplace. These transformations of the input data are often desirable because of the inherent properties of the transformation space. In practice, these features are often readily available because they have already been computed to support other processing stages. Sonar Datasets Sonar Set 10 (SON10) The SON10 data set contains a mix of 404 starboard side and port side images collected using a typical sidescan sonar system. Each image contains 712 range cells and 1000 crossrange cells. Images are 8bit (0255) gray scale. The images contain 69 targets; some images have none, some have more than one. Targetlike nontargets appear throughout the images. A sample image is shown in Figure 24. This dataset is available in its raw format, as described above, or in a preprocessed stage where the normalization, detection, and baseline feature extraction stages from the CSSNavy ATR algorithm have been executed to generate a list of potential targets and feature sets. The preprocessed format provides a baseline for a fair comparison of some of the downstream stages of the multistage ATR process, such as the feature selection and classification stages. In this preprocessed format, SON10 contains 67 targets in 1047 exemplars in the combined training and test data files. Table 21 shows how these exemplars are distributed. SON10 Image 100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 Range Figure 24. Typical SON10 sonar image. Table 21. Distribution of extracted feature exemplars for SON10. Set No. Targets No. Nontargets Both Train 24 447 481 Test 43 523 566 Total 67 980 1047 Sonar Set 5 and 12 (SON512) The SON512 dataset contains a mix of 150 port side and starboard side images collected using a typical sidescan sonar system. Each image contains 640 range cells and either 900 or 1024 crossrange cells. Images are 8bit (0255) gray scale. The images contain 94 targets; some images have none, some have more than one. Targetlike nontargets appear throughout the images. The nearrange and farrange edge artifacts in these images are typically ignored. A sample image is shown in Figure 25. This dataset is available in its raw format as described above or in a preprocessed stage where the normalization, detection, and feature extraction stages from the CSS ATR algorithm have been executed to generate a list of potential targets and feature sets. The preprocessed format provides a baseline for fair comparisons of some of the downstream stages of the multistage ATR process, such as the feature selection and classification stages. In this preprocessed format, SON512 contains 93 targets in 1182 exemplars in the combined training and test data files. Table 22 shows how these exemplars are distributed. Table 22. Distribution of extracted feature exemplars for SON512. Set No. Targets No. Nontargets Both Train 49 489 538 Test 44 600 64 Total 93 1089 1182 SON512 Image 700 800 900 1000 100 200 300 400 500 600 Range Figure 25. Typical SON512 sonar image. Sonar Baseline Extracted Feature Sets For both the SON10 and SON512 data sets, the CSS ATR algorithm has been used to establish a baseline feature set that has been made available to the research community for experimentation in the feature mapping, selection and classification stages. This feature set combines some of the statistical measures and count metrics extracted during the detection stage with some features specifically designed to extract important information based on prior knowledge. Table 23 summarizes the set of sixty one features. Feature 15 should be ignored. The baseline extracted feature sets for SON10 and SON512, as described in Table 23, serve as the input application data for this research. After the ITL approach is investigated and expanded throughout the next several chapters, Chapter 8 presents the application of ITL concepts to the feature extraction problem using the data sets described above. Table 23. The CSS Baseline extracted feature set. No. Name Description 1 Npix nor 2 Npixmf 3 Maxpixnor 4 Maxpixmf 5 HiLite str nor 6 HiLite str mf 7 Max eig_nor 8 Mineig_nor 9 Maxeig_mf 10 Mineig_mf 11 Shadow len 12 Shadow str 13 Maxpix clumf 14 Npixclu 15 Nclusters 16 25 Nnor(i) 2635 Nmf(i) 36 45 Nnordiff(i) 46 55 Nmf diff(i) Avg_nor(i) Avg_mf(i) Number of normalized image pixels in the window that exceed a Threshold Number of matchedfilter pixels in the window that exceed a Threshold Maximum normalized image pixel intensity in the window Maximum matchedfilter pixel intensity in the window Average highlight strength computed from the normalized image Average highlight strength computed from the matchedfilter image Length of major axis of an ellipse fit to highlight region of the normalized image Width of minor axis of an ellipse fit to highlight region of the normalized image Length of major axis of an ellipse fit to bright region of the matched filter image Width of minor axis of an ellipse fit to bright region of the matched filter image Shadow length Shadow strength Maximum matchedfilter intensity over the pixels in the detection Cluster Number of pixels in the detection cluster Number of detected clusters in the image (measure of clutter density) Number of normalized pixels above threshold(i) in the window Number of matchedfilter pixels above threshold(i) in the window Number of normalized pixels above threshold(i) in the window minus the number of normalized pixels above threshold(i) in the region that locally surrounds the window Number of matchedfilter pixels above threshold(i) in the window minus the number of matchedfilter pixels above threshold(i) in the region that locally surrounds the window Average pixel intensity of normalized image over Ith window Average pixel intensity of matchedfilter image over ith window CHAPTER 3 INFORMATIONTHEORETIC LEARNING (ITL) Introduction The concept of optimization using performance criterion derived from information theory has been around since Jaynes introduced the MaxEnt principle in 1957 [Jay57]. Since then, several metrics and architectures have been developed for the training of adaptive systems based on information theoretic measures [Bel95a, Dec96, Kul68, Lin89]. This chapter presents an overview of the informationtheoretic learning paradigm along with a detailed description of the ITL approach that serves as the basis of this research. First, a brief review of some of the key concepts of information theory is presented. This section is followed by a description of a general architecture for learning systems that describes how performance criteria in general can be applied to their training. Next, a discussion of some of the most commonly used informationtheoretic criteria is presented along with their typical applications. The chapter concludes with a detailed description of the ITL approach developed by Principe, et al. Information Theory Information theory provides a mathematical approach for analysis of the information content of messages and the effects of communication channel characteristics on message transmission. Concepts from information theory provide the answers to two key questions in communication theory: * What is the minimum representation for a random variable? * What is the maximum data rate achievable for a transmission channel? These two questions are answered by: the entropy of the random variable, and the maximum of the mutual information between the input and output of the channel, respectively [Cov91]. Entropy is a measure of the uncertainty, or information content, of a random variable. The larger the entropy, the greater uncertainty exists about the actual value of the random variable. Greater uncertainty about the outcome means that more information is conveyed by the actual value of the random variable. It follows that random variables with larger entropy require more symbols (e.g., bits) for their representation. Shannon defined the entropy of a discrete random variable, X, with probabilities, p(x), to be: H,(X) = p(x)log p(x) (31) and H,(X) = p(x)logp(x)dx (32) x for the continuous case where p(x) is the probability density function (PDF). Renyi [Ren76] proposed a generalized definition of entropy that includes Shannon's entropy as a special case. For a discrete random variable the form is: HR (X) = log p(x)a (33) 1a and for continuous random variables, HR (X) = f log p(x) dx aU 1 (34) where the relationship between Shannon and Renyi's given by, limHR. = Hs (35) c~l HRa > Hs > HRP if 1 (36) Of particular interest to the informationtheoretic learning approach by Principe et al. is the case where a = 2, or Renyi's quadratic entropy. Mutual information is a measure of the dependence between two random variables, Xand Y, and is defined by Shannon for the discrete case as: Is(X;Y)= Hs(X) H(X Y)= Z p(x,y)log p(xy)(37) p(x)p(y) and, Is(X;Y)= fp(x,y)log (xy) dydx (38) XY p(x)p(y) for the continuous case [Dec96]. This is equivalent to the KullbackLeibler distance between the joint PDF and the product of the factorized PDF's. From (38), it can be seen that the mutual information is maximized when the conditional entropy between X and Yis minimized. This occurs as the factorized probabilities become similar, or p(y) p(x). Thus maximizing mutual information, subject to the constraints of a particular communication channel, provides the maximum data rate achievable through that channel or channel capacity. Renyi's dependence measure takes the form described by (39) for the continuous case. IR (X; Y) (x, y) dy (39) a 1 l (p(x)p(y))1 The fundamental objective of informationtheoretic learning is to apply performance criteria based on the informationtheoretic measures described above to the adaptation of learning systems. General Learning System Architecture Learning or trainable systems can be viewed as a mapping function, g: RK > R from one space to another described by the equation, Y = g(X, W), where X e RK represents the system input, Y e RM represents the system output, and W represents the set of trainable parameters. Common examples of learning systems include neural networks and adaptive filters. The goal in training such a system is to select the parameters of the mapping, W, that optimize some performance criterion in either a supervised or unsupervised manner. The most common supervised learning criterion is minimization of the MSE between the output and some desired signal. Other examples include the MinowskiR error (or LR norm) and relative or cross entropy. Unsupervised learning criteria include Hebbian learning, Oja's rule, and competitive learning [Her91]. Figure 31 shows a block diagram of a general learning system architecture. D (Optional external signal for Supervised Learning) E (Adaptation/Error Signal) Figure 31. General learning system architecture. One of the most prevalent implementations of Figure 31 uses MSE as the optimization criteria with a desired signal, a neural network as the mapping function, and backpropagation as the parameter adaptation algorithm. However, Figure 31 has many more applications than this. From the perspective of information theory, the architecture in Figure 31 can be viewed in a number of ways. First, the mapping function can represent an encoding function seeking to represent the input data in a more compact format. This is analogous to feature extraction or data compression. Then, in an unsupervised manner, the learning system can be trained to maximize the output entropy as its optimization criteria. This maximization process yields a representation of the input that conveys the maximum information content subject to the constraints of the mapping network. As a second example, the mapping function can be viewed as a communication channel through which we seek to transmit a signal. In this case, the learning system can be trained in a supervised manner to maximize the mutual information between the input and output signals. This produces the mapping parameters that yield the greatest channel capacity, again, subject to the constraints of the mapping network topology. Various informationtheoretic criteria can be readily applied to the architecture in Figure 31. These are described in more detail in the following section. Information Theoretic Optimization Criterion Although the MSE criterion has well known computational advantages and is theoretically and practically appropriate for the solution of many problems, the assumptions inherent when using MSE can produce suboptimal results in many instances. The MSE criterion is derived from the principle of maximum likelihood by assuming that the distribution of the output data can be described by a Gaussian function with an input dependent mean and a single global variance parameter. Thus, the MSE criterion cannot distinguish between the true distribution, and a Gaussian distribution having the same inputdependent mean and average variance. This limitation leads to difficulties in inverse problems for multivalued target data and has been shown to be suboptimal for classification problems [Bis95]. These shortcomings motivate the use of alternative criterion for training of learning systems. By optimizing learning systems based on various informationbased measures, the resulting information content is effectively optimized. The most common information based measures and their practical applications include: * Maximization of Output Entropy (MaxEnt): MaxEnt is an unsupervised method that seeks to spread the output signal throughout the available output space, thus lending itself to dimensionality reduction, feature extraction, data compression, and/or preprocessing of input signal. * Minimization of Cross Entropy or Mutual Information between Output Signals (MinXEnt): MinXEnt tries to reduce the dependence between signals seeking to make each output independent of one another, making it suitable to independent component analysis and blind source separation. * Maximization of Mutual Information between Output and Input (InfoMax): This is a supervised approach that seeks to convey the maximum information from the input through the mapping function to the output. It is similar to MaxEnt in applications. * Maximization of Mutual Information between Output and Desired Signal (Information Filtering): This method is similar to InfoMax, but adapts the mapping function to make the output statistically similar to the any desired signal. This approach can be used in lieu of MSE for many supervised learning problems including function approximation, parameter estimation and pattern recognition PriOO]. * Maximization of Mutual Information between Output and Class Labels (ClassInfoFilt): This is a special case of information filtering described by Torkkola that seeks to group the outputs according to their class labels defined by the desired signal. This is most applicable to classification or class clustering to support classification [TorOO]. * Minimization of Error Entropy (MinErrEnt): This is a supervised approach that minimizes the entropy of the error between the output and a desired signal. This approach is similar to information filtering and is applicable to most supervised learning problems [ErdOO]. The principle drawback to using any of these informationtheoretic criteria is the required computational demands of the implementation. Evaluation of these criteria requires the computation of the difference between two PDF's. The computational complexity of a straightforward implementation is proportional to O(NM2), where N is the number of samples in the dataset and Mis the dimensionality of the random variable. By contrast, the computational complexity of the MSE is proportional to O(N). As will be shown in the sections that follow, recent developments in ITL have resulted in a computationally tractable approach that reduces the complexity of the algorithm to be proportional to O(N2) [PriOO]. However, for large data sets, this still represents somewhat of a challenge. Information Theoretic Learning Approach Principe, Fisher, and Xu have shown that by using Renyi's quadratic entropy (RQE) in combination with Parzen's nonparametric PDF estimation, significant computational savings can be achieved when computing entropybased optimization criterion. They realized similar computational savings for mutual informationbased criterion by introducing two novel measures for PDF divergence based on the Cauchy Schwartz inequality and Euclidean distance [PriOO]. The Parzen PDF estimator for a random variable, Y e RM, is a kernelbased method defined as, N where Nis the number of samples of the random variable, y(3 are the observations of the where N is the number of samples of the random variable, y, are the observations of the random variable, and K is a kernel function. Although other choices exist, the basic ITL approach uses the symmetric Gaussian kernel described by (311). K(z)= G(z,o 21) = 1 e zz/(22) (311) (27o 2 )M/2 Combining the kernel function from (311) with the Parzen estimator in (310) and substituting in (34) with c = 2 results in the following estimator of Renyi's quadratic entropy for Y with observations {y}, HR2(Y) = logV({y}) 1 N N (312) V({y})  G(y, y,2 2 N 1 i =1 where V is defined as the information potential (IP). Note that maximizing the entropy estimate, HR2, in (312) is equivalent to minimizing the IP. For adaptation of the parameters in a mapping function, the sensitivities with respect to the weights are computed yielding, a N a 7V N a V(y)= (y) =N= NFT g(w,x,) where, (313) F, = (y)= 1 iG(y,y,,22I)(y,y), y, N 2 2 2 =1 which lends itself to an adaptive training algorithm such as backpropagation. Thus (312) shows that the combination of the Parzen PDF estimation and Renyi's quadratic entropy yields an entropy estimate that is based on the summation of interactions between each sample pair through the kernel function. As is evident, the computational complexity is proportional to O(N2). The major computational savings is due to the combination of the quadratic form of the PDF's in Renyi's quadratic entropy and the form of the Gaussian kernel selected for Parzen estimation. For mutual informationbased criterion, examination of (38) and (39) reveals that they are not quadratic in PDF's. This led Principe et al. to propose novel distance measures between two PDF's based on quadratic distance measurement. The first method, based on the Euclidean difference of vectors inequality, yields the following measure for quadratic mutual information (EDQMI). ID(X, Y)= fp(x,y)2 ddy+ fp(x( (y)2 dxdy 2f p(x,y)p(x)p(y)dxdy (314) The second method, based on the CauchySchwartz inequality, produces (315) as a measure for mutual information (CSQMI). l p(x, y)2 dxdy p(x)2 p(y)2 dxdy Ics (X, Y) = log f(315) Jo P(x, y)p(x)p(y)dxdy) Both measures are nonnegative and reach zero when p(x,y) = p(x)p(y) (i.e., when X and Y are independent) making their use appropriate for minimizing mutual information. For maximizing mutual information, Principe et al. show through examples that both measures can be used, however, EDQMI is preferable because of the convex nature of the function. By utilizing the criteria in (314) or (315) in combination with the Parzen PDF estimation from (310), the estimates described by (316) for the mutual information between two random variables, Y, and Y2 with observations, y = [y1,,y,2]i= ,...N, are obtained: V(y)V'({yl })V2 GYM Ic ((Y,, Y2) y)=log y V (y)2 (316) D((Y,,Y2) I y)= V(y)+ V({yj})V2({y2}) 2Vc (y) where, V(y) = 1fjG(y Y1,22) 1 N N J1 y1 1 V'(y,{y1}) = G(y, y,,2 21),1= 1,2 (317) 1 1 y"({yz})=uy V i =N2 1N 2 Y (y)= V(y }) ]=1 /=1 In (317), V'(y,,{y, }) represents the partial marginal information potential for the sample y, in its corresponding random variable (indexed by 1). In (316) and (317) V'({y,)) is the th marginal information potential because it averages all the partial marginal information potentials for one index, / [PriOO]. The equations in (316) and (3 17) can support parameter adaptation by again taking the sensitivities with respect to the parameters. Principe et al. have successfully applied this approach to obtain practical solutions to problems including independent component analysis, blind source separation, system identification, and function approximation. Despite these successes, several issues still hinder the widespread application of this approach. First, the information theoretic optimization criteria are highly dependent on the selection of the kernel size in the Parzen estimation of the PDF. Experimentation suggests that an adaptive kernel size, which assures sufficient interaction between the observations, produces favorable results. Second, although the computational complexity has been greatly reduced, even a complexity proportional to O(N2) becomes prohibitive for very large data sets. Chapters 4, 5, and 6 address the reduction of computational complexity in using the ITL process for adaptive system training. These chapters focus on the information 38 theoretic criteria based on RQE, CSQMI and EDQMI. These chapters address improving the computational efficiency in three ways: 1. Reduce the complexity of each criterion evaluation through algorithm optimization. 2. Reduce the number of training iterations required for convergence by applying advanced parameter search techniques. 3. Reduce the overall N2 complexity by introducing a batch technique to partition the data. CHAPTER 4 ALGORITHM OPTIMIZATION Introduction One of the principle drawbacks of using informationtheoretic criteria for the training of learning systems is the extremely slow training due to the computational complexity of the criterion functions. In this chapter, the equations for entropy and mutual information estimation are examined from the perspective of computational efficiency. The optimization techniques of factorization, symmetry exploitation, and vectorization, are used to develop implementations of the ITL criteria with improved efficiency in terms of the number of floatingpoint operations (FLOPs) and computational time. Baseline Entropy Algorithm To establish a basis for comparison, a baseline algorithm implementation is used. This baseline algorithm results from a straightforward implementation of Renyi's quadratic entropy estimator, described in (41), without any attempts to simplify or optimize for efficiency. The following is obtained by combining (311) and (312), HR2 =lo g 72 2exp (41) N =1 =1 T4xna2, 40 2 where d, = (y, yj) This straightforward, or brute force, implementation yields a computational complexity of roughly N2 [4M +10] + NM + 9. Clearly, the dominant term is the factor of the quadratic term corresponding to N2. Using randomly generated data with various values for N and M, computational complexity metrics for the baseline implementation are tabulated in Tables 41, 42, and 43 under the columns labeled 'Baseline'. These metrics were collected using MATLAB version 5.3 on a personal computer with a 1Ghz Athlon processor. The baseline metrics provide a basis for comparison of various optimized algorithm implementations. It should be noted that forM>l, an additional MN2 FLOPs is required because of MATLAB's specific implementation of the SUM command where the sum of K numbers requires K FLOPs versus the expected K1. This issue, however, does not detract from the overall relevance of the discussion since it applies consistently to all compared cases where Mis greater than one. Entropy Algorithm Factorization Factorization, or hoisting, is one of the most elementary techniques for algorithm optimization. The approach looks for repetitive operations that can be eliminated or reduced by pulling, or factoring, them out of major loops. Equation (41) is well suited for this type of optimization. Note that each evaluation of the kernel function contains the same divisor operand, (47o 2)/2 a sixFLOP operation assuming the o2 term has been previously computed and stored. Furthermore, the scaling factor, 402, of the exponential is constant throughout the entire entropy computation and therefore need only be evaluated once. Since both of these terms contribute to the N2 component of the complexity, their factorization can provide significant computational savings overall. These factorizations result in an entropy estimator described by (42). H1 2 4 dd 2 HR2 = log /2 exp C ;C = 40 2 (42) N24x2 z=2r 1 c=1 Another point to notice is that the division operation with the scaling factor, C, of the exponential is conducted a total N2 times. Since the original data, Y, is of dimension MxN, computations can be saved in the cases where M factor from the double summed exponential. This results in the following entropy estimator, HR2 =log {2(41 2 jexp( d d (43) N (47xa 2 1=1 =1 ) where d* = (y* y*), and Y* is computed by scaling the original input, Y, by the square root of the scaling factor C, or 2. This factorization can provide additional computational savings particularly for large datasets where the number of samples, N, is much larger than the dimensionality, M. Table 41. Algorithm optimization: Factorization. Data Size Complexity (FLOPs) Computation Time (sec.) N M Baseline Factorized FRR Baseline Factorized TRR 50 1 35059 15115 2.3 0.13 0.07 1.8 50 2 50109 30215 1.7 0.17 0.09 1.8 50 3 62659 42815 1.5 0.18 0.09 1.9 100 1 140109 60215 2.3 0.51 0.28 1.8 100 2 200209 120415 1.7 0.66 0.37 1.8 100 3 250309 170615 1.5 0.68 0.37 1.8 200 1 560209 240415 2.3 2.06 1.13 1.8 200 2 800409 480815 1.7 2.61 1.50 1.7 200 3 1000609 681215 1.5 2.72 1.48 1.8 500 1 3500509 1501015 2.3 12.80 7.12 1.8 500 2 5001009 3002015 1.7 16.42 9.30 1.8 500 3 6251509 4253015 1.5 17.03 9.32 1.8 1000 1 14001009 6002015 2.3 51.19 28.45 1.8 1000 2 20002009 12004015 1.7 65.91 37.13 1.8 1000 3 25003009 17006015 1.5 68.27 37.46 1.8 2000 1 56002009 24004015 2.3 208.01 112.93 1.8 2000 2 80004009 48008015 1.7 265.73 147.91 1.8 2000 3 100006009 68012015 1.5 271.06 149.62 1.8 Table 41 compares the results of this implementation with the baseline approach. The table shows the average computation times and number of FLOPs required to complete one evaluation of the entropy and the associated information forces for both implementations. It also shows the FLOPcountreductionratio (FRR) and the computationaltimereductionratio (TRR) for the factorized implementation versus the baseline. The resulting computational complexity of the factorized implementation is roughly N2 [4M + 2] + N[M +1] +15, thus reducing original complexity by roughly 8N2 operations. ForM= this is more than a 50 percent reduction in the number of FLOPs. Exploitation of Symmetry Another method for improving the computational efficiency of an algorithm is to look for symmetries or simplifications in the algorithm structure to eliminate redundant or similar calculations. Again, the ITL entropy criterion lends itself to this type of optimization. By noting the fact that V, = V" roughly half of the kernel function evaluations can be omitted. In addition, when the kernel function is factorized as above, its evaluation in the cases where i=j results in V, 1 and F, = 0. Therefore, the computation of the diagonal terms can also be greatly simplified. Equation (44) describes the symmetric implementation with factorization. HR2 =0g 2 I + exp\d d') (44) N 242 2 J+ This symmetric, factorized implementation results in a computational complexity of roughly N2 [2.5M + 1] N[1.5M 1] + 21. Table 42 shows the computational reductions achieved by this method for various values of N and M. Note that the exploitation of symmetry yields even greater reductions in both the complexity and computational time metrics than with factorization alone. Table 42. Algorithm optimization: Symmetry with factorization. Data Size Complexity (FLOPs) Computation Time (sec.) N M Baseline Symmetric FRR Baseline Symmetric TRR 50 1 35059 8746 4.0 0.13 0.05 2.8 50 2 50109 17421 2.9 0.17 0.06 2.9 50 3 62659 24871 2.5 0.18 0.06 3.1 100 1 140109 34971 4.0 0.51 0.18 2.9 100 2 200209 69821 2.9 0.66 0.23 2.9 100 3 250309 99721 2.5 0.68 0.23 2.9 200 1 560209 139921 4.0 2.06 0.70 2.9 200 2 800409 279621 2.9 2.61 0.92 2.8 200 3 1000609 399421 2.5 2.72 0.92 2.9 500 1 3500509 874771 4.0 12.80 4.43 2.9 500 2 5001009 1749021 2.9 16.42 5.77 2.8 500 3 6251509 2498521 2.5 17.03 5.79 2.9 1000 1 14001009 3499521 4.0 51.19 17.74 2.9 1000 2 20002009 6998021 2.9 65.91 22.96 2.9 1000 3 25003009 9997021 2.5 68.27 23.35 2.9 2000 1 56002009 13999021 4.0 208.01 71.07 2.9 2000 2 80004009 27996021 2.9 265.73 92.00 2.9 2000 3 100006009 39994021 2.5 271.06 92.66 2.9 Entropy Vectorization The methods of factorization and symmetry exploitation achieve improvements in computational efficiency by reducing the overall number of operations required to compute a given metric. Vectorization, on the other hand, improves efficiency by taking advantage of modern software and processor design features. Most processors today are optimized to perform singleinstruction multipledata (SIMD) and vector operations very efficiently. Many software implementation environments and programming languages are optimized to make use of these processor characteristics. These features can be exploited to greatly reduce computation times by developing algorithm implementations that operate on vector and matrix operands. Once again, the ITL criteria are well suited for optimization in this manner. The second indexed summation in (44) can be replaced using standard vectorization techniques to remove the need for thejindexing. By describing the basic operand for this summation as the vector in (45), appropriate vectorized exponential and summation functions (such as those resident in MATLAB) can be applied to speed up computational time. D = [d:,+d ,...d:,/ dd] ( Table 43. Algorithm optimization: Vectorized with symmetry and factorization. Data Size Complexity (FLOPs) Computation Time (sec.) N M Baseline Vector FRR Baseline Vector TRR 50 1 35059 8838 4.0 0.13 0.007 19.1 50 2 50109 17561 2.9 0.17 0.008 21.6 50 3 62659 25059 2.5 0.18 0.008 22.3 100 1 140109 35163 4.0 0.51 0.016 32.3 100 2 200209 70111 2.9 0.66 0.018 37.7 100 3 250309 100109 2.5 0.68 0.019 36.2 200 1 560209 140313 4.0 2.06 0.041 50.7 200 2 800409 280211 2.9 2.61 0.049 52.8 200 3 1000609 400209 2.5 2.72 0.070 38.7 500 1 3500509 875763 4.0 12.80 0.195 65.6 500 2 5001009 1750511 2.9 16.42 0.272 60.4 500 3 6251509 2500509 2.5 17.03 0.366 46.6 1000 1 14001009 3501513 4.0 51.19 0.830 61.7 1000 2 20002009 7001011 2.9 65.91 1.175 56.1 1000 3 25003009 10001009 2.5 68.27 1.511 45.2 2000 1 56002009 14003013 4.0 208.01 3.658 56.9 2000 2 80004009 28002011 2.9 265.73 5.022 52.9 2000 3 100006009 40002009 2.5 271.06 6.272 43.2 The computational complexity of the vectorized implementation is approximately N2 [2.5M + 1] N[0.5M 2] 2M + 15. This complexity measure is roughly the same as in the symmetricfactorized case. However, the simple replacement of the indexed exponentials and summations with their vector equivalents yields drastic improvements in computation times as shown in Table 43. Note that for large sample sizes, the computational time has been reduced by factors of 60 or more in some cases. Figure 41 (45) shows graphically the computational efficiencies gained via optimization for the entropy estimator in the case of M=2. x 107 Computational Complexity (M=2) Baseline 6 Factorized Symmetric Vectorized 0 4 2 200 400 600 800 1000 1200 1400 1600 1800 2000 Algorithm Speed (M=2) 250  0 Baseline 200 Factorized S Symmetric 150 Vectorized S100 50  ,t 200 400 600 800 1000 1200 1400 1600 1800 2000 N Figure 41. Entropy estimator optimization results. Results for the QMI Criterion The same techniques for algorithm optimization were applied to (316) and (317) for EDQMI and CSQMI, respectively. Table 44 compares the resulting computational metrics for a baseline EDQMI implementation with one that has been optimized using factorization, symmetry, and vectorization. In these cases, the QMI is evaluated between an Mdimensional, Nsample data set and a onedimensional, Nsample data set. The results obtained for CSQMI are nearly identical as those for EDQMI and are not shown for brevity. Equation (46) summarizes a factorized and symmetric version of (317), where d*, = (y*1 y*). The operands in this representation were also vectorized as in the entropy case to yield significant savings in computation time. 2 N 1 N exp dd N2 (4 2Co 2 /2 1 l '+l => (46) V({)=N242)2 + exp( d *Td, N, ) 2 4O2 r2 Table 44. Algorithm optimization: EDQMI baseline and optimized results. Data Size Complexity (FLOPs) Computation Time (sec.) N M Baseline Vector FRR Baseline Vector TRR 50 1 88071 34345 2.6 0.48 0.028 17.3 50 2 136179 56125 2.4 0.59 0.038 15.6 50 3 184338 77956 2.4 0.71 0.039 18.3 100 1 351121 136170 2.6 1.94 0.054 35.9 100 2 542329 222225 2.4 2.49 0.076 32.8 100 3 733638 308381 2.4 2.84 0.100 28.4 200 1 1402221 542320 2.6 7.63 0.170 44.9 200 2 2164629 884425 2.4 9.94 0.220 45.2 200 3 2927238 1226731 2.4 11.37 0.330 34.5 500 1 8755521 3380770 2.6 48.28 0.830 58.2 500 2 13511529 5511025 2.5 60.26 1.420 42.4 500 3 18268038 7641781 2.4 71.51 2.030 35.2 1000 1 35011021 13511520 2.6 191.20 4.120 46.4 1000 2 54023029 22022025 2.5 241.73 5.980 40.4 1000 3 73036038 30533531 2.4 289.07 8.020 36.0 2000 1 140022021 54023020 2.6 772.26 15.870 48.7 2000 2 216046029 88044025 2.5 966.42 24.060 40.2 2000 3 292072038 122067031 2.4 1153.88 33.940 34.0 Conclusions Straightforward implementations of ITL criteria have produced computationally expensive and time consuming results. By applying some basic principles of algorithm optimization, significant savings in both measures have been demonstrated. The combination of factorization, symmetry exploitation, and vectorization has reduced the algorithm complexity in terms of number of operations by as much as 75%. Similarly, the required computation time has been reduced by as much as 98%. A case in point is the 2000sample quadratic entropy evaluation that originally took over four minutes has now been reduced to roughly five seconds. Although the complexity of the algorithm 47 remains O(N2), the multiple of the quadratic term has been greatly reduced, thereby increasing the practical utility of ITL methods. CHAPTER 5 ADVANCED PARAMETER SEARCH TECHNIQUES Introduction Parameter optimization for learning systems generally requires the use of iterative techniques since analytical solutions exist only for special cases of both topology and criteria [Bis95]. These iterative techniques, which search the parameter space of the mapping function in an attempt to achieve a desired or optimum performance level, often require many epochs in order to converge to an acceptable level of performance. When these techniques are applied using informationbased criteria to determine system performance, a large number of iterations can quickly render their usage to be computationally prohibitive. This dilemma motivates the investigation of advanced parameter search algorithms for ITL in order to arrive at a desired performance level more efficiently. This chapter addresses the adaptation and application of some familiar advanced parameter search algorithms to the training of systems using ITL criteria. Several examples are used to show the benefits of applying advanced training techniques to ITL problems. Baseline Technique: Gradient Descent (GD) The most prevalent and basic parameter search technique for learning system adaptation is the gradient descent algorithm. The GD algorithm essentially follows the slope, or gradient, of the performance surface towards a minimum. Although simple and effective, GD is often slow to converge to a desired performance level and frequently requires hundreds and even thousands of iterations. Despite and perhaps because of its simplicity, the training of ITL systems has been done almost exclusively with GD learning. GD uses the following update rule, Aw = rjVP (51) where Aw represents the change to the learning system parameter vector, or weights, VP represents the gradient of the performance surface, and r is the step size or learning rate. The convergence of the GD algorithm depends strongly on the appropriate selection of the learning rate. Too large a learning rate can cause oscillation or divergence, while too small a learning rate leads to slow adaptation and the need for many epochs. The GD method will serve as the baseline for comparison of other more advanced parameter search techniques. Advanced Search Techniques The general problem of parameter optimization has been the focus of much research and has motivated the development of many advanced techniques for parameter search [Bis95, DemOO, Hag96]. In the paragraphs below, several of these advanced search methods are described in anticipation of their application to training of ITL systems. Improved GD Algorithms Many improvements have been developed to increase the efficiency of the basic GD approach. Two significant improvements have been the incorporation of an adaptive learning rate and the addition of a momentum term to the basic GD algorithm. The adaptive learning rate gradient descent (GDA) algorithm monitors the current performance of the learning system and adapts the learning rate as follows: if performance improves, learning rate is increased slightly; if performance worsens, the learning rate is decreased moderately and the latest update to the system parameters is discarded. This allows the learning rate to adjust based on the local complexity of the performance surface. The learning rate adaptation can be summarized by (52). alk if Pk rlk+l = (52) brk if Pk > PkI where b <1 The GDA algorithm thus reduces the training performance dependency on the sometimes ad hoc selection of the learning rate. The addition of a momentum term to GD serves as a low pass filter for parameter adaptation. This allows parameter adaptation to avoid some levels of noise and shallow local minima in the performance surface. Adaptation is taken in the direction of a weighted sum of the current GD direction and the last step taken. The following summarizes the GD with momentum (GDM) update rule, Awk = (1 )(rlVP) +Awk 1 (53) where a is the momentum constant. Clearly, the momentum constant determines how much credence is given to the historical trend versus the current estimate. Another useful method that combines GD with adaptive learning rate and momentum (GDX) will be studied as well. Note that both GDA and GDX assume a static performance surface that allows epochtoepoch comparison of performance in order to determine the learning rate adaptation. Resilient Backpropagation (RP) Resilient backpropagation was introduced to help eliminate some of the problems encountered when training neural networks containing saturating nonlinearities using GD methods [DemOO, Rie93]. RP uses the sign of the gradient, rather than the actual gradient to determine the direction of the weight change. This helps to improve the slow learning caused by the nearzero gradients in the saturated regions of the nonlinear processing elements. The magnitudes of the weight changes are adapted according to the direction of the most recent epochs as follows: if a weight change is in the same direction for the last two epochs, the magnitude of the change is increased; if the update direction is different, the magnitude of change is decreased [Rie94]. The update rule for RP is given as, A = sign(VP, )6 J (54) where the magnitude, 6 1, is adapted as, a6 ', if VP" VP<, > 0 where a > 1 k k (55) b6 'J if VPk VPkJ <0 where b < 1 RP has proven to be a robust algorithm since its adaptation is governed more by the ongoing adaptive behavior of the weights than the shape of the performance surface. Conjugate Gradient Algorithms Conjugate gradient algorithms were developed in order to select a more efficient search direction than the standard GD approach. These algorithms begin with a steepest descent step and then choose conjugate directions for adaptation rather than steepest descent. For quadratic performance surfaces, conjugate directions form a complete basis in the parameter space and generally provide much faster convergence than GD directions [Bis95]. Most conjugate gradient methods determine the change magnitude using a line search technique. The basic update for conjugate gradient methods is, Awk =kdk (56) dk VPk kdk (57) where ak is determined using a line search, dk represents the conjugate search direction, and Pk determines the method in which the next conjugate direction is chosen. The FletcherReeves (CGF) method chooses the direction with, VPkT VP P, = kVP (58) VPT1 Pk 1 while the PolakRibiere (CGP) updates the direction with, (VPk VPk2)T Pk 3 = (59) VPkT 1VPk  For all conjugate gradient algorithms, the search direction is periodically reset to the negative of the gradient. Typically, this is done when the number of epochs equals the number of parameters to be optimized. The PowellBeale (CGB) uses a reset condition based on a measure of orthogonality [Pow77]. The CGB method also computes the search direction from a linear combination of the negative gradient, the previous search direction, and the last search direction before the previous reset [DemOO]. The Scaled Conjugate Gradient (SCG) method was developed to eliminate the computationally expensive line search from the conjugate gradient approach [DemOO, Mol93]. The method takes advantage of computational savings when the product of the Hessian and a vector are computed. However, the SCG is a modeltrustregion method and requires the addition of a scaling coefficient, X, to govern the trust region. The basic update takes the form of (56) and (57) with, SVp ak dVk (510) dkHkdk +k Idk VPk VPk VkT VPk1 3,k = (511) d V P, where X is adjusted to ensure the validity of the model. All the conjugate gradient methods assume a static performance surface to properly execute the line search or to determine whether a performance improvement can be realized in the case of SCG. Newton's Method Newton's method makes use of the second derivative information, via the Hessian matrix, to arrive at the performance minimum in fewer steps. The basic update for Newton's method is given by, Aw = H'VP (512) However, since the direct computation of the Hessian is computationally expensive, quasiNewton methods have been developed that iteratively estimate the inverse of the Hessian at each epoch. One of the most successful approaches is the Broyden, Fletcher, Goldfarb, and Shanno (BFGS) quasiNewton algorithm [DemOO]. QuasiNewton approaches determine the change direction using the inverse Hessian estimate and the change magnitude using a line search as follows, Aw = xil'VP (513) where a is determined using a line search. The BFGS method requires storage of the inverse Hessian approximation, which may consume prohibitive amounts of memory for large networks. The OneStep Secant (OSS) algorithm is a simplified quasiNewton approach that was developed to avoid the storage requirements for the Hessian estimate [Bat92, DemOO]. The method assumes that for each epoch the previous Hessian estimate was the identity matrix and uses the same update rule as (513). Each of these methods assumes a static performance surface for proper execution of their line searches. LevenbergMarquardt (LM) Algorithm The LevenbergMarquardt algorithm uses a modeltrustregion technique specifically designed to minimize a sum of squares error (SSE) function. The model assumes a locally linear network, which produces a parabolic error surface [Bis95, DemOO]. Based on these assumptions, the algorithm computes estimates of the Hessian matrix and gradient as, H =TJ (514) VP = Je (515) where J is the Jacobian matrix and e is the vector of network errors computed as the difference between the desired and current outputs. The method adds a variable diagonal element, [t, to the Hessian approximation to compensate when the modeltrust assumptions are not valid. The value of [t is decreased with each successful step and increased with each unsuccessful step. As [t approaches zero, the weight adaptation direction becomes that of Newton's method. For large [t the direction becomes parallel to steepest descent. The LM update rule is summarized by, Aw =[JTJ + IllJre (516) The LM algorithm assumes a static performance surface for divergence measures and its efficiency is highly dependent upon the assumptions regarding the error criterion. Performance Divergence Measures Some of the algorithms above use a performance divergence measure to decide whether the new weight update should be kept or whether it should be discarded while readjusting the parameters of the search routine. The simplest test compares the current performance value to the previous performance value and is summarized by, Pk+l > Pk (517) If the test in (517) is false, it is considered positive for learning and true is considered divergent. However, to increase robustness for noise and shallow local minima, a small amount of performance degradation is often allowed. A common implementation of this measure is the specification of a small maximum change percentage in the performance. For MSE and SSE criterion, this is typically implemented by the evaluation of (518) for the case of a five percent degradation tolerance. Pk+l k > 1.05 (518) Pk If the test in (518) evaluates to true, the performance is considered to be diverging beyond the tolerable limit, the current update is discarded, and the search parameters are adjusted to stabilize learning accordingly. If it evaluates false, adaptation is continued using either the current search parameters or parameters adjusted to speed up learning. Line Search Routines Several of the algorithms use line search routines to determine the adaptation step size in a particular direction for each epoch. A line search routine essentially looks for the minimum performance value along a line in a given direction in parameter space. Many line search routines exist including bisection, quadratic interpolation, cubic interpolation, golden section search and backtracking search. In addition, several hybrid line search routines are popular including Brent's search, hybrid bisectioncubic search, and Charalambous' search. Regardless of the search method, each assumes a static performance surface as performance values are evaluated along the search line and used to determine the next estimate towards the minimum. Difficulties and Remedies for ITL In general, straightforward application of most of the advanced parameter search techniques described above to ITL systems will not yield acceptable results. Essentially, the difficulties are due to assumptions in these algorithms that do not necessarily hold true for ITL systems in general. The paragraphs below examine some of the characteristics of ITL systems that necessitate the careful tailoring of most of these routines to accommodate ITL criteria. In addition, proposed remedies are suggested for each case. This section concludes with a summary of how the various advanced training algorithms have been adapted for use with ITL systems. The Appendix describes the tailoring process for one of the simplest (GD) and one of the most complex (LMLS) training algorithms in more detail. Adaptive Kernel Size A key component of PDF estimation is the appropriate selection of the kernel size. For ITL adaptation, the kernel size is also the impetus for learning since adaptation is caused by interactions between samples, which are directly governed by their spacing and the kernel size. Research has demonstrated that allowing the kernel size to adapt during training can lead to improved results [Erd02]. However, this characteristic of ITL systems can present problems for many of the advanced parameter search techniques. The cause of the problems is that the kernel size not only governs the level of interaction between samples, but also and consequently, controls the shape of the performance surface. Figure 51 compares the performance surface for several values of kernel size. The figure shows the EDQMI performance surface for three different values of kernel size (o 2=0.001, 0.01, 0.1) for Example 3 described below. The figures show that 57 although the surfaces generally have similar shapes and similar location of extrema within the weight space, there are two significant differences. First, as intuition suggests, the relative smoothness of the performance surface increases with larger kernel size. Notice the shallow local minima for in the case when o 2=0.001. Secondly, the value of the performance function varies with the kernel size as shown by the different scales for the performance surface plots on the right. Performance Surface and Contours (o2=0.001 4 0 2 " 6 4 2 0 2 Performance Surface and Contours (02=0.010 6 4 2 2 6 4 2 0 2 Performance Surface and Contours (o2=0.100 4 2 0 2 0  6 4 2 0 2 0.2 0.3 .. 6. 6 4 4 2 0 2 2 0 0.01. 0.02 0.03 6 2 2 W2 Figure 51. Performance surface variation with kernel size. ~L~Lr These differences need to be carefully considered when training ITL systems and especially when comparing the training performance of these systems. The impact of a dynamic performance surface on a search algorithm that assumes a static surface can be disastrous, quickly leading to divergence and instability. Some of the key methods that are impacted include performance divergence measures, line searches, and conjugate direction selection. In order to accommodate a dynamic performance surface while taking advantage of the improved efficiency of advanced training techniques, some limitations are imposed on the adaptation of the kernel size. The kernel size is frozen at its current value during the execution of any sequence of search algorithm steps that require a performance comparison or a line search execution. Once these steps have been complete to an end and the resulting set of weights is determined, kernel size adaptation is allowed to take place. For many algorithms, this requires reevaluation of the performance so that it can be used in the next epoch as the basis for comparison. Although this results in an increased computational burden for each epoch in a given algorithm, the advantage of adapting the kernel size based on the new distribution of samples is realized. Positive and Negative Performance Values Use of the MSE and SSE criteria produce a nonnegative performance surface where a zero value corresponds to zero error, or a perfect fit. The nature ofRenyi's quadratic entropy estimator from (312) allows the entropy to take on any real value. This characteristic can be troublesome to some of the performance divergence tracking measures described above. Equation (518) no longer works if both performance values are not positive. A simple solution is to adjust (518) to accommodate negative numbers as described by (519). 59 Pkk +1 > 1.05 (519) I Pk While (519) works in practice, it is not without as robust as might be expected. Notice that this solution, like (518), allows a smaller absolute change as the performance approaches zero, but in this case, from both the positive and negative sides. This is shown in Figure 52. While this type behavior is desirable for SSE and MSE criterion since zero is the ideal terminal performance, the zero value has no such significance when evaluating the entropy criterion. This behavior likely would render searches to be less robust to noise as the performance was crossing this boundary. Percent Deviation: MSE & SSE 0.05 0 f o 0.04 21, 9 0.03  0.02 o 0.01  0 0 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Previous Performance Percent Deviation: Entropy 0.05 o 0.04 0.03 0.02  S0.01 0  1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Previous Performance Figure 52. Performance divergence measure. This shortcoming motivates the search for a performance divergence metric that is tuned to the specific characteristics of entropy estimation. To arrive at such a metric, first note that for a given set of entropy estimation parameters (N, M, o), where N is the number of samples, Mis the dimensionality of the sample space, and CY is the kernel size for the PDF estimation, the maximum and minimum values of entropy are readily obtained using, Hmin log47c +log 2 (520) 2 2 Hmax log4 + log 2 +logN (521) 2 2 This suggests that at any point during training, the current entropy estimate can be compared relative to the maximum range of entropy using the metric in (522). T H current Hmin H relative H T m (522) H H max mm Equation (522) is defined as the relative entropy and has a range of [0,1]. The expressions in (520), (521), and (522) are discussed in further detail in Chapter 7. At this point, it is sufficient to note that the relative entropy provides a basis for a performance divergence metric that is indeed tuned to the characteristics of our entropy estimator. Consider the test described by (523). H elatve(k) H elave(k+1) > 0.05 (523) Equation (523) provides the desired metric for the MaxEnt criterion. Relative Error versus Absolute SquaredError While the MSE and SSE criteria provide an absolute error measure, ITL criteria provide only a relative measure of error. The LM algorithm, designed specifically for the SSE criterion, uses the assumption of a quadratic performance surface along with the absolute SSE to determine both the direction and magnitude of weight change. This presents two problems. First, the computation of the SSE criterion is typically built into LM algorithm implementations and requires a set of desired target outputs to compute the absolute SSE. For ITL training, a desired target set is not always available and even if one is available, it need not have the same dimensionality as the output space. This difficulty can be overcome relatively easily by substituting the ITL information forces from (313) for the error term, e, in the gradient computation of the LM algorithm shown in (515). Although this approach now provides the correct search direction according to the ITL criterion, it introduces a second problem. LM uses the absolute error assumption to determine the magnitude of the step size; however, in ITL systems the information forces only provide a relative measure of the error for each output sample. This mismatch in assumptions can lead to improper step sizes during LM parameter adaptation as shown in Figure 53. The figure shows the contours of an ITL performance surface for a simple twoparameter system identification problem of an infinite impulse response (IIR) filter with N=40 samples. In addition, the weight tracks for three implementations of LMITL algorithms are also shown. The first, labeled 'LM', is the weight track for the implementation describe above. Although the algorithm seems to be pointing in the appropriate direction at each step, many epochs are required for convergence compared to typical LM behavior because of an inappropriately small step size. To help alleviate this problem, two methods are proposed. The first method, shown in the figure as 'LMSC', simply scales the step size by scaling the ITL information forces in proportion to the number of samples, N, in the ITL problem. This approach assumes that the ITL information forces provide an average error that needs to be scaled in order to yield a larger magnitude for larger numbers of samples. An ad hoc value for the scale factor of N/9 was determined to be fairly robust empirically. As is shown in the figure, this approach takes larger steps than LM and arrives at the minimum in fewer epochs. The second approach combines LM with a line search algorithm (LMLS) to determine the step magnitude. Although the line search adds computational complexity per epoch, convergence happens in very few steps. The update rule for this method is given by, Aw = [J J + pl1 e (524) where a is determined using a line search. LMITL Algorithm Comparison 1.6 LM 1.4 ^X<_~~  LM 1.4 LMSCScaled LM LSLineSearch 1.2 IC   I      0.8 0.6 0.4 0.2 0 0.2 0.4 0 0.2 0.4 0.6 0.8 1 Weight 1 Figure 53. LevenbergMarquardt ITL algorithms: Weight tracks. The computational complexity of these three methods is compared in Figure 54. Notice how both of the proposed methods converge more efficiently than the basic LM method. This improved efficiency becomes more pronounced for larger values of N. Training Efficiency Comparison . LM 0.74 \   LMSCScaled LMLSLineSearch 0.76  0.78  \, 0.8   E \ \ S0.82 0.84  0.86 0.88   104 105 106 FLOPS Figure 54. LevenbergMarquardt ITL algorithms: Training efficiency. Summary of Advanced Parameter Search Algorithms for ITL The sections above describe some of the difficulties encountered when trying to apply standard advanced parameter search techniques to ITL problems. For each problem encountered, a viable solution or workaround has been identified allowing ITL systems to be trained with more advanced algorithms. While most of the algorithms described required some tailoring to be compatible with ITL systems, two methods, GD and RP, were sufficiently robust to be applied to ITL with no modification. RP is of particular interest because it is also very computationally efficient. Table 51 summarizes the algorithms that have been applied to ITL along with the associated modifications that were required to arrive at a suitable implementation. The sections that follow will present comparisons of these methods on sample problems. Because of the complexity involved with each computation of the ITL criteria, the traditional approach of comparing algorithms by number of epochs is better performed using a lower level metric such as the number of FLOPs. Table 51. Advanced parameter search algorithm modification summary for ITL. Training Method Modifications Required SD Adaptive Performance Relative Error Acronym Description . Kernel Size Tracking Adjustment GD Gradient Descent GDA GD w/ Adaptive Learning Rate X X GDX GDA w/ Momentum X X RP Resilient Backpropagation LM LevenbergMarquardt X LMSC LM Scaled Step Size X X LMLS LM Line Search X X SCG Scaled Conjugate Gradient X CGB CG PowellBeale X CGF CG FletcherReeves X CGP CG PolakRibiere X OSS One Step Secant X BFGS BFGS QuasiNewton X Sample Problems For providing comparisons, several sample problems are presented that use different ITL criterion in different manners. The first example is an unsupervised feature extraction problem that maximizes the output entropy of a feature extraction network to generate entropybased features. The second example is a supervised training problem that maximizes EDQMI to solve the frequencydoubling problem. Both of these problems use a fixed kernel size. The last example attempts to solve the frequency doubling problem with a simplified network and explores the differences between fixed and adaptive kernel sizes. Example 1: Maximum Entropy Feature Extraction This example is based on the maximum entropy and PCA feature extraction experiment [PriOO]. In this experiment, the input is 50 samples of a pair of two dimensional Gaussian distributions with mean and covariance matrices, r1 0 r0.1 0 ,= m=[1 11, Y2 J m =[1 1] (525) 0 0.1 0 1 A two input, four hidden node, single output multilayer perception (241 MLP) with hyperbolic tangent nonlinearities is used to extract one feature from the two dimensional data using the entropy maximization criterion. The kernel size is fixed for this example. Figure 55 shows typical examples of the input sample distribution with contours, initial output distribution, and final output distribution. Notice how the final trained feature contours have adapted to equally partition the input space as opposed to the untrained features. This is further illustrated by the relative uniformity of the final output distribution versus the initial distribution. In order to compare the various parameter search algorithms, 30 simulations were conducted with each algorithm using different initial conditions and data samples. To maintain a fair comparison, the same 30 initial condition and data sample pairs are used for all of the algorithms. The learning curve for each run was saved and averaged for each algorithm. Figure 56 shows the average learning curves for the various training algorithms. Note that all of the advanced training techniques converge in fewer epochs than the GD methods. 66 Input Distribution and Feature Contours Samples Distribution Init. Feature Final Feature Initial Output Distribution Final Output Distribution 0 0.5 1 0.5 0 0.5 1 Figure 55. Maximum entropy feature extraction. Advanced Training Comparison GD GDA GDX RP SCG CGB CGF CGP BFG OSS LM LMscaled LMw/LineSearch 100 101 EPOCHS Figure 56. Feature extraction: Parameter search comparison in epochs. Figure 57 provides a more telling measure of computational efficiency. This figure shows the average learning curve versus the number of FLOPs required to achieve the given performance level for each parameter search algorithm. In this figure, however, the computational complexity of the LMLS method makes it less efficient than GD. This is likely due to the computational expense of adding the line search to LMLS, which requires multiple ITL criterion evaluations. The GDX, LMSC, LM, and SCG algorithms are comparable in efficiency to GD while the conjugate gradient, OSS, and BFG provide superior convergence in significantly fewer FLOPs. The RP algorithm converges quickly initially, but does not achieve the terminal accuracy of some of the other advanced methods. Advanced Training Comparison 1.6   GD GDA GDX 1.4 RP SCG \ \\ CGB CGF 1.2 CGP \ \BFG S\ OSS E LM S LMscaled a LMw/LineSearch 0.8  0.6 5 6 7 10 10 10 FLOPS Figure 57. Feature extraction: Parameter search comparison in FLOPs. Example 2: Frequency Doubler The second example is based on the frequencydoubling problem in [PriOO]. The input signal is a sampled sine wave with a period of 40 samples. The desired output is a sine wave with twice the frequency. Five delay elements are used to create a sixinput time delay neural network (TDNN). The network topology consists of two hidden units and one output unit, all with hyperbolic tangent nonlinearities. The resulting network contains a total of 17 connection weights and biases. Figure 58 and 59 show the input and desired signals and the network topology, respectfully. Input & Desired Signals F Input 0 8 \ * Desired 0 2 0 6 i i S0 8  o 6 * \ 0 5 10 15 20 25 30 35 40 Sample # Figure 58. Frequency doubler: Input and desired signals The system is trained by maximizing the EDQMI of the network output and the desired signal with a fixed kernel size. Training was conducted using the same set of 30 different initial conditions for each of the parameter search techniques. Performance over the 30 trials is averaged for each of the advanced training techniques. At least two local minima of the performance surface were identified during the training of this system, one at 0.03 and one at 0.08, with the global minimum corresponding to a performance value of roughly 0.12. These local minima correspond to a flat signal with a single hump of the desired sine wave and a staircase signal with level changes at each zero crossing of the desired signal as shown in Figure 510. Figure 511 plots the average of the 30 learning curves against the number of FLOPs required for the various parameter search algorithms. Desired Signal Input Signal Z1I Z1 Z1 Figure 59. Frequency doubler: Network topology. In this example, GD is clearly the least efficient algorithm. Some computational efficiency is achieved by each of the advanced algorithms; however, the most dramatic improvement is realized by RP and the various LevenbergMarquardt methods. Notice that the gradient descent, conjugate gradient, and quasiNewton methods tend to get trapped in the first local minima most frequently as shown by their high average terminal value of around 0.035 to 0.04. After this group, the LMSC, RP, LM, and LMLS methods demonstrate progressively better terminal values. Of particular interest is the efficiency of the RP method, especially in the early stages of training. While exhibiting a very good terminal average, RP has the steepest initial phase of learning. In terms of final performance, however, the LMLS method performs the best overall. 70 Desired and Local Minima Outputs 5 10 15 20 25 30 35 40 Sample # Figure 510. Frequency doubler: Local minima outputs. Advanced Training Comparison 0.01 0.02 0.03 GD GDA GDX RP SCG CGB CGF CGP BFG OSS LM LMscaled LMw/LineSearch 105 106 FLOPS Figure 511. Frequency doubler: Parameter search comparison. Q) U.L 3 > 0 S0.2 C') 0.4 0.6 0.8 1 o 0.04 E 8 0.05 0.06 0.07 0.08 0.09 Example 3: Simplified Frequency Doubler This final example, also based on the frequencydoubling problem, uses a simpler network topology to allow visualization of the effects and advantages of allowing an adaptive kernel size. Again, the input signal is a sampled sine wave with a period of 40 samples and the desired output is a sine wave with twice the frequency. In this case, a single delay element is used to create a twoinput TDNN. The network topology consists of a single output unit with no bias connections and a hyperbolic tangent nonlinearity. The resulting network contains only two weights. Figure 58 shows the input and desired signals, however, since the network only has two weights, the network does not have the capacity to fully obtain the desired signal. The actual performance surface for this twoparameter example is shown in Figure 51 for various kernel sizes. A closer examination of the performance surfaces in this figure reveals that the EDQMI is maximized along the two lines in the weight space described by in (526). w2=0.7724wl and w2=1.1013wl (526) Interestingly, these lines correspond to the linear combinations of the input sine waves that result in net phase shifts of 5, 15, 25, and 35 samples. These phase shifts have the effect of aligning the peaks and zerocrossings of the network output with the peaks of the desired signal. The optimum outputs for this network using EDQMI based on the weight space equations above are given by the equations in (527), of which the first two are shown graphically in Figure 512 for C1 C2=1. Note that if C1 and/or C2 are negative, the 5 and 15 phase shifts become 25 and 35, respectively. Thus, the four global minima for this performance surface correspond to the blue and cyan signals in Figure 5 12 and their respective negatives, subject to scaling. y, = C1 sin 27cf(n 5) y2 = C2 sin 27cf(n 15) y3 = C3 sin 2cf(n 25) y4 =C4 sin 2f(n 35) Desired Signal and O ptim um Outputs D e s ired 0 8     n5 n 1 n 5 0 6   \    0 04 > o 2  2 04 S0 2       0 4       0 6 /  0 8     0 5 10 15 20 25 30 35 40 n Figure 512. Desired and EDQMI optimal signals. The nonlinearity in the network topology limits the magnitude of the constants C, to avoid distortion caused by saturation of the outputs. The four global minima for the kernel size of y 2 = 0.01 are approximately (/+2.2, +/2.85) and (/+4.35, +/3.95), which are indeed colinear with the optimum described above. Performance also deteriorates near the origin as might be expected. In contrast, when using the MSE criterion with this reduced size network, optimization drives the two weight values to zero for a net zero output. Note that even in this simple example, the ITL method tries to pass as much information as possible through the limited topology network unlike the MSE. Experiments were conducted using three sets of initial conditions: IC#1 (0,1) near the first minimum, IC#2 (4,3) near the second minimum, and IC#3 (4,0) near a plateau. As a baseline, the network is trained using a fixed kernel size of 2 = 0.01. Results show, as expected, that the training function's convergence is dependent on the initial conditions (IC). Figure 513 shows the weight tracks for the RP, GDA, and LMSC algorithms for the three different initial conditions. The circles indicate the final values. All three IC's yield different convergence. It is evident from the weight tracks that the LMSC algorithm performs very well in reaching the local minimum that is closest to the IC with fixed kernel size. Also, note that for IC#3 all three algorithms fail to find a global minimum, but rather, get stuck in a shallow minimum in the plateau. Weight Tracks on Performance Surface 5 RP fixed o2 GDA fixed o2 4 LMS fixed 2 3 2 1 6 5 4 3 2 1 0 1 2 3 4 Weight 1 Figure 513. Fixed kernel size weight tracks. Next, the same experiment was conducted using an adaptive kernel size. As noted previously, the selection of an appropriate kernel size for ITL is crucial for proper learning to occur. Because of the difficulty associated with selecting a single fixed value of kernel size, an adaptive scheme for determining kernel size based on the distribution of the data is desirable. Some schemes involve computing the kernel size to ensure that a number of interactions (from 1 to N) occur between outputs. In this particular example, a 1N mean scheme that averages the kernel size for one interaction with that for N interactions as described in (528) is used, although other schemes could be used as well. min(nearestneighbordist(Y)) + max(nearestneighbordist(Y)) o = (528) 2 Weight Tracks on Performance Surface for IC #3 5 RP fixed o2 GDA fixed 2 4 .t i l i e et LMS r.tle C, _ 2 r 4 3 2 1 0 1 2 3 4 Weight 1 Figure 514. Fixed and adaptive kernel size weight tracks for IC #3. The weight track results for this 1N mean adaptive kernel size method are shown in Figure 514 for IC#3. The stars indicate the final values for the adaptive case. Note that the algorithms were able to converge towards one of the global minima in all three cases with an adaptive kernel size. The methods were able to move past the plateau that hindered algorithms in the fixed kernel size case. This confirms that there is indeed a benefit to using an adaptive kernel size. Figure 515 plots both the kernel size and network performance for the RP method. Note that the kernel size is relatively large until the training performance begins to improve as it goes past the plateau. Then, both the kernel size and the performance decrease sharply. Kernel Size (Sigma2 ) 10 10 0) 102 / U) 10, 4o 10 i 0 10 20 30 40 50 60 70 80 90 100 Training Performance Curve I C =(4,0) 10 S 10 Lu 100 101 0 10 20 30 40 50 60 70 80 90 100 Epoch Figure 515. Kernel size and training performance for RP and IC#3. Conclusions Experimental results have demonstrated that the application of properly tailored advanced parameter search algorithms can provide significant computational savings for the training of ITL systems over standard gradient descent. In the examples provided, computational savings greater than two orders of magnitude were realized in some cases. As with the training of traditional systems, the training of different ITL systems is not always accomplished most efficiently by one single algorithm. Like their standard counterparts, the ITL advanced search methods have similar advantages and disadvantages in terms of storage, complexity, and robustness when applied to problems with different number of samples, weights, and criterion. In the examples presented, the LM methods and the RP method were overall the most efficient. However, the results suggest that in general, the application of several algorithms might be necessary to avoid the trappings of a single algorithm applied to a specific problem. The results have also demonstrated the importance of the kernel size to the training of ITL systems. The coupling of an inappropriate kernel size with certain initial conditions was shown to render the search algorithms susceptible to shallow local minima in the performance surface that were merely artifacts of the impact of too small a kernel size on PDF estimation. A 1N mean kernel size adaptation algorithm was described that ensures some level of interaction between the system outputs. The benefits of using an adaptive kernel size during training, and the need to tailor the parameter search algorithms to accommodate this adaptation were also shown. CHAPTER 6 BATCH TRAINING APPROACH Introduction The algorithm optimization of ITL criteria computation and the application of advanced parameter search techniques for ITL training have resulted in significant reductions in the computational demands of training ITL systems. However, despite these improvements, the computational complexity of a given ITL problem remains proportional to the number of samples, N2. Because of this fact, the application of ITL to problems with very large data sets or continuously increasing data length, such as continuous time series data, continues to present a computational challenge for practical implementations. In this chapter, a batch training method for ITL systems is presented that segments the original data into batches in order to reduce the N2 computational dependency. The stochastic information gradient (SIG) method of Erdogmus [Erd01], which was developed concurrently with and independently of the batch method presented herein, is shown to be a special case of the batch training approach. Several examples are used to show the benefits of batch training for ITL applications. Batch Training Approach Overview The investigation of batch training for ITL is motivated by the increasingly slow training that results from the combination of large data sets with the N2 computational complexity of ITL criteria. Fundamentally, the batch approach uses subsets of the entire data set to train the learning system incrementally with smaller batches of appropriately sampled data. Theoretically, the justification for being able to conduct ITL training using batch techniques depends on the validity at least one of two assumptions. The first is that learning can be accomplished incrementally with subsets of the full dataset. The second is that there is some level of redundancy in the data. It has been empirically demonstrated for many problems that the first assumption holds true and computationally efficient training can be accomplished using the SIG method [Erd01]. The utility of the SIG method shows that sufficient information is contained in sequential samples of a properly sampled time series to incrementally learn the characteristics of a signal. Consider, however, the conditions that might cause the SIG method difficulties. Suppose that a noisy time series is sampled at a very high rate relative to its bandwidth. In this case, the differences between adjacent samples corrupted with noise might not be sufficient to efficiently use the SIG compared to a full ITL implementation. However, since the system has been oversampled, the full data set would necessarily contain redundant information. In this case, simply resampling the full data set at a lower sample rate and applying ITL to the reduced size set could reduce computational complexity considerably. In general, if a signal space is oversampled, discarding redundant information can provide computational savings. If a signal space is appropriately sampled, the results herein along with those for SIG research suggest that incremental learning with computational savings can be achieved. Since it is often unclear whether the density of the sampled data is too sparse or too tight, the primary difficulty lies in the selection of the most efficient batch training implementation. Some guidelines for the selection of batch training parameters are presented later in this chapter. Batch Training Parameters The implementation of the concept of batch training introduces several new design parameters in addition to the typical training parameters that must be selected using standard ITL techniques. Among these are the batch sample size, the number of epochs per batch, the number of batches, and the batch selection method. The description and selection of these parameters is discussed in detail below along with other considerations that are important to a proper batch ITL implementation. The batch sample size parameter, Bsize, represents the size of the batch or the number of samples to be used for the batch ITL computation. The relationship between the batch size and the original number of samples, N, governs the maximum computational reduction achievable using a batch implementation. The number of epochs per batch, Neb, is the numbers of iterations given to a particular parameter search algorithm to train with each batch. The number of batches, Nb, represents the number of times a new batch is generated and used for training. A simple example is presented below to provide a clear illustration of these new parameters and the potential advantage of using a batch training technique. This example is based on a timeseries prediction problem, which is the second example in [Erd01]. In this problem, the input data set consists of N32 samples of a composite sinusoidal signal generated by x(t) = sin 20t + 2 sin 40t + 3 sin 60t (61) and is sampled at 100 hertz. Each of these 32 samples is a two dimensional vector consisting of sequential signal samples to be used to predict the next sample. Thus, the full input to the ITL system can be described by (62). X[2x32] x(32) [X(1)[2x1]...x(32)[2x1] (62) (L ix(O)/ ix(31) For illustration purposes, suppose that when using the full input set for ITL training, satisfactory performance is obtained after Ne=100 epochs. Note that this corresponds to a batch implementation where Bsize=N, Nb 1, Neb=Ne, therefore the single batch is the entire input set. The computational complexity of this approach is proportional to the product of the square of the number of samples with the number of epochs (i.e., N2 xNe=102400) or equivalently, Bsize2xNbxNeb. Now consider a batch implementation where the batch size is one quarter of the entire data set (i.e., Bsize=N4=8). The input subset batches, based on a sequential nonoverlapping selection method, would be as follows: XBl[2x8] = [X(1)[2x1]...X(8)[2x1] XB2[2x8] = [(9)[2x1]...x(16)[2x1]] = 1 (63) XB3[2x8] = X(17)[2x1]...x(24)[2x1] X B4[2x8] = [x(25)[2x1] ...x(32)[2x1] thus producing four batches or Nb=4. Using each of these four batches, the learning system would be trained for Neb epochs with the weights remembered from batch to batch. For comparison purposes, suppose that the case where Nb=4 and Neb=100 produces similar performance to the full ITL training case. In this case, the computational complexity is proportional to Bsize2xNb xNeb=25600, a reduction by a factor of four. The relationship between the computational demands of batch training and full ITL training is summarized by (64). Bsize 2 x Nb x Neb (64) 2 (64) N 2Ne From (64), it is clear that linear reductions in the Bsize with less than quadratic increases in Nb xNeb will result in more computationally efficient training. This potential for computational savings was the motivating factor behind the development of a batch training approach for ITL. Batch Selection Methods Various approaches exist for selecting the batch subsets from the entire data set. The SIG method simply selects sets of sequential data samples. Random sampling is also a viable approach. Another approach is to attempt to select a representative subset of the entire data set based on the distribution of the original data. Although a true implementation of this method can be computationally expensive, simplified alternatives exist. The following three batch selection methods have been implemented for ITL batch training to accommodate any Bsize from two to N: 1. Sequential: Selects sets of sequential samples based on the index value. 2. Random: Selects a random subset of unique samples with uniform likelihood over the sample index. 3. Representative: Sorts the original sample set based on sample vector magnitude and then selects uniformly over the sorted distribution. Comparisons of the performance of these methods are presented in the sections below. Other Batch Training Considerations Other critical decisions in the design of a batch training approach include the selection of an adaptation method along with its associated training parameters. Experimental results have demonstrated that the ideal set of batch training parameters for one training method is not necessarily optimal for a different training method. For example, the ideal batch size for GD learning might be different from that for LM learning. The results presented in the sections below try to capture some of the characteristics and tradeoffs between various adaptation methods and batch training parameters. Stochastic Information Gradient (SIG) Method The SIG method was developed to provide an online adaptation algorithm for ITL system training with MinErrEnt criterion. The method is derived from an approximation to the gradient of the error entropy cost function that yields a stochastic instantaneous estimation for the gradient [Erd01]. The approach estimates the gradient based on a window of the most recent Nsig samples of a time series. The special case where Nsig=2 represents the standard SIG implementation. Interestingly, this approach is a special case of the batch training approach described above where Bsize=Nsig=2 and Neb=l with sequential batch selection, gradient descent adaptation, and error entropy criterion. Batch Training Comparison Results Performance Comparison Metric The ideal method for the comparison of different batch training implementations plots the system performance, as measured by the value of the informationbased criterion for the problem at hand, versus the computational complexity of the implementation. The optimal implementation would be the one that achieves the desired performance level with the least computational complexity. During the actual batch training process, however, the learning system adapts its parameters based on an estimate of the system performance using a subset of the input space. Because of this, the runtime performance estimates are noisy and not a good metric for fair comparison of the different implementations. To account for this, the learning system weights are saved for each epoch during batch training. After training, these stored weights are used to compute the actual performance using the full input data set for each epoch. The number of FLOPs is used to measure the computational complexity of the algorithms for each epoch during run time. Variations of the SIG Figure 61 compares several variations of the standard SIG method with the time series prediction problem described above. In these comparisons, GD is the parameter adaptation method with a step size of 0.2. Recall that the standard SIG method corresponds to Bsize=2, Neb=l, and sequential batch selection. In Figure 61, all combinations of Bsize=[2,3,4] and Neb=[1,2,3] are shown. Note that early in training, increasing the batch size alone has an adverse effect on training efficiency. However, in the later stages of training, the batch sizes of three and four converge more rapidly than the batch size of two as denoted by the steeper slopes of these curves. This suggests that more efficient training might be achieved by gradually increasing the batch size as training progresses. Note also that increasing the epochs per batch tends to have a positive effect on efficiency. Clearly, several implementations converge faster than the standard twosample SIG method. Also of interest is that the performance of the full N sample ITL implementation requires approximately 12MFLOPs and is still far from convergence on this scale. This clearly shows the computational savings that can be realized using batch methods. Batch Training with Advanced Parameter Adaptation This section compares several variations of the batch training method when combined with some of the advanced parameter search techniques from Chapter 5. Again, the problem is the timeseries prediction problem described previously. In these comparisons, five sets of batch parameters are used as described in Table 61. Note that the first case corresponds to a full ITL implementation. Batch Training Comparison of GD 0.02 0.04 0.06 0.08 0.1 m  Bsize=2,Neb=1(SIG) \ E Bsize=2,Neb=2 p 0.12     Bsize=2,Neb=3 Bsize=3,Neb=3 0.14      0.14 Bsize=3,Neb=2 Bsize=3,Neb=3 0.16 Bsize=4,Neb=l Bsize=4,Neb=2 0.18 Bsize=4,Neb=3 ' Full ITL 0.2 0 1 2 3 4 5 6 FLOPS 105 Figure 61. Batch training comparison for GD. Table 61. Batch training parameter settings. Setting # Bsize Neb Nb 1 32 100 1 2 16 50 4 3 8 25 16 4 4 12 64 5 2 6 256 Each setting is applied and the system is trained using the SCG, RP, and LM methods for ITL training. Figures 62, 63, and 64 show the training performance comparisons for SCG, RP, and LM respectively using sequential batch selection. For the SCG case in Figure 62, note that the convergence gets gradually smoother as the batch size is increased. This is the general trend for all adaptation methods. However, the computational efficiency reaches an optimum around Bsize=4 and Bsize=8 for both SCG and RP. Notice that in each case, the optimum performance is neither the full input set where Bsize=32 nor Setting #5 where Bsize=2. For comparison, the performance of the standard SIG method is show on each of the figures. Notice that computational savings of greater than an order of magnitude are achieved in many cases by combining advanced parameter search with batch methods. Batch Training Comparison of SCG 0.02 0.04  0.06 0.08 S0.12  0.14 Bsize=2   Bsize=4 0.16 Bsize=8 Bsize=16 0.18 Bsize=32 S SGw/GD 0.2 103 104 105 106 FLOPS Figure 62. Batch training comparison for SCG. Batch Selection Methods Figure 65 compares the three proposed batch selection methods for the time series prediction problem described previously. For each case, Bsize=3 and Neb=3 with GD parameter adaptation. Notice that in these cases, sequential selection performs the best while representative selection is worst. This is understandable since there is no noise in the system and the representative case has a significant amount of overhead in computing the vector magnitudes. 