Citation |

- Permanent Link:
- https://ufdc.ufl.edu/UFE0022867/00001
## Material Information- Title:
- Random Set Framework for Context-Based Classification
- Creator:
- Bolton, Jeremy
- Place of Publication:
- [Gainesville, Fla.]
- Publisher:
- University of Florida
- Publication Date:
- 2008
- Language:
- english
- Physical Description:
- 1 online resource (127 p.)
## Thesis/Dissertation Information- Degree:
- Doctorate ( Ph.D.)
- Degree Grantor:
- University of Florida
- Degree Disciplines:
- Computer Engineering
Computer and Information Science and Engineering - Committee Chair:
- Gader, Paul D.
- Committee Members:
- Banerjee, Arunava
Wilson, Joseph N. Ritter, Gerhard Slatton, Kenneth C. - Graduation Date:
- 12/19/2008
## Subjects- Subjects / Keywords:
- Datasets ( jstor )
Inference ( jstor ) Learning ( jstor ) Logical givens ( jstor ) Machine learning ( jstor ) Population estimates ( jstor ) Probabilities ( jstor ) Random variables ( jstor ) Statistical models ( jstor ) Topology ( jstor ) Computer and Information Science and Engineering -- Dissertations, Academic -- UF based, classification, concept, context, drift, hyperspectral, learning, machine, multitemporal, pattern, random, recognition, set - Genre:
- Electronic Thesis or Dissertation
born-digital ( sobekcm ) Computer Engineering thesis, Ph.D.
## Notes- Abstract:
- Pattern classification is a fundamental problem in intelligent systems design. Many different probabilistic, evidential, graphical, spatial-partitioning and heuristic models have been developed to automate classification. In some applications, there are unknown, overlooked, and disregarded factors that contribute to the data distribution, such as environmental conditions, which hinder classification. Most approaches do not account for these conditions, or factors, that may be correlated with sets of data samples. However, unknown or ignored factors may severely change the data distribution making it difficult to use standard classification techniques. Even if these variable factors are known, there may be a large number of them. Enumerating these variable factors as parameters in clustering or classification models can lead to the 'curse of high dimensionality' or sparse random variable densities. Some Bayesian approaches that integrate out unknown parameters can be extremely time consuming, may require a priori information, and are not suited for the problem at hand. Better methods for incorporating the uncertainty due to these factors are needed. We propose a novel context-based approach for classification within a random set framework. The proposed model estimates the posterior probability of a class and context given both a sample a set of samples, as opposed to the standard method of estimating the posterior given a sample. This conditioned posterior is then expressed in terms of priors, likelihood functions and probabilities involving both a sample and a set of samples. Particular attention is focused on the problem of estimating the likelihood of a set of samples given a context. This estimation problem is framed in a novel way using random sets. Three methods are proposed for performing the estimation: possibilistic, evidential, and probabilistic. These methods are compared and contrasted with each other and with existing approaches on both synthetic data and extensive hyperspectral data sets used for minefield detection algorithm development. Results on synthetic data sets identify the pros and cons of the possibilistic, evidential and probabilistic approaches and existing approaches. Results on hyperspectral data sets in indicate that the proposed context-based classifiers perform better than some state-of-the-art, context-based and statistical approaches. ( en )
- General Note:
- In the series University of Florida Digital Collections.
- General Note:
- Includes vita.
- Bibliography:
- Includes bibliographical references.
- Source of Description:
- Description based on online resource; title from PDF title page.
- Source of Description:
- This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
- Thesis:
- Thesis (Ph.D.)--University of Florida, 2008.
- Local:
- Adviser: Gader, Paul D.
- Electronic Access:
- RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2009-06-30
- Statement of Responsibility:
- by Jeremy Bolton.
## Record Information- Source Institution:
- University of Florida
- Holding Location:
- University of Florida
- Rights Management:
- Copyright Bolton, Jeremy. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
- Embargo Date:
- 6/30/2009
- Resource Identifier:
- 430115860 ( OCLC )
- Classification:
- LD1780 2008 ( lcc )
## UFDC Membership |

Downloads |

## This item has the following downloads: |

Full Text |

PAGE 1 1 RANDOM SET FRAMEWORK FOR C ONTEXT-BASED CLASSIFICATION By JEREMY BOLTON A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2008 PAGE 2 2 2008 Jeremy Bolton PAGE 3 3 ACKNOWLEDGMENTS I thank m y mother, Lois Bolton, father, Wade Bolton, and sister, Ch elsea Bolton for their relentless love and support. I thank my advisor, Paul Gader, for his guidance and encouragement throughout my tenure at the University of Florida. I thank my co mmittee Paul Gader, Joseph Wilson, Gerhard Ritter, Arunava Banerjee, and Clint Slatton for their insight and guidance which has steered my research and bettered resulting contributions. I thank my lab mates for their support and am thankful for their ability to endure my shenanigans. I thank Alina Zare, Nathan VanderKraats, Nicholas Fisher, Xuping Zhang, Raazia Mazhar, W en-Hsiung Lee and Seniha Esen Yuksel for their encouragement, suggestions and aid in my research. I thank colleagues, Jim Keller, Hishem Frigui, and Dominic Ho, for their collaboration on a variety of research projects. I thank William Clark of Army Research O ffice (ARO), Russell Harmon of ARO, Miranda Schatten of Night Vision and El ectronic Sensors Directorate, a nd Michael Cathcart of Georgia Tech, for their support of my research. PAGE 4 4 TABLE OF CONTENTS pageroblem Statement and Motivation ........................................................................................10Proposed Solution ...................................................................................................................122 LITERATURE REVIEW .......................................................................................................16Concept Drift ..........................................................................................................................16The Problem of Concept Drift ......................................................................................... 17Concept Drift Solutions ...................................................................................................18Instance selection .....................................................................................................19Instance weighting ....................................................................................................21Ensemble learning .................................................................................................... 23Applications to Hyperspectral Imagery ........................................................................... 26Probability Introduction ...................................................................................................... ....29Topology ...................................................................................................................... ....29Probability Space ............................................................................................................. 30Measure ....................................................................................................................... ....31Standard Random Variables ............................................................................................32Standard Statistical Approach es for Context Estimation ........................................................33Random Sets ................................................................................................................... ........34General Case: Random Closed Set .................................................................................. 34Random Set Discussion ................................................................................................... 35Theory of Evidence .........................................................................................................39Point Process ...................................................................................................................40Random Measures ........................................................................................................... 45Variational Methods ........................................................................................................... ....46Set Similarity Measures ....................................................................................................... ...48Random Set Applications ....................................................................................................... 51Point Process Applications ..............................................................................................51En Masse Context-Based Methods ..................................................................................53 PAGE 5 5 3 TECHNICAL APPROACH ...................................................................................................55Mathematical Basis of the Random Set Framework .............................................................. 55Possibilistic Approach ........................................................................................................ ....57Development ................................................................................................................... .58Dependent Optimization .................................................................................................. 61Independent Optimization ............................................................................................... 63Evidential Model .............................................................................................................. ......63Development ................................................................................................................... .64Optimization .................................................................................................................. ..64Probabilistic Model .................................................................................................................65Development ................................................................................................................... .65Optimization .................................................................................................................. ..68Discussion .................................................................................................................... ....734 EXPERIMENTAL RESULTS ............................................................................................... 85KL Estimation Experiment ..................................................................................................... 86Experimental Design ....................................................................................................... 86Results .............................................................................................................................88Synthetic Data Experiment .....................................................................................................89Experimental Design ....................................................................................................... 90Results .............................................................................................................................92Hyperspectral Data Experiment .............................................................................................. 94Experimental Design ....................................................................................................... 95Results .............................................................................................................................97Upper and Lower Bounding Experiment ................................................................................ 99Experimental Design ..................................................................................................... 100Resultsable page 4-1. Average inference error for each dataset using 15 te st and 15 train samples. ...................... 1034-2. Average classification error of the listed cont ext-based classifiers on four data sets used in the Synthetic Data Experiments. .................................................................................. 1064-3. How classification varies with respect to the number of germ and grain pairs for data set 3 (with no outlying samples) in th e Synthetic Data Experiment. .................................... 106 PAGE 7 7 LIST OF FIGURES Figure page 1-1. Spectral samples exhibiting contextual transform ations. .......................................................151-2. Illustration of contextual tran sformations in a feature space ..................................................15 3-1. Samples of Gaussian distributions drawn using randomly selected means and variances which where drawn uniformly from a specified interval ................................................... 81 3-2. Learning the representative function using update Equations ................................................82 3-3. Similarities and distinctions between the proposed method and standard methods.. ............. 834-1. Illustration of data se ts one, two, and three. ......................................................................... 103 4-2. Error analysis of the Riemann and unifo rm approximation methods with respect to time and number of observation samples. ................................................................................ 1044-3. Trials using data sets 1, 2, 3 and 4 in the Synthetic Data Experiment. ................................ 1054-4. ROC curve for The Hyperspectral Data Experi ment. Note the dashed plot is the results from the probabilistic context-based classi fier using the analytical solution for KL estimation as discussed in Equation 3-40. ....................................................................... 107 4-5. Hyperspectral Experiment ROC curve of PD versus PFA for the possibilistic, evidential probabilistic, set-based kNN, and whiten / dewhiten approaches ................................... 1084-6. Example of a false alarm POI from The Hyperspectral Data Experiment. .......................... 109 4-7. Example of a target alarm POI, fro m The Hyperspectral Data Experiment. ........................ 1104-8. Example of a target alarm POI from The Hyperspectral Data Experiment. ......................... 111 4-9. Example of a false alarm POI from The Hyperspectral Data Experiment. .......................... 1124-10. Example of a false alarm POI from The Hyperspectral Data Experiment .........................113 4-11. Detection results for the possibilistic RSF classifier and results for standard Gaussian mixture classifiers equipped with variable numbers of mixture components. ................. 1144-12. Non-crossvalidation detecti on results for the possibilistic RSF classifier and the oracle classifier. ..........................................................................................................................115 PAGE 8 8 Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy RANDOM SET FRAMEWORK FOR C ONTEXT-BASED CLASSIFICATION By Jeremy Bolton December 2008 Chair: Paul Gader Major: Computer Engineering Pattern classification is a fundamental problem in intelligent systems design. Many different probabilistic, evidential, graphical, sp atial-partitioning and heuristic models have been developed to automate classification. In so me applications, there are unknown, overlooked, and disregarded factors that contri bute to the data distribution, such as environmental conditions, which hinder classification. Most approaches do not account for these conditi ons, or factors, that may be correlated with sets of data samples. However, unknown or ignored factors may severely change the data distribution making it difficult to us e standard classification techni ques. Even if these variable factors are known, there may be a large number of them. Enumerating these variable factors as parameters in clustering or classi fication models can lead to the curse of high dimensionality or sparse random variable densities. Some Bayesian approaches that integrate out unknown parameters can be extremely time consuming, may require a priori information, and are not suited for the problem at hand. Better methods fo r incorporating the uncertainty due to these factors are needed. We propose a novel context-based approach for classifica tion within a random set framework. The proposed model estimates the posteri or probability of a class and context given PAGE 9 9 both a sample a set of samples, as opposed to the standard method of estimating the posterior given a sample. This conditioned posterior is then expressed in terms of priors, likelihood functions and probabilities involving both a sample and a set of samples. Particular attention is focused on the problem of estimating the likelihood of a set of samples given a context. This estimation problem is framed in a novel way us ing random sets. Three methods are proposed for performing the estimation: possibilistic, evid ential, and probabilistic. These methods are compared and contrasted with each other and wi th existing approaches on both synthetic data and extensive hyperspectra l data sets used for minefield de tection algorithm development. Results on synthetic data sets identify the pros and cons of the possibilistic, evidential and probabilistic approaches and existi ng approaches. Results on hyperspe ctral data sets in indicate that the proposed context-based classifiers perform better than some state-of-the-art, contextbased and statistical approaches. PAGE 10 10 CHAPTER 1 INTRODUCTION Problem Statement and Motivation When collecting data, many known and unknown factors transfor m the observed data distribution. In many applications, sets of samples are collected at a given time, for example, remote sensing. In remotely sensed imagery, images are taken from a remote location such as a plane. These images are essentially sets of pixels, or samples, that are collected at the same time. In this instance, many of the unknown or unspecified factors may influence all of the samples in the image, or some subset thereof, similarly. That is, all of the samples in an image subset may undergo the same transformati on induced by these factors. Optical character recognition (OCR) is another applicati on where factors may influence the results of classification. In OCR, if a classifier could iden tify a font or font size of a document, the problem of character recognition ma y be simplified. In this problem, the font or font size is a factor, or context, which may change the appearance of the sample, or the character. Before we fully characterize the problem at hand, we state some assumptions and define a few terms which are necessary for the problem statement. We assume that similar samples collected in similar conditions or situations wi ll undergo similar transformations. We define a population as a set of samples collec ted under the same conditions or situation. We define the idea of context as the surrounding conditions or situations in which data are collected. We define contextual factors as the unknown or unspecified factors th at transform the datas appearance. Given these definitions, we can define a contextual transformation as a transformation that acts on sets of samples on a contextby-context basis. We attempt to estimate a populations context using the observed populations distribution. PAGE 11 11 In a probabilistic approach, context can be viewed as hi dden random variables that are correlated with the observed samples. This view implies that the observed samples are dependent on these hidden variables. In many standard models, classification accuracy suffers due to contextual factors. If these variables are ignored, many classification methods will suffer since the sample values may be severely altered by contextual tr ansformations. On the other hand, if their values are specified and corresponding parameters are enumerated in a model, problems such as the curse of dimensionality or sparse probability distribut ions may hinder classification results. Example 1.1 Contextual transformations: In this example, we illustrate that contextual factors are present in remotely sensed hyperspectral imagery (HSI) collected by airborne hyperspectral imager (AHI). In this data, each pixel in an image has a corresponding spectral vector, or spectral signature with intensity values in the long wave infrared (LWIR), 7.8 um to 11.02 um. Each spectral signature is usually viewed as a plot of wavelength vs. intensity. Figure 1-1A illustrates multiple spectral signatures, or spectra from a target class and a non-target class indicated by a solid line and a dashed line, respectively. Two consequences of contextual transformations can hinder classification. The fi rst problem is the obvious change in sample appearance in varying contexts, which we refer to as a non-disguising transformation An algorithm must know the appearance of a target sample for identifica tion; therefore, if a target can potentially take on multiple appearances then a classifier must be aware of all potential appearances. The second problem occurs when samples from one class, in some context, are transformed to appear as samples from another cl ass in another context, which we refer to as disguising transformations. We characterize these problems separately since their solutions require different approaches. Solutions to non-disguising transformations re quire knowledge of the various target class appearances. An algorithm devel oper could simply add model constructs or parameters to account for varying appearances. For example, a developer could add de nsities to a mixture model to account for multiple appearances due to multiple transformations. However, this solution will not resolve the prob lem of disguising transformations since samples from different classes have the same appearance. In this situation, context estimation is used to identify relevant models that were constructed fo r similar contexts that our test population has been observed and thereby disregarding models or parameters constructed for irrelevant contexts. Assume we want to classify the bolded spectral signature shown in Figure 1-1A. Classification is difficult since this spectral vector has the sam e a ppearance as some target and non-target spectra from various contexts. However, if we disregarde d the spectra collected in a different context, classification becomes less comp licated as illustrated in Figure 1-1B. PAGE 12 12 Example 1.2 Feature space transformation: Suppose we have images of scenes containing pixels with values in nR. For the sake of illustration, we assume n =2 and each image X has a continuum of pixels. Each of these pixe ls corresponds to a measurement of some object in the real world. We would assume that th e pixels value would depend on the object it represents in the real world, but there are contextual factors that will influe nce the pixels values. In this example, there are five images containi ng pixels that represent two objects in the real world, x and o. Some of these images were taken in different contexts thus each is affected by different influencing factors. These contextual f actors transform the data collected in distinct contexts, differently. These transformations may cause sets of samp les to have different spatial distributions, or shapes, in a feat ure or sample space as shown in Figure 1-2A. Assum e the goal is to label some samples in X 1, denoted by *, using some labeled samples from the other images illustrated in Figure 1-2B. If we ignore the population inform ation, the classification problem becomes more difficult as shown in Figure 1-2C. Instead, if we em phasize, to an algorithm, datasets which appear to have been collected in a similar context, the job of classification may be simplified, as shown in Figure 1-2D. A similar spatial distribution of sets m ay indicate that a similar transformation has acted on the populations and have therefore been collected in similar conditions. We propose that if this contextual information is gathered and utilized correctly, classi fication results should improve. Proposed Solution The problem of variable contex tual f actors is similar to so me existing problems such as concept drift where the idea of a target class an d/or its governing distribution may change with respect to time or some hidden context. In Example 1.2 a solution would need to include a m ethod for determining a similar distribution, or shape, relationship between populations. A more general solution would provide a method for modeling the shape of populations from a particular context. Standard context-based classifiers suffer from a number of limitations. Most notably, they lack the ability to solve th e problem of disguising transformations, as mentioned in Example 1.2. Many classifiers attem pt to estimate context, wh ich we propose is best identified by analysis of an entire population, by inspec ting a single sample. Many existing models also suffer from restrictions, inappropriate assump tions, and the lack of ability to handle all forms of concept drift. Most standard stat istical methods make the independently identically distributed (i.i.d.) PAGE 13 13 assumption that limits their ability to capture any information found through the analysis of the set of samples. The proposed solution uses a random set [1][7] model for population context estimation. A populatio ns context is then considered when each sample of the population is classified. This model has the ability to estimate context by insp ecting the distribution of a set of samples. Populations, after undergoing contextual transfor mations induced by contextual factors, are compared to contextual modelsmodeled using ra ndom setsin attempts to identify the context in which they were collected. Specifically, the creation of the proposed co ntext-based classifier consists of factors for context estimation and class estimation. The classification factor will estimate the class of each sample using cla ss models, one for each context. The context estimation factor will identify the relevance of ea ch model based on the estimated context of the test population and subsequently weight each m odels contribution by contextual relevance. The identification of context allows for more inform ed class estimation emphasizing models relevant to the test populations context a nd ignoring the irrelevant models. Note that the proposed model im plicitly acquires context of a sample set without explicitly performing any estimation of the contextual factors. A subsequent benefit to this approach is that it avoids the curse of high dimensionality and sp arse densities, which ar e potential pitfalls of methods that would directly account for these contextual factors. The proposed random set model allows for evidential, probabilistic, and possibilistic approaches due to the inherent ve rsatility of the random set. Furthe rmore, it also has the ability to avoid the aforementioned limitations and to handle all forms of concept drift. Existing standard and state-of-the-art methods are surveyed, anal yzed, and compared to the proposed approach. PAGE 14 14 Results from experiments indicate that the proposed random set model improves classification results from existing methods in the face of hidden contexts. PAGE 15 15 Figure 1-1. Spectral samples exhi biting contextual transformations. A) Spectra from target and non-target classes collected by AHI in multiple contexts. The target class is indicated by a solid line and a non-target class is indicated by a dashed lin e. B) An unlabeled sample shown in bold along with two labeled samples collected in the same context. Figure 1-2. Illustration of contextu al transformations in a feature sp ace. A) Five images in some feature space that is a subset of R2. B) Labeled samples from each training image and unlabeled samples from the test image. C) All samples without contextual information. D) Using a similarly distribute d training image to label the samples in the test image. A B A C B D PAGE 16 16 CHAPTER 2 LITERATURE REVIEW The following is a review of current literatur e pertinent to problems and solutions arising from contextual factors. First, the problem of concept drift is de tailed along with standard and state-of-the-art solutions [12][58]. Next, a brief review of context-based approaches with applications to hyperspectral im agery is given [59][67]. Next, a brief mathematical and statistical review is given to assist in the developm ent of the proposed random set framework [1][11]. Standard statistical met hods are reviewed and their potent ial uses for context-estim ation are developed. Through the development we indi cate that alternative methods may model the idea of context better than sta ndard approaches. Next, the rando m set is defined and introduced as a method better suited for context estimation [1][7]. This is followed by a few examples of set sim ilarity measures, which are re viewed to assist in set analysis [69][72]. Next, we review of som e existing formulations and applications of random sets. Finally, we review some state-ofthe-art, en masse, context-based approaches, whic h treat sets as unitary elements for contextestimation. Concept Drift The idea th at samples of a class may change with respect to time is an area of recent research. We begin our discussion with a benchmark solution to this problem. One of the first algorithms developed to analyze and contend with this occurrence is STAGGER, which was developed by Schlimmer and Granger, and is based on a psychological and mathematical foundation [24]. STAGGER has 4 major steps: initialization, projec tion, evaluation, and refinem ent. In initialization, the description of a concept or cla ss is constructed using a set of pairs consisting of logical statements, or char acterizations, used to describe a class and corresponding weights used to weight the impor tance of each description. In this step, the PAGE 17 17 concept is specified. In projecti on, a Bayesian scheme is implemented to estimate the frequency of occurrences of the characterizations in subs equent samples. These probabilities are updated after the class of a new sample is determine d. In this step, new samples are inspected to determine if frequency or weighting of each characterization is representative of the data. In evaluation, the effectiveness of each characterization is determined based on the number of correct and incorrect predicti ons for each characterization. In this step, the concept characterizations are evaluated to determine if there should be a change in these concept characterizations. In refinement, the characte rizations and corresponding weights are modified based on their evaluations to improve their effectiveness as predictors. The Problem of Concept Drift STAGGER is one approach that contends with the change of concepts with respec t to time or some hidden context. One of the more popular fo rmulations of this problem, concept drift, has recently become an area of much research [18][57]. In concept drift, a concept may depend on som e hidden context which is not given explicit ly. Changes in the hidde n context then induce changes in our target concept. This principle has been adopted by researchers in the machine learning community and has many applications in scientific research. So lutions to the problem should be able to adjust for concept drift, dis tinguish noise from concep t drift and recognize and adjust for repeat concepts [18]. Concept drif t can be divided into two categories: real and virtual. In re al concept drift, the concept or idea of a target class may change. In virtual concept drift, the data distribution for a target class may change. The former is truly a concept shift a change in conceptwhereas the latter is simply a sampling shifta change of data distribution due to some unknown context or variables. The idea of virtual concept drift is similar to our problem of hidden, population- PAGE 18 18 correlated variables, since this may lead to a change in data distribut ion due to some hidden context. Concept drift can also be categorized as sudden or gradual. In sudden concept drift, the drift may be abrupt and substant ial; whereas in gradual concept drift, the drift may be gradual and minimal. The problem at hand can be descri bed as abrupt or sudden concept drift. The developed model allows for data to be collected at variable times and ma y not necessarily be a continuous flow of data with respect to time; in fact, the drift may be fairly substantial. Concept Drift Solutions There are three m ajor approach es that are used to account for concept drift: instance selection, instance we ighting and ensemble learning. In inst ance selection the goa l is to select relevant samples from some training set for use in classifying test samples. A simple example of this approach would be windowingusing sliding windows or k nearest neighbors (kNN) [22], [23], [25][30]. Instance weighting involves weighting instances of a training set based on their relev ance. Usually in instance weighting a learning algorithm is trained to appropriately weight these instances such as boosting [31][33], [39], [40]. In ensemble lear ning, a set of concept descrip tions are maintained and some combinati on of these descriptions are used to predict current descriptions, as in STAG GER. This general approach could also be interpreted as some sort of model selection where the set concept descriptions are in fact models or algorithms whose results are to be combined based on each concep t descriptions relevance to a certain population [21], [24], [34][58]. In exis ting concept drift solutions, there are a number of restrictions, assumptions, and limitations that induce models that will not be able to account fo r all contextual transformations. Furthermore, almost all existing context-based solutions cannot solve the problem of disguising transformations as defined in Example 1.2. This drawback is due to the fact that context PAGE 19 19 estimation is performed by inspec ting one sample, rather than the entire population. There are five major limitations or pitfalls exhibited by existing concep t drift algorithms. 1. Estimates context based on a single sample (C.1) 2. Recognizes only some forms of concept drift (C.2) 3. Identifies context arbitrarily or with major assumptions (C.3) 4. Admits solutions that are not robust to outliers (C.4) 5. Assumes semi-supervised environment (C.5) We emphasize property C.1 since this is a conc eptual flaw implemente d by many concept drift algorithms. This assumption presumes that the situation discussed in Example 1.2, disguising transformations, will not occur. Next, we survey standard and state-of-the-art approaches to concept drift. In the following, we parenthetical ly indicate where prop erties C.1 C.5 are observed by the surveyed approaches. In almost all existing approaches, C.1 is present except when the approach is highly supervised and make s major assumptions for context identification. Instance selection In full memory approaches all training sam ples are kept but a subset are selected to classify a given test sample. The process by wh ich these samples are se lected is the crux of instance selection approaches. Widmer proposed the choice of a dynamic wind ow size that is chosen based on time and classifier performance [30]. If th e classifier is performing well, it is assumed that the concept has been constant for some time and a large window of samples are retained (C.2 and C.5). However, if performance decreases, it is assumed the concept is changing or has changed and the window size is shrunk (C.3 and C.4). Klinkenberg et al proposed an instance selection appr oach, where a variable sized window is kept over the m most recent training samples, assuming that the last m samples will be reflective of new test samples (C .2) [33], [34]. The selected window size minimizes the error of a support vector machine that is trained using the last h training samples. After the SVM is trained PAGE 20 20 using the last h samples, an upper bound on the error can be directly estimated from the SVM parameters [28], [7]. Af ter these m SVMs have been trained on their last h samples, the training set with least error is selected. The window size is set to h as in Equation 2-1 and the corresponding training samples are used to classi fy the next test set. )(minarg hErr hm ( 2-1) Here the SVM is used for an upper bound error estimate, and when its estimate increases, a change in context is assumed (C.3 and C.4). Salganicoff proposed Darling which retains a selected sample until new samples are presented which occupy a similar subspace of the sample space [22]. This approach assumes context chan ges are directly related to the seque nce of observance and that context is selected based on a single sample (C.1 and C.3). Maloof et al. proposed an instance selection appro ach which is similar in ideology to instance weighting methods [26], [27]. In partial-memory approaches each class ification decision is made using some current characterization of a class and some subset of previously observed samples. The term partial-memory refers to the fact that only a subset of previously observed samples is retained to assist in classi fication and concept updating. Specifically in this method, the concept descriptions are updated using selected samples and misclassified samples [26]. Given a classifier C a data set D and a partial m emory P the update procedure consists of six major steps. 1. P={} 2. Classify D with C 3. Add misclassified samples to P 4. Retrain C using P 5. Select appropriate P 6. Repeat from step 2 when presented with new Data D PAGE 21 21 Note that the classifier focuses on samples that it is misclassifying assumed to be due to concept drift (C.5 and C.4). An example of how to select an appropriate set P is to retain particular samples if they help form the decision boundary. One sele ction technique AQ-PM, which assumes a convex data set, identifies extreme points such as the points forming a covering hyperrectangle thus enclosing, or bounding, particular samples. Instance weighting Instance weighting app roaches weight certain samples differently for the purposes of classification. A popular instance weighting sc heme is boosting. A popular boosting algorithm is Adaptive Boosting, or AdaBoost where misclassified samples are emphasized during parameter learning stage in a statistical manner [30][33]. The error term is calculated as follows: n i itit txCyiD1)]()[(. ( 2-2) In Equation 2-2, t is the learning iteration, ix is sample i } 1,1{ iy is the class for ix tC is the classifier at iteration t ) ( iDt is the weight for sample ix at iteration t and t is the average misclassification at iteration t. If the classifier misclassifies some samples, assumedly due to concept drift (C.3and C.5), the misclassified samp les are emphasized (C.4) in the error term using the weight update formula. 1 1)(exp)( )( t it t tZ xyCiD iD ( 2-3) This update increases the weights of misclass ified samples to coer ce the learning of the new concept in later iterations. Note this is similar to increasing the prior of ix in the statistical sense. Note if the boosting is done offline, just during training, this approach no longer exhibits property C.5, and maybe not C.4; howe ver, it will exhibit property C.1. PAGE 22 22 Dura, Lui, Zhang, and Carin proposed nei ghborhood-based classifiers where a test samples neighborhood is used for classification [35][38]. This approach uses and ac tive learning framework which attempts to extract information from some dataset and extend it to another sample under test (C.1). Classificati on is performed as shown in Equation 2-4. j T i ji n j jiij iiy ypypb Nyp x x x x exp1 1 ),|(,),|()),(|(1 ( 2-4) In Equation 2-4, } 1,1{ iy is a class label, ix is a test sample, jxs are retained samples that are in the neighborhood ) (iNx, ijbs are the weights for each neighbor, and is a parameter vector. The construction of ijb, the weight, and ) (iN x, the neighborhood, are the crux of this algorithm. A few suggestions are shown in Equations 2-5 and 2-6. },0:),{()(X bb Nj ijijj it x xx ( 2-5) where n k i ji i ji t ijb1 2 2)5.exp( )5.exp( xx xx ( 2-6) In Equations 2-5 and 2-6, ijb is the transition probability from ix to jx in less than t steps in Markov random walks [36], [37]. In som e of their other proposed methods, an information theoretic approach is taken to construct ) (iNx based on maximizing the determinant of the Fisher information matrix [35], [38]. Note this approach also exhibits property C.1 since each sam ple is classified using itself and training data, not its population. Note since the parameter doesnt vary, we assume there is only one concept descriptor, which is why we consider this an instance weight ing approach. We note that approach could also be implemented using an ensemble learning approach. PAGE 23 23 Ensemble learning In ensem ble learning an ensemble of concep t descriptions, such as classifiers, are maintained and used in harmony for classificatio n. A popular approach, ensemble integration, employs a weighted scheme to determine the rele vance of each classifiers output given a sample [41]. J j ijij ixCwxC1)( )( ( 2-7) Here the construction of the weight ijw is done to emphasize classifiers of greater contextual relevance. Equation 2-7 can be implemented in many ways such as stat ic voting/weighting or dynamic voting/weighting [39][58]. In ensemble approaches, the crux of the problem is deciding ho w to weight each context-based model. The popular bagging approach constructs N classifiers where each are trained using N corresponding training sets [43]. The training sets are c onstructed by random ly sampling the entire training set with replacement. Each of the sampled training sets contains m samples where m is less than the number of total training samp les. The classifiers, which act on individual samples, are then combined using voting and averaging technique s (C.1 and C.3). The random forest model is a new appro ach using dynamic cla ssifier integration [44], [45], [47]. This model attempts to minimize correl a tion between the individual classifiers while maintaining accuracy [43], [44]. Random subspaces and/or subsets of samples are chosen and a classifier, or tree, is trained using the corresponding sa m ples (C.3). This is repeated N times to create a forest of N trees. Most of the time the classifier s are simply partitionings of the space resulting in boolean classification. Given a test sample, classificat ion is determined by weighting each trees confidence using the c onfidences of neighboring samples in the feature space (C.1 or PAGE 24 24 C.3 and C.5 depending on im plementation). The weight iw for tree i is assigned using Equation 2-8. k j j jOOB k j jij jOOB ixxx xmrxxx xwi i1 1),()(1 )(),()(1 )( ( 2-8) In Equation 2-8, }1,1{)( jixmr indicates whether classifier i has correctly classified sample j, is a weighting function based on distance, k is the size of the neighborhood, and iOOB1 is the indicator function, which indica tes whether its argument is an out-of-bag (OOB) samplea sample not used to train classifier i The use of OOB samples allows for unbiased estimates. We note that given some assumptions, the random forest approach is shown to perform at least as well as boosting and bagging [44]. Tsy mbal et al. proposed an ensemble approach that maintains a set of models optimized over different time periods to ha ndle local concept drift (C.2) [21], [39]. The models predictions are th en combined, in a sense integrating over classifiers. The selection of classifier predictions is done based on a local classification error estimate performed after initial training. During testing, k nearest neighbors of each test sample are us ed to predict the local classification errors of each classifier (C.1). Using these estimated er rors, each classifiers predictions are weighted and the total prediction is calculated using integration. Kuncheva and Santana et al. developed an ensemble approach where contexts or training sets are constructed by cl ustering the training data [48], [49]. Then for each cluster, N class ifiers are ranked such that each has a ranking in each clusterset of samples. The weights for PAGE 25 25 combination are proportional to the classifiers co rrect classification. A test sample is then classified using the k best classifiers from the sample subspace in which it resides (C.1). Frigui et al. used fuzzy clustering methods to pa rtition a feature space into assumed contexts [52]. During classification, the models repres enting a context in which a test s ample lies are used for classification where the classi fiers are weighted by the corresponding fuzzy memberships of the test sample to the fuzzy cluster (C.1). Harries et al. proposed an algorithm to le arn hidden contexts called Splice [57], [58]. In this a lgorithm, a continuous dataset is partitioned, heuristically, into time intervals which supposedly represent partial contexts. Classifier s are then trained and ranked on each interval. The intervals, and classifiers, are then clus tered similarly to an agglomerative clustering algorithm. If a classifier performs well on multiple contexts, the corresponding contexts and classifiers are merged and the classifiers are re -ranked based classificatio n results. The weights are then selected similarly to the approaches proposed by Kuncheva and Santana et al. (C.1) [48], [49]. Santos et a l. proposed a subsetting algorithm that ra ndomly creates subsets of the training data (C.3) [50]. A classifier is trained on each subset, assum ed to be indicative of a context, and a genetic algorithm selection scheme is used to select the best fit classifiers, where fitness is based on error rate, cardinality, and dive rsity. Context models are then weighted based on which subset a test sample resides (C.1). Qi and Picard proposed a context-sensitive Ba yesian learning algorithm that models each training set as a component in a mixture of Gaussians [55]. In this model each training set, or context, h as a corresponding linear classifier. Ii iiDxDpDxypDxyp ),|(),|(),| ( ( 2-9) PAGE 26 26 In Equation 2-9, y is the class label for sample x using training dataset } ,...,{1I iDDDD The term ) ,|(iDxyp is estimated using the expectation propagation method [56]. Note the data set weights are chosen based solely on the sam ple x and not the sample and its population (C.1). Also, note that each iD are training sets and not nece ssarily the population of sample x. In the proposed random set model for context ba sed classification, test sets are used to estimate context which alleviates property C.1, and furthermore does not induce properties C.2C.5. Applications to Hyperspectral Imagery In the experiments, the proposed methods are tested using a hypers pectral dataset with apparent contextual factors. For this reason, we briefly discuss current, state-of-the-art methods used to contend with contextual factors in hype rspectral imagery. We note that some methods take different approaches or assume a different testing environment. There are two major approaches for solutions to contextual transformations in hyperspectral data classification. The first approach relies on physical modeling using environmental information. The other uses statisti cal and/or mathematical methods to identify or mitigate the effects of contextual transfor mations. Next, we list some popular existing approaches which have shown to be su ccessful in some testing situations. There has been much research that uses the physical modeling of the environmental factors on measured data. Here, classifiers may use the output of physical models, for example MODTRAN, which generate the appearance of target spectra in certain environments [59], [60]. For exam ple, the hybrid detectors developed by Broa dwater use target spect ra that are estimated using MODTRAN, which is given envir onmental information about the scene [61]. This PAGE 27 27 approach, and many like it, are shown to be very successful when environmental conditions are available. Healy et al. proposed to use MODTRAN to produce spec tra of various materials in various environmental conditions [62]. A vector subspace for each ma terial is th en defined by selecting an orthonormal basis for the material subspace. Confidence is then assigned to test spectra based on their distance to this subspace. This appr oach provides a robust a nd intuitive solution; however, this classification met hod will suffer in the presence of disguising transformations. Kuan et al. proposed a projection matrix, rooted in a physics-based linear reflectance model, which in effect normalizes envi ronmental conditions between two images [63]. This approach has shown to be successful at identifying regions of im ag es and detecting change in coregistered imagery. This approach can learn a tr ansformation of a set of samples; however, this approach requires a fairly large number of test sample labels be known for the construction of the transformation matrix. Fuehrer et al. proposed the use of atmospheric sampli ng where a sample of some material is projected into some feature space based on atmospheric conditions in which it was observed [64]. Samples in this feature space may then be used to assist, using locality analysis, in identifying m aterial and atmosphere when presen ted with a test image. This method has shown to be successful at classi fication and modeling; however, it cannot account for disguising transformations. In these approaches, environmental conditi ons of a scene are assumed to be known a priori or some ground truth is assumed to be known a priori which may not be the case. In these other cases, different approaches must be taken. PAGE 28 28 The other tactic of existing methods uses vari ous statistical and math ematical approaches to account for contextual transformations. Some selection, ensemble, and context-based methods attempt to identify models relevant to a test sample through context estimation. Some active learning approaches attempt to tran sfer knowledge to test samples. Mayer et al. propose the whitening / dewhitening tr ansformation. In this approach, transformation matrices are constructed to whiten and dewhiten spectra from an image [65]. In this approach, the whitening and dewhitening m atri ces are constructed to whiten the effects of environmental conditions. However, this ap proach requires a semi-supervised testing environment to construct the projection matrix. It also assumes that whitening of spectra will reduce or eliminate the effects of contextual fact ors. This assumption implies that the contextual transformation is simply a linear transformation based on a populations statistical properties, such as the mean and covariance. Mayer proposes the matched filter described in Equation 2-10. t Transform t T tktkxLRxxMF 1 22 ,, ( 2-10) where )(11 2/1 11 2/1xLRRxLttt Transform t In Equation 2-10, ktx, is a test sample, 1x is the mean of clutter samples from labeled image 1, tx is the mean clutter estimate from the test image, 1L is the target estimate for labeled image 1, and 11R ttR are the clutter covariance matrices for im age 1 and the test image, respectively. Rajan et al. propose an active learning approach where a classifier, or learner, attempts to acquire knowledge from a teacher about ne w data points that may be from an unknown distribution [66]. In this so called KL-max approach, the new data points and corresponding labels are ch osen to maximize the KL diverg ence between the learned distributions and the learned distributions including th e new data point and corresponding label. The labels, which are distributions, are then updated us ing the new data point and label. This approach could be used PAGE 29 29 for context estimation where various labels from existing classifiers are chosen based on the KL divergence; however, it estimates these labels sample-by-sample. Many of the aforementioned existing methods eith er operate in differe nt testing conditions, such as semi-supervised classification or environmental conditions are known a priori or they cannot account for disguising transformations. Probability Introduction We now provide a brief mathematical and probab ilistic review of the concepts that will be used in the proposed model. Due to the complex formulation of random se ts, our review starts with the building blocks of probability and m easure theory. The main purpose of the following review is the introduction of notation. For a rigorous mathematical development, see the literature [1][7]. Inform ally, a random variable is a mapping from a probability space to a measurable space. The probability space consists of a domain, fam ily of subsets of the domain, and a governing probability distribution. To formally define ra ndom variables, we need to introduce concepts from topology and measure theory. Topology Definition 2.1 Topology: A topology T on a set X is a collection of subsets of X that satisfy 1. TX 2. T is closed under finite unions and arbitrary intersections. Such a pair, ( X ,T ), is referred to as a topological space [10]. The set X is subsequently referred to as a topological space. Topologies are generally described by construction. Usually a topology is said to be gene rated from some basis or subbasis B Definition 2.2 Basis for a topology: A basis for a topology T on X is a collection B of subsets of X such that PAGE 30 30 1. For all Xx there exists a B Bsuch thatBx 2. If B 21, BBand 21BBx then there exists a 3B such that3Bx and 213BBB [10]. Definition 2.3 Subbasis for a topology: A subbasis for a topology on X is a collection of subsets of X whose union is X The topology generated by a subbasis S is the collection T of all unions and finite intersections of the elements of S [10]. The constituent sets of a topology are the focus of this review. Therefore, we fully detail them and the idea of measurability. Definition 2.4 Open set: Given a topological space ( X ,T ), all sets TG are called open sets [10]. Definition 2.5 Closed set: The complement of an open set is a closed set [10]. A m ajor misconception is that sets are either cl osed or open; however, this is not the case. In fact sets in a topology can be open, closed, neither, or both. For in stance in the standard topology on R, the interval [0,1) is neither open nor closed [10]. We emphasize that this is greatly dependent on how the topology is generate d. There are topologies that do not share the intuitive characteristics of the standard topology on R. We next define some attributes of a topol ogical space, which help characterize important concepts. Many of these attributes such as comp actness are assumed when dealing with sets, but in the following, they are formally defined for clarity. Definition 2.6 Cover: A collection of subsets of a space X is said to cover X if the union of its elements is X Furthermore, an open cover of X is a cover whose el ements are open sets [10]. Definition 2.7 Connectedness: A topological space ( X ,T ) is connected if there does not exist a pair of disjoint non-empty, open subsets U and V of X whose union is X [10]. Definition 2.8 Compactness: A space is compact if every open covering of X contains a finite subcollection that also covers X [10]. Probability Space Next, we define necessary constructs for a pr obability space. We then define a standard random variable which will aid in the development of the random set. PAGE 31 31 Definition 2.9 -Algebra: If X is a set, then a -algebra )( X on X is a collection of subsets of X that satisfy 1. ) ( XX 2. ) ( )( XAXAC 3. If 1}{nnA is a sequence of elements of ) ( X then 1n nA is ) ( X Furthermore, ) ( X is closed under countable intersections [9]. Note that if } {nA is a finite or countably infin ite collection of elements of ) ( X then )(XAAn c n thus a -algebra is also closed under countable intersections. Hence, algebras are topologies si nce the requirements for -algebras subsume the requirements of topologies. Note that -algebras also require closure under complementation, which is not a requirement of a topology. This closure unde r complementation allows for an intuitive application to probabilistic analysis. A -algebra is a type of topology useful in the field of probability and measure theory. In fact, most probability spaces are defined using Borel algebras. Definition 2.10 Borel -algebra: The Borel -algebra on a topological space X written B( X ), is the smallest -algebra that contains the family of all open sets in X Elements of a Borel -algebra are called Borel Sets. Measure Before we introduce random variables, we e xplain the idea of measurability. Although the general idea of measure is fairly complex, we give a simple overview. Definition 2.11 Measure: A measure on )( X is a function ),0[)(: X satisfying 1. 0)( 2. )(,),()(})({ XBABABA BA if finite or 1 1)()( ),(n n n n kj nA A AAkjXA if infinite [9]. The elem ents of ) ( X are called measurable sets [9]. Som e measures have adde d constraints such as th e probability measure. PAGE 32 32 Definition 2.12 Probability measure: A probability measure is a measure ]1,0[)(: XP with the added constraint 1 )( XP We have now properly defined the probability measure which is one of three elements necessary for a probability space. The other tw o elements are the domain and a corresponding -algebra. Definition 2.13 Measure space: A measure space is a triple ),(,XX where the pair )(,XX is referred to as the measurable space, X is a topological space, )( X is a -algebra on X is a measure on )( X [9]. Definition 2.14 Probability space: A probability space is a triple P ),(, where is a topological space, ) ( is a -algebra on and P is a probability measure on ) ( [9]. Definition 2.15 Measurable function: A function A )(: Xf is measurable if for any interval A A ) ()(1XAf [9]. A random variable is a measurable mappi ng from some probability space into a measurable space. Standard Random Variables Random variables are the basis of statistical modeling and anal ysis. The use of statistical modeling and analysis is abundant in the pattern recognition and machine learning community. These tools, along with others, allow rese archers to model systems and automate intelligent decision making. Now that we have defined all the necessary st ructures, we are able to define the random variable. Definition 2.16. Random variable: Given a probability space P ),(, and some measurable space )) (,( XX for some positive integer d a random variable, R is a measurable mapping from a probability space to a measurable space such that ) ()(),(1 RX if the random variable is defined on the entire space [9]. We note here that in applications, many ignor e this initial mapping from the probability space to the measurable space. This mapping is n ecessary for formal definitions; however, it is PAGE 33 33 not necessary for most applications and the cumbersome notation is disregarded. Hereafter, we may disregard this initial mapping unless its recogniti on is required. Standard Statistical Approaches for Context Estimation There are a few issues that will arise if sta ndard statistical techniques are used for context estimation. Next, we detail some of these potential pitfalls. In standard approaches, the probability or li kelihood of multiple occurrences are calculated using a joint distribution ) | ,...,,(2211CxXxXxXPnn ( 2-11) where nxxx ,...,,21 are n observations and C is some context. A few issues that arise from this approach are as follows: 1. Estimation of the joint likelihood function may be complicated by sparsity (J.1) 2. Estimation requires the matching of obs ervations to random variables (J.2) 3. Likelihood calculation is highly depende nt on number of observations (J.3) Issue J.1 will occur when there are a larg e number of random variables compared to number of observations. Issue J.2 occurs si nce there is a distinction made between the observations. If iX is different from jX then each observation will have to be paired with a random variable. This presents a problem of matching each observation to a random variable which also results in issue J.3. Standard random variables are used to model the outcomes of single events or trials. In some approaches, a set of observations is mode led using a standard random variable where the set of observations is interpreted as a sequence of trials from the same experiment. This approach is similar to a common assumption for simplifie d joint estimation, the i.i.d. assumption. ) |()...|()|()|,...,,(2 1 21CxPCxPCxPCxxxPn n ( 2-12) PAGE 34 34 This assumption presumes that observations x can be fully described by one random variable. However, this simplification results in a two additional issues: 1. Estimate of the joint likelihood is not robust to ou tliers due to the product of sample likelihoods (J.4) 2. Contextual information concerning joint obser vation is reduced to a product of sample likelihoods (J.5) Note that even with the i.i.d. a ssumption, issue J.3 is still present. For example, as the number of observations occurs, the likelihood of some context must decrease, which is an unintuitive result for modeling context. This result is intuitive if we are modeling a sequence of experiments. Issue J.4 occurs since we have turned joint estima tion into a product of singleton likelihoods. Random Sets One type of random variablethe random setha s not been researched as extensively as the standard random variable in the intelligen t systems community. We consider only random subsets of dR in the following. First, the formal definition of the random set and some associated constructs are presente d. Next, a brief inspection and di scussion of the random set is presented including its relationship to belief and possibility theory. Finally, the shortcomings of standard point process models fo r context estimation are discusse d, which provides motivation of the proposed implementations. General Case: Random Closed Set Assume that dRE is a topological space. We will denote the family of closed subsets of E as We can define a measurable space )) (,( associated with some probability space P ),(, where all -valued elements will be referred to as closed sets. Informally, a random set is a measurable mapping from the aforementioned probability space to the measurable space. PAGE 35 35 Note that the construction of an intuitive -algebra for closed set values is not as clear as the construction for real number values. For example, a measurable interval for a random variable may be [-1, 4]. This interval, or se t, is constructed by accumulating all the numbers greater than or equal to -1 and less than or equal to 4. However, relationships such as greater than or less than do not linearly order sets. One -algebra that is used with random sets is constructed by the Hit-Miss or Fell t opology, such that any observed set X either intersects, hits, or does not intersect, misses, some K K, where K is the family of compact sets. The families of sets that are used as basis elements to generate the Fell topology are )(},:{ KKKFFK and ) (},:{ GGGFFG. The Fell topology is a standard topology on Definition 2.17 Fell topology: The Fell topology is a topology ) ,( T where T has subbasis which consists of G and K. Note that the Borel -algebra generated by the Fell Topology on coincides with the algebra generated by K [1]. We can now formally de fine the random closed set. Definition 2.18 Random closed set measurable wi th respect to the Fell topology: Let be a collection of all closed sets from a topological space and let ) ( B denote the -algebra generated by K. Given a measurable space )) (,( B associated with some probability space P ),(, ,a measurable mapping : is called a random closed set measurable with respect to the Fell Topology if )( B [1]. Random Set Discussion The random set is governed by its distribution )(},{)( BKK KPP. Since )(B is generated by K, it seems reasonable to determine the measure, or probability, of some set K using K where } {}{ KP PKis a well defined measure. In fact, since these sets K for each K compose our Borel -algebra, our probability distribution is defined on these sets with corresponding values being the probability of an observed will intersect K PAGE 36 36 Note that the sets in K just have to have a non-empty intersection with some set value K In effect, the calculation of lik elihood of a random set value K can be viewed as calculating the measure of the sets that contain at least one similar component as the set K Definition 2.19 Capacity functional: The real-valued function, T ,associated with K KKPFPKTK}, {)()( is called the capacity functional if the following requirements are satisfied [1]: 1. 0)( T 2. K KKT,1)(0 3. )()( KTKTKKn n (upper semi-continuous) 4. K n K KKKKnKTn,...,,,1,0)(11 (completely alternating/ completely alternating) where } ,...,1,,{)(1niK KPKTi K Kn For an extensive explanation, the re ader is directed to literature [1][6]. The capacity functional can be view ed as an optimistic estimate of the probability of a random set. In fact, it can be shown that this measure is an upper bound the family of probability measures P associated with random set that is }:)(sup{)( PPKPKT [1]. This also m eans that the capacity functional is an upper probability. It can be shown that )(KT dominates )( KP P P, which means K KKPKT),()(, P P [1]. To uncover other functionals associated w ith the random set, we dissect the set K into three disjoint sets. },,:{}:{}:{FKKFKFFFKFKFFK ( 2-13) Since the constituent sets in Equation 2-13 are disj oint, we can divide the capacity functional into these following terms: ).()()( },, {}{}{} { KHKCKI KK KPKPKPKPi ( 2-14) PAGE 37 37 Note that T is not additive with respect to K but rather partitions of K. For example, if 2121,KKKKK, then } {} {} {2 1 KPKPKP may be possible. This is true since it may be the case that 3K such that 2 13 K KK ; and by definition, 21KK does not imply 2 1K K In fact, T is a subadditive fuzzy measure on } {} {} {2 1 KPKPKP. ( 2-15) We now define the functionals developed in Equation 2-14. Definition 2.20 Inclusion functional: The inclusion functional calculates the measure of the sets in which K is includedall the sets which have K as a subset. }{)()(* KPFPKIK where}:{*FKFFK ( 2-16) The inclusion functional can be used to describe a random set; however, it does not generally, uniquely determine the distribution of a random set due to some pathological cases. Its alternative interpretation is its relation to the capacity functional of C [1]. )(1) ()( KT KPKICC ( 2-17) Definition 2.21 Containment functional: The containment functional which calculates the measure of the sets which are contained in K }{)()(*KPFPKCK ( 2-18) where}:{*KFFFK. It can be shown that the containment func tional is completely intersection monotone making it the dual of the capacity functional [1]. It can be shown that the following relationship exists b etween the capacity and containment functionals: ) (1}{)(CKTKXPKC ( 2-19) This relation also gives an intuitive explanat ion as to why the containment functional also determines the distribution of a random set, if defined on the open sets. This dual relationship shared between the capacity and containment functionals is similar to the relationship shared PAGE 38 38 between belief and plausibility functions. Belief functions are used extensively in evidential reasoning and are discussed in the Theory of Evidence section [8]. For the purp oses of the random set, the containment functionals superadditivity property can be viewed as a pessimistic estimate of a rand om set value. The containment functional uses a containment requirement for the probabilistic fram e of reference, meaning it uses sets that are contained in K to calculate probability. In other words, this value is the probability that only the elements of K will be generated, whereas, the capacity functional requires only the existence of one similar element. In fact, it can be show n that the containment functional is a lower probability }:)(inf{)( PPKPKC [1]. ( 2-20) This im plies that ) ( KCi is dominated by PPKP ),(, K K. All probability measures on a random set are wedged in between these bounds, that is K KPKTKHKCKIKPKCP)()()()()()(. ( 2-21) This is intuitive since the capacity functional is the probability that the random set will hit a given set, whereas the containm ent functional is the probability that the random set is fully contained within the given set. Definition 2.22 Hit and miss functional: The hit and miss functional calculates the measure of sets that intersect the set K but have no inclusion or containment relationship. },, {) ()(*,* KK KP FPKHi KK ( 2-22) where } ,:{*,*FKKFF FKK The hit and miss functional is not used in the lite rature. It simply identifies sets that have a non-empty intersection with a set K non-containment relationship with a set K and noninclusion relationship with a set K Its use alone for the purpose s of probability assignment would not be intuitive. PAGE 39 39 The inclusion and containment functionals identify the sets above or below K in the lattice of subsets of K, that is these functionals identify the sets that can be lin early ordered with respect to K by inclusion and containm ent. On the other hand, the hit and miss functional considers all sets at the same level as K on the lattice, and are not comparable using inclusion and containment. Theory of Evidence We briefly discuss the relationshi p between random sets and the Theory of Evidence as developed by Dempster and Shafer. Definition 2.23 Belief function: A function ] 1,0[2: XBEL is a belief function on some space X if the following constraints are satisfied 1. 0)( BEL 2. 1)( XBEL 3. BEL is completely monotone [1], [8]. Definition 2.24 Plausibility functions: The dual of the belief function, the plausibility function has the expected dual form )(1)(CABEL APL ( 2-23) Just as the capacity functional is an optim istic estimation of the probability of a set outcome, the plausibility function is an optimisti c estimation of the probab ility of an occurrence of an element in A Belief functions are completely determined by their mass functions. Definition 2.25 Mass functions: A function ] 1,0[2: Xm is a mass function if 0)( m and 1)( XAAm. Note that the containment functional of a random closed set is a belief function, which can also be described by its corresponding mass function. )(}{)()( ACAPBmABELAB ; ( 2-24) Whereas, a general belief function is a containment functional only if some continuity conditions are met [1]. PAGE 40 40 Note that mX X),2(,2 forms a probability space, where m is a probability on sets XA 2 Furthermore, the corresponding belief func tion resembles a cumulative distribution function on X2 using containment relations hip to accumulate measure. The purpose of distributing mass, m to subsets of outcomes rather than simply the outcomes themselves in evidential reasoning is an attempt to model uncertainty. Rather than merely having the ability to state the probabi lity of each outcome, the mass function can assign probability of an outcome occurring in a set without explicitly expressing the probability of its constituents [8]. Point Process General random set models are seldom used in the machine learning community. This is interesting since random variables and statistical models are ubiquitous in the same community. One reason for this is that the general random se t has no simple or even established parametric form or simple methods for estimation. Specific t ypes of random sets, such as point processes, do have simple parametric forms which allow for optimization and estimation; however as will be discussed, they are rarely used to model sets of occurrences. Next, we define some popular parametric form s of the point process and discuss their pros and cons. We conclude that most parametric forms of the point process are restricted to behave as standard random variables. They do not take advantage of the informa tion attained from the co-occurrence, or observation, of a set of samples, but rather treat these samples as independent occurrences. Definition 2.26 Counting measure: Assume dRE is a topological space. A measure on a family of Borel sets B (E) is called a counting measure if it takes only non-negative integer values, that is ,...} 2,1,0{)(: E B [4]. PAGE 41 41 A counting measure is locally finite if th e measure is finite on bounded subsets of E. Therefore, a locally finite counting measure has a finite number of points in its support in any compact set [4]. Definition 2.27 Point process: A point process N : is a random closed set with associated probability space, )),(,( P and a measurable space ))(,( NN B where N is the family of all sets of points in E if is locally finite (each bounded subset of E must contain only a finite number of points of ) [4]. Less form ally, a point process is a random choice of N governed by P In practice, point processes are considered to be random sets of discrete points or as random measures which count the number of points with in bounded regions. Random measur es are further discussed in the Random Measure section. Sinc e a point process is a random se t, the same principles and theorems that apply to random se ts apply to point processes. Since point processes are locally finite, their capacity functional are expressed as follows: ),0)(()0|(|) ()( K K K KTP P P ( 2-25) where | |)( KK Since we know the intersections will have a fi nite number of elements, we can model these probabilities as counting probabilities [4]. Definition 2.28 Intensity measure: The intensity measure of is the mean value of )( K defined as )] ([)( KEK where is simply a random variable with probability space ),(,KK and measurable space )(, RR. Simply, ) ( K is the mean number of points of a realization of in K [1], [4]. In m any applications, point processes are modeled in terms of intensity measures to provide for a simpler functional model. It provide s for an intuitive idea of intensity and allows for a simple parametric form. The following ar e examples of a few popular parameterizations: random point, binomial point process, Poisson point process and the Gibbs point process. PAGE 42 42 Definition 2.29 Random point: A random point is a point process with singleton outcomes. The capacity functional of this random point can be estimated )()0}{()}({ KPKPKP [4]. ( 2-26) Assum e that is our random point is unif ormly distributed in some compact set E K. Let be the Lesbegue measure on E that corresponds to length, ar ea, or volume, depending on the dimension of E. Note this measure represents the uniform distribution on the space E. For each subset A of K we could then define the point proce ss distribution, corre sponding to the random point as follows: )( )( )( K A AP ( 2-27) This is essentially a standard random variab le which should be clear from Equation 2-27. This formulation is simply a ratio of the measure of A and the total measure, the measure of K. This seems reasonable for the probability of a uniformly distributed random point to fall in volume A to assume this value. Definition 2.30 Binomial point process: A binomial point process with n points is n independent uniformly di stributed random points n ,...,,21 which are distributed over the same compact set E K. This binomial point process, written )(nW is governed by the following joint distribution n n i i n i ii nnK A APA AAP )( )( )(),...,,(1 1 2211 ( 2-28) For each subset A of K Since is a Lebesgue measure, there are three inherent properties of the binomial point process. 1. 0 )()( nW 2. nKnW )()( 3. 212 1 21),()()()( )( )(AAA A AAn n nW W W [4]. The above formulation of random points is in dicative of the i.i.d. assumption. The above formulation treats each element of a random set, as being independent of each other. This assumption retards the random sets ability to maintain co-occurrence information about the PAGE 43 43 samples, and furthermore, behaves similarly to the standard random variable with the i.i.d. assumption. The aptly named binomial point pr ocess has an expected value, )()(AEnW modeled by a binomial distribution with parameters n and ) ( APp [4]. The mean of a binomial distribution is sim ply the product of its parameters n and p yielding )( )( )()(K An npAEnW ( 2-29) This means that the intensitymean number of points per unit volumeis given by )( )( )( 1 )( )()(K KE AK AnnW ( 2-30) Although each of the points is distributed uniformly about the sample space in a binomial point process, the number of points contained in subsets of K are not independent, since this distribution is defined for a fixed number of points n If we were to construct )( nW in terms of the number of points per subset as in [4] the distribution woul d be m ore descriptive. ))(,...,)(()( )(11 kk W WnA nAPn n ( 2-31) where nnnnk ...21 and ,...2,1 k Example 2.1 Dependence on number of samples: It is clear that the number of points contained in subsets of K are dependent due to the fact that nnnnk ...21. If we know that11)()(nAnW then we also know that 1 1)\()(nnAKnW [4]. We reiterate that the binomial point process treats its outc omes as the product standard random variables with the i.i.d. assumption and it is highly dependent on the number of points in a given area A Definition 2.31 Poisson point process: Let be a locally finite measure on a topological space ))(,( EBE. The Poisson point process with intensity measure is a random subset of E that satisfies the following constraints 1. For each bounded subset K of E the random variable || K has a Poisson distribution PAGE 44 44 with mean ) ( K 2. Random variables || K are independent for disjoint K [4]. The corresponding capacity func tional takes the f orm )) (exp(1} {)( K KPKT [1], [4]. ( 2-32) The f irst constraint suggests that ) ( K is parameterized by the parameter of the Poisson distribution. This paramete rization is usually of the form ) ()( KK where is a measure, usually Lebesgue, of the set value K for all K K. The second constraint imposes independent scattering the number of points in disjoint Borel sets are independent. Note that this second constraint implies that there is no interaction between points in a pattern elements in a set [4]. This parameterization would ther efore be lim iting for context estimation. The last point process model that is discussed is the Gibbs po int process which has roots in statistical physics. They are motivated by Gibbs distributions which describe equilibriums states of closed physical systems. In Gibbs theory, lik elihoods of configurations are modeled assuming that the higher the probability of a system of objects, the lower the potential energy of the system [4]. This ideology is modeled in their definition. Definition 2.32 Gibbs point process: A point process is a Gibbs Point Process with exactly n points if its capacity func tion is governed by the probability density function defined in Equation 2-33. Z KU Kf ))(exp( )( ( 2-33) Hence the distribution is calcula ted in the standard fashion. n n Kdxdxxxf KP ...),...,( )(1 1 ( 2-34) In Equation 2-33, the function RRndU : is the energy function and Z is the partition function. Note in Equation 2-34, order of integration is irrelevant since } ,...,{1 nxxK [4]. In practice, the energy fu nction is chosen to be a sum of interaction potentials KAAVKU )() ( ( 2-35) PAGE 45 45 Frequently, V is assumed to have small values for large subsets of K This assumption leads to the use of a pair potential function n i ji n jxx KU11)(. ( 2-36) The Gibbs point process can also be fo rmulated for varying numbers of points n. This is called the grand canonical ensemble and assumes n is random [4]. Let nK be the family of sets with n points. Then we can define 0n nKK [4]. We can now define a density on K. ))(exp()( KUcaKfn ( 2-37) where c and a are the appropriate normalization factors [4]. Random Measures Random measures associated with random sets are generalizations of counting measures. As a random counting measure is a function on a point process, a random measure, associated with random sets, is a function on a random set. Definition 2.33 Random measure: Assume ) ,0[: F is a fixed measure and is a random closed set with respect to the Fell Topology. Then )()(, FFM is a random measure which maps from some probability space P ),(, to a measurable space ))(,( M B M where M is the family of all locally finite measures on F and ) ( M B is generated by } )(:{ tFMM M for every F F and 0 t [1]. For each ins tance X of we have a corresponding instance ,XM of random measure M specifically a measure taking on a non-negative value for each set F Note that throughout the literature, the measure is assumed to be additive and thus it has all corresponding characteristics. If we restricted]1,0[:F it can define a probability measure on dR, PAGE 46 46 namely ) ()(, ,FMFP Therefore each instance of a random set has a corresponding measure XP [1]. F X XF PX, )( )(, ( 2-38) To avoid cumbersome notation, we may omit and refer to XP as XP when there is no ambiguity. This construction can be generali zed by a taking a measurable random function E xx ),( We can then define a random measure as in Equation 2-39. FxdxFM )()()(, ( 2-39) Then we can construct a measure XP as in Equation 2-40 [1]. F xdx xdx PX X FX X X, )()( )()( ( 2-40) We have therefore defined a mapping from X to XP Note in this construction we assume a dependence of on denoted by X Note, we have also define d a family of measures P associated with random set The random measure could be viewed as a distribu tion on distributions, or a measure on measures, which is related to variational approaches for approximate inference. Variational Methods The use of variational methods for appr oximate inference has become a popular classification method in the machin e learning community. We give a brief description in order to identify its relationship to random sets, or mo re specifically, random m easures. The goal of variational approaches is to determine the posterior ) |( XZP of latent variables Z given observed data X where Z are typically class labels and parameters of distributions for the PAGE 47 47 elements of X This approach is typically preferred over standard methods when the latent variable space is large, the exp ectations with respect to the po sterior are intractable, or the integrations required are intractable or have no closed form representation [97]. Variational inference approxim ation balances the pros and cons of typical estimation approaches such as EM and other more computa tionally intensive methods such as stochastic techniques [97]. EM approaches suffer from the af orem entioned problems; whereas stochastic methods such as Markov Chain Monte Carlo (MCMC) methods can generate exact results, but not in finite time [97]. In standard approaches such as EM, parameters are estimated by inspecting a small portion of the parameter space, which may make it more lik ely to settle in local optima rather than the global. MCMC methods attempt to construct the tr ue distribution over all the possible values of the parameters using sampling methods. This appr oach allows for a globally optimal choice of parameter values or allows for the integra tion over all possible values. However, these approaches are only guaranteed as the sampling te nds to infinity, but they may be useful when the sample space allows for a tractable solution [97]. In variational m ethods for approximate infe rence, function learning is the objective and typically hyperparameters, prio r distributions on a functions pa rameters, are used to model a family of function values. It can be shown that the optimization of the lo g likelihood of the set of observations X can be separated into two terms: pqKLqLXp ||)()(ln ( 2-41) where dZ Zq ZXp ZqqL )( ),( ln)() ( and dZ Zq XZp ZqpqKL )( )|( ln)()|| (. It can also be shown that we can maximize the lower bound L(q) by minimizing the KL divergence between )( Zq and )|( XZP. Therefore, this is approach is a variational method, as PAGE 48 48 p( Z|X ) is estimated by optimizing the log li kelihood with respect to the function q. Given the use of hyperparameters the optimization with respect to q is called a free form estimate, that is, q is only restricted by the parameterization of the hype rparameters. Therefore this expression can be seen as the optimization of a functional with respect to a function, )(ln][ XpqH ( 2-42) The parameter distributions are typically formulated for simple integration, such that the parameters can be integrated out for the purposes of inference, usually classification. That is, the parameters are never estimated explicitly. In summary, variational learning estimates a function through the use of observed data and parameter distributions governed by hyperparameters. These para meter distributions, which are distributions on distributions, are similar to the idea of random measures. However, as discussed in the Technical Approach, the purpose of the random measure within the random set framework is different from the use of hyperpar ameters in variational inference. Before we discuss random set applications, it is necessary to review some measures, metrics and divergences defi ned on sets or measures. Set Similarity Measures In data sample analysis, it is necessary to have some sort of similarity measure for the purposes of comparing and contrasting the samples. If we are performing contextual analysis it seems appropriate to have a similarity measure to compare and contrast sets. The following is a brief review of standard and m odern set similarity measures. One way to analyze the similarity of measures would be to use a distribution similarity measure or divergence. P opular examples are the Kullback-Leibler (KL) divergence, which was PAGE 49 49 informally introduced in the previous section, and Chernoff divergence. The well-known KL divergence between distributions P0 and P1 is computed as follows: dx xp xp xpPPKL )( )( log)()||(0 1 1 01 ( 2-43) The Chernoff divergence is computed as follows: )(logmax),(10 10t PPCt ( 2-44) where dxxpxptt t )]([)]([)(1 1 0. Upon inspection, both of these divergences seem to quantify the idea of si milarity of measures based on the underlying distribution of mass. Another common approach is the use of compressed distribution similarity measures. Common histogram measures are the L1 and weight L2 measures. i ii LkhKHd ),(1 ( 2-45) khAkht LKHd ),(2 2 ( 2-46) In Equations 2-45 and 2-46, A is a weight matrix; H and K represent histograms, weighted clusters, or feature subsets of two discrete sets. Although popular, these similarity measures give rise to problems in robustness. For example, wh en computing the differences in histogram bins, Equations 2-45 and 2-46 do not account for neighboring bins. A common similarity measure used in topol ogical spaces is the Hausdorff metric. This metric computes the difference between two sets by finding the maximum difference of the minimum point-wise differences. ii Xx Yy ii Yy Xx Hyx yx YXdi i i iinfsup,infsupmax), (. ( 2 4 7 ) PAGE 50 50 Although this similarity measure is indeed a me tric, it seems to lack robustness. For example, two point sets having all constituents the same, less one outlier, would still be assigned a high difference value. Another recently researched approach is the earth mover distance (EMD) [70], [71]. The idea beh ind the EMD is to calculate the minimu m work needed to transform a discrete set X into a discrete set Y given some constraints. This minimization is done using linear programming. In fact, this distance calculation is a reformulation of the well known transportation problem. In this framework, one of the sets is considered a supp lier and one a consumer where each supplier has a supply quantity xi and each consumer has a demand quantity yi. Given a shipping cost cij for each supplier / consumer pair, cij, the goal is to find th e optimal flow of goods, f*ij, such that the cost is minimal. Using the optimal flow, EMD is calculated as follows: Jj j I iJ j jiji I iJ j ji I iJ j jijiy fc f fc YXEMD* ,, ,,), ( ( 2-48) where I iJ j jijifc,,minarg*ff subject to and, ,,0, i Jj ji j Ii ji jixf yf JjIif Note the above formulation requires that each consumer be completely satisfied. For the purposes of set similarity measures, the idea of flow is simply the matching of similar points in the set. The difference between these points is then computed using the cost, which if formulated accordingly, can be a difference measure of these points. Also note that if the numbers of points PAGE 51 51 are different in the sets X and Y then we can assign fraction va lues to the supplies and demands to allow for fractional point matching. Houissa et al. proposed an algorithm that uses EMD as a metric for the comparison of images for image retrieval from a data base [72]. This is novel approach of using a set m etric to analyze the similarity of two sets. In fact, the use of the aforementioned set metrics and divergences is fairly common in the machine learning community. Random Set Applications Next, we review current uses of random se ts and en masse approaches in the machine learning and pattern recognition communities. The most widely used formulation of the random set is by far the point process [74][96]. Point Process Applications Popular applications of point processes in machine learning and pattern analysis arenas include, but are not limited to, th e following: event prediction [89], [90], [92], object recognition / track ing [74], [79][83], and particle modeling [4], [85], [93], [94]. Although we do not detail particle m odeling, we explicitly mention it sin ce many forms of the point process have deep roots in statistical physics, a nd therefore, many point process models relate to physics-based concepts. In many fields of physics, one studies the interaction between groups of particles. In machine learning, these groups of particles are trea ted as sets of samp les distributed by a point process. One of the more popul ar applications of point processes is event prediction. In this application the point process domain is the real li ne, typically time, and the particles are events. Other applications include sample clustering. In most applications, the point process is used similarly to standard random variables with standard probabilistic techniques. There are no known, to the authors, applica tions of point proce sses that include the comparison of sets of samples, which is odd sinc e they are random sets. We review some past PAGE 52 52 and current research involving the use of point processes in a manner relevant to context estimation. Linnett et al. have used Poisson point processes to model segments of images for texturebased classification [84]. In this approach, samples from a same class are considered the same context. Each im age is discretized and each pixe l with similar gray values is bin grouped into similar point processes. A Bayesian posterior is then calculated estimating the class of each segment. Note that in this appr oach, the point process is used as a standard clustering algorithm, grouping samples from the same class together. Stoica et al. proposed the Candy model which models road segments, in remotely sensed imagery, as a marked Poisson point pr ocess for roadway network extraction [74]. Each line segm ent is considered a point, or center, with marks such as width, length, and orientation. The interaction of the segments is governed by a Gi bbs point process whose energy function contains a data term and a line segment interaction term. The segment interaction term penalizes short line segments. Segments are then merged based on an MCMC sampling method which adds points to segments, deletes points from segments, and merges segments. In later work, they incorporated Gibbs point processes within this model [80]. Descom bes, et al. used a point process to model se gments of images within the Candy model framework [81]. They improved their model by adding a prior density on the line segm ents. The prior is modeled as a point proces s, referred to as the Potts model, where the energy function is calculated based on the number of points in a clique in a segment, such that smaller segments are penalized. Other work, such as extensions of the Candy mo del, continues their research of the point process for image analysis [82]. They improved their object pr ocess which is used to model the PAGE 53 53 target line networks in remotely sensed imag es by adding an additional term in its governing density to account for interacti ons with other object processes. The point process is used by Savery and Clou tier to model clusters of red blood cells and correlate their orientation with other attributes of the blood [85]. In this pape r, the po int process is used to model different red blood cell configur ations in the presence of backscattering noise. An energy function is used to assi gn a value to each configuration of blood cells; th is function is placed inside an exponential function to estima te the likelihood of each configuration. An MCMC method is then employed to estimate th e true configuration of the red blood cells. En Masse Context-Based Methods We refer to methods that treat a set of samp les as a singleton unit as en masse approaches. These approaches use the same ideology as the random set and attempt to perform inference or analysis using the set. Dougherty et al. proposed a set-based kNN algorithm is proposed to contend with data sets that may be distributed differently with respect to time [12]. In this approac h, the idea of context is m aintained by using each training set as a set prototype. The algorithm is able to contend with contextual factors and even disguising transformations. In this approach, the k nearest neighbors, neighboring training sets, of the test set are identified. Here co ntext is identif ied through a similarity measure, specifically the Hausdorff metric, between the test set and a prototype set. Classification of the individual samples is performed using the labels of the k nearest samples from the k nearest sets. Although this approach is improved over other context-based methods and solutions to concept drift, it suffers from a lack of robustness due to the use of the Hausdorff metric. Bolton and Gader applied set-base d kNN to remotely sensed data for target classification [15]. Contextual factors were apparent in this data set. Th e application of set-based k NN PAGE 54 54 improved classification results by correctly identifying the contex ts using sets of samples; however, the resiliency of the Ha usdorff metric was questionable. Dougherty et al motivated a statistical approach, an extension of set-based kNN, to identify population correlated fa ctors for improved classification [12], [13], [14]. Dougherty et al provided a very theoretical approach whic h was suggestive of Poisson point processes [12]. We extend Doughertys theoretical approach an d provide a general ra ndom set framework for context based classification which permits possibilistic, probabilistic and evidential implementations. PAGE 55 55 CHAPTER 3 TECHNICAL APPROACH We propose a context-based approach for cl assification posed within a random set framework. The incorporation of random sets equips a classification algorithm with the ability to contend with hidden context changes. The goal of the proposed algorithm is, given an input sample set, or population, identify the populatio ns context and classify the individual input samples. We propose two models for context estima tion and provide analogous inference and optimization strategies. The first model is similar to the germ and grain model which is commonly used in point process simulation [4]. We develop possi bilis tic and evidential approaches within this model and detail some optimization strategies. The second model utilizes random measures. We propose an unnormalized likelihood function which provides for a probabilistic estimate of context within this mode l. Finally, we provide a discussion to identify the similarities and differences of the proposed random measure model and standard statistical methods. Mathematical Basis of the Random Set Framework Assume a topological space dRE with samplesE x. Let },...,{1 I be random sets with respect to the Fell topology. Each i is used to model a distinct context i where we assume },...,{1 I to be exhaustive. Assume a sample set X test or train, containing a finite number of observations } ,...,,{21 nxxxX from some random set. Let ZE: Y be a label function that maps each x to a given label Z},..,2,1{ ly where Z denotes the positive integers. Standard techniques estimate ) |( xyP for classification. If we believe that x was measured or observed in the presence of contextual factors, we can assume that our label function depends PAGE 56 56 on the context. If Y not independent of some context in which x was observed, the posterior estimate can be formulated as follows: I i iXxyPXxyP1),|,(),|(. ( 3-1) Equation 3-1 is interpreted as calcul ating the probability that sample x has class label y and was generated in context i In Equation 3-1 the posterior is ma rginalized over each potential context i For reasons developed throughout Chapters 1 an d 2, context identification is performed by indentifying contextual transformati ons; therefore, the observed population X is used for context estimation. Using Bayes rule and making some independence assumptions, we arrive at Equation 3-2. I i ii i i I i i iPyPXPyxP XxP yPyXxP XxyP1 1)()|()|(),|( ),( ),(),|,( ),|( ( 3-2) In Equation 3-2, we assume x is independent of X given its context and label. We also assume X is independent of y given the context. Equation 3-2 provi des a random set framework for context based classification. The factors in Equation 3-2 have intuitive meanings. The factor ) ,|(iyxP can be interpreted as the probab ility or likelihood that x was collected in context i and is of class y A suitable implementation would be I classifiers, such that when each is presented with a sample x could identify it as having class label y in its corresponding context i The result of classification within a particular context i ) ,|(iyxP is weighted by the term ) |(iXP which can be interpreted as the probability of observing X in context i The result is an intuitive weighting scheme that weight s each classifiers out put based on contextual relevance to the test population. PAGE 57 57 The ) ,(iyP factor is interpreted as a prior likelihood of observing some class and context. Depending on the implementation, th is term may be be tter estimated using, )()|(),(ii iPyPyP where ) |(iyP is the probability of class y given context i and )(iP is the prior probability of context i Note that ) ,|(iyxP and ) |(iXP are terms of great interest as they embody the context-base d approach and will be further discussed and analyzed. Estimating ) |( yxP has been researched for years us ing various models and estimation techniques. The estimation of ) |( XP and ) ,|( yxP has not been researched quite as thoroughly, especially ) |( XP It seems proper that the values ) |( XP should be estimated using determining functionals of The random set model provides for considerable flexibility since these probabilities can be estimated using evidential, pr obabilistic, or possibilistic techniques. The proposed generalized, context-based framew ork may have different interpretations and a potential myriad of implementations. We develop two models for the estimation of ) |( XP within the proposed framework. A germ and gr ain model is specified and accompanied by possibilistic and evidential appr oaches for the estimation of ) |( XP Then a random measure model is specified and a proba bilistic approach is devel oped for the estimation of ) |( XP Possibilistic Approach In this possibilistic approach, ) |( XP is estimated using the capacity functional. )()()|( XTXPXP ( 3-3) For the initial development of this model we will let Y be a random set. Cl assification of the samples from X can be defined as partitioning the set such that subsets of X are assigned some PAGE 58 58 class label y This first model can be considered a preliminary or intermediate model. The classifier in each context is modeled using the co nstructs which are modeling the context, that is, Y is a random subset of each This possibilistic implementa tion provides for a simple and efficient parametric model which allows for direct analysis of the driving terms in Equation 3-2 and concurrent optimization of th e classifier and contextual pa rameters. Optimization techniques for classifiers that do not share parameters with the germ and grain model are also provided. Development Note that in this initial model we use ) ,|}({ YxP instead of ) ,|( YxP This slight modification is due to the fact that the classifi er in this initial implementation is modeled by random set constructs. Therefore the samples mu st be formally defined as singleton sets. However, this is not always the case and the notation ) ,|( YxP should be used, when a standard statistical classifier is used. For the purposes of analysis, we focus on the terms ) |( XP and ) ,|}({ YxP These terms drive the context-based classifier so thei r isolation will aid in analysis. We assume the prior probabilities of all contexts ) (iP are equal and that the probab ilities of the class given the context ) |(iYP are equal. Given this we have I i i iXPYxPXxYP1)|(),|}({)},{| ( ( 3-4) We develop a model similar to that of the germ and grain model [4] [5], [16], that is, the random set is m odeled as a union of random hyperspheres. This model provides a simple yet versatile parametric model to allow for the estimation of the terms in Equation 3-4. The germs are the random hypersphere centers and th e grains refer to the size or volume of the hypersphere, which PAGE 59 59 is directly related to th e radii. If random set i follows a germ and grain model, it is defined by Equation 3-5, where ij are the germs and ij are the grains. in j ij ij i 1)}({ ( 3-5) In Equation 3-5, in is the number of grains used to model context i In our model we assume each grain is governed by a random radius ijr that is exponentially distributed. )exp()(ijij ijijr rp ( 3-6) This implies that the probability that { x } hits a grain, ) |}({ijxP can be estimated as follows ) (})({)|}({ij ij ijxrPxTxPij ( 3-7) Substituting the probability density in E quation 3-6 into Equation 3-7 yields ) exp()) (1)|}({ij ij ij ij ijx xrP xP ( 3-8) Equation 3-8 is used to model the constituent grains ij and subsequently used to model i and Y The capacity functional of ij ) |}({ijxP is subsequently used to estimate the capacity functional of i )()()|(XT PXPiXi i ( 3-9) In this model, the calculation of ) |(iXP follows from the calculation of the capacity functional of the constituent grains. i ijn j iXT XP1)(11)| ( ( 3-10) Equation 3-10 states that the probability that X hits i is the same as the probability that X does not miss all i ijnj ,...,1, Given our model, we can calculate )( XTij using Equation 3-11. PAGE 60 60 })({max)( xTXTij ijXx ( 3-11) The proof is discussed in the Lemma 3-1. Lemma 3-1. Let be a random set taking on set values in and having a probability distribution P on ) ( B and corresponding capacity functional T. If we restrict the elements of to be a random disc or hypersphere then })({max)( xTXTXx if X is finite or })({sup)( xTXTXx if X is infinite. Proof. We show if })({})({2 1xTxTij ij then ) },{()}{(21 1 xxPxPij ij, which can we inductively show implies })({max)( xTXTXx Base Case: First assume without loss of generality (WLOG) that })({})({2 1xTxTij ij If random hypersphere is determined by a random radius, then )),(()),((2 1cxdrPcxdrP where d is some metric, r is the radius of the hypersphere and c is the hypersphere center. This implies that ),()),(2 1cxdcxd if r is governed by a distribution that is monotonic with respect to distance, such as the exponentia l distribution. This is due to th e fact that the probability of intersection is a function of distance only. This implies th at each hypersphere that {x2} hits, {x1} must hits. So in this model we can assume 1 2 2 1x x x x K KK PPxTxTij ij,)()(})({})({2 1 ( 3-12) Equation 3-12 implies that ) },{()}{(21 1 xxPxPij ij. Induction Step: Now assume }) ({max)( xTKTKx .We show that })({}),({maxmax)}{ (1 1xTxT xKPKx ij We know that there exists some })({maxarg xT xKx and therefore )) ,(minarg cxd xKx where ties are arbitrarily broken. There are two cases. First assume ), ()),(1cxdcxd which implies that )(})({1KTxTij ij Using the same argument in the Base Case, th at is, every hypersphere that hits K must hit }{1x. In the other case, if )(})({1KTxT then by the same logic ev ery hypersphere that hits }{1x, must hit K Therefore, })({max)(1xT XTxKx and given the Base Case, is tr ue for all sets of countable size. Thus })({max)}{()( xTxTXTij ij ijXx Xx ( 3-13) Q.E.D. For classification purposes, assume that some subset of the grains represent some class Y which are identified in some index set Cy. PAGE 61 61 yCjij ijijY ),(:)}({ ( 3-14) If we assume that the measure of the random hypersphere overlap in each context, kj xPikij ),,|} ({ is negligible, then the term) ,|}({iYxP can be estimated as follows: yCjij ij ixP YxP),(:)|}({),|(. ( 3-15) The assumption in Equation 3-15 admits simplif ied update equations du ring the optimization stage. Dependent Optimization In this development, we propose an optimization method that assumes parametric dependence of the classifying and context estima ting factors. Optimization of the parameters ij is performed using a minimum classification error (MCE) objective [86], [87] and [88]. The objective is to m aximize the difference between co rrect and incorrect clas sification. Equation 316 is used as an MCE objective function. Each para meter is updated in an iterative fashion using gradient descent. For optimization purposes, let } ,...,{1I iXXX X be training sets that represent different contexts. yx XPxP XPxP yx XPxP XPxP XxDy y y yCkmkm m mk Cjij i ij Ckmkm m mk Cjij i ij ij, )|()|}({ )|()|}({ )|()|}({ )|()|}({ ),,(),(:),( ),(: ),(:),( ),(: ( 3-16) In Equation 3-16, the second terms sum over cont ext, grain pairs that model a class other than yC where yC is the class modeled by parameter ij This objective can be interpreted as an PAGE 62 62 optimization of ij with respect to observations from th e context and class it represents as long as it doesnt hinder the classification of observations from ot her classes in any context. For stability and quick converg ence, a loss function is used. )),,(exp(1 1 ),,(ij ijXxD Xxl ( 3-17) The total loss is then defined by Equation 3-18. XXX x ij ijXxl L ),,()( ( 3-18) We have the following gradient descent update formula where t represents the iteration number and is the learning rate. ij t ij t ijd dL 1 ( 3-19) where XXX x ij ij ij ijd dD XxlXxl d dL )),,(1)(,,(, ( 3-20) and ) exp( ) exp(1) exp( ) exp( ) exp(1) exp( )|() exp(ij ij ij ij ij jm im im im Cm im im ij ij ij ij ij jm im im im Cm im im i ijij ij ijx x x x x x x x XPx x d dD ( 3-21) where ))|}({(maxargij x ijxP x The germs are not optimized in the experime nts. However, similar gradient descent methods could be employed. The proposed updates indicated by Equations 318, 3-19 and 3-20 have the added benefit of concurrently updating classifi cation and contextual parameters since both are implemented as PAGE 63 63 the same structures. Next, we provide a genera l optimization strategy us ing the germ and grain model with a possibilistic estimate. That is, we optimize the contextual pa rameters based on their ability to correctly estimate context. Independent Optimization We estimate the contextual paramete rs using the following MCE objective. i iX i X i ijXP XP D )|()|()( ( 3-22) The objective in Equation 3-22 is to maximize the difference between correct and incorrect context estimation. Using a similar gradient descent strategy, we arrive at Equation 3-23. ) exp( ) exp(1 ) exp( ) exp(1ij ij ij ij ij X jm im im im X ij ij ij ij ij jm im im im ijx x x x x x d dDi i ( 3-23) Equation 3-23 provides for efficient optim ization of the contextual parameter ij based on maximizing the separation between correct and incorrect contextual identification. Evidential Model In the possibilistic approach, we estimate ) |(iXP using the capacity functional. In the evidential approach we use the inclus ion functional to estimate the term ) |(iXP There are two major reasons why we have chosen the inclusion functional for evidential modeling rather than the containment functional. First, we have a continuous model with discrete observations. This means the probability of containment woul d be zero for essentiall y all possible discrete observations X Second, the inclusion functional is mo re intuitive for set-valued random elements, whereas, containment, similar to th e idea of belief, is intuitive for modeling uncertainty with singleton random elements. PAGE 64 64 Development We develop the evidential approach using th e germ and grain model and assume the radii are exponentially distributed. Given these assumpti ons, we calculate the probability of inclusion given one random hypersphere as follows: ) exp(}):({}{ij ij ij ijx FXFP XP ( 3-24) where ))|}({(minarg ij x ijxP x For calculation of inclusion, note that we use ijx rather than ijx. As ijx is the closest Xx to germ ij and determines a non-empty intersection relationship of X and ij ijx is the furthest Xx to germ ij and determines an inclusion relationship of X and ij This probability can be accumulated across the constituent random hyperspheres using the same ideology taken during the calculation of the capacity functi onal in Equation 3-10. Therefore we calculate the probabi lity of inclusion of random set i across the constituent hyperspheres using Equation 3-25. in j ij ij ij i ix FXFP XPXP1) exp(11}):({}{)|(. ( 3-25) Equation 2-25 states that the probability that a random set i includes a set X is equal to the probability that each of the constituent random hyperspheres ij does not have a non inclusion relationship with X Optimization Using the objective defined in Equation 322, the parameters can be optimized using gradient descent as defined in Equation 3-19. For the optimization of ij we substitute Equation 3-26 into Equation 3-19. PAGE 65 65 ) exp() exp(1 ) exp() exp(11 1 i i i iX ij ij ij ij ij n jm m im im im X ij ij ij ij ij n jm m im im im ijx x x x x x d dD ( 3-26) Note we have performed optimization independent of the classifier which is assumed to be independent of ij Depending on the classifier utilized, si milar optimization techniques could be used for its parameters. Probabilistic Model In the probabilistic approach, we model contex t using a class of functions on random sets called random measures. That is, for each obser ved set we construct a corresponding measure. We perform analysis in this space of meas ures rather than in closed subsets of E or as in previous models, in hopes of extracting supplemen tary information to th at found during analysis in Development Recall in Equation 2-33, a like lihood function was derived for a Gibbs point process using an energy function U which was used to assign likelihood based on the configuration of points in some set X We have noted that different forms of U yield different issues and may imply certain constraints on a point process. We now define an unnormalized likelihood f unction using an energy functional which calculates the energy of a particul ar configuration by an alyzing an observed function or measure. The goal is to permit a tractable contextual estimate, as oppose d to an energy function as in Equation 2-35. Furthermore we desi re the ability to analyze the shape of a function across E rather than inspecting pairs of elements in E as in Equation 2-36. Also, we define the likelihood PAGE 66 66 function such that it can be parameterized to recognize different ra ndom measures, whereas Gibbs point processes are typically used to calc ulate probability using the energy of a closed system and not necessarily distinct random measures characterizations. Since we are analyzing functions, we use the KL divergence on func tions. We note that other measures or divergences on functions ma y be used as well. We define the energy functional for random measure M as )||()( QPKLPUX X. ( 3-27) We refer to Q as the representative measure for random measure M and it can be thought of as a parametric representation of We can now define the un normalized likelihood functional for random measure M as )||(exp)(QPKL PpX XM. ( 3-28) Note that this likelihood compares how m easure is distributed between the function XP and Q. Hereafter, we denote Q by Q or iQ for a particular context i If the distribution of mass in XP becomes more similar to that in Q a higher likelihood is assigned to XP, using the KL divergence to assess similarity. Therefore, an intuitive value for Q would be the measure that minimizes the sum over the KL divergences of observed samples } ,...,,{21nX XXPPPD from iM, N j X RRPKL Qj1|| infargM. ( 3-29) Hereafter, we denote the densities corresponding to measures Q and XPas q and X respectively, and assume they exist. The likeliho od function defined in Equation 3-28 is used for contextual estimation given the random set framework for context-based classification. PAGE 67 67 Specifically, we use the likelihood on random meas ures to calculate the contextual estimation term. )}||(exp{)|(iX iQPKL XP ( 3-30) In Equation 3-30, Qi is the representative measure for context i and XP is the measure corresponding to observed set X We use the KL divergence to co mpare distributions using their corresponding densities X and qi, to determine the likelihood of context i Therefore, we can calculate or approximate Equation 3-30 usi ng Equations 3-31 or 3-32, respectively. dx xq x x qPXPi X X iX i)( )( log)(exp)|()|( ( 3-31) Ax i X X iX ix xq x x qPXP )( )( log)( exp)|()|( ( 3-32) In Equation 3-32, E A is used to estimate the KL divergence. The choice of A is further detailed in the Discussion section. The choice between Equations 3-31 and 3-32 depends on the formulation of parameter iq specifically, whether an analytical representation of the KL diverg ence exists or whether it is convenient for parameter estimation given an assumed parametric form of the model. The density q is the parameter for )|( qPX which itself may be parameterized for convenience, for example, ) ,(~ Nq or ) (~ Expq We note that estimation may benefit if density q is modeled using a more complex distribu tion such as a Gaussian mixture; however, this may lead to difficulty in computation and may complicate parameter learning [98]. In the prob abilistic approach, we need to construct the X given some observed set X One possible construction would be to use a simple Lesbegue or un iform measure over the discrete points in X PAGE 68 68 Example 3.1 Discrete measure: Assume } ,...,,{21nxxxX Then we could construct our measure XM using a cardinality based measure c such that X XF X XF FPc c X )( )( )( ( 3-33) We note that this measure meets the requirements outlined in the definition of a random measure; however, it is discontinuous, not smoo th, which may lead to optimization issues. Furthermore, as we will see during th e construction, issues may arise if XM has a limited support. Therefore, it is benefi cial to provide a parametric me asure to provide a smooth measure with a large support. If we use the generalized development of th e random measure, and therefore the general construction of an instance of a random meas ure as in Equation 2-40, we can develop a parametric measure that is continuous and has a large support, given some assumptions. Example 3.2 Continuous parametric measure: Assume } ,...,,{21nxxxX are a finite number of observations from some infinite set dRX. If we assume that elements in X are similarly distributed to this continuous set in space dRwe could estimate the measure on this set using parameters calculated from X and define a measure ]1,0[: FXM, by d dx XX Fx XX X XX FX XX X X FX X Xdx xN dx xN x xN x xN xx xx FPR R),|( ),|( ),|( ),|( )( )( )( ( 3-34) We estimate the center of mass X and covariance function X of the set X using the set of observed finite samples in X and use these estimates for the parameters of the Gaussian density. We have therefore constructed an exampl e of a measure given an observed sample X which is continuous, has a large support, and has a parametric form. Other parametric forms of X could be developed through many existing methods. If we assume a complicated parametric form for X some methods that might be used to estimate X such as the standard EM algorithm, may be subject to initialization conditi ons and therefore will not strictly satisfy Equation 2-40. Optimization Next, we develop optimization strategies a nd example model implementations that would use Equation 3-31 or 3-32. The developed probabil istic model allows for closed form solutions for optimization given certain model assumpti ons and appropriate obj ective functions. Roughly PAGE 69 69 speaking, the optimization of parameter q, using parametric representations of X proceeds in two main steps. During the first step, parameters of the densities X are estimated for each X X using standard methods such as EM or ML estimates. The result is a set of densities, and therefore measures, },...,,{21nX XX In the second step, representative measure iq is estimated for each random set i by maximizing a likelihood function that is a product of factors involving context dependent classification factor ) ,|(iyxp context estimation factor )|()|(iX iqPXP and prior ) (iP with respect to function iq We focus on the maximization of ) |(iXqP since the classification factors of each context can be estimated using standard techniques. Note that factor ) |(iXqP treats X as the samples rather than x as in standard methods. Specifically in the first optimizati on example, we assume a form of X and q such that the integral in Equation 3-31 can be calculated analytically. We take an EM approach for optimization; specifically we take an expect ation over the contextual parameters given each X constructed from observation set X We assume ),(~ Nq and ),(~ NX. Initially each X is constructed from the observed samples from the corresponding Xx Once each X is constructed, the individual elements of the sets Xx are no longer referenced in the optimization process. We begin by defining our objective and corre sponding log likelihood function given our initial independence assumptions of the random set framework arriving at Xx i i X iyxpPXP L ),|()()|(log)(X. ( 3-35) Next, we take an expectation over the contextu al parameters given our observed populations, PAGE 70 70 )|(),|()(log)|(log )(1 |XPyxp P XP LEi X I iX x i i i X X. ( 3-36) We disregard the classification term for now, as this type of optimization is ubiquitous throughout the literature, and therefore we focus on the contextual terms. )|()(log)|(log )(1XPP XP Ri X I i i i X ( 3-37) Using Equation 3-31, we get )|()(log)}||(exp{log )(1XPP qKL Ri X I i i iX X. ( 3-38) After some algebra we arrive at ).|()(log )( )( log)( )(1XPPdx xq x x Ri X I i i i X X X ( 3-39) Analytically integrating and ignoring a constant [98], we arrive at )|()(log )()( log5. )(1 1 1XPP Tr Ri i X I i q q T q q qi Xi i X Xi X i X ( 3-40) We then perform the maximization step by di fferentiating Equation 3-40 with respect to the parameters. At this point we note that many closed form representations can be found for the KL divergence of distributions other than the Gaussian, such as the exponential distribution. Setting the result of the differen tiation of Equation 3-40 to zero and solving for parameters iq iq, and ) (iP, results in update Equations 3-41, 3-42, and 3-43, respectively. PAGE 71 71 X X X i X i qXP XPX i)|( )|( ( 3-41) X X X i i X T q q qXP XPi X i X X i)|( )|())(( ( 3-42) I iX i X i iXP XP P1)|( )|( )(X X ( 3-43) Finally, we use Bayes rule to solve for ) |( XPi )()|()|(ii iPXPXP ( 3-44) Recall, ) |(iXP is given by Equation 3-31. However as previously mentioned, if a mo re complex distribution is assumed for the model or the sample X the KL divergence may not have a closed form representation. We now develop an optimization strategy for this case. Assume the representative measure is a Gaussian mixture, in j ijij iNq1),(~, which does not permit a closed form solution. We note there are numerical / statistical methods that can be used to help estimate the KL divergence [98]; however, the optimiza tion of the parameters in iq would become an issue if those techniques were used. For development of this optimization technique, we skip to Equation 3-37 and substitute in Equation 3-32 arriving at ).|()(log )( )( log)( )(1XPPx xq x x Ri X I i i Ax i X X X ( 3-45) PAGE 72 72 Upon inspection, we see that optimization with respect iq is analogous to minimizing the KL divergences between each X and iq If we assume iq is a Gaussian mixture, with some algebra, we arrive at ).|()(log ),|(log)()(log)( )(11XPP x xN xxxx Ri i X I iA x J j ijij ij X X X X ( 3-46) After performing the maximization step for parameter ij we can get a closed form solution assuming Equation 3-48 is independent of ij )|()( )|()]([ XPxx XPxxxi XA x xij X i XA x xij X ij X X ( 3-47) where )|( ),|( ),|(1xp xN xNij J j ijij ij ijij ij xij ( 3-48) While updating the parameters, we assume xij is independent of the other parameters, which is a common assumption in machine learning [97]. In fact, this result is a similar to the result attained usin g a standard EM approach, taki ng the expectation over each component given the individual samples using ) |( xpij [97]. The other parameters are solved sim ilarly, )|()( )|()]([))(( XPxx XPxx xxi XA x xij X i XA x xij X T ij ij ij X X ( 3-49) and )|()( )|()( XPxx XPxxi XA x X i XA x xij X ij X X ( 3-50) PAGE 73 73 Optimization is again performed in sequence with parameter xij being calculated last in each epoch. To properly calculate the factor x in the update Equations 3-47, 3-49 and 3-50, we use the standard approximation of the Riemann integral. If x is multidimensional, dx R construction of x involves creating incremental volumes V by constructing a hypergrid or hyper-rectangles. Hereaf ter, we refer to V as x One intuitive method of constructing the set A would be to construct samples by taking all dN combinations of the N samples in X in each dimension d. However, if samples x are multidimensional, then construction of x may be intractable. If a smaller A was constructed, the Riemann approximation may decrease in accuracy. We propose an efficient estimation of the KL divergence that assumes x is constant and that the samples that comprise A are uniformly sampled from so me hyperrectangle created from observations of the distributions X and iq This approximation, which is similar to Markov Chain Monte Carlo (MCMC) integration, is intuitive since if the samples are, in fact, uniformly distributed, x should be constant. In the experiments, we analyze the error using synthetic and real data sets. Discussion There are many interesting results of this deri vation. For clarificati on, we first provide a few examples in order to flesh out some of these details. Next we discuss certain similarities and distinctions between the proposed method, st andard methods and variational methods. In particular we compare optimization and inferenc e results of the proposed method to standard statistical methods. Lastly, we compare the propo sed method to typical variational methods for approximate inference. PAGE 74 74 We noted earlier that in the construction of the proposed likelihood function we hoped to gain some versatility over standard probabilistic approach that assume i.i.d. However, some approaches that could be employed for the construction of X may implicitly assume that the singleton elements of X are i.i.d. However, we note these e ffects do not necessarily trickle up to inference at the measure level. After we introduce the optimization methods, which helps identify some characteristics of the proposed appr oach, we illustrate some of the similarities and differences between using standard statistical approaches which assume i.i.d. and the proposed method. Example 3.3 Construction of X : Equation 3-32 can be rewritten Ax i X X iXx xq x x qP )( )( log)(exp)|( ( 3-51) Note that that X is a function of our observation set, X and therefore each term in Equation 3-33 is dependent on the set X Note, the use of samples Ax is simply to estimate the KL divergence, that is, the only reason to use the underlying space is to sample the values of X and iq In fact, the samples in A do not even need to be elements of the observation set X Since the likelihood function can be factorized as in Equation 3-33, we could interpret the resulting product as stating that each value )( xX is distributed by standard random variable )( xMi which is determined by random set i and is represented by ) ( xqi given representative function iq Note that X is a function of the set X and that each corresponding value )( xX is drawn from a distinct random variable )( xMi at each x in the domain of iM, as illustrated in Figure 3-1. So in effect, a random measure is a continuum of random variables on some subset of E one for each element in the domain of M, namely )( xM. As mentioned, each random variable )( xM has a corresponding parameter ) ( xq Note the parametric form of the representative function q has allowed us to maintain a continuum of random variables in a concise manner, but at the cost of versatile forms of )(xq and therefore )( xM. That is, )(1xM is a random variable that maps into R whose distribution is intrinsically governed by PAGE 75 75 the distribution of the random set We shall refer to the value )(1xq as the representative value for random variable )(1xM. Example 3.4 Random variable )( xM: If we wanted to minimize the KL divergence of two probability measures, the two functions must coincide. Assume we wanted to minimize the sum of KL divergences between q and samples X At the optima each representative value )(xq is the representative functions value at x which minimizes this sum of the KL divergences, given the constraint that the representative f unction must be probability measure. Assume we collect N samples from and have N corresponding measures. Note that given for each instance NjxjX,...,1),( is an instance of random variable )( xM for a fixed x If we minimize the expression j XQPKLj|| with respect to each value ) ( xq at a fixed x using Equation 3-32 and subject to the constraint that Q must be a probability measure, that is Axxxq 1)(, we arrive at N x x xx x xx x xqj X j j X jA x X j X A xj X j Xj j j j j j )( 1 )( )( )( )( )( )( ( 3-52) Note that )(xq is the arithmetic mean of NjxjX,...,1),( This means the representative value is the mean value for )( xM for each x, and therefore minimi zes the squared Euclidean distance between samples )( xPX from random variable )( xM, as illustrated in Figure 3-1B. One result o f using a parametric form for the representative function is that the representative value )(xq may no longer be the exact mean of random variable )( xM due to the particular constraints, for example if it is assumed Gaussian distributed. However, the assumption of a parametric model is important, ot herwise, we would need a random variable for each point in the domain of M which does not permit a tractable solution, unless a very simple domain is assumed. As found throughout machine learning techniques, there is a tradeoff between data fidelity and tractability. Example 3.5 Representative value: Given a set of observed Gaussian measures constructed by selecting the mean and covariance from a uniform interval, assume we wish to construct Q using the update Equations 3-47, 3-49 a nd 3-50. Note this implies we are assuming that q is Gaussian. The resulting representative values are not necessarily the arithmetic means of samples from )( x as illustrated in Figure 3-1B. Although, the up date equations are optimal PAGE 76 76 assuming Gaussian, there are not necessarily opti mal over all possible dist ributions due to this extra constraint. The first optimization technique proposed that uses Equation 3-31 to calculate the KL divergence, learns the parameters of q using some parameters of our observed distribution X However, in the second optimization t echnique proposed, the parameters of q are learned using the underlying space, samples in d R We note there is some similarity between these update equations and those that are developed in standard EM algorithms such as Equation 3-53. Ax xij Ax xij ijx ( 3-53) However, we note that in the proposed update eq uations, there is a discrete expectation over random measures, not simply an expectation over standard random variables. We also note that when samples are clustered, the set A is typically the data. However, in the proposed approach, the samples in set A are not directly important, as long as their use permits a good estimate of the KL divergence. The major difference in the update formulas is the factor xxX )( Note that in the KL divergence we integrate with respect to our sample X which is also a density. In the discrete approximation, the factor xxX )( is used instead. One interpretation is that we are taking the expected value of the difference between X and iq This interpretation shows, that during optimization, we are trying to minimize the difference between samples X and representative measure iq Another interpretation is th e representative function iq is being coerced into a form similar to the samples X vicariously through its parameters and using samples x in A and weights PAGE 77 77 xxX )( as illustrated in Figure 3-2. This coercion is pe rfor med through the parameters, for example, the means which reside in the same space as x For this reason, the samples x are included in the update equations. Howe ver in Equation 3-47, the factor xxX )( weights each sample x by its corresponding measur e in the distribution, X In fact, is optimized such that q( x ) is similar to )( xX not necessarily to maximize q( x ) with respect to the samples x as is the case with standard statistical optimization. However, there are similarities in standard statistical optimization and the proposed method. In standard statistical methods, the lear ned posteriors / likelihoo ds are optimized while assuming i.i.d. In the proposed method, the repr esentative function is optimized using observed measures which may have been constructed usin g similar optimization techniques that are used in standard statistical methods. In the develope d approach, the observed m easures are essentially likelihood functions optimized w ith respect to each observed se t, and therefore most likely assume i.i.d. during their construction. We illustra te situations when the proposed methods result in similar and different optimizations than standard methods. Example 3.6 Optimization similarities and differences: Assume we have multiple observation sets NXXXD ,...,,21 observed in the same context and we wish to optimize likelihood functions for context estimation. We optimize a standard likelihood function, which assumes i.i.d., using the EM algorithm while training on the dataset N i iXX1 We will also learn the proposed random meas ure likelihood function by optimizi ng the representative function given observed measures NX XX ,...,,21 using the method in Equati ons 3-47, 3-49, and 3-50. We will construct the observed measures using the standard EM algorithm for Gaussian mixtures. Results are illustrated and further detailed in Figure 3-3. Note that the resulting lik elihood from EM optimization results in a measure learned from the set of all singleton samples, whereas, the learned representative function is a measure which was learned from a set of measures. If the distribution of X is similar to that of each iX with respect to the number of samples in the distribution, the representative measure learne d will be similar to the likelihood learn using standard methods. This is because all informati on can be detailed without any set information; however, if the distributions are different with respect to the nu mber of samples, the learned PAGE 78 78 measures will be different. This result is illustrated in Figure 3-3. This distinc tion is a direct result of the proposed methods ability treat each set as a unitary element. We have identified some fundamental differences between the proposed method and standard techniques. Note there are some similarities and differenc es when performing inference using standard techniques and the proposed technique. In many cases the calculation of likelihood is different during in ference; however in some cas es, the result of inference determination of the most probable contextis similar. In fact, if the representative measure q is the actual learned likelihood of the standard method, that is xxqyxp )()| (, then the result of inference will be the same. This shared similarity between the two approach es is again shared if the distribution of X is similar to that in each iX Example 3.7 Inference similariti es and differences: The random measure approach assigns high likelihood to sets, or random meas ure instances, that have a similar distribution throughout the domain; whereas standard appro aches assign high likelihood as long as each observed singleton sample appears in a place of high likelihood. This difference is illustrated in Figure 3-3C, which continues from Example 3. 6. Note that although this is a funda mental difference, the result of context estimation ma y be similar using both approaches depending on the observed measures construction and the results of optimization. During the optimization of the proposed likeliho od function, the representative function is learned. This is similar to variational methods where functions are learned by optimizing objective functionals. Example 3.8 Comparison with standard inference using variational methods: Given an observed set X we want to determine if it was observed in context i Using the proposed method on random measures, we would first construct X Next, we could determine the unnormalized likelihood of some context using )||(exp)|(iX XQPKL Mpi Whereas, with a standard variational method, or most standard methods of inference given a joint observation set, the initial observation set is ex plicitly assumed i.i.d., during optimization and inference. For example, the sta ndard initial assumption made in variational inference given a set X is N n n Nx ZXp1 2 2/2 exp 2 )|( ( 3-54) where } ,{ Z [97]. Therefore, the estimate of the posterior p( Z | X ) a lso is i.i.d. PAGE 79 79 In Example 3-9, we compared the proposed method to standard variational methods; however, we ignored the use of hyperparameters. The hyperparameters would be better suited for contextual inference since they govern distributions on distribu tions and inference could be performed on observed measures. We explor e the viability of using the subsequent hyperparameters for a means of context estimation. Example 3.9 Context estimation using variational methods: For the construction of the hyperparameters, assume Equation 3-54. We then model parameters and using a normal and gamma distribution, respectively. 1)(,|)|(iiNp iiGamma p ,| )( ( 3-55) It can be shown that the parameter i is updated using N xNi t ii t i 1. ( 3-56) We note this is similar to update Equation 3-41, save the expectation over sets used by the proposed method. Therefore it cannot treat set values as unitary elements and will differ from the proposed method similarly to standa rd statistical methods, as illustrated in Examples 3-6 and 3-7. Again, note that Equation 3-56 is somewhat similar to the op timization of the random set, where the Gaussian is the measure resulting from the update Equations 3-41, 3-42, and 3-43. The difference here is that there is a prior distribution on the parameters of some family of distributions. This simplifies computation to some degree as the random element is reduced to being a standard random variable in dR. Note that this is an atypical use of the intermediate constructs of standard variational inference; however, this potential use fits the problem of developing a likelihood on functions given a simple model. Example 3.10 Context estimation using a mixt ure of Gaussian hyperparameters: In Example 3.9, we constructed a hyperparameter given a single Gaussian measure constructed from an observation set X We can similarly construct a mixture of Gaussians given an observation set X Given a set of observed parameters ,, developed from some observed set X we can estimate the likelihood of some context i given some trained parameters iiiiVW m,, ,, learned given an assumed Gaussian-Wishart prior governing the mean and precision of each Gaussian component PAGE 80 80 J j ijijj jijijj iiiiVWW mN p1 1,|)(,| ),,,|,(VW m ( 3-57) This development by Bishop [89] has surfaced a few inherent issues that accompany this approach. First there is the assumption that th e hyperparameters is factorizable, which was previously mentioned and may or may not be th at constraining dependent on application area. However, the fact that we are now performing in ference in the parameter space, rather than the space of measures has lead to other issues. Note that j and ijm are both indexed by j although they both are elements of sets and im, respectively. This implies that in order to properly calculate the likelihood, there must be the sa me number of observed samples as there are Gaussian-Wishart priors and that the observations and distributi on components must be matched. These issues are a direct resu lt of the hyperparameters being intermediate constructs. These constructs have one purpose, which is to model one set of observations. In fact, they are not meant to be used directly since their only use is to integrate out intermediate parameters. That is why these standard variati onal learning should not be used for context estimation. PAGE 81 81 Figure 3-1. Samples of Gaussian distributi ons drawn using randomly selected means and variances which where drawn uniformly from a specified interval. A) Fifty sample measures are plotted. The resulting value at each point x, is a random variable. For example, random variable )1(M has corresponding samples that that lie on the line x =1. B) The arithmetic mean, optimal Gau ssian and optimal Di stribution are shown given the 50 Gaussian samples. The corre sponding KL divergence values are 88.5, 91.2 and 88.5, respectively. The arithmetic mean is the optimal distribution; they coincide. A B PAGE 82 82 Figure 3-2. Learning the repr esentative function using update Equations 3-47, 3-49 and 3-50 given set }5.2,9.1,5.1,3.1,1,5.{ A. These plots illustrate th e fact that the proposed method learns the function X and does not fit the lear ned parameters to the individual samples in A A) The observed measure and the initialized learned measure q. In standard learning techniques, optimization of the parameter would occur when it was the mean value of the samples in A 1.28. However, the proposed objective is optimized when the corr ect function is learned. Parameter is coerced toward point 5.1 x, since )5.( X is large compared to the other samples in A B) After a couple of iterations, becomes -.33. It should be clear that optimization coincides with function matching rather th an fitting the function to the samples in A C) If we use the set A which is a uniform sampling of 61 points in the range [-3,3], we get a better estimate of the KL divergence and the learned measure coincides with the observed measure. B A C PAGE 83 83 Figure 3-3. Similarities and distinctions betw een the proposed method and standard methods. A) The resulting EM likelihood a nd representative measure when optimized with respect to 10 observed sets (observed in context 2) each with a similar distribution as their union. B) The learned EM likelihood and repr esentative measure when presented with 10 observed sets (observed in context 1) where one set has a distinct distribution compared to the union. The proposed me thod assumes each measure is a single sample and does not weight the one set with a different distribution any differently than the other measures. However, the sta ndard method looks at the distribution of the singleton samples. We have constructe d the set with a diffe rent distribution to have a large number of singleton sample s (comparatively), to emphasize this ideological difference. C) When presented wi th a test set, the contextual estimates vary greatly between the standard appro ach and the proposed approach. Using the standard approach, context 1 is the most probable (100% to 0%); whereas, using the proposed random measure approach context 2 is the most probable (83% to 17%). Using standard i.i.d. joint estimation, the likelihood of samples lying under the B A C PAGE 84 84 observed measure will have a greater lik elihood in context 1 since the likelihood estimate for context 1 has greater likeli hood values (as opposed to the likelihood for context 2) in the corresponding domain (app roximately [-1, 1]). However, when comparing the representative measures for each context to the observed measure, the representative for context 2 is more similar to the observed measure. PAGE 85 85 CHAPTER 4 EXPERIMENTAL RESULTS The three methods for context estimation de veloped within the random set framework (RSF) were tested using syntheti c and real datasets. Four major experiments were performed. In the first experiment, we analyzed the use of different KL divergence approximation methods for estimating context in the proposed random measure model. In the second experiment, each of the three methods was tested using synthetic datasets created to imitate da ta in the presence of contextual factors. We compared synthetic da ta classification result s of the proposed RSF approaches to that of set-based kNN [12], [15] and the whitening / dewhitening transform [65]. The m ain purpose of the experiment s using synthetic data sets is to identify situational pros and cons of each of the approaches. Each methods ab ility to identify correct context is evaluated through its classification results si nce the ultimate goal is classificat ion. Hence, we may refer to our results as context estimation results, but show the classification error on a sample basis. In the third experiment, the proposed methods are a pplied to an extensive hyperspectral data set collected by AHI for the purposes of landmine detect ion. This data set exhibits the effects of contextual factors. The purpose of the experiments using real data sets is to show the applicability of the proposed ra ndom set framework to real-world problems. We compared the hyperspectral data classification re sults to that of set-based k NN and the whitening / dewhitening transform. In the final experiment the possibilistic approach is compared to a similar classifier that does not use contextual information, and it is also compared to an oracle classifier that always selects the correct context for the pur poses of context-based classification. These comparisons compare the possibilistic appro ach to, informally, its lower and upper bounds. PAGE 86 86 KL Estimation Experiment Experiment 1 demonstrates the efficacy of five different constructions of the set A for KL estimation in the proposed random measure model. Recall that if Equation 3-32 is used for KL estimation, the set A must be constructed such that it admits a tractable calculation but not at the expense of correctness. Therefore, we va ried both the construction and size of A and analyzed how each affects its ability to estimate context. Experimental Design We compared the results of context estima tion using three synthetic datasets. In the experiments each training set is constructed rand omly by sampling from a Gaussian mixture with two components. Three Gaussian mixtures are used to simulate three distinct contexts. Fifteen samples are generated from each component in each context. This experimental design attempts to simulate a two class problem within each of three contexts. Ten training populations are constructed from each of the three Gaussian mixtur es to simulate sets of samples observed in 3 distinct contexts. Observed measures are th en created using Equation 2-40 and assuming is a Gaussian distribution; training is performed using Equations 3-47, 3-49, and 3-50. A test population is then generated from one of the Gau ssian mixtures, which is randomly selected, and its corresponding observed measure is created assuming it is Gaussian. The representative function in the random measure model is learne d from the 10 training measures and used to estimate the correct contex t of the test measure. Experiments were performed using three data se ts where each data set, from one to three, represents an increasingly difficult context estim ate problem due to highly overlapping contexts. The data sets are Gaussian sample sets, so all experiments are repeated 50 times. Examples of each dataset are illustrated in Figure 4-1. PAGE 87 87 Each of the random measure models under test uses Equation 3-32 to estimate the KL divergence and performs contextu al estimation using the random measure likelihood function as in Equation 3-32. The five methods used to construct the set A are as follows. In the Riemann test method, A is composed of dN samples constructed by taking a ll combinations of test sample values in all dimensions and a Riemann integral is approximated. In Riemann test and train method, A is a constructed as in the Riemann test, but the samples are constructed using testing and training samples. In the nave test method, A contains the observed test samples and x is assumed constant. In the na ve test and train method, A contains the observed test samples and the observed training samples and x is assumed constant. In the uniform MCMC method, A is the result of sampling a uniform distribution from within the hyperrectangle covering the train and test samples and x is assumed constant. Note the Riem ann test and Riemann test and train methods are the same during the training phase, bu t differ during testing. The same is true for nave test and nave test and train methods. We point this out since during training only the training samples are used by all of the methods. The Riemann approaches approximate the Riem ann integral, which is a fairly standard approach. However it may be intractable for high dimensional data. Using the observed samples to partition the space into these grids would require an exponential number of elements in A with respect to the number of observed samples. In the nave test approach, A is simply the observed test samples and x is assumed constant. In the nave test and train approach, A is simply the union of the test and training samples of the particular context which is to be inferred. We note these approaches are very tractable but we hypothesize that they will not be good estimates of the KL divergence. PAGE 88 88 The uniform MCMC method constructs a hype rrectangle that cove rs the testing and training sets using simple min and max operations. Then a fixed number of samples, the same number of samples in the observed set X in this experiment, are uniformly sampled from the covering hyperrectangle and x is constant for the approximation. The intuition behind this approach is that if the samples are truly uniform, x should be similar for each sample. The hypothesis is that this me thod will balance tractability and correctness. Fifty experiments are run on each of the three da ta sets. For each method the representative measure is assumed to be Gaussian. The resultin g contextual estimation results are compared to those attained by the random measure model using the analytical KL solution. The error of the methods under test is the average difference be tween themselves and th e analytical solution, which is assumed to be the correct estimate. We also compare the contextual estimation error as a function of the number of observed samples. Th e hypothesis is that as the number of samples increases, the KL estimates will improve. Results The results of context estima tion are shown in Table 4-1. The Riemann approaches have the least error total for all three data sets. Uniform MCMC had a low error and performed slightly better than the Riemann test and trai n method for datasets 2 and 3. The nave methods had the most error for each data set, and the na ve test method had the maximum error, 8.7%, on data set 2. Interestingly, Riemann test, which only us es the test samples for estimation purposes, performs better than Riemann test and train which us es both test and train sa mples. This is due to our Riemann approximation. Du e to the construction of A Riemann test and train, will have considerably more elements in the set A Although more elements ma y mean higher granularity and potentially a better estimate of x it has also exacerbated error in estimation. We used the PAGE 89 89 upper bound estimate to approximate the integral, which means KL estimates are slightly high for each x Therefore, if we have considerably more x we may have more error, even with the better granularity. Given the error estimates, the uniform MCMC seems to perform similarly to the Riemann estimates. However, it takes much less time to calculate than the Riemann approaches. Figure 4-2A shows a plot of context estimation erro r versus the number of samples in the initial observation set. For the Riemann approaches, ther e are exponentially many points that are added to correctly partition the space li ke a grid. On the other hand, the uniform MCMC approach performs uniform sampling and constructs A to have the same number of samples that are in the observation set. Figure 4-2B shows the comput ation time needed for the Riemann test and train and uniform MCMC methods versus the size of the observed set. Although the Riemann approaches perform slightly better at integral estimation, uniform MCMC does comparably well and needs a very small amount of relative computation time. The runtime for the Riemann approach is exponential with respect to the number of obser ved samples, whereas the uniform approximation has a linear relationship as shown in Figure 4-2C. Synthetic Data Experiment The classification ability of the methods is under test in this experi ment. Again synthetic data is created to simulate the effects of contextu al factors. Four data se ts are constructed such that each exposes a pro and/or con for each of th e proposed methods. Each of the four data sets are illustrated in Figure 4-3 which helps to visualize the experim ental setup and the purpose for each of the carefully constructed datasets. We also experiment with the whitening/dewhitening transform and set-based kNN to expose their pr os and cons and for comparison purposes. PAGE 90 90 Experimental Design Again samples are randomly generated from a Gaussian mixture with two components where samples from each component are assumed to be from the same class. Again, there are three contexts which allows for clear, less clut tered, analysis. Ten tr aining populations are constructed from each of the three Gaussian mixt ures to simulate sets of samples observed in three distinct contexts. The contextual parameters, ij, for the possibilistic and evidential models are optimized as described in Equations 3-23 and 3-26, respectively. In the probabilistic approach, using random measures, the observed measures are created us ing Equation 2-40, assuming they are Gaussian. The representative functions of the random meas ure likelihood functions were learned using the EM algorithm in Equations 3-47, 3-49 and 3-50, in a supervised manner. That is, each models representative function was optim ized using only the samples from the corresponding context. We performed 50 trials on each data set; in each trial, a test set was generated randomly from one of the Gaussian mixtures associated with one of the contexts. For the random measure model, the corresponding measure was created using the standard EM algorithm assuming a Gaussian mixture of two components. The propos ed evidential, probabilistic and possibilistic methods were equipped with Gaussian mixtures optimized separately using the standard EM algorithm. The contextual components were optim ized separately as pr eviously discussed. The set-based kNN algorithm assigned, to each te st sample, the label of the closest training sample in the closest set, that is, k = 1. The whiten / dewhiten transform was calculated as described in Equation 2-10 for each training imag e. The resulting confidence value was simply averaged over the training sets, since this al gorithm does not provide fo r context estimation or relevance weighting. PAGE 91 91 Data set 1 is a fairly simple data set which should allow for simple context estimation and, within each context, simple cl assification. An example of data generated under data set 1 is shown in Figure 4-3. There are some disguising tran sform ations present; however, the hypothesis is that most of the classifiers will perform well since context estimation is fairly simple in this data set. In data set 2, we orient the Gaussians such that samples from class x are relatively the same as compared to the samples from class o in each of the three contexts. This data set was constructed to highlight the f act that the whitening / dewhitening transform assumes similar orientation of classes throughout each contex t. Therefore, the hypothesis is that the whitening/dewhitening transform wi ll perform well on this data set. Each of the other methods should perform well since there remains only a slight presence of disguising transformations, and context estimation is therefore simple. In data set 3, we introduce the presence of an outlier in the test set. The hypothesis is that the possibilistic approach shoul d remain a good classifier since it has shown to be robust [16]. The evidential es timate will be affected by the outlier since it is a pessimistic approach. The probabilistic approach may be slightly affected if the observed measure is skewed toward the outlier. Set-based kNN will be affected by the outli er due to the use of the Hausdorff metric. The whitening / dewhitening transform ma y be affected since the outlier may drastically influence the whitening process. In data set 4, we introduce multiple outlier s which are placed rela tively near to the observed samples. This data set is constructed to alter the obse rved measure and therefore, our hypothesis is that the probabilistic approach will be highly affected, along with the evidential approach and set-based kNN. The possibilistic approach should be unaffected by the outliers. PAGE 92 92 Classification results from the whitening / dewhit ening transform will be drastically changed if the outliers greatly skew the whitening process. Lastly, we analyzed the classification results of the evidential and possibilistic approaches, on dataset 3, when the number of germ / grain pairs was varied. Results The average classification errors are presented for each classifier for each dataset in Table 4-2. In data set 1, each method performed w ith under a 10% error and the best method, the evidential model, performed best with a 4.1% error. The whitening / dewhitening transform performed the worst since it relies on the fact that each class is relatively oriented in the same manner throughout each context, which is not the cas e (slightly) in data set1. The possibilistic approach performed the worst out of the proposed methods. Upon inspection, it fails to correctly identify context when an observed sample falls n ear a germ of an incorrect context. This is illustrated in Figure 4-3A. In the trial illustrated in Figure 4-3A, Context 3 is the most probable which is in correctly estimated due to the close proximity of one of the test samples to the germ for context 3. The evidential and probabilistic mo dels performed similarly, well. Set-based kNN performed slightly worse, which is attributable to the use of a nearest neighbor classifier as opposed to a Bayesian classifier. In data set 2, the whitening / dewhitening tr ansform results improved as expected. The evidential and probabilistic models perfor med equally as well. Set-based kNN and the possibilistic model performed relatively similarly. In data set 3, the presence of an outlier drastically affected the classification results of the evidential model and set-based kNN. This data set is illustrated in Figure 4-3C. Both of the m etrics used by these methods are pessimistic and are therefore affected by outliers. The PAGE 93 93 possibilistic and probabilistic approaches remained unaffected. Similarly, the whitening / dewhitening transform produced similar re sults as those found in data set 1. In data set 4, the evidential model and se t-based kNN remained highly affected by the presence of outliers. This data set is illustrated in Figure 4-3D. The incorporation of multiple outliers a lso affected the results of the probabi listic approach. The presence of multiple outliers was enough to greatly affect the observed measure and therefore tainted the context estimation. The possibilistic approach remained unaffect ed by the outliers. The whitening / dewhitening transform performed relatively well, although the samples from each class were not relatively oriented the same in each context. However, we note that in each context samples from class o were to the right of samples from class x in each class. Table 4-3 shows the classification error, on data set 4, of the evidential and possibilistic models as the number of germ gr ain pairs varied. It also shows the classification error of the probabilistic approach for a baseline comparison. Overall the cl assification error decreases as the number of germ and grain pairs increases. This result is expected since more germ / grain pairs should allow for more detailed shape characterization. Conversely, in standard techniques, the optim ization of a statistic al classifier using probability density functions may be subject to ov ertraining, especially if the number of densities used is increased or the number of densities is large compared to the number of training samples. In fact, if a probability density function is optim ized with respect to a small number of samples the density will become focused on the few samples thus closing some abstract decision boundary tightly around said samples, causing overtraining. The overtraining during optimization corresponds to increasing the likeli hood of samples in the correct probability density. PAGE 94 94 In the germ and grain model, the probability of sets of intersecting random hyperspheres increases as the hyperspheres grow Therefore overtraining, in the aforementioned sense, is not an issue. However, optimization in the germ and grain model may cause the random radius to diverge, seemingly, the opposite of overtraining. Appropriate MCE optimization techniques, as developed here, must be implem ented to prevent divergence. However, classification error wi ll increase with an increase in the number of germ / grain pairs, if the increase in pairs induces one of the situations outlined in Figure 4-3A, for the possibilistic approach, or Figure 4-3C and Figure 4-3D, for the evidential approach. Hyperspectral Data Experiment The classifiers under test were applied to remo tely sensed, hyperspectral imagery collected from AHI [101], [102]. AHI was flown over an arid site at various tim es in the years 2002, 2003 and 2005. Eight AHI images which covered approximately 145,000m2 were collected at altitudes of 300m and 600m with spatial resolutions of 10c m and 15cm, respectively. Each image contains 20 spectral bands after trimming and binning, ranging over LWIR wavelengths 7.88um 9.92um. Ground truth was provided by Radzelovege et al. [100]. The maximum error was estim ated to be less than one meter. The scenes consisted mainly of targets, soil, dirt lanes and senescen t vegetation. There are 4 types of targets. Targets of t ype 1 are plastic mines buried 10.2 cm deep, targets of type 2 are metal mines buried 10.2cm deep, targets of type 3 are metal mines flush with the ground and targets of type 4 are circular areas of loosened soil, referred to as holes, with diameters less than one meter in length. Since the imager was flown over the course of 4 years at various times of day, it is reasonable to assume that environmental conditi ons were variable. In fact, the presence of PAGE 95 95 contextual transformations incl uding disguising transformations was confirmed, as shown in Figure 1-1. Experimental Design Labeled data sets were constructed from the imagery such that all samples from each data set were assumed to be observed in the same context. Training set construction was done manually since the ground truth e rror was large enough to prevent automation of this task. We note that the spectral signatures of all target types were similar enough to group into the same class given this data set. Each training set c onsisted of 10 samples from one of four classes: target, soil type 1, soil type 2, and vegetation. Therefore each training set, whose samples are assumed to be observed in the same context, co nsisted of 40 samples total. There were eight training populations, one from each image used to model the context of each image. Each context was modeled using four germ and grain pairs. The contextual parameters, ij, were optimized using Equations 2-23 and 2-26 for th e possibilistic and evidential approaches, respectively. Again, the proba bilistic approach was trained using the EM algorithm in a supervised manner, that is, each model was optimized using only the samples from the corresponding context to be modeled. Gradient descent optimization fo r the evidential and possibilistic approaches was termin ated after 200 iterations or soone r if the change was minimal. The germs were set to the results of kmeans clustering of the samples of each class for each context. The learning rate for gradient descent optimization was set to 0.1. The three context-based classifiers within th e random set framework were equipped with Bayesian classifiers implemented as a mixtur e of Gaussians containing two components. The classification parameters were learned usi ng the well-known EM algor ithm in a supervised manner. Specifically, optimization for the mixt ure components modeling a particular class was PAGE 96 96 performed using only samples from the corr esponding class in the corresponding context. Diagonal loading of the covariance matrices wa s done to mitigate the effects of low sample numbers and high dimensionality. Set-based kNN was equipped with a simple clas sifier that was the inverse distance of the test sample to the closest representative of th e target class, in the closest training set and k =1. This classifier permits gray level confidences, which allows for comparison to the other algorithms in the ROC curve. The whitening/dewhitening transform was calcu lated as described in Equation 2-10 for each training image. The resulting confidence value was simply averaged over the training images, since this algorithm does not provide fo r context estimation or relevance weighting. Test sets, or populations, were constructed from subsets of the imagery. The well-known RX algorithm [99] was run by Ed Winter from Tec hnical Research Associates In c. on the imagery as a pre-screener, or anomaly detector, to collect points of interest (POIs). There are 4,591 POIs and 1161 actual targets in the entire da taset. Sets of sample s surrounding each POI in a 9x9 pixel window were collected to form test sets. This implies ther e is a total of 4,591 test sets each set consisting of 81 spectral signatures. Note that each test set is assumed to be a population, which means it is assumed that each sample in the set is observed in the same context. For this dataset, th e 9x9 pixel window is large enoug h to encompass a target and background samples, but small enough to ensure that all samples have been observed in the same context. Each sample in the test set is classified target or non-target by each of the classifiers. The probability of target is calculated for each sample within a test set and each POI is assigned a probability of target detection by taking the mean probability of target over the center samples PAGE 97 97 within a 3x3 window, since this is the standard size of a target. We note that the prescreener was not able to identify all targets in the scene, and the maximum probability of detection (PD) for the classifiers is 75% or 867 targets. Cross validation is implemented at the image level, that is, spectra from a test image are not used for training purposes while said image is under test. Note that this testing procedure assumes that there exists a training population from an image other than the test image that contains samples observed in a context similar to those in the test image. We note that this may not be a valid assumption, and may make classi fication very difficult; however, this testing procedure mimics the testing conditions of real-world application, that is, the exact context and labels of some of the spectra fr om a test image may not be known a priori Classification results of all target types are presented in one receiver operating characteristic (ROC) curve which is shown by PD versus false alarm rate (FAR). We note that previous research has indicated that a minefiel d can be minimally detected when the PD is greater than 50% and the FAR is less than 10-2 FA/m2 and is successfully detected when the PD is greater than 50% and the FAR is less than 10-3 FA/m2 [100]. Results ROC curves for each algorithm are shown in Figure 4-4. All methods performed well achiev ing greater than 50% PD at relatively low FARs. Note the Probabili stic RSF approach was run using the uniform sampling technique for KL estimation and using the analytical integral, assuming Gaussian. The analytical approach performed best, although it assumed Gaussian, whereas, the uniform sampling method used a Gau ssian mixture with four components. Although the uniform sampling allows for a more versatile modeling scheme, the analytical calculation of the KL divergence seemed more important than ve rsatility for correct cont ext estimation. Due to the high dimensionality and sparsity of the data, the KL estimate using uniform sampling suffers. PAGE 98 98 ROC curves with error bars are shown in Figure 4-5. In Figure 4-5, each mine encounter is treated as a binom ial distribution and the error bars illustrate a confidence window of 95%. Note the PDs are normalized to 100% for binomial esti mation, and there is good separation of the possibilistic and evidential a pproaches with 95% confidence indicating a non-random result. All context-based approaches pe rformed better than the whiten / dewhiten transform save the probabilistic approach using uniform sampli ng. This result is expected since these approaches are able to identify re levant contexts and use this in formation to correctly classify samples that have undergone contextual transformations. However, the whiten / dewhiten transform performed relatively well, which indica tes that some of the classification issues induced by contextual transformations can be mitigated by means of whitening the data. Figure 4-7 shows a correctly classified POI, where each context-based approach identifies a relevant context, context 3 or 4, and conse quently classifies the POI correctly. Each of the RSF classifiers performed better th an the set-based kNN classifier. This is due to their ability to identify relevant contexts in a probabilistic manner rather than a nearest neighbor manner as indicated in Figure 4-10 and Figure 4-9. This is also due to the nearest neighbor approach in the cl assifier as indicated in Figure 4-6 and Figure 4-9. This is due to the fact that nearest neighbor appr oaches do not directly incorpor ate the idea of probability, or weights, and therefore assign confidence based on som e fixed number of samples, in this case k =1. We note in previous experiments, k =1 provided the best results for set-based kNN [17]. In Figure 4-8, a POI is incorrectly classified by the possibilis tic approach. In this case, the possibilistic approach was the onl y method to identify context 4 as a relevant context, and consequently misclassified this POI. The situation that occurred in the synthetic data experiment that is shown in Figure 4-3A has occurred, that is, a samp le from the test set has come into close PAGE 99 99 proximity of a germ / grain pair modeling contex t 4, which has caused th e possibilistic approach to choose this context as most likely as opposed to context 3. Although the possibilistic, or optimistic approach, has caused the possibilis tic approach to misclassify the POI in Figure 4-8, it also a llows for resiliency in the face of outlie rs. An instance where the possibilistic approach chose a context different from a ll other approaches is shown in Figure 4-10. This POI was correctly classified by th e possibi listic approach and the chance of FA was lessened by all of the probabilistic approaches as they were able to identify two relevant contexts, one which provides correct classification. The evidential approach performed best, achievi ng highest PDs at almost all FARs. This result is similar to that found in the synthetic da ta experiment, save the situation illustrated in Figure 4-3D. The evidential approach provide s a good contextual m odel as the inclusion functional provides an intuitive model for shape characterization. The probabilistic approach performed well in the synthetic data experiment balancing shape characterization and robustness. However, its results in high dimensional data were inconsistent. Providing enough samp les using the uniform sampling method in high dimensions was not practical, and using the analytical in tegral provided better results. However, the Gaussian assumption limited its shape characterization which influenced its classification results. Upper and Lower Bounding Experiment In this experiment we compared the proposed possibilistic context-ba sed classifier to a standard Bayesian classifier, a non-context base d classification method. We also compared the results of the possibilistic classifi er to results from a context-base d oracle classifier that always chooses the correct context. This comparis on provides an idea of an upper bound and a lower bound for the proposed method, where the standard classifier is a lower bound since it makes no PAGE 100 100 use of contextual information a nd the oracle classifi er is the upper bound since it makes the best use of contextual information. Experimental Design The experimental setup was similar to that in the hyperspectral expe riment. Eight training sets were constructed ea ch representing a set of samples, both target and non-target, observed in some context. Each training set consists of samples collected from an image, where eight contexts are modeled using samples from the eight di stinct images. Note that in this training set there are 20 samples from each class in each cont ext. Also in this experiment more spectral bands are used, that is, each image contains 40 spectral bands after trimming and binning, ranging over LWIR wavelengths 7.88um 9.92um. Again, the possibilistic classifier is equipped with a mixture of Gaussians for sample classificati on. The oracle uses the same classifiers as the possibilistic approach, however, it always chooses the correct Gaussian mixture. Classification results of the standard Bayesian classifier are compared to that of the possibilistic RSF classifier. The hypothesis is that both classifiers can account for non-disguising transformations, however, a standard Bayesian classifier cannot account for disguising transformations, whereas the pos sibilistic classifier can. The number of mixture components used in the standard Bayesian classifier is varied to illustrate how its ability to classify in the pres ence of non-disguising transformations relates to the number of mixture components. The hypothe sis is that as the number of components increases, the results should improve since it will be better equippe d to handle non-disguising transformations. However, regardless of the number of mixture components, the standard Bayesian classifier cannot handle disguising transformations and its results should not best those of the possibilistic classifier, assuming c ontext estimation is performed correctly. PAGE 101 101 The possibilistic classifier was equipped with two mixture com ponents per class per context for a total of 56 components since for each test set there were seven contexts available each with four classes each containing two mixt ure components. We compared the results to those of a standard Bayesian cl assifier with three, seven and 14 mixture components per class. For comparison to the upper bound, the testing procedure will remain the same, except the classifier trained on the test image will be availabl e to the classifiers during testing; therefore, cross validation is no longer being performed. The results of the possibilistic classifier will be compared to the oracle classifier. The oracle classifier is equipped with similar Gaussian mixtures as the possibilistic RSF classifier; howev er, it always uses the Gaussian mixture that was trained on the test image. The results of this classifier can be seen as an upper bound of the classification results within this framework. Ther efore, it provides a means to assess the ability of the context estimation methodology used in th e RSF classifier, namely the optimistic germ and grain model. Results The use of possibilistic context estimati on within the RSF significantly improved classification results. Probability of detection is improved at all FARs and is improved as much as 10 percentage points. False alarm rates are de creased at all PDs and are reduced as much as 50% at PDs of 4x10-3 FAs/m2 through 8x10-3 FAs/m2. Classification results of the standard Bayesian classifier became be tter as the number of mixture components increased. The increase of mixture components equipped the standard classifier with the ability to account for non-disguising transf ormations. When the number of components was less than the number of contexts, the standard classifier performed poorly. This is expected as it could not acc ount for all of the nondisguising transformations. However, its performance improved as the number of mixture co mponents became greater than or equal to the PAGE 102 102 number of contexts. The results also indicate that the RSF cla ssifier was able to account for disguising transformations with an improvement in classification when compared to the standard classifier with the same number of overall mixture components. The RSF Bayesian classifier performed simila rly to the oracle RSF Bayesian classifier indicating that using the random set framework is an excellent method for context estimation. In fact, the RSF Bayesian classifier using the ge rm and grain model weighted the context which was chosen by the oracle as the most likely cont ext 66% of the time, and furthermore, weighted that context as one of the two most likely cont exts 86% of the time. However, we note there is room for improvement which can be noticed at low FARs. PAGE 103 103 Figure 4-1. Illustration of data sets one, two, and three. A) Samples from a distinct context are shown in distinct colors. Distinct class is shown using a distinct symbol. This is the easiest data set since each context is fairly separable. B) In data set 2, context 3 is overlapped highly by both contex t 1 and context 2. C) In data set 3, context 1 is completely overlapped by context 2 and context 3. Table 4-1. Average inference error for each dataset using 15 test and 15 train samples. KL Estimation Data Set 1 Data Set 2 Data Set 3 Riemann Test .0094 .0390 .0522 Riemann Test and Train .0114 .0638 .0642 Nave Test .0750 .0870 .0722 Nave Test and Train .0128 .0639 .0683 Uniform MCMC .0094 .0562 .0581 A B C PAGE 104 104 Figure 4-2. Error analysis of the Riemann a nd uniform approximation methods with respect to time and number of observation samples. A) Plot of context misc lassification rate versus the number of samples in the obser ved set. B) Plot of runtime versus the number of samples in the observations set. C) Close of the plot of runtime versus number of observation samples fo r the uniform approximation method. A B C PAGE 105 105 Figure 4-3. Trials using data sets 1, 2, 3 and 4 in the synthetic da ta experiment. A) Illustration of a trial on data set 1 from the synthetic da ta experiment where the possibilistic model fails to correctly identify context. Here the germ from context 3 is indicated with a black *. Note there is a sample from the test set indicated by a black x, which lies very near to the grain. This increases the pr obability of context 3. B) Trial example of data set 2 in the synthetic data experiment Samples from either class are oriented relatively the same in each of the 3 contexts C) Trial example of data set 3 in the synthetic data experiment. Each test set in each of the 50 tr ials has two outlying samples at [0, 5]. D) Trial example of data set 4 in the synthetic data experiment. Each test set has 6 outliers located near [5, 3.5]. A B C D PAGE 106 106 Table 4-2. Average classification error of the list ed context-based classifiers on four data sets used in the Syntheti c Data Experiments. Context Classifiers Data Set 1 Data Set 2 Data Set 3 Data Set 4 Evidential Model .0413 .0273 .2073 .2500 Probabilistic Model .0427 .0280 .0667 .2562 Possibilistic Model .0647 .0480 .0693 .0542 Set-Based kNN .0560 .0373 .2647 .2520 Whiten/De-Whiten .0993 .0220 .1033 .0792 Table 4-3. How classification vari es with respect to the number of germ and grain pairs for data set 3 (with no outlying samples) in the Synthetic Data Experiment. Context Classifiers 1 Pair/Context 2 Pa ir/Context 3 Pair/Context 4 Pair/Context Evidential Model .0447 .0487 .0367 .0373 Probabilistic Model .0453 .0473 .0400 .0336 Possibilistic Model .0553 .0460 .0453 .0460 PAGE 107 107 Figure 4-4. ROC curve for The Hyperspectral Da ta Experiment. Note the dashed plot is the results from the probabilistic context-based classifier using the analytical solution for KL estimation as discussed in Equation 3-40. PAGE 108 108 Figure 4-5. Hyperspectral Experiment ROC cu rve of PD versus PFA for the possibilistic, evidential probabilistic, set-based kNN, a nd whiten / dewhiten approaches. Error bars show the 95% confidence range assuming each encounter is a binomial experiment. For this reason, PDs are normalized to include only targets that were observed by the algorithms under test, and do not include targets missed by the prescreener. PAGE 109 109 Figure 4-6. Example of a false alarm POI from The Hyperspectral Data Experiment. A snippet of the original AHI image at wavelength 8. 9um is shown in the upper left where the prescreener alarmed. The second row are th e confidence images of set-base kNN, possibilistic, probabilistic, and evidential a pproaches, from left to right. Their contextual estimates of the potential seven contexts are shown in the bar chart in the upper right. Note there are seven potential contexts and not eight since are performing crossvalidation at the image level. Under the confidence images, in the bottom row, are the spectral plots of the test population, shown in blue. Also shown in these plots are the spectra used to cr eate the contextual models of the context that the corresponding approach selected as most pr obable. These training spectral are color coded by class, where red, green, and ye llow correspond to target, soil types, and vegetation, respectively. Note in this example set-based kNN submits a marginal confidence, due to its use of a nearest neighbor based classifier and c hoice of context 3. the probabilistic and evidential approaches select context 3 as well; but their classifier makes use of covariance which allows for correct classifi cation. The possibilistic approach chose a context which correctly identi fies the spectra as soil. PAGE 110 110 Figure 4-7. Example of a target alarm POI, from The Hyperspect ral Data Experiment. A snippet of the original AHI image at wavelength 8.9um is shown in the upper left where the prescreener alarmed. The red circle indicates that this is a target. The second row are the confidence images of set-base kNN, possibilistic, probabilistic, and evidential approaches, from left to right. Their cont extual estimates of the potential seven contexts are shown in the bar chart in th e upper right. Under th e confidence images, in the bottom row, are the spectral plots of the test population, shown in blue. Also shown in these plots are the spectra used to create the contextual models of the context that the corresponding approach se lected as most probable. These training spectral are color coded by class, where red, green, and yellow correspond to target, soil types, and vegetation, respectively. Note in this example each algorithm correctly identifies this POI as a target. Note the possibilistic approach has se lected context 4 as most probable, whereas, the other three methods selected context 3. In this instance, the choice between context 3 and 4 does not change the classification results since the test spectra are similar to the target prototypes in both contexts. PAGE 111 111 Figure 4-8. Example of a target alarm POI from The Hyperspectral Data Experiment. A snippet of the original AHI image at wavelength 8.9um is shown in the upper left where the prescreener alarmed. The red circle indicates that this is a target. The second row are the confidence images of set-base kNN, possibilistic, probabilistic, and evidential approaches, from left to right. Their cont extual estimates of the potential seven contexts are shown in the bar chart in the upper right. Note there are seven potential contexts and not eight since are performing crossvalidation at the image level. Under the confidence images, in the bottom row, ar e the spectral plots of the test population, shown in blue. Also shown in these plots are the spectra used to create the contextual models of the context that the corresponding approach se lected as most probable. These training spectral are color coded by class, where red, green, and yellow correspond to target, soil types, and vegetation, respectively. Note in this example the possibilistic appr oach selects context 4, which results in incorrect classification. Al so note that the evidential approach partially weights context 4, thus its confidence is not as high as set-based kNN and the probabilistic approach. PAGE 112 112 Figure 4-9. Example of a false alarm POI from The Hyperspectral Data Experiment. A snippet of the original AHI image at wavelength 8. 9um is shown in the upper left where the prescreener alarmed. The red circle indicates that this is a target. The second row are the confidence images of set-base kNN, possibilistic, probabilistic, and evidential approaches, from left to right. Their cont extual estimates of the potential seven contexts are shown in the bar chart in th e upper right. Under th e confidence images, in the bottom row, are the spectral plots of the test population, shown in blue. Also shown in these plots are the spectra used to create the contextual models of the context that the corresponding approach se lected as most probable. These training spectral are color coded by class, where red, green, and yellow correspond to target, soil types, and vegetation, respectively. Note in this example set-based kNN submits a marginal confidence rather than a high confidence due to its selecti on of context 3. Note the popul ation spectra for set-based kNN selection fall in between prototypes for class target and vegetation, providing for a marginal confidence. The other 3 classifiers selected context 6 which provides for correct classification. Note samples from the target class in context 6 are extremely similar to the test samples, indicating a correct selection. PAGE 113 113 Figure 4-10. Example of a false alarm POI from The Hyperspectral Data Experiment. A snippet of the original AHI image at wavelength 8.9um is shown in the upper left where the prescreener alarmed. The second row are th e confidence images of set-base kNN, possibilistic, probabilistic, and evidential a pproaches, from left to right. Their contextual estimates of the potential seven contexts are shown in the bar chart in the upper right. Under the confidence images, in the bottom row, are the spectral plots of the test population, shown in blue. Also show n in these plots are the spectra used to create the contextual models of the contex t that the corresponding approach selected as most probable. These training spectral are color coded by class, where red, green, and yellow correspond to target, soil t ypes, and vegetation, respectively. Note in this example set-based kNN submits high confidence due to its selection of context 1. Note the probabilistic and evidential approaches submit marginal confidences as they selected context 1. But their confidence is only marginal since they only partially selected context 1. Note the possibilistic appro ach selected context 4, and was able to correctly classify this POI as a false alarm. PAGE 114 114 Figure 4-11. Detection results for the possibilistic RSF classi fier and results for standard Gaussian mixture classifiers equipped with variable numbers of mixture components. PAGE 115 115 Figure 4-12. Non-crossvalidation detection results for the possibilistic RSF classifier and the oracle classifier. PAGE 116 116 CHAPTER 5 CONCLUSIONS We developed a generalized framework for cont ext-based classification using the theory of random sets. The resulting context-based classifier estimates the posterior of a sample, using the sample and a setits population. Contextual transformations are identified by population analysis, and the resulting contextual estimate pr ovides an appropriate we ight of relevance to context specific classifiers. The random set fr amework provides the tools necessary to perform classification in the presence of contextual factors. Furthermore, it has the ability to contend with disguising transformations, whic h is not the case for standa rd classification procedures. Experimental results have shown the random set mode ls abilities to correctly identify context in various situations, and have show n applicability to real-world problems, improving classification results over state-of-the-art cl assifiers: set-based kNN and th e whiten / dewhiten transform. In the synthetic experiments, pros and c ons of each approach where highlighted. The possibilistic approach was shown to be a robust classifier, resilient to outliers, but at the cost of optimism. The evidential approach has the ability to characterize shape, but at the cost of robustness. The probabilistic approach balanced these two pros and cons, allowing for some characterization of shape and some resilience. Each of these RSF classifiers was superior to setbased kNN, which is not resilient to outliers, but provides an intuitive, nearest neighbor, set comparison procedure. The whiten / dewhiten tran sform assumes a consistent orientation of target subspaces with respect to background subspaces, and given this assumption, provides a whitening solution. This approach can be considered a context-based method, but makes strict assumptions which the other methods do not. Therefore, the whiten / dewhiten transform performed well when said assumptions are true and performs poorly when they are not. PAGE 117 117 Each of the methods under test was able to minimally detect a minefield using a extensive hyperspectral dataset. The eviden tial and possibilistic methods performed best due to their resilience and shape characteriz ation capabilities, reducing FA Rs by up to 25% over set-based kNN. The probabilistic model suffe rs partially due to its attempt to construct a representative measure given a low number samples and high di mensionality. Set-based kNN was bested by the RSF classifiers due to its lack of ability to assign gray-level weights of co ntextual relevance. The whiten / dewhiten transform perf orms worst indicating that, a lthough some of the contextual transformations can be mitigated through the use of whitening, all of them could not. In the final experiment, the po ssibilistic approach performs similarly to its upper bound, and outperformed a similar classifier that made no use of contextual information. This indicates that the possibilistic approach makes good use of contextual information which translates to improved classification results. Each algorithm has different computationa l complexity. Although set-based kNN does not require training, the set-based comparison provi des for a testing computation time bounded by )(2pdTNO where N is the bounding number of sample s in a training or test set, d is the dimensionality of the samples, p is the number of testing populations, and T is the number of training sets. Note for each population p we must calculate the pairwise distances between the test set and all T training sets. Whereas, the RSF classifi ers require a training period, but testing computation time is bounded by ) (3mdpcdNO where c is determined by the fixed number of constructs, such as a germ and grain pair or a likelihood function, used to model C contexts and m is the number of constructs needed to model M classifiers. For each population, we must compare each sample to each contextual construct. Note for each Bayesian classifier we must invert a covariance matrix that is dx d; however, the use of a Gaussian classifier is not necessary PAGE 118 118 within the RSF framework. The wh iten / dewhiten transform has a training period, and requires extensive testing computation time bounded by ) (3NmpmdO Note, for each population we must calculate and invert a covariance matrix. Future work will include the extensive experimentation of the methods developed for the optimization and experimentation methods used by the RSF classifiers. An example of research in optimization strategies would include the in vestigation of the use of EM for unsupervised learning of contexts within the hyperspectral data set. This coul d provide for interesting findings of sub-contexts or subpopulations, within each image. Examples of future research in experimental methods would be performing experiments where the size of the populations varied. Larger populations may provide for a better estimate of context. Extended research may include the development of a non-additive random measure. This development may provide the capability to charact erize complex relationships between sets of samples, similar to a belief function. We also note that during the development of the representative function, it was determined that the point-wise average of the observed measures minimized the KL between the representative function and the observed measures, this may provide for an interesting deve lopment of posterior estimation, and relation to variational methods. Future work should include the appl ication of the RSF classifiers to unexploded ordnance (UXO) datasets. These data sets are subject to problems similar to those faced in remote sensing data, including contextual factors. The us e of contextual estimation should improve classification, or target identification, for these applications as well. PAGE 119 119 LIST OF REFERENCES [1] I. Molchanov, Probability and Its Applicati ons: Theory of Random Sets London: SpringerVerlag, 2005. [2] J. Goutsias, R. Mahler, and H Nguyen, Random Sets: Theory and Applications. New York: Springer-Verlag, 1997. [3] T. Norberg, Convergence and Existe nce of Random Set Distributions Annals of Probability, Vol. 12, No. 3, pp. 726-732, 1983. [4] D. Stoyan, W. Kendall, and J. Mecke, Stochastic Geometry and It s Applications: Second Edition. West Sussex: John Wiley & Sons, 1995. [5] D. Stoyan, Random Sets: Models and Statistics, International Statistical Review Vol. 66, No. 1, pp.1-27, 1998. [6] N. Cressie and G. Laslett, Random Set Theory and Problems of Modeling, SIAM Review Vol. 29, No. 4, pp. 557-574, December 1987. [7] D. Hug, G. Last and W. Weil, A Survey on Contact Distributions, Morphology of Condensed Matter: Physics and Geomet ry of Spatially Complex Systems Springer, 2002. [8] G. Shafer, A Mathematical Theory of Evidence. Princeton: Princeton University Press, 1976. [9] M. Capinski and E. Kopp, Measure, Integral and Probability New York: Springer, 1999. [10] J. Munkres, Topology: Second Edition. Upper Saddle River: Prentice Hall, 2000. [11] V. Vapnik, Statistical Learning Theory, New York: John Wiley & Sons, 1998. [12] E. Dougherty and M. Brun, A Proba bilistic Theory of Clustering, Pattern Recognition, Vol.37, pp. 917-925, 2004. [13] E. Dougherty, J Barrera, M. Brun, S. Kim, R. Cesar, Y. Chen, M. Bittner, and J. Trent, Inference from Clustering with Applic ation to Gene-Expression Microarrays, Journal of Computational Biology, Vol. 9, No. 1, pp. 105-126, 2002. [14] M. Brun, C. Sima, J. Hua, J. Lowey, B. Ca rroll, E. Suh, and E. Dougherty, Model-Based Evaluation of Clustering Validation Measures, Pattern Recognition, Vol. 40, pp. 807-824, 2007. [15] J. Bolton and P. Gader, Application of SetBased Clustering to Landmine Detection with Hyperspectral Imagery, IEEE Proceedings Geoscience and Remote Sensing, Barcelona, July 2007. [16] J. Bolton and P. Gader, Random Set Model for Context-Base d Classification, IEEE World Congress on Computational Intelligence FUZZ, Hong Kong, June 2008 (Accepted). PAGE 120 120 [17] J. Bolton and P. Gader, Applic ation of Context-Based Classifi er to Hyperspectral Imagery for Mine Detection, SPIE Defense and Security Orlando, March 2008 (Accepted). [18] Tsymbal A, The problem of concept drift: definitions and related work, Technical Report TCD-CS-2004-15, Department of Computer Science, Trinity College Dublin, Ireland, 2004. [19] M. Bentley, Environment and Context, The American Journal of Psychology Vol. 39, No. pp. 52-61, December 1927. [20] R. Rescoria, Probability of Shock in the Presence an d Absence of CS in Fear Conditioning, Journal of Comparative and Physiological Psychology 66, pp.1-5, 1968. [21] A. Tsymbal, M. Pechenizkiy, P. Cunningham, and S. Puuronen, Handling Local Concept Drift with Dynamic Integration of Classifi ers: Domain of Antibiotic Resistance in Nosocomial Infections, Proceedings of IEEE Symposiu m Computer-based Medical Systems, 2006. [22] M. Salganicoff, Density Adaptive Learning and Forgetting Technical Report No. IRCS93-50, University of Pennsylvania Institute for Research in Cognitive Science, 1993. [23] D. Widyantoro and J. Yen, Relevant Data Expansion for Learning Concept Drift from Sparsely Labeled Data, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 3, pp. 401-412, March 2005. [24] J. Schlimmer and R. Granger Jr., Inc remental Learning from Noisy Data, Machine Learning, Vol. 1, pp. 317-354, 1986. [25] G. Widmer, Learning in the Presence of Concept Drift and Hidden Contexts, Machine Learning. Vol. 23, pp. 69-101, 1996. [26] M. Maloof and R. Michalski, Learning Evolving Concepts Using Partial-Memory Approach, Working Note AAAI Fall Symposium on Active Learning, Boston, pp. 70-73, November 1995. [27] M. Maloof and R. Michalski, Selecting Examples for Partial Memory Learning, Machine Learning, Vol. 41, pp. 27-52, 2000. [28] R. Klinkenberg, Using Labeled and Unlabe led Data to Learn Drifting Concepts, Workshop Notes on Learning from Temporal and Spatial Data Menlo Park, pp. 16-24, 2001. [29] R. Klinkenberg and T. Joachims, Detec ting Concept Drift with Support Vector Machines, Proceedings of the 17th Intl. Conf. on Machine Learning PP. 487-494, July 2000. [30] R. Schapire, The Strength of Weak Learnability, Machine Learning, Vol. 5, pp. 197-227, 1990. PAGE 121 121 [31] R. Schapire, Y. Feund, P. Bartlett, and W. Lee, Boosting the Margin: A new Explanation for the Effectiveness of Voting Methods, Annals of Statistics Vol. 26, No. 5, pp. 16511686. [32] R. Schapire, The Boosting Approach to Machine Learning: An Overview, MSRI Workshop on Nonlinear Estimation and Classification, 2002. [33] Freund, Y. and R. Schapire, A Decision-Theoretic Genera lization of On-Line Learning and an Application to Boosting Proceedings of Computational Learning Theory: Second European Conference Barcelona, 2005. [34] E. Bauer and R. Kohavi, An Empirical Comp arison of Voting Classification Algorithms: Bagging, Boosting, and Variants, Machine Learning Vol. 36, pp. 105-139, 1999. [35] E. Dura, Y. Zhang, X. Liao, G. Dobeck, and L. Carin, Active Learning for Detection of Mine-Like Objects in Side-Scan Sonar Imagery, IEEE Journal of Oceanic Engineering Vol. 30, No. 2, April 2005. [36] Q. Liu, X. Liao, and L. Carin, Detecti on of Unexploded Ordnance via Efficient SemiSupervised and Active Learning, IEEE Transactions Geoscience and Remote Sensing (submitted). [37] Q. Lui, X. Liao, and L. Carin, Semi-Supervised Multi Task Learning, Neural Information and Processing Systems (NIPS), 2007. [38] Y. Zhang, X. Liao, and L. Carin, Detecti on of Buried Targets Vi a Active Selection of Labeled Data: Application to Sensing Subsurface UXO, IEEE Transactions Geoscience and Remote Sensing Vol. 42, No. 11, November 2004. [39] A. Tsymbal and S. Puuronen, Bagging an d Boosting with Dynamic Integration of Classifiers, Proc. Principles of Data Mining and Knowledge Discovery PKDD, 2000. [40] M. Skurichina and R. Duin, Bagging, Boosting and the Random Subspace Method for Linear Classifiers, Pattern Analysis and Applications Vol. 5, pp. 121-135, 2002. [41] N. Rooney, D. Patterson, S. Anand and A. Ty smbal, Dynamic Integration of Regression Models, Proceedings 5th Annual Multiple Classifier Systems Workshop Cagliari, Italy, 2004. [42] N. Rooney, D. Patterson, A. Tsymbal and S. Anand, Random Subspacing for Regression Ensembles, Proceedings 17th Intl. Florida Artificial Intelligence Research Society 2004. [43] L. Breiman, Bagging Predictors, Machine Learning Vol. 24, pp. 123-140, 1996. [44] A. Tsymbal, M. Pechenizkiy, and Padraig Cunningham, Dynamic Integration with Random Forests, Machine Learning : EMCL LNAI 4212, pp. 801-808, 2006. [45] R. Schapire, Random Forests, Machine Learning Vol. 45, pp. 5-32, 2001. PAGE 122 122 [46] T. Ho, The Random Subspace Method for Constructing Decision Forests IEEE Transactions PAMI Vol. 20, No. 8, pp. 832-844. [47] T. Ho, Random Decision Forest, Proceedings 3rd Intl. Conf. on Document Analysis and Recognition pp. 278-282, Montreal, August 1995. [48] L. Kuncheva, Switching Between Selecti on and Fusion in Combining Classifiers, IEEE Transactions Systems, Man, Cybernetics Vol. 32 No. 2, April 2002. [49] A. Santana, R. Soares, Anne Canuto, and M. Souto, A Dynamic Classifier Selection Method to Build Ensembles Usi ng Accuracy and Diversity, IEEE Proc. 9th Annual Symp. On Neural Networks Brazil, 2006. [50] E. Santos, R. Sabourin, and P. Maupin, Si ngle and Multi-Objective Genetic Algorithms for the Selection of Ense mble of Classifiers, Proceedings Intl. Joint Conf. on Neural Networks Vancouver, July 2006. [51] F. Destempes, J. Angers and M. Mignot te, Fusion of Hidden Markov Random Field Models and Its Bayesian Estimation, IEEE Transactions Image Processing, Vol. 15, No. 10, October 2006. [52] H. Frigui, L. Zhang, P. Gader, D. Ho, C ontext-Dependent Fusion for Landmine Detection with Ground-Penetrating Radar, Proceedings of SPIE Orlando, 2007. [53] R. Cossu, S. Chaudhuri and L. Bruzzone, A C ontext-Sensitive Bayesian Technique for the Partially Supervised Classification of Multitemporal Images, IEEE Transactions Geoscience and Remote Sensing Vol. 2, No. 3, July 2005. [54] I. Taha and J. Ghosh, Symbolic Interp retation of Artificial Neural Networks, IEEE Transactions on Knowledge and Data Engineering, Vol. 11, No. 3, May 1999. [55] Y. Qi and R. Picard, Context-Sensitive Baye sian Classifiers and Application to Mouse Pressure Pattern Classification Proceedings on IEEE Pattern Recognition Vol. 3, pp. 448451, August 2002. [56] T. Minka, A Family of Algorithms for Appr oximate Bayesian Inference, Dissertation Submitted to Department of Electrical Engineering and Computer Science MIT, 2001. [57] M. Harries and C. Sammut, E xtracting Hidden Context, Machine Learning Vol. 32, pp. 101-126, 1998. [58] M. Harries and K. Horn, Learning Stable C oncepts in Domains with Hidden Changes in Context, Proceedings 13th ICML, Workshop on Le arning in Context Sensitive Domains 1996. [59] A. Berk, L. Bernstein, and D. Robertson, MODTRAN: Moderate Resolution Model for LOWTRAN 7. Rep. AFGL-TR-83-0187, 261, [Available from Airforce Geophysical Laboratory, Hanscom Air Force Base, MA 01731], 1983. PAGE 123 123 [60] A. Berk et. al. MODTRAN4 Radiative Transfer Modeling for Atmosphere Correction, Proceedings of SPIE Optical Spectroscopic Techniques and Instrumentation for Atmospheric and Space Research Vol. 3756, July 1999. [61] J. Broadwater and R. Challappa, Hybr id Detectors for Subpixel Targets, IEEE Transactions Pattern Analysis and Machine Intelligence Vol. 29, No. 11, pp. 1891-1903, November 2007. [62] G. Healy and D. Slater, Models and Methods for Automated Materi al Identification in Hyperspectral Imagery Acquired Under Unknown Illumination and Atmospheric Conditions, IEEE Transactions Geoscience and Remote Sensing, Vol. 37, No. 6, November 1999. [63] C. Kuan and G. Healey, Modeling distribution changes for hyperspectral image analysis, Optical Engineering Vol. 46, No. 11, 117201, pp. 1-8, November 2007. [64] P. Fuehrer, G. Healey, B. Rauch and D. Sl ater, Atmospheric Radiance Interpolation for the Modeling of Hyperspectral Data, Proceedings. of SPIE Algorithms and Technologies ofr Multispectral, Hyperspectral, and Ultraspectral Imagery XIV Vol. 6966, No. 69661O1, pp. 1-12. [65] R. Mayer, F. Bucholtz and D. Sc ribner, Object Detection by Using Whitening/Dewhitening to Transform Target Signatures in Multitemporal Hyperspectral and Multispectral Imagery, IEEE Transactions. Geoscience and Remote Sensing Vol. 41, No. 5, pp. 1136-1142, May 2003. [66] S. Rajan, J. Ghosh and M. Crawford, An Acti ve Learning Approach to Hyperspectral Data Classification, IEEE Transactions Geoscience and Remote Sensing Vol. 46, No. 4, April 2008. [67] S. Rajan, J. Ghosh and M. Crawford, An Acti ve Learning Approach to Hyperspectral Data Classification, IEEE Transactions. Geoscience and Remote Sensing, Vol. 46, No. 4, pp. 1231-1242, April 2008. [68] R. Xu and D. Wunsch, Survey of Clustering Algorithms, IEEE Transactions on Neural Networks, No. 3, Vol. 16, pp. 645-678, May 2005. [69] H. Luo, F. Kong, K. Zhang, and L. He, A Clustering Algorithm Based on Mathematical Morphology, IEEE Proceedings Intelligent Control and Automation, Dalian, pp. 60646067, June 2006. [70] Y. Rubner, L. Guibas, and C. Tomasi, A Metr ic for Distributions with Applications to Image Databases, Proceedings International Conference on Computer Vision, pp. 59-66, Bombay, 1998. [71] Y. Rubner, L. Guibas, and C. Tomasi, The Earth Movers Distance, Multi-Dimensional Scaling, and Color-Based Image Retrieval, Proceedings ARPA Image Understanding Workshop, pp. 661-668, May 1997. PAGE 124 124 [72] H. Houissa, N. Boujemaa and H. Frigui, Adapt ive Visual Regions Category with Sets of Point of Interest, Lecture Notes in Computer Science Vol. 4261, pp. 485-493, 2006. [73] S. Theodoridis and K. Koutroumbas, Pattern Recognition: Second Edition, San Diego: Elsevier, 2003. [74] X. Descombes and J. Zerubia, Marked Point Process in Image Analysis, IEEE Signal Processing Magazine pp. 77-84, September 2002. [75] A. Baddeley and J. Moller, Nearest-Neighbor Markov Point Processes and Random Sets, International Statistical Review Vol. 57, No. 2, pp. 89-121, August 1989. [76] J. M. Billiot, J. F. Coeurjolly, and R Dr ouilet, Maximum Pseudo-Likelihood Estimator for Nearest-Neighbors Gibbs Point Processes, arXiv:math/0601065v1 January 2006. [77] J. Gubner and W. B. Chang, Nonparametric Estimation of Interac tion Functions for TwoType Pairwise Interaction Point Processes, Proceedings IEEE Acoustics, Speech, and Signal Processing Vol. 6, pp. 3981-3984, May 2001. [78] P. Fishman and D. Snyder, The Statistical Analysis of Space-Time Point Processes, IEEE Transactions Information Theory, Vol. 22, No. 3, May 1976. [79] R. Stoica, X. Descombes, and J. Zerubia, Road Extraction in Remote Sensed Images Using Stochastic Geometry Framework, Proceedings Intl. Workshop Bayesian Inference and Maximum Entropy Methods France, 2000. [80] R. Stoica, X. Descombes and J. Zerubia, A Gibbs Point Process for Road Extraction from Remotely Sensed Images, Intl. Journal of Computer Vision Vol. 57, No. 2, pp. 121-136, 2004. [81] X. Descombes and J. Zerubia, Marked Point Process in Image Analysis, IEEE Magazine Signal Processing pp. 77-84, September 2002. [82] C. Lacoste, X. Descombes and J. Zerubia, P oint Processes for Uns upervised Line Network Extraction in Remote Sensing, IEEE Transactions PAMI Vol. 27, No. 10, pp. 1568-1579, October 2005. [83] M. Ortner, X. Descombes and J. Zerbia, Point Processes of Segments and Rectangles for Building Extraction from Digital Elevation Models, Proceedings IEEE ICASSP 2006. [84] L. Linnett, D. Carmichael and S. Clarke, T exture Classification Using a Spatial Pont Process Model, Proceedings IEEE Vis. Image Signal Processing Vol. 142, No. 1, February 1995. [85] D. Savery and G. Cloutier, Monte Carlo Simulation of Ultrasound Backscattering by Aggregating Red Blood Cells, Proceedings IEEE Ultrasonics Symposium 2001. PAGE 125 125 [86] B.-H. Juang, W. Chou, and C.-H. Lee, Min imum classification error rate methods for speech recognition, IEEE Transactions Speech Audio Process. Vol. 5, pp. 257, May 1997. [87] S. Katagiri, B.-H. Juang, and C.-H. Lee, P attern recognition using a family of design algorithms based upon the generalized probabilistic descent method, Proceedings IEEE, Vol. 86, pp. 2345, November 1998. [88] M. G. Rahim, B.-H. Juang, and C.-H. Lee, Discriminative utterance verification for connected digit recognition, IEEE Transactions Speech Audio Process. Vol. 5, pp. 266 277, May 1997. [89] A. Ergun, R. Barbieri, U. Eden, M. Wilson and E. Brown, Construction of Point Processes Adaptive Filter Algorithms for Neural System s Using Sequential Monte Carlo Methods, IEEE Transactions Biomedical Engineering Vol. 54, Mo. 3, March 2007. [90] B. Picinbono, Time Intervals an d Counting in Point Processes IEEE Transactions Information Theory Vol. 50, No. 6, pp. 1336-1340, June 2004. [91] V. Solo, High Dimensional Point Process Sy stem Identification: PCA and Dynamic Index Models, Proceedings of 45th IEEE Conf. Decision and Control San Diego, 2006. [92] R. Sunaresan and S. Verdu, Capacity of Queues via Point Process Channels, IEEE Transactions Information Theory Vol. 52, No. 6, June 2006. [93] J. Gubner and W. Chang, Nonparametric Es timation of Interaction Functions for TwoType pairwise Interaction Point Processes, Proceedings IEEE ICASSP 2001. [94] P. Diggle, T. Fiksel, P. Grabarnik, Y. Ogata, D. Stoyan and M. Tanemura, On Parameter Estimation for Pairwise Interaction Point Processes, International Statistical Review, Vol. 62, No. 1, pp. 99-117, April 1994. [95] Y. Ogata and M. Tanemura, Likelihood Analysis of Spatial Point Processes, Journal of the Royal Statistical Society, Series B Vol. 46, No. 3, pp. 496-518, 1984. [96] V. Isham, Multiple Markov Point Processes: Some Approximations, Proceedings of the Royal Society of London, Series A Vol. 391, No.1800, pp. 39-53, January 1984. [97] C. Bishop, Pattern Recognition and Machine Learning : Springer. New York, 2006. [98] J. Chen, J. Hershey, P. Olsen and E. Ya shin, Accelerated Monte Carlo for KullbackLeibler Divergence between Gaussian Mixture Models, Proceedings IEEE Intl. Conference on Acoustics, Speech and Signal Processing pp. 4553 4556, 2008. [99] X. Yu, I. S. Reed, and A. D. Stocker. "C omparative Performance Analysis of Adaptive Multispectral Detectors," IEEE Transactions Signal Processing Vol. 41, No. 8, August 1993, pp. 2639-2656. PAGE 126 126 [100] W. Radzelovage and G. Maksymonko, Lesso ns Learned from a Multi-Mode Infrared Airborne Minefield Detection System, Proceedings of the Infrared Information Symposia 3rd NATO-IRIS Joint Symposium, Vol. 43 No.3, July 1999, pp. 343-364. [101] P. Lucey, T. Williams, M. Winter and E. Wint er, Two Years of Operations of AHI: an LWIR Hyperspectral Imagery, Proceedings SPIE Vol. 4030, pp. 31-40, 2003. [102] P. Lucey, T. Williams, M. Mignard, J. Jullian D. Kobubon, G. Allen, D. Hampton, W. Schaff, M. Schlangen, E. Wint er, W. Kendall, A. Stocker, K. Horton, A. Bowman, AHI: an Airborne Long-Wave Infrar ed Hyperspectral Imager, Proceedings SPIE Vol. 3431, pp. 36-43, 1998. PAGE 127 BIOGRAPHICAL SKETCH Jerem y Bolton received the Bachelor of Scie nce degree in computer engineering from the University of Florida, Gainesville, in May 2003. He received his Master of Engineering and Doctor of Philosophy from the Universi ty of Florida in December 2008. Currently, he is a research assistant in the Computational Science and Intelligence Lab in the Computer and Information Sciences and En gineering Department at the University of Florida. Research includes the development of algorithms, methodologies, and models with a solid mathematical and/or statistical base with applications to landmine detection. Current and previous research applies these methods to a variety of data including hyperspectral, multispectral, radar, and infrared. Jeremy Bolton is a member of IEEE Computa tional Intelligence Society, IEEE Geoscience and Remote Sensing Society and Society of Photographic Instrumentation Engineers. |