USING DOMAIN-SPECIFIC KNOWLEDGE IN SUPPORT VECTOR MACHINES By ENES ERYARSOY A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2005
Copyright 2005 by Enes Eryarsoy
to my beloved wife and both of our familiesÂ….
iv ACKNOWLEDGMENTS It is a rarity that an auth or can claim the entire work for a dissertation. I would like to thank several great people who made me feel special to be around them throughout the isolating experience of writing this dissertation. Without th em this work would have been impossible. Inadvertent misinterpretations and mistakes are solely due to my negligence and possibly stubbornness. I am most appreciative for the support of my supervisory committee chair Gary J. Koehler and cochair Haldun Aytug (a.k.a. the tw o amigos). They have been truly helpful and kind, especially whenever I barged into their offices. I extend my thanks and sincere appreciation to Gary. I have enjoyed his support, friendship, guidance, and understanding since the day I met him. He has been my hero in academia. I thank Haldun Aytug for his brilliant ideas, his eagerness to pursue them, and his patience and support. I also thank to Dr. Selwyn Piramuthu, Dr. Richard E. Newman, Dr. Anurag Agarwal, Dr. Praveen Pathak, Dr. Selc uk Erenguc, and all other Decision and Information Sciences (D.I.S.) faculty for th eir collaboration and support during my Ph.D. study. I would also like to thank my parents for their support. Most of all, I would like to thank my beloved wife Meziyet, for her l ove and support; and my parents, my brother and my sister for always being there for me. La st but not least, I thank my dearest friends Enes Calik, and Avni and Evrim Arg un, for their invaluable friendship.
v TABLE OF CONTENTS page ACKNOWLEDGMENTS.................................................................................................iv LIST OF TABLES............................................................................................................vii LIST OF FIGURES.........................................................................................................viii LIST OF SYMBOLS AND ABBREVIATIONS..............................................................ix ABSTRACT......................................................................................................................x ii CHAPTER 1 INTRODUCTION........................................................................................................1 Background...................................................................................................................1 No Free Lunch (NFL) Theorems and Domain Knowledge..........................................2 Motivation and Summary.............................................................................................6 2 LEARNING THEORY AND SVM CLASSIFIERS....................................................8 Basics of Learning Theory............................................................................................9 Learning Theory..................................................................................................10 Probably Approximately Co rrect (PAC) Learning......................................10 Vapnik-Chervonenkis (VC) Theory.............................................................13 Structural and Empirica l Risk Minimization...............................................19 Generalization Theory for Linear Classifiers......................................................25 Effective VC-Dimension.....................................................................................28 Covering Numbers...............................................................................................29 Fat Shattering.......................................................................................................31 Luckiness Framework.........................................................................................35 Luckiness Framework for Maximal Margin Hyperplanes..................................38 Support Vector Machin es Classification....................................................................46 3 DOMAIN SPECIFIC KNOWLEDGE WI TH SUPPORT VECTOR MACHINES..51 Review on Relevant SVM Literature with Domain Specific Knowledge..................52 An Alternative Approach to Incorporat e Domain Specific Knowledge in Support Vector Machines.....................................................................................................54
vi Charecterizing the Input Space............................................................................55 Charecterizing the Input Sp ace with Box Constraints.........................................57 Characterizing the Input Space with a Polytope..................................................60 4 ELLIPSOID METHOD..............................................................................................63 Ellipsoid Method Introduction....................................................................................63 The Lwner-John Ellipsoid........................................................................................63 The Ellipsoid Method.................................................................................................66 Optimizing Linear Functions over Ellipsoids......................................................71 Different Ellipsoid Methods................................................................................75 Different Ellipsoid MethodsÂ’ Formulation..........................................................77 Shallow Cut Ellipsoid Method............................................................................78 5 COMPUTATIONAL ANALYSIS FOR DOMAIN SPECIFIC KNOWLEDGE WITH SUPPORT VECTOR MACHINES................................................................80 Overview.....................................................................................................................80 Comperative Numerical Analysis fo r Box-Constraints and Polytopes......................82 Polytopes versus Hyper-rectangles......................................................................82 Generating Polytopes...........................................................................................82 Using Ellipsoid Method to Upper Bound the Polytope Diameter..............................85 Central Cut Ellipsoid Method..............................................................................86 Maximum Violated Constraint............................................................................89 Shallow/Deep Cuts..............................................................................................90 Proceeding After a Feasible Point is Found........................................................92 Fat Shattering Dimension and Luckiness Framework................................................94 6 SUMMARY AND CONCLUSIONS.........................................................................99 LIST OF REFERENCES.................................................................................................103 BIOGRAPHICAL SKETCH...........................................................................................109
vii LIST OF TABLES Table page 3-1 Generalization error bound performances under different settings. Numbers in bold shows that the bound is too lo ose to provide any information.........................57 5-1 Summary of random polytopes gene rated in 2, 3 and 5 dimensions........................85 5-2 The central cut ellipsoid method application on 2 and 3 dimensional datasets.......88 5-3 The central cut ellipsoid method application on 2 and 3 dimensional datasets according to maximum violated constraint selection...............................................89 5-4 The ellipsoid method with deep cu ts for 2 and 3 dimensional datasets...................90 5-5 With deep cuts method converges faster and generated ellipsoids are of smaller size........................................................................................................................... .92 5-6 Proceeding after a feasible point is found by randomly choosing a violated constraint and assigning maximum value that can be assigned..........................94 5-7 Proceeding after a feasible point is found by choosing the most violated constraint and assigning maximum value that can be assigned..........................94 5-8 The performance comparison for a 3-dimensional input space with 5 constraints.. ..............................................................................................................97
viii LIST OF FIGURES Figure page 2.1 Consistency of learning process ..............................................................................21 2.2 Risk Functional for structure .............................................................................24 2.3 Two dimensional separating hyperplane, 2, b w where 12ww 'w..........................................................................................................26 3-1 Using box constraints for characterizing the input space.........................................58 4-1 The L-J Ellipsoid for a polytope..............................................................................64 4-2 When maximizing a linear function 'cx over an ellipsoid ,E Dd, the center of the ellipsoid lies between minz and maxz ..................................................................73 4-3 In every iteration the polytope is c ontained in a smaller ellipsoid...........................76 5-1 The L-J approximation.............................................................................................84 5-2 The central-cut ellipsoid method illu strated on a 3-dimensional polytope which has a diameter of 3.77..............................................................................................87 5-3 Volume reduction does not n ecessarily reduce the diameter...................................88 5-4 The ellipsoid generated with the cu t has a diameter of 330 and its volume is about 56.9% of the fina l ellipsoid in A....................................................................91 5-5 The shallow cut method can continue ev en after the feasible region is found.........93 5-6 The -approximate L-J ellipsoid for the polytope..................................................96
ix LIST OF SYMBOLS AND ABBREVIATIONS X Input space X x Input vector x, x Input vector associated with a pos itive or negative la bel, respectively Y Output space y Output label 1,1y Note that iy denotes an output label of a specific input vector ix Real numbers n n-dimensional Euclidean space, or real vector space H Hypothesis space or hypothe sis class. Note that iH denotes a specific hypothesis space S Training sample set hH Classification function that maps : hX in general shH Selected hypothesis for classification task Distribution Error probability 1 Â– the Confidence level l Sample size 2-Norm A Cardinality of a set A or absolute value of A, if A PH Cardinality of the largest set shattered by H
x d VC-dimension of H. Note that id denotes the VC-dimension of iH. n Dimension of the input space H B Growth function n mC The number ways of taking n things m at a time e Constant, base of the natural logarithm PQ Different events used in context Pr Probability function i p Probability of event i k Number of errors made on training sample S. W Parameter set for classifier function. e.g., in support vector machines, nW W w Parameter of a classifier function LF Loss function R F Risk function F Distribution function log Logarithm to base 2 ln Natural logarithm Class of all linear learning machines over n ,FB Classes of linear functions f FgB Linear classification functions b Bias, b Covering number Margin. Note that i denotes the margin of a specific input vector xi
xi sm Margin of a function on sample set S R Radius of a ball cont aining all input points Hierarchically nested sequence of countable hypothesis classes (i.e. 112,..,:...nnHHHHH) HI (Hilbert Space) A Hilb ert space is a vector sp ace with an inner product x y such that the norm defined by xxx transforms HI into a metric space. A An arbitrary set A nonnegative number (it is both used in luckiness framework as well as lagrangian multiplier in support vector machines formulation.) le Level of a function U Unluckiness function L Luckiness function L a Lagrangian function Âˆ SS two sets of vectors of the form '' 1,...,l Sxx and '' 12Âˆ ,...,ll Sxx respectively two functions of the form ,,leL and ,,leL a class of functions R f with VC-dimension of 1, |RfR sv set of support vectors v A non-negative integer n P An n-dimensional polytope Empty set SD Space diagonal HR Hyper rectangle B a n-dimensional ball
xii Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy USING DOMAIN-SPECIFIC KNOWLEDGE IN SUPPORT VECTOR MACHINES By Enes Eryarsoy August 2005 Chair: Gary J. Koehler Cochair: Haldun Aytug Major Department: Decision and Information Sciences Our goal was to develop a methodology th at incorporates domain-specific knowledge for pattern-recognition problems. As different classifier algorithms capture pertinent domain knowledge in different ways, we specifically examined methodology for incorporating Support Vector Machines a novel learning algorithm in patternrecognition. We began with a detailed literature revi ew on learning theory and Support Vector Machines (SVM), an efficient and mathem atically elegant technique for machine learning. We incorporated prior knowledge to enhance generalization error bounds on learning with SVM. First, we consider ed prior knowledge about the domain by incorporating upper and lower bound s of attributes. Then, we considered a more general framework that allowed us to encode prio r knowledge to linear constraints formed by attributes. Finally, by comprehensive compara tive numerical analysis we compared the
xiii effectiveness of incorporating domain know ledge versus using pure SVM error bounds obtained without incorporating domain knowledge.
1 CHAPTER 1 INTRODUCTION Background Â“Intellect is the part of human soul which knows, as distinguished from the power to feelÂ” (The American Heritage Dictiona ry, 2000). The roots of creating a human-made intelligence can be traced back to Greek my thology, where the gods made artificial persons to carry out certain tasks. In the middle of 20th century a new stream of research emerged, aimed at comprehending human inte llect. The new research paradigm, called artificial intelligence (AI), received much attention. Star ting with Turing machines, many real-world problems have been addressed using functions not provided by the human brain. As the first game-playing program Â‘Ch eckersÂ’ led to the fam ous Deep Blue that won the Chess re-match against world champion Kasparov1, the field of artificial intelligence has evolved. Looking for a mechanistic way to mimic th e human brainÂ’s reas oning, AI research bifurcated into two main streams. The fi rst stream (computational neuroscience, or neuroinformatics), focuses on unraveling th e truly complex relationship between the structure and functions of the brain by trying to decipher a huge black box that starts from structures as little as molecules and ends with functions and behaviors that the human brain exhibits. 1 The original Checkers program was written in 1956. Deep Blue won the re-match against Kasparov in 1997.
2 The second stream focused on mimicking the functions of the human brain and replicating these rather than understanding how the brain operates Machine learning, which is a sub-area of AI, studies computer algorithms that can accomplish tasks requiring intelligence. More specifically, it studies developing algorithms that can learn from experience to make inferences about fu ture cases. A large va riety of real-world problems (face detection, voice recogniti on, automatic target recognition, cheminformatics, bioinformatics, and Â‘pattern recognitionÂ’ problems, in general) fall into this category. An algorithm that carries out a learning task is called a Â“learning machineÂ”. In machine learning, the experience is introduced to the learning machin e in the form of a set of training examples. Typically, machine l earning algorithms use this set to train and calibrate the learning machine so that it can infer future cases or unseen situations. For pattern-recognition (a.k.a. classi fication) task, the learning ab ility of a learning machine corresponds to how well the algo rithm is calibrated to cla ssify unseen examples (i.e. examples not included in the training set). This ability is called the Â“generalization ability of the learning machineÂ”. An ideal learning machine would require less effort (i.e., few training points and a fast training process) and would produce re sults of good quality (i.e., good generalization ability). No Free Lunch (NFL) Theorems and Domain Knowledge Many classification algorithms such as neural networks (H ristev, 1998), decision trees (Wagacha, 2003), and support vector mach ines (Vapnik, 1998) have been used on a number of pattern-recognition problems. Ho wever, a classification algorithm that performs extremely well on one task may fail to perform well on another. In the literature, this phenomenon is captured by the Â“No free lunchÂ” theorems (Wolpert and
3 MacReady, 1995). They reflect the idea that for every positive training situation there is a negative training situ ation, this ideology resemble s to David HumeÂ’s skepticism (Freddoso, 2005): Â‘our expectation that the futu re will be similar to the past doesnÂ’t have a reasonable basis but it is ba sed on belief onlyÂ’. In skepticism, even from the greatest number of observations we cannot generate a rule nor can we predict a consequence from any of the known attributes. Using (NFL) theorems, Wolpert and MacReady (1995) analyzed different algorithms. The NFL theorems investigate the performance of various algorithms in areas such as optimization, search, and learning. Generally, the structure of the solution space is very crucial in selecting a good methodology. Wolpert (1996a, and 2001) develops NFL theore ms for learning. In these, he claims that a learning algorithm cannot guarantee a go od generalization abil ity of the machine by merely relying on low misclassification rates resulting from errors made on the training set, attributes of th e learning machine, and a larg e training set. No algorithm, when averaged across all problem inst ances, can outperform random search. From our view point, the NFL theorems sugge st that taking as much information as possible about the problem doma in may enhance learning. Therefore, regardless of the classification methodology used incorporating prior assump tions as well as domain knowledge in machine lear ning is of importance. The idea of capturing domain knowledge co mplements both NFL theorems and AI community. The AI community attributes the popular phrase Â‘knowledge is powerÂ’ to Dr. Edward A. Feigenbaum, a pioneer in AI (Augier and Vendele, 2002). The phrase
4 originated with Francis B acon, an English philosophe r who laid out a complex methodology for scientific inquiry kn own as the Â‘Baconian MethodÂ’. Â‘Bacon suggests that you dr aw up a list of all things in wh ich the phenomena you are trying to explain occurs, as well as a list of things in which it does not occur. Then you rank your lists according to the de gree in which the phenomena occurs in each one. Then you should be able to deduce what factors match the occurrence of the phenomena in one list and don't occur in the other list, and also what factors change in accordance with the way the data had been ranked. From this Bacon concludes you should be able to dedu ce by good inductive reasoning what is the form underlying the phenomena.Â’ (Wikipedia, 2005) The phrase Â“knowledge is powerÂ” in the AI context has a similar, but distinct meaning. In our context, it means that capturing knowledge about the domain of the pattern-recognition problem and in corporating this into the l earning process are the actual sources of power of the learning machines. There have been many studies addressi ng the issue of in corporating domain knowledge in various methods, such as deci sion trees or neural networks. Our study focuses on a machine learning method calle d Support Vector Machines (SVM). The SVM method uses a novel and mathematically eloquent technique that relies on strong theoretical underpinnings of Statistical Learning Theory and Structural Risk Minimization (SRM): Statistical Learning Theory (SLT) developed mainly by Vapnik and Chervonenkis (1971), and Vapnik (1 982) relates statistics to the learning processes. The SVM method is mainly used for classificat ion and regression. Gi ven a classification problem, the SVM method uses SLT to calcula te optimal classifiers by maximizing the margin between the classes. Structural Risk Mi nimization (SRM) : For pattern-recognition one chooses the best learning machine to perform the task. Once the learning machine is chosen, its
5 generalization performance can be computed based on the training set and SLT. We use SRM to choose the learning machine that be st minimizes the generalization error. For classification problem s, to maximize the generalization ability, SVM minimizes a bound on the generalization error by solving a convex optimization problem that maximizes the margin between separate d classes. Since the problem is a convex minimization problem, optimality is guaranteed in other learning machines (such as neural nets and decision trees) known approaches do not guarantee optimality. Our main objective is to develop a met hodology that incorporates domain-specific knowledge for learning. We quantify its impact on learning accuracy and development of SVM. Different classifier algorithms captu re pertinent domain knowledge in different ways. In SVM literature, domain knowledge is usually captured (loosely speaking) via kernel functions. Kernel functions map all th e instances into another space (called the feature space) in which classification task is performed. In feature space task can be achieved easier. However, most studies have used generic kernels such as Gaussian, Polynomial, or Radial Basis Function (RBF) to incorporat e domain knowledge; but lack any specific reason why they might perform well. The fact that these kernels seems to help supports the optimistic view that they incorporate some sort of domain knowledge. One problem with using domain knowledge is that kernel c hoice highly affects learning performance. The problem with usi ng generic kernels is that no a priori knowledge is used during kernel selection (hence our use of Â‘optimisticÂ’ domain knowledge). Generating kernels ta ilored to problems in hand of ten requires trial and error methods and a priori knowledge for the problem domain may be lacking or not directly used.
6 Another problem is with trying to capture domain knowledge through sample points. The limited number of training points available for training may inhibit learning. In learning, an insufficient tr aining set size poses serious prob lems. Niyogi et al., (1998) have addressed the issue of generating virtual examples to increase an effective training set size. They show that in some cases cr eating virtual examples is mathematically equivalent to capturing prior knowledge. Fung et al. (2001, 2003) captured prior knowledge (in the form of polyhedral sets) a nd used it in the formulation of linear SVM classifier. No studies have yet captured prior knowledge by char acterizing the entire input space (i.e., all of the po ssible instances on which classifi cation task is performed). An attribute represents a characteristic of an entity or an instance that we find it of interest. In knowledge discovery process, the basic classification task is to classify entities, or so called points, according to thei r attributesÂ’ values. For each attribute there is a specific domain from which a value is assigned. Motivation and Summary In this study, we propose an alternative approach to incorporate prior knowledge about the input (or instance) space. We investigate the case where the input space is contained in a convex, closed and bounded set, rather than the classical SVM approach that assumes a ball of radius R contains all the training points. Initially we start with box constraints that contain the input space, th en we extend our appro ach to an arbitrary polytope that contains all potential trai ning points. In our sett ing prior knowledge is derived from the convex, closed set that co ntains the input set whose shape depends on the problem domain. We believe that our approach makes two major contributions to the literature: Firstly, we incorporate prior knowledge direc tly from the input space. That is, knowledge
7 is derived immediately from th e attributes of the input spa ce. The contribution in this approach, as indicated above, is as follows: A sample set not equa l to the whole input space is never complete as there are always unseen cases which can potentially reveal more insights about the domain. However, wo rking with the input sp ace itself rather than a finite set of points drawn from it help s extract knowledge w ithout worrying about unseen data. Secondly, given a sample size and a le vel of confidence, by utilizing prior knowledge about the input space we potentially reduce generaliz ation error. Alternatively this also means that by incorporating doma in-specific knowledge, a sample size that is needed to accomplish the learning task with a pre-defined level of confidence may be reduced. The remainder of this dissertation is organi zed as follows. In Chapter 2, we provide comprehensive review on learning theory, and current literatu re on support vector machines. In Chapter 3, literature review on domain knowledge in learning is provided and an alternative way to incorporate domai n knowledge in support vector machines is proposed. Chapter 4 discusses the ellipsoid me thod and its applicabil ity to the problem. Chapter 5 is dedicated to computational anal ysis of our approach to incorporate the domain knowledge.
8 CHAPTER 2 LEARNING THEORY AND SVM CLASSIFIERS Detecting regularities and not-s o-obvious patterns in data that are beneficial from an organizationÂ’s or an indi vidualÂ’s point of view is cal led the Knowledge Discovery process. As the process is usually carri ed out on sizable databases, Â“Knowledge DiscoveryÂ” and Â“Knowledge Discovery in Data basesÂ” (KDD) concepts usually refer to the same activity. In database literature, the KDD process is composed of several iteratively repeated subproce sses. Knowledge discovery process is defined as Â“the nontrivial process of identifying valid, novel, potentially usef ul, and ultimately understandable patterns in dataÂ” (Fayyad et al., 1996). In their study, they also include their version of steps for KDD that can be outlined as: Learning application domain Creating a target dataset Data cleaning and pre-processing: noise in data is reduced, irrelevant data are removed, and missing data are handled. Data reduction and projection: data repr esentation is realized by carrying out dimensionality reduction via elimination of highly correlated variables. Choosing function of data mining: d eciding on the kind of operation to be performed on data such as classificati on, regression, summary, clustering, time series analysisÂ… Choosing data mining algorithms: which models and what parameters are appropriate for the KDD process. Data mining: looking for patterns in the da ta by using a data mining algorithm for a selected purpose.
9 Interpretation: interpreting the results that are yielded by data mining process, and revising and remedying previous processes if necessary. Using discovered knowledge: utilizing th e knowledge derived through the previous steps in order to enhance the system performance. Of all the steps above, data mining is proba bly the most popular. In fact, the term Â“Data MiningÂ” sometimes amounts to the whole process of Knowledge Discovery (Kohavi and Provost, 1998). However, Data Mining usually refers to quantitative and algorithmic approaches and is also known as Machine Learning and pattern-recognition. Kohavi and Provost (1998) provide their de finition for Machine L earning as Â“Machine Learning is the field of scient ific study that concentrates on induction algorithms and on other algorihms that can be said to learnÂ”. An algorithm that carries out a learning task is called a Â“learning machineÂ”. In this study we are especially interested in Â“support vector machinesÂ” as learning machines, and as indi cated in the introduction, SVM relies on strong theoretical underpinnings in learning theory and optimizat ion theory. In the following section we briefly di scuss learning theory and re late it to support vector machines. Basics of Learning Theory Valiant (1984, p. 1134), de fines learning as follows: Â“A program for performing a task has b een acquired by learning if it has been acquired by any means other than explicit programming. Among human skills some clearly appear to have genetica lly preprogrammed elements while some others consist of executing an explicit se quence of instructions that has been memorized. There remains a large area of skill acquisition where no such explicit programming is identifiable. It is this area that we describe here as learningÂ”. In statistics, Learning Theory refers to computational analysis of machine learning techniques (WordiQ Encyclopedi a, 2005). Different learning th eories exist in literature such as Probably Approxia tely Correct Learning (PAC ) (Valiant, 1984), Bayesian
10 Learning, and Statistical Learning theories. These learning theories not only deal with analyzing existing machine learning algorith ms, but also inspir e creation of new algorithms. For example, the Boosting algor ithm (Schapire, 1990) and Support Vector Machines are machine learning algorithms based on PAC Learning, and Statistical Learning Theory. Learning Theory In the first section of this chapter we provided a general introduction to Learning Theory and Machine Learning. As we noted many machine learning algorithms have been developed and applied on different lear ning tasks such as the ones we mentioned earlier. However, in this study our focus is Support Vector Mach ines as a Machine Learning algorithm. Therefore, we will limit ourselves to Support Vector Machines related aspects of Learning Theory. Probably Approximately Correct (PAC) Learning ValiantÂ’s paper Â“A Theory of the Le arnableÂ” (1984) laid the foundation of Computational Learning Theory (COLT) that focuses on the mathematical evaluation and design of algorithms by considering sample data drawn independen tly and identically distributed (i.i.d.) from an unknown but fixed distribution to study problems of prediction in learning theory. Let nX be the input space and also let every point in the input space have a binary output label, y, assigned. We a ssume a fixed but unknown sampling distribution defined over the input-output pa irs. In PAC Learning, the basic classification task is to choose a function from a specified hypothesis sp ace so that the learning is approximately correct, meaning that the probability of mi sclassifications made by the classification
11 function (or hypothesis), over the input sp ace is bounded by a pre-defined confidence level. Formally, let ,1,1yX xbe input-output pair s from an input space X ,yx is called an example. Let h be a classification functi on (or hypothesis) from an hypothesis space H. Also let 11,,...,,llSyy xx be a collection of l examples (i.e., the sample) drawn i.i.d. from an unknown but fixed distribution Then, the error of hypothesis h can be defined as: ,: errhyhy xx. The error term measures the expected costs of an error a nd is usually referred to as the risk functional Here the cost function is an indicator function. Wh en studying problems of prediction in learning theory, PAC Learning considers bounding the e rror rate as one of the prime tasks. A second goal is to do so in polynomial time. Th ere are several factor s affecting the risk functional such as the selected hypot hesis for the classification task ( s h ), the richness of the hypothesis space, the desired confidence level 1 and the number of training examples available (sample complexity). For any shH the pac bound can be expressed as ,,lH It is clear that the error bound is a function of sample complexity, hypothesis space complexity and confidence level, 1 A more formal way of writing a pac bound is as follows: :,,l sSerrhlH This states that given a sample size l the probability that risk functional is greater than the pac bound ,,lH is very small (if is small). However, in order to go beyond a definition, a further analysis of the bound on risk functional is needed. First of
12 all, the bound is a function of sample co mplexity and the hypothe sis space. Therefore sample size determination and characteristics of an hypothesis space have impact on the bound. Secondly, a distribution free bound is doomed to be more pessimistic and looser than a bound on a specific benign distribution.Th ese issues must be addressed in order to bound the risk functional. Let H denote the cardinality of the hypothesis space ( H if the space is infinite). During the learning process a samp le set is processed by the given learning algorithm to output an shH such that ,,errhlH if possible. Hypothesis h is called consistent with the sample if no errors are made on the sample set. The probability that a hypothesis h correctly classifies all of S but has error greater than can be bounded as: : consistent and ,,1l l lSherrhlHe .2 With H being finite, if we have at least one consistent hypothesis s h from hypothesis class H that correctly classifies sample set S, then by using the cardinality of hypothesis space we can bound the probability as: : consistent and ,,l l ssSherrhlHHe In turn we can bound this to giving: : consistent and ,,l l ssSherrhlHHe 2 2lim1...1 1!2!!n x nxxx ex n For x and l 1l xle
13 In order to satisfy the bo und, the error must satisfy 1 ln H l and therefore we set: 1 ,,ln H lH l This expression indicates that the error bound is disproportional to sample size and confid ence level we pick, as expected. More importantly, it also indicates that for an overly complex hypothesis space (i.e., if H is large), even though the sample set is cons istent with the hypothesis, the error bound may be high and therefore there is a higher over fitting risk. Moreover, the cardinality of the hypothesis space must be finite for the bound to hold. However, in general, the cardinality of the hypothesis space is neither necessarily small nor finite. For instance, in ValiantÂ’s original paper (Valiant, 1984), he considers a Boolean function mapping, hence it is finite. However, in many cases, such as Â“learning with linear classifiers with real weight vectorsÂ”, the cardinality of hypothesis class is infinite. Vapnik-Chervonenkis (VC) Theory To make the learning process more feasible and to extend pac theory over hypothesis classes with infinite cardin alities, the concep t Â“cardinality of H is extended to Â“expressive power of H Â” by VC Theory, first introduced by Vapnik and Chervonenkis (1971). Before we formally state VC Theory let us understand Â“shatteringÂ”, Â“growth functionÂ”, and Â“VC-dimensionÂ” concepts. Let H be the hypothesis space defined on instance space X and 11,,...,,llSyy xx be a sample of size l. A hypothesis class His said to shatter a
14 set of points 1,..,lSxx if and only if for every possibl e input-output pairs there exists a hypothesis that labels all of them. Definition 2.1 (Shattering): A set of instances 'Sis shattered by hypothesis space H if and only if for every binary classification 1,1 y of instances in 'S there exists some hypothesis in H consistent. (Mitchell, 1997) For real valued functions Gurvits (2001, p.82) defines shattering as follows: Let H be a class of real-valued functions on domain X We say that H shatters a set 'SX if there exists a function ': hS such that for every subset E S there exists some function EhH satisfying: E f xhx for every \ x AE E f xhx for every x E The pseudo-dimension of H denoted PH, is the cardinality of the largest set that is shattered by H (Gurvits, 2001, p.82) The VC Dimension of an hypothesis space H ( dim dVCH) is the cardinality of the largest set that can be shattered by H If the hypothesis space is linear classifiers over n, then given any set 'S of 1n points in general position (not lying in an 1n dimensional affine subspace), there exists a function in H that consistently labels 'S whatever the labeling of the training points in 'S Also, for any set of 1ln inputs there is at least one labeling that cannot be realized by any function in H Thus, the VC Dimension for linear classifiers is n + 1.
15 The complexity of the hypothesi s space or expressive power of H over a sample of sizel can be calculated by using a Â“growth functionÂ” H B l. The bigger theH B l, the more behaviors the H on a set of l points has. Let l X be the superset of all samples of sizel, for 1,1hxgrowth function is bounded as: 112 ,...,()max():max,,....,:2l ll l HH l XBlBSSXhhhhHxxxxx Then, the VC-dimension for the growth function given above is dimmax:()2l HVCHdlBl and is infinite only when the set is unbounded. Above, a tighter bound on the grow th function would mean a smaller d, which means less power is necessary to shatter l points. Although the gr owth function above grows exponentially in l, in VC Theory, it can be bounde d by a polynomial expression in l of degree dim dVCH : Lemma 2.2 (Vapnik, 1971): Let X be an (infinite) set of elements x and let l X be some set of subsets S of the set X Denote by H B l the number of different subsets 1,...,, l lSSX xx of the set 1,...,lxx. Then either 11 ,...,sup,...,2lll lX xxxx or 02l d d H i l iBl el C d where dis the last integer l for which the equality is valid and i lC is the number of ways of taking i things l at a time.
16 Proof of Lemma 2.2: For ld 2l HBl and for ld 02d l H il Bl i In the case ld, we have 01 d l and knowing that0()d H il Bl i we may write: 00()1ddil dd d H iill dddd B le ii llll which yields d Hel Bl d Note that the structure in th e Lemma above is also known as Â“SauerÂ’s LemmaÂ” (Sauer, 1972). Also note that, Vapnik (1998) uses the term Â“growth functionÂ” for lnH B l, while we use the notation in (Cristianini and Shawe-Taylor, 2000). By using the growth function, we can re-write the pac bound as: : consistent and ,,l l ssHSherrhlHBle In order to minimize the risk functional based on our training error we use the Â“doubling trickÂ” of John Canny Chernoff by introducing a ghost sample S. The doubling trick is based on applying Chernoff bounds in the learning context can be stated as follows: Lemma 2.3 (Chernoff, 1952; Cr istianini and Taylor, 2001): Suppose we draw a set S of l random examples from the error bound on training examples can be bounded by the probability of having zero error on the training examples but high error on a second random sample S. That is : consistent and ,,l l ssHSherrhlHBle can be bounded by
17 2::0,2::0, 2ll SS Sl ShHerrherrhSShHerrherrh given 2 l Proof of Lemma 2.3: Assume we draw a sample 'Sof size 2 land randomly divide it into two equal sets S and S Assume that the number of errors made on sample 'S is known (k), and we also know that proba bility of making error is at least The probability of making l errors is at least 1 2 and 2 l l (from 2 l ), Therefore the probability of making error at least 2 l times is at least 1 2 Let P be the event that no errors are made on S, and Q be the combined event that P occurs on the set S and there are at least 2 l errors on set S Then we know that Pr|1/2 QP and PrPr|Pr QQPP Therefore, Pr2Pr PQ Hence, provided that 2 l we can write: 2::0,2::0, 2ll SS Sl ShHerrherrhSShHerrherrh Given 2 l k errors are made on sample 'S of size 2lwe can bound the probability that all of the 2 l k errors are made on S 11 002 1 2 22/kk k iill li kk li Therefore we can write
18 /2::0,222l l SHShHerrherrhBl. Combining the results so far yields the following expression: /2/22 ::0,22222d l ll SHel ShHerrherrhBl d Therefore a pac bound for any consistent hypothesis h can be shown to be: 222 ,,loglogel errhHd ld Theorem 2.4 (VC-Theorem by Vapnik and Chervonenkis, 1998): Let H be a hypothesis space having VC-dimension d. For any probability distribution on 1,1 X with probability 1 over l random examples S, any hypothesis hH that is consistent with S has error no more than 222 ,,loglogel errhlHd ld provided dl and 2l (Cristianini and Taylor, 2001, p. 56). Bounds on generalization error are derived by co mputing the ratio of misclassification. Therefore a bound on the numbe r of errors over a set of input points can be calculated by multiplying the generalization error by the cardinality of the input point set. This, in a sense, is similar to calculating the cost of misclassification if our loss due to misclassification is simply the count of misclassifications (then generalization error is said to be the risk functional). In the following subsection we discover more about loss functions and we me ntion two basic induction princi ples that relate selecting appropriate hypothesis for classifi cation and risk functionals.
19 Structural and Empirica l Risk Minimization We previously defined the risk functional as measuring the expected cost of misclassification. For a classifi cation problem, the input vectors x from a subset X of n are classified by a function of the form ,,,hXW xwxw, where w is a parameter of the classifier function of parameter set W. For yx input-output pairs, a loss function that counts number of errors can be defined as: 1, ,, 0, otherwise hy LFh xw xxw For a distribution function Fy, the learning algorithm disc overs a relationship as a joint distribution function between input and output pairs Fyx by investigating the conditional probability distribution | F yx that outlays their stochastic relationships. Hence, the risk functional to be minimized can be stated as: ,,, R FLFhdFy wxxwx Estimating the risk functional depends on the sample set 11,,...,,llSyy xx and the learning problem is to choose a value *w for which *infWRFRFwww. However, as the probability distribution Fyx is unknown, the solution to the above problem cannot be computed explicitly. As such, an induction principle is usually invoked. Two common principl es that are used to find or approximate to the best classifier are the Empirical Risk Minimizati on (ERM) principle and the Structural Risk Minimization (SRM) principle. The Bayes classi fier is the overall best classifier not restricted to any hypothesis class, and is unknown (Devroye et al., 1996). The objective
20 of these two principles is to fi nd a classifier that is as close as possible or the same as the Bayes classifier. Empirical Risk Minimization. Empirical risk minimizatio n (ERM) is an induction principle that substitutes minimi zing the empirical risk functional 11 ,,l l Emplii iRFLFh lwxxw for minimizing ,,, R FLFhdFy wxxwx In pattern-recognition one of the mo st common approaches to minimizing l Emp R F is to minimize the number of errors made on the training set. Due to Glivenko-Cantelli Theorem we know th at when there are a large number of observations available (l is large) the empirical distri bution computed fr om sample set S is convergent to the actual distributio n. That is, with probability one, lim0l lFF xx. Similarly, in ERM, for a given hypothesis class H the bias in l Emp R F is small as l increases. That is lim0l Emp lRFRF ww, and hypothesis selection is done ac cording to this criterion. It is hoped that a hypothesis found by the ERM principle have low tr ue risk or true error. However, the hypothesis se lection chooses an hypothesis so argminl Emp R FRF is small, but is not necessarily close to a Bayes classifier. That is, if we let R F denote the Bayes error, then for **infinfll EmpEmp hHhH R FRFRFRFRFRF
21 *lim0l Emp lRFRF is not necessarily true (Devroye et al., 1996) Definition 2.5 (Consistency for ERM): We say that the principle (method) of empirical risk minimization is c onsistent for the set of functions ,, LFhxxw, W w, and for the probability distribution function F x if the following two sequences converge in probability to the same limit: *infP l W R FRF w and *infP l Emp l W R FRF w. (Vapnik, 1998) The idea of this definition can be seen in Figure 2.1. Note that consistency for a classification rule is also know n as asymptotically Bayes-risk efficient (Devroye et al., 1996). Figure 2.1: Consistency of learning process (Adapted from Vapnik, 1998, p.81) Vapnik and Chervonenkis (1998), give a proof of consistency of ERM in Â“The key theorem of learning theoryÂ” and shows that when l is largel Emp R F minimization is consistent. Moreover, Theorem 2.4 can be used to establish upper bounds on the generalization error. Usually, when the hypothesi s class is not rich enough to capture the concept of interest, some errors may be made on training data. Therefore when a l E mp R F*inf R Fw RF l
22 hypothesis space may not be able to capture all the variations and possibly some noise in the data, the VC-Theorem can be adapted to tolerate errors on the training set. The following Theorem by Vapnik provides the genera lization error for such cases (Cristianni and Shawe-Taylor, 2000, p.58). Theorem 2.6: Let Hbe a hypothesis space having VC dimension d. For any probability distribution on 1,1 X with probability 1 over l random examples S, any hypothesis hH that makes k errors on the training set S has error no more than 2424 ,,loglog kel errhlHd lld Provided dl Proof: Vapnik (1998, p.263) Blumer et al. (1989) and Ehrenfeuch t (1989) also provided lower bounds on generalization errors. For a consistent hypothesis the bound is tight, as the following theorem states. Theorem 2.7: Let H be a hypothesis space with finite VC-dimension 1d. Then for any learning algorithm there exist di stributions such that with probability at least over l random examples, the error of the hypothesis h returned by the algorithm is at least 111 max,ln 32 d ll (Cristianini, Sh awe-Taylor, 2000) Proof: Ehrenfeucht et al. (1989) The bound in Theorem 2.6 is tight up to th e log factors as there are lower bounds of the same form (Bartlett a nd Shawe-Taylor, 1999). However, although upper bounds hold for any distribution, lower bounds are valid only for particular distri butions (Bartlett and Shawe-Taylor, 1999).
23 Structural Risk Minimization. As we indicated earlier, when l is largel Emp R F minimization is consistent. Therefore, the ge neralization ability fo r the ERM principle on small sample sets cannot benefit fr om the theorems we have above. Structural Risk Minimization (SRM) is an induction principle that looks for the optimal relationship between sample size (l) available, and the quality of learning by using the function s h chosen from a hypothesis class H and the characteristics of the hypothesis class H (Vapnik, 1998). The SRM principle requires a nested se quence of hypothesis classes. Smaller classes may not be rich enough to ca pture the concept of interest. Let 112,..,:...nnHHHHH be a hierarchically nested sequence of countable hypothesis classes. This is also called a Â“deco mposable concept classÂ” by Linial et al. (1991). Each hypothesis class jH in the hierarchy is composed of a set of functions ,:,1,...,iijjhhHiH xw with same and finite VC-dimension jd (1......jnddd ). The SRM principle says that if jk is the minimum number of training errors for any hypothesis ,ijhH xw, then for a fixed training set S the number of errors jk for each hypothesis class jH satisfies 1......jnkkk The theorem can potentially be utilized to select a hypothesis class that minimizes the risk functional bound. On the one hand, the numbe r of training errors decreases as the selected hypothesis class of structure gets more complex. However according to Theorem 2.6, for overly complex struct ures the confidence level declines ( increases), and the selected hypothesis class mimics the training set rather th an the input space and overfitting occurs. On the othe r hand, if the selected hypothe sis class is not adequetely
24 complex to discover the input space, th e hypothesis class underfits the input space. Namely, in SRM, when deciding which structur e is the best one for the learning task, the OccamÂ’s Razor principle corresponds to choosin g the smallest number of features (the hypothesis class with smallest VC-dimension) sufficient to explain the data. The tradeoff can be seen in Figure 2.2, ad apted from Vapnik (1998, p.81). Figure 2.2. Risk Functional for structure In SVM, the SRM principle is applied ove r a hierarchically nested sequence of countable and linear hypothesis classes. In order to select the optimal hypothesis class iH according to OccamÂ’s razor, the tradeoff betw een empirical risk and confidence interval is measured for different-complexity hypothe sis classes and too much complexity is penalized. An alternative approach to this is the Minimum Description Length (MDL) induction principle (Rissanen, 1978). Basically, MDL principl e sees the learning problem Risk Functional Confidence Interval Empirical Risk Overfitting Underfitting Model Complexity (h) h1 hn h* Risk
25 as a data compression problem in which a simple r description of data than the data itself is needed for representation. As Grnwald (2004, p15) states: Â“According to Rissanen, the goal of induc tive inference should be to `squeeze out as much regularity as possible' from the given data. The main task for statistical inference is to distill the m eaningful information present in the data, i.e. to separate structure (interpreted as the regularity, the `meaningful information') from noise (interpreted as the `acc idental information').Â” In a recent study, Aytug et al. (2003) s uggested addressing th e tradeoff by using genetic algorithm, and computing MDL fo r model selection accordingly. Grnwald (2004) offers a very comprehensive lite rature review on MML (Minimum Message Length) and MDL. In Vapnik (1998)Â’s framework, a data gene rator generates data i.i.d. and according to a fixed but unknown distribution, and the da ta generating machine can be infinitely complex. Despite the SRMÂ’s algebraic strength, there is a crucial shortcoming of the original principle. Shave-Taylor et al. (1998, p.1927) quotes from Vapnik: Â“According to the SRM principle the structur e has to be defined a priori before the training data appear (Vapnik, 1995, p.48)Â”. When using SRM it is assumed that there is a hypothesis clas s in the structure wherein the target function lie s. This may not always be true. In our Â“Effective VCDimensionÂ” section, we will include the appr oaches that address this shortcoming. Generalization Theory for Linear Classifiers A linear classification function :nfX maps an input vector x into a real value. A label is assigned as follows. If 0 f x then the vector x is assigned to the positive class (i.e., 1 ) or the negative class otherwise. A linear classifier f x can be written as:
26 1 n ii i f bwxbxwx where b w are the parameters characterizing the linear classifier (Figure 2.3). The vector w is the weight vector and b is called bias term. Figure 2.3. Two dimensi onal separating hyperplane, 2, b w, where 12ww 'w. The linear classifier shown in Figure 2.3 has been frequently used in the literature. Neural network researchers have referred to it as Â“perceptronsÂ” while statisticians preferred the term Â“linear discriminantsÂ” (C ristianini, Shawe-Taylor, 2000). One of the first learning algorithms for linear classifier s was the perceptron algorithm. The details of perceptron algorithm are not given here but can be found in (Cri stianini and ShaweTaylor, 2000) or (Vapnik, 1998). However, the following definitions and Theroem 2.12 are included due to their importance. w b w
27 Definition 2.8 (Functional Margin): Let 11,,...,,llSyy xx be a linearly separable set. The func tional margin of an example ,iiy x with respect to a hyperplane b w is the quantity iiiyb wx. Moreover, the functional margin distribution of a hyperplane with respect to a training set is the distribution of the margins of the examples in S. The functional margin of a hyperplane is minii S x. Let R be the radius of the n-dimensional ball that contains all elements of the nontrivial set S. Given a linearly separable set of poi nts, the perceptron algorithm achieves the classification task within a pre-speci fied number of iterations as the following theorem states. Theorem 2.9 (Novikoff): Let 11,,...,,llSyy xx be a non-trivial and linearly separating set and let maxii SRxx be the radius of ball that contains the set. Suppose that there exists a vector *w in canonical form such that *1 w and *iiyb *wx Then the number of mistakes made by the perceptron algorithm on S is at most 22 R Proof: Different proofs are available. Cris tianini and Shawe-Taylor (2000, p.14) gives a proof for the theorem. Theorem 2.9 states two important aspects of linear classification. One, scaling of the data does not change the ratio / R therefore the algorithm is scale insensitive. Two, the learning task is easier when the functi onal margin of the se parating hyperplane is
28 large. The second aspect, due to NavikoffÂ’s theorem, is that the result will be analogous to those obtained in the following section when we attempt to derive more effective estimations for the VC-dimension of linear hypotheses by taking the observed data into account. Effective VC-Dimension In the Vapnik-Chervonenkis (VC) Theo ry section, Theorem 2.6 showed generalization error can be upper bounded by using the VC-dimension as a complexity measure. In fact, Cristianini and Shawe-Tayl or (2000) state that th e bound is tight up to log factors. For higher VC-dimensions a learne r needs a larger training set to achieve low generalization error, and the bound is independent of distribution and data. That is, in a distribution dependent sense, for some benign di stributions it is possi ble to derive tighter bounds by taking a distribution dependent VC -dimension measure as domain knowledge into account. The following proposition by Cristianini and Sh awe-Taylor (2000) characterizes the relationship between VC -dimension and linear learning machines. Proposition 2.10 (Cristianini and Shawe-Taylor, 2000, p.57): A linear learning machine of dimension n denoted byn, has the following characteristics: Given any set S of 1n training examples in gene ral position (not lying in an 1 n dimensional affine subspace), there exists a function in that consistently classifies S whatever the labelling of the training points in S For any set of 1ln inputs there is at least one classification that cannot be realised by any function in
29 The proposition states that in data indepe ndent sense, the linear learning machines cannot accomplish learning task since, for highe r dimensions, there will be some sets that cannot be classified correctly. Therefore, domain knowledge derived from data may be used to enhance learning. Covering Numbers The error bound derivation formulas de pend on measuring the hypothesis class H from which a hypothesis s h will be drawn. In the Probably Approximately Correct Learning section we mentioned that if the expr essive power of a hypothesis class is large, even though the sample set is consistent w ith the hypothesis, the error bound may be high and there is a higher overfitti ng risk. If the hypothesis class is infinite, however, learning is not possible. In the literature, there are different ways of measuring the Â“sizeÂ” of an hypothesis class. For example, in earlier sections we noted that the original pac learning paper by Valiant (1984) considered a Boolean function mapping, hence the number of hypothesis in this hypothesis class was finite. However, w ith linear classifiers of real weight vectors the cardinality of H could not be measured. In the VC -theory section, we introduced the VC-dimension as an alternativ e way of measuring the Â“expre ssive powerÂ” or Â“sizeÂ” of an hypothesis class. So far, the functional margin in the training set S did not have any effect on generalization bound computation. However, one would expect that for classification problems with large functional margins the classi fication task would be easier. In order to tighten the error bounds further, we need to measure the expressive power of an hypothesis class by taking the functional margin into account.
30 To measure the expressive power of th e hypothesis class, th e notion of Â“covering numbersÂ” for measuring massiveness of sets in metric spaces is used. The notion dates back to Kolmogorov and TihomirovÂ’s original work (1961) and is the backbone of the quest to calculate the effective VC-dimen sion of real-valued function classes by upper bounding covering numbers for them. A metric space is a set with a global distance function, the metric, with which a non-negative distance between every two elem ents in the set can be calculated. A covering number can be informally defined as the smallest number of sets of radius with which a set in the metric space can be covered. The concept of covering number for real-valued functions was further im proved by Bartlett et al. (1997). The formal definition for covering number for real-valued function classes is as follows: Definition 2.11 (Covering Number): Let F be a class of real-valued functions on a domain X A -cover of F with respect to a sequence of inputs 12,,...,lS xxx is a finite set of functions B such that for all f F there exists gB such that 1maxii ilfg xx The size of the smallest such cover is denoted by ,, FS while the covering numbers of F are the values ,,max,,lSXFlFS (Cristianini and Shawe-Taylor, 2000, p.59) If the covering number for a metric space is finite, then the metric space is totally bounded (Anthony and Bartlett, 1994). In a sense, covering numb ers enables one to view a class of functions at a certain resolution In the following section we will see how to
31 incorporate covering numbers in measuring expr essive power of classifier functions by relating to functional margin. Fat Shattering In this section, first we explain the deri vation of generalization error if the sample set has a margin of smf in Theroem 2.12. Theroem 2.12: Consider a real-valued function space F and fix For any probability distribution on 1,1 X with probability 1 over l random examples S, any hypothesis f F that has margin smf on S has error no more than 22 ,,,log,,/2log errflFFSS l provided 2/ l (Cristianini and Sh awe-Taylor, 2000, p.60) Proof: Previously, we noted the doubling trick of Chernoff by introducing a ghost sample S according to which the pac bound that contains margin information can be written as follows. 2::0,, 2::0,, 2l Ss l Ss SSfFerrfmferrf l SSfFerrfmferrf Now, we include covering numbers in this formulation. For a /2cover on sample SS that satisfies max/2ii iSSfg xx, we know that function g has zero error on S, and a margin greater than /2 On the ghost sample S where f can make errors however, the margin bound /2 doesnÂ’t hold. Following Cristianini and Shawe-Tayl or (2000)Â’s notation, if /2Serrg denotes the
32 number of errors f made, or the number of points for which /2smg then we can write the pac-bound as 2 22::0,, 2 2::0,/2,/2 2l Ss S l Ss Sl SSfFerrfmferrf l SSgBerrgmgerrg Previously, we used the growth func tion to measure the complexity of hypothesis space over the sample size and similarly here, 2 22::0,/2,/222 2el l Ss Sl SSgBerrgmgerrgB By the definition of Â“covering numbersÂ”, the amount of behaviors that the classifier can have is 2max,, 2lSSXBFSS and the result follows. The elegance of the expression lies in in cluding the covering numbers in the error bound. Similar to growth function, the log of ,,/2 FSS is a measure for VCdimension. The next step is to upper bound the quantity ,,/2 FSS by using the fat shattering dimension. As noted earl ier, Kolmogorov and Tikhomirov (1961) provided the initial framework for such a bound. First, le t us give the definiti on of fat shattering. Definition 2.13 (Fat Shattering): Let F be a class of real-valued functions. We say that a set of points X is shattered by F if there are real numbers ir indexed by i x X such that for all binary vectors y indexed by X there is a function y f F satisfying if 1 if 1ii yi iiry fx ry
33 The fat-shattering dimension Ffat of a the set F is a function from the positive real numbers to the integers which maps a value to the size of the largest shattered set, if this is finite or infinite otherwise. (Shawe-Taylor et al., 1998, p.1929). As the dimension depends on value it is also referred as the scale-sensitive VCdimension Earlier in SauerÂ’s Lemma, we used the VC-Dimension to bound the growth function. Here, similarly, we will show how th e fat-shattering dimension is used to upper bound the covering numbers at precision. Different stud ies for different learning systems have addressed the issue of upper bound ing covering numbers such as Bartlet et al. (1997). Smola (1998) includes some of the ways to upper bound covering numbers by VC-dimension. The following lemma by Shawe-Taylor et al (1998) is originally due Alon et al. (1997) and it provides a bound on the covering numbers in terms of the fat-shattering dimension. Note that th e original paper assumes o= 0 and c= 1 and therefore is slightly different. Lemma 2.14 (by Shawe-Taylor et al, 1998, p.1929): Let F be a class of functions from X into co and a distribution over X Choose 01 and let /4Fdfat. Then the following holds. 2 224 log,,1loglog elocloc Fld d If the bound on covering numbers is applie d to Theroem 2.12, the generalization error for linear classifiers w ith large margin can be bounded as the following corollary states.
34 Corollary 2.15: Consider thresholding a real-valued function space F with a range 0,1 and fix For any probability distribution on 1,1 X, with probability 1 over l random examples S, any hypothesis f F that has margin smf on S has error no more than 228324 ,,,logloglog ell errflFd ld provided 2 l dl, where /8Fdfat. (Cristianini and Shawe-Taylor, 2000) The generalization error th erefore, can be bounded, once the fat-shattering dimension is bounded. Here, we constrain ourse lves to linear learning machines. There have been several studies bounding fat-shatte ring dimension for linear learning machines, such as Bartlett and ShaweTaylor (1999), Gurvits (2001), and Hush and Scovel (2003). The following version is due to Ba rtlett and ShaweTaylor (1999). Theorem 2.16: Suppose that X is a ball of radius R in a Hilbert space HI : X HIRxx and consider the set :1, FXxwxwx. Then fat-shattering dimension can be upper bounded as 2 F R fat Proof: Bartlett and Shaw e-Taylor (1999). Above, note that classifier function does not include the bias term (i.e., the classifier passes through the origin). The fat shattering dimension function is from the positive real numbers to the integers mapping the geometric margin into the size of the
35 largest shattered set. Finally, we can state an error bound for real-valued function space F defined in Theorem 2.16, by combining Theorem 2.16 and Corollary 2.15. Corollary 2.17: Consider thresholding a real-valued function space F with a range 0,1 and fix For any probability distribution on 1,1 X, with probability 1 over l random examples S, if R is the radius of the smallest ball containing them, any hypothesis f F that has margin smf on S has error no more than 2 222264324 ,,,logloglog 8 Rell errflF lR provided 2 l 2 264 R l The above bound has important implications. First of all, by incorporating a margin of the linear classifier one can take advantage of tighteni ng generalization error bound for benign distributions with large margins. Secondly, in Theorem 2.4, the bound depended on the VC-dimension. Unlike the bound dependi ng merely on VC-dimension, this bound is dimension free, and it suggests that for be nign distributions even in infinitely large dimensions it could be possible to derive tight gene ralization bounds. Luckiness Framework According to the original SRM framework by Vapnik, a decomposable concept class structure is determined before the data appear and the target function to be learned lies within the struct ure. The most suitable hypothesis class is selected according to criterion defined in Theorem 2.6. De pending on the VC-dimension and number of errors made, the generalization bound, to an extent, fails to incorporate large margin phenomenon, in the sense that patterns that are further apart from each other would
36 intuitively be easier to recognize. Therefor e, in the previous section the error bound depending on fat-shattering dimension potentially gives an advantage of incorporating the easiness of separability for sample if a large margin is observed. A learner calculates the effective VC-dimension after observing data as he or she chooses between VC-dimension and fat-shattering dimension depending on the Â“easinessÂ” of the sample. This tradeoff was originally given in Vapnik (1995) for a s ubset of the input space that is contained in a ball. Shawe-Taylor (1998) extends this to bound fat-shattering dime nsion of the class of hyperplanes. The following version is valid for linear classifier functions. Lemma 2.18: Let F be the set of linea r functions with unit weight vectors, and input space X is contained in a ball of radius R Let also the bias (or threshold) be bounded from above by R :1, FbXxwxwx, : X HIRxx, and bR then the fat-shattering function can be bounded from above by 29 min,1FR fatn Proof: Shawe-Taylor et al. (1998, p.1932) The difference in fat-shattering dimensi on bound given above and Theorem 2.16 is due to bias term in the hyperplanes. In a recent study by Hush and Scovel (2003), the result of Lemma 2.18 was improved and both upper and lower bounds on fa t shattering dimension were essentially tightened to equality.
37 Lemma 2.19: Let HI denote prehilbert space of dimension nand F denote the class of functions on X defined by :1, FbXxwxwx with bR Then 225 max,1min,1 4FRR fatn Proof: Hush and Scovel (2003) Shawe-Taylor gives generalization b ound based on Lemma 2.18. Here we will include slightly different version of th e bound by including the results of Hush and Scovel (2003). The generalization bound can now be derived by combining Lemma 2.19 and Corollary 2.15. Definition 2.20 (Set Closure): The closure of a set A is the smallest set containing A Definition 2.21 (Support of a Distribution): Support of a dist ribution is the smallest closed set whose set closure has probability zero. Theorem 2.22: Consider thresholding linear f unctions with real-valued unit weight vectors on input space X Assume that all inputs 1,1 X are drawn identically and independently acco rding to an unknown distribution whose support is contained in a ball of radius R in n centered at the origin. With probability 1 over training set S with l random examples, any hypothesis that has a margin Smf of or bigger has error no more than 228324 ,,,logloglog ell errflFd ld
38 provided 2 l 285 4 R l where 285 4 R d The theorem assumes that the entire input space X is included in a ball, denoted by B a, and the generalization error depends on the radius of B a. On the one hand, the radius of the ball containing the input sp ace may not be calculated as we allow to be any distribution function according to which inputs are independently and identically drawn. On the other hand, we do observe the radius of a ball that contains all of the points in the set S. One may expect that a more meaningf ul generalization error must rely on the latter ball, rather than the ball th at contains the entire input space. This important shortcoming was first a ddressed in Shawe-Ta ylor et al. (1998). Intuitively, with R fixed, a potential generalizati on error bound that depends on the assumption Â“sample points are cont ained in a ball of radius R centered at the originÂ” would yield a somewhat looser bound than that of the one depending on the assumption Â“entire input space is contained in a ball of radius R centered at the originÂ”. They defined a luckiness framework to r ecover the results of Theore m 2.22 by bounding the ratio of unseen data points that may not be contained in the ball of radius R that contains the sample set. Luckiness Framework for M aximal Margin Hyperplanes By using SRMÂ’s decomposable class stru cture above we described how one can approximate the ideal target function by favor ing functions that perform well on a sample set. Theorem 2.6, SRMÂ’s initial framew ork, and Lemma 2.14 by Alon et al. (1997) together allow us quantify some character istics of our observations by mapping them onto a one measure, generaliza tion error bound, for a particular class in the decomposable
39 structure. Luckiness or Unluckiness fr amework works by measuring luckiness (or unluckiness) of a hypothesis based on sample. Luckiness function, : LSH and unluckiness function : USH provide a means to measure performa nce of a specific hypothesis class in a context called Â“level of a f unctionÂ” defined as follows. Definition 2.23 (Level of a Function): The level of a linear function f F relative to L (or U ) and S is the function ,1,1:,,,,lleSfygFgyLgLf xxx or ,1,1:,,,,lleSfygFgyUgUf xxx. In the following, we state definitions that will prove useful later when we attempt to derive error bounds in the luckiness framework. Definition 2.24 ( -subsequence): An -subsequence of a vector n x is a vector 'x that is obtained by deleting a fraction of coordinates of x. A partitioned vector x y is of the form 11,..,,,..,nm x xyy xysuch that 'xx, and ''x y x y for a partitioned vector, corresponds to deletion of coordinates. For instance, for the vectors 1,...,iini x xx and '' 1,...,lSxx defined for the sample set S, an subsequence of the set that is obtaine d by deleting a fraction of coordinates of a subset of S.. Definition 2.25 (Probable Smoothness): Let '' 1,...,lSxx and '' 12Âˆ ,...,llSxx. A luckiness (or unluckiness) function is probably smooth with
40 respect to two functions ,, leL and ,, leL if for every distribution the following holds 2''''ÂˆÂˆÂˆÂˆ ::0,,,,,,lfFerrflefleLf SSSSSSSSSS or 2''''ÂˆÂˆÂˆÂˆ ::0,,,,,,lfFerrflefleUf SSSSSSSSSS The idea is to measure probable smoothness from the sample and estimate luckiness or unluckiness from the first ha lf of the double sample to bound the growth function by ,,, leLf S. In other words, if the function f is probably smooth, then there exist at least ,,, leLf S many luckier classes with no errors. Theorem 2.26 (Probable Smoothness Theorem): Suppose i p 1,....,2 il, are positive numbers satisfying 2 11l i ip L is a luckiness f unction for a function class H that is probably smooth wi th respect to functions and l and 01/2 For any target function tH and any distribution with probability 1 over l independent examples S chosen according to if for any i a learner finds a hypothesis h in H with 0Serrh and 1,,,2ilLSh, then the generalization error of h satisfies ,, errhli where 24 ,,1log4,,,log4 4i ip liilLShl lp Proof: Shawe-Taylor et al.(1998, p.1934).
41 The original study (Shawe-Taylor et al., 1998) illustrated the luckiness framework by four examples which included one that considers VC-dimen sion unluckiness (big growth function corresponding to being unlucky), and another that concerns maximal margin hyperplanes. We will only focus on th e latter. First, we summarize their study in their context, and then we attempt to provide error bounds depending on new upper bounds on fat shattering dimension as well as tw o simple algebraic errors we were able to detect in the original study. Definition 2.27 (Unluckiness for Maximal Margin Hyperplanes): If f is a linear threshold function in canonica l form with respect to a sample S we define the an unluckiness as 2 2, R Uf S. Proposition 2.27: Consider a class of linear separating hyperplanes F defined on X as :1, FbXSwxwS with bR and 1maxi il R x. The unluckiness function for maximal margin hyperplanes is probably smooth with 92 ,,, 9Uel lUf U S and with 182 ,,loglog32log422log 2 el lUdlm ld Proof: Shawe-Taylor et al.(1998, p1936) Combining Theorem 2.26 with Proposition 2.27, Shawe-Taylor et al.(1998) obtain the generalization error in luckiness framework. Corollary 2.28: Suppose i p 1,....,2 il are positive numbers satisfying 2 11l i ip, and 01/2 For any target function tF and any distribution
42 with probability 1 over l independent examples S chosen according to which is a probability distribution on X if for any i a learner finds a hypothesis f in F with 0Serrf and 1,,,2ilLSf, then the generalization error of f satisfies ,, errfli and 24 9log2log32log 9 2 ,, 8 1297loglog32log168log4 1297iel Ul Up li l el Ulll U where UUSf is the unluckiness function. (Shawe-Taylor et al.1998, p.1936) Now, we attempt to state rather novel ve rsions of Proposition 2.27 and Corollary 2.28 in our framework as Proposition 2.29 and Corollary 2.30. Proposition 2.29: Consider a class of linear separating hyperplanes F defined on X as :1, FbX SwxwS with bR and 1maxi il R x. The unluckiness function for maximal margin hyperplanes is probably smooth with 5 48 ,,, 45Uel lUf U S and with 22183216 ,,logloglog42log ell lUdl ld Proof: We follow Shawe-TaylorÂ’s proof with new fat shattering bounds. The proof is composed of two parts. In the first part, with probability of 1 1,, lU a fraction of points in the double sa mple for which ball of radius R
43 constraint is violat ed is calculated. In the second part, 2,, lU a fraction of of points in the second half of the sample that are either not corre ctly classified or have margin less than /3 is calculated. The result follows by summing up 1,, llU and 2,, llU First part: Consider a class |RfR with 1 if 0 o/wR R f x x. The class has VC dimension 1. The permutation argument requires for ld and 1 d 1 01il B ll i where d is the VC-dimension. We write the pac statement with confidence level of 1 2 as: 1: consistent and ,, 22l l RRSferrflUBle by using the doubling trick of Cherno ff, we can re-write this as, 2 1 Âˆ 1Âˆ :0,2:0, 22ll SRRSRR Sl SSerrferrfSerrferrf hence we obtain, 124 log21log l l Note that, 1 in Shawe-Taylor et al. (1998) given as 112 log21log l l seems to be due to an algebraic mistake. Second part:
44 Now, for the points of the double sample th at is contained in side of the ball, we estimate how many are closer to a hyperplane than /3 Shawe-Taylor et al. (1998) uses their equivalent version of Theorem 2.22 a bove to derive a bound with /3 First, we will mention their results in their framework, and then we will immediately indicate the bounds in our framework. Combining the fat shattering dimension in Lemma 2.18, and Lemma 3.9 in Shawe-Taylor et al. (1998, p. 1932) with a confidence level of 1 2 and for /3 the second part immediately follows 2184 loglog32log el dl ld for 221297/1297 dRU 12182 loglog32log422log 22 ll el dlm lld Alternatively, in our framework, we make use of Corollary 2.15, Lemma 2.19, and Theorem 2.22 as follows. With 2 25 4FR fat for 3 with 1 2 confidence level, in Theorem 2.22 we get, 2 228328 logloglog ell d ld where 2 21445 4 R d Combining 1 and 2 we have 22183216 logloglog42log ell dl ld
45 Corollary 2.30: Suppose d p for 1,...,2 dl are positive numbers satisfying 2 11l d dp. Suppose 1 0 2 and the target function tF and is a probability distribution on the input space X With confidence level 1 over the sample set S with l elements, if the sample set is consistent with an hypothesis f that is 0Serrf then the generalization error of f is no more than 94 24 log1loglog2 2 ,, 832 2loglog2log428log4il p li l ell dll d Corollary 5, is the generalization error for the classification problem for which the support of the sample set obtained lies in a ball centered at the origin with radius R However, Theorem 2.22 gives the generali zation error for the problem for which the entire input space is included in side of such a ball. It must be noted that above bound in Corollary 6.26 in Shawe-Taylor et al. (1998) that appears as 24 9log2log32log 9 2 ,, 8 1297loglog32log168log4 1297iel Ul Up li l el Ulll U is different than that we give in Corollary 5. The error term in their framework has a Â“+1Â” missing due to an algebraic mistake, which ma kes the error bound 24 9log12log32log 9 2 ,, 8 1297loglog32log168log4 1297iel Ul Up li l el Ulll U
46 The difference between two bounds is due to two main reasons. First, we use the generalization bound theorems based on 1,1 classification given in Cristianini and Shawe-Taylor (2000) as our basis. ShaweTaylor et al. (1998) however, consider 0,1 scheme with thresholding and folding argumen ts integrated to their bounds. Second, we also make use of tighter fat shattering bounds due to Hush and Scovel (2003). Support Vector Machines Classification In the previous section we derived the n ecessary expressions to pick a hypothesis class based on generalization error. We also know how to estimate generalization performance of a linear classifier, given th e hypothesis class. We now in a position to devise an efficient way of l earning and computing those hyperp lanes. In this section we will briefly introduce Support Vector Machin es for the hard margin case where the training set S is assubed to be linearly seperable. For the soft margin case, regression and other variations of SVM we recomme nd Cristianini and Taylor (2000). Following our previously defined notation, let w be a weight vector, x and x be two negative and positive points respectively. Assume that a functional margin of 1 is realized. Therefore, with geometri c margin notation one can write 1 1 b b wx wx which implies 11 2 ww xx www In order to find the maximum margin hyperplane, by using 1w one must solve the following Â“hard marginÂ” problem. Proposition 2.31: Given a linearly separable set 11,,...,,llSyy xx the maximum margin hyperplane is the solution of the following problem
47 ,min .. 1 for 1,...,b iistybil www wx Proof: Replacing ,maxb w with ,1 minb ww the result immediately follows. It turns out that corresponding dual problem of the above optimization problem has certain characteristics that make it appealing to work with that we will mention after stating the dual problem. The following pr oposition gives the dual form of optimal hyperplane formulization. Proposition 2.32: Given a linearly separable set 11,,...,,llSyy xx and consider the following problem with solution realizing the maximum. 1,1 11 max 2 .. 0, 1,...,ll iijijij iij l ii i iyy sty il xx Then, the weight vector ** 1l iii iywx solves the maximal margin hyperplane problem stated in Proposition 2.31. Proof: The Lagrangian for the problem stated in Proposition 2.31 is 11 ,,1 2l ii iLabyb iw wwwx with Lagrangian multipliers 0i Now, in order to compute the dual problem statement we need to differentiate the Lagrangian with respect to w, and b. 1,, 0l iii iLb y w wx w, therefore 1l iii iy wx
48 1,, 0l ii iLb y b w therefore 10l ii iy Also note that the Karush-K hun Tucker conditions are ***10, 1,..,iiiybil wx. Plugging expressions above in the or iginal problem statement gives 1 ,1,11 1,111 ,,1 22 1 2 1 2l iii i lll ijijijijijiji ijiji ll iijijij iijLbyb yyyy yy w wwwx xxxx xx Definition 2.33 (Support Vectors): The iS x training points for which the constraint in Proposition 2.31 is strictly satisfied, that is ix satisfying 1 for 1,...,iiybil wx are called support vectors. Donated by sv, the set of points that are the closest to the ma ximum margin hyperplane are called support vectors. The bias in hyperplane formulation does not explicitly appear in the formulation, therefore it can be calculated easily from the primal constraints *** 111 maxmax 2iiii yybwxwx. Also, the maximal margin hyperplane can be expressed in dual representation as ** 1,,l iii i f bybxxx. KKT conditions indicate that i values are non-zero only for the input vectors that lie closest to hyperplane, as all other points would
49 automatically have functional margin of greater than one. Therefore, there are at least two closest points, and the number of non-zero i values is equivalent to number of support vectors. Proposition 2.34: Given a linearly separable set 11,,...,,llSyy xx, the maximum margin hyperplane is the solution of the following problem ,min .. 1 for 1,...,b iistybil www wx has a solution of i isv where sv is the set of support vectors. Proof: The result follows from KKT conditions ** 1 **** ,1 **,, l iiiii i l ijijij ij jjiiij jsvisvyfbyyb yy yy xxx wwxx xx *** 1, 1,...,jjj jsvjsvybjl Note above that, according to Proposition 2.31, the maximal margin hyperplane satisfies 1/2 **1i isv ww In summary, in this chapter we revi ewed generalization error bounds. The error bounds we review were in pac-like form, and in a distribution free sens e. We started from a general hypothesis class a nd towards the end of the ch apter we focused on maximal
50 margin hyperplanes as they are of greatest importance for support vector machines. Then we included the support vector m achine formulation at the end.
51 CHAPTER 3 DOMAIN SPECIFIC KNOWLEDGE WITH SUPPORT VECTOR MACHINES Wolpert and MacReady (1995) and Wolper t (1996a, 1996b, 2001) analyze different algorithms from a skeptical point of view sim ilar to that of David HumeÂ’s Â“Problem of InductionÂ”. The resulting No Free Lunch Theo rems investigate performance of various algorithms in the areas of optimization, search, and learning. In general, they indicate that the structure of the solution space is ve ry crucial in selecting a good methodology (Wolpert and Macready, 1995). Wolpert (1996a and 2001) studies the Â“ no free lunchÂ” (NFL) for learning and claims that a learning methodology that depends only on low empirica l misclassification rate, small VC-dimension, and a large traini ng set cannot guarantee a small generalization error. Wolpert (1996a) analyzes Â“off-the setÂ” (OTS) generalization error of supervised learning mathematically. In his follow up pa per, Wolpert (1996b) claims that if no assumptions are made about the target, th en the generalizatio n performance has no assurance. In summary, Wolpert (2001) clai m two things. One, for a learning algorithm there are as many situations in which it is superior to another algorithm as vice versa. Two, the generalization performance depends on how much the selected classifier based on prior belief, or sample set, actually coinci des with the posterior di stribution, or target function. According to NFL, no learning algorithm performs better than random guessing over all possible learning situ ations. For a detailed survey on NFL theorem one can refer to Whitley and Watson (2004).
52 From our point of view, NFL theorems indi cate that taking as much information as possible about the input space into account may enhance learning since some prior knowledge about the target may increase a l earning algorithmÂ’s chances of constructing the target concept more accu rately. Therefore, regardless of the classification methodology, incorporating prio r knowledge in machine lear ning is of importance. Review on Relevant SVM Literature with Domain Specific Knowledge On the complementary side of the NFL theorems, in the artificial intelligence communityÂ’s Â“Knowledge is PowerÂ” phrase attr ibuted to Dr. Edward A. Feigenbaum, a pioneer in AI (Augier and Vendele, 2002). The thrust of this phrase from our point of view underlines the fact that in the machine learning context the phrase complements the basic idea of the NFL theorems. In the general machine learning framewor k there are many studies about Â“prior knowledgeÂ” incorporation. For example, Niyogi et al. (1998) study Â“prior informationÂ” from the sufficiency of size of the training se t. They show that in the absence of prior information, a larger training set might be needed to learn well. In support vector machines literature however, there have been only several attempts to incorporate domain knowledge. Supp ort vector machine research is relatively young. It is not a surprise that incorporat ing domain knowledge in SVM is even younger and remains a fertile research area. The first successful study on incorp orating domain knowledge was due to Schlkopf et al. (1996). They extract support vectors from the examples, and generate artificial examples by applyi ng invariance transformations to generate so-called virtual support vectors to increase accuracy.
53 According to our review of the SVM literat ure on domain incorporation, there are mainly two ways of incorporating domain knowledge in SVM: Incorporating domain knowledge with kernels Incorporating domain knowle dge in SVMÂ’s formulation One might see Â“fat-shatteringÂ” dimension as a Â“domain-dependentÂ” measure, and therefore as another way of incorporati ng domain knowledge. However, it is data-set dependent and rather than exha ustively Â“domain-dependentÂ”. In the support vector machine literature kernels represent the main method of capturing domain knowledge. In theory, kernel s have a great potential to tailor SVM learning to the problem on hand. However, ma ny SVM applications using kernels merely compare different generic kernels in te rms of their relative performances. In an earlier study, Schlkopf et al. (1998) incorporated domain knowledge to construct task-specific kernels for an image classification task. Several other studies focused on buildi ng task-specific kernels. For example Jochaims (1998, 2001) builds domain-specific ke rnels for text categorization. Another example on task specific kernels in the SV M literature is latent semantic kernels by Cristianini et al. (2002). To the best of our knowledge, explic itly incorporating the encoded domain knowledge in support vector machine formula tions was first introduced by Fung et al. (2001). In their study, domain knowledge was ex plicitly introduced to the formulation of a linear support vector machine classifier in the form of polyhedral sets. Each of the polyhedral sets, called a knowledge set for a spec ific class, is incor porated into the SVM formulation The SVM algorithm is slightly diffe rent since the constrai nt set now includes the knowledge sets. As the polyhedral sets need not have to include any training points,
54 in their application, the resulting hyperplan e is found to be significantly different than linear SVM separation. Furthermore the propos ed method has the potential of replacing missing training data with prior knowledge in terms of knowledge sets. However, from their study it can be observed that the shap e of the new formed, knowledge-set-dependent hyperplane is different than a no-knowledge-set -dependent hyperplane only when at least one of the knowledge sets contains a point that could be a support vector i.e. closer to the boundary than other training examples. In a follow-up study, Fung et al. (2003) studied incorporating prior knowledge in the formulation of nonlinear kernels. They illustrated the power of prior knowledge on a checkerboard example. With 16 points, each located at the center of 16 squares, they showed that including prior knowledge about two of the sixteen squares and using a Gaussian kernel, the correctness level drama tically increased to a level which can be achieved by as many as 39,000 training point s and no prior knowledge. One must note that this remarkable improvement is due to and subject to the av ailability of very particular knowledge of the domain. An Alternative Approach to Incorporate Domain Specific Knowledge in Support Vector Machines In the previous section we reviewed two pioneer studies utilizing prior knowledge by adding membership knowledge sets into SMVs formulation. However, in real life the knowledge sets might not be available or might be recondite due to the problemÂ’s nature. For example, for complex input spaces knowledge sets may be difficult to observe. Moreover, even when they are available, they might not effect or improve the results as they may incorporate inessential knowledge (i.e., the knowledge describes examples that lie further from the separating hyperplane than training points). Intuitively, the efficiency
55 of such fine-drawn knowledge sets requires an abundance of prior information and their encodability. Due to those reasons the applicab ility might be limited in many domains. At the very least, this is a very specific us e of apriori knowledge. We offer a different approach requiring less knowledge where rath er general information about the input space is taken into account. Charecterizing the Input Space In order to introduce our approach in th e following we give some definitions, and then characterize the input space. Definition 3.1 (Bounded Sets, Closed Sets, Compact Sets): A set X is closed if the set contains all of its limit points. A set X in n is bounded if it is contained in some ball 2222 12...n R xxx of finite radius R A closed bounded set is called a compact set. Definition 3.2 (Hyperrectangle, Orthotope): Hyperrectangle, also known as orthotope, is a generalization of rect angle and rectangular parallelpiped. Definition 3.3 (Diameter, Generalized Diameter): In a closed and bounded set, the farthest shortest distance between two points on the boundary is called a diagonal. In Euclidean space, the diameter of a set nS is simply the distance between two points that are furthest apart. Definition 3.4 (Space Diagonal): The line segment connecting opposite polyhedral vertices (i.e., tw o polyhedral vertices that do not share a common face) in a parallelpiped or other similar solid.
56 Definition 3.5 (Convex Hull): The convex hull of a set of points in n is the intersection of all convex se ts containing those points. In other words, the convex hull of a set of points is the smallest convex set that includes those points. Definition 3.6 (Polyhedron, Convex Polytope): A polyhedron is the intersection of a finite nu mber of halfspaces. A convex polyhedron, or a polytope, can be defined as a bounded polyhedron or as the convex hull of a finite set of points in space. We denote a polytope by nP. Characterizing the input space X in terms of domain knowledge available is an important step in encoding prior information. As we noted earlier, Fung et al. (2001 and 2003) incorporated some domain knowledge in terms of polyhedral sets without asserting any specific shape or boundedness on the input space. However, earlier in the luckiness framework we observed that the distance of the furthest point from the origin is one of the determinants of the fat shattering dime nsion and the generalization bound. Because of this we observe that generalization bounds tend to be very loose if the input space is not assumed to be bounded. Table 3-1 shows generalization error boundsÂ’ performances under different scenarios. In Table 3-1, the Â“Bound based on d Â” column represents the bound obtained by using VC-dimension as the data independent VC-dimension and The Â“Bound based on F f at and X is contained in a ball of radius RÂ” column represents the generalization bound obtained when the input space is assumed to be inside of the ball of radius R The Â“Luckiness FrameworkÂ” column re presents the genera lization error bound when the radius R merely represents the radius of the ball containing training sets. We see that fat-shattering dimension dependent bounds provide insights only when the margin between classes is large, and bounds with luckine ss framework are too loose.
57 Table 3-1. Generalization erro r bound performances under diffe rent settings. Numbers in bold shows that the bound is too loos e to provide any information. d R l Bound based on d Bound based on F f at and X is contained in a ball of radius R Luckiness Framework 20 100 10 1000 0.10 0.332 7.4539 n/a 20 100 20 1000 0.10 0.332 1.7784 n/a 20 100 30 1000 0.10 0.332 0.7423 n/a 20 100 40 1000 0.10 0.332 0.3940 n/a 20 100 50 1000 0.10 0.332 0.2388 n/a 20 100 60 1000 0.10 0.332 0.1570 n/a 20 100 60 1000000 0.10 0.001 0.0017 4.535 20 100 60 10000000 0.10 0 0.0003 0.876 The dramatic difference between the rightmo st two columns is simply due to the fact that the values in the rightmost column are derived without assuming that the input space is bounded. Motivated by this example we conclude that char acterizing the input space is likely to be crucial in computing ge neralization bounds. In the following sections we initially start by assuming upper and lower bounds on the attributes of the input space. Later, we tighten these box constraints and a ssume that the input space is contained in a polytope, and finally in a general convex body. However, our main focus is on input spaces contained in a polytope. In order to follow the tradition of pac -like bound generation, we need to examine certain prope rties of polytopes. In the following section we start with box constraints which ar e a special form of a polytope. Charecterizing the Input Space with Box Constraints In the knowledge discovery process, the ba sic classification task is to classify entities (points) according to their attributesÂ’ values. For each attribut e there is a specific domain from which a value is assigned. Often, this specific domain is not infinite and is
58 bounded. In practice, domain ranges for many attr ibutes (like age, inco me level, etc.) can naturally be bounded from a bove as well as below. We would like to explain the motivation fo r using box constraints in support vector machines in Figure 3-1. Figure 3-1. Using box constraints fo r characterizing the input space. In Figure 3-1 A, the larger ball containi ng most of the points denotes the ball of radius R centered at the origin that contains th e training set. The three outlier points on the upper right part depict off-sample points th at lie outside of the ball. The gap between the luckiness framework and fat-shattering bounds are simply due to those points. If upper and lower bounds are not known, potentially there could be many points that are not contained in the ball. In fact, in a distribution free se nse, the reason generalization error bounds generated in the luck iness framework are loose is due to the fact that there could be many of those points. A rectangle is formed by using the upper and lower Part A Part B
59 bounds on the two attributes. In Figure 3-1 B, using a linear transformation the rectangle is re-centered at the origin and everything is included inside of the box. The benefit of taking upper and lower bounds for each attribute is potentially two fold: The luckiness framework is not needed to generate generalization bounds, as the input space is already containe d inside of a hyperrectangle. By using a simple linear transformation, the entire input space can be centered around the origin, and therefore the diameter may be smaller. As the fat shattering dimension depends on the margin as well as the diameter of the ball, the fat shattering dimension, and the generaliz ation bound are poten tially smaller. Proposition 3.7: Let nX be the input space contained in an hyperrectangle 'HR Let also ',ijHR xx be two vectors furthest apart in 'HR The fat-shattering dimension at scale is always smaller than or equal to the fat shattering dimension at scale calculated by consideri ng the smallest ball of radius R containing the input space. Proof: If the input space is contained in a hyperrectangle, then the space diagonal (SD) is 2ijSDR xx and the result follows. Note that, fatshattering dimension can further be improved by using a transformation '2ijHRHR xx. Figure 3-1, Part B graphically illustrates the proposition. The space diagonal of a hyperrectangle can be computed easily as follows. Let 1,...,nubub and 1,...,nlblb denote upper and lower bounds for attributes 1 through n respectively. Then 2 2 1n ii iSDublb. In this section we demonstrated that just knowing upper and lower bounds alone, one can potentially improve the generalizat ion error depending on the fat-shattering
60 dimension. In the following section, we try to incorporate more domain knowledge in the form of linear constraints and analyze the ca se that the input space is contained in a polytope. Characterizing the Input Space with a Polytope In many pattern-recognition problems th ere are interdependencies among the attributes. Some of them are strong correlati ons among the attributes that can be observed statistically while others can even be written in mathematical form. In machine learning literature highly co rrelated attributes may mislead or not contribute much to the learning process. For example, kth nearest neighbor and decision trees are sensitive to correlated attributes (M adden et al., 2002). Neural networks are less sensitive to correlated attr ibutes, although they also be nefit from feature selection (Madden et al., 2002) that elimin ates correlated attributes. We investigate the case where we can denot e the relationships among the attributes in the form of linear constraints, and im prove on the box constraints of the previous section. We assume that the input spa ce is contained inside of a polytope. In Section 3.2.1. we computed the sp ace diagonal for hyper-rectangles. In this section we focus on upper bounding the fat-shat tering dimension for the polytope case. In Chapter 2, we included results for the effective VC-dimension for linear classifiers. The effective VC-dimension, called fat-shattering dimension, is bounded by 2 25 4FR fat where R is the radius of the n -dimenisonal ball centered at the origin. In our case the radius of the n -dimensional ball corresponds to ha lf of the metric diameter of the polytope.
61 Lemma 3.8: Let HI denote a pre-Hilbert space of dimension nand F denote the class of functions on Pdefined by :1, Fb xwxwxP with bR Let x y P be two points furthest apart in P. Then the Ffat dimension for 2 x y P satisfies 2 25 min,1 44Ffatn xy. Proof: The metric radius of the polytope is 2 x y Using Lemma 5 by Hush and Scovel (2003) the lemma follows. In order to compute the fat shattering dimension, the metric diameter of the polytope must be calculated or upper bounde d by some expression. The problem of finding the two points furthest apar t can be stated as follows: 'max .. st xyxy AxaA y a Unfortunately, calculating the metric diameter of the polytope is a convex maximization problem, and is very difficult to solve. Therefore we condense our focus on finding an upper bound on the metric diameter. '(,) max .. f st Aax y x y AxaAya In order to assert an upper bound (,) fAa, on the metric diameter, different methodologies may be used. For example, we can find a vertex, and then attempt to form a simplex containing the polytope (Zhou a nd Suri, 2000) However, for practical purposes that will be evident later, we pref er to follow a rather general methodology and work with ellipsoids. The idea is to find an el lipsoid that tightly fits our polytope and use
62 the diameter of the ellipsoi d to upper bound the diameter of the polytope. The diameter of an ellipsoid can be informally defined as th e length of the longest axis. The fact that calculating the diameter of an ellipsoid is simple, and it upper bounds the metric diameter of any polytope it contains make this approach attractive. A minimum-volume ellipsoid that contains the polytope is called a Â“Lwner-John ellipsoidÂ” (John, 1948). There is no known wa y of explicitly computing Lwner-John ellipsoid for polytopes. However, computa tional considerations provide compelling reasons to work with ellipsoids. Hence, we dedicated the following chapter to computing such ellipsoids.
63 CHAPTER 4 ELLIPSOID METHOD As discussed in Chapter 3, finding an upper bound on the metric diameter of a polytope is a difficult problem, but not uncommon. This type of problem also arises in machine vision, robotics, and computer game design. In machine vision and robotics, the distance between the surrounding environmen t and a robot must be known to avoid collision. However, object shapes could be complex and therefore computing the distance between a robot and objects in the environm ent could be very difficult. In order to facilitate computation, minimum-volume ellipsoids for each complex shape are computed. So, instead of directly gauging distances between object s, distances between ellipsoids are computed as an approxim ation (Rimon and Boyd, 1992, 1997). The same approach is also used in game design to determine impacts and collisions among game objects. Ellipsoid Method Introduction In our context, we would like to upper bound the metric diameter of a given polytope. By the definition, any distance in an ellipsoid is upper bounded by its diameter, i.e. the length of the major axis of the elli psoid. Hence, one possible bound is the length of the longest axis of the minimal ellipsoid containing the polytope. The Lwner-John Ellipsoid Definition 4.1 (The Lwner and John Ellipsoid): For any bounded convex body with non-empty interior K, there exists a unique minimum-volume ellipsoid enclosing the body. This ellipsoid is called the Lwner ellipsoid of K. Also, every
64 compact convex body contains a unique el lipsoid of maximum volume, known as the JohnÂ’s ellipsoid of K. (Blekherman, 2002) In general, The Lwner-John ellipsoid (sometimes spelled as Loewner-John, or abbreviated as L-J ellipsoid) of a convex body denotes a special ellipsoid with the property that the convex body contains the e llipsoid obtained from the L-J ellipsoid by shrinking it n-times where n is dimension of th e convex body. This is illustrated in Figure 4-1. Figure 4-1. The L-J Ellipsoid for a polytope. A) L-J Ellipse for a 2-dimensional polytope, and ellipse obtained by shrinking L-J Ellipse two times. B) Illustration for the L-J Ellipsoid for the polytope. C) For this 3-dimensional polytope, the ellipsoid that is formed by shrinking LJ ellipsoid three times lies completely inside of the polytope. B C A
65 If the polytope is given in the form of a set of vertices, then there are known polynomial time algorithms (Rimon and Boyd, 1992, 1997; Kumar and Yildirim, 2004) to tightly approximate the Lwner-John ellipsoid that contains all the vertices. For example, Kumar and Yildirim proposes a 1 -approximation algorithm that approximates in 3/ Ond where n is the number of points and d is dimension. Rimon and Boyd (1997) formulate the problem as a convex optimization problem. They minimize the volume of minimum containing ( n+ 1)-dimensional ellipsoid centered at the origin subject to containment of vertices of the n -dimensional polytope. They show that the ndimensional minimal ellipsoid can directly be obtained. The complexity of the problem is linear in the number of vertices. These algorithms prove useful when the ve rtices are known. This is the typical case in robotics where the polytopes are defined as the convex hull of extreme points which can be easily calculated. Also in robotics, di mensionality is limited (all objects are three dimensional) and therefore computation of ve rtices is somewhat less complex than the mdimensional case. However, when a polytope is defined in terms of the intersection of a finite number of halfspaces (i.e., |n PxAxa, A is an mn matrix), rather than a set of vertices, then the above mentioned algorithms lo se their attraction. This is due to two main reasons. First, there is no known algorithm that can compute the veritces of a polytope P in time polynomial in m n and number of vertices (Avis and Devroye, 2001). Second, once the vertices are known, then the need for computation of a minimum ellipsoid is not needed, as the metric diameter can directly be calculated.
66 There are known bounds on the number of vert ices of a polytope in the literature, but finding the actual number of vertices is a complex task. For example, McMullen (1970) gives the following upper bound on the number of vertices of a polyhedron: 12 22 nn mm UBmn mnmn where m an n are dimensions of dimensions of matrix A in Axa that has a full column rank and has no redundant equalities. With 40 c onstraints in 20 dimension, for example, the bound exceeds 40 million. Lavallee and Duo ng (1992) give an algorithm to find all the vertices of a polytope. The Ellipsoid Method The algorithm to approximate the L-J ellips oid can be built on top of the normal ellipsoid method, a method developed in the 1970Â’s initially by Shor (1970) and Iudin and Nemirovskii (1976) to solve convex, not necessarily diffrentiable, optimization problems (Bland et al., 1981). The ellipsoid me thod was initially used to find a feasible solution to a system of linear inequaliti es. Later Khachiyan (1979) showed how the ellipsoid method can be implemented in pol ynomial time to solve linear programming problems. The method attempts to enclose the feas ible region of a linear program by an n dimensional ellipsoid, then sli ce it up and form another elli psoid while still including the feasible region Â– P. The objective of the ellip soid method is to find if the feasible region is nonempty. The ellipsoid method was bui lt on top of LevinÂ’s method (1965) for minimization of convex functions. Levin (1965) used center of gravity to approximately minimize a convex function over a polytope P. In his method, at each iteration the center
67 of gravity of the polytope is ca lculated, and used to test a convergence criterion. If the centroid satisfied the convergence criter ion then the method would stop with an approximate solution. Otherwise, a hyperpla ne, whose normal is the subgradient of the convex function at the centroid, cuts the pol ytope into two and the method continues to operate on the reduced polytope containing th e optimal solution. The main disadvantage of the method was that calculating the centroi d at each iteration is very time consuming (Liao and Todd, 1996). Khachiyan (1979) showed that the ellipsoid method can be modified to check the feasibility of a system of linear inequalitie s in polynomial time. The so-called ellipsoid method was the first polynomial-time algorith m to solve linear optimization problems. The ellipsoid method starts with a large ellipsoid containing polytope P, and then uses the center of the ellipsoid -rather than cetroid of the polytope in LevinÂ’s methodto reduce the search space. The ellipsoid method has been widely studi ed. Over one thousand journal articles have been published, and many variations of the method have been developed. A more comprehensive review of ellipsoid methods can be found in Bland et al. (1981) and Grtschel et al. (1988). Before we commence to give details about the ellipsoid method, we provide some definitions that will prove useful in studying the method. Definition 4.2 (Eigenvalues, Eigenvectors): For an nn matrix D, if there is a non-zero vector nx such that Dxx, then is called an eigenvalue of D. x is an eigenvector corresponding to eigenvalue of the nn matrix D. An nn symmetric matrix has n eigenvalues, and they can be found by solving
68 det0DI. Moreover, an nn square symmetric matrices has n orthogonal eigenvectors, and the number of non-zero eigenvalues for a nn matrix is exactly equal to rank of the matrix. Definition 4.3 (Positive Definite): An nn matrix D is called positive definite if for any non-zero vector n x, '0 xDx holds true. A symmetric matrix is positive definite if and only if all of its eigenvalues are positive. Definition 4.4 (Ellipsoid): An ellipsoid centered at na can be defined by a set of vectors E in n of the form 1,|1nEE DdxxdDxd where D is an nn positive definite symmetric matrix. Moreover, D is called characteristic matrix of the ellipsoid. It turns out that eigenva lues and eigenvectors have very useful geometric interpretation for ellipsoids. For the ith axis, if i is the eigenvalue co rresponding to this axis, then the radius of the ellipsoid along this axis is 1i Definition 4.5 (Spectral Radius): For any nn matrix Z, the spectral radius of Z, denoted by Z is 1maxi in Z where i is the modulus of i Proposition 4.6: Let nP be a polytope and let E be the L-J ellipsoid that contains the polytope P. The metric diameter of the polytope is bounded by 1max2/i in Proof: The two points on ellipsoid E that are furthest apart from each other lie on the major axis of E The distance between any two points in P, is less than or equal to the length of the major axis of E as P is contained in E The radius of the
69 ellipsoid along this axis is 1max1/i in and the result follows. Note that, finding 1max1/i in is equivalent to finding a minimal eigenvalue, but instead of that one could find the spectral radius of 1 D, as they mean the same thing. As we noted earlier, in order to upper bound the metric diameter of the polytope P, it would be useful to find the di ameter of the L-J ellipsoid for P. In the remainder of this section we review the ellipsoid method, a meth od that can be used to approximate the L-J ellipsoid. An ellipsoid centered at d with positive definite matrix D is defined as: 1,|1nEEDdxxdDxd. Note that, since the characteristic matrix D is positive definite, so too is its inverse. Definition 4.7 (Affine Transformation): A affine transformation is any transformation that preserves collinearity (i.e. all points lying on a line initially still lie on a line after transformati on) and ratios of distances (e.g. the midpoint of a line segment remains the midpoint after the tr ansformation). For any nonsingular (i.e., invertible) nn matrix E, and ne, we say that :nnAT AT xExe is an affine transformation. According to the notation, a unit ball around the origin can be denoted by E I0. By definition, we can say that ellipsoids are images of unit balls under affine transformations, and the transformation is bij ective (Feige, 2004). For example, consider an affine transformation defined as 1/2 xD y e. Then the reverse of the affine transformation is 1/2 y Dxe. If the transformation is a pplied on the unit ball that is
70 defined as ',0|1nEE Iyyy we can clearly see that 1/21/2,|1nEEDeyDxeDxe which is equivalent to 1,|1nEEDeyxeDxe, and therefore 1/2,, EE DeDI0e is the affine transformation of the unit ball into an ellipsoid. Definition 4.8 (Norm): A function is called a norm if the following conditions are satisfied: The norm function is :nN 0 N x for nx and 0 N x iff 0 x NccN xx for n x c NNN x y x y for ,nxy For every norm N a distance measure between any ,nxy ,Ndx y is defined as ,NdN x y x y Definition 4.9 (Ellipsoidal Norm): For a positive definite matrix D, the following is called ellipsoidal norm. '1DxxDx. The diameter of an ellipsoid is two times the square root of the largest eigenvalue of the matrix D. Similarly, the smallest axis of th e ellipsoid can be calculated by finding two times the square root of the smallest eigenvalue of the matrix D. The volume of an ellipsoid is ,,det VolEVolEI Dd0D, where VolEI0 is the volume of the unit ball in that dimension. Hence, we conclude that the volume depends only on determinant of the characteristic matrix and number of dimensions. The calculation of the volume of a unit ball in n is given in the literature as:
71 /2 /212 ,0~ /2!n n ne VEI nn n From above, for an ellipsoid E and affine transformation function AT xExe, we can write ,detdet,0nVolATEVEI DdED. Optimizing Linear Functions over Ellipsoids The idea behind covering optimizing linea r functions over ellipsoids here will prove useful later in disc ussing the ellipsoid method. Maximizing a linear function 'cx over a unit ball EI0 is achieved by the vector c c. For a unit ball that is not centered at the origin, E Id, the answer is c d c, and for any k and E Ikd it is k c d c. Recall that for any ellipsoid EDd 1/2,, EEEI DdD0d 1/21/21/2,,, EEIEIDDd0DdDd Therefore, denote optimizing a linear function over an ellipsoid as 'max|, E cxxDd. However ''1/21/21/21/2max|,max|, EE cxxDdcDDxDAxDDd '1/21/2max|, EI cDA yy Dd
72 1/2 '1/2'1/21/2 1/2Dc cDcDDAd Dc '' ''1 cDccd cDc cDccd By using ellipsoidal norm definition '1DxxDx, we can write: 1''''max|, EDcxxDdcDccdccd The value that satisfies this is max'argmax|, zE cxxDddb. Notice that if the objective was to minimize, then 1''''min|, EDcxxDdcDccdccd, and min'argmin|, zE cxxDddb. The geometric interpretation of this is seen in the Figure 4-2. In the figure, minz and maxz are maximum and minimum for 'cx over ellipsoid ,EDd. Note that the center of the ellipsoid lies between minz and maxz Also observe that the shaded ellipsoid is the mi nimal ellipsoid containing one half of the ellipsoid divided w ith respect to 'cx through the center. Since ha lf of the ellipsoid is a convex body, say K, then it is the L-J Ellipsoid of K.
73 Figure 4-2. When maximizing a linear function 'cx over an ellipsoid ,EDd, the center of the ellipsoid lies between minz and maxz So, K contains the ellipsoid obtained from its L-J ellipsoid by shrinking it n -times where n is dimension of the K. Hence, if ,EDd is the L-J Ellipsoid in n dimensions, and ellipsoid 2, E n D d is the ellipsoid formed by sh rinking the axes by a factor of n then ellipsoid 2, E n D d is contained in K. The 21 n factor in the L-J ellipsoid results from the fact that the eigenvalues are now 2n times smaller. The ellipsoid method works as follows. We start with an ellipsoid (it could be a ball) that contains P. The method generates a sequence of ellipsoids 1,...,kEE with centers 1,...,kdd iteratively. At each iteration, if th e center of the ellipsoid generated at that iteration doesnÂ’t lie insi de of polytope, then its center point violates one or more constraints defining the polytope. For example, at the ith iteration, let ,iiiEDd be the d zmin zmax b -b c'zmax c'zmin
74 ellipsoid centered at id. If the polytope |n PxAxa where A can be written as ''' 1,...,m AAA is not contained in iE, then id violates at least one of the m constraints, say the kth, sokika Ax for some 1km Thus, we know that P is contained in the halfspace kki AxAx. In the next iteration, a new smaller volume ellipsoid that contains Â“the inters ection of the halfsp ace that contains P and current ellipsoidÂ” is generated. At this point we should poi nt out that there is a st rong relationship between the ellipsoid method and the L-J Ellipsoid. In each iteration the next ellipsoid is generated according to a hyperplane passing through the center of the current ellipsoid. Suppose that, at some iteration i the violating constraint is ki''Axcxcd. Then we know that the center of the current ellipsoid does not lie inside of the polytope (e.g., use Figure 4-3). Therefore, in the next iteration we are only c onsidering the half ellipsoid that contains the polytope, and for this purpose we are using the L-J Ellipsoid that contain that half ellipsoid, as it is a convex body. Now it al so makes sense why we covered maximizing linear functions over ellipsoids because the next ellipsoi d generated is going to go through between minz and maxz depending the location of the polytope. Theorem 4.10: Let ,EEDd be an ellipsoid in n, and let kAc be a non-zero vector in n. Consider the halfspace ''|n iH xcxcd and let 11 1in ddb, Then
75 2 1 22 11in nn DDbb where 'Dc b cDc The matrix 1 iD is symmetric and positive definite and thus 111,iiiEEDd is an ellipsoid. Moreover, 1 iEHE and 1/21 1 n iVolEeVolE (Bertzimas and Tsitsiklis, 1997) Proof: Refer to Bertzimas and Tsitsiklis, 1997, p.367. Note that the new center 11 1in ddb lies on the hyperplane passing through maxz and minz. In fact, minz db, and therefore 1 i d is simply min 11in zd dd. Furthermore, the new ellipsoid is the L-J ellip soid of the half ellipsoid in the previous iteration. Different Ellipsoid Methods In the literature, the general methodology we covered above is referred as the Â“Central CutÂ” ellipsoid method. This is due to the fact that the cut goes right through the centroid of the ellipsoid, and exactly one ha lf of the ellipsoid is discarded at each iteration. Depending on how big a portion of elli psoid is eliminated in every iteration different cut mechanisms have been generated. Obviously, the central cut is not the only way of discarding a portion of the ellipsoid. As long as the cut reduces the volume and the next ellipsoid contains the polytope, the cu t is permissible. So, on the one hand, if the generated ellipsoid makes a bigger cut, this is considered a somewhat greedy cut, and the methodology is referred to as a Â“Deep CutÂ” el lipsoid method. On the other hand, if the cut shrinks the ellipsoid but not as much as the central cut, then it is referred as a
76 Â“Shallow CutÂ” ellipsoid method. Method selection, as we will discuss later, is controlled by a parameter. As we mentioned, the ellipsoid method runs in polynomial time. It is very intuitive and correct that deep cut versions also runs in polynomial time. The idea is, in each of the cuts, the volume of generated el lipsoid decreases iteratively. The proof of complexity is rather complex and we will not include it here Blend et al. (1981), and Grtschel (1988) give a detailed complexity analysis of the method. The algorithm is illustrated in the Figure 4-3. Starting with a large ball containing the polytope, the next ellipsoi d generated fits the convex body tighter than the previous one. Figure 4-3. In every iteration the polytope is contained in a smaller ellipsoid. In the literature, there are other versions of ellipsoid methods, such as the Â“parallel cutÂ” (Blend et al., 1981) method, when there is a parallel pair of constraints two simultaneous cuts can be made. The cuts are centrally symmetric a nd parallel, discarding areas on both side of the ellipsoid to generate a new and narrower one.
77 Different Ellipsoid MethodsÂ’ Formulation Note that in the ellipsoid method, the hype rplane according to which we cut off the ellipsoid must go through the ellipsoid. That is, the hyperplane '|nH xcx must satisfy 'min'maxczcz where min zdb and max zdb. After some algebra we find that ''cdcDc is necessary for a cut. If ' cd cDc is a parameter quantifying location for the cut, it is clear that 11 We can think of as a bias factor such that if we make our cut right in the middle and the new formed ellipsoid contains the halfspace ''|n iH xcxcd. A more comprehensive version of Theorem 4.10 that takes cut locati on into account is given below. Theorem 4.11: Let ,EEDd be an ellipsoid in n, and let kAc be a non-zero nvector. Consider the halfspace '|nH xcx. 1,,,,|n iiEE DdcDdxcx, the L-J Ellipsoid for the next iteration is 1,iiEEDd, if 11/n 1111,iiiiEEDd if 1/1n where 11,ii Dd are 11 1in n ddb and 2 2' 1 221 1 111in n nn DDbb
78 where 'Dc b cDc and ' cd cDc. The matrix 1i D is symmetric and positive definite and thus 111,iiiEEDd is an ellipsoid. Moreover, 1iEHE and for 'icd 1/21 1 n iVolEeVolE Proof: Refer to (Grtschel et al., 1988, p.71; and Bl and et al., 1981, p.1053) Notice in Theorem 4.10, for 'icd the cut is right through the center of the ellipsoid. In the ellipsoid met hod literature, the location of th e cut determines the type of the cut as well as modifications to the ellipsoi d method. The type of the cut is determined by the following: For 'icd the cut is through the center, and called Â“central cutÂ”, 0 For 01 then we can make a deeper cut, and this is called Â“deep cutÂ” For 1/0n then we include more than the half ellipsoid in our cut, and is called Â“shallow cutÂ”. Note that the ellipsoid method assumes that all the coefficients in the system of inequalities are integer values. This only ha s importance from a theoretical point of view, as it helps to prove that the algorithm runs in polynomial time. The method is used to solve linear optimization problems by solving the dual and the primal together. As the feasible point is optimal, the existence of it automatically solves the problem. Shallow Cut Ellipsoid Method Although shallow cuts do not reduce the volume as fast as central or deep cuts, they still yield polynomial times. The shallow-cu t method is due to Yudin and Nemirovski (1976), and is a more sophisticated ve rsion of the central cut method.
79 The central cut method terminates once the origin of the generated ellipsoid lies inside of the convex body. The idea with shallo w-cuts is that it can proceed further even when the point is found to be inside of the convex region. The method does not necessarily stop when the center of the ellips oid lies in the polytope. It stops when the ellipsoid 2, 1ME n D d lies completely inside of the polytope (Grtschel, 1988). As it can proceed further, it can be used to approximate the L-J ellipsoid. For a given polytope P the stopping criterion is me mbership of the ellipsoid 2, 1ME n D d in the polytope. If ellipsoid M E lies completely inside of P, then the method terminates and the ellipsoid is declar ed to be tough (Grtschel et al. 1988). To determine this all the constraints we check if 2 21ii iia n ADA AA is violated for any i. If so the method continues using cons traint i, otherwise it declares M E tough and terminates. In the next chapter we talk about comput ational efficiency of approximating the L-J Ellipsoid by using these ellipsoid methods as we ll as its implication on to fat-shattering dimension and genera lization error bound.
80 CHAPTER 5 COMPUTATIONAL ANALYSIS FOR DOMAIN SPECIFIC KNOWLEDGE WITH SUPPORT VECTOR MACHINES In this chapter we compute the fat-sha ttering dimensions of different polytopes using various methods. In the first section, we give an overview of our analysis, how we generated sample programs and our testi ng methodology. In the second section, we perform numerical analysis for box constr aint and polytope boundi ng objects. For box constraints we compute the space diagona l and upper bounds on gamma. For polytopes we perform a series of anal yses with various ellipsoid methods and combinations, and compute a bound on the metric diameter of th e polytope. We compare our results to approximate L-J ellipsoid. Finally, we make a brief comparison between the luckiness framework and our methodology. Overview All of our computational analyses was perf ormed using Matlab 6.5 and run on a Pentium III Â– 900 Mhz PC under Windows XP. We random ly generated polytopes of different dimensions. Table 5-1 shows that computi ng all the vertices of a polytope is time consuming as the number of vertices grows ve ry rapidly in the numb er of dimension and constraints. In Table 5-1, we show the num ber of vertices is s hown as well as an upper bound on the number of vertices due to the formula by McMullen (1970) as the number of constraints and/or number of dimensions increase,. Note that the minimum number of vertices in n dimensions is n+1, and size of the vertex set depends on the number of constraints, and degeneracy (for example, if a vertex lies on more than n hyperplanes then
81 number of vertices is lower than a vertex satisfying lying on onl y n many hyperplanes). Table 5-1. CPU time and number of vertices for polytopes Number of Constraints Dimensions CPU Time Vertices Upper Bound on Number of Vertices 5 2 <1 5 5 5 3 <1 6 6 10 2 <1 10 10 10 3 <1 16 16 20 3 1.76 36 36 40 3 5.528 76 76 50 3 9.11 96 96 10 4 1.832 27 35 20 4 4.99 82 170 40 4 13.64 196 740 20 5 28.46 170 272 40 5 100 572 1332 80 5 355 1456 5852 20 6 344 376 800 Our random polytope generation process can be summarized as follows. First, by using m -random weight vectors 1,...,mAA, we randomly generate a set of m -hyperplanes such that :iiihaAx. The bias factor, ia in this case can be thought as a a scale, and therefore in our analys es it is fixed at 1 for convenience. These hyperplanes are used to generate inequlities for hyperspaces as :iiiha Ax. Such a generation results in polytopes that always co ntain the origin as x0 is the trivial solution to the set of inequalities. In order to randomize the pro cess further, we moved the polytopes apart from the origin by using a random transfor mation vector depending on the diameter of polytopes. The amount of shift is determined by multiplying a random number drawn from the interval [0, 5] with the diameter of the polytope.
82 Comperative Numerical Analysis for Box-Constraints and Polytopes Box constraints can be established if there exist an upper and a lower bound for every attribute. In Proposition 7, we provided a simple formula for the space diagonal ( SD ) as 2 2 1n ii iSDublb, where ilb and iub are lower and upper bounds corresponding to ith constraint. Polytopes versus Hyper-rectangles For some attributes, upper and lower bounds may not naturally exist to form a hyperrectangle but such attribut es may be bounded via constraints as combinations with other attributes. For example, in an intrusi on detection problem there may not be an upper bound on the connection duration attribute. Bu t in some protocols there may exist a relationship enabling one to assert an upper bound on the connection duration For example, if the protocol type is ICMP (In ternet Control Message Protocol), then the connection is known to be momentary. Then for two attributes connection duration and ICMP (binary) one could assert a constraint in the form 1durationMICMP. If one can formulate all these to get a bounded polyhedron then one can compute the diameter of the polytope instead of the SD of the hyper rectangle. Furthermore, even if all we know is that the attributes are upper and lower bounded allowing us to form a bounding hyper-rectangle, any additional knowledge in the form of linear combinations of other attributes may potentially help refine the size of the input space, and possibly reduce the diameter of the polytope. Generating Polytopes If a polytope is given in the form of a set of vertices, ther e are known polynomial time algorithms to approximte the L-J ellipsoid containing the polytope. In Chapter 4, we
83 mentioned two studies (Rimon and Boyd, 1992, 1997; Kumar and Yildirim, 2004) for such ellipsoid generation. For example, the complexity of the -approximate algorithm in Kumar and Yildirim (2004) is 3/ Ond where n is number of vertices and d is the dimension. We use the -approximate L-J ellipsoid to serve as a benchmark for our results. Note that as the number of vertices grow, such algorithms become impractical or their applications can be limited by a very large for time considerations. For this reason, we computed -approximate L-J ellipsoids only for a limited number of cases. Ellipsoids are computed by using space dilation. Table 5-1 summarizes generated polytopes as well as their computed L-J ellipsoids. A total of 140 random polytopes are generate d. We used GBT 7.0 that implements a method to compute L-J ellipsoid with successive space dilation controlled by two parameters: iterations (set to 50,000) and accuracy (set to 810 ). Whenever one of the two bounds is reached sooner, the -approximate L-J ellipsoid is then formed, and algorithm stopped. Figure 5.1 shows the L-J approximation and the tightness of ellipsoidÂ’s fit for four different parameters.
84 Figure 5-1. The L-J approximation for A) 410 and 500 iterations, B) 410 and 1000 iterations, C) 410 and 5000 iterations, D) 410 and 50000 iterations. In Table 5-1, the polytope diameter is co mputed by finding the maximum of all the pairs of vertices. The column Â“L-J DiamÂ” in the table is the diameter of the -approximate L-J ellipsoid. Note that the diameter of the el lipsoid is not necessarily the same as the diameter of the polytope. Although one could bui ld such cases, during our analysis none of the random polytopes had the same diameter as its L-J ellipsoid. Finally, the last column Â“% DiffÂ” indicates the percentage difference between the polytopeÂ’s and L-J ellipsoidÂ’s diameters. We observed that fo r the same dimension, the difference was disproportional to the number of constraints, with the exception of the Â“5 constraints in 2 dimensionsÂ” set, which was greatly influenced by one outlier case. A B D C
85 Table 5-1. Summary of random polytopes generated in 2, 3 and 5 dimensions Number of Constraints Dims Number of Datasets Iters Accuracy Polytope Diameter L-J Diam % Diff 5 2 20 50000 810 12.48 15.61 25.08% 10 2 20 50000 810 3.10 3.47 12.27% 5 3 20 50000 810 24.25 34.25 41.24% 10 3 20 50000 810 5.07 6.57 29.44% 20 3 20 50000 810 3.15 3.53 11.93% 50 3 20 50000 810 2.47 2.69 8.57% 20 5 20 50000 810 9.07 12.10 33.48% In the next section we use the ellipsoid method to approximate the polytope diameter, and the L-J ellipsoid. Using Ellipsoid Method to Upp er Bound the Polytope Diameter In Chapter 4 we discussed the theory of the ellipsoid method. In this section we will apply it to approximate the L-J Ellipsoid. The method starts with a ball centered at the origin that is large enough to contain th e polytope and then with successive volume reducing iterations finds a feasible point. Finding such a ball, or being sure that a ball is guaranteed to contain the polytope is not very straight forw ard. The enclosing ball is found by using the theory of encoding length of integer numbers and matrices with integer data. The details of encoding length can be found in Grtschel et al. (1988). We previously noted that all the coefficients of polytope data matrix are integers (or can be assumed so since computers can only handl e rational numbers and rationals can be converted to integers using the greatest co mmon denominator). In short, the volume of every full dimensional such polytope can be bounded by a term that contains the description length of its coeffici ents. With the help of the theory it is shown that if the
86 data for polytope Axa are all integers, a ball 22,2,nEnI Aa0 centered at the origin with a radius of 2,2nn Aa contains the polytope. In the following we start w ith the original central-c ut ellipsoid method which terminates once the feasible region is reached. In order to have a tig hter ellipsoid at the time the feasible region is hit, we propose some modifications to this method such as finding the maximum violated constraint, or preferring cu ts more orthogonal to the eigenvector corresponding to the largest ei genvalue of the ellipsoidÂ’s characteristic matrix. Then we continue by illustrating how the shallow ellipsoid method can be used to approximate the L-J ellipsoid. Central Cut Ellipsoid Method In central-cuts, we make our cuts through the centroid of the ellipsoid. Figure 5.1. illustrates the methodology on a 3-dimensional polytope with 10 constraints that we randomly generated. The method starts with a ball that contains th e polytope. Iteratively the volume decreases. Note that, the diamet er does not necessarily decrease at every iteration even though the volume does. Also note that the improvement in terms of bounding the diameter is not very significant. The overall performance of using the central cut ellipsoid method is given in Table 5-2. In the table, note that a large number of iterations (Colum n: Â“ItersÂ”) is not necessarily an indicator of a good approximati on as the diameter can get bigger. Best performance for central-cut was observed w ith 5x2 polytopes (5-constraint polytopes in 2-dimensions). We would also like to note that for some problems the smaller volume ellipsoids obtained by the method had actua lly larger diameters than the starting ellipsoids. For one problem in the dataset th e ending diameter was significantly larger
87 and it caused the average diameter reduction to be negative for that set (10x3). Perhaps the most important observation is that ne ither a large reduction in volume (Column: Â“Volume ReductÂ”), nor a large number of itera tions necessarily indicate a substantial diameter reduction (Column: Â“Diam ReductÂ”). Overall, the centra l-cut method did not perform very well. Figure 5-2. The central-cut ellipsoid method illustrated on a 3-dimensional polytope which has a diameter of 3.77. A) First it eration: the polytope is contained in a large ball centered at the origin with a diameter of 37.7, B) Second iteration: new ellipsoid has a diameter of 39.98, and its volume is 84% of the initial ball, C) Third iteration: diameter is 42.41 and its volume is 71.1% of the initial ball, D) Eighth and the final iteration: diameter is 34.67, and volume is 30.4% of the initial ballÂ’s volume. A B C D
88 Table 5-2. The central cut ellipsoid method a pplication on 2 and 3 dimensional datasets. Number of Constraints Dims Iters Polytope Diam LJ Diameter Ending Diam Diam Reduct Volume Reduct 5 2 9.20 5.69 6.47 29.30 48.49% 77.88% 10 2 5.20 3.07 3.76 25.29 17.56% 52.99% 5 3 31.0024.63 34.71 177.89 27.77% 99.25% 10 3 16.406.06 7.33 60.77 -0.26% 87.13% 20 3 14.603.18 3.46 19.77 37.92% 87.20% 50 3 16.502.40 2.61 35.56 10.98% 89.87% Average 20.8% 82.39% Figure 5-3. Volume reduction does not necessa rily reduce the diameter. The reason behind the poor performance of cen tral cut is rather intuitive. The nave implementation searches for violated constr aints and picks the fi rst violated one and makes the cut according to that constraint. Neit her a selection criterion nor a cut pattern is utilized. Figure 5.3 shows sequences of ellipsoids for a problem of size 50x3. In the figure, one can observe that the most of the cu ts in the cut-sequence are done with respect
89 to the same constraint, until that constraint is not violated. Even though such consecutive cuts result in volume reducti ons, they may also cause the diameter to increase. In the remainder of this section we will introduce a couple of different approaches to that improve these results. Maximum Violated Constraint Instead of picking any constraint, we pick the constraint that is violated the most. The amount of violation can be found by co mputing the distance between a violated constraint and centroi d of the ellipsoid: ' ia cd cc where ia cx is the violated constraint, and d is the center of the ellipsoid. Table 5-3. The central cut ellipsoid method application on 2 and 3 dimensional datasets according to maximum violated constraint selection Number of Constraints Dims Iters Polytope Diam LJ Diameter Ending Diam Diam Reduct Volume Reduct 5 2 8.00 5.69 6.47 32.60 42.69% 71.44% 10 2 3.40 3.07 3.76 30.46 0.70% 41.38% 5 3 25.7524.63 34.71 189.83 22.93% 96.96% 10 3 17.386.06 7.33 41.21 32.00% 87.38% 20 3 11.203.18 3.46 27.77 10.97% 79.58% 50 3 16.002.41 2.59 22.36 34.28% 94.03% Average 23.93% 78.46% When a maximum violated constraint is ta ken into account, diameter reductions for some of the problems were improved. Although the improvement was not consistent for all of the problems on hand, Table 5-3 suggest s that despite smaller volume reductions in the maximum violated constraint approach the diameter reductions were better than central cutÂ’s reductions.
90 Shallow/Deep Cuts During the ellipsoid method, for every viol ated constraint we calculated the amount of the violation. At each iteration, we form another ellipsoid that contains the feasible area, the polytope. We know that with the d eepest cut we find the L-J ellipsoid for the feasible area whose boundary is drawn by the violated constraint and the ellipsoid. The computational results for deep cuts is give n in Table 5-4. Using deep cuts, smaller volume ellipsoids are reached within the same or less number of iterations. Therefore computational effort as well as number of ite rations required is not more than shallower cuts. Table 5-4. The ellipsoid method with deep cuts for 2 and 3 dimensional datasets Number of Constraints Dims Iters Polytope Diam LJ Diameter Ending Diam Diam Reduct Volume Reduct 5 2 7.20 5.69 6.47 17.79 68.72% 95.21% 10 2 5.80 3.07 3.76 16.93 44.80% 88.04% 5 3 18.0024.63 34.71 81.01 67.11% 99.88% 10 3 12.006.06 7.33 22.19 63.39% 97.43% 20 3 4.40 3.18 3.46 17.28 45.72% 95.92% 50 3 7.00 2.41 2.59 14.73 65.29% 98.61% Average 59.17% 95.85% In this document we refer to the deepest cut as a deep cut and everything else as shallow. In order to find out a good valu e for shallowness numerically, we assign a parameter that controls shallowness for every cut. In Chapter 4, we defined to be such a parameter. However coming up with an value that will work well for all datasets is not possible as the value of depends on the amount of violation. As the maximum value for (max ;1/1 n ) that could be assigned depends on the amount of violation, we cannot univer sally test for the best value. Instead we form a percentage grid max1/ n for at each iteration to observe performance level at various levels. The results are shown in Table 5-5. We set grids at .05 intervals and test our
91 dataset at each value. In the table, as 100% corresponds to deep cut and greediness, the average of best performing percentages is named Â“average greedinessÂ” (Column: Avg GreedÂ”). Except for a few cases we found th at being greedy is the best approach computationally. This also makes sense in that with deep cuts the same constraint is not violated twice in a row as the ellipsoid formed is the L-J ellipsoid of the constraint and the remainder of the ellipsoid that contains the polytope. In the beginning of this section we indicat ed that consecutive cuts result in volume reductions, but they may also cause the diamet er to increase. However, we believe that the main reason is that as the amount of pe netration is larger with deep cuts, when shallower cuts are made before achieving that magnitude of penetra tion a feasible point may be found and the method may be termin ated before reaching that magnitude. An illustration is given in Figure 5.4. Figure 5-4. A) A series of shallow cuts with respect to one constraint until it is not violated. The final ellipsoid generated in the figure has a diameter of 384. B) A deep cut with respect to the same c onstraint. The ellipsoid generated with the cut has a diameter of 330 and its volume is about 56.9% of the final ellipsoid in A. A B
92 Table 5-5. With deep cuts method converges faster and generate d ellipsoids are of smaller size. Proceeding After a Feasible Point is Found Numerically, the results suggest that deep cuts are able to reduce the diameter more than shallow ones. This is also illustrated on Figure 5.4 Even in terms of serving the original ellipsoid methodÂ’s purpos e, that is, determining feas ibility, taking shallow cuts does not make much sense, as one can reduce computational effort by simply eliminating larger pieces of the ellipsoid. However, taking shallow cuts prove themselves useful when a feasible point is reach ed yet we would like to cont inue to shrink the ellipsoid further. In other words, we can use shallow cu ts if we would like to find a point that is not only feasible but also deep in the polytope. That is, even after hitting the feasible region the method proceeds to find another LJ ellipsoid containing the feasible region. Grtschel et al. (1988) defi ne the framework as finding the minimum volume ellipsoid containing a point in th e polytope that is deep enough to satisfy 1 1 n '''cxcdcDc. A point that satisfies this constraint is calle d a tough point, otherwis e called a weak point. Number of Constrs Dims Iters Polytope LJ Diam Ending Diam Diam Reduct Volume Reduct Avg Greed 5 2 7.40 5.69 6.47 17. 72 68.85% 95.05% 0.99 10 2 6.75 3.07 3.76 11. 89 61.23% 87.46% 0.93 5 3 16.25 24.63 34.71 94.48 61.64% 97.26% 1 10 3 12.50 6.06 7.33 21. 74 64.13% 97.39% 0.99 20 3 5.40 3.18 3.46 15. 39 51.65% 95.47% 0.97 50 3 7.00 2.41 2.59 14. 73 65.29% 98.61% 1.00 Average62.13% 95.20% 0.98
93 Figure 5-5. The shallow cut method can continue even after the feasible region is found. The largest ball is the starting ball center ed at the origin. The next ball is the ball whose center is feasible. The smalle st ball that contains the polytope is the one obtained by using shallow cut method. From above, as we discover that deep cuts help to penetrate further, we used deep cuts to enter the feasible region, then we proceeded with shallow cuts to find a Â“toughÂ” point in the polytope (Figure 55). Basically, for every constrai nt we determine if they are tough, and if not we form a new ellipsoid according to that constraint. When the new ellipsoid is formed, we set to its maximum possible. Table 5-6. gives results for such approach. Especially for lower dimensions the diameters that we found (Column: Â“Ending DiamÂ”) are good approxima tions to L-J Diameters.
94 Table 5-6. Proceeding after a feasible point is found by randomly choosing a violated constraint and assigning maximum value that can be assigned Number of Constraints Dims Iters Polytope LJ Diameter Ending Diam Diam Reduct Volume Reduct 5 2 214.405.69 6.47 7.62 86.60% 98.84% 10 2 270.003.07 3.76 4.43 85.56% 98.05% 5 3 366.0024.63 34.71 38.77 84.26% 99.98% 10 3 549.006.06 7.33 9.60 84.16% 99.74% 20 3 674.203.18 3.46 6.40 79.90% 99.30% 50 3 694.202.41 2.59 6.14 86.46% 99.80% 20 5 1043 8.49 11.45 10.59 91.18% 99.99% Average 85.45% 99.39% We then modified the method for selecti ng the violated constr aint by adopting one of the methods we tried above. For every cons traint, we picked the constraint with the largest possible value in hopes of deeper penetrati on. As we can see in Table 5-7 by cutting according to the cons traint with the largest value we are able to slightly improve the results. Table 5-7. Proceeding after a feasible point is found by choosing the most violated constraint and assigning maximum value that can be assigned Number of Constraints Dims Iters Polytope LJ Diameter Ending Diam Diam Reduct Volume Reduct 5 2 214.40 5.69 6.47 7.62 86.60%98.84% 10 2 270.00 3.07 3.76 4.43 85.56%98.05% 5 3 366.00 24.63 34.71 38.77 84.26%99.98% 10 3 549.00 6.06 7.33 9.60 84.16%99.74% 20 3 674.20 3.18 3.46 6.40 79.90%99.30% 50 3 694.20 2.41 2.59 6.14 86.46%99.80% 20 5 1043.008.49 11.45 19.10 82.55%1.00E+00 Average 85.45%99.39% Fat Shattering Dimension and Luckiness Framework In this section we compare our appro ach with other methodologies for finding generalization error bounds for [SVMÂ’s] learni ng. So far we talked about VC-Dimension based generalization errors due to Va pnik and Chervonenkis (1971). SVM learning
95 proposes using a maximum margin hyperplane to separate classes from each other. In the generalization error bound performances table, Table 3-1, the VC-dimension based error bounds are tighter than other bounds however th ey do not support the idea behind SVM, as they only require separability of classes. In the other two bounds the distance of the fu rthest point from the origin is one of the determinants of the fat shattering dime nsion as well as the generalization bound. The first bound assumes that the entire input space X is contained in a ball of radius R In order for such an assumption to be true, the input space must be bounded. In that case our approach can be utilized to center the input sp ace in the origin so that its diameter is minimized and the radius is bounded from above. This reduces the fat-shattering bound F f at as well as the genera lization error bound. Let us illustrate our results on one of our randomly generated 3-dimensional polytopes formed by intersecting 5 halfspaces (F igure 5.6). Assume that our input space is contained in such a polytope. The diamet er of the polytope is 39.67, and the minimum ellipsoid that we found is 54.57 which is 3.7% bigger than an -approximate L-J ellipsoid diameter of 50.59. The polytope has 6 vertices The vertex-origin distance for the vertex furthest from the origin and for the vertex closest to the orig in are 63.07 and 58.69, respectively.
96 Figure 5-6. The -approximate L-J ellipsoid for the polytope In this scenario, the smallest ball that contains any training set has at least a diameter of 58.69. Similarly, the samallest ball that contains any training set has a diameter of 63.07 at most, and the largest achievable margin is therefore 39.67 / 2 = 19.84. The VC-dimension is 4, is set to 0.10, and the error bound based on VCdimension is 0.012 for a sample size of 10,000. Table 5-8 illustrate s how incorporating domain knowledge may help bounding the errors The accuracy of the table is only sufficient enough to compare domain incorporati on effects. We know that the smaller the fat shattering dimension F f at is, the larger the chance of achieving a tighter error bound exists. In Table 5-8, part A illustrates the case where domain knowledge is not taken into account. In this case, the radius that contains all the sample points is at least 58.69 (rows
97 1 through 5), and the margin is at most 19.84. The radius of the ball contaninig all the points is 63.07 only if the maximum possible ma rgin is achieved (rows 6 through 9). For a sample size of 10,000, table suggests th at the luckiness framework bounds are much looser. Part B contains the results with domain-specific knowledge taken into account. R is 27.29 (half of 54.57, the diameter of th e minimum ellipsoid found). In part B, the bounds based on fat shattering dimension and X is contained in a ball of radius R are tighter than those in part A due to the shift to the origin. Also, the results for the luckiness framework in part B are tighter than those in part A. However, if the input space is already included in a polytope one does not need luckiness framework. Rows 7 through 9 and 16 through 18 are given to show numeri cally how tight the error bounds based on luckiness are. Table 5-8. The performance comparison for a 3-dimensional input space with 5 constraints.. Part A illustrates the ca se with domain knowledge not taken into account, and Part B with domain-spe cific knowledge taken into account R l Bound based on F f at and X is contained in a ball of radius R Luckiness Framework Row 58.69 3 10000 N/A N/A 1 58.69 6 10000 N/A N/A 2 58.69 9 10000 0.95 N/A 3 58.69 12 10000 0.53 N/A 4 58.69 15 10000 0.34 N/A 5 63.07 19.84 10000 0.22 N/A 6 63.07 19.84 7838159 0.00 N/A 7 63.07 19.84 46236744 0.00 0.99 8 A 63.07 19.84 10000000 0.00 3.27 9 27 3 10000 N/A N/A 10 27 6 10000 0.61 N/A 11 27 9 10000 0.28 N/A 12 27 12 10000 0.16 N/A 13 27 15 10000 0.11 N/A 14 B 27 19.84 10000 0.07 N/A 15 27 19.84 7838159 0.00 0.99 16
98 Table 5-8 Continued R l Bound based on F f at and X is contained in a ball of radius R Luckiness Framework Row 27 19.84 46236744 0.00 0.25 17 B 27 19.84 50000000 0.00 0.23 18 Our approach has two main potential benefits compared to the luckiness framework. The first one is indirect. When one or more of the attributes are not bounded, the input space cannot be contained in any ball of finite radius. However the radius of the training set is finite. A possible approach is to find the minimum enclosing ball (ellipsoid) for the training points and linea rly transform the input space so that the training set is centered at the origin. The luck iness framework due to Shawe-Taylor et al. (1998) does not suggest any linear transforma tion on the input space. This approach does not alter the framework, but the generalizat ion bound would be tighter due to a smaller radius. The second benefit emerges if the input sp ace is contained in a polytope. In this case the luckiness framework will not be n eeded. Assume that the radius of the ndimensional polytope P is R and the upper bound we obtained by approximating the L-J ellipsoid is R We know that 1 1 R R n as the concentrical ellipsoid formed by shrinking our L-J approximation lies completely inside of P. Table 3-1 indicates that a bound depending on R is numerically tighter than a bound obtained through luckiness framework.
99 CHAPTER 6 SUMMARY AND CONCLUSIONS In this research, after over viewing the literature on ge neralization error bounds and support vector machines, we proposed a nove l method to incorporate domain-specific knowledge in support vector machines. Generalization error bounds in SVM are in the form of probably approximately correct learning bounds. In Chapter 2, we started with the initial pac -learning bound due to Valiant (1984) and continued with the furt her development of e rror bounds since then. We summarized these bounds using our notation and then developed new generalization error bounds for SVM learning. Towards the end of Chapter 2, we related the fatshattering dimension and error bounds to SVM, and we briefly overviewed SVM learning technique. In Chapter 3, we noted that in order to enhance learning, domain knowledge may be taken into account. This fact is also motivated by the NFL theorems. We bounded the input space first with box-constraints formed by upper and lower bounds on the attributes, and then we considered a more general case where the input space is bounded by a polytope. In each case we showed that the error bounds can be enhanced by simply finding the metric diameter of these convex bodies containing the input space. Although finding the space diagonal of a hyper-rectangl e is simple, we observed that finding the metric diameter of the polytope is rather difficult (convex maxi mization) problem. We proposed using the diameter of minimal c ontaining ellipsoids (L-J ellipsoids) for polytopes as an upper bound on the metric diameter.
100 In Chapter 4 we reviewed the theory be hind the ellipsoid method and its variants. The ellipsoid method can be modified to a pproximate L-J ellipsoids. In Chapter 5 we tested our approach on a vari ety of randomly generated problem sets. First we computed approximate L-J ellipsoids for each of the randomly generated polytopes by using several alternative approaches. Then we computed hypothetical er ror bounds for those problem sets with and without using our methodology. By using our methodology we were able to improve the error bounds significantly. Briefly, we have laid a framework to incorporate domain-specific knowledge in SVM. To the best of our knowledge, this study is the first study that incorporates domain knowledge directly from the entire set of attributes regardless of the labeling. The domain knowledge tightens the error bound for learning. Tighter bound can be interpreted as increased confidence, and/or sufficiency of a smaller sample set for the learning task. We observed that often th e error bounds under the luck iness framework for SVM are too loose to provide any us eful insights about the sample and learning. When domain knowledge is utilized not only the need for the luck iness framework becomes unnecessary, but also the error bounds depe nding on the fat shattering dimension are tighter. Our study also has several limitations and therefore several res earch opportunities. Perhaps the most important one is that if ke rnels are used, the attributes in the feature space change, and so the bounding polytope may not remain a polytope after the mapping, negating the use of our methodology. These limitations can be addressed by considering the propertie s of the feature spac e in an attribute independent manner. For example, as the polytope constraints may not exist, the polytope cannot be formed.
101 Instead an analysis with the luckiness framew ork on the convex hull of the sample set in the future space can be carried out. Another limitation is that the interrelationships among the attributes may not necessarily be linear and c onstructing a bounding polytope may not be trivial for some domains. Other ways to capture domain knowledge must be investigated. More importantly, our study addresses domain knowledge only in the form of boundary conditions. Doubtlessly, there is a need fo r studies to investigate other ways of incorporating domain knowle dge in SVM learning. Another limitation is that sometimes not all of the attributes can be bounded, and the input space can only be included in an unbounded polyhe dron. In that case one or more of the attributes remains unbounded. Therefore, our methodology doesnÂ’t directly apply and the need for luckiness framework, despite its limited value, remains. Intuitively, if some of the attributes ca n be bounded then error bounds tighter than those in the luckiness frameworks may be derived. An immediate extension of this document is to study the unboundedness of the input space for some variables by modifying the luckiness framework to tighten the error bounds by using the existing constraints. Perhaps the most striking but non-trivial extension to this work would be to combine our methodology to enhance SVM l earning. During our studies we discovered several potential application domains for wh ich our methodology could prove useful. As a future research project we would like to apply our methodology on an application domain by tailoring SVM learning according to our methodology to enhance learning. The challenge is that domain knowledge we co nsider is not to st rong and demanding as a labeled knowledge set as F ung et al. (2001)Â’s study.
102 In conclusion, we believe more resear ch should focus on incorporating domain knowledge in learning. Encoding prior know ledge about a domain into a learning problem will increase the confidence of the learner, and probably and more importantly the resulting accuracy of the induced concepts.
103 LIST OF REFERENCES Alon, N., Ben-David, S., Cesa-Bianchi, N., Haussler, D., Â“Scale-sensitive Dimensions, Uniform Convergence, and LearnabilityÂ”, Journal of the ACM Vol. 44, No 4 (1997), p.615-631. Anthony, M., Bartlett, P., Â“Functi on Learning from InterpolationÂ”, NeuroCOLT Technical Report Series, NC-TR-94-013, (1994). Retrieved June 20, 2005 from ftp://ftp.dcs.rhbnc.ac.uk/pub/neuroco lt/tech_reports/1994/nc-tr-94-013.ps.Z Augier, M. E., Vendele, M. T., Â“An Interview with Edward A. FeigenbaumÂ”, Department of Organization and Industrial Sociology, Working Paper, nr2002-16 (2002). Avis, D., Devroye, L., Â“Estimating the Nu mber of Vertices of a PolyhedronÂ”, Information Processing Letters Vol. 73 (2001), p.137-143. Aytug, H., He, L., Koehler, G. J., Â“Ris k Minimization and Minimum Description for Linear Discriminant FunctionsÂ”. Working Pape r, University of Florida, Gainesville, August 5, 2003 Retrieved June 20, 2005 from http://www.cba.ufl.edu/dis/research/03list.asp Bartlett, P., Kulkarni, S.R., Posner, S.E., Â“` Covering Numbers for Classes of Real-Valued Function,Â” IEEE Transactions on Information Theory Vol. 43, No. 5 (1997), p.1721-1724. Bartlett, P., Shawe-Taylor, J ., Â“Generalization Performance of Support Vector Machines and Other Pattern ClassifiersÂ”. In Schlkopf B., Burges, J.C., Smola, A.J. editors, Advances in Kernel Methods-Support Vector Learning MIT Press, (1999), p.43-54. Bertzimas, D., Tsitsiklis, J. N., Â“I ntroduction to Linear OptimizationÂ”, Athena Scientific Massachusetts, (1997). Bland, R. G., Goldfarb, D., Todd, M. J., Â“The Ellipsoid Method: A SurveyÂ”, Operations Research Vol. 29, No. 6, (1981), p.1039-1091. Blekherman, G., Â“Convexity Properties of th e Cone of Nonnegative PolynomialsÂ”, to appear in Discrete and Computational Geometry arXiv preprint math.CO/0211176 (2002). Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M., Â“Learnability and the VapnikChervonenkis DimensionÂ”, Journal of the ACM Vol. 36, No. 4, (1989), p.929-965.
104 Cristianini, N., Shawe-Taylor, J., Â“An In troduction to Support Vector Machines and Other Kernel-Based Learning MethodsÂ”, Ca mbridge University Press, Cambridge, UK, (2000). Cristianini, N., Shawe-Taylor, J., Lo dhi, H., Â“Latent Semantic KernelsÂ”, Journal of Intelligent Information Systems Vol 18, Issue 2/3 (2002), p.127-152. Devroye, L., Gyrfi, L., Lugosi, G., Â“A Pr obabilistic Theory of Pattern RecognitionÂ”, Springer-Verlag, NY, (1996). Ehrenfeucht, A., Haussler, D., Kearns, M ., Valiant, L.G., Â“A General Lower Bound on the Number of Examples Needed for LearningÂ”, Information and Computation Vol. 82, No. 3, (1989), p.247-261. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Â“The KDD Process for Extracting Useful Knowledge from Volumes of DataÂ”, Communications of the ACM Vol. 39, No: 11 (1996), p.27-34. Freddoso, A., J., Lecture Notes on Introducti on to Philosophy, Retrieved June 20, 2005, http://www.nd.edu/~afreddos/courses/intro/hume.htm Fung, G., Mangasarian, O., Shavlik, J., Â“Knowle dge-based Nonlinear Kernel ClassifiersÂ”, Data Mining Institute, Co mputer Sciences Department, University of Wisconsin, Madison, Wisconsin. Technical Report 03-02, (2003). Retrieved June 20, 2005 from ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/03-02.pdf Fung, G., Mangasarian, O., Shavlik, J., Â“Knowledge-based Support Vector Machine ClassifiersÂ”, Data Mining In stitute, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin. Technica l Report 01-09, (2001). Retrieved June 20, 2005 from ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/01-09.pdf Grtschel, M., Lovazs, L., Schrijver, A., Â“Geometric Algorithms and Combinatorial OptimizationÂ”, Algorithms and Combinatorics: Study and Research Texts Vol. 2, Springer-Verlag, Berlin, (1988). Grnwald, P. G., Â“A Tutorial Introduction to Minimum Description Length: Theory and ApplicationsÂ”. To appear in Â“Advances in Minimum Descript ion Length: Theory and ApplicationsÂ” (Edited by P. Grnwal d, I.J. Myung, M. Pitt), MIT Press (Forthcoming, Autumn 2004), (2004), Retrieved June 20, 2005 from http://homepages.cwi.nl/~pdg/ Gurvits, L., Â“A Note on a Scale-Sensitive Di mension of Linear Bounded Functionals in Banach SpacesÂ”, Theoretical Computer Science 261 (2001), p.81-90. Hristev, R. M., Â“The ANN BookÂ”, ( 1998), Retrieved June 20, 2005 from ftp://math.chtf.stuba.sk/pub/vlado/ NN_books_texts/Hritsev_The_ANN_Book.pdf
105 Hush, D., Scovel, C., Â“Fat-Shattering of Affine FunctionsÂ”, Los Alamos National Laboratory Technical Repor t LA-UR-03-0937, (2003). Iudin, D., B., Nemirovskii, A. S., Â“Infor mational complexity and efficient methods for solving complex extremal problemsÂ”, translated in Matekon: Translations of Russian and East European Mathematical Economics Vol 13 (1976), p.25-45. Joachims, T., Â“Text Categorization with S upport Vector Machines: Learning with Many Relevant FeaturesÂ”, Proceedings of the European Conference on Machine Learning (ECML), Springer, (1998). Joachims, T., Cristianini, N., Shawe-Tayl or, J., Â“Composite Kernels for Hypertext CategorisationÂ”, Proceedings of the Internationa l Conference on Machine Learning (ICML) (2001). John, F., Â“Extreme Problems with Ine qualities as Subsidiary ConditionsÂ”, Studies and Essays Presented to R. Courant on his 60th Birthday Wiley Interscience, New York, (1948). Khachiyan, L. G., Â“A Polynomial Al gorithm for Linear ProgrammingÂ”, Soviet Mathematics Doklady, 20, (1979), p.191-194. Kohavi, R., Provost, F., Â“Glossary of TermsÂ”, Machine Learning, Vol. 30, Issue 2/3 (1998), p.271-274. Kolmogorov, A., Tikhomirov, V., Â“entropy and capacity of Sets in Function SpacesÂ”, Translations of the American Mathematical Society Vol.17 (1961), p.277-364. Kumar, P., Y ld r m, A., E., Â“Minimum Volume Enclosing Ellipsoids and Core SetsÂ”, to appear in Journal of Optimization Theory and Applications (2004) Lavallee, I., Duong, C. P., Â“A Parallel Proba bilistic Algorithm to Find Â“AllÂ” Vertices of a PolytopeÂ”, Technical Report, Institut Nati onal de Recherche en Informatique et en Automatique, Research Report No: 1813, (1992). Levin, A. Y., Â“On an Algorithm for the Minimization of Convex FunctionsÂ”, Soviet Math. Dokl., 6, (1965), p. 286-290. Liao, A., Todd, M. J., Â“Solving LP Problems Via Weighted CentersÂ”, SIAM Journal on Optimization Vol. 6, No. 4, (1996), p.933-960. Linial, N., Mansour, Y., Rive st, R. L., Â“Results on L earnability and the VapnikChervonenkis DimensionÂ”, Information and Computation, Vol. 90 (1991), p.33-49. Madden, M. G., Ryder, A. G., Â“Machine Lear ning Methods for Quantitative Analysis of Raman Spectroscopy DataÂ”, Proceedings of SPIE, the International Society for Optical Engineering Vol. 4876, (2002), p.1130-1139.
106 Mangasarian, O., Â“Knowledge Based Linear ProgrammingÂ”, Data Mining Institute, Computer Sciences Department, Universi ty of Wisconsin, Madison, Wisconsin. Technical Report 03-04, 2003. Retrieved June 20, 2005 from ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/03-04.pdf McMullen, P., Â“The Maximum Number of Faces of a Convex PolytopeÂ”, Mathematika XVII, (1970), p.179-184. Mitchell, T., Â“Machine LearningÂ”, McGraw Hill International, NY, (1997). Nesterov, J. E., Nemirovskii, A., Â“Inter ior Point Polynomial Algorithms in Convex ProgrammingÂ”, SIAM, Philedelphia, (1994). Niyogi, P., Poggio, T., Girosi, F., Â“Incorpora ting Prior Information in Machine Learning by Creating Virtual ExamplesÂ”, IEEE Proceedings on Intelligent Signal Processing, Vol. 86 (1998), p.2196-2209. Rimon, E., Boyd, S., P., Â“Efficient Distance Computation Using the Best Ellipsoid FitÂ”, Technical Report Stanford University, February 10, (1992). Rimon, E., Boyd, S. P., Â“Obstacle Collisi on Detection Using Best Ellipsoid FitÂ”, Journal of Intelligent and Robotic Systems Vol 18, (1997), p.105-126. Rissanen, J., Â“Modeling by Shor test Data DescriptionÂ”, Automatica 14 (1978), p.465471. Sauer, N. Â“On the density of families of setsÂ”. Journal of Combinatorial Theory, 13 (1972), p.145-147. Schapire, R. E., Â“The Strengt h of Weak LearnabilityÂ”, Machine Learning Vol 5 (1990), p.197-227. Schlkopf, B. Burges, C., Vapnik, V., Â“Incor porating Invariances in Support Vector Learning MachinesÂ”, in C. von der Malsbur g, W. von Seelen, J. C. Vorbrggen, and B. Sendhoff editors, Artificial Neural Networks (ICANNÂ’96), Springer Lecture Notes in Computer Science, Vol 1112, Berlin, (1996), p.47-52. Schlkopf, B., Simard, P., Smola, A., Vapnik, V., Â“Prior Knowledge in Support Vector KernelsÂ”, In Jordan, M., Kearns, M., Solla, S. editors, Advances in Neural Information Processing Systems 10, Cambridge, MA, (1998), p.640-646. Shawe-Taylor J., Bartlett P. L., William son R. C., Anthony M., Â“Structural Risk Minimization Over Data-Dependent HierarchiesÂ”, IEEE Transactions on Information Theory Vol. 44 (1998), p.1926-1940. Shor, N., Z., Â“Convergence rate of the gradient descent method with dilatation of the spaceÂ”, translated in Cybernetics Vol 6, No 2 (1970), p.102-108.
107 The American Heritage Dictionary of the English Language, Fourth Edition, Copyright by Houghton Mifflin Company, Boston, (2000). Valiant, L. G., Â“A Theo ry of the LearnableÂ”, Communications of the ACM, Vol.27 (1984), p.1134-1142. Vapnik, N. V., Chervonenkis, A., Y., Â“O n the Uniform Convergence of Relative Frequencies of Events to Their ProbabilitiesÂ”, Theory of Probability and Its Applications Vol.16, No 2, (1971), p.264-280. Vapnik, N. V. Estimation Dependences Based on Empirical Data, Springer-Verlag, NY, (1982). Vapnik, N., V., Chervonenkis, A.Y., Â“The necessary and sufficient conditions for consistency in the empirical risk minimization methodÂ”, Pattern Recognition and Image Analysis Vol. 1, No:3, (1991), p.283-305. Vapnik, N. V., The Nature of Statistical Learning Theory, Springer-Verlag, NY, (1995). Vapnik, N. V., Statistical Learning Theory John Wiley & Sons, Toronto, CA, (1998). Wagacha, P., W., Â“Induction of Decision Tr eesÂ”, (2003), Retrieved June 20, 2005 from http://www.uonbi.ac.ke/acad_depts/ics/cour se_material/machin e_learning/decision Trees.pdf Wikipedia, (2005), Retrieved June 20, 2005 from http://en.wikipedia.org/wiki/Baconian_method#Baconian_Method Whitley D., J. Watson. Â“A Â‘no free lunchÂ’ Tu torialÂ”. Department of Computer-Science, Colorado State University, Fort Collins, Colorado, (2004). Wolpert, D. H., Â“The Supervised Learni ng No-Free-Lunch TheoremsÂ”, NASA Ames Research Center, MS 269-1 (2001). Wolpert, D. H., Â“The Existence of a Priori Distinctions Between Learning AlgorithmsÂ”, Neural Computation Vol 8, Issue 7 (1996a), p.1391-1420. Wolpert, D. H., Â“The Lack of a Priori Distinctions Between Learning AlgorithmsÂ”, Neural Computation Vol. 8, Issue 7 (1996b), p.1341-1390. Wolpert, D.H., Macready, W.G. Â“No Free Lunch Theorems for SearchÂ”, Santa Fe Institue, Technical Repo rt, SFI-TR-95-02-010 (1995). WordiQ Encyclopedia, (2005), Retrieved June 27, 2005 from http://www.wordiq.com/definition/Learning_theory
108 Yudin, D. B., Nemirovskii, A. S., Â“Informa tional Complexity and Effective Methods for Convex Extremal ProblemsÂ”, Matekon: Translations of Russian and East European Math. Economics Vol. 13, (1976), p.24-25. Zhou, Y., Suri, S., Â“Algorithms for Minimum Volume Enclosing Simplex in 3Â”, Symposium on Discrete Algorithms, Pr oceeidngs of the Eleventh Annual ACMSIAM Symposium on Discrete Algorithms, (2000), p.500-509.
109 BIOGRAPHICAL SKETCH Enes Eryarsoy received the BS degree in Industrial Engineering from Istanbul Technical University in Turkey in the Spring of 2001. Since th e Fall of 2001, he has been a graduate assistant in the de partment of Decision and Information Sciences Department at the University of Florida. His academic interests include data mining, machine learning, and IS economics.