Citation
FEATURE SELECTION IN SUPPORT VECTOR MACHINES

Material Information

Title:
FEATURE SELECTION IN SUPPORT VECTOR MACHINES
Copyright Date:
2008

Subjects

Subjects / Keywords:
Algorithms ( jstor )
Colorectal cancer ( jstor )
Datasets ( jstor )
Dimensionality reduction ( jstor )
Genes ( jstor )
Geometric planes ( jstor )
Lagrangian function ( jstor )
Linear programming ( jstor )
Maxims ( jstor )
Zero ( jstor )

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright the author. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:
5/4/2003
Resource Identifier:
70786469 ( OCLC )

Downloads

This item is only available as the following downloads:


Full Text

PAGE 1

FEA TURE SELECTION IN SUPPOR T VECTOR MA CHINES By EUN SEOG YOUN A THESIS PRESENTED TO THE GRADUA TE SCHOOL OF THE UNIVERSITY OF FLORID A IN P AR TIAL FULFILLMENT OF THE REQUIREMENTS F OR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORID A 2002

PAGE 2

Cop yrigh t 2002 b y Eun Seog Y oun

PAGE 3

This thesis is dedicated to m y paren ts, Jaeb yung Y oun and Guiso on Y o o.

PAGE 4

A CKNO WLEDGMENTS I w ould lik e to thank all three professors, Dr. Li Min F u, Dr. Stanley Su and Dr. Anand Rangara jan, who serv ed on m y thesis committee. Their commen ts and suggestions w ere in v aluable. My sp ecial thanks go to Dr. Li Min F u for b eing m y sup ervisor and in tro ducing me to this area of researc h. He pro vided me ideas, assistance, and activ e encouragemen t. I also wish to thank Stev e Gunn who wrote the Matlab co de for Supp ort V ector Mac hines and made its source co de a v ailable from the In ternet. iv

PAGE 5

v TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ .. iv LIST OF TABLES ................................ ................................ ................................ ............. vii LIST OF FIGURES ................................ ................................ ................................ .......... viii ABSTRACT ................................ ................................ ................................ ........................ ix CHAPTERS 1 INTRODUCTION ................................ ................................ ................................ ........... 1 1.1 Feature Selection Prob lem ................................ ................................ ....................... 1 1.2 Notation ................................ ................................ ................................ .................... 2 1.3 Thesis Overview ................................ ................................ ................................ ...... 3 2 SUPPORT VECTOR MACHINES ................................ ................................ ................. 4 2.1 Introduction ................................ ................................ ................................ .............. 4 2.2 Linearly Separable Case ................................ ................................ ........................... 4 2.3 Linearly Inseparable Case ................................ ................................ ........................ 6 2.4 Dual Problem ................................ ................................ ................................ ........... 7 2.5 Kernel Methods and Support Vector Classifiers ................................ ..................... 9 3 RELATED WORKS ................................ ................................ ................................ ...... 12 3.1 SVM Gradient Method ................................ ................................ .......................... 12 3.1.1 Algorithm: Linear Kernel ................................ ................................ ................ 12 3.1.2 Algorithm: General Kernels ................................ ................................ ............ 13 3.2 SVM RFE ................................ ................................ ................................ .............. 14 3.2.1 Algorithm: Linear Kernel ................................ ................................ ................ 14 3.2.2 Algorithm: General Kernels ................................ ................................ ............ 15 4 NEW ALGORITHMS ................................ ................................ ................................ ... 17 4.1 SVM Gradient RFE ................................ ................................ ............................... 17 4.1.1 Linear Kernel ................................ ................................ ................................ .. 18 4.1.2 General Kernels ................................ ................................ ............................... 18 4.2 SVM Projection RFE ................................ ................................ ............................. 19

PAGE 6

vi 4.2.1 Idea ................................ ................................ ................................ .................. 19 4.2.2 Linear Kernel ................................ ................................ ................................ .. 21 4.2.3 General Kernels ................................ ................................ ............................... 23 4.2.4 SVM Projection RFE Algorithm ................................ ................................ .... 24 4.2.4.1 Linear Kernel ................................ ................................ ........................... 24 4.2.4.2 General Kernels ................................ ................................ ........................ 25 5 EXPERIMENTS ................................ ................................ ................................ ............ 27 5.1 Colon Cancer Data ................................ ................................ ................................ 27 5.2 Experimental Design and Results ................................ ................................ .......... 27 6 DISCUSSION ................................ ................................ ................................ ................ 33 6.1 Computational Consideration ................................ ................................ ................ 33 6.2 Discussion with Other Feature Selection Methods ................................ ................ 34 6.2.1 Linear Programming Methods ................................ ................................ ........ 34 6.2.1.1 Feature Selection via Concave Minimization ................................ .......... 34 6.2.1.2 SVM || || p Formulation ................................ ................................ ............ 35 6.2.2 Correlation Coefficient Methods ................................ ................................ .... 36 6.2.3 Feature Scaling Methods ................................ ................................ ................ 37 6.2.4 Wrapper Methods ................................ ................................ ............................ 37 6.3 Dimension Reduction Methods ................................ ................................ .............. 38 6.3.1 Principal Component Analysis ................................ ................................ ....... 38 6.3.2 Projection Pursuit ................................ ................................ ............................ 39 7 CONCLUSION ................................ ................................ ................................ .............. 40 7.1 Su mmary ................................ ................................ ................................ ................ 40 7.2 Future Work ................................ ................................ ................................ ........... 40 APPENDIX MATLAB IMPLEMENTATION ................................ ............................... 42 REFERENCES ................................ ................................ ................................ .................. 48 BIOGRAPHICAL SKETCH ................................ ................................ ............................. 50

PAGE 7

vii LIST OF TABLES Table page 2.1 Common Kernel Functions 3.1 Common Kernel Functions and Their Derivatives 13 5.1 Cross validation Accuracy (Colon Cancer) ..2 9 5.2 Comparison Between Various Feature Selection Methods ...30 5.3 Top ranked Genes .. 32

PAGE 8

viii LIST OF FIGURES Figure page 2.1 Linear discriminant planes ................................ ................................ ............................. 5 2.2 Geometric margin in the maximum margin hyperplane ................................ ................ 6 2.3 Linearly inseparable data ................................ ................................ ............................... 7 2.4 A nonlinear kernel support vector machine ................................ ................................ ... 10 4.1 Motivation for the SVM Projection RFE ................................ ................................ ...... 20 4.2 Linear kernel case for Projection RFE ................................ ................................ ........... 21 4.3 Nonlinear kernel case for Projection RFE ................................ ................................ ..... 23 5.1 CV accuracy comparison of feature (gene) selection methods ................................ ...... 31

PAGE 9

Abstract of Thesis Presen ted to the Graduate Sc ho ol of the Univ ersit y of Florida in P artial F ulllmen t of the Requiremen ts for the Degree of Master of Science FEA TURE SELECTION IN SUPPOR T VECTOR MA CHINES By Eun Seog Y oun Ma y 2002 Chairman: Dr. Li Min F u Ma jor Departmen t: Computer and Information Science and Engineering F eature selection means nding a subset of features whic h are more imp ortan t on the classication. Supp ort V ector Mac hines (SVMs) are a new metho d of extracting information from a data set. A classication b oundary is created allo wing the largest p ossible margin of error. Supp ort V ector Mac hines ha v e b een applied to a n um b er of applications, suc h as bioinformatics, face recognition, text categorization, handwritten digit recognition, and so forth. Researc hers ha v e recen tly started to use SVMs for feature selection. In this thesis, w e presen t an o v erview of this researc h and prop ose t w o new algorithms. W e applied the new algorithms to the colon cancer data set to rank all the genes, dep ending on their imp ortance on the classication, and hence found a subset of genes whic h giv es the b est accuracy . The exp erimen ts sho w that our algorithms are comp etitiv e with the existing metho ds in accuracy and are m uc h faster in sp eed. The colon cancer data set consists of 62 instances, eac h instance with 2,000 genes. Researc hers b eliev e that only a small n um b er of genes are resp onsible for ix

PAGE 10

colon cancer. In this thesis, w e also sho w that the selected genes from our new feature selection metho ds are biologically relev an t b y comparing with other researc hers' ndings and therefore the feature selection metho ds really w ork. x

PAGE 11

CHAPTER 1 INTR ODUCTION F eature selection is an imp ortan t problem in mac hine learning [7, 11 , 18]. In this c hapter, w e presen t the nature of the feature selection and wh y it is needed. W e also giv e a glossary of notations and an o v erview of this thesis. 1.1 F eature Selection Problem Giv en an instance x = ( x 1 ; ; x n ), consider a classication problem. T ypically , only a small n um b er of features of x giv e sucien t information for the classication. The feature selection problem iden ties the small subset of features that are relev an t to the target concept. A small subset of relev an t features giv es more descriminating p o w er than using more features. This is coun ter-in tuitiv e since more features giv e more information and therefore should giv e more discriminating p o w er. But if a feature is irrelev an t, that feature do es not aect the target concept. If a feature is redundan t, then that feature do es not add an ything new to the target concept [21 ]. This justies our coun ter-in tuition. Benets of the feature selection include cutting the computation time short, giving a b etter discriminating h yp erplane and giving a b etter understanding of the data. In our exp erimen ts, for example, w e ha v e a colon cancer data set. Eac h instance consists of 2,000 comp onen ts (gene expression lev el). Researc hers strongly b eliev e that only a small subset of genes are resp onsible for colon cancer [14]. Our exp erimen tal results and other researc hers' ndings [14] supp ort this. Man y metho ds are a v ailable for solving the feature selection problem [11, 18 ]. This thesis concerns the feature selection in Supp ort V ector Mac hines (SVMs). 1

PAGE 12

2 1.2 Notation x i training instance y i target corresp onding to instance x i ; y i 2 f 1 ; 1 g g ( x ) decision h yp erplane trained from SVM w w eigh t v ector when g ( x ) is linear b threshold in g ( x ) Lagrange m ultipliers mapping function to feature space K ( x ; y ) k ernel h ( x ) ( y ) i l n um b er of training instances n dimension of input space h x y i inner pro duct b et w een x and y k k 2-norm k k p p-norm k k 0 dual norm of k k p ( x ; g ( x )) pro jection of x on to g ( x ) slac k v ariable C upp er b ound for L primal Lagrangian function x 0 transp ose of v ector x S V supp ort v ectors j x j comp onen t-wise absolute v alue of v ector x j S V j n um b er of supp ort v ectors \ ( x ; y ) angle b et w een x, y i mean for feature i o v er training instances i standard deviation for feature i o v er training instances V ( x ) v ariance of x

PAGE 13

3 E ( x ) exp ectation x co v ariance matrix 1.3 Thesis Ov erview W e b egin this thesis b y dening the feature selection problem, and then w e discuss wh y the feature selection problem is imp ortan t, esp ecially the feature selection using SVMs. The next c hapter is dev oted to SVMs. In Chapter 3, w e presen t related w orks in the domain. Tw o of the related metho ds are review ed with their ideas and algorithms. In Chapter 4, w e prop ose t w o of the new algorithms in this problem domain: SVM Gradien t-RFE and SVM Pro jection-RFE. Chapter 5 presen ts exp erimen tal results. Existing metho ds and our prop osed metho ds are applied to a real w orld data set, and the results sho w that our prop osed algorithms are b etter. In Chapter 6, w e presen t other feature selection metho ds and comparisons with our algorithms. Chapter 6 also includes an o v erview of dimension reduction metho ds, a lo osely related area. This thesis concludes b y summarizing the kno wledge w e gained and other p ossibilities for future researc h.

PAGE 14

CHAPTER 2 SUPPOR T VECTOR MA CHINES 2.1 In tro duction Supp ort V ector Mac hines (SVMs) are a new classication metho d dev elop ed b y V apnik and his group at A T&T Bell Labs [29 , 10]. The SVMs ha v e b een applied to classication problems [15, 23] as alternativ es to m ultila y er net w orks. Classication is ac hiev ed b y a linear or nonlinear separating surface in the input space of the data set. The goal of SVM is to minimize the exp ectation of the output of sample error. Supp ort V ector Mac hines map a giv en set of binary lab eled training data to a highdimensional feature space and separate the t w o classes of data with a maxim um margin of h yp erplane. T o understand the SVM approac h, w e need to kno w t w o k ey ideas: dualit y and k ernels [3 , 10, 27]. W e examine these concepts for the simple case and then sho w ho w they can b e extended to more complex tasks. 2.2 Linearly Separable Case Consider a binary classication problem with training instances and its class lab el pairs ( x i ; y i ), where i = 1 ; ; l and y i 2 f 1 ; 1 g . The x i is called a p ositiv e instance if the corresp onding lab el is +1; otherwise, it is a negativ e instance. Let P denote the set of p ositiv e instances and N the set of negativ e instances. Figure 2.1 sho ws some p ossible linear discriminan t planes whic h separate P from N . In Figure 2.1 [3] an innite n um b er of discriminating planes exists and P 1 and P 2 are among them. P 1 is preferred since P 2 more lik ely misclassies an instance if there are small p erturbations in the instance [3]. A maxim um distance or margin plane is a discriminan t plane whic h is furthest from b oth P and N . This giv es an in tuitiv e explanation ab out wh y maxim um margin discriminan t planes are b etter. P 3 and P 4 4

PAGE 15

5 P1 P3 P4 P2 P N Figure 2.1: Linear discriminan t planes are called supp orting planes, and p oin ts lying on the supp orting planes are called supp ort v ectors. The maxim um margin is the distance b et w een t w o supp orting planes, and the geometric margin is the normalized distance b y w eigh t b et w een supp orting planes. F or x i 2 P , supp ose w e w an t to nd w and b suc h that h w x i i + b 0. Supp ose k = min i jh w x i i + b j . Then, jh w x i i + b j k ; 8 x i . F or the p oin ts in the other class, w e require h w x i i + b k . Note that w and b can b e rescaled so w e can alw a ys set k equal to 1. T o nd the plane furthest from b oth sets, w e maximize the distance b et w een the supp orting planes for eac h class. The supp ort v ectors are sho wn inside the dotted circle in Figure 2.2. In Figure 2.2, the normalized margin, or geometric margin b et w een the supp orting planes h w x 1 i + b = +1 ; and h w x 2 i + b = 1 is r = 2 = k w k [26, 4]. Since w e w an t to maximize the margin for the reason w e in tuitiv ely justied, w e can form ulate the problem as an optimization problem [10]. max w ;b 2 k w k 2 s :t: h w x i i + b +1 ; y i 2 P h w x i i + b 1 ; y i 2 N (2.1)

PAGE 16

6 {x | (wx) + b = 0 } {x | (wx) + b = +1 } {x | (wx) + b = -1 } x1 x2 (w x1)+b=+1(w x2)+b=-1 => (w (x1-x2)) = 2=> (w/||w||) (x1-x2) = 2/||w|| yi = +1 yi = -1 w Figure 2.2: Geometric margin in the maxim um margin h yp erplane Since maximizing the margin is equiv alen t to minimizing k w k 2 = 2 and the constrain ts can b e simplied to y i ( h w x i i b ) 1 ; 8 i , w e can rewrite them as follo ws: min w ;b k w k 2 2 s.t y i ( h w x i i + b ) 1 (2.2) In mathematical programming, a problem suc h as (2.2) is called a con v ex quadratic problem. Man y robust algorithms exist for solving the quadratic problems. Since the quadratic problem is con v ex, an y lo cal minim um found is alw a ys a global minim um [6]. 2.3 Linearly Inseparable Case Figure 2.3 [3] sho ws the t w o in tersecting con v ex h ulls. In Figure 2.3, if the single bad square is remo v ed, then the same form ulation w e had w ould w ork. T o this end, w e need to relax the constrain t and add a p enalt y to the ob jectiv e function in (2.2). Equiv alen tly , w e need to restrict the inruence of an y single p oin t. An y p oin t

PAGE 17

7 Figure 2.3: Linearly inseparable data falling on the wrong side of its supp orting plane is considered to b e an error. W e w an t to maximize the margin and minimize the error. T o this end, w e need to mak e some c hanges to the form ulation in (2.2). A nonnegativ e error v ariable, or slac k v ariable i , is added to eac h constrain t and added to the ob jectiv e function as a w eigh ted p enalt y [10 ]: min w ;b; 2 k w k 2 + C P ni =1 i s.t y i ( h w x i i + b ) + i 1 ; i 0 ; i = 1 ; ; n; (2.3) where C is a constan t. The constan t C enforces to reduce the inruence of an y particular p oin t, suc h as the bad square in Figure 2.3. More explanation of C is giv en in the next section. 2.4 Dual Problem In this section, w e sho w ho w to deriv e a dual represen tation of the problem. The dual represenation of the original problem is imp ortan t since its ob jectiv e function and constrain ts are expressed in inner pro ducts. By replacing them with a k ernel function, w e can therefore b enet from an enormous computational shortcut. Dual represen tation in v olv es the so-called Lagrangian theory [6, 5]. The Lagrangian theory c haracterizes the solution of an optimization problem. First w e in tro duce the dual represen tation for a linear separable problem (2.2).

PAGE 18

8 The primal Lagrangian function for problem (2.2) is as follo ws [10 ]: L ( w ; b; ) = 1 2 h w w i X li =1 i [ y i ( h w x i i + b ) 1] (2.4) where i 0 are the Lagrange m ultipliers. A t the extreme, the partial deriv ativ es of the Lagrangian are 0, that is, @ L ( w ; b; ) @ b = X li =1 y i i = 0 @ L ( w ; b; ) @ w = w X li =1 y i i x i = 0 or 0 = X li =1 y i i ; (2.5) w = X li =1 y i i x i : (2.6) Substituting the relations (2.5, 2.6) in to the primal problem (2.4) leads to L ( w ; b; ) = l X i =1 i 1 2 l X i;j =1 y i y j i j h x i x j i : (2.7) No w w e ha v e the dual problem corresp onding to the the primal problem (2.2): maximize P li =1 i 1 2 P li;j =1 y i y j i j h x i x j i s : t P li =1 y i i = 0 ; i 0 ; i = 1 ; ; l : (2.8) The primal problem (2.2) and corresp onding dual problem (2.8) yield the same normal v ector to the h yp erplane P li =1 y i i x i + b . W e do the similar dual form ulation for the linearly nonseparable problem. The primal Lagrangian for the problem (2.3) is as follo ws: L ( w ; b; ; ; r ) = 1 2 h w w i + C l X i =1 i 2 X li =1 i [ y i ( h w x i i + b ) 1 + i ] l X i =1 r i i (2.9) where i 0 and r i 0. Setting the partial deriv ativ es of the primal Lagrangian (2.9) to 0, w e deriv e the dual problem. @ L ( w ; b; ; ; r ) @ b = X li =1 y i i = 0 ;

PAGE 19

9 @ L ( w ; b; ; ; r ) @ i = C i r i = 0 ; @ L ( w ; b; ; ; r ) @ w = w X li =1 y i i x i = 0 or 0 = X li =1 y i i ; (2.10) C = i + r i ; (2.11) w = X li =1 y i i x i : (2.12) Substituting the relations (2.10, 2.11, 2.12) in to the primal problem (2.9) leads to L ( w ; b; ; ; r ) = l X i =1 i 1 2 l X i;j =1 y i y j i j h x i x j i : (2.13) No w w e ha v e the dual problem corresp onding to the the primal problem (2.3): maximize P li =1 i 1 2 P li;j =1 y i y j i j h x i x j i s : t P li =1 y i i = 0 ; C i 0 ; i = 1 ; ; l (2.14) where C is the upp er b ound for the Lagrange m ultipliers, that is, it giv es the upp er b ound for the so it limits the inruence of an y p oin t. 2.5 Kernel Metho ds and Supp ort V ector Classiers Ho w to construct the decision h yp erplane from the solution of the problem (2.8) is the next topic. Let solv e the dual problem (2.8). Then b y the relation (2.6), the w eigh t w is as follo ws: w = l X i =1 y i i x i and b = max y i = 1 ( h w x i i ) + min y i =1 ( h w x i i ) 2

PAGE 20

10 Input space Feature space q Figure 2.4: A nonlinear k ernel supp ort v ector mac hine Its decision function therefore is g ( x ) = l X i =1 i y i h x i x i + b (2.15) = X i 2 S V i y i h x i x i + b: (2.16) F or the example on the left of Figure 2.4, no simpler algorithm w ould w ork w ell. A quadratic function suc h as the circle pictured is needed. Figure 2.4 sho ws ho w w e can mak e use of existing linear discriminan t metho ds b y mapping the original input data in to a higher dimensional space. When w e map the original input space, sa y X using ( x ), w e create a new space F = f ( x ) : x 2 X g . Then F is called the feature space. W e in tro duce the idea of the k ernel with an example [25]. T o pro duce a quadratic discriminan t in a t w o-dimensional input space with attributes x 1 and x 2 , map the t w o-dimensional input space [ x 1 ; x 2 ] to the threedimensional feature space [ x 1 2 ; p 2 x 1 x 2 ; x 2 2 ] and construct a linear discriminan t in the feature v ector space. Sp ecically , dene: ( x ) : R 2 ! R 3 then, x = [ x 1 ; x 2 ] ; and w x = w 1 x 1 + w 2 x 2 . ( x ) = [ x 1 2 ; p 2 x 1 x 2 ; x 2 2 ], w ( x ) = w 1 x 1 2 + w 2 p 2 x 1 x 2 + w 3 x 2 2 . The resulting decision function, g ( x ) = w ( x ) + b = w 1 x 1 2 + w 2 p 2 x 1 x 2 + w 3 x 2 2 + b is linear in the three-dimensional space, but it is nonlinear in the original t w odimensional space. This approac h, ho w ev er, ma y cause some problems. If the data

PAGE 21

11 are noisy , then SVM w ould try to discriminate all the p ositiv e examples from the negativ e examples b y increasing the dimensionalit y of the feature space. This is the so-called o v ertting. The gro wth of the dimensionalit y of the feature space w ould b e exp onen tial. The second concern is computing the separating h yp erplane b y carrying out the map in to the feature space. The SVMs magically get around this problem b y using the so-called k ernel. The formal denition for a k ernel is as follo ws [10]: Denition 2.5.1 A kernel is a function K, such that for al l x ; z 2 X K ( x ; z ) = h ( x ) ( x ) i ; wher e is a mapping fr om X to an (inner pr o duct) fe atur e sp ac e F . T o c hange from a linear to a nonlinear classier, w e substitute only the inner pro duct h x y i b y a k ernel function K ( x ; y ). In our example, ( x ) ( y ) = ( x 1 2 ; p 2 x 1 x 2 ; x 2 2 )( y 1 2 ; p 2 y 1 y 2 ; y 2 2 ) 0 = ( x y ) 2 =: K ( x ; y ). Note that the actual dot pro duct in F is computed in R 2 , the input space. T able 2.1 sho ws common k ernel functions. By c hanging the k ernels, w e can also get dieren t nonlinear classiers, but no algorithm c hange is required. F rom a mac hine trained with an appropriately dened k ernel, w e can construct a decision function: g ( x ) = X i 2 S V i y i K ( x i ; x ) + b: (2.17) T able 2.1: Common Kernel F unctions Kernel K ( x ; y ) Linear h x y i P olynomial (1 + h x y i ) RBF exp( k x y k 2 )

PAGE 22

CHAPTER 3 RELA TED W ORKS 3.1 SVM Gradien t Metho d T raining supp ort v ector mac hines pro vides optimal v alues for the Lagrange m ulitpliers i . F rom the i 's, one constructs the decision h yp erplane g ( x ), g ( x ) = X i 2 S V i y i K ( x i ; x ) + b: Researc hers prop osed a feature selection tec hnique for SVMs based on the gradien t [16]. T o rank the features of a giv en x according to their imp ortance to the classication decision, compute the angles b et w een r g ( x ) and unit v ectors e j ; j = 1 ; ; n , represen ting the indices of eac h feature. If the j th feature is not imp ortan t at x , r g ( x ) is almost orthogonal to e j . One do es these computations for all the supp ort v ectors (SVs), that is, compute the angles for all the SVs, compute their a v erages and sort them in descending order. This giv es a feature ranking for all the features. The SVM-Gradien t algorithm is giv en b elo w for a linear k ernel and general k ernels separately . 3.1.1 Algorithm: Linear Kernel (a) T rain SVM using all a v ailable data comp onen ts to get g ( x ). (b) Compute the gradien t r g ( x )(or w eigh ts w ) ; 8 x 2 S V , w = r g ( x ) = X i 2 S V i y i x i : Computation of r g ( x ) can b e done easily since g ( x ) = P i 2 S V i y i K ( x i ; x ) = P i 2 S V i y i h x i x i and r x K ( x i ; x ) = x i . (c) Sort the j w j in descending order. Then this giv es the feature ranking. 12

PAGE 23

13 T able 3.1: Common Kernel F unctions and Their Deriv ativ es Kernel K ( x ; y ) r y K ( x ; y ) Linear x T y x P olynomial (1 + x T y ) (1 + x T y ) 1 RBF exp( k x y k 2 ) 2( x y ) exp( k x y k 2 ) 3.1.2 Algorithm: General Kernels (a) T rain SVM using all a v ailable data comp onen ts to get g ( x ). (b) Compute the gradien t r g ( x ) ; 8 x 2 S V , r g ( x ) = X i 2 S V i y i r x K ( x i ; x ) : Computation of r g ( x ) can b e done easily again since g ( x ) = P i 2 S V i y i K ( x i ; x ) ; and only the term K ( x i ; x ) in v olv es the v ariable x . Therefore tak e the deriv ativ e of the k ernel function only K . T able 3.1 summarizes common k ernel functions and their deriv ativ es. (c) Compute the sum of angles b et w een r g ( x ) and e j ; r j ; j = 1 ; j s j r j = X x 2 S V \ ( r g ( x ) ; e j ) where \ ( r g ( x ) ; e j ) = min 2f 0 ; 1 g + ( 1) arccos hr g ( x ) e j i kr g ( x ) k : (d) Compute the a v erages of the sum of the angles c j = 1 2 r j j SV j : (e) Sort the c in descending order. Then this giv es the feature ranking. The authors also prop osed when the n um b er of SVs is small compared to the n um b er of training instances, include all the p oin ts within the -region around the b orders, that is, j g ( x i ) j 1 + .

PAGE 24

14 3.2 SVM-RFE The SVM-RFE [14] is an application of a recursiv e feature elimination based on sensitivit y analysis for an appropriately dened cost function. 3.2.1 Algorithm: Linear Kernel In the linear k ernel case, dene a cost function J = (1 = 2) k w k . Then the least sensitiv e feature whic h has the minim um magnitude of the w eigh t is eliminated rst. This eliminated feature b ecomes ranking n . The mac hine is retrained without the eliminated feature and remo v es the feature with the minim um magnitude of w eigh ts. This eliminated feature b ecomes ranking n 1. By doing this pro cess rep eatedly un til no feature is left, w e can rank all the features. The algorithms are as follo ws: Giv en training instances X al l = [ x 1 ; x l ] 0 , and class lab els y = [ y 1 ; y l ] 0 , initialize the subset of features s = [1 ; 2 ; n ] and r = an empt y arra y . Rep eat (a)-(e) un til s b ecomes an empt y arra y . (a) Construct new training instances X = X al l (: ; s ) (b) T rain SVM( X ; y ) to get g ( x ). (c) Compute the gradien t w = r g ( x ). w = X i 2 S V i y i x i : (d) Find the feature f with the smallest w j ; j = 1 ; j s j f = argmin( j w j ). (e) Up date r and eliminate the feature from s r = [ s (f ), r ], s = s f s (f ) g . The last eliminated feature is the most imp ortan t one.

PAGE 25

15 3.2.2 Algorithm: General Kernels Dene a cost function as follo ws: J = (1 = 2) T H T e ; where H hk = y h y k K ( x h ; x k ) ; K is a k ernel function, is a Lagrange m ultiplier and e is an l (n um b er of training instances) dimensional v ector of ones. T o compute the c hange in J caused b y remo ving the feature i , one has to retrain a classier for ev ery candidate feature to b e eliminated. This dicult y is a v oided b y assuming no c hange in . Under this assumption, one recomputes the H , H ( i ) hk = y h y k K ( x h ( i ) ; x k ( i )) ; where ( i ) means that the comp onen t i has b een remo v ed. So the sensitivit y function is dened as follo ws: D J ( i ) = J J ( i ) = (1 = 2) T H (1 = 2) T H( i ) Giv en training instances X al l = [ x 1 ; x l ] 0 , and class lab els y = [ y 1 ; y l ] 0 , initialize the subset of features s = [1 ; 2 ; n ] and r = an empt y arra y . Rep eat (a)-(e) un til s b ecomes an empt y arra y . (a) Construct new training instances X = X al l (: ; s ). (b) T rain SVM( X ; y ) to get . = SVM train (c) Compute the ranking criteria for all i . D J ( i ) = (1 = 2) T H (1 = 2) T H( i ) (d) Find the feature k suc h that k = argmin i D J ( i ) : (e) Eliminate the feature k .

PAGE 26

16 r = [ s (k), r ], s = s f s (k) g . In the linear k ernel, K ( x h ; x k ) = h x h x k i and T H = k w k 2 : Therefore D J ( i ) = (1 = 2)( w i ) 2 : This matc hes the feature selection criterion in the linear k ernel case.

PAGE 27

CHAPTER 4 NEW ALGORITHMS W e prop ose t w o new algorithms in this c hapter. The new algorithms are the SVM Gradien t-RFE and the SVM Pro jection-RFE. The SVM Gradien t-RFE combines the SVM-Gradien t and SVM-RFE. In the SVM Pro jection-RFE, the magnitude of the distance b et w een supp ort v ectors and its pro jections is the feature selection criterion. This criterion is com bined with RFE so as to giv e birth to the SVM Pro jection-RFE. 4.1 SVM Gradien t-RFE The SVM Gradien t-RFE com bines t w o existing feature selection metho ds: SVM-RFE and SVM Gradien t. When w e sa y an algorithm, t w o factors are of concerns: prediction accuracy and computing time. The new metho d tak es the merits of t w o existing metho ds so it is comp etitiv e to SVM-RFE in terms of prediction accuracy while main taining sp eedy computation. As in [16 ], this metho d uses the gradien t for feature selection criteria, but in order to giv e a ranking for all the features, the mac hines are trained using all the features and then computing the feature selection criterion, that is, the feature with a minim um angle is eliminated. The ranking of this eliminated feature then b ecomes n . T rain the mac hine no w without the eliminated feature, and the feature with the minim um selection criterion is eliminated. This eliminated feature b ecomes ranking n 1. By recursiv ely eliminating all the features, one ranks all the features. The follo wing section describ es the algorithm and computational details. 17

PAGE 28

18 4.1.1 Linear Kernel Giv en training instances X al l = [ x 1 ; x l ] 0 , and class lab els y = [ y 1 ; y l ] 0 , initialize the subset of features s = [1 ; 2 ; n ] and r = an empt y arra y . Rep eat (a)-(e) un til s b ecomes an empt y arra y . (a) Construct new training instances X = X al l (: ; s ) (b) T rain SVM( X ; y ) to get g ( x ). (c) Compute the gradien t w = r g ( x ). w = X i 2 S V i y i x i : (d) Find the feature f with the smallest w j ; j = 1 ; j s j f = argmin( j w j ). (e) Up date r and eliminate the feature from s r = [ s (f ), r ], s = s f s (f ) g . F or the linear k ernel, the decision h yp erplane is linear, and hence its gradien t at an y supp ort v ector is a constan t v ector (normal v ector). The normal v ector of the g ( x ) is the feature selection criterion. 4.1.2 General Kernels Giv en training instances X al l = [ x 1 ; x l ] 0 , and class lab els y = [ y 1 ; y l ] 0 , initialize the subset of features s = [1 ; 2 ; n ] and r = an empt y arra y . Rep eat (a)-(e) un til s b ecomes an empt y arra y . (a) Construct new training instances X = X al l (: ; s ) (b) T rain SVM( X ; y ) to get g ( x ). (c) Compute the gradien t r g ( x ) ; 8 x 2 S V , r g ( x ) = X i 2 S V i y i r x K ( x i ; x ) :

PAGE 29

19 (d) Compute the sum of angles b et w een r g ( x ) and e j ; r j ; j = 1 ; j s j r j = X x 2 S V \ ( r g ( x ) ; e j ) where \ ( r g ( x ) ; e j ) = min 2f 0 ; 1 g + ( 1) arccos hr g ( x ) e j i kr g ( x ) k : (e) Compute the a v erages of the sum of the angles c j = 1 2 r j j SV j : (f ) Find the feature f with the smallest c j ; j = 1 ; j s j f = argmin( c ). (g) Up date r and eliminate the feature from s r = [ s (f ), r ], s = s f s (f ) g . As w e sa w in Section 3.1, the r g ( x ) computation can b e done easily . 4.2 SVM Pro jection-RFE 4.2.1 Idea When w e gured out this metho d, w e iden tied what c haracteristics of imp ortan t features in the SVM classier are. Consider the p oin t A = ( x 1 ; x 2 ) in Figure 4.1. When p oin t A is pro jected on to the g ( x ) = 0, let P = ( p 1 ; p 2 ) b e the pro jected p oin t, then j A P j = j ( x 1 ; x 2 ) ( p 1 ; p 2 ) j = j ( x 1 ; x 2 ) j : Note that the larger magnitude of x is more inruen tial to the decision plane. In this example, j x 1 j > j x 2 j , and hence x 1 is more inruen tial to the decision than x 2 . This is true since the decision h yp erplane is almost parallel to the x 2 axis, so

PAGE 30

20 A P g(x) x1 x2 D x1 D x2 = 0 Figure 4.1: Motiv ation for the SVM Pro jection-RFE that whatev er the x 2 v alue is, it con tributes little to the decision plane g ( x ). W e no w ha v e to answ er ho w to ecien tly compute j A P j , or x i ; i = 1 ; ; n . The idea is that the normalized distance b et w een P and A is 1 since A is a supp ort v ector and the normalized distance b et w een SVs and the h yp erplane is 1 b y the SVM prop ert y . W e can mak e use of this prop ert y to ha v e ecien t j P A j computation. Before w e in tro duce the algorithm, w e state a prop osition [10] sa ying that the normalized distance (margin) b et w een the h yp erplane and SVs is 1. Prop osition 4.2.1 Consider a line arly sep ar able tr aining sample S = (( x 1 ; y 1 ) ; ; ( x l ; y l )) ; and supp ose the p ar ameters and b solve the fol lowing optimization pr oblem: maximize P li =1 i 1 2 P li;j =1 y i y j i j h x i x j i (4.1) s : t P li =1 y i i = 0 ; i 0 ; i = 1 ; ; l : (4.2) Then w = P li =1 y i i x i r e alizes the maximal mar gin hyp erplane with ge ometric mar gin r = 1 k w k

PAGE 31

21 A P g(x) x1 x2 = wx + b t w P = A + t w = 0 Figure 4.2: Linear k ernel case for Pro jection-RFE 4.2.2 Linear Kernel Using the linear k ernel means that w e ha v e a linear decision h yp erplane g ( x ) = h w x i + b . When A is pro jected on to the g ( x ) = 0, the pro jected p oin t P is expressed in terms of A , and w , w eigh t of g ( x ), that is, p oin t P is the sum of p oin t A and some constan t times the w eigh t w . Figure 4.2 sho ws this. P = A + t w : where t is a constan t. By the prop osition 4.2.1, k t w k = 1 k w k ; and b y solving this in terms of t , w e ha v e t = 1 k w k 2 : Hence, j P A j = j w j k w k 2 : Let p ( x i ; g ( x )) denote the pro jection of x on to g ( x ). Since w is r g ( x i ), w e ha v e, j p ( x i ; g ( x )) x i j = j w j k w k 2 = jr g ( x i ) j kr g ( x i ) k 2 ; 8 i 2 S V :

PAGE 32

22 Since j p ( x i ; g ( x )) x i j = a constan t v ector (normal v ector) 8 i 2 S V , w e only need to compute j w j k w k 2 : The follo wing Lemma 4.2.2 summarizes what w e ha v e done ab o v e. Lemma 4.2.2 Consider a line arly sep ar able tr aining sample S = (( x 1 ; y 1 ) ; ; ( x l ; y l )) ; and supp ose the p ar ameters and b solve the fol lowing optimization pr oblem: maximize P li =1 i 1 2 P li;j =1 y i y j i j h x i x j i (4.3) s : t P li =1 y i i = 0 ; i 0 ; i = 1 ; ; l : (4.4) Then w = P li =1 y i i x i . F or any supp ort ve ctors x i j p ( x i ; g ( x )) x i j = j w j k w k 2 : Pro of: W e can write p ( x i ; g ( x )) in terms of x i ; and the gradien t v ector of the decision h yp erplane g ( x ) at x i as follo ws: p ( x i ; g ( x )) = x i + t w ; where t is some constan t. By the Prop osition 4.2.1, k t w k = 1 k w k ; Solving the ab o v e in terms of t giv es: t = 1 k w k 2 : Hence, j p ( x i ; g ( x )) x i j = j w j k w k 2 ;

PAGE 33

23 P x1 x2 g(x) t g(x) A P = A + t g(x) = 0 Figure 4.3: Nonlinear k ernel case for Pro jection-RFE 4.2.3 General Kernels Using the nonlinear k ernel, w e ha v e a nonlinear decision h yp erplane in the input space. Consider Figure 4.3. Let g ( x ) b e the h yp erplane and P b e the pro jected p oin t of A . Observ e that r g ( A ) is normal to the g ( x ) at the p oin t P . r g ( A ) can b e computed easily as in the SVM Gradien t since g ( x ) = X i 2 S V i y i K ( x i ; x ) ; and r g ( x ) = X i 2 S V i y i r x K ( x i ; x ) : Hence, the pro jected p oin t P is expressed in terms of A and r g ( A ), normal to the g ( x ) at P , that is, the p oin t P is the sum of p oin t A and some constan t t times the normal v ector r g ( A ). Figure 4.3 sho ws this. P = A + t r g ( A ) where t is a constan t. But here w e do not calculate j P A j exactly since its computation is complicated and the exact calculation is not needed. Since j P A j is a constan t times r g ( A ) ; j P A j is prop ortional to jr g ( A ) j kr g ( A ) k 2 , that is, j P A j / jr g ( A ) j kr g ( A ) k 2 :

PAGE 34

24 F or 8 i 2 S V , j p ( x i ; g ( x )) x i j / jr g ( x i ) j kr g ( x i ) k 2 : Since j p ( x i ; g ( x )) x i j is not a constan t v ector unlik e the linear case, w e need to calculate p i = c j p ( x i ; g ( x )) x i j = jr g ( x i ) j kr g ( x i ) k 2 ; 8 i 2 S V ; where c is a constan t. No w w e sum the p i ; 8 i 2 S V , comp onen t-wise. Let d denote the comp onen t-wise summation of p i ; 8 i 2 S V . The feature corresp onding to the largest magnitude of the d will then b e the most imp ortan t feature for the classication of the g ( x ). This feature selection criterion is no w com bined to the recursiv e feature elimination. 4.2.4 SVM Pro jection-RFE Algorithm W e no w describ e ho w to rank all the features. Initially , the mac hine is trained with all the features and all the training data. Then compute the feature selection criterion, that is, d . The feature with the minim um magnitude comp onen t is eliminated. This eliminated feature b ecomes ranking n . No w train the mac hine without the eliminated feature, recompute the d , and eliminate the feature with the minim um selection criterion. This feature b ecomes ranking n 1. Recursiv ely doing this process, w e can giv e a ranking for all the features. The SVM Pro jection-RFE algorithm and computational details are as follo ws: 4.2.4.1 Linear Kernel Giv en training instances X al l = [ x 1 ; x l ] 0 , and class lab els y = [ y 1 ; y l ] 0 , initialize the subset of features s = [1 ; 2 ; n ] and r = an empt y arra y . Rep eat (a)-(e) un til s b ecomes an empt y arra y . (a) Construct new training instances X = X al l (: ; s ) (b) T rain SVM( X ; y ) to get g ( x ).

PAGE 35

25 (c) Compute the gradien t w = r g ( x ). w = X i 2 S V i y i x i : (d) Find the feature f with the smallest w j ; j = 1 ; j s j f = argmin( j w j ). (e) Up date r and eliminate the feature from s r = [ s (f ), r ], s = s f s (f ) g . F or the linear k ernel case, r g ( x i ) is a constan t v ector, that is, the normal v ector of the g ( x ), for an y i 2 S V . Only a one-time computation of the normal v ector of the g ( x ) is therefore needed for feature selection criterion computation for one feature elimination. This algorithm is therefore exactly the same as the SVMRFE and SVM Gradien t-RFE. 4.2.4.2 General Kernels Giv en training instances X al l = [ x 1 ; x l ] 0 , and class lab els y = [ y 1 ; y l ] 0 , initialize the subset of features s = [1 ; 2 ; n ] and r = an empt y arra y . Rep eat (a)-(g) un til s b ecomes an empt y arra y . (a) Construct new training instances X = X al l (: ; s ) (b) T rain SVM( X ; y ) to get g ( x ). (c) Compute the gradien t r g ( x ) ; 8 x 2 S V , r g ( x ) = X i 2 S V i y i r x K ( x i ; x ) : (d) Compute p i = c j p ( x i ; g ( x )) x i j p i = jr g ( x i ) j kr g ( x i ) k 2 ; 8 i 2 S V :

PAGE 36

26 (e) Compute the sum of p i ; 8 i 2 S V . d = X i 2 S V p i : (f ) Find the feature f with the smallest d j ; j = 1 ; j s j f = argmin( d ). (g) Up date r and eliminate the feature from s r = [ s (f ), r ], s = s f s (f ) g . Again, the r g ( x ) computation can b e done easily , as w e did in Section 3.1.

PAGE 37

CHAPTER 5 EXPERIMENTS 5.1 Colon Cancer Data The colon cancer data consist of 62 instances, eac h instance with 2,000 genes [1, 14 , 28, 9 ]. The data set is 22 normal colon tissue samples and 40 colon tumor samples, analyzed with an Aymetrix oligon ucleotide arra y complemen tary to more than 6,500 h uman genes and expressed sequence tags. The 2,000 genes selected ha v e the highest minimal in tensit y across the samples. 5.2 Exp erimen tal Design and Results The colon cancer data are 62 2,000 matrix. Eac h en try of the matrix is a gene expression lev el. Before conducting the exp erimen ts, w e prepro cessed [1, 13 , 14 , 24] the data: tak e the logarithm of all v alues and then normalize the sample v ectors. The normalization of the sample v ectors includes subtracting the mean o v er all training v alues and dividing the result b y the corresp onding standard deviation. Since the data set is not set aside for a training set or a testing set, w e randomly split them in to t w o 31 samples. Then w e p erformed a t w o-fold cross-v alidation (CV). The t w o-fold cross-v alidation means that one of the t w o 31-sample sets is used for training the mac hine and the other 31-sample set is used for a testing set and then vice v ersa. W e ha v e done our exp erimen ts with the linear k ernel and the RBF k ernel. With the linear k ernel, all three metho ds are aordable since they do not tak e more than an hour to complete. W e therefore conducted complete exp erimen ts. F or the SVM-RFE and SVM Gradien t{RFE, w e rank ed the feature (gene) and recursiv ely eliminated features one b y one. But with the RBF k ernel, a one-b y-one feature elimination is not aordable with the SVM RFE. W e th us eliminated a c h unk of genes at one time 27

PAGE 38

28 [14]. A t the rst iteration, w e eliminated a m ultiple n um b er of genes so that the remaining n um b er of genes is the closest p o w er of 2. A t the later iteration, half the remaining genes are eliminated. F or example, with our colon cancer data set, after the rst run, 2 ; 000 1 ; 024 = 976 genes are eliminated since 1,024 is the closest p o w er of 2 to 2,000. After the second run, 1 ; 024 512 = 512 genes are eliminated. As w e ha v e seen in Chapter 3, the SVM Gradien t-RFE and SVM Pro jectionRFE essen tially do the same computation and hence giv e the same ranking and same accuracy . So the SVM Pro jection-RFE accuracy is not sho wn in this table. F or the linear k ernel, cross-v alidation accuracy is computed with the feature selection metho ds: SVM-RFE, SVM Gradien t, and SVM Gradien t-RFE. F or the RBF k ernel, cross-v alidation accuracy is computed with the SVM-RFE and SVM Gradien t-RFE metho ds. T able 5.1 sho ws the cross-v alidation accuracy results. T able 5.1 sho ws that using the RBF k ernel giv es b etter accuracy than using the linear k ernel. With the linear k ernel, the CV accuracy ranges from 61% to 85%. With the linear k ernel, the SVM-RFE and SVM Gradien t-RFE giv e the same accuracy b ecause the metho ds use the same feature selection criterion and the same recursiv e elimination idea. When w e use the linear k ernel, the SVM Gradien t-RFE and SVM-RFE are equally b etter than the SVM Gradien t metho d. This means that a recursiv e feature elimination idea is useful. With the RBF k ernel, the CV accuracy ranges from 74% to 90%. F rom T able 5.1, the b est CV accuracy is obtained using the RBF k ernel with 16 genes. The SVM Gradien t-RFE and SVM-RFE giv e almost the same CV accuracy , although they use a dieren t feature selection criterion. The SVM-RFE uses maxim um sensitivit y based on cost function as the selection criterion, while the SVM Gradien t-RFE uses a minim um angle b et w een the gradien t of SVs and features. Although the CV accuracy is the same b et w een the t w o metho ds, their computing time is v ery dieren t. T able 5.2 sho ws computing times for v arious feature selection metho ds when using the RBF k ernel. Note that the computing times are

PAGE 39

29 T able 5.1: Cross-v alidation Accuracy (Colon Cancer) # genes Linear Kernel RBF Kernel RFE GRAD GRAD RFE GRAD -RFE -RFE 1 0.73 0.61 0.73 0.74 0.74 2 0.77 0.63 0.77 0.74 0.74 3 0.84 0.73 0.84 0.79 0.79 4 0.82 0.74 0.82 0.84 0.84 5 0.79 0.73 0.79 0.89 0.89 6 0.85 0.74 0.85 0.89 0.89 7 0.85 0.81 0.85 0.89 0.89 8 0.84 0.79 0.84 0.89 0.89 9 0.84 0.77 0.84 0.89 0.89 10 0.82 0.81 0.82 0.87 0.87 11 0.85 0.79 0.85 0.89 0.89 12 0.84 0.81 0.84 0.87 0.87 13 0.81 0.81 0.81 0.87 0.87 14 0.81 0.81 0.81 0.89 0.89 15 0.81 0.82 0.81 0.89 0.89 16 0.81 0.82 0.81 0.90 0.90 17 0.81 0.84 0.81 0.90 0.90 18 0.82 0.85 0.82 0.90 0.90 19 0.84 0.81 0.84 0.87 0.87 20 0.82 0.81 0.82 0.87 0.87 21 0.82 0.84 0.82 0.89 0.89 22 0.84 0.81 0.84 0.89 0.89 23 0.85 0.81 0.85 0.89 0.89 24 0.85 0.81 0.85 0.89 0.89 25 0.85 0.82 0.85 0.89 0.89 26 0.87 0.82 0.87 0.90 0.90 27 0.85 0.82 0.85 0.90 0.90 28 0.85 0.82 0.85 0.90 0.90 29 0.85 0.82 0.85 0.90 0.90 30 0.85 0.82 0.85 0.90 0.90 40 0.85 0.82 0.85 0.85 0.85 50 0.84 0.84 0.84 0.85 0.85 60 0.84 0.85 0.84 0.87 0.87 70 0.84 0.82 0.84 0.87 0.87 80 0.84 0.85 0.84 0.89 0.89 90 0.86 0.85 0.85 0.89 0.89 100 0.84 0.85 0.84 0.89 0.89

PAGE 40

30 T able 5.2: Comparison Bet w een V arious F eature Selection Metho ds Comparison Criteria SVM-RFE Gradien t-RFE Pro jection-RFE Computing Time 3 hours 4.09 mins 3.86 mins Best Accuracy 90 % 90 % 90 % # Genes at Best Accuracy 16 16 16 not sho wn here for the linear k ernel since the SVM-RFE and SVM Gradien t-RFE do the same computation with the linear k ernel, and hence it do es not giv e a comparison b et w een v arious metho ds. The SVM-RFE tak e signican tly longer than the other t w o metho ds: SVM Gradien t-RFE, and SVM Pro jection-RFE. The SVM Pro jectionRFE p erforms sligh tly more ecien t computation than the SVM Gradien t-RFE. The exp erimen ts w ere done on a Sun mac hine in CSE114. T able 5.2 also sho ws the b est accuracy and the n um b er of genes at the b est accuracy for v arious feature selection metho ds. These results are the same b et w een the three metho ds. Figure 5.1 graphically sho ws the comparison of v arious feature (gene) selection metho ds. Note that the SVM Pro jection-RFE giv es the same cross-v alidation accuracy as the SVM Gradien t-RFE. The SVM Gradien t-RFE and SVM-RFE giv e the b est CV accuracy o v erall, and with these t w o metho ds the b est CV accuracy is ac hiev ed with 16 genes. This implies that the subset of 16 genes is enough for the classication and ev en m uc h b etter than using all the genes. Among the 2,000 genes, most of them are either irrelev an t or redundan t. In Figure 5.1, the dotted horizon tal line is the accuracy for the linear k ernel using all the genes; its accuracy is 85%. This justies our earlier conjecture. It is w orth y to men tion what genes are rank ed at the top and whether they mak e sense biologically . Guy on et al. [14 ] conducted the feature selection with SVMs on the same colon data set. T able 5.3 sho ws the top-rank ed genes. Our top-rank ed genes (from the SVM Gradien t-RFE using the linear k ernel) are compared with their results. The rst column represen ts the sev en top-rank ed genes in Guy on's results

PAGE 41

31 0 10 20 30 40 50 60 70 80 90 100 60 65 70 75 80 85 90 95 Number of GenesCross-Validation Accuracy % RFE: linearGrad: linearGrad-RFE: linearRFE: rbfGrad-RFE: rbf Figure 5.1: CV accuracy comparison of feature (gene) selection metho ds

PAGE 42

32 T able 5.3: T op-rank ed Genes Rank Rank in GAN Description in Guy on our ranking 1 1 H64807 PLA CENT AL F OLA TE TRANSPOR TER (Homo sapiens) 2 2 T62947 60S RIBOSOMAL PR OTEIN L24 (Arabidopsis thaliana) 3 107 R88740 A TP syn thase coupling factor 6, mito c hondrial precursor (h uman) 4 20 H81558 Pro cyclic form sp ecic p olyp eptide B1 alpha precursor (T rypanosoma brucei brucei) 5 37 T94579 Human c hitotriosidase precursor mRNA, complete cds 6 22 M59040 Human cell adhesion molecule (CD44) mRNA, complete cds 7 33 H08393 Collagen alpha 2 (XI) c hain (Homo sapiens) [14], and the second column corresp onds in ranking to our ranking. The t w o rankings are close since the top t w o genes exactly matc h and the other genes in the rst column are also highly rank ed in the second column. Guy on et al. v eried that their toprank ed genes are biologically relev an t in colon cancer [14 ]. W e can sa y that the feature selection metho ds w ork w ell.

PAGE 43

CHAPTER 6 DISCUSSION In this c hapter, w e discuss the sp eed issue regarding the computation of feature selections, surv ey other feature selection metho ds, and presen t dimension reduction metho ds. 6.1 Computational Consideration Using the linear k ernel tak es less time than using the RBF k ernel. This difference is v ery ob vious when w e use the SVM-RFE metho d. With the linear k ernel, computing time is alw a ys m uc h less than using the RBF k ernel for the same feature selection metho d. This is b ecause for the linear k ernel w e compute the w eigh ts only once for eac h feature elimination, and for the RBF k ernel, a sensitivit y is calculated for eac h candidate gene for elimination (SVM-RFE), or a gradien t v ector is calcuated at all the supp ort v ectors (SVM Gradien t-RFE). The dierence b et w een computing time is at most three hours b et w een RBF and linear k ernel (SVM-RFE). As w e ha v e seen in the previous section, b etter accuracy is obtained using the RBF k ernel than the linear k ernel. If accuracy cannot b e traded for sp eed, using the RBF k ernel is more desirable than using the linear k ernel. Since the SVM-RFE and SVM Gradien tRFE giv e almost the same accuracy with the RBF k ernel, w e therefore consider only the RBF k ernel, and sp eed is no w the issue. The SVM Gradien t-RFE tak es significanlt y less time than the SVM RFE (see T able 5.2), so w e can sa y that the SVM Gradien t-RFE is more desirable against the SVM-RFE. 33

PAGE 44

34 6.2 Discussion with Other F eature Selection Metho ds The main issue is the comparison b et w een the recursiv e feature elimination tec hniques (SVM-RFE, SVM Gradien t-RFE and SVM Pro jection-RFE) and nonrecursiv e feature elimination metho ds. 6.2.1 Linear Programming Metho ds Linear programming (LP) is a mathematical form ulation to solv e an optimization problem whose constrain ts and ob jectiv e function are linear [7, 8]. The LP form ulations w ere suggested to solv e the feature selection problem [7, 8 ]. Giv en t w o p oin t sets, A and B in R n are represen ted b y A 2 R m n and B 2 R k n . This problem is form ulated as follo wing robust linear programming (RLP) [8]. min w ;r ; y ; z e T y m + e T z k (6.1) s : t A w + e r + e y ; B w e r + e z ; y 0 ; z 0 : (6.2) 6.2.1.1 F eature Selection via Conca v e Minimization [8] A subset of features is obtained b y attempting to suppress as man y comp onen ts of the normal v ector w to the separating plane P , whic h is consisten t with obtaining an acceptable separation b et w een sets A and B . This can b e ac hiev ed b y in tro ducing an extra term with the parameter 2 [0 ; 1) in to the ob jectiv e in RLP while w eigh ting the original ob jectiv e b y (1 ). min w ;r ; y ; z ; v (1 )( e T y m + e T z k ) + e T v (6.3) s : t A w + e r + e y ; B w e r + e z ; y 0 ; z 0 ; v w v : (6.4) Because of the discon tin uit y of e T v , this term is appro ximated b y a conca v e exp onen tial on the nonnegativ e real line : v t ( v ; ) = e v ; > 0

PAGE 45

35 This leads to the F eature Selection Conca v e minimization (FSV): min w ;r ; y ; z ; v (1 )( e T y m + e T z k ) + e T ( e v ) (6.5) s : t A w + e r + e y ; B w e r + e z ; y 0 ; z 0 ; v w v : (6.6) 6.2.1.2 SVM k k p F orm ulation[7, 8] The authors ha v e tried sev eral LP form ulations whic h v ary dep ending on what norm to measure the distance b et w een t w o b ounding planes. The distance, measured b y some norm k k on R n , is 2 k w k 0 . Add the recipro cal of this term, k w k 0 2 , to the ob jectiv e function of RLP . min w ;r ; y ; z (1 )( e T y + e T z ) + 2 k w k 0 (6.7) s : t A w + e r + e y ; B w e r + e z ; y 0 ; z 0 : (6.8) One uses the 1 -norm to meaure the distance b et w een the planes. Since the dual norm of the 1 -norm is 1-norm, w e call the LP form ulation as SVM k k 1 . The SVM k k 1 form ulation is as follo ws: min w ;r ; y ; z ; s (1 )( e T y + e T z ) + 2 e T s (6.9) s : t A w + e r + e y ; B w e r + e z ; s w s ; y 0 ; z 0 : (6.10) In the SVM k k p form ulation, the ob jectiv e functions attempt to balance b et w een the n um b er of misclassied instances and the n um b er of non-zero w 's while minimizing the sum of these t w o v alues. Similarly , if one uses the 1{norm to measure the distance b et w een the planes, then the dual to the norm is the 1 {norm and SVM k k 1 form ulation as follo ws: min w ;r ; y ; z ; (1 )( e T y + e T z ) + 2 (6.11)

PAGE 46

36 s : t A w + e r + e y ; B w e r + e z ; e w e ; y 0 ; z 0 : (6.12) The authors also attempted SVM k k 2 . They rep orted that only FSV and SVM k k 1 ga v e small subsets of features and all other form ulations ended up with no feature selections. Solving the linear programming feature selection problem giv es a subset of features with whic h the separating plane is linear in input space, and in SVM recursiv e feature elimination metho ds it could b e nonlinear in input space, as w ell as linear in input space. Also, one do es not ha v e a c hoice for the n um b er of features one desires since the solution for SVM k k p simply giv es a subset of features. The SVM recursiv e feature elimination metho ds giv e ranking for all the features so one can c ho ose top k features if one w an ted to c ho ose a subset whic h consists of k features. 6.2.2 Correlation Co ecien t Metho ds Ev aluating ho w w ell an individual feature con tributes to the separation can pro duce a simple feature ranking. Golub et al. [13 ] used w i = i (+) i ( ) i (+) + i ( ) : P a vlidis [24 ] used w i = ( i (+) i ( )) 2 i (+) 2 + i ( ) 2 : The i and i in the ab o v e are the mean and standard deviations regarding feature i . A large p ositiv e v alue in w i indicates a strong correlation with the (+) class and a large negativ e v alue with the (-) class. Eac h co ecien t w i is computed with information ab out a single feature and do es not consider m utual information b et w een features. These metho ds assume the orthogonalit y assumption implicitly [14 ]. What it means is follo wing. Supp ose one w an ts to nd t w o features whic h giv e a b est classier error rate among all com binations of t w o features. In this case, the correlation co ecien t

PAGE 47

37 metho d nds t w o features whic h are individually go o d, but those t w o features ma y not b e the b est t w o features co op erativ ely . The SVM recursiv e feature elimination metho ds certainly do not ha v e this problem. 6.2.3 F eature Scaling Metho ds Some authors suggest using the lea v e-one-out (LOO) b ounds for SVMs as feature selection criteria [30]. The metho ds searc h all p ossible subsets of n features whic h minimize the LOO b ound, where n is the total n um b er of features. But this minim um nding requires solving a com binatorial problem, whic h is v ery exp ensiv e when n is large. Instead, it scales eac h feature b y a real v ariable and computes this scaling via a gradien t descen t on the LOO b ound. One k eeps the features with the largest scaling v ariables. The authors incorp orate scaling factors in to the k ernel: K ( x ; y ) = K (( x ) ; ( y )) where x is an elemen t-wise m ultiplication. The algorithm is : 1. Solv e the standard SVM to nd 's for a xed . 2. Optimize for xed 's b y a gradien t descen t. 3. Remo v e features corresp onding to small elemen ts in and return to step 1. This metho d ma y end up with a lo cal minim um dep ending on the c hoice of the gradien t step size [14]. 6.2.4 W rapp er Metho ds Giv en a subset of features, a classier can b e used to ev aluate ho w go o d the subset of features are in terms of a classication error rate [18, 11 ]. The classier is called a \wrapp er." T o nd the b est subset, one tries to do all the com binations of n features. The subset ev aluation is p erformed b y a k -fold cross-v alidation or on a test set. Some authors prop ose using a genetic algorithm as subset selection tec hnique and using SVM as a classier on face recognition problem [22]. The wrapp er metho d is a com binatorial searc h problem, and the recursiv e feature elimination is a greedy algorithm. This means the wrapp er metho d is computationally m uc h more exp ensiv e.

PAGE 48

38 6.3 Dimension Reduction Metho ds In this section, w e examine metho ds in dimension reduction. Tw o represen tativ e metho ds are principal comp onen t analysis (PCA) [19, 2] and pro jection pursuit (PP) [12 , 17 , 18 ]. The PCA aims to minimize the error b y doing some transformation to another co ordinate system, and PP aims to nd in teresting lo w-dimensional linear pro jections of high-dimensional data. 6.3.1 Principal Comp onen t Analysis Let us start b y in tro ducing ho w to calculate the PCs. Consider a sample x = [ x 1 ; x 2 ; ; x n ] 0 . Assume without loss of generalit y E [ x i ] = 0 ; 8 i = 1 ; ; n . The co v ariance matrix of x is E ( xx 0 ). The rst principal comp onen t (PC): Y (1) = 1 0 x . 1 can b e found b y maximizing the v ariance of the rst PC 1 0 x under the constrain t 1 0 1 = 1. I.e, max 1 V ( 1 0 x ) s.t 1 0 1 = 1 (6.13) Since V ( 1 0 x ) = E (( 1 0 x )( 1 0 x ) 0 ) = 1 0 E ( xx 0 ) 1 = 1 0 1 the problem (6.13) can b e rewritten as follo ws: max 1 1 0 1 s.t 1 0 1 = 1 (6.14) By using the Lagrangian function, problem (6.14) b ecomes max 1 1 0 1 ( 1 0 1 1) (6.15) where is a Lagrange m ultiplier. The stationary condition states 1 1 = ( I ) 1 = 0

PAGE 49

39 Since 1 6 = 0, det( I ) = 0. Hence is an eigen v alue of . Let 1 b e an eigen v ector asso ciated with . Then max 0 1 1 = max 1 0 1 = max 1 0 1 = max . So is the largest eigen v alue of , and 1 is the eigen v ector asso ciated with . By similar form ulations, one can nd all the PCs and can discard the v ariables with small v ariance so that one can ac hiev e the dimension reduction. But this is dieren t from feature selection. In feature selection, w e w an t to nd a small n um b er of features from the original input space. But the dimension reduction can b e done b y transformation using the largest PCs. But to nd the PCs, one has to mak e use of all the features. So there can not b e a subset with small n um b er of features in the original input space. 6.3.2 Pro jection Pursuit The PP aims to disco v er an in teresting structure pro jection of a m ultiv ariate data set, esp ecially for the visualization purp ose of high-dimensional data. In basic PP , one tries to nd the direction w so that the x T x has an in teresting distribution. W e do not go in to computational details for PP , but just men tion that lik e the PCA, PP cannot b e a feature selection metho d b y the same reason as PCA.

PAGE 50

CHAPTER 7 CONCLUSION 7.1 Summary The SVMs ha v e b een widely used in mac hine learning. F eature selection with SVMs is one of the man y SVM applications. W e ha v e presen ted feature selection metho ds with SVMs. A general in tro duction for SVMs w as giv en b efore going in to the feature selection metho ds since SVM w as the main mac hine learning to ol. W e presen ted some of the related w orks and prop osed t w o new algorithms: SVM Gradien t-RFE and SVM Pro jection-RFE. Then w e presen ted exp erimen tal results in whic h some existing metho ds and our prop osed metho ds are applied to the colon cancer data set. The exp erimen tal results sho w ed our prop osed algorithms are as go o d as the b est existing metho ds in accuracy and are m uc h faster than the existing metho ds. Also, the exp erimen tal results supp ort that a subset of features is enough for training the mac hine and predicting the target. In our colon cancer exp ermen t, only 16 genes giv e the b est accuracy whic h is ab out 90%, while it w as ab out 84% using all the genes. Other researc hers conducted the exp erimen ts on the same colon cancer data set and rank ed all 2,000 genes. Researc hers in biology v eried that the b est-rank ed genes do ha v e signican t roles in colon cancer. In our exp erimen tal results, top-rank ed genes also matc hed their b est-rank ed genes. This justied that genes c hosen b y feature selection metho ds giv e a b etter explanation for the data set. 7.2 F uture W ork What w e review ed and what w e prop osed are the feature selection metho ds, whic h giv e a ranking for all the features. With these metho ds, w e do not kno w ho w man y features are b est without conducting the computation for accuracy for top k 40

PAGE 51

41 features, where k = 1 ; ; n ; that is, w e do not kno w the optimal n um b er of features. Researc h to this aim remains unexplored. Also, researc h in SVMs is explorable, esp ecially nding the parameters for SVMs analytically . In our exp erimen t, w e found the parameters based on lea v eone-out error whic h w e calculate rep eatedly for a p ossible set of parameters to nd the b est pairs. This requires man y preliminary exp erimen ts b efore doing feature selection. As w e ha v e seen, the training data are prepro cessed b efore the SVM training. There is a signican t dierence in training sp eed b et w een prepro cessed data and non-prepro cessed data. Sometimes the prepro cessing aects the accuracy . In our exp erimen ts, the prepro cessing go es through sev eral steps: taking the logarithm and then normalization. The normalization includes subtracting the mean and dividing the result b y a standard deviation, where mean and standard deviation are instancewise. But for other data sets, there can b e dieren t prepro cessing steps, for example, instance-wise normalization and then feature-wise normalization, or vice v ersa. Ho w the prepro cessing should b e done and ho w the prepro cessing aects the training sp eed and acccuracy need to b e explored.

PAGE 52

APPENDIX MA TLAB IMPLEMENT A TION function ranking = rankFeature(trnX, trnY, fsMethod, ker, C, rbfParam, rankFile) % rankFeature gives a ranking for all the features %% Usage: ranking = rankFeature(trnX, trnY, feMethod, ker, C, % rbfParam, rankFile) %% Parameters: trnX Training inputs % trnY Training targets % fsMethod feature selection method % ('rfe', 'gradient', 'gradrfe', 'proj') % ker kernel function % ('linear', 'rbf') % C upper bound % rbfParam p1 in exp(||u-v||*||u-v|| / (2*p1*p1) ) % rankFile file name to be containing the ranking %%% Note 1. % p1 is a width (rbf kernel). The variable is available as global. %% Note 2. % Linear kernel: features are eliminated one-by-one. % RBF kernel: chunks of features are eliminated at each run. %1 col = size( trnX,2 ); 2 s = 1 : col; 3 ranking = []; 4 it = 0; 5 global p1; 6 p1 = rbfParam; 78 switch lower( ker ) 9 case 'linear' 1011 switch lower(fsMethod) 12 case 'rfe' 13 % SVM-RFE method for the linear kernel 14 while( size( s, 2 ) = 0 ) 15 it = it + 1; 16 xT = trnX( :,s ); 17 42

PAGE 53

43 18 [n, a, b, svi ] = svc( xT, trnY, ker, C ); 19 sv = xT( svi, : ); 20 alpha = a( svi ); 21 svY = trnY( svi ); 22 wt = sv' * (alpha .* svY); 23 [value, I] = min( abs( wt ) ); 2425 ranking = [ s(I), ranking ]; 26 s = setdiff( s, s(I) ) ; 27 end 2829 case 'g1' 30 % g1 is the same as the gradient when using the linear kernel. 31 [n, a, b, svi] = svc( trnX, trnY, ker, C ); 32 sv = trnX( svi,: ); 33 alpha = a( svi ); 34 svY = trnY( svi ); 35 wt = sv' * (alpha .* svY); 36 [value, ranking] = sort( abs( wt ) ); 37 ranking = ranking'; 3839 case 'gradient' 40 [n, a, b, svi] = svc( trnX, trnY, ker, C ); 41 sv = trnX( svi,: ); 42 alpha = a( svi ); 43 svY = trnY( svi ); 44 [r, c] = size( sv ); 45 A = zeros( 1, c ); 4647 for j = 1 : r 48 g( j, : ) = zeros( 1, c ); 49 for i = 1 : r 50 temp = ( alpha(i) * svY(i) ) * sv( i,: ); 51 g( j, : ) = g( j, : ) + temp; 52 end 53 t1 = g( j, : ) * eye( c ); 54 t2 = norm( g(j, :) ); 55 temp1 = acos( t1/t2 ); 56 temp2 = min( temp1, pi temp1 ); 57 A = A + temp2; 58 end 59 B = 1 (2/pi) * A/r ; 60 [Y, ranking] = sort(-B); 6162 case 'gradrfe' 63 while( size( s, 2 ) = 0 ) 64 xT = trnX( :,s ); 65 [n, a, b, svi] = svc( xT, trnY, ker, C ); 66 sv = xT( svi, : ); 67 alpha = a( svi ); 68 svY = trnY( svi ); 69 [r, c] = size( sv ); 70 A = zeros( 1, c ); 71 g = [];

PAGE 54

44 7273 for j = 1 : r 74 g( j, : ) = zeros( 1, c ); 75 for i = 1 : r 76 temp = ( alpha(i) * svY(i) ) * sv( i,: ); 77 g( j, : ) = g( j, : ) + temp; 78 end 79 t1 = g( j, : ) * eye( c ); 80 t2 = norm( g(j, :) ); 81 temp1 = acos( t1/t2 ); 82 temp2 = min( temp1, pi temp1 ); 83 A = A + temp2; 84 end 85 B = 1 (2/pi) * A/r ; 86 [cc, I] = min( abs(B) ); 87 ranking = [ s(I), ranking ]; 88 s = setdiff( s, s(I) ); 89 end 9091 otherwise 92 disp('method name is wrong'); 93 end 9495 case 'rbf' 96 switch lower( fsMethod ) 97 case 'gradrfe' 98 while( size( s, 2 ) = 0 ) 99 xT = trnX( :,s ); 100 dim = length( s ); 101 [n, a, b, svi] = svc( xT, trnY, ker, C ); 102 sv = xT( svi, : ); 103 alpha = a( svi ); 104 svY = trnY( svi ); 105 [r, c] = size( sv ); 106 A = zeros( 1, c ); 107 g = []; 108109 for j = 1 : r 110 g( j, : ) = zeros( 1, c ); 111 for i = 1 : r 112 temp=(alpha(i)*svY(i))/ (p1* p1)* exp (-(s v(i, :)-s v(j ,:)) 112a *(sv(i,:)-sv(j,:))'/(2*p1*p 1))* (sv (i,: )-sv (j,: )); 113 g( j, : ) = g( j, : ) + temp; 114 end 115 t1 = g( j, : ) * eye( c ); 116 t2 = norm( g(j, :) ); 117 temp1 = acos( t1/t2 ); 118 temp2 = min( temp1, pi temp1 ); 119 A = A + temp2; 120 end 121 B = 1 (2/pi) * A/r ; 122 [cc,I] = sort( abs( B ) ); 123124 % chunks of features eliminated

PAGE 55

45 125 if( dim == 2000 ) 126 II = I( 1025 : dim ); 127 else 128 tempI = floor( dim/2 ); 129 II = I( tempI + 1 : dim ); 130 end 131132 ranking = [ s(II), ranking ]; 133 s = setdiff( s, s(II) ); 134 end 135136 case 'proj' 137 while( size( s, 2 ) = 0 ) 138 xT = trnX( :,s ); 139 dim = length( s ); 140 [n, a, b, svi] = svc( xT, trnY, ker, C ); 141 sv = xT( svi, : ); 142 alpha = a( svi ); 143 svY = trnY( svi ); 144 [r, c] = size( sv ); 145 A = zeros( 1, c ); 146 g = []; 147 T = []; 148149 for j = 1 : r 150 g( j, : ) = zeros( 1, c ); 151 for i = 1 : r 152 temp=(alpha(i)*svY(i))/ (p1* p1)* exp (-(s v(i, :)-s v(j ,:)) 152a *((sv(i,:)-sv(j,:))'/(2* p1*p 1))* (sv (i,: )-sv (j,: )); 153 g( j, : ) = g( j, : ) + temp; 154 end 155 T(j) = 1 / ( g(j,:) * g(j,:)' ); 156 end 157 diff = abs( diag( T ) * g ); 158 diffSum = ones( 1, r ) * diff; 159 [cc,I] = sort( diffSum ); 160161 % chunks of features eliminated 162 if( dim == 2000) 163 II = I( 1025 : dim ); 164 else 165 tempI = floor( dim/2 ); 166 II = I( tempI + 1 : dim ); 167 end 168169 ranking = [ s(II), ranking ]; 170 s = setdiff( s, s(II) ); 171 end 172173 case 'gradient' 174 % to be implemented 175176 case 'rfe' 177 while( size( s,2 ) = 0 )

PAGE 56

46 178 it = it + 1; 179 K = []; 180 AK = []; 181 H = []; 182 AH = []; 183 xT = trnX( :,s ); 184 dim = length( s ); 185 [ n, a, b, svi ] = svc( xT, trnY, ker, C ); 186 sv = xT( svi, : ); 187 svY = trnY( svi ); 188 alpha = a( svi ); 189 [ r, c ] = size( sv ); 190191 for i = 1 : r 192 for j = 1 : r 193 AH(i,j)=svY(i)*svY(j)*e xp((sv( i,: )-sv (j,: )) 193a *(sv(i,:)sv(j,:))'/(2*p1*p1)); 194 end 195 end 196197 AK = alpha' * AH * alpha; 198 for k = 1 : dim 199 for i = 1 : r 200 for j = 1 : r 201 svi = [ sv(i,1:(k-1)) sv(i,(k+1):dim) ]; 202 svj = [ sv(j,1:(k-1)) sv(j,(k+1):dim) ]; 203 H(i,j) = svY(i)*svY(j)*exp(-(svi-s vj)* (svi -svj )' 203a /(2*p1*p1) ); 204 end 205 end 206 K(k,1) = alpha' * H * alpha; 207 end 208 [cc,I] = sort( abs( AK K ) ); 209210 % chunks of features eliminated 211 if( dim == 2000 ) 212 II = I( 1025 : dim ); 213 else 214 tempI = floor( dim/2 ); 215 II = I( tempI + 1 : dim ); 216 end 217218 ranking = [ s(II), ranking ]; 219 s = setdiff( s, s(II) ); 220 end 221222 otherwise 223 disp('method name is wrong'); 224225 end 226227 otherwise 228 disp('this kernel is not available'); 229

PAGE 57

47 230 end 231232 fid = fopen( rankFile, 'w' ); 233 fprintf(fid, ' 234 fclose(fid);

PAGE 58

REFERENCES [1] U. Alon, N. Bark ai, D. Notterman, K. Gish, S. Ybarra, D. Mac k, and A. Levine, Broad P atterns of Gene Expression Rev ealed b y Clustering Analysis of T umor and Normal Colon Tissues Prob ed b y Oligon ucleotide Arra ys, PNAS, V ol. 96, pp. 6745{6750, June 1999, Cell Biology . [2] T. Anderson, An In tro duction to Multiv ariate Statistical Analysis, John Wiley & Sons, New Y ork, NY, 1958. [3] K. Bennet and C. Campb ell, Supp ort V ector Mac hines: Hyp e or Hallelujah?, A CM SIGKDD, 2(2):1{13, 2000. [4] K. Bennet, and E. Bredensteiner, Geometry in Learning in Geometry at W ork, C. Gorini Ed., Mathematical Asso ciation of America, W ashington, D.C., pp. 132-145, 2000. [5] K. Bennet, and E. Bredensteiner, Dualit y and Geometry in SVMs, In P . Langley editor, Pro ceedings of 17 th In ternational Conference on Mac hine Learning, Morgan Kaufmann, San F rancisco, CA, 65-72, 2000. [6] D. Bertsek as, Nonlinear Programming, A thena Scien tic, Belmon t, MA, 1995. [7] P . Bradley , O. Mangasarian, and W. Street, F eature Selection via Mathematical Programming, INF ORMS Journal on Computing, 10(2):209{217, 1998. [8] P . Bradley , and O. Mangasarian, F eature Selection via Conca v e Minimization and Supp ort V ector Mac hines, In Pro ceedings 13 th In ternational Conference on Mac hine Learning, pp. 82-90, San F rancisco, CA, 1998 [9] O. Chap elle, V. V apnik, O. Bousquet and S. Mukherjee, Cho osing Multiple P arameters for Supp ort V ector Mac hines, Mac hine Learning, V ol. 46, No. 1, pp. 131{159, Jan uary , 2001. [10] N. Cristianini and J. Sha w e-T a ylor, An In tro duction to Supp ort V ector Mac hines, Cam bridge Univ ersit y Press, Boston, MA, 1999. [11] M. Dash, and H. Liu, F eature Selection for Classication, In telligen t Data Analysis, Elsevier, V ol. 1, No. 3, pp. 131{156, 1997. [12] J. F riedman Exploratory Pro jection Pursuit Journal of the American Statistical Asso ciation, V ol. 82, Issue 397, pp. 249{266, 1987. [13] T. Golub, D. Slonim, P . T ama y o, C. Huard, M. Gaasen b eek, J. Mesiro v, H. Coller, M. Loh, J. Do wning, M. Caligiuri, C. Blo omeld, and E. Lander, Molecular Classication of Cancer: Class Disco v ery and Class Prediction b y Gene Expression Monitoring, Science V ol. 286, pp. 531{537, Octob er 1999. 48

PAGE 59

49 [14] I. Guy on, J. W eston, S. Barnhill, and V. V apnik, Gene Selection for Cancer Classication Using Supp ort V ector Mac hines, Mac hine Learning 46(1/3): 389422, Jan uary 2002. [15] I. Guy on, SVM Application Surv ey , W eb page on SVM Applications: h ttp://www.clopinet.com/SVM.applications.h tml, Octob er 20, 2001. [16] L. Hermes and J. Buhmann, F eature Selection for Supp ort V ector Mac hines, Pro ceedings of the In ternational Conference on P attern Recognition (ICPR'00), V ol. 2, pp. 712{715, 2000. [17] P . Hub er, Pro jection Pursuit Annals of Statistics, V ol. 13, Issue 2, pp. 435{475, 1985. [18] G. John, R. Koha vi, and K. Preger, Irrelev an t F eatures and the Subset Selection Problem, W. Cohen and H. Hirsh (Eds.): Mac hine Learning: Pro ceedings of the 11 th In ternational Conference, pp. 121{129, San Mateo, CA, 1994. [19] I. Jollie, Principal Comp onen t Analysis, Springer-V erlag, New Y ork, NY, 1986. [20] M. Jones, and R. Sibson, What Is Pro jecton Pursuit? Journal of the Ro y al Statistical So ciet y . Series A (General), V ol. 150, Issue 1, pp. 1{36, 1987. [21] D. Koller, and M. Sahami, T o w ard Optimal F eature Selection, Pro ceedings of the 13th In ternational Conference on Mac hine Learning (ML), Bari, Italy , July 1996, pp. 284{292, 1996. [22] K. Lee, Y Ch ung, and H. Byun, F ace Recognition Using Supp ort V ector Mac hines with the F eature Set Extracted b y Genetic Algorithms, J. Bigun and F. Smeraldi (Eds.): A VBP A 2001, LNCS 2091, pp. 32-37, 2001. [23] E. Osuna, R. F reund and F. Girosi, Supp ort V ector Mac hines: T raining and Applications, T ec hnical Rep ort AIM-1602, MIT A.I. Lab., 1996. [24] P . P a vlidis, J. W eston, J. Cai, and W. Grundy , Gene F unctional Analysis from Heterogeneous Data, RECOMB, pp. 249{255, 2001. [25] B. Sc holk opf, Supp ort V ector Learning, Ph.D Dissertation, Published b y R. Olden b ourg V erlag, Munic h, German y , 1997. [26] B. Sc holk opf, T utorial: Supp ort V ector Learning, D A GM'99, Septem b er, 1999, Bonn, German y . [27] B. Sc holk opf, Statistical Learning and Kernel Metho ds, Microsoft T ec hnical rep ort, MSR-TR-2000-23, 2000. [28] D. Slonim, P . T ama y o, J. Mesiro v, T. Golub, E. Lander, Class Prediction and Disco v ery Using Gene Expression Data, Pro ceedings of 4 th In ternational Conference on Computational Molecular Biology , April 8{11, 2000, T oky o, Japan, RECOMB 2000, pp. 263{272. [29] V. V apnik, The Nature of Statistical Learning Theory , Springer V erlag, New Y ork, NY, 1995. [30] J. W eston, S. Mukherjee, O. Chap elle, M. P on til, T. P oggio, and V. V apnik, F eature Selection for SVMs, Adv ances in Neural Information Pro cessing Systems, V ol. 13, pp. 668{674, 2000.

PAGE 60

BIOGRAPHICAL SKETCH Eun Seog Y oun w as b orn April 18, 1968, in Ch unnam, Korea. He receiv ed his Bac helor of Science degree in industrial engineering from Han y ang Univ ersit y in Seoul, Korea, and his Master of Science degree in industrial engineering from the Univ ersit y of Wisconsin-Madison. His sp ecialization w as op erations researc h, whic h included mathematical programming and the probabilit y mo deling. He en tered the graduate program in computer and information science and engineering at the Univ ersit y of Florida in Gainesville in August 1999. He did his master's thesis researc h under the guidance of Dr. Li Min F u. His researc h fo cused on the application of supp ort v ector mac hines, esp ecially feature selection in supp ort v ector mac hines. 50