<%BANNER%>

Feature Selection and Discriminant Analysis in Data Mining

Permanent Link: http://ufdc.ufl.edu/UFE0004221/00001

Material Information

Title: Feature Selection and Discriminant Analysis in Data Mining
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0004221:00001

Permanent Link: http://ufdc.ufl.edu/UFE0004221/00001

Material Information

Title: Feature Selection and Discriminant Analysis in Data Mining
Physical Description: Mixed Material
Copyright Date: 2008

Record Information

Source Institution: University of Florida
Holding Location: University of Florida
Rights Management: All rights reserved by the source institution and holding location.
System ID: UFE0004221:00001


This item has the following downloads:


Full Text

PAGE 1

FEATURESELECTIONANDDISCRIMINANTANALYSIS INDATAMINING By EUNSEOGYOUN ADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOL OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF DOCTOROFPHILOSOPHY UNIVERSITYOFFLORIDA 2004

PAGE 2

Copyright2004 by EunSeogYoun

PAGE 3

Thisdissertationisdedicatedtomyparents,JaebyungYoun andGuisoonYoo.

PAGE 4

ACKNOWLEDGMENTS Iwouldliketothankallveprofessors,Drs.LiMinFu,Stanl eySu,Su-Shing Chen,AnandRangarajan,andMarkYang,whoservedonmydisse rtationcommittee. Theircommentsandsuggestionswereinvaluable.Myspecial thanksgotoDr.LiMin Fuforbeingmysupervisorandintroducingmetothisareaofr esearch.Heprovided ideas,nancialsupport,assistance,andactiveencourage mentovertheyears. Iwouldliketothankmyparents,JaebyungYounandGuisoonYo o,fortheir spiritualandnancialsupport,andbrother-in-law,Eungk il,forhisthoughtsand prayers.Iwouldliketothankmywife,Jongjin,andmyson,Su ngho,fortheir continuousencouragementandprayers.Iapologizefornotb eingabletospendmore timewiththemwhileIwaspursuingmygraduatestudies.They alwayshavebeena greatjoyforme. Atlast,butnotleast,IthankGodforallowingmethejoyofwr itingthisdissertation. iv

PAGE 5

TABLEOFCONTENTS page ACK NOWL ED G MENTS . . . . . . . . . . . . . . iv L IST O F TABL ES . . . . . . . . . . . . . . . . viii L IST O F F IG UR ES . . . . . . . . . . . . . . . . x NO TATIO N . . . . . . . . . . . . . . . . . . xi ABSTR ACT . . . . . . . . . . . . . . . . . . xiii CHAPTER 1 INTRO D UCTIO N . . . . . . . . . . . . . . . 1 1 .1 Fea t ur e Select io n Pr o blem . . . . . . . . . . . 1 1 .2 D isser t a t io n O ver view . . . . . . . . . . . . 4 2 SUPPO RT VECTO R MACHINES . . . . . . . . . . 6 2 .1 Int r o duct io n . . . . . . . . . . . . . . . 6 2 .2 L inea r ly Sepa r a ble Ca se . . . . . . . . . . . 6 2 .3 L inea r ly Insepa r a ble Ca se . . . . . . . . . . . 8 2 .4 D ua l Pr o blem . . . . . . . . . . . . . . 9 2 .5 K er nel Met ho ds a nd Supp o r t Vect o r Cla ssifer s . . . . . 1 2 2 .6 L ea st Squa r es SVM . . . . . . . . . . . . . 1 4 2 .6 .1 L SSVM Fo r mula t io n . . . . . . . . . . 1 4 2 .6 .2 Co mpa r iso n wit h t he L 1 no r m SVM . . . . . . 1 5 3 MO D EL SEL ECTIO N IN SVM . . . . . . . . . . . 1 8 3 .1 Int r o duct io n . . . . . . . . . . . . . . . 1 8 3 .2 G ener a liza t io n Er r o r Bo und . . . . . . . . . . 1 8 3 .3 Cr o ssVa lida t io n . . . . . . . . . . . . . . 1 8 3 .4 Bo und . . . . . . . . . . . . . . . 1 9 3 .5 G ACV Bo und . . . . . . . . . . . . . . 2 0 3 .6 O t her Met ho ds . . . . . . . . . . . . . . 2 0 4 R EL ATED WO R K S . . . . . . . . . . . . . . 2 1 4 .1 SVM G r a dient Met ho d . . . . . . . . . . . . 2 1 4 .1 .1 Alg o r it hm: L inea r K er nel . . . . . . . . . 2 1 4 .1 .2 Alg o r it hm: G ener a l K er nels . . . . . . . . 2 2 v

PAGE 6

4 .2 SVMR F E . . . . . . . . . . . . . . . 2 3 4 .2 .1 Alg o r it hm: L inea r K er nel . . . . . . . . . 2 3 4 .2 .2 Alg o r it hm: G ener a l K er nels . . . . . . . . 2 4 4 .3 SVM G r a dient R F E . . . . . . . . . . . . . 2 5 4 .3 .1 L inea r K er nel . . . . . . . . . . . . 2 6 4 .3 .2 G ener a l K er nels . . . . . . . . . . . . 2 6 4 .4 SVM Pr o j ect io nR F E . . . . . . . . . . . . 2 7 4 .4 .1 Idea . . . . . . . . . . . . . . . 2 7 4 .4 .2 L inea r K er nel . . . . . . . . . . . . 2 9 4 .4 .3 G ener a l K er nels . . . . . . . . . . . . 3 1 4 .4 .4 SVM Pr o j ect io nR F E Alg o r it hm . . . . . . . 3 3 4 .5 The Mf o ld SVM . . . . . . . . . . . . . 3 4 4 .5 .1 Bina r ycla ss Mf o ldSVM . . . . . . . . . 3 5 4 .5 .2 Mult icla ss Mf o ldSVM . . . . . . . . . . 3 6 4 .6 L inea r Pr o g r a mming Met ho ds . . . . . . . . . . 3 7 4 .6 .1 Fea t ur e Select io n via Co ncave Minimiza t io n . . . . 3 7 4 .6 .2 SVM k k p Fo r mula t io n . . . . . . . . . . 3 8 4 .7 Co r r ela t io n Co ecient Met ho ds . . . . . . . . . 4 0 4 .8 Fea t ur e Sca ling Met ho ds . . . . . . . . . . . 4 1 4 .9 D imensio n R educt io n Met ho ds . . . . . . . . . . 4 1 4 .9 .1 Pr incipa l Co mp o nent Ana lysis . . . . . . . . 4 1 4 .9 .2 Pr o j ect io n Pur suit . . . . . . . . . . . 4 3 5 SUPPO RT VECTO R F EATUR E SEL ECTIO N . . . . . . . 4 4 5 .1 D a t a set s . . . . . . . . . . . . . . . . 4 4 5 .2 Supp o r t Vect o r Fea t ur e Select io n Using F L D . . . . . . 4 6 5 .2 .1 F isher 's L inea r D iscr imina nt . . . . . . . . 4 6 5 .2 .2 Supp o r t Vect o r Fea t ur e Select io n wit h F L D . . . . 4 7 5 .2 .3 Exp er iment s . . . . . . . . . . . . . 5 0 5 .2 .4 D iscussio n . . . . . . . . . . . . . 5 1 5 .3 SVR F E SVM . . . . . . . . . . . . . . 5 2 5 .3 .1 Exp er iment s . . . . . . . . . . . . . 5 3 5 .3 .2 D iscussio n . . . . . . . . . . . . . 5 7 6 NAIVE BAYES AND F EATUR E SEL ECTIO N . . . . . . . 5 9 6 .1 Na ive Bayes Cla ssifer . . . . . . . . . . . . 5 9 6 .2 Na ive Bayes Cla ssifer Is a L inea r Cla ssifer . . . . . . 6 1 6 .3 Impr oving Na ive Bayes Cla ssifer . . . . . . . . . 6 3 6 .3 .1 Imba la nced D a t a . . . . . . . . . . . 6 3 6 .3 .2 Tr a ining wit h O nly O ne Fea t ur e . . . . . . . 6 5 6 .3 .3 A Vect o r o f Z er o s Test Input . . . . . . . . 6 6 6 .4 Text D a t a set s a nd Pr epr o cessing . . . . . . . . . 6 6 6 .5 Fea t ur e Select io n Appr o a ches in Text D a t a set s . . . . . 7 0 6 .5 .1 Inf o r ma t io n G a in . . . . . . . . . . . 7 0 vi

PAGE 7

6 .5 .2 t t est . . . . . . . . . . . . . . . 7 0 6 .5 .3 Co r r ela t io n Co ecient . . . . . . . . . . 7 1 6 .5 .4 F isher Sco r e . . . . . . . . . . . . . 7 1 6 .5 .5 O dds R a t io . . . . . . . . . . . . . 7 2 6 .5 .6 NB Sco r e . . . . . . . . . . . . . . 7 2 6 .5 .7 No t e o n Fea t ur e Select io n . . . . . . . . . 7 2 6 .6 Cla ss D ep endent Ter m Weig ht ing Na ive Bayes . . . . . 7 3 6 .6 .1 Mo t iva t io n f o r Cla ss D ep endent Ter m Weig ht ing . . . 7 4 6 .6 .2 CD TWNBR F E Alg o r it hm . . . . . . . . 7 9 6 .7 Exp er iment a l R esult s . . . . . . . . . . . . 8 1 6 .7 .1 Fea t ur e R a nking Wo r ks . . . . . . . . . . 8 1 6 .7 .2 Exp er iment a l R esult s o n t he Newsg r o up D a t a set s . . 8 3 6 .8 Na ive Bayes f o r Co nt inuo us Fea t ur es . . . . . . . . 8 5 6 .9 D iscussio n . . . . . . . . . . . . . . . 8 8 7 F EATUR E SEL ECTIO N AND L INEAR D ISCR IMINANTS . . . 9 0 7 .1 Fea t ur e Select io n wit h L inea r D iscr imina nt s . . . . . . 9 0 7 .2 L inea r D iscr imina nt s a nd Their D ecisio n Hyp er pla nes . . . 9 2 7 .3 Visua liza t io n o f Select ed Fea t ur es . . . . . . . . . 9 5 7 .4 Co mpa ct D a t a R epr esent a t io n a nd Fea t ur e Select io n . . . 9 8 7 .5 Applica t io ns o f Fea t ur e Select io n . . . . . . . . . 1 0 0 8 CO NCL USIO NS . . . . . . . . . . . . . . . 1 0 2 R EF ER ENCES . . . . . . . . . . . . . . . . . 1 0 4 BIO G R APHICAL SK ETCH . . . . . . . . . . . . . . 1 1 2 vii

PAGE 8

LISTOFTABLES Table page 2 { 1 Co mmo n K er nel Funct io ns . . . . . . . . . . . . 1 4 4 { 1 Co mmo n K er nel Funct io ns a nd Their D er iva t ives . . . . . . 2 2 5 { 1 Tr a ining / Test D a t a Split s Fo ur D a t a set s . . . . . . . . 4 6 5 { 2 Simple All vs. Simple SV Per f o r ma nce o n MNIST D a t a ( F L D ) . . 5 0 5 { 3 Simple All vs. Simple SV Per f o r ma nce o n So na r D a t a ( F L D ) . . 5 0 5 { 4 AllR F E vs. SVR F E Per f o r ma nce o n MNIST D a t a ( F L D ) . . . 5 1 5 { 5 All R F E vs. SVR F E Per f o r ma nce o n So na r D a t a ( F L D ) . . . 5 1 5 { 6 G uyo n R F E vs. SVR F E Accur a cy o n L eukemia D a t a ( SVM) . . 5 4 5 { 7 G uyo n R F E vs. SVR F E Time o n L eukemia D a t a ( SVM) . . . 5 4 5 { 8 G uyo n R F E vs. SVR F E Per f o r ma nce o n Co lo n Ca ncer D a t a ( SV M) 5 5 5 { 9 G uyo n R F E vs. SVR F E Per f o r ma nce o n MNIST D a t a ( SVM) . . 5 6 5 { 1 0 G uyo n R F E vs. SVR F E Per f o r ma nce o n So na r D a t a ( SVM) . . 5 6 6 { 1 Imba la nced Tr a ining D a t a Exa mple . . . . . . . . . 6 4 6 { 2 Pr o ba bility Est ima t es . . . . . . . . . . . . . 6 4 6 { 3 K eywo r ds Used t o Co llect t he Text D a t a set s . . . . . . . 6 7 6 { 4 2 0 Newsg r o up D a t a set D ist r ibut io n . . . . . . . . . 6 9 6 { 5 Illust r a t ive D a t a f o r t he Cla ss D ep endent Ter m Weig ht in g . . . 7 4 6 { 6 To p 1 0 Wo r ds Select ed f r o m t he Text AB D a t a set . . . . . . 8 1 6 { 7 To p 1 0 Wo r ds Select ed f r o m t he Text CD D a t a set . . . . . . 8 2 6 { 8 To p 1 0 Wo r ds Select ed f r o m t he Text CL D a t a set . . . . . . 8 2 6 { 9 CD TWNBR F E vs. O r ig ina l NBR F E . . . . . . . . 8 3 6 { 1 0 O r ig ina l NB vs. CD TWNB . . . . . . . . . . . 8 4 6 { 1 1 Accur a cy R esult s f o r R a nking Schemes . . . . . . . . 8 4 viii

PAGE 9

6 { 1 2 Co mpa r iso n b etween F L D SVM, a nd D iscr et ized NB . . . . 8 8 7 { 1 Per f o r ma nce o f Va r io us R a nking Schemes o n L inea r SVM ( %) . . 9 9 7 { 2 Co mpa ct D a t a R epr esent a t io n o n SVR F E ( SVM) . . . . . 1 0 0 ix

PAGE 10

LISTOFFIGURES Figure page 2 { 1 L inea r D iscr imina nt Pla nes . . . . . . . . . . . . 7 2 { 2 G eo met r ic Ma r g in in t he Ma ximum Ma r g in Hyp er pla ne . . . . 8 2 { 3 L inea r ly Insepa r a ble D a t a . . . . . . . . . . . . 9 2 { 4 A No nlinea r K er nel Supp o r t Vect o r Ma chine . . . . . . . 1 2 4 { 1 Mo t iva t io n f o r t he SVM Pr o j ect io nR F E . . . . . . . . 2 8 4 { 2 L inea r K er nel Ca se f o r Pr o j ect io nR F E . . . . . . . . 2 9 4 { 3 No nlinea r K er nel Ca se f o r Pr o j ect io nR F E . . . . . . . 3 1 5 { 1 Ima g es o f D ig it s 2 a nd 3 . . . . . . . . . . . . . 4 5 5 { 2 R ecur sive F L D Idea . . . . . . . . . . . . . . 4 8 5 { 3 E ect o f t he Pa r a met er C in SVM. . . . . . . . . . 5 7 6 { 1 Pict o r ia l Illust r a t io n o f t he Ba g O f Wo r ds Mo del. . . . . . 6 9 6 { 2 Test Accur a cy wit h Best r a nked Fea t ur es o n 2 0 Newsg r o up D a t a set s. 8 5 7 { 1 L inea r D iscr imina nt sSepa r a ble, Ba la nced D a t a . . . . . . 9 3 7 { 2 L inea r D iscr imina nt sSepa r a ble, Imba la nced D a t a . . . . . . 9 4 7 { 3 L inea r D iscr imina nt sInsepa r a ble, Ba la nced, D i Va r D a t a . . . 9 4 7 { 4 Ima g es wit h t he Best r a nked Fea t ur es. . . . . . . . . 9 6 7 { 5 L eukemia D a t a Plo t wit h t he Best Two Fea t ur es. . . . . . 9 6 7 { 6 Test Accur a cy wit h Best r a nked Fea t ur es o n MNIST D a t a . . . 9 7 7 { 7 Test Accur a cy wit h Best r a nked Fea t ur es o n Thr ee D a t a se t s. . . 9 8 x

PAGE 11

NOTATION x i traininginstance y i targetcorrespondingtoinstance x i ;y i 2f 1 ; 1 g g ( x )decisionhyperplanetrainedfromSVM w weightvector(normal)when g ( x )islinear b thresholdin g ( x ) Lagrangemultipliers mappingfunctiontofeaturespace K ( x ; y )kernel h ( x ) ( y ) i l numberoftraininginstances n dimensionofinputspace h x y i innerproductbetween x and y kk 2-norm kk p p -norm kk 0 dualnormof kk p ( x ;g ( x ))projectionof x onto g ( x ) slackvariable C upperboundfor L primalLagrangianfunction x 0 transposeofvector x SV supportvectors j x j component-wiseabsolutevalueofvector x j SV j numberofsupportvectors 6 ( x ; y )anglebetween x,y xi

PAGE 12

i meanforfeature i overtraininginstances i standarddeviationforfeature i overtraininginstances V ( x )varianceof x E ( x )expectation x covariancematrixS W total within-class covariancematrix P ( y j x )theposteriorprobabilityof y withrespectto x P ( x j y )thelikelihoodof y withrespectto x xii

PAGE 13

AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorofPhilosophy FEATURESELECTIONANDDISCRIMINANTANALYSIS INDATAMINING By EunSeogYoun May2004 Chair:LiMinFuMajorDepartment:ComputerandInformationScienceandEng ineering Theproblemoffeatureselectionistondasubsetoffeature sforoptimalclassication.Acriticalpartoffeatureselectionistorankfe aturesaccordingtotheir importanceforclassication.Thisdissertationdealswit hfeatureranking.Therst methodisbasedonsupportvectors(SVs).Supportvectorsre fertothosesamplevectorsthatliearoundtheboundarybetweentwodierentclass es.Weshowhowwecan dofeaturerankingwithonlysupportvectors.WhileSV-base dfeaturerankingcan beappliedtoanydiscriminantanalysis,hereweconsidered twolineardiscriminants, SVMandFLD.Featurerankingisbasedontheweightassociate dwitheachfeature ordeterminedbyrecursivefeatureelimination(RFE).Both schemesareshownto beeectiveinvariousdomains.Thesecondmethodoffeature rankingisbasedon naiveBayes.ThenaiveBayesclassierhasbeenextensively usedintextcategorization.Despiteitssuccessfulapplicationtothisproblem,n aiveBayeshasnotbeen usedforfeatureselection.Wehavedevelopedanewfeatures calingmethod,called class-dependent-term-weighting(CDTW).Anewfeaturesel ectionmethod,CDTWNB-RFE,combinesCDTWandRFE.OurexperimentsshowedthatC DTW-NB-RFE xiii

PAGE 14

outperformedanyofthevepopularfeaturerankingschemes usedontextdatasets. Thismethodhasalsobeenextendedtocontinuousdatadomain s. xiv

PAGE 15

CHAPTER1 INTRODUCTION Featureselectionisanimportantprobleminmachinelearni ng[12,26,56].In thischapter,wepresentthenatureofthefeatureselection andshowwhyitisneeded. Wealsoprovideaglossaryofnotationsandanoverviewofthi sdissertation. 1.1 Feature Selection Problem Featureselectionisaproblemofselectinginputvariables (features)thatare mostpredictiveofagivenoutput.Intheclassicationprob lemsetting,givenaset ofinput-outputpairs f ( x i ;y i ): i =1 :::l; x i 2 R d ;y i 2f 1 ; +1 gg ,weattemptto constructadecisionfunction f ,whichmapstheinput x i ontoitslabel y i .Then thegoalistondan f whichminimizestheerror( f ( x ) 6 = y )foreverypossible x .Thefeatureselectionidentiesasmallsubsetoffeatures (variables)sothatthe classierconstructedwiththeselectedfeaturesminimize stheerrorandtheselected featuresalsobetterexplainthedata.Themostimportantpa rtofthefeatureselection istorankthefeaturesaccordingtotheirimportanceforcla ssication.Whilethis dissertationdealswithfeatureranking,wewillusetheter ms,featureselectionand featurerankinginterchangeably.Butmostofthetime,itis meanttobethefeature ranking.Thereareseveralmotivationsforthefeaturesele ction[94]. First,thefeatureselectionidentiesrelevantfeatures. Forexample,inamicroarraydataset,researchersstronglybelievethatonlyasmall subsetofgenes(features) isresponsibleforadisease.Ourpreviousexperimentalres ults[99]andotherresearchers'ndings[44]supportthisconclusion.Whenthef eatureselectionisapplied totextdatasets,therelevanceoftheselectedfeaturesiso bvious.Theyrepresent thekeywordsofthedocumentscollection.Ifitisappliedto handwrittendigitimage 1

PAGE 16

2 datasets,theselectedfeaturesarethesubsetofthemostdi scriminatingpixelsinthe image. Second,thefeatureselectiongivesabettergeneralizatio nerror.Typically,only asmallnumberoffeaturesof x givesucientinformationfortheclassication.The featureselectionproblemndsthesmallsubsetoffeatures thatarerelevanttothe targetconcept.Asmallsubsetofrelevantfeaturesgivesmo rediscriminatingpower thanusingmorefeatures.Thisiscounter-intuitivesincem orefeaturesgivemore informationandthereforeshouldgivemorediscriminating power.Butifafeature isirrelevant,thenthatfeaturedoesnotaectthetargetco ncept.Ifafeatureis redundant,thenthatfeaturedoesnotaddanythingnewtothe targetconcept[60]. Thisjustiesourcounter-intuition. Third,thefeatureselectionreducesthedimensionofthein putspace.Reducing thedimensionhasmanyimplications.Processing(collecti ng,storing,computing, transferring,andsoforth)severalthousandfeaturevalue sismuchmoreexpensive thanprocessingjustseveraldozenorseveralhundredfeatu revalues. Benetsofthefeatureselectionthereforeincludecutting theprocessingtime short,givingabetterdiscriminatinghyperplane,andabet terunderstandingofthe data. Asaneasyattemptforafeatureselectionsolution,wemayco nsiderallthe possiblesubsets.Thismaybeacceptableonlywhenthenumbe roffeaturesissmall. Themicroarraydata,however,havethenumberoffeaturesin theorderofseveral thousands.Simplytryingallthepossiblesubsetsoffeatur esisprohibitiveinthis case.Inourexperiments,wehaveacoloncancermicroarrayd atasetandaleukemia microarraydataset.Inthecoloncancerdataset,eachinsta nceconsistsof2,000 components(geneexpressionlevel),andtheleukemiadatas et7,129components.For example,thereare2 2000 1 subsetsforcoloncancerdata.Andhighdimensionality istypicalindatasetswewillconsider,suchastextdataset sandimagedatasets.

PAGE 17

3 Manymethodsareavailableforsolvingthefeatureselectio nproblemofthe highdimensionaldata[26,56].Onepartofthisdissertatio nconcernsafeature selectionusingsupportvectors.Supportvectorsrefertot hosesamplevectorsthatlie aroundtheboundarybetweentwodierentclasses.Classie rsbuiltononlysupport vectorshavemorediscriminatingpower.Thishasbeenprove ninSupportVector Machines(SVM)andFisher'slineardiscriminant(FLD).Wes howhowwecando featurerankingwithonlysupportvectors.Whilethesuppor tvectorfeatureranking canbeappliedtoanydiscriminantanalysis,weparticularl yconsideredtwolinear discriminants,SVMandFLD.Themainideaforsupportvector basedfeatureranking istoidentifysupportvectorsandthenapplyfeaturerankin gschemesononlythose supportvectors.Thefeaturerankingschemescanbeasimple weightbasedranking orarecursivefeatureelimination(RFE).Inthisdissertat ion,bothareshowntobe eectiveonSVbasedfeaturerankingonvariousdatadomains Anotherpartofthisdissertationconcernsfeatureselecti onandcategorization ontextdocumentdatasets.Thetextcategorizationproblem hasbeenstudiedextensively[54,55,61,64,65,66,75].Oneofthemostpopular textcategorization methodsisthenaiveBayes(NB)method[54,65,66,75].Thena iveBayesmethodis popularasatextdataclassicationtoolforitsimplementa tionandusagesimplicity, andlineartimecomplexityofthemethod.Despiteitssucces sasaclassicationtool, thenaiveBayeshasnotbeenutilizedasafeatureselectionm ethod.Wehavedevelopedanewfeaturescalingmethod,calledclass-dependentterm-weighting(CDTW). Anewfeatureselectionmethod,CDTW-NB-RFE,combinesCDTW andNB-RFE. Forourexperimentsontextdata,wehaveusedthreeminidata setsandrealworld datasets.ThethreeminidatasetswerecollectedfromPubMe dusingsomekeywords. Byapplyingthefeatureselectionmethods,wecouldidentif ythediscriminatingkeywordsbetweentwodierentgroupsoftextdocumentdatasets .Incontrasttoother datadomains,thetextdatadomainneedssomepreprocessing ,whichisuniquein

PAGE 18

4 textdocumentrepresentation.Wewilldiscusshowtoprepro cessthetextdocuments sothattheyaresuitableforasupervisedlearning.Wehaveu sed20Newsgroup datasetsastherealworlddatasets.TheeectivenessofCDT W-NB-RFEhasbeen shownbycomparisonwiththeothervemostpopularfeatures electionmethods ontextdatasets.CDTW-NB-RFEoutperformedthemostpopula rfeatureselection methods. Weapplythefeatureselectionmethodsonlineardiscrimina nts.Infact,wemake useoftheweightvector(normaldirection)ofthelineardis criminantsasafeature rankingcriterion.Weshowhowtheweightvectorofanylinea rdiscriminantscan serveasarankingcriterion.Conversely,arankingscorefo rafeaturecanbeutilized tobeaweightofalineardiscriminant.Weaddresstheseissu es,too. 1.2 Dissertation Overview Webeginthisdissertationbydeningthefeatureselection problem,andthenwe discusswhythefeatureselectionproblemisimportant,esp eciallythefeatureselectionusinglineardiscriminantssuchasSVM,FLD,andNB.The SVMisintroduced asbackgroundknowledgeinChapter2.Derivationsforlinea rSVM,forlinearlyseparabledata,linearlyinseparabledata,andnonlinearSVMar ecoveredtogether.Least squareSVM(LS-SVM)isbrieryintroducedandcomparedtothe regularSVMs.In Chapter3,modelselectioninSVMsisreviewed.Modelselect ionconcernstuningappropriateparameterndingsforSVM.Therefore,modelsele ctionisimportantwhen weattempttogetthebestpossibledecisionfunctionusingS VM.InChapter4,we presentsomeexistingfeatureselectionalgorithmsintheS VMdomainandotherdomains.Severaloftherelatedmethodsarereviewedwiththei rideasandalgorithms. InChapter5,weintroduceanewfeatureselectionalgorithm ,SV-basedfeatureselection.Twolineardiscriminants,SVMandFLD,areconsidered withSV-basedfeature selection.ThemotivationoftheSV-basedfeatureselectio nisdiscussed.TheeectivenessoftheSV-basedmethodisshownwithexperimentalr esultsonfourdierent

PAGE 19

5 datasets.InChapter6,weshowanewfeatureselectionmetho dbasedonanaive Bayesmethod,CDTW-NB-RFE.OtherissuesrelatedtotheNB,s uchassystemic problemsarisingwhenNBisusedforasmallnumberoffeature s,arediscussedon thewayofdevelopmentoftheCDTW-NB-RFEidea.Wealsoshowh owweapply thefeatureselectionondiscretizeddatausingNB.InChapt er7,weshowthegeneralizationofthefeatureselectionwithlineardiscrimin ants.Inthisdiscussion,we showthattheweightvectorofanylineardiscriminantscans erveasafeatureranking score.Conversely,somefeaturerankingscorescanbeuseda saweightofalinear discriminantforclassication.Theeectivenessofthese lectedfeaturesfromour methodsisshownbyvisualizingthedatawiththethoseselec tedfeatures.Twopossiblefeatureselectionapplications,compactdatarepres entationandclusterlabeling, arediscussed.Thisdissertationconcludesbyshowingthei ntendedcontributionof ourresearchandpossiblefutureresearch.

PAGE 20

CHAPTER2 SUPPORTVECTORMACHINES 2.1 Introduction SupportVectorMachines(SVMs)areanewclassicationmeth oddevelopedby VapnikandhisgroupatAT&TBellLabs[89].TheSVMshavebeen appliedto classicationproblems[45,73]asalternativestomultila yernetworks.Classication isachievedbyalinearornonlinearseparatingsurfaceinth einputspaceofthedata set.ThegoalofSVMistominimizetheexpectationoftheoutp utofsampleerror. SupportVectorMachinesmapagivensetofbinarylabeledtra iningdatatoahighdimensionalfeaturespaceandseparatethetwoclassesofda tawithamaximum marginofhyperplane.TounderstandtheSVMapproach,wenee dtoknowtwokey ideas:dualityandkernels[5,25,84].Weexaminetheseconc eptsforthesimplecase andthenshowhowtheycanbeextendedtomorecomplextasks. 2.2 Linearly Separable Case Considerabinaryclassicationproblemwithtraininginst ancesanditsclasslabel pairs( x i ;y i ),where i =1 ; ;l and y i 2f 1 ; 1 g .The x i iscalledapositiveinstance ifthecorrespondinglabelis+1;otherwise,itisanegative instance.Let P denotethe setofpositiveinstancesand N thesetofnegativeinstances.Figure2{1showssome possiblelineardiscriminantplaneswhichseparate P from N .InFigure2{1[5]an innitenumberofdiscriminatingplanesexistand P 1and P 2areamongthem. P 1is preferredsince P 2morelikelymisclassiesaninstanceiftherearesmallper turbations intheinstance[5].Amaximumdistanceormarginplaneisadi scriminantplane whichisfurthestfromboth P and N .Thisgivesanintuitiveexplanationaboutwhy maximummargindiscriminantplanesarebetter. P 3and P 4arecalledsupporting planes,andpointslyingonthesupportingplanesarecalled supportvectors.The 6

PAGE 21

7 P1 P3 P4 P2 P N Figure2{1:LinearDiscriminantPlanes maximummarginisthedistancebetweentwosupportingplane s,andthegeometric marginisthenormalizeddistancebyweightbetweensupport ingplanes. For x i 2 P ,supposewewanttond w and b suchthat h w x i i + b 0.Suppose k =min i jh w x i i + b j .Then, jh w x i i + b j k; 8 x i .Forthepointsintheotherclass, werequire h w x i i + b k .Notethat w and b canberescaledsowecanalways set k equalto1.Tondtheplanefurthestfrombothsets,wemaximi zethedistance betweenthesupportingplanesforeachclass.Thesupportve ctorsareshowninside thedottedcircleinFigure2{2. InFigure2{2,thenormalizedmargin,orgeometricmargin,b etweenthesupportingplanes h w x 1 i + b =+1 ; and h w x 2 i + b = 1is r =2 = k w k [6,83].Sincewe wanttomaximizethemarginforthereasonweintuitivelyjus tied,wecanformulate theproblemasanoptimizationproblem[25]. max w ;b 2 k w k 2 s :t: h w x i i + b +1 ;y i 2 P h w x i i + b 1 ;y i 2 N (2.1)

PAGE 22

8 {x | (wx) + b = 0 } {x | (wx) + b = +1 } {x | (wx) + b = -1 } x1 x2 (w x1)+b=+1(w x2)+b=-1 => (w (x1-x2)) = 2=> (w/||w||) (x1-x2) = 2/||w|| yi = +1 yi = -1 w Figure2{2:GeometricMarginintheMaximumMarginHyperpla ne Sincemaximizingthemarginisequivalenttominimizing k w k 2 = 2andtheconstraintscanbesimpliedto y i ( h w x i i b ) 1 ; 8 i ,wecanrewritethemasfollows: min w ;b k w k 2 2 s :t: y i ( h w x i i + b ) 1 ;i =1 ; ;l: (2.2) Inmathematicalprogramming,aproblemsuchas(2.2)iscall edaconvexquadratic problem.Manyrobustalgorithmsexistforsolvingthequadr aticproblems.Sincethe quadraticproblemisconvex,anylocalminimumfoundisalwa ysaglobalminimum [9]. 2.3 Linearly Inseparable Case Figure2{3[5]showsthetwointersectingconvexhulls.InFi gure2{3,ifthesingle badsquareisremoved,thenthesameformulationwehadwould work.Tothisend, weneedtorelaxtheconstraintandaddapenaltytotheobject ivefunctionin(2.2).

PAGE 23

9 Figure2{3:LinearlyInseparableData Equivalently,weneedtorestricttheinruenceofanysingle point.Anypointfalling onthewrongsideofitssupportingplaneisconsideredtobea nerror.Wewantto maximizethemarginandminimizetheerror.Tothisend,wene edtomakesome changestotheformulationin(2.2). Anonnegativeerrorvariable,orslackvariable i ,isaddedtoeachconstraintand addedtotheobjectivefunctionasaweightedpenalty[25]: min w;b; k w k 2 2 + C P li =1 i s.t y i ( h w x i i + b )+ i 1 ; i 0 ; i =1 ; ;l; (2.3) where C isaconstant.Theconstant C enforcestoreducetheinruenceofany particularpoint,suchasthebadsquareinFigure2{3.Moree xplanationof C is giveninthenextsection. 2.4 Dual Problem Inthissection,weshowhowtoderiveadualrepresentationo ftheproblem. Thedualrepresentationoftheoriginalproblemisimportan tsinceitsobjectivefunctionandconstraintsareexpressedininnerproducts.Byrep lacingthemwitha kernelfunction,wecanthereforebenetfromanenormousco mputationalshortcut. Dualrepresentationinvolvestheso-calledLagrangianthe ory[4,7,9,22,69].The

PAGE 24

10 Lagrangiantheorycharacterizesthesolutionofanoptimiz ationproblem.Firstwe introducethedualrepresentationforalinearlyseparable problem(2.2). TheprimalLagrangianfunctionforproblem(2.2)isasfollo ws[25]: L ( w ;b; )= 1 2 h w w i X li =1 i [ y i ( h w x i i + b ) 1](2.4) where i 0aretheLagrangemultipliers.Attheextreme,thepartiald erivativesof theLagrangianare0,thatis, @L ( w ;b; ) @b = X li =1 y i i =0 @L ( w ;b; ) @ w = w X li =1 y i i x i =0 or 0= X li =1 y i i ; (2.5) w = X li =1 y i i x i : (2.6) Substitutingtherelations(2.5,2.6)intotheprimalprobl em(2.4)leadsto L ( w ;b; )= l X i =1 i 1 2 l X i;j =1 y i y j i j h x i x j i : (2.7) Nowwehavethedualproblemcorrespondingtotheprimalprob lem(2.2): maximize P li =1 i 1 2 P li;j =1 y i y j i j h x i x j i s : t P li =1 y i i =0 ; i 0 ;i =1 ; ;l: (2.8) Theprimalproblem(2.2)andcorrespondingdualproblem(2. 8)yieldthesamenormal vectortothehyperplane P li =1 y i i x i + b

PAGE 25

11 Wedothesimilardualformulationforthelinearlynonsepar ableproblem.The primalLagrangianfortheproblem(2.3)isasfollows: L ( w ;b;;; r )= 1 2 h w w i + C l X i =1 i 2 X li =1 i [ y i ( h w x i i + b ) 1+ i ] l X i =1 r i i (2.9) where i 0and r i 0.SettingthepartialderivativesoftheprimalLagrangian (2.9)to0,wederivethedualproblem. @L ( w ;b;;; r ) @b = X li =1 y i i =0 ; @L ( w ;b;;; r ) @ i = C i r i =0 ; @L ( w ;b;;; r ) @ w = w X li =1 y i i x i = 0 or 0= X li =1 y i i ; (2.10) C = i + r i ; (2.11) w = X li =1 y i i x i : (2.12) Substitutingtherelations(2.10,2.11,2.12)intotheprim alproblem(2.9)leadsto L ( w ;b;;; r )= l X i =1 i 1 2 l X i;j =1 y i y j i j h x i x j i : (2.13) Nowwehavethedualproblemcorrespondingtotheprimalprob lem(2.3): maximize P li =1 i 1 2 P li;j =1 y i y j i j h x i x j i s : t P li =1 y i i =0 ; C i 0 ;i =1 ; ;l (2.14) where C istheupperboundfortheLagrangemultipliers.Thatis,itg ivestheupper boundforthe soitlimitstheinruenceofanypoint.

PAGE 26

12 Input space Feature space q Figure2{4:ANonlinearKernelSupportVectorMachine 2.5 Kernel Methods and Support Vector Classiers Howtoconstructthedecisionhyperplanefromthesolutiono ftheproblem(2.8) isthenexttopic.Let solvethedualproblem(2.8).Thenbytherelation(2.6),the weight w isasfollows: w = l X i =1 y i i x i and b = max y i = 1 ( h w x i i )+min y i =1 ( h w x i i ) 2 Itsdecisionfunctionthereforeis g ( x )= l X i =1 i y i h x i x i + b (2.15) = X i 2 SV i y i h x i x i + b: (2.16) FortheexampleontheleftofFigure2{4,nosimpleralgorith mwouldworkwell.A quadraticfunctionsuchasthecirclepicturedisneeded.Fi gure2{4showshowwe canmakeuseofexistinglineardiscriminantmethodsbymapp ingtheoriginalinput dataintoahigherdimensionalspace.Whenwemaptheorigina linputspace,say X using ( x ),wecreateanewspace F = f ( x ): x 2 X g .Then F iscalledthefeature space.Weintroducetheideaofthekernelwithanexample[82 ].

PAGE 27

13 Toproduceaquadraticdiscriminantinatwo-dimensionalin putspacewith attributes x 1 and x 2 ,mapthetwo-dimensionalinputspace[ x 1 ;x 2 ]tothethreedimensionalfeaturespace[ x 1 2 ; p 2 x 1 x 2 ;x 2 2 ]andconstructalineardiscriminantin thefeaturevectorspace.Specically,dene: ( x ): R 2 R 3 then, x =[ x 1 ;x 2 ] ; and w x = w 1 x 1 + w 2 x 2 ( x )=[ x 1 2 ; p 2 x 1 x 2 ;x 2 2 ], w ( x )= w 1 x 1 2 + w 2 p 2 x 1 x 2 + w 3 x 2 2 Theresultingdecisionfunction, g ( x )= w ( x )+ b = w 1 x 1 2 + w 2 p 2 x 1 x 2 + w 3 x 2 2 + b islinearinthethree-dimensionalspace,butitisnonlinea rintheoriginaltwodimensionalspace.Thisapproach,however,maycausesomep roblems.Ifthedata arenoisy,thenSVMwouldtrytodiscriminateallthepositiv eexamplesfromthe negativeexamplesbyincreasingthedimensionalityofthef eaturespace.Thisisthe so-calledovertting.Thegrowthofthedimensionalityoft hefeaturespacewouldbe exponential.Thesecondconcerniscomputingtheseparatin ghyperplanebycarrying outthemapintothefeaturespace.TheSVMsmagicallygetaro undthisproblemby usingtheso-calledkernel.Theformaldenitionforakerne lisasfollows[25]: Denition2.5.1 AkernelisafunctionK,suchthatforall x ; z 2 X K ( x ; z )= h ( x ) ( x ) i ; where isamappingfrom X toan(innerproduct)featurespace F Tochangefromalineartoanonlinearclassier,wesubstitu teonlytheinnerproduct h x y i byakernelfunction K ( x ; y ).Inourexample, ( x ) ( y )= ( x 1 2 ; p 2 x 1 x 2 ;x 2 2 )( y 1 2 ; p 2 y 1 y 2 ;y 2 2 ) 0 =( x y ) 2 =: K ( x ; y ).Notethattheactualdot productin F iscomputedin R 2 ,theinputspace.Table2{1showscommonkernel functions.Bychangingthekernels,wecanalsogetdierent nonlinearclassiers, butnoalgorithmchangeisrequired.Fromamachinetrainedw ithanappropriately

PAGE 28

14 denedkernel,wecanconstructadecisionfunction: g ( x )= X i 2 SV i y i K ( x i ; x )+ b: (2.17) Table2{1:CommonKernelFunctions Kernel K ( x ; y ) Linear h x y i Polynomial (1+ h x y i ) RBF exp( k x y k 2 ) 2.6 Least Squares SVM TheleastsquaresSVM(LS-SVM)hasbeendevelopedbyGestele tal.[41]. TheymodiedthetraditionalL1-normSVMtoformulateitasa regressionproblem targeting 1.Inthefollowingsection,webrieryintroducetheLS-SVMf ormulation, andinthenextsectionshowthecomparisonwithL1-normSVM. 2.6.1 LS-SVM Formulation Anerrorvariable,orslackvariable e i ,isaddedtoeachconstraintanditssquare isaddedtotheobjectivefunctionasaweightedpenalty: min w;b; k w k 2 2 + r 1 2 P li =1 e i 2 s.t y i ( h w x i i + b )=1 e i ; i =1 ; ;l; (2.18) where r isaconstantandbalancesthedatatandthecomplexityofth eresulting classierfunction.Notetheequalityconstraint.Because ofthisequalityconstraint, thesolutionoftheoptimizationproblem2.18canbeobtaine dfromasystemoflinear equations.From2.18,weconstructtheLagrangianfunction L ( w ;b; e ; )= 1 2 h w w i + r 1 2 X li =1 e i 2 X li =1 i [ y i ( h w x i i + b ) 1+ e i ](2.19)

PAGE 29

15 where i aretheLagrangemultipliers.NotethattheLagrangemultip liers i can beeitherpositiveornegativebecauseoftheequalityconst raintinthe2.18.Wecan writedowntheoptimalityconditionasfollows: @L ( w ;b; e ; ) @ w =0 w = X li =1 y i i x i @L ( w ;b; e ; ) @b =0 X li =1 y i i =0 @L ( w ;b; e ; ) @e i =0 i = re i ;i =1 ; ;l; @L ( w ;b; e ; ) @ i =0 y i ( h w x i i + b ) 1+ e i =0 ;i =1 ; ;l orequivalently, 266666664 I 00 Z T 000 Y T 00 rI I ZYI 0 377777775 266666664 w b e 377777775 = 266666664 000 ~ 1 377777775 where Z =[ x 1 T y 1 ; ; x l T y l ] ;Y =[ y 1 ; ; y l ] ; ~ 1=[1; ;1] ;e =[ e 1 ; ; e l ] ; and =[ 1 ; ; l ].Aftereliminating w ,and e ,wehavethefollowing: 264 0 Y T YZZ T + r 1 I 375 264 b 375 = 264 0 ~ 1 375 Hencetheclassierisfoundfromthesolutionofthesystemo fequationsinsteadof quadraticprogramming. 2.6.2 Comparison with the L1-norm SVM ItwouldbeinterestingtocompareandcontrasttheLS-SVMwi ththeL1-norm SVM.WestartthisdiscussionbyrecappingtheideaoftheL1normSVM.The ideaoftheL1-normSVMisbasedonthestructuralriskminimi zationprinciple, andtheriskboundisminimizedbyminimizingboththedatat measure(sumof e i )andcomplexityoftheresultingclassierfunction( jj w jj 2 = 2).Thecomplexityof

PAGE 30

16 thefunctioncanbemeasuredbythemarginofthedecisionhyp erplane,andthe maximummargindecisionhyperplanehasasimplercomplexit y.TheLS-SVMis introducedbymodifyingtheL1-normSVMtoformulatethecla ssicationproblem asabinarytargetregressionproblem.BoththeL1-normSVMa ndLS-SVMhave similarobjectivefunctionsexceptforthesumofnonnegati veslackvariablesintheL1normSVMandthesumofsquareoftheslackvariableinLS-SVM. Intheconstraints L1-normSVMhasaninequalityandnonnegativityconditionf ortheslackvariable whileLS-SVMhasjustequalityconstraint.Ageneralwayofs olvingtheL1-norm SVMisbyintroducingtheLagrangemultipliersforeachcons traintandconvertthem intoadualproblem.Inthedualproblem,theobjectivefunct ionisaquadraticconvex function(bythepositivesemi-denitenessofthekernelma trix)withthelinearconvex setconstraint.Thismeansthesolutionfoundisaglobalopt imum,andthesolution ndingalgorithmsarewellstudiedinthemathematicalprog rammingliterature[9]. Ontheotherhand,theLS-SVMsolutioncanbefoundbysolving thesystemsof linearequations.Also,thesolutionfoundhereisglobalbe causeofthesamereason asintheL1-normSVM. Itisworthmentioningthemeaningofthe ,theLagrangemultiplier.IntheL1normSVM, i measuresanimportanceofatrainingexample x i duringthetraining process. i closetothedecisionhyperplanehavenon-zerovalues,andc orresponding examplesarecalledsupportvectors(SVs).OnlySVsareinvo lvedinconstructing thedecisionhyperplaneandusually j SVs j =l ,where l isthetotalnumberofexamples,issmall.Thisachievesasparserepresentationofthe datasetandthedecision hyperplane. Ontheotherhand,inLS-SVM,sincetheproblemformulationi saregression formulationwithbinarytarget,almostalltheexampleshav esomeadjustments, e i tothetarget( 1),andthese e i arenon-zerovalues.Since i = re i ,manyof i have non-zerovalues.Therefore,nosparsenessisachievedinLS -SVM.Andexamplesclose

PAGE 31

17 andfartothedecisionplanegetthelarger i andhence i donothavesuchmeaning asintheL1-normSVM.

PAGE 32

CHAPTER3 MODELSELECTIONINSVM 3.1 Introduction Themodelselectionisgettingmoreimportantwhenweconsid erSVMasa classicationtoolforareal-worldproblem.Thisisbecaus eparameterschosenwith caregivemuchbettergeneralizationperformance.Despite thesuccessfulapplication ofSVMsinvariousdomains,actuallyusingitasaclassicat iontoolrequiresanontrivialtuningofSVMparameters.Theparametersincludeth etradeoparameter C inthelinearclassiercase,and C andakernelfunctionparameter.Forexample, whenthekernelfunctionisaradialbasisfunction,thekern elparameteristhe .In general,cross-validationisusedtondgoodestimatesoft heSVMparameters,but itiscomputationallyexpensive.However,recentmodelsel ectionresearchismaking animportantstepbyallowingtheusertooerreasonablygoo destimatesforthe parameterswithouttheuseofsomeadditionalvalidationda ta.Inthischapter,we surveysomeoftheworkinthisspirit.Thebasicideaistond ageneralizationerror boundandtrytominimizethisboundwithsomecombinationof parameters. 3.2 Generalization Error Bound Thepurposeofthegeneralizationerrorboundistwo-fold:1 )toestimatethe trueerrorforadecisionhyperplanewithagiventrainingda taset;and2)totune theappropriatehyperparametersforgivendata.Thehyperp arametersareC,the tradeoparameterbetweenthetrainingerrorandmargin,an d intheradialbasis functionkernel,forexample. 3.3 Cross-Validation Cross-validation(CV)isperformedasfollows[10].Onediv idesthetrainingset randomlyinto k disjointsubsets.Onethentrainstheclassierusingonly k 1subsets 18

PAGE 33

19 andtestitsperformancebycalculatingtheaccuracyusingt heleft-outsubset.This processisrepeated k times.Eachtime,adierentsubsetisleftout.Theaverageo f k accuraciesistheestimateofthegeneralizationaccuracy. Onecanextendthisideato anextremecasebytakingthe k asthenumberofexamplesinthetrainingdata.Then thisprocessiscalledtheLeave-One-Out(LOO)method.Thed isadvantageofCVis thatitrequires k repeatedtrainingandtesting.Thisisquiteexpensive,esp eciallyfor LOO.Theadvantageisthatitgivesaveryaccurateestimatef orthegeneralization bound[30]andiseasytoperformtheprocess.Thatiswhythis processispopular. Thebasicideabehinderrorestimationisthatthetestsetmu stbeindependentofthe trainingset,andthepartitionofadataintothesetwosubse tsshouldbeunbiased. Therespectivedatasizeshouldbeadequate,andtheestimat ederrorraterefersto thetesterrorratherthanthetrainingerrorrate[38]. 3.4 Bound JoachimsestablishesthefollowingLOOerrorbound,whichi scalled estimator ofLOOerrorrate[53].Thisestimatorisupperboundedby jf i :2 i R 2 + i 1 gj =l where i istheLagrangemultiplierofexample i inthedualproblem, i istheslack variableofexample i intheprimalproblem, R 2 istheupperboundon K ( x ; x ),and K ( x ; x 0 ) 0.Theideaistherelationshipbetweenthetrainingexample sforwhich theinequality(2 i R 2 + i 1)holdsandthosetrainingexamplesthatcanproduce anerrorinLOOtesting.Specically,ifanexample( x i ;y i )isclassiedincorrectly byaLOOtesting(thatis,byaSVMtrainedontheexampleswith out( x i ;y i )), thenthatexamplemustsatisfytheinequality(2 i R 2 + i 1)foraSVMtrained ontheallexamples.Bythis,thecardinalityof f i :2 i R 2 + i 1 g isanupper boundonthenumberofLOOerrors.Here, i canbecomputedfromtherelation y i ( P j j ( x i ; x j )+ b ) 1 i .Then,tondoutagoodchoiceof C (orkernel parameter),rstchoosea C andthentrainSVMtondoutthosequantitiesneeded intheformula.Fromthesequantities,wecancomputetheLOO error.Then C

PAGE 34

20 (orkernelparameter)isincreasedordecreasedinthedirec tiontoimprovetheLOO error.Theadvantageofthisapproachistogetagoodchoiceo fparameterswithout anadditionalvalidationset. 3.5 GACV Bound Wahbaetal.[92]usedGeneralizedApproximateCross-Valid ation(GACV)to computablyapproximatetheGeneralizedComparativeKullb ack-LieblerDistance (GCKL)forSVM.TheGCKLforSVMisasfollows: GCKL ( )=1 =l l X i =1 f p i (1 f i ) + +(1 p i )(1+ f i ) + g where f ( x )= wx + b isthedecisionfunction, f i = f ( x i ) ;p i = p ( y i j x i ) ; and ( f ) + = f if f> 0and0otherwise. isallthetunableparameters,andthese parameterscanbeexpressedinsidethekernelfunction[21, 30].TheGACVisthen dendedas GACV ( )=1 =l f l X i =1 i +2 X y i f i < 1 i K ii + X y i f i 2 [ 1 ; 1] i K ii g 3.6 Other Methods VapnikintroducedVCbound[90]andVapniketal.[91]introd ucedanew conceptcalled\spanofsupportvectors."Computationally feasiblespanboundis called\approximatespanbound,"andaso-called\radius-m arginbound"concerns theL1-normSVMformulation.Allofthemareusedforestimat inggeneralization errorandparametertuning.

PAGE 35

CHAPTER4 RELATEDWORKS 4.1 SVM Gradient Method Trainingsupportvectormachinesprovidesoptimalvaluesf ortheLagrangemultipliers i .Fromthe i 's,oneconstructsthedecisionhyperplane g ( x ), g ( x )= X i 2 SV i y i K ( x i ; x )+ b: ResearchersproposedafeatureselectiontechniqueforSVM sbasedonthegradient [49].Torankthefeaturesofagiven x accordingtotheirimportancetotheclassicationdecision,computetheanglesbetween r g ( x )andunitvectors e j ;j =1 ; ;n representingtheindicesofeachfeature.Ifthe j th featureisnotimportantat x r g ( x )isalmostorthogonalto e j .Onedoesthesecomputationsforallthesupport vectors(SVs),thatis,computetheanglesforalltheSVs,co mputetheiraverages andsortthemindescendingorder.Thisgivesafeatureranki ngforallthefeatures. TheSVM-Gradientalgorithmisgivenbelowforalinearkerne landgeneralkernels separately. 4.1.1 Algorithm: Linear Kernel (a)TrainSVMusingallavailabledatacomponentstoget g ( x ). (b)Computethegradient r g ( x )(orweights w ) ; 8 x 2 SV w = r g ( x ) = X i 2 SV i y i x i : Computationof r g ( x )canbedoneeasilysince g ( x )= P i 2 SV i y i K ( x i ; x )= P i 2 SV i y i h x i x i and r x K ( x i ; x )= x i 21

PAGE 36

22 Table4{1:CommonKernelFunctionsandTheirDerivatives Kernel K ( x ; y ) r y K ( x ; y ) Linear x T y x Polynomial (1+ x T y ) (1+ x T y ) 1 RBF exp( k x y k 2 ) 2( x y ) exp( k x y k 2 ) (c)Sortthe j w j indescendingorder.Thenthisgivesthefeatureranking. 4.1.2 Algorithm: General Kernels (a)TrainSVMusingallavailabledatacomponentstoget g ( x ). (b)Computethegradient r g ( x ) ; 8 x 2 SV r g ( x )= X i 2 SV i y i r x K ( x i ; x ) : Computationof r g ( x )canbedoneeasilyagainsince g ( x )= P i 2 SV i y i K ( x i ; x ) ; andonlytheterm K ( x i ; x )involvesthevariable x .Thereforetakethederivativeof thekernelfunctiononly K .Table4{1summarizescommonkernelfunctionsandtheir derivatives. (c)Computethesumofanglesbetween r g ( x )and e j ;r j ;j =1 ; j s j r j = X x 2 SV 6 ( r g ( x ) ; e j ) where 6 ( r g ( x ) ; e j )=min 2f 0 ; 1 g +( 1) arccos hr g ( x ) e j i kr g ( x ) k : (d)Computetheaveragesofthesumoftheangles c j =1 2 r j j SV j : (e)Sortthe c indescendingorder.Thenthisgivesthefeatureranking. TheauthorsalsoproposedwhenthenumberofSVsissmallcomp aredtothe numberoftraininginstances,includeallthepointswithin the -regionaroundthe

PAGE 37

23 borders,thatis, j g ( x i ) j 1+ 4.2 SVM-RFE TheSVM-RFE[44]isanapplicationofarecursivefeatureeli minationbasedon sensitivityanalysisforanappropriatelydenedcostfunc tion. 4.2.1 Algorithm: Linear Kernel Inthelinearkernelcase,deneacostfunction J =(1 = 2) k w k .Thentheleast sensitivefeaturewhichhastheminimummagnitudeofthewei ghtiseliminatedrst. Thiseliminatedfeaturebecomesranking n .Themachineisretrainedwithoutthe eliminatedfeatureandremovesthefeaturewiththeminimum magnitudeofweights. Thiseliminatedfeaturebecomesranking n 1.Bydoingthisprocessrepeatedly untilnofeatureisleft,wecanrankallthefeatures.Thealg orithmsareasfollows: Giventraininginstances X all =[ x 1 ; x l ] 0 ,andclasslabels y =[ y 1 ; y l ] 0 initializethesubsetoffeatures s =[1 ; 2 ; n ]and r =anemptyarray. Repeat(a)-(e)until s becomesanemptyarray. (a)Constructnewtraininginstances X = X all (: ; s ) (b)TrainSVM( X; y )toget g ( x ). (c)Computethegradient w = r g ( x ). w = X i 2 SV i y i x i : (d)Findthefeaturefwiththesmallest w j ;j =1 ; j s j f=argmin j ( j w j j ) : (e)Update r andeliminatethefeaturefrom s r =[ s (f), r ], s = s f s (f) g Thelasteliminatedfeatureisthemostimportantone.

PAGE 38

24 4.2.2 Algorithm: General Kernels Deneacostfunctionasfollows: J =(1 = 2) T H T e ; where H hk = y h y k K ( x h ; x k ) ;K isakernelfunction, isaLagrangemultiplierand e isan l (numberoftraininginstances)dimensionalvectorofones. Tocomputethe changein J causedbyremovingthefeature i ,onehastoretrainaclassierforevery candidatefeaturetobeeliminated.Thisdicultyisavoide dbyassumingnochange in .Underthisassumption,onerecomputesthe H H ( i ) hk = y h y k K ( x h ( i ) ; x k ( i )) ; where( i )meansthatthecomponent i hasbeenremoved.Sothesensitivityfunction isdenedasfollows: DJ ( i )= J J ( i ) =(1 = 2) T H (1 = 2) T H( i ) Giventraininginstances X all =[ x 1 ; x l ] 0 ,andclasslabels y =[ y 1 ; y l ] 0 initializethesubsetoffeatures s =[1 ; 2 ; n ]and r =anemptyarray. Repeat(a)-(e)until s becomesanemptyarray. (a)Constructnewtraininginstances X = X all (: ; s ). (b)TrainSVM( X; y )toget =SVM train (c)Computetherankingcriteriaforall i DJ ( i )=(1 = 2) T H (1 = 2) T H( i )

PAGE 39

25 (d)Findthefeature k suchthat k =argmin i DJ ( i ) : (e)Eliminatethefeature k r =[ s (k), r ], s = s f s (k) g Inthelinearkernel, K ( x h ; x k )= h x h x k i and T H = k w k 2 : Therefore DJ ( i )=(1 = 2)( w i ) 2 : Thismatchesthefeatureselectioncriterioninthelineark ernelcase. 4.3 SVM Gradient-RFE TheSVMGradient-RFEcombinestwoexistingfeatureselecti onmethods:SVMRFEandSVMGradient.Whenwesayanalgorithm,twofactorsar eofconcerns: predictionaccuracyandcomputingtime.Thenewmethodtake sthemeritsoftwo existingmethodssoitiscompetitivetoSVM-RFEintermsofp redictionaccuracy whilemaintainingspeedycomputation. Asin[49],thismethodusesthegradientforfeatureselecti oncriteria,butin ordertogivearankingforallthefeatures,themachinesare trainedusingallthe featuresandthencomputingthefeatureselectioncriterio n,thatis,thefeaturewith aminimumangleiseliminated.Therankingofthiseliminate dfeaturethenbecomes n .Trainthemachinenowwithouttheeliminatedfeature,andt hefeaturewiththe minimumselectioncriterioniseliminated.Thiseliminate dfeaturebecomesranking

PAGE 40

26 n 1.Byrecursivelyeliminatingallthefeatures,oneranksal lthefeatures.The followingsectiondescribesthealgorithmandcomputation aldetails. 4.3.1 Linear Kernel Giventraininginstances X all =[ x 1 ; x l ] 0 ,andclasslabels y =[ y 1 ; y l ] 0 initializethesubsetoffeatures s =[1 ; 2 ; n ]and r =anemptyarray. Repeat(a)-(e)until s becomesanemptyarray. (a)Constructnewtraininginstances X = X all (: ; s ) (b)TrainSVM( X; y )toget g ( x ). (c)Computethegradient w = r g ( x ). w = X i 2 SV i y i x i : (d)Findthefeaturefwiththesmallest w j ;j =1 ; j s j f=argmin j ( j w j j ) : (e)Update r andeliminatethefeaturefrom s r =[ s (f), r ], s = s f s (f) g Forthelinearkernel,thedecisionhyperplaneislinear,an dhenceitsgradientat anysupportvectorisaconstantvector(normalvector).The normalvectorofthe g ( x )isthefeatureselectioncriterion. 4.3.2 General Kernels Giventraininginstances X all =[ x 1 ; x l ] 0 ,andclasslabels y =[ y 1 ; y l ] 0 initializethesubsetoffeatures s =[1 ; 2 ; n ]and r =anemptyarray. Repeat(a)-(e)until s becomesanemptyarray. (a)Constructnewtraininginstances X = X all (: ; s ) (b)TrainSVM( X; y )toget g ( x ).

PAGE 41

27 (c)Computethegradient r g ( x ) ; 8 x 2 SV r g ( x )= X i 2 SV i y i r x K ( x i ; x ) : (d)Computethesumofanglesbetween r g ( x )and e j ;r j ;j =1 ; j s j r j = X x 2 SV 6 ( r g ( x ) ; e j ) where 6 ( r g ( x ) ; e j )=min 2f 0 ; 1 g +( 1) arccos hr g ( x ) e j i kr g ( x ) k : (e)Computetheaveragesofthesumoftheangles c j =1 2 r j j SV j : (f)Findthefeaturefwiththesmallest c j ;j =1 ; j s j f=argmin j c j (g)Update r andeliminatethefeaturefrom s r =[ s (f), r ], s = s f s (f) g AswesawinSection3.1,the r g ( x )computationcanbedoneeasily. 4.4 SVM Projection-RFE IntheSVMProjection-RFE,themagnitudeofthedistancebet weensupport vectorsanditsprojectionsisthefeatureselectioncriter ion.Thiscriterioniscombined withRFEsoastogivebirthtotheSVMProjection-RFE. 4.4.1 Idea Whenweguredoutthismethod,weidentiedwhatcharacteri sticsofimportant featuresintheSVMclassierare. Considerthepoint A =( x 1 ;x 2 )inFigure4{1.

PAGE 42

28 A P g(x) x1 x2 D x1 D x2 = 0 Figure4{1:MotivationfortheSVMProjection-RFE Whenpoint A isprojectedontothe g ( x )=0,let P =( p 1 ;p 2 )betheprojected point,then j A P j = j ( x 1 ;x 2 ) ( p 1 ;p 2 ) j = j ( x 1 ; x 2 ) j : Notethatthelargermagnitudeof x ismoreinruentialtothedecisionplane.In thisexample, j x 1 j > j x 2 j ,andhence x 1 ismoreinruentialtothedecisionthan x 2 .Thisistruesincethedecisionhyperplaneisalmostparall eltothe x 2 axis,so thatwhateverthe x 2 valueis,itcontributeslittletothedecisionplane g ( x ).Wenow havetoanswerhowtoecientlycompute j A P j ,or x i ;i =1 ; ;n .Theideais thatthenormalizeddistancebetween P and A is1since A isasupportvectorand thenormalizeddistancebetweenSVsandthehyperplaneis1b ytheSVMproperty. Wecanmakeuseofthispropertytohaveecient j P A j computation.Before weintroducethealgorithm,westateaproposition[25]sayi ngthatthenormalized distance(margin)betweenthehyperplaneandSVsis1.Proposition4.4.1 Consideralinearlyseparabletrainingsample S =(( x 1 ;y 1 ) ; ; ( x l ;y l )) ;

PAGE 43

29 A P g(x) x1 x2 = wx + b t w P = A + t w = 0 Figure4{2:LinearKernelCaseforProjection-RFE andsupposetheparameters and b solvethefollowingoptimizationproblem: maximize P li =1 i 1 2 P li;j =1 y i y j i j h x i x j i (4.1) s : t P li =1 y i i =0 ; i 0 ;i =1 ; ;l: (4.2) Then w = P li =1 y i i x i realizesthemaximalmarginhyperplanewithgeometric margin r = 1 k w k 4.4.2 Linear Kernel Usingthelinearkernelmeansthatwehavealineardecisionh yperplane g ( x )= h w x i + b .When A isprojectedontothe g ( x )=0,theprojectedpoint P isexpressed intermsof A ,and w ,weightof g ( x ),thatis,point P isthesumofpoint A andsome constanttimestheweight w .Figure4{2showsthis. P = A + t w :

PAGE 44

30 where t isaconstant.Bytheproposition4.4.1, k t w k = 1 k w k ; andbysolvingthisintermsof t ,wehave t = 1 k w k 2 : Hence, j P A j = j w j k w k 2 : Let p ( x i ;g ( x ))denotetheprojectionof x onto g ( x ).Since w is r g ( x i ),wehave, j p ( x i ;g ( x )) x i j = j w j k w k 2 = jr g ( x i ) j kr g ( x i ) k 2 ; 8 i 2 SV: Since j p ( x i ;g ( x )) x i j =aconstantvector(normalvector) 8 i 2 SV ,weonlyneed tocompute j w j k w k 2 : ThefollowingLemma4.4.2summarizeswhatwehavedoneabove Lemma4.4.2 Consideralinearlyseparabletrainingsample S =(( x 1 ;y 1 ) ; ; ( x l ;y l )) ; andsupposetheparameters and b solvethefollowingoptimizationproblem: maximize P li =1 i 1 2 P li;j =1 y i y j i j h x i x j i (4.3) s : t P li =1 y i i =0 ; i 0 ;i =1 ; ;l: (4.4) Then w = P li =1 y i i x i .Foranysupportvectors x i j p ( x i ;g ( x )) x i j = j w j k w k 2 :

PAGE 45

31 P x1 x2 g(x) t g(x) A P = A + t g(x) = 0 Figure4{3:NonlinearKernelCaseforProjection-RFE Proof: Wecanwrite p ( x i ;g ( x ))intermsof x i ; andthegradientvectorofthedecision hyperplane g ( x )at x i asfollows: p ( x i ;g ( x ))= x i + t w ; where t issomeconstant.BytheProposition4.4.1, k t w k = 1 k w k ; Solvingtheaboveintermsof t gives: t = 1 k w k 2 : Hence, j p ( x i ;g ( x )) x i j = j w j k w k 2 ; 4.4.3 General Kernels Usingthenonlinearkernel,wehaveanonlineardecisionhyp erplaneintheinput space.ConsiderFigure4{3.

PAGE 46

32 Let g ( x )bethehyperplaneand P betheprojectedpointof A .Observethat r g ( A )isnormaltothe g ( x )atthepoint P r g ( A )canbecomputedeasilyasinthe SVMGradientsince g ( x )= X i 2 SV i y i K ( x i ; x ) ; and r g ( x )= X i 2 SV i y i r x K ( x i ; x ) : Hence,theprojectedpoint P isexpressedintermsof A and r g ( A ),normaltothe g ( x )at P ,thatis,thepoint P isthesumofpoint A andsomeconstant t timesthe normalvector r g ( A ).Figure4{3showsthis. P = A + t r g ( A ) where t isaconstant.Butherewedonotcalculate j P A j exactlysinceitscomputationiscomplicatedandtheexactcalculationisnotnee ded.Since j P A j isa constanttimes r g ( A ) ; j P A j isproportionalto jr g ( A ) j kr g ( A ) k 2 ,thatis, j P A j/ jr g ( A ) j kr g ( A ) k 2 : For 8 i 2 SV j p ( x i ;g ( x )) x i j/ jr g ( x i ) j kr g ( x i ) k 2 : Since j p ( x i ;g ( x )) x i j isnotaconstantvectorunlikethelinearcase,weneedto calculate p i = c j p ( x i ;g ( x )) x i j = jr g ( x i ) j kr g ( x i ) k 2 ; 8 i 2 SV; where c isaconstant.Nowwesumthe p i ; 8 i 2 SV ,component-wise.Let d denotethe component-wisesummationof p i ; 8 i 2 SV .Thefeaturecorrespondingtothelargest magnitudeofthe d willthenbethemostimportantfeaturefortheclassicatio nof

PAGE 47

33 the g ( x ).Thisfeatureselectioncriterionisnowcombinedtothere cursivefeature elimination. 4.4.4 SVM Projection-RFE Algorithm Wenowdescribehowtorankallthefeatures.Initially,them achineistrained withallthefeaturesandallthetrainingdata.Thencompute thefeatureselection criterion,thatis, d .Thefeaturewiththeminimummagnitudecomponentiseliminated.Thiseliminatedfeaturebecomesranking n .Nowtrainthemachinewithout theeliminatedfeature,recomputethe d ,andeliminatethefeaturewiththeminimum selectioncriterion.Thisfeaturebecomesranking n 1.Recursivelydoingthisprocess,wecangivearankingforallthefeatures.TheSVMProje ction-RFEalgorithm andcomputationaldetailsareasfollows: Linear kernel.Giventraininginstances X all =[ x 1 ; x l ] 0 ,andclasslabels y = [ y 1 ; y l ] 0 ,initializethesubsetoffeatures s =[1 ; 2 ; n ]and r =anemptyarray. Repeat(a)-(e)until s becomesanemptyarray. (a)Constructnewtraininginstances X = X all (: ; s ) (b)TrainSVM( X; y )toget g ( x ). (c)Computethegradient w = r g ( x ). w = X i 2 SV i y i x i : (d)Findthefeaturefwiththesmallest w j ;j =1 ; j s j f=argmin j ( j w j j ) : (e)Update r andeliminatethefeaturefrom s r =[ s (f), r ], s = s f s (f) g Forthelinearkernelcase, r g ( x i )isaconstantvector,thatis,thenormalvector ofthe g ( x ),forany i 2 SV .Onlyaone-timecomputationofthenormalvectorofthe g ( x )isthereforeneededforfeatureselectioncriterioncompu tationforonefeature

PAGE 48

34 elimination.Thisalgorithmisthereforeexactlythesamea stheSVM-RFEandSVM Gradient-RFE. General kernels.Giventraininginstances X all =[ x 1 ; x l ] 0 ,andclasslabels y =[ y 1 ; y l ] 0 ,initializethesubsetoffeatures s =[1 ; 2 ; n ]and r =anempty array.Repeat(a)-(g)until s becomesanemptyarray. (a)Constructnewtraininginstances X = X all (: ; s ) (b)TrainSVM( X; y )toget g ( x ). (c)Computethegradient r g ( x ) ; 8 x 2 SV r g ( x )= X i 2 SV i y i r x K ( x i ; x ) : (d)Compute p i = c j p ( x i ;g ( x )) x i j p i = jr g ( x i ) j kr g ( x i ) k 2 ; 8 i 2 SV: (e)Computethesumof p i ; 8 i 2 SV d = X i 2 SV p i : (f)Findthefeaturefwiththesmallest d j ;j =1 ; j s j f=argmin j ( j d j j ). (g)Update r andeliminatethefeaturefrom s r =[ s (f), r ], s = s f s (f) g Again,the r g ( x )computationcanbedoneeasily,aswedidinSection3.1. 4.5 The M-fold SVM Fuetal.proposedanewfeatureselectionalgorithm:M-fold -SVM[39].They particularlyusedM=10intheirimplementation.Someofthe characteristicsof M-fold-SVMarethatitimprovesthereliabilityoftheselec tedfeaturesandituses

PAGE 49

35 anautomaticthresholdforthenumberoffeaturestobeselec ted.Theyhaveusedit forgeneselectioninmicroarraydatasetsandkeywordselec tionintextdatasets. 4.5.1 Binary-class M-fold-SVM Thebinary-classM-fold-SVMmakesuseoftheSVM-RFEalgori thm.First,it randomlydividesthetrainingdatasetinto10disjointsubs ets.The i th subsetis leftoutfortestingtheselectedfeatures'performancetra inedontheremainingnine subsets.Theninesubsetsareusedtotrainandselectfeatur es.Thisprocessrepeats 10times.Thentheycollectallthefeaturesobtainedfromth e10iterations. Binary-classM-fold-SVMAlgorithmLet X all bethegiventrainingexampleswhere X all =[ x 1 ; ; x l ] 0 input: X all ;Y :Inputsamplesandlabels. output:Asubsetoffeatures.Randomlydivide X all into10disjointsubsets. for i =1 to 10 do X tst = i th subset. X trn = X all X tst ; s =[1 ;:::;dim ], r =anemptyarray. While ( s isnot) do Constructanewtrainingset X = X trn (: ; s ) TrainSVM( X )toget g ( x ). Computethegradient w = r g ( x ). w = X i 2 SV i y i x i : Findthefeaturefwiththesmallest j w j j ;j =1 ; j s j f=argmin j ( j w j j ) : Update r andeliminatethefeaturefrom s r =[ s (f), r ],

PAGE 50

36 s = s f s (f) g endwhileSelect topGenes from r whichgivesthebestaccuracyonthetrainingexamples X trn usingSVM( X trn ). ComputeCVaccuracyusingSVM( X tst ; topGenes ). endforCollectallthefeaturesselected.Sortthembyrepeatabili ty. Inpractice,somemodicationsoftheabovealgorithmareus eful.Inthe while loop,insteadofremovingonefeatureatonetime,chunkoffe aturesareeliminated atatime.Especiallyattheearlystageofthefeatureelimin ation,morefeatures areremoved.OnevariationisthatTheyremovethefeaturess othatthenumberof remainingfeaturesbecometheclosestpowerof2totheorigi nalnumberoffeatures. Inthelatereliminationstep,eliminatejusthalfoftherem ainingfeatures.Thisstep isrepeateduntilonefeatureisleft.Anothervariationist hesameasbefore,butwhen thenumberofremainingfeaturesbecomes256,westarttoeli minatethefeaturesone atatime.Thisvariationisusefulsinceitdoesnottakemuch timetoeliminateone featureatatimesincethen.Alargenumberoffeaturesareel iminatedduringthe rstseveraleliminationsteps.Importantfeaturesshould supposedlybeinthe256 features.Theyareeliminatedonebyoneandhencerankedmor ecorrectly. 4.5.2 Multi-class M-fold-SVM Whentheinstancesaremulti-class,wecanassumethatthecl asslabelsare 1 ; 2 ;:::;k .Basically,werunthebinary-classM-fold-SVM k times.First,class1 instancesarelabeledas+1andtherestarelabeledas-1.The ntheyrunthebinaryclassM-fold-SVMtoselectthefeatureswiththesametraini nginstancesandthenew labels.Next,class2instancesarelabeledas+1andtherest arelabeledas-1.Then theyrunthebinary-classM-fold-SVMwiththesametraining instancesandthenew labels.Thealgorithmrepeatsthisprocedure k times.Finallytheycollectallthe

PAGE 51

37 featureswhichareselectedineachbinary-classM-fold-SV M. Multi-classM-fold-SVMAlgorithm input: X;Y :Inputsamplesandlabels. output:Asubsetoffeatures.for i =1 to k do Y new ( j )=1if j th instance'slabelis i Y new ( j )= 1if j th instance'slabelisnot i runthe Binary-classM-fold-SVM ( X;Y new ). endforCollectthefeaturesselectedineach Binary-classM-fold-SVM run. 4.6 Linear Programming Methods Linearprogramming(LP)isamathematicalformulationtoso lveanoptimization problemwhoseconstraintsandobjectivefunctionarelinea r[4,12,13, ? ,69].The LPformulationsweresuggestedtosolvethefeatureselecti onproblem[12,13].Given twopointsets, A and B in R n arerepresentedby A 2 R m n and B 2 R k n .This problemisformulatedasfollowingrobustlinearprogrammi ng(RLP)[13]. min w ;r; y ; z e T y m + e T z k (4.5) s : t A w + e r + e y ; B w e r + e z ; y 0 ; z 0 : (4.6) 4.6.1 Feature Selection via Concave Minimization Asubsetoffeaturesisobtainedbyattemptingtosuppressas manycomponentsof thenormalvector w totheseparatingplane P ,whichisconsistentwithobtainingan acceptableseparationbetweensets A and B [13].Thiscanbeachievedbyintroducing anextratermwiththeparameter 2 [0 ; 1)intotheobjectiveinRLPwhileweighting

PAGE 52

38 theoriginalobjectiveby(1 ). min w ;r; y ; z ; v (1 )( e T y m + e T z k )+ e T v (4.7) s : t A w + e r + e y ; B w e r + e z ; y 0 ; z 0 ; v w v : (4.8) Becauseofthediscontinuityof e T v ,thistermisapproximatedbyaconcaveexponentialonthenonnegativerealline: v t ( v ; )= e v ;> 0 ThisleadstotheFeatureSelectionConcaveminimization(F SV): min w ;r; y ; z ; v (1 )( e T y m + e T z k )+ e T ( e v )(4.9) s : t A w + e r + e y ; B w e r + e z ; y 0 ; z 0 ; v w v : (4.10) 4.6.2 SVM kk p Formulation Bradleyetal.[12,13]havetriedseveralLPformulationswh ichvarydepending onwhatnormtomeasurethedistancebetweentwoboundingpla nes.Thedistance, measuredbysomenorm kk on R n ,is 2 k w k 0 .Addthereciprocalofthisterm, k w k 0 2 ,to

PAGE 53

39 theobjectivefunctionofRLP. min w ;r; y ; z (1 )( e T y + e T z )+ 2 k w k 0 (4.11) s : t A w + e r + e y ; B w e r + e z ; y 0 ; z 0 : (4.12) Oneusesthe 1 -normtomeasurethedistancebetweentheplanes.Sincethe dualnormofthe 1 -normis1-norm,wecalltheLPformulationasSVM kk 1 .The SVM kk 1 formulationisasfollows: min w ;r; y ; z ; s (1 )( e T y + e T z )+ 2 e T s (4.13) s : t A w + e r + e y ; B w e r + e z ; s w s ; y 0 ; z 0 : (4.14) IntheSVM kk p formulation,theobjectivefunctionsattempttobalancebe tweenthe numberofmisclassiedinstancesandthenumberofnon-zero w 'swhileminimizing thesumofthesetwovalues. Similarly,ifoneusesthe1{normtomeasurethedistancebet weentheplanes, thenthedualtothenormisthe 1 {normandSVM kk 1 formulationasfollows: min w ;r; y ; z ; (1 )( e T y + e T z )+ 2 (4.15) s : t A w + e r + e y ; B w e r + e z ; e w e ; y 0 ; z 0 : (4.16)

PAGE 54

40 TheauthorsalsoattemptedSVM kk 2 .TheyreportedthatonlyFSVandSVM kk 1 gavesmallsubsetsoffeaturesandallotherformulationsen dedupwithno featureselections. Solvingthelinearprogrammingfeatureselectionproblemg ivesasubsetoffeatureswithwhichtheseparatingplaneislinearininputspac e,andinSVMrecursive featureeliminationmethodsitcouldbenonlinearininputs pace,aswellaslinearin inputspace.Also,onedoesnothaveachoiceforthenumberof featuresonedesires sincethesolutionforSVM kk p simplygivesasubsetoffeatures.TheSVMrecursive featureeliminationmethodsgiverankingforallthefeatur essoonecanchoosetop k featuresifonewantedtochooseasubsetwhichconsistsof k features. 4.7 Correlation Coecient Methods Evaluatinghowwellanindividualfeaturecontributestoth eseparationcanproduceasimplefeatureranking.Golub etal. [42]used w i = i (+) i ( ) i (+)+ i ( ) : Pavlidis[74]used w i = ( i (+) i ( )) 2 i (+) 2 + i ( ) 2 : The i and i intheabovearethemeanandstandarddeviationsregardingf eature i Alargepositivevaluein w i indicatesastrongcorrelationwiththe(+)classandalarge negativevaluewiththe(-)class.Eachcoecient w i iscomputedwithinformation aboutasinglefeatureanddoesnotconsidermutualinformat ionbetweenfeatures. Thesemethodsassumetheorthogonalityassumptionimplici tly[44].Whatitmeans isfollowing.Supposeonewantstondtwofeatureswhichgiv eabestclassiererror rateamongallcombinationsoftwofeatures.Inthiscase,th ecorrelationcoecient methodndstwofeatureswhichareindividuallygood,butth osetwofeaturesmay notbethebesttwofeaturescooperatively.TheSVMrecursiv efeatureelimination methodscertainlydonothavethisproblem.

PAGE 55

41 4.8 Feature Scaling Methods Someauthorssuggestusingtheleave-one-out(LOO)boundsf orSVMsasfeature selectioncriteria[94].Themethodssearchallpossiblesu bsetsof n featureswhich minimizetheLOObound,where n isthetotalnumberoffeatures.Butthisminimum ndingrequiressolvingacombinatorialproblem,whichisv eryexpensivewhen n is large.Instead,itscaleseachfeaturebyarealvariableand computesthisscalingviaa gradientdescentontheLOObound.Onekeepsthefeatureswit hthelargestscaling variables.Theauthorsincorporatescalingfactorsintoth ekernel: K ( x ; y )= K (( x ) ; ( y )) where x isanelement-wisemultiplication.Thealgorithmis: 1.SolvethestandardSVMtond 'sforaxed 2.Optimize forxed 'sbyagradientdescent. 3.Removefeaturescorrespondingtosmallelementsin andreturntostep1. Thismethodmayendupwithalocalminimumdependingonthech oiceofthe gradientstepsize[44]. 4.9 Dimension Reduction Methods Inthissection,weexaminemethodsindimensionreduction. Tworepresentative methodsareprincipalcomponentanalysis(PCA)[57,2]andp rojectionpursuit(PP) [35,51,56,58].ThePCAaimstominimizetheerrorbydoingso metransformation toanothercoordinatesystem,andPPaimstondinteresting low-dimensionallinear projectionsofhigh-dimensionaldata. 4.9.1 Principal Component Analysis LetusstartbyintroducinghowtocalculatethePCs.Conside rasample x = [ x 1 ;x 2 ; ;x n ] 0 .Assumewithoutlossofgenerality E [ x i ]=0 ; 8 i =1 ; ;n .The covariancematrixof x is E ( xx 0 ).Therstprincipalcomponent(PC): Y (1) = 1 0 x 1 canbefoundbymaximizingthevarianceoftherstPC 1 0 x undertheconstraint

PAGE 56

42 1 0 1 =1.I.e, max 1 V ( 1 0 x ) s.t 1 0 1 =1 (4.17) Since V ( 1 0 x )= E (( 1 0 x )( 1 0 x ) 0 ) = 1 0 E ( xx 0 ) 1 = 1 0 1 theproblem(4.17)canberewrittenasfollows: max 1 1 0 1 s.t 1 0 1 =1 (4.18) ByusingtheLagrangianfunction,problem(4.18)becomes max 1 1 0 1 ( 1 0 1 1)(4.19) where isaLagrangemultiplier.Thestationaryconditionstates 1 1 =( I ) 1 =0 Since 1 6 =0,det( I )=0.Hence isaneigenvalueof.Let 1 bean eigenvectorassociatedwith .Thenmax 0 1 1 =max 1 0 1 =max 1 0 1 = max .So isthelargesteigenvalueof,and 1 istheeigenvectorassociatedwith .Bysimilarformulations,onecanndallthePCsandcandisc ardthevariables withsmallvariancesothatonecanachievethedimensionred uction.Butthisis dierentfromfeatureselection.Infeatureselection,wew anttondasmallnumber offeaturesfromtheoriginalinputspace.Butthedimension reductioncanbedone bytransformationusingthelargestPCs.ButtondthePCs,o nehastomakeuse

PAGE 57

43 ofallthefeatures.Sotherecannotbeasubsetwithsmallnum beroffeaturesinthe originalinputspace. 4.9.2 Projection Pursuit ThePPaimstodiscoveraninterestingstructureprojection ofamultivariate dataset,especiallyforthevisualizationpurposeofhighdimensionaldata.Inbasic PP,onetriestondthedirection w sothatthe x T x hasaninterestingdistribution. WedonotgointocomputationaldetailsforPP,butjustmenti onthatlikethePCA, PPcannotbeafeatureselectionmethodbythesamereasonasP CA.

PAGE 58

CHAPTER5 SUPPORTVECTORFEATURESELECTION Supportvectorsarethesampleswhichareclosetothedecisi onboundaryinSVM. InSVM,onlythosesamplesareinvolvedinthedecisionhyper planecomputation. TheSVMproducesthesamedecisionhyperplanewithoutnon-s upportvectors.The supportvectorisnowoverloadedtomeanthosepointswhicha reclosetothedecision boundaryinanydiscriminantanalysis.Inthischapter,wes howhowtodofeature selectionwithonlysupportvectors.Whilesupportvectorf eatureselectioncanbe appliedtoanydiscriminantanalysis,inthisdissertation weconsideredtwoparticular lineardiscriminants,SVMandFisher'slineardiscriminan t. Wepresenttwoalgorithmsbasedonthesetwolineardiscrimi nants.Oneis termedSV-RFESVM,SupportVectorRecursiveFeatureElimin ationSVM.Asimilar namingcanbedoneonFLD.TheideaforSVfeatureranking/sel ectionistorst identifythesupportvectorsandtoapplytherankingscheme s,suchasrecursive featureeliminationorsimplenon-recursiveranking,onth esesupportvectorsand buildaclassierusingtheSVsandfeaturesfromtheranking 5.1 Datasets Wehaveusedseveraldatasets,obtainedfrompublicdomains .Theyarevaried intermsofthenumberofsamplesandthedimensionofthesamp le. Thecoloncancerdataconsistof62instances,eachinstance with2,000genes [1,21,44,86].Thedatasethas22normalcolontissueexampl esand40colontumor examples.ItisanalyzedwithanAymetrixoligonucleotide arraycomplementaryto morethan6,500humangenesandexpressedsequencetags.Ing eneral,thiskindof dataandanalysisiscalledamicroarraydataanalysis.Stat isticalmethods[3,32,52, 71]andSVM[16,40,68,98]havebeenextensivelyusedasthem icroarrayanalysis 44

PAGE 59

45 Figure5{1:ImagesofDigits2and3. tool.Inthecoloncancerdataset,the2,000genesselectedh avethehighestminimal intensityacrosstheexamples.Thecoloncancerdataare62 2,000matrix.Each entryofthematrixisageneexpressionlevel.Sincethedata setisnotsetasidefor atrainingsetoratestingset,werandomlysplitthemintotw o31examples.Oneof thetwo31-examplesetsisusedfortrainingthemachine,and theother31-example setisusedforatestingset. Theleukemiadatasetconsistsofa72 7,129matrix[42].Thefeatureselection choosesthegenesfrom7,129genes.Sincethisdatasetisalr eadydividedintotraining andtestsetsbytheoriginalauthors,wefollowtheirsplit. Outof72examples,38of themwereusedastrainingexamplesandtherest(34examples )astesting. TheMNISTdatasetofhandwrittendigits( http://yann.lecun.com/exdb/mnist ) ispubliclyavailable.Thisdatasetcontains60,000traini ngsamples.Amongthem, digits2and3areextracted.Againamongthem,only300sampl esforeachareselectedinourexperimentstobecomeatrainingdataset.And1 50samplesforeach areselectedasatestdataset,thatis,600and300samplesas atrainingdataset andtestdataset,respectively.Eachsampleisa28x28image ,orequivalently,a784dimensionalrowvector.Eachpixelinanimagehasavaluein[ 0,255].Figure5{1 showssampleimagesofdigits2and3. ThesonardatasetwasusedbyGormanandSejnowskiintheirre searchonclassicationusingneuralnetwork[43].Theclassicationtas kistoassignanewsample asametalcylinder(-1)orarock(+1).Eachpatternisasetof 60numbersinthe range0.0to1.0.Thereare208samples,with60featuresinea chsample.Among

PAGE 60

46 Table5{1:Training/TestDataSplitsFourDatasets Dimension #TrainingSamples #TestSamples Leukemia 7,129 38 34 ColonCancer 2,000 31 31 MNIST 784 600 300 Sonar 60 104 104 them,97belongto+1and111belongto-1.Inourexperiment,w erandomlysplit themevenlyintotrainingandtestdata.Also,dataareclass balanced,thatis,97+1 samplesarealmostevenlydividedintotrainingandtestdat a.Table5{1summarizes trainingandtestdatasplitsforfourdatasets. 5.2 Support Vector Feature Selection Using FLD Fisher'slineardiscriminanthaslongbeenusedasasimplel inearclassier[10, 31,48].Itsperformancewasoftenoutperformedbymanyothe rclassierssuchas SVM[44].Astudy,however,showstheequivalencebetweenli nearSVMandFisher's lineardiscriminantonthesupportvectorsfoundbySVM[81] .Thisresulthasbeenof theoreticalinterestbutnotutilizedordevelopedforprac ticalalgorithms.Recently Cooke[23]showedthatFisher'slineardiscriminantcandob etterclassicationby workingonlyonsupportvectors,andheshowedhowsupportve ctorscanbeidentiedbyapplyingFisher'slineardiscriminantrecursively. Chakrabartietal.applied multipleFisher'slineardiscriminantsfortextclassica tion[18].Inthesubsequent sections,weshowhowFisher'slineardiscriminantononlys upportvectorscangive abetterfeatureranking/selection. 5.2.1 Fisher's Linear Discriminant Fisher'slineardiscriminant[10,31]triestondone-dime nsionalprojection.Supposethedataarebinaryclass,andallthedatapointsaretra nsformedtoonedimensionusinganunknownvector w .Themethodattemptstondavectorwhichmaximizesthedierenceofthe between-class averagesfromthetransformeddatawhile minimizingthesumofthe within-class variancesfromthetransformeddata.Thisis

PAGE 61

47 anunconstrainedoptimizationproblemandthe w canbeobtainedfrom w / S W 1 ( m 1 m +1 ) ; where m +1 ( m 1 )isthemeanvectorofclass+1( 1)and S W isthetotal within-class covariancematrixandgivenby S W = X i 2 +1 ( x i m +1 )( x i m +1 ) T + X i 2 1 ( x i m 1 )( x i m 1 ) T : If S W issingular,thepseudo-inverse[46]of S W iscomputed.AlthoughFisher'slinear discriminantisnotstrictlyadiscriminantfunction,byde visingappropriately,Fisher's criterioncanserveasalineardiscriminantfunction.Tode visealineardecision functionwestillneedabiasterm.Thebiastermcanbethetra nsformedpointofthe overallmeanofthedatapoints.Hence,wehaveaFisher'slin eardiscriminant g ( x )= w ( x m )= d X i =1 w i x i + d X i =1 w i (( m i (+)+ m i ( )) = 2) : Here,thelargerabsolutevalueof w i hasmoreimpactontheclassicationdecision. Weusethis w asarankingscorecriterion.Thisissuewillbereviewedind etailin Chapter7. 5.2.2 Support Vector Feature Selection with FLD Thesupportvector-basedfeatureselectionwashintedbyth esuccessofSVM [5,16,40,44,55,99],theproofofequivalencebetweenSVMa ndFLDonsupport vectors,andsuccessfulapplicationsofFLDonsupportvect orsbyCooke[23]and Chakrabartietal.[18].Theseimplythatdiscriminantscan dobetterclassication byconstructingtheclassierbasedonsupportvectors.The SVMhasbeenproven tobeoneofthemostpowerfuldiscriminants.Cookeshowedth atavariationof FLDalsocandocompetitiveclassicationasSVM.Sincedisc riminantsusingonly supportvectorshavemorediscriminatingpower,featurera nking/selectionobtained frommorediscriminatingclassiersshouldgiveabetterfe atureselection.Shashua

PAGE 62

48 0 2 4 6 8 10 12 2 4 6 8 10 12 Original Data y=-1y=+1 -20 -15 -10 -5 0 5 -3 -2 -1 0 1 2 3 Projected ValueClass LabelOne-Dimensional Plot by FLD Remove Keep Remove Figure5{2:RecursiveFLDIdea. [81]showedtheequivalencebetweenSVMandFLDonsupportve ctorsbyproving thatfor S ,thesetofsupportvectorsand w = P i i y i x i in(2.8),then w isthe nullspaceof S W ( S W w = 0 ),where S W isthetotal within-class covariancematrix of S .CookedidnotmentiontherelationshipofhisworktoShashu a'sproof.It, however,seemsthatthebetterperformanceofCooke'srecur siveFLDistheoretically supportedbyShashua'sproof.Figure5{2showstheideaofre cursiveFLDsuggested byCooke[18].InFigure5{2,theuppergureshowstheorigin aldataandthelower gureshowsprojectionsofthedatapointsintoone-dimensi onalspaceusingFLD. Therearethreeregionsinthelowergure,left,middle,and right.Pointsintheleft regionareonlyfromclass 1,andpointsintherightregionareonlyfromclass 1. Afterremovingthesepoints,thepointsinthemiddlearerun onFLDtondaonedimensionalprojection.Thisrepeatsinarecursiveway.We showtherecursiveFLD

PAGE 63

49 andSV-RFEFLDalgorithms.TherecursiveFLDessentiallyis basedonCooke's. SV-RFEFLDAlgorithm 1.Identifythesupportvectorsusingtherecursive FLD input: T :Survivingthreshold S :Decreasingrateonnumberofsurvivingsamples X,Y :Inputsamplesandlabels output: svX,svY :Supportvectorsandcorrespondinglabels initialization: svX:=X,svY:=Y while thefractionofsurvivingsamples( svX ) > T do Find w byrunning FLD ( svX,svY ) Findone-dimensionalpointsbyprojecting svX on w Keep S percentofthepointsfromeachclasswhichareclosesttothe decisionboundary Set svX,svY assurvivingsamplesandcorrespondinglabels endwhile 2.RankthefeaturesbyapplyingRFEonthesupportvectorsfo und input: svX,svY :Inputsamplesandlabels output: rank initialization: rank =emptylist, s =(1,2, ;j; ; dim) whiles isnotempty do Constructnewsamples, newX whosefeaturescontainonly s from svX Find w byapplying FLD ( newX,svY ) Findthefeature j suchthat w j hastheminimumvalueamong j w j Update s byremovingthefeature j from s Update rank byaddingthefeature j ontopofthe rank endwhile

PAGE 64

50 Table5{2:SimpleAllvs.SimpleSVPerformanceonMNISTData (FLD) SimpleAll(1) SimpleSV(2) [(1)-(2)]/(1) Avg.Acc. 70.30% 90.51 % -28.76% #Samples 600 144 76.00% Time 216.8sec 541.8sec -149.9% Table5{3:SimpleAllvs.SimpleSVPerformanceonSonarData (FLD) SimpleAll(1) SimpleSV(2) [(1)-(2)]/(1) Avg.Acc. 64.97% 67.08 % -3.25% #Samples 104 22 78.85% Time 50.47sec 60.53sec -19.93% 5.2.3 Experiments Intherstexperiment,weappliedFLDandfoundrankingbase dontheabsolute valuesofweightofFisher'slineardiscriminant.Wecallth isasSimpleAll.Inour approach,werstfoundthesupportvectorsusingFLDandthe nappliedFLDon onlythosesupportvectors.Whenwefoundthesupportvector s,weused T =0 : 25 and S =0 : 9.Weusedthesameparametersforalltheexperiments.Wedid notapply anynormalizationontheMNISTdataset.ForSimpleSV,wers tfound144support vectors.ComparedtoSimpleAllintermsoftime,SimpleSVto okabout150%more thanSimpleAll.Here,timeincludesndingsupportvectors ,rankingcomputation andtestingtime.But,spendingthisextratimeisworthwhil e.Whenwelookatthe accuracycomparison,SimpleSVgaveabout30%moreaccuracy thanthatofSimple All.Thisisshowninthesecondrow(Avg.Acc.)inTable5{2. Thesameexperimentalsettingwasappliedtothesonardatas et.AsinMNIST, wedidnotapplyanynormalizationonthesonardataset.Simp leSVgave3.25% moreaccuracythanthatofSimpleAllattheexpenseofabout2 0%moretime.Most ofthisextratimewasinitiallyduetondingsupportvector s. Wethenappliedtherecursivefeatureeliminationtechniqu e.WeappliedSVRFEontheMNISTdataset.Thisresultedin144supportvector soutof600.In Table5{4,theTimerowincludesthetimetakentondthesupp ortvectors,recursive

PAGE 65

51 Table5{4:All-RFEvs.SV-RFEPerformanceonMNISTData(FLD ) AllRFE(1) SV-RFE(2) [(1)-(2)]/(1) Avg.Acc. 82.41% 84.36 % -2.37% #Samples 600 144 76.00% Time 289.52sec 594.22sec -105.24% Table5{5:AllRFEvs.SV-RFEPerformanceonSonarData(FLD) All-RFE(1) SV-RFE(2) [(1)-(2)]/(1) Avg.Acc. 66.81% 68.39 % -2.37% #Samples 104 22 78.85% Time 84.51sec 85.39sec -1.04% ranking,andtrainingandtestingtime.Inthetable,SV-RFE tookamosttwicetime asmuchasAll-RFEbecauseidentifyingthesupportvectorsi stime-consumingdue toitsrepeatedinversecomputation.Attheexpenseofthise xtratimespending,the accuracygainonSV-RFEisabout2.37%. Thenextexperimentwasonthesonardataset.Werstfound22 supportvectors. Despiteaboutan80%trainingdatareduction,thetotaltime takenforSV-RFEwas slightlymorethanthatofAll-RFEbecausendingthesuppor tvectorstakestime. TheaccuracygainofSV-RFEis2.37%overAll-RFE. 5.2.4 Discussion Wehaveshownthecomparisonofperformancebetweenfeature selectionusing allthesamplesandonewithonlysupportvectors.Whetherit issimplerankingor recursiveranking,featureselectionononlysupportvecto rsgavemuchbetteraccuracy.Theonlydrawbackofthesupportvector-basedfeature rankingisjustalittle moretimespentbecauseweneededtodecidewhichsamplesare supportvectors. Theperformancedierencewassometimessignicantwhenwe appliedthesimple featurerankingontheMNISTdataset.Attheexpenseofalitt leextratimespent, weachievedalmosta30%accuracygain.WhenapplyingFLDona datatset,weneed tondtheinverseof S W ,thetotal within-class covariancematrix.Theorderofthe matrixisthedimensionofthesampleinthedataset.Amongou rdatasets,samples

PAGE 66

52 intwomicroarraydatasetsareveryhighdimensional.Theor derofthecovariance matrixwillbeseveralthousand.Itsinversecomputationis verytime-consumingand henceimpractical.ThisisthereasonwhywedidnotapplyFLD onthetwomicroarraydatasets.However,weareawarethattherearelineartim e(inthesizeofinput data)algorithmsforcomputingtheinverseforFLD.Oneofth esuccessfulalgorithms isthehill-climbingone[18].Itwouldbenecessarytousesu chlineartimealgorithms forFLDtobepracticalforlargedata.Butwedidnotexploret hisissueinthis dissertation. 5.3 SV-RFE SVM Therecursivefeatureeliminationtechniquehaslongbeenu sedasafeaturesubset selectionmethod[48,88].Guyonetal.[44]successfullyap pliedtheRFEwithSVM (GuyonRFE).InordertoapplytheSVfeatureselection,itis necessarytoidentify thesupportvectorsforthegiventrainingdataset.TheSVM, however,ndssupport vectorsintheprocessofbuildingaclassier.Soourrstjo bisdonebySVM.Then weapplytheRFEtechniqueontheseSVs.WeconsideronlyRFEo nallsamplesand supportvectors.UnlikeFLD,wedonotconsidersimpleranki ngonallsamplesand supportvectorssinceSVMalwaysndsupportvectorsrst.H ence,simplerankings onallsamplesandsupportvectorsareexactlythesame.TheS V-RFESVMalgorithm isintroducedasfollows.SV-RFESVMAlgorithm input: X,Y :Inputsamplesandlabels output: rank 1.Identifythesupportvectors Identifysupportvectorsbyrunning SVM ( X,Y ) Set svX,svY asthesupportvectorsandcorrespondinglabels 2.RankthefeaturesbyapplyingRFEonthesupportvectorsfo und initialization:

PAGE 67

53 rank =emptylist s =(1,2, ;j; ; dim). whiles isnotempty do Constructnewsamples, newX whosefeaturescontainonly s from svX Find w byapplying SVM ( newX,svY ) Findthefeature j suchthat w j hastheminimumvalueamong j w j Update s byremovingthefeature j from s Update rank byaddingthefeature j ontopofthe rank endwhile 5.3.1 Experiments WeappliedtheSV-RFESVMonfourdierentdatasetsandcompa redtheperformanceswithGuyonRFE.Theperformancemeasuresareaccu racyfortestdatasets andtrainingtime(includingfeatureranking)andtestingt ime. Typicallyfeaturesareeliminatedonebyoneateachnewtrai ning.Inthe leukemiadataset,eachsamplehas7,129features.Hence,on e-by-onerecursivefeature eliminationmaynotbefeasible.Inpractice,somemodicat ionsareuseful.Instead ofeliminatingonefeatureatatime,chunksoffeaturesaree liminatedatonetime. Especiallyattheearlystageofthefeatureelimination,mo refeaturesareeliminated. Onevariationisthatweeliminatethefeaturessothatthenu mberofremainingfeaturesbecomestheclosestpoweroftwototheoriginalnumber offeatures.Inthe latereliminationstep,weeliminatejusthalfoftheremain ingfeatures.Thisstepis repeateduntilonefeatureisleft.Anothervariationisthe sameasbefore,butwhen thenumberofremainingfeaturesbecomes256,westarttoeli minatethefeaturesone atatime.Thisvariationispracticalsinceitdoesnottakem uchtimetoeliminate onefeatureatatimesincethen.Alargenumberoffeaturesar eeliminatedduringthe rstseveraleliminationsteps.Supposedlyimportantfeat uresshouldbeinthe256 features,andtheyareeliminatedonebyone,andthereforer ankedmorecorrectly.

PAGE 68

54 Table5{6:GuyonRFEvs.SV-RFEAccuracyonLeukemiaData(SV M) #Features GuyonRFE(%) SV-RFE(%) 2 94.12 94.12 4 94.12 94.12 8 97.06 97.06 16 91.18 97.06 32 97.06 100.00 64 97.06 94.12 128 94.12 94.12 256 91.18 91.18 512 91.18 91.18 1024 94.12 94.12 2048 94.12 94.12 4096 94.12 94.12 7129 94.12 94.12 Acc.Avg 94.12 94.57 Table5{7:GuyonRFEvs.SV-RFETimeonLeukemiaData(SVM) GuyonRFE(1) SV-RFE(2) [(1)-(2)]/(1) #Samples 38 26 32.58% Time 5.86sec 4.00sec 31.74% Wechosethelatterontheleukemiadataset.Oncerankingisd one,accuraciesontest datasetarecomputedusingtop2,4,8,32, :::; 2 k ;::: ,7129featuresontheranking. Inthisexperiment,eachsampleisnormalizedsothattheres ultingsamplebecomes unitlength.Table5{6showstheperformancecomparisonbet weenGuyonSVMand SV-RFE.IntheexperimentshowninTable5{6,weusedlineark ernelSVMand C =100,where C istheupperboundfor intheSVMformulation.Whenweapplied SV-RFE,weinitiallyset C =1togetmoresupportvectors.Thisresultedin26SVs andwestartedSV-RFEwiththese26sampleswhile38SVsinGuy onRFE.These numbersareshowninthesecondrow(#Samples)inTable5{7.S owecanexpect thatthisdierence((38 26) = 38=32 : 58%)shouldbeproportionallyrerectedin thetrainingandtestingtime.TheSV-RFEis31.74%fasterth anGuyonRFEon theleukemiadataset.Table5{7showsalmostexactlythispr oportionalityinthelast

PAGE 69

55 Table5{8:GuyonRFEvs.SV-RFEPerformanceonColonCancerD ata(SVM) GuyonRFE(1) SV-RFE(2) [(1)-(2)]/(1) Avg.Acc. 91.20% 91.38 % -0.2% #Samples 31 29 6.45% Time 5.55sec 5.15sec 7.21% row.ThisistheexactsavingsofSV-RFEoverGuyonRFEintrai ningandtesting time.ThereismorethantimesavingsofSV-RFE.Table5{6sho wsthatSV-RFEis betterthanGuyonRFEonaverageaccuracy. NowGuyonRFEandSV-RFEareappliedtothecoloncancerdatas et.Wehave doneourexperimentswiththelinearkernel.Beforeconduct ingtheexperiment,all theexamplesagainwentthroughaunit-lengthnormalizatio n,thatis,foranexample x ,aunit-lengthnormalizationis x = jj x jj .Wehavecomputedaccuraciesontop1,2, 3, ,49,50,100,200,300, ,1900,2000features.Thecompleteaccuracyresults arenotshown.Resultsontop-rankedfeatureswillbeshowni nChapter7.Table5{8 showstheaverageaccuracies,thenumberofsamples,andtra iningandtestingtime forbothmethods.Again,theproportionalityinthe#Sample sisrerectedalmost exactlyasthetrainingandtestingtimedierencebetweenb othmethods.TheSVRFEis7.2%fasterthanGuyonRFEwhiletheaverageaccuracyi sslightlybetter(by 0.2%).Inthiscoloncancerexperiment,weused C =30onbothmethods.When wegettheinitialsupportvectors,weagainused C =1.Thisresultedin29support vectorsbeforewestartedtoapplyRFE.Twentynineoutof31m eansthatalmost allthetrainingsamplesaresupportvectors,andthatiswhy SV-RFEgainedalittle trainingandtestingtimesavings.Comparedtotheleukemia dataset,thecoloncancer datasetismoredicultintermsofclassseparation.Thisdi cultywasrerectedin themuchhigherrateofsupportvectorsthantheoneintheleu kemiadataset. TheMNISTdataareimages.Eachimageconsistsof28by28pixe ls.This28by 28pixelmatrixisconcatenatedtobecomea784-dimensional rowvector.Aswiththe previoustwoexperiments,eachsampleisunitlengthnormal izedbeforeconducting

PAGE 70

56 Table5{9:GuyonRFEvs.SV-RFEPerformanceonMNISTData(SV M) GuyonRFE(1) SV-RFE(2) [(1)-(2)]/(1) Avg.Acc. 96.73 % 96.69% +0.04% #Samples 600 87 85.50% Time 404.89sec 60.68sec 85.01% Table5{10:GuyonRFEvs.SV-RFEPerformanceonSonarData(S VM) GuyonRFE(1) SV-RFE(2) [(1)-(2)]/(1) Avg.Acc. 76.17% 76.20 % -0.04% #Samples 104 63 39.42% Time 8.85sec 5.41sec 38.88% experiments.Inthisexperiment,accuraciesfortop1,2,3, ,249,250,300,350, ,750,784werecomputed.Theaccuracyforthetop-rankedfea turesareshown inChapter7.FortheexperimentonMNISTdata,weused C =30.Unlikethe previoustwoexperiments,wealsoused C =30togetinitialsupportvectors.The numberofinitialsupportvectorswas87andshowninthe#Sam plesrowinTable5{9. Relativelyverysmallportionsofsamplesaresupportvecto rs.Fromthisfact,wemay guessthedataarenothardintermsoflinearseparability,a ndhighergeneralization accuracyisexpected.Thisjustiesourchoiceofthelater C values.Averysmall rateofsupportvectorsdirectlycontributedtoverylarget rainingandtestingtime savingsby85%whiletheiraverageaccuraciesarenodieren t(only0.04%).Two factorscontributetothelargesavingsontrainingandtest ingtime:relativelyeasier dataintermsoflinearseparabilityandlargetrainingdata .Easierseparabilitymeans thatonlyasmallnumberofsamplesliearoundthedecisionbo undaryandbecome supportvectors.Sotherateofsupportvectorsishigh,andt hiseectiscompounded withlargetrainingdatasize. Thelastexperimentisdoneonthesonardata.Unliketheprev iousexperiments, nonormalizationwasconductedonthisdataset.Weused C =4.ForSV-RFE,we used C =2togettheinitialsupportvectors.Sowestartedwith63su pportvectors toapplyRFE.Onsonardata,SV-RFEgaveacompetitivetestac curacy(actually

PAGE 71

57 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 1 2 3 4 5 Effect of Parameter C in SVM class oneclass twosupport vectorclass one margindecision boundaryclass two margin 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 1 2 3 4 5 class oneclass twosupport vectorclass one margindecision boundaryclass two margin C = 0.3 C = 100 Figure5{3:EectoftheParameterCinSVM. slightlybetter)whilecuttingthetrainingandtestingtim eby40%overGuyonSVM. 5.3.2 Discussion Wehaveacoupleofpointstomention.First,wecanreduce7%t o85%in trainingandtestingtimebyusingSV-RFE,andthisreductio nrateisproportional tothenumberofsupportvectors.Forthelargertrainingdat a,themorelinearly separabledata,SV-RFEhasthelargerreductionintermsofr anking,training,and testingtime. Second,wehadtochoosetheinitial C ,thetradeoparameterbetweenagood generalizationerrorandalargermargin.Forthispurpose, weshowhow C aects thenumberofsupportvectorsindetail. Figure5{3showstheeectof C onsupportvectors.Weremindthat C isthe upperboundfortheLagrangemultiplier ,andeachsamplehasacorresponding Lagrangemultiplier inSVMformulation.Thus,small C valueinruenceseach

PAGE 72

58 samplesmallersothatwehavelargernon-zero 's.Thismeanswehavemoresupport vectors.Ifweuseanextremelysmall C value,thenallthesampleshavealmost thesameinruencetobuildadecisionboundaryregardlessof theirclosenesstothe decisionboundarysothatalmostallthesamplesbecomesupp ortvectors.InFigure5{ 3,theuppergureissomehowclosetothiscase.Thelargenum berofsupport vectorsareduetoasmall C .Thecreateddecisionboundarymisclassiesacouple oftrainingpointswhiletrainingdataareapparentlylinea rlyseparable.Butthe decisionboundarylooksmorererectivetothetrainingpoin tsdistribution.Ifthe C isextremelysmall,thentheresultingdecisionboundarymi ghtbecomeanaverage casedecisionhyperplane.Nomaximummarginconceptisobse rvedthen.Obviously, wewanttoavoidthissituation. Ontheotherhand,ifweusealarge C ,thensomesamplesclosetotheboundary betweentwoclassescangetlarge 'sandothersamplesgetsmall 'ssothatthe inruenceofthoseclosepointsdecidesthedecisionboundar y.Thismayresultina goodtrainingerror,butmaycauseovertting.InFigure5{3 ,thelowergureisclose tothiscase.Ittriestogetaseparablehyperplaneandthose threesupportvectors (`+')gotahigh ,andhencethesesupportvectorsdecideadecisionhyperpla ne. Althoughitgotaseparatinghyperplane(agoodtrainingerr or),thehyperplanedoes notrerectoveralltrainingpoints(poorgeneralization). Forthisreason C iscalledatradeoparameterbetweengoodgeneralization errorandgoodtrainingerror,orequivalentlyatradeobet weengoodtrainingerror andalargermargin.Fromthisdiscussion,wecansaytheinit ial C shouldbeless thanorequaltothe C inthemaintrainingalgorithm.Inourexperiments,weused theinitial C =main C ortheinitial C =1,or2,butnotgreaterthanthemain C

PAGE 73

CHAPTER6 NAIVEBAYESANDFEATURESELECTION TheBayesianmethodisoneofthemostextensivelyusedmachi nelearningmethods[66,85].Amongthem,thenaiveBayesclassierispopula rintextdocument classication.Itssuccessfulapplicationstotextdocume ntdatasetshavebeenshown inmanyresearchpapers[54,65,66,75].Inthischapter,wep ointoutseveralsystemicproblemsrelatedtothenaiveBayes.Someoftheseprob lemsariseonlywhen thenaiveBayesisusedwithjustasmallsubsetoffeatures,f orexample,whenwe trytousethenaiveBayeswithjustoneortwofeatures.Thisp roblemwasnot observedbeforebecausetheydidnottrytoapplythenaiveBa yesasafeatureselectionmethod.Wesuggestedhowtoresolvetheseproblems. Althoughthenaive Bayesclassierwassuccessfulasasimpleandeasytoolforc lassication,therewas roomforimprovement.WeproposeanimprovednaiveBayescla ssierandshowits eectivenessbyexperimentalresultsontextdata. Despiteitssuccessfulapplicationstotextdocumentcateg orizationproblems,the featureselectionusingthenaiveBayesclassierhasnotbe engivenanyattention. Here,wehavedevelopedafeatureselectionmethodbasedont heimprovednaive Bayesapproach,andwehaveshoweditsapplicationtotextdo cumentdatasets. 6.1 Naive Bayes Classier Beforewestartthissection,weintroducetheterminologyw hichwewillusein thischapter.Wewilldenoteafeature(input)variablebyth esymbol X andthe i th componentof X iswrittenas X i .Aparticularobservedinstancevectorisdenoted as x andthe i th componentof x isdenotedas x i .Classlabelvariableisdenotedas Y andaparticularvalueof Y iswrittenas y .Wewilluse w torepresenttheword vectorcorrespondingtothefeaturevector.The i th componentof w isdenotedas w i 59

PAGE 74

60 and w i isthewordcorrespondingtothe i th feature.Hence, X i isarandomvariable denotingthenumberofoccurrencesoftheword w i .Wewillsimplifythenotationas follows: P ( Y = y j X = x ) = P ( Y = y j ( X 1 ;:::;X d )=( x 1 ;:::;x d )) = P ( y j x 1 ;:::;x d ) = P ( y j x ) ; P ( X i =1 j y ) = P ( w i j y ) ; where P ( w i j y )istheprobabilitythatarandomlydrawnwordfromadocumen tinclass y willbetheword w i .NowletusbeginthissectionwiththeBayesformula,which isfundamentaltheoremunderlyingourBayesianfeaturesel ection.Bayestheorem: P ( x j y )= P ( x j y ) P ( y ) P ( x ) : WetakeanexampleofanapplicationoftheBayestheoremtoac lassicationproblem. Supposewehaveatwocategory( Y =+1or 1)classicationproblem.Wecando theclassicationusingtheBayestheorem.Givenatestinst ance x ,wecompute P (+1 j x )= P ( x j +1) P (+1) P ( x ) ; P ( 1 j x )= P ( x j 1) P ( 1) P ( x ) : Thenweassign x as+1classif P (+1 j x ) >P ( 1 j x );otherwise 1.Theprobability P ( y j x )iscalled aposteriori probability(orposterior)of y and P ( x j y )iscalledthe likelihoodof y withrespectto x .ThenaiveBayesclassiertriestoassignaclass labelwhichmaximizes aposteriori probability(MAP)foragiventestinstance.That iswhythisclassierisoftencalledtheMAPnaiveBayesclas sier.Weturntohowto

PAGE 75

61 compute aposteriori probabilityof y ,givenaninstance x .UsingtheBayestheorem, P ( y j x )= P ( y j ( x 1 ;x 2 x d )) = P (( x 1 ;x 2 x d ) j y ) P ( y ) P ( x 1 ;x 2 x d ) (6.1) Sinceweonlywanttocomparetheposteriorprobabilitiesfo rdierent y 's,andthe denominatoriscommonfordierent y 's,wecansimplyignorethedenominatorand computeonlythenumeratorin(6.1).Inthenumerator, P ( y ),calledthepriorprobability,canbecomputedbysimplycountingthenumberofins tanceswhoseclass labelsare y ,andthefractionofthisnumberoverthetotalnumberoftrai ninginstancesis P ( y ).Butcomputing P (( x 1 ;x 2 x d ) j y )bythesamefashionisnotfeasible unlessthedataarebigenough[66].ThenaiveBayesapproach triestogetaround thisproblembyasimplifyingassumptionregardingtherela tionshipbetweenfeatures [66,87,47,85].ThenaiveBayesapproachthusintroducesth eclassconditional independenceassumptionbetweenthefeatures.Hence,then umeratorbecomes P (( x 1 ;x 2 ; ;x d ) j y ) P ( y )= Y di =1 P ( x i j y ) P ( y )(6.2) Insummary,thenaiveBayesapproachclassiesaninstance x as c where c = argmax y Q di =1 P ( x i j y ) P ( y ).Wewillexplainhowtoestimatetheclassconditional probabilitiesinsubsequentsections. 6.2 Naive Bayes Classier Is a Linear Classier Inthischapter,weapplythenaiveBayesapproachtotextdat a.Whenitis appliedparticularlytothetextdata,theprobability P ( x i j y ),describingtheprobabilitythattheword(feature) w i occurs x i times,providedthat x belongstoclass y isestimated[66]: P ( x i j y )= a p + N y i a + N y x i : (6.3)

PAGE 76

62 where a istheequivalentsamplesizeparameter, p isthepriorprobability, N y isthe totalnumberofwordsinalldocumentsinclass y ,countingduplicatewordsmultiple times,and N y i isthenumberoftimestheword w i occursinalldocumentsinclass y .Wewillgivealittlemoreexplanationabouttheequivalent samplesizeparameter shortly.Nowweconsiderabinaryclassicationproblemusi ngthenaiveBayes.That is, Y iseither+1or 1.WerstshowthatthebinarynaiveBayesclassierisaline ar classier[18,15,72,75].ThenaiveBayesclassies x as+1if P (+1 j x ) >P ( 1 j x ), andotherwise-1.Thatis,thedecisionfunction g ( x )= P (+1 j x ) P ( 1 j x ). P (+1 j x ) P ( 1 j x ) () log P (+1 j x 1 ;x 2 x d ) log P ( 1 j x 1 ;x 2 x d ) () log P (+1 j x 1 ;x 2 x d ) P ( 1 j x 1 ;x 2 x d ) () log Q di =1 P ( x i j +1) P (+1) Q di =1 P ( x i j 1) P ( 1) () log Y di =1 P ( x i j +1) P ( x i j 1) +log P (+1) P ( 1) () d X i =1 log P ( x i j +1) P ( x i j 1) +log P (+1) P ( 1) () X di =1 x i log a p + N +1 i a + N +1 log a p + N 1 i a + N 1 +log P (+1) P ( 1) () X di =1 c i x i + b () c x + b Inourexperiment,wehavechosenthe a p =1where a =totalnumberoffeatures (uniquewords)inthetrainingsamples.Ifweestimatetheli kelihoodusingonlythe frequencycounts,insteadof(6.3),then P ( w i j y )= N y i N y : Here,aproblemisthatif theestimatedprobabilityforoneofthefeaturesiszero,th enthewholeposterior probabilityhavingthisfeature(word)wouldbezero.Tores olvethisproblem,a

PAGE 77

63 modiedprobabilityestimateisusedasfollows[87,66,59] : P ( w i j y )= a p + N y i a + N y (6.4) where a;p;N y and N y i aredenedasin(6.3).Evenwhen N y i iszerobecauseof insucienttrainingsamples,thisestimatedprobabilityi snotzeroandhencethe posteriorprobabilitymaynotbezero.Wecangiveamoreintu itiveexplanation abouttheequation(6.4).Iftherearenotrainingsamples,t hentheterm a p + N y i a + N y becomes a p a = p .Thismakessensesincewedonothaveanytrainingdata,andt hen ouronlychoiceisthepriorprobability.Ontheotherhand,w ehavelotsoftraining samples,andthentheterm a p + N y i a + N y isdominatedby N y i N y .Thesemanysamplescould representthepopulationandhencethepriorprobabilitydo esnotmuchaectthe probabilityestimate. 6.3 Improving Naive Bayes Classier Someofdiculties,however,arisewhenweapplythenaiveBa yesclassier.We willconsiderthreesuchsituations.Thedicultiesarisew henthetrainingdataare notbalanced,NBistrainedwithonlyonefeature,andatesti nputisavectorof zeros.Inthissection,wedescribehoweachoftheseproblem soccursandwesuggest howtoresolvethem. 6.3.1 Imbalanced Data Oneoftheproblemsoriginateswhenthetrainingdataarenot balancedbetween theclasses.Thefollowingexampleshowssuchacase.Suppos ewearegivenve traininginputanditsclasslabelpairs,asshowninTable6{ 1,andatestinput x =(3 ; 2 ; 7).Obviouslythetestinputisclosesttothetraininginsta nce(3 ; 2 ; 7),and henceweexpectthenaiveBayeswouldclassifythetestinsta nceasclass 1.Wewill showtheintermediatestepsofthenaiveBayesclassier.Fi rst,wecompute P ( x i j y ). Table6{2showstheprobabilityestimateswhere,forexampl e,therstcolumnand rstrowmeans P ( w 1 j +1).Priorprobabilitiesfor+1and 1are P (+1)=4 = 5=0 : 8

PAGE 78

64 Table6{1:ImbalancedTrainingDataExample traininginput label (7,2,5) +1 (6,2,6) +1 (7,2,5) +1 (7,2,5) +1 (3,2,7) -1 Table6{2:ProbabilityEstimates w 1 w 2 w 3 y =+1 1+27 3+56 =0 : 475 1+8 3+56 =0 : 153 1+21 3+56 =0 : 373 y = 1 1+3 3+12 =0 : 267 1+2 3+12 =0 : 200 1+7 3+12 =0 : 533 and P ( 1)=1 = 5=0 : 2,respectively.Thelikelihoodsof+1and 1withrespectto thetestinstance x =(3 ; 2 ; 7)are: P ((3 ; 2 ; 7) j +1)=(0 : 475) 3 (0 : 153) 2 (0 : 373) 7 =0 : 0000025 ; P ((3 ; 2 ; 7) j 1)=(0 : 267) 3 (0 : 200) 2 (0 : 533) 7 =0 : 0000093 : Notethatthelikelihoodsof Y = 1withrespectto(3 ; 2 ; 7),i.e., P ((3 ; 2 ; 7) j 1)is almostfourtimesgreaterthanthatof+1.But,theposterior probabilitiesof+1and 1are: P (+1 j (3 ; 2 ; 7)) / P (+1) P ((3 ; 2 ; 7) j +1)=0 : 0000020 ; P ( 1 j (3 ; 2 ; 7)) / P ( 1) P ((3 ; 2 ; 7) j 1)=0 : 0000019 : Thus,thenaiveBayesclassiesthetestinstance(3 ; 2 ; 7)asclass+1whilethetest instanceistheclosesttothetraininginstance(3 ; 2 ; 7)(infact,botharethesame) whichbelongstoclass 1.Thisproblemcomesfromtheimbalancedtrainingdata. Thepriorprobabilitiesarefarbiasedtotheclass+1.Despi teanalmostfourtimes largerlikelihood,itwasnotbigenoughtoovercomethebias edpriorprobability. Thisproblemwasalsoobservedbyotherresearchers'work[7 6].Thisproblemmay beresolvedbyassuminguniformpriorprobability,whichme answeusetheMaximum Likelihood(ML)Bayes.

PAGE 79

65 6.3.2 Training with Only One Feature ConsiderweapplythenaiveBayesclassiertopredictanewt estsampleusing onlyonefeatureinthetrainingdata.Thismayhappenwhenwe usethenaiveBayes classierwithfeatureselection.ThenthenaiveBayespred ictsthesampleinan unreasonableway.Wewillillustratethisproblemwithanex ample.Thefollowing exampleshowssuchacase.Supposewehavefoundthemostdisc riminatingfeature, w j .Withfeature w j ,thedataareperfectlyseparated.Thefeaturevalueandits class labelpairswithfeature(word) w j are(3 ; +1) ; (2 ; +1) ; (3 ; +1) ; (1 ; 1) ; (1 ; 1) ; (0 ; 1). Thedataareperfectlyseparatedusingfeature w j .Specically,iftheoccurrencesof feature w j ,i.e., x j inaninstancearegreaterthanorequalto2,thentheinstanc es belongto+1,andotherwise 1.Theproblemoccurswhenwetrytoclassifyanew instancebythenaiveBayeslearning.Hereweapplythenaive BayeswiththeLaplace correction[87].Then, P ( w j j +1)= 1+8 1+8 =1 P ( w j j 1)= 1+2 1+2 =1 Nowforatestinstance x =(1), P (+1 j (1)) /f P ( w j j +1) g 1 P (+1)=1 1 1 2 = 1 2 P ( 1 j (1)) /f P ( w j j 1) g 1 P ( 1)=1 1 1 2 = 1 2 Thisisnotanexpectedresultsincethetestinstance x =(1)isobviouslysimilarto thetraininginstancesinclass 1thanthoseinclass+1.Butthepredictionbased onnaiveBayesisnotwhatitshouldbe.Onewayofresolvingth isproblemisto computeanaverageofEuclideandistancesbetweenthetesti nstanceandtraining instancesinclass+1( 1).Thentheclassgivingtheminimumaveragedistanceis thewinner.

PAGE 80

66 6.3.3 A Vector of Zeros Test Input Asimilardicultyariseswhenatestinstance x isavectorofzeros.Thatis, ( x =(0 ; 0 0)).WeconsiderthenaiveBayesclassier, g ( x )= P i c i x i + b g ( x )= X i c i x i + b = X i c i 0+ b = b: Weremindthat b =log P (+1) P ( 1) .Thismeansthepriorprobabilitiesdecidethetest instance.Specically,if P (+1) >P ( 1),weclassifyitasclass+1;otherwise 1. Thisclassicationdecisiondoesnotmakesensesincethepr obabilityofoccurrence ofavectorofzerosdoesnotdependonthepriorprobabilitie s.Wecanresolvethis probleminthesamewayaswedidintrainingwithonlyonefeat ure,thatis,weuse theminimumEuclideandistancetodecidethewinner. 6.4 Text Datasets and Preprocessing Inthissection,wewillshowhowthetextdocumentcanberepr esentedasa datasetsuitableforsupervisedlearning.Textdatamining [20]isaresearchdomain involvingmanyresearchareas,suchasnaturallanguagepro cessing,machinelearning, informationretrieval[79],anddatamining.Textcategori zationandfeatureselection aretwoofthemanytextdataminingproblems.Thetextdocume ntcategorization problemhasbeenstudiedbymanyresearchers[54,55,61,64, 65,66,75,77,79,95]. Inthisresearch,themainmachinelearningtoolsusedareth enaiveBayesmethod [67,76]andSVM[80].Nigametal.[70]haveproposedtheuseo fmaximumentropy classiersfortextclassicationproblem.Yangetal.show edcomparativeresearchon featureselectionintextclassication[95].Thetextclas sicationproblemintheSVM applicationhasbeenextensivelystudiedbyJoachims[53]. Forthedocumentdatato beavailableforasupervisedlearningsuchasthenaiveBaye sorSVM,thedocument isrstconvertedintoavectorspacenotation(alsocalled\ bagofwords").Foramore compactdatasetwithoutanyinformationloss,stop-wordsa reremoved.Suchstopwordsare\then,"\we,"\are,"andsoforth.Again,dierent formsofthesameword

PAGE 81

67 Table6{3:KeywordsUsedtoCollecttheTextDatasets Dataset Categ. Keywords textAB 1 functionalgenomicsORgeneexpression -1 structuralgenomicsORproteomics textCD 1 HIVinfection -1 cancertumor textCL 1 acutechronicleukemia -1 coloncancer rootareprocessedbythestemmerprogram.Thestemmerprogr amconvertseach wordtoitsrootform.Weusedastop-wordslistwhichcontain s571suchwords,which isavailablefromtheInternet( http://www.unine.ch/Info/clef/ ).Forthestemmer program,weusedthePorter'sstemmingalgorithm.TheJavav ersionofthealgorithm ispubliclyavailable( http://www.tartarus.org/~martin/PorterStemmer ).Theresultingmatrixconvertedfromthedocumentsisveryhigh-di mensionalandsparse.We havecollectedthreedierenttextdocumentdatasets:text CL,textAB,andtextCD. AllthreedatasetsweregeneratedfromPubMed( http://www.pubmed.org ).Byenteringthekeywords\acutechronicleukemia,"wecollected 30abstractsandbyentering\coloncancer"30aswell.Thisdatasetiscalledtext CL.ForthetextAB, twosetsofkeywordsare\functionalgenomicsORgeneexpres sion"and\structural genomicsORproteomics."ForthetextCD,thekeywordsare\H IVinfection"and \cancertumor."Table6{3showsthekeywordsusedtocollect thetextCL,textAB, andtextCD. Aftercollectingthedocuments,eachdocumentwasconverte dintoaso-called \bag-of-words"representation.Theconversionstepsinto abag-of-wordsrepresentationareshownasfollows:Stepsforbag-of-wordsrepresentationofText 1.Identifyallthewordsinthe60documentsandmakealistof words 2.Remove stop-words suchas\a,"\the,"\and,"\then,"andsoforthfromthe list

PAGE 82

68 3.Applythe stemming algorithmtoeachwordinthelist.The stemming leaves outtherootformofthewords.Forexample,\computer,"\com puting,"\computation,"and\computes"allhavethesamecomputroot 4.Sortthewordlist5.Buildamatrix M :eachcolumn( j )isawordinthelist,andeachrow( i )isa document. M ( i;j )=howmanytimesthe j th wordoccursinthe i th document 6.Prunethewordswhichappearlessthanthreetimesinthedo cuments.This resultsinasmaller M Step6inthebag-of-wordsrepresentationstepswasapplied tothe20Newsgroup datasetsonly,butitwasnotappliedtoothertextdatasets. Typically,afterconstructingthefrequencycountmatrix M ,wenormalizethematrixrow-wise(document-wise) sothateachrowof M becomesaunitlength.Therationaleforthisunit-lengthno rmalizationisthateachdocumentmayhaveadierentnumbero fwordsandthismay negativelyaectalearningalgorithm[76].Thisissuewill beaddressedlateragain. Figure6{1showsthepictorialillustrationofthebag-of-w ordsrepresentationmodel oftextdocumentdatasets.Afterconvertingthesetofabstr actsintoabag-of-words representation,wehavea60 1823matrix,a60 1976matrix,anda60 1844 matrixfromtextCL,textAB,andtextCD,respectively.Andt heclasslabelsare+1 fortherst30abstractsand-1forthelast30forallthreeda tasets.Nowwehave datasetsforsupervisedlearning. Also,wehaveusedanothersetoftextdatacalled20Newsgrou pdata[54].The http://www-2.cs.cmu.edu/~TextLearning/datasets.html istheURLforthe20 Newsgroupdatasets.Thisdatasetcontains19,974document sevenlydistributedinto 20newsgroups.Inourexperiments,wehaveselectedthreene wsgroupdocuments. Theyare rec.autos,sci.med, and politics.guns .Eachcategorycontains100 documents,withatotalof300documents.Werandomlybuteve nlydividedeach

PAGE 83

69 patient, accident, auto, symptom, car doctor, infect, pain doctor, ..... ..... Simple Probability Estimation P(`car' | +1 ) = `+1' `+1' the size of the bag # `car' in the bag Filtering : Removing Stop-Words, Stemming, Removing Infrequent Words. doctor, pain, Class +1 Documents Class -1 Documents car, brake, pain, doctor, accident, auto, car, brake, bmw car, ...... ..... Figure6{1:PictorialIllustrationoftheBag-Of-WordsMod el. Table6{4:20NewsgroupDatasetDistribution Category TrainingSet TestSet rec.autos 50 50 sci.med 50 50 politics.guns 50 50 Total 150 150 categorydocumentinto50trainingdatasetand50testdatas etsothatwehave150 trainingdocumentsand150testdocuments.Table6{4showst hedistributionsof thedatasets.Inourexperiments,wedidaone-vs-allbinary classication,thatis, whenwetrainwiththetrainingset,labelsforall rec.autos samplesbecome+1 andlabelsforallothertrainingsamplesbecome 1.Thesameprocedureisdone onatestset.Thisisonesetofbinaryclassication,andsim ilarlyfor sci.med and politics.guns

PAGE 84

70 6.5 Feature Selection Approaches in Text Datasets Fortextdatasets,manyfeatureselectionmethodshavebeen developed.We showsomeofthepopularrankingschemes:informationgain, oddsratio,Fisher score, t -test,andcorrelationcoecient. 6.5.1 Information Gain Oneofthepopularfeatureselectionapproachesinthetextd ocumentdatais usinginformationgain[54,65,75].Theideaisbasedonthei nformationtheory [24,66].Theinformationgainfortheword w k istheentropydierencebetweenthe classvariableandtheclassvariableconditionedonthepre senceorabsenceofthe word w k [75].Theinformationgainfortheword w k iscomputedasfollows: Info ( w k )= Entropy ( Y ) Entropy ( Y j w k ) = X y 2 Y P ( y )log( P ( y ))+ X w k 2f 0 ; 1 g P ( w k ) X y 2 Y P ( y j w k )log( P ( y j w k )) = X y 2 Y X w k 2f 0 ; 1 g P ( y;w k )log P ( y j w k ) P ( y ) ; where P (+1)isthefractionofthedocumentsinclass+1overthetota lnumberof documents, P ( w k )isthefractionofthedocumentscontainingtheword w k overthe totalnumberofdocuments,and P ( y;w k )isthefractionofdocumentsintheclass y thatcontaintheword w k overthetotalnumberofdocuments.Thehigherthe informationgainis,themorediscriminatingtheword(feat ure)is. 6.5.2 t -test Someofstatisticsareusefultocomputeadiscriminatingsc ore.Oneofthemis the t -test[11].Supposewehave n +1 ( n 1 )samplesinclass+1( 1).Thenthe t score iscomputedasfollows: tScore ( w k )= ( y 1 y +1 ) s p 1 =n +1 +1 =n 1

PAGE 85

71 where y 1 ( y +1 )arethesamplemeanforthefeature w k inclass+1( 1)samplesand s isthepooledsamplestandarddeviation.Thepooledvarianc e s 2 canbecomputed asfollows: s 2 = S +1 + S 1 +1 + 1 where S +1 ( S 1 )arethesamplevariancesforthefeature w k in+1( 1)samplesand +1 ( 1 )isthedegreesoffreedomforclass+1( 1)samples.Thehigherthe t score, themorediscriminatingthewordis. 6.5.3 Correlation Coecient Golub[42]usedafeaturerankingbasedonthecorrelationco ecientscore w k whichcanbecomputedas: Coef ( w k )= k (+1) k ( 1) k (+1)+ k ( 1) ; (6.5) where k (+1)( k ( 1))and k (+1)( k ( 1))arethesampleaverageandsamplestandarddeviationforthefeature w k inclass+1samples.Largepositivevalueshavea strongcorrelationwiththeclass+1andlargenegativevalu eswiththeclass 1. Golubusedadecisionfunction, g ( x )basedonthecorrelationcoecients:[42] g ( x )= w ( x )= d X i =1 w i x i + d X i =1 w i ( i (+)+ i ( )) = 2) ; where w i isdenedasin(6.5).Computingthecorrelationcoecients corerequires asimplesamplemeanandstandarddeviation. 6.5.4 Fisher Score Givenatwocategorydataset,ratioofthe between-class to within-class scatter forafeaturegivesarankingcriterion.Thiskindofscoreis calledtheFisherscore (index).Inourexperiment,weusethefollowingrankingsco re[19]. Fisher ( w k )= ( m +1 m 1 ) 2 v +1 + v 1 ;

PAGE 86

72 where m +1 ( m 1 )and v +1 ( v 1 )arethemeanandvarianceofthefeature w k inclass +1( 1)samples. 6.5.5 Odds Ratio Theoddsratioscoreofafeature w k canbecomputedasfollows:[14,15] Odds ( w k )=log P ( w k j +1)(1 P ( w k j 1)) (1 P ( w k j +1)) P ( w k j 1) ; where P ( w k j +1)( P ( w k j 1))isthefractionofthedocumentscontainingtheword w k inclass+1( 1)samples.Someofthecharacteristicsofthisscorearetha tthis scorefavorsfeaturesoccurringveryofteninclass+1.Oneo fthedisadvantagesof thisscoreisthatword w k getshigheroddsratioif w k occursrarelyinclass+1but neveroccursinclass 1[14].Wenotethattheoddsratiorankingschemedoesnot usetheabsolutevalueoftheoddsratiowhenrankingthescor e.AsintheForman's study[34],weaddonetoanyzerofrequencycountinthedenom inatorcomputation toavoiddivisionbyzero. 6.5.6 NB Score Fortextdatasets,wedenearankingcriteriononaword w k ,basedonprobability estimatesasfollows: NBScore ( w k )= j P ( w k j +1) P ( w k j 1) j ; (6.6) where P ( w k j +1)istheprobabilitythatarandomlychosenwordfromarand omly drawndocumentinclass+1willbetheword w k ,andsimilarlyfor P ( w k j 1).We ranktheword w i higherthantheword w j if NBScore ( w i ) >NBScore ( w j ). 6.5.7 Note on Feature Selection Foranylineardiscriminantclassier,wecandeviseafeatu rerankingscheme, thatis,theweightvectorintheclassierfunctionisthefe aturerankingcriterion. ThisissuewillbeaddressedindetailinChapter7.Forsomec lassiers,wecan stillgetabetterfeaturerankingbyapplyingarecursivefe atureselection.Butwe

PAGE 87

73 cannotapplytherecursivefeatureeliminationuniversall ytoanylinearclassier.One suchcaseisthatapplyingrecursiveeliminationtechnique doesnotgiveanybetteror dierentfeatureranking.Forexample,acorrelationcoec ientclassier,information gain,oddsratio,Fisherscore,and t -testrankingimplicitlyassumetheorthogonality betweenfeatures.Hence,applyingRFEgivesthesamerankin gastheonewithout applyingRFE.ThisissuewillberevisitedinChapter7. 6.6 Class Dependent Term Weighting Naive Bayes ThenaiveBayesclassierhasbeensuccessfuldespiteitscr udeclassconditional independenceassumption.Obviously,mostrealdatasetsvi olatethisassumption. Duetothenaiveassumption,thenaiveBayesoftenleadstoth epoorposteriorprobability.Webbetal.[93]andBennett[8]studiedtogetthebe tterposteriorsto accommodatetheviolationoftheassumption.Thefeaturein dependenceassumptionrelatedtothenaiveBayeshasbeenstudiedinconjuncti onwiththenaiveBayes classicationperformance[28,78].Manyresearchers,how ever,triedtorelaxthis crudeassumptioninthehopeofgettingbetterclassicatio naccuracy.Butitseems thatthereisnogreateradvanceinthisdirection.Friedman etal.[37]compared theBayesianclassierwiththeBayesiannetworkwhichsupp osedlylessviolatesthe independenceassumption,andfoundthelatterdidnotgives ignicantimprovement. MoresurveyonthisissuecanbefoundinDomingosetel.'sstu dy[27]. DespitethenaiveassumptionofthenaiveBayes,itssuccess hasnotbeenwell explainedorunderstooduntilrecently.DomingosandPazza ni[27,28],andFriedman [36]haveinvestigatedthisissuerecently.Theirndingsa reessentiallythedistinction betweenprobabilityestimationandclassicationperform ance.Theirclaimsareeven strongertosaythatdetectingandrelaxingthisassumption isnotnecessarilythe bestwaytoimproveperformance.Keepingthisinmind,wehav edevelopedthe class-dependent-term-weightingNB(CDTW-NB)approach.

PAGE 88

74 Table6{5:IllustrativeDatafortheClassDependentTermWe ighting Class Sample w j w k + D 1 01 + D 2 01 + D 3 01 + D 4 61 + D 5 01 + D 6 01 D 7 00 D 8 00 D 9 00 D 10 00 D 11 00 D 12 00 6.6.1 Motivation for Class Dependent Term Weighting AsFriedman[36]andDomingosetal.[27]haveobserved,toge tabetterclassicationperformance,theclassdependenttermweighting isnotintendedtogeta betterormoreaccurateprobabilityestimate,whichmaycom efrombyrelaxingthe independenceassumption.Rather,itisintendedtogetabet terclassicationaccuracybyscalingthedataclassdiscriminativethroughaword occurrenceprobability. BelowisanexampletoillustratingthemotivationfortheCD TW.FromTable6{5, wecanestimatetheprobabilitiesasfollows: P ( w j j +1)= 6+1 F + D P ( w k j +1)= 6+1 F + D P ( w j j 1)= 0+1 F + D P ( w k j 1)= 0+1 F + D where F =#featuresand D =6(#documentsin+1(or 1)).Theword w j appears onlyinonedocument( D 4 )inclass+1whiletheword w k appearsinallthedocuments intheclass+1.Butprobabilities, P ( w j j +1)and P ( w k j +1)arethesame.Weremind that P ( w j j +1)istheprobabilitythatarandomlychosenwordfromarand omlydrawn documentinclass+1willbetheword w j .Thisdoesnotmakesensesincetheword w k appearsinallthedocumentsinclass+1and w j appearsonlyinonedocument

PAGE 89

75 inthesameclassandhencewecouldhavemorechancetoseethe word w k froma randomlydrawndocumentinclass+1thantoseetheword w j .Thisisaweakness ofthenaiveBayesclassier.Wetrytoxthisweaknessbygiv ingmoreweighton suchwordsas w k .FromTable6{5,itisreasonabletothinktheword w k couldbea morerepresentativewordfortheclass+1,whiletheword w j maynotbesuchacase. Givingmoretermweightingonwordsappearinginmoredocume ntsinthesameclass helpsthosewordsasclassdiscriminatingfeatures.Obviou sly,wewanttorankthe word w k higher(moreimportant)thantheword w j .Butthetermweightingisdone dierentlydependentontheclass.Foreachword w i ,andeachsample x 2 class+1, CDTWisdoneasfollows: x i = x i ( n +i + C N + ) N + ; (6.7) where x i isthefrequencycountoftheword w i in x N + isthenumberofdocuments inclass+1, n +i isthenumberofdocumentsinclass+1containingtheword w i and C> 0isascaleconstant.Wedothistermweightingfordocuments inclass 1in similarway.Letusseehowthisclassdependenttermweighti ngrankstheword w k higherthantheword w j .FromTable6{5,theword w k appearsinallthedocuments inclass+1whileitdoesnotin 1.Nowweapplythetermweightingasdescribed in(6.7).Fortheword w j ,scalingisdoneasfollows: x j = x j (6+ C 6) = 6= x j ( C +1) ; forclass+1samples ; x j = x j (0+ C 6) = 6= x j C; forclass 1samples : Fortheword w k ,scalingisdoneasfollows: x k = x k (1+ C 6) = 6 ; forclass+1samples ; x k = x k (0+ C 6) = 6= x k C; forclass 1samples : Letusnowseehowtherankingscoreschangedbeforeandafter thetermweighting. Beforescalingthefrequencycounts,wegetthefollowingpr obabilityestimatesfor

PAGE 90

76 thedatashowninTable6{5: P ( w j j +1)= 6+1 V + + F P ( w j j 1)= 1 V + F P ( w k j +1)= 6+1 V + + F P ( w k j 1)= 1 V + F ; where V + ( V )isthetotalnumberofwordsinclass+1( 1).Thenwecomputethe NBscores forthewords w j ,and w k asdenedin(6.6)asfollows: NBScore ( w j )= j P ( w j j +1) P ( w j j 1) j = j 6+1 V + + F 1 V + F j ; NBScore ( w k )= j P ( w k j +1) P ( w k j 1) j = j 6+1 V + + F 1 V + F j : Thatis,both w j and w k havethesamescore.Afterscalingthefrequencycountsin Table6{5,wegetthefollowingprobabilityestimates: P ( w j j +1)= 6 C +2 U + + F P ( w j j 1)= 1 U + F P ( w k j +1)= 6 C +7 U + + F P ( w k j 1)= 1 U + F ; where U + ( U )isthescaledtotalnumberofwordsinclass+1( 1).Then NBscores forthewords w j ,and w k : NBScore ( w j )= j P ( w j j +1) P ( w j j 1) j = j 6 C +2 U + + F 1 U + F j = j (6 C +2)( U + F ) ( U + + F ) ( U + + F )( U + F ) j = j K N K D j ;

PAGE 91

77 NBScore ( w k )= j P ( w k j +1) P ( w k j 1) j = j 6 C +7 U + + F 1 U + F j = j (6 C +7)( U + F ) ( U + + F ) ( U + + F )( U + F ) j = j 5( U + F )+ K N K D j ; where K N =(6 C +2)( U + F ) ( U + + F )and K D =( U + + F )( U + F ).Hence, j K N K D j < j 5( U + F )+ K N K D j (since 5( U + F ) K D > 0) ) NBScore ( w j )
PAGE 92

78 where V + ( V )isthetotalnumberofwordsinclass+1( 1)and F isthenumberof features.Bytheoriginalrankingscheme,rankingscoresfo r w j and w k : NBScore ( w j )= j P ( w j j +1) P ( w j j 1) j = j m +1 V + + F 1 V + F j ; NBScore ( w k )= j P ( w k j +1) P ( w k j 1) j = j m +1 V + + F 1 V + F j : Thisprovesthattheoriginalrankingschemegivesthesamer ankingforboth w j and w k .NowbytheCDTW,probabilityestimates: P ( w j j +1)= m +1 U + + F 1+ Cm m ; P ( w j j 1)= 1 U + F Cm m ; P ( w k j +1)= m +1 U + + F m + Cm m ; P ( w k j 1)= 1 U + F Cm m ; where U + ( U )isthescaledtotalnumberofwordsinclass+1( 1)and F isthe numberoffeatures.Rankingscoresfor w j ;w k : NBScore ( w j )= j P ( w j j +1) P ( w j j 1) j = j m +1 U + + F 1+ Cm m 1 U + + F Cm m j = j m +1 U + + F 1+ Cm m C U + F j = j ( m +1)(1+ Cm )( U + F ) Cm ( U + + F ) m ( U + + F )( U + F ) j = j ( m +1)( U + F )+ L N L D j ; NBScore ( w k )= j P ( w k j +1) P ( w k j 1) j = j m +1 U + + F m + Cm m 1 U + + F Cm m j = j m +1 U + + F m + Cm m C U + F j = j m ( m +1)( U + F )+ L N L D j ;

PAGE 93

79 where L N = Cm ( m +1)( U + F ) Cm ( U + + F )and L D = m ( U + + F )( U + F ). ( m +1)( U + F ) L D < m ( m +1)( U + F ) L D )j ( m +1)( U + F )+ L N L D j < j m ( m +1)( U + F )+ L N L D j ) NBScore ( w j )
PAGE 94

80 n c :numberofnon-zerocomponentsin t belongingtoClass c N c :numberofsamplesinclass c C :ascaleconstant 3. R ecursive F eature E limination whiles isnotempty do Constructnewsamples, newX whosefeaturescontainonly s from X Computetherankingscore c i from( newX,Y ),for i =1 j s j c i = NBScore ( w i ) Findthefeature j suchthat j =argmin i j c i j Update s byremovingthefeature j from s Update rank byaddingthefeature j ontopofthe rank endwhile Wewanttomakeacoupleofcommentsontheunitlengthnormali zation.The purposeoftheunitlengthnormalizationistomakeeachdocu mentequallength sincealldocumentsmayhavedierentlength.Twobenetsar ethebyproductofthe normalization.Oneisitimprovestheclassicationaccura cy.Otherresearchersalso observedthesimilarresultbydoingtheunitlengthnormali zationontextdata[76]. Theotheriswedoesnotneedtoscaleverysmallprobabilitye stimates.Whenwe applythenaiveBayeswithoutanydatanormalizationprobab ilityestimatescanbe extremelysmall.Weexponentiatethesmallnumbersbyoccur rencecountstomake itevensmallerandtheproductofthousandsofthesesmallnu mberseasilymakethe posteriorprobabilityzero.Hence,noclearclassication decisioncanbemade.This canbepreventedbyascaling.But,bydoingtheunitlengthno rmalization,wemay avoidthisscalingneeds.

PAGE 95

81 Table6{6:Top10WordsSelectedfromtheTextABDataset Ranking t -test SVM-RFE CDTWNB-RFE Info.Gain 1 express express promot proteom 2 proteom protein proteom gene 3 genom genom transcript express 4 transcript proteom induc transcript 5 recent gene genom promot 6 promot induc mrna genom 7 protein diseas diseas mediat 8 gene mass inhibit inhibit 9 mediat promot structur signicantli 10 induc transcript mass diseas 6.7 Experimental Results 6.7.1 Feature Ranking Works Asapreliminaryexperiment,wehaveappliedfourfeaturera nkingschemes, t test,SVM-RFE,naiveBayes-RFE,andinformationgain,tose ehowwelleachfeature rankingalgorithmworks.Wehaveusedall60instancesinthe textdocumentdatasets torankthefeatures.Table6{6showsthetop10selectedword sfromtheTextAB dataset.Aswecanseeinthistable,somehigh-rankedwordsa renotfromthe keywordsused,buttheycouldbemeaningfulwhenwelookatth edata.Theycould bemorediscriminatingwords.Wecannottellhowgoodaranki ngalgorithmisonly bylookingatwhetherthetop-rankedfeaturesarefromtheke ywordsornot.These tablesserveonlytoshowthatallthementionedrankingalgo rithmsworkcorrectly inasensebyrankingsomeimportantwordshigh.Butthesetab lesshouldnotserve asajudgingtoolforwhichalgorithmisbetter.Considering thekeywordswhichwere usedtoretrievethe60abstractdocuments,thefourranking schemesgaveasimilar topwordslist.Specically,\proteom"and\genom"arerank edthetopsixselected wordslistinallfourrankingalgorithms.Table6{7showsth etop10selectedwords fromtheTextCDdataset.Thekeywordsusedforretrievingth eTextCDdatasetwere \hivinfection"and\cancertumor."Threeofthefourrankin gschemesrankatleast

PAGE 96

82 Table6{7:Top10WordsSelectedfromtheTextCDDataset Ranking t -test SVM-RFE CDTWNB-RFE Info.Gain 1 hiv hiv cancer hiv 2 immunodeci infect tumor cancer 3 infect viru hiv infect 4 viru tat mutat immunodeci 5 cancer specif tumour viru 6 tumor gag viru tumor 7 antiretrovir cd4 allel antiretrovir 8 therapi immunodeci tat progress 9 replic region gastric replic 10 progress cancer immunodeci vaccin Table6{8:Top10WordsSelectedfromtheTextCLDataset Ranking t -test SVM-RFE CDTWNB-RFE Info.Gain 1 chronic cancer colon colon 2 colon colon leukemia chronic 3 leukemia leukemia chronic leukemia 4 cancer chronic cancer cancer 5 acut acut acut acut 6 marrow patient express bone 7 bone express transplant marrow 8 express hematolog colorect express 9 patient malign marrow patient 10 diseas bone bone colorect threekeywordsinthetop10list.Theword\hiv"isespeciall yrankedinthetop threefromallfourtop10lists. Table6{8showsthetop10selectedwordsfromtheTextCLdata set.Wordsused toretrievetheTextCLdatasetare\acutechronicleukemia" and\coloncancer."All thesevewordsareamongtopvewordsfromallfourrankings .Actually,theTextCL datasetistheeasiestoneamongthethreetextdocumentdata setsintermsoflinear separability.Thiseasinessisshownbytherankedlists.In thissense,theeasinessis inorderoftheTextCL,TextCD,andTextAB,thelatterthemor edicult.Allthree tablesshowthevalidityofeachrankingalgorithmsincetop -rankedwordssimilarly appearinthetop10wordslist.

PAGE 97

83 Table6{9:CDTW-NB-RFEvs.OriginalNB-RFE Dataset OriginalNB(1)% CDTW-NB(2)% [(1)-(2)]/(1)% rec.autos Max. 97.33 98.00 -0.69 Average 95.77 96.26 -0.51 sci.med Max. 95.33 97.33 -2.10 Average 92.60 94.67 -2.24 politics.guns Max. 98.67 98.00 0.68 Average 93.23 94.19 -1.03 AverageAccuracyGainofCDTW-NB-RFEoverOrig.NB-RFE 1.26 6.7.2 Experimental Results on the Newsgroup Datasets Wehaveappliedthealgorithmtothe20newsgroupdatasets.T able6{9shows theexperimentalresults.Intheexperiment,wedidaone-vs -all,thatis,whenwedid the rec.autos vs.all,welabeledthe rec.autos samplesas+1andtherestas 1, andsimilarlyforthe sci.med and politics.guns .Wehavecomputedtheaccuracies fortop1,2, ,1000,1100,1200, ,3000,3023features.Table6{9istheaverageof theaccuracies.FortheoriginalNB,eachsamplewasunitlen gthnormalized.Then weappliedtherecursivefeatureeliminationusingtheorig inalNB.Usingthisranking, theoriginalNBwasusedtogetthetestaccuraciesforthepre viouslymentionedbest features.ForCDTW-NB,eachsamplewasunitlengthnormaliz ed.ThenCDTWNBwasusedtogetrankingusingRFE.Thenwiththisranking,C DTW-NBwas usedtogettheaccuracieswiththeabovementionedbestfeat ures.Table6{9shows thatCDTW-NB-RFEisbetterthantheoriginalnaiveBayesbya bout1.26%on average.Wehavealsousedtheverankingschemes,informat iongain,oddsratio, Fisherscore, t -testandcorrelationcoecient,andthenappliedtheCDTWNB.We rankedthefeaturesusingtherankingscheme,informationg ain,andwecomputed theaccuraciesfortop1,2,...,1000,1100,1200,...,3000, 3023featuresusingthe originalNBandCDTW-NB,andsimilarlyforotherrankingsch emes.Table6{10 showsCDTW-NBgivesbetteraccuracyresultsforalmostallt herankingschemes thantheoriginalNB.Theaverageaccuracygainisabout0.64 4%overtheoriginal

PAGE 98

84 Table6{10:OriginalNBvs.CDTW-NB Dataset RankingScheme Orig.NB(1)% CDTW-NB(2)% (2)-(1) rec.autos InformationGain 95.45 95.47 0.02 OddsRatio 83.36 85.12 1.76 FisherScore 95.86 95.70 -0.16 t-test 94.58 94.96 0.38 Corr.Coef. 95.52 95.57 0.05 NB-RFE 95.77 96.26 0.49 sci.med InformationGain 94.03 93.86 -0.17 OddsRatio 79.11 81.23 2.12 FisherScore 93.56 93.99 0.43 t-test 92.38 92.56 0.18 Corr.Coef. 94.09 94.29 0.20 NB-RFE 92.60 94.67 2.07 politics.guns InformationGain 93.30 92.83 -0.47 OddsRatio 87.44 89.23 1.79 FisherScore 93.11 93.82 0.71 t-test 90.93 91.85 0.92 Corr.Coef. 93.20 93.42 0.22 NB-RFE 93.14 94.19 1.05 Average 0.644 NB.Table6{11isasubsetofTable6{10.Theaccuracyresults aretheaveragefor thebest1,2,3, ,1000,1100,1200, ,3000,3023featuresusingonlyCDTW-NB foralltherankingschemes.TheresultsshowthattheCDTW-N B-RFEisthebest forallthreedatasetsoveralltheverankingschemes.Sofa r,wehaveshownthat theCDTW-NB-RFEworksbetterthanotherfeaturerankingalg orithms.Wenow turntothefeatureselectionissueontextdata.Figure6{2s howstheaccuraciesfor Table6{11:AccuracyResultsforRankingSchemes RankingScheme rec.autos (%) sci.med (%) politics.guns (%) InformationGain 95.47 93.86 92.83 OddsRatio 85.12 81.23 89.23 FisherScore 95.70 93.99 93.82 t-test 94.96 92.56 91.85 Corr.Coef. 95.57 94.29 93.42 CDTW-NB-RFE 96.26 94.67 94.19

PAGE 99

85 0 10 20 30 40 50 60 70 80 90 100 70 80 90 100 Test Accuracy %Accuracy with Best Features (rec.autos) Accuracy with Best FeaturesAccuracy with Full Features 0 10 20 30 40 50 60 70 80 90 100 70 80 90 100 Test Accuracy %Accuracy with Best Features (sci.med) Accuracy with Best FeaturesAccuracy with Full Features 0 10 20 30 40 50 60 70 80 90 100 60 70 80 90 100 Test Accuracy %Best Features (x 10) Accuracy with Best Features (politics.guns) Accuracy with Best FeaturesAccuracy with Full Features Figure6{2:TestAccuracywithBest-rankedFeatureson20Ne wsgroupDatasets. thebest10,20,...,1000featuresinasolidlineandthatoff ullfeatures(3023)in adottedhorizontalline.FromFigure6{2,wecanseethatthe similaraccuracyas thatoffullfeatureswasachievedwiththebest300features on rec.autos ,400on sci.med ,and650on politics.guns datasets.Withthisnumberoffeatures,all haveachievedbetterthanorthesameaccuracyasusingthefu llfeatures.Thisshows thatthefeatureselectionworksontextdata.Thisndingsa reconsistentwithother researchers'results[19,20,67,60,65,95]. 6.8 Naive Bayes for Continuous Features Sofar,wehaveappliedthenaiveBayestotextdocumentdataw hosefeaturevaluesaretheoccurrencesofwordsandtheyarediscrete.Whenf eaturesarecontinuous variables,aneasywayofapplyingthenaiveBayesisbyassum ingGaussiandistributionforeachfeature.Realworlddatasets,however,maynot followthisassumption. Adierentapproachisadiscretizationofthecontinuousfe ature.Doughertyetal.

PAGE 100

86 [29]showedthatdiscretizationofacontinuousvariableca napproximatethedistributionforthevariableandthiscanhelptoovercometheGaus sianassumptionused forcontinuousvariableswiththenaiveBayes.Inthissecti on,weshowedasimple discretizationmethodforthenaiveBayesmethod,andinthe discretizeddata,we showedhowwecandofeatureranking. Aten-bindiscretizationworkswellcomparedtomorecompli cateddiscretization schemes,suchasfuzzylearningdiscretization,entropymi nimizationdiscretization, andlazydiscretization[29,50,96,97].Althoughmorecomp licateddiscretization schemesgivebetterperformancethanthesimpleten-bindis cretization,itsgainis minorwhileitscomputationismorecomplicated[29].Consi deringthis,weshowthe naiveBayeswithaten-bindiscretization. Inthe k -bindiscretizationscheme,eachfeatureisdividedinto k bins.Asimple buteective k -bindiscretizationtechniqueisequalwidthdiscretizati on(EWD).The EWDdividesthelinebetween v min and v max into k bins.Eachintervalthenhas width w =( v max v min ) =k ,and k 1boundarypointsareat v min + w;v min + 2 w; ;v min +( k 1) w .Thismethodappliestoeachfeatureseparately.Applying thediscretizationtotheoriginalinputfeaturesmakesthe featuresintonominalvalues between1 ; 2 ; ;k ThenaiveBayesclassiernowforthediscretizedfeaturesi sstraightforward.For adiscretizedtestinput x P ( y j x )= P ( x j y ) P ( y ) P ( x ) / P ( x 1 x d j y ) P ( y ) / P ( y ) d Y i P ( x i j y ) / P ( y ) d Y i N y i + mp N y + m ;

PAGE 101

87 where N y i isthenumberofsamplesinclass y satisfying X i = x i p is P ( X i = x i ), and m istheequalsamplesizeparameter.ThenthenaiveBayesclas sierassigns thetestinstance x as c where c =argmax y P ( x j y ) P ( y ).Wehaveappliedtheten-bin discretizationandthenaiveBayestotheMNISTandsonardat asets.Weremindthat theMNISTdataset,however,doesnothavecontinuousfeatur es,butdiscretevalues in[0,255].Thereasonforapplyingdiscretizationisthatt hesedatahavealready256 discretevalueswhichcouldresultintoomanybins(256),an dprobabilityestimates foreachbinmaybeunreliableforasmallnumberofcountsfor eachbin.Thetenbindiscretizationcangiveabetterprobabilityestimatef oreachoccurrencecount thanusingthegivendiscretevalues(256bins).Theexperim entalsettingisthesame asinChapter5,thatis,computetherankingrst,andcomput eaccuraciesfortop 1 ; 2 ; 3 ; ; 249 ; 250 ; 300 ; ; 750 ; 784.Fortherankingscheme,wedonotapplythe sameranking,aswedidintextdocumentdatasets,sincethed iscretizationmakesthe naiveBayesapiecewiselinearboundary.Instead,weapplya nentropy-basedranking schemehere.Anentropy-basedrankingschemeisexplainedb elow. Entropy-basedRankingAlgorithm input: X,Y :Inputsamplesandlabels output: rank 1.Computetheentropyon k th binand j th feature, E j k ,for j =1 dim;k =1 10 N + j :#samplesinclass+1satisfying x j = k N j :#samplesinclass 1satisfying x j = k E j k = N + j N + j + N j log N + j N + j + N j N j N + j + N j log N j N + j + N j 2.Computetheweightedaverageof E j k w j ,for j =1 dim w j = P 10k =1 E j k N + j + N j N N:totalnumberofsamplesin X 3. rank w j for j =1 dim suchthatthesmaller w j isrankedthehigher

PAGE 102

88 Table6{12:ComparisonbetweenFLD,SVM,andDiscretizedNB Classier MNIST(%) Sonar(%) FLD(Simple) 90.51 67.08 FLD(SV-RFE) 84.36 68.39 SVM(SV-RFE) 96.69 76.20 Disc.NaiveBayes 93.34 70.91 Thesonardatasethaspurelycontinuousfeaturevaluesin[0 .0,1.0].Inthiscase, ten-bindiscretizationisnatural.Anexperimentalsettin gisthesameasinChapter 5,Thatis,wecomputeaccuraciesfortop1 ; 2 ;:::; 60.Therankingschemeisbased onentropy.Theaccuraciesforbothdatasetswerealreadykn ownforSVMandFLD fromthepreviouschapter.Thiscangiveacomparativeideab etweenthoseandthe naiveBayes. Table6{12showstheresultsforapplyingaten-bindiscreti zationandthenaive BayestotheMNISTandsonardatasets.Itshowsthatthediscr etizednaiveBayes isnobetterthanSVM,butitisalwaysbetterthanFLD.Althou ghthenaiveBayes isbasedonthe\naive"assumption,andten-bindiscretizat iondoesnotelegantly approximatethedata,itsperformanceisstillcompetitive Whiletwomicroarraydatasetshavepurelycontinuousfeatu resandmaybea goodchoiceforapplyingdiscretization,wedidnotshowany experimentalresults heresincetheyhavetoosmallnumberofsamples(31and38sam ples),anditis unlikelytogetreliableprobabilityestimatesfromthem. 6.9 Discussion WehaveproposedanewrankingschemebasedonNB.Weappliedi ttotext datasetsandshowedthatthisschemeoutperformedotherpop ularrankingschemes. Foratextdataset,itseemsthatthenaiveBayesclassieris theonlychoicealthough SVMhadbeenappliedtotextdataanditwasreportedtobesucc essful[55].There areseveralreasons.Onereasonwouldbetimeandspacecompl exity.TheSVMis notscalable.TheNBisalineartimeandspacecomplexityalg orithmintermsof

PAGE 103

89 numberofinputsamples.Ontheotherhand,thebestknownlin earSVMhastime complexityintheorderof n 1 : 8 ,where n isthenumberofsamples[18]. Anotherissueissimplicityofimplementationandeasiness ofusage.TheSVM needstosolveaquadraticprogrammingproblem.Besidesthe implementationissue, SVMneedstotuneparameterstogetabestperformance.Tunin gSVMparameters isnotatrivialjobsincetheperformancecanwidelyvaryatd ierentsetofparameter values.TheNBisverysimpleandbasicallytherearenoparam eterstotune. TheNBisalinearclassierwhileSVMcandobothlinearandno nlinearclassication.ThismaynotbeabigproblemforNBintextdata.Aslon gasitisconcerned withtextdatasets,lineardiscriminantsaregood[18].The SVMmaygiveabetter performanceinalargeamountofdata.Butifittakesthatmuc htimeandspaceto getaresult,itmaynotbeachoiceforpractitioners.

PAGE 104

CHAPTER7 FEATURESELECTIONANDLINEARDISCRIMINANTS Inthischapter,weshowedageneralizationoffeatureselec tionusinglineardiscriminants.Theideaisthattheweightvectorofanylineard iscriminantcanserveas arankingcriterion.Thenweshowedhowdierentlineardisc riminantsdrawdierent decisionboundariesforvarioustwo-dimensionaldatasets .Thedatasetsvaryinterms ofcovariancesofsamplesandthenumberofsamplesineachcl ass.Wethenpresentedacomparativestudyonrankingschemes,byclassica tionaccuracyofranking schemesonSVMandonthenaiveBayesclassier.Sofar,wehav edeterminedhow goodarankingmethodisusingaclassicationaccuracybyap plyingtherankingon aclassier.Inthischapter,weshowthatthosetop-rankedf eaturesarereallymore discriminatingfeaturesthanthoserankedlowbyvisualizi ngthetop-rankedfeatures inseveraldierentways.Wealsotriedtogiveanideaofhowm anyfeaturesshouldbe selected.Finally,weshowedsomeofapplicationsofthefea tureselection,particularly, anapplicationofthefeatureselectionintextdocumentdat asets. 7.1 Feature Selection with Linear Discriminants WehaveshownthatthelinearSVMdecisionfunctionis g ( x )= P li =1 i y i h x i x i + b = wx + b .Then, wx + b =0isthedecisionhyperplane,where w isthe directionofthedecisionhyperplaneand b isthebiasforthehyperplane.Wemade useoftheweight(normal)vectorofthedecisionhyperplane asarankingcriterion whetherwedoasimplerankingschemeorRFE.Thisideaofusin gaweightvector ofthelineardecisionfunctionasarankingcriterioncanbe generalized.Foralinear classier g ( x )= w x + b ,where w isthedirectionoftheclassierand b isthebias, thecomponentofthe j w j isthemagnitudeofimportanceofthefeature(word).For SVM g ( x )= w x + b = P li =1 i y i x i + b .Constructingthedecisionhyperplaneand 90

PAGE 105

91 rankingthefeaturesbasedonitrequiressolvingaquadrati cprogrammingproblem. WehavealsoshownthatthenaiveBayesisalinearclassier. InthenaiveBayes, classicationisthesignoflog P (+1 j x ) =P ( 1 j x )and i th weightofthelinearclassier islog P ( w i j +1) =P ( w i j 1).Perceptrontrainsalinearclassier,too.Thentheline ar classierassigns sign ( w x )foratestinstance x .Again,forthesamereasonasSVM andthenaiveBayes, w istherankingcriterion.WealsoshowedthatFisher'slinea r discriminantcanbeusedtogivearankingcriterion. Conversely,arankingcriterionscorecanbeaweightofalin earclassierwith asuitablychosenbiasterm.Thecorrelationcoecientisan exampleofthat.The decisionfunctionbasedonthecorrelationcoecientsnowi s: g ( x )= w ( x )= d X i =1 w i x i + d X i =1 w i ( i (+)+ i ( )) = 2) ; where w i isdenedasin(6.5),and i (+)( i ( ))isthesampleaveragefor i th feature foralltheinstancesofclass+1( 1).Thelasttermisthebiasterminthedecision function. Insummary,foranylineardiscriminantclassiers,wecand eviseafeaturerankingapproach,thatis,theweightvectorintheclassierfun ctionisthefeatureranking criterion.Forsomeclassiers,wecanstillgetabetterfea turerankingbyapplyingthe recursivefeatureelimination.Butwecannotapplytherecu rsivefeatureelimination universallytoanylinearclassier.Onesuchcaseisthatap plyingtherecursiveeliminationtechniquedoesnotgiveanybetterordierentfeatu reranking.Forexample, thecorrelationcoecientclassier,Fisherscore,inform ationgain,oddsratio,and t -testrankingimplicitlyassumetheorthogonalitybetween features.Hence,applyingRFEgivesthesamerankingastheonewithoutapplyingRFE .Anothercasein whichwecannotapplytheRFEisthatapplyingRFEisnotcompu tationallyfeasible. Forexample,Fisher'slineardiscriminantrequiresamatri xinversecomputation,and computingamatrixinverseathousandtimescouldnotbeane cientmethod.Here

PAGE 106

92 theorderofthematrixisthedimensionofatrainingsample. Italsomaynotbe feasibletoapplyRFEonSVMwhentheorderofthedatasizeism orethanseveral thousandsandsoisthedimensionofeachsample. 7.2 Linear Discriminants and Their Decision Hyperplanes Itwouldbeinterestingtoseehowdierentlinearclassier sdrawaseparating hyperplanetosometwo-dimensionaldatasets.Sinceeachli nearclassierhasadifferentweightcomputation,itwouldbererectedinthesepar atingdecisionplane. HereweconsidertheSupportVectorMachines(SVM),naiveBa yesclassier(NBC), correlationcoecientclassier(CCC),andFisher'slinea rclassier(FLD).Thegiven dataareintwodimensionsforourvisualizationpurpose.We haveconsideredthree dierentkindsofdatasets.First,wehavegenerated200`*' pointsand200`+'points usingmultivariaterandomnumbergeneratoronMatlab.The` *'pointsweregeneratedfrom =(6 ; 12) ; = 0B@ 31 : 5 1 : 53 1CA ; where and aremeanvectorand covariancematrix,respectively.And`+'pointsweregener atedfrom =(12 ; 6) andthesamecovariancematrixastheoneusedin`*'points.F igure7{1showsdecisionhyperplanesforSVM,NBC,CCC,andFLD.Twoclassesha veequalprior probability.Pointsintwodierentclassesarelinearlyse parable.Threecircled pointsaresupportvectorsandSVMndsthemaximummarginhy perplanefrom thesepoints.Theotherthreediscriminantsshowsimilarhy perplanes.Intheseconddataset,wegenerated50`*'pointsusingmultivariaten ormaldistributionwith =(6 ; 12) ; = 0B@ 1 : 5 : 51 1CA ; and200`+'pointsusingmultivariatenormaldistributionwith =(12 ; 6) ; = 0B@ 31 : 5 1 : 53 1CA : Thisdatasetisimbalanced,thatis,twodifferentclasspointshavedierentpriorprobabilities( P ( 0 0 )=1 = 5and P ( 0 + 0 )=4 = 5). Figure7{2showsplotsfordierentclassiers.Thecircled threepointsaresupport vectors,andSVMndsthemaximummargindecisionhyperplan ebetweenthesethree

PAGE 107

93 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 XYLinear Discriminants Separable, Balanced Data naive BayesSVMCorr. CoefFLDSVs Figure7{1:LinearDiscriminants-Separable,BalancedDat a. pointsregardlessofthepriorprobabilitieswhiletheothe rthreehyperplanesshowa similarbehaviorandareleaningtowardtheclasswithhighe rpriorprobability.Figure7{2showsanobviousdierencebetweenSVMandtheothert hreeclassiers.The SVMconsidersonlythepointslyingontheborderlinewhilet heotherthreeconsider theaveragecasebetweenallthepointswhentheyndtheclas sier. Inthethirddataset,wegenerated200`*'pointsusingmulti variatenormaldistributionwith =(6 ; 12) ; = 0B@ 1 : 3 : 31 1CA ; and200`+'pointsusingmultivariate normaldistributionwith =(12 ; 6) ; = 0B@ 4224 1CA : Twodierentclasseshavevery dierentdistributions.The`*'pointsaredenselyconcent ratedwhile`+'pointsare widelyscattered.Figure7{3showsthedecisionlinesfordi erentclassiers.Again, asinFigure7{3,SVMndsthemaximummarginclassierbetwe enfoursupport vectorsregardlessofthedatadistribution.Buttheothert hreeclassiersaremuch closertothewidelyscatteredclass.Aswehaveseenintheth reedierentgures,

PAGE 108

94 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 XYLinear Discriminants Separable, Imbalanced Data naive BayesSVMCorr. CoefFLDSVs Figure7{2:LinearDiscriminants-Separable,ImbalancedD ata. 2 4 6 8 10 12 14 16 18 20 0 5 10 15 XYLinear Discriminants Inseparable, Balanced, Diff var. Data naive BayesSVMCorr. CoefFLDSVs Figure7{3:LinearDiscriminants-Inseparable,Balanced, Di.Var.Data.

PAGE 109

95 alltheclassiersbutSVMndsomehowanaveragecaseclassi erwhileSVMnds aborderlineclassier.ThisisauniquecharacteristicofS VM. 7.3 Visualization of Selected Features Sofarwehavestatedhowwellarankingschemeworksbyshowin gaccuracyfor thosetop-rankedfeatures.Here,weinspectthetop-ranked featuresvisuallyandsee whethertheyhavemorediscriminatingpowerthanthoserank edlow.First,weuse theMNISTdataset.Figure7{4showsimageshavingonlytop-r ankedfeatures.The imagesinthegureweredonebykeepingonlytop-rankedfeat uresandsetzerofor allotherfeatures.Inthegure,thersttworowsaredigits 2and3imageswithonly best50features,thenexttwowith100,thenexttwowith200, thenexttwowith 300,andthelasttworowswithfull784features.Fromthegu re,itmaybedicult toclassifyitwithonlythebest50featureswithhumaneyes, butitisnotdicult toseethatthetop50featurescontainenoughfeaturesforco rrectclassication,and SVMalreadyachievesalmostthehighestaccuracywiththatn umberoffeatures.As wegodownthegure,itisobviousthathumaneyescandocorre ctclassicationwith 200or300features.Thebest200featureswouldbeenough,if 50isnot. Figure7{5showstheleukemiadatasetwiththebesttwofeatu reswhichwere rankedusingtrainingdatabySV-RFE(SVM).Itshowsthateve ntwofeaturesout of7129giveareasonablygoodseparatingboundary,althoug htheyarenotperfectly separable.Thex-axisisthebestfeatureandthey-axisisth esecondbestfeature. Figure7{6showsthetestaccuracyontheMNISTdatasetwithb est-ranked features.Itshowstheaccuracyofthebest80featureswitha solidlineandthatof full784featureswithadottedhorizontalline.Fromthisg ure,wecansaythatthe best50to80featuresareenoughforthehighestclassicati onaccuracy.Andthis gurepositivelysupportstheclaimswehaveshowninFigure 7{4. Fortheotherthreedatasets,asimilarconclusioncanbedra wn.Figure7{7 showsaccuraciesforthebest50featuresinasolidlineandt hatoffullfeaturesin

PAGE 110

96 Figure7{4:ImageswiththeBest-rankedFeatures. -0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 -0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 The Best Feature (1882)Second Best Feature (1779)Leukemia Data Plot with the Best Two Features Positive ClassNegative Class Figure7{5:LeukemiaDataPlotwiththeBestTwoFeatures.

PAGE 111

97 0 10 20 30 40 50 60 70 80 86 88 90 92 94 96 98 Best FeaturesTest Accuracy %Accuracy with Best 80 Features on MNIST Data (SVM) Accuracy with Best FeaturesAccuracy with Full Features Figure7{6:TestAccuracywithBest-rankedFeaturesonMNIS TData adottedhorizontalline.Fromthegure,wecanseethatthec oloncancerdataset achieveditsbestaccuracyataboutthebest35features,and soforth. Wehaveseenthatfromtheperformanceplot,thebestaccurac yhasbeen achievedataboutthebest70featuresonMNIST,30onleukemi a,35oncolonand 20onsonardatasets.Withthisnumberoffeatures,allhavea chievedbetterthanor thesameaccuracyasusingthefullfeatures.Theratioofthi snumberoffeaturesto thefullnumberoffeaturesareabout9%,0.4%,1.8%,and33%f orthefourdatasets. Thisonlysaysthatfeatureselectionworks,butitdoesnota nswerwhatnumberof featuresgivesthebestclassicationaccuracy,thatis,ho wmanyfeaturesweshould choose.Thisissueseemstobeongoingresearch.Manyofthef eatureselectionalgorithmsrstrankallthefeaturesandselectasubsetoftop -rankedfeaturesfrom thembyathreshold.Golubetal.[42]selected25top-ranked featuresfromhighly +1classcorrelatedfeaturesandanother25from 1classcorrelatedfeatures.Guyon etal.[44]selectedthenumberofgeneswhichcorrespondsto theminimumnumber ofsupportvectors.Leeetal.[63]selectedgeneswithfrequ encieshigherthan2.5%. The25or2.5seemstobeasubjectivejudgment.Recently,Fue tal.[39]suggested

PAGE 112

98 0 5 10 15 20 25 30 35 40 45 50 80 85 90 95 100 Test Accuracy %Accuracy with Best Features on Leukemia Data (SVM) Accuracy with Best FeaturesAccuracy with Full Features 0 5 10 15 20 25 30 35 40 45 50 80 85 90 95 100 Test Accuracy %Accuracy with Best Features on Colon Cancer Data (SVM) Accuracy with Best FeaturesAccuracy with Full Features 0 5 10 15 20 25 30 35 40 45 50 70 75 80 85 Best FeaturesTest Accuracy %Accuracy with Best Features on Sonar Data (SVM) Accuracy with Best FeaturesAccuracy with Full Features Figure7{7:TestAccuracywithBest-rankedFeaturesonThre eDatasets. the M -foldcross-validationfeatureselection.Theideaistora ndomlydividethe giventrainingdatainto M (10)disjointsubsets,andthenpickacertainnumberof featuresbyrunningalearningalgorithm(SVM)usingjuston esubset.Collectallthe selectedfeaturesfrom10runningsandthenpickgenesaccor dingtofrequencycounts. Comparedtootherfeatureselectionalgorithms,thisfeatu reselectionalgorithmuses amoreautomaticthreshold.Alltheaboveselectionmethods havebeenappliedto microarraydatasets. 7.4 Compact Data Representation and Feature Selection NowthelinearSVMisusedtogetaccuracyresultsforvedie rentranking schemes: t -test,correlationcoecient,Fisherscore,Guyon-RFE(SV M),andSVRFE(SVM).Datasetsareleukemia,coloncancer,MNIST,sona rdata.Alldatasets butsonarareunitlengthnormalizedbeforeconductinganye xperiments.ExperimentalsettingsareexactlythesameaswesawinChapter5.Ta ble7{1showsthat

PAGE 113

99 Table7{1:PerformanceofVariousRankingSchemesonLinear SVM(%) RankingSchemes Leukemia ColonCancer MNIST Sonar t -test 83.33 90.74 96.06 75.61 Corr.Coef. 88.46 90.18 96.03 75.45 FisherScore 85.75 88.97 96.06 75.59 GuyonRFE(SVM) 94.12 91.20 96.73 76.17 SV-RFE(SVM) 94.57 91.38 96.69 76.20 inallfourdatasetsbutMNIST,SV-RFE(SVM)givesthebestac curacy.Inthe MNISTdataset,Guyon-RFE(SVM)givesslightlybetteraccur acythanSV-RFE, buttheirdierencesarenegligiblyminor(0.04).TwoSVM-R FEbasedrankings, GuyonRFEandSV-RFEaresignicantlybetterthantherestth reerankingschemes intheleukemiadataset. Intheprevioussection,wehaveshownthatthebestclassic ationaccuracycan beachievedwithasmallfractionoffeaturesoftheoriginal features.InChapter5, wehaveshownthatthebetterclassicationaccuracycanbea chievedwithasmall fractionofsamples(SVs).TheSV-basedfeatureselectiong oesthroughthesetwo steps:identifyingthesupportvectors,selectingthedisc riminatingfeatures.Identifyingthesupportvectorsleadstothereductioninthenumbe rofthedatasamples, whileselectingthediscriminatingfeaturesleadstothere ductioninthesizeofeach sample.Hence,theSV-basedfeatureselectionallowsacomp actrepresentationof thedata.Table7{2showsthereductiononSVM-RFE(SVM)fort heclassication purpose.Inthetable,wehaveusedthenumberofsupportvect ordataonSVMfrom Chapter5andthenumberofselectedfeaturedatafromthepre vioussection.Inthe table,ReductionRatereferstothereductionrateovertheo riginaldatasizeandit wascomputedas[(1) (2) (3) (4)] = [(1) (2)]where(1)isthenumberofsamplesin theoriginaldataset,andsoforth.Thetableshowsthatweca nreducetheoriginal datasizeby98%inthreedatasetsandbyabout80%inonedatas et.Thisimplies thatabout2%oftheoriginaldatasizesucesforthebestcla ssication.Although

PAGE 114

100 Table7{2:CompactDataRepresentationonSV-RFE(SVM) Datasets OriginalData ReducedData Reduction #Samples #Features #Samples #Features Rate(%) (1) (2) (3) (4) Leukemia 38 7129 26 30 99.71 ColonCancer 31 2000 29 35 98.36 MNIST 600 784 87 70 98.71 Sonar 104 60 63 20 79.81 wedidnotshowanydirectusefulnessofthisreduction,such alargedatareduction as98%couldmeanasignicantimplicationandbecritically utilizedinsomeother researchdomainsaswellasmachinelearning. 7.5 Applications of Feature Selection Inthissection,weshowhowfeatureselectioncanbeeectiv elyused.Nowwe inspectedtop10wordsrankedbyCDTW-NB-RFEonthe20Newsgr oupdatasets. Thetop10rankedwordsare rec.autosvs.others : gun,car,doctor,studi,medic,radar,atf,auto,waco,brak e. sci.medvs.others : doctor,car,msg,patient,scienc,gun,infect,symptom,ye ast, pain. politics.gunsvs.others : gun,car,atf,waco,doctor,govern,effect,fbi,weapon,ba ck. Chakrabartietal.[19]mentionedapossibleusefulnessoft hetop-rankedfeatures. Onepossibleapplicationscenarioisthatthisfeaturesele ctionmethodcanprovide descriptorsforclustersfromanunsupervisedlearning.Su pposewehavefoundthree clustersfromunsupervisedlearningappliedtoasetofdocu ments.Weapplythis binaryfeatureselectionmethodbytreatingcluster1docum entsasoneclassandthe balanceofdocumentsastheothertogetarankedlistoffeatu res.Thisprocessrepeats forcluster2documents,andsoforth.Fromeachbinaryfeatu reselection,pickthetop

PAGE 115

101 k words(say10)anddecideifthesewordsarefromthiscluster byinspectingwhich clusterhasmorethesewords.Thencollectthosewordsfromt hiscluster.Thisword listcanbedescriptorsforthiscluster.Theresultsonthe2 0Newsgroupdatasetsare asfollows: cluster1: car,radar,auto,brake. cluster2: doctor,msg,patient,scienc,infect,symptom,yeast,pain cluster3: gun,atf,waco,govern,fbi,weapon. Now car,radar,auto and brake canbedescriptorsforcluster1,andsimilarly forclusters2and3.Thesewordsareclearlysalientkeyword swellrepresenting correspondingclusters.

PAGE 116

CHAPTER8 CONCLUSIONS WehavepresentedSV-basedfeatureselectionandCDTW-NB-R FE.WhenSVbasedfeatureselectionisappliedtoSVM,itcutsdownthefa mousSVM-RFEtraining timebytheratioofthenumberofsupportvectorswhilekeepi ngthesameclassicationaccuracyorsometimesbetter.Inthissense,ouralgo rithmmakesSVM-RFE morepractical.Ontheotherhand,rankinganddiscriminant functionsoughtbyour algorithmonFLDgiveabetterperformanceonFLDattheexpen seofalittleextra trainingtime.WehaveshowntheeectivenessoftheSV-base dfeatureselectionon threedierentdatadomains. ThenaiveBayesclassierhasbeenextensivelyusedontextc ategorizationproblems.WehaveshownCDTW,anewfeaturescalingschemeontext dataandNBRFE,anewfeatureselectionalgorithmusingthenaiveBayes .TheCDTWgavea betterperformanceonmanypopularfeaturerankingschemes whenitwasusedwith thenaiveBayesontextdata.BycombiningCDTWandNB-RFE,we haveshown CDTW-NB-RFEoutperformedthemostpopularfeatureranking schemesontext data.WealsohavepointedoutthatNBhassomesystemicprobl emsoftenarising whenfeatureselectionisappliedtoNB.Wehaveshownhowtor esolveeachofthese problems. WhilewehaveshowntheeectivenessofSV-basedfeaturesel ectionontwo lineardiscriminants,thisalgorithmhaspotentialtobebe tteronotherdiscriminants. OneissueiswhethertheSV-basedfeatureselectioncanbege neralizedtoanylinear discriminant,orevenfurtheranydiscriminantincludinga nonlinearone.Another issueishowmanysupportvectorsareappropriatebeforeapp lyingfeatureranking, 102

PAGE 117

103 suchasRFE,thatis,whentostopremovingthecorrectandwel l-classiedpointson FLDandwhat C weusetogetinitialsupportvectorsonSVM. WealsohaveshownthatSV-RFEleadstoasignicantdataredu ctionsuch as98%forclassicationpurpose.Itwouldbeexplorabletos eewhatimplication thisreductioncanhave,andhowthisreductioncouldbecrit icallyutilizedinother researchdomainsaswellasmachinelearning.

PAGE 118

REFERENCES [1]U.Alon,N.Barkai,D.Notterman,K.Gish,S.Ybarra,D.Ma ck,andA.Levine, BroadPatternsofGeneExpressionRevealedbyClusteringAn alysisofTumor andNormalColonTissuesProbedbyOligonucleotideArrays, PNAS,Vol.96, pp.6745{6750,June1999. [2]T.Anderson,AnIntroductiontoMultivariateStatistic alAnalysis,JohnWiley &Sons,NewYork,NY,1958. [3]P.BaldiandA.D.Long,ABayesianFrameworkfortheAnaly sisofMicroarrayExpressionData:Regularizedt-TestandStatisticalIn ferencesofGene Changes,Bioinformatics,Vol.17(6),pp.509{519,June,20 01. [4]M.Bazaraa,H.SheraliandC.Shetty,NonlinearProgramm ingTheoryand Algorithms,JohnWiley&Sons,Inc.,NewYork,NY,1992. [5]K.BennetandC.Campbell,SupportVectorMachines:Hype orHallelujah?, ACMSIGKDD,2(2):1{13,2000. [6]K.Bennet,andE.Bredensteiner,GeometryinLearningin GeometryatWork, C.GoriniEd.,MathematicalAssociationofAmerica,Washin gton,D.C.,pp. 132-145,2000. [7]K.Bennet,andE.Bredensteiner,DualityandGeometryin SVMs,InP.Langley editor,ProceedingsoftheSeventeenthInternationalConf erenceonMachine Learning,pp.65{72,MorganKaufmannPublishersInc.,SanF rancisco,CA, 2000. [8]P.Bennett,AssessingtheCalibrationofNaiveBayes'Po sterior,CMU-CS00-155,SchoolofComputerScience,CarnegieMellonUniver sity,September, 2000. [9]D.Bertsekas,NonlinearProgramming,AthenaScientifc ,Belmont,MA,1995. [10]C.Bishop,NeuralNetworksforPatternRecognition,Ox fordUniversityPress, NewYork,NY,1995. [11]G.Box,W.Hunter,andJ.Hunter,StatisticsforExperim enters,JohnWiley, NewYork,NY,1978. [12]P.Bradley,O.Mangasarian,andW.Street,FeatureSele ctionviaMathematical Programming,INFORMSJournalonComputing,10(2):209{217 ,1998. 104

PAGE 119

105 [13]P.Bradley,andO.Mangasarian,FeatureSelectionviaC oncaveMinimization andSupportVectorMachines,InProceedingsoftheThirteen thInternational ConferenceonMachineLearning,pp.82-90,SanFrancisco,C A,1998. [14]J.Brank,M.Grobelnik,N.Milic-FraylingandD.Mladen ic,FeatureSelection UsingLinearSupportVectorMachines,MSR-TR-2002-63,Mic rosoftResearch, MicrosoftCorporation,June,2002. [15]J.Brank,M.Grobelnik,N.Milic-FraylingandD.Mladen ic,Interactionof FeatureSelectionMethodsandLinearClassifcationModels ,Workshopon TextLearning(TextML-2002),Sydney,Australia,July2002 [16]M.Brown,W.Grundy,D.Lin,N.Cristianini,C.Sugnet,T .Furey,M.Ares, Jr.andD.Haussler,Knowledge-BasedAnalysisofMicroarra yGeneExpression DatabyusingSupportVectorMachines,PNAS,Vol.97,pp.262 {267,January, 2000. [17]C.Burges,ATutorialonSupportVectorMachinesforPat ternRecognition, DataMiningandKnowledgeDiscovery,Vol.2,No.2,pp.121{1 67,1998. [18]S.Chakrabarti,S.RoyandM.Soundalgekar,FastandAcc urateTextClassifcationviaMultipleLinearDiscriminantProjections,VLD BJournal,Vol.12, No.2,pp.170{185,2003. [19]S.Chakrabarti,B.Dom,R.Agrawal,R.Raghavan,Scalab leFeatureSelection, ClassifcationandSignatureGenerationforOrganizingLar geTextDatabases intoHierarchicalTaxonomiesVLDBJournal,Vol.7,No.3,pp .163{178,1998. [20]S.Chakrabarti,DataMiningforHypertext:ATutorialS urvey,SIGKDD ExplorationsVol.1,No.2,ACMSIGKDD,January,2000. [21]O.Chapelle,V.Vapnik,O.BousquetandS.Mukherjee,Ch oosingMultiple ParametersforSupportVectorMachines,MachineLearning, Vol.46,No.1, pp.131{159,January,2001. [22]V.Chvatal,LinearProgramming,W.H.FreemanCompany, NewYork/San Francisco,1980. [23]T.Cooke,TwoVariationsonFisher'sLinearDiscrimina ntforPatternRecognition,IEEETransactionsonPatternAnalysisandMachineI ntelligence,Vol. 24,No.2,pp268{273,2002. [24]T.CoverandJ.Thomas,ElementsofInformationTheory, JohnWiley,New York,NY,1991. [25]N.CristianiniandJ.Shawe-Taylor,AnIntroductionto SupportVectorMachines,CambridgeUniversityPress,Boston,MA,1999.

PAGE 120

106 [26]M.Dash,andH.Liu,FeatureSelectionforClassifcatio n,IntelligentData Analysis,Elsevier,Vol.1,No.3,pp.131{156,1997. [27]P.DomingosandM.Pazzani,OntheOptimalityoftheSimp leBayesianClassiferunderZero-OneLoss,MachineLearning,Vol.29,pp.10 3{130,1997. [28]P.DomingosandM.Pazzani,BeyondIndependence:Condi tionsfortheOptimalityoftheSimpleBayesianClassifer,Proceedingsofthe ThirteenthInternationalConferenceonMachineLearning(ICML),pp.105{11 2,1996. [29]J.Dougherty,R.KohaviandM.Sahami,SupervisedandUn supervisedDiscretizationofContinuousFeatures,MachineLearning:Pro ceedingsofthe TwelfthInternationalConference,MorganKaufmanPublish ersInc.,SanFrancisco,CA,1995. [30]K.Duan,S.Keerthi,andA.Poo,EvaluationofSimplePer formanceMeasuresforTuningSVMHyperparameters,TechnicalReportCD01-11,Control Division,Dept.ofMechanicalEngineering,NationalUnive rsityofSingapore, 2001. [31]R.Duda,P.Hart,andD.Stork,PatternClassifcation,J ohnWiley&Sons, Inc.,NewYork,2001. [32]S.Dudoit,Y.H.Yang,M.J.CallowandT.P.Speed,Statis ticalMethodsfor IdentifyingdierentiallyExpressedGenesinReplicatedc DNAMicroarrayExperiments,TR578,DepartmentofBiochemistry,StanfordUn iversity,2000. [33]L.Fausett,FundamentalsofNeuralNetworks,Prentice Hall,UpperSaddle River,NewJersey,1994. [34]G.Forman,AnExtensiveEmpiricalStudyofFeatureSele ctionMetricsfor TextClassifcation,TheJournalofMachineLearningResear ch,Vol.3,pp. 1289{1305,March,2003. [35]J.Friedman,ExploratoryProjectionPursuit,Journal oftheAmericanStatisticalAssociation,Vol.82,Issue397,pp.249{266,1987. [36]J.Friedman,OnBias,Variance,0/1-Loss,andtheCurse -of-Dimensionality, DataMiningandKnowledgeDiscovery,Vol.1,pp.55{77,1997 [37]N.Friedman,D.GeigerandM.Goldszmidt,BayesianNetw orkClassifers, MachineLearning,Vol.29,pp.131{163,1997. [38]L.Fu,NeuralNetworksinComputerIntelligence,McGra w-Hill,NewYork, NY,1994. [39]L.FuandE.Youn,ImprovingReliabilityofGeneSelecti onfromMicroarray FunctionalGenomicsData,IEEETransactionsonInformatio nTechnologyin Biomedicine,Vol.7,No.3,September,2003.

PAGE 121

107 [40]T.S.Furey,N.Duy,N.Cristianini,D.Bednarski,M.Sc hummer,andD. Haussler,SupportVectorMachineClassifcationandValida tionofCancerTissueSamplesUsingMicroarrayExpressionData,Bioinformat ics,Vol.16,pp. 906{914,2000. [41]T.V.Gestel,J.A.K.Suykens,G.Lanckriet,A.Lambrech ts,B.DeMoor,and J.Vanderwalle.Bayesianframeworkforleastsquaressuppo rtvectormachine classifers,GaussianprocessesandkernelFisherdiscrimi nantanalysisNeural Computation,Vol.15,No.5,pp.1115{1148,May,2002. [42]T.Golub,D.Slonim,P.Tamayo,C.Huard,M.Gaasenbeek, J.Mesirov,H. Coller,M.Loh,J.Downing,M.Caligiuri,C.Bloomfeld,andE .Lander,MolecularClassifcationofCancer:ClassDiscoveryandClassPre dictionbyGene ExpressionMonitoring,Science,Vol.286,pp.531{537,Oct ober,1999. [43]R.GormanandT.Sejnowski,AnalysisofHiddenUnitsina LayeredNetwork TrainedtoClassifySonarTargets,NeuralNetworks,Vol.1, pp.75{89,1988. [44]I.Guyon,J.Weston,S.Barnhill,andV.Vapnik,GeneSel ectionforCancer ClassifcationUsingSupportVectorMachines,MachineLear ning,Vol.46(1/3), pp.389{422,January,2002. [45]I.Guyon,SVMApplicationSurvey,WebpageonSVMApplic ations: http://www.clopinet.com/SVM.applications.html,Octob er20,2001. [46]W.Hager,AppliedNumericalLinearAlgebra,PrenticeH all,EnglewoodClis, NewJersey,1988. [47]J.HanandM.Kamber,DataMiningConceptsandTechnique s,Morgan KaufmannPublishersInc.,SanFrancisco,CA,2001. [48]T.Hastie,R.TibshiraniandJ.Friedman,TheElementso fStatisticalLearning, DataMining,Inference,andPrediction,Springer-Verlag, NewYork,NY,2001. [49]L.HermesandJ.Buhmann,FeatureSelectionforSupport VectorMachines, ProceedingsoftheInternationalConferenceonPatternRec ognition(ICPR'00), Vol.2,pp.712{715,2000. [50]C.Hsu,H.HuangandT.Wang,WhyDiscretizationWorksfo rNaiveBayes Classifers,IntheSeventeenthInternationalConferenceo nMachineLearning, pp.399{406,MorganKaufmannPublishersInc.,SanFrancisc o,CA,2000. [51]P.Huber,ProjectionPursuitAnnalsofStatistics,Vol .13,Issue2,pp.435{475, 1985. [52]T.Ideker,V.Thorsson,A.SiegelandL.Hood,Testingfo rDierentiallyExpressedGenesbyMaximum-LikelihoodAnalysisofMicroar rayData,Journal ofComputationalBiology,Vol.7,No.6,pp805{817,2000.

PAGE 122

108 [53]T.Joachims,EstimatingtheGeneralizationPerforman ceofanSVMEciently, InProceedingsoftheSeventeenthInternationalConferenc eonMachineLearning,pp.431{438,MorganKaufmannPublishersInc.,SanFran cisco,CA,2000. [54]T.Joachims,AProbabilisticAnalysisoftheRocchioAl gorithmwithTFIDF forTextCategorization,InProceedingsofFourteenthInte rnationalConference onMachineLearning,pp.143{151,MorganKaufmannPublishe rsInc.,San Francisco,CA,1997. [55]T.Joachims,TextCategorizationwithSupportVectorM achines:Learning withManyRelevantFeatures,InProceedingsoftheTenthEur opeanConferenceonMachineLearning,London,UK,1998. [56]G.John,R.Kohavi,andK.Preger,IrrelevantFeaturesa ndtheSubsetSelection Problem,W.CohenandH.Hirsh(Eds.):MachineLearning:Pro ceedingsof the11 th InternationalConference,pp.121{129,SanMateo,CA,1994 [57]I.Jollie,PrincipalComponentAnalysis,Springer-V erlag,NewYork,NY, 1986. [58]M.Jones,andR.Sibson,WhatIsProjectonPursuit?Jour naloftheRoyal StatisticalSociety.SeriesA(General),Vol.150,Issue1, pp.1{36,1987. [59]R.Kohavi,B.BeckerandD.Sommerfeld,ImprovingSimpl eBayes,InProceedingsoftheNinthEuropeanConferenceonMachineLearning,S Pringer-Verlag, NewYork,NY,1997. [60]D.Koller,andM.Sahami,TowardOptimalFeatureSelect ion,Proceedings oftheThirteenthInternationalConferenceonMachineLear ning(ML),Bari, Italy,July1996,pp.284{292,1996. [61]K.Lang,Newsweeder:LearningtoFilterNetnews,InPro ceedingsofthe TwelfthInternationalConferenceonMachineLearning,pp3 31{339,1995. [62]K.Lee,YChung,andH.Byun,FaceRecognitionUsingSupp ortVectorMachineswiththeFeatureSetExtractedbyGeneticAlgorithms ,J.BigunandF. Smeraldi(Eds.):AVBPA2001,LNCS2091,pp.32-37,2001. [63]K.E.Lee,N.Sha,E.R.Dougherty,M.Vannucci,andB.Mal lick,Gene Selection:ABayesianVariableSelectionApproach,Bioinf ormatics,Vol.19, pp.90{97,2003. [64]E.LeopoldandJ.Kindermann,TextCategorizationwith SupportVectorMachines,HowtoRepresentTextsinInputSpace?,MachineLear ning,Vol.46, pp.423{444,2002. [65]A.McCallumandKamalNigam,AComparisonofEventModel sforNaive BayesTextClassifcation,InProceedingsoftheAAAI-98wor kshoponLearning forTextCategorization,1998.

PAGE 123

109 [66]T.Mitchel,MachineLearning,McGraw-Hill,NewYork,N Y,1997. [67]D.Mladenic,FeatureSubsetSelectioninText-Learnin g,InProceedingsofthe TenthEuropeanConferenceonMachineLearning,pp.95{100, London,UK, 1998. [68]S.Mukherjee,P.Tamayo,J.P.Mesirov,D.Slonim,A.Ver ri,andT.Poggio, Supportvectormachineclassifcationofmicroarraydata,T echnicalReport182, AIMemo1676,CBCL,MIT,1999. [69]K.Murty,LinearProgramming,JohnWiley&Sons,Inc.,N ewYork,NY, 1976. [70]K.Nigam,J.LaertyandA.McCallum,UsingMaximumEntr opyforText Classifcation,InIJCAI'99WorkshoponInformationFilter ing,1999. [71]M.A.Newton,C.M.Kendziorski,C.S.Richmond,F.R.Bla ttnerandK.W.Tsui, OnDierentialVariabilityofExpressionRatios:Improvin gStatisticalInference aboutGeneExpressionChangesfromMicroarrayData,Journa lofComputationalBiology,Vol.8,No.1,pp.37{52,2001. [72]A.Ng,M.Jordan,OnDiscriminativevs.GenerativeClas sifers:AComparison ofLogisticRegressionandNaiveBayes,InNeuralInformati onProcessing Systems,pp.841{848,2001. [73]E.Osuna,R.FreundandF.Girosi,SupportVectorMachin es:Trainingand Applications,TechnicalReportAIM-1602,MITA.I.Lab.,19 96. [74]P.Pavlidis,J.Weston,J.Cai,andW.Grundy,GeneFunct ionalAnalysisfrom HeterogeneousData,RECOMB,pp.249{255,2001. [75]J.Rennie,ImprovingMulti-classTextClassifcationw ithNaiveBayes,AI TechnicalReport2001-004,MassachusettsInstituteofTec hnology,September, 2001. [76]J.Rennie,L.Shih,J.Teevan,D.Karger,TacklingthePo orAssumptions ofNaiveBayesTextClassifers,ProceedingsoftheTwentiet hInternational ConferenceonMachineLearning(ICML-2003),WashingtonDC ,2003. [77]R.M.Rifkin,EverythingOldIsNewAgain:AFreshLookat HistoricalApproachesinMachineLearning,Ph.DDissertation,SloanSch oolofManagement Science,MIT,September,2002 [78]I.Rish,AnEmpiricalStudyoftheNaiveBayesClassifer ,InProceedingsof IJCAI-01WorkshoponEmpiricalMethodsinArtifcialIntell igence,2001. [79]G.Salton,AutomaticTextProcession:TheTransformat ion,Analysis,and RetrievalofInformationbyComputer,Addison-Wesley,Rea ding,PA,1989.

PAGE 124

110 [80]S.Sarawagi,S.ChakrabartiandS.Godbole,Cross-trai ning:LearningProbabilisticMappingsBetweenTopics,InProceedingsofthenin thACMSIGKDD internationalconferenceonKnowledgediscoveryanddatam ining,Washington, D.C.pp.177{186,2003 [81]A.Shashua,OntheEquivalencebetweentheSupportVect orMachinesfor ClassifcationandsparsifedFisher'sLinearDiscriminant ,NeuralProcessing Letter9(2):129{139,1999. [82]B.Scholkopf,SupportVectorLearning,Ph.DDissertat ion,PublishedbyR. OldenbourgVerlag,Munich,Germany,1997. [83]B.Scholkopf,Tutorial:SupportVectorLearning,DAGM '99,Bonn,Germany, September,1999. [84]B.Scholkopf,StatisticalLearningandKernelMethods ,MicrosoftTechnical report,MSR-TR-2000-23,2000. [85]D.Sivia,DataAnalysisABayesianTutorial,OxfordUni versityPressInc., NewYork,1996. [86]D.Slonim,P.Tamayo,J.Mesirov,T.Golub,andE.Lander ,ClassPrediction andDiscoveryUsingGeneExpressionData,Proceedingsofth eFourthInternationalConferenceonComputationalMolecularBiology,R ECOMB2000,pp. 263{272,Tokyo,Japan,April,2000. [87]P.Tan,M.Steinbach,andV.Kumar,IntroductiontoData Mining,Preprint, 2003. [88]STheodoridisandKKoutroumbas,PatternRecognition, secondEdition,AcademicPress,SanDiego,CA,2003. [89]V.Vapnik,TheNatureofStatisticalLearningTheory,S pringer-Verlag,New York,NY,1995. [90]V.Vapnik,StatisticalLearningTheory,Wiley,NewYor k,1998. [91]V.Vapnik,O.Chapelle,Boundsonerrorexpectationfor supportvectormachine,in:ASmola,P.Bartlett,B.Scholkopf,andD.Schuurm ans(Eds.), AdvancesinLargeMarginClassifers,MITPres,Cambridge,M A,1999. [92]G.Wahba,Y.Lin,andH.Zhang,GeneralizedApproximate CrossValidation forSupportVectorMachinesin:A.Smola,P.Bartlett,B.Sch olkopfandD. Schuurmans(Eds),AdvancesinLargetMarginClassifersMIT Press,Cambridge,MA,1999. [93]G.WebbandM.Pazzani,AdjustedProbabilityNaiveBaye sInduction,Inthe ProceedingsoftheEleventhAustralianJointConf.onArtif cialIntelligence, WorldScientifc,Singapore,1998.

PAGE 125

111 [94]J.Weston,S.Mukherjee,O.Chapelle,M.Pontil,T.Pogg io,andV.Vapnik, FeatureSelectionforSVMs,AdvancesinNeuralInformation ProcessingSystems,Vol.13,pp.668{674,2000. [95]Y.YangandJ.Pedersen,AComparativeStudyonFeatureS electioninText Categorization,IntheProceedingsofICML-97,Fourteenth InternationalConferenceonMachineLearning,1997. [96]Y.YangandG.Webb,AComparativeStudyofDiscretizati onMethodsfor Naive-BayesClassifers,InProceedingsofPKAW2002,The20 02PacifcRim KnowledgeAcquisitionWorkshop,pp.159{173,Tokyo,Japan ,2002. [97]Y.YangandG.Webb,DiscretizationforNaive-BayesLea rning:ManagingDiscretizationBiasandVariance,TechnicalReport2003/131, SchoolofComputer ScienceandSoftwareEngineering,MonashUniversity,2003 [98]C.Yeang,S.Ramaswamy,P.Tamayo,S.Mukerjee,R.Rifki n,M.Angelo,M. Reich,E.Lander,J.Mesirov,andT.Golub,MolecularClassi fcationofMultiple TumorTypes,Bioinformatics,Vol.17,Suppl.1,pp.S316-32 2,2001. [99]E.Youn,FeatureSelectioninSupportVectorMachines, Master'sThesis, ComputerandInformationScienceandEngineering,Univers ityOfFlorida, May,2002.

PAGE 126

BIOGRAPHICALSKETCH EunSeogYounwasbornApril18,1968,inHwasoon,Korea.Here ceivedhis BachelorofSciencedegreeinindustrialengineeringfromH anyangUniversityinSeoul, Korea,andhisMasterofSciencedegreeinindustrialengine eringfromtheUniversity ofWisconsin-Madison.Hisspecializationwasoperationsr esearch,whichincluded mathematicalprogrammingandtheprobabilitymodeling. Heenteredthegraduateprogramincomputerandinformation scienceandengineeringattheUniversityofFloridainGainesvilleinAug ust1999.Hereceivedhis MasterofSciencedegreeundertheguidanceofDr.LiMinFu.H ismaster'sthesis researchfocusedonfeatureselectioninsupportvectormac hines.Heiscurrentlya Ph.D.candidateunderthesupervisionofDr.LiMinFu.HisPh .D.researchincludes microarrayanalysisandtextdataminingusingsupportvect ormachinesandother lineardiscriminantanalysis. 112