Supervised Machine Learning Models for Feature Selection and Classification on High Dimensional Datasets

MISSING IMAGE

Material Information

Title:
Supervised Machine Learning Models for Feature Selection and Classification on High Dimensional Datasets
Physical Description:
1 online resource (132 p.)
Language:
english
Creator:
Pappu, Vijay Sunder Naga
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Industrial and Systems Engineering
Committee Chair:
PARDALOS,PANAGOTE M
Committee Co-Chair:
BOGINSKIY,VLADIMIR L
Committee Members:
MOMCILOVIC,PETAR
RANGARAJAN,ANAND

Subjects

Subjects / Keywords:
cancer -- classification -- dimensions -- featureselection -- subspaces
Industrial and Systems Engineering -- Dissertations, Academic -- UF
Genre:
Industrial and Systems Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
High dimensional datasets are currently prevalent in many applications due to significant advances in technology over the past decade. High dimensional datasets are generally characterized by large number of features and relatively lesser number of samples. Classification tasks on such datasets pose significant challenges to the standard statistical methods and render many existing classification techniques impractical due to curse of dimensionality. Classification performance is improved by reducing the dimensionality using feature selection and/or feature extraction. In this work, we focus on building scalable efficient classification models for high dimensional datasets that would also be able to extract important features in addition to accurately predicting the class of unknown samples. In this regard, we begin with an overview of feature selection techniques and classification models and then proceed to introduce variants of existing methods that improve their generalization ability on high dimensional datasets. We also study a high-dimensional data application involving Raman Spectroscopy and propose a novel hierarchical classification framework to classify Breast cancer cells from non-cancer cells.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Vijay Sunder Naga Pappu.
Thesis:
Thesis (Ph.D.)--University of Florida, 2013.
Local:
Adviser: PARDALOS,PANAGOTE M.
Local:
Co-adviser: BOGINSKIY,VLADIMIR L.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2013
System ID:
UFE0046266:00001


This item is only available as the following downloads:


Full Text

PAGE 1

!"#$"%&'()"*+,"-./ 0112"345$"*+,"6#*,7"8#9:#),"354"*+,"4,3,4,(9,&;"



PAGE 1

SUPERVISEDMACHINELEARNINGMODELSFORFEATURESELECTIONA ND CLASSIFICATIONONHIGHDIMENSIONALDATASETS By VIJAYSUNDERNAGAPAPPU ADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOL OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF DOCTOROFPHILOSOPHY UNIVERSITYOFFLORIDA 2013

PAGE 2

c r 2013VijaySunderNagaPappu 2

PAGE 3

Mom,DadandManasa–Thisisforyou. 3

PAGE 4

ACKNOWLEDGMENTS Iwouldliketoacknowledgethesupportofseveralpeopletha thavemademyPh.D. possibleatUniversityofFlorida.Firstly,theemotionals upportandencouragementofmy wifeisinvaluableasithelpedmetorealizemydreamofPh.D. Thepatienceandsupport ofmyparentshasalwaysbeenoneofmyassets.Iwouldliketot hankmyacademic advisorDr.PanosM.Pardalosforprovidingmeopportunitie stoworkoninteresting researchproblems.IwouldalsoliketothankDr.PandoG.Geo rgievforthetechnical discussions. IhavegreatlyenjoyedworkingwithMike,Vladimir,Paul,An gelina,Orestisand Jasononseveralchallengingresearchprojects.Iwouldals oliketoacknowledge thevaluableadvicethatIreceivedfromSyed,DmytroandPet rosduringmydoctoral program.Iwouldalsoliketothankotherfriendsfromthedep artment:Ayse,Chrysas, Jose,Kun,Oleg,DeonandJing.Further,Iwouldliketoexten dmythankstoAnuj,Kasi, Sam,PrateekandImranfortheircondenceandsupporttowar dsme. Finally,IwouldliketothankDr.ColeSmithandISEdepartme ntforadmittingme intothedoctoralprogram.Last,butcertainlynottheleast ,IwouldliketothankDr. AnandRangarajan,Dr.PetarMomcilovicandDr.VladimirBog inskiforservinginmy Ph.D.committeeandprovidingvaluablesuggestionsonimpr ovingmywork. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 7 LISTOFFIGURES ..................................... 8 ABSTRACT ......................................... 9 CHAPTER 1INTRODUCTION ................................... 10 1.1ChallengesofHighDimensionalDataSpaces ................ 11 1.2Dimensionalityreduction ............................ 12 1.2.1Featureextraction ........................... 13 1.2.2Featureselection ............................ 13 1.3MotivationandSignicance .......................... 14 1.4ThesisgoalandStructure ........................... 15 2OVERVIEWOFFEATURESELECTIONTECHNIQUES ............. 16 2.1FilterMethods ................................. 18 2.2WrapperMethods ............................... 20 2.3EmbeddedMethods .............................. 21 3OVERVIEWOFCLASSIFICATIONTECHNIQUES ................ 23 3.1SupportVectorMachines ........................... 23 3.2DiscriminantFunctions ............................. 30 3.3HybridClassiers ................................ 40 3.4EnsembleClassiers .............................. 41 4SPARSEPROXIMALSUPPORTVECTORMACHINES ............. 48 4.1ProximalSupportVectorMachines(PSVMs) ................ 48 4.2SparseProximalSupportVectorMachines(sPSVMs) ........... 53 4.3NumericalExperiments ............................ 54 4.3.1CategoryIDatasets .......................... 56 4.3.2CategoryIIDatasets .......................... 61 5COMBININGFEATURESELECTIONANDSUBSPACELEARNING ...... 64 5.1 l 2,1 -norm ..................................... 64 5.2JointFeatureSelectionandSubspaceLearning ............... 65 5.3CurrentApproach ............................... 66 5.4AlternateApproach ............................... 69 5

PAGE 6

5.5ExtendingPrincipalComponentAnalysistofeaturesele ction(JSPCA) .. 75 6CONSTRAINEDSUBSPACECLASSIFIERS ................... 81 6.1LocalSubspaceClassier(LSC) ....................... 82 6.2MotivatingExamples .............................. 83 6.3ConstrainedSubspaceClassier(CSC) ................... 84 6.4ComparisonbetweenLSCandCSC ..................... 88 7BREASTCELLCHARACTERIZATIONUSINGRAMANSPECTROSCOPY .. 90 7.1ExperimentalMethods ............................. 94 7.2Fisher-basedFeatureSelection(FFS) .................... 98 7.3Classication .................................. 101 7.4ResultsandDiscussion ............................ 102 8CONCLUSION .................................... 115 REFERENCES ....................................... 118 BIOGRAPHICALSKETCH ................................ 132 6

PAGE 7

LISTOFTABLES Table page 4-1DatasetsusedinnumericalexperimentsofsPSVMs ............... 56 4-2TestaccuracycomparisonofsPSVMsforCategoryIdatase ts .......... 57 4-3Toptenselectedfeaturesforeachclassincolondataset ............. 58 4-4AverageJaccardIndexcomparisonofsPSVMsforCategory Idatasets .... 60 4-5CVaccuracycomparisonofsPSVMsforCategoryIIdataset s .......... 63 5-1DatasetsusedinnumericalexperimentsofJSPCA ................ 76 5-2AccuracycomparisonbetweenPCAandJSPCAforspambased ataset. .... 77 5-3AccuracycomparisonbetweenPCAandJSPCAforWPBCdatas et. ...... 78 5-4AccuracycomparisonbetweenPCAandJSPCAforionospher edataset. .... 79 5-5AccuracycomparisonbetweenPCAandJSPCAformushroomd ataset. .... 80 6-1ComparisonbetweenLSCandCSContwoexampledatasets ......... 83 7-1CharacteristicsoftheveATCCbreastcelllinesstudie d ............. 95 7-2Sensitivity,specicityandaverageclassicationacc uracyofFFS-SVM ..... 104 7-3PerformancecomparisonamongSVM,PCA-SVM,PCA-LDAand FFS-SVM .. 105 7-4Thetop10featuresselectedbyFFSforthefourbinarycla ssicationtasks. .. 108 7

PAGE 8

LISTOFFIGURES Figure page 4-1FeaturefrequencyplotobtainedfromsPSVMsfortheColo ndataset ...... 58 4-2TestaccuraciesobtainedfromsPSVMsasafunctionof forCategoryIdatasets 59 4-3Featuresselectedfromcolondatasetasafunctionof B fordifferentvaluesof 61 6-1DataandcorrespondingsubspacesfromLSCandCSCinexam ple1 ..... 84 6-2DataandcorrespondingsubspacesfromLSCandCSCinexam ple2 ..... 84 6-3LOOCVaccuraciesasafunctionof fordifferent k obtainedfromCSC .... 89 7-1SpectraloverlayofMCF10Acellline ........................ 96 7-2SpectralpreprocessingofRamanspectra ..................... 97 7-3PeakndinginFFS ................................. 100 7-4PeakcoalescinginFFS ............................... 101 7-5Hierarchicalclusteringofmeanspectraofvecellline s ............. 103 7-6Meanspectraofthevecelllines .......................... 104 7-7ClassicationaccuracycomparisonamongRF-SVM,FFS-S VM,andPCA-SVM 106 8

PAGE 9

AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorofPhilosophy SUPERVISEDMACHINELEARNINGMODELSFORFEATURESELECTIONA ND CLASSIFICATIONONHIGHDIMENSIONALDATASETS By VijaySunderNagaPappu December2013 Chair:PanosM.PardalosMajor:IndustrialandSystemsEngineering Highdimensionaldatasetsarecurrentlyprevalentinmanya pplicationsdueto signicantadvancesintechnologyoverthepastdecade.Hig hdimensionaldatasets aregenerallycharacterizedbylargenumberoffeaturesand relativelylessernumber ofsamples.Classicationtasksonsuchdatasetsposesigni cantchallengestothe standardstatisticalmethodsandrendermanyexistingclas sicationtechniques impracticalduetocurseofdimensionality.Classication performanceisimproved byreducingthedimensionalityusingfeatureselectionand /orfeatureextraction. Inthiswork,wefocusonbuildingscalableefcientclassi cationmodelsforhigh dimensionaldatasetsthatwouldalsobeabletoextractimpo rtantfeaturesinaddition toaccuratelypredictingtheclassofunknownsamples.Inth isregard,webeginwithan overviewoffeatureselectiontechniquesandclassicatio nmodelsandthenproceed tointroducevariantsofexistingmethodsthatimprovethei rgeneralizationabilityon highdimensionaldatasets.Wealsostudyahigh-dimensiona ldataapplicationinvolving RamanSpectroscopyandproposeanovelhierarchicalclassi cationframeworkto classifyBreastcancercellsfromnon-cancercells. 9

PAGE 10

CHAPTER1 INTRODUCTION Inthepastdecade,technologicaladvanceshavehadaprofou ndimpactonsociety andtheresearchcommunity[ 74 ].Highvolumethroughputdatacanbecollected simultaneouslyandatrelativelylowcostinmanyapplicati ons.Oftenisthecase,each observationischaracterizedwiththousandsofvariableso rfeatures.Forexample, inthemedicaldomain,hugenumbersofmagneticresonanceim ages(MRI)and functionalMRIdataarecollectedforeachsubject,whereth eintensityateachpixel mayconstituteafeature.Furthermore,everysampleingene -expressionmicroarray datasetsconsistofmeasurementsfromthousandsofgenes[ 28 ].Varioustypesof spectralmeasurementsincludingMassSpectroscopyandRam anSpectroscopyare verycommoninchemometrics,wheretherecordedspectranum berwellintothe thousands[ 48 ].Inmanysuchbiomedicalapplications,themeasurementst endto beveryexpensiveandthenumberofsamplesforsuchdatasets areontheorderof tens,ormaybelowhundreds.Thesedatasets,oftentermedas HighDimensionLow SampleSize(HDLSS)datasets,arecharacterizedwithlarge numberoffeatures p anda relativelysmallnumberofsamples n ;with p >> n [ 174 ].Suchhighdimensionaldatasets arisinginmanyeldscreategoldenopportunitiesaswellas signicantchallengesfor theadvancementofmathematicalsciences. Classicationisasupervisedmachinelearningtechniquet hatmapssome combinationofinputvariables,intopredenedcategorica loutputs,alsosometimes calledclasslabels.Classicationmodelsinferthemappin gfunctionbylearningfroma trainingsetofexamples.Eachtrainingexampleisapaircon sistingofaninputvector andadesiredclasslabel.Theinferredmappingfunctionist henusedtopredictthe classlabelinformationofnewexamples.Classicationpro blemsoccurinseveralelds ofScienceandtechnologylikediscriminatingcancerousce llsfromnon-cancerouscells, webdocumentclassication,categorizingimagesinremote sensingapplicationsamong 10

PAGE 11

manyothers.SeveralalgorithmsstartingwithNeuralNetwo rks[ 73 ],LogisticRegression [ 90 ],LinearDiscriminantAnalysis[ 111 ],SupportVectorMachines[ 165 ]andmorerecent ensemblemethodslikeBoosting[ 54 ]andRandomForests[ 17 ],havebeenproposed tosolvetheclassicationproblemindifferentcontexts.H owever,classicationtasks onHDLSSdatasetsposesignicantchallengestothestandar dstatisticalmethods andrendermanyexistingclassicationtechniquesimpract ical[ 85 ].Below,someof theinherentdifcultiesofhighdimensionalspacesaredis cussedthatposesignicant challengestoexistingclassicationmethods. 1.1ChallengesofHighDimensionalDataSpaces Poorgeneralizationability: Onechallengeformodelinginhighdimensionaldata spacesistoavoidoverttingthetrainingdata[ 28 ].Classicationmodels,inaddition toperformingwellonthetrainingset,arerequiredtoperfo rmequallywellonan independenttestingsetandhenceitisimperativeforthese modelstohavegood generalizationability.However,oftenthesmallnumberof samplesinhighdimensional datasettingscausetheclassicationmodeltooverttothe trainingdata,therebyhaving poorgeneralizationabilityforthemodel.Theaccuracyofc lassicationalgorithmstends todeteriorateinhighdimensionsdueaphenomenoncalledth ecurseofdimensionality [ 94 ].Thisphenomenonisillustratedforclassicationbyanex amplein[ 161 ].Consider twoequallyprobable,normallydistributedclasseswithco mmonvariance.Letthemean ofclass1be 1 = k 1 = 2 andthemeanofclass2be1 = k 1 = 2 ,where k =1,2,3,... indicates featureindex.Thusthediscriminativepowerofeachfeatur edecreasesas k increases. TheauthorevaluatederrorratesfortheBayes'decisionrul easafunctionof k .The varianceisassumedtobeknownwhiletheclassmeansareesti matedbasedonthe nitedataset.Thefollowingobservationsweremade: asthenumberoffeaturestendtoinnity,thetesterrorredu cestorandom guessing; anitenumberoffeaturesareneededtoachievethebesttest error;and 11

PAGE 12

theoptimaldimensionalityincreaseswithincreasingsamp lesize. Thus,theabilityofanalgorithmtoconvergetoatruemodeld eterioratesrapidlywith increasingfeaturedimensionalitythusleadingtopoorgen eralizationability. Geometricaldistortionofhighdimensionaldataspaces: Severalearlierstudies haverevealedthegeometricaldistortionofhighdimension aldataspaces.Authorsin[ 8 ] attemptedtoquantifythisdistortion.Theyarguedthatund erreasonableassumptions, theratioofdistancesbetweenthenearestandthefarthestt oagiventargetisalmost1. Theyalsoshowthatthisratioremainsconstantforawidevar ietyofdatadistributions anddistancefunctions.Insuchacase,thenearestneighbor problembecomes ill-dened,astheconceptofproximitymaynotbemeaningfu lfromaqualitative perspective.Asimilarstudyin[ 94 ]showthatthedistanceofarandomlyselected pointinahypercubewithdiagonallength1fromonecornerap proachesaconstant valueof0.58asthedimension p increases,thusshowingagainthatdistance-based classierslikekNNmightperformpoorly.Unreliableparameterestimation: Severalclassicationmodelslikediscriminant analysisanditsvariants,requireknowingtheclasscovari ances apriori ,inorderto establishadiscriminantruleforclassicationproblems. However,inmanyproblems, sincethecovarianceisnotknown apriori ,researchersoftenattempttoestimate thecovariancefromthesampledata.Howeversuchestimates forhighdimensional datasetsareunreliablesinceanunknowncovariancematrix forobservationson p variablesintheoryhas p ( p +1) = 2 unknownparametersandthesamplecovariance matrixisapoorestimatorwhenpislarge.Additionally,sin cethesamplecovariance matrixisrank-decient,estimatingquantitieslikeinver sewouldbeunreliablethus leadingtounpredictableperformanceonindependenttestd atasets. 1.2Dimensionalityreduction Onecommonapproachtoaddresstheaforementionedchalleng esinvolves reducingthedimensionalityofthedataseteitherusingfea tureextraction[ 104 ]and/or 12

PAGE 13

featureselection[ 142 ]priortoclassication.Thesedimensionalityreductiont echniques decreasethecomplexityoftheclassicationmodelandthus improvetheclassication performance[ 142 ]. 1.2.1Featureextraction Featureextraction,alsoknownassubspacelearning,trans formtheinputdata intoasetofmeta-featureswhichextractrelevantinformat ionfromtheinputdatafor classication.Severalfeatureextractiontechniquessuc hasPrincipalComponent Analysishasbeensuccessfullyappliedinmanyhighdimensi onaldataapplications [ 107 ].However,inmanybiomedicalapplications,thesetechniq uesofferlimitedmodel interpretabilitytoextractbiomarker-typeinformationa stheytransformtheinputspace duringtheprocess[ 48 ]. 1.2.2Featureselection Featureselectiontechniques,incontrasttofeatureextra ctiontechniques,do notaltertheinputdataspace,butmerelyselectasubsetoff eaturesbasedonsome optimalitycriteria[ 142 ].Thoughfeatureextractiontechniquesgeneralizebetter when combinedwithclassicationmodels,featureselectiontec hniquesoffertheadvantage ofinterpretabilitybyadomainexpertastheypreservetheo riginalfeaturespace.Inthe contextofclassication,featureselectionmethodscanbe broadlyorganizedintothree categories:ltermethods,wrappermethodsandembeddedme thods[ 142 ]. Filtermethodsassessfeaturerelevancefromtheintrinsic propertiesofthedata. Inmostcasesthefeaturesarerankedusingafeaturerelevan cescorewhereahigh scoreindicatesagreaterdiscriminativepower.Thelow-sc oringfeaturesaregenerally removed.Thereduceddataobtainedfromconsideringonlyth eselectedfeaturesare thenpresentedasaninputtotheclassicationalgorithm.A dditionally,featureranking helpsinlimitingthenumberoffeaturesforclassicationd ependingonthesamplesizes, therebyavoidingthecurseofdimensionalityandpoorgener alizationability.Contrary toltermethods,thewrappermethodsintegratetheclassi erhypothesissearchand 13

PAGE 14

thefeaturesubsetsearch.Inthisframework,eachmethodis tailoredtoaspecic classicationalgorithm,whereasearchprocedureinthefe aturespaceisrstdened, andvarioussubsetsoffeaturesaregeneratedandevaluated bytrainingandtestingthe classicationmodel[ 92 142 ].Similartowrappermethods,embeddedmethodsarealso specictoagivenclassicationmodelandintegratethesea rchforanoptimalsubsetof featureswithclassierbuilding. Filtermethodsofferseveraladvantagesincludingscalabi litytohighdimensional datasets,computationalefciency,andindependencefrom theclassicationalgorithm. Thisindependencyofferstheadvantageofperformingfeatu reselectiononlyonceand thenevaluatingdifferentclassiers.Thewrapperandembe ddedmethodsconsider featuredependenciesandincludeinteractionsbetweenfea turesubsetsearchand learningmodelselection.However,acommondrawbackofbot hthesemethodsisthat theyhaveanincreasedriskofoverttingandcanalsobecomp utationallyintensive dependingonthecomputationalcomplexityoftheclassica tionmodel. 1.3MotivationandSignicance Duetotheaforementionedchallengesinhighdimensionalda tasets,several standardclassicationmethodslikelogisticregression, lineardiscriminantanalysis, kNNclassiersetc.,areknowntoperformpoorlyondifferen tclassicationtasks. However,withthecontinuousadvancementsintechnology,t hemeasurementscollected fromeachsamplewouldonlyincrease,therebymakingitimpe rativetobuildscalable andefcientclassicationmodelsandalgorithmsthatwoul dperformwellonsuch datasets.Additionally,inmanybiomedicalapplications, inadditiontobuildingan accurateclassicationmodelwithgoodgeneralizationabi lity,itisalsoimportantto understandthefeaturesthatcontributetothedifferences amongtheclasses[ 48 ].For example,whilebuildingaclassicationmodelforclassify ingnormalpatientsfromcancer patientsusinggeneexpressiondata,itisimportanttoextr actandanalyzethemost 14

PAGE 15

importantgenesthatcontributetothedifferencesamongth eclasses.Suchananalysis couldleadtothediscoveryofbiomarkersinmanybiomedical applications. 1.4ThesisgoalandStructure Inthisworkourgoalistobuildscalableefcientclassica tionmodelsforhigh dimensionaldatasetsthatwouldalsobeabletoextractimpo rtantfeaturesinadditionto accuratelypredictingtheclassofunknownsamples.Theres toftheworkisorganized asfollows:Chapters 2 & 3 discussstandardfeatureselectionandclassication techniquesandtheirapplicationstohighdimensionaldata sets.InChapter 4 ,we extendtheclassicationtechniqueproximalsupportvecto rmachinestoperformfeature selectiononhighdimensionaldatasets.Chapter 5 exploresthepossibilityofcombining featureselectionandfeatureextractiontotakeadvantage ofeithertechniques. Constrainedsubspaceclassiersforhighdimensionaldata setsarediscussedin Chapter 6 .AhighdimensionaldataapplicationofcharacterizingBre astcancercells usingRamanSpectroscopyisdiscussedinChapter 7 .Wediscussfurtherresearch directionsandprovidesomeconclusionsinChapter 8 15

PAGE 16

CHAPTER2 OVERVIEWOFFEATURESELECTIONTECHNIQUES Recently,featureselectionhasbeenanactiveareaofresea rchamongmany researchersduetotremendousadvancesintechnologyenabl ingcollectingsamples withthousandsofmeasurementsinasingleexperiment.Feat ureselectiontechniques selectasubsetoffeaturesbasedonsomeoptimalitycriteri a[ 142 ].Irrespectiveof themeasureofoptimality,theselectedsubsetoffeaturess houldideallypossessthe followingcharacteristics[ 66 ]: anoptimallyminimalnumberoffeaturesshouldbeselectedt oaccuratelypredict theclassofunknownsamples, thepredictionaccuracyoftheclassierrunondatawithonl yselectedsubsetof featuresshouldbebetterthanrunningtheclassierondata containingallfeatures, Theresultingclassdistributionfromtheselectedfeature sshouldbeascloseas possibletotheoriginalclassdistributiongivenallfeatu revalues. Basedontheabovefeaturecharacteristics,itisobviousth atirrelevantfeatures wouldnotbepartoftheoptimalsetoffeatures,whereanirre levantfeaturewithrespect tothetargetclassisdenedasfollows[ 173 ]: Let F bethefullsetoffeaturesand C bethetargetclass.Dene F i 2 F and S i = F F i Denition1:(Irrelevance) Afeature F i isirrelevantifandonlyif: P ( C j F i S 0 i )= P ( C j S 0 i ), 8 S 0 i S i (2–1) Irrelevancesimplymeansthatitisnotnecessaryforclassi cationsincetheclass distributionresultingfromanysubsetofotherfeaturesdo esnotchangeaftereliminating theirrelevantfeature. Thedenitionofrelevanceisnotasstraightforwardasirre levance.Therehave beenseveraldenitionsforrelevanceinthepast,howeverK ohavi etal. [ 92 ]argued thattheearlierdenitionsweren'tadequatetoaccurately classifythefeatures.Hence, 16

PAGE 17

theydenedrelevanceintermsofanoptimalBayes'classie r.Afeature F i isstrongly relevantifremovalof F i alonewillresultindecreaseofclassicationperformance ofan optimalBayes'classier.Afeature F i isweaklyrelevantifitisnotstronglyrelevantand thereexistsasubsetoffeatures, S 0 i ,suchthattheperformanceofaBayes'classieron S 0 i isworsethantheperformanceon S 0 i U f F i g Denition2:(Strongrelevance) Afeature F i isstronglyrelevantifonlyandif: P ( C j F i S 0 i ) 6 = P ( C j S 0 i ), S 0 i S i (2–2) Denition3:(Weakrelevance) Afeature F i isweaklyrelevantifonlyandif: P ( C j F i S i )= P ( C j S i ) and, 9 S 0 i S i P ( C j F i S 0 i ) 6 = P ( C j S 0 i ) (2–3) Strongrelevanceimpliesthatthefeatureisindispensable andisrequiredforan optimalset,whileweakrelevanceimpliesthatthefeaturem ayberequiredsometimes toimprovethepredictionaccuracy.Fromthis,onemayconcl udethattheoptimalset shouldconsistofallthestronglyrelevantfeatures,someo ftheweaklyirrelevantfeatures andnoneoftheirrelevantfeatures.However,thedenition sdonotexplicitlymention whichoftheweaklyrelevantfeaturesshouldbeincludedand whichofthemexcluded. Hence,Yu et.al. [ 173 ]claimthattheweaklyrelevantfeaturesshouldbefurtherc lassied todiscriminateamongtheredundantfeaturesandthenon-re dundantfeatures,since earlierresearcheffortsshowedthatalongwithirrelevant features,redundantfeatures alsoadverselyeffecttheclassierperformance.Beforewe providedenitions,we introduceanotherconceptcalledfeature'sMarkovBlanket asdenedbyKoller etal. [ 93 ]. Denition4:(MarkovBlanket) Givenafeature F i ,let M i F ( F i = 2 M i ), M i issaid tobeaMarkovblanketfor F i ifonlyandif: P ( F M i f F i g C j F i M i )= P ( F M i f F i g C j M i ) (2–4) 17

PAGE 18

TheMarkovblanket M i couldbeimaginedasablanketforthefeature F i that subsumesnotonlytheinformationthat F i possessesabouttargetclassC,butalso aboutotherfeatures.Itisalsoimportanttonotethatthest ronglyrelevantfeatures cannothaveaMarkovBlanket.Since,theirrelevantfeature sdonotcontributeto classication,Yu etal. [ 173 ]furtherclassiedtheweaklyrelevantfeaturesintoeithe r redundantornon-redundantusingtheconceptofMarkovblan ket. Denition5:(Redundantfeature) GivenasetofcurrentfeaturesG,afeatureis redundantandhenceshouldberemovedfromGifandonlyifith asaMarkovBlanket withinG. Fromtheabovedenitions,itisclearthattheoptimalsetof featuresshouldconsist ofallofthestronglyrelevantfeaturesandtheweaklyrelev antnon-redundantfeatures. Howeveranexhaustivesearchoverthefeaturespaceisintra ctablesincethereare 2 p possibilitieswith p beingthenumberoffeatures.Hence,overthepastdecade,se veral heuristicandapproximatemethodshavebeendevelopedtope rformfeatureselection. Featureselectiontechniques,inthecontextofclassicat ion,canbeorganizedintothree categories:ltermethods,wrappermethodsandembeddedme thods[ 142 ]. 2.1FilterMethods Filtermethodsassessfeaturerelevancefromtheintrinsic propertiesofthedata. Inmostcasesthefeaturesarerankedusingafeaturerelevan cescorewhereahigh scoreindicatesagreaterdiscriminativepower.Thelow-sc oringfeaturesaregenerally removed.Thereduceddataobtainedfromconsideringonlyth eselectedfeaturesare thenpresentedasaninputtotheclassicationalgorithm.F iltertechniquesofferseveral advantagesincludingscalabilitytohighdimensionaldata sets,beingcomputationally efcient,andareindependentofclassicationalgorithm. Thisindependencyoffers theadvantageofperformingfeatureselectiononlyonceand thenevaluatingdifferent classiers. 18

PAGE 19

Someunivariateltertechniquesperformsimplehypothesi stestinglikeChi-Square ( 2 )testort-testtoeliminatetheirrelevantfeatures,while othertechniquesestimate informationtheoreticmeasureslikeinformationgainandg ain-ratiotoperformthe lteringprocess[ 7 ].Althoughthesetechniquesaresimple,fastandhighlysca lable, theyignorefeaturedependencieswhichmayleadtoworsecla ssicationperformance ascomparedwithotherfeatureselectiontechniques.Inord ertoaccountforfeature dependencies,anumberofmultivariateltertechniqueswe reintroduced.The multivariateltermethodsrangefromaccountingforsimpl emutualinteractions [ 12 ]tomoresophisticatedsolutions.Onesuchtechniquecalle dCorrelationbased FeatureSelection(CFS)introducedbyHall(1999)[ 69 ],evaluatesasubsetoffeatures byconsideringthediscriminativepowerofeachfeatureina dditiontothedegreeof redundancybetweenthem: CFS S = k cf p k + k ( k 1) (2–5) where CFS S isthescoreofafeaturesubset S containing k features, cf istheaverage featuretoclasscorrelation( f 2 S ),and istheaveragefeaturetofeaturecorrelation. Unliketheunivariateltermethods,CFSpresentsascorefo rasubsetoffeatures. Since,exhaustivesearchisintractable,severalheuristi ctechniqueslikegreedy hill-climbingorbest-rstsearchhavebeenproposedtond thefeaturesubsetwith highestCFSscore. AnotherimportantmultivariateltermethodcalledtheMar kovblanketlteringwas introducedbyKoller et.al [ 93 ].TheideaherebeingthatoncewendaMarkovblanket offeature F i inafeatureset G ,wecansafelyremove F i from G withoutcompromising ontheclassdistribution.SinceestimatingtheMarkovblan ketforafeatureishard,Koller et.al [ 93 ]proposeasimpleiterativealgorithmthatstartswiththef ullfeatureset F = G andthenrepeatedlyeliminatesonefeatureatatimebasedon cross-entropyofeach featureuntilapre-selectednumberoffeaturesareremoved 19

PAGE 20

Koller et.al [ 93 ]furtherprovethatinsuchasequentialeliminationproces sinwhich unnecessaryfeaturesareremovedonebyone,afeaturetagge dasunnecessarybased ontheexistenceofaMarkovblanket M i remainsunnecessaryinlaterstageswhen morefeatureshavebeenremoved.Also,theauthorsclaimtha ttheprocessremoves alltheirrelevantaswellasredundantfeatures.Severalva riationstotheMarkovblanket lteringmethodlikeGrow-Shrink(GS)algorithms,Increme ntalAssociationMarkov Blanket(IAMB),Fast-IAMBandrecently -IAMBhavebeenproposedbyotherauthors [ 57 ].Thereareotherinterestingmultivariateltermethodsl ikeFast-Correlationbased FeatureSelection(FCBF)[ 172 ],MinimumRedundancy-MaximumRelevance(MRMR) [ 41 ],andUncorrelatedShrunkenCentroid(USC)[ 171 ]algorithmsproposedinliterature. 2.2WrapperMethods Asseenintheearliersection,ltermethodstreattheprobl emofndingagood featuresubsetindependentlyoftheclassierbuildingste p.Wrappermethods,onthe otherhand,integratetheclassierhypothesissearchwith inthefeaturesubsetsearch. Inthisframework,eachmethodistailoredtoaspecicclass icationalgorithm,wherea searchprocedureinthefeaturespaceisrstdened,andvar ioussubsetsoffeatures aregeneratedandevaluatedbytrainingandtestingtheclas sicationmodel[ 92 142 ]. Advantagesofwrappermethodsincludeconsiderationoffea turedependenciesandthe abilitytoincludeinteractionsbetweenthefeaturesubset searchandmodelselection.A commondrawbackincludestheriskofoverttingthanthelt ermethodsandcouldbe computationallyintensiveiftheclassicationmodelespe ciallyhasahighcomputational cost. Thewrappermethodsgenerallyemployasearchalgorithmino rdertosearch throughthespaceofallfeaturesubsets.Thesearchalgorit hmiswrappedaround theclassicationmodelwhichprovidesafeaturesubsettha tcanbeevaluatedby theclassicationalgorithm.Asmentionedearlier,sincea nexhaustivesearchis notpractical,heuristicsearchmethodsareusedtoguideth esearch.Thesesearch 20

PAGE 21

methodscanbebroadlyclassiedasdeterministicandrando mizedsearchalgorithms. Deterministicsearchmethodsincludeasetofsequentialse archtechniqueslikethe SequentialForwardSelection[ 89 ],SequentialBackwardSelection[ 89 ],Plus-lMinus-r Selection[ 50 ],BidirectionalSearch,SequentialFloatingSelection[ 134 ]etc.,wherethe featuresareeithersequentiallyaddedorremovedbasedons omecriterionmeasure. RandomizedSearchalgorithmsincludepopulartechniquesl ikeGeneticAlgorithms[ 33 ], SimulatedAnnealing[ 88 ],RandomizedHillClimbing[ 149 ],etc. 2.3EmbeddedMethods Embeddedmethodsintegratethesearchforanoptimalsubset offeaturesinto theclassierconstructionandcanbeseenasasearchinthec ombinedspaceof featuresubsetsandhypotheses.Similartowrappermethods ,embeddedapproaches arealsospecictoagivenlearningalgorithmandincludeth einteractionwiththe classicationmodel,butunlikethewrappermethods,alsoh astheadvantagetobeless computationallyintensive[ 142 ]. Recentlyembeddedmethodshavegainedimportanceamongthe research communityduetoitsadvantages.Theembeddedcharacterist icofseveralclassiers toeliminatefeaturesirrelevanttoclassicationandthus selectasubsetoffeatures, hasbeenexploitedbyseveralauthors.Examplesincludethe useofrandomforests (discussedlater)inanembeddedwaytocalculatetheimport anceofeachfeature [ 39 82 ].Anotherlineofembeddedfeatureselectiontechniquesus estheweightsof eachfeatureinlinearclassiers,suchasSVM[ 67 ]andlogisticregression[ 108 ].These weightsareusedasameasureofrelevanceofeachfeature,an dthusallowforthe removaloffeatureswithverysmallweights.Also,recently regularizedclassierslike Lassoandelastic-nethavealsobeensuccessfullyemployed inperformingfeature selectioninmicroarraygeneanalysis[ 175 ].Anotherinterestingtechniquecalled featureselectionviasparseSVMhasbeenrecentlyproposed byTan etal. [ 153 ].This techniquecalledtheFeaturegeneratingMachine(FGM)adds abinaryvariableforevery 21

PAGE 22

featureinthesparseformulationofSVMvia l 0 -normandandtheauthorsproposea cuttingplanealgorithmcombinedwithmultiplekernellear ningtoefcientlysolvethe convexrelaxationoftheoptimizationproblem. 22

PAGE 23

CHAPTER3 OVERVIEWOFCLASSIFICATIONTECHNIQUES Inthischapter,wepresentstandardclassicationtechniq uesanddiscusstheir extensionsinapplyingtohighdimensionaldatasets.Webeg inthediscussionwith SupportVectorMachinesanditsvariantsandproceedtodisc riminantfunctionsand theirmodicationsforhighdimensionaldatasets.Wetheni ntroducehybridclassiers followedbyensemblemethodsandtheirapplications. 3.1SupportVectorMachines Hard-MarginSupportVectorMachines: Inthelastdecade,SupportVector Machines(SVM)[ 165 ]haveattractedtheattentionofmanyresearcherswithsucc essful applicationtoseveralclassicationproblemsinbioinfor matics,nanceandremote sensingamongmanyothers[ 23 130 160 ].StandardSVMconstructahyperplane, alsoknownasdecisionboundary,thatbestdividestheinput space intotwodisjoint regions.Thehyperplane f : !< ,isestimatedfromthetrainingset S .Theclass membershipforanunknownsample x 2 canbebasedontheclassicationfunction g ( x ) denedas: g ( x )= 8><>: 1, f ( x ) < 0 1, f ( x ) > 0 (3–1) Considerabinaryclassicationproblemwiththetrainings et S denedas: S = f ( x i y i ) j x i 2< p y i 2f 1,1 gg i =1,2,..., n (3–2) where y i iseither-1or1dependingontheclassthateach x i belongsto.Assumethat thetwoclassesarelinearlyseparableandhencethereexist satleastonehyperplane thatseparatesthetrainingdatacorrectly.Ahyperplanepa rametrizedbythenormal vector w 2< p andbias b 2< isdenedas: h w x i b =0 (3–3) 23

PAGE 24

wheretheinnerproduct h i isdenedon < p < p !< .Thetrainingset S satisesthe followinglinearinequalitywithrespecttothehyperplane : y i ( h w x i i b ) 1 8 i =1,2,..., n (3–4) wheretheparameters w and b arechosensuchthatthedistancebetweenthe hyperplaneandtheclosestpointismaximized.Thisgeometr icalmargincanbe expressedbythequantity 1 jj w jj .Hence,forlinearlyseparablesetoftrainingpoints, SVMcanbeformulatedaslinearlyconstrainedquadraticcon vexoptimizationproblem givenas: minimize w, b jj w jj 22 subjectto y i ( h w x i i b ) 1 8 i =1,2,..., n (3–5) Thisclassicalconvexoptimizationproblemcanberewritte n(usingtheLagrangian formulation[ 14 ])intothefollowingdualproblem: maximize 2< n n X i =1 i 1 2 n X i =1 n X j =1 i j y i y j ( h x i x j i ) subjectto n X i =1 i y i =0, and i 0, i =1,2,..., n (3–6) wheretheLagrangemultipliers i ( i =1,2,..., n ) expressedin( 3–6 )canbe estimatedusingquadraticprogramming(QP)methods[ 37 ].Theoptimalhyperplane f canthenbeestimatedusingtheLagrangemultipliersobtain edfromsolving( 3–6 )and thetrainingsamples,i.e., f ( x )= X i 2 S i y i ( h x x i i ) b (3–7) where S isthesubsetoftrainingsamplescalledsupportvectorstha tcorrespondto non-zeroLagrangemultipliers i .Supportvectorsincludethetrainingpointsthat exactlysatisfytheinequalityin( 3–5 )andlieatadistanceequalto 1 k w k fromtheoptimal separatinghyperplane.SincetheLagrangemultipliersare non-zeroonlyforthesupport vectorsandzeroforothertrainingsamples,theoptimalhyp erplanein( 3–7 )effectively 24

PAGE 25

consistsofcontributionsfromthe supportvectors .Itisalsoimportanttonotethatthe Lagrangemultipliers i qualitativelyproviderelativeweightofeachsupportvect orin determiningtheoptimalhyperplane. Theconvexoptimizationproblemin( 3–5 )andthecorrespondingdualin( 3–6 ) convergetoaglobalsolutiononlyifthetrainingsetisline arlyseparable.ThisSVMis calledthehard-marginsupportvectormachines. Soft-MarginSupportVectorMachines: Themaximum-marginobjective introducedintheprevioussubsectiontoobtaintheoptimal hyperplaneissusceptibleto thepresenceofoutliers.Also,itisoftendifculttoadher etotheassumptionoflinear separabilityinrealworlddatasets.Hence,inordertohand lenon-linearlyseparable datasetsaswellasbelesssensitivetooutliers,soft-marg insupportvectormachinesare proposed.Theobjectivecostfunctionin( 3–5 )ismodiedtorepresenttwocompeting measuresnamely,marginmaximization(asinthecaseofline arlyseparabledata)and errorminimization(topenalizethewronglyclassiedsamp les).Thenewcostfunctionis denedas: ( w )= 1 2 jj w jj 22 + C n X i =1 i (3–8) where istheslackvariableintroducedtoaccountforthenon-sepa rabilityofdata,and theconstantCrepresentsaregularizationparameterthatc ontrolsthepenaltyassigned toerrors.ThelargertheCvalue,thehigherthepenaltyasso ciatedtomisclassied samples.Theminimizationofthecostfunctionexpressedin ( 3–8 )issubjecttothe followingconstraints: y i ( h w x i i b ) 1 i 8 i =1,2,..., n i 0, 8 i =1,2,..., n (3–9) 25

PAGE 26

Theconvexoptimizationproblemcanthenbeformulatedusin g( 3–8 )and( 3–9 )for thenon-linearlyseparabledataas: minimize w, b 1 2 jj w jj 22 + C n X i =1 i subjectto y i ( h w x i i b ) 1 i i 0, 8 i =1,2,..., n (3–10) Theoptimizationproblemin( 3–10 )accountsfortheoutliersbyaddingapenalty term C i foreachoutliertotheobjectivefunction.Thecorrespondi ngdualto( 3–10 )can bewrittenusingtheLagrangeformulationas: maximize 2< n n X i =1 i 1 2 n X i =1 n X j =1 i j y i y j ( h x i x j i ) subjectto n X i =1 i y i =0, and ,0 i C i =1,2,..., n (3–11) Thequadraticoptimizationproblemin( 3–11 )canbesolvedusingstandardQP techniques[ 37 ]toobtaintheLagrangemultipliers i KernelSupportVectorMachines: Theideaoflinearseparationbetweentwo classesmentionedinthesubsectionsabovecanbenaturally extendedtohandle nonlinearseparationaswell.Thisisachievedbymappingth edatathroughaparticular nonlineartransformationintoahigherdimensionalfeatur espace.Assumingthat thedataislinearlyseparableinthishighdimensionalspac e,alinearseparation, similartoearliersubsections,canbefound.Suchahyperpl anecanbeachievedby solvingasimilardualproblemdenedin( 3–11 )byreplacingtheinnerproductsin theoriginalspacewithinnerproductsinthetransformedsp ace.However,anexplicit transformationfromtheoriginalspacetofeaturespacecou ldbeexpensiveandattimes infeasibleaswell.Thekernelmethod[ 22 ]providesanelegantwayofdealingwithsuch transformations. 26

PAGE 27

Considerakernelfunction K ( ) ,satisfying Mercer's theorem,thatequalsaninner productinthetransformedhigherdimensionalfeaturespac e[ 112 ],i.e., K ( x i x j )= h ( x i ),( x j ) i (3–12) where ( x i ) and ( x j ) correspondtothemappingofdatapoints x i and x j fromthe originalspacetothefeaturespace.Thereareseveralkerne lfunctionsdenedin literaturethatsatisfytheMercer'sconditions.Onesuchk ernel,calledtheGaussian kernelisgivenby: K ( x i x )= exp ( jj x i x jj 2 ) (3–13) where isaparameterinverselyproportionaltothewidthoftheGau ssianradial basisfunction.Anotherextensivelystudiedkernelisthep olynomialfunctionoforder p expressedas: K ( x i x )=( h x i x i +1) p (3–14) Suchkernelfunctionsdenedaboveallowforefcientestim ationofinnerproducts infeaturespaceswithouttheexplicitfunctionalformofth emapping .Thiselegant calculationofinnerproductsinhigherdimensionalfeatur espaces,alsocalledthe kerneltrick,considerablysimpliesthesolutiontothedu alproblem.Theinnerproducts betweenthetrainingsamplesinthedualformulation( 3–11 )canbereplacedwitha kernelfunction K andrewrittenas: maximize 2< n n X i =1 i 1 2 n X i =1 n X j =1 i j y i y j K ( x i x j ) subjectto n X i =1 i y i =0, and ,0 i C i =1,2,..., n (3–15) The optimal hyperplane f obtainedinthehigherdimensionalfeaturespacecanbe convenientlyexpressedasafunctionofdataintheoriginal inputspaceas: f ( x )= X i 2 S i y i K ( x i x ) b (3–16) 27

PAGE 28

where S isasubsetoftrainingsampleswithnon-zeroLagrangemulti pliers i .The shapeoff( x )dependsonthetypeofkernelfunctionsadopted. Itisimportanttonotethattheperformanceofkernel-based SVMisdependent ontheoptimalselectionofmultipleparameters,including thekernelparameters(e.g., and p parametersforthegaussianandpolynomialkernels,respec tively)andthe regularizationparameterC.Asimpleandsuccessfultechni quethathasbeenemployed involvesagridsearchoverawiderangeoftheparameters.Th eclassicationaccuracy ofSVMforeverypairofparametersisestimatedusingaleave -one-outcrossvalidation techniqueandthepaircorrespondingtothehighestaccurac yischosen.Also,some interestingautomatictechniqueshavebeendevelopedtoes timatetheseparameters [ 25 27 ].Theyinvolveconstructinganoptimizationproblemthatw ouldmaximize themarginaswellasminimizetheestimateoftheexpectedge neralizationerror. Optimizationoftheparametersisthencarriedoutusingagr adientdescentsearchover thespaceoftheparameters.Recently,moreheuristicbased approacheshavebeen proposedtodealwiththisissue.AcontinuousversionofSim ulatedAnnealing(SA) calledHideandSeekSAwasemployedin[ 101 ]toestimatemultipleparametersaswell asselectasubsetoffeaturestoimprovetheclassicationa ccuracy.Similarapproaches combiningParticleSwarmOptimization(PSO)withSVMarepr oposedin[ 65 ]and[ 102 ]. Furthermore,amodiedGeneticAlgorithm(GA)wasalsoimpl ementedalongwithSVM toestimatetheoptimalparameters[ 76 ]. SVMappliedtoHighDimensionalClassicationProblems: SVMhavebeen successfullyappliedtohighdimensionalclassicationpr oblemsarisingineldslike remotesensing,webdocumentclassication,microarrayan alysisetc.Asmentioned earlier,conventionalclassierslikelogisticregressio n,maximumlikelihoodclassication etc.,onhighdimensionaldatatendtoovertthemodelusing trainingdataandrun theriskofachievingloweraccuraciesontestingdata.Henc e,apre-processingstep likeeitherfeatureselectionand/ordimensionalityreduc tiontechniquesareproposed 28

PAGE 29

toalleviatetheproblemofcurseofdimensionalitywhilewo rkingwiththesetraditional classiers.Surprisingly,SVMhavebeensuccessfullyappl iedtohyperspectralremote sensingimageswithoutanypre-processingsteps[ 130 ].ResearchersshowthatSVM aremoreeffectivethatthetraditionalpatternrecognitio napproachthatinvolvesa featureselectionprocedurefollowedbyaconventionalcla ssierandarealsoinsensitive toHughesphenomena[ 80 ].Thisisparticularlyhelpfulasitavoidstheunnecessary additionalcomputationofanintermediarysteplikefeatur eselection/dimensionality reductiontoachievehighclassicationaccuracy. Similarobservationswerereportedintheeldofdocumentc lassicationin[ 83 ], whereSVMweretraineddirectlyontheoriginalhighdimensi onalinputspace.Kernel SVM(gaussianandpolynomialkernels)wereemployedandcom paredwithother conventionalclassierslike k -NNclassiers,Naive-BayesClassier,RocchioClassier andC4.5DecisionTreeClassier.TheresultsshowthatKern elSVMoutperformthe traditionalclassiers.Also,intheeldofmicroarraygen eexpressionanalysis,SVM havebeensuccessfullyappliedtoperformclassicationof severalcancerdiagnosis tasks[ 19 139 ]. TheinsensitivityofSVMtooverttingandtheabilitytoove rcomethecurseof dimensionalitycanbeexplainedviathegeneralizationerr orboundsdevelopedby Vapnik etal. [ 166 ].Vapnikshowedthefollowinggeneralizationerrorbounds forLarge MarginClassiers: = ~ O 1 m R 2 r 2 + log 1 (3–17) where m isthenumberoftrainingsamples, r isthemarginbetweentheparallelplanes, and( R ) 2< + with 0 < 1 .Thiserrorboundisinverselydependentonthe samplesize m andthemargin r .Foranitesamplesize,maximizingthemargin r (or minimizingtheweightvector)wouldreducethegeneralizat ionerror .Interestingly,this errorbounddoes not dependonthedimensionalityoftheinputspace.Since,itis highly 29

PAGE 30

likelytolinearlyseparatethedatainhigherdimensions,S VMtendtoperformwellwith classicationtasksinhighdimensions. 3.2DiscriminantFunctions Adiscriminantfunction g : < p !f 1,1 g assignseitherclass1orclass2toaninput vector x 2< p .Weconsiderhereaclassofdiscriminantfunctions G thatarewellstudied inliteratureandtraditionallyappliedtobinaryclassic ationproblems. QuadraticandLinearDiscriminantAnalysis: Considerabinaryclassication problemwithclasses C 1 and C 2 andpriorprobabilitiesgivenas 1 and 2 .Assumethe classconditionalprobabilitydensities f 1 ( x ) and f 2 ( x ) tobenormallydistributedwith meanvectors 1 and 2 andcovariancematrices 1 and 2 respectively: f k ( x )= 1 (2 ) p = 2 j k j 1 = 2 exp ( 1 2 ( x k ) T 1 k ( x k )) k =1,2. (3–18) where, j k j isthedeterminantofthecovariancematrix k .Following Bayes optimalrule [ 11 ],QuadraticDiscriminantAnalysis(QDA)[ 111 ]assignsclass1toaninputvector x if thefollowingconditionholds: 1 f 1 ( x ) 2 f 2 ( x ) (3–19) LinearDiscriminantAnalysis(LDA)[ 111 ]furtherassumesthecovariances 1 and 2 areequalto andclassiesaninputvectoragaininaccordancetoBayesop timal rule.Theconditionin( 3–19 )canthenberewrittenas: log 1 2 +( x ) T 1 ( 1 2 ) 0, = 1 2 ( 1 + 2 ). (3–20) Assumingthepriorprobabilitiestobeequal,( 3–20 )isequivalentto: ( x 1 ) T 1 ( x 1 ) ( x 2 ) T 1 ( x 2 ) (3–21) ItisinterestingtonotethatLDAcomparesthesquaredMahal anobisdistance[ 35 ] of x fromtheclassmeans 1 and 2 andassignstheclassthatisclosest.Thesquared Mahalanobisdistanceofapoint x fromadistribution P characterizedbymeanvector 30

PAGE 31

andcovariancematrix isdenedas: d M ( x P )=( x ) T 1 ( x ) (3–22) Thisdistancemeasure,unlikeEuclideandistancemeasure, accountsforcorrelations amongdifferentdimensionsof x .Equation( 3–21 )showshowLDAdiffersfromother distance-basedclassierslike k -NNclassier[ 11 ]whichmeasuresEuclideandistance toassigntheclass. FisherLinearDiscriminantAnalysis: FisherLineardiscriminantAnalysis(FLDA) [ 11 ],unlikeLDA,doesnotmakeassumptionsontheclassconditi onaldensities.Instead, itestimatestheclassmeansfromthetrainingset.Inpracti ce,themostcommonlyused estimatorsaretheirmaximum-likelihoodestimates,given by: ^ 1 = 1 N 1 X k 2C 1 x k ,^ 2 = 1 N 2 X k 2C 2 x k (3–23) FLDAattemptstondaprojectionvector w thatmaximizestheclassseparation.In particular,itmaximizesthefollowingFishercriteriongi venas: J ( w )= w T S B w w T S W w (3–24) where S B isthebetween-classcovariancematrixandisgivenby: S B =(^ 2 ^ 1 )(^ 2 ^ 1 ) T (3–25) and S W isthewithin-classcovariancematrixandisgivenby: S W = X k 2C 1 ( x k ^ 1 )( x k ^ 1 ) T + X k 2C 2 ( x k ^ 2 )( x k ^ 2 ) T (3–26) TheoptimalFisherdiscriminant w canbeobtainedbymaximizingtheFishercriterion: maximize w J ( w ) (3–27) 31

PAGE 32

Animportantpropertytonoticeabouttheobjectivefunctio n J ( w ) isthatitis invarianttotherescalingsofthevector w w 8 2< .Hence, w canbechosen inawaythatthedenominatorissimply w T S W w =1 ,sinceitisascalaritself.Forthis reason,wecantransformtheproblemofmaximizingFishercr iterionJintothefollowing constrainedoptimizationproblem, maximize w w T S B w subjectto w T S W w =1 (3–28) TheKKTconditionsfor( 3–28 )canbesolvedtoobtainthefollowinggeneralized eigenvalueproblem,givenas: S B w = S W w (3–29) where representstheeigenvalueandtheoptimalvector w correspondstothe eigenvectorwiththelargesteigenvalue max andisproportionalto: w / S 1 W (^ 2 ^ 1 ) (3–30) Theclassofaninputvector x isdeterminedusingthefollowingcondition: h w x i < c (3–31) where c 2< isathresholdconstant. DiagonalDiscriminantAnalysis :DiagonalLinearDiscriminantAnalysis(DLDA) extendsonLDAandassumesindependenceamongthefeatures[ 55 ].Inparticular,the discriminantrulein( 3–20 )isreplacedwith: log 1 2 +( x ) T D 1 ( 1 2 ) 0 (3–32) where D = diag ( ) .Theoff-diagonalelementsofthecovariancematrix arereplaced withzerosbyindependenceassumption. 32

PAGE 33

Similarly,DiagonalQuadraticDiscriminantAnalysis(DQD A)[ 44 ]assumesthe independenceruleforQDA.Thediscriminantruleinthiscas eisgivenby: log 1 2 +( x 2 ) T D 1 2 ( x 2 ) ( x 1 ) T D 1 1 ( x 1 ) 0 (3–33) where D 1 = diag ( 1 ) ,and D 2 = diag ( 2 ) DQDAandDLDAclassiersaresometimescalledNaiveBayescl assiersbecause theycanariseinaBayesiansetting[ 9 ].Additionally,itisimportanttonotethatFLDA andDiagonalDiscriminantanalysis(DLDAandDQDA)arecomm onlygeneralizedto handlemulti-classproblemsaswell. SparseDiscriminantAnalysis: TheoptimaldiscriminantvectorinFLDA( 3–30 ) involvesestimatingtheinverseofcovariancematrixobtai nedfromsampledata. Howeverthehighdimensionalityinsomeclassicationprob lemsposesthethreatof singularityandthusleadtopoorclassicationperformanc e.Oneapproachtoovercome singularityinvolvesavariableselectionprocedurethats electsasubsetofvariablesmost appropriateforclassication.Suchasparsesolutionhass everaladvantagesincluding betterclassicationaccuracyaswellasinterpretability ofthemodel.Oneofthewaysto inducesparsityisviathepathofregularization.Regulari zationtechniqueshavebeen traditionallyusedtopreventoverttinginclassication models,butrecently,theyhave beenextendedtoinducesparsityaswellinhighdimensional classicationproblems. Here,webrieydiscusssomestandardregularizationtechn iquesthatfacilitatevariable selectionandpreventovertting. Givenasetofinstance-labelpairs( x i y i ); i =1,2,...,n;aregularizedclassier optimizesthefollowingunconstrainedoptimizationprobl em: minimize ( x y )+ jj jj p (3–34) 33

PAGE 34

where representsanon-negativelossfunction,( p ) 2< and isthecoefcient vector.Classierswith p =1 (Lasso-penalty)and p =2 (ridge-penalty)havebeen successfullyappliedtoseveralclassicationproblems[ 175 ]. Inthecontextofregression,Tibshirani[ 156 ]introducedvariableselectionviathe frameworkofregularizedclassiersusingthe l 1 -norm.Thismethod,alsocalledLeast AbsoluteShrinkageandSelectionOperator(LASSO),consid erstheleast-squares errorasthelossfunction.Theuser-denedparameter tradesregularizationwiththe lossterms.The l 1 -norminLASSOproducessomecoefcientsthatareexactly0t hus facilitatingtheselectionofonlyasubsetofvariablesuse fulforregression.TheLASSO regression,inadditiontoprovidingasparsemodel,alsosh aresthestabilityofridge regression.Severalalgorithmshavebeensuccessfullyemp loyedtosolvetheLasso regressioninthepastdecade.Efron etal. [ 46 ]showedthat,startingfromzero,the LASSOsolutionpathsgrowpiecewiselinearlyinapredictab lewayandhenceexploit thispredictabilitytoproposeanewalgorithmcalledLeast AngleRegressionthatsolves theentireLASSOpathefciently.TheLASSOframeworkhasbe enfurtherextended toseveralclassicationproblemsbyconsideringdifferen tlossfunctions,andhasbeen highlysuccessfulinproducingsparsemodelswithhighclas sicationaccuracy. ALasso-typeframework,however,isnotwithoutitslimitat ions.Zou et.al [ 175 ] mentionthataLassoframework,inhighdimensionalproblem s,suffersfromtwo drawbacksnamely,thenumberofvariablesselectedislimit edbythenumberof samples n ,andinthecaseofhighlycorrelatedfeatures,themethodse lectsone ofthem,neglectingtherestandalsodoesnotcareabouttheo neselected.The secondlimitation,alsocalledthegroupingeffect,isvery commoninhighdimensional classicationproblemslikemicroarraygeneanalysiswher eagroupofvariables arehighlycorrelatedtoeachother.Theauthorsproposeane wtechniquethat overcomesthelimitationsofLasso.Thetechnique,callede lastic-net,considersa convexcombinationof l 1 and l 2 -normstoinducesparsity.Inparticular,inanelastic-net 34

PAGE 35

framework,thefollowingoptimizationproblemisminimize d: minimize ( x y )+ k k 1 +(1 ) k k 2 (3–35) where isthelossfunction,and0 1.When =0 (or=1),theelastic-net frameworksimpliestoLasso(orridge)frameworks.Themet hodcouldsimultaneously performvariableselectionalongwithcontinuousshrinkag eandalsoselectgroups ofcorrelatedvariables.Anefcientalgorithm,calledLAR S-EN,alongthelinesof LARS,wasproposedtosolvetheelastic-netproblem.Itisim portanttonotethatthese regularizedframeworksareverygeneralandcanbeaddedtom odelsthatsufferfrom overtting.Theyprovidebettergeneralizationperforman cebyinherentlyperforming variableselectionandthusalsoproducingbetterinterpre tablemodels. SparsitycanbeinducedtothesolutionofFLDAusingregular izationtechniques describedabove.OnesuchmethodcalledSparseLinearDiscr iminantAnalysis(SLDA), isinspiredfrompenalizedleastsquareswhereregularizat ionisappliedtothesolution ofleastsquaresproblemviaLasso-penalty.Thepenalizedl eastsquaresproblemis formulatedas: minimize k y X k 22 + k k 1 (3–36) where X representsthedatamatrixand y istheoutcomevector.Thesecondtermin ( 3–36 )isassumedtoinducesparsitytotheoptimal InordertoinducesparsityinFLDAviathe l 1 penalty,thegeneralizedeigenvalue problemin( 3–29 )isrstreformulatedasanequivalentleastsquaresregres sion problemandisshownthattheoptimaldiscriminantvectorof FLDAisequivalenttothe optimalregressioncoefcientvector.Thisisachievedbya pplyingthefollowingtheorem: Theorem3.1. Assumethebetween-classcovariancematrix S B 2< p p andthewithinclasscovariancematrix S W 2< p p begivenby ( 3–25 ) and ( 3–26 ) .Also,assume S W ispositivedeniteanddenoteitsCholeskydecompositiona s S W = R TW R W where R W 2< p p isanuppertriangularmatrix.Let H B 2< n p satisfy S B = H TB H B .Let 35

PAGE 36

v 1 v 2 ..., v q ( q min ( p n 1)) denotetheeigenvectorsofproblem ( 3–29 ) corresponding tothe q largesteigenvalues 1 2 q .Let A 2< p q =[ 1 2 ,..., q ] and B 2< p q =[ 1 2 ,..., q ] .For > 0 ,let ^ A and ^ B bethesolutiontothefollowingleast squaresregressionproblem: minimize A,B n X i =1 k R T W H B i AB T H B i k 2 + q X j =1 Tj S W j subjectto A T A = I (3–37) where, H B i isthe i th rowof H B .Then ^ j j =1,2,..., q spanthesamesubspaceas v j j =1,2,..., q . Pleasereferto[ 138 ]fortheproofofTheorem 3.1 Afterestablishingtheequivalence,theregularizationis appliedontheleastsquares formulationin( 3–37 )viatheLasso-penaltyasshownbelow: minimize A,B n X i =1 k R T W H B i AB T H B i k 2 + q X j =1 Tj S W j ,+ q X j =1 j ,1 k j k 1 subjectto A T A = I (3–38) Since( 3–38 )isnon-convex,ndingtheglobaloptimumisoftendifcult .Qiao etal. (2009)[ 138 ],suggestatechniquetoobtainalocaloptimumbyalternati ngoptimization over A and B Clemmensen etal. [ 29 ]alsoproposeasimilarsparsemodelusingFLDAfor classicationproblems.Theyalsofollowtheapproachofre -castingtheoptimization problemofFLDAintoanequivalentleastsquaresproblemand theninducingsparsityby introducingaregularizationterm.However,thereformula tionisachievedviaanoptimal scoringfunctionthatmapscategoricalvariablestocontin uousvariablesviaasequence ofscorings.Givenadatamatrix X 2< n p andthesamplesbelongingtooneofthe K 36

PAGE 37

classes,theequivalentregressionproblemcanbeformulat edas: minimize k k jj Y k X k jj 22 subjectto 1 n Tk Y T Y k =1 Tk Y T Y l =0, 8 l < k (3–39) where k isthescorevectorand k isthecoefcientvector.Itcanbeshownthatthe optimalvector k from( 3–39 )isalsooptimaltoFLDAformulationin( 3–28 ).Sparse discriminantvectorsarethenobtainedbyaddingan l 1 -penaltytotheobjectivefunction in( 3–39 )as: minimize k k jj Y k X k jj 22 + r Tk n k + jj k jj 1 subjectto 1 n Tk Y T Y k =1 Tk Y T Y l =0, 8 l < k (3–40) where n isapositive-denitematrix.Theauthorsproposeasimplei terativealgorithm toobtainalocalminimafortheoptimizationproblemin( 3–40 ).Thealgorithminvolves holding k xedandoptimizingwithrespectto k ,andholding k xedandoptimizing withrespectto k untilapre-denedconvergencecriteriaismet. DiscriminantFunctionsforHighDimensionalDataClassic ation: LDAand QDArequirethecovariancewithinclassestobeknown apriori inordertoestablisha discriminantruleinclassicationproblems.Inmanyprobl ems,sincethecovariance isnotknown apriori ,researchersoftenattempttoestimatethecovariancefrom the sampledata.However,inhighdimensionalproblems,thesam plecovariancematrixis ill-conditionedandhenceinducessingularityintheestim ationoftheinversecovariance matrix.FLDAalsofacessimilarchallengessincewithin-sc atterandin-betweenscatter areestimatedfromthesampledata.Infact,evenifthetruec ovariancematrixis notillconditioned,thesingularityofthesamplecovarian cematrixwillmakethese methodsinapplicablewhenthedimensionalityislargertha nthesamplesize.Several authorsperformedatheoreticalstudyontheperformanceof FLDAinhighdimensional 37

PAGE 38

classicationsettings.Bickel et.al. [ 9 ]showedthatundersomeregularityconditions, astheratiooffeatures p andthenumberofsamples n tendtoinnity,theworstcase misclassicationratetendsto0.5.Thisprovesthatasthed imensionalityincreases, FLDAisonlyasgoodasrandomguessing. Severalalternativeshavebeenproposedtoovercomethepro blemofsingularity inLDAandQDA.Thomaz etal. [ 155 ]proposeanewLDAalgorithm(NLDA),which replacesthelessreliablesmallereigenvaluesofthesampl ecovariancematrixwiththe grandmeanofalleigenvaluesandkeepslargereigenvaluesu nchanged.NLDAhas beenusedsuccessfullyinfacerecognitionproblems.Xu etal. [ 169 ]statethelackof theoreticalbasisforNLDAandintroducedamodiedversion ofLDAcalledMLDA,which isbasedonawell-conditionedestimatorforhighdimension alcovariancematrices. Thisestimatorhasbeenshowntobemoreaccuratethanthesam plecovariancematrix asymptotically. TheassumptionofindependenceinDLDAgreatlyreducesthen umberof parametersinthemodelandoftenresultsinaneffectiveand interpretableclassier. Despitethefactthatfeatureswillrarelybeindependentwi thinaclass,inthecase ofhighdimensionalclassicationproblems,thedependenc iescannotbeestimated duetolackofdata.DLDAisshowntoperformwellforhighdime nsionalclassication settinginspiteofthisnaiveassumption.Bickel et.al. [ 9 ]theoreticallyshowedthatitwill outperformclassicaldiscriminantanalysisinhighdimens ionalproblems.However,one shortcomingofDLDAisusesallfeaturesandhenceisnotconv enientforinterpretation. Tibshirani etal. [ 157 ]introducedfurtherregularizationinDLDAusingaprocedu recalled nearestshrunkencentroids(NSC)inordertoimprovemiscla ssicationerroraswellas interpretability.Theregularizationisintroducedinawa ythatautomaticallyassignsa weightzerotofeaturesthatdonotcontributetotheclasspr edictions.Thisisachieved byshrinkingtheclasswisemeantowardtheoverallmean,for eachfeatureseparately. DLDAintegratedwithnearestshrunkencentroidswasapplie dtogeneexpressionarray 38

PAGE 39

analysisandisshowntobemoreaccuratethanothercompetin gmethods.Theauthors provethatthemethodishighlyefcientinndinggenesrepr esentativeofsmallround bluecelltumorsandleukemias.SeveralvariationsofNSCal soexistinliterature,for example[ 32 158 ].Interestingly,NSCisalsoshowntobehighlysuccessfuli nopen-set classicationproblems[ 144 145 ]wherethenumberofclassesisnotnecessarily closed. Anotherframeworkappliedtohighdimensionalclassicati onproblemsinclude combiningDLDAwithshrinkage[ 132 159 ].Pang etal. [ 132 ]combinedtheshrinkage estimatesofvarianceswithdiagonaldiscriminantscorest odenetwoshrinkage-based discriminantrulescalledShrinkage-basedDQDA(SDQDA)an dShrinkage-basedDLDA (SDLDA).Furthermore,theauthorsalsoappliedregulariza tiontofurtherimprovethe performanceofSDQDAandSDLDA.Thediscriminantrulecombi ningshrinkage-based variancesandregularizationindiagonaldiscriminantana lysisshowedimprovementover theoriginalDQDAandDLDA,SVM,and k -NearestNeighborsinmanyclassication problems.Recently,Huang etal. [ 78 ]observedthatthediagonaldiscriminantanalysis suffersfromseriousdrawbackofhavingbiaseddiscriminan tscores.Hence,they proposedbias-correcteddiagonaldiscriminantrulesbyco nsideringunbiasedestimates forthediscriminantscores.Especiallyinthecaseofhighl yunbalancedclassication problems,thebiascorrectedruleisshowntooutperformthe standardrules. Recently,SLDAhasshownpromiseinhighdimensionalclassi cationproblems.In [ 138 ],SLDAwasappliedtosyntheticandrealworlddatasetsincl udingwinedatasets andgeneexpressiondatasetsisshowntoperformverywellon trainingandtestingdata withlessernumberofsignicantvariables.Theauthorsin[ 29 ]comparedSLDAobtained viaoptimalscoringtoothermethodslikeshrunkencentroid regularizeddiscriminant analysis,sparsepartialleastsquaresregressionandthee lastic-netregressionona numberofhighdimensionaldatasetsandisshowntohavecomp arableperformanceto othermethodsbutwithlessernumberofsignicantvariable s. 39

PAGE 40

3.3HybridClassiers Wenowdiscussanimportantsetofclassiersthatarefreque ntlyusedfor classicationinthecontextofhighdimensionaldataprobl ems.Highdimensional datasetsusuallyconsistofirrelevantandredundantfeatu resthatadverselyeffect theperformanceoftraditionalclassiers.Also,thehighd imensionalityofthedata makestheestimationofstatisticalmeasuresdifcult.Hen ce,severaltechniques havebeenproposedintheliteraturetoperformfeaturesele ctionthatselectsrelevant featuressuitableforclassication[ 75 ].Generally,featureselectionisperformedas adimensionalityreductionsteppriortobuildingtheclass icationmodelusingthe traditionalclassiers.Unlike,otherdimensionalityred uctiontechniqueslikethose basedontransformation(e.g.principalcomponentanalysi s)orcompression(e.g. basedoninformationtheory),featureselectiontechnique sdonotaltertheoriginal dimensionalspaceofthefeatures,butmerelyselectasubse tofthem[ 142 ].Thus, theyoffertheadvantageofinterpretabilitybyadomainexp ertastheypreservethe originalfeaturespace.Also,featureselectionhelpstoga inadeeperinsightintothe underlyingprocessesthatgeneratedthedataandthusplaya vitalroleinthediscovery ofbiomarkersespeciallyinbiomedicalapplications[ 48 ].Thustheclassication frameworkcanbeviewedasatwo-stageprocesswithdimensio nalityreductionvia featureselectionbeingtherststepfollowedbyaclassic ationmodel.Wecallthese setofclassiersashybridclassiersasdifferenttechniq uespertainingtotwostages havebeencombinedtoproduceclassicationframeworkstha thavebeensuccessfulin severalhighdimensionalproblems. Statnikov etal. [ 150 ]recentlyperformedacomprehensivecomparativestudy betweenRandomForests[ 17 ]andSupportVectorMachinesformicroarray-based cancerclassication.Theyadoptseveralltermethodslik ethesequentialltering techniquesasapre-processingsteptoselectasubsetoffea tureswhicharethenused asinputtotheclassiers.Itisshownthatonanaverage,Sup portVectorMachines 40

PAGE 41

outperformRandomForestsonmostmicroarraydatasets.Rec ently,Pal etal. [ 129 ] studiedtheeffectofdimensionalityonperformanceofSupp ortVectorMachinesusing fourfeatureselectiontechniquesnamelyCFS,MRMR,Random ForestsandSVM-RFE [ 67 ]onhyperspectraldata.Unlikeearlierndings,theyshowt hatdimensionalitymight effecttheperformanceofSVMandhenceapre-processingste plikefeatureselection mightstillbeusefultoimprovetheperformance. 3.4EnsembleClassiers Ensembleclassiershavegainedincreasingattentionfrom theresearchcommunity overthepastyears,rangingfromsimpleaveragingofindivi duallytrainedneural networkstothecombinationofthousandsofdecisiontreest obuildRandomForests [ 17 ],totheboostingofweakclassierstobuildastrongclassi erwherethetrainingof eachsubsequentclassierdependsontheresultsofallprev iouslytrainedclassiers [ 141 ].Themainideaofanensemblemethodologyistocombineaset ofmodels,each ofwhichsolvesthesameoriginaltask,inordertoobtainabe ttercompositeglobal model,withmoreaccurateandreliableestimatesordecisio ns.Theycombinemultiple hypothesesofdifferentmodelswiththehopetoformabetter classier.Alternatively, anensembleclassiercanalsobeviewedasatechniqueforco mbiningmanyweak learnersinanattempttoproduceastronglearner.Henceane nsembleclassieris itselfasupervisedlearningalgorithmcapableofmakingpr edictiononunknownsample data.Thetrainedensembleclassier,therefore,represen tsasinglehypothesisthatis notnecessarilycontainedwithinthehypothesisspaceofth econstituentmodels.This exibilityofensembleclassierscantheoreticallyover ttothetrainingdatamorethan asinglemodelwould,buthoweversurprisingly,inpractice ,someensembletechniques (especiallybaggingandRandomForests)combiningtheoutp utsofmultipleclassiers reducesthegeneralizationerror[ 40 ]. Ensemblemethodsareparticularlyeffectiveduetothephen omenonthatvarious typesofclassiershavedifferentinductivebiases.Addit ionally,ensemblemethodscan 41

PAGE 42

effectivelymakeuseofsuchdiversitytoreducethevarianc e-errorwhilekeepingthe bias-errorincheck.Incertainsituations,anensemblecan alsoreducebias-error,as shownbythetheoryoflargemarginclassiers.So,diversi edclassiershelpinbuilding lessernumberofclassiersespeciallyinthecaseofRandom Forests.Theincreasein predictionaccuracycomesatacostofperformingmoreopsi ncomparisontoasingle model.So,theensemblemethodscanbethoughtofastradingt hepoorperformance ofaweakclassierwithperformingalotofcomputations.So ,afastpoorlearnerlike decisiontreeshavecertainlyperformedbetterwithensemb lemethods;althoughslow algorithmscanalsobenetfromensembletechniques. Recently,ensemblemethodshaveshownpromiseinhighdimen sionaldata classicationproblems.Inparticular,baggingmethods,r andomforestsandboosting havebeenparticularlyimpressiveduetotheirexibilityt ocreatestrongerclassiers fromweakclassiers.Here,wedescribetwomethods:AdaBoo standRandomForests, andshowtheirimportanceinhighdimensionalproblems. AdaBoost: Boosting[ 54 146 ]isageneralmethodwhichattemptstoboostthe accuracyofanygivenlearningalgorithm.TheinceptionofB oostingcanbetraced backtoatheoreticalframeworkforstudyingmachinelearni ngcalledthePAClearning model[ 163 ].Kearns et.al. [ 86 ]wereamongtherstauthorstoposethequestionof whetheraweaklearnerwhichisonlyslightlycorrelatedwit htrueclassicationinthe PACmodelcanbeboostedintoanaccuratestronglearningalg orithmthatisarbitrarily well-correlatedwithtrueclassication.Schapire[ 146 ]proposedtherstprovable polynomial-timeboostingalgorithmin1989.In1990,Freun d[ 53 ]developedamuch moreefcientboostingalgorithmwhichbeingoptimalinone sensesufferedfromcertain practicaldrawbacks. Boostinggenerallyconsistsofafamilyofmethodsthatprod uceasetofclassiers. Thetrainingsetpresentedtoeachclassierischosenbased ontheperformanceof theearlierclassierintheseries.Unlikeotherensemblem ethodslikebagging[ 15 ], 42

PAGE 43

inboosting,thebaseclassiersaretrainedinsequence,an deachbaseclassier istrainedusingaweightedvariantofthedatasetinwhichth eindividualweighting coefcientofeachdatapointdependsontheperformanceofp reviousclassiers.In particular,pointsthataremisclassiedbyoneoftheearli erclassiersaregivengreater weightwhenusedtotrainthenextclassierinthesequence. Oncealltheclassiers havebeentrained,theirpredictionsarethencombinedthro ughaweightedmajority votingscheme. AdaBoost,shortforAdaptiveBoosting,formulatedbyYoavF reundandRobert Schapire[ 54 ],solvedmanyoftheearlierpracticaldifcultiesofearli erboosting algorithms.Itcanbeconsideredasclassicationframewor kthatcanbeusedin conjunctionwithmanyotherlearnerstoimprovetheirperfo rmance.AdaBoostis adaptiveinthesensethatsubsequentclassiersbuiltaret ailoredtofavorthose instancesmisclassiedbypreviousclassiers.Theframew orkprovidesanewweak classierwithaformoftrainingsetthatisrepresentative oftheperformanceofprevious classiers.Theweightsofthosetrainingsamplesthatarem isclassiedbyearlier weakclassiersareassignedhighervaluesthanthosethata recorrectlyclassied. Thisallowsthenewclassiertoadapttothemisclassiedtr ainingsamplesandfocus onpredictingthemcorrectly.Afterthetrainingphaseisco mplete,eachclassieris assignedaweightandtheiroutputsarelinearlycombinedto makepredictionson theunknownsample.Generally,itprovidesasignicantper formanceboosttoweak learnersthatareonlyslightlybetterthanrandomguessing .Theclassierswithhigher errorratewouldhavenegativecoefcientsinthenalcombi nationofclassiersand therebybehaveliketheirinverses.ThestepsinvolvedinAd aBoostalgorithmare describedbelow[ 11 ]. Considerabinaryclassicationproblem,inwhichthetrain ingdatacomprisesinput vectors x 1 x 2 ,..., x N alongwithcorrespondingbinarytargetvariablesgivenby t where t n 2f 1,1 g .Eachdatapointisgivenanassociatedweightingparameter w n ,whichis 43

PAGE 44

initiallysetto1/Nforalldatapoints.Weassumethatwehav eanalgorithmfortraininga baseclassierusingweighteddatatogiveadiscriminative function y ( x ) 2f 1,1 g Initializethedataweightingcoefcients f w n g bysetting w (1) n =1 = N for n = 1,2,..., N For m =1,..., M : (a)Fitaclassier y m ( x ) tothetrainingdatabyminimizingtheweightederror function J m = N X n =1 w ( m ) n I ( y m ( x n ) 6 = t n ) (3–41) where I ( y m ( x n ) 6 = t n ) istheindicatorfunctionandequals1when( y m ( x n ) 6 = t n )and 0otherwise. (b)Evaluatethequantities: m = P Nn =1 w ( m ) n I ( y m ( x n ) 6 = t n ) P Nn =1 w ( m ) n (3–42) andthenuse m toevaluate m = ln 1 m m (3–43) (c)Updatethedataweightingcoefcients w ( m +1) n = w ( m ) n exp f m I ( y m ( x n ) 6 = t n ) g (3–44) Makepredictionsusingthenalmodel,whichisgivenby: Y (x) M = sign M X m =1 m y m ( x ) (3–45) Weseethattherstweaklearner y 1 ( x ) istrainedusingweightingcoefcients w (1) n thatareallequalandhenceissimilartotrainingasinglecl assier.From( 3–44 ), weseethatinsubsequentiterationstheweightingcoefcie nts w ( m ) n areincreased fordatapointsthataremisclassiedanddecreasedfordata pointsthatarecorrectly classied.Successiveclassiersarethereforeforcedtof ocusonpointsthathavebeen misclassiedbypreviousclassiers,anddatapointsthatc ontinuetobemisclassiedby successiveclassiersreceiveevengreaterweight.Thequa ntities m representweighted measuresoftheerrorratesofeachofthebaseclassiersont hedataset.Wetherefore 44

PAGE 45

seethattheweightingcoefcients m denedby( 3–43 )givegreaterweighttomore accurateclassierswhencomputingtheoveralloutputforu nknownsamplesgivenby ( 3–45 ).AdaBoostissensitivetonoisydataandoutliers.Insomep roblems,however,it canbelesssusceptibletotheoverttingproblemthanmostl earningalgorithms. Boostingframeworkinconjunctionwithseveralclassiers havebeensuccessfully appliedtohighdimensionaldataproblems.Asdiscussedin[ 16 ]boostingframework canbeviewedasafunctionalgradientdescenttechnique.Th isanalysisofboosting connectsthemethodtomorecommonoptimizationviewofstat isticalinference. B¨uhlmannandYu[ 21 ]investigateonesuchcomputationallysimplevariantofbo osting called L 2 Boost,whichisconstructedfromafunctionalgradientdesc entalgorithm withthe L 2 -lossfunction.Inparticular,theystudythealgorithmwit hcubicsmoothing splineasthebaselearnerandshowempericallyonrealandsi mulationdatasetsthe effectivenessofthealgorithminhighdimensionalpredict ors.B¨uhlmann(2003)[ 20 ] presentedaninterestingreviewonhowtheboostingmethods canbeusefulforhigh dimensionalproblems.Heproposesthatinherentvariables electionandassigning variableamountofdegreesoffreedomtoselectedvariables byboostingalgorithms couldbeareasonforhighperformanceinhighdimensionalpr oblems.Additionally,he suggeststhatboostingyieldsconsistentfunctionapproxi mationsevenwhenthenumber ofpredictorsgrowfasttoinnity,wheretheunderlyingtru efunctionissparse.Dettling andB¨uhlmann(2003)[ 38 ]appliedboostingtoperformclassicationtaskswithgene expressiondata.Amodiedboostingframeworkinconjuncti onwithdecisiontreesthat doespre-selectionwasproposedandshowntoyieldslightto drasticimprovementin performanceonseveralpubliclyavailabledatasets. RandomForests: Randomforestsisanensembleclassierthatconsistsofman y tree-likeclassierswitheachclassierbeingtrainedona bootstrappedsampleofthe originaltrainingdata,anddeterminesasplitforeachnode bysearchingonlyacross arandomlyselectedsubsetoftheinputvariables.Forclass ication,eachtreeinthe 45

PAGE 46

RandomForestindependentlypredictstheclassforaninput x .Theoutputofthe RandomForestforanunknownsampleisthendeterminedbyama jorityvotingscheme amongthetrees.ThealgorithmforinducingRandomForestsw asdevelopedbyLeo Breiman[ 17 ]andcanbesummarizedasbelow: Assumethenumberoftrainingsamplesbe N ,andthenumberoffeaturesbegiven by M .Also,assumethatrandom m numberoffeatures( m < M )usedfordecisionat eachsplit.EachtreeintheRandomForestisconstructedasf ollows: Chooseatrainingsetforthistreebybootstrappingtheorig inaltrainingset n times. Therestofthesamplesareusedasatestingsettoestimateth eerrorofthetree. Foreachnodeofthetree,thebestsplitisbasedonrandomlyc hoosing m features foreachtrainingsampleandthetreeisfullygrownwithoutp runing. Forprediction,anewsampleispusheddownthetree.Itisass ignedthelabelofthe trainingsampleintheterminalnodeitendsupin.Thisproce dureisiteratedoverall treesintheensemble,andtheclassobtainedfrommajorityv oteofallthetreesis reportedasRandomForestprediction. RandomForestsisconsideredoneofthemostaccurateclassi ersandarereported tohaveseveraladvantages.RandomForestsareshowntohand lemanyfeatures andalsoassignaweightrelativetotheirimportanceinclas sicationtaskswhich canfurtherbeexploredforfeatureselection.Thecomputat ionalcomplexityofthe algorithmisreducedasthenumberoffeaturesusedforeachs plitisboundedby m Also,non-pruningofthetreesalsohelpinreducingthecomp utationalcomplexityfurther. Suchrandomselectionoffeaturestobuildthetreesalsolim itsthecorrelationamongthe treesthusresultinginerrorratessimilartothoseofAdaBo ost.TheanalysisofRandom Forestshowthatitscomputationaltimeis cT p MN log(N)wherecisaconstant,Tis thenumberoftreesintheensemble,Misthenumberoffeature sandNisthenumber oftrainingsamplesinthedataset.Itshouldbenotedthatal thoughRandomForestsare notcomputationallyintensive,theyrequireafairamounto fmemoryastheystoreanN 46

PAGE 47

byTmatrixinmemory.Also,RandomForestshavesometimesbe enshowntoovertto thedatainsomeclassicationproblems. RandomForests,duetotheaforementionedadvantages,canh andlehigh dimensionaldatabybuildinglargenumberoftreesusingonl yasubsetoffeatures. Thiscombinedwiththefactthattherandomselectionoffeat uresforasplitseeksto minimizethecorrelationbetweenthetreesintheensemble, certainlyhelpsinbuilding anensembleclassierwithhighgeneralizationaccuracyfo rhighdimensionaldata problems.Gislason etal. (2005)[ 59 ]performedacomparativestudyamongRandom Forestsandotherwell-knownensemblemethodsformultisou rceremotesensingand geographicdata.TheyshowthatRandomForestsoutperforms asingleCARTclassier andperformsonparwithotherensemblemethodslikebagging andboosting.Ona relatedremotesensingapplication,Pal(2006)[ 128 ]investigatedtheuseofRandom Forestsforclassicationtasksandcompareditsperforman cewithSVM.Palshowed thatRandomForestsperformequallywelltoSVMintermsofcl assicationaccuracy andtrainingtime.Additionally,Palconcludesthattheuse rdenedparametersin RandomForestsarelessthanthoserequiredforSVM.Pang etal. [ 131 ]proposeda pathway-basedclassicationandregressionmethodusingR andomForeststoanalyze geneexpressiondata.Theproposedmethodallowstorankimp ortantpathways, discoverimportantgenesandndpathway-basedoutlyingca ses.RandomForests,in comparisonwithothermachinelearningalgorithms,wassho wntohaveeitherloweror second-lowestclassicationerrorrates.Recently,Genue r(2010)[ 58 ]usedRandom Foreststoperformfeatureselectionaswell.Theauthorspr oposeastrategyinvolving rankingoftheexplanatoryvariablesusingtheRandomFores tsscoreofimportance. 47

PAGE 48

CHAPTER4 SPARSEPROXIMALSUPPORTVECTORMACHINES ProximalSupportVectorMachines(PSVMs)[ 64 109 ],unlikeSVMs,ndtwo hyperplanes,oneforeachclass,thatsatisfythecondition ofbeingclosesttooneclass andfarthestfromtheother.PSVMsareformulatedasaRaylei ghQuotientproblem andthehyperplanesareobtainedbysolvingtwogeneralized eigenvalueproblems. TheperformanceofPSVMsonseveralrealworlddatasetsshow sthattheyperform wellincomparisontoSVMs[ 64 109 ].Thesimpleproblemformulation,computational efciencyandcomparativeperformanceofPSVMsmakesthema nattractiveSVM-like alternativeforclassicationtasks.Though,severalrese archershaveextendedSVMs tofeatureselectionandshowntheirsuperiorperformance, extensionofPSVMsto featureselectionisseldomconsideredinliterature.Inth ischapter,weintroducea newembeddedfeatureselectionmethodcalledSparseProxim alSupportVector Machines(sPSVMs)thatextendPSVMstofeatureselectionby inducingsparsity inthehyperplanes.ThegeneralizedeigenvalueprobleminP SVMsisrstcastas anequivalentleastsquaresproblemandsparsityisintrodu cedvia l 1 -normonthe coefcientvector.Thisequivalentformulationhelpsinso lvingsPSVMsefcientlyusing alternateoptimizationtechniques.Theclassicationper formanceofsPSVMsonseveral publiclyavailabledatasetsisevaluatedandcomparedwith otherembeddedfeature selectionmethods.Additionally,featureselectionstabi lityofsPSVMsisalsostudied andcontrastedwithotherfeatureselectionmethods.Inadd itiontoperformingfeature selection,weshowthatsPSVMsalsooffertheadvantageofin terpretingtheselected featuresinthecontextoftheclasses. 4.1ProximalSupportVectorMachines(PSVMs) GeneralizedEigenvalueApproach: Considerabinaryclassicationproblem.Let thematrices A 1 2< m p and A 2 2< n p begiven,whoserowsrepresentthetraining examplesoftwoclasses C 1 and C 2 respectively.Thenumberofsamplesin C 1 and C 2 are 48

PAGE 49

givenby m and n respectively.Thenumberoffeaturesisgivenby p .PSVMsattemptsto ndtwohyperplaneswitheachhyperplanebeingfarthestfro moneclassandclosestto theotherclass.Lettheproximalhyperplanefarthestfrom C 2 andclosestto C 1 begiven by: P 1 = f x 2< p jh w 1 x i b 1 =0 g (4–1) wheretheweightvector w 1 2< p andtheoffsetparameter b 1 2< PSVMsisformulatedasthefollowingRayleighQuotientprob lem: maximize z 2< p +1 r ( z )= z 0 H 2 z z 0 G 1 z (4–2) where, G 1 =[ A 1 e ] 0 [ A 1 e ]+ I H 2 =[ A 2 e ] 0 [ A 2 e ], z 0 =[ w 1 0 b 1 ] (4–3) where istheTikhonovregularizationconstantand e isavectorof1. Thestationarypointsof( 4–2 )aregivenbyeigenvectorsofthefollowinggeneralized eigenvalueproblem GEV( H 2 G 1 ) : H 2 z = G 1 z (4–4) where representtheeigenvaluesof GEV( H 2 G 1 ) Thehyperplane P 1 isgivenbytheeigenvectorcorrespondingtothemaximum eigenvaluesatisfying( 4–4 ). Similarly,theproximalhyperplane P 2 (closestto C 2 andfarthestfrom C 1 )isgivenby: P 2 = f x 2< p jh w 2 x i b 2 =0 g (4–5) wheretheweightvector w 2 2< p andtheoffsetparameter b 2 2< P 2 canbeobtainedbysolvingfortheeigenvectorcorrespondin gtomaximum eigenvalueofthegeneralizedeigenvalueproblem GEV ( H 1 G 2 )where, G 2 =[ A 2 e ] 0 [ A 2 e ]+ I H 1 =[ A 1 e ] 0 [ A 1 e ], z 0 =[ w 2 0 b 2 ] (4–6) 49

PAGE 50

Anewpoint x isclassiedbycomputingthedistancefromeitherhyperpla nes: dist ( x P i )= j w Ti x b i j j w i j i 2f 1,2 g (4–7) andtheclassisdeterminedas: class ( x )= argmin i 2f 1,2 g f dist ( x P i ) g (4–8) LeastSquaresApproach: AnalternateformulationforPSVMsviatheleast squaresapproachisproposedinthissection.Werstestabl ishtheequivalence betweeneigenvalueproblemsandtheleast-squaresproblem s. Theorem4.1. Considerarealmatrix X 2< n p withrank r min ( n p ) .Letmatrices V 2< p p and D 2< p p satisfythefollowingrelation: V T ( X T X ) V = D (4–9) where, D = diag ( 2 1 2 2 ,... 2 r ,0,0,...,0) p p .Assume 2 1 2 2 2 r .Forthe followingoptimizationproblem, minimize ^ 2< p jj X X ^ T jj 2F + ^ T ^ subjectto T =1 (4–10) ^ opt isproportionalto v 1 ,where v 1 istheeigenvectorcorrespondingtothelargest eigenvalue 2 1 Pleasereferto[ 176 ]foraproofoftheorem1. Usingtheorem1,wenowestablishthattheproximalhyperpla nes P 1 and P 2 can beobtainedviatheleast-squaresapproach.Letthecholesk ydecompositionofthe matrices H 2 and G 1 begivenby: H 2 = U T2 U 2 G 1 = U T1 U 1 (4–11) where U 1 and U 2 areuppertriangularmatrices. 50

PAGE 51

Using( 4–11 )in GEV ( H 2 G 1 ) U T2 U 2 z = U T1 U 1 z (4–12) ( U 2 U 1 1 ) T ( U 2 U 1 1 ) U 1 z = U 1 z (4–13) ( U 2 U 1 1 ) T ( U 2 U 1 1 ) y = y (4–14) where U 1 z = y Theoptimaleigenvectorcorrespondingtoproximalhyperpl ane P 1 ( 4–1 )canbe foundbythefollowingrelation: z opt = U 1 1 ^ y (4–15) where ^ y istheeigenvectorcorrespondingtothemaximumeigenvalue ofthesymmetric eigenvalueproblemgivenin( 4–14 ). Bysubstituting X = U 2 U 1 1 ^ = U 1 in( 4–10 ),andre-arrangingtheterms,the followingleast-squaresoptimizationproblemisobtained : minimize jj U 2 U 1 1 U 2 T jj 2F + T G 1 subjectto T =1 (4–16) Bytheorem1,theoptimalsolutionfor( 4–16 ) opt isproportionalto z 1 ,theeigenvector correspondingtothelargesteigenvalueof GEV ( H 2 G 1 ). Thisoptimizationproblemissolvedbyalternatingover and .Foraxed ,the followingoptimizationproblemissolvedtoobtain minimize jj U 2 U 1 1 U 2 T jj 2F subjectto T =1 (4–17) Expandingtheobjectivefunction, tr ( U 2 U 1 1 U 2 T ) T ( U 2 U 1 1 U 2 T )) = tr ( 2 T H 2 U 1 1 + T H 2 + U T 1 H 2 U 1 1 ) 51

PAGE 52

Subsituting T =1 ,theoptimizationproblemin( 4–17 )isequivalentto: maximize T U T 1 H 2 subjectto T =1 (4–18) Ananalyticalsolutionforthisproblemexistsandthe opt isgivenby, opt = U T 1 H 2 k U T 1 H 2 k (4–19) Foragiven ,theoptimizationproblem( 4–16 )canbereducedtoridgeregression-type problem.Toseethis,let ^ A beanorthogonalsetofvectorssuchthat A ort =[ ; ^ A ]is p p orthogonal.Consideringthersttermin( 4–16 ), jj U 2 U 1 1 U 2 T jj 2F (4–20) = tr (( U 2 U 1 1 U 2 T ) T ( U 2 U 1 1 U 2 T )) = tr ( A Tort ( U 2 U 1 1 U 2 T ) T ( U 2 U 1 1 U 2 T ) A ort ) = tr (( U 2 U 1 1 A ort U 2 [ 1 0 0 ,..., 0 ] p p ) T ( U 2 U 1 1 A ort U 2 [ 1 0 0 ,..., 0 ] p p )) = jj U 2 U 1 1 A ort U 2 [ 1 0 0 ,..., 0 ] p p jj 2F = jj U 2 U 1 1 U 2 jj 2 + jj U 2 U 1 1 ^ Ajj 2F (4–21) So,if isxed, optimizesthefollowingregressionproblem: minimize jj U 2 U 1 1 U 2 jj 2 + T G 1 (4–22) Inthiscaseaswell,ananalyticalsolutioncanbefoundbyth efollowingexpression: opt =( H 2 + G 1 ) 1 H 2 U 1 1 (4–23) Algorithm 1 summarizesthestepsneededtosolvefortheoptimalhyperpl ane P 1 in PSVMsusingtheleastsquares(LS)approach.Similarly,the hyperplane P 2 canbe obtainedfromAlgorithm1withinputparameters ( H 1 G 2 ) 52

PAGE 53

Algorithm1 PSVMs-via-LS( H 2 G 1 ) 1.Initialize 2.Findtheuppertriangularmatrix U 1 fromthecholeskydecompositionof G 1 3.Find fromthefollowingrelation: = U T 1 H 2 k U T 1 H 2 k 4.Find asfollows: =( H 2 + G 1 ) 1 H 2 U 1 1 5.Alternatebetween3and4untilconvergence. 4.2SparseProximalSupportVectorMachines(sPSVMs) PSVMsareextendedtoperformfeatureselectionwithrespec ttoinputfeatures bylearningasparserepresentationoftheproximalhyperpl anes.Sparsityhas beenwellstudiedinthecontextofregressionproblemsandi sgenerallyachieved viaregularizationtechniques.Hence,sparsityisinduced inPSVMsbysolvingthe PSVMs-via-LSproblemwithanadditional l 1 -normterm.Unlikethe l 2 -norm,the l 1 -norm termisknowntodrivesomecomponentsofthecoefcientvect ortoexactlyzero[ 156 ]. Thefeaturescorrespondingtothenon-zerocoefcientsins parsehyperplanesobtained fromsPSVMscanbeinterpretedastheimportantfeaturesdis criminatingtheclasses. Inducingsparsity: .SparsityisinducedinPSVMsbyaddingan l 1 -normtermtothe objectivefunctiongivenin( 4–16 ).Theresultingoptimizationproblemisgivenby: minimize jj U 2 U 1 1 U 2 T jj 2F + T G 1 + jj jj 1 subjectto T =1 (4–24) wheretheparameter controlsthelevelofsparsityinthecoefcientvector Again,theoptimizationproblemissolvedbyalternatingov er and .Foraxed canbefoundasbefore: opt = U T 1 H 2 k U T 1 H 2 k (4–25) 53

PAGE 54

Given canbefoundbysolvingthefollowingoptimizationproblem: minimize jj U 2 U 1 1 U 2 jj 2 + T G 1 + jj jj 1 (4–26) Expanding( 4–26 ), minimize ( U 2 U 1 1 U 2 ) T ( U 2 U 1 1 U 2 )+ T G 1 + jj jj 1 minimize T ( H 2 + G 1 ) 2 T U T 1 H 2 + jj jj 1 Assuming, W =[ U 2 p U 1 ] T y =[( U 2 U 1 1 ) T 0] T minimize T W T W 2 y T W + jj jj 1 (4–27) Theoptimizationproblem( 4–27 )iscommonlyreferredasLASSOregressionin thestatisticsliterature[ 156 ].SeveralefcientalgorithmslikeLeastAngleRegression (LARS)[ 46 ]andcoordinatedescentmethods[ 56 ]existinliteraturetosolvethisproblem andobtainpartialsolutionpaths.So,insteadoftuning ,wesolvesPSVMsfora pre-speciednumberofnon-zerocoefcientsorfeatures B .Thishelpsincomparingthe performanceofsPSVMswithotherembeddedfeatureselectio nmethodsfordifferent valuesof B .ThestepsofthealgorithmtosolvesPSVMsaresummarizedin Algorithm 2 4.3NumericalExperiments DatasetsandExperimentalSetup .TheperformanceofsPSVMsisevaluated ontwocategoriesofpubliclyavailabledatasets.Therstc ategoryincludesthreehigh dimensionaldatasets:Colon[ 3 ],Leukemia[ 61 ]andBreast[ 164 ],wherethenumberof featuresaremuchlargerthanthenumberofsamples.Theseco ndcategoryconsists of5publiclyavailabledatasetsfromtheUCImachinelearni ngrepository[ 51 ]:WPBC, Ionosphere,WDBC,SpambaseandMushroom;wherethenumbero fsamplesoutweigh thenumberoffeatures.Detailedinformationonthedimensi onsofthedatasetsaregiven inTable 4-1 54

PAGE 55

Algorithm2 sPSVMs( H 2 G 1 ) 1.Initialize 2.Find U 1 and U 2 thatsatisfy, G 1 = U T1 U 1 H 2 = U T2 U 2 3.Find fromthefollowingequation: = U T 1 H 2 k U T 1 H 2 k 4.SolvethefollowingLASSOregressionproblemtoobtain : minimize k y W k 2 + jj jj 1 where W and y aregivenby: W = U 2 p U 1 y = U 2 U 1 1 0 5.Alternatebetween3and4untilconvergence. TheperformanceofsPSVMsonCategoryIdatasetsisevaluate dfordifferentvalues ofselectedfeatures B = f 10,20,30 g ,andcomparedwithC-SVMs,PSVMs,Regularized LogisticRegression(RLR)[ 175 ],sparseSVMs(s-SVMs)[ 153 ]andSVM-RFE[ 67 ].As twoindependentoptimizationproblemsaresolvedinsPSVMs ,numberofselected featuresforeachproblemarechosentobe B = 2 .Duetosmallsamplesize,the parametersfordifferentmethodsarechosen apriori .ThevalueofCforC-SVMsand SVM-RFEissetto1,while =0.1 inPSVMs.Thevalueof inRLRischosenas1 and =100insPSVMs.ForthecategoryIIdatasets,theperformanc eofsPSVMsis comparedagainstPSVMsandC-SVMs.Themodelparametersare obtainedvia10-fold crossvalidation.LibSVMpackage( http://www.csie.ntu.edu.tw/ ~ cjlin/libsvm/ )is usedforC-SVMsandSVM-RFE,FGMpackage( http://c2inet.sce.ntu.edu.sg/Mingkui/FGM.htm ) fors-SVMsandGLMNETpackage( http://www-stat.stanford.edu/ ~ tibs/glmnet-matlab/ ) forRLRandsPSVMs. 55

PAGE 56

Table4-1.Datasetsusedinnumericalexperiments. D ATASET #S AMPLES #F EATURES C ATEGORY I C OLON 622000 L EUKEMIA 383051 B REAST 774869 C ATEGORY II WPBC19833I ONOSPHERE 35134 WDBC56930S PAMBASE 460157 M USHROOM 8124126 4.3.1CategoryIDatasets ClassicationPerformance: TheperformanceofsPSVMsoncategoryIdatasets isevaluatedandcomparedwithothermethods.Allexperimen tsareconductedwith 80/20splitondata,where80%ofthedataisrandomlyselecte dfortrainingthe classicationmodelandtheremaining20%isusedtotestits performance.Thisprocess isrepeatedfor N =50 repetitions.Since,thetestsetsin50repetitionsareexpe ctedto overlapandhencenolongerindependent,comparingtestacc uraciesamongdifferent classicationmethodsforstatisticalsignicance,byper formingconventionalhypothesis testingmaynotbeaccurate[ 117 ].Hence,wefollowtheprocedureoutlinedin[ 117 ] andperformthecorrectedresampledt-testtocalculatecon denceintervalsforthe averagetestaccuraciesandalsoassessthestatisticalsig nicanceoftheobservedtest performancedifferencesamongclassicationmethodsands PSVMs.Table 4-2 shows thetestaccuraciesalongwiththeircondenceintervalsof allclassicationmethods fordifferentvaluesof B .TwoapproachesofPSVMs:PSVMs-GEVandPSVMs-LS arealsoevaluated.TheresultsindicatethatPSVMs-LSisan equivalentformulation ofPSVMs-GEV.C-SVMsandPSVMsperformequallywellincompa risontosPSVMs onalldatasetsandinterestinglyarenotaffectedbythehig hdimensionalityofthedata. However,incomparisontoC-SVMsandPSVMs,sPSVMsalsoachi evesimilartest accuracies,butwithrelativelysmallnumberoffeaturesby inherentlyperformingfeature 56

PAGE 57

selection.Inthecaseofcolondataset,allmethodsperform equallywellforallvaluesof B .FortheLeukemiadataset,allmethodsperformwellfor B =10 case,whilesPSVMs, RLRands-SVMsperformbetterthanSVM-RFEfor B = f 20,30 g .InthecaseofBreast dataset,RLRperformswellforallvaluesof B ,whilesPSVMsands-SVMsperform betterthanSVM-RFEforhighervaluesof B Table4-2.TestaccuraciesforCategoryIdatasetsfromC-SV Ms,PSVMs,sPSVMs, RLR,s-SVMsandSVM-RFE. D ATASETS C-SVM S PSVM SS PSVM S RLR S -SVM S SVM-RFE A CC (%)GEVLS B A CC (%)A CC (%)A CC (%)A CC (%) C OLON 82.17 8.6487.83 6.9587.83 6.95 1080.67 17.8284.17 8.5078.17 10.6382.33 11.41 2083.80 12.5083.67 10.0681.17 11.3081.33 12.40 3084.67 11.7681.33 10.1578.33 12.4880.67 10.57 L EUKEMIA 98.86 3.9099.14 3.4299.14 3.42 1094.29 11.4790.86 10.6894.00 10.0186.29 17.50 2098.29 4.8192.57 9.6596.29 7.5186.86 14.63 30100.092.57 9.6596.29 6.9487.71 13.19 B REAST 63.75 10.2863.50 9.9563.50 9.95 1064.12 11.8570.25 11.0162.63 11.6552.38 14.15 2065.25 12.2269.25 10.7060.25 12.8652.75 10.81 3066.13 11.9467.87 11.5361.75 10.9353.63 10.29 FeatureElimination: SincedifferentsubsetsofdataareusedfortrainingsPSVMs duringthe50repetitions,thereisapossibilityofdiffere ntsetsoffeaturesbeingselected duringeachrepetition.Figure 4-1 showsthefeatureselectionfrequencyplotforeach classincolondatasetforthecase B =20 .They-axisrepresentsthenumberoftimes eachfeatureisselectedduringthetrainingphase.Itisint erestingtonotethatdifferent featuresarerepeatedlyselectedforeachclass.Thelistof top-tenselectedfeaturesfor eachclassintermsoffrequencyisshowninTable 4-3 .Itisworthnotingthatonlyone feature(765)appearsinboththeclasses.ThussPSVMscould beviewedasinducing class-speciclocalsparsityinsteadofglobalsparsityli keotherembeddedmethods. Suchclass-specicfeatureselectioncouldpotentiallyas sistadomainexperttoanalyze theselectedfeaturesinthecontextofdifferentclasses. Assumingthatselectionfrequencyreectstheimportanceo fafeature,thiscouldbe furtherutilizedtoperformfeatureelimination.Afeature lteringparameter isdened thatreectsminimumno.oftimesafeatureisselected.Fore .g.,setting =20 would selectthosefeaturesthathavebeenpickedatleast20times duringthetrainingphase. 57

PAGE 58

200 400 600 800 1000 1200 1400 1600 1800 2000 0 25 50 frequency 200 400 600 800 1000 1200 1400 1600 1800 2000 0 25 50 feature indexfrequencyB A Figure4-1.Frequencyplotof2000featuresinColondataset during50iterationsof trainingphase-A)Tumorclass,B)Normalclass. Table4-3.Thelistoftop-tenselectedfeaturesintermsoff requencyforeachclassin colondatasetfor B =20 TumorClassNormalClass 223,405,423, 765 ,249,493,513, 765 769,1073,1313,897,1042,1325,1318,1520,16141423,1504,1668 Thus,areduceddatasetcanbeformedfromselectedfeatures ,onwhichsPSVMscan bere-runtoperformfeatureselection/classication.The no.ofselectedfeaturesand theircorrespondingtestaccuraciesasafunctionof for B = f 10,20,30 g areshown inFigure 4-2 .As increasesfrom0to25,thenumberofselectedfeaturesdecre ases, whilethetestaccuraciesincreaseforalldatasets,showin gthatfeaturesirrelevantto classicationareremovedinallthecases.Themaximumnumb erofselectedfeatures correspondingto =25 forthe3datasetsare9,16and12respectively;whichaccoun t tolessthan2%oftheoriginalnumberoffeatures.Thus,sPSV Msofferaneffective 58

PAGE 59

waytoeliminatemorethan98%ofthefeatureswithoutcompro misingonclassication accuracies. 0 5 10 15 20 25 0 50 100 150 200 # of features 0 5 10 15 20 25 70 75 80 85 90 95 Threshold tAccuracies (%) 0 5 10 15 20 25 70 75 80 85 90 95 0 5 10 15 20 25 70 75 80 85 90 95 B = 10 B = 20 B = 30 A 0 5 10 15 20 25 0 50 100 150 200 250 300 350 # of features 0 5 10 15 20 25 98 98.5 99 99.5 100 Threshold tAccuracies (%) 0 5 10 15 20 25 98 98.5 99 99.5 100 0 5 10 15 20 25 98 98.5 99 99.5 100 B = 10 B = 20 B = 30 B 0 5 10 15 20 25 0 50 100 150 200 250 300 350 400 # of features 0 5 10 15 20 25 65 70 Threshold tAccuracies (%) 0 5 10 15 20 25 65 70 0 5 10 15 20 25 65 70 B = 10 B = 20 B = 30 C Figure4-2.Thenumberofselectedfeatures(dashedline)an dthecorrespondingtest accuracies(solidline)obtainedfromsPSVMsasafunctiono f forCategory Idatasets-A)Colon,B)Leukemia,C)Breast. 59

PAGE 60

FeatureSelectionStability: Itisimportantfortheproposedmethodtoshow stabilityinfeatureselectionprocessbyrepeatedlyselec tingimportantfeaturesalbeit differentmodelparametersandsmallperturbationsofthet rainingsets.Hence,we rststudytheidentityofselectedfeaturesoverdifferent foldsinthetrainingphase.As mentionedearlier,differentsetsoffeaturescouldbesele ctedduringthetrainingphase asdifferentsubsetsofdataarepresentedtosPSVMs.Forany twosetsoffeatures S i and S j ,weestimatetheJaccardindex J ( i j ) [ 70 ]asfollows: J ( i j )= j S i \ S j j j S i [ S j j 8 i j =1,2,..., N (4–28) Intuitively,theJaccardindexindicatesthedegreeofover lapintheselectedfeatures acrossanytwofolds.Generally,univariatefeatureselect iontechniquesareknown topromotefeatureselectionstability[ 72 ].Forthisreason,t-test,Fishercriterionand WilcoxonRank-sumtests[ 142 ]areconsideredhereforcomparisonwithsPSVMs.Table 4-4 showstheaverageJaccardindexforthesemethodsalongwith sPSVMsonallthe datasets.TheperformanceofsPSVMsisonparwithotheruniv ariatefeatureselection techniquesandgenerallyseemstopromotegreaterstabilit yforsmallervaluesof B Table4-4.AverageJaccardindexforCategoryIdatasetsfro msPSVMs,t-test,Fisher CreterionandWilcoxonrank-sumtests. DatasetsBsPSVMst-testFisherCriterionWilcoxonrank-su m Colon 100.210.160.160.21200.210.210.230.24300.220.230.230.25 Leukemia 100.250.120.140.16200.220.140.180.20300.210.160.210.21 Breast 100.100.110.110.12200.090.120.120.12300.090.130.130.13 Theeffectof B onfeatureselectionprocessisstudiednext.Itisexpected that theproposedmethodselectsimportantfeaturesirrespecti veofthevalueof B .The 60

PAGE 61

featuresselectedforallvaluesof B atdifferentthresholds = f 15,20,25 g forthecolon datasetisshowninFigure 4-3 .For =15 ,features f 493,765,1042,1325 g selected for B =10 consistentlyappearforothervaluesof B aswell.Similarly,features f 765, 1042,1325 g selectedfor = f 20,25 g alsoappearforothervaluesof B .Thisbehavior hasbeenobservedforotherdatasets(notshownhere),where thefeaturesselectedfor lowervaluesofBarerepeatedlypickedforhighervaluesofB aswell.ThussPSVMs areconsistentinselectingfeaturesimportantforclassi cation;whichcouldbefurther analyzedbyadomainexpert. B10 B20 B30 223 405 493 513 686 765 769 876 1042 1241 1313 1325 1504 1520 1614 1668 1976 A B10 B20 B30 405 493 513 686 765 769 1042 1325 1504 1520 1614 1668 1976 B B10 B20 B30 405 493 765 769 1042 1325 1504 1520 1668 C Figure4-3.Featuresselectedfromcolondatasetasafuncti onof B fordifferentvalues of -A) =15 ,B) =20 ,C) =25 4.3.2CategoryIIDatasets TheeffectivenessofsPSVMsineliminatingirrelevant/noi syfeaturesonnormal datasetsisevaluatedbyperformingnumericalexperiments onpubliclyavailablesmaller dimensionaldatasetslistedasCategoryIIinTable 4-1 .Thenumberoffeatures B isvariedbetween0.1 p to0.5 p where p representsthetotalnumberoffeatures.The 61

PAGE 62

parameters and arevariedintherange f 10 5 ,10 4 ... ,0.1 g and f 0.1,1,10,100 g respectively.TheparameterCinC-SVMsischosenfromthese t f 10 5 ,10 4 ... 100 g .A10-foldcrossvalidationisperformedtoselectthebestp arametersandthe correspondingaverageclassicationaccuraciesarerepor tedinTable 4-5 .Generally, theaccuraciesofsPSVMsincreasewiththevalueof B .Overall,standardSVMsare moreefcientoncategoryIIdatasets,butstillsPSVMsperf ormbetterthanPSVMsfor thedatasetsWPBC,IonosphereandWDBCwithlessthanhalfth enumberoffeatures. Additionally,sPSVMsperformcomparablywellonMushroomd atasets.Thisindicates thatsPSVMsareapplicableevenonsmallerdimensionaldata setsandareeffectivein removingirrelevantfeaturesthatwouldaffecttheclassi cationaccuracy. 62

PAGE 63

Table4-5.CVaccuraciesforCategoryIIdatasetsfromC-SVM s,PSVMsandsPSVMs. Datasets C-SVMsPSVMssPSVMs CAcc(%) Acc(%) B Acc(%) WPBC183.88 6.04168.7 8.21 0.1 p 0.0011054.5 20.9 0.2 p 0.0110077.1 1.8 0.3 p 0.110078.1 2.6 0.4 p 0.110078.1 2.9 0.5 p 0.110077.2 3.6 Ionosphere1087.78 5.410.177.5 4.81 0.1 p 1160 8.51 0.2 p 0.11077 3.7 0.3 p 0.11078.5 3.8 0.4 p 0.11078.9 3.8 0.5 p 0.11079.1 4.5 WDBC195.9 2.2 10 3 92.6 2.9 0.1 p 110072.9 9.9 0.2 p 0.0110092.4 3.2 0.3 p 0.0110091.7 2.7 0.4 p 0.0110093.2 2.3 0.5 p 0.011094.7 2.7 Spambase172.1 1.20.168.4 1.3 0.1 p 0.011061.2 0.3 0.2 p 0.110061.2 0.3 0.3 p 0.110061.2 0.3 0.4 p 0.110061.3 0.4 0.5 p 0.110061.3 0.5 Mushroom0.0199.45 0.120.199.8 0.1 0.1 p 0.0110096.3 0.4 0.2 p 0.0110099.6 0.2 0.3 p 0.0110099.7 0.1 0.4 p 0.0110099.8 0.1 0.5 p 0.0110099.8 0.1 63

PAGE 64

CHAPTER5 COMBININGFEATURESELECTIONANDSUBSPACELEARNING Recently,norm-basedsparsityindimensionalityreductio nhasbeenwidely investigatedandalsoappliedforfeatureselectionstudie s.Popularsubspacelearning methodslikePrincipalComponentAnalysis(PCA)andLinear DiscriminantAnalysis (LDA)havebeenreformulatedusingnormconstraintstoindu cefeaturesparsity.For example,Zou etal. [ 176 ]proposedasparsePCAalgorithmbasedon l 1 -normand l 2 -normregularization.Moghaddam etal. [ 113 ]proposedbothexactandgreedy algorithmsforbinaryclasssparseLDA.Cai etal. [ 24 ]proposedauniedsparse subspacelearningframeworkbasedon 1 1 -normregularization. Subspacelearningmethodslearnacoefcientmatrix W fromthedata, W =[ w 1 w 2 ,..., w k ] (5–1) where w i 2< p representsacoefcientvector.Here,eachrowin W representsa feature,whilethecolumnrepresentsdimensionofthesubsp ace.Forexample,inthe caseofPrincipalComponentAnalysis(PCA), w i representtheprincipalcomponents. However,thefeaturesselectedbysparsesubspacemethodsa regenerallyindependent andaredifferentforeachdimensionofthesubspace.Though dimensionalityreduction hasbeenachieved,itisstillunclearwhicharethemostimpo rtantfeatures.Inorderto achievefeatureselectioninsuchmethods,ideally,wewoul dliketoinduce rowsparsity on W ,wherethealltheelementsinaroware simultaneously zero.Then,therows correspondingtonon-zerocoefcientscanbeinterpreteda sthemostimportantfeatures acrossalldimensions.Rowsparsityhasbeenachievedinlit eratureusinga l 2,1 -normof amatrixasdescribedbelow: 5.1 l 2,1 -norm The l 2,1 -normofamatrixwasrstintroducedin[ 42 ]astherotationalinvariant l 1 -normandalsousedformulti-tasklearning[ 47 125 ]andtensorfactorization[ 77 ].Itis 64

PAGE 65

denedas: k W k 2,1 = p X i =1 p ( k X i =1 w 2 ij )= p X i =1 k w i k 2 (5–2) where w i representsthe i -throwofthematrix. The l 2,1 -normisrotationalinvariantforrows: k WR k 2,1 = k W k 2,1 .Additionally,itcan beshownthatitisavalid norm asitsatisesthethreenormconditions: positivity: k A k 2,1 > 0, 8k A k6 = 0 (5–3) absolutehomogenity: k A k 2,1 = j jk A k 8 2<6 =0, (5–4) triangularinequality: k A k 2,1 + k B k 2,1 k A + B k 2,1 (5–5) where A and B 2< p k Similarto l 1 -norm,minimizationof l 2,1 -normencouragesrowsofthematrixto simultaneouslygotozero.Thispropertyof l 2,1 -normhasbeenutilizedtoinducerow sparsityinthecoefcientmatrix. 5.2JointFeatureSelectionandSubspaceLearning Yan et.al [ 170 ]proposedauniedgraphembeddingframeworkthatcanbe utilizedtointerpretpopularsubspacelearningmethodsli kePCAandLDA.Inagraph embeddingframework,givenadatamatrix X 2< m n ,adatagraph G isconstructed whoseverticescorrespondto f x 1 x 2 ,..., x m g .Let A 2< m m beasymmetricadjacency matrixwith A ij representingsomesimilarityrelationshipbetweendata.T hepurposeof graphembeddingistondalowerdimensionalsubspacerepre sentationofgraph G that bestpreservestherelationshipbetweenthedatapoints.Th eoptimal W canbeobtained by: minimize W 2< p k tr ( W T XLX T W ) subjectto W T XDX T W = I k k (5–6) 65

PAGE 66

where D ii = P j A ij isthediagonalmatrix, L = D W isthegraphlaplacian[ 26 ]. Thesolutionto( 5–6 )canbeobtainedbysolvingthefollowinggeneralizedeigen value problemgivenby: XLX T W = XDX T W (5–7) where isadiagonalmatrixwitheigenvaluesasthediagonalelemen ts. Differentsubspacelearningmethodscanbeobtainedwithdi fferentchoicesofthe adjacencymatrix A .Forexample,inthecaseofLinearDiscriminantAnalysis,f orc classeswith n k samplesineachclass, f ( x )= 8>><>>: 1 n k if x i and x j belongtothek-thclass 0, otherwise (5–8) l 2,1 -normisusedtoinducerowsparsityinsubspacelearningmet hods.The followinggeneraloptimizationproblemissolvedtoperfor mjointfeatureselection andsubspacelearning: minimize W 2< p k tr ( W T XLX T W )+ C k W k 2,1 subjectto W T XDX T W = I k k (5–9) Theparameter C controlsrowsparsitywithincreasingvaluesof C forcingmore rowstogotozero.Thoughtheobjectivefunctionin( 5–9 )isconvex,theoptimization problemisdifculttosolveastheconstraintisnon-convex 5.3CurrentApproach Nie etal. [ 119 ]proposedanefcientreformulationto( 5–9 ).Thefollowingtheorem allowsustoefcientlysolve( 5–9 ). Theorem5.1. Let Y 2< n m beamatrixofwhicheachcolumnisaneigenvectorof eigen-problem Wy = Dy .Ifthereexistsamatrix A 2< d m suchthat X T A=Y ,then eachcolumnof A isaneigenvectorofeigen-problem XWX T a= XDX T a withthesame eigenvalue 66

PAGE 67

PleaserefertocorollaryofTheorem1in[ 24 ]fortheproofoftheorem 5.1 Theorem 5.1 showsthatinsteadofsolvingtheeigen-problem XLX T W = XDX T W W canbeobtainedbythefollowingtwosteps: Solvetheeigenvalueproblem: AY=DY (5–10) Find W suchthatitsatisesthefollowinglinearsystemofequatio ns: X T W=Y (5–11) Since( 5–11 )isalinearsystem,itcouldhavethefollowingpossibiliti es: 1. Thesystemhasinnitelymanysolutions, 2. Thesystemhasasingleuniquesolution, 3. Thesystemhasnosolution. Gu etal. [ 62 ]proposealgorithmstosolveincaseofthreepossibilities Case1: Incasewhenthelinearsystemhasinnitelymanysolutions, thefollowing convexrelaxationisproposed: minimize W 2< p k k W k 2,1 subjectto X T W = Y (5–12) Asimpleiterativealgorithmisproposedin[ 62 ]tosolve( 5–12 ).Thealgorithmis summarizedasfollows: 67

PAGE 68

Algorithm3 JointFeatureSelectionCase1( X,W,D ) 1.Initialize G 0 = I ,k=0. 2.Compute Y basedon WY = DY 3.Compute: W k +1 = G 1 t X ( X T G 1 t X ) 1 Y (5–13) 4.Computethediagonalmatrix G k +1 asfollows: g ii = 8>><>>: 0, if w i = 0 1 k a i k 2 otherwise (5–14) 5.Alternatebetween3and4untilconvergence. Theconvergenceofthisalgorithmisprovedin[ 119 ]. Case2: Inthecasewhenthelinearsystem( 5–11 )hasauniquesolutionorno solutions,thefollowingconstrainedoptimizationproble misconsidered: minimize W 2< p k k W k 2,1 subjectto k X T W Y k 2F (5–15) Here,theparameter controlstherowsparsitywithsmallervaluesof inducing largersparsity.[ 62 ]againproposesasimpleefcientalgorithmforsolving( 5–15 ). 68

PAGE 69

Algorithm4 JointFeatureSelectionCase2( X,W,D ) 1.Initialize G 0 = I ,k=0. 2.Compute Y basedon WY = DY 3.Compute: W k +1 = G 1 t X ( X T G 1 t X + I ) 1 Y (5–16) 4.Computethediagonalmatrix G k +1 asfollows: g ii = 8>><>>: 0, if w i = 0 1 k a i k 2 otherwise (5–17) andincrement k to k +1 5.Alternatebetween3and4untilconvergence. Thoughthecurrentapproachissimpleandefcient,ithasit sownlimitations. Forexample,inthecasewhenthelinearsystem( 5–11 )hasinnitelymanysolutions, thereformulationto( 5–9 )limitscontrolonrowsparsityastheparameter C isnot explicitlyset.Thisisespeciallytrueinthecaseofhighdi mensionaldatasetsasthey generallyconformtocase1.Anexplicitcontrolonrowspars ityiscriticalinthecase oflimitedsamplehighdimensionaldatasetsasincreasingv aluesof C couldleadto highersparsityandthusimprovethegeneralizationperfor mance.Hence,wepropose toprovideanexactsolutiontotheoriginaloptimizationpr oblem( 5–9 )withanalternate approachdescribedbelow. 5.4AlternateApproach Theoptimizationproblem( 5–9 )canbere-writtenwithachangeofvariables.Let, ^ W = D 1 = 2 X T W (5–18) 69

PAGE 70

Then,theoptimizationproblem( 5–9 )reducesto: minimize ^ W 2< p k tr ( W T D 1 = 2 LD T = 2 ^ W )+ C k D 1 = 2 X T ^ W k 2,1 subjectto ^ W T ^ W = I k k (5–19) Minimizationwithorthogonalityconstraints( ^ W T ^ W = I k k )haswideapplicationsin combinatorialoptimization,eigenvalueproblems,sparse PCAetc.Theseoptimization problemsaredifculttosolveastheconstraintsarenon-co nvexandcouldleadtomany localminimizers.Manyoftheseproblemsinspecialformsar eNP-hard.Additionally, theseconstraintsarenumericallyexpensivetopreservedu ringiterations.Thefeasible set M kp := f Y 2< p k : Y T Y = I g isoftenreferredtoastheStiefelmanifold.Most existingconstraint-preservingalgorithmseitherusemat rixre-orthogonalizationor generatepointsalonggeodesicsof M kp .Theformerrequiresmatrixfactorizationssuch asthesingularvaluedecomposition,andthelattermustcom putematrixexponentialsor solvepartialdifferentialequations. SeveraloptimizationmethodsforRiemannianmanifoldsare applicableforsolving theoptimizationproblem( 5–19 ).Suchmethodsgenerallyrelyoneithergeodesicsor projections.Simplemethodsarebasedonthesteepestdesce ntgradientapproach [ 1 162 ]whilemoresophisticatedmethodslikeconjugategradient methodand Quasi-NewtonmethodhavealsobeenextendedtoRiemannianm anifoldsin[ 45 ] and[ 137 ]respectively.Second-ordermethodslikeNewton'smethod s[ 1 45 ]havebeen appliedtoachievesuper-linearconvergence.Since,these cond-ordermethodsmay requireadditionalcomputationdependingonthecostfunct ions,theycanrunslower thanthesimplealgorithmsthatareonlybasedonthegradien t. Here,weproposetosolve( 5–19 )usinganalternateconstraint-preservingupdate schemesuggestedin[ 168 ].Wen etal. applytheCayleytransform,aCrank-Nicolson-like [ 60 ]updatescheme,topreservetheconstraintsandbasedonit, developcurvilinear 70

PAGE 71

searchalgorithmswithlowercomputationalcostcomparedt othosebasedonprojections andgeodesics.Webrieydescribethemethodbelow: Cayleytransformationbasedminimization: Considerthefollowingoptimization problemwithorthogonalityconstraints: minimize Y 2< p k F ( Y ) subjectto Y T Y = I (5–20) where I istheidentitymatrixand F ( Y ): < p k !< istheobjectivefunction.Theideaof Cayleytransformationmethodistogeneratefeasiblepoint sontheStiefelmanifold M kp usingCayleytransformconstraintpreservingupdateschem eandmonotonecurvilinear searchalgorithmsateachiteration.Theupdateschemeandt hecurvilinearsearch approachisdescribedbelow. Constraint-preservingupdatescheme: Givenafeasiblepoint Y andthegradient G := DF ( Y )= @ F (Y) @ Y i j ,deneaskew-symmetricmatrix M as: M := GY T YG T (5–21) Giventhecurrentiterateas Y k ,thenewtrialpointisdeterminedalongthecurvegiven by: Y ( )= Y k M Y k + Y ( ) 2 (5–22) whichsatises Y ( ) T Y ( )= Y T Y foranyskew-symmetricmatrix M and 2< .Lemma 1 belowshowsthattheorthogonalityconstraintsarepreserv edandthat Y ( ) denesa descentdirectionat =0 Lemma1. (1)Givenanyskew-symmetricmatrix M 2< p p ,thenthematrix Q := ( I + M ) 1 (( I M ) iswell-denedandorthonormal,i.e. Q T Q = I 71

PAGE 72

(2)Givenanyskew-symmetricmatrix M 2< p p ,thematrix Y ( ) denedby ( 5–22 ) satises Y ( ) T Y ( )= Y T Y .Additionally, Y ( ) canbeexpressedas: Y ( )= I + 2 M 1 I 2 M Y (5–23) anditsderivativewithrespectto isgivenas: Y 0 ( )= I + 2 M 1 M Y + Y ( ) 2 (5–24) andinparticular, Y 0 (0)= MY (3)Assume M := GY T YG T .Then Y ( ) isadescentcurveat =0 ,thatis, F 0 ( Y (0)):= @ F ( Y ( )) @ =0 = 1 2 k M k 2F (5–25) Pleasereferto[ 168 ]forproofofLemma 1 Thematrixinverse I + 2 M 1 dominatesthecomputationof Y ( ) .Thisis particularlyproblematicinthecaseofhighdimensionalda tasetsasthesizeofthe inversematrixwouldbeontheorderofthenumberoffeatures .However,theinverse computationbecomesrelativelycheapwhenthedimensional ityofthesubspaceismuch smallerthanthenumberoffeatures( k << p ).ThefollowingLemmashowsanefcient waytocalculatetheinverseusingSherman-Morrison-Woodb ury(SMW)formula[ 60 ]. Lemma2. Suppose M = LR T RL T ,where L R 2< p k .Rewrite W = UV T for U =[ L R ] and V =[ R L ] .If I + 2 V T U isinvertible,then ( 5–23 ) isequivalentto: Y ( )= Y U I + 2 V T U 1 V T Y (5–26) Pleasereferto[ 168 ]fortheproofofLemma 2 Generally,inthecaseofhighdimensionaldatasets,when k << p ,inverting I + V T U 2< 2 k 2 k ismucheasierthaninverting I + 2 M 2< p p 72

PAGE 73

Monotonecurvilinearsearch: Firstly,weassumethatthedescentcurve Y ( ) is generatedbyaskew-symmetricmatrix M satisfyingthefollowingcondition: F 0 ( Y (0)) k M k 2F (5–27) where > 0 isaconstant. Severalpreviousstudiesshowedthatthesteepestdescentm ethodwithaxedstep sizemaynotconverge.However,bychoosinganappropriates tepsize,convergence canbeguaranteedwithacceleratedspeedwithoutsignican tlyincreasingthecostof eachiteration.Ideally,ateachiteration k ,onecouldchoosethestepsizebyminimizing F ( Y k ( )) alongthecurve Y k ( ) toglobaloptimality.Sincendingasuchaglobal optimizeriscomputationallyexpensiveateachiteration, onegenerallyndsan approximatestepsizethatsatisestheArmijo-Wolfecondi tions.Inourcase,since theobjectivefunctionisnon-differentiable,wendstepsizethatsatisestheweak Armijo-Wolfelineconditionsgivenby: F ( Y k ( k )) F ( Y k (0))+ 1 k F 0 ( Y k (0)) (5–28) F ( Y k ( k )) 2 F 0 ( Y k (0)) (5–29) where 0 < 1 < 2 < 1 aretwoparameters.Severalalgorithmshavebeenproposedi n literaturetond k thatsatisfy( 5–28 )and( 5–29 )[ 114 120 ].Thealgorithmtondthe stepsizeisgiveninAlgorithm 5 : TerminationRules: Astheiteratesapproachastationarypoint,itisimportant todetecttheslowdownandstopappropriately.Additionall y,itistrickytocorrectly predictwhetheranalgorithmistemporarilyorpermanently trappedinaregionwhen itsconvergencespeedhasreduced.Hence,itisimportantto haveexibletermination rules.Wedenethefollowingterminationrules: Normoffunctiongradient rF ( Y ) goesbelowathreshold : rF ( Y ):= G YG T Y (5–30) 73

PAGE 74

Algorithm5 WeakWolfeLineSearch( F ( Y k (0)), F 0 ( Y k (0))) Set =1, =0, = 1 while true do if F ( Y k ( k )) > F ( Y k (0))+ 1 k F 0 ( Y k (0)) then = elseif F ( Y k ( k )) < 2 F 0 ( Y k (0)) then = else STOP. endifif < 1 then =( + )/2 else =2 endif endwhile Numberofiterations k aregreaterthanapre-denedlimit K Relativechangein Y atiteration k and k +1islessthanathreshold Y tol Yk := k Y ( k ) Y ( k +1) k F p n Y (5–31) Relativechangeinobjectivefunctionvalue F atiteration k and k +1islessthanthe threshold F tol Fk := F ( k +1) F ( k ) jF ( k ) j +1 F (5–32) Meanchangeiniterate Y ( k ) andobjectivefunction F ( k ) overthelast T iterationsis belowthresholds 10 Y and 10 F respectively: mean ([ tol Yk min ( k T )+1 ,..., tol Yk ]) 10 Y andmean ([ tol Fk min ( k T )+1 ,..., tol Fk ]) 10 F (5–33) Thealgorithmisrununtiloneoftheterminationrulesaresa tised.Ouralgorithm combiningtheconstraintpreservingupdateandthecurvili nearsearchisdescribedin Algorithm 6 74

PAGE 75

Algorithm6 Cayleytransformwithcurvilinearsearch( F ) 1.Initialize Y (0) 2M pk k 0 0 ,and 0 < 1 < 2 < 1 2.Calculate M accordingto( 5–21 ). 3.Computethestepsize k accordingtoAlgorithm[ref]. 4.Set Y ( k +1) Y ( k ) 5.Ifoneoftheterminationrulessatisfy,then STOP ;otherwise k k +1 andgotostep2. 5.5ExtendingPrincipalComponentAnalysistofeaturesele ction(JSPCA) Wenowapplytheabovementionedalternateapproachtotwodi fferentapplications inextendingPrincipalComponentAnalysis(PCA)tofeature selection.PCAnds asetoflinearlyuncorrelatedvariablescalledprincipalc omponentsfromasetof observationsofpossiblycorrelatedvariables.PCAremove sredundancybytransforming thedatafromahigherdimensionalspaceintoanorthogonall owerdimensional space.Thistransformationisperformedinawaythatthers tprincipalcomponent accountsforthemaximumpossiblevariance,andeachsuccee dingcomponenthaving decreasingvaluesofvariance.Generally,theprincipalco mponentsareobtainedas linearcombinationsofallfeaturesandhenceitisdifcult toextractthemostimportant featuresforfurtherclassicationtasks.Here,weshowhow PCAcouldbeextendedto performfeatureselectionusingthe l 2,1 -norm.WecallthistechniqueasJointSparse PrincipalComponentAnalysis(JSPCA).Wesolvetheoptimiz ationproblemusingthe alternateapproachdescribedaboveandshownumericalresu ltsonpubliclyavailable datasets. Let X 2< m p representbegivenwiththerowsrepresentingsamplesandth e columnsrepresentingfeatures.Let U =[ u 1 u 2 ,..., u k ] p k representtheorthonormal basisofa k -dimensionallinearsubspace S S attemptstocapturemaximalvariancein 75

PAGE 76

thedatabyoptimizingthefollowingoptimizationproblem: maximize U 2< p k tr ( U T X T XU ) subjectto U T U = I k k (5–34) Thesolutiontotheoptimizationproblem( 5–35 )isgivenbyeigenvectorscorresponding to k largesteigenvaluesofmatrix X T X [ 60 ]. WeextendPCAtoperformfeatureselectionbyintroducingth e l 2,1 -normonthe matrix U in( 5–34 ).Hence,theoptimizationproblemismodiedas: maximize U 2< p k tr ( U T X T XU ) C k U k 2,1 subjectto U T U = I k k (5–35) where C > 0 controlstherow-sparsityin U withincreasingvaluesof C promotinghigher levelsofsparsity.Weadoptthealternateapproachusingth eCayleytransformtosolve ( 5–35 ). NumericalExperiments: Weperformnumericalexperimentsonpubliclyavailable datasetstoevaluatetheperformanceofJSPCA.WecombineJS PCAwiththe1-NN classiertoevaluatethequalityofthefeaturesselected. Wecomparetheresults withPCA-1NNclassier,wheretheoriginaldatasetisrstt ransformedtoalower dimensionalsubspaceusingPCAandthenclassiedusing1-N Nclassier.Weperform 10-foldcrossvalidationandcomputetheaverageclassica tionaccuracies.Wevarythe dimensionalityofthesubspacesfrom k = f 1,2,3,4,5 g .Thedimensionsoftheselected datasetsaregiveninTable 5-1 Table5-1.Datasetsusedinnumericalexperiments. D ATASET #S AMPLES #F EATURES WPBC19833I ONOSPHERE 35134 S PAMBASE 460157 M USHROOM 8124126 76

PAGE 77

Tables 5-2 5-3 5-4 and 5-5 comparetheperformanceofJSPCAagainstPCA.The resultsarereportedforvaryingvaluesof k and C .Thenumberoffeaturesselected decreaseswithincreasingvaluesof C foralldatasets.Theclassicationperformanceof JSPCA-1NNclassieroutperformsPCA-1NNclassierforlow ervaluesof k andhigher valuesof C forspambase,ionosphereandmushroomdatasets.Inallthec ases,the performanceofJSPCA-1NNclassiercompareswellwithPCA1NNclassierforhigher valuesof k .However,inallsuchcases,theperformanceisachievedwit hlessthanhalf thenumberoffeatures. Table5-2.AccuracycomparisonbetweenPCA-1NNclassiera ndJSPCA-1NNclassier forthespambasedataset. kAcc PCA C JSPCA #selectedfeatures Acc JSPCA 169.42 2.10 10004570.91 2.18 50001971.27 1.78 100001371.87 2.14 200001274.27 1.86 50000872.62 1.86 272.94 1.27 10005373.85 1.71 50003373.33 1.91 100002573.44 2.06 200001674.03 2.44 50000679.11 1.36 378.35 2.14 10005476.90 1.80 50003774.81 1.08 100002174.85 1.64 200001476.18 1.45 50000581.40 1.20 480.63 1.75 10005479.63 1.42 50003976.74 1.15 100002476.79 2.25 200001578.03 1.38 50000882.05 1.81 582.24 1.50 10005581.89 1.37 50004080.02 1.47 100002479.11 1.59 200001878.70 1.49 50000883.53 9.28 77

PAGE 78

Table5-3.AccuracycomparisonbetweenPCA-1NNclassiera ndJSPCA-1NNclassier fortheWPBCdataset. kAcc PCA C JSPCA #selectedfeatures Acc JSPCA 168.14 9.75 5002067..69 8.15 10001664.11 7.28 20001464.27 9.00 3000467.40 11.43 4000169.27 5.55 271.15 12.03 5002866.64 8.92 10002666.22 9.21 2000667.20 9.31 3000465.66 9.75 4000465.66 10.42 363.09 12.28 5002963.62 13.50 10002861.65 13.24 2000563.09 6.42 3000563.11 7.45 4000564.67 7.66 461.04 5.72 5003161.07 8.57 10002861.65 9.27 20001365.59 8.67 3000865.29 8.47 4000760.77 9.89 561.54 6.36 5003362.62 6.26 10002564.57 9.29 20001668.07 7.48 30001466.52 8.49 4000661.41 10.33 78

PAGE 79

Table5-4.AccuracycomparisonbetweenPCA-1NNclassiera ndJSPCA-1NNclassier fortheionospheredataset. kAcc PCA C JSPCA #selectedfeatures Acc JSPCA 168.67 5.80 5002570.33 5.75 10002170.68 10.61 20001871.79 11.94 30001469.47 7.32 40001072.33 4.58 273.43 6.98 5003271.20 5.99 10002771.46 5.68 20001377.44 7.35 3000378.58 8.75 4000379.11 11.26 383.48 5.50 5003282.58 5.05 10002881.71 7.49 20002281.16 5.34 3000682.58 5.72 4000379.76 8.47 482.88 5.90 5003285.99 6.50 10002986.26 8.19 20001185.47 4.26 30001282.61 5.63 4000785.76 5.51 587.46 4.90 5003286.87 5.61 10002987.40 7.18 20001387.72 4.14 3000987.72 7.29 4000586.02 8.18 79

PAGE 80

Table5-5.AccuracycomparisonbetweenPCA-1NNclassiera ndJSPCA-1NNclassier forthemushroomdataset. kAcc PCA C JSPCA #selectedfeatures Acc JSPCA 184.58 1.78 100006799.34 0.35 3000034100500002099.90 0.97 700001195.65 0.84 900001496.05 0.99 29.383 0.78 100007799.20 0.25 3000041100500002798.50 0.55 700002198.71 0.35 900002099.19 0.43 398.70 0.30 100009099.88 0.12 3000053100500003799.70 0.19 700002899.40 0.18 900002899.40 0.30 499.99 0.04 10000941003000067100500005210070000411009000035100 599.96 0.06 1000010099.90 0.12 300007410050000551007000036100900003299.95 0.08 80

PAGE 81

CHAPTER6 CONSTRAINEDSUBSPACECLASSIFIERS Featureextractiontechniquestransformtheinputdataint oasetofmeta-features thatextractrelevantinformationfrominputdataforclass ication.Onepopulartechnique calledPrincipalComponentAnalysis(PCA),ndsasetoflin earlyuncorrelatedvariables calledprincipalcomponentsfromasetofobservationsofpo ssiblycorrelatedvariables. PCAremovesredundancybytransformingthedatafromahighe rdimensionalspace intoanorthogonallowerdimensionalspace.Thistransform ationisperformedinaway thattherstprincipalcomponentaccountsforthemaximump ossiblevariance,andeach succeedingcomponenthavingdecreasingvaluesofvariance .Thenumberofretained principalcomponentsisusuallylessthanorequaltothenum beroforiginalvariables andaredeterminedusingseveralcriterialiketheeigenval ue-onecriterion,screetest, proportionofvarianceaccountedfor,etc. LocalSubspaceClassier(LSC)[ 97 ]utilizesPCAtoperformclassication. Duringthetrainingphase,alowerdimensionalsubspaceisf oundforeachclassthat approximatesthedata.Inthetestingphase,anewdatapoint isclassiedbycalculating thedistanceofthepointtoeachsubspaceandchoosingthecl asswithminimal distance.AlthoughLSCissimpleandrelativelyeasytoimpl ement,ithasitsown limitations.LSCndsthesubspacesforeachclassseparate lywithouttheknowledge ofthepresenceofanotherclass.Whileeachsubspaceapprox imatesthedatawell, howevertheseprojectionsmaynotbeidealfromaclassicat ionperspective.Inthis chapter,weconstructanotherclassiercalledConstraine dSubspaceClassier(CSC) thataccountsforthepresenceofotheranotherclasswhile ndingthelocalsubspaces. LSCformulationismodiedtoincludetherelativeanglebet weenthesubspacesandis solvedefcientlyusingalternateoptimizationtechnique s.TheperformanceofCSCon publiclyavailabledatasetsisevaluatedandcomparedwith LSC. 81

PAGE 82

6.1LocalSubspaceClassier(LSC) Considerabinaryclassicationproblem.Letthematrices X 1 2< p m and X 2 2< p n begiven,whosecolumnsrepresentthetrainingexamplesoft woclasses C 1 and C 2 respectively.Thenumberofsamplesin C 1 and C 2 aregivenby m and n respectively. Thenumberoffeaturesisgivenby p .LocalSubspaceClassier(LSC)[ 97 ]attempts tondtwosubspacesseparately,oneforeachclassthatbest approximatesthedata. Let U 1 =[ u (1)1 u (1)2 ,..., u (1)k ] p k and U 2 =[ u (2)1 u (2)2 ,..., u (2)k ] p k representorthonormal basesoftwo k -dimensionallinearsubspaces S 1 and S 2 thatapproximateclasses C 1 and C 2 respectively.Weassumethedimensionalityofsubspaces S 1 and S 2 tobesameand equalto k withoutlossofgenerality. S 1 and S 2 attempttocapturemaximalvariancein classes C 1 and C 2 respectivelybyoptimizingthefollowingoptimizationpro blems: maximize U 1 2< p k tr ( U T1 X 1 X T 1 U 1 ) subjectto U T1 U 1 = I k k (6–1) Thesolutiontotheoptimizationproblem( 6–1 )isgivenbyeigenvectorscorresponding to k largesteigenvaluesofmatrix X 1 X T 1 [ 60 ]. Similarly,thefollowingoptimizationproblemissolvedto obtaintheorthonormal basis U 2 representing S 2 : maximize U 2 2< p k tr ( U T2 X 2 X T 2 U 2 ) subjectto U T2 U 2 = I k k (6–2) Theorthonormalbasis U 2 isobtainedbychoosingeigenvectorscorrespondingto k largesteigenvaluesofmatrix X 2 X T 2 Anewpoint x isclassiedbycomputingdistancefromsubspaces S 1 and S 2 : dist ( x S i )= tr ( I U Ti xx T U i ) (6–3) andtheclassof x isdeterminedas: class ( x )= argmin i 2f 1,2 g f dist ( x S i ) g (6–4) 82

PAGE 83

Thoughthesubspaces S 1 and S 2 approximatetheclasseswell,theseprojections maynotbeidealforclassicationtasksaseachofthemareob tainedwithoutthe knowledgeofanotherclass/subspace.Hence,fromaclassi cationperformance perspective,thesesubspacesmaynotbethebestprojection sfortheclasses.Inorder toaccountforthepresenceofanothersubspace,weconsider therelativeorientation ofthesubspaces.Inthefollowingsection,weshowtheeffec tofchangingtherelative orientationofthesubspacesonclassicationperformance throughsomemotivating examples. Table6-1.Averageclassicationaccuraciesandrelativea nglebetweensubspaces generatedfromLSCandCSCintwoexamples. D ATASETS N 1 N 2 LSCCSC 1 1 2 2 A CC (%)A NGLE ( )A CC (%)A NGLE ( ) E XAMPLE 1 3 10 4 1.1 04 74 4 2.8 04 690.29890.83 E XAMPLE 2 35 4 2 06 1010 5225 79.50.98910.74 6.2MotivatingExamples Weconsidertwoexampleshereshowingtheeffectofchanging therelativeangle betweensubspacesgeneratedbyLSC.Thedatasetsaregenera tedfromtwobivariate normaldistributions N 1 ( 1 1 ) and N 2 ( 2 2 ) representingclasses C 1 and C 2 .Each classconsistsof100randomlygeneratedpointsfrom N 1 and N 2 respectively.The parametersof N 1 and N 2 forthetwoclassesareshowninTable 6-1 .LSCandCSCare trainedonthedatawith k =1andtheclassicationaccuraciesareobtainedvia10-fol d crossvalidation.Thesubspacesobtainedforeachofthetra iningfoldsinexample1and example2areshowninFigures 6-1 and 6-2 respectively.Theaverageclassication accuraciesandtheaveragerelativeangle (0 /2)betweenthesubspacesfor LSCandCSCarereportedinTable 6-1 .Inexample1,increasingtherelativeangle betweenthesubspacesclearlyimprovestheclassicationa ccuracyby20%.However inexample2,decreasingtherelativeanglebetweenthesubs pacesshowsbetter 83

PAGE 84

classicationperformanceandoutperformsLSCby 11%.Theseexamplesshowthat relativeorientationofthesubspacesshouldalsobeconsid eredinadditiontocapturing maximalvarianceindata. -5 0 5 10 15 20 -5 0 5 10 15 20 class1 class2 subspace S 1 subspace S 2 A -5 0 5 10 15 20 -5 0 5 10 15 20 class1 class2 subspace S 1 subspace S 2 B Figure6-1.Datapointsgeneratedby N 1 and N 2 inexample1andthesubspaces generatedbyLSCandCSCineachofthetrainingfolds-A)LSC, B)CSC. -5 0 5 10 15 20 -5 0 5 10 15 20 class1 class2 subspace S 1 subspace S 2 A -5 0 5 10 15 20 -5 0 5 10 15 20 class1 class2 subspace S 1 subspace S 2 B Figure6-2.Datapointsgeneratedby N 1 and N 2 inexample2andthesubspaces generatedbyLSCandCSCineachofthetrainingfolds-A)LSC, B)CSC. 6.3ConstrainedSubspaceClassier(CSC) ConstrainedSubspaceClassier(CSC)ndstwosubspacessi multaneously,one foreachclass,suchthateachsubspaceaccountsformaximal varianceinthedatain thepresenceoftheotherclass/subspace.Thus,CSCallowsf oratradeoffbetween approximatingtheclasseswellandtherelativeorientatio namongthesubspaces.The relativeorientationbetweensubspacesisgenerallydene dintermsofprincipalangles [ 71 ].Webrieyreviewprincipalanglesbetweensubspacesbelo w,whicharefurther 84

PAGE 85

utilizedtomodifytheformulationofLSCtoincludetherela tiveorientationamongthe subspaces. Denition1: Let U 1 2< p k and U 2 2< p k betwoorthonormalmatricesspanning subspaces S 1 and S 2 .Theprincipalangles 0 1 2 3 k = 2 between subspaces S 1 and S 2 ,aredenedrecursivelyby: cos i =max x i 2S 1 maxy i 2S 2 x |i y i subjectto x |i x i =1, y |i y i =1, x |i x j =0, y |i y j =0 8 j =1,2,..., i 1. (6–5) Intuitively,therstprincipalangle 1 isthesmallestanglebetweenallpairsofunit vectorsintherstandsecondsubspaces.Therestoftheprin cipalanglesaresimilarly dened.TheprincipalanglescanbecomputedfromtheSingul arValueDecomposition (SVD)of U 01 U 2 [ 60 ]. U 01 U 2 = X ( cos ) Y (6–6) where X =[ x 1 x 2 ,..., x k ] Y =[ y 1 y 2 ,..., y k ] ,andcos isthediagonalmatrixcos = diag(cos 1 ,cos 2 ,...,cos k ).Thecosinesoftheprincipalanglesarealsosometimes knownascanonicalcorrelations. Severaldistancemetricsbetweensubspaceshavebeendene dintermsof principalangles.Onesuchdistancemetric,knownasProjec tionMetric[ 71 ],isdened as: d P ( U 1 U 2 )= vuut k X i =1 sin 2 i (6–7) Theprojectionmetric d P ( U 1 U 2 ) canbeexpressedintermsof U 1 and U 2 asshown below. Theorem1: Let U 1 2< p k and U 2 2< p k betwoorthonormalmatricesspanning subspaces S 1 and S 2 .Theprojectionmetric d P ( U 1 U 2 ) between S 1 and S 2 isgivenby: d P ( U 1 U 2 )= 1 p 2 k U 1 U >1 U 2 U >2 k F (6–8) 85

PAGE 86

Theprojectionmetricisutilizedtoincorporatetherelati veorientationbetween subspacesinLSC.TheformulationofLSCismodiedasshownb elowtoobtainthe ConstrainedSubspaceClassier(CSC): maximize U 1 ,U 2 2< p k tr ( U T1 X 1 X T 1 U 1 )+ tr ( U T2 X 2 X T 2 U 2 ) C k U 1 U >1 U 2 U >2 k 2F subjectto U T1 U 1 = I k k U T2 U 2 = I k k (6–9) wheretheparameter C controlsthetradeoffbetweentherelativeorientationoft he subspacesandtheapproximationofthedata.Itisimportant tonoteherethatwhen C =0 ,CSCreducestoLSC.Additionally,forlargerpositivevalu esof C ,therelative orientationbetweensubspacesreduces,whileforlargerne gativevaluesof C ,the relativeorientationincreases. Theobjectivefunctionin( 6–9 )canbefurthermodiedbyutilizingtheorthogonality constraints.Consideringtheprojectionmetricterm, k U 1 U >1 U 2 U >2 k 2F = tr (( U 1 U >1 U 2 U >2 ) > ( U 1 U >1 U 2 U >2 )) = tr ( U 1 U >1 U 1 U >1 2 U >1 U 2 U >2 U 1 + U 2 U >2 U 2 U >2 ) Utilizing U >1 U 1 = I k k U >2 U 2 = I k k =2 k 2 tr ( U >1 U 2 U >2 U 1 ) (6–10) Introducing( 6–10 )intheobjectivefunctionofCSC,theoptimizationproblem in ( 6–9 )canbere-writtenas: maximize U 1 ,U 2 2< p k tr ( U T1 X 1 X T 1 U 1 )+ tr ( U T2 X 2 X T 2 U 2 )+ C tr ( U T1 U 2 U T2 U 1 ) subjectto U T1 U 1 = I k k U T2 U 2 = I k k (6–11) 86

PAGE 87

Algorithm: Weintroduceanalternatingoptimizationalgorithmtosolv e( 6–11 ).For axed U 2 ,( 6–11 )reducesto: maximize U 1 2< p k tr ( U T1 ( X 1 X T 1 + C U 2 U T2 ) U 1 ) subjectto U T1 U 1 = I k k (6–12) Thesolutionto( 6–12 )isobtainedbychoosingeigenvectorscorrespondingto k largesteigenvaluesofsymmetricmatrix X 1 X T 1 + C U 2 U T2 Similarly,foraxed U 1 ,( 6–11 )reducesto: maximize U 2 2< p k tr ( U T2 ( X 2 X T 2 + C U 1 U T1 ) U 2 ) subjectto U T2 U 2 = I k k (6–13) wherethesolutionto( 6–13 )isagainobtainedbychoosingeigenvectorscorresponding to k largesteigenvaluesofsymmetricmatrix X 2 X T 2 + C U 1 U T1 TerminationRules: Astheiteratesapproachastationarypoint,itisimportant todetecttheslowdownandstopappropriately.Additionall y,itistrickytocorrectly predictwhetheranalgorithmistemporarilyorpermanently trappedinaregionwhen itsconvergencespeedhasreduced.Hence,itisbenecialto haveexibletermination rules.Wedenethefollowingthreeterminationrules: Maximumlimit K onthenumberofiterations, Relativechangein U 1 and U 2 atiteration k and k +1, tol kU 1 = k U ( k ) 1 U ( k +1) 1 k F p n tol kU 2 = k U ( k ) 2 U ( k +1) 2 k F p n (6–14) Relativechangeinobjectivefunctionvalueatiteration k and k +1, tol kf = F ( k +1) F ( k ) j F ( k ) j +1 (6–15) ThealgorithmforCSCissummarizedinAlgorithm 7 87

PAGE 88

Algorithm7 CSC( X 1 X 2 k C ) 1.Initialize U 1 and U 2 suchthat U T1 U 1 = I k k U T2 U 2 = I k k 2.Findeigenvectorscorrespondingto k largesteigenvaluesofthesymmetricmatrix X 1 X T 1 + C U 2 U T2 3.Findeigenvectorscorrespondingto k largesteigenvaluesofthesymmetricmatrix X 2 X T 2 + C U 1 U T1 4.Alternatebetween2and3untiloneoftheterminationrule saresatised. 6.4ComparisonbetweenLSCandCSC TheperformanceofCSCisevaluatedonthreehighdimensiona lpubliclyavailable datasets:DLBCL[ 148 ],Breast[ 164 ]andColon[ 3 ]withthenumberofsamples f 77,77, 62 g andthenumberoffeatures f 5469,4869,2000 g respectively.Theperformanceof CSCisevaluatedfordifferentvaluesof C ,andcomparedwithLSC.Thevaluesof C are choseninsuchawaythattherelativeanglebetweenthesubsp acesvariesuniformly. Therelativeanglebetweenthesubspacesisevaluatedinter msoftheprojection metric d P .Thevalueof d P variesbetween0and k ,where k isthedimensionalityofthe subspaces.Thevalueof k ischosenas f 1,3,10 g .Theclassicationperformanceis evaluatedusingleave-one-outcrossvalidation(LOOCV)te chnique.Wedenearelative angleparameter asfollows: i = d LSC P i + i =1,2,3,..., m (6–16) i = d LSC P i i = 1, 2, 3,..., m (6–17) where, + = k d LSC P m = d LSC P m (6–18) Theclassicationaccuraciesasafunctionof fordifferentvaluesof k isshownin Figure 6-3 0 representstheresultsofLSC. 1 2 correspondsto C > 0 ; 1 2 correspondsto C < 0 .Asmentionedearlier,positivevaluesofCdecreasetherel ative anglebetweenthesubspaceswhilenegativevaluesofCincre asetherelativeangle. ThevaluesofK, tol kf tol k U 1 and tol k U 2 arechosentobe2000, 10 6 10 6 and 10 6 respectively.ForDLBCLandColondatasets,classication accuracyisimproved 88

PAGE 89

byreducingtherelativeanglebetweensubspaces.Inthecas eofBreastdataset, increasingtherelativeangleconsiderablyimprovesthecl assicationaccuracy.However, theeffectofanglechangeisrelativelysmallfor k =10 dimensionalsubspacesforall datasets.Additionally,inthecaseofColondataset,thech angeinthedimensionofthe subspacesdoesnoteffecttheperformanceofCSC. 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 Classification Accuracy (%) ¡ 2 ¡ 1 0 1 2 k=1 k=3 k=10 A 0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 Classification Accuracy (%) ¡ 2 ¡ 1 0 1 2 k=1 k=3 k=10 B 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Classification Accuracy (%) ¡ 2 ¡ 1 0 1 2 k=1 k=3 k=10 C Figure6-3.LOOCVaccuraciesasafunctionof fordifferentvaluesof k forthethree highdimensionaldatasetsA)DLBCL,B)BreastandC)Colon. 89

PAGE 90

CHAPTER7 BREASTCELLCHARACTERIZATIONUSINGRAMANSPECTROSCOPY RecentadvancesinRamanspectroscopyhavespawnedasurgeo finterestin biomedicalapplicationsofthetechnology,particularlyi ntheeldofoncology[ 49 ]. Advancesinsensors,diodelasers,andotheropticalcompon ents,alongwiththe reductioninthecostofthishardware,havemadeitpossible forRamanspectrometers tobecomeincreasinglycommonplaceinlaboratories.Raman spectroscopyhas demonstratedthepotentialtosignicantlyaidintheresea rch,diagnosisandtreatment ofvariouscancers[ 68 115 121 ].BiomedicalapplicationsofRamanspectroscopy currentlyunderinvestigationrangefromtheresearchlabo ratorybench-toptotheclinical settingatthepatient'sbedside.Ramanspectroscopicanal ysisofbiologicalspecimens isadvantageousasitprovidesaspectralngerprintrichin molecularcompositional informationwithoutdisruptingthebiologicalenvironmen t;allowing in-situ biochemical observationstobemade.Itsabilitytodetectvariationsli nkedtoDNA/RNA,proteins, lipid,carbohydrates,andothersmallmoleculemetabolite smakeitanexcellenttoolfor monitoringbiochemicalchangesonthecellularlevel.Rama nspectroscopyofbiological samplesprovidesatimeefcient,noninvasivemethodforin vestigatingpotentialcancer therapeutics,yieldingasignicantamountofbiologicali nformationfromasingle spectrum. Theabilitytodifferentiatebetweencellandtissuetypesi sofgreatimportancefor theresearch,diagnosisandtreatmentofcancer. In-vitro goldstandardassayshave reliedonhighlyspecicandexpensivetechniquessuchasim munoassaysusingow cytometry,uorescencemicroscopy,geneexpressionandpr oteinanalysis.While in-vivo and ex-vivo diagnosishasbeenprimarilybasedonhistologicalstainin gand morphologicalexaminationbyhighlytrainedexpertpathol ogist.Alloftheseconventional methodsareinvasive,requiringtheuseofexogenousagents ,creatinganenvironment non-nativetocellsandtissues;veryfewofwhichenablethe collectionofreal-time 90

PAGE 91

in-situ dynamicinformation.Ramanspectroscopyiscapableofover comingthese disadvantages,mainlyduetoitsnon-invasivenatureandit sabilitytoprovideaglobal overviewofthebiochemicalcompositionofdynamicbiologi calscenariosunlikemost singleend-pointassays.Thisversatiletechniquehasbeen demonstratedasaneffective tool,inparticularforoncology-basedapplicationsaswel lasforotherapplications includinginvestigationofstemcelldifferentiation,cel lproliferationontissueengineering constructs, in-vitro pharmacologicalandtoxicologicaleffects,amongmanyoth er applications;andalsonumerous in-vivo animalstudiesandhumanclinicaltrails[ 103 121 – 123 126 127 135 136 151 152 167 ]. Formanyoftheoncologyapplications,classicationandco mprehensivecharacterization ofcelltypesisofgreatimportancefortheselectionofther apiesforuse in-vivo aswellas theuseofthemostappropriateandclinicallyrelevantcell -basedmodelsforpre-clinical developmentoftherapies in-vitro [ 98 ].Advancesinboththehardware(e.g.optical technologies)andthesoftwarecomponents(e.g.dataanaly sismethods)willallowfor Ramanspectroscopytoparticipatemorereadilyintheresea rchandclinicalsettings.In thischapterthefocusisondevelopingarobustdataanalysi sframeworkforevaluating andcharacterizingvecommonlyusedbreastcelllinesfort herapydevelopmentand breastcancerresearch.Theaimistodevelopthisframework suchthatitcanboth classifycelltypesbasedoncell-linespecicspectralfea tureswhichmayultimately inthefutureallowforthepotentialdiscoveryofRaman-bas edspectralbiomarkersfor identifyingcancerandtumorsub-types[ 48 ]. Thecomplexityofthespectraobtainedfromabiologicalsam plemakesextracting relevantinformationandinterpretingthedatachallengin g.Themostprimitivedata analysisproceduresusedforRamanstudiessuchaspeak-bypeakanalysisandpeak deconvolutiondonotallowforextensivedataextractionan dareamajorlimitation, ofteninvolvingtedious,error-pronemanualanalysisproc edures.Suchmethods donotallowforcompletedataextraction,oftenonlyusinga verylimitedsubsetof 91

PAGE 92

data.Therefore,awidevarietyofdataanalysismethodshav erecentlybeguntobe increasinglyemployedforevaluatingRamanspectra;inclu dingmultivariatemethods suchasprincipalcomponentanalysis,andvariousmachinel earning-baseddata miningandoptimizationmethods.Dataminingandmachinele arningtechniquesare abletocircumventthesepitfallsbyoptimizingdataextrac tion,exposingobscured correlationsandreducingclassicationerror.Computati onalexperimentsindicatethat supervisedclassicationalgorithmssuchasSVMandLDAare abletoseparatethehigh dimensionalspectraldatawithhighaccuracy[ 123 126 127 135 136 152 ].These algorithmscanbeusedincombinationwithdimensionalityr eductionorfeatureselection techniquesinordertoidentifythecriticalregionsandban dsofthespectrumthatallow fordiscriminationandclassicationofcelltypes,cellul arprocessesandcellresponseto variousstimulibasedvariationsinbiochemicalcompositi on.Therefore,thedevelopment andimplementationofadvanceddataminingtechniquesisvi talforcomplete,rapidand accuratedataanalysis.Currently,thereisnosinglestand ard,optimizeddataanalysis procedureforcompleteprocessing(pre-andpost-processi ng)ofRamanspectraldata. Developmentofaconsensusmethodcapableofbeinggenerali zableandtunableforthe manydifferentRamanspectraldatasetsisparticularlyimp ortantinregardstofostering theclinicaltranslationofthetechnology,aswellasforst andardizingdataresultsfor Raman-basedcancerresearchwithintheourishingeld. Ramanspectraldatasetsaredenedbyhigh-dimensionality ,frequentlywithlimited samplesizes.Standardclassicationmodelsareknowntope rformpoorlyonsuch highdimensionaldatasetsduetothecurseofdimensionalit y.Thus,implementing dimensionalityreductionand/orfeatureselectionbecome scrucialpriortoclassication. Frequently,thespectramustbeclassiedsoastodifferent iatebetweentwoormore classesofsamples(e.g.classifyingcelldeathmechanisms orcancersub-types).In otherinstances,thedetection,monitoringandquanticat ionofpeaksindicativeof bimolecularspeciesmayberequiredtostudyparticulardyn amicprocessesorlevels 92

PAGE 93

ofbiologicalactivity(e.g.investigatingchangesinDNA/ RNAlevelsduringtreatment withanti-canceragents).Classicationtaskshaveinvolv edbothsupervisedand unsupervisedclassicationtechniques,whilethelatterh asinvolvedpeak-picking combinedwithconventionalstatisticaltest(e.g.student 'st-test,etc.),dimensionality reduction(e.g.PCA,LDA,etc.),andfeatureselectiontech niques(e.g.Correlation-based featureselection(CFS),Markovblanketlter,etc.). Herein,wehavecombinedRamanspectroscopywithournoveld atamining frameworkinordertoclassifyandcharacterize, in-situ ,vebreastcelllinesbasedon differencesinbiochemicalcompositiondeterminedfromth esignicanceofthespectral featuresutilizedforclassicationdirectlyfromtheorig inalfeaturespace.Ourframework, knownasFisher-basedFeatureSelectionSupportVectorMac hines(FFS-SVM),allows forsimultaneousfeatureselectionandsubsequentclassi cationbasedontheselected features.TheFFS-SVMframeworkiscomparedtoseveralofth emostcommonlyused methodsforclassicationanddimensionalityreductionaf terallspectraldatasetsare pre-processedbyastandardpre-processingprocedure.Anu nsupervisedmethod basedonhierarchicalclusteringisusedtoconstructfourp rototypicalcytopathological binaryclassicationtasks,typicaloftasksthatmightbee ncounteredinaresearchor clinicalsetting,andwhichareusedfortestingtherobustn essandgeneralizabilityof theFFS-SVMframework.Finally,thetop-tenmostsignican tlydiscriminativefeatures, deriveddirectlyaswavenumbersfromtheoriginalfeatures pace,arecorrelatedto theirbimolecularassignmentsbasedontheliterature.The sewavenumbers/features arethendiscussedintermsofbiologicalrelevancetothedi fferencesbetweenthe celllinegroupingsforeachclassicationtask.Althoughw efocusonbreastcancer celllinesinthisparticularstudy,weforeseethatthisfra meworkwillbeapplicableto virtuallyallcancertypesandalsoforbothresearchandcli nicalapplicationsofRaman spectroscopy. 93

PAGE 94

7.1ExperimentalMethods Cellculture: Apanelconsistingofvecelllineswasassembledbasedonth ree cancerousandtwonon-cancerouscelllinesafterconsidera tionofdifferencesintumor origin,morphology,tumorigencityandselectgeneexpress ionproles.Allcelllines wereobtainedfromAmericanTypeCultureCollection(ATCC) ,Manassas,VA.Thecells linesobtainedforstudywereMCF7,MDA-MB-231,BT474,MCF1 2A,andMCF10A. Table 7-1 showsthecelllineschosenwithselectedcharacteristics. Allcelllineswere culturedat37 Cand5%CO 2 inhumidiedcellcultureincubator.Mediawasprepared foreachofthecelllinesbyltrationthrough0.02 mvacuumltrationsystem.Briey, MDA-MB-231mediaconsistedofDulbeccosModiedEaglesMed ium(DMEM)with 2mML-glutamine,1%Non-EssentialAminoAcids(NEAA),and1 0%FetalBovine Serum(FBS);MCF7mediaconsistedofMinimumEaglesMedium( MEM)withEarles BalancedSaltSolution(EBSS),2mML-glutamine,1mMsodium pyruvateand10 g/ml bovineinsulin,and10%FBS;BT474consistedofATCC46-XHyb ricarepowder preparedinsterilemolecular-gradewaterwith1.5g/Lsodi umbicarbonateand10%FBS; MCF12AandMCF10Awerebothgrowninthesamecompletemediaw hichconsisted ofDMEM/HamsF1250/50with,2.5mML-glutamine,15mMHEPES, 0.5mMsodium pyruvate,1.2g/Lsodiumbicarbonate,20ng/mlhumanEpider malGrowthFactor(hEGF), 100ng/mlcholeratoxin,10 g/mlbovineinsulin,500ng/mlhydrocortisone,and5%horse serum. Cells,oncethawed,weregrownuptolog-phase(usuallyafte r2-3passages post-thaw)priortoplatingforexperimentalproceduresas describedinthefollowing section.Cellswerepassageduponbecomingconuentusing0 .25%trypsin/EDTA solutionforharvesting,followedbycentrifugationandsu bsequentlyinoculatedatthe pre-determinedcell-lineconcentrationsin75cm 2 asks.Itshouldbenotedthateach celllinetakesadifferenttimeperiodtoreachconuencean dalsohavedifferentoptimal inoculationconcentrations(cells/cm 2 )basedongrowthcurvecharacteristics.Cellswere 94

PAGE 95

harvestedforplatingforexperimentalproceduresusing0. 25%trypsin/EDTAsolutionto removecellsfromthecultureasksandthencollectedbycen trifugation.Thecellswere thendilutedtothedesiredconcentrationsforinoculation ontoMgF 2 chipsandculturedin 6-wellplates. Table7-1.CharacteristicsoftheveATCCbreastcelllines studied.IDC:InvasiveDuctal Carcinoma,AC:Adenocarcinoma,ER:EstrogenReceptor,PgR : ProgesteroneReceptor,HER2:HER2Overexpression,TP53:P 53protein levelsandmututionalstatus.*MCF10Acellmorphologyishi ghlydependent onConc.ofCa 2+ [ 87 118 ] CellLineMorphologyTumorTypeTumorigenicityER 1 PgR 1 HER2 2 TP53 1 MCF7MassIDCLessaggressive++-+/WT MDA-MB-231StellateACHighlyaggressive---++ M BT474MassIDCHighlyaggressive++++ MCF12ARoundNormalNo---+MCF10ACobblestone*NormalNo---+/WT CollectionofRamanspectrafromcells: ForcollectionofRamanspectrathe cellswereseededonMgF 2 chips(5 5 1mm)atconcentrationsdeterminedtoallow thecellstogrowuptoapproximately80%conuenceafteratl east24hoursincubation at37 Cand5%CO 2 priortoexperimentalcollectionofspectra.TheMgF 2 substrates areusedastheyexhibitaverylowbackgroundRamansignalan dalsoallowforeach ofthecelllinestogrowwiththenormalmorphologyofthatob servedwhengrown onculture-treatedplastic.Thecellsweregrownuptosligh tlylessthanconuence, approximately80%,buttostillallowforcell-cellinterac tionstooccur.Afterthecells reachedapproximately80%conuence,cell-containingMgF 2 chipswereplacedin sterileDelta-TCultureDishesandfreshcompletegrowthme diawasaddedtothecells. ThespectrawerecollectedbyaRenishaw2000InViaSpectrom eterSystemcoupled toaLeicaMicroscopewitha63xwater-immersionobjective( N.A.0.90).Thespectra fromsinglecellswerecollectedoneatatimewithexcitatio nfroma785nmwavelength diodelaser,usinga20sintegrationtime,andanestimatedl aserintensityof50mW atthesurfaceofthecells.Thenefocuswasusedtoadjustth efocalplaneandthe 95

PAGE 96

laserspotsizetoasemi-ellipticalshapeofapproximately 20 m 40 m,whichwhen centeredovereachcellallowedforthelaserspottoencompa ssthemajorityofthe volumeofthecellforspectralacquisition.25-40spectraw erecollectedfromeachcell lineandapparentoutlierswereremovedbyvisualinspectio npriortopre-processingof thespectra.Figure 7-1 showstheoverlayof37RamanspectracollectedfromMCF10A celllinewiththestandarddeviationplottedaswell(bluel ine). 800 1000 1200 1400 1600 1800 Wavenumber/cm -1 Figure7-1.Spectraloverlay:Overlayof37spectracollect edfromMCF10Acellline.The standarddeviationoverallthespectraisalsoshown(bluel ine). Spectralpreprocessing: Allspectraarepre-processedusingbackground subtraction,baselinecorrection,normalizationandsmoo thingprocedures[ 100 121 126 ]priortoapplyingdimensionalityreductionand/orfeatur eselectionfollowedby subsequentclassication.Themultiplespectracollected fromabiologicalsample cansometimesvaryinwavenumberresolution.Hence,thex-a xisisstandardized byapplyinginterpolationtechniquestoestimatevaluesat specicandregular wavenumbers.Therawspectrumisfrom601cm 1 to1800cm 1 withintensityvaluesat unequalintervals.Inordertocomparedifferentspectraas wellasperformdataanalysis, thespectraarestandardizedtohaveintensityvaluesatreg ularintervals.Anintervalof 96

PAGE 97

onewavenumberischosenbetween602cm 1 and1801cm 1 givingatotalof1200 datapointsforeachspectra.Asimplelinearinterpolation methodisusedtondthe intensityvaluesatthepredenedwavenumbers. Noiseremovalisanimportantprocedureinthedatapreproce ssingofrawRaman spectra.FirstaSavitsky-Golaysmoothingisperformedont hestandardizedspectrawith a5 th orderpolynomial[ 143 ].Then,asinaccordancewiththeprocedureoutlinedin[ 100 ] theuorescencebackgroundisremovedfromthespectrabyan automatedbaseline correctionmethod.Thesmoothedspectraisthennormalized usingastandardnormal variatetransformation[ 6 ].Ithasbeenshownin[ 2 ]thatperformingbaselinecorrection methodspriortonormalizationhelpsinextractingrobusta ndquantitativeinformation fromRamanspectra.Thedifferentstepsareshownforthespe ctralpreprocessing procedureinFigure 7-2 800 1000 1200 1400 1600 1800 3500 4000 4500 5000 5500 6000 6500 Wavenumber/cm -1Raman intensity/Arbitr. Units 800 1000 1200 1400 1600 1800 3500 4000 4500 5000 5500 6000 6500 Wavenumber/cm -1Raman intensity/Arbitr. Units Raw spectrum smoothed & standardized spectrum 800 1000 1200 1400 1600 1800 3500 4000 4500 5000 5500 6000 6500 Wavenumber/cm -1Raman intensity/Arbitr. Units Original spectrum Fitted background 800 1000 1200 1400 1600 1800 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 Wavenumber/cm -1Raman intensity/Arbitr. UnitsA B C D Figure7-2.Spectralpreprocessing:A)Therawspectrumobt ainedfromtheRaman Spectrometer.B)Thesmoothed&standardizedspectrumalon gwiththeraw spectrum.C)FittedFluorescencebackgroundshownwiththe spectrum obtainedfrom(B).D)Thenalstandardized,smoothed,base linecorrected andnormalizedspectrum. 97

PAGE 98

Thesubsequentsectionsdescribeindetailthedevelopment oftheFFS-SVM frameworkwhichincludesanovelfeatureselectionmethodc alledFisher-basedFeature Selection(FFS),inwhichtheselectedfeatures,orpeaks,f romtheaveragespectra ofcellsarerankedbasedonFishercriterion.Furthermore, amethodforaccounting forpeakpositioninter-classvariancebypeakcoalescingi salsoincorporated.The selectedfeaturesarethenusedforclassicationbyC-SVMw ithoptimizationofthe hyperplane,andsubsequentwavenumberassignmentandcorr elationofeachfeatureto its 0 biologicalconstituent.Ourframeworksperformanceisthe ncomparedwithseveral populardatareductionandfeatureselectionmethodsinclu dingPCA,randomfeature selection;aswellascommonclassicationmodels,suchasS VM,LDA,PCA-LDA,and PCA-SVM.Classicationaccuraciesweredeterminedusingr andomsub-samplingand nallyclassicationsensitivitiesandspecicitiesarea lsocalculated. 7.2Fisher-basedFeatureSelection(FFS) Thenumberofcollectedspectraisusuallyanorderofmagnit udesmallerthan thenumberofmeasurementsforeachspectrum.Severaluniva riate[ 7 44 154 ]and multivariate[ 69 93 172 ]ltertechniqueshavebeenintroducedtoperformfeature selectioninhighdimensionalclassicationproblems.Uni variateltertechniques assumefeatureindependencyandrankeachfeatureaccordin gtoadiscriminative score,whilethemultivariatetechniquesaccountforfeatu redependenciesandnd anoptimalsubsetoffeaturesbasedonapredenedcriteria. Generally,ndingan optimalsubsetoffeaturesiscomputationallyintractable whentheinputspaceis reasonablylarge,sincethereare 2 n possibilities,where n isthedimensionoftheinput space.Thoughthemultivariatetechniquesmaysometimesle adtobetterclassication performancethanunivariatemethods,theyarecomputation allyexpensiveandemploy heuristicmethodstosearchfortheoptimalsubsetoffeatur es. Acomparisonoftheunivariateandmultivariateltertechn iquesforclassicationof highdimensionaldatasetshavebeenstudiedinpreviousstu dies[ 75 99 ].Surprisingly, 98

PAGE 99

ithasbeenshownthattheunivariateselectiontechniquesy ieldconsistentlybetter resultsthanmultivariatetechniquesandthedifferencesa reattributedtothedifculty inextractingthefeaturedependenciesfromlimitedsample sizes.AsRamanspectral datasetsarealsocharacterizedbyhighdimensionalityand limitedsamplesizes,a simpleunivariatetechniquebasedonFishercriterion[ 43 ]isemployedwhichranksthe featuresaccordingtotheirclassdiscriminativeability. Peaknding: Inadditiontobuildingaclassicationmodelfortheidenti cationof celllines,itisalsoimportanttounderstandandanalyzeth ebiologicalrelevanceofthe selectedfeatureswhichcontributemostsignicantlytoth eclassication.Generally, inaRamanspectrum,mostwavenumberscorrespondtopeaksth atareassignedto molecularvibrationalmodesandrepresentbiologicallyre levantmolecularspecies. Hence,priortorankingthefeatures,thefeaturesareselec tedbyconsideringthepeaks oftheaveragespectrumineachcellline.Thesetofpeaks S foraspeciccelllineare denedaslocalmaximagivenby: S = f x j f ( x ) f ( x ) 8 x 2N ( x ) g (7–1) where x representsthepeaklocation, f ( x ) isthecorrespondingintensityvalue oftheaveragespectrumand N ( x ) representsan -neighborhoodaround x .The -neighborhoodischosenbyconsideringanaveragefull-wid thathalf-maximum (FWHM)of10cm 1 forRamanpeaks.Figure 7-3 showstheaveragespectrafor MCF10Acellsandthecorrespondingpeaksselectedforfurth eranalysis. Peakcoalescing: Peakscorrespondingtothesamebiomolecularbondsmaystil l bereportedwithslightlyvaryingwavenumberswhencompari ngdifferentbiological samples.Inordertoaccountforthisvarianceandtherefore tocreateacommon wavenumberreferencevectorforfeatureselectionandclas sicationtasks,peak coalescingusinghierarchicalclusteringisemployed[ 84 ].Thepeaklocationsfrom differentcelllinesareclustereddependingontheirinter -distancevaluesandchoosing 99

PAGE 100

800 1000 1200 1400 1600 1800 -1 -0.5 0 0.5 1 1.5 2 2.5 Wavenumber/cm -1Raman intensity/Arbitr. Units Average MCF10A cell spectrum peaks Figure7-3.Peaknding:TheaveragespectrumforMCF10Acel lsandtheselected peaks(shownmarkedwithasterisks). thenumberofclusterssuchthatthetotalvarianceofallclu stersisbelowachosen threshold.Inparticular,thenumberofclusters N C isdenedas: N C =min f c j c X i =1 s i th g (7–2) where, s i = X j 2 C i ( x ij i ) 2 (7–3) C i representstheindexofthecluster i i and s i arethemeanandvarianceofcluster i x ij isthepeak j assignedtocluster i and th isapre-denedvariancethreshold.Figure 7-4 showsindividualpeakschosenforMCF10AandMCF12Acelllin esalongwiththe clustersfoundfromhierarchicalclustering. Featureranking: Afterobtainingacommonwavenumberreferencevectorfrom peak-ndingandpeak-coalescing,thefeaturesarerankedb asedonFishercriterion [ 43 ].Foragivenfeature i ,thesherscoreisdenedas: J i = ( i1 i2 ) 2 ( s i 1 ) 2 n 1 + ( s i 2 ) 2 n 2 8 i 2 S (7–4) 100

PAGE 101

1250 1300 1350 1400 1450 1500 -1 -0.5 0 0.5 1 1.5 2 2.5 3 Wavenumber/cm -1Raman intensity/Arbitr. Units Figure7-4.Peakcoalescing:TheaveragespectrumforMCF10 AcellsandMCF12A cellsareshownalongwiththecoalescedclustersobtainedf romhierarchical clustering(denotedbythearrows). where, ij ( s i j ) 2 and n j arethesamplemean,varianceandthenumberofdatasamples inclass j and S isthesetofselectedpeaks.Intuitively,aFisherscorewou ldbehighfor afeaturehavinghighmeaninter-classseparationwhilethe totalwithin-classvariance issmall.Hence,itisexpectedthatthefeatureswithhighFi sherscoresbepartofthe maximallydiscriminatingfeatures.Additionally,itisim portanttonoteherethatthe Fisherscoreissimilartothetwo-sidedt-statisticestima tedusingpooledvariance.We callthisunivariatelterselectiontechniqueFisher-bas edFeatureSelection(FFS). 7.3Classication ThedatasetsformedfromfeaturesobtainedfromFFSareused totrainaclassication algorithm.Severaltechniqueshavebeenproposedforbinar yclassicationproblemsin theliteratureoutofwhich,SVMhaveattractedtheattentio nofmanyresearcherswith numeroussuccessfulapplicationinbioinformatics,nanc eandremotesensingamong manyothers[ 23 130 160 ].SVMscanbeextendedtomulticlassclassicationproblem s bytwocommonlyusedtechniques:one-against-one(OAO)and one-against-all(OAA) 101

PAGE 102

[ 4 ].InOAA,the N -classdatasetisdividedinto N two-classcases,witheachclassbeing classiedagainsttherestoftheclasses.TheOAOapproach, ontheotherhand,builds N ( N 1) /2classicationmodels,witheveryclassbeingclassieda gainsteveryother class.Asimplevotingschemeisemployedforclassassignme ntofunknownsamples. Whilebothofthesetechniquesareshowntoperformwellfors everalclassication tasks,theperformanceofOAAisshowntobecompromisedinth ecaseofunbalanced trainingdatasetsandtheOAOapproachismorecomputationa llyintensive[ 63 ].In ordertoalleviatetheseissues,theC-SVMsareextendedtom ulticlassclassicationby incorporatinghierarchicalclustering.Thepairwisedist ancematrixbetweentheaverage spectraofcelllinesarerstcomputedandanagglomerative hierarchicalclustertree isgeneratedfromthedistancematrix.Adendrogramplotoft hebinaryclustertreeis showninFigure 7-5 .Remarkably,butasshouldbeexpected,thelargestdiffere nce occursbetweenthegroupofcancercellsandthegroupofnoncancercells.Based onFigure 7-5 ,aclassicationframeworkisconstructedfromfourbinary classication tasks:CancerversusNon-Cancer,MCF7versusOtherCancers ,MDA-MB231versus BT474andMCF10AversusMCF12A.Thisclassicationframewo rk,incomparison withOAAandOAOapproaches,reducesthecomputationalcomp lexitybybuildingfour cytologicallyrelevantbinaryclassicationmodelsforde terminingclassmembership. 7.4ResultsandDiscussion Thedatasetincludedatotalof133Ramanspectra,fromwhich 82spectrawere fromthecancercelllines(BT474,MCF7andMDA-MB231)and51 spectrafromthe non-cancercelllines(MCF10AandMCF12A).Themeanspectra ofthevecelllines areshowninFigure 7-6 .Givenanytwocelllines,theclassicationmodelisbuilti n thefollowingmanner:therawspectrafromboththeclassesa repreprocessed,the peaksareselectedfromtheprocessedspectraandcoalesced toremoveexperimental artifacts,thecommonreferencefeaturesarerankedaccord ingtotheirFisherscore, theC-SVMalgorithmisthentrainedonthedatasetcreatedwi thtop k featureschosen 102

PAGE 103

4 5 6 7 8 9 10 BT474 MDAMB231 MCF7 MCF10A MCF12A Distance/Arbitr. Units Figure7-5.Hierarchicalclustering:Similarityinthemea nspectraofthevecelllines areshown.Thedendrogramisconstructedbyperforminghier archical clusteringonmeanspectraofvecelllines.Thespectraldi ssimilarity (increasingfromlefttoright)isshownontheabscissa. withFFS,andnallythemodeliscross-validatedusingrepe atedrandomsub-sampling method[ 91 ].Thenumberofrepetitionsforcross-validationhasbeen xedat100.In eachrepetition,90%ofthesamplesareusedfortrainingand theremaining10%are usedtotesttheclassier.Theclassicationaccuraciesar ereportedastheaverage overalltheclassicationaccuraciesobtainedover100rep etitions.TheCparameter forSVMshasbeenobtainedusingagridsearchbyvaryingthev alueofCbetween 10 5 ,10 4 ,...10 2 andchoosingthevaluethatgivesthemaximalclassication accuracy. Classicationperformance: Thefollowingpresentstheresultsobtainedfrom ourFFS-SVMclassicationframeworkwhenappliedonourtra iningdatasets.The classicationaccuraciesalongwithspecicity,sensitiv ityandthenumberofextracted featuresforthefourbinaryclassicationtasksareshowni nTable 7-2 .Theresultsshow thattheFFS-SVMframework,combiningfeatureselectionwi thC-SVM,yieldshighly accurateclassicationofthecelltypes. 103

PAGE 104

800 1000 1200 1400 1600 1800 -1 -0.5 0 0.5 1 1.5 2 2.5 3 Wavenumber/cm -1Raman intensity/Arbitr. Units MCF7 Cells BT474 Cells MDAMB231Cells MCF10A Cells MCF12A Cells Figure7-6.Meanspectra:Theaveragespectraofallvecell lines. Table7-2.Sensitivity,specicityandaverageclassicat ionaccuracyforthefourbinary classicationtasksobtainedfromC-SVMandvalidatedusin grandom sub-sampling(100repetitions). ClassicationTask#offeaturesAccuracy(%)Sensitivity( %)Specicity(%) CancerVsNon-Cancer3899.599.898.6 MCF7VsOtherCancers3299.396.6100.0 BT474VsMDA-MB2314297.491.7100.0 MCF10AVsMCF12A429197.162 TheFFS-SVMframeworkiscomparedwithotherwidelyknownme thodsforbinary classication:SVM,PCA-SVMandPCA-LDAandtheresultsare showninTable 7-3 TherstmethodSVM,trainstheclassierwithoriginaldata ;PCA-SVMandPCA-LDA performdimensionalityreductionusingPrincipalCompone ntAnalysis(PCA)andthen trainsSVM(orLDA[ 43 ])onthetransformeddata.Thenumberofprincipalcomponen ts forclassicationareobtainedbyconsideringthetopprinc ipalcomponentsthataccount for70%ofthetotalvarianceinthedata.Thetoptenfeatures obtainedfromFisher criterionareusedforclassicationinFFS-SVMframework. Theclassicationaccuracies areobtainedusingrandomsubsamplingtechniquewith100it erations.Theaccuracies 104

PAGE 105

obtainedviaFFS-SVMframeworkarecomparabletotheorigin alaccuraciesreported inTable 7-2 (within2%),therebyshowingthefeaturesobtainedfromFFS arehighly effectiveindiscriminatingthecelllines.Allthemethods includingFFS-SVMexhibithigh performanceonthefourclassicationtasks.Itisinterest ingtonotethatemployingSVM ontheoriginaldataperformsbetterincomparisonwithothe rmethodswhichinvolve someformofdimensionalityreduction.However,FFS-SVMfr amework,inaddition toprovidinghighclassicationaccuracies,alsoyieldsim portantfeatureswhichallow maximumdiscriminationofthecelllinesorcelllinegroupi ngs. Table7-3.Sensitivity,specicityandaverageclassicat ionaccuraciesofthefour frameworksSVM,PCA-SVM,PCA-LDAandFFS-SVMforthefourbi nary classicationtasks.Theclassicationaccuraciesareobt ainedfrom cross-validationusingrandomsubsampling(100repetitio ns). Cancervs.Non-CancerMCF7vs.OtherCancersBT474vs.MDA-M B231MCF10Avs.MCF12A SVM Accuracy(%)99.2100.097.693.4 Sensitivity(%)100.0100.094.8100.0Specicity(%)99.4100.099.580.6 PCA-SVM Accuracy(%)99.498.498.692.8 Sensitivity(%)100.095.196.499.3Specicity(%)98.2100.099.572.9 PCA-LDA Accuracy(%)99.598.396.485.8 Sensitivity(%)99.998.988.282.8Specicity(%)98.697.699.396.6 FFS-SVM Accuracy(%)97.398.998.089.0 Sensitivity(%)100.096.793.497.6Specicity(%)93.3100.0100.062.3 TherelationshipofclassicationaccuraciesofFFS-SVMfr ameworktothenumber offeaturesisshowninFigure 7-7 andiscomparedwithtwootherframeworks:Random Feature-SVM(RF-SVM)andPCA-SVM.Therstframework,RF-S VM,selectsthe featuresatrandomandsubsequentlyperformsclassicatio nusingC-SVM.Theother framework,PCA-SVMperformsPCAonthedataandchoosesthet op k principal componentstoperformclassicationusingC-SVM.Here,the numberofprincipal components k isvariedbetween1andthetotalnumberofextractedfeature s.The resultsareshownfortherstclassicationtask,Cancerve rsusNon-Cancercellline 105

PAGE 106

classication.Theaccuraciesarecalculatedusingrepeat edrandomsub-sampling. Clearly,FFS-SVMperformsbetterthanRF-SVMandyieldsacc uraciessimilarto PCA-SVM.Additionally,PCA-SVMisexpectedtoperformbett erthantheothertwo frameworkssincePCAutilizesallthefeaturesinestimatin gtheprincipalcomponents, whiletheothertwomethodsutilizeonlyasubsetoffeatures forclassication.However, theFFS-SVMframeworkhastheadvantageinthemuchmoreefc ientandeffective interpretationofresults,sinceitworksintheoriginalfe aturespacebyselectinga subsetoffeatures,whilePCAworksinatransformedspaceob tainedfromallthe features.Therefore,FFS-SVMprovidesawaytoclassifycel llineswithhighaccuracy (comparabletoPCA-SVM)andalsondsasubsetofpotentiall ybiologicallyrelevant featureswhichmaximallydiscriminatebetweenthecelllin es. 0 5 10 15 20 25 30 35 40 70 75 80 85 90 95 100 Number of significant variablesAccuracies RF-SVM FFS-SVM PCA-SVM Figure7-7.ClassicationAccuracyComparison:Theaccura ciesversusnumberof signicantvariablesfortheclassicationofCancerversu sNon-Cancercell linesobtainedfromthethreeframeworksRF-SVM,FFS-SVMan dPCA-SVM areshown.Thex-axisshowsthenumberofsignicantvariabl es;featuresin thecaseofRF-SVMandFFS-SVM,andtheprincipalcomponents for PCA-SVM.Alltheaccuraciesareestimatedusingrepeatedra ndom sub-samplingwith100repetitions. 106

PAGE 107

Biologicalvalidationofselectedfeatures: ByapplyingourFFS-SVMframework tothefourbinaryclassicationtaskresolvedbytheHCA,it waspossibletodevelop ahighlyaccurateclassicationmodelwhichcanbeappliedo nfutureclassication oftestingdatasetswhilealsoyieldingbiologicalinforma tionaboutthedifferences betweendifferentgroupingsofthecelltypes.Comparisono fvebreastcelllinesby Cuperlovic-Culf etal. (fourofwhichwerethesameasusedinthecurrentstudy)was performedusingNMRandcomparingseveralclusteringtechn iquesandPCA;primarily focusingonlipophiliccellularextracts[ 30 ].Remarkably,eventhoughNMRhasmuch higherspecicityinregardstobiochemicalspecies,theRa manspectrawereable toprovidemoreaccurateseparationofthecelllinesusingH CAwhendiscriminating amongcancerversusnon-cancercelllines,aswellascancer sub-typeclustering. AcomparisonoftheresultshereintothoseinthestudybyCup erlovic-Culf etal. demonstratestheutilityoftheuniquecellline-specicRa manspectralngerprints andtheamountofinformationthesespectralngerprintsco ntain.Furthermore, thiscomparisonexhibitsthepoweroftheFFS-SVMframework toefcientlyderive distinguishingbiochemicalinformationthatformsthebas isfortheclassseparationfrom thespectralfeaturesdirectlyintheformofwavenumbers. ThesubsequentbiologicalanalysisforeachofthefourHCAd enedbinary classicationtasksbasedonthetoptenclass-discriminat iveselectedfeatures arediscussedinthefollowingsubsections.Thetoptendisc riminativefeaturesare discussedintermsofwavenumberswhichhavebeencorrelate dtobiochemicalspecies basedonreferencevaluesfoundintheliteraturefromother Ramanspectroscopic studiesofbiologicalspecimens. Cancerversusnon-cancer: Thelargestdissimilaritybetweentwogroupsof spectraandthuscelllineswasthatbetweencancerandnon-c ancercelltypesascanbe seeninthedendrogramderivedfromhierarchicalclusterin gshowninFigure 7-5 .This isalsothemostapparentandbroadclassicationtask,rese mblingatypicalcytological 107

PAGE 108

Table7-4.Thetop10featuresselectedbyFFSforthefourbin aryclassicationtasks. Cancervs.Non-CancerMCF7vs.OtherCancersBT474vs.MDA-M B231MCF10Avs.MCF12A 1047134110491047 811986106313208231322760115676516588301174 1450140510851211166010661318941 8296221518811 108611596041338162117991129719 78513161661967 analysisactivity.Theunsupervisedclusteringconrmsth atthecancercelllinesand non-cancercelllinesgrouptogetherbasedonsignicantsi milaritiesanddifferences foundwithintheRamanspectra.Furtheranalysisbyclassi cationwithsupervised learningmethodsandfeatureselectionalsoprovidesameas ureofthedegreeof similaritytobeestablishedbaseduponcross-validationa ccuracyandtheFisher-based selectedfeatureswhichaccountfortheclassdiscriminati on. Here,theFFS-SVMframeworkdemonstratesthatRamanspectr oscopyisableto detectcleardifferencesbetweencancerousandnon-cancer ouscelllines.Therandom sub-samplingvalidationclassicationaccuracyforclass ifyingcancerversusnon-cancer cells,aftergroupingallofthespectrafromallcelllinest ogether,is97.3%withsensitivity of100.0%andspecicityof93.3%.Thetoptenwavenumberfea tures,inorderofclass discriminativepower,areshownintherstcolumnofTable 7-4 .Theselectedfeatureat wavenumber1047cm 1 isthemostdiscriminativeandcorrelatescloselytoaglyco gen peak[ 116 ].Thedifferenceinglycogencontentisanindicatorofgluc oseconsumption andcharacteristicfordifferentiatingbetweencancerand non-cancercelltypesas cancercellsaregenerallytypiedbyhavingsignicantlyh igherglucosemetabolism thannon-cancercells.Fiveofthetoptendiscriminativefe atures(811,823,765,829, and785cm 1 )allcorrelatetoDNAandRNAvibrationalmodesincludingth oseof nucleicacidsuracil,thymidine,andcytosinefromthepyri midineringbreathing,andthe 108

PAGE 109

phosphatebackboneofDNA[ 34 116 ].Cancercellstypicallyhavehigheramountsof DNA,aswellasRNA,duetotheirhigherratesofproliferatio n[ 34 116 ]. Bocklitz etal. hasshownthatdifferencesinnucleisizeandshapearedeni ng differencesbetweenMCF12AandMCF7cellswhenexaminedbyR amanspectroscopy andapplyingSVM[ 13 ].Additionally,Bocklitz etal. showedthattwoofthemost distinguishingwavenumbers,whichareassignedtonucleic acids,were785cm 1 and 811cm 1 .Thesendingsarenoteworthybecauseasshowninboththest udyherein andthestudybyBocklitz etal. ,thesamewavenumberswerefoundtobesignicant indicatorsofdiscriminationbetweencancerandnon-cance rcelllines,evenwhen includingadditionalcancerousandnon-cancerouscelllin es.Thesespecicfeatures appearingindifferentstudies,performedindifferentlab s,usingdifferentinstruments demonstratethepotentialpossibilitiesthatthefeatures electedwavenumberscould beindicativeofspectralbiomarkersfordistinguishingca ncercellsfromnon-cancerous cells. Thewavenumber1086cm 1 wasalsofoundtoshowsignicantdifferences betweenthetwoclassesandisrelatedtoC-Cstretchoffatty acidchains.Thepeak at1450cm 1 thatisindicativeoflipidswasalsoselectedandhasbeensh ownin severalpreviousstudiestoberelatedtoinvasiveductalca rcinoma(IDC)[ 140 147 ]. Thefeatureat1621cm 1 isduetotheC=Cbondvibrationalmodeandhasbeen correlatedtoaminoacidsinstructuralproteins[ 116 ].Adifferenceamongstructural proteinswasalsoindicatedbythefeature1660cm 1 ,whichisduetotheAmideI vibration[ 52 95 147 ].Thepeakat1660cm 1 alsooriginatesfromC=Cstretchin lipids,particularlysphingolipidssuchasceramide[ 96 ].Moreover,ceramidehasbeen showntohaveelevatedlevelsincancerouscells,inparticu larthosewithhigh-level expressionofenzymesregulatingceramidesynthesissucha smulti-drugresistant cancercells[ 5 105 ].Ceramideshavebeenshowntoplayaroleintumorigensisas well asbeinginvolvedinproliferation,survivalandapoptosis [ 110 ].Thefeatures1086cm 1 109

PAGE 110

1450cm 1 ,and1621cm 1 arepossiblyindicatorsofthedifferencesincellmembrane composition,aswellascellmorphologybetweenthecancero usandnon-cancerous celllines.Again,therelevanceofthesefeaturesandthebi ologicalandbiochemical correlationsthatthesefeaturesplayindiscriminatingam ongcancerandnon-cancer celllinescouldpotentiallyleadtothediscoveryof'Raman -basedspectralbiomarkers' forcytopathologicalpost-classicationanalysis.Furth erstudieswillbeneededto conrmthepresenceandrelevanceofsuchspectralbiomarke rsbyinvestigatingalarge, comprehensivepanelofdifferentbreastcellslinesinorde rtoidentifyuniversalRaman spectralfeaturesallowingforidenticationofcancerous celltypes. MCF7versusothercancertypes: BasedontheHCAresults,thesecondmost dissimilarclusteringoriginatesfromthedifferentiatio noftheMCF7cellsfromthe twocancerouscelltypes,BT474andMDA-MB-231.Theseparat ionobservedinthe dendrogramoftheMCF7cellsfromtheothercancerouscellty peshaveaninteresting biologicalbasisduetothedifferenceinthedegreeoftumor igencityofthecancer celllines.SubsequentclassicationbytheFFS-SVMframew orkoftheMCF7cells versustheotherhighlyinvasivecelllinesyieldsaclassi cationaccuracyof98.9% withasensitivityof96.7%andspecicityof100.0%asshown inTable 7-2 .Table 7-1 comparesseveralcharacteristicsbetweentheMCF7,BT474, andMDA-MB-231cell lines.FromTable 7-1 andthecorrespondingreferences[ 87 106 118 ]itisknown thatBT474andMDA-MB-231cellsarederivedfromhighlyinva siveandaggressive tumorscomparedtotherelativelylowlevelinvasivityofth eMCF7cellline.Moreover, theobservedclusteringthereforelikelyaccountsforthes trongsimilarityoftheBT474 spectraandMDA-MB-231spectra.Thetop-tendiscriminativ efeaturesselectedfor differentiatingtheMCF7fromtheBT474andMDA-MB-231cell linesareshowninthe secondcolumnofTable 7-4 Themajorityofthefeaturescorrelatetowavenumberassign mentsthatarerelated tovibrationsobservedfromstructuralproteinsandthesec ondaryproteinstructure; 110

PAGE 111

particularlycollagen,actinandthe -helicalsecondaryproteinstructureaswellas aminoacidresiduessuchasvalineandprolinewhichareofte nfoundabundantlyin structuralproteins(e.g.collagen).Thefeatures1066cm 1 ,1316cm 1 ,1322cm 1 1341cm 1 ,and1658cm 1 haveallbeenpreviouslyassignedtocollagenvibrational modesaswellasaminoacidsandamideIvibrationalmodes[ 13 81 ].Thepeakat 1341cm 1 hasbeenshowntocorrelatetothestructuralproteinactin, andthemajority ofproteincontributionoftheRamanspectrumoriginatesfr omcollagenandactin [ 147 ].Wavenumbers1159cm 1 and1405cm 1 originatefromproteinvibrational modes,and622cm 1 isduetoaspecicvibrationalmodeofphenylalanine[ 34 133 ]. Therefore,thecomposition,ormoreaccuratelythediffere nceincomposition,ofthese structuralproteinsislikelytogiverisetothediscrimina tionofthetwoclassesofhighly invasiveversusnon-invasivebreastcelllinesbasedonthe Ramanspectra.Increased levelsofstructuralproteins,collagenandactin,havebee nassociatedwithincreased invasivenessandmetastasisofbreastcancercells[ 36 ].Furthermore,genesencoding forvimentinandactinhavebeenshowntobehighlyexpressed insuchinvasivecell linesasMDA-MB-231basedonastromal-mesenchymalgeneexp ressionsignature,but notexpressedintheMCF7cellline[ 10 36 ].Thefeatureat986cm 1 isthoughttobe duetotheCH 2 stretchoflipids. Itshouldbenotedthatthefeatureat1799cm 1 isadjacenttotheendofthe ngerprintrangefromwhichspectraldataiscollectedandt hereforewasnotevaluated. Duetotheoverwhelmingproportionofthefeaturesrelatedt ostructuralprotein differencesbetweentheinvasiveandnon-invasivecelllin esitisprobablethatthe clusteringandsubsequenthighaccuracyofthecross-valid ationclassicationisdueto thevariationincellularcompositionofthementionedstru cturalproteins. BT474versusMDA-MB-231: Thetwohighlyinvasivecelllines,BT474and MDA-MB-231,areshowntohavethehighestdegreeofsimilari tyandclusterclosely togetherbasedontheHCA-deriveddendrograminFigure 7-5 .Eventhoughthesetwo 111

PAGE 112

celllinesclusterwithhighsimilarity,theFFS-SVMframew orkisstillabletoclassify thetwocelllinesfromone-anotherwithhighaccuracy.Thec lassicationaccuracyfor theclassicationofBT474versusMDA-MB-231is98.0%witha sensitivityof93.4% andspecicityof100.0%.Thisdemonstratesthepowerofthe FFS-SVMframework tobeabletoaccuratelyclassifyandthereforedifferentia teamongthetwohighly aggressivebreastcancercellsubtypesdespitehavinghigh clusteringsimilarity.The top-tendiscriminativefeaturesareshowninthethirdcolu mnofTable 7-4 .BT474and MDA-MB-231havedifferentmorphologiesandalsohavesigni cantlydifferentgene expressionproles,eventhoughtheyaresimilarintheirin vasivenature[ 87 118 ].A comparisonofselectedcharacteristicscanbefoundinTabl e 7-1 .Themostdominant feature,1049cm 1 ,hasbeenassignedtothevibrationalmodeofglycine.Furth ermore, thefeatureat760cm 1 ,whichisassignedtotryptophan,andthevibrationalmodea t 830cm 1 whichcorrelatestoproline,hydroxyproline,andtyrosine whichisindicative ofdifferenceofproteincontent.Thefeatureat1063cm 1 isstronglycorrelatedtothe C-Cstretchoflipids[ 124 ].Thispeakhasbeenshowntohavecontributionsfrommyrist ic acid,palmiticacidandotherfattyacids;indicatingadiff erenceinfattyacidcomposition amongthetwocelllines.Additionally,thefeatureat1129c m 1 isassignedtotheC-C stretchoftheacyllipidbackboneandcorrelateswithpalmi ticacid,stearicacidand phospholipids[ 34 116 ].Thefeatureat1518cm 1 isassignedtotheC=Cvibrational modeandindicatesadifferencesincarotenoidcomposition ,whichmaybeassociated withmembranestructureanduidity[ 18 79 ].Thefeatureat1661cm 1 isassociated with cis -isomerunsaturatedfattyacidsandmayfurtherindicatead ifferenceinlipid compositionbetweenthecelllines.Both1085cm 1 and1318cm 1 areindicativeof phosphodiesterbondvibrationalmodefoundinDNA[ 116 ]. Interestingly,thetwoIDCcelltypes,BT474andMCF7,botho fwhichareER+ andPR+,arenotfoundtoclusterbasedontheexpressionofER andPRasmightbe expected.InsteadtheACsub-type,MDA-MB-231(ER-,PR-),c lustersmorecloselywith 112

PAGE 113

theBT474cellsbasedonthebiochemicalcompositionbasedo ntheRamanspectra. TheseparationoftheMCF7fromtheMDA-MB-231aresimilarto theresultsfrom investigationofthecelllinesbyNMRformetabolicprolin g;demonstratingagain thatRamanspectroscopymayprovideinformationwithsimil arimpactasthatof morecommonplacemetabolomictechniques[ 31 ].Furthermore,combiningRaman spectroscopywithtechniquessuchasNMRthroughadvancedd ataminingstrategies, anddatafusion,mayallowforcomplimentaryinformationto beextendedfromthe massivedatasetsandthereforeimproveunderstandingof in-vitro breastcancermodels. MCF10AversusMCF12A: TheMCF10AandMCF12Aclustertogetherbased onthedendrogramcreatedbytheHCAasseeninFigure 7-5 ,butthetwocelllines haveclearspectraldifferencesasshownbythelargebranch ingdistanceatthelevel oftheindividualcelllineclusters.Thespectrawerethenc lassiedbytheFFS-SVM frameworkyieldingan89.0%classicationaccuracywithas ensitivityof97.6%and aspecicityof62.3%,asshowninTable 7-2 .Thisclassicationtaskresultsinthe lowestclassicationaccuracyofalltheclassicationtas ksattemptedwiththeFFS-SVM framework.Thehighlevelofsimilaritybetweenthetwocell linesmaygiverisetothe relativelylowclassicationaccuracyforthistask.Furth ermore,bothcelllinesare growninthesamemedia,underthesameconditionsandhavesi milardoublingtime, growthratesandthusmostlikelymetabolicproles.Thetop tenfeaturesselected fordiscriminatingbetweenthetwonon-cancerouscellline sarelistedinthefourth columnofTable 7-4 .Again,thefeatureat1047cm 1 ,whichisassignedtoglycogen, isfoundtoprovidethegreatestsignicantdiscrimination betweenthetwoclasses. Differencesinglucoseconsumptionbetweenthetwocelllin esmayaccountforthis difference,andfurtherevidenceofdifferencesincarbohy dratecontentareprovided bythefeatureat941cm 1 [ 116 ].Although,asshownbyCuperlovic-Culf etal. ,the metabolomicproleofthesecelllinesarequitesimilarinr egardstolipophilicfractions basedonNMRanalysis[ 30 ].Inspiteofthis,severalofthefeatureslistedforthisbi nary 113

PAGE 114

classicationtaskhaveassignmentsrelatedtofattyacids andlipidsincluding714 cm 1 ,967cm 1 ,and1156cm 1 (carotenoids)[ 116 ].Differencesintheproteincontent basedonselectedfeatures1211cm 1 and1338cm 1 mayoriginatefromtheslightly differentmorphologies(Table 7-1 ).Thefeaturesat811cm 1 and1320cm 1 correlate tothephosphodiesterbondstretchofRNAandguaninevibrat ionalmodesrespectively [ 116 ].Thedifferencesbetweenthetwonon-cancerouscelllines arenotverywell denedbytheRamanspectraasevidencedbythedifcultyina ccurateclassication bytheFFS-SVMframework.Asthesetwocelllinesarerelativ elysimilarandareboth non-invasive,non-cancerouscelllines,withsimilarexpr essionprolesitmightbe expectedthatthiswouldbethemostchallengingclassicat iontask. 114

PAGE 115

CHAPTER8 CONCLUSION Inthiswork,wepresentedanoverviewoffeatureselectiona ndclassication techniquesinthecontextofhighdimensionaldatasets.Due totechnologicaladvances inthelastdecade,highdimensionaldatasetshavebeenprev alentinmanyapplications arisingfromeldslikeBioinformatics,e-CommerceandCom puterVision.Several traditionalclassicationalgorithmsareknowntoperform poorlyonsuchhighdimensional datasetsduetocurseofdimensionality.Additionally,inm anybiomedicalapplications, itisimportanttoextractfeaturesthatcontributetothedi fferencesamongtheclasses. Hence,ourpresentworkfocusedonbuildingscalableefcie ntclassicationmodels forhighdimensionaldatasetsthatwouldalsobeabletoextr actimportantfeaturesin additiontoaccuratelypredictingtheclassofunknownsamp les. Anewclass-specicembeddedfeatureselectionmethodcall edSparseProximal SupportVectorMachines(sPSVMs)isproposedinChapter 4 .sPSVMslearnsasparse representationofPSVMsbyrstcastingitasanequivalentl eastsquaresproblemand theninducingsparsityusingthe l 1 -normonthecoefcientvector.Anefcientalgorithm basedonalternateoptimizationtechniquesisproposed.Nu mericalexperimentson severalpubliclyavailabledatasetsverifythegoodclassi cationperformanceand efciencyofsPSVMs.FeatureselectionstabilityofsPSVMs isalsostudiedand comparedwithotherunivariateltertechniques.Moreover ,sPSVMssuccessfully eliminatemorethan98%featuresirrelevanttoclassicati onforhighdimensional datasetsandshowconsistencyinthefeatureselectionproc ess.Additionally,our methodofferstheadvantageofinterpretingtheselectedfe aturesinthecontextof classes. Chapter 5 exploresthepossibilityofcombiningfeatureselectionwi thsubspace learningfordimensionalityreductioninhighdimensional datasets.Currentapproach involvestransformingtheproblemformulationtoanequiva lentconvexrelaxationand 115

PAGE 116

usingefcientalternateoptimizationtechniquestosolve itoptimality.Thoughthecurrent approachisefcient,itlimitsexplicitcontrolintheleve lofsparsity.Hence,wepropose analternateapproachofsolvingtheoriginalproblemtoloc aloptimalityusingClayley transformation.Thisapproach,inadditiontobeingefcie nt,alsoallowsforexplicit controlinthelevelofsparsitywhichiscrucialwhileperfo rmingclassicationtaskson highdimensionaldatasets. Constrainedsubspaceclassiers(CSC)forhighdimensiona ldatasetsare discussedinChapter 6 .CSCimprovesonanearlierproposedsubspaceclassier LocalSubspaceClassier(LSC).Inadditiontoapproximati ngtheclasseswellbylocal subspaces,CSCalsoaccountsfortherelativeanglebetween thesubspacesbyutilizing theprojectionmetric.Anefcientalternateoptimization techniqueisproposed.CSC hasbeenevaluatedonpubliclyavailabledatasetsandiscom paredwithLSC.The improvementinclassicationaccuracyshowstheimportanc eofconsideringtherelative anglebetweensubspaceswhileapproximatingtheclasses.A dditionally,CSCseemsto beeffectiveforlowerdimensionalsubspaces. AhighdimensionaldataapplicationofcharacterizingBrea stcelllinesusing RamanSpectroscopyispresentedinChapter 7 .Wedevelopedarobustmethod capableofclassifyingRamanspectraatdifferentlevelsof celllinegroupingsbased onthebiologicalcharacteristicscontainedwithinthespe ctra.Inordertoidentifythe biologicalcharacteristicscontainedwithinaRamanspect rumandwhichareused asthebasisforclassication,amethodforselectingandex tractingthesefeatures directlyfromthespectraisneeded.AmethodemployingaFis her-basedcriterionfor selectingthefeaturesisimplementedforclassicationwh ilebeingmindfulofthecurse ofdimensionality,andthustunableparameterswereincorp oratedintotheframework allowingforthenumberoffeaturestobeselectedbytheuser aswell.Thepurpose ofdevelopingageneralizablefeatureselectiontechnique forRamanspectroscopy isbasedontheideathatspectralbiomarkersforcancerandc ancersub-typesmay 116

PAGE 117

likelyexist.Thisassumptionisbaseduponthefactthatoth erspectroscopicmethods, whichproducecomplex,richspectra,suchasNMRandmassspe ctrometry,areused todiscoverandidentifycancerbiomarkersregularly.Ther efore,Ramanspectroscopy shouldalsohavethepotentialtoyieldbiomarkers,althoug hthesebiomarkersbeing differentinnature,duetodifferencesinspecicityandse nsitivity,thusbeingmore comparabletoangerprint. Traditionally,featureselectionhasbeenemployedpriort oclassicationtohandle highdimensionaldataclassicationproblems.However,re cently,thefocushasbeen shiftedtoapplyingregularizationtechniquestotraditio nalclassicationalgorithmsfor bettergeneralizationperformance.Thoughtheprogressma desofarisencouraging,we believethathighdimensionaldataclassicationwouldcon tinuetobeanactiveareaof researchastechnologicalinnovationscontinuetoevolvea ndbecomemoreeffective. 117

PAGE 118

REFERENCES [1] Absil,P.-A.,Mahony,R.,&Sepulchre,R.(2009). Optimizationalgorithmson matrixmanifolds .PrincetonUniversityPress. [2] Afseth,N.,Segtnan,V.,&Wold,J.(2006).Ramanspectraofb iologicalsamples: Astudyofpreprocessingmethods. Appliedspectroscopy 60 (12),1358–1367. [3] Alon,U.,Barkai,N.,Notterman,D.,Gish,K.,Ybarra,S.,Ma ck,D.,&Levine,A. (1999).Broadpatternsofgeneexpressionrevealedbyclust eringanalysisof tumorandnormalcolontissuesprobedbyoligonucleotidear rays. Proceedingsof theNationalAcademyofSciences 96 (12),6745–6750. [4] Anthony,G.,Gregg,H.,&Tshilidzi,M.(2007).Imageclassi cationusingsvms: one-against-onevsone-against-all.In Proccedingsofthe28thAsianConference onRemoteSensing ,vol.1,(pp.801–806).CurranAssociates,Inc. [5] Antoon,J.,White,M.,Slaughter,E.,Driver,J.,Khalili,H .,Elliott,S.,Smith,C., Burow,M.,&Beckman,B.(2011).Targetingnf bmediatedbreastcancer chemoresistancethroughselectiveinhibitionofsphingos inekinase-2. Cancer biology&therapy 11 (7),678. [6] Barnes,R.,Dhanoa,M.,&Lister,S.(1989).Standardnormal variate transformationandde-trendingofnear-infrareddiffuser eectancespectra. Appliedspectroscopy 43 (5),772–777. [7] Ben-Bassat,M.(1982).35useofdistancemeasures,informa tionmeasuresand errorboundsinfeatureevaluation. Handbookofstatistics 2 ,773–791. [8] Beyer,K.,Goldstein,J.,Ramakrishnan,R.,&Shaft,U.(199 9).Whenis”nearest neighbor”meaningful?In DatabaseTheory-ICDT'99 ,(pp.217–235).Springer. [9] Bickel,P.,&Levina,E.(2004).Sometheoryforsher'sline ardiscriminant function,naivebayes',andsomealternativeswhentherear emanymorevariables thanobservations. Bernoulli 10 (6),989–1010. [10] Bindels,S.,Mestdagt,M.,Vandewalle,C.,Jacobs,N.,Vold ers,L.,No ¨ el,A., VanRoy,F.,Berx,G.,Foidart,J.,&Gilles,C.(2006).Regul ationofvimentinby sip1inhumanepithelialbreasttumorcells. Oncogene 25 (36),4975–4985. [11] Bishop,C.(2006). Patternrecognitionandmachinelearning .SpringerNewYork. [12] Bo,T.,&Jonassen,I.(2002).Newfeaturesubsetselectionp roceduresfor classicationofexpressionproles. Genomebiology 3 (4),1–11. [13] Bocklitz,T.,Putsche,M.,St¨uber,C.,K ¨ as,J.,Niendorf,A.,R ¨ osch,P.,&Popp,J. (2009).Acomprehensivestudyofclassicationmethodsfor medicaldiagnosis. JournalofRamanSpectroscopy 40 (12),1759–1765. 118

PAGE 119

[14] Boyd,S.,&Vandenberghe,L.(2004). Convexoptimization .CambridgeUniversity Press. [15] Breiman,L.(1996).Baggingpredictors. Machinelearning 24 (2),123–140. [16] Breiman,L.(1999).Predictiongamesandarcingalgorithms Neuralcomputation 11 (7),1493–1517. [17] Breiman,L.(2001).Randomforests. Machinelearning 45 (1),5–32. [18] Britton,G.(1995).Structureandpropertiesofcarotenoid sinrelationtofunction. TheFASEBJournal 9 (15),1551–1558. [19] Brown,M.,Grundy,W.,Lin,D.,Cristianini,N.,Sugnet,C., Furey,T.,Ares,M.,& Haussler,D.(2000).Knowledge-basedanalysisofmicroarr aygeneexpression databyusingsupportvectormachines. ProceedingsoftheNationalAcademyof Sciences 97 (1),262. [20] B¨uhlmann,P.(2003).Boostingmethods:Whytheycanbeusef ulfor high-dimensionaldata.In Proceedingsofthe3rdInternationalWorkshopon DistributedStatisticalComputing(DSC) [21] B¨uhlmann,P.,&Yu,B.(2003).Boostingwiththel2loss:reg ressionand classication. JournaloftheAmericanStatisticalAssociation 98 (462),324–339. [22] Burges,C.(1999). Advancesinkernelmethods:supportvectorlearning .TheMIT press. [23] Byvatov,E.,Schneider,G.,etal.(2003).Supportvectorma chineapplicationsin bioinformatics. Appliedbioinformatics 2 (2),67–77. [24] Cai,D.,He,X.,&Han,J.(2007).Spectralregression:Auni edapproachfor sparsesubspacelearning.In SeventhInternationalConferenceonDataMining (ICDM) ,(pp.73–82).IEEE. [25] Chapelle,O.,Vapnik,V.,Bousquet,O.,&Mukherjee,S.(200 2).Choosingmultiple parametersforsupportvectormachines. Machinelearning 46 (1),131–159. [26] Chung,F.R.(1997). Spectralgraphtheory ,vol.92.AMSBookstore. [27] Chung,K.,Kao,W.,Sun,C.,Wang,L.,&Lin,C.(2003).Radius marginbounds forsupportvectormachineswiththerbfkernel. NeuralComputation 15 (11), 2643–2681. [28] Clarke,R.,Ressom,H.,Wang,A.,Xuan,J.,Liu,M.,Gehan,E. ,&Wang,Y. (2008).Thepropertiesofhigh-dimensionaldataspaces:im plicationsforexploring geneandproteinexpressiondata. NatureReviewsCancer 8 (1),37–49. [29] Clemmensen,L.,Hastie,T.,Witten,D.,&Ersbll,B.(2011) .Sparsediscriminant analysis. Technometrics 53 (4),406–413. 119

PAGE 120

[30] Cuperlovi c-Culf,M.,Belacel,N.,Culf,A.,Chute,I.,Ouellette,R., Burton,I., Karakach,T.,&Walter,J.(2009).Nmrmetabolicanalysisof samplesusingfuzzy k-meansclustering. MagneticResonanceinChemistry 47 (S1),S96–S104. [31] Cuperlovic-Culf,M.,Chute,I.,Culf,A.,Touaibia,M.,Gho sh,A.,Grifths,S., Tulpan,D.,L eger,S.,Belkaid,A.,Surette,M.,etal.(2011).1hnmrmeta bolomics combinedwithgeneexpressionanalysisforthedeterminati onofmajormetabolic differencesbetweensubtypesofbreastcelllines. ChemicalScience 2 (11), 2263–2270. [32] Dabney,A.(2005).Classicationofmicroarraystonearest centroids. Bioinformatics 21 (22),4148–4154. [33] Davis,L.,&Mitchell,M.(1991). HandbookofGeneticAlgorithms .VanNostrand Reinhold. [34] DeGelder,J.,DeGussem,K.,Vandenabeele,P.,&Moens,L.(2 007).Reference databaseoframanspectraofbiologicalmolecules. JournalofRamanSpectroscopy 38 (9),1133–1147. [35] DeMaesschalck,R.,Jouan-Rimbaud,D.,&Massart,D.(2000) .Themahalanobis distance. ChemometricsandIntelligentLaboratorySystems 50 (1),1–18. [36] DeWever,O.,&Mareel,M.(2003).Roleoftissuestromaincan cercellinvasion. TheJournalofpathology 200 (4),429–447. [37] DenHertog,D.(1992). Interiorpointapproachtolinear,quadraticandconvex programming:algorithmsandcomplexity .KluwerAcademicPublishers,MA. [38] Dettling,M.,&B¨uhlmann,P.(2003).Boostingfortumorcla ssicationwithgene expressiondata. Bioinformatics 19 (9),1061–1069. [39] Daz-Uriarte,R.,&DeAndres,S.(2006).Geneselectionan dclassicationof microarraydatausingrandomforest. BMCBioinformatics 7 (3),1–13. [40] Dietterich,T.(2000).Ensemblemethodsinmachinelearnin g. Multipleclassier systems 1857 (1),1–15. [41] Ding,C.,&Peng,H.(2005).Minimumredundancyfeaturesele ctionfrom microarraygeneexpressiondata. Journalofbioinformaticsandcomputational biology 3 (2),185–205. [42] Ding,C.,Zhou,D.,He,X.,&Zha,H.(2006).R1-pca:rotation alinvariantl1-norm principalcomponentanalysisforrobustsubspacefactoriz ation.In Proceedingsof the23rdinternationalconferenceonMachinelearning ,(pp.281–288).ACM. [43] Duda,R.O.,&Hart,P.E.(1973). Patternclassicationandsceneanalysis ,vol.3. WileyNewYork. 120

PAGE 121

[44] Dudoit,S.,Fridlyand,J.,&Speed,T.(2002).Comparisonof discrimination methodsfortheclassicationoftumorsusinggeneexpressi ondata. Journalof theAmericanstatisticalassociation 97 (457),77–87. [45] Edelman,A.,Arias,T.A.,&Smith,S.T.(1998).Thegeometry ofalgorithmswith orthogonalityconstraints. SIAMjournalonMatrixAnalysisandApplications 20 (2),303–353. [46] Efron,B.,Hastie,T.,Johnstone,I.,&Tibshirani,R.(2004 ).Leastangleregression. TheAnnalsofstatistics 32 (2),407–499. [47] Evgeniou,A.,&Pontil,M.(2007).Multi-taskfeaturelearn ing.In Advancesin neuralinformationprocessingsystems:Proceedingsofthe 2006conference ,(pp. 41–48).TheMITPress. [48] Fenn,M.B.,&Pappu,V.(2012).Dataminingforcancerbiomar kerswithraman spectroscopy.In DataMiningforBiomarkerDiscovery ,(pp.143–168).Springer. [49] Fenn,M.B.,Xanthopoulos,P.,Pyrgiotakis,G.,Grobmyer,S .R.,Pardalos,P.M., &Hench,L.L.(2011).Ramanspectroscopyforclinicaloncol ogy. Advancesin OpticalTechnologies 2011 [50] Ferri,F.,Pudil,P.,Hatef,M.,&Kittler,J.(1994).Compar ativestudyoftechniques forlarge-scalefeatureselection. MachineIntelligenceandPatternRecognition 16 ,403–413. [51] Frank,A.,&Asuncion,A.(2010).UCImachinelearningrepos itory. URL http://archive.ics.uci.edu/ml [52] Frank,C.,McCreery,R.,&Redd,D.(1995).Ramanspectrosco pyofnormaland diseasedhumanbreasttissues. Analyticalchemistry 67 (5),777–783. [53] Freund,Y.(1995).Boostingaweaklearningalgorithmbymaj ority. Information andcomputation 121 (2),256–285. [54] Freund,Y.,&Schapire,R.(1996).Experimentswithanewboo stingalgorithm. In Proceedingsofthe13thInternationalConferenceonMachin eLearning ,(pp. 148–156).MorganKaufmannPublishersInc. [55] Friedman,J.,Hastie,T.,&Tibshirani,R.(2001). Theelementsofstatistical learning .SpringerSeriesinStatistics. [56] Friedman,J.,Hastie,T.,&Tibshirani,R.(2010).Regulari zationpathsfor generalizedlinearmodelsviacoordinatedescent. Journalofstatisticalsoftware 33 (1),1–22. 121

PAGE 122

[57] Fu,S.,&Desmarais,M.(2010).Markovblanketbasedfeature selection:Areview ofpastdecade.In ProceedingsoftheWorldCongressonEngineering ,vol.1,(pp. 321–328).Citeseer. [58] Genuer,R.,Poggi,J.,&Tuleau-Malot,C.(2010).Variables electionusingrandom forests. PatternRecognitionLetters 31 (14),2225–2236. [59] Gislason,P.,Benediktsson,J.,&Sveinsson,J.(2006).Ran domforestsforland coverclassication. PatternRecognitionLetters 27 (4),294–300. [60] Golub,G.H.,&VanLoan,C.F.(2012). Matrixcomputations ,vol.3.JHUPress. [61] Golub,T.,Slonim,D.,Tamayo,P.,Huard,C.,Gaasenbeek,M. ,Mesirov,J.,Coller, H.,Loh,M.,Downing,J.,Caligiuri,M.,etal.(1999).Molec ularclassicationof cancer:classdiscoveryandclasspredictionbygeneexpres sionmonitoring. science 286 (5439),531–537. [62] Gu,Q.,Li,Z.,&Han,J.(2011).Jointfeatureselectionands ubspacelearning. In ProceedingsoftheTwenty-Secondinternationaljointconf erenceonArticial Intelligence ,vol.2,(pp.1294–1299).AAAIPress. [63] Gualtieri,J.,&Cromp,R.(1999).Supportvectormachinesf orhyperspectral remotesensingclassication.In ProceedingsofSPIE-TheInternationalSociety forOpticalEngineering ,vol.3584,(pp.221–232).Citeseer. [64] Guarracino,M.,Cifarelli,C.,Seref,O.,&Pardalos,P.(20 07).Aclassication methodbasedongeneralizedeigenvalueproblems. OptimisationMethodsand Software 22 (1),73–81. [65] Guo,X.,Yang,J.,Wu,C.,Wang,C.,&Liang,Y.(2008).Anovel ls-svms hyper-parameterselectionbasedonparticleswarmoptimiz ation. Neurocomputing 71 (16),3211–3215. [66] Guyon,I.,&Elisseeff,A.(2003).Anintroductiontovariab leandfeatureselection. TheJournalofMachineLearningResearch 3 ,1157–1182. [67] Guyon,I.,Weston,J.,Barnhill,S.,&Vapnik,V.(2002).Gen eselectionforcancer classicationusingsupportvectormachines. Machinelearning 46 (1),389–422. [68] Haka,A.,Volynskaya,Z.,Gardecki,J.,Nazemi,J.,Lyons,J .,Hicks,D., Fitzmaurice,M.,Dasari,R.,Crowe,J.,&Feld,M.(2006). Invivo margin assessmentduringpartialmastectomybreastsurgeryusing ramanspectroscopy. CancerResearch 66 (6),3317–3322. [69] Hall,M.(1999). Correlation-basedfeatureselectionformachinelearning .Ph.D. thesis,TheUniversityofWaikato. 122

PAGE 123

[70] Hamers,L.,Hemeryck,Y.,Herweyers,G.,Janssen,M.,&etal .(1989).Similarity measuresinscientometricresearch:thejaccardindexvers ussalton'scosine formula. InformationProcessing&Management 25 (3),315–318. [71] Hamm,J.,&Lee,D.D.(2008).Grassmanndiscriminantanalys is:aunifyingview onsubspace-basedlearning.In Proceedingsofthe25thinternationalconference onMachinelearning ,(pp.376–383).ACM. [72] Haury,A.-C.,Gestraud,P.,&Vert,J.-P.(2011).Theinuen ceoffeatureselection methodsonaccuracy,stabilityandinterpretabilityofmol ecularsignatures. PloS one 6 (12),e28210. [73] Haykin,S.(2004). NeuralNetworks:Acomprehensivefoundation .PrenticeHall, NJ. [74] Herbert,P.,&Tiejun,T.(2012).Recentadvancesindiscrim inantanalysisfor high-dimensionaldataclassication. JournalofBiometrics&Biostatistics 3 (2). [75] Hua,J.,Tembe,W.,&Dougherty,E.(2009).Performanceoffe ature-selection methodsintheclassicationofhigh-dimensiondata. PatternRecognition 42 (3), 409–424. [76] Huang,C.,&Wang,C.(2006).Aga-basedfeatureselectionan dparameters optimizationforsupportvectormachines. ExpertSystemswithapplications 31 (2), 231–240. [77] Huang,H.,&Ding,C.(2008).Robusttensorfactorizationus ingr1norm.In IEEE ConferenceonComputerVisionandPatternRecognition(CVP R),2008. ,(pp. 1–8).IEEE. [78] Huang,S.,Tong,T.,&Zhao,H.(2010).Bias-correcteddiago naldiscriminantrules forhigh-dimensionalclassication. Biometrics 66 (4),1096–1106. [79] Huang,Z.,McWilliams,A.,Lui,H.,McLean,D.,Lam,S.,&Zen g,H.(2003). Near-infraredramanspectroscopyforopticaldiagnosisof lungcancer. Internationaljournalofcancer 107 (6),1047–1052. [80] Hughes,G.(1968).Onthemeanaccuracyofstatisticalpatte rnrecognizers. IEEE TransactionsonInformationTheory 14 (1),55–63. [81] Ikoma,T.,Kobayashi,H.,Tanaka,J.,Walsh,D.,&Mann,S.(2 003).Physical propertiesoftypeicollagenextractedfromshscalesofpa grusmajorand oreochromisniloticas. Internationaljournalofbiologicalmacromolecules 32 (3), 199–204. [82] Jiang,H.,Deng,Y.,Chen,H.,Tao,L.,Sha,Q.,Chen,J.,Tsai ,C.,&Zhang,S. (2004).Jointanalysisoftwomicroarraygene-expressiond atasetstoselectlung adenocarcinomamarkergenes. BMCBioinformatics 5 (81),1–12. 123

PAGE 124

[83] Joachims,T.(1998).Textcategorizationwithsupportvect ormachines:Learning withmanyrelevantfeatures.In InECML-98WorkshoponTextMining ,(pp. 137–142).Springer. [84] Johnson,S.(1967).Hierarchicalclusteringschemes. Psychometrika 32 (3), 241–254. [85] Johnstone,I.,&Titterington,D.(2009).Statisticalchal lengesofhigh-dimensional data. PhilosophicalTransactionsoftheRoyalSocietyA:Mathema tical,Physical andEngineeringSciences 367 (1906),4237–4253. [86] Kearns,M.,&Valiant,L.(1988). LearningBooleanformulaeorniteautomata isashardasfactoring .CenterforResearchinComputingTechnology,Aiken ComputationLaboratory,HarvardUniversity. [87] Kenny,P.,Lee,G.,Myers,C.,Neve,R.,Semeiks,J.,Spellma n,P.,Lorenz,K.,Lee, E.,Barcellos-Hoff,M.,Petersen,O.,etal.(2007).Themor phologiesofbreast cancercelllinesinthree-dimensionalassayscorrelatewi ththeirprolesofgene expression. Molecularoncology 1 (1),84–96. [88] Kirkpatrick,S.,GelattJr,C.,&Vecchi,M.(1983).Optimiz ationbysimulated annealing. Science 220 (4598),671–680. [89] Kittler,J.(1978).Featuresetsearchalgorithms. PatternRecognitionandSignal Processing ,(pp.41–60). [90] Kleinbaum,D.,Klein,M.,&Pryor,E.(2002). Logisticregression:aself-learning text .SpringerVerlag. [91] Kohavi,R.(1995).Astudyofcross-validationandbootstra pforaccuracy estimationandmodelselection.In InternationaljointConferenceonarticial intelligence ,vol.14,(pp.1137–1145).LawrenceErlbaumAssociatesLtd [92] Kohavi,R.,&John,G.(1997).Wrappersforfeaturesubsetse lection. Articial Intelligence 97 (1-2),273–324. [93] Koller,D.,&Sahami,M.(1996).Towardoptimalfeaturesele ction.In Proceedings ofthe13thInternationalConferenceonMachineLearning ,(pp.284–292). StanfordInfoLab. [94] K ¨ oppen,M.(2000).Thecurseofdimensionality.In Proceedingsofthe5thOnline WorldConferenceonSoftComputinginIndustrialApplicati ons(WSC5) ,(pp.4–8). [95] Krafft,C.,Knetschke,T.,Siegner,A.,Funk,R.,&Salzer,R .(2003).Mappingof singlecellsbynearinfraredramanmicrospectroscopy. VibrationalSpectroscopy 32 (1),75–83. 124

PAGE 125

[96] Krafft,C.,Neudert,L.,Simat,T.,Salzer,R.,etal.(2005) .Nearinfraredraman spectraofhumanbrainlipids. Spectrochimicaacta.PartA,Molecularand biomolecularspectroscopy 61 (7),1529–1535. [97] Laaksonen,J.(1997).Localsubspaceclassier.In ArticialNeuralNetworksICANN'97 ,(pp.637–642).Springer. [98] Lacroix,M.,&Leclercq,G.(2004).Relevanceofbreastcanc ercelllinesas modelsforbreasttumours:anupdate. Breastcancerresearchandtreatment 83 (3),249–289. [99] Lai,C.,Reinders,M.,Van'tVeer,L.,&Wessels,L.(2006).A comparisonof univariateandmultivariategeneselectiontechniquesfor classicationofcancer datasets. BMCbioinformatics 7 (1),235–244. [100] Lieber,C.,&Mahadevan-Jansen,A.(2003).Automatedmetho dforsubtraction ofuorescencefrombiologicalramanspectra. Appliedspectroscopy 57 (11), 1363–1367. [101] Lin,S.,Lee,Z.,Chen,S.,&Tseng,T.(2008).Parameterdete rminationofsupport vectormachineandfeatureselectionusingsimulatedannea lingapproach. AppliedSoftComputing 8 (4),1505–1512. [102] Lin,S.,Ying,K.,Chen,S.,&Lee,Z.(2008).Particleswarmo ptimizationfor parameterdeterminationandfeatureselectionofsupportv ectormachines. Expert SystemswithApplications 35 (4),1817–1824. [103] Ling,J.,Weitman,S.,Miller,M.,Moore,R.,&Bovik,A.(200 2).Directraman imagingtechniquesforstudyofthesubcellulardistributi onofadrug. Applied optics 41 (28),6006–6017. [104] Liu,H.,&Motoda,H.(1998). Featureextraction,constructionandselection:A dataminingperspective .Springer. [105] Liu,Y.,Han,T.,Giuliano,A.,&Cabot,M.(1999).Expressio nofglucosylceramide synthase,convertingceramidetoglucosylceramide,confe rsadriamycin resistanceinhumanbreastcancercells. JournalofBiologicalChemistry 274 (2),1140–1146. [106] Liu,Z.,Brattain,M.,&Appert,H.(1997).Differentialdis playofreticulocalbininthe highlyinvasivecellline,mda-mb-435,versusthepoorlyin vasivecellline,mcf-7. Biochemicalandbiophysicalresearchcommunications 231 (2),283–289. [107] Ma,S.,&Dai,Y.(2011).Principalcomponentanalysisbased methodsin bioinformaticsstudies. Briengsinbioinformatics 12 (6),714–722. [108] Ma,S.,&Huang,J.(2005).Regularizedrocmethodfordiseas eclassicationand biomarkerselectionwithmicroarraydata. Bioinformatics 21 (24),4356–4362. 125

PAGE 126

[109] Mangasarian,O.,&Wild,E.(2006).Multisurfaceproximals upportvectormachine classicationviageneralizedeigenvalues. PatternAnalysisandMachineIntelligence,IEEETransactionson 28 (1),69–74. [110] Mao,C.,Du,J.,Sun,T.,Yao,Y.,Zhang,P.,Song,E.,&Wang,J .(2011).A biodegradableamphiphilicandcationictriblockcopolyme rforthedeliveryofsirna targetingtheacidceramidasegeneforcancertherapy. Biomaterials 32 (11), 3124–3133. [111] McLachlan,G.,&Wiley,J.(1992). Discriminantanalysisandstatisticalpattern recognition .WileyOnlineLibrary. [112] Minh,H.Q.,Niyogi,P.,&Yao,Y.(2006).Mercerstheorem,fe aturemaps,and smoothing.In Learningtheory ,(pp.154–168).Springer. [113] Moghaddam,B.,Weiss,Y.,&Avidan,S.(2006).Generalizeds pectralbounds forsparselda.In Proceedingsofthe23rdinternationalconferenceonMachin e learning ,(pp.641–648).ACM. [114] Mor e,J.J.,&Thuente,D.J.(1994).Linesearchalgorithmswith guaranteed sufcientdecrease. ACMTransactionsonMathematicalSoftware(TOMS) 20 (3), 286–307. [115] Morris,M.,&Matousek,P.(2010). EmergingRamanApplicationsandTechniques inBiomedicalandPharmaceuticalFields .SpringerBerlin. [116] Movasaghi,Z.,Rehman,S.,&Rehman,D.(2007).Ramanspectr oscopyof biologicaltissues. AppliedSpectroscopyReviews 42 (5),493–541. [117] Nadeau,C.,&Bengio,Y.(2003).Inferenceforthegeneraliz ationerror. Machine Learning 52 (3),239–281. [118] Neve,R.,Chin,K.,Fridlyand,J.,Yeh,J.,Baehner,F.,Fevr ,T.,Clark,L.,Bayani, N.,Coppe,J.,Tong,F.,etal.(2006).Acollectionofbreast cancercelllinesforthe studyoffunctionallydistinctcancersubtypes. Cancercell 10 (6),515–527. [119] Nie,F.,Huang,H.,Cai,X.,&Ding,C.H.(2010).Efcientand robustfeature selectionviajointl2,1-normsminimization.In AdvancesinNeuralInformation ProcessingSystems ,(pp.1813–1821). [120] Nocedal,J.,&Wright,S.(2006).Numericaloptimization,s eriesinoperations researchandnancialengineering. Springer,NewYork [121] Notingher,I.,Green,C.,Dyer,C.,Perkins,E.,Hopkins,N. ,Lindsay,C.,&Hench, L.(2004).Discriminationbetweenricinandsulphurmustar dtoxicityinvitrousing ramanspectroscopy. JournalofTheRoyalSocietyInterface 1 (1),79–90. 126

PAGE 127

[122] Notingher,I.,&Hench,L.(2006).Ramanmicrospectroscopy :anoninvasivetool forstudiesofindividuallivingcellsinvitro. Expertreviewofmedicaldevices 3 (2), 215–234. [123] Notingher,I.,Selvakumaran,J.,&Hench,L.(2004).Newdet ectionsystem fortoxicagentsbasedoncontinuousspectroscopicmonitor ingoflivingcells. BiosensorsandBioelectronics 20 (4),780–789. [124] OFaol ain,E.,Hunter,M.,Byrne,J.,Kelehan,P.,McNamara,M.,By rne,H.,& Lyng,F.(2005).Astudyexaminingtheeffectsoftissueproc essingonhuman tissuesectionsusingvibrationalspectroscopy. Vibrationalspectroscopy 38 (1), 121–127. [125] Obozinski,G.,Taskar,B.,&Jordan,M.I.(2006).Multi-tas kfeatureselection. StatisticsDepartment,UCBerkeley,Tech.Rep [126] Owen,C.,Notingher,I.,Hill,R.,Stevens,M.,&Hench,L.(2 006).Progress inramanspectroscopyintheeldsoftissueengineering,di agnosticsand toxicologicaltesting. JournalofMaterialsScience:MaterialsinMedicine 17 (11), 1019–1023. [127] Owen,C.,Selvakumaran,J.,Notingher,I.,Jell,G.,Hench, L.,&Stevens, M.(2006).Invitrotoxicologyevaluationofpharmaceutica lsusingraman micro-spectroscopy. Journalofcellularbiochemistry 99 (1),178–186. [128] Pal,M.(2006).Supportvectormachine-basedfeatureselec tionforlandcover classication:acasestudywithdaishyperspectraldata. InternationalJournalof RemoteSensing 27 (14),2877–2894. [129] Pal,M.,&Foody,G.(2010).Featureselectionforclassica tionofhyperspectral databysvm. IEEETransactionsonGeoscienceandRemoteSensing 48 (5), 2297–2307. [130] Pal,M.,&Mather,P.(2005).Supportvectormachinesforcla ssicationinremote sensing. InternationalJournalofRemoteSensing 26 (5),1007–1011. [131] Pang,H.,Lin,A.,Holford,M.,Enerson,B.,Lu,B.,Lawton,M .,Floyd,E.,&Zhao, H.(2006).Pathwayanalysisusingrandomforestsclassica tionandregression. Bioinformatics 22 (16),2028–2036. [132] Pang,H.,Tong,T.,&Zhao,H.(2009).Shrinkage-baseddiago naldiscriminant analysisanditsapplicationsinhigh-dimensionaldata. Biometrics 65 (4), 1021–1029. [133] Podstawka,E.,Ozaki,Y.,&Proniewicz,L.(2005).Partiii: surface-enhanced ramanscatteringofaminoacidsandtheirhomodipeptidemon olayersdeposited ontocolloidalgoldsurface. Appliedspectroscopy 59 (12),1516–1526. 127

PAGE 128

[134] Pudil,P.,Novovi cov a,J.,&Kittler,J.(1994).Floatingsearchmethodsinfeatu re selection. Patternrecognitionletters 15 (11),1119–1125. [135] Pyrgiotakis,G.,Kundakcioglu,O.,Finton,K.,Pardalos,P .,Powers,K.,&Moudgil, B.(2009).Celldeathdiscriminationwithramanspectrosco pyandsupportvector machines. Annalsofbiomedicalengineering 37 (7),1464–1473. [136] Pyrgiotakis,G.,Kundakcioglu,O.,Pardalos,P.,&Moudgil ,B.(2011).Raman spectroscopyandsupportvectormachinesforquicktoxicol ogicalevaluationof titaniananoparticles. JournalofRamanSpectroscopy 42 (6),1222–1231. [137] Qi,C.,Gallivan,K.A.,&Absil,P.-A.(2010).Riemannianbf gsalgorithmwith applications.In Recentadvancesinoptimizationanditsapplicationsineng ineering ,(pp.183–192).Springer. [138] Qiao,Z.,Zhou,L.,&Huang,J.(2009).Sparselineardiscrim inantanalysiswith applicationstohighdimensionallowsamplesizedata. InternationalJournalof AppliedMathematics 39 (1),6–29. [139] Ramaswamy,S.,Tamayo,P.,Rifkin,R.,Mukherjee,S.,Yeang ,C.,Angelo,M., Ladd,C.,Reich,M.,Latulippe,E.,Mesirov,J.,etal.(2001 ).Multiclasscancer diagnosisusingtumorgeneexpressionsignatures. ProceedingsoftheNational AcademyofSciences 98 (26),15149–15154. [140] Rehman,S.,Movasaghi,Z.,Tucker,A.,Joel,S.,Darr,J.,Ru ban,A.,&Rehman, I.(2007).Ramanspectroscopicanalysisofbreastcancerti ssues:identifying differencesbetweennormal,invasiveductalcarcinomaand ductalcarcinomain situofthebreasttissue. JournalofRamanSpectroscopy 38 (10),1345–1351. [141] Rokach,L.(2010).Ensemble-basedclassiers. ArticialIntelligenceReview 33 (1),1–39. [142] Saeys,Y.,Inza,I.,&Larra naga,P.(2007).Areviewoffeatureselectiontechniques inbioinformatics. Bioinformatics 23 (19),2507–2517. [143] Savitzky,A.,&Golay,M.(1964).Smoothinganddifferentia tionofdataby simpliedleastsquaresprocedures. Analyticalchemistry 36 (8),1627–1639. [144] Schaalje,G.,&Fields,P.(2012).Open-setnearestshrunke ncentroid classication. CommunicationsinStatistics-TheoryandMethods 41 (4),638–652. [145] Schaalje,G.,Fields,P.,Roper,M.,&Snow,G.(2011).Exten dednearestshrunken centroidclassication:Anewmethodforopen-setauthorsh ipattributionoftextsof varyingsizes. LiteraryandLinguisticComputing 26 (1),71–88. [146] Schapire,R.(1990).Thestrengthofweaklearnability. Machinelearning 5 (2), 197–227. 128

PAGE 129

[147] Shafer-Peltier,K.,Haka,A.,Fitzmaurice,M.,Crowe,J.,M yles,J.,Dasari,R., &Feld,M.(2002).Ramanmicrospectroscopicmodelofhumanb reasttissue: implicationsforbreastcancerdiagnosisinvivo. JournalofRamanSpectroscopy 33 (7),552–563. [148] Shipp,M.A.,Ross,K.N.,Tamayo,P.,Weng,A.P.,Kutok,J.L. ,Aguiar,R.C., Gaasenbeek,M.,Angelo,M.,Reich,M.,Pinkus,G.S.,etal.( 2002).Diffuselarge b-celllymphomaoutcomepredictionbygene-expressionpro lingandsupervised machinelearning. Naturemedicine 8 (1),68–74. [149] Skalak,D.(1994).Prototypeandfeatureselectionbysampl ingandrandom mutationhillclimbingalgorithms.In Proceedingsofthe11thInternational ConferenceonMachineLearning ,(pp.293–301).Citeseer. [150] Statnikov,A.,Wang,L.,&Aliferis,C.(2008).Acomprehens ivecomparison ofrandomforestsandsupportvectormachinesformicroarra y-basedcancer classication. BMCbioinformatics 9 (319),1–10. [151] Swain,R.,Kemp,S.,Goldstraw,P.,Tetley,T.,&Stevens,M. (2008).Spectral monitoringofsurfactantclearanceduringalveolarepithe lialtypeiicell differentiation. Biophysicaljournal 95 (12),5978–5987. [152] Swain,R.,&Stevens,M.(2007).Ramanmicrospectroscopyfo rnon-invasive biochemicalanalysisofsinglecells. BiochemicalSocietyTransactions 35 (3), 544–549. [153] Tan,M.,Wang,L.,&Tsang,I.(2010).Learningsparsesvmfor featureselection onveryhighdimensionaldatasets.In Proceedingsofthe27thInternational ConferenceonMachineLearning ,(pp.1047–1054). [154] Thomas,J.,Olson,J.,Tapscott,S.,&Zhao,L.(2001).Anef cientandrobust statisticalmodelingapproachtodiscoverdifferentially expressedgenesusing genomicexpressionproles. GenomeResearch 11 (7),1227–1236. [155] Thomaz,C.,&Gillies,D.(2005).Amaximumuncertaintyldabasedapproachfor limitedsamplesizeproblems-withapplicationtofacereco gnition.In Proceedings ofthe18thBrazilianSymposiumonComputerGraphicsandIma geProcessing (pp.89–96).IEEE. [156] Tibshirani,R.(1996).Regressionshrinkageandselection viathelasso. Journalof theRoyalStatisticalSociety.SeriesB(Methodological) 58 (1),267–288. [157] Tibshirani,R.,Hastie,T.,Narasimhan,B.,&Chu,G.(2002) .Diagnosisofmultiple cancertypesbyshrunkencentroidsofgeneexpression. Proceedingsofthe NationalAcademyofSciences 99 (10),6567–6572. 129

PAGE 130

[158] Tibshirani,R.,Hastie,T.,Narasimhan,B.,&Chu,G.(2003) .Classprediction bynearestshrunkencentroids,withapplicationstodnamic roarrays. Statistical Science 18 (1),104–117. [159] Tong,T.,Chen,L.,&Zhao,H.(2012).Improvedmeanestimati onandits applicationtodiagonaldiscriminantanalysis. Bioinformatics 28 (4),531–537. [160] Trafalis,T.,&Ince,H.(2000).Supportvectormachineforr egressionand applicationstonancialforecasting.In ProceedingsoftheInternationalJoint ConferenceonNeuralNetworks ,vol.6,(pp.348–353).IEEE. [161] Trunk,G.(1979).Aproblemofdimensionality:Asimpleexam ple. IEEETransactionsonPatternAnalysisandMachineIntelligence 3 ,306–307. [162] Udriste,C.(1994). ConvexfunctionsandoptimizationmethodsonRiemannian manifolds ,vol.297.Springer. [163] Valiant,L.(1984).Atheoryofthelearnable. CommunicationsoftheACM 27 (11), 1134–1142. [164] van'tVeer,L.,Dai,H.,VanDeVijver,M.,He,Y.,Hart,A.,Ma o,M.,Peterse,H., vanderKooy,K.,Marton,M.,Witteveen,A.,etal.(2002).Ge neexpression prolingpredictsclinicaloutcomeofbreastcancer. nature 415 (6871),530–536. [165] Vapnik,V.(2000). Thenatureofstatisticallearningtheory .springer. [166] Vapnik,V.,&Chapelle,O.(2000).Boundsonerrorexpectati onforsupportvector machines. Neuralcomputation 12 (9),2013–2036. [167] Wachsmann-Hogiu,S.,Weeks,T.,&Huser,T.(2009).Chemica lanalysis in-vivo and in-vitro byramanspectroscopy-fromsinglecellstohumans. Currentopinion inbiotechnology 20 (1),63–73. [168] Wen,Z.,&Yin,W.(2013).Afeasiblemethodforoptimization withorthogonality constraints. MathematicalProgramming ,(pp.1–38). [169] Xu,P.,Brock,G.,&Parrish,R.(2009).Modiedlineardiscr iminantanalysis approachesforclassicationofhigh-dimensionalmicroar raydata. Computational Statistics&DataAnalysis 53 (5),1674–1687. [170] Yan,S.,Xu,D.,Zhang,B.,Zhang,H.-J.,Yang,Q.,&Lin,S.(2 007).Graph embeddingandextensions:ageneralframeworkfordimensio nalityreduction. IEEETransactionsonPatternAnalysisandMachineIntellig ence 29 (1),40–51. [171] Yeung,K.,Bumgarner,R.,etal.(2003).Multiclassclassi cationofmicroarray datawithrepeatedmeasurements:applicationtocancer. GenomeBiology 4 (12), R83. 130

PAGE 131

[172] Yu,L.,&Liu,H.(2003).Featureselectionforhigh-dimensi onaldata:Afast correlation-basedltersolution.In Proceedingsofthe20thInternationalConferenceonMachineLearning ,(pp.856–863). [173] Yu,L.,&Liu,H.(2004).Efcientfeatureselectionviaanal ysisofrelevanceand redundancy. TheJournalofMachineLearningResearch 5 ,1205–1224. [174] Zhang,L.,&Lin,X.(2011).Someconsiderationsofclassic ationforhigh dimensionlow-samplesizedata. StatisticalMethodsinMedicalResearch [175] Zou,H.,&Hastie,T.(2005).Regularizationandvariablese lectionviatheelastic net. JournaloftheRoyalStatisticalSociety:SeriesB(Statist icalMethodology) 67 (2),301–320. [176] Zou,H.,Hastie,T.,&Tibshirani,R.(2006).Sparseprincip alcomponentanalysis. Journalofcomputationalandgraphicalstatistics 15 (2),265–286. 131

PAGE 132

BIOGRAPHICALSKETCH VijayPappuwasborninVisakhapatnam,AndhraPradesh,Indi a.Hereceived abachelor'sdegreeinMechanicalEngineeringfromIndianI nstituteofTechnology (IIT),Chennai,Indiain2006.HejoinedRutgers,TheStateU niversityofNewJersey, in2006foraMasterinScienceinMechanicalEngineering,wh erehespecializedin ComputationalFluidMechanics.Afternishingmaster'spr ogramatRutgersin2008, hejoinedMathWorksInc.,inNatick,MAasanApplicationSup portEngineer.In2010, hejoinedtheIndustrialandSystemsEngineering(I.S.E)do ctoralprogramatUniversity ofFlorida(UF).HereceivedaMasterofSciencein2011andPh .D.in2013from I.S.E.HealsoreceivedMasterofScienceinManagementfrom WarringtonSchool ofBusinessatUFin2013.Hisresearchinterestsincludeapp licationsofdatamining andmachinelearningtechniquestorealworldproblems.His workhasbeenpublished inpeer-reviewedjournalsandconferenceproceedings.Hei stheeditorofaspecial issueforEnergySystemsandhasalsoco-editedtwovolumesp ublishedbySpringer bookseries.Hehasco-organizedtwoconferencesatUFrelat edtoapplicationsin BiomedicineandSmartGridsystems.HewastheCaptainandth ePresidentofGator CricketClubatUFin2011,duringwhichUFCricketteamwasra nked2ndinthenation. HewasawardedtheOutstandingInternationalStudentaward fromUFInternational Centerin2011,ScholarAthleteawardfromdepartmentofRec SportsatUFin2012 andthehonorablementionawardforSethBonderScholarship forAppliedOperations ResearchinHealthServices,INFORMSin2012.Heisstudentm emberofInstituteof OperationsresearchandManagementScience(INFORMS)andI nstituteofElectrical andElectronicsEngineers(IEEE). 132