<%BANNER%>

On Feature Selection in Data Mining

MISSING IMAGE

Material Information

Title:
On Feature Selection in Data Mining
Physical Description:
1 online resource (49 p.)
Language:
english
Creator:
Thottakkara, Paul Francis
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Master's ( Engr.)
Degree Grantor:
University of Florida
Degree Disciplines:
Industrial and Systems Engineering
Committee Chair:
Pardalos, Panagote M
Committee Members:
Momcilovic, Petar
Hager, William Ward

Subjects

Subjects / Keywords:
clustering -- datamining -- jointsparsity -- l21norm -- psvms -- sparsity
Industrial and Systems Engineering -- Dissertations, Academic -- UF
Genre:
Industrial and Systems Engineering thesis, Engr.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
Analysing and extracting useful information from high dimensional dataset challenges the frontiers of statistical tools and methods. Traditional methods tend to fail while dealing with high dimensional datasets. Lower sample size has always been a problem in statistical tests, this gets aggravated in high dimensional data due to the comparable or higher feature size than the number of samples. The power of any statistical test is directly proportional to its ability to reject a false null hypothesis, and sample size is a major deciding factor in generating probabilities of type II error for making valid conclusions. Hence one of the efficient ways of handling high dimensional datasets is by reducing its dimension through feature selection, so that valid statistical conclusions can be easily performed. This work focuses on different aspects associated with feature selection in data mining. Feature selection is one of the active research areas in data mining. The main idea behind feature selection methods is to identify a subset of original input features that are pivotal in data classification or understanding. Feature selection helps in eliminating features with little or no predictive information.  The various discussions in this thesis could be categorized to three major sections, section 1 introduces least squares formulation for proximal support vector machines, section 2 introduces L (2,1) norm as a method to induce sparsity and the last section discusses on the applicability of using sparse clustering in Raman Spectroscopy data.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Paul Francis Thottakkara.
Thesis:
Thesis (Engr.)--University of Florida, 2013.
Local:
Adviser: Pardalos, Panagote M.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2013
System ID:
UFE0045986:00001

MISSING IMAGE

Material Information

Title:
On Feature Selection in Data Mining
Physical Description:
1 online resource (49 p.)
Language:
english
Creator:
Thottakkara, Paul Francis
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Master's ( Engr.)
Degree Grantor:
University of Florida
Degree Disciplines:
Industrial and Systems Engineering
Committee Chair:
Pardalos, Panagote M
Committee Members:
Momcilovic, Petar
Hager, William Ward

Subjects

Subjects / Keywords:
clustering -- datamining -- jointsparsity -- l21norm -- psvms -- sparsity
Industrial and Systems Engineering -- Dissertations, Academic -- UF
Genre:
Industrial and Systems Engineering thesis, Engr.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
Analysing and extracting useful information from high dimensional dataset challenges the frontiers of statistical tools and methods. Traditional methods tend to fail while dealing with high dimensional datasets. Lower sample size has always been a problem in statistical tests, this gets aggravated in high dimensional data due to the comparable or higher feature size than the number of samples. The power of any statistical test is directly proportional to its ability to reject a false null hypothesis, and sample size is a major deciding factor in generating probabilities of type II error for making valid conclusions. Hence one of the efficient ways of handling high dimensional datasets is by reducing its dimension through feature selection, so that valid statistical conclusions can be easily performed. This work focuses on different aspects associated with feature selection in data mining. Feature selection is one of the active research areas in data mining. The main idea behind feature selection methods is to identify a subset of original input features that are pivotal in data classification or understanding. Feature selection helps in eliminating features with little or no predictive information.  The various discussions in this thesis could be categorized to three major sections, section 1 introduces least squares formulation for proximal support vector machines, section 2 introduces L (2,1) norm as a method to induce sparsity and the last section discusses on the applicability of using sparse clustering in Raman Spectroscopy data.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Paul Francis Thottakkara.
Thesis:
Thesis (Engr.)--University of Florida, 2013.
Local:
Adviser: Pardalos, Panagote M.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2013
System ID:
UFE0045986:00001


This item has the following downloads:


Full Text

PAGE 1

ONFEATURESELECTIONINDATAMININGByPAULFRANCISTHOTTAKKARAATHESISPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFENGINEERUNIVERSITYOFFLORIDA2013

PAGE 2

c2013PaulFrancisThottakkara 2

PAGE 3

TomyparentsAliceFrancisandFrancisT.Paul 3

PAGE 4

ACKNOWLEDGMENTS Thisthesiswouldnothavebeenpossiblewithouttheguidanceandthehelpofseveralindividualswhoinonewayoranothercontributedandextendedtheirvaluableassistanceinthepreparationandcompletionofthisstudy.Foremost,IwouldliketoexpressmygratitudetomyadviserDistinguishedProf.PanosM.Pardalosforhiscontributionandguidancethroughoutmyresearch.Besidesmyadvisor,Iwouldliketothanktherestofmythesiscommittee:Prof.WilliamHagerandDr.PetarMomcilovic,fortheirhelpandencouragement.MysincerethankstomylabmembersVijayPappu,Dr.PandoG.Georgiev,MohsenRahmani,MichaelFenn,SyedMujahidforsupportinginmyresearchwork.IwouldliketoextendabigthankyoutomyfriendsJorgeSefair,ZehraMelisTeksan,RachnaManek,RadhikaMedury,AmruthaPattamatta,MiniManchanda,VishnuNarayanan,RahulSubramany,GokulBhat,VijaykumarRamaswamyforthestimulatingthoughtsandencouragement.Finally,IthankmyparentsAliceFrancisandFrancisT.PaulandmysisterNeethaFrancisforallthemotivationandsupportingmethroughoutmylife. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 7 LISTOFFIGURES ..................................... 8 ABSTRACT ......................................... 9 CHAPTER 1INTRODUCTIONTODATAMINING ........................ 10 1.1WhatisDataMining .............................. 10 1.2RoleofFeatureSelectioninDataMining ................... 11 2LEASTSQUARESFORMULATIONFORPROXIMALSUPPORTVECTORMACHINES ...................................... 12 2.1DataClassicationandSeparatingHyperplanes .............. 12 2.1.1SupportVectorMachines ....................... 12 2.1.2ProximalSupportVectorMachines .................. 13 2.1.3TwinSupportVectorMachines .................... 14 2.2ImportanceofLeastSquareFormulations .................. 15 2.3LeastSquareFormulationforGeneratingProximalPlanes ......... 16 2.3.1UsingSpectralDecomposition ..................... 16 2.3.2SpecialCaseEigenvalueProblem .................. 20 2.4ResultsandObservations ........................... 24 3JOINTSPARSEFEATURESELECTION ...................... 26 3.1DimensionalityReduction ........................... 26 3.2L21NormandFeatureSelection ....................... 27 3.3ResultsandObservations ........................... 29 4FEATURESELECTIONINUNLABELLEDDATASETS .............. 33 4.1IntroductiontoRamanSpectraSignals .................... 33 4.2Dataset ..................................... 33 4.3DataPreprocessing .............................. 34 4.3.1RemoveUnnecessaryFeatures .................... 34 4.3.2NoiseandBackgroundSubtraction .................. 34 4.3.3PeakSelection ............................. 34 4.4Clustering .................................... 35 4.5K-meansClustering .............................. 35 4.6SpectralClustering ............................... 37 4.7SparseClusteringforFeatureSelection ................... 39 5

PAGE 6

4.8Observations .................................. 41 5DISCUSSIONANDCONCLUSION ......................... 44 REFERENCES ....................................... 46 BIOGRAPHICALSKETCH ................................ 49 6

PAGE 7

LISTOFTABLES Table page 2-1GEVandPSVM-LSFormulationClassicationAccuracy ............. 25 3-1PCAandJSMethodAccuracy ........................... 30 3-2JSMethod:Topfeatures,thresholdselection .................. 32 3-3JSMethod:TopTfeatures .............................. 32 4-1Weights!jandcorrespondingfeature(wavenumber) ............... 42 7

PAGE 8

LISTOFFIGURES Figure page 4-1K-MeansClustering,TRETScan6scanofdimension120X21 ......... 36 4-2K-MeansClustering,C3AScan2scanofdimension125X50 ........... 37 4-3K-meansC3AScan2scanwithnucleusmarked .................. 38 4-4SpectralClustering,TRETScan6scanofdimension120X21 .......... 39 4-5SpectralClustering,C3AScan2scanofdimension125X50 ........... 40 4-6Clusterusingtop15features,C3AScan2scan .................. 42 8

PAGE 9

AbstractofThesisPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofEngineerONFEATURESELECTIONINDATAMININGByPaulFrancisThottakkaraAugust2013Chair:PanosM.PardalosMajor:IndustrialandSystemsEngineeringAnalysingandextractingusefulinformationfromhighdimensionaldatasetchallengesthefrontiersofstatisticaltoolsandmethods.Traditionalmethodstendtofailwhiledealingwithhighdimensionaldatasets.Lowersamplesizehasalwaysbeenaprobleminstatisticaltests,thisgetsaggravatedinhighdimensionaldataduetothecomparableorhigherfeaturesizethanthenumberofsamples.Thepowerofanystatisticaltestisdirectlyproportionaltoitsabilitytorejectafalsenullhypothesis,andsamplesizeisamajordecidingfactoringeneratingprobabilitiesoftypeIIerrorformakingvalidconclusions.Henceoneoftheefcientwaysofhandlinghighdimensionaldatasetsisbyreducingitsdimensionthroughfeatureselection,sothatvalidstatisticalconclusionscanbeeasilyperformed.Thisworkfocusesondifferentaspectsassociatedwithfeatureselectionindatamining.Featureselectionisoneoftheactiveresearchareasindatamining.Themainideabehindfeatureselectionmethodsistoidentifyasubsetoforiginalinputfeaturesthatarepivotalindataclassicationorunderstanding.Featureselectionhelpsineliminatingfeatureswithlittleornopredictiveinformation.Thevariousdiscussionsinthisthesiscouldbecategorizedtothreemajorsections,section1introducesleastsquaresformulationforproximalsupportvectormachines,section2introducesl2,1normasamethodtoinducesparsityandthelastsectiondiscussesontheapplicabilityofusingsparseclusteringinRamanSpectroscopydata. 9

PAGE 10

CHAPTER1INTRODUCTIONTODATAMINING 1.1WhatisDataMiningGenerally,dataminingcanbeexplainedasthescienceofanalysingdatatoextractusefulinformationusingstatisticaltoolsandmethods.Theoverwhelmingprospectofunderstandingadata-drivensystemusingtheunderlyingdatahasbeenamajormotivatingfactorinpromotingresearchintheeldofdatamining.Immensegrowthintechnologyhasmadedatacollectioncheaperandefcient.Meanwhile,anexponentialriseincomputationalpowerhasmadedataprocessingextremelyfasterandeconomical.Theseadvanceshaveacceleratedthemotivationsbehindtheresearchindatamining.Dataclassication,featureselectionandoutlierdetectionarethemajorresearchareasintheeldofdatamining.Dataclassicationproblemsfocusonlearningasetoftrainingdata,andthenuseinformationfromtrainingdatatopredictthenatureofanynewdatumpoint.Dataclassicationcanbesupervisedorunsupervisedlearningmethodbasedontheavailabilityoflabelinformationontrainingdata.Superviseddataclassicationalgorithmstudieslabelledtrainingset(classlabelsoftrainingdataareavailable)andgeneratesaclassicationmodeltoclassifyanysetofnewdatapointsobserved.Unsupervisedalgorithms,ontheotherhand,trytondpatternsintheunlabelledtrainingdata.Inanystatisticalanalysis,thethreemajorfactorsofconcernarestatisticalaccuracy,modelinterpretabilityandcomputationalcomplexity.Foranyclassicationmodel,itisnecessarytoensurethattheefciencyofanyofthesethreefactorsisnotcompromised.Asetofdatapointsisnormallyexpressedasamatrix,whereeachcolumnrepresentsadatumpointandtherowsrepresentfeatures.Standardstatisticalmethodsrequirelargernumberofdatapointscomparedtofeaturespacedimensiontomakevalidstatisticalinferences.Hencestandarddatasetshavealargecolumndimensioncomparedtorowdimension.Researchinthepasttwodecadeshavepavedwayforefcientdata 10

PAGE 11

miningalgorithmsthatperformwellonstandarddatasets.Butmostofthetraditionalclassicationmodelsbehavepoorlywhenhandlinghighdimensionaldatasets,i.e.datasetswherenumberoffeaturesiscomparableorlargerthanthenumberofdatapoints.Oneoftheprimereasonsforpoorperformanceonhighdimensionaldatasetsisthecompromisemadeinstatisticalaccuracyandcomputationalcomplexityduetothehigherdimensionalfeaturespace.Anotherconcernwithdataclassicationinhighdimensionalspaceisthelargeamountofcollinearitybetweenfeaturesresultinginwrongmodelselection[ 7 ].Overttingandhighernoiselevelsarealsoassociatedwithhighdimensionaldatasets.Thesedrawbacksassociatedwithhigherdimensionaldataindataminingisreferredtoasthecurseofdimensionality. 1.2RoleofFeatureSelectioninDataMiningFeatureselectionisoneoftheprimefocusesintheeldofdatamining.Givenadataset,featureselectioncanbegeneralizedastheprocessofselectingasubsetoffeaturesforuseinfurtherdataanalysis.Thisselectedsubsetoffeaturesisexpectedtocapturemaximuminformationpresentinthedataset,i.e.theselectedfeaturesubsetshouldcontainthemostprominentfeaturesformodelconstruction.Featureselectionisparticularlyimportantinhighdimensionaldatasetssinceitreducesdimensionalityandtherebynulliestheeffectsofthecurseofdimensionality.Further,inmanyreallifesystems,featureselectionisveryimportantinidentifyingthebehaviourandperformanceofthesystem.Especiallyinbiomedicalapplications,featureselectioncanplayapivotalroleinidentifyingbiomarkers.Inadiseaseclassicationproblemingenomicstudy,forexample,featuresselectiontechniquescanidentifythegenesthatdifferentiatethediseasedandhealthycells.Thisnotonlyhelpsthedataanalystinreducingdatadimension,butisalsoahugebreakthroughforbiologiststounderstandthebiologicalsystemandidentifythediseasetriggeringgenes. 11

PAGE 12

CHAPTER2LEASTSQUARESFORMULATIONFORPROXIMALSUPPORTVECTORMACHINES 2.1DataClassicationandSeparatingHyperplanesAhyperplaneinann-dimensionalvectorspacecanbedenedasaatsubsetwithn-1dimensionsanditseparatesthevectorspaceintwodisjointhalfspaces.Manydataclassicationalgorithmsfocusonndinghyperplaneswhichseparatedataintodifferentclassesorwhichassistinapproximatingdata.TheveryrstalgorithminmachinelearningwasthePerceptronalgorithmthatgeneratesaseparatinghyperplanebyminimizingthedistanceofmisclassiedpointstothedecisionboundary.Perceptronmethodsgainedhugemomentumandcontinuedtobethemajormethodformorethanadecade.However,thealgorithmhadanumberofissuessuchastheexistenceofmultipleseparatinghyperplanes,slowconvergence,failureinhandlinginseparabledata,etc.Thislimitedtheapplicabilityofperceptronalgorithmsincomplexandlargedatasets.Thismotivatedthedevelopmentofadvancedandmorerobustalgorithmsthatcouldefcientlyhandlecomplexandlargedata.OneofthesewasSupportVectorMachine,whichproducedefcientandrobustclassicationmodels,waseffectiveinhandlingcomplexdatasetsandproducedlowergeneralizationerror. 2.1.1SupportVectorMachinesSupportvectormachine(SVM)isasuperviseddataclassicationmethodintroducedbyVladimirVapnikandcoworkers[ 4 32 ].ThebasicideaofSVMistogenerateahyperplanethatcouldseparatethedatapointsintoitstwoclasses.IfthetwoclassesarelinearlyseparablethenthestandardSVMtriestogenerateahyperplanethatdividestheinputspaceintotwodisjointhalfspaceswhereeachclassbelongstooneofthehalfspaces.Astherecouldexistmorethanonehyperplaneseparatingtheclasses,theSVMalgorithmselectsthehyperplanewhichisfarthestfromitsclosestdatapoints.Anynewdatapointisclassiedtoaclassbasedonitslocationinthehalfspace. 12

PAGE 13

2.1.2ProximalSupportVectorMachinesProximalSupportVectorMachine(PSVM)introducedbyFungandMangasariancanbeconsideredcloselyrelatedtotheSVMClassier[ 10 ].StandardSVMclassiespointsbasedontheirlocationinthedisjointsubspacesgeneratedbythehyperplanewhilePSVMclassiespointsbasedontheirproximitytotwoparallelhyperplanes.TheobjectiveofPSVMistogeneratetwoparallelhyperplaneswitheachplaneclosesttooneclasswhilebeingfarthestfromtheotherclass.LaterMangasarianandWildintroducedanextensiontoPSVMcalledtheMulti-surfaceProximalSupportVectorMachine(MPSVM)byrelaxingtherequirementofproximalplanesbeingparallel[ 23 ].MPSVMgeneratestwohyperplanessuchthateachplaneisclosesttooneclassandfarthestfromtheotherclass.AsMPSVMcloselyresemblesPSVM,theyareusedinterchangeablyinliterature.Inthisstudy,weusePSVMtoimplyMulti-SurfaceProximalSupportVectorMachine.ConsiderabinaryclassicationproblemwithtwoclassesrepresentedasA2
PAGE 14

[ 30 ].Theregularizedoptimizationmodel[ 6 24 25 ]isgivenby:min(!,)6=0kA!)]TJ /F4 11.955 Tf 11.96 0 Td[(ek2+k[!]k2 kB!)]TJ /F4 11.955 Tf 11.96 0 Td[(ek2 (2)where>0isaregularizationconstant.Dene,GA=[A)]TJ /F4 11.955 Tf 11.96 0 Td[(e]T[A)]TJ /F4 11.955 Tf 11.95 0 Td[(e]+I,HB=[B)]TJ /F4 11.955 Tf 11.95 0 Td[(e]T[B)]TJ /F4 11.955 Tf 11.96 0 Td[(e],zT=[!T]Substitutingtheabovevariablestotheoptimizationmodel 2 ,itcanbereformulatedasminz6=0f(z):=zTGAz zTHBz,maxz6=0f(z):=zTHBz zTGAz (2)Thestationarypointsof 2 aregivenbyeigenvectorscorrespondingtogeneralizedeigenvalueproblemGEV(HB,GA):HBz=GAz (2)andthehyperplanePAisgivenbytheeigenvectorcorrespondingtothelargesteigenvalue.Similarly,theproximalhyperplanePB(farthestfromclassAandclosesttoclassB)givenby:PB=fx2
PAGE 15

parallelplanessuchthateachplaneisclosetooneclassandawayfromtheotherclass.However,TWSVMisnotanexactreformulationofthePSVMmodel 2 ,butisveryclosetothestandardSVMformulation.TwinSupportVectorMachinemodelsolvesapairofquadraticprogramming(QP)problems,whereeachQPndsahyperplaneclosesttopointsofoneclassandatleastataunitdistancefromthepointsofotherclass.TWSVMclassierisobtainedsolvingthebelowpairofQPproblemswhereTWSVM1generatesthehyperplanePAi.e.hyperplaneclosesttoclassAandfarthestfromclassB[ 1 16 ].SimilarlyTWSVM2generatesthehyperplanePB.TWSVM1)min!1,1,q1 2(A!1+e11)T(A!1+e11)+c1eT2qs.t.)]TJ /F3 11.955 Tf 11.96 0 Td[((B!1+e21)+qe2,q0 (2)TWSVM2)min!2,2,q1 2(B!2+e22)T(B!2+e22)+c2eT1qs.t.)]TJ /F3 11.955 Tf 11.95 0 Td[((A!2+e12)+qe1,q0 (2) 2.2ImportanceofLeastSquareFormulationsTechnologicaladvancesinthelastdecadehaveintroducednewandefcienttoolsfordatacollectionespeciallyintheeldofbiomedicine.Thishaspavedwayforalargenumberofcaseswithhighdimensionaldataseti.e.withlargenumberofinputfeaturescomparedtothenumberofsamples.Asdiscussedintheintroductorysection,traditionaldataminingtechniqueshaveproducedappreciableresultsforstandarddatasets.Butwhendataisrepresentedashighdimensionalfeaturevectorswithlimitedsamplesize,itposesgreatchallengeforstandardalgorithms.Hencereducingthenumberoffeaturesisveryimportanttoobtainefcientandeffectivedataanalysis.Featureselectionmethodscanbeclassiedintothreedifferentcategorieslter,wrapperandembeddedmethods.Themostsimpleideaistoselectasubsetoffeatures 15

PAGE 16

fromtheoriginalsetoffeaturesbasedonafeaturerankingprocedure,thisistheltermethod.Wrappermethodscomparesasetoffeaturesubsetsandcomparetheirperformanceinpredictingthedata,andselectsthebestsubset.Embeddedmethodsperformfeatureselectionalongwithclassicationmodelconstructionprocess.Oneoftheverycommonmethodsistointroducel1toaleastsquaresclassicationmodelwhichinducessparsitytothemodel.Thisassistsinfeatureselectionbyremovingirrelevantfeatures.Featureextractionforhighdimensionaldatasetsisveryimportantasmostfeaturesinhighdimensionalvectorsareusuallynon-informativeornoisyandcouldaffectthegeneralizationperformance.Hencethereisgreatinterestinmanydataminingapplicationsforinducingsparsityinhighdimensionaldatasetswithrespecttoinputfeaturestoremoveinsignicantfeatures.Suchsparserepresentationscanprovideinformationonrelevantfeaturesandtherebyassistinfeatureselection.Further,classicationmodelswithsparsedatamatrixcansimplifydecisionruleforfasterpredictioninlarge-scaleproblems.Finally,inmanydataanalysisapplications,asmallsetoffeaturesisdesirabletointerprettheresults.Assparsitycanbeveryeasilyintroducedtoaleastsquaresmathematicalformulationusingaregularizationterm,thefocusofthisstudyistogeneratealeastsquaresformulationforproximalsupportvectormachines.TheLeastAbsoluteShrinkageandSelectionOperator(LASSO)methodintroducedbyTibshirani[ 2 29 ]canbeeffectivelyappliedtotheleastsquaresformulationforinducingsparsity.Further,robustandefcientclassicationmodelofPSVMmakesitanattractivemodelforstudying.Thesefactorsmotivatedtoinvestigateonaleastsquaresformulationforgeneratingproximalplanes. 2.3LeastSquareFormulationforGeneratingProximalPlanes 2.3.1UsingSpectralDecompositionZouetal.[ 15 ]provedthefollowingtheoremthatestablishestherelationbetweeneigenvalueproblemsandleast-squaresproblems. 16

PAGE 17

Theorem2.1((Zouetal.)). ConsiderarealmatrixX2
PAGE 18

whereUAz=y.TheoptimaleigenvectorrelatedtoproximalhyperplanePAinPSVMscanbefoundbythefollowingrelation:zopt=U)]TJ /F7 7.97 Tf 6.59 0 Td[(1A^y (2)where^yistheeigenvectorcorrespondingtothemaximumeigenvalueinthefollowingsymmetriceigenvalueproblemin 2 .BysubstitutingX=LTBU)]TJ /F7 7.97 Tf 6.58 0 Td[(1A,=UA^,(LTBU)]TJ /F7 7.97 Tf 6.58 0 Td[(1A)i=U)]TJ /F6 7.97 Tf 6.58 0 Td[(TAUB,iintheleastsquaresproblem,Equation 2 inTheorem 2.1 ,andre-arrangingtheterms,thefollowingleast-squaresoptimizationproblemisobtained:min,^jjUBU)]TJ /F7 7.97 Tf 6.58 0 Td[(1A)]TJ /F4 11.955 Tf 11.96 0 Td[(UB^Tjj2+^TGA^s.t.T=1 (2)where^optisproportionaltoz1,theoptimaleigenvectorcorrespondingtothelargesteigenvalueoftheGEV(HB,GA).Theoptimizationproblem 2 cansolvedbyalternatingoverand.Fixed:Foraxed,thefollowingoptimizationproblemissolvedtoobtain.min,^jjUBU)]TJ /F7 7.97 Tf 6.58 0 Td[(1A)]TJ /F4 11.955 Tf 11.96 0 Td[(UB^Tjj2s.t.T=1 (2)ExpandingtheobjectivefunctionjjUBU)]TJ /F7 7.97 Tf 6.58 0 Td[(1A)]TJ /F4 11.955 Tf 11.96 0 Td[(UB^Tjj2,(UBU)]TJ /F7 7.97 Tf 6.58 0 Td[(1A)]TJ /F4 11.955 Tf 11.96 0 Td[(UB^T)T(UBU)]TJ /F7 7.97 Tf 6.59 0 Td[(1A)]TJ /F4 11.955 Tf 11.95 0 Td[(UB^T))]TJ /F3 11.955 Tf 33.87 0 Td[(2TU)]TJ /F6 7.97 Tf 6.58 0 Td[(TAHB^+T^HB^SubstitutingT=1,theoptimizationproblemin 2 canbere-writtenas: 18

PAGE 19

maxTU)]TJ /F6 7.97 Tf 6.59 0 Td[(TAHB^s.t.T=1 (2)Ananalyticalsolutionforthisproblemexistsandtheoptisgivenby,opt=U)]TJ /F6 7.97 Tf 6.59 0 Td[(TAHB^ kU)]TJ /F6 7.97 Tf 6.59 0 Td[(TAHB^k (2)Fixed:Foragiven,theoptimizationproblem 2 canbereducedtoridgeregression-typeproblem.Toseethis,let^Abeanorthogonalmatrixsuchthat[;^A]ispporthogonal.Then,jjUBU)]TJ /F7 7.97 Tf 6.58 0 Td[(1A)]TJ /F4 11.955 Tf 11.96 0 Td[(UB^Tjj2=jjUBU)]TJ /F7 7.97 Tf 6.58 0 Td[(1A[;^A])]TJ /F4 11.955 Tf 11.96 0 Td[(UB^T[;^A]jj2=jjUBU)]TJ /F7 7.97 Tf 6.59 0 Td[(1A)]TJ /F4 11.955 Tf 11.95 0 Td[(UB^jj2+jjUBU)]TJ /F7 7.97 Tf 6.59 0 Td[(1A^Ajj2 (2)So,foraxed,optimizesthefollowingregressionproblem:minjjUBU)]TJ /F7 7.97 Tf 6.59 0 Td[(1A)]TJ /F4 11.955 Tf 11.95 0 Td[(UB^jj2+^TGA^ (2)Inthiscaseaswell,ananalyticalsolutioncanbefoundgivenby:^opt=(HB+GA))]TJ /F7 7.97 Tf 6.59 0 Td[(1HBU)]TJ /F7 7.97 Tf 6.59 0 Td[(1A (2)ThefollowingalgorithmsummarizesthestepsneededtosolveforeachoptimalhyperplaneinPSVMusingtheleast-squares(LS)approach: 19

PAGE 20

Algorithm1PSVMs-via-LS(HB,GA) 1.Initialize^. 2.FindtheuppertriangularmatrixUAfromthecholeskydecompositionofGA. 3.Findfromthefollowingrelation:=U)]TJ /F6 7.97 Tf 6.59 0 Td[(TAHB^ kU)]TJ /F6 7.97 Tf 6.59 0 Td[(TAHB^k (2) 4.Find^asfollows:^=(HB+GA))]TJ /F7 7.97 Tf 6.58 0 Td[(1HBU)]TJ /F7 7.97 Tf 6.58 0 Td[(1A (2) 5.Alternatebetween3and4untilconvergence. 2.3.2SpecialCaseEigenvalueProblemLiangSunetal.[ 22 ]introducedatheoremthatestablishesrelationwithaspeciallystructuredeigenvalueproblemandaleastsquaresproblem. Theorem2.2. ConsiderageneralizedeigenvalueproblemsoftheformXSXT!=XXT!or(XXT)yXSXT!=! (2)whereX2
PAGE 21

AsSissymmetricandpositivesemi-deniteitcanbedecomposedasS=HHT (2)usingtheCholeskydecomposition,whereH2
PAGE 22

T2
PAGE 23

ApplyingCholeskydecompositionandSingularValueDecompositiontothecenteredmatrixwehaveCHBCz=CGACzCUTBUBCz=CLALTACz (2)DeneH=VyUTUTB, (2)whereyisthepseudo-inverseofIntroducingUVTVyUT=ItoEquation 2 andsubstitutingforH&LACUVTVyUTUTBUBUyVTVUTCz=CLALTACzCLAHHTLTACz=CLALTACz (2)TheGEVproblemHBz=GAzisreformulatedtoCLAHHTLTACz=CLALTACz(CLA)HHT(CLA)Tz=(CLA)(CLA)Tz (2)whereCLA=^LAiscentredandS=HHTisSymmetricandPositivesemi-denitematrix.ApplyingTheorem 2.2 ,solving^LAHHT^LATz=^LA^LATzisequivalenttominWjjWT^LA)]TJ /F4 11.955 Tf 11.96 0 Td[(Tjj2F (2)whereTisgeneratedfromHand^LAusingEquation 2 inTheorem 2.2 .Further,atoptimalitythecolumnvectorsofWrepresenttheeigenvectorsoftheGeneralizedEigenvalueproblem 2 .ClosedformsolutionforWoptisgivenbyWopt=(^LA^LAT)y^LATT,referringtoEquation 2 .ThefollowingalgorithmsummarizesthestepsneededtosolveforeachoptimalhyperplaneinPSVMusingtheleast-squares(LS)approachderivedfromTheorem 2.2 : 23

PAGE 24

Algorithm2PSVMs-via-LS(HB,GA) 1.UsingCholeskyDecompositionndtheuppertriangularmatrixUBfromHBandlowertriangularmatrixLAfromGA 2.FindU,,VusingSingularValueDecompositionofLA 3.CentrethelowertriangularmatrixLAtocreate^LA 4.GenerateHusingtheEquation 2 5.ApplyTheorem 2.2 togenerateTfromHandLAusing 2 6.ClosedformsolutionisobtainedusingtheequationWopt=(^LA^LAT)y^LATT (2) 2.4ResultsandObservationsInthischapterweintroducedtwoleastsquaresformulationsforgeneratingtheproximalplanesofPSVM.ThecorrectnessoftheleastsquareformulationsisconrmedmathematicallyusingtheTheorems 2.1 2.2 .ItcouldbefurthervalidatedbycomparingtheclassicationaccuraciesofleastsquaresformulationwiththeaccuraciesobtainedfromstandardPSVMformulation(generalizedeigenvalueformulation).Inthisstudy,10foldcrossvalidationaccuraciesarereported.Astheleastsquaresarejustreformulationsofthestandardgeneralizedeigenvaluemodel,theirclassicationaccuraciesareexpectedtobesameorveryclosetotheaccuraciesfromstandardformulation.Numericaltestsweredoneonpubliclyavailablebinaryclassdatasets.IntheresultsTable 2.4 thecolumnDimensionsrepresentthenumberofdatapointstothenumberoffeaturesinthedataset.Colon,DBWorldandDLBCLarehighdimensionaldatasetswhileothersarestandarddatasets.PSVM)]TJ /F4 11.955 Tf 12.06 0 Td[(Eigcolumnshowstheaccuraciesobtainedusingthestandardgeneralizedeigenvalueformulation,PSVM)]TJ /F4 11.955 Tf 12.03 0 Td[(LS)]TJ /F4 11.955 Tf 12.03 0 Td[(F1columnshowsaccuraciesassociatedwiththeleastsquaresformulationusingZouetal.Theorem 2.1 andPSVM)]TJ /F4 11.955 Tf 12.79 0 Td[(LS)]TJ /F4 11.955 Tf 12.79 0 Td[(F2columnshowsaccuraciesassociatedwiththeleastsquares 24

PAGE 25

Table2-1. ResultsTable:PSVM-EigrepresentthestandardGeneralizedeigenvalueformulation,PSVM-LS-F1istheLeastsquaresformulationusingTheorem 2.1 ,Zouetal.[ 15 ]andPSVM-LS-F2istheLeasesquaresformulationusingTheorem 2.2 LiangSunetal.[ 22 ] DatasetDimensionsClassRatioPSVM-EigPSVM-LS-F1PSVM-LSF2 WDBC569*30212:35793.3%93.30%92.80%Spambase4601*571813:278868.0%67.96%76.40%Ionosphere351*34126:22576.9%76.91%75.48%WPBC198*3347:15174.9%74.70%73.79%Mushroom8124*1263916:420899.8%99.80%99.80%Colon62*200040:2287.1%87.14%87.14%DBWorld64*470235:2990.7%90.71%90.71%DLBCL77*546958:1981.8%81.79%75.36% formulationusingTheorem 2.2 fromLiangSunetal..Resultsinferthat,boththeleastsquareapproachesarevalidrepresentationsofproximalplanes,asitgeneratessimilarclassicationaccuraciescomparedtothestandardPSVMformulation.ThisnewformulationpaveswayforaneasyintroductionofembeddedfeatureselectiontechniquetoProximalSupportvectormachines(PSVMs).l1normcanbeintroducedinthenewleastsquaresformulationsdevelopedinthisstudytoobtainsparseclassicationandinturnattainfeatureselectionforPSVMs.TheQuadraticmodelforPSVMusingTwinSupportVectorMachinecanalsousel1normtoinducesparsity.Thisisadirectmethodofinducingsparsity,howeverwhenl1normisinducedtotheTWSVMmodelitbecomesanon-differentiableconstrainedoptimizationmodelwhichiscomputationallyverychallenging.Whiletheleastsquaremodelsdevelopedinthischaptercanbesolvedmoreefciently.Afterinducingl1normtotheLeastsquaresformulation 2 developedfromTheorem 2.1 ,theoptimizationproblemcanbesolvediterativelybyalternatingbetweenand.Thereexistsefcientalgorithms[ 13 ]tosolvetheleastsquaresformulation 2 obtainedusingtheTheorem 2.2 .Thesecreditsfurthersignifytheapplicabilityofthenewleastsquareapproachesintroducedinthisstudy. 25

PAGE 26

CHAPTER3JOINTSPARSEFEATURESELECTION 3.1DimensionalityReductionDimensionalityreductionisthetechniqueofprojectingasetofinputdatapointsontoasmallerdimensionalspace.Thatis,thedataisrepresentedonalowerdimensionalsubspace.Itisnormallyachievedthroughfeatureselectionorfeatureextractionmethods.Featureselectionmethodidentiesasetofprominentfeaturesandthereducedsubspaceisdeterminedbythisselectedsetoffeatures.Whilefeatureextractionmethodcreatesderivedfeatureswhicharecombinationofexistingfeaturesandthesenewderivedfeaturesareusedforgeneratingthereducedsubspace.Dimensionalityreductionhasmanyadvantagesindatamining,especiallywhilehandlinghighdimensionaldatasets.Dimensionalityreductiontechniquesplayavitalroleinhighdimensionaldatasetsasithelpsinreducingthedimensionalityofinputspacewithminimumlossofinformation.Principalcomponentanalysis(PCA)isverycommonmethodofdimensionalityreductiontechniqueusingfeatureextractionmethod.Itcreatesasetofderivedfeaturesorlinearsubspacesthatarelinearcombinationoftheexistingfeaturessuchthatmaximumdatavarianceisalsoaccountedinthenewsubspace.Thiscanalsolookedfromtheperspectiveofcreatinganewsubspacewhereeachbasisofthesubspaceisalinearcombinationoftheexistingfeatures.ConsideradatamatrixX2
PAGE 27

subspacecanbeformulatedasvariancemaximizationproblem.Thesubspaceisrepresentedbyrorthogonalui2
PAGE 28

l2,1norm.Itisintroducedtoadimensionalityreductionproblemexpectingtoinducerowsparsityinthetransformationmatrixandtherebyassistinfeatureselection.JointsparsityisinducedintheorthonormalvectorsspanningSbyintroducingl2,1normtotheoptimizationmodel 3 andismodiedasbelow.MaximizeU20controlstheintensityofsparsityinduced.Solutiontotheoptimizationmodel 3 canbeobtainedbysolvingthefollowingsymmetriceigenvalueproblemandoptimaluiaretheeigenvectorscorrespondingtotherlargesteigenvalues:XXTU=DU (3)whereD=diag(1,2,....,r)arethesetofeigenvaluesandU2
PAGE 29

Themodel 3 isfurtherrelaxedtothefollowingoptimizationproblem:MinimizeU2><>>:0,ifui=01 jaik2otherwise 5.t=t+1,repeatsteps3to5untilconvergence 3.3ResultsandObservationsJointsparse(JS)featureselectionmethodwastestedon4highdimensionaldatasets.ThetransformationmatrixUobtainedusingtheabovemethodwasnotonlyusedtotransformthedatatoalowerdimensionalsubspacebutalsotoextractsignicantfeatures.EachrowinUcanbedirectlycorrelatedtoafeatureintheoriginaldataspace.Inthisexperiment3differentreducedsubspaceswhereconsideredeachhaving5,10and15dimensions,i.e3differentU2
PAGE 30

Table3-1. SummarizesandcomparesthevariancecapturedbyPCAandJSmethod(JointSparseFeatureselectionmethod),rrepresentthedimensionofthereducedspace,AccuracyandStandarddeviationsarealsocompared. VarianceAccuracy%StdDeviationDatasetClassRatiorPCAJSMethodPCAJSMethodPCAJSMethod Colon40:2250.710.7064.271.313.5513.38100.840.8472.973.313.2114.20150.900.8968.368.313.4113.94DBWorld35:2950.230.1689.688.57.608.46100.370.2787.786.97.237.93150.480.3688.886.96.357.53Leukimia27:1150.450.40100.090.70.006.99100.620.5898.698.64.404.40150.730.6999.399.33.193.19Breast44:3350.370.3166.668.410.3911.38100.500.4463.862.211.0313.97150.590.5260.360.013.3413.36 performanceofanydimensionalityreductionmethod.Thedatavarianceinthereducedsubspaceiscomparedwiththevarianceintheoriginalinputspacetocalculatethepercentageofvariancecapturedbydimensionalityreductiontechnique.Percentageofvarianceistheratioofvarianceinsubspacetothevarianceinoriginalinputspace.ThepercentageofvariancecapturedbytheprincipalcomponentsinPCAiscomparedwiththevariancecapturedusingthejointsparsitymethodinTable 3.3 .ClassicationaccuracyusingSVMonthereducedsubspace(generatedfrombothPCAandJSmethod)isalsocomparedfordifferentvaluesofsubspacedimension.TheresultsshowthatJointSparsitymethodcapturesvariancesimilartothatofPCAandalsoperformswellwithclassicationaccuracy.AniterativealgorithmisusedtoobtainthetransformationmatrixUwereineachiterationthel2,1normdecreasesforcingrowscorrespondingtoirrelevantfeaturestosmallermagnitudebutneverreducingittozero.Henceduringtheiterationwhenthenormofanyrowgoesbelowaparticularvalueitisforcedtozeroandinthisstudywehaveusedathresholdofe)]TJ /F7 7.97 Tf 6.58 0 Td[(8.Thealgorithmterminatingconditionare,iftheu<=5e)]TJ /F7 7.97 Tf 6.58 0 Td[(4andf<=e)]TJ /F7 7.97 Tf 6.59 0 Td[(4orifthe 30

PAGE 31

numberofiterationsexceeds100.uforkthiterationisdenedaskUk)]TJ /F6 7.97 Tf 6.59 0 Td[(Uk)]TJ /F9 5.978 Tf 5.75 0 Td[(1jF p rxdwereristhedimensionofreducedsubspaceanddisthenumberoffeaturesinthedataset.fisdenedaskObjk)]TJ /F6 7.97 Tf 6.59 0 Td[(Objk)]TJ /F9 5.978 Tf 5.75 0 Td[(1j dwereObjkistheobjectivevalue(kUk2,1)atthekthiteration.Themaximumnumberofiterationswasxedat100asformostofthedatasetsalgorithmwasconvergingtoacceptablelevelsinlesserthan100iterations.Asthealgorithmcouldn'tinduceefcientparameterstocontrolthesparsitylevels,theperformanceinfeatureselectionprocesswastestedusingthetopprominentfeatures.Inthisstudytwoideaswereusedtoselecttheprominentfeaturesubset.Intherstcase,featuresthathavemagnitudelargerthan%ofthelargestmagnitudefeaturearetheactivefeatureset.Table 3.3 showstheclassicationaccuraciesassociatedwiththeprominentfeaturessubsetgeneratedbythismethodandalsothenumberoffeaturesinthesubset.InthesecondapproachtopTfeatures(featureswiththelargestmagnitude)formedtheprominentsubsetandtheclassicationaccuracyisinTable 3.3 .BoththeresultstablecomparestheaccuracyofJSMethodwiththewidelyacceptedPCAmethod.ResultshowsthattheclassicationaccuraciesfromJSMethodsarenothigherthanthePCAmethod,buttheresultsareverymuchcomparablewiththatofPCAmethod.Indimensionalityreductionmethods,PCAisconsideredtobethebestandefcientalgorithmandPCA-SVMclassicationmodelisknowntoprovidehigherclassicationaccuracies.JSmethodprovidescomparableclassicationaccuracieswithPCA,furtheritalsoprovidesalistofprominentfeatures.Hence,alongwithdimensionalityreduction,featureselectionisalsoperformedusingJSmethod. 31

PAGE 32

Table3-2. SummarizesaccuraciesofPCAandJSmethod(JointSparseFeatureselectionmethod),isthethresholdforselectingfeatures(=30%selectsfeaturesthathaveweightsgreaterthan30%largestweight).Thenumberofrelevantfeaturesisalsogivenforcorrespondingvalues. Acc%Accuracy(JSMehtod)%NonZeroFeaturesDatasetrPCA30%20%10%30%20%10% colon564.1676.2569.5875.4170831001072.916566.2575.41801051221568.3364.1666.2559.1685105133DBWorld589.6188.0790.7688.8429521121087.6981.5385.7686.5328721331588.8486.9285.7685.3895150211Leukimia51009091.4291.424262861098.5793.5792.8597.851336741599.2887.8599.2899.28195088Breast566.5669.6867.1865.312651931063.7563.4364.6863.1259961601560.3159.3760.9359.6868122196 Table3-3. SummarizesaccuraciesofPCAandJSmethod(JointSparseFeatureselectionmethod),Tisthenumberoftopselectedfeatures(T=10selectsthetop10featuresthathavelargestweights) Accuracy% Accuracy%(JSMethod,TopFeatures) DatasetrPCAT10T15T20T25T30 Colon564.1778.7577.5081.2577.5080.001072.9265.0069.5871.2572.0876.251568.3362.5069.5867.5063.3369.17DBWorld589.6282.6987.3183.0886.9286.921087.6977.3188.0884.2382.6985.381588.8582.6979.6281.9274.6276.54Leukimia5100.0086.4390.0089.2988.5785.711098.5791.4386.4387.8695.7193.571599.2982.8680.0087.8688.5790.00Breast566.5662.8165.3165.0069.6969.691063.7560.6368.4469.0667.5072.501560.3172.1973.1368.7566.8862.19 32

PAGE 33

CHAPTER4FEATURESELECTIONINUNLABELLEDDATASETS 4.1IntroductiontoRamanSpectraSignalsThischapterfocusesonfeatureselectioninunlabelledRamanspectroscopydata.ARamanSpectraconsistsofRamanintensitiesmeasuredatvariouswavenumbers.ThepeaksinRamanspectracanbeassociatedtovariousbiologicalelements.Thisnoninvasivemethodisveryvitalinthestudyofcellsandcellularprocess.TheamountofmorphologicandchemicalfeatureinformationinRamanspectraanditseaseofmeasurementmakesRamanspectroscopyanattractivemethodtostudycells.However,efcientlyextractinginformationfromRamanspectraisachallenge.ThedatasetusedhereisacrosssectionalRamanspectroscopyscanofacellembeddedinalayeroftrehalose.OneofthemotivationsbehindcollectingRamanspectrascanofthecellistocreateanimageofthecellusingtheRamanspectroscopyscan.Targetstudyincludes,thetaskofcreatingcellimagebasedonRamanintensityspectraandalsoidentifyingtheimportantpeaksthathelpsindistinguishingvariousregionsintheimagegenerated.ClusteringmethodsareusedtogeneratetheimagewhilesparseclusteringisusedtoidentifytherelevantpeaksinRamanspectra. 4.2DatasetDatasetrepresentaRamanspectroscopyscanofacellembeddedinatrehaloselayer.ThescanisperformedontheX-Zplane,i.eitprovidesacrosssectionalviewofthecell.Theexpectedcrosssectionalimageisalayerofcellatthecentreofthescanandonbothtopandbottomofthecell,isthetrehaloselayer.Thetwodatasetsconsideredinthisstudyare1.)TRETScan6(scanarea:pixeldimension12021)andC3AScan2(12550).Eachdatasetrepresentcrosssectionalscanofcellembeddedinatrehlosemediumbutwithdifferentscanareaandcellsample.EachpixelofthescanarearepresentaRamanspectraconsistingof1024features.ThenumberofdatapointsinTRETScan6is12021=2520andC3AScan2has12550=6250. 33

PAGE 34

Bothdatasetshavesamenumberof1024features,i.eRamanintensitiesaremeasuredat1024differentwavenumbersvaryingbetween0to3800.HenceourTRETScan6datasetconsistsof2520datapointswith1024featuresandC3AScan2datasetconsistsof6250datapointswith1024features. 4.3DataPreprocessing 4.3.1RemoveUnnecessaryFeaturesResearchonRamanspectrasuggeststhattheintensitymeasuresatverylowwavenumbersdonotgiveanyinformationduetothepresenceofhighnoiselevels.Hencerst10featuresofthedatapointsareremovedandthedatasetisreducedto1014features. 4.3.2NoiseandBackgroundSubtractionInordertoextractmaximuminformationontheRamanscattering,bothnoiseandbackgrounduorescencemustberemoved.SubtractionofnoiseisperformedusingSavitzkyGolaysmoothinglter,whichisfoundtobeveryeffectiveforRamanspectra.ThemostpromisingtypeofbackgroundsubtractionalgorithmsusepolynomialtsbecausetheycanapproximatetheuorescenceprolewhileexcludingtheRamanpeaks.However,thereisnoconsensusonthebestpolynomialtorderforuorescencebackgroundsubtraction[ 3 ].Inthisstudyweappliedthesubbackfunctionofmatlab,thisfunctionwillsubtractthebackgroundofaspectrumbyttingapolynomialthroughthedatapointsinaniterativeway. 4.3.3PeakSelectionBiologicallyrelevantwavenumbersinRamanspectraareassociatedwithpeaksofRamanintensity,henceonlythosewavenumberscorrespondingtopeaksareonlyrelevantforanyanalysis.HencepeaksassociatedwitheachRamanspectraisselectedanditsassociatedwavenumbersareshortlistedaspotentialfeatures.DuetotheresolutionandnoiseintheprocessofplottingRamanspectrapeaksofdifferentRamanspectracouldbeshiftedbyfewwavenumbers.Thisistakencareoffbycohesionof 34

PAGE 35

peaksindifferentspectratooneprominentwavenumber.Afterpeakselectionandpeakcohesionthenumberofrelevantfeaturesarefurtherreduced.Oncetherawdatasetundergoespreprocessingitistakenforfurtheranalysis.Thepreprocesseddatasetisexpectedtobecontainingonlyrelevantfeaturesandthenoiseandbackgroundsubtracted. 4.4ClusteringClusteringconsistsofpartitioningthedatapointsbasedonthedifferencesbetweenthem.Themostcommondissimilaritymeasureconsideredistheeuclideandistancebetweenthedatapoints.Henceclusteringtendstogrouppointslyingclosetoeachotherortheprocessofgroupingsimilarelementstogether.Clusteringisanunsupervisedclassicationmodelwheretheclassoftrainingdatasetisunknown,inmanycaseseventhenumberofclassespresentisunknown.Inthisstudytwoclusteringmethodsareanalysedandappliedonthedataset. 4.5K-meansClusteringK-meansclusteringalgorithmisoneofthemostpopularclusteringalgorithms.ToperformK-meansclusteringtheuserneedstospecifythenumberofclusterspresentinthetrainingdataset.TheK-meansalgorithmcanbeexplainedasrepetitionoftwosteps.Themainideaistodenekcentroids,oneforeachcluster.Thenextstepistotakeeachpointbelongingtoagivendatasetandassociateittothenearestcentroid.Whenalldatapointsareassignedtoacentroidormeantherststepiscompletedandaninitialclusterisformed.Atthispointweneedtore-calculateknewcentroidsbasedontheclustersresultingfromthepreviousstep.Afterwehavetheseknewcentroids,nowreassignthedatapointstoeachofthesenewnearestcentroidscreatinganewsetofclusters.Repeattheprocesstillnomoredatapointsarereassignedfromitscurrentclusterorinotherwordscentroidsdonotmoveanymore[ 20 ].Theperformanceofclusteringheavilydependsonthenumberofclusterspresent.Soassigningthe 35

PAGE 36

correctnumberofclustersiscriticalinobtainingagoodclusterandtherebyefcientclassication.Thersttaskinthisstudywasidentifyingtheextracellular(regionofTrehlose)andthecellularregion.Soinitiallyclusteringwasperformedwithtwomeansi.eK=2.K-meansclusteringwasperformedonboththedatasetsandtheFigures 4.5 4.5 showstheclustering.ThecellissuspendedinamediumofTrehloseandhenceinthecrosssectionalviewitisexpectedtoobserveacellularregionatthecentrewhiletheupperandbottomlayershaveTrehlose.Sotheclusteringalgorithmshouldgeneratealayeratthecentrewhiletheupperandbottomlayersconsistsofthesamecluster.TheK-meansclusteroutputshowninFigures 4.5 and 4.5 providesthesameanalysisandhencevalidatestheapplicabilityofusingclusteringmethodsindistinguishingthecellularandextracellularregionsinRamanspectrascan. Figure4-1. K-MeansClustering,TRETScan6scanofdimension120X21 IntheTRETScan6Figure 4.5 thecentreredcolouredstriprepresenttheclusterassociatedwithcellularregionandbluecolouredregiontheextracellularregion. 36

PAGE 37

SimilarpatternisalsoexpectedfromC3AScan2scanandtheclusterobtained 4.5 alsovalidatesourproposition. Figure4-2. K-MeansClustering,C3AScan2scanofdimension125X50 Oncetheimageofextracellularandcellularregionisgenerated,theregionassociatedwithnucleusislocatedfromthecellularregion.Thisisarststeptowardsdistinguishingvariouselementsinthecellularregion.InthisstudyonlythecellularregionofC3AScan6scanisfurtherclusteredtolocatethenucleusandthelightgreencolouredregionshownintheFigure 4.5 isexpectedtorepresentnucleus. 4.6SpectralClusteringSpectralclusteringisalsooneofthepopularclusteringalgorithmsduetoitseaseofimplementationandavailabilityofefcientmethodstosolve.InmanycasesitoutperformsthetraditionalmethodsliketheK-meansalgorithm.Forperformingspectralclusteringthedatasetisrepresentedasagraphnetworkconsistingofnnodeswherenisthenumberofdatapointsandeacharcweightrepresentsomesortofdissimilaritymeasurebetweenthenodesconnectedbythearc.Inthisstudywetakeeuclideandistancebetweentwodatapointsasthedissimilaritymeasureforthearcconnecting 37

PAGE 38

Figure4-3. K-meansC3AScan2scanwithnucleusmarked them.LetW2
PAGE 39

ObservationsfromSpectralClusteringSpectralclusteringwasperformedonboththedatasetsandtheFigures 4.6 4.6 showstheclusteredimage.SimilartotheK-meansclusteredimage 4.5 4.5 ,theSpectralclusteringalgorithmalsogeneratesthescanimage.ThisfurthervalidatestheapplicabilityofusingclusteringmethodsindistinguishingthecellularandextracellularregionsinRamanspectrascan. Figure4-4. SpectralClustering,TRETScan6scanofdimension120X21 IntheTRETScan6Figure 4.6 ,theredcolouredstriprepresenttheclusterassociatedwithcellularregionandbluecolouredregiontheextracellularregion.SimilarpatternisalsoexpectedfromC3AScan2Figure 4.6 dataandtheclusterobtainedalsovalidatesourpropositiononexpectedscanimage.Boththeclusteringmethodsprovidesimilarclustersandhencethisclassicationoutputcanbeusedasareferenceforfurtheranalysis. 4.7SparseClusteringforFeatureSelectionTechnologicaladvancesinthelastdecadehaveintroducednewandefcienttoolsfordatacollectionespeciallyintheeldofbiomedicine.Thishaspavedwayforanewclassoflargedatasetswithveryhighdimensionsi.ewithlargenumberofinput 39

PAGE 40

Figure4-5. SpectralClustering,C3AScan2scanofdimension125X50 featurescomparedtothesizeofobservations.Traditionaldataminingtechniqueshaveproducedappreciableresultsforstandarddatasetsbutwhendataisrepresentedasveryhighdimensionalvectorsitposesgreatchallengesforstandardalgorithms.Featureextractionforhighdimensionaldatasetsisveryimportantasmostfeaturesinhighdimensionalvectorsareusuallynon-informativeornoisyandcouldaffectthegeneralizationperformance.Thereisagreatinterestinmanymachinelearningapplicationforinducingsparsitytohighdimensionaldatasetswithrespecttoinputfeatures.Sparsedatasetcanprovidesignicantinformationonrelevantfeaturesandtherebyassistinfeatureselection.Further,classicationmodelswithsparsedatamatrixcansimplifydecisionruleforfasterpredictioninlarge-scaleproblems.Finally,inmanydataanalysisapplications,asmallsetoffeaturesisdesirabletointerprettheresults.Inthisstudysparseclusteringisperformedtoextracttherelevantfeaturesfromdataset.Featureselectionwillhelpinidentifyingthebiologicallysignicantwave-numbersthatarecriticalindistinguishingbetweenextracellularandcellularregion.SparseK-meansmethodsuggestedbyWittenetal.[ 5 ]isusedforsparse 40

PAGE 41

clustering.ThesparseK-meansclusteringoptimizationproblemcanbeformulatedasfollowsMaximizec1,c2..cK,!pXj=1!j(1 nnXi=1nXi0=1di,i,j)]TJ /F6 7.97 Tf 17.13 14.95 Td[(KXk=11 nkXi,i02Ckdi,i0,j)Subjecttok!k21,k!ks,!j08j (4)where,c1,c2,..cKrepresenttheKclassesornumberofclustersinthedataspace!representtheweightassociatedwitheachfeaturespisthenumberoffeaturessisthetuningparameterdi,i0,jdissimilaritymeasurebetweennodesiandi0alongfeaturej.KisthenumberofclassesTheaboveoptimizationproblemissolvedusinganiterativeprocessproposedbyWittenetal.[ 5 ].Theweight!jassociatedwitheachfeaturerepresentthesignicanceofthatfeatureinclustering,i.edifferentiatingbetweentheextracellularandcellularregion.Hencefeatureswithgreaterweightsarecrucialindistinguishingbetweencellularandextracellularregionandtheyalsorepresentthebiologicallysignicantwavenumbersassociatedwithcellularmaterial.TheTable 4.7 showsalistofrelevantfeatures(wavenumbers)andtheircorrespondingweights.Wavenumbersarearrangedinthedecreasingorderoftheirweightsandweightslessthan0.01arediscarded. 4.8ObservationsStudyhaveshownthatclusteringmethodscouldbeeffectivelyusedtogenerateanimageoftheRamanspectroscopyscan.Forboththedatasets,awellseparatedimagecouldbegenerateddistinguishingthecellularandextracellularregion.BothK-meansandSpectralclusteringmethodsprovidedsimilarwellseparatedimages.Thiswellseparatedimagecouldbeusedtoevaluatetheclusteringperformedbythesparse 41

PAGE 42

Table4-1. Weights!jandcorrespondingfeature(wavenumber) WeightsWavenumberWeightsWavenumber 0.43711280.0738680.41613590.05813180.32813430.05712730.32611450.05214440.31213800.04910350.26811110.0489330.2485380.04112530.20810860.04014280.1975550.0354020.15214000.0335100.12614650.03013020.11410610.0289070.0964290.02312320.09011610.0197240.0798420.01512150.0794560.013587 Figure4-6. Imagegeneratedfromtop15featuresfromsparseClustering,C3AScan2scanofdimension125X50 clusteringmethod.Therelevantwavenumbersshort-listedbysparseclusteringmethod(inTable 4.7 )isusedforfurtherlearning.Theseshot-listedwavenumberscanbeassociatedwithbiologicallysignicantRamanpeaksandtherebyvalidatingthefeatureselectionprocess.Further,thecell 42

PAGE 43

scanimagegeneratedbythetop15short-listedfeaturesprovidesaverysimilarimagecomparablewiththereferenceimagesgeneratedbyk-meansandspectralmethods.Thiscouldfurthervalidatethefeaturesselectedandfeatureselectionprocess.Figure 4.7 showsimagegeneratedbythetop15featuresselectedfromthesparseclusteringmethodandthiscouldbecomparedwithimages 4.5 4.6 generatedbyK-meansclusteringandSpectralclusteringmethod.Inthisstudyonlyapreliminaryanalysisofdistinguishingbetweencellularandextracellularregionispreformedanditshowsthatsparseclusteringiseffectiveinbothclusteringandfeatureextraction.Thereisahugepotentialoffurtherextendingthisstudy,toevendifferentiatingthevariousregionsinsideacell. 43

PAGE 44

CHAPTER5DISCUSSIONANDCONCLUSIONTraditionalstatisticalmethodsfailwhilehandlinghighdimensionaldatasets,introductorysectiondiscussestheprobablereasonsbehindthisbehavior.Hence,highdimensionaldatasetsarepreferredtobestudiedinalowerdimensionalspaceandthiscanbeachievedwiththehelpoffeatureselectionprocess.Thedimensionalityofdatasetscanbereducedbypickingonlyafewofthebestfeatures.Byfeatureselectionanewdatasetiscreatedwithonlyasubsetoforiginalfeaturesbutitcapturesmaximuminformationfromtheoriginaldataset.Thisstudyfocusedonvariousaspectsofextractingthissubsetoffeatures.Therstsectionfocussedonintroducingaleastsquaresformulationforproximalsupportvectormachines.Themotivationbehindthisformulationisrootedfromthestandardmethodofinducingsparsityintoaclassicationmodelusingl1norminaleastsquaresformulation.Itisverycommontointroducel1normintoaleastsquaresclassicationoptimizationmodel,thisinducessparsityinthedecidingvariablesandthencehelpsinunderstandingrelevantvariablesthatnormallycorrespondtothefeaturesinadataset.Proximalsupportvectormachinesareveryefcientclassicationalgorithmsandcanhandlecomplexdatasetsverywell.Thiswasamajordriveinstudyingproximalplanesandinvestigatingwaysforreformulatingittoaleastsquaresproblem.Theclassicationaccuracyofleastsquaresproximalsupportvectormachinesissimilartothatoftheoriginaleigenvalueformulation.Sobythisstudywecoulddevelopanewleastsquaresformulationforgeneratingtheproximalplanes.Thismodelcouldbefurtherusedtointroducel1normandtherebyinducesparsityforfeatureselection.Addingtoit,thealgorithmsdevelopedtosolvetheleastsquaresformulationhasclosedformsolutionsforproximalplanes,thisfurtherimprovesthecomputationalefciency.Thesecondsectiondiscussedanewnormthatcouldinducejointsparsityinadimensionalityreductionproblem.Anl2,1normisintroducedintoandimensionality 44

PAGE 45

reductionproblemandtheoptimizationmodelisiterativelysolved.Thedirectrelationbetweentransformationmatrixandinputfeaturesisusedtodiscardirrelevantfeatures.Thishelpsincreatingasubsetofprominentfeatures.Theapproachnotonlyreducesthedimensionalityofthedatasetbutalsohelpsinextractingfeatures.TheclassicationaccuraciesassociatedwiththereducedfeaturesetproducedcomparableaccuraciestothatofthewellknowPCA-SVMclassicationmethod.Thisnotonlyvalidatesthefeatureselectionprocessbutalsorevampsthebenetsofeliminatingirrelevantfeatures.Reduceddimensionalspaceprovidedbetterclassicationaccuracies,betterfeatureinterpretabilityandreducedcomputationalcomplexity.Inthelastsection,sparseK-meansmethodisappliedtoRamanspectroscopydata.ThisstudyconcentratedonunderstandingtheapplicabilityofsparseclusteringmethodstogeneratetheimageofaRamanspectroscopyscanofcell.Initiallyclusteringwasperformedusingthestandardmethods,vizK-meansclusteringandSpectralclusteringalgorithms.Bothalgorithmsgeneratedsimilarclustersandalsoproducedexpectedimagesbasedonthescansetup.ThisimagewasusedasreferencetocomparetheclustergeneratedbysparseK-meansmethods.TestingshowedthatsparseK-meansalsoproducedsimilarclusterandalsoshort-listedasetofrelevantfeatures.Thishelpedinremovingirrelevantfeaturesandcreatingasubsetofprominentfeatures.Further,thewavenumberscorrespondingtoprominentfeaturescouldberelatedtobiologicalsignicantwavenumbersandthisjustiedthefeatureselectionprocessandapplicabilityofsparseK-meansmethodinRamanspectroscopydata.Tosummarize,thestudytargetedunderstandingtheimportanceoffeatureselectionprocessandvariouswaysofperformingfeatureselection.Majorityoftheresearchfocusedonlabeleddataset,wheresparsitywasinducedinsupervisedclassicationmodelstoassistfeatureselection.Lastly,featureselectioninunlabeleddatasetwasalsostudiedusingsparseclusteringmethods. 45

PAGE 46

REFERENCES [1] BalasundaramS,KapilNApplicationoflagrangiantwinsupportvectormachinesforclassication.Secondinternationalconferenceonmachinelearningandcomputing(2010),pp193397. [2] BradleyEfron,TrevorHastie,IainJohnstone,andRobertTibshirani,Leastangleregression,TheAnnalsofStatistics(2004),Volume32,Number2,407-499. [3] Cao,Alex;Pandya,AbhilashK.;Serhatkulu,GulayK.;Weber,RachelE.;Dai,Houbei;Thakur,JagdishS.;Naik,VamanM.;Naik,Ratna;Auner,GregoryW.;Rabah,Raja;Freeman,D.Carl(2007).Arobustmethodforautomatedbackgroundsubtractionoftissueuorescence.InJournalofRamanSpectroscopy38(9):1199-1205. [4] CortesC,VapnikVSupport-vectornetwork.MachineLearning(1995)20:273297. [5] DanielaM.WittenandRobertTibshirani(2010).Aframeworkforfeatureselectioninclustering.InJAmStatAssoc.105(490):713726. [6] T.Evgeniou,M.Pontil,andT.Poggio,RegularizationNetworksandSupportVectorMachines,AdvancesinComputationalMath.(2000),vol.13,pp.1-50. [7] FanJ,FanY.(2008)High-dimensionalclassicationusingfeaturesannealedindependencerulesAnn.Statist.2008;36:26052637. [8] JFan,YFeng,XTong(2012)Aroadtoclassicationinhighdimensionalspace:theregularizedoptimalafnediscriminantJournaloftheRoyalStatisticalSociety. [9] JFan,JLv(2010)ASelectiveOverviewofVariableSelectioninHighDimensionalFeatureSpace,StatisticaSinica [10] G.FungandO.L.Mangasarian,ProximalSupportVectorMachineClassiersProc.KnowledgeDiscoveryandDataMining,F.ProvostandR.Srikant,(2001)eds.,pp.77-86. [11] MGallagher,TDowns(1997)Visualizationoflearninginneuralnetworksusingprincipalcomponentanalysis,InternationalConferenceonComputational [12] GhoraiS,MukherjeeA,DuttaPKNonparallelplaneproximalclassier.SignalProcess(2009)89:510522. [13] Guo-XunYuan,Kai-WeiChang,Cho-JuiHsieh,Chih-JenLin,AComparisonofOptimizationMethodsandSoftwareforLarge-scaleL1-regularizedLinearClassication,JournalofMachineLearningResearch(2010)11:31833234. [14] JHamm,DDLee(2008)GrassmannDiscriminantAnalysis:aUnifyingViewonSubspace-BasedLearning,25thinternationalconferenceonMachinelearning. 46

PAGE 47

[15] HuiZou,TrevorHastie,andRobertTibshiraniSparsePrincipalComponentAnalysisJournalofComputationalandGraphicalStatistics(2006),Volume15,Number2,Pages265286. [16] JayadevaR,KhemchandaniR,ChandraSTwinsupportvectormachineforpatternclassication.(2007)IEEETransactionsonPatternAnalysisandMachineIntelligence29(5):905910. [17] BJiang,YHDai(2013)AFrameworkofConstraintPreservingUpdateSchemesforOptimizationonStiefelManifold,arXiv:1301.0172 [18] JunLiu,ShuiwangJi,JiepingYe(2012)Multi-TaskFeatureLearningViaEf-cientl2,1-NormMinimization,ProceedingsoftheTwenty-FifthConferenceonUncertaintyinArticialIntelligence. [19] MKolar,HLiu-(2013)FeatureSelectioninHigh-DimensionalClassicationProceedingsofthe30thInternationalConferenceonMachineLearning. [20] LarraagaP,CalvoB,SantanaR,BielzaC,GaldianoJ,InzaI,LozanoJA,ArmaanzasR,SantafG,PrezA,RoblesV.(2006).Machinelearninginbioin-formatics.InBriefBioinformVol7No.1:86-112. [21] KLee,YBresler,MJunge(2012SubspaceMethodsforJointSparseRecovery,InformationTheory,IEEE. [22] LiangSun,ShuiwangJi,JiepingYe,ALeastSquaresFormulationforaClassofGeneralizedEigenvalueProblemsinMachineLearning,Proceedingsofthe26thInternationalConferenceonMachineLearning,Montreal,Canada,2009. [23] MangasarianOL,WildEWMultisurfaceproximalsupportvectorclassicationviageneralizedeigenvalues.IEEETransactionsonPatternAnalysisandMachineIntelligence(2006)28(1):6974. [24] O.L.Mangasarian,LeastNormSolutionofNon-MonotoneComplementarityProblems,FunctionalAnalysis,OptimizationandMathematicalEconomics(1990),pp.217-221,NewYork:OxfordUniv.Press. [25] O.L.MangasarianandR.R.Meyer,NonlinearPerturbationofLinearPrograms,SIAMJ.ControlandOptimization(1979),vol.17,no.6,pp.745-752,. [26] DNiu,JGDy,MIJordan(2011)-DimensionalityReductionforSpectralCluster-ing,14thInternationalConferenceonArticial. [27] Osborne,M.R.,Presnell,B.,andTurlach,B.A.,ANewApproachtoVariableSelectioninLeastSquaresProblems,IMAJournalofNumericalAnalysis(2000),20,389403. 47

PAGE 48

[28] QuanquanGu,ZhenhuiLiandJiaweiHan(2011)JointFeatureSelectionandSubspaceLearning,ProceedingsoftheTwenty-SecondInternationalJointConferenceonArticialIntelligence [29] Tibshirani,R.,RegressionShrinkageandSelectionviatheLasso,JournaloftheRoyalStatisticalSociety(1996),SeriesB,58,267288. [30] A.N.TikhonovandV.Y.Arsen,SolutionsofIll-PosedProblems.NewYork:JohnWiley&Sons(1977). [31] UlrikeVonLuxburg(2007).ATutorialonSpectralClustering.InStatisticsandComputing17(4). [32] VapnikVThenatureofstatisticallearning,(1998)2ndedn.Springer,NewYork. [33] YunhaiXiao,SoonYiWu,BingShengHe(2012)AproximalalternatingdirectionmethodforL2,1normleastsquaresprobleminmulti-taskfeaturelearning,JournalofIndustrialandManagementOptimization. 48

PAGE 49

BIOGRAPHICALSKETCH PaulFrancisThottakkarawasborn1985inKerala,India.Hegraduatedwithabachelor'sdegreeinMechanicalEngineeringfromMahatamaGandhiUniversityinKerala,India.Afterbachelor'sheworkedatSanmarEngineeringCorporationinChennai,IndiafortwoyearsandthenwenttoUniversityofFloridatopursemaster'sdegreeinIndustrialEngineering.Duringhismaster'sprogramatUniversityofFlorida,PaulwasanactivememberofUFINFORMSChapter.Itwasduringthemaster'sprogramthathedevelopedinterestindataminingtechniquesandoptimizationmethods.HecontinuedhisstudiesatUniversityofFloridatogetEngineerDegreewithspecializationindataminingandoptimization.HewillgraduatewithEngineerDegreefromUniversityofFloridainAugust2013.Aftergraduationheplanstojointheindustryasadataanalysttoutilizehisskillandresearchexperience. http://plaza.ufl.edu/paulthottakkara/Paul.html 49