<%BANNER%>

Unsupervised Learning

Permanent Link: http://ufdc.ufl.edu/UFE0022531/00001

Material Information

Title: Unsupervised Learning An Information Theoretic Framework
Physical Description: 1 online resource (140 p.)
Language: english
Creator: Rao, Sudhir
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: clustering, entropy, framework, information, learning, structures, theory
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The goal of this research is to develop a simple and unified framework for unsupervised learning. As a branch of machine learning, this constitutes the most difficult scenario where a machine (or learner) is presented only with the data without any desired output or reward for the action it takes. All that the machine can do then, is to somehow capture all the underlying patterns and structures in the data at different scales. In doing so, the machine has extracted maximal information from the data in an unbiased manner, which it can then use as a feedback to learn and make decisions. Till now, the different facets of unsupervised learning namely, clustering, principal curves and vector quantization have been studied separately. This is understandable from the complexity of the field and the fact that no unified theory exists. Recent attempts to develop such a theory have been mired with complications. One of the issues is the imposition of models and structures on the data rather than extracting them directly through self organization of the data samples. Another reason is the lack of non-parametric estimators for information theoretic measures. Gaussian approximations generally follow, which fail to capture the higher order features in the data that are really of interest. In this thesis, we present a novel framework for unsupervised learning called - The Principle of Relevant Entropy. By using concepts from information theoretic learning, we formulate this problem as a weighted combination of two terms - an information preservation term and a redundancy reduction term. This information theoretic formulation is a function of only two parameters. The user defined weighting parameter controls the task (and hence the type of structure) to be achieved whereas the inherent hidden scale parameter controls the resolution of our analysis. By including such a user defined parameter, we allow the learning machine to influence the level of abstraction to be extracted from the data. The result is the unraveling of 'goal oriented structures' unique to the given dataset. Using information theoretic measures based on Renyi's definition, we estimate these quantities directly from the data. One can derive fixed point update scheme to optimize this cost function, thus avoiding Gaussian approximation altogether. This leads to interaction of data samples giving us a new self organizing paradigm. An added benefit is the lack of unnecessary parameters (like step size) common in other optimization approaches. The strength of this new formulation can be judged from the fact that the existing mean shift algorithms appear as special cases of this cost function. By going beyond the second order statistics and dealing directly with the probability density function, we effectively extract maximal information to reveal the underlying structure in the data. By bringing clustering, principal curves and vector quantization under one umbrella, this powerful principle truly discovers the underlying mechanism of unsupervised learning.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Sudhir Rao.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: Principe, Jose C.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2009-02-28

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0022531:00001

Permanent Link: http://ufdc.ufl.edu/UFE0022531/00001

Material Information

Title: Unsupervised Learning An Information Theoretic Framework
Physical Description: 1 online resource (140 p.)
Language: english
Creator: Rao, Sudhir
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: clustering, entropy, framework, information, learning, structures, theory
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The goal of this research is to develop a simple and unified framework for unsupervised learning. As a branch of machine learning, this constitutes the most difficult scenario where a machine (or learner) is presented only with the data without any desired output or reward for the action it takes. All that the machine can do then, is to somehow capture all the underlying patterns and structures in the data at different scales. In doing so, the machine has extracted maximal information from the data in an unbiased manner, which it can then use as a feedback to learn and make decisions. Till now, the different facets of unsupervised learning namely, clustering, principal curves and vector quantization have been studied separately. This is understandable from the complexity of the field and the fact that no unified theory exists. Recent attempts to develop such a theory have been mired with complications. One of the issues is the imposition of models and structures on the data rather than extracting them directly through self organization of the data samples. Another reason is the lack of non-parametric estimators for information theoretic measures. Gaussian approximations generally follow, which fail to capture the higher order features in the data that are really of interest. In this thesis, we present a novel framework for unsupervised learning called - The Principle of Relevant Entropy. By using concepts from information theoretic learning, we formulate this problem as a weighted combination of two terms - an information preservation term and a redundancy reduction term. This information theoretic formulation is a function of only two parameters. The user defined weighting parameter controls the task (and hence the type of structure) to be achieved whereas the inherent hidden scale parameter controls the resolution of our analysis. By including such a user defined parameter, we allow the learning machine to influence the level of abstraction to be extracted from the data. The result is the unraveling of 'goal oriented structures' unique to the given dataset. Using information theoretic measures based on Renyi's definition, we estimate these quantities directly from the data. One can derive fixed point update scheme to optimize this cost function, thus avoiding Gaussian approximation altogether. This leads to interaction of data samples giving us a new self organizing paradigm. An added benefit is the lack of unnecessary parameters (like step size) common in other optimization approaches. The strength of this new formulation can be judged from the fact that the existing mean shift algorithms appear as special cases of this cost function. By going beyond the second order statistics and dealing directly with the probability density function, we effectively extract maximal information to reveal the underlying structure in the data. By bringing clustering, principal curves and vector quantization under one umbrella, this powerful principle truly discovers the underlying mechanism of unsupervised learning.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Sudhir Rao.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: Principe, Jose C.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2009-02-28

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0022531:00001


This item has the following downloads:


Full Text

PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

IwouldliketotakethisopportunitytothankmyadvisorDr.JoseC.Prncipeforhisconstant,unwaveringsupportandguidancethroughoutmystayatCNEL.Hehasbeenagreatmentorpullingmeoutofmanylocalminima(inthelanguageofCNEL!).Istillwonderhowheworksnon-stopfrommorningtoeveningwithoutlunch,andIamsurethisfeelingissharedamongmanyofmycolleagues.Inshort,hehasbeenaninspirationandcontinuestobeso.Iwouldliketoexpressmygratitudetoallmycommitteemembers;Dr.MuraliRao,Dr.JohnG.HarrisandDr.ClintSlatton;forreadilyagreeingtobepartofmycommittee.Theyhavehelpedimmenselyinimprovingthisdissertationwiththeirinquisitivenatureandhelpfulfeedbacks.IwouldliketoespeciallythankDr.MuraliRao,mymathmentor,forkeepingavigilonallmynotationsandbringingsophisticationtomyengineeringmind!SpecialmentionisalsoneededforJulie,theresearchcoordinatoratCNEL,forconstantlymonitoringthepressurelevelatthelabandmakingussmileevenifitisforashortwhile.MypastaswellaspresentcolleaguesatCNELneeddueacknowledgement.Withoutthem,Iwouldhavebeenshootingquestionstowallsandmirrors!Ihavegrowntoappreciatethemintellectually,andIthankthemforbeingthereformewhenIneededthemmost.Ithasbeenapleasuretoplayballonandothebasketballcourt!Finally,itwouldbefoolishonmypartnottoacknowledgemyparentsandsisterfortheirconstantsupportandnever-endinglove.Thanksgoouttooneandallwhohavehelpedmetocompletethisjourneysuccessfully.Ithasbeenanadventureandafunride! 4

PAGE 5

page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 8 LISTOFFIGURES .................................... 9 ABSTRACT ........................................ 12 CHAPTER 1INTRODUCTION .................................. 14 1.1Background ................................... 14 1.2Motivation .................................... 17 1.3PreviousWork ................................. 19 1.4ContributionofthisThesis ........................... 22 2THEORETICALFOUNDATION .......................... 26 2.1InformationTheoreticLearning ........................ 26 2.1.1Renyi'sQuadraticEntropy ....................... 27 2.1.2Cauchy-SchwartzDivergence ...................... 29 2.1.3Renyi'sCrossEntropy ......................... 31 2.2Summary .................................... 34 3ANEWUNSUPERVISEDLEARNINGPRINCIPLE ............... 36 3.1PrincipleofRelevantEntropy ......................... 36 3.1.1Case1:=0 .............................. 38 3.1.2Case2:=1 .............................. 40 3.1.3Case3:!1 42 3.2Summary .................................... 44 4APPLICATIONI:CLUSTERING ......................... 48 4.1Introduction ................................... 48 4.2ReviewofMeanShiftAlgorithms ....................... 49 4.3ContributionofthisThesis ........................... 52 4.4StoppingCriteria ................................ 53 4.4.1GMSAlgorithm ............................. 54 4.4.2GBMSAlgorithm ............................ 54 4.5ModeFindingAbility .............................. 55 4.6ClusteringExample ............................... 58 4.7ImageSegmentation .............................. 61 4.7.1GMSvsGBMSAlgorithm ....................... 62 4.7.2GMSvsSpectralClusteringAlgorithm ................ 63 5

PAGE 6

.................................... 72 5APPLICATIONII:DATACOMPRESSION .................... 75 5.1Introduction ................................... 75 5.2ReviewofPreviousWork ............................ 76 5.3ToyProblem .................................. 78 5.4FaceBoundaryCompression .......................... 80 5.5ImageCompression ............................... 84 5.6Summary .................................... 85 6APPLICATIONIII:MANIFOLDLEARNING ................... 88 6.1Introduction ................................... 88 6.2ANewDenitionofPrincipalCurves ..................... 90 6.3Results ...................................... 92 6.3.1SpiralDataset .............................. 92 6.3.2ChainofRingsDataset ......................... 95 6.4Summary .................................... 95 7SUMMARYANDFUTUREWORK ........................ 96 7.1DiscussionandConclusions .......................... 96 7.2FutureDirectionofResearch .......................... 101 APPENDIX ANEURALSPIKECOMPRESSION ......................... 105 A.1Introduction ................................... 105 A.2Theory ...................................... 108 A.2.1WeightedLBG(WLBG)Algorithm .................. 108 A.2.2ReviewofSOMandSOM-DLAlgorithms ............... 110 A.3Results ...................................... 111 A.3.1QuantizationResults .......................... 111 A.3.2CompressionRatio ........................... 114 A.3.3Real-timeImplementation ....................... 115 A.4Summary .................................... 117 BSPIKESORTINGUSINGCAUCHY-SCHWARTZPDFDIVERGENCE .... 119 B.1Introduction ................................... 119 B.2DataDescription ................................ 121 B.2.1ElectrophysiologicalRecordings .................... 121 B.2.2NeuronalSpikeDetection ........................ 121 B.3Theory ...................................... 122 B.3.1Non-parametricClusteringusingCSDivergence ........... 122 B.3.2OnlineClassicationofTestSamples ................. 126 6

PAGE 7

...................................... 126 B.4.1ClusteringofPCAComponents .................... 126 B.4.2OnlineLabelingofNeuralSpikes .................... 127 B.5Summary .................................... 127 CASUMMARYOFKERNELSIZESELECTIONMETHODS .......... 130 REFERENCES ....................................... 133 BIOGRAPHICALSKETCH ................................ 140 7

PAGE 8

Table page 5-1J(X)andMSQEonthefacedatasetforthethreealgorithmsaveragedover50MonteCarlotrials. .................................. 81 A-1SNRofspikeregionandofthewholetestdataobtainedusingWLBG,SOM-DLandSOMalgorithms. ................................. 113 A-2ProbabilityvaluesobtainedforthecodevectorsandusedinFigure A-7 .... 115 8

PAGE 9

Figure page 1-1Theprocessoflearningshowingthefeedbackbetweenalearneranditsassociatedenvironment. ..................................... 16 1-2Anewframeworkforunsupervisedlearning. .................... 25 2-1Informationforcewithinadataset. ......................... 29 2-2Crossinformationforcebetweentwodatasets. ................... 35 3-1Asimpleexampledataset.A)Crescentshapeddata.B)pdfplotfor2=0:05. 45 3-2AnillustrationofthestructuresrevealedbythePrincipleofRelevantEntropyforthecrescentshapeddatasetfor2=0:05.Asthevaluesofincreaseswepassthroughthemodes,principalcurvesandintheextremecaseof!1wegetbackthedataitselfasthesolution. ....................... 46 4-1Ringof16Gaussiandatasetwithdierentaprioriprobabilities.Thenumberingoftheclustersisinanticlockwisedirectionstartingwithcenter(1;0).A)R16Gadataset.B)aprioriprobabilities.C)probabilitydensityfunctionestimatedusing2=0:01. .................................... 56 4-2ModesofR16GadatasetfoundusingGMSandGBMSalgorithms.A)GoodModendingabilityofGMSalgorithm.B)PoormodendingabilityofGBMSalgorithm. ...................................... 57 4-3CostfunctionofGMSandGBMSalgorithms. ................... 58 4-4Renyi's\cross"entropyH(X;S)computedforGBMSalgorithm.ThisdoesnotworkasagoodstoppingcriterionforGBMSinmostcasessincetheassumptionoftwodistinctphasesofconvergencedoesnotholdtrueingeneral. ....... 59 4-5AclusteringproblemA)RandomGaussianclusters(RGC)dataset.B)2gcontourplots.C)probabilitydensityestimatedusing2=0:01. .............. 59 4-6SegmentationresultsofRGCdataset.A)GMSalgorithm.B)GBMSalgorithm. 61 4-7Averagednormdistancemovedbyparticlesineachiteration. .......... 61 4-8Baseballimage. .................................... 62 4-9BaseballimagesegmentationusingGMSandGBMSalgorithms.TheleftcolumnshowsresultsfromGMSforvariousdierentnumberofsegmentsandthekernelsizeatwhichitwasachieved.TherightcolumnsimilarlyshowstheresultsfromGBMSalgorithm. ................................... 64 9

PAGE 10

.................... 66 4-11BaldEagleimage. ................................... 67 4-12MultiscaleanalysisofbaldeagleimageusingGMSalgorithm. .......... 69 4-13Performanceofspectralclusteringonbaldeagleimage. .............. 70 4-14TwopopularimagesfromBerkeleyimagedatabase.A)Flightimage.B)Tigerimage. ......................................... 70 4-15MultiscaleanalysisofBerkeleyimagesusingGMSalgorithm. ........... 71 4-16PerformanceofspectralclusteringonBerkeleyimages. .............. 72 4-17ComparisonofGMSandspectralclustering(SC)algorithmswithaprioriselectionofparameters.AnewpostprocessingstagehasbeenaddedtoGMSalgorithmwherecloseclusterswererecursivelymergeduntiltherequirednumberofsegmentswasachieved. ..................................... 73 5-1Halfcircledatasetandthesquaregridinsidewhich16randompointswheregeneratedfromauniformdistribution. ....................... 79 5-2PerformanceofITVQ-FPandITVQ-gradientmethodsonhalfcircledataset.A)Learningcurveaveragedover50trials.B)16pointquantizationresults. ... 80 5-3EectofannealingthekernelsizeonITVQxedpointmethod. ......... 82 5-464pointsquantizationresultsofITVQ-FP,ITVQ-gradientmethodandLBGalgorithmonfacedataset.A)ITVQxedpointmethod.B)ITVQgradientmethod.C)LBGalgorithm. .................................. 83 5-5Twopopularimagesselectedforimagecompressionapplication.A)Baboonimage.B)Peppersimage. .............................. 84 5-6PortionsofBaboonandPepperimagesusedforimagecompression. ....... 85 5-7Imagecompressionresults.TheleftcolumnshowsITVQ-FPcompressionresultsandtherightcolumnshowsLBGresults.A,B)99:75%compressionforboththealgorithms,2=36forITVQ-FPalgorithm.C,D)90%compression,2=16.E,F)95%compression,2=16.G,H)85%compression,2=16. ...... 86 6-1Theprincipalcurveforamixtureof5Gaussiansusingnumericalintegrationmethod. ........................................ 91 6-2TheevolutionofprincipalcurvesstartingwithXinitializedtotheoriginaldatasetS.Theparametersweresetto=2and2=2. ................. 93 10

PAGE 11

... 94 6-4Changesinthecostfunctionanditstwocomponentsforthespiraldataasafunctionofthenumberofiterations.A)CostfunctionJ(X).B)TwocomponentsofJ(X)namelyH(X)andH(X;S). ........................ 94 6-5Denoisingabilityofprincipalcurves.A)Noisy3D\chainofrings"dataset.B)Resultafterdenoising. ................................ 95 7-1Anovelideaofunfoldingthestructureofthedatainthe2-dimensionalspaceofand. ...................................... 99 7-2Theideaofinformationbottleneckmethod. .................... 103 A-1Syntheticneuraldata.A)Spikesfromtwodierentneurons.B)Aninstanceoftheneuraldata. .................................... 107 A-22-Dembeddingoftrainingdatawhichconsistsoftotal100spikeswithacertainratioofspikesfromthetwodierentneurons. ................... 107 A-3TheoutlineofweightedLBGalgorithm. ...................... 109 A-416pointquantizationofthetrainingdatausingWLBGalgorithm. ........ 112 A-5Two-dimensionalembeddingofthetrainingdataandcodevectorsina55latticeobtainedusingSOM-DLalgorithm. ..................... 112 A-6PerformanceofWLBGalgorithminreconstructingspikeregionsinthetestdata. 113 A-7FiringprobabilityofWLBGcodebookonthetestdata. .............. 114 A-8AblockdiagramofPicoDSPsystem. ........................ 116 B-1Exampleofextracellularpotentialsfromtwoneurons. ............... 122 B-2Distributionofthetwospikewaveformsalongtherstandthesecondprincipalcomponents. ...................................... 123 B-3ComparisonofclusteringbasedonCSdivergencewithK-meansforspikesorting. 128 11

PAGE 12

Thegoalofthisresearchistodevelopasimpleanduniedframeworkforunsupervisedlearning.Asabranchofmachinelearning,thisconstitutesthemostdicultscenariowhereamachine(orlearner)ispresentedonlywiththedatawithoutanydesiredoutputorrewardfortheactionittakes.Allthatthemachinecandothen,istosomehowcapturealltheunderlyingpatternsandstructuresinthedataatdierentscales.Indoingso,themachinehasextractedmaximalinformationfromthedatainanunbiasedmanner,whichitcanthenuseasafeedbacktolearnandmakedecisions. Tillnow,thedierentfacetsofunsupervisedlearningnamely,clustering,principalcurvesandvectorquantizationhavebeenstudiedseparately.Thisisunderstandablefromthecomplexityoftheeldandthefactthatnouniedtheoryexists.Recentattemptstodevelopsuchatheoryhavebeenmiredwithcomplications.Oneoftheissuesistheimpositionofmodelsandstructuresonthedataratherthanextractingthemdirectlythroughselforganizationofthedatasamples.Anotherreasonisthelackofnon-parametricestimatorsforinformationtheoreticmeasures.Gaussianapproximationsgenerallyfollow,whichfailtocapturethehigherorderfeaturesinthedatathatarereallyofinterest. Inthisthesis,wepresentanovelframeworkforunsupervisedlearningcalled-ThePrincipleofRelevantEntropy.Byusingconceptsfrominformationtheoreticlearning,weformulatethisproblemasaweightedcombinationoftwoterms-aninformation 12

PAGE 13

UsinginformationtheoreticmeasuresbasedonRenyi'sdenition,weestimatethesequantitiesdirectlyfromthedata.Onecanderivexedpointupdateschemetooptimizethiscostfunction,thusavoidingGaussianapproximationaltogether.Thisleadstointeractionofdatasamplesgivingusanewselforganizingparadigm.Anaddedbenetisthelackofunnecessaryparameters(likestepsize)commoninotheroptimizationapproaches.Thestrengthofthisnewformulationcanbejudgedfromthefactthattheexistingmeanshiftalgorithmsappearasspecialcasesofthiscostfunction.Bygoingbeyondthesecondorderstatisticsanddealingdirectlywiththeprobabilitydensityfunction,weeectivelyextractmaximalinformationtorevealtheunderlyingstructureinthedata.Bybringingclustering,principalcurvesandvectorquantizationunderoneumbrella,thispowerfulprincipletrulydiscoverstheunderlyingmechanismofunsupervisedlearning. 13

PAGE 14

Langley 1996 ].Itisaconuenceofnumerouseldsrangingfromcognitivescienceandneurobiologytoengineering,statisticsandoptimizationtheorytonameafew.Itsgoalinvolvesmodelingthemechanismunderlyinghumanlearningandtoconstantlycorroborateitthroughapplicationstorealworldproblemsusingempiricalandmathematicalapproaches.Thisbranchofsciencecanthusbeconsideredastheeldofresearchdevotedtoformalstudyoflearningsystems[ Ghahramani 2003 ]. Humanmindhasbeenthesubjectofinteresttomanygreatphilosophers.Morethan2000yearsback,GreekphilosopherslikePlatoandAristotlediscussedtheconceptofuniversal[ Watanabe 1985 ].Theuniversalsimplymeansaconceptorageneralnotion.Theyarguedthatthewaytonduniversalistond\forms"or\patterns"fromexemplars.Duringthe16thcenturyFrancisBaconfurtheredinductivereasoningascentraltounderstandingintelligence[ Forsyth 1989 ].Butthescienticdispositiontothesephilosophicalideaswasnotlaiduntil1748when Hume [ 1999 ]discoveredtheruleofinduction.Meanwhile,deductiveandformalreasoninggainedincreasingpopularity,especiallyamonglogicians.Workingwithsymbolsandheuristicrulebasedalgorithms,therstknowledgebasedsystemswerebuilt.ThisleadtothebirthofArticialIntelligence(AI)inmid-1950'swhichcanbeconsideredprimarilyasascienceofdeduction.Theideabehindthesedomainspecicexpertsystemswastoexploreroleandorganizationofknowledgeinmemory.Borrowingheavilyfromcognitivescienceliterature,researchonknowledgerepresentation,naturallanguageandexpertsystemsdominatedthisera. 14

PAGE 15

Thegapbetweenthesetwoeldscontinuedtoincreaseduetodierencesinthefundamentalnotionofintelligence.Itwasnotuntilthelate1970's,thatarenewedinterestemergedwhichspawnedtheeldofMachineLearning.Thiswasnecessitatedbytheneedtodoawaywithoverwhelmingnumberofdomainspecicexpertsystemsandaurgetounderstandbettertheunderlyingprinciplesoflearningwhichgovernthesesystems.Manynewmethodswereproposedwithemphasisintheareaofplanning,diagnosisandcontrol.Workonrealworldproblemsprovidedstrongcluestothesuccessofthesemethods.Theincrementalnatureofthiseldhelpedittogrowtremendouslyforthepast30yearswithgreatertendencytobuildonprevioussystems.Thiseldtodayoccupiesacentralpositioninourquesttoanswermanyunsolvedmysteriesofmanandmachinealike.BorrowingideasbothfromAIandpatternrecognition,thiseldofsciencehassoughttounravelthecentralroleoflearninginintelligence. Learningcanbebroadlydenedasimprovementinperformanceinsomeenvironmentthroughtheacquisitionofknowledgeresultingfromexperienceinthatenvironment[ Langley 1996 ].Thus,learningneveroccursinisolation.Alearneralwaysndshimselfattachedtoanenvironmentactivelygatheringdataforknowledgeacquisition.Withagoalinmind,thelearnerndspatternsindatarelevanttothetaskathandandtakesappropriate 15

PAGE 16

Theprocessoflearningshowingthefeedbackbetweenalearneranditsassociatedenvironment. actionsresultinginmoredatafromtheenvironment.Thisfeedbackenrichesthelearner'sknowledgethusimprovinghisperformance.Figure 1-1 depictsthisrelationship. Therearemanyaspectsofenvironmentthateectslearning,butthemostimportantaspectthatinuenceslearningisthedegreeofsupervisioninvolvedinperformingthetask.Broadly,threecasesemerge.Intherstscenariocalledsupervisedlearning,themachineisprovidedwithasequenceofdatasamplesalongwiththecorrespondingdesiredoutputsandthegoalistolearnthemappingsoastoproducecorrectoutputforagivennewtestsample.Anotherscenarioiscalledreinforcementlearningwherethemachinedoesnothaveanydesiredoutputvaluestolearnthemappingdirectly,butgetsscalarrewards(orpunishments)foreveryactionittakes.Thegoalofthemachinethen,istolearnandperformactionssoastomaximizeitsrewards(orinturnreduceitspunishments).Thenalandthemostdicultscenarioiscalledunsupervisedlearningwherethemachineneitherhasoutputsnorrewardsandhenceisleftonlywiththedata. Thisthesisisaboutunsupervisedlearning.Atrst,itmayseemunclearastowhatthemachinecanpossiblydojustwiththedataathand.Nevertheless,there 16

PAGE 17

Barlow [ 1989 ], statisticalredundancycontainsinformationaboutthepatternsandregularitiesofsensorystimuli.Structureandinformationexistinexternalstimulusanditisthetaskofperceptualsystemtodiscoverthisunderlyingstructure. Inordertodoso,weneedtoextractasmuchinformationaspossiblefromthedata.Itisherethatunsupervisedlearningreliesheavilyonstatisticsandinformationtheorytodynamicallyextractinformationandcapturetheunderlyingpatternsandregularitiesinthedata. Dudaetal. 2000 ].SomeotherexamplesincludeGaussianmixturemodels(GMM)[ MacKay 2003 ]andrecentkernelbasedspectralclusteringmethods[ ShiandMalik 2000 ; Ngetal. 2002 ]whichhavebecomequitepopularinvisioncommunity.Clusteringisusedextensivelyintheareasofdatamining,imagesegmentationandbioinformatics. Thesecondaspectisthenotionofprincipalcurves.Thiscanbeseenasanon-linearcounterpartofprincipalcomponentanalysis(PCA)wherethegoalistondalowerdimensionalembeddingappropriatetotheintrinsicdimensionalityofthedata.Thistopicwasrstaddressedby HastieandStuetzle [ 1989 ]whodeneditbroadlyas\self-consistent"smoothcurveswhichpassthroughthe\middle"ofad-dimensionalprobabilitydistributionordatacloud.Usingregularizationtoavoidovertting,Hastie's 17

PAGE 18

Tibshirani [ 1992 ]andtherecentpopularworkof Kegl [ 1999 ].Duetotheirabilitytondintrinsicdimensionalityofthedatamanifold,principalcurveshaveimmenseapplicationsindimensionalreduction,manifoldlearninganddenoising. Vectorquantization,thethirdcomponentofunsupervisedlearning,isacompressiontechniquewhichdealswithecientrepresentationofthedatawithfewcodevectorsbyminimizingadistortionmeasure[ GershoandGray 1991 ].AclassicexampleistheLinde-Buzo-Gray(LBG)[ Lindeetal. 1980 ]techniquewhichminimizesthemeansquareerrorbetweenthecodevectorsandthedata.AnotherpopularandwidelyusedtechniqueisKohonen'sselforganizingmaps(SOM)[ Kohonen 1982 ; Haykin 1999 ]whichnotonlyrepresentsthedataecientlybutalsopreservesthetopologybyaprocessofcompetitionandcooperationbetweenneurons.Speechcompression,videocodingandcommunicationsaresomeoftheeldswherevectorquantizationisusedextensively. Untilnow,acommonframeworkforunsupervisedlearninghadbeenmissing.Thethreedierentaspectsnamelyclustering,principalcurvesandvectorquantizationhavebeenstudiedseparately.Thisisquiteunderstandablefromthebreadthoftheeldandtheuniquenicheofapplicationsthesefacetsserve.Nevertheless,onecouldseethemunderthesamepurviewusingideasfromstatisticsandinformationtheory.Forexample,itiswellknownthatprobabilitydensityfunction(pdf)subsumesalltheinformationcontainedinthedata.Clustersofdatasampleswouldcorrespondtohighdensityregionsofthedatahighlightedbythepeaksinthedistribution.Theunderlyingmanifoldisnaturallyrevealedbytheridgelinesofthepdfestimateconnectingtheseclusters,whilevectorquantizationcanbeseenaspreservingdirectlythepdfinformationofthedata.Conceptsfrominformationtheorywouldhelpusquantifythesepdffeaturesthroughscalardescriptorslikeentropyanddivergence.Bydoingso,wenotonlygaintheadvantageofgoingbeyondthesecondorderstatisticsandcapturingalltheinformationinthedata,but 18

PAGE 19

Bishop 1995 ; MacKay 2003 ].Manyofthesemodelsemploytheobjectiveofminimumdescriptionlength.Minimizingthedescriptionlengthofthemodelforcesthenetworktolearneconomicrepresentationsthatcapturetheunderlyingregularitiesinthedata.Thedownsideofthisapproachistheimpositionofexternalmodelsonthedatawhichmayarticiallyconstraindatadescription.Ithasbeenobservedthatwhenthemodelassumptioniswrong,thesemethodsperformpoorlyleadingtowronginferences. Modernapproacheshaveproposedcomputationalarchitecturesinspiredbytheprincipalofselforganizationofbiologicalsystems.Mostnotablecontributioninthisregardwasmadeby Linsker [ 1988a b ]withtheintroductionof\Infomax"principle.Accordingtothisprinciple,inordertondrelevantinformationoneneedstomaximizetheinformationrateRofanetworkconstrainedbyitsarchitecture.ThisrateRisdenedasthemutualinformationbetweennoiselessinputXandtheoutputY,thatisR=I(X;Y)=H(Y)H(YjX): 19

PAGE 20

Oja 1982 1989 ]couldbeseenintermofinformationmaximizationleadingtominimumreconstructionerror. Adierentwaytoexplainthefunctioningofsensoryinformationprocessingistheprincipleofminimumredundancyproposedby Barlow [ 1989 ].Sincetheexternalstimulienteringourperceptualsystemarehighlyredundant,itwouldbedesirabletoextractindependentfeatureswhichwouldfacilitatesubsequentlearning.ThelearningofaneventEthenrequirestheknowledgeofonlytheMconditionalprobabilitiesofEgiveneachfeatureyi,insteadoftheconditionalprobabilitiesofEgivenallpossiblecombinationofNsensorystimuli.Inshort,aN-dimensionalproblemisreducedtoMone-dimensionalproblems.AnaturalconsequenceofthisideaistheIndependentComponentAnalysis(ICA)[ JuttenandHerault 1991 ; Comon 1994 ; HyvarinenandOja 2000 ].Unfortunately,theShannonentropydenitionmakestheseinformationtheoreticquantitiesdiculttocompute.FortheGaussianassumption,onecouldinsteadreducethecorrelationamongtheoutputs,butthiswouldonlyremovesecondorderstatisticsandnotaddresshigherorderdependencies. Pioneeringworkintheareaofunsupervisedlearninghasbeendoneby Becker [ 1991 ].Byframingtheproblemasbuildingglobalobjectivefunctionsunderoptimizationframework, Becker [ 1998 ]hasshownthatmanycurrentmethodscanbeseenasenergyminimizingfunctions,thuselucidatingthesimilarityamongdierentaspectsoflearningtheory.Inanattempttogobeyondpreprocessingstagesofsensorysystems, BeckerandHinton [ 1992 ]proposedtheImaxprincipleforunsupervisedlearningwhichdictatesthatsignalsofinterestshouldhavehighmutualinformationacrossdierentsensory 20

PAGE 21

BeckerandHinton 1992 ]. Theseinformationtheoreticcodingprincipleshavesignicantlyadvancedourunderstandingofunsupervisedlearning.Theprocessofinformationpreservationandredundancyreductionstrategieshaveprovidedimportantcluesabouttheinitialpreprocessingstagesrequiredtoextractrelevantfeaturesforlearning.Thereexistthough,twomajorbottlenecksinfullyrealizingthepotentialityofthesetheories.Therstconstraintisthewaytheobjectivefunctionisframed.Mostofthepreviousapproacheseitherpreservemaximalinformationoraimatredundancyreductionstrategy.Someauthorshaveeventriedtouseonetoachievetheother,thusconcentratingonthespecialcaseofequivalence.Forexample, BellandSejnowski [ 1995 ]used\Infomax"toachieveICAusinganonlinearnetworkunderthespecialcasewhenthenonlinearitymatchesthecumulativedensityfunction(cdf)oftheindependentcomponents.Indoingso,alotofrichintermediarystructuresarelost. Thesecondconstraintistheinformationtheoreticnatureofthecostfunctionwhichmakesitdiculttoderiveselforganizingrules.Allthesemodelssharethegeneralfeatureofmodelingthebrainasacommunicationchannelandapplyingconceptsfrominformationtheory.SinceShannondenitionofinformationtheoreticmeasuresarehardtocomputeinclosedform,Gaussianapproximationgenerallyfollowsgivingruleswhichfailtoreallycapturethehigherorderfeaturesthatareofinterest.Arecentexceptionistheinformationbottleneck(IB)methodproposedby Tishbyetal. [ 1999 ].InsteadoftryingtoencodealltheinformationinarandomvariableS,theauthorsonlyquantifytherelevantormeaningfulinformationbyintroducinganothervariableYwhichisdependentonS,so 21

PAGE 22

Forsyth [ 1989 ]putsit,learningis99%patternrecognitionand1%reasoning.Ifwethereforewishtodevelopauniedframeworkforunsupervisedlearning,weneedtoaddressthiscentralthemeofcapturingtheunderlyingstructureinthedata.Toputsimply,complexpatterncouldbedescribedhierarchically(andquiteoftenrecursively)intermsofsimplerstructures[ Pavlidis 1977 ].Thisisafundamentallynewandpromisingapproachforittriestounderstandthedevelopmentalprocessthatgeneratedtheobserveddataitself. Putsuccinctly,structuresareregularitiesinthedatawithinterdependenciesbetweensubsystems[ Watanabe 1985 ].Thisimmediatelytellsusthatwhitenoiseisnotastructuresinceknowledgeofthepartialsystemdoesnotgiveusanyideaofthewholesystem.Inshort,structureshaverigiditywithinterdependenciesandarenotamorphous.Sinceentropyoftheoverallsystemwithinterdependentsubsystemsissmallerthanthesumoftheentropiesofpartialsubsystems,structurescanbeseenasentropyminimizingentities.Bydoingso,theyreduceredundancyintherepresentationofthedata.Thus,formulationofstructurescanbecastaccordingtotheprincipleofminimumentropy.Butmereminimizationofentropywouldnotallowthesestructurestocaptureanyrelevantinformationofthedataleadingtotrivialsolutions.Thisformulationthereforeneedstobemodiedasminimizationofentropywiththeconstrainttoquantifycertainlevelofinformationaboutthedata.Byframingthisproblemasaweightedcombinationofaredundancyreductiontermandaninformationpreservationterm,weproposeanovel 22

PAGE 23

Alearningmachineisalwaysassociatedwithagoal[ Lakshmivarahan 1981 ].Inunsupervisedscenarioswherethemachineonlyhasthedata,goalsareespeciallycrucialsincetheyhelpdierentiaterelevantfromirrelevantinformationinthedata.Byseeking\goalorientedstructures"then,themachinecaneectivelyquantifytherelevantinformationinthedataascompactrepresentationsneededasarststepforsuccessfulandeectivelearning.Thisisanothercontributionofourthesis.Noticethatgoalscanexistatmanylevelsoflearning.Inthisthesis,weaddressthelowerlevelvisiontaskslikeclustering(fordierentiatingobjects),principalcurves(fordenoising)andvectorquantization(forcompression)whichhelpindecodingpatternsfromthedatarelevanttothegoalathand.Bydoingso,onecanextractaninternallyderivedteachingsignal,thusconvertinganunsupervisedproblemtosupervisedframework. Akeyaspectneededforthesuccessofthismethodologyistocomputetheinformationtheoreticmeasuresoftheglobalcostfunctiondirectlyfromthedata.ThoughShannondenitionispopularlyused,fewofthemhavenon-parametricestimatorswhichareeasytocomputeleadingtosimplicationsusingGaussianassumption.Inthisthesis,wedoawaywiththeShannondenitionofinformationtheoreticmeasuresanduseinsteadideasfromInformationTheoreticLearning(ITL)basedonRenyi'sentropy[ Prncipeetal. 2000 ; Erdogmus 2002 ].Thisisthethirdcontributionofthisthesis.UsinginformationtheoreticprinciplesbasedonRenyi'sdenition[ Renyi 1961 1976 ],weestimatedierentinformationtheoreticmeasuresdirectlyfromthedatanon-parametrically.Thesescalardescriptorsgobeyondthesecondorderstatisticstorevealhigherorderfeaturesinthedata. Interestingly,onecanderivexedpointupdateruleforself-organizationofthedatasamplestooptimizethecostfunction,thusavoidingGaussianapproximationsaltogether.Borrowingananalogyfromphysics,inthisadaptivecontext,onecanviewthedata 23

PAGE 24

Tosummarize,thisthesisproposesanumberofkeycontributionswhichfurthersthetheoryofunsupervisedlearning; ThisnewunderstandingofunsupervisedlearningcanbesummarizedwithablockdiagramasshowninFigure 1-2 .Theimportantideabehindthisschematicisthegoalorientednatureofunsupervisedlearning.Inthisframework,theneuro-mappercanbeseenasworkingonphysicsbasedprinciples.Usingananalogywithbiologicalsystems,datafromtheexternalenvironmententeringoursensoryinputsareprocessedtoextractimportantfeaturesforlearning.Basedonthegoalofthelearner,aselforganizingruleisinitiatedintheneuro-mappertondappropriatestructuresintheextractedfeatures.Thoughasmallportion,animportantstepiscriticalreasoning,whichhelpsonetojudgehowusefulthesepatternsareinachievingthegoal.Thisresultsinadualcorrection;rsttotakeappropriateactionsintheenvironmentsoastoactivelygenerateusefuldataandsecond,tochangethefunctioningoftheneuro-mapperitself.Thiscyclicalloopisessentialforeectivelearningandtoadaptinachangingenvironment.Anoptionalcorrectionmaybeusedtoupdatethegoalitselfwhichwouldchangethedirectionoflearning.Whatis 24

PAGE 25

Anewframeworkforunsupervisedlearning. interestinginthisapproachisthatthelearner'sblockisconvertedintoaclassicsupervisedlearningframework.Thisblockrepresentsanopensystemconstantlyinteractingwiththeenvironment.Togetherwiththeenvironment,theoverallschemeisaclosedsystemwheretheprinciplesofconservationisapplicable[see Watanabe 1985 ,chap.6]. Anaturaloutcomeofthisthesisisasimpleyetelegantframeworkbringingdierentfacetsofunsupervisedlearningnamelyclustering,principalcurvesandvectorquantizationunderoneumbrella,thusprovidinganewoutlooktothisexcitingeldofscience.Intheend,wewouldliketopresentaquotefrom Ghahramani [ 2003 ],whichbestsummarizesthiseldandisaconstantsourceofmotivationthroughoutthiswork; Inasense,unsupervisedlearningcanbethoughtofasndingpatternsinthedataaboveandbeyondwhatwouldbeconsideredpureunstructurednoise. 25

PAGE 26

Parzen 1962 ]asshownbelow. whereKisakernelsatisfyingthefollowingconditions: ThekernelsizehisactuallyafunctionofthenumberofsamplesN;representedash(N).AnalysisoftheestimatorpX(x)alongwiththefourpropertiesabovegivesthefollowingresults, WeparticularlyconsidertheGaussiankernelgivenbyG(x)=1 22forouranalysis.Theadvantageofthiskernelselectionistwo-folded.First,itisasmooth,continuousandinnitelydierentiablekernelrequiredforourpurposes(shownlater).Second,theGaussiankernelhasaspecialpropertythatconvolutionoftwoGaussiansgives 26

PAGE 27

[ 1961 1976 ]proposedafamilyofentropymeasurescharacterizedbyorder-.Renyi'sorder-entropyisdenedas 1logZf(x)dx:(2{2) Forthespecialcaseof!1,Renyi's-entropytendstoShannondenition,thatis,lim!1H(X)=Zf(x)logf(x)dx=HS(X): UsingpX(x)=1 2{3 ,wecanobtainanon-parametricestimator 27

PAGE 28

where2=22X.NoticetheargumentoftheGaussiankernelwhichconsidersallpossiblepairsofsamples.TheideaofregardingthesamplesasinformationparticleswasrstintroducedbyPrncipeandcollaborators[ Prncipeetal. 2000 ; Erdogmus 2002 ]uponrealizingthatthesesamplesinteractwitheachotherthroughlawsthatresemblethepotentialeldandtheirassociatedforcesinphysics. Sincethelogisamonotonicfunction,anyoptimizationbasedonH(X)canbetranslatedintooptimizationoftheargumentofthelog,whichwedenotebyV(X),andcallittheinformationpotentialofthesamples.Wecanconsiderthisquantityasanaverageofthecontributionsfromeachparticlexiasshownbelow.NotethatV(xi)isthepotentialeldoverthespaceofthesamples,withaninteractionlawgivenbythekernelshape.V(X)=1 @xiV(xi)=1 28

PAGE 29

2-1 .Beinginherentlyattractive,thedirectionoftheinter-forceeldoneachsamplealwayspointsinwards(towardsothersamples)asshowninthisplot. Figure2-1. Informationforcewithinadataset. Itcouldbeseenasameasureofuncertaintywhenoneapproximatesthe\true"distributionf(x)withanotherdistributiong(x).Thus,DKL(fjjg)canalsoberewrittenas 29

PAGE 30

Onthecontrary,inthecaseofRenyi'sdenition,thereexistsmorethanonewayofrepresentingthedivergencemeasure.TwosuchdenitionsweregivenbyRenyihimself[ Renyi 1976 ].Themorepopularoneiswrittenas Forspecialcaseof!1,lim!1DRKL1(fjjg)=DKL(fjjg): Lutwaketal. [ 2005 ].Therelativeentropycanbewrittenasfollows, 1(Rg)1 Notethatthedenominatorintheargumentofthelogcontainsanintegralevaluation.Sof(x)couldbezeroatsomepointsbutoveralltheintegraliswelldened,thusavoidingnumericalissuesofthepreviousdenition.Again,for!1,thisgivesDKL(fjjg).Inparticular,for=2,onecanrewrite 2{9 as WecallthistheCauchy-SchwartzdivergencesincethesamecouldalsobederivedusingtheCauchy-Schwartzinequality[ Rudin 1976 ].Notethattheargumentisalways2(0;1] 30

PAGE 31

Dcs(fjjg)0forallfandg. Dcs(fjjg)=0if=g. Dcs(fjjg)issymmetrici.e.Dcs(fjjg)=Dcs(gjjf). 2{7 ,onecanderiveaconnectionbetweenCauchy-SchwartzdivergenceandRenyi'squadraticentropy.Rearranging 2{10 ,weget 2logZf2(x)dx+1 2logZg2(x)dx=H(X;Y)1 2H(X)1 2H(Y):(2{11) ThetermH(X;Y)canbeconsideredasRenyi'scrossentropy.TheargumentofthelogcanalsobeseenasEf(x)[g(x)].OnecanalsoderivethisfollowingtheactualderivationofRenyi'sentropy[ Renyi 1976 ].Weprovethisforthediscretecaseinthefollowingtheorem. 1logXkpkq(1)k:

PAGE 33

1logNXk=1pkq1k: where2=2X+2Y.TheargumentoftheGaussiankernelnowshowsinteractionofthesamplesfromonedatasetwithsamplesofanotherdataset.OnecanconsidertheoverallargumentofthelogasthecrossinformationpotentialV(X;Y)whichcanberepresentedasanaverageoverindividualpotentialsV(xi;Y).V(X;Y)=1

PAGE 34

@xiV(xi;Y)=1 Similarly,onecanderivethecrossinformationforceactingonsampleyjduetoallsamplesfxigNi=1ofthesetXbyexpressingV(X;Y)asanaverageoverV(yj;X)anddierentiatingwithrespecttoyjasshownbelowforcompleteness. @yjV(yj;X)=1 NotethatF(xijyj)andF(yjjxi)havethesamemagnitudebutoppositedirectionsgivingusasymmetricgraphmodel.ThisconceptiswelldepictedinFigure 2-2 whereonecanseetheintra-forceeldoperatingbetweensamplesofthetwodatasets. 2{11 ,onecanthengetanon-parametricestimatorforCauchy-Schwartzdivergencemeasureaswell.Thus, 2H(X)1 2H(Y):(2{15) Anotheraspectwhichrequiresparticularattentionistheconceptofinformationforces.Figures 2-1 and 2-2 urgesonetoaskthequestionastowhatwouldhappentothesamplesiftheywereallowedtofreelymoveundertheinuenceoftheseforces.Thisquestionledtotheculminationofthisthesis.Initialeortsofmovingthesamplesusing 34

PAGE 35

Crossinformationforcebetweentwodatasets. gradientmethodshowedpromisingresults,butweremiredwithproblemsofstep-sizeselectionandstabilityissues.Wesoonrealizedthatforsuccessofthisconcept,twothingswereessential; Thesubsequentchapterswouldelaborateonhowwedevelopedanovelframeworkforunsupervisedlearningtakingthesetwoaspectsintoconsideration. Theseideaslieattheheartofinformationtheoreticlearning(ITL)[ Prncipeetal. 2000 ].Byplayingdirectlywiththepdfofthedata,ITLeectivelygoesbeyondthesecondorderstatistics.Theresultisacostfunctionthatdirectlymanipulatesinformationgivingrisetopowerfultechniqueswithnumerousapplicationsintheareaofadaptivesystems[ Erdogmus 2002 ]andmachinelearning[ Jenssen 2005 ; Raoetal. 2006a ]. 35

PAGE 36

Ontheotherhand,inunsupervisedlearningcontextthelearnerisonlypresentedwiththedata.Whatisthenabeststrategytolearn?Theanswerliesindiscoveringstructuresinthedata.As[ Watanabe 1985 ]putsit,thesearerigidentitieswithinterdependentsubsystemsthatcapturetheregularitiesinthedata.Sincetheseregularitiesorpatternsaresubsetsofthegivendata,theirentropyissmallerthantheentiresampleset.Therststepthereforeindiscoveringthesepatternsistominimizetheentropyofthedata.Thisinturnwouldreduceredundancy.Butmereminimizationofentropywouldgiveusatrivialsolutionofdatacollapsingtoasinglepointwhichisnotuseful.Todiscoverstructure,oneneedstomodelthemascompactrepresentationsofthedata.Thisconnectionbetweenstructuresandtheircorrespondingdatacouldbeestablishedbyintroducingadistortionmeasure.Dependingontheextentofthedistortionthen,oneisboundtondarangeofstructureswhichcanbeseenasaninformationextractionprocess.Sinceminimizingtheentropywiththisconstraintaddsrelevancetotheresultingstructure,wecallthis-ThePrincipleofRelevantEntropy. 36

PAGE 37

where2[0;1).TherstquantityH(X)canbeseenasaredundancyreductiontermandthesecondquantityDcs(pXjjpS),asaninformationpreservationterm.Attherstiteration,Dcs(pXjjpS)=0duetotheinitializationX=S.ThiswouldreducetheentropyofXirrespectiveofthevalueofselected,thusmakingDcs(pXjjpS)positive.ThisisakeyfactorinpreventingXfromdiverging.Fromtheseconditerationonwards,dependingonthevaluechosen,onewoulditerativelymovethesamplesofXtocapturedierentregularitiesinS.Interestingly,thereexistsahierarchicaltrendamongtheseregularitiesresultingfromthecontinuousnatureofthedistortionmeasurewhichvariesbetween[0;1).Fromthispointofview,onecanseetheseregularitiesasquantifyingdierentlevelsofstructureinthedata,thusgivingusacompositeframework. Tounderstandthisbetter,wewouldinvestigatethebehaviorofthecostfunctionforsomespecialvaluesof.Beforewedothis,letusrstsimplifythecostfunction.WecansubstitutetheexpansionofDcs(pXjjpS)termasgivenin 2{15 in 3{1 .Toavoidworkingwithfractionalvalues,weuseamoreconvenientdenitionofCauchy-SchwartzdivergencegivenbyDcs(pXjjpS)=logRpX(u)pS(u)du2

PAGE 38

3{1 .Inanycase,thisisnotgoingtoalteranyresults,butwouldonlychangethevaluesofatwhichweattainthoseresults.Havingdonethis,thecostfunctionnowlookslike NotethatwehavedroppedtheextratermH(S)sincethisisconstantwithrespecttoX.Withthissimpliedformofthecostfunction,wewillnowinvestigatetheeectoftheweightingparameter. Forthisspecialcase,thecostfunctioninequation 3{2 reducestoJ(X)=minXH(X): 3{1 ,weseethattheinformationpreservationtermhasbeeneliminatedandthecostfunctiondirectlyminimizestheentropyofthedatasetiteratively.Aspointedoutearlier,thissimplecaseleadstoatrivialsolutionwithallthesamplesconvergingtoasinglepoint. Sincelogisamonotonousfunction,minimizationofentropyisequivalenttomaximizationoftheargumentofthelogwhichistheinformationpotentialV(X)ofthedataset.J(X)maxXV(X)=maxX1

PAGE 39

@xkV(X)=02 DuetotheminimizationofRenyi'sentropyofthedatasetiteratively,thisupdateruleresultsinthecollapseoftheentiredatasettoasinglepointasprovedinthefollowingtheorem. 3{3 convergeswithallthedatasamplesmergingtoasinglepoint. Proof. Wecouldrewriteequation 3{3 asx(+1)k=NXj=1w()j(k)x()j;NXj=1w()j(k)=1:

PAGE 40

3{4 then,onecouldconcludethefollowing,Hull(X(0)=S)Hull(X(1))Hull(X(2)):::Hull(X(+1))::: For=1,thecostfunctiondirectlyminimizesRenyi'scrossentropybetweenXandtheoriginaldatasetS.Again,usingthemonotonouspropertyofthelog,thecostfunctioncanberewrittenasJ(X)=minXH(X;S)maxXV(X;S)=maxX1 40

PAGE 41

Noticethelackofanysuperscriptonsj2SsinceitiskeptconstantthroughouttheprocessofadaptationofX.Unlike 3{3 when=0then,inthiscasethesamplesofX()areconstantlycomparedtothisxeddataset.ThisresultsinthesamplesofXconvergingtothemodes(peaks)ofpdfpS(u),thusrevealinganimportantstructureofS.Weprovethisinthefollowingtheorem. 3{5 Proof. 3{5 asMXj=1G(xksj)xk=MXj=1G(xksj)sjMXj=1G(xksj)fsjxkg=01 @xkpS(x)=0 Sincemodesarestationarysolutionsof@ @xkpS(x)=0,theyarethexedpointsof 3{5 Wecanalsorewriteequation 3{5 asx(+1)k=MXj=1w()j(k)sj;MXj=1w()j(k)=1:w()j(k)aretheweightsevaluatedwithrespecttotheoriginaldataSwhichiskeptconstantthroughouttheprocess.Sinceeachw()j(k)isstrictly2(0;1),eachiterationsamplex(+1)kismappedstrictlyinsidetheconvexhullofS(andnotX()asinthecaseof 41

PAGE 42

3{2 asJ(X)=minX(1)H(X)+2H(X;S)=minX(1)logV(X)2logV(X;S)=minX(1)log1 V(X;S)2 V(X;S)F(xk;S)=0;(3{6) whereF(xk)andF(xk;S)aregivenbyequations 2{5 and 2{13 .Equation 3{6 showsthenetforceactingonsamplexk2X.Thisforceisaweightedcombinationoftwoforces;theinformationforcewithinthedatasetXandthecrossinformationforcebetweenXandS. 42

PAGE 43

Rearranging 3{6 wehave, V(X;S)1 V(X;S)1 wherec=V(X;S) N.Noticethattherearethreedierentwaysofrearranging 3{6 togetaxedpointupdaterule,butwefoundonlythexedpointiteration 3{7 tobestableandconsistent. Havingderivedthegeneralform,wearenowreadytodiscusstheasymptoticcaseof!1.Itturnsoutthatforthisspecialcase,oneendsupdirectlyminimizingtheCauchy-SchwartzdivergencemeasureDcs(pXjjpS).Thisisnotobviousbylookingatequation 3{1 .WeprovethisinTheorem 4 .SincewestartwiththeinitializationX=S,aconsequenceofthisextremecaseisthatwegetbackthedataitselfasthesolution.Thiscorrespondstothehighestlevelofstructurewithalltheinformationpreservedatthecostofnoredundancyreduction. 3{1 directlyminimizestheCauchy-SchwartzpdfdivergenceDcs(pXjjpS).

PAGE 44

3{7 )when!1.Considerthecostfunction, Dierentiatingwithrespecttoxk=f1;2;:::;Ng2Xandequivatingtozero,wehave1 Rearrangingtheaboveequationjustaswedidtoderiveequation 3{7 ,weget whereagainc=V(X;S) N.Takingthelimit!1in 3{7 gives 3{9 3{9 Proof. 3{9 weseethatx(+1)k=x()k 3{9 convergesattheveryrstiterationgivingthedataitselfasthesolution. 44

PAGE 45

B Asimpleexampledataset.A)Crescentshapeddata.B)pdfplotfor2=0:05. dierentproportionsofthesetwoterms.Forexample,when1<<3onecouldgetprincipalcurvespassingthroughthedata.Thesecanbeconsideredasnon-linearversionsoftheprincipalcomponents,embeddingthedatainalowerdimensionalspace.Whatisevenmoreinterestingisthatthesecurvesrepresentridgespassingthroughthemodes(peaks)ofthepdfpS(u)andhencesubsumetheinformationofthemodesrevealedfor=1.Thereexiststherefore,ahierarchicalstructuretothisframework.Atoneextremewith=0,wehaveasinglepointwhichishighlyredundantbutpreservesnoinformationofthedatawhileattheotherextremewhen!1,wehavethedataitselfasthesolutionpreservingalltheinformationwehavewithnoredundancyreduction.Inbetweenthesetwoextremes,oneencountersdierentphaseslikemodesandprincipalcurveswhichrevealinterestingpatternsuniquetothegivendataset.Todemonstratethisidea,weuseasimpleexampleofcrescentshapeddatasetasshowninFigure 3-1 .ApplyingourcostfunctionfordierentvaluesofrevealsvariousstructuresrelevanttothedatasetasshowninFigure 3-2 Fromunsupervisedlearningpointofview,thereareveryinterestingaspectsuniquetothisprinciple.Noticethatthecostinequation 3{1 isafunctionofjusttwoparameters;namelytheweightingparameterandtheinherentlyhiddenscaleparameter.The 45

PAGE 46

B=1,Modes C=2,PrincipalCurve D=2:8 E=3:5 F=5:5 G=13:1 H=20 I!1,Thedata AnillustrationofthestructuresrevealedbythePrincipleofRelevantEntropyforthecrescentshapeddatasetfor2=0:05.Asthevaluesofincreaseswepassthroughthemodes,principalcurvesandintheextremecaseof!1wegetbackthedataitselfasthesolution. parametercontrolsthe\task"tobeachieved(likendingthemodesortheprincipalcurvesofthedata)whereascontrolstheresolutionofouranalysis.Byincludinganinternaluserdenedparameter,weallowthelearningmachinetoinuencethelevelofabstractiontobeextractedfromthedata.Tothebestofourknowledge,thisistherstattempttobuildaframeworkwiththeinclusionofgoalsatthedataanalysislevel.Anaturalconsequenceofthisisarichandhierarchicalstructureunfoldingtheintricaciesbetweenalllevelsofinformationinthedata.InChapters4to6,weshowdierentapplicationsofourprinciplelikeclustering,manifoldlearningandcompressionwhichnowfallunderacommonpurview. 46

PAGE 47

3-2 wereobtainedforkernelsizeof2=0:05.Foradierentkernelsize,wegetanewsetofstructuresrelevanttothatparticularscale.Thisrichrepresentationofthedataisanaturalconsequenceofframingourcostfunctionwithinformationtheoreticprinciples.Theotheraspectcorrespondstotheredundancyachieved.ThroughoutthisdiscussionwehavekeptN=M.Toachieveredundancyinthenumberofpointstorepresentthedata,wecanfollowwithapostprocessingstagewheremergedpointsarereplacedwithasinglepoint.ThiswouldnallygiveusNM.Forexample,tocapturethemodesofthedatawecanreplaceMpointswhichhavemergedtotheirrespectivemodeswithasinglepointateachmode.Thespecialcaseofequalityoccurswhen!1.Inthiscase,wesacriceredundancyandstoreallthesamplestopreservemaximalinformationaboutthedata. 47

PAGE 48

Dudaetal. 2000 ; Watanabe 1985 ].Thesetwoquantitiesgenerallyshareaninverserelationship.So,ifwedenoteanarbitrarysimilaritymeasurebySandthecorrespondingdistancemetricbyD,thenSD=constant: Thisapproachispopularsincedistancemetriciswelldenedandquantiable.Bydeningametricforafeaturespace,oneindirectlydenesanappropriatesimilaritymeasurethusbypassingtheillposedproblemofsimilarityitself.Nevertheless,oneneedstopickadistancemetricwhichcanbeseenaspartofaprioriinformation.Forexample,K-meansclusteringtechniqueusesaEuclideandistancemetric.ThisleadstoaprioriassumptionthattheclustershaveasphericalGaussiandistribution.Ifthisisreallytrue,thentheclusteringsolutionobtainedisgood.Butmostofthetimesuchaprioriinformationisnotavailableandisusedasaconveniencetoeasethedicultyoftheproblem.Theimpositionofsuchexternalmodelsonthedataleadstobiasedinformationextraction,thushamperingthelearningprocess.Forexample,mostneuralspikedatawehaveworkedwithdonotsatisfythisassumptionatall,butalotofmedicalwork 48

PAGE 49

Raoetal. 2006b ]. Recenttrendsinclusteringhavetriedtomakeuseofinformationfromhigherordermomentsortodealdirectlywiththeprobabilitydensityfunction(pdf)ofthedata.Onesuchtechniquewhichhasbecomequitepopularlatelyinimageprocessingcommunityarethemeanshiftalgorithms.Theideaisverysimple.Sinceclusterscorrespondtohighdensityregionsinthedataset,agoodpdfestimationwouldresultintheseclustersdemarcatedbypeaks(ormodes)ofthepdf.Wecould,forexample,useParzenwindowtechniquetonon-parametricallyestimatethepdfthusavoidinganyapriorimodelassumption.Usinganymodendingprocedure,wecouldthenmovethesepointsinthedirectionofnormalizeddensitygradient.Bycorrelatingthepointswiththemodestowhichtheyconvergeonecaninessenceachieveaverygoodclusteringsolution.Thexedpointnon-parametriciterativeschememakesthismethodparticularlyattractivewithonlyasingleparameter(kernelsize)toestimate.AnaddedbenetisautomaticinferenceofthenumberofclustersKwhichinmanyotheralgorithmslikeK-meansandspectralclusteringneedstobespeciedapriori. Weshowinthischapter,thatessentiallythesemean-shiftalgorithmsarespecialcasesofournovelprinciplecorrespondingto=1and=0.Theindependentderivationofthesealgorithmsgivesanewperspectivefromoptimizationpointofviewthushighlightingtheirdierencesandingeneraltheirconnectiontoabroaderframework.Beforewedwellintothis,webrieyreviewtheexistingliteratureonmeanshiftalgorithms.

PAGE 50

22isaGaussiankernelwithbandwidth>0.Inordertosolvethemodesofthepdf,wesolvethestationarypointequationrxpS(x)=0forwhichthemodesarethexedpoints.Thisgivesusaniterativeschemeasshownbelow. TheexpressionT(x)isthesamplemeanofallthesamplessiweightedbythekernelcenteredatx.Thus,thetermT(x)xwascoined\meanshift"intheirlandmarkpaperby FukunagaandHostetler [ 1975 ].Inthisprocess,twodierentdatasetsaremaintainednamelyXandS.ThedatasetXwouldbeinitializedtoSasX(0)=S.Ateveryiteration,anewdatasetX(+1)isproducedbycomparingthepresentdatasetX()withS.ThroughoutthisprocessSisxedandkeptconstant.DuetotheusageofGaussiankernel,thisalgorithmwascalledGaussianMeanShift(GMS). Unfortunately,theoriginalpaperby FukunagaandHostetler usedamodiedversionof 4{1 whichisshownbelow. Inthisiterationscheme,onecomparesthesamplesofthedatasetX()withX()itself.StartingwiththeinitializationX(0)=Sandusing 4{2 ,wethensuccessively\blur"thedatasetStoproducedatasetsX(1);X(2):::X(+1).Asthenewdatasetsareproducedweforgetthepreviousonegivingrisetotheblurringprocesswhichisinherentlyunstable.Itwas Cheng [ 1995 ]whorstpointedoutthisandnamedthexedpointupdate 4{2 asGaussianBlurringMeanShift(GBMS). Recentadvancementsinmeanshifthavemadethesealgorithmsquitepopularinimageprocessingliterature.Inparticular,themeanshiftvectorofGMShasbeenshowntoalwayspointinthedirectionofnormalizeddensitygradient[ Cheng 1995 ].SincepointslyinginlowdensityregionhavesmallvalueofpS(x),thenormalizedgradientatthesepointshavelargevalue.Thishelpsthesamplestoquicklymovefromlowdensityregions 50

PAGE 51

ArigorousproofofstabilityandconvergenceofGMSwasgivenin[ ComaniciuandMeer 2002 ]wheretheauthorsprovedthatthesequencegeneratedforeachsamplexkby 4{1 isaCauchysequencethatconvergesduetothemonotonicincreasingsequenceofthepdfsestimatedatthesepoints.Furtherthetrajectoryisalwayssmoothinthesensethattheconsecutiveanglesbetweenmeanshiftvectorsisalwaysbetween M.A.Carreira-Perpi~nan [ 2007 ]alsoshowedthatGMSisanExpectation-Maximization(EM)algorithmandthushasalinearconvergencerate. Duetotheseinterestingandusefulproperties,GMShasbeensuccessfullyappliedinlowlevelvisiontaskslikeimagesegmentationanddiscontinuitypreservingsmoothing[ ComaniciuandMeer 2002 ]aswellasinhighlevelvisiontaskslikeappearancebasedclustering[ RamananandForsyth 2003 ]andreal-timetrackingofnonrigidobjects[ Comaniciuetal. 2000 ]. Carreira-Perpi~nan [ 2000 ]usedmeanshiftformodendinginmixtureofGaussiandistributions.TheconnectiontoNadarayana-WatsonestimatorfromkernelregressionandtherobustM-estimatorsoflocationhasbeenthoroughlyexploredby ComaniciuandMeer [ 2002 ].Withjustasingleparametertocontrolthescaleofanalysis,thissimplenon-parametriciterativeprocedurehasbecomeparticularlyattractiveandsuitableforwiderangeofapplications. ComparedtoGMS,theunderstandingofGBMSalgorithmremainsrelativelypoorsincethisconceptrstappearedin[ FukunagaandHostetler 1975 ].Apartfromthepreliminaryworkdonein[ Cheng 1995 ],theonlyothernotablecontributionwhichweareawareofwasrecentlymadebyCarreira-Perpi~nan.Inhispaper[ Carreira-Perpi~nan 2006 ],theauthorshowedthatGBMShasacubicconvergencerateandtoovercomeitsinstability,developedanewstoppingcriterion.Byremovingtheredundancyamongpoints 51

PAGE 52

Cheng [ 1995 ]providedacomprehensivecomparison,theanalysisisquitecomplex. FashingandTomasi [ 2005 ]showedmeanshiftasquadraticboundmaximizationbuttheanalysisisindirectandthescopelimited.Ontopofthis,thederivationofGBMSbyreplacingtheoriginalsamplesofSwithX()itselfisquiteheuristic. Arightsteptotacklethisissueistoposethequestion-\Whatdoesmeanshiftalgorithmsoptimize?".Forthiswouldelucidatethecostfunctionandhencetheoptimizationsurfaceitself.Rigoroustheoreticalanalysisisthenpossibletoaccountfortheirinherentproperties.Thisweclaimisourmajorcontributioninthisarea.Noticethatequation 4{1 isthesameas 3{5 .Themodendingabilityofourprinciplefor=1actuallycorrespondstoGMSalgorithm.Thus,GMSoptimizesRenyi'scrossentropybetweendatasetsXandS.FromTheorem 3 ,itisthenclearthattheiterationprocedure 4{1 leadstoastablealgorithm.Thedatasamplesmoveinthedirectionofnormalizedgradientconvergingtotheirrespectivemodes. Ontheotherhand,equation 4{2 exactlycorrespondsto 3{3 .ThisiterationschemeistheresultofdirectlyminimizingRenyi'squadraticentropyofthedataX.Sincethishappensfor=0,GBMSisaspecialcaseofourgeneralprinciple.Fromoptimizationpointofviewthen,GBMSwithinitializationX(0)=SleadstorapidcollapseofthedatatoasinglepointasprovedinTheorem 2 .GBMSthereforedoesnottrulyposesmode 52

PAGE 53

SomeauthorshavesuggestedGBMSasafastalternativetoGMSalgorithmifitcanbestoppedappropriately.Theinherentassumptionhereisthattheclustersarewellseparatedinthegivendata.ThiswouldleadtotwophasesinGBMSoptimization.Intherstphase,thepointswouldrapidlycollapsetothemodesoftheirrespectiveclusters,andinthesecondphase,themodesslowlymovetowardseachothertoultimatelyyieldasinglepoint.Bystoppingthealgorithmaftertherstphase,onecouldideallygettheclusteringsolution.Further,sinceGBMShasapproximately3fasterconvergenceratethanGMS,thisapproachcanbeseenasfastalternativetoGMS.Unfortunately,thisassumptionofwelldemarcatedclustersisnottrueinpracticalapplicationsespeciallyinareaslikeimagesegmentation.Further,sincemodesarenotthexedpointsofGBMS,anystoppingcriteriawouldbeutmostheuristic.Wedemonstratethisrigorouslyformanyproblemsinthefollowingsectionsinthischapter. Byclarifyingthesedierencesthroughanoptimizationroute,webringinnewperspectivetothesealgorithms[ Raoetal. 2008 ].Itisimportanttonotethatourderivationiscompletelyindependentoftheearliereorts.Further,itisonlyinourcasethatthepatternsextractedfromthedatacanbeseenasstructuresofabroaderunsupervisedlearningframeworkwiththeirconnectionstootherfacetslikevectorquantizationandmanifoldlearning.Byelucidatingtheirrespectivestrengthsandweaknessesfrominformationtheoreticpointofviewwesimplifygreatlytheunderstandingofthesealgorithms. 53

PAGE 54

4{1 ,theaveragedistancemovedbysamplesbecomessmalleroversubsequentiterations.BysettingatollevelonthisquantitytoalowvaluewecangetthemodesaswellasstopGMSfromrunningunnecessarily.Thisissummarizedin 4{3 Stopwhen1 Anotherversionofthestoppingcriteriawhichwefoundmoreusefulinimagesegmentationapplications,istostopwhenthemaximumdistancemovedamongalltheparticlesislessthansometollevelratherthantheaveragedistance,thatis, Stopwhenmaxikx()ix(1)ik
PAGE 55

Theassumptionoftwodistinctphaseswaseectivelyusedtoformulateastoppingcriterionby Carreira-Perpi~nan [ 2006 ].Inphase2,d()=fd()(xi)gNi=1takesonatmostKdierentvalues(forKmodes).Binningd()usinglargenumberofbinsgivesusthehistogramwhichhasKorfewernonemptybins.Sinceentropydoesnotdependonexactlocationofthebins,itsvaluedoesnotchangeandcanbeusedtostopthealgorithmasshownin 4{5 whereHs(d)=PBi=1filogfiistheShannonentropy,fiistherelativefrequencyofbiniandthebinsspantheinterval[0;max(d)].ThenumberofbinsBwasselectedasB=0:9N. Itisclearthatthereisnoguaranteethatwewouldndallthemodesusingthisrule.Further,theassumptionusedindevelopingthiscriteriondoesnotholdtrueinmanypracticalscenariosaswillbeshowninourexperiments. ThedatasetinFigure 4-1 consistsofamixtureof16Gaussianswithcentersspreaduniformlyaroundacircleofunitradius.EachGaussiandensityhasasphericalcovarianceof2gI=0:01I.Toincludeamorerealisticscenario,dierentaprioriprobabilitieswereselectedwhichisshowninFigure 4-1B .Usingthismixturemodel,1500iiddatapointsweregenerated.Weselectedthescaleofanalysis2=0:01suchthattheestimatedmodesareveryclosetothemodesoftheGaussianmixture.Notethatsincethedatasetisa 55

PAGE 56

B C Ringof16Gaussiandatasetwithdierentaprioriprobabilities.Thenumberingoftheclustersisinanticlockwisedirectionstartingwithcenter(1;0).A)R16Gadataset.B)aprioriprobabilities.C)probabilitydensityfunctionestimatedusing2=0:01. mixtureof16Gaussianseachwithvariance2g=0:01andspreadacrosstheunitcircle,theoverallvarianceofthedataismuchlargerthan0:01.Thus,byusingakernelsizeof2=0:01forParzenestimationofthepdf,weensurethattheParzenkernelsizeissmallerthantheactualkernelsizeofthedata.Figure 4-1C showsthe3Dviewofthisestimatedpdf.Notetheunequalpeaksduetodierentproportionofpointsineachcluster. Figure 4-2 showsthemodendingabilityofthetwoalgorithms.Tocomparewithgroundtruthwealsoplot2gcontourlinesandactualcentersoftheGaussianmixture.Withtollevelin 4{3 setto106,GMSalgorithmstopsat46thiterationgivingalmostperfectresults.Ontheotherhand,usingstoppingcriterion 4{5 ,GBMSstopsat20thiterationmissingalready4modes(shownwitharrows).WewouldalsoliketopointoutthatthisisthebestresultachievablebyGBMSalgorithmevenifwehadusedstoppingcriterion 4{3 andselectivelyhand-pickedthebesttolvalue. Figure 4-3 showsthecostfunctionswhichthesealgorithmsminimizeforadurationof70iterations.NoticehowthecostfunctionH(X)ofGBMScontinuouslydropsasthemodesmerge.ThiswouldgoonuntilH(X)becomeszerowhenallthesampleswouldhavemergedtoasinglepoint.ForGMS,ontheotherhand,H(X;S)decreasesandsettlesdownsmoothlytoitsxedpointswhicharethemodesofpS(x).Thus,amoreintuitivestoppingcriterionforGMSwhichoriginatesdirectlyfromitscostfunctionistostopwhen 56

PAGE 57

B ModesofR16GadatasetfoundusingGMSandGBMSalgorithms.A)GoodModendingabilityofGMSalgorithm.B)PoormodendingabilityofGBMSalgorithm. theabsolutedierencebetweensubsequentvaluesofH(X;S)becomessmallerthansometollevelassummarizedin 4{6 .Thesearesomeoftheunforeseenadvantageswhenweknowexactlywhatweareoptimizing. Anotherinterestingresultpopsupwiththisnewunderstanding.NoticethateventhoughGBMSdoesnotminimizeRenyi'scrossentropyH(X;S)directly,wecanalwaysmeasurethisquantitybetweenX()ateveryiterationandtheoriginaldatasetS.IftheassumptionoftwodistinctandwellseparatedphasesinGBMSholdstrue,thenthesampleswillquicklycollapsetotheactualmodesofthepdfbeforetheystartslowlymovingtowardeachother.SincewestartwithinitializationX=S,H(X;S)willreachitslocalminimumatthispointbeforeitagainstartsincreasingduetothemergingofGBMSmodes(andhencemovingthemawayfromtheactualmodesofthepdf).BystoppingGBMSatthisminimum,wecoulddeviseaneectivestoppingcriteriongivingthesameresultasGMSwithlessnumberofiterations. 57

PAGE 58

CostfunctionofGMSandGBMSalgorithms. Unfortunately,wefoundthatthisworksonlywhenthemodes(orclusters)areverywellseparatedcomparedtothekernelsize(makingtheassumptiontoholdtrue).Forexample,Figure 4-4 showsH(X;S)computedforGBMSforR16Gadataset.Theminimumisreachedat7thiteration.Therefore,thisstoppingcriterionwouldhaveprematurelystoppedGBMSalgorithmgivingverypoorresults.ItisclearthatGBMSisnotagoodmodendingalgorithm. Theseresultsshedanewlightinourunderstandingofthesetwoalgorithms.Modendingcanbeusedasameanstoclusterdataintodierentgroups.Wewillseenext,theperformanceofthesealgorithmsinclusteringwheretheirrespectivepropertieseectgreatlytheoutcomeoftheapplications. 4-5 showsthedatasetwithtruelabelingaswellasthe2gcontourplots. Although,dierentkernelsizesshouldbeusedfordensityestimationofdierentclusters,forsimplicityandtoexpressourideaclearly,weuseacommonParzenkernel 58

PAGE 59

Renyi's\cross"entropyH(X;S)computedforGBMSalgorithm.ThisdoesnotworkasagoodstoppingcriterionforGBMSinmostcasessincetheassumptionoftwodistinctphasesofconvergencedoesnotholdtrueingeneral. B C AclusteringproblemA)RandomGaussianclusters(RGC)dataset.B)2gcontourplots.C)probabilitydensityestimatedusing2=0:01. sizeforpdfestimation.UsingSilverman'sruleofthumb[ Silverman 1986 ],weestimatedthekernelsizeas2=0:01.Acrosscheckwiththepdfplotvalidatedtheecacyofthisestimate.TheplotisshowninFigure 4-5C .Notethatalltheclustersarewellidentiedforthisparticularkernelsize.Bycorrelatingthepointswiththeirrespectivemodeswewishtosegmentthisdatasetintomeaningfulclusters. Withtollevelsetat106,theGMSalgorithmconvergesatthe41stiteration.ThesegmentationresultisshowninFigure 4-6A .ClearlyGMSperformsverywellin 59

PAGE 60

4-5B ,thesemisclassicationsareboundtooccur.Anotherinterestingmistakeoccuratthetoprightcorner,where4pointsbelongingtoaclusteraremisclassiedandputaspartofanotherhighlyconcentratedcluster.Thesepointslieinthenarrowvalleyborderingthetwoclustersandunfortunatelytheirgradientdirectionspointtowardtheincorrectmode.ButitshouldbeappreciatedthatevenforthiscomplexdatasetwithvaryingshapesofGaussianclusters,GMSwiththesimplestsolutionofsinglekernelsizegivessuchagoodresult. Ontheotherhand,usingstoppingcriterion 4{5 ,GBMSstopsat18thiterationwiththeoutputshowninFigure 4-6B .Noticethepoorsegmentationresultasaconsequenceofmultiplemodesmerging.Itshouldbekeptinmindthatbydeningthekernelsize2=0:01,wehaveselectedthesimilaritymeasureforclusteringandarelookingforsphericalGaussianswithvariancearoundthisvalue.Inthisregard,theresultofGBMSisincoherent.Ontheotherhand,thesegmentationresultobtainedforGMSismuchmorehomogeneousandconsistentwithoursimilaritymeasure.Further,itisonlyinthecaseofGMSalgorithmthatthemodesestimatedfromthepdfdirectlytranslateintoclusters.Onthecontrary,forGBMSalgorithm,itisnotclearhowthemodesinFigure 4-5C correlatewiththeclusteringsolutionobtainedinFigure 4-6B Figure 4-7 showstheaveragechangeinsamplepositionforboththealgorithms.NoticethepeaksinGBMScurvecorrespondingtomodesmerging.ThisisaclassicexamplewheretheassumptionoftwodistinctphasesinGBMSbecomesfuzzy.By5thiteration,twoofthemodeshavealreadymergedandby18thiterationatotalof5modesarelostgivingrisetopoorsegmentationresult.IncaseofGMS,ontheotherhand,the 60

PAGE 61

B SegmentationresultsofRGCdataset.A)GMSalgorithm.B)GBMSalgorithm. averagednormdistancesteadilydecreasesandbyselectingatollevelsucientlylow,wearealwaysassuredofagoodsegmentationresult. Figure4-7. Averagednormdistancemovedbyparticlesineachiteration. 61

PAGE 62

ShiandMalik 2000 ].Inthesecondpart,wecompareGMSalgorithmwiththecurrentstateofartinsegmentationliteraturecalledspectralclusteringonawiderangeofpopularimages. ShiandMalik [ 2000 ]showninFigure 4-8 .Forcomputationalreasons,theimagehasbeenreducedto11073pixels.Thisgraylevelimageistransformedto3-dimensionalfeaturespaceconsistingoftwospatialfeatures,namelythexandycoordinatesofthepixels,andtherangefeaturewhichistheintensityvalueatthatlocation.Thus,thedatasetconsistsof8030pointsinthefeaturespace.Inordertouseanisotropickernel,wescaletheintensityvaluesuchthattheyfallinthesamerangeasthespatialfeaturesasdonein[ M.A.Carreira-Perpi~nan 2007 ].Allthevaluesreportedareinpixelunits.Sincethisisanimagesegmentationapplication,weusestoppingcriterion 4{4 forGMS.Further,wesetthetollevelequalto103forboththealgorithmsthroughoutthisexperiment. Figure4-8. Baseballimage. Weperformedanelaborateexperimentofmultiscaleanalysiswherethekernelsizewaschangedfromasmallvaluetoalargevalueinstepsof0:5.Weselectedthebestsegmentationresultforboththealgorithmsforaparticularnumberofsegments.TheresultsareshowninFigure 4-9 .Thetoprowshowsthesegmentationresultfor8clusters. 62

PAGE 63

M.A.Carreira-Perpi~nan 2007 ; ShiandMalik 2000 ]. Figures 4-9C and 4-9D showtheGMSandGBMSresultsfor6segments.NotethepoorperformanceofGBMS.Insteadofgroupingsimilarobjectsintoone,GBMSsplitsandmergesthemtotwodierentclusters.Thediscsegmentintheimagewassplitintotwowithoneofthemmergingwiththeplayerandtheotherwiththebottombackground.Thisiscounterintuitivegiventhefactthattwoofthecoordinatesofthefeaturespacearespatialcoordinatesoftheimage.Ontheotherhand,GMSclearlygivesaverygoodsegmentationresultwitheachsegmentcorrespondingtoanobjectintheimage.Further,aniceconsistentandhierarchicalstructureisseeninGMS.Aswereducethenumberofclusters,GMSmergesclustersofsameintensityandwhichareclosertoeachotherbeforemergingsimilarintensityclusterswhicharefarapart.Thisiswhatwewouldexpectforthisfeaturespace.Thisresultsinabeautifulpatternintheimagespacewherewholeobjectswhicharesimilararemergedtogetherinanintuitivemanner.Thisphenomenonisagainobservedaswemovefrom6segmentsto4whereGMSputsallthegrayobjectsinonecluster,thusputtingtogetherthreefullobjectsofsimilarintensityinonegroup. Thus,startingfromresultsfor8segmentswhichwereverysimilartoeachother,GMSandGBMStreadaverydierentpathforlowernumberofsegments.GMSneatlyclustersobjectsintheimageintodierentsegmentsandhenceisveryclosetohumansegmentationresult.Thedierentpathfollowedbythetwoalgorithmsresultsinacompletelydierent2levelimagesegmentationasshowninFigure 4-9 63

PAGE 64

BGBMS:segments=8,=10 CGMS:segments=6,=13 DGBMS:segments=6,=11:5 EGMS:segments=4,=18 FGBMS:segments=4,=13 GGMS:segments=2,=28:5 HGBMS:segments=2,=18 BaseballimagesegmentationusingGMSandGBMSalgorithms.TheleftcolumnshowsresultsfromGMSforvariousdierentnumberofsegmentsandthekernelsizeatwhichitwasachieved.TherightcolumnsimilarlyshowstheresultsfromGBMSalgorithm. 64

PAGE 65

Ngetal. [ 2002 ].Forcompleteness,webrieysummarizethismethodbelow. GivenasetofpointsS=fs1;s2;:::;sMginRdthatwewanttosegmentintoKclusters: 1. FormananitymatrixA2RMMdenedbyAij=exp(jjsisjjj2=22)ifi6=jandAii=0. 2. DeneDtobethediagonalmatrixwhose(i;i)-elementisthesumofA0si-throwandconstructthematrixL=D1=2AD1=2. 3. Finde1;e2;:::;eK,theKlargesteigenvectorsofL(chosentobeorthogonaltoeachotherinthecaseofrepeatedeigenvalues),andformthematrixE=[e1;e2;:::;eK]2RMKbystackingtheeigenvectorsincolumns. 4. FormthematrixYfromEbynormalizingeachofE0srowstohaveunitlength(i.e.Yij=Eij=(PjEij)1=2). 5. TreatingeachrowofYasapointinRK,clusterthemintoKclustersviaK-meansalgorithm. 6. Finally,assigntheoriginalpointsitoclusterjifandonlyifrowiofthematrixYwasassignedtoclusterj. Theideahereisthatiftheclustersarewelldened,thentheanitymatrixhasaclearblockstructure.Thedominanteigenvectorswouldthencorrespondtoeachblockmatrixwith1'sintherowscorrespondingtotheblockmatrixandzerosinotherregions.TheYmatrixwouldthenresultinprojectingtheclustersinorthogonaldirectionswhichcanbeeasilyclusteredusingasimplealgorithmlikeK-means.ToeliminatetheeectsofK-meansinitialization,werunK-meansfor10dierentinitializationsandthenselecttheresultwiththeminimumdistortion(whichisthemeansquareerrorinthecaseofK-means).NotethatoneneedstogiveapriorithenumberofclustersKinthisalgorithm. 65

PAGE 66

4-10 .Althoughbothmethodsperformverywell,spectralclusteringgivesabetterresultwithjust5misclassications.InthecaseofGMS,thenumberofmisclassicationsare20,mainlyarisinginlowvalleyareasofpdfestimationwhicharegenerallytrickyregions.Itseemsthatbyprojectingtheclustersontotheirdominanteigenvectors,spectralclusteringisabletoenhancetheseparationbetweendierentclustersleadingtoabetterresult.Nevertheless,itshouldnotbeforgottenthatthenumberofclusterswasautomaticallydetectedinGMSalgorithmunlikeinspectralclusteringwhereithadtobespeciedapriori. B PerformancecomparisonofGMSandspectralclusteringalgorithmsonRGCdataset.A)ClusteringsolutionobtainedusingGMSalgorithm.B)Segmentationobtainedusingspectralclusteringalgorithm. Selectingapriorithenumberofsegmentsinimagesisadiculttask.InsteadwerstgainanunderstandingoftheproblemthroughamultiscaleanalysisusingGMSalgorithm.SinceGMSnaturallyrevealsthenumberofclustersataparticularscale,thisanalysiswouldhelpusinascertainingthekernelsizeandthenumberofsegmentsKtocomparewithspectralclustering.Inourexperience,agoodsegmentationresultisstableoverabroadrangeofkernelsize. 66

PAGE 67

BaldEagleimage. Figure 4-11 showsafullresolutionimageofabaldeagle.Wedownsampledthisimageto4872sizeusingabicubicinterpolationtechnique.Thisstillpreservesthesegmentsverywellatthebenetofreducingthecomputationtime.ThisstepisimportantsinceboththealgorithmshaveO(N2)complexity.Inparticular,theanitymatrixconstructioninspectralclusteringbecomesdiculttomanageformorethan5000datapointsduetomemoryissues.Inordertogetmeaningfulsegmentsinimagesclosetohumanaccuracy,theperceivedcolordierencesshouldcorrespondtoEuclideandistancesofpixelsinthefeaturespace.OnesuchfeaturespacethatbestapproximatesperpetuallyuniformcolorspacesistheL*u*vspace[ Connolly 1996 ; WyszeckiandStiles 1982 ].Further,weaddedthexandycoordinatestothisfeaturespacetotakeintoaccountthespatialcorrelation.Thus,wehave3456datapointsspreadacrossa5-dimensionalfeaturespace. TheobtainedresultsforthemultiscaleanalysisisshowninFigure 4-12 .Beforeplottingthesesegments,wedidapostprocessingoperationwheresegmentswithlessthan10datapointsweremergedtotheclosestcluster.Thisisneededtoeliminatespuriousclustersarisingduetoisolatedpointsoroutliers.Inthecaseofbaldeagleimage,wefoundonlythesegmentationresultforkernelsize2=9tohavesuchspuriousclusters.Thisisclearlyaresultoflowkernelsizeselection.Asweincreasethekernelsize,thenumberofsegmentsaredrasticallyreducedreachingalevelof7segmentsforboth2=25and2=36.Notethesharpandclearsegmentsobtained.Forexample,theeagleitself 67

PAGE 68

Sinceboth5and7segmentslookveryappealingwiththepreviousanalysis,weshowthecomparisonwithspectralclusteringforboththesesegments.Figure 4-13 showstheseresultsfor2=25and2=49respectively.Spectralclusteringperformsextremelywellwithclearwelldenedsegments.Twothingsneedspecialmention.First,thesegmentboundariesareverysharpinspectralclusteringcomparedtoGMSresults.ThisisunderstandablesinceinGMSeachdatapointismovedtowardsitsmodesandinboundaryregionstheremaybeaclearambiguityastowhichpeaktoclimb.Thislocalizedphenomenonleadstoapixelizationeectattheboundary.Second,itssurprisinghowspectralclusteringtechniquecoulddepictthebeakregionsowell.ItshouldberememberedthattheimageusedisalowresolutionimageasshowninFigure 4-12A .Acloseobservationshowsthattheregionofintesectionbetweenthebeakandthefaceoftheeaglehasacolorgradation.ThisisclearlydepictedintheGMSsegmentationresultfor2=16asshowninFigure 4-12C .Probably,byusingthedominanteigenvectorsonecanconcentrateonthemaincomponentandrejectotherinformation.ThiscouldalsoexplaintheclearcrispboundariesproducedinFigure 4-13 Asanextstepinthiscomparison,weselectedtwopopularimagesshowninFigure 4-14 fromtheBerkeleyimagedatabase[ Martinetal. 2001 ]whichhasbeenwidelyusedasastandardbenchmark.Themostimportantpartofthisdatabaseisthathumansegmentationresultsareavailableforalltheimageswhichhelpsonetocomparetheperformanceoftheiralgorithm.Onceagainweperformedamultiscaleanalysisofthesetwoimages.Selectingthenumberofclustersisespeciallydicultinthecaseofthetigerimageascanbeseeninhumansegmentationresultsalso. 68

PAGE 69

B2=9,segments=20 C2=16,segments=14 D2=25,segments=7 E2=36,segments=7 F2=49,segments=5 G2=64,segments=4 MultiscaleanalysisofbaldeagleimageusingGMSalgorithm. 69

PAGE 70

B2=49,segments=5 Performanceofspectralclusteringonbaldeagleimage. B TwopopularimagesfromBerkeleyimagedatabase.A)Flightimage.B)Tigerimage. OuranalysisdepictedinFigure 4-15 showsgoodperformanceofGMSalgorithm.Boththeimagesarewellsegmentedespeciallytheightandthetigerregions.Basedonthisanalysis,weselectedavalueof2segmentsinthecaseofightimageand8segmentsinthecaseofthetigerimage.UsingthesamekernelsizeasinGMSalgorithm,weinitializedthespectralclusteringmethod.Figure 4-16 showstheperformanceofthisalgorithm.Again,spectralclusteringdoesaverygoodsegmentationespeciallyintheightimage.InthecaseofGMSalgorithm,aportionoftheightregionislostduetothelargekernelsizeandoversmoothingeectofthepdfestimation.Theresultsinthecaseofthetigerimageisprettymuchthesame,thoughspectralclusteringseemsslightlybetter. Thecaseoftheightimageisespeciallyinteresting.Ideally,onewouldwanttouseasmallerkernelsizeiftheightregionisofinterest.ThisisseeninsegmentationresultsforGMSwith2=16inFigure 4-15C .Butsuchasmallkernelsizewouldleadtomany 70

PAGE 71

B48x72tigerimage C2=16,segments=15 D2=16,segments=14 E2=49,segments=4 F2=25,segments=10 G2=100,segments=2 H2=36,segments=8 MultiscaleanalysisofBerkeleyimagesusingGMSalgorithm. 71

PAGE 72

B2=36,segments=8 PerformanceofspectralclusteringonBerkeleyimages. clusters.Sinceitisveryclearfromthebeginningthatweneed2clustersforthisimage,onecandoapostprocessingstagewherecloserclustersarerecursivelymergeduntilwehaveonlytwoclustersleft.SucharesultisshowninFigure 4-17A for2=16.Notetheimprovedandclearsegmentationoftheightregion.Tobefair,weprovideacomparisonwithspectralclusteringforthesameparametersinFigure 4-17C .Clearly,bothperformverywellwithamarkedimprovementintheperformanceofGMSalgorithm.Asimilarcomparisonisshownforthetigerimagewithaprioriselectionof2=16andnumberofsegmentsK=8. Formorecomplicatedimageslikenaturalscenary,weneedtogobeyondasinglekernelsizeforalldatapoints.OnetechniqueistousedierentkernelsforeachsampleestimatedusingKnearestneighborhood(KNN)method.ReaderswhoareinterestedinthisareadvisedtorefertoAppendixC.Othertechniquesofadaptivekernelsizehavebeenproposedinliteratureasin[ Comaniciu 2003 ]whichhavegreatlyimprovedtheperformanceofGMSalgorithmgivingsimilarresultstospectralclustering. 72

PAGE 73

BGMS:2=16,segments=8 CSC:2=16,segments=2 DSC:2=16,segments=8 ComparisonofGMSandspectralclustering(SC)algorithmswithaprioriselectionofparameters.AnewpostprocessingstagehasbeenaddedtoGMSalgorithmwherecloseclusterswererecursivelymergeduntiltherequirednumberofsegmentswasachieved. thetwovariationsnamelyGMSandGBMSalgorithms.Sinceboththesemethodsappearasspecialcasesofournovelunsupervisedlearningprinciple,wewerenotonlyabletoshowtheintrinsicconnectionbetweenthesetechniques,butalsothegeneralrelationshiptheysharewithotherunsupervisedlearningfacetslikeprinciplecurvesandvectorquantization. Withthisnewunderstandinganumberofinterestingresultsfollow.WehaveshownthatGBMSdirectlyminimizesRenyi'squadraticentropyandhenceisanunstablemodendingalgorithm.Sincemodesarenotthexedpointsofthiscostfunction,anystoppingcriterionwouldatmostbeheuristic.Ontheotherhand,itsstablecounterpartGMS,minimizesRenyi's\cross"entropygivingthemodesasstationarysolutions.Thus,anewstoppingcriterionistostopwhenthechangeinthecostfunctionissmall.WehavecorroboratedthistheoreticalanalysiswithextensiveexperimentswithsuperiorperformanceofGMSoverGBMSalgorithm.Finally,wehavealsocomparedthe 73

PAGE 74

74

PAGE 75

GershoandGray 1991 ].Inherenttoeveryvectorquantizationtechniquethen,thereshouldbeanassociateddistortionmeasuretoquantifythedegreeofcompression.Forexample,inthesimplesttechniqueofLinde-Buzo-Gray(LBG)algorithm[ Lindeetal. 1980 ]onetriestorecursivelyminimizethesumofeuclideandistancebetweenthedatapointsandtheircorrespondingwinnercodevectors.Thiscanbesummarizedasshownbelow. 2NXi=1kxicik2;(5{1) wherefc1;c2;:::;cKg2CareKcodevectorswithK<
PAGE 76

Raoetal. 2007a ]. Lehn-Schileretal. [ 2005 ]usesgradientmethodtoachievethesame.Thegradientupdateisgivenby @xkJ(X);(5{2) whereisthelearningrateparameterandJ(X)=Dcs(pXjjpS).Thus,thederivative@ @xkJ(X)wouldbeequaltotheexpressionbelowaswasshowninTheorem 4 .@ @xkJ(X)=2 Kirkpatricketal. 1983 ; Haykin 1999 ]wherethekernelsizeisannealedfromalargevaluetoasmallvalueslowly.Largekernelsizehastheeectofoversmoothingthesurfaceofthecostfunction,thuseliminatingorsuppressingmostlocalminimaandacceleratingthesamplesquicklynearthevicinityoftheglobalsolution.Reducingthekernelsizewouldthenallowthesamplestoreducethebiasinthesolutionandeectivelycapturethetrueglobalminimum.Equation 5{3 showshowthekernelsizeisannealedfromaconstant1timestheinitialkernelowithanannealingrate.Here, 76

PAGE 77

Theproblemwiththegradientmethodisthattoactuallyutilizethebenetofkernelannealing,thestepsizealsoneedstobeannealed,albeitwithadierentsetofparameters.Thishelpsspeedupthealgorithmbyusingastepsizeproportionaltothekernelsize.Asshownin 5{4 ,thestepsizeisannealedfromaconstant2timestheinitialvalueowithanannealingrategivingusthestepsizeatthenthiteration,nasshownbelow. Ascanbeimagined,selectingtheseparameterscanbeadauntingtask.Notonlydoweneedtotunetheseparametersnely,butalsoweneedtotakeintoconsiderationtheinterdependenciesbetweentheparameters.Forexample,foragivenannealingrateofthekernelsizethereisanarrowrangeofvaluesfortheannealingrateofthestepsizewhichmakesthealgorithmwork.Selectingoutsidethisrangeeithermakesthealgorithmunstableorwouldstopitprematurelyduetoslowconvergenceprocess. Bydoingawaywithgradientmethodandusingxedpointupdaterule 3{9 ,weeectivelysolvethisproblemandatthesametimespeedupthealgorithmtremendously.OneoftheeasiestinitializationforXistoselectrandomlyNpointsfromSwithN<
PAGE 78

Inthischapter,wewillpresentthequantizationresultsobtainedonanarticialdatasetaswellassomerealimages.Thearticialdatasetshownherewasusedin[ Lehn-Schileretal. 2005 ].Arealapplicationofcompressingtheedgesofafaceimagewillbeshownnext.TodistinguishbetweenthetwovariationoftheITVQmethods,wecallourmethodITVQxedpoint(ITVQ-FP)andthegradientmethodasITVQ-gradient.ComparisonbetweenITVQ-FP,ITVQ-gradientandthestandardLinde-Buzo-Gray(LBG)algorithmisprovidedandquantied.TocomparewithLBG,wealsoreportthemeansquarequantizationerror(MSQE)whichisshownin 5{5 .Finally,wepresentsomeimagecompressionresultsshowingtheperformanceofITVQ-FPandLBGtechniques. 5-1 .Inthegradientmethodtheinitialkernelwassetasshownin 5{6 withthediagonalentriesbeingthevariancealongeachfeaturecomponent.Thekernelwasannealedwithparameters1=1and=0:05.Atnopointoftime,thekernelwasallowedtogobelowo=p Lehn-Schileretal. 2005 ]. Themostdicultpartofgradientmethodistoselectthestepsizeanditsannealingparameterstobestsuitthekernelannealingrates.Thestepsizeshouldbesuciently 78

PAGE 79

Halfcircledatasetandthesquaregridinsidewhich16randompointswheregeneratedfromauniformdistribution. smalltoensuresmoothconvergence.Further,thestepsizeannealingrateshouldbeselectedsuchthatthestepsizeateachiterationiswellbelowthemaximumallowedstepsizeforpresentiterationkernelsize.Afterlotsofhitandtrialruns,thefollowingstepsizeparameterswereselected:o=1,2=1and=0:15.Further,thestepsizewasneverallowedtogobelow0:05. Inthecaseofxedpointmethod,forfaircomparisonwiththegradientcounterpart,weselectthesamekernelinitializationanditsassociatedannealingparameters.Thereisnostepsizeoritsannealingparametersinthiscase.Toquantifythestatisticalvariation,50MonteCarlosimulationswereperformed.Itwasensuredthattheinitializationwassameforboththemethodsineverysimulation.Withtheseselectedparameters,itwasobservedthatboththemethodsalmostgavegoodresulteverytime.Nevertheless,thexedpointwasfoundtobemoreconsistentinitsresults.Figure 5-2A showstheplotofthecostfunctionJ(X)=Dcs(pXjjpS).Clearly,thexedpointalgorithmisalmost10fasterthanthegradientmethodandgivesabettersolutionintermsofminimizingthecostfunction.Figure 5-2B showsoneofthebestresultsforgradientandxedpointmethod.AcarefullookshowssomesmalldierenceswhichmayexplainthesmallerJ(X) 79

PAGE 80

B PerformanceofITVQ-FPandITVQ-gradientmethodsonhalfcircledataset.A)Learningcurveaveragedover50trials.B)16pointquantizationresults. obtainedforxedpointovergradientmethod.Further,thexedpointhasalowerMSQEwith0:0201valuecomparedto0:0215forgradientmethod. Werepeattheprocedureofndingthebestparametersforthegradientmethod.Afteranexhaustivesearchweendupwiththefollowingparameters:1=1,=0:1,o=1,2=1,=0:15.owassettoadiagonalmatrixwithvariancealongeachfeaturecomponentasthediagonalentries.Thealgorithmwasinitializedwith64randompointsinsideasmallsquaregridcenteredonthemouthregion.Asdonebeforeforthexedpoint,thekernelparametersweresetexactlyasthegradientmethodforafaircomparison. Asdiscussedearlier,annealingthekernelsizeinITVQ-FPnotonlygivesthesamplesenoughfreedomtoquicklymoveandcapturetheimportantaspectsofthedata,thusspeedingupthealgorithm,butalsomakesthenalsolutionmorerobusttorandominitialization.ThisideaisbestillustratedinFigure 5-3 wheretheinitializationisdone 80

PAGE 81

Figure 5-4 showstheresultsobtainedusingITVQxedpoint,ITVQgradientmethodandLBG.Twoimportantconclusionscanbemade.Firstly,amongtheITVQalgorithms,thexedpointrepresentsthefacialfeaturesbetterthanthegradientmethod.Forexample,theearsareverywellmodeledinthexedpointmethod.Perhaps,themostimportantadvantageofthexedpointoverthegradientmethodisthespeedofconvergence.WefoundthatITVQxedpointwasmorethan5fasterthanitsgradientcounterpart.Inimageapplications,wherethenumberofdatapoints(pixels)isgenerallylarge,thistranslatesintoahugesavingofcomputationaltime. Secondly,boththeITVQalgorithmsoutperformedLBGintermsoffacialfeaturerepresentation.TheLBGusesmanycodevectorstomodeltheshoulderregionandfewcodevectorstomodeltheeyesorears.ThisisduetothefactthatLBGjustusessecondorderstatisticswhereasITVQ,duetoitsintrinsicformulation,usesallhigherorderstatisticstobetterextracttheinformationfromthedataandallocatethecodevectorstosuitthestructuralproperties.ThisalsoexplainswhyITVQmethodsalsoperformveryclosetoLBGfromMSQEpointofviewasshowninTable 5-1 .Ontheotherhand,noticethepoorperformanceofLBGintermsofminimizingthecostfunctionJ(X).Obviously,theLBGandITVQxedpointmethodsgivelowestvaluesfortheirrespectivecostfunctions. Table5-1. J(X)andMSQEonthefacedatasetforthethreealgorithmsaveragedover50MonteCarlotrials. MethodJ(X)MSQE ITVQ-FP0.02533:355104ITVQ-gradient0.02913:492104LBG0.18343:05104

PAGE 82

B C D E F EectofannealingthekernelsizeonITVQxedpointmethod. 82

PAGE 83

B C 64pointsquantizationresultsofITVQ-FP,ITVQ-gradientmethodandLBGalgorithmonfacedataset.A)ITVQxedpointmethod.B)ITVQgradientmethod.C)LBGalgorithm. 83

PAGE 84

B Twopopularimagesselectedforimagecompressionapplication.A)Baboonimage.B)Peppersimage. 5-5 showstwoimagesusedinthissection.TherstisthepopularbaboonimageandtheotheristhepeppersimageavailableinMatlab.Duetothehugesizeoftheseimages,thecomputationalcomplexitywouldbeexorbitant.Therefore,weselectedonlysomeportionsoftheseimagestoworkwithasshowninFigure 5-6 .Theseportionswerelimitedtoatmost5000pixelpointsforconvenienceandeaseofimplementation. WeusedtheL*u*vfeaturespaceforthisapplicationandinitializedthecodebookasrandompointsselectedfromthedatasetitselfdependingonthecompressionlevel.Insteadofusingsimulatedannealingtechnique,weperformed10to20MonteCarlosimulationsforboththemethodsandselectedthebestresultintermsofloweringtheirrespectivecostfunctions.WefoundthatouralgorithmwasabletondgoodresultseverytimewiththeseMonteCarlorunsandtheresultsweresimilartothatgeneratedusingsimulatedannealingtechnique.Dierentlevelsofcompressionwereevaluatedstartingfrom80%toasfaras99:75%compressioninsomeimages.Wehandpickedthemaximumlevelof 84

PAGE 85

B C D PortionsofBaboonandPepperimagesusedforimagecompression. compressionforbothITVQ-FPandLBGaftercomparingthereconstructedimagetotheoriginalimage.SinceITVQ-FPhasanextraparameterofkernelsize,weranthealgorithmfordierentkernelsizesintherangeof2=16to2=64andselectedthebestresult. Figure 5-7 showsthecomparisonbetweenITVQ-FPandLBGtechniques.Inthecaseofimages,itishardtoquantifythedierenceassuch,unlessthiscompressionisusedasapreprocessingstageinsomebiggerapplication.Nevertheless,onecanseesomesmalldierencesbetweenthesemethodsliketheeyepartofthebabooninFigures 5-7A and 5-7B .Overallbothmethodsarefast,performreliablyandproduceaveryhighlevelofcompressionsuitableofmanypracticalapplications. 85

PAGE 86

B C D E F G H Imagecompressionresults.TheleftcolumnshowsITVQ-FPcompressionresultsandtherightcolumnshowsLBGresults.A,B)99:75%compressionforboththealgorithms,2=36forITVQ-FPalgorithm.C,D)90%compression,2=16.E,F)95%compression,2=16.G,H)85%compression,2=16. 86

PAGE 87

Inthischapter,wehaveshownhowinformationtheoreticvectorquantizationarisesnaturallyasaspecialcaseofourgeneralunsupervisedlearningprinciple.Thisalgorithmhasadualadvantage.First,byusinginformationtheoreticconcepts,weuseallthehigherorderstatisticsavailableinthedataandhencepreservemaximumpossibleinformation.Secondly,byusingxedpointupdaterule,wesubstantiallyspeedupthealgorithmcomparedtogradientmethod.Atthesametime,wecircumventmanyparameterswhichwereusedingradientmethodwhosetuningistooheuristictobeapplicableinanyreallifescenario. 87

PAGE 88

HastieandStuetzle [ 1989 ]asanonlinearextensionofprincipalcomponentanalysis(PCA).Theauthorsdescribethemas\self-consistent"smoothcurveswhichpassthroughthe\middle"ofad-dimensionalprobabilitydistributionordatacloud.Hastie-Stuetzle(HS)algorithmattemptstominimizetheaveragesquareddistanceofthedatapointsandthecurve.Inessence,thecurveisbuiltbyndinglocalmeanalongdirectionsorthogonaltothecurve.Thisdenitionalsousestwodierentsmoothingtechniquestoavoidovertting.Analternativedenitionofprincipalcurvesbasedonthemixturemodelwasgivenby Tibshirani [ 1992 ].AnEMalgorithmwasusedtocarryouttheestimationiteratively. AmajordrawbackofHastie'sdenitionisthatitisnotamenabletotheoreticalanalysis.ThishaspreventedresearchersfromformallyprovingtheconvergenceofHSalgorithm.Theonlyknownfactisthatprincipalcurvesarexedpointsofthealgorithm.Recently, Kegl [ 1999 ]developedaregularizedversionofHastie'sdenition.Byconstrainingthetotallengthofthecurvetobettothedata,onecouldprovetheexistenceofprincipalcurveforeverydatasetwithnitesecondordermoment.Withthistheoreticalmodel,Keglderivedapracticalalgorithmtoiterativelyestimatetheprincipalcurveofthedata.ThePolygonalLineAlgorithmproducespiecewiselinearapproximationstotheprincipalcurveusinggradientbasedmethod.Thishasbeenappliedtohand-writtencharacterskeletonizationtondthemedialaxisofthecharacter[ KeglandKrzyzak 2002 ].Ingeneral,onecouldndaprincipalgraphofagivendatasetusingthistechnique.Similarily,otherimprovementswereproposedby SandilyaandKulkarni [ 2002 ]toregularizetheprincipalcurvebyimposingboundsontheturnsofthecurve. 88

PAGE 89

Inlightofthisbackground,consideronceagainourcostfunctionwhichisreproducedhereforconvenience.J(X)=minXH(X)+Dcs(pXjjpS): 89

PAGE 90

Itisinterestingtonotethattheprincipalcurveproposedherepassesthroughthemodesofthedataunlikeotherdenitionswhichpassesthroughthemeanoftheprojectionofallthepointsorthogonaltothecurve.Thiswouldthendictateanewmaximumlikelihooddenitionofprincipalcurves.Sincetheideaofprincipalcurvesistondtheintrinsiclowerdimensionalmanifoldfromwhichthedataoriginated,weareconvincedthatmode(orpeak)alongeveryorthogonaldirectionismuchmoreinformativethanthemean.Bycapturingthelocaldenseregions,onecanimaginethisprincipalcurveaaridgepassingthroughthepdfcontours.Suchadenitionwasrecentlyproposedby ErdogmusandOzertem [ 2007 ].Webrieysummarizethedenitionandtheresultsobtainedhereforourdiscussion. 90

PAGE 91

Theprincipalcurveforamixtureof5Gaussiansusingnumericalintegrationmethod. Inordertoimplementthis,ErdogmusrstfoundthemodesofthedatausingGaussianMeanShift(GMS)algorithm.Startingfromeachmodeandshootingatrajectoryinthedirectionoftheeigenvectorcorrespondingtolargesteigenvalue 6-1 showsanexampletoclarifythisidea. TheexampleshownhereisasimplecasewithwelldenedHessianvalues.Inmanypracticalproblemsthough,thisapproachsuersfromillconditioningandnumericalissues.Further,thenotionoftheexistenceof1-dimensionalstructureforeverydataisnottrue.Forexample,foraT-shaped2-dimensionaldatasettheredoesnotexista1-dimensionalmanifold.Whatisonlypossibleisa2-dimensionaldenoisedrepresentation,whichcanbeseenasaprincipalgraphofthedata.Inthefollowingsection,weexperimentallyshowhowouralgorithmsatisesthesamedenitionextractingprincipalcurve(or 91

PAGE 92

Kegl 1999 ].Thedataconsistsof1000sampleswhichareperturbedbyGaussiannoisewithvarianceequalto0:25.Wewouldliketopointoutherethatthisisamuchnoisydatasetcomparedtotheoneactuallyusedin[ Kegl 1999 ].Weuse=2forourexperimenthere,althoughwefoundthatanyvaluebetween1<<3canbeused.Figure 6-2 showsthedierentstagesoftheprincipalcurveasitevolvesstartingwithinitializationX=SwhereSisthespiraldata. Notehowquicklythesamplestendtothecurve.Bythe10thiterationthestructureofthecurveisclearlyrevealed.Afterthis,thechangesinthecurveareminimalwiththesamplesonlymovingalongthecurve(andhencealwayspreservingit).WhatisevenmoreexcitingisthatthiscurveexactlypassesthroughthemodesofthedataforthesamescaleasshowninFigure 6-3 .Thusourmethodgivesaprincipalcurvewhichsatisesdenition 1 naturally.WealsodepictthedevelopmentofthecostfunctionJ(X)anditstwocomponentsH(X)andH(X;S)asafunctionofthenumberofiterationsinFigure 6-4 .Noticethequickdecreaseinthecostfunctionduetotherapidcollapseofthedatatothecurve.FurtherdecreaseisassociatedwithsmallchangesinthemovementofsamplesalongthecurveandbystoppingatthisjuncturewecangetaverygoodresultasshowninFigure 6-2F 92

PAGE 93

CIteration5 DIteration10 EIteration20 FIteration30 TheevolutionofprincipalcurvesstartingwithXinitializedtotheoriginaldatasetS.Theparametersweresetto=2and2=2. 93

PAGE 94

Theprincipalcurvepassingthroughthemodes(shownwithblacksquares). B Changesinthecostfunctionanditstwocomponentsforthespiraldataasafunctionofthenumberofiterations.A)CostfunctionJ(X).B)TwocomponentsofJ(X)namelyH(X)andH(X;S). 94

PAGE 95

B Denoisingabilityofprincipalcurves.A)Noisy3D\chainofrings"dataset.B)Resultafterdenoising. 6-5A showsa3-dimensional\chainofrings"datasetcorruptedwithGaussiannoiseofunitvariance.Theparametersofthealgorithmweresetto=1:5and2=1.Withlessthan20iterations,thealgorithmwasabletosuccessfullyremovesnoiseandextracttheunderlying\principalmanifold"ascanbeseeninFigure 6-5B 95

PAGE 96

Fu 1974 ],opticalcharacterrecognition[ Amin 2002 ],time-seriesdataanalysis[ Olszewski 2001 ],imageprocessing[ CaelliandBischof 1997 ],faceandgesturerecognition[ Bartlett 2001 ]andmanymore.Obviously,thetypeof\structures"or\primitives"dependsonthedataathand.Thishasbeenoneofthebottlenecksofthiseld.Alltheapplicationsdescribedsofararedomaindependentanduseextensiveknowledgeoftheproblemathandtoconstructtheappropriatestructures.Creatingknowledgebaseisadiculttaskandinsomeapplications,especiallyonlinedocumentsorbiologicaldata,itisacostlyaair.Further,insomeapplicationslikesceneanalysis,quantifying\structures"manuallymaybeanillposedproblemduetotheirvarietyandmultitudeofeectivecombinationspossibletogeneratedierentpictures.Aselforganizingtechniquetodynamicallyndappropriatestructuresfromthedataitselfwouldnotonlymakeknowledgebasesredundant,butalsoleadtoonlineandfastlearningsystems.Forathoroughtreatmentofthistopic,wewoulddirectthereadersto Watanabe [ 1985 ,chap.10]. Thenotionof\goal"isacrucialaspectofunsupervisedlearningsomuchsothatonecancallthiseldas-GoalOrientedLearning(GOL!!).Thisapproachisneededsincethelearneronlyhasthedataavailablewithoutanyadditionalaprioriinformationlike 96

PAGE 97

1-2 ,weseethatourformulationisdrivenbythegoalthemachineseeks.Thisismodeledthroughaparameterwhichisunderthecontrolofthesystemitself.Byallowingthemachinetoinuencethelevelofabstractiontobeextractedfromthedataandinturncapturing\goalorientedstructures",thePrincipleofRelevantEntropyaddressesthisfundamentalconceptbehindunsupervisedlearning. Appropriategoalsareneededforappropriatelevelsoflearning.Forexample,inagameofchess,theultimategoalistocheckmateyouropponentandwinthegame.Therearealsointermediarygoalslikemakingthebestmovetoremovethequeenetc.Thesecorrespondtohigherlevelgoals.Alowerlevelvisiontaskistomakesuchamovehappen.Thisinvolvesprocessingtheincomingvisionsignalsandgoingthroughtheprocessofclustering(segregationofdierentobjects),principalcurves(denoising)andvectorquantization(compactrepresentation)toassimilatethedata.Itistheselowerlevelgoalsthatwehaveaddressedinthisthesis.Further,beyondthissensoryprocessingandcognitionstage,oneneedstoconvertthisknowledgeintoappropriateactionsandmotoractuators.Thisinvolvestheprocessofreasoningandinference(especiallyinductiveinference,thoughdeductivereasoningmaybenecessarytocomparewithpastexperiences)whichwouldgenerateappropriatecontrolsignalstodriveactions.Theseactionsintheenvironment(likemakingyourrespectivemovesinthegameofchess)wouldinturngeneratenewdatawhichneedstobefurtheranalyzed.ThoughourschematicinFigure 1-2 showsthiscompleteidea,wecertainlydonotintendtosolvethisentiregamutofmachinelearningandweencouragereaderstothinkontheselinestofurtherthistheory.Aninterestingaspectofthisschematicisthatreinforcementlearningcanbeseenasaspecialcaseofourframeworkwheregoalandthecriticalreasoningblocksareexternaltothelearnerandcontrolledbyanindependentcritic. 97

PAGE 98

Silverman 1986 ],maximum-likelihood(ML)techniqueandKnearestneighborhood(KNN)method.WehavesummarizedthesetechniquesinAppendixCforeasyreference.Someofthesetechniquesmakeextraassumptionstondasinglesuitablekernelsizeappropriatetothegivendataset.Forexample,theSilverman'sruleassumesthatthedataisGaussianinnature.Amorerobustmethodwouldbetoassigndierentkernelsizetodierentsamples.TheKNNtechniquecanyieldsucharesultasdiscussedinAppendixC. Adierentwaytoseethekernelsizeissueisitsroleascontrollingthescaleofanalysis.Sincestructuresexistatdierentscales,itwouldbeprudenttoanalyzethedatawithamultiscaleanalysisbeforewemakeaninference.Thisiscertainlytrueinimageandvisionapplicationsaswellasinmanybiologicaldata.Fromthisangle,weseekernelsizeasastrengthofourformulation,encompassingmultiscaleanalysisneededforeectivelearning.Thisgivesrisetoaninterestingperspectivewhereonecanseeourframeworkasanalyzingthedataina2-Dplane.Thetwoaxesofthisplanecorrespondtothetwocontinuousparametersand.Byextractingstructuresfordierentcombinationsof(;),weunfoldthedatainthistwodimensionalspacetotrulycapturetheveryessenceofunsupervisedlearning.ThisideaisdepictedinFigure 7-1 .Onecanalsoseethisanalysisasa\2Dspectrum"ofthedatarevealingalltheunderlyingpatterns,givingrisetoastrongbiologicalconnection. 98

PAGE 99

Anovelideaofunfoldingthestructureofthedatainthe2-dimensionalspaceofand. ThecomputationalcomplexityofouralgorithmisO(N2)ingeneralwhereNisthenumberofdatasamples.Thiscomparesfavorablywithothermethodsparticularlyclusteringtechniques.Forexample,thelatestmethodofspectralclustering[ ShiandMalik 2000 ]requiresonetondtheNNGrammatrix(whichrequirehugememory,especiallyforimages)andthenndtheKdominanteigenvectors.ThiswouldentailatheoreticalcostofO(N3)whichcanbebroughtdowntoO(KN2)byusingthesymmetryofthematrixandrecursivelyndingthetopeigenvectors.AwaytoalleviatethecomputationalburdenofouralgorithmistousetheFastGausstransform(FGT)technique[ GreengardandStrain 1991 ; Yangetal. 2005 ].ByapproximatingtheGaussianwithHermiteseriesexpansion,wecouldreducethecomputionalcomplexitytoO(LpN)whereLandpareparametersoftheexpansion(theseareverysmallcomparedtoN).AnotherideawouldbetoemployresamplingtechniquestoworkwithonlyMpoints(withM<
PAGE 100

InformationTheoreticLearning(ITL)developedbyPrincipeandcollaboratorsatCNEListhersttheorywhichisabletobreakthebarrierandaddressallthefourimportantissuesofunsupervisedlearningframeworkasenlistedattheendofChapter1.Crucialtothissuccess,arethenon-parametricestimatorsforinformationtheoreticmeasureslikeentropyanddivergencewhichcanbecomputeddirectlyfromthedata.Further,theabilitytoderivesimpleandfastxedpointiterativeschemecreatesaselforganizingparadigm,thusavoidingproblemslikestepsizeselectionwhicharepartofthegradientmethods.ThisisanimportantaswellasacrucialbottleneckwhichITLsolveselegantly.Duetothelackofsuchiterativescheme,manymethodsendupmakingGaussianapproximationandmissingimportanthigherorderfeaturesinthedata. Thisthesisdealsingeneralwithspacialdatawiththeassumptionofiidsamples.Thisoccursindiverseeldsincludingpatternrecognition,imageanalysisandmachinelearningapplications.Thereareotherproblemswherethedatahastimestructure.Examplesincludenancialdata,speechandvideosignalstonameafew.Wecanalsoextractsimilarfeaturesfromthesetimeseries(asdoneisspeakerrecognition)ordoalocalembeddingtocreateafeaturespaceandstillapplyourmethod.WeshowtwoexamplesofthisapproachinAppendixAandB Ghahramani 2003 ].AnovelconceptdevelopedatCNELcalledCorrentropy[ Santamariaetal. 2006 ]alsoextractstimestructuresinthedatausing 100

PAGE 101

Therearesomeparticularlyinterestingareasinwhichwewouldliketopursueourfuturework.Onesuchareaisthetheoreticalanalysisofthecostfunction.Theconvergenceofthexedpointupdateruleforsomespecialcaseshavebeenshowninthisthesis.Butforthegeneralcase,thisseemsespeciallydicult.Specialattentionisrequiredinunderstandingtheworkingoftheprinciplebeyond=1.Inparticular,rigorousmathematicalanalysisisneededtoquantifytheconceptofprincipalcurves.Webelievewehavesomethingmorethanjustprincipalcurveshere,butthiswouldneedatotally

PAGE 102

Anothertopicwhichisofinterestistheconnectiontoinformationbottleneck(IB)method[ Tishbyetal. 1999 ].Therearetwoparticularreasonsweareinterestedinpursuingthisresearch.AspointedoutearlierinChapter1,thistechniquetriestoaddressmeaningfulorrelevantinformationinagivendataSthroughanothervariableYwhichisdependentonS.Inherenttothissetup,istheassumptionthatthejointdensityp(s;y)existssothatI(S;Y)>0.ThisideaisdepictedclearlyinFigure 7-2 .InordertocapturethisrelevantinformationYinaminimallyredundantcodebookX,oneneedstominimizethemutualinformationI(X;S)constrainedonmaximizingthemutualinformationI(X;Y).Thisisframedasminimizingavariationalproblemasshownbelow whereisaLagrangemultiplier.MostmethodsatthisstagemakeaGaussianassumptionduetothedicultyinderivingiterativeschemeusingShannondenition.Buttheauthors,usingtheideasfromratedistortiontheoryandinparticularbygeneralizingtheBlahut-Arimotoalgorithm[ Blahut 1972 ],wereabletoderiveaconvergentre-estimationscheme.Theseiterativeselfconsistentequationsworkdirectlywiththeprobabilitydensity,thusprovidinganalternativesolutiontocapturerelevanthigherorderinformationandavoidGaussianapproximationsaltogether. Insteadofassuminganexternalrelevantvariable,thePrincipleofRelevantEntropyextractssuchrelevantinformationdirectlyfromthedatafordierentcombinationsof(;).Thus,wegoonestepfurther,bynotjustextractingonestructure,butarangeofstructuresrelevantfordierenttasksandatdierentresolutions.Bydoingso,wegathermaximalinformationtoactivelylearndirectlyfromthedata.Additionally,workingwithjusttwovariablesnamelyXandS,wesimplifytheproblemwithnoextraassumptionoftheexistenceofthejointdensityfunctionp(s;y). 102

PAGE 103

Theideaofinformationbottleneckmethod. Itshouldbepointedoutthatourmethodisparticularlydesignedtoaddressunsupervisedlearning.TheIBmethodhasbeenappliedtosupervisedlearningparadigmsduetotheavailabilityofextravariableY,thoughsomeapplicationsofIBmethodtounsupervisedlearninglikeclusteringandICAhavealsobeenpursued.Inparticular,changingthevalueofparameterleadstoacoarsetoaveryneanalysisandhasinterestingimplicationswiththeparametersinourcostfunction.Similarformofcostfunctionbasedonenergyminimizingprincipleforunsupervisedlearninghasbeenproposedby Ranzatoetal. [ 2007 ]anditwouldbeimportanttostudyfurthertheconnectiontothesemethods. Finally,wewouldliketopursueresearchintheareaofinference,datafusionandactivelearning[ Cohnetal. 1996 ].GoingbacktoourschematicinFigure 1-2 ,aconstantfeedbackloopexistsbetweentheenvironmentandthelearner.Thus,alearnerisnotapassiverecipientofthedata,butcanactivelysenseandgatherittoadvancehislearning.Todoso,oneneedstoaddresstheissueofinferenceandreasoningbeyondthestructuralanalysisoftheincomingdata.Lessonsfromreinforcementlearningcouldalsobeusedto 103

PAGE 104

Tosummarize,machinelearningisanexcitingeldandweencouragereadersaswellascriticstopursueresearchinthisimportantareaofscience. 104

PAGE 105

Theworkpresentedinthissectionspecicallyaddressestheapplicationofunsupervisedlearningtechniquestotime-seriesdata.Wehavedevisedanewcompressionalgorithmsuitableandhighlytailoredtotheneedsofneuralspikecompression[ Raoetal. 2007b ].Duetoitssimplicity,wewereabletosuccessfullyimplementthisalgorithminlowpowerDSPprocessorforreal-timeBMIapplications[ Gohetal. 2008 ].Whatfollowsisasummaryofthisresearch. Nicolelis 2003 ].Theultimategoalistoprovideparalyzedormotor-impairedpatientsamodeofcommunicationthroughthetranslationofthoughtintodirectcomputercontrol.Inthisemergingtechnology,atinychipcontaininghundredsofelectrodesarechronicallyimplantedinthemotor,premotorandparietalcorticesandconnectedthroughwirestoexternalsignalprocessor,whichisthenprocessedtogeneratecontrolsignals[ Nicolelisetal. 1999 ; Wessbergetal. 2000 ]. Recently,theideaofwirelessneuronaldatatransmissionprotocolshasgainedconsiderableattention[ Wiseetal. 2004 ].Notonlywouldthisenableincreasedmobilityandreduceriskofinfectioninclinicalsettings,butalsowouldfreecumbersomewiredbehaviorparadigms.Althoughthisidealookssimple,amajorbottleneckinimplementingitisthehighconstraintsonthebandwidthandpowerimposedonthesebio-chips.Ontheotherhand,toextractasmuchinformationaspossible,wewouldliketotransmitalltheelectrophysiologicalsignalsforwhichthebandwidthrequirementcanbedaunting.Forexample,totransmittheentirerawdigitizedpotentialsfrom32channelssampledat20kHzwith16bitsofresolutionweneedahugebandwidthof10Mbps. 105

PAGE 106

Bossettietal. 2004 ].Analternativeistousespikedetectionandsortingtechniquessothatbinning(thenumberofspikesinagiventimeinterval)canbeimmediatelydone[ Wiseetal. 2004 ].Thedisadvantageofthesemethodsliesintheweaknessofcurrentautomatedspikedetectionmethodswithouthumaninteractionaswellasthemissedopportunitiesofanypostprocessingsincetheoriginalwaveformislost.Itisinthisregard,thatweproposecompressingtherawneuronpotentialsusingwellestablishedvectorquantizationtechniquesasanalternativeandviablesolution. Beforewedwellintoneuralspikecompression,itisimportanttorstunderstandthedataathand.Wepresenthereasyntheticneuraldatawhichwasdesignedtoemulateasaccuratelyaspossiblethescenarioencounteredwithactualrecordingsofneuronalactivity.ThewaveformcontainsspikefromtwoneuronsdieringinbothpeakamplitudeandwidthasshowninFigure A-1A .BothneuronsredaccordingtohomogeneousPoissonprocesswithringratesof10spikes/s(continuousline)and20spikes/s(dottedline).Further,tointroducesomevariabilityintherecordedtemplate,eachtimeaneuronredthetemplatewasscaledbyaGaussiandistributedrandomnumberwithmean1andstandarddeviation0:01.Finally,thewaveformwascontaminatedwithzeromeanwhitenoiseofstandarddeviation0:05.AninstanceoftheneuraldataisshowninFigure A-1B .Noticethesparsenessofthespikescomparedtothenoiseinthedata.Thisisapeculiarcharacteristicofneuralsignalingeneral. Weshowherethetwodimensionalnon-overlappingembeddingofthetrainingdatainFigure A-2 .Notethedenseclusterneartheorigincorrespondingtothenoisewhichconstitutesmorethan90%ofthedata.Further,sincethespikescorrespondtolargeamplitudesinmagnitude,thefartherthedatapointfromtheoriginthemorelikelyitistobelongtothespikeregion.Itisthesesparsepointswhichweneedtopreserveasaccuratelyaspossiblewhilecompressingtheneuralinformation. 106

PAGE 107

B Syntheticneuraldata.A)Spikesfromtwodierentneurons.B)Aninstanceoftheneuraldata. FigureA-2. 2-Dembeddingoftrainingdatawhichconsistsoftotal100spikeswithacertainratioofspikesfromthetwodierentneurons. 107

PAGE 108

Choetal. 2007 ].ThoughthereisslightimprovementintheSNRofthespikeregion,thechangesmadewereheuristic.Further,thesealgorithmsarecomputationallyintensivewithlargenumberofparameterstotuneandcanonlybeexecutedoine. Inthischapter,weintroduceanoveltechniquecalledweightedLBG(WLBG)algorithmwhicheectivelysolvesthisproblem.Usinganovelweightingfactorwegivemoreweightagetosparseregioncorrespondingtothespikesintheneuraldataleadingtoa15dBincreaseintheSNRofthespikeregionapartfromachievingacompressionratioof150:1.Thesimplicityandthespeedofthealgorithmmakesitfeasibletoimplementthisinreal-time,openingnewdoorsofopportunityinonlinespikecompressionforBMIapplications. A.2.1WeightedLBG(WLBG)Algorithm A{1 ). whereciisthenearestcodevectortodatapointxiasgivenin( A{2 ). 108

PAGE 109

a. Calculateforeachxi2X,thenearestcodevectorci2Cusing( A{2 ). b. Updatethethecodevectorsinthecodebookusing( A{3 )wherethesumistakenoveralldatapointsforwhichcjisthenearestcodevector. c. MeasurethenewdistortionD(C)asshownin( A{1 ).IfD(C)Dmaxthengotostep5orelsecontinue. d. Gobackto(a)unlessthechangeindistortionmeasureislessthan FigureA-3. TheoutlineofweightedLBGalgorithm. ConsideradatasetX=fx1;x2;:::;xNg2Rd.LetC=fc1;c2;:::;cMgdenotethecodebooktobefound.TheoutlineofthealgorithmisshowinFigure A-3 .Bothandaresettoaverysmallvalue.Atypicalvalueis=0:001and=0:0001[111:::]wheretherandom1and-1isaddimensionalvector.ThisrecursivesplittingofthecodebookhastwoadvantagesoverthedirectK-meansmethod. 109

PAGE 110

whereisasmallconstanttopreventtheweightingfromgoingtozero.Thoughanarbitrarychoiceofwoulddo,wecanmakeanintelligentselection.NotethatwecanestimatethestandarddeviationofthenoisefromthedatawhichcorrespondstothedenseGaussianclusterattheorigininFigure A-2 .Since2denotes95percentcondenceintervaloftheGaussiannoise,thereforewecanset=(2)2givingsameweightagetoallpointsbelongingtotheGaussiannoise.ForhigherdimensionalembeddinglikeL=5orL=10,onecanuse=(p Haykin 1999 ].Eachprocessingelement(PE)inthelatticeofMPEshasacorrespondingsynapticweightvectorwhichhasthesamedimensionalityasthatoftheinputspace.Ateveryiteration,thesynapticweightclosesttoeveryinputvectorxkisfoundasshownin( A{5 ). HavingfoundthewinnerPEforeachxk,atopologicalneighborhoodisdeterminedaroundthewinnerneuron.TheweightvectorofeachPEisthenupdatedas wherek2[0;1]isthelearningrate.Thetopologicalneighborhoodistypicallydenedasi;k=expkririk2 110

PAGE 111

WhenapplyingSOMtoneuraldata,itwasfoundthatmostofthePEswereusedtomodelthenoiseratherthanthespikesinthedata.Thisistypicalofanyneuralrecordingwhichgenerallyhassparsenumberofspikes.InordertoalleviatethisproblemandtomovePEsfromlowamplituderegionofstatespacetotheregioncorrespondingtothespikes,thefollowingupdaterulewasproposed, Thiswascalledselforganizingmapwithdynamiclearning(SOM-DL)[ Choetal. 2007 ].ByacceleratingthemovementsofthePEstowardthespikes,theSOM-DLrepresentsthespikesbetter.Butforgoodperformance,carefultuningoftheparametersisimportant.Forexample,itwasexperimentallyveriedthatbetween0:05and0:5balancesbetweenfastconvergenceandsmallquantizationerrorforthespikes.Further,itiswellknownthatSOMbasedalgorithmsarecomputationallyveryintensive. A-4 shows16pointquantizationobtainedusingWLBGonthetrainingdata.Ascanbeseen,morecodevectorsareusedtomodelthepointsfarawayfromtheorigineventhoughtheyaresparse.Thishelpstocodethespikeinformationingreaterdetailsandhenceminimizereconstructionerrors.Ontheotherhand,SOM-DLwastesalotofpointsinmodelingthenoiseclusterasshowninFigure A-5 .Further,notonlydoesSOM-DLhavelargenumberofparameterswhichneedstobenetunedforoptimal 111

PAGE 112

16pointquantizationofthetrainingdatausingWLBGalgorithm. FigureA-5. Two-dimensionalembeddingofthetrainingdataandcodevectorsina55latticeobtainedusingSOM-DLalgorithm. performance,butalsotakesimmenseamountoftimetotrainthenetwork,makingitonlysuitableforolinetraining. Wetestthisonaseparatetestdatageneratedtoemulaterealneuralspikesignal.AsmallregionishighlightedinFigure A-6 whichshowsthecomparisonbetweentheoriginalandthereconstructedsignal.ClearlyweightedLBGdoesaverygoodjobinpreserving 112

PAGE 113

PerformanceofWLBGalgorithminreconstructingspikeregionsinthetestdata. spikefeatures.Alsonoticethesuppressionofnoiseinthenon-spikeregion.Thisdenoisingabilityisoneofthestrengthsofthisalgorithmsandisattributedtothenovelweightingfactorweselected. WereporttheSNRobtainedbyusingWLBG,SOM-DLandSOMinTable A-1 .Ascanbeseen,thereisahugeincreaseof15dBintheSNRofthespikeregionsofthetestdatacomparedtoSOM-DLwhichonlymarginallyimprovestheSNRoverSOM.Obviously,byconcentratingmoreonthespikeregion,ourperformanceonthenon-spikeregionsuersbutthedecreaseisnegligiblecomparedtoSOM-DL.ItshouldbenotedthatgoodreconstructionofthespikeregionisofutmostimportanceandhencetheonlymeasurewhichshouldbeconsideredistheSNRinthespikeregion.Further,theresultreportedhereforWLBGisfor16codevectorswhichisfarlessthan25codevectors(55lattice)forSOM-DLandSOMalgorithms. TableA-1. SNRofspikeregionandofthewholetestdataobtainedusingWLBG,SOM-DLandSOMalgorithms. SNRofWLBGSOM-DLSOM Spikeregion31.12dB16.8dB14.6dBWholetestdata8.12dB8.6dB9.77dB 113

PAGE 114

A-7 showstheprobabilityofringoftheWLBGcodebook.Codevector16modelsthenoisypartofthesignalandhenceresmostofthetime.Itshouldbenotedthatingeneral,neuraldatahasverysparsenumberofactualspikes.TheprobabilityvaluesforthecodevectorsisgiveninTable A-2 FigureA-7. FiringprobabilityofWLBGcodebookonthetestdata. TheentropyofthisdistributionisH(C)=16Xi=1pilog(pi)=0:2141: 114

PAGE 115

ProbabilityvaluesobtainedforthecodevectorsandusedinFigure A-7 CodeVectorsProbability 10.000720.001730.000740.001450.000360.000770.001080.019390.0003100.0018110.0007120.0045130.0002140.0015150.0006160.9647 then,thenumberofbitsneededis20k16=320kbps.Thusweachieveacompressionratioof150andatthesametimemaintaina32dBSNRonthespikeregion.Further,onrealdatasets,wherea10Dembeddingisgenerallyused,thecompressionratiowouldincreaseto750withonly428bpsneededtotransmitthedata.ThisisasignicantachievementandwouldhelpalleviatethebandwidthproblemfacedintransmittingdatainBMIexperiments. Cieslewskietal. 2006 ]thatcoupleswiththeUniversityofFlorida'sexistingtechnology,thePicoRemote(abatterypoweredneuraldataprocessor,andawirelesstransmitter)[ Cheneyetal. 2007 ].ThisboardconsistsofaDSPfromTexasInstruments(TMS320F28335),alowpowerAlteraMaxIICPLD,andaNordicSemiconductor'sultra-low-powertransceiver(nRF24L01).ThePicoDSPcanprovideupto150MIPSandcanoperatefornearly4hoursinlow 115

PAGE 116

AblockdiagramofPicoDSPsystem. powermodes.Thissystemisdesignedtohaveamaximumof8PicoRemotes,eachwithamaximumof8channels(20kHz,12bitsofresolution)resultinginatotalof64channelsofneuraldata.Figure A-8 showstheblockdiagramofthissystem. Thesignalissampledatreal-timeusingtheA/DconversioninPicoRemotes.AllthesedaughterboardsareconnectedtotheCPLDwhichbuersthedataandcontrolstheowtotheDSP.TheneuraldatacapturedbytheDSParethenencodedusingthecodevectorsgeneratedbytheWLBGalgorithm.Whenthewirelessbuerisfullwithcompresseddata,itistransmittedwirelesslyusingtheDSP'sonboardSerialPeripheralInterface(SPI)whichisconnectedtothewirelesstransceiver.Abriefsummaryoftheresultsfollow.Thisisanongoingworkandreaderswhoareinterestedinthisshouldreferto[ Gohetal. 2008 ]. Twodierentneuralsignalswereusedtotestthisarchitecture.TherstisgeneratedbyaBionicNeuralSimulatorandhasanidealsignaltonoiseratio(SNR)of55dB.Thesecondsignalisarealneuronalrecordingfromamicrowireelectrodechronically 116

PAGE 117

Therearenumberofwaystobuilduponthiswork.Weplantoconstructanecientk-dtreesearchalgorithmforthecodebooktakingintoaccounttheweightingfactorandtheprobabilityofthecodevectors.Thiswouldfurtherspeedupthealgorithm.Wewouldalsoliketouseadvancedencodingtechniqueslikeentropycodingandachievebit 117

PAGE 118

118

PAGE 119

Theresearchsummarizedhereproposesanewmethodofclusteringneuralspikewaveformsforspikesorting[ Raoetal. 2006b ].Afterdetectingthespikesusingathresholddetector,weuseprincipalcomponentanalysis(PCA)togettherstfewPCAcomponentsofthedata.ClusteringonthesePCAcomponentsisachievedbymaximizingtheCauchySchwartzPDFdivergencemeasurewhichusestheParzenwindowmethodtonon-parametricallyestimatethepdfoftheclusters.WeprovideacomparisonwithotherclusteringtechniquesinspikesortinglikeK-meansandGaussianmixturemodel,showingthesuperiorityofourmethodintermsofclassicationresultsandcomputationalcomplexity. Fetz 1992 ].Recently,thisknowledgeofneurophysiologyhasbeenappliedtoBrainMachineInterface(BMI)experimentswheremultielectrodearraysareusedtomonitortheelectricalactivityofhundredsofneuronsfromthemotor,premotor,andpartietalcortices[ Wessbergetal. 2000 ].InmultielectrodeBMIexperiments,experimentersarefacedwiththelaborintensivetaskofanalyzingeachoftheextracellularrecordingsforthesignaturesofelectricalactivityrelatedtotheneuronssurroundingtheelectrodetip.Separatingthesedierentneuralsources-atermcalled\spikesorting",helpstheneurophysiologisttostudyandinfertheroleplayedbyeachindividualneuronwithrespecttotheexperimentaltaskathand. Spikesortingisbaseduponthepropertythateveryneuronhasitsowncharacteristic\spike"shapewhichisdependentonitsintrinsicelectrochemicaldynamicsaswellas 119

PAGE 120

Sanchezetal. 2005 ].Toovercomethesechallenges,manysignalprocessingandmachinelearningtechniqueshavebeensuccessfullyappliedwhicharewellsummarizedin[ Lewicki 1998 ].ModernmethodsusemultipleelectrodesandsophisticatedtechniqueslikeIndependentComponentAnalysis(ICA)toaddresstheissueofoverlappingandsimilaritybetweenspikes[ Brownetal. 2004 ].Nevertheless,inmanycases,goodsingle-unitactivitycanbeobtainedwithasimplehardwarethresholddetector[ Wheeler 1999 ].Afterdetection,theclassicationisdoneusingeithertemplatematchingorclusteringoftheprincipalcomponents(PCA)ofthewaveforms[ Lewicki 1998 ; Woodetal. 2004 ].TheadvantagewiththePCAofthespikewaveformsisthatitdynamicallyexploitsdierencesinthevarianceofthewaveshapestodiscriminateandclusterneurons. AcommonclusteringalgorithmwhichisusedextensivelyonthePCAofthewaveformsistheubiquitousK-means[ MacKay 2003 ].Forspikesorting,theK-meansalgorithmalwaysclustersneurons,butthereisnoguaranteethatitconvergestotheoptimumsolutionleadingtoincorrectsorts.Theresultdependsontheoriginalclustercenters(therandominitializationproblem)aswellasthefactthatK-meansassumeshypersphericalorhyperellipsoidalclusters.Researchershaveemployedotherclusteringtechniquestoovercomethisproblem. Lewicki [ 1994 ]usedBayesianclusteringtosuccessfullyclassifyneuralsignals.TheadvantageofusingBayesianframeworkisthatitispossibletoquantifythecertaintyoftheclassication.Recently, Hilleletal. [ 2004 ]extendedthistechniquebyautomatingtheprocessofdetectingandclassication.OtherclusteringtechniqueslikeGaussianMixtureModel(GMM)[ Dudaetal. 2000 ]andSupportVectorMachines(SVM)[ Vogelsteinetal. 2004 ]havealsobeenapplied 120

PAGE 121

Inthissection,weproposetheCauchySchwartzPDFdivergencemeasureforclusteringofneuralspikedata.WeshowthatthismethodnotonlyyieldssuperiorresultstoK-meansbutalsoiscomputationallylessexpensivewithO(N)complexityforclassifyingonlinetestsamples. B.2.1ElectrophysiologicalRecordings B-1 .Here,theactionpotentialsfromtwoneuronscanbeidentied(withasterisks)bythedierencesinamplitudeandwidthofthewaveshapes. B-1 ,athresholdof25Vissucienttodetecteachofthetwowaveshapes.Asetofuniquewaveshapeswereconstructedfromthethresholdedwaveformsbaseduponthewidthwhichwasmeasured0:6mstotheleftand1:5mstotherightofthethresholdcrossing.Usingelectrophysiologicalparameters(amplitudeandwidth)ofthespike,artifactsignals(e.g.,electricalnoise,movementartifact)wereremoved.Thepeak-to-peakamplitude,waveformshape,andinterspikeinterval(ISI)werethenevaluatedtoensurethatthedetectedspikes 121

PAGE 122

Exampleofextracellularpotentialsfromtwoneurons. hadacharacteristicanddistinctwaveformshapewhencomparedwithotherwaveformsinthesamechannel.Next,thersttenprincipalcomponents(PC)werecomputedfromallofthedetectedspikewaveforms.OfallthePCs,onlythersttwoeigenvaluesweregreaterthanthevalue1andcapturedmajorityofthevariance B-2 wheretwooverlappingclustersofpointscorrespondtoeachofthedetectedwaveshapes.Thechallengehereistoclustereachoftheneuronsusingautomatedtechniques. Jenssen [ 2005 ]. 122

PAGE 123

Distributionofthetwospikewaveformsalongtherstandthesecondprincipalcomponents. non-negative,byCauchy-Schwartzinequalitythefollowingconditionholds,Zp(u)q(u)du2Zp2(u)duZq2(u)du: Rp2(u)duRq2(u)du:(B{1) NotethatJcs(p;q)2(0;1]andhencethedivergencemeasureDcs(pjjq)0withequalityithetwopdfsarethesame. Giventhesamplepopulationfx1;x2;:::;xN1gandfy1;y2;:::;yN2goftherandomvariableXandY,onecanestimatethisdivergencemeasuredirectlyfromthedatausingideasfrominformationtheoreticlearning[ Prncipeetal. 2000 ].ThereadersarereferredtoChapter2foradetailedderivationofthefollowingnon-parametricestimator,^Jcs=1

PAGE 124

Torecursivelyndthebestlabelingofeachdatapointsoastominimize^Jcs,wersthavetoexpresstheabovequantityas^Jcs=1 2PN;Ni;j=1(1mTimj)Gij;2I PN;Ni;j=1(1mi1mj1)Gij;2IPN;Ni;j=1(1mi2mj2)Gij;2I: Withmik;k=1;2initializedtoanyvalueintheinterval[0;1],theoptimizationproblemcanbeformulatedasminm1;m2;:::;mN^Jcs(m1;m2;:::;mN)subjecttomTj11=0;j=1;2;:::;N:

PAGE 125

2j@^Jcs 2s Aftertheconvergenceofthealgorithm,onecangetcrispmembershipvectorsbydesignatingthemaximumvalueoftheelementsofeachmi;i=1;2;:::;Ntoone,andresttozero. Thistechniqueimplementsaconstrainedgradientdescentsearch,withbuiltinvariablestep-sizeforeachcoordinatedirection.Thegeneralizationtomorethantwoclusterscanbefoundin[ Jenssen 2005 ].JenssenalsodemonstratedthesuperiorperformanceofthisalgorithmcomparedtoGMMandFuzzyK-means(FKM)formanynonconvexclusteringproblems.Forad-dimensionaldata,thekernelsizeiscalculatedusingSilverman'sruleofthumbgivenbyequation B{3 where2X=d1PiXiiandXiiarethediagonalelementsofthesamplecovariancematrix.Toavoidlocalminima,weannealthekernelsizefrom2optto0:5optoveraperiodof100iterations.CalculatingthegradientsrequiresO(N2)computations.Toreducethecomplexity,westochasticallysamplethemembershipspacebyrandomlyselectingMmembershipvectorswhereM<
PAGE 126

B{3 usingthecorrespondingtrainingsamplesandthenaveragedtoobtainanestimateofopt. Sinceinformationpotentialandentropyareinverselyrelated,weassignthetestpointtotheclusterwhichincurmaximumchangeininformationpotential.ThechangeininformationpotentialoftheclustersC1andC2duetoanewtestpointxtisgivenby Thustheclassicationrulecanbesummedupas, If4Vc1>4Vc299KClassifyasCluster1sampleIf4Vc2>4Vc199KClassifyasCluster2sample:(B{5) NotethatthiscomputationtakesjustO(N)calculations,andhenceismuchmoreecientthancalculatingthechangeintheentireCSdivergencemeasure. B.4.1ClusteringofPCAComponents B-3A .AsseeninFigure B-3B ,K-meansclearlyfailsto 126

PAGE 127

B{5 ,theremaining2234pointswereclassiedasbelongingtoeitheroftheclustersbyjustcomparingwiththetrainingsamplesasshowninFigure B-3C .Forcomparison,wehavealsoplottedthetrainingsamplesalongwiththetestsamples.Thealgorithmtook23secondstoclassifyall2234testpoints.Wealsousedanothermethodofcomparisonwhereanewtestpointisnotonlycomparedtothetrainingsamples,butalsototheprevioustestpointswhichhavebeenalreadyclassied.Thiswilllinearlyincreasethecomputationalcomplexity,butgenerallygivesbetterclassicationresults.Nevertheless,inourcase,thetwomethodsgaveidenticalresultswhichmaybeduetothefactthatthetrainingsamplesdenedtheclusterssucientlywell. 127

PAGE 128

BClusteringoftrainingdatausingK-means COnlineClassicationoftestpoints ComparisonofclusteringbasedonCSdivergencewithK-meansforspikesorting. 128

PAGE 129

129

PAGE 130

Weprovidehereabriefsummaryofsomekernelsizeselectionmethodswhichwehaveusedinthisthesis.Thereexistsextensiveliteratureworkinthiseldandwecertainlydonotintendtosummarizeeverythinginthischapter.Forathoroughtreatmentonthistopic,thereadersaredirectedtothefollowingreferences[ KatkovnikandShmulevich 2002 ; WandandJones 1995 ; Fukunaga 1990 ; Silverman 1986 ]. Oneoftheearliestapproachestakentondanoptimalkernelsizewasbyminimizingthemeansquareerror(MSE)betweentheestimatedandtheoriginalpdf.TheMSEbetweentheestimator^f(x)andthetruedensityf(x)ataparticularpointxisgivenbyMSEf^f(x)g=Ef[^f(x)f(x)]2g; Silverman 1986 ]asshowninequation C{1 130

PAGE 131

Anothertechniqueofoptimalityisthemaximumlikelihoodsolution.Considerarandomsamplefx1;:::;xNg.ThegeneralParzenpdfestimationcanbeformulatedasp(x)=1 Themaximizationiscarriedoutasabrute-forcelinesearch.AgoodupperlimitisthekernelestimatefromSilverman'srulesinceityieldsanoverestimatedresult.Withalogspacedlinesearch,thethatyieldsthemaximumvalueinequation C{2 isselectedasagoodapproximationtothetrueoptimal. OnecouldalsouseKnearestneighbor(KNN)techniqueforeachsamplexitogetalocalestimateofthekernelsize.ThislocalestimateiiscomputedasthemeanofKnearestneighborsasshownbelow.i=1 131

PAGE 132

132

PAGE 133

A.Amin.Structuraldescriptiontorecognisingarabiccharactersusingdecisiontreelearningtechniques.InT.Caelli,A.Amin,R.P.W.Duin,M.S.Kamel,andD.deRidder,editors,SSPR/SPR,volume2396ofLectureNotesinComputerSci-ence,pages152{158.Springer,2002.ISBN3-540-44011-9. H.B.Barlow.Unsupervisedlearning.NeuralComputation,1(3):295{311,1989. M.S.Bartlett.FaceImageAnalysisbyUnsupervisedLearning.Springer,2001. S.Becker.Unsupervisedlearningproceduresforneuralnetworks.InternationalJournalofNeuralSystems,2:17{33,1991. S.Becker.Unsupervisedlearningwithglobalobjectivefunctions.InM.A.Arbib,editor,TheHandbookofBrainTheoryandNeuralNetworks,pages997{1000.MITPress,Cambridge,MA,USA,1998. S.BeckerandG.E.Hinton.Aself-organizingneuralnetworkthatdiscoverssurfacesinrandom-dotstereograms.Nature,355:161{163,1992. A.BellandT.Sejnowski.Aninformation-maximizationapproachtoblindseparationandblinddeconvolution.NeuralComputation,7(6):1129{1159,1995. C.M.Bishop.NeuralNetworksforPatternRecognition.OxfordUniversityPress,1995. R.E.Blahut.Computationofchannelcapacityandratedistortionfunction.IEEETrans.onInformationTheory,IT-18:460{473,1972. C.A.Bossetti,J.M.Carmena,M.A.L.Nicolelis,andP.D.Wolf.Transmissionlatenciesinatelemetry-linkedbrain-machineinterface.IEEETransactionsonBiomedicalEngineering,51(6):919{924,June2004. E.M.Brown,R.E.Kass,andP.P.Mitra.Multipleneuralspiketraindataanalysis:state-of-the-artandfuturechallenges.NatureNeuroscience,7(5):456{461,May2004. T.CaelliandW.F.Bischof.MachineLearningandImageInterpretation.Springer,1997. M.A.Carreira-Perpi~nan.Mode-ndingformixturesofgaussiandistributions.IEEETrans.onPatternAnalysisandMachineIntelligence,22(11):1318{1323,November2000. M.A.Carreira-Perpi~nan.Fastnonparametricclusteringwithgaussianblurringmean-shift.InW.W.CohenandA.Moore,editors,ICML,pages153{160.ACM,2006.ISBN1-59593-383-2. D.Cheney,A.Goh,J.Xu,K.Gugel,J.G.Harris,J.C.Sanchez,andJ.C.Prncipe.Wireless,invivoneuralrecordingusingacustomintegratedbioamplierandthepicosystem.InInternationalIEEE/EMBSConferenceonNeuralEngineering,pages19{22,May2007. 133

PAGE 134

J.Cho,A.R.C.Paiva,S.Kim,J.C.Sanchez,andJ.C.Prncipe.Self-organizingmapswithdynamiclearningforsignalreconstruction.NeuralNetworks,20(11):274{284,March2007. G.Cieslewski,D.Cheney,K.Gugel,J.Sanchez,andJ.C.Prncipe.Neuralsignalsamplingviathelowpowerwirelesspicosystem.InProceedingsofIEEEEMBS,pages5904{5907,August2006. D.A.Cohn,Z.Ghahramani,andM.Jordan.Activelearningwithstatisticalmodels.JournalofArticialIntelligenceResearch,4:129{145,March1996. D.Comaniciu.Analgorithmfordata-drivenbandwidthselection.IEEETrans.onPatternAnalysisandMachineIntelligence,25(2):281{288,2003. D.ComaniciuandP.Meer.Meanshift:Arobustapproachtowardfeaturespaceanalysis.IEEETrans.onPatternAnalysisandMachineIntelligence,24(5):603{619,May2002. D.Comaniciu,V.Ramesh,andP.Meer.Real-timetrackingofnon-rigidobjectsusingmeanshift.InProceedingsofIEEEConf.ComputerVisionandPatternRecognition,volume2,pages142{149,June2000. P.Comon.Independentcomponentanalysis-anewconcept?SignalProcessing,36:287{314,1994. C.Connolly.Therelationshipbetweencolourmetricsandtheappearanceofthree-dimensionalcolouredobjects.ColorResearchandApplications,21:331{337,1996. R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassication,2ndedition.Wiley-Interscience,2000. D.Erdogmus.InformationTheoreticLearning:Renyi'sEntropyanditsApplicationstoAdaptiveSystemTraining.PhDthesis,UniversityofFlorida,2002. D.ErdogmusandU.Ozertem.Self-consistentlocallydenedprincipalsurfaces.InProceedingsoftheInternationalConferenceonAccoustic,SpeechandSignalProcessing,volume2,pages15{20,April2007. M.FashingandC.Tomasi.Meanshiftisaboundoptimization.IEEETrans.onPatternAnalysisandMachineIntelligence,27(3):471{474,March2005. E.E.Fetz.Aremovementparametersrecognizablycodedintheactivityofsingleneurons.BehavioralandBrainSciences,15(4):679{690,March1992. R.Forsyth.MachineLearning:PrinciplesandTechniques.ChapmanandHall,1989. K.S.Fu.SyntacticMethodsinPatternRecognition.AcademicPress,NewYork,1974. 134

PAGE 135

K.FukunagaandL.D.Hostetler.Theestimationofthegradientofadensityfunctionwithapplicationsinpatternrecognition.IEEETrans.onInformationtheory,21(1):32{40,January1975. A.GershoandR.M.Gray.VectorQuantizationandSignalCompression.Springer,1991. Z.Ghahramani.Unsupervisedlearning.InO.Bousquet,U.vonLuxburg,andG.Ratsch,editors,AdvancedLecturesonMachineLearning,volume3176ofLectureNotesinComputerScience,pages72{112.Springer,2003.ISBN3-540-23122-6. A.Goh,S.Craciun,S.Rao,D.Cheney,K.Gugel,J.C.Sanchez,andJ.C.Prncipe.Wirelesstransmissionofneuralrecordingsusingaportablereal-timediscimination/compressionalgorithm.InProceedingsoftheInternationalConferenceoftheIEEEEngineeringinMedicineandBiologySociety,August2008. L.GreengardandJ.Strain.Thefastgausstransform.SIAMJournalonScienticComputing,24:79{94,1991. T.HastieandW.Stuetzle.Principalcurves.JournaloftheAmericanStatisticalAssocia-tion,84(406):502{516,1989. S.Haykin.NeuralNetworks:AComprehensiveFoundation,2ndedition.NewJersey,PrenticeHall,1999. A.B.Hillel,A.Spiro,andE.Stark.Spikesorting:Bayesianclusteringofnon-stationarydata.InProceedingsofNIPS,December2004. D.Hume.AnEnquiryConcerningHumanUnderstanding.OxfordUniversityPress,1999. A.HyvarinenandE.Oja.Independentcomponentanalysis:Algorithmsandapplications.NeuralNetworks,13(4-5):411{430,2000. R.Jenssen.AnInformationTheoreticApproachtoMachineLearning.PhDthesis,UniversityofTromso,2005. C.JuttenandJ.Herault.Blindseparationofsources,parti:Anadaptivealgorithmbasedonneuromimeticarchitecture.SignalProcessing,24:1{10,1991. V.KatkovnikandI.Shmulevich.Kerneldensityestimationwithadaptivevaryingwindowsize.PatternRecognitionLetters,23:1641{1648,2002. B.Kegl.Principalcurves:learning,design,andapplications.PhDthesis,ConcordiaUniversity,1999. B.KeglandA.Krzyzak.Piecewiselinearskeletonizationusingprincipalcurves.IEEETransactionsonPatternAnalysisandMachineIntelligence,24(1):59{74,2002. 135

PAGE 136

T.Kohonen.Self-organizedformationoftopologicallycorrectfeaturemaps.BiologicalCybernetics,43:59{69,1982. S.Lakshmivarahan.LearningAlgorithms:TheoryandApplications.Springer-Verlag,Berlin-Heidelberg-NewYork,1981. P.Langley.ElementsofMachineLearning.MorganKaufmann,1996. T.Lehn-Schiler,A.Hegde,D.Erdogmus,andJ.C.Prncipe.Vector-quantizationusinginformationtheoreticconcepts.NaturalComputing,4:39{51,Jan.2005. M.S.Lewicki.Bayesianmodelingandclassicationofneuralsignals.NeuralComputation,6:1005{1030,May1994. M.S.Lewicki.Areviewofmethodsforspikesorting:thedetectionandclassicationofneuralactionpotentials.Network:ComputationinNeuralSystems,9(4):R53{R78,1998. Y.Linde,BuzoA.,andGrayR.M.Analgorithmforvectorquantizerdesign.IEEETrans.onCommunications,28(1):84{95,Jan.1980. R.Linsker.Self-organizationinaperceptualnetwork.IEEEComputer,21(3):105{117,March1988a. R.Linsker.Towardsanorganizingprincipleforalayeredperceptualnetwork.InD.Z.Anderson,editor,NeuralInformationProcessingSystems-NaturalandSynthetic.AmericanInstituteofPhysics,NewYork,1988b. E.Lutwak,D.Yang,andG.Zhang.Cramer-raoandmoment-entropyinequalitiesforrenyientropyandgeneralizedsherinformation.IEEETransactionsofInformationTheory,51(2):473{478,February2005. M.A.Carreira-Perpi~nan.Gaussianmeanshiftisanemalgorithm.IEEETransactionsonPatternAnalysisandMachineIntelligence,29(5):767{776,May2007. D.J.C.MacKay.InformationTheory,InferenceandLearningAlgorithms.CambridgeUniversityPress,2003. D.Martin,C.Fowlkes,D.Tal,andJ.Malik.Adatabaseofhumansegmentednaturalimagesanditsapplicationtoevaluatingsegmentationalgorithmsandmeasuringecologicalstatistics.InProc.8thInt'lConf.ComputerVision,volume2,pages416{423,July2001. A.Y.Ng,M.Jordan,andY.Weiss.Onspectralclustering:Analysisandanalgorithm.InProceedingsofNIPS,2002. M.A.L.Nicolelis.Brainmachineinterfacestorestoremotorfunctionandprobeneuralcircuits.NatureReviewsNeuroscience,4:417{422,2003. 136

PAGE 137

E.Oja.Asimpliedneuronmodelasaprincipalcomponentanalyser.JournalofMathematicalBiology,15:267{273,1982. E.Oja.Neuralnetworks,principalcomponentsandsubspaces.InternationalJournalofNeuralSystems,1(1):61{68,1989. R.T.Olszewski.GeneralizedFeatureExtractionforStructuralPatternRecognitioninTime-SeriesData.PhDthesis,CarnegieMellonUniversity,2001. E.Parzen.Ontheestimationofprobabilitydensityfunctionandmode.TheAnnalsofMathematicalStatistics,33(3):1065{1076,1962. T.Pavlidis.StructuralPatternRecognition.Springer-Verlag,Berlin-Heidelberg-NewYork,1977. J.C.Prncipe,D.Xu,andJ.Fisher.Informationtheoreticlearning.InSimonHaykin,editor,UnsupervisedAdaptiveFiltering,pages265{319.JohnWiley,2000. D.RamananandD.A.Forsyth.Findingandtrackingpeoplefromthebottomup.InProceedingsofIEEEConf.ComputerVisionandPatternRecognition,pages467{474,June2003. M.Ranzato,Y.Boureau,S.Chopra,andY.LeCun.Auniedenergy-basedframeworkforunsupervisedlearning.InProc.ofthe11-thInt'lWorkshoponArticialIntelligenceandStatistics(AISTATS),2007. S.Rao,S.Han,andJ.C.Prncipe.Informationtheoreticvectorquantizationwithxedpointupdates.InProceedingsoftheInternationalJointConferenceonNeuralNetworks,pages1020{1024,August2007a. S.Rao,A.M.Martins,W.Liu,andJ.C.Prncipe.Informationtheoreticmeanshiftalgorithm.InProceedingsofIEEEConf.onMachineLearningforSignalProcessing,pages155{160,Sept.2006a. S.Rao,A.M.Martins,andJ.C.Prncipe.Meanshift:Aninformationtheoreticperspective.SubmittedtoPatternRecognitionLetters,2008. S.Rao,A.R.C.Paiva,andJ.C.Prncipe.Anovelweightedlbgalgorithmforneuralspikecompression.InProceedingsoftheInternationalJointConferenceonNeuralNetworks,pages1883{1887,August2007b. S.Rao,J.C.Sanchez,andJ.C.Prncipe.Spikesortingusingnonparametricclusteringviacauchyschwartzpdfdivergence.InProceedingsoftheInternationalConferenceonAccoustic,SpeechandSignalProcessing,volume5,pages881{884,May2006b. 137

PAGE 138

A.Renyi.Somefundamentalquestionsofinformationtheory.InSelectedPapersofAlfredRenyi,pages526{552.AkademiaKiado,1976. W.Rudin.PrinciplesofMathematicalAnalysis.McGraw-HillPublishingCo.,1976. J.Sanchez,J.Pukala,J.C.Prncipe,F.Bova,andM.Okun.Linearpredictiveanalysisfortargetingthebasalgangliaindeepbrainstimulationsurgeries.InProceedingsoftheConferenceon2ndIntIEEEWorkshoponNeuralEngineering,Washington,2005. S.SandilyaandS.R.Kulkarni.Principalcurveswithboundedturn.IEEETransactionsonInformationTheory,48(10):2789{2793,2002. I.Santamaria,P.P.Pokharel,andJ.C.Principe.Generalizedcorrelationfunction:Denition,propertiesandapplicationtoblindequalization.IEEETrasactionsonSignalProcessing,54(6):2187{2197,2006. J.ShiandJ.Malik.Normalizedcutsandimagesegmentation.IEEETransactionsonPatternAnalysisandMachineIntelligence,22(8):888{905,2000. B.W.Silverman.DensityEstimationforStatisticsandDataAnalysis.ChapmanandHall,1986. R.Tibshirani.Principalcurvesrevisited.StatisticsandComputing,2:183{190,1992. N.Tishby,F.Pereira,andW.Bialek.Theinformationbottleneckmethod.InProceedingsofthe37-thAnnualAllertonConferenceonCommunication,ControlandComputing,pages368{377,1999. R.J.Vogelstein,K.Murari,P.H.Thakur,G.Cauwenberghs,S.Chakrabartty,andC.Diehl.Spikesortingwithsupportvectormachines.InProceedingsofIEEEEMBS,pages546{549,September2004. M.P.WandandM.C.Jones.KernelSmoothing.ChapmanandHall,1995. S.Watanabe.PatternRecognition:HumanandMechanical.Wiley,1985. J.Wessberg,C.R.Stambaugh,J.D.Kralik,P.D.Beck,M.Laubach,J.K.Chapin,J.Kim,S.J.Biggs,M.A.Srinivasan,andM.A.L.Nicolelis.Real-timepredictionofhandtrajectorybyensemblesofcorticalneuronsinprimates.Nature,408(6810):361{365,November2000. B.C.Wheeler.Automaticdiscriminationofsingleunits.InMiguelA.L.Nicolelis,editor,MethodsforNeuralEnsembleRecordings,chapter4,pages61{78.CRCPress,1999. K.D.Wise,D.J.Anderson,J.F.Hetke,D.R.Kipke,andK.Naja.Wirelessimplantablemicrosystems:high-densityelectronicinterfacestothenervoussystem.ProceedingsoftheIEEE,92(1):76{97,January2004. 138

PAGE 139

G.WyszeckiandW.S.Stiles.ColorScience:ConceptsandMethods,QuantitativeDataandFormulae.Wiley,1982. C.Yang,R.Duraiswami,andL.Davis.Ecientmean-shifttrackingviaanewsimilaritymeasure.InProceedingsofIEEEConf.ComputerVisionandPatternRecognition,volume1,pages176{183,June2005. 139

PAGE 140

SudhirMadhavRaowasborninHyderabad,IndiainSeptemberof1980.HereceivedhisB.E.fromtheDepartmentofElectronicsandCommunicationsEngineering,OsmaniaUniversity,Indiain2002.HeobtainedhisM.Sinelectricalandcomputerengineeringin2004fromtheUniversityofFloridawithemphasisintheareaofsignalprocessing.IntheFallof2004,hejoinedtheComputationalNeuroEngineeringLaboratory(CNEL)attheUniversityofFloridaandstartedworkingtowardshisPh.DunderthesupervisionofDr.JoseC.Prncipe.HereceivedhisPh.DintheSummerof2008.Hisprimaryinterestsincludesignalprocessing,machinelearningandadaptivesystems.HeisamemberofEtaKappaNu,TauBetaPiandIEEE. 140