Reproducing Kernel Hilbert Space Methods for Information Theoretic Learning

MISSING IMAGE

Material Information

Title:
Reproducing Kernel Hilbert Space Methods for Information Theoretic Learning
Physical Description:
1 online resource (135 p.)
Language:
english
Creator:
Sanchez Giraldo, Luis Gonzalo
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Electrical and Computer Engineering
Committee Chair:
Principe, Jose C
Committee Members:
Wong, Tan F
Rangarajan, Anand
Rao, Murali

Subjects

Subjects / Keywords:
information -- kernel -- learning
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre:
Electrical and Computer Engineering thesis, Ph.D.
Electronic Thesis or Dissertation
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )

Notes

Abstract:
Information theory provide principled models for different machine learning problems, such as clustering, dimensionality reduction, classification, and many others. However, using the definitions of information theoretic quantities directly, posses a challenging estimation problem that often leads to strong simplifications, such as Gaussian models, or the use of plug in density estimators that restrict the kind of representations that we can use on the data. In this work, we adopt a data-driven perspective based on reproducing kernel Hilbert space methods that leads to successful application of information theoretic principles without resorting to estimation of the underlying probability distributions. The proposed methodology offers several advantages compared to other state of the art work such as entropic graphs because it can provide quantities that are more amenable for optimization (differentiability) as well as the representation flexibility that kernel methods provide. The work is divided into two main parts. First, we introduce an information theoretic objective function for unsupervised learning called the principle of relevant information. We employ an information theoretic reproducing kernel Hilbert space (RKHS) formulation, which can overcome some of the limitations of previous approaches based on Parzen's density estimation. Results are competitive with kernel-based feature extractors such as kernel PCA. Moreover, the proposed framework goes further on the relation between information theoretic learning, kernel methods and support vector algorithms. For the second part, which is motivated by the results and insights obtained by the RKHS formulation of the principle of relevant information, we develop a framework for information theoretic learning based on infinitely divisible matrices. We formulate an entropy-like functional on positive definite matrices based on Renyi's definition and examine some key properties of this functional that lead to the concept of infinite divisibility. This formulation avoids the plug in estimation of density and allows to use the representation power that comes with the use of positive definite kernels. Learning from data comes from using this functional on positive definite matrices that correspond to Gram matrices constructed by pairwise evaluations of infinitely divisible kernels on data samples. We show how we can define analogues to quantities such as conditional entropy that can be employed to formulate different learning problems, and provide efficient ways to solve the optimization that arise from these formulations. Numerical results using the proposed quantities to test independence, metric learning, and image super-resolution show that the proposed framework can obtain state of the art performances.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2012.
Local:
Adviser: Principe, Jose C.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2013-06-30
Statement of Responsibility:
by Luis Gonzalo Sanchez Giraldo.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2012
System ID:
UFE0044655:00001


This item is only available as the following downloads:


Full Text

PAGE 1

REPRODUCINGKERNELHILBERTSPACEMETHODSFORINFORMATION THEORETICLEARNING By LUISGONZALOS ANCHEZGIRALDO ADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOL OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF DOCTOROFPHILOSOPHY UNIVERSITYOFFLORIDA 2012

PAGE 2

c r 2012LuisGonzaloS anchezGiraldo 2

PAGE 3

Tomypresentandfuturefamily. 3

PAGE 4

ACKNOWLEDGMENTS IwanttoexpressmymostsinceregratitudetomyPh.D.adviso rDr.Jos ePrncipe forhisenthusiasticandsupportingattitudetowardsmywor k.Ialsowanttosaythanks forhispatienceandunderstandingthatreectsthewisdomo fatruementor.Iam veryindebtedwithProfessorMuraliRao,forhisinvaluable help,friendship,andlove formathematics.IcansaythatbothDr.PrncipeandDr.Rao areexamplesofmen withcontagiouspassionforwhattheydo.APh.D.isajourney fullofupsanddowns andIstronglybelievethattheirexcitementformyworkhelp edmethroughthehard times.IalsowanttoextendmygratitudetoDr.TanWongandDr .AnandRangarajan fortheirvaluablecommentsandsharpremarks.Theyreallyh elpshapemythinking andenhancetheperspectiveIhaveofmywork.Duringthisyea rs,Ihadthegreat opportunitytobepartofanamazinggroupoftalentedstuden ts,butmostimportantly valuablepersons.IreallyenjoyedthetimeatCNELandallth ethingswedidasthe familyawayfromfamily thatweare.EspecialthankstoErionHasanbelliuforhis friendshipanddisinterestedhelpandsupport.SohanSethf orhisfriendship,help, andtheinterestingdiscussions,AlexanderSinghgraciasp orsuamistadyayudaen elmomentooportuno,ShalomDarmajan,Il“Memming”Park,St efanCracium,Austin Brockmeier,LinLi,HectorGalloza,GoktugCinar,RakeshCh alasani,AbhishekSingh, andalltheCNELersforthegoodtimes. Finally,Iwanttothankmyfamilyforalwaysbeingthereform e,andtoGodfor bringingJihyeBaeintomylife.Jihyenaesarang,Icannotas kformorewhenIalready haveitallwithyou.Thisworkisasyoursasitismine. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS ..................................4 LISTOFTABLES ......................................8 LISTOFFIGURES .....................................9 ABSTRACT .........................................10 CHAPTER 1INTRODUCTION ...................................12 1.1TheProblemofLearning ...........................12 1.2StatisticalLearningPerspective ........................13 1.3RateDistortion,theInformationBottleneck,andLearn ing .........15 1.3.1Example:RateDistortionandPCA ..................16 1.3.2TheInformationBottleneck ......................17 1.4ContributionsofthePresentWork ......................19 2MATHEMATICALPRELIMINARIES ........................20 2.1ReproducingKernelHilbertSpaces .....................20 2.2TheCovarianceFunction ...........................23 2.3RKHSsinMachineLearning .........................24 3THEPRINCIPLEOFRELEVANTINFORMATION ................27 3.1InformationTheoreticLearninginaNutshell .................27 3.2ThePrincipleofRelevantInformation ....................28 3.2.1PRIasaSelf-organizationMechanism ................30 3.2.2OntheInuenceof ..........................30 3.2.3ANoteonInformationTheoreticVectorQuantization ........34 3.2.4PracticalIssuesandOpenQuestions .................35 3.3AlternativeSolutionstothePrincipleofRelevantInfo rmation .......36 3.3.1ThePRIasaWeightingProblem ...................36 3.3.2SequentialMinimalOptimizationforthePRI .............39 3.3.2.1Decompositionintosmallersubproblems .........41 3.3.2.2Sequentialminimaloptimizationalgorithm .........43 3.3.2.3SMOalgorithm ........................44 3.3.2.4Selectingtheworkingset ..................45 3.3.3Experiments ...............................46 3.3.3.1Syntheticdata ........................46 3.3.3.2ImageretrievalwithpartiallyoccludeddataMNIST ....48 3.4TheInformationPotentialRKHSFormulationofthePRI ..........52 5

PAGE 6

4ESTIMATINGENTROPY-LIKEQUANTITIESWITHKERNELS .........56 4.1Motivation ....................................57 4.1.1HilbertSpaceRepresentationofData ................58 4.1.2TheCross-InformationPotentialRKHS ................60 4.2PositiveDeniteMatrices,andRenyi'sEntropyAxioms ...........61 4.2.1EntropyinequalitiesforHadamardProducts .............64 4.2.2TheTensorandHadamardProductEntropyGap ..........67 4.2.3TheSingleandHadamardProductEntropyGap ..........67 4.3InnitelyDivisibleFunctions ..........................68 4.3.1Direct-SumandProductkernels ....................68 4.3.1.1Direct-sumkernels ......................68 4.3.1.2Productkernelandtensorproductspaces .........69 4.3.2NegativeDeniteFunctionsandInniteDivisibleMat rices .....70 4.3.2.1NegativedenitefunctionsandHilbertianmetrics ....70 4.3.2.2Innitedivisiblematrices ..................70 4.4StatisticalPropertiesofGramMatricesandtheirconne ctionwithITL ...72 4.4.1Thetraceof G .............................73 4.4.2TheSpectrumof G andConsistencyofitsEstimator ........75 4.5Experiments:IndependenceTest .......................78 5INFORMATIONTHEORETICLEARNINGWITHMATRIX-BASEDENTROP Y .83 5.1ComputingDerivativesofMatrixEntropy ...................83 5.2SupervisedMetricLearning ..........................84 5.3TransductiveLearningwithanApplicationtoImageSupe rResolution ..87 6CONCLUSIONSANDFUTUREWORK ......................91 APPENDIX ABASICDEFINITIONSANDSTANDARDNOTATION ...............94 A.1Shannon'sEntropyandMutualInformation .................94 A.2R enyi'sMeasuresofInformation .......................96 BAUXILIARYTHEOREMSPROOFSANDDERIVATIONS .............100 B.1SufcientConditionsforPseudo-ConvexPrograms .............100 B.2DetailsoftheSolutiontotheMinimalSubproblem .............101 CAPPROACHESTOUNSUPERVISEDLEARNING ................102 C.1EncoderLearningMethods ..........................103 C.2DecoderLearningMethods ..........................109 C.3Encoder-DecoderorChannelLearningMethods ..............113 6

PAGE 7

DRANK-DEFICIENTCOMPUTATIONOFTHEPRI .................117 D.1RankDecientApproximationforITL .....................117 D.1.1Renyi's -OrderEntropyandRelatedFunctions ...........117 D.1.2RankDecientApproximation .....................119 D.2ThePrincipleofRelevantInformation ....................121 D.3Experiments ..................................124 D.3.1SimulatedData .............................124 D.3.2ImageSegmentationSignalDenoisingwiththePRI .........124 D.4Remarks ....................................126 REFERENCES .......................................128 BIOGRAPHICALSKETCH ................................135 7

PAGE 8

LISTOFTABLES Table page 3-1Resultsfortheimageretrievalfrompartiallyoccluded queries ..........50 4-1Listofdistributionsusedintheindependencetest ................79 8

PAGE 9

LISTOFFIGURES Figure page 1-1Differentttingstothedata .............................15 3-1EmployeddatasetforPRI ..............................32 3-2Entropyandcross-entropyvaluesfordifferent ..................32 3-3Resultsafterself-organization ............................33 3-4ResultingvectorquantizationusingthePRI. ....................35 3-5CornereffectofthePRIobjectivebasedonweights ...............38 3-6Computationtimesfordifferenttolerancelevelsandsa mplesizes ........47 3-7Numberofsupportvectorsfordifferenttolerancelevel sandsamplesizes ...47 3-8Entropyandcross-entropyvs intheweight-basedPRI .............48 3-9Resultingsupportvectorsfordifferentvaluesof ................49 3-10Distributionoftheweightsfor =10 ........................49 3-11Incompletequeriesandtheiroriginalversions ...................51 3-12Retrievedpattersfromquerieswithmissinginformati on .............52 4-1Spacesinvolvedinthedata-drivenapproach ...................72 4-2Resultsoftheindependencetest ..........................81 4-3Resultsofindependencetestunderdifferentparameter s .............82 5-1ResultsfortheMetriclearningapplication .....................86 5-2Imagesuper-resolutionresults ...........................90 C-1Blockdiagramencoderdecoderscheme ......................103 D-1AccuraciesfortheIPandCIPestimators .....................125 D-2ImagesegmentationusingPRI ...........................125 D-3NoisyandDenoisedversionsofaperiodicsignal .................126 9

PAGE 10

AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorofPhilosophy REPRODUCINGKERNELHILBERTSPACEMETHODSFORINFORMATION THEORETICLEARNING By LuisGonzaloS anchezGiraldo December2012 Chair:Jos eC.Prncipe Major:ElectricalandComputerEngineering Informationtheoryprovidesprincipledmodelsfordiffere ntmachinelearning problems,suchasclustering,dimensionalityreduction,c lassication,andmanyothers. However,usingthedenitionsofinformationtheoreticqua ntitiesdirectly,possesa challengingestimationproblemthatoftenleadstostrongs implications,suchas Gaussianmodels,ortheuseofplugindensityestimatorstha trestrictthekindof representationsthatwecanuseonthedata.Inthiswork,wea doptadata-driven perspectivebasedonreproducingkernelHilbertspacemeth odsthatleadstosuccessful applicationofinformationtheoreticprincipleswithoutr esortingtoestimationof theunderlyingprobabilitydistributions.Theproposedme thodologyoffersseveral advantagescomparedtootherstateoftheartworksuchasent ropicgraphsbecause itcanprovidequantitiesthataremoreamenableforoptimiz ation(differentiability)as wellastherepresentationexibilitythatkernelmethodsp rovide.Theworkisdivided intotwomainparts.First,weintroduceaninformationtheo reticobjectivefunction forunsupervisedlearningcalledtheprincipleofrelevant information.Weemployan informationtheoreticreproducingkernelHilbertspace(R KHS)formulation,which canovercomesomeofthelimitationsofpreviousapproaches basedonParzen's densityestimation.Resultsarecompetitivewithkernel-b asedfeatureextractors suchaskernelPCA.Moreover,theproposedframeworkgoesfu rtherontherelation betweeninformationtheoreticlearning,kernelmethodsan dsupportvectoralgorithms. 10

PAGE 11

Forthesecondpart,whichismotivatedbytheresultsandins ightsobtainedbythe RKHSformulationoftheprincipleofrelevantinformation, wedevelopaframework forinformationtheoreticlearningbasedoninnitelydivi siblematrices.Weformulate anentropy-likefunctionalonpositivedenitematricesba sedonRenyi'sdenition andexaminesomekeypropertiesofthisfunctionalthatlead totheconceptofinnite divisibility.Thisformulationavoidsthepluginestimati onofdensityandallowsto usetherepresentationpowerthatcomeswiththeuseofposit ivedenitekernels. Learningfromdatacomesfromusingthisfunctionalonposit ivedenitematricesthat correspondtoGrammatricesconstructedbypairwiseevalua tionsofinnitelydivisible kernelsondatasamples.Weshowhowwecandeneanaloguesto quantitiessuch asconditionalentropythatcanbeemployedtoformulatedif ferentlearningproblems, andprovideefcientwaystosolvetheoptimizationthatari sefromtheseformulations. Numericalresultsusingtheproposedquantitiestotestind ependence,metriclearning, andimagesuper-resolutionshowthattheproposedframewor kcanobtainstateofthe artperformances. 11

PAGE 12

CHAPTER1 INTRODUCTION 1.1TheProblemofLearning Inasystem,learningisthepropertytoincorporateexterna linformationtoimprove itsperformanceinaparticulartask.Inthissensethesyste mtriestoinfergeneralrules fromasetofgivenexamples.Therulescanbeintheformofafu nctionthatmodelthe dependencerelationbetweeninputoutputpairsorcanconsi stofsomeobservation aboutthestructureofthespaceswheretheexamplesarerepr esented.Theinformation providedtothesystemmainlycomesfromasetofobservedinp uts f x i gX ,although othersourcesofinformationmaybegivendependinguponthe taskandthecontext inwhichthesystemisput.Usuallythreescenarioscanbedes cribed.The supervised setting,wherethetaskcorrespondtondingaruleofassoci ationbetweenpairs f ( x i y i ) g ofobservedinputsandcorrespondingtargetsgivenbyanexp ert(external agent).Therefore,learningoccurswhenthesystemeffecti velypredictsthecorrect outputforannonpreviouslyseeninput.Asecondscenariois called reinforcement learning inwhichthesysteminteractswiththeenvironmentbyperfor mingactionsthat feedbacktothesystemintheformofrewardsorpunishments. Inthiscasetheobserved inputsarecalledthestates.Thegoalofadaptationistond asuitablesetofactions thatwillmaximizetherewardovertimeandthissequencewil ldependontheobserved state.Thethirdsetting,called unsupervised learning.Inthiscase,wecanlooselysay thattheonlyinformationavailableisthroughtheobserved inputs.Theassumptionisthe presenceofregularitiesamongtheobservedexamplessince thereisaprocessbehind theirgeneration.Therefore,thegoalistondsuchregular itiesandonemotivationfor doingsoisthatexpressingtheavailableinformationinter msoftheunderlyingcauses maysimplifyfurtherstagesofprocessing.Thisviewassume sagenerativemodelfor theobserveddata.However,onecannotarguetheactualcaus eswillbeunveiledbythe learningprocesssincelearningmustbealwaysaccompanied byassumptionsthatmay 12

PAGE 13

ormaynotnecessarilyagreewithreality 1 .Therefore,wecanthinkofexploitingthe statisticalregularitiestoencodetheobservedinputsint omorecompactrepresentations andalsoreducetheeffectofsomeexternalnoisewhichinthe absenceofadditional informationisassumedtobeunstructured,byexploitingth eredundancyintheinputs. Thisisafeatureextractionpointofviewofunsupervisedle arning. Fromsystemdesignperspective,featureextractionhasbee nconsideredan importantstageforincreasingaccuracyandreducingoverttingincommonly encounteredsubsequenttaskssuchasclassicationorregr essionoractionselection. Whilefeaturescanbehandcraftedbasedontheexternalknow ledgeaboutthedomain oftheapplication,learningthefeaturesoffersadvantage ssuchasadaptabilityacross differentdomainswithoutmuchexternalintervention.How ever,toapplythe learning fromexamples paradigmtotheproblemofndingarepresentation,weneedt oaddress twomainquestions: Howtoassesstheeffectivenessofarepresentationwithout havingto evaluatetheresultsonthesubsequenttasks? Whatarethedesirablepropertiesintherepresentation? Throughoutliterature,thisquestionshavebeenapproache dfromdifferentperspectives leadingtomyriadoftechniques.Yetsurprisingly,isthefa ctthatmanyofthese techniquesoverlapeitherbyusingthesamecriteriatoasse sstheireffectiveness,or inthepropertiesoftheconveyedrepresentations. 1.2StatisticalLearningPerspective Letusbeginwiththesupervisedlearningsetting.Here,wea relimitedtoaset of n observations f ( x i y i ) g ni =1 XY thatareassumedtobe i.i.d. andsampled 1 Intheabsenceofassumptionsthereisnoprivilegedor“best ”feature representation,andthateventhenotionbetweenpatternsd ependsimplicitlyon assumptionsthatmayornotbecorrect[ 22 ] 13

PAGE 14

fromanunknownjointdistribution P ( x y ) .Thedependencebetween X and Y is modeledthroughafunctionalrelationship y = f ( x ) thattriestocapturetheunknown conditionaldistribution P ( y j x ) .Thefunction f ( x ) ischosenfromasetofpossible functions(hypothesis) F = f f ( x ); 2 g .Thequalityofaparticularapproximation f ( x ) is ideally measuredintermsofanaveragelossdenedas: R ( )= Z X Y ` ( x y f ( x ) ) d P ( x y ). (1–1) Thelossfunction ` measuresthepoint-wiseaccuracyoftheapproximationandt herefore inuencestheoptimalchoice.Typically, ` isanonnegativefunctionand ` ( x y f ( x ) ) = 0 if y = f ( x ) .( 1–1 )alsoknowasriskfunctionalcannotbecomputedinpractice since P ( x y ) isunknown.Instead,anempiricalversioniscomputedbased onthesampleof size n as R emp ( )= Z X Y ` ( x y f ( x ) ) d P n ( x y )= 1 n n X i =1 ` ( x i y i f ( x i ) ) (1–2) where P n denotestheempiricaldistribution P n ( x y )= 1 n if ( x y )=( x i y i ) for i =1,..., n and 0 otherwise.Theempiricalrisk( 1–2 )isoftenaccompaniedbyaregularizationterm thatcontroloverttingwhenthefunctionclass F istoorich R reg ( )= R emp ( )+n ( f ( x ) ) (1–3) Theregularizationterm n ( f ( x ) ) articiallyshrinksthefunctionclass,whichisjustied theoreticallybygeneralizationboundsoftheform R ( ) R emp ( )+ C ( F ) ,where C ( F ) isatermmeasuringthe capacity ofthefunctionclass.Theaboveriskcanbeadapted tothecaseofunsupervisedlearning.Inthissettingthedes iredoutputsaretheinputs themselvesand f capturesthestructureofthedata.Thereforeitisnecessar ytond waystorestrictthethepossiblechoicesfor f toavoidtrivialsolutions.Howtorestrictthe choicesisanopenquestionthatexplainswhyinunsupervise dlearning,therehasbeen distinctionsbetweenclusteringandprojectionmethods.H owever,someofthetools 14

PAGE 15

Figure1-1.Differentttingstothedata.Theleftguresho wsareasonabletunder smoothnessassumptionsofthefunctionaldependency.Them iddlegureis anexampleofundertting,whichconsideranoverlysimplis ticmodelofthe dataandthusintroduceslargeerrors.Therightmostgures howsan exampleofhowarichclassoffunctionscanprovideacandida tethatcant perfectlythedatabutmaybewronginpredictingtherelatio nforunseen exemplarsundertheassumptionofsmoothness. employedtoanalyzeandsolvetheproblemsshowthattheyare intimatelyrelated[ 45 ]. Inthefollowing,wemotivatetheideaofusinginformationt heorytoaddresstheproblem oflearningandhighlighttherelationwiththeregularized riskminimizationsettingthat wehavejustpresented. 1.3RateDistortion,theInformationBottleneck,andLearn ing IntheShannon'sviewofinformationtransferedthroughach annel,itwasshown thatthenoiseandbandwidthinherenttothechanneldetermi nestherateatwhich informationcanbetransfered,andanyattempttotransferi nformationabovethelimit imposedbythechannelwillincurinsomeerrorintherecover edmessage.Thisresultis knownasthechannelcapacitywheretheobjectiveistotrans mitinformationwithzero errorlimit.Allowingsomelossofinformationintheproces syieldstherate-distortion function.Theinformationrate-distortionfunction R ( D ) isdenedas: R ( D )=min q (^ x j x ) I ( X ; ^ X ) subjectto E[ d ( X ^ X )] D (1–4) where d ( ) isadistortionfunctionthatcanberelatedtothelossincur redwhen theexactvalueoftheinputcannotberecoveredfromtheoutp utand E[ d ( X ^ X )]= P x ,^ x p ( x ) q (^ x j x ) d ( x ,^ x ) .Aswecanseetheaveragedistortionisananaloguequantity 15

PAGE 16

totheriskfunctional( 1–1 ),andthemutualinformation I ( X ; ^ X ) canberelatedtoa regularizationtermthatprovidestheleastcommittedmapp ingfrom x to ^ x 1.3.1Example:RateDistortionandPCA TherelationbetweenratedistortionandPCAarisesfromthe assumptionof Gaussiansource.ConsidertheGaussianrandomvector X 2 R d withzeromeanand covariance .Wecanndsimilaritytransformation U toaGaussianrandomvector Z 2 R d withzeromeanandcovariance ,suchthat U T U = .Thetransformation matrix U isunitarywhichimplies det( U )=1 ,andthus, h ( X )= h ( Z ) andthemappingis abijection 2 .Thefollowingtheoremfrom[ 16 ]willcompleteourresult. Theorem1.3.1.(RatedistortionforaparallelGaussiansou rce) Let X i N (0, r 2 i ) i =1,..., d beindependentGaussianrandomvariables,andthedistorti onmeasure d ( x ,^ x )= P di =1 ( x i ^ x i ) 2 .Thentheratedistortionfunctionisgivenby R ( D )= d X i =1 1 2 log r i D i (1–5) where D i = 8><>: if
PAGE 17

inthegenerativemodelforPCA,butthemaindifferenceisth at ^ Z isnotassumedtobe N (0, I ) 1.3.2TheInformationBottleneck Inratedistortion,thebasicassumptionisthatthedistort ionmeasureisgiven; therefore,theresultswilldependonwhatdistortionmeasu rehasbeenchosen. Theinformationbottleneck[ 87 ]proposesanalternativewayofformulatingthe precision-complexitytrade-offwithoutdeningadistort ionmeasureinadvance.Thisis donebyincludingareferencevariable Y ,calledthe relevantvariable ,intotheproblem. Asinratedistortion,onewantstocompress X into ^ X ,suchthat,themutualinformation between ^ X and Y ismaximized,thatis, min q (^ x j x ) I ( X ; ^ X ) I ( ^ X ; Y ) (1–7) where > 0 isthetrade-offparameter.Anequivalentformulationof( 1–7 )asa constrainedoptimizationproblemisbasedonthesocalledt herelevance-compression function,whichdependsonthejointdistribution P ( x y ) ,andisdenedasfollows: ^ R ( ^ D )=min q (^ x j x ) I ( X ; ^ X ) subjectto I ( ^ X ; Y ) ^ D (1–8) thatis, ^ R ( ^ D ) istheminimalachievablecompression,forwhichthereleva ntinformationis above ^ D [ 83 ].Itisimportanttomakeclearthat ^ X shouldbecompletelydenedgiven X alone,thisimpliesthefollowingMarkovianrelation Y X ^ X ,thatcombinedwiththe dataprocessinginequalityshowsthat I ( ^ X ; Y ) I ( X ; Y ) Intheabove,wewantedtomotivatetheuseofinformationthe orytodescribethe learningproblem.Nevertheless,theaboveformulationass umetheavailabilityofthe distributions,whichaswepointedoutatthebeginningofth ediscussion,isnotthecase. Tomakethisformulationpractical,wemusthaveactualesti matorsofsuchoperational quantitiesbasedontheobserveddata.Here,wethriveinthe previousworkon informationtheoreticlearning (ITL)anditsrelationtokernelmethodsinmachinelearning 17

PAGE 18

Thisbringsalongthegeneralitylevelatwhichinformation theorytreatsaproblemwith thepowerfultoolsfrommachinelearningtodealwithpracti calproblems.Thegoalof thisthesisistodevelopaframeworkforlearningbasedinfo rmationtheoreticprinciples thateffectivelycapturesstatisticalregularitiesdirec tlyfromdata.Tothisend,weemploy reproducingkernelHilbertspacemethodsandshowhowtheyc anbeemployedto computequantitieswithsimilarpropertiestoentropy,mut ualinformationanddivergence. Inthisway,wecansuccessfullyintegrateinformationtheo reticprinciplestotheproblem oflearningandobtaindatadrivenrepresentationsthatcon veythenecessaryinformation toperformsubsequenttasks, i.e. furtherstagesofdataprocessing. First,weinvestigateanobjectivefunctioncalledthe principleofrelevantinformation (PRI),whichtradesoffbetweenfaithfulnessofthereprese ntationasinformation preservationandparsimonyoftherepresentation.Thisbri ngstogethertwocore conceptsofinformationtheory,divergenceandentropy.Fr omtheprinciple,ithas beenobserveddifferentlevelsofinteractionofthesetwoq uantitiesrelatetocommon tasksinunsupervisedlearning,rangingfromclusteringto manifoldlearning,andback toquantization.Theestimatorsemployedwithintheprinci plehaveledtoestablish connectionsbetweenkernelmethodsandinformationtheore ticobjectives,andintegrate powerfultoolsfromoptimizationandmachinelearninginto thealgorithmsthateffectively manipulateinformationatitsveryessence. Inthesecondpart,motivatedbytheideaofavoidingdensity estimation,we proposeamatrixbasedformulationofentropy-likequantit iesthatcanbeemployed forinformationtheoreticlearning.Theseideasleadtothe conceptinnitelydivisible matricesandestablishatheoreticalframeworkthatallows theuseofreproducing kernelHilbertspacemethodstoposelearningproblemsasop timizationofinformation theoreticquantities. 18

PAGE 19

1.4ContributionsofthePresentWork Weintroduceanalternativeformulationoftheprincipleof relevantinformation basedonthereproducingkernelHilbertspaceunderstandin gofITLwiththecross informationpotential.Weshowedhowtheinformationtheor eticformulationofthe principlecanbecastasquasi-convexoptimizationproblem andprovideanefcient solutionthatwasmotivatedbythesequentialminimaloptim izationmethoddeveloped forsupportvectormachines.Thisrstpartservesasamotiv ationtoinvestigatefurther ontheRKHSformulationsforITLWedevelopakernelbasedfra meworkforinformation theoreticlearningbystudyingthepropertiesofentropyli kefunctionalsofpositive denitematricesthatavoidstheprocessofestimatingtheu nderlyingprobability distributionasintermediatestep.Wearriveattheconcept ofinnitelydivisible matricesasthebridgebetweentwoHilbertspaces.OneHilbe rtspaceisrelatedto theinnitedivisiblekernelprovidestheGrammatrixthatc anbeemployedtocompute theentropy-likefunctionalsonpositivedenitematrices .ThesecondHilbertspaceis theactualrepresentationofthedata,thatcomesfromthere lationbetweenaninnite divisiblematricesandconditionallynegativedenitemat rices,thatcanbeembedded inHilbertspaces.Westudiedtheintegraloperatorsinvolv edinthelimitcasesofthe proposedmatrixfunctionalusingsomeoftheanalysisthath asbeendoneforpositive denitekernelsinthecontextoflearning.Thisframeworkp rovidesmeansofposing differentlearningproblems(wemainlyfocusonsupervised andunsupervised)based oninformationtheoreticquantities,andbringstothetabl enewapplicationdomainsthat wherenotaddressedbefore,forinstanceapplicationsthat employkernelsforstructured domainssuchasgraphs,timeseries,couldbeincludedinthe proposedframework. 19

PAGE 20

CHAPTER2 MATHEMATICALPRELIMINARIES Inthischapter,weprovideabriefaccountoftheintroducto ryconceptsofthe theoryofreproducingkernelHilbertspaces.Westartwitht heirformaldenitionand introducethenecessaryandsufcientconditionsforafunc tiontobeareproducing kernel.Following,wepresentsomesituationswherethethe oryofreproducingkernel thatarerelevanttolearning.Thecontentsofthischaptera rebasedonthepapersby Aronszajn[ 3 ],Parzen[ 58 ],thebookbySch ¨ olkpfandSmola[ 76 ],andthebook[ 17 ]. 2.1ReproducingKernelHilbertSpaces Let X beasetand F beavectorspaceoffunctionsfrom X totheaeld F ;in particular,let F = R .Wesaythat H isareproducingkernelHilbertspace(RKHS)on X over R ,if: (i) H isavectorsubspaceof F ; (ii) H isendowedwithaninnerproductproduct, h i ,andiscompleteinthemetric inducedbyit; (iii)forevery x 2X and f 2H ,thelinearevaluationfunctional F x : H7! R ,denedas F x ( f )= f ( x ) ,isbounded. FromRieztheorem[ 41 ],weknowthatforanyboundedfunctional H onaHilbertspace H ,thereexistsauniquevector h 2H suchthat: H ( f )= h h f i forall f 2H .Inparticular, foreachevaluationfunctionals F x thereexistacorrespondingvector x 2H .The bivariatefunctiondenedby ( x y )= x ( y ) (2–1) iscalleda reproducingkernel for H ;itiseasytoverifythat, ( x y )= h x y i (2–2) and k F x k 2 = k x k 2 = h x x i = ( x x ) .Let H beaRKHSontheset X withkernel .Thelinearspanof f ( x ): x 2Xg isdensein H .Thisresultsfromthefactthat 20

PAGE 21

anyfunction f orthogonaltothethespanof f ( x ): x 2Xg mustsatisfy h f x i forall x 2X ,andthus f ( x )=0 forevery x 2X Lemma2.1.1. Let f f n gH .If lim n k f n f k =0 ,then f ( x )=lim n f n ( x ) forevery x 2X Proof2.1.1. Thisisasimpleconsequenceofthereproducingpropertyand CauchySchwarzinequality. j f n ( x ) f ( x ) j = jh f n f x ijk f n f kk x k! 0 Aconsequenceoftheabovelemmais: Proposition2.1.1. Let H 1 and H 2 beRKHSon X withkernels 1 and 2 ,respectively.If 1 ( x y )= 2 ( x y ) forall x y 2X ,then H 1 = H 2 and k f k 1 = k f k 2 forevery f Proof2.1.2. wecantake ( x y )= 1 ( x y )= 2 ( x y ) andthusthe M ` =span f x 2 M ` : x 2Xg isdensein H ` ,andforany f ( x )= P i i x i ( x ) thereisnoregardaboutwhether f belongstoeither M 1 or M 2 .Notethat k f k 21 = P i j i alpha j ( x i x j )= k f k 22 ,andthus k f k 1 = k f k 2 forevery f 2 M 1 = M 2 .If f 2H 1 ,thenthereisasequenceoffunctions f f n g M 1 thatconvergeto f innorm.Since f f n g isCauchyin M 1 isalsoCauchyin M 2 sobycompletenessof H 2 thereexist g 2H 2 suchthat f n g .Then,byLemma 2.1.1 wehavethat f ( x )=lim n f n ( x )= g ( x ) forevery x 2X ,thusevery f 2H 1 isalsoin H 2 andviceversa,and H 1 = H 2 .Finally,wecanextend k f k 1 = k f k 2 toall H 1 and H 2 Inotherwords,twodifferentreproducingkernelHilbertsp acesdonothavethe samereproducingkernel.Thefollowingtheoremshowsanalt ernativewaytoexpress thereproducingkernelofaRKHS H Theorem2.1.1. Let H havereproducingkernel .If f e : 2 g isanorthonormalbasis for H ;then ( x y )= X 2 e ( x ) e ( y ). (2–3) wheretheseriesconvergespoint-wise. 21

PAGE 22

Proof2.1.3. Since f e : 2 g isanorthonormalbasis x = X 2 h e x i e = X 2 e ( x ) e wheretheseriesconvergesinnorm(Parsevalidentity).The n.byLemma 2.1.1 x ( y )= X 2 e ( x ) e ( y )= ( x y ). Now,weturnthefocuson ( x y ) andexplorenecessaryandsufcientconditions forthisfunctiontobeareproducingkernel.Amatrix A 2 R n n iscalled positivedenite ifitissymmetricand n X i j =1 i j A ij 0 (2–4) forany i 2 R .Arealfunctionoftwovariables ( x y ) iscalleda positivedenite function ifforanynitesubset f x i gX thematrix K ij = ( x i x j ) ispositivedenite. Proposition2.1.2. Let H beaRKHSon X withreproducingkernel .Then, isa positivedenitefunctionProof2.1.4. Foraxed f x i gX ,wehave n X i j =1 i j ( x i x j )= n X i =1 i x i n X i =1 i x i + = rrrrr n X i =1 i x i rrrrr 2 0 Toconcludethissection,weshowMoore'sTheorem,whichist heconverseto theaboveresultandgivesusacharacterizationofapositiv edenitefunctionstobea sufcientconditionforthefunctiontobethereproducingk ernelofsomeRKHS H Theorem2.1.2. Let X beasetand : XX7! R beapositivedenitefunction.Then, thereexistareproducingkernelHilbertspace H offunctionson X ,suchthat, isthe reproducingkernelof H 22

PAGE 23

Proof2.1.5. Considerthefunctions x ( y )= ( x y ) andthespace W spannedbythe set f x : x 2Xg .Thefollowingbilinearmap B : W W 7! R B X i i x i X j j x j = X i j i j ( x i x j ), where i j 2 R ,iswelldenedon W .Tosupporttheaboveclaim,noticethatif f ( x )= P i i x i ( x ) iszeroforall x 2X ,thenbydenition B ( f x )=0 forall x Conversely,if B ( f w )=0 forall w 2 W ,thenbytaking w = x weseethat f ( x )=0 Thus B iswelldened. Since ispositivedenite B ( f f ) 0 andweseethat B ( f f )=0 ifandonly if B ( w f )=0 forall w 2 W ,therefore f ( x )=0 forall X .Nowwehaveshownthat W isapre-Hilbertspacewithinnerproduct B .Let H denotethecompletionof W ,we needtoshowthateveryelementof H isfunctionon X .Let h 2H bethelimitpoint ofaCauchysequence f f n g W .ByCauchy-Schwarzinequality j f n ( x ) f m ( x ) j = j B ( f n f m x ) jk f n f m k ( x x ) .Hence,thepoint-wiselimit h ( x )=lim n f n ( x ) iswelldened.TOconclude,let h cdot i betheinnerproducton H .Then,wehave h h x i =lim n h f n x i =lim n B ( f n x )= h ( x ) .Thus H isareproducingkernelHilbert spacewithreproducingkernel Finally,combiningProposition 2.1.1 withMoore'sTheoremshowsthereisa correspondencebetweenRKHS'sontheset X andpositivedenitefunctionsonthis set. 2.2TheCovarianceFunction Considerastochasticprocess f X ( t ): t 2Tg ,where X ( t ) arerealrandom variablesdenedonaprobabilityspace (n, B P ) withboundedsecondordermoments, thatis, E j X ( t ) j 2 = Z n j X ( t ) j 2 d P < 1 (2–5) 23

PAGE 24

Withoutlossofgenerality,wecanconsiderrandomvariable swithzeromean, E[ X ( t )]= 0 forall t 2T ;thecovariancefunctionisdenedas, R ( s t )=E[ X ( s ) X ( t )]= Z n X ( s ) X ( t )d P (2–6) Itiseasytoverifythat R isapositivedenitefunctionandthereforedenesareprod ucing kernelHilbertspaceoffunctionson T .AresultoriginallyduetoLo eve,andpresented byParzenin[ 58 ]showedacongruencemapbetweentheRKHSinducedbythe function R andthe L 2 spacethatcorrespondtothecompletionofthespanoftheset f X ( t ): t 2Tg denotedby L 2 ( X ( t ): t 2T ) Theorem2.2.1. Let f X ( t ): t 2Tg bearandomprocesswithcovariancekernel R Then L 2 ( X ( t ): t 2T ) iscongruentwiththereproducingkernelHilbertspace H with reproducingkernel R .Furthermore,anylinearmap : H7! L 2 ( X ( t ): t 2T ) whichhas thepropertythatforany f 2H andany t 2T E[ ( f ) X ( t )]= f ( t ) (2–7) isthecongruencefrom H onto L 2 ( X ( t ): t 2T ) ,whichmaps R ( t ) into X ( t ) 2.3RKHSsinMachineLearning Thestudypositivedenitekernelsinmachinelearningwasi nitiallymotivatedasa generalizationofawellbodyoftheorythathasbeendevelop edforlinearmodelsand algorithms.Inthiscontext,apositivedenitekernel isanimplicitwaytorepresent theset X objectsofinterest.Wehavealreadyseenthatthereisacorr espondence betweenapositivedenitekernel andareproducingkernelHilbertspaceoffunctions H withreproducingkernel .Therefore,thekernelcanbeunderstoodasanindirectway tocomputeinnerproductsbetweenelementsofaHilbertspac ethataretheresultof mappingtheelementsof X to H .Inotherwords,thereexistamapping : X7!H ,such that ( x y )= h ( x ), ( y ) i (2–8) 24

PAGE 25

Thespace H isknowasthefeaturespaceand iscalledthefeaturemap.Oneofthe bigappealsofthisideaisthatbyperforminglinearoperati onsin H ,wemaybeable toperformnonlinearmanipulationsintheinputspace X ,andmostimportantlythat thereisnoneedtoperformanyexplicitcomputationsinthef eaturespace.Notice,this ideaiscompletelydifferenttothecongruencemapintroduc edinTheorem 2.2.1 .An importantresultassociatedwiththeuseofpositivedenit ekernelinmachinelearningis therepresentertheorem[ 40 ]whichispresentedbelow. Theorem2.3.1.(Representertheorem)[ 76 ]. Let n:[0, 1 ) 7! R beastrictlymonotonic increasingfunction, X beaset,and c :( X R 2 ) n 7! R [1 beanarbitrarylossfunction. Then,eachminimizer f 2H oftheregularizedriskfunctional c (( x 1 y 1 f ( x 1 )),...,( x n y n f ( x n )))+n k f k 2H (2–9) admitsarepresentationoftheform f ( x )= n X i =1 i ( x i x ). (2–10) Proof2.3.1. Let S =span f ( x i ): x i 2X i =1,..., n g denotethesubspaceof H spannedbythe n trainingsamples.Considerasolution f 2H ,thissolutioncanbe expressedas f = f S + f S ? where f S 2 S and f S ? 2 S ? .Therefore f ( x i )= f S ( x )+ f S ? ( x i )= f S ( x )+0 .Nowforthesecondtermof ( 2–9 ) ,wehavethat n k f k 2H =n k f S k 2H + k f S ? k 2H since n isstrictlymonotonicincreasingwecanseethattheminimum willbeachieved for k f S ? k 2H =0 whichimpliesthat f S ? =0 Therepresentertheorembasicallystatesthatthesolution totheminimizationof theregularizedriskfunctionalcanbeexpressedintermoft hesocalledtrainingsample f ( x i y i ) g ni =1 .Thisisimportantbecauseitallowsustodealwithproblems thatatrst 25

PAGE 26

glanceappeartobeinnitedimensional.Noticethatthereg ularizationdoesnotprevent ( 2–9 )ofhavinglocalmultipleminima,thispropertywouldrequi resomeextraconditions suchasconvexity. 26

PAGE 27

CHAPTER3 THEPRINCIPLEOFRELEVANTINFORMATION 3.1InformationTheoreticLearninginaNutshell Analternativeparadigmforlearningbasedontheideasunde rlyinginformation theorywasproposedin[ 63 ] 1 .Informationtheoreticlearning(ITL)isablendofideas frominformationtheoryandthetheoryofadaptivesystemst oaccomplish information ltering inanonparametricfashion.ITLobjectivefunctionsarebui ldupontheconcepts ofentropyanddivergence(extendstomutualinformation). Learningfromdataismade possiblebyusingParzenwindowstocomputepluginestimato rsofRenyi'sentropy.For arandomvariable X 2X R d andprobabilitydensityfunction f ( x ) ,Renyideneda familyofentropyfunctionals H ( X )= 1 1 log Z X f 1 ( x ) f ( x )d x (3–1) indexedbytheparameter .Thecaseof =2 isofparticularinterestbecauseit providesanestimatorofarathersimpleform.Let f x i g ni =1 X bean i.i.d. sample fromtherandomvariable X withdensity f ( x ) .ForaParzenwindowestimate ^f ( x )= 1 n P ni =1 ( x x i ) ,thepluginestimatorofRenyi'sentropyoforder 2 isgivenby ^ H 2 ( X )= log 1 n 2 n X i j =1 p 2 ( x i x j ) (3–2) where ( x y )= C exp k x y k 2 2 2 istheGaussiankernelwithparameter .Followinga similarlineofthinking,ITLusestwodenitionofdivergen cethatarecompatiblewith 1 Amorecompleteaccountofideasdevelopmentsandalgorithm scanbefoundin [ 62 ]. 27

PAGE 28

Renyi'sentropyoforder 2 :adivergencemeasurebasedontheEuclideannorm, D ED ( X ; Y )= Z X ( f ( x ) g ( x ) ) 2 d x = Z X f 2 ( x )d x + Z X g 2 ( x )d x 2 Z X f ( x ) g ( x )d x ; (3–3) andameasurebasedontheCauchy-Schwarzsinequality, D CS ( X ; Y )= log R X f ( x ) g ( x )d x 2 R X f 2 ( x )d x R X f 2 ( x )d x =log Z X f 2 ( x )d x +log Z X g 2 ( x )d x 2log Z X f ( x ) g ( x )d x (3–4) Notethatbothdivergencessharesimilarterms,butmostint erestinglyisthattheseterms correspondtovariationsofthesamebuildingblock.Thecro ss-informationpotential V ( X Y )= Z X f ( x ) g ( x )d x (3–5) whichcorrespondtoaninnerproductbetweenprobabilityde nsityfunctionsandthus actsasameasureofsimilarity.Itturnsoutthattheestimat orof( 3–5 )alsotakesa simpleformwhenestimatedfromdata.Let f x i g ni =1 and f y i g mi =1 be i.i.d. samplesfrom f ( x ) and g ( x ) ,respectively;Thepluginestimatorofthecross-informat ionpotentialis givenby: ^ V ( X Y )= 1 nm n X i =1 m X j =1 p 2 ( x i y j ) (3–6) 3.2ThePrincipleofRelevantInformation Allthemethodswereviewedforunsupervisedlearning C refertoinformation preservation,buttheirapproachatquantifyinginformati onissomehowloosegiven thatinformationtheoryprovidesoperationalquantitiest hatappropriatelyaccomplish suchobjective.Nevertheless,estimatinginformationthe oreticquantitiesfromdatais adifcultproblemleadingtosimplicationsandassumptio nsthatmayleadtopoor 28

PAGE 29

resultsornegligibleimprovementifdataisfarfrombeingi nagreementwiththeimposed restrictions.Inthisrespect, informationtheoreticlearning (ITL)[ 62 ]developsa frameworkthatdealswithestimatorsthathavetheinformat iontheoreticavoryetare simpletocomputefromdata.Here,wepresentaprinciplemot ivatedbythegoalsin unsupervisedlearning,whicheffectivelymanipulatesinf ormationbysettingtheproblem withinthITLframework. Structurecanbeassociatedtothestatisticalregularitie spresentontheoutcomes ofaprocess.Therefore,theentropyrelatedtotheseoutcom escanbeattributedinpart totheunderlyingstructure,andtheresttoparticularitie sofeachoutcome, i.e. details orevenrandomperturbations.Hence,wecanthinkofthemini mizationofentropyasa meansofndingsuchregularities.Supposewearegivenaran domvariable S withPDF g ,forwhichwewanttondadescriptionintermsofaPDF f withreducedentropy,that is,avariable X thatcapturestheunderlyingstructureof S .The principleofrelevant information (PRI)formulatestheaboveproblemasatrade-offbetweenth eentropy H 2 ( f ) of X anditsdescriptivepowerabouttheobservedrandomvariabl e S intermsof theirrelativeentropy D CS ( f k g ) .ForaxedPDF g 2F theobjectiveisgivenby: J ( f )= H 2 ( f )+2 D CS ( f k g ), (3–7) where isthetrade-offparameter.Theminimizationof J withinasetofadmissible PDFs F shouldleadtoafunction f 2F thathasminimumentropy,butatthesame time,theinformationgainfromobservingthesamplewhichi srepresentedby g isalso maximized.Nevertheless,asitisoftenthecase,itisneces sarytochooseawayto compute g fromthesampleandsuitablespace F tosearchfor f .Theonlyavailable informationabout g isencodedinasample S = f x i g Ni =1 ,andsomeassumptionsabout thefunctionclass F mustbemadeinordertoobtainatractablesolution. 29

PAGE 30

3.2.1PRIasaSelf-organizationMechanism Asuitablewaytoframetheabovesearchasanoptimizationpr oblemwasproposed in[ 68 ].ThissolutioncombinesParzendensityestimationwithas elforganizationofa sampletomatchthedesireddensity f thatminimizes( D–8 ).Theoptimizationproblem becomes: argmin X 2 ( R d ) M h ^ H 2 ( X )+ ^ D CS ( X k S ) i (3–8) where S 2 ( R d ) N isasetof d -dimensionalpointswithcardinality N ,and X asetof d -dimensionalpointswithcardinality M .FortheParzenwindowwithGaussiankernel G wecanevaluatethecostas: (1 )log 1 M 2 M X i j =1 G p 2 ( x i x j ) log 1 MN M P i =1 N P j =1 G p 2 ( x i s j ) 2 1 N 2 N P i j =1 G p 2 ( s i s j ) (3–9) where x i 2 X and s i 2 S .Wecanfurthersimplifythecostbydroppingthedenominato r ontherighttermof( D–11 )sinceitisconstant.Theselforganizationprinciplemove s eachparticle x i accordingtotheforcesexertedbythesamples S and X .Theentropy minimizationcreatesattractiveforcesamongtheinformat ionparticles x i 'sinteracting withtheforcesinducedbytheeldcreatedbythesample S ,whichrestrictsthemotion ofeach x i .Theparticleswillmovearoundthespaceuntilthewholesys temreaches apointofequilibrium(Localminimumof J ( X ) ).Computingthepartialderivativesof ( D–11 )withrespecttoeachpointin X yieldsaxedpointupdatedescribedinAlgorithm 1 3.2.2OntheInuenceof Noticethatthetrade-offparameter canbevariedbetween 0 and 1 determining thestrengthsofeachtermin J andthustheequilibriumpoints.Consequently,we shouldexpectdifferentbehaviorsofthesystemaccordingt o .First,wecanlookatthe extremepointsintherange [0, 1 ) .For 0 allthepointsin X willcollapsetoone 30

PAGE 31

Algorithm1 PRIalgorithmwithselforganization 1: Initializationelements: S setofpointsfortheoriginalpdf ^ g X (0) = S initialguessfor ^ f 2: Computetheinformationpotentialof X ( t ) ^ V ( X ( t ) )= 1 M 2 M X i j =1 G p 2 ( x ( t ) i x ( t ) j ) 3: Computethecross-informationpotentialbetween X ( t ) and S ^ V ( X ( t ) S )= 1 MN M X i =1 N X j =1 G p 2 ( x ( t ) i s j ) 4: Updatetheelementsof X withthefollowingxedpointrule: x ( t +1) i = c 1 P Nj =1 G p 2 ( x ( t ) i x ( t ) j ) x ( t ) j P Mj =1 G p 2 ( x ( t ) i s j ) + P Mj =1 G p 2 ( x ( t ) i s j ) s j P Mj =1 G p 2 ( x ( t ) i s j ) c 1 P Nj =1 G p 2 ( x ( t ) i x ( t ) j ) P Mj =1 G p 2 ( x ( t ) i s j ) x ( t ) i where c = M ^ V ( X ( t ) S ) N ^ V ( X ( t ) ) 5: Iteratethesteps 2 through 4 untilconvergence. singlepoint,whichinthelimitcasebecomesindependentof thetargetsample S .This doesnotappeartobeaveryinterestingcase.Theotherextre mecaseiswhen X is initializedbythelocationsprovidedbythesample S and !1 .Herethelocations of X willnotmoveawayfromthelocationsof S ,thesystemisalreadyinequilibrium. Interestingcasesarisewhen 1 .Forinstance,ithasbeenshownthatthecase =1 correspondtotheGaussianmeanshiftalgorithm[ 67 ].Othervaluesoflambda canprovideotherdescriptionsofthestatisticalstructur eofthedata.Wewillillustrate thisphenomenonwithasimpleexample,gure 3-1 showsa 2 -dimensionalsample S of pointsdistributedinanon-lineararrangement.Algorithm 1 isrunforkernelsize =0.5 anddifferentvaluesof .Figure 3-2 showsthebehavioroftheentropyterm ^ H 2 ( X ) and 31

PAGE 32

-4 -2 0 2 4 -4 -3 -2 -1 0 1 2 3 4 Target sample set S Figure3-1.Datasetutilizedtoillustratethebehavioroft heselforganizationobtainedby theprincipleofrelevantinformation 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 1 H(X) H(X;S) Figure3-2.Entropy X andcross-entropywithrespectto S afterequilibriumfordifferent valuesof .Noticethatthevaluesareplottedagainst 1 forbetter visualization. thecross-entropyterm ^ H 2 ( X ; S ) obtainedfromthenumeratoroftherightmosttermof ( D–11 )fordifferentvaluesofthetrade-offparameteronthedata set.Wecanobserve howthistermsgothroughaseriesoftransitionsas increases.Figure 3-3 shows thedifferentarrangementsofthepointsgivenbythediffer entvaluesofthetradeoff parameter.Wecanseethatthersttransitionoccurswhen startsmovingawayfrom 1 .Inthecaseof =1 thepointslocateonthemodesoftheParzendensityestimato r 32

PAGE 33

-4 -2 0 2 4 -4 -3 -2 -1 0 1 2 3 4 PRI l = 1 S X -4 -2 0 2 4 -4 -3 -2 -1 0 1 2 3 4 PRI l = 1.05 S X -4 -2 0 2 4 -4 -3 -2 -1 0 1 2 3 4 PRI l = 1.25 S X -4 -2 0 2 4 -4 -3 -2 -1 0 1 2 3 4 PRI l = 1.5 S X -4 -2 0 2 4 -4 -3 -2 -1 0 1 2 3 4 PRI l = 2 S X -4 -2 0 2 4 -4 -3 -2 -1 0 1 2 3 4 PRI l = 3.5 S X -4 -2 0 2 4 -4 -3 -2 -1 0 1 2 3 4 PRI l = 10 S X -4 -2 0 2 4 -4 -3 -2 -1 0 1 2 3 4 PRI l = 20 S X Figure3-3.Resultingsample X afterself-organizationfordifferentvaluesof 33

PAGE 34

ofthePDFaccordingto S .As increasespointsin X starttospreadout,howeverthey donotmovefreelyinalldirections,andaccommodateonrath erregularpatterns.For example,for =2 thepointsappeartoformregularpatternsreminiscentofpr incipal curves.Aswegoforlargervaluesof ,weseethatregularitybecomeslessandless evident,andpointsin X willspreadsimilarlytothepointsinthesample S 3.2.3ANoteonInformationTheoreticVectorQuantization Above,wearguedthatifweusethesample S asinitialguessfor X andwelet !1 ,weendupwith X = S asthenearestequilibriumforthesystemrelatedtothe PRIcost.Basically,thisoccursbecauseweonlyputemphasi sontheC-Sdivergence termin( D–8 ),andthus,aminimumof J ( X ) isachievedifwehaveasetofpoints thatgeneratethesameParzendensityestimatewiththeoneb asedon S .Indeed, thetermcontrollingtheuncertaintyof X isneglected.Thisapparentlynon-interesting casecanbegivenapositivetwistifweimposeanentropycons traintinanindirect fashion.Theentropyminimizationcreatesabottleneckont heamountofinformation theoutputcontains.Quantizationisanimplicitwayoflimi tingtheamountofinformation thatcanbepossiblyconveyedabouttheinputvariablebylim itingtherepresentation alphabet.Ifweletthenumberofpoints X belessthanthenumberofpointsin S ,with properinitialization,wecanobtainavectorquantization algorithmsincepointsin X will distributeevenlyonaccordingtotheCauchy-Schwarzdiver gencewithrespectto S to faithfullyrepresentthesetofpoints S .Animportantpointhereisthatthecapacityofthe representationisnotlimitedbyanentropicterm,butbythe numberofpointsemployed in X .Thequantizationoptimizationproblemisgivenby: argmin X 2 ( R d ) M h ^ H 2 ( X )+ ^ D CS ( X k S ) i (3–10) where M isthenumberofpossiblestatesafterquantization.Theref ore,theself-organization mechanismprovidedbythePRIcanbeappliedinthetrainingo faquantizer.Figure 3-4 displaysthevectorquantizationsolutionobtainedbytheP RIxedpointalgorithmwith 34

PAGE 35

-4 -2 0 2 4 -4 -3 -2 -1 0 1 2 3 4 PRI l = 20 S X Figure3-4.ResultingvectorquantizationusingthePRI.linearkernelannealinganduniformgridinitializationfo r X .Noticethatinthiscase thedelitymeasureofthevectorquantizationisbasedonth edistribution,whichhas acleareffectinhowthevectorsin X willdistributeaccordingto S .Sinceallpoints in X areequallyweighted,thePRIwillplacemorepointsonregio nswherethePDF estimatedfrom S haslargervalues.Thismayormaynotbeanappealingpropert y,but itishighlyproblemdependent.Forexample,theapplicatio nshownin[ 68 ]fortheface contoursbenetsfromtheabovepropertysincemorevectors arelocatedinregionsthat containmoredetailedfeaturessincetheyaremoredenselys ampled.However,forother compressionscenariostheactualdetailsareveryrareeven tsinthedistributionandthus overlookedbythePRI.3.2.4PracticalIssuesandOpenQuestions Theuseofmarginaldistributionstooptimizetheobjective functionthroughself organizationarisespracticalissuesandtheoreticalques tionsthateventuallywilllead toabetterunderstandingoftheprinciple.Amongthepracti calissuesthatwerenot previouslyaddressedwecanlist: Howcanwereducethecomputationalcomplexityoftheoptimi zation? 35

PAGE 36

Howwelldoestheselforganizationprincipleoperateinhig hdimensionalspaces? WhatistheinuenceoftheParzendensityestimationunderl yingthesolution? Howcanweapplytheprincipletootherdomainsbesides R d ? Isthereawaytomaketheencodingexplicit? Intheattempttosolvetheseissues,wealsoencounteranumb erofquestionsfrom thetheoreticalpointofview: HowdoesthePRIavoidthecomputationofmutualinformation ,whichseemstobe abetter-suitedgoal? IsthereamoregeneralformulationofthePRIthantheonebas edofParzen? IsthereageometricunderstandingofthePRI? Whataretheapproximationguaranteesforthenitesampler egime? Inthefollowingsections,westarttoaddresssomeoftheabo veissuestoshapethe developmentspresentedthoughtherestofthisthesis. 3.3AlternativeSolutionstothePrincipleofRelevantInfo rmation 3.3.1ThePRIasaWeightingProblem Fortheset F ofprobabilitydensityfunctionsthataresquareintegrabl ein X R n wecandenethecross-informationpotential V (CIP)asabilinearformthatmaps densities f i f j 2F totherealnumberstroughtheintegral, V ( f i f j )= Z X f i ( x ) f j ( x )d x (3–11) Itiseasytoseethatforabasisofuniformlybounded,square integrableprobability densityfunctions, V isapositivesemidenitefunctiononthe span fFg .Now,consider theset G = f g = P mi =1 i ( x i ) j x i 2 R n P mi =1 i =1 ,and i 0 g ,where isa“Parzen”typeofkernelwhichisalsosquareintegrable, thatis, issymmetric, nonnegative,hasboundedintegral(canbenormalized),bel ongsto L 2 ,andshift invariantwith asthescaleparameter.Clearly,forany g 2G wehave k g k 2 36

PAGE 37

k ( x ) k 2 .Thence, G isbounded.However,ifthe X isnon-compactoursearchspace isalsonon-compact. Theobjectivefunctionfortheprincipleofrelevantinform ation( D–8 )canbewritten intermsofIPfunction.UsingtheParzenbasedestimation,w erestrictthesearch problemto GF .Inthiscase,wehavethatequation( D–8 )canberewrittenas: J ( f )= log V ( f f ) log [ V ( f g )] 2 V ( f f ) V ( g g ) (3–12) straightforwardmanipulationofthetermsyieldsanequiva lentproblem: argmin f 2G [ (1 )log V ( f f ) 2 log V ( f g ) ] (3–13) Twoimportantaspectsoftheaboveobjectiveare:thechoice ofthekernel,shape andsize ,determinesdifferentscalesfortheanalysis;thetrade-o ffparameter denesasetofregimesforthepossiblesolutionstotheprob lem.Aswepreviously mentioned,theonlyavailableinformationiscontainedint hesample S = f x i g Ni =1 Anapproximationofthetargetdensity g isthengivenbyitsweightedParzenwindow estimator ^ g ( x )= P Ni =1 i ( x i x ) ,where i 0 and P Ni =1 i =1 .Inourexperiments,we limitto i =1 = N .Toenforcecompactnessinoursearchspace,welookforasol ution f thathasthesameformof ^ g ,thatis f ( x )= N X i =1 i f i ( x )= N X i =1 i ( x i x ). (3–14) where i 0 and P Ni =1 i =1 .Byxing andevaluatingtheinformationpotential betweeneachpair ( x i x j ) 2 S S ,wecanrewrite( 3–13 )inmatrixnotationas: min ( 1)log T V 2 log T V subjectto i 0 and N X i =1 i =1, (3–15) 37

PAGE 38

where V ij = R X ( x i x ) ( x j x )d x .Ititalsoimportanttohighlightthatforvaluesof 1 theregularizationtermalmostvanishes,andthemaximumin formationgain (innerproductterm)canberelatedtothemodesof ^ g whichareapproximatedby thesharpestfunctionsavailable,whichinoursetting,are individualwindowsthat satisfyconvexcombinationconstraint.Therefore,wecane xpectthemodesof ^ g tobe almostorthogonalcomponents.When isveryclosetoone,thecross-entropytermis dominantintheobjectiveandthesolutionwilllieonthecor nersinoneofthecornersof thesimplex.Figure 3-5 depictsthegeometricalelementsthatexplainthisphenome non. Thesemicirclesarelociofminimumandmaximum L 2 normthatintersectwiththe simplexbeingthesolution f withthelargestnormtheonemaximizingthedotproduct withthetargetdistribution g Figure3-5.CornereffectofthePRIobjectivebasedonweigh ts Thisinsighthighlightsoneofthemajordifferencesfromth epredecessorofthethis alternativeformulation,andidentiestheclusteringbeh avioroftheself-organization principleasalocalminimumoftheobjectivefunction.Neve rtheless,thisisnottobe betakenasadisadvantage,itactuallysuggestaconnection withtheenergy-models forlearningthatwereviewedinSection C .Thiscornerphenomenaalsomotivatesthe extractionofmorethanonesolution.Areasonableapproach istodeatethetargetPDF 38

PAGE 39

^ g andrunthealgorithmonthenewfunction.Thedeationisgiv enby: new = crt T V crt ( T V ) (3–16) thedeationjustseemstherightthingtodo,butwedon'thav eanytheoretical justicationfordoingthis.Theoptimizationproblemin( 3–15 )canbesolvedusing methodssuchasprojectedgradientalongwithapenaltyfunc tion,butweruninto difcultiestosetstepsize,schedulingthepenaltyfuncti on[ 14 ],andamatrixinversion isrequiredintheprojectionstep.Applicationofsecondor dermethodssuchasNewton giverisetoatleast O ( N 2 ) complexitiesinmemoryandcalculations.Inthenext section,weinvestigateasequentialoptimizationschemet hatalleviatesthememory requirement,makingthealgorithmscalabletomorerealist icscenarioswheresample sizesareintheordertenthsofthousands.3.3.2SequentialMinimalOptimizationforthePRI Noticethattheformoftheproblemadoptedin( 3–15 )isnotaconvexprogram. Nevertheless,itcanbeturnedintoanequivalentformthatc anberecognizedasa convexprogram.Proposition3.3.1. Theconvexprogram, min T V subjectto 0 q T =0 1 T 1=0, (3–17) isequivalentto ( 3–15 ) ,where q = V andsome > 0 Proof3.3.1. Bydenition > 0 ,thustheconstraint log q T =log isequivalentto q T =0 .Thepositivesemidenitenessoftheinformationpotentia ltellusthat T V 0 .However,takingintoaccount q T =0 guaranteesstrictinequality; therefore,theminimizersof log T V and T V ontheconstraintsetdenedin ( 3–17 ) 39

PAGE 40

arethesame.Hence,solvingthefollowingpseudo-convexpr ogram min log T V subjectto 0 log q T =log 1 T 1=0, (3–18) shouldyieldthesamesolution.Now,thegradientoftheobje ctivein ( 3–15 ) withrespect totheweightvector is, r J ( )=2 1 T V V 2 T V V (3–19) Byincludingtheconstraints 1 T =1 and 0 ,for > 1 ,thesetofKKTnecessary conditionsforlocaloptimalityintheLagrangian L ( r )= J ( )+ P Ni =1 i c i ( )+ r e ( ) is 8>>>>>>>>>><>>>>>>>>>>: @ @ L ( r )= r J ( )+ P Ni =1 i @ @ c i ( )+ r @ @ e ( )=0, @ @ L ( r )= c ( ) 0, T c ( )=0= T 0, @ @r L ( r )= e ( )=0= 1 T 1. (3–20) Therearetwopossiblecasesforeach i i > 0 Forwhich i =0 and 2 t i T V 2 1 q i T q + r =0, (3–21) where t = V i =0 Yields 2 t i T V 2 1 q i T q i + r =0. (3–22) 40

PAGE 41

Noticethat r =2 1 1 ,therefore, 2 V T V 2 1 q T q +2 1 1 =0. (3–23) Pre-multiplying ( 3–23 ) by ( ) intheconstraintset,yieldsthefollowingsetof conditions ( ) T 2 V T V 0, 8 0: q T = 1 T =1 T =0 0 0 log q T =log 1 T =1, (3–24) whichbyTheorem B.1.1 aresufcientconditionsforthesolutionofapseudo-conve x functiondenedonanopensetwithconvexinequalityconstr aints,thatinourcase correspondsto ( 3–18 ) Twoimportantresultscomefromtheaboveproposition.Onei sobviousfromthe statementinthepropositionthattellsusthereexistanequ ivalentconvexprogramthat solves( 3–15 ).Butevenbetteristheonethatcomesasabyproductofthepr oof.The KKTrstorderconditionsin( 3–20 )arenecessaryandsufcienttosolve( 3–15 ). 3.3.2.1Decompositionintosmallersubproblems IntheproofofProposition 3.3.1 ,wesolveamoreconvenientformof( 3–15 ),for whichwefactorize ( 1) fromtheobjective.Ifwederivethethesolutionfortheorig inal problem,thetwocases( 3–21 )and( 3–22 )arereplacedby: i > 0 With i =0 and 2 1 T V t i 2 T q q i + r =0, (3–25) 41

PAGE 42

where t = V i =0 2 1 T V t i 2 T q q i i + r =0. (3–26) Notethatcombining( 3–25 )and( 3–26 )withtheoptimal ,wehavethat r =2 Usingthisfactalongwiththenon-negativityof ,yieldthefollowingcondition, At i Bq i > 1, (3–27) where A = 1 T V and B = T q Let'spartitionthesetofindexesoftheentriesof into W ,theworkingset,and P the complementarysetofinactiveelements.Then, = T W T P T ,forwhichwedenethe followingsubproblem: min W ( 1)log T W V WW W +2 T P V PW W + T P V PP P + 2 log T W q W + T P q P subjectto W 0, and T W 1 + T P 1 =1 (3–28) Similarremarkstotheonesmadein[ 55 ]canbeobtainedfor( 3–28 ): Theterms A = T P V PP P and B = T P q P areconstantinthesubproblem Thecomputationof 2 T P V PW W isindependentofthesizeof P andalsoofthe numberofnonzero i 's Replacing i ,with i 2 W with j with j 2 P leavesthecostunchangedandthe feasibilityremainsintact. Ifthesubproblemisoptimalbeforetheabovereplacement,t henewsubproblemis optimalifandonlyif j satisestheoptimalityconditions. Thesocalled“ Builddown ”stepisratherobvious.Nowthe“ Buildup ”stepthatstates thatmovingavariablefrom P to W givesastrictimprovementinthecostwhenthe subproblemisre-optimized.Inourcase,wecanjustifytheb uildupsinceweprove thattheKKTrstorderconditionsarenecessaryandsufcie ntforasolutionto( 3–15 ). Thesebuilddownandbuildupstepssuggestastrategyforopt imizing( 3–15 )bysolving 42

PAGE 43

smallersubproblems.Ateachiteration,solveasubproblem thatincludesaconstraint violatorpickedfromthecomplementaryset P .Iterateuntiloptimalityconditionsare satiseduptosomedesiredlevelofaccuracy.3.3.2.2Sequentialminimaloptimizationalgorithm Intheprevioussection( 3.3.2.1 ),theoptimizationproblemrelatedtotheprincipleof informationwasdecomposedintosmallersubproblemsthatc anbesolvediterativelyto achievethesolutionofthefullproblem.Animportantchara cteristicofsuchdecomposition isthatthesizeoftheworkingset W andthecomplementaryset P areindependentof thenumberofsupportvectorsinthesolution,thatis,the i 'sgreaterthanzero.The sequentialminimaloptimizationproposedin[ 60 ]choosesthesmallestsubproblemthat canbesolvedateachiteration.Thiscorrespondstosolving fortwovariablesatthetime, whichcanbefoundanalytically.Thelatterisofparticular appealtosolvePRIsinceour costdoesnothaveanstandardformasitisthecaseforSVMs(q uadraticprogram), therefore,wecannotresorttoofftheshelfsolversforourp roblem. Withoutlossofgenerality,wewillrefertoourvariablesin theworkingsetas 1 and 2 andthecomplementarysetas P .Bytheequalityconstraintinthesubproblem( 3–28 ) wehavethat 1 + 2 =1 T P 1 = w andthence 1 = w 2 .Letusdenoteby i the valueof i fromthepreviousiteration.Wecanformulatethesubproble mintermsof 2 as: min 2 [ ( 1)log A ( 2 ) 2 log B ( 2 ) ] subjectto 0 2 w and w = 1 + 2 (3–29) with A ( 2 )= 2 2 ( V 11 + V 22 2 V 12 )+2 2 ( w ( V 12 V 11 )+( 2 1 ))+ w 2 V 11 +2 w 1 + A where i = V i V 1 i 1 V 1 i 2 ,and A = P T V PP P ;and B ( 2 )= 2 ( q 2 q 1 )+ wq 1 + B 43

PAGE 44

where B = P T q P .Thesolutiontoproblem( 3–29 )liesonthelinesegment 1 = w 2 with 0 2 w .Computingthederivativeoftheobjectivein( 3–29 )yieldsasecond orderpolynomialon 2 (DetailsaregiveninAppendix B.2 ),thussolving c 2 2 2 + c 1 2 + c 0 =0 (3–30) withcoefcients: c 2 = 2( V 11 + V 22 2 V 12 )( q 2 q 1 ) c 1 =2( 1)( wq 1 + B )( V 11 + V 22 2 V 12 )+ 2( +1)( w ( V 12 V 11 )+( 2 1 ))( q 2 q 1 ) c 0 =2( 1)( w ( V 12 V 11 )+( 2 1 ))( wq 1 + B )+ 2 ( w 2 V 11 +2 w 1 + A )( q 2 q 1 ) conveyscandidatesolutionsthatoughttobecheckedalongw iththeendpointsof thelinesegment.Let r 1 and r 2 betherootsof( 3–30 ).Rulingoutcaseswithcomplex numbers,wehave: L =min f r 1 r 2 g and U =max f r 1 r 2 g thecandidatesolutionsare, s 1 = 8>>>><>>>>: 0 L 0 L 0 < L < w wL w and s 2 = 8>>>><>>>>: 0 U 0 U 0 < U < w wU w (3–31) If s 1 6 = s 2 wecheck J ( s i )= [ ( 1)log A ( s i ) 2 log B ( s i ) ] and 2 =argmin s i 2f s 1 s 2 g f J ( s i ) g (3–32) otherwise, 2 = s 1 = s 2 3.3.2.3SMOalgorithm Thealgorithmcanbedescribedintothreebasicsteps: 44

PAGE 45

Step1:Initialization q V f q IP ( ) T q CIP ( ) IP ( ) Step2:Constantswithinaniteration i f i V 1 i 1 V 2 i 2 A IP ( ) (2( 2 f 1 + 2 f 2 ) ( 2 1 V 11 + 2 2 V 22 +2 1 V 12 2 )) B CIP ( ) ( 1 q 1 + 2 q 2 ) w 1 + 2 Step3:Updates 2 solutiondescribedin ( 3–32 ) 1 w 2 f f +( 2 2 ) V T 1 +( 2 2 ) V T 2 IP ( ) A +(2( 1 f 1 + 2 f 2 ) ( 2 1 V 11 + 2 2 V 22 +2 1 V 12 2 )) CIP ( ) B +( 1 q 1 + 2 q 2 ) Steps2and3areiteratedfordifferentworkingsetschosena ccordingtosomeheuristics thataredescribedbelow.3.3.2.4Selectingtheworkingset Therearetwotypesofconstraintviolations:anequalityco nstraint( 3–25 )if i > 0 andtheinequalityconstraint( 3–27 )if i =0 .Theconstraintviolationsareeasyto 45

PAGE 46

computeateachiteration.Let bedenedas =2 1 IP ( ) f 2 CIP ( ) q (3–33) theconstraintqualicationsare i =2 if i > 0 ,and i > 2 if i =0 .Inthedescription ofouralgorithm,wechosetoinitialize withthesamevaluesof .However,ourcost functionsuggestthatpointsforwhich q i islargewillbeexpectedtobecomesupport vectors,thatis i > 0 .Wecanthenuse = q = ( q T 1 ) astheinitialguess.However, thiswouldimplythecomputationof f attheinitialization.Itiscustomarytochoose = 1 N 1 .Thenattheinitialiterationallconstraintswillbeviola ted(unless !1 ). Onepassthroughthewholesettakingpairsofindexes ( i j ) ,where i correspondtoa descendingorderofthesamplesaccordingto f and j 'stakenatrandomwillcreatethe rststageofsparsenessinourweightvector ;thisisourrstheuristic.Afterthispass, wecancheckwhether( 3–27 )issatisedforthecurrent i 'sthatarezero.Asecond stagesuggestcheckingthewithinthesetofsampleswith i > 0 ,andforwhich i > 2 sincetheyaremostlikelytovanish.Wewillstopwhencondit ionsarefullledwithin tolerance.3.3.3Experiments3.3.3.1Syntheticdata Here,weareconcernedwiththecomputationoftheprinciple ofrelevantinformation onlargesamplesizes.Thepurposeofthisexperimentalsetu pistoobservethe behaviorofthealgorithmintermsof whichcontrolsthenumberofnonzeroweights andthereforethenumberofequalityconstraintsthataremu chhardertosatisfy.Datais obtainedbysamplingfromatwodimensionalmixtureofthree Gaussianswithcenters (0,0) (3,3) ,and ( 6,4) ;sphericalcovariances 0.8 2 I 1.2 2 I ,and I ;andmixing proportions 0.2 0.3 ,and 0.5 ,respectively.Thekernelemployedinourexperimentsis theGaussiankernel ( x y )=exp( 1 2 2 k x y k 2 ) ,with =0.2 .Figure 3-6 depictsthe computationtimesfordifferenttolerancelevelsonthecon straintviolationsaswellas 46

PAGE 47

10 2 10 3 10 4 10 -1 10 0 10 1 10 2 10 3 Number of PointsComputation time [seconds]Tolerance e = 10 -1 l = 1.001 l = 1.5 l = 2.5 l = 5 l = 10 A 10 2 10 3 10 4 10 -1 10 0 10 1 10 2 10 3 Number of PointsComputation time [seconds]Tolerance e = 10 -2 l = 1.001 l = 1.5 l = 2.5 l = 5 l = 10 B 10 2 10 3 10 4 10 -1 10 0 10 1 10 2 10 3 Number of PointsComputation time [seconds]Tolerance e = 10 -3 l = 1.001 l = 1.5 l = 2.5 l = 5 l = 10 C Figure3-6.Computationtimesfordifferenttoleranceleve lsandsamplesizes 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 Number of PointsNumber of support vectorsTolerance e = 10 -1 l = 1.001 l = 1.5 l = 2.5 l = 5 l = 10 A 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 Number of PointsNumber of support vectorsTolerance e = 10 -2 l = 1.001 l = 1.5 l = 2.5 l = 5 l = 10 B 10 2 10 3 10 4 10 0 10 1 10 2 10 3 10 4 Number of PointsNumber of support vectorsTolerance e = 10 -3 l = 1.001 l = 1.5 l = 2.5 l = 5 l = 10 C Figure3-7.Numberofsupportvectorsfordifferenttoleran celevelsandsamplesizes varioussamplesizesandtradeoffparameter .Figure 3-7 showsthenalnumberof supportvectors(nonzeroweights)whentheabovementioned parametersarevaried. First,noticethatthekernelsize waskeptxedregardlessofthesizeofthe sample.Thisallowsforstudyingthealgorithmbehaviorint ermsofsparsityofdata, whichinthiscasecorrespondstosmallsamplesizes.Thetol erancelevelhasaclear effectonthecomputationtime,butmoreinterestingisthee ffectonthenumberof supportvectorswhichreduceswhenthelevelofaccuracyinc reases.Onthesmall sampleregime,theincrementonthecomputationtimeduetot hemoredemanding tolerancelevel =10 3 canbeattributedtothescarcityofdatawhichmakesthe costfunctionmuchmoresensitivetoanychangeintheweight vector .Intermsof computationalcomplexity,thealgorithmbehaveswithinth ereasonablelevels.Inthe experimentscarriedweboundthemaximumnumberofiteratio nsby N log N .This upperboundwasneverattainedbythelargersamplesizesand onlyreachedbysmall samplesizesonthemostdemandingscenarios,thatis,small andlarge ,sincethe 47

PAGE 48

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 -0.5 0 0.5 1 1.5 2 2.5 3 1 H(X) H(X;S) Figure3-8.Entropyoftheweightedsampleandcross-entrop ywithrespecttothe originalsample S fordifferentvaluesof .Noticethatthevaluesareplotted against 1 forbettervisualization. tradeoffparameter iscloselyrelatedwiththenumberofsupportvectorsandthu sthe proportionofconstraintviolatorsincreases. Wealsoperformtheillustrativeexperimentsonthedatasho wninFigure 3-1 .Figure 3-8 showsthedifferentvaluesoftheentropywhenthetradeoffp arameter isvaried between 1 and 20 .Theactualplotshowstheinverseof forbettervisualization.Inthis case,wecanobservetworegimes;for < 1.5 ,alltheimportanceisgiventoonepoint, whichcorrespondstothelargestpeakontheParzendensitye stimatefromtheobserved data S .Once becomeslargerthan 1.5 morepointsbecomeactive,andtheweights progressivelyequalizeacrossalldatapoints.Figure 3-9 displaytheresultingsupport vectorsfoundbythealgorithmfordifferentvaluesof .Figure 3-10 showsinagray intensitiesthedistributionoftheweightsacross S when =10 .Noticethatthelarger valuesareevenlydistributedtoaccountforthemoredensel ypopulatedregionsofthe data.3.3.3.2ImageretrievalwithpartiallyoccludeddataMNIST WeemployasubsetoftheMNISTdatabasetoperformexperimen tsonpattern retrievalfrompartiallyoccludedqueries.Theweightings chemefortheprincipleof 48

PAGE 49

-4 -2 0 2 4 -4 -3 -2 -1 0 1 2 3 4 PRI SMO l = 1 S X -4 -2 0 2 4 -4 -3 -2 -1 0 1 2 3 4 PRI SMO l = 1.25 S X -4 -2 0 2 4 -4 -3 -2 -1 0 1 2 3 4 PRI SMO l = 2 S X -4 -2 0 2 4 -4 -3 -2 -1 0 1 2 3 4 PRI SMO l = 10 S X Figure3-9.Resultingsupportvectorsfordifferentvalues of -4 -2 0 2 4 -4 -3 -2 -1 0 1 2 3 4 PRI SMO l = 10 S 0 1 2 3 4 5 6 7 x 10 -3 Figure3-10.Distributionoftheweightsfor =10 49

PAGE 50

Table3-1.Resultsfortheimageretrievalfrompartiallyoc cludedqueries ncomp 38153050100 KPCA 0.59(0.03)0.55(0.015)0.53(0.007)0.50(0.007)0.48(0.0 08)0.47(0.006) KECA 0.64(0.018)0.57(0.018)0.51(0.022)0.49(0.013)0.49(0. 009)0.48(0.007) 234568 PRI 0.58(0.013)0.53(0.004)0.51(0.007)0.50(0.009)0.49(0. 01)0.48(0.01) nsuppvectors 62.7126181232276347 relevantinformationisappliedtolearntherepresentativ esamplesfromthetrainingdata. ResultsarecomparedagainstkernelPCAandkernelentropyc omponentanalysisfor differentnumberofcomponentsanddifferentvaluesof inthecaseofPRI.Thepattern retrievalapplicationrequirespre-imagingofthepattern sfromthefeaturespaceback totheinputspacetoapplyKPCAKECA).Weemploythemethodpr esentedin[ 42 ]to computethepre-imagesoftheprojectedpatternsintheKPCA andKECAsubspaces (Formoredetailsonthisproblemsee[ 37 50 ]).Theprincipleofrelevantinformation requiresadifferentapproachsinceitdoesnotprovideanex plicitprojectionmethodas itisthecaseforKPCAandKECA.Asimplepre-imagingalgorit hmcanbebasedon theNadaraya-Watsonkernelregression[ 51 ].Inourexperiments,weusetheGaussian kernel.ThepreimagingforPRIconsistofthefollowingstep s: Computethek-nearestneighborsonthesetofsupportvector s(trainingpointswith nonzeroweights)ofthequerypattern x Reconstructusingthefollowingequation: x rec = P ki =1 ( x i x ) x i P ki =1 ( x i x ) (3–34) wheretheindexesfor x i 'sareunderstoodasthe k nearestneighborsof x Inthiscase,wewanttoseeifwecaneffectivelyretrieveadi gitbypresentinganimage withamissingportion.Themodelistrainedwithcompleteim agesandtestedwith imagesforwhichthelowerhalfofthematrixhasbeenreplace dwithzeros. Table 3-1 showsthenormalizedSNRfortheretrievedpatternsusingaK PCAKECA andtheRKHSPRI.ThenumberscorrespondtoaverageoftenMon tecarlosimulations 50

PAGE 51

Figure3-11.Samplequerieswithmissinginformationandth eircorrespondingcomplete versions using200imagesperdigit(total600)fortrainingand50inc ompleteimagesofeach digit(150total)fortesting.Wealsopresenttheaveragenu mberofsupportvectorsfor theRKHSPRI.Thestandarddeviationsarealsoshown.Theker nelsizeis5,andthe numberofneighborsforpre-imagingis10.Noticethatdiffe rencesbetweenclosest resultsforallmethodsandalsothebestperformancesarewi thintherangesetby + thestandarddeviation.Figure 3-11 showssomeoftheexemplarsthatwereused fortestingalongwiththeincompleteversionsthatwereinp uttothealgorithms.Figure 3-12A showsthereconstructedimagesforKPCAwhereeachrowcorre spondstothe numberofcomponentsinTable 3-1 .Likewise,Figure 3-12B displaystheresultsfor KECA,andFigure 3-12C showtheresultsoftheproposedapproachforthedifferent LambdavaluesinTable 3-1 .Althoughtheerrorsforallmethodsareverysimilar,the resultsforthePRIaremorevisuallyappealing. 51

PAGE 52

AKPCA BKECA CPRI Figure3-12.Retrievedpattersfromquerieswithmissingin formationfordifferent methodsontheMNISTdataset. 3.4TheInformationPotentialRKHSFormulationofthePRI Forapositivesemidenitekernel on X ,wecandeneaspaceofcontinuous boundedfunctionalson X asthecompletionofthespanofall f 'softheform f ( x )= X i 2A i ( x x i ) (3–35) where i 2 R and x i 2X .ThisspaceisaReproducingKernelHilbertspaceof continuousfunctionsfrom X .Asimilardevelopmentcanbeadoptedfortheinformation potentialsintroducedby[ 62 ].Fortheset F ofprobabilitydensityfunctionsthatare squareintegrablein R n ,wecandenethecross-informationpotential V (CIP)asa 52

PAGE 53

bilinearformthatmapsdensities f i f j 2F totherealnumberstroughtheintegral, V ( f i f j )= Z R n f i ( x ) f j ( x )d x (3–36) Itiseasytoseethatforabasisofuniformlybounded,square integrableprobability densityfunctions, V denesaRKHSonthe span fFg (undercompletion).Nowconsider theset G = f g = P mi =1 i ( x i ) j x i 2 R n P mi =1 i =1 ,and i 0 g ,where isa “Parzen”typeofkernel,thatis issymmetric,nonnegative,hasboundedintegral(can benormalized),andshiftinvariantwith asthescaleparameter; V alsodenesaRKHS K on G .Clearly,forany g 2G ,wehave kV ( g ) k K k ( x ) k 2 ;therefore, K isaspace offunctionalsonabounded,albeitnon-compactset. Theobjectivefunctionfortheprincipleofrelevantinform ation(Equation( D–8 )) canbewrittenintermsoftheinformationpotentialRKHS.Us ingtheParzenbased estimation,werestrictthesearchproblemto GF .Inthiscase,wehavethatequation ( D–8 )canberewrittenas: J ( f )= log V ( f f ) log [ V ( f g )] 2 V ( f f ) V ( g g ) (3–37) straightforwardmanipulationofthetermsyieldsanequiva lentproblem: argmin f 2G [ (1 )log V ( f f ) 2 log V ( f g ) ] (3–38) orequivalently, argmin f 2G (1 )log k ( f ) k 2K 2 log h ( f ), ( g ) i K (3–39) where : G7!K isanunderlyingmappingsuchthat V ( f g )= h ( f ), ( g ) i K 2 Twoimportantaspectsoftheaboveobjectiveare:thechoice ofthekernel,shapeand 2 Thisiscommonlyknownasthefeaturemapwithinthemachinel earningcommunity. 53

PAGE 54

size ,determinesdifferentspacesoffunctionals(RKHS's);the trade-offparameter denesasetofregimesforthepossiblesolutionstotheprob lem.Notehoweverthat thereisnodifferencebetween( D–8 )or( 3–39 )intermsofthespacebeingsearched (itis G forboth).Therefore,thecorequestioniswhetherwecanper formthesearchin theIPRKHS.Letusfocusonthecase > 1 .Consideranelement F 2K ,theterm log h F ( g ) i K willplaytheroleofariskfunctional R g evaluatedat F .Now,sincewe arenotprovidedwiththefunction g ,wetakeempiricalestimatorof R emp : R emp ( F )= 2log 1 N N X i =1 h F ( ( x i )) i K (3–40) Thisquantitycomesfrommappingtheempiricaldistributio nafterbeingconvolved withtheParzenkernel .Nonetheless,inorderforthisfunctional F tobevalidasan estimatorofainformationtheoreticquantity,wealsoneed : h F ( ( x )) i K 0, 8 x 2 R n (3–41) SolutiontotheconstrainedproblemintheIPRKHS: Inthecasewhere > 1 wehavethefollowingconstrainedproblem(fornotationalc onveniencewewilldenote ( ( x i )) by ( i ) ): min F 2K ( 1)log k F k 2K 2 log 1 N N X i =1 h F ( i ) i K # s.t. h F ( i ) i K 0, 8 x 2 R n (3–42) Hence,( 3–42 )canbeseenasaregularizedriskminimizationproblem min F 2 K R emp ( F )+ ( 1) n k F k 2K where K denotesthesetoffeasiblepointsin K .Theevaluationoftheconstraintin ( 3–42 )thatrequiresnonnegativityoftheinnerproductin K forall x 2 R n willbe relaxedbyevaluatingtheconditiononlyat x i inthesample S .Thisyieldsthefollowing 54

PAGE 55

formulation, min F 2K ( 1)log k F k 2K 2 log 1 N N X i =1 i # s.t. h F ( i ) i K i i 0 (3–43) withLagrangianfunction L ( F ) : ( 1)log k F k 2K 2 log 1 N N X i =1 i + N X i =1 i ( h F ( i ) i K i ) N X i =1 i i (3–44) where and arethemultipliers.Settingderivativeswithrespectto F and tozero yields: F = k F k 2K 2( 1) N X i =1 i ( i ) k F k 2K = 2( 1) N P i j =1 i j h ( i ), ( j ) i K N X i =1 i i 0 i = i + 2 P Ni =1 i (3–45) Replacing( 3–45 )intheLagrangian( 3–44 ),weobtainthedualproblem min 2 R N ( 1)log N X i j =1 i j h ( i ), ( j ) i K s.t. 2 log 1 N N X i j =1 j h ( i ), ( j ) i K = 0, and N X i =1 i =1 (3–46) 55

PAGE 56

CHAPTER4 ESTIMATINGENTROPY-LIKEQUANTITIESWITHINFINITELYDIVIS IBLEKERNELS Theoperationalquantitiesininformationtheoryarebased ontheprobabilitylaws underlyingthedatagenerationprocess,butthesearerarel yorneverknowninthe statisticallearningsettingweretheonlyinformationava ilablecomesfromasample f z i g i =1 N .TheuseofRenyi'sdenitionofentropyalongwithParzende nsityestimation wereproposedasthemaintooltoworklearningproblemsinte rmofinformation theoryquantitiessuchentropyandrelativeentropy[ 62 ].Renyi'sentropyoforder is ageneralizationofShannon'sentropybyrelaxingthecondi tionofadditivityofentropy tothegeneralized mean 1 .Theentropyofarandomvariable X asafunctionofthe parameter isgivenby, H ( X )= 1 1 log Z X p 1 ( x ) p ( x )d x (4–1) where p istheprobabilitydensityfunction(continuouscase)orth eprobabilitymass function(takingtheintegralasasum)oftherandomvariabl e X withsupport X .The limitingcase 1 bringsbackShannon'sentropy.Aplug-inestimatorof( 4–1 )for = 2 canbederivedusingParzenwindowapproximation.Forasamp le X = f x i g ni =1 R d anestimatorofR enyi's 2 -orderentropybasedonParzenwindowwithGaussiankernel ofsize isgivenby: ^ H 2 ( X )= log 1 n 2 n X i j =1 p 2 ( x i x j ) (4–2) where ( x y )= C exp 1 2 2 k x y k 2 and C isanormalizationconstant.Animportant conceptualelementderivedfromthisframeworkisthefunct ionalthatcorresponds totheargumentofthe log ,whichhasbeencalledinformationpotentialinanalogy tothepotentialeldsarisinginphysics[ 62 ].Morerecentworkhasshownthatthe 1 h X i = 1 P ni =1 w i ( x i ) ,where P w i =1 w i 0 and isaKolmogorov-Nagumo function 56

PAGE 57

informationpotentialisaspecialcaseofapositivedenit ekernelcalledthecross informationpotentialbetweentwopdfsthatmapsprobabili tydensityfunctionsin L 2 toa ReproducingkernelHilbertSpaceoffunctions[ 95 ].Moreover,thisframeworkhasbeen alreadyexploitedtosolveoptimizationproblemswithinfo rmationtheoreticobjective functionsthatbearcloseresemblancetokernelmethods[ 71 ].Theserecentinsights motivateviewsthatgobeyondtheParzendensityestimation thatinitiallymotivatedits study. Inthiswork,weshowhowbyusingpositivedenitekernelswi thspecicproperties, wecanobtainentropy-likequantitieswithoutassumingtha tprobabilitiesofeventsare knownorhavebeenestimated.Otherapproacheswithsimilar motivationbasedonthe conceptofentropicgraphshavebeenrecentlyaddressedin[ 15 31 64 ].However,one ofthemaindisadvantagesthequantitiesproposedbytheseg raphbasedformulationare notwellsuitedforadaptationschemessincetheyarenotdif ferentiable.Thisisnotthe casefortheproposedframeworkasweshowinChapter 5 .Wewillfollowthestatistical learningsettingwheretheonlyavailableinformationisco ntainedinanite i.i.d. sample Z = f z i g ni =1 .Inthissensewethinkaboutentropyasameasureofthelack( uncertainty) orpresence(structure)ofstatisticalregularitiesinagi vensamplerepresentedbythe aGrammatrix.Fromtheaxiomaticcharacterizationofentro pythatleadstoR enyi's denition,wedevelopananalogueversionofthisfunctiont hatisappliedtopositive denitematrices.Then,welookatsomebasicinequalitieso finformationandhow theycanbetranslatedtothesettingofpositivedenitemat rices.Thepurposeofthis characterizationistoestablishsomedesirablepropertie sonthepositivedenitekernels thatcanbeemployedtoconstructtheGrammatrixforwhichou rextensionofentropy makessenseintermsofinformation. 4.1Motivation TheuseofHilbertspacestorepresentdataisnotanewidea[ 1 2 ],andithas becomeofcommonpracticeinmachinelearningunderthename ofkernelmethods. 57

PAGE 58

Kernelmethodsprovideanappealingframeworktodealwithd ataofdifferentnature byembeddingabstractsetinreproducingkernelHilbertspa ces,whereitispossible tocarryoutmanipulationsoftherepresentationofdatabyt heoperationsofaddition, scalarmultiplicationandinnerproduct.Thisallowsoneto dealwithalgorithmsina rathergenericwayprovidedthekerneliswell-ttedtothep articularproblem.These propertyhasbeenrecentlyexploitedinmanypracticalappl icationswheredata isnotnecessarilygivenasvectorin R p ,forexampletext,trees,pointprocesses, functionaldata,amongothers[ 82 ].Ithasbeennoticedthatkernelinducedmapcanbe understoodasameanforcomputinghighorderstatisticsoft hedataandmanipulating theminalinearfashionasrstorderstatistics.Methodssu chaskernelindependent componentanalysis[ 4 ],theworkonmeasuresofdependenceandindependenceusing Hilbert-Schmidtnorms[ 29 ],andrecentworkonquadraticmeasuresofindependence [ 79 ]arejustamongtheexamplesofthisemergingline.Asweshal lsee,theGram matrixplaysafundamentalroleinestablishingtheconnect ionbetweeninformation theoreticconceptsandkernelmethods.4.1.1HilbertSpaceRepresentationofData Inthissection,wewanttomotivatetheuseofpositivedeni tematricesassuitable descriptorsofdata.Forthis,weneedtounderstandtherole oftheHilbertspace representationandhowitnaturallyarisesfromthefundame ntalideasofpattern analysis.Let ( X B X ) betheobjectspace,thiswith -algebra B X andaprobability measure P X denedonit.Afunction : X7! R iscalledafeature.Arepresentation isafamilyoffeatures f t g t 2T ,where ( T B T T ) isameasurespaceand T is -nite. Let t bealsoboundedforall t 2T ,andletusdenote t ( x ) by ( t x ) where t 2T and x 2X .Ifwealsorequirethatforallxed x and y in X G ( x y )= Z T ( t x ) ( t y )d T ( t ) < 1 (4–3) 58

PAGE 59

Then,theset F denedasthecompletionofthesetoffunctions F oftheform, F ( t )= N X i =1 i ( t x i ), (4–4) where i 2 R x i 2X ,and 8 N 2 N 2 .Thespace F isaHilbertspacerepresentation oftheset X .Nevertheless,dealingexplicitlywithsuchan F maybedifcultifnot impossibleforpracticalpurposes.Thefollowingresultgi vesanalternativewaytodeal withtheproblembasedonthebivariatefunction G ( x y ) denedabove.Considertheset offunctionson X oftheform f ( x )= N X i =1 i G ( x x i ), (4–5) where i 2 R x i 2X ,and 8 N 2 N .Letusdenetheinnerproductbetweenelements f = P Ni =1 i G ( x x i ) and g = P Mj =1 i G ( x x j ) oftheabovesetas: h f g i = N X i =1 M X j =1 i j G ( x i x j ), (4–6) thecompletionoftheabovesetisaHilbertspace H offunctionson X .Moreover, H isa reproducingkernelHilbertspacewithkernel G .Noticethatforanyniteset f x i g Ni =1 we havethat N X i =1 N X j =1 i j G ( x i x j ) 0, (4–7) forall 2 R N .Functionssatisfyingtheaboveconditionarecalledposit ivedenite. Theorem4.1.1. (BasicCongruenceTheorem[ 58 ]): Let H 1 and H 2 betwoabstract Hilbertspaces.Let X beanindexset.Let f F ( x ), x 2Xg ,beafamilyofvectorswhich span H 1 .Similarly,let f f ( x ), x 2Xg beafamilyofvectorswhichspan H 2 .Suppose 2 Eventhoughisnotexplicitlystated,weassumetheconstruc tionofalinearspaceof realfunctionswithdomain T 59

PAGE 60

that,forevery x and y in X h F ( x ), F ( y ) i 1 = h f ( x ), f ( y ) i 2 (4–8) Thenthespaces H 1 and H 2 arecongruent,andonecandeneacongruence from H 1 to H 2 whichhasthepropertythat ( F ( x ))= f ( x ) for x 2X Proposition4.1.1. Let X beacompactspace.Thespaces F and H arecongruent. Proof4.1.1. Thecongruencefollowsfromdenitionof F and H .For F = P ni =1 i ( t x i ) simplytake : F7!H as: ( F )= n X i =1 i Z X ( t x i ) ( t )d T ( t )= f Theabovepropositionallowsustoperformtheanalysisofth erepresentationof X ontheequivalenceclassesthatcanbeformedbyusingthefun ction G todene relationsbetweentheelementsof X .Fromthecongruence,wecandeneadistance functionbetweentherepresentationsoftwoelements x y 2X usingthefunction G as follows: d 2 ( ( t x ), ( t y ))= G ( x x )+ G ( y y ) 2 G ( x y ), (4–9) forconveniencewewrite d 2 ( ( t x ), ( t y )) as d 2 ( x y ) 4.1.2TheCross-InformationPotentialRKHS Fortheset F ofprobabilitydensityfunctionsthataresquareintegrabl ein R n ,we candenethecross-informationpotential V (CIP)asabilinearformthatmapsdensities f i f j 2F totherealnumberstroughtheintegral, V ( f i f j )= Z R n f i ( x ) f j ( x )d x (4–10) Itiseasytoseethatforabasisofuniformlybounded,square integrableprobability densityfunctions, V denesaRKHSonthe span fFg (uptocompletion).Nowconsider theset G := f g = P mi =1 i ( x i ) j x i 2 R n P mi =1 i =1 ,and i 0 g ,where isa 60

PAGE 61

“Parzen”typeofkernel,thatis issymmetric,nonnegative,hasboundedintegral(can benormalized),andshiftinvariantwith asthescaleparameter; V alsodenesaRKHS K on G .Clearly,forany g 2G ,wehave kV ( g ) k K k ( x ) k 2 ;therefore, K isaspace offunctionalsonabounded,albeitnon-compactset.Notice thatthecrossinformation potential,bydenition,isapositivedenitefunctiontha tisdatadependent,anddiffers fromtheinstance-basedrepresentationinthisrespect.Ne vertheless,theempirical estimator( 4–2 )linksbothHilbertspacerepresentations.Ifweconstruct theGrammatrix K withelements K ij = ( x i x j ) ,itiseasytoverifythat( 4–2 )correspondsto: ^ H 2 ( X )= log 1 n 2 tr ( KK ) (4–11) Aswecansee,theestimatoriftheinformationpotentialcan berelatedtothenorm oftheGrammatrix K denedas k K k 2 =tr ( KK ) .Inthenextsection,weextendthis concepttoothernormsandshowhowthepropertiesofRenyi's denitionofentropy carryon. 4.2PositiveDeniteMatrices,andRenyi'sEntropyAxioms Hermitianmatricesareconsideredasgeneralizationsofre alnumbers.Itispossible todeneapartialorderingonthissetbyusingpositiveden itematrices;fortwo Hermitianmatrices A B 2 M n ,wesay A < B if A B ispositivedenite.Likewise, A B meansthat A B isstrictlypositivedenite.Thefollowingspectraldecom position theoremrelatestothefunctionalcalculusonmatricesandp rovidesareasonablewayto extendcontinuousscalar-valuedfunctionstoHermitianma trices. Theorem4.2.1. Let D C beagivensetandlet N n ( D ):= f A 2 M n : A isnormaland ( A ) D g .If f ( t ) isacontinuousscalar-valuedfunctionon D ,thentheprimarymatrixfunction f ( A )= U 0BBBB@ f ( 1 ) 0 ... ... 0 f ( n ) 1CCCCA U (4–12) 61

PAGE 62

iscontinuouson N n ( D ) ,where A = U U = diag ( 1 ,..., n ) ,and U 2 M n isunitary. NowwearereadytodeneamatrixanaloguetoRenyi'sentropy thatwillbeapplied toGrammatricesconstructedusingapositivedenitefunct ion. Considertheset +n ofpositivedenitematricesin M n forwhich tr( A ) 1 .Itisclear thatthissetisclosedunderniteconvexcombinations.Proposition4.2.1. Let A 2 +n and B 2 +n andalso tr( A )=tr( B )=1 .Thefunctional S ( A )= 1 1 log 2 ( tr A ) (4–13) satisesthefollowingsetofconditions: (i) S ( PAP )= S ( A ) foranyorthonormalmatrix P 2 M n (ii) S ( pA ) isacontinuousfunctionfor 0 < p 1 (iii) S ( 1 n I )=log 2 n (entropyisexhaustive). (iv) S ( A n B )= S ( A )+ S ( B ) (v)If AB = BA = 0 ;thenforthestrictlymonotonicandcontinuousfunction g ( x )=2 ( 1) x for 6 =1 and > 0 ,wehavethat: S ( tA +(1 t ) B )= g 1 ( tg ( S ( A ))+ +(1 t ) g ( S ( B )) ) (4–14) Proof4.2.1. Theproofof( i )easilyfollowsfromTheorem 4.2.1 .Take A = U U now PU isalsoaunitarymatrixandthus f ( A ) = f ( PAP ) thetracefunctionalisinvariant underunitarytransformations.For( ii ),theproofreducestothecontinuityof 1 1 log 2 ( p ) For( iii ),asimplecalculationyields tr A = 1 n 1 .Now,forproperty( iv ),noticethatif tr A =tr B =1 ,then, tr( A n B )=1 .Since A = U U and B = V V wecanwrite A n B =( U n V )( n )( U n V ) ,fromwhich tr( A n B ) =tr( n ) =tr( )tr( ) andthus( iv )isproved.Finally,( v )noticethatforanyintegerpower k of tA +(1 t ) B wehave: ( tA +(1 t ) B ) k =( tA ) k +((1 t ) B ) k since AB = BA = 0 .Under extraconditionssuchas f (0)=0 theargumentintheproofofTheorem 4.2.1 canbe extendedtothiscase.Sincetheeigen-spacesforthenon-nu lleigenvaluesof A and B 62

PAGE 63

areorthogonalwecansimultaneouslydiagonalize A and B withtheorthonormalmatrix U ,thatis A = U U and B = U U where and arediagonalmatricescontainingthe eigenvaluesof A and B respectively.Since AB = BA = 0 ,then = 0 .Undertheextra condition f (0)=0 ,wehavethat f ( tA +(1 t ) B )= f ( tA )+ f ((1 t ) B ) yieldingthe desiredresultfor( v ). Noticealsothatif ( A )=1 ,thatis A isrankonematrix, S =0 for 6 =0 Thefollowingimportantpropertyisalsotrue. Proposition4.2.2. Let A 2 +n ,and tr( A )=1 .For > 1 S ( A ) S ( 1 n I ) (4–15) Proof4.2.2. Let f i g bethesetofeigenvaluesof A .Thenwehavethat, S ( A ) S ( 1 n I )= 1 1 log 2 tr( AA 1 ) n ( 1) ; (4–16) = 1 1 log 2 tr A ( nA ) 1 ; (4–17) = 1 1 log 2 X i i f ( n i ) # ; (4–18) 1 1 X i i log 2 f ( n i ); (4–19) = X i i log 2 i 1 n ; (4–20) log 2 X i i 1 n i # =0. (4–21) Where ( 4–19 ) and ( 4–21 ) areduetoJensen'sinequality. However,suchacharacterizationmaynotbeenoughtotellwh atunit-tracepositive denitematriceshaveaninformationtheoreticinterpreta tion.Forexampleconsider thematrix L = 1 n ll T where n 1 oftheentriesare 1 andtheremaining n n 1 entriesare 1 .Thiscanbeseenasavectorencodingclasses.Noticethat,e valuatingtheentropy 63

PAGE 64

functionaldenedaboveyields 0 forany .However,withatwo-columnmatrix M ,for whichthecolumnsrepresenttheclassandtherowsthesample s,thatencodestheclass membershipsas M ij =1 ifthe i -thsamplebelongstothe j -thclass,and 0 otherwise, weobtainamorereasonablequantityusing L = MM T ,relatedtoabinomialdistribution with p = n 1 n .Interestingly,wecansimplerelate l and M by l = M ( 1, 1 ) T .Inthe followingsections,wewilladdressthisissuebyconsideri ngaparticularcontextinwhich Hadamardproductsbetweenpositivedenitematricesarise 4.2.1EntropyinequalitiesforHadamardProducts InthepropertieslistedaboveinPropositions 4.2.1 and 4.2.2 ,wedidnotconsidered Hadamardproductsofpositivedenitematrices.Thisprodu ctmaybeofinterestin thecasewehavetwomatrices A and B in n withunittracewherethereexistssome relationbetweentheelements A ij and B ij forall i and j .TheHadamardproductcanbe usefulindevelopinganaloguestojointentropies,whereea chonethematricesinvolved intheHadamardproductrepresentsarandomvariable.Befor ewepresentthemain resultofthispartofthesection,weneedtointroducetheco nceptofmajorizationand someresultspertainingtheorderingthatarisesfromthisd enition. Denition4.2.1. (Majorization): Let p and q betwononnegativevectorsin R n such that P ni =1 p i = P ni =1 q i < 1 .Wesay p 4 q q majorizes p ,iftheirrespectiveordered sequences p [1] p [2] p [ n ] and q [1] q [2] q [ n ] denotedby f p [ i ] g ni =1 and f p [ i ] g ni =1 ,satisfy: k X i =1 p [ i ] k X i =1 q [ i ] for k =1,..., n (4–22) Itcanbeshownthatif p 4 q then p = Aq forsomedoublystochasticmatrix A [ 11 ]. Itisalsoeasytoverifythatif p 4 q and p 4 h then p 4 tq +(1 t ) h for t 2 [0,1] Themajorizationorderisimportantbecauseitcanbeassoci atedwiththedenitionof Schur-concave(convex) functions.Arealvaluedfunction f on R n iscalledSchur-convex if p 4 q implies f ( p ) f ( q ) andSchur-concaveif f ( q ) f ( p ) 64

PAGE 65

Lemma4.2.1. Thefunction f : S n 7! R + ( S n denotesthe n dimensionalsimplex), denedas, f ( p )= 1 1 log 2 n X i =1 p i (4–23) isSchur-concavefor > 0 Noticethat,Schur-concavity(Schur-convexity)cannotbe confusedwithconcavity (convexity)ofafunctionintheusualsense.Now,weareread ytostatetheinequalityfor Hadamardproducts.Proposition4.2.3. Let A and B betwo n n positivedenitematriceswithtrace 1 with nonnegativeentries,and A i i = 1 n for i =1,2,..., n .Then,thefollowinginequalitieshold: (i) S A B tr( A B ) S ( B ), (4–24) (ii)and S A B tr( A B ) S ( A )+ S ( B ). (4–25) Proof4.2.3. Inproving ( 4–24 ) and ( 4–25 ) ,wewillusethefactthat S preservesthe majorizationorder(inversely)ofnonnegativesequenceso nthe n -dimensionalsimplex. Firstlookattheidentity x T ( A B ) x =tr ( AD x BD x ) = 1 n Inparticular,if f x i g ni =1 isanorthonormalbasisfor R n tr ( A B ) = n X i =1 x T i ( A B ) x i 65

PAGE 66

Ifwelet f x i g ni =1 betheeigenvectorsof A B orderedaccordingtotheirrespective eigenvaluesindecreasingorder,then, k X i =1 x T i ( A B ) x i = k X i =1 tr ( AD x i BD x i ) 1 n k X i =1 tr 11 T D x i BD x i = 1 n k X i =1 x T i Bx i 1 n k X i =1 y T i By i (4–26) where k =1,..., n and f y i g ni =1 aretheeigenvectorsof B orderedaccordingtotheir respectiveeigenvaluesindecreasingorder.Theinequalit y ( 4–26 ) isequivalenttosay that n ( A B ) 4 ( B ) ,thatis,thesequenceofeigenvaluesof ( A B ) = tr( A B ) is majorizedbythesequenceofeigenvaluesof B ,whichimplies ( 4–24 ) Toprove ( 4–25 ) noticethatfor A wehavetwoextremecases A = 1 n I and A = 1 n 11 T Taking A = 1 n 11 T wehavethat k X i =1 i ( B )= n k X i =1 1 n tr 11 T D x i BD x i = k X i =1 i A B tr( A B ) (4–27) theotherextremecasewhere A = 1 n I wehave, 1 n k X i =1 i ( B ) 1 n n k X i =1 1 n d i ( B )= k X i =1 i A B tr( A B ) (4–28) where f i ( X ) g aretheeigenvaluesof X indecreasingorderand f d i ( X ) g arethe elementsofthediagonalof X orderedindecreasingorder.Theinequalities ( 4–27 ) and ( 4–28 ) imply 4–25 66

PAGE 67

4.2.2TheTensorandHadamardProductEntropyGap Themutualinformationofapairofrandomvariables X and Y canbeseenasthe gainofinformationfromassuming X and Y tobeindependenttoknowingthejoint probabilitydistribution.Inotherwords,theamountofunc ertaintyreducedfromfrom knowingthemarginaldistributionstoknowingthejointdis tribution.IntheShannon denitionthisinformationgaincanbeexpressedas: I ( X ; Y )= H ( X )+ H ( Y ) H ( X Y ) (4–29) where H ( X ) and H ( Y ) arethemarginalentropiesof X and Y ,and H ( X Y ) istheirjoint entropy.Inanalogywecancomputethequantity: I ( A ; B )= S ( A )+ S ( B ) S A B tr( A B ) (4–30) forpositivesemidenite A and B withnonnegativeentriesandunittracesuchthat A ii = 1 n forall i =1,..., n .Noticethattheabovequantityisnonnegativeandsatises S ( A ) I ( A ; A ). 4.2.3TheSingleandHadamardProductEntropyGap Anotherquantityofinterestisrelatedtoconditionalentr opyof X given Y ,which canbeunderstoodastheuncertaintyabout X thatremainsafterknowingthejoint distributionof X and Y .InShannon'sdenitiontheConditionalentropy H ( X j Y ) canbe expressedas: H ( X j Y )= H ( X Y ) H ( Y ). (4–31) Extendingthisideatothematrixcaseyields: H ( A j B )= S A B tr( A B ) S ( B ) (4–32) forpositivesemidenite A and B withnonnegativeentriesandunittracesuchthat A ii = 1 n forall i =1,..., n .Theabovequantityisnonnegativeandupperboundedby 67

PAGE 68

S ( A ) .Noticethatthenormalizationprocedureforinnitelydiv isiblematricesproposed inTheorem 4.3.3 isnowbeautifullyjustiedasthemaximumentropymatrixam ong allmatricesforwhichtheHilbertspaceembeddingsareisom etricallyisomorphic.In thefollowingsection,wewillseehowinnitedivisiblemat ricesrelatetheHadamard productswithconcatenationoftherepresentationsofthev ariableswewanttoanalyze jointly. 4.3InnitelyDivisibleFunctions 4.3.1Direct-SumandProductkernels4.3.1.1Direct-sumkernels Let 1 and 2 betwopositivedenitekernelsdenedon XX .Thekernel = 1 + 2 ,denedas ( x y )= 1 ( x y )+ 2 ( x y ) ,isapositivedenitekernel.The abovefunctioniscalleddirectsumkernelanditistherepro ducingkernelofaspace H offunctionsoftheform f = f 1 + f 2 ,where f 1 2H 1 and f 2 2H 2 ,and H 1 and H 2 arethe RKHSsdenedby 1 and 2 ,respectively.ConsidertheHilbertspace H = H 1 H 2 formedbyallpairs ( f 1 f 2 ) comingfrom H 1 and H 2 ,respectively.Itispossiblethatsome functions f 6 =0 belongtothe H 1 and H 2 atthesametime.Thesefunctionsformaset ofpairs ( f f ) 2H ,whichturnouttobeaclosedsubspaceof H denotedby H 0 ,such that, H = H 0 H ? 0 .Therefore,thelinearcorrespondence f ( x )= f 1 ( x )+ f 2 ( x ) between f 2H and ( f 1 f 2 ) 2H issuchthatallelementsin H 0 maptothezerofunctionin H andtheelementsof H and H ? 0 areinonetoonecorrespondence.Thenormof f 2H canbedenedfromthecorrespondence f 7! ( g 1 ( f ), g 2 ( f )) as: k f k 2H = k ( g 1 ( f ), g 2 ( f )) k 2H = k g 1 ( f ) k 2H 1 + k g 2 ( f ) k 2H 2 (4–33) Noticethat, ( g 1 ( f ), g 2 ( f )) isthedecompositionof f intothepair H withminimumnorm inthisspace.Thefollowingtheoremstatestheresult[ 3 ]. Theorem4.3.1. If i ( x y ) isthereproducingkerneloftheclass H i ,withnorm kk i ,then ( x y )= 1 ( x y )+ 2 ( x y ) isthereproducingkerneloftheclassoffunctions H ofall 68

PAGE 69

functions f = f 1 + f 2 with f i 2H i ,andwiththenormdenedby k f k 2 =min k f 1 k 21 + k f k 22 (4–34) theminimumistakenoveralldecompositions f = f 1 + f 2 with f i 2H i 4.3.1.2Productkernelandtensorproductspaces Considertwopositivedenitekernels 1 and 2 denedon XX and YY respectively.Theirtensorproduct 1 n 2 :( XY ) ( XY ) denedby: 1 n 2 (( x i y i ),( x j y j ))= 1 ( x i x j ) 2 ( y i y j ) (4–35) isalsoapositivedenitekernel.Notethatwecanconsidert wokernels ~ 1 and ~ 2 bothdenedon ( XY ) ( XY ) ,suchthat ~ 1 (( x i y i ),( x j y j ))= 1 ( x i x j ) and ~ 2 (( x i y i ),( x j y j ))= 2 ( y i y j ) ;thekernel ~ 1 ~ 2 (( x i y i ),( x j y j ))=~ 1 (( x i y i ),( x j y j ))~ 2 (( x i y i ),( x j y j ))= 1 n 2 (( x i y i ),( x j y j )) ispositivedenitebySchurTheorem[ 34 ].Letuslookatthespaceoffunctionsthat n = 1 n 2 spans.Let H n = H 1 nH 2 ,where H 1 and H 2 aretheRKHSsspannedby 1 and 2 ,respectively.Thespace H n isthecompletionofthespaceofallfunctions f on XY oftheform: f ( x y )= n X i =1 f ( i ) 1 ( x ) f ( i ) 2 ( y ) (4–36) with f ( i ) 1 2H 1 and f ( i ) 2 2H 2 ,andinnerproduct, h f g i n = n X i =1 m X j =1 h f ( i ) 1 g ( j ) 1 i 1 h f ( i ) 2 g ( j ) 2 i 2 (4–37) Thefunctions f and g mayhavemultiplerepresentationsoftheform( 4–36 )without changing h f g i n .Letuslookatthecasewhere X and Y arethesameset.The followingtheoremdescribesthekernelderivedfromtheres trictionof 1 n 2 tothe diagonalsubsetof XX [ 3 ]. 69

PAGE 70

Theorem4.3.2. For x y 2X ,thekernel ( x y )= 1 ( x y ) 2 ( x y ) isthereproducing kerneloftheclass H oftherestrictionsofthedirectproduct H n = H 1 nH 2 tothe diagonalsetformedbyallelements ( x x ) 2XX .Foranysuchrestriction f k f k =min k g k n forall g 2H n suchthat f ( x )= g ( x x ) 4.3.2NegativeDeniteFunctionsandInniteDivisibleMat rices 4.3.2.1NegativedenitefunctionsandHilbertianmetrics Let M = ( X d ) beaseparablemetricspace,anecessaryandsufcientcondi tion for M tobeembeddableinaHilbertspace H isthatforanyset f x i gX of n +1 points, thefollowinginequalityholds: n X i j =1 i j d 2 ( x 0 x i )+ d 2 ( x 0 x j ) d 2 ( x i x j ) 0, (4–38) forany 2 R n .Thisconditionisequivalentto n X i j =0 i j d 2 ( x i x j ) 0, (4–39) forany 2 R n +1 ,suchthat P ni =0 i =0 .Thisconditionisknownasnegative deniteness.Interestingly,theaboveconditionimpliest hat exp( rd 2 ( x i x j )) ispositive denitein X forall r > 0 [ 75 ].Indeed,matricesderivedfromfunctionssatisfyingthe abovepropertyconformaspecialclassofmatricesknowasin nitedivisible. 4.3.2.2Innitedivisiblematrices AccordingtotheSchurproducttheorem A < 0 impliesthat A n = A A A < 0 foranypositiveinteger n .Aninterestingquestioniswhentheaboveholdsifoneweret o takefractionalpowersof A ,thatis,when A 1 m < 0 foranypositiveinteger m .Thisleadto theconceptofinnitedivisiblematrices[ 12 33 ]. Denition4.3.1. Supposethat A < 0 and a ij 0 forall i and j A issaidtobeinnite divisibleif A r < 0 foreverynonnegative r Innitedivisiblematricesareintimatelyrelatedtonegat ivedenitenessaswecan seefromthefollowingproposition 70

PAGE 71

Proposition4.3.1. If A isinnitedivisible,thenthematrix B ij = log A ij isnegative denite Fromthisfactitispossibletorelateinnitelydivisiblem atriceswithisometric embeddingintoHilbertspaces.Ifweconstructthematrix, D ij = B ij 1 2 ( B ii + B jj ), (4–40) usingthematrix B fromproposition 4.3.1 .ThereexistaHilbertspace H andamapping suchthat D ij = k ( i ) ( j ) k 2H (4–41) Moreover,noticethatif A ispositivedenite A isnegativedeniteand exp A ij is innitelydivisible.Inasimilarway,wecanconstructamat rix, D ij = A ij + 1 2 ( A ii + A jj ), (4–42) withthesameproperty( 4–41 ).Thisrelationbetween( 4–40 )and( 4–42 )suggestsa normalizationofinnitelydivisiblematriceswithnon-ze rodiagonalelementsthatcanbe formalizedinthefollowingtheorem.Theorem4.3.3. Let X beanonemptysetand d 1 and d 2 twometricsonit,suchthatfor anyset f x i g ni =1 n X i j =1 i j d 2 ` ( x i x j ) 0, (4–43) forany 2 R n ,and P ni =1 i =0 ,istruefor ` =1,2 .Considerthematrices A ( ` ) ij = exp d 2 ` ( x i x j ) andtheirnormalizations ^ A ( ` ) ,denedby: ^ A ( ` ) ij = A ( ` ) ij q A ( ` ) ii q A ( ` ) jj (4–44) Then,if ^ A (1) = ^ A (2) foranyniteset f x i g ni =1 X ,thereexistisometricallyisomorphic Hilbertspaces H 1 and H 2 ,thatcontainareHilbertspaceembeddingsofthemetric spaces ( X d ` ) ` =1,2 .Moreover, ^ A ( ` ) areinnitelydivisible. 71

PAGE 72

nrnnnnnrnn nnn n rrnn Figure4-1.Spacesinvolvedinthecomputationofthedata-d riveninformationtheoretic quantities Figure 4-1 providesanillustrationoftherelationsbetweenthespace sinvolved intheproposedmatrixframework.Thenormalizedinnitedi visiblekernelprovidesa directrepresentationthatissuitableforthecomputation ofthedata-driveninformation theoreticquantities.Atwo-stepprocessthatisequivalen trequiresembeddinga negativedenitemetricspaceintoaHilbertspacefollowed bytheexponentialfunction onthesquareddistances. 4.4StatisticalPropertiesofGramMatricesandtheirconne ctionwithITL Let ( X B X P X ) beacountablygeneratedmeasurespace.Let : XX7! R bea reproducingkernelandthemapping : X7!H suchthat ( x y )= h ( x ), ( y ) i ,and: E X [ ( X X ) ] =E X k ( X ) k 2 = Z X h ( x ), ( x ) i d P X ( x )=1 (4–45) 72

PAGE 73

Since E X k ( X ) k 2 < 1 wecandeneanoperator G : H7!H throughthefollowing bilinearform 3 : G ( f g )= h f Gg i = Z X h f ( x ) ih ( x ), g i d P X ( x ) (4–46) noticethat f and g belongto H andfromthereproducingpropertyof ,wehavethat f ( x )= h f ( x ) i andthus G ( f g )=E X [ f ( X ) g ( X ) ] .Fromthenormalizationcondition ( 4–45 )wehavethat: tr( G )= N H X i =1 G ( i i ) = N H X i =1 Z X h i ( x ) ih ( x ), i i d P X ( x )=1 (4–47) where f i g N H i =1 isacompleteorthonormalbasisfor H ,andthus G istraceclass. 4.4.1Thetraceof G Ourdenitionoftheentropylikequantityforpositiveden itematrices,weemploy thefunctionalcalculususingthespectraltheoremtocompu te tr( A ) .Inparticular,we considertheGrammatrix K constructedbyallpairwiseevaluationsofanormalized innitedivisiblekernel andscaleby 1 N suchthat 1 N P Ni =1 ( x i x i )=1 .Theabove scalingcanbethoughasnormalizingthekernelsuchthatfor theempiricaldistribution P N E emp [ ( X X ) ] =E emp k ( X ) k 2 = Z X h ( x ), ( x ) i d P N ( x ) = 1 N N X i =1 ( x i x i )=1 (4–48) 3 Notice,that f 2H) f 2 L 2 ( P X ) .First, j f ( x ) j = jh f ( x ) ijk f k ( x x ) 1 2 ,andthus f ( x ) 2 k f k 2 ( x x ) .Since R ( x x )d P X =1 ,wehave k f k 22 = R f 2 d P X k f k 2 73

PAGE 74

ItfollowsimmediatelyfromProposition 4.4.1 that tr b G N = tr 1 N K .Aswehave seen, G denesabilinearform G thatcoincideswiththecorrelationoffunctionson X thatbelongtotheRKHSinducedby .Letuslookatthecase =2 ,whichistheinitial motivationofthisstudyandhasbeenextensivelytreatedin ITLinrelationtoplugin estimatorsofRenyi'sentropy.Thiscaseisalsoimportants incethereareinteresting linkswithmaximumdiscrepancyandHilbertSchmidtnorms.I nthelimitcasewehave: tr G 2 = N H X i =1 h i G 2 i i = N H X i =1 h G i G i i = N H X i =1 k G i k 2 = k G k 2HS = N H X i =1 Z X Z X h ( x ) h ( x ), i i ,... ( y ) h ( y ), i ii d P X ( x )d P X ( y ) = Z X Z X h ( x ), ( y ) ih ( x ),... N H X i =1 i h i ( y ) ii d P X ( x )d P X ( y ) = Z X Z X h ( x ), ( y ) i 2 d P X ( x )d P X ( y ) = k X k 2K (4–49) where k X k 2 2 denotesthesquarednormofatheamapping P X 7! X intheRKHS K inducedbythekernel 2 ( x y )= ( x y ) ( x y ) .Inthemoregeneralcaseofany > 1 74

PAGE 75

wehave, tr ( G ) = N H X i =1 h i G i i = N H X i =1 h G i G 1 i i = N H X i =1 Z X h i ( x ) ih ( x ), G 1 i i d P X ( x ) = Z X h ( x ), G 1 ( x ) i d P X ( x ) = Z X h ( x x )d P X ( x ) (4–50) noticethat h ( x y ) itself,isapositivedenitefunctionon XX thatalsodependson P X ( x ) 4.4.2TheSpectrumof G andConsistencyofitsEstimator Bydenition,itisobviousthatthebilinearform G isapositivedenitekernelin H since N X i j =1 i j G ( f i f j ) 0 (4–51) foranyniteset f f i g Ni =1 H .Noticefrom( 4–46 ) G issymmetricandthus G isself adjoint.Moreover,since G ispositivedenite,itcanbeshownthat G isapositive deniteoperator.Insteadofdealingdirectlywiththespec trumof G ,forwhichwe shouldknowtheprobabilitymeasure P X ,wearegoingtolookatthespectrumof b G N andtheconvergencepropertiesofthisoperator.Basedonth eempiricaldistribution P N = 1 N x i ( x ) ,theempiricalversion b G N of G obtainedfromasample f x i g ofsize N is givenby: h f b G N g i = b G ( f g )= Z X h f ( x ) ih ( x ), g i d P N ( x ) = 1 N N X i =1 h f ( x i ) ih ( x i ), g i (4–52) 75

PAGE 76

Proposition4.4.1. (Spectrumof b G N ): Forasample f x i g Ni =1 ,let b G N bedenedasin ( 4–52 ) ,andlet K betheGrammatrixofproducts K ij = h ( x i ), ( x j ) i .Then, b G N hasat most N positiveeigenvalues k satisfying: 1 N K i = i i (4–53) Moreover, N i areallthepositiveeigenvaluesof K Proof4.4.1. Firstnoticethatforall f ? span f ( x i ) g ,wehave b G N f =0 ,andthusany eigenvectorwithacorrespondingpositiveeigenvaluemust belongtothe span f ( x i ) g whichisan N dimensionalsubspaceandtherefore,since b G N isnormaltherecanbeat most N positiveeigenvalues.Nowlet v beaneigenvectorof b G N ,,wehavethat h b G N v i = 1 N N X j =1 h ( x j ) ih ( x j ), v i = h v i Then,foreach ( x i ) itistruethat h ( x i ), b G N v i = 1 N N X j =1 h ( x i ), ( x j ) ih ( x j ), v i = h ( x i ), v i Bytaking i = h ( x i ), v i wecanformthefollowingsystemofequations: 1 N K = (4–54) whichistrueforallpositiveeigenvaluesof b G N Proposition4.4.2. (Compactnessof G ): G : H7!H denedby ( 4–46 ) iscompact. Proof4.4.2. Wewillshowthatif g n w g in H ( f g n g isweaklyconvergent),impliesthat Gg n g stronglyin H .Since H isaHilbertspaceweonlyneedtoshowthat k Gg n k7!k Gg k (4–55) 76

PAGE 77

Sinceany f 2H isalsoin L 2 ( P X ) h Gg n Gg n i = Z X g n ( x )( Gg n )( x )d P X ( x ), h Gg Gg n i = Z X g ( x )( Gg )( x )d P X ( x ). (4–56) Moreover, j g n ( x ) jk g n k ( x x ) 1 2 j Gg n ( x ) jk Gg n k ( x x ) 1 2 (4–57) andtherefore, j g n ( Gg n )( x ) jk g n kk Gg n k ( x x ) .Sinceboth f g n g and f Gg n g areweakly convergentin H theirnormsarebounded;then f g n ( Gg n ) g isboundedbythe L 1 ( P X ) normof ( x x ) (uptoaconstant).Theweakconvergencepropertyof f g n g impliesthat g n ( x ) g ( x ) point-wise,whichalsoimplies g n ( Gg n )( x ) g ( Gg )( x ) point-wise.Since thesefunctionsareuniformlyboundedbytheintegrablefun ction ( x x ) ,byLebesgue dominatedconvergencein L 1 ( P X ) wehave: Z X g n ( Gg n )( x )d P X ( x ) Z X g ( Gg )( x )d P X ( x ), (4–58) whichprovesthat k Gg n k!k G g k ,andthus G iscompact. Thefollowingtheoremfoundin[ 39 ]isavariationalcharacterizationofthediscrete spectrum(eigenvalues)ofacompactoperatorinaseparable Hilbertspace. Theorem4.4.1. Let A B beselfadjointoperatorsinaseparableHilbertspace H ,such that B = A + C ,where C isacompactselfadjointoperator.Let f r k g beanenumeration ofnonzeroeigenvaluesof C .Thenthereexistsextendedenumerations f j g f j g of discreteeigenvaluesfor A B ,respectively,suchthat: X j ( j j ) X k ( r k ), (4–59) 77

PAGE 78

where isanynonnegativeconvexfunctionon R ,and (0)=0 Thedenitionofextendedenumeration f i g accordingtoTheorem 4.4.1 means thataforaselfadjointoperator A in H onlythediscreteeigenvalueswithnitemultiplicity m arelisted m timesandanyothervaluesarelistedaszero.Ifwehaveaboun ded kernel,whichinthecaseofanormalizedversionoftheinni telydivisiblematrixis alwaysthecase,wecanapplyHoeffding'sinequality.Let i beasequenceofzero mean,independentrandomvariablestakingvaluesinasepar ableHilbertspacesuch that k i k < C forall i then: Pr rrrrr 1 N n X i =1 n rrrrr # 2exp N 2 2 C 2 (4–60) notethat ( b G N G ) iscompactoperator.Let j beacompleteorthonormalbasisfor H wecansetthat, N H X j =1 ( b G N G ) j = 1 N N X i =1 ( ( X i ) k ( X i ) k E[ ( X ) k ( X ) k ] ) (4–61) Combining( 4–60 )with( 4–61 )andTheorem 4.4.1 ,yieldsthefollowingresult. Theorem4.4.2. Forapositivedenitekernel satisfying ( 4–45 ) ,and ( x x ) C .Let i and b i theextendedenumerationsofthediscreteeigenvaluesof G and b G N ,respectively. Then,withprobability 1 X i ( i b i ) 2 1 2 s 2 C log 2 N (4–62) Proof4.4.3. ApplytheresultofTheorem 4.4.1 using ( x )= x 2 4.5Experiments:IndependenceTest Here,wedevelopatestforindependencebetweenrandomelem ents X and Y basedonthegapbetweentheentropyofthetensorandHadamar dproductsoftheir Grammatrices.Here,wereportresultsforanexperimentals etupsimilarto[ 30 ].We 78

PAGE 79

Table4-1.Listofdistributionsusedintheindependencete stalongwiththeir correspondingoriginalandresultingkurtosisaftercentr alizationandrescaling DistributionKurtosis Student's t distribution 3 DOF 1 Doubleexponential 3.00 Uniform 1.20 Student's t distribution 5 DOF 6.00 Exponential 6.00 Mixture, 2 doubleexponentials 1.16 Symmetricmixture, 2 Gaussian,multimodal 1.68 Symmetricmixture, 2 Gaussian,transitional 0.74 Symmetricmixture, 2 Gaussian,unimodal 0.50 Asymmetricmixture, 2 Gaussian,multimodal 0.53 Asymmetricmixture, 2 Gaussian,transitional 0.67 Asymmetricmixture, 2 Gaussian,unimodal 0.47 Symmetricmixture, 4 Gaussian,multimodal 0.82 Symmetricmixture, 4 Gaussian,transitional 0.62 Symmetricmixture, 4 Gaussian,unimodal 0.80 Asymmetricmixture, 4 Gaussian,multimodal 0.77 Asymmetricmixture, 4 Gaussian,transitional 0.29 Asymmetricmixture, 4 Gaussian,unimodal 0.67 draw N i.i.d. samplesfromtworandomlypickeddensitiescorrespondingt otheICA benchmarkdensities[ 4 ].Thesedensitiesarescaledandcentralizedsuchthatthey havezeromeanandunitvariance(seeTable 4-1 ).Theserandomvariablesaremixed usinga 2 -dimensionalrotationmatrixwithrotationangle 2 [0, = 4] .Gaussiannoise withunitvarianceandzeromeanisaaddedasextradimension s.Finallyeachone oftherandomvectorsisrotatedbyarandomrotation(orthon ormalmatrix)in R 2 ,and R 3 ,accordingly.Thiscausestheresultingrandomvectorstob edependentacrossall observeddimensions.Weperformexperimentsvaryingangle s,samplessizesand, dimensionality.Thetestcomparesthevalueofthegap: S ( K X )+ S ( K Y ) S K X K Y tr( K X K Y ) (4–63) where K X and K Y aretheGrammatrices(Gaussiankernel)forthe X and Y components ofthesample f ( x i y i ) g Ni =1 ,withathresholdcomputedbysamplingasurrogateofthenul l 79

PAGE 80

hypothesis H 0 basedonshufingoneofthecomponentsofthesample k times,that is,thecorrespondencesbetween x i and y i arebrokenbytherandompermutations. Thethresholdisthetheestimatedquantile 1 where isthesignicancelevelof thetest(TypeIerror),meaningthatthetestisdatadepende nt.Thehypothesis H 0 X isindependentof Y ,isacceptedifthegap( 4–63 )isbelowthethreshold,otherwise, wereject H 0 .Inallourexperiments k =100 .hesolidlinesinFigures 4-2A 4-2B ,and 4-2C showtheestimatedprobabilityof H 0 beingacceptedfortheproposedtestwith =0.05 .Theresultsareaveragesover 100 simulationsforeachoneoftheparameter congurations.Wecompareourresultswiththeonesobtaine dbyusingthekernel basedstatisticproposedin[ 30 ], T n = 1 n 2 n X i j =1 L h ( x i x j ) L 0h ( y i y j )+ 2 n 3 n X j =1 n X i =1 L h ( x i x j ) n X i =1 L 0h ( y i y j ) !# + + 1 n 2 n X i j =1 L h ( x i x j ) 1 n 2 n X i j =1 L 0h ( y i y j ) (4–64) where L h and L 0h arecharacteristickernelson R d [ 84 ].Inthecaseof X Y 2 R (Figure 4-2A ),thetypeIIerrorislowevenforsmallsamplesizes,wherea sthedependence becomesmoredifculttodetectas d increases,requiringalarger N toobtainan acceptabletypeIIerror.Ourresultsarecompetitivetotho seobtainedwiththekernel basedstatistic( 4–64 ).Thetwomethodsperformrelativelysimilarforlargeangl e,but itcanbenoticedthattheproposedmethodworkbetterwhenth eangleiscloseto 0 Itisimportanttopointoutthatinbothcases,theproposeds tatisticusingthegapand theonein( 4–64 ),thethresholdwasempiricallydeterminedbyapproximati ngthenull distributionusingpermutationsononeofthevariables.Wh etherwecanprovidea distributionofthenullhypothesisfor( 4–63 )issubjectoffuturework.Figure 4-3 shows theinuenceoftheparametersinthepoweroftheproposedin dependencetest.The behaviorofthetestfordifferentorders andkernelsizes canbeexplainedfromthe 80

PAGE 81

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Dim 1, s 2 = 1 Angle ( p /4)Acceptance of H 0 128 512 1024 2048 A 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Dim 2, s 2 = 2 Angle ( p /4)Acceptance of H 0 B 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Dim 3, s 2 = 3 Angle ( p /4)Acceptance of H 0 C Figure4-2.Resultsoftheindependencetestbasedonthegap betweentensorand Hadamardproductentropiesfordifferentsamplesizesandd imensionality. Figures 4-2A 4-2B ,and 4-2C ,correspondtorandomvariablesof 1 2 ,and 3 dimensions.Thelargertheangletheeasiertoreject H 0 (Independence). 81

PAGE 82

0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 sAcceptance of H 0 a = 1.01 a = 1.5 a = 2 a = 3 a = 5 Figure4-3.Resultsoftheindependencetestbasedonthegap betweentensorand Hadamardproductentropiesfordifferentkernelsizes andentropyorders foraxedsampleofsize 1024 androtationangle = 8 .Thedimensionality oftheoftherandomvariablesis d =2 spectralpropertiesoftheGrammatrices.Forsmallerkerne lsizestheGrammatrix approachestoidentityandthusitseigenvaluesbecomemore similar,with 1 = n asthe limitcase.Therefore,thegap( 4–63 )monotonicallyincreasesas 0 ,sodoesthe gapforthepermutedsample.Sincebothquantitieshavethes ameupperbound,the probabilityofaccepting H 0 increases.Theotherphenomenonisrelatedtotheentropy order,itcanbenoticedthatthelargertheorder thesmallerthekernelsize thatis neededtominimizethetypeIIerror.Theorderhasansmoothi ngeffectintheresulting operatordenedin( 4–50 ).Large willemphasizeonthelargesteigenvaluesofthe Grammatricesthatarecommonlyassociatedwithslowlychan gingfeatures. 82

PAGE 83

CHAPTER5 INFORMATIONTHEORETICLEARNINGWITHMATRIX-BASEDENTROPY Sofar,wehaveseenhowtheproposedmatrixbasedframeworkc anbeuseful whencomparingquantitiesthatdescribethedatainastatic manner.Forexample,we havederivedquantitiesthatcanbeemployedtoteststatist icalindependencebetween tworandomvariables X and Y .Inthischapter,wewanttopursueadifferentgoal. Instead,wewanttousethequantitiesbasedonmatrix-entro pyasobjectivefunctions toformulatelearningproblemsasmathematicaloptimizati onproblems.Thus,weare lookingatmaximizingorminimizingagiveninformationthe oreticquantityoverasubset ofthesetofpositivedenitematricesthatwewilldescribe below.Wearelookingata constrainedoptimizationproblemoftheform: minimize X 2 +n f 0 ( X ) subjectto f i ( X ) 0, for i =1,..., m ; h j ( X )=0, for j =1,..., ` (5–1) +n denotesthesetofpositivedenitematrices.Inourinforma tiontheoreticcontext, f 0 andinsomecases f i for i =1,..., m willbesomeofthefunctionalsthatcanbederived fromtheentropylikefunctionaldenedin( 4–13 ).Thematrixentropyfunctional,fallinto thefamilyofmatrixfunctionsknowasspectralfunctions.L et H n denotethevectorspace ofrealHermitianmatricesofsize n n endowedwithinnerproduct h X Y i =tr XY ; andlet U n denotethesetof n n unitarymatrices.Arealvaluedfunction f deneson asubsetof H n isunitarilyinvariantif f ( UXU )= f ( X ) forany U 2 U n ;thesefunctions onlydependontheeigenvaluesof X andthereforearecalledspectralfunctions[ 24 ]. Recallingproperty( i )inProposition 4.2.1 ,wecanclearlyseethatthefunction( 4–13 ) belongtotheclassofspectralfunctionswehavejustdescri bed. 5.1ComputingDerivativesofMatrixEntropy Associatedwitheachspectralfunction f thereisasymmetricfunction F on R n .By symmetricwemeantthat F ( x )= F ( Px ) forany n n permutationmatrix P .Let ( X ) 83

PAGE 84

denotethevectoroforderedeigenvaluesof X ;then,aspectralfunction f ( X ) isofthe form F ( ( X )) forasymmetricfunction F .Weareinterestedinthedifferentiationofthe composition ( F )( )= F ( ( )) at X .Thefollowingresult[ 46 ]allowsustodifferentiate anspectralfunction f at X Theorem5.1.1. Lettheset n R n beopenandsymmetric,thatis,forany x 2 n andany n n permutationmatrix P Px 2 n .Supposethat F issymmetric,Then,the spectralfunction F ( ( )) isdifferentiableatamatrix X ifandonlyif F isdifferentiableat thevector ( X ) .Inthiscase,thegradientof F at X is r ( F )( X )= U diag( r F( ( X ))) U (5–2) foranyunitarymatrixsatisfying X = U diag( ( X )) U Formtheabovetheoremitisstraightforwardtoobtainthede rivativeof( 4–13 )at A asfollows: r S ( A )= (1 )tr( A ) U 1 U (5–3) where A = U U .Itisimportanttonotethatthisdecompositioncanbeuseto our advantage.Insteadofcomputingthefullsetofeigenvector sandeigenvaluesof A ,we canapproximatethegradientof S byusingonlyafewleadingeigenvalues.Itiseasyto seethatthisapproximationwillbeoptimalintheFrobenius norm k X k Fro = p tr( X X ) 5.2SupervisedMetricLearning Here,weapplytheproposedmatrixframeworktotheproblemo fsupervisedmetric learning.Thisproblemcanbeformulatedasfollows.Givena setofpoints f ( x i l i ) g ni =1 weseekforapositivesemidenitematrix AA T ,thatparametrizesaMahalanobis distancebetweensamples x x 0 2 R d as d ( x x 0 )=( x x 0 ) T AA T ( x x 0 ) .Ourgoalisto ndparametrizationmatrix A suchthattheconditionalentropyofthelabels l i giventhe projectedsamples y i = A T x i with y i 2 R p and p d ,isminimized.Thiscanbeposedas 84

PAGE 85

thefollowingoptimizationproblem: minimize A 2 R d p S ( L j Y ) subjectto A T x i = y i for i =1,..., n ; tr( A T A )= p (5–4) wherethetraceconstraintpreventsthesolutionfromgrowi ngunbounded.Wecan translatethisproblemtoourmatrix-basedframeworkinthe followingway.Let K bethe matrixrepresentingtheprojectedsamples K ij = 1 n exp ( x i x j ) T AA T ( x i x j ) 2 2 and L bethematrixofclassco-occurrenceswhere L ij = 1 n if l i = l j andzerootherwise. Theconditionalentropycanbecomputedas S ( L j Y )= S ( n K L ) S ( K ) ,andits gradientat A ,whichcanbederivedbasedon( 5–2 ),isgivenby: X T ( P diag( P1 ) XA ) (5–5) where P = ( L r S ( n K L ) r S ( K ) ) K (5–6) Finally,wecanuse( 5–5 )tosearchfor A iteratively.Toevaluatetheresultsweusethe sameexperimentalsetupproposedin[ 18 ],whichcompares5differentapproaches tosupervisedmetriclearningbasedontheclassicationer rorobtainedfromtwo-fold cross-validationusinga 4 -nearestneighborclassier.Thereportederrorsareavera ges errorsfromvesrunsofeachalgorithm;inourcasetheparam etersare p =3 =1.01 and = p 3 .thefeaturevectorswerecenteredandscaledtohaveunitva riance. Figure 5-1A showstheresultsoftheproposedapproachconditionalentr opymetric learning(CEML),twovariantsofinformationtheoreticmet riclearning(ITML)proposed in[ 18 ],thebaselineEuclideandistance,adistancebasedonthei nversecovariance, themaximallycollapsingmetriclearning(MCML)methodfro m[ 27 ],andthelarge 85

PAGE 86

Wine Ionosphere Scale Iris 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 ErrorUCI Datasets CEML(proposed method) ITML-MaxEnt ITML-Online Euclidean Inverse Covariance MCML LMNN AClassicationerrorUCIdata -10 -5 0 5 10 -10 -5 0 5 10 Projected Features Gram Matrix Projected Features BProjectedfacesUMistdatasetandresultingGrammatrix = p 2 Figure5-1.ResultsfortheMetriclearningapplicationmarginnearestneighbor(LMNN)methodfoundin[ 91 ].Theresultsforthe Soybean datasetarenotreportedsincethereismorethanonepossibl edatasetintheUCI repositoryunderthatname.Theerrorsobtainedbythemetri clearningalgorithm usingtheproposedmatrix-basedentropyframeworkarecons istentlyamongthebest performingmethodsincludedinthecomparison.Wealsorunt healgorithmontheUMist dataset;ThisdatasetconsistsofGrayscalefaces(8bit[0255])of20differentpeople. Thetotalnumberofimagesis575andthesizeofeachimageis1 12x92pixelsfora 86

PAGE 87

totalof 10304 dimensions.Pixelvalueswerenormalizeddivingby 255 andcentered. Figure 5-1B showstheprojectedimagesinto R 2 .itisremarkablehowalinearprojection canseparatethefacesanditcanalsobeseenfromtheGrammat rixthatittriesto approximatetheco-occurrencematrix L 5.3TransductiveLearningwithanApplicationtoImageSupe rResolution Considerasampleofinputoutputpairs f ( x i y i ) g ni =1 ,theusualinductivelearning settingtriestondafunction f thatcapturesthedependencybetweentheinputand outputvariables X and Y .Thisfunctionsisthenappliedtonewincominginputs x j j > n topredictthecorresponding y j = f ( x j ) .Ontheotherhand,learningbytransduction asintroducedin[ 25 ]canbeuderstoodasinferencefromparticulartoparticula r;in otherwords,atransductivelearningalgorithmprovidesap redictedvalueof y j thatis consistentwiththepreviouslyobservedpairs f ( x i y i ) g ni =1 withoutexplicitcomputationof thefunction f 1 Letusdescribetheinformationtheoreticlearningapproac htotransductionina binaryclassicationsetting.Supposewearegivenasetof f ( x i y i ) g ni =1 of n labeled points x i ,eachpoint x i eitherbelongstoclass 1 or 1 .Thereisapoint x n +1 forwhichno label y n +1 isprovided.Wecanconstructan ( n +1) ( n +1) Grammatrixbycomputing agivenkernelonthesetofinputs f x i g n +1 i =1 andtwopossiblecompletionsofthesetof outputs Y +1 = f y 1 ,..., y i ,+1 g and Y 1 andtheircorrespondingGrammatrices L +1 and L 1 .Let I ( y ) bedenedas, I score ( y )= 8><>: I ( K ; L 1 ) y = 1, I ( K ; L +1 ) y =+1 (5–7) 1 Recallthatasupervisedlearningalgorithmisusuallydesc ribedasamappingfrom thespacesofsamples f z i g ni =1 XY tofunctions f on X 87

PAGE 88

Thepredictedlabel ^ y n +1 forthepoint x n +1 willbechosenusingthescore I score ( y ) as, ^ y n +1 =argmin y 2f 1,+1 g I score ( y ) (5–8) Amoregeneralcaseoftheaboveformulationproblemneedsto consideraset Y thatis notnecessarilydiscrete( e.g. Y iscontinuous).Inthiscase,thepredictionof y canbe castasaoptimizationproblem. Example-basedimagesuperresolution: Superresolutionofimagesisaproblem thathasbeenstudiedfromdifferentperspectives[ 57 ].Alearningapproachtoimage super-resolutionisknownastheexample-basedsuper-reso lution;inthiscasethe missingpartsofinformationarereconstructedbyincorpor atingpriorknowledgetothe process.Asetoftrainimagesispartitionedintopatchesfo rwhichalowresolution correspondingpatchisassumedtobeasubsampledversionof thehighresolution patch.Inadditiontothesubsampling,itmaybeassumedthat thehighresolutionpatch undergoesalow-passlteringprocedurebeforebeingsubsa mpled(anti-aliasing). Themethodoflocalcorrelationsproposedin[ 13 ]canbecategorizedamongthe example-basedmethods.Itbasicimplementationemploysas etofassociative memoriesthatarecomputedbypartitioningthespaceoflowr esolutionpatchesand trainingandassociativememoryforeachoneofthepartitio n;anewimageisthen processedbythefollowingsteps:1.Thelowresolutionimageistransformedintoasetofpatch esof p sizeforwhich themeanvalueofthepixelintensitieswithinapatcheachre moved. 2.Eachpatchisthecomparestodictionaryofcentersthatar etheVoronoycentersof thepartitiondonetothespaceoflowrespatches. 3.Thehighresolutionpatchisobtainedastheoutputofthea ssociativememory correspondingtotheclosestcentertothelowresolutionim agepatch. Ourapproachusingthetransductivelearningmethodexplai nedabovecanbeadapted totheproblemofimagesuperresolution.Let ( X train Y train ) denotethesetoflow res,highrespairsoftrainingpatches,and X query denotethesetofpatchesofalow 88

PAGE 89

resolutionquery.Thegoalistondthecorrespondingsethi ghresolutionpatches Y query suchthattheentropygapismaximizedonthefullsample ( X Y ) = 0B@ 264 X train X query 375 264 Y train Y query 375 1CA (5–9) Thatis, maximize Y query S ( K X )+ S ( K Y ) S K X K Y tr( K X K Y ) (5–10) where K X and K Y aretheGrammatrices(Gaussiankernel)forthe X and Y components ofthefullsample.Thismethodrequiresaninitialguess,wh ichinourcaseisprovidedby thelocalcorrelationmethod.Figure 5-2 displayssomeoftheresultsobtainedwiththe proposedmethodwithamagnicationfactorof 3 89

PAGE 90

A B C D E F G H I J K L M N O P Figure5-2.Fromlefttoright:therstcolumncorrespondto thelowresolutionqueries, thesecondcolumnistheresultofapplyingBi-cubicinterpo lationto upsampletheimage( 3 ),thethirdcolumnistheresultoflocalcorrelation methodintroducedin[ 61 ],andthefourthcolumnistheproposedapproach usingtrasnductivelearning.Inthelastcasetheinitialgu esscorrespondsto theoutputofthelocalcorrelationmethod. 90

PAGE 91

CHAPTER6 CONCLUSIONSANDFUTUREWORK Inthisthesis,weextendtheworkonreproducingkernelHilb ertspacemethods forinformationtheoreticlearningthatwasinitiatedin[ 95 ].Weintroducetheprinciple ofrelevantinformationandanalternativeformulationusi ngtheinformationpotential RKHS.Thisalternativeformulationshowstheadvantagesof usingthereproducing kernelHilbertspaceformulationforinformationtheoreti clearning.Inparticular,weshow howcanweincorporateresults,methods,andalgorithmstha thavebeendeveloped developedinthemachinelearningcommunitytoourframewor kInthespiritofkernel methods,thePRIcanbesolvedwithasupportvectortypealgo rithm.Theinformation potentialRKHSformulationeffectivelyby-passesthedens ityestimationstepthatis commononpreviousalgorithmsusinginformationtheoretic learningframework.The experimentsshowthatthemethodcanperformdifferentunsu pervisedlearningtasksby settingthetrade-offparameterappropriatelyandresults arecomparablewithotherstate oftheartkernel-basedfeatureextractors. Weintroduceasequentialminimaloptimizationalgorithmf ortheprincipleof relevantinformationbasedonweighteddensityestimation .Inordertoguarantee convergenceofthealgorithm,weshowthattheKarush-KuhnTuckerrstorder optimalityconditionarenecessaryandsufcientinourfor mulation.Inprovingthis, wefoundthereexistaconvexprogramthatyieldsthesamesol ution.Resultsshow thatcomputationalcomplexityismanageableevenforsampl esizesofseveraltens ofthousands.Theveryimportantfeatureisthatelementsof theGrammatrixare computedatrequestanddonotneedtobestored,nevertheles s,speedimprovements canbeachievedbyusingacachethattemporarilystoresfreq uentlyvisitedsamples. Theresultsandinsightsobtainedwiththenewformulations oftheprincipleof relevantinformationmotivatethesecondpartofthisdisse rtation.Here,wepresent anestimationframeworkofentropy-likequantitiesbasedo ninnitedivisiblematrices. 91

PAGE 92

ByusingtheaxiomaticcharacterizationofRenyi'sentropy ,afunctionalonpositive denitematricesisdened.Wedeneestimatorsofentropylikequantitiesbased ongrammatricesthatcorrespondtoinnitedivisiblekerne ls.TheuseofHadamard productsallowsustodenequantitiesthataresimilartomu tualinformationand conditionalentropy.Weshowedsomepropertiesofthepropo sedquantitiesand theirasymptoticbehaviorasoperatorsinreproducingkern elHilbertspacedening distributiondependentkernels.Numericalexperimentssh owedtheusefulnessofthe proposedapproachwithresultsthatarecompetitivewithth estateoftheart.Moreover, Weshowhowquantitiessuchasconditionalentropyandmutua linformationcanbe employedtoformulatedifferentlearningproblems,andpro videefcientwaystosolve theoptimizationthatarisefromtheseformulations. FutureWork: Oneimportantsubjectforfurtherresearchisrelatedtoext ensions oftheprincipleofrelevantinformationthatcanleadtosup ervisedorsemi-supervised learning.Forinstance,minimizationofthejointentropyo finput-outputtargetpairsalong withaninformationpreservationconstraintwithrespectt othegivenmarginalofthe inputsisverysimilartotheconceptofmanifoldregulariza tionproposedby[ 9 ]. Thereisroomforimprovementsonthesequentialminimalopt imizationalgorithm proposedtosolvetheRKHSPRIformulation.Namely,improve mentsintermsof speedbybetterselectionheuristicsandmemorytradeoffsc anbepursuedinfuture implementations. Anotherinterestingquestioniswhethertheprinciplecanb eextendedtodealwith randomvariablesofvariousdomainscapturingrelevantinf ormationthroughvariables thatcanhelpinproblemssuchasvisualization. Anotherinterestinglineofworkinvolvetherelationbetwe eninformationtheory andstatisticalmechanics.Addingphysicalunderstanding oftheInformationtheoretic formulationsoflearningproblemscanbeilluminatinginpr oposingbiologicallyplausible implementationsofinformationtheoreticobjectivefunct ions. 92

PAGE 93

Thereisplentyofroomfordevelopmentsinoptimizationand numericalanalysis thatleadtoefcient,largescalealgorithmsandapproxima tionsthatcanbeemployedto ndsolutiontotheinformationtheoreticlearningobjecti vefunctionsbasedonmatrices. Oneofthebighurdlesoftheproposedframeworkistheestima tionofthespectrum,we proposetousethelargesteigenvaluestoapproximatethegr adientoftheinformation theoreticquantities,buttheseapproximationstillrequi resthestorageofthefullGram matrix.ItseemsthatgreedyapproximationtotheGrammatri xareanecessarystepto largescaleimplementations. 93

PAGE 94

APPENDIXA BASICDEFINITIONSANDSTANDARDNOTATION A.1Shannon'sEntropyandMutualInformation Oneoftheunderpinningconceptsininformationtheoryisen tropy,whichcanbe understoodastheaverageuncertainityforsomegivenproce ss.Therstdenitionof entropywasintroducedbyShannon[ 80 ]inthecontextofcommunicationsystems, andgaverisetoinformationtheory.Theentropyfunctionwa sdenedtosatisfy thefollowingrequirements:continuityinthemeasureofun certainity,thatisonthe probabilityarguments;forapartitionwhereallelements( disjointsubsets)havethe sameprobability,thefunctionmustdependonthenumberofs ubsets;andfora renementofacertaincollectionofdisjointsubsetstheen tropyshouldincrease 1 Let (n, B n P ) beaprobabilityspaceand X :(n, B n ) 7! ( X B X ) ameasurablefunction. Forsimplicity,rst,considerthecasewheretheelements x i of X arecountable(or nitelymany)anddeneapartition U = f X 1 ( x 1 ), X 1 ( x 2 ),... g (A–1) of n .Forafunctional H representingtheuncertainityaboutthepartition U (inducedby themeasurablefunction X ).Theentropyofdiscreterandomvariable X thatcantake n possiblevaluesdenotedby x n H ( X )=E X log 1 P( X ) = n X i =1 P( x i )logP( x i ) (A–2) Startingfromasimilarformulation,itispossibletodene thejointentropy H ( X Y )=E X Y log 1 P( X Y ) = m X j =1 n X i =1 P( x i y j )logP( x i y j ) (A–3) 1 Noticewedonexplicitlystatethatthefunctionshouldbest rictlyincreasing,butbut thisisthecasewheretheresultingsubsetsofthepartition havenon-zeromeasuresince theirprobailitydecreaseandhencetheuncertainityincre ases 94

PAGE 95

andtheconditionalentropy H ( Y j X )=E X Y log 1 P( Y j X ) = m X j =1 n X i =1 P( y i x i )logP( y j j X = x i ) (A–4) observethatthejointentropycorrespondstotheexpection oftheuncertaintityofa variable( Y inourcase)giventhatwehaveknowledgeoftheother( X ).andinthecase ofindependentrandomvariableswehavethat H ( Y )= H ( Y j X ) Andwhen y = g ( x ) foradeterministicmapping g wehave H ( Y j X )=0 .Wecan alsointerpretentropynotintermsofuncertainityrelated toapossibleoutcomeof anexperiment,butasameasureoftheinformationwegainfro mobservingagiven outcome.Usingtheabovedenitionsandthelatterinterpre tationofentropy,wecan introducetheconceptofmutualinformationbetweentworan domvariablesasthe amountofinformationthatiscommontothetwovariables.In thiscasetheadditivity propertyofentropyplaysanimportantrole: I ( X ; Y )= H ( X ) H ( X j Y )= H ( Y ) H ( Y j X )= I ( Y ; X ) (A–5) Noticethatmutualinformationisasymmetricfunctionandi szeroif X and Y are independentAstraighforwardalgebraicmanipulationof( A–5 )yields: I ( X ; Y )= m X j =1 n X i =1 P( y i x i )log P( y i x i ) P( y j )P( x i ) (A–6) Theargumentofthe log functionin( A–6 )sugestamoregeneralstatement.Theratio betweenthejointandtheproductofthemarginalscanbeseen asacomparison betweentwovalidprobablitymassfunctions.The Kullback-Leiblerdivergence isthe generalizationofourratherinformalstatement.DenitionA.1.1. (Kullback-LeiblerKLdivergence): Let P and Q betwoprobabilitymass functionsdenedoverthesamedomain.Therelativeentropy orKLdivergenceisis 95

PAGE 96

denedas KL( P k Q )= n X i =1 P ( x i )log P ( x i ) Q ( x i ) (A–7) Noticetheaboveabovedenitionshavebeendenedfortheca seofdiscrete probabilityspaces(countablesets).Nevertheless,these conceptscanalsobe extendedtocontinuousrandomvariablesaswellasothermea surablefunctions 2 Foracontinuousrandomvariable X andprobabilitydensityfunction f ( x ) withsupport X thedifferentialentropyisgivenby: H ( X )=E X log 1 f ( X ) = Z X f ( x )log f ( x )d x (A–8) thejointentropy H ( X Y )=E X Y log 1 f ( X Y ) = ZZ X Y f ( x y )log f ( x y )d x d y (A–9) andtheconditionalentropy H ( Y j X )=E X Y log 1 f ( Y j X ) = ZZ X Y f ( x y )log f ( y j x )d y d x (A–10) Mutualinformationcanobtainedusing( A–5 )andtheKLdivergencefortwovaliddensity functions f and g KL( f k g )= Z X f ( x )log f ( x ) g ( x ) d x (A–11) A.2R enyi'sMeasuresofInformation Theaboveintroductionisratherinformalandaimsatprovid ingacondensedpicture ofthetoolswewanttousethroughoutthiswork.Inthissecti on,wewanttoprovidea 2 Aslongaswecandeneapartitiondirectlyorindirectly[ 56 ].However,some propertiesdenedfordiscreteRVsmaydifferwhenwemoveto thecontinuouscase. 96

PAGE 97

moredetaileddescriptionoftheparticularmeasuresfromw hichinformationtheoretic learninghasbeendeveloped[ 70 ]. DenitionA.2.1. (GeneralizedRandomVariable): Let (n, B n P ) beaprobabilityspace, and X ( ) afunctiondenedon n 1 2B n where P (n 1 ) < 1 andismeasurablewith respectto B n X iscalleda generalizedrandomvariable Thedistributionofageneralizedrandomvariableiscalled a generalizeddistribution Inthecaseofdiscretegeneralizedrandomvariablewithni tesupport,thegeneralized distributioncanberepresentedbyasequenceof P =( p 1 p 2 ,..., p n ) andaweight W ( P )= P ni =1 p i ,whereeach p i correspondstotheprobabilityofeachelementofthe partition(recall( A–1 )).Sincewearedealingwithgeneralizeddistributionsofs ubsets n 3 ofthesamplespace n .Withtheseelementsathand,itispossibletocharacterize theentropyfunctionbythefollowingpostulates 4 : ( H1 ) : H ( P ) isasymmetricfucntionoftheelementsof P ( H2 ) : H ( f p g ) iscontinuouson ( 0,1 ] ( H3 ) : Forgeneralizeddistributions P and Q H ( PQ )= H ( P )+ H ( Q ) ( H4 ) : Forgeneralizeddistributions P and Q suchthat W ( P )+ W ( Q ) 1 H ( P[Q )= W ( P ) H ( P )+ W ( Q ) H ( P ) W ( P )+ W ( Q ) (A–12) Afunctionsatisfyingsuchpostulatesisgivenby H ( P )= n P i =1 p i log 1 p i n P i =1 p i (A–13) 3 Accordingtodenition A.2.1 ,if (n, B n P ) isaprobabilityspace.Wecandenea generalizeddistributionfor n n ,if (n ,n \B n ) isameasurablespace. 4 Thereisanextrapostulate H ( f 1 = 2 g )=1 butthisisonlyusefultodenethebase log function.Itisanscalingconstant. 97

PAGE 98

Thisconstructionagreeswiththenotionofofaverageuncer tainity.Infact,itcanbe extendedtoamoregeneralformofameanvalueusingtheKolmo gorov-Nagumo function ( x ) ,thatis: 1 n X i =1 w k ( x k ) (A–14) where isanarbitrarystrictlymonotonicandcontinuousfunction .Underthese conditions 1 istheinversefunction.Postulate ( H4 ) isthenreplacedby ( H4 0 ) : Forgeneralizeddistributions P and Q ,thereexistsastrictlymonotonicand increasingfunction ,suchthatif W ( P )+ W ( Q ) 1 H ( P[Q )= 1 W ( P ) ( H ( P ) ) + W ( Q ) ( H ( P ) ) W ( P )+ W ( Q ) (A–15) Itisclearthatforthecase ( x )= ax + b and a 6 =0 thepostulatereducestotheoriginal construction.Now,if ( x )=2 ( 1) x ,thefunction H ( P )= 1 1 log 0BB@ n P i =1 p i n P i =1 p i 1CCA (A–16) iscalledtheentropyoforder ofthegeneralizeddistribution P ,with lim a 1 H ( P )= H ( P ) where H ( P ) isdenedinequation( A–13 ).Aswehavepointedoutbeforetheentropy notonlyreferstouncertainitybuttoinformation,aswell. Inthissense,wecangoback tothemoregeneralconceptrelatedtomutualinformationdi vergence.Followinga similarschemeusedtodenetheentropyof order,wecandenethe -Divergencefor adiscreterandomvariableas: D ( PkQ )= 1 1 log 0BB@ n P i =1 p i q 1 i n P i =1 p i 1CCA (A–17) Foracontinuousrandomvariable X withpdf f X ( x ) ,thedifferentialentropyof order, H ( X )= 1 1 log 0@ Z X f ( x ) d x 1A (A–18) 98

PAGE 99

andthefortwodensityfunctions f and g the -Divergence D ( f k g )= 1 1 log 0@ Z X f ( x ) g ( x ) 1 d x 1A (A–19) As 1 ( A–18 )approximatestoShannon'sentropyand( A–19 )totheK-Ldivergence. AmodiedversionofRenyi'sdenitionof -relativeentropybetweenrandomvariables withPDFs f and g isgivenin[ 48 ], D ( f k g )=log R g 1 f 1 1 R g 1 R f 1 (1 ) (A–20) Similarly,Shannon'srelativeentropy(K-Ldivergence)is thelimitfor 1 .Animportant componentintherelativeentropyisthecross-entropyterm H ( f ; g ) thatquantiesthe informationgainfromobserving g withrespecttothe“true”density f .Itturnsoutthat forthecaseof =2 ,theabovequantitiescanbeexpressed,undersomerestrict ions, asfunctionsofinnerproductsbetweenPDFs.Inparticular, the 2 -orderentropyof f and cross-entropybetween f and g ,canberespectivelyexpressedas, H 2 ( f )= log Z X f 2 ( x )d x ; H 2 ( f ; g )= log Z X f ( x ) g ( x )d x (A–21) theassociatedrelativeentropyoforder 2 iscalledtheCauchy-Schwarzdivergenceand isdenedasfollows: D CS ( f k g )= 1 2 log R fg 2 R f 2 R g 2 (A–22) 99

PAGE 100

APPENDIXB AUXILIARYTHEOREMSPROOFSANDDERIVATIONS B.1SufcientConditionsforPseudo-ConvexPrograms Thefollowingtheoremisextractedfrom[ 49 ]Chapter10. TheoremB.1.1. Let X 0 beanopensetin R n anlet f and g berespectivelyandscalar anda m -dimensionalvectorfunctionbothdenedin X 0 .Let x 2X 0 I = f i j g i ( x )=0 g f bepseudo-convexat x ,and g bedifferentiableandquasi-convexat x .Ifthereexists a 2 R m suchthatthepair ( x ) satisfythefollowingconditions: h r f ( x )+ T D g ( x ) i ( x x ) 0, 8 x 2X 0 ; g ( x ) 0 (B–1) T g ( x )=0 g ( x ) 0 0 Then, x isasolutionofthefollowingminimizationproblem min x 2X 0 f ( x ) subjectto g ( x ) 0. (B–2) ProofB.1.1. Let I = f i j g i ( x )=0 g J = f j j g i ( x ) < 0 g ,thence I [ J = f 1,..., m g since 0 g ( x ) 0 ,and 0 ,wehavethat f j g j 2 J =0 ,andfromquasiconvexity of g at x ,thegradientsof g i at x for i 2 I areorthogonaltotangentplanestothelevel setsdenedby g i ( x )=0 andthereforeforanyfeasiblepoint x 2X 0 and g ( x ) 0 D g I ( x )( x x ) 0 ,bynon-negativityof andsince J =0 wehave: I T D g I ( x )( x x ) 0 (B–3) J T D g J ( x )( x x )=0 T D g ( x )( x x )= h I T D g I ( x )+ J T D g J ( x ) i ( x x ) 0. 100

PAGE 101

Finally,since r f ( x )+ T D g ( x ) ( x x ) 0 forall x 2X 0 and g ( x ) 0 ,weneed that r f ( x )( x x ) 0 andthusbypseudo-convexityof f implyingthat f ( x ) f ( x ) for all x 2X 0 suchthat g ( x ) 0 AgeneralizationoftheKuhn-Tuckersufcientoptimalityc riterionfollowsfromthe abovetheorembyreplacingcondition( B–1 )with r f ( x )+ T D g ( x )=0 (B–4) B.2DetailsoftheSolutiontotheMinimalSubproblem Werefertotheobjectivein( 3–29 )as, J ( 2 )=( 1)log A ( 2 ) 2 log B ( 2 ). (B–5) Takingthederivativeof J ( 2 ) andequatingtozeroyields: d d 2 J ( 2 )=0=( 1) B ( 2 ) d d 2 A ( 2 ) 2 A ( 2 ) d d 2 B ( 2 ) (B–6) where d d 2 A ( 2 )=2( 2 ( V 11 + V 22 V 12 )+( w ( V 12 V 11 )+( 2 1 ))) (B–7) and d d 2 B ( 2 )= q 2 q 1 (B–8) Expandingandrearrangingyields( 3–30 ) 101

PAGE 102

APPENDIXC APPROACHESTOUNSUPERVISEDLEARNING Amongthevastamountofliteraturethatcanbefoundonunsup ervisedlearning, thereiscustomaryelementintheirpresentationthatdivid esthemethodsintotwo maincategories:discretevariablemethods,andcontinuou svariablerepresentations. Discreterepresentationscanbeeasilyassociatedwithbin aryrepresentationsforwhich afeatureiseitherpresentornot;clusteringandvectorqua ntizationarerepresentatives ofthiscategory.Inwhatconcernstocontinuousvariablere presentations,wecan ndtechniquesfordimensionalityreductionthatlookforf aithfullowdimensional representationofthedata;methodsthatdonotnecessarily reducedimensionbut giverepresentationsthatdecomposeintoindependentfact orsandmethodsthatexploit redundancytobuildrepresentationsthatarerobust.Final lylearningtheprobability distribution,whichmayfallintoethergroups.Thelatterp roblemmotivatesagenerative viewofunsupervisedlearningthatperformsinferencefora givenmodelfromobserved datatogiveaposteriordistributionandalearningsteptha tusestheposteriortoupdate themodel. Aswementionedbefore,inordertolearntherepresentation fromtheobserved dataweneedtwoaddressthequestionsofwhatlookforinathe representationand howtoassesstheeffectivenesswithoutresortingtofurthe rstagesofprocessing. Inhere,weattempttopresentsomeoftheexistingtechnique sforunsupervised learningbasedonourmainobjective,whichisunderstandth eroleofinformation theoreticprinciplesbehindunsupervisedlearningandhow theyapplytolearningdata representation.Forthisreason,ourreviewdividestheset oftechniquesaccordingtothe wayinformationowsineachmethod. Let X beourinputsetand Z thecodeset.Amapping f : X7!Z (themappingcan berandomordeterministic)willbecalledthe encoder ,andthemapping g : Z7!X the decoder .Thecomposition g f iscalled encoder-decoderchannel orsimply channel 102

PAGE 103

Figure C-1 depictsablockdiagramoftheaboveelements.Theencoder-d ecoder FigureC-1.Blockdiagramencoderdecoderschemestructureallowsustoclassifytheexistingunsupervisedl earningmethodsintothree maingroups:Encoderlearning,decoderlearning,andencod er-decoderchannel learning. C.1EncoderLearningMethods Inthiscaseourmainconcernistolearnhowtocode x into z .Recovering x oran approximationtoitbyusing z mightbepossiblebutitisnotrequiredforthesemethods. Therelationbetween X and Z isestablishedthroughamapping f thatcanbelearned explicitlyorimplicitly. k -MeansClustering issimplealgorithmthatpartitionstheinputspacein k disjoint portionsaccordingtothedistributionofthedata.Let X be R d C = f ` g k` =1 R d aset ofcenters.Theobjectiveistoadjustthelocationsofthese tofcenters C suchthatthe objectiveisminimized: E rrrr X argmin ` 2 C fk X ` k 2 g rrrr 2 # (C–1) Thesampleestimatorofthecostleadstoaniterativeproced ureforwhichthesample S = f x i g Ni =1 ispartitionedinto k disjointsubsetsaccordingto C ,usingtheelements ofeachsubset S ` thevaluesof ` areupdatedaccordingtothesamplemean.The algorithmiteratesuntilthepartitionsdonotchange.The k -meansalgorithmisasimple methodandhasbeenwidelyappliedindifferenttasksthatre quirequantizationora prototypebasedrepresentationofaverylargesample.Asan informationtheoretic objectivethe k -meansclusteringcorrespondtotheaminimizationofadive rgence 103

PAGE 104

betweenthedatadistributionandamixtureof k isotropicGaussiandistribution.One caveatofthecostfunctionisthepresenceofmultiplelocal minimum,whichmakesthe algorithmsensitivetoinitialconditions. Principalcomponentanalysis(PCA) seeksforalinearmapping A : R d 7! R p ; x 7! Ax ,where p < d AA T = I ,and E[ X ]= 0 ,suchthat trE[ ZZ T ] ismaximized.Inother words,welookforalineartransformationforwhichoutputs havemaximumvariance [ 38 ]andbecomeuncorrelated.Thesecondconditioncanberathe rthoughtofasa consequenceofthesolution,therowsof A correspondtothetransposeof p leading eigenvectorsofthecovariancematrixof X .DuetoitssimplicityPCAiswidelyappliedin practice,butmaygittedtorepresentvepoorresultssince onlyworkswiththesecond orderstatisticsofthedata.Assumingthat X Gaussiandistributed,PCAcanbethough asndingthelinearsubspaceforwhich Z hasmaximumdifferentialentropy,sinceany lineartransformationofaGaussianvariableresultsinano therGaussianthemaximum entropyobjectivebecomesavariancemaximizationobjecti ve. KernelPCA [ 77 ]isnonlineargeneralizationofPCAbasedonthetheoryofpo sitive denitekernels.Let : XX7! R beapositivedenitekernel,where X isan arbitraryset.For x y 2X ( x y )= h ( x ), ( y ) i representsaninnerproductona reproducingkernelHilbertspace H ,where : X7!H isanunderlyingmappinginduced bythekernelfunction.Inthisgeneralization,linearPCAi scarriedoutin H byadual representationoftheeigenvectorsofthecovarianceopera torin H .Forthesample S = f x i g Ni =1 ,let K betheGrammatrixwithentries K ij = ( x i x j ) for i j =1,..., N ,and foreaseofexpositionletusassumethekernelissuchthatth esampleiscentralizedin thefeaturespace.Thesolutionisfoundbysolving K ` = ` ` forthe p eigenvectors withlargesteigenvalues.Theencoderisanonlinearmappin gfrom X to R p suchthatfor eachdimensionof z thereisafunction f ` : X7! R .For x 2X theevaluationof f ` on x is 104

PAGE 105

expressedintermofkernelevaluationasfollows: f ` ( x )= 1 p ` X i =1 ` i ( x i x ), (C–2) where ` i isthe i -thentryofthe ` -theigenvectorof K Kernelentropycomponentanalysis(KECA) [ 36 ]isamethodverysimilarto kernelPCAinthewaytheeigenvectorsin H arecomputedusingthedualformulation. However,thesolutionismotivatedbytherelationbetweene stimatorsofRenyi'ssecond orderentropybasedonParzenwindowsandtheGrammatrix K 1 ,andthusthismatrix doesnotrequirecenteringasinkernelPCA.InKECAthemappi ngiscomputedusing the p eigenvectorsthathavethehighestscores r ` = p ` P Ni =1 ` i .Although,both kernelPCAandKECAarenonlinearalgorithmswithaveryeleg antmathematical formulation,thedualrepresentationemployedfortheeige nvectorscomesattheprice ofincreasedcomputationalburdenforlarger N .Moreover,theinformationtheoretic interpretationofKECAusingtheRenyi'sentropyrelationi snotshowntoremainvalid foreverypositivesemidenitekernel;thus,thechoiceoft heeigenvaluesextractedby KECAmaybemisleadinginsomecases. Noiselessindependentcomponentanalysis(ICA) [ 10 35 ]isaparadigm motivatedbytheearlierworkofBarlowonminimunentropyco desandredundancy reduction[ 6 – 8 ].ICAcomputesalineartransformationofthedata z = Ax ,where A : R d 7! R d ,suchthatthecomponentsof Z arestatisticallyindependent.Because independenceimplieszerolinearcorrelationbetweenvari ables,ICAisusuallyseen asthecompositionoftwolinearoperations A = UB where B isawhiteningoperation and U anorthonormalmatrix( i.e. arotation).Asaresult, E[ ZZ T ]= I ,thisexplains whythereisnodimensionalityreductionasinPCAsincether eisnoexplicitreasonto selectasubsetofthecomponents.Oneoftheearlyalgorithm sforICAwasproposed 1 ThetopicofRenyi'sentropyisdevelopedindetailintheApp endix 105

PAGE 106

in[ 10 ].Theideaistomaximizethemutualinformationbetweenthe input X 2 R d and theoutputofanorlinearmapping y =( h A ) h ( A x ) .Thisobjectiveisfullyachievedif thenonlinearity h matchestheshapeofthecumulativedistributionofeachcom ponent of Z = A X .Sincetheinputoutputmapisdeterministic,themaximizat ionofthe mutualinformationonlydependsontheentropyoftheoutput s.Themaximumentropy distributionoveraniteintervalistheuniformdistribut ionandthusthepropertyofthe networktomaximizemutualinformationifthenonlinearity matchesthecumulative distribution.Anotherassumptionaboutindependentcompo nentsisthattheytendto benon-Gaussian,thispropertyisexploitedin[ 35 ],whereeachcomponentof Z is optimizedtomaximizeorminimizeitskurtosis,inparticul ar,valuesofthisstatistics thatarecloseto 3 sincethisisthevalueofkurtosisforGaussiandistributed random variables.Noticethatthisalgorithmdoesnotemploytheno nlinearity h tondthelinear map.AlthoughICAiswell-motivatedfromthephysiological pointofviewmorerecent approachesthatincorporaterobustcodingseemtobemorepl ausibleandexposebetter features. Selforganizingmaps(SOM) [ 22 ]seekforamappingfrom X toalow-dimensional spacesuchthatproximityrelationsarepreservedasmuchas possible.Themappingis computedtoatopologicallyconstrainednetwork.Theunits ofthenetworkarearranged inalatticethatforcesneighboringunitstolearnatthesam etime.Forinstance,consider asetof k N units,where N isthesizeofthedataset,the j -thunitcomputesa map f i : X7! R oftheform f j ( x i )= h ( x i ), w j i .Bycompetitivelearning,theunit parametersareupdatedaccordingtotheunitwiththehighes tactivationvalue i ,that is, w j w j + ( t ) P i ( j ) ( x i ) ,where ( t ) isthestepsize,whichmayvaryovertime, and P ` isastrengthfunctioncenteredatthe ` -thpointinthelatticethatweightsthe neighboringunitsto ` .Thistechniqueisverysimpletocompute,andhasbeensubje ct ofasignicantamountofresearch.Severalconnectionstov ectorquantizationand manifoldlearninghavebeenalsoestablished.Someknowpro blemsoftheSOMare 106

PAGE 107

localminimumandmodelselection.Pickingtherighttopolo gyisnottrivialandsince thereisnoparticularcostfunction,itisdifculttoasses sthesuccessoflearning. Asmentionedabove,theencodercanbeexplicit, e.g. themethodswejust presented,orthemappingcanbelearnedimplicitly.Inthel attercase,wedonot havedirectaccesstotheencoderfunction f ,butweusethevaluesof z toimplya transformation z = f ( x ) .Instancesoftheimplicitencoderlearningapproachare spectralclustering [ 90 ]and multidimensionalscaling(MDS) [ 22 ].Spectralclustering partitionstheinputdatasetbylookingatthespectrumofdi fferentnormalizationsof thegraphLaplacianofthedata(foranintroductorytreatme ntsee[ 90 ]).Themethod hassomeappealingpropertiessuchas:convexityofthecost ,abilitytounveilnonlinear structuresindata,andforsomecases,theoreticalguarant eessuchasconsistency[ 89 ]. MDSusesadistancefunction D : XX7! R toconstructadistancematrix D with entries D ij = D ( x i x j ) ,andadistancefunction(usuallyeuclidean) d : ZZ7! R to computeadistancematrix .Then,thelocationsofaset S z = f z i g Ni =1 ,usuallyin R p ,are optimizedtominimizethecost J ( S z )= N X i < j ( d ij D ij ) 2 w ij (C–3) where w ij isaweightfunctiontoemphasizethepreservationoflargeo rsmalldistances inthemapping. Someofthemanifoldlearningmethodsalsofallintothiscat egory. Locallylinear embedding(LLE) [ 74 ]isatwo-stagemethodthatlooksforaglobalcoordinatesys tem thatcanrepresentwellthelocalgeometryextractedfromlo calneighborhoods.Inthe rststage,asetofweights W i thatminimizes P j 2N i k x i W ij x j k 2 isobtainedbysolvinga leastsquaresproblemunderthetranslationinvariancecon straint P j 2N i W ij =1 forall i Inthesecondstage,similartoMDS,weoptimizethelocation ofaset S z = f z i g Ni =1 R p byminimizing P Ni =1 P j 2N i k x i W ij x j k 2 subjectto E[ ZZ T ]= I .Thefreeparameteris thechoiceoftheneighborhoods N i .Themethodisveryappealingsinceitcansolvea 107

PAGE 108

highlynon-linearproblembysolvinglinearproblems.Howe ver,themethodmaysuffer degeneraciesforlatticeorganizationofpointsaswellasn otbeinggoodatpreserving globalstructuresinceveryfarawaypointsintheinputspac emaybemappedtoclose locationintheoutputspace. Isomap [ 85 ]isalsoatwo-stagemethod.First,itcomputesanapproxima tionto thegeodesicdistanceonamanifoldbasedontheshortestpat hdistanceinagraph constructedbylocalneighborhoodsofeachdatapoint.Theo btaineddistanceis mappedtopointsin R p usingMDS.UnlikeLLE,theisomapalgorithmhasconvergence guaranteesforsomespecialclassesofmanifolds.Neverthe less,thealgorithmcannot computeagoodmapiftheintrinsicparameterspaceisnotcon vex,orthemanifoldis notuniformlysampled,orifthecurvatureisinvariantunde rlocalisometry.Alsothe complexityincreasesquadraticallywiththenumberofpoin ts,andsub-samplingmethods arenotgoodalternativessincetheyjeopardizetheaccurac yoftheapproximationofthe geodesicdistance.Morerecentalgorithms, semideniteembedding [ 92 ], minimum volumeembedding [ 81 ],arealsoinstancesoftwo-stagemethods.Theyestimate acenteredsimilaritymatrix K underlocalconstraintsandpositivedeniteness,the objectivefunctionoperatesonthespectrumof K andcanbesolvedviasemidenite programming.Forsemideniteembedding,thecostfunction is rm( K ) whichis equivalenttothesumoftheeigenvaluesof K .Sincethematrixiscentered,thisis equivalenttomaximizingthevarianceoftheunderlyingset ofpointsgenerating K Minimumvolumeembeddinggoesonestepbeyondbymaximizing thegapbetween thesumof p largesteigenvaluesof K andthesumofthe N p remainingeigenvalues. Inbothalgorithms,oncethesimilaritymatrixislearned,a mapto R p iscomputedasin kernelPCA.Ithasbeenarguedthatthesetechniquesarenotg uaranteedtopreserve localandglobalstructureatthesametime.Moreover,theas sumptiondatalyingona manifoldisnotclearinthepresenceofdisjointclasses.Th ecomputationalcomplexity ofthesemethodsiscubicandtheirrobustnessagainstnoise hasnotbeenaddressed. 108

PAGE 109

The t -distributedversion stochasticneighborembedding(SNE) [ 88 ]isapowerful algorithmtovisualizehigh-dimensionaldata.Insteadofp reservingthedistancesdirectly SNE[ 32 ]convertsthedistancesbetween x i 2X andtheotherpointsintosimilarities basedonconditionalprobabilities p j j i that x i wouldpick x j asaneighbor,andthen optimizethelocationsofpointsin z i 2 R p suchthattheconversionoftheirdistanceto conditionalprobabilities q j j i areagoodmatchto p j j i using KL( P k Q ) asthecostfunction. The t -SNEisamodicationoftheoriginalmethodtoalleviatethe crowdingproblem. Firstthefunctionthatmodelsprobabilitiesin Z isheavy-tailedforcinglargedistancesin X toremainrelativelylargein Z .Also,theconditionalprobabilitiesarereplacedbyjoint probabilitiestomakethecostsymmetric.Theresultsforvi sualizationarecomparable withotherstateofartmethods;however,theusefulnessoft hetechniquehasnot beenevaluatedforothersubsequenttasksandontargetspac esofmorethanthree dimensions.Besides,insomecasestheintrinsicdimension alityofthemanifoldcanbe veryhighandalowerdimensionalembeddingmaynotbepossib le 2 C.2DecoderLearningMethods Inthiscase,weassumethereisagenerativemodelfor X forwhich Z isalatent variable.Sincethelatentvariablesareconsideredunobse rvedtheyareestimatedfrom thedecoder.Themodelassumesaclassofdecodersalongwith acodedistribution. Gaussianmixturemodels(GMM) areasimpletypeoflatentvariablemodels. Inthisgenerativemodel, Z isadiscretevariabletaking k possiblevalueseachone ofthemassociatedwithaGaussiancomponentofthemixture. TheEMalgorithm isthemostpopularmethodforlearningthemixturemodel.Th eexpectationstep assign responsibilities toeachobserveddatapoint x i .Themaximizationstepupdate theparametersofthemodelbytakingintoaccounttherespon sibilitiesviamaximum 2 Anembeddingisaninjectivemapbydenition 109

PAGE 110

likelihood.Oncethemodelhasbeenlearned,wecancodenewi nputsbypartitioning thespacebasedonmaximumaposterioriusingBayestheorem. Aunifyingviewof Lineargenerativemodels ispresentedin[ 69 ].Inthisview,a pairoflinearstatespaceandobservationmodelsareusedas agenerativemodelforthe observeddata.Themodeliswrittenas 3 : s t +1 = Bs t + z t x t = Cs t + t (C–4) Thismodelcomprisesvariousstaticlatentvariablemodels suchas factoranalysis probabilisticPCA ,particularcasesoftheGMM,andevenICAmodelsbyincludin g nonlinearitiesinthestateequationofthemodel.Thisgene ralizationalsoconsider dynamicmodelssuchHiddenMarkovmodelsandKalmanlters. Infactoranalysis, B =0 and z t 2 R p istimeindependentandGaussiandistributedwithzeromean andcovariance I ;theobservationmodel C : R p 7! R d isalinearmapand t isa timeindependentGaussianvectorwithzeromeananddiagona lcovariancematrix R .ProbabilisticPCAhasasimilarmodeltofactoranalysisbu t isisotropic,thatis, thecovariancematrixis 2 I .InthecaseofICA,thestateequationis s = g ( z ) ,with z N ( 0 I ) .Noticethedistinctionbetweenlearningtheencodermodel aspresented aboveandthedecodermodel.Theencodermodelisthelimitin gcasefor 0 and thusthelearningcanbedoneintermsof A = C 1 ,thedecoderschemecanbe estimatedinthepresenceofnoiseviaEM.Howevercomputati onoftheposteriorusually becomesintractableandthussamplingmethodsshouldbecon sidered. The generativetopographicmap(GTM) [ 86 ]isatopologicallyconstrainedmodel thatsolvesmanyoftheissuesoftheSOM.Thisagenerativemo delfor X ,withaprior 3 Notethechangesinnotationwithrespecttopliteratureonc ontrolsystemstheory whereitiscustomarytouse x todenotethestatesand y fortheoutputsofthesystem. 110

PAGE 111

p ( z ) on Z aparametricclassofcontinuousfunctions G = f g w : Z7!Xj w 2 n g and aconditionalprobability p ( x j z w ) where aretheparametersofnoisemodelthat followsthemap g ( z ) .BayesiantreatmentoftheGTMmayincludeaprioron G .Inthe originalformulation,parameters w and areadjustedbymaximizingthe log likelihood oftheobserved x 'sbyintegratingover Z ,thatis, p ( x j w )= R Z p ( x j z w ) p ( z )d z .In ordertomakecomputationstractable,theGTMassumesaprio roftheform p ( x )= 1 p P pi =1 ( z z i ) wherelocationsof z i areusuallytakentoformauniformgridin Z Gaussianprocesslatentvariablemodels(GP-LVM) [ 43 ]isaBayesianlatent variablemodelbasedonadualinterpretationofprobabilis ticPCA.Recallthegenerative modelforprobabilisticPCA, x = C z + where z N (0, I ) and N (0, 2 I ) ,assume X = R d and Z = R p .InaBayesianframework,wecanincludeaprioron C ofthe form p ( W )= Q di =1 N ( w i j 0, I ) .Sincesimultaneousmarginalizationover Z and W is intractable,thedualformulationmarginalizesover W ,thereforeweobtainalikelihood functionoftheform p ( X j Z 2 )= Q di =1 p ( x (i) j Z 2 ) ,where Z isthematrixrepresentation ofasetof N latentpointsassociatedwiththeobservedset S = f x i g Ni =1 alsoarrangedin thematrix X ,and x ( i ) isthe i -thcolumnof X .Itturnsoutthat p ( x ( i ) j Z 2 )= N ( x ( i ) j 0, K ) where K = ZZ T + 2 I .Theoptimal Z iscalculatedaccordingtotheloglikelihood: dN 2 log2 d 2 log j K j 1 2 tr( K 1 XX T ) (C–5) Thisformulationallowsastraightforwardnonlinearexten sionbyconstructing K using positivedenitekernels,then K ij = ( z i z j )+ 2 ij andthustherelationbetween z and x canbenonlinear.Moreover,theprobabilisticformalismal lowstheuseofothernoise modelsratherthanGaussian,butthiscomesataprice,appro ximateinferencemust beemployed.Analgorithmbasedondirectoptimizationof( C–5 )doesnotscalewell with O ( N 3 ) ,tocircumventthisproblemsparseapproximationsof K areneeded.The optimizationprocedureisalsopronetolocalminimumandin itializationproceduresmust beconsidered. 111

PAGE 112

Theworkon Sparsecoding [ 54 ]and Sparsecodingwithover-completebasis [ 53 ]describeplausiblemechanismsthatcanexplaintheproper tiesofreceptiveelds ofsimplecellsinmammalianprimaryvisualcortex.Here,de codingprocessislinear generativemodel x = Cz + withasparsityconstraintonthevaluesthevariable z can take.Thesparserepresentationforcesalargeportionofth e p dimensionsof z totake zerovalues.Thisallowscaseswhere p > d ,thatis,Ccancontainanover-complete basis.Giventhebasisthrough C ,thecode z 2 R p foraparticular x 2 R d isfoundby solvingtheobjective: min z 2 R p k x Cz k 2 + p X i =1 S ( z ( i ) ) (C–6) where z ( i ) denotesthe i -thdimensionof z .Thisobjectivecorrespondstoamaximum likelihoodestimationof z assuming isisotropicuncorrelatedGaussianandtheprior on z isaproductofmarginalsoftheform 1 Z part exp S ( z ( i ) ) thatcorrespondtohigh kurtosisdistributions(see[ 52 ]fordetails).Animportantobservationisthattheencodin g procedureassociatedwiththisdecodinglearningschemeis non-linear.Inpractice, thebasisalsoneedstobeestimated.Forsuchpurposeinterl eavedoptimizationof codesandbasisiscarriedout.Inshort,givenaxed C ,wesolve( C–6 )forall x i 'sinthe trainingset;usingthecodevectors z i weminimize P Ni =1 k x Cz k 2 withrespectto C .As aconsequence,theoveralloptimizationispronetolocalmi nimumifboth C and z i are tobelearned.Underthemaximumlikelihoodassumptionthes emodelstrytominimize theKLdivergencebetweentheoutputdistributionoftehgen erativemodelandthetrue distributionofthedata. Non-negativematrixfactorization [ 44 ]isarepresentationof theobserveddatainmatrixform.Thesetof N observedpointsin R d isrepresentedby a N d matrix X .Foranon-negativematrix X ,welookfordecomposition ZW T ,where sizeof W and Z are ( d p ) and ( N p ) ,respectively,suchthatthenorm k X ZW T k 112

PAGE 113

isminimized 4 .In[ 44 ]twotypesofnon-negativefactorizationsaredescribed:c onvex andconic.Inboth, 0 W ij 1 and 0 Z ij ,butintheconvexcasetheextraconstraint Z1 = 1 isalsoimposed. C.3Encoder-DecoderorChannelLearningMethods Wehaveseenthatsomeoftheunsupervisedlearningmethodsh aveencoderand decoderformulation,namelyPCAvsprobabilisticPCA,ICAi nfomaxvsICAmaximum likelihood,andk-meansvsGMM.Theencodermethodsareeffe ctiveincomputing amapfromtheobservedinputspace X tothecodespace Z ,howevertheyarenot concernedaboutrecoveringtheinput x fromthecode z .Ontheotherhand,decoder methodsaregoodatrecovering x fromaninferredcode,sincetheyaregenerative modelsfortheobserveddata.Intheabsenceofadditionalin formation,howcana systemrecognizetheimportantfeatures?Thisbringusback informallytotheprinciple ofinformationpreservation[ 47 ].Acodingschemeeffectivelypreservestheinformation abouttheinputsifwecanrecoverthemfromtheircodedversi ons.Theencoder-decoder orchannellearningschemeisanintuitivewaytospecifythe desiredpropertiesofthe features,andatthesametime,guaranteethatsuchfeatures preservetheinformation abouttheobserveddata.Intheencoder-decoderlearning,t heinputdataundergoesa seriesoftransformationsaimedatextractingthestructur econveyedbytheredundancy intheinput.Thepreservationofinformationismeasuredby adelitycriterion.The compositionencoder-decoder,orthechannel,isamapfrom X to X ,where x denotes theinputand ^ x thereconstructedversion(outputofthechannel). Theworkon generalizedHebbianlearning presentedin[ 72 ]andhisextension tononlinearunitsin[ 73 ],areearlyexamplesthatmakeexplicittheencoder-decode r 4 Extensionstoothermeasuresofdiscrepancybetween X and ZW T canbe considered.In[ 19 ]algorithmsfornon-negativematrixfactorizationbasedo nBregman divergencesareproposed 113

PAGE 114

learningmethodforunsupervisedlearning.Inthelinearca se,itwasshownthatthe algorithmhastheoptimalitypropertiesofPCA(Karhunen-L oevetranform).Instead ofmeasuringthemutualinformationbetween X and ^ X ,theobjectivefunctionisthe expectedsquaredEuclideandistancebetween x anditsafter-channelversion ^ x .The assumptionisthatthestructureoftheinputin R d canbecapturedbyavariabletaking valueson R p .Inthelinearcase p d butinthenonlinearcasethisneednotbethe case. Withinthesparsecodingliteraturethe robustcoding pointofviewpresented in[ 20 ]isanotherexampleoftheencoder-decodermethod.Therobu stnessinthe encodingisenforcedbyintroducingadditivenoisetotheco dedinputs.Consideralinear model z = Ax forwhichwanttodecode ^ z = z + .Thedecodedversionof x ,isalso alinearprocedure ^ x = C ^ z .Addingthenoisehasabiologicalmotivationoflimiting theinformationcapacityoftheprocessingunits.Aswement ionedbefore,sparseness involvesanonlinearoperationandneedstobeexplicitlyst atedinthecostfunction.The objectivefunctionforsparserobustcodingis, E[ k ^ X X k ] 1 p X i =1 E[log p ( u ( i ) )]+ 2 p X i =1 log E[ u ( i ) ] 2 2 (C–7) Thersttermistheinformationpreservationbydelityoft herecoveredinput.The secondtermisasparsenessconstraint;thisisanentropyte rmandthisdoesnot enforcesparsenessperse.Thethirdterm,thatconstrainst hepowerofthecode,can pullthepeakofthedistributionof Z towardstheorigin.Importantremarkisthatthe noisetermwillproduceencodingproceduresthatrangefrom noiselessICAtoPCAas thepowerofthenoisetermincreaseswithrespecttothecode power. In energy-basedmodel learning,energyfunctionsareparametrizedfamiliesof functionswithlowenergyvaluesoncertainregionsoftheir domains.Inunsupervised learning,theenergyfunctionsareoptimizedsuchthatlowe nergieswillbeassociated withregionswheretrainingdataliesandhighenergyvalues willbegiventoregions 114

PAGE 115

wherenodatahasbeenobserved.Oncetheseenergy-modelsha vebeenlearned, inferencecanbedonebyreturningthepointclosestlocalen ergyminimumtoaninput x .Commonexamplesofenergyfunctionsarenorms, log likelihood.Thesecondoneis associatedwithGibbsdistributionoftheform 1 Z E exp E ( X ) .Oneoftheadvantagesof theenergy-modelbasedlearning,comparedtoprobabilisti cmodels,isthatiteliminates normalizationofthefunctions,whichcanbecomputational lyexpensiveorintractable. Whencombiningenergyterms,itisimportanttoconsiderthe scaleatwhicheachofthe termsworksinceimpropertrade-offmayleadtopoorresults .Theenergy-basedmodel frameworkwasadoptedforunsupervisedlearningin[ 65 ]thatallowstheformulationof encoder-decoderschemesasacombinationofenergytermsfo rinference(encoding), codecost(sparseness),andreconstructioncost(decoding ).Namely, Codepredictionenergy E C [ X Z A ] Reconstructionenergy(decoder) E D [ X ^ Z C ] Codecost E Z [ c ( ^ Z )] thispenalizesundesiredformsforthecode,propertiessuc h assparsenesswillthenhavelowenergyvalues. Theenergyfunctionisthesumoftheaboveterms: E C [ X Z A ]+ E D [ X ^ Z C ]+ E Z [ c ( ^ Z )] (C–8) Evenaftertheenergy-modelhasbeenlearned,inferencefor codingisstillproblematic sinceitinvolvesthecodecost.Toallowfastinference,[ 66 ]removesthecodecost termandtrainsasoftmaxsparsifyingfunction.Byusingafa stinferenceencoder,itis possibletobuilddeeparchitecturesthatcomputefeatures efciently. Encoder-decodermethodswehavereviewedsofar,haveexpli citencoderand decoderfunctions.Wecanalso,combinethemintoonemappin gfrom X to X .Inthis case,theintermediatevariable z isimplicit;theinformationprovidedbytheoutput variable ^ x abouttheinput x servesasalowerboundontheinformationtheimplicit variable z containsabout x .Itiscommonacrossalltheabovepresentedmethods 115

PAGE 116

thatunsupervisedlearninghasthegoalofpreservingthein formationconveyedby theobservedinputs.Atthesametime,itassumesthatallthe uncertaintycarriedby theinputdataneednotbepreserved.Finally,theencoder-d ecoderlearningscheme alsosuggeststhat ^ x shouldnotcarrymoreuncertaintythannecessarytocarryth e informationabout x 116

PAGE 117

APPENDIXD ANEFFICIENTRANK-DEFICIENTCOMPUTATIONOFTHEPRINCIPLEO F RELEVANTINFORMATION Amajorissue,whichweaddressinthiswork,isthattheamoun tofcomputation associatedtothePRIgrowsquadraticallywiththesizeofth eavailablesample.This limitsthescaleoftheapplicationsifoneweretoapplythef ormulasdirectly.The problemofpolynomialgrowthoncomplexityhasalsoreceive dattentionwithinthe machinelearningcommunityworkingonkernelmethods.Cons equently,approaches tocomputeapproximationstopositivesemidenitematrice sbasedonkernelshave beenproposed[ 23 94 ].ThegoalofthesemethodsistoaccuratelyestimatelargeG ram matriceswithoutcomputingtheir n 2 elements,directly.Ithasbeenobservedthatin practicetheeigenvaluesoftheGrammatrixdroprapidlyand thereforereplacingthe originalmatrixbyalowrankapproximationseemsreasonabl e[ 5 23 ].Inourwork,we deriveanalgorithmfortheprincipleofrelevantinformati onbasedonrankdecient approximationsofaGrammatrix.Wealsoproposeasimplemod iedversionofthe Nystr ¨ ommethodparticularlysuitedforestimationinITL. ThechapterstartswithabriefintroductiontoRenyi'sEntr opyandtheassociated informationquantitieswiththeircorrespondingrankdec ientapproximations.Then,the objectivefunctionfortheprincipleofrelevantinformati on(PRI)ispresented.Following, weproposeanimplementationoftheoptimizationproblemba sedonrankdecient approximations.Thealgorithmistestedonsimulateddataf orvariousaccuracyregimes (differentranks)followedbysomeresultsonrealisticsce narios.Finally,weprovide someconclusionsalongwithfutureworkdirections. D.1RankDecientApproximationforITL D.1.1Renyi's -OrderEntropyandRelatedFunctions Ininformationtheory,anaturalextensionofthecommonlyu sedShannon'sentropy is -orderentropyproposedbyRenyi[ 70 ].Forarandomvariable X withprobability 117

PAGE 118

densityfunction(PDF) f ( x ) andsupport X ,the -entropy H ( X ) isdenedas; H ( f )= 1 1 log Z X f ( x )d x (D–1) Thecase 1 givesShannon'sentropy.Similarly,amodiedversionofRe nyi's denitionof -relativeentropybetweenrandomvariableswithPDFs f and g isgivenin [ 48 ], D ( f k g )=log R g 1 f 1 1 R g 1 R f 1 (1 ) (D–2) likewise, 1 yieldsShannon'srelativeentropy(KLdivergence).Animpo rtant componentoftherelativeentropyisthecross-entropyterm H ( f ; g ) thatquantiesthe informationgainfromobserving g withrespecttothe“true”density f .Itturnsoutthatfor thecaseof =2 ,theabovequantitiescanbeexpressed,undersomerestrict ions,asa functionoftheinnerproductbetweenPDFs.Inparticular,t he 2 -orderentropyof f and cross-entropybetween f and g ,are: H 2 ( f )= log Z X f 2 ( x )d x (D–3) H 2 ( f ; g )= log Z X f ( x ) g ( x )d x (D–4) Theassociatedrelativeentropyoforder 2 iscalledtheCauchy-Schwarzdivergenceand isdenedasfollows: D CS ( f k g )= 1 2 log R fg 2 R f 2 R g 2 (D–5) Theaboveoperationsassumethat f and g areknow,whichisalmostneverthe whenlearningfromdata.Pluginestimatorsofthesecondord erRenyi'sentropyand cross-entropy(Cauchy-Schwarz( D–5 ))canbederivedusingParzendenstityestimators. Foran i.i.d. sample S = f x i g ni =1 R p drawnfrom g ,theParzendensityestimator ^ g at x isgivenby ^ g ( x )= 1 n P ni =1 ( x x i ), where isanadmissiblekernel[ 59 ].Considertwo samples S 1 = f x i g ni =1 and S 2 = f y i g mi =1 ,bothin R p ,drawn i.i.d. from g and f ,respectively. 118

PAGE 119

Let K 1 beamatrixwithallpairwiseevaluationsof on S 1 ,thatis K 1 ( i j )= ( x i x j ) ;the estimateoftheentropyis ^ H ( g )= log 1 n 2 > K 1 ^ H ( f ) canbederivedsimilarfashionfrom matrix K 2 .Cross-entropyis ^ H ( f ; g )= log 1 nm > K 12 ,where K 12 ( i j )= ( x i y j ) .Notethat wearebasicallyestimatingtheargumentsofthe log functionsin( D–3 )and( D–4 );we willrefertothemasinformationpotential(IP)andcross-i nformationpotential(CIP)[ 62 ]. ForpositivesemidenitekernelsthatarealsoParzen, K 'sareGrammatrices. D.1.2RankDecientApproximation Anysymmetricpositivesemidenitematrix A canbewrittenastheproduct GG > Notethedecomposition A neednotbeunique. IncompleteCholeskyDecomposition: Thisdecompositionisaspecialcaseof the LU factorizationisknowasthe Choleskydecompositon [ 28 ].Here, G isalower triangularmatrixwithpositivediagonal.Theadvantageof thisdecompositionisthat wecanapproximateourGrammatrix K witharbitraryaccuracy bychoosingalower triangularmatrix ~ G with d columnssuchthat k K ~ G ~ G > k (Forasuitablematrixnorm). ThisincompleteCholeskydecomposition(ICD)canbecomput edbyagreedyapproach thatminimizesthetraceoftheresidualof K ~ G ~ G > [ 5 23 78 ].Foran n n matrixthe complexityofthismethodis O ( nd ) andthetimecomplexityis O ( nd 2 ) .Therefore,this algorithmispreferableonlywhen d 2 n .TheerroroftheICDbasedestimatorscan beeasilybounded.Forapositivesemidenitematrixwehave that k A k 2 A ,whichfor errormatrix ( K > ) .theestimatorstreatedinthispaperaremostlyoftheform a > b andsotheerrorisboundedby k a kk b k Nystr ¨ omAproximation: Thisisawellknowrankdecientapproximationto K in machinelearning[ 93 ].TheapproximateGrammatrix ~ K iscomputedbyprojectingallthe datapointsontoasubspacespannedbyarandomsubsampleofs ize d inthefeature space.Consequently ~ K = K d K 1 dd K >d ,where K d isthekernelevaluationbetweenall datapointsandthesubsampleofsize d ,and K dd isthekernelevaluationbetweenall thepointsinthesubsample.Thepricewepayforthissimplic ityisthattheaccuracyof 119

PAGE 120

theapproximationcannotbesimplydetermined.Animproved versionofthismethod witherrorguaranteecanbefoundin[ 21 ].OneimportantremarkontheNystr ¨ ommethod relatestothecomputationof K 1 dd forwhichwecanemploytheeigen-decompositionof K dd = UU > Nystr ¨ om-KECA: Assumewewanttoreduceevenfurtherthesizeof K dd based onitseigenvalues,wewillhaveagoodprojectionintermsof thesquarednormin thereproducingkernelHilbertspaceassociatedto ifwepickthecolumnsof U correspondingtothelargesteigenvaluesonthediagonalof .However,wearemore interestedontheprojectionofthemeanofthemappeddatapo ints .Thisidea resemblestheapproachfollowedby[ 26 ]onthestoppingcriterionfortheorthogonal seriesdensityestimationbasedonKernelPCA.Thematrix K d K 1 dd K >d represents anAnsatzproductinthefeaturespace H as h ( x ),P( y ) i ,where P isaprojection operator.Inparticular h ,P ih i .Wecanndafasterconvergenceseries thantheoneobtainedbyorderingtheeigenvaluesof K 1 dd inanon-increasingway.Such seriescanbecreatedbyorderingthecolumnsof U andtheirrespectiveeigenvalues accordingtothescore s i = 1 = 2 i [ u >i K >d ] 2 .WecallthisdecompositionNystr ¨ om-KECA becauseitresemblesthekernelentropycomponentanalysis proposedin[ 36 ]. ThecomputationoftheestimatorsoftheIPandCIPcanbeeasi lycarriedoutby computingalowrankdecompositionofthe ( n + m ) ( n + m ) Grammatrix K ofthe augmentedsample S = f x i g n1 [f y i g m1 K = 264 K 1 K 12 K >12 K 2 375 (D–6) 120

PAGE 121

Since K ~ G ~ G > with d n + m .Theblockarrayin( D–6 )canbeexpressedinsub-blocks of ~ G as 1 : 264 ~ G 1 ~ G 2 375 ~ G >1 ~ G >2 = 264 ~ G 1 ~ G >1 ~ G 1 ~ G >2 ~ G 2 ~ G >1 ~ G 2 ~ G >2 375 (D–7) ThenfortheIP 1 n 2 > K 1 1 n 2 > G 1 ~ G >1 ,andfortheCIP 1 nm > K 1 1 nm > G 1 ~ G >2 .Notethat computingCIPneedsroughly O (( n + m ) d 2 ) operationsratherthan O ( nm ) .However,it seemsredundanttoindirectlyworkwitha ( n + m ) ( n + m ) matrixwhileweareonly interestedina n m block.However,ourproblemrequiresbothIPandCIP. D.2ThePrincipleofRelevantInformation Regularitiesonthedatacanbeattributedtostructureinth eunderlyinggenerating process.Theseregularitiescanbequantiedbytheentropy estimatedfromdata, hence,wecanthinkoftheminimizationofentropyasameansf orndingsuch regularities.Supposewearegivenarandomvariable X withPDF g ,forwhichwe wanttondadescriptionintermsofaPDF f withreducedentropy,thatis,avariable Y thatcapturestheunderlyingstructureof X .Theprincipleofrelevantinformation(PRI) caststhisproblemasatrade-offbetweentheentropyof YH 2 ( f ) anditsdescriptive powerabout X intermsoftheirrelativeentropy D CS ( f k g ) .Theprinciplecanbebriey understoodasatrade-offbetweentheminimizationofredun dancypreservingmostof theoriginalstructureofagivenprobabilitydensityfunct ion.Foraxedpdf g 2F the objectiveisgivenby: argmin f 2F [ H 2 ( f )+ D CS ( f k g )] for 1. (D–8) Thetradeoffparameter denesvariousregimesforthiscostfunctionsrangingfrom clustering( =1 )toareminiscentprincipalcurves( t p )tovectorquantization ( !1 ). 1 Sub-blockscannotbecomputedindividuallyfrom K 1 and K 2 121

PAGE 122

PRIasaSelfOrganizationMechanism: Asolutiontotheabovesearchproblem wasproposedin[ 68 ].ThemethodcombinesParzendensityestimationwithaself organizationofasampletomatchthedesireddensity f thatminimizes( D–8 ).The optimizationproblembecomes: min Y 2 ( R p ) m [ ^ H 2 ( Y )+ ^ D CS ( Y k X )] (D–9) where X 2 ( R p ) n isasetof p -dimensionalpointswithcardinality n ,and Y asetof p -dimensionalpointswithcardinality m .Problem( D–9 )isequivalentto: min Y 2 ( R p ) m [(1 ) ^ H 2 ( Y )+2 ^ H 2 ( Y ; X )]=min Y 2 ( R p ) m J ( Y ) (D–10) FortheGaussiankernel ( x y )=exp 1 2 2 k x y k 2 wecanevaluatethecost( D–10 )as 2 : (1 )log 1 m 2 > K 2 2 log 1 mn > K 12 (D–11) Theselforganizationprinciplemoveseachparticle y i accordingtotheforcesexertedby thesamples X and Y .Theentropyminimizationcreatesattractiveforcesamong the y i 's andthesample X inducesaforceeldthatrestrictthemovementofeach y i .Computing thepartialderivativesof( D–11 )withrespecttoeachpoint y r 2 Y yields: @ @ y r log 1 m 2 > K 2 = 2 m P i =1 ( y r y i ) y i y r 2 > > K 2 (D–12) @ @ y r log 1 mn > K 12 = n P i =1 ( y r x i ) x i y r 2 > > K 12 (D–13) 2 Notethattheemployedkerneldoesnotintegratetoone.This isnotaproblemsince anormalizationfactorbecomesanadditiveconstantintheo bjectivefunction. 122

PAGE 123

Adirectoptimizationofthecostin( D–11 )iscomputationallyburdensomeandonly feasibleforsamplessizesuptoafewthousands,thislimits theapplicabilityofthe principle.Here,wewanttoovercomethislimitationbyallo wingcomputationonlarger samplescommonlyencounteredinsignalprocessingapplica tions.Below,wedevelop awayofincorporatingtherankdecientapproximationstot hegradientofthePRIcost function.Itisimportanttoremindthereaderthatthissolu tioncanbeeasilyadaptedto otherITLobjectives. Considerthefollowingidentity:For A = ab > ,where a and b arecolumnvectors, C A =diag( a ) C diag( b ) (D–14) where diag( z ) denotesadiagonalmatrixwiththeelementsof z onthemaindiagonal,for simplicitywedenote diag( ) as d( ) WewillrestricttheanalysistotheGaussiankernelbutasim ilartreatmentforother kernelswithsimilarpropertiesontheirderivativecanbea dopted.Let Y ( k ) and X ( k ) bematriceswithentries Y ( k ) ij = y ( k ) i y ( k ) j and X ( k ) ij = x ( k ) i y ( k ) j ,respectively.By z ( k ) i ,wemeanthe k -thcomponentofthevector z i .Equations( D–12 )and( D–13 )canbe re-expressedandcombinedforall r =1,..., m usingmatrixoperationsas: @ @ y ( k ) ^ H 2 ( Y )= 2 2 > K 2 Y ( k ) > K 2 (D–15) @ @ y ( k ) ^ H 2 ( Y ; X )= 1 2 > K 12 X ( k ) > K 12 (D–16) decomposing Y ( k ) = y ( k ) > y ( k ) > and X ( k ) = x ( k ) 1 > y ( k ) > ,where y ( k ) = ( y ( k ) 1 y ( k ) 2 ,..., y ( k ) m ) > and x ( k ) =( x ( k ) 1 x ( k ) 2 ,..., x ( k ) m ) > ,andapplyingtheidentity( D–14 )on ( D–15 )and( D–16 ),aftersomealgebra,yields: @ @ y ( k ) J ( Y )= a y ( k ) > K 2 > K 2 d( y ( k ) ) + + b x ( k ) > K 12 > K 12 d( y ( k ) ) (D–17) 123

PAGE 124

where a = 2(1 ) = ( 2 > K 2 ) and b = 2 = ( 2 > K 12 ) .Finally,itiseasytoverifythatfor therankdecientapproximation K ~ G ~ G > : @ @ y ( k ) J ( Y ) a y ( k ) > ~ G 2 + b x ( k ) > ~ G 1 ~ G >2 + a > ~ G 2 + b > ~ G 1 ~ G >2 d( y ( k ) ), (D–18) Thelastequationcanbecomputedin O (max f n m g d ) d istherankof ~ K ,insteadof O (max f nm m 2 g ) whichisthecomplexityofthexedpointalgorithmpresente din[ 68 ]. D.3Experiments Theexperimentalsetupfortheabovemethodsisdividedinto twostages;inorder toassesstheperformanceofthemethodologyintermsofaccu racywetesttherank decientestimationalgorithmsonsimulateddata.Then,we applyourimplementation methodologyforthePRIintworealscenarios,namely,autom aticimagesegmentation andsignaldenoising.D.3.1SimulatedData Forthesimulateddatawecomputetheinformationpotential andthecross-information potentialofamixtureoftwounitaryisotropicGaussiandis tributionsinafour-dimensional space.Themeanvectorsare ( 1,1,1,1 ) > and ( 1,1,1,1 ) > .Thekernelsizeissettobe =2 andthesizeofthesetis n =1000 .Figure D-1 displaystheperformanceofthe estimatorsfordifferentranksthatarerelatedtotheaccur acylevelsetfortheincomplete Choleskydecomposition.Noticethattheperformanceofthe Nystr ¨ om-KECAremains constant,albeitslightlyworsethanpureNystr ¨ om.Thisisbecausewedropvectorswith lowscoresasdescribedinsection D.1.2 .Thereforewesacricemoreaccuracytolower therankevenfurther.D.3.2ImageSegmentationSignalDenoisingwiththePRI Automaticimagesegmentationisusuallyseenasaclusterin gproblem,where spatialandintensityfeaturesarecombinedtodiscernobje ctsfrombackground.A wellestablishedprocedureforimagesegmentationandforc lusteringingeneralisthe 124

PAGE 125

10 -15 10 -10 10 -5 10 0 10 5 10 -16 10 -14 10 -12 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 epsilonAccuracy Incomplete Cholesky Nystrom NystromKECA Subsampling AIP 10 -15 10 -10 10 -5 10 0 10 5 10 -18 10 -16 10 -14 10 -12 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 epsilonAccuracy Incomplete Cholesky Nystrom NystromKECA Subsampling BCIP FigureD-1.AccuraciesfortheIPandCIPestimatorsGaussianmeanshift(GMS).IthasbeenshownthatGMSisaespe cialcaseofthe PRIobjectivewhenthetradeoffparameter issetto 1 .Treatingimagesascollection ofpixelsisafairlychallengingtaskduetotheamountofpoi ntstobeprocessed. Figure D-2 showsthesegementationresultsusingthePRIoptimization describedon Section D.2 andtheNystr ¨ om-KECArankdecientfactorization.Theimageresolution is 130 194 pixelsforatotalof25220points. AOriginalImage BSegmentedimageusingPRI FigureD-2.ImagesegmentationusingPRI Setting =1 ,aswealreadymentioned,denesamodeseekingregimeforth e PRI.Forlargervaluesof weobtainedasolution,forwhichpointsconcentrateonhigh ly denseregionsandarenicelyscatteredonpatternsresembli ngprincipalcurvesofdata. Thekeyinterpretationisthattheestimateddistributiong etspulledtowardsregionsof 125

PAGE 126

higherentropyinthemanifoldofPDFswherewearelookingfo ranoptimumofthePRI objective.Figure D-3 showstheresultingdenoisedsignalthatwasembeddedinatw o dimensionalspacealongwiththecontourplotsoftheestima tedPDF.Thenumberof pointsis 15000 FigureD-3.NoisyandDenoisedversionsofaperiodicsignal embeddedina 2 dimensionalspace D.4Remarks Inthischapterwesuggesttheuseofrankdecientapproxima tiontoGrammatrices involvedintheestimationofITLobjectivefunctions.Inpa rticularwefocusonthe PrincipleofRelevantInformationthatrequiresestimatio nofsecond-orderentropiesand cross-entropies(equivalentlyinformationandcross-inf ormationpotentials)alongwith theirrespectivederivatives,whichareemployedduringth eoptimization.Wedeveloped amethodologicalapproachtofactorizetheelementsinvolv edinthegradientcalculation andtheseresultscanbeextendedtoothermethodsthatinvol vesimilarforms.Alongthe linesofrankdecientapproximation,weproposeasimplemo dicationtotheNystr ¨ om methodthatismotivatedbythenatureofthequantitieswewa nttoestimate.The presentedmethodologyallowstheapplicationofthePRIonm uchlargerdatasets,we 126

PAGE 127

expectthisimprovementopensnewdirectionsforwhichweca napplythisprinciple. SomeoftheanalysisthatleadtothemodicationoftheNystr ¨ om-baseddecomposition raisesthequestionaboutwhatkernelscanbemoreusefulint hecontextofInformation theoreticlearningbasedontheirconvergencepropertiesf orcertainvectors. 127

PAGE 128

REFERENCES [1]Aizerman,M.A.,Braverman,E.M.,&Rozonoer,L.I.(1964 ).Themethodof thepotentialfunctionsfortheproblemofrestoringthecha racteristicofafunction converterfromrandomlyobservedpoints. AvtomatikaiTelemekhanika 25 (12), 1705–1714. [2]Aizerman,M.A.,Braverman,E.M.,&Rozonoer,L.I.(1964 ).Theoretical foundationsofthepotentialfunctionmethodinpatternrec ognitionlearning. AvtomatikaiTelemekhanika 25 (6),917–936. [3]Aronszajn,N.(1950).Theoryofreproducingkernels. TransactionsoftheAmerican MathematicalSociety 68 (3),337–404. [4]Bach,F.R.,&Jordan,M.I.(2002).Kernelindependentco mponentanalysis. JournalofMachineLearningResearch 3 ,1–48. [5]Bach,F.R.,&Jordan,M.I.(2002).Kernelindependentco mponentanalysis. JMLR 3 ,1–48. [6]Barlow,H.(1961).Possibleprinciplesunderlyingthet ransformationsofsensory messages. SensoryCommunication ,(pp.217–234). [7]Barlow,H.(1989).Unsupervisedlearning. NeuralComputation 1 ,295–311. [8]Barlow,H.,Kaushal,T.,&Mitchison,G.(1989).Finding minimumentropycodes. NeuralComputation 1 ,412–423. [9]Belkin,M.,Niyogi,P.,&Sindhwani,V.(2006).Manifold regularization:Ageometric frameworkforlearningfromlabelledandunlabelledexampl es. Journalofmachine LearningResearch 7 ,2399–2434. [10]Bell,A.J.,&Sejnowski,T.J.(1995).Aninformation-m aximizationapproachtoblind separationandblinddeconvolution. NeuralComputation 7 ,1129–1159. [11]Bhatia,R.(1996). MatrixAnalysis .GraduateTextsinMathematics.Springer. [12]Bhatia,R.(2006).Innitedivisiblematrices. TheAmericanMathematicalMonthly 113 (3),221–235. [13]Candocia,F.M.,&Principe,J.C.(1999).Super-resolu tionofimagesbasedon localcorrelations. IEEETransactionsonNeuralNetworks 10 (2),372–380. [14]Chong,E.K.P.,&Zak,S.H.(2001). AnIntroductiontoOptimization .Discrete MathematicsandOptimization.WileyInterscience,second ed. [15]Costa,J.A.,&III,A.O.H.(2004).Geodesicentropicgr aphsfordimensionand entropyestimationinmanifoldlearning. IEEETransacionsonSignalProcessing 52 ,2210–2221. 128

PAGE 129

[16]Cover,T.M.,&Thomas,J.A.(2006). ElementsofInformationTheory Wiley-Interscience,seconded. [17]Cucker,F.,&Zhou,D.X.(2007). LearningTheory:AnApproximationTheory Viewpoint .CambridgeMonographsonAppliedandComputationalMathem atics. CambridgeUniversityPress. [18]Davis,J.V.,Kulis,B.,Jain,P.,Sra,S.,&Dhillon,I.S .(2007).Information-theoretic metriclearning.In ICML ,(pp.209–216).Corvalis,Oregon,USA. [19]Dhillon,I.S.,&Sra,S.(2000).Generalizednon-negat ivematrixapproximations withbregmandivergences.In NIPS ,(pp.556–562). [20]Doi,E.,&Lewicki,M.S.(2004).Sparsecodingofnatura limagesusingan overcompletesetoflimitedcapacityunits.In NIPS ,(pp.377–384). [21]Drineas,P.,&Mahoney,M.W.(2005).Onthenystr ¨ ommethodforapproximatinga grammatrixforimprovedkernel-basedlearning. JMLR 6 ,2153–2175. [22]Duda,R.O.,Hart,P.E.,&Stork,D.G.(2001). PatternClassication .JohnWiley andSons,seconded. [23]Fine,S.,&Scheinberg,K.(2001).Efcientsvmtrainin gusinglow-rankkernel representations. JMLR 2 ,243–264. [24]Friedland,S.(1981).Convexspectralfunctions. LinearandMultilinearAlgebra 9 299–316. [25]Gammerman,A.,Vovk,V.,&Vapnik,V.(1998).Learningb ytransduction.In UncertaintyinArticialIntelligence ,(pp.148–155). [26]Girolami,M.(2002).Orthogonalseriesdensityestima tionandthekernel eigenvalueproblem. NeuralComputation 14 (3),669–688. [27]Globerson,A.,&Roweis,S.(2005).Metriclearningbyc ollapsingclasses.In NIPS [28]Golub,G.H.,&VanLoan,C.F.(1996). MatrixComputation .Baltimore,Maryland: TheJohnsHopkinsUniversityPress,thirded. [29]Gretton,A.,Bousquet,O.,Smola,A.,&Sch ¨ olkopf,B.(2005).Measuringstatistical dependencewithhilbert-schmidtnorms.InS.Jain,H.Simon ,&E.Tomita(Eds.) ProceedingsofAlgorithmicLearningTheory ,(pp.63–77). [30]Gretton,A.,&Gy ¨ or,L.(2010).Consistentnonparametrictestofindepende nce. JournalofMachineLearningResearch 11 ,1391–1423. [31]Hero,A.,&Michel,O.(1998).Robustentropyestimatio nstrategiesbasedon edgeweightedrandomgraphs.In ProceedingsoftheMeetingoftheInternational SocietyforOpticalEngineering(SPIE) 129

PAGE 130

[32]Hinton,G.,&Roweis,S.(2003).Stochasticneighborem bedding.In NIPS2002 (pp.857–864). [33]Horn,R.A.(1969).Thetheoryofinnitelydivisiblema tricesandkernels. TransactionsoftheAmericanMathematicalSociety 136 ,269–286. [34]Horn,R.A.,&Johnson,C.R.(1985). MatrixAnalysis .CambridgeUniversityPress. [35]Hyv ¨ arinen,A.,Karhunen,J.,&Oja,E.(2001). IndependentComponentAnalysis JohnWileyandSons. [36]Jensen,R.(2009).Kernelentropycomponentanalysis. IEEETransactionson PatternAnalysisandMachineIntelligence [37]Jenssen,R.,Eltoft,T.,Girolami,M.,&Erdogmus,D.(2 006).Kernelmaximum entropydatatransformationandanenhancedspectralclust eringalgorithm.In NIPS ,(pp.633–640). [38]Jolliffe,I.T.(2002). PrincipalComponentAnalysis .Springer,rsted. [39]Kato,T.(1987).Variationofdiscretespectra. CommunicationsinMathematical Physics 111 ,501–504. [40]Kimeldorf,G.S.,&Wahba,G.(1971).Someresultsonthe tchebychefanspline functions. JournalofMathematicalAnalysisandApplications 33 (1),82–95. [41]Kreyszig,E.(1978). IntroductoryFunctionalAnalysiswithApplications .JohnWiley &Sons. [42]Kwok,J.T.,&Tsang,I.W.(2003).Thepreimageproblemi nkernelmethods.In Proceeedingsofthe20thInternationalConferenceonMachi neLearning ,(pp. 408–415). [43]Lawrence,N.(2005).Probabilisticnon-linearprinci palcomponentanalysiswith gaussianprocesslatentvariablemodels. JournalofMachineLearningResearch 6 ,1783–1816. [44]Lee,D.D.,&Seung,H.S.(1997).Unsupervisedlearning byconvexandconic coding.In NIPS ,(pp.515–521). [45]Lee,D.D.,&Seung,H.S.(1999).Learningthepartsofob jectsbynon-negative matrixfactorization. Nature 401 ,788–791. [46]Lewis,A.S.(1996).Derivativesofspectralfunctions MathematicsofOperations Research 21 (3),576–588. [47]Linsker,R.(1988).Self-organizationinaperceptual network. Computer 21 105–117. 130

PAGE 131

[48]Lutwak,E.,Yang,D.,&Zhang,G.(2005).Cram er-raoandmoment-entropy inequalitiesforrenyientropyandgeneralizedsherinfor mation. IEEETransactions onInformationTheory 51 (2),473–478. [49]Mangasarian,O.(1969). NonlinearProgramming .SystemsandScience. McGraw-Hill. [50]Mika,S.,R ¨ atch,G.,Weston,J.,Sch ¨ olkopf,B.,&M¨uller,K.-R.(1999).Fisher discriminantanalysiswithkernels.In IXIEEESignalProcessingSocietyWorkshop onNeuralNetworksforSignalProcessing ,(pp.41–48). [51]Nadaraya,E.A.(1964).Onestimatingregression. TheoryofProbabilityandits Applications 9 (1),141–142. [52]Olshausen,B.A.(1996).Learninglinear,sparse,fact orialcodes.Tech.Rep.AI memo1580,CBCLandAILabMIT. [53]Olshausen,B.A.,&Field,D.J.(1997).Sparsecodingwi thandovercompletebasis set:Astartegyemployedbyv1? VisionResearch 37 (23),3311–3325. [54]Olshausen,B.A.,&J.Field,D.(1996).Emergenceofsim ple-cellreceptiveeld propertybylearningasparsecodefornaturalimages. Nature 381 ,607–609. [55]Osuna,E.,Freund,R.,&Girosi,F.(1997).Animprovedt rainingalgorithmfor supportvectormachines.In IEEEWorkshoponneuralNetworksforSignal Processing ,(pp.276–285). [56]Papoulis,A.(1991). Probability,RandomVariables,andStochasticProcesses McGraw-Hill,thirded. [57]Park,M.K.,&Kang,M.G.(2003).Super-resolutionimag ereconstruction:a technicaloverview. IEEESignalProcessingMagazine 20 (3),21–36. [58]Parzen,E.(1959).Statisticalinferenceontimeserie sbyhilbertspacemethods,i. Tech.Rep.23,StanfordUniversity. [59]Parzen,E.(1962).Onestimationofaprobabilitydensi tyfunctionandmode. Annals ofMathematicalStatistics 33 (3),1065–1076. [60]Platt,J.C.(1998).Sequentialminimaloptimization: Afastalgorithmfortraining supportvectormachines.Tech.rep.,MicrosoftResearch. [61]Principe,J.C.(2010). InformationTheoreticLearning:Renyi'sEntropyandKerne l Perspective .InformationScienceandStatistics.Springer. [62]Principe,J.C.(2010). InformationTheoreticLearning:Renyi'sEntropyandKerne l Perspectives .SeriesinInformationScienceandStatistics.Springer. [63]Principe,J.C.,Xu,D.,&III,J.W.F.(2000). UnsupervisedAdaptiveFiltering ,chap. Information-TheoreticLearning,(pp.265–319).JohnWile yandSons. 131

PAGE 132

[64]P al,D.,P oczos,B.,&Szepesv ari,C.(2010).Estimationofr enyientropyandmutual informationbasedongeneralizednearest-neighborgraphs .In NIPS [65]Ranzato,M.(2009). UnsupervisedLearningofFeatureHierarchies .Ph.D.thesis, NewYorkUniversity. [66]Ranzato,M.,Chopra,C.P.S.,&LeCun,Y.(2006).Efcie ntlearningofsparse representationswithanenergybasedmodel.In NIPS [67]Rao,S.,deMedeirosMartins,A.,&Principe,J.C.(2008 ).Meanshift:An informationtheoreticperspective. PatternRecognitionLetters 30 ,222–230. [68]Rao,S.M.(2008). UnsupervisedLearning:AnInformationTheoreticFramewor k Ph.D.thesis,UniversityofFlorida. [69]Roweis,S.,&Ghahramani,Z.(1999).Aunifyingreviewo flineargaussianmodels. NeuralComputation 11 ,305–345. [70]R enyi,A.(1961).Onmeasuresofentropyandinformation.In Proceedingsofthe fourthBerkeleySymposiumonMathematicalStatisticsandP robability ,vol.1,(pp. 547–561).Berkeley:UniversityofCaliforniaPress. [71]Sanchez-Giraldo,L.G.,&Principe,J.C.(2011).Arepr oducingkernelhilbertspace formulationoftheprincipleofrelevantinformation.In IEEEInternationalWorkshop onMachineLearningforSignalProcessing(MLSP) [72]Sanger,T.D.(1989).Optimalunsupervisedlearningin asingle-layerfeedforward neuralnetwork. NeuralNetworks 2 ,459–473. [73]Sanger,T.D.(1989).Anoptimalityprincipleforunsup ervisedlearning.In NIPS (pp.11–19). [74]Saul,L.K.,&Roweis,S.T.(2003).Thinkglobally,tlo cally:Unsupervisedlearning oflowdimensionalmanifolds. JournalofMachineLearningResearch 4 ,119–155. [75]Schoenberg,I.J.(1938).Metricspacesandpositivede nitefunctions. TransactionsoftheAmericanMathematicalSociety 44 (3),522–536. [76]Sch ¨ olkopf,B.,&Smola,A.J.(2002). LearningwithKernels:SupportVector Machines,Regularization,Optimization,andBeyond .AdaptiveComputationand MachineLearning.MITPress. [77]Sch ¨ olkpf,B.,Smola,A.J.,&M¨uller,K.-R.(1996).Nonlinearc omponentanalysis asakerneleigenvalueproblem.Tech.Rep.44,Max-Planck-I nstitutf¨urbiologische Kybernetik. [78]Seth,S.,&Principe,J.(2009).Onspeedingupcomputat ionininformationtheoretic learning.In IJCNN 132

PAGE 133

[79]Seth,S.,Rao,M.,Park,I.,&Prncipe,J.C.(2011).Au niedframeworkfor quadraticmeasuresofindependence. IEEETransactionsonSignalProcessing 59 (8),3624–3635. [80]Shannon,C.E.(1948).Amathematicaltheoryofcommuni cation. ThebellSystem technicalJournal 27 ,379–423,623–656. [81]Shaw,B.,&Jebara,T.(2007).Minimumvolumeembedding .In Proceedingsof theEleventhInternationalConferenceonArticialIntell igenceandStatistics ,(pp. 460–467). [82]Shawe-Taylor,J.,&Cristianini,N.(2004). KernelMethodsforPatternAnalysis CambridgeUniversityPress. [83]Slonim,N.(2002). TheInformationBottleneck:TheoryandApplications .Ph.D. thesis,HebrewUniversity. [84]Sriperumbudur,B.K.,Gretton,A.,Fukumizu,K.,Lanck riet,G.,&Sch ¨ olkopf,B. (2008).Injectivehilbertspaceembeddingsofprobability measures.In Proceedings ofthe21stAnnualConferenceonLearningTheory ,(pp.111–122). [85]Tenenbaum,J.B.,deSilva,V.,&Langford,J.(2000).Ag lobalgeometricframework fornonlineardimensionalityreduction. Science 290 (5500),2319–2323. [86]Tipping,M.E.,&Bishop,C.M.(1999).Probabilisticpr incipalcomponentanalysis. JournaloftheRoyalStatisticalSociety 61 (3),611–622. [87]Tishby,N.,Pereira,F.C.,&Bialek,W.(1999).Theinfo rmationbottleneckmethod. In The37thannualAllertonConferenceonCommunication,Cont rol,andComputing ,(pp.368–377). [88]vanderMaaten,L.,&Hinton,G.(2008).Visualizingdat ausingt-sne. Journalof MachineLearningResearch 9 ,2579–2605. [89]vonLuxburg,U.(2004). StatisticalLearningwithSimilarityandDissimilarity Functions .Ph.D.thesis,TechnicalUniversityofBerlin. [90]vonLuxburg,U.(2007).Atutorialonspectralclusteri ng.Tech.Rep.149,MaxPlank InstituteforBiologicalCybernetics. [91]Weinberger,K.Q.,Blitzer,J.,&Saul,L.K.(2005).Dis tancemetriclearningfor largemarginnearestneighborclassication.In NIPS [92]Weinberger,K.Q.,Sha,F.,&Saul,L.K.(2004).Learnin gakernelmatrixfor nonlineardimensionalityreduction.In ProceedingoftheTwenty-rstInternational ConferenceonMachineLearning ,(pp.106–113). [93]Williams,C.K.,&Seeger,M.(2000).Usingthenystr ¨ ommethodtospeedupkernel machines.In NIPS ,(pp.682–688). 133

PAGE 134

[94]Williams,C.K.I.,&Seeger,M.(2000).Theeffectofthe inputdensitydistribution onkernel-basedclassiers.In ICML ,(pp.1159–1166). [95]Xu,J.-W.,Paiva,A.R.C.,Park,I.,&Principe,J.C.(20 08).Areproducingkernel hilbertspaceframeworkforinformationtheoreticlearnin g. IEEETransactionson SignalProcessing 56 (12),5891–5902. 134

PAGE 135

BIOGRAPHICALSKETCH LuisGonzaloS anchezGiraldowasbornin1983inManizales,Colombia.He receivedtheB.S.inelectronicsengineeringandM.Eng.ini ndustrialautomationfrom UniversidadNacionaldeColombiain2005and2008,respecti vely,andhisPh.D.in electricalandcomputerengineeringfromUniversityofFlo ridain2012.Between2004 and2008,hewasappointedasaresearchassistantattheCont rolandDigitalSignal ProcessingGroup(GCPDS)atUniversidadNacionaldeColomb ia.DuringhisPh.D. studiesheworkedasaresearchassistantattheComputation alNeuro-Engineering Laboratory(CNEL)atUniversityofFlorida.Hismainresear chinterestsareinmachine learningandsignalprocessing. 135