<%BANNER%>

Adaptive Filtering in Reproducing Kernel Hilbert Spaces

Permanent Link: http://ufdc.ufl.edu/UFE0022748/00001

Material Information

Title: Adaptive Filtering in Reproducing Kernel Hilbert Spaces
Physical Description: 1 online resource (161 p.)
Language: english
Creator: Liu, Weifeng
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: active, adaptive, filter, kernel, learning, nonlinear, online, regularization
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The theory of linear adaptive filters has reached maturity, unlike the field of nonlinear adaptive filters. Although nonlinear adaptive filters are very useful in nonlinear and nonstationary signal processing, complexity and non-convexity issues limit existing algorithms like Volterra series, time-lagged feedforward networks and Bayesian filtering in an online scenario. Kernel methods are also nonlinear methods and their solid mathematical foundation and experimental successes are making them very popular in recent years, but most of the algorithms use block adaptation and are computationally very expensive using a large Gram matrix of dimensionality given by the number of data points; therefore computationally efficient online algorithms are very much needed for their useful flexibility in design. This work developed systematically for the first time a class of on-line learning algorithms in reproducing kernel Hilbert spaces (RKHS). The reproducing kernel Hilbert space provides an elegant means of obtaining nonlinear extensions of linear algorithms expressed in terms of inner products using the so-called kernel trick. We presented kernel extensions for three well-known adaptive filtering methods, namely the least-mean-square, the affine-projection-algorithms and the recursive-least-squares, studied their properties and validated them in real applications. We focused on revealing the unique structures of the linear adaptive filters and demonstrated how the nonlinear extensions are derived. These algorithms are universal approximators, use convex optimization (no local minima) and display moderate complexity. Simulations of time series prediction, nonlinear channel equalization, nonlinear fading channel tracking, and noise cancellation were included to illustrate the applicability and correctness of our theory. Besides, a unifying view of active data selection for kernel adaptive filters was introduced and analyzed to address their growing structure. Finally we discussed the wellposedness of the proposed gradient based algorithms for completeness.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Weifeng Liu.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: Principe, Jose C.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2010-12-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0022748:00001

Permanent Link: http://ufdc.ufl.edu/UFE0022748/00001

Material Information

Title: Adaptive Filtering in Reproducing Kernel Hilbert Spaces
Physical Description: 1 online resource (161 p.)
Language: english
Creator: Liu, Weifeng
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2008

Subjects

Subjects / Keywords: active, adaptive, filter, kernel, learning, nonlinear, online, regularization
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: The theory of linear adaptive filters has reached maturity, unlike the field of nonlinear adaptive filters. Although nonlinear adaptive filters are very useful in nonlinear and nonstationary signal processing, complexity and non-convexity issues limit existing algorithms like Volterra series, time-lagged feedforward networks and Bayesian filtering in an online scenario. Kernel methods are also nonlinear methods and their solid mathematical foundation and experimental successes are making them very popular in recent years, but most of the algorithms use block adaptation and are computationally very expensive using a large Gram matrix of dimensionality given by the number of data points; therefore computationally efficient online algorithms are very much needed for their useful flexibility in design. This work developed systematically for the first time a class of on-line learning algorithms in reproducing kernel Hilbert spaces (RKHS). The reproducing kernel Hilbert space provides an elegant means of obtaining nonlinear extensions of linear algorithms expressed in terms of inner products using the so-called kernel trick. We presented kernel extensions for three well-known adaptive filtering methods, namely the least-mean-square, the affine-projection-algorithms and the recursive-least-squares, studied their properties and validated them in real applications. We focused on revealing the unique structures of the linear adaptive filters and demonstrated how the nonlinear extensions are derived. These algorithms are universal approximators, use convex optimization (no local minima) and display moderate complexity. Simulations of time series prediction, nonlinear channel equalization, nonlinear fading channel tracking, and noise cancellation were included to illustrate the applicability and correctness of our theory. Besides, a unifying view of active data selection for kernel adaptive filters was introduced and analyzed to address their growing structure. Finally we discussed the wellposedness of the proposed gradient based algorithms for completeness.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Weifeng Liu.
Thesis: Thesis (Ph.D.)--University of Florida, 2008.
Local: Adviser: Principe, Jose C.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2010-12-31

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2008
System ID: UFE0022748:00001


This item has the following downloads:


Full Text

PAGE 1

ADAPTIVEFILTERINGINREPRODUCINGKERNELHILBERTSPACESByWEIFENGLIUADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2008 1

PAGE 2

c2008WeifengLiu 2

PAGE 3

Tomyfamily,friendsandprofessors. 3

PAGE 4

ACKNOWLEDGMENTS IwouldliketostartbythankingmysupervisorDr.JoseC.Principe.Hisconstantsupport,encouragementandsupervisionhavebeenintegraltothiswork.Hisvision,imaginationandenthusiasmarewhatIadmireandtrytoliveupto.Histrust,guidanceandmentoringarelongtreasuredandappreciated.Iamalsogratefultothemembersofmyadvisorycommittee,especiallyDr.JohnHarris,Dr.MuraliRaoandDr.JayGopalakrishnan.Theirmuchhelpandguidancearedeeplyappreciated.FinallyIwouldliketothankalltheCNELmembers,especiallytheITLgroupmembersfortheircollaborationinthisandotherresearchprojects.Withoutthem,lifeinCNELwouldnotbesoenjoyableandmemorable.Dr.PuskalPohkarelinitiatedthisresearchprojectbystudyingthekernelleastmeansquarealgorithm.Dr.JayGopalakrishnanhelpedinthedevelopmentofTheorems 4.3 and 6.6 .Dr.MuraliRaohelpedinthedevelopmentofTheorem 6.4 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 8 LISTOFFIGURES .................................... 9 ABSTRACT ........................................ 12 CHAPTER 1INTRODUCTION .................................. 14 1.1Supervised,SequentialandActiveLearning ................. 14 1.1.1WhatIsLearning? ........................... 14 1.1.2SupervisedLearning ........................... 14 1.1.3SequentialLearning ........................... 14 1.1.4ActiveLearning ............................. 15 1.2LinearandNonlinearAdaptiveFiltering ................... 17 1.2.1LinearAdaptiveFiltering ........................ 17 1.2.2NonlinearAdaptiveFiltering ...................... 18 1.3ReproducingKernelHilbertSpaceandKernelMethods ........... 20 1.4KernelAdaptiveFilters ............................. 23 1.5Notation ..................................... 28 2KERNELLEASTMEANSQUAREALGORITHM ................ 29 2.1FormulationofKernelLeastMeanSquareAlgorithm ............ 29 2.1.1LearningProblemSetting ........................ 29 2.1.2LeastMeanSquareAlgorithm ..................... 29 2.1.3KernelLeastMeanSquareAlgorithm ................. 30 2.2ImplementationofKernelLeastMeanSquareAlgorithm .......... 32 2.2.1Misadjustment .............................. 32 2.2.2StepSize ................................. 33 2.2.3InniteTrainingDataandSparsication ............... 33 2.2.4PenalizingSolutionNorm ........................ 34 2.3Simulations ................................... 35 2.3.1Mackey-GlassTimeSeriesPrediction ................. 35 2.3.2NonlinearChannelEqualization .................... 39 2.4Discussions ................................... 41 3KERNELAFFINEPROJECTIONALGORITHMS ................ 43 3.1FormulationofKernelAneProjectionAlgorithms ............. 43 3.1.1AneProjectionAlgorithms ...................... 43 3.1.2KernelAneProjectionAlgorithms .................. 46 5

PAGE 6

3.1.2.1SimpleKAPA(KAPA-1) ................... 47 3.1.2.2NormalizedKAPA(KAPA-2) ................ 49 3.1.2.3LeakyKAPA(KAPA-3) ................... 50 3.1.2.4LeakyKAPAwithNewton'srecursion(KAPA-4) ..... 51 3.2TaxonomyforRelatedAlgorithms ....................... 51 3.2.1KernelLeastMeanSquareAlgorithm ................. 51 3.2.2Kivinen'sNORMA ........................... 52 3.2.3KernelADALINE ............................ 52 3.2.4RecursivelyAdaptedRadialBasisFunctionNetworks ........ 53 3.2.5SlidingWindowKernelRecursiveLeastSquares ........... 53 3.2.6RegularizationNetworks ........................ 53 3.3ImplementationofKernelAneProjectionAlgorithms ........... 54 3.3.1ErrorReusing .............................. 54 3.3.2SlidingWindowGramMatrixInversion ................ 54 3.3.3SparsicationandNoveltyCriterion .................. 55 3.4Simulations ................................... 57 3.4.1Mackey-GlassTimeSeriesPrediction ................. 57 3.4.2NoiseCancelation ............................ 60 3.4.3NonlinearChannelEqualization .................... 64 3.5Discussions ................................... 67 4EXTENDEDKERNELRECURSIVELEASTSQUARES ............ 69 4.1FormulationofExtendedKernelRecursiveLeastSquares .......... 69 4.1.1RecursiveLeastSquares ........................ 69 4.1.2KernelRecursiveLeastSquares .................... 70 4.1.3ExtendedRecursiveLeastSquares ................... 74 4.1.4ReformulationofExtendedRecursiveLeastSquares ......... 76 4.1.5ExtendedKernelRecursiveLeastSquares ............... 80 4.1.5.1Randomwalkkernelrecursiveleastsquares ........ 80 4.1.5.2Exponentiallyweightedkernelrecursiveleastsquares ... 81 4.2ImplementationofExtendedKernelRecursiveLeastSquares ........ 81 4.2.1SparsicationandApproximateLinearDependency ......... 83 4.2.2ApproximateLinearDependencyandStability ............ 84 4.3Simulations ................................... 86 4.3.1RayleighChannelTracking ....................... 86 4.3.2LorenzTimeSeriesPrediction ..................... 90 4.4Discussions ................................... 93 5CONDITIONALINFORMATIONFORACTIVEDATASELECTION ..... 95 5.1DenitionofConditionalInformation ..................... 95 5.1.1ProblemStatement ........................... 96 5.1.2DenitionofConditionalInformation ................. 96 5.2EvaluationofConditionalInformation .................... 97 5.2.1GaussianProcessesTheory ....................... 98 6

PAGE 7

5.2.2EvaluationofConditionalInformation ................. 99 5.2.2.1Inputdistribution ....................... 100 5.2.2.2Unknowndesiredsignal ................... 101 5.3LearningRules ................................. 102 5.3.1RelationtoKernelRecursiveLeastSquares .............. 102 5.3.2UpdatingRuleforLearnableData ................... 103 5.3.3UpdatingRuleforAbnormalandRedundantData .......... 104 5.3.4ActiveOnlineGPRegression ...................... 105 5.4Simulations ................................... 106 5.4.1NonlinearRegression .......................... 106 5.4.2Mackey-GlassTimeSeriesPrediction ................. 109 5.4.3CO2ConcentrationForecasting .................... 112 5.5Discussions ................................... 114 6WELLPOSEDNESSANALYSIS .......................... 116 6.1RegularizationNetworks ............................ 117 6.2WellposednessAnalysisofKernelADALINE ................. 119 6.2.1KernelADALINE ............................ 120 6.2.2TikhonovRegularization ........................ 122 6.2.3Self-regularizationthroughGradientDescent ............. 124 6.3WellposednessAnalysisofKernelLeastMeanSquare ............ 128 6.3.1ModelFreeSolutionNormBound ................... 129 6.3.2ModelBasedSolutionNormBound .................. 131 6.4Simulations ................................... 133 6.4.1Mackey-GlassTimeSeriesPrediction ................. 133 6.5Discussions ................................... 139 7CONCLUSIONANDFUTUREWORK ...................... 141 7.1Conclusion .................................... 141 7.2FutureWork ................................... 142 7.2.1KernelKalmanFilter .......................... 142 7.2.2CharacterizethePropertiesthroughApplications .......... 142 7.2.3KernelDesign .............................. 143 APPENDIX AALD-STABILITYTHEOREM ........................... 144 BSOLUTIONNORMUPPERBOUNDS ....................... 148 B.1KernelADALINE ................................ 148 B.2KernelLeastMeanSquare ........................... 150 REFERENCES ....................................... 153 BIOGRAPHICALSKETCH ................................ 161 7

PAGE 8

LISTOFTABLES Table page 1-1Comparisonofdierentnonlinearadaptivelters ................. 24 1-2Notation ........................................ 28 2-1PerformancecomparisonofKLMSwithdierentstepsizesandRNwithdierentregularizationparametersinMackey-Glasstimeseriesprediction ......... 36 2-2Complexitycomparisonatiterationi ........................ 37 2-3PerformancecomparisonwithdierentnoiselevelsinMackey-Glasstimeseriesprediction(trainingMSE) .............................. 38 2-4PerformancecomparisonwithdierentnoiselevelsinMackey-Glasstimeseriesprediction(testingMSE) ............................... 38 2-5EectofkernelsizeofGaussiankernelinMackey-Glasstimeseriesprediction 38 2-6EectoforderofpolynomialkernelinMackey-Glasstimeseriesprediction ... 39 2-7Performancecomparisoninnonlinearchannelequalization ............ 41 3-1ComparisonoffourKAPAupdaterules ...................... 51 3-2Listofrelatedalgorithms .............................. 53 3-3PerformancecomparisoninMackey-Glasstimeseriesprediction ......... 58 3-4Complexitycomparisonatiterationi ........................ 58 3-5EectofnoveltycriterioninMackey-Glasstimeseriesprediction ......... 60 3-6Noisereductioncomparisoninnoisecancelation .................. 62 3-7NoisereductioncomparisonwithfMRInoisesource ................ 63 4-1PerformancecomparisoninRayleighchanneltracking ............... 87 4-2PerformancecomparisoninLorenzsystemprediction ............... 92 5-1NetworksizesofRAN,SKLMS-1,andAOGR ................... 112 8

PAGE 9

LISTOFFIGURES Figure page 1-1Blockdiagramoflinearadaptivelters ....................... 17 1-2Blockdiagramofnonlinearadaptivelters ..................... 19 1-3Relationbetweendierentadaptivelteringalgorithms .............. 25 2-1NetworktopologyofKLMS ............................. 32 2-2LearningcurvesofLMSandKLMSinMackey-Glasstimeseriesprediction ... 36 2-3EectofexplicitregularizationinNORMAinMackey-Glasstimeseriesprediction 40 2-4LearningcurvesofLMS(=0:005)andKLMS(=0:1)innonlinearchannelequalization(=0:4) ................................ 41 3-1LearningcurvesofLMS,KLMS-1,KAPA-1(K=10),KAPA-2(K=10),SW-KRLS(K=50)andKRLSinMackey-Glasstimeseriesprediction ..... 58 3-2LearningcurvesofKLMS-1,KAPA-1(K=10)andKAPA-2(K=10)withandwithoutnoveltycriterioninMackey-Glasstimeseriesprediction ....... 59 3-3Noisecancelationsystem ............................... 60 3-4EnsemblelearningcurvesofNLMS,SKLMS-1andSKAPA-2(K=10)innoisecancelation ...................................... 61 3-5SegmentoffMRInoiserecording .......................... 63 3-6EnsemblelearningcurvesofNLMSandSKAPA-2(K=10)infMRInoisecancelation ...................................... 63 3-7Basicstructureofnonlinearchannel ......................... 64 3-8EnsemblelearningcurvesofLMS,APA-1,SKLMS-1,SKAPA-1andSKAPA-2innonlinearchannelequalization(=0:1) ..................... 65 3-9NetworksizesofSKLMS-1,SKAPA-1andSKAPA-2overtraininginnonlinearchannelequalization ................................. 65 3-10PerformancecomparisonofLMS,APA-1,SKLMS-1,SKAPA-1andSKAPA-2withdierentSNRinnonlinearchannelequalization ............... 66 3-11EnsemblelearningcurvesofLMS,APA-1,SKLMS-1,SKAPA-1andSKAPA-2withanabruptchangeatiteration500innonlinearchannelequalization .... 67 3-12NetworksizesofLMS,APA-1,SKLMS-1,SKAPA-1andSKAPA-2overtrainingwithanabruptchangeatiteration500innonlinearchannelequalization .... 67 9

PAGE 10

4-1EnsemblelearningcurvesofLMS-2,EX-RLS,KRLSandEX-KRLSintrackingaRayleighfadingmultipathchannel ........................ 88 4-2EectofapproximatelineardependencyinEX-KRLS ............... 88 4-3Networksizevs.performanceinEX-KRLSwithALD ............... 89 4-4EectofapproximatelineardependencyinEX-KRLS(=0) .......... 89 4-5Networksizevs.performanceinEX-KRLSwithALD(=0) .......... 90 4-6StatetrajectoryofLorenzsystemforvalues=8=3,=10and=28 .... 91 4-7TypicalwaveformoftherstcomponentfromtheLorenzsystem ........ 91 4-8PerformancecomparisonofLMS-2,RLS,EX-RLS,KRLSandEX-KRLSinLorenzsystemprediction ................................... 92 4-9EnsemblelearningcurvesofLMS-2,RLS,EX-RLS,KRLSandEX-KRLSinLorenzsystemprediction ............................... 93 4-10PerformancecomparisonofSW-KRLSandEX-KRLSinLorenzsystemprediction 94 5-1Illustrationofconditionalinformationalongtraininginnonlinearregression .. 107 5-2Networksizevs.testingMSEforconditionalinformationcriterioninnonlinearregression ....................................... 108 5-3Networksizevs.testingMSEforALDinnonlinearregression .......... 108 5-4ComparisonofCIandALDinredundancyremovalinnonlinearregression ... 109 5-5Trainingdatawithoutliersinnonlinearregression ................. 109 5-6ComparisonofCIandALDcriterionsinoutlierdetectioninnonlinearregression 110 5-7LearningcurvesofAOGRandKRLS-ALDwithoutliersinnonlinearregression 110 5-8Networksizevs.testingMSEofAOGRfordierentT1inMackey-Glasstimeseriesprediction .................................... 111 5-9LearningcurvesofLMS,SKLMS-1,RANandAOGRinMackey-Glasstimeseriesprediction ....................................... 112 5-10CO2concentrationtrendfromyear1958toyear2008 ............... 113 5-11LearningcurveofAOGRandconditionalinformationoftrainingdataalongiterationinCO2concentrationforecasting ..................... 113 5-12Networksizevs.testingMSEofAOGRfordierentT1inCO2concentrationforecasting ....................................... 114 10

PAGE 11

5-13LearningcurveofAOGRandconditionalinformationoftrainingdataalongiterationwitheectiveexamplescircledinCO2concentrationforecasting .... 114 5-14ForecastingresultofAOGRforCO2concentration ................ 115 6-1Thereg-functionsofthreeregularizationapproaches ................ 127 6-2Eectofstepsizeonthereg-functionofkernelADALINE(N=500,n=600) 127 6-3Eectofiterationnumberonthereg-functionofkernelADALINE(=0:1,N=500) ....................................... 128 6-4Eectoftrainingdatasizeonthereg-functionofkernelADALINE(=0:1,n=600) ........................................ 128 6-5LearningcurvesofKLMSonbothtrainingandtestingdatasets ......... 134 6-6SolutionnormofKLMSalongiterationwithitsmodelfreeupperbound .... 134 6-7LearningcurvesofkernelADALINEonbothtrainingandtestingdatasets ... 135 6-8SolutionnormofkernelADALINEalongiterationanditsupperbound ..... 135 6-9Cross-validationresultofregularizationnetworkwithdierentregularizationparameters ...................................... 136 6-10Solutionnormofregularizationnetworkwithdierentregularizationparametersanditsupperbound ................................. 136 6-11SolutionnormofKLMSanditsmodel-freeupperboundwithdierentstepsizes(convergenceanddivergence) ............................ 137 6-12SolutionnormofKLMSanditsmodelfreeupperboundalongiteration(=10) 137 6-13SolutionnormofKLMSanditsmodel-basedupperboundwithdierentstepsizes .......................................... 138 6-14SolutionnormofKLMSanditsmodel-basedupperboundalongiteration(=:03) .......................................... 139 11

PAGE 12

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyADAPTIVEFILTERINGINREPRODUCINGKERNELHILBERTSPACESByWeifengLiuDecember2008Chair:JoseC.PrincipeMajor:ElectricalandComputerEngineering Thetheoryoflinearadaptiveltershasreachedmaturity,unliketheeldofnonlinearadaptivelters.Althoughnonlinearadaptiveltersareveryusefulinnonlinearandnonstationarysignalprocessing,complexityandnon-convexityissueslimitexistingalgorithmslikeVolterraseries,time-laggedfeedforwardnetworksandBayesianlteringinanonlinescenario.Kernelmethodsarealsononlinearmethodsandtheirsolidmathematicalfoundationandexperimentalsuccessesaremakingthemverypopularinrecentyears,butmostofthealgorithmsuseblockadaptationandarecomputationallyveryexpensiveusingalargeGrammatrixofdimensionalitygivenbythenumberofdatapoints;thereforecomputationallyecientonlinealgorithmsareverymuchneededfortheirusefulexibilityindesign. Thisworkdevelopedsystematicallyforthersttimeaclassofon-linelearningalgorithmsinreproducingkernelHilbertspaces(RKHS).ThereproducingkernelHilbertspaceprovidesanelegantmeansofobtainingnonlinearextensionsoflinearalgorithmsexpressedintermsofinnerproductsusingtheso-calledkerneltrick.Wepresentedkernelextensionsforthreewell-knownadaptivelteringmethods,namelytheleast-mean-square,theane-projection-algorithmsandtherecursive-least-squares,studiedtheirpropertiesandvalidatedtheminrealapplications. Wefocusedonrevealingtheuniquestructuresofthelinearadaptiveltersanddemonstratedhowthenonlinearextensionsarederived.Thesealgorithmsareuniversal 12

PAGE 13

approximators,useconvexoptimization(nolocalminima)anddisplaymoderatecomplexity.Simulationsoftimeseriesprediction,nonlinearchannelequalization,nonlinearfadingchanneltracking,andnoisecancelationwereincludedtoillustratetheapplicabilityandcorrectnessofourtheory.Besides,aunifyingviewofactivedataselectionforkerneladaptivelterswasintroducedandanalyzedtoaddresstheirgrowingstructure.Finallywediscussedthewellposednessoftheproposedgradientbasedalgorithmsforcompleteness. 13

PAGE 14

CHAPTER1INTRODUCTION 1.1Supervised,SequentialandActiveLearning 1.1.1WhatIsLearning? Learningisaprocessbywhichthefreeparametersandthetopologyofaneuralnetworkareadaptedthroughaprocessofstimulationbytheenvironmentinwhichthenetworkisembedded[ 46 ]. Adjustmentsonthefreeparametersarewellstudiedintheadaptivelteringandneuralnetworkeldwhereasadaptationofthetopologystillneedsalotofimprovement.Traditionally,thegrowingandpruningtechniquesofmulti-layerperceptronsaremoreviewedasaheuristicnetworkdesigningtoolthantreatedasanintegralpartoflearningitself[ 45 56 ].However,learninganetworktopologymaybeequallyimportantifnotmoreimportantthanadjustingthefreeparametersinthenetwork,asexempliedinbiologicallearningwhereneuronsinthehumanbraincandieandthatnewsynapticconnectionscangrow[ 46 ].Iftheenvironmentischangingovertime,thedesignofan\optimal"networkbeforehandbyconventionalmethodssuchasAkaikeInformationCriterion[ 1 ],BayesianInformationCriterion[ 85 ],andminimumdescriptionlength[ 79 ]isnotpossible.Therefore,itwillmakemoresensetoadaptthenetworkstructureovertimeaspartoflearning. 1.1.2SupervisedLearning Inconceptualterms,wemaythinkthereisateacherwhohasknowledgeoftheenvironment,withthatknowledgebeingrepresentedbyinput-outputexamples.Thefamouslearningrulesincludeerror-correctionlearningliketheWidrow-Horule[ 105 ]andmemory-basedlearningexempliedbyk-nearestneighborclassiers[ 21 ]andradialbasisfunctionnetworks[ 13 70 ]. 1.1.3SequentialLearning Historically,machinelearninghasfocusedonnon-sequentiallearningtasks,wherethetrainingsetcanbeconstructedaprioriandlearningstopsoncethissetisduly 14

PAGE 15

processed.Despiteitswideapplicability,therearemanysituationswherelearningtakesplaceovertime.Alearningtaskissequentialifthetrainingexamplesbecomeavailableovertime,usuallyoneatatime.Alearningalgorithmissequentialif,foranygiventrainingexamplesfu(1);d(1)g,fu(2);d(2)g,...,fu(i);d(i)g,...itproducesasequenceofhypotheses~(1),~(2),...~(i),...suchthat~(i)dependsonlyonthelasthypothesis~(i1)andthecurrentexamplefu(i);d(i)g[ 36 ].Notethatsometimesonerelaxesslightlytheabovedenitiontoallowthenexthypothesistodependonthepreviousoneandasmallsubsetofnewtrainingexamples(ratherthanasingleone)totrade-ocomplexityandperformance.Ithasbeenlongarguedthatlearninginvolvestheabilitytoimproveperformanceovertime[ 88 ].Clearly,humansacquireknowledgeovertimeandknowledgeisconstantlyrevisedbasedonnewlyacquiredinformation.Inthestudyofroboticandintelligentagentsonendsthatsequentiallearningisamusttodealwithcomplexoperatingenvironmentwhichisusuallychangingandunpredictable[ 12 94 ]. Moreover,sequentiallearningalgorithmsusuallyrequirelesscomputationresourcestoperformthehypothesisupdateateachiteration.Thefamousleast-mean-squareadaptivelter[ 105 ]iswidelyusedinadaptivesignalprocessing,adaptivecontrolsandcommunicationswherenonstationaryenvironment,stricttimingrequirementandlimitedcomputationresourcesaremainconcerns.Ithasbeenobservedthatwheneverthereisarequirementtoprocesssignalsthatresultfromoperationinanenvironmentofunknownstatisticsoronethatisinherentlynonstationary,theuseofanadaptivelteroersahighlyattractivesolutiontotheproblem[ 47 ]. 1.1.4ActiveLearning Therearepossiblytwoscenarioswhereactivelearningndsapplications.Intherst,unlabeleddata(inputwithoutdesired)isabundantbutlabelingdataisexpensive.Forexample,itisquiteeasytogetrawspeechsamplesforaspeechrecognizerwhereaslabelingthesesamplesisverytedious.Theideaofactivelearninghereistoconstructanaccuraterecognizerwithsignicantlyfewerlabelsthanyouwouldneedinaconventional 15

PAGE 16

supervisedlearningframework.Thistopichasbeenstudiedbythenamesof\sequentialdesign",\querylearning"or\selectivesampling"ineconomictheory,statisticsandmachinelearning[ 15 27 29 57 61 97 ].Fortheseproblems,nodesiredsignal(orlabel)isavailabletoquantifytheimportanceofthecandidatedatasample. Inthesecondscenario,alargeamountofdatahasalreadybeengatheredinpairsofinputandtarget,butasubsetofdatamustbeselectedforecienttrainingandsparserepresentation[ 74 86 89 ].ThiskindofproblemscommonlyariseinkernelmethodsandGaussianprocesses(GP)modeling[ 77 ].Therationalebehinditisthattrainingdataarenotequallyimportant.Especiallyinasequentiallearningsetting,dependingonthestateofalearningsystem,thesamedatamaycontaindierentamountofinformation.Thisisunderstandableinourdailyexperience.Amessagecontainsthemostamountofinformationwhenitisrstperceived.Theamountofinformationdecreasesafterithasbeenpresentedformanytimes.Wetermthiskindofproblemsasactivedataselectiontodistinguishitfromtherstscenario.Weuseactivelearningasageneraltermreferringtobothcases. Activesequentiallearningisquitecommoninpractice.Forexample,humanbeingsassessincomingdataeverydaybasedonhisknowledgeandthendecidehowmuchresourcehe/shewouldallocatetolearnit.Intuitively,foralearningmachine,onecanexpectthatafterprocessingsucientsamplesfromthesamesource,thereislittleneedtolearn,becauseofredundancy.Activesequentialmethodshavebeenstudiedin[ 22 28 69 ]forregressionproblemsand[ 8 38 ]forclassicationproblems.Ithasbeendemonstratedthatactivelearningcansignicantlyreducethecomputationalcomplexitywithequivalentperformance.Anditevencanprovideamoreaccurateandmorestablesolutioninsomesituations. 16

PAGE 17

1.2LinearandNonlinearAdaptiveFiltering 1.2.1LinearAdaptiveFiltering Linearadaptivelters,builtaroundalinearcombinerw,aredesignedtoperformsequentiallearningbasedonasequenceofinput-outputexamples(seeFigure 1-1 ).Accordinglythedenitionofsequentiallearningcanbetranslatedintothefollowingequation[ 41 ]: w(i)=w(i1)+Gain(i)e(i)(1{1) wherew(i1)isthepreviousestimateofthelterweight(whichcorrespondstotheprevioushypothesisinthedenitionofsequentiallearning),theerrore(i)isthemodelpredictionerroronthenewdataarisingfromtheuseofw(i1)andGain(i)isthealgorithmgain.Inthecaseoftheleast-mean-squarealgorithm, w(i)=w(i1)+u(i)e(i)(1{2) whereu(i)istheinputvector,isthelearningrate,andtheerrore(i)iscalculatedby e(i)=d(i)w(i1)Tu(i)(1{3) Figure1-1. Blockdiagramoflinearadaptivelters 17

PAGE 18

Anothercelebratedalgorithmistherecursive-least-squares[ 68 ]: w(i)=w(i1)+g(i)e(i)(1{4) whereg(i)isthegainvector,whosemeaningwillbeclearinChapter 4 Despitetheirsimplestructure(andprobablybecauseofit),theyenjoywideapplicabilityandsuccessesindiverseeldssuchascommunications,control,radar,sonar,seismology,andbiomedicalengineering,amongothers.Thetheoryoflinearadaptiveltershasreachedahighlymaturestageofdevelopment[ 47 ].However,thesamecannotbesaidaboutnonlinearadaptivelters. 1.2.2NonlinearAdaptiveFiltering Theproblemofdesigninganonlinearadaptivelterismuchharder.OneofthesimplewaystogetanonlinearadaptivelterisbycascadingastaticnonlinearitywithalinearltersuchasinHammersteinandWienermodel[ 6 62 106 ].However,inthisapproachthemodelingcapabilityisquitelimited,thechoiceofthenonlinearityishighlyproblemdependent,andtherearelocalminimaduringtraining.GabortriedusingtheVolterraseriestobypassthemathematicaldicultiesofnonlinearadaptivelteringwhileitisclearthatthecomplexityoftheVolterraseriesexplodesexponentiallyasitsmodelingcapacityincreases[ 35 102 ].AnimprovementoftheVolterraseriesistheWienerseriesproposedby[ 55 106 ],butslowconvergenceandhighcomplexitystillhinderitswideapplication.Lateron,time-laggedmulti-layerperceptrons,radialbasisfunctionnetworksorrecurrentneuralnetworks[ 46 55 73 ]areusedtoreplacethelinearcombinerandtrainedinareal-timefashionbystochasticgradient(seeFigure 1-2 ).Theyhaveahistoryofsuccessesbuttheirnon-convexoptimizationnaturepreventstheirwidespreaduseinon-lineapplications. OtherformsofsequentiallearningcanbefoundintheBayesianlearningliterature[ 109 ].RecursiveBayesianestimationisageneralprobabilisticapproachforestimatinganunknownprobabilitydensityfunctionrecursivelyovertimeusingincomingmeasurements 18

PAGE 19

Figure1-2. Blockdiagramofnonlinearadaptivelters andamathematicalprocessmodel.IthasacloseconnectiontotheKalmanlterintheadaptivelteringtheory[ 51 80 ]butisusuallycomplicatedforarbitrarydatadistributionsexempliedbysequentialMonteCarlomethods[ 25 ]. Reinforcementlearning[ 92 ]isanothereldwheresequentiallearningprevails.Reinforcementlearningdiersfromsupervisedlearninginseveralways.Themostimportantdierenceisthatthereisnopresentationofinput-outputpairs.Instead,afterchoosinganactionbasedonthecurrentstatethealgorithmistoldanimmediaterewardandthesubsequentstate,butisnottoldwhichactionwouldhavebeeninitsbestlong-terminterests.Anotherdierenceisthaton-lineperformanceisimportant:theevaluationofthesystemisoftenconcurrentwithlearning.Itprovidesawayofprogrammingagentsbyrewardandpunishmentwithoutneedingtospecifyhowthetaskistobeachieved,butthereareformidablecomputationalobstaclestofulllingthepromise. Forallthesereasons,thepurposeofthisresearchistoinvestigateafamilyofnonlinearadaptivelteringalgorithmswhichhavethefollowingfeatures: 1. Theyhavetheuniversalapproximationproperty. 2. Theyareconvexoptimizationproblemsandhencehavenolocalminima. 3. Theyhavemoderatecomplexityintermsofcomputationandmemory. 19

PAGE 20

Inotherwords,wewanttobuildanonlinearadaptivelter,whichpossessestheabilityofmodelinganycontinuousinput-outputmappingy=f(u)andobeysthefollowingsequentiallearningrule fi=fi1+Gain(i)e(i)(1{5) wherefidenotestheestimateofthemappingattimei.Thissequentiallearningisveryattractiveinpracticesincethecurrentestimateconsistsoftwopartsbyaddition,namely,thepreviousestimateandmodicationtermproportionaltothepredictionerroronnewdata.Thisuniqueincrementalnaturedistinguishesourmethodsfromalltheothers.Although( 1{5 )appearsverysimple,thealgorithmcaninfactbemotivatedbymanydierentobjectivefunctions.Also,dependingontheprecisemeaningsofGain(i)ande(i),thealgorithmcantakemanydierentforms.Weexplorethisindetailinthesubsequentchapters.ThisamazingfeatureisachievedwiththeunderlyinglinearstructureofthereproducingkernelHilbertspace(RKHS)wherethealgorithmsexist. 1.3ReproducingKernelHilbertSpaceandKernelMethods Akernel[ 4 ]isacontinuous,symmetric,positive-denitefunction:UU!R.Uistheinputdomain,acompactsubsetofRL.ThecommonlyusedkernelsincludetheGaussiankernel( 1{6 )andthepolynomialkernel( 1{7 ):(u;u0)=exp(ajjuu0jj2) (1{6)(u;u0)=(uTu0+1)p (1{7) TheMercertheorem[ 4 14 ]statesthatanykernel(u;u0)canbeexpandedasfollows: (u;u0)=1Xi=1&ii(u)i(u0)(1{8) 20

PAGE 21

where&iandiaretheeigenvaluesandtheeigenfunctionsrespectively.Theeigenvaluesarenon-negative.Therefore,amapping'canbeconstructedas ':U!F'(u)=[p &11(u);p &22(u);:::](1{9) Byconstruction,thedimensionalityofFisdeterminedbythenumberofstrictlypositiveeigenvalues,whichcanbeinniteintheGaussiankernelcase. Inthemachinelearningliterature,'isusuallytreatedasthefeaturemappingand'(u)isthetransformedfeaturevectorlyinginthefeaturespaceF(whichisaninnerproductspace).Bydoingso,animportantimplicationis '(u)T'(u0)=(u;u0)(1{10) Fisisometric-isomorphictotheRKHSinducedbythekernel.DenotethisRKHSasF0,whichisafunctionalspacesatisfyingthefollowingproperties: 1. F0istheclosureofthespanofall(u;)withu2U.Inotherwords,h=Xu(i)2Uai(u(i););8h2F0 wheref(u;)ju2Ugisthekernelbasisandaiistherealcoecients. 2. F0hasthereproducingproperty=h(u);8h2F0;8u2U: 3. Aninterestingspecialcasefollowsas<(u0;);(u;)>=(u0;u);8u;u02U: 4. Theinnerproductinducesthestandardnormjjhjj2F0=: ItiseasytocheckthatF0isessentiallythesameasFbyidentifying'(u)=(u;)(bytrivialcongruence),whicharethebasesofthetwospacesrespectivelyandpreserve 21

PAGE 22

theinnerproduct.Byslightlyabusingthenotation,wedonotdistinguishF0andFinthisresearchifnoconfusionisinvolved. Themainideaofkernelmethodscanbesummarizedasfollows:transformtheinputdataintoahigh-dimensionalfeaturespaceviaapositivedenitekernelsuchthattheinnerproductoperationinthefeaturespacecanbecomputedecientlythroughthekernelevaluation( 1{10 ).Thenappropriatelinearmethodsaresubsequentlyappliedonthetransformeddata.Aslongaswecanformulatethealgorithmintermsofinnerproduct(orequivalentkernelevaluation),weneverexplicitlyhavetocomputeinthehighdimensionalfeaturespace.Whilethismethodologyiscalledkerneltrick,wehavetopointoutthattheunderlyingreproducingkernelHilbertspaceplaysacentralroletoprovidelinearity,convexity,anduniversalapproximationcapabilityatthesametime.Successfulexamplesofthismethodologyincludesupportvectormachines[ 101 ],principalcomponentanalysis[ 84 ],Fisherdiscriminantanalysis[ 64 ],andmanyothers. Ithasbeenproved[ 90 ]thatinthecaseofGaussiankernel,foranycontinuousinput-outputmappingf:U!Randany&>0,thereexistparametersfcigmi=1inUandrealnumbersfaigmi=1suchthat jjfmXi=1ai(;ci)jj2<&(1{11) Ifwedenoteavector!inFas!=mXi=1ai'(ci) thenby( 1{10 )and( 1{11 ),wehavejjf!T'jj2<& ThisshowsthatthelinearmodelinFhastheuniversalapproximationproperty. 22

PAGE 23

Furthermore,ifourproblemistominimizeacostfunctionoveranitedatasetf(u(i);d(i))gNi=1minfJ(f)=NXi=1(d(i)f(u(i)))2+jjfjj22 ithasbeenprovedthattheoptimalsolutioncanbeexpressedasf=NXi=1ai(;u(i)) forsuitableai.Thisresultiscalledrepresentertheorem[ 83 ].Inotherwords,althoughwedidconsiderfunctionswhichwereexpansionsintermsofarbitrarypointsci(see( 1{11 )),itturnsoutthatwecanalwaysexpressthesolutionintermsofthetrainingpointsu(i)only.HencetheoptimizationproblemoveranarbitrarilylargenumberofvariablesistransformedintooneoverNvariables,whereNisthenumberoftrainingpoints. RecentlyithasalsobeenshownthatVolterraseriesandWienerseriescanbetreatedjustasaspecialcaseofakernelregressionframework[ 32 ].ByformulatingtheVolterraandWienerseriesasalinearregressioninRKHS,thecomplexityisnowindependentoftheinputdimensionalityandtheorderofnonlinearity. Basedontheseadvantagesandarguments,ourstrategyisclear:toformulatetheclassicadaptiveltersintheRKHSsuchthatweareiterativelysolvingaconvexleast-squaresproblemthere.Aslongaswecanformulatethesealgorithmsintermsofinnerproduct,weobtainnonlinearadaptivelterswhichhavetheuniversalapproximationpropertyandconvexityatthesametime.Convexityisaveryimportantfeaturewhichpreventsthealgorithmsfrombeingstuckinlocalminima.(SeeTable 1-1 ) 1.4KernelAdaptiveFilters Intherecentyears,therehavebeensomeeortsof\kernelizing"adaptiveltersintheliterature.FriebrstusedthisideatoderivekernelADALINE,whichisformulatedasadeterministicgradientmethodbasedonallthetrainingdata(notonline)[ 33 ].ThenKivinenproposedanalgorithmcalledNORMAbydirectlydierentiatingtheregularizedfunctionaltogetthestochasticgradient[ 53 ].Whilethederivationinvolvesadvanced 23

PAGE 24

Table1-1. Comparisonofdierentnonlinearadaptivelters AlgorithmsModelingcapacityconvexitycomplexity linearadaptivelterslinearonlyYesverysimpleHammersteinWienermodelslimitednonlinearityNosimpleVolterraWienerseriesUniversalYesveryhightime-laggedneuralnetworksUniversalNomodestrecurrentneuralnetworksUniversalNohighkerneladaptiveltersUniversalYesmodestRecursiveBayesianestimationUniversalNoveryhigh mathematics,theresultsareactuallyequivalenttoakernelversionofleakyleastmeansquare.Atalmostthesametime,Engelstudiedthecaseofkernelrecursiveleastsquaresbyutilizingthematrixinversionlemma[ 28 ].However,neitherrealizedthefullscopeoftheircontributionsintermsofkerneladaptiveltersaswepresentinFigure 1-3 .Thisresearchservestolltheblanksleftbythepreviousworksandinvestigatetheminaunifyingframework.Weproposeseveralnewalgorithms,namelythekernelleastmeansquare,thekernelaneprojectionalgorithms,andtheextendedkernelrecursiveleastsquares.Therelationamongthesealgorithmsasillustratedin 1-3 willbecomeclearwhenwepresenttheminthesubsequentchapters. Kerneladaptiveltersprovideanewperspectiveforlinearadaptivelterssincelinearadaptiveltersbecomeaspecialcasebeingalternativelyexpressedinthedualspace.Kerneladaptiveltersclearlyshowthatthereisagrowingmemorystructureembeddedinthelterweights.Theynaturallycreateagrowingradialbasisfunctionnetwork,learningthenetworktopologyandadaptingthefreeparametersdirectlyfromdataatthesametime.Thelearningruleisabeautifulcombinationoftheerror-correctionandmemory-basedlearning,andpotentiallyitwillhaveadeepimpactonourunderstandingabouttheessenceoflearningtheory.Thelearningstrategyissimilartotheresourceallocatingnetworks(RAN)proposedbyPlatt[ 69 ],butthedepthandthescopewecoveredinthisworkisfarbeyondthat. Historically,mostofthekernelmethodsuseblockadaptationandarecomputationallyveryexpensiveusingalargeGrammatrixofdimensionalitygivenbythenumberof 24

PAGE 25

Figure1-3. Relationbetweendierentadaptivelteringalgorithms datapoints;thereforethecomputationallyecientonlinealgorithmsprovidetheusefulexibilityoftradingotheperformancewiththecomplexity.Andinnonstationaryenvironments,thetrackingabilityofonlinealgorithmsprovidesextraadvantages,ifnotessential. Thecombinationofsequentiallearningandmemory-basedlearningrequires,andatthesametimeenablesthenetworktoselectinformativetrainingexamplesinsteadoftreatingallexamplesequally.Empiricalevidenceshowsthatselectinginformativeexamplescandrasticallyreducethetrainingtimeandproducemuchmorecompactnetworkswithequivalentaccuracy[ 8 22 28 38 69 ].Therefore,inthecaseoflargeandredundantdataset,performingkernelonlinelearningalgorithmscanhaveabigedgeoverbatchmodemethodsintermsofeciency. Thewidelyusedactivedataselectionmethodsforkerneladaptiveltersincludenoveltycriterion[ 69 ]andapproximatelineardependency[ 28 ].Botharebasedonsomeheuristicdistancefunctionswhilewewillpresentaprincipledandunifyingapproach. 25

PAGE 26

Ourcriterionisanobjectivefunctionbasedonthenegativeloglikelihoodofcandidatedata,whichistermedasconditionalinformationhere,indicatingthatitquantieshowinformativethedatacandidateisconditionalontheknowledgeofthelearningsystem.Itturnsoutthattheapproximatelineardependencyisaspecialcaseandthenoveltycriterionissomeapproximationinthisinformationtheoreticframework. Theresourceallocatingnetworksisprobablytheearliestattemptinthisresearchline.Thoughthestartingpointisvastlydierent,thelearningprocedureisconceptuallyverysimilartothealgorithmswewilldiscussinthisworkandmanyofourideasareinuencedbythispioneeralgorithmandwethinkanearlytreatmenthelpsthesubsequentdiscussions. TheRANisessentiallyaradialbasisfunctionnetwork.Itstoresthecenters,thewidthesofthecentersandthelinearcoecientsintheformatoffcj;wj;ajgforthejthunit.Thecalculationoftheoutputforaninputpatternuisby xj=exp(jjucjjj2=w2j)y=Xjajxj+ whereisabiasterm. Thelearningstrategyisasfollows:Thenetworkstartswithablankslate.Whenfu;dgisidentiedasapatternthatisnotcurrentlywellpresentedbythenetwork,thenetworkallocatesanewunitthatmemorizesthepattern.Lettheindexofthisnewunitben.Thecenteroftheunitissettothenovelinput,cn=u: Thelinearcoecientonthesecondlayerissettothedierencebetweentheoutputofthenetworkandthenoveloutput,an=dy: 26

PAGE 27

Thewidthofthenewunitisproportionaltothedistancefromtheneareststoredcentertothenovelinput,wn=kjjucnearestjj: wherekisanoverlapfactor.Askgrowslarger,theresponsesoftheunitsoverlapmoreandmore. TheRANusesatwo-partnoveltycondition.Aninput-outputpairfu;dgisconsiderednoveliftheinputisfarawayfromexistingcenters,jjucnearestjj>(t); andifthedierencebetweenthedesiredoutputandtheoutputofthenetworkislargejjdy(u)jj>2: Errorslargerthan2areimmediatelycorrectedbytheallocationofanewunit,whileerrorssmallerthan2aregraduallyrepairedusinggradientdescent.Thedistance(t)isthescaleofresolutionthatthenetworkisttingatthetthinputpresentation.Thelearningstartswith(t)=max,whichisthelargestlengthscaleofinterest,typicallythesizeoftheentireinputspaceofnon-zeroprobabilitydensity.Thedistance(t)shrinksuntilitreachesmin,whichisthesmallestlengthscaleofinterest.Thefollowingfunctionisusedtodetermine(t):(t)=max(maxexp(t=);min); whereisadecayconstant. 27

PAGE 28

Whenanewunitisnotallocated,theWidrow-HoLMSalgorithm[ 105 ]isusedtodecreasetheerror: aj=(dy)xj;=(dy);cj=2 wj(ucj)xj[(dy)aj] ItisshownthattheRANisabletolearnquickly,accuratelyandtoformacompactrepresentation[ 69 ].However,wehavetopointoutthatRANisbuiltuponintuitionandheuristics;itisnotaconvexoptimizationproblem;itsconvergenceandwellposednessishardtoproveandnotguaranteed. 1.5Notation ThenotationusedthroughouttheworkissummarizedinTable 1-2 .Thereisnorulewithoutanexception.fiisusedtodenotetheestimateofaninput-outputmappingfatithiterationsinceparenthesisispreservedforinputlikefi(u). Table1-2. Notation descriptionexamples ScalarsSmallitaliclettersdVectorsSmallboldlettersw,!,aMatricesCapitalBOLDlettersU,Timeoriterationindicesinparenthesesu(i),d(i)Componentofvectors/matricessubscriptindicesaj(i),Gi;jLinearspacesCapitalmathbblettersF,HScalarconstantsCapitalITALIClettersL,N 28

PAGE 29

CHAPTER2KERNELLEASTMEANSQUAREALGORITHM Thecombinationofthefamedkerneltrickandtheleast-mean-square(LMS)algorithmprovidesaninterestingsamplebysampleupdateforanadaptivelterinRKHS,whichisnamedherethekernelleastmeansquare(KLMS). TheKLMSnaturallycreatesagrowingradialbasisfunctionnetwork(RBF),learningthenetworktopologyandadaptingthefreeparametersdirectlyfromthetrainingdata,combiningbeautifullytheerror-correctionandthememory-basedlearningstrategies. 2.1FormulationofKernelLeastMeanSquareAlgorithm 2.1.1LearningProblemSetting Supposethegoalistoapproximateacontinuousinput-outputmappingf:U!Rbasedonasequenceofexamplesfu(1);d(1)g,fu(2);d(2)g,...,fu(N);d(N)g.UisassumedasacompactsubspaceofRL.Hereweassumethatthetrainingdataareniteandwewilladdresstheproblemofsequentiallearningwithinnitetrainingdatalater. 2.1.2LeastMeanSquareAlgorithm TheLMSalgorithmusethefollowingprocedure[ 105 ] w(0)=0e(i)=d(i)w(i1)Tu(i)w(i)=w(i1)+e(i)u(i)(2{1) tondtheoptimalweightwo,whichminimizestheempiricalrisk:minwJemp(w)=NXi=1(d(i)wTu(i))2 InLMS,e(i)iscalledthepredictionerror,isthestepsizeandw(i)istheestimateoftheoptimalweightattimei. Therepeatedapplicationoftheweight-updateequationyieldsw(n)=nXi=1e(i)u(i) 29

PAGE 30

thatis,aftern-steptraining,theweightestimateisexpressedasthelinearcombinationofthepreviousandpresentinputdataweightedbythepredictionerrors.Moreimportantly,theinput-outputoperationofthislearningsystemcanbesolelyexpressedintermsofinnerproducts fn(u0)=w(n)Tu0=Xni=1e(i)u(i)Tu0 (2{2) e(n+1)=d(n+1)Xni=1e(i)u(i)Tu(n+1) (2{3) Therefore,bythekerneltrick,theLMSalgorithmcanbereadilyextendedtoRKHS[ 72 ]. 2.1.3KernelLeastMeanSquareAlgorithm Weutilizethekernelinducedmapping( 1{9 )totransformthedatau(i)intothefeaturespaceFas'(u(i))andinterpret( 1{10 )astheusualdotproduct.Denote'(i)='(u(i))forsimplicity.ThentheKLMSisnothingbuttheLMSperformedontheexamplesequencef'(i);d(i)gwhichissummarizedasfollows:!(0)=0e(i)=d(i)!(i1)T'(i)!(i)=!(i1)+e(i)'(i) where!(i)denotestheestimate(attimei)oftheweightvectorintheRKHS.Sincethealgorithmcanbeformulatedintermsofinnerproduct,i.e. fn(u0)=!(n)T'(u0)=Xni=1e(i)'(u(i))T'(u0) (2{4) e(n+1)=d(n+1)Xni=1e(i)'(u(i))T'(u(n+1)) (2{5) wecanecientlycomputethenetworkoutputanderrorsusingthekerneltrick( 1{10 ) fn(u0)=!(n)T'(u0)=Xni=1e(i)(u(i);u0) (2{6) e(n+1)=d(n+1)Xni=1e(i)(u(i);u(n+1)) (2{7) 30

PAGE 31

Ifwedenotefiastheestimateoftheinput-outputmappingattimei,wehavethefollowingsequentiallearningrulefortheKLMS: fi=fi1+e(i)(u(i);)(2{8) By( 2{8 ),itisseenthattheKLMSallocatesanewkernelunitwhenanewtrainingdatacomesinwithu(i)asthecenterande(i)asthecoecient.Thecoecientsandthecentersshouldbestoredinthecomputerduringtraining.Denotea(i)asthecoecientvectorattimeiandC(i)asthecorrespondingsetofcenters.TheupdatesneededfortheKLMSattimeiis ai(i)=e(i) (2{9) aj(i)=aj(i1);j=1;:::;i1 (2{10) C(i)=fC(i1);u(i)g (2{11) ThealgorithmissummarizedinAlgorithm 1 andillustratedinFigure 2-1 : Algorithm1KernelLeastMeanSquare(KLMS-1) Initialization: learningstep a1(1)=d(1) whilefu(i);d(i)gavailabledo %evaluateoutputsofthecurrentnetwork y(i)=i1Pj=1aj(i1)(u(i);u(j)) %computererrors e(i)=d(i)y(i) %allocateanewunit C(i)=fC(i1);u(i)g %calculatethecoecientforthenewunit ai(i)=e(i) %keeptheexistingcentersandcoecients aj(i)=aj(i1);j=1;:::;i1 endwhile 31

PAGE 32

Figure2-1. NetworktopologyofKLMS 2.2ImplementationofKernelLeastMeanSquareAlgorithm SincetheKLMSistheLMSinthechosenRKHS,manyresultsfromtheadaptivelteringliteraturecanbeutilized.Wewillbrieydiscussmisadjustment,stepsize,growingstructureandpenalizingsolutionnorminthefollowing. 2.2.1Misadjustment ItisclearthatthemisadjustmentforKLMSis[ 47 ] Mt=J(1)Jmin Jmin= 2tr[R']= 2Ntr[G'](2{12) whereR'istheautocorrelationmatrixofthetransformeddataandG'istheGrammatrix.Denotingthedatamatrix=['(1);'(2);:::;'(N)],onehas R'=1 NT=1 NNXi=1'(i)'(i)T (2{13) G'=T (2{14) Andforshift-invariantkernels,i.e.,<'(i);'(i)>F=jj'(i)jj2F=g0 32

PAGE 33

themisadjustmentoftheKLMSsimplyequalsg0=2whichisdata-independentbutisproportionaltothestepsize. 2.2.2StepSize Thestepsizeisrequiredtosatisfythefollowingconditionforthealgorithmtostaystable<1=&1 where&1isthelargesteigenvalueoftheautocorrelationmatrix.Since&1
PAGE 34

andunifyNCandALDinarigorousinformationtheoreticframework.NCandALDarediscussedinChapter 3 andChapter 4 respectively. 2.2.4PenalizingSolutionNorm Kivinen[ 53 ]derivedasimilaralgorithm,calledNORMAbutfromavastlydierentviewpoint.TheydierentiatedthefollowingregularizedfunctionaldirectlytogetthestochasticgradientinthefunctionalspaceminfJ(f)=Ejdf(u)j2+jjfjj2 withastheregularizationparameter. Whilethederivationinvolvesadvancedmathematics,theresultsareactuallyequivalenttothefollowingupdaterule fi=(1)fi1+e(i)(u(i);)(2{15) Comparing( 2{15 )withKLMS( 2{8 ),ithasascalingfactor(1)onthepreviousestimate,whichisusuallylessthan1,anditimposesaforgettingmechanismsothatthetrainingdatainthefarpastarescaleddownexponentially.Therefore,byneglectingtheunitswithverysmallcoecients,thenumberofactualactiveunitsarenite. TheregularizationintroducesabiasinthesolutionasiswellknowninleakyLMS[ 82 ].Itisreportedin[ 72 ]thatevenaverysmallregularizationparameterdegradestheoverallperformancecomparedwithKLMS.InChapter 6 ,arigorousmathematicalanalysisisconductedthatKLMSpossessesaself-regularizationmechanismwithnoexplicitsolutionnormpenalty. 34

PAGE 35

2.3Simulations 2.3.1Mackey-GlassTimeSeriesPrediction Therstexampleistheshort-termpredictionoftheMackey-Glass(MG)chaotictimeseries[ 39 65 ].Itisgeneratedfromthefollowingtimedelayordinarydierentialequation dx(t) dt=bx(t)+ax(t) 1+x(t)10(2{16) withb=0:1,a=0:2,and=30.Thetimeseriesisdiscretizedatasamplingperiodof6seconds.Inthisexample,thetimeembeddingisset10,i.e.u(i)=[x(i10);x(i9);:::;x(i1)]Tareusedastheinputtopredictthepresentonex(i)whichisthedesiredresponsehere.Asegmentof500samplesisusedasthetrainingdataandanother100asthetestdata.AllthedataiscorruptedbyGaussiannoisewithzeromeanand0.04standarddeviation. WecomparetheperformanceofalinearcombinertrainedwithLMS,theKLMSandaregularizationnetwork(RN),whichimplicitlyusesaradialbasisfunctionnetworktopologyspeciedbythekernelutilized[ 37 70 ].AGaussiankernel( 1{6 )witha=1ischosenforbothRNandKLMS.InRN,everyinputpointisusedasthecenterandthetrainingisdoneinbatchmode.OnehundredMonteCarlosimulationsarerunwithdierentrealizationsofnoise.Theresultsaresummarizedinthefollowingtables.Figure 2-2 isthelearningcurvesfortheLMSandKLMSwithalearningrateof0.2forbothalgorithms.Asexpected,theKLMSconvergestoasmallervalueofMSEduetoitsnonlinearnature.Surprisingly,therateofdecayofbothlearningcurvesisbasicallythesame. AsonecanobserveinTable 2-1 ,theperformanceoftheKLMSismuchbetterthanthelinearLMSascanbeexpected(Mackey-Glasstimeseriesisanonlinearsystem)andiscomparabletotheRNwithsuitableregularization.ThisisindeedsurprisingsincetheRNcanbeviewedasabatchmodekernelregressionmethodversustheKLMSwhichisastraightstochasticgradientapproachimplementedintheRKHS.Alltheresultsinthe 35

PAGE 36

Figure2-2. LearningcurvesofLMSandKLMSinMackey-Glasstimeseriesprediction Table2-1. PerformancecomparisonofKLMSwithdierentstepsizesandRNwithdierentregularizationparametersinMackey-Glasstimeseriesprediction AlgorithmtrainingMSEtestingMSEsolutionnorm LinearLMS0:0210:0020:0260:007KLMS(=0:1)0:00740:00030:00690:00080:840:02KLMS(=0:2)0:00540:00040:00560:00081:140:02KLMS(=0:6)0:00620:00120:00580:00171:730:06RN(=0)000:0120:0043375639RN(=1)0:00380:00020:00390:00081:470:03RN(=10)0:0110:00010:0100:00030:550:01 tablesareintheformof\averagestandarddeviation".ItisinterestingtocomparethedesignandperformanceofKLMSwithdierentstepsizesandtheRNwithdierentregularizationssinceeachcontrolsthestabilityoftheobtainedsolution[ 59 ].Firstofall,whentheregularizationparameteriszero,theRNperformanceinthetestsetisprettybad(worsethanthelinearsolution),whichindicatesthattheproblemisill-posed(i.e.thattheinverseisillconditioned)andneedsregularization.TheRNiscapableofoutperformingtheKLMSinthetestsetwiththeproperregularizationparameter(=1),butthedierenceissmallandattheexpenseofamorecomplexsolution(seeTable 2-2 )aswellaswithacarefulselectionoftheregularizationparameter.Table 2-2 summarizesthecomputationalcomplexityofthethreealgorithms.TheKLMSeectivelyreducesthe 36

PAGE 37

Table2-2. Complexitycomparisonatiterationi AlgorithmComputationMemory LMSO(L)O(L)KLMSO(i)O(i)RNO(i3)O(i2) computationalcomplexityandmemorystoragewhencomparedwiththeRN.Andifwehavedirectaccesstothefeaturespace,thecomplexitycanbereducedevenfurthertothedimensionalityofthefeaturespaceifitisnotprohibitivelylarge.Table 2-1 supportsourtheorythatthenormoftheKLMSsolutioniswell-bounded[ 59 ],whichwillbeclearinChapter 6 .Aswesee,increasingthestepsizeintheKLMSincreasesthenormofthesolutionbutfailstoincreasetheperformancebecauseofthegradientnoiseintheestimation(misadjustment). Next,dierentnoisevariance2isusedinthedatatofurthervalidateKLMS'sapplicability.AsweseeinTables 2-3 and 2-4 ,theKLMSperformsconsistentlyonboththetrainingandtestwithdierentnoiselevelanddegradesgracefullywithincreasingnoise,whichisanillustrationofitswellposedness.Itisobservedthatatseverenoiselevel(=:5),allmethodsfallapartsincethenoisecomponentwillnolongercorrespondtothesmallestsingularvalueasrequiredbyTikhonovregularization[ 96 ].Withsmallnoise,theregularizationnetworkoutperformstheKLMSsincemisadjustmentplaysamorecriticalrolehere.ThisisaquitegoodillustrationofthedicultytheKLMSmayfacetobalanceamongconvergence,misadjustmentandregularization.ButremembertheKLMSisamuchsimpler,onlinealgorithmandtheperformancegapcomparedwithRNisthepricetopay.Throughoutthissetofsimulation,thekernelusedintheKLMSandtheRNistheGaussiankernelwitha=1.Thelearningstepis0.1forboththelinearLMSandtheKLMS.TheregularizationparameteroftheRNissetatthebestvalue(=1). Anykernelmethod,includingtheKLMS,needstochooseasuitablekernel.Inthethirdsetofsimulations,theeectofdierentkernelsanddierentkernelparametersonKLMSisdemonstrated.InthecaseoftheGaussiankernel( 1{6 ),wechoose3kernel 37

PAGE 38

Table2-3. PerformancecomparisonwithdierentnoiselevelsinMackey-Glasstimeseriesprediction(trainingMSE) AlgorithmLinearLMSKLMS(=0:1)RN(=1) =:0050:0175e50:00502e50:00141e5=:020:0180:00020:00550:00010:00206e5=:040:0210:0020:00740:00030:00380:0002=:10:0330:0010:0190:0010:0100:0005=:50:3260:0150:2520:0100:0970:003 Table2-4. PerformancecomparisonwithdierentnoiselevelsinMackey-Glasstimeseriesprediction(testingMSE) AlgorithmLinearLMSKLMS(=0:1)RN(=1) =:0050:0180:00020:00410:00010:00126e5=:020:0180:00070:00460:00040:00160:0002=:040:0260:0070:00690:00080:00390:0008=:10:0310:0050:0180:0030:0170:003=:50:3630:0570:3320:0520:3310:052 parameters:10,2,and0.2.Thelearningrateis0.1forboththelinearLMSandtheKLMSandtheregularizationparameteroftheRNis1throughoutthesimulation.TheresultsaresummarizedinTable 2-5 .Asexpected,toosmallortoolargekernelsizeshurtperformanceforbothKLMSandRN.Inthisproblem,akernelsizearound1givesthebestperformanceonthetestset.Inthecaseofthepolynomialkernel( 1{7 ),theorderissetto2,5,and8.ThelearningrateischosenaccordinglyintheKLMSaslistedinTable 2-6 (recalltherelationbetweenthelearningrateandthetraceoftheGrammatrix).Itisobservedthattheperformancedeterioratessubstantiallywhenpistoolarge(>8)fortheKLMS.Thisisalsovalidatedbythemisadjustmentformula( 2{12 ). Table2-5. EectofkernelsizeofGaussiankernelinMackey-Glasstimeseriesprediction AlgorithmtrainingMSEtestingMSE LinearLMS0:0220:0010:0220:001KLMS(a=10)0:00850:00050:00780:0010KLMS(a=2)0:00610:00030:00560:0014KLMS(a=:2)0:0170:00070:0160:0010RN(a=10)0:00400:00020:00680:0009RN(a=2)0:00430:00020:00470:0006RN(a=:2)0:00980:00030:00920:0005 38

PAGE 39

Table2-6. EectoforderofpolynomialkernelinMackey-Glasstimeseriesprediction AlgorithmtrainingMSEtestingMSE KLMS(p=2;=0:1)0:0100:0010:0090:002KLMS(p=5;=0:01)0:00990:00060:00990:0007KLMS(p=8;=:0006)0:0270:0090:0250:009RN(p=2;=1)0:00640:00050:00660:0008RN(p=5;=1)0:00340:00030:00590:0007RN(p=8;=1)0:00140:00010:00780:0004 Thegeneralkernelselectionproblemhasreceivedalotofattentionintheliterature(see[ 63 ]andreferencestherein).Thegeneralideaistoformulatetheproblemasaconvexoptimizationthroughparameterizationofthekernelfunctionfamily.ItwillbeinterestingtoinvestigatethisproblemspecicallyfortheKLMS.OneoftheinterestingobservationsistheautomaticnormalizationoperatedbytheGaussiankernel(oranyshift-invariantkernel)intheKLMS.Thisalsoexplainstherobustnessw.r.t.stepsizeinthecaseoftheGaussiankernel,andthedicultiesencounteredinthepolynomialcase. Inthelastsimulationoftheexample,weexaminetheeectoftheregularizationparameterontheperformanceofNORMA(leakyKLMS).30regularizationparametersarechosenwithin[0,1].Foreachregularizationparameter,50MonteCarlosimulationsarerunwithdierentrealizationsofnoise(=0:01).ThenalaverageMSEonthetestingsetisplottedinFigure 2-3 alongwithitsstandarddeviation.Aswesee,theexplicitregularizationhasadetrimentaleectinthisexample.ItwillbeclearlaterthattheKLMShasaself-regularizationmechanismthusnoexplicitnormpenaltyisneeded. 2.3.2NonlinearChannelEqualization TheLMSalgorithmiswidelyusedinchannelequalization[ 82 ]andwetestKLMSonanonlinearchannelequalizationproblem.Theproblemsettingisasfollows:Abinarysignalfs(1);s(2);:::;s(N)gisfedintoanonlinearchannel.Atthereceiverendofthechannel,thesignalisfurthercorruptedbyadditivewhiteGaussiannoiseandisthenobservedasfr(1);r(2);:::;r(N)g.Theaimofchannelequalizationistoconstructan\inverse"lterthatreproducestheoriginalsignalwithaslowanerrorrateaspossible. 39

PAGE 40

Figure2-3. EectofexplicitregularizationinNORMAinMackey-Glasstimeseriesprediction Itiseasytoformulateitasaregressionproblem,withexamplesf([r(i);r(i+1);:::;r(i+l)];s(iD))g,wherelisthetimeembeddinglength,andDistheequalizationtimelag. Inthisexperiment,thenonlinearchannelmodelisdenedbyz(t)=s(t)+0:5s(t1),r(t)=z(t)0:9z(t)2+v(t),wherev(t)isthewhiteGaussiannoisewithavarianceof2.ThelearningcurveisplottedinFigure 2-4 ,whichissimilartothepreviousexample.Theltersaretrainedwith1000dataandxedafterwards.Testingwasperformedona5000-samplerandomtestsequence.WecomparetheperformanceofLMS,KLMSandRNasabatchmodebaseline.TheGaussiankernelwitha=0:1isusedintheKLMSforbestresults,andl=5andD=2.TheresultsarepresentedinTable 2-7 ;eachentryconsistsoftheaverageandthestandarddeviationfor100MonteCarloindependenttests.TheresultsinTable 2-7 showthat,theRNoutperformstheKLMSintermsofthebiterrorrate(BER)butnotbymuchwhichissurprisingsinceoneisabatchmethodandtheotherisonline.TheybothoutperformtheconventionalLMSsubstantiallyascanbeexpectedbecausethechannelisnonlinear.TheregularizationparameterfortheRNandthelearningrateofKLMSweresetbycrossvalidation[ 54 ]. 40

PAGE 41

Figure2-4. LearningcurvesofLMS(=0:005)andKLMS(=0:1)innonlinearchannelequalization(=0:4) Table2-7. Performancecomparisoninnonlinearchannelequalization AlgorithmLinearLMS(=:005)KLMS(=0:1)RN(=1) BER(=:1)0:1620:0140:0200:0120:0080:001BER(=:4)0:1770:0120:0580:0080:0460:003BER(=:8)0:2180:0120:1300:0100:1180:004 2.4Discussions TheKLMSalgorithmisintrinsicallyastochasticgradientmethodologytosolvetheleastsquaresprobleminRKHS.SincetheupdateequationoftheKLMScanbewrittenasinnerproducts,KLMScanbeecientlycomputedintheinputspace.ThegoodapproximationabilityoftheKLMSstemsfromthefactthatthetransformeddataincludespossiblyinnitedierentfeaturesoftheoriginaldata.Intheframeworkofstochasticprojection,thespacespannedbyf'(i)gissolargethattheprojectionerrorofthedesiredsignald(i)couldbeverysmall[ 67 ],asiswellknownfromCover'stheorem[ 46 ].Thiscapabilityincludesmodelingofnonlinearsystems,whichisthemainreasonwhytheKLMScanachievegoodperformanceintheMackey-Glasssystempredictionandnonlinearchannelequalization. Demonstratedbytheexperiments,theKLMShasgeneralapplicabilityduetoitssimplicitysinceitisimpracticaltoworkwithbatchmodekernelmethodsinlargedata 41

PAGE 42

sets.TheKLMSisveryusefulinproblemslikenonlinearchannelequalization,nonlinearsystemidentication,nonlinearactivenoisecontrol,whereonlinelearningisanecessity. Aspointedout,thestudyoftheKLMShasacloserelationwiththeresource-allocatingnetworks,buttheKLMSismuchsimplertechnicallyandmathematically.FewerparametersareneededcomparedwithRAN.AlmostalltheliteratureforLMScanbeusedtoanalyzeKLMSespeciallyitsconvergenceandwellposednessarewellunderstood.AlsointheframeworkofRKHS,anyMercerkernelcanbeusedforKLMSinsteadofrestrictingthearchitecturetotheGaussiankernelasinRAN. Inthischapter,twoimportantissuesarepurposelyomitted,namelysparsicationandwellposednessanalysiswhichwillbetreatedinsubsequentchaptersforbetterpresentation. 42

PAGE 43

CHAPTER3KERNELAFFINEPROJECTIONALGORITHMS Thecombinationofthefamedkerneltrickandaneprojectionalgorithms(APA)yieldspowerfulnonlinearextensions,namedcollectivelyhereKAPA.Thischapterisafollow-upstudyofthekernelleast-mean-squarealgorithm.KAPAinheritsthesimplicityandonlinenaturewhilereducesthegradientnoiseandhencebooststheperformancecomparedwiththeKLMS.Moreinterestingly,itprovidesaunifyingmodelforseveralexistingneuralnetworktechniques,includingkernelleast-mean-squarealgorithms,sliding-windowkernelrecursiveleast-squares,kernelrecursiveleast-squaresandregularizationnetworks.Therefore,manyinsightscanbegainedintothebasicrelationsamongthemandthetrade-obetweencomputationcomplexityandperformance. 3.1FormulationofKernelAneProjectionAlgorithms Wewillstartwithareviewonaneprojectionalgorithms,focusingonitssubtlevariationsduetodierentoptimizationtechniques.Thenthematrixinversionlemmaisusedtoderiveequivalentrepresentationswhicharemoresuitableforkernelextensions.Finallyitfollowsnaturallywithatreatmentofkernelaneprojectionalgorithms. 3.1.1AneProjectionAlgorithms Letdbeazero-meanscalar-valuedrandomvariableandletubeazero-meanL1randomvariablewithapositive-denitecovariancematrixRu=E[uuT].Thecross-covariancevectorofdanduisdenotedbyrdu=E[du].Theweightvectorwthatsolves minwEjdwTuj2(3{1) isgivenbywo=R1urdu[ 82 ]. Severalmethodsthatapproximatewiterativelyalsoexist.Forexample,thecommongradientmethod w(0)=initialguess;w(i)=w(i1)+[rduRuw(i1)](3{2) 43

PAGE 44

ortheregularizedNewton'srecursion, w(0)=initialguess;w(i)=w(i1)+(Ru+"I)1[rduRuw(i1)](3{3) where"isasmallpositiveregularizationfactorandisthestepsizespeciedbydesigners. Stochastic-gradientalgorithmsreplacethecovariancematrixandthecross-covariancevectorbylocalapproximationsdirectlyfromdataateachiteration.Thereareseveralwaysforobtainingsuchapproximations.Thetrade-oiscomputationcomplexity,convergenceperformance,andsteady-statebehavior[ 82 ]. Assumethatwehaveaccesstoobservationsoftherandomvariablesdanduovertimefd(1);d(2);:::gandfu(1);u(2);:::g Theleast-mean-square(LMS)algorithmsimplyusestheinstantaneousvaluesforapproximations^Ru=u(i)u(i)Tand^rdu=d(i)u(i).Thecorrespondingsteepest-descentrecursion( 3{2 )andNewton'srecursion( 3{3 )becomew(i)=w(i1)+u(i)[d(i)u(i)Tw(i1)] (3{4)w(i)=w(i1)+u(i)[u(i)Tu(i)+"I]1[d(i)u(i)Tw(i1)] (3{5) Theaneprojectionalgorithmhoweveremploysbetterapproximations.Specically,RuandrduarereplacedbytheapproximationsfromtheKmostrecentregressorsandobservations.DenotingU(i)=[u(iK+1);:::;u(i)]LKandd(i)=[d(iK+1);:::;d(i)]T onehas^Ru=1 KU(i)U(i)T^rdu=1 KU(i)d(i) (3{6) 44

PAGE 45

Therefore( 3{2 )and( 3{3 )becomew(i)=w(i1)+U(i)[d(i)U(i)Tw(i1)] (3{7)w(i)=w(i1)+[U(i)U(i)T+"I]1U(i)[d(i)U(i)Tw(i1)] (3{8) and( 3{8 ),bythematrixinversionlemma,isequivalentto[ 82 ] w(i)=w(i1)+U(i)[U(i)TU(i)+"I]1[d(i)U(i)Tw(i1)](3{9) Itisnotedthatthisequivalenceletsusdealwiththematrix[U(i)TU(i)+"I]insteadof[U(i)U(i)T+"I]anditplaysaveryimportantroleinthederivationofkernelextensions.Wecallrecursion( 3{7 )APA-1andrecursion( 3{9 )APA-2. Insomecircumstances,aregularizedsolutionisneededinsteadof( 3{1 ).TheregularizedLSproblemis minwEjdwTuj2+jjwjj2(3{10) whereistheregularizationparameter(donotconfusewiththeregularizationfactor"inNewton'srecursion,whichisintroducedmainlytoensurenumericalstabilityandisnotdirectlyrelatedtothenormconstraintas).Thegradientmethodis w(i)=w(i1)+[rdu(I+Ru)w(i1)]=(1)w(i1)+[rduRuw(i1)](3{11) TheNewton'srecursionwith"=0is w(i)=w(i1)+(I+Ru)1[rdu(I+Ru)w(i1)]=(1)w(i1)+(I+Ru)1rdu(3{12) Iftheapproximations( 3{6 )areused,wehave w(i)=(1)w(i1)+U(i)[d(i)U(i)Tw(i1)](3{13) 45

PAGE 46

and w(i)=(1)w(i1)+[I+U(i)U(i)T]1U(i)d(i)(3{14) whichis,bythematrixinversionlemma,equivalentto w(i)=(1)w(i1)+U(i)[I+U(i)TU(i)]1d(i)(3{15) Forsimplicity,recursions( 3{13 )and( 3{15 )arenamedhereAPA-3andAPA-4respectively. 3.1.2KernelAneProjectionAlgorithms WeutilizetheMercertheoremagainheretotransformthedatau(i)intothefeaturespaceFas'(u(i))(denotedas'(i)).Weformulatetheaneprojectionalgorithmsontheexamplesequencefd(1);d(2);:::gandf'(1);'(2);:::gtoestimatetheweightvector!thatsolves min!Ejd!T'(u)j2(3{16) Bystraightforwardmanipulation,( 3{7 )becomes !(i)=!(i1)+(i)[d(i)(i)T!(i1)](3{17) and( 3{9 )becomes !(i)=!(i1)+(i)[(i)T(i)+"I]1[d(i)(i)T!(i1)](3{18) where(i)=['(iK+1);:::;'(i)]. Accordingly,( 3{13 )becomes !(i)=(1)!(i1)+(i)[d(i)(i)T!(i1)](3{19) and( 3{15 )becomes !(i)=(1)!(i1)+(i)[(i)T(i)+I]1d(i)(3{20) 46

PAGE 47

Forsimplicity,werefertotherecursions( 3{17 ),( 3{18 ),( 3{19 ),and( 3{20 )asKAPA-1,KAPA-2,KAPA-3,andKAPA-4respectively. 3.1.2.1SimpleKAPA(KAPA-1) Recursion( 3{17 )usesthestraightgradientdescentandisthesimplestamongall.ItishencealsonamedsimpleKAPAhere. Itmaybediculttohavedirectaccesstotheweightsandthetransformeddatainfeaturespace,so( 3{17 )needstobemodied.Ifwesettheinitialguess!(0)=0,theiterationof( 3{17 )willbe !(0)=0;!(1)=d(1)'(1)=a1(1)'(1);:::!(i1)=i1Xj=1aj(i1)'(j);(i)T!(i1)=[i1Xj=1aj(i1)iK+1;j;;:::;i1Xj=1aj(i1)i1;j;i1Xj=1aj(i1)i;j]T;e(i)=d(i)(i)T!(i1);!(i)=!(i1)+(i)e(i)=i1Xj=1aj(i1)'(j)+KXj=1ej(i)'(ij+K):(3{21) wherei;j=(u(i);u(j))forsimplicity. Notethatduringtheiteration,theweightvectorinthefeaturespaceassumesthefollowingexpansion !(i)=iXj=1aj(i)'(j)8i>0(3{22) i.e.theweightattimeiisalinearcombinationoftheprevioustransformedinput.Thisresultmayseemsimplyarestatementoftherepresentertheoremin[ 83 ].However,itshouldbeemphasizedthatthisresultdoesnotrelyonanyexplicitminimalnormconstraintasrequiredfortherepresentertheorem.Aspointedoutin[ 59 ],thegradient 47

PAGE 48

searchinvolvedhasaninherentregularizationmechanismwhichguaranteesthesolutionisinthedatasubspaceunderappropriateinitialization.Ingeneral,theinitialization!(0)canintroducewhateveraprioriinformationisavailable,whichcanbeanylinearcombinationofanytransformeddatainordertoutilizethekerneltrick. By( 3{22 ),theupdatingontheweightvectorreducestotheupdatingontheexpansioncoecients ak(i)=8>>>>>><>>>>>>:(d(i)i1Pj=1aj(i1)i;j);k=iak(i1)+(d(k)i1Pj=1aj(i1)k;j);iK+1ki1ak(i1);1k
PAGE 49

Algorithm2KernelAneProjectionAlgorithm(KAPA-1) Initialization: learningstep a1(1)=d(1) whilefu(i);d(i)gavailabledo %allocateanewunit ai(i1)=0 fork=max(1;iK+1)toido %evaluateoutputsofthecurrentnetwork y(i;k)=i1Pj=1aj(i1)k;j %computererrors e(i;k)=d(k)y(i;k) %updatethemin(i;K)mostrecentunits ak(i)=ak(i1)+e(i;k) endfor ifi>Kthen %keeptheremaining fork=1toiKdo ak(i)=ak(i1) endfor endif endwhile 3.1.2.2NormalizedKAPA(KAPA-2) Similarly,theregularizedNewton'srecursion( 3{18 )canbefactorizedintothefollowingsteps !(i1)=i1Xj=1aj(i1)'(j);e(i)=d(i)(i)T!(i1);G(i)=(i)T(i);!(i)=!(i1)+(i)[G(i)+"I]1e(i):(3{29) Inpractice,wedonothaveaccesstothetransformedweight!oranytransformeddata,sotheupdatehastobeontheexpansioncoecientalikeinKAPA-1.ThewholerecursionissimilartotheKAPA-1exceptthattheerrorisnormalizedbyaKKmatrix[G(i)+"I]1. 49

PAGE 50

3.1.2.3LeakyKAPA(KAPA-3) Thefeaturespacemaybeinnitedimensionaldependingonthechosenkernel,whichmaycausethecostfunction( 3{16 )tobeill-posedintheconventionalempiricalriskminimization(ERM)sense[ 37 ].Thecommonpracticeistoconstrainthesolutionnorm: min!Ejd!T'(u)j2+jj!jj2(3{30) Aswehavealreadyshownin( 3{19 ),theleakyKAPAis !(i)=(1)!(i1)+(i)[d(i)(i)T!(i1)](3{31) Again,theiterationwillbeontheexpansioncoecienta,whichissimilartotheKAPA-1. ak(i)=8>>>>>><>>>>>>:(d(i)i1Pj=1aj(i1)i;j);k=i(1)ak(i1)+(d(k)i1Pj=1aj(i1)k;j);iK+1ki1(1)ak(i1);1k
PAGE 51

3.1.2.4LeakyKAPAwithNewton'srecursion(KAPA-4) Asbefore,theKAPA-4( 3{20 )reducesto ak(i)=8>>>><>>>>:~d(i);k=i(1)ak(i1)+~d(k);iK+1ki1(1)ak(i1);1k
PAGE 52

3.2.2Kivinen'sNORMA SimilarlytheKAPA-3( 3{19 )reducestotheNORMAalgorithmintroducedbyKivinenin[ 53 ]. !(i)=(1)!(i1)+'(i)[d(i)'(i)T!(i1)](3{35) AswediscussedinChapter 2 ,penalizingexplicitlythesolutionnormintroducesabiasandsignicantlydegeneratestheoverallperformance,soingeneralwedonotrecommendtheuseofKAPA-3. 3.2.3KernelADALINE AssumethatthesizeofthetrainingdataisniteN.IfwesetK=N,thentheupdateruleoftheKAPA-1becomes!(i)=!(i1)+[dT!(i1)] wherethefulldatamatricesare=['(1);:::;'(N)];d=[d(1);:::;d(N)] Itiseasytocheckthattheweightvectoralsoassumesthefollowingexpansion!(i)=NXj=1aj(i)'(j) Andtheupdatingontheexpansioncoecientsisaj(i)=aj(i1)+[d(j)'(j)T!(i1)] ThisisnothingbutthekernelADALINE(KA)introducedin[ 33 ].NoticethatthekernelADALINEisnotanonlinemethod. 52

PAGE 53

Table3-2. Listofrelatedalgorithms AlgorithmUpdateequationRelationtoKAPA KLMS!(i)=!(i1)+'(i)[d(i)'(i)T!(i1)]KAPA-1,K=1NKLMS!(i)=!(i1)+'(i) ("+i;i)[d(i)'(i)T!(i1)]KAPA-2,K=1NORMA!(i)=(1)!(i1)+'(i)[d(i)'(i)T!(i1)]KAPA-3,K=1KA!(i)=!(i1)+[dT!(i1)]KAPA-1,K=NRA-RBF!(i)=[dT!(i1)]KAPA-3,=1,K=NSW-KRLS!(i)=(i)[(i)T(i)+I]1d(i)KAPA-4,=1RegNet!(i)=[T+I]1dKAPA-4,=1,K=N 3.2.4RecursivelyAdaptedRadialBasisFunctionNetworks AssumethesizeofthetrainingdataisNasabove.Ifweset=1andK=N,theupdateruleofKAPA-3becomes!(i)=[dT!(i1)] whichistherecursively-adaptedRBF(RA-RBF)networkintroducedin[ 58 ].Thisisaveryintriguingalgorithmusingtheglobalerrordirectlytocomposethenewnetwork.Bycontrast,theKLMS-1usestheapriorierrorstocomposethenetwork. 3.2.5SlidingWindowKernelRecursiveLeastSquares InKAPA-4,ifweset=1,wehave !(i)=(i)[(i)T(i)+I]1d(i)(3{36) whichisthesliding-windowkernelRLS(SW-KRLS)introducedin[ 100 ]. 3.2.6RegularizationNetworks WeassumethereareonlyNtrainingdataandK=N.Equation( 3{20 )becomesdirectly !(i)=[T+I]1d(3{37) whichistheregularizationnetwork(RegNet)[ 37 ]. WesummarizealltherelatedalgorithmsinTable 3-2 forconvenience. 53

PAGE 54

3.3ImplementationofKernelAneProjectionAlgorithms Inthissection,wewilldiscusstheimplementationoftheKAPAalgorithmsindetailincludingtheerrorreusingtechnique,sliding-windowGrammatrixinversiontospeedupthecalculationandnoveltycriterionforsparsication. 3.3.1ErrorReusing AsweseeinKAPA-1,KAPA-2andKAPA-3,themosttime-consumingpartofthecomputationistoobtaintheerrorinformation.Forexample,suppose!(i1)=Pi1j=1aj(i1)'(j).Weneedtocalculatee(i;k)=d(k)!(i1)T'(k)foriK+1kitocompute!(i),whichconsistsof(i1)Kkernelevaluations.Asiincreases,thisdominatesthecomputationtime.Inthissense,thecomputationcomplexityoftheKAPAisKtimesoftheKLMS.However,afteracarefulmanipulation,wecanshrinkthecomplexitygapbetweenKAPAandtheKLMS. AssumethatwestorealltheKerrorse(i1;k)=d(k)!(i2)T'(k)foriKki1fromthepreviousiteration.Atthepresentiteration,wehave e(i;k)=d(k)'(k)T!(i1)=d(k)'(k)T[!(i2)+i1Xj=iKe(i1;j)'(j)]=[d(k)'(k)T!(i2)]+i1Xj=iKe(i1;j)j;k=e(i1;k)+i1Xj=iKe(i1;j)j;k(3{38) Sincee(i1;i)hasnotbeencomputedyet,wehavetocalculatee(i;i)byi1timeskernelevaluationanyway.OverallthecomputationcomplexityoftheKAPA-1isO(i+K2),whichisonlyO(K2)morethanKLMS. 3.3.2SlidingWindowGramMatrixInversion InKAPA-2andKAPA-4,anothercomputationdicultyistoinvertaKKmatrix,whichnormallyrequiresO(K3).However,intheKAPA,thedatamatrix(i)hasa 54

PAGE 55

slidingwindowstructure,thereforeatrickcanbeusedtospeedupthecomputation.Thetrickisbasedonthematrixinversionformulaandwasintroducedin[ 100 ].Weoutlinethebasiccalculationstepshere.Supposetheslidingmatricessharethesamesub-matrixD G(i1)+I=264abTbD375;G(i)+I=264DhhTg375(3{39) andweknowfromthepreviousiteration (G(i1)+I)1=264efTfH375(3{40) FirstweneedtocalculatetheinverseofDas D1=HT=e(3{41) ThenwecanupdatetheinverseofthenewGrammatrixas (G(i)+I)1=264D1+(D1h)(D1h)Ts1(D1h)s1(D1h)Ts1s1375(3{42) withs=ghTD1h.TheoverallcomplexityisO(K2). 3.3.3SparsicationandNoveltyCriterion AsweseeintheformulationofKAPAaswellasKLMS,thenetworksizeincreaseslinearlywiththenumberoftrainingdata,whichmayposeabigproblemforthesealgorithmstobeappliedinonlineapplications.Tocurbthegrowingstructure,manymethodshavebeenproposedincludingnoveltycriterion[ 69 ]andapproximatelineardependency(ALD)[ 28 ].Therearemanyotherwaystoachievesparsenessthatrequirethecreationofabasisdictionaryandstorageofthecorrespondingcoecients.SupposethepresentdictionaryisC(i)=fcjgmij=1wherecjisthejthcenterandmiisthecardinality.Whenanewdatapairfu(i+1);d(i+1)gispresented,adecisionismadeimmediatelywhetheru(i+1)shouldbeaddedintothedictionaryasacenter. 55

PAGE 56

ThenoveltycriterionintroducedbyPlattisrelativelysimpleandhasalreadybeendiscussedinChapter 1 .Herewetailorthesameidea(asimpliedversion)fortheKLMSandKAPA:Firstitcalculatesthedistanceofu(i+1)tothepresentdictionarydis1=mincj2C(i)jju(i+1)cjjj.Ifitissmallerthansomepresetthreshold,say1,u(i+1)willnotbeaddedintothedictionary.Otherwise,itcomputesthepredictionerrore(i+1).Onlyifthepredictionerrorislargerthananotherpresetthreshold,say2,u(i+1)willbeacceptedasanewcenter. TheALDtestintroducedin[ 28 ]ismorecomputationallyinvolvedandwillbediscussedinChapter 4 .Inthischapter,wemainlyfocusontheeectivenessofNC. Theimportantconsequencesofthesparsicationprocedureareasfollows: 1)IftheinputdomainUisacompactset,thecardinalityofthedictionaryisalwaysniteandupperbounded.Thisstatementisnothardtoproveusingthenitecoveringtheoremofthecompactsetandthefactthatelementsinthedictionaryare-separable[ 28 ].Hereisthebriefidea:supposesphereswithdiameterareusedtocoverUandtheoptimalcoveringnumberisNc.Thenbecauseanytwocentersinthedictionarycannotbeinthesamesphere,thetotalnumberofthecenterswillbenogreaterthanNcregardlessofthedistributionandtemporalstructureofu.Ofcoursethisisaworstcaseupperbound.Inthecaseofnitetrainingdata,thenetworksizewillbeniteanyway.Thisistrueinapplicationslikechannelequalization,wherethetrainingsequenceispartofeachtransmissionframe.Inastationaryenvironment,thenetworkconvergesquicklyandthethresholdonpredictionerrorsplaysitsparttoconstrainthenetworksize.Wewillvalidatethisclaiminthesimulationsection.Inanon-stationaryenvironment,moresophisticatedpruningmethodsshouldbeusedtoconstrainthenetworksize.Simplestrategiesincludepruningtheoldestunitinthedictionary[ 100 ],pruningrandomly[ 17 ],pruningtheunitwiththeleastcoecientorsimilar[ 23 91 ].Anotheralternativeapproachistosolvetheproblemintheprimalspace[ 66 93 ]directlybyusingthelowrankapproximationmethodssuchasNystrommethod[ 108 ],incompleteCholeskyfactorization[ 31 ]andkernelprincipal 56

PAGE 57

componentanalysis[ 84 ].Itshouldbepointedoutthatthescalabilityissueisatthecoreofthekernelmethodsandsoallthekernelmethodsneedtodealwithitinonewayortheother.Indeed,thesequentialnatureoftheKAPAenablesactivelearning[ 8 34 ]onhugedatasetswhichisimpossibleinbatchmodealgorithmslikeregularizationnetworks.ThediscussiononactivelearningwillcontinueinChapter 5 2)Basedon1),wecanprovethatthesolutionnormsofKLMS-1andKAPA-1areupperbounded[ 59 ]. Thesignicanceof1)isofpracticalinterestbecauseitstatesthatthesystemcomplexityiscontrolledbythenoveltycriterionparametersanddesignerscanestimateaworstcaseupperbound.Thesignicanceof2)isoftheoreticalinterestbecauseitguaranteesthewellposednessofthealgorithms.ThewellposednessoftheKAPA-3andKAPA-4ismostlyensuredbytheregularizationterm.See[ 37 ]and[ 100 ]fordetails. 3.4Simulations 3.4.1Mackey-GlassTimeSeriesPrediction Thisexampleisafurtherstudyontheshort-termpredictionoftheMackey-Glass(MG)chaotictimeseriesdiscussedinChapter 2 .Wesetthetimeembeddingas7here,i.e.u(i)=[x(i7);x(i6);:::;x(i1)]Tareusedastheinputtopredictthepresentonex(i)whichisthedesiredresponsehere.Asegmentof500samplesisusedasthetrainingdataandanother100pointsasthetestdata(inthetestingphase,thelterisxed).AllthedataiscorruptedbyGaussiannoisewithzeromeanand0.001variance. WecomparethepredictionperformanceofKLMS-1,KAPA-1,KAPA-2,KRLS,andalinearcombinertrainedwithLMS.AGaussiankernel( 1{6 )withkernelparametera=1ischosenforallthekernel-basedalgorithms.OnehundredMonteCarlosimulationsarerunwithdierentrealizationsofnoise.TheresultsaresummarizedinTable 3-3 .Figure 3-1 isthelearningcurvesfortheLMS,KLMS-1,KAPA-1,KAPA-2(K=10)andKRLSrespectively. 57

PAGE 58

Figure3-1. LearningcurvesofLMS,KLMS-1,KAPA-1(K=10),KAPA-2(K=10),SW-KRLS(K=50)andKRLSinMackey-Glasstimeseriesprediction Table3-3. PerformancecomparisoninMackey-Glasstimeseriesprediction AlgorithmParametersTestMeanSquareError LMS=0:040:02080:0009KLMS-1=0:020:00520:00022SW-KRLSK=50,=0:10:00520:00026KAPA-1=0:03,K=100:00480:00023KAPA-2=0:03,K=10,=0:10:00400:00028KRLS=0:10:00270:00009 AswecanseeinTable 3-3 ,theperformanceoftheKAPA-2issubstantiallybetterthantheKLMS-1.Alltheresultsinthetablesareintheformof\averagestandarddeviation".Table 3-4 summarizesthecomputationalcomplexityofthesealgorithms.TheKLMSandKAPAeectivelyreducethecomputationalcomplexityandmemorystoragewhencomparedwiththeKRLS.KAPA-3andsliding-windowKRLSarealsotestedonthisproblem.ItisobservedthattheperformanceoftheKAPA-3issimilartoKAPA-1when Table3-4. Complexitycomparisonatiterationi AlgorithmComputationMemory LMSO(L)O(L)KLMS-1O(i)O(i)SW-KRLSO(K2)O(K2)KAPA-1O(i+K2)O(i+K)KAPA-2O(i+K2)O(i+K2)KAPA-4O(K2)O(i+K2)KRLSO(i2)O(i2) 58

PAGE 59

theforgettingtermisverycloseto1asexpectedandtheresultsareseverelybiasedwhentheforgettingtermisreducedfurther.Thereasoncanbefoundin[ 59 ].Theperformanceofthesliding-windowKRLSisincludedinFigure 3-1 andinTable 3-3 withK=50.ItisobservedthatKAPA-4(includingthesliding-windowKRLS)doesnotperformwellwithsmallK(<50). Next,wetesthowthenoveltycriterionaectstheperformance.Asegmentof1000samplesisusedasthetrainingdataandanother200asthetestdata.AllthedataiscorruptedbyGaussiannoisewithzeromeanand0.0001variance.Thethresholdsinthenoveltycriterionaresetas1=0:1and2=0:05.ThelearningcurvesareshowninFigure 3-2 andtheresultsaresummarizedinTable 3-5 ,whichiscalculatedfromthelast100pointsofthelearningcurves.Itisseenthatthecomplexitycanbereduceddramaticallywiththenoveltycriterionwithequivalentaccuracy.HereSKLMSandSKAPAdenotethesparseKLMSandthesparseKAPArespectively. Figure3-2. LearningcurvesofKLMS-1,KAPA-1(K=10)andKAPA-2(K=10)withandwithoutnoveltycriterioninMackey-Glasstimeseriesprediction Severalcommentsfollow:Althoughformallybeingadaptivelters,thesealgorithmscanbeviewedasecientalternativestobatchmodeRBFnetworks,thereforeitispracticaltofreezetheirweightsduringtestphase.Moreover,whencomparedwithothernonlinearlterssuchasRBFs,wedividethedataintrainingandtestingasnormally 59

PAGE 60

Table3-5. EectofnoveltycriterioninMackey-Glasstimeseriesprediction AlgorithmParametersTestMeanSquareErrorDictionarysize KLMS-1=0:10:00430:000661000SKLMS-1=0:10:00620:00056378KAPA-1=0:050:00280:000111000SKAPA-1=0:050:00320:00041278KAPA-2=0:05,=0:10:00210:000341000SKAPA-2=0:05,=0:10:00160:00016246 doneinneuralnetworks.Ofcourse,itisalsofeasibletousethepredictionerrorasaperformanceindicatorlikeinconventionaladaptivelteringliterature. 3.4.2NoiseCancelation Anotherimportantprobleminsignalprocessingisnoisecancelationinwhichanunknowninterferencehastoberemovedbasedonsomereferencemeasurement.ThebasicstructureofanoisecancelationsystemisshowninFigure 3-3 .Theprimarysignaliss(i)anditsnoisymeasurementd(i)actsasthedesiredsignalofthesystem.n(i)isawhitenoiseprocesswhichisunknown,andu(i)isitsreferencemeasurement,i.e.adistortedversionofthenoiseprocessthroughsomedistortionfunction,whichisunknowningeneral.Hereu(i)istheinputoftheadaptivelter.Theobjectiveistouseu(i)astheinputtothelterandtoobtainasthelteroutputanestimateofthenoisesourcen(i).Therefore,thenoisecanbesubtractedfromd(i)toimprovethesignal-noise-ratio. Figure3-3. Noisecancelationsystem 60

PAGE 61

Figure3-4. EnsemblelearningcurvesofNLMS,SKLMS-1andSKAPA-2(K=10)innoisecancelation Inthisexample,thenoisesourceisassumedwhite,uniformlydistributedbetween[0:5;0:5].Theinterferencedistortionfunctionisassumedtobe u(i)=n(i)0:2u(i1)u(i1)n(i1)+0:1n(i1)+0:4u(i2)(3{43) Aswesee,thedistortionfunctionisnonlinear(multiplicative)andhasinniteimpulsiveresponse,whichontheotherhand,meansitisimpossibletorecovern(i)fromanitetimedelayembeddingofu(i).Werewritethedistortionfunctionasn(i)=u(i)+0:2u(i1)0:4u(i2)+(u(i1)0:1)n(i1) Thereforethepresentvalueofthenoisesourcen(i)notonlydependsonthereferencenoisemeasure[u(i);u(i1);u(i2)],butitalsodependsonthepreviousvaluen(i1),whichinturndependson[u(i1);u(i2);u(i3)]andsoon.Itmeansweneedaverylongtimeembedding(innitelongtheoretically)inordertorecovern(i)accurately.However,therecursivenatureofanadaptivesystemprovidesafeasiblealternative,i.e.wefeedbacktheoutputofthelter^n(i1),whichistheestimateofn(i1)toestimatethepresentone,pretending^n(i1)isthetruevalueofn(i1).Thereforetheinputoftheadaptiveltercanbeintheformof[u(i);u(i1);u(i2);^n(i1)].Itcanbe 61

PAGE 62

seenthatthesystemisinherentlyrecurrent.InthelinearcasewithaDARMAmodel,itisstudiedasoutputerrormethods[ 41 ].However,itwillbenon-trivialtogeneralizetheresultsconcerningconvergenceandstabilitytononlinearcasesanditservesasalineoffuturework. Weassumetheprimarysignals(i)=0duringthetrainingphase.Andthesystemsimplytriestoreconstructthenoisesourcefromthereferencemeasure.WeusealinearltertrainedwiththenormalizedLMS(NLMS),twononlinearlterstrainedwiththeSKLMS-1andtheSKAPA-2(K=10)respectively.1000trainingsamplesareusedand200MonteCarlosimulationsareruntogettheensemblelearningcurvesasshowninFigure 3-4 .ThestepsizeandregularizationparameterfortheNLMSis0.2and0.005.ThestepsizeforSKLMS-1andSKAPA-2is0.5and0.2respectively.TheGaussiankernelisusedforbothSKLMS-1andSKAPA-2withkernelparametera=1.ThetoleranceparametersforSKLMS-1andSKAPA-2are1=0:15and2=0:01.Andthenoisereductionfactor(NR),whichisdenedas10log10fE[n2(i)]=E[n(i)y(i)]2gislistedinTable 3-6 alongwiththecorrespondingnetworksize(thenalnumberofunits).TheperformanceimprovementofSKAPA-2isobviouswhencomparedwithSKLMS-1. Table3-6. Noisereductioncomparisoninnoisecancelation AlgorithmNetworkSizeNR(dB) NLMSN/A9.090.45SKLMS-14071415.580.48SKAPA-23701421.990.80 Nextweuseamorerealisticnoisesource(insteadofthewhitenoise)whichisafMRIrecordingrecordedandprovidedbyDr.IssaPanahiinUniversityofTexasatDallas.ThemeanofthefMRInoiseis0andthestandarddeviationis0.051.ThetypicalwaveformisshowninFigure 3-5 .WecompareSKAPA-2withNLMS.200MonteCarlosimulationsareconductedusingdierentsegmentsoftherecording.WeaverageallthelearningcurvestogethertogettheensemblelearningcurvesplottedinFigure 3-6 .ThestepsizeandregularizationparameterfortheNLMSis0.2and0.005.ThestepsizeforSKAPA-2is0.2. 62

PAGE 63

TheGaussiankernelisusedforSKAPA-2withkernelparametera=1.Thetoleranceparametersare1=0and2=0:001.Andthenoisereductionfactor(NR)islistedinTable 3-7 alongwiththecorrespondingnetworksize(thenalnumberofunits).TheperformanceimprovementofSKAPA-2isobviouswhencomparedwithNLMS. Table3-7. NoisereductioncomparisonwithfMRInoisesource AlgorithmNetworkSizeNR(dB) NLMSN/A23.684.14SKAPA-21701236.502.29 Figure3-5. SegmentoffMRInoiserecording Figure3-6. EnsemblelearningcurvesofNLMSandSKAPA-2(K=10)infMRInoisecancelation 63

PAGE 64

3.4.3NonlinearChannelEqualization Inthisexample,weconsideranonlinearchannelequalizationproblem,wherethenonlinearchannelismodeledbyanonlinearWienermodel.ThenonlinearWienermodelconsistsofaserialconnectionofalinearlterandamemorylessnonlinearity(SeeFigure 3-7 ).Thiskindofmodelhasbeenusedtomodeldigitalsatellitecommunicationchannels[ 52 ]anddigitalmagneticrecordingchannels[ 81 ]. Figure3-7. Basicstructureofnonlinearchannel Theproblemsettingisasfollows:Abinarysignalfs(1);s(2);:::;s(N)gisfedintothenonlinearchannel.Atthereceiverendofthechannel,thesignalisfurthercorruptedbyadditivewhiteGaussiannoiseandisthenobservedasfr(1);r(2);:::;r(N)g.Theaimofchannelequalizationistoconstructaninverselterthatreproducestheoriginalsignalwithaslowanerrorrateaspossible.Itiseasytoformulateitasaregressionproblem,withinput-outputexamplesf(r(t+D);r(t+D1);:::;r(t+Dl+1));s(t)g,wherelisthetimeembeddinglength,andDistheequalizationtimelag. Inthisexperiment,thenonlinearchannelmodelisdenedbyx(t)=s(t)+0:5s(t1),r(t)=x(t)0:9x(t)2+n(t),wheren(t)isthewhiteGaussiannoisewithavarianceof2.WecomparetheperformanceoftheLMS,theAPA-1,theSKLMS-1,theSKAPA-1(K=10),andtheSKAPA-2(K=10).TheGaussiankernelwitha=0:1isusedintheSKLMS-1,SKAPA-1andSKAPA-2selectedwithcrossvalidation.l=3andD=2intheequalizer.Thenoisevarianceisxedhere=0:1.TheensemblelearningcurvesareplottedinFigure 3-8 with50MonteCarlosimulations.ForeachMonteCarlosimulation,thelearningcurvesarecalculatedonasegmentof100testingdata.TheMSEiscalculatedbetweenthecontinuousoutput(beforetakingtheharddecision)andthedesiredsignal.FortheSKLMS-1,SKAPA-1,andSKAPA-2,thenoveltycriterionisemployedwith 64

PAGE 65

1=0:26,2=0:08.ThedynamicchangingofthenetworksizeisalsoplottedinFigure 3-9 overthetraining.Itcanbeseenthatatthebeginning,thenetworksizesincreasequicklybutafterconvergencethenetworksizesincreaseslowly.Andinfact,wecanstopaddingnewcentersafterconvergencebycross-validationbynoticingthattheMSEdoesnotchangeafterconvergence. Figure3-8. EnsemblelearningcurvesofLMS,APA-1,SKLMS-1,SKAPA-1andSKAPA-2innonlinearchannelequalization(=0:1) Figure3-9. NetworksizesofSKLMS-1,SKAPA-1andSKAPA-2overtraininginnonlinearchannelequalization Next,dierentnoisevariancesareset.Tomakethecomparisonfair,wetunethenoveltycriterionparameterstomakethenetworksizealmostthesame(around100)in 65

PAGE 66

eachscenariobycrossvalidation.Foreachsetting,20MonteCarlosimulationsarerunwithdierenttrainingdataanddierenttestingdata.Thesizeofthetrainingdatais1000andthesizeofthetestingdatais105.Theltersarexedduringthetestingphase.TheresultsarepresentedinFigure 3-10 .Thenormalizedsignal-noise-ratio(SNR)isdenedas10log101 2.ItisclearlyshownthattheSKAPA-2outperformstheSKLMS-1substantiallyintermsofthebiterrorrate(BER).ThelinearmethodsneverreallyworkinthissimulationregardlessoftheSNR.TheimprovementoftheSKAPA-1ontheSKLMS-1ismarginalbutitexhibitsasmallervariance.Theroughnessinthecurvesismostlyduetothevariancefromthestochastictraining. Figure3-10. PerformancecomparisonofLMS,APA-1,SKLMS-1,SKAPA-1andSKAPA-2withdierentSNRinnonlinearchannelequalization Inthelastsimulation,wetestthetrackingabilityoftheproposedmethodsbyintroducinganabruptchangeduringtraining.Thetrainingdatais1500.Fortherst500data,thechannelmodeliskeptthesameasbefore,butforthelast1000datathenonlinearityofthechannelisswitchedtor(t)=x(t)+0:9x(t)2+n(t).Theensemblelearningcurvesfrom100MonteCarlosimulationsareplottedinFigure 3-11 andthedynamicchangeofthenetworksizeisplottedinFigure 3-12 .ItisseenthattheSKAPA-2outperformsothermethodswithitsfasttrackingspeed.Itisalsonotedthatthenetworksizesincreaserightafterthechangetothechannelmodel. 66

PAGE 67

Figure3-11. EnsemblelearningcurvesofLMS,APA-1,SKLMS-1,SKAPA-1andSKAPA-2withanabruptchangeatiteration500innonlinearchannelequalization Figure3-12. NetworksizesofLMS,APA-1,SKLMS-1,SKAPA-1andSKAPA-2overtrainingwithanabruptchangeatiteration500innonlinearchannelequalization 3.5Discussions ThischapterdiscussestheKAPAalgorithmfamilywhichisintrinsicallyastochasticgradientmethodologytosolvetheleastsquaresprobleminRKHS.Itisafollow-upstudyoftheKLMS.SincetheKAPAupdateequationcanbewrittenasinnerproducts,KAPAcanbeecientlycomputedintheinputspace.ThegoodapproximationabilityoftheKAPAstemsfromthefactthatthetransformeddata'(u)includespossiblyinnitedierentfeaturesoftheoriginaldata. 67

PAGE 68

ComparingwiththeKLMS,KRLS,andregularizationnetworks(batchmodetraining),KAPAgivesyetanotherwayofcalculatingthecoecientsforshallowRBFlikeneuralnetworks.TheperformanceoftheKAPAissomewherebetweentheKLMSandKRLS,whichisspeciedbythewindowlengthK.ThereforeitnotonlyprovidesafurthertheoreticalunderstandingofRBFlikeneuralnetworks,butitalsobringsmuchexibilityforapplicationdesignwiththeconstraintsonperformanceandcomputationresources. Threeexamplesarestudiedinthechapter,namely,timeseriesprediction,nonlinearchannelequalizationandnonlinearnoisecancelation.Inallexamples,theKAPAdemonstratessuperiorperformancewhencomparedwiththeKLMS,whichisexpectedfromtheclassicadaptivelteringtheory. 68

PAGE 69

CHAPTER4EXTENDEDKERNELRECURSIVELEASTSQUARES Inrecentyears,therehavebeenmanyeortstostudythegeneralformulationofrecursiveleastsquaresinRKHS,suchaskernelrecursiveleastsquares(KRLS)[ 28 ]andsliding-windowKRLS(SW-KRLS)[ 100 ].DuetothecloserelationshipbetweentheextendedrecursiveleastsquaresandtheKalmanlter,itwillbeofgreatsignicancetostudythepossibilityofderivinganextendedRLSinRKHS.Thischapterisonesuchattemptinthisdirection,presentingtheextendedkernelRLSfortracking(EX-KRLS).ThispossibilitywillopenanewresearchlineintheareaofnonlinearKalmanlteringbesidestheextendedKalmanlter,unscentedKalmanlter,andparticleKalmanlter[ 5 47 76 104 ]. 4.1FormulationofExtendedKernelRecursiveLeastSquares Wewillstartwithareviewonrecursiveleastsquares,kernelrecursiveleastsquaresandextendedrecursiveleastsquares.ThedicultyofderivingextendedkernelrecursiveleastsquaresdirectlyfromextendedRLSorKRLSispointedout.ThentwoimportanttheoremsareprovidedtoovercomethedicultysuchthattheextendedKRLSfollowsnaturally.Exponentially-weightedKRLSandrandom-walkKRLSarealsobrieydiscussedasspecialcasesoftheextendedKRLS. 4.1.1RecursiveLeastSquares Withasequenceoftrainingdatafu(j);d(j)gi1j=1uptotimei1,therecursiveleastsquaresalgorithmestimatestheweightw(i1)byminimizingthefollowingcost minw(i1)i1Xj=1jd(j)u(j)Tw(i1)j2+jjw(i1)jj2(4{1) Hereu(j)istheL1regressorinput,d(j)isthedesiredresponse,andistheregularizationparameter.Whenanewinput-outputpairfu(i);d(i)gbecomesavailable, 69

PAGE 70

theweightestimatew(i)whichistheminimizerof minw(i)iXj=1jd(j)u(j)Tw(i)j2+jjw(i)jj2(4{2) canbecalculatedrecursivelyfromthepreviousestimatew(i1)withoutsolving( 4{2 )directly.Thestandardrecursiveleastsquares(RLS)issummarizedinAlgorithm 3 [ 47 ]: Algorithm3RecursiveLeastSquares(RLS) Startwithw(0)=0;P(0)=1I iteratefori1 re(i)=1+u(i)TP(i1)u(i)kp(i)=P(i1)u(i)=re(i)e(i)=d(i)u(i)Tw(i1)w(i)=w(i1)+kp(i)e(i)P(i)=[P(i1)P(i1)u(i)u(i)TP(i1)=re(i)](4{3) In( 4{3 ),P(i)canbeinterpretedastheinversionoftheregularizeddataautocorrelationmatrix,kp(i)iscalledthegainvector,ande(i)isthepredictionerror.Thereforetheupdateequationontheweightw(i)=w(i1)+kp(i)e(i) alignsnicelywithourdiscussioninChapter 1 .NoticethatPisLL,whereListhedimensionalityoftheinputu.ThereforethetimeandmemorycomplexitiesarebothO(L2). TheRLSdistributesthecomputationloadevenlyintoeachiteration,whichisveryappealinginapplicationslikechannelequalization[ 82 ]wheredataisavailablesequentiallyovertime. 4.1.2KernelRecursiveLeastSquares Thekernelrecursiveleastsquaresisintroducedin[ 28 ]andweprovideabriefreviewonittobetterdistinguishandappreciateourcontribution. 70

PAGE 71

WeutilizetheMercertheoremtotransformthedatau(i)intothefeaturespaceFas'(u(i))(denotedas'(i)).Weformulatetherecursiveleastsquaresalgorithmontheexamplesequencefd(1);d(2);:::gandf'(1);'(2);:::g.Ateachiteration,theweightvector!(i),whichistheminimizerof min!(i)iXj=1jd(j)!(i)T'(j)j2+jj!(i)jj2(4{4) needstobesolvedrecursivelyasin( 4{3 ).However,( 4{3 )cannotbedirectlyappliedherebecausethedimensionalityof'(j)issohigh(canbeeveninnite)thatitisnotfeasibleinpractice. Introducing d(i)=[d(1);:::;d(i)]T(i)=['(1);:::;'(i)](4{5) onehas !(i)=[I+(i)(i)T]1(i)d(i)(4{6) furtherbythematrixinversionlemma[ 82 ], !(i)=(i)[I+(i)T(i)]1d(i)(4{7) Wehavetoemphasizethesignicanceofthechangefrom( 4{6 )to( 4{7 )here.First,(i)T(i)iscomputablebythekerneltrick( 1{10 )andsecondtheweightisexplicitlyexpressedasalinearcombinationoftheinputdata!(i)=(i)a(i). Denote Q(i)=(I+(i)T(i))1(4{8) Itiseasytoseethat Q(i)1=264Q(i1)1h(i)h(i)T+'(i)T'(i)375(4{9) 71

PAGE 72

whereh(i)=(i1)T'(i).Usingthissliding-windowstructure,theupdatingoftheinversionofthisgrowingmatrixcanbequiteecient[ 82 ] Q(i)=r(i)1264Q(i1)r(i)+z(i)z(i)Tz(i)z(i)T1375(4{10) where z(i)=Q(i1)h(i)r(i)=+'(i)T'(i)z(i)Th(i)(4{11) Thereforetheexpansioncoecientsoftheweightare a(i)=Q(i)d(i)=264Q(i1)+z(i)z(i)Tr(i)1z(i)r(i)1z(i)Tr(i)1r(i)1375264d(i1)d(i)375=264a(i1)z(i)r(i)1e(i)r(i)1e(i)375(4{12) wheree(i)isthepredictionerrorcomputedbythedierencebetweenthedesiredsignalandthepredictionfi1(u(i)): fi1(u(i))=h(i)Ta(i1)=i1Xj=1aj(i1)(u(j);u(i)) (4{13) e(i)=d(i)fi1(u(i)) (4{14) ThereforetheKRLSassumesaradialbasisfunctionnetworkstructureatanytimei[ 46 ].aj(i1)isthejthcomponentofa(i1).From( 4{12 ),weseethatthelearningprocedureofKRLSisverysimilartoKLMS,KAPAandRANinthesensethatitallocatesanewunitwithu(i)asthecenterandr(i)1e(i)asthecoecient.AtthesametimeKRLSalsoupdatesallthepreviouscoecientsbyz(i)r(i)1e(i)whereasKLMSneverupdatespreviouscoecientsandKAPAonlyupdatestheK1mostrecentones. 72

PAGE 73

Ifwedenotefiastheestimateoftheinput-outputmappingattimei,wehavethefollowingsequentiallearningrulefortheKRLS: fi=fi1+r(i)1[(u(i);)i1Xj=1zj(i)(u(j);)]e(i)(4{15) Thecoecientsa(i)andthecentersC(i)shouldbestoredinthecomputerduringtraining.TheupdatesneededfortheKRLSattimeiis ai(i)=r(i)1e(i) (4{16) aj(i)=aj(i1)r(i)1e(i)zj(i);j=1;:::;i1 (4{17) C(i)=fC(i1);u(i)g (4{18) TheKRLSissummarizedinAlgorithm 4 Algorithm4Kernelrecursiveleastsquares(KRLS) Startwith Q(1)=(+(u(1);u(1))1 a(1)=Q(1)d(1) iteratefori>1: h(i)=[(u(i);u(1));:::;(u(i);u(i1))]Tz(i)=Q(i1)h(i)r(i)=+(u(i);u(i))z(i)Th(i)Q(i)=r(i)1Q(i1)r(i)+z(i)z(i)Tz(i)z(i)T1e(i)=d(i)h(i)Ta(i1)a(i)=a(i1)z(i)r(i)1e(i)r(i)1e(i)(4{19) ThetimeandmemorycomplexitiesarebothO(i2).Aswewillsee,withsomesparsication,thecomplexitywillbereducedtoO(m2i)wheremiistheeectivenumberofcentersinthenetworkattimei. 73

PAGE 74

4.1.3ExtendedRecursiveLeastSquares TheproblemfortheRLS(andtheKRLS)isthatithasapoortrackingperformance.Fromtheviewpointofstate-spacemodel,theRLSimplicitlyassumesthatthedatasatisfy[ 47 ] x(i+1)=x(i)d(i)=u(i)Tx(i)+v(i)(4{20) i.e.thestatex(i)isxedovertimeandwithnosurprisetheoptimalestimateofthestatew(i)cannottrackvariations. Toimproveitstrackingability,severaltechniquescanbeemployed.Forexample,thedatacanbeexponentially-weightedovertimeoratruncatedwindowcanbeappliedonthetrainingdata(whichalsocanbeviewedasaspecialweighting).Theworkforasliding-windowKRLShasalreadybedonein[ 100 ]andthischaptertriestoderiveanexponentially-weightedKRLSwithstatenoise. Fortheexponentially-weightedscheme,theassumedstate-spacemodelis 1 x(i+1)=x(i)d(i)=u(i)Tx(i)+v(i)(4{21) whereisascalingfactor.AsisknownfromtheextendedRLSmethod,themostgeneralstate-spacemodelwouldbe x(i+1)=Ax(i)+n(i)d(i)=u(i)Tx(i)+v(i)(4{22) withAasthestatetransitionmatrix.Whileitwouldbemostdesirabletohavesuchageneralstate-spacemodelintheRKHS,itturnsouttobeatbestverydicult.Inthis 1 Thisfactisnontrivialandpleasereferto[ 82 ]fordetails. 74

PAGE 75

chapterwefocusonaspecialcaseof( 4{22 ),i.e., x(i+1)=x(i)+n(i)d(i)=u(i)Tx(i)+v(i)(4{23) Kalman[ 51 ]proposedatwostepsequentialestimationalgorithmtoupdatethestateestimate.Atthecoreofthisprocedureistherecursiveleastsquareupdateoftheobservationmodel.Indeed,thesolutionofthisstate-spaceestimationproblemamountstosolvingthefollowingleastsquarescostfunction[ 82 ]: minfx(1);n(1);:::;n(i)g[iXj=1ijjd(j)u(j)Tx(j)j2+ijjx(1)jj2+q1iXj=1ijjjn(j)jj2](4{24) subjecttox(j+1)=x(j)+n(j). isintroducedtohaveexponentialweightingonthepastdata,istheregularizationparametertocontroltheinitialstate-vectornormandqprovidessometrade-obetweenthemodelingvariationandmeasurementdisturbance.Observethatifq=0;=1,( 4{24 )reducestotheexponentially-weightedRLS.Furtherif=1,itreducestothestandardRLS.Wehavetoemphasizethat( 4{24 )isamuchharderquadraticoptimizationproblemwithlinearconstraintscomparedtotheconstraint-freeleastsquaresproblem( 4{2 ).Whentheinputdataaretransformedintoahighdimensionalfeaturespaceviaakernelmapping,thisproblemgetsjustmuchharder. TheextendedRLSrecursionsaregiveninAlgorithm 5 [ 82 ]: Algorithm5Extendedrecursiveleastsquares(EX-RLS) Startwithw(0)=0;P(0)=11I iteratefori1 re(i)=i+u(i)TP(i1)u(i)kp(i)=P(i1)u(i)=re(i)e(i)=d(i)u(i)Tw(i1)w(i)=w(i1)+kp(i)e(i)P(i)=jj2[P(i1)P(i1)u(i)u(i)TP(i1)=re(i)]+iqI(4{25) 75

PAGE 76

4.1.4ReformulationofExtendedRecursiveLeastSquares InthefeaturespaceF,themodelbecomes x(i+1)=x(i)+n(i)d(i)='(i)Tx(i)+v(i)(4{26) whichissimilarto( 4{23 )exceptthattheinputis'(i)insteadofu(i). Wecannotuse( 4{25 )directlybecausetheinputdataandthestatevectornowlieinapossiblyinnitedimensionalspace(consequentlythePmatrixis11).WecannotusethematrixinversionlemmalikeintheKRLSbecauseofthecomplicatedconstrainedleastsquarescostfunction( 4{24 ).Theapplicationofthekerneltrickrequiresthereformulationoftherecursionsolelyintermsofinnerproductoperationsbetweeninputvectors. Bycarefullyobservingtherecursion( 4{25 ),onecanconcludethatallthecalculationsarebasedonu(j)TP(k)u(i)foranyk,i,j. Theorem4.1. ThematricesP(j)in( 4{25 )assumethefollowingform P(j)=(j)IH(j)Q(j)H(j)T(4{27) where(j)isascalarandQ(j)isajjmatrixwithH(j)=[u(1);:::;u(j)],forallj>0. Proof. Firstnoticethatby( 4{25 )P(0)=11I;P(1)=jj2[11I22u(1)u(1)T re(i)]+qI=[jj2 +q]Iu(1)[jj222 +11u(1)Tu(1)]u(1)T sotheclaimisvalidforj=1,namely,(1)=jj211+q;Q(1)=jj222 +11u(1)Tu(1) 76

PAGE 77

Thenusingthemathematicalinduction,theproofforalljfollows.Assumeitistrueforj=i1,i.e., P(i1)=(i1)IH(i1)Q(i1)H(i1)T(4{28) Bysubstitutingitintothelastequationof( 4{25 ),onehasP(i)=jj2[P(i1)P(i1)u(i)u(i)TP(i1) re(i)]+iqI=(jj2(i1)+iq)Ijj2r1e(i)H(i)266666664Q(i1)re(i)+z(i)z(i)T(i1)z(i)(i1)z(i)T2(i1)377777775H(i)T Therefore(i)=jj2(i1)+iqQ(i)=jj2 re(i)266666664Q(i1)re(i)+z(i)z(i)T(i1)z(i)(i1)z(i)T2(i1)377777775P(i)=(i)H(i)Q(i)H(i)T wherez(i)=Q(i1)H(i1)Tu(i). Bytheorem 4.1 ,thecalculationu(j)TP(k)u(i)onlyinvolvesinnerproductoperationsbetweentheinputvectors,whichistheprerequisitetousingthekerneltrick. Theorem4.2. Theoptimalstateestimatein( 4{25 )isalinearcombinationofthepastinputvector,namely w(j)=H(j)a(j)(4{29) Proof. Noticethatby( 4{25 )w(0)=0w(1)=11d(1)u(1) +11u(1)Tu(1) soa(1)=11d(1) +11u(1)Tu(1) 77

PAGE 78

Thustheclaimisvalidforj=1.Thenweusethemathematicalinductiontoproveitisvalidforanyj.Assumeitistruefori1.Bytherecursion( 4{25 )andtheresultfromtheorem 4.1 ,wehavew(i)=w(i1)+kp(i)e(i)=H(i1)a(i1)+P(i1)u(i)e(i)=re(i)=H(i1)a(i1)+(i1)u(i)e(i)=re(i)H(i1)z(i)e(i)=re(i)=H(i)264a(i1)z(i)e(i)r1e(i)(i1)e(i)r1e(i)375 wherez(i)=Q(i1)H(i1)Tu(i). Hence,afterrepresentingwwitha,PwithandQ,wehavethefollowingequivalentrecursions: Startwitha(1)=11d(1) +11u(1)Tu(1);(1)=jj211+q;Q0(1)=jj222 +11u(1)Tu(1) 78

PAGE 79

Iteratefori>1:h(i)=H(i1)Tu(i)z0(i)=Q0(i1)h(i)re(i)=i+(i1)u(i)Tu(i)h(i)Tz0(i)e(i)=d(i)h(i)Ta(i1)a(i)=264a(i1)z0(i)r1e(i)e(i)(i1)r1e(i)e(i)375(i)=jj2(i1)+iqQ0(i)=jj2 re(i)264Q0(i1)re(i)+z0(i)z0(i)T(i1)z0(i)(i1)z0(i)T2(i1)375 Thisisquitesucientforourpurposebutwhenisverysmall,,Q0,z0andreareallverylarge,causingpossiblenumericalissues.Andalsotobetterunderstandthemeaningsofthesequantities,wedothefollowingvariablechanges: Q(i1)=Q0(i1)=(i1)z(i)=z0(i)=(i1)r(i)=re(i)=(i1)r(i)=1=(i1) ThereforewehaveAlgorithm 6 whichisequivalenttoAlgorithm 5 : Noticethatthroughouttheiteration,theinputvectoru(i)isonlyinthecalculationofh(i)andr(i),bothintheformofinnerproduct.Thesignicanceofthisreformulationisitsindependenceonthedatadimensionality.ComparingwithAlgorithm 5 ,wereplacetherecursiononw(i)withtheoneona(i)inAlgorithm 6 .Further,wereplacetherecursiononP(i)withtheoneson(i)andQ(i)where(i)isascalarandQ(i)isii.Inaword,thedimensionofa(i),(i)andQ(i)onlydependsonthesizeoftrainingdatairegardlessofthedimensionoftheinputu. 79

PAGE 80

Algorithm6AnovelvariantofextendedRLS(EX-RLS-2) Startwitha(1)=d(1) 2+u(1)Tu(1);r(1)==(jj2+q);Q(1)=jj2 [2+u(1)Tu(1)][jj2+2q] iteratefori>1:h(i)=H(i1)Tu(i)z(i)=Q(i1)h(i)r(i)=ir(i1)+u(i)Tu(i)h(i)Tz(i)e(i)=d(i)h(i)Ta(i1)a(i)=a(i1)z(i)r1(i)e(i)r1(i)e(i)r(i)=r(i1)=(jj2+iqr(i1))Q(i)=jj2 r(i)(jj2+iqr(i1))Q(i1)r(i)+z(i)z(i)Tz(i)z(i)T1 4.1.5ExtendedKernelRecursiveLeastSquares NowthenonlinearextensionofAlgorithm 6 isstraightforwardbyreplacingu(i)Tu(j)with(u(i);u(j))(SeeAlgorithm 7 ). ThelearningprocedureofEX-KRLSisquitesimilartoKRLS:itallocatesanewunitwithu(i)asthecenterandr1(i)e(i)asthecoecient;atthesametimeallthepreviouscoecientsareupdatedbyaquantityz(i)r1(i)e(i).Finallyascalarismultipliedona(i)accordingtothestateupdateequation.Themajordierenceistheintroductionof,whichreectstheuncertaintyofthestatenoise. 4.1.5.1Randomwalkkernelrecursiveleastsquares Bysetting=1,wehavetheKRLSforthefollowingrandomwalkmodel x(i+1)=x(i)+n(i)d(i)='(i)Tx(i)+v(i)(4{31) 80

PAGE 81

Algorithm7Extendedkernelrecursiveleastsquares(EX-KRLS) Startwitha(1)=d(1) 2+(u(1);u(1));r(1)==(jj2+q);Q(1)=jj2 [2+(u(1);u(1))][jj2+2q] iteratefori>1: h(i)=[(u(i);u(1));:::;(u(i);u(i1))]Tz(i)=Q(i1)h(i)r(i)=ir(i1)+(u(i);u(i))h(i)Tz(i)e(i)=d(i)h(i)Ta(i1)a(i)=a(i1)z(i)r1(i)e(i)r1(i)e(i)r(i)=r(i1)=(jj2+iqr(i1))Q(i)=jj2 r(i)(jj2+iqr(i1))Q(i1)r(i)+z(i)z(i)Tz(i)z(i)T1(4{30) TherecursiveequationtoestimatetherandomwalkstateinRKHSissummarizedinAlgorithm 8 4.1.5.2Exponentiallyweightedkernelrecursiveleastsquares WiththeextendedKRLS,itisveryeasytohavetheexponentiallyweightedKRLSbysettingq=0,=1.TheexponentiallyweightedKRLSissummarizedinAlgorithm 9 Itisnoticedthat(i)becomesaconstantandisnolongerneeded.Afactorof2isabsorbedintoforsimplicity.NowitisalmostthesameasAlgorithm 4 (KRLS)exceptthattheregularizationparameterisexponentiallyweightedbyi1inr(i).Setting=1,wehavetheKRLSback. 4.2ImplementationofExtendedKernelRecursiveLeastSquares TheimplementationoftheextendedKRLSisquitestraightforward.Weneedtostorea(i),(i)andQ(i)incomputeralongtraining,sothetotaltimecomplexityandmemory 81

PAGE 82

Algorithm8Randomwalkkernelrecursiveleastsquares(RW-KRLS) Startwitha(1)=d(1) 2+(u(1);u(1));r(1)==(1+q);Q(1)=1 [2+(u(1);u(1))][1+2q] iteratefori>1:h(i)=[(u(i);u(1));:::;(u(i);u(i1))]Tz(i)=Q(i1)h(i)r(i)=ir(i1)+(u(i);u(i))h(i)Tz(i)e(i)=d(i)h(i)Ta(i1)a(i)=a(i1)z(i)r1(i)e(i)r1(i)e(i)r(i)=r(i1)=(1+iqr(i1))Q(i)=1 r(i)(1+iqr(i1))Q(i1)r(i)+z(i)z(i)Tz(i)z(i)T1 Algorithm9Exponentiallyweightedkernelrecursiveleastsquares(EW-KRLS) Startwitha(1)=d(1) +(u(1);u(1));Q(1)=1 +(u(1);u(1)) iteratefori>1:h(i)=[(u(i);u(1));:::;(u(i);u(i1))]Tz(i)=Q(i1)h(i)r(i)=i1+(u(i);u(i))h(i)Tz(i)e(i)=d(i)h(i)Ta(i1)a(i)=a(i1)z(i)r1(i)e(i)r1(i)e(i)Q(i)=r(i)1Q(i1)r(i)+z(i)z(i)Tz(i)z(i)T1 82

PAGE 83

complexityarebothO(i2).Theonlythingweneedaddressissparsication.Thenoveltycriterioncanstillbeusedhere.Howeverthereisamoreecientwaytoachievethat. 4.2.1SparsicationandApproximateLinearDependency Theapproximatelineardependency(ALD)testintroducedin[ 28 ]ismorecomputationallyinvolvedthanthenoveltycriteriondiscussedinChapter 3 .SupposethepresentdictionaryisC(i)=fcjgmij=1wherecjisthejthcenterandmiisthecardinality.Whenanewdatapairfu(i+1);d(i+1)gispresented,itteststhefollowingcostdis2=min8bjj'(u(i+1))Xcj2C(i)bj'(cj)jj whichindicatesthedistanceofthenewinputtothelinearspanofthepresentdictionaryinthefeaturespace. Bystraightforwardcalculus,itturnsoutthat G(i)b=h(i+1)(4{32) whereG(i)andh(i+1)areredenedbasedonC(i)as h(i+1)=[(u(i+1);c1);:::;(u(i+1);cmi)]T (4{33) G(i)=266664(c1;c1):::(cmi;c1):::::::::(c1;cmi):::(cmi;cmi)377775 (4{34) IfG(i)isinvertible,onehas b=G(i)1h(i+1) (4{35) dis22=(u(i+1);u(i+1))h(i+1)TG1(i)h(i+1) (4{36) u(i+1)willberejectedifdis2issmallerthansomepresetthreshold3inALD.ThecomputationcomplexityisO(m2i) 83

PAGE 84

Ontheotherhand,by( 4{8 )and( 4{11 ),onehas r(i+1)=+(u(i+1);u(i+1))h(i+1)T(G(i)+I)1h(i+1)(4{37) inKRLS( 4{19 ).Itisnotedthatwhenisverysmall,r(i+1)isessentiallyequivalenttodis2.AnditwillbeclearinChapter 5 thatr(i+1)hasamoreprincipledinterpretation.Inotherwords,r(i+1)ismorethansucientforALDpurposeandtheextracalculationfordis2isnotneeded. Ifthenewdataisdeterminedtobenot\novel",itissimplydiscardedinthischapterbutdierentstrategycanbeemployedtoutilizetheinformationlikein[ 69 ]and[ 28 ]. InALD,theerrorinformationisnotcalculatedforactivedataselectionasinthenoveltycriterion.Inotherwords,thedesiredsignalisnotused,whichfallsintotherstscenarioaswediscussedactivelearninginChapter 1 .Whenthedesiredsignalisactuallyavailable,itisjustwasteful.Naturallywecancalculatethepredictionerrore(i+1)andcompareitwithathresholdaswedoinnoveltycriterion,butthisisquiteheuristic.TheseissueswillbecomemuchclearerinChapter 5 Furthermore,intheextendedKRLS,r(i)in( 4{30 )playsaverysimilarroletor(i)inKRLS.ThoughitsmeaningisnotasclearasinKRLS,itisatleastaverygoodapproximationespeciallywhen,arecloseto1andqisverysmall,whichisusuallyvalidinpractice.ThereforetheALDisreadilyapplicableforEX-KRLS,EW-KRLSandRW-KRLSwithoutextracomputation. ItisalsointerestingtonotethatNCisanapproximationofALD.Italwaysholdsthatdis1dis2sincedis1correspondstoaspecialchoiceofb(1fortheclosestcenterand0foralltheothers).Whenalocalkernelfunction(likeGaussian)isused,pickingtheclosestoneisagoodstrategytoestimatedis2. 4.2.2ApproximateLinearDependencyandStability AveryimportantconsequenceofapplyingALDinKRLSisthatitsstabilityisimproved.Toillustratetheidea,weassumetheregularizationparameter=0inKRLS. 84

PAGE 85

Theillposedness(orinstability)arisesfromtheinversionoperationoftheGrammatrixG(i),thatis,ifG(i)isclosetosingularity,a(i)=G(i)1d(i)couldbeoutofcontrol.AlthoughG(i)issemi-positivedenite,itseigenvaluescouldbearbitraryclosetozero.Thusinversionrendersinstability[ 40 ]. Weareinterestedtoanswerthefollowingquestion:Ifthesystemisstableattimei,givendatau(i+1),whatcanwesayaboutthesystemattimei+1?Usingmathematicalterms,iftheminimaleigenvalueofG(i)ismin(i),givennewdatau(i+1),whatwouldtheminimaleigenvalueofG(i+1)be?RememberG(i+1)hasthefollowingstructure G(i+1)=264G(i)h(i+1)h(i+1)T(u(i+1);u(i+1))375(4{38) Denotingmin(i+1)astheminimumeigenvalueofG(i+1)andmin(i)astheminimumeigenvalueofG(i),ourgoalistondarelationbetweenthemandunderstandhowALDhelpstopreventillposedness. Let'squotesomekeyquantitiesfromKRLShereforeasypresentation.Q(i+1)=G(i+1)1 Q(i+1)=264Q(i)+z(i+1)z(i+1)Tr(i+1)1z(i+1)r(i+1)1z(i+1)Tr(i+1)1r(i+1)1375(4{39) where z(i+1)=Q(i)h(i+1)r(i+1)=(u(i+1);u(i+1))z(i+1)Th(i+1)(4{40) r(i+1)istheSchurcomplementofG(i)w.r.t.G(i+1)[ 82 ].Aswecanobservefrom( 4{39 )thatwhenr(i+1)!0,sometermsinQ(i+1)approachinnity,whichisasureindicationofillposedness.Anill-conditionedGrammatrixisundesirableinanycircumstance.WithALD,ifr(i+1)istoosmall,u(i+1)willbeautomaticallyexcluded.ThisisgoodbutdoesALDguaranteeG(i+1)iswellconditioned?Wewillanswerthisquestionbythefollowingtheorem. 85

PAGE 86

Theorem4.3. min(i+1)2min(i)r (g+r+min(i))+p (g+r+min(i))28min(i)r(4{41) Thefollowingnotationsareusedforshort:g:=(u(i+1);u(i+1))andr:=r(i+1).ThistheoremiscalledALD-stabilitytheorem.TheproofisquitecomplicatedandisincludedinAppendixA. Alsobytheinterlacingtheorem[ 40 ],onehas min(i+1)min(i)(4{42) Noticethatg>rholdsallthetime.Forshift-invariantkernels,by( 4{42 ),onehasg=min(1)min(i)foranyi1.Thereforearelaxedbutsimplerlowerboundis min(i+1)min(i)r g+r+min(i)min(i)r 3g(4{43) Toconclude,min(i+1)islowerboundedandcannotbearbitrarilysmall.Ifmin(i)andrarebothseparatedfrom0bysomedistance,soismin(i+1). 4.3Simulations 4.3.1RayleighChannelTracking WeconsidertheproblemoftrackinganonlinearRayleighfadingmultipathchannelandcomparetheperformanceoftheproposedEX-KRLSalgorithmtotheoriginalKRLS.AlsoperformanceofthenormalizedLMS,RLSandEX-RLSareincludedforcomparison. ThenonlinearRayleighfadingmultipathchannelemployedhereisthecascadeofatraditionalRayleighfadingmultipathchannelfrom[ 82 ]andasaturationnonlinearity.IntheRayleighmultipathfadingchannel,thenumberofthepathsischosenasM=5,themaximumDopplerfrequencyfD=100HzandthesamplingrateTs=0:8s(soitisaslowfadingchannelwiththesamefadingrateforallthepaths).AllthetapcoecientsaregeneratedaccordingtotheRayleighmodelbutonlytherealpartisusedinthisexperiment.AwhiteGaussiandistributedtimeseries(withunitpower)issentthrough 86

PAGE 87

thischannel,corruptedwiththeadditivewhiteGaussiannoise(withvariance2=0:001)andthenthesaturationnonlinearityy=tanh(x)isappliedonit,wherexistheoutputoftheRayleighchannel.Thewholenonlinearchannelistreatedasablackboxandonlytheinputandoutputareknown. Thetrackingtaskistestedon5methods.TherstoneisthenormalizedLMS 2 (regularizationfactor=103,stepsize=0:25);thesecondistheRLS(withregularizationparameter=103);thethirdoneistheEX-RLS(=0:9999999368,q=3:26107,=0:995,=103accordingto[ 82 ](onpage759).Thelasttwoarenonlinearmethods,namelytheKRLS(regularizationparameter=0:01)andtheproposedEX-KRLS(=0:999998,q=104,=0:995,=0:01).WeusetheGaussiankernelinbothcaseswithkernelparametera=1.Noticethatisverycloseto1andqverycloseto0sincethefadingofthechannelisveryslow. Wegenerate1000symbolsforeveryexperimentandperformindependently200MonteCarloexperiments.TheensemblelearningcurvesareplottedinFigure 4-1 ,whichclearlyshowsEX-KRLShasabettertrackingabilitythanKRLS.Thelast100valuesinthelearningcurvesareusedtocalculatethenalmeansquareerror(MSE),whichislistedinTable 4-1 .Itisseenthatthenonlinearmethodsoutperformthelinearmethodssignicantly,sincethechannelmodelweusehereisnonlinear.ThoughtheRayleighchannelisaslowfadingchannelinthisproblem,westillenjoynearly4dBimprovementbyusingtheEX-KRLSmodel. Table4-1. PerformancecomparisoninRayleighchanneltracking AlgorithmMSE(dB) LMS-2-11.780.94RLS-12.020.46EX-RLS-12.620.56KRLS-16.630.52EX-KRLS-20.440.75 2 ShownasLMS-2inthetableandgure. 87

PAGE 88

Figure4-1. EnsemblelearningcurvesofLMS-2,EX-RLS,KRLSandEX-KRLSintrackingaRayleighfadingmultipathchannel Inthesecondsimulation,wetesttheeectivenessofALDtoreducethecomplexityinEX-KRLS.Wechoose10thresholdsinALDintherangeof[0,0.05].Foreachthreshold,500symbolsaregeneratedforeveryexperimentand50MonteCarloexperimentsareconducted.ThenalMSEiscalculatedwiththelast100valuesintheensemblelearningcurves.ThenalMSEvs.thethresholdisplottedinFigure 4-2 .ThenalMSEvs.thenalnetworksizeisplottedinFigure 4-3 .Theregularizationparameterissetas0.01. Figure4-2. EectofapproximatelineardependencyinEX-KRLS Inthethirdsimulation,weshowhowtheALDcanhelpstabilizethesolution.Forthispurpose,weset=0inEX-KRLS.Wechoose30thresholdsinALDintherangeof 88

PAGE 89

Figure4-3. Networksizevs.performanceinEX-KRLSwithALD [0,0.1].Foreachthreshold,500symbolsaregeneratedforeveryexperimentand50MonteCarloexperimentsareconducted.ThenalMSEiscalculatedwiththelast100valuesintheensemblelearningcurves.ThenalMSEvs.thethresholdisplottedinFigure 4-4 .ThenalMSEvs.thenalnetworksizeisplottedinFigure 4-5 .Aswesee,therstpointinFigure 4-4 correspondstonoALDandthealgorithmdoesnotworkduetoillposedness.Incontrast,alltheotherpointsshowabetterstabilityduetotheuseofALD. Figure4-4. EectofapproximatelineardependencyinEX-KRLS(=0) 89

PAGE 90

Figure4-5. Networksizevs.performanceinEX-KRLSwithALD(=0) 4.3.2LorenzTimeSeriesPrediction TheLorenzattractor,introducedbyEdwardLorenzin1963[ 60 ],isa3-dimensionalstructurecorrespondingtothelong-termbehaviorofachaoticow,notedforitsbutteryshape.Thesystemisnonlinear,three-dimensionalanddeterministic.In2001itwasprovenbyWarwickTuckerthatforacertainsetofparametersthesystemexhibitschaoticbehavioranddisplayswhatistodaycalledastrangeattractor[ 99 ].Thesystemarisesinlasers,dynamos,andspecicwaterwheels. ThefollowingmapdictateshowthestateofLorenzsystemevolvesovertimeinacomplex,non-repeatingpattern. dx dt=x+yzdy dt=(zy) (4{44) dx dt=xy+yz Setting=8=3,=10and=28,thestateevolutionpatternisplottedinFigure 4-6 .Therstorderapproximationisusedwithastepsize0.01. Wepicktherstcomponent,namelyxherefortheshorttermpredictiontask.TherstcomponentisplottedinFigure 4-7 90

PAGE 91

Figure4-6. StatetrajectoryofLorenzsystemforvalues=8=3,=10and=28 Figure4-7. TypicalwaveformoftherstcomponentfromtheLorenzsystem Theshorttermpredictiontaskcanbeformulatedasfollows:using5pastdatau(i)=[x(i5);x(i4);:::;x(i1)]Tastheinputtopredictx(i+T)whichisthedesiredresponsehere.Tisthepredictionstep.ThelargerTis,themorenonlinearitythesystemexhibitsandhencethehardertheproblemis.Thesignalispre-processedtobezeromeanandunitvariancebeforethemodeling. WetestLMS-2,RLS,EX-RLS,KRLSandEX-KRLSonthistask.Wepick20predictionstepsintherangeof[1,20].Foreachpredictionstep,50MonteCarlosimulationsarerunwithdierentsegmentsofthesignal.Thelengthofeachsegmentis 91

PAGE 92

1000points.ThenalMSEiscalculatedfromthelast100pointsintheensemblelearningcurveswhichareaveragedoveralltheMonteCarlosimulations.Theregularizationfactor=103,stepsize=0:2inLMS-2.Theregularizationparameter=103inRLS.=1,q=0:01,=0:99,=103inEX-RLS.=0:001inKRLSand=1,q=0:01,=0:99,=0:001inEX-KRLS.WeusetheGaussiankernelwithkernelparametera=1.TheperformanceisplottedinFigure 4-8 .ItcanbeseenthattheextendedmodelsexhibitbetterperformancesincetheLorenzsystemisswitchingbetweentwoattractorsasshowninFigure 4-6 .Withsmallpredictionsteps,thelinearmodelperformequallywellwiththenonlinearmodelsbecausethesamplingrateisquitehighandthesignalisverysmoothinFigure 4-7 Figure4-8. PerformancecomparisonofLMS-2,RLS,EX-RLS,KRLSandEX-KRLSinLorenzsystemprediction NextinFigure 4-9 theensemblelearningcurvesareshownwith200MonteCarlosimulations.Thepredictionstepisxedat10.ThenalMSEisalsolistedinTable 4-2 Table4-2. PerformancecomparisoninLorenzsystemprediction AlgorithmMSE(dB) LMS-2-8.120.91RLS-11.830.71EX-RLS-19.890.77KRLS-32.531.13EX-KRLS-44.924.80 92

PAGE 93

Figure4-9. EnsemblelearningcurvesofLMS-2,RLS,EX-RLS,KRLSandEX-KRLSinLorenzsystemprediction FinallywecomparetheperformanceofSW-KRLSwithEX-KRLS.SincethemodelingandtrackingperformanceofSW-KRLSdependsonthewindowlength,wepick20dierentwindowlengthsintherangeof[10,400].Foreachwindowlength,50MonteCarlosimulationsarerunwithdierentsegmentsofsignal.Eachsegmenthas500points.Thepredictionstepisxedas10.TheperformanceisplottedinFigure 4-10 .ClearlyEX-KRLSoutperformsSW-KRLSsignicantlyinthisexample.ThetrendofimprovingperformancewithlongerwindowlengthshowsSW-KRLSfailstoimprovethetrackingabilitysinceiteventuallydefaultstoKRLSwiththebestperformance. 4.4Discussions AkernelbasedversionoftheEX-RLS(fortrackingmodel)waspresented.ComparingwiththeexistingKRLSalgorithms,itprovidesamoregeneralstate-spacemodel,whichisastepclosertothepossiblekernelKalmanlters.Preliminaryresultsofthisalgorithmarepromising,whichsuggestsitcanhaveawideapplicabilityinnonlinearextensionsofmostproblemswhichisdealtwithbytheEX-RLS. Exponentially-weightedKRLSandrandom-walkKRLSarepresentedasspecialcasesoftheextendedKRLS.Approximatelineardependencycriterionisdiscussedas 93

PAGE 94

Figure4-10. PerformancecomparisonofSW-KRLSandEX-KRLSinLorenzsystemprediction amainmethodforsparsication.Itseectonthestabilityisalsodemonstratedbyamathematicaltreatment. ExamplesofRayleighchanneltrackingandLorenzsystempredictionareillustratedwithcomparisonamongEX-KRLS,KRLS,SW-KRLSandlinearmethods.ItshowsthatEX-KRLSoutperformsalltheothersintermsofitsmodelingandtrackingability. 94

PAGE 95

CHAPTER5CONDITIONALINFORMATIONFORACTIVEDATASELECTION Thischapterwilladdresstheprincipalbottleneckofthisclassofon-linekernelalgorithms,whichisrelatedtoitsgrowingstructurewitheachnewsample.Intuitively,onecanexpectthatafterprocessingsucientsamplesfromthesamesource,thereislittleneedtogrowthestructureoftheltersmore,becauseofredundancy.Itisalsoevidentbytheuseofthenoveltycriterion[ 69 ]andapproximatelineardependency[ 28 ].Thoughthesetwomethodsarequiteeectiveinpracticalapplications,theyareheuristicinnature.Thereforehowtomathematicallyestablishtheframeworktotestifagivensampleisneededornottoimproveperformanceisofgreatsignicance. Tothisregard,anewcriterionisproposedbasedonthenegativeloglikelihood.Thecriterionistermedasconditionalinformation,sinceitindicateshowmuchinformationacandidateexamplecontainsconditionalonthe\knowledgeofthelearningsystem".Deningtheinstantaneousinformationcontainedonadatasample,itispossibletoestimateitdirectlyfromdatausingGaussianprocesstheory[ 77 ].Thiscriterionallowsustodiscardorincludenewexemplarsinthelterstructuressystematicallyandcurbtheltergrowth.Thisinformationcriterionprovidesaunifyingviewonexistingmethodsdiscussedin[ 28 69 ].Anditalsogivesageneralframeworkforredundancyremoval,abnormalitydetectionandknowledgediscovery. 5.1DenitionofConditionalInformation AswediscussedinChapter 1 ,therearetwokindsofactivelearningdependingontheavailabilityofthedesiredsignal(labelinginformation).Inbothcases,weneedsomecriteriontoobjectivelyestimatetheutilityofcandidatedatapoints. Inthischapter,weuseanobjectivefunctionbasedonthenegativeloglikelihood(NLL)ofacandidatedatumconditionedonthelearningsysteminvolved.TheconceptoftheNLLbyitselfisnotnew,whichiswidelyusedinparameterestimationandhypothesistest[ 9 16 26 ].Howeveritsapplicationtoactivedataselectionisquitenovel.One 95

PAGE 96

workbyDimaandHebert[ 24 ]fallsinthesamelineasoursbuttheyusedkerneldensityestimationtoestimatetheNLLwhileweuseGaussianprocesses(GP)theorytomathematicallycarryoutthecalculation.WenametheNLLasconditionalinformationheretoemphasizewearemeasuringhowinformativethecandidateistothecurrentlearningsystem.Underthisframework,weareabletoinvestigatethetwoscenarios(withorwithoutdesiredsignal)inaunifyingway,whichallowsnewinsightstobegainedandhighlightstherelationshipbetweenexistingmethods. 5.1.1ProblemStatement Imaginethatwearegatheringdataintheformofasetofinput-outputpairsD(i)=fu(j);d(j)gij=1.Thisdatasetismodeledwithalearningsystemy(u;T(i)),whereT(i)speciesthemodelandarchitectureofthelearningsystemattimei.Theproblemistomeasurehowmuchinformationanewdatapairfu(i+1);d(i+1)gcontainswithrespecttothecurrentlearningsystemandhowtousethisinformationtoecientlyupdatethelearningsystem.Withoutsurprise,theamountofinformationamessagecontainsisrelatedtoitsreceptor.Forexample,alivegameshowmaybequiteinterestingbutareplay(thesamemessage)mightbejustboringtothesameperson.Noticethatthestateforthepersonhaschangedsincehewatchedthelivebroadcasting. 5.1.2DenitionofConditionalInformation BorrowingtheideafromHartley[ 44 ]andlaterShannon[ 87 ],wedenetheinformationmeasureofaparticulardatumfu(i+1);d(i+1)gconditionalonthecurrentlearningsystemT(i)as CI(i+1)=lnp(u(i+1);d(i+1)jT(i))(5{1) wherep(u(i+1);d(i+1)jT(i))istheconditionalprobabilitydensityoffu(i+1);d(i+1)ggivenT(i).WecallCI(i+1)theconditionalinformationoffu(i+1);d(i+1)ggivenT(i). HartleyandShannonstudiedtheconceptofinformationindigitalcommunicationwhereonlymodelsofmemorylesschannelsandindependentsourcesareconsidered.Bydesign,bothtransmittersandreceiversknowexactlythedistributionofanytransmitted 96

PAGE 97

message.Receivingpreviousmessagesdoesnotchangethedistributionofanyfuturemessageandhencewillnothelp.Mathematically,onehasp(messageattimei+1jreceiverattimei)=p(messageattimei+1): Thereforea\conditionless"informationconceptisquitesucientforthatpurpose[ 87 ]. However,whendealingwithrealworldlearningproblems,evenifanindependentsourceassumptioncanbeused,thelearningsystemneverhasanexactknowledgeofthedatadistribution(whichisexactlymeanttobelearnt).Thesystemndsthebestguessitcanpossiblyhave:theposteriordistributionofthemessage.Inthissense,theconditionalinformationmeasureisconsistentwithandanaturalgeneralizationofShannon'sdenitionusedindigitalcommunications. Intuitivelyifp(u(i+1);d(i+1)jT(i))isverylarge,thenewdatumfu(i+1);d(i+1)giswellexpectedbythelearningsystemT(i)andthuscontainsasmallamountofinformationtobelearnt.Ontheotherhand,ifp(u(i+1);d(i+1)jT(i))issmall,thenewdatum\surprises"thelearningsystem,whichmeanseitherthedatacontainssomethingnewforthesystemtodiscoveroritissuspiciousandrejectable. Accordingtotheconditionalinformation,wecanclassifythenewdatuminto3categories: Abnormal:CI(i+1)>T1. Learnable:T1CI(i+1)T2. Redundant:CI(i+1)
PAGE 98

problematicwhenthedimensionalityishighforu.WewillfollowtheGaussianprocessestheory[ 77 ]toevaluatetheconditionalinformation( 5{1 ). 5.2.1GaussianProcessesTheory BytheGPtheory,thepriordistributionofoutputsofthelearningsystemareassumedtobejointlyGaussian,i.e.[y(u(1));:::;y(u(i))]TN(0;2nI+G(i)) whereG(i)=266664(u(1);u(1)):::(u(i);u(1)):::::::::(u(1);u(i)):::(u(i);u(i))377775 foranyiand2nisthevarianceofthenoisecontainedintheobservation. isthecovariancefunction.Itissymmetric,positive-denitejustlikethereproducingkernel.Infactanyreproducingkernelisacovariancefunctionandviceversa.Pleasereferto[ 77 107 ]fordetails.ThecommonlyusedcovarianceistheGaussiankernel (u;u0)=exp(ajjuu0jj2)(5{2) Theideabehindthisseeminglynaiveassumptionisasfollows: Foranyinputu,thepriordistributionoftheoutputy(u)isGaussianwithzeromeanandvariance(u;u)+2n,wherethecomponent2ncomesfromthenoise. Foranytwoinputsu,u0,theaprioridistributionoftheoutputsy(u)andy(u0)arejointlyGaussian.Thecorrelationbetweenthetwoisdeterminedby(u;u0).ItisclearthatiftheGaussiankernel( 5{2 )isused,thecloserthetwoinputsare,thestrongercorrelationthetwooutputshave.Itexplicitlyimposessomesmoothnessconstraintonit[ 77 ]. Formoreinputs,thesameideaapplies. AsweknowfromtheBayesianinferencetheory[ 26 ],theaprioridistributionisonlyimportantwhenthereisinsucientdata.Inpractice,anaivepriorsuchasGaussianor 98

PAGE 99

uniformgivesverygoodinferenceresults.Othernon-Gaussianpriorsarealsopossiblebutthepricetopayisamuchhighercomputationalcomplexity[ 77 ]. Withthispriorassumption,theposteriordistributionoftheoutputgiventheinputu(i+1)andallpastobservationsD(i)canbederivedas[ 77 ]: p(yju(i+1);D(i))N(d(i+1);2(i+1))(5{3) whichisagainnormaldistributed,with d(i+1)=h(i+1)T[2nI+G(i)]1d(i) (5{4) 2(i+1)=2n+(u(i+1);u(i+1))h(i+1)T[2nI+G(i)]1h(i+1) (5{5) whereh(i+1)=[(u(i+1);u(1));:::;(u(i+1);u(i))]Td(i)=[d(1);:::;d(i)]T Thenotationy(u(i))denotestheoutputofthelearningsystem,whichisarandomvariablebydenition.Andd(i)istheactualdeterministicobservation(realization)oftheoutputassociatedwithu(i).Sop(yju(i+1);D(i))isadistribution(afunction)andp(d(i)ju(i+1);D(i))istheexactprobabilitydensity(anumber). Comparing( 5{4 )with( 4{13 )and( 5{5 )with( 4{11 ),( 4{36 ),wendthatGPregressionisalmostequivalenttoKRLSexceptitsprobabilityinterpretation. 5.2.2EvaluationofConditionalInformation LetusassumeT(i)=D(i)atthismoment,i.e.thelearningsystemmemorizesallthepastinput-outputpairs.Therefore,theconditionalprobabilitydensityp(u(i+1);d(i+1)jT(i))canbeevaluatedbyp(u(i+1);d(i+1)jT(i))=p(d(i+1)ju(i+1);T(i))p(u(i+1)jT(i))=1 p 2(i+1)exp((d(i+1)d(i+1))2 22(i+1))p(u(i+1)jT(i)) 99

PAGE 100

andthereforetheconditionalinformationis CI(i+1)=ln[p(u(i+1);d(i+1)jT(i))]=lnp 2+ln(i+1)+(d(i+1)d(i+1))2 22(i+1)ln[p(u(i+1)jT(i))](5{6) Equation( 5{6 )givesawholepictureofwhatfactorsandhowthesefactorsaecttheconditionalinformationofthenewdatum.Someobservationsareasfollows: 1. CI(i+1)isdirectlyproportionaltothesquareofthepredictionerrore(i+1)=d(i+1)d(i+1) ford(i+1)isthepredictionmeanandalsothemaximumaposterior(MAP)estimationofd(i+1)bythecurrentlearningsystemT(i).Ife2(i+1)isverysmall,whichmeansthelearningsystempredictswellnearu(i+1),thecorrespondingCI(i+1)issmall. 2. (i+1)hasadualeect.Inthecaseofaverysmallpredictionerror,saye(i+1)0,ln(i+1)playsamajorroleandtheCI(i+1)isdirectlyproportionalto(i+1).Alargevarianceindicatesthelearningsystemisuncertainaboutitsguessthoughtheguesshappenstoberight.Incorporatingthenewdatumwillboostthepredictioncondenceneartheneighborhoodofthenewdatuminthefutureinference.Ontheotherhand,withalargepredictionerror,thesecondterm(d(i+1)d(i+1))2=22(i+1)maydominate.Inotherwords,decreasing(i+1)willactuallyincreaseCI(i+1)signicantlywhen(i+1)issmall.Thisisastrongindicationofabnormality. 3. Asmallerp(u(i+1)jT(i))leadstoalargerCI(i+1),thatis,arareoccurrencecontainsmoreinformation. 5.2.2.1Inputdistribution Thedistributionp(u(i+1)jT(i))isproblemdependent.Intheregressionmodel,itisreasonabletoassume p(u(i+1)jT(i))=p(u(i+1))(5{7) thatis,thedistributionofu(i+1)isindependentofthepreviousobservations,ormemoryless. 100

PAGE 101

IftheinputhasanormaldistributionN(;),wehave CI(i+1)=ln(i+1)+(d(i+1)d(i+1))2 22(i+1)+(u(i+1))T1(u(i+1)) 2(5{8) Ingeneralwecanassumethedistributionp(u(i+1))isuniformifnoaprioriinformationisavailable.Therefore,bydiscardingtheconstantterms,theconditionalinformationissimpliedas CI(i+1)=ln(i+1)+(d(i+1)d(i+1))2 22(i+1)(5{9) Allthemethodsin[ 22 28 69 ]implicitlyassumememorylessuniforminput. 5.2.2.2Unknowndesiredsignal Theconditionalinformationdependsontheknowledgeofthedesiredsignald(i+1).Inthecaseofunknowndesiredsignal,wesimplyaverageCI(i+1)overtheposteriordistributionofy(u(i+1)). Byusing( 5{6 ),onehas CI(i+1)=ZCI(i+1)p(yju(i+1);T(i))dy=lnp 2+ln(i+1)+1 2ln[p(u(i+1)jT(i))] Neglectingtheconstanttermsyields CI(i+1)=ln(i+1)ln[p(u(i+1)jT(i))](5{10) Furthermore,undermemorylessuniforminputassumption,itissimpliedas CI(i+1)=ln(i+1)(5{11) ByidentifyingtherelationbetweentheregularizationparameterinKRLSandthenoisevarianceparameter=2n,thepredictionvarianceisnothingbutr(i)denedinKRLS.Thisstatementexplainsrigorouslymanyobservationsin[ 22 28 ]andwhyALD 101

PAGE 102

isavalidactivedataselectioncriterion.Obviouslyitdoesnotutilizetheimportantinformationcontainedinthedesiredsignal. 5.3LearningRules Therewillbemanypossiblelearningrulesdealingwithdierentcategoriesofdata.Wediscusssomeofthemtoillustratethemainideas. 5.3.1RelationtoKernelRecursiveLeastSquares AswementionedthatthelearningsystemT(i)canbesimplythesetofalltheexamplesD(i),butitismoreconvenientinpracticetointerpretT(i)intermsofitsarchitecture.Bydoingso,weexplicitlyestablishtheequivalencybetweentheKRLSandtheonlineGPregression. In( 5{4 ),wedenotea(i)=[2nI+G(i)]1d(i)andthepredictionmadebythelearningsystemgivenanyinputu0isiXj=1aj(i)(u(j);u0) whichassumesaradialbasisfunction(RBF)networkstructurejustasinKRLS. Additionally,wedenotethesetofcentersasC(i)=fu(1);:::;u(i)gsinceatthismomentweassumeallthepastinputsarecenters.HencethelearningsystemcanbesucientlyrepresentedbyT(i)=fC(i);a(i)g. TheoutputoftheRBFnetworkgivesthepredictionmean( 5{4 ).IthasbeenshowninChapter 4 thatwhen2n=0,thepredictionvariance( 5{5 )isadistancemeasurebetweenthenewinputandtheexistingcenters. (i+1)=minbjk'(u(i+1))iXj=1bj'(u(j))k(5{12) where'isthetransformationinducedbythekernel.Thisunderstandingisveryusefulifweseeksomesortofapproximation.Forexample,theresource-allocatingnetworkequippedwithnoveltycriterioncanberegardedasanapproximationofthismethodology.Thoughthenoveltycriterioninvolvesthesameconceptsofpredictionerroranddistance,itisnotsoprincipledastheconditionalinformation. 102

PAGE 103

5.3.2UpdatingRuleforLearnableData AssumeweknowthesystemT(i)=fa(i);C(i)gattimeiandthedatumfu(i+1);d(i+1)gislearnable.WeusethestandardGPapproachtoupdateT(i+1)=fa(i+1);C(i+1)g,whichisequivalenttotheKRLSrecursion( 4{19 ).Were-derivetheproceduretobetterillustratetherelationbetweenGPandKRLS. By( 5{4 ),wehave a(i+1)=[2nI+G(i+1)]1d(i+1)(5{13) Fornotationalsimplicity,wedenoteQ(i+1)=[2nI+G(i+1)]1.Bythematrixinversionlemma,onehas Q(i+1)=264Q(i)+z(i+1)z(i+1)T=2(i+1)z(i+1)=2(i+1)z(i+1)T=2(i+1)1=2(i+1)375(5{14) where z(i+1)=Q(i)h(i+1)(5{15) ThereforethelinearcoecientsoftheRBFnetworkare a(i+1)=Q(i+1)d(i+1)=264Q(i)+z(i+1)z(i+1)T=2(i+1)z(i+1)=2(i+1)z(i+1)T=2(i+1)=2(i+1)375264d(i)d(i+1)375=264a(i)z(i+1)e(i+1)=2(i+1)e(i+1)=2(i+1)375(5{16) wheree(i+1)isthepredictionerrorasdenedbefore. Tosumup,foralearnabledatum,anewcenteru(i+1)isaddedintothedictionary,i.e.C(i+1)=fC(i);u(i+1)g andthecoecientsa(i+1)areupdatedaccordingto( 5{16 ). 103

PAGE 104

Theupdatingruleisconsistentwiththeobservationsof( 5{6 ).Ifthepredictionerrorissmallandthepredictionvarianceislarge,themodifyingquantitiesaresmall.Intheextremecase,aredundantdatumleadstonegligibleupdate.Ontheotherhand,alargepredictionerrorandasmallpredictionvarianceresultinalargemodicationtothecoecients.Intheextremecase,anabnormaldatumcausesalargesolutionnormandpossibleinstability. 5.3.3UpdatingRuleforAbnormalandRedundantData Asweobservedfromtheupdatingequation( 5{16 )andthediscussionthereafter,itisundesirabletolearntheredundantandabnormaldatainthesamewayaslearnableones.Inthecaseofredundantdata,asimplestrategyistodiscardthem.However,inthecaseofabnormaldata,throwing-awaymaynotbeecientwhenthedataisnon-stationaryinnatureortheinput-outputmappingundergoessomeabruptchangeduringlearning.Thereforeanalternativelearningstrategytodealwithabnormaldatainatrackingmodeistomakesmall,controlledadjustmentsusingstochasticgradientdescent[ 69 105 ]. Intheprevioussections,wesimplyassumeallthepastdataarelearnableandthesetofcentersconsistsofallthepastinput.Howeverinthecaseofabnormalandredundantdata,thisassumptionisnottruebecausesomeofthemwillbethrownaway.Thereforeweneedtoredeneournotationstoavoidconfusion:C(i)=fc1;:::;cmig wheremiisthenumberofthecenters.AndaccordinglyG(i)=266664(c1;c1):::(cmi;c1):::::::::(c1;cmi):::(cmi;cmi)377775 (5{17)h(i+1)=[(u(i+1);c1);:::;(u(i+1);cmi)]T (5{18) 104

PAGE 105

ConsequentlythecalculationofthepredictionmeanandpredictionvariancegivenT(i)=fa(i);C(i)gbecomed(i+1)=h(i+1)Ta(i) (5{19)2(i+1)=2n+(u(i+1);u(i+1))h(i+1)T[2nI+G(i)]1h(i+1) (5{20) basedon( 5{17 )and( 5{18 ).HenceoneisabletocalculateCI(i+1)underappropriateassumptionsusing( 5{19 )and( 5{20 ). Ifthedatumiscategorizedasabnormalandthelearningmachineisinthenon-trackingmode,itwillbethrownaway.Otherwise,weneedtodistinguishtwocases.Firstifthenewinputisveryclosetosomeclusteroftheexistingcenters,say,2(i+1)<4,weuseavariantoftheWidrow-HoLMSalgorithm[ 69 105 ]todecreasetheerrorinacontrolledmanner: aj(i+1)=aj(i)+1~e(i+1)(u(i+1);c(j));j=1;:::;mi(5{21) where~e(i+1)isthecrampederrordenedas~e(i+1)=8>>>><>>>>:e(i+1);je(i+1)j<";";e(i+1)";";e(i+1)"; Howeverifthenewinputu(i+1)isfarawayfromtheexistingcenters,( 5{21 )arenotveryeectiveduetotheexponentialattenuationof(u(i+1);c(j)).Thereforeinthecaseof2(i+1)4,weaddthenewinputintothedictionaryandassignthecorrespondingcoecientas ami+1(i+1)=2~e(i+1)(5{22) 5.3.4ActiveOnlineGPRegression Bythederivationslistedabove,theoverallprocedure(non-tracking)issummarizedinAlgorithm 10 andwecallthealgorithmActiveOnlineGPRegression(AOGR).ThetimeandspacecomplexityatiterationiarebothO(m2i). 105

PAGE 106

Algorithm10ActiveOnlineGPRegression(non-tracking)(AOGR) StartwithT(1)=fC(1);a(1)gandQ(1) Iteratethefollowingstepsfori1: 1.CalculateCI(i+1)h(i+1)=[(u(i+1);c1);:::;(u(i+1);cmi)]Td(i+1)=h(i+1)Ta(i)2(i+1)=2n+(u(i+1);u(i+1))h(i+1)TQ(i)h(i+1)CI(i+1)=ln(i+1)+(d(i+1)d(i+1))2 22(i+1)ln[p(u(i+1)jT(i))] 2.UpdateT(i+1) 2.1Abnormal:CI(i+1)>T1T(i+1)=T(i) 2.2Learnable:T1CI(i+1)T2a(i+1)updatedby( 5{16 )C(i+1)=fC(i);u(i+1)gQ(i+1)updatedby( 5{14 ) 2.3Redundant:CI(i+1)
PAGE 107

usedwitha=0:2.Theregularizationparameter=0:001.Theparametersareselectedthroughcrossvalidation. Figure5-1. Illustrationofconditionalinformationalongtraininginnonlinearregression Inthesecondsimulation,weshowhoweectivethemethodistoremoveredundancy.AlargeT2isusedtodisabletheabnormalitydetection.Wecompareconditionalinformation(CI)criterion(equation( 5{8 ))withtheapproximatelineardependency(ALD)test(equation( 5{11 )).Wetestbothalgorithmswith30dierentthresholds.Foreachthreshold,100MonteCarlosimulationsareconductedwithindependentinputstocalculatetheaveragenumberofcentersandthecorrespondingaveragetestingMSE.ForeachMonteCarlosimulation,200trainingpointsand100testingpointsareused.TheresultisillustratedinFigures 5-2 5-3 and 5-4 .ItisclearthatCIisveryeectiveinawiderangeof[-3,5].ThoughALDisequallyeectiveintherangeof[-3,-2],alargerthresholdleadstocatastrophicalresults(almosteverypointsareexcludedexcepttherstone).Bycontrast,theCIprovidesabalancebycheckingthepredictionerror.Nevertheless,theoverallperformanceofALDandCIiscomparableasshowninFigure 5-4 .Forasimpleproblemlikethisthelearningsystemlearnsthetaskveryquicklyafter20iterationsandthepredictionafterwardsisquiteaccurate(e(i+1)0),suchthatonlythersttermin( 5{8 )isactuallyeective. 107

PAGE 108

Figure5-2. Networksizevs.testingMSEforconditionalinformationcriterioninnonlinearregression Figure5-3. Networksizevs.testingMSEforALDinnonlinearregression Inthethirdsimulation,weshowhowCIcanbeusedtodetectoutlierswhileALDcannot.200trainingdataaregeneratedasbeforebut15outliersaremanuallyaddedattimeindices50,60,...,190(byippingtheirsigns).WechooseT1=3:14,T2=200forCIandT1=3:37forALDbasedontheresultofthesecondsimulation.Thereareactually11eectiveoutliersinFigure 5-5 sinceanother4pointsareveryclosetotheorigin.Anditclearlyshows11largepeaksinFigure 5-6 .TheperformanceseriouslydegradesinALDmethodduetothedetrimentaleectsoftheoutliers.Wehavetoemphasizetheimportanceofinitialtraining(childhoodeducation).Ifanoutlierisacceptedasanormal 108

PAGE 109

Figure5-4. ComparisonofCIandALDinredundancyremovalinnonlinearregression datumintheverybeginningoftraining,thestrategypresentedhereisnotveryeective,whichleadstoaveryinterestingfutureresearchwork.ThisexampleclearlydemonstratestheCI'sabilitytodetectandrejectoutliers. Figure5-5. Trainingdatawithoutliersinnonlinearregression 5.4.2Mackey-GlassTimeSeriesPrediction WeusetheMackey-Glass(MG)chaotictimeseries[ 39 65 ]asthebenchmarkdataagainhere.Itisgeneratedfromthefollowingtimedelayordinarydierentialequation dx(t) dt=bx(t)+ax(t) 1+x(t)10(5{23) 109

PAGE 110

Figure5-6. ComparisonofCIandALDcriterionsinoutlierdetectioninnonlinearregression Figure5-7. LearningcurvesofAOGRandKRLS-ALDwithoutliersinnonlinearregression withb=0:1,a=0:2,and=30.Thetimeseriesisdiscretizedatasamplingperiodof6seconds.Thetimeembeddingis7,i.e.u(i)=[x(i7);x(i6);:::;x(i1)]Tareusedastheinputtopredictthepresentonex(i)whichisthedesiredresponsehere.Asegmentof500samplesisusedasthetrainingdataandanother100pointsasthetestdata(inthetestingphase,thelterisxed). FirstwecomparetheperformanceofKRLSusingconditionalinformation(CI)andapproximatelineardependency(ALD)respectively.AGaussiankernelwithkernel 110

PAGE 111

parametera=1ischosen.AlargeT2isusedtodisabletheabnormalitydetection.Wetestbothalgorithmswith30dierentT1.TheresultisillustratedinFigures 5-8 .TheoverallperformanceofALDandCIiscomparable.Theregularizationparameter=0:001selectedbycrossvalidationinbothalgorithmsandmemorylessuniforminputdistributionisassumedinCI. Figure5-8. Networksizevs.testingMSEofAOGRfordierentT1inMackey-Glasstimeseriesprediction Secondly,wecomparetheperformanceofLMS,sparseKLMS-1(SKLMS-1),resourceallocatingnetwork(RAN),andAOGR.AGaussiankernel( 5{2 )withkernelparametera=1ischosenforallthealgorithms.OnehundredMonteCarlosimulationsarerunwithdierentrealizationsofnoise.ThenoiseisadditivewhiteGaussiannoisewithzeromeanand0.004variance.ThestepsizeforLMS1is0.01.Thestepsizeis0.5fortheSKLMS-1,and1=0:1and2=0:1areusedinthenoveltycriterion.ForRAN,thestepsizeis0.05andthetoleranceforpredictionerroris0.05.Thedistanceresolutionparametersaremax=0:5,min=0:05andran=45.Theoverlapfactoris0.87.PleaserefertothediscussionofRANinChapter 1 or[ 69 ]formeaningsandsettingsoftheparametersinRAN.Figure 5-9 istheensemblelearningcurvesfortheLMS,sparseKLMS-1,resourceallocatingnetwork,andAOGRrespectively.PerformanceofRANandSKLMS-1is 111

PAGE 112

comparableandAOGRoutperformsbothsignicantly.ThenetworksizesarelistedinTable 5-1 Figure5-9. LearningcurvesofLMS,SKLMS-1,RANandAOGRinMackey-Glasstimeseriesprediction Table5-1. NetworksizesofRAN,SKLMS-1,andAOGR Algorithmnetworksize RAN36513SKLMS-120511AOGR707 5.4.3CO2ConcentrationForecasting ThedataconsistsofmonthlyaverageatmosphericCO2concentrations(inpartspermillionbyvolumeppmv)collectedatMaunaLoaObservatory,Hawaii,between1958and2008withtotally603observations[ 95 ].Therst423pointsareusedfortrainingandthelast180pointsarefortesting.ThedataisshowninFigure 5-10 .WetrytomodeltheCO2concentrationasafunctionoftime.Severalfeaturesareimmediatelyapparent:alongtermrisingtrend,apronouncedseasonalvariationandsomesmallerirregularities.Theproblemofkerneldesignforthisspecictaskisthoroughlydiscussedin[ 77 ].Weusethesamekernelinthisexample.OurgoalistotesthoweectivetheAOGRistomodelthisnonlineartimeseries. 112

PAGE 113

Figure5-10. CO2concentrationtrendfromyear1958toyear2008 Atrstwesimplyassumealldataarelearnableandcalculatetheconditionalinformationforeverypointduringtraining.Thelearningcurveisthemean-square-error(MSE)calculatedonthetestingdata.Figure 5-11 showsthecorrespondencebetweentheadditionsofinformativedata(bluecross)anddropsintestingMSE(greensolid). Figure5-11. LearningcurveofAOGRandconditionalinformationoftrainingdataalongiterationinCO2concentrationforecasting NextweshowhoweectivetheAOGRistoremoveredundancy.AlargeT2isusedtodisabletheabnormalitydetection.50dierentT1arechosenfrom[-1.5,3].TheresultisillustratedinFigure 5-12 .Thenumberofcenterscanbesafelyreducedfrom423to77withequivalentaccuracy.BysettingT1=:03061,wehavethecorrespondinglearning 113

PAGE 114

curveswiththeeectivetrainingdatahighlighted(redcircle)inFigure 5-13 .ThelongtermpredictionresultisplottedinFigure 5-14 .Asisclear,thepredictionisveryaccurateatthebeginninganddeviatessignicantinthefarfuture.ActuallyitisclearfromFigure 5-14 thattheincreaseoftheCO2concentrationisacceleratingatanunforeseenspeed. Figure5-12. Networksizevs.testingMSEofAOGRfordierentT1inCO2concentrationforecasting Figure5-13. LearningcurveofAOGRandconditionalinformationoftrainingdataalongiterationwitheectiveexamplescircledinCO2concentrationforecasting 5.5Discussions Thischapterpresentsaunifyingcriterionforon-lineactivelearning.Activelearningiscrucialinmanymachinelearningapplications,anditalsoshineslightontheaction 114

PAGE 115

Figure5-14. ForecastingresultofAOGRforCO2concentration perceptioncycleofhowbiologicalorganismsinteractwiththerealworld.Anylearningsystemdevelopsamodeloftheworld,thereforenotallnewsamplesencounteredcontainthesameinformationtoupdatethesystemstate.Theissuehasbeenhowtoformulateacostfunctionthatiscomputableinrealtime.Forregression,wehaveshownthatthetheoryofGaussianprocessesenablesanelegantandstillreasonablyecientalgorithmtocarryonthecomputation.Weareparticularlyinterestedinthiscasebecauseofourworktoderiveon-linelearningalgorithmsinkernelspacesfornonlinearadaptiveltering.Buttherearemanymoreapplicationsthatcanbenetfromthisdevelopment.Theinterestingnextquestionishowtoapplythesametechniqueforclassicationproblems,wherethecriterionishardertocompute,andafullinformationtheoreticapproachmayberequired. 115

PAGE 116

CHAPTER6WELLPOSEDNESSANALYSIS Inthischapter,wefocusonsomewellposednessanalysisrelatingtothekerneladaptivelters.Itisnotedthatallthekerneladaptiveltersarederivedinahighdimensionalfeaturespaceimplicitlyorexplicitlysolvingaleastsquaresproblem.Inthecaseofnitetrainingdata,howtoavoidoverttingandstaystableisobviouslyaveryimportanttopic. TheconceptofwellposednesswasproposedbyHadamard[ 42 ]andiscloselyrelatedtotheconceptsofstabilityandgeneralization.Regularizationasaremedyforill-posednessbecamewidelyknownduetotheworkofTikhonov[ 96 ]andridgeregression[ 48 ],andalsohasaveryprincipledBayesianlearninginterpretation[ 77 ].Inleast-squaresproblems,Tikhonovregularizationisessentiallyatrade-obetweenttingthetrainingdataandreducingsolutionnorms.Thesolutionwithsmallernormpossessesmorestabilityandguaranteesbettergeneralizationperformanceinstatisticalsense[ 11 ]. Wewillstartbyshowingthattheregularizationnetworkcanbeformulatedasaleast-squaresprobleminahighdimensionalkernelspaceF.UsingthebatchandthestochasticgradientdescenttospecicallysolvethisleastsquaresprobleminFleadstothekernelADALINE(KADALINE)andthekernelleast-mean-squarealgorithm(KLMS)respectively. Althoughestablishingtheseconnectionsareimportant,ourmainfocuswillbeonthewellposednessanalysisoftheseiterativemethods.Thisisnotatrivialproblemiftheleast-squaresproblemisformulatedinahigh-dimensionalspacewithoutexplicitregularization.Inotherwords,weareinterestedtoshowhowtheseiterativemethodscangiveregularizedsolutionsolelybyproperearly-stopping.Mathematically,thisrequiressomeupperboundsonthesolutionnorm.Byestablishingthesolutionnormupperbounds,weprovideaconclusiveanswerwhyearly-stoppingworks. 116

PAGE 117

Thesignicanceofanupperboundforthesolutionnormiswellstudiedby[ 37 71 ]inthecontextofregularizationnetworktheory.Aconstraintonthesolutionnormensureswellposednessoftheproblemthroughtheresultingcompactnessoftheeectivehypothesisspace.InPoggio'swords,compactnessofthehypothesisspaceissucientforconsistencyofempiricalerrorminimization,forcingsmoothnessandstability. 6.1RegularizationNetworks Regularizationnetworksarewidelyusedinnonlinearregression[ 70 ],patternrecognition[ 49 ],nancialanalysis[ 50 ],tonamejustafew.Thedesignofregularizationnetworks(RN)hasbeenextensivelystudiedintermsofregularization,wellposedness,andgeneralization[ 37 71 ].Afterpickingasuitablekernelfunction,thedesignproblemofaregularizationnetworkamountstosolvingthefollowinglinearsystem [G+I]a=d(6{1) whereGistheNNGrammatrix,istheregularizationparameter,aistheN1unknowncoecientsanddisthedesiredresponsevector.Wehavetoemphasizethatplaysasignicantrolehere.Ifitistoosmall,thesolutionmaybeunder-regularized,leadingtohighvarianceintheoutputandpoorgeneralizationontestdata.Ontheotherhand,ifistoolarge,thesolutionwillbeover-regularized,resultinginabigbiasintheoutputandoverallpoorperformance[ 71 ]. UsingdirectmethodstosolvetheregularizationnetworkproblemrequiresO(N3)computation,whereNisthenumberoftrainingdata[ 37 40 ].Besides,asuitableregularizationparameterneedstobeselectedsuchthattheobtainedsolutionisproperlyregularizedespeciallyinlarge-scaleproblems.Thecommonlyusedmethodtodeterminetheoptimaliscross-validation[ 103 ],[ 46 ],whichincreasesthetotaltrainingcomplexitybymultiplyingthenumberofcandidateregularizationparameters. Acurve-ttingapproachwillbeadoptedinthischaptertointroducethemainideasofRN.Supposewehavetrainingdatafu(i);d(i)gNi=1whereu(i)isL1regressorsand 117

PAGE 118

d(i)arescalardesiredresponses.ThegoalofcurvettingistondafunctioninsomehypothesisspaceHsuchthatthefollowingriskisminimized: minfNXi=1jd(i)f(u(i))j2+jjfjj2H(6{2) Notethatisnon-negativewhichcanalsobeviewedasaLagrangemultiplierforthenormconstraintjjfjjH
PAGE 119

Furtherbytherepresentertheorem[ 83 ],theminimizerof( 6{3 )isalinearcombinationofthetransformedtrainingdata,i.e. !=NXi=1ai'(i)(6{4) Bysubstituting( 6{4 )into( 6{3 ),onehas minajjdGajj2+aTGa(6{5) whereGi;j=[(u(i);u(j))]andd=[d(1);:::;d(N)]T.Gissymmetricandpositivesemi-denite. Takingthederivativeof( 6{5 )w.r.t.a,wehave G[G+I]a=Gd(6{6) IfGisinvertible,( 6{6 )isequivalentto [G+I]a=d(6{7) If6=0,thesolution[G+I]1dautomaticallysatises( 6{6 ).Ontheotherhand,if=0,thelinearsystems( 6{7 )and( 6{6 )havethesamesolutionspace 1 .Andthecase=0iswhatweareinterestedinthischapter. 6.2WellposednessAnalysisofKernelADALINE WewillstartbyestablishingthefactthatkernelADALINEisthebatchmodegradientdescentintheRKHS.Thenbyexaminingthebehaviorofitsconvergence(dierentspeedalongdierenteigen-directions),weconcludewhykernelADALINEpossessesaself-regularizationmechanism.Finallyasolutionnormupperboundismathematicallyderived. 1 IfdisnotorthogonaltothenullspaceofG,thepseudo-inversesolutionisused. 119

PAGE 120

6.2.1KernelADALINE Let=0andwewillshowthatthegradientdescenton( 6{3 )iswellposedaslongasthetrainingisproperlyearlystopped.Thisisachievedbyexplicitlyestablishinganupperboundonthesolutionnorm.Ithasbeenlongknownthattheearlystoppingmethodcanpreventoverlearningintheareaofmulti-layerperceptrons[ 7 43 46 78 ]butnoonehasrigorouslyanalyzedtheunderlyingmechanismbecauseofthenonlinearnatureofthemulti-layerperceptrons.OuranalysisisspecicallyforRNbuttheinsightsgainedarealsohelpfulforotherneuralnetworktechniques. Denote=['(1);:::;'(N)]andrewritethecost( 6{3 )as min!J=jjdT!jj2(6{8) with=0,i.e.,noexplicitregularization. Theill-posednessofthisproblemarisesfromthefactthatthedimensionoftheunknownvariable!isfarmorethanthenumberoftrainingdataN. Thegradientofthecostfunctionis rJ=2(dT!)(6{9) Therefore,thegradientdescentmethodis !(i)=!(i1)+(dT!(i1))=N(6{10) where!(i)denotestheestimateoftheweightatiterationi.isthestepsize.ThepurposeofintroducingafactorNwillbeclearlater. Theorem6.1. Withinitialvalue!(0)=0,theweightestimateby( 6{10 )isalinearcombinationofthetransformeddataatanyiteration,i.e., !(i)=a(i)=NXj=1aj(i)'(j);8i(6{11) 120

PAGE 121

Proof. Since!(0)=0,theclaimistruefori=0.Thenweusemathematicalinductiontoproveforotheri.Suppose( 6{11 )istruefori1.Thereforee(i)=dT!(i1)=d(T)a(i1)=dGa(i1) Thenby( 6{10 ),onehas!(i)=!(i1)+(dT!(i1))=N=a(i1)+e(i)=N=(a(i1)+e=N) i.e., a(i)=a(i1)+e=N(6{12) Thiscompletestheproof. Noticethatthisresultcannotbederivedfromtherepresentertheorembecausewedonothavetheexplicitnormconstraintin( 6{8 ).Bytheproofitleadstoanequivalentupdateequation a(i)=a(i1)+(dGa(i1))=N(6{13) Thisresultiscrucialinkernelmethods,since!isinahighdimensionalspaceandweusuallydonothaveaccesstoit.Bywriting!asalinearcombinationofthetrainingdata,weactuallysolveaproblemwithdimensionalityN. Theupdaterule( 6{12 )isactuallyequivalenttothekernelADALINEin[ 33 ].Asisclearthisalgorithmisoriginallyformulatedinaveryhighdimensionalspace,itisnontrivialtoshowthatthesolutionfrom( 6{12 )isproperlyregularized.BeforeweshowwhythekernelADALINEisself-regularized,abriefreviewontheTikhonovregularizationisnecessaryforfurtheranalysis. 121

PAGE 122

6.2.2TikhonovRegularization Letthesingularvaluedecomposition(SVD)ofbe =P264S000375QT(6{14) whereP,QareorthogonalmatricesandS=diag(s1;:::;sr)withsithesingularvaluesandrtherankof.Itisassumedthats1:::sr>0withoutlossofgenerality.IntroducethecorrelationmatrixR=T=N.ThenonehasR=P264S2=N000375PT (6{15)G=Q264S2000375QT (6{16) Thewellknownpseudo-inversesolutionto( 6{8 )isgiveby[ 40 ] !PI=Pdiag(s11;:::;s1r;0;:::;0)QTd(6{17) TheLSproblem,evenwiththepseudo-inverse(consideraverysmallsr),canbeill-posedduetothenatureoftheproblem,smalldatasize,orseverenoise.TheTikhonovregularization[ 96 ]iswidelyusedtoaddressthisissue.AregularizationtermisintroducedintheLScostfunctionwhichpenalizesthesolutionnormin( 6{8 )as J=jjdT!jj2+jj!jj2F(6{18) whichleadstotheTikhonovregularizationsolution !TR=Pdiag(s1 s21+;:::;sr s2r+;0;:::;0)QTd(6{19) 122

PAGE 123

Comparing( 6{19 )with( 6{17 ),weseethattheTikhonovregularizationmodiesthediagonaltermsthroughthefollowingregularizationfunction(reg-function): HTR(x)=x2 x2+(6{20) Noticethatifsrisverysmall,thepseudo-inversesolutionbecomesproblematicasthesolutionapproachesinnity.However,fortheTikhonovregularization,HTR(sr)s1r!0ifsrissmallandHTR(sr)s1r!s1rifsrislarge.Inthissense,theTikhonovregularizationsmoothlyltersouttheminorcomponentsthatcorrespondtosmallsingularvalues(relativeto).Itisalsoclearthatattenuatingtheminorcomponentsisimportantforustogetasmallernormsolutionorinotherwords,amorestablesolution. Andthesolutionfor( 6{6 )is aTR=Qdiag((s1+)1;:::;(sr+)1;0;:::;0)QTd(6{21) Itisalsoinstructivetorelatethesolutionnormtotheregularizationparameterdirectlyasinthefollowingtheorem. Theorem6.2. jjaTRjjjjdjj (6{22) jj!TRjjjjdjj 2p (6{23) Theproofisquitestraightforwardbutitclearlyshowshowtheregularizationparametercontrolsthesolutionnorm. Withthisunderstanding,theso-calledtruncatedpseudo-inverseregularization[ 40 ]isnothingbutthefollowinghardcut-oreg-function HPCA(x)=8><>:1ifx>t0ifx6t(6{24) 123

PAGE 124

wheretisthecut-othreshold.Ifsk>tsk+1(usuallykr),thesolutionbecomes !PCA=Pdiag(s11;:::;s1k;0;:::;0)QTd(6{25) Noticethatthismethodisequivalenttoapplyingprincipalcomponentanalysistechnique(PCA)[ 46 ]tothedataandusingtherstkprincipalcomponentstorepresenttheoriginaldata.Underreasonablesignal-noise-ratio(SNR),thesmallsingularvaluecomponentsarepurelyassociatedwiththenoise.Discardingthesespuriousfeaturescaneectivelypreventover-learning.Asimilarideacanbefoundin[ 28 ]and[ 30 ]. 6.2.3Self-regularizationthroughGradientDescent Inthesection,wewillshowthatthegradientdescentiterationprovidesaninherentregularizationsimilartotheTikhonovregularization.Firstrewrite( 6{10 )as !(i)=!(i1)+(dT!(i1))=N=(IT=N)!(i1)+d=N=(IP264S2=N000375PT)!(i1)+P264S=N000375QTd(by( 6{14 ))=P[(I264S2=N000375)(PT!(i1))+264S=N000375QTd](6{26) Denoteb(i)=PT!(i),whichamountstodecomposingtheweightvectoralongthecolumnvectorsofmatrixPas!(i)=MXm=1bm(i)Pm=Pb(i) whereMisthedimensionalityoftheRKHS.Thereforeby( 6{26 ),onehas b(i)=(I264S2=N000375)(b(i1))+264S=N000375QTd(6{27) 124

PAGE 125

orequivalentlyforeachcomponent bm(i)=(1s2m=N)bm(i1)+smQTmd=N(6{28) for1mM. Observethatifsm=0,bm(i)=bm(i1)=:::=bm(0) Ifsm6=0,werepeatedlyuse( 6{28 )fori=1;2;:::andobtain bm(i)=(1s2m=N)ibm(0)+(smQTmd=N)(i1Xj=0(1s2m=N)j)=(1s2m=N)ibm(0)+[1(1s2m=N)i](QTmd)=sm(6{29) Noticethats2m=NistheeigenvalueofthecorrelationmatrixwhichisasymptoticallyindependentofN,sointroducingafactorofNmakestheconvergenceconditiononindependentofN. Theinterestingobservationisifproperearly-stoppingisusedinthetraining,thesolutionoftheKADALINEisself-regularized.Forexample,westartfrom!(0)=0andthetrainingstopsafternsteps.Therefore bm(n)=[1(1s2m=N)n](QTmd)=sm(6{30) Thisequationshowsthatalongdierenteigen-directions,thealgorithmconvergesatvastlydierentspeed.Ifsmisverysmall,(1s2m=N)isverycloseto1,whichleadstoaveryslowconvergence.Ontheotherhand,forlargesm,(1s2m=N)iscloseto0andtheconvergenceisveryfast.ThispartiallyexplainswhythesolutionfromtheKADALINEisnotaectedbythesmallsingularvalueasthepseudo-inversedoes. Furtherby!(n)=Pb(n),onehas !KA;n=Pdiag([1(1s21=N)n]s11;:::;[1(1s2r=N)n]s1r;0;:::;0)QTd(6{31) 125

PAGE 126

Itmeansthereg-functionforthekernel-ADALINEstoppedatiterationnisHKA;n(x)=1(1x2=N)n Andthefollowingtheoremtellswhytheearly-stoppingtakescareofthesmallsingularvalues. Theorem6.3. limx!0HKA;n(x)x1=0 Proof. HKA;n(x)x1=1 x[1(1x2=N)][1+(1x2=N)+:::+(1x2=N)n1]=x=N[1+(1x2=N)+:::+(1x2=N)n1] Thereforeitisapolynomialinxandtheconclusionfollowsdirectly. Thesignicanceofthisresultisthatwecanreplacethetediouscrossvalidationforselectingtheoptimalregularizationparameterbythesimplegradientdescentwithearly-stopping. AcomparisonofthreeregularizationmethodsisillustratedinFigure 6-1 .Inthereg-functionofTikhonovregularization,theregularizationparameterischosenas1.Forthereg-functionofPCA,t=0:5.Forthereg-functionoftheKADALINE,=0:1,N=500,andn=600.FurtherinFigures 6-2 6-3 6-4 ,weshowtheeectofthestepsize,iterationnumberanddatasizeontheregularizationfunctionofthekernelADALINE. Asoneofthemaincontributionsofthischapter,weexplicitlyestablishupperboundsforthesolutionnormsjj!KA;njjandjjaKA;njj. Theorem6.4. Assumej1s2i=Nj<1,8i. jjaKA;njjn Njjdjj (6{32) jj!KA;njjr 2n Njjdjj (6{33) TheproofisinAppendixB.Thefactorintheboundsn=Nconveysalotofinsightsintotheadaptation.Increasingnorgivesalargerbound,indicatinglessregularization. 126

PAGE 127

Figure6-1. Thereg-functionsofthreeregularizationapproaches Figure6-2. Eectofstepsizeonthereg-functionofkernelADALINE(N=500,n=600) Ontheotherhand,increasingNmakestheboundsmaller,indicatingthewellposednessoftheproblem.Thiscorroboratestheobservationsabove. Aswesee,itisverysignicanttoestablishupperboundsforthesolutionnorm.Bydoingso,wenotonlyverifythewellposednessofthekernelADALINE,butalsorevealhowthegradientadaptation(dierentparametersetting)aectsit. 127

PAGE 128

Figure6-3. Eectofiterationnumberonthereg-functionofkernelADALINE(=0:1,N=500) Figure6-4. Eectoftrainingdatasizeonthereg-functionofkernelADALINE(=0:1,n=600) 6.3WellposednessAnalysisofKernelLeastMeanSquare Thekernelleast-mean-square(KLMS)isdiscussedinChapter 2 .Theupdateequationisquotedbelowforconvenience !(i)=!(i1)+'(i)(d(i)'(i)T!(i1))(6{34) 128

PAGE 129

Usuallywewillstopati=Nbygoingthroughallthetrainingdataonce,i.e. !KLMS=!(N)=NXj=1e(j)'(j)(6{35) andsimplytheexpansioncoecientsequaltothepredictionerrorsscaledbythestepsize,i.e. aKLMS=[e(1);:::;e(N)]T(6{36) WecanexpectthattheKLMSalgorithmiswellposed.Firstofall,bythesmallstepsizetheory[ 47 ],theconvergencebehavioroftheKLMSissimilartothatoftheKADALINE.Wecanobtainsimilarresultsto( 6{31 )fortheKLMSbutinthemeansenseduetoitsstochasticnature[ 59 ]. Inthissection,wetrytoestablishupperboundsforthenormofthesolutionderivedabove.Twodierentapproacheswillbedemonstrated. 6.3.1ModelFreeSolutionNormBound Inthissection,weareabletoshowthatKLMSamountstosolvingalowertriangularlinearsystem.Fromanumericalanalysisviewpoint,toinvertaunitlowertriangularmatrixTismuchmorestablethantoinvertthepositivesemi-denitematrixG.WewillestablishasolutionnormboundforKLMSpurelyfromanumericalanalysisviewpoint. Theorem6.5. eanddarelinearlyrelatedthroughaunitlowertriangularmatrix,Te=d,i.e. 0BBBBBBB@100:::01;210:::0::::::::::::1;N2;N3;N:::11CCCCCCCANN0BBBBBBB@e(1)e(2):::e(N)1CCCCCCCAN1=0BBBBBBB@d(1)d(2):::d(N)1CCCCCCCAN1(6{37) wherei;j=(u(i);u(j))forsimplicity. 129

PAGE 130

Proof. Itiseasytoseethate(i)=d(i)Xi1k=1e(k)(u(k);u(i)) sod(i)=e(i)+Xi1k=1e(k)(u(k);u(i)) fori=1;:::;N.Thiscompletestheproof. Thisresultisveryinteresting.Ittellsusthatinsteadofsolvingalarge-scaledenselinearsystem( 6{7 ),wecanndan\approximate"solutionbysolvingaunitlowertriangularlinearsystem.Moreexplicitly,wehavethefollowingupperbounds: Theorem6.6. Assumeji;jj1.Then jjaKLMSjjvuut NXl=1(1+)2(Nl)!jjdjj (6{38) jj!KLMSjjvuut NNXl=1(1+)2(Nl)!jjdjj (6{39) Noticethatnoassumptionismadeontomaketheresultvalid.Itisessentialtoconstrainforconvergencebuttheupperboundholdseventhealgorithmdiverges.Inthissense,thisisaworst-caseboundandwillbeverylooseinthecaseofconvergence.Totaketheconvergenceanddatastatisticsintoconsideration,anotherupperboundwillbediscussedinthenextsection.Howeveritismodel-basedandthemodelisusuallyunknown.TheproofisincludedinAppendixB. Corollary6.7. TheminimumsingularvalueofTislowerboundedby smin(T)1 PNl=1(1+)2(Nl)(6{40) 130

PAGE 131

Proof. smax(T1)=maxdjjT1djj2 jjdjj2=maxdjjejj2 jjdjj2vuut NXl=1(1+)2(Nl)!(by( 6{38 )) Bynoticingthefactsmin(T)=1=smax(T1),onecompletestheproof. Asweknow,theproblemtoinvertGisthatitssmallestsingularvaluecanbearbitrarilysmall,sohavingalowerboundonthesmallestsingularvalueisverysignicantintermsofstability.Bynow,itisclearthattheKLMSisself-regularized.AsistrueintheTikhonovregularization,theregularizationcontrolsthetrade-obetweenthevarianceandbias,itcanbeseenthatthestepsize,thedatasizeNandtheiterationnumberntogetherplayasimilarrolehereinboththeKADALINEandtheKLMS. 6.3.2ModelBasedSolutionNormBound Assumethetrainingdatafu(i);d(i)gNi=1satisfyamultiplelinearregressionmodelinF: d(i)='(i)T!o+v(i)(6{41) where!oistheoptimalweightandv(i)isthemodelinguncertainty.ThenbytheH1robustnesstheorem[ 82 ]:foranyunknownvector!oandniteenergynoisesequencev(i)withoutanystatisticalassumption,thefollowinginequalityholds Pij=1j^s(j)s(j)j2 1jj!ojj2+Pi1j=1jv(j)j2<1;foralli=1;2;:::;N(6{42) ifandonlyifthematricesf1I'(i)'(i)Tgarepositive-deniteforalli.Intheinequality,s(i)=(!o)T'(i)and^s(i)=!(i1)T'(i),where!(i1)isgivenbytheKLMSrecursion( 6{34 ). Thisresultwillbeusedtoprovethefollowingtheorem. 131

PAGE 132

Lemma6.8. UndertheH1stablecondition,thepredictionerrorsatisesthefollowinginequality: jjejj2<1jj!ojj2+2jjvjj2(6{43) wherev=[v(1);:::;v(N)]T. Proof. Firstnoticethate(i)v(i)=s(i)^s(i) Substitutingitinto( 6{42 ),onehasPij=1je(j)v(j)j2 1jj!ojj2+Pi1j=1jv(j)j2<1;foralli=1;2;:::;N orequivalently,Xij=1je(j)v(j)j2<1jj!ojj2+Xi1j=1jv(j)j2;foralli=1;2;:::;N BythetriangleinequalityPij=1je(j)j2Pij=1je(j)v(j)j2+Pij=1jv(j)j2<1jj!ojj2+Pi1j=1jv(j)j2+Pij=1jv(j)j2;foralli=1;2;:::;N Intermofnorms,jjejj2<1jj!ojj2+2jjvjj2 Theorem6.9. UndertheH1stablecondition,aKLMSand!KLMSareupper-bounded: jjaKLMSjj


PAGE 133

wellposednessaccordingtothedenitionof[ 42 ].Theeectofchoosingasmoothkernelisdemonstratedbythedependencyoftheupperboundonthelargestsingularvalue.InthecaseoftheGaussiankernel,thelargestsingularvalueiswellupperboundedregardlessoftheinputsignalenergyunlikethepolynomialkernel.Ontheotherhand,inthelinearcase,theLMSoperatesinarelativelysmalldimensionalspaceandtheill-posednessduetoinsucienttrainingdataisunlikelytohappen. 6.4Simulations 6.4.1Mackey-GlassTimeSeriesPrediction Theexampleistheshort-termpredictionoftheMackey-Glass(MG)chaotictimeseriesstudiedinChapter 2 .Thetimeembeddingis7,i.e.[x(i7);x(i6);:::;x(i1)]Tareusedastheinputu(i)topredictthepresentonex(i)whichisthedesiredresponsed(i).Asegmentof500samplesisusedasthetrainingdataandanother100pointsasthetestdata.AllthedataiscorruptedbyGaussiannoisewithzeromeanand0.01variance. WecomparetheperformanceoftheKLMS,thekernelADALINEandtheregularizationnetwork.AGaussiankernelwitha=1ischosenforallthealgorithms.Intheregularizationnetwork,everyinputpointisusedasacenterandthetrainingisdoneinbatchmode(directinversiontotheGrammatrix). Figure 6-5 isthelearningcurvesfortheKLMSwithalearningrateof0.1.Aswecansee,theKLMSconvergesveryquicklyandthemeansquareerror(MSE)forthetrainingdatasetandthetestdatasetarecomparableallthetime,whichmeansthenetworkisnotover-learningasthetraininggoeson.Theminimummeansquareerrorofthetestsetis0.0169andtheaveragetestingMSEofthelast50iterationsis0.0177.InFig. 6-6 ,theactualsolutionnormjjaKLMSjjanditsupperbound( 6{38 )areplotted.ItisobservedthattheboundisveryloosewhenNincreases. Figure 6-7 isthelearningcurvesforthekernelADALINEwithalearningrateof0.5.Aswecansee,thekernelADALINEconvergesveryquicklyandthemeansquareerrorforthetrainingdatasetkeepsdecreasingastraininggoesonwhereasthemean 133

PAGE 134

Figure6-5. LearningcurvesofKLMSonbothtrainingandtestingdatasets Figure6-6. SolutionnormofKLMSalongiterationwithitsmodelfreeupperbound squareerrorforthetestdatasetincreasesaftersomepoint.Thispointisactuallytheearly-stoppingpoint.InFigure 6-7 ,theminimalmeansquareerrorofthetestdataoccursatiteration430andtheminimalvalueis0.0155.Andmeanwhile,thetestingMSEisnotverysensitivetothestoppingpointaswecanseeinthelearningcurves.InFigure 6-8 ,theactualsolutionnormjjaKAjjanditsupperbound( 6{32 )areplotted.Theupperboundisfairlytight. Thecross-validationresultfortheregularizationnetworkispresentinFigure 6-9 .Wepick20regularizationparametersintheinterval[0:01;10].Thebestperformanceonthe 134

PAGE 135

Figure6-7. LearningcurvesofkernelADALINEonbothtrainingandtestingdatasets Figure6-8. SolutionnormofkernelADALINEalongiterationanditsupperbound testsetoccurswhentheregularizationparameterequal0.8483andminimalmeansquareerrorofthetestsetis0.0160.ItisalsonoticedinthisexamplethatthetestingMSEisquitesensitivetothechoiceoftheregularizationparameter.ThesolutionnormjjaTRjjanditsupperbound( 6{22 )areplottedinFigure 6-10 TherstsetofsimulationsillustratethatKLMSandKADALINEprovideselfregularizedsolutionswhicharecomparabletotheregularizationnetwork.ItisobservedthattheboundderivedfortheKLMSisveryloose,whichisexpectedbecausewhenthe 135

PAGE 136

Figure6-9. Cross-validationresultofregularizationnetworkwithdierentregularizationparameters Figure6-10. Solutionnormofregularizationnetworkwithdierentregularizationparametersanditsupperbound KLMSconverges,theerrorsarenormallyverysmall.Nowinthissimulation,wewanttoshowthattheboundisactuallyfairlygoodiftheKLMSdiverges. Weusethesameproblemsettingasintherstexample,butwepick10dierentstepsizesfrom0.1to10.Foreachstepsize,werunthealgorithmfor150iterations.Figure 6-11 showshowthesolutionnormchangesasthestepsizeincreases.Itisnotedthattheupperboundisveryloosewhenissmallandfairlytightwhenislarge.Itisexpectedthatwhenthestepsizeissmall,theKLMSalgorithmconvergesandtheerrors 136

PAGE 137

arenormallysmallafterwards.Whilethestepsizeincreases,theKLMSalgorithmdivergesandtheerrorsgooutofcontrol. Figure 6-12 showshowtheactualsolutionnormandtheupperboundchangeswiththeiterationasisxedat10.Itshowsthattheactualsolutionnormincreasesalmostexponentiallywiththeiterationnumberandtheboundisfairlytight. Figure6-11. SolutionnormofKLMSanditsmodel-freeupperboundwithdierentstepsizes(convergenceanddivergence) Figure6-12. SolutionnormofKLMSanditsmodelfreeupperboundalongiteration(=10) Inmostpracticalsituations,theKLMSconverges.Wederivedamodelbasedbound( 6{43 ),butunfortunatelywedonotknow!oorv(i). 137

PAGE 138

Ontheotherhand,bythesmall-step-sizetheory[ 47 ],wehaveJ(m)=E[je(m)j2]Jmin+NJmin 2 whenmissucientlylarge.Therefore,jjejj2NXm=1J(m)N(Jmin+NJmin 2) HereJmin=E[jv(i)j2]andisfairlyboundedbyE[jd(i)j2]underreasonablesignalnoiseratio.Hence jjejj2<(1+N 2)jjdjj2 (6{46) jjaKLMSjj
PAGE 139

Figure6-14. SolutionnormofKLMSanditsmodel-basedupperboundalongiteration(=:03) 6.5Discussions Thischapterinvestigatesthewellposednessofthegradientdescentmethodandstochasticgradientdescentmethod,whichisaveryimportanttopicinthetheoryofmachinelearningandkernelmethods[ 10 20 ]. Byexaminingtheadaptionofthegradientiteration,wendthattheconvergencespeedsalongdierenteigen-directionsarevastlydierent.Itconvergesfasteralongdirectionscorrespondingtolargereigen-values.Inotherwords,theprincipalcomponentsplaymajorroleintheadaptation.Thenbymathematicallyestablishingsolutionnormupperbounds,wegiveaconclusiveanswerthatthegradient-basedmethodspossessaself-regularizationmechanismandexplainrigorouslywhyearly-stoppingworksinthetrainingofneuralnetworks. Althoughthisisatheoreticchapter,weconductsomenumericalsimulationstomainlyillustrateourideas.WetestourtheoryontheMackey-Glasstimeseriespredictionproblemandtheresultsagreewiththetheoryverywell.ItisobservedthattheupperboundfortheKLMSisveryloosesinceitisaworst-caseupperboundanddoesnotconsideranydatastaticsingeneral.Alternativeprobabilisticupperboundsarediscussed 139

PAGE 140

consideringthedatastatisticsandconvergence.Theresultsarealsotestedonthesameexample. Allthediscussionscanbeextendedforanygradientdescentbasedmethods(eitherbatchorstochastic)suchastheKAPA-1.Ontheotherhand,thewellposednessofsecond-ordergradientmethodsisusuallyensuredbytheexplicitregularizationlikeintheKAPA-2andKRLS. 140

PAGE 141

CHAPTER7CONCLUSIONANDFUTUREWORK 7.1Conclusion Inthiswork,wehavepresentedaframeworkandafamilyofnonlinearadaptivelters,calledcollectivelyaskerneladaptivelters,includingthekernelleastmeansquare(KLMS),thekernelaneprojectionalgorithms(KAPA-1,KAPA-2,KAPA-3,andKAPA-4),theexponentially-weightedkernelrecursiveleastsquares(EW-KRLS),therandom-walkingkernelrecursiveleastsquares(RW-KRLS)andtheextendedkernelrecursiveleast-squares(EX-KRLS). TheyarenaturalextensionsofthelinearadaptivelterswithinaunifyingRKHSframework.Theysharethesamelearningstrategytracingbacktotheresource-allocatingnetworks[ 69 ].Whiletheresourceallocatingnetworksarebuiltuponintuitionsandheuristics,wehavelaiddownarigorousmathematicalfoundationforithere. Thekerneladaptiveltersemployaveryniceincrementallearningrulewhichdistinguishesitfromothernonlinearsequentiallearningalgorithmsinbothscopeanddetail.Comparingwiththeconventionalmethodswhichsuerfromlocalminima,slowconvergenceorintensecomputation,thekerneladaptiveltershavethefollowingfeatures: Thekerneladaptiveltershavetheuniversalapproximationproperty. Thekerneladaptiveltersareconvexoptimizationproblemsandhencehavenolocalminima. Thekerneladaptiveltershavemoderatecomplexityintermsofcomputationandmemory. Furthermore,viewedasonlinealternativestothebatchmodekernelmethods,thekerneladaptiveltershavethefollowingadvantages: Thekerneladaptiveltersaresuitedforactivelearning,beingabletoidentifyredundancy,noveltyandabnormalityinlargedataset. Thekerneladaptiveltersaremoreecientinnonstationaryenvironmentsduetotheirtrackingability. 141

PAGE 142

Thekerneladaptiveltersaresimplerintermsofcomputationandmemory. Besides,thisworkalsoproposesanewcriterionforactivedataselectioninsequentiallearningproblemsbasedonconditionalinformation.Itincorporatestheexistingmethods[ 28 69 ]inaprincipledandrigorousinformationtheoreticframework.Itdemonstratesasuperiororequivalentperformancecomparingwiththeexistingmethods. ThisworkalsohaspresentedawellposednessanalysisontheKLMS.Sincekernelmethodsareusuallyformulatedinaveryhighdimensionalspace,thewellposednessanalysisincaseofnitetrainingdataiscrucial.TheanalysisintuitivelyandalsodecisivelyshowsthattheKLMStrainedwithnitedataiswellposedandthusnoexplicitregularizationisneeded. 7.2FutureWork Wehavepresentedelegantnonlinearversionsoflinearadaptivelterswhereimprovementispossibleeitherinbetteringtheirpracticalimplementationorincompletingthetheoreticaldetails.Hencethefutureworkwilladdressthesemainissues. 7.2.1KernelKalmanFilter WehavepresentedakernelextendedRLSforthesimpletrackingmodelbutourambitionistoderiveakernelversionofthefull-blownstate-spacemodel.Giventhenatureofthehighdimensionalityofthefeaturespaceandthekerneltrickthatisemployed,itmightnotbeeasilyfeasibletoderiveitingeneral,butwithsomesimplicationakernelextendedRLSforaricherstate-spacemodelwouldbewelcomeandpossiblymorefeasible. 7.2.2CharacterizethePropertiesthroughApplications Thoughalotofsimulationshavebeenconductedtovalidatetheirapplicability,westillknowlittlewhichistheclassofproblemswherekernelltersaresuitable,andunderwhatconditionstheyoutperformotherlteringmethodsintheliterature.Sincerealworlddataarenormallynonstationaryandnoisy,theseappearasthetwomostimportantissuesthatneedtobestudied.Moreover,westilldonotknowwhichistheclassofnonlinear 142

PAGE 143

modelsforwhichkernelltersareparticularlywellmatchedto.Possibleapplicationsthatseemtobeinterestingincludedynamicmodeling,noveltydetection,andnancialtimeseriesprediction. 7.2.3KernelDesign Allthekernelmethodshavethefreedomofchoosingkerneltypeandparameters.Themostpopularmethodsofarisbycrossvalidation[ 2 18 75 ].Nearestneighbormethodisalsousedintheresource-allocatingnetworks[ 69 ]whichallowtheadaptationofthekernelsizeduringthelearning.WithacloserelationtoGaussianprocesstheory,marginalmaximumlikelihood[ 77 ]isalsoapplicabletohelpinkerneladaptivelters.Besides,thegeneralkernelselectionproblemhasbeenexaminedfromamorerigorousviewpoint,formulatingtheproblemasaconvexoptimizationthroughparameterizationofthekernelfunctionfamily[ 3 19 63 ].Itwillbeinterestingtoinvestigatethisproblemspecicallyforthekerneladaptivelters. 143

PAGE 144

APPENDIXAALD-STABILITYTHEOREM ThefollowingistheproofofTheorem 4.3 .FirstwehavethefollowinglemmafortheSchurcomplementr(i+1): LemmaA.1. hr(i+1)v;vi=infwhG(i+1)264wv375;264wv375i(A{1) foranyrealnumberv. Proof. Letwv=G(i)1h(i+1)v.ThenG(i+1)264wvv375=2640r(i+1)v375 Hence, hr(i+1)v;vi=hG(i+1)264wvv375;264wvv375i(A{2) Therefore hr(i+1)v;viinfwhG(i+1)264wv375;264wv375i(A{3) Toprovethereverseinequality(),observethat hG(i+1)264wvv375;264wvv375i=hr(i+1)v;vi=hG(i+1)264wvv375;264wv375i(A{4) foranyw.ApplyingCauchy-SchwarzinequalityintheinnerproductgeneratedbyG(i+1)andcancelingacommonfactor,wehavethat hr(i+1)v;vi=hG(i+1)264wvv375;264wvv375ihG(i+1)264wv375;264wv375i(A{5) 144

PAGE 145

forallw.Takingtheinmumoverallw,wendthatthereverseinequalityalsoholds. Nowweprovethetheorem. TheoremA.2. min(i+1)2min(i)r (g+r+min(i))+p (g+r+min(i))28min(i)r(A{6) Theproofisquitecomplicatedandwebreakitintothreeparts.Thefollowingnotationsareusedforshort:g:=(u(i+1);u(i+1)),r:=r(i+1)andh:=h(i+1). Proof. Firstly,bythepropertyofthesmallesteigenvalue[ 40 ] infw;v;jjwjj2+jjvjj2=1hG(i+1)264wv375;264wv375i=min(i+1)(A{7) DenotingtheeigenvectorofG(i+1)w.r.t.min(i+1)is[z;v1]Twithjj[z;v1]Tjj=1,onehasmin(i+1)=hG(i+1)264zv1375;264zv1375iinfwhG(i+1)264wv1375;264wv1375i=rv21(bylemma A.1 ) Hencewehavealowerboundformin(i+1), min(i+1)rv21(A{8) Howeverifjv1jistoosmall,thislowerboundistrivial. 145

PAGE 146

Secondly,v1isnotarbitraryandisinterrelatedwithwandmin(i+1)bythefollowingequationG(i)z+hv1=min(i+1)z (A{9)hTz+gv1=min(i+1)v1 (A{10) whichisthebasicequalityofeigenvalueandeigenvector.Ifv1=0,onehasG(i)z=min(i+1)z (A{11)hTz=0 (A{12) i.e.min(i+1)=min(i).Inthiscase,thelowerboundof( A{8 )becomesveryloose. Theideaistoutilizetheseconstraints( A{9 )and( A{10 ),togetabetterbound.BymultiplyingzTon( A{9 ),onehas zThv1=zTmin(i+1)zzTG(i)z(A{13) SincezTh=hTz,bysubstituting( A{10 )into( A{13 ),onehas (gmin(i+1))v21=min(i+1)jjzjj2+zTG(i)z(A{14) Recallthatjjzjj2+v21=1.Ifv21=1andjjzjj2=0,onehasmin(i+1)=g,whichisgreaterthanr(i+1)andconsistentwith( A{8 ).Nowifjjzjj2>0,onehas zTG(i)z jjzjj2=(gmin(i+1))v21 jjzjj2+min(i+1)(A{15) Therefore,itfollowsthat (gmin(i+1))v21 jjzjj2+min(i+1)=zTG(i)z jjzjj2min(i)(A{16) Aftersomerearrangement,wehavethesecondlowerbound (1v21 jjzjj2)min(i+1)gv21 jjzjj2+min(i)(A{17) 146

PAGE 147

Thirdly,ifv21iscloseto1,( A{8 )istighterandontheotherhandifv21iscloseto0,( A{17 )ismuchtighter.Itsucestoknowthatmin(i+1)cannotbearbitrarilysmall.Thelowestvalueonthelowerboundisachievedatthecrosspointofthetwo,i.e. (1v21 jjzjj2)rv21=gv21 jjzjj2+min(i)(A{18) Solvingthisequation,onehas v21=(g+r+min(i))p (g+r+min(i))28min(i)r 4r(A{19) Noticethatr=ghTG(i)1hg,so(g+r+min(i))28min(i)r(2r+min(i))28min(i)r=(2rmin(i))20 Pickingthesmallersolutionandusing( A{8 ),onehas min(i+1)(g+r+min(i))p (g+r+min(i))28min(i)r 4(A{20) whichholdsinallsituations.Noticethat (g+r+min(i))p (g+r+min(i))28min(i)r 4=2min(i)r (g+r+min(i))+p (g+r+min(i))28min(i)r(A{21) Thiscompletestheproof. 147

PAGE 148

APPENDIXBSOLUTIONNORMUPPERBOUNDS B.1KernelADALINE BelowistheproofofTheorem 6.4 .Webreakitintotwoparts. LemmaB.1. Assumej1x2=Nj<1andx0.j1(1x2=N)n xjr 2 Nn Proof. Letz=p NxandH(z)=1(1z2)n z.jH(z)j=1 z[1(1z2)]j[1+(1z2)+:::+(1z2)n1]j=zj[1+(1z2)+:::+(1z2)n1]jz[1+j(1z2)j+:::+j(1z2)n1j]zn forallz.Substitutingz=p Nx,onehasj1(1x2=N)n xjn Nx Usingthefactthat0xp 2N=,onehasj1(1x2=N)n xjr 2 Nn TheoremB.2. Assumej1s2i=Nj<1,8i.jj!KA;njjr 2 Nnjjdjj 148

PAGE 149

Proof. jj!KA;njj=jjPdiag([1(1s21=N)n]s11;:::;[1(1s2r=N)n]s1r;0;:::;0)QTdjjjjdiag([1(1s21=N)n]s11;:::;[1(1s2r=N)n]s1r;0;:::;0)jjjjdjjr 2 Nnjjdjj(usingLemma B.1 ) whereP,Qareorthogonalmatrices. LemmaB.3. Assumej1x2=Nj<1andx0.j1(1x2=N)n x2jn N Proof. j1(1x2=N)n x2j=1 x2[1(1x2=N)]j[1+(1x2=N)+:::+(1x2=N)n1]j= Nj[1+(1x2=N)+:::+(1x2=N)n1]jn N TheoremB.4. Assumej1s2i=Nj<1,8i. jjaKA;njjn Njjdjj(B{1) Proof. Firstusingthesametechniquetoderive!KA;n,wehave aKA;n=Qdiag([1(1s21=N)n]s21;:::;[1(1s2r=N)n]s2r;0;:::;0)QTd(B{2) ThenitisstraightforwardtohavetheconclusionbyusingLemma B.3 Thereforeusingtherelationthat!KA;n=aKA;n,wehaveanotherboundfor!KA;n. 149

PAGE 150

CorollaryB.5. Assumej1s2i=Nj<1,8iandtr(G)N.jj!KA;njjn p Njjdjj Proof. jj!KA;njj2=aTKA;nGaKA;ns21jjaKA;njj2N(n N)2jjdjj2=(2n2 N)jjdjj2 usingthefactthats21isthelargesteigenvalueofGandissmallerthantr(G)N.Thiscompletestheproof. B.2KernelLeastMeanSquare Inthefollowing,weproveTheorem 6.6 LemmaB.6. Assumeji;jj1.Then jXl=1je(l)jjXl=1(1+)jljd(l)j(B{3) Proof. Toprove( B{3 ),weuseinductiononj.First,observethattheresultholdsforj=1.Next,weprovethattheestimateistrueforsomej>1,assumingthat( B{3 )holdsforj1. Bythebacksubstitutionalgorithmappliedto( 6{37 ),wehavee(j)=d(j)j1Xk=1e(k)k;j Hence, je(j)jjd(j)j+j1Xk=1je(k)jjk;jj(B{4) 150

PAGE 151

Usingthisestimate,wehavejXl=1je(l)j=je(j)j+j1Xl=1je(l)j(jd(j)j+j1Xk=1je(k)jjk;jj)+j1Xl=1je(l)j(by( B{4 ))jd(j)j+(1+)j1Xk=1je(k)j(byji;jj1)jd(j)j+(1+)j1Xl=1(1+)j1ljd(l)j(byinductionassumption)=jd(j)j+j1Xl=1(1+)jljd(l)j=jXl=1(1+)jljd(l)j Thiscompletestheproof. Theassumptionji;jj1holdsiftheGaussiankernel( 1{6 )isused. LemmaB.7. Assumeji;jj1.Then jjejj22NXl=1(1+)2(Nl)!jjdjj22(B{5) Proof. By( B{3 ),onehase(j)jXl=1(1+)jljd(l)j Thereforeitprovidesamax-normboundjjejj1NXl=1(1+)Nljd(l)j 151

PAGE 152

Hence[ 98 ]jjejj22jjejj1jjejj1NXl=1(1+)Nljd(l)j!2NXl=1(1+)2(Nl)!jjdjj22(byCauchy-Schwarzinequality) Thiscompletestheproof. TheoremB.8. Assumeji;jj1.ThenjjaKLMSjjvuut NXl=1(1+)2(Nl)!jjdjj (B{6)jj!KLMSjjvuut NNXl=1(1+)2(Nl)!jjdjj (B{7) Proof. Itisstraightforwardtoprove( B{6 ),soweonlygivetheprooffor( B{7 ). By( 6{35 ),onehasjj!KLMSjj2=2eTGe2s21jjejj22NNXl=1(1+)2(Nl)!jjdjj22 usingthefactthats21isthelargesteigenvalueofGandissmallerthantrace(G)N.Thiscompletestheproof. 152

PAGE 153

REFERENCES [1] H.Akaike.Anewlookatthestatisticalmodelidentication.IEEETransactionsonAutomaticControl,19:716{723,1974. [2] S.An,W.Liu,andS.Venkatesh.Fastcross-validationalgorithmsforleastsquaressupportvectormachineandkernelridgeregression.PatternRecognition,40:2154{2162,2007. [3] A.Argyriou,C.A.Micchelli,andM.Pontil.Learningconvexcombinationsofcontinuouslyparameterizedbasickernels.InProc.ofthe18thConferenceonLearningTheory,pages338{352,2005. [4] N.Aronszajn.Theoryofreproducingkernels.Trans.Amer.Math.Soc.,68:337{404,1950. [5] S.Arulampalam,S.Maskell,N.J.Gordon,andT.Clapp.Atutorialonparticleltersforon-linenon-linear/non-gaussianbayesiantracking.IEEETransactionsofSignalProcessing,50(2):174{188,2002. [6] S.BillingsandS.Fakhouri.Identicationofsystemscontaininglineardynamicsandstaticnonlinearelements.Automatica,18:15{26,1982. [7] C.M.Bishop.NeuralNetworksforPatternRecognition.OxfordUniversityPress,1994. [8] A.Bordes,S.Ertekin,J.Weston,andL.Bottou.Fastkernelclassierswithonlineandactivelearning.J.Mach.Learn.Res.,6:1579{1619,2005. [9] P.A.N.BosmanandD.Thierens.Negativelog-likelihoodandstatisticalhypothesistestingasthebasisofmodelselectioninIDEAs.InD.Whitley,editor,LateBreakingPapersatthe2000GeneticandEvolutionaryComputationConference,pages51{58,LasVegas,Nevada,USA,82000. [10] L.Bottou.Learningwithlargedatasets.NIPS2007tutorial,2007. [11] O.BousquetandA.Elissee.Stabilityandgeneralization.JournalofMachineLearningResearch,2:499{526,2002. [12] P.Brazdil.Learninginmulti-agentenvironments.InProceedingsoftheEuropeanWorkingSessiononLearning,pages598{605,1991. [13] D.BroomheadandD.Lowe.Multivariablefunctionalinterpolationandadaptivenetworks.ComplexSystems,2:321{355,1988. [14] C.J.C.Burges.Atutorialonsupportvectormachinesforpatternrecognition.DataMiningandKnowledgeDiscovery,2(2):121{167,1998. 153

PAGE 154

[15] C.Campbell,N.Cristianini,andA.Smola.Querylearningwithlargemarginclassiers.InProc.17thInternationalConf.onMachineLearning,pages111{118.MorganKaufmann,SanFrancisco,CA,2000. [16] G.CasellaandR.L.Berger.StatisticalInference.DuxburyPress,2001. [17] G.Cavallanti,N.Cesa-Bianchi,andC.Gentile.Trackingthebesthyperplanewithasimplebudgetperceptron.MachineLearning,69:143{167,2007. [18] G.C.CawleyandN.L.C.Talbot.Ecientleave-one-outcross-validationofkernelsherdiscriminantclassiers.PattenRecognition,36:2585{2592,2003. [19] O.Chapelle,V.Vapnik,O.Bousquet,andS.Mukherjee.Choosingmultipleparametersforsupportvectormachines.MachineLearning,46:131{159,2002. [20] R.Collobert.LargeScaleMachineLearning.PhDthesis,UniversitedeParisVI,Paris,France,2004. [21] T.M.CoverandP.Hart.Nearestneighborpatternclassication.IEEETransactionsonInformationTheory,13:21{27,1967. [22] L.CsatoandM.Opper.Sparseonlinegaussianprocesses.NeuralComputation,14:641{668,2002. [23] O.Dekel,S.Shalev-Shwartz,andY.Singer.Theforgetron:Akernel-basedperceptrononaxedbudget.InAdvancesinNeuralInformationProcessingSystems18,pages1342{1372,Cambridge,MA,2006.MITPress. [24] C.DimaandM.Hebert.Activelearningforoutdoorobstacledetection.InProc.ScienceandSystemsI,August2005. [25] A.Doucet,C.Andrieu,andS.Godsill.Onsequentialmontecarlosamplingmethodsforbayesianltering.StatisticsandComputing,10(3):197{208,2000. [26] R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassication.Wiley-Interscience,2000. [27] M.A.El-Gamal.Theroleofpriorsinactivebayesianlearninginthesequentialstatisticaldecisionframework.InJ.W.T.GrandyandL.H.Schick,editors,MaximumEntropyandBayesianMethods,pages33{38.Kluwer,1991. [28] Y.Engel,S.Mannor,andR.Meir.Thekernelrecursiveleast-squaresalgorithm.IEEETrans.onSignalProcessing,52(8):2275{2285,2004. [29] V.V.Fedorov.Theoryofoptimalexperiment.AcademicPress,1972. [30] R.Fierro,G.Golub,P.Hansen,andD.O'Leary.Regularizationbytruncatedtotalleastsquares.Technicalreport,ReportUNIC-93-14,1993. 154

PAGE 155

[31] S.FineandK.Scheinberg.Ecientsvmtrainingusinglow-rankkernelrepresentations.JournalofMachineLearningResearch,pages242{264,2001. [32] M.O.FranzandB.Scholkopf.Aunifyingviewofwienerandvolterratheoryandpolynomialkernelregression.NeuralComputation,18:3097{3118,2006. [33] T.-T.FriebandR.F.Harrison.Akernel-basedadaline.InproceedingsEuropeanSymposiumonArticialNeuralNetworks1999,pages245{250,April1999. [34] K.Fukumizu.Activelearninginmultilayerperceptrons.InD.S.Touretzky,M.C.Mozer,andM.E.Hasselmo,editors,AdvancesinNeuralInformationProcessingSystems,volume8,pages295{301.TheMITPress,1996. [35] D.Gabor.Holographicmodeloftemporalrecall.Nature,217:584{585,1968. [36] C.Giraud-Carrier.Anoteontheutilityofincrementallearning.AICommunica-tions,2000. [37] F.Girosi,M.Jones,andT.Poggio.Regularizationtheoryandneuralnetworksarchitectures.NeuralComputation,7:219{269,1995. [38] T.Glasmachers.Second-ordersmoimprovessvmonlineandactivelearning.NeuralComputation,20:374{382,2008. [39] L.GlassandM.Mackey.FromClockstoChaos:TheRhythmsofLife.PrincetonUniversityPress,Princeton,NJ,1988. [40] G.GolubandC.Loan.MatrixComputations.TheJohnHopkinsUniversityPress,Washington. [41] G.GoodwinandK.Sin.Adaptiveltering:predictionandcontrol.Prentice-Hall,NJ,1984. [42] J.Hadamard.Surlesproblemesauxderiveespartiellesetleursignicationphysique.InPrincetonUniversityBulletin,number23,1902. [43] K.HagiwaraandK.Kuno.Regularizationlearningandearlystoppinginlinearnetworks.InProceedingsoftheIEEE-INNS-ENNSInternationalJointConferenceonNeuralNetworks,IJCNN2000,volume4,pages511{516.MITPress,2000. [44] R.Hartley.Transmissionofinformation.InBellSystemTechnicalJournal,pages535{563,July1928. [45] B.HassibiandG.Stork.Secondorderderivativesfornetworkpruning:Optimalbrainsurgeon.InNeuralInformationProcessingSystems,pages164{171,1992. [46] S.Haykin.NeuralNetworks:AComprehensiveFoundation.PrenticeHall,secondedition,1998. [47] S.Haykin.AdaptiveFilterTheory.Prentice-Hall,NJ,2002. 155

PAGE 156

[48] A.HoerlandR.Kennard.Ridgeregression:biasedestimationfornonorthogonalproblems.Technometrics,12:55{67,1970. [49] A.HowellandH.Buxton.Facerecognitionusingradialbasisfunctionneuralnetworks.InProceedingsofBritishMachineVisionConference,BMVA,Edinburgh,pages455{464,1996. [50] J.M.Hutchinson.Aradialbasisfunctionapproachtonancialtimeseriesanalysis.TechnicalReportAITR-1457,1993. [51] R.E.Kalman.Anewapproachtolinearlteringandpredictionproblems.Transac-tionsoftheASME-JournalofBasicEngineering,82:35{45,1960. [52] G.Kechriotis,E.Zarvas,andE.S.Manolakos.Usingrecurrentneuralnetworksforadaptivecommunicationchannelequalization.IEEETrans.onNeuralNetworks,5:267{278,March1994. [53] J.Kivinen,A.Smola,andR.C.Williamson.Onlinelearningwithkernels.IEEETrans.onSignalProcessing,52:2165{2176,Aug.2004. [54] R.Kohavi.Astudyofcross-validationandbootstrapforaccuracyestimationandmodelselection.InProceedingsoftheFourteenthInternationalJointConferenceonArticialIntelligence,pages1137{1143,1995. [55] K.LangandG.Hinton.Thedevelopmentofthetime-delayneuralnetworkarchitectureforspeechrecognition.Tech.reportCMU-CS-88-152,Carnegie-MellonUniversity,Pittsburgh,PA,1988. [56] Y.LeCun,J.S.Denker,andS.A.Solla.Optimalbraindamage.InAdvancesinNeuralInformationProcessingSystems2,pages598{605,1990. [57] D.V.Lindley.Onthemeasureofinformationprovidedbyanexperiment.TheAnnalsofMathematicalStatistics,27(4):986{1005,1956. [58] W.Liu,P.Pokharel,andJ.Prncipe.Recursivelyadaptedradialbasisfunctionnetworksanditsrelationshiptoresourceallocatingnetworksandonlinekernellearning.InproceedingsIEEEInternationalworkshoponmachinelearningforsignalprocessing2007,pages245{250,2007. [59] W.Liu,P.Pokharel,andJ.Prncipe.Thekernelleastmeansquarealgorithm.IEEETransactionsonSignalProcessing,56:543{554,2008. [60] E.N.Lorenz.Deterministicnonperiodicow.JournaloftheAtmosphericSciences,20:130{141,1963. [61] D.MacKay.Information-basedobjectivefunctionsforactivedataselection.NeuralComputation,4(4):590{604,1992. 156

PAGE 157

[62] V.Z.Marmarelis.Identicationofnonlinearbiologicalsystemsusinglaguerreexpansionsofkernels.Ann.Biomed.Eng.,21:573{589,1993. [63] C.MicchelliandM.Pontil.Learningthekernelfunctionviaregularization.JournalofMachineLearningResearch,6:1099{1125,2005. [64] S.Mika,G.Ratsch,J.Weston,B.Scholkopf,andK.-R.Muller.Fisherdiscriminantanalysiswithkernels.InProceedingsofIEEENeuralNetworksforSignalProcessingWorkshop1999,pages8{10,1999. [65] S.Mukherjee,E.Osuna,andF.Girosi.Nonlinearpredictionofchaotictimeseriesusingsupportvectormachines.InJ.Principe,L.Giles,N.Morgan,andE.Wilson,editors,IEEEWorkshoponNeuralNetworksforSignalProcessingVII,pages511{514.IEEEPress,Piscataway,NJ,1997. [66] A.Navia-Vazquez,F.Perez-Cruz,A.Artes-Rodriguez,andA.Figueiras-Vidal.Weightedleastsquarestrainingofsupportvectorclassiersleadingtocompactandadaptiveschemes.IEEETrans.NeuralNetworks,12:1047{1059,2001. [67] E.Parzen.Statisticalmethodsontimeseriesbyhilbertspacemethods.TechnicalReport23,AppliedMathematicsandStatisticsLaboratory,StanfordUniversity,CA,1959. [68] R.L.Plackett.Thediscoveryofthemethodofleast-squares.Biometrika,59:239{251,1972. [69] J.Platt.Aresource-allocatingnetworkforfunctioninterpolation.NeuralComputa-tion,3(2):213{225,1991. [70] T.PoggioandF.Girosi.Networksforapproximationandlearning.Proc.IEEE,78(9):1481{1497,1990. [71] T.PoggioandS.Smale.Themathematicsoflearning:Dealingwithdata.Amer.Math.Soc.,50:537{544,November2003. [72] P.Pokharel,W.Liu,andJ.Prncipe.Kernellms.InProc.InternationalConferenceonAccoustics,SpeechandSignalProcessing2007,pages1421{1424,2007. [73] J.Principe,B.deVries,J.Kuo,andP.G.deOliveira.Modelingapplicationswiththefocusedgammanet.InAdvancesinNeuralInformationProcessingSystems,volume4,pages143{150,1992. [74] J.Quinonero-CandelaandC.E.Rasmussen.Aunifyingviewofsparseapproximategaussianprocessregression.J.Mach.Learn.Res.,6:1939{1959,2005. [75] J.Racine.Anecientcross-validationalgorithmforwindowwidthselectionfornonparametrickernelregression.CommunicationsinStatistics:SimulationandComputation,22:1107{1114,1993. 157

PAGE 158

[76] L.RalaivolaandF.d'AlcheBuc.Timeseriesltering,smoothingandlearningusingthekernelkalmanlter.InProceedings.2005IEEEInternationalJointConferenceonNeuralNetworks,pages1449{1454,2005. [77] C.E.RasmussenandC.Williams.GaussianProcessesforMachineLearning.MITPress,2006. [78] S.RaudysandT.Cibas.Regularizationbyearlystoppinginsinglelayerperceptrontraining.InICANN,pages77{82,1996. [79] J.Rissanen.Modelingbyshortestdatadescription.Automatica,14:465{471,1978. [80] S.RoweisandZ.Ghahramani.Aunifyingreviewoflineargaussianmodels.NeuralComputation,11(2),1999. [81] N.P.SandsandJ.M.Cio.Nonlinearchannelmodelsfordigitalmagneticrecording.IEEETrans.Magn.,29:3996{3998,November1993. [82] A.Sayed.FundamentalsofAdaptiveFiltering.Wiley,NewYork,2003. [83] B.Scholkopf,R.Herbrich,A.Smola,andR.Williamson.Ageneralizedrepresentertheorem.InProceedingsoftheAnnualConferenceonComputationalLearningTheory,pages416{426.2001. [84] B.Scholkopf,A.Smola,andK.Muller.Nonlinearcomponentanalysisasakerneleigenvalueproblem.NeuralComputation,10:1299{1319,1998. [85] G.Schwarz.Estimatingthedimensionofamodel.AnnalsofStatistics,6:461{464,1978. [86] M.SeegerandC.Williams.Fastforwardselectiontospeedupsparsegaussianprocessregression.InWorkshoponAIandStatistics9.2003. [87] C.Shannon.Amathematicaltheoryofcommunication.InBellSystemTechnicalJournal,pages379{423,July1948. [88] H.Simon.Whyshouldmachineslearn?InR.S.Michalski,J.G.Carbonell,andT.Mitchell,editors,MachineLearning:AnArticialIntelligenceApproach.TiogaPublishingCompany,1983. [89] A.J.SmolaandP.L.Bartlett.Sparsegreedygaussianprocessregression.InNIPS,pages619{625,2000. [90] I.Steinwart.Ontheinuenceofthekernelontheconsistencyofsupportvectormachines.JournalofMachineLearningResearch,2:67{93,2001. [91] Y.Sun,P.Saratchandran,andN.Sundararajan.Adirectlinkminimalresourceallocationnetworkforadaptivenoisecancellation.Neuralprocessingletters,12(3):255{265,2000. 158

PAGE 159

[92] R.S.SuttonandA.G.Barto.ReinforcementLearning:AnIntroduction.MITPress,1998. [93] J.A.K.Suykens,T.V.Gestel,J.D.Brabanter,B.D.Moor,andJ.Vandewalle.LeastSquaresSupportVectorMachines.WorldScientic,2002. [94] M.Tan.Adaptation-basedcooperativelearningmultiagentsystems.InProceedingsoftheTenthInternationalConferenceonMachineLearning,pages330{337,1993. [95] P.Tans.Trendsinatmosphericcarbondioxide-maunaloa.NOAA/ESRL,2008. [96] A.TikhonovandV.Arsenin.SolutionofIll-posedProblems.Winston&Sons,Washington,1977. [97] S.TongandD.Koller.Supportvectormachineactivelearningwithapplicationstotextclassication.InP.Langley,editor,ProceedingsofICML-00,17thInternationalConferenceonMachineLearning,pages999{1006,Stanford,US,2000.MorganKaufmannPublishers,SanFrancisco,US. [98] L.N.TrefethenandD.B.III.NumericalLinearAlgebra.SIAM:SocietyforIndustrialandAppliedMathematics,1997. [99] W.Tucker.Arigorousodesolverandsmale's14thproblem.FoundationsofComputationalMathematics,2:53{117,2002. [100] S.V.Vaerenbergh,J.Via,andI.Santamara.Asliding-windowkernelrlsalgorithmanditsapplicationtononlinearchannelidentication.InProc.InternationalConferenceonAccoustics,SpeechandSignalProcessing2006,pages789{792,May2006. [101] V.Vapnik.TheNatureofStatisticalLearningTheory.SpringerVerlag,NewYork,1995. [102] V.Volterra.Sopralefunzionichedipendonodealtrefunzioni.Rend.R.AcademiadeiLincei2Sem.,59:97{105,1887. [103] G.Wahba.SplineModelsforObservationalData.SIAM,Philadelphia,1990. [104] E.A.WanandR.vanderMerwe.Theunscentedkalmanlterfornonlinearestimation.InProc.ofIEEESymposium2000(AS-SPCC).LakeLouise,Alberta,Canada,2000. [105] B.WidrowandM.E.Ho.Adaptiveswitchingcircuits.IREWESCONConventionRecord,4:96{104,1960. [106] N.Wiener.Nonlinearproblemsinrandomtheory.Wiley,NJ,1958. [107] C.K.I.Williams.Gaussianprocesses.InM.A.Arbib,editor,HandbookofBrainTheoryandNeuralNetworks,pages466{470.TheMITPress,2002. 159

PAGE 160

[108] C.K.I.WilliamsandM.Seeger.UsingtheNystrommethodtospeedupkernelmachines.InT.K.Leen,T.G.Dietterich,andV.Tresp,editors,Advancesinneuralinformationprocessingsystems,chapter13,pages682{688.MITPress,2001. [109] R.L.Winkler.IntroductiontoBayesianInferenceandDecision.Probabilistic,secondedition,2003. 160

PAGE 161

BIOGRAPHICALSKETCH Weifeng(Aaron)Liu,grewupinShanghai,China.HegothisB.S.andM.S.degreesinelectricalengineeringfromShanghaiJiaoTongUniversityin2003and2005respectively.Hewasinvolvedinresearchlikeoptimizingmotionestimationalgorithmsinvideocompression(H.264)inBellLabsShanghai,designingasatellitecommunicationsystem(designerandDSPprogrammer),andworkingonwirelesscommunication(TurboHybridAutomaticRepeatreQuest)inShanghaiJiaoTongUniversity.In2005,hejoinedtheComputationalNeuroEngineeringLaboratoryattheUniversityofFloridaasaPh.D.student.Hisresearchfocusesonsignalprocessing,adaptiveltering,andmachinelearning.Hehas4publicationsinrefereedjournalsand9conferencepapers. 161