Kernel Temporal Differences for Reinforcement Learning with Applications to Brain Machine Interfaces

MISSING IMAGE

Material Information

Title:
Kernel Temporal Differences for Reinforcement Learning with Applications to Brain Machine Interfaces
Physical Description:
1 online resource (130 p.)
Language:
english
Creator:
Bae, Jihye
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Electrical and Computer Engineering
Committee Chair:
Principe, Jose C
Committee Members:
Gader, Paul D
Harris, John Gregory
Banerjee, Arunava

Subjects

Subjects / Keywords:
learning
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre:
Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
Reinforcement learning brain machine interfaces (RLBMI) have been shown to be a promising avenue for practical implementations of BMIs. In the RLBMI, a computer agent and a user in the environment cooperate and learn co-adaptively. An essential component in the agent is the neural decoder which translates the neural states of the user into control actions for the external device in the environment. However, to realize the advantages of the RLBMI in practice, there are several challenges that need to be addressed. First, the neural decoder must be able to handle high dimensional neural states containing spatial-temporal information. Second, the mapping from neural states to actions must be flexible enough without making strong assumptions. Third, the computational complexity of the decoder should be reasonable such that real time implementations are feasible. Fourth, it should be robust in the presence of outliers or perturbations in the environment. We introduce algorithms that take into account these four issues. To efficiently handle the high dimensional state spaces, we adopt the temporal difference (TD) learning which allows the learning of the state value function using function approximation. For a flexible decoder, we propose the use of kernel base representations which provides nonlinear extensions of TD(lambda) which we call kernel temporal difference (KTD)(lambda). Two key advantages of KTD(lambda) are its nonlinear functional approximation capabilities and convergence guarantees that gracefully emerge as an extension of the convergence results known for linear TD learning algorithms. To address the robustness issue, we introduce correntropy temporal difference (CTD) and correntropy kernel temporal difference (CKTD), which is a robust alternative to the mean square error (MSE) employed by conventional TD learning. From state value function estimation, all fundamental features of the proposed algorithms can be observed. However, this is only an intermediate step in finding a proper policy. Therefore, we extend all proposed TD algorithms to state-action value function estimation based on Q-learning: Q-learning via correntropy temporal difference (Q-CTD), Q-KTD(lambda), and Q-CKTD. To illustrate the behavior of the proposed algorithms, we apply them to the problem of finding an optimal policy on simulated sequential decision making with continuous state spaces. The results show that Q-KTD and Q-CKTD are able to find a proper control policy and give stable performance with the appropriate parameters, and that Q-CKTD improves performance in off-policy learning. Finally, the Q-KTD(lambda) and Q-CKTD algorithms are applied to neural decoding in RLBMIs. First, they are applied in open-loop experiments to find a proper mapping between a monkey's neural states and desired positions of a computer cursor or a robotic arm. The experimental results show that the algorithms can effectively learn the neural state-action mapping. Moreover, Q-CKTD shows that the optimal policy can be estimated even without having perfect predictions of the value function with a discrete set of actions. Q-KTD is also applied to closed-loop RLBMI experiments. The co-adaptation of the decoder and the subject are observed. Results show that the algorithm succeeds in finding a proper mapping between neural states and desired actions. The kernel based representation combined with temporal differences is a suitable approach to obtain a flexible neural state decoder that can be learned and adapted online. These observations show the algorithms' potential advantages in relevant practical applications of RL.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Jihye Bae.
Thesis:
Thesis (Ph.D.)--University of Florida, 2013.
Local:
Adviser: Principe, Jose C.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2014-08-31

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2013
System ID:
UFE0045881:00001


This item is only available as the following downloads:


Full Text

PAGE 1

KERNELTEMPORALDIFFERENCESFORREINFORCEMENTLEARNINGWITHAPPLICATIONSTOBRAINMACHINEINTERFACESByJIHYEBAEADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2013

PAGE 2

c2013JihyeBae 2

PAGE 3

Idedicatethistomyfamilyfortheirendlesssupport. 3

PAGE 4

ACKNOWLEDGMENTS IwouldliketosincerelythankmyPh.D.advisorProf.JoseC.Principeforhisinvaluableguidance,understanding,andpatience.ItishardtoimaginethatIcouldcompletethisprogramwithouthiscontinualsupport.ItwasafortuneformetomeetProf.Principe.Thankstohim,Iwasabletoobtainin-depthknowledgeinadaptivesignalprocessingandinformationtheoreticlearningandtohaveanunforgettablelifetimeopportunitytoenhancemyviewaboutresearch.IwouldliketothankProf.JustinC.Sanchezforenrichingmyknowledgeinneuroscienceandsupportingmyresearch.Hiswillingnessandopennesstocollaborategavemechancestolearnandtoconductpracticalexperimentsthathavebecomeanimportantpartofthisdissertation.Inaddition,IwanttothankDr.Sanchez'slabmembers,especiallyDr.EricPohlmeyerandDr.BabakMahmoudifortheirhelp,advise,andfruitfuldiscussions.IwouldalsoliketothankmyPh.D.committeemembers,Prof.JohnG.Harris,Prof.PaulD.Gader,andProf.ArunavaBanerjeefortheirvaluablecommentsandthecriticalfeedbackaboutmyresearch.IwasveryfortunateforhavingtheopportunitytobepartoftheComputationalNeuro-EngineeringLaboratory(CNEL)atUniversityofFlorida.ThankstoCNELmembers,Ididnotonlygainknowledgebutalsohadmemoriesthatwillremainforlife.Ispeciallythanktomylovelygirls,Dr.LinLiandDr.SonglinZhaofortheirconstantsupportandwonderfulfriendship;IwillneverforgettherstdayattheUniversityofFloridaandCNEL.Youwillalwaysbewithmeinmyheart.IalsothankAustinBrockmeier,EvanKriminger,andMatthewEmighforthevaluablediscussions,help,andforintroducingmetothediversecultureoftheUS.IthankmygoodoldCNELfriend,StefanCraciun;Iwillalwaysmisstheenergeticandexcitingcorner.IthankCNELalumni,Dr.ErionHasanbelliu,Dr.SohanSeth,andDr.AlexanderSinghAlvarado,forgoodmemoriesatGreenwichGreenandCNEL.IalsothankDr.DivyaAgrawal,VeronicaBolonCanedo,RakeshChalasani,GoktugT.Cinar,RoshaPokharel,Kwansun 4

PAGE 5

Cho,JongminLee,PingpingZhu,KanLi,InJunPark,MiguelD.Teixeira,BilalFadlallah,GavinPhilips,GabrielNallathambifortheirsupportandgoodfriendship.Lastbutnotleast,Iwanttothankmyfamilyfortheirencouragementandendlesssupportincludingmynewfamily,Dr.LuisGonzaloSanchezGiraldo.Youaremybestfriend,classmate,labmate,andlifetimepartner.Youbroughtmehappinessandfaithandenrichedmylifebothasaresearcherandasaperson. 5

PAGE 6

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 8 LISTOFFIGURES ..................................... 9 ABSTRACT ......................................... 12 CHAPTER 1INTRODUCTION ................................... 14 2REINFORCEMENTLEARNING .......................... 28 3STATEVALUEFUNCTIONESTIMATION/POLICYEVALUATION ........ 32 3.1TemporalDifference() ............................. 33 3.1.1TemporalDifference()inReinforcementLearning ......... 36 3.1.2ConvergenceofTemporalDifference() ............... 37 3.2KernelTemporalDifference() ......................... 41 3.2.1KernelMethods ............................. 41 3.2.2KernelTemporalDifference() ..................... 42 3.2.3ConvergenceofKernelTemporalDifference() ........... 44 3.3CorrentropyTemporalDifferences ...................... 46 3.3.1Correntropy ............................... 47 3.3.2MaximumCorrentropyCriterion .................... 48 3.3.3CorrentropyTemporalDifference ................... 49 3.3.4CorrentropyKernelTemporalDifference ............... 50 4SIMULATIONS-POLICYEVALUATION ...................... 53 4.1LinearCase ................................... 53 4.2LinearCase-RobustnessAssessment ................... 57 4.3NonlinearCase ................................. 65 4.4NonlinearCase-RobustnessAssessment ................. 69 5POLICYIMPROVEMENT .............................. 75 5.1State-Action-Reward-State-Action ...................... 76 5.2Q-learning ................................... 76 5.3Q-learningviaKernelTemporalDifferencesandCorrentropyVariants .. 77 5.4ReinforcementLearningBrainMachineInterfaceBasedonQ-learningwithFunctionApproximation ......................... 80 6

PAGE 7

6SIMULATIONS-POLICYIMPROVEMENT .................... 82 6.1MountainCarTask ............................... 82 6.2TwoDimensionalSpatialNavigationTask .................. 88 7PRACTICALIMPLEMENTATIONS ......................... 95 7.1OpenLoopReinforcementLearningBrainMachineInterface:Q-KTD() 95 7.1.1Environment ............................... 96 7.1.2Agent .................................. 96 7.1.3Center-outReachingTask-SingleStep ............... 97 7.1.4Center-outReachingTask-Multi-Step ................ 102 7.2OpenLoopReinforcementLearningBrainMachineInterface:Q-CKTD .. 104 7.3ClosedLoopBrainMachineInterfaceReinforcementLearning ...... 107 7.3.1Environment ............................... 108 7.3.2Agent .................................. 110 7.3.3Results ................................. 111 7.3.4ClosedLoopPerformanceAnalysis .................. 113 8CONCLUSIONSANDFUTUREWORK ...................... 118 APPENDIX AMERCER'STHEOREM ............................... 123 BQUANTIZATIONMETHOD ............................. 124 REFERENCES ....................................... 125 BIOGRAPHICALSKETCH ................................ 130 7

PAGE 8

LISTOFTABLES Table page 6-1TheaveragesuccessrateofQ-KTDandQ-CKTD. ................ 93 8

PAGE 9

LISTOFFIGURES Figure page 1-1Thedecodingstructureofreinforcementlearningmodelinabrainmachineinterface. ....................................... 17 2-1Theagentandenvironmentinteractioninreinforcementlearning. ........ 28 3-1Diagramofadaptivevaluefunctionestimationinreinforcementlearning. .... 32 3-2ContoursofCIM(X,0)in2dimensionalsamplespace. .............. 48 4-1A13stateMarkovchain[ 6 ]forthelinearcase. .................. 54 4-2Performancecomparisonoverdifferentcombinationsofeligibilitytraceratesandinitialstepsizes0inTD(). ......................... 55 4-3PerformanceoverdifferentkernelsizesinKTD(). ................ 56 4-4Performancecomparisonoverdifferentcombinationsofeligibilitytraceratesandinitialstepsizes0inKTD()withh=0.2. ................. 56 4-5LearningcurveofTD()andKTD(). ........................ 58 4-6ThecomparisonofstatevalueV(x)inx2XconvergencebetweenTD()andKTD(). ...................................... 58 4-7TheperformanceofTDfordifferentlevels(variances2)ofadditiveGaussiannoiseontherewards. ................................ 59 4-8TheperformancechangeofCTDoverdifferentcorrentropykernelsizes,hc. .. 61 4-9LearningcurveofTDandCTDwhentheGaussiannoisewithvariance2=10isaddedtothereward. .............................. 61 4-10PerformanceofCTDcorrespondingtodifferentcorrentropykernelsizeshc,withmixtureofGaussiannoisedistribution. .................... 62 4-11LearningcurvesofTDandCTDwhenthenoiseaddedtotherewardscorrespondstoamixtureofGaussians. .............................. 62 4-12PerformancechangesofTDwithrespecttodifferentLaplaciannoisevariancesb2. ........................................... 63 4-13PerformanceofCTDdependingondifferentcorrentropykernelsizeshcwithvariousLaplaciannoisevariances. ......................... 64 4-14LearningcurveofTDandCTDwhentheLaplaciannoisewithvarianceb2=25isaddedtothereward. .............................. 65 4-15A13stateMarkovchainforthenonlinearcase. .................. 66 9

PAGE 10

4-16Theeffectofandtheinitialstepsize0inTD(). ................ 66 4-17TheperformanceofKTDwithdifferentkernelsizes. ............... 67 4-18PerformancecomparisonoverdifferentcombinationsofandtheinitialstepsizeinKTD()withh=0.2. ............................... 67 4-19LearningcurvesofTD()andKTD(). ....................... 68 4-20ThecomparisonofstatevalueconvergencebetweenTD()andKTD(). ... 69 4-21PerformancesofCKTDdependingonthedifferentcorrentropykernelsizes. .. 70 4-22LearningcurveofKTDandCKTD. ......................... 71 4-23Thecomparisonofstatevaluefunction~VestimatedbyKTDandCorrentropyKTD. .......................................... 71 4-24MeanandstandarddeviationofRMSerrorover100runsatthe2000thtrial. .. 73 4-25MeanRMSerrorover100runs.Noticethisisalogplotinthehorizontalaxis 73 5-1ThestructureofQ-learningviakerneltemporaldifference() .......... 79 5-2ThedecodingstructureofreinforcmentlearningmodelinabrainmachineinterfaceusingaQ-learningbasedfunctionapproximationalgorithm. ........... 80 6-1TheMountain-cartask. ............................... 83 6-2PerformanceofQ-TD()withvariouscombinationofand. .......... 84 6-3TheperformanceofQ-KTD()withrespecttodifferentkernelsizes. ...... 85 6-4PerformanceofQ-KTD()withvariouscombinationofand. ......... 85 6-5RelativefrequencywithrespecttoaveragenumberofiterationspertrialofQ-TD()andQ-KTD(). ............................... 86 6-6AveragenumberofiterationspertrialofQ-TD()andQ-KTD(). ........ 86 6-7TheperformanceofQ-CKTDwithdifferentcorrentropykernelsizes. ...... 88 6-8AveragenumberofstepspertrialofQ-KTDandQ-CKTD. ............ 88 6-9Theaveragesuccessratesover125trialsand50implementations. ....... 90 6-10Theaveragenalltersizesover125trialsand50implementations. ...... 91 6-11Twodimensionalstatetransitionsoftherst,third,andfthsetswith=0.9and=0. ....................................... 92 10

PAGE 11

6-12Theaveragesuccessratesover125trialsand50implementationswithrespecttodifferentltersizes. ................................ 92 6-13Thechangeofsuccessrates(top)andnalltersize(bottom)withU=5. .. 93 6-14ThechangeofaveragesuccessratesbyQ-KTDandQ-CKTD. ......... 94 7-1Thecenter-outreachingtaskfor8targets. ..................... 96 7-2Thecomparisonofaveragelearningcurvesfrom50MonteCarlorunsbetweenQ-KTD(0)andMLP. ................................. 98 7-3Theaveragesuccessratesover20epochsand50MonteCarlorunswithrespecttodifferentltersizes. ................................ 99 7-4ThecomparisonofKTD(0)withdifferentnalltersizesandTDNNwith10hiddenunits. ..................................... 100 7-5Theeffectofltersizecontrolon8-targetsingle-stepcenter-outreachingtask. 101 7-6Theaveragesuccessratesforvariousltersizes. ................. 102 7-7Rewarddistributionforrighttarget. ......................... 103 7-8Thelearningcurvesformultistepmultitargettasks. ............... 104 7-9Averagesuccessratesover50runs. ........................ 106 7-10Q-valuechangespertiralduring10epochs. .................... 107 7-11TargetindexandmatchingQ-values. ........................ 108 7-12Thesuccessratesofeachtargetover1through5epochs. ........... 109 7-13PerformanceofQ-learningviaKTDintheclosedloopRLBMI .......... 112 7-14Proposedvisualizationmethod. ........................... 115 7-15TheestimatedQ-valuesandresultingpolicyfortheprojectedneuralstates. .. 116 11

PAGE 12

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyKERNELTEMPORALDIFFERENCESFORREINFORCEMENTLEARNINGWITHAPPLICATIONSTOBRAINMACHINEINTERFACESByJihyeBaeAugust2013Chair:JoseC.PrincipeMajor:ElectricalandComputerEngineeringReinforcementlearningbrainmachineinterfaces(RLBMI)havebeenshowntobeapromisingavenueforpracticalimplementationsofBMIs.IntheRLBMI,acomputeragentandauserintheenvironmentcooperateandlearnco-adaptively.Anessentialcomponentintheagentistheneuraldecoderwhichtranslatestheneuralstatesoftheuserintocontrolactionsfortheexternaldeviceintheenvironment.However,torealizetheadvantagesoftheRLBMIinpractice,thereareseveralchallengesthatneedtobeaddressed.First,theneuraldecodermustbeabletohandlehighdimensionalneuralstatescontainingspatial-temporalinformation.Second,themappingfromneuralstatestoactionsmustbeexibleenoughwithoutmakingstrongassumptions.Third,thecomputationalcomplexityofthedecodershouldbereasonablesuchthatrealtimeimplementationsarefeasible.Fourth,itshouldberobustinthepresenceofoutliersorperturbationsintheenvironment.Weintroducealgorithmsthattakeintoaccountthesefourissues.Toefcientlyhandlethehighdimensionalstatespaces,weadoptthetemporaldifference(TD)learningwhichallowsthelearningofthestatevaluefunctionusingfunctionapproximation.Foraexibledecoder,weproposetheuseofkernelbaserepresentationswhichprovidesnonlinearextensionsofTD()whichwecallkerneltemporaldifference(KTD)().TwokeyadvantagesofKTD()areitsnonlinearfunctionalapproximationcapabilitiesandconvergenceguaranteesthatgracefullyemergeas 12

PAGE 13

anextensionoftheconvergenceresultsknownforlinearTDlearningalgorithms.Toaddresstherobustnessissue,weintroducecorrentropytemporaldifference(CTD)andcorrentropykerneltemporaldifference(CKTD),whichisarobustalternativetothemeansquareerror(MSE)employedbyconventionalTDlearning.Fromstatevaluefunctionestimation,allfundamentalfeaturesoftheproposedalgorithmscanbeobserved.However,thisisonlyanintermediatestepinndingaproperpolicy.Therefore,weextendallproposedTDalgorithmstostate-actionvaluefunctionestimationbasedonQ-learning:Q-learningviacorrentropytemporaldifference(Q-CTD),Q-KTD(),andQ-CKTD.Toillustratethebehavioroftheproposedalgorithms,weapplythemtotheproblemofndinganoptimalpolicyonsimulatedsequentialdecisionmakingwithcontinuousstatespaces.TheresultsshowthatQ-KTDandQ-CKTDareabletondapropercontrolpolicyandgivestableperformancewiththeappropriateparameters,andthatQ-CKTDimprovesperformanceinoff-policylearning.Finally,theQ-KTD()andQ-CKTDalgorithmsareappliedtoneuraldecodinginRLBMIs.First,theyareappliedinopen-loopexperimentstondapropermappingbetweenamonkey'sneuralstatesanddesiredpositionsofacomputercursororaroboticarm.Theexperimentalresultsshowthatthealgorithmscaneffectivelylearntheneuralstate-actionmapping.Moreover,Q-CKTDshowsthattheoptimalpolicycanbeestimatedevenwithouthavingperfectpredictionsofthevaluefunctionwithadiscretesetofactions.Q-KTDisalsoappliedtoclosed-loopRLBMIexperiments.Theco-adaptationofthedecoderandthesubjectareobserved.Resultsshowthatthealgorithmsucceedsinndingapropermappingbetweenneuralstatesanddesiredactions.Thekernelbasedrepresentationcombinedwithtemporaldifferencesisasuitableapproachtoobtainaexibleneuralstatedecoderthatcanbelearnedandadaptedonline.Theseobservationsshowthealgorithms'potentialadvantagesinrelevantpracticalapplicationsofRL. 13

PAGE 14

CHAPTER1INTRODUCTIONResearchinBrainmachineinterfaces(BMIs)isamultidisciplinaryeffortinvolvingeldssuchasneurophysiologyandengineering.Developmentsinthisareahaveawiderangeofapplications,especiallyforsubjectswithneuromusculardisabilities,forwhomBMIsmaybecomeasignicantaid.NeuraldecodingofmotorsignalsisoneofthemaintasksthatneedstobeexecutedbytheBMI.Neuraldecodingisaprocessofextractinginformationfrombrainsignals.Forexample,wecanreconstructastimulusbasedonthespiketrainsproducedbycertainneuronsinbrain.Themaingoalofneuraldecodingistocharacterizetheelectricalactivityofgroupsofneurons,thatis,identifyingpatternsofbehaviorthatcorrelatewithagiventask.Thisprocessisafundamentalsteptowardsthedesignofprostheticdevicesthatcommunicatedirectlywithbrain.Ideasfromsystemtheorycanbeusedtoframethedecodingproblem.Bypassingthebodycanbeachievedbymodellingthetransferfunctionfrombrainactivitytolimbmovementandutilizingtheoutputoftheproperlytrainedmodeltocontrolaroboticdevicetoimplementtheintentionofmovement.Someapproachestothedesignofneuraldecodingsystemsinvolvemachinelearningmethods.Inordertochoosetheappropriatelearningmethod,factorssuchalearningspeedandstabilityhelpindeterminingtheusefulnessofaparticularmethod.SupervisedlearningiscommonlyappliedinBMI[ 19 ]becauseofthetremendousbodyofworkinsystemidentication.Givenatrainingsetofneuralsignalsandsynchronizedmovements,theproblemistondamappingbetweenthetwowhichcanbesolvedbyapplyingsupervisedlearningtechniques;thekinematicvariablesofanexternaldeviceissetasdesiredsignals,andthesystemcanbetrainedtoobtaintheregressionmodel.[ 22 ]showedthatthewellknownsupervisedlearningalgorithmssuchasWienerlter,leastmeansquareadaptivelter,andtimedelayneural 14

PAGE 15

networkareabletoestimatethemappingfromspiketrainsfromthemotorcortextothekinematicvariablesofamonkey'shandmovements.[ 35 ]appliedlinearestimationalgorithmsincludingridgeregressionandamodiedKalmanltertoestimatethecursorpositiononacomputerscreenbasedonamonkey'sneuralactivity;thesystemwasalsoimplementedforclosed-loopbraincontrolexperiments.Inaddition,[ 49 ]usedanechostatenetwork,whichisonetypeofrecurrentneuralnetwork,todecodeamonkey'sneuralactivityinacenter-outreachtaskinclosedloopBMIs.NotethatwhenclosedloopBMIexperimentsareconductedusingsupervisedlearning,apre-trainedfunctionalregressionmodelisappliedtoestimatethedesiredkinematicvalues;afterpre-training,xedmodelparametersareapplied,andthesystemdoesnotadaptsimultaneouslyduringtheexperiments.EventhoughthesupervisedlearningapproachhasbeenappliedtoneuraldecodinginrealtimecontrolofBMIs,itisprobablynotthemostappropriatemethodologyfortheproblembecauseoftheabsenceofgroundtruthinaparaplegicuserwhocannotmove.Inaddition,evenifthedesiredsignalwasavailable,thereareotherfactorssuchasbrainplasticitythatstilllimitthefunctionalityofsupervisedlearningsincefrequentcalibration(retraining)becomesnecessary.InBMIs,itisnecessarytohavedirectcommunicationbetweenthecentralnervoussystemandthecomputerthatcontrolsanexternaldevicesuchasaprostheticarmfordisabledindividuals.Thus,methodsthatcanadaptandadjusttosubtleneuralvariationsarepreferred.Whenweframeneuraldecodingasasequentialdecisionmakingproblem,dynamicprogramming(DP)isaclassicalapproachtosolvesuchproblems.Insequentialdecisionmakingproblems,thereisadynamicsystemwhoseevolutionisaffectedbythedecisionsbeingmade.Thegoalistondadecisionmakingrule(feedbackpolicy)thatoptimizesagivenperformancecriterion.However,DPhasthefollowingdrawbacks:Itassumesallmodelcomponentsincludingthedynamicsandtheenvironmentareknown,andallstatesarefullyobservable.However,inmanypracticalapplications,the 15

PAGE 16

aboveconditionsareoftenunsatised.Inaddition,tondanoptimaldecisionmaker,DPrequirestheevaluationofallthestatesandcontrols.Thisresultsinhighcomputationaldemandswhentheproblemdimensionscalesup(Bellman'scurseofdimensionality).Furthermore,directmodellingisratherdifcultsincetherearemanyfactorsthatneedtobeaccountedforevenwithinthesametaskandsubject.Althoughthetheoreticalfoundationofreinforcementlearning(RL)isdrawnfromdynamicprogramming(DP),RLaddressesthedrawbacksofdynamicprogrammingbecauseitallowsustoachieveanapproximationoftheoptimalvaluefunctionsofDPwithoutexplicitlyknowingtheenvironment.Ontheotherhand,RLisoneoftherepresentativelearningschemes(supervisedandunsupervisedlearning)whichprovidesageneralframeworkforadaptingasystemtoanovelenvironment.RLdiffersfromtheotherlearningschemesinthesensethatRLnotonlyobservesbutalsointeractswiththeenvironmenttocollecttheinformation.Also,RLreceivesrewardinformationfromtheenvironmentwhichisfrequentlydelayedbyunspeciedtimeamounts.Thus,RLisconsideredthemostrealisticclassoflearningandisrichwithmanyalgorithmsforon-linelearningwithlowcomputationalcomplexity.Reinforcementlearning(RL)algorithmsareageneralframeworkforsystemadaptationtoanovelenvironments;thischaracteristicissimilartothewaybiologicalorganismsinteractwithenvironmentandlearnfromexperience.InRL,itispossibletolearnonlywithinformationfromtheenvironment,andthustheneedforadesiredsignalissuppressed.Therefore,RLiswellsuitedfortheneuraldecodingstageofaBMIapplication.ABMIarchitecturebasedonreinforcementlearning(RLBMI)isintroducedin[ 13 ],andsuccessfulapplicationsofthisapproachcanbefoundin[ 1 30 37 ].IntheRLBMIarchitecture,therearetwointelligentsystems:theBMIdecoderintheagent,andtheuserintheenvironment.Thetwointelligentsystemslearnco-adaptivelybasedonclosedloopfeedback(Figure 1-1 ).Theagentupdatesthestateoftheenvironment, 16

PAGE 17

namely,thepositionofacursoronascreenorarobot'sarmposition,basedontheuser'sneuralactivityandthereceivedrewards.Atthesametime,thesubjectproducesthecorrespondingbrainactivity.Throughiterations,bothsystemslearnhowtoearnrewardsbasedontheirjointbehavior.TheBMIdecoderlearnsacontrolstrategybasedontheuser'sneuralstateandperformsactionsingoaldirectedteststhatupdatethestateoftheexternaldeviceintheenvironment.Inaddition,theuserlearnsthetaskbasedonthestateoftheexternaldevice.Noticethatbothsystemsactsymbioticallybysharingtheexternaldevicetocompletetheirtasks,andthisco-adaptationallowsforcontinuoussynergisticadaptationbetweentheBMIdecoderandtheusereveninchangingenvironments. Figure1-1. Thedecodingstructureofreinforcementlearningmodelinabrainmachineinterface. Notethatintheagent,theproperneuraldecodingofthemotorsignalsisessentialtocontroltheexternaldevicethatinteractswiththephysicalenvironment.However,thereareseveralchallengesthatmustbeaddressedinpracticalimplementationsofRLBMI: 1. HighdimensionalinputstatespacesAlgorithmsmustbeabletoreadilyhandlehighdimensionalstatespacesthatcorrespondtotheneuralstaterepresentations. 2. NonlinearmappingsThemappingfromneuralstatestoactionsmustbeexibleenoughtohandlenonlinearmappingsyetmakinglittleassumptions. 3. Computationalcomplexity 17

PAGE 18

Algorithmsshouldexecutewithareasonableamountoftimeandresourcesthatallowthemtoperformcontrolactionsinrealtime. 4. RobustnessThealgorithmsshouldhandlecaseswhereassumptionsmaynothold,e.g.thepresenceofoutliersorperturbationsintheenvironment.Inthisdissertation,weintroducealgorithmsthattakeintoaccounttheaforementionedissues.RLlearnsoptimalcontrolpolicies(amapfromstatestoactions)byobservingtheinteractionofalearningagentwiththeenvironment.Ateachstep,thedecisionmakerdecidesanactiongivenastatefromasystem(environment)togeneratedesirablestates.Overtime,thecontroller(agent)learnsbyinteractingwiththesystem(environment)whilemaximizingaquantityknownastotalreward.Theaimoflearningistoderivetheoptimalcontrolpoliciestobringthedesiredbehaviorintothesystem,andtheoptimalityisassessedintermsoftheexpectedtotalrewardknownasvaluefunction.Therefore,estimatingthevaluefunctionisafundamentalandcrucialalgorithmiccomponentinreinforcementlearningproblems.Temporaldifference(TD)learningisamethodthatcanbeappliedtoapproximatevaluefunctionsthroughincrementalcomputationdirectlyfromnewexperiencewithouthavinganassociatedmodelofenvironment.Thisallowsustoefcientlyhandlehighdimensionalstatesandactionsbyusingadaptivefunctionalapproximators,whichcanbetraineddirectlyfromthedata.TDalgorithmsapproximatethevaluefunctionbasedonthedifferencebetweentwoestimationscorrespondingtosubsequentinputsintime(temporaldifferenceerror).TheintroductionoftheTD()algorithmin[ 50 ]revivedtheinterestofTDlearningintheRLcommunity.Here,representaneligibilitytraceratewhichisaddedtotheaveragingprocessovertemporaldifferencestoputemphasisonthemostrecentobservedstatesandtoefcientlydealwiththedelayedreward.TD()[ 50 ]isthefundamentalalgorithmusedtoestimatestatevaluefunctions,whichcanbeutilizedto 18

PAGE 19

computeanapproximatesolutiontoBellman'sequationusingparametrizedfunctions.BecauseTDlearningallowssystemupdatesdirectlyfromthesequenceofstates,onlinelearningbecomespossiblewithouthavingadesiredsignalatalltimes.Foramajorityofrealworldpredictionproblems,TDlearninghaslowermemoryandcomputationaldemandsthansupervisedlearning[ 50 ].SinceTD()updatesthevaluefunctionwheneveranystatetransitionsareobserved,thismaycauseinefcientuseofdata.Inaddition,themanualselectionofoptimalparameters(stepsizeandtheeligibilitytracerate)isstillrequired.Apoorchoiceofthestepsizeandtheeligibilitytraceparameterscancauseadramaticallyslowconvergencerateoranunstablesystem.TD()isalsosensitivetothedistancebetweenoptimalandinitialparameters.However,itispopularlyappliedbecauseofitssimplicityandabilitytobeusedforonlinelearninginmultisteppredictionproblems.ToavoidthepossibilityofpoorperformanceduetoimproperchoiceofthestepsizeandtheinitializationofparametersinTD(),theleastsquaresTD(LSTD)andrecursiveleastsquaresTD(RLSTD)wereintroducedin[ 8 ].Subsequently,anextensiontoarbitraryvaluesof,LSTD(),wasproposedin[ 5 ].However,incomparisontoTD(O(d)),LSTDandRLSTDhaveincreasedcomputationalcomplexityperupdate:O(d3)andO(d2)respectively,wheredisthedimensionalityofthestaterepresentationspace.Thenecessityofaddressingcomputationalefciencyhasstimulatedfurtherinterestinonlinelearning.IncrementalleastsquaresTDlearningcallediLSTD,whichachievesper-time-stepcomplexitiesofO(d),wasintroducedin[ 15 ],anditstheoreticalanalysisextendedtoiLSTD()canbefoundin[ 16 ].ThisiLSTDusesasimilarapproachtoRLSTD,buttoupdatethesystemandkeepalowcomputationalload,itonlymodiesasingledimensionofweightthatcorrespondstothelargestTDupdate.However,theoreticalanalysisshowsthatconvergencecannotbeguaranteedunderthisgreedyapproach,andmodicationsthatguaranteeconvergenceincreasethecomputationalcostdramatically.Thismakestheabovealgorithmunattractiveforonlinelearning. 19

PAGE 20

Eventhoughalltheabovemethodsprovidetheirownadvantagessuchasconvergence,stability,orlearningrate,theyarelimitedtoparametrizedlinearfunctionapproximationwhichmaynotbeasexibleespeciallyinpracticalapplicationswherelittlepriorknowledgecanbeincorporated.Theimportanceofndingaproperfunctionalspaceturnsourinteresttowardsnonlinearmodelswhicharegenerallymoreexible.NonlinearvariantsofTDalgorithmshavealsobeenproposed.However,theyaremostlybasedontimedelayneuralnetworks,sigmoidalmultilayerperceptrons,orradialbasisfunctionnetworks.Despitetheirgoodapproximationcapabilities,thesealgorithmsareusuallypronetofallintolocalminima[ 3 7 20 54 ]turningtrainingintoanart.Therehasbeenagrowinginterestinaclassoflearningalgorithmsthathavenonlinearapproximationcapabilitiesandyetallowcostfunctionsthatareconvex.Theyareknownaskernelbasedlearningalgorithms[ 44 ].Oneofthemajorappealsofkernelmethodsistheabilitytohandlenonlinearoperationsonthedatabyindirectlyusinganunderlyingnonlinearmappingtoasocalledfeaturespace(ReproducingKernelHilbertSpace(RKHS))whichisendowedwithaninnerproduct.AlinearoperationintheRKHScorrespondstoanonlinearoperationintheinputspace;forsomekernelfunctionsthesepropertiescanleadtouniversalapproximationoffunctionsontheinputspace.Manyoftherelatedoptimizationproblemscanbeposedasconvex(nolocalminima)withalgorithmsthatarestillreasonablyeasytocompute(usingthekerneltrick[ 44 ]).Recentworkinadaptivelteringhasshowntheusefulnessofkernelmethodsinsolvingnonlinearadaptivelteringproblems[ 20 28 ].Successfulapplicationsofthekernel-basedapproachinsupervisedlearningarewellknownthroughsupportvectormachines(SVM)[ 4 ],kernelleastsquares(KLS)[ 43 ],andGaussianprocesses(GP)[ 40 ].Kernel-basedlearninghasalsobeensuccessfullyintegratedintoreinforcementlearning[ 11 12 14 17 39 57 ]demonstratingtheirpotentialadvantagesinthiscontext.Furthermore,kernelmethodshavebeenintegratedwithtemporaldifferencealgorithmsshowingsuperiorperformanceinnonlinearapproximationproblems.The 20

PAGE 21

closerelationbetweenGaussianprocessesandkernelrecursiveleastsquareswasexploitedin[ 14 ]tobringtheBayesianframeworkintoTDlearning.GaussianprocesstemporaldifferenceuseskernelsinprobabilisticdiscriminativemodelsbasedonGaussianprocesses,incorporatingparameterssuchasvarianceoftheobservationnoiseandprovidingpredictivedistributions(posteriorvariance)toevaluatepredictions.Similarworkusingkernel-basedleastsquarestemporaldifferencelearningwitheligibilitiescalledKLSTD()wasintroducedin[ 58 ].UnlikeGPTD,KLSTD()doesnotuseaprobabilisticapproach.TheideainKLSTDistoextendLSTD()[ 5 ]usingtheconceptofduality.However,KLSTD()usesabatchupdate,soitscomputationalcomplexitypertimeupdateisO(n3)whichisnotpracticalforonlinelearning.Here,wewillinvestigateTDlearningintegratedwithkernelmethods,whichwecallkerneltemporaldifference(KTD)()[ 1 2 ].Weadoptalearningapproachbasedonstochasticgradientmethods,whichisverypopularinadaptiveltering.Whencombinedwithkernelmethods,thestochasticgradientcanreducethecomputationalcomplexitytoO(n).Namely,weshowhowKTD()canbederivedfromthekernelleastmeansquare(KLMS)[ 27 ]algorithm.AlthoughthestandardsettinginsupervisedlearningdiffersfromRL,sinceRLdoesnotuseexplicitinformationfromadesiredsignalateverysample,elementssuchastheadaptivegainandtheapproximationerrortermscanbewellexploitedinsolvingRLproblems[ 50 ].KTDsharesmanyfeatureswiththeKLMSalgorithm[ 27 ]exceptthattheerrorisnowobtainedusingthetemporaldifferences,i.e.thedifferenceofconsecutiveoutputsisusedastheerrorguidingtheadaptationprocess.OnlineKTD()iswellsuitedfornonlinearfunctionapproximation.Itavoidssomeofthemainissuessuchaslocalminimaorproperinitializationthatarecommoninothernonlinearfunctionapproximationmethods.Inaddition,basedonthedualrepresentation,wecanshowotherimplicitadvantagesofusingkernels.Forinstance,universalkernelsautomaticallysatisfyoneoftheconditionsforconvergenceofTD(). 21

PAGE 22

Namely,linearlyindependentrepresentationsofstatesareobtainedthroughtheimplicitmappingassociatedwiththekernel.Eventhoughthisnon-parametrictechniquerequiresahighcomputationalcostthatcomeswiththeinherentlygrowingstructure,whentheproblemishighlycomplicatedandrequiresalargeamountofdata,thesetechniquesproducebettersolutionsthananyothersimplelinearfunctionapproximationmethods.Inaddition,aswewillseeinthiswork,therearemethodsthatwecanemploytoovercomescalabilityissuessuchasgrowingltersizes[ 9 25 ].Inpractice,itiscommontofacethesituationwhereassumptionsaboutnoiseorthemodeldeviatefromstandardconsiderations.Forexample,outliers,whichresultfromunexpectedperturbationssuchasnoisystaterepresentations,transitions,orrewards,canbedifculttobeaccountedfor.Insuchcases,thecontrollermayfailtoobtainthedesiredbehavior.Tothebestofourknowledge,nostudyhasaddressedtheissueofhownoiseorsmallperturbationstothemodelaffectperformanceinTDlearning.MoststudiesonTDalgorithmsfocusonsyntheticexperimentssuchassimulatedMarkovchainsorrandomwalkproblems.Inourwork,weinvestigatethemaximumcorrentropycriterion(MCC)asanobjectivefunction[ 38 ]thataimsatcopingwiththeabovementiondifculty.Correntropyisageneralizedcorrelationmeasurebetweentworandomvariablesrstintroducedin[ 42 ].Ithasbeenshownthatcorrentropyisusefulinnon-Gaussiansignalprocessing[ 26 ]andeffectiveformanyapplicationsundernoisyenvironments[ 18 21 36 45 47 ].Correntropycanbeappliedasacostfunction,resultinginthemaximumcorrentropycriterion(MCC).Asystemcanbeadaptedinsuchawaythatthesimilaritybetweendesiredandpredictedsignalsismaximized.MCCservesasanalternativetoMSEthatuseshigherorderinformation,whichmakesitapplicabletocaseswhereGaussianityandlinearityassumptionsdonotnecessarilyhold. 22

PAGE 23

MCChasbeenappliedtoobtainrobustmethodsforadaptivesystemsinsupervisedlearning[ 45 46 59 ].Inparticular,aninterestingblendbetweenKLMSandmaximumcorrentropycriterion(MCC)wasproposedin[ 59 ].Thebasicideaofkernelmaximumcorrentropy(KMC)isthatinputdataistransferredtoanRKHSusinganonlinearmappingfunction,andthemaximumcorrentropycriterion(MCC)isappliedasacostfunctiontominimizetheerror.ItwasshownthattheKMCaccuratelyapproximatesnonlinearsystems,anditwasabletoreducethedetrimentaleffectsofvarioustypesofnoiseincomparisontotheconventionalMSEcriterion.WewillshowhowKMCcanbeincorporatedintoTDlearning.Correntropykerneltemporaldifference(CKTD)canbederivedinasimilarwaytoTDlearningwhenposedasasupervisedlearningproblem.Asaresult,weobtainacorrentropytemporaldifference(CTD)algorithm,whichextendsTDandKTDalgorithmstotherobustmaximumcorrentropycriterion.NotethattheTDalgorithmswehavestudiedareintroducedforstatevaluefunctionestimationgivenaxedpolicy.TosolvecompleteRLproblems,thealgorithmsshouldallowtheconstructionofnearoptimalpolicies.Wewanttondtheoptimalstate-actionmapping(policy)bymaximizingacumulativereward,andthismappingcanbeexclusivelydeterminedbytheestimatedstate-actionvaluefunctionbecauseitquantiestherelativedesirabilityofdifferentstateandactionpairs.Actor-Criticisonewaytondanoptimalpolicybasedontheestimatedactionvaluefunction.Thisisawell-knownmethodthatcombinestheadvantagesofpolicygradientandvaluefunctionapproximation.TheActor-Criticmethodcontainstwoseparatesystems(actorandcritic),andeachoneofthesystemsisupdatedbasedontheother.Theactorcontrolsthepolicytoselectactions,andthecriticestimatesthevaluefunction.Thus,aftereachactionisselectedfromthegivenpolicybytheactor,thecriticevaluatesthepolicyusingtheestimatedvaluefunction.In[ 23 ],itisshownhowTDalgorithmscanbeappliedtothecritictoestimatethevaluefunction,whilethepolicygradientmethod 23

PAGE 24

isappliedtoupdatetheactor.Basedonthegradientofthevaluefunctionobtainedfromthecritic,thepolicyintheactorisupdated.Thecriticevaluatesthevaluefunctiongivenaxedpolicyfromtheactor.However,sincetheActor-Criticmethodincludestwosystems,itischallengingtoadjustthemsimultaneously.Ontheotherhand,Q-learning[ 55 ]isasimpleonlinelearningmethodtondanoptimalpolicybasedontheactionvaluefunctionQ.Despitebeingasimpleapproach,Q-learningiscommonlyusedbecauseitiseffective,andtheagentcanbeupdatedbasedsolelyonobservations.ThebasicideaofQ-learningisthatwhentheactionvalueQisclosetotheoptimalactionvalueQ,thepolicy,whichisgreedywithrespecttoallactionvaluesforagivenstate,isclosetooptimal.Therefore,wecanthinkofextendingtheproposedTDalgorithms(KTD(),CTD,andCKTD)toapproximateQ-functionsfromwhichwecanderivetheoptimalpolicy.Inparticular,wewillintroduceQ-CTD,Q-KTD,andQ-CKTDalgorithms.Q-learningisawellknownoff-policyTDcontrolalgorithm;theformofstate-actionmappingfunction(policy)isundetermined,andTDlearningisappliedtoestimatethestate-actionvaluefunction.TheconvergenceofQ-learningwithfunctionapproximationhasbeenamainconcerninitsapplication[ 51 ].[ 54 ]showedthatitispossibletodivergeinQ-learningwithnonlinearfunctionapproximation.Inaddition,[ 3 ]pointedoutthatvaluebasedRLalgorithmscanbecomeunstablewhencombinedwithfunctionapproximation.Despitetheaboveissues,[ 32 ]showedconvergencepropertiesofQ-learningwithlinearfunctionapproximationunderrestrictedconditions.Furthermore,theextensionofthegradienttemporaldifference(GTD)familyoflearningalgorithmstoQ-learningcalledGreedyGQ[ 29 ]resultsinbetterconvergenceproperties;thesystemconvergesindependentlyofthesamplingdistribution.However,GreedyGQmaygetstuckinlocalminimaevenwithlinearfunctionaproximationbecausetheobjectivefunctionisnon-convex. 24

PAGE 25

Although[ 29 32 ]showedthefeasibilityofapplyingQ-learningwithlinearfunctionapproximation,theuseofanonlinearfunctionapproximatorinQ-learninghasnotyetbeenactivelyconsideredmainlybecauseofthelackofconvergenceguarantees.However,incorporatingthekernel-basedrepresentationmaybringtheadvantagesofnonlinearfunctionapproximation,andtheconvergencepropertiesoflinearfunctionapproximationinQ-learningwouldstillhold.AconvergenceresultforQ-learningusinglinearfunctionapproximationbytemporaldifference(TD)()[ 50 ]isintroducedin[ 32 ].Theyprovedthatwhenthelearningpolicyandthegreedypolicyarecloseenough,thealgorithmconvergestoaxedpointofarecursionbasedonthebellmanoperatortheyintroducedin[ 32 ].Theirconvergenceresultisbasedonarelationbetweentheautocorrelationmatricesofthebasisfunctionswithrespecttothelearningpolicyandthegreedypolicy.Inaddition,theyassumeacompactstatespacewithanitesetofboundedlinearlyindependentbasisfunctions.InQ-KTD,therepresentationspaceispossiblyinnitelydimensional.Therefore,thedirectextensionoftheresultsfrom[ 32 ]wouldrequireanextendedversionoftheordinarydifferentialequation(ODE)methodtoaHilbertspacevalueddifferentialequation.SincepolicyisnotxedinQ-learning,itisrequiredthatthesystemexplorestheenvironmentandlearnsunderchangingpolicies.Thesystemshouldrespondaccordinglyandbeabletodisregardlargechangesthatmayresultfromexploration.Toaddressthisproblemweexplorerobustnessthroughthemaximumcorrentropycriterioninthecontextofchangingpolicies.Aswementionedabove,oneofthepracticalobjectivesofourworkistoapplytheproposedTDalgorithmstoneuraldecodingwithinthereinforcementlearningbrainmachineinterfaceframework.IntheRLBMIstructure,theagentlearnshowtotranslatetheneuralstatesintoactions(direction)basedonpredenedrewardvaluesfromtheenvironment.Sincetherearetwointelligentsystems,aBMIdecoderinagentandaBMIuserinenvironment,inclosedloopfeedback,wecanunderstandthesystemasa 25

PAGE 26

cooperativegame.Infact,theBMIuserhasnodirectaccesstoactions,andtheagentmustinterprettheuser'sbrainactivitycorrectlytofacilitatetherewards[ 13 ].Therefore,theproposedalgorithmscanbeappliedtotheagent,whichdecodestheneuralstatestransformingthemtotheproperactiondirectionsthatareinturnexecutedbyanexternaldevicesuchasacomputerscreenoraroboticarm.Theupdatedpositionoftheactuatorwillinuencetheuser'ssubsequentneuralstatesbecauseofthevisualfeedbackinvolvedintheprocess.Thatishowthetwointelligentsystemslearnco-adaptivelyandtheclosedloopfeedbackiscreated.Inotherwords,theinputtotheBMIdecoderistheuser'sneuralstates,whichcanbeconsideredastheuser'soutput.Likewise,theactiondirectionsoftheexternaldevicearethedecoder'soutputandbecauseofthevisualfeedbacktheycanalsobeconsideredastheinputtotheuser.WewillexamthecapabilityoftheQ-KTDalgorithmbothinopenandclosedloopReinforcementLearningBrainMachineInterfaces(RLBMI)toperformreachingtasks.TheclosedloopRLBMIexperimentwillshowhowthetwointelligentsystemsco-adaptivelylearninarealtimereachingtask.NotethatQ-learningviaKTD(Q-KTD)ispowerfulinpracticalapplicationsduetoitsnonlinearapproximationcapabilities.Also,thisalgorithmisadvantageousforrealtimeapplicationssinceparameterscanbechosenontheybasedontheobservedinputstates,andnonormalizationisrequired.Inaddition,wewillseetheperformanceofQ-learningviacorrentropyKTD(Q-CKTD)inopenloopRLBMIexperiments,andseehowcorrentropycanimproveperformanceunderchangingpolicies.Themaincontributionofthisthesisarethethreenewstatevaluefunctionapproximationalgorithmsbasedontemporaldifferencealgorithms:kerneltemporaldifference(),correntropytemporaldifference,andcorrentropykerneltemporaldifference.TheproposedalgorithmsareextendedtondacontrolpolicyinreinforcementlearningproblemsbasedonQ-learning,andthisleadstheQ-learningviakernel 26

PAGE 27

temporaldifference(Q-KTD)(),Q-CTD,andQ-CKTD.Moreover,weprovideatheoreticalanalysisontheconvergenceanddegreeofsub-optimalityoftheproposedalgorithmsbasedontheextensionofexistingresultstotheTDalgorithmanditsQ-learningcounterpart.Furthermore,wetestthealgorithmstoillustratetheirbehaviorandoverallperformancebothinstatevaluefunctionapproximationandpolicyestimationproblems.Finally,weapplytheproposedalgorithmstoRLBMIshowinghowthedevelopedmethodologycanbeusefulinrelevantpracticalscenarios. 27

PAGE 28

CHAPTER2REINFORCEMENTLEARNINGWewillshowthebackgroundofRLincludingthemathematicalformulationofthevaluefunctioninMarkovdecisionprocesses,andinthefollowingchapters,wewillseehowthetemporaldifferencescanbederivedforvaluefunctionestimationandappliedtoRLalgorithms.Inreinforcementlearning,acontroller(agent)interactswithasystem(environment)overtimeandmodiesitsbehaviortoimproveperformance.Thisperformanceisassessedintermsofcumulativerewards,whichareassignedbasedonataskgoal.InRLtheagenttriestoadjustitsbehaviorbytakingactionsthatwillincreasetherewardinthelongrun;theseactionsaredirectedtowardstheaccomplishmentofthetaskgoal(Figure 2-1 ). Figure2-1. Theagentandenvironmentinteractioninreinforcementlearning. Assumingtheenvironmentisastochastic(thatis,ifacertainstateisvisiteddifferenttimesandthesameactionistaken,thefollowingstatemaynotbethesameeachtime)andstationaryprocessthatsatisestheMarkovcondition P(x(n)jx(n)]TJ /F5 11.955 Tf 11.95 0 Td[(1),x(n)]TJ /F5 11.955 Tf 11.96 0 Td[(2),,x(0))=P(x(n)jx(n)]TJ /F5 11.955 Tf 11.96 0 Td[(1)),(2)itispossibletomodeltheinteractionbetweenthelearningagentandtheenvironmentasaMarkovdecisionprocess(MDP).Forthesakeofsimplicity,weassumethestatesandactionsarediscrete,buttheycanalsobecontinuous.AMarkovdecisionprocess(MDP)consistsofthefollowingelements: 28

PAGE 29

x(n)2X:states a(n)2A:actions Raxx0:(XA)X7!R:rewardfunctionoverstatesx02Xgivenastateactionpair(x,a)2XA, Raxx0=E[r(n+1)jx(n)=x,a(n)=a,x(n+1)=x0].(2) Paxx0:statetransitionprobabilitythatgivesaprobabilitydistributionoverstatesXgivenastateactionpairinXA, Paxx0=P(x(n+1)=x0jx(n)=x,a(n)=a).(2)Attimestepn,theagentreceivestherepresentationoftheenvironment'sstatex(n)2Xasinput,andaccordingtothisinputtheagentselectsanactiona(n)2A.Byperformingtheselectedactiona(n),theagentreceivesarewardr(n+1)2R,andthestateoftheenvironmentchangesfromx(n)tox(n+1).Thenewstatex(n+1)followsthestatetransitionprobabilityPaxx0giventheactiona(n)andthecurrentstatex(n).Atthenewstatex(n+1),theprocessrepeats;theagenttakesanactiona(n+1),andthiswillresultinarewardr(n+2)andastatetransitionfromx(n+1)tox(n+2).Thisprocesscontinueseitherindenitelyoruntilaterminalstateisreacheddependingontheprocess.Therearetwoimportantconceptsassociatedwiththeagent:thepolicyandvaluefunctions. Policy:X!Aisafunctionthatmapsastatex(n)toanactiona(n). Thevaluefunctionisameasureoflong-termperformanceofanagentfollowingapolicystartingfromastatex(n), Statevaluefunction:V(x(n))=E[R(n)jx(n)] (2) Actionvaluefunction:Q(x(n),a(n))=E[R(n)jx(n),a(n)] (2) whereR(n)isareturn. 29

PAGE 30

Acommonchoiceforthereturnistheinnite-horizondiscountedmodel R(n)=1Xk=0kr(n+k+1),0<<1(2)thattakesintoaccounttherewardsinthelongrun,butweightsthemwithadiscountfactorthatpreventsthefunctionfromgrowingunboundedask!1andalsoprovidesmathematicaltractability[ 52 ].TheobjectiveofRListondagoodpolicythatmaximizestheexpectedrewardofallfutureactionsgiventhecurrentknowledge.Sincethevaluefunctionrepresentstheexpectedcumulativerewardgivenapolicy,theoptimalpolicycanbeobtainedbasedonthevaluefunction;apolicyisbetterthananotherpolicy0whenthepolicygivesgreaterexpectedreturnthanthepolicy0.Inotherwords,0whenV(x)V0(x)orQ(x,a)Q0(x,a)forallx2Xanda2A.Therefore,theoptimalstatevaluefunctionVisdenedbyV(x(n))=maxV(x(n)),andtheoptimalactionvaluefunctionQcanbeobtainedbyQ(x(n),a(n))=maxQ(x(n),a(n)).WhenwehavecompleteknowledgeofRaxx0andPaxx0,anoptimalpolicycanbedirectlycomputedusingthedenitionofthevaluefunction.Foragivenpolicy,thestatevaluefunctionVcanbeexpressedas, V(x(n))=E[R(n)jx(n)] (2) =E"1Xk=0kr(n+k+1)x(n)# (2) =E"r(n+1)+1Xk=0kr(n+k+2)x(n)# (2) =Xa(x(n),a(n))Xx0Paxx0"Raxx0+E"1Xk=0kr(n+k+2)x0## (2) =Xa(x(n),a(n))Xx0Paxx0[Raxx0+V(x0)] (2) 30

PAGE 31

Theoptimalpolicyisobtainedbyselectinganactiona(n)satisfying V(x)=maxaXx0Paxx0Raxx0+V(x0).(2)Equation( 2 )iscommonlyknownastheBellmanoptimalityequationforV.FortheactionvaluefunctionQ,theoptimalityequationcanbeobtainedinasimilarfashion Q(x,a)=Xx0Paxx0hRaxx0+maxa0Q(x0,a0)i.(2)Thesolutiontotheequations( 2 )and( 2 )canbeobtainedusingdynamicprogramming(DP)methods.However,thisprocedureisinfeasiblewhenthenumberofvariablesincreasesduetotheexponentialgrowthofthestatespace(curseofdimensionality)[ 52 ].RLallowsustondpolicieswhichapproachtheBellmanoptimalpolicieswithoutexplicitknowledgeoftheenvironment(Paxx0andRaxx0);aswewillseeinthefollowingchapter,inreinforcementlearning,temporaldifference(TD)algorithmsapproximatethevaluefunctionsbylearningtheparametersusingsimulationsratherthanusingtheexplicitstatetransitionprobabilityPaxx0andrewardfunctionPaxx0.Theestimatedvaluefunctionswillallowcomparisonsbetweenpoliciesandthusguidetheoptimalpolicysearch.Inthischapterwecheckedthelearningparadigmandbasiccomponentsofreinforcementlearning.TheinteractionbetweenagentandenvironmentisanimportantfeatureinRL,andpolicyandvaluefunctionsarekeyconceptsintheagentprovidingthecontrol;basedonthevaluefunction,aproperpolicycanbeobtained. 31

PAGE 32

CHAPTER3STATEVALUEFUNCTIONESTIMATION/POLICYEVALUATIONValuefunctionestimationisanimportantsub-probleminndinganoptimalpolicyinreinforcementlearning.Inthischapter,wewillintroducethreenewtemporaldifferencealgorithms:kerneltemporaldifference(KTD)(),correntropytemporaldifference(CTD),andcorrentropykerneltemporaldifference(CTD).Thealgorithmswillbeextendedbasedontheconventionaltemporaldifference(TD)algorithmcalledTD(),whichisarepresentativeonlinelearningalgorithmtoestimatethevaluefunction.Allofthealgorithmslistedaboveusetemporaldifference(TD)errortoupdatethesystem,andgivenaxedpolicy,theoptimalvaluefunctioncanbeestimatedbasedontheTDerror.Figure 3-1 showshowthevaluefunctioncanbeestimatedusinganadaptivesystembasedontheTDerror.Inanadaptivesystem,therearetwoimportantelements:thelearningalgorithmconcernedwiththeclassoffunctionsthatcanbeapproximatedbythesystemandthecostfunctionwhichquantiesthetnessofthefunctionapproximations. Figure3-1. Diagramofadaptivevaluefunctionestimationinreinforcementlearning.Givenaxedpolicy,valuefunctioncanbeestimatedbasedontemporaldifferenceerror. Weproposeusingakernelframeworkforthemapper.Theimplicitlinearmappinginakernelspacecanprovideuniversalapproximationintheinputspace,andmanyoftherelatedoptimizationproblemscanbeposedasconvex(nolocalminima)withalgorithmsthatarestillreasonablyeasytocompute(usingthekerneltrick[ 44 ]).In 32

PAGE 33

addition,weapplycorrentropyasacostfunctiontondtheoptimalsolution.Correntropyisarobustsimilaritymeasurebetweentworandomvariablesorsignalswhenheavytailedornon-Gaussiandistributionsareinvolved[ 21 36 45 ]. 3.1TemporalDifference()Temporaldifferencelearningisanincrementallearningmethodspecializedforpredictionproblems,anditprovidesanefcientlearningprocedurethatcanbeappliedtoreinforcementlearning.Inparticular,TDlearningallowslearningdirectlyfromnewexperiencewithouthavingamodeloftheenvironment.Itemployspreviousestimationstoprovideupdatestothecurrentpredictor.In[ 50 ],theTD()algorithmisderivedasthesolutiontoamulti-steppredictionproblem.Foramulti-steppredictionproblem,wehaveasequenceofinput-outputpairs(x(1),d(1)),(x(2),d(2)),,(x(m),d(m)),inwhichthedesiredoutputdcanonlybeobservedattimem+1.Then,asystemwillproduceasequenceofpredictionsy(1),y(2),,y(m)basedsolelyontheobservedinputsequences.Ingeneral,thepredictedoutputisafunctionofallpreviousinputs, y(n)=f(x(1),x(2),,x(n));(3)here,weassumethaty(n)=f(x(n))forsimplicity.Thepredictorfcanbedenedbasedonasetofparametersw,thatis, y(n)=f(x(n),w).(3)Writingthemulti-steppredictionproblemasasupervisedlearningproblem,theinput-outputpairsbecome(x(1),d),(x(2),d),,(x(m),d),andtheupdateruleateachtimestepcanbewrittenas, wn=(d)]TJ /F4 11.955 Tf 11.96 0 Td[(y(n))rwy(n),(3) 33

PAGE 34

where,isthelearningrate,andthegradientvectorrwy(n)containsthepartialderivativesofy(n)withrespecttow.Aswementionedabove,thedesiredvalueofthepredictiondonlybecomesavailableattimem+1,andthustheparametervectorwcanonlybeupdatedaftermtimesteps.Theupdateisgivenbythefollowingexpression w w+mXn=1wn.(3)Whenthepredictedoutputy(n)isalinearfunctionofx(n),wecanwritethepredictorasy(n)=w|x(n),forwhichrwy(n)=x(n),andtheupdaterulebecomes wn=(d)]TJ /F4 11.955 Tf 11.96 0 Td[(w|x(n))x(n).(3)ThekeyobservationtoextendthesupervisedlearningapproachtotheTDmethodisthatthedifferencebetweendesiredandpredictedoutputattimencanbewrittenas d)]TJ /F4 11.955 Tf 11.96 0 Td[(y(n)=mXk=n(y(k+1))]TJ /F4 11.955 Tf 11.95 0 Td[(y(k))(3)wherey(m+1),d.Usingthisexpansionintermsofthedifferencesbetweensequentialpredictions,wecanupdatethesystemateachtimestep.TheTDupdateruleisderivedasfollows: w w+mXn=1wn (3) =w+mXn=1(d)]TJ /F4 11.955 Tf 11.96 0 Td[(y(n))rwy(n) (3) =w+mXn=1mXk=n(y(k+1))]TJ /F4 11.955 Tf 11.96 0 Td[(y(k))rwy(n) (3) =w+mXk=1kXn=1(y(k+1))]TJ /F4 11.955 Tf 11.96 0 Td[(y(k))rwy(n) (3) =w+mXn=1(y(n+1))]TJ /F4 11.955 Tf 11.96 0 Td[(y(n))nXk=1rwy(k) (3) 34

PAGE 35

Inthiscase,allpredictionsareusedequally.Byusingexponentialweightingonrecency,wecanemphasizemorerecentpredictions,andthisyieldsthefollowingupdaterule,whichiscalledtheeligibilitytrace; wn=(y(n+1))]TJ /F4 11.955 Tf 11.96 0 Td[(y(n))nXk=1n)]TJ /F7 7.97 Tf 6.58 0 Td[(krwy(k).(3)TheeligibilitytraceisacommonmethodusedinRLtodealwithdelayedreward;itallowspropagatingtherewardsbackwardoverthecurrentstatewithoutrememberingthetrajectoryexplicitly.Expression( 3 )isknownastheTD()updaterule[ 50 ],andthedifferencebetweenpredictionsofsequentialinputsiscalledTDerror eTD(n)=y(n+1))]TJ /F4 11.955 Tf 11.95 0 Td[(y(n).(3)Notethatwhen=0,theupdaterulebecomesas w w+mXn=1(y(n+1))]TJ /F4 11.955 Tf 11.95 0 Td[(y(n))x(n),(3)andthisisthesameformasLMSexpectthattheerrortermissubstitutedbytheincrementaldifferenceintheoutputsfortheerrorterm.Insupervisedlearning,thepredictorcanonlybeupdatedoncetheerror(differencebetweenpredictedoutputanddesiredsignal)isavailable.Therefore,inthemulti-steppredictionproblem,thesystemcouldnotbeupdateduntiltheerrorwasavailableattherewardtime,whichbecomesavailableonlyinthefutureattimem+1.Incontrast,theTDalgorithmallowssystemupdatesdirectlyfromthesequenceofstates.Therefore,onlinelearningbecomespossiblewithouthavingthedesiredsignalavailableatalltimes.Thisallowsefcientlearninginmostrealworldpredictionproblems;TDlearninghaslowermemoryandcomputationaldemandsthansupervisedlearning,andempiricalresultsshowthatTD()canprovidemoreaccuratepredictions[ 50 ]. 35

PAGE 36

3.1.1TemporalDifference()inReinforcementLearningNow,letusseehowtheTD()algorithmisemployedinRL.WhenweconsiderthepredictionyasstatevaluefunctionVgivenaxedpolicy,TD()canapproximatethestatevaluefunction~Vusingaparametrizedfamilyoffunctionsoftheform ~V(x(n))=w>x(n)(3)withparametervectorw2Rd.Forconvenience,weuseVtodenoteVunlessweneedtoindicatedifferentpolicies.NotethattheobjectiveofTD()istominimizethemeansquareerror(MSE)criterion, minEh(V(x(n)))]TJ /F5 11.955 Tf 13.82 2.66 Td[(~V(x(n)))2i.(3)Basedon( 2 ),wecanobtainanapproximateformoftherecursioninvolvedintheBellmanequationasfollows: ~V(x(n))r(n+1)+~V(x(n+1)).(3)Thus,theTDerrorattimen( 3 )canbeassociatedwiththefollowingexpression eTD(n)=r(n+1)+V(x(n+1)))]TJ /F4 11.955 Tf 11.95 0 Td[(V(x(n)),(3)andtheerrorterm( 3 )combinedwith( 3 )givesusthefollowingpertime-stepupdaterule; wn=(r(n+1)+V(x(n+1)))]TJ /F4 11.955 Tf 11.95 0 Td[(V(x(n)))nXk=1n)]TJ /F7 7.97 Tf 6.59 0 Td[(krwV(k).(3)Algorithm 1 showspseudocodefortheimplementationoftheTD()algorithmforlinearvaluefunctionapproximation.Thealgorithmassumesthefollowinginformationtobegiven: axedpolicyinMDP adiscountfactor2[0,1] 36

PAGE 37

aparameter2[0,1] asequenceofstepsizes1,2,forincrementalcoefcientupdating Algorithm1pseudocodeofTD()algorithminreinforcementlearning Setw=0(oranarbitraryestimate) Setn=1 forn1do z(n)=x(n),wherex(n)2Xisastartstate while(x(n)6=terminalstate)do Simulateonestepprocessproducingrewardr(n+1)andnextstatex(n+1) wn=(r(n+1)+w|x(n+1))]TJ /F4 11.955 Tf 11.95 0 Td[(w|x(n))z(n) w w+nwn z(n+1)=z(n)+x(n+1) n=n+1 endwhile endfor Ateachstatetransition,thealgorithmcomputesonestepTDerrorr(n+1)+w|x(n+1))]TJ /F4 11.955 Tf 12.73 0 Td[(w|x(n),anddependingontheeligibilitiesz(n)=Pnk=n0n)]TJ /F7 7.97 Tf 6.59 0 Td[(kx(k),aportionofTDerrorispropagatedbacktoupdatethesystem. 3.1.2ConvergenceofTemporalDifference()Wewillseethatinthecaseof=0and=1,theTDsolutionsconvergeasymptoticallytotheidealsolutionundergivenconditionsinanabsorbingMarkovprocesses.Forallothercases6=1,thesolutionalsoconverges,buttheanswerisdifferentingeneralfromtheonegivenbytheleastmeansquaresalgorithm.RememberthattheconventionalTDalgorithmassumesthatthefunctionclassislinearlyparametrizedsatisfyingy=w>x.ThisassumptionwillbealsoconsideredintheconvergenceproofforTDwithany01. =1caseTheTD(1)procedureisequivalentto( 3 ),andthisgivesthesameper-sequenceweightchangesasthesupervisedlearningmethodsince( 3 )isderivedbydirectlyreplacingtheerrorterminsupervisedlearningusing( 3 )(Theorem 3.1 ). 37

PAGE 38

Theorem3.1. Onmultisteppredictionproblems,thelinearTD(1)procedureproducesthesamepre-sequenceweightchangesastheWidrow-Hoffrule[ 50 ]. =0caseTheconvergenceresultforlinearTD(0)presentedin[ 50 ]isprovedundertheassumptionthatthedynamicsystemprovidingthestatescorrespondtoanabsorbingMarkovprocess.InanabsorbingMarkovprocess,thereisasetofterminalstatesT,asetofnon-terminalstatesN,andtransitionprobabilitiespijwherei2N,j2N[T.Thetransitionprobabilitiesaresetsuchthataterminalstatewillbevisitedinanitenumberofstatetransitions.Here,weassumethataninitialstateisselectedwithprobabilityiamongnon-terminalstates.Giventheinitialstatex(i),anabsorbingMarkovprocessgeneratesastatesequencex(i),x(i+1),,x(j)wherex(j)2T.Attheterminalstatex(j),theoutputdisselectedfromanarbitraryprobabilitydistributionwithexpectedvaluedj.ThisabsorbingMarkovprocessconvergestothedesiredbehaviorasymptoticallybasedonexperience.Here,thedesiredbehavioristomapeachnon-terminalstatex(i)totheexpectedoutcomedgiventhesequencestartingfromi;thus,theidealpredictionsy(i)=f(x(i),w)shouldbeequaltoE[djx(i)],8i2N.Foracompletedsequence,wehavethefollowingrelation E[djx(i)]=Xj2Tpijdj+Xj2NpijXk2Tpjkdk+Xj2NpijXk2NpjkXl2Tpkldl+ (3) ="1Xk=0Qkh#i (3) =(I)]TJ /F4 11.955 Tf 11.95 0 Td[(Q))]TJ /F6 7.97 Tf 6.58 0 Td[(1hi, (3) where[Q]ij=pijfori,j2N,and[h]i=Pj2Tpijdjfori2N.ThefollowingtheoremshowsthatTD(0)convergestotheidealpredictionsfortheappropriatestepsizewhenthestatesfx(i)ji2Ngarelinearlyindependent. 38

PAGE 39

Theorem3.2. ForanyabsorbingMarkovchain,foranydistributionofstartingprobabilityi,foranyoutcomedistributionswithniteexpectedvaluesdj,andforanylinearlyindependentsetofobservationvectorsfx(i)ji2Ng,thereexistsan>0suchthat,forallpositivenx(i)]=E[djx(i)]=[(I)]TJ /F4 11.955 Tf 12.14 0 Td[(Q))]TJ /F6 7.97 Tf 6.58 0 Td[(1h]i,8i2N[ 50 ]. 0<<1caseTheworkin[ 10 ]extendedtheconvergenceofTDtogeneral.TheTDupdaterule( 3 )canbeexpressedas wsn+1=wsn+XD[QsX>wsn)]TJ /F4 11.955 Tf 11.96 0 Td[(X>wsn+(Qs)]TJ /F6 7.97 Tf 6.59 0 Td[(1+Qs)]TJ /F6 7.97 Tf 6.59 0 Td[(2++I)h],(3)wherewaretheexpectedweights,Xisastatematrixdenedas[X]ab=[xa]b,wherearunsoverthestatesandbonthedimensions.Disadiagonalmatrixsatisfying[D]ab=abda,whereabistheKroneckerdeltaanddaisexpectednumberoftimestheMarkovchainisinstatexainonesequence,andsshowsthenumberofstatetransitionsbeingtraced.AftermultiplyingX>onbothsidesof( 3 ),whenwereorganizetheequationusing(I+Q+Q2++Qs)]TJ /F6 7.97 Tf 6.59 0 Td[(1)h=(I)]TJ /F4 11.955 Tf 11.87 0 Td[(Qs)E[djx(i)]andwn=(1)]TJ /F3 11.955 Tf 11.87 0 Td[()P1s=1s)]TJ /F6 7.97 Tf 6.59 0 Td[(1wsn,wecanobtainthefollowingequation X>wn+1=X>wn)]TJ /F3 11.955 Tf 11.95 0 Td[(X>XD[I)]TJ /F5 11.955 Tf 11.96 0 Td[((1)]TJ /F3 11.955 Tf 11.96 0 Td[()Q(I)]TJ /F3 11.955 Tf 11.96 0 Td[(Q))]TJ /F6 7.97 Tf 6.59 0 Td[(1](X>wn)]TJ /F4 11.955 Tf 11.96 0 Td[(E[djx(i)]).(3)Whenthestaterepresentationsarelinearlyindependent,Xhasafullrank,andtherighttermof( 3 ) )]TJ /F4 11.955 Tf 11.96 0 Td[(X>XD[I)]TJ /F5 11.955 Tf 11.95 0 Td[((1)]TJ /F3 11.955 Tf 11.95 0 Td[()Q(I)]TJ /F3 11.955 Tf 11.96 0 Td[(Q))]TJ /F6 7.97 Tf 6.59 0 Td[(1](X>wn)]TJ /F4 11.955 Tf 11.96 0 Td[(E[djx(i)])(3) 39

PAGE 40

hasafullsetofnonzeroeigenvalueswhoserealpartsarenegative.Therefore,iftheaboveconditionshold,itcanbeshownthatTD()convergeswithprobability1usingTheorem 3.3 [ 24 ]. Theorem3.3. Letfy(n)gbegivenby y(n+1)=y(n)+n(g(y(n))+n+n)(3)satisfyingthefollowingassumptions 1. .gisacontinuousRdvaluedfunctiononRd. 2. .fngisaboundedwithprobability1sequenceofRdvaluedrandomvariablessuchthatn!0withprobability1. 3. .fngisasequenceofpositiverealnumberssuchthatn!0,Pnn=1. 4. .fngisasequenceofRdvaluedrandomvariablesandsuchthatforsomeT>0andeach>0limn!1P8<:supjnmaxtTm(jT+t))]TJ /F6 7.97 Tf 6.59 0 Td[(1Xi=m(jT)ii"9=;=0wherem(t)isdenedbymaxfn:tntgfort0andtn=Pn)]TJ /F6 7.97 Tf 6.58 0 Td[(1i=0i.Also,letfy(n)gbeboundedwithprobability1.Then,thereisanullset0suchthat!=20impliesthatfyn()gisequicontinuous,andalsothatthelimity()ofanyconvergentsubsequenceoffyn()gisboundedandsatisestheordinarydifferentialequation(ODE) y0=g(y)(3)onthetimeinterval(,1).Lety0bealocallyasymptoticallystable(inthesenseofLiapunov)solutionto( 3 )withdomainofattractionDA(y0).Then,if!=20andthereisacompactsetADA(y0)suchthaty(n)2Ainnitelyoften,wehavey(n)!y0asn!1[ 24 ].Whenwesety(n)in( 3 )asX>wn,( 3 )satisestheordinarydifferentialequation( 3 ).Therefore,undertheassumptionsinTheorem 3.3 ,thedifferential 40

PAGE 41

equationisasymptoticallystabletoE[djx(i)],thatis,wn!wasn!1withprobability1whereX>w=E[djx(i)]andXisfullrank. 3.2KernelTemporalDifference()Intheprevioussection,weintroducedTD(),andweobservedhowthevaluefunctioncanbeestimatedadaptively.NotethattheTD()approximatesthevaluefunctionusingalinearfunction,whichmaybelimitedinpractice.Asanalternative,algorithmswithnonlinearapproximationcapabilitieshavebecomeatopicofgrowinginterest.NonlinearvariantsofTDalgorithmshavealsobeenproposed,andtheyaremostlybasedontimedelayneuralnetworks,sigmoidalmultilayerperceptrons,orradialbasisfunctionnetworks.Despitetheirgoodapproximationcapabilities,thesealgorithmsareusuallypronetofallintolocalminima[ 3 7 20 54 ],whichdoesnotguaranteetheoptimalityofTD().Kernelmethodshavebecomeanappealingchoiceduetotheirelegantwayofdealingwithnonlinearfunctionapproximationproblems;thekernelbasedalgorithmshavenonlinearapproximationcapabilities,yetthecostfunctioncanbeconvex[ 44 ].Inthefollowing,wewillshowhowtheconventionalTD()algorithmcanbeextendedusingkernelfunctionstoobtainnonlinearvariantsofthealgorithm;weintroduceakerneladaptivelterimplementedwithstochasticgradientontemporaldifferencescalledkerneltemporaldifference(KTD)(). 3.2.1KernelMethodsThebasicideaofkernelmethodsistononlinearlymaptheinputdatatoahighdimensionalfeaturespaceofvectors.LetXbeanonemptyset.Forapositivedenitefunction:XX!R[ 28 44 ],thereexistsaHilbertspaceHandamapping:X!Hsuchthat (x,y)=h(x),(y)i.(3)Theinnerproductinthehighdimensionalfeaturespacecanbecalculatedbyevaluatingthekernelfunctionintheinputspace.Here,HiscalledareproducingkernelHilbert 41

PAGE 42

space(RKHS)becauseitsatisesthefollowingproperty f(x)=hf,(x)i=hf,(x,)i,8f2H.(3)ThemappingimpliedbytheuseofthekernelfunctioncanalsobeunderstoodthroughMercer'sTheorem(Appendix A )[ 33 ].Thesepropertiesallowustotransformconventionallinearalgorithmsinthefeaturespacetonon-linearsystemswithoutexplicitlycomputingtheinnerproductinthehighdimensionalspace. 3.2.2KernelTemporalDifference()Insupervisedlearning,astochasticgradientsolutiontoleastsquaresfunctionapproximationusingakernelmethodcalledkernelleastmeansquare(KLMS)isintroducedin[ 27 ].TheKLMSalgorithmattemptstominimizetheriskfunctionalE(d)]TJ /F4 11.955 Tf 11.95 0 Td[(f(x))2byminimizingtheempiricalriskJ(f)=PNn=1(d(n))]TJ /F4 11.955 Tf 12.88 0 Td[(f(x(n)))2onthespaceHinducedbythekernel.Using( 3 ),wecanrewrite J(f)=NXn=1[d(n))-222(hf,(x(n))i]2(3)BydifferentiatingtheempiricalriskJ(f)withrespecttofandapproximatingthesumbythecurrentdifference(stochasticgradient),wecanderivethefollowingupdaterule f0=0fn=fn)]TJ /F6 7.97 Tf 6.59 0 Td[(1+e(n)(x(n))(3)wheree(n)=d(n))]TJ /F4 11.955 Tf 10.99 0 Td[(fn)]TJ /F6 7.97 Tf 6.58 0 Td[(1(x(n)),whichcorrespondstoKLMSalgorithm[ 27 ].Givenanewstatex(n),theoutputcanbecalculatedusingthekernelexpansion, fn)]TJ /F6 7.97 Tf 6.58 0 Td[(1(x(n))=fn)]TJ /F6 7.97 Tf 6.59 0 Td[(2(x(n))+e(n)]TJ /F5 11.955 Tf 11.96 0 Td[(1)hx(n)]TJ /F5 11.955 Tf 11.95 0 Td[(1),x(n)i (3) =n)]TJ /F6 7.97 Tf 6.59 0 Td[(1Xk=1e(k)hx(k),x(n)i. (3) Aswementionedabove,foramulti-steppredictionproblem,wecansimplysayy(n)=f(x(n)).LetthefunctionfbelongtoanRKHSHasinKLMS.Bytreatingthe 42

PAGE 43

observedinputsequenceandthedesiredpredictionasasequenceofpairs(x(1),d),(x(2),d),,(x(m),d)andmakingd,y(m+1),wecanobtaintheupdatesoffunctionfafterthewholesequenceofminputshasbeenobservedas f f+mXn=1fn (3) =f+mXn=1e(n)(x(n)) (3) =f+mXn=1[d)]TJ /F4 11.955 Tf 11.96 0 Td[(f(x(n))](x(n)). (3) Here,fn=[d)-270(hf,(x(n))i](x(n))aretheinstantaneousupdatesofthefunctionffrominputdatabasedonthekernelexpansions( 3 ).Byreplacingtheerrord)]TJ /F4 11.955 Tf -434.92 -23.91 Td[(f(x(n))usingtherelationwithtemporaldifferences( 3 )andreorganizingtheequation( 3 )asintheTD()derivationfrom[ 50 ],wecanobtainthefollowingupdate f f+mXn=1[f(x(n+1)))]TJ /F4 11.955 Tf 11.96 0 Td[(f(x(n))]nXk=1(x(k)),(3)andgeneralizingforyields f f+mXn=1[f(x(n+1)))]TJ /F4 11.955 Tf 11.95 0 Td[(f(x(n))]nXk=1n)]TJ /F7 7.97 Tf 6.59 0 Td[(k(x(k)).(3)Thetemporaldifferencesf(x(n+1)))]TJ /F4 11.955 Tf 13.05 0 Td[(f(x(n))canberewrittenusingthekernelexpansionsashf,(x(n+1))i)-222(hf,(x(n))i.Thisyields f f+mXn=1hf,(x(n+1)))]TJ /F3 11.955 Tf 11.96 0 Td[((x(n))inXk=1n)]TJ /F7 7.97 Tf 6.59 0 Td[(k(x(k)),(3)wherefn=hf,(x(n+1)))]TJ /F3 11.955 Tf 12.62 0 Td[((x(n))iPnk=1n)]TJ /F7 7.97 Tf 6.59 0 Td[(k(x(k)).Thisupdaterule( 3 )iscalledkerneltemporaldifference(KTD)()[ 1 2 ].UsingtheRKHSproperties,theevaluationofthefunctionfatacertainxcanbecalculatedasakernelexpansion.When=0,theupdaterulebecomes f f+mXn=1hf,(x(n+1)))]TJ /F3 11.955 Tf 11.95 0 Td[((x(n))i(x(n)),(3) 43

PAGE 44

anditisnoticeablethattheupdateruleisexactlyofthesameformasKLMS( 3 )exceptfortheerrorterms;insupervisedlearning,theerrorisdenedasthedifferencebetweendesiredsignalandpredictionsattimen,whereasinTDlearning,theerroristhedifferencebetweensequentialpredictions.Inaddition,equation( 3 )canbemodiedforstatevaluefunctionapproximationbyreplacingtheerrortermusing( 3 ); f f+mXn=1[r(n+1)+V(x(n+1)))]TJ /F4 11.955 Tf 11.96 0 Td[(V(x(n))]nXk=1n)]TJ /F7 7.97 Tf 6.59 0 Td[(k(x(k)) (3) =f+mXn=1[r(n+1)+hf,(x(n+1)))]TJ /F3 11.955 Tf 11.96 0 Td[((x(n))i]nXk=1n)]TJ /F7 7.97 Tf 6.59 0 Td[(k(x(k)). (3) 3.2.3ConvergenceofKernelTemporalDifference()BasedontheconvergenceguaranteesforTD(),weareabletoextendtheresulttotheconvergenceofKTD(). =1caseTheorem 3.1 showsthatinthecaseofTDwith=1,itssolutionconvergestothesamesolutionasthesupervisedlearning(leastsquare)duetothederivationofTDupdaterulebasedon( 3 ).WecanalsousethisrelationtoshowtheconvergenceofKTD(1).[ 27 ]provedthefollowingproposition; Proposition3.1. TheKLMSalgorithmconvergesasymptoticallyinthemeansensetotheoptimalsolutionunderthesmall-step-sizecondition[ 27 ].Inamulti-steppredictionproblem,KTD(1)isderivedbyreplacingtheerrorinsupervisedlearningwiththeTDerrortermusing( 3 ).Thus,weobtainthefollowingtheorem; Theorem3.4. Onmulti-steppredictionproblems,theKTD(1)procedureproducesthesamepre-sequenceweightchangesastheleastsquaresolution. 44

PAGE 45

Proof. Sinceby( 3 )thesequenceofTDerrorscanbereplacedbyamultisteppredictionwitherrore(n)=d)]TJ /F4 11.955 Tf 12.5 0 Td[(y(n),theresultofProposition 3.1 alsoappliesinthiscase. ThismeansthatKTD(1)alsoasymptoticallyconvergestotheoptimalsolutionwhenthestepsizesatisesPnn=1andPn2n<1forn0. <1caseForgeneralcases(<1),wesawthattheconvergenceofTDheavilyreliesonthestaterepresentationx(n);theconvergenceisprovedgiventhatthestatefeaturevectorsarelinearlyindependent(Theorem 3.4 and 3.3 ).Manymodelscanbereformulatedusingadualrepresentation,andthisideanaturallyariseswhenusingkernelfunctions.WederivedKTD()usingthedualrepresentationtoexpressthesolutionofTD()intermsofthekernelfunction.NotethattheweightvectorintheRKHScanbeexpressedasthelinearcombinationofthefeaturevectors(x)(Proposition 3.2 ).Therefore,wecanextendTheorem 3.2 andtheconvergenceprooftoTD(0<<1)toKTD(<1)byshowingthatthefeaturemapcreatesarepresentationofstatesintheRKHSsatisfyingthelinearindependenceassumptionwhenthekernelisstrictlypositivedenite.ThisimpliesthattheconvergenceguaranteeofTD(<1)canbeextendedtoKTD(<1)whenitisviewedasalinearfunctionapproximatorintheRKHS. Proposition3.2. If:XX!Risastrictlypositivedenitekernel,foranynitesetfxigNi=1Xofdistinctelements,thesetf(xi)gislinearlyindependent. Proof. Ifisstrictlypositivedenite,thenPij(xi,xj)>0foranysetxiwherexi6=xj,8i6=j,andanyi2Rsuchthatnotalli=0.Supposethereexistsasetfxigforwhichf(xi)garenotlinearlyindependent.Then,theremustbeasetofcoefcientsi2Rnot 45

PAGE 46

allequaltozerosuchthatPi(xi)=0,whichimpliesthatkPi(xi)k2=0 0=Xijh(xi),(xj)i (3) =Xij(xi,xj), (3) whichcontradictstheassumption. Thisshowsthatifastrictlypositivedenitekernelisused,theconditionoflinearlyindependentstaterepresentationsissatisedinKTD().ThisisanecessaryconditionforconvergenceofTD(0)inTheorem 3.2 andTD(0<<1)basedontheODErepresentationfromTheorem 3.3 3.3CorrentropyTemporalDifferencesInthepreviouschapter,wefocusedourattentiononthefunctionalmapperoftheadaptivesystem.Inthepresentchapter,wewillturnourattentiontowardsthecostfunction.Acommonissueinpracticalscenariosisthattheassumptionsaboutnoiseorthemodelmaynotholdoraresubjecttoperturbations.MoststudiesonTDalgorithmsshowingtheperformanceonsyntheticexperimentssuchasthenoiselessMarkovchainorrandomwalkproblemsdonotusuallyaddresstheissueofhownoiseorsmallperturbationstothemodelaffectperformance.Inpractice,noisystatetransitionsorrewardsmaybeobserved,andnoisemayevenbepresentintheinputstaterepresentations.Highlynoise-corruptedenvironmentsleadtodifcultiesinlearning,andthismayresultinfailuretoobtainthedesiredbehaviorofthecontroller.Oneofthemostpopularlyutilizedguresofmeritisthemeansquareerror(MSE),whichisasecondorderstatistic,andmethodssuchasTD()andKTD()usethiscriterion.ItiswellknownthattheMSEcriterionismostusefulonlyunderGaussianityassumptions[ 20 ];nevertheless,anydeparturefromthisbehaviorcanaffectperformancesignicantly.Correntropy[ 42 ]isanalternativetoMSEthathasbeenshowntobeabletodealwithsituationswheretheGaussianitydoesnothold.Oneofthemainfeaturesofcorrentropyasacostfunctionisitsrobustnesstolargeperturbationsinthe 46

PAGE 47

learningprocess;performanceimprovementsoverMSEinmanyrealisticscenariosincludingfat-taildistributionsandsevereoutliernoisehavebeendemonstratedin[ 26 59 ]. 3.3.1CorrentropyThegeneralizedcorrelationfunctioncalledcorrentropywasrstintroducedin[ 42 ].Correntropyisdenedintermsofinnerproductsofvectorsinakernelfeaturespace B(X,Y)=E[(X)]TJ /F4 11.955 Tf 11.96 0 Td[(Y)](3)whereXandYaretworandomvariables,andisatranslationinvariantkernel.WhenistheGaussiankernel,theTaylorseriesexpansionofcorrentropyisgivenby B(X,Y)=1 p 2hc1Xn=0()]TJ /F5 11.955 Tf 9.3 0 Td[(1)n 2nh2ncn!E[kX)]TJ /F4 11.955 Tf 11.96 0 Td[(Yk2n].(3)Thisexpansionshowsthatcorrentropyincludesalltheeven-ordermomentsoftherandomvariableskX)]TJ /F4 11.955 Tf 12.79 0 Td[(Yk.Adifferentkernelcanleadtoadifferentexpansion,butwhatitisnoticeableisthatbyusinganonlinearkernel,correntropycontainsinformationbeyondsecondorderstatisticsofthestatisticaldistribution,andthusitisbettersuitedfornon-linearandnon-Gaussiansignalprocessing.Ithasalsobeenobservedthatinanimpulsivenoiseenvironment,correntropycanobtainperformanceimprovementsovertheconventionalMSEcriterion[ 26 ] MSE(X,Y)=E[(X)]TJ /F4 11.955 Tf 11.95 0 Td[(Y)2].(3)Thegeometricmeaningofcorrentropyinthesamplespacecanbeexplainedthroughthecorrentropyinducedmetric(CIM).Thecorrentropyinducedmetric(CIM)isdenedasfollows CIM(X,Y)=((0))]TJ /F4 11.955 Tf 11.96 0 Td[(B(X,Y))1=2,(3)whereGaussiankernel(x,y)=expkx)]TJ /F7 7.97 Tf 6.59 0 Td[(yk2 2h2cisused,andinputspacevectorsareX=(x1,x2,,xN)>andY=(y1,y2,,yN)>. 47

PAGE 48

Figure3-2. ContoursofCIM(X,0)in2dimensionalsamplespace. Figure 3-2 showsthebehavioroftheCIMbasedonasamplef(x1,y1),(x2,y2)gofsizeN=2.Here,kernelsizehc=0.2isapplied.WhereasMSEmeasurestheL2-normdistancebetweenrandomvariableswithnitevariance,CIMbasedontheGaussiankernelapproximatestheL2-normdistanceonlyforpointsthatareclose,andaspointsgetfurtherapartthemetricgoestoatransitionphasewhereitresemblesL1-normdistanceandnallyapproachestheL0-normforpointsthatarefaraway.Noticethatifonlyoneoftheerrorsislarge,CIMdoesnotchangeaslongastheothererrorissmall.ThisbehaviorshowshowCIMcaneffectivelydealwithoutliers.Furthermore,thekernelbandwidthhccontrolsthescaleofCIMnorm;asmallerkernelsizeenlargestheregionfortheL0-norm,andalargerkernelsizeextendstheL2-normarea.Thus,selectingaproperkernelsizeisnecessary. 3.3.2MaximumCorrentropyCriterionCorrentropycanbeusedasacostfunction,andithasbeenappliedtoadaptivesystems[ 45 46 59 ].Lettheshiftinvariantkernelemployedincorrentropy.Thecost 48

PAGE 49

functioncanbewrittenas J=E[(e)] (3) 1 NNXn=1(e(n)). (3) Forasystemdescribedbyparametricmappingy=f(xj),theparametersetcanbeadaptedsuchthatthecorrentropyoferrorsignald)]TJ /F4 11.955 Tf 12.29 0 Td[(yismaximized.Thisiscalledthemaximumcorrentropycriterion(MCC) MCC=maxNXn=1(e(n)).(3)TheMCCcanbeunderstoodinthecontextofM-estimation,whichisageneralizedmaximumlikelihoodmethodtoestimateparametersunderthecostfunction minNXn=1(e(n)j),(3)whereisadifferentiablefunctionsatisfying(e)0,(0)=0,(e)=()]TJ /F4 11.955 Tf 9.3 0 Td[(e),and(e(i))(e(j))forje(i)j>je(j)j.Thisgeneralestimationisequivalenttoaweightedleastsquareproblem minNXn=1w(e(n))e(n)2,(3)wherew(e)=0(e)=eand0isthederivativeof.When(e)=(1)]TJ /F4 11.955 Tf 9.3 0 Td[(exp()]TJ /F4 11.955 Tf 9.29 0 Td[(e2=2h2c))=p 2hc,thegeneralizedlikelihoodproblembecomesMCC.TherelationtotheweightedleastsquaresproblembecomesobviousbylookingatthegradientofJ,forwhichaGaussianweightingtermplacesmoreemphasisonsmallerrors,anddiminishestheeffectoflargeerrors.Thispropertyiskeytotherobustnesstooutliersorsuddenperturbationsintheerror.Noticethatthekernelsizestillcontrolstheweights. 3.3.3CorrentropyTemporalDifferenceAvariantoftheleastmeansquare(LMS)algorithm[ 45 ]usingMCChasbeenformulatedinsupervisedlearning.SimilartotheMSEcriterion,astochasticgradient 49

PAGE 50

ascentapproachcanbeusedtomaximizecorrentropybetweendesiredsignald(n)andthesystemoutputy(n).LetGdenotetheGaussiankernelemployedbycorrentropy.Thegradientofthecostfunctionisexpressedasfollows rJn=@B(d(n),y(n)) @w=@G(e(n)) @e(n)@e(n) @w (3) =1 h2ce(n)G(e(n))rwy(n). (3) Inaddition,inthepreviouslydescribedmulti-steppredictionproblem,temporaldifference(TD)errorcanbelinkedtotheLMSalgorithmbyusingtherecursion( 3 ).Therefore,wecanalsoapplyTDwithMCCasfollows; w w+mXn=1wn (3) =w+mXn=1[e(n)G(e(n))rwy(n)] (3) =w+mXn=1"mXk=ne(k)exp)]TJ /F5 11.955 Tf 9.29 0 Td[((Pmk=ne(k))2 2h2cx(n)#, (3) wheree(n)=y(n+1))]TJ /F4 11.955 Tf 11.95 0 Td[(y(n)wheny(n)isalinearfunctionofx(n).Inthecaseof=0,wesawthatsupervisedlearningalgorithmsandtheirextendedTDalgorithmshaveexactlythesameformofupdateruleexceptfortheerrorterms.Thus,wecanobtainadirectextensionforcorrentropytemporaldifference(CTD)asfollows w w+mXn=1(y(n+1))]TJ /F4 11.955 Tf 11.95 0 Td[(y(n))exp)]TJ /F5 11.955 Tf 9.3 0 Td[(((y(n+1))]TJ /F4 11.955 Tf 11.95 0 Td[(y(n))2 2h2cx(n).(3)Equation( 3 )alsosatisestheweightupdatesinthecaseofsinglesteppredictionproblems(whenm=1). 3.3.4CorrentropyKernelTemporalDifferenceUsingtheideasofbothkernelleastmeansquare(KLMS)andmaximumcorrentropycriterion,kernelmaximumcorrentropy(KMC)isintroducedin[ 59 ].Again,tomaximize 50

PAGE 51

theerrorsignalcorrentropy,wecanusestochasticgradientascent,andtheupdatestothesystemarebasedonthepositivegradientofthenewcostfunctioninthefeaturespace.Thus,inKMC,thegradientcanbeexpressedasfollows; rJn=@B(d(n),y(n)) @f=@G(e(n)) @e(n)@e(n) @f (3) =1 hc2e(n)G(e(n))(x(n)). (3) Also,theestimatedfunctionattimen+1canbeobtainedas f0=0 (3) fn+1=fn+rJn (3) =fn+exp)]TJ /F4 11.955 Tf 9.3 0 Td[(e(n)2 2h2ce(n)(x(n)) (3) =fn)]TJ /F6 7.97 Tf 6.59 0 Td[(1+nXi=n)]TJ /F6 7.97 Tf 6.59 0 Td[(1exp)]TJ /F4 11.955 Tf 9.3 0 Td[(e(i)2 2h2ce(i)(x(i)) (3) =nXi=1exp)]TJ /F4 11.955 Tf 9.3 0 Td[(e(i)2 2h2ce(i)(x(i)) (3) Again,byusingtheerrorrelationinsupervisedandTDlearningin( 3 ),inthemultisteppredictionproblem,thetemporaldifference(TD)errorcanbeintegratedinKMCasfollows f f+mXn=1"exp)]TJ /F5 11.955 Tf 9.3 0 Td[((Pmk=n(y(k+1))]TJ /F4 11.955 Tf 11.96 0 Td[(y(k)))2 2h2cmXk=n(y(k+1))]TJ /F4 11.955 Tf 11.95 0 Td[(y(k))(x(n))#.(3)Inthecaseof=0,wesawthattheKLMS( 3 )andKTD(0)( 3 )updateruleshaveexactlythesameformexceptfortheerrorterms.Thus,wecanderivecorrentropykerneltemporaldifference(CKTD)asfollows f f+mXn=1exp)]TJ /F5 11.955 Tf 9.3 0 Td[((y(n+1))]TJ /F4 11.955 Tf 11.95 0 Td[(y(n))2 2h2c(y(n+1))]TJ /F4 11.955 Tf 11.96 0 Td[(y(n))(x(n)).(3)Thisequationalsosatises( 3 )inthecaseofsinglesteppredictions(m=1). 51

PAGE 52

NotethatcomparedtoTD(0)( 3 )andKTD(0)( 3 ),theonlydifferencebetweentheCTD( 3 )andCKTD( 3 )updaterulesistheextraweightingtermwhichistheexponentialoftheerror.Therefore,thestabilityresultfromTheorem 3.3 shouldalsoapplyinthecaseofcorrentropysincetheextraweightingtermcanbefactortogetherwiththestepsize.Thisshouldnotchangetheconditionsonthestepsizesequence,n!0,Pnn=1,sinceweemploytheGaussiankernelforcorrentropysatisfying0G(e)1.Nevertheless,theconvergencepointsforcorrentropyTDandTDwillbedifferentingeneral.Inthischapter,threenewtemporaldifferencealgorithmswereintroducedforstatevaluefunctionestimation.First,analgorithmthatcombineskernelbasedrepresentationswithconventionalTDlearning,kerneltemporaldifference(KTD)(),wasintroduced.OneofthekeyadvantagesofKTD()isitsnonlinearfunctionapproximationcapabilityintheinputspacewithconvergenceguarantees.Becauseofthelinearstructureofthecomputationsthatareimplicitlycarriedoutinthefeaturespacethroughthekernelexpansion,existingresultsonlinearfunctionapproximationcanbeextendedtothekernelsetting.Followingthis,themaximumcorrentropycriterion(MCC)asarobustalternativetoMSEwasappliedtoTD()andKTD()algorithms.Weintroducedthecorrentropytemporaldifference(CTD)andcorrentropykerneltemporaldifference(CTD)algorithms.Thesealgorithmsareshowntobestableandrobustundernoisyconditions.Notethatnonlinearfunctionapproximationcapabilitiesandrobustnessareappealingpropertiesforpracticalimplementations.Learningmethodswithnonlinearfunctionapproximationcapabilitieshasbeenthesubjectofactiveresearch.However,thelackofconvergenceguaranteeshasbeenanissuethatmakesthisavenuelessattractiveforrealapplications.ApowerfulaspectofKTD()isduetoitsapproximationmechanismwhichovercomestheconvergenceissue. 52

PAGE 53

CHAPTER4SIMULATIONS-POLICYEVALUATIONInthischapter,weexaminetheempiricalperformanceofthetemporaldifferencealgorithmsintroducedintheprevioussectionstotheproblemofstatevaluefunctionestimation~Vgivenaxedpolicy.Firstofall,wecarryoutexperimentsonasimpleillustrativeMarkovchaindescribedin[ 6 ];werefertothisproblemastheBoyanchainproblem.ThisisapopularexperimentinvolvinganepisodictasktotestTDlearningalgorithms.Theexperimentisusefulinillustratinglinearaswellasnonlinearfunctionsofthestaterepresentations,andshowshowthestatevaluefunctionisestimatedusingadaptivesystems.TD()andKTD()arecomparedinthelinearandnonlinearfunctionapproximationproblem.Furthermore,TD(),KTD(),CTD,andCKTDareappliedtoanoisyenvironmentwherethepolicydoesnotremainxedbutisrandomlyperturbed. 4.1LinearCaseTotesttheefcacyoftheproposedmethod,werstobservetheperformanceonasimpleMarkovchain(Figure 4-1 ).Thereare13statesnumberedfrom12to0.Eachtrialstartsatstate12andterminatesatstate0.Eachstateisrepresentedbya4-dimensionalvector,andtherewardsareassignedinsuchawaythatthevaluefunctionVisalinearfunctiononthestates;namely,Vtakesthevalues[0,)]TJ /F5 11.955 Tf 9.29 0 Td[(2,)]TJ /F5 11.955 Tf 9.3 0 Td[(4,,)]TJ /F5 11.955 Tf 9.3 0 Td[(22,)]TJ /F5 11.955 Tf 9.3 0 Td[(24]atstates[0,1,2,,11,12].InthecaseofV=w>x,theoptimalweightsarew=[)]TJ /F5 11.955 Tf 9.3 -.01 Td[(24,)]TJ /F5 11.955 Tf 9.3 -.01 Td[(16,)]TJ /F5 11.955 Tf 9.3 -.01 Td[(8,0].Toassesstheperformanceofthealgorithms,theupdatedestimateofthestatevaluefunction~V(x)iscomparedtotheoptimalvaluefunctionVattheendofeachtrial.ThisisdonebycomputingtheRMSerrorofthevaluefunctionoverallstates RMS=s 1 nXx2XV(x))]TJ /F5 11.955 Tf 13.82 2.66 Td[(~V(x)2,(4)wherenisthenumberofstates,n=13. 53

PAGE 54

Figure4-1. A13stateMarkovchain[ 6 ].Forstatesfrom2to12,thestatetransitionprobabilityis0.5andthecorrespondingreward)]TJ /F5 11.955 Tf 9.3 0 Td[(3.State1hasstatetransitionprobabilityof1totheterminalstate0andreward)]TJ /F5 11.955 Tf 9.3 0 Td[(2.State12,8,4,and0have4-dimensionalstatespacerepresentations[1,0,0,0],[0,1,0,0],[0,0,1,0],and[0,0,0,1]respectively,andtherepresentationsoftheotherstatesarelinearinterpolationsbetweentheabovevectors. Wesawinthepreviouschapterthatthestepsizeisrequiredtosatisfy(n)0,P1n=1(n)=1,andP1n=1(n)2<1toguaranteeconvergence.Consequently,thefollowingstepsizeschedulingisapplied; (n)=0a0+1 a0+n,wheren=1,2,.(4)where0istheinitialstepsize,anda0istheannealingfactorwhichcontrolshowfastthestepsizedecreases.Inthisexperiment,a0=100isapplied.Furthermore,weassumethatthepolicyisguaranteedtoterminate,whichmeansthatthevaluefunctionViswell-behavedwithoutusingadiscountfactorin( 2 );thatis,=1.Usingtheabovesetup,werstapplyTD()toestimatethevaluefunctioncorrespondingtotheBoyanchain(Figure 4-1 ).Toobtaintheoptimalparameters,variouscombinationsofeligibilitytraceratesandinitialstepsize0valuesareevaluated.Eligibilitytraceratesfrom0to1with0.2jumpsandinitialstepsizes0between0.1and0.9with0.1intervalsareobservedfor1000trials(Figure 4-2 ).TheRMSerrorofthevaluefunctionareaveragesover10MonteCarloruns,andtheinitialweightvectorissetasw=0ateachrun.Acrossallvaluesofwithoptimalstepsize,TD()providesgoodapproximationtoVafter1000trials.Weobservethatsmallstepsizesgenerallygivebetterperformance. 54

PAGE 55

Figure4-2. Performancecomparisonoverdifferentcombinationsofeligibilitytraceratesandinitialstepsizes0inTD().TheplottedverticallinesegmentsarethemeanRMSvaluesafter100trials(topmarkers),500trials(middlemarkers),and1000trials(bottommarkers). However,ifthestepsizeisverysmall,thesystemfailstoreachagoodperformancelevel,especiallywithsmallvalues(=0,0.2,and0.4).Weknowthatstepsizemainlycontrolsthetradeoffbetweenperformanceaccuracyandspeedoflearning,sosmallstepsizelearningmaybetooslowtoconvergewithin1000trials.Also,largestepsizesresultinlargererrorduetomis-adjustment.BasedonFigure 4-2 ,parametervaluesof=1and0=0.1areselectedforfurtherobservation.Beforeweextendtheexperiment,wewanttoobservethebehaviorofKTD()inalinearfunctionapproximationproblem.WepreviouslyemphasizedthecapabilityofKTD()asanonlinearfunctionapproximator,however,undertheappropriatekernelsize,KTD()shouldapproximatelinearfunctionswellonaregionofinterest.InKTD(),weemploytheGaussiankernel, (x(i),x(j))=expkx(i))]TJ /F4 11.955 Tf 11.96 0 Td[(x(j)k2 2h2,(4)whichisauniversalkernelcommonlyencounteredinpractice.Tondtheoptimalkernelsize,wexalltheotherfreeparametersaroundmedianvalues,=0.4and0=0.5,andtheaverageRMSerrorover10MonteCarlorunsiscompared(Figure 4-3 ).Forthisspecicexperiment,itseemsobviousthatsmallerkernelsizesyieldbetterperformance,sincethestaterepresentationsarenite.However,ingeneral,applying 55

PAGE 56

Figure4-3. PerformanceoverdifferentkernelsizesinKTD().TheverticallinesegmentscontainthemeanRMSvaluesafter100trials(topmarkers),500trials(middlemarkers),and1000trials(bottommarkers). toosmallkernelsizesleadstoover-ttingorinthiscasetoslowlearning.Inparticular,choosingaverysmallkernelleadstoaprocedureverysimilartoatablelookupmethod.Thus,wechoosethekernelsizeh=0.2tobethelargestkernelsizeforwhichweobtainsimilarmeanRMSvaluesasthoseforh=0.1andh=0.05at1000thtrial,andthelowestmeanRMSatthe100thtrial.Afterxingthekernelsizeath=0.2,theexperimentalevaluationofdifferentcombinationsofeligibilitytraceratesandinitialstepsizes0areobserved.Figure 4-4 showstheaverageperformanceover10MonteCarlorunsfor1000trials. Figure4-4. Performancecomparisonoverdifferentcombinationsofeligibilitytraceratesandinitialstepsizes0inKTD()withh=0.2.TheplottedverticallinesegmentscontainthemeanRMSvalueafter100trials(topmarker),500trials(middlemarker),and1000trials(bottommarker). 56

PAGE 57

AllvalueswithoptimalstepsizeshowgoodapproximationtoVafter1000trials.SmallerstepsizeswithlargervaluesshowbetterperformanceinTD(Figure 4-2 ),whereaslargerstepsizeswithsmallerperformsbetterinKTD.NoticethatKTD(=0)showsslightlybetterperformancethanKTD(=1),thismaybeattributedtothelocalnatureofKTDwhenusingtheGaussiankernel.Inaddition,varyingthestepsizehasarelativelysmalleffectonKTD().AgaintheGaussiankernelaswellasothernormalizedkernelsprovideanimplicitnormalizedupdaterulewhichisknowntobelesssensitivetostepsize.BasedontheFigure 4-4 ,theoptimaleligibilitytracerateandinitialstepsizevalues=0.6and0=0.3isselectedforKTDwithkernelsizeh=0.2.ThelearningcurvesofTD()andKTD()arecompared.Theoptimalparametersareemployedinbothalgorithmsbasedontheexperimentalevaluation(=1and0=0.1forTDand=0.6and0=0.3forKTD),andtheRMSerrorisaveragedover50MonteCarlorunsfor1000trials.Comparativelearningcurvesaregivenin(Figure 4-5 ).BothalgorithmreachthemeanRMSvalueofaround0.06.Here,weconrmedtheabilityofTD()andKTD()tohandlethefunctionapproximationproblemwhenthexedpolicyyieldsastatevaluefunctionthatislinearinthestaterepresentation.Asweexpected,TD()convergesfastertotheoptimalsolutionbecauseofthelinearnatureoftheproblem.KTD()convergesslowerthanTD(),butitisalsoabletoapproximatethevaluefunctionproperly.InthissenseKTDisopentowiderclassofproblemsthanitslinearcounterpart.Also,theestimatedstatevalues~Vforthelast50trialsareobservedinFigure 4-6 .ItshowsthatbothTD()andKTD()successfullyestimatetheoptimalstatevalueV. 4.2LinearCase-RobustnessAssessmentInthissection,wewanttoobservetheroleofthecostfunctionintheadaptationprocess.Inthefollowingexperiment,weconsiderthesameBoyanchainfromthepreviouslinearcase(Figure 4-1 ),butunliketheabovecase,therewardsarerandom 57

PAGE 58

Figure4-5. LearningcurveofTD()andKTD().ThesolidlineshowsthemeanRMSerrorandthedashedlineshowsthestandarddeviationsover50MonteCarloruns. Figure4-6. ThecomparisonofstatevalueV(x)inx2XconvergencebetweenTD()andKTD().ThesolidlineshowstheoptimalstatevaluesVandthedashedlineshowstheestimatedstatevalues~VbyTD()(left)andKTD()(right). variablesthemselves.Werefertothemasnoisyrewards.Threetypesofnoisewillbeaddedtotheoriginaldiscreterewardvalues,andbehaviorsbetweenTDandcorrentropyTDwillbecompared.First,Gaussiannoisewithprobabilitydensityfunction G(,2)=1 p 22exp)]TJ /F5 11.955 Tf 10.5 8.09 Td[((x)]TJ /F3 11.955 Tf 11.96 0 Td[()2 22(4) 58

PAGE 59

isaddedtotherewards.Here,themeanissetaszero,anddifferentvariancevalues(2=0.2,0.5,1,2,10,20,50)areapplied.FromFigure 4-2 ,weobservedthattheparameterset=1and0=0.1leadstothebestperformance.However,forfaircomparisonwithCTD,TDwith=0and0=0.3willbeapplied.ToobservetheinuenceoftheGaussiannoiseintheperformanceoftheTDalgorithm,weapplyitfortwoparametersets(=1and0=0.1(redline)and=0and0=0.3(blueline)asdepictedinFigure 4-7 ).TheRMSerrorisaveragedover50MonteCarloruns,andtheplotshowsthemeanandstandarddeviationatthe1000thtrial. Figure4-7. TheperformanceofTDfordifferentlevels(variances2)ofadditiveGaussiannoiseontherewards. Inbothparametersets,increasingthenoisevarianceworsenstheperformanceofTDintermsofthemeanandstandarddeviationoftheaverageRMSerrors.Now,weexamhowCTDbehaveswithrespecttothenoisevariance.RecallthatcorrentropyitselfrequiressettinganextrakernelsizethatisdifferentfromthekernelsizeparameterrequiredinKTD.Todistinguishthetwokernels,wewillrefertothekernelhcasthecorrentropykernel.Differentcorrentropykernelsizes(hc=1,2,3,4,5,10,20,50,100)areappliedtoCTDalgorithm,anditsperformanceisobservedwithrespecttothenoisevariances,2=0.2,1,10,50(Figure 4-8 ). 59

PAGE 60

SmallercorrentropykernelsizeshcyieldhigherRMSerror,butasthecorrentropykernelsizegetslarger,theerrorconvergestovaluessimilartothoseobtainedwithTDfor=0and0=0.3(ThebluelineinFigure 4-7 ).ThisresultisintuitivesinceMSEisoptimalunderGaussianassumptionsandforalargeenoughkernelsize,correntropybehavessimilartoMSE.WefurtherlookintothelearningbehaviorofTDandCTDwhenthenoisevariance2=10istaken(Figure 4-9 ).TheRMSerrorisaveragedover50MonteCarloruns,andthemeanandstandarddeviationatthe1000thtrialaredisplayed.AsweshowinFigure 4-7 and 4-8C ,similarmeanandvarianceforTDandCTDcanbeobserved.Nevertheless,CTDshowssmootherlearningcurvesthanTD.ThisisexpectedsincecorrentropybehavessimilartoMSEwhenhc!1.Fromthecomparisonbetweenthetwodifferentcorrentropykernelsizes,hc=5andhc=10,weobservethatthesmallercorrentropykernelsizehasslowerconvergencerates.Inthisexperiment,weconrmedthatMSEcriteriaisoptimalinthecaseofGaussiannoisewithzeromean,andthatCTDisalsoabletoapproximatethevaluefunctionwithproperchoiceofkernelsizehc.NotethatGaussiannoisewithdifferentvariancesareaddedtotheassignedrewardinFigure 4-8 .Secondly,weexplorethebehaviorofTDandCTDunderoutliernoiseconditions;themixtureofGaussiandistributions,0.9G(0,1)+0.1G(5,1),isaddedtotherewardvalues.FromFigure 4-2 ,weknowthatforTD(=0),theinitialstepsize0=0.3isoptimal.TondtheoptimalcorrentropykernelsizehcinCTD,weevaluatedifferentcorrentropykernelsizes,hc=1,2,2.5,3,5,10,20,50,100(Figure 4-10 ).Smallcorrentropykernelsizes,hc=1andhc=2,leadtolargeRMSerror.Inthiscase,theconvergencecanbeveryslow,andonlyinasmallvicinityoftheoptimalsolution,thegradienthasvaluesthatwillmaketheadaptivesystemrespondaccordingly.Whenhc=2.5correntropyTDshowsthelowestRMSerror,andashcincreases,theaverageRMSincreasesandconvergestosimilarresultsasTD.Again,itisobviousthatthelargecorrentropykernelsizetakesintoaccountlargervaluesofthe 60

PAGE 61

A2=0.2 B2=1 C2=10 D2=50Figure4-8. TheperformancechangeofCTDoverdifferentcorrentropykernelsizes,hc. Figure4-9. LearningcurveofTDandCTDwhentheGaussiannoisewithvariance2=10isaddedtothereward.RMSerrorisaveragedover50MonteCarloruns,andthesolidlineshowsthemeanRMSerror,andthedashedlinerepresentsthestandarddeviations. error,especiallythoseofthesecondcomponentofthemixture.ThelearningcurvesofTDandCTDarecomparedinFigure 4-11 61

PAGE 62

Figure4-10. PerformanceofCTDcorrespondingtodifferentcorrentropykernelsizeshc,withmixtureofGaussiannoisedistribution.TheRMSerrorisaveragedover50MonteCarloruns,andtheplotshowsthemeanandstandarddeviationatthe1000thtrial. Figure4-11. LearningcurvesofTDandCTDwhenthenoiseaddedtotherewardscorrespondstoamixtureofGaussians.RMSerrorisaveragedover50MonteCarloruns,andthesolidlineisthemeanRMSerror,andthedashedlineshowsthestandarddeviation. WecanobservethatascorrentropykernelsizeincreasestheperformancebecomessimilartoTD.EventhoughCTDwiththeoptimalcorrentropykernelsizehcinitiallyperformsslowerthanTD,wecanobservethattheerrorkeepsdecreasingbeyondthevaluesobtainedwithTD.Thisisaclearexampleoftherobustnessofcorrentropytonon-Gaussiannon-symmetricimpulsivenoise. 62

PAGE 63

ItiswellknownthatheavytaildistributionssuchasLaplacianmakestheMSEnon-optimal[ 45 47 ].Thus,ourthirdexperimentconsidersLaplaciandistributedadditivenoise L(,2b2)=1 2bexp)]TJ 10.49 8.09 Td[(jx)]TJ /F3 11.955 Tf 11.96 0 Td[(j b(4)intheassignedreward.Themeanissetaszero,anddifferentvariances(b2=0.04,0.25,1,4,25,100)areapplied.Again,TDwith=1and0=0.1and=0and0=0.3isappliedtoobservetheinuenceoftheLaplaciannoise(Figure 4-12 ).Inbothcases,theperformancedegradesasthevarianceincreases.Moreover,theRMSvaluesobtainedforGaussiandistributednoisewithsimilarvariancesaresmaller,whichgoesalongwiththefactthatMSEissuboptimalinthiscase. Figure4-12. PerformancechangesofTDwithrespecttodifferentLaplaciannoisevariancesb2.TheRMSerrorisaveragedover50MonteCarloruns,andtheplotshowsthemeanandstandarddeviationatthe1000thtrial. TheperformanceofCTDisobservedfordifferentnoisevariancesb2=0.04,1,25,100.Thecorrentropykernelsizesofhc=1,2,3,4,5,10,20,50,100areapplied.Figure 4-13 showsthecorrespondingRMSvalues.Whenthenoisevarianceissmall(b2=0.04andb2=1),theresultsshowthatperformancedoesnotdegradeasthecorrentropykernelsizebecomeslarger(Figure 4-13A and 4-13B ).Inthiscase,thetwocostfunctions,MSEandMCC,donotexposesignicantdifferences.However,whenthenoisevarianceislarge(b2=25 63

PAGE 64

Ab2=0.04 Bb2=1 Cb2=25 Db2=100Figure4-13. PerformanceofCTDdependingondifferentcorrentropykernelsizeshcwithvariousLaplaciannoisevariances.TheRMSerrorisaveragedover50MonteCarloruns,andtheplotshowsthemeanandstandarddeviationatthe1000thtrial. andb2=100),certaincorrentropykernelsizesshowsmallererrorthanotherlargercorrentropykernelsizes(Figure 4-13C and 4-13D ).SinceMSEisnotoptimalunder'heavytail'noisedistributions,approximatingthebehaviorofcorrentropytoMSEbyincreasingthekernelsizeresultsinworseperformances.Figure 4-14 showsthelearningcurveofTDandCTDwhentheLaplicianNoisewithvarianceb2=25isaddedtothereward.Atthebeginning,TDshowsaslightlyfasterconvergencerate,butafteraroundthe50thtrial,CTDreacheslowerRMSerrorthanTD.Again,thisexampleveriestherobustnessofcorrentropyfornon-Gaussianscenarios;inparticular,heavytaildistributednoise(sparse). 64

PAGE 65

Figure4-14. LearningcurveofTDandCTDwhentheLaplaciannoisewithvarianceb2=25isaddedtothereward.RMSerrorisaveragedover50MonteCarloruns,andthesolidlineisthemeanRMSerror,andthedashedlineshowsthestandarddeviations. 4.3NonlinearCaseWehaveseentheperformancesofTD(),KTD(),andCTDontheproblemofestimatingastatevaluefunction,whichisalinearfunctionofthegivenstaterepresentation.Now,thesameproblemcanbeturnedintoanonlinearonebymodifyingtherewardvaluesinthechainsuchthattheresultingstatevaluefunctionVisnolongeralinearfunctionofthestates.Thenumberofstatesandthestaterepresentationsremainthesameasintheprevioussection.However,theoptimalvaluefunctionVbecomesnonlinearwithrespecttotherepresentationofthestates;namely,(V=[0)]TJ /F5 11.955 Tf 11.88 0 Td[(0.2)]TJ /F5 11.955 Tf 11.88 0 Td[(0.6)]TJ /F5 11.955 Tf 11.87 0 Td[(1.4)]TJ /F5 11.955 Tf 11.88 0 Td[(3)]TJ /F5 11.955 Tf 11.87 0 Td[(6.2)]TJ /F5 11.955 Tf -458.7 -23.91 Td[(12.6)]TJ /F5 11.955 Tf 11.85 0 Td[(13.4)]TJ /F5 11.955 Tf 11.85 0 Td[(13.5)]TJ /F5 11.955 Tf 11.85 0 Td[(14.45)]TJ /F5 11.955 Tf 11.84 0 Td[(15.975)]TJ /F5 11.955 Tf 11.85 0 Td[(19.2125)]TJ /F5 11.955 Tf 11.84 0 Td[(25.5938])forstates0to12.Thisimpliesthatthevaluesofrewardsforeachstatearealsodifferentfromtheonesgivenforthelinearcase(Figure 4-15 ).Again,toevaluatetheperformance,aftereachtrialiscompleted,theestimatedstatevalue~ViscomparedtotheoptimalstatevalueVusingRMSerror( 4 )asdescribedaboveforthelinearcase. 65

PAGE 66

Figure4-15. A13stateMarkovchain.Instatesfrom2to12,eachstatetransitionhaveprobability0.5,andstate1hastransitionprobability1totheabsorbingstate0.Notethatoptimalstatevaluefunctionscanberepresentedasanonlinearfunctionofthestates,andcorrespondingrewardvaluesareassignedtoeachstate. Firstofall,TD()isappliedtoestimatethevaluefunctionwithvariouscombinationsofandinitialstepsize0.from0to1with0.2intervalsandinitialstepsizes0between0.1and0.9with0.1intervalsareobservedfor1000trials.TheRMSerrorofthevaluefunctionistheaverageover10MonteCarloruns,andtheinitialweightvectorissetasw=0ateachrun(Figure 4-16 ). Figure4-16. Theeffectofandtheinitialstepsize0inTD().TheplottedlinesegmentscontainthemeanRMSvalueafter100trials(topmarker),500trials(middlemarker),and1000trials(bottommarker). Itisnoticeablethatlargervaluesshowbetterperformances,asweknowthiscasecorrespondstotheleastmeansquaressolution.Thebehaviorforintermediatecases(<1)isnotguaranteedtoconvergetotheoptimalsolutionsincetherepresentationofallstatesdonotformalinearlyindependentsetofvectors.However,thesolutionfor=1willstilltrytoapproximateE[djx]becauseoftheimplicitregularizationinthe 66

PAGE 67

stochasticgradientalgorithm.Forfurtherobservation,TDwith=0.8and0=0.1willbeapplied.ForKTD(),theGaussiankernel( 4 )isapplied,andkernelsizeh=0.2ischosenbasedonFigure 4-17 ;afterxingalltheotherfreeparametersaroundmedianvalues=0.4and0=0.5,theaverageRMSerrorfor10MonteCarlorunsiscompared.Then,performanceswithdifferentcombinationsofparameters(and0)arecomparedwithh=0.2.Figure 4-18 showstheaverageRMSerrorover10MonteCarlorunsfor1000trials. Figure4-17. TheperformanceofKTDwithdifferentkernelsizes.TheplottedlinesegmentcontainsthemeanRMSvalueafter100trials(topmarker),500trials(middlemarker),and1000trials(bottommarker). Figure4-18. PerformancecomparisonoverdifferentcombinationsofandtheinitialstepsizeinKTD()withh=0.2.TheplottedsegmentisthemeanRMSvalueafter100trials(topsegment),500trials(middlesegment),and1000trials(bottomsegment). 67

PAGE 68

Again,comparedtoTD,largerstepsizeswithsmallervaluesperformbetterinKTD.Thecombinationof=0.4and0=0.3showsthebestperformance,butthe=0casealsoshowsgoodperformances.UnlikeTD,wecansaythatthereisnodominantvaluefor.Recallthatithasbeenprovedthatconvergenceto[djx]isguaranteedforlinearlyindependentrepresentationsofthestates,whichisautomaticallyfullledinKTDwhenthekernelisuniversal.Therefore,thedifferencesareratherduetotheconvergencespeedcontrolledbytheinteractionbetweenthestepsizeandtheelegibiltytrace.BasedonFigure 4-16 and 4-18 ,optimalstepsizeandeligibilitytraceratevaluesareselected(=0.8and0=0.1forTDand=0.4and0=0.3forKTD),andtheirrespectiveaverageRMSerrorover50MonteCarlorunsareshowninFigure 4-19 Figure4-19. LearningcurvesofTD()andKTD().ThesolidlineshowsthemeanRMSerror,andthedashedlinerepresentsthestandarddeviationover50MonteCarloruns. Thelinearfunctionapproximator,TD()(blueline),cannotestimatetheoptimalstatevalues,butKTD()outperformsthelinearalgorithm;thisbehaviorisexpectedsincetheGaussiankernelisuniversal.KTD()reachestothemeanvaluearound0.07,andthemeanvalueofTD()isaround1.8.Figure 4-20 showstheoptimalstatevalueV,andthepredictedstatevalue~VbyTD()andKTD()forthelast50trials.NoticethatTD()triestoestimatethevaluefunctionbyapiece-wiseevenlyspaced 68

PAGE 69

pattern.Thisisassociatedwiththedegreesoffreedomoftherepresentationspace(4-dimensionalforthepresentcase).Incontrast,KTD()isabletofaithfullyreproducethenonlinearbehaviorofthevaluefunction. Figure4-20. ThecomparisonofstatevalueconvergencebetweenTD()andKTD().ThesolidlineshowstheoptimalstatevaluesVandthedashedlineshowstheestimatedstatevalues~VbyTD()(left)andKTD()(right). 4.4NonlinearCase-RobustnessAssessmentInthissection,weextendtheexperimenttoobservetheperformancesofKTD()andCKTDundernoisyrewards.WewillconsiderthesameBoyanchainfromthepreviousnonlinearcase(Figure 4-15 ),andanoisyrewardorperturbedpolicywillbeemployed.Firstofall,weaddimpulsivenoisewithaprobabilitydensityfunctiongivenby0.95G(0,0.05)+0.05(0,5)tothecurrentrewardwithprobability0.05.Thiscanbethoughtofrandomlyreplacingthepolicywithprobability0.05.Sincethestaterepresentationsandtheoptimalstatevaluesarethesameaswiththepreviousexperiment,basedonFigure 4-17 and 4-18 ,aGaussiankernelwithkernelsizeh=0.2andinitialstepsize0=0.3withannealingfactora0=100areapplied.ForfaircomparisonwithcorrentropyKTD,issetas0.WevalidatetheoptimalcorrentropykernelsizebasedonFigure 4-21 .Wexalltheotherfreeparametersaroundmedianvalues,=0.4and0=0.5,andtheaverageRMSerrorsover10MonteCarlorunsarecompared. 69

PAGE 70

Figure4-21. PerformancesofCKTDdependingonthedifferentcorrentropykernelsizes.TheplottedlinesegmentscontainthemeanRMSvaluesafter100trials(topmarkers),500trials(middlemarkers),and1000trials(bottommarkers).Notethelogscalesonxandyaxis. Wecanobservethatasthecorrentropykernelsizegetslarger,the100thmeanRMSerrordecreases,andweknowitconvergestothesamesolutionasKTD.However,after1000trials,CKTDwithhc=5showsthelowestmeanRMSerror.Thisimpliesthatalargercorrentropykernelsizebringsfasterinitialconvergencespeeds,butitfailstoreachlowererrorsifthecorrentropykernelsizeremainstoolarge.Thismotivatestheideathatbycontrollingthecorrentropykernelsizeduringtheadaptation,wemayobtainfastandrobustfunctionapproximation.InFigure 4-22 ,wecomparethelearningcurvesofKTDandCKTDwithdifferentcorrentropykernelsizes,andFigure 4-23 showstheestimatedstatevalue~VbyKTDandCKTD.InthecaseofKTD,itisnoticeablethatwhentheundesirablenoisytransitionoccurstheestimationprocessdegrades,andthustheoverallperformanceisaffected.Ontheotherhand,CKTDshowsmorestableperformanceevenwiththeimpulsivenoise.ForCKTD,weapplyaxedcorrentropykernelsizehc=5(blueline);asexpected,itshowsaslowerconvergenceratethanKTD(redline)butlowererrorvalues.Toobtainfasterconvergence,westartwithhc=150,andthecorrentropykernelsizeisswitchedtohc=5at100thtrial.Inthisway,wecanaccelerateinitialconvergenceratesand,afterswitching,lowererrorvalues.Asimilar 70

PAGE 71

switchingschemehasalreadybeenutilizedincorrentropybasedadaptivelteringalgorithms,butinstaeadofapplyingalargeinitialkernelsizethealgorithmusesMSEattheinitialstageandthenswitchestocorrentropy. Figure4-22. LearningcurveofKTDandCKTD.ThesolidlineismeanRMSerrorover50MonteCarlorunsandthedashedlineshowsthestandarddeviation. Figure4-23. Thecomparisonofstatevaluefunction~VestimatedbyKTDandCorrentropyKTD. Now,wewanttoseehowperturbedpolicyinuencestheperformanceofKTDandCKTD.WeconsiderthesameBoyanchainfromthepreviousnonlinearcase(Figure 4-15 ),buttheobservationscomefromapolicythathasbeenperturbed.Notethatinthisexperiment,rewarddoesnotcontainsanynoise;noiseisonlyaddedtostatetransitions.Ateachstep,withprobability0.1,thestatetransitionsaremadefrom 71

PAGE 72

stateitojuniformlyforj=0,...,12,andarewardvaluecorrespondtothestatexiisassigned.Sincethestaterepresentationsandtheoptimalstatevaluesarethesameasthepreviousexperiment,aGaussiankernelwithkernelsizeh=0.2,initialstepsize0=0.3withannealingfactora0=100,and=0areapplied.KTDandCKTDaretrainedfor2000trialsandtheRMSerrorisaveragedover100MonteCarloruns.Figure 4-24 showsthemeanandstandarddeviationoferrornormover100MonteCarloatthe2000thtrialwhenCKTDisusedwithcorrentropykernelsizeshcbetween1and10withincrementsof1andalsoforhc=15,20,100.Figure 4-25 showsthelearningcurvesofKTDandCKTDwithcorrentropykernelsizeshc=2,3,5,6,7,100intermsofmeanRMSerrorover100MonteCarloruns.Again,weobservethatalargercorrentropykernelsizebringsfasterinitialconvergencespeeds,butitfailstoreachlowererrorsifthecorrentropykernelsizeremainstoolarge,showingsimilarperformancetoKTD.Thisexperimentprovidesevidencethatbycontrollingthekernelsize,wecanobtainfastandrobustfunctionapproximation.Figure 4-24 and 4-25 showthatcorrentropykernelsizehc=100performsthesameasKTD;inthecaseofKTD,themeanandstandarddeviationatthelasttrialare1.7016and0.4069,respectively.Althoughhc=5andhc=6havethesameintervalfortheerrorvalueinthelasttrial,sincealargerkernelsizeisfaster,hc=6isselectedastheoptimalcorrentropykernelsize.InthecaseofKTD,weobservethatwhentheundesirablenoisytransitionoccurstheestimationprocessdegrades,andthustheoverallperformanceisaffected.Ontheotherhand,CKTDshowsmorestableperformanceevenundertherandomstatetransitions.Inthischapter,weexaminethebehaviorofthealgorithmsintroducedinthepreviouschapter.Wepresentexperimentalresultsonsyntheticexamplestoapproximatethestatevaluefunctionunderaxedpolicy.Inparticular,weapplythealgorithmstoabsorbingMarkovchains.WeobservethatKTD()performswellonbothlinearandnonlinearfunctionapproximationproblems.Inaddition,weshowhowthelinear 72

PAGE 73

Figure4-24. MeanandstandarddeviationofRMSerrorover100runsatthe2000thtrial.Notethelogscaleinx-axis. Figure4-25. MeanRMSerrorover100runs.Noticethisisalogplotinthehorizontalaxis independenceoftheinputstaterepresentationscanaffecttheperformanceofalgorithms.ThisisanessentialguaranteefortheconvergenceofTDwitheligibilitytraces.TheuseofstrictlypositivedenitekernelsinKTD()impliesthelinearindependencecondition,andthusthisalgorithmconvergesforall2[0,1].Moreover,weperformexperimentswiththemaximumcorrentropycriterionundernoisyconditions.ExperimentswithheavytaildistributionsonnoisyrewardsandstatetransitionprobabilitiesshowthatCTDandCKTDalgorithmscanimproveperformanceoverconventionalMSE.Inparticular,robustbehaviorofcorrentropyistestedforLaplaciannoiseandimpulsive 73

PAGE 74

noisethatrepresentstheeffectsofoutliersinthereward.Correntropywasalsotestedwhenthepolicyisrandomlyreplaced;thiswasachievedbyaddingarandomperturbationtostatetransitions.Inthefollowingchapters,wewillextendtheTDalgorithmstoestimatetheactionvaluefunctionwhichcanbeappliedinndingapropercontrolpolicy. 74

PAGE 75

CHAPTER5POLICYIMPROVEMENTWehaveshownhowthekernelbasednonlinearmappingandTD()canbecombinedinkernelbasedleastmeansquarestemporaldifferencelearningwitheligibilitiescalledKTD(),andwehaveseentheadvantagesofKTD()innonlinearfunctionapproximationproblems.Moreover,anewrobustcostfunctionbasedoncorrentropyhasbeenintegratedintoTDandKTDalgorithms.Sofar,wehaveonlyusedTDlearningalgorithmstoestimatethestatevaluefunctiongivenaxedpolicy.However,thisisstillanintermediatestepinRL.Recallthatwewanttondaproperstatetoactionmappingthatresultsinmaximumreturn.Sincethevaluefunctionquantiestherelativedesirabilityofdifferentstatesinthestatespace,itallowscomparisonsbetweenpoliciesandthusguidestheoptimalpolicysearch.Therefore,wecanextendtheproposedmethodstosolvecompleteRLproblems.Here,ourgoalistondtheoptimalcontrolactionA(n)ateachtimenwhichmaximizesthecumulativereward.WhentheoptimalstatevaluefunctionVisobtained( 2 ),anoptimalpolicycanbederivedusingthestatevaluefunction;theoptimalactionsequencefA(n)gisgivenby A(n)=argmaxa2A(x)Xx0Paxx0[Raxx0+V(x0)].(5)Here,forthesakeofsimplicity,wedenotex(n)byx.However,directuseof( 5 )isstilllimitedbecauseinpractice,Paxx0orRaxx0areunknownmostofthetime.OnewaytogetaroundtheseissuesiswithQ-learning[ 55 ].Q-learningallowstheestimationoftheoptimalvaluefunctionQ(x,a)incrementally,andbasedontheestimatedQ,aproperpolicycanbeobtained.Fromthedenitionofstate-actionvaluefunctions,wehavethefollowingrelation:V(x)=maxa2A(x)Q(x,a).Thisshowsthattheoptimalactionprocess( 5 )canbe 75

PAGE 76

obtainedusingtheactionvaluefunctionQ A(n)=argmaxa2A(x)Q(X(n),a),(5)wherefX(n)gisthecontrolledMarkovchain[ 32 ]. 5.1State-Action-Reward-State-ActionTherststeptoapplyQ-learningistoestimatethestate-actionvaluefunctionQinsteadofthestatevaluefunctionV.State-Action-Reward-State-Action(SARSA)isintroducedtolearnthestate-actionvaluefunctionQgivenaxedpolicy.TheupdateruleofSARSAisasfollows, Q(x(n),a(n)) Q(x(n),a(n))+[r(n+1)+Q(x(n+1),a(n+1)))]TJ /F4 11.955 Tf 9.87 0 Td[(Q(x(n),a(n))].(5)Thisshowsthattocompleteoneupdate,thesequenceofstate-actionpair(x(n),a(n)),correspondingrewardr(n+1),andtransitiontonextstate-actionpair(x(n+1),a(n+1))arerequired,andthusthenameSARSA.SARSAhasastrongrelationwithQ-learning.ItcanbeunderstoodasQ-learning[ 55 ]givenaxedpolicy[ 52 ].Q-learningdoesnotuseaxedpolicy,butitexploresdifferentpoliciestoultimatelyobtainagoodpolicy.ForlargestateXandactionAspaces,wecanestimatetheQvaluesusingfunctionapproximators,butnowtheproposedTD()algorithmsareappliedtostateactionpairsratherthanonlytostates[ 48 ].ThisgivesthebasicideaofhowtheTDalgorithmscanbeassociatedwithQfunctionapproximationinpolicyevaluation. 5.2Q-learningQ-learningisawellknownoff-policyTDcontrolalgorithm.Theformofthestate-actionmappingfunction(policy)isundetermined,andTDlearningisappliedtoestimatethestate-actionvaluefunction.Thisallowsthesystemtoexplorepoliciestowardsndinganoptimalpolicy.Thisisanimportantfeatureforpracticalapplicationssincepriorinformationaboutapolicyisusuallynotavailable. 76

PAGE 77

Sincevaluefunctionsrepresenttheexpectedcumulativerewardgivenapolicy,wecansaythatthepolicyisbetterthanthepolicy0whenthepolicygivesgreaterexpectedreturnthanthepolicy0.Inotherwords,0ifandonlyifQ(x,a)Q0(x,a)forallx2Xanda2A.Therefore,theoptimalactionvaluefunctionQcanbewrittenasfollows, Q(x(n),a(n))=maxQ(x(n),a(n)) (5) =Er(n+1)+maxa(n+1)Q(x(n+1),a(n+1))x(n),a(n) (5) Theequation( 5 )canbeestimatedonline,andaone-stepQ-learningupdatecanbedenedas, Q(x(n),a(n)) Q(x(n),a(n))+[r(n+1)+maxaQ(x(n+1),a))]TJ /F4 11.955 Tf 10.67 0 Td[(Q(x(n),a(n))],(5)tomaximizetheexpectedrewardE[r(n+1)jx(n),a(n),x(n+1)].Attimen,anactiona(n)canbeselectedusingmethodssuchas-greedyortheBoltzmanndistribution,whicharecommonlyapplied[ 53 ].InthecasethatthestateXandactionAsetsarenite,( 5 )allowsexplicitlycomputingtheactionvaluefunctionQ.However,whenthestateXandactionAareinniteorverylarge,itisinfeasibletoobtainexplicitQvalues.Thus,wewillseehowfunctionalapproximationcanbeintegratedintoQ-learning. 5.3Q-learningviaKernelTemporalDifferencesandCorrentropyVariantsWehaveseenhowtemporaldifferencealgorithmsapproximatethestatevaluefunctionsusingaparametrizedfamilyoffunctions.InQ-learning,thestate-actionvaluefunctionQcanbeapproximatedusingtheproposedmethods(KTD(),CTD,andCKTD).Thiscanbedoneusingthesamemethodsemployedforthestatevaluefunctionestimation.Wepreviouslyapproximatedthestatevaluefunctionusingaparametrizedfamilyoffunctionssuchas~V(x(n))=f(x(n),w)usingTDalgorithms.Wecanapplythe 77

PAGE 78

sameapproachtoapproximatethestate-actionvaluefunction: ~Q(x,a=i)=f(x,wja=i).(5)Inthecaseofthelinearfunctionapproximators(TD()andCTD),theactionvaluefunctioncanbeestimatedas~Q(x(n),a=i)=w>x(n),andfortheirkernelextensions(KTD()andCKTD),theactionvaluefunctioncanbeapproximatedas~Q(x(n),a=i)=hf,(x(n))i.Notethat~Q(x(n),a=i)denotesanstate-actionvaluegivenastatex(n)attimenandadiscreteactioni.Therefore,basedonQ-learning( 5 ),theupdateruleforKTD()( 3 )canbeintegratedas f f+mXn=1[r(n+1)+maxaQ(x(n+1),a))]TJ /F4 11.955 Tf 11.96 0 Td[(Q(x(n),a(n))]nXk=1n)]TJ /F7 7.97 Tf 6.59 0 Td[(k(x(k)).(5)WecallthisapproachQ-learningviakerneltemporaldifference(Q-KTD)().Forsingle-steppredictionproblems(m=1),( 5 )yieldssingleupdatesforQ-KTD()oftheform Qi(x(n))=n)]TJ /F6 7.97 Tf 6.59 0 Td[(1Xj=1eTDi(j)Ik(j)hx(n),x(j)i.(5)Here,Qi(x(n))=Q(x(n),a=i)andeTDi(n)denotestheTDerrordenedas eTDi(n)=ri+Qii(x(n+1)))]TJ /F4 11.955 Tf 11.95 0 Td[(Qi(x(n)),(5)andIk(n)isanindicatorvectorwiththesamesizeasthenumberofoutputs(actions).Thismeansthatonlythekthentryofthevectorissetto1andtherestoftheentriesare0.Theselectionoftheactionunitkattimencanbebasedonagreedymethod.Therefore,onlytheweight(parametervector)correspondingtothewinningactiongetsupdated.Recallthattherewardricorrespondstotheactionselectedbythecurrentpolicywithinputx(n)becauseitisassumedthatthisactioncausesthenextinputstatex(n+1). 78

PAGE 79

Theselectionoftheactionunitkattimenisbasedonmethodssuchas-greedyandtheBoltzmanndistributionwhicharecommonlyappliedfortheactionselection[ 53 ].Weadopt-greedyforourexperiments.Thisisoneofthemostpopularmethodstocontroltheexplorationandexploitationtradeoff.TheactioncorrespondingtotheunitwiththehighestQvaluegetsselectedwithprobability1)]TJ /F3 11.955 Tf 12.01 0 Td[(.Otherwise,anyotheractionisselectedatrandom.Inotherwords,theprobabilityofselectingarandomactionis.ThestructureofQ-learningbasedonKTD(0)isshowninFigure 5-1 .Thenumber Figure5-1. ThestructureofQ-learningviakerneltemporaldifference() ofunits(kernelevaluations)increasesasmoretrainingdataarrives.Eachaddedunitiscenteredatthepreviousinputlocationsx(1),x(2),,x(n)]TJ /F5 11.955 Tf 11.96 0 Td[(1).Likewise,Q-learningviacorrentropytemporaldifference(Q-CTD)havethefollowingupdaterule w w+mXn=1exp)]TJ /F4 11.955 Tf 9.29 0 Td[(eTD(n)2 2h2ceTD(n)x(n),(5)andQ-CKTDas f f+mXn=1exp)]TJ /F4 11.955 Tf 9.3 0 Td[(eTD(n)2 2h2ceTD(n)(x(n)).(5)Here,thetemporaldifferenceerroreTDisdenedas eTD(n)=r(n+1)+maxaQ(x(n+1),a))]TJ /F4 11.955 Tf 11.95 0 Td[(Q(x(n),a(n)).(5) 79

PAGE 80

5.4ReinforcementLearningBrainMachineInterfaceBasedonQ-learningwithFunctionApproximationWehaveseenhowtheagentandenvironmentinteractinthereinforcementlearningparadigminFigure 2-1 .MoreoverinFigure 1-1 ,wehaveshownhowtheenvironmentcanbeconceivedofthereinforcementlearningbrainmachineinterface(RLBMI)paradigm.TheTDalgorithmsweproposedhelpmodeltheagent.Figure 3-1 showshowstatevaluefunctionVcanbeestimatedusingtheproposedTDalgorithmsunderaxedpolicy.Notethatthestatevaluefunctionapproximationisonlyanintermediatestepandtheformofpolicyisxed.InRLBMI,itisessentialtondthepolicywhichconveysthedesiredactionontheexternaldevice.Directcomputationoftheoptimalpolicyischallengingsincealltheinformationrequiredtocalculatetheoptimalpolicyisnotknowninpractice.Therefore,weestimatetheoptimalpolicyusingtheactionvaluefunctionQ.Figure 5-2 depictstheRLBMIstructureusingQ-learningwiththeproposedTDalgorithms. Figure5-2. ThedecodingstructureofreinforcmentlearningmodelinabrainmachineinterfaceusingaQ-learningbasedfunctionapproximationalgorithm. Basedontheneuralstatefromenvironment,theactionvaluefunctionQcanbeapproximatedusinganadaptivesystem.Weproposedalgorithmsfocusingonboththefunctionalmappingandthecostfunction.Kernelbasedrepresentationshavebeenintegratedtoimprovethefunctionalmappingcapabilitiesofthesystem,andcorrentropyhasbeenemployedasthecostfunctiontoobtainrobustnessinthesystem.Basedon 80

PAGE 81

theestimatedQvalues,apolicydecidesaproperaction.Notethatthepolicyisthelearningpolicywhichchangesovertime.RecallthatthemainadvantageofRLBMIistheco-adaptationbetweentwointelligentsystems:theBMIdecoderintheagent,andtheBMIuserintheenvironment.Bothsystemslearnhowtoearnrewardsbasedontheirjointbehavior.TheBMIdecoderlearnsacontrolstrategybasedontheuser'sneuralstateandperformactionsingoaldirectedtasksthatupdatethestateoftheexternaldeviceintheenvironment.Inaddition,theuserlearnsthetaskbasedonthestateoftheexternaldevice.BoththeBMIdecoderandtheuserreceivefeedbackaftereachmovementiscompletedandusethisfeedbacktoadapt.Noticethatbothsystemsactsymbioticallybysharingtheexternaldevicetocompletetheirtasks,andthisco-adaptationallowsforcontinuoussynergisticadjustmentsoftheBMIdecoderandtheusereveninchangingenvironments.InChapter 7 ,wewillexaminehowthisco-adaptationprocessworksinpracticebyshowingexperimentsonrealBMIs. 81

PAGE 82

CHAPTER6SIMULATIONS-POLICYIMPROVEMENTInthischapter,weexaminetheempiricalperformancesoftheextendedtemporaldifferencealgorithmstotheproblemofndingaproperstatetoactionmappingbasedontheestimatedactionvaluefunctionQ.Inthefollowing,wewillnotonlyassesstheirperformanceandbehaviorbutalsoexaminethemethods'sapplicabilitytopracticalsituations.Notethatinthefollowingsimulations,theblockdiagramoftheagentremainsthesameasinFigure 5-2 ;nonetheless,thecomponentsintheenvironmentblockareindeeddifferent.Forinstance,inthefollowingmountaincarproblem,thestatesarepositionandvelocity,andtheactionsaretheleftandrightaccelerationsaswellascoast. 6.1MountainCarTaskWerstcarryoutexperimentsonasimpledynamicsystemwhichwasrstintroducedin[ 34 ].ThisexperimentiswellknownasMountain-cartask,afamousepisodictaskincontrolproblems.ThereisacardrivingalongamountaintrackasdepictedinFigure 6-1 ,andthegoalofthistaskistoreachthetopoftherightsidehill.Thechallengeinthistaskisthatthereareregionsnearthecenterofthehillwheremaximumaccelerationofthecarisnotenoughtoovercometheforceimposedbygravity,andthereforeamoresophisticatedstrategythatallowsthecartogainmomentumusingthehillmustbelearned.Thus,ifthesystemsimplytriestomaximizeshorttermrewards,itwouldfailtoreachthegoal.Inthiscase,theonlywaytoreachthegoalistorstacceleratebackwards,eventhoughitisfurtherawayfromthegoal,andthendriveforwardwithfullacceleration.Thisisarepresentativeexampletoevaluatethesystem'scapabilitytondaproperpolicytoachieveagoalinRL.Thedetailsofthemodelarebasedon[ 48 ].Theobservedstatescorrespondtothefollowingpairofcontinuousvariablesarepositionp(n)andvelocityv(n)ofthecar.Thevaluesarerestrictedtotheintervals)]TJ /F5 11.955 Tf 9.3 0 Td[(1.2p(n)0.5and)]TJ /F5 11.955 Tf 9.3 0 Td[(0.07v(n)0.07forall 82

PAGE 83

Figure6-1. TheMountain-cartask. timen.Themountainaltitudeissin(3p),andthestateevolutiondynamicsaregivenby v(n+1)=v(n)+0.001a(n))]TJ /F4 11.955 Tf 11.95 0 Td[(gcos(3p(n)) (6) p(n+1)=p(n)+v(n+1) (6) wheregrepresentsgravity(g=0.0025),anda(n)isachosenactionattimen.Thereare3possibleactions:acceleratebackwardsa=)]TJ /F5 11.955 Tf 9.29 0 Td[(1,coasta=0,andaccelerateforwarda=+1.Ateachtimestep,rewardr=)]TJ /F5 11.955 Tf 9.3 0 Td[(1isassigned,andoncetheupdatedpositionp(n+1)exceeds0.5,thetrialterminates.Weundergo30trialstolearnthepolicy.Ateachtrial,theinitialstatesaredrawnrandomlyfrom)]TJ /F5 11.955 Tf 9.3 0 Td[(1.2p0.5and)]TJ /F5 11.955 Tf 9.3 0 Td[(0.07v0.07.Thesystemisinitializedwhenthersttrialstarts,andeachtrialhasthemaximumnumberofstepsas104.Ateachtrial,thenumberofstepsiscounted,anditisaveragedoverthe30trialsand50MonteCarloruns.ForeachMonteCarlorun,thesamesetof30initialvaluesisused.Inaddition,forthe-greedymethod,weapplyexplorationrate=0.05.First,weapplyQ-TD()tondthestateactionmap,andtheperformancesofdifferentcombinationsofparameters(=0,0.2,0.4,0.6,0.8,1and=0.1,0.3,0.5,0.7,0.9)areobserved(Figure 6-2 ).Ingeneral,asgetslarger,theperformanceworsens.Thelargemeanandstandarddeviationappearwhenthecargetsstuckinthevalley 83

PAGE 84

Figure6-2. PerformanceofQ-TD()withvariouscombinationofand. (p)]TJ /F5 11.955 Tf 23.64 0 Td[(0.5),soitfailstoreachthegoalwithinthemaximumsteplimit,104.Notethatinthistask,theinputstatespaceiscontinuous,sothethereareaninnitenumberofstates,andusingtheposition-velocityrepresentationcertainlydoesnotfulllthelinearindependencecriterion.AttemptstomakeQ-TD()applicableincontinuousinputspacebydiscretizingthestatespaceareusuallyconsidered.Forinstance,placingoverlappingtilestopartitiontheinputspace,aprocesscalledtilecoding,isausualapproachtoprovidearepresentationthatwouldbeexpectedtodoabetterjob.ExampleswherewecanseetheperformanceofTDincludingthispreprocessingmethodcanbefoundin[ 16 48 ].However,properstaterepresentationsaredifculttoobtainbecausetheyrequirepriorinformationaboutthestatespace.ItisherewherewebelieveQ-KTD()canprovideanadvantage.ForQ-KTD(),weemploytheGaussiankernel( 4 ).FromtheQ-TD()application,itisobservedthatkernelsizeh=0.2whichisclosetotheheuristicthatusesthedistributionofsquareddistancebetweenpairsofinputstates.Toconrmtheusefulnessofthesevalues,weapplydifferentkernelsizes(h=0.01,0.05,0.1,0.2,0.3,0.4),andthemeannumberofstepspertrialisobserved.Thismeanistheaverageover30trialsand50MonteCarloruns.Forthisevaluation,wex=0.4and=0.5.Kernelsizeh=0.05showsthelowestmeannumberofstepspertrial,butperformancesarenotsignicantlydifferentforabroaderrangeofparametervaluesthatincludeh=0.2,whichisthelargestkernelsizethatexposesgoodperformance. 84

PAGE 85

Figure6-3. TheperformanceofQ-KTD()withrespecttodifferentkernelsizes. Again,apreferenceforalargerkernelsizeismotivatedbythesmoothnessassumption.TheperformanceofQ-KTD()withadifferentcombinationofandareobserved.Here,thesamecombinationasFigure 6-2 istested. Figure6-4. PerformanceofQ-KTD()withvariouscombinationofand. BasedonFigure 6-2 6-3 ,and 6-4 ,theoptimalparametersforQ-TDandQ-KTDcanbeobtained(=0.4and=0.5forQ-TDand=0,=0.3,andh=0.2forQ-KTD).Withtheselectedparameters,wefurthercomparetheperformancesofQ-TD()andQ-KTD().First,ateachtrial,wecountthenumberofiterationsuntilthecarreachthegoal,andthenweaveragethenumberofiterationspertrialover30trialsand50MonteCarloruns.Figure 6-5 showstherelativefrequencywithrespecttotheaveragenumberofiterationspertrial.Forbetterunderstanding,Figure 6-6 plotstheaveragenumberofiterationspertrialwithrespecttothetrialnumber.Notethatthex-axisofFigure 6-5 correspondstoy-axisofFigure 6-6 .TheresultsshowsthatbothQ-TD()and 85

PAGE 86

Figure6-5. RelativefrequencywithrespecttoaveragenumberofiterationspertrialofQ-TD()andQ-KTD(). Figure6-6. AveragenumberofiterationspertrialofQ-TD()andQ-KTD(). Q-KTD()areabletondaproperpolicy.However,comparedtoQ-TD(),Q-KTD()worksbetterforpolicyimprovement.Q-KTD()hasmoretrialswithlessnumberofiterations(Figure 6-6 ).Inaddition,thelargenumberofiterationsinFigure 6-5 isduetoexplorationattheinitialstageoflearning.Inthestatevalueestimationproblems(Boyanchainexperimentsinthepreviouschapters),wehaveseentherobustnessofmaximumcorrentropycriteria(MCC)underdifferenttypesofperturbationonthepolicyorenvironment;namely,thenoisyrewardandstatetransitionprobability.Here,wewillseetheusefulnessofcorrentropyforlearningunderswitchingpolicies,whichcanbethecaseinimplementinganexploration 86

PAGE 87

/exploitationtradeoffinreinforcementlearning.TDalgorithmsintegratedwithcorrentropycanprovidebetterperformanceundersuchlearningscenarios.Whenwetrytoobtainagoodpolicywithoutanypriorknowledgeofhowtheoptimalpolicyshouldbe,thesystemisrequiredtolearnbyexploringtheenvironment.Thus,ateachtime,thesystemobservescertainstatetoactionmapsfromexperience,andthesystemneedstoevaluatethegivenpolicytoupdatethefunctionalmapping;thatis,itisessentialthatthesystemisabletolearnunderchangingpolicies.Therefore,herewewillobservehowtheproposedalgorithmscanefcientlylearnagoodpolicywhileconstantlychangingpoliciesduringthelearningprocess.WeusetheMountain-cartaskandvarytheexplorationratetoconrmhowthesystemlearnsunderchangingpolicy.Westartwithatotallyrandompolicy(100%explorationrate,=1).Thisexplorationrateiskeptuntil200thstepandthenweswitchto=0.Whentheexplorationrateis0,theobservedperformanceshowsexactlywhatthesystemhasbeenabletolearnfromrandomexploration.Inaddition,furtherstepsareallowedtoletthesystemadjustitscurrentestimateofthepolicy.BykeepingtheoptimalparametersofQ-KTD=0.3andh=0.2,weexamineQ-CKTDwithdifferentcorrentropykernelsizes(hc=1,2,3,4,5,10,50).Figure 6-7 showstheaveragenumberofstepspertrialoverthe30trialsand50MonteCarloruns.Inthecaseofhc=3,Q-CKTDresultsshowsameanandstandarddeviationof349.8713368.0790,whereasQ-KTDshows558.57731012.3withtheoptimalparameters.Thisobservationrevealsthepositiveeffectthatrobustnessofcorrentropyasacostfunctionbringstolearningunderchangingpolicies.Forbetterunderstanding,wefurtherobservetheaveragestepnumberateachtrialover50MonteCarloruns(Figure 6-8 ).Notethatthesame30initialstatesareappliedforthe50MonteCarloruns.Q-CKTDtakesalargernumberofstepsatthebeginning,butaslearningprogresses(trialnumberincreases),itrequiresasignicantlyfewerstepspertrial.Wecanalsoseethatthesystemadaptstotheenvironmentandisabletondabetterpolicy.Notethat 87

PAGE 88

Figure6-7. TheperformanceofQ-CKTDwithdifferentcorrentropykernelsizes. Figure6-8. AveragenumberofstepspertrialofQ-KTDandQ-CKTD. untilthe200thstep,thepolicyiscompletelyrandom,andthus,bothQ-KTDandQ-CKTDshowanaveragenumberofstepslargerthan200.Thetrialsthatreachthegoalevenunderarandompolicyareabletodosobecausetheirinitialpositionsarecloseenoughtothegoal. 6.2TwoDimensionalSpatialNavigationTaskWehaveobservedthebenetsofusingthekernelbaserepresentationsinpracticalapplications.BeforeapplyingQ-KTD()andQ-CKTDtoneuraldecodinginbrainmachineinterfaces,wepresentsomeresultsonasimulated,2-dimensionalspatialnavigationtask.Thissimulationwillprovideinsightsabouthowthesystemwillperform 88

PAGE 89

infurtherpracticalexperiments.Thissimulationsharessomesimilaritieswiththeneuraldecodingexperiment;basedontheinputstates,thesystempredictswhichdirectionshouldfollow,anddependingontheupdatedposition,thenextinputstatesareprovided.Thegoalistoreachatargetareawhereapositiverewardisassigned.Nopriorinformationoftheenvironmentisgiven;thesystemisrequiredtoexploretheenvironmenttoreachthetarget.Thissimulationisamodiedversionofthemazeproblemin[ 14 ].Inourcasethereisa2-dimensionalstatespacethatcorrespondstoasquarewithaside-lengthof20units.Thegoalistonavigatefromanypositiononthesquaretoatargetlocatedwithinthesquare.Inourexperiments,onetargetislocatedatthecenterofthesquare(10,10)andanyapproximationswithina2unitradiusareconsideredsuccessful.Axedsetof25pointsdistributedinalatticecongurationaretakenasinitialseedsforrandominitialstates.Eachinitialstatecorrespondstodrawingrandomlyoneofthese25pointswithequalprobability.Thelocationoftheselectedpointisfurtherperturbedwithunitvariance,zeromeanadditiveGaussiannoise,G(0,1).Tonavigatethroughthestatespace,wecanchoosetomove3unitsoflengthinoneofthe8possibledirectionsthatareallowed.Themaximumnumberofstepspertrialislimitedto20.Theagentgetsareward+0.6everytimeitreachesthegoal,andthenanewtrialstarts.Otherwise,areward)]TJ /F5 11.955 Tf 9.3 0 Td[(0.6isgiven.Explorationrateof=0.05anddiscountfactor=0.9areused.ThekernelemployedistheGaussiankernelwithsizeh=4.Thiskernelsizeisselectedbasedonthedistributionofsquareddistancebetweenpairsofinputstates.Toassesstheperformance,wecountthenumberoftrialswhichearnedthepositiverewardwithinagroupof25trials;thatis,every25trials,wecalculatethesuccessrateofthelearnedmappingas(#ofsuccessfultrials)=25.Tohelpinunderstandingthebehaviorandillustratingtheroleoftheparameters,withaxedkernelsizeof4,the 89

PAGE 90

performanceoverthevariousstepsizes(=0.01,0.10.9with0.1intervals)andvaluesofeligibilitytracerate(=0,0.2,0.5,0.8,1)areshowninFigure 6-9 Figure6-9. Theaveragesuccessratesover125trialsand50implementations. Thestepsizemainlyaffectsthespeedoflearning,andwithinthestabilitylimits,largerstepsizesprovidefasterconvergence.However,duetotheeffectofeligibilitytracerate,thestabilitylimitssuggestedin[ 28 ]mustbeadjustedaccordingly
PAGE 91

largestnalltersizeis2500.However,withagoodadaptivesystem,thenalltersizecanbereduced.Thenalltersizecorrespondsinverselywiththesuccessrates(Figure 6-9 and 6-10 ).Highsuccessratesmeanthatasystemhaslearnedthestate-actionmapping,whereasasystemthathasnotadaptedtonewenvironment,keepsexploringthespace.Therefore,highsuccessrateswillcorrespondtosmallltersizesandviceversa. Figure6-10. Theaveragenalltersizesover125trialsand50implementations. Bothaveragesuccessrateandnalltersizeshowthat=0.9and=0havethebestperformance.Withtheselectedparameters,thesuccessratesapproachtoover95%after100trials.FromFigure 6-11 ,wecanobservehowlearningisaccomplished.Atthebeginning,thesystemexploresmoreofthespacebasedontherewardinformation,andthetrajectorieslooksrathererratic.Oncethesystemstartslearning,actionscorrespondingtostatesnearthetargetpointtowardtherewardzone,andastimegoesbythisareabecomeslargerandlargeruntilitcoversthewholestatespace.Thebluestartsrepresentthe25initialstates,andgreenarrowsshowstheactionchosenateachstate.Reddotatthecenteristhetargetandredcircleshowstherewardzone. 91

PAGE 92

Figure6-11. Twodimensionalstatetransitionsoftherst,third,andfthsetswith=0.9and=0. Kernelmethodsarepowerfulforsolvingnonlinearproblems,butthegrowingcomputationalcomplexityandmemorysizelimittheirapplicabilityinpracticalscenarios.Toovercomethis,wealsoshowhowthequantizationapproachpresented[ 9 ]canbeemployedtoamelioratethelimitationsimposedbygrowingltersizes.Foraxedsetof125inputs,weconsiderquantizationsizesU=40,30,20,10,5,2,1.Figure 6-12 showstheeffectofdifferentquantizationsizesonthenalperformance.Noticethattheminimumsizeforstableperformanceofthelterisreachedaroundapproximately60units.Therefore,thequantizationsizeUcanbeselected,andthemaximumsuccessrateisstillbeingachieved(seeFigure 6-13 ). Figure6-12. Theaveragesuccessratesover125trialsand50implementationswithrespecttodifferentltersizes. 92

PAGE 93

Figure6-13. Thechangeofsuccessrates(top)andnalltersize(bottom)withU=5. LetusnowcomparetheperformancebetweenQ-KTDandQ-CKTD.BasedonFigure 6-9 and 6-10 ,weselect=0,=0.9,kernelsizeh=4forbothalgorithms.InthecaseofQ-CKTD,thecorrentropykernelsizehc=10isselectedbyvisualinspection.Table 6-1 showstheaveragesuccessrateofQ-KTDandQ-CKTD.Notethattheaveragevaluescorrespondto50MonteCarlorunsusing125trialsperrun. Table6-1. TheaveragesuccessrateofQ-KTDandQ-CKTD. mean standarddeviation Q-KTD 0.7019 0.0674 Q-CKTD 0.7248 0.0455 Aswecanseee,theQ-CKTDalgorithmshowshigheraveragesuccessratesaswellassmallervarianceamongruns.Figure 6-14 depictstheevolutionoftheaveragesuccessratesalongwiththeirvarianceestimatesacross50MonteCarloruns.Every25trials,wecountthenumberoftrialsthatearnedpositiverewardwithinthe25trialintervals.Bothalgorithms,Q-KTDandQ-CKTD,showsimilarperformanceattheverybeginning.However,asthenumberoftrialsincrease,Q-CKTDdisplayshigheraveragesuccessratesthanQ-KTD.Thesedifferencesaremorenoticeableatthe50thand100thtrials.Inaddition,itisalsoimportanttohighlightthebehaviorofthestandarddeviationforQ-CKTD,whichdecreasesmuchfasterthanQ-KTDasthenumberoftrials 93

PAGE 94

increases.Theseresultsshowthattherobustnessofthecorrentropycriterionasacostfunctioncanhelpinlearningthepolicy. Figure6-14. ThechangeofaveragesuccessratesbyQ-KTDandQ-CKTD. Inthischapter,wetestedQ-KTD()andQ-CKTDonsyntheticexperimentstondagoodpolicybasedontheapproximationoftheactionvaluefunctionQinRL.WesawthatQ-KTD()providedstableperformanceincontinuousstatespacesandfoundgoodstatetoactionmappings.Inaddition,weobservedthattherobustnatureofQ-CKTDhelpedimproveperformanceunderchangingpolicies.Experimentalresultsalsoprovidedinsightsonhowtoperformparameterselection.Inaddition,weshowedhowthequantizationapproachcouldbesuccessfullyappliedtocontrolthegrowingltersize.Theresultsshowedthatthemethodwasabletondagoodpoliciesandtobeimplementedinmorerealisticscenarios. 94

PAGE 95

CHAPTER7PRACTICALIMPLEMENTATIONSInthepreviouschaptersweusetheBoyanChainproblemtoelucidatethepropertiesofthedifferentproposedalgorithmswhenestimatingstatevaluefunctions.WeobservedbothlinearandnonlinearcapabilitiesinKTD().Giventheappropriatekernelsize,KTDshouldbeabletoapproximatebothlinearandnonlinearfunctions.Inaddition,theMountain-carand2-dimensionalspatialnavigationexperimentsshowedtheadvantagesofQ-KTD()incontinuousstatespaceswherethenumberofstatesisessentiallyinnite.Theuseofkernelsallowsarbitraryinputspacesandworkswithlittlepriorknowledgeofpolicy.Q-KTD()isasimpleyetpowerfulalgorithmtosolveRLproblems.OurultimategoalistoshowthatKTD()canworkinmorerealisticscenarios.Toillustratethis,wepresentarelevantsignalprocessingapplicationinbrainmachineinterfaces.InourRLBMIexperiments,weusemonkeys'neuralsignaltomapanactiondirection(computercursorposition/robotarmposition).Theagentstartsatanaivestate,butthesubjecthasbeentrainedtoreceiverewardsfromtheenvironment.Onceitreachestheassignedtarget,thesystemandthesubjectearnareward,andtheagentupdatesitsdecoderofbrainactivity.Throughiteration,theagentlearnshowtocorrectlytranslateneuralstatesintoaction-direction. 7.1OpenLoopReinforcementLearningBrainMachineInterface:Q-KTD()WerstapplytheneuraldecoderonopenloopRLBMIexperiments;thealgorithmlearnsbasedonthemonkey'sneuralstatestondapropermappingtoactionswhilethemonkeyisconductingagoalreachingtask.However,theoutputoftheagentdoesnotdirectlychangethestateoftheenvironmentbecausethisisdonewithpre-recordeddata.Theexternaldeviceisupdatedbasedonlyontheactualmonkey'sphysicalresponse.Thus,ifthemonkeyconductsthetaskproperly,theexternaldevicereachesthegoal.Inthissense,weonlyconsiderthemonkey'sneuralstatefromsuccessfultrials 95

PAGE 96

totraintheagent.Thegoalofthisexperimentistoevaluatethesystem'scapabilitytopredicttheproperstatetoactionmappingbasedonthemonkey'sneuralstatesandtoassesstheviabilityoffurtherclosedloopRLBMIexperiments. 7.1.1EnvironmentThedataemployedintheseexperimentsisprovidedbySUNYDownstateMedicalCenter.Afemalebonnetmacaqueistrainedforacenter-outreachingtaskallowing8actiondirections.Afterthesubjectattainsabout80%successrate,micro-electrodearraysareimplantedinthemotorcortex(M1).AnimalsurgeryisperformedundertheInstitutionalAnimalCareandUseCommittee(IACUC)regulationsandassistedbytheDivisionofLaboratoryAnimalResources(DLAT)atSUNYDownstateMedicalCenter.Asetof185unitsareobtainedaftersortingfrom96channels,andtheringtimesoftheseunitsaretheonesusedfortheneuraldecoding;theneuralstatesarerepresentedbytheringratesona100mswindow.Thereisasetof8possibletargetsand8possibleactiondirections.Everytrialstartsatthecenterpoint,andthedistancefromthecentertoeachtargetis4cm;anythingwithinaradiusof1cmfromthetargetpointisconsideredasavalidreach(Figure 7-1 ). Figure7-1. Thecenter-outreachingtaskfor8targets. 7.1.2AgentIntheagent,Q-learningviakerneltemporaldifference(Q-KTD)()isappliedtoneuraldecoding.Aftertheneuralstatesarepreprocessedbynormalizingtheirdynamicrangetoliebetween)]TJ /F5 11.955 Tf 9.3 0 Td[(1and1,theyareinputtothesystem.Basedonthepreprocessedneuralstates,thesystempredictswhichdirectionthecomputercursorwillbeupdated. 96

PAGE 97

Eachoutputunitrepresentsoneofthe8possibledirections,andamongthe8outputsoneactionisselectedbythe-greedymethod[ 56 ].Theperformanceisevaluatedbycheckingwhethertheupdatedpositionreachestheassignedtarget,anddependingontheupdatedposition,arewardvaluewillbeassignedtothesystem. 7.1.3Center-outReachingTask-SingleStepFirst,weobservethebehaviorofthealgorithmsonasinglestepreachingtask.Thismeansthatrewardsfromtheenvironmentarereceivedafterasinglestepandoneactionisperformedbytheagentpertrial.Theassignmentofrewardisbasedonthe1)]TJ /F5 11.955 Tf 12.08 0 Td[(0distancetothetarget,thatis,dist(x,d)=0ifx=d,anddist(x,d)=1,otherwise.Oncethecursorreachestheassignedtarget,theagentgetsapositivereward(+0.6),otherwiseitreceivesnegativereward()]TJ /F5 11.955 Tf 9.3 0 Td[(0.6)[ 41 ].Basedontheselectedactionwithexplorationrate=0.01,andtheassignedrewardvalue,thesystemisadaptedasinQ-learningviakernelTD()with=0.9.Inourcase,wecanconsider=0sinceourexperimentperformssinglestepupdatespertrial.Inthisexperiment,theringratesofthe185unitson100mswindowsaretimeembeddedusing6thordertapdelay,thiscreatesarepresentationspacewhereeachstateisavectorwith1295dimensions.Thesimplestversionoftheproblemlimitsthenumberoftargetsto2(rightandleft),andthetargetsshouldbereachedwithinasinglestep.Thetimedelayedneuralnet(TDNN)hasalreadybeenappliedtoRLBMIexperiments,anditsapplicabilityinneuraldecodinghasbeenvalidatedin[ 13 31 ].Thus,theperformanceoftheQ-KTDalgorithmiscomparedwithaTDNNasamapper.Thetotalnumberoftrialsis43forthe2targets.ForQ-KTD,weemploytheGaussiankernel( 4 ),andthekernelsizehisheuristicallychosenbasedonthedistributionofthemeansquareddistancebetweenpairsofinputstates;lets=E[kxi)]TJ /F4 11.955 Tf 12.7 0 Td[(xjk2)],thenh=p s=2.Forthisparticulardataset,theaboveheuristicgivesakernelsizeh=7.Thestepsize=0.3isselectedbasedonthestabilityboundthatwasderivedforKLMS[ 28 ], 97

PAGE 98


PAGE 99

However,oneapparentdisadvantageofusinganonparametricapproachsuchasKTDisthegrowinglterstructure,whichisconsideredaprohibitiveconstraintforpracticalapplications;theltersizeincreaseslinearlywiththeinputdata,whichinanonlinescenarioisprohibitive.Therefore,methodsforcontrollingthegrowthofthelterarenecessary;fortunately,thereexistsmethodstoavoidthisproblemsuchasthesurprisemeasure[ 25 ]orthequantizationapproach[ 9 ],whichareincorporatedinouralgorithmforthe2targetcenter-outreachingtasks.Withoutcontrollingtheltersize,thesuccessratesreacharound100%within3epochs,butwithinonly20epochs,theltersizebecomesaslargeas861units.Usingthesurprisemeasure[ 25 ],theltersizecanbereducedto87centerswithacceptableperformance.However,quantizationmethod[ 9 ]allowstheltersizetobereducedto10unitsandtohaveperformanceabove90%successrate.Therefore,moreexperimentsapplyingthequantizationapproachareconducted.Figure 7-3 showstheeffectofltersizeinthe2targetexperiment. Figure7-3. Theaveragesuccessratesover20epochsand50MonteCarlorunswithrespecttodifferentltersizes. Forltersizesassmallas10units,theaveragesuccessratesremainstable.Thus,ltersize10canbechosenwhenefcientcomputationisnecessary.Figure 7-4 showsthelearningcurvescorrespondingtodifferentltersizesincomparisonwithTDNN.Theaveragesuccessratesarecomputedover50MonteCarloruns. 99

PAGE 100

Figure7-4. ThecomparisonofKTD(0)withdifferentnalltersizesandTDNNwith10hiddenunits. Aswepointedout,inthecaseoftotalltersizeof10(redline),thealgorithmshowsalmostthesamelearningspeedasthelinearlygrowingltersize,withsuccessratesabove90%.WhenwecomparetheaveragelearningcurvestoTDNN,evenalterwith3units(magentaline)usingKTD(0)performsbetterthanTDNN.Inthe2targetsinglestepcenteroutreachingtask,Q-KTD(0)showedpromisingresultssolvingtheinitializationandgrowingltersizeissues.FurtheranalysisofQ-KTD(0)isconductedonamoredifculttaskinvolvingalargernumberoftargets.Alltheexperimentalvaluesarekeptxedusingthesamesetupfromtheaboveexperiments.Theonlychangesarethenumberoftargetsfrom2to8(18)andstepsize=0.5.Sincethetotalnumberoftrialsis178inthisexperiment,withoutanymechanismtocontroltheltersize,thelterstructurecangrowupto1780unitswithin10epochs.Thequantizationapproach[ 9 ]isagainappliedtoreducetheltersize.Intuitively,thereisanintrinsicrelationbetweenquantizationsizeUandkernelsizeh.Consequently,basedonthedistributionofsquareddistancebetweenpairsofinputstates,variouskernelsizes(h=0.5,1,1.5,2,3,5,7)andquantizationsizes(U=1,110,120,130)aretested.The 100

PAGE 101

correspondingsuccessratesfornalltersizesof178,133,87,and32aredisplayedinFigure 7-5 Figure7-5. Theeffectofltersizecontrolon8-targetsingle-stepcenter-outreachingtask.Theaveragesuccessratesarecomputedover50MonteCarlorunsafterthe10thepoch. Again,sincealltheparametersarexedoverthe50MonteCarloruns,thenarrowerrorbarsareduetotherandomactionselectionforexploration,andthissmallvariationsupportsthatthiskernelapproachdoesnotheavilydependoninitializationunliketheconventionalTDlearningalgorithmssuchasneuralnets.Withanalltersizeof178(blueline),thesuccessratesaresuperiortoanyotherltersizesforeverykernelsizestested,sinceitcontainsalltheinputinformation.Especiallyforsmallkernelsizes(h2),successratesabove96%areobserved.Moreover,notethatevenafterreductionofthestateinformation(redline),thesystemstillproducesacceptablesuccessratesforkernelsizesrangingfrom0.5to2(around90%successrates).Intuitivelythelargestkernelsizesthatprovidegoodperformancearebetterforgeneralization;inthissense,akernelsizeh=2isselectedsincethisisthelargestkernelsizethatconsiderablyreducestheltersizeandyieldsaneuralstatetoactionmappingthatperformswell(around90%ofsuccessrates).Inthecaseofkernelsizeh=2withnalltersizeof178,thesystemreaches100%successratesafter6epochs 101

PAGE 102

withamaximumvarianceof4%(Figure 7-6 ).Toobservethelearningprocess,successratesarecalculatedaftereachepoch(1epochcontains178trials). Figure7-6. Theaveragesuccessratesforvariousltersizes. The8-targetexperimentshowstheeffectoftheltersize,andhowitconvergesafter6epochs(Figure 7-6 ).Aswecanseefromthenumberofunitsinbothcases,higherrepresentationcapacityisrequiredtoobtainthedesiredperformanceasthetaskbecomesmorecomplex(Figure 7-4 and 7-6 ).Theresultsofthealgorithmonthe8-targetcenter-outreachingtaskshowedthatthemethodcaneffectivelylearnthebrain-stateactionmappingforthistaskandisstillfeasible. 7.1.4Center-outReachingTask-Multi-StepHere,wewanttodevelopamorerealisticscenario.Therefore,weextendthetasktomulti-stepandmulti-targetexperiments.ThiscaseallowsustoexploretheroleoftheeligibilitytracesinQ-KTD().Thepricepaidforthisextensionisthatnow,lambda0<<1selectionneedstobecarriedoutaccordingtothebestobservedperformance.Testingbasedonthesameexperimentalsetupaswiththesinglesteptask,thatis,adiscreterewardvalueisassignedatthetarget,causesextremelyslowlearningsincenoguidanceisgiven.Thesystemrequireslongperiodsofexplorationuntilitactuallyreachesthetarget.Therefore,weemployacontinuousrewarddistributionaroundthe 102

PAGE 103

selectedtargetdenedbythefollowingexpression: r(s)=8><>:prewardG(s)ifG(s)>0.1,nrewardifG(s)0.1.whereG(s)=exp(s)]TJ /F3 11.955 Tf 11.95 0 Td[()>C)]TJ /F6 7.97 Tf 6.59 0 Td[(1(s)]TJ /F3 11.955 Tf 11.96 0 Td[()(7)wheres2R2isthepositionofthecursor,preward=1,andnreward=)]TJ /F5 11.955 Tf 9.3 0 Td[(0.6.ThemeanvectorcorrespondtotheselectedtargetlocationandthecovariancematrixC=R0B@7.5000.11CAR>whereR=0B@cossin)]TJ /F5 11.955 Tf 11.29 0 Td[(sincos1CAdependsontheangleoftheselectedtargetasfollows:fortargetsoneandvetheangleis0,twoandsix)]TJ /F3 11.955 Tf 9.3 0 Td[(=4,threeandseven=2,andfourandeight=4.Figure 7-7 showstherewarddistributionfortargetone. Figure7-7. Rewarddistributionforrighttarget. Thesameformofdistributionisappliedtotheotherdirectionscenteredattheassignedtargetpoint.Theblackdiamondistheinitialposition,andthepurplediamondshowsthepossibledirectionsincludingtheassignedtargetdirection(reddiamond).Oncethesystemreachestheassignedtarget,thesystemearnsamaximumrewardof+1,andreceivespartialrewardsaccordingto( 7 )duringtheapproachingstage. 103

PAGE 104

Whenthesystemearnsthemaximumreward,thetrialisclassiedasasuccessfultrial.Themaximumnumberofstepspertrialislimitedsuchthatthecursormustapproachthetargetonastraightlinetrajectory.Here,wealsocontrolthecomplexityofthetaskbyallowingdifferentnumberoftargetsandsteps.Namely,2-step4-target(right,up,left,anddown);and4-step3-target(right,up,anddown)experimentsareperformed.Increasingthenumberofstepspertrialamountstomakingsmallerjumpsaccordingtoeachaction.Aftereachepoch,thenumberofsuccessfultrialsarecountedforeachtargetdirection.Figure 7-8 showsthelearningcurvesforeachtargetandtheaveragesuccessrates. A2-step4-target B4-step3-targetFigure7-8. Thelearningcurvesformultistepmultitargettasks. Largernumberofstepsresultsinlowersuccessrates.However,thetwocases(twoandfoursteps)obtainanaveragesuccessrateabove60%for1epoch.Thisresultsuggeststhatthealgorithmscouldbeappliedinonlinescenarios.Theperformancesshowalldirectionscanachievesuccessratesabove70%afterconvergence. 7.2OpenLoopReinforcementLearningBrainMachineInterface:Q-CKTDWehavealreadyseentheperformanceofQ-KTD()tondanoptimalneuraltomotormapping.Inthissection,wewanttocomparetheperformanceofQ-learningviaKTD()andCKTD.Bothalgorithmsareappliedtopassivedataonacenter-outreachingtaskaimingat4targets(right,up,left,anddown).Thedifferencebetween 104

PAGE 105

thepassivedataandthedataemployedintheprevioussectionisthatinthepassivedatathemonkeydoesnotperformanymovement,itonlyobserveshowthepositionofacursorchangesovertime.Neuralstatesarerecordedwhilethemonkeywatchesthescreenchangingthroughthedurationoftheexperiment.Spiketimesfrom49unitsareconvertedtoringratesusinga100mswindow,anda9thordertapdelaylineisappliedtoinput,hence,490dimensionsareusedtorepresenttheneuralstates.Thetotalnumberoftrialsis144,andeachtrialisinitializedatthecenter,andallows2stepstoapproachthetarget.Thedistancebetweentheinitialpoint(center)andthetargetcanbecoveredin1step.Atrialisterminatedonceitpasses2stepsorreceivespositivereward+1.5.Here,thepositiverewardvalueisassignedwhenthecursorreachestherewardzone(0.2distancefromanassignedtarget).Otherwise,itearnsnegativereward)]TJ /F5 11.955 Tf 9.3 0 Td[(0.6.Thediscountfactoris=0.9,theexplorationrate=0.01,andstepsize=0.5.Intheseexperiments,wedonotapplyanyltersizecontrol.ThekernelsizeforKTDischosenbasedonthedistributionofsquareddistancebetweenpairsofinputstatesresultinginh=0.8.Whenwexthelterkernelsizetoh=0.8andapplyQ-CKTD,thereisnosignicantdifferencebetweenQ-KTDandQ-CKTD.However,bychangingthelterkernelsizeto1,Q-CKTDshowsimprovementoverKTD.Here,thecorrentropykernelsizeishc=1.Thesuccessratesateachepochareobtainedastheaveragenumberofsuccessfultrialsoverthe4targets.Thesesuccessratesarefurtherestimatedbyaveragingover50MonteCarloruns;theresultsaredisplayedinFigure 7-9 .TheaveragesuccessratesbetweenQ-KTDwithlterkernelsizeh=0.8andQ-CKTDwithlterkernelsizeh=1andcorrentropykernelsizehc=1arecompared(Figure 7-9 (a)).CKTDshowsimprovedsuccessratesforthe1stand2ndepochs.However,thesuccessratesafterthe3rdepochremainessentiallyequaltothoseforKernelTD.SincecorrentropyKTDweightsareacombinationoftheerrorvaluesewiththelevelofimportancebasedonthe(e),andtheerrordistributionchangesduring 105

PAGE 106

ACorrentropyKTDwithxedcorrentropykernelsize1. BCorrentropyKTDwithreducedkernelsizefrom1to0.8at3rdepoch.Figure7-9. Averagesuccessratesover50runs.Thesolidlineshowsthemeansuccessrates,andthedashedlineshowsthestandarddeviation. thelearningprocess,itisreasonabletoassumethatthesizeofcorrentropykernelmayneedtobeadjustedaslearningprogresses.Aprincipledmethodtoselectthecorrentropykernelsizeisstillunderdevelopment;however,wechosetomanuallysetchangesinthecorrentropykernelsizebyobservingtheevolutionoftheerrors.Thecorrentropykernelsizehcisreducedfrom1to0.8atthe3rdepoch.Aswepredicted,improvementinsuccessratesisobservedatthe3rdand4thepochs(Figure 7-9 (b)).Thismotivatesfurtherworkonnetuningthecorrentropykernelsize,andthussomeoftheeffortwillbedevotedtothisissue.TounderstandbetterthepropertiesofQ-KTDandQ-CKTD,weobservethebehaviorofotherquantitiessuchastheactualpredictionsoftheQ-valuesandindividualsuccessratesaccordingtoeachtarget.ThesequantitiesareobservedbyemployingthesameparametersetasinFigure 7-9 (b),buttheresultsareobtainedfromasinglerun.First,theQ-valuechangesareobservedateachtrial(Figure 7-10 ).CorrentropyKTDhasslowerconvergenceinQ-valuesthanKernelTD.However,correntropyKTDshowshighersuccessratesovertime.Inaddition,whenwechecktheQ-valuechangesatthe1stepoch(Figure 7-11 (a)and(b)),correntropyKTDhashighervalues,anditattemptstoexploremoredirectionsduringlearning.Sincethepositiverewardis1.5and 106

PAGE 107

AKernelTD. BCorrentropyKTD.Figure7-10. Q-valuechangespertiralduring10epochs. theQ-valuerepresentsexpectedrewardgivenstateandaction,itisdesirableasavaluepredictorforittoconvergeto1.5.AlthoughtheQ-valuespredictedbyKernelTDareclosertothepositivereward1.5(Figure 7-11 (c)and(d)),thevarianceoftheQ-valuedoesnotaffectthesuccessrates.Thisleavesanopenquestionaboutwhatpropertiesofcorrentropymaybeinvolvedinthisbehavior,anditbecomesanimportantreasontocarryoutfurtheranalysisinordertofullyunderstandthealgorithm.Thesuccessrateofeachtargetisobservedfrom1stto5thepoch(Figure 7-12 ).Targetindices1,3,5,and7representright,up,left,anddownrespectively.WhenweapplyKernelTD,atthebeginning,thelearningdirectiontendstofocusoncertaindirections;duringtherstepochtheagentmainlylearnsthedowndirection(targetindex7),andduringthesecondepochthelearninginclinestowardstheleftdirection(targetindex5)(Figure 7-12 (a)).However,thelearningvariationovereachdirectionincorrentropyKTDissmallerincomparisonwithKernelTD(Figure 7-12 (b)). 7.3ClosedLoopBrainMachineInterfaceReinforcementLearningQ-KTD()hasbeentestedonopenloopRLBMIexperiments,andwehaveseenthatthealgorithmperformswellontheopenloopRLBMIexperiment.Therefore,theapplicationhasprogressedtoclosedloopRLBMIexperiments.InclosedloopRLBMIexperiments,theagentistrainedtondamappingfromthemonkey'sneuralstates 107

PAGE 108

AAt1stepochbyKTD. BAt1stepochbyCorrentropyKTD. CAt10thepochbyKTD. DAt10thepochbyCorrentropyKTD.Figure7-11. TargetindexandmatchingQ-values. toarobotarmposition.Themonkeyhasbeentrainedtoassociateitsneuralstateswithaparticulartaskgoal.Thebehaviortaskisareachingtaskusingaroboticarm,inwhichthedecodercontrolstherobotarm'sactiondirectionbypredictingthemonkey'sintentbasedonitsneuronalactivity.Iftherobotarmreachestoanassignedtarget,arewardwillbegiventoboththemonkey(foodreward)andthedecoder(positivevalue).Noticethatthetwointelligentsystemslearnco-adaptivelytoaccomplishthegoal.TheseexperimentsareconductedincooperationwiththeNeuroprostheticsResearchGroupattheUniversityofMiami.Theperformanceisevaluatedintermsoftaskcompletionaccuracyandspeed.Furthermore,weattempttoevaluatetheindividualperformanceofeachoneofthesystemsintheRLBMI. 7.3.1EnvironmentDuringpre-training,amarmosetmonkeyhasbeentrainedtoperformatargetreachingtaskaimedattwospatiallocations(AorBtrial);themonkeywastaughttoassociatechangesinmotoractivityduringAtrials,andproducestaticmotorresponses 108

PAGE 109

AKTD. BCorrentropyKTD.Figure7-12. Thesuccessratesofeachtargetover1through5epochs. duringBtrials.Whenonetargetisassigned,thetrialstartswithabeep.Toconductthetrialduringtheusertrainingphase,themonkeyisrequiredtosteadilyplaceitshandonatouchpadfor7001200ms.ThisactionproducesagobeepthatisfollowedbyoneofthetwotargetLEDsbeingliton(Atrial:redlightforleftdirectionorBtrial:greenlightforrightdirection).Therobotarmgoesuptohomeposition,namely,thecenterpositionbetweenthetwotargets.Itsgrippershowsanobject(foodrewardsuchaswaxwormormarshmallowforAtrialandundesirableobject(woodenbead)forBtrial).FortheAtrial,themonkeyshouldmoveitsarmtoasensorwithin2000ms,andfortheBtrial,themonkeyshouldholditsarmontheinitialsensorfor2500ms.Ifthemonkeysuccessfullyconductsthetask,therobotarmmovestotheassigneddirection,thetargetLEDlightblinks,andthemonkeygetsthefoodreward.Afterthemonkeyistrainedtoperformtheassignedtaskproperly,amicro-electrodearray(16-channeltungstenmicroelectrodearrays,TuckerDavisTechnologies,FL)issurgicallyimplantedunderisouraneanesthesiaandsterileconditions.IntheclosedloopRLBMI,neuralstatesfromthemotorcortex(M1)arerecorded.Theseneuralstatesbecomeinputstotheneuraldecoder.Allsurgicalandanimalcareprocedureswere 109

PAGE 110

consistentwiththeNationalResearchCouncilGuidefortheCareandUseofLaboratoryAnimalsandwereapprovedbytheUniversityofMiamiInstitutionalAnimalCareandUseCommittee.Intheclosedloopexperiment,aftertheinitialholdingtimethatproducesthegobeep,theroboticarm'spositionisupdatedbasedsolelyonthemonkey'sneuralstates,andthemonkeyisnotrequiredtoperformanymovementunlikeduringtheuserpre-trainingsessions.Duringthereal-timeexperiment,14neuronsareobtainedfrom10electrodes.Theneuralstatesarerepresentedbytheringratesona2secwindowfollowingthegosignal. 7.3.2AgentFortheBMIdecoder,weuseQ-learningimplementedwithkernelTemporalDifferences(Q-KTD)().TheadvantageofKTDforonlineapplicationsisthatitdoesnotdependontheinitialization;neitherdoesitrequireanypriorinformationaboutinputsates.Also,thisalgorithmbringstheadvantagesofbothTDlearning[ 50 ]andkernelmethods[ 44 ].Thereforeitisexpectedthatthealgorithmpredictsproperlytheneuralstatetoactionmap,eventhoughtheneuralstatesvaryineachexperiment.Basedonthemonkey'sneuralstate,theBMIdecoderproducesanoutputusingtheQ-KTDalgorithm.Theoutputrepresentsthe2possibledirections(leftandright),andtherobotarmmovesaccordingly.Onebigdifferencebetweenopenandclosedloopapplicationsistheamountofaccessibledata;intheclosedloopexperiment,wecanonlygetinformationabouttheneuralstatesuptothecurrentstate.Inthepreviousofineexperiment,normalizationandkernelselectionwereconductedofflinebasedontheentiredataset.However,itisnotpossibletoapplythesamemethodtotheonlinesettingsinceweonlyhaveinformationabouttheinputstatesuptothepresenttime.Normalizationisascalingfactorwhichimpactsthekernelsize;properselectionoftgekernelsizebringsproperscalingtothedata.Thedynamicrangeofstatescanchangefromexperimentto 110

PAGE 111

experiment.Consequently,inanonlineapplication,thekernelsizeneedstobeadjustedateachtime.Beforegettinganyneuralstates,thekernelsizecannotbedetermined.Thus,incontrasttothepreviousopenloopexperiments,normalizationoftheinputneuralstatesisnotapplied,andthekernelsizeisautomaticallyselectedfromthegiveninputs.ForQ-KTD(),theGaussiankernel( 4 )isemployed.Thekernelsizehisautomaticallyselectedbasedonthehistoryofinputs.Notethatintheclosedloopexperiments,thedynamicrangeofstatesvariesfromexperimenttoexperiment.Consequently,thekernelsizeneedstobereadjustedeachtimeanewexperimenttakesplaceandcannotbedeterminedbeforehand.Ateachtime,thedistancesbetweenthestatesarecomputedtocalculatetheoutputvalues.Therefore,weusethedistancevaluestoselectthekernelsizeasfollows: htemp(n)=vuut 1 2(n)]TJ /F5 11.955 Tf 11.96 0 Td[(1)n)]TJ /F6 7.97 Tf 6.59 0 Td[(1Xi=1kx(i))]TJ /F4 11.955 Tf 11.95 0 Td[(x(n)k2 (7) h(n)=1 n"n)]TJ /F6 7.97 Tf 6.59 0 Td[(1Xi=1h(i)+htemp(n)# (7) Usingthesquareddistancesbetweenpairsofpreviouslyseeninputstates,wecanobtainanestimateofthemeandistance,andthisvalueisalsoaveragedalongwithpastkernelsizestoassignthecurrentkernelsize.Theinitialerrorissettozero,andtherstinputstatevectorisassignedastherstunit'scenter.Normalizationoftheinputneuralstatesisnotapplied,andastepsize=0.5isused.Moreover,weconsider=1and=0sinceourexperimentperformssinglesteptrialsin( 5 ). 7.3.3ResultsTheoverallperformanceisevaluatedbycheckingwhethertheroboticarmreachestheassignedtargetornot.Oncetherobotarmreachesthetarget,thedecodergetsapositivereward+1,otherwise,itreceivesnegativereward)]TJ /F5 11.955 Tf 9.29 0 Td[(1. 111

PAGE 112

Figure 7-13 showsthedecoderperformancefor2experiments;therstexperiment(leftcolumn)hasatotalof20trials(10Atrialsand10Btrials).Theoverallsuccessratewas90%.Onlythersttrialforeachtargetwasmis-assigned.Thesecondexperiment(rightcolumn)hasatotalof53trials(27Atrialsand26Btrials),withoverallsuccessrateof41=53(around77%).Althoughthesuccessrateofthesecondexperimentis Figure7-13. PerformanceofQ-learningviaKTDintheclosedloopRLBMIcontrolledbyamonkeyforexperiment1(left)andexperiment2(right);Thesuccess(+1)andfailure()]TJ /F5 11.955 Tf 9.29 0 Td[(1)indexofeachtrial(top),thechangeofTDerror(middle),andthechangeofQ-values(down). notashighastherstexperiment,bothexperimentsshowthatthealgorithmlearnsanappropriateneuralstatetoactionmap.Eventhoughthereisvariationamongtheneuralstateswithineachexperiment,thedecoderadaptswelltominimizetheTDerror,andtheQ-valuesconvergetothedesiredvaluesforeachaction;sincethisisasinglesteptaskandthereward+1isassignedforasuccessfultrial,itisdesiredthattheestimatedQ-value~Qbecloseto+1. 112

PAGE 113

ItisobservedthattheTDerrorandQ-valueareoscillating.ThedrasticchangeofTDerrororQ-valuecorrespondstothemissedtrial.Theoverallperformancecanbeevaluatedbycheckingwhethertherobotarmreachesthedesiredtargetornot(thetopplotsinFigure 7-13 ).However,thisassessmentdoesnotshowwhatcausesthechangeinthesystemvalues.Inaddition,itishardtoknowhowthetwoseparateintelligentsystemsinteractduringlearningandhowneuralstatesaffecttheoverallperformance. 7.3.4ClosedLoopPerformanceAnalysisSincethisRLBMIarchitecturecontains2separateintelligentsystemsthatco-adapt,itisimportanttohavenotonlyawellperformingBMIdecoderbutalsoawelltrainedBMIuser.Undertheco-adaptationscenario,itisobviousthatifonesystemdoesnotperformproperly,itwillcausedetrimentaleffectsontheperformanceoftheothersystem.IftheBMIdecoderdoesnotgiveproperupdatestotheroboticdevice,itwillconfusetheuserconductingthetask,andiftheusergivesimproperstateinformationorthetranslationiswrong,theresultingupdatemayfaileventhoughtheBMIdecoderwasabletondtheoptimalmappingfunction.Here,weanalyzehoweachparticipant(agentanduser)inuencestheoverallperformancebothinsuccessfulandmissedtrialsbyvisualizingthestates,correspondingactionvaluesQ,andresultingpolicyinatwo-dimensionalspace.ThisistherstattempttoevaluatetheindividualperformanceofthesubjectandthecomputeragentonaclosedloopReinforcementLearningBrainMachineInterface(RLBMI).Withtheproposedmethodology,wecanobservehowthedecodereffectivelylearnsagoodstatetoactionmapping,andhowneuralstatesaffectthepredictionperformance.Amajorassumptioninourmethodologyistoassumethattheuseralwaysimplementsthesamestrategytosolvethetask,otherwisethisanalysisbreaksdown.Undertheseconditions,whenthesystemencountersanewconditionwethereforeassumethattheuserisdistractedoruncooperative.Butthismaynotbethecaseandwedidnothaveaccesstoenoughextrainformationtoquantifybehaviorbesidesvisualinspection. 113

PAGE 114

Inthetwo-targetreachingtask,thedecodercontainstwooutputunitsrepresentingthefunctionsQ(x,a=left)andQ(x,a=right).ThepolicyisdeterminedbyselectingtheactionassociatedwithoneoftheseunitsbasedontheirQ-values.Theperformanceofthedecoderiscommonlyevaluatedintermsofsuccessratebycountingthesuccessfultrialsthatreachthedesiredtargets,alongwiththechangesintheTDerrorortheQ-values.However,thesecriteriaarenotwellsuitedtounderstandhowthetwointelligentsystemsinteractduringlearning.Forinstance,ifthereisachangeinperformanceoranerrorinthedecodingprocessitishardtotellwhichoneofthetwosubsystemsismorelikelytoberesponsibleforit.Anotheraddeddifcultyinevaluatingtheuser'soutputisthattheneuralstatesarehighdimensionalvectors.Inthissense,wewanttoapplyadimensionalityreductiontechniquetoproduceauser'soutputthatcanbevisualizedandeasilyinterpreted,neverthelessbeingindependentoftheclasslabels(unsupervised).Wefoundthatprincipalcomponentanalysis(PCA)onthesetofobservedneuralstatesissufcientforthegoalofthisanalysis.PCAisawellknownmethodtotransformdatatoanewcoordinatesystembasedoneigenvaluedecompositionofadatacovariancematrix.LetX=[x(1),x(2),,x(n)]>bethedatamatrixcontainingthesetofobservedstatesduringtheclosedloopexperimentuntiltimen.AtransformeddatasetY=XWcanbeobtainedbyusingthetransformationmatrixW,whichcorrespondstothematrixofeigenvectorsofthecovariancematrixN)]TJ /F6 7.97 Tf 6.58 0 Td[(1X>X.WithoutlossofgeneralityweassumethatthedataXhaszeromean.Thedistributionofstatesuptotimencanbevisualizedbyprojectingthehighdimensionalneuralstatesintotwodimensionsusingthersttwolargestprincipalcomponents.Inthistwo-dimensionalspaceofprojectedneuralstates,wecanalsoshowtherelationwiththedecoderbycomputingtheoutputsoftheunitsassociatedwitheachoneoftheactionsanddisplayingthemascontourplots.Asetoftwo-dimensionalspacelocationsYgridevenlydistributedontheplanecanbeprojectedinthehighdimensional 114

PAGE 115

spaceofneuralstatesas^X=YgridW>.LetQ(n)ibetheiunitfromthedecoderupdatedusing( 5 )attimen.WecancomputetheestimatedQ-valuesatapointyonthetwodimensionalplaneusing^Q(n)(^x=Wy,a=i).Inthisway,wecanextrapolatethepossibleoutputsthatthedecoderwouldproduceinthevicinityofthealreadyobserveddatapoints.Furthermore,thenalestimatedpolicycanbeobtainedbyselectingtheunitthatmatchestheactionwiththemaximumQ-valueamongalloutputunits(Figure 7-14 ). Figure7-14. Proposedvisualizationmethod. Here,wevisualizetheneuralstatesandcorrespondingQ-valuesandpolicyrelatedtothenalperformance.Thus,thenallearneddecoder^Q(T)andalltheneuralstatesXareutilized;thatis,n=TandXisofsizeTdwheredisthedimensionoftheneuralstatevectors.Noticethattheproposedmethodcanalsobeappliedatanystageofthelearningprocess;wecanobservethebehavioroftwosystemsatanyintermediatetimebyusingthesubsetofneuralstatesthathavebeenobservedaswellasthelearneddecoderuptothistime.Figure 7-15 providesavisualizationofthedistributionofthe14dimensionalneuralstatesprojectedintotwodimensions.Thecorrespondingcontourlevelsaretheestimatedactionvalues~Qusingthelearneddecoderfromtheclosedloopexperiment.Inaddition,weprovidethepartitionforleftandrightactionsintheprojectedtwodimensionalspace,whichcorrespondstothenalpolicyderivedfromtheestimatedQ-values.Theprojectionshowsthattheneuralstatesfromthetwoclassesareseparable.Asweexpected,theQ-valuesforeachdirectionhavehighervaluesonregionsoccupiedbythecorrespondingneuralstates.Forexample,theQ-valuesforthe 115

PAGE 116

Figure7-15. TheestimatedQ-values(top)andresultingpolicy(bottom)fortheprojectedneuralstatesusingPCAfromexperiment1(leftcolumn)andexperiment2(rightcolumn).TherstandthirdtopplotsshowtheQ-valuesforrightdirection,andthesecondandforthtopplotsshowtheQ-valuesforleftdirection. rightdirectionhavelargervaluesfortheareaslledbythestatescorrespondingtoBtrial.Thisisconrmedbyshowingthepartitionsachievedbytheresultingpolicy.Duringthetrainingsession,thesuccessrateswerehighlydependentonthemonkey'sperformance.Mostofthetimeswhentheagentpredictedthewrongtarget,itwasobservedthatthemonkeywasdistracted,oritwasnotinteractingwiththetaskproperly.Wearealsoabletoseethisphenomenonfromtheplots;thefailedtrialsduringtheclosedloopexperimentaremarkedasredstars(missedAtrials)andgreendots(missedBtrials).Wecanseethatmostoftheneuralstatesthatweremisclassiedappeartobeclosertothestatescorrespondingtotheoppositetargetintheprojectedstatespace.Thissupportstheideathatfailureduringthesetrialswasmainlyduetothemonkey'sbehaviorandnottothedecoder. 116

PAGE 117

Fromthebottomplots,itisapparentthatthedecodercanpredictnonlinearpolicies.Finally,theestimatedpolicyinexperiment2(bottomrightplot)showsthatthesystemeffectivelylearnsandgoesfromaninitiallymisclassiedAtrial(duringtheclosedloopexperiment),whichislocatedneartheborderandrightbottomareas,toanaldecoderwherethesamestatewouldbeassignedtotherightdirection.Itisaremarkablefactthatthesystemadaptstotheenvironmenton-line.WehaveappliedtheQ-KTD()andCKTDalgorithmstoneuraldecodingforbrainmachineinterfaces.IntheopenloopRLBMIexperiment,weconrmedthatthesystemwasabletondaproperneuralstatetoactionmapping.Inaddition,wesawhowbyusingcorrentropyasacostfunctiontherecouldbepotentialimprovementstothelearningspeed.Finally,Q-KTD()wassuccessfullyappliedtoclosedloopexperiments.Thedecoderwasabletoprovidetheproperrobotarmactions.Finally,weexploredarstattemptatanalyzingthebehaviorofthetwointelligentsystemsseparately.Withtheproposedmethodology,weobservedhowtheneuralstateinuencesthedecoderperformance. 117

PAGE 118

CHAPTER8CONCLUSIONSANDFUTUREWORKThereinforcementlearningbrainmachineinterface(RLBMI)[ 13 ]hasbeenshowntobeapromisingparadigmforBMIimplementations.Itallowsco-adaptivelearningbetweentwointelligentsystems;oneistheBMIdecoderontheagentside,andtheotheristheBMIuseraspartoftheenvironment.Fromtheagentside,theproperneuraldecodingofthemotorsignalsisessentialtocontroltheexternaldevicethatinteractswiththephysicalenvironment.However,thereareseveralchallengesthatmustbeaddressedinordertoturnRLBMIintoapracticalreality.First,algorithmsmustbeabletoreadilyhandlehighdimensionalstatesspacesthatcorrespondtotheneuralstaterepresentation.Themappingfromneuralstatestoactionsmustbeexibleenoughtohandlenonlinearmappingsyetmakinglittleassumptions.Algorithmsshouldrequireareasonableamountofcomputationalresourcesthatallowsrealtimeimplementations.Thealgorithmsshouldhandlecaseswhereassumptionsmaynothold,i.e.thepresenceofoutliersorperturbationsintheenvironment.Inthisthesis,wehaveintroducedalgorithmsthattakeintoaccounttheabovementionedissues.Wehaveemployedsyntheticexperimentsthatillustratethepropertiesoftheproposedmethodsandencouragetheirapplicabilityinpracticalscenarios.Finally,weappliedthesealgorithmstoRLBMIexperiments,showingtheirpotentialadvantagesinarelevantapplication.Westartedbyintroducingthreenewtemporaldifference(TD)algorithmsforstatevaluefunctionestimation.Statevaluefunctionestimationisanintermediatesteptondapropermappingfromstatetoaction,fromwhichallfundamentalfeaturesofthealgorithmscouldbeobserved.Thisfunctionalapproximationisabletohandlelargeamountsofinputdatawhichisoftenrequiredinpracticalimplementations.WehaveseenhowtheproposedTDalgorithmscanprovidefunctionalapproximationofstatevaluefunctionsgivenapolicy. 118

PAGE 119

Kerneltemporaldifference(KTD)()wasproposedbyintegratingkernel-basedrepresentationstotheconventionalTDlearning.Thebigadvantagesofthiskernel-basedlearningalgorithmarethenonlinearfunctionalapproximationcapabilitiesalongwiththeknownconvergenceguaranteesoflinearTDlearning,whichresultsinmoreaccurateandfasterlearning.Usingthedualrepresentations,itcanbeshownthattheconvergenceresultsforlinearTD()extendtothekernel-basedalgorithm.Byusingstrictlypositivedenitekernels,thelinearindependenceconditionisautomaticallysatisedforinputstaterepresentationsinabsorbingMarkovprocesses.ExperimentsonsimulateddatadrawnfromabsorbingMarkovchainsallowedustoconrmthemethod'snonlinearapproximationcapabilities.Moreover,robustvariantsofTD()andKTD()algorithmswereproposedbyusingcorrentropyasacostfunction.Namely,correntropytemporaldifference(CTD)andcorrentropykerneltemporaldifference(CKTD)werederivedforthecaseof=0.TD()andKTD()usemeansquareerror(MSE)astheirobjectivefunctionwhichhasknownlimitationsforhandlingnonGaussiannoisecorruptedenvironments.ExperimentsusingasyntheticabsorbingMarkovchainshowedCTDandCKTDareabletoprovidebetterrobustnessperformancethanMSEundernon-Gaussiannoiseorperturbedstatetransitions.WehaveobservedthatKTD()hasbetterperformancewith=0thanlargerduetotherelationbetweenstepsizeandtheeligibilitytracerate.Inmultisteppredictionproblems,whentheGaussiankernelisemployedinthesystem,largereligibilitytraceratesrequiresmallerstepsizesforstableperformance,whichalsodependsontheallowednumberofstepspertrial.Smallstepsizesforlargemaketheperformanceslowercomparedtothelargerlearningratesthatsmallvaluesallow.Thus,itisintuitivethattheperformanceofCTDandCKTDwithlargermaynotperformaswellas=0inon-lineimplementations.However,itisnecessarytofurtherexplorethebehaviorofCTDandCKTDwithgeneral.TheextensionofTD(0)andKTD(0)togeneral 119

PAGE 120

usingthemulti-steppredictionasastartingpointdoesnotseemtobeapplicableforcorrentropy,sincethereisnoobviouswaytointerchangetermsinthecostduetothenonlinearityofthekernelemployedbycorrentropy;noupdatescanbemadebeforeatrialiscomplete.Theupdaterulewederivedforgeneralrequiresustoupdatethesystemonceatrialiscomplete.Therefore,furtherstudyforthederivationofCTDandCKTDforgeneralisrequired.Inaddition,weobservedthatCTDandCKTDhavestableperformance.However,furtheranalysisisstillrequiredtodeterminetheconvergencepoints.Weextendedallproposedalgorithmstostate-actionvaluefunctionsbasedonQ-learning.Thisextensionallowsustondaproperstatetoactionmappingwhichcanbefurtherexploitedinpracticalcasessuchastheneuraldecodingprobleminreinforcementlearningbrainmachineinterfaces.TheintroducedTDalgorithmswereextendedtoestimateaction-valuefunctions,andbasedontheestimatedvalues,theoptimalpolicycanbedecidedusingQ-learning.ThreevariantsofQ-learningwerederived:Q-learningviacorrentropytemporaldifference(Q-CTD),Q-KTD(),andQ-CKTD.TheobservationandanalysisofCTD,KTD(),andCKTDgivesusabasicideaofhowtheproposedextendedalgorithmsbehave.However,inthecaseofQ-CTD,Q-KTD(),andQ-CKTD,theconvergenceanalysisisstillchallengingsinceQ-learningcontainsbothalearningpolicyandagreedypolicy.InthecaseofQ-KTD(),theconvergenceproofforQ-learningusingtemporaldifference(TD)()withlinearfunctionapproximationin[ 32 ]givesabasicintuitionfortheroleoffunctionapproximationontheconvergenceofQ-learning.Forthekernel-basedrepresentationinQ-KTD(),thedirectextensionoftheresultsfrom[ 32 ]wouldbringtheadvantagesofnonlinearfunctionapproximation.Nonetheless,toapplytheseresults,itisrequiredanextendedversionoftheordinarydifferentialequation(ODE)methodforHilbertspacevalueddifferentialequations. 120

PAGE 121

Theextendedalgorithmswereappliedtondanoptimalcontrolpolicyindecisionmakingproblemswherethestatespaceiscontinuous.WeobservedthebehaviorofQ-KTDandQ-CKTDundervariousparametersetsincludingkernelsize,stepsize,andeligibilitytracerate.Fromtheexperiments,weobservedthattheoptimallterkernelsizedependsontheinputdistributionandaffectsthelearningspeed,andproperannealingofthestepsizeisrequiredforconvergence.ForKTDsmalleligibilitytracestendtoworkbetter.Inthecaseofcorrentropy,thekernelsizepresentsatradeoffbetweenlearningspeedandrobustnessandalsodependsontheerrordistribution.ResultsshowedthatQ-KTD()canofferperformanceadvantagesoverotherconventionalnonlinearfunctionapproximationmethods.Furthermore,itisimportanttohighlighthowtherobustnesspropertyofthecorrentropycriterioncanbeexploitedtoimprovelearningunderchangingpolicies.WehaveempiricallyobservedthatQ-CKTDwasabletoprovideabetterpolicyintheoff-policylearningparadigm.Furthermore,Q-KTD()wasappliedtoestimateanoptimalpolicyinopenloopbrainmachineinterface(BMI)problems,andexperimentalresultsshowthemethodcaneffectivelylearnthebrain-stateactionmapping.WealsotestedQ-CKTDonanopenloopRLBMIapplicationtoassessthealgorithm'scapabilityinestimatingaproperstatetoactionmap.Inoff-policyTDlearning,Q-CKTDresultsshowedthattheoptimalpolicycouldbeestimatedevenwithouthavingperfectpredictionsofthevaluefunctioninaprobleminvolvingadiscretesetofactions.Finally,weappliedQ-KTDtoclosedloopRLBMIexperimentsusingamonkey.Resultsshowedthatthealgorithmsucceedsinndingapropermappingbetweenneuralstatesanddesiredactions.Therefore,thekernellterstructureisasuitableapproachtoobtainaexibleneuralstatedecoderthatcanbelearnedandadaptedonline.Wealsoprovidedamethodologytoteaseaparttheinuencesoftheuserandtheagentintheoverallperformanceofthesystem.Thismethodologyhelpedusvisualizethecases 121

PAGE 122

wheretheerrorsmayhavebeencausedbytheuseraswellasthedecisionboundariesthatthedecoderimplementsbasedontheobservedneuralstates.WesawthesuccessfulintegrationoftheproposedTDalgorithmsinpolicysearch.ThisshowsthattheintroducedTDmethodshavethecapabilitytoapproximatevaluefunctionsproperly,whichcancontributetondingaproperpolicy.Actor-Criticisanotherwellknownmethodtondapolicyusinganestimatedvaluefunction.WecanalsoextendtheapplicationoftheQ-CTD,Q-KTD,andQ-CKTDalgorithmstotheActor-Criticframework.TheActor-Criticmethodcombinestheadvantagesofpolicygradientandvaluefunctionapproximationwiththepossibilityofbetterconvergenceguaranteesandreducedvarianceontheestimation.TheTDalgorithmscanbeappliedtotheCritictoestimatethevaluefunction,andthepolicygradientmethodcanbeappliedtoupdatetheActorthatchoosestheaction[ 23 ]. 122

PAGE 123

APPENDIXAMERCER'STHEOREMLetXbeacompactsubsetofRn.SupposeisacontinuoussymmetricfunctionsuchthattheintegraloperatorT:L2(X)!L2(X)(Tkf)()=ZX(,x)f(x)dx,ispositive,thatisZXX(x,z)f(x)f(z)dxdz0,forallf2Lx(X).Thenwecanexpand(x,z)inauniformlyconvergentseries(onXX)intermsoffunctionj,satisfyinghj,iiL2(X)=ij(x,z)=1Xj=1jj(x)j(z).Furthermore,theseriesP1i=1jijisconvergent[ 33 ]. 123

PAGE 124

APPENDIXBQUANTIZATIONMETHODThequantizationapproachintroducedin[ 9 ]isasimpleyeteffectiveapproximationheuristicthatlimitsthegrowingstructureofthelterbyaddingunitsinaselectivefashion.Onceanewstateinputx(i)arrives,itsdistancestoeachexistingunitC(i)]TJ /F5 11.955 Tf 12.21 0 Td[(1)arecalculated dist(x(i),C(i)]TJ /F5 11.955 Tf 11.95 0 Td[(1))=min1jsize(C(i)]TJ /F6 7.97 Tf 6.58 0 Td[(1))kx(i))]TJ /F4 11.955 Tf 11.95 0 Td[(Cj(i)]TJ /F5 11.955 Tf 11.96 0 Td[(1)k.(B)Iftheminimumdistancedist(x(i),C(i)]TJ /F5 11.955 Tf 12.2 0 Td[(1))issmallerthanthequantizationsizeU,thenewinputstatex(i)isabsorbedbytheclosestexistingunittoit,andhencenonewunitisaddedtothestructure.Inthiscase,unitcentersremainthesameC(i)=C(i)]TJ /F5 11.955 Tf 12 0 Td[(1),buttheconnectionweightstotheclosestunitareupdated. 124

PAGE 125

REFERENCES [1] Bae,Jihye,Chhatbar,Pratic,Francis,JosephT.,Sanchez,JustinC.,andPrincipe,JoseC.ReinforcementLearningviaKernelTemporalDifference.The33rdAnnualInternationalConferenceoftheIEEEonEngineeringinMedicineandBiologySociety.2011,5662. [2] Bae,Jihye,Giraldo,LuisSanchez,Chhatbar,Pratic,Francis,JosephT.,Sanchez,JustinC.,andPrincipe,JoseC.StochasticKernelTemporalDifferenceforReinforcementLearning.IEEEInternationalWorkshoponMachineLearningforSignalProcessing.2011,1. [3] Baird,Leemon.ResidualAlgorithms:ReinforcementLearningwithFunctionApproximation.MachineLearning.1995,30. [4] Boser,BernhardE.,Guyon,IsabelleM.,andVapnik,VladimirN.ATrainingAlgorithmforOptimalMarginClassiers.InProceedingsofthe5thAnnualWorkshoponComputationalLearningTheory(COLT).1992,144. [5] Boyan,JustinA.LearningEvaluationFunctionsforGlobalOptimization.Ph.D.thesis,CarnegieMellonUniversity,1998. [6] .TechnicalUpdate:Least-SquaresTemporalDifferenceLearning.MachineLearning49(2002):233. [7] Boyan,JustinA.andMoore,AndrewW.GeneralizationinReinforcementLearning:SafelyApproximatingtheValueFunction.AdvancesinNeuralInformationProcessingSystems.1995,369. [8] Bradtke,StevenJ.andBarto,AndrewG.LinearLeast-SquaresAlgorithmsforTemporalDifferenceLearning.MachineLearning22(1996):33. [9] Chen,Badong,Zhao,Songlin,Zhu,Pingping,andPrincipe,JoseC.QuantizedKernelLeastMeanSquareAlgorithm.IEEETransactionsonNeuralNetworksandLearningSystems23(2012).1:22. [10] Dayan,PeterandSejnowski,TerrenceJ.TD()ConvergeswithProbability1.MachineLearning14(1994):295. [11] Deisenroth,MarcPeter.EfcientReinforcementLearningusingGaussianProcess.Ph.D.thesis,KarlsruheInstituteofTechnology,2010. [12] Dietterich,ThomasG.andWang,Xin.BatchValueFunctionApproximationviaSupportVectors.AdvancesinNeuralInformationProcessingSystems.MITPress,2001,1491. [13] DiGiovanna,Jack,Mahmoudi,Babak,Fortes,Jose,Principe,JoseC.,andSanchez,JustinC.CoadaptiveBrain-MachineInterfaceviaReinforcementLearning.IEEETransactionsonBiomedicalEngineering56(2009).1. 125

PAGE 126

[14] Engel,Yaakov,Mannor,Shie,andMeir,Ron.ReinforcementlearningwithGaussianprocesses.InProceedingsofthe22ndInternationalConferenceonMachineLearning.2005,201. [15] Geramifard,Alborz,Bowling,Michael,andSutton,RichardS.IncrementalLeast-SquaresTemporalDifferenceLearning.InProceedingsofthe21stNationalConferenceonArticialIntelligence.2006,356. [16] Geramifard,Alborz,Bowling,Michael,Zinkevich,Martin,andSutton,RichardS.iLSTD:EligibilityTracesandConvergenceAnalysis.AdvancesinNeuralInforma-tionProcessingSystems.2007,441. [17] Ghavamzadeh,MohammadandEngel,Yaakov.BayesianActor-CriticAlgorithms.InProceedingsofthe24thInternationalConferenceonMachineLearning.2007. [18] Gunduz,AysegulandPrincipe,JoseC.CorrentropyasaNovelMeasureforNonlinearityTests.InternationalJointConferenceonNeuralNetworks89(2009). [19] Haykin,Simon.NeuralNetworks:acomprehensivefoundation.Maxwell,1994. [20] .NeuralNetworksandlearningMachines.PrenticeHall,2009. [21] Jeong,Kyu-HwaandPrincipe,JoseC.TheCorrentropyMaceFilterforImageRecognition.InProceedingsofthe16thIEEESignalProcessingSocietyWorkshoponMachineLearningforSignalProcessing.2006,9. [22] Kim,Sung-Phil,Sanchez,JustinC.,Rao,YadunandanaN.,Erdogmus,Deniz,Carmena,JoseM.,Lebedev,MikhailA.,Nicolelis,Miguel.A.L.,andPrincipe,JoseC.AComparisonofOptimalMIMOLinearandNonlinearModelsforBrain-MachineInterfaces.JournalofNeuralEngineering3(2006).145. [23] Konda,VijayR.andTsitsiklis,JohnN.OnActor-CriticAlgorithms.SocietyforIndustrialandAppliedMathematicsJournalonControlandOptimization42(2003).4:1143. [24] Kushner,HaroldJ.andClark,DeanS.StochasticApproximationMethodsforConstrainedandUnconstrainedSystems.Springer-Verlag,1978. [25] Liu,Weifeng,Park,Il,andPrincipe,JoseC.AnInformationTheoreticApproachofDesigningSparseKernelAdaptiveFilters.IEEETransactionsonNeuralNetwtorks20(2009).12:1950. [26] Liu,Weifeng,Pokharel,PuskalP.,andPrincipe,JoseC.Correntropy:PropertiesandApplicationsinNon-GaussianSignalProcessing.IEEETransactionsonSignalProcessing55(2007).11:5286. [27] .TheKernelLeastMeanSquareAlgorithm.IEEETransactionsonSignalProcessing56(2008).2:543. 126

PAGE 127

[28] Liu,Weifeng,Principe,JoseC.,andHaykin,Simon.KernelAdaptiveFiltering:AComprehensiveIntroduction.Wiley,2010. [29] Maei,HamidReza,Szepesvari,Csaba,Bhatnagar,Shalabh,andSutton,RichardS.TowardOff-PolicyLearningControlwithFunctionApproximation.Proceedingofthe27thInternationalConferenceonMachineLearning.2010. [30] Mahmoudi,Babak.IntegratingRoboticActionwithBiologicPerception:ABrainMachineSymbiosisTheory.Ph.D.thesis,UniversityofFlorida,2010. [31] Mahmoudi,Babak,DiGiovanna,Jack,Principe,JoseC.,andSanchez,JustinC.Co-AdaptiveLearninginBrain-MachineInterfaces.BrainInspiredCognitiveSystems(BICS).2008. [32] Melo,FranciscoS.,Meyn,SeanP.,andRibeiro,M.Isabel.AnAnalysisofReinforcementLearningwithFunctionAproximation.InProceedingsofthe25thInternationalConferenceonMachineLearning.2008,664. [33] Mercer,John.FunctionsofPositiveandNegativeType,andTheirConnectionwiththeTheoryofIntegralEquations.PhilosophicalTransactionsoftheRoyalSocietyofLondon209(1909):415. [34] Moore,AndrewW.VariableResolutionDynamicProgramming:EfcientlyLearningActionMapsinMultivariateReal-valuedState-spaces.InProceedingsofthe8thInternationalConferenceonMachineLearning.1991. [35] Mulliken,GrantH.,Musallam,Sam,andAndersen,RichardA.DecodingTrajectoriesfromPosteriorParietalCortexEnsembles.TheJournalofNeuro-science28(2008).48:12913. [36] Park,IlandPrincipe,JoseC.CorrentropyBasedGrangerCausality.IEEEInternationalConferenceonAcoustics,Speech,andSignalProcessing(ICASSP).2008,3605. [37] Pohlmeyer,EricA.,Mahmoudi,Babak,Geng,Shijia,Prins,Noe,andSanchez,JustinC.Brain-machineinterfacecontrolofarobotarmusingactor-criticrainforcementlearning.AnnualInternationalConferenceoftheIEEEonEngi-neeringinMedicineandBiologySociety(EMBC).2012,4108. [38] Principe,JoseC.InformationTheoreticLearning.Springer,2010. [39] Rasmussen,CarlEdwardandKuss,Malte.GaussianProcessesinReinforcementLearning.AdvancesinNeuralInformationProcessingSystems.MITPress,2004,751. [40] Rasmussen,CarlEdwardandWilliams,ChristopherK.I.GaussianProcessesforMachineLearning.MITPress,2006. 127

PAGE 128

[41] Sanchez,JustinC.,Tarigoppula,Aditya,Choi,JohnS.,Marsh,BrandiT.,Chhatbar,PratikY.,Mahmoudi,Babak,andFrancis,JosephT.Controlofacenter-outreachingtaskusingareinforcementlearningBrain-MachineInterface.The5thInternationalIEEE/EMBSConferenceonNeuralEngineering(NER).2011,525. [42] Santamaria,Ignacio,Pokharel,PuskalP.,andPrincipe,JoseC.GeneralizedCorrelationFunction:Denition,Properties,andApplicationtoBlindEqualization.IEEETransactionsonSignalProcessing54(2006).6. [43] Saunders,Craig,Gammerman,Alexander,andVovk,Volodya.RidgeRegressionLearningAlgorithminDualVariables.InProceedingsofthe15thInternationalConferenceonMachineLearning.1998,515. [44] Scholkopf,BernhardandSmola,AlexanderJ.LearningwithKernels.MITPress,2002. [45] Singh,AbhishekandPrincipe,JoseC.UsingCorrentropyasacostfunctioninlinearadaptivelters.The2009InternationalJointConferenceonNeuralNetworks(IJCNN).2009,2950. [46] .AClosedFormRecursiveSolutionforMaximumCorrentropyTraining.2010IEEEInternationalConferenceonAcousticsSpeechandSignalProcessing(ICASSP).2010,20702073. [47] .Alossfunctionforclassicationbasedonarobustsimilaritymetric.The2010InternationalJointConferenceonNeuralNetworks(IJCNN).2010,1. [48] Singh,SatinderP.andSutton,RichardS.ReinforcementLearningwithReplacingEligibilityTraces.MachineLearning22(1996):123. [49] Sussillo,David,Nuyujukian,Paul,Fan,JolineM.,Kao,JonathanC.,Stavisky,SergeyD.,Ryu,Stephen,andShenoy,Krishna.Arecurrentneuralnetworkforclosed-loopintracorticalbrain-machineinterfacedecoders.JournalofNeuralEngineering9(2012).2. [50] Sutton,RichardS.LearningtoPredictbytheMethodsofTemporalDifferences.MachineLearning3(1988):9. [51] .OpenTheoreticalQuestionsinReinforcementLearning.Tech.rep.,AT&TLabs,1999. [52] Sutton,RichardS.andBarto,AndrewG.ReinforcementLearning:AnIntroduction.MITPress,1998. [53] Szepesvari,Csaba.AlgorithmsforReinforcementLearning.Morgan&Slaypool,2010. 128

PAGE 129

[54] Tsitsiklis,JohnN.andRoy,BenjaminVan.AnAnalysisofTemporal-DifferenceLearningwithFunctionApproximation.Tech.Rep.5,IEEETransactionsonAutomaticControl,1997. [55] Watkins,ChristopherJ.C.H.LearningfromDelayedRewards.Ph.D.thesis,King'sCollege,1989. [56] Watkins,ChristopherJ.C.H.andDayan,Peter.TechnicalNote:Q-Learning.MachineLearning8(1992).3-4:279. [57] Xu,Xin,Hu,Dewen,andLu,Xicheng.Kernel-BasedLeastSquaresPolicyIterationforReinforcementLearning.IEEETransactionsonNeuralNetworks18(2007).4. [58] Xu,Xin,Xie,Tao,Hu,Dewen,andLu,Xicheng.KernelLeast-SquaresTemporalDifferenceLearning.InternationalJournalofInformationTechnology.vol.11.2005,54. [59] Zhao,Songlin,Chen,Badong,andPrincipe,JoseC.KernelAdaptiveFilteringwithMaximumCorrentropyCriterion.The2011InternationalJointConferenceonNeuralNetworks(IJCNN).2011,2012. 129

PAGE 130

BIOGRAPHICALSKETCH JihyeBaereceivedaBachelorofEngineeringintheSchoolofElectricalEngineeringandComputerScienceatKyungpookNationalUniversity,Daegu,SouthKoreain2007,andtheMasterofScienceandDoctorofPhilosophy(Ph.D.)intheDepartmentofElectricalandComputerEngineeringatUniversityofFlorida,Gainesville,Florida,theUnitedStatesofAmericain2009and2013,respectively.ShejoinedtheComputationalNeuro-EngineeringLaboratory(CNEL)atUniversityofFloridain2010duringherPh.D.studiesandworkedasaresearchassistantunderthesupervisionofProf.JoseC.PrincipeatCNEL.Herresearchinterestsencompassadaptivesignalprocessing,machinelearning,andtheirapplicationsinbrainmachineinterfacesincludingneuraldecodingandcontrolproblems.Hercurrentresearchmainlyfocusesonkernelmethodsandinformationtheoreticlearning,andhowbothareascanbeappliedinreinforcementlearning. 130