Reinforcement Learning and Optimal Control Methods for Uncertain Nonlinear Systems

MISSING IMAGE

Material Information

Title:
Reinforcement Learning and Optimal Control Methods for Uncertain Nonlinear Systems
Physical Description:
1 online resource (125 p.)
Language:
english
Creator:
Bhasin,Shubhendu
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Mechanical Engineering, Mechanical and Aerospace Engineering
Committee Chair:
Dixon, Warren E
Committee Members:
Barooah, Prabir
Kumar, Mrinal
Khargonekar, Pramod
Frank, Lewis

Subjects

Subjects / Keywords:
adaptive -- approximate -- control -- dynamic -- learning -- neural -- nonlinear -- optimal -- reinforcement
Mechanical and Aerospace Engineering -- Dissertations, Academic -- UF
Genre:
Mechanical Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
Notions of optimal behavior expressed in natural systems led researchers to develop reinforcement learning (RL) as a computational tool in machine learning to learn actions by trial and error interactions yielding either a reward or punishment. RL provides a way for learning agents to optimally interact with uncertain complex environments, and hence, can address problems from a variety of domains, including artificial intelligence, controls, economics, operations research, etc. The focus of this work is to investigate the use of RL methods in feedback control to improve the closed-loop performance of nonlinear systems. Most RL-based controllers are limited to discrete-time systems, are offline methods, require knowledge of system dynamics and/or lack a rigorous stability analysis. This research investigates new control methods as an approach to address some of the limitations associated with traditional RL-based controllers. A robust adaptive controller with an adaptive critic or actor-critic (AC) architecture is developed for a class of uncertain nonlinear systems with disturbances. The AC structure is inspired from RL and uses a two pronged neural network (NN) architecture -- an action NN, also called the actor, which approximates the plant dynamics and generates appropriate control actions; and a critic NN, which evaluates the performance of the actor, based on some performance index. In the context of current literature on RL-based control, the contribution of this work is the development of controllers which learn the optimal policy (approximately) for uncertain nonlinear systems. In contrast to model learning strategies for RL-based control of uncertain systems, the requirement of model knowledge is obviated in this work by the development of a robust identification-based state derivative estimator. The robust identifier is designed to yield asymptotically convergent state derivative estimates which are leveraged for model-free formulation of the Bellman error. The identifier is combined with the traditional actor-critic resulting in a novel actor-critic-identifier architecture, which is used to approximate the infinite-horizon optimal control for continuous-time uncertain nonlinear systems. The method is online, partially model-free, and is the first ever indirect adaptive control approach to continuous-time RL.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Shubhendu Bhasin.
Thesis:
Thesis (Ph.D.)--University of Florida, 2011.
Local:
Adviser: Dixon, Warren E.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2011
System ID:
UFE0042825:00001


This item is only available as the following downloads:


Full Text

PAGE 1

REINFORCEMENTLEARNINGANDOPTIMALCONTROLMETHODSFOR UNCERTAINNONLINEARSYSTEMS By SHUBHENDUBHASIN ADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOL OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF DOCTOROFPHILOSOPHY UNIVERSITYOFFLORIDA 2011 1

PAGE 2

c r 2011ShubhenduBhasin 2

PAGE 3

Dedicatedwithlovetomyparentsandmybrother;andwithreveren cetomyGuru. 3

PAGE 4

ACKNOWLEDGMENTS IthankmyadvisorDr.WarrenE.Dixonforhisguidanceandmotivatio nduringmy doctoralresearch.HegroomedmeduringtheinitialyearsofmyPh Dprogramandmade meunderstandthevirtuesofrigorinresearch.Inthelatterpart ofmyPhD,hegaveme enoughfreedomtodevelopmyownideasandgrowasanindependent researcher.His excellentworkethichasbeenaconstantsourceofinspiration. Iamalsothankfultomycommitteemembers,Dr.PramodKhargone kar,Dr.Prabir Barooah,Dr.MrinalKumarandDr.FrankLewis,forprovidinginsigh tfulsuggestionsto improvethequalityofmyresearch.Ispeciallythankmycollaborator s,Dr.FrankLewis andhisstudent,VamvoudakisKyriakos,forgivingmeanewperspec tiveabouttheeld. IwouldalsoliketoacknowledgemycoworkersattheNonlinearContro lsand Robotics(NCR)Labwholledmydayswithlivelytechnicaldiscussionsa ndfriendly banter.IwilldenitelymissthetimeswhenthewholelabwouldgoforTij uanaFlats lunchexcursionsandNCRHappyHours.Ithanktheinnumerablefrie ndsImadein GainesvilleformakingmystintattheUniversityofFlorida,amemorable experience. Fromthebottomofmyheart,Ithankmyparentsfortheirlove,su pportandsacrice. Mymotheralwaystookakeeninterestinmyeducation.Iwasfortun atetohaveherwith meduringthelastyearofmyPhD.Herunconditionalloveandencour agementsawme throughtotheend.Icannotthankherenough.Lastbutnotthe least,Iamgratefulto GodandGurusforguidingmeandhelpingmetodrawstrengthandinsp irationfrom within. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 8 LISTOFFIGURES .................................... 9 LISTOFABBREVIATIONS ............................... 11 ABSTRACT ........................................ 12 CHAPTER 1INTRODUCTION .................................. 14 1.1BackgroundandMotivation .......................... 14 1.2ProblemStatement ............................... 15 1.3LiteratureSurvey ................................ 15 1.4DissertationOutline .............................. 17 1.5Contributions .................................. 19 2REINFORCEMENTLEARNINGANDOPTIMALCONTROL ......... 21 2.1ReinforcementLearningMethods ....................... 21 2.1.1PolicyIteration ............................. 23 2.1.2ValueIteration .............................. 24 2.1.3Q-Learning ................................ 25 2.2AspectsofReinforcementLearningMethods ................. 25 2.2.1CurseofDimensionalityandFunctionApproximation ........ 25 2.2.2Actor-CriticArchitecture ........................ 26 2.2.3ExploitationVsExploration ...................... 26 2.3InniteHorizonOptimalControlProblem .................. 27 2.4OptimalControlMethods ........................... 29 2.5AdaptiveOptimalControlandReinforcementLearning ........... 31 3ASYMPTOTICTRACKINGBYAREINFORCEMENTLEARNING-BASED ADAPTIVECRITICCONTROLLER ....................... 33 3.1DynamicModelandProperties ........................ 33 3.2ControlObjective ................................ 34 3.3ActionNN-BasedControl ........................... 35 3.4CriticNNArchitecture ............................. 40 3.5StabilityAnalysis ................................ 43 3.6ExperimentalResults ............................. 47 3.7ComparisonwithRelatedWork ........................ 50 3.8Summary .................................... 53 5

PAGE 6

4ROBUSTIDENTIFICATION-BASEDSTATEDERIVATIVEESTIMATIO N FORNONLINEARSYSTEMS ........................... 54 4.1RobustIdentication-BasedStateDerivativeEstimation .......... 54 4.2ComparisonwithRelatedWork ........................ 63 4.3ExperimentandSimulationResults ...................... 64 4.4Summary .................................... 70 5ANACTOR-CRITIC-IDENTIFIERARCHITECTUREFORAPPROXIMAT E OPTIMALCONTROLOFUNCERTAINNONLINEARSYSTEMS ....... 71 5.1Actor-Critic-IdentierArchitectureforHJBApproximation ......... 71 5.2Actor-CriticDesign ............................... 74 5.2.1LeastSquaresUpdatefortheCritic .................. 76 5.2.2GradientUpdatefortheActor ..................... 77 5.3IdentierDesign ................................. 77 5.4ConvergenceandStabilityAnalysis ...................... 81 5.5ComparisonwithRelatedWork ........................ 86 5.6Simulation .................................... 89 5.6.1NonlinearSystemExample ....................... 89 5.6.2LQRExample .............................. 94 5.7Summary .................................... 98 6CONCLUSIONANDFUTUREWORK ...................... 99 6.1DissertationSummary ............................. 99 6.2FutureWork ................................... 101 6.2.1Model-FreeRL ............................. 101 6.2.2RelaxingthePersistenceofExcitationCondition ........... 102 6.2.3AsymptoticRL-BasedOptimalControl ................ 102 6.2.4BetterFunctionApproximationMethods ............... 102 6.2.5RobustnesstoDisturbances ...................... 103 6.2.6OutputFeedbackRLControl ..................... 103 6.2.7ExtendingRLbeyondtheInnite-HorizonRegulator ........ 104 APPENDIX AASYMPTOTICTRACKINGBYAREINFORCEMENTLEARNING-BASED ADAPTIVECRITICCONTROLLER ....................... 105 A.1DerivationofSucientConditionsinEq.3{42 ................ 105 A.2DierentialInclusionsandGeneralizedSolutions ............... 106 BROBUSTIDENTIFICATION-BASEDSTATEDERIVATIVEESTIMATIO N FORNONLINEARSYSTEMS ........................... 108 B.1ProofofInequalitiesinEqs.4{12-4{14 .................... 108 B.1.1ProofofInequalityinEq.4{12 .................... 109 6

PAGE 7

B.1.2ProofofInequalitiesinEq.4{13 .................... 110 B.1.3ProofofInequalityinEq.4{14 .................... 114 B.2DerivationofSucientConditionsinEq.4{18 ................ 115 REFERENCES ....................................... 116 BIOGRAPHICALSKETCH ................................ 125 7

PAGE 8

LISTOFTABLES Table page 3-1SummarizedexperimentalresultsandPvaluesofonetailedunpa iredt-testfor Link1. ......................................... 50 3-2SummarizedexperimentalresultsandPvaluesofonetailedunpa iredt-testfor Link2. ......................................... 50 4-1Comparisonoftransient( t =0 5sec.)andsteady-state( t =5 10sec.)state derivativeestimationerrors ~ x ( t ). .......................... 67 8

PAGE 9

LISTOFFIGURES Figure page 2-1ReinforcementLearningforMDP. .......................... 22 2-2ReinforcementLearningcontrolsystem. ...................... 22 2-3Actor-criticarchitectureforonlinepolicyiteration. ................ 27 3-1ArchitectureoftheRISE-basedACcontroller. ................... 41 3-2Two-linkexperimenttestbed. ............................ 48 3-3ComparisonoftrackingerrorsandtorquesbetweenNN+RISE andAC+RISE forlink1. ....................................... 51 3-4ComparisonoftrackingerrorsandtorquesbetweenNN+RISE andAC+RISE forlink2. ....................................... 52 4-1Comparisonofthestatederivativeestimate ^ x ( t ). ................. 66 4-2Comparisonofthestateestimationerrors~ x ( t ). .................. 67 4-3Comparisonofthestatederivativeestimationerrors ~ x ( t ). ............. 68 4-4Comparisonofthestatederivativeestimationerrors ~ x ( t )atsteadystate. .... 68 4-5Statederivativeestimationerrors ~ x ( t )fornumericaldierentiationmethods. .. 69 5-1Actor-critic-identierarchitecturetoapproximatetheHJB. ............ 73 5-2Systemstates x ( t )withpersistentlyexcitedinputfortherst3seconds. .... 90 5-3Errorinestimatingthestatederivative ~ x ( t )bytheidentier. ........... 91 5-4Convergenceofcriticweights ^ W c ( t ). ......................... 91 5-5Convergenceofactorweights ^ W a ( t ). ........................ 92 5-6Errorinapproximatingtheoptimalvaluefunctionbythecriticat steadystate. 92 5-7Errorinapproximatingtheoptimalcontrolbytheactoratste adystate. .... 93 5-8Errorsinapproximatingthe(a)optimalvaluefunction,and(b) optimalcontrol, asafunctionoftime. ................................. 93 5-9Systemstates x ( t )withpersistentlyexcitedinputfortherst25seconds. .... 95 5-10Convergenceofcriticweights ^ W c ( t ). ......................... 96 5-11Convergenceofactorweights ^ W a ( t ). ........................ 96 9

PAGE 10

5-12Errorsinapproximatingthe(a)optimalvaluefunction,and(b )optimalcontrol, asafunctionoftime. ................................. 97 10

PAGE 11

LISTOFABBREVIATIONS ACAdaptiveCritic(orActor-Critic)ACIActor-Critic-IdentierADPApproximateDynamicProgrammingDNNDynamicNeuralNetworkDHPDualHeuristicProgrammingDPDynamicProgrammingGHJBGeneralizedHamilton-Jacobi-BellmanHDPHeuristicDynamicProgrammingHJBHamilton-Jacobi-BellmanMDPMarkovDecisionProcessNNNeuralNetworkPDEPartialDierentialEquationPEPersistenceofExcitationPIPolicyIterationRISERobustIntegraloftheSignoftheErrorRLReinforcementLearningTDTemporalDierenceUUBUniformlyUltimatelyBounded 11

PAGE 12

AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorofPhilosophy REINFORCEMENTLEARNINGANDOPTIMALCONTROLMETHODSFOR UNCERTAINNONLINEARSYSTEMS By ShubhenduBhasin August2011 Chair:WarrenE.DixonMajor:MechanicalEngineering Notionsofoptimalbehaviorexpressedinnaturalsystemsledrese archerstodevelop reinforcementlearning(RL)asacomputationaltoolinmachinelear ningtolearnactions bytrialanderrorinteractionsyieldingeitherarewardorpunishmen t.RLprovidesaway forlearningagentstooptimallyinteractwithuncertaincomplexenvir onments,andhence, canaddressproblemsfromavarietyofdomains,includingarticialin telligence,controls, economics,operationsresearch,etc. ThefocusofthisworkistoinvestigatetheuseofRLmethodsinfeed backcontrol toimprovetheclosed-loopperformanceofnonlinearsystems.Mos tRL-basedcontrollers arelimitedtodiscrete-timesystems,areoinemethods,requirekn owledgeofsystem dynamicsand/orlackarigorousstabilityanalysis.Thisresearchinve stigatesnewcontrol methodsasanapproachtoaddresssomeofthelimitationsassociat edwithtraditional RL-basedcontrollers. Arobustadaptivecontrollerwithanadaptivecriticoractor-critic( AC)architecture isdevelopedforaclassofuncertainnonlinearsystemswithdisturba nces.TheAC structureisinspiredfromRLandusesatwoprongedneuralnetwo rk(NN)architecture{ anactionNN,alsocalledtheactor,whichapproximatestheplantdyn amicsandgenerates appropriatecontrolactions;andacriticNN,whichevaluatesthep erformanceoftheactor, basedonsomeperformanceindex. 12

PAGE 13

InthecontextofcurrentliteratureonRL-basedcontrol,theco ntributionofthis workisthedevelopmentofcontrollerswhichlearntheoptimalpolicy( approximately) foruncertainnonlinearsystems.Incontrasttomodellearningst rategiesforRL-based controlofuncertainsystems,therequirementofmodelknowled geisobviatedinthiswork bythedevelopmentofarobustidentication-basedstatederivat iveestimator.Therobust identierisdesignedtoyieldasymptoticallyconvergentstatederiva tiveestimateswhich areleveragedformodel-freeformulationoftheBellmanerror.The identieriscombined withthetraditionalactor-criticresultinginanovelactor-critic-id entierarchitecture, whichisusedtoapproximatetheinnite-horizonoptimalcontrolfo rcontinuous-time uncertainnonlinearsystems.Themethodisonline,partiallymodel-fr ee,andistherst everindirectadaptivecontrolapproachtocontinuous-timeRL. 13

PAGE 14

CHAPTER1 INTRODUCTION 1.1BackgroundandMotivation RLreferstoanagentwhichinteractswithitsenvironmentandmodi esitsactions basedonstimulireceivedinresponsetoitsactions.Learninghappe nsthroughtrial anderrorandisbasedonacauseandeectrelationshipbetweenth eactionsandthe rewards/punishment.Decisions/actionswhichleadtoasatisfacto ryoutcomearereinforced andaremorelikelytobetakenwhenthesamesituationarisesagain.A lthoughRL originatedinpsychologytoexplainhumanbehavior,ithasbecomeaus efulcomputational toolforlearningbyexperienceinmanyengineeringapplications,suc hascomputergame playing,industrialmanufacturing,tracmanagement,roboticsa ndcontrol,etc.From acomputationalintelligenceperspective,anRLagentchoosesact ionswhichminimize thecostofitslong-terminteractionswiththeenvironment[ 1 2 ].Acostfunction, whichcapturestheperformancecriteria,isusedtocritiquetheac tionsoftheagentas anumericalreward,calledthereinforcementsignal.Unlikesupervis edlearningwhere learningisinstructionalandbasedonasetofexamplesofcorrectin put/outputbehavior, RLismoreevaluativeandindicatesonlythemeasureofgoodnessofa particularaction. Sinceinteractionisdonewithoutateacher,RLisparticularlyeectiv einsituations whereexamplesofdesiredbehaviorarenotavailablebutitispossiblet oevaluatethe performanceofactionsbasedonsomeperformancecriterion.Im provingtheclosed-loop performanceofnonlinearsystemshasbeenanactiveresearchar eainthecontrols community.StrongconnectionsbetweenRLandfeedbackcontro l[ 3 ]haveprompted amajoreorttowardsconvergenceofthetwoelds{computatio nalintelligenceand controls.SeveralissuesstillexistthathinderRLmethodsforcon trolofnonlinearsystems, suchasstability,convergence,choiceoffunctionapproximator, etc.Thisworkattempts tohighlightandaddresssomeoftheseissuesandprovideascaoldin gforconstructive RL-basedmethodsforoptimalcontrolofuncertainnonlinearsys tems. 14

PAGE 15

1.2ProblemStatement TheanalogybetweenacontinuouslylearningRL-agentinanunknown environment andacontinuouslyadaptingandimprovingcontrollerforanuncerta insystemdene theproblemstatementofthiswork.Specically,theproblemaddre ssedinthisworkis developingRL-basedcontrollersforcontinuous-timeuncertainno nlinearsystems.These controllersareinspiredbyRLandinheritmanyimportantfeatures, likelearningby interactingwithanuncertainenvironment,reward-basedlearning ,onlineimplementation, andoptimality. 1.3LiteratureSurvey ACarchitectureshavebeenproposedasmodelsofRL[ 2 4 ].SinceACmethodsare amenabletoonlineimplementation,theyhavebecomeanimportantsu bjectofresearch, particularlyinthecontrolscommunity[ 5 { 14 ].InAC-basedRL,anactornetworklearns toselectactionsbasedonevaluativefeedbackfromthecritictoma ximizefuturerewards. DuetothesuccessofNNsasuniversalapproximators[ 15 16 ],theyhavebecomeanatural choiceinACarchitecturesforapproximatingunknownplantdynamic sandcostfunctions [ 17 18 ].Typically,theACarchitectureconsistsoftwoNNs{anactionorac torNNand acriticNN.ThecriticNNapproximatestheevaluationfunction,mapp ingstatestoan estimatedmeasureofthevaluefunction,whiletheactionNNapprox imatesanoptimal controllawandgeneratesactionsorcontrolsignals.Followingthe worksofSutton[ 1 ], Barto[ 19 ],Watkins[ 20 ],andWerbos[ 21 ],currentresearchfocusesontherelationship betweenRLanddynamicprogramming(DP)[ 22 ]methodsforsolvingoptimalcontrol problems.Duetothe curseofdimensionality associatedwithusingDP[ 22 ],Werbos[ 5 ] introducedanalternativeApproximateDynamicProgramming(ADP) approachwhich givesanapproximatesolutiontotheDPproblem,orthe Hamiltonian-Jacobi-Bellman (HJB)equationforoptimalcontrol.AdetailedreviewofADPdesigns canbefoundin[ 6 ]. VariousmodicationstoADPalgorithmshavesincebeenproposed[ 7 23 24 ]. 15

PAGE 16

TheperformanceofADP-basedcontrollershavebeensuccessfu llydemonstrated onvariousnonlinearplantswithunknowndynamics.Venayagamoort hyetal.used ADPforcontrolofturbogenerators,synchronousgenerator s,andpowersystems[ 25 26 ].FerrariandStengel[ 27 ]usedaDualHeuristicProgramming(DHP)basedADP approachtocontrolanonlinearsimulationofajetaircraftinthepr esenceofparameter variationsandcontrolfailures.Jagannathanetal.[ 28 ]usedACsforgraspingcontrolof athree-nger-gripper.Someotherinterestingapplicationsarem issilecontrol[ 29 ],HVAC control[ 30 ],andcontrolofdistributedparametersystems[ 11 ]. ConvergenceofADPalgorithmsforRL-basedcontrolisstudiedin[ 7 { 10 31 32 ].A policyiteration(PI)algorithmisproposedin[ 33 ]usingQ-functionsforthediscrete-time LQRproblemandconvergencetothestatefeedbackoptimalsolut ionisproven.In [ 34 ],model-freeQ-learningisproposedforlineardiscrete-timesystem swithguaranteed convergencetothe H 2 and H 1 statefeedbackcontrolsolution.Mostofthepreviouswork onADPhasfocusedoneithernitestateMarkoviansystemsordisc rete-timesystems[ 35 36 ].TheinherentlyiterativenatureoftheADPalgorithmhasimpededth edevelopment ofclosed-loopcontrollersforcontinuous-timeuncertainnonlinear systems.Extensionsof ADP-basedcontrollerstocontinuous-timesystemsentailschalleng esinprovingstability, convergence,andensuringthealgorithmisonlineandmodel-free.E arlysolutionsto theproblemconsistedofusingadiscrete-timeformulationoftimean dstate,andthen applyinganRLalgorithmonthediscretizedsystem.Discretizingthes tatespaceforhigh dimensionalsystemsrequiresalargememoryspaceandacomputat ionallyprohibitive learningprocess.ConvergenceofPIforcontinuous-timeLQRwas rstprovedin[ 37 ]. Baird[ 38 ]proposed AdvantageUpdating ,anextensionoftheQ-learningalgorithmwhich couldbeimplementedincontinuous-timeandprovidedfasterconver gence.Doya[ 39 ] usedanHJBframeworktoderivealgorithmsforvaluefunctionappr oximationandpolicy improvement,basedonacontinuous-timeversionofthetemporal dierence(TD)error. Murrayetal.[ 8 ]alsousedtheHJBframeworktodevelopa stepwisestable iterativeADP 16

PAGE 17

algorithmforcontinuous-timeinput-anesystemswithaninputqua draticperformance measure.InBeardetal.[ 40 ],Galerkin'sspectralmethodisusedtoapproximatethe solutiontothegeneralizedHJB(GHJB),usingwhichastabilizingfeedb ackcontroller wascomputedoine.Similarto[ 40 ],Abu-KhalafandLewis[ 41 ]proposedaleast-squares successiveapproximationsolutiontotheGHJB,whereanNNistraine doinetolearn theGHJBsolution.Anothercontinuous-timeformulationofadaptiv ecriticisproposedin Hanselman[ 12 ]. Alloftheaforementionedapproachesforcontinuous-timenonline arsystemsrequire completeknowledgeofsystemdynamics.Thefactthatcontinuous -timeADPrequires knowledgeofthesystemdynamicshashamperedthedevelopmento fcontinuous-time extensionstoADP-basedcontrollersfornonlinearsystems.Rece ntresultsby[ 13 42 ]have madenewinroadsbyaddressingtheproblemforpartiallyunknownno nlinearsystems. API-basedhybridcontinuous-time/discrete-timesampleddataco ntrollerisdesigned in[ 13 42 ],wherethefeedbackcontroloperationoftheactoroccursatf astertimescale thanthelearningprocessofthecritic.VamvoudakisandLewis[ 14 ]extendedtheidea bydesigningamodel-basedonlinealgorithmcalled synchronousPI whichinvolved synchronouscontinuous-timeadaptationofbothactorandcritic NNs. 1.4DissertationOutline Chapter 1 servesasanintroduction.Themotivation,problemstatement,lite rature surveyandthecontributionsoftheworkareprovidedinthischapt er. Chapter 2 discussesthekeyelementsintheeldofRLfromacomputational intelligencepointofviewanddiscusseshowthesetechniquescanbea ppliedtosolve controlproblems.Further,theoptimalcontrolproblem,optima lcontrolmethods, andtheirlimitationsarediscussed.ConnectionsbetweenRLandopt imalcontrolare establishedandimplementationissuesarehighlighted. Chapter 3 developsacontinuous-timeadaptivecriticcontrollertoyieldasympt otic trackingofaclassofuncertainnonlinearsystemswithboundeddist urbances.The 17

PAGE 18

proposedAC-basedcontrollerconsistsoftwoNNs-anactionNN,a lsocalledtheactor, whichapproximatestheplantdynamicsandgeneratesappropriate controlactions;anda criticNN,whichevaluatestheperformanceoftheactorbasedons omeperformanceindex. Thereinforcementsignalfromthecriticisusedtodevelopacompos iteweighttuning lawfortheactionNNbasedonLyapunovstabilityanalysis.Arecently developedrobust feedbacktechnique,RISE(RobustIntegraloftheSignoftheEr ror),isusedinconjunction withthefeedforwardactionNNtoyieldasemi-globalasymptoticres ult. Chapter 4 developsarobustidentication-basedstatederivativeestimation methodforuncertainnonlinearsystems.Theidentierarchitectu reconsistsofa recurrentmulti-layerdynamicNNwhichapproximatesthesystemdy namicsonline, andacontinuousrobustfeedbackRISEtermwhichaccountsform odelingerrorsand exogenousdisturbances.Thedevelopedmethodndsapplications inRL-basedcontrol methodsforuncertainnonlinearsystems. Chapter 5 developsanonlineadaptiveRL-basedsolutionfortheinnite-horizo n optimalcontrolproblemforcontinuous-timeuncertainnonlinears ystems.Anovel actor-critic-identier(ACI)isdevelopedtoapproximatetheHJBe quationusingthree NNstructures-actorandcriticNNsapproximatetheoptimalcont rolandtheoptimal valuefunction,respectively,andarobustdynamicNN(DNN)identi erasymptotically approximatestheuncertainsystemdynamics.Anadvantageofth eusingtheACI architectureisthatlearningbytheactor,critic,andidentierisco ntinuousand simultaneous,withoutrequiringknowledgeofsystemdriftdynamics .Convergenceof thealgorithmisanalyzedusingLyapunov-basedadaptivecontrolm ethods. Chapter 6 concludesthedissertationwithadiscussionofthekeyideas,contr ibutions andlimitationsofthiswork.Italsopointstofutureresearchdirect ionsandpavesapath forwardforfurtherdevelopmentsintheeld. 18

PAGE 19

1.5Contributions ThisworkfocusesondevelopingRL-basedcontrollersforcontinuo us-timenonlinear systems.ThecontributionsofChapters 3 5 areasfollows. AsymptotictrackingbyaRL-basedadaptivecriticcontroll er :AC-based controllersaretypicallydiscreteand/oryieldauniformlyultimatelybo undedstability resultduetothepresenceofdisturbancesanduncertainapprox imationerrors.A continuousasymptoticAC-basedtrackingcontrollerisdevelopedf oraclassofnonlinear systemswithboundeddisturbances.Theapproachisdierentfro mtheoptimal control-basedADPapproachesproposedinliterature[ 8 { 10 13 14 32 42 ],wherethe criticusuallyapproximatesalong-termcostfunctionandtheactor approximatesthe optimalcontrol.However,thesimilaritywiththeADP-basedmethod sisintheuseof theACarchitecture,borrowedfromRL,wherethecritic,throug hareinforcementsignal aectsthebehavioroftheactorleadingtoanimprovedperforman ce.Theproposedrobust adaptivecontrollerconsistsofaNNfeedforwardterm(actorNN) andarobustfeedback term,wheretheweightupdatelawsoftheactorNNaredesignedas acompositeofa trackingerrortermandaRLterm(fromthecritic),withtheobjec tiveofminimizingthe trackingerror[ 43 { 45 ].Therobusttermisdesignedtowithstandtheexternaldisturban ces andmodelingerrorsintheplant.Typically,thepresenceofbounded disturbancesand NNapproximationerrorsleadtoauniformlyultimatelybounded(UUB) result.Themain contributionofthisworkistheuseofarecentlydevelopedcontinuo usfeedbacktechnique, RISE[ 46 47 ],inconjunctionwiththeACarchitecturetoyieldasymptotictrackin gof anunknownnonlinearsystemsubjectedtoboundedexternaldist urbances.Theuseof RISEinconjunctionwiththeactionNNmakesthedesignofthecriticN Narchitecture challengingfromastabilitystandpoint.Tothisend,thecriticNNiscom binedwithan additionalRISE-liketermtoyieldareinforcementsignal,whichisused toupdatethe weightsoftheactionNN.ALyapunovstabilityanalysisguaranteesc losed-loopstability 19

PAGE 20

ofthesystem.Experimentsareperformedtodemonstratetheim provedperformancewith theproposedRL-basedACmethod. Robustidentication-basedstatederivativeestimationf ornonlinear systems :Astatederivativeestimationmethodisdevelopedwhichcanbeused todesign completeorpartialmodel-freeRL-methodsforcontrolofuncer tainnonlinearsystems.The developedrobustidentierprovidesonlineestimatesofthestated erivativeofuncertain nonlinearsystemsinthepresenceofexogenousdisturbances.Th eresultdiersfrom existingpurerobustmethodsinthattheproposedmethodcombine sanadaptiveDNN systemidentierwitharobustRISEfeedbacktoensureasymptot icconvergencetothe statederivative,whichisprovenusingaLyapunov-basedstabilitya nalysis.Simulation resultsinthepresenceofnoiseshowanimprovedtransientandste adystateperformance ofthedevelopedidentierincomparisontoseveralotherderivativ eestimationmethods including:ahighgainobserver,a2-slidingmoderobustexactdieren tiator,andnumerical dierentiationmethods,suchasbackwarddierenceandcentral dierence. Anovelactor-critic-identierarchitectureforapproxim ateoptimalcontrol ofuncertainnonlinearsystems :Anovelactor-critic-identierarchitectureisdeveloped tolearntheapproximatesolutiontotheHJBequationforinnite-ho rizonoptimalcontrol ofuncertainnonlinearsystems.Theonlinemethodisthersteverin directadaptive controlapproachtocontinuous-timeRL.Anothercontributiono fthedevelopedmethod isthatunlikepreviousresultsinliterature,thelearningbytheactor ,criticandidentier iscontinuousandsimultaneous,andthenoveladditionoftheidenti ertothetraditional actor-criticarchitectureeliminatestheneedtoknowthesystemd riftdynamics.The stabilityandconvergenceofthealgorithmisrigorouslyanalyzed.AP Econditionis requiredtoensureexponentialconvergencetoaboundedregion intheneighborhoodofthe optimalcontrolandUUBstabilityoftheclosed-loopsystem. 20

PAGE 21

CHAPTER2 REINFORCEMENTLEARNINGANDOPTIMALCONTROL RLreferstotheproblemofagoal-directedagentinteractingwitha nuncertain environment.ThegoalofanRLagentistomaximizealong-termscala rrewardbysensing thestateoftheenvironmentandtakingactionswhichaectthest ate.Ateachstep,an RLsystemgetsevaluativefeedbackabouttheperformanceofits action,allowingitto improvetheperformanceofsubsequentactions.SeveralRLmet hodshavebeendeveloped andsuccessfullyappliedinmachinelearningtolearnoptimalpoliciesfor nite-state nite-actiondiscrete-timeMarkovDecisionProcesses(MDPs),sh owninFig. 2-1 .An analogousRLcontrolsystemisshowninFig. 2-2 ,wherethecontroller,basedonstate feedbackandreinforcementfeedbackaboutitspreviousaction, calculatesthenextcontrol whichshouldleadtoanimprovedperformance.Thereinforcements ignalistheoutput ofaperformanceevaluatorfunction,whichistypicallyafunctionof thestateandthe control.AnRLsystemhasasimilarobjectivetoanoptimalcontroller whichaimsto optimizealong-termperformancecriterionwhilemaintainingstability. Thischapter discussesthekeyelementsintheeldofRLandhowtheycanbeapplie dtosolvecontrol problems.Further,theoptimalcontrolproblem,optimalcontro lmethods,andtheir limitationsarediscussed.ConnectionsbetweenRLandoptimalcont rolareestablished andimplementationissuesarehighlighted,whichmotivatethemethod sdevelopedinthis dissertation. 2.1ReinforcementLearningMethods RLmethodstypicallyestimatethevaluefunction,whichisameasureo fgoodness ofagivenactionforagivenstate.Thevaluefunctionrepresentst hereward/penalty accumulatedbytheagentinthelongrun,andforadeterministicMDP ,maybedenedas aninnite-horizondiscountedreturnas[ 2 ] V u ( x 0 )= 1 X k =0 r k r k +1 ; 21

PAGE 22

Reward function Agent Environment state action reward Figure2-1.ReinforcementLearningforMDP. Performance function Controller Plant state control reinforcement Figure2-2.ReinforcementLearningcontrolsystem. where x k and u k arethestateandaction,respectively,forthediscrete-timesys tem x k +1 = f ( x k ;u k ), r k +1 r ( x k ;u k )isthereward/penaltyatthe k th step,and r 2 [0 ; 1) isthediscountfactorusedtodiscountfuturerewards.Theobje ctiveofanRLmethod istodetermineapolicywhichmaximizesthevaluefunction.Sincetheva luefunctionis unknown,typicallytherststepistoestimatethevaluefunction,w hichcanbeexpressed usingBellman'sequationas[ 2 ] V u ( x )= r ( x;u )+ rV u ( f ( x;u )) ; 22

PAGE 23

wheretheindex k issuppressed.Theoptimalvaluefunctionisdenedas V ( x )=min u V u ( x ) ; whichcanalsobeexpressedusingtheBellmanoptimalityconditionas V ( x )=min u [ r ( x;u )+ rV ( f ( x;u ))] u ( x )=argmin u [ r ( x;u )+ rV ( f ( x;u ))] : (2{1) TheaboveBellmanrelationsformthebasisofallRLmethods{policyite ration, valueiteration,andQ-learning[ 2 20 35 ].RLmethodscanbecategorizedasmodel-based andmodel-free.Model-basedorDP-basedRLalgorithmsutilizethee xpressioninEq. 2{1 butareoineandrequireperfectknowledgeoftheenvironment,a sseenfromEq. 2{1 .Ontheotherhand,model-freeRLalgorithmsarebasedontempor aldierence (TD),whichreferstothedierencebetweentemporallysuccessiv eestimatesofthesame quantity.IncontrasttoDP-basedRLmethods,TD-basedRLmet hodsareonlineanddo notuseanexplicitmodelofthesystem,rathertheyusedata(set ofsamples,trajectories etc.)obtainedfromtheprocess,i.e.,theylearnbyinteractingwitht heenvironment.Some ofthepopularRLmethodsaresubsequentlydiscussed.2.1.1PolicyIteration PolicyIteration(PI)algorithms[ 22 48 ]successivelyalternatebetweenpolicy evaluationandpolicyimprovement.Thealgorithmstartswithaninitial admissible policy,estimatesthevaluefunction(policyevaluation),andthenimp rovesthepolicy usingagreedysearchontheestimatedvaluefunction(policyimprov ement).Thepolicy evaluationstepinDP-basedPIisperformedusingthefollowingrecur rencerelationsuntil convergencetothevaluefunction V u ( x ) r ( x;u )+ rV u ( f ( x;u )) ; (2{2) 23

PAGE 24

wherethesymbol` 'denotesthevalueontherightbeingassignedtothequantityonth e left.Aftertheconvergenceofpolicyevaluation,policyimprovemen tisperformedusing u ( x )=argmin a [ r ( x;a )+ rV u ( f ( x;a ))] (2{3) ItcanbeseenfromEqs. 2{2 and 2{3 thattheDP-basedPIalgorithmrequiresknowledge ofthesystemmodel f ( x;u ).Usingthemodel-free TD (0)algorithm[ 1 ],whichlearnsfrom interactingwiththeenvironment,thislimitationisovercome.Usingth e TD (0)algorithm, thevaluefunctionisestimatedusingthefollowingupdate V u ( x ) V u ( x )+ [ r ( x;u )+ rV u ( x ) V u ( x )] ; (2{4) where 2 (0 ; 1]isthelearningrate,and x denotesthenextstateobservedafter performingaction u at x .IncontrasttoDP-basedpolicyevaluation,thevaluefunction estimationinEq. 2{4 doesnotrequireanexplicitmodelofthesystem.ThePIalgorithm convergestotheoptimalpolicy[ 48 ].OnlinePIalgorithmsdonotwaitfortheconvergence ofthepolicyevaluationsteptoimplementpolicyimprovement;howeve r,theirconvergence canonlybeguaranteedonlyunderveryrestrictiveconditions,suc hasgenerationof innitelylongtrajectoriesforeachiteration[ 49 ]. 2.1.2ValueIteration ValueIteration(VI)algorithmsdirectlyestimatetheoptimalvaluef unction,whichis thenusedtocomputetheoptimalpolicy.Itcombinesthetruncate dpolicyevaluationand policyimprovementstepsinonestepusingthefollowingrecurrencer elationsfromDP[ 2 ] V ( x ) min a [ r ( x;a )+ rV ( f ( x;a ))] VIconvergestotheoptimal V ( x ) ; andissaidtobelesscomputationallyintensivethan PI,althoughPItypicallyconvergesinfeweriterations[ 35 ]. 24

PAGE 25

2.1.3Q-Learning Q-LearningalgorithmsuseQ-factors Q ( x;u ),whicharestate-actionpairsinsteadof thestatevaluefunction V ( x ).TheQ-iterationalgorithmusesTDlearningtondthe optimalQ-factor Q ( x;u )as Q ( x;u ) Q ( x;u )+ h r ( x;u )+ r min a Q ( x;a ) Q ( x;u ) i : TheQ-learningalgorithm[ 20 ]isoneofthemajorbreakthroughsinreinforcementlearning, sinceitinvolveslearningtheoptimalaction-valuefunctionindepende ntofthepolicybeing followed(alsocalledo-policy) 1 ,whichgreatlysimpliestheconvergenceanalysisof thealgorithm.Adequateexplorationis,however,neededforthec onvergenceto Q : The optimalpolicycanbedirectlyfoundfromperformingagreedysearc hon Q as u ( x )=argmin a Q ( x;a ) : 2.2AspectsofReinforcementLearningMethods ThissectiondiscussesaspectsandissuesinimplementationoftheRL methodson highdimensionalandlarge-scalepracticalsystems.2.2.1CurseofDimensionalityandFunctionApproximation RLmethodswherevaluefunctionestimatesarerepresentedasat ablerequire,at everyiteration,storageandupdatingofallthetableentriescorr espondingtotheentire statespace.Infact,thecomputationandstoragerequirement sincreaseexponentially withthesizeofthestatespace,alsocalledthe curseofdimensionality .Theproblem iscompoundedwhenconsideringcontinuousspaceswhichcontainin nitelymany statesandactions.Onesolutionapproachistorepresentvaluefu nctionsusingfunction approximators,whicharebasedonsupervisedlearning,andgener alizebasedonlimited 1 Anon-policyvariantofQ-learning,SARSA[ 50 ],isbasedonpolicyiteration. 25

PAGE 26

informationaboutthestatespace[ 2 ].Aconvenientwaytorepresentvaluefunctionsisby usinglinearlyparameterizedapproximatorsoftheform T ( x ),where istheunknown parametervector,and isauser-denedbasisfunction.Selectingtherightbasisfunction whichrepresentsalltheindependentfeaturesofthevaluefunct ioniscrucialinsolvingthe RLproblem.Somepriorknowledgeregardingtheprocessistypicallyin cludedinthebasis function.Theparametervectorisestimatedusingoptimizationalgo rithms,e.g.,gradient descent,leastsquaresetc.Multi-layerneuralnetworksmayalso beusedasnonlinearly parameterizedapproximators;however,weightconvergenceish ardertoproveascompared tolinearlyparameterizednetworkstructures.2.2.2Actor-CriticArchitecture Actor-criticmethods,introducedbyBarto[ 19 ],implementthepolicyiteration algorithmonline,wherethecriticistypicallyaneuralnetworkwhichimp lementspolicy evaluationandapproximatesthevaluefunction,whereastheacto risanotherneural networkwhichapproximatesthecontrol.Thecriticevaluatesthep erformanceoftheactor usingascalarrewardfromtheenvironmentandgeneratesaTDerr or.Theactor-critic neuralnetworks,showninFig. 2-3 areupdatedusinggradientupdatelawsbasedonthe TDerror.2.2.3ExploitationVsExploration Thetrade-obetweenexploitationandexplorationhasbeenatopic ofmuchresearch intheRLcommunity[ 51 ].Foranagentinanunknownenvironment,explorationis requiredtotryoutdierentactionsandlearnbasedontrialander ror,whereaspast experiencemayalsobeexploitedtoselectthebestactionsandminimiz ethecost oflearning.ForsampleortrajectorybasedRLmethods(e.g.,Mont eCarlo)inlarge dimensionalspaces,selectingbestactions(e.g.,greedypolicy)bas edoncurrentestimates isnotsucientbecausebetteralternativeactionsmaypotentiallyn everbeexplored. Sucientexplorationisessentialtolearntheglobaloptimalsolution .However,toomuch explorationcanalsobecostlyintermsofperformanceandstabilityw henthemethodis 26

PAGE 27

Actor Plant Critic Reinforcement signal state action reward Figure2-3.Actor-criticarchitectureforonlinepolicyiteration. implementedonline.Oneapproachistousea -greedypolicy,wheretheexplorationisthe highestwhentheagentstartslearning,butgraduallydecaysasex perienceisgainedand exploitationispreferredtoreachtheoptimalsolution. 2.3InniteHorizonOptimalControlProblem RLhascloseconnectionswithoptimalcontrol.Inthissection,theu ndiscounted innitehorizonoptimalcontrolproblemisformulatedforcontinuou s-timenonlinear systems.Consideracontinuous-timenonlinearsystem x = F ( x;u ) ; (2{5) where x ( t ) 2X R n u ( t ) 2U R m isthecontrolinput, F : XU! R n isLipschitz continuouson XU containingtheorigin,suchthatthesolution x ( t )ofthesystemin Eq. 2{5 isuniqueforanyniteinitialcondition x 0 andcontrol u 2U .Itisalsoassumed that F (0 ; 0)=0 : Further,thesystemisstabilizable,i.e.thereexistsacontinuousfe edback controllaw u ( x ( t ))suchthattheclosed-loopsystemisasymptoticallystable. 27

PAGE 28

Theinnite-horizonscalarcostfunctionforthesystemEq. 2{5 canbedenedas J ( x ( t ) ;u ( ) t < 1 )= Z 1 t r ( x ( s ) ;u ( s )) ds; (2{6) where t istheinitialtime, r ( x;u ) 2 R istheimmediateorlocalcostforthestateand control,denedas r ( x;u )= Q ( x )+ u T Ru; (2{7) where Q ( x ) 2 R iscontinuouslydierentiableandpositivedenite,and R 2 R m m isa positive-denitesymmetricmatrix.Theoptimalcontrolproblemist ondanadmissible control u 2 ( X ) ; suchthatthecostinEq. 2{6 associatedwiththesystemEq. 2{5 is minimized[ 52 ].Anadmissiblecontrolinput u ( t )canbedenedasacontinuousfeedback controllaw u ( x ( t )) 2 ( X ) ; where( )denotesthesetofadmissiblecontrols,which asymptoticallystabilizesthesystemEq. 2{5 on X u (0)=0 ; and J ( )inEq. 2{6 isnite. Theoptimalvaluefunctioncanbedenedas V ( x ( t ))=min u ( ) 2 ( X ) t < 1 Z 1 t r ( x ( s ) ;u ( x ( s ))) ds: (2{8) Assumingthevaluefunctioniscontinuouslydierentiable,Bellman'spr incipleof optimalitycanbeusedtoderivethefollowingoptimalitycondition[ 52 ] 0=min u ( t ) 2 ( X ) r ( x;u )+ @V ( x ) @x F ( x;u ) ; (2{9) whichisanonlinearpartialdierentialequation(PDE),alsocalledthe HJBequation. Basedontheassumptionthat V ( x )iscontinuouslydierentiable,theHJBinEq. 2{9 providesameanstoobtaintheoptimalcontrol u ( x )infeedbackform.Usingtheconvex localcostinEqs. 2{7 and 2{9 ,aclosed-formexpressionfortheoptimalcontrolcanbe derivedas u ( x )= 1 2 R 1 @F ( x;u ) @u T @V ( x ) @x T : (2{10) 28

PAGE 29

Forthecontrol-anedynamicsoftheform x = f ( x )+ g ( x ) u; (2{11) where f ( x ) 2 R n and g ( x ) 2 R n m ,theexpressioninEq. 2{10 canbewrittenintermsof thesystemstateas u ( x )= 1 2 R 1 g T ( x ) @V ( x ) @x T : (2{12) Ingeneral,thesolutionstotheoptimalcontrolproblemmaynotbe smooth[ 53 ]. Existenceofauniquenon-smoothsolution(calledtheviscositysolut ion)isstudiedin [ 53 ],[ 54 ]. TheHJBinEq. 2{9 canberewrittenintermsoftheoptimalvaluefunctionby substitutingforthelocalcostinEq. 2{7 ,thesysteminEq. 2{11 andtheoptimalcontrol inEq. 2{12 ,as 0= Q ( x )+ @V ( x ) @x f ( x ) 1 4 @V ( x ) @x g ( x ) R 1 g T ( x ) @V ( x ) @x T ; (2{13) 0= V (0) : Althoughinclosed-form,theoptimalpolicyinEq. 2{12 requiresknowledgeoftheoptimal valuefunction V ( x ) ; thesolutionoftheHJBequationinEq. 2{13 .TheHJBequationis problematictosolveingeneralandmaynothaveananalyticalsolutio n. 2.4OptimalControlMethods SincethesolutionoftheHJBisprohibitivelydicultandsometimeseven impossible, severalalternativemethodsareinvestigatedinliterature.The calculusofvariations approachgeneratesasetofarst-ordernecessaryoptimalityc onditions,calledthe Euler-Lagrangeequations,resultinginatwo-point(ormulti-point) boundaryvalue problem,whichistypicallysolvednumericallyusingindirectmethods,su chasshooting, multipleshootingetc[ 52 ].Anothernumericalapproachistousedirectmethodswherethe stateand/orcontrolareapproximatedusingfunctionapproxima torsordiscretizedusing 29

PAGE 30

collocationandtheoptimalcontrolproblemistranscribedtoanonlin earprogramming problem,whichcansolvedusingmethodssuchasdirectshooting,dir ectcollocation, pseudo-spectralmethodsetc.[ 55 56 ].Althoughthesenumericalapproachesare eectiveandpractical,theyareopen-loop,oine,requireexactm odelknowledgeand aredependentoninitialconditions.Anotherapproachbasedonfe edbacklinearization involvesrobustlycancelingthesystemnonlinearities,therebyredu cingthesystemtoa linearsystem,andsolvingtheassociatedAlgebraicRiccatiEquation (ARE)/Dierential RiccatiEquation(DRE)foroptimalcontrol[ 57 58 ].Adrawbackoffeedbacklinearization isthatitsolvesatransformedoptimalcontrolproblemwithrespec ttoapartofthe controlwhiletheotherpartisusedtocancelthenonlinearterms. Moreover,linearization cancelsallnonlinearities,someofwhichmaybeusefulforthesyste m.Inverseoptimal controllerscircumventthetaskofsolvingtheHJBbyprovingoptima lityofacontrollaw forameaningfulcostfunction[ 59 { 61 ].Thefactthatthecostfunctioncannotbechosena prioribytheuserlimitstheapplicabilityofthemethod. Giventhelimitationsofmethodsthatseekanexactoptimalsolution, thefocus ofsomeliteraturehasshiftedtowardsdevelopingmethodswhichyie ldasub-optimal oranapproximatelyoptimalsolution.Model-predictivecontrol(MP C)orreceding horizoncontrol(RHC)[ 62 63 ]isanexampleofanonlinemodel-basedapproximate optimalcontrolmethodwhichsolvetheoptimalcontrolproblemov eranitetimehorizon ateverystatetransitionleadingtoastatefeedbackoptimalcont rolsolution.These methodshavebeensuccessfullyappliedinprocesscontrolwheret hemodelisexactly knownandthedynamicsareslowlyvarying[ 64 65 ].Anoinesuccessiveapproximation method,proposedin[ 66 ],improvestheperformanceofaninitialstabilizingcontrolby approximatingthesolutiontothegeneralizedHJB(GHJB)equationa ndthenusing theBellman'soptimalityprincipletocomputeanimprovedcontrollaw.T hisprocess isrepeatedandproventoconvergetotheoptimalpolicy.TheGHJB ,unliketheHJB, isalinearPDEwhichismoretractabletosolve,e.g.,usingmethodsliketh eGalerkin 30

PAGE 31

projection[ 40 ].Thesuccessiveapproximationmethodissimilartothepolicyiteration algorithminRL;however,themethodisoineandrequirescompletem odelknowledge. Toalleviatethecurseofdimensionalityassociatedwithdynamicprogr amming,afamily ofmethods,calledACdesigns(alsocalledADP),weredevelopedin[ 6 17 35 36 ]tosolve theoptimalcontrolproblemusingRLandneuralnetworkbackpro pagationalgorithms. Themethodsare,however,applicableonlyfordiscrete-timesyste msandlackarigorous Lyapunovstabilityanalysis. 2.5AdaptiveOptimalControlandReinforcementLearning MostoptimalcontrolapproachesdiscussedinSection 2.4 areoineandrequire completemodelknowledge.Evenforlinearsystems,wheretheLQR givestheclosed-form analyticalsolutiontotheoptimalcontrolproblem,theAREissolved oineandrequires exactknowledgeofthesystemdynamics.Adaptivecontrolprovid esaninroadtodesign controllerswhichcanadaptonlinetotheuncertaintiesinsystemdyn amics,basedon minimizationoftheoutputerror(e.g.,usinggradientorleastsquare smethods).However, classicaladaptivecontrolmethodsdonotmaximizealong-termper formancefunction, andhencearenotoptimal. Adaptiveoptimalcontrol referstomethodswhichlearnthe optimalsolutiononlineforuncertainsystems.RLmethodsdescribe dinSection 2.1 have beensuccessfullyusedinMDPstolearnoptimalpolicesinuncertainen vironments,e.g., TD-basedQ-learningisanonlinemodel-freeRLmethodforlearningop timalpolicies.In [ 3 ],Suttonetal.arguethatRLisadirectadaptiveoptimalcontrolte chnique.Owing tothediscretenatureofRLalgorithms,manymethodshavebeenp roposedforadaptive optimalcontrolofdiscrete-timesystems[ 6 7 10 33 67 { 70 ].Unfortunately,anRL formulationforcontinuous-timesystemsisnotasstraightforwar dasinthediscrete-time case,becausewhiletheTDerrorinthelatterismodel-free,itisnott hecasewiththe former,wheretheTDerrorformulationinherentlyrequirescomple teknowledgeofthe systemdynamics(seeEq. 2{9 ).RLmethodsbasedonthemodel-basedTDerrorfor continuous-timesystemsareproposedin[ 8 14 39 41 ].Apartialmodel-freesolution 31

PAGE 32

isproposedin[ 13 ]usinganactor-criticarchitecture,however,theresultingcont rolleris hybridwithacontinuous-timeactorandadiscrete-timecritic.Othe rissuesconcerning RL-basedcontrollersare:closed-loopstability,convergencetot heoptimalcontrol, functionapproximation,andtradeobetweenexploitationandexp loration.Fewresults haverigorouslyaddressedtheseissueswhicharecriticalforsucc essfulimplementation ofRLmethodsforfeedbackcontrol.Theworkinthisdissertationis motivatedbythe needtoprovideatheoreticalfoundationforRL-basedcontrolm ethodsandexploretheir potentialasadaptiveoptimalcontrolmethods. 32

PAGE 33

CHAPTER3 ASYMPTOTICTRACKINGBYAREINFORCEMENTLEARNING-BASED ADAPTIVECRITICCONTROLLER ACbasedcontrollersaretypicallydiscreteand/oryieldauniformlyult imately boundedstabilityresultduetothepresenceofdisturbancesandu nknownapproximation errors.Theobjectiveinthischapteristodesignacontinuous-time ACcontrollerwhich yieldsasymptotictrackingofaclassofuncertainnonlinearsystems withbounded disturbances.TheproposedAC-basedcontrollerarchitecturec onsistsoftwoNNs| anactionNN,alsocalledtheactor,whichapproximatestheplantdyn amicsandgenerates appropriatecontrolactions;andacriticNN,whichevaluatesthep erformanceoftheactor basedonsomeperformanceindex.Thereinforcementsignalfrom thecriticisusedto developacompositeweighttuninglawfortheactionNNbasedonLyap unovstability analysis.Arecentlydevelopedrobustfeedbacktechnique,RISE, isusedinconjunction withthefeedforwardactionneuralnetworktoyieldasemi-globala symptoticresult. 3.1DynamicModelandProperties Themn-thorderMIMOBrunovskyform 1 canbewrittenas[ 43 ] x 1 = x 2 ... (3{1) x n 1 = x n x n = g ( x )+ u + d y = x 1 ; 1 TheBrunovskyformcanbeusedtomodelmanyphysicalsystems, e.g., Euler-Lagrangesystems. 33

PAGE 34

where x ( t ) [ x T1 x T2 :::x Tn ] T 2 R mn arethemeasurablesystemstates, u ( t ) 2 R m ;y 2 R m arethecontrolinputandsystemoutput,respectively; g ( x ) 2 R m isanunknownsmooth function,locallyLipschitzin x ;and d ( t ) 2 R m isanexternalboundeddisturbance. Assumption3.1. Thefunction g ( x ) issecondorderdierentiable,i.e., g ( ) ; g ( ) ; g ( ) 2 L 1 if x ( i ) ( t ) 2L 1 ;i =0 ; 1 ; 2 where ( ) ( i ) ( t ) denotesthe i th derivativewithrespecttotime : Assumption3.2. Thedesiredtrajectory y d ( t ) 2 R m isdesignedsuchthat y ( i ) d ( t ) 2 L 1 ;i =0 ; 1 ;:::;n +1 : Assumption3.3. Thedisturbancetermanditsrstandsecondtimederivative sare boundedi.e. d ( t ) ; d ( t ) ; d ( t ) 2L 1 : 3.2ControlObjective ThecontrolobjectiveistodesignacontinuousRL-basedNNcontr ollersuchthat theoutput y ( t )tracksadesiredtrajectory y d ( t ) : Toquantifythecontrolobjective,the trackingerror e 1 ( t ) 2 R m isdenedas e 1 y y d : (3{2) Thefollowinglteredtrackingerrorsaredenedtofacilitatethesu bsequentstability analysis e 2 e 1 + 1 e 1 e i e i 1 + i 1 e i 1 + e i 2 ;i =3 ;:::;n (3{3) r e n + n e n ; (3{4) where 1 ;:::; n 2 R arepositiveconstantcontrolgains.Notethatthesignals e 1 ( t ) ;:::; e n ( t ) 2 R m aremeasurablewhereasthelteredtrackingerror r ( t ) 2 R m inEq. 3{4 is notmeasurablesinceitdependson_ x n ( t ).ThelteredtrackingerrorsinEq. 3{3 canbe expressedintermsofthetrackingerror e 1 ( t )as e i = i 1 X j =0 a ij e ( j ) 1 ;i =2 ;:::;n (3{5) 34

PAGE 35

where a ij 2 R arepositiveconstantsobtainedfromsubstitutingEq. 3{5 inEq. 3{3 and comparingcoecients[ 47 ].Itcanbeeasilyshownthat a ij =1 ;j = i 1 : (3{6) 3.3ActionNN-BasedControl UsingEqs. 3{2 3{6 ,theopenlooperrorsystemcanbewrittenas r = y ( n ) y ( n ) d + f; (3{7) where f ( e 1 ; e 1 ;:::;e ( n 1) 1 ) 2 R m isafunctionofknownandmeasurableterms,denedas f = n 2 X j =0 a nj ( e ( j +1) 1 + n e ( j ) 1 )+ n e ( n 1) 1 : SubstitutingthedynamicsfromEq. 3{1 intoEq. 3{7 yields r = g ( x )+ d y ( n ) d + f + u: (3{8) Addingandsubtracting g ( x d ): R mn R m ; where g ( x d )isasmoothunknownfunctionof thedesiredtrajectory x d ( t ) [ y T d y T d ::: ( y ( n 1) d ) T ] T 2 R mn ; theexpressioninEq. 3{8 can bewrittenas r = g ( x d )+ S + d + Y + u; (3{9) where Y ( e 1 ; e 1 ;:::;e ( n 1) 1 ;y ( n ) d ) 2 R m containsknownandmeasurabletermsandisdened as Y y ( n ) d + f; (3{10) andtheauxiliaryfunction S ( x;x d ) 2 R m isdenedas S g ( x ) g ( x d ) : Theunknownnonlinearterm g ( x d )canberepresentedbyamulti-layerNNas g ( x d )= W T a ( V T a x a )+ ( x a ) ; (3{11) 35

PAGE 36

where x a ( t ) 2 R mn +1 [1 x Td ] T istheinputtotheNN, W a 2 R ( N a +1) m and V a 2 R ( mn +1) N a aretheconstantboundedidealweightsfortheoutputandhidden layers respectivelywith N a beingthenumberofneuronsinthehiddenlayer, ( ) 2 R N a +1 isthe boundedactivationfunction,and ( x a ) 2 R m isthefunctionreconstructionerror. Remark3.1. TheNNusedinEq. 3{11 isreferredtoastheactionNNortheassociative searchelement(ASE)[ 19 ],anditisusedtoapproximatethesystemdynamicsand generateappropriatecontrolsignals. Basedontheassumptionthatthedesiredtrajectoryisbounded, thefollowing inequalitieshold k a ( x a ) k a 1 ; k a ( x a ; x a ) k a 2 ; k a ( x a ; x a ; x a ) k a 3 ; (3{12) where a 1 ;" a 2 ;" a 3 2 R areknownpositiveconstants.Also,theidealweightsareassumed to existandbeboundedbyknownpositiveconstants[ 18 ],suchthat k V a k V a ; k W a k W a : (3{13) SubstitutingEq. 3{11 inEq. 3{9 ,theopenlooperrorsystemcannowbewrittenas r = W T a ( V T a x a )+ ( x a )+ S + d + Y + u: (3{14) TheNNapproximationfor g ( x d )canberepresentedas ^ g ( x d )= ^ W T a ( ^ V T a x a ) ; where ^ W a ( t ) 2 R ( N a +1) m and ^ V a ( t ) 2 R ( mn +1) N a arethesubsequentlydesignedestimates oftheidealweights.Thecontrolinput u ( t )inEq. 3{14 cannowbedesignedas u Y ^ g ( x d ) a ; (3{15) where a ( t ) 2 R m denotestheRISEfeedbacktermdenedas[ 46 47 ] a ( k a +1) e n ( t ) ( k a +1) e n (0)+ v; (3{16) 36

PAGE 37

where v ( t ) 2 R m isthegeneralizedsolutionto v =( k a +1) n e n + 1 sgn ( e n ) ;v (0)=0(3{17) where k a ; 1 2 R areconstantpositivecontrolgains,and sgn ( )denotesavectorsignum function. Remark3.2. Typically,thepresenceofthefunctionreconstructionerr oranddisturbance termsinEq. 3{14 wouldleadtoaUUBstabilityresult.TheRISEtermusedinEq. 3{15 robustlyaccountsforthesetermsguaranteeingasymptotic trackingwithacontinuous controller[ 71 ](i.e.,comparedwithsimilarresultsthatcanbeobtainedb ydiscontinuous slidingmodecontrol).ThederivativeoftheRISEstructure includesa sgn ( ) terminEq. 3{17 whichallowsittoimplicitlylearnandcanceltermsinthest abilityanalysisthatare C 2 withboundedtimederivatives. SubstitutingthecontrolinputfromEq. 3{15 inEq. 3{14 yields r = W T a ( V T a x a ) ^ W T a ( ^ V T a x a )+ S + d + a a : (3{18) Tofacilitatethesubsequentstabilityanalysis,thetimederivativeof Eq. 3{18 isexpressed as r = ^ W T a 0 ( ^ V T a x a ) ~ V T a x a + ~ W T a 0 ( ^ V T a x a ) ^ V T a x a + W T a 0 ( V T a x a ) V T a x a W T a 0 ( ^ V T a x a ) ^ V T a x a ^ W T a 0 ( ^ V T a x a ) ~ V T a x a ^ W T a ( ^ V T a x a ) ^ W T a 0 ( ^ V T a x a ) ^ V T a x a + S + d +_ a a ; (3{19) where 0 ( ^ V T a x a ) d V T a x a =d V T a x a j V T a x a = ^ V T a x a ; and ~ W a ( t ) 2 R ( N a +1) m and ~ V a ( t ) 2 R ( mn +1) N a arethemismatchbetweentheidealandtheestimatedweights,and are denedas ~ V a V a ^ V a ; ~ W a W a ^ W a : 37

PAGE 38

TheweightupdatelawsfortheactionNNaredesignedbasedonthes ubsequentstability analysisas ^ W a proj ( aw n 0 ( ^ V T a x a ) ^ V T a x a e Tn + aw ( ^ V T a x a ) R ^ W T c 0 ( ^ V T c e n ) ^ V T c ) ^ V a = proj ( av n x a e Tn ^ W T a 0 ( ^ V T a x a )+ av x a R ^ W T c 0 ( ^ V T c e n ) ^ V T c ^ W T a 0 ( ^ V T a x a )) ; (3{20) where aw 2 R ( N a +1) ( N a +1) ; av 2 R ( mn +1) ( mn +1) areconstant,positivedenite, symmetricgainmatrices, R ( t ) 2 R isthesubsequentlydesignedreinforcementsignal, proj ( )isasmoothprojectionoperatorutilizedtoguaranteethatthewe ightestimates ^ W a ( t )and ^ V a ( t )remainbounded[ 72 ],[ 73 ],and ^ V c ( t ) 2 R m N c and ^ W c ( t ) 2 R ( N c +1) 1 are thesubsequentlyintroducedweightestimatesforthecriticNN.Th eNNweightupdate lawinEq. 3{20 iscompositeinthesensethatitconsistsoftwoterms,oneofwhichis aneinthetrackingerror e n ( t )andtheotherinthereinforcementsignal R ( t ). TheupdatelawinEq. 3{20 canbedecomposedintotwoterms ^ W T a = We n + WR ^ V T a = Ve n + VR : (3{21) UsingAssumption 3.2 ,Eq. 3{13 andtheuseofprojectionalgorithminEq. 3{20 ,the followingboundscanbeestablished rr We n rr r 1 k e n k rr WR rr r 2 j R j ; rr Ve n rr r 3 k e n k rr VR rr r 4 j R j ; (3{22) where r 1 ;r 2 ;r 3 ;r 4 2 R areknownpositiveconstants.SubstitutingEqs. 3{16 3{20 ,and 3{21 inEq. 3{19 ,andgroupingterms,thefollowingexpressionisobtained r = ~ N + N R + N e n ( k a +1) r 1 sgn ( e n ) ; (3{23) wheretheunknownauxiliaryterms ~ N ( t ) 2 R m and N R ( t ) 2 R m aredenedas ~ N S + e n We n ( ^ V T a x a ) ^ W T a 0 ( ^ V T a x a ) Ve n x a ; (3{24) 38

PAGE 39

N R WR ( ^ V T a x a ) ^ W T a 0 ( ^ V T a x a ) VR x a : (3{25) Theauxiliaryterm N ( t ) 2 R m issegregatedintotwotermsas N = N d + N B ; (3{26) where N d ( t ) 2 R m isdenedas N d W T a 0 ( V T a x a ) V T a x a + d +_ a ; (3{27) and N B ( t ) 2 R m isfurthersegregatedintotwotermsas N B = N B 1 + N B 2 (3{28) where N B 1 ( t ) ;N B 2 ( t ) 2 R m aredenedas N B 1 W T a 0 ( ^ V T a x a ) ^ V T a x a ^ W T a 0 ( ^ V T a x a ) ~ V T a x a ; N B 2 ~ W T a 0 ( ^ V T a x a ) ^ V T a x a + ^ W T a 0 ( ^ V T a x a ) ~ V T a x a : (3{29) UsingtheMeanValueTheorem,thefollowingupperboundcanbedeve loped[ 47 ],[ 71 ] rrr ~ N ( t ) rrr 1 ( k z k ) k z k ; (3{30) where z ( t ) 2 R ( n +1) m isdenedas z [ e T1 e T2 :::e Tn r T ] T ; (3{31) andtheboundingfunction 1 ( ) 2 R isapositive,globallyinvertible,non-decreasing function.UsingAssumptions 3.2 and 3.3 ,Eqs. 3{12 3{13 ,and 3{20 ,thefollowingbounds canbedevelopedforEqs. 3{25 3{29 k N d k 1 k N B 1 k 2 k N B 2 k 3 k N k 1 + 2 + 3 ; k N R k 4 j R j : (3{32) 39

PAGE 40

TheboundsforthetimederivativeofEqs. 3{27 and 3{28 canbedevelopedusing Assumptions 3.2 and 3.3 ,Eqs. 3{12 and 3{20 rrr N d rrr 5 ; rrr N B rrr 6 + 7 k e n k + 8 j R j ; (3{33) where i 2 R ,( i =1 ; 2 ;:::; 8)arecomputablepositiveconstants. Remark3.3. ThesegregationoftheauxiliarytermsinEqs. 3{21 and 3{23 followsa typicalRISEstrategy[ 71 ]whichismotivatedbythedesiretoseparatetermsthatcanb e upperboundedbystate-dependenttermsandtermsthatcanbe upperboundedbyconstants. Specically, ~ N ( t ) containstermsupperboundedbytrackingerrorstate-depen dentterms, N ( t ) hastermsboundedbyaconstant,andisfurthersegregatedin to N d ( t ) and N B ( t ) whosederivativesareboundedbyaconstantandlinearcombi nationoftrackingerror states,respectively.Similarly, N R ( t ) containsreinforcementsignaldependentterms.The termsinEq. 3{28 arefurthersegregatedbecause N B 1 ( t ) willberejectedbytheRISE feedback,whereas N B 2 ( t ) willbepartiallyrejectedbytheRISEfeedbackandpartiall y canceledbytheNNweightupdatelaw. 3.4CriticNNArchitecture InRLliterature[ 2 ],thecriticgeneratesascalarevaluationsignalwhichisthenused totunetheactionNN.ThecriticitselfconsistsofaNNwhichapproxim atesanevaluation functionbasedonsomeperformancemeasure.TheproposedACa rchitectureisshownin Fig. 3-1 .Thelteredtrackingerror e n ( t )canbeconsideredasaninstantaneousutility functionoftheplantperformance[ 43 44 ]. Thereinforcementsignal R ( t ) 2 R isdenedas[ 43 ] R ^ W T c ( ^ V T c e n )+ ; (3{34) where ^ V c 2 R m N c ^ W c 2 R ( N c +1) 1 ( ) 2 R N c +1 isthenonlinearactivationfunction, N c arethenumberofhiddenlayerneuronsofthecriticNN,andtheper formancemeasure e n ( t )denedinEq. 3{3 istheinputtothecriticNN ; and 2 R isanauxilliaryterm 40

PAGE 41

Plant Performance Evaluator RISE Feedback Action NN Critic NN x(t) en(t) Instantaneous Utility R(t) Reinforcement signal u(t) (t) xd(t) xd(t) Figure3-1.ArchitectureoftheRISE-basedACcontroller. generatedas = ^ W T c 0 ( ^ V T c e n ) ^ V T c ( a + n e n ) k c R 2 sgn ( R ) ; (3{35) where k c ; 2 2 R areconstantpositivecontrolgains.Theweightupdatelawforthe critic NNisgeneratedbasedonthesubsequentstabilityanalysisas ^ W c = proj ( cw ( ^ V T c e n ) R cw ^ W c ) (3{36) ^ V c = proj ( cv e n ^ W T c 0 ( ^ V T c e n ) R cv ^ V c ) ; where cw ; cv 2 R areconstantpositivecontrolgains. Remark3.4. Thestructureofthereinforcementsignal R ( t ) inEq. 3{34 ismotivatedby literaturesuchas[ 43 { 45 ],wherethereinforcementsignalistypicallytheoutputof acritic NNwhichtunestheactorbasedonaperformancemeasure.Thep erformancemeasure consideredinthisworkisthetrackingerror e n ( t ) ,andthecriticweightupdatelawsare 41

PAGE 42

designedusingagradientalgorithmtominimizethetrackin gerror,asseenfromthe subsequentstabilityanalysis.Theauxiliaryterm ( t ) in 3{34 isaRISE-likerobustifying termwhichisaddedtoaccountforcertaindisturbanceterms whichappearintheerror systemofthereinforcementlearningsignal.Specically, theinclusionof ( t ) isusedto implicitlylearnandcompensatefordisturbancesandfunct ionreconstructionerrorsinthe reinforcementsignaldynamics,yieldinganasymptotictra ckingresult. Toaidthesubsequentstabilityanalysis,thetimederivativeofthere inforcement signalinEq. 3{34 isobtainedas R = ^ W T c ( ^ V T c e n )+ ^ W T c 0 ( ^ V T c e n ) ^ V T c e n + ^ W T c 0 ( ^ V T c e n ) ^ V T c e n + : (3{37) UsingEqs. 3{18 3{35 3{36 ,andtheTaylorseriesexpansion[ 18 ] ( V T a x a )= ^ V T a x a + 0 ( ^ V T a x a ) ~ V T a x a + O ~ V T a x a 2 ; where O ( ) 2 representshigherorderterms,theexpressioninEq. 3{37 canbewrittenas R = ^ W T c ( ^ V T c e n )+ ^ W T c 0 ( ^ V T c e n ) ^ V T c e n + N dc + N s + ^ W T c 0 ( ^ V T c e n ) ^ V T c ~ W T a ^ V T a x a + ^ W T c 0 ( ^ V T c e n ) ^ V T c ^ W T a 0 ( ^ V T a x a ) ~ V T a x a k c R 2 sgn ( R ) ; (3{38) wheretheauxiliaryterms N dc ( t ) 2 R and N s ( t ) 2 R areunknownfunctionsdenedas N dc ^ W T c 0 ( ^ V T c e n ) ^ V T c ~ W T a 0 ( ^ V T a x a ) ~ V T a x a + W T a O ~ V T a x a 2 + d + a N s ^ W T c 0 ( ^ V T c e n ) ^ V T c S: (3{39) UsingAssumptions 3.2 and 3.3 ,Eqs. 3{12 3{36 ,andtheMeanValueTheorem,the followingboundscanbedevelopedforEq. 3{39 k N dc k 9 ; k N s k 2 ( k z k ) k z k ; (3{40) where 9 2 R isacomputablepositiveconstant,and 2 ( ) 2 R isapositive,globally invertible,non-decreasingfunction. 42

PAGE 43

3.5StabilityAnalysis Theorem3.1. TheRISE-basedACcontrollergiveninEqs. 3{15 and 3{34 alongwiththe weightupdatelawsfortheactionandcriticNNgiveninEqs. 3{20 and 3{36 ,respectively, ensurethatallsystemsignalsareboundedunderclosed-loo poperationandthatthe trackingerrorisregulatedinthesensethat k e 1 ( t ) k! 0 as t !1 providedthecontrolgains k a and k c areselectedsucientlylargebasedontheinitial conditionsofthestates, n 1 ; n ; 2 ; and k c ; arechosenaccordingtothefollowing sucientconditions n 1 > 1 2 ; n > 3 + 1 2 ; 2 > 9 ;k c > 4 ; (3{41) and 1 ; 3 ; 4 2 R ,introducedinEq. 3{46 ,arechosentosatisfythefollowingsucient conditions 2 1 > max 1 + 2 + 3 ; 1 + 2 + 5 n + 6 n ; 3 > 7 + 8 2 ; 4 > 8 2 : (3{42) Proof. Let D R ( n +1) m +3 beadomaincontaining y ( t )=0,where y ( t ) 2 R ( n +1) m +3 is denedas y [ z T R p P p Q ] T ; (3{43) wheretheauxiliaryfunction Q ( t ) 2 R isdenedas Q 1 2 tr ( ~ W T a 1 aw ~ W a )+ 1 2 tr ( ~ V T a 1 av ~ V a )+ 1 2 tr ( ^ W T c ^ W c )+ 1 2 tr ( ^ V T c ^ V c ) ; (3{44) 2 ThederivationofthesucientconditionsinEq. 3{42 isprovidedintheAppendix A.1 43

PAGE 44

where tr ( )isthetraceofamatrix.Theauxiliaryfunction P ( z;R;t ) 2 R inEq. 3{43 is thegeneralizedsolutiontothedierentialequation P = L;P (0)= 1 m X i =1 j e ni (0) j e n (0) T N (0) ; (3{45) wherethesubscript i =1 ; 2 ;::;m denotesthe i thelementofthevector,andtheauxiliary function L ( z;R;t ) 2 R isdenedas L r T ( N d + N B 1 1 sgn ( e n ))+_ e Tn N B 2 3 k e n k 2 4 j R j 2 ; (3{46) where 1 ; 3 ; 4 2 R arechosenaccordingtothesucientconditionsinEq. 3{42 ProvidedthesucientconditionsintroducedinEq. 3{42 aresatised,then P ( z;R;t ) 0 : FromEqs. 3{23 3{32 3{38 and 3{40 ,somedisturbancetermsintheclosed-looperror systemsareboundedbyaconstant.Typically,suchterms(e.g.,NN reconstructionerror) leadtoaUUBstabilityresult.Thedenitionof P ( z;R;t )ismotivatedbytheRISE controlstructuretocompensateforsuchdisturbancessotha tanasymptotictracking resultisobtained. Let V ( y ): D [0 ; 1 ) R beaLipschitzcontinuousregularpositivedenitefunction denedas V 1 2 z T z + 1 2 R 2 + P + Q (3{47) whichsatisesthefollowinginequalities: U 1 ( y ) V ( y ) U 2 ( y )(3{48) where U 1 ( y ), U 2 ( y ) 2 R arecontinuouspositivedenitefunctions.FromEqs. 3{3 3{4 3{23 3{38 3{44 ,and 3{45 ,thedierentialequationsoftheclosed-loopsystem arecontinuousexceptintheset f ( y;t ) j e n =0 orR =0 g .UsingFilippov'sdierential inclusion[ 74 { 77 ],theexistenceanduniquenessofsolutionscanbeestablishedfor_ y = f ( y;t )(a.e.),where f ( y;t ) 2 R ( n +1) m +3 denotestheright-handsideofthetheclosed-loop errorsignals.UnderFilippov'sframework,ageneralizedLyapunovs tabilitytheorycanbe 44

PAGE 45

used(see[ 77 { 80 ]andAppendix A.2 forfurtherdetails)toestablishstrongstabilityofthe closed-loopsystem.ThegeneralizedtimederivativeofEq. 3{47 existsalmosteverywhere (a.e.),and V ( y ) 2 a:e: ~ V ( y )where ~ V = \ 2 @V ( y ) T K z T R 1 2 P 1 2 P 1 2 Q 1 2 Q 1 T ; where @V isthegeneralizedgradientof V [ 78 ],and K [ ]isdenedas[ 79 80 ] K [ f ]( y;t ) \ > 0 \ N =0 cof ( B ( y; ) N;t ) ; (3{49) where \ N =0 denotestheintersectionofallsets N ofLebesguemeasurezero, co denotes convexclosure,and B ( y; )representsaballofradius around y: Since V ( y )isa Lipschitzcontinuousregularfunction, ~ V = r V T K z T R 1 2 P 1 2 P 1 2 Q 1 2 Q 1 T = z T R 2 P 1 2 2 Q 1 2 0 K z T R 1 2 P 1 2 P 1 2 Q 1 2 Q 1 T Usingthecalculusfor K [ ]from[ 80 ]andsubstitutingthedynamicsfromEqs. 3{23 3{38 3{44 ,and 3{45 ,andsplitting k c as k c = k c 1 + k c 2 ; yields ~ V r T ( ~ N + N R + N e n ( k a +1) r 1 K [ sgn ( e n )])+ n X i =1 e Ti e i + R ( ^ W T c ( ^ V T c e n )+ N dc + N s )+ ^ W T c 0 ( ^ V T c e n ) ^ V T c e n R 2 RK [ sgn ( R )] + ^ W T c 0 ( ^ V T c e n ) ^ V T c ~ W T a ^ V T a x a R k c R 2 + ^ W T c 0 ( ^ V T c e n ) ^ V T c ^ W T a 0 ( ^ V T a x a ) ~ V T a x a R r T ( N d + N B 1 1 K [ sgn ( e n )]) e n ( t ) T N B 2 + 3 k e n k 2 + 4 j R j 2 1 2 tr ( ~ W T a 1 aw ^ W a ) 1 2 tr ( ~ V T a 1 av ^ V a ) 1 2 tr ( ^ W T c ^ W c ) 1 2 tr ( ^ V T c ^ V c ) : = n X i =1 i k e i k 2 + e Tn 1 e n k r k 2 ( k c 1 + k c 2 ) j R j 2 + r T ( ~ N + N R k a r ) + R ( N dc + N s k c R ) 2 j R j cw j R j 2 rrr ( ^ V T c e n ) rrr 2 cw rrr ^ W c rrr 2 45

PAGE 46

+2 cw j R j rrr ^ W c ( ^ V T c e n ) rrr + 3 k e n k 2 + 4 j R j 2 cv rrr ^ W T c 0 ( ^ V T c e n ) rrr 2 k e n k 2 j R j 2 cv ( rrr ^ V c rrr 2 2 rrr ^ W T c 0 ( ^ V T c e n ) rrr k e n k rrr ^ V c rrr j R j ) ; (3{50) wheretheNNweightupdatelawsfromEqs. 3{20 3{36 ,andthefactthat( r T r T ) i SGN ( e ni )=0isused(thesubscript i denotesthe i th element),where K [ sgn ( e n )]= SGN ( e n )[ 80 ],suchthat SGN ( e ni )=1if e ni > 0,[ 1 ; 1]if e ni =0 ; and 1if e ni < 0. UpperboundingtheexpressioninEq. 3{50 usingEqs. 3{30 3{32 ,and 3{40 ,yields ~ V n 2 X i =1 i k e i k 2 n 1 1 2 k e n 1 k 2 k r k 2 n 3 1 2 k e n k 2 ( k c 1 4 ) j R j 2 +( 9 2 ) j R j [ k a k r k 2 1 ( k z k ) k z kk r k ] [ k c 2 j R j 2 ( 2 ( k z k )+ 4 ) j R jk z k ] : (3{51) ProvidedthegainsareselectedaccordingtoEq. 3{41 ,theexpressioninEq. 3{51 canbe furtherupperboundedbycompletingthesquaresas ~ V k z k 2 + 2 ( k z k ) k z k 2 4 k ( k c 1 4 ) j R j 2 U ( y ) 8 y 2D ; (3{52) where k min( k a ;k c 2 )and 2 R isapositiveconstantdenedas =min 1 ; 2 ;:::; n 2 ; n 1 1 2 ; n 3 1 2 ; 1 : InEq. 3{52 ( ) 2 R isapositive,globallyinvertible,non-decreasingfunctiondenedas 2 ( k z k )= 21 ( k z k )+( 2 ( k z k )+ 4 ) 2 ; and U ( y ) c rrr z T R T rrr 2 ; forsomepositiveconstant c; isacontinuous,positive semi-denitefunctiondenedonthedomain D n y ( t ) 2 R ( n +1) m +3 jk y k 1 2 p k o : 46

PAGE 47

Thesizeofthedomain D canbeincreasedbyincreasing k: TheresultinEq. 3{52 indicatesthat V ( y ) U ( y ) 8 V ( y ) 2 ~ V ( y ) 8 y 2D : TheinequalitiesinEqs. 3{48 and 3{52 canbeusedtoshowthat V ( y ) 2L 1 in D ;hence, e 1 ( t ) ;e 2 ( t ) ;:::e n ( t ) ;r ( t )and R ( t ) 2L 1 in D .StandardlinearanalysismethodscanbeusedalongwithEqs. 3{1 3{5 toprovethat_ e 1 ( t ) ; e 2 ( t ) ;::: e n ( t ) ;x ( i ) ( t ) 2L 1 ( i =0 ; 1 ; 2)in D .Further,Assumptions 3.1 and 3.3 canbeusedtoconcludethat u ( t ) 2L 1 in D .Fromtheseresults,Eqs. 3{12 3{13 3{19 3{20 ,and 3{34 3{37 canbeusedtoconcludethat_ r ( t ) ; ( t ) ; R ( t ) 2L 1 in D .Hence, U ( y )isuniformlycontinuousin D .Let SD denoteasetdenedasfollows: S y ( t ) Dj U 2 ( y ( t )) < 1 1 2 p k 2 : (3{53) TheregionofattractioninEq. 3{53 canbemadearbitrarilylargetoincludeanyinitial conditionsbyincreasingthecontrolgain k (i.e.asemi-globaltypeofstabilityresult),and hence k e 1 ( t ) k ; j R j! 0as t !18 y (0) 2S : 3.6ExperimentalResults TotesttheperformanceoftheproposedAC-basedapproach,t hecontrollerinEqs. 3{15 3{20 3{34 3{36 wasimplementedonatwo-linkrobotmanipulator,wheretwo aluminumlinksaremountedona240Nm(rstlink)anda20Nm(secondlin k)switched reluctancemotor.Themotorresolversproviderotorpositionmea surementswitha resolutionof614400pulses/revolution,andastandardbackward sdierencealgorithm isusedtonumericallydetermineangularvelocityfromtheencoderre adings(Fig. 3-2 ). Thetwo-linkrevoluterobotismodeledasanEuler-Lagrangesystem withthefollowing dynamics M ( q ) q + V m ( q; q )_ q + F (_ q )+ d = ; (3{54) 47

PAGE 48

Figure3-2.Two-linkexperimenttestbed. where M ( q ) 2 R 2 2 denotestheinertiamatrix, V m ( q; q ) 2 R 2 2 denotesthecentripetal-Coriolis matrix, F (_ q ) 2 R 2 denotesfriction, d ( t ) 2 R 2 denotesanunknownexternaldisturbance, ( t ) 2 R 2 representsthecontroltorque,and q ( t ) ; q ( t ) ; q ( t ) 2 R 2 denotethelinkposition, velocityandacceleration.Thedynamicsin( 3{54 )canbetransformedintotheBrunovsky formas x 1 = x 2 x 2 = g ( x )+ u + d; (3{55) where x 1 q;x 2 q;x =[ x 1 x 2 ] T ;g ( x ) M 1 ( q )[ V m ( q; q )_ q + F (_ q )] ;u M 1 ( q ) ( t ) ; and d M 1 ( q ) ( t ) : Thecontrolobjectiveistotrackadesiredlinktrajectory,select edas (indegrees): q d ( t )=60sin(2 : 5 t )(1 e 0 : 01 t 3 ) : Twocontrollersareimplementedonthesystem,bothhavingthesam eexpressionforthe control u ( t )asinEq. 3{15 ;however,theydierintheNNweightupdatelaws.Therst 1 Forthisexperiment,theinertiamatrixisassumedtobeknownasitisr equiredfor calculationofjointtorques ( t ),whicharedeterminedusingtheexpression = M ( q ) u: 48

PAGE 49

controller(denotedbyNN+RISE)employsastandardNNgradientbasedweightupdate lawwhichisaneinthetrackingerror,givenas ^ W a aw h proj ( n 0 ( ^ V T a x a ) ^ V T a x a e Tn ) i ^ V a = av h proj ( n x a e Tn ^ W T a 0 ( ^ V T a x a )) i : TheproposedAC-basedcontroller(denotedbyAC+RISE)usesac ompositeweight updatelaw,consistingofagradient-basedtermandareinforceme nt-basedterm,asin Eq. 3{20 ,wherethereinforcementtermisgeneratedfromthecriticarchit ectureinEq. 3{34 .FortheNN+RISEcontroller,theinitialweightsoftheNN, ^ W a (0)ischosento bezero,whereas ^ V a (0)israndomlyinitializedin[ 1 ; 1] ; suchthatitformsabasis[ 81 ]. TheinputtotheactionNNischosenas x a =[1 q T d q T d ] ; andthenumberofhidden layerneuronsarechosenbytrialanderroras N a =10.Allotherstatesareinitialized tozero.AsigmoidactivationfunctionischosenfortheNNandthead aptationgainsare selectedas aw = I 11 av =0 : 1 I 11 ,withfeedbackgainsselectedas 1 = diag (10 ; 15) ; 2 = diag (20 ; 15) ;k a =(20 ; 15)and 1 = diag (2 ; 1) : FortheAC+RISEcontroller,the criticisaddedtotheNN+RISEbyincludinganadditionalRLterminthew eightupdate lawoftheactionNN.TheactorNNandtheRISEterminAC+RISEuset hesamegains asNN+RISE.Thenumberofhiddenlayerneuronsforthecriticares electedbytrialand erroras N c =3 : TheinitialcriticNNweights ^ W c (0)and ^ V c (0)arerandomlychosenin[ 1 ; 1] : Thecontrolgainsforthecriticareselectedas k c =5 2 =0 : 1 cw =0 : 4 cv =1 : Experimentsforbothcontrollerswererepeated10consecutivet imeswiththesamegains tochecktherepeatabilityandaccuracyofresults.Foreachrun, theRMSvaluesofthe trackingerror e 1 ( t )andtorques ( t )arecalculated.Aone-tailunpairedt-testisperformed withasignicancelevelof =0 : 05 : Asummaryofcomparativeresultswiththetwo controllersaretabulatedinTables 3-1 and 3-2 Tables 3-1 and 3-2 indicatethattheAC+RISEcontrollerhasstatisticallysmaller meanRMSerrorsforLink1( P =0 : 003)andLink2( P =0 : 046)ascomparedto 49

PAGE 50

Table3-1.SummarizedexperimentalresultsandPvaluesofonetaile dunpairedt-testfor Link1. ExperimentRMSerror[Link1]Torque[Link1](Nm) NN+RISEAC+RISENN+RISEAC+RISE Maximum0 : 143 0 : 123 15 : 93716 : 013 Minimum0 : 101 0 : 098 15 : 45115 : 470 Mean0 : 125 0 : 108 15 : 68715 : 764 Std.dev.0 : 014 0 : 009 0 : 1520 : 148 P(T < =t)0 : 003 0 : 134 denotesstatisticallysignicantvalue.Table3-2.SummarizedexperimentalresultsandPvaluesofonetaile dunpairedt-testfor Link2. ExperimentRMSerror[Link2]Torque[Link2](Nm) NN+RISEAC+RISENN+RISEAC+RISE Maximum0 : 161 0 : 138 1 : 8561 : 858 Minimum0 : 112 0 : 107 1 : 7171 : 670 Mean0 : 137 0 : 127 1 : 7831 : 753 Std.dev.0 : 015 0 : 010 0 : 0450 : 054 P(T < =t)0 : 046 0 : 098 denotesstatisticallysignicantvalue. theNN+RISEcontroller.TheAC+RISEcontroller,whilehavingaredu cederror,uses approximatelythesameamountofcontroltorque(statisticallyins ignicantdierence)as NN+RISE.TheresultsindicatethatthemeanRMSthepositiontrack ingerrorsforLink1 andLink2areapproximately14%and7%smallerfortheproposedAC+ RISEcontroller. Theplotsfortrackingerrorandcontroltorquesareshownfora typicalexperimentin Figs. 3-3 and 3-4 3.7ComparisonwithRelatedWork AcontinuousasymptoticAC-basedtrackingcontrollerisdeveloped foraclassof nonlinearsystemswithboundeddisturbances.Theapproachisdie rentfromtheoptimal control-basedADPapproachesproposedinliterature[ 8 { 10 13 14 32 42 ],wherethe criticusuallyapproximatesalong-termcostfunctionandtheactor approximatesthe optimalcontrol.However,thesimilaritywiththeADP-basedmethod sisintheuse oftheACarchitecture,borrowedfromRL,wherethecritic,thro ughareinforcement signalaectsthebehavioroftheactorleadingtoanimprovedperf ormance.The 50

PAGE 51

0 10 20 30 40 50 -0.4 -0.2 0 0.2 0.4 (a) Time [sec]Error Link 1 [deg] 0 10 20 30 40 50 -50 0 50 (b) Time [sec]Torque Link 1 [Nm] 0 10 20 30 40 50 -0.4 -0.2 0 0.2 0.4 (c) Time [sec]Error Link 1 [deg] 0 10 20 30 40 50 -50 0 50 (d) Time [sec]Torque Link 1 [Nm] Figure3-3.ComparisonoftrackingerrorsandtorquesbetweenN N+RISEandAC+RISE forlink1(a)TrackingerrorwithNN+RISE,(b)ControlTorquewithNN+RISE,(c)TrackingerrorwithAC+RISE,(d)ControlTorquew ith AC+RISE. proposedadaptiverobustcontrollerconsistsofaNNfeedforwar dterm(actorNN)and arobustfeedbackterm,wheretheweightupdatelawsoftheacto rNNaredesignedas acompositeofatrackingerrortermandaRLterm(fromthecritic) ,withtheobjective ofminimizingthetrackingerror[ 43 { 45 ].Therobusttermisdesignedtowithstand theexternaldisturbancesandmodelingerrorsintheplant.Typica lly,thepresenceof boundeddisturbancesandNNapproximationerrorsleadtoaUUBre sult.Themain contributionofthisworkistheuseofarecentlydevelopedcontinuo usfeedbacktechnique, RISE[ 46 47 ],inconjunctionwiththeACarchitecturetoyieldasymptotictrackin gof 51

PAGE 52

0 10 20 30 40 50 -0.4 -0.2 0 0.2 0.4 0.6 (a) Time [sec]Error Link2 [deg] 0 10 20 30 40 50 -6 -4 -2 0 2 4 6 (b) Time [sec]Torque Link 2 [Nm] 0 10 20 30 40 50 -0.4 -0.2 0 0.2 0.4 0.6 (c) Time [sec]Error Link 2 [deg] 0 10 20 30 40 50 -6 -4 -2 0 2 4 6 (d) Time [sec]Torque Link 2 [Nm] Figure3-4.ComparisonoftrackingerrorsandtorquesbetweenN N+RISEandAC+RISE forlink2(a)TrackingerrorwithNN+RISE,(b)ControlTorquewithNN+RISE,(c)TrackingerrorwithAC+RISE,(d)ControlTorquew ith AC+RISE. anunknownnonlinearsystemsubjectedtoboundedexternaldist urbances.Theuseof RISEinconjunctionwiththeactionNNmakesthedesignofthecriticN Narchitecture challengingfromastabilitystandpoint.Tothisend,thecriticNNiscom binedwithan additionalRISE-liketermtoyieldareinforcementsignal,whichisused toupdatethe weightsoftheactionNN.Asmoothprojectionalgorithmisusedtobo undtheNNweight estimatesandaLyapunovstabilityanalysisguaranteesclosed-loop stabilityofthesystem. 52

PAGE 53

3.8Summary AnAC-basedcontrollerisdevelopedforaclassofuncertainnonlinea rsystemswith additiveboundeddisturbances.Themaincontributionofthisworkis thecombinationof thecontinuousRISEfeedbackwiththeACarchitecturetoguaran teeasymptotictracking forthenonlinearsystem.ThefeedforwardactionNNapproximate sthenonlinearsystem dynamicsandtherobustfeedback(RISE)rejectstheNNfunctio nalreconstructionerror anddisturbances.Inaddition,theactionNNistrainedonlineusingac ombinationof trackingerrorandareinforcementsignal,generatedbythecritic .Experimentalresults andt-testanalysisdemonstratefasterconvergenceofthetra ckingerrorwhenaRLtermis includedintheNNweightupdatelaws. 53

PAGE 54

CHAPTER4 ROBUSTIDENTIFICATION-BASEDSTATEDERIVATIVEESTIMATION FOR NONLINEARSYSTEMS Therequirementofcompletemodelknowledgehasimpededthedeve lopmentof RL-basedoptimalcontrolsolutionsforcontinuous-timeuncerta innonlinearsystems,which motivatesthedevelopmentofstatederivativeestimatorinthischa pter.Besidesproviding amodel-freevaluefunctionapproximationinRL-basedcontrol,est imationofthestate derivativeisusefulformanyotherapplicationsincluding:disturban ceandparameter estimation[ 82 ],faultdetectionindynamicalsystems[ 83 ],digitaldierentiationinsignal processing,accelerationfeedbackinrobotcontacttransitionco ntrol[ 84 ],DCmotorcontrol [ 85 ]andactivevibrationcontrol[ 86 ].Theproblemofcomputingthestatederivative becomestrivialifthestateisfullymeasurableandthesystemdynam icsareexactly known.Thepresenceofuncertainties(parametricandnon-para metric)andexogenous disturbances,however,maketheproblemchallengingandmotivate thestatederivative estimationmethodforuncertainnonlinearsystemsdevelopedinthis work. 4.1RobustIdentication-BasedStateDerivativeEstimati on Consideracontrol-aneuncertainnonlinearsystem x = f ( x )+ m X i =1 g i ( x ) u i + d; (4{1) where x ( t ) 2 R n isthemeasurablesystemstate, f ( x ) 2 R n and g i ( x ) 2 R n ;i =1 ;:::;m areunknownfunctions, u i ( t ) 2 R ;i =1 ;:::;m isthecontrolinput,and d ( t ) 2 R n isan exogenousdisturbance.Theobjectiveistodesignanestimatorfo rthestatederivative x ( t )usingarobustidentication-basedapproachthatadaptivelyiden tiestheuncertain dynamics. Assumption4.1. Thefunctions f ( x ) and g i ( x ) ;i =1 ;:::;m aresecond-orderdierentiable. 54

PAGE 55

Assumption4.2. ThesysteminEq. 4{1 isboundedinputboundedstate(BIBS)stable,i.e., u i ( t ) ;x ( t ) 2L 1 ;i =1 ;:::;m: Also, u i ( t ) issecondorderdierentiable,and u i ( t ) ; u i ( t ) 2L 1 i =1 ;:::;m Assumption4.3. Thedisturbance d ( t ) issecondorderdierentiable,and d ( t ) ; d ( t ) ; d ( t ) 2 L 1 Assumption4.4. Givenacontinuousfunction F : S R n ,where S isacompactsimply connectedset,thereexistsidealweights ; suchthattheoutputoftheNN,denotedby ^ F ( ; ) ,approximates F ( ) toanarbitraryaccuracy[ 15 ]. Remark4.1. Assumptions 4.1 4.3 indicatethatthetechniquedevelopedinthisworkis onlyapplicableforsucientlysmoothsystems(i.e.atleas tsecond-orderdierentiable) thatareBIBSstable.Therequirementthatthedisturbancei s C 2 canberestrictive.For example,randomnoisedoesnotsatisfythisassumption;how ever,simulationswithadded noiseshowrobustnesstothesedisturbancesaswell.Assump tion 4.4 statestheuniversal approximationpropertyoftheNNswhichisprovedforsigmoi dalactivationfunctions in[ 15 ].Since x ( t ) isassumedtobebounded(Assumption 4.2 ),thefunctions f ( x ) and g ( x ) canbedenedonacompactset;hence,theNNuniversalapprox imationproperty (Assumption 4.4 )holds. UsingAssumption 4.4 ,thedynamicsysteminEq. 4{1 canberepresentedby replacingtheunknownfunctionswithmulti-layerNNs,as x = W T f ( V T f x )+ f ( x )+ m X i =1 W T gi ( V T gi x )+ gi ( x ) u i + d; (4{2) where W f 2 R L f +1 n V f 2 R n L f ;W gi 2 R L gi +1 n ;V gi 2 R n L gi ;i =1 ;:::;m arethe unknownidealNNweights, f ( V T f x ) 2 R L f +1 and gi ( V T gi x ) 2 R L gi +1 arethe NNactivationfunctions,and f ( x ) 2 R n and gi ( x ) 2 R n arethefunctionreconstruction errors. Assumption4.5. Theidealweightsareboundedbyknownpositiveconstants[ 18 ],i.e. k W f k F W f k V f k F V f k W gi k F W g and k V gi k F V g ; 8 i 55

PAGE 56

Assumption4.6. Theactivationfunctions f ( ) and gi ( ) ,andtheirderivativeswith respecttotheirarguments, 0 f ( ) 0 gi ( ) ; 00 f ( ) ; 00 gi ( ) ,areboundedwithknownbounds(e.g., sigmoidalandhyperbolictangentactivationfunctions). Assumption4.7. Thefunctionreconstructionerrors f ( ) and gi ( ) ,andtheirderivativeswithrespecttotheirarguments, 0f ( ) ;" 0gi ( ) ;" 00f ( ) ;" 00gi ( ) ,areboundedwithknown bounds[ 18 ]. Thefollowingmulti-layerdynamicneuralnetwork(MLDNN)identieris proposedto identifythesysteminEq. 4{2 andestimatethestatederivative ^ x = ^ W T f ^ f + m X i =1 ^ W T gi ^ gi u i + ; (4{3) where^ x ( t ) 2 R n istheidentierstate, ^ W f ( t ) 2 R L f +1 n ^ V f ( t ) 2 R n L f ; ^ W gi ( t ) 2 R L gi +1 n ; ^ V gi ( t ) 2 R n L gi ;i =1 ;:::;m aretheweightestimates,^ f ( ^ V T f ^ x ) 2 R L f +1 ^ gi ( ^ V T gi ^ x ) 2 R L gi +1 ;i =1 ;:::;m ,and ( t ) 2 R n denotestheRISEfeedbackterm denedas[ 47 71 ] k ~ x ( t ) k ~ x (0)+ v; where~ x ( t ) x ( t ) ^ x ( t ) 2 R n istheidenticationerror,and v ( t ) 2 R n isthegeneralized solution(inFilippov'ssense[ 74 ])to v =( k + r )~ x + 1 sgn (~ x ); v (0)=0 ; where k;;r; 1 2 R arepositiveconstantcontrolgains,and sgn ( )denotesavector signumfunction. Remark4.2. TheDNN-basedsystemidentiersinliterature,[ 87 { 91 ],typicallydonot includeafeedbacktermbasedontheidenticationerror,ex ceptinresultssuchas[ 92 { 94 ], whereahighgainproportionalfeedbacktermisusedtoguara nteeboundedstability.The noveluseofRISEfeedbackterm, ( t ) inEq. 4{3 ,ensuresasymptoticregulationofthe identicationerrorinthepresenceofdisturbanceandNNfu nctionapproximationerrors. 56

PAGE 57

Theidenticationerrordynamicscanbewrittenas ~ x = W T f f ^ W T f ^ f + m X i =1 h ( W T gi gi ^ W T gi ^ gi )+ gi ( x ) i u i + f ( x )+ d : (4{4) Alteredidenticationerrorisdenedas r ~ x + ~ x: (4{5) TakingthetimederivativeofEq. 4{5 andusingEq. 4{4 yields r = W T f 0 f V T f x ^ W T f ^ f ^ W T f ^ 0 f ^ V T f ^ x ^ W T f ^ 0 f ^ V T f ^ x + m X i =1 ( W T gi gi ^ W T gi ^ gi )_ u i + m X i =1 h W T gi 0 gi V T gi xu i ^ W T gi ^ gi u i ^ W T gi ^ 0 gi ^ V T gi ^ xu i ^ W T gi ^ 0 gi ^ V T gi ^ xu i i + m X i =1 [_ gi ( x ) u i + gi ( x )_ u i ]+_ f ( x )+ d kr r ~ x 1 sgn (~ x )+ ~ x: (4{6) TheweightupdatelawsfortheDNNinEq. 4{3 aredevelopedbasedonthesubsequent stabilityanalysisas ^ W f = proj ( wf ^ 0 f ^ V T f ^ x ~ x T ) ; ^ V f = proj ( vf ^ x ~ x T ^ W T f ^ 0 f ) ; ^ W gi = proj ( wgi ^ 0 gi ^ V T gi ^ xu i ~ x T ) ; ^ V gi = proj ( vgi ^ xu i ~ x T ^ W T gi ^ 0 gi ) i =1 :::m; (4{7) where proj ( )isasmoothprojectionoperator,and wf 2 R L f +1 L f +1 ; vf 2 R n n ; wgi 2 R L gi +1 L gi +1 ; vgi 2 R n n areconstantpositivediagonaladaptationgainmatrices.The spaceofDNNweightestimatesisprojectedontoacompactconvex set,constructedusing knownupperboundsoftheidealweights(Assumption 4.5 ).Thisensuresthattheweight estimatesarealwaysbounded,whichisexploitedinthesubsequents tabilityanalysis. Anyoftheseveralsmoothprojectionalgorithmsmaybeused([ 72 73 ]).Addingand subtracting 1 2 W T f ^ 0 f ^ V T f ^ x + 1 2 ^ W T f ^ 0 f V T f ^ x + P mi =1 h 1 2 W T gi ^ 0 gi ^ V T gi ^ xu i + 1 2 ^ W T gi ^ 0 gi V T gi ^ xu i i ,and groupingsimilarterms,theexpressioninEq. 4{6 canberewrittenas r = ~ N + N B 1 + ^ N B 2 kr r ~ x 1 sgn (~ x ) ; (4{8) 57

PAGE 58

wheretheauxiliarysignals, ~ N ( x; ~ x;r; ^ W f ; ^ V f ; ^ W gi ; ^ V gi ;t ) ;N B 1 ( x; ^ x; ^ W f ; ^ V f ; ^ W gi ; ^ V gi ;t ) ; and ^ N B 2 (^ x; ^ x; ^ W f ; ^ V f ; ^ W gi ; ^ V gi ;t ) 2 R n inEq. 4{8 aredenedas ~ N ~ x ^ W T f ^ f ^ W T f ^ 0 f ^ V T f ^ x + 1 2 W T f ^ 0 f ^ V T f ~ x + 1 2 ^ W T f ^ 0 f V T f ~ x (4{9) m X i =1 ^ W T gi ^ gi u i + ^ W T gi ^ 0 gi ^ V T gi ^ xu i 1 2 ^ W T gi ^ 0 gi V T gi ~ xu i 1 2 W T gi ^ 0 gi ^ V T gi ~ xu i ; N B 1 m X i =1 W T gi gi u i + W T gi 0gi V T gi xu i +_ gi ( x ) u i + gi ( x )_ u i + W T f 0 f V T f x +_ f ( x )+ d m X i =1 1 2 ^ W T gi ^ 0 gi V T gi xu i + 1 2 W T gi ^ 0 gi ^ V T gi xu i + ^ W T gi ^ gi u i (4{10) 1 2 W T f ^ 0 f ^ V T f x 1 2 ^ W T f ^ 0 f V T f x; ^ N B 2 m X i =1 1 2 ~ W T gi ^ 0 gi ^ V T gi ^ xu i + 1 2 ^ W T gi ^ 0 gi ~ V T gi ^ xu i + 1 2 ~ W T f ^ 0 f ^ V T f ^ x + 1 2 ^ W T f ^ 0 f ~ V T f ^ x: (4{11) Tofacilitatethesubsequentstabilityanalysis,anauxiliaryterm N B 2 (^ x; x; ^ W f ; ^ V f ; ^ W gi ; ^ V gi ;t ) 2 R n isdenedbyreplacing ^ x ( t )in ^ N B 2 ( )by_ x ( t ) ; and ~ N B 2 (^ x; ~ x; ^ W f ; ^ V f ; ^ W gi ; ^ V gi ;t ) ^ N B 2 ( ) N B 2 ( ).Theterms N B 1 ( )and N B 2 ( )aregroupedas N B N B 1 + N B 2 .Using Assumptions 4.2 4.5 4.7 ,Eq. 4{5 andEq. 4{7 ,thefollowingboundcanbeobtainedfor Eq. 4{9 rrr ~ N rrr 1 ( k z k ) k z k ; (4{12) where z ~ x T r T T 2 R 2 n ,and 1 ( ) 2 R isapositive,globallyinvertible,non-decreasing function.ThefollowingboundscanbedevelopedbasedonEq. 4{2 ,Assumptions 4.2 4.3 4.5 4.7 ,Eq. 4{7 ,Eq. 4{10 andEq. 4{11 k N B 1 k 1 ; k N B 2 k 2 ; rrr N B rrr 3 + 4 2 ( k z k ) k z k ; (4{13) rrr ~ x T ~ N B 2 rrr 5 k ~ x k 2 + 6 k r k 2 ; (4{14) 58

PAGE 59

where i 2 R ;i =1 ;:::; 6arecomputablepositiveconstants,and 2 ( ) 2 R isapositive, globallyinvertible,non-decreasingfunction. Tofacilitatethesubsequentstabilityanalysis,let D R 2 n +2 beadomaincontaining y ( t )=0,where y ( t ) 2 R 2 n +2 isdenedas y h ~ x T r T p P p Q i T ; (4{15) wheretheauxiliaryfunction P ( z;t ) 2 R isthegeneralizedsolution(inFilippov'ssense)to thedierentialequation P = L;P (0)= 1 n X i =1 j ~ x i (0) j ~ x T (0) N B (0) ; (4{16) wheretheauxiliaryfunction L ( z;t ) 2 R isdenedas L r T ( N B 1 1 sgn (~ x ))+ ~ x T N B 2 2 2 ( k z k ) k z kk ~ x k ; (4{17) where 1 ; 2 2 R areselectedaccordingtothefollowingsucientconditions 1 : 1 > max( 1 + 2 ; 1 + 3 ) ; 2 > 4 ; (4{18) toensurethat P ( t ) 0.Theauxiliaryfunction Q ( ~ W f ; ~ V f ; ~ W gi ; ~ V gi ) 2 R inEq. 4{15 is denedas Q 1 4 tr ( ~ W T f 1 wf ~ W f )+ tr ( ~ V T f 1 vf ~ V f )+ m X i =1 ( tr ( ~ W T gi 1 wgi ~ W gi )+ tr ( ~ V T gi 1 vgi ~ V gi )) # ; (4{19) where tr ( )denotesthetraceofamatrix. 1 ThederivationofthesucientconditionsinEq. 4{18 isprovidedintheAppendix. 59

PAGE 60

Theorem4.1. TheidentierdevelopedinEq. 4{3 alongwithitsweightupdatelawsin Eq. 4{7 ensuresasymptoticconvergence,inthesensethat lim t !1 k ~ x ( t ) k =0 and lim t !1 rr ~ x ( t ) rr =0 providedthecontrolgains k and r areselectedsucientlylargebasedontheinitial conditionsofthestates 2 ,andsatisfythefollowingsucientconditions r> 5 ;k> 6 ; (4{20) where 5 and 6 areintroducedinEq. 4{14 ,and 1 and 2 areselectedaccordingtothe sucientconditionsinEq. 4{18 Proof. Let V : D! R beaLipschitzcontinuousregularpositivedenitefunctiondened as V 1 2 r T r + 1 2 r ~ x T ~ x + P + Q; (4{21) whichsatisesthefollowinginequalities: U 1 ( y ) V ( y ) U 2 ( y ) ; (4{22) where U 1 ( y ), U 2 ( y ) 2 R arecontinuouspositivedenitefunctionsdenedas U 1 1 2 min (1 ;r ) k y k 2 and U 2 max (1 ;r ) k y k 2 ,respectively. Let_ y = F ( y;t )representtheclosed-loopdierentialequationsinEqs. 4{4 4{7 4{8 ,and 4{16 ,where F ( ) 2 R 2 n +2 denotestheright-handsideofthetheclosed-loop errorsignals.Since F ( y;t )isdiscontinuousintheset f ( y;t ) j ~ x =0 g ,theexistenceand stabilityofsolutionscannotbestudiedintheclassicalsense.Usingt hedierential inclusion_ y 2 F ( y;t ) ; where y isabsolutelycontinuousand F ( )isLebesguemeasurable andlocallybounded,existenceanduniquenessofsolutionscanbees tablishedinthe 2 Seesubsequentstabilityanalysis. 60

PAGE 61

Filippov'ssense(see[ 74 76 ]andAppendix A.2 forfurtherdetails).Stabilityofsolutions basedondierentialinclusionisstudiedusingnon-smoothLyapunov functions,using thedevelopmentin[ 79 80 ].ThegeneralizedtimederivativeofEq. 4{21 existsalmost everywhere(a.e.),and V ( y ) 2 a:e: ~ V ( y )where ~ V = \ 2 @V ( y ) T K r T ~ x T 1 2 P 1 2 P 1 2 Q 1 2 Q 1 T ; (4{23) where @V isthegeneralizedgradientof V [ 78 ],and K [ ]isdenedin 3{49 .Since V ( y )isa Lipschitzcontinuousregularfunction,Eq. 4{23 canbesimpliedas[ 79 ] ~ V = r V T K r T ~ x T 1 2 P 1 2 P 1 2 Q 1 2 Q 1 T = h r T r ~ x T 2 P 1 2 2 Q 1 2 0 i K r T ~ x T 1 2 P 1 2 P 1 2 Q 1 2 Q 1 T : Usingthecalculusfor K [ ]from[ 80 ](Theorem1,Properties2,5,7),andsubstitutingthe dynamicsfromEq. 4{8 andEq. 4{16 ,yields ~ V r T ( ~ N + N B 1 + ^ N B 2 kr 1 K [ sgn (~ x )] r ~ x )+ r ~ x T ( r ~ x ) r T ( N B 1 1 K [ sgn (~ x )]) ~ x T N B 2 + 2 2 ( k z k ) k z kk ~ x k 1 2 h tr ( ~ W T f 1 wf ^ W f )+ tr ( ~ V T f 1 vf ^ V f ) i 1 2 m X i =1 h tr ( ~ W T gi 1 wgi ^ W gi )+ tr ( ~ V T gi 1 vgi ^ V gi ) i : = r ~ x T ~ x kr T r + r T ~ N + 1 2 ~ x T ~ W T f ^ 0 f ^ V T f ^ x + 1 2 ~ x T ^ W T f ^ 0 f ~ V T f ^ x + m X i =1 1 2 ~ x T ~ W T gi ^ 0 gi ^ V T gi ^ xu i + 1 2 ~ x T ^ W T gi ^ 0 gi ~ V T gi ^ xu i + ~ x T ( ^ N B 2 N B 2 ) + 2 2 ( k z k ) k z kk ~ x k 1 2 tr ( ~ W T f ^ 0 f ^ V T f ^ x ~ x T ) 1 2 tr ( ~ V T f ^ x ~ x T ^ W T f ^ 0 f ) (4{24) 1 2 m X i =1 h tr ( ~ W T gi ^ 0 gi ^ V T gi ^ xu i ~ x T )+ tr ( ~ V T gi ^ xu i ~ x T ^ W T gi ^ 0 gi ) i ; whereEq. 4{7 K [ sgn (~ x )]= SGN (~ x )[ 80 ],andthefactthat( r T r T ) i SGN (~ x i )=0,is used(thesubscript i denotesthe i th element),suchthat SGN (~ x i )=1if~ x i > 0,[ 1 ; 1] 61

PAGE 62

if~ x i =0 ; and 1if~ x i < 0.Cancelingcommonterms,substitutingfor k k 1 + k 2 and r r 1 + r 2 ,usingEqs. 4{12 4{14 ,andcompletingthesquares,theexpressioninEq. 4{24 canbeupperboundedas ~ V ( r 1 5 ) k ~ x k 2 ( k 1 6 ) k r k 2 + 1 ( k z k ) 2 4 k 2 k z k 2 + 2 2 2 ( k z k ) 2 4 r 2 k z k 2 : (4{25) ProvidedthesucientconditionsinEq. 4{20 aresatised,theexpressioninEq. 4{25 can berewrittenas ~ V k z k 2 + ( k z k ) 2 4 k z k 2 U ( y ) 8 y 2D (4{26) where min f r 1 5 ;k 1 6 g , min f k 2 ; r 2 2 2 g ( k z k ) 2 1 ( k z k ) 2 + 2 ( k z k ) 2 isapositive,globallyinvertible,non-decreasingfunction,and U ( y )= c k z k 2 ; forsome positiveconstant c; isacontinuous,positivesemi-denitefunctiondenedonthedomain D y ( t ) 2 R 2 n +2 jk y k 1 2 p : Thesizeofthedomain D canbeincreased byincreasingthegains k and r .TheresultinEq. 4{26 indicatesthat V ( y ) U ( y ) 8 V ( y ) 2 a:e: ~ V ( y ) 8 y 2D .TheinequalitiesinEq. 4{22 andEq. 4{26 canbeused toshowthat V ( y ) 2L 1 in D ;hence,~ x ( t ) ;r ( t ) 2L 1 in D .UsingEq. 4{5 ,standard linearanalysiscanbeusedtoshowthat ~ x ( t ) 2L 1 in D .Since_ x ( t ) 2L 1 fromEq. 4{1 andAssumption 4.2 4.3 ^ x ( t ) 2L 1 in D .FromtheuseofprojectioninEq. 4{7 ^ W f ( t ) ; ^ W gi ( t ) 2L 1 ;i =1 :::m .Usingtheaboveboundingarguments,itcanbeshown fromEq. 4{8 that_ r ( t ) 2L 1 in D : Since~ x ( t ), r ( t ) 2L 1 ,thedenitionof U ( y )canbe usedtoshowthatitisuniformlycontinuousin D : Let SD denoteasetdenedas S n y ( t ) Dj U 2 ( y ( t )) < 1 2 1 2 p 2 o ; wheretheregionofattractioncanbemade arbitrarilylargetoincludeanyinitialconditionsbyincreasingthecont rolgain (i.e.a semi-globaltypeofstabilityresult),andhence c k z k 2 0as t !18 y (0) 2S : Usingthe denitionof z ( t ),itcanbeshownthat k ~ x ( t ) k ; rr ~ x ( t ) rr ; k r k! 0as t !18 y (0) 2S : 62

PAGE 63

4.2ComparisonwithRelatedWork Themostcommonapproachtoestimatederivativesisbyusingnumer icaldierentiation methods.TheEulerbackwarddierenceapproachisoneofthesimp lestandthemost commonnumericalmethodstodierentiateasignal;however,thisa dhocapproachyields erroneousresultsinthepresenceofsensornoise.Thecentrald ierencealgorithmperforms betterthanbackwarddierence;however,thecentraldieren cealgorithmisnon-causal sinceitrequiresfuturestatevaluestoestimatethecurrentderiv ative.Noiseattenuation innumericaldierentiatorsmaybeachievedbyusingalow-passlter ,atthecostof introducingaphasedelayinthesystem.Amoreanalyticallyrigorousa pproachistocast theproblemofstatederivativeestimationasanobserverdesignpr oblembyaugmenting thestatewithitsderivative,wherethestateisfullymeasurableand thestatederivative isnot,thereby,reducingtheproblemtodesigninganobserverfor theunmeasurablestate derivative.Previousapproachestosolvetheproblemusepurerob ustfeedbackmethods requiringinnitegainorinnitefrequency[ 95 { 97 ].Ahighgainobserverispresentedin [ 96 ]toestimatetheoutputderivatives,andasymptoticconvergenc etothederivativeis achievedasthegaintendstoinnity,whichisproblematicingenerala ndespeciallyinthe presenceofnoise.In[ 97 ],arobustexactdierentiatorusinga2-slidingmodealgorithmis developedwhichassumesaknownupperboundforaLipschitzconst antofthederivative. Alltheabovementionedmethodsarerobustnonmodel-basedappr oaches.Incontrast topurelyrobustfeedbackmethods,anidentication-basedrobu stadaptiveapproachis consideredinthiswork.Theproposedidentierconsistsofadynam icneuralnetwork (DNN)[ 87 88 90 98 ]andaRISE(RobustIntegraloftheSignoftheError)term[ 47 71 ], wheretheDNNadaptivelyidentiestheunknownsystemdynamicson line,whileRISE,a continuousrobustfeedbackterm,isusedtoguaranteeasympto ticconvergencetothestate derivativeinthepresenceofuncertaintiesandexogenousdisturb ances.TheDNNwith itsrecurrentfeedbackconnectionshasbeenshowntolearndyna micsofhighdimensional uncertainnonlinearsystemswitharbitraryaccuracy[ 98 99 ],motivatingtheiruseinthe 63

PAGE 64

proposedidentier.UnlikemostpreviousresultsonDNN-basedsys temidentication [ 88 { 91 94 ],whichonlyguaranteeboundedstabilityoftheidenticationerrors ystemin thepresenceofDNNapproximationerrorsandexogenousdisturb ances,theadditionof RISEtotheDNNidentierguaranteesasymptoticidentication. TheRISEstructurecombinesthefeaturesofthehighgainobserv erandhigherorder slidingmodemethods,inthesensethatitconsistsofhighgainpropor tionalandintegral statefeedbackterms(similartoahighgainobserver),andtheinte gralofasignumterm, allowingittoimplicitlylearnandcanceltheeectsofDNNapproximation errorsand exogenousdisturbancesintheLyapunovstabilityanalysis,guaran teeingasymptotic convergence. 4.3ExperimentandSimulationResults Experimentsandsimulationsonatwo-linkrobotmanipulator(Fig. 3-2 )are performedtocomparetheproposedmethodwithseveralotherd erivativeestimation methods.Thefollowingrobotdynamicsareconsidered: M ( q ) q + V m ( q; q )_ q + F d q + F s (_ q )= u ( t ) ; (4{27) where q ( t )=[ q 1 q 2 ] T and_ q ( t )=[_ q 1 q 2 ] T aretheangularpositions( rad )andangular velocities( rad=sec )ofthetwolinks,respectively, M ( q )istheinertiamatrix,and V m ( q; q ) isthecentripetal-Coriolismatrix,denedas M 264 p 1 +2 p 3 c 2 p 2 + p 3 c 2 p 2 + p 3 c 2 p 2 375 V m 264 p 3 s 2 q 2 p 3 s 2 (_ q 1 +_ q 2 ) p 3 s 2 q 1 0 375 ; where p 1 =3 : 473 kg m 2 ;p 2 =0 : 196 kg m 2 ;p 3 =0 : 242 kg m 2 ;c 2 = cos ( q 2 ), s 2 = sin ( q 2 ) ; F d = diag f 5 : 3 ; 1 : 1 g Nm sec and F s (_ q )= diag f 8 : 45 tanh (_ q 1 ) ; 2 : 35 tanh (_ q 2 ) g Nm are themodelsfordynamicandstaticfriction,respectively.Therobot modelinEq. 4{27 canbeexpressedas_ x = f ( x )+ g ( x ) u + d; wherethestate x ( t ) 2 R 4 isdenedas x ( t ) [ q 1 q 2 q 1 q 2 ] T ;d ( t ) 0 : 1 sin (10 t )[1111] T isanexogenousdisturbance,and 64

PAGE 65

f ( x ) 2 R 4 and g ( x ) 2 R 4 2 aredenedas f ( x ) q T f M 1 ( V m F d )_ q F s g T T and g ( x )=[0 2 2 M 1 ] ; respectively.ThecontrolinputisdesignedasaPDcontroller totrackthedesiredtrajectory q d ( t )=[0 : 5 sin (2 t )0 : 5 cos (2 t )] T ; as u ( t )= 2[ q 1 ( t ) 0 : 5 sin (2 t ) q 2 ( t ) 0 : 5 cos (2 t )] T [_ q 1 ( t ) cos (2 t )_ q 2 ( t )+ sin (2 t )] T .Theobjective istodesignastatederivativeestimator ^ x ( t )toasymptoticallyconvergeto_ x ( t ).The performanceofthedevelopedRISE-basedDNNidentierinEqs. 4{3 and 4{7 iscompared withthe2-slidingmoderobustexactdierentiator[ 97 ] ^ x = z s + s p j ~ x j sgn (~ x ) ; z s = s sgn (~ x ) ; (4{28) andthehighgainobserver[ 96 ] ^ x = z h + h 1 h 1 (~ x ) ; z h = h 2 h 2 (~ x ) : (4{29) ThemotorencodersinFig. 3-2 providepositionmeasurementsforthetwolinks( x 1 ( t )and x 2 ( t ))witharesolutionof614400pulses/revolution,andastandardba ckwardsdierence algorithmisusedtonumericallydetermineangularvelocities( x 3 ( t )and x 4 ( t ))fromthe encoderreadings.Theexperimentalresultsforthestatederiva tiveestimateswiththe 2-slidingmode,thehighgainobserver,theproposedmethod,andt hebackwarddierence algorithm,areshowninFig. 4-1 .Becauseofunavailabilityofvelocityandacceleration sensorstoverifythestatederivativeestimates,noperformanc ecomparisonscouldbe made.However,afewobservationscanbemadefromFig. 4-1 {thesteady-stateestimates ofthestatederivativeforthe2-slidingmode,thehighgainobserve r,andtheproposed methodlooksimilar;however,thetransientresponseofthe2-slidin gmodediersfrom thatofthehighgainobserverandtheproposedmethod.Ontheot herhand,thestate derivativeestimatewithbackwarddierenceisverynoisyanddoesn otresemblethe responseofanyoftheothermethods.Theexperimentalresults demonstratethatthe performanceoftheproposedidentier-basedstatederivativee stimatoriscomparable toexistingmethodsinliterature,andthattheestimatesfromback warddierenceare 65

PAGE 66

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 -10 -5 0 5 10 15 (a) State Derivative Estimate (2-sliding mode) Time (s)_ ^ x ( t ) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 -10 -5 0 5 10 15 (b) State Derivative Estimate (High gain observer) Time (s)_ ^ x ( t ) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 -10 -5 0 5 10 15 (c) State Derivative Estimate (Proposed) Time (s)_ ^ x ( t ) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 -100 -50 0 50 100 150 200 (d) State Derivative Estimate (Backward Difference) Time (s)_ ^ x ( t ) Figure4-1.Comparisonofthestatederivativeestimate ^ x ( t )for(a)2-slidingmode,(b) highgainobserver,(c)proposedmethod,and(d)backwarddier enceona two-linkexperimenttestbed. pronetoerrorinpresenceofsensornoise.Simulationsareperfor medtocompare, qualitativelyandquantitatively,theperformanceofthedierente stimators.Thegains fortheidentierinEqs. 4{3 and 4{7 areselectedas k =20, =5, r =200, 1 =1 : 25 ; andtheDNNadaptationgainsareselectedas wf =0 : 1 I 11 11 ; vf = I 4 4 ; wg 1 = 0 : 7 I 4 4 ; wg 2 =0 : 4 I 4 4 ; vg 1 = vg 2 = I 4 4 ,where I n n denotesanidentitymatrixof appropriatedimensions.Theneuralnetworksfor f ( x )and g ( x )aredesignedtohave10 and3hiddenlayerneurons,respectively,andtheDNNweightsarein itializedasuniformly distributedrandomnumbersintheinterval[ 1 ; 1].Thegainsforthe2-slidingmode dierentiatorinEq. 4{28 areselectedas s =4 : 1, s =4,whilethegainsforthehigh gainobserverinEq. 4{29 areselectedas h 1 =0 : 2 ;" h 1 =0 : 01 ; h 2 =0 : 3 ;" h 2 =0 : 001. Toensureafaircomparison,thegainsofallthethreeestimatorsw eretunedforbest performance(leastRMSerror)forthesamesettlingtimeofappro ximately0 : 4seconds forthestatederivativeestimationerrors.AwhiteGaussiannoisew asaddedtothestate measurements,maintainingasignaltonoiseratioof60 dB .Theinitialconditionsofthe systemandtheestimatorsarechosenas x ( t )=^ x ( t )=[1111] T 66

PAGE 67

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 (a) State Estimation Error (2-sliding mode) Time (s)~ x ( t ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 (b) State Estimation Error (High gain observer) Time (s)~ x ( t ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 (c) State Estimation Error (Proposed) Time (s)~ x ( t ) Figure4-2.Comparisonofthestateestimationerrors~ x ( t )for(a)2-slidingmode,(b)high gainobserver,and(c)proposedmethods,inthepresenceofsen sornoise(SNR 60dB). Table4-1.Comparisonoftransient( t =0 5sec.)andsteady-state( t =5 10sec.)state derivativeestimationerrors ~ x ( t )fordierentderivativeestimationmethodsin presenceofnoise(60dB). Backwarddierence Centraldierence 2-slidingmode Highgainobserver Proposed TransientRMSerror 14.44437.63072.34802.13261.7808 SteadystateRMSerror 14.14617.05830.10950.04140.0297 Figs. 4-2 4-4 showthesimulationresultsforstateestimationandstatederivativ e estimationerrorsforthe2-slidingmoderobustexactdierentiato rin[ 97 ],thehighgain observerin[ 96 ],andthedevelopedRISE-basedDNNestimator.Whilethemaximum overshootinestimatingthestatederivative(Fig. 4-3 )using2-slidingmodeissmaller, 67

PAGE 68

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -30 -25 -20 -15 -10 -5 0 5 10 (a) State Derivative Estimation Error (2-sliding mode) Time (s)_ ~ x ( t ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -30 -25 -20 -15 -10 -5 0 5 10 (b) State Derivative Estimation Error (High gain observer) Time (s)_ ~ x ( t ) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -30 -25 -20 -15 -10 -5 0 5 10 (c) State Derivative Estimation Error (Proposed) Time (s)_ ~ x ( t ) Figure4-3.Comparisonofthestatederivativeestimationerrors ~ x ( t )for(a)2-sliding mode,(b)highgainobserver,and(c)proposedmethods,inthepr esenceof sensornoise(SNR60dB). 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 -0.2 -0.1 0 0.1 0.2 0.3 (a) State Derivative Estimation Error at Steady State (2-sliding mode)Time (s)_ ~ x ( t ) 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 -0.2 -0.1 0 0.1 0.2 0.3 (b) State Derivative Estimation Error at Steady State (High gain observer)Time (s)_ ~ x ( t ) 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 -0.2 -0.1 0 0.1 0.2 0.3 (c) State Derivative Estimation Error at Steady State (Proposed)Time (s)_ ~ x ( t ) Figure4-4.Comparisonofthestatederivativeestimationerrors ~ x ( t )atsteadystate,for (a)2-slidingmode,(b)highgainobserver,and(c)proposedmetho ds,inthe presenceofsensornoise(SNR60dB). 68

PAGE 69

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -60 -40 -20 0 20 40 60 Time(s)_ ~ x ( t )(a) State Derivative Estimation Error with Backward Difference 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -60 -40 -20 0 20 40 60 Time(s)_ ~ x ( t )(b) State Derivative Estimation Error with Central Difference Figure4-5.Statederivativeestimationerrors ~ x ( t )fornumericaldierentiationmethods (a)backwarddierenceand(b)centraldierencewithstep-size of10 4 ,inthe presenceofsensornoise(SNR60dB). thesteadystateerrorsarecomparativelylargerthanboththeh ighgainobserverandthe proposedmethod.Table 4-1 givesacomparisonofthetransientandsteadystateRMS statederivativeestimationerrorsfordierentestimationmethod s.Resultsofstandard numericaldierentiationalgorithms-backwarddierenceandcent raldierence(witha step-sizeof10 4 )arealsoincluded;asseenfromTable 4-1 andFig. 4-5 ,theyperform signicantlyworsethantheothermethods,inpresenceofnoise.A lthough,simulation resultsforthehighgainobserverandthedevelopedmethodareco mparable,asseen fromFigs. 4-2 4-4 andTable 4-1 ,dierencesexistinthestructureoftheestimators andproofofconvergenceoftheestimates.Thedevelopedidenti erincludestheRISE structure,whichcombinesthefeaturesofthehighgainobserver withtheintegralofa signumterm,allowingittoimplicitlylearnandcanceltermsinthestability analysis; thus,guaranteeingasymptoticconvergence.Whilesingularpertu rbationmethodscan beusedtoproveasymptoticconvergenceofthehighgainobserve rtothederivativeof theoutputsignal(_ x ( t )inthiscase)asthegainstendtoinnity[ 100 ],Lyapunov-based stabilitymethodsareusedtoproveasymptoticconvergenceofth eproposedidentier (as t !1 )withnitegains.Further,whilebothhighgainobserverand2-sliding mode robustexactdierentiatorarepurelyrobustfeedbackmethods ,thedevelopedmethod, 69

PAGE 70

inadditiontousingarobustRISEfeedbackterm,usesaDNNtoadap tivelyidentifythe systemdynamics. 4.4Summary Arobustidentierisdevelopedforonlineestimationofthestateder ivativeof uncertainnonlinearsystemsinthepresenceofexogenousdisturb ances.Theresultdiers fromexistingpurerobustmethodsinthattheproposedmethodco mbinesaDNNsystem identierwitharobustRISEfeedbacktoensureasymptoticconve rgencetothestate derivative,whichisprovenusingaLyapunov-basedstabilityanalysis .Simulationresults inthepresenceofnoiseshowanimprovedtransientandsteadysta teperformanceofthe developedidentierincomparisontoseveralotherderivativeestim ationmethods. 70

PAGE 71

CHAPTER5 ANACTOR-CRITIC-IDENTIFIERARCHITECTUREFORAPPROXIMATE OPTIMALCONTROLOFUNCERTAINNONLINEARSYSTEMS RLusesevaluativefeedbackfromtheenvironmenttotakeapprop riateactions [ 101 ].OneofthemostwidelyusedarchitecturestoimplementRLalgorithm sistheAC architecture,whereanactorperformscertainactionsbyintera ctingwithitsenvironment, thecriticevaluatestheactionsandgivesfeedbacktotheactor,le adingtoimprovementin performanceofsubsequentactions[ 4 19 101 ].ACalgorithmsarepervasiveinmachine learningandareusedtolearntheoptimalpolicyonlinefornite-spac ediscrete-time Markovdecisionproblems[ 6 17 19 101 102 ].Theobjectiveofthischapteristoappend anidentierstructuretothestandardACarchitecture,calledac tor-critic-identier, whichsolvesthecontinuous-timeoptimalcontrolproblemfornonlin earsystemswithout requiringcompleteknowledgeofsystemdynamics. 5.1Actor-Critic-IdentierArchitectureforHJBApproxim ation Consideracontinuous-timenonlinearsystem x = F ( x;u ) ; where x ( t ) 2X R n u ( t ) 2U R m isthecontrolinput, F : XU! R n isLipschitz continuouson XU containingtheorigin,suchthatthesolution x ( t )ofthesystemis uniqueforanyniteinitialcondition x 0 andcontrol u 2U .Theoptimalvaluefunction canbedenedas V ( x ( t ))=min u ( ) 2 ( X ) t < 1 Z 1 t r ( x ( s ) ;u ( x ( s ))) ds; (5{1) where( X )isasetofadmissiblepolicies,and r ( x;u ) 2 R istheimmediateorlocalcost, denedas r ( x;u )= Q ( x )+ u T Ru; (5{2) 71

PAGE 72

where Q ( x ) 2 R iscontinuouslydierentiableandpositivedenite,and R 2 R m m isa positive-denitesymmetricmatrix.ForthelocalcostinEq. 5{2 ,whichisconvexinthe control,andcontrol-anedynamicsoftheform x = f ( x )+ g ( x ) u; (5{3) where f ( x ) 2 R n and g ( x ) 2 R n m ,theclosed-formexpressionforoptimalcontrolis derivedas[ 52 ] u ( x )= 1 2 R 1 g T ( x ) @V ( x ) @x T ; (5{4) whereitisassumedthatthevaluefunction V ( x )iscontinuouslydierentiableand satises V (0)=0 : TheHamiltonianofthesysteminEq. 5{3 isgivenby H ( x;u;V x ) V x F u + r u ; where V x @V @x 2 R 1 n denotesthegradientoftheoptimalvaluefunction V ( x ), F u ( x;u ) f ( x )+ g ( x ) u 2 R n denotesthesystemdynamicswithcontrol u ( x ),and r u r ( x;u )denotesthelocalcostwithcontrol u ( x ).Theoptimalvaluefunction V ( x )in Eq. 5{1 andtheassociatedoptimalpolicy u ( x )inEq. 5{4 satisfytheHJBequation H ( x;u ;V x )= V x F u + r u =0 : (5{5) Replacing u ( x ), V x ( x ),and F u ( x;u )inEq. 5{5 bytheirapproximations,^ u ( x )(actor), ^ V ( x )(critic),and ^ F ^ u ( x; ^ x; ^ u )(identier),respectively,theapproximateHJBequationis givenby ^ H ( x; ^ x; ^ u; ^ V x )= ^ V x ^ F ^ u + r ^ u ; (5{6) where^ x ( t )denotesthestateoftheidentier.UsingEqs. 5{5 and 5{6 ,theerrorbetween theactualandtheapproximateHJBequationisgivenbytheBellmanr esidualerror 72

PAGE 73

System Identifier Critic Bellman Error Actor + Local Cost Figure5-1.Actor-critic-identierarchitecturetoapproximatet heHJB. hjb ( x; ^ x; ^ u; ^ V x ),denedas hjb ^ H ( x; ^ x; ^ u; ^ V x ) H ( x;u ;V x ) : (5{7) Since H ( x;u ;V x ) 0,theBellmanerrorcanbewritteninameasurableformas hjb = ^ H ( x; ^ x; ^ u; ^ V x )= ^ V x ^ F ^ u + r ( x; ^ u ) : (5{8) TheactorandcriticlearnbasedontheBellmanerror hjb ( ),whereastheidentier estimatesthesystemdynamicsonlineusingtheidenticationerror~ x ( t ) x ( t ) ^ x ( t ), andhenceisdecoupledfromthedesignofactorandcritic.Theblock diagramoftheACI architectureisshowninFig. 5-1 Thefollowingassumptionsaremadeaboutthecontrol-anesystem inEq. 5{3 73

PAGE 74

Assumption5.1. Thefunctions f ( x ) and g ( x ) aresecond-orderdierentiable. Assumption5.2. Theinputgainmatrix g ( x ) isknownandboundedi.e. 0 < k g ( x ) k g where g isaknownpositiveconstant. Assumingtheoptimalcontrol,theoptimalvaluefunctionandthesy stemdynamics arecontinuousanddenedoncompactsets,NNscanbeusedtoap proximatethem [ 15 103 ].SomestandardNNassumptionswhichwillbeusedthroughoutthew orkare: Assumption5.3. Givenacontinuousfunction : S R n ,where S isacompactsimply connectedset,thereexistsidealweights W;V suchthatthefunctioncanberepresentedby aNNas ( x )= W T ( V T x )+ ( x ) ; where ( ) isthenonlinearactivationfunction,and ( x ) isthefunctionreconstruction error. Assumption5.4. TheidealNNweightsareboundedbyknownpositiveconstants i.e. k W k W k V k V [ 18 ]. Assumption5.5. TheNNactivationfunction ( ) anditsderivativewithrespecttoits arguments, 0 ( ) ,arebounded. Assumption5.6. UsingtheNNuniversalapproximationproperty[ 15 103 ],thefunction reconstructionerrorsanditsderivativewithrespecttoit sargumentsarebounded[ 18 ]as k ( ) k k 0 ( ) k 0 5.2Actor-CriticDesign UsingAssumption 5.3 andEq. 5{4 ,theoptimalvaluefunctionandtheoptimal controlcanberepresentedbyNNsas V ( x )= W T ( x )+ v ( x ) ; u ( x )= 1 2 R 1 g T ( x )( 0 ( x ) T W + 0v ( x ) T ) ; (5{9) 74

PAGE 75

where W 2 R N areunknownidealNNweights, N isthenumberofneurons, ( x )= [ 1 ( x ) 2 ( x ) ::: N ( x )] T 2 R N isasmoothNNactivationfunctionsuchthat i (0)=0and 0i (0)=0 8 i =1 :::N ,and v ( ) 2 R isthefunctionreconstructionerror. Assumption5.7. TheNNactivationfunctions f i ( x ): i =1 :::N g areselectedsothatas N !1 ( x ) providesacompleteindependentbasisfor V ( x ) UsingAssumption 5.7 andtheWeierstrasshigher-orderapproximationTheorem,both V ( x )and @V ( x ) @x canbeuniformlyapproximatedbyNNsinEq. 5{9 ,i.e.as N !1 ,the approximationerrors v ( x ) ;" 0v ( x ) 0[ 41 ].Thecritic ^ V ( x )andtheactor^ u ( x )approximate theoptimalvaluefunctionandtheoptimalcontrolinEq. 5{9 ,andaregivenby ^ V ( x )= ^ W T c ( x );^ u ( x )= 1 2 R 1 g T ( x ) 0 T ( x ) ^ W a ; (5{10) where ^ W c ( t ) 2 R N and ^ W a ( t ) 2 R N areestimatesoftheidealweightsofthecriticand actorNNs,respectively.Theweightestimationerrorsforthecrit icandactorNNsare denedas ~ W c ( t ) W ^ W c ( t ) 2 R N and ~ W a ( t ) W ^ W a ( t ) 2 R N ,respectively. Remark5.1. Sincetheoptimalcontrolisdeterminedusingthegradiento ftheoptimal valuefunctioninEq. 5{9 ,thecriticNNinEq. 5{10 maybeusedtodeterminetheactor withoutusinganotherNNfortheactor.However,foreaseind erivingweightupdatelaws andsubsequentstabilityanalysis,separateNNsareusedfo rtheactorandthecritic[ 14 ]. TheactorandcriticNNweightsarebothupdatedbasedontheminimiz ationofthe Bellmanerror hjb ( )inEq. 5{8 ,whichcanberewrittenbysubstituting ^ V ( x )fromEq. 5{10 as hjb = ^ W T c + r ( x; ^ u ) ; (5{11) where ( x; ^ x; ^ u ) 0 ( x ) ^ F ^ u ( x; ^ x; ^ u ) 2 R N isthecriticNNregressorvector. 75

PAGE 76

5.2.1LeastSquaresUpdatefortheCritic Let E c ( hjb ) 2 R + denotetheintegralsquaredBellmanerroras E c = t Z 0 2 hjb ( ) d: (5{12) Theleastsquares(LS)updatelawforthecriticisgeneratedbyminim izingEq. 5{12 as @E c @ ^ W c =2 t Z 0 hjb ( ) @ hjb ( ) @ ^ W c ( t ) d =0 : (5{13) Using @ hjb @ ^ W c = T fromEq. 5{11 ,thebatchLScriticweightestimateisdeterminedfrom Eq. 5{13 as[ 104 ] ^ W c ( t )= 0@ t Z 0 ( ) ( ) T d 1A 1 t Z 0 ( ) r ( ) d; (5{14) providedtheinverse R t 0 ( ) ( ) T d 1 exists.Foronlineimplementation,anormalized recursiveformulationoftheLSalgorithmisdevelopedbytakingthet imederivativeEq. 5{14 andnormalizingas[ 104 ] ^ W c = c 1+ T hjb ; (5{15) where ; c 2 R areconstantpositivegains,and( t ) R t 0 ( ) ( ) T d 1 2 R N N isa symmetricestimationgainmatrixgeneratedas = c !! T 1+ T ;( t +r )=(0)= 0 I; (5{16) where t +r istheresettingtimeatwhich min f ( t ) g 1 0 >' 1 > 0.Thecovariance resettingensuresthat( t )ispositive-deniteforalltimeandpreventsitsvaluefrom becomingarbitrarilysmallinsomedirections,thusavoidingslowadapt ationinsome directions(alsocalledthecovariancewind-upproblem)[ 104 ].FromEq. 5{16 ,itisclear 76

PAGE 77

that 0,whichmeansthatthecovariancematrix( t )canbeboundedas 1 I ( t ) 0 I: (5{17) 5.2.2GradientUpdatefortheActor Theactorupdate,likethecriticupdateinSection 5.2.1 ,isbasedontheminimization oftheBellmanerror hjb ( ).However,unlikethecriticweights,theactorweightsappear nonlinearlyin hjb ( ),makingitproblematictodevelopaLSupdatelaw.Hence,a gradientupdatelawisdevelopedfortheactorwhichminimizesthesqu aredBellmanerror E a ( t ) 1 2 2 hjb ,whosegradientisgivenby @E a @ ^ W a = @ hjb @ ^ W a hjb = ^ W T c 0 @ ^ F ^ u @ ^ u @ ^ u @ ^ W a + ^ W T a 0 G 0 T hjb ; (5{18) whereEq. 5{11 isused,and G ( x ) g ( x ) R 1 g ( x ) T 2 R n n isasymmetricmatrix.Using Eq. 5{18 ,thegradient-basedupdatelawfortheactorNNisgivenby ^ W a = proj 8<: a 1 p 1+ T ^ W T c 0 @ ^ F ^ u @ ^ u @ ^ u @ ^ W a T hjb a 1 p 1+ T 0 G 0 T ^ W a hjb a 2 ( ^ W a ^ W c ) o ; (5{19) where proj fg isaprojectionoperatorusedtoboundtheweightestimates[ 72 ],[ 73 ], a 1 ; a 2 2 R arepositiveadaptationgains, 1 p 1+ T isthenormalizationterm,andthelast terminEq. 5{19 isaddedforstability(basedonthesubsequentstabilityanalysis). 5.3IdentierDesign Thefollowingassumptionismadefortheidentierdesign: Assumption5.8. Thecontrolinputisboundedi.e. u ( t ) 2L 1 Remark5.2. UsingAssumptions 5.2 and 5.5 ,andtheprojectionalgorithminEq. 5{19 Assumption 5.8 holdsforthecontroldesign u ( t )=^ u ( x ) inEq. 5{10 77

PAGE 78

UsingAssumption 5.3 ,thedynamicsysteminEq. 5{3 ,withcontrol^ u ( x ),canbe representedusingamulti-layerNNas x = F ^ u ( x; ^ u )= W T f ( V T f x )+ f ( x )+ g ( x )^ u; (5{20) where W f 2 R L f +1 n V f 2 R n L f aretheunknownidealNNweights, f ( V T f x ) 2 R L f +1 istheNNactivationfunction,and f ( x ) 2 R n isthefunctionreconstruction error.Thefollowingmulti-layerdynamicneuralnetwork(MLDNN)ide ntierisusedto approximatethesysteminEq. 5{20 ^ x = ^ F ^ u ( x; ^ x; ^ u )= ^ W T f ^ f + g ( x )^ u + ; (5{21) where^ x ( t ) 2 R n istheDNNstate,^ f ( ^ V T f ^ x ) 2 R L f +1 ^ W f ( t ) 2 R L f +1 n and ^ V f ( t ) 2 R n L f areweightestimates,and ( t ) 2 R n denotestheRISEfeedbackterm denedas[ 47 71 ] k ~ x ( t ) k ~ x (0)+ v; where~ x ( t ) x ( t ) ^ x ( t ) 2 R n istheidenticationerror,and v ( t ) 2 R n isthegeneralized solution(inFilippov'ssense[ 105 ])to v =( k + r )~ x + 1 sgn (~ x ); v (0)=0 ; where k;;r; 1 2 R arepositiveconstantcontrolgains,and sgn ( )denotesavector signumfunction.Theidenticationerrordynamicscanbewrittenas ~ x = ~ F u ( x; ^ x;u )= W T f f ^ W T f ^ f + f ( x ) ; (5{22) where ~ F ^ u ( x; ^ x; ^ u ) F ^ u ( x; ^ u ) ^ F ^ u ( x; ^ x; ^ u ) 2 R n .Alteredidenticationerrorisdenedas r ~ x + ~ x: (5{23) 78

PAGE 79

TakingthetimederivativeofEq. 5{23 andusingEq. 5{22 yields r = W T f 0 f V T f x ^ W T f ^ f ^ W T f ^ 0 f ^ V T f ^ x ^ W T f ^ 0 f ^ V T f ^ x +_ f ( x ) kr r ~ x (5{24) 1 sgn (~ x )+ ~ x: BasedonEq. 5{24 andthesubsequentstabilityanalysis,theweightupdatelawsforth e DNNaredesignedas ^ W f = proj ( wf ^ 0 f ^ V T f ^ x ~ x T ) ; ^ V f = proj ( vf ^ x ~ x T ^ W T f ^ 0 f ) ; (5{25) where proj ( )isasmoothprojectionoperator[ 72 ],[ 73 ],and wf 2 R L f +1 L f +1 ; vf 2 R n n arepositiveconstantadaptationgainmatrices.TheexpressioninE q. 5{24 canbe rewrittenas r = ~ N + N B 1 + ^ N B 2 kr r ~ x 1 sgn (~ x ) ; (5{26) wheretheauxiliarysignals, ~ N ( x; ~ x;r; ^ W f ; ^ V f ;t ) ;N B 1 ( x; ^ x; ^ W f ; ^ V f ;t ) ; and ^ N B 2 (^ x; ^ x; ^ W f ; ^ V f ;t ) 2 R n aredenedas ~ N ~ x ^ W T f ^ f ^ W T f ^ 0 f ^ V T f ^ x + 1 2 W T f ^ 0 f ^ V T f ~ x + 1 2 ^ W T f ^ 0 f V T f ~ x; (5{27) N B 1 W T f 0 f V T f x 1 2 W T f ^ 0 f ^ V T f x 1 2 ^ W T f ^ 0 f V T f x +_ f ( x ) ; (5{28) ^ N B 2 1 2 ~ W T f ^ 0 f ^ V T f ^ x + 1 2 ^ W T f ^ 0 f ~ V T f ^ x; (5{29) where ~ W f W f ^ W f ( t ) 2 R L f +1 n and ~ V f V f ^ V f ( t ) 2 R n L f : Tofacilitatethe subsequentstabilityanalysis,anauxiliaryterm N B 2 (^ x; x; ^ W f ; ^ V f ;t ) 2 R n isdenedby replacing ^ x ( t )in ^ N B 2 ( )by_ x ( t ) ; and ~ N B 2 (^ x; ~ x; ^ W f ; ^ V f ;t ) ^ N B 2 ( ) N B 2 ( ).Theterms N B 1 ( )and N B 2 ( )aregroupedas N B N B 1 + N B 2 .UsingAssumptions 5.2 5.4 5.6 ,and Eqs. 5{23 5{25 5{28 and 5{29 ,thefollowingboundscanbeobtained rrr ~ N rrr 1 ( k z k ) k z k ; (5{30) 79

PAGE 80

k N B 1 k 1 ; k N B 2 k 2 ; rrr N B rrr 3 + 4 2 ( k z k ) k z k ; (5{31) rrr ~ x T ~ N B 2 rrr 5 k ~ x k 2 + 6 k r k 2 ; (5{32) where z ~ x T r T T 2 R 2 n 1 ( ) ; 2 ( ) 2 R arepositive,globallyinvertible,non-decreasing functions,and i 2 R ;i =1 ;:::; 6arecomputablepositiveconstants.Tofacilitatethe subsequentstabilityanalysis,let D R 2 n +2 beadomaincontaining y ( t )=0,where y ( t ) 2 R 2 n +2 isdenedas y h ~ x T r T p P p Q i T ; (5{33) wheretheauxiliaryfunction P ( z;t ) 2 R isthegeneralizedsolutiontothedierential equation P = L;P (0)= 1 n X i =1 j ~ x i (0) j ~ x T (0) N B (0) ; (5{34) wheretheauxiliaryfunction L ( z;t ) 2 R isdenedas L r T ( N B 1 1 sgn (~ x ))+ ~ x T N B 2 2 2 ( k z k ) k z kk ~ x k ; (5{35) where 1 ; 2 2 R arechosenaccordingtothefollowingsucientconditionstoensure P ( z;t ) 0[ 71 ] 1 > max( 1 + 2 ; 1 + 3 ) ; 2 > 4 : (5{36) Theauxiliaryfunction Q ( ~ W f ; ~ V f ) 2 R inEq. 5{33 isdenedas Q 1 4 h tr ( ~ W T f 1 wf ~ W f )+ tr ( ~ V T f 1 vf ~ V f ) i ; where tr ( )denotesthetraceofamatrix. 80

PAGE 81

Theorem5.1. ForthesysteminEq. 5{3 ,theidentierdevelopedinEq. 5{21 alongwith theweightupdatelawsinEq. 5{25 ensuresasymptoticidenticationofthestateandits derivative,inthesensethat lim t !1 k ~ x ( t ) k =0 and lim t !1 rr ~ x ( t ) rr =0 ; providedthecontrolgains k and r arechosensucientlylargebasedontheinitial conditionsofthestates 1 ,andsatisfythefollowingsucientconditions r> 5 ;k> 6 ; (5{37) where 5 and 6 areintroducedinEq. 5{32 ,and 1 ; 2 introducedinEq. 5{35 ,arechosen accordingtothesucientconditionsinEq. 5{36 Proof. Theproofissimilartotheproofof Theorem4.1 ,thedierencebeingthat g ( x )is assumedtobeexactlyknowninthischapter.Thissimpliesthedesign oftheidentier, where g ( x )isdirectlyused,unlikeinChapter 4 whereitsNNestimateisusedinstead. UsingthedevelopedidentierinEq. 5{21 ,theactorweightupdatelawcannowbe simpliedusingEq. 5{19 as ^ W a = proj a 1 p 1+ T 0 G 0 T ^ W a ^ W c hjb a 2 ( ^ W a ^ W c ) : (5{38) 5.4ConvergenceandStabilityAnalysis TheunmeasurableformoftheBellmanerrorcanbewrittenusingEqs 5{5 5{8 and Eq. 5{11 ,as hjb = ^ W T c W T c 0 F u +^ u T R ^ u u T Ru 0v F u : = ~ W T c W T 0 ~ F ^ u + 1 4 ~ W T a 0 G 0 T ~ W a 1 4 0v G" 0 T v 0v F u ; (5{39) 1 Seesubsequentsemi-globalstabilityanalysis. 81

PAGE 82

whereEqs. 5{9 and 5{10 areused.Thedynamicsofthecriticweightestimationerror ~ W c ( t )cannowbedevelopedbysubstitutingEq. 5{39 inEq. 5{15 ,as ~ W c = c T ~ W c + c 1+ T W T 0 ~ F ^ u + 1 4 ~ W T a 0 G 0 T ~ W a 1 4 0v G" 0 T v 0v F u 1 4 0v G" 0 T v 0v F u ; (5{40) where ( t ) ( t ) p 1+ ( t ) T ( t ) ( t ) 2 R N isthenormalizedcriticregressorvector,boundedas k k 1 p 1 ; (5{41) where 1 isintroducedinEq. 5{17 .TheerrorsysteminEq. 5{40 canberepresentedby thefollowingperturbedsystem ~ W c =n nom + per ; (5{42) wheren nom ( ~ W c ;t ) c T ~ W c 2 R N ; denotesthenominalsystem,and per ( t ) c 1+ T h W T 0 ~ F ^ u + 1 4 ~ W T a 0 G 0 T ~ W a 1 4 0v G" 0 T v 0v F u i 2 R N denotesthe perturbation.UsingTheorem2.5.1in[ 104 ],thenominalsystem ~ W c = c T ~ W c (5{43) isgloballyexponentiallystable,iftheboundedsignal ( t )isPE,i.e. 2 I t 0 + Z t 0 ( ) ( ) T d 1 I 8 t 0 0 ; forsomepositiveconstants 1 ; 2 ; 2 R .Sincen nom ( ~ W c ;t )iscontinuouslydierentiable andtheJacobian @ n nom @ ~ W c = c T isboundedfortheexponentiallystablesysteminEq. 5{43 ,theconverseLyapunovTheorem4.14in[ 106 ]canbeusedtoshowthatthereexists 82

PAGE 83

afunction V c : R N [0 ; 1 ) R ,whichsatisesthefollowinginequalities c 1 rrr ~ W c rrr 2 V c ( ~ W c ;t ) c 2 rrr ~ W c rrr 2 @V c @t + @V c @ ~ W c n nom ( ~ W c ;t ) c 3 rrr ~ W c rrr 2 (5{44) rrrr @V c @ ~ W c rrrr c 4 rrr ~ W c rrr ; forsomepositiveconstants c 1 ;c 2 ;c 3 ;c 4 2 R : UsingAssumptions 5.2 5.4 5.6 and 5.8 ,the projectionboundsinEq. 5{19 ,thefactthat F u 2L 1 (since u ( x )isstabilizing),and providedtheconditionsofTheorem1hold(requiredtoprovethat ~ F ^ u 2L 1 ),thefollowing boundscanbedeveloped: rrr ~ W a rrr 1 ; rrr 0 G 0 T rrr 2 ; rrrr 1 4 ~ W T a 0 G 0 T ~ W a W T 0 ~ F ^ u 0v F u rrrr 3 ; rrrr 1 2 W T 0 G" 0 T v + 1 2 0v G" 0 T v + 1 2 W T 0 G 0 T ~ W a + 1 2 0v G 0 T rrrr 4 ; (5{45) where 1 ; 2 ; 3 ; 4 2 R arecomputablepositiveconstants. Theorem5.2. IfAssumptions 5.1 5.8 hold,thenormalizedcriticregressor ( t ) dened in 5{40 isPE(persistentlyexciting),andprovidedEq. 5{36 ,Eq. 5{37 andthefollowing sucientgainconditionissatised 2 c 3 a 1 > 1 2 ; (5{46) where a 1 ;c 3 1 ; 2 areintroducedinEqs. 5{19 5{44 ,and 5{45 ,thenthecontrollerin Eq. 5{10 ,theactorandcriticweightupdatelawsinEqs. 5{15 5{16 and 5{38 ,andthe 2 Since c 3 isafunctionofthecriticadaptationgain c a 1 istheactoradaptationgain, and 1 ; 2 areknownconstants,thesucientgainconditioninEq. 5{46 canbeeasily satised. 83

PAGE 84

identierinEq. 5{21 and 5{25 ,guaranteethatthestateofthesystem x ( t ) ,andtheactor andcriticweightestimationerrors ~ W a ( t ) and ~ W c ( t ) areUUB. Proof. ToinvestigatethestabilityofEq. 5{3 withcontrol^ u ( x ),andtheperturbedsystem inEq. 5{42 ,consider V L : X R N R N [0 ; 1 ) R asthecontinuouslydierentiable, positive-deniteLyapunovfunctioncandidatedenedas V L ( x; ~ W c ; ~ W a ;t ) V ( x )+ V c ( ~ W c ;t )+ 1 2 ~ W T a ~ W a ; where V ( x )(theoptimalvaluefunction),istheLyapunovfunctionforEq. 5{3 ,and V c ( ~ W c ;t )istheLyapunovfunctionfortheexponentiallystablesysteminEq. 5{43 .Since V ( x )iscontinuouslydierentiableandpositive-denitefromEq. 5{1 and 5{2 ,thereexist class K functions 1 and 2 denedon[0 ;r ] ; where B r X (Lemma4.3in[ 106 ]),such that 1 ( k x k ) V ( x ) 2 ( k x k ) 8 x 2 B r : (5{47) UsingEqs. 5{44 and 5{47 V L ( x; ~ W c ; ~ W a ;t )canbeboundedas 1 ( k x k )+ c 1 rrr ~ W c rrr 2 + 1 2 rrr ~ W a rrr 2 V L ( x; ~ W c ; ~ W a ;t ) 2 ( k x k )+ c 2 rrr ~ W c rrr 2 + 1 2 rrr ~ W a rrr 2 ; whichcanbewrittenas 3 ( k ~ z k ) V L ( x; ~ W c ; ~ W a ;t ) 4 ( k ~ z k ) 8 ~ z 2 B s ; where~ z ( t ) [ x ( t ) T ~ W c ( t ) T ~ W a ( t ) T ] T 2 R n +2 N 3 and 4 areclass K functionsdenedon [0 ;s ] ; where B s X R N R N .Takingthetimederivativeof V L ( )yields V L = @V @x f + @V @x g ^ u + @V c @t + @V c @ ~ W c n nom + @V c @ ~ W c per ~ W T a ^ W a ; (5{48) wherethetimederivativeof V ( )istakenalongthethetrajectoriesofthesystemEq. 5{3 withcontrol^ u ( )andthetimederivativeof V c ( )istakenalongthealongthetrajectories oftheperturbedsystemEq. 5{42 .Tofacilitatethesubsequentanalysis,theHJBinEq. 84

PAGE 85

5{5 isrewrittenas @V @x f = @V @x gu Q ( x ) u T Ru .Substitutingfor @V @x f inEq. 5{48 usingthefactthat @V @x g = 2 u T R fromEq. 5{4 ,andusingEqs. 5{19 and 5{44 ,Eq. 5{48 )canbeupperboundedas V L Q u T Ru c 3 rrr ~ W c rrr 2 + c 4 rrr ~ W c rrr k per k +2 u T R ( u ^ u ) + a 2 ~ W T a ( ^ W a ^ W c )+ a 1 p 1+ T ~ W T a 0 G 0 T ( ^ W a ^ W c ) hjb : (5{49) Substitutingfor u ; ^ u; hjb ,and per usingEqs. 5{4 5{10 5{39 ,and 5{42 ,respectively, andusingEq. 5{17 andEq. 5{41 inEq. 5{49 ,yields V L Q c 3 rrr ~ W c rrr 2 a 2 rrr ~ W a rrr 2 + 1 2 W T 0 G" 0 T v + 1 2 0v G" 0 T v + 1 2 W T 0 G 0 T ~ W a + c 4 c 0 2 p 1 rrrr W T 0 ~ F ^ u + 1 4 ~ W T a 0 G 0 T ~ W a 1 4 0v G" 0 T v 0v F u rrrr rrr ~ W c rrr (5{50) + a 1 p 1+ T ~ W T a 0 G 0 T ( ~ W c ~ W a ) ~ W T c W T 0 ~ F ^ u + 1 4 ~ W T a 0 G 0 T ~ W a 1 4 0v G" 0 T v 0v F u + 1 2 0v G 0 T ~ W a + a 2 rrr ~ W a rrr rrr ~ W c rrr : UsingtheboundsdevelopedinEqs. 5{45 5{50 canbefurtherupperboundedas V L Q ( c 3 a 1 1 2 ) rrr ~ W c rrr 2 a 2 rrr ~ W a rrr 2 + a 1 21 2 3 + 4 + c 4 c 0 2 p 1 3 + a 1 1 2 3 + a 1 21 2 + a 2 1 rrr ~ W c rrr : Provided c 3 > a 1 1 2 ,andcompletingthesquareyields V L Q (1 )( c 3 a 1 1 2 ) rrr ~ W c rrr 2 a 2 rrr ~ W a rrr 2 + a 1 21 2 3 + 4 + 1 4 ( c 3 a 1 1 2 ) c 4 c 0 2 p 1 3 + a 1 1 2 3 + a 1 21 2 + a 2 1 2 (5{51) where0 << 1.Since Q ( x )ispositivedenite,Lemma4.3in[ 106 ]indicatesthatthere existclass K functions 5 and 6 suchthat 5 ( k ~ z k ) Q +(1 )( c 3 a 1 1 2 ) rrr ~ W c rrr 2 + a 2 rrr ~ W a rrr 2 6 ( k ~ z k ) 8 v 2 B s : (5{52) 85

PAGE 86

UsingEq. 5{52 ,theexpressioninEq. 5{51 canbefurtherupperboundedas V L 5 ( k ~ z k )+ a 1 21 2 3 + 4 + 1 4 ( c 3 a 1 1 2 ) c 4 c 0 2 p 1 3 + a 1 1 2 3 + a 1 21 2 + a 2 1 2 ; whichprovesthat V L ( )isnegativewhenever~ z ( t )liesoutsidethecompactsetn ~ z ~ z : k ~ z k 1 5 1 4 ( c 3 a 1 1 2 ) h c 4 c 0 2 p 1 3 + a 1 1 2 3 + a 1 21 2 + a 2 1 + a 1 21 2 3 + 4 i 2 andhence, k ~ z ( t ) k isUUB(Theorem4.18in[ 106 ]).TheboundsinEq. 5{45 dependon theactorNNapproximationerror 0v ,whichcanbereducedbyincreasingthenumber ofneurons N ,therebyreducingthesizeoftheresidualsetn ~ z .FromAssumption 5.7 asthenumberofneuronsoftheactorandcriticNNs N !1 ; thereconstructionerror 0v 0. Remark5.3. Sincetheactor,criticandidentierarecontinuouslyupda ted,thedeveloped RLalgorithmcanbecomparedtofullyoptimisticPIinmachin elearningliterature[ 107 ], wherepolicyevaluationandpolicyimprovementaredoneaft ereverystatetransition, unliketraditionalPI,wherepolicyimprovementisdoneaft erconvergenceofthepolicy evaluationstep.ProvingconvergenceofoptimisticPIisco mplicatedandisanactivearea ofresearchinmachinelearning[ 107 108 ].Byconsideringanadaptivecontrolframework, thisresultinvestigatestheconvergenceandstabilitybeh avioroffullyoptimisticPIin continuous-time. Remark5.4. ThePEconditioninTheorem2isequivalenttotheexploratio nparadigm inRLwhichensuressucientsamplingofthestatespaceandc onvergencetotheoptimal policy[ 101 ]. 5.5ComparisonwithRelatedWork SimilartoRL,optimalcontrolinvolvesselectionofanoptimalpolicyba sedonsome long-termperformancecriteria.DPprovidesameanstosolveoptim alcontrolproblems [ 52 ];however,DPisimplementedbackwardintime,makingitoineandcomp utationally expensiveforcomplexsystems.Owingtothesimilaritiesbetweenopt imalcontroland 86

PAGE 87

RL[ 3 ],Werbos[ 17 ]introducedRL-basedACmethodsforoptimalcontrol,calledADP. ADPusesNNstoapproximatelysolveDPforward-in-time,thusavoid ingthe curseof dimensionality .AdetaileddiscussionofADP-baseddesignsisfoundin[ 6 24 107 ].The successofADPpromptedamajorresearcheorttowardsdesign ingADP-basedoptimal feedbackcontrollers.Thediscrete/iterativenatureoftheADPf ormulationlendsitself naturallytothedesignofdiscrete-timeoptimalcontrollers[ 7 10 67 { 70 109 ]. ExtensionsofADP-basedcontrollerstocontinuous-timesystems entailschallenges inprovingstability,convergence,andensuringthealgorithmisonline andmodel-free. Earlysolutionstotheproblemconsistedofusingadiscrete-timefor mulationoftimeand state,andthenapplyinganRLalgorithmonthediscretizedsystem. Discretizingthestate spaceforhighdimensionalsystemsrequiresalargememoryspacea ndacomputationally prohibitivelearningprocess.Baird[ 38 ]proposed AdvantageUpdating ,anextensionof theQ-learningalgorithmwhichcouldbeimplementedincontinuous-time andprovided fasterconvergence.Doya[ 39 ]usedaHJBframeworktoderivealgorithmsforvalue functionapproximationandpolicyimprovement,basedonacontinuo us-timeversionof thetemporaldierenceerror.Murrayetal.[ 8 ]alsousedtheHJBframeworktodevelop a stepwisestable iterativeADPalgorithmforcontinuous-timeinput-anesystemswit h aninputquadraticperformancemeasure.InBeardetal.[ 40 ],Galerkin'sspectralmethod isusedtoapproximatethesolutiontotheGHJB,usingwhichastabilizin gfeedback controllerwascomputedoine.Similarto[ 40 ],Abu-KhalafandLewis[ 41 ]proposeda least-squaressuccessiveapproximationsolutiontotheGHJB,whe reanNNistrained oinetolearntheGHJBsolution.Recentresultsby[ 13 42 ]havemadenewinroadsby addressingtheproblemforpartiallyunknownnonlinearsystems.Ho wever,theinherently iterativenatureoftheADPalgorithmhaspreventedthedevelopme ntofrigorousstability proofsofclosed-loopcontrollersforcontinuous-timeuncertainn onlinearsystems. Alltheaforementionedapproachesforcontinuous-timenonlinear systemsareoine and/orrequirecompleteknowledgeofsystemdynamics.Oneofthe contributionsin 87

PAGE 88

[ 13 ]isthatonlypartialknowledgeofthesystemdynamicsisrequired,a ndahybrid continuous-time/discrete-timesampleddatacontrollerisdevelope dbasedonPI,where thefeedbackcontroloperationoftheactoroccursatfastert imescalethanthelearning processofthecritic.VamvoudakisandLewis[ 14 ]extendedtheideabydesigninga model-basedonlinealgorithmcalled synchronousPI whichinvolvedsynchronous, continuous-timeadaptationofbothactorandcriticneuralnetwo rks.Inspiredbythe workin[ 14 ],anovelactor-critic-identierarchitectureisproposedinthiswo rkto approximatelysolvethecontinuous-timeinnitehorizonoptimalcon trolproblemfor uncertainnonlinearsystems;however,unlike[ 14 ],thedevelopedmethoddoesnotrequire knowledgeofthesystemdriftdynamics.TheactorandcriticNNsap proximatethe optimalcontrolandtheoptimalvaluefunction,respectively,whe reastheidentier DNNestimatesthesystemdynamicsonline.TheintegralRLtechniqu ein[ 13 ]leads toahybridcontinuous-time/discrete-timecontrollerwithtwotimescaleactorand criticlearningprocess,whereastheapproachin[ 14 ],althoughcontinuous-time,requires completeknowledgeofsystemdynamics.Acontributionofthiswork istheuseofa novelactor-critic-identierarchitecture,whichobviatesthene edtoknowthesystemdrift dynamics,andwherethelearningoftheactor,criticandidentieris continuousand simultaneous.Moreover,theactor-critic-identiermethodutilize sanidentication-based onlinelearningscheme,andhenceisthersteverindirectadaptivec ontrolapproach toRL.Theideaissimilartothe HeuristicDynamicProgramming (HDP)algorithm [ 5 ],whereWerbossuggestedtheuseofamodelnetworkalongwithth eactorandcritic networks. Inthedevelopedmethod,theactorandcriticNNsusegradientand leastsquares-based updatelaws,respectively,tominimizetheBellmanerror,whichisthed ierencebetween theexactandtheapproximateHJBequation.TheidentierDNNisac ombinationof aHopeld-type[ 110 ]component,inparallelcongurationwiththesystem[ 111 ],anda novelRISE(RobustIntegralofSignoftheError)component.T heHopeldcomponentof 88

PAGE 89

theDNNlearnsthesystemdynamicsbasedononlinegradient-based weighttuninglaws, whiletheRISEtermrobustlyaccountsforthefunctionreconstru ctionerrors,guaranteeing asymptoticestimationofthestateandthestatederivative.Theo nlineestimationof thestatederivativeallowstheactor-critic-identierarchitectur etobeimplemented withoutknowledgeofsystemdriftdynamics;however,knowledgeo ftheinputgainmatrix isrequiredtoimplementthecontrolpolicy.Whilethedesignoftheact orandcritic arecoupledthroughtheHJBequation,thedesignoftheidentieris decoupledfrom actor-critic,andcanbeconsideredasamodularcomponentinthea ctor-critic-identier architecture.Convergenceoftheactor-critic-identier-base dalgorithmandstabilityofthe closed-loopsystemareanalyzedusingLyapunov-basedadaptivec ontrolmethods,anda persistenceofexcitation (PE)conditionisusedtoguaranteeexponentialconvergenceto aboundedregionintheneighborhoodoftheoptimalcontrolandUU Bstabilityofthe closed-loopsystem.ThePEconditionisequivalenttotheexploration paradigminRL [ 101 ]andensuresadequatesamplingofthesystem'sdynamics,require dforconvergenceto theoptimalpolicy. 5.6Simulation 5.6.1NonlinearSystemExample Thefollowingnonlinearsystemisconsidered[ 14 ] x = 264 x 1 + x 2 0 : 5 x 1 0 : 5 x 2 (1 ( cos (2 x 1 )+2) 2 ) 375 + 264 0 cos (2 x 1 )+2 375 u; (5{53) where x ( t ) [ x 1 ( t ) x 2 ( t )] T 2 R 2 and u ( t ) 2 R : Thestateandcontrolpenaltiesarechosen as Q ( x )= x T 264 1001 375 x ; R =1 : 89

PAGE 90

0 2 4 6 8 10 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 System StatesTime (s)x x 1 x 2 Figure5-2.Systemstates x ( t )withpersistentlyexcitedinputfortherst3seconds. TheoptimalvaluefunctionandoptimalcontrolforthesysteminEq 5{53 areknown,and givenby[ 14 ] V ( x )= 1 2 x 21 + x 22 ; u ( x )= ( cos (2 x 1 )+2) x 2 : TheactivationfunctionforthecriticNNisselectedwith N =3neuronsas ( x )=[ x 21 x 1 x 2 x 22 ] T ; whiletheactivationfunctionfortheidentierDNNisselectedasasym metricsigmoid with L f =5neuronsinthehiddenlayer.Theidentiergainsareselectedas k =800 ; =300 ;r =5 ; 1 =0 : 2 ; wf =0 : 1 I 6 6 ; vf =0 : 1 I 2 2 ; andthegainsfortheactor-criticlearninglawsareselectedas a 1 =10 ; a 2 =50 ; c =20 ; =0 : 005 : 90

PAGE 91

0 2 4 6 8 10 -0.015 -0.01 -0.005 0 0.005 0.01 State Derivative Estimation_ ~ xTime (s) Figure5-3.Errorinestimatingthestatederivative ~ x ( t )bytheidentier. 0 2 4 6 8 10 -1 -0.5 0 0.5 1 1.5 critic NN weights Time (s)Wc W c1 W c2 W c3 Figure5-4.Convergenceofcriticweights ^ W c ( t ). 91

PAGE 92

0 2 4 6 8 10 -1 -0.5 0 0.5 1 1.5 actor NN weights Time (s)Wa W a1 W a2 W a3 Figure5-5.Convergenceofactorweights ^ W a ( t ). -2 -1 0 1 2 -2 -1 0 1 2 -5 0 5 10 x 10 -4 x1 Optimal Value Function Approximation Error x2 ^ V ¡ V¤ Figure5-6.Errorinapproximatingtheoptimalvaluefunctionbythe criticatsteady state. 92

PAGE 93

-2 -1 0 1 2 -2 -1 0 1 2 -6 -4 -2 0 2 4 6 x 10-4 x1 Optimal Control Approximation Error x2 ^ u ¡ u¤ Figure5-7.Errorinapproximatingtheoptimalcontrolbytheacto ratsteadystate. 0 2 4 6 8 10 1 0 1 2 3 4 5 6 7 ^ V ¡ V ¤Time!(s) 0 2 4 6 8 10 2.5 2 1.5 1 0.5 0 0.5 1 1.5 ^ u ¡ u ¤Time (s) Figure5-8.Errorsinapproximatingthe(a)optimalvaluefunction, and(b)optimal control,asafunctionoftime. 93

PAGE 94

Thecovariancematrixisinitializedto(0)=5000 ; alltheNNweightsarerandomly initializedin[ 1 ; 1],andthestatesareinitializedto x (0)=[3 ; 1] : Animplementation issueinusingthedevelopedalgorithmistoensurePEofthecriticregr essorvector. Unlikelinearsystems,wherePEoftheregressortranslatestosu cientrichnessofthe externalinput,noveriablemethodexiststoensurePEinnonlinear regulationproblems. ToensurePEqualitatively,asmallexploratorysignalconsistingofs inusoidsofvarying frequencies, n ( t )= sin 2 ( t ) cos ( t )+ sin 2 (2 t ) cos (0 : 1 t )+ sin 2 ( 1 : 2 t ) cos (0 : 5 t )+ sin 5 ( t ),isadded tothecontrol u ( t )fortherst3seconds[ 14 ].TheevolutionofstatesisshowninFig. 5-2 Theidentierapproximatesthesystemdynamics,andthestatede rivativeestimationerror isshowninFig. 5-3 .Persistenceofexcitationensuresthattheweightsconvergeco nverge totheiroptimalvaluesof W =[0 : 501] T inapproximately2seconds,asseenfromthe evolutionofactorandcriticweightsinFigs. 5-4 and 5-5 .Theerrorsinapproximating theoptimalvaluefunctionandoptimalcontrolatsteadystate( t =10 sec: )areplotted againstthestatesinFigs. 5-6 and 5-7 ,respectively.Fig. 5-8 showstheerrorbetweenthe optimalvaluefunctionandapproximateoptimalvaluefunction,and theoptimalcontrol andapproximateoptimalcontrol,asafunctionoftimealongthetra jectory x ( t ). 5.6.2LQRExample Thefollowinglinearsystemisconsidered[ 14 ] x = 266664 1 : 018870 : 90506 0 : 00215 0 : 82225 1 : 07741 0 : 17555 001 377775 | {z } A + 266664 001 377775 | {z } B u: withthefollowingstateandthecontrolpenalties Q ( x )= x T 266664 100010001 377775 x ; R =1 : 94

PAGE 95

0 5 10 15 20 25 30 35 40 45 50 -15 -10 -5 0 5 10 15 System States Time (s)x x 1 x 2 x 3 Figure5-9.Systemstates x ( t )withpersistentlyexcitedinputfortherst25seconds. ThefollowingsolutiontotheAREcanbeobtained P = 266664 1 : 42451 : 1682 0 : 1352 1 : 16821 : 4349 0 : 1501 0 : 1352 0 : 15012 : 4329 377775 : Theoptimalvaluefunctionisgivenby V ( x )= x T Px ,andtheoptimalcontrolisgivenby u = R 1 B T Px = 0 : 1352 0 : 15012 : 4329 x: a 1 =5 ; a 2 =50 ; c =20 ; =0 : 001 : TheaboveLQRdesignassumescompleteknowledgeofsystemdynam ics(i.e. A and B ),andtheAREissolvedoinetoobtain P .Theproposedactor-critic-identier architectureisusedtosolvetheLQRproblemonlinewithoutrequiring knowledgeofthe systemdriftdynamics(i.e. A ).ThebasisforthecriticNNisselectedbyexploitingthe 95

PAGE 96

0 10 20 30 40 50 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 critic NN weights Time (s)W c Wc1 Wc2 Wc3 Wc4 Wc5 Wc6 Figure5-10.Convergenceofcriticweights ^ W c ( t ). 0 10 20 30 40 50 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 actor NN weights Time (s)W a Wa1 Wa2 Wa3 Wa4 Wa5 Wa6 Figure5-11.Convergenceofactorweights ^ W a ( t ). 96

PAGE 97

0 10 20 30 40 50 -20 -15 -10 -5 0 5 10 15 20 ^ V ¡ V¤Time (s) 0 5 10 15 20 25 30 35 40 45 50 -5 -4 -3 -2 -1 0 1 2 3 4 ^ u ¡ u ¤Time (s) Figure5-12.Errorsinapproximatingthe(a)optimalvaluefunction ,and(b)optimal control,asafunctionoftime. structureofthevaluefunctionas ( x )= x 21 x 1 x 2 x 1 x 3 x 22 x 2 x 3 x 23 T ; andtheoptimalweightsaregivenby W = 1 : 42452 : 3364 0 : 27041 : 4349 0 : 30022 : 4329 T : ThesameidentierasinthenonlinearexampleinSection 5.6.1 isused,andthegainsfor theactorandcriticlearninglawsareselectedasThecovariancemat rixisinitializedto (0)=50000 ; alltheNNweightsarerandomlyinitializedin[ 1 ; 1],andthestatesare initializedto x (0)= 15 13 12 : ToensurePE,anexploratorysignalconsisting ofsinusoidsofvaryingfrequencies, n ( t )=10( sin (2 t )+ sin ( et )+ cos (5 t ) 5 + sin (10 t )+ cos (3 t )+ sin (2 t ) 2 cos (0 : 1 t )+ sin (0 : 5 t )+ cos (10 t )+ sin (20 t )),isaddedtothecontrol u ( t )fortherst25seconds.TheevolutionofstatesisshowninFig. 5-9 ,andFigs. 5-10 and 5-11 showtheconvergenceofcriticandactorweights,respectively.F ig. 5-12 shows theerrorbetweentheoptimalvaluefunctionandapproximateopt imalvaluefunction, andtheoptimalcontrolandapproximateoptimalcontrol,asafun ctionoftimealongthe trajectory x ( t ). 97

PAGE 98

5.7Summary Anactor-critic-identierarchitectureisproposedtolearntheap proximatesolution totheHJBequationforinnite-horizonoptimalcontrolofuncert ainnonlinearsystems. Theonlinemethodisthersteverindirectadaptivecontrolapproa chtocontinuous-time RL.Thelearningbytheactor,criticandidentieriscontinuousands imultaneous,and thenoveladditionoftheidentiertothetraditionalACarchitectu reeliminatestheneed toknowthesystemdriftdynamics.TheactorandcriticminimizetheB ellmanerrorusing gradientandleast-squaresupdatelaws,respectively,andprovid eonlineapproximationsto theoptimalcontrolandtheoptimalvaluefunction,respectively. Theidentierestimates thesystemdynamicsonlineandasymptoticallyconvergestothesys temstateandits derivative.APEconditionisrequiredtoensureexponentialconver gencetoabounded regionintheneighborhoodoftheoptimalcontrolandUUBstabilityo ftheclosed-loop system.Simulationresultsdemonstratetheperformanceofthea ctor-critic-identier-based method. 98

PAGE 99

CHAPTER6 CONCLUSIONANDFUTUREWORK Thischapterconcludesthedissertationbydiscussingthekeyideas developedin eachchapter.Limitationsandimplementationissuesoftheworkare discussedand recommendationsaremaderegardingpossiblefutureresearchdir ections. 6.1DissertationSummary ThisworkfocussesonreplicatingthesuccessofRLmethodsinmach inelearningto controlcontinuous-timenonlinearsystems.WhileinChapter 3 ,theRLapproachisused todeveloprobustadaptivecontrollerswhichguaranteeasymptot ictracking,RLmethods areusedinChapter 5 todeveloponlineadaptiveoptimalcontrollers.Theimprovementin performanceoftheclosed-loopsystemdemonstratedthroughs imulationsandexperiments showsthepotentialofonlinedata-drivenRLmethods,wherethec ontrollerisabletolearn theoptimalpolicybyinteractingwiththeenvironment.TheRLappro achforoptimal controliscastasaparameterestimationandidenticationproblem ,andisconsideredin anadaptivecontrolframework.Theadaptivecontrolframewor kallowsrigorousanalysisof stabilityandconvergenceofthealgorithm.FortheRL-basedoptim alcontrolinChapter 5 ,apersistenceofexcitationconditionisfoundtobecrucialinensur ingexponential convergenceoftheparameterstoaboundedregionintheneighbo rhoodoftheoptimal controlandyieldsUUBstabilityoftheclosed-loopsystem. ThefocusofChapter 3 istodevelopanon-dynamicprogrammingbasedadaptive criticcontrollerforaclassofcontinuous-timeuncertainnonlinears ystemswithadditive boundeddisturbances.Thisworkovercomesthelimitationofprevio usworkwhere adaptivecriticcontrollersareeitherdiscrete-timeand/oryieldaun iformlyultimately boundedstabilityresultduetothepresenceofdisturbancesandu nknownapproximation errors.Theasymptotictrackingresultismadepossiblebycombining acontinuousRISE feedbackwithboththeactorandthecriticNNstructures.Thefe edforwardactorNN approximatesthenonlinearsystemdynamicswhiletherobustfeedb ack(RISE)rejects 99

PAGE 100

theNNfunctionalreconstructionerroranddisturbances.Inad dition,theactorNNis trainedonlineusingacombinationoftrackingerror,andareinforce mentsignal,generated bythecritic.Experimentalresultsandt-testanalysisdemonstra tefasterconvergence ofthetrackingerrorwhenareinforcementlearningtermisincluded intheNNweight updatelaws.Althoughtheproposedmethodguaranteesasympto tictracking,alimitation ofthecontrolleristhatitdoesnotensureoptimality,whichisacommo nfeature(atleast approximateoptimalcontrol)ofDP-basedRLcontrollers. ThedevelopmentofthestatederivativeestimatorinChapter 4 ismotivatedby theneedtodevelopmodel-freeRL-basedsolutionstotheoptimalc ontrolproblemfor nonlinearsystems.Incontrasttopurelyrobustfeedbackmetho dsinliterature,an identication-basedrobustadaptiveapproachisdeveloped.Ther esultdiersfromexisting purerobustmethodsinthattheproposedmethodcombinesaDNNs ystemidentierwith arobustRISEfeedbacktoensureasymptoticconvergencetoth estatederivative,which isprovenusingaLyapunov-basedstabilityanalysis.Simulationresult sinthepresence ofnoiseshowanimprovedtransientandsteadystateperformanc eofthedevelopedstate derivativeidentierincomparisontoseveralotherderivativeestim ationmethods.Initially developedformodel-freeRL-basedcontrol,thedevelopedestima torcanbeusedinawide rangeofapplications,e.g.,parameterestimation,faultdetection, accelerationfeedback, outputfeedbackcontrol,etc. DuetothedicultyinsolvingtheHJBforoptimalcontrolofcontinuo us-time systems,fewresultsexistwhichsolve/circumventtheprobleminan onlinemodel-freeway. ThestatederivativeestimatordevelopedinChapter 4 pavedthewayforthedevelopment ofanovelactor-critic-identierarchitectureinChapter 5 whichlearnstheapproximate optimalsolutionforinnite-horizonoptimalcontrolofuncertainn onlinearsystems.The methodisonline,partiallymodel-free,andisthersteverindirectad aptivecontrol approachtocontinuous-timeRL.TheactorandcriticminimizetheBe llmanerrorusing gradientandleast-squaresupdatelaws,respectively,andprovid eonlineapproximationsto 100

PAGE 101

theoptimalcontrolandtheoptimalvaluefunction,respectively. Theidentierestimates thesystemdynamicsonlineandasymptoticallyconvergestothesys temstateandits derivative.Anothercontributionoftheresultisthatthelearningb ytheactor,criticand identieriscontinuousandsimultaneous,andthenoveladditionoft heidentiertothe traditionalactor-criticarchitectureeliminatestheneedtoknowt hesystemdriftdynamics. Alimitationofthemethod,however,istherequirementoftheknowle dgeoftheinputgain matrix. 6.2FutureWork ThisworkillustratesthatRLmethodscanbesuccessfullyappliedtof eedbackcontrol. Whilethemethodsdevelopedarefairlygeneralandapplicabletoawide rangeofsystems, researchinthisareaisstillatanascentstageandseveralinteres tingopenproblemsexist. Thissectiondiscussestheopentheoreticalproblems,implementat ionissues,andfuture researchdirections.6.2.1Model-FreeRL RLmethodsbasedonTDlearningtypicallydonotneedamodeltolearn the optimalpolicy;theyeitherlearnthemodelonline(indirectadaptivea pproach)ordirectly learntheparametersoftheoptimalcontrol(directadaptiveapp roach).Thecontroller developedinChapter 5 isbasedonanindirectadaptiveapproach,whereanidentieris usedtoapproximatethesystemdynamicsonlineresultinginamodel-f reeformulation oftheBellmanerrorwhichisusedtoapproximatethevaluefunction. Althoughthe approximationofthevaluefunctionismodel-free,thegreedypolicy usedtocompute theoptimalpolicyrequiresknowledgeoftheinputgainmatrix.Hence ,thedeveloped approachisonlypartiallymodel-free.Apossibleapproachforcomple telymodel-freeRL forcontinuous-timenonlinearsystemsistouseQ-learningmethods [ 20 ],adirectadaptive model-freeapproachtolearnoptimalpoliciesinMDPs.However,Q-le arning-basedcontrol designstillremainsanopenproblemforcontinuous-timenonlinearsy stems.Arecent resultin[ 112 ]pointstoapossibleapproachtosolvetheproblem. 101

PAGE 102

6.2.2RelaxingthePersistenceofExcitationCondition ThecriticregressorinChapter 5 isrequiredtosatisfythePEconditionfor convergencetoaneighborhoodoftheoptimalcontrol.Asobserv edinChapter 5 ,the PEconditioninadaptivecontrolisequivalenttotheexplorationpara digm,whichliesat theheartofRL.Explorationisessentialtoexplorethestatespac eandconvergetothe globaloptimalsolution.Forlinearsystems,thePEconditiontransla testothesucient richnessoftheexternalinput.However,PEishardtoverifyingen eralfornonlinear systems.FutureeortscanfocusonrelaxingthePEassumptionb yreplacingitwitha milderconditionontheregressorvector.Arecentresultin[ 113 ]attemptstorelaxthePE assumptionbyexploitingpriorinformationaboutthesystembuttha tmaygoagainstthe spiritofRLwhichreliesononlinelearning.6.2.3AsymptoticRL-BasedOptimalControl AlthoughasymptotictrackingisguaranteedinChapter 3 ,thecontrollerisnot optimal.InChapter 5 ,whereanoptimalcontrollerisdeveloped,aUUBstabilityresult isachieved.AnopenprobleminRL-basedoptimalcontrolisasympto ticstabilityof theclosed-loopsysteminpresenceofNNapproximationerrors.On ewayistoaccount forapproximationerrorsbycombiningtheoptimalcontrolwitharo bustfeedback,e.g., slidingmodeorRISE.Althoughasymptoticstabilitycanbeprovedbyt headditionof theserobustmethods,optimalityoftheoverallcontrollermaybec ompromisedindoing so.Hence,itisnotstraightforwardtoextendtherobustfeedba ckcontroltoolstooptimal controlinpresenceofNNapproximationerrors.6.2.4BetterFunctionApproximationMethods Generalizationandtheuseofappropriatefunctionapproximators forvaluefunction approximationisoneofthemostimportantissuesfacingRL,preven tingitsusein large-scalesystems.FunctionapproximationwasintroducedinRLt oalleviatethecurse ofdimensionalitywhensolvingsequentialdecisionproblemswithlargeo rcontinuousstate spaces[ 35 ].MostRLalgorithmsforcontinuous-timecontrolinvolveparamete rization 102

PAGE 103

ofthevaluefunctionandthecontrol.Theseparameterizationsinv olveselectingan appropriatebasisfunctionforthevalueandthecontrol,ataskwh ichcanbeveryhard withoutanypriorknowledgeaboutthesystem.Linearfunctionapp roximators,though convenienttousefromanalysispointofview,havelimitedapproximat ioncapability. NonlinearapproximatorslikemultilayerNNshavebetterapproximatio ncapabilitybut arenotamenableforanalysisandprovingconvergence.Achallenge forthecomputational intelligencecommunityistodevelopsimpleyetpowerfulapproximator swhicharealso amenabletomathematicalanalysis.6.2.5RobustnesstoDisturbances Inpracticalsystems,disturbancesareinevitable,e.g.,windgustp ushingagainst anaircraft,contaminantinachemicalprocess,suddenpoliticalup heavalaecting thestockmarket,etc.ThesystemconsideredinChapter 5 isnotsubjectedtoany externaldisturbances,andhence,robustnesstoexternaldis turbancesisnotguaranteed. Optimalcontrolofsystemssubjectedtodisturbancescanbeco nsideredintheframework ofminimaxdierentialgames[ 114 ],wherethecontrolanddisturbancearetreated asplayerswithconrictinginterests{oneminimizestheobjectivefun ctionwhereas theothermaximizesit,andbothreachanoptimalcompromise(ifitex ists)whichis calledthesaddlepointsolution.Recentresultsin[ 115 116 ]havemadeinroadsintothe continuous-timedierentialgameproblem.TheACImethoddevelop edinChapter 5 can beextendedtosolvethedierentialgameprobleminanonline,partia llymodel-freeway. 6.2.6OutputFeedbackRLControl Themethodsdevelopedinthisworkassumefullstatefeedback,ho wever,theremay besituationswhereallthestatesarenotavailableformeasuremen t.InRLjargon,such situationsarereferredasPartiallyObservableMarkovDecisionPro cesses(POMDPs) [ 117 ].Fromacontrolsperspective,inabsenceoffull-statefeedback, theproblemcanbe dealtbydevelopingobserversandoutputfeedbackcontrollers.A nopenproblemisto extendthesemethodstoRL-basedcontrol.Achallengeinextendin gtheobserver-based 103

PAGE 104

techniquesforoutputfeedbackRListhatobserverstypicallynee damodelofthesystem whileRLmethodsareideallymodel-free.Apossiblealternativeistouse non-modelbased observers,likehighgainorslidingmode.Thestatederivativeestimat ordevelopedin Chapter 4 canalsobeextendedtotheoutputfeedbackcase. 6.2.7ExtendingRLbeyondtheInnite-HorizonRegulator Themethodsdevelopedinthisworkareapplicableonlyforinnite-hor izon regulationofcontinuous-timesystems.Also,thesystemconsider edin 5 isrestricted tobeautonomous.ADPfortime-varyingsystemsandtrackingare interestingopen problems.Otherextensionswherefutureresearcheortscanb edirectedare:minimum time,nite-time,andconstrainedoptimalcontrolproblems. 104

PAGE 105

APPENDIXA ASYMPTOTICTRACKINGBYAREINFORCEMENTLEARNING-BASED ADAPTIVECRITICCONTROLLER A.1DerivationofSucientConditionsinEq. 3{42 IntegratingEq. 3{46 ,thefollowingexpressionisobtained Z t 0 L ( ) d = Z t 0 r T ( N d + N B 1 1 sgn ( e n ))+_ e Tn N B 2 3 k e n k 2 4 j R j 2 d: UsingEq. 3{4 ,integratingtherstintegralbyparts,andintegratingtheseco ndintegral yields Z t 0 L ( ) d = e Tn N e Tn (0) N (0) Z t 0 e Tn N B + N d d + 1 m X i =1 j e ni (0) j 1 m X i =1 j e ni ( t ) j + Z t 0 n e Tn ( N d + N B 1 1 sgn ( e n )) d Z t 0 ( 3 k e n k 2 + 4 j R j 2 ) d: Usingthefactthat k e n k m P i =1 j e ni j ; andusingtheboundsinEqs. 3{32 and 3{33 ,yields Z t 0 L ( ) d 1 m X i =1 j e ni (0) j e Tn (0) N (0) ( 1 1 2 3 ) k e n k Z t 0 3 7 8 2 k e n k 2 d Z t 0 4 8 2 j R j 2 d + Z t 0 n k e n k ( 1 + 2 + 5 n + 6 n 1 ) d: IfthesucientconditionsinEq. 3{42 aresatised,thenthefollowinginequalityholds Z t 0 L ( ) d 1 m X i =1 j e ni (0) j e n (0) T N (0) : Z t 0 L ( ) d P (0) : (A{1) UsingEqs. A{1 and 3{45 ,itcanbeshownthat P ( z;R;t ) 0 : 105

PAGE 106

A.2DierentialInclusionsandGeneralizedSolutions Considerasystem x = f ( x;t ) ; (A{2) where x 2 R n and f : R n R R n .Ifthefunction f isLipschitzcontinuousin x andpiecewisecontinuousin t ,existenceanduniquenessofsolutionscanbestudiedand provedintheclassicalsense(Cauchy-Peanotheorem).However ,manypracticalsystems withdiscontinuousrighthandsidesexist,e.g.,Coulombfriction,sliding mode,contact transitionetc.Forsuchsystems,theremaybenosolutionsintheu sualsense,andthe notionofsolutionshastobegeneralizedtoensureitsexistence.On eoftheways 1 isto studythe generalizedsolutions inFilippov'ssenseusingthefollowingdierentialinclusion x 2 K [ f ]( x;t ) ; (A{3) where f isLebesguemeasurableandlocallybounded,and K [ ]isdenedas K [ f ]( x;t ) \ > 0 \ M =0 cof ( B ( x; ) M;t ) ; where T M =0 denotestheintersectionofallsets M ofLebesguemeasurezero, co denotes convexclosure.Inwords, K [ ]istheconvexclosureofthesetofallpossiblelimit valuesof f insmallneighborhoodsofagivenpoint x .If x isabsolutelycontinuous (i.e.dierentiablea.e.)andsatisesEq. A{3 ,thenitiscalledageneralizedsolution(in Filippov'ssense)ofthedierentialequationEq. A{2 Thedierentialequationsoftheclosed-loopsystem,Eqs. 3{3 3{4 3{20 3{23 3{36 3{38 ,and 3{45 ,havediscontinuousrighthandsides.Specically,theyarecontinu ous exceptintheset f ( y;t ) j ~ x =0 g ,whichhasaLebesguemeasureof0 : Hence,theFilippov's dierentialinclusionframeworkisusedtoensureexistenceandunq uenessofsolutions 1 Ifthefunction f isdiscontinuousin t andcontinuousin x ,thesolutiontoEq. A{2 canbestudiedinthesenseofCaratheodory. 106

PAGE 107

(a.e.)for_ y = F ( y;t ) ; where F denotestherighthandsidesofthedierentialequationsof theclosed-loopsystem.Thefunction F isLebesguemeasurableandlocallybounded,and iscontinuousexceptintheset f ( y;t ) j ~ x =0 g .Stabilityofsolutionsbasedondierential inclusionisstudiedusingnon-smoothLyapunovfunctions,usingthe developmentin [ 79 80 ].ThegeneralizedtimederivativeofEq. 3{47 existsalmosteverywhere(a.e.),and V ( y ) 2 a:e: ~ V ( y )where ~ V = \ 2 @V ( y ) T K [ F ]( y;t ) ; (A{4) where @V isthegeneralizedgradientof V [ 78 ].SincetheLyapunovfunctioninEq. 3{47 is aLipschitzcontinuousregularfunction,thegeneralizedtimederiva tiveinEq. A{4 canbe computedas ~ V = r V T K [ F ]( y;t ) : Thefollowingrelationsfrom[ 80 ]arethenusedtoarriveatequationEq. 3{50 : 1.If f and g arelocallyboundedfunctions, K [ f + g ]( x ) K [ f ]( x )+ K [ g ]( x ). 2.If g : R m R p n is C 0 and f : R m R n islocallybounded, K [ gf ]( x )= g ( x ) K [ f ]( x ) : 3.If f : R m R n iscontinuous, K [ f ]( x )= f f ( x ) g : 4. K [ sgn ( x )]= SGN ( x ) ; where SGN ( )referstotheset-valued sgn ( )function. 107

PAGE 108

APPENDIXB ROBUSTIDENTIFICATION-BASEDSTATEDERIVATIVEESTIMATION FOR NONLINEARSYSTEMS B.1ProofofInequalitiesinEqs. 4{12 4{14 Somepreliminaryinequalitiesareprovedwhichwillfacilitatetheproofo finequalities inEqs. 4{12 4{14 .UsingthetriangleinequalityinEq. 4{2 ,thefollowingboundcanbe obtained k x kk W f kk f k + k f k + m X i =1 [ k W gi kk gi k + k gi k ] k u i k + k d k ; c 1 ; (B{1) whereAssumptions 4.2 4.3 4.5 4.7 areusedand c 1 2 R isacomputableconstant.Using triangleinequalityinEq. 4{3 ,andthefactthat ^ x =_ x r + ~ x ,thefollowingboundcan beobtained rrr ^ x rrr k x k + k r k + k ~ x k ; c 1 + c 2 k z k ; (B{2) where c 2 max f 1 ; g2 R .UsingAssumptions 4.2 4.6 ,projectionboundsontheweight estimatesinEq. 4{7 ,andtheboundsinEqs. B{1 and B{2 ,thefollowingboundscanbe developedfortheDNNweightupdatelawsinEq. 4{7 rrr ^ W f rrr c 3 k ~ x k + c 4 k ~ x kk z k ; rrr ^ V f rrr c 5 k ~ x k + c 6 k ~ x kk z k ; rrr ^ W gi rrr c 7 k ~ x k + c 8 k ~ x kk z k ; rrr ^ V gi rrr c 9 k ~ x k + c 10 k ~ x kk z k8 i =1 :::m; (B{3) where c i 2 R i =3 ::: 10arecomputableconstants.UsingAssumptions 4.1 4.3 ,thederivative ofthedynamicsinEq. 4{2 yields x = W T f 0 f V T f x + 0f x + m X i =1 W T gi 0 gi V T gi x + 0gi x u i + W T gi gi + gi u i + d; (B{4) 108

PAGE 109

andusingthetriangleinequalityyieldsthefollowingbound k x kk W f k rr 0 f rr k V f kk x k + rr 0f rr k x k + m X i =1 k W gi k rr 0 gi rr k V gi kk x k + rr 0gi rr k x k k u i k +[ k W gi kk gi k + k gi k ] k u i k )+ rrr d rrr ; c 11 ; (B{5) whereAssumptions 4.2 4.3 4.5 4.7 ,andEq. B{1 isused,and c 11 2 R isacomputable constant.B.1.1ProofofInequalityinEq. 4{12 UsingtriangleinequalityinEq. 4{9 yields rrr ~ N rrr rr ~ x rr + rrr ^ W f rrr k ^ f k + rrr ^ W f rrr rr ^ 0 f rr rrr ^ V f rrr k ^ x k + 1 2 rr W T f rr rr ^ 0 f rr rrr ^ V f rrr rr ~ x rr + 1 2 rrr ^ W f rrr rr ^ 0 f rr k V f k rr ~ x rr + m X i =1 h rrr ^ W gi rrr k ^ gi kk u i k + rrr ^ W gi rrr rr ^ 0 gi rr rrr ^ V gi rrr k ^ x kk u i k + 1 2 rrr ^ W gi rrr rr ^ 0 gi rr k V gi k rr ~ x rr k u i k + 1 2 rr W T gi rr rr ^ 0 gi rr rrr ^ V gi rrr rr ~ x rr k u i k : (B{6) UsingEq. 4{5 ,thefactthat k x k ; k r kk z k ,andtheboundsdevelopedinEqs. B{1 B{2 and B{3 ,theexpressioninEq. B{6 canbefurtherupperboundedas rrr ~ N rrr h k wf k rr ^ 0 f rr rrr ^ V f rrr k ^ f k ( c 3 + c 4 k z k )+ rrr ^ W f rrr rr ^ 0 f rr ( k x k + k ~ x k )( c 5 + c 6 k z k ) + + 2 + 1 2 c 2 rr W T f rr rr ^ 0 f rr rrr ^ V f rrr + 1 2 c 2 rrr ^ W f rrr rr ^ 0 f rr k V f k k z k + m X i =1 n k ^ gi kk u i k ( c 7 + c 8 k z k )+ c 2 rrr ^ W gi rrr rr ^ 0 gi rr k V gi kk u i k o # k z k + 1 2 m X i =1 rrr ^ W gi rrr rr ^ 0 gi rr k u i k ( k x k + k ~ x k )( c 9 + c 10 k z k ) # k z k + 1 2 m X i =1 1 2 c 2 rr W T gi rr rr ^ 0 gi rr rrr ^ V gi rrr k u i k # k z k 1 ( k z k ) k z k ; where 1 ( ) 2 R isapositive,globallyinvertible,non-decreasingfunction. 109

PAGE 110

B.1.2ProofofInequalitiesinEq. 4{13 UsingthetriangleinequalityinEq. 4{10 yields k N B 1 k m X i =1 k W gi kk gi kk u i k + c 1 k W gi k rr 0gi rr k V gi kk u i k + k gi kk u i k + k gi kk u i k + c 1 k W f k rr 0 f rr k V f k + k f k + rrr d rrr ++ 1 2 c 1 k W f k rr ^ 0 f rr rrr ^ V f rrr (B{7) + m X i =1 1 2 rrr ^ W gi rrr rr ^ 0 gi rr k V gi kk u i k + 1 2 k W gi k rr ^ 0 gi rr rrr ^ V gi rrr k u i k + rrr ^ W gi rrr k ^ gi kk u i k + 1 2 c 1 rrr ^ W f rrr rr ^ 0 f rr k V f k 1 ; whereAssumptions 4.2 4.3 4.5 4.7 ,andprojectionboundsontheweightestimatesinEq. 4{7 areused,and 1 2 R isapositiveconstantcomputedusingtheupperboundsofthe termsinEq. B{7 .Byreplacing ^ x ( t )by_ x ( t )inEq. 4{11 ,theexpressionfor N B 2 ( )canbe obtainedas N B 2 m X i =1 1 2 ~ W T gi ^ 0 gi ^ V T gi xu i + 1 2 ^ W T gi ^ 0 gi ~ V T gi xu i + 1 2 ~ W T f ^ 0 f ^ V T f x + 1 2 ^ W T f ^ 0 f ~ V T f x: (B{8) UsingthetriangleinequalityinEq. B{8 yields k N B 2 k m X i =1 1 2 c 1 rrr ~ W gi rrr rr ^ 0 gi rr rrr ^ V gi rrr k u i k + 1 2 c 1 rrr ^ W gi rrr rr ^ 0 gi rr rrr ~ V gi rrr k u i k (B{9) + 1 2 c 1 rrr ~ W f rrr rr ^ 0 f rr rrr ^ V f rrr + 1 2 c 1 rrr ^ W f rrr rr ^ 0 f rr rrr ~ V f rrr 2 ; whereAssumptions 4.2 4.3 4.5 4.7 andprojectionboundsontheweightestimatesinEq. 4{7 areused,and 2 2 R isapositiveconstantcomputedusingtheupperboundsofthe termsinEq. B{9 .Takingthederivativeof N B N B 1 + N B 2 ; andusingEqs. 4{11 and B{8 yields N B ( ),whichcanbesplitas N B N Ba + N Bb ; (B{10) 110

PAGE 111

where N Ba m X i =1 W T gi 0gi V T gi x u i + W T gi gi u i + W T gi 0gi V T gi x u i + W T gi 0gi V T gi xu i + W T gi 0gi V T gi xu i + m X i =1 gi u i +2_ gi u i +_ gi u i 1 2 ^ W T gi ^ 0 gi V T gi xu i 1 2 ^ W T gi ^ 0 gi V T gi x u i 1 2 W T gi ^ 0 gi ^ V T gi xu i + m X i =1 1 2 W T gi ^ 0 gi ^ V T gi x u i ^ W T gi ^ gi u i + 1 2 ~ W T gi ^ 0 gi ^ V T gi xu i + 1 2 ~ W T gi ^ 0 gi ^ V T gi x u i (B{11) + f + d + W T f 0 f V T f x + W T f 0 f V T f x 1 2 W T f ^ 0 f ^ V T f x 1 2 ^ W T f ^ 0 f V T f x + 1 2 ~ W T f ^ 0 f ^ V T f x + 1 2 ^ W T f ^ 0 f ~ V T f x; N Bb m X i =1 1 2 ^ W T gi ^ 0 gi V T gi xu i + 1 2 ^ W T gi ^ 0 gi V T gi xu i + 1 2 W T gi ^ 0 gi ^ V T gi xu i + 1 2 W T gi ^ 0 gi ^ V T gi xu i 1 2 W T f ^ 0 f ^ V T f x 1 2 ^ W T f ^ 0 f V T f x 1 2 ^ W T f ^ 0 f V T f x + 1 2 ^ W T f ^ 0 f ^ V T f x + 1 2 ~ W T f ^ 0 f ^ V T f x + m X i =1 ^ W T gi ^ gi u i ^ W T gi ^ gi u i 1 2 ^ W T gi ^ 0 gi ^ V T gi xu i + 1 2 ~ W T gi ^ 0 gi ^ V T gi xu i + 1 2 ~ W T gi ^ 0 gi ^ V T gi xu i + m X i =1 1 2 ^ W T gi ^ 0 gi ~ V T gi xu i + 1 2 ^ W T gi ^ 0 gi ~ V T gi xu i 1 2 ^ W T gi ^ 0 gi ^ V T gi xu i + 1 2 ^ W T gi ^ 0 gi ~ V T gi xu i (B{12) + 1 2 ^ W T gi ^ 0 gi ~ V T gi x u i + 1 2 ~ W T f ^ 0 f ^ V T f x + 1 2 ^ W T f ^ 0 f ~ V T f x + 1 2 ^ W T f ^ 0 f ~ V T f x 1 2 ^ W T f ^ 0 f ^ V T f x; where ^ 0 gi denotesthetimederivativeof^ 0 gi .TodevelopupperboundsforEqs. B{11 and B{12 ,thefollowingboundwillbeused rrr ^ 0 gi rrr rr ^ 00 gi rr rrr ^ V gi rrr k ^ x k + rr ^ 00 gi rr rrr ^ V gi rrr rrr ^ x rrr ( c 9 k ~ x k + c 10 k ~ x kk z k ) rr ^ 00 gi rr ( k x k + k ~ x k )+( c 1 + c 2 k z k ) rr ^ 00 gi rr rrr ^ V gi rrr c 12 + 0 ( k z k ) k z k ; whereEqs. B{2 and B{3 areused, c 12 2 R + ,and 0 ( ) 2 R isapositive,globally invertible,non-decreasingfunction.UsingtheboundinEq. B{5 rrr N Ba rrr inEq. B{11 can 111

PAGE 112

beupperboundedusingthetriangleinequalityasrrr N Ba rrr m X i =1 c 1 k W gi k rr 0gi rr k V gi kk u i k + k W gi kk gi kk u i k + c 1 k W gi k rr 0gi rr k V gi kk u i k + m X i =1 c 11 k W gi k rr 0gi rr k V gi kk u i k + c 1 k W gi k rr 0gi rr k V gi kk u i k + k gi kk u i k + m X i =1 2 k gi kk u i kk gi kk u i k + 1 2 c 11 rrr ^ W gi rrr rr ^ 0 gi rr k V gi kk u i k + m X i =1 1 2 c 1 rrr ^ W gi rrr rr ^ 0 gi rr k V gi kk u i k + 1 2 c 11 k W gi k rr ^ 0 gi rr rrr ^ V gi rrr k u i k + k f k + rrr d rrr + c 1 k W f k rr 0 f rr k V f k + c 11 k W f k rr 0 f rr k V f k + 1 2 c 11 k W f k rr ^ 0 f rr rrr ^ V f rrr + 1 2 c 11 rrr ^ W f rrr rr ^ 0 f rr k V f k + m X i =1 1 2 k W gi k rr ^ 0 gi rr rrr ^ V gi rrr k x kk u i k rrr ^ W gi rrr k ^ gi kk u i k + m X i =1 1 2 c 11 rrr ~ W gi rrr rr ^ 0 gi rr rrr ^ V gi rrr k u i k + 1 2 rrr ~ W gi rrr rr ^ 0 gi rr rrr ^ V gi rrr k x kk u i k + 1 2 c 11 rrr ~ W f rrr rr ^ 0 f rr rrr ^ V f rrr + 1 2 c 11 rrr ^ W f rrr rr ^ 0 f rr rrr ~ V f rrr : UsingAssumptions 4.2 4.3 4.5 4.7 ,allthetermsintheaboveexpressioncanbebounded byaconstant,andhencethefollowingboundcanbedeveloped rrr N Ba rrr 31 ; (B{13) where 31 2 R + isacomputableconstant.TheboundonEq. B{14 isdevelopedas rrr N Bb rrr m X i =1 1 2 c 1 ( c 7 k ~ x k + c 8 k ~ x kk z k ) rr ^ 0 gi rr k V gi kk u i k + m X i =1 1 2 c 1 ( c 12 + 0 ( k z k ) k z k ) rrr ^ W gi rrr k V gi kk u i k + m X i =1 1 2 c 1 ( c 12 + 0 ( k z k ) k z k ) k W gi k rrr ^ V gi rrr k u i k + m X i =1 1 2 c 1 ( c 9 k ~ x k + c 10 k ~ x kk z k ) k W gi k rr ^ 0 gi rr k u i k + 1 2 c 1 ( c 5 k ~ x k + c 6 k ~ x kk z k ) k W f k rr ^ 0 f rr + 1 2 c 1 ( c 3 k ~ x k + c 4 k ~ x kk z k ) rr ^ 0 f rr k V f k 112

PAGE 113

+ 1 2 c 1 ( c 12 + 0 ( k z k ) k z k ) rrr ^ W f rrr k V f k + 1 2 c 1 ( c 3 k ~ x k + c 4 k ~ x kk z k ) rr ^ 0 f rr rrr ^ V f rrr + m X i =1 h ( c 7 k ~ x k + c 8 k ~ x kk z k ) k ^ gi kk u i k +( c 12 + 0 ( k z k ) k z k ) rrr ^ W gi rrr k u i k i + m X i =1 1 2 c 1 ( c 7 k ~ x k + c 8 k ~ x kk z k ) rr ^ 0 gi rr rrr ^ V gi rrr k u i k + m X i =1 1 2 c 1 ( c 12 + 0 ( k z k ) k z k ) rrr ~ W gi rrr rrr ^ V gi rrr k u i k + m X i =1 1 2 c 1 ( c 9 k ~ x k + c 10 k ~ x kk z k ) rrr ~ W gi rrr rr ^ 0 gi rr k u i k + m X i =1 1 2 c 1 ( c 7 k ~ x k + c 8 k ~ x kk z k ) rr ^ 0 gi rr rrr ~ V gi rrr k u i k + m X i =1 1 2 c 1 ( c 12 + 0 ( k z k ) k z k )( c 7 k ~ x k + c 8 k ~ x kk z k ) rrr ~ V gi rrr k u i k + m X i =1 1 2 c 11 ( c 7 k ~ x k + c 8 k ~ x kk z k ) rr ^ 0 gi rr rrr ~ V gi rrr k u i k + m X i =1 1 2 c 1 ( c 7 k ~ x k + c 8 k ~ x kk z k ) rr ^ 0 gi rr rrr ~ V gi rrr k u i k + m X i =1 1 2 c 1 ( c 7 k ~ x k + c 8 k ~ x kk z k )( c 9 k ~ x k + c 10 k ~ x kk z k ) rr ^ 0 gi rr k u i k + 1 2 c 1 ( c 12 + 0 ( k z k ) k z k ) rrr ~ W f rrr rrr ^ V f rrr + 1 2 c 1 ( c 5 k ~ x k + c 6 k ~ x kk z k ) rrr ~ W f rrr rr ^ 0 f rr + 1 2 c 1 ( c 3 k ~ x k + c 4 k ~ x kk z k ) rr ^ 0 f rr rrr ~ V f rrr + 1 2 c 1 ( c 12 + 0 ( k z k ) k z k ) rrr ^ W f rrr rrr ~ V f rrr + 1 2 c 1 ( c 5 k ~ x k + c 6 k ~ x kk z k ) rrr ^ W f rrr rr ^ 0 f rr ; whichcanbesimpliedbycombiningtermsboundedbyconstantsandt ermsboundedby afunctionofstates,as rrr N Bb rrr 32 + 4 2 ( k z k ) k z k ; (B{14) where 32 ; 4 2 R + arecomputableconstants,and 2 ( ) 2 R isapositive,globallyinvertible, non-decreasingfunction.FromEqs. B{10 B{13 ,and B{14 ,thefollowingboundcanbe 113

PAGE 114

obtained rrr N B rrr 3 + 4 2 ( k z k ) k z k : B.1.3ProofofInequalityinEq. 4{14 Usingthedenition ~ N B 2 ^ N B 2 N B 2 ~ x T ~ N B 2 = ~ x T ( ^ N B 2 N B 2 ) = ~ x T m X i =1 1 2 ~ W T gi ^ 0 gi ^ V T gi ~ xu i + 1 2 ^ W T gi ^ 0 gi ~ V T gi ~ xu i + 1 2 ~ x T ~ W T f ^ 0 f ^ V T f ~ x + 1 2 ~ x T ^ W T f ^ 0 f ~ V T f ~ x; whichcanbeupperboundedusingthetriangleinequalityas rrr ~ x T ~ N B 2 rrr 1 2 rr ~ x rr 2 ( m X i =1 h rrr ~ W gi rrr rr ^ 0 gi rr rrr ^ V gi rrr k u i k + rrr ^ W gi rrr rr ^ 0 gi rr rrr ~ V gi rrr k u i k i + rrr ~ W f rrr rr ^ 0 f rr rrr ^ V f rrr + rrr ^ W f rrr rr ^ 0 f rr rrr ~ V f rrr o : (B{15) Usingthefactthat rr ~ x rr 2 = k r ~ x k 2 =( r ~ x ) T ( r ~ x ) k r k 2 + 2 k ~ x k 2 +2 k r kk ~ x k (1+ ) k r k 2 + (1+ ) k ~ x k 2 ,Eq. B{15 canbefurtherupperboundedas rrr ~ x T ~ N B 2 rrr 1 2 (1+ ) k r k 2 + k ~ x k 2 n rrr ~ W f rrr rr ^ 0 f rr rrr ^ V f rrr + rrr ^ W f rrr rr ^ 0 f rr rrr ~ V f rrr + m X i =1 h rrr ~ W gi rrr rr ^ 0 gi rr rrr ^ V gi rrr k u i k + rrr ^ W gi rrr rr ^ 0 gi rr rrr ~ V gi rrr k u i k i ) : UsingtheAssumptions 4.2 4.3 4.5 4.7 ,thefollowingboundcanobtained rrr ~ x T ~ N B 2 rrr 5 k ~ x k 2 + 6 k r k 2 ; (B{16) where 5 ; 6 2 R + arecomputableconstants. 114

PAGE 115

B.2DerivationofSucientConditionsinEq. 4{18 IntegratingEq. 4{17 yields t Z 0 L ( ) d = t Z 0 r T [ N B 1 ( ) 1 sgn (~ x ))]+ ~ x ( ) T N B 2 ( ) 2 2 ( k z k ) k z kk ~ x k d: =~ x T N B ~ x T (0) N B (0) t Z 0 ~ x T N B d + 1 n X i =1 j ~ x i (0) j 1 n X i =1 j ~ x i ( t ) j + t Z 0 ~ x T ( N B 1 1 sgn (~ x )) d t Z 0 2 2 ( k z k ) k z kk ~ x k d; whereEq. 4{8 isused.Usingthefactthat k ~ x k 2 n P i =1 j ~ x i j ; andusingtheboundsinEq. 4{13 ,yields t Z 0 L ( ) d 1 n X i =1 j ~ x i (0) j ~ x T (0) N B (0) ( 1 1 2 ) k ~ x k t Z 0 ( 1 1 3 ) k ~ x k d t Z 0 ( 2 4 ) 2 ( k z k ) k z kk ~ x k d: IfthesucientconditionsinEq. 4{18 aresatised,thenthefollowinginequalityholds t Z 0 L ( ) d 1 n X i =1 j ~ x i (0) j ~ x T (0) N B (0)= P (0) (B{17) UsingEqs. 4{16 and B{17 ,itcanbeshownthat P ( z;t ) 0 : 115

PAGE 116

REFERENCES [1] R.Sutton,\Learningtopredictbythemethodsoftemporaldier ences," Mach. Learn. ,vol.3,no.1,pp.9{44,1988. [2] R.SuttonandA.Barto, Introductiontoreinforcementlearning .MITPress Cambridge,MA,USA,1998. [3] R.Sutton,A.Barto,andR.Williams,\Reinforcementlearningisdirec tadaptive optimalcontrol," IEEEContr.Syst.Mag. ,vol.12,no.2,pp.19{22,1992. [4] B.Widrow,N.Gupta,andS.Maitra,\Punish/reward:Learningwith acriticin adaptivethresholdsystems," IEEETrans.Syst.ManCybern. ,vol.3,no.5,pp. 455{465,1973. [5] P.Werbos,\Approximatedynamicprogrammingforreal-timecontr olandneural modeling,"in HandbookofIntelligentControl:Neural,Fuzzy,andAdapti veApproaches ,D.A.WhiteandD.A.Sofge,Eds.NewYork:VanNostrandReinhold 1992. [6] D.V.ProkhorovandI.Wunsch,D.C.,\Adaptivecriticdesigns," IEEETrans. NeuralNetworks ,vol.8,pp.997{1007,1997. [7] S.FerrariandR.Stengel,\Anadaptivecriticglobalcontroller,"in Proc.Am. ControlConf. ,vol.4,2002. [8] J.Murray,C.Cox,G.Lendaris,andR.Saeks,\Adaptivedynamicp rogramming," IEEETrans.Syst.ManCybern.PartCAppl.Rev. ,vol.32,no.2,pp.140{153,2002. [9] X.LiuandS.Balakrishnan,\Convergenceanalysisofadaptivecritic basedoptimal control,"in Proc.Am.ControlConf. ,vol.3,2000. [10] T.Dierks,B.Thumati,andS.Jagannathan,\Optimalcontrolofu nknownane nonlineardiscrete-timesystemsusingoine-trainedneuralnetwo rkswithproofof convergence," NeuralNetworks ,vol.22,no.5-6,pp.851{860,2009. [11] R.Padhi,S.Balakrishnan,andT.Randolph,\Adaptive-criticbased optimalneuro controlsynthesisfordistributedparametersystems," Automatica ,vol.37,no.8,pp. 1223{1234,2001. [12] T.Hanselmann,L.Noakes,andA.Zaknich,\Continuous-timeadap tivecritics," IEEETrans.NeuralNetworks ,vol.18,no.3,pp.631{647,2007. [13] D.VrabieandF.Lewis,\Neuralnetworkapproachtocontinuoustimedirect adaptiveoptimalcontrolforpartiallyunknownnonlinearsystems," NeuralNetworks vol.22,no.3,pp.237{246,2009. 116

PAGE 117

[14] K.VamvoudakisandF.Lewis,\Onlineactor-criticalgorithmtosolvet he continuous-timeinnitehorizonoptimalcontrolproblem," Automatica ,vol.46, pp.878{888,2010. [15] G.Cybenko,\Approximationbysuperpositionsofasigmoidalfunct ion," Math. ControlSignalsSyst. ,vol.2,pp.303{314,1989. [16] A.Barron,\Universalapproximationboundsforsuperpositionso fasigmoidal function," IEEETrans.Inf.Theory ,vol.39,no.3,pp.930{945,1993. [17] P.Werbos,\Amenuofdesignsforreinforcementlearningovertime ," Neural networksforcontrol ,pp.67{95,1990. [18] F.L.Lewis,R.Selmic,andJ.Campos, Neuro-FuzzyControlofIndustrialSystems withActuatorNonlinearities .Philadelphia,PA,USA:SocietyforIndustrialand AppliedMathematics,2002. [19] A.Barto,R.Sutton,andC.Anderson,\Neuronlikeadaptiveeleme ntsthatcansolve dicultlearningcontrolproblems," IEEETrans.Syst.ManCybern. ,vol.13,no.5, pp.834{846,1983. [20] C.WatkinsandP.Dayan,\Q-learning," Mach.Learn. ,vol.8,no.3,pp.279{292, 1992. [21] P.Werbos,\Buildingandunderstandingadaptivesystems:Astatis tical/numerical approachtofactoryautomationandbrainresearch," IEEETrans.Syst.Man Cybern. ,vol.17,no.1,pp.7{20,1987. [22] R.Bellman, DynamicProgramming .DoverPublications,Inc.,2003. [23] J.SiandY.Wang,\On-linelearningcontrolbyassociationandreinf orcement," IEEETrans.NeuralNetworks ,vol.12,no.2,pp.264{276,2001. [24] J.Si,A.Barto,W.Powell,andD.Wunsch,Eds., HandbookofLearningandApproximateDynamicProgramming .Wiley-IEEEPress,2004. [25] G.Venayagamoorthy,R.Harley,andD.Wunsch,\Comparisonofh euristicdynamic programminganddualheuristicprogrammingadaptivecriticsforne urocontrolofa turbogenerator," IEEETrans.NeuralNetworks ,vol.13,no.3,pp.764{773,2002. [26] ||,\Dualheuristicprogrammingexcitationneurocontrolforgen eratorsina multimachinepowersystem," IEEETrans.Ind.Appl. ,vol.39,no.2,pp.382{394, 2003. [27] S.FerrariandR.Stengel,\Onlineadaptivecriticrightcontrol," JournalofGuidanceControlandDynamics ,vol.27,no.5,pp.777{786,2004. 117

PAGE 118

[28] S.JagannathanandG.Galan,\Adaptivecriticneuralnetwork-ba sedobjectgrasping controlusingathree-ngergripper," IEEETrans.NeuralNetworks ,vol.15,no.2, pp.395{407,2004. [29] D.HanandS.Balakrishnan,\State-constrainedagilemissilecontro lwith adaptive-critic-basedneuralnetworks," IEEETrans.ControlSyst.Technol. ,vol.10, no.4,pp.481{489,2002. [30] C.Anderson,D.Hittle,M.Kretchmar,andP.Young,\Robustrein forcement learningforheating,ventilation,andairconditioningcontrolofbuild ings,"in HandbookofLearningandApproximateDynamicProgramming ,J.Si,A.G.Barto, W.B.Powell,andD.Wunsch,Eds.Wiley-IEEEPress,August2004,p p.517{529. [31] T.Landelius,\Reinforcementlearninganddistributedlocalmodels ynthesis,"Ph.D. dissertation,LinkopingUniversity,Sweden,1997. [32] D.Prokhorov,R.Santiago,andD.Wunsch,\Adaptivecriticdesign s:Acasestudy forneurocontrol," NeuralNetworks ,vol.8,no.9,pp.1367{1372,1995. [33] S.Bradtke,B.Ydstie,andA.Barto,\Adaptivelinearquadraticco ntrolusingpolicy iteration,"in Proc.Am.ControlConf. IEEE,1994,pp.3475{3479. [34] A.Al-Tamimi,F.L.Lewis,andM.Abu-Khalaf,\Model-freeq-learning designs forlineardiscrete-timezero-sumgameswithapplicationtoh-[innity ]control," Automatica ,vol.43,pp.473{481,2007. [35] D.Bertsekas, DynamicProgrammingandOptimalControl .AthenaScientic,2007. [36] D.WhiteandD.Sofge, Handbookofintelligentcontrol:neural,fuzzy,andadapti ve approaches .VanNostrandReinholdCompany,1992. [37] D.Kleinman,\Onaniterativetechniqueforriccatiequationcomput ations," IEEE Trans.Autom.Contr. ,vol.13,no.1,pp.114{115,1968. [38] L.Baird,\Advantageupdating,"WrightLab,Wright-PattersonAir ForceBase,OH, Tech.Rep.,1993. [39] K.Doya,\Reinforcementlearningincontinuoustimeandspace," NeuralComput. vol.12,no.1,pp.219{245,2000. [40] R.Beard,G.Saridis,andJ.Wen,\Galerkinapproximationsofthege neralized Hamilton-Jacobi-Bellmanequation," Automatica ,vol.33,pp.2159{2178,1997. [41] M.Abu-KhalafandF.Lewis,\Nearlyoptimalcontrollawsfornonline arsystems withsaturatingactuatorsusinganeuralnetworkHJBapproach," Automatica vol.41,no.5,pp.779{791,2005. 118

PAGE 119

[42] D.Vrabie,M.Abu-Khalaf,F.Lewis,andY.Wang,\Continuous-time ADPforlinear systemswithpartiallyunknowndynamics,"in Proc.IEEEInt.Symp.Approx.Dyn. Program.Reinf.Learn. ,2007,pp.247{253. [43] J.CamposandF.Lewis,\Adaptivecriticneuralnetworkforfeedf orward compensation,"in Proc.Am.ControlConf. ,vol.4,1999. [44] O.KuljacaandF.Lewis,\Adaptivecriticdesignusingnon-linearnetw ork structures," Int.J.AdaptControlSignalProcess. ,vol.17,no.6,pp.431{445, 2003. [45] Y.KimandF.Lewis, High-levelfeedbackcontrolwithneuralnetworks .World ScienticPubCoInc,1998. [46] P.M.Patre,W.MacKunis,C.Makkar,andW.E.Dixon,\Asymptotic trackingfor systemswithstructuredandunstructureduncertainties," IEEETrans.ControlSyst. Technol. ,vol.16,no.2,pp.373{379,2008. [47] B.Xian,D.M.Dawson,M.S.deQueiroz,andJ.Chen,\Acontinuous asymptotic trackingcontrolstrategyforuncertainnonlinearsystems," IEEETrans.Autom. Control ,vol.49,pp.1206{1211,2004. [48] R.Howard, DynamicprogrammingandMarkovprocesses .TechnologyPressof MassachusettsInstituteofTechnology(Cambridge),1960. [49] J.Tsitsiklis,\Ontheconvergenceofoptimisticpolicyiteration," TheJournalof MachineLearningResearch ,vol.3,pp.59{72,2003. [50] R.Sutton,\Generalizationinreinforcementlearning:Successful examplesusing sparsecoarsecoding," Advancesinneuralinformationprocessingsystems ,pp. 1038{1044,1996. [51] L.Kaelbling,M.Littman,andA.Moore,\Reinforcementlearning:As urvey," JournalofArticialIntelligenceResearch ,vol.4,pp.237{285,1996. [52] D.Kirk, OptimalControlTheory:AnIntroduction .DoverPubns,2004. [53] M.CrandallandP.Lions,\ViscositysolutionsofHamilton-Jacobieq uations," TransactionsoftheAmericanMathematicalSociety ,vol.277,no.1,pp.1{42,1983. [54] M.BardiandI.Dolcetta, OptimalcontrolandviscositysolutionsofHamiltonJacobi-Bellmanequations .Springer,1997. [55] J.Betts, Practicalmethodsforoptimalcontrolusingnonlinearprog ramming .Society forIndustrialMathematics,2001,no.3. [56] Q.Gong,W.Kang,andI.Ross,\Apseudospectralmethodforth eoptimalcontrol ofconstrainedfeedbacklinearizablesystems," IEEETrans.Autom.Contr. ,vol.51, no.7,pp.1115{1129,2006. 119

PAGE 120

[57] R.FreemanandP.Kokotovic,\Optimalnonlinearcontrollersforfe edback linearizablesystems,"in Proc.Am.ControlConf. ,Jun.1995,pp.2722{2726. [58] K.Dupree,P.Patre,Z.Wilcox,andW.Dixon,\Asymptoticoptimalc ontrolof uncertainnonlineareuler-lagrangesystems," Automatica ,2010. [59] M.Sepulchre,R.JankovicandP.V.Kokotovic, ConstructiveNonlinearControl NewYork:Springer-Verlag,1997. [60] M.KrsticandZ.-H.Li,\Inverseoptimaldesignofinput-to-states tabilizing nonlinearcontrollers," IEEETrans.Autom.Control ,vol.43,no.3,pp.336{350, March1998. [61] M.Krstic,\Inverseoptimaladaptivecontrol-theinterplaybetwe enupdatelaws, controllaws,andLyapunovfunctions,"in Proc.Am.ControlConf. ,2009,pp. 1250{1255. [62] D.MayneandH.Michalska,\Recedinghorizoncontrolofnonlinears ystems," IEEE Trans.Autom.Contr. ,vol.35,no.7,pp.814{824,1990. [63] M.MorariandJ.Lee,\Modelpredictivecontrol:past,presenta ndfuture," Computers&ChemicalEngineering ,vol.23,no.4-5,pp.667{682,1999. [64] B.Foss,T.Johansen,andA.Srensen,\Nonlinearpredictivecon trolusinglocal models{appliedtoabatchfermentationprocess," ControlEngineeringPractice vol.3,no.3,pp.389{396,1995. [65] J.Richalet,A.Rault,J.Testud,andJ.Papon,\Modelpredictiveh euristiccontrol:: Applicationstoindustrialprocesses," Automatica ,vol.14,no.5,pp.413{428,1978. [66] G.SaridisandC.Lee,\Anapproximationtheoryofoptimalcontrol fortrainable manipulators,"vol.9,no.3,1979. [67] S.Balakrishnan,\Adaptive-critic-basedneuralnetworksforair craftoptimal control," J.Guid.Contr.Dynam. ,vol.19,no.4,pp.893{898,1996. [68] R.Padhi,N.Unnikrishnan,X.Wang,andS.Balakrishnan,\Asinglene twork adaptivecritic(SNAC)architectureforoptimalcontrolsynthes isforaclassof nonlinearsystems," NeuralNetworks ,vol.19,no.10,pp.1648{1660,2006. [69] P.HeandS.Jagannathan,\Reinforcementlearningneural-netwo rk-basedcontroller fornonlineardiscrete-timesystemswithinputconstraints," IEEETrans.Syst.Man Cybern.PartBCybern. ,vol.37,no.2,pp.425{436,2007. [70] A.Al-Tamimi,F.L.Lewis,andM.Abu-Khalaf,\Discrete-timenonlinea rHJB solutionusingapproximatedynamicprogramming:Convergencepro of," IEEE Trans.Syst.ManCybern.PartBCybern. ,vol.38,pp.943{949,2008. 120

PAGE 121

[71] P.M.Patre,W.MacKunis,K.Kaiser,andW.E.Dixon,\Asymptotict racking foruncertaindynamicsystemsviaamultilayerneuralnetworkfeed forwardand RISEfeedbackcontrolstructure," IEEETrans.Autom.Control ,vol.53,no.9,pp. 2180{2185,2008. [72] W.E.Dixon,A.Behal,D.M.Dawson,andS.Nagarkatti, NonlinearControlof EngineeringSystems:ALyapunov-BasedApproach .BirkhuserBoston,2003. [73] M.Krstic,P.V.Kokotovic,andI.Kanellakopoulos, NonlinearandAdaptiveControl Design .JohnWiley&Sons,1995. [74] A.Filippov,\Dierentialequationswithdiscontinuousright-handsid e," Am.Math. Soc.Transl. ,vol.42no.2,pp.199{231,1964. [75] ||, Dierentialequationswithdiscontinuousright-handside .Netherlands:Kluwer AcademicPublishers,1988. [76] G.V.Smirnov, Introductiontothetheoryofdierentialinclusions .American MathematicalSociety,2002. [77] J.P.AubinandH.Frankowska, Set-valuedanalysis .Birkhuser,2008. [78] F.H.Clarke, Optimizationandnonsmoothanalysis .SIAM,1990. [79] D.ShevitzandB.Paden,\Lyapunovstabilitytheoryofnonsmooth systems," IEEE Trans.Autom.Control ,vol.39no.9,pp.1910{1914,1994. [80] B.PadenandS.Sastry,\AcalculusforcomputingFilippov'sdierent ialinclusion withapplicationtothevariablestructurecontrolofrobotmanipula tors," IEEE Trans.CircuitsSyst. ,vol.34no.1,pp.73{82,1987. [81] F.L.Lewis,\Nonlinearnetworkstructuresforfeedbackcontro l," AsianJ.Control vol.1,no.4,pp.205{228,1999. [82] M.Niethammer,P.Menold,andF.Allgower,\Parameterandderivat iveestimation fornonlinearcontinuous-timesystemidentication,"in 5thIFACSymposium NonlinearControlSystems(NOLCOS01),Russia ,2001. [83] T.Floquet,J.Barbot,W.Perruquetti,andM.Djemai,\Ontherob ustfault detectionviaaslidingmodedisturbanceobserver," Int.J.Control ,vol.77,no.7, pp.622{629,2004. [84] W.Xu,J.Han,andS.Tso,\Experimentalstudyofcontacttrans itioncontrol incorporatingjointaccelerationfeedback,"vol.5,no.3,pp.292{ 301,2000. [85] P.SchmidtandR.Lorenz,\Designprinciplesandimplementationofac celeration feedbacktoimproveperformanceofdcdrives," IEEETrans.Ind.Appl. ,vol.28, no.3,pp.594{599,1992. 121

PAGE 122

[86] N.Olgac,H.Elmali,M.Hosek,andM.Renzulli,\Activevibrationcontro lof distributedsystemsusingdelayedresonatorwithaccelerationfee dback," J.Dyn. Syst.Meas.Contr. ,vol.119,p.380,1997. [87] K.NarendraandK.Parthasarathy,\Identicationandcontrol ofdynamicalsystems usingneuralnetworks," IEEETrans.NeuralNetworks ,vol.1,no.1,pp.4{27,1990. [88] M.PolycarpouandP.Ioannou,\Identicationandcontrolofnon linearsystems usingneuralnetworkmodels:Designandstabilityanalysis," SystemsReport91-0901,UniversityofSouthernCalifornia ,1991. [89] G.A.RovithakisandM.A.Christodoulou,\Adaptivecontrolofunk nownplants usingdynamicalneuralnetworks," IEEETrans.Syst.ManCybern. ,vol.24,pp. 400{412,1994. [90] A.Poznyak,W.Yu,E.Sanchez,andJ.Perez,\Nonlinearadaptive trajectory trackingusingdynamicneuralnetworks," IEEETrans.NeuralNetworks ,vol.10, no.6,pp.1402{1411,2002. [91] W.Yu,A.Poznyak,andX.Li,\Multilayerdynamicneuralnetworksf ornon-linear systemon-lineidentication," Int.J.Control ,vol.74,no.18,pp.1858{1864,2001. [92] R.SannerandJ.Slotine,\Stablerecursiveidenticationusingradia lbasisfunction networks,"in AmericanControlConference ,1992,pp.1829{1833. [93] S.LuandT.Basar,\Robustnonlinearsystemidenticationusingne ural-network models," IEEETrans.NeuralNetworks ,vol.9,no.3,pp.407{429,2002. [94] J.HuangandF.Lewis,\Neural-networkpredictivecontrolforno nlineardynamic systemswithtime-delay," IEEETrans.NeuralNetworks ,vol.14,no.2,pp.377{389, 2003. [95] S.Ibrir,\Onlineexactdierentiationandnotionofasymptoticalgeb raicobservers," IEEETrans.Automat.Contr. ,vol.48,no.11,pp.2055{2060,2003. [96] L.VasiljevicandH.Khalil,\Errorboundsindierentiationofnoisysign alsby high-gainobservers," Systems&ControlLetters ,vol.57,no.10,pp.856{862,2008. [97] A.Levant,\Robustexactdierentiationviaslidingmodetechnique," Automatica vol.34,no.3,pp.379{384,1998. [98] M.Gupta,L.Jin,andN.Homma, Staticanddynamicneuralnetworks:from fundamentalstoadvancedtheory .Wiley-IEEEPress,2003. [99] K.FunahashiandY.Nakamura,\Approximationofdynamicsystem sby continuous-timerecurrentneuralnetworks," NeuralNetworks ,vol.6,pp.801{806, 1993. 122

PAGE 123

[100] H.KhalilandF.Esfandiari,\Semiglobalstabilizationofaclassofnonlin earsystems usingoutputfeedback," IEEETrans.Autom.Control ,vol.38,no.9,pp.1412{1415, 1993. [101] R.S.SuttonandA.G.Barto, ReinforcementLearning:AnIntroduction .MIT Press,1998. [102] V.KondaandJ.Tsitsiklis,\Onactor-criticalgorithms," SIAMJ.Contr.Optim. vol.42,no.4,pp.1143{1166,2004. [103] K.Hornik,M.Stinchcombe,andH.White,\Multilayerfeedforwardn etworksare universalapproximators," NeuralNetworks ,vol.2,pp.359{366,1985. [104] S.SastryandM.Bodson, AdaptiveControl:Stability,Convergence,andRobustness UpperSaddleRiver,NJ:Prentice-Hall,1989. [105] A.F.Fillipov, DierentialEquationswithDiscontinuousRighthandSides .Kluwer AcademicPublishers,1988,pp.48-122. [106] H.K.Khalil, NonlinearSystems ,3rded.PrenticeHall,2002. [107] D.BertsekasandJ.Tsitsiklis, Neuro-DynamicProgramming .AthenaScientic, 1996. [108] L.Busoniu,R.Babuska,B.DeSchutter,andD.Ernst, ReinforcementLearningand DynamicProgrammingUsingFunctionApproximators .CRCPress,2010. [109] G.Lendaris,L.Schultz,andT.Shannon,\Adaptivecriticdesignfo rintelligent steeringandspeedcontrolofa2-axlevehicle,"in Int.JointConf.NeuralNetw. 2000,pp.73{78. [110] J.Hopeld,\Neuronswithgradedresponsehavecollectivecomput ationalproperties likethoseoftwo-stateneurons," Proc.Nat.Acad.Sci.U.S.A. ,vol.81,no.10,p. 3088,1984. [111] A.Poznyak,E.Sanchez,andW.Yu, Dierentialneuralnetworksforrobustnonlinearcontrol:identication,stateestimationandtrajecto rytracking .WorldScientic PubCoInc,2001. [112] P.MehtaandS.Meyn,\Q-learningandPontryagin'sminimumprinciple," in Proc. IEEEConf.Decis.Control ,2009,pp.3598{3605. [113] G.ChowdharyandE.Johnson,\Concurrentlearningforconverg enceinadaptive controlwithoutpersistencyofexcitation,"in Proc.IEEEConf.Decis.Control IEEE,2010,pp.3674{3679. [114] T.BasarandG.J.Olsder, DynamicNoncooperativeGameTheory .SIAM,1999. 123

PAGE 124

[115] D.VrabieandF.Lewis,\Integralreinforcementlearningforonline computationof feedbacknashstrategiesofnonzero-sumdierentialgames,"in Proc.IEEEConf. Decis.Control .IEEE,2010,pp.3066{3071. [116] K.VamvoudakisandF.Lewis,\Multi-playernon-zero-sumgames:O nlineadaptive learningsolutionofcoupledhamilton-jacobiequations," Automatica ,2011. [117] T.Jaakkola,S.Singh,andM.Jordan,\Reinforcementlearningalgo rithmfor partiallyobservablemarkovdecisionproblems," Advancesinneuralinformation processingsystems ,pp.345{352,1995. 124

PAGE 125

BIOGRAPHICALSKETCH ShubhenduBhasinwasborninDelhi,Indiain1982.HereceivedhisBach elorof Engineeringdegreeinmanufacturingprocessesandautomationen gineeringfromNetaji SubhasInstituteofTechnology,UniversityofDelhi,Indiain2004.F romAugust2004 toMarch2006,heworkedatTataElxsiLtd.,Bangalore,asaDesign andDevelopment Engineerintheirembeddedsystemsdivision.Thereafter,hejoined ConexantSystems Ltd.,Noida,whereheworkedasaSoftwareEngineertillJuly2006.H ethenjoinedthe NonlinearControlsandRobotics(NCR)researchlabintheDepartme ntofMechanical andAerospaceEngineering,UniversityofFlorida,Gainesville,topur suehisM.S.and doctoralresearchundertheadvisementofDr.WarrenE.Dixon. HereceivedhisM.S. inmechanicalengineeringinSpringof2009andPhDinSummerof2011. Hisresearch interestsincludereinforcementlearning-basedcontrol,approxim atedynamicprogramming, dierentialgames,robustandadaptivecontrolofmechanicalsy stems. 125