Citation
Model-Based Reinforcement Learning for Online Approximate Optimal Control

Material Information

Title:
Model-Based Reinforcement Learning for Online Approximate Optimal Control
Creator:
Kamalapurkar, Rushikesh L
Place of Publication:
[Gainesville, Fla.]
Florida
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (6 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Mechanical Engineering
Mechanical and Aerospace Engineering
Committee Chair:
DIXON,WARREN E
Committee Co-Chair:
BAROOAH,PRABIR
Committee Members:
KUMAR,MRINAL
RAO,ANIL
MEYN,SEAN PETER
Graduation Date:
8/9/2014

Subjects

Subjects / Keywords:
Approximation ( jstor )
Differential games ( jstor )
Error rates ( jstor )
Identifiers ( jstor )
Liapunov functions ( jstor )
Mathematical independent variables ( jstor )
Optimal control ( jstor )
Signals ( jstor )
Simulations ( jstor )
Trajectories ( jstor )
Mechanical and Aerospace Engineering -- Dissertations, Academic -- UF
adaptive -- control -- learning -- lyapunov -- nonlinear -- optimal -- reinforcement
Genre:
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
born-digital ( sobekcm )
Electronic Thesis or Dissertation
Mechanical Engineering thesis, Ph.D.

Notes

Abstract:
The objective of an optimal control synthesis problem is to compute the policy that an agent should follow in order to maximize the accumulated reward. Analytical solution of optimal control problems is often impossible when the system dynamics are nonlinear. Many numerical solution techniques are available to solve optimal control problems; however, such methods generally require perfect model knowledge and may not be implementable in real-time. Inroads to solve optimal control problems for nonlinear systems can be made through insights gained from examining the value function. Under a given policy, the value function provides a map from the state space to the set of real numbers that measures the value of a state, generally defined as the total accumulated reward starting from that state. If the value function is known, a reasonable strategy is to apply control to drive the states towards increasing value. If the value function is unknown, a reasonable strategy is to use input-output data to estimate the value function online, and use the estimate to compute the control input. Reinforcement learning (RL)-based optimal control synthesis techniques implement the aforementioned strategy by approximating the value function using a parametric approximation scheme. The approximate optimal policy is then computed based on the approximate value function. RL-based techniques are valuable not only as online optimization tools but also as control synthesis tools. In discrete-time stochastic systems with countable state and action spaces RL-based techniques have demonstrated the ability to synthesize stabilizing policies with minimal knowledge of the structure of the system. Techniques such as Q-learning have shown to be effective tools to generate stabilizing policies based on input-output data without any other information about the system. RL thus offers a potential alternative to traditional control design techniques. However, the extensions of RL techniques to continuous-time systems that evolve on a continuous state-space are scarce, and often require more information about the system than just input-output data. This dissertation investigates extending the applicability of RL-based techniques in a continuous-time deterministic setting to generate approximate optimal policies online by relaxing some of the limitations imposed by the continuous-time nature of the problem. State-of-the-art implementations of RL in continuous-time systems require a restrictive PE condition for convergence to optimality. In this dissertation, model-based RL is implemented via simulation of experience to relax the restrictive persistence of excitation condition. The RL-based approach is also extended to obtain approximate feedback-Nash equilibrium solutions to N -player nonzero-sum games. In trajectory tracking problems, since the error dynamics are nonautonomous, the value function depends explicitly on time. Since universal function approximators can approximate functions with arbitrary accuracy only on compact domains, value functions for infinite-horizon optimal tracking problems cannot be approximated with arbitrary accuracy using universal function approximators. Hence, the extension of RL-based techniques to optimal tracking problems for continuous-time nonlinear systems has remained a non-trivial open problem. In this dissertation, RL-based approaches are extended to solve trajectory tracking problems by using the desired trajectory, in addition to the tracking error, as an input to learn the value function. Distributed control of groups of multiple interacting agents is a challenging problem with multiple practical applications. When the agents possess cooperative or competitive objectives, the trajectory and the decisions of each agent are affected by the trajectories and decisions of the neighboring agents. The external influence renders the dynamics of each agent nonautonomous; hence, optimization in a network of agents presents challenges similar to the optimal tracking problem. The interaction between the agents in a network is often modeled as a differential game on a graph, defined by coupled dynamics and coupled cost functions. Using insights gained from the tracking problem, this dissertation extends the model-based RL technique to generate feedback-Nash equilibrium optimal policies online, for agents in a network with cooperative or competitive objectives. In particular, the network of agents is separated into autonomous subgraphs, and the differential game is solved separately on each subgraph. The applicability of the developed methods is demonstrated through simulations, and to illustrate their effectiveness, comparative simulations are presented wherever alternate methods exist to solve the problem under consideration. The dissertation concludes with a discussion about the limitations of the developed technique, and further extensions of the technique are proposed along with the possible approaches to achieve them. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2014.
Local:
Adviser: DIXON,WARREN E.
Local:
Co-adviser: BAROOAH,PRABIR.
Statement of Responsibility:
by Rushikesh L Kamalapurkar.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Resource Identifier:
969976957 ( OCLC )
Classification:
LD1780 2014 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

MODEL-BASEDREINFORCEMENTLEARNINGFORONLINEAPPROXIMATE OPTIMALCONTROL By RUSHIKESHLAMBODARKAMALAPURKAR ADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOL OFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENT OFTHEREQUIREMENTSFORTHEDEGREEOF DOCTOROFPHILOSOPHY UNIVERSITYOFFLORIDA 2014

PAGE 2

c 2014RushikeshLambodarKamalapurkar 2

PAGE 3

TomyparentsArunandSaritaKamalapurkarfortheirinvaluablesupport 3

PAGE 4

ACKNOWLEDGMENTS IwouldliketoexpresssinceregratitudetowardsDr.WarrenE.Dixon,whose constantencouragementandsupporthavebeeninstrumentalinmyacademicsuccess. Asmyacademicadvisor,hehasprovidedmewithvaluableadviceregardingresearch. Asamentor,hehasplayedacentralroleinpreparingmeformyacademiccareerby inspiringmetodoindependentresearch,providingmevaluableinsightsintothenittygrittiesofanacademiccareer,andhelpingmehonemygrantwritingskills.Iwouldalso liketoextendmygratitudetowardsmycommitteemembersDr.PrabirBarooah,Dr. MrinalKumar,Dr.AnilRao,andDr.SeanMeyn,andmyprofessorsDr.PaulRobinson andDr.MichaelJuryfortheirtime,thevaluablerecommendationstheyprovided,and forbeingexcellentteachesfromwhomIhavedrawnalotofknowledgeandinspiration. IwouldalsoliketothankmycolleaguesattheUniversityofFloridaNonlinearControls andRoboticslaboratoryforcountlessfruitfuldiscussionsthathavehelpedshapethe ideasinthisdissertation.Iacknowledgethatthisdissertationwouldnothavebeen possiblewithoutthesupportandencouragementprovidedbymyfamilyandmyfriends andwithoutthenancialsupportprovidedbytheNationalScienceFoundationandthe OfceofNavalResearch. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS..................................4 LISTOFTABLES......................................8 LISTOFFIGURES.....................................9 LISTOFABBREVIATIONS................................13 ABSTRACT.........................................14 CHAPTER 1INTRODUCTION...................................17 1.1Motivation....................................17 1.2LiteratureReview................................22 1.3OutlineoftheDissertation...........................28 1.4Contributions..................................30 1.4.1ApproximateOptimalRegulation...................31 1.4.2 N -playerNonzero-sumDifferentialGames..............33 1.4.3ApproximateOptimalTracking.....................33 1.4.4Model-basedReinforcementLearningforApproximateOptimal Tracking.................................34 1.4.5DifferentialGraphicalGames.....................35 2PRELIMINARIES...................................37 2.1Notation.....................................37 2.2ProblemFormulation..............................37 2.3ExactSolution.................................38 2.4ValueFunctionApproximation.........................39 2.5RL-basedOnlineImplementation.......................40 2.6LIPApproximationoftheValueFunction...................43 2.7UncertaintiesinSystemDynamics......................45 3MODEL-BASEDREINFORCEMENTLEARNINGFORAPPROXIMATEOPTIMALREGULATION................................47 3.1Motivation....................................47 3.2SystemIdentication..............................49 3.2.1CL-basedParameterUpdate......................50 3.2.2ConvergenceAnalysis.........................52 3.3ApproximateOptimalControl.........................53 3.3.1ValueFunctionApproximation.....................53 3.3.2SimulationofExperienceviaBEExtrapolation............54 5

PAGE 6

3.4StabilityAnalysis................................56 3.5Simulation....................................60 3.5.1ProblemwithaKnownBasis......................61 3.5.2ProblemwithanUnknownBasis....................62 3.6ConcludingRemarks..............................67 4MODEL-BASEDREINFORCEMENTLEARNINGFORONLINEAPPROXIMATEFEEDBACK-NASHEQUILIBRIUMSOLUTIONOF N -PLAYER NONZERO-SUMDIFFERENTIALGAMES....................69 4.1ProblemFormulationandExactSolution...................69 4.2ApproximateSolution..............................71 4.2.1SystemIdentication..........................72 4.2.2ValueFunctionApproximation.....................74 4.3StabilityAnalysis................................77 4.4Simulation....................................82 4.4.1ProblemSetup.............................82 4.4.2AnalyticalSolution...........................83 4.4.3SimulationParameters.........................83 4.4.4SimulationResults...........................85 4.5ConcludingRemarks..............................87 5EXTENSIONTOAPPROXIMATEOPTIMALTRACKING............89 5.1FormulationofTime-invariantOptimalControlProblem...........90 5.2ApproximateOptimalSolution.........................91 5.3StabilityAnalysis................................95 5.3.1SupportingLemmas..........................95 5.3.2GainConditionsandGainSelection..................97 5.3.3MainResult...............................99 5.4Simulation....................................101 5.5ConcludingRemarks..............................107 6MODEL-BASEDREINFORCEMENTLEARNINGFORAPPROXIMATEOPTIMALTRACKING..................................108 6.1ProblemFormulationandExactSolution...................108 6.2SystemIdentication..............................109 6.3ValueFunctionApproximation.........................112 6.4SimulationofExperience...........................113 6.5StabilityAnalysis................................114 6.6Simulation....................................118 6.6.1NonlinearSystem............................118 6.6.2LinearSystem..............................121 6.7ConcludingRemarks..............................124 6

PAGE 7

7MODEL-BASEDREINFORCEMENTLEARNINGFORONLINEAPPROXIMATEFEEDBACK-NASHEQUILIBRIUMSOLUTIONOFDIFFERENTIAL GRAPHICALGAMES................................128 7.1GraphTheoryPreliminaries..........................128 7.2ProblemFormulation..............................129 7.2.1ElementsoftheValueFunction....................131 7.2.2OptimalFormationTrackingProblem.................131 7.3SystemIdentication..............................137 7.4ApproximationoftheBEandtheRelativeSteady-stateController.....138 7.5ValueFunctionApproximation.........................139 7.6SimulationofExperienceviaBEExtrapolation................140 7.7StabilityAnalysis................................142 7.8Simulations...................................146 7.8.1One-dimensionalExample.......................147 7.8.2Two-dimensionalExample.......................153 7.9ConcludingRemarks..............................161 8CONCLUSIONS...................................162 APPENDIX AONLINEDATACOLLECTIONCH3........................166 BPROOFOFSUPPORTINGLEMMASCH5...................171 B.1ProofofLemma5.1..............................171 B.2ProofofLemma5.2..............................173 B.3ProofofLemma5.3..............................174 REFERENCES.......................................177 BIOGRAPHICALSKETCH................................189 7

PAGE 8

LISTOFTABLES Table page 4-1Learninggainsforforvaluefunctionapproximation................84 4-2Initialconditionsforthesystemandthetwoplayers................84 7-1Simulationparametersfortheone-dimensionalexample..............148 7-2Simulationparametersforthetwo-dimensionalexample.............154 8

PAGE 9

LISTOFFIGURES Figure page 2-1Actor-criticarchitecture...............................44 2-2Actor-critic-identierarchitecture..........................46 3-1Simulation-basedactor-critic-identierarchitecture................49 3-2Systemstateandcontroltrajectoriesgeneratedusingthedevelopedmethod forthesysteminSection3.5.1............................62 3-3Actorandcriticweighttrajectoriesgeneratedusingthedevelopedmethodfor thesysteminSection3.5.1comparedwiththeirtruevalues.Thetruevalues computedbasedontheanalyticalsolutionarerepresentedbydottedlines...63 3-4Driftparameterestimatetrajectoriesgeneratedusingthedevelopedmethod forthesysteminSection3.5.1comparedtotheactualdriftparameters.The dottedlinesrepresenttruevaluesofthedriftparameters.............64 3-5Systemstateandcontroltrajectoriesgeneratedusingthedevelopedmethod forthesysteminSection3.5.2............................65 3-6Actorandcriticweighttrajectoriesgeneratedusingthedevelopedmethodfor thesysteminSection3.5.2.Sinceananalyticaloptimalsolutionisnotavailable,theweightestimatescannotbecomparedwiththeirtruevalues......66 3-7Driftparameterestimatetrajectoriesgeneratedusingthedevelopedmethod forthesysteminSection3.5.2comparedtotheactualdriftparameters.The dottedlinesrepresenttruevaluesofthedriftparameters.............66 3-8Stateandcontroltrajectoriesgeneratedusingfeedbackpolicy ^ u x comparedtoanumericaloptimalsolutionforthesysteminSection3.5.2.......67 4-1Trajectoriesofactorandcriticweightsforplayer1comparedagainsttheir truevalues.Thetruevaluescomputedbasedontheanalyticalsolutionare representedbydottedlines.............................85 4-2Trajectoriesofactorandcriticweightsforplayer2comparedagainsttheir truevalues.Thetruevaluescomputedbasedontheanalyticalsolutionare representedbydottedlines..............................86 4-3Trajectoriesoftheestimatedparametersinthedriftdynamicscompared againsttheirtruevalues.Thetruevaluesarerepresentedbydottedlines....86 4-4Systemstatetrajectoryandthecontroltrajectoriesforplayers1and2generatedusingthedevelopedtechnique........................87 5-1Stateanderrortrajectorieswithprobingsignal...................103 9

PAGE 10

5-2Evolutionofvaluefunctionandpolicyweights...................104 5-3HamiltonianandcostateofthenumericalsolutioncomputedusingGPOPS...105 5-4Controltrajectories ^ t obtainedfromGPOPSandthedevelopedtechnique..105 5-5Trackingerrortrajectories e t obtainedfromGPOPSandthedeveloped technique........................................106 6-1Systemtrajectoriesgeneratedusingtheproposedmethodforthenonlinear system.........................................120 6-2Valuefunctionandthepolicyweighttrajectoriesgeneratedusingtheproposedmethodforthenonlinearsystem.Sinceananalyticalsolutionof theoptimaltrackingproblemisnotavailable,weightscannotbecompared againsttheiridealvalues..............................121 6-3Trajectoriesoftheunknownparametersinthesystemdriftdynamicsforthe nonlinearsystem.Thedottedlinesrepresentthetruevaluesoftheparameters.122 6-4SatisfactionofAssumptions6.1and6.2forthenonlinearsystem.........123 6-5Comparisonbetweencontrolanderrortrajectoriesresultingfromthedevelopedtechniqueandanumericalsolutionforthenonlinearsystem........124 6-6Systemtrajectoriesgeneratedusingtheproposedmethodforthelinearsystem.125 6-7Valuefunctionandthepolicyweighttrajectoriesgeneratedusingtheproposedmethodforthelinearsystem.Sinceananalyticalsolutionoftheoptimaltrackingproblemisnotavailable,weightscannotbecomparedagainst theiridealvalues...................................126 6-8Trajectoriesoftheunknownparametersinthesystemdriftdynamicsforthe linearsystem.Thedottedlinesrepresentthetruevaluesoftheparameters...126 6-9SatisfactionofAssumptions6.1and6.2forthelinearsystem...........127 7-1Communicationtopologyanetworkcontainingveagents............146 7-2Statetrajectoriesfortheveagentsfortheone-dimensionalexample.The dottedlinesshowthedesiredstatetrajectories...................147 7-3Trackingerrortrajectoriesfortheagentsfortheone-dimensionalexample....149 7-4Trajectoriesofthecontrolinputandtherelativecontrolerrorforallagentsfor theone-dimensionalexample.............................150 7-5Valuefunctionweightsanddriftdynamicsparametersestimatesforagent1 fortheone-dimensionalexample.Thedottedlinesinthedriftparameterplot aretheidealvaluesofthedriftparameters.....................150 10

PAGE 11

7-6Valuefunctionweightsanddriftdynamicsparametersestimatesforagent2 fortheone-dimensionalexample.Thedottedlinesinthedriftparameterplot aretheidealvaluesofthedriftparameters.....................151 7-7Valuefunctionweightsanddriftdynamicsparametersestimatesforagent3 fortheone-dimensionalexample.Thedottedlinesinthedriftparameterplot aretheidealvaluesofthedriftparameters.....................151 7-8Valuefunctionweightsanddriftdynamicsparametersestimatesforagent4 fortheone-dimensionalexample.Thedottedlinesinthedriftparameterplot aretheidealvaluesofthedriftparameters.....................152 7-9Valuefunctionweightsanddriftdynamicsparametersestimatesforagent5 fortheone-dimensionalexample.Thedottedlinesinthedriftparameterplot aretheidealvaluesofthedriftparameters.....................152 7-10Phaseportraitinthestate-spaceforthetwo-dimensionalexample.Theactualpentagonalformationisrepresentedbyasolidblackpentagon,andthe desireddesiredpentagonalformationaroundtheleaderisrepresentedbya dottedblackpentagon.................................153 7-11Phaseportraitofallagentsintheerrorspaceforthetwo-dimensionalexample.155 7-12TrajectoriesofthecontrolinputandtherelativecontrolerrorforAgent1for thetwo-dimensionalexample.............................156 7-13TrajectoriesofthecontrolinputandtherelativecontrolerrorforAgent2for thetwo-dimensionalexample.............................156 7-14TrajectoriesofthecontrolinputandtherelativecontrolerrorforAgent3for thetwo-dimensionalexample.............................157 7-15TrajectoriesofthecontrolinputandtherelativecontrolerrorforAgent4for thetwo-dimensionalexample.............................157 7-16TrajectoriesofthecontrolinputandtherelativecontrolerrorforAgent5for thetwo-dimensionalexample.............................158 7-17Valuefunctionweightsandpolicyweightsforagent1forthetwo-dimensional example........................................158 7-18Valuefunctionweightsandpolicyweightsforagent2forthetwo-dimensional example........................................159 7-19Valuefunctionweightsandpolicyweightsforagent3forthetwo-dimensional example........................................159 7-20Valuefunctionweightsandpolicyweightsforagent4forthetwo-dimensional example........................................160 11

PAGE 12

7-21Valuefunctionweightsandpolicyweightsforagent5forthetwo-dimensional example........................................160 12

PAGE 13

LISTOFABBREVIATIONS ACIactor-critic-identier ADPadaptivedynamicprogramming AREalgebraicRiccatiequation BEBellmanerror CLconcurrentlearning DPdynamicprogramming DREdifferentialRiccatiequation GHJBgeneralizedHamlton-Jacobi-Bellman HJHamilton-Jacobi HJBHamilton-Jacobi-Bellman LIPlinear-in-the-parameters LPlinearlyparameterizable/linearlyparameterized MPCmodelpredictivecontrol NNneuralnetwork PEpersistenceofexcitation/persistentlyexciting PIpolicyiteration RLreinforcementlearning SDREstatedependentRiccatiequations TDtemporal-difference UBultimatelybounded VIvalueiteration 13

PAGE 14

AbstractofDissertationPresentedtotheGraduateSchool oftheUniversityofFloridainPartialFulllmentofthe RequirementsfortheDegreeofDoctorofPhilosophy MODEL-BASEDREINFORCEMENTLEARNINGFORONLINEAPPROXIMATE OPTIMALCONTROL By RushikeshLambodarKamalapurkar August2014 Chair:WarrenE.Dixon Major:MechanicalEngineering Theobjectiveofanoptimalcontrolsynthesisproblemistocomputethepolicy thatanagentshouldfollowinordertomaximizetheaccumulatedreward.Analytical solutionofoptimalcontrolproblemsisoftenimpossiblewhenthesystemdynamicsare nonlinear.Manynumericalsolutiontechniquesareavailabletosolveoptimalcontrol problems;however,suchmethodsgenerallyrequireperfectmodelknowledgeandmay notbeimplementableinreal-time. Inroadstosolveoptimalcontrolproblemsfornonlinearsystemscanbemade throughinsightsgainedfromexaminingthevaluefunction.Underagivenpolicy,the valuefunctionprovidesamapfromthestatespacetothesetofrealnumbersthat measuresthevalueofastate,generallydenedasthetotalaccumulatedreward startingfromthatstate.Ifthevaluefunctionisknown,areasonablestrategyistoapply controltodrivethestatestowardsincreasingvalue.Ifthevaluefunctionisunknown, areasonablestrategyistouseinput-outputdatatoestimatethevaluefunctiononline, andusetheestimatetocomputethecontrolinput.ReinforcementlearningRLbasedoptimalcontrolsynthesistechniquesimplementtheaforementionedstrategy byapproximatingthevaluefunctionusingaparametricapproximationscheme.The approximateoptimalpolicyisthencomputedbasedontheapproximatevaluefunction. RL-basedtechniquesarevaluablenotonlyasonlineoptimizationtoolsbutalso ascontrolsynthesistools.Indiscrete-timestochasticsystemswithcountablestate 14

PAGE 15

andactionspacesRL-basedtechniqueshavedemonstratedtheabilitytosynthesize stabilizingpolicieswithminimalknowledgeofthestructureofthesystem.Techniques suchasQ-learninghaveshowntobeeffectivetoolstogeneratestabilizingpolicies basedoninput-outputdatawithoutanyotherinformationaboutthesystem.RLthus offersapotentialalternativetotraditionalcontroldesigntechniques.However,the extensionsofRLtechniquestocontinuous-timesystemsthatevolveonacontinuous state-spacearescarce,andoftenrequiremoreinformationaboutthesystemthanjust input-outputdata. ThisdissertationinvestigatesextendingtheapplicabilityofRL-basedtechniques inacontinuous-timedeterministicsettingtogenerateapproximateoptimalpolicies onlinebyrelaxingsomeofthelimitationsimposedbythecontinuous-timenatureofthe problem.State-of-the-artimplementationsofRLincontinuous-timesystemsrequirea restrictivePEconditionforconvergencetooptimality.Inthisdissertation,model-based RLisimplementedviasimulationofexperiencetorelaxtherestrictivepersistenceof excitationcondition.TheRL-basedapproachisalsoextendedtoobtainapproximate feedback-Nashequilibriumsolutionsto N -playernonzero-sumgames. Intrajectorytrackingproblems,sincetheerrordynamicsarenonautonomous,the valuefunctiondependsexplicitlyontime.Sinceuniversalfunctionapproximatorscan approximatefunctionswitharbitraryaccuracyonlyoncompactdomains,valuefunctions forinnite-horizonoptimaltrackingproblemscannotbeapproximatedwitharbitrary accuracyusinguniversalfunctionapproximators.Hence,theextensionofRL-based techniquestooptimaltrackingproblemsforcontinuous-timenonlinearsystemshas remainedanon-trivialopenproblem.Inthisdissertation,RL-basedapproachesare extendedtosolvetrajectorytrackingproblemsbyusingthedesiredtrajectory,inaddition tothetrackingerror,asaninputtolearnthevaluefunction. Distributedcontrolofgroupsofmultipleinteractingagentsisachallengingproblem withmultiplepracticalapplications.Whentheagentspossesscooperativeorcompetitive 15

PAGE 16

objectives,thetrajectoryandthedecisionsofeachagentareaffectedbythetrajectories anddecisionsoftheneighboringagents.Theexternalinuencerendersthedynamics ofeachagentnonautonomous;hence,optimizationinanetworkofagentspresents challengessimilartotheoptimaltrackingproblem.Theinteractionbetweentheagents inanetworkisoftenmodeledasadifferentialgameonagraph,denedbycoupleddynamicsandcoupledcostfunctions.Usinginsightsgainedfromthetrackingproblem,this dissertationextendsthemodel-basedRLtechniquetogeneratefeedback-Nashequilibriumoptimalpoliciesonline,foragentsinanetworkwithcooperativeorcompetitive objectives.Inparticular,thenetworkofagentsisseparatedintoautonomoussubgraphs, andthedifferentialgameissolvedseparatelyoneachsubgraph. Theapplicabilityofthedevelopedmethodsisdemonstratedthroughsimulations, andtoillustratetheireffectiveness,comparativesimulationsarepresentedwherever alternatemethodsexisttosolvetheproblemunderconsideration.Thedissertation concludeswithadiscussionaboutthelimitationsofthedevelopedtechnique,and furtherextensionsofthetechniqueareproposedalongwiththepossibleapproachesto achievethem. 16

PAGE 17

CHAPTER1 INTRODUCTION 1.1Motivation Theabilitytolearnthecorrectbehaviorfrominteractionswiththeenvironmentis ahighlydesirablecharacteristicofacognitiveagent.Typicalinteractionsbetweenan agentanditsenvironmentcanbedescribedintermsofactions,states,andrewardsor penalties.Theactionsexecutedbytheagentaffectthestateofthesystemi.e.,the agentandtheenvironment,andtheagentispresentedwitharewardorapenalty. Assumingthattheagentchoosesanactionbasedonthestateofthesystem,the behaviororthepolicyoftheagentcanbedescribedasamapfromthestatespaceto theactionspace. Tolearnthecorrectpolicy,itiscrucialtoestablishameasureofcorrectness. Thecorrectnessofapolicycanbequantiedinnumerouswaysdependingonthe objectivesoftheagent-environmentinteraction.Forguidanceandcontrolapplications, thecorrectnessofapolicyisoftenquantiedintermsofaLagrangecostandaMeyer cost.TheLagrangecostisthecumulativepenaltyaccumulatedalongapathtraversed bytheagentandtheMeyercostisthepenaltyattheboundary.Policieswithlower totalcostareconsideredbetterandpoliciesthatminimizethetotalcostareconsidered optimal.TheproblemofndingtheoptimalpolicythatminimizesthetotalLagrangeand MeyercostisknownastheBolzaoptimalcontrolproblem. ObtainingananalyticalsolutiontotheBolzaproblemisofteninfeasibleifthe systemdynamicsarenonlinear.Manynumericalsolutiontechniquesareavailableto solveBolzaproblems;however,numericalsolutiontechniquesrequireexactmodel knowledgeandarerealizedviaopen-loopimplementationofofinesolutions.Openloopimplementationsaresensitivetodisturbances,changesinobjectives,andchanges inthesystemdynamics;hence,onlineclosed-loopsolutionsofoptimalcontrolproblems aresought-after.Inroadstosolveanoptimalcontrolproblemonlinecanbemadeby 17

PAGE 18

lookingattheso-calledvaluefunction.Underagivenpolicy,thevaluefunctionprovides amapfromthestatespacetothesetofrealnumbersthatmeasuresthequalityofa state.Inotherwords,underagivenpolicy,thevaluefunctionevaluatedatagivenstate isthecostaccumulatedwhenstartinginthegivenstateandfollowingthegivenpolicy. Undergeneralconditions,thepolicythatdrivesthesystemstatealongthesteepest negativegradientoftheoptimalvaluefunctionturnsouttobetheoptimalpolicy;hence, onlineoptimalcontroldesignreliesoncomputationoftheoptimalvaluefunction. Forsystemswithnitestateandactionspaces,valuefunction-baseddynamic programmingDPtechniquessuchaspolicyiterationPIandvalueiterationVI areestablishedaseffectivetoolsforoptimalcontrolsynthesis.However,bothPIand VIsufferfromBellman'scurseofdimensionality,i.e.,theybecomecomputationally intractableasthesizeofthestatespacegrows.Furthermore,bothPIandVIrequire exactknowledgeofthesystemdynamics.Theneedforexcessivecomputationcan berealisticallysidesteppedifoneseekstoobtainanapproximationtotheoptimal valuefunctioninsteadoftheexactoptimalvaluefunctioni.e.,approximatedynamic programming.Theneedforexactmodelknowledgecanbeeliminatedbyusinga simulation-basedapproachwherethegoalistolearntheoptimalvaluefunctionusing state-action-rewardtripletsobservedalongthestatetrajectoryi.e.,reinforcement learningRL. ApproximatedynamicprogrammingalgorithmsapproximatetheclassicalPIandVI algorithmsbyusingaparametricapproximationofthepolicyorthevaluefunction.The centralideaisthatifthepolicyorthevaluefunctioncanbeparameterizedwithsufcient accuracyusingasmallnumberofparameters,theoptimalcontrolproblemreducesto anapproximationproblemintheparameterspace.Furthermore,thisformulationlends itselftoanonlinesolutionapproachusingRLwheretheparametersareadjustedonthe-yusinginput-outputdata.However,sufcientexplorationofthestate-actionspace isrequiredforconvergence,andtheoptimalityoftheobtainedpolicydependsheavilyon 18

PAGE 19

theaccuracyoftheparameterizationscheme;theformulationofwhichrequiressome insightintothedynamicsofthesystem.Despitetheaforementioneddrawbacks,RLhas givenrisetoeffectivetechniquesthatcansynthesizenearlyoptimalpoliciestocontrol nonlinearsystemsthathavelargestateandactionspacesandunknownorpartially knowndynamics.Asaresult,RLhasbeenagrowingareaofresearchinthepasttwo decades. Inrecentyears,RLtechniqueshavebeenextendedtoautonomouscontinuous-time deterministicsystems.InonlineimplementationsofRL,thecontrolpolicyderivedfrom theapproximatevaluefunctionisusedtocontrolthesystem;hence,obtainingagood approximationofthevaluefunctioniscriticaltothestabilityoftheclosed-loopsystem. Obtainingagoodapproximationofthevaluefunctiononlinerequiresconvergence oftheunknownparameterstotheiridealvalues.Hence,similartoadaptivecontrol, thesufcientexplorationconditionmanifestsitselfasapersistenceofexcitationPE conditionwhenRLisimplementedonline.Ingeneral,itisimpossibletoguaranteePE apriori;hence,aprobingsignaldesignedusingtrialanderrorisaddedtothecontroller toensurePE.Theprobingsignalisnotconsideredinthestabilityanalysis;hence, stabilityoftheclosed-loopimplementationcannotbeguaranteed.Inthisdissertation, amodel-basedRLschemeisdevelopedtorelaxthePEcondition.Model-basedRL isimplementedusingaconcurrentlearningCL-basedsystemidentiertosimulate experiencebyevaluatingtheBellmanerrorBEoverunexploredareasofthestate space. Amultitudeofrelevantcontrolproblemscanbemodeledasmulti-inputsystems, whereeachinputiscomputedbyaplayer,andeachplayerattemptstoinuencethe systemstatetominimizeitsowncostfunction.Inthiscase,theoptimizationproblemfor eachplayeriscoupledwiththeoptimizationproblemforotherplayers;hence,ingeneral, anoptimalsolutionintheusualsensedoesnotexist,motivatingtheformulationof alternativesolutionconcepts.ThemostpopularsolutionconceptisaNashequilibrium 19

PAGE 20

solutionwhichndsapplicationsinoptimaldisturbancerejection,i.e., H 1 control,where thedisturbanceismodeledasaplayerinatwo-playernonzero-sumdifferentialgame. AsetofpoliciesiscalledaNashequilibriumsolutiontoamulti-objectiveoptimization problemifnoneoftheplayerscanimprovetheiroutcomebychangingtheirpolicywhile alltheotherplayersabidebytheNashequilibriumpolicies.Thus,Nashequilibrium solutionsprovideasecuresetofstrategies,inthesensethatnoneoftheplayers haveanincentivetodivergefromtheirequilibriumpolicy.Motivatedbythewidespreadapplicationsofdifferentialgames,thisdissertationextendsthemodel-basedRL techniquestoobtainfeedback-Nashequilibriumsolutionsto N )]TJ/F22 11.9552 Tf 9.298 0 Td [(playernonzero-sum differentialgames. ExtensionofRLtotrajectorytrackingproblemsisnottrivialbecausetheerror dynamicsarenonautonomous,resultingintime-varyingvaluefunctions.Sinceuniversal functionapproximatorscanapproximatefunctionswitharbitraryaccuracyonlyon compactdomains,valuefunctionsforinnite-horizonoptimaltrackingproblemscannot beapproximatedwitharbitraryaccuracyusinguniversalfunctionapproximators.The resultsinthisdissertationextendRL-basedapproachestotrajectorytrackingproblems byusingthedesiredtrajectory,inadditiontothetrackingerror,asaninputtolearnthe valuefunction. Thefactthatthevaluefunctiondependsonthedesiredtrajectoryresultsinachallengeinestablishingsystemstabilityduringthelearningphase.Stabilityduringthe learningphaseisoftenestablishedusingLyapunov-basedstabilityanalysismethods, whicharemotivatedbythefactthatundergeneralconditions,theoptimalvaluefunction isaLyapunovfunctionfortheclosed-loopsystemundertheoptimalpolicy.Intracking problems,thevaluefunction,asafunctionofthetrackingerrorandthedesiredtrajectoryisnotaLyapunovfunctionfortheclosed-loopsystemundertheoptimalpolicy.In thisdissertation,theaforementionedchallengeisaddressedbyprovingthatthevalue 20

PAGE 21

function,asatime-varyingfunctionofthetrackingerrorcanbeusedasaLyapunov function. RLtechniquesarevaluablenotonlyforoptimizationbutalsoforcontrolsynthesisin complexsystemssuchasadistributednetworkofcognitiveagents.Combinedefforts frommultipleautonomousagentscanyieldtacticaladvantagesincluding:improved munitionseffects;distributedsensing,detection,andthreatresponse;anddistributed communicationpipelines.Whilecoordinatingbehaviorsamongautonomousagentsis achallengingproblemthathasreceivedmainstreamfocus,uniquechallengesarise whenseekingautonomouscollaborativebehaviorsinlowbandwidthcommunication environments.Forexample,mostcollaborativecontrolliteraturefocusesoncentralized approachesthatrequireallnodestocontinuouslycommunicatewithacentralagent, yieldingaheavycommunicationdemandthatissubjecttofailureduetodelays,and missinginformation.Furthermore,thecentralagentisrequiredtocarryenoughonboardcomputationalresourcestoprocessthedataandtogeneratecommandsignals. Thesechallengesmotivatetheneedforadecentralizedapproachwherethenodes onlyneedtocommunicatewiththeirneighborsforguidance,navigationandcontrol tasks.Furthermore,whentheagentspossescooperativeorcompetitiveobjectives,the trajectoryandthedecisionsofeachagentareaffectedbythetrajectoriesanddecisions oftheneighboringagents.Theexternalinuencerendersthedynamicsofeachagent nonautonomous,andhence,optimizationinanetworkofagentspresentschallenges similartotheoptimaltrackingproblem. Theinteractionbetweentheagentsinanetworkisoftenmodeledasadifferential gameonagraph,denedbycoupleddynamicsandcoupledcostfunctions.Using insightsgainedfromthetrackingproblem,thisdissertationextendsthemodel-basedRL techniquetogeneratefeedback-Nashequilibriumoptimalpoliciesonline,foragentsina networkwithcooperativeorcompetitiveobjectives.Inparticular,thenetworkofagents 21

PAGE 22

isseparatedintoautonomoussubgraphs,andthedifferentialgameissolvedseparately oneachsubgraph. Theapplicabilityofthedevelopedmethodsisdemonstratedthroughsimulations, andtoillustratetheireffectiveness,comparativesimulationsarepresentedwherever alternatemethodsexisttosolvetheproblemunderconsideration.Thedissertation concludeswithadiscussionaboutthelimitationsofthedevelopedtechnique,and furtherextensionsofthetechniqueareproposedalongwiththepossibleapproachesto achievethem. 1.2LiteratureReview Onewaytodevelopoptimalcontrollersforgeneralnonlinearsystemsistousenumericalmethods[1].Acommonapproachistoformulatetheoptimalcontrolproblem intermsofaHamiltonianandthentonumericallysolveatwopointboundaryvalue problemforthestateandco-stateequations[2,3].Anotherapproachistocastthe optimalcontrolproblemasanonlinearprogrammingproblemviadirecttranscription andthensolvetheresultingnonlinearprogram[4–9].Numericalmethodsareofine, donotgenerallyguaranteestability,oroptimality,andareoftenopen-loop.Theseissuesmotivatethedesiretondananalyticalsolution.Developinganalyticalsolutions tooptimalcontrolproblemsforlinearsystemsiscomplicatedbytheneedtosolvean algebraicRiccatiequationAREoradifferentialRiccatiequationDRE.Developing analyticalsolutionsfornonlinearsystemsisevenfurthercomplicatedbythesufcient conditionofsolvingaHamilton-Jacobi-BellmanHJBpartialdifferentialequation,where ananalyticalsolutionmaynotexistingeneral.Ifthenonlineardynamicsareexactly known,thentheproblemcanbesimpliedattheexpenseofoptimalitybysolvingan AREthroughfeedback-linearizationmethodscf.[10–14]. Alternatively,someinvestigatorstemporarilyassumethattheuncertainsystem couldbefeedback-linearized,solvetheresultingoptimalcontrolproblem,andthenuse adaptive/learningmethodstoasymptoticallylearntheuncertainty,i.e.,asymptotically 22

PAGE 23

convergetotheoptimalcontroller[15–18].Inverseoptimalcontrol[19–24]isalsoan alternativemethodtosolvethenonlinearoptimalcontrolproblembycircumventingthe needtosolvetheHJBequation.ByndingacontrolLyapunovfunction,whichcanbe showntoalsobeavaluefunction,anoptimalcontrollercanbedevelopedthatoptimizes aderivedcost.However,sincethecostisderivedratherthanspeciedbymission/task objectives,thisapproachisnotexploredinthisdissertation.Optimalcontrol-based algorithmssuchasstatedependentRiccatiequationsSDRE[25–28]andmodelpredictivecontrolMPC[29–35]havebeenwidelyutilizedforcontrolofnonlinear systems.However,bothSDREandMPCareinherentlymodel-based.Furthermore,due tononuniquenessofstatedependentlinearfactorizationinSDRE-basedtechniques, andsincethecontrolproblemissolvedoverasmallpredictionhorizoninMPC,SDRE andMPCgenerallyresultinsuboptimalpolicies.Furthermore,MPC-basedapproaches arecomputationallyintensive,andclosed-loopstabilityofSDRE-basedmethodsis generallyimpossibletoestablishaprioriandhastobeestablishedthroughextensive simulation.Owingtotheaforementioneddrawbacks,SDREandMPCapproachesare notexploredinthisdissertation.ThisdissertationfocusesonDP-basedtechniques. ThefundamentalideainallDPtechniquesistheprincipleofoptimality,dueto Bellman[36].DPtechniquesbasedontheprincipleofoptimalityhavebeenextensively studiedinliteraturecf.[37–42].TheapplicabilityofclassicalDPtechniqueslikePI andVIislimitedbythecurseofdimensionalityandtheneedformodelknowledge. Simulation-basedreinforcementlearningRLtechniquessuchasQ-learning[40]and temporal-differenceTD-learning[38,43]avoidthecurseofdimensionalityandthe needforexactmodelknowledge.However,thesetechniquesrequirethestatesandthe actionstobeonnitesets.Eventhoughthetheoryisdevelopedfornitestatespaces ofanysize,theimplementationofsimulation-basedRLtechniquesisfeasibleonlyif thesizeofthestatespaceissmall.Extensionsofsimulation-basedRLtechniquesto generalstatespacesorverylargenitestatespacesinvolveparametricapproximation 23

PAGE 24

ofthepolicy.Suchalgorithmshavebeenstudiedindepthforsystemswithcountable stateandactionspacesunderthenameofneuro-DPcf.[42,44–48]andthereferences therein.Theextensionofthesetechniquestogeneralstatespacesandcontinuous time-domainsischallengingandonlyasmallnumberofresultsareavailableinthe literature[49]. Fordeterministicsystems,RLalgorithmshavebeenextendedtoasolveniteand innite-horizondiscountedandtotalcostoptimalregulationproblemscf.[50–59]under namessuchasadaptivedynamicprogrammingADPoradaptivecriticalgorithms. Thediscrete/iterativenatureoftheapproximatedynamicprogrammingformulation lendsitselfnaturallytothedesignofdiscrete-timeoptimalcontrollers[50,53,55, 60–67],andtheconvergenceofalgorithmsforDP-basedRLcontrollersisstudiedin resultssuchas[61,68–70].Mostpriorworkhasfocusedonconvergenceanalysisfor discrete-timesystems,butsomecontinuousexamplesareavailable[52,54,57,70–79]. Forexample,in[72]AdvantageUpdatingwasproposedasanextensionoftheQlearningalgorithmwhichcouldbeimplementedincontinuoustimeandprovidedfaster convergence.Theresultin[74]usedaHJB-basedframeworktoderivealgorithmsfor valuefunctionapproximationandpolicyimprovement,basedonacontinuousversion oftheTDerror.AnHJBframeworkwasalsousedin[70]todevelopastepwisestable iterativeapproximatedynamicprogrammingalgorithmforcontinuousinput-afne systemswithaninput-quadraticperformancemeasure.Basedonthesuccessive approximationmethodrstproposedin[71],anadaptiveoptimalcontrolsolutionis providedin[73],whereaGalerkin'sspectralmethodisusedtoapproximatethesolution tothegeneralizedHJBGHJB.Aleast-squares-basedsuccessiveapproximation solutiontotheGHJBisprovidedin[52],whereanNNistrainedofinetolearnthe GHJBsolution.Anothercontinuousformulationisproposedin[75]. Inonlinereal-timeapplications,DP-basedtechniquesgenerallyrequirearestrictive PEconditiontoestablishstabilityandconvergence.However,recentresearchindicates 24

PAGE 25

thatdata-drivenlearningbasedonrecordedexperiencecanimprovetheefciency ofinformationutilization,therebymollifyingthePErequirements.Experiencereplay techniqueshavebeenstudiedinRLliteraturetocircumventthePErequirement,which isanalogoustotherequirementofsufcientexploration.Experiencereplaytechniques involverepeatedprocessingofrecordedinput-outputdatainordertoimproveefciency ofinformationutilization[80–85]. ADP-basedmethodsthatseekanonlinesolutiontotheoptimalcontrolproblem, cf.,[53,57,59,63,86,87]andthereferencesthereinarestructurallysimilartoadaptive controlschemes.Inadaptivecontrol,theestimatesfortheuncertainparametersinthe plantmodelareupdatedusingthecurrenttrackingerrorastheperformancemetric, whereas,inonlineRL-basedtechniques,estimatesfortheuncertainparametersinthe valuefunctionareupdatedusingacontinuous-timecounterpartoftheTDerror,called theBE,astheperformancemetric.ConvergenceofonlineRL-basedtechniquestothe optimalsolutionisanalogoustoparameterconvergenceinadaptivecontrol. Parameterconvergencehasbeenafocusofresearchinadaptivecontrolforseveral decades.Itiscommonknowledgethattheleastsquaresandgradientdescent-based updatelawsgenerallyrequirePEinthesystemstateforconvergenceoftheparameter estimates.Modicationschemessuchasprojectionalgorithms, )]TJ/F22 11.9552 Tf 9.299 0 Td [(modication,and e )]TJ/F22 11.9552 Tf 9.299 0 Td [(modicationareusedtoguaranteeboundednessofparameterestimatesandoverall systemstability;however,thesemodicationsdonotguaranteeparameterconvergence unlessthePEcondition,whichisoftenimpossibletoverifyonline,issatised[88–91]. Asrecentlyshowninresultssuchas[92]and[93],CL-basedmethodscanbe usedtoguaranteeparameterconvergenceinadaptivecontrolwithoutrelyingonthePE condition.Concurrentlearningreliesonrecordedstateinformationalongwithcurrent statemeasurementstoupdatetheparameterestimates.Learningfromrecordeddatais effectivesinceitisbasedonthemodelerror,whichiscloselyrelatedtotheparameter estimationerror.Thekeyconceptthatenablesthecomputationofthemodelerrorfrom 25

PAGE 26

pastrecordeddataisthatthemodelerrorcanbecomputedifthestatederivativeis known,andthestatederivativecanbeaccuratelycomputedatapastrecordeddata pointusingnumericalsmoothingtechniques[92,93].Similartechniqueshavebeen recentlyshowntobeeffectiveforonlinereal-timeoptimalcontrol[94,95].Inparticular, theresultsin[95]indicatethatrecordedvaluesoftheBEcanbeusedtosolvetheonline real-timeoptimalcontrolproblemwithouttheneedofPE.However,aniteamountof addedprobingnoiseisrequiredfortherecordeddatatoberichenough.Inspiredby theresultsin[96]and[97],whichsuggestthatsimulatedexperiencebasedonasystem modelcanbemoreeffectivethanrecordedexperience,theeffortsinthisdissertation focusonthedevelopmentofonlinereal-timeoptimalcontroltechniquesbasedonmodel learningandBEextrapolation. Amultitudeofrelevantcontrolproblemscanbemodeledasmulti-inputsystems, whereeachinputiscomputedbyaplayer,andeachplayerattemptstoinuencethe systemstatetominimizeitsowncostfunction.Inthiscase,theoptimizationproblem foreachplayeriscoupledwiththeoptimizationproblemforotherplayers,andhence,in general,anoptimalsolutionintheusualsensedoesnotexist,motivatingtheformulation ofalternativeoptimalitycriteria. Differentialgametheoryprovidessolutionconceptsformanymulti-player,multiobjectiveoptimizationproblems[98–100].Forexample,asetofpoliciesiscalledaNash equilibriumsolutiontoamulti-objectiveoptimizationproblemifnoneoftheplayerscan improvetheiroutcomebychangingtheirpolicywhilealltheotherplayersabidebythe Nashequilibriumpolicies[101].Thus,Nashequilibriumsolutionsprovideasecuresetof strategies,inthesensethatnoneoftheplayershaveanincentivetodivergefromtheir equilibriumpolicy.Hence,Nashequilibriumhasbeenawidelyusedsolutionconceptin differentialgame-basedcontroltechniques. Ingeneral,Nashequilibriaarenotunique.Foraclosed-loopdifferentialgame i.e.,thecontrolisafunctionofthestateandtimewithperfectinformationi.e.allthe 26

PAGE 27

playersknowthecompletestatehistory,therecanbeinnitelymanyNashequilibria. Ifthepoliciesareconstrainedtobefeedbackpolicies,theresultingequilibriaarecalled subgameperfectNashequilibriaorfeedback-Nashequilibria.Thevaluefunctions correspondingtofeedback-NashequilibriasatisfyacoupledsystemofHamilton-Jacobi HJequations[102–107]. Ifthesystemdynamicsarenonlinearanduncertain,ananalyticalsolutionofthe coupledHJequationsisgenerallyinfeasible;hence,dynamicprogramming-based approximatesolutionsaresought[56,58,87,108–112].Inthisdissertation,asimulationbasedactor-critic-identierACIarchitectureisdevelopedtoobtainanapproximate feedback-Nashequilibriumsolutiontoaninnitehorizon N -playernonzero-sumdifferentialgameonline,withoutrequiringPE,foranonlinearcontrol-afnesystemwith uncertainlinearlyparameterizeddriftdynamics. Fortrajectorytrackingproblemsindiscrete-timesystems,severalapproaches havebeendevelopedtoaddressthenonautonomousnatureoftheopen-loopsystem. Parket.al.[113]usegeneralizedbackpropagationthroughtimetosolveanitehorizon trackingproblemthatinvolvesofinetrainingofneuralnetworksNNs.AnADP-based approachispresentedin[114]tosolveaninnite-horizonoptimaltrackingproblem wherethedesiredtrajectoryisassumedtodependonthesystemstates.Agreedy heuristicdynamicprogrammingbasedalgorithmispresentedin[86]whichusesa systemtransformationtoexpressanonautonomoussystemasanautonomoussystem. However,thisresultlacksanaccompanyingstabilityanalysis.ADP-basedapproaches arepresentedin[115,116]fortrackingincontinuous-timesystems.Inboththeresults, thevaluefunctioni.e.thecriticandthecontrolleri.e.theactorpresentedaretimevaryingfunctionsofthetrackingerror.However,sincetheproblemisaninnite-horizon optimalcontrolproblem,timedoesnotlieonacompactset.NNscanonlyapproximate functionsonacompactdomain.Thus,itisunclearhowaNNwithtimeinvariantbasis functionscanapproximatethetime-varyingvaluefunctionandthepolicy. 27

PAGE 28

Forproblemswithmultipleagents,asthedesiredactionbyanindividualagentdependsontheactionsandtheresultingtrajectoriesofitsneighbors,theerrorsystemfor eachagentbecomesacomplexnonautonomousdynamicalsystem.Nonautonomous systems,ingeneral,havenon-stationaryvaluefunctions.Sincenon-stationaryfunctions aredifculttoapproximateusingparameterizedfunctionapproximationschemessuch asNNs,designingoptimalpoliciesfornonautonomoussystemsisnottrivial.Toaddress thischallenge,differentialgametheoryisoftenemployedinmulti-agentoptimalcontrol, wheresolutionstocoupledHamilton-JacobiHJequationsc.f.[112]aresought.Since thecoupledHJequationsaredifculttosolve,someformofRLisoftenemployedtoget anapproximatesolution.Resultssuchas[58,112,117–120]indicatethatADPcanbe usedtogenerateapproximateoptimalpoliciesonlineformulti-agentsystems.Sincethe HJequationsarecoupled,alloftheseresultshaveacentralizedcontrolarchitecture. Decentralizedcontroltechniquesfocusonndingcontrolpoliciesbasedonlocal dataforindividualagentsthatcollectivelyachievethedesiredgoal,which,forthe problemconsideredinthiseffort,istrackingadesiredtrajectorywhilemaintaininga desiredformation.Variousmethodshavebeendevelopedtosolveformationtracking problemsforlinearsystemscf.[121–125]andthereferencestherein.Fornonlinear systems,MPC-basedapproaches[126,127]andADP-basedapproaches[128, 129]havebeenproposed.TheMPC-basedcontrollersrequireextensivenumerical computationsandlackstabilityandoptimalityguarantees.TheADP-basedapproaches eitherrequireofinecomputations,oraresuboptimalbecausenotalltheinter-agent interactionsareconsideredinthevaluefunction.Inthisdissertation,asimulation-based ACIarchitectureisdevelopedtocooperativelycontrolagroupofagentstotracka trajectorywhilemaintainingadesiredformation. 1.3OutlineoftheDissertation Chapter1servesastheintroduction.Motivationbehindtheresultsinthedissertationispresentedalongwithadetailedreviewofthestateoftheart. 28

PAGE 29

Chapter2containsabriefreviewofavailabletechniquesusedintheapplicationof RLtodeterministiccontinuous-timesystems.Thischapteralsohighlightstheproblems andthelimitationsofexistingtechniques,therebymotivatingthedevelopmentinthe dissertation. Chapter3implementsmodel-basedRLtosolveapproximateoptimalregulation problemsonlinewitharelaxedPE-likeconditionusingasimulation-basedACIarchitecture.Thedevelopmentisbasedontheobservationthat,givenamodelofthesystem, model-basedRLcanbeimplementedbyevaluatingtheBellmanerroratanynumberof desiredpointsinthestatespace.Inthisresult,aparametricsystemmodelisconsidered,andaCL-basedparameteridentierisdevelopedtocompensateforuncertainty intheparameters.UltimatelyboundedUBregulationofthesystemstatestoaneighborhoodoftheorigin,andconvergenceofthedevelopedpolicytoaneighborhoodofthe optimalpolicyareestablishedusingaLyapunov-basedanalysis,andsimulationsare presentedtodemonstratetheperformanceofthedevelopedcontroller. Chapter4extendstheresultsofChapter3toobtainanapproximatefeedbackNashequilibriumsolutiontoaninnite-horizon N -playernonzero-sumdifferential gameonline,withoutrequiringPE,foranonlinearcontrol-afnesystemwithuncertain linearlyparameterizeddriftdynamics.Itisshownthatunderaconditionmilderthan PE,uniformlyultimatelyboundedconvergenceofthedevelopedcontrolpoliciestothe feedback-Nashequilibriumpoliciescanbeestablished.Simulationresultsarepresented todemonstratetheperformanceofthedevelopedtechniquewithoutanaddedexcitation signal. Chapter5presentsanADP-basedapproachusingthepolicyevaluationCritic andpolicyimprovementActorarchitecturetoapproximatelysolvetheinnite-horizon optimaltrackingproblemforcontrol-afnenonlinearsystemswithquadraticcost.The problemissolvedbytransformingthesystemtoconvertthetrackingproblemthathasa non-stationaryvaluefunction,intoastationaryoptimalcontrolproblem.Theultimately 29

PAGE 30

boundedUBtrackingandestimationresultisestablishedusingLyapunovanalysisfor nonautonomoussystems.Simulationsareperformedtodemonstratetheapplicability andtheeffectivenessofthedevelopedmethod. Chapter6utilizesmodel-basedreinforcementlearningtoextendtheresultsof Chapter5tosystemswithuncertaintiesindriftdynamics.Asystemidentierisused forapproximatemodelinversiontofacilitatetheformulationofafeasibleoptimalcontrol problem.Model-basedreinforcementlearningisimplementedusingaconcurrent learning-basedsystemidentiertosimulateexperiencebyevaluatingtheBellman erroroverunexploredareasofthestatespace.Trackingofthedesiredtrajectory andconvergenceofthedevelopedpolicytoaneighborhoodoftheoptimalpolicyis establishedviaLyapunov-basedstabilityanalysis.Simulationresultsdemonstratethe effectivenessofthedevelopedtechnique. Chapter7combinesgraphtheoryanddifferentialgametheorywiththeactorcritic-identierarchitectureinADPtosynthesizeapproximateonlinefeedback-Nash equilibriumcontrolpoliciesforagentsonacommunicationnetworkwithaspanningtree. NNsareusedtoapproximatethepolicy,thevaluefunction,andthesystemdynamics. UBconvergenceoftheagentstothedesiredformation,UBconvergenceoftheagent trajectoriestothedesiredtrajectories,andUBconvergenceoftheagentcontrollersto theirrespectivefeedback-NashequilibriumpoliciesisestablishedthroughaLyapunovbasedstabilityanalysis.Simulationsarepresentedtodemonstratetheapplicabilityof theproposedtechniquetocooperativelycontrolagroupofveagents. Chapter8concludesthedissertation.Asummaryofthedissertationisprovided alongwithadiscussiononopenproblemsandfutureresearchdirections. 1.4Contributions Thissectiondetailsthecontributionsofthisdissertationoverthestate-of-the-art. 30

PAGE 31

1.4.1ApproximateOptimalRegulation InRL-basedapproximateonlineoptimalcontrol,theHJBequationalongwithan estimateofthestatederivativecf.[49,59],oranintegralformoftheHJBequationcf. [130]isutilizedtoapproximatelyevaluatetheBEateachvisitedstatealongthesystem trajectory.TheBEprovidesanindirectmeasureofthequalityofthecurrentestimateof thevaluefunctionateachvisitedstatealongthesystemtrajectory.Hence,theunknown valuefunctionparametersareupdatedbasedontheBEalongthesystemtrajectory. Suchweightupdatestrategiescreatetwochallengesforanalyzingconvergence.The systemstatesneedtobePE,andthesystemtrajectoryneedstovisitenoughpointsin thestatespacetogenerateagoodapproximationtothevaluefunctionovertheentire operatingdomain.Thesechallengesaretypicallyaddressedbyaddinganexploration signaltothecontrolinputcf.[43,49,130]toensuresufcientexplorationinthe desiredregionofthestatespace.However,noanalyticalmethodsexisttocomputethe appropriateexplorationsignalwhenthesystemdynamicsarenonlinear. Inthisdissertation,theaforementionedchallengesareaddressedbyobserving thattherestrictionthattheBEcanonlybeevaluatedalongthesystemtrajectories isaconsequenceofthemodel-freenatureofRL-basedapproximateonlineoptimal control.Inparticular,theintegralBEisonlymeaningfulasameasureofqualityofthe valuefunctionifevaluatedalongthesystemtrajectories,andstatederivativeestimators canonlygenerateestimatesofthestatederivativealongthesystemtrajectoriesusing numericalsmoothing.However,ifthesystemdynamicsareknown,thestatederivative, andhence,theBEcanbeevaluatedatanydesiredpointinthestatespace.Unknown parametersinthevaluefunctioncanthereforebeadjustedbasedonleastsquare minimizationoftheBEevaluatedatanynumberofdesiredpointsinthestatespace.For example,inaninnite-horizonregulationproblem,theBEcanbecomputedatsampled pointsuniformlydistributedinaneighborhoodaroundtheoriginofthestatespace.The resultsofthisdissertationindicatethatconvergenceoftheunknownparametersinthe 31

PAGE 32

valuefunctionisguaranteedprovidedtheselectedpointssatisfyarankcondition.Since theBEcanbeevaluatedatanydesiredpointinthestatespace,sufcientexploration canbeachievedbyappropriatelyselectingthepointsinadesiredneighborhood. Ifthesystemdynamicsarepartiallyunknown,anapproximationtotheBEcanbe evaluatedatanydesiredpointinthestatespacebasedonanestimateofthesystem dynamics.IfeachnewevaluationoftheBEalongthesystemtrajectoryisinterpreted asgainingexperienceviaexploration,anevaluationoftheBEatanunexploredpoint inthestatespacecanbeinterpretedasasimulatedexperience.Learningbased onsimulationofexperiencehasbeeninvestigatedinresultssuchas[131–136]for stochasticmodel-basedRL;however,theseresultssolvetheoptimalcontrolproblem ofineinthesensethatrepeatedlearningtrialsneedtobeperformedbeforethe algorithmlearnsthecontroller,andsystemstabilityduringthelearningphaseisnot analyzed.Thisdissertationfurthersthestateoftheartfornonlinear,control-afne plantswithlinearlyparameterizableLPuncertaintiesinthedriftdynamicsbyproviding anonlinesolutiontodeterministicinnite-horizonoptimalregulationproblems.Inthis dissertation,aCL-basedparameterestimatorisdevelopedtoexponentiallyidentifythe unknownparametersinthesystemmodel,andtheparameterestimatesareusedto implementsimulatedexperiencebyextrapolatingtheBE.Themaincontributionsofthis chapterinclude: Novelimplementationofsimulatedexperienceindeterministicnonlinearsystems usingCL-basedsystemidentication. Detailedstabilityanalysistoestablishsimultaneousonlineidenticationofsystemdynamicsandonlineapproximatelearningoftheoptimalcontroller,while maintainingsystemstability.Thestabilityanalysisshowsthatprovidedthesystem dynamicscanbeapproximatedfastenough,andwithsufcientaccuracy,simulationofexperiencebasedontheestimatedmodelimplementedviaapproximate BEextrapolationcanbeutilizedtoapproximatelysolveaninnite-horizonoptimal regulationproblemonlineareprovided. Forthersttimeever,simulationresultsthatdemonstratetheapproximatesolution ofaninnite-horizonoptimalregulationproblemonlineforaninherentlyunstable 32

PAGE 33

control-afnenonlinearsystemwithuncertaindriftdynamicswithouttheadditionof anexternalad-hocprobingsignal. 1.4.2 N -playerNonzero-sumDifferentialGames In[58],aPE-basedintegralreinforcementlearningalgorithmispresentedtosolve nonzero-sumdifferentialgamesinlinearsystemswithouttheknowledgeofthedrift matrix.In[112],aPE-baseddynamicprogrammingtechniqueisdevelopedtond anapproximatefeedback-Nashequilibriumsolutiontoaninnite-horizon N -player nonzero-sumdifferentialgameonlinefornonlinearcontrol-afnesystemswithknown dynamics.In[119],aPE-basedADPmethodisusedtosolveatwo-playerzerosumgameonlinefornonlinearcontrol-afnesystemswithouttheknowledgeofdrift dynamics.Inthisdissertation,asimulation-basedACIarchitecturecf.[59]isused toobtainanapproximatefeedback-Nashequilibriumsolutiontoaninnite-horizon N -playernonzero-sumdifferentialgameonline,withoutrequiringPE,foranonlinear control-afnesystemwithuncertainLPdriftdynamics.Thecontributionofthisresultis thatitextendsthedevelopmentinChapter3tothemoregeneral N )]TJ/F22 11.9552 Tf 9.299 0 Td [(playernonzerosumdifferentialgameframework. 1.4.3ApproximateOptimalTracking ApproximationtechniqueslikeNNsarecommonlyusedinADPliteraturefor valuefunctionapproximation.ADP-basedapproachesarepresentedinresultssuch as[115,116]toaddressthetrackingproblemforcontinuoustimesystems,wherethe valuefunction,andthecontrollerpresentedaretime-varyingfunctionsofthetracking error.However,foraninnite-horizonoptimalcontrolproblem,thedomainofthevalue functionisnotcompact.NNscanonlyapproximatefunctionsonacompactdomain. Thus,itisunclearhowaNNwiththetrackingerrorasaninputcanapproximatethe time-varyingvaluefunctionandcontroller. Fordiscretetimesystems,severalapproacheshavebeendevelopedtoaddress thetrackingproblem.Parket.al.[113]usegeneralizedback-propagationthrough 33

PAGE 34

timetosolveanitehorizontrackingproblemthatinvolvesofinetrainingofNNs.An ADP-basedapproachispresentedin[114]tosolveaninnite-horizonoptimaltracking problemwherethedesiredtrajectoryisassumedtodependonthesystemstates. Greedyheuristicdynamicprogrammingbasedalgorithmsarepresentedinresultssuch as[86,137,138]whichtransformthenonautonomoussystemintoanautonomous system,andapproximateconvergenceofthesequenceofvaluefunctionstotheoptimal valuefunctionisestablished.However,theseresultslackanaccompanyingstability analysis.Inthisresult,thetrackingerrorandthedesiredtrajectorybothserveasinputs totheNNforvaluefunctionapproximation.Effectivenessofthedevelopedtechniqueis demonstratedvianumericalsimulations.Themaincontributionsofthisresultinclude: Formulationofastationaryoptimalcontrolproblemforinnite-horizontotal-cost optimaltrackingcontrol. Formulationandproofofthehypothesisthattheoptimalvaluefunctionisavalid candidateLyapunovfunctionwheninterpretedasatime-varyingfunctionofthe trackingerror. NewLyapunov-likestabilityanalysistoestablishultimateboundednessunder sufcientpersistentexcitation. 1.4.4Model-basedReinforcementLearningforApproximateOptimalTracking Thischapterextendstheactor-criticmethoddevelopedinthepreviouschapter tosolveaninnite-horizonoptimaltrackingproblemforsystemswithunknowndrift dynamicsusingmodel-basedRL.Thedevelopmentinthepreviouschapterrelieson minimizingthedifferencebetweentheimplementedcontrollerandthesteady-statecontroller.Thecomputationofthesteady-statecontrollerrequiresexactmodelknowledge. Inthischapter,aCL-basedsystemidentierisdevelopedgenerateanonlineapproximationtothesteady-statecontroller.Furthermore,theCL-basedsystemidentieris alsousedtoimplementmodel-basedRLtosimulateexperiencebyevaluatingtheBE overunexploredareasofthestatespace.Effectivenessofthedevelopedtechniqueis demonstratedvianumericalsimulations.Themaincontributionsofthisresultinclude: 34

PAGE 35

Extensionoftrackingtechniquetosystemswithuncertaindriftdynamicsviathe useofaCL-basedsystemidenticationforapproximatemodelinversion. Lyapunov-basedstabilityanalysistoshowsimultaneoussystemidenticationand ultimatelyboundedtrackinginthepresenceofuncertainties. 1.4.5DifferentialGraphicalGames Variousmethodshavebeendevelopedtosolveformationtrackingproblemsfor linearsystems.Anoptimalcontrolapproachisusedin[139]toachieveconsensus whileavoidingobstacles.In[140],anoptimalcontrollerisdevelopedforagentswith knowndynamicstocooperativelytrackadesiredtrajectory.In[141]aninverseoptimal controllerisdevelopedforunmannedaerialvehiclestocooperativelytrackadesired trajectorywhilemaintainingadesiredformation.In[142]adifferentialgame-based approachisdevelopedforunmannedaerialvehiclestoachievedistributedNash strategies.In[143],anoptimalconsensusalgorithmisdevelopedforacooperative teamofagentswithlineardynamicsusingonlypartialinformation.Avaluefunction approximationbasedapproachispresentedin[128]forcooperativesynchronizationina stronglyconnectednetworkofagentswithknownlineardynamics. Fornonlinearsystems,anMPC-basedapproachispresentedin[126],however, nostabilityorconvergenceanalysisispresented.AstabledistributedMPC-based approachispresentedin[127]fornonlineardiscrete-timesystemswithknownnominal dynamics.Asymptoticstabilityisprovedwithoutanyinteractionbetweenthenodes, however,anonlinearoptimalcontrolproblemneedtobesolvedateveryiterationto implementthecontroller.Anoptimaltrackingapproachforformationcontrolispresented in[129]usingsinglenetworkadaptivecriticswherethevaluefunctionislearned ofine.Onlinefeedback-Nashequilibriumsolutionofdifferentialgraphicalgamesina topologicalnetworkofagentswithcontinuous-timeuncertainnonlineardynamicshas remainedanopenproblem.Thecontributionsofthischapterarethefollowing: Introductionofrelativecontrolerrorminimizationtechniquetofacilitatetheformulationofafeasibleinnite-horizontotal-costdifferentialgraphicalgame. 35

PAGE 36

DevelopmentasetofcoupledHJequationscorrespondingtofeedback-Nash equilibriumsolutionsofdifferentialgraphicalgames. Lyapunov-basedstabilityanalysistoshowultimatelyboundedformationtrackingin thepresenceofuncertainties. 36

PAGE 37

CHAPTER2 PRELIMINARIES 2.1Notation Throughoutthedissertation, R n denotes n )]TJ/F22 11.9552 Tf 9.299 0 Td [(dimensionalEuclideanspace, R >a denotesthesetofrealnumbersstrictlygreaterthan a 2 R ,and R a denotesthesetof realnumbersgreaterthanorequalto a 2 R .Unlessotherwisespecied,thedomain ofallthefunctionsisassumedtobe R 0 .Functionswithdomain R 0 aredenedby abuseofnotationusingonlytheirimage.Forexample,thefunction x : R 0 ! R n is denedbyabuseofnotationas x 2 R n ,andreferredtoas x insteadof x t .Byabuse ofnotation,thestatevariablesarealsousedtodenotestatetrajectories.Forexample, thestatevariable x intheequation _ x = f x + u isalsousedas x t todenotethe statetrajectoryi.e.,thegeneralsolution x : R 0 ! R n to _ x = f x + u evaluatedat time t .Unlessotherwisespecied,allthemathematicalquantitiesareassumedtobe time-varying.Unlessotherwisespecied,anequationoftheform g x = f + h y;t is interpretedas g x t = f t + h y t ;t forall t 2 R 0 ,andadenitionoftheform g x;y , f y + h x forfunctions g : A B ! C , f : B ! C and h : A ! C isinterpretedas g x;y , f y + h x ; 8 x;y 2 A B .Theonlyexceptiontothe aforementionedequationanddenitionnotationisthedenitionsofcostfunctionals, wheretheargumentstothecostfunctionalarefunctions.Thetotalderivative @f x @x isdenotedby r f andthepartialderivative @f x;y @x isdenotedby r x f x;y .An n n identitymatrixisdenotedby I n , n m matricesofzerosandonesaredenotedby 0 n m and 1 n m ,respectively,and 1 S denotestheindicatorfunctionoftheset S . 2.2ProblemFormulation Thefocusofthisdissertationistoobtainonlineapproximatesolutionstoinnitehorizontotal-costoptimalcontrolproblems.Tofacilitatetheformulationoftheoptimal controlproblem,Consideracontrol-afnenonlineardynamicalsystem _ x = f x + g x u; 37

PAGE 38

where x 2 R n denotesthesystemstate, u 2 R m denotesthecontrolinput, f : R n ! R n denotesthedriftdynamics,and g : R n ! R n m denotesthecontroleffectivenessmatrix. Thefunctions f and g areassumedtobelocallyLipschitzcontinuousfunctionssuch that f =0 and r f x iscontinuousandboundedforeverybounded x 2 R n .Inthe following,thenotation u t ; t 0 ;x 0 denotesatrajectoryofthesystemin2underthe controlsignal u withtheinitialcondition x 0 2 R n andinitialtime t 0 2 R 0 . Thecontrolobjectiveistosolvetheinnite-horizonoptimalregulationproblem online,i.e.,tosimultaneouslydesignandutilizeacontrolsignal u onlinetominimizethe costfunctional J x;u , 1 t 0 r x ;u d; underthedynamicconstraintin2whileregulatingthesystemstatetotheorigin.In 2, r : R n R m ! R 0 denotestheinstantaneouscostdenedas r x;u , Q x + u T Ru; where Q : R n ! R 0 isapositivedenitefunctionand R 2 R m m isaconstantpositive denitesymmetricmatrix. 2.3ExactSolution Itiswellknownthatifthefunctions f;g; and Q arestationarytime-invariantand thetime-horizonisinnite,thentheoptimalcontrolinputisastationarystate-feedback policy u t = x t forsomefunction : R n ! R m .Furthermore,thefunctionthat mapseachstatetothetotalaccumulatedcoststartingfromthatstateandfollowinga stationarystate-feedbackpolicy,i.e.,thevaluefunction,isalsoastationaryfunction. Hence,theoptimalvaluefunction V : R n ! R 0 canbeexpressedas V x , inf u j 2 R t 1 t r u ; t;x ;u d; 38

PAGE 39

forall x 2 R n ,where U R m istheactionspace.Assuminganoptimalcontrollerexists, theoptimalvaluefunctioncanbeexpressedas V x , min u j 2 R t 1 t r u ; t;x ;u d: TheoptimalvaluefunctionischaracterizedbythecorrespondingHJBequation[1] 0=min u 2 U r V x o f x o + g x o u + r x o ;u ; forall x o 2 R n ; withtheboundarycondition V =0 : ProvidedtheHJBin2admits acontinuouslydifferentiablesolution,itconstitutesanecessaryandsufcientcondition foroptimality,i.e.,iftheoptimalvaluefunctionin2iscontinuouslydifferentiable, thenitistheuniquesolutiontotheHJBin2[144].Theoptimalcontrolpolicy u : R n ! R m canbedeterminedfrom2as[1] u x o = )]TJ/F15 11.9552 Tf 10.494 8.088 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 g T x o r V x o T ; 8 x o 2 R n : TheHJBin2canbeexpressedintheopen-loopform r V x o f x o + g x o u x o + r x o ;u x o =0 ; forall x o 2 R n ; andusing2,theHJBin2canbeexpressedintheclosed-loop form r V x o f x o )]TJ/F15 11.9552 Tf 13.151 8.087 Td [(1 4 r V x o g x o R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 g T x o r V x o T + Q x o =0 : forall x o 2 R n .Theoptimalpolicycannowbeobtainedusing2iftheHJBin2 canbesolvedfortheoptimalvaluefunction V . 2.4ValueFunctionApproximation AnanalyticalsolutionoftheHJBequationisgenerallyinfeasible;hence,anapproximatesolutionissought.Inanapproximateactor-critic-basedsolution,theoptimalvalue function V isreplacedbyaparametricestimate ^ V x; ^ W c andtheoptimalpolicy u by 39

PAGE 40

aparametricestimate ^ u x; ^ W a where ^ W c 2 R L and ^ W a 2 R L denotevectorsofestimatesoftheidealparameters.Theobjectiveofthecriticistolearntheparameters ^ W c , andtheobjectiveoftheactoristolearntheparameters ^ W a .Substitutingtheestimates ^ V and ^ u for V and u in2,respectively,aresidualerror : R n R L R L ! R ,called theBE,isdenedas x; ^ W c ; ^ W a , r x ^ V x; ^ W c f x + g x ^ u x; ^ W a + r x; ^ u x; ^ W a : Tosolvetheoptimalcontrolproblem,thecriticaimstondasetofparameters ^ W c andtheactoraimstondasetofparameters ^ W a suchthat x o ; ^ W c ; ^ W a =0 ,and ^ u x o ; ^ W a = )]TJ/F24 7.9701 Tf 10.494 4.707 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 g T x o r ^ V x o ; ^ W c T 8 x o 2 R n .Sinceanexactbasisforvalue functionapproximationisgenerallynotavailable,anapproximatesetofparameters thatminimizestheBEissought.Inparticular,toensureuniformapproximationofthe valuefunctionandthepolicyoveranoperatingdomain D R n ,itisdesirabletond parametersthatminimizetheintegralerror E s : R L R L ! R denedas E s ^ W c ; ^ W a , x 2D 2 x; ^ W c ; ^ W a dx: Hence,inanonlineimplementationofthedeterministicactor-criticmethod,itisdesirabletoupdatetheparameterestimates ^ W c and ^ W a onlinetominimizetheinstantaneous error E s ^ W c t ; ^ W a t orthecumulativeinstantaneouserror E t , t 0 E s ^ W c ; ^ W a d; whilethesystemin2isbeingcontrolledusingthecontrollaw u t = ^ u x t ; ^ W a t : 2.5RL-basedOnlineImplementation ComputationoftheBEin2andtheintegralerrorin2requiresexact modelknowledge.Furthermore,computationoftheintegralerrorin2isgenerally 40

PAGE 41

infeasible.Twoprevalentapproachesemployedtorenderthecontroldesignrobustto uncertaintiesinthesystemdriftdynamicsareintegralRLcf.[95]and[145]andstate derivativeestimationcf.[59]and[146]. IntegralRLexploitsthefactthatforall T> 0 and t>t 0 + T ,theBEin6 hasanequivalentintegralform int t = ^ V x t )]TJ/F25 11.9552 Tf 11.956 0 Td [(T ; ^ W c t )]TJ/F15 11.9552 Tf 14.496 3.022 Td [(^ V x t ; ^ W c t )]TJ/F42 11.9552 Tf -431.332 -14.277 Td [( t t )]TJ/F26 7.9701 Tf 6.587 0 Td [(T r x ;u d; where u t =^ u x t ; ^ W a t ; 8 t 2 R t 0 .Sincetheintegralform doesnotrequiremodelknowledge,policiesdesignedbasedon int canbeimplemented withoutknowledgeof f: Statederivativeestimation-basedtechniquesexploitthefactthatifthesystem modelisuncertain,thecriticcancomputetheBEateachtimeinstance t usingthe state-derivative _ x t as t t , r x ^ V x t ; ^ W c t _ x t + r x t ; ^ u x t ; ^ W a t : Ifthestate-derivativeisnotdirectlymeasurable,anapproximationoftheBEcanbe computedusingadynamicallygeneratedestimateofthestate-derivative.Notethat theintegralformoftheBEisinherentlydependentonthestatetrajectory,andsince adaptivederivativeestimatorsestimatethederivativeonlyalongthetrajectory,the derivativeestimation-basedtechniquesarealsodependentonthestatetrajectory. Hence,intechniquessuchas[59,95,145,146]theBEcanonlybeevaluatedalongthe systemtrajectory. Since2constitutesanecessaryandsufcientconditionforoptimality,the BEservesasanindirectmeasureofhowclosethecriticparameterestimates ^ W c are totheiridealvalues;hence,inRLliterature,eachevaluationoftheBEisinterpreted asgainedexperience.Inparticular,thecriticreceivesstate-derivative-action-reward tuples x t ; _ x t ;u t ;r x t ;u t andcomputestheBEusing2.Thecritic thenperformsaone-stepupdatetotheparameterestimates ^ W c basedoneitherthe instantaneousexperience,quantiedbythesquarederror 2 t t ,orthecumulative 41

PAGE 42

experience,quantiedbytheintegralsquarederror E t t , t 0 2 t d; usingasteepestdescentupdatelaw.Theuseofthecumulativesquarederroris motivatedbythefactthatinthepresenceofuncertainties,BEcanonlybeevaluated alongthesystemtrajectory;hence, E t t istheclosestapproximationto E t in2 thatcanbecomputedusingtheavailableinformation. Intuitively,for E t t toapproximate E t overanoperatingdomain,thestate trajectory x t needstovisitasmanypointsintheoperatingdomainaspossible.This intuitionisformalizedbythefactthattheuseoftheapproximation E t t toupdatethe criticparameterestimatesisvalidprovidedcertainexplorationconditions 1 aremet.In RLterms,theexplorationconditionstranslatetotheneedforthecritictogainenough experienceinordertolearnthevaluefunction.Theexplorationconditionscanbe relaxedusingexperiencereplay,whereeachevaluationoftheBE int isinterpreted asgainedexperience,andtheseexperiencesarestoredinahistorystackandare repeatedlyusedinthelearningalgorithmtoimprovedataefciency,however,anite amountofexplorationisstillrequiredsincethevaluesstoredinthehistorystackare alsoconstrainedtothesystemtrajectory. Whiletheestimates ^ W c arebeingupdatedbythecritic,theactorsimultaneously updatestheparameterestimates ^ W a usingagradient-basedapproachsothatthe quantity ^ u x; ^ W a + 1 2 R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 g T x r ^ V x; ^ W c T decreases.Theweightupdatesare performedonlineinreal-timewhilethesystemisbeingcontrolledusingthecontrollaw u t =^ u x t ; ^ W a t : Naturally,itisdifculttoguaranteestabilityduringthelearning phase.Infact,theuseoftwodifferentsetsparameterstoapproximatethevaluefunction 1 Theexplorationconditionsaredetailedinthenextsectionforalinear-in-theparametersLIPapproximationofthevaluefunction. 42

PAGE 43

andthepolicyismotivatedbythestabilityanalysis.Inparticular,todate,theauthoris unawareofanyresultsthatcanguaranteestabilityduringlearningphaseinanonline continuous-timedeterministicimplementationofRL-basedactor-critictechniquein whichonlythethevaluefunctionisapproximated,andbasedon2,thesystemis controlledusingthecontrollaw u = )]TJ/F24 7.9701 Tf 10.494 4.707 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 g T x r ^ V x; ^ W c T . 2.6LIPApproximationoftheValueFunction Forfeasibilityofanalysis,theoptimalvaluefunctionisapproximatedusingaLIP approximation ^ V x; ^ W c , ^ W T c x ; where : R n ! R L isacontinuouslydifferentiablenonlinearactivationfunctionsuch that =0 and 0 =0 ,and ^ W c 2 R L ,where L denotesthenumberofunknown parametersintheapproximationofthevaluefunction.Basedon2,theoptimal policyisapproximatedusingtheLIPapproximation ^ u x; ^ W a , )]TJ/F15 11.9552 Tf 10.494 8.088 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 g x T r T x ^ W a : Theupdatelawusedbythecritictoupdatetheweightestimatesisgivenby _ ^ W c = )]TJ/F25 11.9552 Tf 9.299 0 Td [( c )]TJ/F25 11.9552 Tf 8.51 8.088 Td [(! t ; _ )-278(= )]TJ/F23 11.9552 Tf 9.971 0 Td [()]TJ/F25 11.9552 Tf 11.955 0 Td [( c )]TJ/F25 11.9552 Tf 8.51 8.088 Td [(!! T 2 )]TJ/F30 11.9552 Tf 7.314 16.857 Td [( 1 f k )]TJ/F28 7.9701 Tf 5.288 0 Td [(k )]TJ/F23 11.9552 Tf 5.289 -0.996 Td [(g ; k )-167( t 0 k )]TJ/F25 11.9552 Tf 7.315 0 Td [(; where ! , r x _ x 2 R L denotestheregressorvector, , 1+ ! T )]TJ/F25 11.9552 Tf 7.314 0 Td [(! 2 R , c ;; 2 R > 0 areconstantlearninggains, )]TJ/F23 11.9552 Tf 10.805 0 Td [(2 R > 0 isaconstantsaturationconstant,and )]TJ/F22 11.9552 Tf 10.637 0 Td [(istheleast squaresgainmatrix.Theupdatelawusedbytheactortoupdatetheweightestimatesis derivedusingaLyapunov-basedstabilityanalysis,andisgivenby _ ^ W a = )]TJ/F25 11.9552 Tf 9.299 0 Td [( a 1 ^ W a )]TJ/F15 11.9552 Tf 15.368 3.022 Td [(^ W c )]TJ/F25 11.9552 Tf 11.956 0 Td [( a 2 ^ W a + c r x g x R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 g T x r T x ^ W a ! T 4 ; 43

PAGE 44

Figure2-1.Actor-criticarchitecture where a 1 ; a 2 2 R > 0 areconstantlearninggains.Ablockdiagramoftheresultingcontrol architectureispresentedinFigure2-1. Thestabilityanalysisindicatesthatthesufcientexplorationconditiontakesthe formofaPEconditionthatrequirestheexistenceofpositiveconstants and T such thattheregressorvectorsatises I L t + T t ! ! T d; forall t 2 R t 0 . Let ~ W c , W )]TJ/F15 11.9552 Tf 15.736 3.022 Td [(^ W c and ~ W a , W )]TJ/F15 11.9552 Tf 15.736 3.022 Td [(^ W a denotethevectorsofparameterestimation errors,where W 2 R L denotestheconstantvectorofidealparameters.Provided2 issatised,andundersufcientconditionsonthelearninggainsandtheconstants and T ,thecandidateLyapunovfunction V L x; ~ W c ; ~ W a , V x + 1 2 ~ W T c )]TJ/F28 7.9701 Tf 7.314 4.936 Td [()]TJ/F24 7.9701 Tf 6.587 0 Td [(1 ~ W c + 1 2 ~ W T a ~ W a 44

PAGE 45

canbeusedtoestablishconvergenceof x t , ~ W c t ,and ~ W a t toaneighborhoodof zeroas t !1 ,whenthesystemin2iscontrolledusingthecontrollaw u t =^ u x t ; ^ W a t ; andtheparameterestimates ^ W c and ^ W a areupdatedusingtheupdatelawsin2 and2,respectively. 2.7UncertaintiesinSystemDynamics TheuseofthestatederivativetocomputetheBEin2isadvantageous becauseitiseasiertoobtainadynamicestimateofthestatederivativethanitisto identifythesystemdynamics.Forexample,considerthehigh-gaindynamicstate derivativeestimator _ ^ x = g x u + k ~ x + ; _ = k +1~ x; where _ ^ x 2 R n isanestimateofthestatederivative, ~ x , x )]TJ/F15 11.9552 Tf 13.237 0 Td [(^ x isthestateestimation error,and k; 2 R > 0 areidenticationgains.Using2,theBEin2canbe approximatedby ^ t as ^ t t = r x ^ V x t ; ^ W c t _ ^ x t + r x t ;u t : Thecriticcanthenlearnthevaluefunctionweightsbyusinganapproximationof cumulativeexperience,quantiedbytheintegralerror ^ E t t = t 0 ^ 2 t d; byusing ^ t insteadof t in2.Underadditionalsufcientconditionsonthegains k and ; thecandidateLyapunovfunction V L x; ~ W c ; ~ W a ; ~ x;x f , V x + 1 2 ~ W T c )]TJ/F28 7.9701 Tf 7.314 4.936 Td [()]TJ/F24 7.9701 Tf 6.587 0 Td [(1 ~ W c + 1 2 ~ W T a ~ W a + 1 2 ~ x T ~ x + 1 2 x T f x f ; 45

PAGE 46

Figure2-2.Actor-critic-identierarchitecture where x f , _ ~ x + ~ x ,canbeusedtoestablishconvergenceof x t , ~ W c t , ~ W a t , ~ x ,and x f toaneighborhoodofzero,whenthesystemin2iscontrolledusingthecontrol law2.Thisextensionoftheactor-criticmethodtohandleuncertaintiesinthe systemdynamicsusingderivativeestimationisknownastheACIarchitecture.Ablock diagramoftheACIarchitectureispresentedinFigure2-2. Ingeneral,thecontrollerin2doesnotensurethePEconditionin2. Thus,inanonlineimplementation,anad-hocexplorationsignalisoftenaddedto thecontrollercf.[43,49,54].Sincetheexplorationsignalisnotconsideredinthe thestabilityanalysis,itisdifculttheensurestabilityoftheonlineimplementation. Moreover,theaddedprobingsignalcauseslargecontroleffortexpenditureandthere isnomeanstoknowwhenitissufcienttoremovetheprobingsignal.Thefollowing chapteraddressesthechallengesassociatedwiththesatisfactionoftheconditionin 2byusingsimulatedexperiencealongwiththecumulativeexperiencecollected alongthesystemtrajectory. 46

PAGE 47

CHAPTER3 MODEL-BASEDREINFORCEMENTLEARNINGFORAPPROXIMATEOPTIMAL REGULATION Inthischapter,aCL-basedimplementationofmodel-basedRLisdevelopedto solveapproximateoptimalregulationproblemsonlinewitharelaxedPE-likecondition. Thedevelopmentisbasedontheobservationthat,givenamodelofthesystem, model-basedRLcanbeimplementedbyevaluatingtheBEatanynumberofdesired pointsinthestatespace.Inthisresult,aparametricsystemmodelisconsidered, andaCL-basedparameteridentierisdevelopedtocompensateforuncertaintyin theparameters.UBregulationofthesystemstatestoaneighborhoodoftheorigin, andconvergenceofthedevelopedpolicytoaneighborhoodoftheoptimalpolicy areestablishedusingaLyapunov-basedanalysis,andsimulationsarepresentedto demonstratetheperformanceofthedevelopedcontroller. 3.1Motivation AnACIarchitecturetosolveoptimalregulationproblemswaspresentedinChapter 2,undertherestrictivePErequirementin2.ThePErequirementisaconsequence oftheattempttoachieveuniformapproximationusinginformationobtainedalongone systemtrajectory.Inparticular,inordertoapproximatethevaluefunction,thecritic intheACImethodutilizesexperiencegainedalongthesystemtrajectory,quantied bythecumulativeobservederrorin2,insteadofthetotalerrorin2.The criticintheACIarchitectureisrestrictedtotheuseofexperiencegainedalongthe systemtrajectorybecauseevaluationoftheBErequiresstatederivatives,andthe dynamicstate-derivativeestimatorcanonlyestimatestatederivativesalongthesystem trajectory. Ifthesystemdynamicsareknown,orifasystemidentiercanbedeveloped toestimatethestatederivativeuniformlyovertheentireoperatingdomain,thenthe criticcanutilizesimulatedexperiencealongwithgainedexperiencetolearnthevalue 47

PAGE 48

function.Inparticular,theBEin2canbeapproximatedas ^ X x; ^ W c ; ^ W a , r x ^ V x; ^ W c _ X x; ^ u x; ^ W a + r x; ^ u x; ^ W a ; where _ X : R n R m ! R n denotestheestimateddynamicsthatmapthestateactionpair x; ^ u x; ^ W a tothecorrespondingstatederivative.Sincethecontroleffectivenessand thecontrolsignalin2areknown,auniformparametricapproximation ^ f x; ^ ofthe function f ,where ^ denotesthematrixofparameterestimates,issufcienttogeneratea uniformestimateofthesystemdynamics.Inparticular,using ^ f ,theBEin2canbe approximatedas ^ x; ^ W c ; ^ W a ; ^ , r x ^ V x; ^ W c ^ f x; ^ + g x ^ u x; ^ W a + r x; ^ u x; ^ W a : SimilartoSection2.6,thecumulativegainedexperiencecanbequantiedusingthe integralerrorin2,where ^ t = ^ x ; ^ W c ; ^ W a ; ^ . Givencurrentparameterestimates ^ W c t , ^ W a t and ^ t ,theapproximateBEin 3canbeevaluatedatanypoint x i 2 R n .Thatis,thecriticcangainexperience onhowwellthevaluefunctionisestimatedananyarbitrarypoint x i inthestatespace withoutactuallyvisiting x i .Inotherwords,givenaxedstate x i andacorresponding plannedaction ^ u x i ; ^ W a ,thecriticcanusetheestimateddriftdynamics ^ f x i ; ^ W a tosimulateavisitto x i bycomputinganestimateofthestatederivativeat x i ,resulting insimulatedexperiencequantiedbytheBE ^ ti t = ^ x i ; ^ W c t ; ^ W a t ; ^ t .The simulatedexperiencecanthenbeusedalongwithgainedexperiencebythecriticto learnthevaluefunction.Themotivationbehindusingsimulatedexperienceisthatvia selectionofmultiplesay N points,theerrorsignalin2canbeaugmentedtoyield aheuristicallybetterapproximation ^ E ti t ,givenby ^ E ti t , t 0 ^ 2 t + N X i =1 ^ 2 ti ! d; 48

PAGE 49

Figure3-1.Simulation-basedactor-critic-identierarchitecture tothedesirederrorsignalin2.Ablockdiagramofthesimulation-basedACI architectureispresentedinFigure2-2. Onlineimplementationofsimulationofexperiencerequiresuniformonlineestimationofthefunction f usingtheparametricapproximation ^ f x; ^ ,i.e.,theparameter estimates ^ needtoconvergetotheirtruevalues .Inthefollowing,asystemidentierthatachievesuniformapproximationof f isdevelopedbasedonrecentideason data-drivenparameterconvergenceinadaptivecontrolcf.[92,93,147]. 3.2SystemIdentication Let f x o = Y x o ,forall x o 2 R n ,bealinearparameterizationofthefunction f ,where Y : R n ! R n p istheregressionmatrix,and 2 R p isthevectorofconstant 49

PAGE 50

unknownparameters. 1 Let ^ 2 R p beanestimateoftheunknownparametervector . Toestimatethedriftdynamics,anidentierisdesignedas _ ^ x = Y x ^ + g x ^ u + k x ~ x; wherethemeasurablestateestimationerror ~ x isdenedas ~ x , x )]TJ/F15 11.9552 Tf 13.052 0 Td [(^ x ,and k x 2 R n n isapositivedenite,constantdiagonalobservergainmatrix.From2and3the identicationerrordynamicscanbederivedas _ ~ x = Y x ~ )]TJ/F25 11.9552 Tf 11.955 0 Td [(k x ~ x; where ~ istheparameteridenticationerrordenedas ~ , )]TJ/F15 11.9552 Tf 12.894 3.155 Td [(^ : 3.2.1CL-basedParameterUpdate Intraditionaladaptivecontrol,convergenceoftheestimates ^ totheirtruevalues isensuredbyassumingthataPEconditionissatised[89–91].Toensureconvergence withoutthePEcondition,thisresultemploysaCL-basedapproachtoupdatethe parameterestimatesusingrecordedinput-outputdata[92,93,147]. Foreaseofexposition,thefollowingsystemidentierdevelopmentisbasedonthe assumptionthatthedatarequiredtoperformCL-basedsystemidenticationisavailable aprioriinahistorystack.Forexample,datarecordedinapreviousrunofthesystem canbeutilized,orthehistorystackcanberecordedbyrunningthesystemusinga differentknownstabilizingcontrollerforaniteamountoftimeuntiltherecordeddata satisestherankcondition3detailedinthefollowingassumption. Fromapracticalperspective,arecordedhistorystackisunlikelytobeavailable apriori.Forsuchapplications,thehistorystackcanberecordedonline.Provided 1 Thefunction f isassumedtobeLPforeaseofexposition.Thesystemidentiercan alsobedevelopedusingmulti-layerNNsfornon-LPfunctions.Forexample,asystem identierdevelopedusingsingle-layerNNsispresentedinChapter6. 50

PAGE 51

thesystemstatesareexcitingoveranitetimeinterval t 2 t 0 ;t 0 + t versus t 2 [ t 0 ; 1 asintraditionalPE-basedapproachesuntilthehistorystacksatises3, thenamodiedformofthecontrollerdevelopedinSection3.3canbeusedoverthe timeinterval t 2 t 0 ;t 0 + t ,andthecontrollerdevelopedinSection3.3canbeused thereafter.Therequiredmodicationstothecontroller,andtheresultingmodicationsto thestabilityanalysisareprovidedinAppendixA. Assumption3.1. [92,93]Ahistorystack H id containingrecordedstate-action pairs f x j ; ^ u j j j =1 ; ;M g ,andcorrespondingnumericallycomputedestimates f _ x j j j =1 ; ;M g ofthestatederivative _ x j , f x j + g x j ^ u j thatsatises rank M X j =1 Y T j Y j ! = p; k _ x j )]TJ/F15 11.9552 Tf 13.981 0 Td [(_ x j k < d; 8 j isavailableapriori,where Y j = Y x j ,and d 2 R 0 isapositiveconstant. BasedonAssumption3.1,theupdatelawfortheparameterestimatesin3is designedas _ ^ =)]TJ/F26 7.9701 Tf 19.74 -1.793 Td [( Y x T ~ x +)]TJ/F26 7.9701 Tf 19.075 -1.793 Td [( k M X j =1 Y T j _ x j )]TJ/F25 11.9552 Tf 11.955 0 Td [(g j ^ u j )]TJ/F25 11.9552 Tf 11.955 0 Td [(Y j ^ ; where g j , g x j , )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [( 2 R p p isaconstantpositivedeniteadaptationgainmatrix,and k 2 R isaconstantpositiveCLgain.From2andthedenitionof ~ ,thebracketed termin3,canbeexpressedas _ x j )]TJ/F25 11.9552 Tf 12.052 0 Td [(g j ^ u j )]TJ/F25 11.9552 Tf 12.052 0 Td [(Y j ^ = Y j ~ + d j ,where d j , _ x j )]TJ/F15 11.9552 Tf 14.077 0 Td [(_ x j 2 R n , andtheparameterupdatelawin3canbeexpressedintheadvantageousform _ ^ =)]TJ/F26 7.9701 Tf 19.739 -1.793 Td [( Y x T ~ x +)]TJ/F26 7.9701 Tf 19.076 -1.793 Td [( k M X j =1 Y T j Y j ! ~ +)]TJ/F26 7.9701 Tf 19.075 -1.793 Td [( k M X j =1 Y T j d j : Evenifahistorystackisavailableapriori,theperformanceoftheestimatormaybe improvedbyreplacingolddatawithnewdata.ThestabilityanalysisinSection3.4 51

PAGE 52

allowsforachanginghistorystackthroughtheuseofasingularvaluemaximizing algorithmcf.[93,147]. 3.2.2ConvergenceAnalysis Let V 0 : R n + p ! R 0 beapositivedenitecontinuouslydifferentiablecandidate Lyapunovfunctiondenedas V 0 z , 1 2 ~ x T ~ x + 1 2 ~ T )]TJ/F28 7.9701 Tf 7.315 4.948 Td [()]TJ/F24 7.9701 Tf 6.586 0 Td [(1 ~ ; where z , h ~ x T ; ~ T i T 2 R n + p : ThefollowingboundsontheLyapunovfunctioncanbe established: 1 2 min )]TJ/F15 11.9552 Tf 5.479 -9.683 Td [(1 ; k z k 2 V 0 z 1 2 max ; k z k 2 ; where ; 2 R denotetheminimumandthemaximumeigenvaluesofthematrix )]TJ/F28 7.9701 Tf 7.315 4.948 Td [()]TJ/F24 7.9701 Tf 6.586 0 Td [(1 . Using3and3,theLyapunovderivativecanbeexpressedas _ V 0 = )]TJ/F15 11.9552 Tf 10.023 0 Td [(~ x T k x ~ x )]TJ/F15 11.9552 Tf 12.894 3.155 Td [(~ T k M X j =1 Y T j Y j ! ~ )]TJ/F25 11.9552 Tf 11.955 0 Td [(k ~ T M X j =1 Y T j d j : Let y 2 R betheminimumeigenvalueof P M j =1 Y T j Y j .Since P M j =1 Y T j Y j issymmetricandpositivesemi-denite,3canbeusedtoconcludethatitisalsopositive denite,andhence y > 0 .Using3,theLyapunovderivativein3canbebounded as _ V 0 )]TJ/F25 11.9552 Tf 21.918 0 Td [(k x k ~ x k 2 )]TJ/F25 11.9552 Tf 11.955 0 Td [(y k ~ 2 + k d ~ : In3, d = d P M j =1 k Y j k ,and k x 2 R denotestheminimumeigenvalueofthematrix k x .Theinequalitiesin3and3canbeusedtoconcludethat ~ and k ~ x k exponentiallydecaytoanultimateboundas t !1 . TheCL-basedobserverresultsinexponentialregulationoftheparameterandthe statederivativeestimationerrorstoaneighborhoodaroundtheorigin.Inthefollowing, 52

PAGE 53

theparameterandstatederivativeestimatesareusedtoapproximatelysolvetheHJB equationwithouttheknowledgeofthedriftdynamics. 3.3ApproximateOptimalControl 3.3.1ValueFunctionApproximation Approximationstotheoptimalvaluefunction V andtheoptimalpolicy u are designedbasedonNN-basedrepresentations.AsinglelayerNNcanbeusedto representtheoptimalvaluefunction V as V x o = W T x o + x o ; forall x o 2 R n ,where W 2 R L istheidealweightmatrixand : R n ! R L and : R n ! R areintroducedin2. Basedon3aNN-basedrepresentationoftheoptimalcontrollerisderivedas u x o = )]TJ/F15 11.9552 Tf 10.494 8.087 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 g T x o )]TJ/F23 11.9552 Tf 5.479 -9.683 Td [(r T x o W + r T x o ; forall x o 2 R n .TheNN-basedapproximations ^ V : R n R L ! R oftheoptimalvalue functionin3and ^ u : R n R L ! R m oftheoptimalpolicyin3aregivenby 2and2,respectively,where ^ W c 2 R L and ^ W a 2 R L areestimatesoftheideal weights W .Theuseoftwosetsofweightstoestimatethesamesetofidealweightsis motivatedbythestabilityanalysisandthefactthatitenablesaformulationoftheBE thatislinearinthevaluefunctionweightestimates ^ W c ,enablingaleastsquares-based adaptiveupdatelaw.Usingtheparametricestimates ^ V and ^ u ofthevaluefunction andthepolicyfrom2and2,respectively,andusingthesystemidentier developedinSection3.2,theBEin3canbeexpressedas ^ t = ! T ^ W c + x T Qx +^ u T x; ^ W a R ^ u x; ^ W a ; where ! 2 R L istheregressorvectordenedas ! , r x Y x ^ + g x ^ u x; ^ W a : 53

PAGE 54

3.3.2SimulationofExperienceviaBEExtrapolation IntraditionalRL-basedalgorithms,thevaluefunctionestimateandthepolicy estimateareupdatedbasedonobserveddata.Theuseofobserveddatatolearn thevaluefunctionnaturallyleadstoasufcientexplorationconditionwhichdemands sufcientrichnessintheobserveddata.Instochasticsystems,thisisachievedusing arandomizedstationarypolicycf.[43,48,49],whereasindeterministicsystems, aprobingnoiseisaddedtothederivedcontrollawcf.[56,57,59,114,115].The techniquedevelopedinthisresultimplementssimulationofexperienceinamodelbasedRLschemebyusing Y ^ asanestimateoftheuncertaindriftdynamics f to extrapolatetheapproximateBEtounexploredareasofthestatespace.Thefollowing rankconditionenablestheextrapolationoftheapproximateBEtoapredenedsetof points f x i 2 R n j i =1 ; ;N g inthestatespace. Assumption3.2. Thereexistsanitesetofpoints f x i 2 R n j i =1 ; ;N g suchthat 0
PAGE 55

Tosimulateexperience,theapproximateBEisevaluatedatthesampledpoints f x i j i =1 ; ;N g as ^ ti = ! T i ^ W c + x T i Qx i +^ u T x i ; ^ W a R ^ u x i ; ^ W a : Fornotationalbrevity,thedependenceofthefunctions f , Y , g , , , ^ u , ^ u i , ^ t ,and ^ ti , onthestate,time,andtheweightsissuppressedhereafter.ACL-basedleast-squares updatelawforthevaluefunctionweightsisdesignedbasedonthesubsequentstability analysisas _ ^ W c = )]TJ/F25 11.9552 Tf 9.299 0 Td [( c 1 )]TJ/F25 11.9552 Tf 8.51 8.088 Td [(! ^ t )]TJ/F25 11.9552 Tf 13.151 8.088 Td [( c 2 N )]TJ/F26 7.9701 Tf 14.156 14.944 Td [(N X i =1 ! i i ^ ti ; _ )-278(= )]TJ/F23 11.9552 Tf 9.971 0 Td [()]TJ/F25 11.9552 Tf 11.955 0 Td [( c 1 )]TJ/F25 11.9552 Tf 8.51 8.088 Td [(!! T 2 )]TJ/F30 11.9552 Tf 7.315 16.857 Td [( 1 f k )]TJ/F28 7.9701 Tf 5.289 0 Td [(k )]TJ/F23 11.9552 Tf 5.289 -0.996 Td [(g ; k )-167( t 0 k )]TJ/F25 11.9552 Tf 7.314 0 Td [(; where 1 fg denotestheindicatorfunction, )]TJ/F23 11.9552 Tf 11.119 0 Td [(2 R > 0 isthesaturationconstant, 2 R > 0 is theforgettingfactor,and c 1 ; c 2 2 R > 0 areconstantadaptationgains.Theupdatelawin 3ensuresthattheadaptationgainmatrixisboundedsuchthat )]TJETq1 0 0 1 235.292 333.576 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F23 11.9552 Tf 245.927 335.25 Td [(k )-167( t k )]TJ/F25 11.9552 Tf 7.314 0 Td [(; 8 t 2 R t 0 ; where )]TJETq1 0 0 1 107.877 297.713 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F23 11.9552 Tf 119.848 299.387 Td [(2 R > 0 isaconstant.Thepolicyweightsarethenupdatedtofollowthevalue functionweightsas 2 _ ^ W a = )]TJ/F25 11.9552 Tf 9.299 0 Td [( a 1 ^ W a )]TJ/F15 11.9552 Tf 15.368 3.022 Td [(^ W c )]TJ/F25 11.9552 Tf 11.955 0 Td [( a 2 ^ W a + c 1 G T ^ W a ! T 4 + N X i =1 c 2 G T i ^ W a ! T i 4 N i ! ^ W c ; 2 Usingthefactthattheidealweightsarebounded,aprojection-basedcf.[148]updatelaw ^ W a = proj n )]TJ/F25 11.9552 Tf 9.298 0 Td [( a 1 ^ W a )]TJ/F15 11.9552 Tf 15.367 3.022 Td [(^ W c o canbeutilizedtoupdatethepolicyweights. Sincethepolicyweightsareboundedaprioribytheprojectionalgorithm,alesscomplex stabilityanalysiscanbeusedtoestablishtheresultinTheorem3.1. 55

PAGE 56

where a 1 ; a 2 2 R arepositiveconstantadaptationgains,and G , r gR )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 g T r T 2 R L L . Theupdatelawin3isfundamentallydifferentfromtheCL-basedadaptive updateinresultssuchas[92,93],inthesensethatthepoints f x i 2 R n j i =1 ; ;N g areselectedaprioribasedonpriorinformationaboutthedesiredbehaviorofthesystem,andusinganestimateofthesystemdynamics,theapproximateBEisevaluated at f x i 2 R n j i =1 ; ;N g .IntheCL-basedadaptiveupdateinresultssuchas[92,93], thepredictionerrorisusedasametricforlearning.Thepredictionerrordependson measuredornumericallycomputedvaluesofthestatederivative;hence,theprediction errorcanonlybeevaluatedatobserveddatapointsalongthestatetrajectory. 3.4StabilityAnalysis Tofacilitatethesubsequentstabilityanalysis,theapproximateBEisexpressedin termsoftheweightestimationerrors ~ W c and ~ W a as ^ t = )]TJ/F25 11.9552 Tf 9.299 0 Td [(! T ~ W c )]TJ/F25 11.9552 Tf 11.955 0 Td [(W T r Y ~ + 1 4 ~ W T a G ~ W a + 1 4 G )-222(r f + 1 2 W T r G r T ; where G , gR )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 g T 2 R n n and G , r G r T 2 R .Similarly,theapproximateBE evaluatedatthesampledstates f x i j i =1 ; ;N g canbeexpressedas ^ ti = )]TJ/F25 11.9552 Tf 9.298 0 Td [(! T i ~ W c + 1 4 ~ W T a G i ~ W a )]TJ/F25 11.9552 Tf 11.955 0 Td [(W T r i Y i ~ + i ; where Y i = Y x i ,and i , 1 2 W T r i G i r T i + 1 4 G i )-222(r i f i 2 R isaconstant. Let Z R 2 n +2 L + p denoteacompactset,andlet , Z R n .Onthecompactset R n thefunction Y isLipschitzcontinuous;hence,thereexistsapositiveconstant 56

PAGE 57

L Y 2 R suchthat 3 k Y x k L Y k x k ; 8 x 2 : Furthermore,usingtheuniversalfunctionapproximationproperty,theidealweight matrix W 2 R L ,isboundedabovebyaknownpositiveconstant W inthesensethat k W k W andthefunctionreconstructionerror : R n ! R isuniformlyboundedover suchthat sup x o 2 j x o j and sup x o 2 jr x o j r .Using3,thenormalized regressor ! canbeboundedas sup t 2 R t 0 ! t t 1 2 p )]TJETq1 0 0 1 353.7 512.018 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F25 11.9552 Tf 362.21 522.259 Td [(: Forbrevityofnotation,forafunction : R n ! R 0 ,denetheoperator : R 0 ! R 0 as , sup x o 2 x o ,andthefollowingpositiveconstants: # 1 , c 1 L Y k k r 4 p )]TJETq1 0 0 1 146.953 410.644 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F25 11.9552 Tf 173.548 420.885 Td [(;# 2 , N X i =1 c 2 kr i Y i k W 4 N p )]TJETq1 0 0 1 296.889 410.644 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F30 11.9552 Tf 321.735 437.742 Td [( ;# 3 , L Y c 1 W kr k 4 p )]TJETq1 0 0 1 421.567 410.644 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F25 11.9552 Tf 450.795 420.885 Td [(;# 4 , 1 4 G ; # 5 , c 1 k 2 W T r G r T + G k 8 p )]TJETq1 0 0 1 185.801 361.074 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F15 11.9552 Tf 245.339 371.315 Td [(+ N X i =1 c 2 ! i i N i ;# 7 , c 1 k G k 8 p )]TJETq1 0 0 1 405.991 361.074 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F15 11.9552 Tf 422.817 371.315 Td [(+ N X i =1 c 2 k G i k 8 N p )]TJETq1 0 0 1 499.598 361.074 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F30 11.9552 Tf 510.895 388.172 Td [( ; # 6 , 1 2 W T G + 1 2 r G T r T + # 7 W 2 + a 2 W;q , min f Q g ; v l = 1 2 min q 2 ; c 2 c 3 ; a 1 +2 a 2 6 ;k x ; k y 4 ; = 3 # 2 5 4 c 2 c + 3 # 2 6 2 a 1 +2 a 2 + k d 2 2 y + # 4 : Tofacilitatethestabilityanalysis,let V L : R 2 n +2 L + p R 0 ! R 0 beacontinuously differentiablepositivedenitecandidateLyapunovfunctiondenedas V L Z;t , V x + 1 2 ~ W T c )]TJ/F28 7.9701 Tf 7.314 4.936 Td [()]TJ/F24 7.9701 Tf 6.586 0 Td [(1 ~ W c + 1 2 ~ W T a ~ W a + V 0 z ; 3 TheLipschitzpropertyisexploitedhereforclarityofexposition.Theboundin3 canbeeasilygeneralizedto k Y x k L Y k x k k x k ,where L Y : R ! R isapositive, non-decreasingfunction. 57

PAGE 58

where V istheoptimalvaluefunction, V 0 wasintroducedin3and Z = h x T ; ~ W T c ; ~ W T a ; ~ x T ; ~ T i T : Usingthefactthat V ispositivedenite,3,3andLemma4.3from[149]yield v k Z o k V L Z o ;t v k Z o k ; forall t 2 R t 0 andforall Z o 2 R 2 n +2 L + p .In3, v ; v : R 0 ! R 0 areclass K functions. ThesufcientconditionsforUBconvergencearederivedbasedonthesubsequent stabilityanalysisas a 1 +2 a 2 6 ># 7 W 2 2 +1 2 2 ; k 4 > # 2 + 1 3 # 3 Z y 1 ; q 2 ># 1 ; c 2 3 > 2 # 7 W + a 1 +2 )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [(# 1 + 1 # 2 + # 3 = 3 Z 2 c ; r v l r; where Z , v )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 v max k Z t 0 k ; q v l , r 2 R 0 denotestheradiusoftheset Z denedas r , 1 2 sup fk x )]TJ/F25 11.9552 Tf 11.955 0 Td [(y kj x;y 2Zg ,and 1 ; 2 ; 3 2 R areknownpositiveadjustable constants.TheLipschitzconstantsin3andtheNNfunctionapproximation errorsin3dependontheunderlyingcompactset;hence,givenaboundonthe initialcondition Z t 0 fortheconcatenatedstate Z ,acompactsetthatcontainsthe concatenatedstatetrajectoryneedstobeestablishedbeforeadaptationgainssatisfying theconditionsin3canbeselected.Inthefollowing,basedonthesubsequent stabilityanalysis,analgorithmisdevelopedtocomputetherequiredcompactset, denotedby Z R 2 n +2 L + p .InAlgorithm3.1,thenotation f g i denotesthevalueof computedinthe i th iteration.Sincetheconstants and v l dependon L Y onlythrough theproducts L Y r and L Y 3 ,Algorithm3.1ensuresthesatisfactionofthesufcient conditionin3.Themainresultofthischaptercannowbestatedasfollows. 58

PAGE 59

Algorithm3.1 GainSelection First iteration: Given z 2 R 0 suchthat k Z t 0 k < z ,let Z 1 , 2 R 2 n +2 L + p jk k v )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 v z .Using Z 1 ; computetheboundsin3andselectthegainsaccordingto3–24.If n q v l o 1 z ,set Z = Z 1 andterminate. Second iteration: If z< n q v l o 1 ,let Z 2 , 2 R 2 n +2 L + p jk k v )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 v n q v l o 1 .Using Z 2 ; compute theboundsin3andselectthegainsaccordingto3.If n v l o 2 n v l o 1 ,set Z = Z 2 andterminate. Third iteration: If n v l o 2 > n v l o 1 ,increasethenumberofNNneuronsto f L g 3 toensure f L Y g 2 r 3 f L Y g 2 r 2 ; 8 i =1 ;::;N; increasetheconstant 3 toensure f L Y g 2 f 3 g 3 f L Y g 2 f 3 g 2 ,andincreasethegain k tosatisfythegainconditionsin3.Theseadjustmentsensure f g 3 f g 2 .Set Z = 2 R 2 n +2 L + p jk k v )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 v n q v l o 2 andterminate. Theorem3.1. ProvidedAssumptions 3.1 3.2 holdandgains q , c 2 , a 2 ,and k are selectedlargeenoughusingAlgorithm3.1,theobserverin 3 alongwiththeadaptive updatelawin 3 andthecontrollerin 2 alongwiththeadaptiveupdatelawsin 3 and 3 ensurethatthestate x ,thestateestimationerror ~ x ,thevaluefunction weightestimationerror ~ W c andthepolicyweightestimationerror ~ W a areUB. Proof. Thetimederivativeof3alongthetrajectoriesof2,3,3,3, and3isgivenby _ V L = r V f + g ^ u )]TJ/F15 11.9552 Tf 15.367 3.022 Td [(~ W T c )]TJ/F25 11.9552 Tf 9.298 0 Td [( c 1 ! ^ t )]TJ/F25 11.9552 Tf 13.151 8.088 Td [( c 2 N N X i =1 ! i i ^ ti ! )]TJ/F15 11.9552 Tf 15.367 3.022 Td [(~ W T a )]TJ/F25 11.9552 Tf 9.298 0 Td [( a 1 ^ W a )]TJ/F15 11.9552 Tf 15.367 3.022 Td [(^ W c )]TJ/F25 11.9552 Tf 11.955 0 Td [( a 2 ^ W a )]TJ/F15 11.9552 Tf 13.15 8.088 Td [(1 2 ~ W T c )]TJ/F28 7.9701 Tf 7.314 4.936 Td [()]TJ/F24 7.9701 Tf 6.586 0 Td [(1 )]TJ/F23 11.9552 Tf 9.971 0 Td [()]TJ/F25 11.9552 Tf 11.955 0 Td [( c 1 )]TJ/F25 11.9552 Tf 8.51 8.088 Td [(!! T 2 )]TJ/F30 11.9552 Tf 7.315 16.857 Td [( )]TJ/F28 7.9701 Tf 7.314 4.936 Td [()]TJ/F24 7.9701 Tf 6.587 0 Td [(1 ~ W c )]TJ/F15 11.9552 Tf 12.68 0 Td [(~ x T k x ~ x )]TJ/F25 11.9552 Tf 11.955 0 Td [(k ~ T N X j =1 Y T j Y j ! ~ )]TJ/F25 11.9552 Tf 11.955 0 Td [(k ~ T M X j =1 Y T j d j )]TJ/F15 11.9552 Tf 15.367 3.022 Td [(~ W T a c 1 G T ^ W a ! T 4 + N X i =1 c 2 G T i ^ W a ! T i 4 N i ! ^ W c ; SubstitutingfortheapproximateBEsfrom3and3,usingtheboundsin 3and3,andusingYoung'sinequality,theLyapunovderivativein3can 59

PAGE 60

beupper-boundedas _ V L )]TJ/F25 11.9552 Tf 23.23 9.378 Td [(q 2 k x k 2 )]TJ/F25 11.9552 Tf 13.15 8.088 Td [( c 2 c 3 ~ W c 2 )]TJ/F25 11.9552 Tf 13.15 8.088 Td [( a 1 +2 a 2 6 ~ W a 2 )]TJ/F25 11.9552 Tf 11.955 0 Td [(k x k ~ x k 2 )]TJ/F25 11.9552 Tf 13.151 9.378 Td [(k y 4 ~ 2 )]TJ/F30 11.9552 Tf 11.955 13.27 Td [( q 2 )]TJ/F25 11.9552 Tf 11.956 0 Td [(# 1 k x k 2 )]TJ/F30 11.9552 Tf 11.955 16.857 Td [( c 2 c 3 )]TJ/F25 11.9552 Tf 11.955 0 Td [(# 1 + 1 # 2 + 2 # 7 W + a 1 2 )]TJ/F25 11.9552 Tf 13.151 8.088 Td [(# 3 k x k 3 ~ W c 2 )]TJ/F30 11.9552 Tf 11.955 16.857 Td [( k y 4 )]TJ/F25 11.9552 Tf 13.151 8.088 Td [(# 2 1 )]TJ/F25 11.9552 Tf 11.955 0 Td [(# 3 3 k x k ~ 2 )]TJ/F30 11.9552 Tf 11.955 16.857 Td [( a 1 +2 a 2 6 )]TJ/F25 11.9552 Tf 11.955 0 Td [(# 7 k W k)]TJ/F25 11.9552 Tf 21.785 8.088 Td [(# 7 k W k 2 2 ~ W a 2 + 3 # 2 5 4 c 2 c + 3 # 2 6 2 a 1 +2 a 2 + k d 2 2 y + 1 4 G : ProvidedthegainsareselectedbasedusingAlgorithm3.1,theLyapunovderivativein 3canbeupper-boundedas _ V L )]TJ/F25 11.9552 Tf 21.917 0 Td [(v l k Z k 2 ; 8k Z k r v l > 0 ; forall t 0 and 8 Z 2Z .Using3,3and3,Theorem4.18in[149] cannowbeinvokedtoconcludethat Z isUBinthesensethat limsup t !1 k Z t k v )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 v q v l .Furthermore,theconcatenatedstatetrajectoriesareboundedsuch that k Z t k Z forall t 2 R t 0 .Sincetheestimates ^ W a approximatetheidealweights W ,thedenitionsin3and2canbeusedtoconcludethatthepolicy ^ u approximatestheoptimalpolicy u . 4 3.5Simulation Thissectionpresentstwosimulationstodemonstratetheperformanceandthe applicabilityofthedevelopedtechnique.First,theperformanceofthedeveloped controllerisdemonstratedthroughapproximatesolutionofanoptimalcontrolproblem thathasaknownanalyticalsolution.Basedontheknownsolution,anexactpolynomial basisisusedforvaluefunctionapproximation.Thesecondsimulationdemonstrates 4 If H id isupdatedwithnewdata,3and3formaswitchedsystem.Provided H id isupdatedusingasingularvaluemaximizingalgorithm,3canbeusedtoestablishthat V L isacommonLyapunovfunctionfortheswitchedsystemcf.[93]. 60

PAGE 61

theapplicabilityofthedevelopedtechniqueinthecasewheretheanalyticalsolution, andhence,thebasisforvaluefunctionapproximationisunknown.Inthiscase,since theoptimalsolutionisunknown,theoptimaltrajectoriesobtainedusingthedeveloped techniquearecomparedwithoptimaltrajectoriesobtainedthroughanumericaloptimal controltechnique. 3.5.1ProblemwithaKnownBasis Theperformanceofthedevelopedcontrollerisdemonstratedbysimulatinga nonlinear,control-afnesystemwithatwodimensionalstate x =[ x 1 ;x 2 ] T .Thesystem dynamicsaredescribedby2,where[57] f = 2 6 4 x 1 x 2 00 00 x 1 x 2 )]TJ/F15 11.9552 Tf 5.479 -9.684 Td [(1 )]TJ/F15 11.9552 Tf 11.955 0 Td [( cos x 1 +2 2 3 7 5 2 6 6 6 6 6 6 6 4 a b c d 3 7 7 7 7 7 7 7 5 ;g = 2 6 4 0 cos x 1 +2 3 7 5 : where a;b;c;d 2 R arepositiveunknownparameters.Theparametersareselectedas 5 a = )]TJ/F15 11.9552 Tf 9.298 0 Td [(1 , b =1 ;c = )]TJ/F15 11.9552 Tf 9.298 0 Td [(0 : 5 ; and d = )]TJ/F15 11.9552 Tf 9.298 0 Td [(0 : 5 .Thecontrolobjectiveistominimizethecostin 2,where Q = I 2 2 and R =1 .Theoptimalvaluefunctionandoptimalcontrolforthe systemin3aregivenby V x = 1 2 x 2 1 + x 2 2 ; and u x = )]TJ/F15 11.9552 Tf 9.298 0 Td [( cos x 1 +2 x 2 cf.[57]. Tofacilitatetheidentierdesign,thirtydatapointsarerecordedusingasingular valuemaximizingalgorithmcf.[93]fortheCL-basedadaptiveupdatelawin3.The statederivativeattherecordeddatapointsiscomputedusingafthorderSavitzkyGolaysmoothingltercf.[150]. TofacilitatetheADP-basedcontroller,thebasisfunction : R 2 ! R 3 forvalue functionapproximationisselectedas = x 2 1 ;x 1 x 2 ;x 2 2 .Basedontheanalytical solution,theidealweightsare W =[0 : 5 ; 0 ; 1] T .ThedatapointsfortheCL-basedupdate 5 Theoriginisanunstableequilibriumpointoftheunforcedsystem _ x = f x . 61

PAGE 62

lawin3areselectedtobeona 5 5 gridona 2 2 squarearoundtheorigin.The learninggainsareselectedas c 1 =1 ; c 2 =15 ; a 1 =100 ; a 2 =0 : 1 ; =0 : 005 ;k x = 10 I 2 2 , )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [( =20 I 4 4 ; and k =30 : Thepolicyandthevaluefunctionweightestimatesare initializedusingastabilizingsetofinitialweightsas ^ W c = ^ W a =[1 ; 1 ; 1] T andthe leastsquaresgainisinitializedas )-167(=100 I 3 3 .Theinitialconditionforthesystem stateisselectedas x =[ )]TJ/F15 11.9552 Tf 9.299 0 Td [(1 ; )]TJ/F15 11.9552 Tf 9.298 0 Td [(1] T ,thestateestimates ^ x areinitializedtobezero,the parameterestimates ^ areinitializedtobeone,andthehistorystackforCLisrecorded online. Figures3-2-3-4demonstratesthatthesystemstateisregulatedtotheorigin,the unknownparametersinthedriftdynamicsareidentied,andthevaluefunctionandthe policyweightsconvergetotheirtruevalues.Furthermore,unlikepreviousresults,an ad-hocprobingsignaltoensurePEisnotrequired. Figure3-2.Systemstateandcontroltrajectoriesgeneratedusingthedevelopedmethod forthesysteminSection3.5.1. 3.5.2ProblemwithanUnknownBasis Todemonstratetheapplicabilityofthedevelopedcontroller,anonlinear,controlafnesystemwithafourdimensionalstate x =[ x 1 ;x 2 ;x 3 ;x 4 ] T issimulated.Thesystem 62

PAGE 63

Figure3-3.Actorandcriticweighttrajectoriesgeneratedusingthedevelopedmethod forthesysteminSection3.5.1comparedwiththeirtruevalues.Thetrue valuescomputedbasedontheanalyticalsolutionarerepresentedbydotted lines. dynamicsaredescribedby2,where f = 2 6 6 6 6 6 6 6 4 x 3 x 4 )]TJ/F25 11.9552 Tf 9.298 0 Td [(M )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 V m 2 6 4 x 3 x 4 3 7 5 3 7 7 7 7 7 7 7 5 + 2 6 6 6 6 6 4 0 ; 0 ; 0 ; 0 0 ; 0 ; 0 ; 0 M )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 ;M )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 D 3 7 7 7 7 7 5 2 6 6 6 6 6 6 6 4 f d 1 f d 2 f s 1 f s 2 3 7 7 7 7 7 7 7 5 ; g = 0 ; 0 T ; 0 ; 0 T ; M )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 T T : In3, D , diag [ x 3 ;x 4 ;tanh x 3 ;tanh x 4 ] andthematrices M;V m ;F d ;F s 2 R 2 2 aredenedas M , 2 6 4 p 1 +2 p 3 c 2 ;p 2 + p 3 c 2 p 2 + p 3 c 2 ;p 2 3 7 5 ;F d , 2 6 4 f d 1 ; 0 0 ;f d 2 3 7 5 ;V m , 2 6 4 )]TJ/F25 11.9552 Tf 9.299 0 Td [(p 3 s 2 x 4 ; )]TJ/F25 11.9552 Tf 9.299 0 Td [(p 3 s 2 x 3 + x 4 p 3 s 2 x 3 ; 0 3 7 5 ; and F s , 2 6 4 f s 1 tanh x 3 ; 0 0 ;f s 2 tanh x 3 3 7 5 ; where c 2 = cos x 2 ;s 2 = sin x 2 , p 1 =3 : 473 , p 2 =0 : 196 ,and p 3 =0 : 242 ,and f d 1 ;f d 2 ; f s 1 ;f s 2 2 R arepositiveunknownparameters.Theparametersareselectedas f d 1 =5 : 3 , 63

PAGE 64

Figure3-4.Driftparameterestimatetrajectoriesgeneratedusingthedevelopedmethod forthesysteminSection3.5.1comparedtotheactualdriftparameters.The dottedlinesrepresenttruevaluesofthedriftparameters. f d 2 =1 : 1 ;f s 1 =8 : 45 ; and f s 2 =2 : 35 .Thecontrolobjectiveistominimizethecostin 2,where Q = diag [10 ; 10 ; 1 ; 1] and R = diag [1 ; 1] . TofacilitatetheADP-basedcontroller,thebasisfunction : R 4 ! R 10 forvalue functionapproximationisselectedas x = x 1 x 3 ;x 2 x 4 ;x 3 x 2 ;x 4 x 1 ;x 1 x 2 ;x 4 x 3 ;x 2 1 ;x 2 2 ;x 2 3 ;x 2 4 : ThedatapointsfortheCL-basedupdatelawin3areselectedtobeona 3 3 3 3 gridaroundtheorigin,andthepolicyweightsareupdatedusingaprojection-based updatelaw.Thelearninggainsareselectedas c 1 =1 ; c 2 =30 ; a 1 =0 : 1 ; =0 : 0005 ; k x =10 I 4 , )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [( = diag [90 ; 50 ; 160 ; 50] ; and k =1 : 1 : Theleastsquaresgainisinitialized as )-167(=1000 I 10 andthepolicyandthevaluefunctionweightestimatesareinitialized as ^ W c = ^ W a =[5 ; 5 ; 0 ; 0 ; 0 ; 0 ; 25 ; 0 ; 2 ; 2] T : Theinitialconditionforthesystem stateisselectedas x =[1 ; 1 ; 0 ; 0] T ,thestateestimates ^ x areinitializedtobezero, 64

PAGE 65

Figure3-5.Systemstateandcontroltrajectoriesgeneratedusingthedevelopedmethod forthesysteminSection3.5.2. theparameterestimates ^ areinitializedtobeone,andahistorystackcontainingthirty datapointsisrecordedonlineusingasingularvaluemaximizingalgorithmcf.[93]for theCL-basedadaptiveupdatelawin3.Thestatederivativeattherecordeddata pointsiscomputedusingafthorderSavitzky-Golaysmoothingltercf.[150]. Figures3-5-3-7demonstratesthatthesystemstateisregulatedtotheorigin,the unknownparametersinthedriftdynamicsareidentied,andthevaluefunctionand thepolicyweightsconverge.Thevaluefunctionandthepolicyweightsconvergetothe followingvalues. ^ W c = ^ W a =[24 : 7 ; 1 : 19 ; 2 : 25 ; 2 : 67 ; 1 : 18 ; 0 : 93 ; 44 : 34 ; 11 : 31 ; 3 : 81 ; 0 : 10] T : Sincethetruevaluesofthevaluefunctionweightsareunknown,theweightsin 3cannotbecomparedtotheirtruevalues.However,ameasureofproximityoftheweightsin3totheidealweights W canbeobtainedbycomparingthesystemtrajectoriesresultingfromapplyingthefeedbackcontrolpolicy ^ u x = )]TJ/F24 7.9701 Tf 10.494 4.707 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 g T x r T x ^ W a tothesystem,againstnumericallycomputedoptimalsystemtrajectories.InFigure3-8,thenumericaloptimalsolutionisobtainedusing 65

PAGE 66

Figure3-6.Actorandcriticweighttrajectoriesgeneratedusingthedevelopedmethod forthesysteminSection3.5.2.Sinceananalyticaloptimalsolutionisnot available,theweightestimatescannotbecomparedwiththeirtruevalues. Figure3-7.Driftparameterestimatetrajectoriesgeneratedusingthedevelopedmethod forthesysteminSection3.5.2comparedtotheactualdriftparameters.The dottedlinesrepresenttruevaluesofthedriftparameters. 66

PAGE 67

Figure3-8.Stateandcontroltrajectoriesgeneratedusingfeedbackpolicy ^ u x comparedtoanumericaloptimalsolutionforthesysteminSection3.5.2. aninnite-horizonGausspseudospectralmethodcf.[9]using45collocationpoints. Figure3-8indicatesthattheweightsin3generatestateandcontroltrajectories thatcloselymatchthenumericallycomputedoptimaltrajectories. 3.6ConcludingRemarks Anonlineapproximateoptimalcontrollerisdeveloped,wherethevaluefunctionis approximatedwithoutPEvianoveluseofaCL-basedsystemidentiertoimplement simulationofexperienceinmodel-basedRL.ThePEconditionisreplacedbyaweaker rankconditionthatcanbeveriedonlinefromrecordeddata.UBregulationofthe systemstatestoaneighborhoodoftheorigin,andconvergenceofthepolicytoa neighborhoodoftheoptimalpolicyareestablishedusingaLyapunov-basedanalysis. Simulationsdemonstratethatthedevelopedtechniquegeneratesanapproximationto theoptimalcontrolleronline,whilemaintainingsystemstability,withouttheuseofan ad-hocprobingsignal.TheLyapunovanalysissuggeststhattheconvergencecritically dependsontheamountofcollectiveinformationavailableinthesetofBEsevaluatedat thepredenedpoints.Thisrelationshipissimilartotheconditionsonthestrengthand 67

PAGE 68

theintervalofPEthatarerequiredforparameterconvergenceinadaptivesystemsin thepresenceofboundedorLipschitzadditivedisturbances. Thecontroltechniquedevelopedinthischapterdoesnotaccountforadditive externaldisturbances.Traditionally,optimaldisturbancerejectionisachievedvia feedback-Nashequilibriumsolutionofan H 1 controlproblem.The H 1 controlproblem isatwo-playerzero-sumdifferentialgameproblem.Motivatedbytheneedtoaccomplishdisturbancerejection,thefollowingchapterextendstheresultsofthischapterto obtainfeedback-Nashequilibriumsolutionstoamoregeneral N )]TJ/F22 11.9552 Tf 9.299 0 Td [(playernonzero-sum differentialgame. 68

PAGE 69

CHAPTER4 MODEL-BASEDREINFORCEMENTLEARNINGFORONLINEAPPROXIMATE FEEDBACK-NASHEQUILIBRIUMSOLUTIONOF N -PLAYERNONZERO-SUM DIFFERENTIALGAMES Inthischapter,aCL-basedACIarchitecturecf.[59]isusedtoobtainanapproximatefeedback-Nashequilibriumsolutiontoaninnite-horizon N -playernonzero-sum differentialgameonline,withoutrequiringPE,foranonlinearcontrol-afnesystemwith uncertainLPdriftdynamics. Asystemidentierisusedtoestimatetheunknownparametersinthedriftdynamics.ThesolutionstothecoupledHJequationsandthecorrespondingfeedback-Nash equilibriumpoliciesareapproximatedusingparametricuniversalfunctionapproximators. Basedonestimatesoftheunknowndriftparameters,estimatesfortheBellmanerrors areevaluatedatasetofpre-selectedpointsinthestate-space.Thevaluefunctionand thepolicyweightsareupdatedusingaconcurrentlearning-basedleast-squaresapproachtominimizetheinstantaneousBEsandtheBEsevaluatedatpre-selectedpoints. Simultaneously,theunknownparametersinthedriftdynamicsareupdatedusingahistorystackofrecordeddataviaaconcurrentlearning-basedgradientdescentapproach. ItisshownthatunderaconditionmilderthanPE,UBconvergenceoftheunknowndrift parameters,thevaluefunctionweightsandthepolicyweightstotheirtruevaluescan beestablished.Simulationresultsarepresentedtodemonstratetheperformanceofthe developedtechniquewithoutanaddedexcitationsignal. 4.1ProblemFormulationandExactSolution Consideraclassofcontrol-afnemulti-inputsystems _ x = f x + N X i =1 g i x u i ; 69

PAGE 70

where x 2 R n isthestateand u i 2 R m i arethecontrolinputsi.e.theplayers.In4, theunknownfunction f : R n ! R n isLP 1 ,thefunctions g i : R n ! R n m i areknown, locallyLipschitzcontinuousanduniformlybounded,thefunction f islocallyLipschitz, and f =0 .Deneacostfunctional J i x i ;u i ;::;u N = 1 0 r i x i ;u i d where r i : R n R m 1 R m N ! R 0 denotestheinstantaneouscostdenedas r i x;u i ;::;u N , x T Q i x + P N j =1 u T j R ij u j ,where Q i 2 R n n and R ij 2 R m j m j areconstant positivedenitematrices.Theobjectiveofeachagentistominimizethecostfunctional in4.Tofacilitatethedenitionofafeedback-Nashequilibriumsolution,let U , ff u i : R n ! R m i ;i =1 ;::;N gjf u 1 ;::; u N g isadmissiblewithrespectto4 g bethesetofalladmissibletuplesoffeedbackpolicies.Atuple f u 1 ;::; u N g iscalled admissibleifthefunctions u i arecontinuousforall i =1 ;::;N ,andresultinnitecosts J i forall i =1 ;::;N: Let V f u 1 ;::; u N g i : R n ! R 0 denotethevaluefunctionofthe i th player withrespecttothetupleoffeedbackpolicies f u 1 ;::; u N g2 U ,denedas V f u 1 ;::; u N g i x , 1 t r i ;x ; u 1 ;x ;::; u N ;x d; where ;x for 2 [ t; 1 denotesthetrajectoryof4obtainedusingthefeedback controller u i = u i ;x andtheinitialcondition t;x = x .In4, r i : R n R m 1 R m N ! R 0 denotestheinstantaneouscostdenedas r i x;u i ;::;u N , x T Q i x + P N j =1 u T j R ij u j ,where Q i 2 R n n isapositivedenitematrix.Thecontrol 1 Thefunction f isassumedtobeLPforeaseofexposition.Thesystemidentiercan alsobedevelopedusingmulti-layerNNs.Forexample,asystemidentierdeveloped usingsingle-layerNNsispresentedinChapter6. 70

PAGE 71

objectiveistondanapproximatefeedback-Nashequilibriumsolutiontotheinnitehorizonregulationdifferentialgameonline,i.e.,tondatuple f u 1 ;::;u N g2 U suchthat forall i 2f 1 ;::;N g ,forall x 2 R n ,thecorrespondingvaluefunctionssatisfy V i x , V f u 1 ;u 2 ;::;u i ;::;u N g i x V f u 1 ;u 2 ;::; u i ;::;u N g i x forall u i suchthat f u 1 ;u 2 ;::; u i ;::;u N g2 U . Providedafeedback-Nashequilibriumsolutionexistsandprovidedthevalue functionsarecontinuouslydifferentiable,anexactclosed-loopfeedback-Nash equilibriumsolution f u i ;::;u N g canbeexpressedintermsofthevaluefunctions as[100,103,104,107,112] u i x o = )]TJ/F15 11.9552 Tf 10.494 8.088 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 ii g T i x o r V i x o T ; 8 x o 2 R n ; andthevaluefunctions f V 1 ;::;V N g arethesolutionstothecoupledHJequations x oT Q i x o + N X j =1 1 4 r V j x o G ij x o )]TJ/F23 11.9552 Tf 5.479 -9.684 Td [(r V j x o T )]TJ/F15 11.9552 Tf 13.15 8.088 Td [(1 2 r V i x o N X j =1 G j x o )]TJ/F23 11.9552 Tf 5.48 -9.684 Td [(r V j x o T + r V i x o f x o =0 ; forall x o 2 R n .In4, G j x o , g j x o R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 jj g T j x o and G ij x o , g j x o R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 jj R ij R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 jj g T j x o .TheHJequationsin4areintheso-calledclosed-loop form;theycanbeexpressedinanopen-loopformas x oT Q i x o + N X j =1 u T j x o R ij u j x o + r V i x o f x o + r V i x o N X j =1 g j x o u j x o =0 ; forall x o 2 R n . 4.2ApproximateSolution ComputationofananalyticalsolutiontothecouplednonlinearHJequationsin4 is,ingeneral,infeasible.Hence,similartoChapter3,aparametricapproximatesolution n ^ V 1 x; ^ W c 1 ;::; ^ V N x; ^ W cN o issought.Basedon n ^ V 1 x; ^ W c 1 ;::; ^ V N x; ^ W cN o , 71

PAGE 72

anapproximation n ^ u 1 x; ^ W a 1 ;::; ^ u N x; ^ W aN o totheclosed-loopfeedback-Nash equilibriumsolutioniscomputed,where ^ W ci 2 R p W i ,i.e.,thevaluefunctionweights, and ^ W ai 2 R p W i ,i.e.,thepolicyweights,denotetheparameterestimates.Sincethe approximatesolution,ingeneral,doesnotsatisfytheHJequations,asetofresidual errors i : R n R p Wi R p W 1 ; R p W N ! R ,calledBEs,isdenedas i x; ^ W ci ; ^ W a 1 ; ; ^ W aN , x T Q i x + N X j =1 ^ u T j x; ^ W aj R ij ^ u j x; ^ W aj + r ^ V i x; ^ W ci f x + r ^ V i x; ^ W ci N X j =1 g j x ^ u j x; ^ W aj ; andtheapproximatesolutionisrecursivelyimprovedtodrivetheBEstozero.ThecomputationoftheBEsin4requiresknowledgeofthedriftdynamics f .Toeliminatethis requirement,andtoenablesimulationofexperienceviaBEextrapolation,aconcurrent learning-basedsystemidentierisdevelopedinthefollowingsection. 4.2.1SystemIdentication Let f x o = Y x o ; forall x o 2 R n ,bethelinearparameterizationofthedrift dynamics,where Y : R n ! R n p denotesthelocallyLipschitzregressionmatrix,and 2 R p denotesthevectorofconstant,unknowndriftparameters.Thesystemidentier isdesignedas _ ^ x = Y x ^ + N X i =1 g i x u i + k x ~ x; wherethemeasurablestateestimationerror ~ x isdenedas ~ x , x )]TJ/F15 11.9552 Tf 13.292 0 Td [(^ x , k x 2 R n n is apositivedenite,constantdiagonalobservergainmatrix,and ^ 2 R p denotesthe vectorofestimatesoftheunknowndriftparameters.Intraditionaladaptivesystems, theestimatesareupdatedtominimizetheinstantaneousstateestimationerror,and convergenceofparameterestimatestotheirtruevaluescanbeestablishedundera restrictivePEcondition.Inthisresult,aconcurrentlearning-baseddata-drivenapproach isdevelopedtorelaxthePEconditiontoaweaker,veriablerankconditionasfollows. 72

PAGE 73

Assumption4.1. [92,93]Ahistorystack H id containingstate-actiontuples )]TJ/F25 11.9552 Tf 12.453 -9.684 Td [(x j ;u i j j i =1 ; ;N;j =1 ; ;M recordedalongthetrajectoriesof4that satises rank M X j =1 Y T j Y j ! = p ; isavailableapriori,where Y j , Y x j ,and p denotesthenumberofunknown parametersinthedriftdynamics. Tofacilitatetheconcurrentlearning-basedparameterupdate,numericalmethods areusedtocomputethestatederivative _ x j correspondingto )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [(x j ; ^ u i j .Theupdatelaw forthedriftparameterestimatesisdesignedas _ ^ =)]TJ/F26 7.9701 Tf 19.739 -1.793 Td [( Y T ~ x +)]TJ/F26 7.9701 Tf 19.075 -1.793 Td [( k M X j =1 Y T j _ x j )]TJ/F26 7.9701 Tf 16.805 14.944 Td [(N X i =1 g i j u i j )]TJ/F25 11.9552 Tf 11.955 0 Td [(Y j ^ ! ; where g i j , g i x j , )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [( 2 R p p isaconstantpositivedeniteadaptationgainmatrix,and k 2 R isaconstantpositiveconcurrentlearninggain.Theupdatelawin4requires theunmeasurablestatederivative _ x j .Sincethestatederivativeatapastrecordedpoint onthestatetrajectoryisrequired,pastandfuturerecordedvaluesofthestatecanbe usedalongwithaccuratenoncausalsmoothingtechniquestoobtaingoodestimatesof _ x j : Inthepresenceofderivativeestimationerrors,theparameterestimationerrorscan beshowntobeUUB,wherethesizeoftheultimatebounddependsontheerrorinthe derivativeestimate[93]. Toincorporatenewinformation,thehistorystackisupdatedwithnewdata.Thus, theresultingclosed-loopsystemisaswitchedsystem.Toensurethestabilityofthe switchedsystem,thehistorystackisupdatedusingasingularvaluemaximizing algorithmcf.[93].Using4,thestatederivativecanbeexpressedas _ x j )]TJ/F26 7.9701 Tf 16.804 14.944 Td [(N X i =1 g i j u i j = Y j ; 73

PAGE 74

andhence,theupdatelawin4canbeexpressedintheadvantageousform _ ~ = )]TJ/F15 11.9552 Tf 9.298 0 Td [()]TJ/F26 7.9701 Tf 7.314 -1.793 Td [( Y T ~ x )]TJ/F15 11.9552 Tf 11.955 0 Td [()]TJ/F26 7.9701 Tf 7.314 -1.793 Td [( k M X j =1 Y T j Y j ! ~ ; where ~ , )]TJ/F15 11.9552 Tf 13.001 3.155 Td [(^ denotesthedriftparameterestimationerror.Theclosed-loopdynamics ofthestateestimationerroraregivenby _ ~ x = Y ~ )]TJ/F25 11.9552 Tf 11.955 0 Td [(k x ~ x: 4.2.2ValueFunctionApproximation Thevaluefunctions,i.e.,thesolutionstotheHJequationsin4,arecontinuously differentiablefunctionsofthestate.UsingtheuniversalapproximationpropertyofNNs, thevaluefunctionscanberepresentedas V i x o = W T i i x o + i x o ; forall x o 2 R n ,where W i 2 R p W i denotestheconstantvectorofunknownNNweights, i : R n ! R p W i denotestheknownNNactivationfunction, p Wi 2 N denotesthe numberofhiddenlayerneurons,and i : R n ! R denotestheunknownfunction reconstructionerror.Theuniversalfunctionapproximationpropertyguaranteesthatover anycompactdomain C R n ,forallconstant i ; r i > 0 ,thereexistsasetofweights andbasisfunctionssuchthat k W i k W , sup x 2C k i x k i , sup x 2C kr i x k r i , sup x 2C k i x k i and sup x 2C kr i x k r i ,where W i ; i ; r i ; i ; r i 2 R arepositive constants.Basedon4and4,thefeedback-Nashequilibriumsolutionsare givenby u i x o = )]TJ/F15 11.9552 Tf 10.494 8.088 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 ii g T i x o )]TJ/F23 11.9552 Tf 5.479 -9.684 Td [(r T i x o W i + r T i x o ; 8 x o 2 R n : TheNN-basedapproximationstothevaluefunctionsandthecontrollersaredened as ^ V i x; ^ W ci , ^ W T ci i x ; ^ u i x; ^ W ai , )]TJ/F15 11.9552 Tf 10.494 8.087 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 ii g T i x r T i x ^ W ai ; 74

PAGE 75

Theuseoftwodifferentsets n ^ W ci o and n ^ W ai o ofestimatestoapproximatethesame setofidealweights f W i g ismotivatedbythesubsequentstabilityanalysisandthefact thatitfacilitatesanapproximateformulationoftheBEsthatisafneinthevaluefunction weights,enablingleastsquares-basedadaptation.Basedon4,measurable approximations ^ i : R n R p Wi R p W 1 ; R p W N R p ! R totheBEsin4are denedas ^ i x; ^ W ci ; ^ W a 1 ; ; ^ W aN ; ^ , ^ W T ci r i x Y x ^ )]TJ/F15 11.9552 Tf 13.151 8.088 Td [(1 2 N X j =1 r i x G j x r T j x ^ W aj ! + x T Q i x + N X j =1 1 4 ^ W T aj r j x G ij x r T j x ^ W aj ; Thefollowingassumption,whichingeneralisweakerthanthePEassumption,is requiredforconvergenceoftheconcurrentlearning-basedvaluefunctionweight estimates. Assumption4.2. Foreach i 2f 1 ;::;N g ,thereexistsanitesetof M xi points f x ij 2 R n j j =1 ;::;M xi g suchthat c xi , inf t 2 R 0 min P M xi k =1 ! k i t ! k i T t k i t M xi > 0 ; where min denotestheminimumeigenvalue,and c xi 2 R isapositiveconstant.In 4, ! k i = r ik i Y ik ^ )]TJ/F15 11.9552 Tf 13.151 8.088 Td [(1 2 N X j =1 r ik i G ik j )]TJ/F23 11.9552 Tf 5.479 -9.684 Td [(r ik j T ^ W aj ; wherethesuperscript ik indicatesthatthefunctionisevaluatedat x = x ik ,and k i , 1+ i )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [(! k i T )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i ! k i ,where i 2 R > 0 isthenormalizationgainand )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i 2 R P W i P W i isthe adaptationgainmatrix. 75

PAGE 76

Theconcurrentlearning-basedleast-squaresupdatelawforthevaluefunction weightsisdesignedas _ ^ W ci = )]TJ/F25 11.9552 Tf 9.298 0 Td [( c 1 i )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i ! i i ^ ti )]TJ/F25 11.9552 Tf 13.15 8.088 Td [( c 2 i )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i M xi M xi X k =1 ! k i k i ^ k ti ; _ )]TJ/F26 7.9701 Tf 7.314 -1.794 Td [(i = i )]TJ/F26 7.9701 Tf 7.314 -1.794 Td [(i )]TJ/F25 11.9552 Tf 11.955 0 Td [( c 1 i )]TJ/F26 7.9701 Tf 7.314 -1.794 Td [(i ! i ! T i 2 i )]TJ/F26 7.9701 Tf 7.314 -1.794 Td [(i 1 f k )]TJ/F27 5.9776 Tf 5.289 -1.215 Td [(i k )]TJ/F27 5.9776 Tf 5.289 -1.215 Td [(i g ; k )]TJ/F26 7.9701 Tf 7.314 -1.794 Td [(i t 0 k )]TJ/F26 7.9701 Tf 7.314 -1.794 Td [(i ; where ! i = r i x Y x ^ )]TJ/F24 7.9701 Tf 13.62 4.707 Td [(1 2 P N j =1 r i x G j x r T j x ^ W aj t ; i , 1+ i ! T i )]TJ/F26 7.9701 Tf 7.314 -1.794 Td [(i ! i , 1 fg denotestheindicatorfunction, )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i > 0 2 R isthesaturationconstant, i 2 R isthe constantpositiveforgettingfactor, c 1 i ; c 2 i 2 R areconstantpositiveadaptationgains, andtheinstantaneousBEs ^ ti and ^ k ti aredenedas ^ ti t , ^ i x t ; ^ W ci t ; ^ W a 1 t ; ; ^ W aN t ; ^ t ; ^ k ti t , ^ i x ik ; ^ W ci t ; ^ W a 1 t ; ; ^ W aN t ; ^ t : Thepolicyweightupdatelawsaredesignedbasedonthesubsequentstability analysisas _ ^ W ai = )]TJ/F25 11.9552 Tf 9.298 0 Td [( a 1 i ^ W ai )]TJ/F15 11.9552 Tf 15.368 3.022 Td [(^ W ci )]TJ/F25 11.9552 Tf 11.955 0 Td [( a 2 i ^ W ai + 1 4 M xi X k =1 N X j =1 c 2 i M xi r ik j G ik ij )]TJ/F23 11.9552 Tf 5.479 -9.684 Td [(r ik j T ^ W T aj )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [(! k i T k i ^ W T ci + 1 4 N X j =1 c 1 i r j x G ij x r T j x ^ W T aj ! T i i ^ W T ci ; where a 1 i ; a 2 i 2 R arepositiveconstantadaptationgains.Theforgettingfactor i along withthesaturationintheupdatelawfortheleast-squaresgainmatrixin4ensure cf.[91]thattheleast-squaresgainmatrix )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i anditsinversearepositivedeniteand boundedforall i 2f 1 ;::;N g as )]TJETq1 0 0 1 233.036 147.985 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F26 7.9701 Tf 240.35 146.77 Td [(i k )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i t k )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i ; 8 t 2 R 0 ; 76

PAGE 77

where )]TJETq1 0 0 1 107.877 706.371 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F26 7.9701 Tf 115.191 705.156 Td [(i 2 R isapositiveconstant,andthenormalizedregressorisboundedas ! i i 1 2 p i )]TJETq1 0 0 1 333.28 660.994 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F26 7.9701 Tf 340.594 659.778 Td [(i : Fornotationalbrevity,state-dependenceofthefunctions f;g i , u i , G i , G ij , i , Y , and V i andissuppressedhereafter. 4.3StabilityAnalysis Subtracting4from4,theapproximateBEcanbeexpressedinanunmeasurableformas ^ ti = ! T i ^ W ci + N X j =1 1 4 ^ W T aj r j G ij r T j ^ W aj )]TJ/F26 7.9701 Tf 16.804 14.944 Td [(N X j =1 u T j R ij u j )-222(r V i f )-222(r V i N X j =1 g j u j Substitutingfor V and u from4and4andusing f = Y ,theapproximate BEcanbeexpressedas ^ ti = ! T i ^ W ci + N X j =1 1 4 ^ W T aj r j G ij r T j ^ W aj )]TJ/F25 11.9552 Tf 11.956 0 Td [(W T i r i Y )-222(r i Y )]TJ/F26 7.9701 Tf 14.812 14.944 Td [(N X j =1 1 4 W T j r j G ij r T j W j )]TJ/F26 7.9701 Tf 16.804 14.944 Td [(N X j =1 1 2 0 j G ij r T j W j )]TJ/F26 7.9701 Tf 16.805 14.944 Td [(N X j =1 1 4 0 j G ij r T j + 1 2 N X j =1 r i G j r T j + 1 2 N X j =1 W T i r i G j r T j W j + 1 2 N X j =1 r i G j r T j W j + 1 2 N X j =1 W T i r i G j r T j ; Addingandsubtracting 1 4 ^ W T aj r j G ij r T j W j + ! T i W i yields ^ ti = )]TJ/F25 11.9552 Tf 9.298 0 Td [(! T i ~ W ci + 1 4 N X j =1 ~ W T aj r j G ij r T j ~ W aj )]TJ/F15 11.9552 Tf 13.151 8.088 Td [(1 2 N X j =1 )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [(W T i r i G j )]TJ/F25 11.9552 Tf 11.955 0 Td [(W T j r j G ij r T j ~ W aj )]TJ/F25 11.9552 Tf 11.955 0 Td [(W T i r i Y ~ )-222(r i Y + i ; where i , 1 2 P N j =1 )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [(W T i r i G j )]TJ/F25 11.9552 Tf 11.955 0 Td [(W T j r j G ij r T j + 1 2 P N j =1 W T j r j G j r T i + 1 2 P N j =1 r i G j r T j )]TJ/F30 11.9552 Tf 12.711 8.966 Td [(P N j =1 1 4 0 j G ij r T j : Similarly,theapproximateBEevaluatedatthe selectedpointscanbeexpressedinanunmeasurableformas 77

PAGE 78

^ k ti = )]TJ/F25 11.9552 Tf 9.298 0 Td [(! kT i ~ W ci + k i )]TJ/F25 11.9552 Tf 11.955 0 Td [(W T i r ik i Y ik ~ )]TJ/F15 11.9552 Tf 13.15 8.088 Td [(1 2 N X j =1 )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [(W T i r ik i G ik j )]TJ/F25 11.9552 Tf 11.955 0 Td [(W T j r ik j G ik ij )]TJ/F23 11.9552 Tf 12.952 -9.684 Td [(r ik j T ~ W aj + 1 4 N X j =1 ~ W T aj r ik j G ik ij )]TJ/F23 11.9552 Tf 5.479 -9.684 Td [(r ik j T ~ W aj ; wheretheconstant k i 2 R isdenedas k i , )]TJ/F25 11.9552 Tf 9.299 0 Td [( 0 ik i Y ik + ik i .Tofacilitatethestability analysis,acandidateLyapunovfunctionisdenedas V L = N X i =1 V i + 1 2 N X i =1 ~ W T ci )]TJ/F28 7.9701 Tf 7.314 4.948 Td [()]TJ/F24 7.9701 Tf 6.587 0 Td [(1 i ~ W ci + 1 2 N X i =1 ~ W T ai ~ W ai + 1 2 ~ x T ~ x + 1 2 ~ T )]TJ/F28 7.9701 Tf 7.314 4.948 Td [()]TJ/F24 7.9701 Tf 6.587 0 Td [(1 ~ : Since V i arepositivedenite,theboundin4andLemma4.3in[149]canbeused toboundthecandidateLyapunovfunctionas v k Z o k V L Z o ;t v k Z o k forall Z o = h x T ; ~ W T c 1 ;::; ~ W T cN ; ~ W T a 1 ;::; ~ W T aN ; ~ x; ~ i T 2 R 2 n +2 N P i p W i + p and v ; v : R 0 ! R 0 areclass K functions.Foranycompactset Z R 2 n +2 N P i p W i + p ; dene 1 , max i;j sup Z 2Z 1 2 W T i r i G j r T j + 1 2 r i G j r T j ; 4 , max i;j sup Z 2Z r j G ij r T j ; 5 i , c 1 i L Y r i 4 p i )]TJETq1 0 0 1 183.705 318.124 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F26 7.9701 Tf 191.02 316.908 Td [(i ; 2 , max i;j sup Z 2Z c 1 i ! i 4 i )]TJ/F15 11.9552 Tf 5.48 -9.683 Td [(3 W j r j G ij )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 W T i r i G j r T j + M xi X k =1 c 2 i ! k i 4 M xi k i )]TJ/F15 11.9552 Tf 5.48 -9.684 Td [(3 W T j r ik j G ik ij )]TJ/F15 11.9552 Tf 11.956 0 Td [(2 W T i r ik i G ik j )]TJ/F23 11.9552 Tf 12.951 -9.684 Td [(r ik j T 3 , max i;j sup Z 2Z 1 2 N X i;j =1 )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [(W T i r i + r i G j r T j )]TJ/F15 11.9552 Tf 13.151 8.088 Td [(1 4 N X i;j =1 )]TJ/F15 11.9552 Tf 5.479 -9.684 Td [(2 W T j r j + 0 j G ij r T j ! 6 i , c 1 i L Y W i r i 4 p i )]TJETq1 0 0 1 176.665 195.869 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F26 7.9701 Tf 183.979 194.654 Td [(i ; 7 i , c 2 i max k r ik i Y ik W i 4 p i )]TJETq1 0 0 1 309.698 195.869 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F26 7.9701 Tf 317.012 194.654 Td [(i ; 8 , N X i =1 c 1 i + c 2 i W i 4 8 p i )]TJETq1 0 0 1 467.205 195.869 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F26 7.9701 Tf 474.519 194.654 Td [(i ; 9 i , )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( 1 N + a 2 i + 8 W i ; 10 i , c 1 i sup Z 2Z k i k + c 2 i max k k i 2 p i )]TJETq1 0 0 1 405.12 157.843 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F26 7.9701 Tf 412.434 156.627 Td [(i v l , min q i 2 ; c 2 i c xi 4 ;k x ; 2 a 1 i + a 2 i 8 ; k y 2 ; , N X i =1 2 2 9 i 2 a 1 i + a 2 i + 2 10 i c 2 i c xi + 3 ; 78

PAGE 79

where q i denotestheminimumeigenvalueof Q i , y denotestheminimumeigenvalueof P M j =1 Y T j Y j , k x denotestheminimumeigenvalueof k x ,andthesupremaexistsince ! i i isuniformlyboundedforall Z ,andthefunctions G i , G ij , i ,and r i arecontinuous.In 4, L Y 2 R 0 denotestheLipschitzconstantsuchthat k Y $ k L Y k $ k forall $ 2Z R n : ThesufcientconditionsforUBconvergencearederivedbasedonthe subsequentstabilityanalysisas q i > 2 5 i ; c 2 i c xi > 2 5 i +2 1 7 i + 2 2 N + a 1 i +2 3 6 i Z; 2 a 1 i + a 2 i > 4 8 + 2 2 N 2 ; k y > 2 7 i 1 +2 6 i 3 Z; where Z , v )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 v max k Z t 0 k ; q v l and 1 ; 2 ; 3 2 R areknownpositive adjustableconstants.Furthermore,thecompactset Z satisesthesufcientcondition r v l r; where r 2 R 0 denotestheradiusoftheset Z . SincetheNNfunctionapproximationerrorandtheLipschitzconstant L Y depend onthecompactsetthatcontainsthestatetrajectories,thecompactsetneedstobe establishedbeforethegainscanbeselectedusing4.Basedonthesubsequent stabilityanalysis,analgorithmisdevelopedtocomputetherequiredcompactset denotedby Z basedontheinitialconditions.InAlgorithm4.1,thenotation f $ g i foranyparameter $ denotesthevalueof $ computedinthe i th iteration.Sincethe constants and v l dependon L Y onlythroughtheproducts L Y r i and L Y 3 ,Algorithm 4.1ensuresthesatisfactionofthesufcientconditioninthat Theorem4.1. ProvidedAssumptions4.1-4.2holdandthecontrolgainssatisfythe sufcientconditionsin 4 ,wheretheconstantsin 4 arecomputedbasedon 79

PAGE 80

Algorithm4.1 GainSelection First iteration: Given z 2 R 0 suchthat k Z t 0 k n q v l o 1 ,increasethenumberofNNneuronsto f p Wi g 3 toensure f L Y g 2 r i 3 f L Y g 2 r i 2 ; 8 i =1 ;::;N; decreasetheconstant 3 toensure f L Y g 2 f 3 g 3 f L Y g 2 f 3 g 2 ,andincreasethegain k tosatisfythe gainconditionsin4.Theseadjustmentsensure f g 3 f g 2 .Set Z = 2 R 2 n +2 N P i f p W i g 3 + p jk k v )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 v n q v l o 2 andterminate. thecompactset Z selectedusingAlgorithm4.1,thesystemidentierin 4 alongwith theadaptiveupdatelawin 4 andthecontrollers u i t =^ u i x t ; ^ W ai t along withtheadaptiveupdatelawsin 4 and 4 ensurethatthestate x ,thestate estimationerror ~ x ,thevaluefunctionweightestimationerrors ~ W ci andthepolicyweight estimationerrors ~ W ai areUB,resultinginUBconvergenceofthecontrollers u i tothe feedback-Nashequilibriumcontrollers u i x . Proof. ThederivativeofthecandidateLyapunovfunctionin4alongthetrajectories of4,4,4,4,and4isgivenby _ V L = N X i =1 r V i f + N X j =1 g j u j !! +~ x T Y ~ )]TJ/F25 11.9552 Tf 11.955 0 Td [(k x ~ x + ~ T )]TJ/F25 11.9552 Tf 9.298 0 Td [(Y T ~ x )]TJ/F25 11.9552 Tf 11.955 0 Td [(k M X j =1 Y T j Y j ! ~ ! )]TJ/F15 11.9552 Tf 13.15 8.088 Td [(1 2 N X i =1 ~ W T ci i )]TJ/F28 7.9701 Tf 7.314 4.948 Td [()]TJ/F24 7.9701 Tf 6.586 0 Td [(1 i )]TJ/F25 11.9552 Tf 11.955 0 Td [( c 1 i ! i ! T i 2 i ~ W ci + N X i =1 ~ W T ci c 1 i ! i i ^ ti + c 2 i M xi M xi X i =1 ! k i k i ^ k ti ! )]TJ/F26 7.9701 Tf 16.805 14.944 Td [(N X i =1 ~ W T ai )]TJ/F25 11.9552 Tf 11.955 0 Td [( a 1 i ^ W T ai )]TJ/F15 11.9552 Tf 15.367 3.022 Td [(^ W T ci )]TJ/F25 11.9552 Tf 11.955 0 Td [( a 2 i ^ W T ai + 1 4 N X j =1 c 1 i ^ W T ci ! i i ^ W T aj r j G ij r T j 80

PAGE 81

+ 1 4 M xi X k =1 N X j =1 c 2 i M xi ^ W T ci ! k i k i ^ W T aj r ik j G ik ij )]TJ/F23 11.9552 Tf 5.48 -9.684 Td [(r ik j T ! : SubstitutingtheunmeasurableformsoftheBEsfrom4and4into4and usingthetriangleinequality,theCauchy-SchwarzinequalityandYoung'sinequality,the Lyapunovderivativein4canbeboundedas _ V )]TJ/F26 7.9701 Tf 28.76 14.944 Td [(N X i =1 q i 2 k x k 2 )]TJ/F26 7.9701 Tf 16.804 14.944 Td [(N X i =1 c 2 i c xi 2 ~ W ci 2 )]TJ/F25 11.9552 Tf 11.955 0 Td [(k x k ~ x k 2 )]TJ/F25 11.9552 Tf 13.15 9.378 Td [(k y 2 ~ 2 )]TJ/F26 7.9701 Tf 16.804 14.944 Td [(N X i =1 2 a 1 i + a 2 i 4 ~ W ai 2 + N X i =1 9 i ~ W ai + N X i =1 10 i ~ W ci )]TJ/F26 7.9701 Tf 16.804 14.944 Td [(N X i =1 q i 2 )]TJ/F25 11.9552 Tf 11.955 0 Td [( 5 i k x k 2 + N X i =1 k y 2 )]TJ/F25 11.9552 Tf 13.15 8.088 Td [( 7 i 1 )]TJ/F25 11.9552 Tf 13.151 8.088 Td [( 6 i 3 k x k ~ i 2 )]TJ/F26 7.9701 Tf 16.804 14.944 Td [(N X i =1 c 2 i c xi 2 )]TJ/F25 11.9552 Tf 11.955 0 Td [( 5 i )]TJ/F25 11.9552 Tf 11.956 0 Td [( 1 7 i )]TJ/F15 11.9552 Tf 13.151 8.087 Td [(1 2 2 2 N )]TJ/F15 11.9552 Tf 13.151 8.087 Td [(1 2 a 1 i )]TJ/F25 11.9552 Tf 11.955 0 Td [( 3 6 i k x k ~ W ci 2 + N X i =1 2 a 1 i + a 2 i 4 )]TJ/F25 11.9552 Tf 11.955 0 Td [( 8 )]TJ/F25 11.9552 Tf 13.151 8.088 Td [( 2 N 2 2 ~ W ai 2 + 3 : Providedthesufcientconditionsin4holdandtheconditions c 2 i c xi 2 > 5 i + 1 7 i + 1 2 2 2 N + 1 2 a 1 i + 3 6 i k x k ; k y 2 > 7 i 1 + 6 i 3 k x k holdforall Z 2Z .Completingthesquaresin4,theboundontheLyapunov derivativecanbeexpressedas _ V )]TJ/F26 7.9701 Tf 24.775 14.944 Td [(N X i =1 q i 2 k x k 2 )]TJ/F26 7.9701 Tf 14.811 14.944 Td [(N X i =1 c 2 i c xi 4 ~ W ci 2 )]TJ/F25 11.9552 Tf 11.955 0 Td [(k x k ~ x k 2 )]TJ/F26 7.9701 Tf 14.811 14.944 Td [(N X i =1 2 a 1 i + a 2 i 8 ~ W ai 2 )]TJ/F25 11.9552 Tf 13.15 9.378 Td [(k y 2 ~ 2 + ; )]TJ/F25 11.9552 Tf 21.918 0 Td [(v l k Z k 2 ; 8k Z k > r v l ;Z 2Z : Using4,4,and4,Theorem4.18in[149]canbeinvokedtoconclude that limsup t !1 k Z t k v )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 v q v l : Furthermore,thesystemtrajectoriesare boundedas k Z t k Z forall t 2 R 0 .Hence,theconditionsin4aresufcientfor theconditionsin4toholdforall t 2 R 0 . 81

PAGE 82

Theerrorbetweenthefeedback-Nashequilibriumcontrollerandtheapproximate controllercanbeexpressedas k u i x t )]TJ/F25 11.9552 Tf 11.955 0 Td [(u i t k 1 2 k R ii k g i r i ~ W ai t + r i ; forall i =1 ;::;N ,where g i , sup x o k g i x o k .Sincetheweights ~ W ai areUB,UB convergenceoftheapproximatecontrollerstothefeedback-Nashequilibriumcontroller isobtained. Remark 4.1 . Theclosed-loopsystemanalyzedusingthecandidateLyapunovfunctionin 4isaswitchedsystem.Theswitchinghappenswhenthehistorystackisupdated andwhentheleast-squaresregressionmatrices )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i reachtheirsaturationbound.Similar toleastsquares-basedadaptivecontrolcf.[91],4canbeshowntobeacommon Lyapunovfunctionfortheregressionmatrixsaturation,andtheuseofasingularvalue maximizingalgorithmtoupdatethehistorystackensuresthat4isacommon Lyapunovfunctionforthehistorystackupdatescf.[93].Since4isacommon Lyapunovfunction,4,4,and4establishUBconvergenceoftheswitched system. 4.4Simulation 4.4.1ProblemSetup Toportraytheperformanceofthedevelopedapproach,theconcurrentlearningbasedadaptivetechniqueisappliedtothenonlinearcontrol-afnesystem[112] _ x = f x + g 1 x u 1 + g 2 x u 2 ; where x 2 R 2 , u 1 ;u 2 2 R ,and f = 2 6 6 6 6 4 x 2 )]TJ/F15 11.9552 Tf 11.956 0 Td [(2 x 1 0 B @ )]TJ/F24 7.9701 Tf 10.494 4.707 Td [(1 2 x 1 )]TJ/F25 11.9552 Tf 11.955 0 Td [(x 2 + 1 4 x 2 cos x 1 +2 2 + 1 4 x 2 sin x 2 1 +2 2 1 C A 3 7 7 7 7 5 ; 82

PAGE 83

g 1 = 2 6 4 0 cos x 1 +2 3 7 5 ;g 2 = 2 6 4 0 sin x 2 1 +2 3 7 5 : Thevaluefunctionhasthestructureshownin4withtheweights Q 1 =2 Q 2 =2 I 2 and R 11 = R 12 =2 R 21 =2 R 22 =2 .ThesystemidenticationprotocolgiveninSection 4.2.1andtheconcurrentlearning-basedschemegiveninSection4.2.2areimplemented simultaneouslytoprovideanapproximateonlinefeedback-Nashequilibriumsolutionto thegivennonzero-sumtwo-playergame. 4.4.2AnalyticalSolution Thecontrol-afnesystemin4isselectedforthissimulationbecauseitis constructedusingtheconverseHJapproach[12]suchthattheanalyticalfeedbackNashequilibriumsolutionofthenonzero-sumgameis V 1 = 2 6 6 6 6 4 0 : 5 0 1 3 7 7 7 7 5 T 2 6 6 6 6 4 x 2 1 x 1 x 2 x 2 2 3 7 7 7 7 5 ;V 2 = 2 6 6 6 6 4 0 : 25 0 0 : 5 3 7 7 7 7 5 T 2 6 6 6 6 4 x 2 1 x 1 x 2 x 2 2 3 7 7 7 7 5 ; andthefeedback-Nashequilibriumcontrolpoliciesforplayer1andplayer2are u 1 = )]TJ/F15 11.9552 Tf 10.494 8.087 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 11 g T 1 2 6 6 6 6 4 2 x 1 0 x 2 x 1 02 x 2 3 7 7 7 7 5 T 2 6 6 6 6 4 0 : 5 0 1 3 7 7 7 7 5 ;u 2 = )]TJ/F15 11.9552 Tf 10.494 8.088 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 22 g T 2 2 6 6 6 6 4 2 x 1 0 x 2 x 1 02 x 2 3 7 7 7 7 5 T 2 6 6 6 6 4 0 : 25 0 0 : 5 3 7 7 7 7 5 : Sincetheanalyticalsolutionisavailable,theperformanceofthedevelopedmethodcan beevaluatedbycomparingtheobtainedapproximatesolutionagainsttheanalytical solution. 4.4.3SimulationParameters Thedynamicsarelinearlyparameterizedas f x = Y x ; where Y x = 2 6 4 x 2 x 1 0000 00 x 1 x 2 x 2 cos x 1 +2 2 x 2 cos x 1 +2 2 3 7 5 83

PAGE 84

Table4-1.Learninggainsforforvaluefunctionapproximation Player1Player2 0.0050.005 c 1 1.01.0 c 2 1.51.0 a 1 10.010.0 a 2 0.10.1 3.03.0 )]TJ/F22 11.9552 Tf 325.823 0 Td [(10,000.010,000.0 isknownandtheconstantvectorofparameters = 1 ; )]TJ/F15 11.9552 Tf 9.299 0 Td [(2 ; )]TJ/F24 7.9701 Tf 10.494 4.707 Td [(1 2 ; )]TJ/F15 11.9552 Tf 9.298 0 Td [(1 ; 1 4 ; )]TJ/F24 7.9701 Tf 10.494 4.707 Td [(1 4 T isassumed tobeunknown.Theinitialguessfor isselectedas ^ t 0 =0 : 5 1 6 1 .Thesystem identicationgainsareselectedas k x =5 , )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [( = diag ; 20 ; 100 ; 100 ; 60 ; 60 , k =1 : 5 . Ahistorystackof30pointsisselectedusingasingularvaluemaximizingalgorithm cf.[93]fortheconcurrentlearning-basedupdatelawin4,andthestatederivatives areestimatedusingafthorderSavitzky-Golayltercf.[150].Basedonthestructure ofthefeedback-Nashequilibriumvaluefunctions,thebasisfunctionforvaluefunction approximationisselectedas =[ x 2 1 ;x 1 x 2 ;x 2 2 ] T ,andtheadaptivelearningparameters andinitialconditionsareshownforbothplayersinTables4-1and4-2.Twenty-vepoints lyingona 5 5 gridona 2 2 squarearoundtheoriginareselectedfortheconcurrent learning-basedupdatelawsin4and4. Table4-2.Initialconditionsforthesystemandthetwoplayers Player1Player2 ^ W c t 0 [3 ; 3 ; 3] T [3 ; 3 ; 3] T ^ W a t 0 [3 ; 3 ; 3] T [3 ; 3 ; 3] T )-167( t 0 100 I 3 100 I 3 x t 0 [1 ; 1] T [1 ; 1] T ^ x t 0 [0 ; 0] T [0 ; 0] T 84

PAGE 85

4.4.4SimulationResults Figures4-1and4-2showtherapidconvergenceoftheactorandcriticweights totheapproximatefeedback-Nashequilibriumvaluesforbothplayers,resultinginthe valuefunctionsandcontrolpolicies V 1 x = 2 6 6 6 6 4 0 : 5021 )]TJ/F15 11.9552 Tf 9.299 0 Td [(0 : 0159 0 : 9942 3 7 7 7 7 5 T 2 6 6 6 6 4 x 2 1 x 1 x 2 x 2 2 3 7 7 7 7 5 ; u 1 x = )]TJ/F15 11.9552 Tf 10.494 8.088 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 11 g T 1 2 6 6 6 6 4 2 x 1 0 x 2 x 1 02 x 2 3 7 7 7 7 5 T 2 6 6 6 6 4 0 : 4970 )]TJ/F15 11.9552 Tf 9.298 0 Td [(0 : 0137 0 : 9810 3 7 7 7 7 5 ; V 2 x = 2 6 6 6 6 4 0 : 2510 )]TJ/F15 11.9552 Tf 9.299 0 Td [(0 : 0074 0 : 4968 3 7 7 7 7 5 T 2 6 6 6 6 4 x 2 1 x 1 x 2 x 2 2 3 7 7 7 7 5 ; u 2 x = )]TJ/F15 11.9552 Tf 10.494 8.088 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 22 g T 2 2 6 6 6 6 4 2 x 1 0 x 2 x 1 02 x 2 3 7 7 7 7 5 T 2 6 6 6 6 4 0 : 2485 )]TJ/F15 11.9552 Tf 9.298 0 Td [(0 : 0055 0 : 4872 3 7 7 7 7 5 : Figure4-3demonstratesthatwithouttheinjectionofaPEsignalthesystemidenticationparametersalsoapproximatelyconvergedtothecorrectvalues.Thestateand controlsignaltrajectoriesaredisplayedinFigure4-4. Figure4-1.Trajectoriesofactorandcriticweightsforplayer1comparedagainsttheir truevalues.Thetruevaluescomputedbasedontheanalyticalsolutionare representedbydottedlines. 85

PAGE 86

Figure4-2.Trajectoriesofactorandcriticweightsforplayer2comparedagainsttheir truevalues.Thetruevaluescomputedbasedontheanalyticalsolutionare representedbydottedlines. Figure4-3.Trajectoriesoftheestimatedparametersinthedriftdynamicscompared againsttheirtruevalues.Thetruevaluesarerepresentedbydottedlines. 86

PAGE 87

Figure4-4.Systemstatetrajectoryandthecontroltrajectoriesforplayers1and2 generatedusingthedevelopedtechnique 4.5ConcludingRemarks Aconcurrentlearning-basedadaptiveapproachisdevelopedtodeterminethe feedback-Nashequilibriumsolutiontoan N -playernonzero-sumgameonline.The solutionstotheassociatedcoupledHJequationsandthecorrespondingfeedback-Nash equilibriumpoliciesareapproximatedusingparametricuniversalfunctionapproximators. Basedonestimatesoftheunknowndriftparameters,estimatesfortheBellmanerrors areevaluatedatasetofpreselectedpointsinthestate-space.Thevaluefunction andthepolicyweightsareupdatedusingaconcurrentlearning-basedleast-squares approachtominimizetheinstantaneousBEsandtheBEsevaluatedatthepreselected points.Simultaneously,theunknownparametersinthedriftdynamicsareupdated usingahistorystackofrecordeddataviaaconcurrentlearning-basedgradientdescent approach. Thesimulation-basedACItechniquedevelopedinthischapterandChapter3 achievesapproximateoptimalcontrolforautonomoussystemandstationarycost functions.ExtensionoftheACItechniquestooptimaltrajectorytrackingproblems presentsuniquechallengesforvaluefunctionapproximationduetothetime-varying 87

PAGE 88

natureoftheproblem.Thefollowingchapterdescribesthechallengesandpresents asolutiontoextendtheACIarchitecturetosolveinnite-horizontrajectorytracking problems. 88

PAGE 89

CHAPTER5 EXTENSIONTOAPPROXIMATEOPTIMALTRACKING ADPhasbeeninvestigatedandusedasatooltoapproximatelysolveoptimal regulationproblems.Fortheseproblems,functionapproximationtechniquescanbe usedtoapproximatethevaluefunctionbecauseitisatimeinvariantfunction.Intracking problems,thetrackingerror,andhencethevaluefunction,isafunctionofthestateand anexplicitfunctionoftime.ApproximationtechniqueslikeNNsarecommonlyusedin ADPliteratureforvaluefunctionapproximation.However,NNscanonlyapproximate functionsoncompactdomains,thusleadingtoatechnicalchallengetoapproximatethe valuefunctionforatrackingproblembecausetheinnite-horizonnatureoftheproblem impliesthattimedoesnotlieonacompactset.Hence,theextensionofthistechnique tooptimaltrackingproblemsforcontinuous-timenonlinearsystemshasremaineda non-trivialopenproblem. Inthisresult,thetrackingerrorandthedesiredtrajectorybothserveasinputsto theNN.Thismakesthedevelopedcontrollerfundamentallydifferentfromprevious results,inthesensethatadifferentHJBequationmustbesolvedanditssolution,i.e. thefeedbackcomponentofthecontroller,isatime-varyingfunctionofthetracking error.Inparticular,thischapteraddressesthetechnicalobstaclesthatresultfromthe time-varyingnatureoftheoptimalcontrolproblembyincludingthepartialderivative ofthevaluefunctionwithrespecttothedesiredtrajectoryintheHJBequation,and byusingasystemtransformationtoconverttheproblemintoatime-invariantoptimal controlprobleminsuchawaythattheresultingvaluefunctionisatime-invariant functionofthetransformedstates,andhence,lendsitselftoapproximationusingaNN. ALyapunov-basedanalysisisusedtoproveultimatelyboundedtrackingandthatthe controllerconvergestotheapproximateoptimalpolicy.Simulationresultsarepresented todemonstratetheapplicabilityofthepresentedtechnique.Togaugetheperformance oftheproposedmethod,acomparisonwithanumericaloptimalsolutionispresented. 89

PAGE 90

5.1FormulationofTime-invariantOptimalControlProblem Considertheclassofnonlinearcontrol-afnesystemsdescribedin2.The controlobjectiveistotrackaboundedcontinuouslydifferentiablesignal x d 2 R n .To quantifythisobjective,atrackingerrorisdenedas e , x )]TJ/F25 11.9552 Tf 12.181 0 Td [(x d .Theopen-looptracking errordynamicscanthenbewrittenas _ e = f x + g x u )]TJ/F15 11.9552 Tf 13.981 0 Td [(_ x d : Thefollowingassumptionsaremadetofacilitatetheformulationofanapproximate optimaltrackingcontroller. Assumption5.1. Thefunction g isbounded,thematrix g x o hasfullcolumnrankforall x o 2 R n ,andthefunction g + : R n ! R m n denedas g + , )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [(g T g )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 g T isboundedand locallyLipschitz. Assumption5.2. Thedesiredtrajectoryisboundedsuchthat k x d k d 2 R ,and thereexistsalocallyLipschitzfunction h d : R n ! R n suchthat _ x d = h d x d and g x d g + x d h d x d )]TJ/F25 11.9552 Tf 11.955 0 Td [(f x d = h d x d )]TJ/F25 11.9552 Tf 11.955 0 Td [(f x d , 8 t 2 R t 0 . Thesteady-statecontrolpolicy u d : R n ! R m correspondingtothedesired trajectory x d is u d x d = g + d h d x d )]TJ/F25 11.9552 Tf 11.955 0 Td [(f d ; where g + d , g + x d and f d , f x d .Totransformthetime-varyingoptimalcontrol problemintoatime-invariantoptimalcontrolproblem,anewconcatenatedstate 2 R 2 n isdenedas[86] , e T ;x T d T : Basedon5andAssumption5.2,thetimederivativeof5canbeexpressedas _ = F + G ; 90

PAGE 91

wherethefunctions F : R 2 n ! R 2 n , G : R 2 n ! R 2 n m ,andthecontrol 2 R m are denedas F , 2 6 4 f e + x d )]TJ/F25 11.9552 Tf 11.955 0 Td [(h d x d + g e + x d u d x d h d x d 3 7 5 ;G , 2 6 4 g e + x d 0 n m 3 7 5 ; , u )]TJ/F25 11.9552 Tf 11.955 0 Td [(u d : LocalLipschitzcontinuityof f and g ,thefactthat f =0 ,andAssumption5.2imply that F =0 and F islocallyLipschitz.Theobjectiveoftheoptimalcontrolproblem istominimizethecostfunctional J ; ,introducedin2,subjecttothedynamic constraintsin5whiletrackingthedesiredtrajectory.Foreaseofexposition,letthe function Q : R 2 n ! R 0 in2bedenedas Q , T Q ,where Q 2 R 2 n 2 n isa constantmatrixdenedas Q , 2 6 4 Q 0 n n 0 n n 0 n n 3 7 5 ; where Q 2 R n n isapositivedenitesymmetricmatrixofconstantswiththeminimum eigenvalue q 2 R > 0 .Thus,thereward r : R 2 n R m ! R isgivenby r ; , T Q + T R: 5.2ApproximateOptimalSolution SimilartothedevelopmentinChapter2,assumingthataminimizingpolicyexists andassumingthattheoptimalvaluefunctionsatises V 2 C 1 and V =0 ,thelocal costin5andthedynamicsin5,yieldtheoptimalpolicy : R 2 n ! R m as o = )]TJ/F15 11.9552 Tf 10.494 8.088 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 G T o r V o T ; 8 o 2 R 2 n 91

PAGE 92

where V : R 2 n ! R 0 denotestheoptimalvaluefunctiondenedasin2withthe localcostdenedin5. 1 Thepolicyin5andthevaluefunction V satisfythe HJBequation[1] r V o F o + G o o + r o ; o =0 ; 8 o 2 R 2 n ,withtheinitialcondition V =0 . Thevaluefunction V canberepresentedusingaNNwith L neuronsas V o = W T o + o ; 8 o 2 R 2 n where W 2 R L istheidealweightmatrixboundedabovebyaknownpositiveconstant W 2 R inthesensethat k W k W , : R 2 n ! R L isaboundedcontinuously differentiablenonlinearactivationfunction,and : R 2 n ! R isthefunctionreconstruction error[151,152]. Using5and5theoptimalpolicycanberepresentedas o = )]TJ/F15 11.9552 Tf 10.494 8.088 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 G T o )]TJ/F23 11.9552 Tf 5.479 -9.684 Td [(r T o W + r T o ; 8 o 2 R 2 n : Basedon5and5,theNNapproximationstotheoptimalvaluefunctionand theoptimalpolicyaredenedas ^ V ; ^ W c , ^ W T c ; ^ ; ^ W a , )]TJ/F15 11.9552 Tf 10.494 8.088 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 G T r T ^ W a ; where ^ W c 2 R L and ^ W a 2 R L areestimatesoftheidealneuralnetworkweights W .The useoftwoseparatesetsofweightestimates ^ W a and ^ W c for W ismotivatedbythefact thattheBEislinearwithrespecttothevaluefunctionweightestimatesandnonlinear 1 Sincetheclosed-loopsystemcorrespondingto5underafeedbackpolicyisautonomous,thecost-to-go,i.e.,theintegralin2isindependentofinitialtime.Hence, thevaluefunctionisonlyafunctionof . 92

PAGE 93

withrespecttothepolicyweightestimates.Useofaseparatesetofweightestimatesfor thevaluefunctionfacilitatesleastsquares-basedadaptiveupdates. Thecontrollerforthedynamicsin5is t =^ t ; ^ W a t ,andthecontroller implementedontheactualsystemisobtainedfrom5,5,and5–12as u = )]TJ/F15 11.9552 Tf 10.494 8.088 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 G T r T ^ W a + g + d h d x d )]TJ/F25 11.9552 Tf 11.955 0 Td [(f d : Usingtheapproximations ^ and ^ V in5for and V ,respectively,theBEin 2,isgiveninameasurableformby t = r ^ V ; ^ W c _ + r ; ; wherethederivative _ ismeasurablebecausethesystemmodelisknown.Fornotationalbrevity,state-dependenceofthefunctions h d , F;G;V ; ;; and andthe argumentstothefunctions ^ ,and ^ V aresuppressedhereafter.Thevaluefunction weightsareupdatedtominimize t 0 2 t d usinganormalizedleastsquaresupdate law 2 withanexponentialforgettingfactoras[91] _ ^ W c = )]TJ/F25 11.9552 Tf 9.298 0 Td [( c )]TJ/F25 11.9552 Tf 32.732 8.088 Td [(! 1+ ! T )]TJ/F25 11.9552 Tf 7.314 0 Td [(! t ; _ )-278(= )]TJ/F25 11.9552 Tf 9.298 0 Td [( c )]TJ/F25 11.9552 Tf 9.298 0 Td [( )-222(+)]TJ/F25 11.9552 Tf 47.303 8.088 Td [(!! T 1+ ! T )]TJ/F25 11.9552 Tf 7.314 0 Td [(! )]TJ/F30 11.9552 Tf 7.314 16.857 Td [( ; where ; c 2 R arepositiveadaptationgains, ! 2 R L isdenedas ! , r _ ,and 2 ; 1 istheforgettingfactorfortheestimationgainmatrix )]TJ/F23 11.9552 Tf 12.357 0 Td [(2 R L L .Thepolicy 2 Theleast-squaresapproachismotivatedbyfasterconvergence.Withminormodicationstothestabilityanalysis,theresultcanalsobeestablishedforagradientdescent updatelaw. 93

PAGE 94

weightsareupdatedtofollowthecriticweights 3 as _ ^ W a = )]TJ/F25 11.9552 Tf 9.298 0 Td [( a 1 ^ W a )]TJ/F15 11.9552 Tf 15.367 3.022 Td [(^ W c )]TJ/F25 11.9552 Tf 11.955 0 Td [( a 2 ^ W a ; where a 1 ; a 2 2 R arepositiveadaptationgains.Thefollowingassumptionfacilitatesthe stabilityanalysisusingPE. Assumption5.3. Theregressor : R 0 ! R L denedas , ! p 1+ ! T )]TJ/F26 7.9701 Tf 5.289 0 Td [(! satisesthe PEcondition,i.e.,thereexistconstants T; 2 R > 0 suchthat I t + T t T d: 4 UsingAssumption5.3and[91,Corollary4.3.2]itcanbeconcludedthat ' I L L )-167( t 'I L L ; 8 t 2 R 0 where ';' 2 R areconstantssuchthat 0 <' < ' . 5 Basedon5,theregressor vectorcanbeboundedas k t k 1 p ' ; 8 t 2 R 0 : Using5,5,and5,anunmeasurableformoftheBEcanbewritten as t = )]TJ/F15 11.9552 Tf 12.711 3.022 Td [(~ W T c ! + 1 4 ~ W T a G ~ W a + 1 4 r Gr T + 1 2 W T r Gr T )-222(r F; 3 Theleast-squaresapproachcannotbeusedtoupdatethepolicyweightsbecause theBEisanonlinearfunctionofthepolicyweights. 4 Theregressorisdenedhereasatrajectoryindexedbytime.Thisdenitionsuppressesthefactthatdifferentinitialconditionsresultindifferentregressortrajectories. Assumption5.3describesthepropertiesofonespecictrajectorystartingfromone specicinitialcondition.Naturally,thenalresultofthechapteralsodescribeslimiting propertiesofonespecicstatetrajectory.Thatis,thenalresultisnotuniforminthe initialconditions. 5 Sincetheevolutionof isdependentontheinitialcondition,theconstants ' and ' dependontheinitialcondition. 94

PAGE 95

where G , GR )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 G T and G , r GR )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 G T r T .Theweightestimationerrorsfor thevaluefunctionandthepolicyaredenedas ~ W c , W )]TJ/F15 11.9552 Tf 16.088 3.022 Td [(^ W c and ~ W a , W )]TJ/F15 11.9552 Tf 16.088 3.022 Td [(^ W a , respectively. 5.3StabilityAnalysis Beforestatingthemainresultofthechapter,threesupplementarytechnicallemmas arestated.Tofacilitatethediscussion,let Y2 R 2 n +2 L beacompactset,andlet Z , Y R n +2 L .UsingtheuniversalapproximationpropertyofNNs,onthecompactset Y R 2 n ,theNNapproximationerrorscanbeboundedsuchthat sup o 2Y R 2 n j o j and sup o 2Y R 2 n jr o j r ,where 2 R and r 2 R arepositiveconstants.Using Assumptions5.1and5.2andthefactthatonthecompactset Y R 2 n ,thereexists apositiveconstant L F 2 R suchthat 6 sup o 2Y R 2 n k F o k L F k o k ; thefollowing boundsaredevelopedtoaidthesubsequentstabilityanalysis: r 4 + W T r 2 Gr T + r L F k x d k 1 ; kG k 2 ; r Gr T 3 ; 1 2 W T G + 1 2 r Gr T 4 ; 1 4 r Gr T + 1 2 W T r Gr T 5 ; where 1 ; 2 ; 3 ; 4 ; 5 2 R arepositiveboundsthatareconstantforaxedinitialcondition. 5.3.1SupportingLemmas Thecontributionintheprevioussectionwasthedevelopmentofatransformation thatenablestheoptimalpolicyandtheoptimalvaluefunctiontobeexpressedasa time-invariantfunctionof .Theuseofthistransformationpresentsachallengeinthe sensethattheoptimalvaluefunction,whichisusedastheLyapunovfunctionforthe stabilityanalysis,isnotapositivedenitefunctionof ,becausethematrix Q ispositive 6 InsteadofusingthefactthatlocallyLipschitzfunctionsoncompactsetsareLipschitz,itispossibletoboundthefunction F as k F k k k k k ,where : R 0 ! R 0 isnon-decreasing.Thisapproachisfeasibleandresultsinadditionalgainconditions. 95

PAGE 96

semi-denite.Inthissection,thistechnicalobstacleisaddressedbyexploitingthefact thatthetime-invariantoptimalvaluefunction V : R 2 n ! R canbeinterpretedasa time-varyingmap V t : R n R 0 ! R ,suchthat V t e;t , V 0 B @ 2 6 4 e x d t 3 7 5 1 C A forall e 2 R n andforall t 2 R 0 .Specically,thetime-invariantformfacilitatesthe developmentoftheapproximateoptimalpolicy,whereastheequivalenttime-varying formcanbeshowntobeapositivedeniteanddecrescentfunctionofthetrackingerror. Inthefollowing,Lemma5.1isusedtoprovethat V t : R n R 0 ! R ispositivedenite anddecrescent,andhence,acandidateLyapunovfunction. Lemma5.1. Let B a denoteaclosedballaroundtheoriginwiththeradius a 2 R > 0 .The optimalvaluefunction V t : R n R 0 ! R satisesthefollowingproperties V t e;t v k e k ; a V t ;t =0 ; b V t e;t v k e k ; c 8 t 2 R 0 and 8 e 2 B a where v :[0 ;a ] ! R 0 and v :[0 ;a ] ! R 0 areclass K functions. Proof. SeeAppendixB.1. SincethestabilityanalysisissubjecttothePEconditioninAssumption5.3,the behaviorofthesystemstatesisexaminedoverthetimeinterval [ t;t + T ] .Thefollowing twolemmasestablishgrowthboundsonthetrackingerrorandtheactorandthecritic weights. Lemma5.2. Let Z , e T ~ W T c ~ W T a T ; andsupposethat Z 2Z ,forall 2 [ t;t + T ] . Then,theNNweightsandthetrackingerrorssatisfy )]TJ/F15 11.9552 Tf 22.304 0 Td [(inf 2 [ t;t + T ] k e k 2 )]TJ/F25 11.9552 Tf 21.917 0 Td [($ 0 sup 2 [ t;t + T ] k e k 2 + $ 1 T 2 sup 2 [ t;t + T ] ~ W a 2 + $ 2 96

PAGE 97

)]TJ/F15 11.9552 Tf 18.504 0 Td [(inf 2 [ t;t + T ] ~ W a 2 )]TJ/F25 11.9552 Tf 21.918 0 Td [($ 3 sup 2 [ t;t + T ] ~ W a 2 + $ 4 inf 2 [ t;t + T ] ~ W c 2 + $ 5 sup 2 [ t;t + T ] k e k 2 + $ 6 ; where $ 1 = 3 n 4 sup t 2 R t o gR )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 G T r T 2 , $ 6 = 18 L a 1 c ' r L F d + 5 T 2 2 ' 1 )]TJ/F24 7.9701 Tf 6.587 0 Td [(6 L c 'T 2 = ' 2 +3 L )]TJ/F25 11.9552 Tf 5.48 -9.683 Td [( a 2 WT 2 , $ 3 = 1 )]TJ/F24 7.9701 Tf 6.587 0 Td [(6 L a 1 + a 2 2 T 2 2 , $ 4 = 6 L 2 a 1 T 2 1 )]TJ/F24 7.9701 Tf 6.586 0 Td [(6 L c 'T 2 = ' 2 , $ 5 = 18 a 1 L c ' r L F T 2 2 ' 1 )]TJ/F24 7.9701 Tf 6.586 0 Td [(6 L c 'T 2 = ' 2 , $ 0 = 1 )]TJ/F24 7.9701 Tf 6.587 0 Td [(6 nT 2 L 2 F 2 , $ 2 = 3 n 2 T 2 dL F +sup t k gg + d h d )]TJ/F26 7.9701 Tf 6.586 0 Td [(f d )]TJ/F18 5.9776 Tf 7.782 3.259 Td [(1 2 gR )]TJ/F18 5.9776 Tf 5.756 0 Td [(1 G T r T W )]TJ/F26 7.9701 Tf 6.587 0 Td [(h d k 2 n . Proof. SeeAppendixB.2. Lemma5.3. Let Z , e T ~ W T c ~ W T a T ; andsupposethat Z 2Z ,forall 2 [ t;t + T ] . Then,thecriticweightssatisfy )]TJ/F26 7.9701 Tf 13.948 18.663 Td [(t + T t ~ W T c 2 d )]TJ/F25 11.9552 Tf 21.918 0 Td [( $ 7 ~ W c 2 + $ 8 t + T t k e k 2 d +3 2 2 t + T t ~ W a 4 d + $ 9 T; where $ 7 = 2 ' 2 2 2 ' 2 + 2 c ' 2 T 2 ;$ 8 =3 0 2 L 2 F ; and $ 9 =2 2 5 + 0 2 L 2 F d 2 : Proof. SeeAppendixB.3. 5.3.2GainConditionsandGainSelection Thissectiondetailssufcientgainconditionsderivedbasedonastabilityanalysis performedusingthecandidateLyapunovfunction V L : R n +2 L R 0 ! R denedas V L Z;t , V t e;t + 1 2 ~ W T c )]TJ/F28 7.9701 Tf 7.314 4.339 Td [()]TJ/F24 7.9701 Tf 6.586 0 Td [(1 ~ W c + 1 2 ~ W T a ~ W a : Using5andLemma5.1, v l k Z o k V L Z o ;t v l k Z o k ; 8 Z o 2 B b ; 8 t 2 R 0 ,where v l :[0 ;b ] ! R 0 and v l :[0 ;b ] ! R 0 areclass K functions, and B b R n +2 L denotesaballofradius b 2 R > 0 aroundtheorigin,containing Z . 97

PAGE 98

Tofacilitatethediscussion,dene a 12 , a 1 + a 2 , , a 2 W + 4 2 a 12 + 2 c 1 2 + 1 4 3 ;$ 10 , $ 6 a 12 +2 $ 2 q + c $ 9 8 + , Z , e T ~ W T c ~ W T a T ; and $ 11 , 1 16 min c $ 7 ; 2 $ 0 q T;$ 3 a 12 T : Let Z 0 2 R 0 denoteaknownconstant boundontheinitialconditionsuchthat k Z t 0 k Z 0 ,andlet Z , v l )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 v l max k Z 0 k ; r $ 10 T $ 11 !! + T ! : ThesufcientgainconditionsforthesubsequentTheorem5.1aregivenby 7 a 12 > max a 1 2 + c 2 4 s Z ' ; 3 c 2 2 Z ! ; 1 > 2 r L F ; c > a 1 2 ; > 2 $ 4 a 12 c $ 7 T; q > max $ 5 a 12 $ 0 ; 1 2 c $ 8 ; c L F r 1 ; T< min 1 p 6 L a 12 ; ' p 6 L c ' ; 1 2 p nL F ; r a 12 6 L 3 a 12 +8 q $ 1 ! : Furthermore,thecompactset Z satisesthesufcientcondition Z r; where r , 1 2 sup z;y 2Z k z )]TJ/F25 11.9552 Tf 11.955 0 Td [(y k denotestheradiusof Z .SincetheLipschitzconstantand theboundsonNNapproximationerrordependonthesizeofthecompactset Z ,the constant Z dependson r ;hence,feasibilityofthesufcientconditionin5isnot apparent.Algorithm5.1detailsaniterativegainselectionprocessinordertoensure satisfactionofthesufcientconditionin5.InAlgorithm5.1,thenotation f $ g i for anyparameter $ denotesthevalueof $ computedinthe i th iteration.Algorithm5.1 ensuressatisfactionofthesufcientconditionin5. 7 Similarconditionson and T canbefoundinPE-basedadaptivecontrolinthe presenceofboundedorLipschitzuncertaintiescf.[153,154]. 98

PAGE 99

Algorithm5.1 GainSelection First iteration: Given Z 0 2 R 0 suchthat k Z t 0 k 1 .Using Z 1 ; computetheboundsin5and5,andselectthe gainsaccordingto5.If Z 1 1 v l )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 v l k Z 0 k ; set Z = Z 1 andterminate. Second iteration: If Z 1 > 1 v l )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 v l k Z 0 k ; let Z 2 , % 2 R n +2 f L g 1 jk % k 2 Z 1 .Using Z 2 ; computetheboundsin5and5andselectthegainsaccordingto5. If Z 2 Z 1 ,set Z = Z 2 andterminate. Third iteration: If Z 2 > Z 1 ,increasethenumberofNNneuronsto f L g 3 toyieldalowerfunction approximationerror r 3 suchthat f L F g 2 r 3 f L F g 1 r 1 .Theincreaseinthe numberofNNneuronsensuresthat f g 3 f g 1 .Furthermore,theassumptionthatthe PEinterval f T g 3 issmallenoughsuchthat f L F g 2 f T g 3 f T g 1 f L F g 1 and f L g 3 f T g 3 f T g 1 f L g 1 ensuresthat n $ 10 $ 11 o 3 n $ 10 $ 11 o 1 ,andhence, Z 3 2 Z 1 .Set Z = % 2 R n +2 f L g 3 jk % k 2 Z 1 andterminate. 5.3.3MainResult Theorem5.1. Providedthatthesufcientconditionsin 5 and 5 aresatisedandAssumptions5.1-5.3hold,thecontrollerin 5 andtheupdatelawsin 5 5 guaranteethatthetrackingerrorisultimatelybounded,andtheerror k t )]TJ/F25 11.9552 Tf 11.955 0 Td [( t k isultimatelyboundedas t !1 . Proof. Thetimederivativeof V L is _ V L = r V F + r V G ^ + ~ W T c )]TJ/F28 7.9701 Tf 7.314 4.339 Td [()]TJ/F24 7.9701 Tf 6.587 0 Td [(1 _ ~ W c )]TJ/F24 7.9701 Tf 11.369 4.707 Td [(1 2 ~ W T c )]TJ/F28 7.9701 Tf 7.314 4.339 Td [()]TJ/F24 7.9701 Tf 6.587 0 Td [(1 _ \000 )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 ~ W c )]TJ/F15 11.9552 Tf -454.537 -20.885 Td [(~ W T a _ ^ W a : Using5andthefactsthat r V F = r V G )]TJ/F25 11.9552 Tf 12.597 0 Td [(r ; and r V G = )]TJ/F15 11.9552 Tf 9.298 0 Td [(2 T R yields _ V L = )]TJ/F25 11.9552 Tf 9.299 0 Td [(e T Qe + T R )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 T R ^ )]TJ/F25 11.9552 Tf 11.955 0 Td [( c ~ W T c T ~ W c )]TJ/F25 11.9552 Tf 11.955 0 Td [( C 2 ~ W T c )]TJ/F28 7.9701 Tf 7.314 4.936 Td [()]TJ/F24 7.9701 Tf 6.586 0 Td [(1 ~ W c + 1 2 c ~ W T c !! T ~ W c )]TJ/F15 11.9552 Tf 15.367 3.022 Td [(~ W T a _ ^ W a + c ~ W T c p 1+ ! T )]TJ/F25 11.9552 Tf 7.314 0 Td [(! 1 4 ~ W T a G ~ W a )-222(r F + 1 4 r Gr T + 1 2 W T r Gr T ; where , 1+ ! T )]TJ/F25 11.9552 Tf 7.315 0 Td [(! .Using5,5andtheboundsin5-5the Lyapunovderivativein5canbeboundedaboveontheset Z as _ V L )]TJ/F25 11.9552 Tf 23.23 9.378 Td [(q 2 k e k 2 )]TJ/F15 11.9552 Tf 13.151 8.088 Td [(1 4 c ~ W T c 2 )]TJ/F25 11.9552 Tf 13.151 8.088 Td [( a 12 2 ~ W a 2 + )]TJ/F15 11.9552 Tf 5.479 -9.684 Td [(2 a 2 W + 4 ~ W a )]TJ/F25 11.9552 Tf 13.151 8.088 Td [( c 2 1 )]TJETq1 0 0 1 466.025 113.646 cm[]0 d 0 J 0.478 w 0 0 m 14.689 0 l SQBT/F23 11.9552 Tf 466.025 103.803 Td [(r 1 ~ W T c 2 99

PAGE 100

)]TJ/F15 11.9552 Tf 13.15 8.088 Td [(1 2 a 12 )]TJ/F25 11.9552 Tf 11.955 0 Td [( a 1 2 )]TJ/F25 11.9552 Tf 13.15 8.087 Td [( c 2 4 ~ W T c ~ W a 2 )]TJ/F30 11.9552 Tf 13.151 19.061 Td [()]TJ/F25 11.9552 Tf 5.479 -9.683 Td [(q )]TJ/F25 11.9552 Tf 11.955 0 Td [( c L F r 1 2 k e k 2 )]TJ/F15 11.9552 Tf 13.151 8.087 Td [(1 2 c )]TJ/F25 11.9552 Tf 13.151 8.088 Td [( a 1 2 ~ W c 2 + c 1 + 2 W 2 ~ W T c + 1 4 3 ; where 1 , 2 2 R areknownadjustablepositiveconstants.Providedthesufcient conditionsin5aresatised,completionofsquaresyields _ V L )]TJ/F25 11.9552 Tf 23.23 9.378 Td [(q 2 k e k 2 )]TJ/F15 11.9552 Tf 13.151 8.088 Td [(1 8 c ~ W T c 2 )]TJ/F25 11.9552 Tf 13.151 8.088 Td [( a 12 4 ~ W a 2 + : Theinequalityin5isvalidprovided Z t 2Z .Integrating5andusingLemma 5.3andthegainconditionsin5yields V L Z t + T ;t + T )]TJ/F25 11.9552 Tf 11.955 0 Td [(V L Z t ;t )]TJ/F15 11.9552 Tf 23.113 8.087 Td [(1 8 c $ 7 ~ W c t 2 )]TJ/F25 11.9552 Tf 13.268 9.378 Td [(q 4 t + T t k e k 2 d + 1 8 c $ 9 )]TJ/F25 11.9552 Tf 13.151 8.088 Td [( a 12 8 t + T t ~ W a 2 d + T; provided Z 2Z ; 8 2 [ t;t + T ] .Usingthefactsthat )]TJ/F42 11.9552 Tf 11.291 9.631 Td [( t + T t k e k 2 d )]TJ/F25 11.9552 Tf 9.298 0 Td [(T inf 2 [ t;t + T ] k e k 2 and )]TJ/F42 11.9552 Tf 11.291 9.631 Td [( t + T t ~ W a 2 d )]TJ/F25 11.9552 Tf 22.845 0 Td [(T inf 2 [ t;t + T ] ~ W a 2 ,andLemma 5.2yield V L Z t + T ;t + T )]TJ/F25 11.9552 Tf 11.955 0 Td [(V L Z t ;t )]TJ/F25 11.9552 Tf 23.113 9.377 Td [( c $ 7 16 ~ W c t 2 )]TJ/F25 11.9552 Tf 13.151 8.087 Td [($ 3 a 12 T 16 ~ W a t 2 + $ 10 T )]TJ/F25 11.9552 Tf 13.151 9.378 Td [($ 0 q T 8 k e t k 2 ; provided Z 2Z ; 8 2 [ t;t + T ] .Thus, V L Z t + T ;t + T )]TJ/F25 11.9552 Tf 11.22 0 Td [(V L Z t ;t < 0 provided k Z t k > q $ 10 T $ 11 and Z 2Z ; 8 2 [ t;t + T ] .TheboundsontheLyapunovfunctionin 5yield V L Z t + T ;t + T )]TJ/F25 11.9552 Tf 12.2 0 Td [(V L Z t ;t < 0 provided V L Z t ;t > v l q $ 10 T $ 11 and Z 2Z ; 8 2 [ t;t + T ] . Since Z t 0 2Z ; 5canbeusedtoconcludethat _ V L Z t 0 ;t 0 .The sufcientconditionin5ensuresthat v l )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 V L Z t 0 ;t 0 + T r ;hence, Z t 2Z forall t 2 [ t 0 ;t 0 + T ] .If V L Z t 0 ;t 0 > v l q $ 10 T $ 11 ,then Z t 2Z 100

PAGE 101

forall t 2 [ t 0 ;t 0 + T ] implies V L Z t 0 + T ;t 0 + T )]TJ/F25 11.9552 Tf 13.109 0 Td [(V L Z t 0 ;t 0 < 0 ;hence, v l )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 V L Z t 0 + T ;t 0 + T + T r .Thus, Z t 2Z forall t 2 [ t 0 + T;t 0 +2 T ] . Inductively,thesystemstateisboundedsuchthat sup t 2 [0 ; 1 k Z t k r andultimately bounded 8 suchthat limsup t !1 k Z t k v l )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 v l r $ 10 T $ 11 ! + T ! : 5.4Simulation Simulationsareperformedonatwo-linkmanipulatortodemonstratetheabilityof thepresentedtechniquetoapproximatelyoptimallytrackadesiredtrajectory.Thetwo linkrobotmanipulatorismodeledusingEuler-Lagrangedynamicsas M q + V m _ q + F d _ q + F s = u; where q = q 1 q 2 T and _ q = _ q 1 _ q 2 T aretheangularpositionsinradians andtheangularvelocitiesinradian/srespectively.In5, M 2 R 2 2 denotes theinertiamatrix,and V m 2 R 2 2 denotesthecentripetal-Coriolismatrixgivenby M , 2 6 4 p 1 +2 p 3 c 2 p 2 + p 3 c 2 p 2 + p 3 c 2 p 2 3 7 5 ;V m , 2 6 4 )]TJ/F25 11.9552 Tf 9.299 0 Td [(p 3 s 2 _ q 2 )]TJ/F25 11.9552 Tf 9.298 0 Td [(p 3 s 2 _ q 1 +_ q 2 p 3 s 2 _ q 1 0 3 7 5 ; where c 2 = cos q 2 ; s 2 = sin q 2 , p 1 =3 : 473 kg:m 2 , p 2 =0 : 196 kg:m 2 ,and p 3 =0 : 242 kg:m 2 ,and F d = diag 5 : 3 ; 1 : 1 Nm:s and F s _ q = 8 : 45 tanh _ q 1 ; 2 : 35 tanh _ q 2 T Nm arethe modelsforthestaticandthedynamicfriction,respectively. Theobjectiveistondapolicy ^ thatensuresthatthestate x , q 1 ;q 2 ; _ q 1 ; _ q 2 T tracksthedesiredtrajectory x d t = 0 : 5 cos t ; 0 : 33 cos t ; )]TJ/F25 11.9552 Tf 9.299 0 Td [(sin t ; )]TJ/F25 11.9552 Tf 9.298 0 Td [(sin t T , 8 Iftheregressor satisesastrongeru-PEassumptioncf.[155,156],thetracking errorandtheweightestimationerrorscanbeshowntobeuniformlyultimatelybounded. 101

PAGE 102

whileminimizingthecost 1 0 )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [(e T Qe +^ T ^ dt ,where Q = diag 10 ; 10 ; 2 ; 2 .Using 5-5andthedenitions f , 2 6 4 x 3 ;x 4 ; 0 B @ M )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 )]TJ/F25 11.9552 Tf 9.298 0 Td [(V m )]TJ/F25 11.9552 Tf 11.955 0 Td [(F d 2 6 4 x 3 x 4 3 7 5 )]TJ/F25 11.9552 Tf 11.955 0 Td [(F s 1 C A T 3 7 5 T ;h d , x d 3 ;x d 4 ; )]TJ/F15 11.9552 Tf 9.299 0 Td [(4 x d 1 ; )]TJ/F15 11.9552 Tf 9.298 0 Td [(9 x d 2 T ; g + d , 0 ; 0 T ; 0 ; 0 T ;M x d ;g , 0 ; 0 T ; 0 ; 0 T ; M )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 T T ; theoptimaltrackingproblemcanbetransformedintothetime-invariantformin5. ThetwomajorchallengesintheapplicationofADPtosystemssuchas53 includeselectinganappropriatebasisforthevaluefunctionapproximationandensuring thattheregressor introducedinAssumption5.3isPE.Duetothesizeofthestate spaceandthecomplexityofthedynamics,obtainingananalyticalsolutiontotheHJB equationforthisproblemisprohibitivelydifcult.Furthermore,sincetheregressorisa complexnonlinearfunctionofthestates,itisdifculttoensurethatitremainsPE.Asa result,thisservesasamodelproblemtodemonstratetheapplicabilityofADP-based approximateonlineoptimalcontrol. Inthiseffort,thebasisselectedforthevaluefunctionapproximationisapolynomial basiswith23elementsgivenby = 1 2 2 1 2 2 1 3 1 4 2 3 2 4 2 1 2 2 2 1 2 5 2 1 2 6 2 1 2 7 2 1 2 8 2 2 2 5 2 2 2 6 2 2 2 7 2 2 2 8 2 3 2 5 2 3 2 6 2 3 2 7 2 3 2 8 2 4 2 5 2 4 2 6 2 4 2 7 2 4 2 8 T : Thecontrolgainsareselectedas a 1 =5 ; a 2 =0 : 001 , c =1 : 25 ; =0 : 001 ; and =0 : 005 ; andtheinitialconditionsare x = 1 : 81 : 600 T , ^ W c =10 1 23 1 , 102

PAGE 103

Figure5-1.Stateanderrortrajectorieswithprobingsignal. ^ W a =6 1 23 1 ,and )-167(=2000 I 23 .ToensurePE,aprobingsignal p t = 2 6 6 6 6 6 6 6 4 2 : 55 tanh t 20 sin p 232 t cos p 20 t +6 sin )]TJ/F15 11.9552 Tf 5.479 -9.684 Td [(18 e 2 t +20 cos t cos t 0 : 01 tanh t 20 sin p 132 t cos p 10 t +6 sin et +20 cos t cos t 3 7 7 7 7 7 7 7 5 isaddedtothecontrolsignalfortherst30secondsofthesimulation[57]. ItisclearfromFigure5-1thatthesystemstatesareboundedduringthelearning phaseandthealgorithmconvergestoastabilizingcontrollerinthesensethatthe trackingerrorsgotozerowhentheprobingsignaliseliminated.Furthermore,Figure5-2 showsthattheweightestimatesforthevaluefunctionandthepolicyareboundedand theyconverge.Thus,Figures5-1and5-2demonstratethatanapproximateoptimal policycanbegeneratedonlinetosolveanoptimaltrackingproblemusingasimple polynomialbasissuchas5,andaprobingsignalthatconsistsofacombinationof sinusoidalsignalssuchas5. TheNNweightsconvergetothefollowingvalues ^ W c = ^ W a = 83 : 362 : 3727 : 02 : 78 )]TJ/F15 11.9552 Tf 9.299 0 Td [(2 : 830 : 2014 : 1329 : 8118 : 874 : 113 : 47 103

PAGE 104

Figure5-2.Evolutionofvaluefunctionandpolicyweights. 6 : 699 : 7115 : 584 : 9712 : 4211 : 313 : 291 : 19 )]TJ/F15 11.9552 Tf 9.299 0 Td [(1 : 994 : 55 )]TJ/F15 11.9552 Tf 9.299 0 Td [(0 : 470 : 56 T : Notethatthelastsixteenweightsthatcorrespondtothetermscontainingthedesired trajectories 5 ; ; 8 arenon-zero.Thus,theresultingvaluefunction V andtheresulting policy ^ dependonthedesiredtrajectory,andhence,aretime-varyingfunctionsofthe trackingerror.Sincethetrueweightsareunknown,adirectcomparisonoftheweights in5.4withthetrueweightsisnotpossible.Instead,togaugetheperformanceofthe presentedtechnique,thestateandthecontroltrajectoriesobtainedusingtheestimated policyarecomparedwiththoseobtainedusingRadau-pseudospectralnumericaloptimal controlcomputedusingtheGPOPSsoftware[7].Sinceanaccuratenumericalsolution isdifculttoobtainforaninnite-horizonoptimalcontrolproblem,thenumericaloptimal controlproblemissolvedoveranitehorizonrangingoverapproximately5timesthe settlingtimeassociatedwiththesloweststatevariable.Basedonthesolutionobtained usingtheproposedtechnique,theslowestsettlingtimeisestimatedtobeapproximately 20seconds.Thus,toapproximatetheinnite-horizonsolution,thenumericalsolutionis computedovera100secondtimehorizonusing300collocationpoints. 104

PAGE 105

Figure5-3.HamiltonianandcostateofthenumericalsolutioncomputedusingGPOPS. Figure5-4.Controltrajectories ^ t obtainedfromGPOPSandthedeveloped technique. 105

PAGE 106

Figure5-5.Trackingerrortrajectories e t obtainedfromGPOPSandthedeveloped technique. AsseeninFigure5-3,theHamiltonianofthenumericalsolutionisapproximately zero.Thissupportstheassertionthattheoptimalcontrolproblemistime-invariant.Furthermore,sincetheHamiltonianisclosetozero,thenumericalsolutionobtainedusing GPOPSissufcientlyaccurateasabenchmarktocompareagainsttheADP-based solutionobtainedusingtheproposedtechnique.NotethatinFigure5-3,thecostate variablescorrespondingtothedesiredtrajectoriesarenonzero.Sincethesecostate variablesrepresentthesensitivityofthecostwithrespecttothedesiredtrajectories,this furthersupportstheassertionthattheoptimalvaluefunctiondependsonthedesired trajectory,andhence,isatime-varyingfunctionofthetrackingerror. Figures5-4and5-5showthecontrolandthetrackingerrortrajectoriesobtainedfromthedevelopedtechniquedashedlinesplottedalongsidethenumerical solutionobtainedusingGPOPSsolidlines.Thetrajectoriesobtainedusingthe developedtechniqueareclosetothenumericalsolution.Theinaccuraciesarea 106

PAGE 107

resultofthefactsthatthesetofbasisfunctionsin5isnotexact,andtheproposedmethodattemptstondtheweightsthatgeneratetheleasttotalcostforthe givensetofbasisfunctions.Theaccuracyoftheapproximationcanbeimproved bychoosingamoreappropriatesetofbasisfunctions,oratanincreasedcomputationalcost,byaddingmorebasisfunctionstotheexistingsetin5.Thetotalcost 100 0 e t T Qe t + t T R t dt obtainedusingthenumericalsolutionisfoundto be75.42andthetotalcost 1 0 e t T Qe t + t T R t dt obtainedusingthedevelopedmethodisfoundtobe84.31.NotethatfromFigures5-4and5-5,itisclearthat boththetrackingerrorandthecontrolconvergetozeroafterapproximately20seconds, andhence,thetotalcostobtainedfromthenumericalsolutionisagoodapproximation oftheinnite-horizoncost. 5.5ConcludingRemarks AnADP-basedapproachusingthepolicyevaluationandpolicyimprovement architectureispresentedtoapproximatelysolvetheinnite-horizonoptimaltracking problemforcontrol-afnenonlinearsystemswithquadraticcost.Theproblemis solvedbytransformingthesystemtoconvertthetrackingproblemthathasatimevaryingvaluefunction,intoatime-invariantoptimalcontrolproblem.Theultimately boundedtrackingandestimationresultwasestablishedusingLyapunovanalysisfor nonautonomoussystems.Simulationsareperformedtodemonstratetheapplicability andtheeffectivenessofthedevelopedmethod.Thedevelopedmethodcanbeapplied tohigh-dimensionalnonlineardynamicalsystemsusingsimplepolynomialbasis functionsandsinusoidalprobingsignals.However,theaccuracyoftheapproximation dependsonthechoiceofbasisfunctionsandtheresulthingesonthesystemstates beingPE.Furthermore,computationofthedesiredcontrolin5requiresexactmodel knowledge.Thefollowingchapterusesmodel-basedRLideasfromChapter3torelax thePErequirementandtoallowforuncertaintiesinthesystemdynamics. 107

PAGE 108

CHAPTER6 MODEL-BASEDREINFORCEMENTLEARNINGFORAPPROXIMATEOPTIMAL TRACKING Inthischapter,thetrackingcontrollerdevelopedinChapter5isextendedtosolve innite-horizonoptimaltrackingproblemscontrol-afnecontinuous-timenonlinear systemswithuncertaindriftdynamicsusingmodel-basedRL.InChapter5,model knowledgeisusedinthecomputationoftheBEandinthecomputationofthesteadystatecontrolsignal.Inthischapter,aCL-basedsystemidentierisusedtosimulate experiencebyevaluatingtheBEoverunexploredareasofthestatespace.Thesystem identierisalsoutilizedtoapproximatethesteady-statecontrolsignal.ALyapunovbasedstabilityanalysisispresentedtoestablishsimultaneousidenticationand trajectorytracking.Effectivenessofthedevelopedtechniqueisdemonstratedvia numericalsimulations. 6.1ProblemFormulationandExactSolution Considertheconcatenatednonlinearcontrol-afnesystemdescribedbythe differentialequation5.SimilartoChapter5,theobjectiveoftheoptimalcontrol problemistominimizethecostfunctional J ; ,introducedin2,subjecttothe dynamicconstraintsin5whiletrackingthedesiredtrajectory.Inthischapter,amore generalformoftherewardsignalisconsidered.Therewardsignal r : R 2 n R m ! R is givenby r ; , Q + T R; wherethefunction Q : R 2 n ! R isdenedas Q 0 B @ 2 6 4 e x d 3 7 5 1 C A , Q e ; 8 x d 2 R n ; where Q : R n ! R isacontinuouspositivedenitefunctionthatsatises q k e o k Q e o q k e o k ; 8 e o 2 R n 108

PAGE 109

where q : R ! R and q : R ! R areclass K functions. Usingtheestimates ^ V ; ^ W c and ^ ; ^ W a in5theBEcanbeobtainedas ; ^ W c ; ^ W a , r ^ V ; ^ W c F + G ^ ; ^ W a + r ; ^ ; ^ W a : Inthischapter,simulationofexperienceviaBEextrapolationisusedtoimprovedata efciency,basedontheobservationthatifadynamicsystemidentierisdevelopedto generateanestimate F ; ^ ofthedriftdynamics F ,anestimateoftheBEin6 canbeevaluatedatany 2 R 2 n .Thatis,using ^ F ,experiencecanbesimulatedby extrapolatingtheBEoverunexploredoff-trajectorypointsintheoperatingdomain. Hence,ifanidentiercanbedevelopedsuchthat ^ F approaches F exponentially fast,learninglawsfortheoptimalpolicycanutilizesimulatedexperiencealongwith experiencegainedandstoredalongthestatetrajectory. Ifparametricapproximatorsareusedtoapproximate F ,convergenceof ^ F to F is impliedbyconvergenceoftheparameterstotheirunknownidealvalues.Itiswellknown thatadaptivesystemidentiersrequirePEtoachieveparameterconvergence.Torelax thePEcondition,aCL-basedcf.[92,93,97,147]systemidentierthatusesrecorded dataforlearningisdevelopedinthefollowingsection. 6.2SystemIdentication Onanycompactset C R n thefunction f canberepresentedusingaNNas f x o = T f )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [(Y T x 1 + x o ; 8 x o 2 R n where x 1 , 1 x o T T 2 R n +1 , 2 R p +1 n and Y 2 R n +1 p denotetheunknown output-layerandhidden-layerNNweights, f : R p ! R p +1 denotesaboundedNNbasis function, : R n ! R n denotesthefunctionreconstructionerror,and p 2 N denotesthe numberofNNneurons.Usingtheuniversalfunctionapproximationpropertyofsingle layerNNs,givenaconstantmatrix Y suchthattherowsof f )]TJ/F25 11.9552 Tf 5.48 -9.683 Td [(Y T x 1 formaproper basis,thereexistconstantidealweights andknownconstants , ,and 0 2 R such 109

PAGE 110

that k k F < 1 , sup x o 2C k x o k ,and sup x o 2C kr x o x o k 0 ,where kk F denotestheFrobeniusnorm. Usinganestimate ^ 2 R p +1 n oftheweightmatrix ; thefunction f canbeapproximatedbythefunction ^ f : R 2 n R p +1 n ! R n denedas ^ f ; ^ , ^ T ; where : R 2 n ! R p +1 isdenedas = f Y T 1 e T + x T d T ! .Basedon6, anestimatorforonlineidenticationofthedriftdynamicsisdevelopedas _ ^ x = ^ T + g x u + k ~ x; where ~ x , x )]TJ/F15 11.9552 Tf 13.904 0 Td [(^ x ,and k 2 R isapositiveconstantlearninggain.Thefollowing assumptionfacilitatesCL-basedsystemidentication. Assumption6.1. [92]Ahistorystackcontainingrecordedstate-actionpairs f x j ;u j g M j =1 alongwithnumericallycomputedstatederivatives f _ x j g M j =1 thatsatises min M X j =1 fj f T j ! = > 0 ; k _ x j )]TJ/F15 11.9552 Tf 13.98 0 Td [(_ x j k < d; 8 j isavailableapriori.In6, fj , f Y T 1 x T j , d 2 R isaknownpositive constant,and min denotestheminimumeigenvalue. Theweightestimates ^ areupdatedusingthefollowingCL-basedupdatelaw: _ ^ =)]TJ/F26 7.9701 Tf 19.74 -1.793 Td [( f )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [(Y T x 1 ~ x T + k )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [( M X j =1 fj _ x j )]TJ/F25 11.9552 Tf 11.955 0 Td [(g j u j )]TJ/F15 11.9552 Tf 12.894 3.155 Td [(^ T fj T ; where k 2 R isaconstantpositiveCLgain,and )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [( 2 R p +1 p +1 isaconstant,diagonal, andpositivedeniteadaptationgainmatrix. 110

PAGE 111

Tofacilitatethesubsequentstabilityanalysis,acandidateLyapunovfunction V 0 : R n R p +1 n ! R isselectedas V 0 ~ x; ~ , 1 2 ~ x T ~ x + 1 2 tr ~ T )]TJ/F28 7.9701 Tf 7.314 4.948 Td [()]TJ/F24 7.9701 Tf 6.587 0 Td [(1 ~ ; where ~ , )]TJ/F15 11.9552 Tf 13.051 3.155 Td [(^ andtr denotesthetraceofamatrix.Using6-6,thefollowing boundonthetimederivativeof V 0 isestablished: _ V 0 )]TJ/F25 11.9552 Tf 21.918 0 Td [(k k ~ x k 2 )]TJ/F25 11.9552 Tf 11.955 0 Td [(k ~ 2 F + k ~ x k + k d ~ F ; where d P M j =1 k j k + P M j =1 k j k k j k .Using6and6aLyapunov-based stabilityanalysiscanbeusedtoshowthat ^ convergesexponentiallytoaneighborhood around . Using6,theBEin6canbeapproximatedas ^ ; ^ W c ; ^ W a ; ^ , r ^ V ; ^ W c F ; ^ + F 1 + G ^ ; ^ W a + Q +^ T ; ^ W a R ^ ; ^ W a ; where F ; ^ , 2 6 6 6 6 6 6 6 4 ^ T )]TJ/F25 11.9552 Tf 11.955 0 Td [(g x g + x d ^ T 0 B @ 2 6 4 0 n 1 x d 3 7 5 1 C A 0 n 1 3 7 7 7 7 7 7 7 5 ; and F 1 , )]TJ/F25 11.9552 Tf 9.298 0 Td [(h d + g e + x d g + x d h d T h T d T .Theoptimaltrackingproblemis thusreformulatedastheneedtondestimates ^ and ^ V online,tominimizethetotal integralerror ^ E ^ ^ W c ; ^ W a , 2 R 2 n ; ^ W c ; ^ W a ; ^ 2 d; 111

PAGE 112

foragiven ^ ,whilesimultaneouslyimproving ^ using6,andensuringstabilityofthe systemin2usingthecontrollaw u =^ +^ u d ; where ^ u d ; ^ , g + d h d )]TJ/F15 11.9552 Tf 12.894 3.155 Td [(^ T d ; and d , 0 1 n x T d T ! . 6.3ValueFunctionApproximation Since V and arefunctionsofthestate ; theminimizationproblemstatedin Section6.2isinnite-dimensional,andhence,intractable.Toobtainanite-dimensional minimizationproblem,theoptimalvaluefunctionisrepresentedoveranycompact operatingdomain C R 2 n usingaNNas V o = W T o + o ; 8 o 2 R 2 n where W 2 R L denotesavectorofunknownNNweights, : R 2 n ! R L denotes aboundedNNbasisfunction, : R 2 n ! R denotesthefunctionreconstruction error,and L 2 N denotesthenumberofNNneurons.Usingtheuniversalfunction approximationpropertyofsinglelayerNNs,thereexistconstantidealweights W and knownconstants W , ,and r 2 R suchthat k W k W< 1 , sup o 2C k o k ,and sup o 2C kr o k r . Using5,aNNrepresentationoftheoptimalpolicyisobtainedas o = )]TJ/F15 11.9552 Tf 10.494 8.088 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 G T o )]TJ/F23 11.9552 Tf 5.479 -9.684 Td [(r T o W + r T o ; 8 o 2 R 2 n : Usingestimates ^ W c and ^ W a fortheidealweights W ,theoptimalvaluefunctionandthe optimalpolicyareapproximatedas ^ V ; ^ W c , ^ W T c ; ^ ; ^ W a , )]TJ/F15 11.9552 Tf 10.494 8.088 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 G T r T ^ W a : 112

PAGE 113

Theoptimalcontrolproblemisthusreformulatedastheneedtondasetofweights ^ W c and ^ W a online,tominimizetheinstantaneoussquarederrorin2,wherethe instantaneousBEisgivenby ^ E ^ ^ W c ; ^ W a , 2 ^ ; ^ W c ; ^ W a ; ^ 2 d; foragiven ^ ,whilesimultaneouslyimproving ^ using6,andensuringstabilityofthe systemin2usingthecontrollaw u =^ ; ^ W a +^ u d ; ^ : Using5,6,and6,thevirtualcontroller fortheconcatenatedsystemin 5canbeexpressedas 1 =^ ; ^ W a + g + d ~ T d + g + d d ; where d , x d . 6.4SimulationofExperience Sincecomputationoftheintegralin6is,ingeneral,intractable,simulation ofexperienceisimplementedbyapproximationoftheintegralwithasummation overnitelymanypointsinthestatespace.Thefollowingassumptionfacilitatesthe aforementionedapproximation. Assumption6.2. [97]Thereexistsanitesetofpoints f i 2 j i =1 ; ;N g suchthat 0
PAGE 114

UsingAssumption6.2,simulationofexperienceisimplementedbytheweight updatelaws _ ^ W c = )]TJ/F25 11.9552 Tf 9.298 0 Td [( c 1 )]TJ/F25 11.9552 Tf 8.509 8.088 Td [(! ^ t )]TJ/F25 11.9552 Tf 13.151 8.088 Td [( c 2 N )]TJ/F26 7.9701 Tf 14.156 14.944 Td [(N X i =1 ! i i ^ ti ; _ )-278(= )]TJ/F23 11.9552 Tf 9.971 0 Td [()]TJ/F25 11.9552 Tf 11.955 0 Td [( c 1 )]TJ/F25 11.9552 Tf 8.509 8.088 Td [(!! T 2 )]TJ/F30 11.9552 Tf 7.314 16.857 Td [( 1 f k )]TJ/F28 7.9701 Tf 5.289 0 Td [(k )]TJ/F23 11.9552 Tf 5.288 -0.996 Td [(g ; k )-167( t 0 k )]TJ/F25 11.9552 Tf 7.314 0 Td [(; _ ^ W a = )]TJ/F25 11.9552 Tf 9.298 0 Td [( a 1 ^ W a )]TJ/F15 11.9552 Tf 15.367 3.022 Td [(^ W c )]TJ/F25 11.9552 Tf 11.955 0 Td [( a 2 ^ W a + c 1 G T ^ W a ! T 4 + N X i =1 c 2 G T i ^ W a ! T i 4 N i ! ^ W c ; where ! , r F ; ^ + F 1 + G ^ ; ^ W a , )]TJ/F23 11.9552 Tf 13.631 0 Td [(2 R L L istheleastsquaresgainmatrix, )]TJ/F23 11.9552 Tf 11.684 0 Td [(2 R denotesapositivesaturationconstant, 2 R denotesthe forgettingfactor, c 1 ; c 2 ; a 1 ; a 2 2 R denoteconstantpositiveadaptationgains, 1 fg denotestheindicatorfunctionoftheset fg , G , 0 G R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 G T r T ,and , 1+ ! T )]TJ/F25 11.9552 Tf 7.314 0 Td [(! ,where 2 R isapositivenormalizationconstant.In6-6and inthesubsequentdevelopment,foranyfunction ; ,thenotation i ,isdenedas i , i ; ,andtheinstantaneousBEs ^ t and ^ ti aredenedas ^ t t , ^ t ; ^ W c t ; ^ W a t ; ^ t and ^ ti t , ^ i ; ^ W c t ; ^ W a t ; ^ t .Thesaturatedleast-squaresupdatelawin6 ensuresthatthereexistpositiveconstants ; 2 R suchthat )-167( t )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 ; 8 t 2 R : 6.5StabilityAnalysis Ifthestatepenaltyfunction Q ispositivedenite,thentheoptimalvaluefunction V ispositivedenite,andservesasaLyapunovfunctionforthesystemin5underthe optimalcontrolpolicy ;hence, V isusedcf.[57,59,145]asacandidateLyapunov functionfortheclosed-loopsystemunderthepolicy ^ : Basedonthedenitionin6, thefunction Q ,andhence,thefunction V arepositivesemidenite;hence,thefunction V isnotavalidcandidateLyapunovfunction.However,theresultsinChapter5can 114

PAGE 115

beusedtoshowthatanonautonomousformoftheoptimalvaluefunctiondenotedby V t : R n R ! R ,denedas V t e;t , V 0 B @ 2 6 4 e x d t 3 7 5 1 C A ; 8 e 2 R n ;t 2 R ; ispositivedeniteanddecrescent.Hence, V t ;t =0 ; 8 t 2 R andthereexistclass K functions v : R ! R and v : R ! R suchthat v k e o k V t e o ;t v k e o k ; forall e o 2 R n andforall t 2 R . Tofacilitatethestabilityanalysis,aconcatenatedstate Z 2 R 2 n +2 L + n p +1 isdened as Z , e T ~ W T c ~ W T a ~ x T vec ~ T T ; andacandidateLyapunovfunctionisdenedas V L Z;t , V t e;t + 1 2 ~ W T c )]TJ/F28 7.9701 Tf 7.314 4.936 Td [()]TJ/F24 7.9701 Tf 6.587 0 Td [(1 ~ W c + 1 2 ~ W T a ~ W a + V 0 ~ ; ~ x ; wherevec denotesthevectorizationoperatorand V 0 isdenedin6.Using6,theboundsin6and6,andthefactthattr ~ T )]TJ/F28 7.9701 Tf 7.314 4.948 Td [()]TJ/F24 7.9701 Tf 6.586 0 Td [(1 ~ = vec ~ T )]TJ/F15 11.9552 Tf 5.479 -9.683 Td [()]TJ/F28 7.9701 Tf 7.315 4.948 Td [()]TJ/F24 7.9701 Tf 6.586 0 Td [(1 I p +1 vec ~ ,thecandidateLyapunovfunctionin6can beboundedas v l k Z o k V L Z o ;t v l k Z o k ; forall Z o 2 R 2 n +2 L + n p +1 andforall t 2 R ,where v l : R ! R and v l : R ! R areclass K functions. Tofacilitatethestabilityanalysis,givenanycompactball R 2 n +2 L + n p +1 ofradius r 2 R centeredattheorigin,apositiveconstant 2 R isdenedas 115

PAGE 116

, 3 c 1 + c 2 W 2 k G k 16 p )]TJETq1 0 0 1 165.009 692.954 cm[]0 d 0 J 0.359 w 0 0 m 5.289 0 l SQBT/F15 11.9552 Tf 194.967 700.683 Td [(+ k W T G + r G r r T k 4 + a 2 W 2 2 a 1 + a 2 + G 2 + W T r G r r T 2 + W T r Gg + d d ; + 3 W T r Gg + d + r Gg + d g + k d 2 4 k + c 1 + c 2 2 k k 2 4 )]TJETq1 0 0 1 460.746 631.283 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F25 11.9552 Tf 468.06 632.957 Td [( c 2 c + 2 2 k + r Gg + d d where G r , GR )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 G T ; and G , r G r r T ,andaclass K function v l : R ! R is denedas v l k Z k = q k e k 2 + c 2 c 8 ~ W c 2 + a 1 + a 2 6 ~ W a 2 + k 4 k ~ x k 2 + k 6 vec ~ 2 : ThesufcientgainconditionsusedinthesubsequentTheorem6.1are v )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 l < v l )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 )]TJ/F25 11.9552 Tf 5.48 -9.683 Td [(v l r c 2 c > 3 c 2 + c 1 2 W 2 kr k 2 g 2 4 k )]TJETq1 0 0 1 274.104 386.281 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F22 11.9552 Tf 504.698 396.156 Td [( a 1 + a 2 > 3 c 1 + c 2 W k G k 8 p )]TJETq1 0 0 1 251.738 346.641 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F15 11.9552 Tf 298.671 356.883 Td [(+ 3 c c 2 c 1 + c 2 W k G k 8 p )]TJETq1 0 0 1 398.888 346.641 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F15 11.9552 Tf 441.898 356.883 Td [(+ a 1 ! 2 : In6-6,foranyfunction $ : R l ! R , l 2 N ,thenotation k $ k ,denotes sup y 2 R l k $ y k and g , k k + gg + d k d k : Thesufcientconditionin6requirestheset tobelargeenoughbasedon theconstant .SincetheNNapproximationerrorsdependonthecompactset ,in general,foraxednumberofNNneurons,theconstant increaseswiththesizeof theset .However,foraxedset ; theconstant decreaseswithdecreasingfunction reconstructionerrors,i.e.,withincreasingnumberofNNneurons.Henceasufciently largenumberofNNneuronsisrequiredtosatisfytheconditionin6. Theorem6.1. ProvidedAssumptions5.2-6.2hold,andthecontrolgainsareselected basedon 6 6 ,thecontrollerin 6 ,alongwiththeweightupdatelaws 116

PAGE 117

6 6 ,andtheidentierin 6 alongwiththeweightupdatelaw 6 ensurethatthesystemstatesremainbounded,thetrackingerrorisuniformlyultimately bounded,andthatthecontrolpolicy ^ convergestoaneighborhoodaroundtheoptimal controlpolicy : Proof. Using5andthefactthat _ V t e t ;t = _ V t ; 8 t 2 R ,thetime-derivative ofthecandidateLyapunovfunctionin6is _ V L = r V F + G )]TJ/F15 11.9552 Tf 15.367 3.022 Td [(~ W T c )]TJ/F28 7.9701 Tf 7.314 4.936 Td [()]TJ/F24 7.9701 Tf 6.587 0 Td [(1 _ ^ W c )]TJ/F15 11.9552 Tf 13.151 8.088 Td [(1 2 ~ W T c )]TJ/F28 7.9701 Tf 7.314 4.936 Td [()]TJ/F24 7.9701 Tf 6.587 0 Td [(1 _ \000 )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 ~ W c )]TJ/F15 11.9552 Tf 15.367 3.022 Td [(~ W T a _ ^ W a + _ V 0 + r V G )-222(r V G : Using5,6,6,and6theexpressionin6isboundedas _ V L )]TJETq1 0 0 1 116.298 469.841 cm[]0 d 0 J 0.478 w 0 0 m 9.271 0 l SQBT/F25 11.9552 Tf 116.298 459.998 Td [(Q )]TJ/F15 11.9552 Tf 15.368 3.022 Td [(~ W T c )]TJ/F28 7.9701 Tf 7.315 4.936 Td [()]TJ/F24 7.9701 Tf 6.586 0 Td [(1 _ ^ W c )]TJ/F15 11.9552 Tf 13.15 8.087 Td [(1 2 ~ W T c )]TJ/F28 7.9701 Tf 7.315 4.936 Td [()]TJ/F24 7.9701 Tf 6.586 0 Td [(1 _ \000 )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 ~ W c )]TJ/F15 11.9552 Tf 15.367 3.022 Td [(~ W T a _ ^ W a + _ V 0 + 1 2 )]TJ/F25 11.9552 Tf 5.48 -9.683 Td [(W T G + r G r r T ~ W a + W T r Gg + d ~ T d + r Gg + d ~ T d + 1 2 G + 1 2 W T r G r r T + W T r Gg + d d )]TJ/F15 11.9552 Tf 11.955 0 Td [( T R + r Gg + d d : TheapproximateBEin6isexpressedintermsoftheweightestimationerrorsas ^ t = )]TJ/F25 11.9552 Tf 9.299 0 Td [(! T ~ W c )]TJ/F25 11.9552 Tf 11.955 0 Td [(W T r F ~ + 1 4 ~ W T a G ~ W a + ; where F ~ , F ; ~ and = O )]TJETq1 0 0 1 255.175 301.387 cm[]0 d 0 J 0.478 w 0 0 m 4.727 0 l SQBT/F25 11.9552 Tf 255.175 294.566 Td [(; r ; .Using6,theboundin6andthe updatelawsin6-6,theexpressionin6isboundedas _ V L )]TJETq1 0 0 1 120.283 240.459 cm[]0 d 0 J 0.478 w 0 0 m 9.271 0 l SQBT/F25 11.9552 Tf 120.283 230.616 Td [(Q )]TJ/F26 7.9701 Tf 16.805 14.944 Td [(N X i =1 ~ W T c c 2 N ! i ! T i i ~ W c )]TJ/F25 11.9552 Tf 11.956 0 Td [(k ~ 2 F )]TJ/F15 11.9552 Tf 11.956 0 Td [( a 1 + a 2 ~ W T a ~ W a )]TJ/F25 11.9552 Tf 11.955 0 Td [(k k ~ x k 2 )]TJ/F25 11.9552 Tf 11.107 0 Td [( c 1 ~ W T c ! W T r F ~ + a 1 ~ W T a ~ W c + a 2 ~ W T a W + 1 4 c 1 ~ W T c ! ~ W T a G ~ W a )]TJ/F26 7.9701 Tf 15.956 14.944 Td [(N X i =1 ~ W T c c 2 N ! i i W T 0 i F ~ i + N X i =1 1 4 ~ W T c c 2 N ! i i ~ W T a G i ~ W a + ~ W T c c 2 N N X i =1 ! i i i )]TJ/F15 11.9552 Tf 14.112 3.022 Td [(~ W T a c 1 G T ^ W a ! T 4 + N X i =1 c 2 G T i ^ W a ! T i 4 N i ! ^ W c + k ~ x k + k d ~ F + 1 2 )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [(W T G + r G r r T ~ W a + W T r Gg + d ~ T d + r Gg + d ~ T d + 1 2 G + c 1 ~ W T c ! + 1 2 W T r G r r T + W T r Gg + d d + r Gg + d d : 117

PAGE 118

Segregationofterms,completionofsquares,andtheuseofYoung'sinequalitiesyields _ V L )]TJETq1 0 0 1 120.283 679.036 cm[]0 d 0 J 0.478 w 0 0 m 9.271 0 l SQBT/F25 11.9552 Tf 120.283 669.193 Td [(Q )]TJ/F25 11.9552 Tf 13.151 8.088 Td [( c 2 c 4 ~ W c 2 )]TJ/F15 11.9552 Tf 13.15 8.088 Td [( a 1 + a 2 3 ~ W a 2 )]TJ/F25 11.9552 Tf 13.15 8.088 Td [(k 2 k ~ x k 2 )]TJ/F25 11.9552 Tf 13.151 8.847 Td [(k 3 ~ 2 F )]TJ/F30 11.9552 Tf 9.507 20.444 Td [( c 2 c 4 )]TJ/F15 11.9552 Tf 13.151 8.088 Td [(3 c 2 + c 1 2 W 2 kr k 2 g 2 16 k )]TJETq1 0 0 1 212.004 625.638 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F30 11.9552 Tf 263.573 655.956 Td [(! ~ W c 2 )]TJ/F30 11.9552 Tf 9.507 20.444 Td [( a 1 + a 2 3 )]TJ/F15 11.9552 Tf 13.151 8.088 Td [( c 1 + c 2 W k G k 8 p )]TJETq1 0 0 1 453.374 625.271 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F30 11.9552 Tf 493.727 655.956 Td [(! ~ W a 2 + 1 c c 2 c 1 + c 2 W k G k 8 p )]TJETq1 0 0 1 175.01 578.178 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F15 11.9552 Tf 218.021 588.42 Td [(+ a 1 ! 2 ~ W a 2 + 3 W T r Gg + d + r Gg + d g + k d 2 4 k + 3 c 1 + c 2 W 2 k G k 16 p )]TJETq1 0 0 1 181.311 542.445 cm[]0 d 0 J 0.359 w 0 0 m 5.289 0 l SQBT/F15 11.9552 Tf 211.269 550.174 Td [(+ k W T G + r G r r T k 4 + a 2 k W k 2 2 a 1 + a 2 + c 1 + c 2 2 k k 2 4 )]TJETq1 0 0 1 425.588 524.28 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F25 11.9552 Tf 432.902 525.953 Td [( c 2 c + 2 2 k + 1 2 G + 1 2 W T r G r r T + W T r Gg + d d + r Gg + d d ; forall Z 2 .Providedthesufcientconditionsin6-6aresatised,the expressionin6yields _ V L )]TJ/F25 11.9552 Tf 21.918 0 Td [(v l k Z k ; 8k Z k v )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 l ; forall Z 2 .Using6,6,and6Theorem4.18in[149]canbeinvokedto concludethateverytrajectory Z t satisfying k Z t 0 k v l )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 )]TJ/F25 11.9552 Tf 5.479 -9.683 Td [(v l r ,isboundedforall t 2 R andsatises limsup t !1 k Z t k v l )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 )]TJETq1 0 0 1 310.587 326.762 cm[]0 d 0 J 0.478 w 0 0 m 8.78 0 l SQBT/F25 11.9552 Tf 310.587 319.941 Td [(v l )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [(v )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 l : 6.6Simulation 6.6.1NonlinearSystem Theeffectivenessofthedevelopedtechniqueisdemonstratedvianumerical simulationonanonlinearsystemoftheform5,where f = 2 6 4 1 2 3 4 5 6 3 7 5 2 6 6 6 6 4 x 1 x 2 x 2 cos x 1 +2 3 7 7 7 7 5 ;g = 2 6 4 0 cos x 1 +2 3 7 5 : 118

PAGE 119

Theidealvaluesoftheunknownparametersare 1 = )]TJ/F15 11.9552 Tf 9.298 0 Td [(1 , 2 =1 , 3 =0 , 4 = )]TJ/F15 11.9552 Tf 9.299 0 Td [(0 : 5 , 5 =0 ,and 6 = )]TJ/F15 11.9552 Tf 9.299 0 Td [(0 : 5 .Thecontrolobjectiveistofollowadesiredtrajectory,whichisthe solutionoftheinitialvalueproblem _ x d = 2 6 4 )]TJ/F15 11.9552 Tf 9.298 0 Td [(11 )]TJ/F15 11.9552 Tf 9.298 0 Td [(21 3 7 5 x d ;x d = 2 6 4 0 2 3 7 5 ; whileensuringconvergenceoftheestimatedpolicy ^ toaneighborhoodof thepolicy ,suchthatthecontrollaw t = t minimizesthecost 1 0 )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [(e T t diag [10 ; 10] e t + t 2 dt ,subjecttothedynamicconstraintin5. Thevaluefunctionisapproximatedusingthepolynomialbasis = [ e 2 1 ;e 2 2 ;e 2 1 x 2 d 1 ;e 2 2 x 2 d 2 ;e 2 2 x 2 d 1 ;e 2 1 x 2 d 2 ;e 1 e 2 ] T ; andtheunknowndriftdynamicsareapproximatedusingthebasis x =[ x 1 ;x 2 ;x 2 cos x 1 +2] T : Learninggainsforsystem identicationandvaluefunctionapproximationareselectedas c 1 =0 : 1 ; c 2 =2 : 5 ; a 1 =1 ; a 2 =0 : 01 ; =0 : 3 ; =0 : 005 ; )-278(=100000 ;k =500 ; )]TJ/F26 7.9701 Tf 7.315 -1.793 Td [( = I 3 ; )-167(=5000 I 9 ;k =20 ; ToimplementBEextrapolation,errorvalues f i g 81 i =1 areselectedtobeuniformlyspaced overthea 2 2 2 2 hypercubecenteredattheorigin.Thehistorystackrequired forCLcontainstenpoints,andisrecordedonlineusingasingularvaluemaximizing algorithmcf.[93],andtherequiredstatederivativesarecomputedusingafthorder Savitzky-Golaysmoothingltercf.[150]. Theinitialvaluesforthestateandthestateestimateareselectedtobe x = [1 ; 2] T and ^ x =[0 ; 0] T ,respectively.TheinitialvaluesfortheNNweightsforthe valuefunction,thepolicy,andthedriftdynamicsareselectedtobe 5 1 7 , 3 1 7 ,and 0 6 ,respectively.Sincethesystemin6hasnostableequilibria,theinitialpolicy ^ ; 0 6 1 isnotstabilizing.ThestabilizationdemonstratedinFigure6-1isachievedvia fastsimultaneouslearningofthesystemdynamicsandthevaluefunction. 119

PAGE 120

Figure6-1and6-2demonstratesthatthecontrollerremainsbounded,thetracking errorisregulatedtotheorigin,andtheNNweightsconverge.InFigure6-3,thedashed linesdenotetheidealvaluesoftheNNweightsforthesystemdriftdynamics. Figure6-1.Systemtrajectoriesgeneratedusingtheproposedmethodforthenonlinear system. Figure6-4demonstratessatisfactionoftherankconditionsin6and6. Therankconditiononthehistorystackin6isensuredbyselectingpointsusinga singularvaluemaximizationalgorithm,andtheconditionin6ismetviaoversampling,i.e.,byselecting160pointstoidentify9unknownparameters.Unlikeprevious resultsthatrelyontheadditionofanad-hocprobingsignaltosatisfythePEcondition, thisresultensuressufcientexplorationviaBEextrapolation. Sinceananalyticalsolutionoftheoptimaltrackingproblemisnotavailablefor thenonlinearsystemin6,thevaluefunctionandthepolicyweightscannotbe comparedagainsttheiridealvalues.However,ameasureofproximityoftheobtained weights ^ W a totheidealweights W canbeobtainedbycomparingthesystemtrajectoriesresultingfromapplyingthefeedbackcontrolpolicy ^ = )]TJ/F24 7.9701 Tf 10.494 4.707 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 G T r T ^ W a forxedweights ^ W a tothesystem,againstnumericallycomputedoptimalsystemtrajectories.Figure6-5showsthatthecontrolanderrortrajectoriesresultingfromthe 120

PAGE 121

Figure6-2.Valuefunctionandthepolicyweighttrajectoriesgeneratedusingthe proposedmethodforthenonlinearsystem.Sinceananalyticalsolutionof theoptimaltrackingproblemisnotavailable,weightscannotbecompared againsttheiridealvalues obtainedweightsareclosetothenumericalsolution.Thenumericalsolutionisobtained fromGPOPSoptimalcontrolsoftware[7]using300collocationpoints. Acomparisonbetweenthelearnedweightsandtheoptimalweightsispossiblefor linearsystemsprovidedthedynamics h d ofthedesiredtrajectoryarealsolinear. 6.6.2LinearSystem Todemonstrateconvergencetotheidealweights,thefollowinglinearsystemis simulated: _ x = 2 6 4 )]TJ/F15 11.9552 Tf 9.298 0 Td [(11 )]TJ/F15 11.9552 Tf 9.298 0 Td [(0 : 50 : 5 3 7 5 x + 2 6 4 0 1 3 7 5 u: Thecontrolobjectiveistofollowadesiredtrajectory,whichisthesolutionoftheinitial valueproblem _ x d = 2 6 4 )]TJ/F15 11.9552 Tf 9.298 0 Td [(11 )]TJ/F15 11.9552 Tf 9.298 0 Td [(21 3 7 5 x d ;x d = 2 6 4 0 2 3 7 5 : whileensuringconvergenceoftheestimatedpolicy ^ toaneighborhoodof thepolicy ,suchthatthecontrollaw t = t minimizesthecost 121

PAGE 122

Figure6-3.Trajectoriesoftheunknownparametersinthesystemdriftdynamicsforthe nonlinearsystem.Thedottedlinesrepresentthetruevaluesofthe parameters. 1 0 )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [(e T t diag [10 ; 10] e t + t 2 dt ,subjecttothedynamicconstraintin5, over 2 U . Thevaluefunctionisapproximatedusingthepolynomialbasis = [ e 2 1 ;e 2 2 ;e 1 e 2 ;e 1 x d 1 ;e 2 x d 2 ;e 1 x d 2 ;e 2 x d 1 ] T ; andtheunknowndriftdynamicsisapproximatedusingthelinearbasis x =[ x 1 ;x 2 ] T : Learninggainsforsystemidentication andvaluefunctionapproximationareselectedas c 1 =0 : 5 ; c 2 =10 ; a 1 =10 ; a 2 =0 : 001 ; =0 : 1 ; =0 : 005 ; )-278(=100000 ;k =500 ; )]TJ/F26 7.9701 Tf 7.315 -1.793 Td [( = I 2 ; )-167(=1000 I 7 ;k =10 ; ToimplementBEextrapolation,errorvalues f e i g 25 i =1 areselectedtobeuniformlyspaced ina 5 5 gridona 2 2 squarearoundtheorigin,andthepoints f x d t j g 11 j =1 areselected alongthedesiredtrajectorysuchthatthetimeinstances t j arelinearlyspacedoverthe 122

PAGE 123

Figure6-4.SatisfactionofAssumptions6.1and6.2forthenonlinearsystem. interval [0 : 1 ; 2 ] .Thesetofpoints f k g 275 k =1 isthencomputedas f k g = e T i x T d t j T , i =1 ; ; 25 , j =1 ; ; 11 .ThehistorystackrequiredforCLcontainstenpoints, andisrecordedonlineusingasingularvaluemaximizingalgorithmcf.[93],andthe requiredstatederivativesarecomputedusingafthorderSavitzky-Golaysmoothing ltercf.[150]. Thelinearsystemin6andthelineardesireddynamicsresultinalineartimeinvariantconcatenatedsystem.Sincethesystemislinear,theoptimaltrackingproblem reducestoanoptimalregulationproblem,whichcanbesolvedbysolvingtheresulting algebraicRiccatiequation.Theoptimalvaluefunctionisgivenby V = T P ,where thematrix P isgivenby P = 2 6 6 6 6 6 6 6 4 4 : 430 : 6700 0 : 672 : 9100 0000 0000 3 7 7 7 7 7 7 7 5 : Usingthematrix P ; theidealweighscorrespondingtotheselectedbasiscanbe computedas W =[4 : 43 ; 1 : 35 ; 0 ; 0 ; 2 : 91 ; 0 ; 0] . 123

PAGE 124

Figure6-5.Comparisonbetweencontrolanderrortrajectoriesresultingfromthe developedtechniqueandanumericalsolutionforthenonlinearsystem. Figures6-6-6-8demonstratethatthecontrollerremainsbounded,thetracking errorgoestozero,andtheweightestimates ^ W c , ^ W a and ^ gototheirtruevalues, establishingtheconvergenceoftheapproximatepolicytotheoptimalpolicy.Figure6-9 demonstratessatisfactionoftherankconditionsin6and6. 6.7ConcludingRemarks Aconcurrent-learningbasedimplementationofmodel-basedRLisdevelopedto obtainanapproximateonlinesolutiontoinnite-horizonoptimaltrackingproblemsfor nonlinearcontinuous-timecontrol-afnesystems.Thedesiredsteady-statecontrolleris usedtofacilitatetheformulationofafeasibleoptimalcontrolproblem,andthesystem stateisaugmentedwiththedesiredtrajectorytofacilitatetheformulationofastationary optimalcontrolproblem.ACL-basedsystemidentierisdevelopedtoremovethe dependenceofthedesiredsteady-statecontrolleronthesystemdriftdynamics,andto facilitatesimulationofexperienceviaBEextrapolation. Thedesignvariablein5andinversionofthecontroleffectivenessmatrixis necessarybecausethecontrollerdoesnotasymptoticallygotozero,causingthetotal costtobeinniteforanypolicy.Thedenitionof andtheinversionofthecontrol 124

PAGE 125

Figure6-6.Systemtrajectoriesgeneratedusingtheproposedmethodforthelinear system. effectivenessmatrixcanbeavoidediftheoptimalcontrolproblemisformulatedinterms ofadiscountedcost.Anonlinesolutionofthediscountedcostoptimalcontrolproblemis possiblebymakingminormodicationstothetechniquedevelopedinthischapter. ThehistorystackinAssumption6.1isassumedtobeavailableaprioriforeaseof exposition.Providedthesystemstatesareexcitingoveraniteamountoftimeneeded forcollection,thehistorystackcanbecollectedonline.Forthecasewhenahistory stackisnotavailableinitially,thedevelopedcontrollerneedstobemodiedduringthe datacollectionphasetoensurestability.Therequiredmodicationsaresimilartothose describedinAppendixA.OncetheconditioninAssumption6.1ismet,thedeveloped controllercanbeusedthereafter. Technicalchallengessimilartotheoptimaltrackingproblemareencounteredwhile dealingwithmultipleinteractingagents.Sincethetrajectoryofoneagentisinuenced byotheragents,thevaluefunctionbecomestime-varying.Thefollowingchapter extendsthesimulation-basedACImethodtoobtainanapproximatefeedback-Nash equilibriumsolutiontoaclassofgraphicalgamesbasedonideasdevelopedinprevious chapters. 125

PAGE 126

Figure6-7.Valuefunctionandthepolicyweighttrajectoriesgeneratedusingthe proposedmethodforthelinearsystem.Sinceananalyticalsolutionofthe optimaltrackingproblemisnotavailable,weightscannotbecompared againsttheiridealvalues Figure6-8.Trajectoriesoftheunknownparametersinthesystemdriftdynamicsforthe linearsystem.Thedottedlinesrepresentthetruevaluesoftheparameters. 126

PAGE 127

Figure6-9.SatisfactionofAssumptions6.1and6.2forthelinearsystem. 127

PAGE 128

CHAPTER7 MODEL-BASEDREINFORCEMENTLEARNINGFORONLINEAPPROXIMATE FEEDBACK-NASHEQUILIBRIUMSOLUTIONOFDIFFERENTIALGRAPHICAL GAMES EffortsinthischapterseektocombinedifferentialgametheorywiththeADP frameworktodetermineforward-in-time,approximateoptimalcontrollersforformation trackinginmulti-agentsystemswithuncertainnonlineardynamics.Acontinuous controlstrategyisproposed,usingcommunicationfeedbackfromextendedneighbors onacommunicationtopologythathasaspanningtree.Thesimulation-basedACI architecturefromChapter3isextendedtocooperativelycontrolagroupofagentsto trackatrajectoryinadesiredformationusingideasfromChapter6. 7.1GraphTheoryPreliminaries Considerasetof N autonomousagentsmovinginthestatespace R n .Thecontrol objectiveisfortheagentstotrackadesiredtrajectorywhilemaintainingadesired formation.Toaidthesubsequentdesign,anotheragenthenceforthreferredtoasthe leaderisassumedtobetraversingthedesiredtrajectory,denotedby x 0 2 R n .The agentsareassumedtobeonanetworkwithaxedcommunicationtopologymodeled asastaticdirectedgraphi.e.digraph. Eachagentformsanodeinthedigraph.Thesetofallnodesexcludingtheleader isdenotedby N = f 1 ; N g .Ifnode i canreceiveinformationfromnode j thenthere existsadirectededgefromthe j th tothe i th nodeofthedigraph,denotedbytheordered pair j;i .Let E denotethesetofalledges.Lettherebeapositiveweight a ij 2 R associatedwitheachedge j;i .Notethat a ij 6 =0 ifandonlyif j;i 2 E: Thedigraph isassumedtohavenorepeatededgesi.e. i;i = 2 E; 8 i ,whichimplies a ii =0 ; 8 i . Notethat a i 0 denotestheedgeweightalsoreferredtoasthepinninggainfortheedge betweentheleaderandannode i .Similartotheotheredgeweights, a i 0 6 =0 ifand onlyifthereexistsadirectededgefromtheleadertotheagent i .Theneighborhood setsofnode i aredenotedby N )]TJ/F26 7.9701 Tf 6.587 0 Td [(i and N i ,denedas N )]TJ/F26 7.9701 Tf 6.587 0 Td [(i , f j 2Nj j;i 2 E g and 128

PAGE 129

N i , N )]TJ/F26 7.9701 Tf 6.586 0 Td [(i [f i g .Tostreamlinetheanalysis,agraphconnectivitymatrix A2 R N N is denedas A , [ a ij j i;j 2N ] ,adiagonalpinninggainmatrix A 0 2 R N N isdenedas A 0 , diag a i 0 j i 2N ,anin-degreematrix D2 R N N isdenedas D , diag d i ; where d i , P j 2N i a ij ,andagraphLaplacianmatrix L2 R N N isdenedas L , D)-230(A . Thegraphissaidtohaveaspanningtreeifgivenanynode i ,thereexistsadirected pathfromtheleader 0 to i .Anode j issaidtobeanextendedneighborofnode i ifthereexistsadirectedpathfromnode j tonode i .Theextendedneighborhoodset ofnode i ,denotedby S )]TJ/F26 7.9701 Tf 6.586 0 Td [(i ,isdenedasthesetofallextendedneighborsofnode i: Formally, S )]TJ/F26 7.9701 Tf 6.587 0 Td [(i , f j 2Nj j 6 = i ^9 n N; f j 1 ; j n gNjf j;j 1 ; j 1 ;j 2 ; ; j n ;i g 2 E g .Let S i , S )]TJ/F26 7.9701 Tf 6.587 0 Td [(i [f i g ,andlettheedgeweightsbenormalizedsuchthat P j a ij =1 for all i 2N .Notethatthesub-graphsarenestedinthesensethat S j S i forall j 2S i . 7.2ProblemFormulation Thestate x i 2 R n ofeachagentevolvesaccordingtothecontrol-afnedynamics _ x i = f i x i + g i x i u i ; where u i 2 R m i denotesthecontrolinputand f i : R n ! R n and g i : R n ! R n m i are locallyLipschitzcontinuousfunctions. Assumption7.1. Thegroupofagentsfollowsavirtualleaderwhosedynamicsare describedby _ x 0 = f 0 x 0 ; where f 0 : R n ! R n isalocallyLipschitzcontinuousfunction. Thefunction f 0 ,andtheinitialcondition x 0 t 0 areselectedsuchthatthetrajectory x 0 t isboundedforall t 2 R t 0 . Thecontrolobjectiveisfortheagentstomaintainapredeterminedformation aroundtheleaderwhileminimizingacostfunction.Forall i 2N ,the i th agentis awareofitsconstantdesiredrelativeposition x dij 2 R n withrespecttoallitsneighbors j 2N )]TJ/F26 7.9701 Tf 6.586 0 Td [(i ,suchthatthedesiredformationisrealizedwhen x i )]TJ/F25 11.9552 Tf 12.187 0 Td [(x j ! x dij forall i;j 2N . Tofacilitatecontroldesign,theformationisexpressedintermsofasetofconstant vectors f x di 0 2 R n g i 2N whereeach x di 0 denotestheconstantnaldesiredpositionof 129

PAGE 130

agent i withrespecttotheleader.Thevectors f x di 0 g i 2N areunknowntotheagents notconnectedtotheleader,andtheknowndesiredinteragentrelativepositioncan beexpressedintermsof f x di 0 g i 2N as x dij = x di 0 )]TJ/F25 11.9552 Tf 12.543 0 Td [(x dj 0 .Thecontrolobjectiveisthus satisedwhen x i ! x di 0 + x 0 forall i 2N .Tofacilitatecontroldesign,denethelocal neighborhoodtrackingerrorsignal e i = X j 2f 0 g[N )]TJ/F27 5.9776 Tf 5.756 0 Td [(i a ij x i )]TJ/F25 11.9552 Tf 11.955 0 Td [(x j )]TJ/F25 11.9552 Tf 11.955 0 Td [(x dij : Tofacilitateanalysis,theerrorsignalin7isexpressedintermsoftheunknown leader-relativedesiredpositionsas e i = X j 2f 0 g[N )]TJ/F27 5.9776 Tf 5.756 0 Td [(i a ij x i )]TJ/F25 11.9552 Tf 11.955 0 Td [(x di 0 )]TJ/F15 11.9552 Tf 11.955 0 Td [( x j )]TJ/F25 11.9552 Tf 11.955 0 Td [(x dj 0 : Stackingtheerrorsignalsinavector E , e T 1 ;e T 2 ; ;e T N T 2 R nN theequationin 7canbeexpressedinamatrixform E = L + A 0 I n X)-222(X d )-222(X 0 ; where X = x T 1 ;x T 2 ; ;x T N T 2 R nN , X d = x T d 10 ;x T d 20 ; ;x T dN 0 T 2 R nN , X 0 = x T 0 ;x T 0 ; ;x T 0 T 2 R nN , I n denotesan n n identitymatrix,and denotes theKroneckerproduct.Using7,itcanbeconcludedthatprovidedthematrix L + A 0 I n 2 R nN nN isnonsingular, kEk! 0 implies x i ! x di 0 + x 0 forall i ,and hence,thesatisfactionofthecontrolobjective.Thematrix L + A 0 I n canbeshown tobenonsingularprovidedthegraphhasaspanningtreewiththeleaderattheroot.To facilitatetheformulationofanoptimizationproblem,thefollowingsectionexploresthe functionaldependenceofthestatevaluefunctionsforthenetworkofagents. 130

PAGE 131

7.2.1ElementsoftheValueFunction Thedynamicsfortheopen-loopneighborhoodtrackingerrorare _ e i = X j 2f 0 g[N )]TJ/F27 5.9776 Tf 5.756 0 Td [(i a ij f i x i + g i x i u i )]TJ/F25 11.9552 Tf 11.955 0 Td [(f j x j )]TJ/F25 11.9552 Tf 11.955 0 Td [(g j x j u j Underthetemporaryassumptionthateachcontroller u i isanerror-feedbackcontroller, i.e. u i t =^ u i e i t ;t ,theerrordynamicsareexpressedas _ e i = X j 2f 0 g[N )]TJ/F27 5.9776 Tf 5.756 0 Td [(i a ij f i x i + g i x i ^ u i e i ;t )]TJ/F25 11.9552 Tf 11.955 0 Td [(f j x j )]TJ/F25 11.9552 Tf 11.955 0 Td [(g j x j ^ u j e j ;t : Thus,theerrortrajectory f e i t g 1 t = t 0 ,where t 0 denotestheinitialtime,depends on ^ u j e j t ;t , 8 j 2N i .Similarly,theerrortrajectory f e j t g 1 t = t 0 dependson ^ u k e k t ;t ; 8 k 2N j .Recursively,thetrajectory f e i t g 1 t = t 0 dependson ^ u j e j t ;t , andhence,on e j t ; 8 j 2S i .Thus,evenifthecontrollerforeachagentisrestrictedto uselocalerrorfeedback,theresultingerrortrajectoriesareinterdependent.Inparticular, achangeintheinitialconditionofoneagentintheextendedneighborhoodcausesa changeintheerrortrajectoriescorrespondingtoalltheextendedneighbors.Consequently,thevaluefunctioncorrespondingtoaninnite-horizonoptimalcontrolproblem whereeachagenttriestominimize 1 t 0 Q e i + R u i d ,where Q : R n ! R and R : R m i ! R arepositivedenitefunctions,isdependentontheerrorstatesof alltheextendedneighbors.Inotherwords,theinnite-horizonvalueofanerrorstate dependsonerrorstatesofalltheextendedneighbors;hence,communicationwithextendedneighborsisvitalforthesolutionofanoptimalcontrolprobleminthepresented framework. 7.2.2OptimalFormationTrackingProblem Whentheagentsareperfectlytrackingthedesiredtrajectoryinthedesiredformation,eventhoughthestatesofalltheagentsaredifferent,thetime-derivativesofthe statesofalltheagentsareidentical.Hence,insteadystate,thecontrolsignalapplied byeachagentmustbesuchthatthetimederivativesareallidentical.Inparticular,the 131

PAGE 132

relativecontrolsignal u ij 2 R m i thatwillkeepnode i initsdesiredrelativepositionwith respecttonode j ,i.e., x i = x j + x dij mustbesuchthatthetimederivativeof x i isthe sameasthetimederivativeof x j .Usingthedynamicsoftheagentfrom7,andsubstitutingthedesiredrelativeposition x j + x dij forthestate x i ,therelativecontrolsignal u ij mustsatisfy f i x j + x dij + g i x j + x dij u ij =_ x j : Therelativesteady-statecontrolsignalcanbeexpressedinanexplicitformprovidedthe followingassumptionissatised. Assumption7.2. Thematrix g i x isfullrankforall i 2N andforall x 2 R n ,furthermore,therelativesteady-statecontrolsignalexpressedas u ij = f ij x j + g ij x j u j ; satises7.2.2alongthedesiredtrajectory,where f ij x j , g + i x j + x dij f j x j )]TJ/F25 11.9552 Tf 11.955 0 Td [(f i x j + x dij 2 R m i , g ij x j , g + i x j + x dij g j x j 2 R m i m j , g 0 x , 0 forall x 2 R n , u i 0 0 forall i 2N ,and g + i x denotestheMoore-Penrose pseudoinverseofthematrix g i x forall x 2 R n . Tofacilitatetheformulationofanoptimalformationtrackingproblem,denethe controlerror i 2 R m i as i , X j 2N )]TJ/F27 5.9776 Tf 5.756 0 Td [(i [f 0 g a ij u i )]TJ/F25 11.9552 Tf 11.955 0 Td [(u ij Inthereminderofthischapter,thecontrolerrors f i g willbetreatedasthedesign variables.Inordertoimplementthecontrollers f u i g usingdesignedcontrolerrors f i g , itisessentialtoinvert7.Tofacilitatetheinversion,let S o i , f 1 ; ;s i g ,where s i , jS i j .Let i : S o i !S i beabijectivemapsuchthat i = i .Fornotationalbrevity, let S i denotetheconcatenatedvector h T 1 i ; T 2 i ; ; T s i i i T ,let S )]TJ/F27 5.9776 Tf 5.756 0 Td [(i denotethe concatenatedvector h T 2 i ; ; T s i i i T ,let P i denote P j 2N )]TJ/F27 5.9776 Tf 5.756 0 Td [(i [f 0 g ,andlet j i denote i j ,let E i , h e T S i ;x T 1 i i T 2 R n s i +1 ,andlet E )]TJ/F26 7.9701 Tf 6.586 0 Td [(i , h e T S )]TJ/F27 5.9776 Tf 5.756 0 Td [(i ;x T 1 i i T 2 R ns i .Then,thecontrol 132

PAGE 133

errorvector S i 2 R P k 2S o i m k i canbeexpressedas S i = L gi E i u S i )]TJ/F25 11.9552 Tf 11.955 0 Td [(F i E i ; wherethematrix L gi : R n s i +1 ! R P k 2S o i m k i P k 2S o i m k i isdenedas L gi E i , 2 6 6 6 6 6 6 6 6 4 P 1 i a 1 i j I m 1 i ; )]TJ/F25 11.9552 Tf 9.299 0 Td [(a 1 i 2 i g 1 i 2 i x 2 i ; ; )]TJ/F25 11.9552 Tf 9.298 0 Td [(a 1 i s i i g 1 i s i i x s i i )]TJ/F25 11.9552 Tf 9.299 0 Td [(a 2 i 1 i g 2 i 1 i x 1 i ; P 2 i a 2 i j I m 2 i ; ; )]TJ/F25 11.9552 Tf 9.298 0 Td [(a 2 i s i i g 2 i s i i x s i i . . . )]TJ/F25 11.9552 Tf 9.298 0 Td [(a s i i 1 i g s i i 1 i x 1 i ; )]TJ/F25 11.9552 Tf 9.298 0 Td [(a s i i 2 i g s i i 2 i x 2 i ; ; P s i i a s i i j I m s i i 3 7 7 7 7 7 7 7 7 5 ; and F i : R n s i +1 ! R P k 2S o i m k i isdenedas F i E i , P 1 i a 1 i j f T 1 i j x j P s i i a s i i j f T s i i j x j T : Assumption7.3. Thematrix L gi E i t isinvertibleforall t 2 R . Assumption7.3isacontrollabilitylikecondition.Intuitively,Assumption7.3requires thecontroleffectivenessmatricestobecompatibletoensuretheexistenceofrelative controlinputsthatallowtheagentstofollowthedesiredtrajectoryinthedesired formation. UsingAssumption7.3,thecontrolvectorcanbeexpressedas u S i = L )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 gi E i S i + L )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 gi F i E i : Let L k gi denotethe )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 i k th blockrowof L )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 gi .Then,thecontroller u i can beimplementedas u i = L i gi E i S i + L i gi F i E i ; andforany j 2N )]TJ/F26 7.9701 Tf 6.586 0 Td [(i , u j = L j gi E i S i + L j gi F i E i : Using7and7,theerrorandthestatedynamicsfortheagentscanberepresentedas _ e i = F i E i + G i E i S i ; 133

PAGE 134

and _ x i = F i E i + G i E i S i ; where F i E i , P i a ij )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [(f i x i )]TJ/F25 11.9552 Tf 11.955 0 Td [(f j x j + )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [(g i x i L i gi E i )]TJ/F25 11.9552 Tf 11.956 0 Td [(g j x j L j gi E i F i E i , G i E i , P i a ij )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [(g i x i L i gi E i )]TJ/F25 11.9552 Tf 11.955 0 Td [(g j x j L j gi E i , F i E i , f i x i + g i x i L i gi E i F i E i , and G i E i , g i x i L i gi E i . Let h i ; S )]TJ/F27 5.9776 Tf 5.756 0 Td [(i ei t;t 0 ; E i 0 and h i ; S )]TJ/F27 5.9776 Tf 5.756 0 Td [(i xi t;t 0 ; E i 0 denotethetrajectoriesof7and 7,respectively,withtheinitialtime t 0 ,initialcondition E i t 0 = E i 0 ,andpolicies i : R n s i +1 ! R mi ,andletand H i = h h e T S i ;h T x 1 i i T .Deneacostfunctional J i e i ; i , 1 0 r i e i ; i d where r i : R n R m i ! R 0 denotesthelocalcostdenedas r i e i ; i , Q i e i + T i R i i ; where Q i : R n ! R 0 isapositivedenitefunctionand R i 2 R m i m i isaconstant positivedenitematrix.Theobjectiveofeachagentistominimizethecostfunctionalin 7.Tofacilitatethedenitionofafeedback-Nashequilibriumsolution,denevalue functions V i : R n s i +1 ! R 0 as V i ; S )]TJ/F27 5.9776 Tf 5.756 0 Td [(i i E i , 1 t r i h i ; S )]TJ/F27 5.9776 Tf 5.756 0 Td [(i ei ;t; E i ; i H i ; S )]TJ/F27 5.9776 Tf 5.756 0 Td [(i i ;t; E i d; wherethenotation V i ; S )]TJ/F27 5.9776 Tf 5.756 0 Td [(i i E i denotesthetotalcost-to-gounderthepolicies S i , startingfromthestate E i .Notethatthevaluefunctionsin7aretime-invariant becausethedynamicalsystems _ e j = F j E i + G j E i S j j 2S i and _ x i = F i E i + G i E i S i togetherformanautonomousdynamicalsystem. Agraphicalfeedback-Nashequilibriumsolutionwithinthesubgraph S i isdened asthetupleofpolicies j : R n s j +1 ! R mj j 2S i suchthatthevaluefunctionsin7 satisfy V j E j , V j ; S )]TJ/F27 5.9776 Tf 5.757 0 Td [(j j E j V j ; S )]TJ/F27 5.9776 Tf 5.756 0 Td [(j j E j ; 134

PAGE 135

forall j 2S i ,forall E i 2 R n s i +1 andforalladmissiblepolicies j .ProvidedafeedbackNashequilibriumsolutionexistsandthevaluefunctions7arecontinuously differentiable,thefeedback-Nashequilibriumvaluefunctionscanbecharacterizedin termsofthefollowingsystemofHJequations: X j 2S i r e j V i E o i F j E o i + G j E o i S j E o i + r x i V i E o i )]TJ/F23 11.9552 Tf 5.479 -9.683 Td [(F i E o i + G i E o i S i E o i + Q i E o i + T i E o i R i i E o i =0 ; 8E o i 2 R n s i +1 ; where Q i E o i , Q i e i . Theorem7.1. Providedafeedback-Nashequilibriumsolutionexistsandthatthevalue functionsin 7 arecontinuouslydifferentiable,thesystemofHJequationsin 7 constitutesanecessaryandsufcientconditionforfeedback-Nashequilibrium. Proof. Considerthecostfunctionalin7,andassumethatalltheextendedneighborsofthe i th agentfollowtheirfeedback-Nashequilibriumpolicies.Thevaluefunction correspondingtoanyadmissiblepolicy i canbeexpressedas V i ; S )]TJ/F27 5.9776 Tf 5.757 0 Td [(i i e T i ; E T )]TJ/F26 7.9701 Tf 6.586 0 Td [(i T = 1 t r i h i ; S )]TJ/F27 5.9776 Tf 5.756 0 Td [(i ei ;t; E i ; i H i ; S )]TJ/F27 5.9776 Tf 5.756 0 Td [(i i ;t; E i d: Treatingthedependenceon E )]TJ/F26 7.9701 Tf 6.586 0 Td [(i asexplicittimedependencedene V i ; S )]TJ/F27 5.9776 Tf 5.756 0 Td [(i i e i ;t , V i ; S )]TJ/F27 5.9776 Tf 5.756 0 Td [(i i e T i ; E T )]TJ/F26 7.9701 Tf 6.587 0 Td [(i t T ; forall e i 2 R n andforall t 2 R 0 .Assumingthattheoptimalcontrollerthatminimizes 7whenalltheextendedneighborsfollowtheirfeedback-Nashequilibriumpolicies exists,andassumingthattheoptimalvaluefunction V i , V i ; S )]TJ/F27 5.9776 Tf 5.756 0 Td [(i i exists,optimalcontrol theoryforsingleobjectiveoptimizationproblemscf.[144]canbeusedtoderivethe followingnecessaryandsufcientcondition @ V i e i ;t @e i )]TJ/F33 11.9552 Tf 5.479 -9.684 Td [(F i E i + G i E i S i E i + @ V i e i ;t @t = Q i e i + T i E i R i i E i : 135

PAGE 136

Using7,thepartialwithrespecttothestatecanbeexpressedas @ V i e i ;t @e i = @V i E i @e i ; forall e i 2 R n andforall t 2 R 0 ,andthepartialwithrespecttotimecanbeexpressed as @ V i e i ;t @t = X j 2S )]TJ/F27 5.9776 Tf 5.756 0 Td [(i @V i E i @e j F j E i + G j E i S j E i + @V i E i @x i )]TJ/F23 11.9552 Tf 5.479 -9.684 Td [(F i E i + G i E i S i E i ; forall e i 2 R n andforall t 2 R 0 .Substituting7and7into7and repeatingtheprocessforeach i ,thesystemofHJequationsin7isobtained. MinimizingtheHJequationsusingthestationarycondition,thefeedback-Nash equilibriumsolutionisexpressedintheexplicitform i E o i = )]TJ/F15 11.9552 Tf 10.494 8.088 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 i X j 2S i )]TJ/F33 11.9552 Tf 5.479 -9.684 Td [(G i j E o i T )]TJ/F23 11.9552 Tf 5.48 -9.684 Td [(r e j V i E o i T )]TJ/F15 11.9552 Tf 13.15 8.088 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 i )]TJ/F23 11.9552 Tf 5.479 -9.684 Td [(G i i E o i T r x i V i E o i T ; forall E o i 2 R n s i +1 ,where G i j , G j @ S j @ i ,and G i i , G i @ S i @ i .Sincesolutionofthesystem ofHJequationsin7isgenerallyinfeasible,thefeedback-Nashvaluefunctionsand thefeedback-Nashpoliciesareapproximatedusingparametricapproximationschemes as ^ V i E i ; ^ W ci and ^ i E i ; ^ W ai ,respectivelywhere ^ W ci 2 R L i and ^ W ai 2 R L i are parameterestimates.Substitutionoftheapproximations ^ V i and ^ i in7leadstoa setofBEs i denedas i E i ; ^ W ci ; ^ W a S j , X j 2S i r e j ^ V i E i ; ^ W ci F j E j + G j E j ^ S j E j ; ^ W a S j + r x i ^ V i E i ; ^ W ci F i E i + G i E i ^ S i E i ; ^ W a S j )]TJ/F15 11.9552 Tf 12.875 0 Td [(^ T i E i ; ^ W ai R ^ i E i ; ^ W ai )]TJ/F25 11.9552 Tf 11.955 0 Td [(Q i e i : Approximatefeedback-Nashequilibriumcontrolisrealizedbytuningtheestimates ^ V i and ^ i soastominimizetheBellmanerrors i .However,computationof i andthat 136

PAGE 137

of u ij in7requiresexactmodelknowledge.Inthefollowing,aCL-basedsystem identierisdevelopedtorelaxtheexactmodelknowledgerequirement.Inparticular,the developedcontrollersdonotrequiretheknowledgeofthesystemdriftfunctions f i . 7.3SystemIdentication Onanycompactset R n thefunction f i canberepresentedusingaNNas f i x = T i i x + i x ; forall x 2 R n ,where i 2 R P i +1 n denotetheunknownoutput-layerNNweights, i : R n ! R P i +1 denotesaboundedNNbasisfunction, i : R n ! R n denotesthe functionreconstructionerror,and P i 2 N denotesthenumberofNNneurons.Using theuniversalfunctionapproximationpropertyofsinglelayerNNs,providedtherowsof i x formaproperbasis,thereexistconstantidealweights i andpositiveconstants i 2 R and i 2 R suchthat k i k F i < 1 and sup x 2 k i x k i ,where kk F denotestheFrobeniusnorm. Assumption7.4. Thebounds i and i areknownforall i 2N : Usinganestimate ^ i 2 R P i +1 n oftheweightmatrix i ; thefunction f i canbe approximatedbythefunction ^ f i : R n R P i +1 n ! R n denedby ^ f i x; ^ , ^ T i x : Basedon7,anestimatorforonlineidenticationofthedriftdynamicsisdeveloped as _ ^ x i = ^ T i i x i + g i x i u i + k i ~ x i ; where ~ x i , x i )]TJ/F15 11.9552 Tf 13.626 0 Td [(^ x i ,and k i 2 R isapositiveconstantlearninggain.Thefollowing assumptionfacilitatesCL-basedsystemidentication. Assumption7.5. [92]Ahistorystackcontainingrecordedstate-actionpairs x k i ;u k i M i k =1 alongwithnumericallycomputedstatederivatives _ x k i M i k =1 thatsatises min M i X k =1 k i )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( k i T ! = i > 0 ; _ x k i )]TJ/F15 11.9552 Tf 13.981 0 Td [(_ x k i < d i ; 8 k 137

PAGE 138

isavailableapriori.In7, k i , i )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [(x k i , d i 2 R isaknownpositiveconstant,and min denotestheminimumeigenvalue. Theweightestimates ^ i areupdatedusingthefollowingCL-basedupdatelaw: _ ^ i = k i )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i M i X k =1 k i _ x k i )]TJ/F25 11.9552 Tf 11.955 0 Td [(g k i u k i )]TJ/F15 11.9552 Tf 12.895 3.155 Td [(^ T i k i T +)]TJ/F26 7.9701 Tf 19.075 -1.793 Td [(i i x i ~ x T i ; where g k i , g i )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [(x k i , k i 2 R isaconstantpositiveCLgain,and )]TJ/F26 7.9701 Tf 7.315 -1.793 Td [(i 2 R P i +1 P i +1 isa constant,diagonal,andpositivedeniteadaptationgainmatrix. Tofacilitatethesubsequentstabilityanalysis,acandidateLyapunovfunction V 0 i : R n R P i +1 n ! R isselectedas V 0 i ~ x i ; ~ i , 1 2 ~ x T i ~ x i + 1 2 tr ~ T i )]TJ/F28 7.9701 Tf 7.314 4.948 Td [()]TJ/F24 7.9701 Tf 6.587 0 Td [(1 i ~ i ; where ~ i , i )]TJ/F15 11.9552 Tf 13.686 3.155 Td [(^ i andtr denotesthetraceofamatrix.Using7-7,the followingboundonthetimederivativeof V 0 i isestablished: _ V 0 i )]TJ/F25 11.9552 Tf 21.918 0 Td [(k i k ~ x i k 2 )]TJ/F25 11.9552 Tf 11.955 0 Td [(k i i ~ i 2 F + i k ~ x i k + k i d i ~ i F ; where d i , d i P M i k =1 k i + P M i k =1 )]TJ 5.48 0.478 Td [( k i k i .Using7and7aLyapunovbasedstabilityanalysiscanbeusedtoshowthat ^ i convergesexponentiallytoa neighborhoodaround i . 7.4ApproximationoftheBEandtheRelativeSteady-stateController Usingtheapproximations ^ f i forthefunctions f i ,theBEscanbeapproximatedas ^ i E i ; ^ W ci ; ^ W a S j ; ^ S i , r x i ^ V i E i ; ^ W ci ^ F i E i ; ^ S i + G i E i ^ S i E i ; ^ W a S j X j 2S i r e j ^ V i E i ; ^ W ci ^ F j E j ; ^ S j + G j E j ^ S j E j ; ^ W a S j )]TJ/F25 11.9552 Tf 11.955 0 Td [(Q i e i )]TJ/F15 11.9552 Tf 12.875 0 Td [(^ T i E i ; ^ W ai R ^ i E i ; ^ W ai : 138

PAGE 139

In7, ^ F i E i ; ^ S i , P i a ij ^ f i x i ; ^ i )]TJ/F15 11.9552 Tf 12.51 3.155 Td [(^ f j x j ; ^ j + P i a ij )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [(g i x i L i gi )]TJ/F25 11.9552 Tf 9.962 0 Td [(g j x j L j gi ^ F i E i ; ^ S i ; ^ F i E i ; ^ S i , ^ T i i x i + g i x i L i gi ^ F i E i ; ^ S i ; ^ F i E i ; ^ S i , P 1 i a 1 i j ^ f 1 i j x 1 i ; ^ 1 i ;x j ; ^ j T ; ; P s i i a s i i j ^ f s i i j x s i i ; ^ s i i ;x j ; ^ j T T ; ^ f ij x i ; ^ i ;x j ; ^ j , g + i x j + x dij ^ f j x j ; ^ j )]TJ/F15 11.9552 Tf 14.503 3.155 Td [(^ f i x j + x dij ; ^ i : Theapproximations ^ F i , ^ F i ,and ^ F i arerelatedtotheoriginalunknownfunctionas ^ F i E i ; S i + B i E i = F i E i , ^ F i E i ; S i + B i E i = F i E i ,and ^ F i E i ; S i + B i E i = F i E i ,where B i , B i ,and B i are O )]TJ/F15 11.9552 Tf 5.48 -9.684 Td [( S i termsthatdenoteboundedfunction approximationerrors. Usingtheapproximations ^ f i ,animplementableformofthecontrollersin7is expressedas u S i = L )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 gi E i ^ S i E i ; ^ W a S j + L )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 gi ^ F i E i ; S i : Using7and7,anunmeasurableformofthevirtualcontrollersimplementedon thesystems7and7isgivenby S i =^ S i E i ; ^ W a S j )]TJ/F15 11.9552 Tf 14.606 3.022 Td [(^ F i E i ; ~ S i )]TJ/F25 11.9552 Tf 11.955 0 Td [(B i E i : 7.5ValueFunctionApproximation Onanycompactset 2 R n s i +1 ,thevaluefunctionscanberepresentedas V i E o i = W T i i E o i + i E o i ; 8E o i 2 R n s i +1 ; where W i 2 R L i areidealNNweights, i : R n s i +1 ! R L i areNNbasisfunctions and i : R n s i +1 ! R arefunctionapproximationerrors.Usingtheuniversalfunction approximationpropertyofsinglelayerNNs,provided i E o i formsaproperbasis,there existconstantidealweights W i andpositiveconstants W i 2 R and i ; r i 2 R suchthat k W i k W i < 1 , sup E o i 2 k i E o i k i ,and sup E o i 2 kr i E o i k r i . 139

PAGE 140

Assumption7.6. Theconstants i ; r i ; and W i areknownforall i 2N . Using7and7,thefeedback-Nashequilibriumpoliciescanberepresentedas i E o i = )]TJ/F15 11.9552 Tf 10.494 8.088 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 i G i E o i W i )]TJ/F15 11.9552 Tf 13.151 8.088 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 i G i E o i ; 8E o i 2 R n s i +1 : where G i E i , X j 2S i )]TJ/F33 11.9552 Tf 5.48 -9.684 Td [(G i j E i T )]TJ/F23 11.9552 Tf 5.48 -9.684 Td [(r e j i E i T + )]TJ/F23 11.9552 Tf 5.48 -9.684 Td [(G i i E i T r x i i E i T and G i E i , X j 2S i )]TJ/F33 11.9552 Tf 5.479 -9.684 Td [(G i j E i T )]TJ/F23 11.9552 Tf 5.479 -9.684 Td [(r e j i E i T + )]TJ/F23 11.9552 Tf 5.48 -9.684 Td [(G i i E i T r x i i E i T : ThevaluefunctionsandthepoliciesareapproximatedusingNNsas ^ V i E i ; ^ W ci , ^ W T ci i E i ; ^ i E i ; ^ W ai , )]TJ/F15 11.9552 Tf 10.494 8.087 Td [(1 2 R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 i G i E i ^ W ai : 7.6SimulationofExperienceviaBEExtrapolation AconsequenceofTheorem7.1isthattheBEprovidesanindirectmeasureof howclosetheweights ^ W ci and ^ W ai aretotheidealweights W i .Fromareinforcement learningperspective,eachevaluationoftheBEalongthesystemtrajectorycanbe interpretedexperiencegainedbythecritic,andeachevaluationoftheBEatpointsnot yetvisitedcanbeinterpretedassimulatedexperience.Inpreviousresultssuchas[95, 112,119,128,157],thecriticisrestrictedtotheexperiencegainedinotherwordsBEs evaluatedalongthesystemstatetrajectory.Thedevelopmentin[112,119,128,157] canbeextendedtoemploysimulatedexperience;however,theextensionrequiresexact modelknowledge.Inresultssuchas[95],theformulationoftheBEdoesnotallow forsimulationofexperience.Theformulationin7employsthesystemidentier developedinsection7.3tofacilitateapproximateevaluationoftheBEatoff-trajectory points.Fornotationalbrevity,theargumentstothefunctions i , ^ F i , G i , G i , ^ F i , ^ i , G i , G i ,and i aresuppressedhereafter 140

PAGE 141

Tosimulateexperience,eachagentselectsasetofpoints E k i M i k =1 andevaluates theinstantaneousBEatthecurrentstate,denotedby ^ ti ,andtheinstantaneousstateat theselectedpoints,denotedby ^ k ti .TheBEs ^ ti and ^ k ti aredenedas ^ ti t , ^ i E i t ; ^ W ci t ; ^ W a t S j ; ^ t S i ; ^ k ti t , ^ i E k i ; ^ W ci t ; ^ W a t S j ; ^ t S i : Notethatonce f e j g j 2S i and x i areselected,the i th agentcancomputethestatesofall theremainingagentsinthesubgraph. Thecriticusessimulatedexperiencetoupdatethevaluefunctionweightsusinga leastsquares-basedupdatelaw _ ^ W ci = )]TJ/F25 11.9552 Tf 9.298 0 Td [( c 1 i )]TJ/F26 7.9701 Tf 7.314 -1.794 Td [(i ! i i ^ ti )]TJ/F25 11.9552 Tf 13.151 8.088 Td [( c 2 i )]TJ/F26 7.9701 Tf 7.314 -1.794 Td [(i M i M i X k =1 ! k i k i ^ k ti ; _ )]TJ/F26 7.9701 Tf 7.315 -1.793 Td [(i = i )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i )]TJ/F25 11.9552 Tf 11.956 0 Td [( c 1 i )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i ! i ! T i 2 i )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i 1 f k )]TJ/F27 5.9776 Tf 5.288 -1.215 Td [(i k )]TJ/F27 5.9776 Tf 5.289 -1.215 Td [(i g ; k )]TJ/F26 7.9701 Tf 7.315 -1.793 Td [(i t 0 k )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i ; where i , 1+ i ! T i )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i ! i , )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i 2 R L i L i denotesthetime-varyingleast-squareslearning gain, )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i 2 R denotesthesaturationconstant,and c 1 i ; c 2 i ; i ; i 2 R areconstant positivelearninggains.In7, ! i , X j 2S i r e j i ^ F j + G j ^ S j + r x i i ^ F i + G i ^ S i ; ! k i , X j 2S i r e j k i ^ F k j + G k j ^ k S j + r x i k i ^ F k i + G k i ^ k S i ; whereforafunction i E i ; ,thenotation k i indicatesevaluationat E i = E k i ;i.e., k i , i )]TJ/F23 11.9552 Tf 5.479 -9.684 Td [(E k i ; .Theactorupdatesthepolicyweightsusingthefollowingupdatelaw derivedbasedonaLyapunov-basedstabilityanalysis: _ ^ W ai = )]TJ/F25 11.9552 Tf 9.298 0 Td [( a 2 i ^ W ai + 1 4 c 1 i G i R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 i G T i ^ W T ai ! T i i ^ W T ci + 1 4 M i X k =1 c 2 i M i G k i R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 i )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [(G k i T ^ W T ai )]TJ/F25 11.9552 Tf 5.479 -9.683 Td [(! k i T k i ^ W T ci ; )]TJ/F25 11.9552 Tf 11.955 0 Td [( a 1 i ^ W ai )]TJ/F15 11.9552 Tf 15.367 3.022 Td [(^ W ci 141

PAGE 142

where a 1 i ; a 2 i 2 R areconstantpositivelearninggains.Thefollowingassumptions facilitatessimulationofexperience Assumption7.7. [97]Foreach i 2N ,thereexistsanitesetof M i points E k i M i k =1 suchthat i , inf t 2 R 0 min P M i k =1 ! k i t ! k i T t k i t M i > 0 ; where min denotestheminimumeigenvalue,and i 2 R isapositiveconstant. 7.7StabilityAnalysis Tofacilitatethestabilityanalysis,thelefthandsideof7issubtractedfrom 7toexpresstheBEintermsofweightestimationerrorsas ^ ti = )]TJ/F15 11.9552 Tf 12.711 3.022 Td [(~ W T ci ! i )]TJ/F25 11.9552 Tf 11.956 0 Td [(W T i r x i i E i ^ F i E i ; ~ S i + 1 4 ~ W T ai G T i R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 i G i ~ W ai )]TJ/F15 11.9552 Tf 13.15 8.088 Td [(1 2 W T i G T i R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 i G i ~ W ai )]TJ/F25 11.9552 Tf 11.955 0 Td [(W T i X j 2S i r e j i E i ^ F j E j ; ~ S j + 1 2 W T i X j 2S i r e j i E i G j R S j ~ W a S j + i ; + 1 2 W T i r x i i E i G i R S i ~ W a S i ; where ~ , )]TJ/F15 11.9552 Tf 18.423 3.819 Td [(^ , i = O S i ; r e S i i ; S i ,and R S j , diag R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 1 j G T 1 j ; ;R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 1 j G T s j j isablockdiagonalmatrix.Considerasetofextendedneighbors S p correspondingtothe p th agent.Toanalyzeasymptoticpropertiesof theagentsin S p ; considerthefollowingcandidateLyapunovfunction V Lp Z p ;t , X i 2S p V ti e S i ;t + X i 2S p 1 2 ~ W T ci )]TJ/F28 7.9701 Tf 7.315 4.948 Td [()]TJ/F24 7.9701 Tf 6.586 0 Td [(1 i ~ W ci + X i 2S p 1 2 ~ W T ai ~ W ai + X i 2S p V 0 i ~ x i ; ~ i ; where Z p 2 R ns i +2 L i s i + n P i +1 s i isdenedas Z p , e T S p ; ~ W c T S p ; ~ W a T S p ; ~ x T S p ; vec ~ S p T T ; vec denotesthevectorizationoperator,and V ti : R ns i R ! R isdenedas V ti e S i ;t , V i e T S i ;x T i t T ; 142

PAGE 143

8 e S i 2 R ns i ; 8 t 2 R .Since V ti dependson t onlythroughuniformlyboundedleader trajectories,Lemmas1and2from[146]canbeusedtoshowthat V ti isapositive deniteanddecrescentfunction.Thus,usingLemma4.3from[149],thefollowing boundsonthecandidateLyapunovfunctionin7areestablished v lp )]TJ 5.479 0.478 Td [( Z o p V Lp )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [(Z o p ;t v lp )]TJ 5.48 0.478 Td [( Z o p ; forall Z o p 2 R ns i +2 L i s i + n P i +1 s i andforall t ,where v lp ; v lp : R ! R areclass K functions. Tofacilitatethestabilityanalysis,givenanycompactball p R 2 ns i +2 L i s i + n P i +1 s i of radius r p 2 R centeredattheorigin,apositiveconstant p 2 R isdenedas p , X i 2S p 0 B @ i 2 2 k i + 3 k i d i + A i B i 2 4 k i i 1 C A + X i 2S p 5 c 1 i + c 2 i 2 ! i i i 2 4 c 2 i i + X i 2S p 1 2 r x i V i E i G i R S i S i + X j 2S i r e j V i E i G j R S j S j + X i 2S p 3 1 4 c 1 i + c 2 i W T i ! i i W T i G T i R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 i G i + 1 2 k A a 1 i k + a 2 i W i 2 4 a 1 i + a 2 i + X i 2S p X j 2S i r e j V i E i G j B j + r x i V i E i G i B i ; whereforanyfunction $ : R l ! R , l 2 N ,thenotation k $ k ,denotes sup y 2 p R l k $ y k and A i , B i ,and A a 1 i areuniformlyboundedstate-dependentterms.Deneaclass K function v lp : R ! R as v lp k Z p k , + 1 2 X i 2S p c 2 i i 5 ~ W ci 2 + 1 2 X i 2S p a 1 i + a 2 i 3 ~ W ai 2 + 1 2 X i 2S p k i i 3 ~ i 2 F 1 2 X i 2S p q i k e i k + 1 2 X i 2S p k i 2 k ~ x i k 2 ; 143

PAGE 144

where q i : R ! R areclass K functionssuchthat q i k e k Q i e ; 8 e 2 R n ; 8 i 2N .The sufcientgainconditionsusedinsubsequentTheorem7.2are c 2 i i 5 > X j 2S p 3 s p 1 j 2S i c 1 i + c 2 i 2 A 1 a ij 2 B 1 a ij 2 4 k j j ; a 1 i + a 2 i 3 > X j 2S p 5 s p 1 i 2S j c 1 j + c 2 j 2 A 1 ac ji 2 16 c 2 j j + 5 2 a 1 i 4 c 2 i i + c 1 i + c 2 i W i ! i i G T i R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 i G i 4 ; v )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 lp p < v lp )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 v lp r p ; where A 1 a ij , B 1 a ij ,and A 1 ac ji areuniformlyboundedstate-dependentterms. Theorem7.2. ProvidedAssumptions7.1-7.7holdandthesufcientgainconditions in 7 7 aresatised,thecontrollerin 7 alongwiththeactorandcritic updatelawsin 7 and 7 ,andthesystemidentierin 7 alongwiththe weightupdatelawsin 7 ensurethatthelocalneighborhoodtrackingerrors e i are ultimatelyboundedandthatthepolicies ^ i convergetoaneighborhoodaroundthe feedback-Nashpolicies i forall i 2N . Proof. ThetimederivativeofthecandidateLyapunovfunctionin7isgivenby _ V Lp = X i 2S p _ V ti e S i ;t )]TJ/F15 11.9552 Tf 13.151 8.088 Td [(1 2 X i 2S p ~ W T ci )]TJ/F28 7.9701 Tf 7.314 4.948 Td [()]TJ/F24 7.9701 Tf 6.587 0 Td [(1 i _ )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i )]TJ/F28 7.9701 Tf 7.314 4.948 Td [()]TJ/F24 7.9701 Tf 6.587 0 Td [(1 i ~ W ci )]TJ/F30 11.9552 Tf 12.318 11.357 Td [(X i 2S p ~ W T ci )]TJ/F28 7.9701 Tf 7.314 4.948 Td [()]TJ/F24 7.9701 Tf 6.587 0 Td [(1 i _ ^ W ci )]TJ/F30 11.9552 Tf 12.319 11.357 Td [(X i 2S p ~ W T ai _ ^ W ai + _ V 0 i ~ x i ; ~ i : Using7,theupdatelawsin7and7,andthedenitionof V ti in7, thederivative7canbeboundedas _ V Lp X i 2S p X j 2S i r e j V i E i )]TJ/F33 11.9552 Tf 5.48 -9.684 Td [(F j + G j S j + X i 2S p r x i V i E i F i + G i S i + X i 2S p k i d i ~ i F )]TJ/F15 11.9552 Tf 13.151 8.088 Td [(1 2 X i 2S p ~ W T ci )]TJ/F28 7.9701 Tf 7.314 4.948 Td [()]TJ/F24 7.9701 Tf 6.587 0 Td [(1 i i )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i )]TJ/F25 11.9552 Tf 11.955 0 Td [( c 1 i )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i ! i ! T i 2 i )]TJ/F26 7.9701 Tf 7.315 -1.793 Td [(i )]TJ/F28 7.9701 Tf 7.314 4.948 Td [()]TJ/F24 7.9701 Tf 6.586 0 Td [(1 i ~ W ci )]TJ/F30 11.9552 Tf 12.319 11.357 Td [(X i 2S p ~ W T ai )]TJ/F25 11.9552 Tf 9.299 0 Td [( a 1 i ^ W ai )]TJ/F15 11.9552 Tf 15.368 3.022 Td [(^ W ci )]TJ/F25 11.9552 Tf 11.955 0 Td [( a 2 i ^ W ai 144

PAGE 145

)]TJ/F15 11.9552 Tf 13.15 8.088 Td [(1 4 X i 2S p c 1 i ~ W T ai G i R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 i G T i ^ W T ai ! T i i ^ W T ci + 1 4 X i 2S p c 2 i M i ~ W T ai M i X k =1 G k i R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 i )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [(G k i T ^ W T ai )]TJ/F25 11.9552 Tf 5.479 -9.683 Td [(! k i T k i ^ W T ci )]TJ/F30 11.9552 Tf 9.662 11.358 Td [(X i 2S p ~ W T ci )]TJ/F28 7.9701 Tf 7.314 4.948 Td [()]TJ/F24 7.9701 Tf 6.587 0 Td [(1 i )]TJ/F25 11.9552 Tf 9.299 0 Td [( c 1 i )]TJ/F26 7.9701 Tf 7.315 -1.793 Td [(i ! i i ^ i )]TJ/F25 11.9552 Tf 13.151 8.088 Td [( c 2 i )]TJ/F26 7.9701 Tf 7.315 -1.793 Td [(i M i M i X k =1 ! k i k i ^ k ti ! )]TJ/F30 11.9552 Tf 9.662 11.358 Td [(X i 2S p k i k ~ x i k 2 )]TJ/F30 11.9552 Tf 9.661 11.358 Td [(X i 2S p k i i ~ i 2 F + X i 2S p i k ~ x i k : Using7,7,and7,thederivativein7canbeboundedas _ V Lp )]TJ/F30 11.9552 Tf 24.273 11.358 Td [(X i 2S p Q i e i )]TJ/F15 11.9552 Tf 13.151 8.088 Td [(1 2 X i 2S p c 1 i ~ W T ci ! i ! T i i ~ W ci )]TJ/F30 11.9552 Tf 12.319 11.358 Td [(X i 2S p c 2 i M i ~ W T ci M i X k =1 ! k i k i ! kT i ~ W ci )]TJ/F30 11.9552 Tf 12.319 11.358 Td [(X i 2S p a 1 i + a 2 i ~ W T ai ~ W ai + 1 2 X i 2S p X j 2S i r e j V i E i G j R S j ~ W a S j + 1 2 X i 2S p c 1 i ~ W T ci ! i i W T i r x i i E i G i R S i ~ W a S i + X i 2S p a 2 i ~ W T ai W i )]TJ/F15 11.9552 Tf 13.151 8.088 Td [(1 4 X i 2S p c 1 i W T i ! i i W T i G T i R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 i G i ~ W ai )]TJ/F15 11.9552 Tf 13.15 8.088 Td [(1 4 X i 2S p c 2 i M i W T i M i X k =1 ! k i k i W T i G kT i R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 i G k i ~ W ai )]TJ/F30 11.9552 Tf 12.319 11.358 Td [(X i 2S p X j 2S i r e j V i E i G j ^ F j E j ; ~ S j )]TJ/F30 11.9552 Tf 12.319 11.358 Td [(X i 2S p r x i V i E i G i ^ F i E i ; ~ S i + X i 2S p a 1 i ~ W T ai ~ W ci + 1 2 X i 2S p r x i V i E i G i R S i S i + 1 2 X i 2S p X j 2S i r e j V i E i G j R S j S j )]TJ/F30 11.9552 Tf 12.318 11.357 Td [(X i 2S p X j 2S i r e j V i E i G j B j )]TJ/F30 11.9552 Tf 12.318 11.357 Td [(X i 2S p r x i V i E i G i B i + X i 2S p c 1 i ~ W T ci ! i i i )]TJ/F30 11.9552 Tf 12.318 11.358 Td [(X i 2S p c 2 i M i ~ W T ci M i X k =1 ! k i k i W T i r x i i )]TJ/F23 11.9552 Tf 5.48 -9.684 Td [(E k i ^ F i E k i ; ~ S i )]TJ/F30 11.9552 Tf 12.318 11.358 Td [(X i 2S p c 1 i ~ W T ci ! i i W T i X j 2S i r e j i E i ^ F j E j ; ~ S j )]TJ/F30 11.9552 Tf 12.318 11.358 Td [(X i 2S p c 1 i ~ W T ci ! i i W T i r x i i E i ^ F i E i ; ~ S i )]TJ/F30 11.9552 Tf 12.318 11.357 Td [(X i 2S p M i X k =1 c 2 i ~ W T ci ! k i M i k i W T i X j 2S i r e j i )]TJ/F23 11.9552 Tf 5.479 -9.684 Td [(E k i ^ F j E k j ; ~ S j + X i 2S p c 2 i M i ~ W T ci M i X k =1 ! k i k i k i + 1 2 X i 2S p c 1 i ~ W T ci ! i i W T i X j 2S i r e j i E i G j R S j ~ W a S j + 1 2 X i 2S p r x i V i E i G i R S i ~ W a S i )]TJ/F30 11.9552 Tf 12.318 11.357 Td [(X i 2S p c 1 i ~ W T ci ! i 4 i W T i G T i R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 i G i ~ W ai + X i 2S p M i X k =1 c 2 i ~ W T ci ! k i W T i 2 M i k i X j 2S i r e j i )]TJ/F23 11.9552 Tf 5.48 -9.683 Td [(E k i G k j R k S j ~ W a S j )]TJ/F15 11.9552 Tf 13.151 8.088 Td [(1 4 X i 2S p c 2 i M i ~ W T ci M i X k =1 ! k i k i W T i G kT i R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 i G k i ~ W ai + 1 4 X i 2S p c 2 i M i W T i M i X k =1 ! k i k i ~ W T ai G kT i R )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 i G k i ~ W ai 145

PAGE 146

Figure7-1.Communicationtopologyanetworkcontainingveagents. + 1 2 X i 2S p c 2 i M i ~ W T ci M i X k =1 ! k i k i W T i r x i i )]TJ/F23 11.9552 Tf 5.48 -9.684 Td [(E k i G k i R k S i ~ W a S i )]TJ/F30 11.9552 Tf 12.319 11.357 Td [(X i 2S p k i k ~ x i k 2 )]TJ/F30 11.9552 Tf 12.318 11.357 Td [(X i 2S p k i i ~ i 2 F + X i 2S p c 1 i W T i ! i 4 i ~ W T ai G T i R )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 i G i ~ W ai + X i 2S p k i d i ~ i F + X i 2S p i k ~ x i k : UsingtheCauchy-Schwarzinequality,theTriangleinequality,andcompletionof squares,thederivativein7canbeboundedas _ V Lp )]TJ/F25 11.9552 Tf 21.917 0 Td [(v lp k Z p k forall Z p 2 p suchthat k Z p k v )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 lp p .Usingtheboundsin7,thesufcient conditionin7,andthederivativein7,Theorem4.18in[149]canbeinvoked toconcludethateverytrajectory Z p t satisfying k Z p t 0 k v lp )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 v lp r p ,isbounded forall t 2 R andsatises limsup t !1 k Z p t k v lp )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 )]TJETq1 0 0 1 350.947 305.645 cm[]0 d 0 J 0.478 w 0 0 m 13.042 0 l SQBT/F25 11.9552 Tf 350.947 298.824 Td [(v lp )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [(v )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 lp p : Sincethechoiceofthesubgraph S p wasarbitrary,theneighborhoodtrackingerrors e i areultimatelyboundedforall i 2N .Furthermore,theweightestimates ^ W ai converge toaneighborhoodoftheidealweights W i ;hence,invokingTheorem7.1,thepolicies ^ i convergetoaneighborhoodofthefeedback-Nashequilibriumpolicies i forall i 2N . 7.8Simulations Thissectionprovidestwosimulationexamplestodemonstratetheapplicability ofthedevelopedtechnique.Theagentsinboththeexamplesareassumedtohave thecommunicationtopologyasshowninFigure7-1withunitpinninggainsandedge 146

PAGE 147

Figure7-2.Statetrajectoriesfortheveagentsfortheone-dimensionalexample.The dottedlinesshowthedesiredstatetrajectories. weights.Themotionoftheagentsintherstexampleisdescribedbyidenticalnonlinear one-dimensionaldynamics,andthemotionoftheagentsinthesecondexampleis describedbyidenticalnonlineartwo-dimensionaldynamics. 7.8.1One-dimensionalExample Thedynamicsofalltheagentsareselectedtobeoftheform7where f i x i = i 1 x i + i 2 x 2 i ,and g i x i = cos x i 1 +2 forall i =1 ; ; 5 : Theidealvaluesofthe unknownparametersareselectedtobe i 1 =0 and i 2 =1 ,forall i: Theagentsstart at x i =2 forall i ,andtheirnaldesiredlocationswithrespecttoeachotheraregiven by xd 12 =0 : 5 ;xd 21 = )]TJ/F15 11.9552 Tf 9.299 0 Td [(0 : 5 ;xd 43 = )]TJ/F15 11.9552 Tf 9.298 0 Td [(0 : 5 ; and xd 53 = )]TJ/F15 11.9552 Tf 9.298 0 Td [(0 : 5 .Theleadertraversesan exponentiallydecayingtrajectory x 0 t = e )]TJ/F24 7.9701 Tf 6.586 0 Td [(0 : 1 t .Thedesiredpositionsofagents1and3 withrespecttotheleaderare x d 10 =0 : 75 and x d 30 =1 ,respectively. 147

PAGE 148

Table7-1.Simulationparametersfortheone-dimensionalexample. Agent1Agent2Agent3Agent4Agent5 Q i 1010101010 R i 0.10.10.10.10.1 i E i 1 2 [ e 2 1 ; 1 2 e 4 1 ;e 2 1 x 2 1 ;e 2 2 ] T 1 2 [ e 2 2 ; 1 2 e 4 2 ;e 2 2 x 2 2 ;e 2 1 ] T 1 2 [ e 2 3 ; 1 2 e 4 3 ;e 2 3 x 2 3 ; 1 2 e 4 3 x 2 3 ] T 1 2 [ e 2 4 ; 1 2 e 4 4 ;e 2 3 e 2 4 ; e 2 4 x 2 4 ;e 2 3 ] T 1 2 [ e 2 5 ; 1 2 e 4 5 ;e 2 4 e 2 5 ;e 2 3 e 2 5 ; e 2 5 x 2 5 ;e 2 3 e 2 4 ;e 2 3 ;e 2 4 ] T x i 22222 ^ x i 00000 ^ W ci 1 4 1 1 4 1 1 4 1 1 5 1 3 1 8 1 ^ W ai 1 4 1 1 4 1 1 4 1 1 5 1 3 1 8 1 ^ i 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i 500 I 4 500 I 4 500 I 4 500 I 5 500 I 8 c 1 i 0.10.10.10.10.1 c 2 i 1010101010 a 1 i 55555 a 2 i 0.10.10.10.10.1 i 0.0050.0050.0050.0050.005 )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i I 2 0 : 8 I 2 I 2 I 2 I 2 k i 500500500500500 k i 3030252030 148

PAGE 149

Figure7-3.Trackingerrortrajectoriesfortheagentsfortheone-dimensionalexample. Table7-1summarizestheoptimalcontrolproblemparameters,basisfunctions, andadaptationgainsfortheagents.Foreachagent i; vevaluesof e i ,threevaluesof x i ,andthreevaluesoferrorscorrespondingtoalltheextendedneighborsareselected forBEextrapolation,resultingin 5 3 s i totalvaluesof E i .Allagentsestimatethe unknowndriftparametersusinghistorystackscontainingthirtypointsrecordedonline usingasingularvaluemaximizingalgorithmcf.[93],andcomputetherequiredstate derivativesusingafthorderSavitzky-Golaysmoothingltercf.[150]. Figures7-2-7-4showthetrackingerror,thestatetrajectoriescomparedwiththe desiredtrajectories,andthecontrolinputsforalltheagentsdemonstratingconvergence tothedesiredformationandthedesiredtrajectory.Notethatagents2,4,and5donot haveacommunicationlinktotheleader,nordotheyknowtheirdesiredrelativeposition fromtheleader.Theconvergencetothedesiredformationisachievedviacooperative controlbasedondecentralizedobjectives.Figures7-5-7-9showtheevolutionand 149

PAGE 150

Figure7-4.Trajectoriesofthecontrolinputandtherelativecontrolerrorforallagentsfor theone-dimensionalexample. Figure7-5.Valuefunctionweightsanddriftdynamicsparametersestimatesforagent1 fortheone-dimensionalexample.Thedottedlinesinthedriftparameterplot aretheidealvaluesofthedriftparameters. 150

PAGE 151

Figure7-6.Valuefunctionweightsanddriftdynamicsparametersestimatesforagent2 fortheone-dimensionalexample.Thedottedlinesinthedriftparameterplot aretheidealvaluesofthedriftparameters. Figure7-7.Valuefunctionweightsanddriftdynamicsparametersestimatesforagent3 fortheone-dimensionalexample.Thedottedlinesinthedriftparameterplot aretheidealvaluesofthedriftparameters. 151

PAGE 152

Figure7-8.Valuefunctionweightsanddriftdynamicsparametersestimatesforagent4 fortheone-dimensionalexample.Thedottedlinesinthedriftparameterplot aretheidealvaluesofthedriftparameters. Figure7-9.Valuefunctionweightsanddriftdynamicsparametersestimatesforagent5 fortheone-dimensionalexample.Thedottedlinesinthedriftparameterplot aretheidealvaluesofthedriftparameters. 152

PAGE 153

Figure7-10.Phaseportraitinthestate-spaceforthetwo-dimensionalexample.The actualpentagonalformationisrepresentedbyasolidblackpentagon,and thedesireddesiredpentagonalformationaroundtheleaderisrepresented byadottedblackpentagon. convergenceofthevaluefunctionweightsandtheunknownparametersinthedrift dynamics. 7.8.2Two-dimensionalExample Inthissimulation,thedynamicsofalltheagentsareassumedtobeexactlyknown, andareselectedtobeoftheform7whereforall i =1 ; ; 5 ; f i x i = 2 6 4 )]TJ/F25 11.9552 Tf 9.298 0 Td [(x i 1 + x i 2 )]TJ/F15 11.9552 Tf 9.298 0 Td [(0 : 5 x i 1 )]TJ/F15 11.9552 Tf 11.955 0 Td [(0 : 5 x i 2 )]TJ/F15 11.9552 Tf 11.955 0 Td [(cos x i 1 +2 2 3 7 5 ;g i x i = 2 6 4 sin x i 1 +20 0cos x i 1 +2 3 7 5 Theagentsstartattheorigin,andtheirnaldesiredrelativepositionsaregivenby xd 12 =[ )]TJ/F15 11.9552 Tf 9.299 0 Td [(0 : 5 ; 1] T xd 21 =[0 : 5 ; )]TJ/F15 11.9552 Tf 9.299 0 Td [(1] T ;xd 43 =[0 : 5 ; 1] T ; and xd 53 =[ )]TJ/F15 11.9552 Tf 9.298 0 Td [(1 ; 1] T . 153

PAGE 154

Table7-2.Simulationparametersforthetwo-dimensionalexample Agent1Agent2Agent3Agent4Agent5 Q i 10 I 2 10 I 2 10 I 2 10 I 2 10 I 2 R i I 2 I 2 I 2 I 2 I 2 i E i 1 2 [2 e 2 11 ; 2 e 11 e 12 ; 2 e 2 12 ; e 2 21 ; 2 e 21 e 22 ;e 2 22 ; e 2 11 x 2 11 ;e 2 12 x 2 11 ; e 2 11 x 2 12 ;e 2 12 x 2 12 ] T 1 2 [2 e 2 21 ; 2 e 21 e 22 ; 2 e 2 22 ; e 2 11 ; 2 e 11 e 12 ;e 2 12 ; e 2 21 x 2 21 ;e 2 22 x 2 21 ; e 2 21 x 2 22 ;e 2 22 x 2 22 ] T 1 2 [2 e 2 31 ; 2 e 31 e 32 ; 2 e 2 32 ; e 2 31 x 2 31 ;e 2 32 x 2 31 ; e 2 31 x 2 32 ;e 2 32 x 2 12 ] T 1 2 [2 e 2 41 ; 2 e 41 e 42 ; 2 e 2 42 ; e 2 31 ; 2 e 31 e 32 ;e 2 32 e 2 41 x 2 41 ;e 2 42 x 2 41 ; e 2 41 x 2 42 ;e 2 42 x 2 42 ] T 1 2 [2 e 2 51 ; 2 e 51 e 52 ; 2 e 2 52 ; e 2 41 ; 2 e 41 e 42 ;e 2 42 ; e 2 31 ; 2 e 31 e 32 ;e 2 32 ; e 2 51 x 2 51 ;e 2 52 x 2 51 ; e 2 51 x 2 52 ;e 2 52 x 2 52 ] T x i 0 2 1 0 2 1 0 2 1 0 2 1 0 2 1 ^ W ci 1 10 1 1 10 1 2 1 7 1 5 1 10 1 3 1 13 1 ^ W ai 1 10 1 1 10 1 2 1 7 1 5 1 10 1 3 1 13 1 )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(i 500 I 10 500 I 10 500 I 4 500 I 5 500 I 8 c 1 i 0.10.10.10.10.1 c 2 i 2.552.52.52.5 a 1 i 2.50.52.52.52.5 a 2 i 0.010.010.010.010.01 i 0.0050.0050.0050.0050.005 154

PAGE 155

Figure7-11.Phaseportraitofallagentsintheerrorspaceforthetwo-dimensional example. Therelativepositionsaredesignedsuchthatthenaldesiredformationisa pentagonwiththeleadernodeatthecenter. Theleadertraversesasinusoidaltrajectorytrajectory x 0 t =[2sin t ; 2sin t + 2cos t ] T .Thedesiredpositionsofagents1and3withrespecttotheleaderare x d 10 =[ )]TJ/F15 11.9552 Tf 9.298 0 Td [(1 ; 0] T and x d 30 =[0 : 5 ; )]TJ/F15 11.9552 Tf 9.299 0 Td [(1] T ,respectively. Table7-2summarizestheoptimalcontrolproblemparameters,basisfunctions, andadaptationgainsfortheagents.Foreachagent i; ninevaluesof e i , x i ,anderrors correspondingtoalltheextendedneighborsareselectedforBEextrapolationinuniform 3 3 gridina 1 1 squarearoundtheorigin,resultingin 9 9 s i totalvaluesof E i . Figures7-10-7-16showthetrackingerror,thestatetrajectories,andthecontrol inputsforalltheagentsdemonstratingconvergencetothedesiredformationandthe desiredtrajectory.Notethatagents2,4,and5donothaveacommunicationlink 155

PAGE 156

Figure7-12.TrajectoriesofthecontrolinputandtherelativecontrolerrorforAgent1for thetwo-dimensionalexample. Figure7-13.TrajectoriesofthecontrolinputandtherelativecontrolerrorforAgent2for thetwo-dimensionalexample. 156

PAGE 157

Figure7-14.TrajectoriesofthecontrolinputandtherelativecontrolerrorforAgent3for thetwo-dimensionalexample. Figure7-15.TrajectoriesofthecontrolinputandtherelativecontrolerrorforAgent4for thetwo-dimensionalexample. 157

PAGE 158

Figure7-16.TrajectoriesofthecontrolinputandtherelativecontrolerrorforAgent5for thetwo-dimensionalexample. Figure7-17.Valuefunctionweightsandpolicyweightsforagent1forthe two-dimensionalexample. 158

PAGE 159

Figure7-18.Valuefunctionweightsandpolicyweightsforagent2forthe two-dimensionalexample. Figure7-19.Valuefunctionweightsandpolicyweightsforagent3forthe two-dimensionalexample. 159

PAGE 160

Figure7-20.Valuefunctionweightsandpolicyweightsforagent4forthe two-dimensionalexample. Figure7-21.Valuefunctionweightsandpolicyweightsforagent5forthe two-dimensionalexample. 160

PAGE 161

totheleader,nordotheyknowtheirdesiredrelativepositionfromtheleader.The convergencetothedesiredformationisachievedviacooperativecontrolbasedon decentralizedobjectives.Figures7-17-7-21showtheevolutionandconvergenceof thevaluefunctionweightsandthepolicyweightsforalltheagents.Sinceanalternative methodtosolvethisproblemisnotavailabletothebestoftheauthor'sknowledge,a comparativesimulationcannotbeprovided. 7.9ConcludingRemarks Asimulation-basedACIarchitectureisdevelopedtocooperativelycontrolagroup ofagentstotrackatrajectorywhilemaintainingadesiredformation.Communication amongextendedneighborsisneededtoimplementthedevelopedmethod.Sincean analyticalfeedback-Nashequilibriumsolutionisnotavailable,thepresentedsimulation doesnotdemonstrateconvergencetofeedback-Nashequilibriumsolutions.Tothe bestoftheauthor'sknowledge,alternativemethodstosolvedifferentialgraphicalgame problemsarenotavailableintheliterature;hence,acomparativesimulationisinfeasible. 161

PAGE 162

CHAPTER8 CONCLUSIONS RLisapowerfultoolforonlinelearningandoptimization,however,theapplication ofRLtodynamicalsystemsischallengingfromacontroltheoryperspective.The challengestakethreedifferentforms:analysisanddesignchallenges,applicability challenges,andimplementationchallenges. Sincethecontrollerissimultaneouslylearnedandusedonline,uniqueanalysis challengesariseinestablishingstabilityduringthelearningphase.Furthermore, RL-basedcontrollersarehardtodesignowingtothenecessarytradeoffsbetween explorationandexploitation,whichalsocomplicatethestabilityanalysisowingtothe factthatingeneral,thelearnedcontrollerdoesnotmeettheexplorationdemands, necessitatingtheadditionofanexplorationsignal.Inthecaseofdeterministicnonlinear systems,anexplicitcharacterizationofthenecessaryexplorationsignalsishard toobtain;hence,theexplorationsignalisgenerallyleftoutofthestabilityanalysis, defeatingthepurposeofthestabilityanalysis. ApplicabilitychallengesspringfromthefactthatRLincontinuous-statesystemsis usuallyrealizedusingvaluefunctionapproximation.Sincetheactionthatacontroller takesinaparticularstatedependsonthevalueofthatstate,thecontrolpolicydepends onthevaluefunction;hence,auniformapproximationofthevaluefunctionoverthe entireoperatingdomainisvitalforthecontroldesign.Resultsthatuseparametricapproximationtechniquesforvaluefunctionapproximationareubiquitousinliterature. Sinceparametricapproximatorscanonlygenerateuniformapproximationsovercompactdomains,approximationbecomeschallengingifthevaluefunctionistime-varying andifthetimehorizonisinnite.Hence,traditionalRLmethodsarenotapplicablefor trajectorytrackingapplications,networkcontrolapplications,andotherapplicationsthat exhibittime-varyingvaluefunctions. 162

PAGE 163

Theresultsofthisdissertationpartiallyaddresstheaforementionedchallenges viathedevelopmentnewinnovativemodel-basedRLmethodsandrigorousLyapunovbasedmethodsforstabilityanalysis.InChapter3,adata-drivenmodel-basedRL techniquethatdoesnotrequireanaddedexplorationsignalisdevelopedtosolve innite-horizontotal-costoptimalregulationproblemsforuncertaincontrol-afne nonlinearsystems.InChapter4,thedata-drivenmodel-basedRLtechniqueisextended toobtainfeedback-Nashequilibriumsolutionsto N )]TJ/F22 11.9552 Tf 9.298 0 Td [(playernonzero-sumdifferential games,withoutexternalad-hocapplicationofanexplorationsignal.Inchapters3and4, sufcientexplorationissimulatedbyusinganestimateofthesystemdynamicsobtained usingadata-drivensystemidentiertoextrapolatetheBEtounexploredareasofthe state-space.AsetofpointsinthestatespaceisselectedaprioriforBEextrapolation, andthevaluefunctionisapproximatedusingatime-varyingregressormatrixcomputed basedontheselectedpoints.Thedevelopedresultreliesonasufcientconditionon theminimumeigenvalueofatime-varyingregressormatrix.Whilethisconditioncan beheuristicallysatisedbychoosingenoughpoints,andcanbeeasilyveriedonline, itcannot,ingeneral,beguaranteedapriori.Furtherresearchisrequiredtoinvestigate theexistenceofasetofpointsthatguaranteesthattheresultingregressormatrixhas auniformapositiveminimumsingularvalue.Thefactthattheconvergencerateof thevaluefunctionapproximationdependsontheaforementionedminimumsingular valuemotivatesfurtherresearchintoaprioriselectionofandonlineadjustmentstothe setofpointsusedforBEextrapolation.Forexample,threshold-basedalgorithmscan beemployedtoensuresufcientexplorationbyselectingnewpointsiftheminimum singularvalueoftheregressorfallsbelowacertainthreshold. InChapter5,RL-basedmethodsareextendedtoaclassofinnite-horizonoptimal trajectorytrackingproblemswherethevaluefunctionistime-varying.Providedthat thedesiredtrajectoryistheoutputofanautonomousdynamicalsystem,theoptimal controlproblemcanbeformulatedsothatthevalefunctiondependsontimeonly 163

PAGE 164

throughthedesiredtrajectory.Valuefunctionapproximationisthenachievedbyusing thedesiredtrajectoryalongwiththetrackingerrorastraininginputs.ALyapunov-based stabilityanalysisisdevelopedbasedbyprovingthatthetime-varyingvaluefunction isaLyapunovfunctionfortheoptimalclosed-looperrorsystem.Thedevelopedresult reliesontheassumptionthatasteady-statecontrollerthatcanmakethesystemexactly trackthedesiredtrajectoryexists,andthatitcanbecomputedbyinversionofthe systemdynamics.Inversionofsystemdynamicsrequiresexactmodelknowledge. Motivatedbytheneedtoobtainanoptimaltrackingsolutionforuncertainsystems,a data-drivensystemidentierisdevelopedforapproximatemodelinversioninChapter6. Thedata-drivensystemidentierisalsousedtoextrapolatetheBE,therebyremoving theneedforanaddedexplorationsignalfromthetrackingcontrollerdevelopedin Chapter5.Thedevelopedtechniquerequiresknowledgeofthedynamicsofthedesired trajectory.Thefactthatinmanyrealworldcontrolapplications,thedesiredtrajectory isgeneratedonlineusingatrajectoryplannermodule,motivatesthedevelopmentof anoptimaltrackingcontrollerrobusttouncertaintiesinthedynamicsofthedesired trajectory.FurtherresearchisrequiredtoapplyRL-basedmethodstotime-varying systemsthatcannotbetransformedintostationarysystemsoncompactdomainsusing stateaugmentation.Inadaptivecontrol,itisgenerallypossibletoformulatethecontrol problemsuchthatPEalongthedesiredtrajectoryissufcienttoachieveparameter convergence.IntheADP-basedtrackingproblem,PEalongthedesiredtrajectorywould besufcienttoachieveparameterconvergenceiftheBEcanbeformulatedintermsof thedesiredtrajectories.Achievingsuchaformulationisnottrivial,andisasubjectfor futureresearch. InChapter7,theRL-basedmethodsareextendedtoobtainfeedback-Nashequilibriumsolutionstoaclassofdifferentialgraphicalgamesusingideasfromchapters 3-6. Itisestablishedthatinacooperativegamebasedonminimizationofthelocal neighborhoodtrackingerrors,thevaluefunctioncorrespondingtotheagentsdepends 164

PAGE 165

oninformationobtainedfromalltheirextendedneighbors.AsetofcoupledHJequationsaredevelopedthatserveasnecessaryandsufcientconditionsforfeedback-Nash equilibrium,andclosed-formexpressionsforthefeedback-Nashequilibriumpoliciesare developedbasedontheHJequations.Thefactthatthedevelopedtechniquerequires eachagenttocommunicatewithallofitsextendedneighborsmotivatesthesearchfora decentralizedmethodtogeneratefeedback-Nashequilibriumpolicies. Inallthechaptersofthisdissertation,parametricapproximationtechniquesare usedtoapproximatethevaluefunctions.Parametricapproximationofthevaluefunction requiresselectionofappropriatebasisfunctions.Selectionofbasisfunctionsfor generalnonlinearsystemsisanontrivialopenproblem,evenifthesystemdynamics areknown.ImplementationofRL-basedcontrollersforgeneralnonlinearsystemsis difcultbecausethebasisfunctionsandtheexplorationsignalneedstobeselected usingtrial-and-error,withverylittleinsightstobegainedfromdomainknowledgeabout thesystem.Notethatauniformapproximationofthevaluefunctionovertheentire domainisrequiredonlyifanoptimalcontrollerisdesired.Forreal-timesub-optimal control,agoodapproximationofthevaluefunctionoverasmallneighborhoodofthe currentstateissufcient.Thismotivatesthedevelopmentofbasisfunctionsthatfollow thesystemstate,andarecapableofapproximatingthevaluefunctionoverasmall domain.Analysisofconvergenceandstabilityissuesarisingfromtheuseofmoving basisfunctionsisasubjectforfutureresearch. 165

PAGE 166

APPENDIXA ONLINEDATACOLLECTIONCH3 Thehistorystack H id thatsatisesconditionsin3canbecollectedonline providedthecontrollerin2resultsinthesystemstatesbeingsufcientlyexciting overanitetimeinterval t 0 ;t 0 + t R . 1 Duringthisnitetimeinterval,sinceahistory stackisnotavailable,anadaptiveupdatelawthatensuresfastconvergenceof ~ to zerowithoutPEcannotbedeveloped.Hence,thesystemdynamicscannotbedirectly estimatedwithoutPE.SinceextrapolationoftheBEtounexploredareasofthestate spacerequiresestimatesofthesystemdynamics,withoutPE,suchextrapolationis infeasibleduringthetimeinterval t 0 ;t 0 + t . However,evaluationoftheBEalongthesystemtrajectoriesdoesnotexplicitly dependontheparameters .Estimationofthestatederivativeisenoughtoevaluate theBEalongsystemtrajectories.Thismotivatesthedevelopmentofthefollowingstate derivativeestimator. _ ^ x f = gu + k f ~ x f + f ; _ f = k f f +1~ x f ; A where ^ x f 2 R n isanestimateofthestate x , ~ x f , x )]TJ/F15 11.9552 Tf 13.471 0 Td [(^ x f ,and k f ; f ; f 2 R > 0 are constantestimationgains.Tofacilitatethestabilityanalysis,denealterederrorsignal r 2 R n as r , _ ~ x f + f ~ x f ; where _ ~ x f , _ x )]TJ/F15 11.9552 Tf 14.065 3.154 Td [(_ ^ x f .Using2andAthedynamicsofthe lterederrorsignalcanbeexpressedas _ r = )]TJ/F25 11.9552 Tf 9.299 0 Td [(k f r +~ x f + r x ff + r x fg ^ u + _ ~ x f .The instantaneousBEin2canbeapproximatedalongthestatetrajectoryusingthe 1 Tocollectthehistorystack,therst M valuesofthestate,thecontrol,andthecorrespondingnumericallycomputedstatederivativeareaddedtothehistorystack.Then, theexistingvaluesareprogressivelyreplacedwithnewvaluesusingasingularvalue maximizationalgorithm. 166

PAGE 167

statederivativeestimateas ^ f = ! T f ^ W cf + x T Qx +^ u T x; ^ W af R ^ u x; ^ W af ; A where ! f 2 R L istheregressorvectordenedas ! f , r x _ ^ x f : Duringtheinterval t 0 ;t 0 + t ,thevaluefunctionandthepolicyweightscanbelearnedbasedonthe approximateBEinAprovidedthesystemstatesareexciting,i.e.,ifthefollowing assumptionissatised. AssumptionA.1. Thereexistsatimeinterval t 0 ;t 0 + t R andpositiveconstants ;T 2 R suchthatclosed-looptrajectoriesofthesystemin2withthecontrollerin 2alongwiththeweightupdatelaws _ ^ W cf = )]TJ/F25 11.9552 Tf 9.298 0 Td [( cf )]TJ/F26 7.9701 Tf 7.315 -1.793 Td [(f ! f f f ; _ )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(f = f )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(f )]TJ/F25 11.9552 Tf 11.955 0 Td [( cf )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(f ! f ! T f f )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(f ; ^ W af = )]TJ/F25 11.9552 Tf 9.299 0 Td [( a 1 f ^ W a )]TJ/F15 11.9552 Tf 15.367 3.022 Td [(^ W c )]TJ/F25 11.9552 Tf 11.955 0 Td [( a 2 f ^ W a ; A where f , 1+ f ! T f )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(f ! f isthenormalizationterm, a 1 f ; a 2 f ; cf ; f 2 R areconstant positivegainsand )]TJ/F26 7.9701 Tf 7.315 -1.793 Td [(f 2 R L L istheleast-squaresgainmatrix,andthestatederivative estimatorinAsatises I L t + T t f f T d; 8 t 2 t 0 ;t 0 + t ; A where f , ! f p 1+ f ! T f )]TJ/F27 5.9776 Tf 5.288 -1.406 Td [(f ! f 2 R N istheregressorvector.Furthermore,thereexistsaset oftimeinstances f t 1 t M g t 0 ;t 0 + t suchthatthehistorystack H id containingthe valuesofstate-actionpairsandthecorrespondingnumericalderivativesrecordedat f t 1 t M g satisestheconditionsinAssumption3.1. ConditionssimilartoAareubiquitousinonlineapproximateoptimalcontrol literature.Infact,AssumptionA.1requirestheregressor f tobeexcitingoveranite timeinterval,whereasthePEconditionsusedinrelatedresultssuchas[57–59,114, 158]requiresimilarregressorvectorstobeexcitingoverall t 2 R t 0 . 167

PAGE 168

Onanycompactset R n thefunction f isLipschitzcontinuous;hence,there existpositiveconstants L f ;L df 2 R suchthat k f x k L f k x k and kr x f x k L df ; forall x 2 : TheupdatelawsinAalongwiththeexcitationconditioninAensure thattheadaptationgainmatrixisboundedsuchthat )]TJETq1 0 0 1 227.125 539.014 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F26 7.9701 Tf 234.439 537.798 Td [(f k )]TJ/F26 7.9701 Tf 7.314 -1.794 Td [(f t k )]TJ/F26 7.9701 Tf 7.314 -1.794 Td [(f ; 8 t 2 R t 0 ; A wherecf.[91,ProofofCorollary4.3.2] )]TJETq1 0 0 1 206.178 467.288 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F26 7.9701 Tf 213.492 466.072 Td [(f =min cf T; min )]TJ/F26 7.9701 Tf 11.867 -1.794 Td [(f t 0 e )]TJ/F26 7.9701 Tf 6.587 0 Td [( f T : Thefollowingpositiveconstantsaredenedforbrevityofnotation. # 8 , L df 2 k gR )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 g T r T k ;# 9 , W T G + 1 2 r G T r T 2 + a 2 f W; # 10 , k 2 W T r G r T + G k 4 ; f , 2 cf # 10 + 3 # 9 4 a 1 f + a 2 f + # 4 + 5 # 2 8 W 2 4 k f ; v lf = 1 2 min q 2 ; )]TJETq1 0 0 1 282.229 313.362 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F26 7.9701 Tf 289.543 312.146 Td [(f 4 ; a 1 f + a 2 f 3 ; f 3 ; k f 5 : Tofacilitatethestabilityanalysis,Let V Lf : R 3 n +2 L R 0 ! R 0 beacontinuously differentiablepositivedenitecandidateLyapunovfunctiondenedas V Lf Z f ;t , V x + 1 2 ~ W T cf )]TJ/F28 7.9701 Tf 7.314 4.948 Td [()]TJ/F24 7.9701 Tf 6.587 0 Td [(1 f t ~ W cf + 1 2 ~ W T af ~ W af + 1 2 ~ x T f ~ x f + 1 2 r T r: A Usingthefactthat V ispositivedenite,AandLemma4.3from[149]yield v lf k Z f k V Lf Z f ;t v lf k Z f k ; A 168

PAGE 169

forall t 2 R t 0 andforall Z f 2 R 3 n +2 L .InA, v lf ; v lf : R 0 ! R 0 areclass K functionsand Z , h x T ; ~ W T cf ; ~ W T af ; ~ x T f ;r T i T : ThesufcientconditionsforUUB convergencearederivedbasedonthesubsequentstabilityanalysisas a 1 f + a 2 f > 3 a 1 f 2 4 )]TJ/F15 11.9552 Tf 13.15 8.088 Td [(3 # 8 5 2 )]TJ/F15 11.9552 Tf 13.15 8.088 Td [(3 cf k G k 4 p f )]TJ/F26 7.9701 Tf 7.314 -1.793 Td [(f Z f ;q > 2 L 2 f 2 cf r 2 + 5 L 2 df 4 k f k f > 5max # 8 2 5 + f +2 cf W 2 kr k 2 ; 3 3 f 4 ; 1 f > 6 cf W 2 kr k 2 ; )]TJETq1 0 0 1 462.498 586.402 cm[]0 d 0 J 0.478 w 0 0 m 7.314 0 l SQBT/F26 7.9701 Tf 469.812 585.187 Td [(f > 2 a 1 f 4 ; A where Z f , v )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 f v f max k Z f t 0 k ; q f v lf ,and 4 ; 5 2 R areknownpositive adjustableconstants.AnalgorithmsimilartoAlgorithm3.1isemployedtoselectthe gainsandacompactset Z f R 3 n +2 L suchthat r f v lf 1 2 diam Z f : A TheoremA.1. Providedthegainsareselectedtosatisfythesufcientconditions in A basedonanalgorithmsimilartoAlgorithm3.1,thecontrollerin 2 ,the weightupdatelawsin A ,thestatederivativeestimatorin A ,andtheexcitation conditionin A ensurethatthestatetrajectory x ,thestateestimationerror ~ x f ,and theparameterestimationerrors ~ W cf ,and ~ W af remainboundedsuchthat k Z f t k Z f ; 8 t 2 t 0 ;t 0 + t : Proof. UsingtechniquessimilartotheproofofTheorem3.1,thetimederivativeofthe candidateLyapunovfunctioninAcanbeboundedas _ V Lf )]TJ/F25 11.9552 Tf 21.918 0 Td [(v lf k Z f k 2 ; 8k Z f k r f v lf ; A inthedomain Z f .UsingA,A,andA,Theorem4.18in[149]isusedto showthat Z f isUUB,andthat k Z f t k Z f ; 8 t 2 t 0 ;t 0 + t : Duringtheinterval t 0 ;t 0 + t ,thecontrollerin2isusedalongwiththeweight updatelawsinAssumptionA.1.Whenenoughdataiscollectedinthehistorystackto 169

PAGE 170

satisfytherankconditionin3,theupdatelawsfromSection3.3areused.The bound Z f isusedtocomputegainsforTheorem3.1usingAlgorithm3.1.Theorems1 and2establishUUBregulationofthesystemstateandtheparameterestimationerrors fortheoverallswitchedsystem. 170

PAGE 171

APPENDIXB PROOFOFSUPPORTINGLEMMASCH5 B.1ProofofLemma5.1 ThefollowingsupportingtechnicallemmaisusedtoproveLemma5.1. LemmaB.1. Let D R n containtheoriginandlet : D R 0 ! R 0 bepositive denite.If x;t isbounded,uniformlyin t ,forallbounded x andif x 7)167(! x;t is continuous,uniformlyin t ,then isdecrescentin D . Proof. Since x;t isbounded,uniformlyin t , sup t 2 R 0 f x;t g existsandisuniquefor allbounded x .Letthefunction : D ! R 0 bedenedas x , sup t 2 R 0 f x;t g : B Since x ! x;t iscontinuous,uniformlyin t , 8 "> 0 , 9 & x > 0 suchthat 8 y 2 D , d D R 0 x;t ; y;t <& x = d R 0 x;t ; y;t <"; B where d M ; denotesthestandardEuclideanmetriconthemetricspace M .Bythe denitionof d M ; , d D R 0 x;t ; y;t = d D x;y : UsingB, d D x;y <& x = j x;t )]TJ/F15 11.9552 Tf 11.955 0 Td [( y;t j <": B Giventhefactthat ispositive,Bimplies x;t < y;t + " and y;t < x;t + " whichfromBimplies x < y + " and y < x + " ,andhence, fromB, d D x;y <& x = j x )]TJ/F25 11.9552 Tf 11.955 0 Td [( y j <": Since ispositivedenite,B canbeusedtoconclude =0 : Thus, isboundedabovebyacontinuouspositive denitefunction;hence, isdecrescentin D . Basedonthedenitionsin5-5and5, V t e;t = 1 t )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [(e T Qe + T R d V e e ; 8 t 2 R 0 ; B 171

PAGE 172

where V e e , 1 t )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [(e T Qe d isapositivedenitefunction.Lemma4.3in[149] canbeinvokedtoconcludethatthereexistsaclass K function v :[0 ;a ] ! R 0 suchthat v k e k V e e ,whichalongwithB,implies5a. FromB, V 0 ;x T d T = 1 t )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( T R d; withtheminimizer t = 0 ; 8 t 2 R 0 : Furthermore, V 0 ;x T d T isthecostincurredwhenstartingwith e =0 andfollowingtheoptimalpolicythereafterforanyarbitrarydesiredtrajectory x d cf. Section3.7of[43].Substituting x t 0 = x d t 0 , t 0 =0 and5in5indicates that _ e t 0 =0 : Thus,whenstartingfrom e =0 ,thezeropolicysatisesthedynamic constraintsin5.Furthermore,theoptimalcostis V 0 ;x T d T =0 ; 8k x d k 0 ; 9 &> 0 ,suchthat 8 e T o ;x T d T ; e T 1 ;x T d T 2 e o x d , d e o x d e T o ;x T d T ; e T 1 ;x T d T <& = d R V e T o ;x T d T ;V e T 1 ;x T d T <": Thus, foreach e o 2 R n ,thereexistsa &> 0 independentof x d ,thatestablishesthecontinuity of e 7)167(! V e T ;x T d T at e o .Thus, e 7)167(! V e T ;x T d T iscontinuous,uniformlyin x d , andhence,using5 e 7)167(! V t e;t iscontinuous,uniformlyin t .UsingLemmaB.1 and5aand5b,thereexistsapositivedenitefunction : R n ! R 0 suchthat V t e;t < e ; 8 e;t 2 R n R 0 .Lemma4.3in[149]indicatesthatthereexistsa class K function v :[0 ;a ] ! R 0 suchthat e v k e k ,whichimplies5c. 172

PAGE 173

B.2ProofofLemma5.2 Usingthedenitionofthecontrollerin,thetrackingerrordynamicscanbe expressedas _ e = f + 1 2 gR )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 G T 0 T ~ W a + gg + d h d )]TJ/F25 11.9552 Tf 11.955 0 Td [(f d )]TJ/F15 11.9552 Tf 13.151 8.088 Td [(1 2 gR )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 G T 0 T W )]TJ/F25 11.9552 Tf 11.955 0 Td [(h d : Onanycompactset,thetrackingerrorderivativecanbeboundedaboveas k _ e k L F k e k + L W ~ W a + L e ; where L e = L F k x d k + gg + d h d )]TJ/F25 11.9552 Tf 11.955 0 Td [(f d )]TJ/F24 7.9701 Tf 13.151 4.707 Td [(1 2 gR )]TJ/F24 7.9701 Tf 6.587 0 Td [(1 G T 0 T W )]TJ/F25 11.9552 Tf 11.955 0 Td [(h d and L W = 1 2 gR )]TJ/F24 7.9701 Tf 6.586 0 Td [(1 G T 0 T . Usingthefactthat e and ~ W a arecontinuousfunctionsoftime,ontheinterval [ t;t + T ] , thetimederivativeof e canbeboundedas k _ e k L F sup 2 [ t;t + T ] k e k + L W sup 2 [ t;t + T ] ~ W a + L e : Sincetheinnitynormislessthanthe2-norm,thederivativeofthe j th componentof _ e is boundedas _ e j L F sup 2 [ t;t + T ] k e k + L W sup 2 [ t;t + T ] ~ W a + L e : Thus,themaximumandtheminimumvalueof e j arerelatedas sup 2 [ t;t + T ] j e j j inf 2 [ t;t + T ] j e j j + L F sup 2 [ t;t + T ] k e k + L W sup 2 [ t;t + T ] ~ W a + L e ! T: Squaringtheaboveexpressionandusingtheinequality x + y 2 2 x 2 +2 y 2 sup 2 [ t;t + T ] j e j j 2 2inf 2 [ t;t + T ] j e j j 2 +2 L F sup 2 [ t;t + T ] k e k + L W sup 2 [ t;t + T ] ~ W a + L e ! 2 T 2 : Summingover j ,andusingthethefactsthat sup 2 [ t;t + T ] k e k 2 P n j =1 sup 2 [ t;t + T ] j e j j 2 and inf 2 [ t;t + T ] P n j =1 j e j j 2 inf 2 [ t;t + T ] k e k 2 , sup 2 [ t;t + T ] k e k 2 2inf 2 [ t;t + T ] k e k 2 +2 L F sup 2 [ t;t + T ] k e k 2 + L W sup 2 [ t;t + T ] ~ W a 2 + L e ! 2 nT 2 : 173

PAGE 174

Usingtheinequality x + y + z 2 3 x 2 +3 y 2 +3 z 2 ,5canbeobtained.Usinga similarprocedureontheupdatelawfor ~ W a , )]TJ/F15 11.9552 Tf 22.968 0 Td [(inf 2 [ t;t + T ] ~ W a 2 )]TJ/F30 11.9552 Tf 23.114 18.531 Td [()]TJ/F15 11.9552 Tf 5.479 -9.684 Td [(1 )]TJ/F15 11.9552 Tf 11.955 0 Td [(6 N a 1 + a 2 2 T 2 2 sup 2 [ t;t + T ] ~ W a 2 +3 N 2 a 2 W 2 T 2 +3 N 2 a 1 sup 2 [ t;t + T ] ~ W c 2 T 2 : B Similarly,thedynamicsfor ~ W c yield sup 2 [ t;t + T ] ~ W c 2 2 1 )]TJ/F24 7.9701 Tf 13.151 5.699 Td [(6 N 2 c ' 2 T 2 2 ' 2 inf 2 [ t;t + T ] ~ W c 2 + 6 NT 2 2 c ' 2 0 L F d + 5 2 ' 1 )]TJ/F24 7.9701 Tf 13.151 5.699 Td [(6 N 2 c ' 2 T 2 2 ' 2 + 6 NT 2 2 c ' 2 0 2 L 2 F ' 1 )]TJ/F24 7.9701 Tf 13.151 5.698 Td [(6 N 2 c ' 2 T 2 2 ' 2 sup 2 [ t;t + T ] k e k 2 : B SubstitutingBintoB,5canbeobtained. B.3ProofofLemma5.3 TheintegrandontheLHScanbewrittenas ~ W T c = ~ W T c t + ~ W T c )]TJ/F15 11.9552 Tf 15.367 3.022 Td [(~ W T c t : Usingtheinequality x + y 2 1 2 x 2 )]TJ/F25 11.9552 Tf 11.955 0 Td [(y 2 andintegrating, t + T t ~ W T c 2 d 1 2 ~ W T c t 0 @ t + T t T d 1 A ~ W c t )]TJ/F26 7.9701 Tf 14.611 18.663 Td [(t + T t 0 @ 0 @ t _ ~ W c d 1 A T 1 A 2 d: Substitutingthedynamicsfor ~ W c fromandusingthePEconditioninAssumption3, t + T t ~ W T c 2 d 1 2 ~ W T c t ~ W c t )]TJ/F26 7.9701 Tf 14.612 18.663 Td [(t + T t t )]TJ/F25 11.9552 Tf 9.299 0 Td [( c )-167( T ~ W c + c )-167( q 1+ ! T )-167( ! + c )-167( ~ W T a G ~ W a 4 q 1+ ! T )-167( ! 174

PAGE 175

)]TJ/F25 11.9552 Tf 15.92 8.088 Td [( c )-167( 0 F q 1+ ! T )-167( ! d T ! 2 ; where , 1 4 0 G 0 T + 1 2 W T 0 G 0 T : Usingtheinequality x + y + w )]TJ/F25 11.9552 Tf 11.955 0 Td [(z 2 2 x 2 +6 y 2 + 6 w 2 +6 z 2 , t + T t ~ W T c 2 d 1 2 ~ W T c t ~ W c t )]TJ/F26 7.9701 Tf 14.612 18.663 Td [(t + T t 2 0 @ t c ~ W T c T )]TJ/F26 7.9701 Tf 13.859 4.936 Td [(T d 1 A 2 d )]TJ/F15 11.9552 Tf 11.955 0 Td [(6 t + T t 0 @ t c T T )]TJ/F26 7.9701 Tf 13.859 4.339 Td [(T q 1+ ! T )-167( ! d 1 A 2 d )]TJ/F15 11.9552 Tf 11.955 0 Td [(6 t + T t 0 @ t c F T 0 T T )]TJ/F26 7.9701 Tf 13.859 4.339 Td [(T q 1+ ! T )-167( ! d 1 A 2 d )]TJ/F15 11.9552 Tf 11.955 0 Td [(6 t + T t 0 @ t c ~ W T a G ~ W a T )]TJ/F26 7.9701 Tf 13.859 4.338 Td [(T q 1+ ! T )-167( ! d 1 A 2 d: UsingtheCauchy-Schwarzinequality,theLipschitzproperty,thefactthat 1 p 1+ ! T )]TJ/F26 7.9701 Tf 5.289 0 Td [(! 1 , andtheboundsin, t + T t ~ W T c 2 d 1 2 ~ W T c t ~ W c t )]TJ/F15 11.9552 Tf 11.955 0 Td [(6 t + T t 0 @ t c 5 ' ' d 1 A 2 d )]TJ/F26 7.9701 Tf 14.611 18.664 Td [(t + T t 2 2 c 0 @ t ~ W T c 2 d t )]TJ/F25 11.9552 Tf 5.48 -9.683 Td [( T )]TJ/F26 7.9701 Tf 13.859 4.936 Td [(T 2 d 1 A d )]TJ/F26 7.9701 Tf 14.611 18.663 Td [(t + T t 6 2 c 2 2 0 @ t ~ W a 4 d t )]TJ/F25 11.9552 Tf 5.48 -9.684 Td [( T )]TJ/F26 7.9701 Tf 13.859 4.936 Td [(T 2 d 1 A d )]TJ/F26 7.9701 Tf 14.612 18.664 Td [(t + T t 6 2 c 0 2 0 @ t k F k 2 d t )]TJ/F25 11.9552 Tf 5.479 -9.684 Td [( T )]TJ/F26 7.9701 Tf 13.859 4.936 Td [(T 2 d 1 A d Thus, 175

PAGE 176

t + T t ~ W T c 2 d )]TJ/F15 11.9552 Tf 21.918 0 Td [(2 2 c A 4 ' 2 t + T t )]TJ/F25 11.9552 Tf 11.955 0 Td [(t t ~ W T c 2 dd + 1 2 ~ W T c t ~ W c t )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 2 c A 4 ' 2 2 5 T 3 )]TJ/F15 11.9552 Tf 11.956 0 Td [(6 2 c 2 2 A 4 ' 2 t + T t )]TJ/F25 11.9552 Tf 11.955 0 Td [(t t ~ W a 4 dd )]TJ/F15 11.9552 Tf 11.956 0 Td [(6 2 c 0 2 L 2 F A 4 ' 2 t + T t )]TJ/F25 11.9552 Tf 11.956 0 Td [(t t k e k 2 dd )]TJ/F15 11.9552 Tf 11.955 0 Td [(3 2 c A 4 ' 2 0 2 L 2 F d 2 T 3 ; where A = 1 p ' .Changingtheorderofintegration, t + T t ~ W T c 2 d 1 2 ~ W T c t ~ W c t )]TJ/F25 11.9552 Tf 11.955 0 Td [( 2 c A 4 ' 2 T 2 t + T t ~ W T c 2 d )]TJ/F15 11.9552 Tf 11.956 0 Td [(3 2 c A 4 ' 2 0 2 L 2 F T 2 t + T t k e k 2 d )]TJ/F15 11.9552 Tf 11.956 0 Td [(3 2 c 2 2 A 4 ' 2 T 2 t + T t ~ W a 4 d )]TJ/F15 11.9552 Tf 11.955 0 Td [(2 2 c A 4 ' 2 T 3 2 5 + 0 2 L 2 F d 2 : Reorderingtheterms,5isobtained. 176

PAGE 177

REFERENCES [1]D.Kirk, OptimalControlTheory:AnIntroduction .Dover,2004. [2]O.vonStrykandR.Bulirsch,“Directandindirectmethodsfortrajectoryoptimization,” Ann.Oper.Res. ,vol.37,no.1,pp.357,1992. [3]J.T.Betts,“Surveyofnumericalmethodsfortrajectoryoptimization,” J.Guid. Contr.Dynam. ,vol.21,no.2,pp.193,1998. [4]C.R.HargravesandS.Paris,“Directtrajectoryoptimizationusingnonlinear programmingandcollocation,” J.Guid.Contr.Dynam. ,vol.10,no.4,pp.338, 1987. [5]G.T.Huntington,“Advancementandanalysisofagausspseudospectraltranscriptionforoptimalcontrol,”Ph.D.dissertation,DepartmentofAeronauticsand Astronautics,MIT,May2007. [6]F.FahrooandI.M.Ross,“Pseudospectralmethodsforinnite-horizonnonlinear optimalcontrolproblems,” J.Guid.Contr.Dynam. ,vol.31,no.4,pp.927, 2008. [7]A.V.Rao,D.A.Benson,C.L.Darby,M.A.Patterson,C.Francolin,andG.T. Huntington,“Algorithm902:GPOPS,AMATLABsoftwareforsolvingmultiplephaseoptimalcontrolproblemsusingtheGausspseudospectralmethod,” ACM Trans.Math.Softw. ,vol.37,no.2,pp.1,2010. [8]C.L.Darby,W.W.Hager,andA.V.Rao,“Anhp-adaptivepseudospectralmethod forsolvingoptimalcontrolproblems,” Optim.ControlAppl.Methods ,vol.32,no.4, pp.476,2011. [9]D.Garg,W.W.Hager,andA.V.Rao,“Pseudospectralmethodsforsolving innite-horizonoptimalcontrolproblems,” Automatica ,vol.47,no.4,pp.829–837, 2011. [10]R.FreemanandP.Kokotovic,“Optimalnonlinearcontrollersforfeedback linearizablesystems,”in Proc.Am.ControlConf. ,Jun.1995,pp.2722. [11]Q.Lu,Y.Sun,Z.Xu,andT.Mochizuki,“Decentralizednonlinearoptimalexcitation control,” IEEETrans.PowerSyst. ,vol.11,no.4,pp.1957,Nov.1996. [12]V.NevisticandJ.A.Primbs,“Constrainednonlinearoptimalcontrol:aconverse HJBapproach,”CaliforniaInstituteofTechnology,Pasadena,CA91125,Tech. Rep.CIT-CDS96-021,1996. [13]J.A.PrimbsandV.Nevistic,“Optimalityofnonlineardesigntechniques:A converseHJBapproach,”CaliforniaInstituteofTechnology,Pasadena,CA91125, Tech.Rep.CIT-CDS96-022,1996. 177

PAGE 178

[14]M.Sekoguchi,H.Konishi,M.Goto,A.Yokoyama,andQ.Lu,“Nonlinearoptimal controlappliedtoSTATCOMforpowersystemstabilization,”in Proc.IEEE/PES Transm.Distrib.Conf.Exhib. ,Oct.2002,pp.342. [15]Y.KimandF.Lewis,“OptimaldesignofCMACneural-networkcontrollerforrobot manipulators,” IEEETrans.Syst.ManCybern.PartCAppl.Rev. ,vol.30,no.1,pp. 22,feb2000. [16]Y.Kim,F.Lewis,andD.Dawson,“Intelligentoptimalcontrolofroboticmanipulator usingneuralnetworks,” Automatica ,vol.36,no.9,pp.1355,2000. [17]K.Dupree,C.Liang,G.Hu,,andW.E.Dixon,“Globaladaptivelyapunov-based controlofarobotandmass-springsystemundergoinganimpactcollision,” IEEE Trans.Syst.ManCybern. ,vol.38,pp.1050,2008. [18]K.Dupree,C.Liang,G.Hu,andW.E.Dixon,“Lyapunov-basedcontrolofarobot andmass-springsystemundergoinganimpactcollision,” Int.J.Robot.Autom. ,vol. 206,no.4,pp.3166,2009. [19]R.A.FreemanandP.V.Kokotovic, RobustNonlinearControlDesign:State-Space andLyapunovTechniques .Boston,MA:Birkhuser,1996. [20]J.Fausz,V.-S.Chellaboina,andW.Haddad,“Inverseoptimaladaptivecontrolfor nonlinearuncertainsystemswithexogenousdisturbances,”in Proc.IEEEConf. Decis.Control ,Dec.1997,pp.2654. [21]Z.H.LiandM.Krstic,“Optimaldesignofadaptivetrackingcontrollersfornonlinear systems,” Automatica ,vol.33,pp.1459,1997. [22]M.KrsticandZ.-H.Li,“Inverseoptimaldesignofinput-to-statestabilizingnonlinear controllers,” IEEETrans.Autom.Control ,vol.43,no.3,pp.336,March1998. [23]M.KrsticandP.Tsiotras,“Inverseoptimalstabilizationofarigidspacecraft,” IEEE Trans.Autom.Control ,vol.44,no.5,pp.1042,May1999. [24]W.Luo,Y.-C.Chu,andK.-V.Ling,“Inverseoptimaladaptivecontrolforattitude trackingofspacecraft,” IEEETrans.Autom.Control ,vol.50,no.11,pp.1639, Nov.2005. [25]J.R.Cloutier,“State-dependentriccatiequationtechniques:anoverview,”in Proc. Am.ControlConf. ,vol.2,1997,pp.932. [26]T.imen,“State-dependentriccatiequationSDREcontrol:asurvey,”in Proc. IFACWorldCongr. ,2008,pp.6. [27]——,“Systematicandeffectivedesignofnonlinearfeedbackcontrollersviathe state-dependentriccatiequationsdremethod,” Annu.Rev.Control ,vol.34,no.1, pp.32,2010. 178

PAGE 179

[28]T.Yucelen,A.S.Sadahalli,andF.Pourboghrat,“Onlinesolutionofstatedependent riccatiequationfornonlinearsystemstabilization,”in Proc.Am.ControlConf. , 2010,pp.6336. [29]C.E.Garcia,D.M.Prett,andM.Morari,“Modelpredictivecontrol:theoryand practice-asurvey,” Automatica ,vol.25,no.3,pp.335,1989. [30]D.MayneandH.Michalska,“Recedinghorizoncontrolofnonlinearsystems,” IEEE Trans.Autom.Contr. ,vol.35,no.7,pp.814,1990. [31]M.MorariandJ.Lee,“Modelpredictivecontrol:past,presentandfuture,” Computers&ChemicalEngineering ,vol.23,no.4-5,pp.667,1999. [32]F.AllgwerandA.Zheng, Nonlinearmodelpredictivecontrol .Springer,2000, vol.26. [33]D.Mayne,J.Rawlings,C.Rao,andP.Scokaert,“Constrainedmodelpredictive control:Stabilityandoptimality,” Automatica ,vol.36,pp.789,2000. [34]E.F.CamachoandC.Bordons, Modelpredictivecontrol .Springer,2004,vol.2. [35]L.GrneandJ.Pannek, NonlinearModelPredictiveControl .Springer,2011. [36]R.Bellman,“Thetheoryofdynamicprogramming,”DTICDocument,Tech.Rep., 1954. [37]A.Barto,R.Sutton,andC.Anderson,“Neuron-likeadaptiveelementsthatcan solvedifcultlearningcontrolproblems,” IEEETrans.Syst.ManCybern. ,vol.13, no.5,pp.834,1983. [38]R.Sutton,“Learningtopredictbythemethodsoftemporaldifferences,” Mach. Learn. ,vol.3,no.1,pp.9,1988. [39]P.Werbos,“Amenuofdesignsforreinforcementlearningovertime,” NeuralNetw. forControl ,pp.67,1990. [40]C.WatkinsandP.Dayan,“Q-learning,” Mach.Learn. ,vol.8,no.3,pp.279, 1992. [41]R.Bellman, DynamicProgramming .DoverPublications,Inc.,2003. [42]D.Bertsekas, DynamicProgrammingandOptimalControl .AthenaScientic, 2007. [43]R.S.SuttonandA.G.Barto, ReinforcementLearning:AnIntroduction . Cambridge,MA,USA:MITPress,1998. [44]D.BertsekasandJ.Tsitsiklis, Neuro-DynamicProgramming .AthenaScientic, 1996. 179

PAGE 180

[45]J.N.TsitsiklisandB.VanRoy,“Ananalysisoftemporal-differencelearningwith functionapproximation,” IEEETrans.Automat.Contr. ,vol.42,no.5,pp.674, 1997. [46]J.N.TsitsiklisandB.V.Roy,“Averagecosttemporal-differencelearning,” Automatica ,vol.35,no.11,pp.1799–1808,1999. [47]J.Tsitsiklis,“Ontheconvergenceofoptimisticpolicyiteration,” J.Mach.Learn. Res. ,vol.3,pp.59,2003. [48]V.KondaandJ.Tsitsiklis,“Onactor-criticalgorithms,” SIAMJ.Contr.Optim. , vol.42,no.4,pp.1143,2004. [49]P.MehtaandS.Meyn,“Q-learningandpontryagin'sminimumprinciple,”in Proc. IEEEConf.Decis.Control ,Dec.2009,pp.3598. [50]S.Balakrishnan,“Adaptive-critic-basedneuralnetworksforaircraftoptimalcontrol,” J.Guid.Contr.Dynam. ,vol.19,no.4,pp.893,1996. [51]M.Abu-KhalafandF.Lewis,“NearlyoptimalHJBsolutionforconstrainedinput systemsusinganeuralnetworkleast-squaresapproach,”in Proc.IEEEConf. Decis.Control ,LasVegas,NV,2002,pp.943. [52]——,“Nearlyoptimalcontrollawsfornonlinearsystemswithsaturatingactuators usinganeuralnetworkHJBapproach,” Automatica ,vol.41,no.5,pp.779, 2005. [53]R.Padhi,N.Unnikrishnan,X.Wang,andS.Balakrishnan,“Asinglenetwork adaptivecriticSNACarchitectureforoptimalcontrolsynthesisforaclassof nonlinearsystems,” NeuralNetw. ,vol.19,no.10,pp.1648,2006. [54]D.Vrabie,M.Abu-Khalaf,F.Lewis,andY.Wang,“Continuous-timeADPforlinear systemswithpartiallyunknowndynamics,”in Proc.IEEEInt.Symp.Approx.Dyn. Program.Reinf.Learn. ,2007,pp.247. [55]A.Al-Tamimi,F.L.Lewis,andM.Abu-Khalaf,“Discrete-timenonlinearHJBsolution usingapproximatedynamicprogramming:Convergenceproof,” IEEETrans.Syst. ManCybern.PartBCybern. ,vol.38,pp.943,2008. [56]K.VamvoudakisandF.Lewis,“Onlinesynchronouspolicyiterationmethodfor optimalcontrol,”in RecentAdvancesinIntelligentControlSystems ,W.Yu,Ed. Springer,2009,pp.357. [57]——,“Onlineactor-criticalgorithmtosolvethecontinuous-timeinnitehorizon optimalcontrolproblem,” Automatica ,vol.46,no.5,pp.878,2010. [58]D.VrabieandF.Lewis,“Integralreinforcementlearningforonlinecomputationof feedbacknashstrategiesofnonzero-sumdifferentialgames,”in Proc.IEEEConf. Decis.Control ,2010,pp.3066. 180

PAGE 181

[59]S.Bhasin,R.Kamalapurkar,M.Johnson,K.Vamvoudakis,F.L.Lewis,and W.Dixon,“Anovelactor-critic-identierarchitectureforapproximateoptimalcontrol ofuncertainnonlinearsystems,” Automatica ,vol.49,no.1,pp.89,2013. [60]G.Lendaris,L.Schultz,andT.Shannon,“Adaptivecriticdesignforintelligent steeringandspeedcontrolofa2-axlevehicle,”in Int.JointConf.NeuralNetw. , 2000,pp.73. [61]S.FerrariandR.Stengel,“Anadaptivecriticglobalcontroller,”in Proc.Am.Control Conf. ,vol.4,2002,pp.2665. [62]D.HanandS.Balakrishnan,“State-constrainedagilemissilecontrolwithadaptivecritic-basedneuralnetworks,” IEEETrans.ControlSyst.Technol. ,vol.10,no.4, pp.481,2002. [63]P.HeandS.Jagannathan,“Reinforcementlearningneural-network-based controllerfornonlineardiscrete-timesystemswithinputconstraints,” IEEETrans. Syst.ManCybern.PartBCybern. ,vol.37,no.2,pp.425,2007. [64]Z.ChenandS.Jagannathan,“GeneralizedHamilton-Jacobi-Bellmanformulation -basedneuralnetworkcontrolofafnenonlineardiscrete-timesystems,” IEEE Trans.NeuralNetw. ,vol.19,no.1,pp.90,Jan.2008. [65]T.Dierks,B.Thumati,andS.Jagannathan,“Optimalcontrolofunknownafne nonlineardiscrete-timesystemsusingofine-trainedneuralnetworkswithproofof convergence,” NeuralNetw. ,vol.22,no.5-6,pp.851,2009. [66]A.HeydariandS.Balakrishnan,“Finite-horizoncontrol-constrainednonlinear optimalcontrolusingsinglenetworkadaptivecritics,” IEEETrans.NeuralNetw. Learn.Syst. ,vol.24,no.1,pp.145,2013. [67]D.LiuandQ.Wei,“Policyiterationadaptivedynamicprogrammingalgorithmfor discrete-timenonlinearsystems,” IEEETrans.NeuralNetw.Learn.Syst. ,vol.25, no.3,pp.621,Mar.2014. [68]D.Prokhorov,R.Santiago,andD.Wunsch,“Adaptivecriticdesigns:Acasestudy forneurocontrol,” NeuralNetw. ,vol.8,no.9,pp.1367,1995. [69]X.LiuandS.Balakrishnan,“Convergenceanalysisofadaptivecriticbasedoptimal control,”in Proc.Am.ControlConf. ,vol.3,2000. [70]J.Murray,C.Cox,G.Lendaris,andR.Saeks,“Adaptivedynamicprogramming,” IEEETrans.Syst.ManCybern.PartCAppl.Rev. ,vol.32,no.2,pp.140, 2002. [71]R.LeakeandR.Liu,“Constructionofsuboptimalcontrolsequences,” SIAMJ. Control ,vol.5,p.54,1967. 181

PAGE 182

[72]L.Baird,“Advantageupdating,”WrightLab,Wright-PattersonAirForceBase,OH, Tech.Rep.,1993. [73]R.Beard,G.Saridis,andJ.Wen,“Galerkinapproximationsofthegeneralized Hamilton-Jacobi-Bellmanequation,” Automatica ,vol.33,pp.2159,1997. [74]K.Doya,“Reinforcementlearningincontinuoustimeandspace,” NeuralComput. , vol.12,no.1,pp.219,2000. [75]T.Hanselmann,L.Noakes,andA.Zaknich,“Continuous-timeadaptivecritics,” IEEETrans.NeuralNetw. ,vol.18,no.3,pp.631,2007. [76]D.VrabieandF.Lewis,“Neuralnetworkapproachtocontinuous-timedirect adaptiveoptimalcontrolforpartiallyunknownnonlinearsystems,” NeuralNetw. , vol.22,no.3,pp.237–246,2009. [77]S.Bhasin,N.Sharma,P.Patre,andW.E.Dixon,“Robustasymptotictrackingofa classofnonlinearsystemsusinganadaptivecriticbasedcontroller,”in Proc.Am. ControlConf. ,Baltimore,MD,2010,pp.3223. [78]Y.JiangandZ.-P.Jiang,“Computationaladaptiveoptimalcontrolfor continuous-timelinearsystemswithcompletelyunknowndynamics,” Automatica , vol.48,no.10,pp.2699–2704,2012. [79]X.Yang,D.Liu,andD.Wang,“Reinforcementlearningforadaptiveoptimalcontrol ofunknowncontinuous-timenonlinearsystemswithinputconstraints,” Int.J. Control ,vol.87,no.3,pp.553,2014. [80]L.-J.Lin,“Self-improvingreactiveagentsbasedonreinforcementlearning, planningandteaching,” Mach.Learn. ,vol.8,no.3-4,pp.293,1992. [81]P.Cichosz,“Ananalysisofexperiencereplayintemporaldifferencelearning,” Cybern.Syst. ,vol.30,no.5,pp.341,1999. [82]S.KalyanakrishnanandP.Stone,“Batchreinforcementlearninginacomplex domain,”in Proc.Int.Conf.Auton.AgentsMulti-AgentSyst. ,Honolulu,HI,2007, pp.650. [83]L.Dung,T.Komeda,andM.Takagi,“Efcientexperiencereuseinnon-markovian environments,”in Proc.Int.Conf.Instrum.ControlInf.Technol. ,Tokyo,Japan, 2008,pp.3327. [84]P.Wawrzy nski,“Real-timereinforcementlearningbysequentialactor-criticsand experiencereplay,” NeuralNetw. ,vol.22,no.10,pp.1484,2009. [85]S.Adam,L.Busoniu,andR.Babuska,“Experiencereplayforreal-time reinforcementlearningcontrol,” IEEETrans.Syst.ManCybern.PartCAppl.Rev. , vol.42,no.2,pp.201,2012. 182

PAGE 183

[86]H.Zhang,Q.Wei,andY.Luo,“Anovelinnite-timeoptimaltrackingcontrol schemeforaclassofdiscrete-timenonlinearsystemsviathegreedyhdpiteration algorithm,” IEEETrans.Syst.ManCybern.PartBCybern. ,vol.38,no.4,pp. 937,2008. [87]H.Zhang,D.Liu,Y.Luo,andD.Wang, AdaptiveDynamicProgrammingforControl AlgorithmsandStability ,ser.CommunicationsandControlEngineering.London: Springer-Verlag,2013. [88]K.S.NarendraandA.M.Annaswamy,“Anewadaptivelawforrobustadaptive controlwithoutpersistentexcitation,” IEEETrans.Autom.Control ,vol.32,pp. 134,1987. [89]K.NarendraandA.Annaswamy, StableAdaptiveSystems .Prentice-Hall,Inc., 1989. [90]S.SastryandM.Bodson, AdaptiveControl:Stability,Convergence,and Robustness .UpperSaddleRiver,NJ:Prentice-Hall,1989. [91]P.IoannouandJ.Sun, RobustAdaptiveControl .PrenticeHall,1996. [92]G.V.ChowdharyandE.N.Johnson,“Theoryandight-testvalidationofa concurrent-learningadaptivecontroller,” J.Guid.Contr.Dynam. ,vol.34,no.2,pp. 592,March2011. [93]G.Chowdhary,T.Yucelen,M.Mhlegg,andE.N.Johnson,“Concurrentlearning adaptivecontroloflinearsystemswithexponentiallyconvergentbounds,” Int.J. Adapt.ControlSignalProcess. ,vol.27,no.4,pp.280,2013. [94]X.ZhangandY.Luo,“Data-basedon-lineoptimalcontrolforunknownnonlinear systemsviaadaptivedynamicprogrammingapproach,”in Proc.Chin.Control Conf. IEEE,2013,pp.2256. [95]H.Modares,F.L.Lewis,andM.-B.Naghibi-Sistani,“Integralreinforcement learningandexperiencereplayforadaptiveoptimalcontrolofpartially-unknown constrained-inputcontinuous-timesystems,” Automatica ,vol.50,no.1,pp. 193,2014. [96]L.K.R.Sutton,“Model-basedreinforcementlearningwithanapproximate,learned model,”in Proc.YaleWorkshopAdapt.Learn.Syst. ,1996,pp.101. [97]R.Kamalapurkar,P.Walters,andW.E.Dixon,“Concurrentlearning-based approximateoptimalregulation,”in Proc.IEEEConf.Decis.Control ,Florence,IT, Dec.2013,pp.6256. [98]R.Isaacs, DifferentialGames:AMathematicalTheorywithApplicationsto WarfareandPursuit,ControlandOptimization ,ser.DoverBooksonMathematics. DoverPublications,1999. 183

PAGE 184

[99]S.Tijs, IntroductiontoGameTheory .HindustanBookAgency,2003. [100]T.BasarandG.J.Olsder, DynamicNoncooperativeGameTheory:Second Edition ,ser.ClassicsinAppliedMathematics.SIAM,1999. [101]J.Nash,“Non-cooperativegames,” AnnalsofMath. ,vol.2,pp.286,1951. [102]J.Case,“Towardatheoryofmanyplayerdifferentialgames,” SIAMJ.Control , vol.7,pp.179,1969. [103]A.StarrandC.-Y.Ho,“Nonzero-sumdifferentialgames,” J.Optim.TheoryApp. , vol.3,no.3,pp.184,1969. [104]A.StarrandHo,“Furtherpropertiesofnonzero-sumdifferentialgames,” J.Optim. TheoryApp. ,vol.4,pp.207,1969. [105]A.Friedman, Differentialgames .Wiley,1971. [106]A.BressanandF.S.Priuli,“Innitehorizonnoncooperativedifferentialgames,” J. Differ.Equ. ,vol.227,no.1,pp.230–257,2006. [107]A.Bressan,“Noncooperativedifferentialgames,” MilanJ.Math. ,vol.79,no.2,pp. 357,December2011. [108]M.Littman,“Value-functionreinforcementlearninginmarkovgames,” Cogn.Syst. Res. ,vol.2,no.1,pp.55,2001. [109]Q.WeiandH.Zhang,“Anewapproachtosolveaclassofcontinuous-time nonlinearquadraticzero-sumgameusingadp,”in IEEEInt.Conf.Netw.Sens. Control ,2008,pp.507. [110]H.Zhang,Q.Wei,andD.Liu,“Aniterativeadaptivedynamicprogrammingmethod forsolvingaclassofnonlinearzero-sumdifferentialgames,” Automatica ,vol.47, pp.207,2010. [111]X.Zhang,H.Zhang,Y.Luo,andM.Dong,“Iterationalgorithmforsolvingthe optimalstrategiesofaclassofnonafnenonlinearquadraticzero-sumgames,”in Proc.IEEEConf.Decis.Control ,May2010,pp.1359. [112]K.VamvoudakisandF.Lewis,“Multi-playernon-zero-sumgames:Onlineadaptive learningsolutionofcoupledhamilton-jacobiequations,” Automatica ,vol.47,pp. 1556,2011. [113]Y.M.Park,M.S.Choi,andK.Y.Lee,“Anoptimaltrackingneuro-controller fornonlineardynamicsystems,” IEEETrans.NeuralNetw. ,vol.7,no.5,pp. 1099,1996. [114]T.DierksandS.Jagannathan,“Optimaltrackingcontrolofafnenonlinear discrete-timesystemswithunknowninternaldynamics,”in Proc.IEEEConf.Decis. Control ,2009,pp.6750. 184

PAGE 185

[115]——,“Optimalcontrolofafnenonlinearcontinuous-timesystems,”in Proc.Am. ControlConf. ,2010,pp.1568. [116]H.Zhang,L.Cui,X.Zhang,andY.Luo,“Data-drivenrobustapproximateoptimal trackingcontrolforunknowngeneralnonlinearsystemsusingadaptivedynamic programmingmethod,” IEEETrans.NeuralNetw. ,vol.22,no.12,pp.2226, 2011. [117]M.Johnson,T.Hiramatsu,N.Fitz-Coy,andW.E.Dixon,“Asymptoticstackelberg optimalcontroldesignforanuncertainEuler-Lagrangesystem,”in Proc.IEEE Conf.Decis.Control ,Atlanta,GA,2010,pp.6686. [118]K.VamvoudakisandF.Lewis,“Onlineneuralnetworksolutionofnonlinear two-playerzero-sumgamesusingsynchronouspolicyiteration,”in Proc.IEEE Conf.Decis.Control ,2010. [119]M.Johnson,S.Bhasin,andW.E.Dixon,“Nonlineartwo-playerzero-sumgame approximatesolutionusingapolicyiterationalgorithm,”in Proc.IEEEConf.Decis. Control ,2011,pp.142. [120]K.Vamvoudakis,F.L.Lewis,M.Johnson,andW.E.Dixon,“Onlinelearning algorithmforstackelberggamesinproblemswithhierarchy,”in Proc.IEEEConf. Decis.Control ,Maui,HI,Dec.2012,pp.1883. [121]M.LewisandK.Tan,“Highprecisionformationcontrolofmobilerobotsusing virtualstructures,” AutonomousRobots ,vol.4,no.4,pp.387,1997. [122]T.BalchandR.Arkin,“Behavior-basedformationcontrolformultirobotteams,” IEEETrans.Robot.Autom. ,vol.14,no.6,pp.926,Dec1998. [123]A.Das,R.Fierro,V.Kumar,J.Ostrowski,J.Spletzer,andC.Taylor,“Avisionbasedformationcontrolframework,” IEEETrans.Robot.Autom. ,vol.18,no.5,pp. 813,Oct2002. [124]J.FaxandR.Murray,“Informationowandcooperativecontrolofvehicle formations,” IEEETrans.Autom.Control ,vol.49,no.9,pp.1465,Sept. 2004. [125]R.Murray,“Recentresearchincooperativecontrolofmultivehiclesystems,” J.Dyn. Syst.Meas.Control ,vol.129,pp.571,2007. [126]D.H.Shim,H.J.Kim,andS.Sastry,“Decentralizednonlinearmodelpredictive controlofmultipleyingrobots,”in Proc.IEEEConf.Decis.Control ,vol.4,2003, pp.3621. [127]L.MagniandR.Scattolini,“Stabilizingdecentralizedmodelpredictivecontrolof nonlinearsystems,” Automatica ,vol.42,no.7,pp.1231–1236,2006. 185

PAGE 186

[128]K.G.Vamvoudakis,F.L.Lewis,andG.R.Hudas,“Multi-agentdifferential graphicalgames:Onlineadaptivelearningsolutionforsynchronizationwith optimality,” Automatica ,vol.48,no.8,pp.1598–1611,2012. [129]A.HeydariandS.N.Balakrishnan,“Anoptimaltrackingapproachtoformation controlofnonlinearmulti-agentsystems,”in Proc.AIAAGuid.Navig.ControlConf. , 2012. [130]D.Vrabie,“Onlineadaptiveoptimalcontrolforcontinuous-timesystems,”Ph.D. dissertation,UniversityofTexasatArlington,2010. [131]S.P.Singh,“Reinforcementlearningwithahierarchyofabstractmodels,”in AAAI Natl.Conf.Artif.Intell. ,vol.92,1992,pp.202. [132]C.G.AtkesonandS.Schaal,“Robotlearningfromdemonstration,”in Int.Conf. Mach.Learn. ,vol.97,1997,pp.12. [133]P.Abbeel,M.Quigley,andA.Y.Ng,“Usinginaccuratemodelsinreinforcement learning,”in Int.Conf.Mach.Learn. NewYork,NY,USA:ACM,2006,pp.1. [134]M.P.Deisenroth, EfcientreinforcementlearningusingGaussianprocesses .KIT ScienticPublishing,2010. [135]D.Mitrovic,S.Klanke,andS.Vijayakumar,“Adaptiveoptimalfeedbackcontrolwith learnedinternaldynamicsmodels,”in FromMotorLearningtoInteractionLearning inRobots ,ser.StudiesinComputationalIntelligence,O.SigaudandJ.Peters, Eds.SpringerBerlinHeidelberg,2010,vol.264,pp.65. [136]M.P.DeisenrothandC.E.Rasmussen,“Pilco:Amodel-basedanddata-efcient approachtopolicysearch,”in Int.Conf.Mach.Learn. ,2011,pp.465. [137]Y.LuoandM.Liang,“Approximateoptimaltrackingcontrolforaclassofdiscretetimenon-afnesystemsbasedongdhpalgorithm,”in IWACIInt.WorkshopAdv. Comput.Intell. ,2011,pp.143. [138]D.Wang,D.Liu,andQ.Wei,“Finite-horizonneuro-optimaltrackingcontrolfora classofdiscrete-timenonlinearsystemsusingadaptivedynamicprogramming approach,” Neurocomputing ,vol.78,no.1,pp.14–22,2012. [139]J.WangandM.Xin,“Multi-agentconsensusalgorithmwithobstacleavoidancevia optimalcontrolapproach,” Int.J.Control ,vol.83,no.12,pp.2606,2010. [140]——,“Distributedoptimalcooperativetrackingcontrolofmultipleautonomous robots,” RoboticsandAutonomousSystems ,vol.60,no.4,pp.572–583,2012. [141]——,“Integratedoptimalformationcontrolofmultipleunmannedaerialvehicles,” IEEETrans.ControlSyst.Technol. ,vol.21,no.5,pp.1731,2013. 186

PAGE 187

[142]W.Lin,“Distributeduavformationcontrolusingdifferentialgameapproach,” Aerosp.Sci.Technol. ,vol.35,pp.54,2014. [143]E.Semsar-KazerooniandK.Khorasani,“Optimalconsensusalgorithmsfor cooperativeteamofagentssubjecttopartialinformation,” Automatica ,vol.44, no.11,pp.2766–2777,2008. [144]D.Liberzon, Calculusofvariationsandoptimalcontroltheory:aconciseintroduction .PrincetonUniversityPress,2012. [145]F.L.Lewis,D.Vrabie,andV.L.Syrmos, OptimalControl ,3rded.Wiley,2012. [146]R.Kamalapurkar,H.Dinh,S.Bhasin,andW.Dixon.Approximatelyoptimal trajectorytrackingforcontinuoustimenonlinearsystems.arXiv:1301.7664. [147]G.Chowdhary,“Concurrentlearningadaptivecontrolforconvergencewithout persistenceyofexcitation,”Ph.D.dissertation,GeorgiaInstituteofTechnology, December2010. [148]W.E.Dixon,A.Behal,D.M.Dawson,andS.Nagarkatti, NonlinearControlof EngineeringSystems:ALyapunov-BasedApproach .Birkhauser:Boston,2003. [149]H.K.Khalil, NonlinearSystems ,3rded.PrenticeHall,2002. [150]A.SavitzkyandM.J.E.Golay,“Smoothinganddifferentiationofdatabysimplied leastsquaresprocedures.” Anal.Chem. ,vol.36,no.8,pp.1627,1964. [151]K.Hornik,M.Stinchcombe,andH.White,“Universalapproximationofanunknown mappinganditsderivativesusingmultilayerfeedforwardnetworks,” NeuralNetw. , vol.3,no.5,pp.551–560,1990. [152]F.L.Lewis,R.Selmic,andJ.Campos, Neuro-FuzzyControlofIndustrialSystems withActuatorNonlinearities .Philadelphia,PA,USA:SocietyforIndustrialand AppliedMathematics,2002. [153]K.M.Misovec,“Frictioncompensationusingadaptivenon-linearcontrolwith persistentexcitation,” Int.J.Control ,vol.72,no.5,pp.457,1999. [154]K.NarendraandA.Annaswamy,“Robustadaptivecontrolinthepresenceof boundeddisturbances,” IEEETrans.Automat.Contr. ,vol.31,no.4,pp.306, 1986. [155]E.Panteley,A.Loria,andA.Teel,“Relaxedpersistencyofexcitationforuniform asymptoticstability,” IEEETrans.Autom.Contr. ,vol.46,no.12,pp.1874, 2001. [156]A.LoraandE.Panteley,“Uniformexponentialstabilityoflineartime-varying systems:revisited,” Syst.ControlLett. ,vol.47,no.1,pp.13–24,2002. 187

PAGE 188

[157]R.Kamalapurkar,H.T.Dinh,P.Walters,andW.E.Dixon,“Approximateoptimal cooperativedecentralizedcontrolforconsensusinatopologicalnetworkofagents withuncertainnonlineardynamics,”in Proc.Am.ControlConf. ,Washington,DC, June2013,pp.1322. [158]H.Zhang,L.Cui,andY.Luo,“Near-optimalcontrolfornonzero-sumdifferential gamesofcontinuous-timenonlinearsystemsusingsingle-networkadp,” IEEE Trans.Cybern. ,vol.43,no.1,pp.206,2013. 188

PAGE 189

BIOGRAPHICALSKETCH RushikeshKamalapurkarreceivedhisBachelorofTechnologydegreeinmechanicalengineeringfromVisvesvarayaNationalInstituteofTechnology,Nagpur,India.He workedfortwoyearsasaDesignEngineeratLarsenandToubroLtd.,Mumbai,India. HereceivedhisMasterofSciencedegreeandhisDoctorofPhilosophydegreefrom theDepartmentofMechanicalandAerospaceEngineeringattheUniversityofFlorida underthesupervisionofDr.WarrenE.Dixon.Hisresearchinterestsincludedynamic programming,optimalcontrol,reinforcementlearning,anddata-drivenadaptivecontrol foruncertainnonlineardynamicalsystems. 189



PAGE 1

IEEETRANSACTIONSONAUTOMATICCONTROL,VOL.58,NO.9,SEPTEMBER2013 2333TechnicalNotesandCorrespondence LaSalle-YoshizawaCorollariesforNonsmoothSystemsNicholasFischer,RushikeshKamalapurkar,andWarrenE.DixonAbstract— Inthistechnicalnote,twogeneralizedcorollariestothe LaSalle-YoshizawaTheoremarepresentedfornonautonomoussystemsdescribedbynonlineardiffere ntialequationswithdiscontinuous right-handsides.Lyapunov-basedanalysismethodsthatachieveasymptoticconvergencewhenthecandidateLyapunovderivativeisupper boundedbyanegativesemi-de Þ nitefunctioninthepresenceofdifferential inclusionsarepresented.Adesignexampleillustratestheutilityofthe corollaries. IndexTerms— LaSalle’sTheorem,nonlinearcontrolsystems,sliding modecontrol.I.INTRODUCTIONVariousLyapunov-basedanal ysismethodshavebeendevelopedfor differentialinclusionsinliteratureforbothautonomous(cf.[1]–[9]) andnonautonomous(cf.[6],[10]–[13])systems.Ofthese,severalstabilitytheoremshavebeene stablishedwhichapplytononsmoothsystemsforwhichthederivativeofthecandidateLyapunovfunctioncan beupperboundedbyanegative-de Þ nitefunction:Lyapunov’sgeneralizedtheoremand Þ ni te-timeconvergencein[8],[10]–[14]areexamplesofsuch.However,forcertainclassesofcontrollers(e.g.,adaptivecontrollers,outputfeedbackcontrollers,etc.),anegative-de Þ nite boundmaybedif Þ cul t(orimpossible)toachieve,restrictingtheuseof suchmethods. Matrosov’sTheorem[15]providesaframeworkforexaminingthe stabilityofequil ibriumpoints(andsetsthroughvariousextensions) whenthecandidateLyapunovfunctionhasanegativesemi-de Þ nite decay.Variousextensionsofthistheoremhavebeendeveloped(cf. [16]–[20])toe ncompassdiscreteandhybridsystemstoestablishstabilityofclosedsets.Inparticular,[19](seealsotherelatedworkin [16]and[17])extendedMatrosov’sTheoremtodifferentialinclusions, whilealsoad dressingthestabilityofsets. IncontrasttoMatrosovTheorems,L aSalle’sInvariancePrinciple [21]hasbeenwidelyadoptedasamethod,forcontinuousautonomous (time-inva riant)systems,torelaxthestrictnegative-de Þ nitecondition onthecandidateLyapunovfunctiond erivativewhilestillensuring asymptoticstabilityoftheorigin.Stabilityoftheoriginisproven byshowin gthatboundedsolutionsconvergetothelargestinvariant subsetcontainedinthesetofpointswherethederivativeofthe candidateLyapunovfunctioniszero.In[22],LaSalle’sInvarianceManuscriptreceivedJuly12,2012;revi sedNovember29,2012andJanuary 27,2013;acceptedFebruary 05,2013.DateofpublicationApril25,2013;date ofcurrentversionAugust15,2013.ThisworkwassupportedinpartbyNationalScienceFoundationAwards0547448,0901491,1161260,andacontract withtheAirForceResearchLaboratory,MunitionsDirectorateatEglinAFB. Anyopinions, Þ ndingsandconclusionsorrecommendationsexpressedinthis materialarethoseoftheauthor(s)anddonotnecessarilyre ß ecttheviewsofthe sponsoringagencies.Recommende dbyAssociateEditorC.Prieur. TheauthorsarewiththeDepartm entofMechanicalandAerospace Engineering,UniversityofFlorida,Gainesville,FL,32605USA(e-mail: nic.r. Þ scher@gmail.com;rkamalapurkar@u ß .edu;wdixon@u ß .edu). Colorversionsofoneormoreofthe Þ guresinthispaperareavailableonline athttp://ieeexplore.ieee.org. DigitalObjectIdenti Þ er10.1109/TAC.2013.2246900Principlewasmodi Þ edtostatethatboundedsolutionsconvergetothe largestinvariantsubsetofthesetwhereanintegrableoutputfu nction iszero.Theintegralinvariancemethodwasfurtherextendedin[23] todifferentialinclusions.Asdescribedin[24],additionalextensions oftheinvarianceprincipletosystemswithdiscontinuousri ght-hand sideswerepresentedin[4],[6],[9]forFilippovsolutionsand[25]for Carathéodorysolutions. VariousextensionsofLaSalle’sInvariancePrinciplehave alsobeen developedforhybridsystems(cf.[24],[26]–[30]).Theresultsin[26] and[29]focusonswitchedlinears ystems,whereastheresultin[30] focusesonswitchednonlinearsystems.In[28],hy bridextensionsof LaSalle’sInvariancePrinciplewer eappliedforsystemswhereatleast onesolutionexistsforeachinitialconditionfordeterministicsystems andcontinuoushybridsystems.Left-continuou sandimpulsivehybrid systemsareconsideredinextensionsin[27].In[24],twoinvariance principlesaredevelopedforhybridsystems:oneinvolvesaLyapunovlikefunctionthatisnonincreasingalo ngalltrajectoriesthatremainin agivenset,andtheotherconsidersapairofauxiliaryoutputfunctions thatsatisfycertainconditionsonlyalongthehybridtrajectory.Areviewofinvarianceprinciplesforhyb ridsystemsisprovidedin[31]. Thechallengefordevelopinginvariance-likeprinciplesfornonautonomoussystemsisthatitmaybeunclearhowtoevende Þ neaset wherethederivativeofthecandid ateLyapunovfunctionisstationary sincethecandidateLyapunovfunctionisafunctionofbothstateand time[32],[33].Byaugmentingthestatevectorwithtime(cf.[34], [35]),anonautonomoussystemc anbeexpressedasanautonomous system:thistechniqueallowsautonomoussystemsresults(cf.[36]and [37])tobeextendedtononautonomoussystems.Whilethestateaugmentationmethodcanbeausef ultool,ingeneral,augmentingthestate vectoryieldsanon-compactattractor(whenthetimedependenceis notperiodic),destroyingsomeofthestructureoftheoriginalequation; forexample,thenewsyst emwillnothaveanybounded,periodic,or almostperiodicmotions.Someresults(cf.[38]–[40])haveexplored waystoutilizetheaugmentedsystem’snon-compactattractorsbyfocusingonsolutionope ratordecomposition,energyequationsornew notionsofcompactness,butthesemethodstypicallyrequireadditional regularityconditions(withrespecttotime)thancaseswhentimeiskept asadistinctvaria ble. TheKrasovskii-LaSalleTheorem[41]wasoriginallydevelopedfor periodicsystems,withseveralgeneralizationsalsoexistingfornot necessarilyperi odicsystems(e.g.,see[6],[ 42]–[45]).Inparticular, a(Krasovskii-LaSalle)ExtendedI nvariancePrincipleisdeveloped in[45]toprovethattheoriginofanonautonomousswitchedsystem withapiecewis econtinuousuniformlyboundedintimeright-hand sideisgloballyasymptoticallystable(oruniformlygloballyasymptoticallystableforautonomoussystems).Theresultin[45]usesa Lipschitzco ntinuous,radiallyunbounded,positive-de Þ nitefunction withanegativesemi-de Þ nitederivative(conditionC1)alongwithan auxiliaryLipschitzcontinuous(possiblyinde Þ nite)functionwhose derivative isupperboundedbytermswhosesumarepositive-de Þ nite (conditionC2). Alsofornonautonomoussystems,theLaSalle-YoshizawaTheorem (i.e.,[33 ,Theorem8.4]and[46,TheoremA.8]),basedontheworkin [21],[47],[48],providesaconvenientanalysistoolwhichallowsthe limitingset(whichdoesnotneedtobeinvariant)tobede Þ nedwhere thenega tivesemi-de Þ niteboundonthecandidateLyapunovderivativeisequaltozero,guaranteeingasymptoticconvergenceofthestate.0018-9286©2013IEEE

PAGE 2

2334 IEEETRANSACTIONSONAUTOMATICCONTROL,VOL.58,NO.9,SEPTEMBER2013Givenitsutility,theLaSalle-Yoshi zawaTheoremhasbeenapplied,for example,inadaptivecontrolandinderivingstabilityfrompassivity propertiessuchasfeedbackpassiv ationandbacksteppingdesignsof nonlinearsystems[21].Availablep roofsfortheLaSalle-Yoshizawa TheoremexploitBarbalat’sLemma,whichisofteninvokedtoshow asymptoticconvergenceforgeneral classesofnonlinearsystems[33]. Ingeneral,adaptingtheLaSalle-YoshizawaTheoremtosystemswhere theright-handsideisnotlocallyLipschitzhasonlyrecentlybeenexplored.Theresultin[49]presentsthreeinvariance-likesemis tability theoremsthatutilizesimilarargum entstotheLaSalle-YoshizawaTheoremundertheassumptionthatthesystemdynamicsareuniformly bounded.Alternatively,usingBarbalat’sLemmaandtheobse rvation thatanabsolutelycontinuousfunctionthathasauniformlylocallyintegrablederivativeisuniformlycontinuous,theresultin[50]proves asymptoticconvergenceofanoutputfunctionforno nlinearsystems with disturbances.Theresultin[50]isdevelopedfordifferential equationswithacontinuousright-handside,but[50,Facts1–4]provideinsightsintotheapplicationofBarbalat’ sLemmatodiscontinuous systems. Inthistechnicalnote,wepresen ttwocorollariestotheLaSalleYoshizawaTheoremfornonautonomoussyste mswithright-handside discontinuitiesthatareessentiallylocallybounded,uniformlyin ,utilizingFilippovsolutionsandLipschitzcontinuousandregularLyapunov-likefunctionswhosetimederivat ivescanbeupperboundedby negativesemi-de Þ nitefunctions.Applicabilityofoneofthecorollaries isillustratedforanexampleproblem. II.PRELIMINARIESConsiderthesystem (1) where denotesthestatevector, isLebesguemeasurableandessentiallylocallybounded,uniformlyin and isanopenandconnectedset.Existenceanduniquenessof thecontinuoussolution areprovidedundertheconditionthatthe function isLipschitzcontinuous.However,if containsadiscontinuityatanypointin ,thenasolutionto(1)maynotexistinthe classicalsense.Thus,itisnecessarytorede Þ netheconceptofasolution.Utilizingdifferentialinclusi ons,thevalueofageneralizedsolution(e.g.,Filippov[51]orKrasovskii[52]solutions)atacertainpoint canbefoundbyinterpretingthebehaviorofitsderivativeatnearby points.Generalizedsolutionswillbeclosetothetrajectoriesoftheactualsystemsincetheyarealimitofsolutionsofordinarydifferential equationswithacontinuousright-handside[13].Whilethereexistsa Filippovsolutionforanyarbitraryinitialcondition ,thesolutionisgenerallynotunique[5],[51]. De Þ nition1.(FilippovSolution): [51]Afunction iscalledasolutionof(1)ontheinterval if isabsolutely continuousandforalmostall where isanuppersemi-continuous,nonempty,compact andconvexvaluedmapon ,de Þ nedas (2) denotestheintersectionoversets ofLebesguemeasurezero, denotesconvexclosure,and . Remark1: Onecanalsoformulatethesolutionsof(1)inotherways [53];forinstance,usingKrasovskii’sde Þ nitionofsolutions[52].The corollariespresentedinthisworkcanalsobeextendedtoKrasovskii solutions(see[3],forexample).InthecaseofKrasovskiisolutions, onewouldgetstrongerconclusions(i.e.,conclusionsforapotentially largersetofsolutions)atthecostofslightlystrongerassumptions(e.g., localboundednessratherthanessentiallylocalboundedness). Tofacilitatethemainresults,fourde Þ nitionsareprovided. De Þ nition2.(DirectionalDerivative): [54]Givenafunction ,therightdirectionalderivativeof at inthe directionof isde Þ nedas Additionally,thegeneralized directionalderivativeof at inthedirectionof isde Þ nedas De Þ nition3.(RegularFunction): [34]Afunction issaidtoberegularat ifforall ,therightdirectionalderivativeof at inthedirectionof existsand 1De Þ nition4.(Clarke’sGeneralizedGradient): [34]Forafunction thatislocallyLipschitzin ,de Þ nethe generalizedgradientof at by where isthesetofmeasurezerowherethegradientof isnot de Þ ned. De Þ nition5.(LocallyBounded,Uniformlyin ): Let .Themap islocallybounded,uniformly in ,ifforeachcompactset ,thereexists suchthat , . Thefollowinglemmaprovidesamethodforcomputingthetime derivativeofaregularfunction usingClarke’sgeneralizedgradient [34]and ,from(2),alongthesolutiontrajectoriesof(1). Lemma1.(ChainRule): [6],[56]Let beaFilippovsolutionof (1)and bealocallyLipschitz,regularfunction. Then isabsolutelycontinuous, existsalmost everywhere(a.e.),i.e.,foralmostall ,and ,where2 Remark2: Throughoutthesubsequentdiscussion,forbrevityofnotation,leta.e.refertoalmostall . III.MAINRESULTForthesystemdescribedin(1)withacontinuousright-handside, existingLyapunovtheorycanbeusedtoexaminethestabilityof theclosed-loopsystemusingcontinuoustechniquessuchasthose describedin[57].However,thesetheoremsmustbealteredforthe set-valuedmap forsystemswithright-handsideswhich arenotLipschitzcontinuous[6],[13],[14].Lyapunovanalysisfor nonsmoothsystemsisanalogoustotheanalysisusedforcontinuous systems.Thedifferencesarethatd ifferentialequationsarereplaced withinclusions,gradientsarereplacedwithgeneralizedgradients,1Notethatany continuousfunctionisregularandthesumofregularfunctionsisregular[55].2Equivalently,almosteverywhere, forall and some .

PAGE 3

IEEETRANSACTIONSONAUTOMATICCONTROL,VOL.58,NO.9,SEPTEMBER2013 2335andpointsarereplacedwithsetsthroughouttheanalysis.Thefollowingpresentationandsubsequentproofsdemonstratehowthe LaSalle-YoshizawaTheoremcanbeadaptedforsuchsystems. Thefollowingauxiliarylemmafro m[56]andBarbalat’sLemmaare providedtofacilitatetheproofso fthenonsmoothLaSalle-Yoshizawa Corollaries. Lemma2: [56]Let beanyFilippovsolutiontothesystemin (1)and bealocallyLipschitz,regularfunction. If ,then . Proof: Forthesakeofcontradiction,letthereexistsome suchthat .Then Itfollowsthat onasetofpositivemeasure,which contradictsthat ,a.e. Lemma3.(Barbalat’sLemma): [57]Let beauniformlycontinuousfunction.Supposethat existsand is Þ nite.Then BasedonLemmas2and3,nonsmoot hcorollariestotheLaSalleYoshizawaTheorem(c.f.,[33,Theorem8.4]and[46,TheoremA.8]) areprovidedinCorollary1and2. Corollary1: Forthesystemgivenin(1),let beanopenand connectedsetcontaining andsuppose isLebesguemeasurable and isessentiallylocallybounded,uniformlyin .Let belocallyLipschitzandregularsuchthat (3) (4) where and arecontinuouspositivede Þ nitefunctions, isa continuouspositivesemi-de Þ nitefunctionon ,choose and suchthat and and isaFilippov solutionto(1)where .Then is boundedandsatis Þ es (5) Proof: Since and , isintheinteriorof .De Þ neatime-dependentset by From(3),theset contains since Ontheotherhand, isasubsetof since Thus Basedon(4), ,hence, isnon-increasing fromLemma2.Forany andany ,thesolution startingat staysin forevery .Therefore,any solutionstartingin staysin ,andconsequentlyin ,forallfuturetime.Hence,the Filippovsolution isboundedsuchthat , . FromLemma2, isalsoboundedsuchthat .Since isLebesguemeasurablefrom(4), (6) Therefore, isbounded .Existenceof isguaranteedsincetheleft-handsideof(6) ismonotonicallynondecreasing(basedonthede Þ nitionof ) andboundedabove.Since islocallyabsolutelycontinuousand isessentiallylocallybounded,uniformlyin , isuniformly continuous.Because iscontinuousin ,and isonthecompact set , isuniformlycontinuousin on .Therefore, byLemma3, . Remark3: FromDef.1, isanuppersemi-continuous, nonempty,compactandconvexvaluedmap.WhileexistenceofaFilippovsolutionforanyarbitraryinitialcondition isprovided bythede Þ nition,generallyspeaking,thesolutionisnon-unique[5], [51]. NotethatCorollary1establishes(5)foraspeci Þ c .Underthe strongerconditionthat3 ,itispossibleto showthat(5)holdsforallFilippovsolutionsof(1).Thenextcorollary ispresentedtoillustratethispoint. Corollary2: Forthesystemgivenin(1),let beanopenand connectedsetcontaining andsuppose isLebesguemeasurable and isessentiallylocallybounded,uniformlyin .Let belocallyLipschitzandregularsuchthat (7) (8) , where and arecontinuouspositivedefinitefunctions,and isacontinuouspositivesemi-de Þ nitefunctionon .Choose and suchthat and .Then,allFilippovsolutionsof(1)suchthat areboundedandsatisfy Proof: Let beanyarbitraryFilippovsolutionof(1).Then, fromLemma1,and(8), ,whichistheconditionin(4).Sincetheselectionof isarbitrary,Corollary1canbe usedtoimplythattheresultin(5)holdsforeach . IV.DESIGNEXAMPLETheLaSalle-YoshizawaCorollaries(andtheLaSalle-Yoshizawa Theorem)areusefulintheirabilitytoprovideboundednessand convergenceofsolutions,whileprovidingacompactframeworkto de Þ netheregionofattractionforwhichboundednessandconvergence resultshold.Infact,theregionofattractionisprovidedaspartofthe corollarystructures.Inthecaseofsemi-globalandlocalresults,these domainsandsetsareespeciallyuseful.Itisimportanttonotethat Barbalat’sLemmacanbeusedtoachievethesameresults(infact,it3Theinequality isusedtoindicatethateveryelementof theset islessthanorequaltothescalar .

PAGE 4

2336 IEEETRANSACTIONSONAUTOMATICCONTROL,VOL.58,NO.9,SEPTEMBER2013isusedintheproofforCorollary1);however,theuseofBarbalat’s Lemmawouldrequiretheidenti Þ cationoftheregionofattractionfor whichconvergenceholdsanddoesnotprovideboundednessofthe trajectories.Forillustrativepur poses,thefollowingdesignexample targetstheregulationofa Þ rstordernonlinearsystem. Considera Þ rst-ordernonlineardifferentialequationgivenby (9) where isanunknown,linear-parameterizable,essentiallylocallybounded,uniformlyin functionthatcanbe expressedas where isavectorofunknownconstantparametersand istheregressionmatrixfor .Inaddition, isthecontrolinput, isthemeasurablesystemstate,and isanessentiallylocallyboundeddisturbance thatsatis Þ es (10) where isapositiveconstant,and isapositive, globallyinvertible,state-dependentfunction.Aregulationcontroller for(9)canbedesignedas (11) where isanestimateof , aregainconstants, and isde Þ ned as .Basedonthesubsequentstability analysis,anadaptiveupdatelawcanbede Þ nedas (12) where isapositivegainmatrix.Theclosed-loopsystemis givenby (13) where denotesthemismatch .In(13),itis apparentthattheRHScontainsadiscontinuityin ,andtheuseof differentialinclusionsprovidesforexistenceofsolutions. Let bede Þ nedas andchooseapositive-de Þ nite,locallyLipschitz,regularcandidateLyapunovfunctionas (14) ThecandidateLyapunovfunctionin(14)satis Þ esthefollowinginequalities: (15) wherethecontinuouspositive-de Þ nitefunctions arede Þ nedas ,and ,where areknownconstants.Then, and Since is in ,4 (16)4ForcontinuouslydifferentiableLyapunovcandidatefunctions,thegeneralizedgradientreducestothestandardgradient.However,thisisnotrequiredby theCorollaryitselfandonlyassistsinevaluation.Afterusing(13),theexpressionin(16)canbewrittenas (17) where suchthat if , [ 1,1]if ,and if . Remark4: Onecouldalsoconsiderthediscontinuousfunctioninsteadofthedifferentialinclusion(i.e.,the functioncanalternativelybede Þ nedas )usingCaratheodorysolutions;however,thismethodwouldnotbeanindicatorforwhathappenswhen measurementnoiseispresentinthesystem.Asdescribedinresults suchas[58]–[60],FilippovandKrasovskiisolutionsfordiscontinuousdifferentialequationsareappropriateforcapturingthepossible closed-loopsystembehaviorinthep resenceofarbitrarilysmallmeasurementnoise.Byutilizingthesetvaluedmap intheanalysis,weaccountforthepossibilitythatwhenthetruestatesatis Þ es , (ofthemeasuredstate)fallswithintheset . Therefore,thepresentedanalysisismorerobusttomeasurementnoise thanananalysisthatdependson tobede Þ nedasaknown singleton. Substitutingfortheadaptiveupdatelawin(12),cancelingtermsand utilizingtheboundfor in(10),theexpressionin(16)canbeupper boundedas (18) Thesetin(17)reducestothescalarinequalityin(18)sinceinthecase when isde Þ nedasaset,itismultipliedby ,i.e.,when , .Regroupingsimilarterms,theexpression in(18)canbewrittenas (19) Provided and ,theexpressionin(19)canbe upperboundedas where isapositivesemi-de Þ nitefunctionde Þ nedonthedomain .Theinequalitiesin(15) canbeusedtoshowthat in ;hence, and in .Since containstheconstantunknown systemparametersand in ,thede Þ nitionfor canbeusedtoshowthat in .Given that in , in .Since , , and in ,thecontrolisboundedfrom(11)andthe adaptionlawin(12).Theclosed-loopdynamicsin(10)and(13)can beusedtoconcludethat in ;hence, isuniformly continuousin . Choose suchthat denotesaclosedball, andlet denotethesetde Þ nedas (20) InvokingCorollary2, ,thus, .The regionofattractionin(20)canbemadearbitrarilylargetoincludeall initialconditions(asemi-globaltyperesult)byincreasingthegain . Remark5: Forsomesystems(e.g.,closed-looperrorsystemswith slidingmodecontrollaws),itmaybepossibletoshowthatCorollary 2ismoreeasilyapplied,asisthef ocusoftheexampleinSectionIV. However,inothercases,itmaybedif Þ culttosatisfytheinequality in(8).TheusefulnessofCorollary1isdemonstratedinthosecases whereitisdif Þ cultorimpossibletoshowthattheinequalityin(8)can besatis Þ ed,butitispossibletoshowthat(4)canbesatis Þ edforalmost alltime.

PAGE 5

IEEETRANSACTIONSONAUTOMATICCONTROL,VOL.58,NO.9,SEPTEMBER2013 2337V.CONCLUSIONInthistechnicalnote,theLasalle -YoshizawaTheoremisextended todifferentialsystemswhoseright-handsidesarediscontinuous.The resultpresentstwotheoreticaltoolsapplicabletononautonomoussystemswithdiscontinuitiesintheclos ed-looperrorsystem.Generalized Lyapunov-basedanalysismethodsared evelopedutilizingdifferential inclusionsinthesenseofFilippovtoachieveasymptoticconvergence whenthecandidateLyapunovde rivativeisupperboundedbyanegativesemi-de Þ nitefunction.CaseswhentheboundontheLyapunov derivativeholdsforallpossibleFilippovsolutionsarealsoconsidered. Anadaptive,slidingmodecontrolexampleisprovidedtoillustratethe utilityofthemainresults. ACKNOWLEDGMENTTheauthorswouldliketoexpresstheirgratitudetoProfessorA.Teel forhisconstructivecommentsduringthedevelopmentofthiswork.REFERENCES [1]F.H.Clarke,Y.S.Ledyaev,andR.J.Stern,“Asymptoticstabilityand smoothLyapunovfunctions,” J.Diff.Equations ,vol.149,pp.69–114, 1998. [2]C.M.KellettandA.R.Teel,“SmoothLyapunovfunctionsandrobustnessofstabilityfordifferenceinclusions,” Syst.&ControlLett. ,vol. 52,pp.395–405,2004. [3]F.M.Ceragioli,“Discontinuousordinarydifferentialequationsand stabilization,”Ph.D.dissertation,UniversitadiFirenze,Firenze,Italy, 1999. [4]A.BacciottiandF.Ceragioli,“Stab ilityandstabilizationofdiscontinuoussystemsandnonsmoothLyapunovfunctions,” Control,Optim., andCalc.ofVar. ,vol.4,pp.361–376,1999. [5]J.P.AubinandA.Cellina ,DifferentialInclusions .Berlin,Germany: Springer,1984. [6]D.ShevitzandB.Paden,“Lyapunovstabilitytheor yofnonsmoothsystems,” IEEETrans.Autom.Control ,vol.39,no.9,pp.1910–1914,Sep. 1994. [7]A.N.MichelandK.Wang ,QualitativeTheoryof DynamicalSystems, theRoleofStabilityPreservingMappings .NewYork:MarcelDekker, 1995. [8]E.MoulayandW.Perruquetti,“Finitetimesta bilityofdifferential inclusions,” IMAJ.Math.ControlInfo. ,vol.22,pp.465–275, 2005. [9]H.LogemannandE.Ryan,“Asymptoticbeha viourofnonlinearsystems,” Amer.Math.Month. ,vol.111,pp.864–889,2004. [10]M.Forti,M.Grazzini,P.Nistri, andL.Pancioni,“GeneralizedLyapunovapproachforconvergenceofneural networkswithdiscontinuousornon-Lipschitzactivations,” PhysicaD ,vol.214,pp.88–99, 2006. [11]Q.WuandN.Sepehri,“OnLyapunov’sstabi lityanalysisofnonsmoothsystemswithapplicationstocontrolengineering,” Int.J. Nonlin.Mech. ,vol.36,no.7,pp.1153–1161,2001. [12]Q.Wu,N.Sepehri,P.Sekhavat,andS.P eles,“Ondesignofcontinuous Lyapunov’sfeedbackcontrol,” J.FranklinInst. ,vol.342,no.6,pp. 702–723,2005. [13]Z.GuoandL.Huang,“GeneralizedLyap unovmethodfordiscontinuoussystems,” NonlinearAnal. ,vol.71,pp.3083–3092,2009. [14]G.ChengandX.Mu,“Finite-timestabilitywithrespecttoaclosed invariantsetforaclassofdiscontin uoussystems,” Appl.Math.Mech. , vol.30,no.8,pp.1069–1075,2009. [15]V.M.Matrosov,“Onthestabilityofmotion,” J.Appl.Math.Mech. , vol.26,pp.1337–1353,1962. [16]A.Loría,E.Panteley,D.Popovic,andA.R.Teel,“AnestedMatrosovtheoremandpersistencyofexcitationforuniformconvergence instablenonautonomoussystems, ” IEEETrans.Autom.Control ,vol. 50,no.2,pp.183–198,Feb.2005. [17]R.SanfeliceandA.Teel,“Asymptoticstabilityinhybridsystemsvia nestedMatrosovfunctions,” IEEE Trans.Autom.Control ,vol.54,no. 7,pp.1569–1574,Jul.2009. [18]M.MalisoffandF.Mazenc,“ConstructionsofstrictLyapunovfunctionsfordiscretetimeandhybridtime-varyingsystems,” Nonlin. Anal.:HybridSyst. ,vol.2,no.2,pp.394–407,2008. [19]A.Teel,E.Panteley,andA.Loría ,“Integralcharacterizationsofuni-formasymptoticandexponentialstabilitywithapplications,” Math. Control,Signals,andSyst. ,vol.15,pp.177–201,2002. [20]A.Astol Þ andL.Praly,“ALaSalleversionofMatrosovtheorem,”in Proc.IEEEConf.Decis.Control ,2011,pp.320–324. [21]J.P.LaSalle ,AnInvariancePrincipleintheTheoryofStability .New York:Academic,1967. [22]C.ByrnesandC.Martin,“Anintegral-invarianceprinciplefornonlinearsystems,” IEEETrans.Autom.Control ,vol.40,no.6,pp. 983–994,Jun.1995. [23]E.Ryan,“Anintegralinvariancepr inciplefordifferentialinclusions withapplicationsinadaptivecontrol,” SIAMJ.ControlOptim. ,vol. 36,no.3,pp.960–980,1998. [24]R.Sanfelice,R.Goebel,andA.Teel,“Invarianceprinciplesforhybrid systemswithconnectionstodetectabilityandasymptoticstab ility,” IEEETrans.Autom.Control ,vol.52,no.12,pp.2282–2297,Dec. 2007. [25]A.BacciottiandF.Ceragioli,“NonpathologicalLyapunovfunc tions anddiscontinuousCaratheodorysystems,” Automatica ,vol.42,pp. 453–458,2006. [26]J.Hespanha,“Uniformstabilityofswitchedlinearsyst ems:Extensions ofLaSalle’sInvariancePrinciple,” IEEETrans.Autom.Control ,vol. 49,no.4,pp.470–482,Apr.2004. [27]V.Chellaboina,S.Bhat,andW.Haddad,“Aninvariancepr inciple fornonlinearhybridandimpulsivedynamicalsystems,” Nonlin.Anal. , vol.53,pp.527–550,2003. [28]J.Lygeros,K.Johansson,J.Simi ,Z.Jiang,andS.Sast ry,“Dynamical propertiesofhybridautomata,” IEEETrans.Autom.Control ,vol.48, no.1,pp.2–17,Jan.2003. [29]J.Hespanha,D.Liberzon,D.Angeli,andE.Sontag, “Nonlinearnormobservabilitynotionsandstabilityofswitchedsystems,” IEEETrans. Autom.Control ,vol.50,no.2,pp.154–168,Feb.2005. [30]A.BacciottiandL.Mazzi,“Aninvarianceprincipl efornonlinear switchedsystems,” Syst.Contr.Lett. ,vol.54,pp.1109–1119, 2005. [31]R.Goebel,R.Sanfelice,andA.Teel ,HybridDynam icalSystems . Princeton,NJ:PrincetonUniversityPress,2012. [32]T.-C.Lee,D.-C.Liaw,andB.S.Chen,“Ageneralinvariance principlefornonlineartime-varyingsystem sanditsapplications,” IEEETrans.Autom.Control ,vol.46,no.12,pp.1989–1993,Dec. 2001. [33]H.K.Khalil ,NonlinearSystems ,3rded.:Pre nticeHall,2002. [34]F.Clarke ,OptimizationandNonsmoothAnalysis .Reading,MA:Addison-Wesley,1983. [35]V.V.NemyckiïandV.V.Stepanov ,Qualitat iveTheoryofDifferential Equations .Princeton,NJ:PrincetonUniv.Press,1960. [36]C.M.KellettandA.R.Teel,“WeakconverseLyapunovtheoremsand controlLyapunovfunctions,” SIAMJ.Cont r.Optimiz. ,vol.42,no.6, pp.1934–1959,2004. [37]A.R.TeelandL.Praly,“AsmoothLyapunovfunctionfromaclass-KL estimateinvolvingtwopositivesemideÞ n itefunctions,” ESAIMControlOptim.Calc.Var. ,vol.5,pp.313–367,2000. [38]I.Moise,R.Rosa,andX.Wang,“Attractorsfornoncompactnonautonomoussystemsviaenergyequations ,” Discret.Contin.Dynam. Syst. ,vol.10,pp.473–496,2004. [39]T.Caraballo,G.Åukaszewicz,andJ.Real,“Pullbackattractors forasymptoticallycompactnon-auton omousdynamicalsystems,” Nonlin.Anal.:Theory,Methods,Appl. ,vol.64,no.3,pp.484–498, 2006. [40]G.R.Sell,“Nonautonomousdifferenti alequationsandtopologicaldynamicsi.thebasictheory,” Trans.Amer.Math.Society ,vol.127,no. 2,pp.241–262,1967. [41]M.Vidyasagar ,NonlinearSyst.Ana l. ,2nded.Philadelphia,PA: SIAM,2002. [42]T.-C.LeeandZ.-P.Jiang,“AgeneralizationofKrasovskii-LaSalletheoremfornonlineartime-varyingsy stems:Converseresultsandapplications,” IEEETrans.Autom.Control ,vol.50,no.8,pp.1147–1163, Aug.2005. [43]Z.Artstein,“Uniformasymptotics tabilityviathelimitingequations,” J.Diff.Equat. ,vol.27,pp.172–189,1978.

PAGE 6

2338 IEEETRANSACTIONSONAUTOMATICCONTROL,VOL.58,NO.9,SEPTEMBER2013[44]J.Alvarez,Y.Orlov,andL.Acho,“AninvarianceprinciplefordiscontinuousdynamicsystemswithapplicationstoaCoulombfrictionoscillator,” ASMEJ.Dynam.Syst.,Meas.,Control ,vol.74,pp.190–198, 2000. [45]Y.Orlov,“Extendedinvarianceprinciplefornonautonomousswitched systems,” IEEETrans.Autom.Control ,vol.48,no.8,pp.1448–1452, Aug.2003. [46]M.Krstic,P.V.Kokotovic,andI.Kanellakopoulos ,Nonlinearand AdaptiveControlDesign .NewYork:Wiley,1995. [47]J.P.LaSalle,“SomeextensionsofLiapunov’ssecondmethod,” IRE Trans.CircuitTheory ,vol.CT-7,pp.520–527,1960. [48]T.Yoshizawa,“Asymptoticbehaviorofsolutionsofasystemof differentialequations,” Contrib.Differ.Equat. ,vol.1,pp.371–387, 1963. [49]Q.Hui,W.M.Haddad,andS.P.Bhat,“Semistabilityfortime-varying discontinuousdynamicalsystemswithapplicationtoagreementproblemsinswitchingnetworks,”in IEEEProc.Conf.Decis.Control ,2008, p.29852990. [50]A.Teel,“AsymptoticconvergencefromLpstability,” IEEETrans. Autom.Control ,vol.44,no.11,pp.2169–2170,Nov.1999. [51]A.F.Filippov ,DifferentialEquationswithDiscontinuousRig ht-hand Sides .Norwell,MA:Kluwer,1988. [52]N.N.Krasovskii ,Stabilityofmotion .Stanford,CA:StanfordUniversityPress,1963. [53]O.Hájek,“Discontinuousdifferentialequations,” J.Diff.Equat. ,vol. 32,pp.149–170,1979. [54]W.Kaplan ,AdvancedCalculus ,4thed.Reading,MA:Addison-Wesley,1991,Reading. [55]F.Clarke,Y.Ledyaev,R.Stern,andP.Wolenski ,NonsmoothAnalysis andControlTheory ,178thed.NewYork:Springer,1998,Gr aduate TextsinMathematics. [56]B.PadenandS.Sastry,“AcalculusforcomputingFilippov’sdifferentialinclusionwithapplicationtothevariablestr ucturecontrolofrobot manipulators,” IEEETrans.CircuitsSyst. ,vol.34,no.1,pp.73–82, Jan.1987. [57]H.K.Khalil ,NonlinearSystems ,2nded.UpperSaddl eRiver,NJ: Prentice-Hall,1996. [58]H.Hermes,“Discontinuousvector Þ eldsandfeedbackcontrol,”in DifferentialEquationsandDynamicalSystems .NewYor k:Academic Press,1967. [59]J.-M.CoronandL.Rosier,“Arelationbetweencontinuous time-varyinganddiscontinuousfeedbackstab ilization,” J.Math. Systems.Estim.Control ,vol.4,no.1,pp.67–84,1994. [60]R.Goebel,R.Sanfelice,andA.T eel,“Hybriddynamicalsystems,” IEEEControlSyst.Mag. ,vol.29,no.2,pp.28–9 3,2009. StructuralAnalysisof LaplacianSpectral PropertiesofLarge-ScaleNetworksVictorM.Preciado ,Member,IEEE , AliJadbabaie ,SeniorMember,IEEE ,and GeorgeC.Verghese ,Fellow,IEEEAbstract— Usingmethodsfromalgebraicgraphtheoryandconvexoptimization,westudytherelationshipb etweenlocalstructuralfeaturesofa networkandtheeigenvaluesofitsLaplacianmatrix.Inparticular,weproposeaseriesofsemide Þ niteprogramsto Þ ndnewboundsonthespectral radiusandthespectralgapoftheLap lacianmatrixintermsofacollectionoflocalstructuralfeaturesofthenetwork.Ouranalysisshowsthatthe Laplacianspectralradiusisstronglyc onstrainedbylocalstructuralfeatures.Ontheotherhand,weillustratehowlocalstructuralfeaturesare usuallyinsuf Þ cienttoaccuratelyestimatetheLaplacianspectralgap.Asa consequence,randomgraphmodelsinwhichonlylocalstructuralfeatures areprescribedare,ingeneral,inadequatetofaithfullymodelLaplacian spectralpropertiesofanetwork. IndexTerms— Laplacianmatrix,spectral.I.INTRODUCTIONUnderstandingtherelationshipbetweenthestructureanddy namics ofanetworkisacentralquestioninthe Þ eldofnetworkscience[1]. Sincethebehaviorofmanynetworkeddynamicalprocessesisclosely relatedwiththeLaplacianeigenvalues(see[2],[3]andre ferences therein),itisofrelevancetostudytherelationshipbetweenstructural featuresofthenetworkanditsspectralproperties. Inthistechnicalnote,westudythisrelationship,f ocusingontherole playedbystructuralfeaturesthatcanbeextractedfromlocalizedsamplesofthenetworkstructure.Weshowhowstructuralinformationextractedfromtheselocalsamplescanbeef Þ ciently aggregatedtoinfer globalpropertiesoftheLaplacians pectrum.Inparticular,wepropose agraph-theoreticalapproachtorelatestructuralfeaturesofanetwork withalgebraicpropertiesofitsLaplacian matrix.Ouranalysisreveals thattherearecertainspectralproperties,suchastheso-calledspectralmoments,thatcanbeef Þ cientlycomputedfromthesestructural features.Furthermore,applyingarecent resultbyLasserre[4],weproposeaseriesofsemide Þ niteprogramstocomputenewboundsonthe Laplacianspectralradiusandspectralgapfromatruncatedsequence ofspectralmoments. Thetechnicalnoteisorganizedasfollows.Inthenextsubsection, wede Þ neterminologyneededinourderivations.InSectionII,weintroduceagraph-theoreticalmethod ologytoderiveclosed-formexpressionsfortheso-calledLaplacianspectralmomentsintermsofstructuralfeaturesofthenetwork.InSectionIII,weusesemide Þ niteprogrammingtoderiveoptimalbounds ontheLaplacianspectralradiusManuscriptreceivedMay23,201 2;revisedNovember01,2012;accepted April22,2013.Dateofpublica tionMay01,2013;dateofcurrentversionAugust15,2013.Thisworkwassup portedbyONRMURI“NextGenerationNetworkScience”andAFOSR“Topol ogicalandGeometricToolsforAnalysisof ComplexNetworks.”Recommen dedbyAssociateEditorF.Paganini. V.M.PreciadoandA.Jadbabaie arewiththeDepartmentofElectrical andSystemsEngineering,Uni versityofPennsylvania,Philadelphia,PA 19104USA(e-mail:victorm preciado@gmail.com;preciado@seas.upenn.edu; jadbabai@seas.upenn.edu ). G.C.VergheseiswiththeDep artmentofElectricalEngineeringandComputerScience,Massachuse ttsInstituteofTechnology,Cambridge,MA02139 USA(e-mail:verghese@mit .edu). Colorversionsofoneormor eofthe Þ guresinthistechnicalnoteareavailable onlineathttp://ieeexplo re.ieee.org. DigitalObjectIdenti Þ er10.1109/TAC.2013.2261187 0018-9286©2013IEEE