<%BANNER%>

Integrating Robotic Action with Biologic Perception

Permanent Link: http://ufdc.ufl.edu/UFE0042506/00001

Material Information

Title: Integrating Robotic Action with Biologic Perception A Brain-Machine Symbiosis Theory
Physical Description: 1 online resource (134 p.)
Language: english
Creator: Mahmoudi, Babak
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: accumbens, action, actor, brain, critic, cycle, interface, learning, machine, perception, reinforcement, reward, striatum, symbiosis
Biomedical Engineering -- Dissertations, Academic -- UF
Genre: Biomedical Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: In patients with motor disability the natural cyclic flow of information between the brain and external environment is disrupted by their limb impairment. Brain-Machine Interfaces (BMIs) aim to provide new communication channels between the brain and environment by direct translation of brain s internal states into actions. For enabling the user in a wide range of daily life activities, the challenge is designing neural decoders that autonomously adapt to different tasks, environments, and to changes in the pattern of neural activity. In this dissertation, a novel decoding framework for BMIs is developed in which a computational agent autonomously learns how to translate neural states into action based on maximization of a measure of shared goal between user and the agent. Since the agent and brain share the same goal, a symbiotic relationship between them will evolve therefore this decoding paradigm is called a Brain-Machine Symbiosis (BMS) framework. A decoding agent was implemented within the BMS framework based on the Actor-Critic method of Reinforcement Learning. The rule of the Actor as a neural decoder was to find mapping between the neural representation of motor states in the primary motor cortex (MI) and robot actions in order to solve reaching tasks. The Actor learned the optimal control policy using an evaluative feedback that was estimated by the Critic directly from the user s neural activity of the Nucleus Accumbens (NAcc). Through a series of computational neuroscience studies in a cohort of rats it was demonstrated that NAcc could provide a useful evaluative feedback by predicting the increase or decrease in the probability of earning reward based on the environmental conditions. Using a closed-loop BMI simulator it was demonstrated the Actor-Critic decoding architecture was able to adapt to different tasks as well as changes in the pattern of neural activity. The custom design of a dual micro-wire array enabled simultaneous implantation of MI and NAcc for the development of a full closed-loop system. The Actor-Critic decoding architecture was able to solve the brain-controlled reaching task using a robotic arm by capturing the interdependency between the simultaneous action representation in MI and reward expectation in NAcc.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Babak Mahmoudi.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Local: Adviser: Sanchez, Justin.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2011-06-30

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0042506:00001

Permanent Link: http://ufdc.ufl.edu/UFE0042506/00001

Material Information

Title: Integrating Robotic Action with Biologic Perception A Brain-Machine Symbiosis Theory
Physical Description: 1 online resource (134 p.)
Language: english
Creator: Mahmoudi, Babak
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2010

Subjects

Subjects / Keywords: accumbens, action, actor, brain, critic, cycle, interface, learning, machine, perception, reinforcement, reward, striatum, symbiosis
Biomedical Engineering -- Dissertations, Academic -- UF
Genre: Biomedical Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: In patients with motor disability the natural cyclic flow of information between the brain and external environment is disrupted by their limb impairment. Brain-Machine Interfaces (BMIs) aim to provide new communication channels between the brain and environment by direct translation of brain s internal states into actions. For enabling the user in a wide range of daily life activities, the challenge is designing neural decoders that autonomously adapt to different tasks, environments, and to changes in the pattern of neural activity. In this dissertation, a novel decoding framework for BMIs is developed in which a computational agent autonomously learns how to translate neural states into action based on maximization of a measure of shared goal between user and the agent. Since the agent and brain share the same goal, a symbiotic relationship between them will evolve therefore this decoding paradigm is called a Brain-Machine Symbiosis (BMS) framework. A decoding agent was implemented within the BMS framework based on the Actor-Critic method of Reinforcement Learning. The rule of the Actor as a neural decoder was to find mapping between the neural representation of motor states in the primary motor cortex (MI) and robot actions in order to solve reaching tasks. The Actor learned the optimal control policy using an evaluative feedback that was estimated by the Critic directly from the user s neural activity of the Nucleus Accumbens (NAcc). Through a series of computational neuroscience studies in a cohort of rats it was demonstrated that NAcc could provide a useful evaluative feedback by predicting the increase or decrease in the probability of earning reward based on the environmental conditions. Using a closed-loop BMI simulator it was demonstrated the Actor-Critic decoding architecture was able to adapt to different tasks as well as changes in the pattern of neural activity. The custom design of a dual micro-wire array enabled simultaneous implantation of MI and NAcc for the development of a full closed-loop system. The Actor-Critic decoding architecture was able to solve the brain-controlled reaching task using a robotic arm by capturing the interdependency between the simultaneous action representation in MI and reward expectation in NAcc.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Babak Mahmoudi.
Thesis: Thesis (Ph.D.)--University of Florida, 2010.
Local: Adviser: Sanchez, Justin.
Electronic Access: RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2011-06-30

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2010
System ID: UFE0042506:00001


This item has the following downloads:


Full Text

PAGE 1

1 INTEGRATING ROBOTIC ACTION WITH BIOLOGIC PERCEPTION: A BRAINMACHINE SYMBYOSIS THEORY By BABAK MAHMOUDI A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2010

PAGE 2

2 2010 Babak Mahmoudi

PAGE 3

3 To my family who have always inspired me to reach higher goals in my life. None of this would have been possible without their unconditional love and support.

PAGE 4

4 ACKNOWLEDGMENTS Earning a PhD is all about exploring unknown territories All along this journey many people help ed me and because of whom my graduate experience has been one that I will cherish forever. I am indebted t o all of them but I will have the chance to thank a few here. My deepest gratitude is to my advisor Dr. Justin Sanchez who has been both a professional mentor and a strong supporter throughout all stages of this adventure. Many discussions and long hours with Dr. Sanchez served to elevate this research to a higher level. Throughout it all he has developed my abilities as a researcher. Dr. Jose Principe got me to understand how a machine could learn even to think like an adaptive filter. All of our hard work enabled a major contribution to the field; I dont know if it would have been possible any other way. Dr. Jeff Kleim s expertise about the motor cortex organization and comments of Dr. Tom DeMarse about learning w ere helpful in BMI design. I would like to thank Dr Van Oostrom and Dr Harris for their support especially during the last months of my PhD I owe much of my success to Dr John DiGiovanna who is a great collaborator and I have the privilege of calling him one of my best friends. Much of the s uccess of this work was due to his contributions. I am grateful to AprilLane Derfyniak and Tifiny McDonald in the department of Biomedical Engineering for being always helpful all the way from admission to graduation I would like to thank all of my friends especially those members of the Computational NeuroEngineering Laboratory who helped me during my research. Finally I would like to send my deepest thanks to my family who over the last four years supported me from t housands of miles away with their endless love.

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS .................................................................................................. 4 LIST OF FI GURES .......................................................................................................... 8 LIST OF ABBREVIATIONS ........................................................................................... 11 ABSTRACT ................................................................................................................... 12 CHAPTER 1 INTRODUCTION .................................................................................................... 14 Brain Machine Interface (BMI) ................................................................................ 14 Tr ajectory BMIs ....................................................................................................... 15 Goal Driven BMIs ................................................................................................... 18 Limitations of the Current BMI Design .................................................................... 19 Brain Machine Symbiosis (BMS) Theory ................................................................ 21 Organization of the Dissertation .............................................................................. 23 2 A THEORETICAL FOUNDATION FOR THE BMS THEORY ................................. 25 Introduction ............................................................................................................. 25 RL Methods for BMS ............................................................................................... 29 Q Learning ....................................................................................................... 30 Actor Critic Learning ......................................................................................... 31 PerceptionAction Reward Cycle ............................................................................ 33 Reward Processing in the Brain .............................................................................. 35 3 MOTOR STATE REPRESENTATION AND PLASTICITY DURING REINFORCEMENT LEARNING BASED BMI ......................................................... 41 Introduction ............................................................................................................. 41 Reinforcement Learning based BMI ....................................................................... 4 1 Experiment Setup ................................................................................................... 42 Neuronal Shaping As a Measure of Plasticity in MI ................................................ 46 Neuronal Tuning As a Measure of Robustness in MI States ................................... 48 4 REPRESENTATION OF REWARD EXPECTATION IN THE NUCLEUS ACCUMBENS AND MODELING OF EVALUATIVE FEEDBACK ........................... 52 Introduction ............................................................................................................. 52 Experiment Setup ................................................................................................... 52 Temporal Propert ies of NAcc Activity Leading up to Reward .................................. 55

PAGE 6

6 Extracting an Scalar Reward Predictor from NAcc ................................................. 57 5 ACTORCRITIC REALIZATION OF THE BMS THEORY ....................................... 60 Introduction ............................................................................................................. 60 BMI Control Architec ture ......................................................................................... 60 Critic Structure .................................................................................................. 64 Actor Structure ................................................................................................. 65 Closed Loop Simulator ........................................................................................... 67 Convergenc e of the Actor Critic During Environmental Changes ..................... 70 Reorganization of Neural Representation ........................................................ 77 Effect of Noise in the States and the Evaluative Feedback .............................. 81 6 CLOSEDLOOP IMPLEMENTATION OF THE ACTORCRITIC ARCHITECTURE .................................................................................................... 85 Introduction ............................................................................................................. 85 Experiment Setup ................................................................................................... 85 Training Paradigm ............................................................................................ 86 Electrophysiology ............................................................................................. 88 Closed Loop Ex periment Paradigm .................................................................. 89 Critic Learning ......................................................................................................... 91 Neurophysiology of NAcc under Rewarding and NonRewarding Conditions .. 91 State Estimation from NAcc Neural Activity ...................................................... 94 Desired response ....................................................................................... 94 Linear vs. Nonlinear regression ................................................................ 97 Classification vs. regression ....................................................................... 98 Time segmentation .................................................................................. 100 Actor Learning ...................................................................................................... 102 Preliminary Simulations Using Sign and Magnitude of the Evaluative Feedback for Training the Actor .................................................................. 103 Inaccuracy in State Estimation and its Influence on the Actor Learning ......... 106 Actor Learning Based on Real MI Neural States and NAcc Evaluative Feedback .................................................................................................... 107 7 CONCLUSIONS ................................................................................................... 113 Overview ............................................................................................................... 113 Broader Impact and Future Works ........................................................................ 116 APPENDIX: DUAL MICRO ARRAY DESIGN ............................................................. 120 LIST OF REFERENCES ............................................................................................. 124 BIOGRAPHICAL SKETCH .......................................................................................... 134

PAGE 7

7 LIST OF TABLES Table page 2 1 Generic actor critic algorithm. ............................................................................. 32 3 1 Robot actions. ..................................................................................................... 47 3 2 Action tuning depths. .......................................................................................... 49 5 1 Decoding performance during sequential target acquisition ............................... 73 5 2 State of the neural ensemble during learning ..................................................... 80 5 3 BMI performance with synthetic and surrogate data ........................................... 82 6 1 State estimation performance ........................................................................... 102 6 2 Effect of inaccuracy in the evaluative feedback on the Actor performance ....... 107

PAGE 8

8 LIST OF FIGURES Figure page 1 1 Organization chart of the dissertation. ................................................................ 24 2 1 Agent environment interaction ............................................................................ 26 2 2 Actor Critic architecture ...................................................................................... 32 2 3 Components of P A R C ........................................................................................ 34 2 4 Actor Critic structure in ma chine learning and neurobiology .............................. 39 3 1 RLBMI architecture ............................................................................................. 42 3 2 Overview of the RLBMI experimental paradigm ................................................. 45 3 3 The timeline of the brain controlled twotarget choice task ................................ 46 3 4 Neural adaptation over multiple sessions ........................................................... 47 3 5 Action tuning curves of 3 neurons over three experiment sessions .................... 51 4 1 Stereotaxic coordinates for microwire electrode impl antation and experimental setup ............................................................................................. 54 4 2 Perievent time histogram of t hree categories of NAcc neurons .......................... 56 4 3 Extracting a scalar reward expectation signal from NAcc ................................... 58 4 4 Estimating an evaluative feedback from NAcc neural activity ............................. 59 5 1 An implementation of symbiotic BMI based o n the Actor Critic architecture ....... 61 5 2 Structure of the actor. ......................................................................................... 68 5 3 Simulation experiment setup .............................................................................. 69 5 4 Spatial distribution of targets in the 2D workspace. ............................................ 71 5 5 Decoding performance during sequential presentation of the targets i n the four t arget reaching task ..................................................................................... 75 5 6 Decoding performance with random distribution of targets inside and outside of the workspace ................................................................................................ 76 5 7 Reorganiz ation of the neural tuning map ............................................................ 78

PAGE 9

9 5 8 Network adaptation after reorganization of the tuning map. ............................... 79 5 9 Decoding performance during di fferent users learning phases .......................... 80 6 1 Critic training and the closedloop experiment setup .......................................... 86 6 2 Neuromodulation of 3 NAcc neurons during nonrewarding (red trace) and rewarding (blue trace) trials over multiple sessions ............................................ 93 6 3 Critic learning performance by using ramp function as the desired response .... 96 6 4 Estimating the sign of the desired response using a nonlinear regression method (TDNN) .................................................................................................. 97 6 5 Estimating the sign of the desired response using a linear regression method (Wiener filter) ...................................................................................................... 98 6 6 Supervised state classification of rewarding and nonrewarding states from NAcc neuromodulation using TDNN ................................................................... 99 6 7 Classification per formance over different segments of the data using different window sizes .................................................................................................... 101 6 8 Perievent time histogram of NAcc neurons that were used for classification .... 101 6 9 Actor learning based on MI neural states using both amplitude and sign of the simulated evaluative feedback during one target acquisition task .............. 104 6 10 Actor learning based on MI neural states using only the sign of the simulated evaluative feedback during one target acquisition task .................................... 105 6 11 Offline closedloop control performance of the Actor Critic archite cture using real MI and NAcc neural data ........................................................................... 109 6 12 Actors movement trajectory in 3D space during closed loop control using the simu ltaneous real neural activity in MI and NAcc ............................................. 109 6 13 Actors parameter adaptation during closedloop control ................................. 110 6 14 Actor learning performance based on the real MI neural states and a random evaluative feedback (surrogate analysis) ......................................................... 111 6 15 Actors movement trajectory in 3D space based on the real MI neural states and a random evaluative feedback ................................................................... 112 A 1 Dual micro wire electrode ................................................................................. 121 A 2 Relative anatomical positions of the MI and NAcc in a coronal cross section (1.7 mm anterior to the bregma) ....................................................................... 121

PAGE 10

10 A 3 Relative anatomical positions of the MI and NAcc in a sagittal cross section (0.9 mm lateral to the midline) .......................................................................... 122 A 4 Relative anatomical position of the MI and NAcc in a sagittal cross section (1.9 mm lateral to the midline) .......................................................................... 123 A 5 Relative anatomical position of the MI and NAcc in a sagittal cross section (2.4 mm lateral to the midline) .......................................................................... 123

PAGE 11

11 LIST OF ABBREVIATIONS BMI Brain Machine Interface BMS Brain Machine Symbiosis IA Intelligent Assistant MDP Markov Decision Process MI Primary motor cortex MLP Multi Layer Perceptron NAcc Nucleus Accumbens PAR C PerceptionAction Reward Cycle PE Processing Element RL Reinforcement Learning TD Temporal Difference TDNN Time Delayed Neural Network

PAGE 12

12 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy INTEGRATING ROBOTIC ACTION WITH BIOLOGIC PERCEPTION: A BRAINMACHINE SYMBYOSIS THEORY By Babak Mahmoudi December 2010 Chair: Justin C. Sanchez Ma jor: Biomedical Engineering In patients with motor disability the natural cyclic flow of information between the brain and external environment is disrupted by their limb impairment. Brain Machine Interfaces (BMIs) aim to provide new communication channel s between the brain and environment by direct translation of brains internal states into actions. For enabling the user in a wide range of daily life activities, the challenge is designing neural decoders that autonomously adapt to different tasks, environments, and to changes in the pattern of neural activity. In this dissertation, a novel decoding framework for BMIs is developed in which a computational agent autonomously learns how to translate neural states into action based on maximization of a measur e of shared goal between user and the agent. Since the agent and brain share the same goal, a symbiotic relationship between them will evolve therefore this decoding paradigm is called a BrainMachine Symbiosis (BMS) framework. A decoding agent was implemented within the BMS framework based on the Actor Critic method of Reinforcement Learning. The rule of the Actor as a neural decoder was to find mapping between the neural representation of motor states in the primary motor cortex (MI) and robot actions in order to solve reaching tasks. The Actor

PAGE 13

13 learned the optimal control policy using an evaluative feedback that was estimated by the Critic directly from the users neural activity of the Nucleus Accumbens (NAcc). Through a series of computational neuroscience studies in a cohort of rats it was demonstrated that NAcc could provide a useful evaluative feedback by predicting the increase or decrease in the probability of earning reward based on the environmental conditions. Using a closedloop BMI simulator it was demonstrated the Actor Critic decoding architecture was able to adapt to different tasks as well as changes in the pattern of neural activity. The custom design of a dual microwire array enabled simultaneous implant ation of MI and NAcc for the development of a full closed loop system The Actor Critic decoding architecture was able to solve the braincontrolled reaching task using a robotic arm by capturing the interdependency between the simultaneous action representation in MI and reward expectation in NAcc.

PAGE 14

14 CHAPTER 1 INTRODUCTION Brain Machine Interface (BMI) T he seminal works of Fetz and Schmidt [1 2] showed the possibility of long term recording from cortical neurons in monkey and suggested that single unit activity of motor cortical neurons could be used as a potential source of control for external devices In patients with sensorimotor disability direct interfac ing the prosthetic devices with the brain was a promising solution for restoring their function. From this, Brain Machine Interfaces (BMI) emerged as a new assistive technology BMIs are creating new pathways to interact with the brain and they can be roughly divided in four categories: the sensory BMIs which substitute sensory inputs (like visual [3,4] or auditory [5 7] and are the most common (120,000 people have been implanted worldwide with cochlear implants; the motor BMIs that substitute parts of the body to convey intent of motion to prosthetic limbs; the cognitive BMIs that repair communication between brain areas such as the hippocampus [8] that mediates short term to long term memories; and the clinical BMIs that stimulate specific brain areas to repair normal function, suc h as deep brain stimulation for Parkinsons disease [9] or to avoid or abort epileptic seizures [10] This dissertation focuses on motor BM Is which are revolutionizing the way paralyzed users interact with the environment because they offer a direct link between the brain and a tool that interacts with the environment, bypassing the body to express intent [1113] Within the motor BMIs there are two basic types: the trajectory BMIs and the goal driven BMIs.

PAGE 15

15 Trajectory BMIs The theory of trajectory BMIs is rooted i n the concept of the population vector. In the 1980s, Georgopoulos showed that firing rate of motor cortical neurons in monkeys was tuned to the direction of movement during a reaching task [14,15] In this study movement direction was measured in a standard center out reaching task. The tuning function (cosineshaped) relating discharge rate to direction is broad, covering all movement directions, and shows that each cell changes its discharge rate for all directions, or conversely, that all cells actively code each direction (simultaneous activity). By itself, the tuning function of a s ingle cell is not very useful for decoding direction because a single direction will correspond to more than one discharge rate, and the broadness of the function means that small fluctuations of discharge rate will correspond to large changes in direction. However, specific directions were well predicted if weighted responses from many cells were added linearly together in a vector form. This method is called the population vector algorithm (PVA) [15,16] In 1999, for the first time it was shown that simultaneous recorded populations of single neurons could be used to control a closedloop BMI in a rat model [17] In this experiment microelectrodes were implanted chronically in the primary motor cortex (MI) and ventrolateral (VL) thalamus. The recorded signal was used to control a robotic arm with one degree of freedom. Following the successful implementation of closedloop BMI in a rat model, in 2000 the results of predicting hand trajector y from an ensemble of cortical neurons in primates were published [18] In this experiment neural activity was recorded from multiple cortical areas including dorsal premotor cortex, primary motor cortex and posterior parietal cortex bilaterally and both linear and nonlinear models were used to predict the robot trajectory from ensemble neural activity. Neural

PAGE 16

16 information obtained over different cortical areas concurrently showed some level of motor control however the prediction result was slightly different based on information obtained from each individual area. This observation implies the c orrelation of high dimensional representation of motor states in the cortex, especially MI, to other motor parameters rather than hand position [19] T herefore, in order for accurate trajectory decoding access to other sources of information might be necessary to constrain the high degrees of freedom in MI [20] In the same year, Kennedy reported the first human invasive BMI [21] This BMI was designed as a cursor controller for typing on a virtual keyboard and selecting icons on the screen by an Amyotrophic lateral sclerosis (ALS) patient. Neural data was recorded using NeuroTrophic electrode that used a trophic factor to encourage growth of neurons around the tip of electrode. Because of ALS complications after 5 months the experiments were stopped and this attempt produced no longterm improvement over noninvasive BMI s. The most successful human micro electrode BMI was reported by Hochberg et al where they demonstrated a tetraplegic human could achieve brain cursor control after 3 years of spinal cord injury. In this work a 96 microarray Utah electrodes was implanted in the primary motor cortex [22] In 2002 Serruya et al. implemented a closed loop BMI in monkeys [23] In this experiment the monkey controlled a cursor on a screen using a joystick where the cursor served as visual feedback to the animal. Hand trajectory was reconstructed from 7 30 neurons in MI using a linear filter. The result was presented to the animal in the form of cursor movement on the screen. One of the subjects could move the cursor through brain control. In this paradigm the plasticity of the brain compensated for

PAGE 17

17 inaccuracies of the linear model through feedback learning. These works demonstrate the remarkable capability of the brain in learning to control the prosthetic device but they also highlight the importance of increasing the speed of learning in BMIs especially for the clinical applications. To overcome these issues, Taylo r et al investigated the performance of the BMI in presence of biofeedback in a 3D center out task the same year [24] The results showed a significant improvement of trajectory reconstruction performance in closed loop experiments compared to open loop trajectories. The experiments were conducted using a fixed decoding algorithm therefore to reconcile for the changes in the neural modulation, recalibration was necessary every day. This issue wa s impeding for applicability of BMI in patients with motor deficiency therefore in the next step, Taylor et al developed a coadaptive algorithm for trajectory prediction that compensated for the changes in the brain tuning function. This approach was t he first step in addressing the problem of desired signal for BMI in patients with motor disability. Coadaptive algorithms were designed to adapt with changes in the neural activity of the target brain area, which is the result of learning or reorganization in the brain [2529] Carmena et al demonstrated that it was possible to decode two different upper limb functions, reaching trajectory and grasping, from the same brain area [30] This result was in accordance with the fact that several movement parameters are encoded in cortex. In addition, by recording from different cortical areas they decoded various kinematics parameters (hand position, velocity, gripping force, and the EMGs of multiple arm muscles). This work was the first to demonstrate the recording from multiple brain areas in order to complete multiple tasks.

PAGE 18

18 In all of these works the focus has been on extraction of kinematics parameters from neural activity through training an input output model with recorded data within sessions the subject has performed the physical movement to provide a desired signal. Trajectory BMIs as the name indicates learn how to control a robotic arm to follow a trajectory. They are basically signal translators to actuate prosthetics; they collect firing patterns of dozens to hundreds of neurons in the motor cortex and surrounding areas to decode the users intent expressed in the neural signal time structure. Goal Driven BMIs In contrast to trajectory based BMIs where their focus is on extracting movement trajectory information from motor cortex, the goal driven BMIs extract the location in space for the intended movement from a set of predetermined tar gets using electrodes in the parietal cortices and they can be used for high level coarse command for robots to implement motion to the desired location in space [31] The primary distinction between goal driven BMIs and trajectory BMIs is in the type of information extracted from the brain and the control strategy. From this point of view some noninvasive Brain Computer Interfaces that utilize multiple electrodes placed on the scalp (or directly over the cortex) to map cognition signatures into a set of predefined discrete goals can also be categorized as goal driven BMIs. In this class of BMIs, goal information is ex tracted from the brain then a robotic device makes the necessary movements to reach the goal. Shenoy et. al demonstrated the possibility of using neural activity of Parietal Reach Region (PRR) before or without arm movement as a control signal for prosthet ics control In this work maximum likelihood estimation was used to estimate reach parameters. The PRR is a part of Posterior Parietal Cortex (PPC) which lies at the functional interface between sensory and motor representations in the primate brain.

PAGE 19

19 The PPC receives sensory input from visual and proprioceptive pathways and sends output to primary motor cortex. These sensory inputs could be integrated to compute a goal vector in eyecentered coordinates for a reaching movement. However, this high level m otor information is not limited to this specific anatomical region and could be extracted from many higher order areas in the brain e.g. frontal lobe [32] Musallam et.al implemented a closedloop cognitive BMI based on higher order information about the targets instead of low level motor commands for controlling the prost hetic device [33] In this work planning of hand movement from center of a screen to some fixed discrete targets was decoded from Parietal Reach Region (PRR) and dorsal Premotor cortex (PMd). The decoding algorithm in this work was based on classification of each discrete peripheral tar get using a Bayesian classifier. At the beginning of each session, a database of reach movements was collected to build a classifier and during brain control phase this classifier was used to decode target location. A recent study has reported that traject ory information also could be extracted from PRR [34] In t he goal based BMIs information about goal is merely used as a cognitive signal for selecting a task and the contribution of goal information in action generation is completely ignored. Limitations of the Current BMI Design Many groups have conducted research in trajectory BMIs and the approach has b een strongly signal processing based without much concern to incorporate the design principles of the biologic system in the interface. The implementation path has either taken an unsupervised approach by finding causal relationships in the data [35] a supervised approach using (functional) regression [36] or more sophisticated methods

PAGE 20

20 of sequential estimation [37] to minimize the error between predicted and known behavior. These approaches are primarily datadriven techniques that seek out correlation and structure between the spatiotemporal neural activation and behavior. Once the model is trained, the procedure is to fix th e model parameters for use in a test set that assumes stationarity in the functional mapping. Some of the best known linear models that have used this architecture in the BMI literature are the Wiener filter [38,39] and Population Vector [40] generative models [24,41,42] and nonlinear dynamic neural networks (a time delay neural network or recurrent neural networks [4345] models that assume behavior can be captured by a static input output model and that the spike train statistics do not change over time. While these models have been shown to work well in specific scenarios, they carry with them the strong assumptions about stationarity of neural response and will likely not be feasible over the longterm. The success of BMI control is in part due to the brain plasticity that incorporates the prosthetic device into its cognitive space [46] and use it as an extension of the biologic body [47] By analyzing the trajectory BMI paradigm it still follows the approach of the user primarily driving the prosthetic with very little feedback or cooperation from the device itself It can be argued that this engineering approach shows the proof of concept of a BMI as long as the combined system solves the task. Unfortunately, there have been difficulties in translating the trajectory paradigm from proofof concept to clinical environments because it requires too much information from the setting, namely the existence of a desired trajectory to train the decoding algorithms. Quadriplegics, the intended clinical group for trajectory BMIs, cannot move so there is no trajectory in real settings and the current solutions are rather poor. Moreover, with continuous neural

PAGE 21

21 interface use, the neural representation supporting such behavior will change [48] It has been shown in animals and humans that intelligent users can switch to brain control seamlessly [48,49] However, it has also been shown that the time that it takes to achieve a certain level of mastery of the prosthetic device can be extremely slow especially when the details of the dynamics of control are unknown to the user. From a behavioral perspective, even simple issues of scale (i.e. dynamic range of reaching) can create problems for input output models if the full range of values was not encountered during training [50] Even with the great adaptability of the users brain, it can take significant time for the performance to recover. To contend with these issues, it has been suggested by a few groups that adaptability of the interface is a critical design principle for engineering the next generation BMIs [5153] In these studies, the concept of adaptability typically refers only to very detailed aspects of the signal translation to include automatic selection of features, electrode sites, or training signals [54,55] This concept of adaptability does not go far enough, because it is unable to raise the level of the bidirectional dialogue with the user and ignores some of the fundamental aspects of goal directed behavior The work in this dissertation is motivated in part by the need for BMI systems that enable the user in a wide range of daily life activities; the challenge is designing neural decoders that autonomously adapt to different tasks, changes in the pattern of neural activity and changes in the environment. Brain Machine Symbiosis (BMS) Theory The design of a new framework to transform BMIs begins w ith the view that intelligent neuroprosthetics emerge from the process where the user and neuroprosthetics cooperatively seek to maximize goals while interacting with a complex,

PAGE 22

22 dynamical environment. Emergence as is discussed here and in the cognitive sci ences depends on a series of events or elemental procedures that promote specific brain or behavioral syntax, feedback, and repetition over time [56] ; hence, the sequential evaluative process is always ongoing, adheres to strict timing and cooperativecompetitive processes, and is very different from the notion of static computational methods. With these elemental procedures, goal directed behavior can be built on closedloop mechanisms which continuously adapt internal and external antecedents of the world, express intent thr ough behavior in the environment, and evaluate the consequences of those behaviors to promote learning. This form of adaptive behavior is the main feature of an intelligent behavior that distinguishes between reactive and proactive behavior and relies on continuous processing of sensory information that is used to guide a series of goal directed actions which is called PerceptionAction Reward Cycle (PAR C) Several basic computations involve the PA R C: (i) The first step is the formulation of goals, that based on ones internal motivations, sets the context for planning an action. Beyond the setting of an initial goal, PAR C involves (ii) estimating the value of all the possible actions for attaining ones goals, and (iii) choosing the best action that will achieve ones goal, based on action values The underlying substrate for all these processes is the existence of an internal model of the environment but it is rarely the case that one has a complete world model and can accurately select the optimal action in a given environment. More typically, one is dealing with a nonstationary world model, in an ever changing environment. Thus, the PARC should continuously run in order to adapt to changes in the environment

PAGE 23

23 Collectively these components play a critic al role in organizing behavior in the nervous system [57] a nd form the basis of Brain Machine S ymbiosis (BMS). The concept of BMS theory which will be developed throughout this dissertation is based on distributing the PARC between the user and a computational agent that follows the same principles of goal directed behavior as the user does. By aligning the goal of the computational agent with the users goal a symbiotic relationship between them emerges and the computational agent evolves as an I ntelligent A ssistant (IA) that helps the user reach its goal. The str ucture of the IA can be engineered for specific tasks and purposes. In this work I have designed the IA for decoding motor neural commands. Based on the BMS theory I have framed the BMI decoding problem as a continuous cyclic process in which the users i nternal antecedent will be translated to external states in the environment. The BMS design approach is fundamentally different from the t raditional BMI design approach in the sense that the decoder autonomously adapts to maximize users satisfaction without need for an external teaching signal Organization of the Dissertation The dissertation is focused on developing a theory and architecture for symbiotic BMI and testing through simulation and experimentation. The text will be organized in two main part s and six chapters. In the first part I will develop the BMS theory based on the computational theories of reinforcement learning and neurobiological principles of valuebased decision making in the brain. Next, I will present the experimental results that support the theory In the second part I will develop the architecture and test it through simulation and closedloop experimentation. Figure 11 outlines the organizational chart of the dissertation and title of each chapter.

PAGE 24

24 Figure 11. Organization chart of the dissertation A theory of Symbiotic BMI Developing the Theory Theory of Reinforcement Learning Theory of reward processing in the brain Action representation in MI Evaluative feedback in NAcc Validation Symbiotic Architecture and Simulation Closed loop experiments Chapter 4 Chapter 5 Chapter 2 Chapter 3 Chapter 6

PAGE 25

25 CHAPTER 2 A THEORETICAL FOUNDATION FOR THE BMS THEORY Introduction In the previous chapter the concept of BrainMachine Symbiosis was introduced. In this chapter and the next the theoretical foundation for how that concept can be implemented using the well established computational and neurophysiological theories of valuebased decision making will be developed. Since we are interested in the interaction of a learning agent with its environment and the principles that define intelligent behavior we begin our investigation using reinforcement learning (RL) because it is a computational framework built upon actionvaluereward sequences Because of this feature, RL can constitute the computational framework of the symbiotic BMI. In this section I will discuss how the RL theory can be used for developing the theory of symbiotic BMI In addition some of the RL concepts that will be used for develo ping the BMS theory will be introduced Another aspect of the BMS theory is rooted in the theory of reward processing and perceptionactionreward cycle in the brain. In this section, the neurophysiological principles that contribute to developing the theory will be introduced. Reinforcement Learning: A Computational Framework for the BMS Theory Reinforcement Learning (RL) is a computational framework for rewardbased learning and decision making. Learning through interaction with the environment distingui shes this method from other learning paradigms. Unlike supervised learning in which there is a desired signal that tells the learner exactly what to do, in the RL a scalar signal, called reward evaluates the performance of the learner during interaction w ith environment. In reinforcement learning, instead of saying what to do (like

PAGE 26

26 supervised learning), the actions or decisions that learner takes would be evaluated based on their outcome. The ultimate goal of the learner is to maximize a measure of reward over time through interaction with environment therefore those actions and decisions that lead to higher reward are reinforced more [58] Another important feature of learning based on a dimension less reward signal is that the learning task is decoupled from the teaching signal therefore depending on different states the agent can learn different control policies to solve various tasks using the same reward signal. Learning based upon interaction with environment is the key feature of the RL as the computational framework of the BMS theory. Figure 21 shows a schematic diagram of agent (learner) environment interaction in reinforcement learning. By taking actions the agent changes the state of the environment and receives rewards. The agents goal is maximizing rewards over time therefore depending on the state of the environment it should learn the optimal action selection policy that yields maximum reward over time. Figure 21. Agent environment interaction

PAGE 27

27 Throughout this section, I will discuss that an RL agent satisfies all the requirements of an intelligent tool that was mentioned in the introduction section and will demonstrate the important implications of this feature of the R L in BMS theory. RL Components By formulating the BMI control as a decision making problem, there are a number of different actions we have to choose from. A Markov Decision Process (MDP) is a way to model problems so that one can automate the process of decision making in uncertain environments and use off theshelf algorithms in solving the decision making process. A n MDP model is composed of four components: a set of states, a set of actions, the effects of actions on states (state transition probabilities) and the immediate value of the actions (rewards). The main problem of solving an MDP is to find the best action to take in each particular state of the environment. The transitions specify how each of the actions changes t he state. Since an action could have different effects, depending upon the state, we need to specify the action's effect for each state in the MDP. The fact that effects of an action can be probabilistic is the most powerful aspect of the MDP This probabi lity is known as transition probability. } | Pr{1 'a a s s s s Pt t t a ss (2 1) In Equation 21, st and at define the state and action respectively at time t and st+1 is the next state. The Markov property of a n MDP implies that transition from the current state to the next state is independent of all previous transitions. If we want to automate the decision making process, then we must be able to have some measure of an action's cost or stat e's value so that we can compare different alternative action policies

PAGE 28

28 over the long term. We specify some immediate value for performing each action in each state. The learner expects a reward by visiting a new state } | {1 1 's s a a s s r E Rt t t t a ss (2 2) Equation 22 computes the expected reward from taking action a in state s. The result of taking action a is changing the state of the environment to the state s and yields the immediate reward rt+1 at time t+1.Transition probability and expected reward specifies the dynamics of a finite MDP. One main assumption behind MDP is that the environment obeys the Markov property, i.e. state transitions are based only on the current state of the environment and the actions selected by the learning agent. In a motor BMI for reaching tasks, states can be the neural states of the brain and actions can be defined by a set of discrete prosthetic arm movements. The role of agent is to learn neural dynamics in terms of state transition probabi lities. Beyond state and action, reinforcement learning is composed of three main elements; reward function, value function and policy. These elements constitute a bidirectional transfer function between perception (states) and actions which forms the bac kbone of the BMS theory. The reward function determines the immediate reward of visiting each state. Reward function maps each state or actionstate to a scalar reward value. Discounted return is defined as a weighted sum of all rewards that would be earned in future, where 1 gives more weight to immediate rewards. (2 3) Once reward function defines immediate return (rn in Equation 23) of visiting a state, value function determines the value of that state or action state pair in terms of accumulated expected reward (Rt in Equation 23) in long term if agent starts from that 1 1 t n n t n tr R

PAGE 29

29 state. This definition implies that a particular state may yiel d low instant reward but its next states yield high reward then the value of that state is high however its instantaneous reward is low. } | { ) ( a as s R E a s Vt t t (2 4) Equation 24 computes the expected longterm reward (Rt) by taking action a in the state s at time t, provided that the agent follows policy RL Methods for BMS The solution to an MDP is called a poli cy and it simply defines a mapping from state space to action space. Solving a reinforcement learning problem means to find a policy that maximizes the reward over long run. By definition, the policy is the solution to the decoding problem in the BMS framework. Although the policy is what we are after, we will usually compute a value function, and then we can derive the policy from the value function. A value function is similar to a policy, except that instead of specifying an action for each state, it specifies a numerical value for each state. Dynamic programming (DP) which is a model based approach and Temporal Difference (TD) learning are two fundamental classes of solution to the reinforcement learning problem. Typically for the BMI application, a model of the environ ment does not exist therefore in this work the focus will be on TD learning method. If the agent does not know the expected rewards and the transition probabilities, but rather just has to learn by interactions with the environment then methods of temporal differences should be used for prediction and optimization. Conventional predictionlearning methods assign credit by means of the difference between predicted and actual outcomes, however temporal difference methods assign credit by means of

PAGE 30

30 the difference between temporally successive predictions S ince transition probabilities are unknown difference between two consecutive predictions is used to adapt the value function. It has been shown that this bootstrapping method converges to optimal value functi on for finite MDP models [ 59] In the case of using function approximation the TD algorithm also converges with linear function approximators. There are two principal classes of temporal difference methods, Q learning and Actor Critic, which each will be d iscussed as a potential computational framework for the BMS theory Q Learning In Q learning methods actionvalue functions are learned and based on that a policy is determined exclusively from the estimated values. )] ( ) ( max [ ) ( ) (, 1 1 t t a t a t t t t ta s Q s Q r a s Q a s Q (2 5) In Equation 25 Q(st,at) represents stateaction value function at time t and rt+1 is the immediate reward at the next time step. In this method, Q function is updated independent of the policy so we can derive action from Q using any policy, therefore Q learning is known as an off policy method. However, in order for correct convergence, all action states should be updated. The above formulation is for onestep Q learning. Combining Q learning and eligibility trace we can derive farther in the actionstate space to find the expected return. Q learning as an off policy method is more suited for episodic tasks however in the BMS framework we are dealing with a continuous task. I n Q learning the control policy and state valuefunction are computed by Q function therefore there is no structured way of splitting the control architecture between the user and the IA for the BMS framework.

PAGE 31

31 Actor Critic Learning Actor critic (AC) is a c lass of TD learning algorithms in which parameters of two separate structures for action selection (A ctor ) and value estimation (C ritic ) are simultaneously estimated. The A ctor contains action selection policy by establishing a probabilistic mapping between states and actions. The critic is a conventional value function that maps states to expected future reward. The AC learning is based on simultaneous solving a prediction problem, addressed by critic, with a problem of control by actor These problems ar e separable, but are solved simultaneously to find an optimal policy. A variety of methods can be used to solve the prediction problem, but the ones that have proved most effective in large applications are those based on some form of TD learning. Figure 22 shows the basic architecture of actor critic. In this architecture critic uses a scalar signal (TD error) to evaluate the actions taken by actor. The critic implements a statevalue function. After each action selection, the critic evaluates the new state to determine whether things have gone better or worse than expected. That evaluation is in the form of TD error: ) ( ) (1 t t t ts V s V r (2 6) Here V is value function and 1 0 is a discount factor. St and rt represent the state of environment and immediate reward at t respectively. If TD error is positive the action which is taken by actor should be reinforced otherwise the association between action and st ate should be weakened. Based on this TD error the Actors parameters are then adapted. A generic actor critic algorithm is summarized in Table 21 [59] Various implementations of this algorithm are presented in the literature.

PAGE 32

32 Table 21. Generic actor critic algorithm. 1 Initialization of the parameters of stochastic policy and value function estimate 2 Execution of action a randomly from the current state and policy and computing the TD error 3 Adjusting the probability of action a based on TD error. If probabi lity is increased otherwise it is decreased. 4 Updating the critic by adjusting the estimate value of state t t ts V s V ) ( ) ( 5 Return to step 1 Figure 22. Actor Critic architecture Actor critic algorithms can be viewed as stochastic gradient algorithms on the parameter space of the actor [60] Actorcritic is a policy gradient method in which the policy is explicitly represented independent of value function and is updated based on the gradient of expect ed reward with respect to the policy parameters [61] This is an important feature that makes Actor Critic a very appealing candidate for the BMS

PAGE 33

33 framework. Another class of policy gradient methods is Williams REINFORCE algorithm [62,63] An advantage of policy gradient methods over value function based approaches like Q learning is that with an arbitrary differentiable function approximator, actor critic algorithms are convergent to a locally optimal policy [61] However in situations that there are large plateaus in the expected reward, the gradients are small and usually do not point towards the optimal solution. In these situations Natural actor critic algorithms [64,65] have a better performance compared to regular policy gradient methods. In this class of actor critic algorithms the steepes t ascent with respect to the Fisher information metric which is called natural policy gradient [66] is used instead of normal gradient. PerceptionAction Reward Cycle The fundamental feature that distinguishes Actor Critic RL from other learning paradigms is the cyclic flow of information between the agent and the env ironment in the form of action and perception. All forms of adaptive goal directed behavior require processing of sensory information resulted from actions and modifying actions based on processed information. Depending upon the task complexity, at different levels, sensory and motor information are integrated through this cyclic process to maximize reward which is known as PerceptionAction Reward Cycle (PA R C) [57] Figure 23 shows the components of PARC for action selection during goal directed behavior [67] In this diagram, first a representation for goal and possible actions for reaching the goal is constructed in the brain. Based on the ultimate goal a value is assigned to each action and the action with the highest value is selected and executed. This part of the process forms the action component of PARC. In the next step, the outcome of the selected

PAGE 34

34 action is perceived and the result is evaluated in terms of final goal. A learning mechanism modifies the action process based on the perceived results. Figure 23. Components of PARC Comparing the Actor Critic archit ecture in Figure 22 with the structure of PARC in Figure 23, it can be seen that by assign ing separate structures for perception (critic) and action (actor) the Actor C ritic architecture inherently implements PARC. The idea to promote a symbiotic relationship between the brain and the IA in the BMS framework, using Actor Critic architecture, is to train an artificial Actor with a biologic Critic The cyclic aspect of sensory motor integration is instrumental for motor contr ol in a changing environment. In patients with motor deficiency the PARC is disrupted by disability in execution of actions; therefore, BMI should bridge the disrupted loop by an external actor. In order for a brain controlled device to be always in the loop with the user BMI should provide a bidirectional information channel between the brain and IA in such a way that at each time step the user can send neural motor commands to the IA for action selection and the IA perceive the outcome of its action from users perspective to modify the control policy. Ho wever, in the current BMI decoders that use input output

PAGE 35

35 models the flow of information is unidirectional and it is assumed that biofeedback (perceptual aspect of PARC) is implicitly represented in the motor commands. Understanding the dynamics of PARC in the brain is important both from engineering and neuroscience perspectives. From neuroscience point of view this information is critical in understanding the mechanisms of motor control. From engineering stand point, this knowledge could be used in the design of intelligent robots that are able to adapt to their environment I n other words insight into PARC is essential for reveres engineering the brain. The BMS framework is a great tool to functionally dissect the PA R C by embedding computational models in the loop with the brain and studying the information dynamics through the model parameter adaptation. My work throughout the rest of this dissertation is focused on systematic development of a framewor k for generating goal directed locomotion based on the cyclic integration of goal perception and motor actions using the computational structure of the Actor Critic learning. In this framework the reward processing circuitry in the brain is sought as a so urce of goal representation that provides an evaluative feedback for the IA actions The IA in the BMS framework integrates perceptual and motor states of the brain for action generation by learning how to decode actions based on the brains reward percept ion. Reward Processing in the Brain One of the key features of the Actor Critic method is that it provides a computational framework for continuously transforming the perception of the state of the environment into actions based on reward maximization. The compatibility between the PARC in the Actor Critic method and the brain is fundamental in the BMS theory. Reward is the main link that promotes symbiotic relationship between the brain and

PAGE 36

36 machine. In the previous section it was explained how an artificial learning agent can learn based on the reward signal. If we can align the intention of the user with the reward signal of the agent then the learning agent will learn a state action mapping which is based on the users intention. It is very important to note that states and actions are completely arbitrary therefore the agent can learn any task based on users intention provided the states and action are defined appropriately In the BMS theory the mapping between the internal brain states and actions will be called intention function. The task of the learning agent is to approximate this intention function but the question is how the agent can learn the intention function without need for a teaching signal external to the user. This question returns back to the limitations of the current BMI design, which were discussed in Chapter 1. Different parts of the brain encode different aspects of users reward seeking behavior [68] but the basal ganglia are known for their instrumental contribution to reward processing and motor control through corticobasal ganglia loops [6971] If we consider the reaching task as a reward seeking movement, reward expectation signal can provide an evaluative feedback for the IA about the likelihood of reaching the target. The real time measure of this likelihood is important because it signals the IA whether to continue with its current control policy or to change it. The s triatum is a brain area that plays a key role in the representation of reward expectation associated with actions in the brain. There is a hypothesis that striatum contributes to action selection and modification of the control policy [72] This hypothesis is supported by clinical studies that show at the early stage of Huntingtons disease (HD), degeneration begins in striatal neurons therefore it is likely that

PAGE 37

37 dysfunction in the striatal patches cause the patient fail to adjust its reaching movement. Under pathological conditions in where the striatum is involved patients exhibit jerky movements in reaching tasks. This clinical observation suggests that control policy is computed in striatum. Since the control policy changes too often therefore error corrections fail in the movement trajectory in these patients In healthy people, small errors do not change the control policy and next state planner can maintain a smooth trajectory in guiding the end effector to the target [73] These symptom s are different from tremors in Parkinsons disease (PD) in which the Sabstantia Nigra is involved. From the computational modeling perspective, the basal ganglia circuitry by itself implements a reinforcement learning framework for motor learning and sensory motor integration during goal directed behavior [74 76] In this framework the striatum represents actionspecific reward values in corticobasal ganglia loops [77] and encodes reward expectation of actions [78] during goal directed behavior. It is suggested that striatum enhances the association between sensory information and motor r esponse followed by reward [79] In this process, Neucleus Accumbens (NAcc) as a major component of the ventral striatum is known for associating reward values t o sensory information and selecting actions that lead to reward. The most important hypothesis underlying both computational and experimental research is the function of striatal circuits (NAcc) in facilitation of action selection through transformation of sensory information to actions [80,81] This hypothesis partially stems from the anatomical arrangement of the NAcc. The NAcc receives inputs from hippocampus, amygdala, prefrontal cortex and ventral tegmental area which are known to contribute in reward perception, and sends output to areas that are involved in motor generation like ventral

PAGE 38

38 pallidum. Taha et al has shown that the neural activity of a subset of neurons in the NAcc was tightly correlated with movement direction [82] This observation suggests that NAcc represents both sensory and motor aspects of reward seeking behavior. Integration of reward perception and motor information in the NAcc has given raise to the idea that NAcc serves as limbic motor interface [83] From stand point of developing a symbiotic BMI where the integration of goal information and trajectory generation is the key concept, the NAcc can be an important source of information however, which aspect of reward perception or motor actions for BMI are encoded in this area is unknown and is the subject of a part of research in this dissertation. Biological plausibility of the control ar chitecture is another important design characteristic which brings two significant advantages for BMI application. The first one is that we can incorporate the functionality of a brain area as an embedded system in the BMI control architecture and the second advantage stems from this fact that interaction between brain and external controller in a unified framework makes BMI an efficient tool for understanding the control mechanisms of brain. This approach is a departure from traditional approach in designi ng BMIs which currently is focused on decoding control commands from the brain. This approach not only opens a new avenue in neural interface design but also provides a tool for reverse engineering the brain. In order to utilize the functionality of the br ain in the design of neural interface a processing model of the target brain area is instrumental. If the overall architecture of the controller is similar to the computational model of the target brain area, the mathematical and biological components of t he overall architecture could be merged together.

PAGE 39

39 Figure 24. Actor Critic structure in machine learning and neurobiology A) Schematic implementation of Actor Critic. B) neural correlates to the components of structure A. The Basal ganglia are known for their instrumental role in goal directed behavior and they have been the target of many computational studies [8488] Numerous neurophysiologic studies propose reinforcement learning as the computational model of basal ganglia [86,89,90,91] These results along with neuroanatomical structure of basal ganglia suggest that sub structures in basal ganglia including striatum and midbrain Dopaminergic neurons implement an actor critic realization of reinforcement learning in the basal ganglia. Figure 24 compares the regular architecture of actor critic with its biological counterpart in the brain [92] In Figure 24A, critic implements a state value function that evaluates actions of the actor based on the states and instantaneous reward. The critic criticizes the actor by temporal difference error. If the action is good in terms of increasing reward expectation the critic generate a posit ive prediction error that reinforces the selected action by strengthening the association between the state and the action. On the other hand negative prediction error decreases the possibility of selecting that particular action in the same state again. Figure 24B is the biological counterpart of the Figure 24A in which DorsoLateral Striatum (DLS) takes the role of

PAGE 40

40 actor by implementing action selection policy and Ventral Striatum (VS) implements a value function. In this diagram, HT+ corresponds to hypothalamus and other structures like habenula, pedunculopontine and superior colliculus which potentially are involved in the processing of received reward. For the BMS framework, the duality between these two structures suggests the possibility of replac i ng the critic in Figure 24A with its biologic counterpart.

PAGE 41

41 CHAPTER 3 MOTOR STATE REPRESENTATION AND PLASTICITY DURING REINFORCEMENT LEARNING BASED BMI Introduction In the previous chapter the theoretical aspects of the BMS theory were introduced and it was explained how the RL methods can provide the computational tools for designing a symbiotic BMI. The first building block of any RL paradigm is the definition of state and reward. It was also discussed that in order to promote the symbiotic relationship between the user and the prosthetic device both the neural states and a measure of reward should be estimated from the brain. In this chapter experimental results will show the possibility of using motor neural representation in M I as RL states and extracting reward information from NAcc in a rat model. Reinforcement Learning based BMI We designed and tested a BMI platform based on RL principles (RLBMI) in which, a computational agent learned to map MI neural states to a set of robotic actions in order t o perform a reaching task The purpose of this section is to show how the neural states emerged during the experiments where the RL agent used those states to complete the task. The RLBMI framework in Figure 3 1 was designed for stud ying causation between neural states and RL agent Here, the interaction between a computational agent (BMI Algorithm) and users brain (Rats Brain) occurs through the generation of a sequence of brain states that are mapped by the agent to a series of ac tions of a robotic arm to complete a reaching task Upon completing the task, the animal will receive a water reward. The agent and user must learn coadaptively (based on actions and neural states and rewards) which strategies will maximize the reward.

PAGE 42

42 Fig ure 31 RLBMI architecture The co adaptation here means the rats brain and agent participated in a dialogue and adapt ed to each other to maximize their cumulative rewards. The RLBMI was the first step in developing the BMS framework. In this work we demonstrated the feasibility of decoding MI neural states using RL techniques. In the RLBMI, reward information was manually provided to the agent and we assumed the reward landscape that was defined in the environment matched the internal reward representation in the brain therefore the RLBMI can be categorized as a semi symbiotic paradigm. Experiment Setup Three Male SpragueDawley rats were trained (about 100 trials per session) in a two lever choice task via operant conditioning to associate robot cont rol with lever pressing. The rats were trained using shaping and chaining [93] to associate control of the robot with rewards obtained when the goal wa s achieved by reaching to the correct target in the external envi ronment. As shown in Figure 32 the rat was enclosed in a behavioral cage with plexiglass walls. There were a set of retractable levers (Med Associates, St. Albans VT) in t he robotic workspace which is referred to as target levers There were 3 sets of green LEDs: the set immediately behind the rat levers are

PAGE 43

43 cage LEDs the set in the robot workspace are midfield LEDs and the set on the robot levers are target LEDs The pos itioning of the 3 sets of LEDs and levers offers a technique to guide attention from inside the cage to the robot environment outside. There was one additional blue LED mounted to the robot endpoint (the guide LED) used to cue the animal for tracking the position of the robot. Because the behavioral cage walls were constructed from plexiglass, the robotic workspace was within the rats field of vision [94] The robot operated in a workspace based on an action representation defined in Cartesi an space as shown in Figure 32 The action set included 26 movements: 6 uni directional (i.e. up, down, forward, back, left, and right), 12 bi directional (e.g. left forward), 8 tri directional (e.g. left forward down) and not move for total of 27 possible actions. A solenoid controller dispensed 0.04 mL of water into the reward center on successful trials when the animal maneuver ed t hat robot to the target. An IR beam passed through the most distal portion of the reward center. The rat initiated the trials Once the animals have been operantly conditioned they were implanted with microelectrodes and entered into braincontrol mode where their neuronal activity derived the movement of the robot arm. To derive the internal neural representation, rats participat ing in the BMI experiment were chronically implanted bilaterally with two microelectrode arrays (32 total electrodes) in layer V of the caudal forelimb area in the primary motor cortex (MI) [95,96] The intent of the animal was derived directly from these signals. Each array was 8x2 electrodes with 250 m row and 500 m column spacing (Tucker Davis Technologies (TDT), Alachua FL). Neuronal signal s were recorded from the caudal forelimb area of M I because this area has been shown to be

PAGE 44

44 predictive of limb motion in a rat model; additionally, similar modulations occurred when operating a BMI without physical movements [43] Electrophysiological recordings were performed with commercial neural recording hardware (TDT, Alachua FL). A TDT system (one RX5 and two RP2 modules) operated synchronously at 24414.06 Hz to record neuronal potentials from both microelectrode arrays. The neuronal potentials were bandpass filtered (0.56 kHz) and spi ke sorting wa s performed to isolate single neurons in the vicinity of each electrode. Once the neurons were isolated, the TDT system record ed unit firing times and a firing rate estimate. As in other BMI experiments, we defined the state by neuronal firing rates in 100 msec windows [12,24,97,98] which have been embedded in longer time windows (667 ms) [99,100] to respect the Markov assumption and account for motor planning. The vector of firing rates obtained from all recorded neurons was used as inputs to the RL agent In brain control the rats neuronal modulations in the primary motor cortex defined the environmental states of the agent which generated the robot movements. The mapping between states and actions ( value function) was updated every 100 ms based on a reward distribution that was defined in the workspace. The rewards and penalties for the RL agent w ere assigned in the robot workspace based on the robot completing the task that the animal was trained to achieve, i.e. if the rat maneuvered the robot proximal to a target, then the agent wa s reinforced ( rt = 1) and the rat earned a water reward. Penalti es were assigned ( rt = 0.01) whenever the task was not completed to encourage minimization of task time. As with operant conditioning, the rat and the agent had to co adapt to learn the task over multiple sessions in several days (cumulative training). Es sential to the success of this task was the coupling of the motivation and

PAGE 45

45 actions of the rat with the parameters of the agent and the resulting movement of the robot. While the rat was learning how to get its reward, the agent must change its parameters and learn to more effectively respond to the animals brain signals. A) B) Figure 32 Overview of the RLBMI experimental paradigm. A) Schematic showing the layout of the plexiglass cage, levers, LEDs, and robot arm. B) Image of the complete experimental setup. The Cartesian coordinate system is superimposed in the workspace of the robot. At each instance in time, an action must be selected from the set of 27. Once a reaching trial began (i.e. with a nose poke in the water receptacle) the agent selected the best action given the value function. Action selection continued every 100 ms based on the evolving state. The agent must select specific temporal action sequences based on MI neuromodulations to maneuver the robot proximal to the target. Figure 33 shows the timeline of each trial during the experiment.

PAGE 46

46 Figure 33 The timeline of the brain controlled twotarget choice task The trial time limit for brain control was extended to 4.3 s to allow the rat to make corrections using visual feedback of the robot position. The rat was not cued explicitly that it was in brain control since all 4 levers were extended for each trial. However, we have observed in the first session of brain control that all animals ceased making movements immediately when they begin obtaining water using neural activation alone. The animals tended to remain stationary in the center of the cage directly in front of the water center. The robot wa s maneuvered in a 3D workspace based on 27 actions (summarized in Table 3 1). Neuronal Shaping As a Measure of Plasticity in MI In order to investigate the plasticity of neural representation in MI a criterion of executive motor commands for symbiotic BMI, the firing properties of MI neurons over different sessions of the clo sed loop BMI experiment s were analyzed. Figure 34 shows meaningful changes in the firing properties of a selected neuron over different experiment sessions to complete the task where the task difficulty level increased over multiple sessions. In this figure, mean firing rate and Coefficient of Variation (CV) were used to characterize the firing properties of this neuron. For this rat, 60% of the neurons had a decrease, 30% had an increase and 10% had no significant change in the mean firing rate over diffe rent sessions when compared to the first session. Interestingly, of

PAGE 47

47 the neurons with a decrease in firing rate, 73% had an increase in their CV of firing and the rest had no significant change in their CV. 86% of the neurons which had increase in their mean firing had no significant change in their CV. The remaining 10% of the neurons that had no significant change in their mean firing rate also did not show a significant change in their CV. All metrics were tested for significance using ANOVA at 95%. Based on the results, coadaptation does not primarily occur as a general upregulation in neuronal firing but as an increase in temporally specific neuromodulation of the ensemble related to subgoals of the complete reaching task. Table 31. Robot actions L Left RB Right Back BLU Back Left Up R Right LU Left-Up BRU Back -Right-Up F Forward RU Right Up FRD Fwd Right Down B Back LD Left Down FRU Fwd Right Up U Up RD Right Down BRD Back Right Down D Down BD Back Down FLD Fwd Left Down LF Left-Fwd BU Back -Up FLU Fwd -Left-Up RF Right Fwd FD Fwd Down BLD Back Left Down LB Left Back FU Forward Up St Stay (A) (B) Figure 34 Neural adaptation over multiple sessions. Changes in the A) mean firing rate and B) coefficient of variation of one example neuron. This neuron consistently increased its firing rate by increase in the task difficulty level over multiple sessions

PAGE 48

48 Neuronal Tuning As a Measure of Robustness in MI States In the previous section it was show n that as experience wa s gained between the us er and the agent, neural response of the user wa s shaped [101] In this section, neural tuning analysis was performed with respect to robot actions to study the robustness of MI neural states for BMI control In co adaptive BMIs a biological and an artificial learning systems are coupled together [102] therefore learning in the artificial system could be used to characterize the learning in the biologic system. Once the rat and the agent learned to accomplish the task sequences of optimal actions we re used most frequently. The tuning for these actions also became deeper as the performance of the BMI increased indicating that the users internal representation for these actions wa s strengthened. We buil t upon the classical formulation for computing the tuning direction which measured neuronal firing rates given a partic ular kinematic variable. While co adaptation wa s occurring throughout brain control of the prosthetic arm, we assumed that changes in the tuning function wer e smoothly varying within a session. This was assessed by observing the weight tracks of the value function and the rate of reward returns for each session. For the data reported here, no abrupt changes were observed and we used the timescale of the entire session to compute the tuning. Neural tuning was computed for the robot control actions that agent had taken at each time step. Tuning curves were constructed for each action by taking the mean instantaneous firing rate over all instances of agent taking that action. The tuning depth and direction were used as scalar descriptors of the tuning curves However, unlike classic methods for computing the tuning depth [30] we could not normalize the tuning depth by standard deviation of firing rate. In RLBMI, the number of actions wa s variable between different sessions; therefore, sessions with few

PAGE 49

49 actions would have a heavily biased estimation of s tandard deviation. This bias could distort trends in the tuning curves. To avoid this problem, tuning depth for each neuron was computed by taking the difference between maximum and minimum of the tuning curve and normalized by the area under the tuning curve (sum of mean firing rates). Table 32 summarizes the action tuning depths corresponding to action tuning curves and overall performance of the animal for left and right trials. For a given session and neuron, the most tuned action and its corresponding tuning depth value were computed. Table 32. Action tuning depths Session/Performance N03/Action N19/Action N23/Action Left Trials 1/56% 0.0959 L 0.2015 FLU 0.2373 LF 2/61% 0.0467 R 0.1589 FRU 0.1920 FRU 3/90% 0.0458 L 0.4410 FRU 0.4649 FRU Right trials 1/75% 0.1002 R 0.2430 FLU 0.1904 FLU 2/42% 0.0386 R 0.1423 FRU 0.2479 FRU 3/78% 0.1918 FRU 0.5643 FRU 0.1075 FRU Figures 35 A and 35 B show the action tuning curves of 3 neurons for successful left and right trials respectively. The actions in these figures were exclusively selected by the agent based on their estimated value at each time step throughout the session. Although we recorded 29 neurons from this rat, we only present two representative neuron types. The first is an example of a fast switching neuron (Neuron 3) which

PAGE 50

50 changes its representation for left or right trials. The second example is slow switching neurons which change their tuning representation (Neurons 19 and 23) over sessions. This behavior shows redundancy in the population representation. Specifically using the results of Table 32, we see that Neuron 3 is most deeply tuned to an action with an L component for left trials and to an action with an R component for right trials for all trials except those of session 2 (left). This neuron is characterized by a relatively shallow overall tuning depth. In contrast, we see from the table that Neurons 19 and 23 both changed their tuning direction from actions with left components to actions with right components between sessions (FLU to FRU). These neurons are characterized by much deeper tuning. The slow switching neurons also showed behavior that was tuned to the same acti ons for both left and right trials. For example, FRU is selected for both left and right trials. Over these three sessions, we can see the number of actions selected by the agent decreased (left 6 to 4, right 4 to 2) to a subset of effective actions to acc omplish the task. These modeling changes are also accompanied by increases in tuning depth as shown above. This implies that both the animal and agent have converged to a stationary point on the joint performance surface. These results show that MI corti cal neurons have the ability of providing stable yet plastic state sequences that can be shaped towards a specific task In the previous experiments we externally shaped the neural modulation by manually specifying the reward distribution in the agents w orkspace. The main purpose of the symbiotic BMI is to use the internal goal representation to automatically shape the action representation. In the next section, I will discuss a set of experiments to characterize the goal representation in NAcc.

PAGE 51

51 (A) (B) Figure 35. Action tuning curves of 3 neurons over three experiment sessions. A) left trials and B) right trials.

PAGE 52

52 CHAPTER 4 REPRESENTATION OF REWARD EXPECTATION IN THE NUCLEUS ACCUMBEN S AND MODELING OF EVAL UATIVE FEEDBACK Introduction Since unders tanding the reward representation in the brain plays a pivotal role in the BMS theory neurophysiological studies were performed to characterize the representation of goal information in the NAcc during goal directed behavior involving a prosthetic arm In a followup second set of experiments a hypothesis was tested that reward expectation c ould be estimated in the form of a scalar signal from NAcc. This signal eventually would play the role of a binding link between user and IA Using this signal the learning process of the IA would be directed towards maximizing the users satisfaction. Experiment Setup Microwire array electrodes were implanted into the left NAcc of three SpragueDawley rats and single unit activity of accumbal neurons were chronically r ecorded while the arrays were positioned stereotaxically and lowered with a hydraulic micropositioner to an approximate depth of 70.2 mm (Figure 4 1 A) [103] This site was chosen because of the high density of medium spiny neurons at this level [104] The rats were given up to two weeks to recover from surgery bef ore resuming the experiment. All procedures were approved by the Institutional Animal Care and Use Committee (IACUC) at the University of Florida. All rats were trained in a twolever choice task via operant conditioning to earn water reward by pressing retractable levers (Med Associates, St. Albans VT) inside their behavioral chamber cued by LEDs (Figure 4 1B ). A solenoid

PAGE 53

53 controller (Med Associates) dispensed 0.04 mL of water into the reward center on successful trials. An IR beam (Med Associates) passed t hrough the most distal portion of the reward center. The rat initiated the trial by nose poke in the water center. The workspace used low level lighting and was designed to maximize the rats visual abilities. After the rats reached the operant conditioning inclusion criteria of 80% on each side, neural data was recorded for six sessions. Electrophysiological recordings were performed us ing commercial neural recording hardware (TDT, Alachua FL). A TDT system (one RX5 and two RP2 modules) operated synchronously at 24414.06 Hz to record neuronal potentials from microelectrode arrays. The neuronal potentials were bandpass filtered (0.5 6 kHz). Next, online spike sorting [105] was performed to isolate single neurons in the vicinity of each electrode. Prior to the in vivo recording, the experimenter reviewed each sorted unit over multiple days to refine the spike sorting thresholds and templates. The number of sorted si ngle units varied between rats: rat01 had 12 units, rat02 had 13 units (including one multi unit), and rat03 had 41 units. The isolation of these units was repeatable over sessions with high confidence from the recordings. Once the neurons were isolated, t he TDT system recorded unit firing times and a firing rate estimate was obtained by summing firing within nonoverlapping 100 ms bins. Additionally, all behavioral signals (e.g. water rewards, LED activation) were recorded synchronously using the shared ti me clock.

PAGE 54

54 Figure 4 1 Stereotaxic coordinates for microwire electrode implantation and experimental setup. A) Stereotaxic neurosurgical methods were used to target the NAcc and M I In experiments involving simultaneous recording of NAcc and M1 a dual electrode array was implanted. B ) Top view of the animal behavioral box. A nose poke into the IR beam initiated the random selection of a target level cued by a light (LED). The animal had up to 4 seconds to press a lever. If the correct lever was pressed a water reward was delivered.

PAGE 55

55 Temporal Properties of NAcc Activity Leading up to Reward Since a biological, neural based error signal was used to adapt the network using reinforcement learning, a set of guidelines for what can be expected from the temporal modulation of the NAcc leading up to target acquisition is developed here. For symbiotic BMI, the critical time period is the segment of time between target selection and acquisition so we require a measure of how accumbal neurons were excited or inhibited leading up to the target providing reward. To investigate this aspect, the data from the rat lever pressing experiments was segmented so that each trial was time aligned to the onset of the lever press indicated by time 0 in Figures 4 2 A F. Next, 4 seconds of data leading up to this point was extracted which we will call the target acquisition time. This duration was selected because it was corresponding to the mean time between cue and press for all animals and trials. During the target acquisition tim e, neuronal firing was binned into 100ms windows and a firing rate was computed. The Peri Event Time H istograms (PETHs) in Figures 4 2 A F correspond to the average firing rate over left or right trials during the target acquisition time. Also included in this figure are the raster plots for each trial. In this analysis, three groups emerged in the neuronal firing and representative plot s of the firing are presented in Figures 4 2 A F. We performed statistical analysis to identify neurons in each group quantitatively. For each neuron we compared the baseline activity (2 seconds before the cue) with the neural activity during 2 seconds bef ore the lever press using Kolmogorov Smirnov test (KS test). In the first group, neurons selectively responded to each target by increasing their firing rate. The neurons in this category exhibited excitatory and inhibitory activity when the animal approac hed the left and right targets respectively. The neurons in the second group responded only

PAGE 56

56 when the rat approached either target and they did not respond to the other target. The third group of neurons responded nonselectively to both targets. As the ani mal approached to each target the neurons in the third group increased or decreased their firing rates. Out of 66 isolated neurons (63%) significantly changed their firing rate during goal approach behavior compared to their baseline activity. In accordanc e with the three categories that we identified, (25%) belonged to the first category, (39%) belonged to the second category and (46%) belonged to the third category. Figure 42. Perievent time histogram of three categories of NAcc neurons. D ual nonselec tive neurons (both decrease firing after cue) for A ) left and B ) right trials D ual selective neurons (increase and decrease firing for both targets) C ) left and D) right trials. U ni selective neurons (decrease for one and stay constant for the other target) for E ) left and F ) right trials

PAGE 57

57 The results of this neurophysiology s tudy suggested that there was a hetereogeneous and rich representation of goal information in the NAcc during goal approach behavior. The next step in the design of the Actor Critic was to transform this neural representation into a scalar evaluative feedback for adaptation of the actor. As in the conventional Actor Critic learning, both the negative and positive reinforcement are required for training the actor. Extracting the positi ve and the negative components of evaluative feedback is supported by the rule of NAcc in representation of reward and aversion prediction [106] In order to extract a control s ignal for BMI, we need to estimate two different states (left and right) from ensemble neural activity in real time. Based on the physiological observations and histogram analysis, we modeled reward zero at the beginning of the trial and reached to 1 at the time of lever press for right trials or decreased from zero to 1 for left trials. As the rats got closer to the target the absolute value of reward expectation increased linearly. Extracting an S calar Reward Predictor from NAcc We tested the performance of the valueestimator on single trial basis by computing the correlationcoefficient between the output of the valuedesired signal. Two TimeDelayed Neural Networks (TDNNs) we re used to function during left and right trials. Figure 4 3 shows a schematic of the desired response and the structure of reward estimator. Each network contained 10 tanh nonlinear Processing Elements (PE s) in the hidden layer and one linear PE at the output. In order to preserve the temporal structure of the neural response, a gamma memory structure with 3 taps was used to capture the temporal structure of the neural

PAGE 58

58 data. The number of gamma tap delays and PEs were selected empirically. Using error backpropagation, the TDNNs were trained using 60% of the data (80 trials) and tested on the remaining 32 trials. At each time step during a trial, multi channel neural vector was fed to the TDNN and the instantaneous error between output of the model and the desired signal was computed. The average correlation coefficient between the model were able to find a functional relationship between the NAcc population response and the reward expectation. Figure 4 4 shows two sample trials of estimating reward during left and right trials. Figure 4 3 Extracting a scalar reward expectation signal from NAcc. A) Conceptual diagram of reward expectation modulation of the user based on IA actions. The temporal structure of the NAcc neuronal activity indicates the expectation of reward or aversion in goal directed tasks. The critic must interpret this activity and transform it to a scalar error signal. B) Estimating reward expectation in the form of a scalar signal from NAcc multi channel neural activity.

PAGE 59

59 Figure 4 4 Estimating an evaluative feedback from NAcc neural activity A) left and B) right tri als.

PAGE 60

60 CHAPTER 5 ACTORCRITIC REALIZATION OF THE BMS THEORY Introduction In the previous chapters the foundation of the BMS theory from both theoretical and experimental perspectives was developed. In the rest of the dissertation I will present the design and test procedure for a BMI implementation of the BMS theory. Borrowing the idea from A ctor C ritic method in machine learning, in this chapter I will introduce a control architecture that works based on the principles of the BMS theory. BMI Control Architecture By formulating the BMI control as a decisionmaking problem, the process of optimization can be built on the theory of reinforcement learning (RL) [107] In the design of BMIs, learning through reinforcement is very appropriate because it is inspired by operant conditioning of biological systems where the learner must discover which actions yield the most reward through experience [86] The approach is built on the concept of valuation and as described before, valuation is the process of how a system assigns importance to actions and behavior outcomes. For goal directed motor b ehavior, we seek systems that compute with actionoutcome sequences and assign high value to outcomes that yield desirable rewards. This approach is very different from habitual valuation which does not participate in continual self analysis [108] which is important in dynamical environments. One of the main computational goals in the methods presented here is to develop real time techniques for modeling and coupling the valuation between the user and the BMI (to enhance symbiosis) in a variety of tasks. The conventional way to build value functions in RL is to couple two entities: an agent and the environment [58] The agent represents an intelligent being attempting to

PAGE 61

61 achieve a goal. The environment represents anything the agent cannot directly modify but can interact with. The interaction is defined by the agents actions, which influence the environment and the states and rewards observed from the environment. After each action is completed by the agent, a reward is supplied. The agent attempts to maximize these rewards for the entire task. The main problem to be solved in this formulation of BMI is to find the best action to take in each particular state of the environment. Through this process, a value function is developed. Figure 5 1. An implementation of symbiotic BMI based on the Actor Critic architecture A) G eneric Actor Critic architecture, B) Block diagram of the symbiotic BMI controller The architecture contains two key components. The actor is driven by the primary motor cortex and its primary role is to select actions in the environment. These actions are evaluated by the critic, which is driven by the NAcc At each instance in time, the critic provides an error signal that is used to adapt the parameters of the actor for choosing actions that lead to reward. In this entire system there is in intrinsic coupling between the motor system, reward system, and the environment. RL can be implemented in many ways [109] however since we are seeking to build BMIs that utilize both motor and reward signals from the brain, the A ctor C ritic computational framework is a good choice for the following reasons, which will be described next. A schematic of generic A ctor C ritic architecture is shown in Figure 5 1A.

PAGE 62

62 Here, the critic contains a value function that associates a scalar value to each state in terms of the reward that can be in the future. The actor performs actions following a rule that favors the maximal accumulation of rewards. As time progresses the value function is learned based on interacting with the environment. After each action, the critic computes a temporal difference error which is used to update the value function. The critic criticizes the actor in a way that if the action is good (increasing reward expectation) the critic generates a positive prediction error that strengthens the association between the state and the action. Otherwise, negative prediction error decreases the possibility of selecting that parti cular action in the same state again. This cycle is repeated continuously and the critics value function becomes progressively more accurate and the actor changes its actions to maximize reward. An important aspect of this computational framework for BMIs is that it has been theorized to have a biological counterpart. The neuroanatomical structure of the basal ganglia suggest that sub structures including striatum and midbrain dopaminergic neurons implement an actor critic realization of reinforcement learning. For BMIs, we seek to derive from the primary motor cortex the implementation of the action selection policy and Nucleus Accumbens (NAcc) implements a value function. There is evidence that neuromodulation in the NAcc, which is known for encoding TD error, predicts reward and aversion [106] therefore we used the NAcc response for evaluating the actor. In the conventional actor critic architecture, the critic should learn a value function for mapping states to expected cumulative future reward [60] However, in this BMI application, we seek a critic that is biologically embedded in the users brain and evaluates actions of the IA based on the users reward expectation. If the IAs action is

PAGE 63

63 favor able to the user reward expectation increases and that action would be reinforced; otherwise, the reward expectation decreases and the action should be penalized. In Figure 5 1B, we formulate the control architecture of the BMI based on Actor Critic imple mentation of reinforcement learning. For BMI applications, the actor plays the rule of decoding the users neural motor commands. The critic provides an evaluative feedback to the actor in the form of TD error ( ) that represents a measure of users goal. Figure 5 1B combines key elements of the proposed actor critic framework: actions, states, and value which are distributed between the user and the computer code which we call an Intelligent Assistant (IA). The Critic translates the internal goals of the BMI user to provide an evaluative feedback to the Actor when it performs actions in the environment. Hence, the Critics input is driven by the users Nucleus Accumbens (NAcc) and is responsible for reward expectation. The Actors sta te input is neural activity collected from the primary motor cortex (MI) that must be translated into prosthetic actions by the Agent. Note that the evaluation subsystem (Critic) and the controller (Actor) are split into two embodiments (neural activity and computer code), creating a symbiotic (brainmachine) system due to the tight and real time feedback. In the A ctor C ritic framework, the value of an action is specified by a measure of reward received when that action is selected. At every instance in tim e, the brain generates new MI states, the Actor selects actions, and the C ritic (driven by NAcc ) provides an evaluative feedback in terms of reward expectation. The stateto action mapping is updated based on the past history of rewards and the estimation of future rewards. The modulation of reward activity in the users brain defines the task, which is a great advantage for reaching tasks in the

PAGE 64

64 external world because the designer does not need to specify the goal. The agent finds an optimal control strategy based on the users neuronal state and the actions that are defined as movement direction. The key problems in this architecture are the following: Translate the NAcc activity into an evaluative feedback signal. This involves the integration of improved real time signal processing methods that capture global computation on multiple spatial, temporal, and behavioral scales. Estimate the state action value function (shown mathematically later) that selects future actions given the states. To best capture the effects of the neural inputs on the architecture, initialization, and parameter selection of the Actor Critic model, I performed experimentation on a group of parallel multiple models. By running in parallel, data analysis and optimization could be co mpleted in a more efficient manner than was done with conventional BMI model building techniques. Critic Structure In the traditional Actor Critic architecture, the critic implemented a value function that provided an evaluative feedback for the actor th rough temporal difference error. This error simultaneously adapted both the Actor and the Critic architectures. In contrast, in our architecture, the value function was implemented in the brain and it was biologically trained. Therefore, we just needed to estimate the evaluative feedback from the neural response in NAcc. Depending upon the users goal, IA actions might increase or decrease the reward expectation [110] The users reward expectation was the source of goal information, which the IA used in the form of an evaluative feedback for action selection. During BMI use, we sought to best capture and model the response of NAcc neurons over time and translate it into a scalar value that could be used for evaluation

PAGE 65

65 of actions of the actor. This modeling was denoted as the Value Estimator in the Actor Critic model of Figure 5 1B. We modeled the hidden parameters in the NAcc data that pertained to goal proximity and movement directions. By finding the modulatory effect of IA actions on the reward expectation of the user, the value function predicted how the act ions over time influenced the reward expectation. Training the critic involved nonlinear regression of the NAcc neural modulation to a scalar function representing the reward expectation of the user in presence of known goals. Figure 37A schematically pl ots the reward expectation where the positive slope corresponds to approaching the goal and negative slope represents getting away from the goal. A Time Delayed Neural Network (TDNN) was trained using conventional error backpropagation [111] to est imate the users reward expectation from firing pattern of NAcc neurons. The TDNN was composed of a gamma structure to capture the temporal pattern in the NAcc activity and a Multi Layer Perceptron (MLP) with a linear output Processing Element (PE) for reg ression. Actor Structure The Actor in Figure 5 1B was a parameterized policy estimator that treated neural activity in MI as states and tried to find a mapping between users neural states and robot actions to maximize a measure of users reward expectat ion that was presented by the critic. The users reward expectation ( ) was a function of IAs actions and users neural state. The tr is the reward at each time step. A a S s a a s s r E a st t t t ,1 ( 5 1) At each time step, the actor which was parameterized by took action a s s a a s at t Pr ; ( 5 2)

PAGE 66

66 The cost function was defined as the average of expected reward over time. 1 0, 1 ) (T t ta s T J ( 5 3) The actor should find a set of that maximized J over time, i.e. J max arg* Once the actor converged to the optimal policy, the user could actively control the actions by modulating appropriate neural states in a feed forward way. During adaptation, the actor estimated the gradient of J with respect to states and actions and improved the policy by adjusting its parameters in the direction of J therefore an instantaneous measure of gradient direction was required. We defined an instantaneous error that resembled the tem poral difference error in the regular Actor Critic architecture as an estimate of J 1 t t t ( 5 4) To approximate the optimal policy, we used a TimeDelayed Neural Network with a gamma memory structure and a Multi Layer Perceptron topology (Figure 5 2). As in other BMI experiments, the input was defined by neuromodulation (firing rate) over all channels within 100 msec windows which have been embedded in longer time windows using a gamma memory structure [112] The network architecture was composed of a set of sensory nodes that received the MI neural state as input. A hidden layer formed a set of basis functions and finally the output layer, which spanned an action space for IA. In a discrete act ion space, each output PE represented one action and computed the probability of the corresponding action given parameter set and input neural state St (Equation 5 2). The agent executed the action with the highest probability and rec eived

PAGE 67

67 an evaluative feedback from the critic. The evaluative feedback was computed by Equation 5 4 and backpropagated to adjust the parameters of selected action t t i t i ts 1 ( 5 5) where ) (ts represented the projection of input neural state to feature space. The superscript in Equation 5 5 corresponds to the index of selected action. The actors adaptation procedure in the symbiotic BMI architecture is summarized as follows: 1 The user generates motor state st and its reward expectation is t 2 The actor associates st to action ai and executes the selected action. 3 Execution of action ai increases or decreases the reward expectation, which would be reflected by 1 t 4 The error is defined as 1 t t t 5 If 0 t This error is used to update the parameters of selected action. The hidden layer weights are not changed. If 0 t Parameters of selected action would be updated. The error back propagates to the hidden layer and updates the hidden layer weights. 6 Return to step one. ClosedLoop Simulator In order to measure the performance and quantify the convergence of the Actor Critic adaptation, a simulator was develo ped to investigate the learning capability of IA. Since it was difficult to control every aspect of neuronal responses during in vivo experiments, the simulator offered a method to investigate model parameters before

PAGE 68

68 running closedloop experiments with the animal in the loop. The simulator was composed of three main modules that continuously interacted with each other; neuromodulation synthesizer, Actor Critic controller, and environment. The MLP in the Actor consis ted of two layers, 3 nonlinear hidden PEs and 4 linear output PEs that represented four discrete actions. The actions were defined as the principal directions of movement in the 2D grid space (up, down, right, and left) For the sake of simplicity and tractability the reaching task was in 2D but it could be easily extended to 3D. Figure 5 2. Structure of the actor. The environment consisted of a 20x20 grid world with 0.1 spacing between each node (Figure 5 3A). The task was to navigate a robotic arm from the center of the grid to any target in 2D space based on the motor and reward neuromodulation. The neuromodulation synthesizer in our simulation consisted of an ensemble of synthetic MI neurons that were generated based on the model presented in [113] The main parameter of neuromodulation module was the tuning properties of the neurons. The

PAGE 69

69 ensemble of cortical neurons was composed of four subsets where neurons in each subset were tuned to a particular action. At each time step, the neuromodulation synthesizer produced a motor command that was encoded into MI neural activity by exciting t he corresponding subsets of neurons. For example, if the user decided to navigate the robot in the upright direction, those neurons in the ensemble which were tuned to the right direction and up direction were activated. Figure 5 3. Simulation experiment setup. A) 2D grid world. B) Error (reinforcement) definition based on the projection of the movement vector on the target vector. The Actor Critic controller received neural input from the neuromodulator and used it to navigate the endeffector of a robot (red circle in Figure 5 3A) in the grid space. Based on the robot movement with respect to the target, movement and target vectors were computed at each time step. In this experimental simulation, we required an error signal that mimicked the modulation of an ensemble of NAcc neurons. To meet this need, the following method was implemented. Considering the target vector as desired direction, a scalar error was defined by computing the cosine of the angle between the movement and target vectors. We selected cosine function because it

PAGE 70

70 converted movements towards target ( the target, a positive error was generated ot herwise the error was negative. This error signal resembled the TD error in the Actor Critic algorithm in the sense that if action was desirable the agent received a positive reinforcement otherwise the reinforcement was negative. The value estimator in Fi gure 5 1B was responsible for estimating TD error from neural activity in NAcc. Each experiment was composed of 100 trials where each trial consisted of a single reach to a specified target. In each trial, if the agent could not reach the target in 50 steps the trial was considered unsuccessful. Convergence of the Actor Critic During Environmental Changes One of the primary advantages of the Actor Critic architecture is that it is designed to symbiotically adapt with the user during environmental changes. The goal here is to perform simulated closedloop experiments will be to determine how the Actor Critic model performance is affected when a new target (unforeseen to the user) is introduced to the behavioral workspace. The addition of new targets is a com mon occurrence in the activities of daily life and it is expected that with each new target there are unseen aspects of the control scheme that need to be learned and the process will take time. Because the Actor Critic learns on line and can respond to changes (unlike static BMIs which learn from a training set), we will test the condition where the navigation of robot to a new target will require the selection of a new action (or action set) to be learned to reach the new target. The new target will be located outside the space spanned by the previously leaned control policy therefore the IA will not be able to reach the new target without the learning to acquire this new action set. The experimental approach is designed to introduce a perturbation to the BMI control paradigm so that the

PAGE 71

71 performance difference can be measured. In this section, I specifically focus on an important question regarding the applicability of BMI in daily life activities. How does learning a new task affect the previously learned functional mapping for BMI control? Figure 5 4. Spatial distribution of targets in the 2D workspace. The targets at the corners of the blue rectangle were used in the sequential learning task. The area inside the blue rectangle represents the space spanned by the control policy that was learned during the sequential learning task. In order for the agent to reach the targets outside the blue rectangle it had to adapt its previously learned control policy The star at the center of the workspace shows the start point of the agent. For these experiments, we designed a sequential learning task. The task was to navigate to a set of targets located at each of the corners of the blue square 2D workspace as shown in Figure 5 4. However, all of the targets are not presented at the same time. Targets were numbered as following: 1Upper right, 2 Lower left, 3 Upper

PAGE 72

72 left and 4Lower right. Starting from a nave state (random small initial actor weight values between 0.5 and 0.5), the Actor Critic decoder was required to adaptively find a control policy to reach each target using only MI and NAcc activity. Once the decoder found an appropriate control policy for the task, a new target was introduced. In this way, I presented all the four targets sequentially (14 ) therefore the decoder had to change its control policy for reaching each target. For the first target, the parameters of the decoder were initialized randomly but afterward the network started from the previously learned control policy (i.e. previous act or weight values). Once the Actor Critic learned each task individually, we presented all the four targets where, one of the four targets was presented randomly in each trial. In this task, the decoder had to derive a control policy that enabled switching among all the targets. In other words the network had to remember its previous control policies. In each of the tasks, to consolidate the control, as the decoder learned a control policy the learning rate was annealed to zero and parameters of the decoder were frozen. However, by introducing a new task, the learning rate was reset and the network resumed adaptation. The learning rate annealing is an important aspect of coadaptation because it controls to what extent the BMI adapts to the user. In this symbiotic BMI architecture in particular, there were two reasons for annealing the learning rate. First, from machine learning point of view, every time the decoder successfully completed the task the association between the MI neural states and Actor Critic a ctions were reinforced by increasing the corresponding network weights. Annealing the learning rate prevented the network weights from growing unlimitedly. Second, based on the representation of reward, the NAcc became

PAGE 73

73 habituated to specific goals over tim e [114,115] and may reduce the amount of evaluative feedback over time. Table 5 1. Decoding performance during sequential target acquisition T1 T2 T3 T4 (1) T4 (2) T all (1) T all (2) Speed 5 22 8 29 0 30 10 Accuracy 100% 96% 100% 46% 100% 40% 98% Figure 5 5 shows the performance of the decoder during this set of experiments. The red stem plot in Figure 5 5A shows the targets that were presented during each trial. Each target was represented by a number; right up (1), left down (2), right down (3) and left up (4). The blue stems in Figure 5 5A show whether the decoder was successful or not in the corresponding trial. Here we can see by introducing a new target, performance degraded at the beginning but after a few trials the decoder was able to learn all the new tasks. Table 5 1 summarizes the decoding performance during sequential target acquisition task. For each target 100 trials were presented. The number of trials that took for the decoder to find the target during the first 50 trials of each target was used as a measure of speed of learning and the percentage of successful trials during the second 50 trials was used as a measure of performance. If the performance was less than 90%, another epoch of 100 trials was presented to the decoder. Recall, that the output of the Actor network provides the value of each action take. Figure 5 5B shows the values of the actor output processing elements over time where actions left, right, up and down are represented by colors blue, green, red and light blue respectively. For each task, we can see the network adjusted its parameters in such a way that maximized the probability of selecting actions that were required for accomplishing the task. For example, in trials 100200 actions down and right were

PAGE 74

74 necessary for reac hing the target. These are the actions with the highest value (green, light blue). However, when all targets were presented during trials 500800 we can see that a mixture of actions had high value and these modulated depending on the target and feedback f rom NAcc. In Figure 5 5C we can see how the network adjusted the output layer weights to find a mapping between neural states and optimal actions based on the error signal. It is important to note that at the introduction of a new target, we can observe ad aptation in the weight values and then consolidation of the control scheme through the plateau of the weight values. When an environmental change occurred, again the weights adapted appropriately. To guide the adaptation, the critic provided the reinforcem ent signal here. It was the gradient of the reward expectation of the user, which was approximated by the cosine of the angle between movement vector and target vector. From Figure 5 5D, we can see that when a new target was introduced, the evaluative feedback was the largest in the negative direction ( 1 because the movement direction and desired direction are 180 degrees apart). Over time, through adaptation the movement direction and target direction become collinear and the cosine becomes maximally positive. We emphasize here that all of the adaptation of the decoder was through NAcc evaluative feedback and there was no a priori training or any external teaching signal. In the next step, the ability of the Actor Critic to scale its control policy for tar gets spanned by its most valuable action set was tested. In the first part of this simulation, we specified random target points inside the blue square workspace (Figure 54). During the first 200 trials, we did not adapt the network and the IA followed the control policy from the 4target task.

PAGE 75

75 Figure 5 5. Decoding performance during sequential presentation of the targets in the four target reac hing task. A ) Sequential presentation of 4 targets as indicated by red stems. Blue stems indicate if the target was aquired (1) or missed (0). Note that when a new target is introduced the performance decreases but within a few trials it recovers. B ) Temporal sequence of action values. Each colored trace represents the value of one action (i.e. up, down, left, right). Note that for each target only certain actions have high value since they are required to acquire the target. C ) Weight values for the output layer of the Actor. Each colored trace corresponds to an individual weight. Note that when a new target is introduced that the weights adapt then plateau once the performance improves. D ) The temporal difference error becomes maximally positive when the targets are acquired. Figure 5 6 shows the performance of the decoder which was able to reach 96% of the 200 new targets inside the workspace based on the control policy learned during 4target task. In the second part of the experiments (Trials 201:300), we presented targets outside the workspace however the Actor Critic remained fixed. Even though the decoder was successful during some of the trials, performance degraded to 66%

PAGE 76

76 compared to the case where targets were inside the workspace. In the third part of this exp eriment (Trials 301500), adaptation was enabled in the Actor Critic and the performance boosted up to 95% as the decoder adjusted its control policy. As the IA modified its control policy, it can be seen the evaluative feedback in Figure 56D became more positive. It is important to note that the network adapted only two times (time points 5500 and 7500) during this experiment (Figure 56C) emphasizing the fact that IA adapted only if it was necessary for improving its performance. Figure 5 6. Decoding performance with random distribution of targets inside and outside of the workspace

PAGE 77

77 Reorganization of Neural Representation Brain plasticity is an important design factor of BMI and it has been observed in the context of many research areas. From signal processing point of view, changes in the pattern of neural activity can be a challenging problem for decoding because standard, static input output models assume stationarity in the neural input [116] In this section, two scenarios of reorganization in the neural pattern were investigated First, I considered an extreme situation where after the decoder converged to a control policy the input pattern for all neurons was perturbed by shuffling the preferred action. Second, the emergence of a stable tuning map during the learning process of the user was simulated. This consisted of changing the tuning width and depth of certain neurons in the ensemble. During these experiments, the task again was reaching targets that were located at each corner of the workspace (4target task). In the first scenario, a specific tuning map was defined by dividing 12 neurons in an ensemble into four subgroups each tuned to one of the principa l directions that were mentioned in the previous section (Figure 5 6A). In this environment, a nave decoder learned to perform the task perfectly. Figure 5 8 A shows the performance of the decoder during 100 trials. For each trial, one of the four targets was picked randomly. We can see the decoder reached 100% accuracy after 3 trials. At this point, the tuning map was disturbed by shuffling the Preferred Action (PA) of the neurons. Figure 5 7 shows the tuning map of the neural ensemble before and after reo rganization.

PAGE 78

78 Figure 5 7. Reorganization of the n e ural tuning map. Here hot colors indicate maximal firing A ) before and B ) after reorganization. In Figure 5 8 (trials 1:100) we can see the decoder converged to an appropriate control policy to solve the 4target task After convergence the neural tuning map was reorganized and the learning rate was reset to allow the decoder adapt to the new situation. Figure 5 8 (trials 100300) shows the adaptation and performance of the decoder after reor ganization of the tuning map. We can see after reorganization, the performance degraded at trial 100 but the decoder was able to recover performance. By letting the decoder adapt more, we can see during trials 200:300 the decoder was able to reach to the l evel of performance that it had achieved prior to reorganization

PAGE 79

79 Figure 5 8. Network adaptation after reorganization of the tuning map For the second scenario, we simulated the users learning in the form of change in the tuning width and depth of MI neurons. During learning, neurons became gradually tuned to specific actions over consecutive phases of learning. Neurons were categorized based on their preferred actions and formed subpopulations that increased their size as well as tuning depth as the us er learned the task. We increased the size of population in 3 steps in which 33%, 66% and 100% of the entire neural ensemble became tuned to actions. The tuning depth of the neurons was shallow and over consecutive phases their depth increased. Therefore i n each session there were both shallow and deep tuned neurons that existed in the population. In the last session, all of

PAGE 80

80 the neurons were deeply tuned to actions and there was no neuron with shallow tuning. Spanning from a nave to an expert user, Table 5 2 summarizes the emergence of neural pattern during learning. Figure 5 9. Decoding performance during different user s learning phases. Note that as the tuning depth increases the performance increases. Despite the rapidly changing characteristics of the neuronal tuning the network was able to appropriately respond given the available firing in the input. Table 5 2. State of the neural ensemble during learning Nave user Learning phase 1 Learning phase 2 Learning phase 3 Expert user No tuning 1/3 shallow 2/3 no tuning 1/3 deep 1/3 shallow 1/3 no tuning 2/3 deep 1/3 shallow All the neurons are deeply tuned to their PA

PAGE 81

81 Figure 5 9 shows the performance of the decoder during different phases of learning. Trials 1:100, 101:300 and 301:400 correspond to the learning phases 1 to 3 respectively. Trials 401:500 correspond to a state that the user has completely learned the task. In Figure 5 9A, we can see when there was no information in the MI neural ensemble; the decoder was not able to learn a control policy. By increasing the amplitude of output layer weights during the first three phases of learning the decoder increased the variance of action values in order to magnify the structure of the input neural data (Figure 5 9B and 5 9C) but since the tuning was shallow, the decoder could not come up with an appropriate control policy and the performance was poor in these phases of learning. By increasing the depth of tuning in the fourth phase, the decoder was able to solve the task by less magnification compared to the first three phases. Likewise, when the user completely learned the task at the fifth phase the amplitude of the weights and the variance of action values were smaller than the previous phase. By increasing the depth of tuning, the level of reinforcement became positive. In this figure we can see by increasing the tuning depth the performance increased therefore tuning depth of MI neurons play an important role in the performance. At the end of each phase the weights of the network were reset to random values between 1 and 1. The learning rates were also reset to their initial values. Effect of Noise in the States and the Evaluative Feedback An important characteristic of any BMI decoder is its performance in presence of noise. To investigate the effect of noise on our BMI we tested the system under three different conditions. First we reduced the tuning depth of MI neurons from 1 to 0.2. ) 1 ( 5 0 Baseline ActionRate Firing Mean Rate Firing Mean Depth Tuning ( 5 6)

PAGE 82

82 The tuning depth was computed by Equation 56. In the next step, Gaussian noise was added to the TD error to generate a noisy error signal with 5dB signal to noise ratio. Finally I put the agent under noisy input noisy error condition. Table 5 3 summarizes the performance of the agent under these three conditions for different target configurations. Here, NF In NF Err corresponds to Noise Free Input Noise Free Error. Table 5 3. BMI performance with synthetic and surrogate data 1 Target 2 Target 4 Target NF In put NF Err or 99.8% 98.3% 95.6% Noisy Input 96.0% 80.6% 65.3% Noisy Error 100% 99.2% 96.2% N oisy In put N oisy Err or 98.2% 78.0% 53.3% Surrogate 14.1% 14.6% 14.2% In order to test against the null hypothesis that agent can reach the target without any structure in the input data we ran a test with surrogate dataset. Surrogate data was generated by reducing the tuning depth of all the neurons to zero. In this test the error signal was computed the same way as in the experiments with tuned neurons. Performance of the BMI with surrogate data is presented along with the results with synthetic neural data in Table 5 3. The results in Table 5 3 demonstrate that noise in the error signal slightly improved the performance of the system because it helped the agent skip from local minima. It should be noted that these results are based on the ideal case that task related information is represented in both the motor states and the evaluative feedback. The noise in the critics response served also as a mechanism for exploration. Since the agent is most sensitive to the sign of the error, fluctuations in the amplitude did not degrade the performance. In general increasing the number of targets decreases the performance however this decrease in the performance is more prominent with noisy

PAGE 83

83 motor commands. However, even with noisy input states and noisy error signals, the agent performed well with one and two target tasks. A random walk test was performed to specify the probability of reaching a target by chance. Since the probability of reaching all the targets were the same we computed the chance level for one target. For the random walk test the actions of the agent were selected randomly at each step. The same limit on the number of steps per trials was applied to the random walk test. Probability of reaching a target by chance was 0.1%. In this chapter I introduced a control architecture and adaptation procedure for decoding mot or commands in MI based on an evaluative feedback from NAc which indicates users goal. Performance of the BMI controller was studied in presence of noise in the error and input for three different task. Our results demonstrated the feasibility of this ar chitecture in simulation of biological constraints. The adaptive agent was able to navigate the robot to the targets in the continuum of space. Here we made no assumption about the location of the target in the space. In other words the environment for the nave agent was novel and no training was involved. The agent learnt how to decode MI neural activity on the fly just based on an evaluative feedback from user. Since the error signal is extracted from the brain the system is self contained in the sense that it doesnt require an external source of information for adaptation. The BMI design based on Actor Critic architecture faces two main challenges. First the performance of this system relies on extracting motor commands and evaluative feedback from t he brain. Provided reliable signals are extracted from the brain this system is able to reach any point in the continuum of its workspace. The BMI demonstrated a fairly robust performance in presence of noise both in the motor

PAGE 84

84 commands and TD error signal. However the BMI was more sensitive to the noise in the motor commands than TD error. The second challenge is adaptability of the actor during reorganization in the brain which stems from winner takeall approach in training the decoder. In this approach, only the parameters of the winning action are updated therefore over time other actions become less competitive compared to the winner and the BMI cannot reorganize itself for accomplishing the task. In a stationary environment this may not cause a probl em but in presence of nonstationarity the system should have enough flexibility to converge to a new control policy. However, from the results in Table 5 3 it can be seen in spite of decreasing the tuning depth of MI neurons and decreasing the SNR in the TD error, the agent had a good performance in reaching a single target. This observation implies that using multiple agents each specialized for a particular task might be a better approach than having one agent learns multiple tasks.

PAGE 85

85 CHAPTER 6 CLOSEDLO OP IMPLEMENTATION OF THE ACTORCRITIC ARCHITECTURE Introduction In the previous chapter the Actor Critic architecture for implementation of a symbiotic BMI was introduced. Using a simulator the system behavior was studied und er a variety of environmental and neural conditions. In the next step I designed a full closedloop in vivo BMI experiment to find the learning specifications based on real neural data. The focus of this chapter is on two main questions: How can we extract an evaluative feedback from the brain reliably in real time ? How can we adapt the BMI decoder using this evaluative feedback? This chapter is organized into the following parts. In the first part the experiment setup will be introduced. In the second, the factors that are important for training the Critic structure and eventually estimating a robust evaluative feedback from NAcc will be discussed In the last part, Actor learning based on the real neural states, recorded from MI and NAcc simultaneously during closedlo op experiments will be d emonstrated. Experiment Setup A two target choice task which is shown from a topview in Figure 6 1 was designed. The rat must maneuver a five degreeof freedom (DOF) robotic arm (Dynaservo, Markham ON) based on visual and reward fe edback to reach a set of targets and earn a water reward. The concept behind designing this experiment paradigm was to find a set of specifications for the Actor Critic learning to operate in a novel environment. T he closedloop experiment was divided into supervised control mode and braincontrolled mode. While the purpose of the supervised control mode was to collect a training dataset for designing the Actor Critic architecture the design of the braincontrolled mode was to test the BMI performance in a novel environment.

PAGE 86

86 Figure 6 1. Critic training and the closedloop experiment setup Training Paradigm A new paradigm was developed to train the rat for the closedloop experiments. The main difference between t his new paradigm with the previous training paradigm was to shift the attention of the rat to the movements of the robot towards or away from the target. A Male SpragueDawley rat was trained in a twolever choice task via operant conditioning to associate robot control with lever pressing [93] As shown in Figure 6 1, there are two sets of retractable levers (Med Associates, St. Albans VT): the set within the behavioral cage is referred to as cage levers ; the set in the robotic workspace is referred to as distal levers A solenoid controller (Med Associates) dispenses 0.04 mL of water into the reward center on successful trials. There are three sets of green LEDs: the set immediately behind the cage levers are cage LEDs the set in the robot workspace are midfield LEDs and the set on the target levers are distal

PAGE 87

87 LEDs The positioning of the three sets of LEDs and levers offers a technique to guide attention from inside the cage to the robot environment outside. There is one additional blue LED mounted to the robot endpoint; it is referred to as the guide LED and it is used to assist the rat in tracking the position of the robot. Because the behavioral cage walls are constructed from plexiglass, the robotic workspace is within the rats field of vision [94] The workspace uses low level lighting and is designed to maximize the rats visual abilities. The distal LEDs and guide LED provide contrast in low light conditions and targets are positioned to maximize the angle subtended to the rats eye. Initially the robotic arm tip (guide LED) is positioned directly in front of the water reward center. The rat initiates a trial with a nosepoke through the IR beam in the reward center. The target side and robot speed are randomly selected. All levers are extended synchronously and LEDs on the target side are illuminated to cue the rat. The robot follows a predetermined trajectory to reach the target lever within 0.8 1.8 s and the robot will only press the target levers while the rat is pressing the correct cage lever If the correct cage and target levers are pressed concurrently for 500 ms then the task is successfully completed; a water reward positively reinforces the rats association of the robot lever pressing with reward and the trial ends. If the rat pres ses the incorrect cage lever at any time, the trial is aborted, a brief tone indicates the choice was wrong, and there is a timeout (48 s) before the next trial can begin. Additionally, if the task is not completed within 2.5 s the trial is ended. Whenever a trial ends: all levers are retracted, the LEDs are turned off, and the robot is reset to the initial position. A 4s refractory period prevents a new trial while the rat may be drinking.

PAGE 88

88 The rat is then shaped to attend to the robot workspace by gradually moving the center of attention from within the cage to the robot workspace outside. This is achieved through turning off cage and midfield LED cues in sequence during training. The variable robot speed also encourages attention to the robot the rat c an minimize task energy by synchronizing pressing with the robot. Eventually, the rat cues are reduced to the proximity of the guide LED to the target LED for completing the task and obtaining water. The rat learn to perform stereotypical motions for the environmental cues [94] The timeout and timelimit both encourage correct behavior the rat can maximize water rewards earned by avoiding timeouts and unsuccessful trials. Once the rat made association between the reward and robot actions, some catch trials were introduced in which the robot moved to the nontarget lever. In this case the rat received an aversive feedback (negative tone with no water reward) by pressing any lever inside his cage. The order of target and nontarget trials was random throughout the training session but it was balanced to keep the rat motivated. Electrophysiology After reaching the operant conditioning inclusion criterion, the rat was chronically implanted unilaterally with a custom designed microelectrode array (see appendix) that simultaneously targeted the layer V of the forelimb area in the primary motor cortex (MI) [95,96] and the Nucleus Accumbens (NAcc) Each array was 8x2 electrodes with 250 m row and 500 m column spacing (Tucker Davis Technologies (TDT), Alachua FL) but the length of arrays for MI and NAcc were different (the design of the dual microarray is explained in Appendix A) The arrays were positioned stereotaxically and lowered simultaneously with a hydraulic micropositioner to an approximate depth of 1.6 mm for the MI array and 7.5 mm for the NAcc array. Spatiotemporal c haracteristics of

PAGE 89

89 neuronal signal during insertion provided additional information about the array location relative to neurophysiologic landmarks. More details of the surgical technique are given in [1 17] The rat was given up to two weeks to recover from surgery before resuming the experiment. Electrophysiological recordings were performed with commercial neural recording hardware (TDT, Alachua FL). A TDT system (one RX5 and two RP2 modules) operat ed synchronously at 24414.06 Hz to record neuronal potentials from both microelectrode arrays. The neuronal potentials were bandpass filtered (0.56 kHz). Next, online spike sorting was performed to isolate single neurons in the vicinity of each electrode. Prior to the first closedloop experiment, the experimenter reviewed each sorted unit over multiple days to refine the spike sorting thresholds and templates. 20 single units in MI and 23 single units in NAcc were isolated. The isolation of these units was repeatable over sessions with high confidence from the recordings. Once the neurons were isolated the TDT system recorded unit firing times and a firing rate estimate was obtained by summing firing within nonoverlapping 100 ms bins. Additionally, all behavioral signals (e.g. water rewards, LED activation) were recorded synchronously using the shared time clock. In addition to the firing rate of single units, Local Field Potential was also recorded with 381.47 Hz sampling frequency across all the 32 channels. ClosedLoop Experiment Paradigm After the rat was trained to make association between the robot movement and earning reward, it was transferred to the closedloop experiment E ach trial was started by rats nose poke in a water receptacle in his c age but the cage levers were not

PAGE 90

90 presented. During the closedloop experiment only distal levers were presented and t he target was cued by turning a set of LEDs below a distal lever. In the supervised control mode, one distal lever in the right corner of the robot workspace was presented as the target and neural data was collected when the robot moved towards or away from the target. In t hose trials that the robot moved towards the target the rat received water reward when the robot reached the target and pressed the lever hence rewarding trials In nonrewarding trials the robot moved away from the target lever and moved to a lever which was located at the left corner of the workspace. Once the robot reached the lef t (Non target) lever and pressed the lever the rat received a negative tone without water reward. In the supervised robot mode the target information was known to the Critic in the form of desired response. The neural data in this mode was used to train the Critic in order to predict the reward expectation of the rat based on the robot actions with respect to the target. The role of the critic was to discriminate between the actions that increased or decreased the probability of earning reward so when a new target at a different location in the robot workspace was presented it should guide the Actor to learn the novel task which is the remapping of MI neural states to actions in order to navigate the robotic arm to the new target In each experiment s ession, the temporal order of rewarding and nonrewarding trials was random. 700 ms after the start of trials (nosepoke) the robot started moving. towards the target lever or away from the target lever depending on the trial. The reason for this delay was to pass the initial transient of the gamma filter at the input of the actor and critic The gamma filter was used to preserve the temporal structure of the neural signal. During the target trials the robot reached and pressed the target distal lever and

PAGE 91

91 t he rat received water reward accompanied by a reward tone. During the non target trials s still the target lever was cued by light but the robot moved to another lever (not cued) which was located at a different location and it pressed the lever. In this case, the rat just received a negative tone without any water reward. During the experiment the rat was sitting motionless in the middle of the cage near the water receptacle and watch ed the robot workspace. Critic Learning The experiment setup is designed t o investigate the specifications of the Actor Critic learning in real environments As mentioned in the introduction section, the first question in designing a symbiotic BMI based on the Actor Critic method is how can we extract an evaluative feedback from the brain in real time ?. The answer to this question has two components. First from neurophysiological perspective we need to understand how the positive and negative components of an evaluative feedback are represented in the NAcc d uring goal directed be havior Second, what engineering techniques can be used to extract this information in the form of an evaluative feedback? NAcc neural activity was recorded over 1 0 sessions while the animal was engaged in the closedloop experiments. The experiment setup was designed to study the NAcc neural response under rewarding and nonrewarding conditions. From the perspective of BMI design, incrementally predicting these two states in real time from NAcc neural activity is the main objective of the Critic structure. Neurophysiology of NAcc under Rewarding and NonRewarding Conditions In the first level of analysis the NAcc neural activity was compared under rewarding and nonrewarding conditions during the supervised robot control mode. Figure 6 2 show s the Peri Ev ent Time Histogram of 3 representative accumbal neurons

PAGE 92

92 during four experiment sessions. The red and blue traces correspond to nonrewarding and rewarding trials respectively. In the first three experiment sessions (Figures 6 2 A C), the conditions were t he same. Session 1 and 2 were recorded on two consecutive days but there was one day between the second and third sessions and session 4 was recorded 4 days later. As it can be seen, over time the difference between the neuromodulation of NAcc during rewar ding and nonrewarding trials diminished. This might be due to the fact that over time the rat showed habituation to the experiment conditions. After the third session, the environmental conditions were changed by introducing a new target and putting the r obot under BrainControl mode in which the Actor translated MI neural activity to robot movement commands. In this mode, the robot performed nonoptimal trajectories in the workspace rather than directly reaching the targets since learning was just initiated. After this change in the experiment procedure, the conditions were switched back to the previous mode (experiments 13) and performed another experiment in the SupervisedControl mode. Figure 6 2D shows the NAcc neuromodulation after exposing the animal to a new environment. We can see in the fourth day the difference between target and nontarget trials increased again. These results imply that over time the NAcc response to rewarding and nonrewarding stimuli become sim ilar but getting involved in a novel environment might increase the contrast between rewarding and nonrewarding conditions. This result is in agreement with other studies on the effect of novelty in the environment on the reward representation in the stri atum [114,118,119] These observations have two important implications for BMIs. First, the diminished separation is an important indicator for consolidation [120] Once the task was

PAGE 93

93 consolidated the IA should stop adaptation. Second, introducing a new environment will evoke more separation which will trigger learning. This is an indicator for adaptation that tells the IA to adjust its control policy. The consolidationadaptation dilemma is also compatible with the concept of surprise in machine learning. Figure 6 2. Neuromodulation of 3 NAcc neurons during nonrewarding (red trace) and rewarding (blue trace) trials over multiple sessions Time zero corresponds to the start of robot movement and at the time step 15 the robot reaches the target. A) The first session, B) the second session, C) the third session and D) the forth session after introducing a nov el environment to the rat Note the separability between the red and blue traces decreases over sessions A C. However when a new environment is introduced the separation increases. These two properties in closedloop BMI indicate the cyclical nature of lea rning and consolidation (habituation) which the critic evaluative feedback provides to the actor. This result has an important implication for BMI. The key feature of the symbiotic BMI design is that the system will adapt to the user whenever the performance degrades otherwise the parameters of the BMI decoder will not change. In other words whenever the BMI does not perform as the user expects and the parameters of the BMI decoder need to be adapted the NAcc response as a teaching signal will be available. The NAcc activity conveys two pieces of information for adaptation of the BMI decoder.

PAGE 94

94 First, it specifies the window of learning in which the parameters of the decoder will be unlocked. Second, within the learning phase it provides a teaching signal for the decoder in the form of an evaluative feedback. The current experiment setup is designed for characterizing the learning within the learning phase however identifying the learning window or segmentation is the subject of the future research. State Esti mation from NAcc Neural Activity The engineering challenge of extracting the evaluative feedback from NAcc is in the modeling of the multi dimensional NAcc neural vector. The results in the previous section were based on the average of neural activity over multiple sessions however, for real time control the evaluative feedback should be extracted on a singletrial basis. From this prospective there are two scenarios for estimating an evaluative feedback from the NAcc; supervised and unsupervised. In this s ection the results are focused on the supervised approach which is based on regressing multi channel NAcc neural activity to a set of predefined rewarding and nonrewarding states as the desired response. In this section I investigated different factors that contributed to state estimation from NAcc as an evaluative feedback Desired response In the supervisedrobot control mode of the closedloop experiments the desired response was defined based on the robot trajectory towards or away from the target. The assumption was that when the robot go t closer to the target, the reward expectation increased and when robot got away from the target the reward expectation decreased During the supervised control mode the robot trajectory was a straight path from the start point to the distal levers. The NAcc neural activity was segmented by the robot travel time window and regressed it to a linearly increasing function for rewarding trials

PAGE 95

95 and decreasing function for nonrewarding trials. By defining the desired response as a ramp function the critic is facing two problems at the same time; first discriminating between the rewarding and nonrewarding states (sign) and second estimating the probability of each one (magnitude) Here I used the same ar chitecture that was shown in Figure 43 for the critic structure but instead of the gamma memory structure I used a regular tappeddelay line. The advantage of using tappeddelay line is that timing is more precise and in the case of using segmented data t here will be no discontinuity at the input of the critic. Figure 63 shows the training and test performance of the critic in regressing the NAcc neural activity to the ramp function as the desired response. From this results it can be seen that the networ k had a poor performance both during training and testing. The poor performance is in part due to the difficulty in learning both the amplitude and sign of the desired response. From Actor Critic learning perspective, discriminating between rewarding and nonrewarding states is more important than estimating the magnitude of each probability. One way to mitigate this problem was to simplify the state estimation task by extracting only the sign of the evaluative feedback. In this way the Critic would evaluat e Actors performance by labeling each state as rewarding (positive) or nonrewarding (negative) and the amplitude would be ignored. By taking this strategy, two main questions remains to be investigated; first, what are the design specifications of the C ritic network and second how the Actors performance would be affected by receiving only the sign information from the Critic.

PAGE 96

96 Figure 63. Critic learning performance by using ramp function as the desired response Throughout the rest of this section the first question will be investigated and the results will be focused on classifying the rewarding and nonrewarding states from NAcc multi channel neural activity as the critics response. The second question will be studied under the Actor learning section. In the next step, instead of using linearly increasing/decreasing functions, I used a constant function (+1 for rewarding and 1 for nonrewarding trials) as the desired response. All the time steps within a rewarding trial were marked as rewarding states and so forth for the nonrewarding trials. Figure 64 shows the performance of the critic in learning the sign of the desired response. In this figure it can be seen that the critic still had problem in learning the task.

PAGE 97

97 Figure 64. Estimating the sign of the desired response using a nonlinear regression method ( TDNN) Linear vs. Nonlinear regression The next step in designing the critic wa s selecting between linear and nonlinear structures. The results s o far were based on a nonlinear structure (TDNN) that was trained using error backpropagation. The hidden layer of the TDNN provided a projection space for reducing the dimensionality of the neural space. As a linear structure I used a F inite I mpulse R esponse (FIR) filter and trained it using the Wiener Hopf equation. Similar to the TDNN, I also used the Wiener filter for regressing the multi channel neural vector over time to the temporal sequence of states as desired response. Figure 65 shows the regres sion performance of the Wiener filter. Comparing the performance plots in Figures 65 and 64 no significant improvement was observed.

PAGE 98

98 Figure 65 Estimating the sign of the desired response using a linear regression method (Wiener filter) Classification vs. regression So far the state estimation was based on regressing the multi dimensional NAcc neuromodulation to a desired response. The regression approach emphasizes on the temporal structure in the NAcc data however inhomogenity or nonstationarity in t he data can have an adverse effect on the state estimation. another approach to the estimation of the evaluative feedback is classification in which by looking at the whole pattern in the signal the critic identifies the rewarding and nonrewarding states. For the classification, I used t he same topology for the critics network as before. In the first step, all the data points across all the neurons within a trial were used in order to classify the rewarding and nonrewarding trials. Figure 6 8A shows the critics performance during the training phase. In this figure, the red trace shows the desired response and the blue trace shows the output of the critic that classifies the rewarding and nonrewarding states based on NAcc neural activity. This figure shows a

PAGE 99

99 concatenation of sequential trials within a session. It can be seen that the critic had a good performance after 1 00 iterations over the training dataset. Figure 6 3B shows the generalization performance over a test dataset. From this figure it can be seen the critic did not have a good performance over the test dataset (A) (B) Figure 6 6 Supervised state classification of rewarding and nonrewarding states from NAcc neuromodulation using TDNN. A) training and B) generalization performance. The red trac e shows the desired response and the blue trace is the output of the critic

PAGE 100

100 Time segmentation An important specification in the design of the critic structure is the time resolution required for computing the evaluative feedback. This resolution sets an upper limit on the speed of adaptation of the Actor. Another important aspect of characterizing the time resolution is specifying the window of time that can be used for adaptation. By looking at the perievent time histograms of NAcc neural activity, it can be seen that during the trial, during a specific time window the separation between the rewarding and nonrewarding states was the most pronounced. I used a sliding window with different lengths in order to find the optimal window size for the critic and also identify the time interval in the data that sepa rability between rewarding and non rewarding conditions was high. This could be due to various behavioral and electrophysiological parameters in the experiment such as attention visual depth of the rat. Figure 6 7 shows the classification performance over different parts of the data using different window sizes. I tried window sizes from 2 up to 10 samples. Each subplot corresponds to a specific window size. For each window the neural activity across all the channels was used for classification. By sliding the window by one sample, the same procedure repeated over time and new classification rate was computed. In Figure 6 7 the results show the rate of classification of 40 trials in the test set. Figure 68 shows the perievent time histogram of selected neurons for the same dataset that I used in Figure 67. The red and blue traces correspond to the rewarding and nonrewarding trials respectively. It can be seen the neural response between rewarding and nonrewarding conditions was most different during the first 600 ms of the trial. This observation is in agreement with the classification result s that are shown in Figure 67.

PAGE 101

101 Figure 6 7 C lassification performance over different segment s of the data using different window sizes Each subplot shows the classification performance over time using a sliding window (2 sample 10 sample window ). Figure 68. Perievent time histogram of NAcc neurons that were used for classification

PAGE 102

102 Table 6 1 summarizes the best performance that was achieved using different approaches. From these results the highest performance was achieved by timesegmentation. The performance in regression method was computed by thresholding the output of the critic and c omparing the result with the desired response. The results of this analysis have two important implications for training the critic. First, the critic should be trained during specific timewindows therefore in real time application of the actor critic arc hitecture a segmentation strategy should be employed to identify when the critic should adapt. This information is also important for adaptation of the actor which should be investigated in future research. The second implication for BMI is selection of the input buffer size for critic. Since NAcc neural activity is local in time, by using large buffer sizes, irrelevant information at the input of the critic will degrade the state estimation performance and also decreases the control time resolution. On the other hand, the state estimation using shorter buffer sizes can be less robust therefore there is a tradeoff between the robustness and temporal resolution in the selection of buffer size for the critic structure. Table 61. S tate estimation performance Ramp function TDNN Regression Linear Regression Classification Segmentation / Classification 52% 57% 55% 58% 72% Actor Learning The ultimate goal of the actor critic learning for BMI is training the actor as a neural decoder. In the previous section different aspects of estimating an evaluative feedback from NAcc neural activity were discussed. In this section I discuss actor learning based on the motor states in MI and the Critics evaluative feedback.

PAGE 103

103 Preliminary Simulations Using Sign and Magnitude of the Evaluative Feedback for Training the Actor An important aspect of the actor learning for BMI is finding the minimum requirements for convergence of the actor to a control policy that enables the user to accomplish the task. As shown in the previous chapter, t he bottleneck of the BMI control using actor critic learning is estimating the evaluative feedback. It was show n that the critic could decode the sign of the evaluative feedback. However it remains unknown how much information is needed to reliably train the actor In this section I study the effect of magnitude and sign information in the evaluative feedback on actor s learning. In order to characterize the actor learning based on different layers of information in the evaluative feedback. I us ed the closedloop simulator that was developed in the previous chapter Activity of the 20 simulated MI neurons formed the neural input vector to the actor structure. The actor topology was consisting of a gamma memory structure at the input of an MLP net work (Figure 42). The linear output nodes of the MLP represented the 27 actions in the robot workspace (Table 31). The assumption here was that rat generated goal directed motor states in the MI neural activity. The actors task was to decode the motor r epresentation in MI and maneuver the robot from the start point to the target in the same grid space that was used during the experiment. Since I was focused on actor learning, based on the actions of the actor, an evaluative feedback was computed that represented the critics response in the ideal case. Figure 6 9 shows the actors decoding performance when the critic provided complete information in terms of both sign and magnitude. The sign of the evaluative feedback represented whether the action increased or decreased the reward expectation and the magnitude showed to what extent the action was desirable or

PAGE 104

104 undesirable. Figure 6 9 A shows the performance over 100 trials and Figure 6 9 B depicts the action values computed by actor over time. In Figure 6 9 it can be seen that after about 35 trials the actor found an effective sequence of actions that were required for completing the task. The task was reaching one target in the robot workspace (A) (B) Figure 6 9 Actor learning based on MI neural states using both amplitude and sign of the simulated evaluative feedback during one target acquisition task A) Performance B) Action values.

PAGE 105

105 In the next step I repeated the same experiment with the same dataset but this time I quantized the critics response to +1 (desirable action) and 1 (undesirable action). Here I excluded the magnitude information from the evaluative feedback and reduced the information content to 1 bit. Figure 6 10 shows the actors performance using only the sign of the evaluative feedback. I t can be seen the actor still was able to converge to a control policy in order to complete the task but compared to the case of us ing both sign and magnitude it took 50% ( 34 trials) longer for the actor to converge. Figure 6 10 Actor learning based on MI neural states using only the sign of the simulated evaluative feedback during one target acquisition task A) Performance B) Ac tion values. An important characteristic of Actor Critic learning is that when the Actor found a control policy to earn reward, the corresponding actions continually get reinforced. In the Neural Network implementation of Actor, the weights corresponding t o the winning action will grow therefore over time other actions will lose the chance to compete with the winning action. This is a critical issue when the Actor needs to change the control policy. To mitigate this problem I introduced a decaying factor that decreases the learning rate of the wining action at the output layer after earning reward. If an action is selected several times the learning rate becomes zero and weights corresponding to the

PAGE 106

1 06 wining action will freeze. Selecting the rate of decay is v ery important for BMI because if the weights decay fast the Actor can adapt to noise. In other words if the actor finds the reward through random walk the weights will freeze quickly disregard the underlying structure in the state space. I naccuracy in S tate E stimation and its I nfluence on the A ctor Learning Another important factor on the learning performance of the actor is the inaccuracy in the evaluative feedback. From the results of analyzing the critic learning it is clear that the critic always provides a noisy estimation of the reward expectation. On a positive note t he level of the noise in the evaluative feedback can be regarded as an exploration mechanism That means i f the evaluative feedback is purely random then all the actions will be explor atory actions. Because of the stochasticity of the neural states we need to consider some levels of exploration in the Actor learning process therefore adding noise to the evaluative feedback can improve the learning. Combining this strategy with an appropriate weight annealing rate will improve the learning but an important question is what is the maximum level of the noise in evaluative feedback that actor can tole r ate. Answer to this question directly sets some specifications for the design of critic. I n this section I investigated this question using the closedloop simulator that was developed in Chapter 5. The advantage of using the simulator is that by having access to the ground truth in terms of ideal motor states and evaluative feedback we can es timate the performance boundaries. The simulation experiment was performed in a two target task where the targets were presented randomly. T he ideal evaluative feedback randomly was replaced by a Gaussian noise during the Actor adaptation. The percentage of the time that noise was fed back to the Actor specified the level of feedback randomness. Based on the results in the critic training section the sign of the

PAGE 107

107 evaluative feedback was used for training the Actor. Table 62 shows the effect of inaccuracy in the evaluative feedback on the performance of the A ctor. The Actor performance was computed based on the number of times that Actor successfully completed the task. It should be noted that these results are based on an ideal presentation of motor states t o the A ctor. I t can be seen that some level of randomness improved the performance but after that the performance degraded Table 62. Effect of inaccuracy in the evaluative feedback on the Actor performance Feedback R andomness 0% 10% 30% 50% 100% Performance 75% 80% 60% 45% 3% Actor Learning Based on Real MI Neural States and NAcc Evaluative Feedback In the final step, I put together all the lessons that I learned from investigating each component of the Actor Critic architecture and trained the whole system using the real neural data that was collected during closed loop experiment. S imultaneous real MI and NAcc activity were used in order to show that Actor Critic architecture is able to capture the structure in MI neural states based on the ev aluative feedback that is extracted from NAcc. T he data that was recorded during the supervisedrobot control mode of the experiment was used because the timing of the robot trajectory during this mode was repeatable therefore the neural states could be more repeatable as well The Actor was composed of an MLP neural network with 3 nonlinear hidden PEs and 12 linear output PEs that represented 12 directions of robot movement in 3D space. 3 tapdelays were used at the input of the network to capture the tem poral structure in the data. As explained in the experiment procedure, the target was located on the right corner of the robot workspace. H alf of the trials were used to train the critic. Similar to the actor, the critic structure was composed of an MLP neural network with 5 nonlinear hidden PEs and one linear output PE. The output of the critic was threshold at zero

PAGE 108

108 therefore the positive values corresponded to rewarding condition and the negative values were associated to nonrewarding states. Based on t he analysis that was performed on the critic learning (Figure 6 7 ), the 3 sample buffer size was selected and the first 1 second segment of the trial was used for training. The accuracy of the critic was 72% over this time segment. For testing the learning performance of the system, the evaluative feedback from the critic was used to train the nave actor. The parameters of the actor were initialized using random values. The task was training the actor to navigate the robot from the starting point to the target based on motor neural states in MI and the evaluative feedback from critic. It should be noted that the test dataset was new to both the critic and actor structures. Task difficulty level was 4 steps which means it took 4 steps in the direct path to reach the target from the start point. Figure 6 11 shows the performance of the actor over 40 trials. The first subplot shows whether each trial was a success or not and the second subplot shows the time that was taken for completing the task. There was a 1 second timelimit that if the actor was not able to reach the target, the trial was considered unsuccessful. From the F igure 6 11 it can be seen that at the beginning the actor was not able to find the target but after a few trials it converged to a c ontrol policy that consistently was able to solve the task. Another factor that is reflected in the second subplot as the actor learned the optimal control policy the time to the target decreased which is because the actor learned the direct path to the target. This is in agreement with the actual robot trajectory during the task where the robot took a direct path to reach to the target. Figure 612 shows the points that ac tor visited during the learning. I t can be seen that after convergence the movement trajectory was a direct path in 4 steps towards the target.

PAGE 109

109 Figure 6 11 Offline closedloop control performance of the Actor Critic architecture using real MI and NAcc neural data Figure 6 12 Actors movement tr ajectory in 3D spac e during closed loop control using the simultaneous real neural activity in MI and NAcc The optimal path (magenta) was repeatedly selected after convergence.

PAGE 110

110 In order to take a closer look at the Actor adaptation based on the evaluativ e feedback from NAcc the values of different actions are plotted over time at the output of the actor as well as the cumulative reward and the output of the hidden layer which are shown in F igure 6 1 3 It can be seen that Actor adjusted the value of different actions in order to find the appropriate action. This adjustment has been done both at the hidden and output layer. The hidden layer in the Actors structure provides a projection space (feature s pace) in order to reduce the dimensionality of the input space. From Figure 6 10C it can be seen that around time step 72 the hidden layer found a projection that was kept throughout the rest of learning. After the hidden layer found the right projection, the output layer found the optimal action. Figure 6 1 3 Actors parameter adaptation during closed loop control A) Cumulative reward over time. B ) Action values computed at the output layer of the actor C ) Output of the hidden layer processing elements of actor In order to show the effect of the evaluative feedback that was extracted from the NAcc on the Actor learning, a surrogate analysis was performed in which the evaluative

PAGE 111

111 feedback from the NAcc was replaced by uniformly distributed random noise. T he sign of the signal was computed and used it as the evaluative feedback for training the actor. Figure 6 1 4 shows the performance of the actor, using this signal over the same dataset. From the performance plot it can be seen that actor was not able t o complete the task without the NAcc information. In Figure 6 1 5 the robot trajectory shows the actor was not able to reach the target repeatedly using surrogate evaluative feedback within the time limit of each trial 6 1 4 Actor learning performance based on the real MI neural states and a random evaluative feedback (surrogate analysis)

PAGE 112

112 Figure 6 1 5 Actors movement trajectory in 3D space based on the real MI neural states and a random evaluative feedback

PAGE 113

113 CHAPTER 7 CONCLUSIONS Overview During daily life activities, BMIs should be able to perform complex tasks in changing environments based on dynamic neuronal activation. Robust performance in nonstationary conditions requires intelligent adaptive control strategies. A transformative framework for go al directed behavior was introduced here that enables the symbiosis between two learning systems; an artificial intelligent agent and the users brain. This framework is based on well established concepts that include the perceptionaction reward cycle and valuebased decision making. However, unlike traditional computational modeling or neurobiological study of these systems, a method was presented that enables a direct, real time dialogue between the biological and computational systems. An important desi gn element of the symbiotic BMI is that neither the user nor the IA can solve the task independently. The users brain has no direct access to the external space where the reward is located and the IA s reward which is located in the brain only can be achi eved by satisfying the user Both need to learn how to symbiotically cooperate and use the prerequisites of valuebased decision making to solve the task. In the BMS theory, users natural PA R C is broken to enable the expression of intent through the robot ic arm, which due to the real time operation and visual feedback seems to be causally assimilated by the user. By modulating the brain activity in the motor and reward areas consistently the user allow the IA to learn an effective control policy. Therefore, it can be conclude d that as long as the IA adapts quickly and consistently within the causality window of the user behavior, it is possible to break the natural PA R C and close the loop externally with the IA.

PAGE 114

114 The main research question in my work was how two different intelligent entities (artificial and biologic) could engage in a symbiotic relationship. The key concept in promoting a symbiotic relation between the user and IA was to link the PARC s of the user and IA by aligning their goal. A challenge in this regard was to match the neural representation of goal in the brain with the mathematical definition of goal in the IA. The fundamentals of Actor Critic architecture were adopted for the implementation of IA because it was a goal driven architecture that had separate structures for representation of goal (critic) and action (actor) where action selection in the Actor Critic architecture was based on goal information presented by critic in the form of an evaluative feedback. Therefore, by extracting an evaluative feedback from the brain, the IA could learn how to take action based on users goals. We formulated the BMI control as a decision making process where the actor learned action values corresponding to each neural state. Instead of a specific context dependent mapping, in a symbiotic BMI, the actor learned a control policy for associating neural states to actions. Goal directed adaptation of the IA played a pivotal rule in aligning the control policy in the direction of the users goal. For the control of neuroprosthesis the evaluative feedback was used only for adaptation of the actor structure when the user needed help e.g. novel environment otherwise the BMI would not change the control policy. Compar ed to other BMIs trained with an external supervisory signal, the first step in the Actor Critic implementation of symbiotic BMI was to extract an internal measure of users goal in the form of evaluative feedback from the brain. T he possibility of extracting such a signal from NAcc for adaptation of the actor was investigated The results suggested that NAcc contained rich representation of goal information during

PAGE 115

115 goal approaching behavior. An important aspect of an evaluative feedback was that it h ad to provide both positive and negative reinforcement where positive component predicted reward and the negative component predicted aversion. For real time BMI control we needed to estimate the evaluative feedback on single trial basis with high temporal resolution. The modeling results confirmed as the animal approached or moved away from the preferred target, the value estimator was able to predict the probability of earning reward as a decreasing or increasing linear function within a 100 ms time scal e. The critic provided an evaluative feedback with high temporal resolution by computing the gradient of the signal at the output of the value estimator. In a simulation study, the adaptation of the IA was tested based on the evaluative feedback under two conditions; changing environments and in presence of dynamic neural states. The IA was able to adapt its control policy in changing environments to solve novel tasks where only an evaluative feedback was available to the IA as a teaching signal for adaptat ion. In all of our simulations we saw that by changing the environment the IA adapted its control policy accordingly to utilize actions that were required for solving the task. One of the appealing characteristics of the actor critic control architecture was that if a new task was within the space spanned by a learned control policy, the IA was able to accomplish the task without need for adaptation. In other words, the IA adapted its control policy only if it was not able to accomplish the task. Learning rate was the parameter that controlled the IA adaptation. In the simulation experiments when the IA learned a control policy the learning rate was set to zero therefore all of the parameters of IA were fixed. In the BMS framework the IA adapts to the user only if the performance degrades however, adaptation of the learning

PAGE 116

116 rate based on a measure of performance is the subject of future research. Adaptation of control policy for novel tasks required utilizing new sequence of actions, however; in the case of changing neural patterns, the IA needed to find a new mapping between the neural state and actions. I introduced a new neural pattern by shuffling the Action Preference of neurons. Again, the IA could associate the new neural state to appropriate actions j ust using an evaluative feedback. The simulation results also showed that decoding performance of the actor critic architecture was robust to noise in the evaluative feedback [121] The Actor Critic architecture gives IA great flexibility to adapt to both changes in the environment and the neural states. As far as there are repeatable sets of neural states that correlate with the task, the IA automatically associates them to appropriate actions in such a way to maximize users sati sfaction. Since the IA uses the brains computational capability for reward/punishment prediction, the actor critic based BMI is more computationally efficient than the conventional Actor Critic method. However, we should consider the value function estimation would be replaced by the computation required for estimating the evaluative feedback from the neural ensemble activity in the brain. Adaptation of value estimator is the subject of future research. Broader Impact and Future Works Brain plasticity is fundamental for constructing new representations of the environment. Primary motor cortex neurons may not naturally fire to evaluate actions of external devices; they initiate motor actions through the limbs. As shown in the tuning results, the animal was evaluating the robotic actions and expressing intent by creating consistent modulations in MI neurons to purposely decide actions that move the robotic arm to the targeted goal. Moreover, the representation is changing in time to support

PAGE 117

117 improved performance. This plasticity opens up tremendous options for rehabilitation, because it shows that neural function may be rerouted purposely to extend natural function if the IA responds within the time span of the users PA R C and is able to decipher the new modulation expressing the users goals. Obviously, this leads to new research avenues and very interesting questions for neurobiology and the design of new experimental paradigms. Through the parameters (learning rates, values of actions, model weights) of the BMI architecture, the IA can participate not only as an assistant but an observer to demarcate important events during brain control. The learning rate of the IA affects the speed at which the user and IA must adapt to solve tasks. By adjusting the lear ning rate, one can specify which player effectively drives the learning process and further shape behavior. Furthermore, in the actor critic framework, the output of the actor network explicitly states the numeric value of each action, which can be read out at every instance in time. Therefore, finding causal relationships between neural activity and behavior is now potentially more powerful because analysis of neuronal tuning can be evaluated quantitatively and triggered by actions of increasing value. Det ermination of which neurons are adjusting their encoding for those actions naturally expresses the functional role of the representation in solving the task. From an engineering perspective, BMS theory opens up a new class of human machine interaction in w hich the peripheral devices autonomously adapt to the users needs without explicit programming or body actions but simply decoding brain activity. Users must be engaged in a dialogue with the IA to allow the translation of their intent into the IAs model of the world. There are many scientific and technological challenges still to be overcome such as physical interface between the brain and external devices

PAGE 118

118 specifying the bandwidth, adaptability and convergence requirements. Another important aspect is t o show that the BMS framework can be replicated in other levels of abstraction such as electrocorticogram (ECoG) or electroencephalogram (EEG). This is specifically important from the perspective of applicability in humans. The EEG and ECoG are much noisier and less specific than neural firing rates, but on the other hand the human has much higher cognitive capabilities than the rat and this can be translated into user controlled neuromodulations detectable in the mass action of large neural populations. BM S theory has the potential to transform ordinary tools and appliances into assistive devices that are more responsive to the goals and needs of their users, and broaden again the level of interaction with the world. Continuous adaptation and reward learning are two unique features of the reinforcement learning as the computational framework of the BMS theory. Conventional BMI retraining with a desired response requires the patient physically or mentally (in the case of the paralyzed) generate a training set which imposes a delay before the interface can be used. In addition, retraining may create learning confounds because it generates a different control mapping (network weights) for the patient each day. Continuous symbiotic adaptation incorporates prior knowledge that the IA has gained which allows the patient to learn a control strategy over multiple days (network weights are preserved; hence prior knowledge is preserved). Therefore, reinforcement learning enables a training philosophy unlike the conventional BMIs because (1) it does not need an explicit desired signal, (2) improves performance with usage and may allow for more difficult tasks due to the feedback between the user and the BMI agent.

PAGE 119

119 The actor critic architecture currently uses a model f ree RL technique because environmental dynamics are unknown. The agent can only learn from experience and cannot predict future states. Future BMS implementations may benefit from using sensory information to model the environment and deploy model based RL methods to estimate future states and rewards. This modification would allow the IA to learn not only from experience, but also from model prediction of possible environmental interactions; thus facilitating faster learning.

PAGE 120

120 A PPENDIX DUAL MICRO ARRAY DESIGN A key requirement for the BMS framework is to record the MI and NAcc neural activity simultaneously. The anatomical location of these two brain structures with respect to each other poses challenges in using two separate micro wire arrays. Based on t he conventional design of the arrays it is not possible to record from shallow and deep brain structures in the rat. We custom designed a dual microwire array (Figure A 1 A ) for targeting the MI and NAcc simultaneously with the specifications that are shown in Figure A 1 B Based on the anatomical locations of the MI and NAcc with respect to each other, the MI array was on the lateral side and the NAcc was on the medial side of the assembly. Each array consisted of 2 row s of 8 tungsten microwires. The arrays had to be placed as close as possible with respect to each other. The MI array was designed to target the layer V of the cortex. The spacing between the two arrays had to be selected in such a way that when the shall ow array hits the MI at the same time the deep array also hits the NAcc. Figure A 2 shows the coronal cross section of the rats brain at the level that both MI and NAcc were aligned vertically [103] Because of the fabrication limitations it was not possible to position the MI and NAcc arrays independently in the anterior posterior direction therefore the spacing between the two arrays and the stereotaxic coordinate had to be selected in such a way that MI and NAcc were aligned.

PAGE 121

121 A B Figure A 1. Dual micro wire electrode A) Actual array, B) Physical s pecifications Figure A 2 Relative anatomical positions of the MI and NAcc in a c o ronal cross section (1.7 mm anterior to the bregma)

PAGE 122

122 Figures A 3 to A 5 show how the relative position of these two areas changed at different medial lateral levels [103] Another factor in the design of the electrodes was to ensure that the MI array covers the hand representation area of the motor cortex. After designing the electrodes and specifying the stereotaxic coordinate, the next step was implanting the dual array. As I mentioned in the electrophysiology section we used a hydraulic micropositioner to implant the electrode at the desired depth. During the surgery in order to minimize the brain injury, we lowered the electrode in one shot. The challenge in lowering the electrode was that once the NAcc array was at the target, the MI also had to be on target. I used the neurophysiological characteristics of the signal along with the anatomical landmarks to verify the location of the electrode. Figure A 3 R elative anatomical positions of the MI and NAcc in a sagittal cross section (0.9 mm lateral to the midline)

PAGE 123

123 Figure A 4 R elative anatomical position of the MI and NAcc in a sagittal cross section (1.9 mm lateral to the midline) Figure A 5 R elative anatomical position of the MI and NAcc in a sagittal cross section (2.4 mm lateral to the midline)

PAGE 124

124 LIST OF REFE RENCES 1. Fetz EE, Finocchio DV (1975) Correlations between activity of motor cortex cells and arm muscles during operantly conditionedresponse patterns. Experimental Brain Research 23: 217240. 2. Schmidt E (1980) Single neuron recordi ng from motor cortex as a possible source of signals for control of external devices. Annals of Biomedical Engineering 8: 339349. 3. Chelvanayagam DK, Vickery RM, Kirkcaldie MTK, Coroneo MT, Morley JW (2008) Multichannel surface recordings on the visual cortex: implications for a neuroprosthesis. Journal of Neural Engineering 5: 125132. 4. Zrenner E (2002) Will Retinal Implants Restore Vision? Science 295: 10221025. 5. Miller CA, Woodruff KE, Pfingst BE (1995) Functional repsonses from guinea pigs wit h cochlear implants. I. Electrophysiological and psychophysical measures. Hearing Research 92: 8599. 6. Nie K, Barco A, Zeng F (2006) Spectral and temporal cues in cochlear implant speech perception. Ear & Hearing 27: 208217. 7. Rouger J, Lagleyre S, Fraysse B, Deneve S, Deguine O, et al. (2007) Evidence that cochlear implanted deaf patients are better multisensory integrators. PNAS 104: 72957300. 8. Berger TW, Baudry M, Brinton RD, Liaw JS, Marmarelis VZ, et al. (2001) Brain implantable biomimetic el ectronics as the next era in neural prosthetics. Proceedings of The IEEE 89: 9931012. 9. Lozano AM, Mahant N (2004) Deep brain stimulation surgery for Parkinson's disease: Mechanisms and consequences. Pakinsonism and Related Disorders 10: S49S57. 10. Ludvig N, Kovacs L, Medveczky G, Kuzniecky RI, Devinsky O (2005) Toward the development of a subdural hybrid neuroprosthesis for the treatment of intractable focal epilepsy. Epilepsia 46: 270270. 11. Sanchez JC, Principe JC (2007) Brain Machine Interface Engineering. San Rafael: Morgan and Claypool. 12. Donoghue JP (2002) Connecting cortex to machines: recent advances in brain interfaces. Nature Neuroscience 5: 10851088.

PAGE 125

125 13. Nicolelis MAL (2003) Brainmachine interfaces to restore motor function and pr obe neural circuits. Nature Reviews Neuroscience 4: 417422. 14. Georgopoulos AP, Kalaska JF, Caminiti R, Massey JT (1982) On the relations between the direction of twodimensional arm movements and cell discharge in primate motor cortex. J Neurosci ence 2 : 15271537. 15. Georgopoulos AP, Schwartz AB, Kettner RE (1986) Neuronal population coding of movement direction. Science 233: 14161419. 16. Schwartz AB (2004) Cortical neural prosthetics. Annual Rev iew Neurosci ence 27: 487507. 17. Chapin JK, Moxon K A, Markowitz RS, Nicolelis MAL (1999) Realtime control of a robot arm using simultaneously recorded neurons in the motor cortex. Nature Neuroscience 2: 664670. 18. Wessberg J, Stambaugh CR, Kralik JD, Beck PD, Laubach M, et al. (2000) Real time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature 408: 361365. 19. Scott SH (2003) The role of primary motor cortex in goal directed movements: insights from neurophysiological studies on nonhuman primates. Current Opini on in Neurobiology 13: 671677. 20. Srinivasan L, Eden U, Willsky A, Brown E (2006) A state space analysis for reconstruction of goal directed movements using neural signals Neural Comput ation 18: 24652494. 21. Kennedy PR, Bakay RAE, Moore MM, Adams K, Goldwaithe J (2000) Direct control of a computer from the human central nervous system. IEEE Transactions on Rehabilitation Engineering, 8: 198202. 22. Hochberg LR, Serruya MD, Friehs GM, Mukand JA, Saleh M, et al. (2006) Neuronal ensemble control of pr osthetic devices by a human with tetraplegia. Nature 442: 164171. 23. Serruya MD, Hatsopoulos NG, Paninski L, Fellows MR, Donoghue JP (2002) Instant neural control of a movement signal. Nature 416: 141142. 24. Taylor DM, Tillery SIH, Schwartz AB (2002) Direct cortical control of 3D neuroprosthetic devices. Science 296: 18291832. 25. Gage GJ, Ludwig KA, Otto KJ, Ionides EL, Kipke DR (2005) Naive coadaptive cortical control. J ournal of Neural Engineering 2: 5263.

PAGE 126

126 26. J. DiGiovanna BM, J. Fortes, J. C. Principe, and J. C. Sanchez (2009 ) Co Adaptive BrainMachine Interface via Reinforcement Learning IEEE Transac tions on Biomedical Engineering 56: 5464. 27. Mahmoudi B, DiGiovanna J, Principe JC, Sanchez JC (2008) Neuronal tuning in a brainmachine interf ace dur ing Reinforcement Learning. 30th Annual International Conference of the IEEE EMBS 44914494. 28. Buonomano DV, Merzenich MM (1998) Cortical plasticity: from synapses to maps Annual Review of Neuroscience 21: 149. 29. Jackson A, Mavoori J, Fetz E E (2006) Longterm motor cortex plasticity induced by an electronic neural implant. Nature 444: 5660. 30. Carmena JM, Lebedev MA, Crist RE, O'Doherty JE, Santucci DM, et al. (2003) Learning to control a brainmachine interface for reaching and grasping b y primates. PLoS Biology 1: 193208. 31. Shenoy KV, Meeker D, Cao S, Kureshi SA, Pesaran B, et al. (2003) Neural prosthetic control signals from plan activity. NeuroReport 14: 591597. 32. Andersen RA, Musallam S, Pesaran B (2004) Selecting the signals f or a brainmachine interface. Current Opinion in Neurobiology 14: 720726. 33. Musallam S, Corneil BD, Greger B, Scherberger H, Andersen RA (2004) Cognitive control signals for neural prosthetics. Science 305: 258262. 34. Mulliken GH, Musallam S, Anders en RA (2008) Decoding trajectories from posterior parietal cortex ensembles Journal of Neuroscience 28: 1291312926. 35. Buzski G (2006) Rhythms of the Brain. New York: Oxford University Press. 36. Kim SP, Sanchez JC, Rao YN, Erdogmus D, Principe JC, et al. (2006) A Comparison of Optimal MIMO Linear and Nonlinear Models for BrainMachine Interfaces. J ournal of Neural Engineering 3: 145161. 37. Brown EN, Kass RE, Mitra PP (2004) Multiple neural spike train data analysis: state of the art and future challenges Nature Neuroscience 7: 456461. 38. Serruya MD, Hatsopoulos NG, Paninski L, Fellows MR, Donoghue JP (2002) Brain machine interface: Instant neural control of a movement signal. Nature 416: 141142.

PAGE 127

127 39. Wessberg J, Stambaugh CR, Kralik JD, Beck PD, Laubach M, et al. (2000) Real time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature 408: 361365. 40. Helms Tillery SI, Taylor DM, Schwartz AB (2003) Training in cortical control of neuroprosthetic devices improves signal extraction from small neuronal ensembles. Reviews in the Neurosciences 14: 107119. 41. Moran DW, Schwartz AB (1999) Motor cortical representation of speed and direction during reaching. Journal of Neurophysiology 82: 26762692. 42. Wu W, Black MJ Gao Y, Bienenstock E, Serruya M, et al. (2002) Inferring hand motion from multi cell recordings in mot or cortex using a Kalman filter University of Edinburgh, Scotland. pp. 6673. 43. Chapin JK, Moxon KA, Markowitz RS, Nicolelis MA (1999) Real time cont rol of a robot arm using simultaneously recorded neurons in the motor cortex. Nature Neuroscience 2: 664670. 44. Gao Y, Black MJ, Bi enenstock E, Wu W, Donoghue JP (2003) A quantitative comparison of linear and nonlinear models of motor cortical activity for the encoding and decoding of arm motions. 1st IEEE EMBS Conference on Neural Engineering 45. Sanchez JC, Kim SP, Erdogmus D, Rao YN, Principe JC, et al. (2002) Input output mapping performance of linear and nonlinear models for estimating hand trajec tories from cortical neuronal firing patterns; IEEE workshop on Neural Network for signal processing. pp. 139148. 46. Velliste M, Perel S, Spalding MC, Whitford AS, Schwartz AB (2008) Cortical control of a prosthetic arm for self feeding. Nature 453: 1098 1101. 47. Holmes NP, Calvert GA, Spence C (2004) Extending or projecting peripersonal space with tools? Multisensory interactions highlight only the distal and proximal ends of tools. Neuroscience Letters 372: 6267. 48. Carmena JM, Lebedev MA, Crist RE, O'Doherty JE, Santucci DM, et al. (2003) Learning to control a brain machine interface for reaching and grasping by primates. PLoS Biology 1: 116. 49. Hochberg LR, Serruya MD, Friehs GM, Mukand JA, Saleh M, et al. (2006) Neuronal ensemble control of prosthetic devices by a human with tetraplegia. Nature 442: 164171.

PAGE 128

128 50. Moody J (1992) The effective number of parameters: an analysis of generalization and regularizatio n in nonlinear learning systems, Neural Information Processing Systems 4: 847854. 51. Taylor DM, Helms Tillery SI, Schwartz AB (2003) Information conveyed through braincontrol: Cursor versus robot. IEEE Transactions on Neural Systems and Rehabilitation Engineering 11: 195199. 52. Tillery SIH, Taylor DM, Schwartz AB (2003) Training in cortical control of neuroprosthetic devices improves signal extraction from small neuronal ensembles. Reviews in the Neurosciences 14: 107119. 53. Millan J (2003) Adaptive brain interfaces. Comm of the ACM 46: 7580. 54. McFarland DJ, Krusienski DJ, Wolpaw JR (2006) Braincomputer interface signal processing at the Wadsworth Center: mu and sensorimotor beta rhythms. Event Related Dynamics of Brain Oscillations. pp. 411419. 55. Birbaumer N, Kubler A, Ghanayim N, Hinterberger T, Perelmouter J, et al. (2 000) The thought translation device (TTD) for completely paralyzed patients. IEEE Transactions on Rehabilitation Engineering 8: 190193. 56. Calvin WH (1990) The emergence of intelligence. Scientific American 9: 4451. 57. Fuster JM (2004) Upper processi ng stages of the perceptionaction cycle. Trends in Cognitive Sciences 8: 143145. 58. Sutton R S B arto AG (1998) Reinforcement Learning: An Introduction. Cambridge: MIT Press. 59. Crites RH, B arto AG (1995) An Actor Critic algorithm that is equivalent to qlearning. Advances in Neural Information Processing Systems 7. 60. Konda VR, Tsitsiklis JR (2003) On Actor Critic a lgorithms. SIAM J ournal on Control and Optimization 42: 11431166. 61. Sutton RS, M cAllester D Singh S, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems 12: 10571063. 62. Williams RJ (1988) Toward a theory of reinforcement learning connectionist systems. Technical Report NU CCS883, Northeastern University, College of Computer Science.

PAGE 129

129 63. Williams RJ (1992) Simple statistical gradient following algorithms for connectionist reinforcement learning. Machine Learning 8: 229 256. 64. Peters J, Schaal S (2008) Natural Actor Critic. Neur ocomputing 71: 11801190. 65. Kakade SA (2002) Natural policy gradient. Advances in Neural Information Processing Systems 14. 66. Amari S I (1998) Natural gradient works efficiently in learning. Neural Computation 10: 251276. 67. Rangel A, Camerer C, M ontague PR (2008) A framework for studying the neurobiology of valuebased decision making. Nature Reviews Neuroscience 9: 545556. 68. Schultz W (2000) Multiple reward signals in the brain. Nat ure Rev iew Neurosci ence 1: 199207. 69. Samejima K, Doya K ( 2007) Multiple representations of belief states and action values in corticobasal ganglia loops. Reward and Decision Making in Corticobasal Ganglia Networks. Oxford: Blackwell Publishing 213228. 70. Tanaka SC, Doya K, Okada G, Ueda K, Okamoto Y, et al. (2004) Prediction of immediate and future rewards differentially recruits corticobasal ganglia loops. Nature Neuroscience 7: 887893. 71. Graybiel AM, Aosaki T, Flaherty AW, Kimura M (1994) The basal ganglia and adaptive motor control. Science 265: 18261831. 72. Hare TA, O'Doherty J, Camerer CF, Schultz W, Rangel A (2008) Dissociating the role of the orbitofrontal cortex and the striatum in the computation of goal values and prediction errors. Journal of Neuroscience 28: 56235630. 73. Shadmehr R W ise SP (2005) The computational neurobiology of reaching and pointing. Cambridge: MIT Press. 74. Doya K (2000) Complementary roles of basal ganglia and cerebellum in learning and motor control. Current Opinion in Neurobiology 10: 732739. 75. Houk JC, Da vis JL, Baiser DG (1995) Models of information processing in the basal ganglia. Cambridge: MIT Press. 76. Kawato M, Samejima K (2007) Efficient reinforcement learning: computational theories, neuroscience and robotics. Current Opinion in Neurobiology 17: 205212.

PAGE 130

130 77. Samejima K, Ueda Y, Doya K, Kimura M (2005) Representation of actionspecific reward values in the striatum. Science 310: 13371340. 78. Schultz W, Dayan P, Montague PR (1997) A neural substrate of prediction and reward. Science 275: 1593 1599. 79. Williams ZM, Eskandar EN (2006) Selective enhancement of associative learning by microstimulation of the anterior caudate. Nature Neuroscience 9: 562568. 80. Pennartz CMA, Dasilva FHL, Groenewegen HJ (1994) The Nucleus Accumbens as a complex of functionally distinct neuronal ensembles: An integration of behavioral, electrophysiological and anatomical data. Progress in Neurobiology 42: 719761. 81. Redgrave P, Prescott TJ, Gurney K (1999) The basal ganglia: a vertebrate solution to the selec tion problem. Neuroscience 89: 15. 82. Taha SA, Nicola SM, Fields HL (2007) Cueevoked encoding of movement planning and execution in the rat nucleus accumbens. Journal of Physiology London 584: 801818. 83. Mogenson GJ, Jones DL, Yim CY (1980) From motivation to action: Functional interface between the limbic system and the motor system. Progress in Neurobiology 14: 6997. 84. Morris G, Nevet A, Arkadir D, Vaadia E, Bergman H (2006) Midbrain dopamine neurons encode decisions for future action. Natur e Neuroscience 9: 10571063. 85. Doya K (2008) Modulators of decision making. Nature Neuroscience 11: 410416. 86. Cohen MX (2008) Neurocomputational mechanisms of reinforcement guided learning in humans: A review. Cognitive Affective & Behavioral Neuros cience 8: 113125. 87. Hikosaka O, Nakamura K, Nakahara H (2006) Basal ganglia orient eyes to reward. Journal of Neurophysiology 95: 567584. 88. Arbuthnott GW, Wickens J (2007) Space, time and dopamine. Trends in Neurosciences 30: 6269.

PAGE 131

131 89. Khamassi M Lacheze L, Girard B, Berthoz A, Guillot A (2005) Actor Critic models of reinforcement learning in the basal ganglia: From natural to artificial rats. Adaptive Behavior 13: 131148. 90. Suri RE, Bargas J, Arbib MA (2001) Modeling functions of striatal dopamine modulation in learning and planning. Neuroscience 103: 6585. 91. Schonberg T, Daw ND, Joel D, O'Doherty JP (2007) Reinforcement learning signals in the human striatum distinguish learners from nonlearners during rewardbased decision making. Journal of Neuroscience 27: 1286012867. 92. Botvinick M, Niv Y, Barto A (2008) Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective. Cognition. 93. Bower GH (1981) Theories of Learning. Englewood Cliffs: PrenticeHall, Inc. 94. Whishaw IQ (2005) The behavior of the laboratory rat; Ian Q. Whishaw BK, editor. New York: Oxford University Press, Inc. 95. Kleim JA, Barbay S, Nudo RJ (1998) Functional reorganization of the rat motor cortex following motor skill learning. J Neurophysiol 80: 3321 3325. 96. Donoghue JP, Wise SP (1982) The motor cortex of the rat: cytoarchitecture and microstimulation mapping. J Comp Neurol 212: 7688. 97. Nicolelis MAL (1999) Methods for Neural Ensemble Recordings. Boca Raton: CRC Press. 98. Andersen RA, Musallam S, Pesaran B (2004) Selecting the signals for a brainmachine interface. Current Opinion in Neurobiology 14: 720. 99. Wu W, Gao Y, Bienenstock E, Donoghue JP, Black MJ (2005 ) Bayesian Population Decoding of Motor Cortical Activity Using a Kalman Filter. Neural Comp utation 18: 80118. 100. DiGiovanna J, Mahmoudi B, Mitzelfelt J, Sanchez JC, Principe JC (2007) Brain Machine Interface Control via Reinforcement Learning, 3rd IEEE EMBS Conference on Neural Engineering 101. Mahmoudi B, DiGiovanna J, Sanchez JC (2008) Neuronal shaping in a CoAdaptive BrainMachine Interface Computational and Systems Neuroscience (COSYNE) conference. Salt Lake City, Utah 102. Mahmoudi B, DiGiovanna J, Principe JC, Sanchez JC (2008) CoAdaptiv e Learning in BrainMachine Interfaces. Brain Inspired Cognitive Systems Conference, Sao Luis, Brazil.

PAGE 132

132 103. Paxinos G, Watson C (1998) The Rat Brain in Stereotaxic Coordinates. San Diego: Academic Press. 104. Groenewegen HJ, Wright CI, Beijer AVJ, G. Hol stege RB, Saper CB (1996) Chapter 29 The nucleus accumbens: gateway for limbic struc tures to reach the motor system. Progress in Brain Research: E lsevier 485511. 105. Lewicki MS (1998) A review of methods for spike sorting: the detection and classificat ion of neural action potentials. Neural Network 9. 106. Carlezon Jr WA, Thomas MJ (2009) Biological substrates of reward and aversion: A nucleus accumbens activity hypothesis. Neuropharmacology 56: 122132. 107. Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4: 237285. 108. Dayan P, Niv Y, Seymour B, D. Daw N (2006) The misbehavior of value and the discipline of the will. Neural Networks 19: 11531160. 109. Gosavi A (2009) Reinforcement Learning: A tutorial survey and recent advances INFORMS Journal on Computing 21: 178192. 110. Roitman MF, Wheeler RA, Carelli RM (2005) Nucleus accumbens neurons are innately tuned for rewarding and aversive taste stim uli, encode their predictors, and are linked to motor output. Neuron 45: 587597. 111. Haykin S (2001) Neural Networks: A Comprehensive Foundation: Prentice Hall. 112. Principe JC, de Vries B, de Oliveira PG (1993) The gammafilter a new class of adaptiv e IIR filters with restricted feedback. IEEE Transactions on Signal Processing, 41: 649656. 113. Izhikevich EM (2003) Simple model of spiking neurons. IEEE Transactions on Neural Networks 14: 15691572. 114. Struthers W, DuPriest A, Runyan J (2005) Habi tuation reduces novelty induced FOS expression in the striatum and cingulate cortex. Experimental Brain Research 167: 136140. 115. Ferretti V, Roullet P, Sargolini F, Rinaldi A, Perri V, et al. (2010) Ventral striatal plasticity and spatial memory. Proc eedings of the National Academy of Sciences 107: 79457950.

PAGE 133

133 116. Sanchez JC, P rincipe JC (2006) Optimal Signal Processing for BrainMachine Interfaces. In: Akay M, editor. Handbook of Neural Engineering. New York: Wiley. 117. Sanchez JC, Alba N, Nishida T, Batich C, Carney PR (2006) Structural modifications in chronic microwire electrodes for cortical neuroprosthetics: a case study. IEEE Transactions on Neural Systems and Rehabilitation Engineering 14: 217221. 118. Guitart Masip M, Bunzeck N, Stephan KE Dolan RJ, Duzel E (2010) Contextual Novelty Changes Reward Representations in the Striatum. Journal of Neuroscience 30: 17211726. 119. Lee R S, Koob GF, Henriksen SJ (1998) Electrophysiological responses of nucleus accumbens neurons to novelty stimuli and exploratory behavior in the awake, unrestrained rat. Brain Research 799: 317322. 120. Ganguly K, Carmena JM (2009) Emergence of a Stable Cortical Map for Neuroprosthetic Control. PLoS Biol ogy 7: e1000153. 121. Mahmoudi B, Principe JC, Sanchez JC (2009) An actor critic architecture and simulator for goal directed brainmachine interfaces. 31st International Conference of the IEEE EMBS, pp. 33653368.

PAGE 134

134 BIOGRAPHICAL SKETCH Babak Mahmoudi received a M aster of S cience in b iomedical e ngineering f rom the University of Florida and a M aster of S cience in electrical engineering from Iran University of Science and Technology, Tehran in 2009 and 2003 respectively. He earned a B achelor of S cience in e lectrical e ngineering from the University of Tehran, Tehran, Iran in 1999. He worked in the Advanced Brain Signal Processing Laboratory, RIKEN, Japan in 2004. He joined the Neuroprosthetics Research Group at the University of Florida in 2006 where he has been working on developing a framework for BrainMachine Symbiosis towards his PhD. He has authored more than 20 peer reviewed conference and 5 journal papers. Babaks research is focused on artificial intelligence and BrainMachine Symbiosis.