- PDF Free Download

Transcription

1 University of Cambridge Department of Engineering MODULAR ON-LINE FUNCTION APPROXIMATION FOR SCALING UP REINFORCEMENT LEARNING Chen Khong Tham Jesus College, Cambridge, England. A October 1994 This dissertation is submitted for consideration for the degree of Doctor of Philosophy at the University of Cambridge

2 Summary Reinforcement learningisapowerful learning paradigm for autonomous agents which interact with unknown environments with the objective of maximizing cumulative payo. Recent research has addressed issues concerning the scaling up of reinforcement learning methods in order to solve problems with large state spaces, composite tasks and tasks involving non-markovian situations. In this dissertation, I extend existing ways of scaling up reinforcement learning methods and propose several new approaches. An array of Cerebellar Model Articulation Controller (CMAC) networks is used as fast function approximators so that the evaluation function and policy can be learnt on-line as the agentinteracts with the environment. Learning systems which combine reinforcement learning techniques with CMAC networks are developed to solve problems with large state and action spaces. Actions can be either discrete or real-valued. The problem of generating a sequence of torque or position change commands in order to drive a simulated multi-linked manipulator towards desired arm congurations is examined. A hierarchical and modular function approximation architecture using CMAC networks is then developed, following the Hierarchical Mixtures of Experts framework. The non-linear function approximation abilityofcmacnetworks enables non-linear functions to be modelled in expert and gating networks, while permitting fast linear learning rules to be used. An on-line gradient ascent learning procedure derived from the Expectation Maximization algorithm is proposed, enabling faster learning to be achieved. The new architecture can be used to enable reinforcement learning agents to acquire contextdependent evaluation functions and policies. This is demonstrated in an implementation of the Compositional Q-Learning framework in which composite tasks consisting of several elemental tasks are decomposed using reinforcement learning. The framework is extended to the case where rewards can be received in non-terminal states of elemental tasks, and to `vector of actions' situations where the agent produces several coordinated actions in order to achieve a goal. The resulting system is employed to enable the simulated multi-linked manipulator to position its end-eector at several positions in the workspace sequentially. Finally, the benets of using prior knowledge in order to extend the capabilities of reinforcement learning agents are examined. A classier system-based Q-learning scheme is developed to enable agents to reason using condition-action rules. The utility ofthisscheme is illustrated in a blocks world planning task. The state of the environment and the set of valid actions are determined at the end of sequences of satised condition-action rules. These methods have been developed to further enable the reinforcement learning paradigm to be used for solving dicult real world problems. They perform in an on-line and incremental manner with low computational cost. Keywords: Machine Learning, Reinforcement Learning, Neural Networks, Control and Planning, Hierarchical and Modular Architectures, Robot Learning, Articial Intelligence

3 Acknowledgements I am grateful to Steve Waterhouse and Gavin Rummery for numerous discussions during the course of this research. Many people in the Speech, Vision and Robotics Group at the Department of Engineering, especially Tony Robinson, Andrew Senior, Mahesan Niranjan, Tim Jervis and Chris Dance, have contributed to my understanding of the eld. I was also fortunate to have discussions with Chris Watkins and Richard Sutton when they were in Cambridge. I would like to thank my supervisor Dr Richard Prager for keeping my work on course during the past three years. I am also grateful to the National University of Singapore for nancial support. In addition, the Department of Engineering and Jesus College have been generous in awarding conference grants. This dissertation is dedicated to my wife Jasmine, my parents and my good friend Eleana Yalouri. Declaration I declare that this dissertation is the result of my own original work. Where my research has drawn on the work of others, this is acknowledged at the appropriate points in the text. This dissertation has not been submitted in whole or part for a degree at any other institution. Chen-Khong Tham Cambridge October 1994

4 Contents 1 Introduction Agent-environment interaction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The Articial Intelligence (AI) approach : : : : : : : : : : : : : : : : : : : : : : : : The control engineering approach : : : : : : : : : : : : : : : : : : : : : : : : : : : : Learning systems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Supervised Learning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Reinforcement Learning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Objectives of this work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7 2 Reinforcement Learning Basic concepts : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Markov Decision Processes (MDP) : : : : : : : : : : : : : : : : : : : : : : : : : : : Stochastic Dynamic Programming (DP) : : : : : : : : : : : : : : : : : : : : : : : : Policy Iteration : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Value Iteration : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Solving sequential decision tasks : : : : : : : : : : : : : : : : : : : : : : : : The Temporal Dierences (TD) method : : : : : : : : : : : : : : : : : : : : : : : : The TD() algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Predicting return with TD() : : : : : : : : : : : : : : : : : : : : : : : : : : Convergence of TD algorithms : : : : : : : : : : : : : : : : : : : : : : : : : Q-Learning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The Q-learning algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : Convergence of the Q-learning algorithm : : : : : : : : : : : : : : : : : : : : Combining Q-learning and TD() : : : : : : : : : : : : : : : : : : : : : : : Learning automata : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Associative stochastic learning automata (ASLA) : : : : : : : : : : : : : : : : : : : Stochastic hill-climbing algorithms : : : : : : : : : : : : : : : : : : : : : : : Real-valued actions and multi-parameter distributions : : : : : : : : : : : : Stochastic Real Valued (SRV) units : : : : : : : : : : : : : : : : : : : : : : Actor-critic learning systems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Exploration and action selection : : : : : : : : : : : : : : : : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 20 3 Function Approximators for Reinforcement Learning The need for function approximation : : : : : : : : : : : : : : : : : : : : : : : : : : Function approximation: theoretical issues : : : : : : : : : : : : : : : : : : : : : : : Batch vs on-line learning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : A comparison : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Stochastic approximation methods : : : : : : : : : : : : : : : : : : : : : : : Adding momentum and using sub-sets of data : : : : : : : : : : : : : : : : : Momentum and TD() : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Multi-Layer Perceptron (MLP) networks : : : : : : : : : : : : : : : : : : : : : : : : Radial Basis Function (RBF) networks : : : : : : : : : : : : : : : : : : : : : : : : : Resource Allocating Networks (RAN) : : : : : : : : : : : : : : : : : : : : : Advantages of RBF networks : : : : : : : : : : : : : : : : : : : : : : : : : : Disadvantages of RBF networks : : : : : : : : : : : : : : : : : : : : : : : : : 28 i

5 CONTENTS ii 3.6 Cerebellar Model Articulation Controller (CMAC) : : : : : : : : : : : : : : : : : : Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : How acmacnetwork operates : : : : : : : : : : : : : : : : : : : : : : : : : Mappings within a CMAC network : : : : : : : : : : : : : : : : : : : : : : : Training procedure for a CMAC network : : : : : : : : : : : : : : : : : : : Storage requirements : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Hashing : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Advantages of CMAC networks : : : : : : : : : : : : : : : : : : : : : : : : : Disadvantages of CMAC networks : : : : : : : : : : : : : : : : : : : : : : : Comparison of function approximators in a supervised learning task : : : : : : : : Performance of MLP networks : : : : : : : : : : : : : : : : : : : : : : : : : Performance of GaRBF-RAN networks : : : : : : : : : : : : : : : : : : : : : Performance of CMAC networks : : : : : : : : : : : : : : : : : : : : : : : : Comparison of learnt functions : : : : : : : : : : : : : : : : : : : : : : : : : Summary and conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 36 4 Manipulator Control using Reinforcement Learning An actor-critic learning system using CMAC networks : : : : : : : : : : : : : : : : Prediction element : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Performance element : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : A Q-learning system using CMAC networks : : : : : : : : : : : : : : : : : : : : : : Robot control : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Robot Learning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Dynamical simulation of a two-linked manipulator : : : : : : : : : : : : : : : : : : Learning manipulator control and obstacle avoidance : : : : : : : : : : : : : : : : : Details of experiments : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Implementation details : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Reinforcement schedule : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Advantages of using CMAC networks : : : : : : : : : : : : : : : : : : : : : Q-learning and function approximation : : : : : : : : : : : : : : : : : : : : Implementation on real robots : : : : : : : : : : : : : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 54 5 Modular Function Approximation Hierarchical Mixtures of Experts (HME) : : : : : : : : : : : : : : : : : : : : : : : : Description of the HME architecture : : : : : : : : : : : : : : : : : : : : : : A probability model and posterior probabilities : : : : : : : : : : : : : : : : Likelihood and gradient ascent : : : : : : : : : : : : : : : : : : : : : : : : : The Expectation Maximization (EM) algorithm : : : : : : : : : : : : : : : : : : : : Basic concepts : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : ApplyingEMtotheHMEarchitecture : : : : : : : : : : : : : : : : : : : : : Incremental EM algorithms : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Optimization methods for the M phase : : : : : : : : : : : : : : : : : : : : : : : : : First order methods : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Second order methods : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : A composite linear regression problem : : : : : : : : : : : : : : : : : : : : : : : : : First order methods : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Second order methods : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 72 6 A Hierarchical CMAC Architecture Requirements in reinforcement learning : : : : : : : : : : : : : : : : : : : : : : : : HME-CMAC architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Context-dependent function approximation : : : : : : : : : : : : : : : : : : : : : : A composite non-linear regression problem : : : : : : : : : : : : : : : : : : : : : : : On-line mode using the on-line GEM algorithm : : : : : : : : : : : : : : : : Increasing the learning rate : : : : : : : : : : : : : : : : : : : : : : : : : : : On-line learning with Recursive Least Squares (RLS) : : : : : : : : : : : : : : : : : Summary and conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 80

6 CONTENTS iii 7 Hierarchical and Modular Reinforcement Learning Motivation for hierarchical and modular approaches : : : : : : : : : : : : : : : : : Extended Compositional Q-Learning (CQ-L) : : : : : : : : : : : : : : : : : : : : : Elemental and composite tasks : : : : : : : : : : : : : : : : : : : : : : : : : Extended CQ-L architecture : : : : : : : : : : : : : : : : : : : : : : : : : : Manipulator task decomposition using CQ-L : : : : : : : : : : : : : : : : : : : : : Agent-environment interaction : : : : : : : : : : : : : : : : : : : : : : : : : Tasks to be performed : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Implementing CQ-L with HME-CMAC : : : : : : : : : : : : : : : : : : : : Experiments and results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Two phase training : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : On-line GEM and three elemental tasks : : : : : : : : : : : : : : : : : : : : On-line GEM and single phase training : : : : : : : : : : : : : : : : : : : : Six Q-modules in CQ-L : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Twelve composite tasks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Noise in sensing : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : On-line GEM algorithm and reinforcement learning : : : : : : : : : : : : : TD() in HME-CMAC and CQ-L : : : : : : : : : : : : : : : : : : : : : : : Advantages of the CQ-L approach : : : : : : : : : : : : : : : : : : : : : : : Disadvantages of the CQ-L approach : : : : : : : : : : : : : : : : : : : : : : HME-CMAC for context-dependent learning : : : : : : : : : : : : : : : : : Other hierarchical and modular approaches : : : : : : : : : : : : : : : : : : : : : : The subsumption architecture : : : : : : : : : : : : : : : : : : : : : : : : : : Learning high-level skills by Q-learning : : : : : : : : : : : : : : : : : : : : Feudal reinforcement learning : : : : : : : : : : : : : : : : : : : : : : : : : : HDG Learning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Multiple-agent architectures : : : : : : : : : : : : : : : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 99 8 Incorporation of Prior Knowledge Methods to incorporate prior knowledge : : : : : : : : : : : : : : : : : : : : : : : : Feature-based state representation : : : : : : : : : : : : : : : : : : : : : : : Initialization : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Cooperating policies : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Competing policies : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Embedded knowledge : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Models of the environment : : : : : : : : : : : : : : : : : : : : : : : : : : : Classier system-based Q-learning : : : : : : : : : : : : : : : : : : : : : : : : : : : Combining AI techniques with reinforcement learning : : : : : : : : : : : : Classier systems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Modications for Q-learning : : : : : : : : : : : : : : : : : : : : : : : : : : : Blocks world planning task : : : : : : : : : : : : : : : : : : : : : : : : : : : Experiments and results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Related work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Conclusions and Future Research Contributions in this dissertation : : : : : : : : : : : : : : : : : : : : : : : : : : : : Concluding remarks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Directions for future research : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 116 A Symbols in CMAC Framework 117 B Dynamical Model of Manipulator 118 B.1 Lagrangian approach : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 118 B.2 Model of robot with two joints : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 119 B.3 Equations of motion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 120

7 CONTENTS iv C Incremental EM Algorithms 121 C.1 Standard EM iteration : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 121 C.2 Incremental version of standard EM : : : : : : : : : : : : : : : : : : : : : : : : : : 122 C.3 Incremental version with decay : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 122 C.4 On-line GEM algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 123 D Derivatives of Likelihood Terms 124 D.1 Expert networks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 124 D.2 Top level gating network : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 125 D.3 Second level gating networks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 126 D.4 Intermediate derivatives : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 126 E Q-values of Elemental and Composite Tasks 127 E.1 Preliminaries : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 127 E.2 Assumptions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 127 E.3 Elemental tasks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 128 E.4 Composite tasks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 128 E.5 Result : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 129 F Parameter Values in Experiments 130 F.1 Function approximators : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 130 F.2 Reinforcement learning for manipulator control : : : : : : : : : : : : : : : : : : : : 130 F.3 HME architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 131 F.4 HME-CMAC architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 132 F.5 CQ-L architecture for manipulator task decomposition : : : : : : : : : : : : : : : : 132 F.6 CS-QL architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 133 Bibliography 134

8 List of Figures 1.1 Interaction between an agent and the environment. : : : : : : : : : : : : : : : : : : Learning systems as a bridge between AI and control engineering approaches. : : : Interaction between a reinforcement learning agent and the environment. : : : : : Context-dependent reinforcement learning. : : : : : : : : : : : : : : : : : : : : : : : Interaction between a Q-learning system and the environment. : : : : : : : : : : : An associative stochastic learning automata (ASLA) unit. : : : : : : : : : : : : : : Interaction between an actor-critic learning system and the environment. : : : : : Amulti-layer perceptron (MLP) network. : : : : : : : : : : : : : : : : : : : : : : : A Gaussian radial basis function, implemented as a Resource Allocating Network (GaRBF-RAN). : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Example of CMAC usagewithatwo-dimensional input space. : : : : : : : : : : : : Non-linear and linear mappings within a CMAC. : : : : : : : : : : : : : : : : : : : Root Mean Square Error (RMSE) of MLP networks on the test set. : : : : : : : : Root Mean Square Error (RMSE) of GaRBF-RAN networks on the test set. : : : : Root Mean Square Error (RMSE) of CMAC networks on the test set with and without added noise. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Mesh-plots of functions learnt by dierent function approximators. : : : : : : : : : Robot manipulator with obstacles in the workspace and system block diagram showing the interaction between learning agent andenvironment. : : : : : : : : : : : : : Learning curves for dierent reinforcement learning algorithms in the manipulator control task when torque commands were generated. : : : : : : : : : : : : : : : : : Learning curves for dierent reinforcement learning algorithms in the manipulator control task when position change commands were generated. : : : : : : : : : : : : Trajectories followed by the manipulator from dierent start positions to the destination. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Mesh-plots showing the evaluation function and real-valued policy learnt. : : : : : Hierarchical Mixtures of Experts architecture. : : : : : : : : : : : : : : : : : : : : : Relationship between gradient ascent, GEM and EM algorithms. : : : : : : : : : : The on-line GEM algorithm. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Training (x) and test (+) data for the composite linear regression problem. : : : : Learning curves for rst order batch algorithms with the HME architecture. : : : : Learning curves for the on-line GEM algorithm with the HME architecture. : : : : Root Mean Squared Error (RMSE) when maxm is determined according to the stage of training and posterior probabilities. : : : : : : : : : : : : : : : : : : : : : : : : : Learning curves for second order algorithms with the HME architecture. : : : : : : Context-dependent learning using the HME-CMAC architecture. : : : : : : : : : : Learning curves for the on-line GEM algorithm with the HME-CMAC architecture Outputs of the gating network in the HME-CMAC architecture. : : : : : : : : : : Output of expert networks in the HME-CMAC architecture. : : : : : : : : : : : : : Multiple M steps vs one M step and higher learning rates with the HME-CMAC architecture. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The CQ-L architecture for an agent with three Q-modules and two actuators. : : : 84 v

9 LIST OF FIGURES vi 7.2 Robot manipulator with obstacles in the workspace and three destinations. : : : : Interaction between an agent with the CQ-L architecture and the environment. : : Learning curves for two phase training with the CQ-L architecture. : : : : : : : : : Variation of gating module outputs for two phase training. : : : : : : : : : : : : : : Variation of average number of steps per trial for two phase training. : : : : : : : : Mesh-plots showing the variation of Q-values over the range of manipulator movement for the three Q-modules. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Trajectories followed by the manipulator for the elemental and composite tasks. : : Learning curves for three elemental tasks with dierent numbers of M steps. : : : : Variation of gating module outputs for three elemental tasks with dierent numbers of M steps. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Learning curves for single phase training with dierent numbers of M steps. : : : : Variation of gating module outputs for single phase training with dierent numbers of M steps. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The eect of noise on task decomposition. : : : : : : : : : : : : : : : : : : : : : : : The eect of using momentum in the CQ-L architecture. : : : : : : : : : : : : : : : Several ways to incorporate prior knowledge. : : : : : : : : : : : : : : : : : : : : : Classier system-based Q-learning (CS-QL) architecture. : : : : : : : : : : : : : : : An agent-environmentinteraction cycle under the CS-QL scheme. : : : : : : : : : : Condition-action rule ring sequences and Q-classiers. : : : : : : : : : : : : : : : Example of a blocks world planning task with the optimal sequence of actions from a start to a goal conguration. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The meanings of bits in the condition and action parts of classiers in the CS-QL architecture. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Precondition-add-delete lists in the CS-QL architecture. : : : : : : : : : : : : : : : The resulting Q-classiers for performing the test task. : : : : : : : : : : : : : : : : Learning curves for the blocks world planning task using the CS-QL architecture. : 111 B.1 The real and simulated multi-linked manipulator. : : : : : : : : : : : : : : : : : : : 119

10 List of Tables 3.1 Performance of MLP networks on the test set. : : : : : : : : : : : : : : : : : : : : : Performance of GaRBF-RAN networks on the test set. : : : : : : : : : : : : : : : : Performance of CMAC networks at two resolutions on the test set. : : : : : : : : : Performance of dierent reinforcement learning algorithms when torque commands were generated. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Performance of dierent reinforcement learning algorithms when position change commands were generated. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Parameters, output values, target values, internal weights and observation weights in the IRLS algorithm for expert and gating networks. : : : : : : : : : : : : : : : : Performance of rst order batch algorithms with the HME architecture. : : : : : : Performance of the on-line GEM algorithm with the HME architecture. : : : : : : Performance of the on-line GEM algorithm with the HME architecture when maxm is variable. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Performance of second order algorithms with the HME architecture. : : : : : : : : Performance of the on-line GEM algorithm with the HME-CMAC architecture. : : Performance of the on-line GEM algorithm with the HME-CMACarchitecture using one M step at dierent learning rates. : : : : : : : : : : : : : : : : : : : : : : : : : Elemental and composite tasks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Performance of the CQ-L approach under dierent training conditions. : : : : : : : Performance of the CQ-L approach for three elemental tasks with dierent numbers of M steps. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Performance of the CQ-L approach for single phase training with dierent numbers of M steps. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 93 vii

11 Chapter 1 Introduction This dissertation addresses learning methods for solving the general problem of an autonomous agent interacting with an unknown and uncertain environment 1 in order to achieve certain goals, as shown in Figure 1.1. A good example of agent-environment interaction is that of a robot operating in its workspace, but more generally, tasks such as game-playing (Tesauro 1991 Schraudolph et al. 1994), network routing, job scheduling and pattern discrimination (Narendra and Thathachar 1989) also fall within this framework. 1.1 Agent-environment interaction Task/Goal command Context/State information Environment Agent Action/ Control input Figure 1.1: Interaction between an agent and the environment. Both the environment and the agent can be characterized in many dierent ways. The state of the environment describes fully the situation of the environment atanygiven time. The environment can be deterministic, where an action at a particular state will always produce the same outcome, i.e. remaining in the same state or moving to some other xed state, or stochastic, where the same action at a particular state can lead to several possible outcomes and transitions to dierent states. The state of the environment is often dicult to determine. It may be only partially observable with some relevant aspects hidden from view. Furthermore, noise in sensors enable only an estimate of the true state to be known. The agent canuseavariety of methods to determine actions which bring about the outcome in the environment that satises the task or goal command. Dierent amounts of prior knowledge may be given to the agent to enable it to achieve this objective. The agent can be given a set of condition-action rules which specify actions to be taken in all possible situations. In this case, the behaviour of the agent is completely dened by ahuman designer. On the other hand, an agent which learns is able to improve its performance over time. Terms such asadaptive and self-organizing are also used to describe this class of agents. The agent can 1 In control engineering literature, `agent' and `environment' are referred to as controller and plant or process, respectively. 1

12 1. Introduction 2 be given examples of how it should act in several dierent situations and then be expected to generalize between them in order to determine the appropriate action in novel situations. In a more dicult case, the agent merely receives a reward signal when the goal is achieved at the end of a sequence of actions, or a penalty when an undesirable outcome occurs. These paradigms are referred to as supervised learning and reinforcement learning, respectively. The agent can construct a world model from its experiences and use it to predict the consequences of actions without actually acting in the environment. This is useful when interaction with the environment is costly. When a complete world model is available, a plan can be made according to criteria such as the shortest, quickest, or safest path to the goal. 1.2 The Articial Intelligence (AI) approach The traditional approach taken by Articial Intelligence (AI) researchers to enable an autonomous agent to interact with its environment has been to provide it with a huge amount of relevant knowledge in the form of rules or frames in a knowledge base. This knowledge permits the agent to reason about its current situation and the state of the environment, and formulate a plan for achieving its goals. An exhaustive search in the knowledge base for a solution which satises various constraints is usually involved, although ecient techniques such asbest rst and A* search can be used. Planning requires the recursive evaluation of various courses of actions, each with its own set of consequences, and is computation intensive. For example, when precondition-add-delete lists are used for planning, only actions whose preconditions are satised by the current state description are treated as candidate actions. When each of the candidate actions are executed, new conditions become true and are added to the state description, while those which are no longer true are deleted. This process is repeated until some terminating condition is reached, after which another course of action is tried. While the AI approachisuseful in environments where complex relationships exist between objects and actions, it requires considerable human design eort and almost complete knowledge of the world in which the agent operates. In addition, many AI approaches assume that the environment isdeterministic and fully predictable. Real world situations are often fraught with uncertainty and probabilistic reasoning systems such as Bayesian belief networks (Spiegelhalter et al. 1993) are required. 1.3 The control engineering approach The eld of control engineering involves very precise methods for eecting a change in the environment, typically physical systems, in order to bring about the desired outcome. There are two main classes of problems: (1) the regulation problem, where a xed operating point has to be maintained in the presence of external disturbances, and (2) the tracking problem, where a desired trajectory has to be followed. When controlling such systems, issues such as stability, fast response times and robustness in the presence of noise are of paramount importance. When a model of the process or plant is required, system identication procedures (Soderstrom and Stoica 1989) can be used. Specically, real-time recursive identication techniques (Ljung and Soderstrom 1983) enable variations in the process to be tracked by adapting the parameters of these models on-line. Closely related to these are adaptive control techniques (Astrom and Wittenmark 1989) which allow the parameters of the controller to adapt to changes in the process. While these techniques are rigorous, their application has been largely restricted to the control of physical systems in the manner described above. In regulation and tracking problems, the set point and desired trajectory, respectively, are pre-determined by a human designer. As in the case of AI techniques, considerable design eort is required in order to specify the desired behaviour for these systems. 1.4 Learning systems Learning systems are characterized by their ability to improve their performance over time. A learning system, especially one performing reinforcement learning, can be regarded as a bridge between the AI and control engineering approaches discussed above (see Figure 1.2). Fu (1970)

13 1. Introduction 3 provided a comprehensive overview of learning control systems. He described ways in which conventional control schemes can be enhanced with methods from elds such as pattern classication, reinforcement learning, Bayesian estimation, stochastic approximation and stochastic automata models. More recently, Saridis and Valavanis (1988) presented an analytical formulation for the design of `intelligent machines' which consisted of three components hierarchically ordered according to the principle of `increasing precision with decreasing intelligence'. The three components are: (1) the organizational level, performing general information processing tasks requiring a longterm memory, (2) the coordination level, dealing with specic information processing tasks with a short-term memory, and (3) the control level, which involves the execution of tasks through hardware using feedback control methods. increasing intelligence/autonomy AI Reinforcement Learning Control increasing precision Figure 1.2: Learning systems as a bridge between AI and control engineering approaches. The incorporation of the ability to learn reduces to a large extent the design eort required for realizing autonomous agents. The resulting agents are also more exible and robust as they can adapt to changing situations. The supervised and reinforcement learning paradigms will now be discussed in greater detail Supervised Learning In supervised learning, the main task facing the learner is to learn a mapping from input patterns to target output values. These target values are assumed to be supplied to the learner by a `teacher'. When an input pattern is presented to the learner, an output value is produced. The error, i.e. dierence between the target and actual output values, can be used to improve the performance of the learner. It is not sucient to merely `memorize' what the desired output values should be for a given input pattern since the data may be corrupted by noise. The learner is required to generalize from input-output pairs which have been encountered before in order to predict the output values for unseen but similar input patterns. This involves nding a model which ts the data that is optimal in some sense, e.g. least mean squared error between target outputs and actual outputs. Supervised learning techniques can be applied for the training of function approximators which are parametrized models performing the mapping from input patterns to output values. Examples of function approximators are multi-layer perceptron (MLP), radial basis function (RBF) and Cerebellar Model Articulation Controller (CMAC) networks. These networks will be described in Chapter Reinforcement Learning Reinforcement learning problems typically involve control where actions which aect the environment are generated by the learning agent. A signal in the form of reinforcement or payo, evaluates the agent's actions and is provided to the agent by the environment. This signal simply indicates whether a favourable outcome has been achieved or otherwise, and does not indicate what the correct action is or how far the current action is from the correct one. The agent's objective isto perform actions so as to maximize the cumulativepayo it receives over time from the environment. Agent-environment interaction in the case of reinforcement learning is shown in Figure 1.3. The key advantage of the reinforcement learning paradigm is that, unlike supervised learning, a `teacher' does not have to be present to provide a target output value, i.e. the `correct' action in this case, for every input pattern. A further diculty that the paradigm copes with is that the reinforcement signal cannot be used directly to derive an error signal which can be used for improving the agent's performance. Hence, learning usually involves performing actions in a trialand-error manner, correlating outcomes with actions, and increasing the probability of performing actions which bring about favourable outcomes.

14 1. Introduction 4 disturbances Environment Context/State information payoff/ reinforcement actions Agent Figure 1.3: Interaction between a reinforcement learning agent and the environment. Most reinforcement learning procedures require the learning of: (1) an evaluation function which predicts the expected sum of payo, and (2) a policy which species the action to be taken in each state. These quantities are commonly stored in look-up tables. 1.5 Objectives of this work In this section, the objectives of the work described in this dissertation are presented together with an overview of the contents of each chapter. Reinforcement learning for autonomous agents The reinforcement learning paradigm provides a method for realizing autonomous systems which can learn to perform tasks with minimal human supervision and design eort in a wide range of environments. The main concepts in reinforcement learning will be reviewed in a comprehensive survey of the eld in Chapter 2. This work deals with techniques for scaling up reinforcement learning to handle real-world problems with large state and action spaces. Ideally, these techniques should work in an on-line 2 manner so that the agent can improve its performance as it interacts continuously with the environment. In order to be suitable for implementation on truly autonomous systems, these techniques must also perform well without requiring enormous amounts of computation and storage. In addition, this dissertation addresses two important ways to extend the capabilities of reinforcement learning agents: (1) hierarchical and modular learning, and (2) incorporation of prior knowledge. Barto (1993) lists several open areas of research in reinforcement learning: 1. using compact representations of evaluation functions, i.e. not look-up tables 2. dealing with incomplete state information and non-markovian situations 3. performing exploration eectively 4. incorporating prior knowledge 5. using modular and hierarchical architectures 6. integration with other problem solving and planning methods This dissertation is focussed towards points 1, 4, 5 and 6. Scaling up with function approximation Until recently, reinforcement learning has only been applied to small problems with several hundred states and a few discrete actions in each state. To achieve the objective of scaling up to problems with large state and action spaces, the evaluation function and policy of a reinforcement learning agent can be stored in function approximators instead of look-up tables. Supervised learning then becomes a sub-problem of reinforcement learning. Among others, Tesauro (1991), Lin (1993b), Tham and Prager (1994), and Rummery and Niranjan (1994) have shown that the combination 2 In this dissertation, the term on-line learning refers to learning which takes place on the basis that training data is observed only once by the agent (Sutton and Whitehead 1993), as in the reinforcement learning problems considered in Chapters 4 and 7. The parameters in the function approximator are updated after each observation. In the supervised learning problems considered in Chapters 3, 5 and 6, the same data in the training set is seen in each epoch. Thus, the term `on-line mode' in these chapters refers to the application of an on-line learning method to an o-line supervised learning task.

15 1. Introduction 5 of reinforcement learning and function approximation can be successfully applied for solving large problems. However, a drawback of function approximators is that they usually require a long training process involving repeated passes through training data before the input-output mapping is learnt accurately. This may limit the usefulness of function approximators for on-line reinforcement learning. For example, Lin (1993b) described an `experience replay' algorithm in which experiences during a trial involving agent-environment interaction were recorded so that they can be replayed to train several MLP networks. In Chapter 3, several function approximators commonly used in reinforcement learning applications will be described. The performance of these function approximators in terms of on-line learning speed, accuracy, computational cost and storage requirements are compared. The Cerebellar Model Articulation Controller (CMAC) (Albus 1975) network emerged as a non-linear function approximator well-suited for reinforcement learning and shall be used extensively in later parts of this dissertation. Manipulator control using reinforcement learning Using a fast function approximator which can perform incremental learning should enable reinforcement learning to be used in an on-line manner to solve problems with large state and action spaces. Learning systems which employ dierent reinforcement learning algorithms integrated with CMAC networks are developed in Chapter 4. These systems are then tested on a multi-linked manipulator control and obstacle avoidance task, which have approximately 600,000 distinguishable states and either real-valued actions or 11 discrete actions. The performance of these learning systems in terms of the quality of solutions, amount of training required, computational cost and storage requirements are compared. Hierarchical and modular reinforcement learning So far, only single reinforcement learning tasks which require monolithic function approximators have been considered. This was the view presented in Figure 1.3. A more useful approach is to have task-dependent or context-dependent reinforcement learning according to the scheme shown in Figure 1.4. Task command Context information Contextdependent switch action Skill 1 Skill 2... Skill n Detailed state information Detailed state information Detailed state information Figure 1.4: Context-dependent reinforcement learning. This can be viewed as a hierarchical and modular approach to reinforcement learning. The most important benets from using a hierarchical and modular approach are 1. transfer of learning from basic or elemental skills in order to solve more complex tasks, e.g. composite tasks which involve several elemental skills executed sequentially, and 2. reduction in the temporal and spatial resolution at the higher levels of the hierarchy, leading to a smaller search space and faster re-planning when the goal changes.

16 1. Introduction 6 There are many schemes for performing hierarchical and modular reinforcement learning. In this dissertation, I shall focus on the Compositional Q-Learning (CQ-L) framework proposed by Singh (1992b) which requires hierarchical and modular function approximation. Hierarchical and modular function approximation The Hierarchical Mixtures of Experts (HME) architecture (Jordan and Jacobs 1993) is modular approach to supervised learning. It consists of gating networks which mix the outputs from expert networks in order to produce the nal output value. Essentially, itisadivide-and-conquer approach to supervised learning where dierent regions in input space are allocated to dierent expert networks. These expert networks can model the data in sub-regions better than a single monolithic network assigned to the entire input space. Fast batch and on-line learning algorithms derived from the Expectation-Maximization algorithm (EM) and second order methods were proposed for the case where the gating and expert networks contain linear approximators. However, these algorithms are computationally expensive when the number of parameters in the networks is large. In Chapter 5, a new on-line Generalized EM (GEM) algorithm is formulated which gives the benets of faster learning provided by the EM algorithm, with signicantly lower computational and storage costs than the algorithms mentioned above. The performance of these algorithms are compared in a composite linear regression task, according to criteria similar to those used when comparing function approximators above. Hierarchical CMAC architecture By incorporating CMAC networks into the HME architecture, non-linear function approximation tasks with large state spaces can be solved with a one level HME, compared to the case where several levels are required when linear approximators are used in expert networks. Since the output of a CMAC network is linear in its parameters, the fast batch and on-line learning algorithms proposed for the HME architecture can be used. In particular, the new on-line GEM algorithm will also bring about faster learning and savings in computational and storage costs as in the case of the HME architecture with linear approximators considered above. The hierarchical CMACarchitecture will be described in Chapter 6 together with an illustration of its usefulness in a composite non-linear regression problem. This problem can be viewed as a context-dependent function approximation problem. Extending Compositional Q-Learning We return to the Compositional Q-Learning (CQ-L) framework mentioned during the discussion on hierarchical and modular reinforcement learning above. The CQ-L framework was designed to facilitate transfer of learning from elemental skills to composite skills. In Chapter 7, two extensions to this framework are proposed to enhance its usefulness for solving composite reinforcement learning tasks. The hierarchical CMAC architecture, incorporating the on-line GEM algorithm, is then used to implement the extended CQ-L framework. The resulting learning system is the main contribution of this dissertation. In order to evaluate its eectiveness in solving composite reinforcement learning tasks with large state and action spaces, the manipulator obstacle avoidance and control problem considered above is re-visited. The agentisnow required to learn howtosolve up to fteen dierent tasks, up to twelve of which are composite tasks. Incorporation of prior knowledge Most approaches to reinforcement learning are tabula rasa, i.e. the agent starts o with small random values of the parameters in its evaluation function and policy. However, prior knowledge is often available and can be used to reduce the training time needed before the agent becomes competent. Instead of relying on a random walk, exploration strategies can be specied. Certain actions which are known to be damaging in particular situations can also be removed from the set of candidate actions. This involves run-time determination of the set of legal actions. Dierent ways of incorporating prior knowledge are reviewed in Chapter 8. In particular, the use of condition-action rules to perform reasoning is considered. A classier system (Holland 1986)

17 1. Introduction 7 based reinforcement learning system is developed and its usefulness is demonstrated in a blocks world planning task. 1.6 Summary In this chapter, the issue of agent-environment interaction was discussed. The AI, control engineering and learning approaches for the control of autonomous agents were compared, with the conclusion that permitting agents to learn reduces human design eort while producing more exible and robust agents. An introduction to the supervised and reinforcement learning paradigms was given, followed by a detailed account of the objectives of this work and an overview of this dissertation.

18 Chapter 2 Reinforcement Learning \Reinforcement learning is the learning of a mapping from situations to actions so as to maximize a scalar reward or reinforcement signal. The learner is not told which action to take, as in most forms of machine learning, but instead must discover which actions yield the highest reward by trying them. In the most interesting and challenging cases, actions may aect not only the immediate reward but also the next situation, and through that, all subsequent rewards. These two characteristics - trial-and-error search and delayed reward -are the two most distinguishing features of reinforcement learning." - R.S. Sutton, ed. Machine Learning: Special Issue on Reinforcement Learning May 1992 In this chapter, the main concepts in reinforcement learning will be reviewed. The ideas and algorithms examined here provide the foundation for the work described in subsequent parts of this dissertation. First, the reinforcement learning paradigm is described. Then, the mathematical formalism of Markovian decision processes and principles of stochastic dynamic programming are presented. Two well-known algorithms in reinforcement learning, the TD() algorithm and the Q-learning algorithm, are described, followed by a discussion on associative stochastic learning automata and actor-critic learning systems. 2.1 Basic concepts Reinforcement learning methods are characterized by a reinforcement signal that evaluates the performance of the learning agent with respect to a given set of goals. Typically, a positive reward is given for an action which brought about a desirable outcome, and a negative penalty is imposed for an action which caused an unwanted consequence. These methods dier from supervised-learning methods since the reinforcement signal does not provide the agent with the correct answer at each step instead, it only indicates how favourable the outcome of a sequence of actions is. Thus, reinforcement learning methods are able to overcome one of the main limitations of supervised learning: the requirement of a `teacher'. In addition, the reinforcement signal does not contain gradient or directional information. It does not indicate whether improvement is possible and how, i.e. by how much and in which direction, the behaviour should be changed for improvement. The agent has to infer this directional information from a collection of reinforcement signals received over time. The reinforcement signal can be immediate, evaluating the most recent action performed by the agent. In the more challenging case of delayed reinforcement, the reinforcement signal for a particular action arrives long after the action had been taken and further cycles of agent-environment interaction. The agent then has the task of relating this reinforcement signal to an action which was taken some time in the past - this is known as the temporal credit assignment problem 1. The reinforcement learning paradigm has its origins in the theory of stochastic learning automata (Narendra and Thathachar 1974) (see Section 2.6) which deals with the selection of 1 In contrast, the structural credit assignment problem deals with the apportionment of credit to the part(s) of the system responsible for a particular decision or action. 8

19 2. Reinforcement Learning 9 actions in unknown stochastic environments in order to minimize penalties received. This earlier work was extended in two ways: (1) to the associative case (Barto and Anandan 1985 Williams 1988), and (2) to the delayed reinforcement case, mentioned above. In the case of associative reinforcement learning, the agent receives context or state information from the environment. Therefore, dierent actions can be generated in dierent situations. In this dissertation, only associative reinforcement learning tasks will be considered. Initially, the agent performs exploration by trying dierent actions randomly in order to discover their utilities. As learning progresses, it encounters a conict between performing: (1) actions which enable it to learn more about the environment and potentially take better actions in the future, but which mayhave undesirable short-term consequences, and (2) actions that lead to high payo based on the knowledge it currently has. This is commonly referred to as the exploration vs exploitation trade-o. Thrun (1992) suggested several directed exploration methods to minimize the costs of learning (see Section 2.9). 2.2 Markov Decision Processes (MDP) Mathematically, a reinforcement learning agentinteracting with the environment can be considered as undergoing a Markov decision process with four essential components: 1. states x 2 S, where S is the state-space 2. actions a 2 A(x), i.e. the set of possible actions may be dierent in dierent states 3. state transition function T (x a), with state transition probabilities P xy (a) =Pr(T (x a) =y), where y is the state reached from state x when action a is taken 4. reward function R(x a) which gives a reward when action a is taken in state x. The state x contains a complete description of the condition of the system which, together with future actions, determine all aspects of the future behaviour of the system. This is the Markov property: once the state is known, there is no need to have information about the history of the system, i.e. previous states, actions and rewards, in order to make a decision about what action to take. To simplify analysis, a nite and discrete-time dynamical system is considered. This means that S is a nite set of states and A(x) is a nite set of actions. The reward function R(x a) may be stochastic, with actual rewards r coming from a probability distribution determined by x and a. It is sucient to consider the expected reward, written as (x a) =E[R(x a)] for xed x and a A policy species the action a to be performed in each statex, i.e. a = (x). A stationary policy species the same action each time a particular state is entered. On the other hand, a stochastic policy species an action chosen from a xed probability distribution over actions in A(x). In a reinforcement learning problem with delayed reinforcement, the aim of the agent is to perform actions that lead to maximum cumulative reward over time. It is not enough to simply maximize the immediate reward which it receives. Although the total or average reward, e.g. Schwartz (1993), received over time can be used as a measure of cumulative reward, it is more common to use the sum of discounted rewards, referred to as the return, which, from time t, is given by: r t + r t r t+2 + :::+ n r t+n + ::: The term r t is the reward received at time t and is the discount factor, with 0 1. If the number of time steps of operation, i.e. the horizon, is innite, the return with <1 is still a nite quantity. The discount factor adjusts the degree to which long-term consequences of actions must be accounted for. In a delayed reinforcement task,r t may depend on any ofa t, a t;1, a t;2, :::, where a t is the action taken at time t.

20 2. Reinforcement Learning Stochastic Dynamic Programming (DP) Dynamic programming (Bertsekas 1987) is a method of solving the credit assignment problem in sequential or multi-stage decision processes. Most reinforcement learning algorithms operate by approximating dynamic programming. This enables them to handle delayed reinforcement situations in stochastic environments in a computationally ecient manner. When stochastic factors are involved, the expected return, which is the expected value of the actual return, is considered. As a result of the Markov property, the expected return from state x depends only on x and the policy that will be followed. Dene random variable R(x n) tobe the immediate reward obtained after starting in state x and following policy for n steps. Thus, the expected return from state x when policy is followed is written as V (x) =E[R(x 1) + R(x 2) + :::+ n;1 R(x n)+:::] (2.1) where V (x) is the evaluation function 2 for policy. It gives an immediately accessible prediction of expected return at state x. The evaluation function can be estimated by repeatedly running the process under policy and averaging the discounted sums of rewards that follow. Equation 2.1 can also be written as V (x) =(x (x)) + X y2s P xy ((x))v (y) (2.2) If the expected reward and state transition probabilities P xy are known, i.e. a model of the underlying task is available, the evaluation function for policy can be calculated by solving a set of linear equations, one for each state. Usually, we wish to nd a policy that maximizes the evaluation function such that V (x) = max V (x) (2.3) for all possible initial states x. Such a policy is referred to as an optimal policy, denoted as, and the corresponding evaluation function V (x) is referred to as the optimal evaluation function. There may be several optimal policies, but all of them give the same unique optimal evaluation function. The Bellman Optimality Equation (Bellman 1957) characterizes the optimal value of a state x in terms of the optimal values of possible successor states y X V (x) = max f(x a)+ P xy (a)v (y)g (2.4) a2a(x) y2s where V (x) is a unique bounded solution. There are a variety ofcomputational techniques for solving Bellman's equation. Here, policy iteration and value iteration are considered Policy Iteration Consider two policies: 1 with evaluation function V 1,and 2. One way to determine whether 2 is uniformly better than 1 is to compute V 2 and compare it with V 1 over the entire state space, but this is computationally wasteful. Assume that policy 1 recommends action a and policy 2 recommends action b in state x. The expected return, starting from state x, following policy 2 for one step, i.e. taking action b, and then following policy 1 thereafter is Q 1 (x b) =(x b)+ X y2s P xy (b)v 1 (y) which is easier to compute than V 2. If Q 1 (x 2 (x)) V 1 (x) for all states x, then 2 is uniformly as good or better than 1. In general, the quantity Q (x a) is referred to as the action value of action a in state x under policy. The following algorithm will converge to the optimal policy in a nite Markov decision process (Bellman and Dreyfus 1962): 1. arbitrary initial policy 2. Repeat 2 The evaluation function is also referred to as the value function.