Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download ""

Transcription

1 University of Cambridge Department of Engineering MODULAR ON-LINE FUNCTION APPROXIMATION FOR SCALING UP REINFORCEMENT LEARNING Chen Khong Tham Jesus College, Cambridge, England. A October 1994 This dissertation is submitted for consideration for the degree of Doctor of Philosophy at the University of Cambridge

2 Summary Reinforcement learningisapowerful learning paradigm for autonomous agents which interact with unknown environments with the objective of maximizing cumulative payo. Recent research has addressed issues concerning the scaling up of reinforcement learning methods in order to solve problems with large state spaces, composite tasks and tasks involving non-markovian situations. In this dissertation, I extend existing ways of scaling up reinforcement learning methods and propose several new approaches. An array of Cerebellar Model Articulation Controller (CMAC) networks is used as fast function approximators so that the evaluation function and policy can be learnt on-line as the agentinteracts with the environment. Learning systems which combine reinforcement learning techniques with CMAC networks are developed to solve problems with large state and action spaces. Actions can be either discrete or real-valued. The problem of generating a sequence of torque or position change commands in order to drive a simulated multi-linked manipulator towards desired arm congurations is examined. A hierarchical and modular function approximation architecture using CMAC networks is then developed, following the Hierarchical Mixtures of Experts framework. The non-linear function approximation abilityofcmacnetworks enables non-linear functions to be modelled in expert and gating networks, while permitting fast linear learning rules to be used. An on-line gradient ascent learning procedure derived from the Expectation Maximization algorithm is proposed, enabling faster learning to be achieved. The new architecture can be used to enable reinforcement learning agents to acquire contextdependent evaluation functions and policies. This is demonstrated in an implementation of the Compositional Q-Learning framework in which composite tasks consisting of several elemental tasks are decomposed using reinforcement learning. The framework is extended to the case where rewards can be received in non-terminal states of elemental tasks, and to `vector of actions' situations where the agent produces several coordinated actions in order to achieve a goal. The resulting system is employed to enable the simulated multi-linked manipulator to position its end-eector at several positions in the workspace sequentially. Finally, the benets of using prior knowledge in order to extend the capabilities of reinforcement learning agents are examined. A classier system-based Q-learning scheme is developed to enable agents to reason using condition-action rules. The utility ofthisscheme is illustrated in a blocks world planning task. The state of the environment and the set of valid actions are determined at the end of sequences of satised condition-action rules. These methods have been developed to further enable the reinforcement learning paradigm to be used for solving dicult real world problems. They perform in an on-line and incremental manner with low computational cost. Keywords: Machine Learning, Reinforcement Learning, Neural Networks, Control and Planning, Hierarchical and Modular Architectures, Robot Learning, Articial Intelligence

3 Acknowledgements I am grateful to Steve Waterhouse and Gavin Rummery for numerous discussions during the course of this research. Many people in the Speech, Vision and Robotics Group at the Department of Engineering, especially Tony Robinson, Andrew Senior, Mahesan Niranjan, Tim Jervis and Chris Dance, have contributed to my understanding of the eld. I was also fortunate to have discussions with Chris Watkins and Richard Sutton when they were in Cambridge. I would like to thank my supervisor Dr Richard Prager for keeping my work on course during the past three years. I am also grateful to the National University of Singapore for nancial support. In addition, the Department of Engineering and Jesus College have been generous in awarding conference grants. This dissertation is dedicated to my wife Jasmine, my parents and my good friend Eleana Yalouri. Declaration I declare that this dissertation is the result of my own original work. Where my research has drawn on the work of others, this is acknowledged at the appropriate points in the text. This dissertation has not been submitted in whole or part for a degree at any other institution. Chen-Khong Tham Cambridge October 1994

4 Contents 1 Introduction Agent-environment interaction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The Articial Intelligence (AI) approach : : : : : : : : : : : : : : : : : : : : : : : : The control engineering approach : : : : : : : : : : : : : : : : : : : : : : : : : : : : Learning systems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Supervised Learning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Reinforcement Learning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Objectives of this work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7 2 Reinforcement Learning Basic concepts : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Markov Decision Processes (MDP) : : : : : : : : : : : : : : : : : : : : : : : : : : : Stochastic Dynamic Programming (DP) : : : : : : : : : : : : : : : : : : : : : : : : Policy Iteration : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Value Iteration : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Solving sequential decision tasks : : : : : : : : : : : : : : : : : : : : : : : : The Temporal Dierences (TD) method : : : : : : : : : : : : : : : : : : : : : : : : The TD() algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Predicting return with TD() : : : : : : : : : : : : : : : : : : : : : : : : : : Convergence of TD algorithms : : : : : : : : : : : : : : : : : : : : : : : : : Q-Learning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The Q-learning algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : Convergence of the Q-learning algorithm : : : : : : : : : : : : : : : : : : : : Combining Q-learning and TD() : : : : : : : : : : : : : : : : : : : : : : : Learning automata : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Associative stochastic learning automata (ASLA) : : : : : : : : : : : : : : : : : : : Stochastic hill-climbing algorithms : : : : : : : : : : : : : : : : : : : : : : : Real-valued actions and multi-parameter distributions : : : : : : : : : : : : Stochastic Real Valued (SRV) units : : : : : : : : : : : : : : : : : : : : : : Actor-critic learning systems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Exploration and action selection : : : : : : : : : : : : : : : : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 20 3 Function Approximators for Reinforcement Learning The need for function approximation : : : : : : : : : : : : : : : : : : : : : : : : : : Function approximation: theoretical issues : : : : : : : : : : : : : : : : : : : : : : : Batch vs on-line learning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : A comparison : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Stochastic approximation methods : : : : : : : : : : : : : : : : : : : : : : : Adding momentum and using sub-sets of data : : : : : : : : : : : : : : : : : Momentum and TD() : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Multi-Layer Perceptron (MLP) networks : : : : : : : : : : : : : : : : : : : : : : : : Radial Basis Function (RBF) networks : : : : : : : : : : : : : : : : : : : : : : : : : Resource Allocating Networks (RAN) : : : : : : : : : : : : : : : : : : : : : Advantages of RBF networks : : : : : : : : : : : : : : : : : : : : : : : : : : Disadvantages of RBF networks : : : : : : : : : : : : : : : : : : : : : : : : : 28 i

5 CONTENTS ii 3.6 Cerebellar Model Articulation Controller (CMAC) : : : : : : : : : : : : : : : : : : Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : How acmacnetwork operates : : : : : : : : : : : : : : : : : : : : : : : : : Mappings within a CMAC network : : : : : : : : : : : : : : : : : : : : : : : Training procedure for a CMAC network : : : : : : : : : : : : : : : : : : : Storage requirements : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Hashing : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Advantages of CMAC networks : : : : : : : : : : : : : : : : : : : : : : : : : Disadvantages of CMAC networks : : : : : : : : : : : : : : : : : : : : : : : Comparison of function approximators in a supervised learning task : : : : : : : : Performance of MLP networks : : : : : : : : : : : : : : : : : : : : : : : : : Performance of GaRBF-RAN networks : : : : : : : : : : : : : : : : : : : : : Performance of CMAC networks : : : : : : : : : : : : : : : : : : : : : : : : Comparison of learnt functions : : : : : : : : : : : : : : : : : : : : : : : : : Summary and conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 36 4 Manipulator Control using Reinforcement Learning An actor-critic learning system using CMAC networks : : : : : : : : : : : : : : : : Prediction element : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Performance element : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : A Q-learning system using CMAC networks : : : : : : : : : : : : : : : : : : : : : : Robot control : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Robot Learning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Dynamical simulation of a two-linked manipulator : : : : : : : : : : : : : : : : : : Learning manipulator control and obstacle avoidance : : : : : : : : : : : : : : : : : Details of experiments : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Implementation details : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Reinforcement schedule : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Advantages of using CMAC networks : : : : : : : : : : : : : : : : : : : : : Q-learning and function approximation : : : : : : : : : : : : : : : : : : : : Implementation on real robots : : : : : : : : : : : : : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 54 5 Modular Function Approximation Hierarchical Mixtures of Experts (HME) : : : : : : : : : : : : : : : : : : : : : : : : Description of the HME architecture : : : : : : : : : : : : : : : : : : : : : : A probability model and posterior probabilities : : : : : : : : : : : : : : : : Likelihood and gradient ascent : : : : : : : : : : : : : : : : : : : : : : : : : The Expectation Maximization (EM) algorithm : : : : : : : : : : : : : : : : : : : : Basic concepts : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : ApplyingEMtotheHMEarchitecture : : : : : : : : : : : : : : : : : : : : : Incremental EM algorithms : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Optimization methods for the M phase : : : : : : : : : : : : : : : : : : : : : : : : : First order methods : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Second order methods : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : A composite linear regression problem : : : : : : : : : : : : : : : : : : : : : : : : : First order methods : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Second order methods : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 72 6 A Hierarchical CMAC Architecture Requirements in reinforcement learning : : : : : : : : : : : : : : : : : : : : : : : : HME-CMAC architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Context-dependent function approximation : : : : : : : : : : : : : : : : : : : : : : A composite non-linear regression problem : : : : : : : : : : : : : : : : : : : : : : : On-line mode using the on-line GEM algorithm : : : : : : : : : : : : : : : : Increasing the learning rate : : : : : : : : : : : : : : : : : : : : : : : : : : : On-line learning with Recursive Least Squares (RLS) : : : : : : : : : : : : : : : : : Summary and conclusion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 80

6 CONTENTS iii 7 Hierarchical and Modular Reinforcement Learning Motivation for hierarchical and modular approaches : : : : : : : : : : : : : : : : : Extended Compositional Q-Learning (CQ-L) : : : : : : : : : : : : : : : : : : : : : Elemental and composite tasks : : : : : : : : : : : : : : : : : : : : : : : : : Extended CQ-L architecture : : : : : : : : : : : : : : : : : : : : : : : : : : Manipulator task decomposition using CQ-L : : : : : : : : : : : : : : : : : : : : : Agent-environment interaction : : : : : : : : : : : : : : : : : : : : : : : : : Tasks to be performed : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Implementing CQ-L with HME-CMAC : : : : : : : : : : : : : : : : : : : : Experiments and results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Two phase training : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : On-line GEM and three elemental tasks : : : : : : : : : : : : : : : : : : : : On-line GEM and single phase training : : : : : : : : : : : : : : : : : : : : Six Q-modules in CQ-L : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Twelve composite tasks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Noise in sensing : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : On-line GEM algorithm and reinforcement learning : : : : : : : : : : : : : TD() in HME-CMAC and CQ-L : : : : : : : : : : : : : : : : : : : : : : : Advantages of the CQ-L approach : : : : : : : : : : : : : : : : : : : : : : : Disadvantages of the CQ-L approach : : : : : : : : : : : : : : : : : : : : : : HME-CMAC for context-dependent learning : : : : : : : : : : : : : : : : : Other hierarchical and modular approaches : : : : : : : : : : : : : : : : : : : : : : The subsumption architecture : : : : : : : : : : : : : : : : : : : : : : : : : : Learning high-level skills by Q-learning : : : : : : : : : : : : : : : : : : : : Feudal reinforcement learning : : : : : : : : : : : : : : : : : : : : : : : : : : HDG Learning : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Multiple-agent architectures : : : : : : : : : : : : : : : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 99 8 Incorporation of Prior Knowledge Methods to incorporate prior knowledge : : : : : : : : : : : : : : : : : : : : : : : : Feature-based state representation : : : : : : : : : : : : : : : : : : : : : : : Initialization : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Cooperating policies : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Competing policies : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Embedded knowledge : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Models of the environment : : : : : : : : : : : : : : : : : : : : : : : : : : : Classier system-based Q-learning : : : : : : : : : : : : : : : : : : : : : : : : : : : Combining AI techniques with reinforcement learning : : : : : : : : : : : : Classier systems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Modications for Q-learning : : : : : : : : : : : : : : : : : : : : : : : : : : : Blocks world planning task : : : : : : : : : : : : : : : : : : : : : : : : : : : Experiments and results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Related work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Conclusions and Future Research Contributions in this dissertation : : : : : : : : : : : : : : : : : : : : : : : : : : : : Concluding remarks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Directions for future research : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 116 A Symbols in CMAC Framework 117 B Dynamical Model of Manipulator 118 B.1 Lagrangian approach : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 118 B.2 Model of robot with two joints : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 119 B.3 Equations of motion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 120

7 CONTENTS iv C Incremental EM Algorithms 121 C.1 Standard EM iteration : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 121 C.2 Incremental version of standard EM : : : : : : : : : : : : : : : : : : : : : : : : : : 122 C.3 Incremental version with decay : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 122 C.4 On-line GEM algorithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 123 D Derivatives of Likelihood Terms 124 D.1 Expert networks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 124 D.2 Top level gating network : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 125 D.3 Second level gating networks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 126 D.4 Intermediate derivatives : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 126 E Q-values of Elemental and Composite Tasks 127 E.1 Preliminaries : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 127 E.2 Assumptions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 127 E.3 Elemental tasks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 128 E.4 Composite tasks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 128 E.5 Result : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 129 F Parameter Values in Experiments 130 F.1 Function approximators : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 130 F.2 Reinforcement learning for manipulator control : : : : : : : : : : : : : : : : : : : : 130 F.3 HME architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 131 F.4 HME-CMAC architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 132 F.5 CQ-L architecture for manipulator task decomposition : : : : : : : : : : : : : : : : 132 F.6 CS-QL architecture : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 133 Bibliography 134

8 List of Figures 1.1 Interaction between an agent and the environment. : : : : : : : : : : : : : : : : : : Learning systems as a bridge between AI and control engineering approaches. : : : Interaction between a reinforcement learning agent and the environment. : : : : : Context-dependent reinforcement learning. : : : : : : : : : : : : : : : : : : : : : : : Interaction between a Q-learning system and the environment. : : : : : : : : : : : An associative stochastic learning automata (ASLA) unit. : : : : : : : : : : : : : : Interaction between an actor-critic learning system and the environment. : : : : : Amulti-layer perceptron (MLP) network. : : : : : : : : : : : : : : : : : : : : : : : A Gaussian radial basis function, implemented as a Resource Allocating Network (GaRBF-RAN). : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Example of CMAC usagewithatwo-dimensional input space. : : : : : : : : : : : : Non-linear and linear mappings within a CMAC. : : : : : : : : : : : : : : : : : : : Root Mean Square Error (RMSE) of MLP networks on the test set. : : : : : : : : Root Mean Square Error (RMSE) of GaRBF-RAN networks on the test set. : : : : Root Mean Square Error (RMSE) of CMAC networks on the test set with and without added noise. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Mesh-plots of functions learnt by dierent function approximators. : : : : : : : : : Robot manipulator with obstacles in the workspace and system block diagram showing the interaction between learning agent andenvironment. : : : : : : : : : : : : : Learning curves for dierent reinforcement learning algorithms in the manipulator control task when torque commands were generated. : : : : : : : : : : : : : : : : : Learning curves for dierent reinforcement learning algorithms in the manipulator control task when position change commands were generated. : : : : : : : : : : : : Trajectories followed by the manipulator from dierent start positions to the destination. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Mesh-plots showing the evaluation function and real-valued policy learnt. : : : : : Hierarchical Mixtures of Experts architecture. : : : : : : : : : : : : : : : : : : : : : Relationship between gradient ascent, GEM and EM algorithms. : : : : : : : : : : The on-line GEM algorithm. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Training (x) and test (+) data for the composite linear regression problem. : : : : Learning curves for rst order batch algorithms with the HME architecture. : : : : Learning curves for the on-line GEM algorithm with the HME architecture. : : : : Root Mean Squared Error (RMSE) when maxm is determined according to the stage of training and posterior probabilities. : : : : : : : : : : : : : : : : : : : : : : : : : Learning curves for second order algorithms with the HME architecture. : : : : : : Context-dependent learning using the HME-CMAC architecture. : : : : : : : : : : Learning curves for the on-line GEM algorithm with the HME-CMAC architecture Outputs of the gating network in the HME-CMAC architecture. : : : : : : : : : : Output of expert networks in the HME-CMAC architecture. : : : : : : : : : : : : : Multiple M steps vs one M step and higher learning rates with the HME-CMAC architecture. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The CQ-L architecture for an agent with three Q-modules and two actuators. : : : 84 v

9 LIST OF FIGURES vi 7.2 Robot manipulator with obstacles in the workspace and three destinations. : : : : Interaction between an agent with the CQ-L architecture and the environment. : : Learning curves for two phase training with the CQ-L architecture. : : : : : : : : : Variation of gating module outputs for two phase training. : : : : : : : : : : : : : : Variation of average number of steps per trial for two phase training. : : : : : : : : Mesh-plots showing the variation of Q-values over the range of manipulator movement for the three Q-modules. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Trajectories followed by the manipulator for the elemental and composite tasks. : : Learning curves for three elemental tasks with dierent numbers of M steps. : : : : Variation of gating module outputs for three elemental tasks with dierent numbers of M steps. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Learning curves for single phase training with dierent numbers of M steps. : : : : Variation of gating module outputs for single phase training with dierent numbers of M steps. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The eect of noise on task decomposition. : : : : : : : : : : : : : : : : : : : : : : : The eect of using momentum in the CQ-L architecture. : : : : : : : : : : : : : : : Several ways to incorporate prior knowledge. : : : : : : : : : : : : : : : : : : : : : Classier system-based Q-learning (CS-QL) architecture. : : : : : : : : : : : : : : : An agent-environmentinteraction cycle under the CS-QL scheme. : : : : : : : : : : Condition-action rule ring sequences and Q-classiers. : : : : : : : : : : : : : : : Example of a blocks world planning task with the optimal sequence of actions from a start to a goal conguration. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The meanings of bits in the condition and action parts of classiers in the CS-QL architecture. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Precondition-add-delete lists in the CS-QL architecture. : : : : : : : : : : : : : : : The resulting Q-classiers for performing the test task. : : : : : : : : : : : : : : : : Learning curves for the blocks world planning task using the CS-QL architecture. : 111 B.1 The real and simulated multi-linked manipulator. : : : : : : : : : : : : : : : : : : : 119

10 List of Tables 3.1 Performance of MLP networks on the test set. : : : : : : : : : : : : : : : : : : : : : Performance of GaRBF-RAN networks on the test set. : : : : : : : : : : : : : : : : Performance of CMAC networks at two resolutions on the test set. : : : : : : : : : Performance of dierent reinforcement learning algorithms when torque commands were generated. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Performance of dierent reinforcement learning algorithms when position change commands were generated. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Parameters, output values, target values, internal weights and observation weights in the IRLS algorithm for expert and gating networks. : : : : : : : : : : : : : : : : Performance of rst order batch algorithms with the HME architecture. : : : : : : Performance of the on-line GEM algorithm with the HME architecture. : : : : : : Performance of the on-line GEM algorithm with the HME architecture when maxm is variable. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Performance of second order algorithms with the HME architecture. : : : : : : : : Performance of the on-line GEM algorithm with the HME-CMAC architecture. : : Performance of the on-line GEM algorithm with the HME-CMACarchitecture using one M step at dierent learning rates. : : : : : : : : : : : : : : : : : : : : : : : : : Elemental and composite tasks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Performance of the CQ-L approach under dierent training conditions. : : : : : : : Performance of the CQ-L approach for three elemental tasks with dierent numbers of M steps. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Performance of the CQ-L approach for single phase training with dierent numbers of M steps. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 93 vii

11 Chapter 1 Introduction This dissertation addresses learning methods for solving the general problem of an autonomous agent interacting with an unknown and uncertain environment 1 in order to achieve certain goals, as shown in Figure 1.1. A good example of agent-environment interaction is that of a robot operating in its workspace, but more generally, tasks such as game-playing (Tesauro 1991 Schraudolph et al. 1994), network routing, job scheduling and pattern discrimination (Narendra and Thathachar 1989) also fall within this framework. 1.1 Agent-environment interaction Task/Goal command Context/State information Environment Agent Action/ Control input Figure 1.1: Interaction between an agent and the environment. Both the environment and the agent can be characterized in many dierent ways. The state of the environment describes fully the situation of the environment atanygiven time. The environment can be deterministic, where an action at a particular state will always produce the same outcome, i.e. remaining in the same state or moving to some other xed state, or stochastic, where the same action at a particular state can lead to several possible outcomes and transitions to dierent states. The state of the environment is often dicult to determine. It may be only partially observable with some relevant aspects hidden from view. Furthermore, noise in sensors enable only an estimate of the true state to be known. The agent canuseavariety of methods to determine actions which bring about the outcome in the environment that satises the task or goal command. Dierent amounts of prior knowledge may be given to the agent to enable it to achieve this objective. The agent can be given a set of condition-action rules which specify actions to be taken in all possible situations. In this case, the behaviour of the agent is completely dened by ahuman designer. On the other hand, an agent which learns is able to improve its performance over time. Terms such asadaptive and self-organizing are also used to describe this class of agents. The agent can 1 In control engineering literature, `agent' and `environment' are referred to as controller and plant or process, respectively. 1

12 1. Introduction 2 be given examples of how it should act in several dierent situations and then be expected to generalize between them in order to determine the appropriate action in novel situations. In a more dicult case, the agent merely receives a reward signal when the goal is achieved at the end of a sequence of actions, or a penalty when an undesirable outcome occurs. These paradigms are referred to as supervised learning and reinforcement learning, respectively. The agent can construct a world model from its experiences and use it to predict the consequences of actions without actually acting in the environment. This is useful when interaction with the environment is costly. When a complete world model is available, a plan can be made according to criteria such as the shortest, quickest, or safest path to the goal. 1.2 The Articial Intelligence (AI) approach The traditional approach taken by Articial Intelligence (AI) researchers to enable an autonomous agent to interact with its environment has been to provide it with a huge amount of relevant knowledge in the form of rules or frames in a knowledge base. This knowledge permits the agent to reason about its current situation and the state of the environment, and formulate a plan for achieving its goals. An exhaustive search in the knowledge base for a solution which satises various constraints is usually involved, although ecient techniques such asbest rst and A* search can be used. Planning requires the recursive evaluation of various courses of actions, each with its own set of consequences, and is computation intensive. For example, when precondition-add-delete lists are used for planning, only actions whose preconditions are satised by the current state description are treated as candidate actions. When each of the candidate actions are executed, new conditions become true and are added to the state description, while those which are no longer true are deleted. This process is repeated until some terminating condition is reached, after which another course of action is tried. While the AI approachisuseful in environments where complex relationships exist between objects and actions, it requires considerable human design eort and almost complete knowledge of the world in which the agent operates. In addition, many AI approaches assume that the environment isdeterministic and fully predictable. Real world situations are often fraught with uncertainty and probabilistic reasoning systems such as Bayesian belief networks (Spiegelhalter et al. 1993) are required. 1.3 The control engineering approach The eld of control engineering involves very precise methods for eecting a change in the environment, typically physical systems, in order to bring about the desired outcome. There are two main classes of problems: (1) the regulation problem, where a xed operating point has to be maintained in the presence of external disturbances, and (2) the tracking problem, where a desired trajectory has to be followed. When controlling such systems, issues such as stability, fast response times and robustness in the presence of noise are of paramount importance. When a model of the process or plant is required, system identication procedures (Soderstrom and Stoica 1989) can be used. Specically, real-time recursive identication techniques (Ljung and Soderstrom 1983) enable variations in the process to be tracked by adapting the parameters of these models on-line. Closely related to these are adaptive control techniques (Astrom and Wittenmark 1989) which allow the parameters of the controller to adapt to changes in the process. While these techniques are rigorous, their application has been largely restricted to the control of physical systems in the manner described above. In regulation and tracking problems, the set point and desired trajectory, respectively, are pre-determined by a human designer. As in the case of AI techniques, considerable design eort is required in order to specify the desired behaviour for these systems. 1.4 Learning systems Learning systems are characterized by their ability to improve their performance over time. A learning system, especially one performing reinforcement learning, can be regarded as a bridge between the AI and control engineering approaches discussed above (see Figure 1.2). Fu (1970)

13 1. Introduction 3 provided a comprehensive overview of learning control systems. He described ways in which conventional control schemes can be enhanced with methods from elds such as pattern classication, reinforcement learning, Bayesian estimation, stochastic approximation and stochastic automata models. More recently, Saridis and Valavanis (1988) presented an analytical formulation for the design of `intelligent machines' which consisted of three components hierarchically ordered according to the principle of `increasing precision with decreasing intelligence'. The three components are: (1) the organizational level, performing general information processing tasks requiring a longterm memory, (2) the coordination level, dealing with specic information processing tasks with a short-term memory, and (3) the control level, which involves the execution of tasks through hardware using feedback control methods. increasing intelligence/autonomy AI Reinforcement Learning Control increasing precision Figure 1.2: Learning systems as a bridge between AI and control engineering approaches. The incorporation of the ability to learn reduces to a large extent the design eort required for realizing autonomous agents. The resulting agents are also more exible and robust as they can adapt to changing situations. The supervised and reinforcement learning paradigms will now be discussed in greater detail Supervised Learning In supervised learning, the main task facing the learner is to learn a mapping from input patterns to target output values. These target values are assumed to be supplied to the learner by a `teacher'. When an input pattern is presented to the learner, an output value is produced. The error, i.e. dierence between the target and actual output values, can be used to improve the performance of the learner. It is not sucient to merely `memorize' what the desired output values should be for a given input pattern since the data may be corrupted by noise. The learner is required to generalize from input-output pairs which have been encountered before in order to predict the output values for unseen but similar input patterns. This involves nding a model which ts the data that is optimal in some sense, e.g. least mean squared error between target outputs and actual outputs. Supervised learning techniques can be applied for the training of function approximators which are parametrized models performing the mapping from input patterns to output values. Examples of function approximators are multi-layer perceptron (MLP), radial basis function (RBF) and Cerebellar Model Articulation Controller (CMAC) networks. These networks will be described in Chapter Reinforcement Learning Reinforcement learning problems typically involve control where actions which aect the environment are generated by the learning agent. A signal in the form of reinforcement or payo, evaluates the agent's actions and is provided to the agent by the environment. This signal simply indicates whether a favourable outcome has been achieved or otherwise, and does not indicate what the correct action is or how far the current action is from the correct one. The agent's objective isto perform actions so as to maximize the cumulativepayo it receives over time from the environment. Agent-environment interaction in the case of reinforcement learning is shown in Figure 1.3. The key advantage of the reinforcement learning paradigm is that, unlike supervised learning, a `teacher' does not have to be present to provide a target output value, i.e. the `correct' action in this case, for every input pattern. A further diculty that the paradigm copes with is that the reinforcement signal cannot be used directly to derive an error signal which can be used for improving the agent's performance. Hence, learning usually involves performing actions in a trialand-error manner, correlating outcomes with actions, and increasing the probability of performing actions which bring about favourable outcomes.

14 1. Introduction 4 disturbances Environment Context/State information payoff/ reinforcement actions Agent Figure 1.3: Interaction between a reinforcement learning agent and the environment. Most reinforcement learning procedures require the learning of: (1) an evaluation function which predicts the expected sum of payo, and (2) a policy which species the action to be taken in each state. These quantities are commonly stored in look-up tables. 1.5 Objectives of this work In this section, the objectives of the work described in this dissertation are presented together with an overview of the contents of each chapter. Reinforcement learning for autonomous agents The reinforcement learning paradigm provides a method for realizing autonomous systems which can learn to perform tasks with minimal human supervision and design eort in a wide range of environments. The main concepts in reinforcement learning will be reviewed in a comprehensive survey of the eld in Chapter 2. This work deals with techniques for scaling up reinforcement learning to handle real-world problems with large state and action spaces. Ideally, these techniques should work in an on-line 2 manner so that the agent can improve its performance as it interacts continuously with the environment. In order to be suitable for implementation on truly autonomous systems, these techniques must also perform well without requiring enormous amounts of computation and storage. In addition, this dissertation addresses two important ways to extend the capabilities of reinforcement learning agents: (1) hierarchical and modular learning, and (2) incorporation of prior knowledge. Barto (1993) lists several open areas of research in reinforcement learning: 1. using compact representations of evaluation functions, i.e. not look-up tables 2. dealing with incomplete state information and non-markovian situations 3. performing exploration eectively 4. incorporating prior knowledge 5. using modular and hierarchical architectures 6. integration with other problem solving and planning methods This dissertation is focussed towards points 1, 4, 5 and 6. Scaling up with function approximation Until recently, reinforcement learning has only been applied to small problems with several hundred states and a few discrete actions in each state. To achieve the objective of scaling up to problems with large state and action spaces, the evaluation function and policy of a reinforcement learning agent can be stored in function approximators instead of look-up tables. Supervised learning then becomes a sub-problem of reinforcement learning. Among others, Tesauro (1991), Lin (1993b), Tham and Prager (1994), and Rummery and Niranjan (1994) have shown that the combination 2 In this dissertation, the term on-line learning refers to learning which takes place on the basis that training data is observed only once by the agent (Sutton and Whitehead 1993), as in the reinforcement learning problems considered in Chapters 4 and 7. The parameters in the function approximator are updated after each observation. In the supervised learning problems considered in Chapters 3, 5 and 6, the same data in the training set is seen in each epoch. Thus, the term `on-line mode' in these chapters refers to the application of an on-line learning method to an o-line supervised learning task.

15 1. Introduction 5 of reinforcement learning and function approximation can be successfully applied for solving large problems. However, a drawback of function approximators is that they usually require a long training process involving repeated passes through training data before the input-output mapping is learnt accurately. This may limit the usefulness of function approximators for on-line reinforcement learning. For example, Lin (1993b) described an `experience replay' algorithm in which experiences during a trial involving agent-environment interaction were recorded so that they can be replayed to train several MLP networks. In Chapter 3, several function approximators commonly used in reinforcement learning applications will be described. The performance of these function approximators in terms of on-line learning speed, accuracy, computational cost and storage requirements are compared. The Cerebellar Model Articulation Controller (CMAC) (Albus 1975) network emerged as a non-linear function approximator well-suited for reinforcement learning and shall be used extensively in later parts of this dissertation. Manipulator control using reinforcement learning Using a fast function approximator which can perform incremental learning should enable reinforcement learning to be used in an on-line manner to solve problems with large state and action spaces. Learning systems which employ dierent reinforcement learning algorithms integrated with CMAC networks are developed in Chapter 4. These systems are then tested on a multi-linked manipulator control and obstacle avoidance task, which have approximately 600,000 distinguishable states and either real-valued actions or 11 discrete actions. The performance of these learning systems in terms of the quality of solutions, amount of training required, computational cost and storage requirements are compared. Hierarchical and modular reinforcement learning So far, only single reinforcement learning tasks which require monolithic function approximators have been considered. This was the view presented in Figure 1.3. A more useful approach is to have task-dependent or context-dependent reinforcement learning according to the scheme shown in Figure 1.4. Task command Context information Contextdependent switch action Skill 1 Skill 2... Skill n Detailed state information Detailed state information Detailed state information Figure 1.4: Context-dependent reinforcement learning. This can be viewed as a hierarchical and modular approach to reinforcement learning. The most important benets from using a hierarchical and modular approach are 1. transfer of learning from basic or elemental skills in order to solve more complex tasks, e.g. composite tasks which involve several elemental skills executed sequentially, and 2. reduction in the temporal and spatial resolution at the higher levels of the hierarchy, leading to a smaller search space and faster re-planning when the goal changes.

16 1. Introduction 6 There are many schemes for performing hierarchical and modular reinforcement learning. In this dissertation, I shall focus on the Compositional Q-Learning (CQ-L) framework proposed by Singh (1992b) which requires hierarchical and modular function approximation. Hierarchical and modular function approximation The Hierarchical Mixtures of Experts (HME) architecture (Jordan and Jacobs 1993) is modular approach to supervised learning. It consists of gating networks which mix the outputs from expert networks in order to produce the nal output value. Essentially, itisadivide-and-conquer approach to supervised learning where dierent regions in input space are allocated to dierent expert networks. These expert networks can model the data in sub-regions better than a single monolithic network assigned to the entire input space. Fast batch and on-line learning algorithms derived from the Expectation-Maximization algorithm (EM) and second order methods were proposed for the case where the gating and expert networks contain linear approximators. However, these algorithms are computationally expensive when the number of parameters in the networks is large. In Chapter 5, a new on-line Generalized EM (GEM) algorithm is formulated which gives the benets of faster learning provided by the EM algorithm, with signicantly lower computational and storage costs than the algorithms mentioned above. The performance of these algorithms are compared in a composite linear regression task, according to criteria similar to those used when comparing function approximators above. Hierarchical CMAC architecture By incorporating CMAC networks into the HME architecture, non-linear function approximation tasks with large state spaces can be solved with a one level HME, compared to the case where several levels are required when linear approximators are used in expert networks. Since the output of a CMAC network is linear in its parameters, the fast batch and on-line learning algorithms proposed for the HME architecture can be used. In particular, the new on-line GEM algorithm will also bring about faster learning and savings in computational and storage costs as in the case of the HME architecture with linear approximators considered above. The hierarchical CMACarchitecture will be described in Chapter 6 together with an illustration of its usefulness in a composite non-linear regression problem. This problem can be viewed as a context-dependent function approximation problem. Extending Compositional Q-Learning We return to the Compositional Q-Learning (CQ-L) framework mentioned during the discussion on hierarchical and modular reinforcement learning above. The CQ-L framework was designed to facilitate transfer of learning from elemental skills to composite skills. In Chapter 7, two extensions to this framework are proposed to enhance its usefulness for solving composite reinforcement learning tasks. The hierarchical CMAC architecture, incorporating the on-line GEM algorithm, is then used to implement the extended CQ-L framework. The resulting learning system is the main contribution of this dissertation. In order to evaluate its eectiveness in solving composite reinforcement learning tasks with large state and action spaces, the manipulator obstacle avoidance and control problem considered above is re-visited. The agentisnow required to learn howtosolve up to fteen dierent tasks, up to twelve of which are composite tasks. Incorporation of prior knowledge Most approaches to reinforcement learning are tabula rasa, i.e. the agent starts o with small random values of the parameters in its evaluation function and policy. However, prior knowledge is often available and can be used to reduce the training time needed before the agent becomes competent. Instead of relying on a random walk, exploration strategies can be specied. Certain actions which are known to be damaging in particular situations can also be removed from the set of candidate actions. This involves run-time determination of the set of legal actions. Dierent ways of incorporating prior knowledge are reviewed in Chapter 8. In particular, the use of condition-action rules to perform reasoning is considered. A classier system (Holland 1986)

17 1. Introduction 7 based reinforcement learning system is developed and its usefulness is demonstrated in a blocks world planning task. 1.6 Summary In this chapter, the issue of agent-environment interaction was discussed. The AI, control engineering and learning approaches for the control of autonomous agents were compared, with the conclusion that permitting agents to learn reduces human design eort while producing more exible and robust agents. An introduction to the supervised and reinforcement learning paradigms was given, followed by a detailed account of the objectives of this work and an overview of this dissertation.

18 Chapter 2 Reinforcement Learning \Reinforcement learning is the learning of a mapping from situations to actions so as to maximize a scalar reward or reinforcement signal. The learner is not told which action to take, as in most forms of machine learning, but instead must discover which actions yield the highest reward by trying them. In the most interesting and challenging cases, actions may aect not only the immediate reward but also the next situation, and through that, all subsequent rewards. These two characteristics - trial-and-error search and delayed reward -are the two most distinguishing features of reinforcement learning." - R.S. Sutton, ed. Machine Learning: Special Issue on Reinforcement Learning May 1992 In this chapter, the main concepts in reinforcement learning will be reviewed. The ideas and algorithms examined here provide the foundation for the work described in subsequent parts of this dissertation. First, the reinforcement learning paradigm is described. Then, the mathematical formalism of Markovian decision processes and principles of stochastic dynamic programming are presented. Two well-known algorithms in reinforcement learning, the TD() algorithm and the Q-learning algorithm, are described, followed by a discussion on associative stochastic learning automata and actor-critic learning systems. 2.1 Basic concepts Reinforcement learning methods are characterized by a reinforcement signal that evaluates the performance of the learning agent with respect to a given set of goals. Typically, a positive reward is given for an action which brought about a desirable outcome, and a negative penalty is imposed for an action which caused an unwanted consequence. These methods dier from supervised-learning methods since the reinforcement signal does not provide the agent with the correct answer at each step instead, it only indicates how favourable the outcome of a sequence of actions is. Thus, reinforcement learning methods are able to overcome one of the main limitations of supervised learning: the requirement of a `teacher'. In addition, the reinforcement signal does not contain gradient or directional information. It does not indicate whether improvement is possible and how, i.e. by how much and in which direction, the behaviour should be changed for improvement. The agent has to infer this directional information from a collection of reinforcement signals received over time. The reinforcement signal can be immediate, evaluating the most recent action performed by the agent. In the more challenging case of delayed reinforcement, the reinforcement signal for a particular action arrives long after the action had been taken and further cycles of agent-environment interaction. The agent then has the task of relating this reinforcement signal to an action which was taken some time in the past - this is known as the temporal credit assignment problem 1. The reinforcement learning paradigm has its origins in the theory of stochastic learning automata (Narendra and Thathachar 1974) (see Section 2.6) which deals with the selection of 1 In contrast, the structural credit assignment problem deals with the apportionment of credit to the part(s) of the system responsible for a particular decision or action. 8

19 2. Reinforcement Learning 9 actions in unknown stochastic environments in order to minimize penalties received. This earlier work was extended in two ways: (1) to the associative case (Barto and Anandan 1985 Williams 1988), and (2) to the delayed reinforcement case, mentioned above. In the case of associative reinforcement learning, the agent receives context or state information from the environment. Therefore, dierent actions can be generated in dierent situations. In this dissertation, only associative reinforcement learning tasks will be considered. Initially, the agent performs exploration by trying dierent actions randomly in order to discover their utilities. As learning progresses, it encounters a conict between performing: (1) actions which enable it to learn more about the environment and potentially take better actions in the future, but which mayhave undesirable short-term consequences, and (2) actions that lead to high payo based on the knowledge it currently has. This is commonly referred to as the exploration vs exploitation trade-o. Thrun (1992) suggested several directed exploration methods to minimize the costs of learning (see Section 2.9). 2.2 Markov Decision Processes (MDP) Mathematically, a reinforcement learning agentinteracting with the environment can be considered as undergoing a Markov decision process with four essential components: 1. states x 2 S, where S is the state-space 2. actions a 2 A(x), i.e. the set of possible actions may be dierent in dierent states 3. state transition function T (x a), with state transition probabilities P xy (a) =Pr(T (x a) =y), where y is the state reached from state x when action a is taken 4. reward function R(x a) which gives a reward when action a is taken in state x. The state x contains a complete description of the condition of the system which, together with future actions, determine all aspects of the future behaviour of the system. This is the Markov property: once the state is known, there is no need to have information about the history of the system, i.e. previous states, actions and rewards, in order to make a decision about what action to take. To simplify analysis, a nite and discrete-time dynamical system is considered. This means that S is a nite set of states and A(x) is a nite set of actions. The reward function R(x a) may be stochastic, with actual rewards r coming from a probability distribution determined by x and a. It is sucient to consider the expected reward, written as (x a) =E[R(x a)] for xed x and a A policy species the action a to be performed in each statex, i.e. a = (x). A stationary policy species the same action each time a particular state is entered. On the other hand, a stochastic policy species an action chosen from a xed probability distribution over actions in A(x). In a reinforcement learning problem with delayed reinforcement, the aim of the agent is to perform actions that lead to maximum cumulative reward over time. It is not enough to simply maximize the immediate reward which it receives. Although the total or average reward, e.g. Schwartz (1993), received over time can be used as a measure of cumulative reward, it is more common to use the sum of discounted rewards, referred to as the return, which, from time t, is given by: r t + r t r t+2 + :::+ n r t+n + ::: The term r t is the reward received at time t and is the discount factor, with 0 1. If the number of time steps of operation, i.e. the horizon, is innite, the return with <1 is still a nite quantity. The discount factor adjusts the degree to which long-term consequences of actions must be accounted for. In a delayed reinforcement task,r t may depend on any ofa t, a t;1, a t;2, :::, where a t is the action taken at time t.

20 2. Reinforcement Learning Stochastic Dynamic Programming (DP) Dynamic programming (Bertsekas 1987) is a method of solving the credit assignment problem in sequential or multi-stage decision processes. Most reinforcement learning algorithms operate by approximating dynamic programming. This enables them to handle delayed reinforcement situations in stochastic environments in a computationally ecient manner. When stochastic factors are involved, the expected return, which is the expected value of the actual return, is considered. As a result of the Markov property, the expected return from state x depends only on x and the policy that will be followed. Dene random variable R(x n) tobe the immediate reward obtained after starting in state x and following policy for n steps. Thus, the expected return from state x when policy is followed is written as V (x) =E[R(x 1) + R(x 2) + :::+ n;1 R(x n)+:::] (2.1) where V (x) is the evaluation function 2 for policy. It gives an immediately accessible prediction of expected return at state x. The evaluation function can be estimated by repeatedly running the process under policy and averaging the discounted sums of rewards that follow. Equation 2.1 can also be written as V (x) =(x (x)) + X y2s P xy ((x))v (y) (2.2) If the expected reward and state transition probabilities P xy are known, i.e. a model of the underlying task is available, the evaluation function for policy can be calculated by solving a set of linear equations, one for each state. Usually, we wish to nd a policy that maximizes the evaluation function such that V (x) = max V (x) (2.3) for all possible initial states x. Such a policy is referred to as an optimal policy, denoted as, and the corresponding evaluation function V (x) is referred to as the optimal evaluation function. There may be several optimal policies, but all of them give the same unique optimal evaluation function. The Bellman Optimality Equation (Bellman 1957) characterizes the optimal value of a state x in terms of the optimal values of possible successor states y X V (x) = max f(x a)+ P xy (a)v (y)g (2.4) a2a(x) y2s where V (x) is a unique bounded solution. There are a variety ofcomputational techniques for solving Bellman's equation. Here, policy iteration and value iteration are considered Policy Iteration Consider two policies: 1 with evaluation function V 1,and 2. One way to determine whether 2 is uniformly better than 1 is to compute V 2 and compare it with V 1 over the entire state space, but this is computationally wasteful. Assume that policy 1 recommends action a and policy 2 recommends action b in state x. The expected return, starting from state x, following policy 2 for one step, i.e. taking action b, and then following policy 1 thereafter is Q 1 (x b) =(x b)+ X y2s P xy (b)v 1 (y) which is easier to compute than V 2. If Q 1 (x 2 (x)) V 1 (x) for all states x, then 2 is uniformly as good or better than 1. In general, the quantity Q (x a) is referred to as the action value of action a in state x under policy. The following algorithm will converge to the optimal policy in a nite Markov decision process (Bellman and Dreyfus 1962): 1. arbitrary initial policy 2. Repeat 2 The evaluation function is also referred to as the value function.

NEURAL NETWORKS A Comprehensive Foundation

NEURAL NETWORKS A Comprehensive Foundation NEURAL NETWORKS A Comprehensive Foundation Second Edition Simon Haykin McMaster University Hamilton, Ontario, Canada Prentice Hall Prentice Hall Upper Saddle River; New Jersey 07458 Preface xii Acknowledgments

More information

Introduction to Logistic Regression

Introduction to Logistic Regression OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

More information

Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning

Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning Motivation Machine Learning Can a software agent learn to play Backgammon by itself? Reinforcement Learning Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut

More information

Channel Allocation in Cellular Telephone. Systems. Lab. for Info. and Decision Sciences. Cambridge, MA 02139. bertsekas@lids.mit.edu.

Channel Allocation in Cellular Telephone. Systems. Lab. for Info. and Decision Sciences. Cambridge, MA 02139. bertsekas@lids.mit.edu. Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems Satinder Singh Department of Computer Science University of Colorado Boulder, CO 80309-0430 baveja@cs.colorado.edu Dimitri

More information

Rational Agents. E.g., vacuum-cleaner world. Rational agents. Agents. Intelligent agent-view provides framework to integrate the many subareas of AI.

Rational Agents. E.g., vacuum-cleaner world. Rational agents. Agents. Intelligent agent-view provides framework to integrate the many subareas of AI. Rational Agents Characterization of agents and environments Rationality PEAS (Performance measure, Environment, Actuators, Sensors) Definition: An agent perceives its environment via sensors and acts upon

More information

Face Locating and Tracking for Human{Computer Interaction. Carnegie Mellon University. Pittsburgh, PA 15213

Face Locating and Tracking for Human{Computer Interaction. Carnegie Mellon University. Pittsburgh, PA 15213 Face Locating and Tracking for Human{Computer Interaction Martin Hunke Alex Waibel School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Abstract Eective Human-to-Human communication

More information

A framework for parallel data mining using neural networks R. Owen Rogers rogers@qucis.queensu.ca November 1997 External Technical Report ISSN-0836-0227- 97-413 Department of Computing and Information

More information

Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998. Eligibility Traces 0 Eligibility Traces Suggested reading: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998. Eligibility Traces Eligibility Traces 1 Contents:

More information

INTRODUCTION TO NEURAL NETWORKS

INTRODUCTION TO NEURAL NETWORKS INTRODUCTION TO NEURAL NETWORKS Pictures are taken from http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html http://research.microsoft.com/~cmbishop/prml/index.htm By Nobel Khandaker Neural Networks An

More information

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski trakovski@nyus.edu.mk Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems

More information

Neuro-Dynamic Programming An Overview

Neuro-Dynamic Programming An Overview 1 Neuro-Dynamic Programming An Overview Dimitri Bertsekas Dept. of Electrical Engineering and Computer Science M.I.T. September 2006 2 BELLMAN AND THE DUAL CURSES Dynamic Programming (DP) is very broadly

More information

Tracking Algorithms. Lecture17: Stochastic Tracking. Joint Probability and Graphical Model. Probabilistic Tracking

Tracking Algorithms. Lecture17: Stochastic Tracking. Joint Probability and Graphical Model. Probabilistic Tracking Tracking Algorithms (2015S) Lecture17: Stochastic Tracking Bohyung Han CSE, POSTECH bhhan@postech.ac.kr Deterministic methods Given input video and current state, tracking result is always same. Local

More information

Online Tuning of Artificial Neural Networks for Induction Motor Control

Online Tuning of Artificial Neural Networks for Induction Motor Control Online Tuning of Artificial Neural Networks for Induction Motor Control A THESIS Submitted by RAMA KRISHNA MAYIRI (M060156EE) In partial fulfillment of the requirements for the award of the Degree of MASTER

More information

Chapter 4: Artificial Neural Networks

Chapter 4: Artificial Neural Networks Chapter 4: Artificial Neural Networks CS 536: Machine Learning Littman (Wu, TA) Administration icml-03: instructional Conference on Machine Learning http://www.cs.rutgers.edu/~mlittman/courses/ml03/icml03/

More information

Architecture bits. (Chromosome) (Evolved chromosome) Downloading. Downloading PLD. GA operation Architecture bits

Architecture bits. (Chromosome) (Evolved chromosome) Downloading. Downloading PLD. GA operation Architecture bits A Pattern Recognition System Using Evolvable Hardware Masaya Iwata 1 Isamu Kajitani 2 Hitoshi Yamada 2 Hitoshi Iba 1 Tetsuya Higuchi 1 1 1-1-4,Umezono,Tsukuba,Ibaraki,305,Japan Electrotechnical Laboratory

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

number of players in each group

number of players in each group TRAINING INTELLIGENT AGENTS USING HUMAN INTERNET DATA ELIZABETH SKLARy ALAN D. BLAIRz PABLO FUNESy JORDAN POLLACKy ydemo Lab, Dept. of Computer Science, Brandeis University, Waltham, MA 2454-911, USA E-mail:

More information

NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi

NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi NEURAL NETWORKS AND REINFORCEMENT LEARNING Abhijit Gosavi Department of Engineering Management and Systems Engineering Missouri University of Science and Technology Rolla, MO 65409 1 Outline A Quick Introduction

More information

Power Prediction Analysis using Artificial Neural Network in MS Excel

Power Prediction Analysis using Artificial Neural Network in MS Excel Power Prediction Analysis using Artificial Neural Network in MS Excel NURHASHINMAH MAHAMAD, MUHAMAD KAMAL B. MOHAMMED AMIN Electronic System Engineering Department Malaysia Japan International Institute

More information

Learning outcomes. Knowledge and understanding. Competence and skills

Learning outcomes. Knowledge and understanding. Competence and skills Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

More information

NEURAL NETWORK FUNDAMENTALS WITH GRAPHS, ALGORITHMS, AND APPLICATIONS

NEURAL NETWORK FUNDAMENTALS WITH GRAPHS, ALGORITHMS, AND APPLICATIONS NEURAL NETWORK FUNDAMENTALS WITH GRAPHS, ALGORITHMS, AND APPLICATIONS N. K. Bose HRB-Systems Professor of Electrical Engineering The Pennsylvania State University, University Park P. Liang Associate Professor

More information

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate

More information

A Memory Reduction Method in Pricing American Options Raymond H. Chan Yong Chen y K. M. Yeung z Abstract This paper concerns with the pricing of American options by simulation methods. In the traditional

More information

Lecture 1: Introduction to Neural Networks Kevin Swingler / Bruce Graham

Lecture 1: Introduction to Neural Networks Kevin Swingler / Bruce Graham Lecture 1: Introduction to Neural Networks Kevin Swingler / Bruce Graham kms@cs.stir.ac.uk 1 What are Neural Networks? Neural Networks are networks of neurons, for example, as found in real (i.e. biological)

More information

THE LOGIC OF ADAPTIVE BEHAVIOR

THE LOGIC OF ADAPTIVE BEHAVIOR THE LOGIC OF ADAPTIVE BEHAVIOR Knowledge Representation and Algorithms for Adaptive Sequential Decision Making under Uncertainty in First-Order and Relational Domains Martijn van Otterlo Department of

More information

International Journal of Software and Web Sciences (IJSWS) www.iasir.net

International Journal of Software and Web Sciences (IJSWS) www.iasir.net International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International

More information

Force/position control of a robotic system for transcranial magnetic stimulation

Force/position control of a robotic system for transcranial magnetic stimulation Force/position control of a robotic system for transcranial magnetic stimulation W.N. Wan Zakaria School of Mechanical and System Engineering Newcastle University Abstract To develop a force control scheme

More information

Mapping an Application to a Control Architecture: Specification of the Problem

Mapping an Application to a Control Architecture: Specification of the Problem Mapping an Application to a Control Architecture: Specification of the Problem Mieczyslaw M. Kokar 1, Kevin M. Passino 2, Kenneth Baclawski 1, and Jeffrey E. Smith 3 1 Northeastern University, Boston,

More information

6.2.8 Neural networks for data mining

6.2.8 Neural networks for data mining 6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural

More information

Inductive QoS Packet Scheduling for Adaptive Dynamic Networks

Inductive QoS Packet Scheduling for Adaptive Dynamic Networks Inductive QoS Packet Scheduling for Adaptive Dynamic Networks Malika BOURENANE Dept of Computer Science University of Es-Senia Algeria mb_regina@yahoo.fr Abdelhamid MELLOUK LISSI Laboratory University

More information

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I

Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Gerard Mc Nulty Systems Optimisation Ltd gmcnulty@iol.ie/0876697867 BA.,B.A.I.,C.Eng.,F.I.E.I Data is Important because it: Helps in Corporate Aims Basis of Business Decisions Engineering Decisions Energy

More information

Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

More information

The Artificial Prediction Market

The Artificial Prediction Market The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

Regression Using Support Vector Machines: Basic Foundations

Regression Using Support Vector Machines: Basic Foundations Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering

More information

Agents: Rationality (2)

Agents: Rationality (2) Agents: Intro Agent is entity that perceives and acts Perception occurs via sensors Percept is one unit of sensory input Percept sequence is complete history of agent s percepts In general, agent s actions

More information

Recurrent Neural Networks

Recurrent Neural Networks Recurrent Neural Networks Neural Computation : Lecture 12 John A. Bullinaria, 2015 1. Recurrent Neural Network Architectures 2. State Space Models and Dynamical Systems 3. Backpropagation Through Time

More information

Machine Learning. 01 - Introduction

Machine Learning. 01 - Introduction Machine Learning 01 - Introduction Machine learning course One lecture (Wednesday, 9:30, 346) and one exercise (Monday, 17:15, 203). Oral exam, 20 minutes, 5 credit points. Some basic mathematical knowledge

More information

Tracking Moving Objects In Video Sequences Yiwei Wang, Robert E. Van Dyck, and John F. Doherty Department of Electrical Engineering The Pennsylvania State University University Park, PA16802 Abstract{Object

More information

IAI : Biological Intelligence and Neural Networks

IAI : Biological Intelligence and Neural Networks IAI : Biological Intelligence and Neural Networks John A. Bullinaria, 2005 1. How do Humans do Intelligent Things? 2. What are Neural Networks? 3. What are Artificial Neural Networks used for? 4. Introduction

More information

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a

More information

Using Markov Decision Processes to Solve a Portfolio Allocation Problem

Using Markov Decision Processes to Solve a Portfolio Allocation Problem Using Markov Decision Processes to Solve a Portfolio Allocation Problem Daniel Bookstaber April 26, 2005 Contents 1 Introduction 3 2 Defining the Model 4 2.1 The Stochastic Model for a Single Asset.........................

More information

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Multilayer Percetrons

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Multilayer Percetrons PMR5406 Redes Neurais e Aula 3 Multilayer Percetrons Baseado em: Neural Networks, Simon Haykin, Prentice-Hall, 2 nd edition Slides do curso por Elena Marchiori, Vrie Unviersity Multilayer Perceptrons Architecture

More information

Next Generation Intrusion Detection: Autonomous Reinforcement Learning of Network Attacks

Next Generation Intrusion Detection: Autonomous Reinforcement Learning of Network Attacks Next Generation Intrusion Detection: Autonomous Reinforcement Learning of Network Attacks James Cannady Georgia Tech Information Security Center Georgia Institute of Technology Atlanta, GA 30332-0832 james.cannady@gtri.gatech.edu

More information

Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network

Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network Qian Wu, Yahui Wang, Long Zhang and Li Shen Abstract Building electrical system fault diagnosis is the

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning LU 2 - Markov Decision Problems and Dynamic Programming Dr. Martin Lauer AG Maschinelles Lernen und Natürlichsprachliche Systeme Albert-Ludwigs-Universität Freiburg martin.lauer@kit.edu

More information

Advanced Signal Processing and Digital Noise Reduction

Advanced Signal Processing and Digital Noise Reduction Advanced Signal Processing and Digital Noise Reduction Saeed V. Vaseghi Queen's University of Belfast UK WILEY HTEUBNER A Partnership between John Wiley & Sons and B. G. Teubner Publishers Chichester New

More information

Software Quality Factors OOA, OOD, and OOP Object-oriented techniques enhance key external and internal software quality factors, e.g., 1. External (v

Software Quality Factors OOA, OOD, and OOP Object-oriented techniques enhance key external and internal software quality factors, e.g., 1. External (v Object-Oriented Design and Programming Deja Vu? In the past: Structured = Good Overview of Object-Oriented Design Principles and Techniques Today: Object-Oriented = Good e.g., Douglas C. Schmidt www.cs.wustl.edu/schmidt/

More information

A Sarsa based Autonomous Stock Trading Agent

A Sarsa based Autonomous Stock Trading Agent A Sarsa based Autonomous Stock Trading Agent Achal Augustine The University of Texas at Austin Department of Computer Science Austin, TX 78712 USA achal@cs.utexas.edu Abstract This paper describes an autonomous

More information

Intelligent Agents Serving Based On The Society Information

Intelligent Agents Serving Based On The Society Information Intelligent Agents Serving Based On The Society Information Sanem SARIEL Istanbul Technical University, Computer Engineering Department, Istanbul, TURKEY sariel@cs.itu.edu.tr B. Tevfik AKGUN Yildiz Technical

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Include Requirement (R)

Include Requirement (R) Using Inuence Diagrams in Software Change Management Colin J. Burgess Department of Computer Science, University of Bristol, Bristol, BS8 1UB, England Ilesh Dattani, Gordon Hughes and John H.R. May Safety

More information

Reinforcement Learning of Task Plans for Real Robot Systems

Reinforcement Learning of Task Plans for Real Robot Systems Reinforcement Learning of Task Plans for Real Robot Systems Pedro Tomás Mendes Resende pedro.resende@ist.utl.pt Instituto Superior Técnico, Lisboa, Portugal October 2014 Abstract This paper is the extended

More information

Hyperspectral images retrieval with Support Vector Machines (SVM)

Hyperspectral images retrieval with Support Vector Machines (SVM) Hyperspectral images retrieval with Support Vector Machines (SVM) Miguel A. Veganzones Grupo Inteligencia Computacional Universidad del País Vasco (Grupo Inteligencia SVM-retrieval Computacional Universidad

More information

SHAPE REGISTRATION USING OPTIMIZATION FOR MOBILE ROBOT NAVIGATION. Feng Lu

SHAPE REGISTRATION USING OPTIMIZATION FOR MOBILE ROBOT NAVIGATION. Feng Lu SHAPE REGISTRATION USING OPTIMIZATION FOR MOBILE ROBOT NAVIGATION by Feng Lu A A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Robot Communication: Issues and Implementations. Holly A. Yanco. at the

Robot Communication: Issues and Implementations. Holly A. Yanco. at the Robot Communication: Issues and Implementations by Holly A. Yanco B.A. Wellesley College (1991) Submitted to the Department of Electrical Engineering and Computer Science in partial fulllment of the requirements

More information

Lecture 1: Introduction to Reinforcement Learning

Lecture 1: Introduction to Reinforcement Learning Lecture 1: Introduction to Reinforcement Learning David Silver Outline 1 Admin 2 About Reinforcement Learning 3 The Reinforcement Learning Problem 4 Inside An RL Agent 5 Problems within Reinforcement Learning

More information

Graduate Co-op Students Information Manual. Department of Computer Science. Faculty of Science. University of Regina

Graduate Co-op Students Information Manual. Department of Computer Science. Faculty of Science. University of Regina Graduate Co-op Students Information Manual Department of Computer Science Faculty of Science University of Regina 2014 1 Table of Contents 1. Department Description..3 2. Program Requirements and Procedures

More information

Figure 1: Cost and Speed of Access of different storage components. Page 30

Figure 1: Cost and Speed of Access of different storage components. Page 30 Reinforcement Learning Approach for Data Migration in Hierarchical Storage Systems T.G. Lakshmi, R.R. Sedamkar, Harshali Patil Department of Computer Engineering, Thakur College of Engineering and Technology,

More information

Clustering and scheduling maintenance tasks over time

Clustering and scheduling maintenance tasks over time Clustering and scheduling maintenance tasks over time Per Kreuger 2008-04-29 SICS Technical Report T2008:09 Abstract We report results on a maintenance scheduling problem. The problem consists of allocating

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

MULTI-SPERT PHILIPP F ARBER AND KRSTE ASANOVI C. International Computer Science Institute,

MULTI-SPERT PHILIPP F ARBER AND KRSTE ASANOVI C. International Computer Science Institute, PARALLEL NEURAL NETWORK TRAINING ON MULTI-SPERT PHILIPP F ARBER AND KRSTE ASANOVI C International Computer Science Institute, Berkeley, CA 9474 Multi-Spert is a scalable parallel system built from multiple

More information

The Lindsey-Fox Algorithm for Factoring Polynomials

The Lindsey-Fox Algorithm for Factoring Polynomials OpenStax-CNX module: m15573 1 The Lindsey-Fox Algorithm for Factoring Polynomials C. Sidney Burrus This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 2.0

More information

Programmable Logic Controllers Definition. Programmable Logic Controllers History

Programmable Logic Controllers Definition. Programmable Logic Controllers History Definition A digitally operated electronic apparatus which uses a programmable memory for the internal storage of instructions for implementing specific functions such as logic, sequencing, timing, counting,

More information

Software development process

Software development process OpenStax-CNX module: m14619 1 Software development process Trung Hung VO This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 2.0 Abstract A software development

More information

Options with exceptions

Options with exceptions Options with exceptions Munu Sairamesh and Balaraman Ravindran Indian Institute Of Technology Madras, India Abstract. An option is a policy fragment that represents a solution to a frequent subproblem

More information

Online Model Predictive Control of a Robotic System by Combining Simulation and Optimization

Online Model Predictive Control of a Robotic System by Combining Simulation and Optimization Mohammad Rokonuzzaman Pappu Online Model Predictive Control of a Robotic System by Combining Simulation and Optimization School of Electrical Engineering Department of Electrical Engineering and Automation

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Università degli Studi di Bologna

Università degli Studi di Bologna Università degli Studi di Bologna DEIS Biometric System Laboratory Incremental Learning by Message Passing in Hierarchical Temporal Memory Davide Maltoni Biometric System Laboratory DEIS - University of

More information

Deterministic Sampling-based Switching Kalman Filtering for Vehicle Tracking

Deterministic Sampling-based Switching Kalman Filtering for Vehicle Tracking Proceedings of the IEEE ITSC 2006 2006 IEEE Intelligent Transportation Systems Conference Toronto, Canada, September 17-20, 2006 WA4.1 Deterministic Sampling-based Switching Kalman Filtering for Vehicle

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

CHAPTER 5 PREDICTIVE MODELING STUDIES TO DETERMINE THE CONVEYING VELOCITY OF PARTS ON VIBRATORY FEEDER

CHAPTER 5 PREDICTIVE MODELING STUDIES TO DETERMINE THE CONVEYING VELOCITY OF PARTS ON VIBRATORY FEEDER 93 CHAPTER 5 PREDICTIVE MODELING STUDIES TO DETERMINE THE CONVEYING VELOCITY OF PARTS ON VIBRATORY FEEDER 5.1 INTRODUCTION The development of an active trap based feeder for handling brakeliners was discussed

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Accurate and robust image superresolution by neural processing of local image representations

Accurate and robust image superresolution by neural processing of local image representations Accurate and robust image superresolution by neural processing of local image representations Carlos Miravet 1,2 and Francisco B. Rodríguez 1 1 Grupo de Neurocomputación Biológica (GNB), Escuela Politécnica

More information

Optimization applications in finance, securities, banking and insurance

Optimization applications in finance, securities, banking and insurance IBM Software IBM ILOG Optimization and Analytical Decision Support Solutions White Paper Optimization applications in finance, securities, banking and insurance 2 Optimization applications in finance,

More information

Sensory-motor control scheme based on Kohonen Maps and AVITE model

Sensory-motor control scheme based on Kohonen Maps and AVITE model Sensory-motor control scheme based on Kohonen Maps and AVITE model Juan L. Pedreño-Molina, Antonio Guerrero-González, Oscar A. Florez-Giraldo, J. Molina-Vilaplana Technical University of Cartagena Department

More information

TD-Gammon, A Self-Teaching Backgammon Program, Achieves Master-Level Play

TD-Gammon, A Self-Teaching Backgammon Program, Achieves Master-Level Play TD-Gammon, A Self-Teaching Backgammon Program, Achieves Master-Level Play Gerald Tesauro IBM Thomas J. Watson Research Center P. O. Box 704 Yorktown Heights, NY 10598 (tesauro@watson.ibm.com) Abstract.

More information

Thesis work and research project

Thesis work and research project Thesis work and research project Hélia Pouyllau, INRIA of Rennes, Campus Beaulieu 35042 Rennes, helia.pouyllau@irisa.fr July 16, 2007 1 Thesis work on Distributed algorithms for endto-end QoS contract

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning LU 2 - Markov Decision Problems and Dynamic Programming Dr. Joschka Bödecker AG Maschinelles Lernen und Natürlichsprachliche Systeme Albert-Ludwigs-Universität Freiburg jboedeck@informatik.uni-freiburg.de

More information

Robot Task-Level Programming Language and Simulation

Robot Task-Level Programming Language and Simulation Robot Task-Level Programming Language and Simulation M. Samaka Abstract This paper presents the development of a software application for Off-line robot task programming and simulation. Such application

More information

Introduction to Engineering System Dynamics

Introduction to Engineering System Dynamics CHAPTER 0 Introduction to Engineering System Dynamics 0.1 INTRODUCTION The objective of an engineering analysis of a dynamic system is prediction of its behaviour or performance. Real dynamic systems are

More information

Intelligent Agents. Based on An Introduction to MultiAgent Systems and slides by Michael Wooldridge

Intelligent Agents. Based on An Introduction to MultiAgent Systems and slides by Michael Wooldridge Intelligent Agents Based on An Introduction to MultiAgent Systems and slides by Michael Wooldridge Denition of an Agent An agent is a computer system capable of autonomous action in some environment, in

More information

Analecta Vol. 8, No. 2 ISSN 2064-7964

Analecta Vol. 8, No. 2 ISSN 2064-7964 EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,

More information

Invited Applications Paper

Invited Applications Paper Invited Applications Paper - - Thore Graepel Joaquin Quiñonero Candela Thomas Borchert Ralf Herbrich Microsoft Research Ltd., 7 J J Thomson Avenue, Cambridge CB3 0FB, UK THOREG@MICROSOFT.COM JOAQUINC@MICROSOFT.COM

More information

Modular Neural Networks

Modular Neural Networks 16 Modular Neural Networks In the previous chapters we have discussed different models of neural networks linear, recurrent, supervised, unsupervised, self-organizing, etc. Each kind of network relies

More information

Learning outcomes. Knowledge and understanding. Ability and Competences. Evaluation capability and scientific approach

Learning outcomes. Knowledge and understanding. Ability and Competences. Evaluation capability and scientific approach Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

More information

D. E. Perry A. Porter? L. G. Votta M. W. Wade. Software Production Research Dept Quality Management Group

D. E. Perry A. Porter? L. G. Votta M. W. Wade. Software Production Research Dept Quality Management Group Evaluating Workow and Process Automation in Wide-Area Software Development D. E. Perry A. Porter? Software Production Research Dept Computer Science Dept Bell Laboratories University of Maryland Murray

More information

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) Machine Learning Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) What Is Machine Learning? A computer program is said to learn from experience E with respect to some class of

More information

Safe robot motion planning in dynamic, uncertain environments

Safe robot motion planning in dynamic, uncertain environments Safe robot motion planning in dynamic, uncertain environments RSS 2011 Workshop: Guaranteeing Motion Safety for Robots June 27, 2011 Noel du Toit and Joel Burdick California Institute of Technology Dynamic,

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

Neural Network Add-in

Neural Network Add-in Neural Network Add-in Version 1.5 Software User s Guide Contents Overview... 2 Getting Started... 2 Working with Datasets... 2 Open a Dataset... 3 Save a Dataset... 3 Data Pre-processing... 3 Lagging...

More information

Robotics. Chapter 25. Chapter 25 1

Robotics. Chapter 25. Chapter 25 1 Robotics Chapter 25 Chapter 25 1 Outline Robots, Effectors, and Sensors Localization and Mapping Motion Planning Motor Control Chapter 25 2 Mobile Robots Chapter 25 3 Manipulators P R R R R R Configuration

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information