Reinforcement Learning and the Bayesian Control Rule

Similar documents
Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning

STA 4273H: Statistical Machine Learning

Bayesian Statistics: Indian Buffet Process

Learning Tetris Using the Noisy Cross-Entropy Method

Invited Applications Paper

Bayesian probability theory

Basics of Statistical Machine Learning

Christfried Webers. Canberra February June 2015

Least Squares Estimation

Component Ordering in Independent Component Analysis Based on Data Power

How To Solve Two Sided Bandit Problems

Reinforcement Learning

Bayesian Image Super-Resolution

Machine Learning Introduction

A Sarsa based Autonomous Stock Trading Agent

Chapter 7. BANDIT PROBLEMS.

Gaussian Processes in Machine Learning

Call Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning

NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi

Deterministic Sampling-based Switching Kalman Filtering for Vehicle Tracking

An Introduction to the Kalman Filter

Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL

A web marketing system with automatic pricing. Introduction. Abstract: Keywords:

A Learning Based Method for Super-Resolution of Low Resolution Images

Bayesian Clustering for Campaign Detection

Bayesian networks - Time-series models - Apache Spark & Scala

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

A simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Continued Fractions and the Euclidean Algorithm

Artificial Neural Network and Non-Linear Regression: A Comparative Study

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Semi-Supervised Learning in Inferring Mobile Device Locations

PLAANN as a Classification Tool for Customer Intelligence in Banking

Neuro-Dynamic Programming An Overview

MUSICAL INSTRUMENT FAMILY CLASSIFICATION

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Model for dynamic website optimization

CHAPTER 2 Estimating Probabilities

JPEG compression of monochrome 2D-barcode images using DCT coefficient distributions

Enhancing the SNR of the Fiber Optic Rotation Sensor using the LMS Algorithm

The Probit Link Function in Generalized Linear Models for Data Mining Applications

Multivariate Normal Distribution

Linear Threshold Units

Interpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma

Neural network models: Foundations and applications to an audit decision problem

Analecta Vol. 8, No. 2 ISSN

Bootstrapping Big Data

On closed-form solutions of a resource allocation problem in parallel funding of R&D projects

Performance Evaluation of Artificial Neural. Networks for Spatial Data Analysis

Reinforcement Learning

Chapter 4: Artificial Neural Networks

Jae Won idee. School of Computer Science and Engineering Sungshin Women's University Seoul, , South Korea

Analysis of Multilayer Neural Networks with Direct and Cross-Forward Connection

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Advanced Ensemble Strategies for Polynomial Models

Least-Squares Intersection of Lines

Parallel Fuzzy Cognitive Maps as a Tool for Modeling Software Development Projects

Università degli Studi di Bologna

Intelligent Agents Serving Based On The Society Information

The equivalence of logistic regression and maximum entropy models

Introduction to Logistic Regression

How To Test Granger Causality Between Time Series

Video Affective Content Recognition Based on Genetic Algorithm Combined HMM

Accurate and robust image superresolution by neural processing of local image representations

How To Use Neural Networks In Data Mining

Predict the Popularity of YouTube Videos Using Early View Data

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Computing with Finite and Infinite Networks

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

Clarify Some Issues on the Sparse Bayesian Learning for Sparse Signal Recovery

Dirichlet Processes A gentle tutorial

Aalborg Universitet. Published in: I E E E Signal Processing Letters. DOI (link to publication from Publisher): /LSP.2014.

SAS Software to Fit the Generalized Linear Model

A HYBRID GENETIC ALGORITHM FOR THE MAXIMUM LIKELIHOOD ESTIMATION OF MODELS WITH MULTIPLE EQUILIBRIA: A FIRST REPORT

Software Defect Prediction Modeling

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Regression III: Advanced Methods

Decision Theory Rational prospecting

Statistical Machine Learning from Data

Course: Model, Learning, and Inference: Lecture 5

Provisioning algorithm for minimum throughput assurance service in VPNs using nonlinear programming

Computational Intelligence Introduction

Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network

Lyapunov Stability Analysis of Energy Constraint for Intelligent Home Energy Management System

Adaptive Equalization of binary encoded signals Using LMS Algorithm

The Big Data methodology in computer vision systems

Linear Models for Classification

Robot Task-Level Programming Language and Simulation

Analysis of Bayesian Dynamic Linear Models

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

Confidence Intervals for One Standard Deviation Using Standard Deviation

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Machine Learning.

Genetic algorithms for changing environments

A Reliability Point and Kalman Filter-based Vehicle Tracking Technique

Transcription:

Reinforcement Learning and the Bayesian Control Rule Pedro A. Ortega, Daniel A. Braun, and Simon Godsill Department of Engineering, University of Cambridge Trumpington Street, Cambridge CB2 1PZ, UK {pao32,dab54,sjg}@cam.ac.uk Abstract. We present an actor-critic scheme for reinforcement learning in complex domains. The main contribution is to show that planning and I/O dynamics can be separated such that an intractable planning problem reduces to a simple multi-armed bandit problem, where each lever stands for a potentially arbitrarily complex policy. Furthermore, we use the Bayesian control rule to construct an adaptive bandit player that is universal with respect to a given class of optimal bandit players, thus indirectly constructing an adaptive agent that is universal with respect to a given class of policies. Keywords: Reinforcement learning, actor-critic, Bayesian control rule. 1 Introduction Actor-critic (AC) methods [1] are reinforcement learning (RL) algorithms [9] whose implementation can be conceptually subdivided into two modules: the actor, responsible for interacting with the environment; and the critic, responsible for evaluating the performance of the actor. In this paper we present an AC method that conceptualizes learning a complex policy as a multi-armed bandit problem [2, 6] where pulling one lever corresponds to executing one iteration of a policy. The critic, who plays the role of the multi-armed bandit player, is implemented using the recently introduced Bayesian control rule (BCR) [8, 7]. This has the advantage of bypassing the computational costs of calculating the optimal policy by translating adaptive control into a probabilistic inference problem. The actor is implemented as a pool of parameterized policies. The scheme that we put forward here significantly simplifies the design of RL agents capable of learning complex I/O dynamics. Furthermore, we argue that this scheme offers important advantages over current RL approaches as a basis for general adaptive agents in real-world applications. This research was supported by the European Commission FP7-ICT, GUIDE Gentle User Interfaces for Disabled and Elderly citizens.

θ t Critic s Environment Actor a t o t Critic Environment r t Fig. 1. The critic is an agent interacting with actor-environment system. 2 Setup The interaction between the agent and the environment proceeds in cycles t = 1,2,... where at cycle t, the agent produces an action a t A that is gathered by the environment, which in turn responds with an observation o t O and a reinforcement signal r t R that are collected by the agent. To implement the actor-critic architecture, we introduce a signal θ t Θ generated by the critic at the beginning of each cycle, i.e. immediately before a t,o t and r t are produced. 2.1 Critic The critic is modeled as a multi-armed bandit player with a possibly (uncountably) infinite number of levers to choose from. More precisely, the critic iteratively tries out levers θ 1,θ 2,θ 3,... so as to maximize the sum of the reinforcements r 1,r 2,r 3,... In this sense, the θ t and the r t are the critic s actions and observations respectively, not to be confused with the actions a t and observations o t of the actor. According to the BCR, the critic has to sample the lever θ t from the distribution [8] P(θ t ˆθ 1:t 1,r 1:t 1 ) = P(θ t φ,θ 1:t 1,r 1:t 1 )P(φ ˆθ 1:t 1,r 1:t 1 )dφ, (1) Φ where the hat -notation ˆθ 1:t 1 denotes causal intervention rather than probabilistic conditioning. The expression (1) corresponds to a weighted mixture of policies P(θ t φ,θ 1:t 1,r 1:t 1 ) parameterized by φ Φ with weights given by the posterior P(φ ˆθ 1:t 1,r 1:t 1 ). The posterior can in turn be expressed as P(φ ˆθ 1:t 1,r 1:t 1 ) = P(φ) t τ=1 P(r τ φ,θ 1:τ,r 1:τ 1 ) Φ P(φ ) t τ=1 P(r (2) τ φ,θ 1:τ,r 1:τ 1 )dφ,

where the P(r t φ,θ 1:t 1,r 1:t 1 ) are the likelihoods of the reinforcements under the hypothesis φ, and where P(φ) is the prior over Φ. Note that there are no interventions on the right hand side of this equation. Furthermore, we assume that each parameter φ Φ fully determines the likelihood function P(r t φ,θ 1:t,r 1:t 1 ) representing the probability of observing the reinforcement r t given that (an arbitrary) lever θ t was pulled. The terms θ 1:t 1,r 1:t 1 can be used to model the internal state of the bandit at cycle t. We assume that each bandit has a unique lever θ φ Θ thatmaximizes theexpected sumof rewards,andthattheoptimal strategy consists in pulling it in every time step: { 1 if θ t = θ φ, P(θ t φ,θ 1:t 1,r 1:t 1 ) = P(θ t φ) = 0 if θ t θ φ. Finally, we assume a prior P(φ) over the set of operation modes Φ. This completes the specification of the critic. We will give a concrete example in Sec. 3. 2.2 Actor TheaimoftheactoristoofferanrichpoolofI/Odynamicsparameterized by Θ that the critic can choose from. More precisely, from Fig. 1 it is seen that the actor implements the stream over the actions, i.e. P(a t θ 1:t,a 1:t 1,o 1:t 1 ) = P(a t θ t,a 1:t 1,o 1:t 1 ), where we have assumed that this distribution is independent of the previously chosen parameters θ 1:t 1. For implementation purposes it is convenient to summarize the experience a 1:t,o 1:t as a sufficient statistics s θ t+1 representing an internal state of the I/O dynamics θ Φ at time t +1. States are then updated recursively as s θ t+1 = f θ(s θ t,a t,o t ), where f θ maps the old state s θ t and the interaction (a t,o t ) into the new state s θ t+1. This scheme facilitates running the different I/O dynamics in parallel. The behavior of our proposed actor-critic scheme is described in the pseudo-code listed in Alg. 1. 3 Experimental Results We have applied the proposed scheme to a toy problem containing elements that are usually regarded as challenging in the literature: nonlinear & high-dimensional dynamics and only partially observable state. The I/O domains are A = O = [ 1,1] 10 with reinforcements in R.

Algorithm 1: Actor-Critic BCR 1 foreach φ Φ do Set P 1(φ) P(φ) 2 foreach θ Θ do Initialize states s θ 0 3 for t 1,2,3,... do 4 Sample φ t P t(φ) 5 Set θ t θ φt 6 Sample a t P(a t θ t,s θt t ) 7 Issue a t and collect o t and r t 8 foreach φ Φ do Set P t+1(φ) P t(φ)p(r t φ,θ 1:t,r 1:t 1) 9 foreach θ Θ do Set s θ t+1 f θ (s θ t,a t,o t) The environment produces observations following the equation [o t,q t ] T = f(µ [a t,o t 1,q t 1 ] T ), where a t,o t,q t [ 1,1] 10 are the 10-dimensional action, observation and internal state vectors respectively, µ is a 20 30 parameter matrix, and f( ) is a sigmoid mapping each component x into 2/(1 + e x ) 1. Rewards are issued as r t = h(θ) +ν t where h is an unknown reward mean function and ν t is Gaussian noise with variance σ 2. Analogously, the actor implements a family of 300 different policies, where each policy is of the form [a t,s t ] T = f(θ [a t 1,o t 1,s t 1 ] T ), wheres t [ 1,1] 10 istheinternal state vector andθ Θ istheparameter matrix of the policy. These 300 matrices were sampled randomly. Thecriticismodeledasabanditplayerwith Θ = 300leverstochoose from, where each lever is a parameter matrix θ Θ. We assume that pulling lever θ produces a normally distributed reward r N(φ θ,1/λ), where φ θ R is the mean specific to lever θ and where λ > 0 is a known precision term that is common to all levers andall bandits.thus,abandit is fully specified by the vector φ = [φ θ ] θ Θ of all its means. To include all possible bandits we use Φ = R Θ. For each φ Φ, the likelihood model is P(r t φ,θ 1:t,r 1:t 1 ) = P(r t φ,θ t ) = N(r t ;φ θt,1/λ), and the policy is P(θ t φ) = 1 if θ t = argmax θ {φ θ } and zero otherwise. Because the likelihood is normal we place a conjugate prior P(φ θ ) = N(φ θ ;m θ,1/p θ ) over Φ, where m θ and p θ are the mean and precision hyperparameters. This allows an easy update of the posterior after obtaining a reward[4]. To assess the performance of our algorithm, we have averaged

optimum Fig. 2. Time-averaged reward of the adaptive system versus optimum performance. a total of 100 runs with 5000 time steps. Fig. 2 shows the performance curve. It can be seen that the interaction moves from an exploratory to an exploitative phase, converging towards the optimal performance. 4 Discussion and Conclusion The main contribution of this paper is to show how to separate the planning problem from the underlying I/O dynamics into the critic and the actor respectively, reducing reducing a complex planning problem to a simple multi-armed bandit problem. The critic is a bandit player based on the Bayesian control rule. The actor is treated as a black box, possibly implementing arbitrary complex policies. There are important differences between our approach and other actorcritic methods. First, current actor-critic algorithms critically depend on the state-space view of the environment see for instance [3, 9, 5]. In our opinion, this view leads to an entanglement of planning and dynamics that renders the RL problem far more difficult than necessary. Rather, we argue that this separation allows tackling domains that are intractable otherwise. Second, current reinforcement learning algorithms rely on constructing a point-estimate of the optimal policy, which is intractable when done accurately, and very costly even when only approximated. In contrast, we use the Bayesian control rule to maintain a distribution over optimal policies that is refined on-line as more observations become available. Additional experimental work is required to investigate the scalability of our actor-critic scheme to larger and more realistic domains.

References 1. A. Barto, R. Sutton, and C. Anderson. Neuron like elements that can solve difficult learning control problems. IEEE Trans. on Systems, Man and Cybernetics, 13, 1983. 2. D. A. Berry and B. Fristedt. Bandit problems: Sequential allocation of experiments. Monographs on Statistics and Applied Probability. Chapman & Hall, London, 1985. 3. D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. 4. C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. 5. M. Ghavamzadeh and Y. Engel. Bayesian actor-critic algorithms. In Proc. of the 24th International Conference on Machine Learning, 2007. 6. J. C. Gittins. Multi-armed bandit allocation indices. Wiley-Interscience Series in Systems and Optimization. John Wiley & Sons, Ltd., Chichester, 1989. 7. P. A. Ortega and D. A. Braun. A bayesian rule for adaptive control based on causal interventions. In The third conference on artificial general intelligence, pages 121 126, Paris, 2010. Atlantis Press. 8. P. A. Ortega and D. A. Braun. A minimum relative entropy principle for learning and acting. Journal of Artificial Intelligence Research, 38:475 511, 2010. 9. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.