Lecture 5: Model-Free Control



Similar documents
Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

Using Markov Decision Processes to Solve a Portfolio Allocation Problem

Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning

Lecture 1: Introduction to Reinforcement Learning

A Sarsa based Autonomous Stock Trading Agent

Assignment #1. Example: tetris state: board configuration + shape of the falling piece ~2 200 states! Recap RL so far. Page 1

Playing Atari with Deep Reinforcement Learning

TD(0) Leads to Better Policies than Approximate Value Iteration

Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL

Reinforcement Learning: A Tutorial Survey and Recent Advances

A Multi-agent Q-learning Framework for Optimizing Stock Trading Systems

Reinforcement Learning

Reinforcement Learning

Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment

Distributed Reinforcement Learning for Network Intrusion Response

Hierarchical Reinforcement Learning in Computer Games

NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi

. (3.3) n Note that supremum (3.2) must occur at one of the observed values x i or to the left of x i.

Urban Traffic Control Based on Learning Agents

Learning Agents: Introduction

Learning Exercise Policies for American Options

Statistical Machine Learning

Supply Chain Yield Management based on Reinforcement Learning

Creating Algorithmic Traders with Hierarchical Reinforcement Learning MSc Dissertation. Thomas Elder s

Reinforcement Learning with Factored States and Actions

Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems

Reinforcement Learning: An Introduction

Nonlinear Algebraic Equations Example

Statistical Machine Learning from Data

Lecture 3: Linear methods for classification

Beyond Reward: The Problem of Knowledge and Data

Statistics 104: Section 6!

MATLAB and Big Data: Illustrative Example

ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE

Inductive QoS Packet Scheduling for Adaptive Dynamic Networks

COMMUTATIVITY DEGREES OF WREATH PRODUCTS OF FINITE ABELIAN GROUPS

A Reinforcement Learning Approach for Supply Chain Management

Temporal Difference Learning in the Tetris Game

1 Sufficient statistics

THE NUMBER OF REPRESENTATIONS OF n OF THE FORM n = x 2 2 y, x > 0, y 0

Neuro-Dynamic Programming An Overview

Computing Near Optimal Strategies for Stochastic Investment Planning Problems

Tools for the analysis and design of communication networks with Markovian dynamics

Stochastic Models for Inventory Management at Service Facilities

STAT 830 Convergence in Distribution

Model-based reinforcement learning with nearly tight exploration complexity bounds

Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method

Probability Generating Functions

Probability Theory. Florian Herzog. A random variable is neither random nor variable. Gian-Carlo Rota, M.I.T..

Chapter 2: Binomial Methods and the Black-Scholes Formula

The Engle-Granger representation theorem

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

Lecture 11. Sergei Fedotov Introduction to Financial Mathematics. Sergei Fedotov (University of Manchester) / 7

Reduced echelon form: Add the following conditions to conditions 1, 2, and 3 above:

The Heat Equation. Lectures INF2320 p. 1/88

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)

Lecture 13: Martingales

Numerical Methods for Option Pricing

Jae Won idee. School of Computer Science and Engineering Sungshin Women's University Seoul, , South Korea

Lecture 14: Section 3.3

Monte Carlo-based statistical methods (MASM11/FMS091)

The Ergodic Theorem and randomness

Feature Selection with Monte-Carlo Tree Search

1.5 / 1 -- Communication Networks II (Görg) Transforms

Lecture 8. Confidence intervals and the central limit theorem

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Nonparametric adaptive age replacement with a one-cycle criterion


Creating a NL Texas Hold em Bot

CHAPTER 2 Estimating Probabilities

Estimating the Degree of Activity of jumps in High Frequency Financial Data. joint with Yacine Aït-Sahalia

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Linear Threshold Units

Does Black-Scholes framework for Option Pricing use Constant Volatilities and Interest Rates? New Solution for a New Problem

Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012

Lecture 6: Option Pricing Using a One-step Binomial Tree. Friday, September 14, 12

Nonlinear Algebraic Equations. Lectures INF2320 p. 1/88

Transcription:

Lecture 5: Model-Free Control David Silver

Outline 1 Introduction 2 On-Policy Monte-Carlo Control 3 On-Policy Temporal-Difference Learning 4 Off-Policy Learning 5 Summary

Introduction Model-Free Reinforcement Learning Last lecture: Model-free prediction Estimate the value function of an unknown MDP This lecture: Model-free control Optimise the value function of an unknown MDP

Introduction Uses of Model-Free Control Some example problems that can be modelled as MDPs Elevator Parallel Parking Ship Steering Bioreactor Helicopter Aeroplane Logistics Robocup Soccer Quake Portfolio management Protein Folding Robot walking Game of Go For most of these problems, either: MDP model is unknown, but experience can be sampled MDP model is known, but is too big to use, except by samples Model-free control can solve these problems

Introduction On and Off-Policy Learning On-policy learning Learn on the job Learn about policy π from experience sampled from π Off-policy learning Look over someone s shoulder Learn about policy π from experience sampled from µ

On-Policy Monte-Carlo Control Generalised Policy Iteration Generalised Policy Iteration (Refresher) Policy evaluation Estimate v π e.g. Iterative policy evaluation Policy improvement Generate π π e.g. Greedy policy improvement

On-Policy Monte-Carlo Control Generalised Policy Iteration Generalised Policy Iteration With Monte-Carlo Evaluation Policy evaluation Monte-Carlo policy evaluation, V = v π? Policy improvement Greedy policy improvement?

On-Policy Monte-Carlo Control Generalised Policy Iteration Model-Free Policy Iteration Using Action-Value Function Greedy policy improvement over V (s) requires model of MDP π (s) = argmax a A R a s + P a ss V (s ) Greedy policy improvement over Q(s, a) is model-free π (s) = argmax a A Q(s, a)

On-Policy Monte-Carlo Control Generalised Policy Iteration Generalised Policy Iteration with Action-Value Function Q = q π Starting Q, π q *, π * π = greedy(q) Policy evaluation Monte-Carlo policy evaluation, Q = q π Policy improvement Greedy policy improvement?

On-Policy Monte-Carlo Control Exploration Example of Greedy Action Selection There are two doors in front of you. You open the left door and get reward 0 V (left) = 0 You open the right door and get reward +1 V (right) = +1 You open the right door and get reward +3 V (right) = +2 You open the right door and get reward +2 V (right) = +2 Are you sure you ve chosen the best door?.

On-Policy Monte-Carlo Control Exploration ɛ-greedy Exploration Simplest idea for ensuring continual exploration All m actions are tried with non-zero probability With probability 1 ɛ choose the greedy action With probability ɛ choose an action at random { ɛ/m + 1 ɛ if a = argmax Q(s, a) π(a s) = a A ɛ/m otherwise

On-Policy Monte-Carlo Control Exploration ɛ-greedy Policy Improvement Theorem For any ɛ-greedy policy π, the ɛ-greedy policy π with respect to q π is an improvement, v π (s) v π (s) q π (s, π (s)) = a A π (a s)q π (s, a) = ɛ/m a A q π (s, a) + (1 ɛ) max a A q π(s, a) ɛ/m a A q π (s, a) + (1 ɛ) a A = a A π(a s)q π (s, a) = v π (s) π(a s) ɛ/m q π (s, a) 1 ɛ Therefore from policy improvement theorem, v π (s) v π (s)

On-Policy Monte-Carlo Control Exploration Monte-Carlo Policy Iteration Q = q π Starting Q, π q *, π * π = ε-greedy(q) Policy evaluation Monte-Carlo policy evaluation, Q = q π Policy improvement ɛ-greedy policy improvement

On-Policy Monte-Carlo Control Exploration Monte-Carlo Control Q = q π Starting Q q *, π * π = ε-greedy(q) Every episode: Policy evaluation Monte-Carlo policy evaluation, Q q π Policy improvement ɛ-greedy policy improvement

On-Policy Monte-Carlo Control GLIE GLIE Definition Greedy in the Limit with Infinite Exploration (GLIE) All state-action pairs are explored infinitely many times, lim N k(s, a) = k The policy converges on a greedy policy, lim π k(a s) = 1(a = argmax Q k (s, a )) k a A For example, ɛ-greedy is GLIE if ɛ reduces to zero at ɛ k = 1 k

On-Policy Monte-Carlo Control GLIE GLIE Monte-Carlo Control Sample kth episode using π: {S 1, A 1, R 2,..., S T } π For each state S t and action A t in the episode, Theorem N(S t, A t ) N(S t, A t ) + 1 1 Q(S t, A t ) Q(S t, A t ) + N(S t, A t ) (G t Q(S t, A t )) Improve policy based on new action-value function ɛ 1/k π ɛ-greedy(q) GLIE Monte-Carlo control converges to the optimal action-value function, Q(s, a) q (s, a)

On-Policy Monte-Carlo Control Blackjack Example Back to the Blackjack Example

On-Policy Monte-Carlo Control Blackjack Example Monte-Carlo Control in Blackjack

On-Policy Temporal-Difference Learning MC vs. TD Control Temporal-difference (TD) learning has several advantages over Monte-Carlo (MC) Lower variance Online Incomplete sequences Natural idea: use TD instead of MC in our control loop Apply TD to Q(S, A) Use ɛ-greedy policy improvement Update every time-step

On-Policy Temporal-Difference Learning Sarsa(λ) Updating Action-Value Functions with Sarsa S,A R S A Q(S, A) Q(S, A) + α ( R + γq(s, A ) Q(S, A) )

On-Policy Temporal-Difference Learning Sarsa(λ) On-Policy Control With Sarsa Q = q π Starting Q q *, π * π = ε-greedy(q) Every time-step: Policy evaluation Sarsa, Q q π Policy improvement ɛ-greedy policy improvement

On-Policy Temporal-Difference Learning Sarsa(λ) Sarsa Algorithm for On-Policy Control

On-Policy Temporal-Difference Learning Sarsa(λ) Convergence of Sarsa Theorem Sarsa converges to the optimal action-value function, Q(s, a) q (s, a), under the following conditions: GLIE sequence of policies π t (a s) Robbins-Monro sequence of step-sizes α t α t = t=1 αt 2 < t=1

On-Policy Temporal-Difference Learning Sarsa(λ) Windy Gridworld Example Reward = -1 per time-step until reaching goal Undiscounted

On-Policy Temporal-Difference Learning Sarsa(λ) Sarsa on the Windy Gridworld

On-Policy Temporal-Difference Learning Sarsa(λ) n-step Sarsa Consider the following n-step returns for n = 1, 2, : n = 1 (Sarsa) q (1) t = R t+1 + γq(s t+1 ) n = 2 q (2) t = R t+1 + γr t+2 + γ 2 Q(S t+2 ).. n = (MC) q ( ) t = R t+1 + γr t+2 +... + γ T 1 R T Define the n-step Q-return q (n) t = R t+1 + γr t+2 +... + γ n 1 R t+n + γ n Q(S t+n ) n-step Sarsa updates Q(s, a) towards the n-step Q-return ( ) Q(S t, A t ) Q(S t, A t ) + α q (n) t Q(S t, A t )

On-Policy Temporal-Difference Learning Sarsa(λ) Forward View Sarsa(λ) The q λ return combines all n-step Q-returns q (n) t Using weight (1 λ)λ n 1 q λ t = (1 λ) n=1 λ n 1 q (n) t Forward-view Sarsa(λ) ( ) Q(S t, A t ) Q(S t, A t ) + α qt λ Q(S t, A t )

On-Policy Temporal-Difference Learning Sarsa(λ) Backward View Sarsa(λ) Just like TD(λ), we use eligibility traces in an online algorithm But Sarsa(λ) has one eligibility trace for each state-action pair E 0 (s, a) = 0 E t (s, a) = γλe t 1 (s, a) + 1(S t = s, A t = a) Q(s, a) is updated for every state s and action a In proportion to TD-error δ t and eligibility trace E t (s, a) δ t = R t+1 + γq(s t+1, A t+1 ) Q(S t, A t ) Q(s, a) Q(s, a) + αδ t E t (s, a)

On-Policy Temporal-Difference Learning Sarsa(λ) Sarsa(λ) Algorithm

On-Policy Temporal-Difference Learning Sarsa(λ) Sarsa(λ) Gridworld Example

Off-Policy Learning Off-Policy Learning Evaluate target policy π(a s) to compute v π (s) or q π (s, a) While following behaviour policy µ(a s) Why is this important? {S 1, A 1, R 2,..., S T } µ Learn from observing humans or other agents Re-use experience generated from old policies π 1, π 2,..., π t 1 Learn about optimal policy while following exploratory policy Learn about multiple policies while following one policy

Off-Policy Learning Importance Sampling Importance Sampling Estimate the expectation of a different distribution E X P [f (X )] = P(X )f (X ) = Q(X ) P(X ) Q(X ) f (X ) [ ] P(X ) = E X Q Q(X ) f (X )

Off-Policy Learning Importance Sampling Importance Sampling for Off-Policy Monte-Carlo Use returns generated from µ to evaluate π Weight return G t according to similarity between policies Multiply importance sampling corrections along whole episode t = π(a t S t ) π(a t+1 S t+1 ) µ(a t S t ) µ(a t+1 S t+1 )... π(a T S T ) µ(a T S T ) G t G π/µ Update value towards corrected return ( ) V (S t ) V (S t ) + α G π/µ t V (S t ) Cannot use if µ is zero when π is non-zero Importance sampling can dramatically increase variance

Off-Policy Learning Importance Sampling Importance Sampling for Off-Policy TD Use TD targets generated from µ to evaluate π Weight TD target R + γv (S ) by importance sampling Only need a single importance sampling correction V (S t ) V (S t ) + ( ) π(at S t ) α µ(a t S t ) (R t+1 + γv (S t+1 )) V (S t ) Much lower variance than Monte-Carlo importance sampling Policies only need to be similar over a single step

Off-Policy Learning Q-Learning Q-Learning We now consider off-policy learning of action-values Q(s, a) No importance sampling is required Next action is chosen using behaviour policy A t+1 µ( S t ) But we consider alternative successor action A π( S t ) And update Q(S t, A t ) towards value of alternative action Q(S t, A t ) Q(S t, A t ) + α ( R t+1 + γq(s t+1, A ) Q(S t, A t ) )

Off-Policy Learning Q-Learning Off-Policy Control with Q-Learning We now allow both behaviour and target policies to improve The target policy π is greedy w.r.t. Q(s, a) π(s t+1 ) = argmax a Q(S t+1, a ) The behaviour policy µ is e.g. ɛ-greedy w.r.t. Q(s, a) The Q-learning target then simplifies: R t+1 + γq(s t+1, A ) =R t+1 + γq(s t+1, argmax a Q(S t+1, a )) =R t+1 + max a γq(s t+1, a )

Off-Policy Learning Q-Learning Q-Learning Control Algorithm S,A R S ( ) Q(S, A) Q(S, A) + α R + γ max Q(S, a ) Q(S, A) a A Theorem Q-learning control converges to the optimal action-value function, Q(s, a) q (s, a)

Off-Policy Learning Q-Learning Q-Learning Algorithm for Off-Policy Control

Off-Policy Learning Q-Learning Q-Learning Demo Q-Learning Demo

Off-Policy Learning Q-Learning Cliff Walking Example

7! 7! Lecture 5: Model-Free Control Summary Relationship Between DP and TD Full Backup (DP) Sample Backup (TD) v (s) s a v (s Bellman Expectation 0 ) s 0 Equation for v π (s) Iterative Policy Evaluation TD Learning 7! q (s Bellman Expectation 0,a 0 ) a 0 A Equation for q π (s, a) Q-Policy Iteration Sarsa r q (s, a) s 0 7! 7! s, a S,A r R S q (s, a) s, a r s 0 q (s Bellman Optimality 0,a 0 ) a 0 Equation for q (s, a) Q-Value Iteration Q-Learning 7!

Summary Relationship Between DP and TD (2) Full Backup (DP) Iterative Policy Evaluation Sample Backup (TD) TD Learning V (s) E [R + γv (S ) s] V (S) α R + γv (S ) Q-Policy Iteration Sarsa Q(s, a) E [R + γq(s, A ) s, a] Q(S, A) α R + γq(s, A ) Q-Value Iteration [ ] Q(s, a) E R + γ max Q(S, a ) s, a a A Q-Learning Q(S, A) α R + γ max a A Q(S, a ) where x α y x x + α(y x)

Summary Questions?