Markov Decision Processes and Exact Solution Methods: Value Iteration Policy Iteration Linear Programming

Similar documents
Reinforcement Learning

Reinforcement Learning

Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning

Neuro-Dynamic Programming An Overview

Lecture 5: Model-Free Control

Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

Assignment #1. Example: tetris state: board configuration + shape of the falling piece ~2 200 states! Recap RL so far. Page 1

Lecture 1: Introduction to Reinforcement Learning

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Efficient Prediction of Value and PolicyPolygraphic Success

Optimization under uncertainty: modeling and solution methods

TD(0) Leads to Better Policies than Approximate Value Iteration

Solutions of Equations in One Variable. Fixed-Point Iteration II

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment

Using Markov Decision Processes to Solve a Portfolio Allocation Problem

1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

Linear Programming Notes V Problem Transformations

Lecture 6: Logistic Regression

Computing Near Optimal Strategies for Stochastic Investment Planning Problems

How will the programme be delivered (e.g. inter-institutional, summerschools, lectures, placement, rotations, on-line etc.):

SGN-1158 Introduction to Signal Processing Test. Solutions

Determining the Direct Mailing Frequency with Dynamic Stochastic Programming

Stochastic Models for Inventory Management at Service Facilities

Finitely Additive Dynamic Programming and Stochastic Games. Bill Sudderth University of Minnesota

Chapter 7 Nonlinear Systems

Creating a NL Texas Hold em Bot

Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL

10. Proximal point method

A survey of point-based POMDP solvers

How To Optimize A Multi-Echelon Inventory System

1 Short Introduction to Time Series

Analysis of an Artificial Hormone System (Extended abstract)

Applied Computational Economics and Finance

NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi

Skill-based Routing Policies in Call Centers

Betting Strategies, Market Selection, and the Wisdom of Crowds

Linear Threshold Units

SINGLE-STAGE MULTI-PRODUCT PRODUCTION AND INVENTORY SYSTEMS: AN ITERATIVE ALGORITHM BASED ON DYNAMIC SCHEDULING AND FIXED PITCH PRODUCTION

CHAPTER 7 APPLICATIONS TO MARKETING. Chapter 7 p. 1/54

10.2 ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS. The Jacobi Method

LECTURE 15: AMERICAN OPTIONS

6.231 Dynamic Programming Midterm, Fall Instructions

Optimal proportional reinsurance and dividend pay-out for insurance companies with switching reserves

Mathematical finance and linear programming (optimization)

9th Max-Planck Advanced Course on the Foundations of Computer Science (ADFOCS) Primal-Dual Algorithms for Online Optimization: Lecture 1

Lecture: Financing Based on Market Values II

The Analysis of Dynamical Queueing Systems (Background)

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

Lecture 3: Growth with Overlapping Generations (Acemoglu 2009, Chapter 9, adapted from Zilibotti)

Big Data - Lecture 1 Optimization reminders

Adaptive Online Gradient Descent

Linear-Quadratic Optimal Controller 10.3 Optimal Linear Control Systems

VENDOR MANAGED INVENTORY

Valuation of the Minimum Guaranteed Return Embedded in Life Insurance Products

Chapter 2: Binomial Methods and the Black-Scholes Formula

Learning Agents: Introduction

On the Efficiency of Competitive Stock Markets Where Traders Have Diverse Information

Probability and Statistics

Degeneracy in Linear Programming

6.207/14.15: Networks Lecture 15: Repeated Games and Cooperation

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

OPTIMAL DESIGN OF A MULTITIER REWARD SCHEME. Amir Gandomi *, Saeed Zolfaghari **

AI: A Modern Approach, Chpts. 3-4 Russell and Norvig

Lecture Notes 10

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA

A Single-Unit Decomposition Approach to Multi-Echelon Inventory Systems

Algorithms for Reinforcement Learning

Introduction to Algorithmic Trading Strategies Lecture 2

Fixed Point Theorems

Wald s Identity. by Jeffery Hein. Dartmouth College, Math 100

Chapter 21: The Discounted Utility Model

Markov Decision Processes for Ad Network Optimization

Universidad de Montevideo Macroeconomia II. The Ramsey-Cass-Koopmans Model

Notes from Week 1: Algorithms for sequential prediction

G(s) = Y (s)/u(s) In this representation, the output is always the Transfer function times the input. Y (s) = G(s)U(s).

Temporal Difference Learning in the Tetris Game

An Introduction to Applied Mathematics: An Iterative Process

24. The Branch and Bound Method

Mark Shifrin. YEQT 2015 Eurandom, Eindhoven 12th November 2015

Inductive QoS Packet Scheduling for Adaptive Dynamic Networks

Lecture notes: single-agent dynamics 1

Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams

Fleet Management. Warren B. Powell and Huseyin Topaloglu. June, 2002

Online Appendix to Stochastic Imitative Game Dynamics with Committed Agents

Optimal Design of Sequential Real-Time Communication Systems Aditya Mahajan, Member, IEEE, and Demosthenis Teneketzis, Fellow, IEEE

Advanced CFD Methods 1

Lattice-Based Threshold-Changeability for Standard Shamir Secret-Sharing Schemes

How to Gamble If You Must

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

THE LOGIC OF ADAPTIVE BEHAVIOR

A Branch and Bound Algorithm for Solving the Binary Bi-level Linear Programming Problem

A Network Flow Approach in Cloud Computing

Load Balancing and Switch Scheduling

Proximal mapping via network optimization

The Binomial Distribution

Nonlinear Optimization: Algorithms 3: Interior-point methods

Solving Systems of Linear Equations

Infinitely Repeated Games with Discounting Ù

3.2 Roulette and Markov Chains

Transcription:

Markov Decision Processes and Exact Solution Methods: Value Iteration Policy Iteration Linear Programming Pieter Abbeel UC Berkeley EECS

Markov Decision Process Assumption: agent gets to observe the state [Drawing from Sutton and Barto, Reinforcement Learning: An Introduction, 1998]

Markov Decision Process (S, A, T, R, H) Given S: set of states A: set of actions T: S x A x S x {0,1,,H} à [0,1], T t (s,a,s ) = P(s t+1 = s s t = s, a t =a) R: S x A x S x {0, 1,, H} à < R t (s,a,s ) = reward for (s t+1 = s, s t = s, a t =a) H: horizon over which the agent will act Goal: Find ¼ : S x {0, 1,, H} à A that maximizes expected sum of rewards, i.e.,

Examples MDP (S, A, T, R, H), goal: q q q q q q q Cleaning robot Walking robot Pole balancing Games: tetris, backgammon Server management Shortest path problems Model for animals, people

Canonical Example: Grid World The agent lives in a grid Walls block the agent s path The agent s actions do not always go as planned: 80% of the time, the action North takes the agent North (if there is no wall there) 10% of the time, North takes the agent West; 10% East If there is a wall in the direction the agent would have been taken, the agent stays put Big rewards come at the end

Solving MDPs In an MDP, we want an optimal policy π*: S x 0:H A A policy π gives an action for each state for each time t=5=h t=4 t=3 t=2 t=1 t=0 An optimal policy maximizes expected sum of rewards Contrast: In deterministic, want an optimal plan, or sequence of actions, from start to a goal

Outline Optimal Control = given an MDP (S, A, T, R,, H) find the optimal policy ¼ * Exact Methods: Value Iteration Policy Iteration Linear Programming For now: discrete state-action spaces as they are simpler to get the main concepts across. Will consider continuous spaces later!

Value Iteration Algorithm: Start with for all s. For i=1,, H Given V i *, calculate for all states s 2 S: This is called a value update or Bellman update/back-up = the expected sum of rewards accumulated when starting from state s and acting optimally for a horizon of i steps

Value Iteration in Gridworld noise = 0.2, =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld noise = 0.2, =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld noise = 0.2, =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld noise = 0.2, =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld noise = 0.2, =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld noise = 0.2, =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld noise = 0.2, =0.9, two terminal states with R = +1 and -1

Exercise 1: Effect of discount, noise (a) Prefer the close exit (+1), risking the cliff (-10) (b) Prefer the close exit (+1), but avoiding the cliff (-10) (c) Prefer the distant exit (+10), risking the cliff (-10) (d) Prefer the distant exit (+10), avoiding the cliff (-10) (1) = 0.1, noise = 0.5 (2) = 0.99, noise = 0 (3) = 0.99, noise = 0.5 (4) = 0.1, noise = 0

Exercise 1 Solution (a) Prefer close exit (+1), risking the cliff (-10) --- = 0.1, noise = 0

Exercise 1 Solution (b) Prefer close exit (+1), avoiding the cliff (-10) -- = 0.1, noise = 0.5

Exercise 1 Solution (c) Prefer distant exit (+1), risking the cliff (-10) -- = 0.99, noise = 0

Exercise 1 Solution (d) Prefer distant exit (+1), avoid the cliff (-10) -- = 0.99, noise = 0.5

Value Iteration Convergence Theorem. Value iteration converges. At convergence, we have found the optimal value function V* for the discounted infinite horizon problem, which satisfies the Bellman equations Now we know how to act for infinite horizon with discounted rewards! Run value iteration till convergence. This produces V*, which in turn tells us how to act, namely following: Note: the infinite horizon optimal policy is stationary, i.e., the optimal action at a state s is the same action at all times. (Efficient to store!) 25

Convergence and Contractions Define the max-norm: Theorem: For any two approximations U and V I.e. any distinct approximations must get closer to each other, so, in particular, any approximation must get closer to the true U and value iteration converges to a unique, stable, optimal solution Theorem: I.e. once the change in our approximation is small, it must also be close to correct 26

Outline Optimal Control = given an MDP (S, A, T, R,, H) find the optimal policy ¼ * Exact Methods: Value Iteration Policy Iteration Linear Programming For now: discrete state-action spaces as they are simpler to get the main concepts across. Will consider continuous spaces later!

Policy Evaluation Recall value iteration iterates: Policy evaluation: At convergence:

Exercise 2

Policy Iteration Alternative approach: Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal utilities!) until convergence Step 2: Policy improvement: update policy using onestep look-ahead with resulting converged (but not optimal!) utilities as future values Repeat steps until policy converges This is policy iteration It s still optimal! Can converge faster under some conditions

Policy Evaluation Revisited Idea 1: modify Bellman updates Idea 2: it s just a linear system, solve with Matlab (or whatever), variables: V ¼ (s), constants: T, R

Policy Iteration Guarantees Policy Iteration iterates over: Theorem. Policy iteration is guaranteed to converge and at convergence, the current policy and its value function are the optimal policy and the optimal value function! Proof sketch: (1) Guarantee to converge: In every step the policy improves. This means that a given policy can be encountered at most once. This means that after we have iterated as many times as there are different policies, i.e., (number actions) (number states), we must be done and hence have converged. (2) Optimal at convergence: by definition of convergence, at convergence ¼ k+1 (s) = ¼ k (s) for all states s. This means Hence satisfies the Bellman equation, which means is equal to the optimal value function V*. 34

Outline Optimal Control = given an MDP (S, A, T, R,, H) find the optimal policy ¼ * Exact Methods: Value Iteration Policy Iteration Linear Programming For now: discrete state-action spaces as they are simpler to get the main concepts across. Will consider continuous spaces later!

Infinite Horizon Linear Program Recall, at value iteration convergence we have LP formulation to find V * : µ 0 is a probability distribution over S, with µ 0 (s)> 0 for all s 2 S. Theorem. V * is the solution to the above LP.

Theorem Proof

Dual Linear Program Interpretation: Equation 2: ensures has the above meaning Equation 1: maximize expected discounted sum of rewards Optimal policy:

Outline Optimal Control = given an MDP (S, A, T, R,, H) find the optimal policy ¼ * Exact Methods: Value Iteration Policy Iteration Linear Programming For now: discrete state-action spaces as they are simpler to get the main concepts across. Will consider continuous spaces later!

Today and forthcoming lectures Optimal control: provides general computational approach to tackle control problems. Dynamic programming / Value iteration Exact methods on discrete state spaces (DONE!) Discretization of continuous state spaces Function approximation Linear systems LQR Extensions to nonlinear settings: Local linearization Differential dynamic programming Optimal Control through Nonlinear Optimization Open-loop Model Predictive Control Examples: