An Introduction to Markov Decision Processes. MDP Tutorial - 1

Similar documents
Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL

Online Planning Algorithms for POMDPs

Computing Near Optimal Strategies for Stochastic Investment Planning Problems

Lecture 1: Introduction to Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

BINOMIAL OPTIONS PRICING MODEL. Mark Ioffe. Abstract

Valuation of the Minimum Guaranteed Return Embedded in Life Insurance Products

Stochastic Models for Inventory Management at Service Facilities

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment

A survey of point-based POMDP solvers

Best-response play in partially observable card games

Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning

POMPDs Make Better Hackers: Accounting for Uncertainty in Penetration Testing. By: Chris Abbott

Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems

Lecture Note 1 Set and Probability Theory. MIT Spring 2006 Herman Bennett

Lecture 6: Option Pricing Using a One-step Binomial Tree. Friday, September 14, 12

Probabilistic Model Checking at Runtime for the Provisioning of Cloud Resources

Optimization under uncertainty: modeling and solution methods

Markov Decision Processes for Ad Network Optimization

TD(0) Leads to Better Policies than Approximate Value Iteration

Safe robot motion planning in dynamic, uncertain environments

Development of dynamically evolving and self-adaptive software. 1. Background

Scalable Planning and Learning for Multiagent POMDPs

Using Markov Decision Processes to Solve a Portfolio Allocation Problem

Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning

Today s Agenda. Automata and Logic. Quiz 4 Temporal Logic. Introduction Buchi Automata Linear Time Logic Summary

MODELING CUSTOMER RELATIONSHIPS AS MARKOV CHAINS. Journal of Interactive Marketing, 14(2), Spring 2000, 43-55

Predictive Representations of State

Techniques for Supporting Prediction of Security Breaches in. Critical Cloud Infrastructures Using Bayesian Network and. Markov Decision Process

DiPro - A Tool for Probabilistic Counterexample Generation

The MultiAgent Decision Process toolbox: software for decision-theoretic planning in multiagent systems

Tagging with Hidden Markov Models

1 Representation of Games. Kerschbamer: Commitment and Information in Games

Chapter 5 Discrete Probability Distribution. Learning objectives

Finite Mathematics Using Microsoft Excel

STRATEGIC CAPACITY PLANNING USING STOCK CONTROL MODEL

MODELING CUSTOMER RELATIONSHIPS AS MARKOV CHAINS

DATA MINING IN FINANCE

Nonparametric adaptive age replacement with a one-cycle criterion

Chapter 11 Monte Carlo Simulation

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 13. Random Variables: Distribution and Expectation

Planning with Continuous Resources in Stochastic Domains

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction

Some Research Directions in Automated Pentesting

24 Uses of Turing Machines

24. The Branch and Bound Method

Lecture 12: The Black-Scholes Model Steven Skiena. skiena

Penetration Testing == POMDP Solving?

Inductive QoS Packet Scheduling for Adaptive Dynamic Networks

Reinforcement Learning with Factored States and Actions

Chapter 4. SDP - difficulties. 4.1 Curse of dimensionality

Approximate Solutions to Factored Markov Decision Processes via Greedy Search in the Space of Finite State Controllers

WEEK #22: PDFs and CDFs, Measures of Center and Spread

SEQUENCES ARITHMETIC SEQUENCES. Examples

Dynamic Programming 11.1 AN ELEMENTARY EXAMPLE

VENDOR MANAGED INVENTORY

Measuring the Performance of an Agent

An Extension Model of Financially-balanced Bonus-Malus System

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Lecture notes: single-agent dynamics 1

TABLE OF CONTENTS. GENERAL AND HISTORICAL PREFACE iii SIXTH EDITION PREFACE v PART ONE: REVIEW AND BACKGROUND MATERIAL

Computer Handholders Investment Software Research Paper Series TAILORING ASSET ALLOCATION TO THE INDIVIDUAL INVESTOR

Monte Carlo Methods in Finance

Unit 4 The Bernoulli and Binomial Distributions

Regular Expressions and Automata using Haskell

Part V: Option Pricing Basics

M/M/1 and M/M/m Queueing Systems

An Environment Model for N onstationary Reinforcement Learning

Evaluation of Treatment Pathways in Oncology: Modeling Approaches. Feng Pan, PhD United BioSource Corporation Bethesda, MD

Master of Arts in Mathematics

Principles of demand management Airline yield management Determining the booking limits. » A simple problem» Stochastic gradients for general problems

Transcription:

An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1

Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal Solutions (Ron) Dynamic programming Linear programming Refinements to the basic model (Bob) Partial observability Factored representations MDP Tutorial - 2

Stochastic Automata with Utilities A Markov Decision Process (MDP) model contains: A set of possible world states S A set of possible actions A A real valued reward function R(s,a) A description T of each action s effects in each state. We assume the Markov Property: the effects of an action taken in a state depend only on that state and not on the prior history. MDP Tutorial - 3

Stochastic Automata with Utilities A Markov Decision Process (MDP) model contains: A set of possible world states S A set of possible actions A A real valued reward function R(s,a) A description T of each action s effects in each state. We assume the Markov Property: the effects of an action taken in a state depend only on that state and not on the prior history. MDP Tutorial - 4

Representing Actions Deterministic Actions: T : S A S For each state and action we specify a new state. Stochastic Actions: T : S A Prob( S) For each state and action we specify a probability distribution over next states. Represents the distribution P(s s, a). 0.6 0.4 MDP Tutorial - 5

Representing Actions Deterministic Actions: T : S A S For each state and action we specify a new state. Stochastic Actions: T : S A Prob( S) For each state and action we specify a probability distribution over next states. Represents the distribution P(s s, a). 1.0 MDP Tutorial - 6

Representing Solutions A policy π is a mapping from S to A MDP Tutorial - 7

Following a Policy Following a policy π: 1. Determine the current state s 2. Execute action π(s) 3. Goto step 1. Assumes full observability: the new state resulting from executing an action will be known to the system MDP Tutorial - 8

Evaluating a Policy How good is a policy π in a state s? For deterministic actions just total the rewards obtained... but result may be infinite. For stochastic actions, instead expected total reward obtained again typically yields infinite value. How do we compare policies of infinite value? MDP Tutorial - 9

Objective Functions An objective function maps infinite sequences of rewards to single real numbers (representing utility) Options: 1. Set a finite horizon and just total the reward 2. Discounting to prefer earlier rewards 3. Average reward rate in the limit Discounting is perhaps the most analytically tractable and most widely studied approach MDP Tutorial - 10

Discounting A reward n steps away is discounted by for discount rate 0 < γ < 1. models mortality: you may die at any moment models preference for shorter solutions a smoothed out version of limited horizon lookahead γ n We use cumulative discounted reward as our objective (Max value <= M γ M γ 2 1 + + M +... = ----------- ) 1 γ M MDP Tutorial - 11

Value Functions A value function V π : S R represents the expected objective value obtained following policy π from each state in S. Value functions partially order the policies, but at least one optimal policy exists, and all optimal policies have the same value function, V * MDP Tutorial - 12

Bellman Equations Bellman equations relate the value function to itself via the problem dynamics. For the discounted objective function, V π ( s) = R( s, π( s) ) + T ( s, π( s), s ) γ V π ( s ) s S V * ( s) = MAX a A R sa, ( ) + T ( sas,, ) γ V * ( s ) s S In each case, there is one equation per state in S MDP Tutorial - 13

Finite-horizon Bellman Equations Finite-horizon values at adjacent horizons are related by the action dynamics V π, 0 ( s) = R( s, π( s) ) V π, n ( s) = R sa, ( ) + T ( sas,, ) γ V π, n 1 ( s ) s S MDP Tutorial - 14

Relation to Model Checking Some thoughts on the relationship MDP solution focuses critically on expected value Contrast safety properties which focus on worst case This contrast allows MDP methods to exploit sampling and approximation more aggressively MDP Tutorial - 15

At this point, Ron Parr spoke on solution methods for about 1/2 an hour, and then I continued. MDP Tutorial - 16

Large State Spaces In AI problems, the state space is typically astronomically large described implicitly, not enumerated decomposed into factors, or aspects of state Issues raised: How can we represent reward and action behaviors in such MDPs? How can we find solutions in such MDPs? MDP Tutorial - 17

A Factored MDP Representation State Space S assignments to state variables: On-Mars?, Need-Power?, Daytime?,..etc... Partitions each block a DNF formula (or BDD, etc) Block 1: not On-Mars? Block 2: On-Mars? and Need-Power? Block 3: On-Mars? and not Need-Power? Reward function R labelled state-space partition: Block 1: not On-Mars?.................... Reward=0 Block 2: On-Mars? and Need-Power?....... Reward=4 Block 3: On-Mars? and not Need-Power?.. Reward=5 MDP Tutorial - 18

Factored Representations of Actions Assume: actions affect state variables independently. 1 e.g...pr(nd-power? ^ On-Mars? x, a) = Pr (Nd-Power? x, a) * Pr (On-Mars? x, a) Represent effect on each state variable as labelled partition: Effects of Action Charge-Battery on variable Need-Power? Pr(Need-Power? Block n) Block 1: not On-Mars?..................... 0.9 Block 2: On-Mars? and Need-Power?........ 0.3 Block 3: On-Mars? and not Need-Power?.... 0.1 1. This assumption can be relaxed. MDP Tutorial - 19

Representing Blocks Identifying irrelevant state variables Decision trees DNF formulas Binary/Algebraic Decision Diagrams MDP Tutorial - 20

Partial Observability System state can not always be determined a Partially Observable MDP (POMDP) Action outcomes are not fully observable Add a set of observations O to the model Add an observation distribution U(s,o) for each state Add an initial state distribution I Key notion: belief state, a distribution over system states representing where I think I am MDP Tutorial - 21

POMDP to MDP Conversion Belief state Pr(x) can be updated to Pr(x o) using Bayes rule: Pr(s s,o) = Pr(o s,s ) Pr(s s) / Pr(o s) = U(s,o) T(s,a,s) normalized Pr(s o) = Pr(s s,o) Pr(s) A POMDP is Markovian and fully observable relative to the belief state. a POMDP can be treated as a continuous state MDP MDP Tutorial - 22

Belief State Approximation Problem: When MDP state space is astronomical, belief states cannot be explicitly represented. Consequence: MDP conversion of POMDP impractical Solution: Represent belief state approximately Typically exploiting factored state representation Typically exploiting (near) conditional independence properties of the belief state factors MDP Tutorial - 23