2.2 Stochastic Dynamic Programming

Similar documents
Reinforcement Learning

Reinforcement Learning

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

Dynamic Programming 11.1 AN ELEMENTARY EXAMPLE

Systems of Linear Equations

6.231 Dynamic Programming Midterm, Fall Instructions

5 Homogeneous systems

5.5. Solving linear systems by the elimination method

1 The EOQ and Extensions

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

Row Echelon Form and Reduced Row Echelon Form

6.207/14.15: Networks Lecture 15: Repeated Games and Cooperation

Basic Proof Techniques

A linear algebraic method for pricing temporary life annuities

Regular Languages and Finite Automata

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Product Mix as a Framing Exercise: The Role of Cost Allocation. Anil Arya The Ohio State University. Jonathan Glover Carnegie Mellon University

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

WRITING PROOFS. Christopher Heil Georgia Institute of Technology

APPENDIX. Interest Concepts of Future and Present Value. Concept of Interest TIME VALUE OF MONEY BASIC INTEREST CONCEPTS

4/1/2017. PS. Sequences and Series FROM 9.2 AND 9.3 IN THE BOOK AS WELL AS FROM OTHER SOURCES. TODAY IS NATIONAL MANATEE APPRECIATION DAY

Reduced echelon form: Add the following conditions to conditions 1, 2, and 3 above:

Sensitivity Analysis 3.1 AN EXAMPLE FOR ANALYSIS

Optimal linear-quadratic control

1.2 Solving a System of Linear Equations

3. Time value of money. We will review some tools for discounting cash flows.

MATHEMATICS FOR ENGINEERS BASIC MATRIX THEORY TUTORIAL 2

One Period Binomial Model

NOTES ON LINEAR TRANSFORMATIONS

160 CHAPTER 4. VECTOR SPACES

A simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 2

Linear Programming Notes VII Sensitivity Analysis

Continued Fractions and the Euclidean Algorithm

What is Linear Programming?

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

Solving Linear Systems, Continued and The Inverse of a Matrix

3 Some Integer Functions

1 Review of Least Squares Solutions to Overdetermined Systems

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

5.4 Solving Percent Problems Using the Percent Equation

Solving Linear Programs

5.3 The Cross Product in R 3

Special Situations in the Simplex Algorithm

SYSTEMS OF EQUATIONS AND MATRICES WITH THE TI-89. by Joseph Collison

Vector and Matrix Norms

Session 7 Fractions and Decimals

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Integer Programming Formulation

9.2 Summation Notation

Mathematical Induction

Introduction to Matrix Algebra

NPV with an Option to Delay and Uncertainty

STRATEGIC CAPACITY PLANNING USING STOCK CONTROL MODEL

Math 115 Spring 2011 Written Homework 5 Solutions

Solution to Homework 2

Section 1.3 P 1 = 1 2. = P n = 1 P 3 = Continuing in this fashion, it should seem reasonable that, for any n = 1, 2, 3,..., =

Recall that two vectors in are perpendicular or orthogonal provided that their dot

Psychology and Economics (Lecture 17)

Chapter 27: Taxation. 27.1: Introduction. 27.2: The Two Prices with a Tax. 27.2: The Pre-Tax Position

Math 55: Discrete Mathematics

Circuit Analysis using the Node and Mesh Methods

Arguments and Dialogues

Linear Algebra Notes

Numerical Methods for Option Pricing

Mathematical Induction. Mary Barnes Sue Gordon

Network Planning and Analysis

Chapter 2 An Introduction to Forwards and Options

Linear Programming Notes V Problem Transformations

4.6 Linear Programming duality

How to Configure and Use MRP

Linear Programming. Solving LP Models Using MS Excel, 18

Section IV.1: Recursive Algorithms and Recursion Trees

Linear Programming. March 14, 2014

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

Time Value of Money Level I Quantitative Methods. IFT Notes for the CFA exam

Binomial lattice model for stock prices

MATH APPLIED MATRIX THEORY

Chapter 11 Number Theory

DERIVATIVES AS MATRICES; CHAIN RULE

10.2 Series and Convergence

Solutions to Math 51 First Exam January 29, 2015

The Basics of FEA Procedure

Methods for Finding Bases

Basic Components of an LP:

Section Inner Products and Norms

PERPETUITIES NARRATIVE SCRIPT 2004 SOUTH-WESTERN, A THOMSON BUSINESS

SOLVING LINEAR SYSTEMS

8.2. Solution by Inverse Matrix Method. Introduction. Prerequisites. Learning Outcomes

4.5 Linear Dependence and Linear Independence

Solution to Exercise 7 on Multisource Pollution

VALUE %. $100, (=MATURITY

Lecture 3: Finding integer solutions to systems of linear equations

Overview. Essential Questions. Precalculus, Quarter 4, Unit 4.5 Build Arithmetic and Geometric Sequences and Series

On the Efficiency of Competitive Stock Markets Where Traders Have Diverse Information

Mathematical Induction

NPV Versus IRR. W.L. Silber We know that if the cost of capital is 18 percent we reject the project because the NPV

This asserts two sets are equal iff they have the same elements, that is, a set is determined by its elements.

IEOR 6711: Stochastic Models, I Fall 2012, Professor Whitt, Final Exam SOLUTIONS

Transcription:

2.2 Stochastic Dynamic Programming In general, the result of a given action will be unknown, although we will often be able to at least estimate the distribution of the resulting state given the action taken and the present state. These probabilities will be called the transition probabilities. The reward or cost incurred at each stage can depend on the action taken and the state. 1 / 47

Calculation of the optimal value recursively Consider the recursive derivation of the optimal value. Suppose we have the optimal payoffs starting from each of the possible states at stage i + 1. We can calculate the expected cost (payoff) starting from any state at stage i by using the appropriate transition probabilities. Hence, we can calculate the optimal expected reward starting from any state at stage i. 2 / 47

Calculation of the optimal value inductively Now consider the inductive derivation of the optimal value. At each stage we need to calculate the optimal cost of arriving at a given state given the optimal payoffs at previous stages. In order to do this, we need to calculate the probabilities of coming from each of the previous possible states given the present state (these are not the transition probabilities). It follows that it will be easier to use the recursive procedure. 3 / 47

Determination of the optimal strategy The optimal policy (strategy) should define what action is to be taken in each possible combination of state and stage. This policy is built up by remembering the action that achieves the optimal cost/reward at each step of the calculation. We cannot in general infer a full description of the strategy used by an individual by observing their actions, since some possible states may not be reached. 4 / 47

2.2.2 The Bellman equation Assume that Set of possible states at stage i is S i. Set of actions available in state j is A j (the set of available actions is assumed to be independent of the moment). It is assumed that all of these sets are finite. It is assumed that the probability of a transition from state j to state k only depends on the action taken and not on the moment an action is taken. i.e. the transition probability of going from state j to state k after taking action a can be denoted p j,k (a). 5 / 47

Bellman equation for minimisation Suppose the optimal expected cost in the minimisation problem starting in state k at stage i is d i (k). It is simple to calculate the expected costs of taking an action in a given state at the final stage. The total costs starting at earlier stages in a given state can be calculated by recursion. 6 / 47

Bellman equation for minimisation We have d i 1 (j) = min a A j {c a + k S i p j,k (a)d i (k)}, where c a is the immediate expected cost of the action a. The sum is simply the expected costs starting at stage i given that the action a is taken (the sum is taken over the set of states possible at stage i). We minimise this sum over the set of possible actions in state j. 7 / 47

Bellman equation for maximisation Analogously, suppose the optimal expected reward in the maximisation problem starting in state k at stage i is R i (k). We have R i 1 (j) = max a A j {r a + k S i p j,k (a)r i (k)}, where r a is the expected immediate reward gained by taking action a. 8 / 47

Example 2.2.1 At the beginning of a day a machine is in one of two states: G-Good and B-Bad. Two actions are available each morning: F -Fix and L-Leave. The cost of F is 4. No immediate costs are associated with the action L. There are no costs associated with being in state G, but the cost of being in state B at the end of the day is 3 (this cost is incurred on the same day and the state does not change overnight). 9 / 47

Example 2.2.1-ctd. If the present state is G, then the state at the end of the day will be G with probability 0.95 if action F is taken and with probability 0.8 if the action L is taken. If the present state is B, then the state at the end of the day will be B with probability 0.1 if action F is taken and with probability 0.9 if the action L is taken. At the beginning of day 1 the machine is in state G. Derive the optimal policy to minimise the costs over a 3-day period. In order to solve such a problem, we first derive the transition matrix. 10 / 47

The transition matrix The rows in the transition matrix correspond to the possible combinations of present state and action (here with 2 states and 2 actions in each state, we have have four combinations {F, G}, {F, B}, {L, G} and {L, B}). The columns correspond to the state at the next stage (here G or B). The entries are the transition probabilities to the new state given the combination of the present state and action. The sum of the entries in a row must be equal to 1. G B F, G 0.95 0.05 L, G 0.8 0.2 F, B 0.9 0.1 L, B 0.1 0.9 11 / 47

Solution of Example 2.2.1 We calculate the optimal policy by recursion. We need to derive the optimal action to be taken in each possible state at each moment. At the beginning of the last day the system can be either in state G or state B. We first calculate d 3 (G) and d 3 (B), the index gives the day on which the action was taken. 12 / 47

Solution of Example 2.2.1 In state G, taking the action F we incur a cost of 4 and with probability 0.05 the machine finishes in state B and we then incur a cost of 3. Thus, the expected immediate cost is 4.15. Taking the action L, with probability 0.2 the machine finishes in state B. Hence, the expected immediate cost is 0.6. Thus, d 3 (G) = 0.6 and the optimal action at stage 4 in state G is L. 13 / 47

Solution of Example 2.2.1 Similarly, d 3 (B) = min{4 + 0.1 3, 0.9 3} = 2.7. Here, the first entry is the expected cost of using F and the second entry is the expected cost of using L (this convention is used in all the calculations below). Hence, d 3 (L) = 2.7 and the optimal action at stage 3 in state B is L. 14 / 47

Solution of Example 2.2.1 Working backwards, the Bellman equation in state G at stage 2 is given by d 2 (G)=min{4.15 + 0.95d 3 (G) + 0.05d 3 (B), 0.6 + 0.8d 3 (G) + 0.2d 3 (B)} =min{4.855, 1.62} = 1.62. The optimal action at stage 2 in state G is L. Note that 4.15 is the expected immediate cost at stage 2 when action F is taken. This is made up of the cost of fixed plus the expected costs associated with the state of the machine at the end of the day (3 with probability 0.05). Given the present state is G and the action F, the next state will be G with probability 0.95, otherwise it will be F. Hence, the optimal expected future costs are 0.95d 3 (G) + 0.05d 3 (B). 15 / 47

Solution of Example 2.2.1 Similarly, d 2 (B)=min{4.3 + 0.9d 3 (G) + 0.1d 3 (B), 2.7 + 0.1d 3 (G) + 0.9d 3 (B)} =min{5.11, 5.19} = 5.11. The optimal action at stage 2 in state B is F. 16 / 47

Solution of Example 2.2.1 Working backwards, the Bellman equation in state G at stage 1 is given by d 1 (G)=min{4.15 + 0.95d 2 (G) + 0.05d 2 (B), 0.6 + 0.8d 2 (G) + 0.2d 2 (B)} =min{5.9445, 2.918} = 2.918. The optimal action at stage 1 in state G is L. 17 / 47

Solution of Example 2.2.1 Similarly, d 1 (B)=min{4.3 + 0.9d 2 (G) + 0.1d 2 (B), 2.7 + 0.1d 3 (G) + 0.9d 2 (B)} =min{6.27, 8.46} = 6.27. The optimal action at stage 1 in state B is F. 18 / 47

2.3 Infinite Horizon Problems The imposition of a finite time horizon is unnatural, since in general we are interested in long-term control procedures. In this case, we may define a problem as an infinite horizon problem. Of course, there are problems with such a formulation: 1. The total costs (rewards) obtained in such a problem would not be finite. 2. Such problems cannot be solved by recursion. 19 / 47

Formulation of costs (rewards) in an infinite horizon problem In order to eliminate the first problem, we can either a. Discount the costs (rewards) over time. The logic behind this is that people naturally prefer immediate rewards to delayed rewards (due to inflation, uncertainty etc.). A reward of x at stage i 0 + i is defined to be worth the same as a reward of xβ i at stage i 0, where 0 < β < 1 is the discount factor. b. Optimise the average cost (reward) per stage. The method of solution depends on the approach we use. 20 / 47

2.3.1 Dynamic Programming with Discounted Costs (Payoffs) It should be noted that discount factors may be used in finite horizon problems and the recursive method of solution used. The appropriate Bellman equations are: For minimisation problems d i 1 (j) = min a A j {c a + β k S i p j,k (a)d i (k)}. 21 / 47

Dynamic Programming with Discounted Costs (Payoffs) and Finite Horizons For maximisation problems R i 1 (j) = max a A j {r a + β k S i p j,k (a)r i (k)}, 22 / 47

Recursive calculations and discounted payoffs Using this procedure to calculate the optimal actions at stage n 1 the calculation clearly discounts the expected costs obtained in the final stage by a factor of β. This discount is incorporated in the optimal costs {d n 1 (j)}. At stage n 2, the costs obtained at stage n 1 are discounted by a factor of β and thus the costs incurred at stage n are discounted in total by a factor of β 2. Hence, these calculations implicitly take into account the discount factor used. 23 / 47

Infinite horizon problems with discounted payoffs We assume that: The set of possible states is the same at each stage. The set of permissible actions only depends on the present state. Let m be the (finite) number of possible states and S = {1, 2,..., m}. A j is the (finite) set of permissible actions in state j and A j = {a 1,j, a 2,j,..., a nj,j}, where n j is the number of available actions in state j. 24 / 47

Stationary strategies We only consider stationary strategies. These are strategies which choose actions that only depend on the present state and not on the stage. In order to specify a stationary strategy π, it suffices to define which action should be taken in each state, i.e. define a(j) for j = 1, 2,..., m. There is a finite number of stationary strategies. 25 / 47

Optimality and stationary strategies It is intuitively clear that there must a strategy in this class that is optimal in the class of all possible strategies. This is due to the fact that the problem faced by an individual at stage i 0 in state j is identical (self-similar) to the problem faced by an individual at stage i 0 + i in state j. Hence, if the action a(j) is optimal at stage i 0, then it must be optimal at stage i 0 + i. 26 / 47

Methods of solution We consider two methods of solving such problems 1. Policy iteration. 2. Value iteration. 27 / 47

Policy Iteration Consider the stationary policy π = {a(j)} m j=1 (i.e. a policy defines what action is taken in each state j). The idea of policy iteration is based on the self-similarity of infinite horizon problems. That is to say, if the expected sum of discounted costs starting at stage 0 in state j using the strategy π is C π (j), then the sum of the expected sum of discounted costs starting at stage 1 in state j will be βc π (j). 28 / 47

Derivation of expected costs under a policy Using this, the expected costs (payoffs) when starting in each of the possible states, C π (1), C π (2),..., C π (m), under the strategy π is defined by a set of m linear equations in m unknowns. The j-th equation is C π (j) = c a(j) + β m p j,k [a(j)]c π (k), k=1 where c a(j) are the immediate expected costs of using action a(j) and p j,k [a(j)] is the probability of transition from state j to state k given that action a(j) is taken. 29 / 47

Example 2.3.1 Consider the infinite horizon version of Example 2.2.1 in which β = 0.95. Assume that the costs of the machine ending the day in a bad state are incurred on that day. Calculate the expected reward from adopting the policy π of never checking the machine, i.e. always choose L. 30 / 47

Calculation of the expected costs under a policy If the present state is G, then under the policy the state will be G at the next stage with probability 0.8 and B with probability 0.2. An immediate cost of 3 is only incurred in the second case. Hence, the expected immediate costs are 0.6. It follows that C π (G) = 0.6+β[0.8C π (G)+0.2C π (B)] = 0.6+0.76C π (G)+0.19C π (B). 31 / 47

Calculation of the expected costs under a policy-ctd. Arguing similarly, it follows that C π (B)=2.7 + β[0.1c π (G) + 0.9C π (B)] =2.7 + 0.095C π (G) + 0.855C π (B). Rearranging these two equations, we obtain 0.24C π (G)=0.6 + 0.19C π (B) 0.095C π (G) + 0.145C π (B)=2.7. Solving this system of equations, we obtain C π (G) 35.82, C π (B) 42.09. 32 / 47

Evalution of the optimal policy Since there is a finite number of stationary strategies, we could derive the optimal policy by deriving the optimal costs (rewards) starting in each state {C π (j)} m j=1 for each possible stationary strategy π. The optimal policy minimises the expected costs regardless of the starting state. However, when there are a large number of possible strategies, this is not an efficient procedure. 33 / 47

The policy iteration algorithm The algorithm defining the method of policy iteration is as follows: 0. Set i = 0 and some initial stationary policy π 0. 1. Calculate the expected costs using π i when starting in each of the possible states i.e. by solving the appropriate system of linear equations. 2. Calculate the expected cost starting in state j = 1 for each of the strategies of the form: Carry out action a in stage 1 and thereafter follow π i. [Note that if the action is the one defined by π i in state j, then this expected cost is simply C π i (j).] Such a strategy will be denoted by (a, π i ). Set a(j) to be the action that minimises these costs (maximises rewards). Repeat for j = 2, 3,..., m. 34 / 47

The policy iteration algorithm 3. The policy π i+1 is defined to be {a(j)} m j=1 (i.e. it is made up of the actions that minimised the costs in each state, calculated in Step 2). If π i+1 = π i, then stop. In this case, π i is the optimal policy. If π i+1 π i, then set i = i + 1 and return to 1. 35 / 47

Calculation of expected costs for a perturbed policy In Step 2, we need to calculate the expected costs starting in state j, using a in the first step and thereafter using π i. These are given by m C (a,π i ) (j) = c a + β p j,k (a)c π i (k). k=1 Note that C π i (k) was calculated in Step 1 (it gives the expected cost (reward) under the present policy). Hence, we only need to evaluate the appropriate expressions rather than solve a system of equations. 36 / 47

Example 2.3.2 Derive the optimal policy of the problem given in Example 2.3.1. 37 / 47

Solution of Example 2.3.2 Step 0 is to set some initial policy. Let π 0 be the policy always choose L. Step 1 involves the calculation of the expected costs starting in each of the possible states. From Example 2.3.1, we have C π 0 (G) 35.82, C π 0 (B) 42.09. 38 / 47

The policy improvement step Step 2 is to check to see whether this policy can be improved. Starting in state G instead of using L at the first stage we could use F. In this case, the state at the next stage will be G with probability 0.95, otherwise the state will be B. The action F is associated with an immediate cost of 4 and a transition to B (which occurs with probability 0.05) is associated with an immediate cost of 3. It follows that the expected immediate cost is 4.15. 39 / 47

The policy improvement step It follows that C (F,π 0) (G) = 4.15 + β[0.95c π 0 (G) + 0.05C π 0 (B)] 38.48. Since C π 0 (G) 35.82, we have a(g) = L. 40 / 47

The policy improvement step Similarly, starting in state B and using F at the first stage, we have C (F,π 0) (B) = 4.3 + β[0.9c π 0 (G) + 0.1C π 0 (B)] 38.92. Since C π 0 (B) 42.09, it can be seen that a(b) = F. Hence, the policy π 1 is to fix the machine when it is in a bad state and leave it when it is in a good state. 41 / 47

Calculation of the expected costs under the new policy Returning to step 1, we calculate the expected costs under this new policy. As in Example 2.3.1, this requires the solution of m linear equations (where m is the number of states, here 2). As the action used in state G is L, we have C π 1 (G)=0.6 + β[0.8c π 1 (G) + 0.2C π 1 (B)] =0.6 + 0.76C π 1 (G) + 0.19C π 1 (B). 42 / 47

Calculation of the expected costs under the new policy In state B we use the action F (cost of 4). The state at the next stage will be G with probability 0.9 (no costs). Otherwise this state will be B, which is associated with an immediate cost of 3. Hence, the immediate expected cost is 4.3. We have C π 1 (B)=4.3 + β[0.9c π 1 (G) + 0.1C π 1 (B)] =4.3 + 0.855C π 1 (G) + 0.095C π 1 (B). 43 / 47

Calculation of the expected costs under the new policy This leads to the pair of linear equations 0.24C π 1 (G)=0.6 + 0.19C π 1 (B) 0.855C π 1 (G) + 0.905C π 1 (B)=4.3 Solving this system of equations gives C π 1 (B) = 28.22, C π 1 (G) = 24.84. 44 / 47

The policy improvement step We now see if it is possible to improve this policy. Starting in state G, we could change the action used in the first stage to F. The cost of this action is 4 and the probability of going to state B is 0.05, which is associated with an immediate cost of 3. The expected immediate cost is thus 4.15. 45 / 47

The policy improvement step We have C (F,π 1) (G) = 4.15 + β[0.95c π 1 (G) + 0.05C π 1 (B)] 27.91. Since C π 1 (G) = 24.84, it does not pay to change this action. 46 / 47

The policy improvement step Similarly, starting in state B, we could change the action taken to L. Arguing as above, C (L,π 1) (B) = 2.7 + β[0.1c π 1 (G) + 0.9C π 1 (B)] 29.19. Since C π 1 (B) = 28.22, it does not pay to change this action. Hence, π 2 = π 1. The optimal policy is to fix the machine when it is in state B and leave it when it is in state G. 47 / 47