2.2 Stochastic Dynamic Programming

2.2 Stochastic Dynamic Programming In general, the result of a given action will be unknown, although we will often be able to at least estimate the distribution of the resulting state given the action taken and the present state. These probabilities will be called the transition probabilities. The reward or cost incurred at each stage can depend on the action taken and the state. 1 / 47

Calculation of the optimal value recursively Consider the recursive derivation of the optimal value. Suppose we have the optimal payoffs starting from each of the possible states at stage i + 1. We can calculate the expected cost (payoff) starting from any state at stage i by using the appropriate transition probabilities. Hence, we can calculate the optimal expected reward starting from any state at stage i. 2 / 47

Calculation of the optimal value inductively Now consider the inductive derivation of the optimal value. At each stage we need to calculate the optimal cost of arriving at a given state given the optimal payoffs at previous stages. In order to do this, we need to calculate the probabilities of coming from each of the previous possible states given the present state (these are not the transition probabilities). It follows that it will be easier to use the recursive procedure. 3 / 47

Determination of the optimal strategy The optimal policy (strategy) should define what action is to be taken in each possible combination of state and stage. This policy is built up by remembering the action that achieves the optimal cost/reward at each step of the calculation. We cannot in general infer a full description of the strategy used by an individual by observing their actions, since some possible states may not be reached. 4 / 47

2.2.2 The Bellman equation Assume that Set of possible states at stage i is S i. Set of actions available in state j is A j (the set of available actions is assumed to be independent of the moment). It is assumed that all of these sets are finite. It is assumed that the probability of a transition from state j to state k only depends on the action taken and not on the moment an action is taken. i.e. the transition probability of going from state j to state k after taking action a can be denoted p j,k (a). 5 / 47

Bellman equation for minimisation Suppose the optimal expected cost in the minimisation problem starting in state k at stage i is d i (k). It is simple to calculate the expected costs of taking an action in a given state at the final stage. The total costs starting at earlier stages in a given state can be calculated by recursion. 6 / 47

Bellman equation for minimisation We have d i 1 (j) = min a A j {c a + k S i p j,k (a)d i (k)}, where c a is the immediate expected cost of the action a. The sum is simply the expected costs starting at stage i given that the action a is taken (the sum is taken over the set of states possible at stage i). We minimise this sum over the set of possible actions in state j. 7 / 47

Bellman equation for maximisation Analogously, suppose the optimal expected reward in the maximisation problem starting in state k at stage i is R i (k). We have R i 1 (j) = max a A j {r a + k S i p j,k (a)r i (k)}, where r a is the expected immediate reward gained by taking action a. 8 / 47

Example 2.2.1 At the beginning of a day a machine is in one of two states: G-Good and B-Bad. Two actions are available each morning: F -Fix and L-Leave. The cost of F is 4. No immediate costs are associated with the action L. There are no costs associated with being in state G, but the cost of being in state B at the end of the day is 3 (this cost is incurred on the same day and the state does not change overnight). 9 / 47

Example 2.2.1-ctd. If the present state is G, then the state at the end of the day will be G with probability 0.95 if action F is taken and with probability 0.8 if the action L is taken. If the present state is B, then the state at the end of the day will be B with probability 0.1 if action F is taken and with probability 0.9 if the action L is taken. At the beginning of day 1 the machine is in state G. Derive the optimal policy to minimise the costs over a 3-day period. In order to solve such a problem, we first derive the transition matrix. 10 / 47

The transition matrix The rows in the transition matrix correspond to the possible combinations of present state and action (here with 2 states and 2 actions in each state, we have have four combinations {F, G}, {F, B}, {L, G} and {L, B}). The columns correspond to the state at the next stage (here G or B). The entries are the transition probabilities to the new state given the combination of the present state and action. The sum of the entries in a row must be equal to 1. G B F, G 0.95 0.05 L, G 0.8 0.2 F, B 0.9 0.1 L, B 0.1 0.9 11 / 47

Solution of Example 2.2.1 We calculate the optimal policy by recursion. We need to derive the optimal action to be taken in each possible state at each moment. At the beginning of the last day the system can be either in state G or state B. We first calculate d 3 (G) and d 3 (B), the index gives the day on which the action was taken. 12 / 47

Solution of Example 2.2.1 In state G, taking the action F we incur a cost of 4 and with probability 0.05 the machine finishes in state B and we then incur a cost of 3. Thus, the expected immediate cost is 4.15. Taking the action L, with probability 0.2 the machine finishes in state B. Hence, the expected immediate cost is 0.6. Thus, d 3 (G) = 0.6 and the optimal action at stage 4 in state G is L. 13 / 47

Solution of Example 2.2.1 Similarly, d 3 (B) = min{4 + 0.1 3, 0.9 3} = 2.7. Here, the first entry is the expected cost of using F and the second entry is the expected cost of using L (this convention is used in all the calculations below). Hence, d 3 (L) = 2.7 and the optimal action at stage 3 in state B is L. 14 / 47

Solution of Example 2.2.1 Working backwards, the Bellman equation in state G at stage 2 is given by d 2 (G)=min{4.15 + 0.95d 3 (G) + 0.05d 3 (B), 0.6 + 0.8d 3 (G) + 0.2d 3 (B)} =min{4.855, 1.62} = 1.62. The optimal action at stage 2 in state G is L. Note that 4.15 is the expected immediate cost at stage 2 when action F is taken. This is made up of the cost of fixed plus the expected costs associated with the state of the machine at the end of the day (3 with probability 0.05). Given the present state is G and the action F, the next state will be G with probability 0.95, otherwise it will be F. Hence, the optimal expected future costs are 0.95d 3 (G) + 0.05d 3 (B). 15 / 47

Solution of Example 2.2.1 Similarly, d 2 (B)=min{4.3 + 0.9d 3 (G) + 0.1d 3 (B), 2.7 + 0.1d 3 (G) + 0.9d 3 (B)} =min{5.11, 5.19} = 5.11. The optimal action at stage 2 in state B is F. 16 / 47

Solution of Example 2.2.1 Working backwards, the Bellman equation in state G at stage 1 is given by d 1 (G)=min{4.15 + 0.95d 2 (G) + 0.05d 2 (B), 0.6 + 0.8d 2 (G) + 0.2d 2 (B)} =min{5.9445, 2.918} = 2.918. The optimal action at stage 1 in state G is L. 17 / 47

Solution of Example 2.2.1 Similarly, d 1 (B)=min{4.3 + 0.9d 2 (G) + 0.1d 2 (B), 2.7 + 0.1d 3 (G) + 0.9d 2 (B)} =min{6.27, 8.46} = 6.27. The optimal action at stage 1 in state B is F. 18 / 47

2.3 Infinite Horizon Problems The imposition of a finite time horizon is unnatural, since in general we are interested in long-term control procedures. In this case, we may define a problem as an infinite horizon problem. Of course, there are problems with such a formulation: 1. The total costs (rewards) obtained in such a problem would not be finite. 2. Such problems cannot be solved by recursion. 19 / 47

Formulation of costs (rewards) in an infinite horizon problem In order to eliminate the first problem, we can either a. Discount the costs (rewards) over time. The logic behind this is that people naturally prefer immediate rewards to delayed rewards (due to inflation, uncertainty etc.). A reward of x at stage i 0 + i is defined to be worth the same as a reward of xβ i at stage i 0, where 0 < β < 1 is the discount factor. b. Optimise the average cost (reward) per stage. The method of solution depends on the approach we use. 20 / 47

2.3.1 Dynamic Programming with Discounted Costs (Payoffs) It should be noted that discount factors may be used in finite horizon problems and the recursive method of solution used. The appropriate Bellman equations are: For minimisation problems d i 1 (j) = min a A j {c a + β k S i p j,k (a)d i (k)}. 21 / 47

Dynamic Programming with Discounted Costs (Payoffs) and Finite Horizons For maximisation problems R i 1 (j) = max a A j {r a + β k S i p j,k (a)r i (k)}, 22 / 47

Recursive calculations and discounted payoffs Using this procedure to calculate the optimal actions at stage n 1 the calculation clearly discounts the expected costs obtained in the final stage by a factor of β. This discount is incorporated in the optimal costs {d n 1 (j)}. At stage n 2, the costs obtained at stage n 1 are discounted by a factor of β and thus the costs incurred at stage n are discounted in total by a factor of β 2. Hence, these calculations implicitly take into account the discount factor used. 23 / 47

Infinite horizon problems with discounted payoffs We assume that: The set of possible states is the same at each stage. The set of permissible actions only depends on the present state. Let m be the (finite) number of possible states and S = {1, 2,..., m}. A j is the (finite) set of permissible actions in state j and A j = {a 1,j, a 2,j,..., a nj,j}, where n j is the number of available actions in state j. 24 / 47

Stationary strategies We only consider stationary strategies. These are strategies which choose actions that only depend on the present state and not on the stage. In order to specify a stationary strategy π, it suffices to define which action should be taken in each state, i.e. define a(j) for j = 1, 2,..., m. There is a finite number of stationary strategies. 25 / 47

Optimality and stationary strategies It is intuitively clear that there must a strategy in this class that is optimal in the class of all possible strategies. This is due to the fact that the problem faced by an individual at stage i 0 in state j is identical (self-similar) to the problem faced by an individual at stage i 0 + i in state j. Hence, if the action a(j) is optimal at stage i 0, then it must be optimal at stage i 0 + i. 26 / 47

Methods of solution We consider two methods of solving such problems 1. Policy iteration. 2. Value iteration. 27 / 47

Policy Iteration Consider the stationary policy π = {a(j)} m j=1 (i.e. a policy defines what action is taken in each state j). The idea of policy iteration is based on the self-similarity of infinite horizon problems. That is to say, if the expected sum of discounted costs starting at stage 0 in state j using the strategy π is C π (j), then the sum of the expected sum of discounted costs starting at stage 1 in state j will be βc π (j). 28 / 47

Derivation of expected costs under a policy Using this, the expected costs (payoffs) when starting in each of the possible states, C π (1), C π (2),..., C π (m), under the strategy π is defined by a set of m linear equations in m unknowns. The j-th equation is C π (j) = c a(j) + β m p j,k [a(j)]c π (k), k=1 where c a(j) are the immediate expected costs of using action a(j) and p j,k [a(j)] is the probability of transition from state j to state k given that action a(j) is taken. 29 / 47

Example 2.3.1 Consider the infinite horizon version of Example 2.2.1 in which β = 0.95. Assume that the costs of the machine ending the day in a bad state are incurred on that day. Calculate the expected reward from adopting the policy π of never checking the machine, i.e. always choose L. 30 / 47

Calculation of the expected costs under a policy If the present state is G, then under the policy the state will be G at the next stage with probability 0.8 and B with probability 0.2. An immediate cost of 3 is only incurred in the second case. Hence, the expected immediate costs are 0.6. It follows that C π (G) = 0.6+β[0.8C π (G)+0.2C π (B)] = 0.6+0.76C π (G)+0.19C π (B). 31 / 47

Calculation of the expected costs under a policy-ctd. Arguing similarly, it follows that C π (B)=2.7 + β[0.1c π (G) + 0.9C π (B)] =2.7 + 0.095C π (G) + 0.855C π (B). Rearranging these two equations, we obtain 0.24C π (G)=0.6 + 0.19C π (B) 0.095C π (G) + 0.145C π (B)=2.7. Solving this system of equations, we obtain C π (G) 35.82, C π (B) 42.09. 32 / 47

Evalution of the optimal policy Since there is a finite number of stationary strategies, we could derive the optimal policy by deriving the optimal costs (rewards) starting in each state {C π (j)} m j=1 for each possible stationary strategy π. The optimal policy minimises the expected costs regardless of the starting state. However, when there are a large number of possible strategies, this is not an efficient procedure. 33 / 47

The policy iteration algorithm The algorithm defining the method of policy iteration is as follows: 0. Set i = 0 and some initial stationary policy π 0. 1. Calculate the expected costs using π i when starting in each of the possible states i.e. by solving the appropriate system of linear equations. 2. Calculate the expected cost starting in state j = 1 for each of the strategies of the form: Carry out action a in stage 1 and thereafter follow π i. [Note that if the action is the one defined by π i in state j, then this expected cost is simply C π i (j).] Such a strategy will be denoted by (a, π i ). Set a(j) to be the action that minimises these costs (maximises rewards). Repeat for j = 2, 3,..., m. 34 / 47

The policy iteration algorithm 3. The policy π i+1 is defined to be {a(j)} m j=1 (i.e. it is made up of the actions that minimised the costs in each state, calculated in Step 2). If π i+1 = π i, then stop. In this case, π i is the optimal policy. If π i+1 π i, then set i = i + 1 and return to 1. 35 / 47

Calculation of expected costs for a perturbed policy In Step 2, we need to calculate the expected costs starting in state j, using a in the first step and thereafter using π i. These are given by m C (a,π i ) (j) = c a + β p j,k (a)c π i (k). k=1 Note that C π i (k) was calculated in Step 1 (it gives the expected cost (reward) under the present policy). Hence, we only need to evaluate the appropriate expressions rather than solve a system of equations. 36 / 47

Example 2.3.2 Derive the optimal policy of the problem given in Example 2.3.1. 37 / 47

Solution of Example 2.3.2 Step 0 is to set some initial policy. Let π 0 be the policy always choose L. Step 1 involves the calculation of the expected costs starting in each of the possible states. From Example 2.3.1, we have C π 0 (G) 35.82, C π 0 (B) 42.09. 38 / 47

The policy improvement step Step 2 is to check to see whether this policy can be improved. Starting in state G instead of using L at the first stage we could use F. In this case, the state at the next stage will be G with probability 0.95, otherwise the state will be B. The action F is associated with an immediate cost of 4 and a transition to B (which occurs with probability 0.05) is associated with an immediate cost of 3. It follows that the expected immediate cost is 4.15. 39 / 47

The policy improvement step It follows that C (F,π 0) (G) = 4.15 + β[0.95c π 0 (G) + 0.05C π 0 (B)] 38.48. Since C π 0 (G) 35.82, we have a(g) = L. 40 / 47

The policy improvement step Similarly, starting in state B and using F at the first stage, we have C (F,π 0) (B) = 4.3 + β[0.9c π 0 (G) + 0.1C π 0 (B)] 38.92. Since C π 0 (B) 42.09, it can be seen that a(b) = F. Hence, the policy π 1 is to fix the machine when it is in a bad state and leave it when it is in a good state. 41 / 47

Calculation of the expected costs under the new policy Returning to step 1, we calculate the expected costs under this new policy. As in Example 2.3.1, this requires the solution of m linear equations (where m is the number of states, here 2). As the action used in state G is L, we have C π 1 (G)=0.6 + β[0.8c π 1 (G) + 0.2C π 1 (B)] =0.6 + 0.76C π 1 (G) + 0.19C π 1 (B). 42 / 47

Calculation of the expected costs under the new policy In state B we use the action F (cost of 4). The state at the next stage will be G with probability 0.9 (no costs). Otherwise this state will be B, which is associated with an immediate cost of 3. Hence, the immediate expected cost is 4.3. We have C π 1 (B)=4.3 + β[0.9c π 1 (G) + 0.1C π 1 (B)] =4.3 + 0.855C π 1 (G) + 0.095C π 1 (B). 43 / 47

Calculation of the expected costs under the new policy This leads to the pair of linear equations 0.24C π 1 (G)=0.6 + 0.19C π 1 (B) 0.855C π 1 (G) + 0.905C π 1 (B)=4.3 Solving this system of equations gives C π 1 (B) = 28.22, C π 1 (G) = 24.84. 44 / 47

The policy improvement step We now see if it is possible to improve this policy. Starting in state G, we could change the action used in the first stage to F. The cost of this action is 4 and the probability of going to state B is 0.05, which is associated with an immediate cost of 3. The expected immediate cost is thus 4.15. 45 / 47

The policy improvement step We have C (F,π 1) (G) = 4.15 + β[0.95c π 1 (G) + 0.05C π 1 (B)] 27.91. Since C π 1 (G) = 24.84, it does not pay to change this action. 46 / 47

The policy improvement step Similarly, starting in state B, we could change the action taken to L. Arguing as above, C (L,π 1) (B) = 2.7 + β[0.1c π 1 (G) + 0.9C π 1 (B)] 29.19. Since C π 1 (B) = 28.22, it does not pay to change this action. Hence, π 2 = π 1. The optimal policy is to fix the machine when it is in state B and leave it when it is in state G. 47 / 47