Chapter 4: Dynamic Programming

Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence, optiml policies Discuss efficiency nd utility of DP 1

Policy Evlution Policy Evlution: for given policy π, compute the stte-vlue function V π Recll: Stte - vlue function for policy π : V π (s) = E R π (s) { } = E γ t r t s 0 = s t= 0 Bellmn eqution for V π : V π (s) = π( s) P s [ R ss ʹ + γv π ( )] system of S simultneous liner equtions 2

Itertive Methods V 0 V 1 V k V k +1 V π sweep A sweep consists of pplying bckup opertion to ech stte. A full policy-evlution bckup: V k +1 (s) π( s) P s [ R ss ʹ + γv k ( )] 3

Itertive Policy Evlution 4

A Smll Gridworld An undiscounted episodic tsk Nonterminl sttes: 1, 2,..., 14; One terminl stte (shown twice s shded squres) Actions tht would tke gent off the grid leve stte unchnged Rewrd is 1 until the terminl stte is reched 5

Itertive Policy Evl for the Smll Gridworld π = equiprobble rndom ction choices 6

Policy Improvement Suppose we hve computed V π for deterministic policy π. For given stte s, would it be better to do n ction? π(s) The vlue of doing in stte s is : Q π (s,) = P ss ʹ [ R ss ʹ + γv π ( )] It is better to switch to ction for stte s if nd only if Q π (s, ) > V π (s) 7

Policy Improvement Theorem π π ʹ Let nd be ny pir of deterministic policies such tht Then the policy ʹ π must be s good s, or better thn π 8

Policy Improvement Cont. Do this for ll sttes to get new policy ʹ π tht is greedy with respect to V π : Then V π ʹ π ʹ (s) = rgmx Q π (s, ) V π = rgmx P [ + γ V π ( )] R 9

Policy Improvement Cont. Wht if V π ʹ = V π? i.e., for ll s S, V π ʹ (s) = mx P ss ʹ [ R ss ʹ + γv π ( )]? But this is the Bellmn Optimlity Eqution. So V ʹ π = V nd both π nd ʹ π re optiml policies. 10

Policy Itertion π 0 V π 0 π 1 V π 1 π * V * π * policy evlution policy improvement greedifiction 11

Policy Itertion 12

Vlue Itertion Recll the full policy-evlution bckup: V k +1 (s) π( s) P s [ R ss ʹ + γv k ( )] Here is the full vlue-itertion bckup: V k +1 (s) mx P ss ʹ [ + γ V k ( )] R ss ʹ 13

Vlue Itertion Cont. 14

Asynchronous DP All the DP methods described so fr require exhustive sweeps of the entire stte set. Asynchronous DP does not use sweeps. Insted it works like this: Repet until convergence criterion is met: Pick stte t rndom nd pply the pproprite bckup Still need lots of computtion, but does not get locked into hopelessly long sweeps Cn you select sttes to bckup intelligently? YES: n gent s experience cn ct s guide. 15

Generlized Policy Itertion Generlized Policy Itertion (GPI): ny interction of policy evlution nd policy improvement, independent of their grnulrity. A geometric metphor for convergence of GPI: 16

Liner Progrmming Since lim k T k V = V * for ll V, we hve Thus is the smllest tht stisfies the constrint 17

Efficiency of DP To find n optiml policy is polynomil in the number of sttes BUT, the number of sttes is often stronomicl, e.g., often growing exponentilly with the number of stte vribles (wht Bellmn clled the curse of dimensionlity ). In prctice, clssicl DP cn be pplied to problems with few millions of sttes. Asynchronous DP cn be pplied to lrger problems, nd pproprite for prllel computtion. It is surprisingly esy to come up with MDPs for which DP methods re not prcticl. 18

Totl number of deterministic policies DP methods re polynomil time lgorithms VI (ech itertion) O S 2 A PI (ech itertion) = the cost of policy evlution + the cost of policy improvement Liner system of equtions itertive Efficiency of DP nd LP O S 3 or O S 2.807 O S 2 log(1/θ) log(1/γ) O S 2 A Ech itertion of PI is computtionlly more expensive thn ech itertion of VI PI typiclly require fewer itertions to converge thn VI Exponentilly fster thn ny direct serch in policy spce Number of sttes often grows exponentilly with the number of stte vribles 19

Efficiency of LP LP methods Their worst-cse convergence gurntees re better thn those of DP methods Become imprcticl t much smller number of sttes thn do DP methods 20

Summry Policy evlution: bckups without mx Policy improvement: form greedy policy, if only loclly Policy itertion: lternte the bove two processes Vlue itertion: bckups with mx Full bckups (to be contrsted lter with smple bckups) Generlized Policy Itertion (GPI) Asynchronous DP: wy to void exhustive sweeps Bootstrpping: updting estimtes bsed on other estimtes 21