Reinforcement Learning

Transcription

1 Reinforcement Learning Solving MDPs March April, 2015

2 Brute Force Solving an MDP means finding an optimal policy A naive approach consists of enumerating all the deterministic Markov policies evaluate each policy return the best one The number of policies is exponential: A S Need a more intelligent search for best policies restrict the search to a subset of the possible policies using stochastic optimization algorithms

3 What is? : sequential or temporal component to the problem : optimizing a program, i.e., a policy c.f. linear programming A method for solving complex problems By breaking them down into subproblems Solve the subproblems Combine solutions to subproblems

4 Requirements for is a very general solution method for problems which have two properties: Optimal substructure Principle of optimality applies Optimal solution can be decomposed into subproblems Overlapping subproblems Subproblems recur many times Solutions can be cached and reused Markov decision processes satisfy both properties Bellman equation gives revursive decomposition Value function stores and reuses solutions

5 Planning by assumes full knowledge of the MDP It is used for planning in an MDP Prediction Input: MDP S, A, P, R, γ, µ and policy π (i.e., MRP S, P π, R π, γ, µ ) Output: value function V π Control Input: MDP S, A, P, R, γ, µ Output: value function V and optimal policy π

6 Other Applications of is used to solve many other problems: Scheduling algorithms String algorithms (e.g., sequence alignment) Graph algorithms (e.g., shortest path algorithms) Graphical models (e.g., Viterbi algorithm) Bioinformatics (e.g., lattice models)

7 Finite Horizon Principle of optimality: the tail of an optimal policy is optimal for the tail problem Backward induction Backward recursion V k (s) = max a A k R k (s, a) + P k (s s, a)v k+1 (s ) s, k = N 1,..., 0 S k+1 Optimal policy π k (s) arg max a A k R k (s, a) + P k (s s, a)v k+1 (s ) s, k = 0,..., N 1 S k+1 Cost: N S A vs A N S of brute force policy search From now on, we will consider infinite horizon discounted MDPs

8 Policy Evaluation For a given policy π compute the state value function V π Recall State value function for policy π: { } V π (s) = E γ t r t s 0 = s Bellman equation for V π : [ V π (s) = a A π(a s) t=0 R(s, a) + γ s S P(s s, a)v π (s ) ] A system of S simultaneous linear equations Solution in matrix notation (complexity O(n 3 )): V π = (I γp π ) 1 R π

9 Iterative Policy Evaluation Iterative application of Bellman expectation backup V 0 V 1 V k V k+1 V π A full policy evaluation backup: [ V k+1 (s) a A π(a s) R(s, a) + γ s S P(s s, a)v k (s ) A sweep consists of applying a backup operation to each state Using synchronous backups At each iteration k + 1 For all states s S Update V k+1 (s) from V k (s ) ]

10 Example Small Gridworld Undiscounted episodic MDP γ = 1 All episodes terminate in absorbing terminal state Transient states 1,..., 14 One terminal state (shown twice as shaded squares) Actions that would take agent off the grid leave state unchanged Reward is 1 until the terminal state is reached

11 Policy Evaluation in Small Gridworld

12 Policy Evaluation in Small Gridworld

13 Policy Improvement Consider a deterministic policy π For a given state s, would it better to do an action a π(s)? We can improve the policy by acting greedily π (s) = arg max a A Qπ (s, a) This improves the value from any state s over one step Q π (s, π (s)) = max a A Qπ (s, a) Q π (s, π(s)) = V π (s)

14 Policy Improvement Theorem Theorem Let π and π be any pair of deterministic policies such that Q π (s, π (s)) V π (s), s S Then the policy π must be as good as, or better than π V π (s) V π (s), s S Proof. V π (s) Q π (s, π (s)) = E π [r t+1 + γv π (s t+1 ) s t = s] [ E π rt+1 + γq π (s t+1, π (s t+1 )) s t = s ] [ ] E π r t+1 + γr t+2 + γ 2 Q π (s t+2, π (s t+2 )) s t = s E π [r t+1 + γr t s t = s] = V π (s)

15 What if improvements stops (V π = V π )? Q π (s, π (s)) = max a A Qπ (s, a) = Q π (s, π(s)) = V π (s) But this is the Bellman optimality equation Therefore V π (s) = V π (s) = V (s) for all s S So π is an optimal policy! π 0 V π 0 π 1 V π 1 π V π

16 Example of Jack s Car Rental States: Two locations, maximum of 20 cars each Actions: Move up to 5 cars between two locations overnight Reward: $10 for each car rented (must be available) Transitions: Cars returned and requested randomly Poisson distribution, n returns/request with probability λ n n! e λ First location: average requests = 3, average returns = 3 Second location: average requests = 4, average returns = 2

17 Example of Jack s Car Rental

18 Modified Does policy evaluation need to converge to V π? Or should we introduce a stopping condition e.g., ɛ convergence of value function Or simply stop after k iterations of iterative policy evaluation? For example, in the small gridworld k = 3 was sufficient to achieve optimal policy Why not update policy every iteration? i.e. stop after k = 1

19 Generalized Policy evaluation: Estimate V π e.g., Iterative policy evaluation Policy improvement: Generate π π e.g., Greedy policy improvement

20 Principle of Optimality Determinisitic Shortest Path Problem: reach goal state g from start state s 1 with min cost Each action a from state s leads to: Deterministic transition P(s s, a) = 1 iff s = succ(s, a) Reward R(s, a) negative (is a cost) Undiscounted γ = 1 Theorem A path s 1, a 1, s 2, a 2,..., s T is optimal if and only if: For any intermediate state s t along the solution path s t, a t,..., s T is an optimal path from s t to g

21 Principle of Optimality Specifically, consider subdividing the path after one step Then an optimal path from s consists of: An optimal first action a Followed by an optimal path from s = succ(s, a ) Therefore the value of an optimal path must satisfy: V (s) = max a A R(s, a) + V (s )

22 Determinisitic If we know the solution to subproblems V (s ) Then it is easy to construct the solution to V (s) V (s) max a A R(s, a) + V (s ) The idea of value iteration is to apply these updates iteratively e.g., Starting from the goal and working backward

23 in MDPs Many MDPs don t have a finite horizon They are tipically loopy So there is no end to work backwards from However, we can still propagate information backwards Using Bellman optimality equation to backup V (s) from V (s ) Each subproblem is easier due to discount factor γ Iterate until convergence

24 Principle of Optimality in MDPs Theorem A policy π(a s) achieves the optimal value from state s, V π (s) = V (s), if and only if For any state s reachable from s π achieves the optimal value from state s, V π (s ) = V (s ) So an optimal policy π (a s) must consist of: an optimal first action a followed by an optimal policy from successor state s V (s) = max R(s, a) + γ P(s s, a)v (s ) a A s S

25 Problem: find optimal policy π Solution: iterative application of Bellman optimality backup V 1 V 2 V Using synchronous backups At each iteration k + 1 For all states s S Update V k+1 (s) from V k (s ) Unlike policy iteration there is no explicit policy Intermediate value functions may not correspond to any policy demo: poole/demos/mdp/vi.html

26 Convergence and Contractions Define the max norm: V = max s V (s) Theorem converges to the optimal state-value function lim k V k = V Proof. V k+1 V = T V k T V γ V k V γ k+1 V 0 V Theorem V i+1 V i < ɛ V i+1 V < 2ɛγ 1 γ

27 Synchronous Algorithms Problem Bellman Equation Algorithm Prediction Bellman Expectation Equation Policy Evaluation (Iterative) Control Bellman Expectation + Greedy Policy Improvement Control Bellman Optimality Equation Algorithms are based on state value function V π (s) or V (s) Complexity O(mn 2 ) per iteration, for m actions and n states Could also apply to action value function Q π (s, a) or Q (s, a) Complexity O(m 2 n 2 ) per iteration

28 Efficiency of DP To find optimal policy is polynomial in the number of states... but, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables: curse of dimensionality In practice, classical DP can be applied to problems with a few millions states Asynchronous DP can be applied to larger problems, and appropriate for parallel computation It is surprisingly easy to come up with MDPs for which methods are not practical

29 Complexity of DP DP methods are polynomial time algorithms for fixed discounted MDPs : O( S 2 A ) for each iteration : Cost of policy evaluation + Cost of policy iteration Policy evaluation: system of equations: O ( S 3) or O ( S ( ) 2.373) Iterative: O S 2 log( ɛ 1 ) log( γ 1 ) Policy ( improvement: ( )) recently proven to be O A 1 γ log S 1 γ Each iteration of PI is computationally more expensive than each iteration of VI PI typically requires fewer iterations to converge than VI Exponentially faster than any direct policy search Number of states often grows exponentially with the number of state variables

30 Asynchronous DP methods described so far used synchronous backups i.e., all state are backed up in parallel Asynchronous DP backs up states individually, in any order For each selected state, apply the appropriate backup Can significantly reduce computation Guaranteed to converge if all states continue to be selected Three ideas for asynchronous DP: In place DP Prioritized sweeping Real time DP

31 In place Synchronous value iteration stores two copies of value function for all s S ( V new (s) max a A R(s, a) + γ s S P(s s, a)v old (s ) V old V new In place value iteration only stores one copy of value function for all s S ( ) V (s) max a A R(s, a) + γ s S P(s s, a)v (s ) )

32 Prioritized Sweeping Use the magnitude of Bellman error to guide state selection, e.g., ( max R(s, a) + γ ) P(s s, a)v k (s ) V (s) a A s S Backup the state with the largest remaining Bellman error Update Bellman error of affected states after each backup Requires knowledge of reverse dynamics (predecessor states) Can be implemented efficiently by maintaining a priority queue

33 Real Time Idea: only states that are relevant to agent Use agent s experience to guide the selection of states After each time step s t, a t, r t+1 ( a t arg max a A Backup the state s t V (s t) max a A ( R(s t, a) + γ s S P(s s t, a)v (s ) R(s t, a) + γ s S P(s s t, a)v (s ) ) ) Theorem If V 0 V then t such that a t are optimal for all t t (where t < with probability 1)

34 Full Width Backups programming uses full width backups For each backup (synchronous or asynchronous) Every successor state and action is considered Using knowledge of the MDP transitions and reward function programming is effective for medium size problems (millions of states) For large problems dynamic programming suffers Bellman s curse of dimensionality Number of states n = S grows exponentially with number of states variables Even one backup can be too expensive

35 Sample Backups Reinforcement Learning techniques exploit sample backups Sample backups do not use reward function R and transition dynamics P Uses sample rewards and sample transitions s, a, s, r Advantages Model free: no prior knowledge of MDP required Breaks the curse of dimensionality through sampling Cost of backups is constant, independent of n = S

36 Approximate Approximate the value function Using a function approximator V θ (s) = f (s, θ) Apply dynamic programming to V θ e.g., Fitted repeats at each iteration k Sample states S S For each state s S, estimate target value using Bellman optimality equation ( ) Ṽ k (s) = max a A R(s, a) + γ s S P(s s, a)v θ k (s ) Train next value function V θ k+1 using targets { s, Ṽk(s) }

37 Infinite Horizon Recall, at value iteration convergence we have { s S : V (s) = max a A LP formulation to find V : R(s, a) + γ s S P(s s, a)v (s ) min V s S µ(s)v (s) s. t. V (s) R(s, a) + s S P(s s, a)v (s ), s S, a A } S variables S A constraints Theorem V is the solution of the above LP.

38 Theorem Proof Let T be the optimal Bellman operator, then the LP can be written as: min V µ T V s. t. V T (V ) Monotonicity property: if U V then T (U) T (V ). Hence, if V T (V ) then T (V ) T (T (V )), and by repeated application, V T (V ) T 2 (V ) T 3 (V ) T (V ) = V Any feasible solution to the LP must satisfy V T (V ), and hence must satisfy V V Hence, assuming all entries µ are positive, V is the optimal solution to the LP

39 Dual Program max λ s. t. λ(s, a)r(s, a) s S a A λ(s, a)p(s s, a), λ(s, a ) = µ(s) + γ a A s S λ(s, a) 0, s S, a A a A s S Interpretation λ(s, a) = t=0 γt P(s t = s, a t = a) Equation 2: ensures λ has the above meaning Equation 1: maximize expected discounted sum of rewards Optimal policy: π (s) = arg max a λ(s, a)

40 Complexity of LP LP worst case convergence guarantees are better than those of DP methods LP methods become impractical at a much smaller number of states than DP methods do