Model-Free vs. Model- Based RL: Q, SARSA, & E 3

Size: px

Start display at page:

Download "Model-Free vs. Model- Based RL: Q, SARSA, & E 3"

Chloe Flynn
7 years ago
Views:

1 Model-Free vs. Model- Based RL: Q, SARSA, & E 3

2 Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary Final projects Final presentations Dec 2, 7, 9 20 min (max) presentations 3 or 4 per day Sign up for presentation slots today!

3 The Q-learning algorithm Algorithm: Q_learn Inputs: State space S; Act. space A Discount γ (0<=γ<1); Learning rate α (0<=α<1) Outputs: Q Repeat { s=get_current_world_state() a=pick_next_action(q,s) (r,s )=act_in_world(a) Q(s,a)=Q(s,a)+α*(r+γ*max_a (Q(s,a ))-Q(s,a)) } Until (bored)

4 SARSA-learning algorithm Algorithm: SARSA_learn Inputs: State space S; Act. space A Discount γ (0<=γ<1); Learning rate α (0<=α<1) Outputs: Q s=get_current_world_state() a=pick_next_action(q,s) Repeat { (r,s )=act_in_world(a) a =pick_next_action(q,s ) Q(s,a)=Q(s,a)+α*(r+γ*Q(s,a )-Q(s,a)) a=a ; s=s ; } Until (bored)

5 SARSA vs. Q SARSA and Q-learning very similar SARSA updates Q(s,a) for the policy it s actually executing Lets the pick_next_action() function pick action to update Q updates Q(s,a) for greedy policy w.r.t. current Q Uses max_a to pick action to update might be diff than the action it executes at s In practice: Q will learn the true π*, but SARSA will learn about what it s actually doing Exploration can get Q-learning in trouble...

6 Radioactive breadcrumbs Can now define eligibility traces for SARSA In addition to Q(s,a) table, keep an e(s,a) table Records eligibility (real number) for each state/ action pair At every step ((s,a,r,s,a ) tuple): Increment e(s,a) for current (s,a) pair by 1 Update all Q(s,a ) vals in proportion to their e(s,a ) Decay all e(s,a ) by factor of λγ Leslie Kaelbling calls this the radioactive breadcrumbs form of RL

7 SARSA(λ)-learning alg. Algorithm: SARSA(λ)_learn Inputs: S, A, γ (0<=γ<1), α (0<=α<1), λ (0<=λ<1) Outputs: Q e(s,a)=0 // for all s, a s=get_curr_world_st(); a=pick_nxt_act(q,s); Repeat { (r,s )=act_in_world(a) a =pick_next_action(q,s ) δ=r+γ*q(s,a )-Q(s,a) e(s,a)+=1 foreach (s,a ) pair in (SXA) { Q(s,a )=Q(s,a )+α*e(s,a )*δ e(s,a )*=λγ } a=a ; s=s ; } Until (bored)

8 The trail of crumbs Path taken Sutton & Barto, Sec 7.5

9 The trail of crumbs Action values increased by one-step Sarsa λ=0 Sutton & Barto, Sec 7.5

10 The trail of crumbs Action values increased by Sarsa(!) with!=0.9 Sutton & Barto, Sec 7.5

11 Eligibility for a single state e(s i,a j ) accumulating eligibility trace times of visits to a state 1st visit 2nd visit... Sutton & Barto, Sec 7.5

12 Eligibility trace followup Eligibility trace allows: Tracking where the agent has been Backup of rewards over longer periods Credit assignment: state/action pairs rewarded for having contributed to getting to the reward Why does it work?

13 The forward view of elig. Original SARSA did one step backup: Q(s, a) (1 α)q(s, a) + α 1 Q(s, a) 1 Q(s, a) (r t + γq(s t+1, a t+1 )) Info backup Rest of trajectory Q(s,a) r t Q(s t+1,a t+1 )

14 The forward view of elig. Original SARSA did one step backup: Q(s, a) (1 α)q(s, a) + α 1 Q(s, a) 1 Q(s, a) (r t + γq(s t+1, a t+1 )) Could also do a two step backup : Q(s, a) (1 α)q(s, a) + α 2 Q(s, a) 2 Q(s, a) (r t + γr t+1 + γ 2 Q(s t+2, a t+2 )) Info backup Rest of trajectory Q(s,a) r t+1 r t Q(s t+2,a t+2 )

15 The forward view of elig. Original SARSA did one step backup: Q(s, a) (1 α)q(s, a) + α 1 Q(s, a) 1 Q(s, a) (r t + γq(s t+1, a t+1 )) Could also do a two step backup : Q(s, a) (1 α)q(s, a) + α 2 Q(s, a) 2 Q(s, a) (r t + γr t+1 + γ 2 Q(s t+2, a t+2 )) Or even an n step backup : Q(s, a) ((1 α)q(s, a) + α n Q(s, a) n 1 n Q(s, a) i=0 γ i r t+i ) + γ n Q(s t+n, a t+n ))

16 The forward view of elig. Small-step backups (n=1, n=2, etc.) are slow and nearsighted Large-step backups (n=100, n=1000, n= ) are expensive and may miss near-term effects Want a way to combine them Can take a weighted average of different backups E.g.: Q(s, a) (1 α)q(s, a)+ α ( 1 3 2Q(s, a) Q(s, a) )

17 The forward view of elig. Q(s, a) (1 α)q(s, a)+ α ( 1 3 2Q(s, a) Q(s, a) ) 1/3 2/3

18 The forward view of elig. How do you know which number of steps to avg over? And what the weights should be? Accumulating eligibility traces are just a clever way to easily avg. over all n: Q(s, a) (1 α)q(s, a)+ ( ) α (1 λ) λ i 1 i Q(s, a) i=1

19 The forward view of elig. λ 0 λ 1 λ 2 ( α (1 λ) i=1 λ i 1 i Q(s, a) ) λ n-1

20 Replacing traces Kind just described are accumulating e-traces Every time you go back to state, add extra e. There are also replacing eligibility traces Every time you go back to a state/action, reset e(s,a) to 1 Works better sometimes times of state visits accumulating trace replacing trace Sutton & Barto, Sec 7.8

21 Model-free vs. Model-based

22 What do you know? Both Q-learning and SARSA(λ) are model free methods A.k.a., value-based methods Learn a Q function Never learn T or R explicitly At the end of learning, agent knows how to act, but doesn t explicitly know anything about the environment Also, no guarantees about explore/exploit tradeoff Sometimes, want one or both of the above

23 Model-based methods Model based methods, OTOH, do explicitly learn T & R At end of learning, have entire M= S,A,T,R Also have π* At least one model-based method also guarantees explore/exploit tradeoff properties

24 E 3 Efficient Explore & Exploit algorithm Kearns & Singh, Machine Learning 49, 2002 Explicitly keeps a T matrix and a R table Plan (policy iter) w/ curr. T & R -> curr. π Every state/action entry in T and R: Can be marked known or unknown Has a #visits counter, nv(s,a) After every s,a,r,s tuple, update T & R (running average) When nv(s,a)>nvthresh, mark cell as known & re-plan When all states known, done learning & have π*

25 The E 3 algorithm Algorithm: E3_learn_sketch // only an overview Inputs: S, A, γ (0<=γ<1), NVthresh, R max, Var max Outputs: T, R, π* Initialization: R(s)=R max // for all s T(s,a,s )=1/ S // for all s,a,s known(s,a)=0; nv(s,a)=0; // for all s, a π=policy_iter(s,a,t,r)

26 The E 3 algorithm Algorithm: E3_learn_sketch // con t Repeat { s=get_current_world_state() a=π(s) (r,s )=act_in_world(a) T(s,a,s )=(1+T(s,a,s )*nv(s,a))/(nv(s,a)+1) nv(s,a)++; if (nv(s,a)>nvthresh) { known(s,a)=1; π=policy_iter(s,a,t,r) } } Until (all (s,a) known)

Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998. Eligibility Traces 0 Eligibility Traces Suggested reading: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998. Eligibility Traces Eligibility Traces 1 Contents: