Motivation Machine Learning Can a software agent learn to play Backgammon by itself? Reinforcement Learning Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Motivation Can a software agent learn to balance a pole by itself? Can a software agent learn to play Backgammon by itself? Learning from success or failure Neuro-Backgammon: playing at worldchampion level (Tesauro, 99) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3
Can a software agent learn to balance a pole by itself? Can a software agent learn to cooperate with others by itself? Learning from success or failure Neural RL controllers: noisy, unknown, nonlinear (Riedmiller et.al. ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4 Can a software agent learn to cooperate with others by itself? Reinforcement Learning has biological roots: reward and punishment Learning from success or failure no teacher, but: Cooperative RL agents: complex, multi-agent, cooperative (Riedmiller et.al. ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5
Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions + goal Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions + goal learn algorithm/ policy Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions + goal learn algorithm/ policy Action Action... Happy Programming Agent.. Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5
Actor-Critic Scheme (Barto, Sutton, 983) Overview Actor Critic External reward I Reinforcement Learning - Basics s Internal Critic World Actor? Actor-Critic Scheme: Critic maps external, delayed reward in internal training signal Actor represents policy Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 A First Example A First Example Choose: Action a {,, } is reached Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8
The Temporal Credit Assignment Problem The Temporal Credit Assignment Problem Which action(s) in the sequence has to be changed? Which action(s) in the sequence has to be changed? Temporal Credit Assignment Problem Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Sequential Decision Making Three Steps Examples: Chess, Checkers (Samuel, 959), Backgammon (Tesauro, 9) Cart-Pole-Balancing (AHC/ ACE (Barto, Sutton, Anderson, 983)), Robotics and control,... Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 0 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning
Three Steps Three Steps Describe environment as a Markov Decision Process (MDP) Describe environment as a Markov Decision Process (MDP) Formulate learning task as a dynamic optimization problem Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Three Steps. Description of the environment Describe environment as a Markov Decision Process (MDP) Formulate learning task as a dynamic optimization problem Solve dynamic optimization problem by dynamic programming methods Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning
. Description of the environment. Description of the environment S: (finite) set of states S: (finite) set of states A: (finite) set of actions Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning. Description of the environment. Description of the environment S: (finite) set of states A: (finite) set of actions S: (finite) set of states A: (finite) set of actions Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition For simplicity, we will first assume a deterministic environment. There, the model can be described by a transition function f : S A S, s = f(s, a) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning
. Description of the environment. Description of the environment S: (finite) set of states A: (finite) set of actions S: (finite) set of states A: (finite) set of actions Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition For simplicity, we will first assume a deterministic environment. There, the model can be described by a transition function f : S A S, s = f(s, a) For simplicity, we will first assume a deterministic environment. There, the model can be described by a transition function f : S A S, s = f(s, a) Markov property: Transition only depends on current state and action Markov property: Transition only depends on current state and action Pr(s t+ s t, a t ) = Pr(s t+ s t, a t, s t, a t, s t, a t,...) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning. Formulation of the learning task. Formulation of the learning task every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3
. Formulation of the learning task. Formulation of the learning task every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Wanted: optimal policy π : S A where J π (s) = min π { t c(s t, π(s t )) s 0 = s} Wanted: optimal policy π : S A where J π (s) = min π { t c(s t, π(s t )) s 0 = s} Additive (path-)costs allow to consider all events Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3. Formulation of the learning task every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Wanted: optimal policy π : S A where J π (s) = min π { t c(s t, π(s t )) s 0 = s} Choice of immediate cost function c( ) specifies policy to be learned Example: 0, if s success (s ) c(s) = 000, if s failure (s Failure), else J π (s start ) = J π (s start ) = 004 Additive (path-)costs allow to consider all events Does this solve the temporal credit assignment problem? YES! Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4
3. Solving the optimization problem Choice of immediate cost function c( ) specifies policy to be learned Example: 0, if s success (s ) c(s) = 000, if s failure (s Failure), else For the optimal path costs it is known that J (s) = min a {c(s,a) + J (f(s, a))} (Principle of Optimality (Bellman, 959)) J π (s start ) = J π (s start ) = 004 Can we compute J (we will see why, soon)? specification of requested policy by c( ) is simple! Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6
Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k ()} Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6
Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k (f(s,a))} Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k (f(s,a))}? 5 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k (f(s,a))} Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J? 5 7 3 5 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7
Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Stochastic shortest path problems: there exists an absorbing terminal state with zero costs Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Stochastic shortest path problems: Stochastic shortest path problems: there exists an absorbing terminal state with zero costs there exists a proper policy (a policy that has a non-zero chance to finally reach the terminal state) there exists an absorbing terminal state with zero costs there exists a proper policy (a policy that has a non-zero chance to finally reach the terminal state) every non-proper policy has infinite path costs for at least one state Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7
Ok, now we have J Ok, now we have J when J is known, then we also know an optimal policy: π (s) arg min a A {c(s,a) + J (f(s,a))} 5 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8 Back to our maze Overview of the approach so far 9 8 7 6 5 4 3 0 Start 6 7 9 9 0 0 5 4 3 00 000 00 000 Ziel Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 0
Overview of the approach so far Overview of the approach so far Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy Ziel Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy iterative computation of optimal pathcosts J : s S : J k+ (s) = min a A {c(s,a) + J k (f(s, a))} Ziel Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 0 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 0 Overview of the approach so far Overview of the approach so far Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy iterative computation of optimal pathcosts J : s S : J k+ (s) = min a A {c(s,a) + J k (f(s, a))} Ziel Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy iterative computation of optimal pathcosts J : s S : J k+ (s) = min a A {c(s,a) + J k (f(s, a))} Ziel Computation of an optimal policy from J π (s) arg min a A {c(s,a) + J (f(s,a))} Computation of an optimal policy from J π (s) arg min a A {c(s,a) + J (f(s,a))} value function ( costs-to-go ) can be stored in a table Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 0 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 0
Overview of the approach: Stochastic Domains Overview of the approach: Stochastic Domains Ziel Ziel value iteration in stochastic environments: s S : J k+ (s) = min a A { (c(s,a) + J k (s ))} Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Overview of the approach: Stochastic Domains Overview of the approach: Stochastic Domains Ziel Ziel value iteration in stochastic environments: s S : J k+ (s) = min a A { s S p(s,s, a)(c(s,a) + J k (s ))} value iteration in stochastic environments: s S : J k+ (s) = min a A { s S p(s,s, a)(c(s,a) + J k (s ))} Computation of an optimal policy from J π (s) arg min a A { s S p(s, s, a)(c(s,a) + J k (s ))} value function J ( costs-to-go ) can be stored in a table Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning
Problems of Value Iteration: Reinforcement Learning Problems of Value Iteration: Reinforcement Learning for all s S : Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Problems of Value Iteration: Reinforcement Learning for all s S :J k+ (s) = min a A {c(s,a) + J k ()} Problems of Value Iteration: Reinforcement Learning for all s S :J k+ (s) = min a A {c(s,a) + J k (f(s,a))} Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning
Reinforcement Learning Problems of Value Iteration: for all s S :J k+ (s) = min a A {c(s,a) + J k (f(s,a))} problems: Size of S (Chess, robotics,... ) learning time, storage? model (transition behaviour) f(s, a) or p(s, s,a) must be known! Reinforcement Learning Problems of Value Iteration: for all s S :J k+ (s) = min a A {c(s,a) + J k (f(s,a))} problems: Size of S (Chess, robotics,... ) learning time, storage? model (transition behaviour) f(s, a) or p(s, s,a) must be known! Reinforcement Learning is dynamic programming for very large state spaces and/ or model-free tasks Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Important contributions - Overview Real Time Dynamic Programming (Barto, Sutton, Watkins, 989) Real Time Dynamic Programming (Barto, Sutton, Watkins, 989) Idea: instead For all s S now For some s S... Model-free learning (Q-Learning,(Watkins, 989)) neural representation of value function (or alternative function approximators) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4
Real Time Dynamic Programming (Barto, Sutton, Watkins, 989) Idea: instead For all s S now For some s S... learning based on trajectories (experiences) Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Q π (s,a) := (c(s,a) + J π (s )) Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Q π (s, a) := s S p(s, s, a)(c(s,a) + J π (s )) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5
Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Q π (s, a) := s S p(s, s, a)(c(s,a) + J π (s )) is now possible without a model: Original VI: state evaluation Action selection: π (s) arg min{c(s, a) + J (f(s, a))} Q-learning: Action selection where J π (s ) expected pathcosts when starting from s and acting according to π 5 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 is now possible without a model: Original VI: state evaluation Action selection: Q-learning: Action selection Q: state-action evaluation Action selection: π (s) = arg min Q (s, a) Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s, u) := (c(s,a) + J k (s )) π (s) arg min{c(s, a) + J (f(s, a))} 5 7 6 8 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7
Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s,u) := s S p(s, s, a)(c(s,a) + J k (s )) Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s,u) := s S p(s, s, a)(c(s,a) + J k (s )) where J k (s) = min a A(s) Q k (s,a ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s,u) := s S p(s, s, a)(c(s,a) + J k (s )) Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: where J k (s) = min a A(s) Q k (s,a ) Furthermore, learning a Q-function without a model, by experience of transition tuples (s,a) s only is possible: Q-learning (Q-Value Iteration + Robbins-Monro stochastic approximation) Q k+ (s,a) := ( α) Q k (s,a) + α (c(s,a) + min a A(s ) Q k(s, a )) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8
Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: uses a representation of costs-to-go for state/ action-pairs Q(s, a) Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: uses a representation of costs-to-go for state/ action-pairs Q(s, a) uses a stochastic approximation scheme to incrementally compute expectation values on the basis of observed transititions (s, a) s Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8 Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: uses a representation of costs-to-go for state/ action-pairs Q(s, a) uses a stochastic approximation scheme to incrementally compute expectation values on the basis of observed transititions (s, a) s Q-Learning algorithm converges under the same assumption as value iteration + every state/ action pair has to be visited infinitely often + conditions for stochastic approximation Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9
Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9
Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q k+ (s t, u t ) := ( α)q k (s t, u t ) + α(c(s t, u t ) + J k (s t+ )) where J k (s t+ ) := min a A Q k (s t+, a) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q k+ (s t, u t ) := ( α)q k (s t, u t ) + α(c(s t, u t ) + J k (s t+ )) where J k (s t+ ) := min a A Q k (s t+, a) Terminal state reached Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q k+ (s t, u t ) := ( α)q k (s t, u t ) + α(c(s t, u t ) + J k (s t+ )) where J k (s t+ ) := min a A Q k (s t+, a) Terminal state reached policy is optimal ( enough ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9
... Representation of the path-costs in a function approximator Idea: neural representation of value function (or alternative function approximators) (Neuro Dynamic Programming (Bertsekas, 987)) Representation of the path-costs in a function approximator Idea: neural representation of value function (or alternative function approximators) (Neuro Dynamic Programming (Bertsekas, 987)) w ij w ij x... J(x) x... J(x)...... few parameters (here: weights) specify value function for a large state space Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 30 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 30 Representation of the path-costs in a function approximator Example: learning to intercept in robotic soccer Idea: neural representation of value function (or alternative function approximators) (Neuro Dynamic Programming (Bertsekas, 987)) as fast as possible (anticipation of intercept position) w ij random noise in ball and player movement need for corrections x...... J(x) sequence of turn(θ)and dash(v)-commands required few parameters (here: weights) specify value function for a large state space E learning by gradient descent: w ij = (J(s ) c(s,a) J(s)) w ij handcoding a routine is a lot of work, many parameters to tune! Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 30 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3
Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent 0, s S + c(s) =, s S 0.0, else Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3
Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent 0, s S + c(s) =, s S 0.0, else Actions: turn(0 o ), turn(0 o ),... turn(360 o ),... dash(0), dash(0),... Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent 0, s S + c(s) =, s S 0.0, else Actions: turn(0 o ), turn(0 o ),... turn(360 o ),... dash(0), dash(0),... neural value function (6-0--architecture) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Learning curves 00 "kick.stat" u 3:7 0. "kick.stat" u 3: 90 0.8 80 0.6 70 0.4 60 0. 50 0. 40 0.08 30 0 0.06 0 0.04 0 0 5000 0000 5000 0000 5000 30000 35000 40000 45000 50000 Percentage of successes 0.0 0 5000 0000 5000 0000 5000 30000 35000 40000 45000 50000 Costs (time to intercept) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 33