Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning

Size: px
Start display at page:

Download "Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning"

Transcription

1 Motivation Machine Learning Can a software agent learn to play Backgammon by itself? Reinforcement Learning Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Motivation Can a software agent learn to balance a pole by itself? Can a software agent learn to play Backgammon by itself? Learning from success or failure Neuro-Backgammon: playing at worldchampion level (Tesauro, 99) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3

2 Can a software agent learn to balance a pole by itself? Can a software agent learn to cooperate with others by itself? Learning from success or failure Neural RL controllers: noisy, unknown, nonlinear (Riedmiller et.al. ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4 Can a software agent learn to cooperate with others by itself? Reinforcement Learning has biological roots: reward and punishment Learning from success or failure no teacher, but: Cooperative RL agents: complex, multi-agent, cooperative (Riedmiller et.al. ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5

3 Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions + goal Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions + goal learn algorithm/ policy Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions + goal learn algorithm/ policy Action Action... Happy Programming Agent.. Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5

4 Actor-Critic Scheme (Barto, Sutton, 983) Overview Actor Critic External reward I Reinforcement Learning - Basics s Internal Critic World Actor? Actor-Critic Scheme: Critic maps external, delayed reward in internal training signal Actor represents policy Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 A First Example A First Example Choose: Action a {,, } is reached Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8

5 The Temporal Credit Assignment Problem The Temporal Credit Assignment Problem Which action(s) in the sequence has to be changed? Which action(s) in the sequence has to be changed? Temporal Credit Assignment Problem Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Sequential Decision Making Three Steps Examples: Chess, Checkers (Samuel, 959), Backgammon (Tesauro, 9) Cart-Pole-Balancing (AHC/ ACE (Barto, Sutton, Anderson, 983)), Robotics and control,... Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 0 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning

6 Three Steps Three Steps Describe environment as a Markov Decision Process (MDP) Describe environment as a Markov Decision Process (MDP) Formulate learning task as a dynamic optimization problem Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Three Steps. Description of the environment Describe environment as a Markov Decision Process (MDP) Formulate learning task as a dynamic optimization problem Solve dynamic optimization problem by dynamic programming methods Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning

7 . Description of the environment. Description of the environment S: (finite) set of states S: (finite) set of states A: (finite) set of actions Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning. Description of the environment. Description of the environment S: (finite) set of states A: (finite) set of actions S: (finite) set of states A: (finite) set of actions Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition For simplicity, we will first assume a deterministic environment. There, the model can be described by a transition function f : S A S, s = f(s, a) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning

8 . Description of the environment. Description of the environment S: (finite) set of states A: (finite) set of actions S: (finite) set of states A: (finite) set of actions Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition For simplicity, we will first assume a deterministic environment. There, the model can be described by a transition function f : S A S, s = f(s, a) For simplicity, we will first assume a deterministic environment. There, the model can be described by a transition function f : S A S, s = f(s, a) Markov property: Transition only depends on current state and action Markov property: Transition only depends on current state and action Pr(s t+ s t, a t ) = Pr(s t+ s t, a t, s t, a t, s t, a t,...) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning. Formulation of the learning task. Formulation of the learning task every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3

9 . Formulation of the learning task. Formulation of the learning task every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Wanted: optimal policy π : S A where J π (s) = min π { t c(s t, π(s t )) s 0 = s} Wanted: optimal policy π : S A where J π (s) = min π { t c(s t, π(s t )) s 0 = s} Additive (path-)costs allow to consider all events Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3. Formulation of the learning task every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Wanted: optimal policy π : S A where J π (s) = min π { t c(s t, π(s t )) s 0 = s} Choice of immediate cost function c( ) specifies policy to be learned Example: 0, if s success (s ) c(s) = 000, if s failure (s Failure), else J π (s start ) = J π (s start ) = 004 Additive (path-)costs allow to consider all events Does this solve the temporal credit assignment problem? YES! Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4

10 3. Solving the optimization problem Choice of immediate cost function c( ) specifies policy to be learned Example: 0, if s success (s ) c(s) = 000, if s failure (s Failure), else For the optimal path costs it is known that J (s) = min a {c(s,a) + J (f(s, a))} (Principle of Optimality (Bellman, 959)) J π (s start ) = J π (s start ) = 004 Can we compute J (we will see why, soon)? specification of requested policy by c( ) is simple! Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6

11 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k ()} Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6

12 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k (f(s,a))} Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k (f(s,a))}? 5 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k (f(s,a))} Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J? Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7

13 Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Stochastic shortest path problems: there exists an absorbing terminal state with zero costs Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Stochastic shortest path problems: Stochastic shortest path problems: there exists an absorbing terminal state with zero costs there exists a proper policy (a policy that has a non-zero chance to finally reach the terminal state) there exists an absorbing terminal state with zero costs there exists a proper policy (a policy that has a non-zero chance to finally reach the terminal state) every non-proper policy has infinite path costs for at least one state Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7

14 Ok, now we have J Ok, now we have J when J is known, then we also know an optimal policy: π (s) arg min a A {c(s,a) + J (f(s,a))} 5 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8 Back to our maze Overview of the approach so far Start Ziel Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 0

15 Overview of the approach so far Overview of the approach so far Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy Ziel Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy iterative computation of optimal pathcosts J : s S : J k+ (s) = min a A {c(s,a) + J k (f(s, a))} Ziel Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 0 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 0 Overview of the approach so far Overview of the approach so far Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy iterative computation of optimal pathcosts J : s S : J k+ (s) = min a A {c(s,a) + J k (f(s, a))} Ziel Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy iterative computation of optimal pathcosts J : s S : J k+ (s) = min a A {c(s,a) + J k (f(s, a))} Ziel Computation of an optimal policy from J π (s) arg min a A {c(s,a) + J (f(s,a))} Computation of an optimal policy from J π (s) arg min a A {c(s,a) + J (f(s,a))} value function ( costs-to-go ) can be stored in a table Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 0 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 0

16 Overview of the approach: Stochastic Domains Overview of the approach: Stochastic Domains Ziel Ziel value iteration in stochastic environments: s S : J k+ (s) = min a A { (c(s,a) + J k (s ))} Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Overview of the approach: Stochastic Domains Overview of the approach: Stochastic Domains Ziel Ziel value iteration in stochastic environments: s S : J k+ (s) = min a A { s S p(s,s, a)(c(s,a) + J k (s ))} value iteration in stochastic environments: s S : J k+ (s) = min a A { s S p(s,s, a)(c(s,a) + J k (s ))} Computation of an optimal policy from J π (s) arg min a A { s S p(s, s, a)(c(s,a) + J k (s ))} value function J ( costs-to-go ) can be stored in a table Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning

17 Problems of Value Iteration: Reinforcement Learning Problems of Value Iteration: Reinforcement Learning for all s S : Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Problems of Value Iteration: Reinforcement Learning for all s S :J k+ (s) = min a A {c(s,a) + J k ()} Problems of Value Iteration: Reinforcement Learning for all s S :J k+ (s) = min a A {c(s,a) + J k (f(s,a))} Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning

18 Reinforcement Learning Problems of Value Iteration: for all s S :J k+ (s) = min a A {c(s,a) + J k (f(s,a))} problems: Size of S (Chess, robotics,... ) learning time, storage? model (transition behaviour) f(s, a) or p(s, s,a) must be known! Reinforcement Learning Problems of Value Iteration: for all s S :J k+ (s) = min a A {c(s,a) + J k (f(s,a))} problems: Size of S (Chess, robotics,... ) learning time, storage? model (transition behaviour) f(s, a) or p(s, s,a) must be known! Reinforcement Learning is dynamic programming for very large state spaces and/ or model-free tasks Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Important contributions - Overview Real Time Dynamic Programming (Barto, Sutton, Watkins, 989) Real Time Dynamic Programming (Barto, Sutton, Watkins, 989) Idea: instead For all s S now For some s S... Model-free learning (Q-Learning,(Watkins, 989)) neural representation of value function (or alternative function approximators) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4

19 Real Time Dynamic Programming (Barto, Sutton, Watkins, 989) Idea: instead For all s S now For some s S... learning based on trajectories (experiences) Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Q π (s,a) := (c(s,a) + J π (s )) Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Q π (s, a) := s S p(s, s, a)(c(s,a) + J π (s )) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5

20 Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Q π (s, a) := s S p(s, s, a)(c(s,a) + J π (s )) is now possible without a model: Original VI: state evaluation Action selection: π (s) arg min{c(s, a) + J (f(s, a))} Q-learning: Action selection where J π (s ) expected pathcosts when starting from s and acting according to π 5 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 is now possible without a model: Original VI: state evaluation Action selection: Q-learning: Action selection Q: state-action evaluation Action selection: π (s) = arg min Q (s, a) Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s, u) := (c(s,a) + J k (s )) π (s) arg min{c(s, a) + J (f(s, a))} Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7

21 Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s,u) := s S p(s, s, a)(c(s,a) + J k (s )) Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s,u) := s S p(s, s, a)(c(s,a) + J k (s )) where J k (s) = min a A(s) Q k (s,a ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s,u) := s S p(s, s, a)(c(s,a) + J k (s )) Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: where J k (s) = min a A(s) Q k (s,a ) Furthermore, learning a Q-function without a model, by experience of transition tuples (s,a) s only is possible: Q-learning (Q-Value Iteration + Robbins-Monro stochastic approximation) Q k+ (s,a) := ( α) Q k (s,a) + α (c(s,a) + min a A(s ) Q k(s, a )) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8

22 Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: uses a representation of costs-to-go for state/ action-pairs Q(s, a) Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: uses a representation of costs-to-go for state/ action-pairs Q(s, a) uses a stochastic approximation scheme to incrementally compute expectation values on the basis of observed transititions (s, a) s Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8 Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: uses a representation of costs-to-go for state/ action-pairs Q(s, a) uses a stochastic approximation scheme to incrementally compute expectation values on the basis of observed transititions (s, a) s Q-Learning algorithm converges under the same assumption as value iteration + every state/ action pair has to be visited infinitely often + conditions for stochastic approximation Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9

23 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9

24 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q k+ (s t, u t ) := ( α)q k (s t, u t ) + α(c(s t, u t ) + J k (s t+ )) where J k (s t+ ) := min a A Q k (s t+, a) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q k+ (s t, u t ) := ( α)q k (s t, u t ) + α(c(s t, u t ) + J k (s t+ )) where J k (s t+ ) := min a A Q k (s t+, a) Terminal state reached Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q k+ (s t, u t ) := ( α)q k (s t, u t ) + α(c(s t, u t ) + J k (s t+ )) where J k (s t+ ) := min a A Q k (s t+, a) Terminal state reached policy is optimal ( enough ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9

25 ... Representation of the path-costs in a function approximator Idea: neural representation of value function (or alternative function approximators) (Neuro Dynamic Programming (Bertsekas, 987)) Representation of the path-costs in a function approximator Idea: neural representation of value function (or alternative function approximators) (Neuro Dynamic Programming (Bertsekas, 987)) w ij w ij x... J(x) x... J(x) few parameters (here: weights) specify value function for a large state space Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 30 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 30 Representation of the path-costs in a function approximator Example: learning to intercept in robotic soccer Idea: neural representation of value function (or alternative function approximators) (Neuro Dynamic Programming (Bertsekas, 987)) as fast as possible (anticipation of intercept position) w ij random noise in ball and player movement need for corrections x J(x) sequence of turn(θ)and dash(v)-commands required few parameters (here: weights) specify value function for a large state space E learning by gradient descent: w ij = (J(s ) c(s,a) J(s)) w ij handcoding a routine is a lot of work, many parameters to tune! Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 30 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3

26 Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent 0, s S + c(s) =, s S 0.0, else Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3

27 Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent 0, s S + c(s) =, s S 0.0, else Actions: turn(0 o ), turn(0 o ),... turn(360 o ),... dash(0), dash(0),... Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent 0, s S + c(s) =, s S 0.0, else Actions: turn(0 o ), turn(0 o ),... turn(360 o ),... dash(0), dash(0),... neural value function (6-0--architecture) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Learning curves 00 "kick.stat" u 3:7 0. "kick.stat" u 3: Percentage of successes Costs (time to intercept) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 33

Reinforcement Learning

Reinforcement Learning Reinforcement Learning LU 2 - Markov Decision Problems and Dynamic Programming Dr. Martin Lauer AG Maschinelles Lernen und Natürlichsprachliche Systeme Albert-Ludwigs-Universität Freiburg martin.lauer@kit.edu

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning LU 2 - Markov Decision Problems and Dynamic Programming Dr. Joschka Bödecker AG Maschinelles Lernen und Natürlichsprachliche Systeme Albert-Ludwigs-Universität Freiburg jboedeck@informatik.uni-freiburg.de

More information

Lecture 1: Introduction to Reinforcement Learning

Lecture 1: Introduction to Reinforcement Learning Lecture 1: Introduction to Reinforcement Learning David Silver Outline 1 Admin 2 About Reinforcement Learning 3 The Reinforcement Learning Problem 4 Inside An RL Agent 5 Problems within Reinforcement Learning

More information

Boosting. riedmiller@informatik.uni-freiburg.de

Boosting. riedmiller@informatik.uni-freiburg.de . Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

Lecture 5: Model-Free Control

Lecture 5: Model-Free Control Lecture 5: Model-Free Control David Silver Outline 1 Introduction 2 On-Policy Monte-Carlo Control 3 On-Policy Temporal-Difference Learning 4 Off-Policy Learning 5 Summary Introduction Model-Free Reinforcement

More information

Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method

Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method Martin Riedmiller Neuroinformatics Group, University of Onsabrück, 49078 Osnabrück Abstract. This

More information

NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi

NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi NEURAL NETWORKS AND REINFORCEMENT LEARNING Abhijit Gosavi Department of Engineering Management and Systems Engineering Missouri University of Science and Technology Rolla, MO 65409 1 Outline A Quick Introduction

More information

Using Markov Decision Processes to Solve a Portfolio Allocation Problem

Using Markov Decision Processes to Solve a Portfolio Allocation Problem Using Markov Decision Processes to Solve a Portfolio Allocation Problem Daniel Bookstaber April 26, 2005 Contents 1 Introduction 3 2 Defining the Model 4 2.1 The Stochastic Model for a Single Asset.........................

More information

Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998. Eligibility Traces 0 Eligibility Traces Suggested reading: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998. Eligibility Traces Eligibility Traces 1 Contents:

More information

Neuro-Dynamic Programming An Overview

Neuro-Dynamic Programming An Overview 1 Neuro-Dynamic Programming An Overview Dimitri Bertsekas Dept. of Electrical Engineering and Computer Science M.I.T. September 2006 2 BELLMAN AND THE DUAL CURSES Dynamic Programming (DP) is very broadly

More information

A Sarsa based Autonomous Stock Trading Agent

A Sarsa based Autonomous Stock Trading Agent A Sarsa based Autonomous Stock Trading Agent Achal Augustine The University of Texas at Austin Department of Computer Science Austin, TX 78712 USA achal@cs.utexas.edu Abstract This paper describes an autonomous

More information

Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL

Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL POMDP Tutorial Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL Observation Belief ACTOR Transition Dynamics WORLD b Policy π Action Markov Decision Process -X: set of states [x s,x r

More information

The Neuro Slot Car Racer: Reinforcement Learning in a Real World Setting

The Neuro Slot Car Racer: Reinforcement Learning in a Real World Setting The Neuro Slot Car Racer: Reinforcement Learning in a Real World Setting Tim C. Kietzmann Neuroinformatics Group Institute of Cognitive Science University of Osnabrück tkietzma@uni-osnabrueck.de Martin

More information

Version Spaces. riedmiller@informatik.uni-freiburg.de

Version Spaces. riedmiller@informatik.uni-freiburg.de . Machine Learning Version Spaces Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

Inductive QoS Packet Scheduling for Adaptive Dynamic Networks

Inductive QoS Packet Scheduling for Adaptive Dynamic Networks Inductive QoS Packet Scheduling for Adaptive Dynamic Networks Malika BOURENANE Dept of Computer Science University of Es-Senia Algeria mb_regina@yahoo.fr Abdelhamid MELLOUK LISSI Laboratory University

More information

How I won the Chess Ratings: Elo vs the rest of the world Competition

How I won the Chess Ratings: Elo vs the rest of the world Competition How I won the Chess Ratings: Elo vs the rest of the world Competition Yannis Sismanis November 2010 Abstract This article discusses in detail the rating system that won the kaggle competition Chess Ratings:

More information

Assignment #1. Example: tetris state: board configuration + shape of the falling piece ~2 200 states! Recap RL so far. Page 1

Assignment #1. Example: tetris state: board configuration + shape of the falling piece ~2 200 states! Recap RL so far. Page 1 Generalization and function approximation CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Represent

More information

Chapter 4: Artificial Neural Networks

Chapter 4: Artificial Neural Networks Chapter 4: Artificial Neural Networks CS 536: Machine Learning Littman (Wu, TA) Administration icml-03: instructional Conference on Machine Learning http://www.cs.rutgers.edu/~mlittman/courses/ml03/icml03/

More information

TD(0) Leads to Better Policies than Approximate Value Iteration

TD(0) Leads to Better Policies than Approximate Value Iteration TD(0) Leads to Better Policies than Approximate Value Iteration Benjamin Van Roy Management Science and Engineering and Electrical Engineering Stanford University Stanford, CA 94305 bvr@stanford.edu Abstract

More information

Learning Agents: Introduction

Learning Agents: Introduction Learning Agents: Introduction S Luz luzs@cs.tcd.ie October 22, 2013 Learning in agent architectures Performance standard representation Critic Agent perception rewards/ instruction Perception Learner Goals

More information

Reinforcement Learning on the Combinatorial Game of Nim ERIK JÄRLEBERG

Reinforcement Learning on the Combinatorial Game of Nim ERIK JÄRLEBERG Reinforcement Learning on the Combinatorial Game of Nim ERIK JÄRLEBERG Bachelor of Science Thesis Stockholm, Sweden 2011 Reinforcement Learning on the Combinatorial Game of Nim ERIK JÄRLEBERG Bachelor

More information

A Multi-agent Q-learning Framework for Optimizing Stock Trading Systems

A Multi-agent Q-learning Framework for Optimizing Stock Trading Systems A Multi-agent Q-learning Framework for Optimizing Stock Trading Systems Jae Won Lee 1 and Jangmin O 2 1 School of Computer Science and Engineering, Sungshin Women s University, Seoul, Korea 136-742 jwlee@cs.sungshin.ac.kr

More information

Bidding Dynamics of Rational Advertisers in Sponsored Search Auctions on the Web

Bidding Dynamics of Rational Advertisers in Sponsored Search Auctions on the Web Proceedings of the International Conference on Advances in Control and Optimization of Dynamical Systems (ACODS2007) Bidding Dynamics of Rational Advertisers in Sponsored Search Auctions on the Web S.

More information

An Environment Model for N onstationary Reinforcement Learning

An Environment Model for N onstationary Reinforcement Learning An Environment Model for N onstationary Reinforcement Learning Samuel P. M. Choi Dit-Yan Yeung Nevin L. Zhang pmchoi~cs.ust.hk dyyeung~cs.ust.hk lzhang~cs.ust.hk Department of Computer Science, Hong Kong

More information

Temporal Difference Learning in the Tetris Game

Temporal Difference Learning in the Tetris Game Temporal Difference Learning in the Tetris Game Hans Pirnay, Slava Arabagi February 6, 2009 1 Introduction Learning to play the game Tetris has been a common challenge on a few past machine learning competitions.

More information

6.231 Dynamic Programming and Stochastic Control Fall 2008

6.231 Dynamic Programming and Stochastic Control Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.231 Dynamic Programming and Stochastic Control Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 6.231

More information

A Reinforcement Learning Approach for Supply Chain Management

A Reinforcement Learning Approach for Supply Chain Management A Reinforcement Learning Approach for Supply Chain Management Tim Stockheim, Michael Schwind, and Wolfgang Koenig Chair of Economics, esp. Information Systems, Frankfurt University, D-60054 Frankfurt,

More information

Urban Traffic Control Based on Learning Agents

Urban Traffic Control Based on Learning Agents Urban Traffic Control Based on Learning Agents Pierre-Luc Grégoire, Charles Desjardins, Julien Laumônier and Brahim Chaib-draa DAMAS Laboratory, Computer Science and Software Engineering Department, Laval

More information

Reinforcement Learning with Factored States and Actions

Reinforcement Learning with Factored States and Actions Journal of Machine Learning Research 5 (2004) 1063 1088 Submitted 3/02; Revised 1/04; Published 8/04 Reinforcement Learning with Factored States and Actions Brian Sallans Austrian Research Institute for

More information

Playing Atari with Deep Reinforcement Learning

Playing Atari with Deep Reinforcement Learning Playing Atari with Deep Reinforcement Learning Volodymyr Mnih Koray Kavukcuoglu David Silver Alex Graves Ioannis Antonoglou Daan Wierstra Martin Riedmiller DeepMind Technologies {vlad,koray,david,alex.graves,ioannis,daan,martin.riedmiller}

More information

Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems

Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems Thomas Degris Thomas.Degris@lip6.fr Olivier Sigaud Olivier.Sigaud@lip6.fr Pierre-Henri Wuillemin Pierre-Henri.Wuillemin@lip6.fr

More information

Reinforcement Learning of Task Plans for Real Robot Systems

Reinforcement Learning of Task Plans for Real Robot Systems Reinforcement Learning of Task Plans for Real Robot Systems Pedro Tomás Mendes Resende pedro.resende@ist.utl.pt Instituto Superior Técnico, Lisboa, Portugal October 2014 Abstract This paper is the extended

More information

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment Hideki Asoh 1, Masanori Shiro 1 Shotaro Akaho 1, Toshihiro Kamishima 1, Koiti Hasida 1, Eiji Aramaki 2, and Takahide

More information

TD-Gammon, A Self-Teaching Backgammon Program, Achieves Master-Level Play

TD-Gammon, A Self-Teaching Backgammon Program, Achieves Master-Level Play TD-Gammon, A Self-Teaching Backgammon Program, Achieves Master-Level Play Gerald Tesauro IBM Thomas J. Watson Research Center P. O. Box 704 Yorktown Heights, NY 10598 (tesauro@watson.ibm.com) Abstract.

More information

Channel Allocation in Cellular Telephone. Systems. Lab. for Info. and Decision Sciences. Cambridge, MA 02139. bertsekas@lids.mit.edu.

Channel Allocation in Cellular Telephone. Systems. Lab. for Info. and Decision Sciences. Cambridge, MA 02139. bertsekas@lids.mit.edu. Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems Satinder Singh Department of Computer Science University of Colorado Boulder, CO 80309-0430 baveja@cs.colorado.edu Dimitri

More information

Algorithms for Reinforcement Learning

Algorithms for Reinforcement Learning Algorithms for Reinforcement Learning Draft of the lecture published in the Synthesis Lectures on Artificial Intelligence and Machine Learning series by Morgan & Claypool Publishers Csaba Szepesvári June

More information

Stochastic Models for Inventory Management at Service Facilities

Stochastic Models for Inventory Management at Service Facilities Stochastic Models for Inventory Management at Service Facilities O. Berman, E. Kim Presented by F. Zoghalchi University of Toronto Rotman School of Management Dec, 2012 Agenda 1 Problem description Deterministic

More information

EFFICIENT USE OF REINFORCEMENT LEARNING IN A COMPUTER GAME

EFFICIENT USE OF REINFORCEMENT LEARNING IN A COMPUTER GAME EFFICIENT USE OF REINFORCEMENT LEARNING IN A COMPUTER GAME Yngvi Björnsson, Vignir Hafsteinsson, Ársæll Jóhannsson, and Einar Jónsson Reykjavík University, School of Computer Science Ofanleiti 2 Reykjavik,

More information

University of Cambridge Department of Engineering MODULAR ON-LINE FUNCTION APPROXIMATION FOR SCALING UP REINFORCEMENT LEARNING Chen Khong Tham Jesus College, Cambridge, England. A October 1994 This dissertation

More information

Learning Tetris Using the Noisy Cross-Entropy Method

Learning Tetris Using the Noisy Cross-Entropy Method NOTE Communicated by Andrew Barto Learning Tetris Using the Noisy Cross-Entropy Method István Szita szityu@eotvos.elte.hu András Lo rincz andras.lorincz@elte.hu Department of Information Systems, Eötvös

More information

Multi-agent Q-Learning of Channel Selection in Multi-user Cognitive Radio Systems: A Two by Two Case

Multi-agent Q-Learning of Channel Selection in Multi-user Cognitive Radio Systems: A Two by Two Case Multi-agent Q-Learning of Channel Selection in Multi-user Cognitive Radio Systems: A Two by Two Case Husheng Li arxiv:93.5282v [cs.it] 3 Mar 29 Abstract Resource allocation is an important issue in cognitive

More information

Reinforcement Learning for Resource Allocation and Time Series Tools for Mobility Prediction

Reinforcement Learning for Resource Allocation and Time Series Tools for Mobility Prediction Reinforcement Learning for Resource Allocation and Time Series Tools for Mobility Prediction Baptiste Lefebvre 1,2, Stephane Senecal 2 and Jean-Marc Kelif 2 1 École Normale Supérieure (ENS), Paris, France,

More information

Chi-square Tests Driven Method for Learning the Structure of Factored MDPs

Chi-square Tests Driven Method for Learning the Structure of Factored MDPs Chi-square Tests Driven Method for Learning the Structure of Factored MDPs Thomas Degris Thomas.Degris@lip6.fr Olivier Sigaud Olivier.Sigaud@lip6.fr Pierre-Henri Wuillemin Pierre-Henri.Wuillemin@lip6.fr

More information

Supply Chain Yield Management based on Reinforcement Learning

Supply Chain Yield Management based on Reinforcement Learning Supply Chain Yield Management based on Reinforcement Learning Tim Stockheim 1, Michael Schwind 1, Alexander Korth 2, and Burak Simsek 2 1 Chair of Economics, esp. Information Systems, Frankfurt University,

More information

RETALIATE: Learning Winning Policies in First-Person Shooter Games

RETALIATE: Learning Winning Policies in First-Person Shooter Games RETALIATE: Learning Winning Policies in First-Person Shooter Games Megan Smith, Stephen Lee-Urban, Héctor Muñoz-Avila Department of Computer Science & Engineering, Lehigh University, Bethlehem, PA 18015-3084

More information

Call Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning

Call Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning Call Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning Peter Marbach LIDS MIT, Room 5-07 Cambridge, MA, 09 email: marbach@mit.edu Oliver Mihatsch Siemens AG Corporate

More information

How To Maximize Admission Control On A Network

How To Maximize Admission Control On A Network Optimizing admission control while ensuring quality of service in multimedia networks via reinforcement learning* Timothy X Brown t, Hui Tong t, Satinder Singh+ t Electrical and Computer Engineering +

More information

Distributed Reinforcement Learning for Network Intrusion Response

Distributed Reinforcement Learning for Network Intrusion Response Distributed Reinforcement Learning for Network Intrusion Response KLEANTHIS MALIALIS Doctor of Philosophy UNIVERSITY OF YORK COMPUTER SCIENCE September 2014 For my late grandmother Iroulla Abstract The

More information

10. Machine Learning in Games

10. Machine Learning in Games Machine Learning and Data Mining 10. Machine Learning in Games Luc De Raedt Thanks to Johannes Fuernkranz for his slides Contents Game playing What can machine learning do? What is (still) hard? Various

More information

A survey of point-based POMDP solvers

A survey of point-based POMDP solvers DOI 10.1007/s10458-012-9200-2 A survey of point-based POMDP solvers Guy Shani Joelle Pineau Robert Kaplow The Author(s) 2012 Abstract The past decade has seen a significant breakthrough in research on

More information

THE LOGIC OF ADAPTIVE BEHAVIOR

THE LOGIC OF ADAPTIVE BEHAVIOR THE LOGIC OF ADAPTIVE BEHAVIOR Knowledge Representation and Algorithms for Adaptive Sequential Decision Making under Uncertainty in First-Order and Relational Domains Martijn van Otterlo Department of

More information

Reducing Operational Costs in Cloud Social TV: An Opportunity for Cloud Cloning

Reducing Operational Costs in Cloud Social TV: An Opportunity for Cloud Cloning IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 6, OCTOBER 2014 1739 Reducing Operational Costs in Cloud Social TV: An Opportunity for Cloud Cloning Yichao Jin, Yonggang Wen, Senior Member, IEEE, Han Hu,

More information

An Associative State-Space Metric for Learning in Factored MDPs

An Associative State-Space Metric for Learning in Factored MDPs An Associative State-Space Metric for Learning in Factored MDPs Pedro Sequeira and Francisco S. Melo and Ana Paiva INESC-ID and Instituto Superior Técnico, Technical University of Lisbon Av. Prof. Dr.

More information

Creating Algorithmic Traders with Hierarchical Reinforcement Learning MSc Dissertation. Thomas Elder s0789506

Creating Algorithmic Traders with Hierarchical Reinforcement Learning MSc Dissertation. Thomas Elder s0789506 Creating Algorithmic Traders with Hierarchical Reinforcement Learning MSc Dissertation Thomas Elder s78956 Master of Science School of Informatics University of Edinburgh 28 Abstract There has recently

More information

Security Risk Management via Dynamic Games with Learning

Security Risk Management via Dynamic Games with Learning Security Risk Management via Dynamic Games with Learning Praveen Bommannavar Management Science & Engineering Stanford University Stanford, California 94305 Email: bommanna@stanford.edu Tansu Alpcan Deutsche

More information

Time Hopping Technique for Faster Reinforcement Learning in Simulations

Time Hopping Technique for Faster Reinforcement Learning in Simulations BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 11, No 3 Sofia 2011 Time Hopping Technique for Faster Reinforcement Learning in Simulations Petar Kormushev 1, Kohei Nomoto

More information

Goal-Directed Online Learning of Predictive Models

Goal-Directed Online Learning of Predictive Models Goal-Directed Online Learning of Predictive Models Sylvie C.W. Ong, Yuri Grinberg, and Joelle Pineau McGill University, Montreal, Canada Abstract. We present an algorithmic approach for integrated learning

More information

Batch Reinforcement Learning for Smart Home Energy Management

Batch Reinforcement Learning for Smart Home Energy Management Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) Batch Reinforcement Learning for Smart Home Energy Management Heider Berlink, Anna Helena Reali Costa

More information

Where Do Rewards Come From?

Where Do Rewards Come From? Where Do Rewards Come From? Satinder Singh baveja@umich.edu Computer Science & Engineering University of Michigan, Ann Arbor Richard L. Lewis rickl@umich.edu Department of Psychology University of Michigan,

More information

SINGLE-STAGE MULTI-PRODUCT PRODUCTION AND INVENTORY SYSTEMS: AN ITERATIVE ALGORITHM BASED ON DYNAMIC SCHEDULING AND FIXED PITCH PRODUCTION

SINGLE-STAGE MULTI-PRODUCT PRODUCTION AND INVENTORY SYSTEMS: AN ITERATIVE ALGORITHM BASED ON DYNAMIC SCHEDULING AND FIXED PITCH PRODUCTION SIGLE-STAGE MULTI-PRODUCT PRODUCTIO AD IVETORY SYSTEMS: A ITERATIVE ALGORITHM BASED O DYAMIC SCHEDULIG AD FIXED PITCH PRODUCTIO Euclydes da Cunha eto ational Institute of Technology Rio de Janeiro, RJ

More information

Skill-based Routing Policies in Call Centers

Skill-based Routing Policies in Call Centers BMI Paper Skill-based Routing Policies in Call Centers by Martijn Onderwater January 2010 BMI Paper Skill-based Routing Policies in Call Centers Author: Martijn Onderwater Supervisor: Drs. Dennis Roubos

More information

The MultiAgent Decision Process toolbox: software for decision-theoretic planning in multiagent systems

The MultiAgent Decision Process toolbox: software for decision-theoretic planning in multiagent systems The MultiAgent Decision Process toolbox: software for decision-theoretic planning in multiagent systems Matthijs T.J. Spaan Institute for Systems and Robotics Instituto Superior Técnico Lisbon, Portugal

More information

Best-response play in partially observable card games

Best-response play in partially observable card games Best-response play in partially observable card games Frans Oliehoek Matthijs T. J. Spaan Nikos Vlassis Informatics Institute, Faculty of Science, University of Amsterdam, Kruislaan 43, 98 SJ Amsterdam,

More information

A Simultaneous Deterministic Perturbation Actor-Critic Algorithm with an Application to Optimal Mortgage Refinancing

A Simultaneous Deterministic Perturbation Actor-Critic Algorithm with an Application to Optimal Mortgage Refinancing Proceedings of the 45th IEEE Conference on Decision & Control Manchester Grand Hyatt Hotel San Diego, CA, USA, December 13-15, 2006 A Simultaneous Deterministic Perturbation Actor-Critic Algorithm with

More information

Semi-Markov Adaptive Critic Heuristics with Application to Airline Revenue Management

Semi-Markov Adaptive Critic Heuristics with Application to Airline Revenue Management Semi-Markov Adaptive Critic Heuristics with Application to Airline Revenue Management Ketaki Kulkarni Abhijit Gosavi Email: gosavia@mst.edu Susan Murray Katie Grantham Department of Engineering Management

More information

Goal-Directed Online Learning of Predictive Models

Goal-Directed Online Learning of Predictive Models Goal-Directed Online Learning of Predictive Models Sylvie C.W. Ong, Yuri Grinberg, and Joelle Pineau School of Computer Science McGill University, Montreal, Canada Abstract. We present an algorithmic approach

More information

Learning to Trade via Direct Reinforcement

Learning to Trade via Direct Reinforcement IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 4, JULY 2001 875 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell Abstract We present methods for optimizing portfolios, asset

More information

Nonlinear Algebraic Equations Example

Nonlinear Algebraic Equations Example Nonlinear Algebraic Equations Example Continuous Stirred Tank Reactor (CSTR). Look for steady state concentrations & temperature. s r (in) p,i (in) i In: N spieces with concentrations c, heat capacities

More information

Reinforcement Learning: A Tutorial Survey and Recent Advances

Reinforcement Learning: A Tutorial Survey and Recent Advances A. Gosavi Reinforcement Learning: A Tutorial Survey and Recent Advances Abhijit Gosavi Department of Engineering Management and Systems Engineering 219 Engineering Management Missouri University of Science

More information

Some stability results of parameter identification in a jump diffusion model

Some stability results of parameter identification in a jump diffusion model Some stability results of parameter identification in a jump diffusion model D. Düvelmeyer Technische Universität Chemnitz, Fakultät für Mathematik, 09107 Chemnitz, Germany Abstract In this paper we discuss

More information

Introduction to Online Learning Theory

Introduction to Online Learning Theory Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

More information

Detection and Mitigation of Cyber-Attacks Using Game Theory and Learning

Detection and Mitigation of Cyber-Attacks Using Game Theory and Learning Detection and Mitigation of Cyber-Attacks Using Game Theory and Learning João P. Hespanha Kyriakos G. Vamvoudakis Correlation Engine COAs Data Data Data Data Cyber Situation Awareness Framework Mission

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

10.2 Series and Convergence

10.2 Series and Convergence 10.2 Series and Convergence Write sums using sigma notation Find the partial sums of series and determine convergence or divergence of infinite series Find the N th partial sums of geometric series and

More information

4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department Email : sss40@eng.cam.ac.

4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department Email : sss40@eng.cam.ac. 4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department Email : sss40@eng.cam.ac.uk 1 1 Outline The LMS algorithm Overview of LMS issues

More information

Learning a wall following behaviour in mobile robotics using stereo and mono vision

Learning a wall following behaviour in mobile robotics using stereo and mono vision Learning a wall following behaviour in mobile robotics using stereo and mono vision P. Quintía J.E. Domenech C.V. Regueiro C. Gamallo R. Iglesias Dpt. Electronics and Systems. Univ. A Coruña pquintia@udc.es,

More information

Computing Near Optimal Strategies for Stochastic Investment Planning Problems

Computing Near Optimal Strategies for Stochastic Investment Planning Problems Computing Near Optimal Strategies for Stochastic Investment Planning Problems Milos Hauskrecfat 1, Gopal Pandurangan 1,2 and Eli Upfal 1,2 Computer Science Department, Box 1910 Brown University Providence,

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Long-term Planning by Short-term Prediction

Long-term Planning by Short-term Prediction Long-term Planning by Short-term Prediction Shai Shalev-Shwartz Nir Ben-Zrihem Aviad Cohen Amnon Shashua Mobileye Abstract We consider planning problems, that often arise in autonomous driving applications,

More information

Bag of Pursuits and Neural Gas for Improved Sparse Coding

Bag of Pursuits and Neural Gas for Improved Sparse Coding Bag of Pursuits and Neural Gas for Improved Sparse Coding Kai Labusch, Erhardt Barth, and Thomas Martinetz University of Lübec Institute for Neuro- and Bioinformatics Ratzeburger Allee 6 23562 Lübec, Germany

More information

STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly)

STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly) STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University (Joint work with R. Chen and M. Menickelly) Outline Stochastic optimization problem black box gradient based Existing

More information

IN AVIATION it regularly occurs that an airplane encounters. Guaranteed globally optimal continuous reinforcement learning.

IN AVIATION it regularly occurs that an airplane encounters. Guaranteed globally optimal continuous reinforcement learning. GUARANTEED GLOBALLY OPTIMAL CONTINUOUS REINFORCEMENT LEARNING 1 Guaranteed globally optimal continuous reinforcement learning Hildo Bijl Abstract Self-learning and adaptable autopilots have the potential

More information

Creating a NL Texas Hold em Bot

Creating a NL Texas Hold em Bot Creating a NL Texas Hold em Bot Introduction Poker is an easy game to learn by very tough to master. One of the things that is hard to do is controlling emotions. Due to frustration, many have made the

More information

Hierarchical Reinforcement Learning in Computer Games

Hierarchical Reinforcement Learning in Computer Games Hierarchical Reinforcement Learning in Computer Games Marc Ponsen, Pieter Spronck, Karl Tuyls Maastricht University / MICC-IKAT. {m.ponsen,p.spronck,k.tuyls}@cs.unimaas.nl Abstract. Hierarchical reinforcement

More information

Efficient Network Marketing Strategies For Secondary Users

Efficient Network Marketing Strategies For Secondary Users Determining the Transmission Strategy of Cognitive User in IEEE 82. based Networks Rukhsana Ruby, Victor C.M. Leung, John Sydor Department of Electrical and Computer Engineering, The University of British

More information

Variational approach to restore point-like and curve-like singularities in imaging

Variational approach to restore point-like and curve-like singularities in imaging Variational approach to restore point-like and curve-like singularities in imaging Daniele Graziani joint work with Gilles Aubert and Laure Blanc-Féraud Roma 12/06/2012 Daniele Graziani (Roma) 12/06/2012

More information

Jae Won idee. School of Computer Science and Engineering Sungshin Women's University Seoul, 136-742, South Korea

Jae Won idee. School of Computer Science and Engineering Sungshin Women's University Seoul, 136-742, South Korea STOCK PRICE PREDICTION USING REINFORCEMENT LEARNING Jae Won idee School of Computer Science and Engineering Sungshin Women's University Seoul, 136-742, South Korea ABSTRACT Recently, numerous investigations

More information

Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining

Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining George Konidaris Computer Science Department University of Massachusetts Amherst Amherst MA 01003 USA gdk@cs.umass.edu

More information

An Introduction to Markov Decision Processes. MDP Tutorial - 1

An Introduction to Markov Decision Processes. MDP Tutorial - 1 An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal

More information

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen (für Informatiker) M. Grepl J. Berger & J.T. Frings Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2010/11 Problem Statement Unconstrained Optimality Conditions Constrained

More information

Machine Learning. 01 - Introduction

Machine Learning. 01 - Introduction Machine Learning 01 - Introduction Machine learning course One lecture (Wednesday, 9:30, 346) and one exercise (Monday, 17:15, 203). Oral exam, 20 minutes, 5 credit points. Some basic mathematical knowledge

More information

Random access protocols for channel access. Markov chains and their stability. Laurent Massoulié.

Random access protocols for channel access. Markov chains and their stability. Laurent Massoulié. Random access protocols for channel access Markov chains and their stability laurent.massoulie@inria.fr Aloha: the first random access protocol for channel access [Abramson, Hawaii 70] Goal: allow machines

More information

Efficient Prediction of Value and PolicyPolygraphic Success

Efficient Prediction of Value and PolicyPolygraphic Success Journal of Machine Learning Research 14 (2013) 1181-1227 Submitted 10/11; Revised 10/12; Published 4/13 Performance Bounds for λ Policy Iteration and Application to the Game of Tetris Bruno Scherrer MAIA

More information

Options with exceptions

Options with exceptions Options with exceptions Munu Sairamesh and Balaraman Ravindran Indian Institute Of Technology Madras, India Abstract. An option is a policy fragment that represents a solution to a frequent subproblem

More information

A Behavior Based Kernel for Policy Search via Bayesian Optimization

A Behavior Based Kernel for Policy Search via Bayesian Optimization via Bayesian Optimization Aaron Wilson WILSONAA@EECS.OREGONSTATE.EDU Alan Fern AFERN@EECS.OREGONSTATE.EDU Prasad Tadepalli TADEPALL@EECS.OREGONSTATE.EDU Oregon State University School of EECS, 1148 Kelley

More information

Neural Networks and Learning Systems

Neural Networks and Learning Systems Neural Networks and Learning Systems Exercise Collection, Class 9 March 2010 x 1 x 2 x N w 11 3 W 11 h 3 2 3 h N w NN h 1 W NN y Neural Networks and Learning Systems Exercise Collection c Medical Informatics,

More information

KERNEL REWARDS REGRESSION: AN INFORMATION EFFICIENT BATCH POLICY ITERATION APPROACH

KERNEL REWARDS REGRESSION: AN INFORMATION EFFICIENT BATCH POLICY ITERATION APPROACH KERNEL REWARDS REGRESSION: AN INFORMATION EFFICIENT BATCH POLICY ITERATION APPROACH Daniel Schneegaß and Steffen Udluft Information & Communications, Learning Systems Siemens AG, Corporate Technology D-81739

More information

Game playing. Chapter 6. Chapter 6 1

Game playing. Chapter 6. Chapter 6 1 Game playing Chapter 6 Chapter 6 1 Outline Games Perfect play minimax decisions α β pruning Resource limits and approximate evaluation Games of chance Games of imperfect information Chapter 6 2 Games vs.

More information

Tetris: A Study of Randomized Constraint Sampling

Tetris: A Study of Randomized Constraint Sampling Tetris: A Study of Randomized Constraint Sampling Vivek F. Farias and Benjamin Van Roy Stanford University 1 Introduction Randomized constraint sampling has recently been proposed as an approach for approximating

More information

PageRank Optimization in Polynomial Time by Stochastic Shortest Path Reformulation

PageRank Optimization in Polynomial Time by Stochastic Shortest Path Reformulation PageRank Optimization in Polynomial Time by Stochastic Shortest Path Reformulation Balázs Csanád Csáji 12, Raphaël M. Jungers 34, and Vincent D. Blondel 4 1 Department of Electrical and Electronic Engineering,

More information