Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning"

Transcription

1 Motivation Machine Learning Can a software agent learn to play Backgammon by itself? Reinforcement Learning Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning Motivation Can a software agent learn to balance a pole by itself? Can a software agent learn to play Backgammon by itself? Learning from success or failure Neuro-Backgammon: playing at worldchampion level (Tesauro, 99) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 3

2 Can a software agent learn to balance a pole by itself? Can a software agent learn to cooperate with others by itself? Learning from success or failure Neural RL controllers: noisy, unknown, nonlinear (Riedmiller et.al. ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 4 Can a software agent learn to cooperate with others by itself? Reinforcement Learning has biological roots: reward and punishment Learning from success or failure no teacher, but: Cooperative RL agents: complex, multi-agent, cooperative (Riedmiller et.al. ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 4 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 5

3 Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions + goal Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 5 Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions + goal learn algorithm/ policy Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions + goal learn algorithm/ policy Action Action... Happy Programming Agent.. Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 5

4 Actor-Critic Scheme (Barto, Sutton, 983) Overview Actor Critic External reward I Reinforcement Learning - Basics s Internal Critic World Actor? Actor-Critic Scheme: Critic maps external, delayed reward in internal training signal Actor represents policy Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 7 A First Example A First Example Choose: Action a {,, } is reached Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 8

5 The Temporal Credit Assignment Problem The Temporal Credit Assignment Problem Which action(s) in the sequence has to be changed? Which action(s) in the sequence has to be changed? Temporal Credit Assignment Problem Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 9 Sequential Decision Making Three Steps Examples: Chess, Checkers (Samuel, 959), Backgammon (Tesauro, 9) Cart-Pole-Balancing (AHC/ ACE (Barto, Sutton, Anderson, 983)), Robotics and control,... Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 0 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning

6 Three Steps Three Steps Describe environment as a Markov Decision Process (MDP) Describe environment as a Markov Decision Process (MDP) Formulate learning task as a dynamic optimization problem Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning Three Steps. Description of the environment Describe environment as a Markov Decision Process (MDP) Formulate learning task as a dynamic optimization problem Solve dynamic optimization problem by dynamic programming methods Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning

7 . Description of the environment. Description of the environment S: (finite) set of states S: (finite) set of states A: (finite) set of actions Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning. Description of the environment. Description of the environment S: (finite) set of states A: (finite) set of actions S: (finite) set of states A: (finite) set of actions Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition For simplicity, we will first assume a deterministic environment. There, the model can be described by a transition function f : S A S, s = f(s, a) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning

8 . Description of the environment. Description of the environment S: (finite) set of states A: (finite) set of actions S: (finite) set of states A: (finite) set of actions Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition For simplicity, we will first assume a deterministic environment. There, the model can be described by a transition function f : S A S, s = f(s, a) For simplicity, we will first assume a deterministic environment. There, the model can be described by a transition function f : S A S, s = f(s, a) Markov property: Transition only depends on current state and action Markov property: Transition only depends on current state and action Pr(s t+ s t, a t ) = Pr(s t+ s t, a t, s t, a t, s t, a t,...) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning. Formulation of the learning task. Formulation of the learning task every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 3

9 . Formulation of the learning task. Formulation of the learning task every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Wanted: optimal policy π : S A where J π (s) = min π { t c(s t, π(s t )) s 0 = s} Wanted: optimal policy π : S A where J π (s) = min π { t c(s t, π(s t )) s 0 = s} Additive (path-)costs allow to consider all events Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 3. Formulation of the learning task every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Wanted: optimal policy π : S A where J π (s) = min π { t c(s t, π(s t )) s 0 = s} Choice of immediate cost function c( ) specifies policy to be learned Example: 0, if s success (s ) c(s) = 000, if s failure (s Failure), else J π (s start ) = J π (s start ) = 004 Additive (path-)costs allow to consider all events Does this solve the temporal credit assignment problem? YES! Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 4

10 3. Solving the optimization problem Choice of immediate cost function c( ) specifies policy to be learned Example: 0, if s success (s ) c(s) = 000, if s failure (s Failure), else For the optimal path costs it is known that J (s) = min a {c(s,a) + J (f(s, a))} (Principle of Optimality (Bellman, 959)) J π (s start ) = J π (s start ) = 004 Can we compute J (we will see why, soon)? specification of requested policy by c( ) is simple! Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 4 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 5 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 6

11 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 6 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k ()} Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 6

12 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k (f(s,a))} Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k (f(s,a))}? 5 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 6 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k (f(s,a))} Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J? Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 7

13 Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Stochastic shortest path problems: there exists an absorbing terminal state with zero costs Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 7 Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Stochastic shortest path problems: Stochastic shortest path problems: there exists an absorbing terminal state with zero costs there exists a proper policy (a policy that has a non-zero chance to finally reach the terminal state) there exists an absorbing terminal state with zero costs there exists a proper policy (a policy that has a non-zero chance to finally reach the terminal state) every non-proper policy has infinite path costs for at least one state Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 7

14 Ok, now we have J Ok, now we have J when J is known, then we also know an optimal policy: π (s) arg min a A {c(s,a) + J (f(s,a))} 5 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 8 Back to our maze Overview of the approach so far Start Ziel Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 0

15 Overview of the approach so far Overview of the approach so far Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy Ziel Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy iterative computation of optimal pathcosts J : s S : J k+ (s) = min a A {c(s,a) + J k (f(s, a))} Ziel Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 0 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 0 Overview of the approach so far Overview of the approach so far Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy iterative computation of optimal pathcosts J : s S : J k+ (s) = min a A {c(s,a) + J k (f(s, a))} Ziel Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy iterative computation of optimal pathcosts J : s S : J k+ (s) = min a A {c(s,a) + J k (f(s, a))} Ziel Computation of an optimal policy from J π (s) arg min a A {c(s,a) + J (f(s,a))} Computation of an optimal policy from J π (s) arg min a A {c(s,a) + J (f(s,a))} value function ( costs-to-go ) can be stored in a table Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 0 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 0

16 Overview of the approach: Stochastic Domains Overview of the approach: Stochastic Domains Ziel Ziel value iteration in stochastic environments: s S : J k+ (s) = min a A { (c(s,a) + J k (s ))} Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning Overview of the approach: Stochastic Domains Overview of the approach: Stochastic Domains Ziel Ziel value iteration in stochastic environments: s S : J k+ (s) = min a A { s S p(s,s, a)(c(s,a) + J k (s ))} value iteration in stochastic environments: s S : J k+ (s) = min a A { s S p(s,s, a)(c(s,a) + J k (s ))} Computation of an optimal policy from J π (s) arg min a A { s S p(s, s, a)(c(s,a) + J k (s ))} value function J ( costs-to-go ) can be stored in a table Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning

17 Problems of Value Iteration: Reinforcement Learning Problems of Value Iteration: Reinforcement Learning for all s S : Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning Problems of Value Iteration: Reinforcement Learning for all s S :J k+ (s) = min a A {c(s,a) + J k ()} Problems of Value Iteration: Reinforcement Learning for all s S :J k+ (s) = min a A {c(s,a) + J k (f(s,a))} Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning

18 Reinforcement Learning Problems of Value Iteration: for all s S :J k+ (s) = min a A {c(s,a) + J k (f(s,a))} problems: Size of S (Chess, robotics,... ) learning time, storage? model (transition behaviour) f(s, a) or p(s, s,a) must be known! Reinforcement Learning Problems of Value Iteration: for all s S :J k+ (s) = min a A {c(s,a) + J k (f(s,a))} problems: Size of S (Chess, robotics,... ) learning time, storage? model (transition behaviour) f(s, a) or p(s, s,a) must be known! Reinforcement Learning is dynamic programming for very large state spaces and/ or model-free tasks Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning Important contributions - Overview Real Time Dynamic Programming (Barto, Sutton, Watkins, 989) Real Time Dynamic Programming (Barto, Sutton, Watkins, 989) Idea: instead For all s S now For some s S... Model-free learning (Q-Learning,(Watkins, 989)) neural representation of value function (or alternative function approximators) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 4

19 Real Time Dynamic Programming (Barto, Sutton, Watkins, 989) Idea: instead For all s S now For some s S... learning based on trajectories (experiences) Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 4 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 5 Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Q π (s,a) := (c(s,a) + J π (s )) Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Q π (s, a) := s S p(s, s, a)(c(s,a) + J π (s )) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 5

20 Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Q π (s, a) := s S p(s, s, a)(c(s,a) + J π (s )) is now possible without a model: Original VI: state evaluation Action selection: π (s) arg min{c(s, a) + J (f(s, a))} Q-learning: Action selection where J π (s ) expected pathcosts when starting from s and acting according to π 5 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 6 is now possible without a model: Original VI: state evaluation Action selection: Q-learning: Action selection Q: state-action evaluation Action selection: π (s) = arg min Q (s, a) Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s, u) := (c(s,a) + J k (s )) π (s) arg min{c(s, a) + J (f(s, a))} Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 7

21 Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s,u) := s S p(s, s, a)(c(s,a) + J k (s )) Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s,u) := s S p(s, s, a)(c(s,a) + J k (s )) where J k (s) = min a A(s) Q k (s,a ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 7 Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s,u) := s S p(s, s, a)(c(s,a) + J k (s )) Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: where J k (s) = min a A(s) Q k (s,a ) Furthermore, learning a Q-function without a model, by experience of transition tuples (s,a) s only is possible: Q-learning (Q-Value Iteration + Robbins-Monro stochastic approximation) Q k+ (s,a) := ( α) Q k (s,a) + α (c(s,a) + min a A(s ) Q k(s, a )) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 8

22 Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: uses a representation of costs-to-go for state/ action-pairs Q(s, a) Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: uses a representation of costs-to-go for state/ action-pairs Q(s, a) uses a stochastic approximation scheme to incrementally compute expectation values on the basis of observed transititions (s, a) s Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 8 Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: uses a representation of costs-to-go for state/ action-pairs Q(s, a) uses a stochastic approximation scheme to incrementally compute expectation values on the basis of observed transititions (s, a) s Q-Learning algorithm converges under the same assumption as value iteration + every state/ action pair has to be visited infinitely often + conditions for stochastic approximation Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 9

23 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 9 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 9

24 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q k+ (s t, u t ) := ( α)q k (s t, u t ) + α(c(s t, u t ) + J k (s t+ )) where J k (s t+ ) := min a A Q k (s t+, a) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 9 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q k+ (s t, u t ) := ( α)q k (s t, u t ) + α(c(s t, u t ) + J k (s t+ )) where J k (s t+ ) := min a A Q k (s t+, a) Terminal state reached Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q k+ (s t, u t ) := ( α)q k (s t, u t ) + α(c(s t, u t ) + J k (s t+ )) where J k (s t+ ) := min a A Q k (s t+, a) Terminal state reached policy is optimal ( enough ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 9

25 ... Representation of the path-costs in a function approximator Idea: neural representation of value function (or alternative function approximators) (Neuro Dynamic Programming (Bertsekas, 987)) Representation of the path-costs in a function approximator Idea: neural representation of value function (or alternative function approximators) (Neuro Dynamic Programming (Bertsekas, 987)) w ij w ij x... J(x) x... J(x) few parameters (here: weights) specify value function for a large state space Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 30 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 30 Representation of the path-costs in a function approximator Example: learning to intercept in robotic soccer Idea: neural representation of value function (or alternative function approximators) (Neuro Dynamic Programming (Bertsekas, 987)) as fast as possible (anticipation of intercept position) w ij random noise in ball and player movement need for corrections x J(x) sequence of turn(θ)and dash(v)-commands required few parameters (here: weights) specify value function for a large state space E learning by gradient descent: w ij = (J(s ) c(s,a) J(s)) w ij handcoding a routine is a lot of work, many parameters to tune! Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 30 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 3

26 Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 3 Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent 0, s S + c(s) =, s S 0.0, else Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 3

27 Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent 0, s S + c(s) =, s S 0.0, else Actions: turn(0 o ), turn(0 o ),... turn(360 o ),... dash(0), dash(0),... Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent 0, s S + c(s) =, s S 0.0, else Actions: turn(0 o ), turn(0 o ),... turn(360 o ),... dash(0), dash(0),... neural value function (6-0--architecture) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 3 Learning curves 00 "kick.stat" u 3:7 0. "kick.stat" u 3: Percentage of successes Costs (time to intercept) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 33

Reinforcement Learning

Reinforcement Learning Reinforcement Learning LU 2 - Markov Decision Problems and Dynamic Programming Dr. Martin Lauer AG Maschinelles Lernen und Natürlichsprachliche Systeme Albert-Ludwigs-Universität Freiburg martin.lauer@kit.edu

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning LU 2 - Markov Decision Problems and Dynamic Programming Dr. Joschka Bödecker AG Maschinelles Lernen und Natürlichsprachliche Systeme Albert-Ludwigs-Universität Freiburg jboedeck@informatik.uni-freiburg.de

More information

Lecture 1: Introduction to Reinforcement Learning

Lecture 1: Introduction to Reinforcement Learning Lecture 1: Introduction to Reinforcement Learning David Silver Outline 1 Admin 2 About Reinforcement Learning 3 The Reinforcement Learning Problem 4 Inside An RL Agent 5 Problems within Reinforcement Learning

More information

Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method

Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method Martin Riedmiller Neuroinformatics Group, University of Onsabrück, 49078 Osnabrück Abstract. This

More information

Boosting. riedmiller@informatik.uni-freiburg.de

Boosting. riedmiller@informatik.uni-freiburg.de . Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

Lecture 5: Model-Free Control

Lecture 5: Model-Free Control Lecture 5: Model-Free Control David Silver Outline 1 Introduction 2 On-Policy Monte-Carlo Control 3 On-Policy Temporal-Difference Learning 4 Off-Policy Learning 5 Summary Introduction Model-Free Reinforcement

More information

NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi

NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi NEURAL NETWORKS AND REINFORCEMENT LEARNING Abhijit Gosavi Department of Engineering Management and Systems Engineering Missouri University of Science and Technology Rolla, MO 65409 1 Outline A Quick Introduction

More information

Informatics 2D Reasoning and Agents Semester 2, 2015-16

Informatics 2D Reasoning and Agents Semester 2, 2015-16 Informatics 2D Reasoning and Agents Semester 2, 2015-16 Alex Lascarides alex@inf.ed.ac.uk Lecture 30 Markov Decision Processes 25th March 2016 Informatics UoE Informatics 2D 1 Sequential decision problems

More information

Learning to Play Board Games using Temporal Difference Methods

Learning to Play Board Games using Temporal Difference Methods Learning to Play Board Games using Temporal Difference Methods Marco A. Wiering Jan Peter Patist Henk Mannen institute of information and computing sciences, utrecht university technical report UU-CS-2005-048

More information

A Gentle Introduction to Machine Learning

A Gentle Introduction to Machine Learning A Gentle Introduction to Machine Learning Second Lecture Part I Olov Andersson, AIICS Linköpings Universitet Outline of Machine Learning Lectures First we will talk about Supervised Learning Definition

More information

Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998. Eligibility Traces 0 Eligibility Traces Suggested reading: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998. Eligibility Traces Eligibility Traces 1 Contents:

More information

Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL

Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL POMDP Tutorial Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL Observation Belief ACTOR Transition Dynamics WORLD b Policy π Action Markov Decision Process -X: set of states [x s,x r

More information

Neuro-Dynamic Programming An Overview

Neuro-Dynamic Programming An Overview 1 Neuro-Dynamic Programming An Overview Dimitri Bertsekas Dept. of Electrical Engineering and Computer Science M.I.T. September 2006 2 BELLMAN AND THE DUAL CURSES Dynamic Programming (DP) is very broadly

More information

A Sarsa based Autonomous Stock Trading Agent

A Sarsa based Autonomous Stock Trading Agent A Sarsa based Autonomous Stock Trading Agent Achal Augustine The University of Texas at Austin Department of Computer Science Austin, TX 78712 USA achal@cs.utexas.edu Abstract This paper describes an autonomous

More information

Using Markov Decision Processes to Solve a Portfolio Allocation Problem

Using Markov Decision Processes to Solve a Portfolio Allocation Problem Using Markov Decision Processes to Solve a Portfolio Allocation Problem Daniel Bookstaber April 26, 2005 Contents 1 Introduction 3 2 Defining the Model 4 2.1 The Stochastic Model for a Single Asset.........................

More information

The Neuro Slot Car Racer: Reinforcement Learning in a Real World Setting

The Neuro Slot Car Racer: Reinforcement Learning in a Real World Setting The Neuro Slot Car Racer: Reinforcement Learning in a Real World Setting Tim C. Kietzmann Neuroinformatics Group Institute of Cognitive Science University of Osnabrück tkietzma@uni-osnabrueck.de Martin

More information

Learning Agents: Introduction

Learning Agents: Introduction Learning Agents: Introduction S Luz luzs@cs.tcd.ie October 22, 2013 Learning in agent architectures Performance standard representation Critic Agent perception rewards/ instruction Perception Learner Goals

More information

How I won the Chess Ratings: Elo vs the rest of the world Competition

How I won the Chess Ratings: Elo vs the rest of the world Competition How I won the Chess Ratings: Elo vs the rest of the world Competition Yannis Sismanis November 2010 Abstract This article discusses in detail the rating system that won the kaggle competition Chess Ratings:

More information

TD(0) Leads to Better Policies than Approximate Value Iteration

TD(0) Leads to Better Policies than Approximate Value Iteration TD(0) Leads to Better Policies than Approximate Value Iteration Benjamin Van Roy Management Science and Engineering and Electrical Engineering Stanford University Stanford, CA 94305 bvr@stanford.edu Abstract

More information

ADVANTAGE UPDATING. Leemon C. Baird III. Technical Report WL-TR Wright Laboratory. Wright-Patterson Air Force Base, OH

ADVANTAGE UPDATING. Leemon C. Baird III. Technical Report WL-TR Wright Laboratory. Wright-Patterson Air Force Base, OH ADVANTAGE UPDATING Leemon C. Baird III Technical Report WL-TR-93-1146 Wright Laboratory Wright-Patterson Air Force Base, OH 45433-7301 Address: WL/AAAT, Bldg 635 2185 Avionics Circle Wright-Patterson Air

More information

Inductive QoS Packet Scheduling for Adaptive Dynamic Networks

Inductive QoS Packet Scheduling for Adaptive Dynamic Networks Inductive QoS Packet Scheduling for Adaptive Dynamic Networks Malika BOURENANE Dept of Computer Science University of Es-Senia Algeria mb_regina@yahoo.fr Abdelhamid MELLOUK LISSI Laboratory University

More information

Assignment #1. Example: tetris state: board configuration + shape of the falling piece ~2 200 states! Recap RL so far. Page 1

Assignment #1. Example: tetris state: board configuration + shape of the falling piece ~2 200 states! Recap RL so far. Page 1 Generalization and function approximation CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Represent

More information

Version Spaces. riedmiller@informatik.uni-freiburg.de

Version Spaces. riedmiller@informatik.uni-freiburg.de . Machine Learning Version Spaces Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

Q1. [19 pts] Cheating at cs188-blackjack

Q1. [19 pts] Cheating at cs188-blackjack CS 188 Spring 2012 Introduction to Artificial Intelligence Practice Midterm II Solutions Q1. [19 pts] Cheating at cs188-blackjack Cheating dealers have become a serious problem at the cs188-blackjack tables.

More information

Chapter 4: Artificial Neural Networks

Chapter 4: Artificial Neural Networks Chapter 4: Artificial Neural Networks CS 536: Machine Learning Littman (Wu, TA) Administration icml-03: instructional Conference on Machine Learning http://www.cs.rutgers.edu/~mlittman/courses/ml03/icml03/

More information

Artificially Intelligent Adaptive Tutoring System

Artificially Intelligent Adaptive Tutoring System Artificially Intelligent Adaptive Tutoring System William Whiteley Stanford School of Engineering Stanford, CA 94305 bill.whiteley@gmail.com Abstract This paper presents an adaptive tutoring system based

More information

Rational and Convergent Learning in Stochastic Games

Rational and Convergent Learning in Stochastic Games + ' ' Rational and Convergent Learning in Stochastic Games Michael Bowling mhb@cs.cmu.edu Manuela Veloso veloso@cs.cmu.edu Computer Science Department Carnegie Mellon University Pittsburgh, PA 523-389

More information

Reinforcement Learning on the Combinatorial Game of Nim ERIK JÄRLEBERG

Reinforcement Learning on the Combinatorial Game of Nim ERIK JÄRLEBERG Reinforcement Learning on the Combinatorial Game of Nim ERIK JÄRLEBERG Bachelor of Science Thesis Stockholm, Sweden 2011 Reinforcement Learning on the Combinatorial Game of Nim ERIK JÄRLEBERG Bachelor

More information

A Multi-agent Q-learning Framework for Optimizing Stock Trading Systems

A Multi-agent Q-learning Framework for Optimizing Stock Trading Systems A Multi-agent Q-learning Framework for Optimizing Stock Trading Systems Jae Won Lee 1 and Jangmin O 2 1 School of Computer Science and Engineering, Sungshin Women s University, Seoul, Korea 136-742 jwlee@cs.sungshin.ac.kr

More information

Multiple Model Reinforcement Learning for Environments with Poissonian Time Delays

Multiple Model Reinforcement Learning for Environments with Poissonian Time Delays Multiple Model Reinforcement Learning for Environments with Poissonian Time Delays by Jeffrey Scott Campbell A Thesis submitted to the Faculty of Graduate Studies and Research in partial fulfilment of

More information

Temporal Difference Learning in the Tetris Game

Temporal Difference Learning in the Tetris Game Temporal Difference Learning in the Tetris Game Hans Pirnay, Slava Arabagi February 6, 2009 1 Introduction Learning to play the game Tetris has been a common challenge on a few past machine learning competitions.

More information

An Environment Model for N onstationary Reinforcement Learning

An Environment Model for N onstationary Reinforcement Learning An Environment Model for N onstationary Reinforcement Learning Samuel P. M. Choi Dit-Yan Yeung Nevin L. Zhang pmchoi~cs.ust.hk dyyeung~cs.ust.hk lzhang~cs.ust.hk Department of Computer Science, Hong Kong

More information

Introduction.

Introduction. . Machine Learning Introduction Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

On-line Policy Improvement using Monte-Carlo Search

On-line Policy Improvement using Monte-Carlo Search On-line Policy Improvement using Monte-Carlo Search Gerald Tesauro IBM T. J. Watson Research Center P. O. Box 704 Yorktown Heights, NY 10598 Gregory R. Galperin MIT AI Lab 545 Technology Square Cambridge,

More information

Bidding Dynamics of Rational Advertisers in Sponsored Search Auctions on the Web

Bidding Dynamics of Rational Advertisers in Sponsored Search Auctions on the Web Proceedings of the International Conference on Advances in Control and Optimization of Dynamical Systems (ACODS2007) Bidding Dynamics of Rational Advertisers in Sponsored Search Auctions on the Web S.

More information

6.231 Dynamic Programming and Stochastic Control Fall 2008

6.231 Dynamic Programming and Stochastic Control Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.231 Dynamic Programming and Stochastic Control Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 6.231

More information

Temporal Difference Learning of N-Tuple Networks for the Game 2048

Temporal Difference Learning of N-Tuple Networks for the Game 2048 This paper has been accepted for presentation at CIG 014 (IEEE Conference on Computational Intelligence and Games), August 014, Dortmund, Germany (http://wwwcig014de) Temporal Difference Learning of N-Tuple

More information

A Reinforcement Learning Approach for Supply Chain Management

A Reinforcement Learning Approach for Supply Chain Management A Reinforcement Learning Approach for Supply Chain Management Tim Stockheim, Michael Schwind, and Wolfgang Koenig Chair of Economics, esp. Information Systems, Frankfurt University, D-60054 Frankfurt,

More information

Playing Atari with Deep Reinforcement Learning

Playing Atari with Deep Reinforcement Learning Playing Atari with Deep Reinforcement Learning Volodymyr Mnih Koray Kavukcuoglu David Silver Alex Graves Ioannis Antonoglou Daan Wierstra Martin Riedmiller DeepMind Technologies {vlad,koray,david,alex.graves,ioannis,daan,martin.riedmiller}

More information

Planning in Artificial Intelligence

Planning in Artificial Intelligence Logical Representations and Computational Methods for Marov Decision Processes Craig Boutilier Department of Computer Science University of Toronto Planning in Artificial Intelligence Planning has a long

More information

Reinforcement Learning with Factored States and Actions

Reinforcement Learning with Factored States and Actions Journal of Machine Learning Research 5 (2004) 1063 1088 Submitted 3/02; Revised 1/04; Published 8/04 Reinforcement Learning with Factored States and Actions Brian Sallans Austrian Research Institute for

More information

Urban Traffic Control Based on Learning Agents

Urban Traffic Control Based on Learning Agents Urban Traffic Control Based on Learning Agents Pierre-Luc Grégoire, Charles Desjardins, Julien Laumônier and Brahim Chaib-draa DAMAS Laboratory, Computer Science and Software Engineering Department, Laval

More information

Algorithms for Reinforcement Learning

Algorithms for Reinforcement Learning Algorithms for Reinforcement Learning Draft of the lecture published in the Synthesis Lectures on Artificial Intelligence and Machine Learning series by Morgan & Claypool Publishers Csaba Szepesvári June

More information

Tetris: Experiments with the LP Approach to Approximate DP

Tetris: Experiments with the LP Approach to Approximate DP Tetris: Experiments with the LP Approach to Approximate DP Vivek F. Farias Electrical Engineering Stanford University Stanford, CA 94403 vivekf@stanford.edu Benjamin Van Roy Management Science and Engineering

More information

Abstracting Reusable Cases from Reinforcement Learning

Abstracting Reusable Cases from Reinforcement Learning Abstracting Reusable Cases from Reinforcement Learning Andreas von Hessling and Ashok K. Goel College of Computing Georgia Institute of Technology Atlanta, GA 30318 {avh, goel}@cc.gatech.edu Abstract.

More information

Reinforcement Learning of Task Plans for Real Robot Systems

Reinforcement Learning of Task Plans for Real Robot Systems Reinforcement Learning of Task Plans for Real Robot Systems Pedro Tomás Mendes Resende pedro.resende@ist.utl.pt Instituto Superior Técnico, Lisboa, Portugal October 2014 Abstract This paper is the extended

More information

Game-Playing & Adversarial Search MiniMax, Search Cut-off, Heuristic Evaluation

Game-Playing & Adversarial Search MiniMax, Search Cut-off, Heuristic Evaluation Game-Playing & Adversarial Search MiniMax, Search Cut-off, Heuristic Evaluation This lecture topic: Game-Playing & Adversarial Search (MiniMax, Search Cut-off, Heuristic Evaluation) Read Chapter 5.1-5.2,

More information

EFFICIENT USE OF REINFORCEMENT LEARNING IN A COMPUTER GAME

EFFICIENT USE OF REINFORCEMENT LEARNING IN A COMPUTER GAME EFFICIENT USE OF REINFORCEMENT LEARNING IN A COMPUTER GAME Yngvi Björnsson, Vignir Hafsteinsson, Ársæll Jóhannsson, and Einar Jónsson Reykjavík University, School of Computer Science Ofanleiti 2 Reykjavik,

More information

Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems

Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems Thomas Degris Thomas.Degris@lip6.fr Olivier Sigaud Olivier.Sigaud@lip6.fr Pierre-Henri Wuillemin Pierre-Henri.Wuillemin@lip6.fr

More information

Using Markov Decision Theory to Provide a Fair Challenge in a Roll-and-Move Board Game

Using Markov Decision Theory to Provide a Fair Challenge in a Roll-and-Move Board Game Using Markov Decision Theory to Provide a Fair Challenge in a Roll-and-Move Board Game Éric Beaudry, Francis Bisson, Simon Chamberland, Froduald Kabanza Abstract Board games are often taken as examples

More information

TD-Gammon, A Self-Teaching Backgammon Program, Achieves Master-Level Play

TD-Gammon, A Self-Teaching Backgammon Program, Achieves Master-Level Play TD-Gammon, A Self-Teaching Backgammon Program, Achieves Master-Level Play Gerald Tesauro IBM Thomas J. Watson Research Center P. O. Box 704 Yorktown Heights, NY 10598 (tesauro@watson.ibm.com) Abstract.

More information

Learning Tetris Using the Noisy Cross-Entropy Method

Learning Tetris Using the Noisy Cross-Entropy Method NOTE Communicated by Andrew Barto Learning Tetris Using the Noisy Cross-Entropy Method István Szita szityu@eotvos.elte.hu András Lo rincz andras.lorincz@elte.hu Department of Information Systems, Eötvös

More information

RETALIATE: Learning Winning Policies in First-Person Shooter Games

RETALIATE: Learning Winning Policies in First-Person Shooter Games RETALIATE: Learning Winning Policies in First-Person Shooter Games Megan Smith, Stephen Lee-Urban, Héctor Muñoz-Avila Department of Computer Science & Engineering, Lehigh University, Bethlehem, PA 18015-3084

More information

OPTIMIZATION AND OPERATIONS RESEARCH Vol. IV - Markov Decision Processes - Ulrich Rieder

OPTIMIZATION AND OPERATIONS RESEARCH Vol. IV - Markov Decision Processes - Ulrich Rieder MARKOV DECISIO PROCESSES Ulrich Rieder University of Ulm, Germany Keywords: Markov decision problem, stochastic dynamic program, total reward criteria, average reward, optimal policy, optimality equation,

More information

Stochastic Models for Inventory Management at Service Facilities

Stochastic Models for Inventory Management at Service Facilities Stochastic Models for Inventory Management at Service Facilities O. Berman, E. Kim Presented by F. Zoghalchi University of Toronto Rotman School of Management Dec, 2012 Agenda 1 Problem description Deterministic

More information

Using abstract models of behaviors to automatically generate reinforcement learning hierarchies

Using abstract models of behaviors to automatically generate reinforcement learning hierarchies Hauptseminar Intelligente Autonome Systeme Using abstract models of behaviors to automatically generate reinforcement learning hierarchies Christian Sosnowski Betreuer: Freek Stulp Technische Universität

More information

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment Hideki Asoh 1, Masanori Shiro 1 Shotaro Akaho 1, Toshihiro Kamishima 1, Koiti Hasida 1, Eiji Aramaki 2, and Takahide

More information

University of Cambridge Department of Engineering MODULAR ON-LINE FUNCTION APPROXIMATION FOR SCALING UP REINFORCEMENT LEARNING Chen Khong Tham Jesus College, Cambridge, England. A October 1994 This dissertation

More information

Channel Allocation in Cellular Telephone. Systems. Lab. for Info. and Decision Sciences. Cambridge, MA 02139. bertsekas@lids.mit.edu.

Channel Allocation in Cellular Telephone. Systems. Lab. for Info. and Decision Sciences. Cambridge, MA 02139. bertsekas@lids.mit.edu. Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems Satinder Singh Department of Computer Science University of Colorado Boulder, CO 80309-0430 baveja@cs.colorado.edu Dimitri

More information

Learning. Artificial Intelligence. Learning. Types of Learning. Inductive Learning Method. Inductive Learning. Learning.

Learning. Artificial Intelligence. Learning. Types of Learning. Inductive Learning Method. Inductive Learning. Learning. Learning Learning is essential for unknown environments, i.e., when designer lacks omniscience Artificial Intelligence Learning Chapter 8 Learning is useful as a system construction method, i.e., expose

More information

Chi-square Tests Driven Method for Learning the Structure of Factored MDPs

Chi-square Tests Driven Method for Learning the Structure of Factored MDPs Chi-square Tests Driven Method for Learning the Structure of Factored MDPs Thomas Degris Thomas.Degris@lip6.fr Olivier Sigaud Olivier.Sigaud@lip6.fr Pierre-Henri Wuillemin Pierre-Henri.Wuillemin@lip6.fr

More information

Reinforcement Learning for Resource Allocation and Time Series Tools for Mobility Prediction

Reinforcement Learning for Resource Allocation and Time Series Tools for Mobility Prediction Reinforcement Learning for Resource Allocation and Time Series Tools for Mobility Prediction Baptiste Lefebvre 1,2, Stephane Senecal 2 and Jean-Marc Kelif 2 1 École Normale Supérieure (ENS), Paris, France,

More information

Goal-Directed Online Learning of Predictive Models

Goal-Directed Online Learning of Predictive Models Goal-Directed Online Learning of Predictive Models Sylvie C.W. Ong, Yuri Grinberg, and Joelle Pineau McGill University, Montreal, Canada Abstract. We present an algorithmic approach for integrated learning

More information

Supply Chain Yield Management based on Reinforcement Learning

Supply Chain Yield Management based on Reinforcement Learning Supply Chain Yield Management based on Reinforcement Learning Tim Stockheim 1, Michael Schwind 1, Alexander Korth 2, and Burak Simsek 2 1 Chair of Economics, esp. Information Systems, Frankfurt University,

More information

THE LOGIC OF ADAPTIVE BEHAVIOR

THE LOGIC OF ADAPTIVE BEHAVIOR THE LOGIC OF ADAPTIVE BEHAVIOR Knowledge Representation and Algorithms for Adaptive Sequential Decision Making under Uncertainty in First-Order and Relational Domains Martijn van Otterlo Department of

More information

Multi-agent Q-Learning of Channel Selection in Multi-user Cognitive Radio Systems: A Two by Two Case

Multi-agent Q-Learning of Channel Selection in Multi-user Cognitive Radio Systems: A Two by Two Case Multi-agent Q-Learning of Channel Selection in Multi-user Cognitive Radio Systems: A Two by Two Case Husheng Li arxiv:93.5282v [cs.it] 3 Mar 29 Abstract Resource allocation is an important issue in cognitive

More information

A survey of point-based POMDP solvers

A survey of point-based POMDP solvers DOI 10.1007/s10458-012-9200-2 A survey of point-based POMDP solvers Guy Shani Joelle Pineau Robert Kaplow The Author(s) 2012 Abstract The past decade has seen a significant breakthrough in research on

More information

Distributed Reinforcement Learning for Network Intrusion Response

Distributed Reinforcement Learning for Network Intrusion Response Distributed Reinforcement Learning for Network Intrusion Response KLEANTHIS MALIALIS Doctor of Philosophy UNIVERSITY OF YORK COMPUTER SCIENCE September 2014 For my late grandmother Iroulla Abstract The

More information

10. Machine Learning in Games

10. Machine Learning in Games Machine Learning and Data Mining 10. Machine Learning in Games Luc De Raedt Thanks to Johannes Fuernkranz for his slides Contents Game playing What can machine learning do? What is (still) hard? Various

More information

November 28 th, Carlos Guestrin. Lower dimensional projections

November 28 th, Carlos Guestrin. Lower dimensional projections PCA Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University November 28 th, 2007 Lower dimensional projections Rather than picking a subset of the features, we can new features that are combinations

More information

Optimizing admission control while ensuring quality of service in multimedia networks via reinforcement learning*

Optimizing admission control while ensuring quality of service in multimedia networks via reinforcement learning* Optimizing admission control while ensuring quality of service in multimedia networks via reinforcement learning* Timothy X Brown t, Hui Tong t, Satinder Singh+ t Electrical and Computer Engineering +

More information

Artificial Intelligence Adversarial Search

Artificial Intelligence Adversarial Search Artificial Intelligence Adversarial Search Andrea Torsello Games Vs. Planning Problems In many problems you re are pitted against an opponent certain operators are beyond your control: you cannot control

More information

Introduction to Online Learning Theory

Introduction to Online Learning Theory Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

More information

Reducing Operational Costs in Cloud Social TV: An Opportunity for Cloud Cloning

Reducing Operational Costs in Cloud Social TV: An Opportunity for Cloud Cloning IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 6, OCTOBER 2014 1739 Reducing Operational Costs in Cloud Social TV: An Opportunity for Cloud Cloning Yichao Jin, Yonggang Wen, Senior Member, IEEE, Han Hu,

More information

Call Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning

Call Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning Call Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning Peter Marbach LIDS MIT, Room 5-07 Cambridge, MA, 09 email: marbach@mit.edu Oliver Mihatsch Siemens AG Corporate

More information

Reinforcement Learning: A Tutorial Survey and Recent Advances

Reinforcement Learning: A Tutorial Survey and Recent Advances A. Gosavi Reinforcement Learning: A Tutorial Survey and Recent Advances Abhijit Gosavi Department of Engineering Management and Systems Engineering 219 Engineering Management Missouri University of Science

More information

COMP-424: Artificial intelligence. Lecture 7: Game Playing

COMP-424: Artificial intelligence. Lecture 7: Game Playing COMP 424 - Artificial Intelligence Lecture 7: Game Playing Instructor: Joelle Pineau (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp424 Unless otherwise noted, all material posted

More information

IN AVIATION it regularly occurs that an airplane encounters. Guaranteed globally optimal continuous reinforcement learning.

IN AVIATION it regularly occurs that an airplane encounters. Guaranteed globally optimal continuous reinforcement learning. GUARANTEED GLOBALLY OPTIMAL CONTINUOUS REINFORCEMENT LEARNING 1 Guaranteed globally optimal continuous reinforcement learning Hildo Bijl Abstract Self-learning and adaptable autopilots have the potential

More information

Learning to play chess using TD(λ)-learning with database games

Learning to play chess using TD(λ)-learning with database games Learning to play chess using TD(λ)-learning with database games Henk Mannen & Marco Wiering Cognitive Artificial Intelligence University of Utrecht henk.mannen@phil.uu.nl & marco@cs.uu.nl Abstract In this

More information

Combining Q-Learning with Artificial Neural Networks in an Adaptive Light Seeking Robot

Combining Q-Learning with Artificial Neural Networks in an Adaptive Light Seeking Robot Combining Q-Learning with Artificial Neural Networks in an Adaptive Light Seeking Robot Steve Dini and Mark Serrano May 6, 2012 Abstract Q-learning is a reinforcement learning technique that works by learning

More information

Hierarchical Reinforcement Learning in Computer Games

Hierarchical Reinforcement Learning in Computer Games Hierarchical Reinforcement Learning in Computer Games Marc Ponsen, Pieter Spronck, Karl Tuyls Maastricht University / MICC-IKAT. {m.ponsen,p.spronck,k.tuyls}@cs.unimaas.nl Abstract. Hierarchical reinforcement

More information

Nonlinear Algebraic Equations Example

Nonlinear Algebraic Equations Example Nonlinear Algebraic Equations Example Continuous Stirred Tank Reactor (CSTR). Look for steady state concentrations & temperature. s r (in) p,i (in) i In: N spieces with concentrations c, heat capacities

More information

Security Risk Management via Dynamic Games with Learning

Security Risk Management via Dynamic Games with Learning Security Risk Management via Dynamic Games with Learning Praveen Bommannavar Management Science & Engineering Stanford University Stanford, California 94305 Email: bommanna@stanford.edu Tansu Alpcan Deutsche

More information

SINGLE-STAGE MULTI-PRODUCT PRODUCTION AND INVENTORY SYSTEMS: AN ITERATIVE ALGORITHM BASED ON DYNAMIC SCHEDULING AND FIXED PITCH PRODUCTION

SINGLE-STAGE MULTI-PRODUCT PRODUCTION AND INVENTORY SYSTEMS: AN ITERATIVE ALGORITHM BASED ON DYNAMIC SCHEDULING AND FIXED PITCH PRODUCTION SIGLE-STAGE MULTI-PRODUCT PRODUCTIO AD IVETORY SYSTEMS: A ITERATIVE ALGORITHM BASED O DYAMIC SCHEDULIG AD FIXED PITCH PRODUCTIO Euclydes da Cunha eto ational Institute of Technology Rio de Janeiro, RJ

More information

Skill-based Routing Policies in Call Centers

Skill-based Routing Policies in Call Centers BMI Paper Skill-based Routing Policies in Call Centers by Martijn Onderwater January 2010 BMI Paper Skill-based Routing Policies in Call Centers Author: Martijn Onderwater Supervisor: Drs. Dennis Roubos

More information

A Simultaneous Deterministic Perturbation Actor-Critic Algorithm with an Application to Optimal Mortgage Refinancing

A Simultaneous Deterministic Perturbation Actor-Critic Algorithm with an Application to Optimal Mortgage Refinancing Proceedings of the 45th IEEE Conference on Decision & Control Manchester Grand Hyatt Hotel San Diego, CA, USA, December 13-15, 2006 A Simultaneous Deterministic Perturbation Actor-Critic Algorithm with

More information

10.2 Series and Convergence

10.2 Series and Convergence 10.2 Series and Convergence Write sums using sigma notation Find the partial sums of series and determine convergence or divergence of infinite series Find the N th partial sums of geometric series and

More information

Some stability results of parameter identification in a jump diffusion model

Some stability results of parameter identification in a jump diffusion model Some stability results of parameter identification in a jump diffusion model D. Düvelmeyer Technische Universität Chemnitz, Fakultät für Mathematik, 09107 Chemnitz, Germany Abstract In this paper we discuss

More information

Detection and Mitigation of Cyber-Attacks Using Game Theory and Learning

Detection and Mitigation of Cyber-Attacks Using Game Theory and Learning Detection and Mitigation of Cyber-Attacks Using Game Theory and Learning João P. Hespanha Kyriakos G. Vamvoudakis Correlation Engine COAs Data Data Data Data Cyber Situation Awareness Framework Mission

More information

Semi-Markov Adaptive Critic Heuristics with Application to Airline Revenue Management

Semi-Markov Adaptive Critic Heuristics with Application to Airline Revenue Management Semi-Markov Adaptive Critic Heuristics with Application to Airline Revenue Management Ketaki Kulkarni Abhijit Gosavi Email: gosavia@mst.edu Susan Murray Katie Grantham Department of Engineering Management

More information

4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department Email : sss40@eng.cam.ac.

4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department Email : sss40@eng.cam.ac. 4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department Email : sss40@eng.cam.ac.uk 1 1 Outline The LMS algorithm Overview of LMS issues

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

An Associative State-Space Metric for Learning in Factored MDPs

An Associative State-Space Metric for Learning in Factored MDPs An Associative State-Space Metric for Learning in Factored MDPs Pedro Sequeira and Francisco S. Melo and Ana Paiva INESC-ID and Instituto Superior Técnico, Technical University of Lisbon Av. Prof. Dr.

More information

Learning a wall following behaviour in mobile robotics using stereo and mono vision

Learning a wall following behaviour in mobile robotics using stereo and mono vision Learning a wall following behaviour in mobile robotics using stereo and mono vision P. Quintía J.E. Domenech C.V. Regueiro C. Gamallo R. Iglesias Dpt. Electronics and Systems. Univ. A Coruña pquintia@udc.es,

More information

Batch Reinforcement Learning for Smart Home Energy Management

Batch Reinforcement Learning for Smart Home Energy Management Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) Batch Reinforcement Learning for Smart Home Energy Management Heider Berlink, Anna Helena Reali Costa

More information

Creating Algorithmic Traders with Hierarchical Reinforcement Learning MSc Dissertation. Thomas Elder s0789506

Creating Algorithmic Traders with Hierarchical Reinforcement Learning MSc Dissertation. Thomas Elder s0789506 Creating Algorithmic Traders with Hierarchical Reinforcement Learning MSc Dissertation Thomas Elder s78956 Master of Science School of Informatics University of Edinburgh 28 Abstract There has recently

More information

Reinforcement learning and conditioning: an overview. Université Charles de Gaulle, Lille, France

Reinforcement learning and conditioning: an overview. Université Charles de Gaulle, Lille, France Reinforcement learning and conditioning: an overview Jérémie Jozefowiez Université Charles de Gaulle, Lille, France jozefowiez@univ-lille3.fr October 8, 2002 Contents 1 Computational foundations of reinforcement

More information

Creating a NL Texas Hold em Bot

Creating a NL Texas Hold em Bot Creating a NL Texas Hold em Bot Introduction Poker is an easy game to learn by very tough to master. One of the things that is hard to do is controlling emotions. Due to frustration, many have made the

More information

Learning Exercise Policies for American Options

Learning Exercise Policies for American Options Yuxi Li Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Csaba Szepesvari Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Dale Schuurmans

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information