Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning


 Louise Kelly
 2 years ago
 Views:
Transcription
1 Motivation Machine Learning Can a software agent learn to play Backgammon by itself? Reinforcement Learning Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät AlbertLudwigsUniversität Freiburg Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning Motivation Can a software agent learn to balance a pole by itself? Can a software agent learn to play Backgammon by itself? Learning from success or failure NeuroBackgammon: playing at worldchampion level (Tesauro, 99) Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 3
2 Can a software agent learn to balance a pole by itself? Can a software agent learn to cooperate with others by itself? Learning from success or failure Neural RL controllers: noisy, unknown, nonlinear (Riedmiller et.al. ) Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 3 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 4 Can a software agent learn to cooperate with others by itself? Reinforcement Learning has biological roots: reward and punishment Learning from success or failure no teacher, but: Cooperative RL agents: complex, multiagent, cooperative (Riedmiller et.al. ) Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 4 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 5
3 Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions + goal Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 5 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 5 Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions + goal learn algorithm/ policy Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions + goal learn algorithm/ policy Action Action... Happy Programming Agent.. Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 5 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 5
4 ActorCritic Scheme (Barto, Sutton, 983) Overview Actor Critic External reward I Reinforcement Learning  Basics s Internal Critic World Actor? ActorCritic Scheme: Critic maps external, delayed reward in internal training signal Actor represents policy Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 6 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 7 A First Example A First Example Choose: Action a {,, } is reached Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 8 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 8
5 The Temporal Credit Assignment Problem The Temporal Credit Assignment Problem Which action(s) in the sequence has to be changed? Which action(s) in the sequence has to be changed? Temporal Credit Assignment Problem Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 9 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 9 Sequential Decision Making Three Steps Examples: Chess, Checkers (Samuel, 959), Backgammon (Tesauro, 9) CartPoleBalancing (AHC/ ACE (Barto, Sutton, Anderson, 983)), Robotics and control,... Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 0 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning
6 Three Steps Three Steps Describe environment as a Markov Decision Process (MDP) Describe environment as a Markov Decision Process (MDP) Formulate learning task as a dynamic optimization problem Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning Three Steps. Description of the environment Describe environment as a Markov Decision Process (MDP) Formulate learning task as a dynamic optimization problem Solve dynamic optimization problem by dynamic programming methods Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning
7 . Description of the environment. Description of the environment S: (finite) set of states S: (finite) set of states A: (finite) set of actions Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning. Description of the environment. Description of the environment S: (finite) set of states A: (finite) set of actions S: (finite) set of states A: (finite) set of actions Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition For simplicity, we will first assume a deterministic environment. There, the model can be described by a transition function f : S A S, s = f(s, a) Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning
8 . Description of the environment. Description of the environment S: (finite) set of states A: (finite) set of actions S: (finite) set of states A: (finite) set of actions Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition For simplicity, we will first assume a deterministic environment. There, the model can be described by a transition function f : S A S, s = f(s, a) For simplicity, we will first assume a deterministic environment. There, the model can be described by a transition function f : S A S, s = f(s, a) Markov property: Transition only depends on current state and action Markov property: Transition only depends on current state and action Pr(s t+ s t, a t ) = Pr(s t+ s t, a t, s t, a t, s t, a t,...) Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning. Formulation of the learning task. Formulation of the learning task every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 3 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 3
9 . Formulation of the learning task. Formulation of the learning task every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Wanted: optimal policy π : S A where J π (s) = min π { t c(s t, π(s t )) s 0 = s} Wanted: optimal policy π : S A where J π (s) = min π { t c(s t, π(s t )) s 0 = s} Additive (path)costs allow to consider all events Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 3 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 3. Formulation of the learning task every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Wanted: optimal policy π : S A where J π (s) = min π { t c(s t, π(s t )) s 0 = s} Choice of immediate cost function c( ) specifies policy to be learned Example: 0, if s success (s ) c(s) = 000, if s failure (s Failure), else J π (s start ) = J π (s start ) = 004 Additive (path)costs allow to consider all events Does this solve the temporal credit assignment problem? YES! Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 3 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 4
10 3. Solving the optimization problem Choice of immediate cost function c( ) specifies policy to be learned Example: 0, if s success (s ) c(s) = 000, if s failure (s Failure), else For the optimal path costs it is known that J (s) = min a {c(s,a) + J (f(s, a))} (Principle of Optimality (Bellman, 959)) J π (s start ) = J π (s start ) = 004 Can we compute J (we will see why, soon)? specification of requested policy by c( ) is simple! Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 4 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 5 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 6 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 6
11 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 6 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 6 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k ()} Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 6 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 6
12 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k (f(s,a))} Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k (f(s,a))}? 5 7 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 6 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 6 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k (f(s,a))} Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J? Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 6 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 7
13 Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Stochastic shortest path problems: there exists an absorbing terminal state with zero costs Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 7 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 7 Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Stochastic shortest path problems: Stochastic shortest path problems: there exists an absorbing terminal state with zero costs there exists a proper policy (a policy that has a nonzero chance to finally reach the terminal state) there exists an absorbing terminal state with zero costs there exists a proper policy (a policy that has a nonzero chance to finally reach the terminal state) every nonproper policy has infinite path costs for at least one state Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 7 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 7
14 Ok, now we have J Ok, now we have J when J is known, then we also know an optimal policy: π (s) arg min a A {c(s,a) + J (f(s,a))} 5 7 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 8 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 8 Back to our maze Overview of the approach so far Start Ziel Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 9 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 0
15 Overview of the approach so far Overview of the approach so far Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy Ziel Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy iterative computation of optimal pathcosts J : s S : J k+ (s) = min a A {c(s,a) + J k (f(s, a))} Ziel Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 0 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 0 Overview of the approach so far Overview of the approach so far Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy iterative computation of optimal pathcosts J : s S : J k+ (s) = min a A {c(s,a) + J k (f(s, a))} Ziel Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy iterative computation of optimal pathcosts J : s S : J k+ (s) = min a A {c(s,a) + J k (f(s, a))} Ziel Computation of an optimal policy from J π (s) arg min a A {c(s,a) + J (f(s,a))} Computation of an optimal policy from J π (s) arg min a A {c(s,a) + J (f(s,a))} value function ( coststogo ) can be stored in a table Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 0 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 0
16 Overview of the approach: Stochastic Domains Overview of the approach: Stochastic Domains Ziel Ziel value iteration in stochastic environments: s S : J k+ (s) = min a A { (c(s,a) + J k (s ))} Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning Overview of the approach: Stochastic Domains Overview of the approach: Stochastic Domains Ziel Ziel value iteration in stochastic environments: s S : J k+ (s) = min a A { s S p(s,s, a)(c(s,a) + J k (s ))} value iteration in stochastic environments: s S : J k+ (s) = min a A { s S p(s,s, a)(c(s,a) + J k (s ))} Computation of an optimal policy from J π (s) arg min a A { s S p(s, s, a)(c(s,a) + J k (s ))} value function J ( coststogo ) can be stored in a table Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning
17 Problems of Value Iteration: Reinforcement Learning Problems of Value Iteration: Reinforcement Learning for all s S : Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning Problems of Value Iteration: Reinforcement Learning for all s S :J k+ (s) = min a A {c(s,a) + J k ()} Problems of Value Iteration: Reinforcement Learning for all s S :J k+ (s) = min a A {c(s,a) + J k (f(s,a))} Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning
18 Reinforcement Learning Problems of Value Iteration: for all s S :J k+ (s) = min a A {c(s,a) + J k (f(s,a))} problems: Size of S (Chess, robotics,... ) learning time, storage? model (transition behaviour) f(s, a) or p(s, s,a) must be known! Reinforcement Learning Problems of Value Iteration: for all s S :J k+ (s) = min a A {c(s,a) + J k (f(s,a))} problems: Size of S (Chess, robotics,... ) learning time, storage? model (transition behaviour) f(s, a) or p(s, s,a) must be known! Reinforcement Learning is dynamic programming for very large state spaces and/ or modelfree tasks Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning Important contributions  Overview Real Time Dynamic Programming (Barto, Sutton, Watkins, 989) Real Time Dynamic Programming (Barto, Sutton, Watkins, 989) Idea: instead For all s S now For some s S... Modelfree learning (QLearning,(Watkins, 989)) neural representation of value function (or alternative function approximators) Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 3 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 4
19 Real Time Dynamic Programming (Barto, Sutton, Watkins, 989) Idea: instead For all s S now For some s S... learning based on trajectories (experiences) QLearning Idea (Watkins, Diss, 989): In every state store for every action the expected coststogo. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 4 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 5 QLearning Idea (Watkins, Diss, 989): In every state store for every action the expected coststogo. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Q π (s,a) := (c(s,a) + J π (s )) QLearning Idea (Watkins, Diss, 989): In every state store for every action the expected coststogo. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Q π (s, a) := s S p(s, s, a)(c(s,a) + J π (s )) Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 5 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 5
20 QLearning Idea (Watkins, Diss, 989): In every state store for every action the expected coststogo. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Q π (s, a) := s S p(s, s, a)(c(s,a) + J π (s )) is now possible without a model: Original VI: state evaluation Action selection: π (s) arg min{c(s, a) + J (f(s, a))} Qlearning: Action selection where J π (s ) expected pathcosts when starting from s and acting according to π 5 7 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 5 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 6 is now possible without a model: Original VI: state evaluation Action selection: Qlearning: Action selection Q: stateaction evaluation Action selection: π (s) = arg min Q (s, a) Learning an optimal QFunction To find Q, a value iteration algorithm can be applied Q k+ (s, u) := (c(s,a) + J k (s )) π (s) arg min{c(s, a) + J (f(s, a))} Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 6 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 7
21 Learning an optimal QFunction To find Q, a value iteration algorithm can be applied Q k+ (s,u) := s S p(s, s, a)(c(s,a) + J k (s )) Learning an optimal QFunction To find Q, a value iteration algorithm can be applied Q k+ (s,u) := s S p(s, s, a)(c(s,a) + J k (s )) where J k (s) = min a A(s) Q k (s,a ) Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 7 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 7 Learning an optimal QFunction To find Q, a value iteration algorithm can be applied Q k+ (s,u) := s S p(s, s, a)(c(s,a) + J k (s )) Summary Qlearning Qlearning is a variant of value iteration when no model is available it is based on two major ingredigents: where J k (s) = min a A(s) Q k (s,a ) Furthermore, learning a Qfunction without a model, by experience of transition tuples (s,a) s only is possible: Qlearning (QValue Iteration + RobbinsMonro stochastic approximation) Q k+ (s,a) := ( α) Q k (s,a) + α (c(s,a) + min a A(s ) Q k(s, a )) Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 7 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 8
22 Summary Qlearning Qlearning is a variant of value iteration when no model is available it is based on two major ingredigents: uses a representation of coststogo for state/ actionpairs Q(s, a) Summary Qlearning Qlearning is a variant of value iteration when no model is available it is based on two major ingredigents: uses a representation of coststogo for state/ actionpairs Q(s, a) uses a stochastic approximation scheme to incrementally compute expectation values on the basis of observed transititions (s, a) s Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 8 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 8 Summary Qlearning Qlearning is a variant of value iteration when no model is available it is based on two major ingredigents: uses a representation of coststogo for state/ actionpairs Q(s, a) uses a stochastic approximation scheme to incrementally compute expectation values on the basis of observed transititions (s, a) s QLearning algorithm converges under the same assumption as value iteration + every state/ action pair has to be visited infinitely often + conditions for stochastic approximation Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 8 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 9
23 QLearning algorithm start in arbitrary initial state s 0 ; t = 0 QLearning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 9 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 9 QLearning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme QLearning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 9 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 9
24 QLearning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Qvalue: QLearning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Qvalue: Q k+ (s t, u t ) := ( α)q k (s t, u t ) + α(c(s t, u t ) + J k (s t+ )) where J k (s t+ ) := min a A Q k (s t+, a) Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 9 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 9 QLearning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Qvalue: Q k+ (s t, u t ) := ( α)q k (s t, u t ) + α(c(s t, u t ) + J k (s t+ )) where J k (s t+ ) := min a A Q k (s t+, a) Terminal state reached QLearning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Qvalue: Q k+ (s t, u t ) := ( α)q k (s t, u t ) + α(c(s t, u t ) + J k (s t+ )) where J k (s t+ ) := min a A Q k (s t+, a) Terminal state reached policy is optimal ( enough ) Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 9 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 9
25 ... Representation of the pathcosts in a function approximator Idea: neural representation of value function (or alternative function approximators) (Neuro Dynamic Programming (Bertsekas, 987)) Representation of the pathcosts in a function approximator Idea: neural representation of value function (or alternative function approximators) (Neuro Dynamic Programming (Bertsekas, 987)) w ij w ij x... J(x) x... J(x) few parameters (here: weights) specify value function for a large state space Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 30 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 30 Representation of the pathcosts in a function approximator Example: learning to intercept in robotic soccer Idea: neural representation of value function (or alternative function approximators) (Neuro Dynamic Programming (Bertsekas, 987)) as fast as possible (anticipation of intercept position) w ij random noise in ball and player movement need for corrections x J(x) sequence of turn(θ)and dash(v)commands required few parameters (here: weights) specify value function for a large state space E learning by gradient descent: w ij = (J(s ) c(s,a) J(s)) w ij handcoding a routine is a lot of work, many parameters to tune! Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 30 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 3
26 Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 3 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 3 Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent 0, s S + c(s) =, s S 0.0, else Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 3 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 3
27 Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent 0, s S + c(s) =, s S 0.0, else Actions: turn(0 o ), turn(0 o ),... turn(360 o ),... dash(0), dash(0),... Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent 0, s S + c(s) =, s S 0.0, else Actions: turn(0 o ), turn(0 o ),... turn(360 o ),... dash(0), dash(0),... neural value function (60architecture) Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 3 Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 3 Learning curves 00 "kick.stat" u 3:7 0. "kick.stat" u 3: Percentage of successes Costs (time to intercept) Martin Riedmiller, AlbertLudwigsUniversität Freiburg, Machine Learning 33
Reinforcement Learning
Reinforcement Learning LU 2  Markov Decision Problems and Dynamic Programming Dr. Martin Lauer AG Maschinelles Lernen und Natürlichsprachliche Systeme AlbertLudwigsUniversität Freiburg martin.lauer@kit.edu
More informationReinforcement Learning
Reinforcement Learning LU 2  Markov Decision Problems and Dynamic Programming Dr. Joschka Bödecker AG Maschinelles Lernen und Natürlichsprachliche Systeme AlbertLudwigsUniversität Freiburg jboedeck@informatik.unifreiburg.de
More informationLecture 1: Introduction to Reinforcement Learning
Lecture 1: Introduction to Reinforcement Learning David Silver Outline 1 Admin 2 About Reinforcement Learning 3 The Reinforcement Learning Problem 4 Inside An RL Agent 5 Problems within Reinforcement Learning
More informationNeural Fitted Q Iteration  First Experiences with a Data Efficient Neural Reinforcement Learning Method
Neural Fitted Q Iteration  First Experiences with a Data Efficient Neural Reinforcement Learning Method Martin Riedmiller Neuroinformatics Group, University of Onsabrück, 49078 Osnabrück Abstract. This
More informationBoosting. riedmiller@informatik.unifreiburg.de
. Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät AlbertLudwigsUniversität Freiburg riedmiller@informatik.unifreiburg.de
More informationLecture 5: ModelFree Control
Lecture 5: ModelFree Control David Silver Outline 1 Introduction 2 OnPolicy MonteCarlo Control 3 OnPolicy TemporalDifference Learning 4 OffPolicy Learning 5 Summary Introduction ModelFree Reinforcement
More informationNEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi
NEURAL NETWORKS AND REINFORCEMENT LEARNING Abhijit Gosavi Department of Engineering Management and Systems Engineering Missouri University of Science and Technology Rolla, MO 65409 1 Outline A Quick Introduction
More informationInformatics 2D Reasoning and Agents Semester 2, 201516
Informatics 2D Reasoning and Agents Semester 2, 201516 Alex Lascarides alex@inf.ed.ac.uk Lecture 30 Markov Decision Processes 25th March 2016 Informatics UoE Informatics 2D 1 Sequential decision problems
More informationLearning to Play Board Games using Temporal Difference Methods
Learning to Play Board Games using Temporal Difference Methods Marco A. Wiering Jan Peter Patist Henk Mannen institute of information and computing sciences, utrecht university technical report UUCS2005048
More informationA Gentle Introduction to Machine Learning
A Gentle Introduction to Machine Learning Second Lecture Part I Olov Andersson, AIICS Linköpings Universitet Outline of Machine Learning Lectures First we will talk about Supervised Learning Definition
More informationEligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.
Eligibility Traces 0 Eligibility Traces Suggested reading: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998. Eligibility Traces Eligibility Traces 1 Contents:
More informationPreliminaries: Problem Definition Agent model, POMDP, Bayesian RL
POMDP Tutorial Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL Observation Belief ACTOR Transition Dynamics WORLD b Policy π Action Markov Decision Process X: set of states [x s,x r
More informationNeuroDynamic Programming An Overview
1 NeuroDynamic Programming An Overview Dimitri Bertsekas Dept. of Electrical Engineering and Computer Science M.I.T. September 2006 2 BELLMAN AND THE DUAL CURSES Dynamic Programming (DP) is very broadly
More informationA Sarsa based Autonomous Stock Trading Agent
A Sarsa based Autonomous Stock Trading Agent Achal Augustine The University of Texas at Austin Department of Computer Science Austin, TX 78712 USA achal@cs.utexas.edu Abstract This paper describes an autonomous
More informationUsing Markov Decision Processes to Solve a Portfolio Allocation Problem
Using Markov Decision Processes to Solve a Portfolio Allocation Problem Daniel Bookstaber April 26, 2005 Contents 1 Introduction 3 2 Defining the Model 4 2.1 The Stochastic Model for a Single Asset.........................
More informationThe Neuro Slot Car Racer: Reinforcement Learning in a Real World Setting
The Neuro Slot Car Racer: Reinforcement Learning in a Real World Setting Tim C. Kietzmann Neuroinformatics Group Institute of Cognitive Science University of Osnabrück tkietzma@uniosnabrueck.de Martin
More informationLearning Agents: Introduction
Learning Agents: Introduction S Luz luzs@cs.tcd.ie October 22, 2013 Learning in agent architectures Performance standard representation Critic Agent perception rewards/ instruction Perception Learner Goals
More informationHow I won the Chess Ratings: Elo vs the rest of the world Competition
How I won the Chess Ratings: Elo vs the rest of the world Competition Yannis Sismanis November 2010 Abstract This article discusses in detail the rating system that won the kaggle competition Chess Ratings:
More informationTD(0) Leads to Better Policies than Approximate Value Iteration
TD(0) Leads to Better Policies than Approximate Value Iteration Benjamin Van Roy Management Science and Engineering and Electrical Engineering Stanford University Stanford, CA 94305 bvr@stanford.edu Abstract
More informationADVANTAGE UPDATING. Leemon C. Baird III. Technical Report WLTR Wright Laboratory. WrightPatterson Air Force Base, OH
ADVANTAGE UPDATING Leemon C. Baird III Technical Report WLTR931146 Wright Laboratory WrightPatterson Air Force Base, OH 454337301 Address: WL/AAAT, Bldg 635 2185 Avionics Circle WrightPatterson Air
More informationInductive QoS Packet Scheduling for Adaptive Dynamic Networks
Inductive QoS Packet Scheduling for Adaptive Dynamic Networks Malika BOURENANE Dept of Computer Science University of EsSenia Algeria mb_regina@yahoo.fr Abdelhamid MELLOUK LISSI Laboratory University
More informationAssignment #1. Example: tetris state: board configuration + shape of the falling piece ~2 200 states! Recap RL so far. Page 1
Generalization and function approximation CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Represent
More informationVersion Spaces. riedmiller@informatik.unifreiburg.de
. Machine Learning Version Spaces Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät AlbertLudwigsUniversität Freiburg riedmiller@informatik.unifreiburg.de
More informationQ1. [19 pts] Cheating at cs188blackjack
CS 188 Spring 2012 Introduction to Artificial Intelligence Practice Midterm II Solutions Q1. [19 pts] Cheating at cs188blackjack Cheating dealers have become a serious problem at the cs188blackjack tables.
More informationChapter 4: Artificial Neural Networks
Chapter 4: Artificial Neural Networks CS 536: Machine Learning Littman (Wu, TA) Administration icml03: instructional Conference on Machine Learning http://www.cs.rutgers.edu/~mlittman/courses/ml03/icml03/
More informationArtificially Intelligent Adaptive Tutoring System
Artificially Intelligent Adaptive Tutoring System William Whiteley Stanford School of Engineering Stanford, CA 94305 bill.whiteley@gmail.com Abstract This paper presents an adaptive tutoring system based
More informationRational and Convergent Learning in Stochastic Games
+ ' ' Rational and Convergent Learning in Stochastic Games Michael Bowling mhb@cs.cmu.edu Manuela Veloso veloso@cs.cmu.edu Computer Science Department Carnegie Mellon University Pittsburgh, PA 523389
More informationReinforcement Learning on the Combinatorial Game of Nim ERIK JÄRLEBERG
Reinforcement Learning on the Combinatorial Game of Nim ERIK JÄRLEBERG Bachelor of Science Thesis Stockholm, Sweden 2011 Reinforcement Learning on the Combinatorial Game of Nim ERIK JÄRLEBERG Bachelor
More informationA Multiagent Qlearning Framework for Optimizing Stock Trading Systems
A Multiagent Qlearning Framework for Optimizing Stock Trading Systems Jae Won Lee 1 and Jangmin O 2 1 School of Computer Science and Engineering, Sungshin Women s University, Seoul, Korea 136742 jwlee@cs.sungshin.ac.kr
More informationMultiple Model Reinforcement Learning for Environments with Poissonian Time Delays
Multiple Model Reinforcement Learning for Environments with Poissonian Time Delays by Jeffrey Scott Campbell A Thesis submitted to the Faculty of Graduate Studies and Research in partial fulfilment of
More informationTemporal Difference Learning in the Tetris Game
Temporal Difference Learning in the Tetris Game Hans Pirnay, Slava Arabagi February 6, 2009 1 Introduction Learning to play the game Tetris has been a common challenge on a few past machine learning competitions.
More informationAn Environment Model for N onstationary Reinforcement Learning
An Environment Model for N onstationary Reinforcement Learning Samuel P. M. Choi DitYan Yeung Nevin L. Zhang pmchoi~cs.ust.hk dyyeung~cs.ust.hk lzhang~cs.ust.hk Department of Computer Science, Hong Kong
More informationIntroduction.
. Machine Learning Introduction Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät AlbertLudwigsUniversität Freiburg riedmiller@informatik.unifreiburg.de
More informationOnline Policy Improvement using MonteCarlo Search
Online Policy Improvement using MonteCarlo Search Gerald Tesauro IBM T. J. Watson Research Center P. O. Box 704 Yorktown Heights, NY 10598 Gregory R. Galperin MIT AI Lab 545 Technology Square Cambridge,
More informationBidding Dynamics of Rational Advertisers in Sponsored Search Auctions on the Web
Proceedings of the International Conference on Advances in Control and Optimization of Dynamical Systems (ACODS2007) Bidding Dynamics of Rational Advertisers in Sponsored Search Auctions on the Web S.
More information6.231 Dynamic Programming and Stochastic Control Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.231 Dynamic Programming and Stochastic Control Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 6.231
More informationTemporal Difference Learning of NTuple Networks for the Game 2048
This paper has been accepted for presentation at CIG 014 (IEEE Conference on Computational Intelligence and Games), August 014, Dortmund, Germany (http://wwwcig014de) Temporal Difference Learning of NTuple
More informationA Reinforcement Learning Approach for Supply Chain Management
A Reinforcement Learning Approach for Supply Chain Management Tim Stockheim, Michael Schwind, and Wolfgang Koenig Chair of Economics, esp. Information Systems, Frankfurt University, D60054 Frankfurt,
More informationPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning Volodymyr Mnih Koray Kavukcuoglu David Silver Alex Graves Ioannis Antonoglou Daan Wierstra Martin Riedmiller DeepMind Technologies {vlad,koray,david,alex.graves,ioannis,daan,martin.riedmiller}
More informationPlanning in Artificial Intelligence
Logical Representations and Computational Methods for Marov Decision Processes Craig Boutilier Department of Computer Science University of Toronto Planning in Artificial Intelligence Planning has a long
More informationReinforcement Learning with Factored States and Actions
Journal of Machine Learning Research 5 (2004) 1063 1088 Submitted 3/02; Revised 1/04; Published 8/04 Reinforcement Learning with Factored States and Actions Brian Sallans Austrian Research Institute for
More informationUrban Traffic Control Based on Learning Agents
Urban Traffic Control Based on Learning Agents PierreLuc Grégoire, Charles Desjardins, Julien Laumônier and Brahim Chaibdraa DAMAS Laboratory, Computer Science and Software Engineering Department, Laval
More informationAlgorithms for Reinforcement Learning
Algorithms for Reinforcement Learning Draft of the lecture published in the Synthesis Lectures on Artificial Intelligence and Machine Learning series by Morgan & Claypool Publishers Csaba Szepesvári June
More informationTetris: Experiments with the LP Approach to Approximate DP
Tetris: Experiments with the LP Approach to Approximate DP Vivek F. Farias Electrical Engineering Stanford University Stanford, CA 94403 vivekf@stanford.edu Benjamin Van Roy Management Science and Engineering
More informationAbstracting Reusable Cases from Reinforcement Learning
Abstracting Reusable Cases from Reinforcement Learning Andreas von Hessling and Ashok K. Goel College of Computing Georgia Institute of Technology Atlanta, GA 30318 {avh, goel}@cc.gatech.edu Abstract.
More informationReinforcement Learning of Task Plans for Real Robot Systems
Reinforcement Learning of Task Plans for Real Robot Systems Pedro Tomás Mendes Resende pedro.resende@ist.utl.pt Instituto Superior Técnico, Lisboa, Portugal October 2014 Abstract This paper is the extended
More informationGamePlaying & Adversarial Search MiniMax, Search Cutoff, Heuristic Evaluation
GamePlaying & Adversarial Search MiniMax, Search Cutoff, Heuristic Evaluation This lecture topic: GamePlaying & Adversarial Search (MiniMax, Search Cutoff, Heuristic Evaluation) Read Chapter 5.15.2,
More informationEFFICIENT USE OF REINFORCEMENT LEARNING IN A COMPUTER GAME
EFFICIENT USE OF REINFORCEMENT LEARNING IN A COMPUTER GAME Yngvi Björnsson, Vignir Hafsteinsson, Ársæll Jóhannsson, and Einar Jónsson Reykjavík University, School of Computer Science Ofanleiti 2 Reykjavik,
More informationLearning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems
Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems Thomas Degris Thomas.Degris@lip6.fr Olivier Sigaud Olivier.Sigaud@lip6.fr PierreHenri Wuillemin PierreHenri.Wuillemin@lip6.fr
More informationUsing Markov Decision Theory to Provide a Fair Challenge in a RollandMove Board Game
Using Markov Decision Theory to Provide a Fair Challenge in a RollandMove Board Game Éric Beaudry, Francis Bisson, Simon Chamberland, Froduald Kabanza Abstract Board games are often taken as examples
More informationTDGammon, A SelfTeaching Backgammon Program, Achieves MasterLevel Play
TDGammon, A SelfTeaching Backgammon Program, Achieves MasterLevel Play Gerald Tesauro IBM Thomas J. Watson Research Center P. O. Box 704 Yorktown Heights, NY 10598 (tesauro@watson.ibm.com) Abstract.
More informationLearning Tetris Using the Noisy CrossEntropy Method
NOTE Communicated by Andrew Barto Learning Tetris Using the Noisy CrossEntropy Method István Szita szityu@eotvos.elte.hu András Lo rincz andras.lorincz@elte.hu Department of Information Systems, Eötvös
More informationRETALIATE: Learning Winning Policies in FirstPerson Shooter Games
RETALIATE: Learning Winning Policies in FirstPerson Shooter Games Megan Smith, Stephen LeeUrban, Héctor MuñozAvila Department of Computer Science & Engineering, Lehigh University, Bethlehem, PA 180153084
More informationOPTIMIZATION AND OPERATIONS RESEARCH Vol. IV  Markov Decision Processes  Ulrich Rieder
MARKOV DECISIO PROCESSES Ulrich Rieder University of Ulm, Germany Keywords: Markov decision problem, stochastic dynamic program, total reward criteria, average reward, optimal policy, optimality equation,
More informationStochastic Models for Inventory Management at Service Facilities
Stochastic Models for Inventory Management at Service Facilities O. Berman, E. Kim Presented by F. Zoghalchi University of Toronto Rotman School of Management Dec, 2012 Agenda 1 Problem description Deterministic
More informationUsing abstract models of behaviors to automatically generate reinforcement learning hierarchies
Hauptseminar Intelligente Autonome Systeme Using abstract models of behaviors to automatically generate reinforcement learning hierarchies Christian Sosnowski Betreuer: Freek Stulp Technische Universität
More informationAn Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment
An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment Hideki Asoh 1, Masanori Shiro 1 Shotaro Akaho 1, Toshihiro Kamishima 1, Koiti Hasida 1, Eiji Aramaki 2, and Takahide
More informationUniversity of Cambridge Department of Engineering MODULAR ONLINE FUNCTION APPROXIMATION FOR SCALING UP REINFORCEMENT LEARNING Chen Khong Tham Jesus College, Cambridge, England. A October 1994 This dissertation
More informationChannel Allocation in Cellular Telephone. Systems. Lab. for Info. and Decision Sciences. Cambridge, MA 02139. bertsekas@lids.mit.edu.
Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems Satinder Singh Department of Computer Science University of Colorado Boulder, CO 803090430 baveja@cs.colorado.edu Dimitri
More informationLearning. Artificial Intelligence. Learning. Types of Learning. Inductive Learning Method. Inductive Learning. Learning.
Learning Learning is essential for unknown environments, i.e., when designer lacks omniscience Artificial Intelligence Learning Chapter 8 Learning is useful as a system construction method, i.e., expose
More informationChisquare Tests Driven Method for Learning the Structure of Factored MDPs
Chisquare Tests Driven Method for Learning the Structure of Factored MDPs Thomas Degris Thomas.Degris@lip6.fr Olivier Sigaud Olivier.Sigaud@lip6.fr PierreHenri Wuillemin PierreHenri.Wuillemin@lip6.fr
More informationReinforcement Learning for Resource Allocation and Time Series Tools for Mobility Prediction
Reinforcement Learning for Resource Allocation and Time Series Tools for Mobility Prediction Baptiste Lefebvre 1,2, Stephane Senecal 2 and JeanMarc Kelif 2 1 École Normale Supérieure (ENS), Paris, France,
More informationGoalDirected Online Learning of Predictive Models
GoalDirected Online Learning of Predictive Models Sylvie C.W. Ong, Yuri Grinberg, and Joelle Pineau McGill University, Montreal, Canada Abstract. We present an algorithmic approach for integrated learning
More informationSupply Chain Yield Management based on Reinforcement Learning
Supply Chain Yield Management based on Reinforcement Learning Tim Stockheim 1, Michael Schwind 1, Alexander Korth 2, and Burak Simsek 2 1 Chair of Economics, esp. Information Systems, Frankfurt University,
More informationTHE LOGIC OF ADAPTIVE BEHAVIOR
THE LOGIC OF ADAPTIVE BEHAVIOR Knowledge Representation and Algorithms for Adaptive Sequential Decision Making under Uncertainty in FirstOrder and Relational Domains Martijn van Otterlo Department of
More informationMultiagent QLearning of Channel Selection in Multiuser Cognitive Radio Systems: A Two by Two Case
Multiagent QLearning of Channel Selection in Multiuser Cognitive Radio Systems: A Two by Two Case Husheng Li arxiv:93.5282v [cs.it] 3 Mar 29 Abstract Resource allocation is an important issue in cognitive
More informationA survey of pointbased POMDP solvers
DOI 10.1007/s1045801292002 A survey of pointbased POMDP solvers Guy Shani Joelle Pineau Robert Kaplow The Author(s) 2012 Abstract The past decade has seen a significant breakthrough in research on
More informationDistributed Reinforcement Learning for Network Intrusion Response
Distributed Reinforcement Learning for Network Intrusion Response KLEANTHIS MALIALIS Doctor of Philosophy UNIVERSITY OF YORK COMPUTER SCIENCE September 2014 For my late grandmother Iroulla Abstract The
More information10. Machine Learning in Games
Machine Learning and Data Mining 10. Machine Learning in Games Luc De Raedt Thanks to Johannes Fuernkranz for his slides Contents Game playing What can machine learning do? What is (still) hard? Various
More informationNovember 28 th, Carlos Guestrin. Lower dimensional projections
PCA Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University November 28 th, 2007 Lower dimensional projections Rather than picking a subset of the features, we can new features that are combinations
More informationOptimizing admission control while ensuring quality of service in multimedia networks via reinforcement learning*
Optimizing admission control while ensuring quality of service in multimedia networks via reinforcement learning* Timothy X Brown t, Hui Tong t, Satinder Singh+ t Electrical and Computer Engineering +
More informationArtificial Intelligence Adversarial Search
Artificial Intelligence Adversarial Search Andrea Torsello Games Vs. Planning Problems In many problems you re are pitted against an opponent certain operators are beyond your control: you cannot control
More informationIntroduction to Online Learning Theory
Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent
More informationReducing Operational Costs in Cloud Social TV: An Opportunity for Cloud Cloning
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 6, OCTOBER 2014 1739 Reducing Operational Costs in Cloud Social TV: An Opportunity for Cloud Cloning Yichao Jin, Yonggang Wen, Senior Member, IEEE, Han Hu,
More informationCall Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning
Call Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning Peter Marbach LIDS MIT, Room 507 Cambridge, MA, 09 email: marbach@mit.edu Oliver Mihatsch Siemens AG Corporate
More informationReinforcement Learning: A Tutorial Survey and Recent Advances
A. Gosavi Reinforcement Learning: A Tutorial Survey and Recent Advances Abhijit Gosavi Department of Engineering Management and Systems Engineering 219 Engineering Management Missouri University of Science
More informationCOMP424: Artificial intelligence. Lecture 7: Game Playing
COMP 424  Artificial Intelligence Lecture 7: Game Playing Instructor: Joelle Pineau (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp424 Unless otherwise noted, all material posted
More informationIN AVIATION it regularly occurs that an airplane encounters. Guaranteed globally optimal continuous reinforcement learning.
GUARANTEED GLOBALLY OPTIMAL CONTINUOUS REINFORCEMENT LEARNING 1 Guaranteed globally optimal continuous reinforcement learning Hildo Bijl Abstract Selflearning and adaptable autopilots have the potential
More informationLearning to play chess using TD(λ)learning with database games
Learning to play chess using TD(λ)learning with database games Henk Mannen & Marco Wiering Cognitive Artificial Intelligence University of Utrecht henk.mannen@phil.uu.nl & marco@cs.uu.nl Abstract In this
More informationCombining QLearning with Artificial Neural Networks in an Adaptive Light Seeking Robot
Combining QLearning with Artificial Neural Networks in an Adaptive Light Seeking Robot Steve Dini and Mark Serrano May 6, 2012 Abstract Qlearning is a reinforcement learning technique that works by learning
More informationHierarchical Reinforcement Learning in Computer Games
Hierarchical Reinforcement Learning in Computer Games Marc Ponsen, Pieter Spronck, Karl Tuyls Maastricht University / MICCIKAT. {m.ponsen,p.spronck,k.tuyls}@cs.unimaas.nl Abstract. Hierarchical reinforcement
More informationNonlinear Algebraic Equations Example
Nonlinear Algebraic Equations Example Continuous Stirred Tank Reactor (CSTR). Look for steady state concentrations & temperature. s r (in) p,i (in) i In: N spieces with concentrations c, heat capacities
More informationSecurity Risk Management via Dynamic Games with Learning
Security Risk Management via Dynamic Games with Learning Praveen Bommannavar Management Science & Engineering Stanford University Stanford, California 94305 Email: bommanna@stanford.edu Tansu Alpcan Deutsche
More informationSINGLESTAGE MULTIPRODUCT PRODUCTION AND INVENTORY SYSTEMS: AN ITERATIVE ALGORITHM BASED ON DYNAMIC SCHEDULING AND FIXED PITCH PRODUCTION
SIGLESTAGE MULTIPRODUCT PRODUCTIO AD IVETORY SYSTEMS: A ITERATIVE ALGORITHM BASED O DYAMIC SCHEDULIG AD FIXED PITCH PRODUCTIO Euclydes da Cunha eto ational Institute of Technology Rio de Janeiro, RJ
More informationSkillbased Routing Policies in Call Centers
BMI Paper Skillbased Routing Policies in Call Centers by Martijn Onderwater January 2010 BMI Paper Skillbased Routing Policies in Call Centers Author: Martijn Onderwater Supervisor: Drs. Dennis Roubos
More informationA Simultaneous Deterministic Perturbation ActorCritic Algorithm with an Application to Optimal Mortgage Refinancing
Proceedings of the 45th IEEE Conference on Decision & Control Manchester Grand Hyatt Hotel San Diego, CA, USA, December 1315, 2006 A Simultaneous Deterministic Perturbation ActorCritic Algorithm with
More information10.2 Series and Convergence
10.2 Series and Convergence Write sums using sigma notation Find the partial sums of series and determine convergence or divergence of infinite series Find the N th partial sums of geometric series and
More informationSome stability results of parameter identification in a jump diffusion model
Some stability results of parameter identification in a jump diffusion model D. Düvelmeyer Technische Universität Chemnitz, Fakultät für Mathematik, 09107 Chemnitz, Germany Abstract In this paper we discuss
More informationDetection and Mitigation of CyberAttacks Using Game Theory and Learning
Detection and Mitigation of CyberAttacks Using Game Theory and Learning João P. Hespanha Kyriakos G. Vamvoudakis Correlation Engine COAs Data Data Data Data Cyber Situation Awareness Framework Mission
More informationSemiMarkov Adaptive Critic Heuristics with Application to Airline Revenue Management
SemiMarkov Adaptive Critic Heuristics with Application to Airline Revenue Management Ketaki Kulkarni Abhijit Gosavi Email: gosavia@mst.edu Susan Murray Katie Grantham Department of Engineering Management
More information4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department Email : sss40@eng.cam.ac.
4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department Email : sss40@eng.cam.ac.uk 1 1 Outline The LMS algorithm Overview of LMS issues
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationAn Associative StateSpace Metric for Learning in Factored MDPs
An Associative StateSpace Metric for Learning in Factored MDPs Pedro Sequeira and Francisco S. Melo and Ana Paiva INESCID and Instituto Superior Técnico, Technical University of Lisbon Av. Prof. Dr.
More informationLearning a wall following behaviour in mobile robotics using stereo and mono vision
Learning a wall following behaviour in mobile robotics using stereo and mono vision P. Quintía J.E. Domenech C.V. Regueiro C. Gamallo R. Iglesias Dpt. Electronics and Systems. Univ. A Coruña pquintia@udc.es,
More informationBatch Reinforcement Learning for Smart Home Energy Management
Proceedings of the TwentyFourth International Joint Conference on Artificial Intelligence (IJCAI 2015) Batch Reinforcement Learning for Smart Home Energy Management Heider Berlink, Anna Helena Reali Costa
More informationCreating Algorithmic Traders with Hierarchical Reinforcement Learning MSc Dissertation. Thomas Elder s0789506
Creating Algorithmic Traders with Hierarchical Reinforcement Learning MSc Dissertation Thomas Elder s78956 Master of Science School of Informatics University of Edinburgh 28 Abstract There has recently
More informationReinforcement learning and conditioning: an overview. Université Charles de Gaulle, Lille, France
Reinforcement learning and conditioning: an overview Jérémie Jozefowiez Université Charles de Gaulle, Lille, France jozefowiez@univlille3.fr October 8, 2002 Contents 1 Computational foundations of reinforcement
More informationCreating a NL Texas Hold em Bot
Creating a NL Texas Hold em Bot Introduction Poker is an easy game to learn by very tough to master. One of the things that is hard to do is controlling emotions. Due to frustration, many have made the
More informationLearning Exercise Policies for American Options
Yuxi Li Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Csaba Szepesvari Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Dale Schuurmans
More informationModern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
More information