Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning
|
|
|
- Louise Kelly
- 9 years ago
- Views:
Transcription
1 Motivation Machine Learning Can a software agent learn to play Backgammon by itself? Reinforcement Learning Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg [email protected] Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning Motivation Can a software agent learn to balance a pole by itself? Can a software agent learn to play Backgammon by itself? Learning from success or failure Neuro-Backgammon: playing at worldchampion level (Tesauro, 99) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 3
2 Can a software agent learn to balance a pole by itself? Can a software agent learn to cooperate with others by itself? Learning from success or failure Neural RL controllers: noisy, unknown, nonlinear (Riedmiller et.al. ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 4 Can a software agent learn to cooperate with others by itself? Reinforcement Learning has biological roots: reward and punishment Learning from success or failure no teacher, but: Cooperative RL agents: complex, multi-agent, cooperative (Riedmiller et.al. ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 4 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 5
3 Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions + goal Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 5 Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions + goal learn algorithm/ policy Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions + goal learn algorithm/ policy Action Action... Happy Programming Agent.. Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 5
4 Actor-Critic Scheme (Barto, Sutton, 983) Overview Actor Critic External reward I Reinforcement Learning - Basics s Internal Critic World Actor? Actor-Critic Scheme: Critic maps external, delayed reward in internal training signal Actor represents policy Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 7 A First Example A First Example Choose: Action a {,, } is reached Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 8
5 The Temporal Credit Assignment Problem The Temporal Credit Assignment Problem Which action(s) in the sequence has to be changed? Which action(s) in the sequence has to be changed? Temporal Credit Assignment Problem Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 9 Sequential Decision Making Three Steps Examples: Chess, Checkers (Samuel, 959), Backgammon (Tesauro, 9) Cart-Pole-Balancing (AHC/ ACE (Barto, Sutton, Anderson, 983)), Robotics and control,... Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 0 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning
6 Three Steps Three Steps Describe environment as a Markov Decision Process (MDP) Describe environment as a Markov Decision Process (MDP) Formulate learning task as a dynamic optimization problem Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning Three Steps. Description of the environment Describe environment as a Markov Decision Process (MDP) Formulate learning task as a dynamic optimization problem Solve dynamic optimization problem by dynamic programming methods Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning
7 . Description of the environment. Description of the environment S: (finite) set of states S: (finite) set of states A: (finite) set of actions Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, Machine Learning. Description of the environment. Description of the environment S: (finite) set of states A: (finite) set of actions S: (finite) set of states A: (finite) set of actions Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition For simplicity, we will first assume a deterministic environment. There, the model can be described by a transition function f : S A S, s = f(s, a) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning
8 . Description of the environment. Description of the environment S: (finite) set of states A: (finite) set of actions S: (finite) set of states A: (finite) set of actions Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition For simplicity, we will first assume a deterministic environment. There, the model can be described by a transition function f : S A S, s = f(s, a) For simplicity, we will first assume a deterministic environment. There, the model can be described by a transition function f : S A S, s = f(s, a) Markov property: Transition only depends on current state and action Markov property: Transition only depends on current state and action Pr(s t+ s t, a t ) = Pr(s t+ s t, a t, s t, a t, s t, a t,...) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning. Formulation of the learning task. Formulation of the learning task every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 3
9 . Formulation of the learning task. Formulation of the learning task every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Wanted: optimal policy π : S A where J π (s) = min π { t c(s t, π(s t )) s 0 = s} Wanted: optimal policy π : S A where J π (s) = min π { t c(s t, π(s t )) s 0 = s} Additive (path-)costs allow to consider all events Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 3. Formulation of the learning task every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Wanted: optimal policy π : S A where J π (s) = min π { t c(s t, π(s t )) s 0 = s} Choice of immediate cost function c( ) specifies policy to be learned Example: 0, if s success (s ) c(s) = 000, if s failure (s Failure), else J π (s start ) = J π (s start ) = 004 Additive (path-)costs allow to consider all events Does this solve the temporal credit assignment problem? YES! Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 4
10 3. Solving the optimization problem Choice of immediate cost function c( ) specifies policy to be learned Example: 0, if s success (s ) c(s) = 000, if s failure (s Failure), else For the optimal path costs it is known that J (s) = min a {c(s,a) + J (f(s, a))} (Principle of Optimality (Bellman, 959)) J π (s start ) = J π (s start ) = 004 Can we compute J (we will see why, soon)? specification of requested policy by c( ) is simple! Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 4 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 5 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 6
11 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 6 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k ()} Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 6
12 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k (f(s,a))} Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k (f(s,a))}? 5 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 6 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k (f(s,a))} Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J? Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 7
13 Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Stochastic shortest path problems: there exists an absorbing terminal state with zero costs Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 7 Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Stochastic shortest path problems: Stochastic shortest path problems: there exists an absorbing terminal state with zero costs there exists a proper policy (a policy that has a non-zero chance to finally reach the terminal state) there exists an absorbing terminal state with zero costs there exists a proper policy (a policy that has a non-zero chance to finally reach the terminal state) every non-proper policy has infinite path costs for at least one state Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 7
14 Ok, now we have J Ok, now we have J when J is known, then we also know an optimal policy: π (s) arg min a A {c(s,a) + J (f(s,a))} 5 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 8 Back to our maze Overview of the approach so far Start Ziel Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 0
15 Overview of the approach so far Overview of the approach so far Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy Ziel Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy iterative computation of optimal pathcosts J : s S : J k+ (s) = min a A {c(s,a) + J k (f(s, a))} Ziel Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 0 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 0 Overview of the approach so far Overview of the approach so far Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy iterative computation of optimal pathcosts J : s S : J k+ (s) = min a A {c(s,a) + J k (f(s, a))} Ziel Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy iterative computation of optimal pathcosts J : s S : J k+ (s) = min a A {c(s,a) + J k (f(s, a))} Ziel Computation of an optimal policy from J π (s) arg min a A {c(s,a) + J (f(s,a))} Computation of an optimal policy from J π (s) arg min a A {c(s,a) + J (f(s,a))} value function ( costs-to-go ) can be stored in a table Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 0 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 0
16 Overview of the approach: Stochastic Domains Overview of the approach: Stochastic Domains Ziel Ziel value iteration in stochastic environments: s S : J k+ (s) = min a A { (c(s,a) + J k (s ))} Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning Overview of the approach: Stochastic Domains Overview of the approach: Stochastic Domains Ziel Ziel value iteration in stochastic environments: s S : J k+ (s) = min a A { s S p(s,s, a)(c(s,a) + J k (s ))} value iteration in stochastic environments: s S : J k+ (s) = min a A { s S p(s,s, a)(c(s,a) + J k (s ))} Computation of an optimal policy from J π (s) arg min a A { s S p(s, s, a)(c(s,a) + J k (s ))} value function J ( costs-to-go ) can be stored in a table Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning
17 Problems of Value Iteration: Reinforcement Learning Problems of Value Iteration: Reinforcement Learning for all s S : Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning Problems of Value Iteration: Reinforcement Learning for all s S :J k+ (s) = min a A {c(s,a) + J k ()} Problems of Value Iteration: Reinforcement Learning for all s S :J k+ (s) = min a A {c(s,a) + J k (f(s,a))} Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning
18 Reinforcement Learning Problems of Value Iteration: for all s S :J k+ (s) = min a A {c(s,a) + J k (f(s,a))} problems: Size of S (Chess, robotics,... ) learning time, storage? model (transition behaviour) f(s, a) or p(s, s,a) must be known! Reinforcement Learning Problems of Value Iteration: for all s S :J k+ (s) = min a A {c(s,a) + J k (f(s,a))} problems: Size of S (Chess, robotics,... ) learning time, storage? model (transition behaviour) f(s, a) or p(s, s,a) must be known! Reinforcement Learning is dynamic programming for very large state spaces and/ or model-free tasks Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning Important contributions - Overview Real Time Dynamic Programming (Barto, Sutton, Watkins, 989) Real Time Dynamic Programming (Barto, Sutton, Watkins, 989) Idea: instead For all s S now For some s S... Model-free learning (Q-Learning,(Watkins, 989)) neural representation of value function (or alternative function approximators) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 4
19 Real Time Dynamic Programming (Barto, Sutton, Watkins, 989) Idea: instead For all s S now For some s S... learning based on trajectories (experiences) Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 4 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 5 Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Q π (s,a) := (c(s,a) + J π (s )) Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Q π (s, a) := s S p(s, s, a)(c(s,a) + J π (s )) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 5
20 Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Q π (s, a) := s S p(s, s, a)(c(s,a) + J π (s )) is now possible without a model: Original VI: state evaluation Action selection: π (s) arg min{c(s, a) + J (f(s, a))} Q-learning: Action selection where J π (s ) expected pathcosts when starting from s and acting according to π 5 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 6 is now possible without a model: Original VI: state evaluation Action selection: Q-learning: Action selection Q: state-action evaluation Action selection: π (s) = arg min Q (s, a) Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s, u) := (c(s,a) + J k (s )) π (s) arg min{c(s, a) + J (f(s, a))} Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 7
21 Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s,u) := s S p(s, s, a)(c(s,a) + J k (s )) Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s,u) := s S p(s, s, a)(c(s,a) + J k (s )) where J k (s) = min a A(s) Q k (s,a ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 7 Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s,u) := s S p(s, s, a)(c(s,a) + J k (s )) Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: where J k (s) = min a A(s) Q k (s,a ) Furthermore, learning a Q-function without a model, by experience of transition tuples (s,a) s only is possible: Q-learning (Q-Value Iteration + Robbins-Monro stochastic approximation) Q k+ (s,a) := ( α) Q k (s,a) + α (c(s,a) + min a A(s ) Q k(s, a )) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 8
22 Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: uses a representation of costs-to-go for state/ action-pairs Q(s, a) Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: uses a representation of costs-to-go for state/ action-pairs Q(s, a) uses a stochastic approximation scheme to incrementally compute expectation values on the basis of observed transititions (s, a) s Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 8 Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: uses a representation of costs-to-go for state/ action-pairs Q(s, a) uses a stochastic approximation scheme to incrementally compute expectation values on the basis of observed transititions (s, a) s Q-Learning algorithm converges under the same assumption as value iteration + every state/ action pair has to be visited infinitely often + conditions for stochastic approximation Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 9
23 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 9 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 9
24 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q k+ (s t, u t ) := ( α)q k (s t, u t ) + α(c(s t, u t ) + J k (s t+ )) where J k (s t+ ) := min a A Q k (s t+, a) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 9 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q k+ (s t, u t ) := ( α)q k (s t, u t ) + α(c(s t, u t ) + J k (s t+ )) where J k (s t+ ) := min a A Q k (s t+, a) Terminal state reached Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q k+ (s t, u t ) := ( α)q k (s t, u t ) + α(c(s t, u t ) + J k (s t+ )) where J k (s t+ ) := min a A Q k (s t+, a) Terminal state reached policy is optimal ( enough ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 9
25 ... Representation of the path-costs in a function approximator Idea: neural representation of value function (or alternative function approximators) (Neuro Dynamic Programming (Bertsekas, 987)) Representation of the path-costs in a function approximator Idea: neural representation of value function (or alternative function approximators) (Neuro Dynamic Programming (Bertsekas, 987)) w ij w ij x... J(x) x... J(x) few parameters (here: weights) specify value function for a large state space Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 30 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 30 Representation of the path-costs in a function approximator Example: learning to intercept in robotic soccer Idea: neural representation of value function (or alternative function approximators) (Neuro Dynamic Programming (Bertsekas, 987)) as fast as possible (anticipation of intercept position) w ij random noise in ball and player movement need for corrections x J(x) sequence of turn(θ)and dash(v)-commands required few parameters (here: weights) specify value function for a large state space E learning by gradient descent: w ij = (J(s ) c(s,a) J(s)) w ij handcoding a routine is a lot of work, many parameters to tune! Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 30 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 3
26 Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 3 Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent 0, s S + c(s) =, s S 0.0, else Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 3
27 Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent 0, s S + c(s) =, s S 0.0, else Actions: turn(0 o ), turn(0 o ),... turn(360 o ),... dash(0), dash(0),... Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent 0, s S + c(s) =, s S 0.0, else Actions: turn(0 o ), turn(0 o ),... turn(360 o ),... dash(0), dash(0),... neural value function (6-0--architecture) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 3 Learning curves 00 "kick.stat" u 3:7 0. "kick.stat" u 3: Percentage of successes Costs (time to intercept) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, [email protected] Machine Learning 33
Reinforcement Learning
Reinforcement Learning LU 2 - Markov Decision Problems and Dynamic Programming Dr. Martin Lauer AG Maschinelles Lernen und Natürlichsprachliche Systeme Albert-Ludwigs-Universität Freiburg [email protected]
Reinforcement Learning
Reinforcement Learning LU 2 - Markov Decision Problems and Dynamic Programming Dr. Joschka Bödecker AG Maschinelles Lernen und Natürlichsprachliche Systeme Albert-Ludwigs-Universität Freiburg [email protected]
Lecture 1: Introduction to Reinforcement Learning
Lecture 1: Introduction to Reinforcement Learning David Silver Outline 1 Admin 2 About Reinforcement Learning 3 The Reinforcement Learning Problem 4 Inside An RL Agent 5 Problems within Reinforcement Learning
Boosting. [email protected]
. Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg [email protected]
Lecture 5: Model-Free Control
Lecture 5: Model-Free Control David Silver Outline 1 Introduction 2 On-Policy Monte-Carlo Control 3 On-Policy Temporal-Difference Learning 4 Off-Policy Learning 5 Summary Introduction Model-Free Reinforcement
Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method
Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method Martin Riedmiller Neuroinformatics Group, University of Onsabrück, 49078 Osnabrück Abstract. This
NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi
NEURAL NETWORKS AND REINFORCEMENT LEARNING Abhijit Gosavi Department of Engineering Management and Systems Engineering Missouri University of Science and Technology Rolla, MO 65409 1 Outline A Quick Introduction
Using Markov Decision Processes to Solve a Portfolio Allocation Problem
Using Markov Decision Processes to Solve a Portfolio Allocation Problem Daniel Bookstaber April 26, 2005 Contents 1 Introduction 3 2 Defining the Model 4 2.1 The Stochastic Model for a Single Asset.........................
Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.
Eligibility Traces 0 Eligibility Traces Suggested reading: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998. Eligibility Traces Eligibility Traces 1 Contents:
Neuro-Dynamic Programming An Overview
1 Neuro-Dynamic Programming An Overview Dimitri Bertsekas Dept. of Electrical Engineering and Computer Science M.I.T. September 2006 2 BELLMAN AND THE DUAL CURSES Dynamic Programming (DP) is very broadly
A Sarsa based Autonomous Stock Trading Agent
A Sarsa based Autonomous Stock Trading Agent Achal Augustine The University of Texas at Austin Department of Computer Science Austin, TX 78712 USA [email protected] Abstract This paper describes an autonomous
Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL
POMDP Tutorial Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL Observation Belief ACTOR Transition Dynamics WORLD b Policy π Action Markov Decision Process -X: set of states [x s,x r
The Neuro Slot Car Racer: Reinforcement Learning in a Real World Setting
The Neuro Slot Car Racer: Reinforcement Learning in a Real World Setting Tim C. Kietzmann Neuroinformatics Group Institute of Cognitive Science University of Osnabrück [email protected] Martin
Version Spaces. [email protected]
. Machine Learning Version Spaces Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg [email protected]
Inductive QoS Packet Scheduling for Adaptive Dynamic Networks
Inductive QoS Packet Scheduling for Adaptive Dynamic Networks Malika BOURENANE Dept of Computer Science University of Es-Senia Algeria [email protected] Abdelhamid MELLOUK LISSI Laboratory University
How I won the Chess Ratings: Elo vs the rest of the world Competition
How I won the Chess Ratings: Elo vs the rest of the world Competition Yannis Sismanis November 2010 Abstract This article discusses in detail the rating system that won the kaggle competition Chess Ratings:
Assignment #1. Example: tetris state: board configuration + shape of the falling piece ~2 200 states! Recap RL so far. Page 1
Generalization and function approximation CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Represent
Chapter 4: Artificial Neural Networks
Chapter 4: Artificial Neural Networks CS 536: Machine Learning Littman (Wu, TA) Administration icml-03: instructional Conference on Machine Learning http://www.cs.rutgers.edu/~mlittman/courses/ml03/icml03/
TD(0) Leads to Better Policies than Approximate Value Iteration
TD(0) Leads to Better Policies than Approximate Value Iteration Benjamin Van Roy Management Science and Engineering and Electrical Engineering Stanford University Stanford, CA 94305 [email protected] Abstract
Learning Agents: Introduction
Learning Agents: Introduction S Luz [email protected] October 22, 2013 Learning in agent architectures Performance standard representation Critic Agent perception rewards/ instruction Perception Learner Goals
A Multi-agent Q-learning Framework for Optimizing Stock Trading Systems
A Multi-agent Q-learning Framework for Optimizing Stock Trading Systems Jae Won Lee 1 and Jangmin O 2 1 School of Computer Science and Engineering, Sungshin Women s University, Seoul, Korea 136-742 [email protected]
An Environment Model for N onstationary Reinforcement Learning
An Environment Model for N onstationary Reinforcement Learning Samuel P. M. Choi Dit-Yan Yeung Nevin L. Zhang pmchoi~cs.ust.hk dyyeung~cs.ust.hk lzhang~cs.ust.hk Department of Computer Science, Hong Kong
Temporal Difference Learning in the Tetris Game
Temporal Difference Learning in the Tetris Game Hans Pirnay, Slava Arabagi February 6, 2009 1 Introduction Learning to play the game Tetris has been a common challenge on a few past machine learning competitions.
6.231 Dynamic Programming and Stochastic Control Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.231 Dynamic Programming and Stochastic Control Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 6.231
A Reinforcement Learning Approach for Supply Chain Management
A Reinforcement Learning Approach for Supply Chain Management Tim Stockheim, Michael Schwind, and Wolfgang Koenig Chair of Economics, esp. Information Systems, Frankfurt University, D-60054 Frankfurt,
Urban Traffic Control Based on Learning Agents
Urban Traffic Control Based on Learning Agents Pierre-Luc Grégoire, Charles Desjardins, Julien Laumônier and Brahim Chaib-draa DAMAS Laboratory, Computer Science and Software Engineering Department, Laval
Reinforcement Learning with Factored States and Actions
Journal of Machine Learning Research 5 (2004) 1063 1088 Submitted 3/02; Revised 1/04; Published 8/04 Reinforcement Learning with Factored States and Actions Brian Sallans Austrian Research Institute for
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning Volodymyr Mnih Koray Kavukcuoglu David Silver Alex Graves Ioannis Antonoglou Daan Wierstra Martin Riedmiller DeepMind Technologies {vlad,koray,david,alex.graves,ioannis,daan,martin.riedmiller}
Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems
Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems Thomas Degris [email protected] Olivier Sigaud [email protected] Pierre-Henri Wuillemin [email protected]
An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment
An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment Hideki Asoh 1, Masanori Shiro 1 Shotaro Akaho 1, Toshihiro Kamishima 1, Koiti Hasida 1, Eiji Aramaki 2, and Takahide
TD-Gammon, A Self-Teaching Backgammon Program, Achieves Master-Level Play
TD-Gammon, A Self-Teaching Backgammon Program, Achieves Master-Level Play Gerald Tesauro IBM Thomas J. Watson Research Center P. O. Box 704 Yorktown Heights, NY 10598 ([email protected]) Abstract.
Channel Allocation in Cellular Telephone. Systems. Lab. for Info. and Decision Sciences. Cambridge, MA 02139. [email protected].
Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems Satinder Singh Department of Computer Science University of Colorado Boulder, CO 80309-0430 [email protected] Dimitri
Stochastic Models for Inventory Management at Service Facilities
Stochastic Models for Inventory Management at Service Facilities O. Berman, E. Kim Presented by F. Zoghalchi University of Toronto Rotman School of Management Dec, 2012 Agenda 1 Problem description Deterministic
EFFICIENT USE OF REINFORCEMENT LEARNING IN A COMPUTER GAME
EFFICIENT USE OF REINFORCEMENT LEARNING IN A COMPUTER GAME Yngvi Björnsson, Vignir Hafsteinsson, Ársæll Jóhannsson, and Einar Jónsson Reykjavík University, School of Computer Science Ofanleiti 2 Reykjavik,
University of Cambridge Department of Engineering MODULAR ON-LINE FUNCTION APPROXIMATION FOR SCALING UP REINFORCEMENT LEARNING Chen Khong Tham Jesus College, Cambridge, England. A October 1994 This dissertation
Learning Tetris Using the Noisy Cross-Entropy Method
NOTE Communicated by Andrew Barto Learning Tetris Using the Noisy Cross-Entropy Method István Szita [email protected] András Lo rincz [email protected] Department of Information Systems, Eötvös
Reinforcement Learning for Resource Allocation and Time Series Tools for Mobility Prediction
Reinforcement Learning for Resource Allocation and Time Series Tools for Mobility Prediction Baptiste Lefebvre 1,2, Stephane Senecal 2 and Jean-Marc Kelif 2 1 École Normale Supérieure (ENS), Paris, France,
Supply Chain Yield Management based on Reinforcement Learning
Supply Chain Yield Management based on Reinforcement Learning Tim Stockheim 1, Michael Schwind 1, Alexander Korth 2, and Burak Simsek 2 1 Chair of Economics, esp. Information Systems, Frankfurt University,
Distributed Reinforcement Learning for Network Intrusion Response
Distributed Reinforcement Learning for Network Intrusion Response KLEANTHIS MALIALIS Doctor of Philosophy UNIVERSITY OF YORK COMPUTER SCIENCE September 2014 For my late grandmother Iroulla Abstract The
10. Machine Learning in Games
Machine Learning and Data Mining 10. Machine Learning in Games Luc De Raedt Thanks to Johannes Fuernkranz for his slides Contents Game playing What can machine learning do? What is (still) hard? Various
A survey of point-based POMDP solvers
DOI 10.1007/s10458-012-9200-2 A survey of point-based POMDP solvers Guy Shani Joelle Pineau Robert Kaplow The Author(s) 2012 Abstract The past decade has seen a significant breakthrough in research on
Creating Algorithmic Traders with Hierarchical Reinforcement Learning MSc Dissertation. Thomas Elder s0789506
Creating Algorithmic Traders with Hierarchical Reinforcement Learning MSc Dissertation Thomas Elder s78956 Master of Science School of Informatics University of Edinburgh 28 Abstract There has recently
Where Do Rewards Come From?
Where Do Rewards Come From? Satinder Singh [email protected] Computer Science & Engineering University of Michigan, Ann Arbor Richard L. Lewis [email protected] Department of Psychology University of Michigan,
SINGLE-STAGE MULTI-PRODUCT PRODUCTION AND INVENTORY SYSTEMS: AN ITERATIVE ALGORITHM BASED ON DYNAMIC SCHEDULING AND FIXED PITCH PRODUCTION
SIGLE-STAGE MULTI-PRODUCT PRODUCTIO AD IVETORY SYSTEMS: A ITERATIVE ALGORITHM BASED O DYAMIC SCHEDULIG AD FIXED PITCH PRODUCTIO Euclydes da Cunha eto ational Institute of Technology Rio de Janeiro, RJ
The MultiAgent Decision Process toolbox: software for decision-theoretic planning in multiagent systems
The MultiAgent Decision Process toolbox: software for decision-theoretic planning in multiagent systems Matthijs T.J. Spaan Institute for Systems and Robotics Instituto Superior Técnico Lisbon, Portugal
Best-response play in partially observable card games
Best-response play in partially observable card games Frans Oliehoek Matthijs T. J. Spaan Nikos Vlassis Informatics Institute, Faculty of Science, University of Amsterdam, Kruislaan 43, 98 SJ Amsterdam,
Learning to Trade via Direct Reinforcement
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 4, JULY 2001 875 Learning to Trade via Direct Reinforcement John Moody and Matthew Saffell Abstract We present methods for optimizing portfolios, asset
Nonlinear Algebraic Equations Example
Nonlinear Algebraic Equations Example Continuous Stirred Tank Reactor (CSTR). Look for steady state concentrations & temperature. s r (in) p,i (in) i In: N spieces with concentrations c, heat capacities
Reinforcement Learning: A Tutorial Survey and Recent Advances
A. Gosavi Reinforcement Learning: A Tutorial Survey and Recent Advances Abhijit Gosavi Department of Engineering Management and Systems Engineering 219 Engineering Management Missouri University of Science
Introduction to Online Learning Theory
Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent
Detection and Mitigation of Cyber-Attacks Using Game Theory and Learning
Detection and Mitigation of Cyber-Attacks Using Game Theory and Learning João P. Hespanha Kyriakos G. Vamvoudakis Correlation Engine COAs Data Data Data Data Cyber Situation Awareness Framework Mission
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
10.2 Series and Convergence
10.2 Series and Convergence Write sums using sigma notation Find the partial sums of series and determine convergence or divergence of infinite series Find the N th partial sums of geometric series and
4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department Email : [email protected].
4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department Email : [email protected] 1 1 Outline The LMS algorithm Overview of LMS issues
Learning a wall following behaviour in mobile robotics using stereo and mono vision
Learning a wall following behaviour in mobile robotics using stereo and mono vision P. Quintía J.E. Domenech C.V. Regueiro C. Gamallo R. Iglesias Dpt. Electronics and Systems. Univ. A Coruña [email protected],
Computing Near Optimal Strategies for Stochastic Investment Planning Problems
Computing Near Optimal Strategies for Stochastic Investment Planning Problems Milos Hauskrecfat 1, Gopal Pandurangan 1,2 and Eli Upfal 1,2 Computer Science Department, Box 1910 Brown University Providence,
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
Bag of Pursuits and Neural Gas for Improved Sparse Coding
Bag of Pursuits and Neural Gas for Improved Sparse Coding Kai Labusch, Erhardt Barth, and Thomas Martinetz University of Lübec Institute for Neuro- and Bioinformatics Ratzeburger Allee 6 23562 Lübec, Germany
STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly)
STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University (Joint work with R. Chen and M. Menickelly) Outline Stochastic optimization problem black box gradient based Existing
Creating a NL Texas Hold em Bot
Creating a NL Texas Hold em Bot Introduction Poker is an easy game to learn by very tough to master. One of the things that is hard to do is controlling emotions. Due to frustration, many have made the
Hierarchical Reinforcement Learning in Computer Games
Hierarchical Reinforcement Learning in Computer Games Marc Ponsen, Pieter Spronck, Karl Tuyls Maastricht University / MICC-IKAT. {m.ponsen,p.spronck,k.tuyls}@cs.unimaas.nl Abstract. Hierarchical reinforcement
Variational approach to restore point-like and curve-like singularities in imaging
Variational approach to restore point-like and curve-like singularities in imaging Daniele Graziani joint work with Gilles Aubert and Laure Blanc-Féraud Roma 12/06/2012 Daniele Graziani (Roma) 12/06/2012
Jae Won idee. School of Computer Science and Engineering Sungshin Women's University Seoul, 136-742, South Korea
STOCK PRICE PREDICTION USING REINFORCEMENT LEARNING Jae Won idee School of Computer Science and Engineering Sungshin Women's University Seoul, 136-742, South Korea ABSTRACT Recently, numerous investigations
An Introduction to Markov Decision Processes. MDP Tutorial - 1
An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal
Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen
(für Informatiker) M. Grepl J. Berger & J.T. Frings Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2010/11 Problem Statement Unconstrained Optimality Conditions Constrained
Machine Learning. 01 - Introduction
Machine Learning 01 - Introduction Machine learning course One lecture (Wednesday, 9:30, 346) and one exercise (Monday, 17:15, 203). Oral exam, 20 minutes, 5 credit points. Some basic mathematical knowledge
Random access protocols for channel access. Markov chains and their stability. Laurent Massoulié.
Random access protocols for channel access Markov chains and their stability [email protected] Aloha: the first random access protocol for channel access [Abramson, Hawaii 70] Goal: allow machines
Game playing. Chapter 6. Chapter 6 1
Game playing Chapter 6 Chapter 6 1 Outline Games Perfect play minimax decisions α β pruning Resource limits and approximate evaluation Games of chance Games of imperfect information Chapter 6 2 Games vs.
Tetris: A Study of Randomized Constraint Sampling
Tetris: A Study of Randomized Constraint Sampling Vivek F. Farias and Benjamin Van Roy Stanford University 1 Introduction Randomized constraint sampling has recently been proposed as an approach for approximating
