Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning



Similar documents
Reinforcement Learning

Reinforcement Learning

Lecture 1: Introduction to Reinforcement Learning

Boosting.

Lecture 5: Model-Free Control

Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method

NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi

Using Markov Decision Processes to Solve a Portfolio Allocation Problem

Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

Neuro-Dynamic Programming An Overview

A Sarsa based Autonomous Stock Trading Agent

Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL

The Neuro Slot Car Racer: Reinforcement Learning in a Real World Setting

Version Spaces.

Inductive QoS Packet Scheduling for Adaptive Dynamic Networks

How I won the Chess Ratings: Elo vs the rest of the world Competition

Assignment #1. Example: tetris state: board configuration + shape of the falling piece ~2 200 states! Recap RL so far. Page 1

Chapter 4: Artificial Neural Networks

TD(0) Leads to Better Policies than Approximate Value Iteration

Learning Agents: Introduction

Reinforcement Learning on the Combinatorial Game of Nim ERIK JÄRLEBERG

A Multi-agent Q-learning Framework for Optimizing Stock Trading Systems

Bidding Dynamics of Rational Advertisers in Sponsored Search Auctions on the Web

An Environment Model for N onstationary Reinforcement Learning

Temporal Difference Learning in the Tetris Game

6.231 Dynamic Programming and Stochastic Control Fall 2008

A Reinforcement Learning Approach for Supply Chain Management

Urban Traffic Control Based on Learning Agents

Reinforcement Learning with Factored States and Actions

Playing Atari with Deep Reinforcement Learning

Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems

Reinforcement Learning of Task Plans for Real Robot Systems

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment

TD-Gammon, A Self-Teaching Backgammon Program, Achieves Master-Level Play

Channel Allocation in Cellular Telephone. Systems. Lab. for Info. and Decision Sciences. Cambridge, MA

Algorithms for Reinforcement Learning

Stochastic Models for Inventory Management at Service Facilities

EFFICIENT USE OF REINFORCEMENT LEARNING IN A COMPUTER GAME


Learning Tetris Using the Noisy Cross-Entropy Method

Multi-agent Q-Learning of Channel Selection in Multi-user Cognitive Radio Systems: A Two by Two Case

Reinforcement Learning for Resource Allocation and Time Series Tools for Mobility Prediction

Chi-square Tests Driven Method for Learning the Structure of Factored MDPs

Supply Chain Yield Management based on Reinforcement Learning

RETALIATE: Learning Winning Policies in First-Person Shooter Games

Call Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning

How To Maximize Admission Control On A Network

Distributed Reinforcement Learning for Network Intrusion Response

10. Machine Learning in Games

A survey of point-based POMDP solvers

THE LOGIC OF ADAPTIVE BEHAVIOR

Reducing Operational Costs in Cloud Social TV: An Opportunity for Cloud Cloning

An Associative State-Space Metric for Learning in Factored MDPs

Creating Algorithmic Traders with Hierarchical Reinforcement Learning MSc Dissertation. Thomas Elder s

Security Risk Management via Dynamic Games with Learning

Time Hopping Technique for Faster Reinforcement Learning in Simulations

Goal-Directed Online Learning of Predictive Models

Batch Reinforcement Learning for Smart Home Energy Management

Where Do Rewards Come From?

SINGLE-STAGE MULTI-PRODUCT PRODUCTION AND INVENTORY SYSTEMS: AN ITERATIVE ALGORITHM BASED ON DYNAMIC SCHEDULING AND FIXED PITCH PRODUCTION

Skill-based Routing Policies in Call Centers

The MultiAgent Decision Process toolbox: software for decision-theoretic planning in multiagent systems

Best-response play in partially observable card games

A Simultaneous Deterministic Perturbation Actor-Critic Algorithm with an Application to Optimal Mortgage Refinancing

Semi-Markov Adaptive Critic Heuristics with Application to Airline Revenue Management

Goal-Directed Online Learning of Predictive Models

Learning to Trade via Direct Reinforcement

Nonlinear Algebraic Equations Example

Reinforcement Learning: A Tutorial Survey and Recent Advances

Some stability results of parameter identification in a jump diffusion model

Introduction to Online Learning Theory

Detection and Mitigation of Cyber-Attacks Using Game Theory and Learning

Statistical Machine Learning

10.2 Series and Convergence

4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department sss40@eng.cam.ac.

Learning a wall following behaviour in mobile robotics using stereo and mono vision

Computing Near Optimal Strategies for Stochastic Investment Planning Problems

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Long-term Planning by Short-term Prediction

Bag of Pursuits and Neural Gas for Improved Sparse Coding

STORM: Stochastic Optimization Using Random Models Katya Scheinberg Lehigh University. (Joint work with R. Chen and M. Menickelly)

IN AVIATION it regularly occurs that an airplane encounters. Guaranteed globally optimal continuous reinforcement learning.

Creating a NL Texas Hold em Bot

Hierarchical Reinforcement Learning in Computer Games

Efficient Network Marketing Strategies For Secondary Users

Variational approach to restore point-like and curve-like singularities in imaging

Jae Won idee. School of Computer Science and Engineering Sungshin Women's University Seoul, , South Korea

Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Machine Learning Introduction

Random access protocols for channel access. Markov chains and their stability. Laurent Massoulié.

Efficient Prediction of Value and PolicyPolygraphic Success

Options with exceptions

A Behavior Based Kernel for Policy Search via Bayesian Optimization

Neural Networks and Learning Systems

KERNEL REWARDS REGRESSION: AN INFORMATION EFFICIENT BATCH POLICY ITERATION APPROACH

Game playing. Chapter 6. Chapter 6 1

Tetris: A Study of Randomized Constraint Sampling

PageRank Optimization in Polynomial Time by Stochastic Shortest Path Reformulation

Transcription:

Motivation Machine Learning Can a software agent learn to play Backgammon by itself? Reinforcement Learning Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Motivation Can a software agent learn to balance a pole by itself? Can a software agent learn to play Backgammon by itself? Learning from success or failure Neuro-Backgammon: playing at worldchampion level (Tesauro, 99) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3

Can a software agent learn to balance a pole by itself? Can a software agent learn to cooperate with others by itself? Learning from success or failure Neural RL controllers: noisy, unknown, nonlinear (Riedmiller et.al. ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4 Can a software agent learn to cooperate with others by itself? Reinforcement Learning has biological roots: reward and punishment Learning from success or failure no teacher, but: Cooperative RL agents: complex, multi-agent, cooperative (Riedmiller et.al. ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5

Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions + goal Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions + goal learn algorithm/ policy Reinforcement Learning has biological roots: reward and punishment no teacher, but: actions + goal learn algorithm/ policy Action Action... Happy Programming Agent.. Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5

Actor-Critic Scheme (Barto, Sutton, 983) Overview Actor Critic External reward I Reinforcement Learning - Basics s Internal Critic World Actor? Actor-Critic Scheme: Critic maps external, delayed reward in internal training signal Actor represents policy Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 A First Example A First Example Choose: Action a {,, } is reached Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8

The Temporal Credit Assignment Problem The Temporal Credit Assignment Problem Which action(s) in the sequence has to be changed? Which action(s) in the sequence has to be changed? Temporal Credit Assignment Problem Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Sequential Decision Making Three Steps Examples: Chess, Checkers (Samuel, 959), Backgammon (Tesauro, 9) Cart-Pole-Balancing (AHC/ ACE (Barto, Sutton, Anderson, 983)), Robotics and control,... Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 0 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning

Three Steps Three Steps Describe environment as a Markov Decision Process (MDP) Describe environment as a Markov Decision Process (MDP) Formulate learning task as a dynamic optimization problem Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Three Steps. Description of the environment Describe environment as a Markov Decision Process (MDP) Formulate learning task as a dynamic optimization problem Solve dynamic optimization problem by dynamic programming methods Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning

. Description of the environment. Description of the environment S: (finite) set of states S: (finite) set of states A: (finite) set of actions Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning. Description of the environment. Description of the environment S: (finite) set of states A: (finite) set of actions S: (finite) set of states A: (finite) set of actions Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition For simplicity, we will first assume a deterministic environment. There, the model can be described by a transition function f : S A S, s = f(s, a) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning

. Description of the environment. Description of the environment S: (finite) set of states A: (finite) set of actions S: (finite) set of states A: (finite) set of actions Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition Behaviour of the environment model p : S S A [0,] p(s, s, a) Probability distribution of transition For simplicity, we will first assume a deterministic environment. There, the model can be described by a transition function f : S A S, s = f(s, a) For simplicity, we will first assume a deterministic environment. There, the model can be described by a transition function f : S A S, s = f(s, a) Markov property: Transition only depends on current state and action Markov property: Transition only depends on current state and action Pr(s t+ s t, a t ) = Pr(s t+ s t, a t, s t, a t, s t, a t,...) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning. Formulation of the learning task. Formulation of the learning task every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3

. Formulation of the learning task. Formulation of the learning task every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Wanted: optimal policy π : S A where J π (s) = min π { t c(s t, π(s t )) s 0 = s} Wanted: optimal policy π : S A where J π (s) = min π { t c(s t, π(s t )) s 0 = s} Additive (path-)costs allow to consider all events Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3. Formulation of the learning task every transition emits transition costs, immediate costs, c : S A R (sometimes also called immediate reward, r) Now, an agent policy π : S A can be evaluated (and judged): Consider pathcosts: J π (s) = t c(s t, π(s t )), s 0 = s Wanted: optimal policy π : S A where J π (s) = min π { t c(s t, π(s t )) s 0 = s} Choice of immediate cost function c( ) specifies policy to be learned Example: 0, if s success (s ) c(s) = 000, if s failure (s Failure), else J π (s start ) = J π (s start ) = 004 Additive (path-)costs allow to consider all events Does this solve the temporal credit assignment problem? YES! Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4

3. Solving the optimization problem Choice of immediate cost function c( ) specifies policy to be learned Example: 0, if s success (s ) c(s) = 000, if s failure (s Failure), else For the optimal path costs it is known that J (s) = min a {c(s,a) + J (f(s, a))} (Principle of Optimality (Bellman, 959)) J π (s start ) = J π (s start ) = 004 Can we compute J (we will see why, soon)? specification of requested policy by c( ) is simple! Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6

Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k ()} Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6

Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k (f(s,a))} Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k (f(s,a))}? 5 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Computing J : the value iteration (VI) algorithm Start with arbitrary J 0 (s) for all states s : J k+ (s) := min a A {c(s,a) + J k (f(s,a))} Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J? 5 7 3 5 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7

Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Stochastic shortest path problems: there exists an absorbing terminal state with zero costs Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Convergence of value iteration Value iteration converges under certain assumptions, i.e. we have lim k J k = J Discounted problems: J π (s) = min π { t γt c(s t, π(s t )) s 0 = s} where 0 γ < (contraction mapping) Stochastic shortest path problems: Stochastic shortest path problems: there exists an absorbing terminal state with zero costs there exists a proper policy (a policy that has a non-zero chance to finally reach the terminal state) there exists an absorbing terminal state with zero costs there exists a proper policy (a policy that has a non-zero chance to finally reach the terminal state) every non-proper policy has infinite path costs for at least one state Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7

Ok, now we have J Ok, now we have J when J is known, then we also know an optimal policy: π (s) arg min a A {c(s,a) + J (f(s,a))} 5 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8 Back to our maze Overview of the approach so far 9 8 7 6 5 4 3 0 Start 6 7 9 9 0 0 5 4 3 00 000 00 000 Ziel Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 0

Overview of the approach so far Overview of the approach so far Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy Ziel Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy iterative computation of optimal pathcosts J : s S : J k+ (s) = min a A {c(s,a) + J k (f(s, a))} Ziel Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 0 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 0 Overview of the approach so far Overview of the approach so far Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy iterative computation of optimal pathcosts J : s S : J k+ (s) = min a A {c(s,a) + J k (f(s, a))} Ziel Description of the learning task as an MDP S, A, T, f, c c specifies requested behaviour/ policy iterative computation of optimal pathcosts J : s S : J k+ (s) = min a A {c(s,a) + J k (f(s, a))} Ziel Computation of an optimal policy from J π (s) arg min a A {c(s,a) + J (f(s,a))} Computation of an optimal policy from J π (s) arg min a A {c(s,a) + J (f(s,a))} value function ( costs-to-go ) can be stored in a table Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 0 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 0

Overview of the approach: Stochastic Domains Overview of the approach: Stochastic Domains Ziel Ziel value iteration in stochastic environments: s S : J k+ (s) = min a A { (c(s,a) + J k (s ))} Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Overview of the approach: Stochastic Domains Overview of the approach: Stochastic Domains Ziel Ziel value iteration in stochastic environments: s S : J k+ (s) = min a A { s S p(s,s, a)(c(s,a) + J k (s ))} value iteration in stochastic environments: s S : J k+ (s) = min a A { s S p(s,s, a)(c(s,a) + J k (s ))} Computation of an optimal policy from J π (s) arg min a A { s S p(s, s, a)(c(s,a) + J k (s ))} value function J ( costs-to-go ) can be stored in a table Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning

Problems of Value Iteration: Reinforcement Learning Problems of Value Iteration: Reinforcement Learning for all s S : Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Problems of Value Iteration: Reinforcement Learning for all s S :J k+ (s) = min a A {c(s,a) + J k ()} Problems of Value Iteration: Reinforcement Learning for all s S :J k+ (s) = min a A {c(s,a) + J k (f(s,a))} Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning

Reinforcement Learning Problems of Value Iteration: for all s S :J k+ (s) = min a A {c(s,a) + J k (f(s,a))} problems: Size of S (Chess, robotics,... ) learning time, storage? model (transition behaviour) f(s, a) or p(s, s,a) must be known! Reinforcement Learning Problems of Value Iteration: for all s S :J k+ (s) = min a A {c(s,a) + J k (f(s,a))} problems: Size of S (Chess, robotics,... ) learning time, storage? model (transition behaviour) f(s, a) or p(s, s,a) must be known! Reinforcement Learning is dynamic programming for very large state spaces and/ or model-free tasks Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning Important contributions - Overview Real Time Dynamic Programming (Barto, Sutton, Watkins, 989) Real Time Dynamic Programming (Barto, Sutton, Watkins, 989) Idea: instead For all s S now For some s S... Model-free learning (Q-Learning,(Watkins, 989)) neural representation of value function (or alternative function approximators) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4

Real Time Dynamic Programming (Barto, Sutton, Watkins, 989) Idea: instead For all s S now For some s S... learning based on trajectories (experiences) Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 4 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Q π (s,a) := (c(s,a) + J π (s )) Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Q π (s, a) := s S p(s, s, a)(c(s,a) + J π (s )) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5

Q-Learning Idea (Watkins, Diss, 989): In every state store for every action the expected costs-to-go. Q π (s, a) denotes the expected future pathcosts for applying action a in state s (and continuing according to policy π): Q π (s, a) := s S p(s, s, a)(c(s,a) + J π (s )) is now possible without a model: Original VI: state evaluation Action selection: π (s) arg min{c(s, a) + J (f(s, a))} Q-learning: Action selection where J π (s ) expected pathcosts when starting from s and acting according to π 5 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 5 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 is now possible without a model: Original VI: state evaluation Action selection: Q-learning: Action selection Q: state-action evaluation Action selection: π (s) = arg min Q (s, a) Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s, u) := (c(s,a) + J k (s )) π (s) arg min{c(s, a) + J (f(s, a))} 5 7 6 8 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 6 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7

Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s,u) := s S p(s, s, a)(c(s,a) + J k (s )) Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s,u) := s S p(s, s, a)(c(s,a) + J k (s )) where J k (s) = min a A(s) Q k (s,a ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 Learning an optimal Q-Function To find Q, a value iteration algorithm can be applied Q k+ (s,u) := s S p(s, s, a)(c(s,a) + J k (s )) Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: where J k (s) = min a A(s) Q k (s,a ) Furthermore, learning a Q-function without a model, by experience of transition tuples (s,a) s only is possible: Q-learning (Q-Value Iteration + Robbins-Monro stochastic approximation) Q k+ (s,a) := ( α) Q k (s,a) + α (c(s,a) + min a A(s ) Q k(s, a )) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 7 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8

Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: uses a representation of costs-to-go for state/ action-pairs Q(s, a) Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: uses a representation of costs-to-go for state/ action-pairs Q(s, a) uses a stochastic approximation scheme to incrementally compute expectation values on the basis of observed transititions (s, a) s Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8 Summary Q-learning Q-learning is a variant of value iteration when no model is available it is based on two major ingredigents: uses a representation of costs-to-go for state/ action-pairs Q(s, a) uses a stochastic approximation scheme to incrementally compute expectation values on the basis of observed transititions (s, a) s Q-Learning algorithm converges under the same assumption as value iteration + every state/ action pair has to be visited infinitely often + conditions for stochastic approximation Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 8 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9

Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9

Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q k+ (s t, u t ) := ( α)q k (s t, u t ) + α(c(s t, u t ) + J k (s t+ )) where J k (s t+ ) := min a A Q k (s t+, a) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q k+ (s t, u t ) := ( α)q k (s t, u t ) + α(c(s t, u t ) + J k (s t+ )) where J k (s t+ ) := min a A Q k (s t+, a) Terminal state reached Q-Learning algorithm start in arbitrary initial state s 0 ; t = 0 choose action greedily u t := arg min a A Q k (s t, a) or u t according to an exploration scheme apply u t in the environment: s t+ = f(s t, u t, w t ) learn Q-value: Q k+ (s t, u t ) := ( α)q k (s t, u t ) + α(c(s t, u t ) + J k (s t+ )) where J k (s t+ ) := min a A Q k (s t+, a) Terminal state reached policy is optimal ( enough ) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 9

... Representation of the path-costs in a function approximator Idea: neural representation of value function (or alternative function approximators) (Neuro Dynamic Programming (Bertsekas, 987)) Representation of the path-costs in a function approximator Idea: neural representation of value function (or alternative function approximators) (Neuro Dynamic Programming (Bertsekas, 987)) w ij w ij x... J(x) x... J(x)...... few parameters (here: weights) specify value function for a large state space Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 30 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 30 Representation of the path-costs in a function approximator Example: learning to intercept in robotic soccer Idea: neural representation of value function (or alternative function approximators) (Neuro Dynamic Programming (Bertsekas, 987)) as fast as possible (anticipation of intercept position) w ij random noise in ball and player movement need for corrections x...... J(x) sequence of turn(θ)and dash(v)-commands required few parameters (here: weights) specify value function for a large state space E learning by gradient descent: w ij = (J(s ) c(s,a) J(s)) w ij handcoding a routine is a lot of work, many parameters to tune! Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 30 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3

Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent 0, s S + c(s) =, s S 0.0, else Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3

Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent 0, s S + c(s) =, s S 0.0, else Actions: turn(0 o ), turn(0 o ),... turn(360 o ),... dash(0), dash(0),... Reinforcement learning of intercept : Ball is in kickrange of player state space: S work = positions on pitch S + : Ball in kickrange S : e.g. collision with opponent 0, s S + c(s) =, s S 0.0, else Actions: turn(0 o ), turn(0 o ),... turn(360 o ),... dash(0), dash(0),... neural value function (6-0--architecture) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 3 Learning curves 00 "kick.stat" u 3:7 0. "kick.stat" u 3: 90 0.8 80 0.6 70 0.4 60 0. 50 0. 40 0.08 30 0 0.06 0 0.04 0 0 5000 0000 5000 0000 5000 30000 35000 40000 45000 50000 Percentage of successes 0.0 0 5000 0000 5000 0000 5000 30000 35000 40000 45000 50000 Costs (time to intercept) Martin Riedmiller, Albert-Ludwigs-Universität Freiburg, riedmiller@informatik.uni-freiburg.de Machine Learning 33