Lecture 1: Introduction to Reinforcement Learning

Similar documents
Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning

Lecture 5: Model-Free Control

Machine Learning Introduction

Learning Agents: Introduction

Learning is a very general term denoting the way in which agents:

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Best-response play in partially observable card games

Machine Learning.

An Environment Model for N onstationary Reinforcement Learning

Beyond Reward: The Problem of Knowledge and Data

Machine Learning and Statistics: What s the Connection?

Fall 2012 Q530. Programming for Cognitive Science

Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

cs171 HW 1 - Solutions

A Client-Server Interactive Tool for Integrated Artificial Intelligence Curriculum

Teaching Introductory Artificial Intelligence with Pac-Man

COMP 590: Artificial Intelligence

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Inductive QoS Packet Scheduling for Adaptive Dynamic Networks

What is Learning? CS 391L: Machine Learning Introduction. Raymond J. Mooney. Classification. Problem Solving / Planning / Control

Assignment #1. Example: tetris state: board configuration + shape of the falling piece ~2 200 states! Recap RL so far. Page 1

CSE 517A MACHINE LEARNING INTRODUCTION

TD-Gammon, A Self-Teaching Backgammon Program, Achieves Master-Level Play

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Draft dpt for MEng Electronics and Computer Science

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems

Introduction to Machine Learning Using Python. Vikram Kamath

CSC384 Intro to Artificial Intelligence

Neuro-Dynamic Programming An Overview

Machine Learning: Overview

Agent Simulation of Hull s Drive Theory

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

ANALYTICS IN BIG DATA ERA

2010 Master of Science Computer Science Department, University of Massachusetts Amherst

Data, Measurements, Features

Reinforcement Learning

Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL

Exercise and the Brain

Measuring the Performance of an Agent

The MultiAgent Decision Process toolbox: software for decision-theoretic planning in multiagent systems

Machine Learning and Pattern Recognition Logistic Regression

Worksheet for Teaching Module Probability (Lesson 1)

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Laboratory work in AI: First steps in Poker Playing Agents and Opponent Modeling

NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi

We employed reinforcement learning, with a goal of maximizing the expected value. Our bot learns to play better by repeated training against itself.

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Big Data - Lecture 1 Optimization reminders

Reinforcement Learning

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Agents: Rationality (2)

Game playing. Chapter 6. Chapter 6 1

Design of an FX trading system using Adaptive Reinforcement Learning

Perspectives on Data Mining

Online Algorithms: Learning & Optimization with No Regret.

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

10. Machine Learning in Games

Where Do Rewards Come From?

Learning outcomes. Knowledge and understanding. Competence and skills

6.231 Dynamic Programming and Stochastic Control Fall 2008

comp4620/8620: Advanced Topics in AI Foundations of Artificial Intelligence

E27 SPRING 2013 ZUCKER PROJECT 2 PROJECT 2 AUGMENTED REALITY GAMING SYSTEM

Eastern Washington University Department of Computer Science. Questionnaire for Prospective Masters in Computer Science Students

Distributed Reinforcement Learning for Network Intrusion Response

Recurrent Neural Networks

Feature Selection with Monte-Carlo Tree Search

6.042/18.062J Mathematics for Computer Science. Expected Value I

We are here to help you...

NEUROEVOLUTION OF AUTO-TEACHING ARCHITECTURES

School of Computer Science

Lecture 2: Universality

Tikesh Ramtohul WWZ, University of Basel 19 th June 2009 (COMISEF workshop, Birkbeck College, London)

Machine Learning Applications in Algorithmic Trading

Knowledge-based systems and the need for learning

Playing Atari with Deep Reinforcement Learning

QUALITY ENGINEERING PROGRAM

1/3 1/3 1/

Decision Trees and Networks

A Guide to Curriculum Development: Purposes, Practices, Procedures

Decision Theory Rational prospecting

Game theory and AI: a unified approach to poker games

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Definitions. A [non-living] physical agent that performs tasks by manipulating the physical world. Categories of robots

Random access protocols for channel access. Markov chains and their stability. Laurent Massoulié.

University of Chicago Graduate School of Business. Business 41000: Business Statistics Solution Key

Transcription:

Lecture 1: Introduction to Reinforcement Learning David Silver

Outline 1 Admin 2 About Reinforcement Learning 3 The Reinforcement Learning Problem 4 Inside An RL Agent 5 Problems within Reinforcement Learning

Admin Class Information Thursdays 9:30 to 11:00am Website: http://www.cs.ucl.ac.uk/staff/d.silver/web/teaching.html Group: http://groups.google.com/group/csml-advanced-topics Contact me: d.silver@cs.ucl.ac.uk

Admin Assessment Assessment will be 50% coursework, 50% exam Coursework Assignment A: RL problem Assignment B: Kernels problem Assessment = max(assignment1, assignment2) Examination A: 3 RL questions B: 3 kernels questions Answer any 3 questions

Admin Textbooks An Introduction to Reinforcement Learning, Sutton and Barto, 1998 MIT Press, 1998 40 pounds Available free online! http://webdocs.cs.ualberta.ca/ sutton/book/the-book.html Algorithms for Reinforcement Learning, Szepesvari Morgan and Claypool, 2010 20 pounds Available free online! http://www.ualberta.ca/ szepesva/papers/rlalgsinmdps.pdf

About RL Many Faces of Reinforcement Learning Computer Science Engineering Mathematics Optimal Control Operations Research Machine Learning Reinforcement Learning Bounded Rationality Reward System Classical/Operant Conditioning Neuroscience Psychology Economics

About RL Branches of Machine Learning Supervised Learning Unsupervised Learning Machine Learning Reinforcement Learning

About RL Characteristics of Reinforcement Learning What makes reinforcement learning different from other machine learning paradigms? There is no supervisor, only a reward signal Feedback is delayed, not instantaneous Time really matters (sequential, non i.i.d data) Agent s actions affect the subsequent data it receives

About RL Examples of Reinforcement Learning Fly stunt manoeuvres in a helicopter Defeat the world champion at Backgammon Manage an investment portfolio Control a power station Make a humanoid robot walk Play many different Atari games better than humans

About RL Helicopter Manoeuvres

About RL Bipedal Robots

About RL Atari

The RL Problem Reward Rewards A reward R t is a scalar feedback signal Indicates how well agent is doing at step t The agent s job is to maximise cumulative reward Reinforcement learning is based on the reward hypothesis Definition (Reward Hypothesis) All goals can be described by the maximisation of expected cumulative reward Do you agree with this statement?

The RL Problem Reward Examples of Rewards Fly stunt manoeuvres in a helicopter +ve reward for following desired trajectory ve reward for crashing Defeat the world champion at Backgammon +/ ve reward for winning/losing a game Manage an investment portfolio +ve reward for each $ in bank Control a power station +ve reward for producing power ve reward for exceeding safety thresholds Make a humanoid robot walk +ve reward for forward motion ve reward for falling over Play many different Atari games better than humans +/ ve reward for increasing/decreasing score

The RL Problem Reward Sequential Decision Making Goal: select actions to maximise total future reward Actions may have long term consequences Reward may be delayed It may be better to sacrifice immediate reward to gain more long-term reward Examples: A financial investment (may take months to mature) Refuelling a helicopter (might prevent a crash in several hours) Blocking opponent moves (might help winning chances many moves from now)

The RL Problem Environments Agent and Environment observation action O t A t reward R t

The RL Problem Environments Agent and Environment observation O t reward R t action A t At each step t the agent: Executes action A t Receives observation O t Receives scalar reward R t The environment: Receives action A t Emits observation O t+1 Emits scalar reward R t+1 t increments at env. step

The RL Problem State History and State The history is the sequence of observations, actions, rewards H t = O 1, R 1, A 1,..., A t 1, O t, R t i.e. all observable variables up to time t i.e. the sensorimotor stream of a robot or embodied agent What happens next depends on the history: The agent selects actions The environment selects observations/rewards State is the information used to determine what happens next Formally, state is a function of the history: S t = f (H t )

The RL Problem State Environment State observation O t reward R t environment state e S t action A t The environment state St e is the environment s private representation i.e. whatever data the environment uses to pick the next observation/reward The environment state is not usually visible to the agent Even if St e is visible, it may contain irrelevant information

The RL Problem State Agent State observation O t agent state Sa t reward R t action A t The agent state St a is the agent s internal representation i.e. whatever information the agent uses to pick the next action i.e. it is the information used by reinforcement learning algorithms It can be any function of history: S a t = f (H t )

The RL Problem State Information State An information state (a.k.a. Markov state) contains all useful information from the history. Definition A state S t is Markov if and only if P[S t+1 S t ] = P[S t+1 S 1,..., S t ] The future is independent of the past given the present H 1:t S t H t+1: Once the state is known, the history may be thrown away i.e. The state is a sufficient statistic of the future The environment state St e is Markov The history H t is Markov

The RL Problem State Rat Example What if agent state = last 3 items in sequence? What if agent state = counts for lights, bells and levers? What if agent state = complete sequence?

The RL Problem State Fully Observable Environments state action Full observability: agent directly observes environment state S t A t O t = S a t = S e t reward R t Agent state = environment state = information state Formally, this is a Markov decision process (MDP) (Next lecture and the majority of this course)

The RL Problem State Partially Observable Environments Partial observability: agent indirectly observes environment: A robot with camera vision isn t told its absolute location A trading agent only observes current prices A poker playing agent only observes public cards Now agent state environment state Formally this is a partially observable Markov decision process (POMDP) Agent must construct its own state representation S a t, e.g. Complete history: S a t = H t Beliefs of environment state: S a t = (P[S e t = s 1 ],..., P[S e t = s n ]) Recurrent neural network: S a t = σ(s a t 1 W s + O t W o )

Inside An RL Agent Major Components of an RL Agent An RL agent may include one or more of these components: Policy: agent s behaviour function Value function: how good is each state and/or action Model: agent s representation of the environment

Inside An RL Agent Policy A policy is the agent s behaviour It is a map from state to action, e.g. Deterministic policy: a = π(s) Stochastic policy: π(a s) = P[A t = a S t = s]

Inside An RL Agent Value Function Value function is a prediction of future reward Used to evaluate the goodness/badness of states And therefore to select between actions, e.g. [ v π (s) = E π Rt+1 + γr t+2 + γ 2 R t+3 +... S t = s ]

Inside An RL Agent Example: Value Function in Atari

Inside An RL Agent Model A model predicts what the environment will do next P predicts the next state R predicts the next (immediate) reward, e.g. Pss a = P[S t+1 = s S t = s, A t = a] R a s = E [R t+1 S t = s, A t = a]

Inside An RL Agent Maze Example Start Goal Rewards: -1 per time-step Actions: N, E, S, W States: Agent s location

Inside An RL Agent Maze Example: Policy Start Goal Arrows represent policy π(s) for each state s

Inside An RL Agent Maze Example: Value Function -14-13 -12-11 -10-9 Start -16-15 -12-8 -16-17 -6-7 -18-19 -5-24 -20-4 -3-23 -22-21 -22-2 -1 Goal Numbers represent value v π (s) of each state s

Inside An RL Agent Maze Example: Model Start -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 Goal Agent may have an internal model of the environment Dynamics: how actions change the state Rewards: how much reward from each state The model may be imperfect Grid layout represents transition model P a ss Numbers represent immediate reward R a s from each state s (same for all a)

Inside An RL Agent Categorizing RL agents (1) Value Based No Policy (Implicit) Value Function Policy Based Policy No Value Function Actor Critic Policy Value Function

Inside An RL Agent Categorizing RL agents (2) Model Free Policy and/or Value Function No Model Model Based Policy and/or Value Function Model

Inside An RL Agent RL Agent Taxonomy Model-Free Value Function Actor Critic Policy Value-Based Model-Based Policy-Based Model

Problems within RL Learning and Planning Two fundamental problems in sequential decision making Reinforcement Learning: The environment is initially unknown The agent interacts with the environment The agent improves its policy Planning: A model of the environment is known The agent performs computations with its model (without any external interaction) The agent improves its policy a.k.a. deliberation, reasoning, introspection, pondering, thought, search

Lecture 1: Introduction to Reinforcement Learning Problems within RL Atari Example: Reinforcement Learning observation action Ot At reward Rt Rules of the game are unknown Learn directly from interactive game-play Pick actions on joystick, see pixels and scores

Problems within RL Atari Example: Planning Rules of the game are known Can query emulator perfect model inside agent s brain If I take action a from state s: what would the next state be? what would the score be? Plan ahead to find optimal policy e.g. tree search right right left right left left

Problems within RL Exploration and Exploitation (1) Reinforcement learning is like trial-and-error learning The agent should discover a good policy From its experiences of the environment Without losing too much reward along the way

Problems within RL Exploration and Exploitation (2) Exploration finds more information about the environment Exploitation exploits known information to maximise reward It is usually important to explore as well as exploit

Problems within RL Examples Restaurant Selection Exploitation Go to your favourite restaurant Exploration Try a new restaurant Online Banner Advertisements Exploitation Show the most successful advert Exploration Show a different advert Oil Drilling Exploitation Drill at the best known location Exploration Drill at a new location Game Playing Exploitation Play the move you believe is best Exploration Play an experimental move

Problems within RL Prediction and Control Prediction: evaluate the future Given a policy Control: optimise the future Find the best policy

Problems within RL Gridworld Example: Prediction A A +10 B B +5 Actions 3.3 8.8 4.4 5.3 1.5 1.5 3.0 2.3 1.9 0.5 0.1 0.7 0.7 0.4-0.4-1.0-0.4-0.4-0.6-1.2-1.9-1.3-1.2-1.4-2.0 (a) (b) What is the value function for the uniform random policy?

Problems within RL Gridworld Example: Control A B +5 22.0 24.4 22.0 19.4 17.5 19.8 22.0 19.8 17.8 16.0 +10 B 17.8 19.8 17.8 16.0 14.4 16.0 17.8 16.0 14.4 13.0 A 14.4 16.0 14.4 13.0 11.7 a) gridworld b) V* v c) π * What is the optimal value function over all possible policies? What is the optimal policy?

Course Outline Course Outline Part I: Elementary Reinforcement Learning 1 Introduction to RL 2 Markov Decision Processes 3 Planning by Dynamic Programming 4 Model-Free Prediction 5 Model-Free Control Part II: Reinforcement Learning in Practice 1 Value Function Approximation 2 Policy Gradient Methods 3 Integrating Learning and Planning 4 Exploration and Exploitation 5 Case study - RL in games