NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi

Similar documents

Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning

Semi-Markov Adaptive Critic Heuristics with Application to Airline Revenue Management

Neuro-Dynamic Programming An Overview

Inductive QoS Packet Scheduling for Adaptive Dynamic Networks

Using Markov Decision Processes to Solve a Portfolio Allocation Problem

Temporal Difference Learning in the Tetris Game

Learning Tetris Using the Noisy Cross-Entropy Method

Reinforcement Learning

Reinforcement Learning: A Tutorial Survey and Recent Advances

Reinforcement Learning

Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method

TD(0) Leads to Better Policies than Approximate Value Iteration

Recurrent Neural Networks

A Sarsa based Autonomous Stock Trading Agent

Learning Agents: Introduction

Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

Lecture 6. Artificial Neural Networks

CHAPTER 5 PREDICTIVE MODELING STUDIES TO DETERMINE THE CONVEYING VELOCITY OF PARTS ON VIBRATORY FEEDER

Chapter 4: Artificial Neural Networks

Efficient online learning of a non-negative sparse autoencoder

Assignment #1. Example: tetris state: board configuration + shape of the falling piece ~2 200 states! Recap RL so far. Page 1

Reinforcement Learning with Factored States and Actions

In the past two decades, interest in supply chain management

Lecture 5: Model-Free Control

Algorithms for Reinforcement Learning

Jae Won idee. School of Computer Science and Engineering Sungshin Women's University Seoul, , South Korea

Neural Networks and Support Vector Machines

Call Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning

An Environment Model for N onstationary Reinforcement Learning

Self Organizing Maps: Fundamentals

6.2.8 Neural networks for data mining

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Introduction to Machine Learning Using Python. Vikram Kamath

Tetris: A Study of Randomized Constraint Sampling

The Neuro Slot Car Racer: Reinforcement Learning in a Real World Setting

Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network

Channel Allocation in Cellular Telephone. Systems. Lab. for Info. and Decision Sciences. Cambridge, MA

SIMULATING CANCELLATIONS AND OVERBOOKING IN YIELD MANAGEMENT

Data Mining Lab 5: Introduction to Neural Networks

On-Line Learning Control by Association and Reinforcement

Linear Threshold Units

A Multi-level Artificial Neural Network for Residential and Commercial Energy Demand Forecast: Iran Case Study

Goal-Directed Online Learning of Predictive Models

Reinforcement Learning for Resource Allocation and Time Series Tools for Mobility Prediction

Statistical Machine Learning

Optimization under uncertainty: modeling and solution methods

Goal-Directed Online Learning of Predictive Models

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

A Reinforcement Learning Approach for Supply Chain Management

IN AVIATION it regularly occurs that an airplane encounters. Guaranteed globally optimal continuous reinforcement learning.

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Lecture 1: Introduction to Reinforcement Learning

Global Supply Chain Management: A Reinforcement Learning Approach

NEURAL NETWORKS A Comprehensive Foundation

Stochastic Models for Inventory Management at Service Facilities

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

VENDOR MANAGED INVENTORY

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

/SOLUTIONS/ where a, b, c and d are positive constants. Study the stability of the equilibria of this system based on linearization.

KERNEL REWARDS REGRESSION: AN INFORMATION EFFICIENT BATCH POLICY ITERATION APPROACH

How To Maximize Admission Control On A Network

Introduction to Logistic Regression

Neural Networks and Learning Systems

SINGLE-STAGE MULTI-PRODUCT PRODUCTION AND INVENTORY SYSTEMS: AN ITERATIVE ALGORITHM BASED ON DYNAMIC SCHEDULING AND FIXED PITCH PRODUCTION

Follow links Class Use and other Permissions. For more information, send to:

MSCA Introduction to Statistical Concepts

Introduction to Online Learning Theory

Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment

Learning Exercise Policies for American Options

Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data

A Multi-agent Q-learning Framework for Optimizing Stock Trading Systems

COMBINED NEURAL NETWORKS FOR TIME SERIES ANALYSIS

An Introduction to Neural Networks

NTC Project: S01-PH10 (formerly I01-P10) 1 Forecasting Women s Apparel Sales Using Mathematical Modeling

Chapter 12 Discovering New Knowledge Data Mining

Similarity and Diagonalization. Similar Matrices

Intelligent Agents Serving Based On The Society Information

Learning to Trade via Direct Reinforcement

Monotonicity Hints. Abstract

Integer Programming: Algorithms - 3

NEURAL NETWORKS IN DATA MINING

Figure 1: Cost and Speed of Access of different storage components. Page 30

Supply Chain Yield Management based on Reinforcement Learning

Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems

NEURAL NETWORK FUNDAMENTALS WITH GRAPHS, ALGORITHMS, AND APPLICATIONS

Scheduling Algorithms for Downlink Services in Wireless Networks: A Markov Decision Process Approach

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Reinforcement Learning for Trading Systems and Portfolios

Lecture 8 February 4

Prediction Model for Crude Oil Price Using Artificial Neural Networks

4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department sss40@eng.cam.ac.

Bidding Dynamics of Rational Advertisers in Sponsored Search Auctions on the Web

Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining

Data Mining Practical Machine Learning Tools and Techniques

Hierarchical Reinforcement Learning in Computer Games

Options with exceptions

Early defect identification of semiconductor processes using machine learning

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

Transcription:

NEURAL NETWORKS AND REINFORCEMENT LEARNING Abhijit Gosavi Department of Engineering Management and Systems Engineering Missouri University of Science and Technology Rolla, MO 65409 1

Outline A Quick Introduction to Reinforcement Learning The Role of Neural Networks in Reinforcement Learning Some Algorithms The Success Stories and the Failures Some Online Demos Future of Neural Networks and Reinforcement Learning 2

What is Reinforcement Learning? Reinforcement Learning (RL) is a technique useful in solving control optimization problems. By control optimization, we mean the problem of recognizing the best action in every state visited by the system so as to optimize some objective function, e.g., the average reward per unit time and the total discounted reward over a given time horizon. Typically, RL is used when the system has a very large number of states (>> 1000) and has complex stochastic structure, which is not amenable to closed form analysis. When problems have a relative small number of states and the underlying random structure is relatively simple, one can use dynamic programming. 3

Q-Learning The central idea in Q-Learning is to recognize or learn the optimal action in every state visited by the system (also called the optimal policy) via trial and error. The trial and error mechanism can be implemented within the real-world system (commonly seen in robotics) or within a simulator (commonly seen in management science / industrial engineering). 4

Q-Learning: Working Mechanism The agent chooses an action, obtains feedback for that action, and uses the feedback to update its database. In its database, the agent keeps a so-called Q-factor for every state-action pair. When the feedback for selecting an action in a state is positive, the associated Q-factor s value is increased, while if the feedback is negative, the value is decreased. The feedback consists of the immediate revenue or reward plus the value of the next state. 5

Simulator (environment) Feedback r(i,a,j) Action a RL Algorithm (Agent) Figure 1: Trial and error mechanism of RL. The action selected by the RL agent is fed into the simulator. The simulator simulates the action, and the resultant feedback obtained is fed back into the knowledge-base (Q-factors) of the agent. The agent uses the RL algorithm to update its knowledge-base, becomes smarter in the process, and then selects a better action. 5

Q-Learning: Feedback The immediate reward is denoted by r(i, a, j), where i is the current state, a the action chosen in the current state, and j the next state. The value of any state is given by the maximum Q-factor in that state. Thus, if there are two actions in each state, the value of a state is the maximum of the two Q-factors for that state. In mathematical terms: feedback = r(i, a, j) + λ max b Q(j, b), where λ is the discount factor, which discounts the values of future states. Usually, λ = 1/(1 + R) where R is the rate of discounting. 6

Q-Learning: Algorithm The core of the Q-Learning algorithm uses the following updating equation: Q(i, a) [1 α]q(i, a) + α [feedback], i.e., Q(i, a) [1 α]q(i, a) + α [ ] r(i, a, j) + λ max Q(j, b), b where α is the learning rate (or step size). 7

Q-Learning: Where Do We Need Neural Networks? When we have a very large number of state-action pairs, it is not feasible to store every Q-factor separately. Then, it makes sense to store the Q-factors for a given action within one neural network. When a Q-factor is needed, it is fetched from its neural network. When a Q-factor is to be updated, the new Q-factor is used to update the neural network itself. For any given action, Q(i, a) is a function of i, the state. Hence, we will call it a Q-function in what follows. 8

Incremental or Batch? Neural networks are generally of two types: batch updating or incremental updating. The batch updating neural networks require all the data at once, while the incremental neural networks take one data piece at a time. For reinforcement learning, we need incremental neural networks since every time the agent receives feedback, we obtain a new piece of data that must be used to update some neural network. 9

Neurons and Backpropagation Neurons are used for fitting linear forms, e.g., y = a + bi where i is the input (the state in our case). Also called adenaline rule or Widrow-Hoff rule. Backprop is used when the Q-factor is non-linear in i, which is usually the case. (Algorithm was invented by Paul Werbos in 1975). Backprop is a universal function approximator, and ideally should fit any Q-function! Neurons can also be used by fitting the Q-function in a piecewise manner, where a linear fit is introduced in every piece. 10

Algorithm for Incremental Neuron Step 1: Initialize the weights of the neural network. Step 2a: Compute the output o p using output = k w(j)x(j), where j=0 w(j) is the jth weight of neuron and x(j) is the jth input. Step 2b: Update each w(i) for i = 0, 1,..., k using: w(i) w(i) + µ[target output]x(i), where the target is the updated Q-factor. Step 3: Increment iter by 1. If iter < iter max, return to Step 2. 11

Q-Learning combined with Neuron We now discuss a simple example of Q-Learning coupled with a neuron using incremental updating on an MDP with two states and two actions. Step 1. Initialize the weights of the neuron for action 1, i.e., w(1, 1) and w(2, 1), to small random numbers, and set the corresponding weights for action 2 to the same values. Set k, the number of state transitions, to 0. Start system simulation at any arbitrary state. Set k max to a large number. Step 2. Let the state be i. Simulate action a with a probability of 1/ A(i). Let the next state be j. Step 3. Evaluate the Q-factor for state-action pair, (i, a), which we 12

will call Q old, using the following: Q old = w(1, a) + w(2, a)i. Now evaluate the Q-factor for state j associated to each action, i.e., Q next (1) = w(1, 1) + w(2, 1)j; Q next (2) = w(1, 2) + w(2, 2)j. Now set Q next = max {Q next (1), Q next (2)}. Step 3a. Update the relevant Q-factor as follows (via Q-Learning). Qnew (1 α)q old + α [r(i, a, j) + λq next ]. (1) Step 3b. The current step in turn may contain a number of steps and involves the neural network updating. Set m = 0, where m is the number of iterations used within the neural network. Set m max, the maximum number of iterations for neuronal updating, 13

to a suitable value (we will discuss this value below). Step 3b(i). Update the weights of the neuron associated to action a as follows: w(1, a) w(1, a)+µ(qnew Q old )1; w(2, a) w(2, a)+µ(qnew Q old )i. (2) Step 3b(ii). Increment m by 1. If m < m max, return to Step 3b(i); otherwise, go to Step 4. Step 4. Increment k by 1. If k < k max, set i j; then go to Step 2. Otherwise, go to Step 5. Step 5. The policy learned, ˆµ, is virtually stored in the weights. To determine the action prescribed in a state i where i S, compute the following: µ(i) arg max a A(i) [w(1, a) + w(2, a)i]. 14

Some important remarks need to be made in regards to the algorithm above. Remark 1. Note that the update in Equation (2) is the update used by an incremental neuron that seeks to store the Q-factors for a given action. Remark 2. The step-size µ is the step size of the neuron, and it can be also be decayed with every iteration m 15

Backpropagation The algorithm is more complicated, since it involves multiple layers and the threshold functions. The algorithm requires significant tuning. Incremental version of the algorithm must be used. 16

Integration of neural networks with Q-Learning In some sense resembles integration of hardware (neural network) and software (Q-Learning). Integration has to be tightly controlled; otherwise results can be disappointing. Mixed success with backprop; Crites and Barto (1996; elevator case study), Das et al (1999; preventive maintenance used backprop with great success), and Sui (supply chains) have obtained great success, but some other failures reported (Sutton and Barto s 1998 textbook). Piecewise fitting of the function using neurons has also shown robust and stable behavior: Gosavi (2004; airline revenue management). 17

Toy Problem: Two states and Two actions Table 1: The table shows Q-factors. Method Q(1, 1) Q(1, 2) Q(2, 1) Q(2, 2) Q-factor-value iteration 44.84 53.02 51.87 49.28 Q-Learning with α = 150/(300 + n) 44.40 52.97 51.84 46.63 Q-Learning with Neuron 43.90 51.90 51.54 49.26 18

Future Directions A great deal of current research in function approximation and RL is using regression, e.g., algorithms such as LSTD (least squares temporal differences) However, most exciting RL applications in robotics and neuro-science studies are still using neural networks. Support Vector Machines are yet another data mining tool that have seen some recent applications in RL. 19

References D. Bertsekas and J. Tsitsiklis. Neuro-dynamic programming, Athena, 1996. R. Crites and A. Barto. Improving elevator performance using reinforcement learning. In Neural Information Processing Systems (NIPS). 1996. T.K. Das,, S. Mahadevan, and N. Marchalleck. Solving semi-markov decision problems using average reward reinforcement learning. Management Science, 45(4):560574, 1999.. Simulation-based optimization: Parametric optimization techniques and reinforcement learning, Kluwer Academic Publishers, 2009., N. Bandla, and T. K. Das. A reinforcement learning 20

approach to a single leg airline revenue management problem with multiple fare classes and overbooking. IIE Transactions (Special Issue on Large-Scale Optimization edited by Suvrajeet Sen), 34(9):729742, 2002. R. Sutton and A. Barto. Reinforcement Learning: An Introduction, MIT Press, 1998. 21