Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

Save this PDF as:

Size: px
Start display at page:

Download "Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998."

Transcription

1 Eligibility Traces 0 Eligibility Traces Suggested reading: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, Eligibility Traces Eligibility Traces 1 Contents: n-step TD Prediction The forward view of TD(!) The backward view of TD(!) Sarsa(!) Q(!) Actor-critic methods Replacing traces

2 Eligibility traces! Eligibility Traces 2 Eligibility traces are one of the basic mechanisms of reinforcement learning.! There are two ways to view eligibility traces:! The more theoretical view is that they are a bridge from TD to Monte Carlo methods (forward view).! According to the other view, an eligibility trace is a temporary record of the occurrence of an event, such as the visiting of a state or the taking of an action (backward view). The trace marks the memory parameters associated with the event as eligible for undergoing learning changes.! n-step TD Prediction! Eligibility Traces 3 Idea: Look farther into the future when you do TD backup (1, 2, 3,, n steps)! TD(0)

3 Mathematics of n-step TD Prediction! Eligibility Traces 4 Monte Carlo:! R t = r t +1 +"r t +2 +" 2 r t " T #t #1 r T TD:! R (1) t = r t +1 +"V t (s t +1 ) Use V! to estimate remaining return!! n-step TD:! 2 step return:! n-step return:!!! R t (2) = r t +1 +"r t +2 +" 2 V t (s t +2 ) R t (n ) = r t +1 +"r t +2 +" 2 r t " n #1 r t +n +" n V t (s t +n ) If the episode ends in less than n steps, then the truncation in a n-step return occurs at the episode's end, resulting in the conventional complete return: T " t # n \$ R t (n ) = R t (T "t ) = R t n-step Backups!! Eligibility Traces 5 Backup (on-line or off-line):! (n "V t (s t ) = #[ R ) t \$V t (s t )] on-line:! off-line:!! the updates are done during the episode, as soon as the increment is computed. V t +1 (s )! := V t (s ) + "V t (s ) the increments are accumulated "on the side" and are not used to change value estimates until the end of the episode. \$ T #1 V (s ) := V (s ) + "V t (s ) t =0!

4 Eligibility Traces 6 Error reduction property of n-step returns! For any V, the expected value of the n-step return using V is guaranteed to be a better estimate of V! than V is: the worst error under the new estimate is guaranteed to be less than or equal to " n times the worst error under V. Maximum error using n-step return Maximum error using V Using this, you can show that TD prediction methods using n-step backups converge. n-step TD Prediction! Eligibility Traces 7 Random Walk Examples:! A one-step method would change only the estimate for the last state, V(E), which would be incremented toward 1, the observed return. How does 2-step TD work here?! How about 3-step TD?! A two-step method would increment the values of the two states preceding termination: V(E), and V(D).

5 n-step TD Prediction! Eligibility Traces 8 A larger example:! Task:! 19 state random walk! Averaging n-step Returns! Eligibility Traces 9 Backups can be done not just toward any n-step return but toward any average of n-step returns.! One backup e.g. backup half of 2-step and half of 4-step! Called a complex backup! Draw each component with a horizontal line above them! Label with the weights for that component! positive and sum to 1 weights

6 Eligibility Traces 10!-return algorithm (forward view of TD(!))! TD(!) is a particular method for averaging all n-step backups.!!-return:! weight by! n-1 (time since visitation)! normalization factor Backup using!-return:!!-return weighting Function! Eligibility Traces 11 After a terminal state has been reached, all subsequent -step returns are equal to R t.! T #t #1 R " t = (1 # ") \$ " n #1 R (n) t + " T #t #1 R t n =1 Until termination After termination!

7 Relation to TD(0) and MC! Eligibility Traces 12!-return can be rewritten as:! If! = 1, you get MC:! Until termination After termination If! = 0, you get TD(0)! Forward view of TD(!)! Eligibility Traces 13 For each state visited, we look forward in time to all the future rewards and states to determine its update.! Off-line:!-return algorithm TD(!)

8 !-return on the Random Walk! Eligibility Traces 14 Same 19 state random walk as before! Backward View of TD(!)! Eligibility Traces 15 Shout! t backwards over time! The strength of your voice decreases with temporal distance by "#!

9 Backward View of TD(!)! Eligibility Traces 16 The forward view was for theory! The backward view is for mechanism! New variable called eligibility trace. The eligibility trace for state at time is denoted! On each step, decay all traces by "# and increment the trace for the current state by! discount rate trace-decay parameter Backward View of TD(!)! Eligibility Traces 17 The traces indicate the degree to which each state is eligible for undergoing learning changes should a reinforcing event occur.! The reinforcing events we are concerned with are the moment-by-moment one-step TD errors. For example, state-value prediction TD error:! one-step TD error The global TD error signal triggers proportional updates to all recently visited states, as signaled by their nonzero traces: As always, these increments could be done on each step to form an on-line algorithm, or saved until the end of the episode to produce an off-line algorithm.!

10 On-line Tabular TD(!)! Eligibility Traces 18 Backward View TD(0)! Eligibility Traces 19 The backward view of TD(!) is oriented backward in time. At each moment we look at the current TD error and assign it backward to each prior state according to the state's eligibility trace at that time. The TD(!) update reduces to the simple TD rule (TD(0)). Only the one state preceding the current one is changed by the TD error.

11 Backward View TD(1) MC! Eligibility Traces 20 Similar to MC. Similar to MC for an undiscounted episodic task. If you set # to 1, you get MC but in a better way! Can apply TD(1) to continuing tasks! Works incrementally and on-line (instead of waiting to the end of the episode)! Equivalence of Forward and Backward Views! Eligibility Traces 21 The forward (theoretical,!-return algorithm) view of TD(!) is equivalent to the backward (mechanistic) view for off-line updating! The book shows:! " I sst = s = s :1 t # \$ else : 0 Backward updates Forward updates! algebra shown in book On-line updating with small " is similar!

12 Eligibility Traces 22 On-line versus Off-line on Random Walk! Same 19 state random walk! On-line performs better over a broader range of parameters! Sarsa(!)! Eligibility Traces 23 Eligibility-traces for state-action pairs:! % e t (s,a) = "#e t \$1(s,a) +1 & ' "#e t \$1 (s,a) if s = s t and a = a t otherwise Q t +1 (s,a) = Q t (s,a) + () t e t (s,a) ) t = r t +1 +"Q t (s t +1,a t +1 ) \$ Q t (s t,a t )! Triggers an update of all recently visited state-action values

13 Sarsa(!)! Eligibility Traces 24 Algorithm: Sarsa(!)! Eligibility Traces 25 Gridworld Example: With one trial, the agent has much more information about how to get to the goal! not necessarily the best way! Can considerably accelerate learning!

14 Q(!)! Eligibility Traces 26 A problem occurs for off-policy methods such as Q-learning when exploratory actions occur, since you backup over a non-greedy policy. This would violate GPI.! Three Approaches to Q(!):! Watkins: Zero out eligibility trace after a non-greedy action. Do max when backing up at first non-greedy choice.! Peng: No distinction between exploratory and greedy actions.! Naïve: Similar to Watkins's method, except that the traces are not set to zero on exploratory actions.! Watkins s Q(!)! Eligibility Traces 27 Watkins: Zero out eligibility trace after a non-greedy action. Do max when backing up at first non-greedy choice.! Disadvantage to Watkins s method:! Early in learning, the eligibility trace will be cut frequently resulting in short traces!

15 Watkins s Q(!)! Eligibility Traces 28 Peng s Q(!)! Eligibility Traces 29 No distinction between exploratory and greedy actions! Backup max action except at end! Never cut traces! The earlier transitions of each are on-policy, whereas the last (fictitious) transition uses the greedy policy! Disadvantage:! Complicated to implement! Theoretically, no guarantee to converge to the optimal value!

16 Peng s Q(!) Algorithm! Eligibility Traces 30 For a complete description of the needed implementation, see Peng & Williams (1994, 1996). Naive Q(!)! Eligibility Traces 31 Idea: is it really a problem to backup exploratory actions?! Never zero traces! Always backup max at current action (unlike Peng or Watkins s)! Works well in preliminary empirical studies!

17 Comparison Results! Eligibility Traces 32 Comparison of Watkins s, Peng s, and Naïve (called McGovern s here) Q(!)! Deterministic gridworld with obstacles! 10x10 gridworld! 25 randomly generated obstacles! 30 runs! " = 0.05, # = 0.9,! = 0.9, \$ = 0.05, accumulating traces! From McGovern and Sutton (1997). Towards a better Q(!) Convergence of the Q(!) s! Eligibility Traces 33 None of the methods are proven to converge.! Watkins s is thought to converge to Q *! Peng s is thought to converge to a mixture of Q \$ and Q *! Naive - Q *?!

18 Eligibility Traces 34 Eligibility Traces for Actor-Critic Methods! Critic: On-policy learning of V \$. Use TD(!) as described before.! Actor: Needs eligibility traces for each state-action pair.! We change the update equation of the actor:! to Replacing Traces! Eligibility Traces 35 Using accumulating traces, frequently visited states can have eligibilities greater than 1! This can be a problem for convergence! Replacing traces: Instead of adding 1 when you visit a state, set that trace to 1!

19 Replacing Traces Example! Eligibility Traces 36 Same 19 state random walk task as before! Replacing traces perform better than accumulating traces over more values of!! Why Replacing Traces?! Eligibility Traces 37 Replacing traces can significantly speed learning! They can make the system perform well for a broader set of parameters! Accumulating traces can do poorly on certain types of tasks! Why is this task particularly onerous for accumulating traces?

20 More Replacing Traces! Eligibility Traces 38 There is an interesting relationship between replace-trace methods and Monte Carlo methods in the undiscounted case:!!just as conventional TD(1) is related to the every-visit MC algorithm, off-line replace-trace TD(1) is identical to first-visit MC (Singh and Sutton, 1996).! Extension to action-values:! When you revisit a state, what should you do with the traces for the other actions?! Singh & Sutton (1996) proposed to set them to zero:! Variable!! Eligibility Traces 39 Can generalize to variable!% Here! is a function of time! Could define!

21 Conclusions! Eligibility Traces 40 Provides efficient, incremental way to combine MC and TD! Includes advantages of MC (can deal with lack of Markov property)! Includes advantages of TD (using TD error, bootstrapping)! Can significantly speed learning! Does have a cost in computation! References! Eligibility Traces 41 Singh, S. P. and Sutton, R. S. (1996). Reinforcement learning with replacing eligibility traces. Machine Learning, 22:

Lecture 5: Model-Free Control

Lecture 5: Model-Free Control David Silver Outline 1 Introduction 2 On-Policy Monte-Carlo Control 3 On-Policy Temporal-Difference Learning 4 Off-Policy Learning 5 Summary Introduction Model-Free Reinforcement

Using Markov Decision Processes to Solve a Portfolio Allocation Problem

Using Markov Decision Processes to Solve a Portfolio Allocation Problem Daniel Bookstaber April 26, 2005 Contents 1 Introduction 3 2 Defining the Model 4 2.1 The Stochastic Model for a Single Asset.........................

A Sarsa based Autonomous Stock Trading Agent

A Sarsa based Autonomous Stock Trading Agent Achal Augustine The University of Texas at Austin Department of Computer Science Austin, TX 78712 USA achal@cs.utexas.edu Abstract This paper describes an autonomous

Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning

Motivation Machine Learning Can a software agent learn to play Backgammon by itself? Reinforcement Learning Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut

A Multi-agent Q-learning Framework for Optimizing Stock Trading Systems

A Multi-agent Q-learning Framework for Optimizing Stock Trading Systems Jae Won Lee 1 and Jangmin O 2 1 School of Computer Science and Engineering, Sungshin Women s University, Seoul, Korea 136-742 jwlee@cs.sungshin.ac.kr

Learning to Play Board Games using Temporal Difference Methods

Learning to Play Board Games using Temporal Difference Methods Marco A. Wiering Jan Peter Patist Henk Mannen institute of information and computing sciences, utrecht university technical report UU-CS-2005-048

Jae Won idee. School of Computer Science and Engineering Sungshin Women's University Seoul, 136-742, South Korea

STOCK PRICE PREDICTION USING REINFORCEMENT LEARNING Jae Won idee School of Computer Science and Engineering Sungshin Women's University Seoul, 136-742, South Korea ABSTRACT Recently, numerous investigations

TD(0) Leads to Better Policies than Approximate Value Iteration

TD(0) Leads to Better Policies than Approximate Value Iteration Benjamin Van Roy Management Science and Engineering and Electrical Engineering Stanford University Stanford, CA 94305 bvr@stanford.edu Abstract

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October 17, 2015 Outline

Reinforcement Learning with Factored States and Actions

Journal of Machine Learning Research 5 (2004) 1063 1088 Submitted 3/02; Revised 1/04; Published 8/04 Reinforcement Learning with Factored States and Actions Brian Sallans Austrian Research Institute for

Reinforcement Learning of Task Plans for Real Robot Systems

Reinforcement Learning of Task Plans for Real Robot Systems Pedro Tomás Mendes Resende pedro.resende@ist.utl.pt Instituto Superior Técnico, Lisboa, Portugal October 2014 Abstract This paper is the extended

Projective Simulation compared to reinforcement learning

COMPUTER SCIENCE Master thesis Projective Simulation compared to reinforcement learning By: Øystein Førsund Bjerland Supervisor: Matthew Geoffrey Parker June 1, 2015 Acknowledgements I would like to thank

NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi

NEURAL NETWORKS AND REINFORCEMENT LEARNING Abhijit Gosavi Department of Engineering Management and Systems Engineering Missouri University of Science and Technology Rolla, MO 65409 1 Outline A Quick Introduction

Algorithms for Reinforcement Learning

Algorithms for Reinforcement Learning Draft of the lecture published in the Synthesis Lectures on Artificial Intelligence and Machine Learning series by Morgan & Claypool Publishers Csaba Szepesvári June

Computing Near Optimal Strategies for Stochastic Investment Planning Problems

Computing Near Optimal Strategies for Stochastic Investment Planning Problems Milos Hauskrecfat 1, Gopal Pandurangan 1,2 and Eli Upfal 1,2 Computer Science Department, Box 1910 Brown University Providence,

Beyond Reward: The Problem of Knowledge and Data

Beyond Reward: The Problem of Knowledge and Data Richard S. Sutton University of Alberta Edmonton, Alberta, Canada Intelligence can be defined, informally, as knowing a lot and being able to use that knowledge

Supply Chain Yield Management based on Reinforcement Learning

Supply Chain Yield Management based on Reinforcement Learning Tim Stockheim 1, Michael Schwind 1, Alexander Korth 2, and Burak Simsek 2 1 Chair of Economics, esp. Information Systems, Frankfurt University,

Lecture 1: Introduction to Reinforcement Learning

Lecture 1: Introduction to Reinforcement Learning David Silver Outline 1 Admin 2 About Reinforcement Learning 3 The Reinforcement Learning Problem 4 Inside An RL Agent 5 Problems within Reinforcement Learning

Time Hopping Technique for Faster Reinforcement Learning in Simulations

BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 11, No 3 Sofia 2011 Time Hopping Technique for Faster Reinforcement Learning in Simulations Petar Kormushev 1, Kohei Nomoto

Neuro-Dynamic Programming An Overview

1 Neuro-Dynamic Programming An Overview Dimitri Bertsekas Dept. of Electrical Engineering and Computer Science M.I.T. September 2006 2 BELLMAN AND THE DUAL CURSES Dynamic Programming (DP) is very broadly

Creating Algorithmic Traders with Hierarchical Reinforcement Learning MSc Dissertation. Thomas Elder s0789506

Creating Algorithmic Traders with Hierarchical Reinforcement Learning MSc Dissertation Thomas Elder s78956 Master of Science School of Informatics University of Edinburgh 28 Abstract There has recently

Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems

Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems Thomas Degris Thomas.Degris@lip6.fr Olivier Sigaud Olivier.Sigaud@lip6.fr Pierre-Henri Wuillemin Pierre-Henri.Wuillemin@lip6.fr

A Reinforcement Learning Approach for Supply Chain Management

A Reinforcement Learning Approach for Supply Chain Management Tim Stockheim, Michael Schwind, and Wolfgang Koenig Chair of Economics, esp. Information Systems, Frankfurt University, D-60054 Frankfurt,

Where Do Rewards Come From?

Where Do Rewards Come From? Satinder Singh baveja@umich.edu Computer Science & Engineering University of Michigan, Ann Arbor Richard L. Lewis rickl@umich.edu Department of Psychology University of Michigan,

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

Options with exceptions

Options with exceptions Munu Sairamesh and Balaraman Ravindran Indian Institute Of Technology Madras, India Abstract. An option is a policy fragment that represents a solution to a frequent subproblem

Intelligent Agents Serving Based On The Society Information

Intelligent Agents Serving Based On The Society Information Sanem SARIEL Istanbul Technical University, Computer Engineering Department, Istanbul, TURKEY sariel@cs.itu.edu.tr B. Tevfik AKGUN Yildiz Technical

Goal-Directed Online Learning of Predictive Models

Goal-Directed Online Learning of Predictive Models Sylvie C.W. Ong, Yuri Grinberg, and Joelle Pineau McGill University, Montreal, Canada Abstract. We present an algorithmic approach for integrated learning

Urban Traffic Control Based on Learning Agents

Urban Traffic Control Based on Learning Agents Pierre-Luc Grégoire, Charles Desjardins, Julien Laumônier and Brahim Chaib-draa DAMAS Laboratory, Computer Science and Software Engineering Department, Laval

Intelligent Tutoring Systems Using Reinforcement Learning

Intelligent Tutoring Systems Using Reinforcement Learning A THESIS submitted by SREENIVASA SARMA B. H. for the award of the degree of MASTER OF SCIENCE (by Research) DEPARTMENT OF COMPUTER SCIENCE AND

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment Hideki Asoh 1, Masanori Shiro 1 Shotaro Akaho 1, Toshihiro Kamishima 1, Koiti Hasida 1, Eiji Aramaki 2, and Takahide

Hierarchical Reinforcement Learning in Computer Games

Hierarchical Reinforcement Learning in Computer Games Marc Ponsen, Pieter Spronck, Karl Tuyls Maastricht University / MICC-IKAT. {m.ponsen,p.spronck,k.tuyls}@cs.unimaas.nl Abstract. Hierarchical reinforcement

Machine Learning Applications in Algorithmic Trading

Machine Learning Applications in Algorithmic Trading Ryan Brosnahan, Ross Rothenstine Introduction: Computer controlled algorithmic trading is used by financial institutions around the world and accounted

Artificially Intelligent Adaptive Tutoring System

Artificially Intelligent Adaptive Tutoring System William Whiteley Stanford School of Engineering Stanford, CA 94305 bill.whiteley@gmail.com Abstract This paper presents an adaptive tutoring system based

Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining

Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining George Konidaris Computer Science Department University of Massachusetts Amherst Amherst MA 01003 USA gdk@cs.umass.edu

Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning

Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning Richard S. Sutton a, Doina Precup b, and Satinder Singh a a AT&T Labs Research, 180 Park Avenue, Florham Park,

Learning Agents: Introduction

Learning Agents: Introduction S Luz luzs@cs.tcd.ie October 22, 2013 Learning in agent architectures Performance standard representation Critic Agent perception rewards/ instruction Perception Learner Goals

Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL

POMDP Tutorial Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL Observation Belief ACTOR Transition Dynamics WORLD b Policy π Action Markov Decision Process -X: set of states [x s,x r

A linear algebraic method for pricing temporary life annuities

A linear algebraic method for pricing temporary life annuities P. Date (joint work with R. Mamon, L. Jalen and I.C. Wang) Department of Mathematical Sciences, Brunel University, London Outline Introduction

Inductive QoS Packet Scheduling for Adaptive Dynamic Networks

Inductive QoS Packet Scheduling for Adaptive Dynamic Networks Malika BOURENANE Dept of Computer Science University of Es-Senia Algeria mb_regina@yahoo.fr Abdelhamid MELLOUK LISSI Laboratory University

Machine Learning. 01 - Introduction

Machine Learning 01 - Introduction Machine learning course One lecture (Wednesday, 9:30, 346) and one exercise (Monday, 17:15, 203). Oral exam, 20 minutes, 5 credit points. Some basic mathematical knowledge

Chi-square Tests Driven Method for Learning the Structure of Factored MDPs

Chi-square Tests Driven Method for Learning the Structure of Factored MDPs Thomas Degris Thomas.Degris@lip6.fr Olivier Sigaud Olivier.Sigaud@lip6.fr Pierre-Henri Wuillemin Pierre-Henri.Wuillemin@lip6.fr

Chapter ML:IV. IV. Statistical Learning. Probability Basics Bayes Classification Maximum a-posteriori Hypotheses

Chapter ML:IV IV. Statistical Learning Probability Basics Bayes Classification Maximum a-posteriori Hypotheses ML:IV-1 Statistical Learning STEIN 2005-2015 Area Overview Mathematics Statistics...... Stochastics

Assignment #1. Example: tetris state: board configuration + shape of the falling piece ~2 200 states! Recap RL so far. Page 1

Generalization and function approximation CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Represent

AP Physics 1 and 2 Lab Investigations

AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks

Reinforcement Learning: An Introduction

i Reinforcement Learning: An Introduction Second edition, in progress Richard S. Sutton and Andrew G. Barto c 2012 A Bradford Book The MIT Press Cambridge, Massachusetts London, England ii In memory of

OPTIMAL DESIGN OF A MULTITIER REWARD SCHEME. Amir Gandomi *, Saeed Zolfaghari **

OPTIMAL DESIGN OF A MULTITIER REWARD SCHEME Amir Gandomi *, Saeed Zolfaghari ** Department of Mechanical and Industrial Engineering, Ryerson University, Toronto, Ontario * Tel.: + 46 979 5000x7702, Email:

Figure 1: Cost and Speed of Access of different storage components. Page 30

Reinforcement Learning Approach for Data Migration in Hierarchical Storage Systems T.G. Lakshmi, R.R. Sedamkar, Harshali Patil Department of Computer Engineering, Thakur College of Engineering and Technology,

A Markov Decision Process Model for Traffic Prioritisation Provisioning

Issues in Informing Science and Information Technology A Markov Decision Process Model for Traffic Prioritisation Provisioning Abdullah Gani The University of Sheffield, UK Omar Zakaria The Royal Holloway,

Application of Markov chain analysis to trend prediction of stock indices Milan Svoboda 1, Ladislav Lukáš 2

Proceedings of 3th International Conference Mathematical Methods in Economics 1 Introduction Application of Markov chain analysis to trend prediction of stock indices Milan Svoboda 1, Ladislav Lukáš 2

Local Graph Partitioning as a Basis for Generating Temporally-Extended Actions in Reinforcement Learning

Local Graph Partitioning as a Basis for Generating Temporally-Extended Actions in Reinforcement Learning Özgür Şimşek University of Massachusetts Amherst, MA 01003-9264 ozgur@umass.cs.edu Alicia P. Wolfe

Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference Michael Wick, Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum Department of Computer Science University of Massachusetts

A SOFTWARE SYSTEM FOR ONLINE LEARNING APPLIED IN THE FIELD OF COMPUTER SCIENCE

The 1 st International Conference on Virtual Learning, ICVL 2006 223 A SOFTWARE SYSTEM FOR ONLINE LEARNING APPLIED IN THE FIELD OF COMPUTER SCIENCE Gabriela Moise Petroleum-Gas University of Ploieşti 39

Tetris: Experiments with the LP Approach to Approximate DP

Tetris: Experiments with the LP Approach to Approximate DP Vivek F. Farias Electrical Engineering Stanford University Stanford, CA 94403 vivekf@stanford.edu Benjamin Van Roy Management Science and Engineering

An efficient method for simulating interest rate curves

BANCO DE MÉXICO An efficient method for simulating interest rate curves Javier Márquez Diez-Canedo Carlos E. Nogués Nivón Viviana Vélez Grajales February 2003 Summary The purpose of this paper is to present

Chapter 12 Time-Derivative Models of Pavlovian Reinforcement Richard S. Sutton Andrew G. Barto

Approximately as appeared in: Learning and Computational Neuroscience: Foundations of Adaptive Networks, M. Gabriel and J. Moore, Eds., pp. 497 537. MIT Press, 1990. Chapter 12 Time-Derivative Models of

CHAPTER 5 PREDICTIVE MODELING STUDIES TO DETERMINE THE CONVEYING VELOCITY OF PARTS ON VIBRATORY FEEDER

93 CHAPTER 5 PREDICTIVE MODELING STUDIES TO DETERMINE THE CONVEYING VELOCITY OF PARTS ON VIBRATORY FEEDER 5.1 INTRODUCTION The development of an active trap based feeder for handling brakeliners was discussed

Automatic Selection of Loop Scheduling Algorithms Using Reinforcement Learning

Automatic Selection of Loop Scheduling Algorithms Using Reinforcement Learning Sumithra Dhandayuthapani [1,2], Ioana Banicescu [1,2], Ricolindo L. Cariño [2], Eric Hansen [1], Jaderick P. Pabico [1,2],

Goal-Directed Online Learning of Predictive Models

Goal-Directed Online Learning of Predictive Models Sylvie C.W. Ong, Yuri Grinberg, and Joelle Pineau School of Computer Science McGill University, Montreal, Canada Abstract. We present an algorithmic approach

Reinforcement Learning on the Combinatorial Game of Nim ERIK JÄRLEBERG

Reinforcement Learning on the Combinatorial Game of Nim ERIK JÄRLEBERG Bachelor of Science Thesis Stockholm, Sweden 2011 Reinforcement Learning on the Combinatorial Game of Nim ERIK JÄRLEBERG Bachelor

parent ROADMAP MATHEMATICS SUPPORTING YOUR CHILD IN HIGH SCHOOL

parent ROADMAP MATHEMATICS SUPPORTING YOUR CHILD IN HIGH SCHOOL HS America s schools are working to provide higher quality instruction than ever before. The way we taught students in the past simply does

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics* Rakesh Nagi Department of Industrial Engineering University at Buffalo (SUNY) *Lecture notes from Network Flows by Ahuja, Magnanti

Analysis of a Production/Inventory System with Multiple Retailers

Analysis of a Production/Inventory System with Multiple Retailers Ann M. Noblesse 1, Robert N. Boute 1,2, Marc R. Lambrecht 1, Benny Van Houdt 3 1 Research Center for Operations Management, University

ADVANTAGE UPDATING. Leemon C. Baird III. Technical Report WL-TR Wright Laboratory. Wright-Patterson Air Force Base, OH

ADVANTAGE UPDATING Leemon C. Baird III Technical Report WL-TR-93-1146 Wright Laboratory Wright-Patterson Air Force Base, OH 45433-7301 Address: WL/AAAT, Bldg 635 2185 Avionics Circle Wright-Patterson Air

Monte Carlo Methods in Finance

Author: Yiyang Yang Advisor: Pr. Xiaolin Li, Pr. Zari Rachev Department of Applied Mathematics and Statistics State University of New York at Stony Brook October 2, 2012 Outline Introduction 1 Introduction

Distributed Reinforcement Learning for Network Intrusion Response

Distributed Reinforcement Learning for Network Intrusion Response KLEANTHIS MALIALIS Doctor of Philosophy UNIVERSITY OF YORK COMPUTER SCIENCE September 2014 For my late grandmother Iroulla Abstract The

Optimizing admission control while ensuring quality of service in multimedia networks via reinforcement learning*

Optimizing admission control while ensuring quality of service in multimedia networks via reinforcement learning* Timothy X Brown t, Hui Tong t, Satinder Singh+ t Electrical and Computer Engineering +

Channel Allocation in Cellular Telephone. Systems. Lab. for Info. and Decision Sciences. Cambridge, MA 02139. bertsekas@lids.mit.edu.

Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems Satinder Singh Department of Computer Science University of Colorado Boulder, CO 80309-0430 baveja@cs.colorado.edu Dimitri

Circuits 1 M H Miller

Introduction to Graph Theory Introduction These notes are primarily a digression to provide general background remarks. The subject is an efficient procedure for the determination of voltages and currents

Amajor benefit of Monte-Carlo schedule analysis is to

2005 AACE International Transactions RISK.10 The Benefits of Monte- Carlo Schedule Analysis Mr. Jason Verschoor, P.Eng. Amajor benefit of Monte-Carlo schedule analysis is to expose underlying risks to

Reinforcement Learning: A Tutorial Survey and Recent Advances

A. Gosavi Reinforcement Learning: A Tutorial Survey and Recent Advances Abhijit Gosavi Department of Engineering Management and Systems Engineering 219 Engineering Management Missouri University of Science

Non-Life Insurance Mathematics

Thomas Mikosch Non-Life Insurance Mathematics An Introduction with the Poisson Process Second Edition 4y Springer Contents Part I Collective Risk Models 1 The Basic Model 3 2 Models for the Claim Number

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online

Objective. Materials. TI-73 Calculator

0. Objective To explore subtraction of integers using a number line. Activity 2 To develop strategies for subtracting integers. Materials TI-73 Calculator Integer Subtraction What s the Difference? Teacher

Jiří Matas. Hough Transform

Hough Transform Jiří Matas Center for Machine Perception Department of Cybernetics, Faculty of Electrical Engineering Czech Technical University, Prague Many slides thanks to Kristen Grauman and Bastian

Florida Department of Education/Office of Assessment January 2012. Grade 7 FCAT 2.0 Mathematics Achievement Level Descriptions

Florida Department of Education/Office of Assessment January 2012 Grade 7 FCAT 2.0 Mathematics Grade 7 FCAT 2.0 Mathematics Reporting Category Geometry and Measurement Students performing at the mastery

Risk Analysis Using Monte Carlo Simulation

Risk Analysis Using Monte Carlo Simulation Here we present a simple hypothetical budgeting problem for a business start-up to demonstrate the key elements of Monte Carlo simulation. This table shows the

EVALUATING ACADEMIC READINESS FOR APPRENTICESHIP TRAINING For ACCESS TO APPRENTICESHIP MATHEMATICS SKILL OPERATIONS WITH INTEGERS AN ACADEMIC SKILLS MANUAL for The Precision Machining And Tooling Trades

Skill-based Routing Policies in Call Centers

BMI Paper Skill-based Routing Policies in Call Centers by Martijn Onderwater January 2010 BMI Paper Skill-based Routing Policies in Call Centers Author: Martijn Onderwater Supervisor: Drs. Dennis Roubos

Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method

Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method Martin Riedmiller Neuroinformatics Group, University of Onsabrück, 49078 Osnabrück Abstract. This

Tracking Algorithms. Lecture17: Stochastic Tracking. Joint Probability and Graphical Model. Probabilistic Tracking

Tracking Algorithms (2015S) Lecture17: Stochastic Tracking Bohyung Han CSE, POSTECH bhhan@postech.ac.kr Deterministic methods Given input video and current state, tracking result is always same. Local

Inequality, Mobility and Income Distribution Comparisons

Fiscal Studies (1997) vol. 18, no. 3, pp. 93 30 Inequality, Mobility and Income Distribution Comparisons JOHN CREEDY * Abstract his paper examines the relationship between the cross-sectional and lifetime

Characterization and Modeling of Packet Loss of a VoIP Communication

Characterization and Modeling of Packet Loss of a VoIP Communication L. Estrada, D. Torres, H. Toral Abstract In this work, a characterization and modeling of packet loss of a Voice over Internet Protocol

Elementary Number Theory We begin with a bit of elementary number theory, which is concerned

CONSTRUCTION OF THE FINITE FIELDS Z p S. R. DOTY Elementary Number Theory We begin with a bit of elementary number theory, which is concerned solely with questions about the set of integers Z = {0, ±1,

An ant colony optimization for single-machine weighted tardiness scheduling with sequence-dependent setups

Proceedings of the 6th WSEAS International Conference on Simulation, Modelling and Optimization, Lisbon, Portugal, September 22-24, 2006 19 An ant colony optimization for single-machine weighted tardiness

Lecture 6. Weight. Tension. Normal Force. Static Friction. Cutnell+Johnson: 4.8-4.12, second half of section 4.7

Lecture 6 Weight Tension Normal Force Static Friction Cutnell+Johnson: 4.8-4.12, second half of section 4.7 In this lecture, I m going to discuss four different kinds of forces: weight, tension, the normal

Learning Tetris Using the Noisy Cross-Entropy Method

NOTE Communicated by Andrew Barto Learning Tetris Using the Noisy Cross-Entropy Method István Szita szityu@eotvos.elte.hu András Lo rincz andras.lorincz@elte.hu Department of Information Systems, Eötvös

Reinforcement Learning: An Introduction. ****Draft****

i Reinforcement Learning: An Introduction Second edition, in progress ****Draft**** Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016 A Bradford Book The MIT Press Cambridge, Massachusetts London,

NODAL ANALYSIS. Circuits Nodal Analysis 1 M H Miller

NODAL ANALYSIS A branch of an electric circuit is a connection between two points in the circuit. In general a simple wire connection, i.e., a 'short-circuit', is not considered a branch since it is known

Big Graph Processing: Some Background

Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs

A Behavior Based Kernel for Policy Search via Bayesian Optimization

via Bayesian Optimization Aaron Wilson WILSONAA@EECS.OREGONSTATE.EDU Alan Fern AFERN@EECS.OREGONSTATE.EDU Prasad Tadepalli TADEPALL@EECS.OREGONSTATE.EDU Oregon State University School of EECS, 1148 Kelley

COMPUTATIONAL MODELS OF CLASSICAL CONDITIONING: A COMPARATIVE STUDY

COMPUTATIONAL MODELS OF CLASSICAL CONDITIONING: A COMPARATIVE STUDY Christian Balkenius Jan Morén christian.balkenius@fil.lu.se jan.moren@fil.lu.se Lund University Cognitive Science Kungshuset, Lundagård

Tennessee Mathematics Standards 2009-2010 Implementation. Grade Six Mathematics. Standard 1 Mathematical Processes

Tennessee Mathematics Standards 2009-2010 Implementation Grade Six Mathematics Standard 1 Mathematical Processes GLE 0606.1.1 Use mathematical language, symbols, and definitions while developing mathematical

Fundamentals of Actuarial Mathematics

Fundamentals of Actuarial Mathematics S. David Promislow York University, Toronto, Canada John Wiley & Sons, Ltd Contents Preface Notation index xiii xvii PARTI THE DETERMINISTIC MODEL 1 1 Introduction

Similarity Search and Mining in Uncertain Spatial and Spatio Temporal Databases. Andreas Züfle

Similarity Search and Mining in Uncertain Spatial and Spatio Temporal Databases Andreas Züfle Geo Spatial Data Huge flood of geo spatial data Modern technology New user mentality Great research potential

Applied mathematics and mathematical statistics

Applied mathematics and mathematical statistics The graduate school is organised within the Department of Mathematical Sciences.. Deputy head of department: Aila Särkkä Director of Graduate Studies: Marija

Dynamic Programming. Lecture 11. 11.1 Overview. 11.2 Introduction

Lecture 11 Dynamic Programming 11.1 Overview Dynamic Programming is a powerful technique that allows one to solve many different types of problems in time O(n 2 ) or O(n 3 ) for which a naive approach

Absolute Value of Reasoning

About Illustrations: Illustrations of the Standards for Mathematical Practice (SMP) consist of several pieces, including a mathematics task, student dialogue, mathematical overview, teacher reflection