Lecture 5: Model-Free Control
|
|
|
- Randell Mosley
- 9 years ago
- Views:
Transcription
1 Lecture 5: Model-Free Control David Silver
2 Outline 1 Introduction 2 On-Policy Monte-Carlo Control 3 On-Policy Temporal-Difference Learning 4 Off-Policy Learning 5 Summary
3 Introduction Model-Free Reinforcement Learning Last lecture: Model-free prediction Estimate the value function of an unknown MDP This lecture: Model-free control Optimise the value function of an unknown MDP
4 Introduction Uses of Model-Free Control Some example problems that can be modelled as MDPs Elevator Parallel Parking Ship Steering Bioreactor Helicopter Aeroplane Logistics Robocup Soccer Quake Portfolio management Protein Folding Robot walking Game of Go For most of these problems, either: MDP model is unknown, but experience can be sampled MDP model is known, but is too big to use, except by samples Model-free control can solve these problems
5 Introduction On and Off-Policy Learning On-policy learning Learn on the job Learn about policy π from experience sampled from π Off-policy learning Look over someone s shoulder Learn about policy π from experience sampled from µ
6 On-Policy Monte-Carlo Control Generalised Policy Iteration Generalised Policy Iteration (Refresher) Policy evaluation Estimate v π e.g. Iterative policy evaluation Policy improvement Generate π π e.g. Greedy policy improvement
7 On-Policy Monte-Carlo Control Generalised Policy Iteration Generalised Policy Iteration With Monte-Carlo Evaluation Policy evaluation Monte-Carlo policy evaluation, V = v π? Policy improvement Greedy policy improvement?
8 On-Policy Monte-Carlo Control Generalised Policy Iteration Model-Free Policy Iteration Using Action-Value Function Greedy policy improvement over V (s) requires model of MDP π (s) = argmax a A R a s + P a ss V (s ) Greedy policy improvement over Q(s, a) is model-free π (s) = argmax a A Q(s, a)
9 On-Policy Monte-Carlo Control Generalised Policy Iteration Generalised Policy Iteration with Action-Value Function Q = q π Starting Q, π q *, π * π = greedy(q) Policy evaluation Monte-Carlo policy evaluation, Q = q π Policy improvement Greedy policy improvement?
10 On-Policy Monte-Carlo Control Exploration Example of Greedy Action Selection There are two doors in front of you. You open the left door and get reward 0 V (left) = 0 You open the right door and get reward +1 V (right) = +1 You open the right door and get reward +3 V (right) = +2 You open the right door and get reward +2 V (right) = +2 Are you sure you ve chosen the best door?.
11 On-Policy Monte-Carlo Control Exploration ɛ-greedy Exploration Simplest idea for ensuring continual exploration All m actions are tried with non-zero probability With probability 1 ɛ choose the greedy action With probability ɛ choose an action at random { ɛ/m + 1 ɛ if a = argmax Q(s, a) π(a s) = a A ɛ/m otherwise
12 On-Policy Monte-Carlo Control Exploration ɛ-greedy Policy Improvement Theorem For any ɛ-greedy policy π, the ɛ-greedy policy π with respect to q π is an improvement, v π (s) v π (s) q π (s, π (s)) = a A π (a s)q π (s, a) = ɛ/m a A q π (s, a) + (1 ɛ) max a A q π(s, a) ɛ/m a A q π (s, a) + (1 ɛ) a A = a A π(a s)q π (s, a) = v π (s) π(a s) ɛ/m q π (s, a) 1 ɛ Therefore from policy improvement theorem, v π (s) v π (s)
13 On-Policy Monte-Carlo Control Exploration Monte-Carlo Policy Iteration Q = q π Starting Q, π q *, π * π = ε-greedy(q) Policy evaluation Monte-Carlo policy evaluation, Q = q π Policy improvement ɛ-greedy policy improvement
14 On-Policy Monte-Carlo Control Exploration Monte-Carlo Control Q = q π Starting Q q *, π * π = ε-greedy(q) Every episode: Policy evaluation Monte-Carlo policy evaluation, Q q π Policy improvement ɛ-greedy policy improvement
15 On-Policy Monte-Carlo Control GLIE GLIE Definition Greedy in the Limit with Infinite Exploration (GLIE) All state-action pairs are explored infinitely many times, lim N k(s, a) = k The policy converges on a greedy policy, lim π k(a s) = 1(a = argmax Q k (s, a )) k a A For example, ɛ-greedy is GLIE if ɛ reduces to zero at ɛ k = 1 k
16 On-Policy Monte-Carlo Control GLIE GLIE Monte-Carlo Control Sample kth episode using π: {S 1, A 1, R 2,..., S T } π For each state S t and action A t in the episode, Theorem N(S t, A t ) N(S t, A t ) Q(S t, A t ) Q(S t, A t ) + N(S t, A t ) (G t Q(S t, A t )) Improve policy based on new action-value function ɛ 1/k π ɛ-greedy(q) GLIE Monte-Carlo control converges to the optimal action-value function, Q(s, a) q (s, a)
17 On-Policy Monte-Carlo Control Blackjack Example Back to the Blackjack Example
18 On-Policy Monte-Carlo Control Blackjack Example Monte-Carlo Control in Blackjack
19 On-Policy Temporal-Difference Learning MC vs. TD Control Temporal-difference (TD) learning has several advantages over Monte-Carlo (MC) Lower variance Online Incomplete sequences Natural idea: use TD instead of MC in our control loop Apply TD to Q(S, A) Use ɛ-greedy policy improvement Update every time-step
20 On-Policy Temporal-Difference Learning Sarsa(λ) Updating Action-Value Functions with Sarsa S,A R S A Q(S, A) Q(S, A) + α ( R + γq(s, A ) Q(S, A) )
21 On-Policy Temporal-Difference Learning Sarsa(λ) On-Policy Control With Sarsa Q = q π Starting Q q *, π * π = ε-greedy(q) Every time-step: Policy evaluation Sarsa, Q q π Policy improvement ɛ-greedy policy improvement
22 On-Policy Temporal-Difference Learning Sarsa(λ) Sarsa Algorithm for On-Policy Control
23 On-Policy Temporal-Difference Learning Sarsa(λ) Convergence of Sarsa Theorem Sarsa converges to the optimal action-value function, Q(s, a) q (s, a), under the following conditions: GLIE sequence of policies π t (a s) Robbins-Monro sequence of step-sizes α t α t = t=1 αt 2 < t=1
24 On-Policy Temporal-Difference Learning Sarsa(λ) Windy Gridworld Example Reward = -1 per time-step until reaching goal Undiscounted
25 On-Policy Temporal-Difference Learning Sarsa(λ) Sarsa on the Windy Gridworld
26 On-Policy Temporal-Difference Learning Sarsa(λ) n-step Sarsa Consider the following n-step returns for n = 1, 2, : n = 1 (Sarsa) q (1) t = R t+1 + γq(s t+1 ) n = 2 q (2) t = R t+1 + γr t+2 + γ 2 Q(S t+2 ).. n = (MC) q ( ) t = R t+1 + γr t γ T 1 R T Define the n-step Q-return q (n) t = R t+1 + γr t γ n 1 R t+n + γ n Q(S t+n ) n-step Sarsa updates Q(s, a) towards the n-step Q-return ( ) Q(S t, A t ) Q(S t, A t ) + α q (n) t Q(S t, A t )
27 On-Policy Temporal-Difference Learning Sarsa(λ) Forward View Sarsa(λ) The q λ return combines all n-step Q-returns q (n) t Using weight (1 λ)λ n 1 q λ t = (1 λ) n=1 λ n 1 q (n) t Forward-view Sarsa(λ) ( ) Q(S t, A t ) Q(S t, A t ) + α qt λ Q(S t, A t )
28 On-Policy Temporal-Difference Learning Sarsa(λ) Backward View Sarsa(λ) Just like TD(λ), we use eligibility traces in an online algorithm But Sarsa(λ) has one eligibility trace for each state-action pair E 0 (s, a) = 0 E t (s, a) = γλe t 1 (s, a) + 1(S t = s, A t = a) Q(s, a) is updated for every state s and action a In proportion to TD-error δ t and eligibility trace E t (s, a) δ t = R t+1 + γq(s t+1, A t+1 ) Q(S t, A t ) Q(s, a) Q(s, a) + αδ t E t (s, a)
29 On-Policy Temporal-Difference Learning Sarsa(λ) Sarsa(λ) Algorithm
30 On-Policy Temporal-Difference Learning Sarsa(λ) Sarsa(λ) Gridworld Example
31 Off-Policy Learning Off-Policy Learning Evaluate target policy π(a s) to compute v π (s) or q π (s, a) While following behaviour policy µ(a s) Why is this important? {S 1, A 1, R 2,..., S T } µ Learn from observing humans or other agents Re-use experience generated from old policies π 1, π 2,..., π t 1 Learn about optimal policy while following exploratory policy Learn about multiple policies while following one policy
32 Off-Policy Learning Importance Sampling Importance Sampling Estimate the expectation of a different distribution E X P [f (X )] = P(X )f (X ) = Q(X ) P(X ) Q(X ) f (X ) [ ] P(X ) = E X Q Q(X ) f (X )
33 Off-Policy Learning Importance Sampling Importance Sampling for Off-Policy Monte-Carlo Use returns generated from µ to evaluate π Weight return G t according to similarity between policies Multiply importance sampling corrections along whole episode t = π(a t S t ) π(a t+1 S t+1 ) µ(a t S t ) µ(a t+1 S t+1 )... π(a T S T ) µ(a T S T ) G t G π/µ Update value towards corrected return ( ) V (S t ) V (S t ) + α G π/µ t V (S t ) Cannot use if µ is zero when π is non-zero Importance sampling can dramatically increase variance
34 Off-Policy Learning Importance Sampling Importance Sampling for Off-Policy TD Use TD targets generated from µ to evaluate π Weight TD target R + γv (S ) by importance sampling Only need a single importance sampling correction V (S t ) V (S t ) + ( ) π(at S t ) α µ(a t S t ) (R t+1 + γv (S t+1 )) V (S t ) Much lower variance than Monte-Carlo importance sampling Policies only need to be similar over a single step
35 Off-Policy Learning Q-Learning Q-Learning We now consider off-policy learning of action-values Q(s, a) No importance sampling is required Next action is chosen using behaviour policy A t+1 µ( S t ) But we consider alternative successor action A π( S t ) And update Q(S t, A t ) towards value of alternative action Q(S t, A t ) Q(S t, A t ) + α ( R t+1 + γq(s t+1, A ) Q(S t, A t ) )
36 Off-Policy Learning Q-Learning Off-Policy Control with Q-Learning We now allow both behaviour and target policies to improve The target policy π is greedy w.r.t. Q(s, a) π(s t+1 ) = argmax a Q(S t+1, a ) The behaviour policy µ is e.g. ɛ-greedy w.r.t. Q(s, a) The Q-learning target then simplifies: R t+1 + γq(s t+1, A ) =R t+1 + γq(s t+1, argmax a Q(S t+1, a )) =R t+1 + max a γq(s t+1, a )
37 Off-Policy Learning Q-Learning Q-Learning Control Algorithm S,A R S ( ) Q(S, A) Q(S, A) + α R + γ max Q(S, a ) Q(S, A) a A Theorem Q-learning control converges to the optimal action-value function, Q(s, a) q (s, a)
38 Off-Policy Learning Q-Learning Q-Learning Algorithm for Off-Policy Control
39 Off-Policy Learning Q-Learning Q-Learning Demo Q-Learning Demo
40 Off-Policy Learning Q-Learning Cliff Walking Example
41 7! 7! Lecture 5: Model-Free Control Summary Relationship Between DP and TD Full Backup (DP) Sample Backup (TD) v (s) s a v (s Bellman Expectation 0 ) s 0 Equation for v π (s) Iterative Policy Evaluation TD Learning 7! q (s Bellman Expectation 0,a 0 ) a 0 A Equation for q π (s, a) Q-Policy Iteration Sarsa r q (s, a) s 0 7! 7! s, a S,A r R S q (s, a) s, a r s 0 q (s Bellman Optimality 0,a 0 ) a 0 Equation for q (s, a) Q-Value Iteration Q-Learning 7!
42 Summary Relationship Between DP and TD (2) Full Backup (DP) Iterative Policy Evaluation Sample Backup (TD) TD Learning V (s) E [R + γv (S ) s] V (S) α R + γv (S ) Q-Policy Iteration Sarsa Q(s, a) E [R + γq(s, A ) s, a] Q(S, A) α R + γq(s, A ) Q-Value Iteration [ ] Q(s, a) E R + γ max Q(S, a ) s, a a A Q-Learning Q(S, A) α R + γ max a A Q(S, a ) where x α y x x + α(y x)
43 Summary Questions?
Eligibility Traces. Suggested reading: Contents: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.
Eligibility Traces 0 Eligibility Traces Suggested reading: Chapter 7 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998. Eligibility Traces Eligibility Traces 1 Contents:
Using Markov Decision Processes to Solve a Portfolio Allocation Problem
Using Markov Decision Processes to Solve a Portfolio Allocation Problem Daniel Bookstaber April 26, 2005 Contents 1 Introduction 3 2 Defining the Model 4 2.1 The Stochastic Model for a Single Asset.........................
Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning
Motivation Machine Learning Can a software agent learn to play Backgammon by itself? Reinforcement Learning Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut
Lecture 1: Introduction to Reinforcement Learning
Lecture 1: Introduction to Reinforcement Learning David Silver Outline 1 Admin 2 About Reinforcement Learning 3 The Reinforcement Learning Problem 4 Inside An RL Agent 5 Problems within Reinforcement Learning
A Sarsa based Autonomous Stock Trading Agent
A Sarsa based Autonomous Stock Trading Agent Achal Augustine The University of Texas at Austin Department of Computer Science Austin, TX 78712 USA [email protected] Abstract This paper describes an autonomous
Assignment #1. Example: tetris state: board configuration + shape of the falling piece ~2 200 states! Recap RL so far. Page 1
Generalization and function approximation CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Represent
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning Volodymyr Mnih Koray Kavukcuoglu David Silver Alex Graves Ioannis Antonoglou Daan Wierstra Martin Riedmiller DeepMind Technologies {vlad,koray,david,alex.graves,ioannis,daan,martin.riedmiller}
TD(0) Leads to Better Policies than Approximate Value Iteration
TD(0) Leads to Better Policies than Approximate Value Iteration Benjamin Van Roy Management Science and Engineering and Electrical Engineering Stanford University Stanford, CA 94305 [email protected] Abstract
Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL
POMDP Tutorial Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL Observation Belief ACTOR Transition Dynamics WORLD b Policy π Action Markov Decision Process -X: set of states [x s,x r
Reinforcement Learning: A Tutorial Survey and Recent Advances
A. Gosavi Reinforcement Learning: A Tutorial Survey and Recent Advances Abhijit Gosavi Department of Engineering Management and Systems Engineering 219 Engineering Management Missouri University of Science
A Multi-agent Q-learning Framework for Optimizing Stock Trading Systems
A Multi-agent Q-learning Framework for Optimizing Stock Trading Systems Jae Won Lee 1 and Jangmin O 2 1 School of Computer Science and Engineering, Sungshin Women s University, Seoul, Korea 136-742 [email protected]
Reinforcement Learning
Reinforcement Learning LU 2 - Markov Decision Problems and Dynamic Programming Dr. Martin Lauer AG Maschinelles Lernen und Natürlichsprachliche Systeme Albert-Ludwigs-Universität Freiburg [email protected]
Reinforcement Learning
Reinforcement Learning LU 2 - Markov Decision Problems and Dynamic Programming Dr. Joschka Bödecker AG Maschinelles Lernen und Natürlichsprachliche Systeme Albert-Ludwigs-Universität Freiburg [email protected]
Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning
Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning Richard S. Sutton a, Doina Precup b, and Satinder Singh a a AT&T Labs Research, 180 Park Avenue, Florham Park,
An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment
An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment Hideki Asoh 1, Masanori Shiro 1 Shotaro Akaho 1, Toshihiro Kamishima 1, Koiti Hasida 1, Eiji Aramaki 2, and Takahide
Distributed Reinforcement Learning for Network Intrusion Response
Distributed Reinforcement Learning for Network Intrusion Response KLEANTHIS MALIALIS Doctor of Philosophy UNIVERSITY OF YORK COMPUTER SCIENCE September 2014 For my late grandmother Iroulla Abstract The
Hierarchical Reinforcement Learning in Computer Games
Hierarchical Reinforcement Learning in Computer Games Marc Ponsen, Pieter Spronck, Karl Tuyls Maastricht University / MICC-IKAT. {m.ponsen,p.spronck,k.tuyls}@cs.unimaas.nl Abstract. Hierarchical reinforcement
NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi
NEURAL NETWORKS AND REINFORCEMENT LEARNING Abhijit Gosavi Department of Engineering Management and Systems Engineering Missouri University of Science and Technology Rolla, MO 65409 1 Outline A Quick Introduction
. (3.3) n Note that supremum (3.2) must occur at one of the observed values x i or to the left of x i.
Chapter 3 Kolmogorov-Smirnov Tests There are many situations where experimenters need to know what is the distribution of the population of their interest. For example, if they want to use a parametric
Urban Traffic Control Based on Learning Agents
Urban Traffic Control Based on Learning Agents Pierre-Luc Grégoire, Charles Desjardins, Julien Laumônier and Brahim Chaib-draa DAMAS Laboratory, Computer Science and Software Engineering Department, Laval
Learning Agents: Introduction
Learning Agents: Introduction S Luz [email protected] October 22, 2013 Learning in agent architectures Performance standard representation Critic Agent perception rewards/ instruction Perception Learner Goals
Learning Exercise Policies for American Options
Yuxi Li Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Csaba Szepesvari Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Dale Schuurmans
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Supply Chain Yield Management based on Reinforcement Learning
Supply Chain Yield Management based on Reinforcement Learning Tim Stockheim 1, Michael Schwind 1, Alexander Korth 2, and Burak Simsek 2 1 Chair of Economics, esp. Information Systems, Frankfurt University,
Creating Algorithmic Traders with Hierarchical Reinforcement Learning MSc Dissertation. Thomas Elder s0789506
Creating Algorithmic Traders with Hierarchical Reinforcement Learning MSc Dissertation Thomas Elder s78956 Master of Science School of Informatics University of Edinburgh 28 Abstract There has recently
Reinforcement Learning with Factored States and Actions
Journal of Machine Learning Research 5 (2004) 1063 1088 Submitted 3/02; Revised 1/04; Published 8/04 Reinforcement Learning with Factored States and Actions Brian Sallans Austrian Research Institute for
Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems
Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems Thomas Degris [email protected] Olivier Sigaud [email protected] Pierre-Henri Wuillemin [email protected]
Reinforcement Learning: An Introduction
i Reinforcement Learning: An Introduction Second edition, in progress Richard S. Sutton and Andrew G. Barto c 2012 A Bradford Book The MIT Press Cambridge, Massachusetts London, England ii In memory of
Nonlinear Algebraic Equations Example
Nonlinear Algebraic Equations Example Continuous Stirred Tank Reactor (CSTR). Look for steady state concentrations & temperature. s r (in) p,i (in) i In: N spieces with concentrations c, heat capacities
Statistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
Beyond Reward: The Problem of Knowledge and Data
Beyond Reward: The Problem of Knowledge and Data Richard S. Sutton University of Alberta Edmonton, Alberta, Canada Intelligence can be defined, informally, as knowing a lot and being able to use that knowledge
Statistics 104: Section 6!
Page 1 Statistics 104: Section 6! TF: Deirdre (say: Dear-dra) Bloome Email: [email protected] Section Times Thursday 2pm-3pm in SC 109, Thursday 5pm-6pm in SC 705 Office Hours: Thursday 6pm-7pm SC
MATLAB and Big Data: Illustrative Example
MATLAB and Big Data: Illustrative Example Rick Mansfield Cornell University August 19, 2014 Goals Use a concrete example from my research to: Demonstrate the value of vectorization Introduce key commands/functions
ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE
ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE YUAN TIAN This synopsis is designed merely for keep a record of the materials covered in lectures. Please refer to your own lecture notes for all proofs.
Inductive QoS Packet Scheduling for Adaptive Dynamic Networks
Inductive QoS Packet Scheduling for Adaptive Dynamic Networks Malika BOURENANE Dept of Computer Science University of Es-Senia Algeria [email protected] Abdelhamid MELLOUK LISSI Laboratory University
COMMUTATIVITY DEGREES OF WREATH PRODUCTS OF FINITE ABELIAN GROUPS
COMMUTATIVITY DEGREES OF WREATH PRODUCTS OF FINITE ABELIAN GROUPS IGOR V. EROVENKO AND B. SURY ABSTRACT. We compute commutativity degrees of wreath products A B of finite abelian groups A and B. When B
A Reinforcement Learning Approach for Supply Chain Management
A Reinforcement Learning Approach for Supply Chain Management Tim Stockheim, Michael Schwind, and Wolfgang Koenig Chair of Economics, esp. Information Systems, Frankfurt University, D-60054 Frankfurt,
Temporal Difference Learning in the Tetris Game
Temporal Difference Learning in the Tetris Game Hans Pirnay, Slava Arabagi February 6, 2009 1 Introduction Learning to play the game Tetris has been a common challenge on a few past machine learning competitions.
1 Sufficient statistics
1 Sufficient statistics A statistic is a function T = rx 1, X 2,, X n of the random sample X 1, X 2,, X n. Examples are X n = 1 n s 2 = = X i, 1 n 1 the sample mean X i X n 2, the sample variance T 1 =
THE NUMBER OF REPRESENTATIONS OF n OF THE FORM n = x 2 2 y, x > 0, y 0
THE NUMBER OF REPRESENTATIONS OF n OF THE FORM n = x 2 2 y, x > 0, y 0 RICHARD J. MATHAR Abstract. We count solutions to the Ramanujan-Nagell equation 2 y +n = x 2 for fixed positive n. The computational
Neuro-Dynamic Programming An Overview
1 Neuro-Dynamic Programming An Overview Dimitri Bertsekas Dept. of Electrical Engineering and Computer Science M.I.T. September 2006 2 BELLMAN AND THE DUAL CURSES Dynamic Programming (DP) is very broadly
Computing Near Optimal Strategies for Stochastic Investment Planning Problems
Computing Near Optimal Strategies for Stochastic Investment Planning Problems Milos Hauskrecfat 1, Gopal Pandurangan 1,2 and Eli Upfal 1,2 Computer Science Department, Box 1910 Brown University Providence,
Tools for the analysis and design of communication networks with Markovian dynamics
1 Tools for the analysis and design of communication networks with Markovian dynamics Arie Leizarowitz, Robert Shorten, Rade Stanoević Abstract In this paper we analyze the stochastic properties of a class
Stochastic Models for Inventory Management at Service Facilities
Stochastic Models for Inventory Management at Service Facilities O. Berman, E. Kim Presented by F. Zoghalchi University of Toronto Rotman School of Management Dec, 2012 Agenda 1 Problem description Deterministic
STAT 830 Convergence in Distribution
STAT 830 Convergence in Distribution Richard Lockhart Simon Fraser University STAT 830 Fall 2011 Richard Lockhart (Simon Fraser University) STAT 830 Convergence in Distribution STAT 830 Fall 2011 1 / 31
Model-based reinforcement learning with nearly tight exploration complexity bounds
Model-based reinforcement learning with nearly tight exploration complexity bounds István Szita Csaba Szepesvári University of Alberta, Athabasca Hall, Edmonton, AB T6G 2E8 Canada [email protected] [email protected]
Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method
Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method Martin Riedmiller Neuroinformatics Group, University of Onsabrück, 49078 Osnabrück Abstract. This
Probability Generating Functions
page 39 Chapter 3 Probability Generating Functions 3 Preamble: Generating Functions Generating functions are widely used in mathematics, and play an important role in probability theory Consider a sequence
Probability Theory. Florian Herzog. A random variable is neither random nor variable. Gian-Carlo Rota, M.I.T..
Probability Theory A random variable is neither random nor variable. Gian-Carlo Rota, M.I.T.. Florian Herzog 2013 Probability space Probability space A probability space W is a unique triple W = {Ω, F,
Chapter 2: Binomial Methods and the Black-Scholes Formula
Chapter 2: Binomial Methods and the Black-Scholes Formula 2.1 Binomial Trees We consider a financial market consisting of a bond B t = B(t), a stock S t = S(t), and a call-option C t = C(t), where the
The Engle-Granger representation theorem
The Engle-Granger representation theorem Reference note to lecture 10 in ECON 5101/9101, Time Series Econometrics Ragnar Nymoen March 29 2011 1 Introduction The Granger-Engle representation theorem is
t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).
1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction
Lecture 11. Sergei Fedotov. 20912 - Introduction to Financial Mathematics. Sergei Fedotov (University of Manchester) 20912 2010 1 / 7
Lecture 11 Sergei Fedotov 20912 - Introduction to Financial Mathematics Sergei Fedotov (University of Manchester) 20912 2010 1 / 7 Lecture 11 1 American Put Option Pricing on Binomial Tree 2 Replicating
Reduced echelon form: Add the following conditions to conditions 1, 2, and 3 above:
Section 1.2: Row Reduction and Echelon Forms Echelon form (or row echelon form): 1. All nonzero rows are above any rows of all zeros. 2. Each leading entry (i.e. left most nonzero entry) of a row is in
The Heat Equation. Lectures INF2320 p. 1/88
The Heat Equation Lectures INF232 p. 1/88 Lectures INF232 p. 2/88 The Heat Equation We study the heat equation: u t = u xx for x (,1), t >, (1) u(,t) = u(1,t) = for t >, (2) u(x,) = f(x) for x (,1), (3)
Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)
Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091) Magnus Wiktorsson Centre for Mathematical Sciences Lund University, Sweden Lecture 5 Sequential Monte Carlo methods I February
Lecture 13: Martingales
Lecture 13: Martingales 1. Definition of a Martingale 1.1 Filtrations 1.2 Definition of a martingale and its basic properties 1.3 Sums of independent random variables and related models 1.4 Products of
Numerical Methods for Option Pricing
Chapter 9 Numerical Methods for Option Pricing Equation (8.26) provides a way to evaluate option prices. For some simple options, such as the European call and put options, one can integrate (8.26) directly
Jae Won idee. School of Computer Science and Engineering Sungshin Women's University Seoul, 136-742, South Korea
STOCK PRICE PREDICTION USING REINFORCEMENT LEARNING Jae Won idee School of Computer Science and Engineering Sungshin Women's University Seoul, 136-742, South Korea ABSTRACT Recently, numerous investigations
Lecture 14: Section 3.3
Lecture 14: Section 3.3 Shuanglin Shao October 23, 2013 Definition. Two nonzero vectors u and v in R n are said to be orthogonal (or perpendicular) if u v = 0. We will also agree that the zero vector in
Monte Carlo-based statistical methods (MASM11/FMS091)
Monte Carlo-based statistical methods (MASM11/FMS091) Jimmy Olsson Centre for Mathematical Sciences Lund University, Sweden Lecture 5 Sequential Monte Carlo methods I February 5, 2013 J. Olsson Monte Carlo-based
The Ergodic Theorem and randomness
The Ergodic Theorem and randomness Peter Gács Department of Computer Science Boston University March 19, 2008 Peter Gács (Boston University) Ergodic theorem March 19, 2008 1 / 27 Introduction Introduction
Feature Selection with Monte-Carlo Tree Search
Feature Selection with Monte-Carlo Tree Search Robert Pinsler 20.01.2015 20.01.2015 Fachbereich Informatik DKE: Seminar zu maschinellem Lernen Robert Pinsler 1 Agenda 1 Feature Selection 2 Feature Selection
1.5 / 1 -- Communication Networks II (Görg) -- www.comnets.uni-bremen.de. 1.5 Transforms
.5 / -- Communication Networks II (Görg) -- www.comnets.uni-bremen.de.5 Transforms Using different summation and integral transformations pmf, pdf and cdf/ccdf can be transformed in such a way, that even
Lecture 8. Confidence intervals and the central limit theorem
Lecture 8. Confidence intervals and the central limit theorem Mathematical Statistics and Discrete Mathematics November 25th, 2015 1 / 15 Central limit theorem Let X 1, X 2,... X n be a random sample of
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
Nonparametric adaptive age replacement with a one-cycle criterion
Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: [email protected]
α α λ α = = λ λ α ψ = = α α α λ λ ψ α = + β = > θ θ β > β β θ θ θ β θ β γ θ β = γ θ > β > γ θ β γ = θ β = θ β = θ β = β θ = β β θ = = = β β θ = + α α α α α = = λ λ λ λ λ λ λ = λ λ α α α α λ ψ + α =
Creating a NL Texas Hold em Bot
Creating a NL Texas Hold em Bot Introduction Poker is an easy game to learn by very tough to master. One of the things that is hard to do is controlling emotions. Due to frustration, many have made the
CHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
Estimating the Degree of Activity of jumps in High Frequency Financial Data. joint with Yacine Aït-Sahalia
Estimating the Degree of Activity of jumps in High Frequency Financial Data joint with Yacine Aït-Sahalia Aim and setting An underlying process X = (X t ) t 0, observed at equally spaced discrete times
THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok
THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE Alexer Barvinok Papers are available at http://www.math.lsa.umich.edu/ barvinok/papers.html This is a joint work with J.A. Hartigan
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Does Black-Scholes framework for Option Pricing use Constant Volatilities and Interest Rates? New Solution for a New Problem
Does Black-Scholes framework for Option Pricing use Constant Volatilities and Interest Rates? New Solution for a New Problem Gagan Deep Singh Assistant Vice President Genpact Smart Decision Services Financial
Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012
Bayesian logistic betting strategy against probability forecasting Akimichi Takemura, Univ. Tokyo (joint with Masayuki Kumon, Jing Li and Kei Takeuchi) November 12, 2012 arxiv:1204.3496. To appear in Stochastic
Lecture 6: Option Pricing Using a One-step Binomial Tree. Friday, September 14, 12
Lecture 6: Option Pricing Using a One-step Binomial Tree An over-simplified model with surprisingly general extensions a single time step from 0 to T two types of traded securities: stock S and a bond
Nonlinear Algebraic Equations. Lectures INF2320 p. 1/88
Nonlinear Algebraic Equations Lectures INF2320 p. 1/88 Lectures INF2320 p. 2/88 Nonlinear algebraic equations When solving the system u (t) = g(u), u(0) = u 0, (1) with an implicit Euler scheme we have
