Reinforcement Learning
|
|
- Irene Nelson
- 7 years ago
- Views:
Transcription
1 Reinforcement Learning LU 2 - Markov Decision Problems and Dynamic Programming Dr. Martin Lauer AG Maschinelles Lernen und Natürlichsprachliche Systeme Albert-Ludwigs-Universität Freiburg martin.lauer@kit.edu Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (1)
2 LU 2: Markov Decision Problems and DP Goals: Definition of Markov Decision Problems (MDPs) Introduction to Dynamic Programming (DP) Outline short review definition of MDPs DP: principle of optimality the DP algorithm (backward DP) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (2)
3 Review Process, can be influenced by actions Agent: Sensory input, output of action Feedback RL: Training information through evaluation only Delayed Reinforcement Learning: Decision, decision, decision,... evaluation Multi-stage decision process Optimization Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (3)
4 The Agent Concept Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (4)
5 Multi-stage decision problems Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (5)
6 Three components System, process Rewards, costs Policy, strategy Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (6)
7 Requirements for the model Goal: Describing the system s behaviour (also a system: Process, world, environment) requirements for a model: situations activities current situation can be influenced adjustments possible at discrete points in time noise, interference, random goal specification: definition of costs / rewards Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (7)
8 System description Discrete decision points t T = {0, 1,..., N} or (stages) T = {0, 1,... } System state (situation) s t S here: S finite Actions u t U here: U finite Transition function s t+1 = f (s t, u t) reaction of the system Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (8)
9 Goal formulation: Introducing costs At every decision (= in every stage) direct costs arise Direct costs Refinement: dependant on state and action c : S R c : S U R Reward, cost, punishment? Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (9)
10 Summary: Deterministic systems discrete decision points t T = {0, 1,..., N} or stages T = {0, 1,... } system state (situation) actions s t S u t U transition function s t+1 = f (s t, u t) direct costs c : S U R 5-tuple (T, S, U, f, c) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (10)
11 Example: Shortest path problems Find the shortest path from start node to finish node. Every edge has a specific cost that can be interpreted as length. Optimization goal over multiple stages Evaluation of whole sequence (reminder: decision, decision,... evaluation) Look at accumulated total costs: t T c(st, ut) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (11)
12 Stochastic systems Again: requirements for a model: situations activities current situation can be influenced adjustments possible at discrete points in time noise, interference,random goal specification: definition of costs / rewards Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (12)
13 Markov Decision Processes Deterministic system: 5-Tuple (T, S, U, f, c) Stochastic system: The deterministic transition function f is replaced by a conditional probability distribution. In the following, we re looking at a finite state set S = (1, 2,..., N). Let i, j S be states: Notation: Markov Decision Process (MDP): 5-Tuple (T, S, U, p ij (u), c(s, u)) P(s t+1 = j s t = i, u t = u) = p ij (u) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (13)
14 Markov property It holds that: P(s t+1 = j s t, u t) = P(s t+1 = j s t, s t 1,..., u t, u t 1,...) The probability distribution of the following state s t+1 is uniquely defined given the knowledge of the current state s t and the action u t. It especially does not depend on the previous history of the system. Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (14)
15 Remarks (1) Deterministic system is a special case of an MDP: { 1, st+1 = f (s t, u t) P(s t+1 s t, u t) = 0, otherwise Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (15)
16 Remarks (2) Equivalent description with deterministic transition function f : Approach: additional argument - random variable w t (noise): s t+1 = f (s t, u t, w t) with w t random variable with given probability distribution P(w t s t, u t) Transformation into previous form: Let W (i, u, j) = {w j = f (i, u, w)} be the set of all values of w, for which the system transitions from state i on input of u into state j. Then it holds: p ij (u) = P(w W (i, u, j)) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (16)
17 Summary: MDPs discrete decision points t T = {0, 1,..., N} or stages T = {0, 1,... } system state (situation) actions transition probabilites p ij (u) s t S u t U P(s t+1 = j s t = i, u t = u) = p ij (u) alternatively: Transition function s t+1 = f (s t, u t, w t) with w t random variable direct costs 5-tuple (T, S, U, p ij (u), c(s, u)) c : S U R Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (17)
18 Summary: MDPs Model: State, action, following state Deterministic and stochastic transition function Information about history summarized in state Very general description: OR, control engineering, games,... Generalizations (not covered here) Transition function not stationary p ij,t (u) Costs not stationary c t(i, u) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (18)
19 Example stock keeping Assume you are the owner of a toys shop at an exhibition. Exhibition lasts N days. state: number of toys in your shop action: ordered number of toys to be delivered on the next day disturbance : number of toys sold s t u t w t system equation: costs for toys in stock acquisition costs for each toy which was ordered minus gain for sold toys s t+1 = s t + u t w t c(s, u) = c 1(s) + c 2(u) gain there are also terminal costs g(s), if there are still toys in stock after the N days. Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (19)
20 Policy and selection function Policy: The selection function π t : S U, π t(s) = u chooses at time t an action u U as function of the current state s S. Selection function chooses an action in dependence of the situation (see graphic agent ) Refinement: π t : S U, π t(s) = u, with u U(s) situation dependent action set (example: chess) A policy ˆπ consists of N selection functions (N being the number of decision points) ˆπ = (π 0, π 1,..., π t,...) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (20)
21 Non-stationary policies The selection function π t can be dependent on the time of the decision. Meaning: The same situation at different points in time can lead to different decisions of the agent. ˆπ = (π 0, π 1,..., π t,...) If the selection functions differ for single time points, we call it a non-stationary policy. Example soccer: Situation s: Midfield player has the ball. Reasonable action in the first minute: π 1(s) = return pass Reasonable action in the last minute: π 90(s) = shoot on goal General rationale: The limited optimization time frame ( finite horizon, see below) usually requires a non-stationary policy! Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (21)
22 Stationary policies We will look mostly at stationary policies. Then it holds that π 0 = π 1 =... π t... =: π and ˆπ = (π, π,..., π,...) With stationary policies, the terms policy and selection function become interchangeable. We will call the selection function π - as generally done in literature - our policy. Bertsekas uses the term µ for the selection function. Therefore there arise minor differences from the notation used there. Remark: In the following only deterministic selection functions will be used Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (22)
23 Goal of the policy Reach the optimization goal over multiple stages (sequence of decisions) Solving a dynamic optimization problem Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (23)
24 Cumulated costs (costs-to-go) Interesting: Cumulated costs for a given state s with given policy π: J π (s) = t T c(s t, π(s t)), s 0 = s Wanted: Optimal policy π so that for all s it holds that: J π (s) = min c(s t, π(s t)), π ˆπ t T under the constraint that s t+1 = f (s t, u t) s 0 = s Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (24)
25 Cumulated costs in MDPs Expected cumulated costs for a given state s using a given policy π: J π (s) = E w c(s t, π(s t)), s 0 = s t T Wanted: Optimal policy π so that for all s it holds that: J π (s) = min π Π Ew t T c(s t, π(s t)), s 0 = s under the constraint that s t+1 = f (s t, u t, w t), or with given probability distribution P(s t+1 = j s t = i, u t = u) = p ij (u) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (25)
26 Problem types Definition horizon: The horizon N of a problem denotes the number of decision stages to be traversed. Finite horizon: Problems with given termination time Infinite horizon: Approximation for very long processes or processes with an unknown end (e.g. control system) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (26)
27 Finite horizon N-stage decision problem Each state has terminal costs g(i) that are due if the system ends in i after N stages. Costs of a policy π N 1 JN π (s) = E[g(s N ) + c(s t, π t(s t)) s 0 = s] Generally: Non-stationary policy t=0 Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (27)
28 Infinite horizon Costs of a policy π Problem: Finite costs? J π (s) = lim E[ N c(s t, π t(s t)) s 0 = s] N Solution: Discount α < 1 J π (s) = lim E[ N α t c(s t, π t(s t)) s 0 = s] N t=0 t=0 Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (28)
29 Solution of dynamic optimization problems Central question: How do we find the policy that leads (on average) to minimal costs? Remark: We can formulate this analogously as a maximization problem (e.g. maximizing the gain). Solution method: Dynamic Programming (Bellman, 1957) Backward Dynamic Programming Value Iteration (LU 3 ff.) Policy Iteration (LU 3 ff.) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (29)
30 Backward Dynamic Programming - idea Problem: Stochastic multistage decision problems with finite horizon Idea: Calculate the costs starting from the last stage to the first stage. Example: Find the shortest path in a graph Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (30)
31 Backward Dynamic Programming - problem specification (1) finite horizon N MDP N discrete decision points t T = {0, 1,..., N} State set finite s t S = {1, 2,..., n} Action set finite u t U = {u 1,..., u m} Transition prob. p ij (u) P(s t+1 = j s t = i, u t = u) = p ij (u) direct costs c : S U R in the last stage N every stage causes terminal costs g(s N ) := c N (s N ) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (31)
32 Backward Dynamic Programming - objective Wanted: π with J π = min π J π with JN π (i) = E[g(s N ) + N 1 t=0 c(st, πt(st)) s0 = i] the costs belonging to π are called the optimal cumulated costs J := J π. Approach: 1. Calcuation of optimal cumulated costs ( cost-to-go ) J k ( ) for all states (J k ( ) is a n dimensional vector). k is the number of remaining steps. 2. from J k follows the optimal policy for the k step problem. (k steps until process terminates). Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (32)
33 Backward Dynamic Programming - motivation Thesis - Bellman s Principle of Optimality: If I have k more steps to go, the optimal costs for a state i are given with the minimal expected value of the sum of the direct transition costs + optimal cumulated costs of the next state, if there are k 1 more steps to be done from there. The minimization here goes over all possible actions Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (33)
34 Bellman s Principle of Optimality Formal: For the optimal cumulated costs J k (i) of the k-stage decision problem, it holds that: Jk (i) = min u U(i) Ew {c(i, u) + k J k 1(f (i, u, w k )} n = min u U(i) {p ij (u)(c(i, u) + Jk 1(j))} i = 1... n (1) j=1 Hence we can calculate the optimal cumulated costs of the N stage optimization problem recursively starting with k = 0. Backward-DP algorithm Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (34)
35 Bellman s Principle of Optimality - proof (1) Policy ˆπ (k) for k stages: ˆπ (k) = (π k, π k 1, π k 2,...) = (π k, ˆπ (k 1) ) Let S (k) (i) = (s N k = i, s (N k)+1,..., s N ) be a possible state sequence starting in state i with k transitions. J k (i) = min J ˆπ (k) ˆπ (k) k = min ˆπ (k) { S (k) (i) (i) (P(S (k) (i) ˆπ (k) )( = min ˆπ (k) {c(i, π k (i)) + S (k) (i) k c(s N l, π l (s N l )) + g(s N )))} l=1 (P(S (k) (i) ˆπ (k) ) k 1 ( c(s N l, π l (s N l )) + g(s N )))} l=1 (2) (3) (4) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (35)
36 Bellman s Principle of Optimality - proof (2) = min ˆπ (k) {c(i, π k (i)) + S (k) (i) (P(S (k) (i) ˆπ (k) ) k 1 ( c(s N l, π l (s N l )) + g(s N )))} l=1 = min{c(i, π k (i)) + P(s (N k)+1 = j s N k = i, π k ) ˆπ (k) j S (5) S (k 1) (j) k 1 (P(S (k 1) (j) ˆπ (k 1) ) ( c(s N l, π l (s N l )) + g(s N )))} l=1 (6) (7) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (36)
37 Bellman s Principle of Optimality - proof (3) = min{c(i, π k (i)) + P(s (N k)+1 = j s N k = i, π k ) ˆπ (k) j S S (k 1) (j) k 1 (P(S (k 1) (j) ˆπ (k 1) ) ( c(s N l, π l (s N l )) + g(s N )))} = min { c(i, u) + P(s (N k)+1 = j s N k = i, u) u U(i) min { ˆπ (k 1) S (k 1) (j) j S l=1 k 1 (P(S (k 1) (j) ˆπ (k 1) ) ( c(s N l, π l (s N l )) + g(s N )))} } l=1 (8) (9) (10) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (37)
38 Bellman s Principle of Optimality - proof (4) = min { c(i, u) + P(s (N k)+1 = j s N k = i, u) u U(i) min { ˆπ (k 1) S (k 1) (j) j S = min { c(i, u) + u U(i) k 1 (P(S (k 1) (j) ˆπ (k 1) ) ( c(s N l, π l (s N l )) + g(s N )))} } j S l=1 P(s (N k)+1 = j s N k = i, u) min {J ˆπ (k 1) ˆπ (k 1) = min {c(i, u) + P(s (N k)+1 = j s N k = i, u) Jk 1(j)} u U(i) j S (11) k 1 (j)} } (12) = min {c(i, u) + p ij (u) Jk 1(j)} u U(i) j S (13) (14) (15) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (38)
39 Backward Dynamic Programming - algorithm k = 0: For k = 1 To N, i S J 0 (i) = g(i) or Jk (i) = min E wk {c(i, u) + Jk 1(f (i, u, w k ))} u U(i) J k (i) = min u U(i) n p ij (u)(c(i, u) + Jk 1(j)) j=1 Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (39)
40 Choosing an action Requirement: Jk (i) is known for all k N. Approach: We simply calculate for all possible actions the expected costs and choose the best action (with minimal expected cumulated costs). π k (i) arg min u U(i) E wk {c(i, u) + J k 1(f (i, u, w k )) the chosen optimal action minimizes the sum of the expected transition costs plus the expected cumulated costs of the remaining problem. Remark: J k defines an optimal policy The policy is not unique, but J k is Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (40)
41 Remarks Complexity for deterministic systems O(N n m) Complexity for stochastic systems O(N n 2 m) Exact solution rarely computable, numeric solution; but: very complex! (N = number of stages, n = number of states, m = number of actions) Prof. Dr. Martin Riedmiller, Dr. Martin Lauer Machine Learning Lab, University of Freiburg Reinforcement Learning (41)
Reinforcement Learning
Reinforcement Learning LU 2 - Markov Decision Problems and Dynamic Programming Dr. Joschka Bödecker AG Maschinelles Lernen und Natürlichsprachliche Systeme Albert-Ludwigs-Universität Freiburg jboedeck@informatik.uni-freiburg.de
More informationMotivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning
Motivation Machine Learning Can a software agent learn to play Backgammon by itself? Reinforcement Learning Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut
More information6.231 Dynamic Programming and Stochastic Control Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.231 Dynamic Programming and Stochastic Control Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 6.231
More informationOptimization under uncertainty: modeling and solution methods
Optimization under uncertainty: modeling and solution methods Paolo Brandimarte Dipartimento di Scienze Matematiche Politecnico di Torino e-mail: paolo.brandimarte@polito.it URL: http://staff.polito.it/paolo.brandimarte
More informationStochastic Inventory Control
Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the
More informationA simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R
A simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R Federico Perea Justo Puerto MaMaEuSch Management Mathematics for European Schools 94342 - CP - 1-2001 - DE - COMENIUS - C21 University
More informationScheduling Software Projects to Minimize the Development Time and Cost with a Given Staff
Scheduling Software Projects to Minimize the Development Time and Cost with a Given Staff Frank Padberg Fakultät für Informatik Universität Karlsruhe, Germany padberg @ira.uka.de Abstract A probabilistic
More informationPreliminaries: Problem Definition Agent model, POMDP, Bayesian RL
POMDP Tutorial Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL Observation Belief ACTOR Transition Dynamics WORLD b Policy π Action Markov Decision Process -X: set of states [x s,x r
More informationNeuro-Dynamic Programming An Overview
1 Neuro-Dynamic Programming An Overview Dimitri Bertsekas Dept. of Electrical Engineering and Computer Science M.I.T. September 2006 2 BELLMAN AND THE DUAL CURSES Dynamic Programming (DP) is very broadly
More informationNotes from Week 1: Algorithms for sequential prediction
CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking
More informationFactors to Describe Job Shop Scheduling Problem
Job Shop Scheduling Job Shop A work location in which a number of general purpose work stations exist and are used to perform a variety of jobs Example: Car repair each operator (mechanic) evaluates plus
More informationBoosting. riedmiller@informatik.uni-freiburg.de
. Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de
More informationStochastic Models for Inventory Management at Service Facilities
Stochastic Models for Inventory Management at Service Facilities O. Berman, E. Kim Presented by F. Zoghalchi University of Toronto Rotman School of Management Dec, 2012 Agenda 1 Problem description Deterministic
More information6.231 Dynamic Programming Midterm, Fall 2008. Instructions
6.231 Dynamic Programming Midterm, Fall 2008 Instructions The midterm comprises three problems. Problem 1 is worth 60 points, problem 2 is worth 40 points, and problem 3 is worth 40 points. Your grade
More informationInductive QoS Packet Scheduling for Adaptive Dynamic Networks
Inductive QoS Packet Scheduling for Adaptive Dynamic Networks Malika BOURENANE Dept of Computer Science University of Es-Senia Algeria mb_regina@yahoo.fr Abdelhamid MELLOUK LISSI Laboratory University
More informationLECTURE 4. Last time: Lecture outline
LECTURE 4 Last time: Types of convergence Weak Law of Large Numbers Strong Law of Large Numbers Asymptotic Equipartition Property Lecture outline Stochastic processes Markov chains Entropy rate Random
More informationCoding and decoding with convolutional codes. The Viterbi Algor
Coding and decoding with convolutional codes. The Viterbi Algorithm. 8 Block codes: main ideas Principles st point of view: infinite length block code nd point of view: convolutions Some examples Repetition
More information10.2 Series and Convergence
10.2 Series and Convergence Write sums using sigma notation Find the partial sums of series and determine convergence or divergence of infinite series Find the N th partial sums of geometric series and
More informationDynamic Programming 11.1 AN ELEMENTARY EXAMPLE
Dynamic Programming Dynamic programming is an optimization approach that transforms a complex problem into a sequence of simpler problems; its essential characteristic is the multistage nature of the optimization
More informationTD(0) Leads to Better Policies than Approximate Value Iteration
TD(0) Leads to Better Policies than Approximate Value Iteration Benjamin Van Roy Management Science and Engineering and Electrical Engineering Stanford University Stanford, CA 94305 bvr@stanford.edu Abstract
More informationNEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi
NEURAL NETWORKS AND REINFORCEMENT LEARNING Abhijit Gosavi Department of Engineering Management and Systems Engineering Missouri University of Science and Technology Rolla, MO 65409 1 Outline A Quick Introduction
More information1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let
Copyright c 2009 by Karl Sigman 1 Stopping Times 1.1 Stopping Times: Definition Given a stochastic process X = {X n : n 0}, a random time τ is a discrete random variable on the same probability space as
More informationHow To Optimize A Multi-Echelon Inventory System
The University of Melbourne, Department of Mathematics and Statistics Decomposition Approach to Serial Inventory Systems under Uncertainties Jennifer Kusuma Honours Thesis, 2005 Supervisors: Prof. Peter
More informationScheduling Algorithms for Downlink Services in Wireless Networks: A Markov Decision Process Approach
Scheduling Algorithms for Downlink Services in Wireless Networks: A Markov Decision Process Approach William A. Massey ORFE Department Engineering Quadrangle, Princeton University Princeton, NJ 08544 K.
More informationSupply planning for two-level assembly systems with stochastic component delivery times: trade-off between holding cost and service level
Supply planning for two-level assembly systems with stochastic component delivery times: trade-off between holding cost and service level Faicel Hnaien, Xavier Delorme 2, and Alexandre Dolgui 2 LIMOS,
More informationAn Environment Model for N onstationary Reinforcement Learning
An Environment Model for N onstationary Reinforcement Learning Samuel P. M. Choi Dit-Yan Yeung Nevin L. Zhang pmchoi~cs.ust.hk dyyeung~cs.ust.hk lzhang~cs.ust.hk Department of Computer Science, Hong Kong
More informationLoad Balancing and Switch Scheduling
EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: liuxh@systems.stanford.edu Abstract Load
More informationPerformance Analysis of a Telephone System with both Patient and Impatient Customers
Performance Analysis of a Telephone System with both Patient and Impatient Customers Yiqiang Quennel Zhao Department of Mathematics and Statistics University of Winnipeg Winnipeg, Manitoba Canada R3B 2E9
More informationChapter 2: Binomial Methods and the Black-Scholes Formula
Chapter 2: Binomial Methods and the Black-Scholes Formula 2.1 Binomial Trees We consider a financial market consisting of a bond B t = B(t), a stock S t = S(t), and a call-option C t = C(t), where the
More informationRandom access protocols for channel access. Markov chains and their stability. Laurent Massoulié.
Random access protocols for channel access Markov chains and their stability laurent.massoulie@inria.fr Aloha: the first random access protocol for channel access [Abramson, Hawaii 70] Goal: allow machines
More informationCall Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning
Call Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning Peter Marbach LIDS MIT, Room 5-07 Cambridge, MA, 09 email: marbach@mit.edu Oliver Mihatsch Siemens AG Corporate
More information5 Directed acyclic graphs
5 Directed acyclic graphs (5.1) Introduction In many statistical studies we have prior knowledge about a temporal or causal ordering of the variables. In this chapter we will use directed graphs to incorporate
More informationAnalysis of Algorithms I: Optimal Binary Search Trees
Analysis of Algorithms I: Optimal Binary Search Trees Xi Chen Columbia University Given a set of n keys K = {k 1,..., k n } in sorted order: k 1 < k 2 < < k n we wish to build an optimal binary search
More informationScheduling Shop Scheduling. Tim Nieberg
Scheduling Shop Scheduling Tim Nieberg Shop models: General Introduction Remark: Consider non preemptive problems with regular objectives Notation Shop Problems: m machines, n jobs 1,..., n operations
More informationModern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
More informationCreating a NL Texas Hold em Bot
Creating a NL Texas Hold em Bot Introduction Poker is an easy game to learn by very tough to master. One of the things that is hard to do is controlling emotions. Due to frustration, many have made the
More informationDynamic programming. Doctoral course Optimization on graphs - Lecture 4.1. Giovanni Righini. January 17 th, 2013
Dynamic programming Doctoral course Optimization on graphs - Lecture.1 Giovanni Righini January 1 th, 201 Implicit enumeration Combinatorial optimization problems are in general NP-hard and we usually
More informationAn Introduction to Markov Decision Processes. MDP Tutorial - 1
An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal
More informationA Single-Unit Decomposition Approach to Multi-Echelon Inventory Systems
A Single-Unit Decomposition Approach to Multi-Echelon Inventory Systems Alp Muharremoğlu John N. sitsiklis July 200 Revised March 2003 Former version titled: Echelon Base Stock Policies in Uncapacitated
More informationUsing Markov Decision Processes to Solve a Portfolio Allocation Problem
Using Markov Decision Processes to Solve a Portfolio Allocation Problem Daniel Bookstaber April 26, 2005 Contents 1 Introduction 3 2 Defining the Model 4 2.1 The Stochastic Model for a Single Asset.........................
More informationScheduling Single Machine Scheduling. Tim Nieberg
Scheduling Single Machine Scheduling Tim Nieberg Single machine models Observation: for non-preemptive problems and regular objectives, a sequence in which the jobs are processed is sufficient to describe
More information2.3 Convex Constrained Optimization Problems
42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions
More informationLecture notes: single-agent dynamics 1
Lecture notes: single-agent dynamics 1 Single-agent dynamic optimization models In these lecture notes we consider specification and estimation of dynamic optimization models. Focus on single-agent models.
More informationA control Lyapunov function approach for the computation of the infinite-horizon stochastic reach-avoid problem
A control Lyapunov function approach for the computation of the infinite-horizon stochastic reach-avoid problem Ilya Tkachev and Alessandro Abate Abstract This work is devoted to the solution of the stochastic
More informationFinitely Additive Dynamic Programming and Stochastic Games. Bill Sudderth University of Minnesota
Finitely Additive Dynamic Programming and Stochastic Games Bill Sudderth University of Minnesota 1 Discounted Dynamic Programming Five ingredients: S, A, r, q, β. S - state space A - set of actions q(
More informationAn Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment
An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment Hideki Asoh 1, Masanori Shiro 1 Shotaro Akaho 1, Toshihiro Kamishima 1, Koiti Hasida 1, Eiji Aramaki 2, and Takahide
More informationOptimal Design of Sequential Real-Time Communication Systems Aditya Mahajan, Member, IEEE, and Demosthenis Teneketzis, Fellow, IEEE
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 11, NOVEMBER 2009 5317 Optimal Design of Sequential Real-Time Communication Systems Aditya Mahajan, Member, IEEE, Demosthenis Teneketzis, Fellow, IEEE
More informationstatistical learning; Bayesian learning; stochastic optimization; dynamic programming
INFORMS 2008 c 2008 INFORMS isbn 978-1-877640-23-0 doi 10.1287/educ.1080.0039 Optimal Learning Warren B. Powell and Peter Frazier Department of Operations Research and Financial Engineering, Princeton
More informationChapter 4 Lecture Notes
Chapter 4 Lecture Notes Random Variables October 27, 2015 1 Section 4.1 Random Variables A random variable is typically a real-valued function defined on the sample space of some experiment. For instance,
More information1 Portfolio Selection
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # Scribe: Nadia Heninger April 8, 008 Portfolio Selection Last time we discussed our model of the stock market N stocks start on day with
More informationPageRank Optimization in Polynomial Time by Stochastic Shortest Path Reformulation
PageRank Optimization in Polynomial Time by Stochastic Shortest Path Reformulation Balázs Csanád Csáji 12, Raphaël M. Jungers 34, and Vincent D. Blondel 4 1 Department of Electrical and Electronic Engineering,
More informationALGORITHMIC TRADING WITH MARKOV CHAINS
June 16, 2010 ALGORITHMIC TRADING WITH MARKOV CHAINS HENRIK HULT AND JONAS KIESSLING Abstract. An order book consists of a list of all buy and sell offers, represented by price and quantity, available
More informationSecurity Risk Management via Dynamic Games with Learning
Security Risk Management via Dynamic Games with Learning Praveen Bommannavar Management Science & Engineering Stanford University Stanford, California 94305 Email: bommanna@stanford.edu Tansu Alpcan Deutsche
More informationSingle item inventory control under periodic review and a minimum order quantity
Single item inventory control under periodic review and a minimum order quantity G. P. Kiesmüller, A.G. de Kok, S. Dabia Faculty of Technology Management, Technische Universiteit Eindhoven, P.O. Box 513,
More informationM/M/1 and M/M/m Queueing Systems
M/M/ and M/M/m Queueing Systems M. Veeraraghavan; March 20, 2004. Preliminaries. Kendall s notation: G/G/n/k queue G: General - can be any distribution. First letter: Arrival process; M: memoryless - exponential
More informationCourse: Model, Learning, and Inference: Lecture 5
Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.
More informationECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE
ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE YUAN TIAN This synopsis is designed merely for keep a record of the materials covered in lectures. Please refer to your own lecture notes for all proofs.
More informationVENDOR MANAGED INVENTORY
VENDOR MANAGED INVENTORY Martin Savelsbergh School of Industrial and Systems Engineering Georgia Institute of Technology Joint work with Ann Campbell, Anton Kleywegt, and Vijay Nori Distribution Systems:
More informationTwo-Stage Stochastic Linear Programs
Two-Stage Stochastic Linear Programs Operations Research Anthony Papavasiliou 1 / 27 Two-Stage Stochastic Linear Programs 1 Short Reviews Probability Spaces and Random Variables Convex Analysis 2 Deterministic
More informationNumerical methods for American options
Lecture 9 Numerical methods for American options Lecture Notes by Andrzej Palczewski Computational Finance p. 1 American options The holder of an American option has the right to exercise it at any moment
More informationCHAPTER 1. Basic Concepts on Planning and Scheduling
CHAPTER 1 Basic Concepts on Planning and Scheduling Scheduling, FEUP/PRODEI /MIEIC 1 Planning and Scheduling: Processes of Decision Making regarding the selection and ordering of activities as well as
More informationHow I won the Chess Ratings: Elo vs the rest of the world Competition
How I won the Chess Ratings: Elo vs the rest of the world Competition Yannis Sismanis November 2010 Abstract This article discusses in detail the rating system that won the kaggle competition Chess Ratings:
More informationLecture 5: Model-Free Control
Lecture 5: Model-Free Control David Silver Outline 1 Introduction 2 On-Policy Monte-Carlo Control 3 On-Policy Temporal-Difference Learning 4 Off-Policy Learning 5 Summary Introduction Model-Free Reinforcement
More informationA Sarsa based Autonomous Stock Trading Agent
A Sarsa based Autonomous Stock Trading Agent Achal Augustine The University of Texas at Austin Department of Computer Science Austin, TX 78712 USA achal@cs.utexas.edu Abstract This paper describes an autonomous
More informationINTEGRATED OPTIMIZATION OF SAFETY STOCK
INTEGRATED OPTIMIZATION OF SAFETY STOCK AND TRANSPORTATION CAPACITY Horst Tempelmeier Department of Production Management University of Cologne Albertus-Magnus-Platz D-50932 Koeln, Germany http://www.spw.uni-koeln.de/
More informationMarkov Decision Processes for Ad Network Optimization
Markov Decision Processes for Ad Network Optimization Flávio Sales Truzzi 1, Valdinei Freire da Silva 2, Anna Helena Reali Costa 1, Fabio Gagliardi Cozman 3 1 Laboratório de Técnicas Inteligentes (LTI)
More informationReliability Guarantees in Automata Based Scheduling for Embedded Control Software
1 Reliability Guarantees in Automata Based Scheduling for Embedded Control Software Santhosh Prabhu, Aritra Hazra, Pallab Dasgupta Department of CSE, IIT Kharagpur West Bengal, India - 721302. Email: {santhosh.prabhu,
More informationBinomial lattice model for stock prices
Copyright c 2007 by Karl Sigman Binomial lattice model for stock prices Here we model the price of a stock in discrete time by a Markov chain of the recursive form S n+ S n Y n+, n 0, where the {Y i }
More informationComputing Near Optimal Strategies for Stochastic Investment Planning Problems
Computing Near Optimal Strategies for Stochastic Investment Planning Problems Milos Hauskrecfat 1, Gopal Pandurangan 1,2 and Eli Upfal 1,2 Computer Science Department, Box 1910 Brown University Providence,
More informationLOGISTIQUE ET PRODUCTION SUPPLY CHAIN & OPERATIONS MANAGEMENT
LOGISTIQUE ET PRODUCTION SUPPLY CHAIN & OPERATIONS MANAGEMENT CURSUS CONTENTS 1) Introduction 2) Human resources functions 3) A new factory 4) Products 5) Services management 6) Methods 7) Planification
More informationDetermining the Direct Mailing Frequency with Dynamic Stochastic Programming
Determining the Direct Mailing Frequency with Dynamic Stochastic Programming Nanda Piersma 1 Jedid-Jah Jonker 2 Econometric Institute Report EI2000-34/A Abstract Both in business to business and in consumer
More informationOptimal proportional reinsurance and dividend pay-out for insurance companies with switching reserves
Optimal proportional reinsurance and dividend pay-out for insurance companies with switching reserves Abstract: This paper presents a model for an insurance company that controls its risk and dividend
More informationFairness in Routing and Load Balancing
Fairness in Routing and Load Balancing Jon Kleinberg Yuval Rabani Éva Tardos Abstract We consider the issue of network routing subject to explicit fairness conditions. The optimization of fairness criteria
More informationOffline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com
More information2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system
1. Systems of linear equations We are interested in the solutions to systems of linear equations. A linear equation is of the form 3x 5y + 2z + w = 3. The key thing is that we don t multiply the variables
More information5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1
5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1 General Integer Linear Program: (ILP) min c T x Ax b x 0 integer Assumption: A, b integer The integrality condition
More informationVersion Spaces. riedmiller@informatik.uni-freiburg.de
. Machine Learning Version Spaces Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de
More informationFeature Selection with Monte-Carlo Tree Search
Feature Selection with Monte-Carlo Tree Search Robert Pinsler 20.01.2015 20.01.2015 Fachbereich Informatik DKE: Seminar zu maschinellem Lernen Robert Pinsler 1 Agenda 1 Feature Selection 2 Feature Selection
More informationApplication of Markov chain analysis to trend prediction of stock indices Milan Svoboda 1, Ladislav Lukáš 2
Proceedings of 3th International Conference Mathematical Methods in Economics 1 Introduction Application of Markov chain analysis to trend prediction of stock indices Milan Svoboda 1, Ladislav Lukáš 2
More informationROLLING HORIZON PROCEDURES FOR THE SOLUTION OF AN OPTIMAL REPLACEMENT
REVISTA INVESTIGACIÓN OPERACIONAL VOL 34, NO 2, 105-116, 2013 ROLLING HORIZON PROCEDURES FOR THE SOLUTION OF AN OPTIMAL REPLACEMENT PROBLEM OF n-machines WITH RANDOM HORIZON Rocio Ilhuicatzi Roldán and
More informationMaterial Requirements Planning MRP
Material Requirements Planning MRP ENM308 Planning and Control I Spring 2013 Haluk Yapıcıoğlu, PhD Hierarchy of Decisions Forecast of Demand Aggregate Planning Master Schedule Inventory Control Operations
More information3.2 Roulette and Markov Chains
238 CHAPTER 3. DISCRETE DYNAMICAL SYSTEMS WITH MANY VARIABLES 3.2 Roulette and Markov Chains In this section we will be discussing an application of systems of recursion equations called Markov Chains.
More informationHow Asymmetry Helps Load Balancing
How Asymmetry Helps oad Balancing Berthold Vöcking nternational Computer Science nstitute Berkeley, CA 947041198 voecking@icsiberkeleyedu Abstract This paper deals with balls and bins processes related
More informationA Programme Implementation of Several Inventory Control Algorithms
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume, No Sofia 20 A Programme Implementation of Several Inventory Control Algorithms Vladimir Monov, Tasho Tashev Institute of Information
More information17.3.1 Follow the Perturbed Leader
CS787: Advanced Algorithms Topic: Online Learning Presenters: David He, Chris Hopman 17.3.1 Follow the Perturbed Leader 17.3.1.1 Prediction Problem Recall the prediction problem that we discussed in class.
More informationST 371 (IV): Discrete Random Variables
ST 371 (IV): Discrete Random Variables 1 Random Variables A random variable (rv) is a function that is defined on the sample space of the experiment and that assigns a numerical variable to each possible
More informationRisk Management for IT Security: When Theory Meets Practice
Risk Management for IT Security: When Theory Meets Practice Anil Kumar Chorppath Technical University of Munich Munich, Germany Email: anil.chorppath@tum.de Tansu Alpcan The University of Melbourne Melbourne,
More information8.1 Min Degree Spanning Tree
CS880: Approximations Algorithms Scribe: Siddharth Barman Lecturer: Shuchi Chawla Topic: Min Degree Spanning Tree Date: 02/15/07 In this lecture we give a local search based algorithm for the Min Degree
More informationThe Ergodic Theorem and randomness
The Ergodic Theorem and randomness Peter Gács Department of Computer Science Boston University March 19, 2008 Peter Gács (Boston University) Ergodic theorem March 19, 2008 1 / 27 Introduction Introduction
More informationProject Scheduling: PERT/CPM
Project Scheduling: PERT/CPM CHAPTER 8 LEARNING OBJECTIVES After completing this chapter, you should be able to: 1. Describe the role and application of PERT/CPM for project scheduling. 2. Define a project
More information3. Regression & Exponential Smoothing
3. Regression & Exponential Smoothing 3.1 Forecasting a Single Time Series Two main approaches are traditionally used to model a single time series z 1, z 2,..., z n 1. Models the observation z t as a
More informationEstimating an ARMA Process
Statistics 910, #12 1 Overview Estimating an ARMA Process 1. Main ideas 2. Fitting autoregressions 3. Fitting with moving average components 4. Standard errors 5. Examples 6. Appendix: Simple estimators
More information6.254 : Game Theory with Engineering Applications Lecture 2: Strategic Form Games
6.254 : Game Theory with Engineering Applications Lecture 2: Strategic Form Games Asu Ozdaglar MIT February 4, 2009 1 Introduction Outline Decisions, utility maximization Strategic form games Best responses
More informationA linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form
Section 1.3 Matrix Products A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form (scalar #1)(quantity #1) + (scalar #2)(quantity #2) +...
More informationLargest Fixed-Aspect, Axis-Aligned Rectangle
Largest Fixed-Aspect, Axis-Aligned Rectangle David Eberly Geometric Tools, LLC http://www.geometrictools.com/ Copyright c 1998-2016. All Rights Reserved. Created: February 21, 2004 Last Modified: February
More informationSystems of Linear Equations
Systems of Linear Equations Beifang Chen Systems of linear equations Linear systems A linear equation in variables x, x,, x n is an equation of the form a x + a x + + a n x n = b, where a, a,, a n and
More informationMachine Learning: Multi Layer Perceptrons
Machine Learning: Multi Layer Perceptrons Prof. Dr. Martin Riedmiller Albert-Ludwigs-University Freiburg AG Maschinelles Lernen Machine Learning: Multi Layer Perceptrons p.1/61 Outline multi layer perceptrons
More informationA simpler and better derandomization of an approximation algorithm for Single Source Rent-or-Buy
A simpler and better derandomization of an approximation algorithm for Single Source Rent-or-Buy David P. Williamson Anke van Zuylen School of Operations Research and Industrial Engineering, Cornell University,
More informationGenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1
(R) User Manual Environmental Energy Technologies Division Berkeley, CA 94720 http://simulationresearch.lbl.gov Michael Wetter MWetter@lbl.gov February 20, 2009 Notice: This work was supported by the U.S.
More information