Reinforcement Learning

Size: px
Start display at page:

Download "Reinforcement Learning"

Transcription

1 Reinforcement Learning LU 2 - Markov Decision Problems and Dynamic Programming Dr. Joschka Bödecker AG Maschinelles Lernen und Natürlichsprachliche Systeme Albert-Ludwigs-Universität Freiburg jboedeck@informatik.uni-freiburg.de Acknowledgement Slides courtesy of Martin Riedmiller and Martin Lauer Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (1)

2 LU 2: Markov Decision Problems and DP Goals: Definition of Markov Decision Problems (MDPs) Introduction to Dynamic Programming (DP) Outline short review definition of MDPs DP: principle of optimality the DP algorithm (backward DP) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (2)

3 Review Process, can be influenced by actions Agent: Sensory input, output of action Feedback RL: Training information through evaluation only Delayed Reinforcement Learning: Decision, decision, decision,... evaluation Multi-stage decision process Optimization Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (3)

4 The Agent Concept Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (4)

5 Multi-stage decision problems Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (5)

6 Three components System, process Rewards, costs Policy, strategy Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (6)

7 Requirements for the model Goal: Describing the system s behaviour (also a system: Process, world, environment) requirements for a model: situations activities current situation can be influenced adjustments possible at discrete points in time noise, interference, random goal specification: definition of costs / rewards Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (7)

8 System description Discrete decision points t T = {0, 1,..., N} or (stages) T = {0, 1,... } Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (8)

9 System description Discrete decision points t T = {0, 1,..., N} or (stages) T = {0, 1,... } System state (situation) s t S here: S finite Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (8)

10 System description Discrete decision points t T = {0, 1,..., N} or (stages) T = {0, 1,... } System state (situation) s t S here: S finite Actions u t U here: U finite Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (8)

11 System description Discrete decision points t T = {0, 1,..., N} or (stages) T = {0, 1,... } System state (situation) s t S here: S finite Actions u t U here: U finite Transition function s t+1 = f (s t, u t) reaction of the system Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (8)

12 Goal formulation: Introducing costs At every decision (= in every stage) direct costs arise Direct costs Refinement: dependant on state and action c : S R c : S U R Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (9)

13 Goal formulation: Introducing costs At every decision (= in every stage) direct costs arise Direct costs Refinement: dependant on state and action c : S R c : S U R Reward, cost, punishment? Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (9)

14 Summary: Deterministic systems discrete decision points t T = {0, 1,..., N} or stages T = {0, 1,... } system state (situation) actions s t S u t U transition function s t+1 = f (s t, u t) direct costs c : S U R 5-tuple (T, S, U, f, c) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (10)

15 Example: Shortest path problems Find the shortest path from start node to finish node. Every edge has a specific cost that can be interpreted as length. Optimization goal over multiple stages Evaluation of whole sequence (reminder: decision, decision,... evaluation) Look at accumulated total costs: t T c(st, ut) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (11)

16 Stochastic systems Again: requirements for a model: situations activities current situation can be influenced adjustments possible at discrete points in time noise, interference,random goal specification: definition of costs / rewards Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (12)

17 Markov Decision Processes Deterministic system: 5-Tuple (T, S, U, f, c) Stochastic system: The deterministic transition function f is replaced by a conditional probability distribution. In the following, we re looking at a finite state set S = (1, 2,..., N). Let i, j S be states: Notation: Markov Decision Process (MDP): 5-Tuple (T, S, U, p ij (u), c(s, u)) P(s t+1 = j s t = i, u t = u) = p ij (u) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (13)

18 Markov property It holds that: P(s t+1 = j s t, u t) = P(s t+1 = j s t, s t 1,..., u t, u t 1,...) The probability distribution of the following state s t+1 is uniquely defined given the knowledge of the current state s t and the action u t. It especially does not depend on the previous history of the system. Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (14)

19 Remarks (1) Deterministic system is a special case of an MDP: { 1, st+1 = f (s t, u t) P(s t+1 s t, u t) = 0, otherwise Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (15)

20 Remarks (2) Equivalent description with deterministic transition function f : Approach: additional argument - random variable w t (noise): s t+1 = f (s t, u t, w t) with w t random variable with given probability distribution P(w t s t, u t) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (16)

21 Remarks (2) Equivalent description with deterministic transition function f : Approach: additional argument - random variable w t (noise): s t+1 = f (s t, u t, w t) with w t random variable with given probability distribution P(w t s t, u t) Transformation into previous form: Let W (i, u, j) = {w j = f (i, u, w)} be the set of all values of w, for which the system transitions from state i on input of u into state j. Then it holds: p ij (u) = P(w W (i, u, j)) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (16)

22 Summary: MDPs discrete decision points t T = {0, 1,..., N} or stages T = {0, 1,... } system state (situation) actions transition probabilites p ij (u) s t S u t U P(s t+1 = j s t = i, u t = u) = p ij (u) alternatively: Transition function s t+1 = f (s t, u t, w t) with w t random variable direct costs 5-tuple (T, S, U, p ij (u), c(s, u)) c : S U R Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (17)

23 Summary: MDPs Model: State, action, following state Deterministic and stochastic transition function Information about history summarized in state Very general description: OR, control engineering, games,... Generalizations (not covered here) Transition function not stationary p ij,t (u) Costs not stationary c t(i, u) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (18)

24 Example stock keeping Assume you are the owner of a toys shop at an exhibition. Exhibition lasts N days. state: number of toys in your shop action: ordered number of toys to be delivered on the next day disturbance : number of toys sold s t u t w t system equation: s t+1 = s t + u t w t Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (19)

25 Example stock keeping Assume you are the owner of a toys shop at an exhibition. Exhibition lasts N days. state: number of toys in your shop action: ordered number of toys to be delivered on the next day disturbance : number of toys sold s t u t w t system equation: costs for toys in stock acquisition costs for each toy which was ordered minus gain for sold toys s t+1 = s t + u t w t c(s, u) = c 1(s) + c 2(u) gain Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (19)

26 Example stock keeping Assume you are the owner of a toys shop at an exhibition. Exhibition lasts N days. state: number of toys in your shop action: ordered number of toys to be delivered on the next day disturbance : number of toys sold s t u t w t system equation: costs for toys in stock acquisition costs for each toy which was ordered minus gain for sold toys s t+1 = s t + u t w t c(s, u) = c 1(s) + c 2(u) gain there are also terminal costs g(s), if there are still toys in stock after the N days. Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (19)

27 Policy and selection function Policy: The selection function π t : S U, π t(s) = u chooses at time t an action u U as function of the current state s S. Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (20)

28 Policy and selection function Policy: The selection function π t : S U, π t(s) = u chooses at time t an action u U as function of the current state s S. Selection function chooses an action in dependence of the situation (see graphic agent ) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (20)

29 Policy and selection function Policy: The selection function π t : S U, π t(s) = u chooses at time t an action u U as function of the current state s S. Selection function chooses an action in dependence of the situation (see graphic agent ) Refinement: π t : S U, π t(s) = u, with u U(s) situation dependent action set (example: chess) A policy ˆπ consists of N selection functions (N being the number of decision points) ˆπ = (π 0, π 1,..., π t,...) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (20)

30 Non-stationary policies The selection function π t can be dependent on the time of the decision. Meaning: The same situation at different points in time can lead to different decisions of the agent. ˆπ = (π 0, π 1,..., π t,...) If the selection functions differ for single time points, we call it a non-stationary policy. Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (21)

31 Non-stationary policies The selection function π t can be dependent on the time of the decision. Meaning: The same situation at different points in time can lead to different decisions of the agent. ˆπ = (π 0, π 1,..., π t,...) If the selection functions differ for single time points, we call it a non-stationary policy. Example soccer: Situation s: Midfield player has the ball. Reasonable action in the first minute: π 1(s) = return pass Reasonable action in the last minute: π 90(s) = shoot on goal General rationale: The limited optimization time frame ( finite horizon, see below) usually requires a non-stationary policy! Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (21)

32 Stationary policies We will look mostly at stationary policies. Then it holds that π 0 = π 1 =... π t... =: π and ˆπ = (π, π,..., π,...) With stationary policies, the terms policy and selection function become interchangeable. We will call the selection function π - as generally done in literature - our policy. Bertsekas uses the term µ for the selection function. Therefore there arise minor differences from the notation used there. Remark: In the following only deterministic selection functions will be used Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (22)

33 Goal of the policy Reach the optimization goal over multiple stages (sequence of decisions) Solving a dynamic optimization problem Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (23)

34 Cumulated costs (costs-to-go) Interesting: Cumulated costs for a given state s with given policy π: J π (s) = t T c(s t, π(s t)), s 0 = s Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (24)

35 Cumulated costs (costs-to-go) Interesting: Cumulated costs for a given state s with given policy π: J π (s) = t T c(s t, π(s t)), s 0 = s Wanted: Optimal policy π so that for all s it holds that: J π (s) = min c(s t, π(s t)), π ˆπ t T under the constraint that s t+1 = f (s t, u t) s 0 = s Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (24)

36 Cumulated costs in MDPs Expected cumulated costs for a given state s using a given policy π: J π (s) = E w c(s t, π(s t)), s 0 = s t T Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (25)

37 Cumulated costs in MDPs Expected cumulated costs for a given state s using a given policy π: J π (s) = E w c(s t, π(s t)), s 0 = s t T Wanted: Optimal policy π so that for all s it holds that: J π (s) = min π Π Ew t T c(s t, π(s t)), s 0 = s under the constraint that s t+1 = f (s t, u t, w t), or with given probability distribution P(s t+1 = j s t = i, u t = u) = p ij (u) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (25)

38 Problem types Definition horizon: The horizon N of a problem denotes the number of decision stages to be traversed. Finite horizon: Problems with given termination time Infinite horizon: Approximation for very long processes or processes with an unknown end (e.g. control system) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (26)

39 Finite horizon N-stage decision problem Each state has terminal costs g(i) that are due if the system ends in i after N stages. Costs of a policy π N 1 JN π (s) = E[g(s N ) + c(s t, π t(s t)) s 0 = s] Generally: Non-stationary policy t=0 Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (27)

40 Infinite horizon Costs of a policy π J π (s) = lim E[ N c(s t, π t(s t)) s 0 = s] N Problem: Finite costs? t=0 Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (28)

41 Infinite horizon Costs of a policy π Problem: Finite costs? J π (s) = lim E[ N c(s t, π t(s t)) s 0 = s] N Solution: Discount α < 1 J π (s) = lim E[ N α t c(s t, π t(s t)) s 0 = s] N t=0 t=0 Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (28)

42 Solution of dynamic optimization problems Central question: How do we find the policy that leads (on average) to minimal costs? Solution method: Dynamic Programming (Bellman, 1957) Backward Dynamic Programming (finite horizon) Value Iteration (LU 3 ff., infinite horizon) Policy Iteration (LU 3 ff., infinite horizon) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (29)

43 Backward Dynamic Programming - idea Problem: Stochastic multistage decision problems with finite horizon Idea: Calculate the costs starting from the last stage to the first stage. Example: Find the cheapest path in a graph Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (30)

44 Backward Dynamic Programming - problem specification (1) finite horizon N MDP N discrete decision points t T = {0, 1,..., N} State set finite s t S = {1, 2,..., n} Action set finite u t U = {u 1,..., u m} Transition prob. p ij (u) P(s t+1 = j s t = i, u t = u) = p ij (u) direct costs c : S U R in the last stage N every stage causes terminal costs g(s N ) := c N (s N ) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (31)

45 Backward Dynamic Programming - objective Wanted: π with J π = min π J π with JN π (i) = E[g(s N ) + N 1 t=0 c(st, πt(st)) s0 = i] the costs belonging to π are called the optimal cumulated costs J := J π. Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (32)

46 Backward Dynamic Programming - objective Wanted: π with J π = min π J π with JN π (i) = E[g(s N ) + N 1 t=0 c(st, πt(st)) s0 = i] the costs belonging to π are called the optimal cumulated costs J := J π. Approach: 1. Calcuation of optimal cumulated costs ( cost-to-go ) J k ( ) for all states (J k ( ) is a n dimensional vector). k is the number of remaining steps. 2. from J k follows the optimal policy for the k step problem. (k steps until process terminates). Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (32)

47 Backward Dynamic Programming - motivation Thesis - Bellman s Principle of Optimality: If I have k more steps to go, the optimal costs for a state i are given with the minimal expected value of the sum of the direct transition costs + optimal cumulated costs of the next state, if there are k 1 more steps to be done from there. The minimization here goes over all possible actions Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (33)

48 Bellman s Principle of Optimality Formal: For the optimal cumulated costs J k (i) of the k-stage decision problem, it holds that: Jk (i) = min u U(i) Ew {c(i, u) + k J k 1(f (i, u, w k )} n = min u U(i) {p ij (u)(c(i, u) + Jk 1(j))} i = 1... n (1) j=1 Hence we can calculate the optimal cumulated costs of the N stage optimization problem recursively starting with k = 0. Backward-DP algorithm Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (34)

49 Bellman s Principle of Optimality - proof (1) Policy ˆπ (k) for k stages: ˆπ (k) = (π k, π k 1, π k 2,...) = (π k, ˆπ (k 1) ) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (35)

50 Bellman s Principle of Optimality - proof (1) Policy ˆπ (k) for k stages: ˆπ (k) = (π k, π k 1, π k 2,...) = (π k, ˆπ (k 1) ) Let S (k) (i) = (s N k = i, s (N k)+1,..., s N ) be a possible state sequence starting in state i with k transitions. Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (35)

51 Bellman s Principle of Optimality - proof (1) Policy ˆπ (k) for k stages: ˆπ (k) = (π k, π k 1, π k 2,...) = (π k, ˆπ (k 1) ) Let S (k) (i) = (s N k = i, s (N k)+1,..., s N ) be a possible state sequence starting in state i with k transitions. J k (i) = min J ˆπ (k) ˆπ (k) k (i) (2) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (35)

52 Bellman s Principle of Optimality - proof (1) Policy ˆπ (k) for k stages: ˆπ (k) = (π k, π k 1, π k 2,...) = (π k, ˆπ (k 1) ) Let S (k) (i) = (s N k = i, s (N k)+1,..., s N ) be a possible state sequence starting in state i with k transitions. J k (i) = min J ˆπ (k) ˆπ (k) k = min ˆπ (k) { S (k) (i) (i) (P(S (k) (i) ˆπ (k) )( k c(s N l, π l (s N l )) + g(s N )))} l=1 (2) (3) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (35)

53 Bellman s Principle of Optimality - proof (1) Policy ˆπ (k) for k stages: ˆπ (k) = (π k, π k 1, π k 2,...) = (π k, ˆπ (k 1) ) Let S (k) (i) = (s N k = i, s (N k)+1,..., s N ) be a possible state sequence starting in state i with k transitions. J k (i) = min J ˆπ (k) ˆπ (k) k = min ˆπ (k) { S (k) (i) (i) (P(S (k) (i) ˆπ (k) )( = min ˆπ (k) {c(i, π k (i)) + S (k) (i) k c(s N l, π l (s N l )) + g(s N )))} l=1 (P(S (k) (i) ˆπ (k) ) k 1 ( c(s N l, π l (s N l )) + g(s N )))} l=1 (2) (3) (4) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (35)

54 Bellman s Principle of Optimality - proof (2) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (36)

55 Bellman s Principle of Optimality - proof (2) = min ˆπ (k) {c(i, π k (i)) + S (k) (i) (P(S (k) (i) ˆπ (k) ) k 1 ( c(s N l, π l (s N l )) + g(s N )))} l=1 (5) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (36)

56 Bellman s Principle of Optimality - proof (2) = min ˆπ (k) {c(i, π k (i)) + S (k) (i) (P(S (k) (i) ˆπ (k) ) k 1 ( c(s N l, π l (s N l )) + g(s N )))} l=1 = min{c(i, π k (i)) + P(s (N k)+1 = j s N k = i, π k ) ˆπ (k) j S (5) S (k 1) (j) k 1 (P(S (k 1) (j) ˆπ (k 1) ) ( c(s N l, π l (s N l )) + g(s N )))} l=1 (6) (7) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (36)

57 Bellman s Principle of Optimality - proof (3) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (37)

58 Bellman s Principle of Optimality - proof (3) = min{c(i, π k (i)) + P(s (N k)+1 = j s N k = i, π k ) ˆπ (k) j S S (k 1) (j) k 1 (P(S (k 1) (j) ˆπ (k 1) ) ( c(s N l, π l (s N l )) + g(s N )))} l=1 (8) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (37)

59 Bellman s Principle of Optimality - proof (3) = min{c(i, π k (i)) + P(s (N k)+1 = j s N k = i, π k ) ˆπ (k) j S S (k 1) (j) k 1 (P(S (k 1) (j) ˆπ (k 1) ) ( c(s N l, π l (s N l )) + g(s N )))} = min { c(i, u) + P(s (N k)+1 = j s N k = i, u) u U(i) min { ˆπ (k 1) S (k 1) (j) j S l=1 k 1 (P(S (k 1) (j) ˆπ (k 1) ) ( c(s N l, π l (s N l )) + g(s N )))} } l=1 (8) (9) (10) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (37)

60 Bellman s Principle of Optimality - proof (4) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (38)

61 Bellman s Principle of Optimality - proof (4) = min { c(i, u) + P(s (N k)+1 = j s N k = i, u) u U(i) min { ˆπ (k 1) S (k 1) (j) j S k 1 (P(S (k 1) (j) ˆπ (k 1) ) ( c(s N l, π l (s N l )) + g(s N )))} } l=1 (11) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (38)

62 Bellman s Principle of Optimality - proof (4) = min { c(i, u) + P(s (N k)+1 = j s N k = i, u) u U(i) min { ˆπ (k 1) S (k 1) (j) j S = min { c(i, u) + u U(i) k 1 (P(S (k 1) (j) ˆπ (k 1) ) ( c(s N l, π l (s N l )) + g(s N )))} } j S l=1 P(s (N k)+1 = j s N k = i, u) min {J ˆπ (k 1) ˆπ (k 1) (11) k 1 (j)} } (12) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (38)

63 Bellman s Principle of Optimality - proof (4) = min { c(i, u) + P(s (N k)+1 = j s N k = i, u) u U(i) min { ˆπ (k 1) S (k 1) (j) j S = min { c(i, u) + u U(i) k 1 (P(S (k 1) (j) ˆπ (k 1) ) ( c(s N l, π l (s N l )) + g(s N )))} } j S l=1 P(s (N k)+1 = j s N k = i, u) min {J ˆπ (k 1) ˆπ (k 1) = min {c(i, u) + P(s (N k)+1 = j s N k = i, u) Jk 1(j)} u U(i) j S (11) k 1 (j)} } (12) (13) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (38)

64 Bellman s Principle of Optimality - proof (4) = min { c(i, u) + P(s (N k)+1 = j s N k = i, u) u U(i) min { ˆπ (k 1) S (k 1) (j) j S = min { c(i, u) + u U(i) k 1 (P(S (k 1) (j) ˆπ (k 1) ) ( c(s N l, π l (s N l )) + g(s N )))} } j S l=1 P(s (N k)+1 = j s N k = i, u) min {J ˆπ (k 1) ˆπ (k 1) = min {c(i, u) + P(s (N k)+1 = j s N k = i, u) Jk 1(j)} u U(i) j S (11) k 1 (j)} } (12) = min {c(i, u) + p ij (u) Jk 1(j)} u U(i) j S (13) (14) (15) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (38)

65 Backward Dynamic Programming - algorithm k = 0: For k = 1 To N, i S J 0 (i) = g(i) or Jk (i) = min E wk {c(i, u) + Jk 1(f (i, u, w k ))} u U(i) J k (i) = min u U(i) n p ij (u)(c(i, u) + Jk 1(j)) j=1 Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (39)

66 Choosing an action Requirement: Jk (i) is known for all k N. Approach: We simply calculate for all possible actions the expected costs and choose the best action (with minimal expected cumulated costs). π k (i) arg min u U(i) E wk {c(i, u) + J k 1(f (i, u, w k )) the chosen optimal action minimizes the sum of the expected transition costs plus the expected cumulated costs of the remaining problem. Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (40)

67 Choosing an action Requirement: Jk (i) is known for all k N. Approach: We simply calculate for all possible actions the expected costs and choose the best action (with minimal expected cumulated costs). π k (i) arg min u U(i) E wk {c(i, u) + J k 1(f (i, u, w k )) the chosen optimal action minimizes the sum of the expected transition costs plus the expected cumulated costs of the remaining problem. Remark: J k defines an optimal policy The policy is not unique, but J k is Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (40)

68 Remarks Complexity for deterministic systems O(N n m) Complexity for stochastic systems O(N n 2 m) Exact solution rarely computable, numeric solution; but: very complex! (N = number of stages, n = number of states, m = number of actions) Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (41)

Reinforcement Learning

Reinforcement Learning Reinforcement Learning LU 2 - Markov Decision Problems and Dynamic Programming Dr. Martin Lauer AG Maschinelles Lernen und Natürlichsprachliche Systeme Albert-Ludwigs-Universität Freiburg martin.lauer@kit.edu

More information

Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning

Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning Motivation Machine Learning Can a software agent learn to play Backgammon by itself? Reinforcement Learning Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut

More information

6.231 Dynamic Programming and Stochastic Control Fall 2008

6.231 Dynamic Programming and Stochastic Control Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.231 Dynamic Programming and Stochastic Control Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 6.231

More information

Optimization under uncertainty: modeling and solution methods

Optimization under uncertainty: modeling and solution methods Optimization under uncertainty: modeling and solution methods Paolo Brandimarte Dipartimento di Scienze Matematiche Politecnico di Torino e-mail: paolo.brandimarte@polito.it URL: http://staff.polito.it/paolo.brandimarte

More information

Stochastic Inventory Control

Stochastic Inventory Control Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the

More information

Neuro-Dynamic Programming An Overview

Neuro-Dynamic Programming An Overview 1 Neuro-Dynamic Programming An Overview Dimitri Bertsekas Dept. of Electrical Engineering and Computer Science M.I.T. September 2006 2 BELLMAN AND THE DUAL CURSES Dynamic Programming (DP) is very broadly

More information

Boosting. riedmiller@informatik.uni-freiburg.de

Boosting. riedmiller@informatik.uni-freiburg.de . Machine Learning Boosting Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL

Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL POMDP Tutorial Preliminaries: Problem Definition Agent model, POMDP, Bayesian RL Observation Belief ACTOR Transition Dynamics WORLD b Policy π Action Markov Decision Process -X: set of states [x s,x r

More information

Notes from Week 1: Algorithms for sequential prediction

Notes from Week 1: Algorithms for sequential prediction CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

More information

Scheduling Software Projects to Minimize the Development Time and Cost with a Given Staff

Scheduling Software Projects to Minimize the Development Time and Cost with a Given Staff Scheduling Software Projects to Minimize the Development Time and Cost with a Given Staff Frank Padberg Fakultät für Informatik Universität Karlsruhe, Germany padberg @ira.uka.de Abstract A probabilistic

More information

Factors to Describe Job Shop Scheduling Problem

Factors to Describe Job Shop Scheduling Problem Job Shop Scheduling Job Shop A work location in which a number of general purpose work stations exist and are used to perform a variety of jobs Example: Car repair each operator (mechanic) evaluates plus

More information

A simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R

A simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R A simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R Federico Perea Justo Puerto MaMaEuSch Management Mathematics for European Schools 94342 - CP - 1-2001 - DE - COMENIUS - C21 University

More information

LECTURE 4. Last time: Lecture outline

LECTURE 4. Last time: Lecture outline LECTURE 4 Last time: Types of convergence Weak Law of Large Numbers Strong Law of Large Numbers Asymptotic Equipartition Property Lecture outline Stochastic processes Markov chains Entropy rate Random

More information

10.2 Series and Convergence

10.2 Series and Convergence 10.2 Series and Convergence Write sums using sigma notation Find the partial sums of series and determine convergence or divergence of infinite series Find the N th partial sums of geometric series and

More information

6.231 Dynamic Programming Midterm, Fall 2008. Instructions

6.231 Dynamic Programming Midterm, Fall 2008. Instructions 6.231 Dynamic Programming Midterm, Fall 2008 Instructions The midterm comprises three problems. Problem 1 is worth 60 points, problem 2 is worth 40 points, and problem 3 is worth 40 points. Your grade

More information

Stochastic Models for Inventory Management at Service Facilities

Stochastic Models for Inventory Management at Service Facilities Stochastic Models for Inventory Management at Service Facilities O. Berman, E. Kim Presented by F. Zoghalchi University of Toronto Rotman School of Management Dec, 2012 Agenda 1 Problem description Deterministic

More information

Inductive QoS Packet Scheduling for Adaptive Dynamic Networks

Inductive QoS Packet Scheduling for Adaptive Dynamic Networks Inductive QoS Packet Scheduling for Adaptive Dynamic Networks Malika BOURENANE Dept of Computer Science University of Es-Senia Algeria mb_regina@yahoo.fr Abdelhamid MELLOUK LISSI Laboratory University

More information

NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi

NEURAL NETWORKS AND REINFORCEMENT LEARNING. Abhijit Gosavi NEURAL NETWORKS AND REINFORCEMENT LEARNING Abhijit Gosavi Department of Engineering Management and Systems Engineering Missouri University of Science and Technology Rolla, MO 65409 1 Outline A Quick Introduction

More information

Chapter 2: Binomial Methods and the Black-Scholes Formula

Chapter 2: Binomial Methods and the Black-Scholes Formula Chapter 2: Binomial Methods and the Black-Scholes Formula 2.1 Binomial Trees We consider a financial market consisting of a bond B t = B(t), a stock S t = S(t), and a call-option C t = C(t), where the

More information

How To Optimize A Multi-Echelon Inventory System

How To Optimize A Multi-Echelon Inventory System The University of Melbourne, Department of Mathematics and Statistics Decomposition Approach to Serial Inventory Systems under Uncertainties Jennifer Kusuma Honours Thesis, 2005 Supervisors: Prof. Peter

More information

TD(0) Leads to Better Policies than Approximate Value Iteration

TD(0) Leads to Better Policies than Approximate Value Iteration TD(0) Leads to Better Policies than Approximate Value Iteration Benjamin Van Roy Management Science and Engineering and Electrical Engineering Stanford University Stanford, CA 94305 bvr@stanford.edu Abstract

More information

Scheduling Single Machine Scheduling. Tim Nieberg

Scheduling Single Machine Scheduling. Tim Nieberg Scheduling Single Machine Scheduling Tim Nieberg Single machine models Observation: for non-preemptive problems and regular objectives, a sequence in which the jobs are processed is sufficient to describe

More information

Performance Analysis of a Telephone System with both Patient and Impatient Customers

Performance Analysis of a Telephone System with both Patient and Impatient Customers Performance Analysis of a Telephone System with both Patient and Impatient Customers Yiqiang Quennel Zhao Department of Mathematics and Statistics University of Winnipeg Winnipeg, Manitoba Canada R3B 2E9

More information

Random access protocols for channel access. Markov chains and their stability. Laurent Massoulié.

Random access protocols for channel access. Markov chains and their stability. Laurent Massoulié. Random access protocols for channel access Markov chains and their stability laurent.massoulie@inria.fr Aloha: the first random access protocol for channel access [Abramson, Hawaii 70] Goal: allow machines

More information

Analysis of Algorithms I: Optimal Binary Search Trees

Analysis of Algorithms I: Optimal Binary Search Trees Analysis of Algorithms I: Optimal Binary Search Trees Xi Chen Columbia University Given a set of n keys K = {k 1,..., k n } in sorted order: k 1 < k 2 < < k n we wish to build an optimal binary search

More information

Coding and decoding with convolutional codes. The Viterbi Algor

Coding and decoding with convolutional codes. The Viterbi Algor Coding and decoding with convolutional codes. The Viterbi Algorithm. 8 Block codes: main ideas Principles st point of view: infinite length block code nd point of view: convolutions Some examples Repetition

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Scheduling Algorithms for Downlink Services in Wireless Networks: A Markov Decision Process Approach

Scheduling Algorithms for Downlink Services in Wireless Networks: A Markov Decision Process Approach Scheduling Algorithms for Downlink Services in Wireless Networks: A Markov Decision Process Approach William A. Massey ORFE Department Engineering Quadrangle, Princeton University Princeton, NJ 08544 K.

More information

Dynamic Programming 11.1 AN ELEMENTARY EXAMPLE

Dynamic Programming 11.1 AN ELEMENTARY EXAMPLE Dynamic Programming Dynamic programming is an optimization approach that transforms a complex problem into a sequence of simpler problems; its essential characteristic is the multistage nature of the optimization

More information

1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let Copyright c 2009 by Karl Sigman 1 Stopping Times 1.1 Stopping Times: Definition Given a stochastic process X = {X n : n 0}, a random time τ is a discrete random variable on the same probability space as

More information

Creating a NL Texas Hold em Bot

Creating a NL Texas Hold em Bot Creating a NL Texas Hold em Bot Introduction Poker is an easy game to learn by very tough to master. One of the things that is hard to do is controlling emotions. Due to frustration, many have made the

More information

An Introduction to Markov Decision Processes. MDP Tutorial - 1

An Introduction to Markov Decision Processes. MDP Tutorial - 1 An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal

More information

Dynamic programming. Doctoral course Optimization on graphs - Lecture 4.1. Giovanni Righini. January 17 th, 2013

Dynamic programming. Doctoral course Optimization on graphs - Lecture 4.1. Giovanni Righini. January 17 th, 2013 Dynamic programming Doctoral course Optimization on graphs - Lecture.1 Giovanni Righini January 1 th, 201 Implicit enumeration Combinatorial optimization problems are in general NP-hard and we usually

More information

Supply planning for two-level assembly systems with stochastic component delivery times: trade-off between holding cost and service level

Supply planning for two-level assembly systems with stochastic component delivery times: trade-off between holding cost and service level Supply planning for two-level assembly systems with stochastic component delivery times: trade-off between holding cost and service level Faicel Hnaien, Xavier Delorme 2, and Alexandre Dolgui 2 LIMOS,

More information

A Single-Unit Decomposition Approach to Multi-Echelon Inventory Systems

A Single-Unit Decomposition Approach to Multi-Echelon Inventory Systems A Single-Unit Decomposition Approach to Multi-Echelon Inventory Systems Alp Muharremoğlu John N. sitsiklis July 200 Revised March 2003 Former version titled: Echelon Base Stock Policies in Uncapacitated

More information

An Environment Model for N onstationary Reinforcement Learning

An Environment Model for N onstationary Reinforcement Learning An Environment Model for N onstationary Reinforcement Learning Samuel P. M. Choi Dit-Yan Yeung Nevin L. Zhang pmchoi~cs.ust.hk dyyeung~cs.ust.hk lzhang~cs.ust.hk Department of Computer Science, Hong Kong

More information

Numerical methods for American options

Numerical methods for American options Lecture 9 Numerical methods for American options Lecture Notes by Andrzej Palczewski Computational Finance p. 1 American options The holder of an American option has the right to exercise it at any moment

More information

Load Balancing and Switch Scheduling

Load Balancing and Switch Scheduling EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: liuxh@systems.stanford.edu Abstract Load

More information

A control Lyapunov function approach for the computation of the infinite-horizon stochastic reach-avoid problem

A control Lyapunov function approach for the computation of the infinite-horizon stochastic reach-avoid problem A control Lyapunov function approach for the computation of the infinite-horizon stochastic reach-avoid problem Ilya Tkachev and Alessandro Abate Abstract This work is devoted to the solution of the stochastic

More information

Finitely Additive Dynamic Programming and Stochastic Games. Bill Sudderth University of Minnesota

Finitely Additive Dynamic Programming and Stochastic Games. Bill Sudderth University of Minnesota Finitely Additive Dynamic Programming and Stochastic Games Bill Sudderth University of Minnesota 1 Discounted Dynamic Programming Five ingredients: S, A, r, q, β. S - state space A - set of actions q(

More information

Optimal Design of Sequential Real-Time Communication Systems Aditya Mahajan, Member, IEEE, and Demosthenis Teneketzis, Fellow, IEEE

Optimal Design of Sequential Real-Time Communication Systems Aditya Mahajan, Member, IEEE, and Demosthenis Teneketzis, Fellow, IEEE IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 11, NOVEMBER 2009 5317 Optimal Design of Sequential Real-Time Communication Systems Aditya Mahajan, Member, IEEE, Demosthenis Teneketzis, Fellow, IEEE

More information

2.3 Convex Constrained Optimization Problems

2.3 Convex Constrained Optimization Problems 42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

More information

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment Hideki Asoh 1, Masanori Shiro 1 Shotaro Akaho 1, Toshihiro Kamishima 1, Koiti Hasida 1, Eiji Aramaki 2, and Takahide

More information

ALGORITHMIC TRADING WITH MARKOV CHAINS

ALGORITHMIC TRADING WITH MARKOV CHAINS June 16, 2010 ALGORITHMIC TRADING WITH MARKOV CHAINS HENRIK HULT AND JONAS KIESSLING Abstract. An order book consists of a list of all buy and sell offers, represented by price and quantity, available

More information

5 Directed acyclic graphs

5 Directed acyclic graphs 5 Directed acyclic graphs (5.1) Introduction In many statistical studies we have prior knowledge about a temporal or causal ordering of the variables. In this chapter we will use directed graphs to incorporate

More information

Call Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning

Call Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning Call Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning Peter Marbach LIDS MIT, Room 5-07 Cambridge, MA, 09 email: marbach@mit.edu Oliver Mihatsch Siemens AG Corporate

More information

Security Risk Management via Dynamic Games with Learning

Security Risk Management via Dynamic Games with Learning Security Risk Management via Dynamic Games with Learning Praveen Bommannavar Management Science & Engineering Stanford University Stanford, California 94305 Email: bommanna@stanford.edu Tansu Alpcan Deutsche

More information

Scheduling Shop Scheduling. Tim Nieberg

Scheduling Shop Scheduling. Tim Nieberg Scheduling Shop Scheduling Tim Nieberg Shop models: General Introduction Remark: Consider non preemptive problems with regular objectives Notation Shop Problems: m machines, n jobs 1,..., n operations

More information

M/M/1 and M/M/m Queueing Systems

M/M/1 and M/M/m Queueing Systems M/M/ and M/M/m Queueing Systems M. Veeraraghavan; March 20, 2004. Preliminaries. Kendall s notation: G/G/n/k queue G: General - can be any distribution. First letter: Arrival process; M: memoryless - exponential

More information

Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

More information

Material Requirements Planning MRP

Material Requirements Planning MRP Material Requirements Planning MRP ENM308 Planning and Control I Spring 2013 Haluk Yapıcıoğlu, PhD Hierarchy of Decisions Forecast of Demand Aggregate Planning Master Schedule Inventory Control Operations

More information

Using Markov Decision Processes to Solve a Portfolio Allocation Problem

Using Markov Decision Processes to Solve a Portfolio Allocation Problem Using Markov Decision Processes to Solve a Portfolio Allocation Problem Daniel Bookstaber April 26, 2005 Contents 1 Introduction 3 2 Defining the Model 4 2.1 The Stochastic Model for a Single Asset.........................

More information

Lecture notes: single-agent dynamics 1

Lecture notes: single-agent dynamics 1 Lecture notes: single-agent dynamics 1 Single-agent dynamic optimization models In these lecture notes we consider specification and estimation of dynamic optimization models. Focus on single-agent models.

More information

statistical learning; Bayesian learning; stochastic optimization; dynamic programming

statistical learning; Bayesian learning; stochastic optimization; dynamic programming INFORMS 2008 c 2008 INFORMS isbn 978-1-877640-23-0 doi 10.1287/educ.1080.0039 Optimal Learning Warren B. Powell and Peter Frazier Department of Operations Research and Financial Engineering, Princeton

More information

How I won the Chess Ratings: Elo vs the rest of the world Competition

How I won the Chess Ratings: Elo vs the rest of the world Competition How I won the Chess Ratings: Elo vs the rest of the world Competition Yannis Sismanis November 2010 Abstract This article discusses in detail the rating system that won the kaggle competition Chess Ratings:

More information

Lecture 5: Model-Free Control

Lecture 5: Model-Free Control Lecture 5: Model-Free Control David Silver Outline 1 Introduction 2 On-Policy Monte-Carlo Control 3 On-Policy Temporal-Difference Learning 4 Off-Policy Learning 5 Summary Introduction Model-Free Reinforcement

More information

Feature Selection with Monte-Carlo Tree Search

Feature Selection with Monte-Carlo Tree Search Feature Selection with Monte-Carlo Tree Search Robert Pinsler 20.01.2015 20.01.2015 Fachbereich Informatik DKE: Seminar zu maschinellem Lernen Robert Pinsler 1 Agenda 1 Feature Selection 2 Feature Selection

More information

PageRank Optimization in Polynomial Time by Stochastic Shortest Path Reformulation

PageRank Optimization in Polynomial Time by Stochastic Shortest Path Reformulation PageRank Optimization in Polynomial Time by Stochastic Shortest Path Reformulation Balázs Csanád Csáji 12, Raphaël M. Jungers 34, and Vincent D. Blondel 4 1 Department of Electrical and Electronic Engineering,

More information

INTEGRATED OPTIMIZATION OF SAFETY STOCK

INTEGRATED OPTIMIZATION OF SAFETY STOCK INTEGRATED OPTIMIZATION OF SAFETY STOCK AND TRANSPORTATION CAPACITY Horst Tempelmeier Department of Production Management University of Cologne Albertus-Magnus-Platz D-50932 Koeln, Germany http://www.spw.uni-koeln.de/

More information

Single item inventory control under periodic review and a minimum order quantity

Single item inventory control under periodic review and a minimum order quantity Single item inventory control under periodic review and a minimum order quantity G. P. Kiesmüller, A.G. de Kok, S. Dabia Faculty of Technology Management, Technische Universiteit Eindhoven, P.O. Box 513,

More information

ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE

ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE YUAN TIAN This synopsis is designed merely for keep a record of the materials covered in lectures. Please refer to your own lecture notes for all proofs.

More information

VENDOR MANAGED INVENTORY

VENDOR MANAGED INVENTORY VENDOR MANAGED INVENTORY Martin Savelsbergh School of Industrial and Systems Engineering Georgia Institute of Technology Joint work with Ann Campbell, Anton Kleywegt, and Vijay Nori Distribution Systems:

More information

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system 1. Systems of linear equations We are interested in the solutions to systems of linear equations. A linear equation is of the form 3x 5y + 2z + w = 3. The key thing is that we don t multiply the variables

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

CHAPTER 1. Basic Concepts on Planning and Scheduling

CHAPTER 1. Basic Concepts on Planning and Scheduling CHAPTER 1 Basic Concepts on Planning and Scheduling Scheduling, FEUP/PRODEI /MIEIC 1 Planning and Scheduling: Processes of Decision Making regarding the selection and ordering of activities as well as

More information

ST 371 (IV): Discrete Random Variables

ST 371 (IV): Discrete Random Variables ST 371 (IV): Discrete Random Variables 1 Random Variables A random variable (rv) is a function that is defined on the sample space of the experiment and that assigns a numerical variable to each possible

More information

Two-Stage Stochastic Linear Programs

Two-Stage Stochastic Linear Programs Two-Stage Stochastic Linear Programs Operations Research Anthony Papavasiliou 1 / 27 Two-Stage Stochastic Linear Programs 1 Short Reviews Probability Spaces and Random Variables Convex Analysis 2 Deterministic

More information

ROLLING HORIZON PROCEDURES FOR THE SOLUTION OF AN OPTIMAL REPLACEMENT

ROLLING HORIZON PROCEDURES FOR THE SOLUTION OF AN OPTIMAL REPLACEMENT REVISTA INVESTIGACIÓN OPERACIONAL VOL 34, NO 2, 105-116, 2013 ROLLING HORIZON PROCEDURES FOR THE SOLUTION OF AN OPTIMAL REPLACEMENT PROBLEM OF n-machines WITH RANDOM HORIZON Rocio Ilhuicatzi Roldán and

More information

Chapter 4 Lecture Notes

Chapter 4 Lecture Notes Chapter 4 Lecture Notes Random Variables October 27, 2015 1 Section 4.1 Random Variables A random variable is typically a real-valued function defined on the sample space of some experiment. For instance,

More information

17.3.1 Follow the Perturbed Leader

17.3.1 Follow the Perturbed Leader CS787: Advanced Algorithms Topic: Online Learning Presenters: David He, Chris Hopman 17.3.1 Follow the Perturbed Leader 17.3.1.1 Prediction Problem Recall the prediction problem that we discussed in class.

More information

A Sarsa based Autonomous Stock Trading Agent

A Sarsa based Autonomous Stock Trading Agent A Sarsa based Autonomous Stock Trading Agent Achal Augustine The University of Texas at Austin Department of Computer Science Austin, TX 78712 USA achal@cs.utexas.edu Abstract This paper describes an autonomous

More information

3.2 Roulette and Markov Chains

3.2 Roulette and Markov Chains 238 CHAPTER 3. DISCRETE DYNAMICAL SYSTEMS WITH MANY VARIABLES 3.2 Roulette and Markov Chains In this section we will be discussing an application of systems of recursion equations called Markov Chains.

More information

Computing Near Optimal Strategies for Stochastic Investment Planning Problems

Computing Near Optimal Strategies for Stochastic Investment Planning Problems Computing Near Optimal Strategies for Stochastic Investment Planning Problems Milos Hauskrecfat 1, Gopal Pandurangan 1,2 and Eli Upfal 1,2 Computer Science Department, Box 1910 Brown University Providence,

More information

Version Spaces. riedmiller@informatik.uni-freiburg.de

Version Spaces. riedmiller@informatik.uni-freiburg.de . Machine Learning Version Spaces Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information

Markov Decision Processes for Ad Network Optimization

Markov Decision Processes for Ad Network Optimization Markov Decision Processes for Ad Network Optimization Flávio Sales Truzzi 1, Valdinei Freire da Silva 2, Anna Helena Reali Costa 1, Fabio Gagliardi Cozman 3 1 Laboratório de Técnicas Inteligentes (LTI)

More information

The Ergodic Theorem and randomness

The Ergodic Theorem and randomness The Ergodic Theorem and randomness Peter Gács Department of Computer Science Boston University March 19, 2008 Peter Gács (Boston University) Ergodic theorem March 19, 2008 1 / 27 Introduction Introduction

More information

Reliability Guarantees in Automata Based Scheduling for Embedded Control Software

Reliability Guarantees in Automata Based Scheduling for Embedded Control Software 1 Reliability Guarantees in Automata Based Scheduling for Embedded Control Software Santhosh Prabhu, Aritra Hazra, Pallab Dasgupta Department of CSE, IIT Kharagpur West Bengal, India - 721302. Email: {santhosh.prabhu,

More information

Estimating an ARMA Process

Estimating an ARMA Process Statistics 910, #12 1 Overview Estimating an ARMA Process 1. Main ideas 2. Fitting autoregressions 3. Fitting with moving average components 4. Standard errors 5. Examples 6. Appendix: Simple estimators

More information

8.1 Min Degree Spanning Tree

8.1 Min Degree Spanning Tree CS880: Approximations Algorithms Scribe: Siddharth Barman Lecturer: Shuchi Chawla Topic: Min Degree Spanning Tree Date: 02/15/07 In this lecture we give a local search based algorithm for the Min Degree

More information

3. Regression & Exponential Smoothing

3. Regression & Exponential Smoothing 3. Regression & Exponential Smoothing 3.1 Forecasting a Single Time Series Two main approaches are traditionally used to model a single time series z 1, z 2,..., z n 1. Models the observation z t as a

More information

Systems of Linear Equations

Systems of Linear Equations Systems of Linear Equations Beifang Chen Systems of linear equations Linear systems A linear equation in variables x, x,, x n is an equation of the form a x + a x + + a n x n = b, where a, a,, a n and

More information

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form Section 1.3 Matrix Products A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form (scalar #1)(quantity #1) + (scalar #2)(quantity #2) +...

More information

Risk Management for IT Security: When Theory Meets Practice

Risk Management for IT Security: When Theory Meets Practice Risk Management for IT Security: When Theory Meets Practice Anil Kumar Chorppath Technical University of Munich Munich, Germany Email: anil.chorppath@tum.de Tansu Alpcan The University of Melbourne Melbourne,

More information

Binomial lattice model for stock prices

Binomial lattice model for stock prices Copyright c 2007 by Karl Sigman Binomial lattice model for stock prices Here we model the price of a stock in discrete time by a Markov chain of the recursive form S n+ S n Y n+, n 0, where the {Y i }

More information

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1 (R) User Manual Environmental Energy Technologies Division Berkeley, CA 94720 http://simulationresearch.lbl.gov Michael Wetter MWetter@lbl.gov February 20, 2009 Notice: This work was supported by the U.S.

More information

1 Portfolio Selection

1 Portfolio Selection COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture # Scribe: Nadia Heninger April 8, 008 Portfolio Selection Last time we discussed our model of the stock market N stocks start on day with

More information

Supplement to Call Centers with Delay Information: Models and Insights

Supplement to Call Centers with Delay Information: Models and Insights Supplement to Call Centers with Delay Information: Models and Insights Oualid Jouini 1 Zeynep Akşin 2 Yves Dallery 1 1 Laboratoire Genie Industriel, Ecole Centrale Paris, Grande Voie des Vignes, 92290

More information

Machine Learning: Multi Layer Perceptrons

Machine Learning: Multi Layer Perceptrons Machine Learning: Multi Layer Perceptrons Prof. Dr. Martin Riedmiller Albert-Ludwigs-University Freiburg AG Maschinelles Lernen Machine Learning: Multi Layer Perceptrons p.1/61 Outline multi layer perceptrons

More information

Fairness in Routing and Load Balancing

Fairness in Routing and Load Balancing Fairness in Routing and Load Balancing Jon Kleinberg Yuval Rabani Éva Tardos Abstract We consider the issue of network routing subject to explicit fairness conditions. The optimization of fairness criteria

More information

A simpler and better derandomization of an approximation algorithm for Single Source Rent-or-Buy

A simpler and better derandomization of an approximation algorithm for Single Source Rent-or-Buy A simpler and better derandomization of an approximation algorithm for Single Source Rent-or-Buy David P. Williamson Anke van Zuylen School of Operations Research and Industrial Engineering, Cornell University,

More information

Determining the Direct Mailing Frequency with Dynamic Stochastic Programming

Determining the Direct Mailing Frequency with Dynamic Stochastic Programming Determining the Direct Mailing Frequency with Dynamic Stochastic Programming Nanda Piersma 1 Jedid-Jah Jonker 2 Econometric Institute Report EI2000-34/A Abstract Both in business to business and in consumer

More information

Optimal Vehicle Routing with Real-Time Traffic Information

Optimal Vehicle Routing with Real-Time Traffic Information Optimal Vehicle Routing with Real-Time Traffic Information Seongmoon Kim 1 Department of Industrial and Systems Engineering Florida International University 10555 W. Flagler Street / (EC 3100), Miami,

More information

4 Learning, Regret minimization, and Equilibria

4 Learning, Regret minimization, and Equilibria 4 Learning, Regret minimization, and Equilibria A. Blum and Y. Mansour Abstract Many situations involve repeatedly making decisions in an uncertain environment: for instance, deciding what route to drive

More information

A Simultaneous Deterministic Perturbation Actor-Critic Algorithm with an Application to Optimal Mortgage Refinancing

A Simultaneous Deterministic Perturbation Actor-Critic Algorithm with an Application to Optimal Mortgage Refinancing Proceedings of the 45th IEEE Conference on Decision & Control Manchester Grand Hyatt Hotel San Diego, CA, USA, December 13-15, 2006 A Simultaneous Deterministic Perturbation Actor-Critic Algorithm with

More information

LOGISTIQUE ET PRODUCTION SUPPLY CHAIN & OPERATIONS MANAGEMENT

LOGISTIQUE ET PRODUCTION SUPPLY CHAIN & OPERATIONS MANAGEMENT LOGISTIQUE ET PRODUCTION SUPPLY CHAIN & OPERATIONS MANAGEMENT CURSUS CONTENTS 1) Introduction 2) Human resources functions 3) A new factory 4) Products 5) Services management 6) Methods 7) Planification

More information

Safe robot motion planning in dynamic, uncertain environments

Safe robot motion planning in dynamic, uncertain environments Safe robot motion planning in dynamic, uncertain environments RSS 2011 Workshop: Guaranteeing Motion Safety for Robots June 27, 2011 Noel du Toit and Joel Burdick California Institute of Technology Dynamic,

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

More information

Classification - Examples

Classification - Examples Lecture 2 Scheduling 1 Classification - Examples 1 r j C max given: n jobs with processing times p 1,...,p n and release dates r 1,...,r n jobs have to be scheduled without preemption on one machine taking

More information

How Asymmetry Helps Load Balancing

How Asymmetry Helps Load Balancing How Asymmetry Helps oad Balancing Berthold Vöcking nternational Computer Science nstitute Berkeley, CA 947041198 voecking@icsiberkeleyedu Abstract This paper deals with balls and bins processes related

More information

Optimal Hiring of Cloud Servers A. Stephen McGough, Isi Mitrani. EPEW 2014, Florence

Optimal Hiring of Cloud Servers A. Stephen McGough, Isi Mitrani. EPEW 2014, Florence Optimal Hiring of Cloud Servers A. Stephen McGough, Isi Mitrani EPEW 2014, Florence Scenario How many cloud instances should be hired? Requests Host hiring servers The number of active servers is controlled

More information