6.231 Dynamic Programming Midterm, Fall 2008. Instructions



Similar documents
6.231 Dynamic Programming and Stochastic Control Fall 2008

Reinforcement Learning

Reinforcement Learning

6.042/18.062J Mathematics for Computer Science. Expected Value I

Economics 1011a: Intermediate Microeconomics

Optimization under uncertainty: modeling and solution methods

Lecture 25: Money Management Steven Skiena. skiena

= = 106.

Betting rules and information theory

Principles of demand management Airline yield management Determining the booking limits. » A simple problem» Stochastic gradients for general problems

Real Time Scheduling Basic Concepts. Radek Pelánek

Part I. Gambling and Information Theory. Information Theory and Networks. Section 1. Horse Racing. Lecture 16: Gambling and Information Theory

Sample Mid-Term Examination Fall Some Useful Formulas

Classification - Examples

LOOKING FOR A GOOD TIME TO BET

Improve Net Present Value using cash flow weight

The Discrete Binomial Model for Option Pricing

Web Hosting Service Level Agreements

Kelly s Criterion for Option Investment. Ron Shonkwiler (shonkwiler@math.gatech.edu)

4.2 Description of the Event operation Network (EON)

Finite Horizon Investment Risk Management

5. Factoring by the QF method

Lecture notes: single-agent dynamics 1

A Single-Unit Decomposition Approach to Multi-Echelon Inventory Systems

Computing Near Optimal Strategies for Stochastic Investment Planning Problems

1 Gambler s Ruin Problem

Decision Theory Rational prospecting

How To Price A Call Option

The Kelly criterion for spread bets

Chapter 4 Lecture Notes

Optimal Control of a Production-Inventory System with both Backorders and Lost Sales

Section 7C: The Law of Large Numbers

Options pricing in discrete systems

Periodic Capacity Management under a Lead Time Performance Constraint

1 Interest rates, and risk-free investments

1/3 1/3 1/

arxiv: v1 [math.pr] 5 Dec 2011

Journal of Financial and Strategic Decisions Volume 12 Number 1 Spring 1999 AN ANALYSIS OF VARIATIONS IN CAPITAL ASSET UTILIZATION

Stochastic Processes and Advanced Mathematical Finance. Duration of the Gambler s Ruin

פרויקט מסכם לתואר בוגר במדעים )B.Sc( במתמטיקה שימושית

A Profit-Maximizing Production Lot sizing Decision Model with Stochastic Demand

How to Gamble If You Must

Call Admission Control and Routing in Integrated Service Networks Using Reinforcement Learning

V. RANDOM VARIABLES, PROBABILITY DISTRIBUTIONS, EXPECTED VALUE

TD(0) Leads to Better Policies than Approximate Value Iteration

Scheduling Software Projects to Minimize the Development Time and Cost with a Given Staff

Phantom bonuses. November 22, 2004

Betting with the Kelly Criterion

MONT 107N Understanding Randomness Solutions For Final Examination May 11, 2010

A New Interpretation of Information Rate

Betting systems: how not to lose your money gambling

Network Planning & Capacity Management

Expected Value and the Game of Craps

Linear Programming Notes V Problem Transformations

Dynamic Programming 11.1 AN ELEMENTARY EXAMPLE

Risk Management for IT Security: When Theory Meets Practice

Optimal f :Calculating the expected growth-optimal fraction for discretely-distributed outcomes

Option Properties. Liuren Wu. Zicklin School of Business, Baruch College. Options Markets. (Hull chapter: 9)

24. The Branch and Bound Method

OPTIMAL DESIGN OF A MULTITIER REWARD SCHEME. Amir Gandomi *, Saeed Zolfaghari **

Stochastic Models for Inventory Management at Service Facilities

On Adaboost and Optimal Betting Strategies

Valuation and Optimal Decision for Perpetual American Employee Stock Options under a Constrained Viscosity Solution Framework

Review Horse Race Gambling and Side Information Dependent horse races and the entropy rate. Gambling. Besma Smida. ES250: Lecture 9.

Alok Gupta. Dmitry Zhdanov

Introduction Set invariance theory i-steps sets Robust invariant sets. Set Invariance. D. Limon A. Ferramosca E.F. Camacho

Formulations of Model Predictive Control. Dipartimento di Elettronica e Informazione

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Sensitivity analysis of utility based prices and risk-tolerance wealth processes

36 Odds, Expected Value, and Conditional Probability

Elementary Statistics and Inference. Elementary Statistics and Inference. 17 Expected Value and Standard Error. 22S:025 or 7P:025.

Statistics 100A Homework 3 Solutions

Integrated support system for planning and scheduling /4/24 page 75 #101. Chapter 5 Sequencing and assignment Strategies

LECTURE 15: AMERICAN OPTIONS

When Machine Learning Meets AI and Game Theory

I. Readings and Suggested Practice Problems. II. Risks Associated with Default-Free Bonds

TRAFFIC ENGINEERING OF DISTRIBUTED CALL CENTERS: NOT AS STRAIGHT FORWARD AS IT MAY SEEM. M. J. Fischer D. A. Garbin A. Gharakhanian D. M.

Week_11: Network Models

A hierarchical multicriteria routing model with traffic splitting for MPLS networks

Wald s Identity. by Jeffery Hein. Dartmouth College, Math 100

We consider the optimal production and inventory control of an assemble-to-order system with m components,

Load Balancing and Switch Scheduling

2004 Networks UK Publishers. Reprinted with permission.

A Simple Model of Price Dispersion *

KEELE UNIVERSITY MID-TERM TEST, 2007 BA BUSINESS ECONOMICS BA FINANCE AND ECONOMICS BA MANAGEMENT SCIENCE ECO MANAGERIAL ECONOMICS II

Algorithms for optimal allocation of bets on many simultaneous events

Scheduling Algorithms for Downlink Services in Wireless Networks: A Markov Decision Process Approach

Lecture 6: Portfolios with Stock Options Steven Skiena. skiena

Fund Manager s Portfolio Choice

The overall size of these chance errors is measured by their RMS HALF THE NUMBER OF TOSSES NUMBER OF HEADS MINUS NUMBER OF TOSSES

How To Maximize Admission Control On A Network

OPTIMAL CONTROL OF A COMMERCIAL LOAN REPAYMENT PLAN. E.V. Grigorieva. E.N. Khailov

Linear Risk Management and Optimal Selection of a Limited Number

Dynamic programming. Doctoral course Optimization on graphs - Lecture 4.1. Giovanni Righini. January 17 th, 2013

ALGORITHMIC TRADING WITH MARKOV CHAINS

STRATEGIC CAPACITY PLANNING USING STOCK CONTROL MODEL

Current Accounts in Open Economies Obstfeld and Rogoff, Chapter 2

Motivated by a problem faced by a large manufacturer of a consumer product, we

1 Portfolio Selection

CS MidTerm Exam 4/1/2004 Name: KEY. Page Max Score Total 139

Transcription:

6.231 Dynamic Programming Midterm, Fall 2008 Instructions The midterm comprises three problems. Problem 1 is worth 60 points, problem 2 is worth 40 points, and problem 3 is worth 40 points. Your grade G 0, 100] is given by the following formula: G = 60 f 1 + 40 min1, f 2 + f 3 ], where f i 0, 1], i = 1, 2, 3, is the fraction of problem i that you solved correctly. Notice that if you solve correctly both problems 2 and 3, but you do not solve problem 1, your grade will only be G = 40. Thus, try to solve problem 1 first, then choose between problems 2 and 3 whichever you prefer, and if you still have time, try to do the remaining problem. Credit is given only for clear logic and careful reasoning. GOOD LUCK! 1

Problem 1 Infinite Horizon Problem, 60 points An engineer has invented a better mouse trap and is interested in selling it for the right price. At the beginning of each period, he receives a sale offer that takes one of the values s 1,..., s n with corresponding probabilities p 1,..., p n, independently of prior offers. If he accepts the offer he retires from engineering. If he refuses the offer, he may accept subsequent offers but he also runs the risk that a competitor will invent an even better mouse trap, rendering his own unsaleable; this happens with probability β > 0 at each time period, independently of earlier time periods. While he is overtaken by the competitor, at each time period, he may choose to retire from engineering, or he may choose to invest an amount v 0, in which case he has a probability γ to improve his mouse trap, overtake his competitor, and start receiving offers as earlier. The problem is to determine the engineer s strategy to maximize his discounted expected payoff minus investment cost, assuming a discount factor α < 1. a 15 points] Formulate the problem as an infinite horizon discounted cost problem and write the corresponding Bellman s equation. b 15 points] Characterize as best as you can an optimal policy. c 10 points] Describe as best as you can how policy iteration would work for this problem. What kind of initial policy would lead to an efficient policy iteration algorithm? d 10 points] Assume that there is no discount factor. Does the problem make sense as an average cost per stage problem? e 10 points] Assume that there is no discount factor and that the investment cost v is equal to 0. Does the problem make sense as a stochastic shortest path problem, and what is then the optimal policy? Solution to Problem 1 a Let the set of possible offers be {x i i = 1,..., n} {t, r}, where being in a state has the following meaning: x i : receiving offer s i ; t: overtaken; r: retired. Define the control space by {Aaccept offer, Iinvest, Rretire, Rejreject}. The state r is absorbing, and for the other states, the set of admissible controls is Ux i = {A, Rej}, Ut = {I, R}. 2

Define the corresponding transition probabilities, and per-stage costs as described in the problem. Bellman s equation for α < 1 is n J x i = max s i, α 1 β p j J x j + βj t, 1 j=1 n J t = max 0, v + α γ p j J x j + 1 γj t. 2 j=1 b From Eq. 1, we see that since the second term in the minimization of the right-hand side does not depend on the current state x i, the optimal policy is a threshold policy at states x i, i = 1,..., n. A single threshold is needed since J s is clearly monotonically nondecreasing with s. From Eq. 2, we see that once the inventor has been overtaken, it is either optimal to retire immediately, or to keep investing until his mouse trap is improved. c Consider policy iteration. If the current policy µ is a threshold policy, the cost J µ s is monotonically nondecreasing with s, so the cost of the improved policy is also monotonically nondecreasing with s. Hence, starting with a nonoptimal threshold policy, the method will generate a new and improved policy for at least one offer. It follows that starting with a threshold policy, after at most n iterations, policy iteration will terminate with an optimal threshold policy. d Even without discounting, all policies have finite total cost, except the one that never sells and always invests, which incurs infinite total cost. Thus, we see that under the average cost criterion, all of the finite total cost policies will be optimal. Thus, the average cost criterion does not discriminate enough between different policies and makes no sense. e Since there is no discounting and v = 0, there is no penalty for waiting as long as necessary until the maximum possible offer is received, which is going to happen with probability 1. So a stochastic shortest path formulation makes limited sense, since it excludes from consideration all offers except the maximum possible, no matter how unlikely this maximum offer is. Problem 2 Imperfect State Information Problem, 40 points A machine tosses a coin N times, but each time may switch the probability of heads between two possible values h and h. The switch occurs with probability q at each time, independently of the switches in previous times. A gambler may bet on the outcome of each coin toss knowing h, h, and q, the outcomes of previous tosses, and the probability r that h is used at the initial toss. If the gambler bets on a coin toss, he wins $ 2 if a head occurs, while he loses $ 1 if a tail occurs. The gambler may also permanently stop betting prior to any coin toss. 3

a 20 points] Define r for k = 0, p k = P rob{coin with a probability of heads h is used at toss k zk 1,..., z 0 } for k = 1,..., N 1, where z k 1,..., z 0 are the preceding observed tosses. Write a equation for the evolution of p k as the coin tosses are observed. b 10 points] Use the equation of part a to write a DP algorithm to find the gambler s optimal policy for betting or stopping to bet prior to any coin toss. You will receive credit if you do this part correctly, assuming a generic functional form of this equation. c 10 points] Repeat part b for the variant of the problem where the gambler may resume play at any time after he stops betting, and he keeps observing the coin tosses also when he is not betting. Solution to Problem 2 a The machine uses two coins, one with a probability of heads equal to h, and the the other one with a probability of heads equal to h. We call the first coin a type h coin, while we call the second coin a type h coin. Let f h z, z {head, tail}, be the probability mass function for the type h coin clearly, f h head = h. Similarly, let f h z, z {head, tail}, be the probability mass function for the type h coin clearly, f hhead = h. We are faced with an imperfect state information problem where the state at stage k, that we call x k, is the type of coin that the machine uses at stage k. We write x k = h if at stage k the machine uses a type h coin, and similarly we write x k = h if at stage k the machine uses a type h coin. The conditional probability p k is generated recursively according to the following equation with p 0 = r: p k+1 = = P r{x k+1 = h z k, z k 1,..., z 0 } P r {x k+1 = h, z k z k 1,..., z 0 } = P r{z k z k 1,..., z 0 } P r{z k x k+1 = h, z k 1,..., z 0 } P r{x k+1 = h z k 1,..., z 0 } = P r{zk x k = h, z k 1,..., z 0 }P r{x k = h z k 1,..., z 0 } + P r{z k x k = h, z k 1,..., z 0 }P r {x k = h z k 1,..., z 0 } = P r{z k x k+1 = h, z k 1,..., z 0 } P r{x k+1 = h x k = h, z k 1,..., z 0 }p k + P r{x k+1 = h x k = h, z k 1,..., z 0 }1 p k f h z k p k + f z k 1 p k h 4

= 1 qp k q1 pk f h zk 1 q p 1 p + fh zk 1 qp k +q1 p k 1 q p k + q1 p k. = φp k, z k, k+q k f h z k p k + f h z k 1 p k ] since P r{z k x k+1 = h, z k 1,..., z 0 } = = P r{z k x k+1 = h, x k = h, z k 1,..., z0} P r{x k = h x k+1 = h, z k 1,..., z 0} + + P r{z k x k+1 = h, x k = h, z k 1,..., z 0 } P r{x k = h x k+1 = h, z k 1,..., z0} = P r{x k+1 = h x k = h, z k 1,..., z0}p k = f h z k + P r{x k+1 = h x k = h, z k 1,..., z 0 }p k + P r{x k+1 = h x k = h, z k 1,..., z 0} 1 p k P r{x k+1 = h x k = h, z k 1,..., z 0}1 p k + f h z k = P r{x k+1 = h x k = h, z k 1,..., z 0 }p k + P r{x k+1 = h 1 x k = h, z k 1,..., z 0 } p k 1 qpk q1 p k = f h z k + f h zk 1 q p k + q1 p k 1 qp k + q1 p k b The DP algorithm is J k p k = max }{{} 0, 2 p k h+1 p k h 1 p k 1 h+1 p k 1 h+e zk {J }] k+1 φk p k, z k, k = 0,..., N 1, with J N p N 0. stop c In this case we have for k = 0,..., N 1 J k p k = max 0 + E zk {J k+1 φk p k, z k } {, 2 p k h+1 p k h 1 p k 1 h+1 p k 1 h+e zk J k+1 φk p k, z k }], } {{ } with J N p N 0. Thus, we do not play if do not play 0 > 2 p k h + 1 p k h 1 p k 1 h + 1 p k 1 h, that is, assuming without loss of generality that h > h, we do not play if 1 3h p k <. 3h h Problem 3 Optimal Stopping Problem, 40 points 5

A driver is looking for parking on the way to his destination. Each parking place is free with probability p independently of whether other parking places are free or not. The driver cannot observe whether a parking place is free until he reaches it. If he parks k places from his destination, he incurs a cost k. If he reaches the destination without having parked the cost is C. The driver wishes to find an optimal parking policy. a 20 points] Formulate the problem as a DP problem, clearly identifying, states, controls, transition probabilities and costs. b 20 points] Write the DP algorithm and show that it can be represented by the sequence {C k } generated as follows C k = p mink, C k 1 ] + qc k, k = 1, 2,..., 1 where C 0 = C and q = 1 p. Characterize as best as you can the optimal policy. Solution to Problem 3 a Let the state xk {T, T } where T represents the driver having parked before reaching the kth spot. Let the control at each parking spot uk {P, P } where P represents the choice to park in the kth spot. Let the disturbance w k equal 1 if the kth spot is free; otherwise it equals 0. Clearly, we have the control constraint that u k = P, if x k = T or w k = 0. The cost associated with parking in the kth spot is: gkt, P, 1 = k If the driver has not parked upon reaching his destination, he incurs a cost g N T = C. All other costs are zero. The system evolves according to: { T, if xk = T or uk = P x k+1 = T, otherwise b Once the driver has parked, his remaining cost is zero. Thus, we can define C k to be the expected remaining cost, given that the driver has not parked before the kth spot. Note that this is simply J k T. The DP algorithm starts with C 0 = C, and for k = 1, 2,..., generates C k by C k = min E { g k T, u k, w k + J k 1 x k 1 } = min p k + J k 1 T ] ] + qj 1 T k, u k {P,P } w k } } } {{ } } J k 1 {{ T {{ park, free Since J k T = 0 for all k, we have ] ] C k = min pk + qc k 1, C k 1 = p min k, C k 1 + qc k 1. Thus the sequence {C k } is the sequence generated by DP. 6 park, not free don t par} k

MIT OpenCourseWare http://ocw.mit.edu 6.231 Dynamic Programming and Stochastic Control Fall 2011 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.