6.231 Dynamic Programming and Stochastic Control Fall 2008



Similar documents
6.231 Dynamic Programming Midterm, Fall Instructions

Reinforcement Learning

Reinforcement Learning

Stochastic Inventory Control

A hierarchical multicriteria routing model with traffic splitting for MPLS networks

2.3 Convex Constrained Optimization Problems

INTEGRATED OPTIMIZATION OF SAFETY STOCK

The Trip Scheduling Problem

Linear Programming: Chapter 11 Game Theory

A Robust Optimization Approach to Supply Chain Management

Single item inventory control under periodic review and a minimum order quantity

A Profit-Maximizing Production Lot sizing Decision Model with Stochastic Demand

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

Proximal mapping via network optimization

1 Representation of Games. Kerschbamer: Commitment and Information in Games

Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

6.852: Distributed Algorithms Fall, Class 2

Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning

Dynamic Programming 11.1 AN ELEMENTARY EXAMPLE

LECTURE 4. Last time: Lecture outline

Optimization under uncertainty: modeling and solution methods

Approximation Algorithms

Neuro-Dynamic Programming An Overview

Basic Components of an LP:

ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE

Classification/Decision Trees (II)

Dynamic programming formulation

OPTIMAL DESIGN OF A MULTITIER REWARD SCHEME. Amir Gandomi *, Saeed Zolfaghari **

OPRE 6201 : 2. Simplex Method

A Network Flow Approach in Cloud Computing

Project Scheduling: PERT/CPM

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Revenue Management for Transportation Problems

2.8 An application of Dynamic Programming to machine renewal

Discuss the size of the instance for the minimum spanning tree problem.

1 The EOQ and Extensions

A simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R

1 Portfolio mean and variance

Complexity Theory. IE 661: Scheduling Theory Fall 2003 Satyaki Ghosh Dastidar

5.1 Bipartite Matching

Handling Advertisements of Unknown Quality in Search Advertising

Lecture 11: Sponsored search

An Environment Model for N onstationary Reinforcement Learning

Optimization with Big Data: Network Flows

Discrete Optimization

4.6 Linear Programming duality

Change Management in Enterprise IT Systems: Process Modeling and Capacity-optimal Scheduling

Factors to Describe Job Shop Scheduling Problem

Random graphs with a given degree sequence

The Math. P (x) = 5! = = 120.

Material Requirements Planning. Lecturer: Stanley B. Gershwin

Network Models 8.1 THE GENERAL NETWORK-FLOW PROBLEM

Role of Stochastic Optimization in Revenue Management. Huseyin Topaloglu School of Operations Research and Information Engineering Cornell University

1 Portfolio Selection

EXCEL SOLVER TUTORIAL

Linear Programming. Widget Factory Example. Linear Programming: Standard Form. Widget Factory Example: Continued.

Robust Airline Schedule Planning: Minimizing Propagated Delay in an Integrated Routing and Crewing Framework

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Risk Management for IT Security: When Theory Meets Practice

Optimization Modeling for Mining Engineers

3/25/ /25/2014 Sensor Network Security (Simon S. Lam) 1

1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

Linear Programming Notes V Problem Transformations

Fare Planning for Public Transport

Motivated by a problem faced by a large manufacturer of a consumer product, we

The life cycle of new products is becoming shorter and shorter in all markets. For electronic products, life

Offline sorting buffers on Line

Optimizing Strategic Safety Stock Placement in Supply Chains. Stephen C. Graves Sean P. Willems

Dantzig-Wolfe bound and Dantzig-Wolfe cookbook

Linear Programming. March 14, 2014

SINGLE-STAGE MULTI-PRODUCT PRODUCTION AND INVENTORY SYSTEMS: AN ITERATIVE ALGORITHM BASED ON DYNAMIC SCHEDULING AND FIXED PITCH PRODUCTION

Binary Image Reconstruction

A Production Planning Problem

Linear Risk Management and Optimal Selection of a Limited Number

Scheduling Shop Scheduling. Tim Nieberg

Abstract Title: Planned Preemption for Flexible Resource Constrained Project Scheduling

cs171 HW 1 - Solutions

Course: Model, Learning, and Inference: Lecture 5

Lecture Notes 10

Decentralized control of stochastic multi-agent service system. J. George Shanthikumar Purdue University

A New Quantitative Behavioral Model for Financial Prediction

Measuring the Performance of an Agent

How To Solve The Line Connectivity Problem In Polynomatix

VEHICLE ROUTING PROBLEM

Optimization models for congestion control with multipath routing in TCP/IP networks

Notes on Symmetric Matrices

Supply Chain Coordination with Financial Constraints and Bankruptcy Costs

An Approximation Algorithm for the Unconstrained Traveling Tournament Problem

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Agents: Rationality (2)

Response time behavior of distributed voting algorithms for managing replicated data

Scheduling and Location (ScheLoc): Makespan Problem with Variable Release Dates

Network Flow I. Lecture Overview The Network Flow Problem

Transcription:

MIT OpenCourseWare http://ocw.mit.edu 6.231 Dynamic Programming and Stochastic Control Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

6.231 Dynamic Programming Midterm Exam, Fall 2002 October 29, 2002 2:30-4:30pm Prof. Dimitri Bertsekas Problem 1 (20 points) We have a set of N objects, denoted 1, 2,..., N, which we want to group in clusters that consist of consecutive objects. For each cluster i, i+1,..., j, there is an associated cost a ij. We want to find a grouping of the objects in clusters such that the total cost is minimum. Formulate the problem as a shortest path problem, and write a DP algorithm for its solution. (Note: An example of this problem arises in typesetting programs, such as TEX/LATEX, that break down a paragraph into lines in a way that optimizes the paragraph s appearance.) Problem 2 (40 points) The latest casino sensation is a slot machine with N arms, labeled 1,..., N. A single play with arm i costs C i dollars, and has two possible outcomes: a win, which occurs with probability p i and pays a reward R i, and a loss, which occurs with probability 1 p i. The rule is that each arm may be played at most once, and play must stop at the first loss or after playing all arms once, whichever comes first. The objective is to find the arm-playing order that maximizes the total expected reward minus the total expected cost. (a) Write a DP algorithm for solving the problem. (b) Show that it is optimal to play the arms in order of nonincreasing (p i R i C i )/(1 p i ). Note: This may be shown with or without using the DP algorithm of part (a). (c) Assume that at any time, there is the option to stop playing, in addition to selecting a new arm to play. Write a DP algorithm for solving this variant of the problem, and find an optimal policy. (d) Suppose that in the context of part (c), you may play an arm as many times as you want, but each time the reward to be obtained diminishes by a factor β with 0 < β < 1. Assuming that C i > 0, find an optimal policy. Problem 3 (40 points) 1

Consider an inventory control problem where stock evolves according to x k+1 = x k + u k w k, and the cost of stage k is cu k + h max(0, w k x k u k ) + p max(0, x k + u k w k ), where c, h, and p are positive scalars with p > c. There is no terminal cost. The stock x k is perfectly observed at each stage. The demands w k are independent, identically distributed, nonnegative random variables. However, the (common) distribution of the w k is unknown. Instead it is known that this distribution is one out of two known distributions F 1 and F 2, and that the a priori probability that F 1 is the correct distribution is a given scalar q, with 0 < q < 1. You may assume for convenience that w k can take a finite number of values under each of F 1 and F 2. (a) Formulate this as an imperfect state information problem, and identify the state, control, system disturbance, observation, and observation disturbance. (b) Write a DP algorithm in terms of a suitable sufficient statistic. (c) Characterize as best as you can the optimal policy. MIDTERM SOLUTIONS: Problem 1 (20 points) We may model this problem as a deterministic shortest path problem with nodes {0, 1,..., N}, where 0 is the start and N is the destination, and arcs (i, j) only if j > i (unless you are at node N which is absorbing). So each arc (i, j), for i = N, corresponds to a cluster of nodes i + 1, i + 2,..., j which has cost a i+1,j, while arc (N, N) has cost 0. We have the following DP problem setup: x k =last node of a cluster x k S = {0, 1,..., N} for k = 0, 1,..., N x k+1 = u k for k = 0, 1,..., N 1 x 0 = 0 u k U k (x) = {i S i > x} if x = N k = 0, 1,..., N 1 2

and u k U k (x) = {N} if x = N Moreover: g k (x, u) = a x+1,u if x = N k = 0, 1..., N 1 and g k (x, u) = 0 if x = N We then have the following DP algorithm: J N (N) = 0 J k (i) = min [a i+1,j + J k+1 (j)] if i = N k = 0, 1,..., N 1 {j S j>i} and The optimal cost is then J 0 (0). Problem 2 (40 points) J k (i) = 0 if i = N (a) (12 points) Choose as state at stage k the set of N k arms not yet played (if no loss has occured in the first k 1 plays) or a special termination state otherwise (which has cost-to-go 0 at all stages under any policy). The initial state is then the set {1, 2,..., N}. The expected reward at each stage is p i R i C i, where i is the arm selected to play. The DP algorithm at the nontermination states is J k {i 1,..., i N k } = max p i R i C i + p i J k+1 {i 1,..., i N k } {i}, k = 0,..., N 1, i {1 1,...,i N k } J N ( ) = 0. (b) (12 points) The problem is identical to the quiz problem with expected reward for trying the ith question equal to p i R i C i. The result follows by the interchange argument in the book, with expected reward given the order i 1, i 2,..., i N equal to p i1 R i1 C i1 + p i1 (p i2 R i2 C i2 ) +... + p i1 p i2 p in 1 (p in R in C in ). (c) (8 points) Transform the problem to the one of parts (a) and (b) by adding a new arm N + 1 with C N+1 = 0, p N +1 = 0, R N+1 = 0. Choosing this arm is equivalent to stopping, since its cost, reward, and probability of continuing are all equal to 0. Since this new arm s index ((p i R i C i )/(1 p i )) is equal to 0, we never play any arm with negative index. An optimal policy is to select the set of arms i with p i R i > C i, play them in order of nonincreasing (p i R i C i )/(1 p i ), and then stop. 3

(d) (8 points) Create additional arms with reward β m 1 R i corresponding to playing arm i for the mth time. However, create new arms only for the finite number of values of m for which p i β m 1 R i > C i. We initially ignore the constraint that arm (i, l) must be played before arm (i, l + 1) for all i, l. Using the results of part (c), the optimal policy for this problem is to select the set of arms with (i, m) satisfying p i β m 1 R i > C i, play them in order of nonincreasing (p i β m 1 R i C i )/(1 p i ), and then stop. Notice that this optimal policy, for any i, always plays arm (i, l) before arm (i, l + 1) for all l, and therefore is also optimal over the set of orderings with this constraint. Problem 3 (40 points) (a) (13 points) The state is (x k, d k ), where d k takes the value 1 or 2 depending on whether the common distribution of the w k is F 1 or F 2. The variable d k stays constant (i.e., satisfies d k+1 = d k for all k), but is not observed perfectly. Instead, the sample demand values w 0, w 1,... are observed (w k = x k + u k x k+1 ), and provide information regarding the value of d k. In particular, given the a priori probability q and the demand values w 0,..., w k 1, we can calculate the conditional probability that w k will be generated according to F 1. (b) (13 points) A suitable sufficient statistic is (x k, q k ), where The conditional probability q k evolves according to q k = P (d k = 1 w 0,..., w k 1 ). q k P (w k F 1 ) q k+1 =, q 0 = q, q k P (w k F 1 ) + (1 q k )P (w k F 2 ) where P { F i } denotes probability under the distribution F i, and assuming that w k can take a finite number of values under the distributions F 1 and F 2. The initial step of the DP algorithm in terms of this sufficient statistic is J N 1 (x N 1, q N 1 ) = min cu N 1 u N 1 0 + q N 1 E h max(0, w N 1 x N 1 u N 1 ) + p max(0, x N 1 + u N 1 w N 1 ) F 1 + (1 q N 1 )E h max(0, w N 1 x N 1 u N 1 ) + p max(0, x N 1 + u N 1 w N 1 ) F 2, where E{ F i } denotes expected value with respect to the distribution F i. The typical step of the DP algorithm is J k (x k, q k ) = min cu k u k 0 + q k E h max(0, w k x k u k ) + p max(0, x k + u k w k ) + J k+1 x k + u k w k, φ(q k, w k ) F 1 + (1 q k )E h max(0, w k x k u k ) + p max(0, x k + u k w k ) + J k+1 x k + u k w k, φ(q k, w k ) F 2, where q k P (w k F 1 ) φ k (q k, w k ) =. q k P (w k F 1 ) + (1 q k )P (w k F 2 ) 4

(c) (14 points) It can be shown inductively, as in the text, that J k (x k, q k ) is convex and coercive as a function of x k for fixed q k. For a fixed value of q k, the minimization in the right-hand side of the DP minimization is exactly the same as in the text with the probability distribution of w k being the mixture of the distributions F 1 and F 2 with corresponding probabilities q k and (1 q k ). It follows that for each value of q k, there is a threshold S k (q k ) such that it is optimal to order an amount S k (q k ) x k, if S k (q k ) > x k, and to order nothing otherwise. In particular, S k (q k ) minimizes over y the function cy + q k E h max(0, w k y) + p max(0, y w k ) + J k+1 y w k, φ k (q k, w k ) F 1 + (1 q k )E h max(0, w k y) + p max(0, y w k ) + J k+1 y w k, φ k (q k, w k ) F 2. 5