OPTIMIZATION AND CONTROL

Transcription

1 5.5 Optimal stopping over the infinite horizon Contents OPTIMIZATION AND CONTROL Richard Weber DYNAMIC PROGRAMMING 1 1 Dynamic Programming: The Optimality Eqation Control as optimization over time The principle of optimality Example: the shortest path problem The optimality eqation Markov decision processes Some Examples of Dynamic Programming Example: managing spending and savings Example: exercising a stock option Example: accepting the best offer Dynamic Programming over the Infinite Horizon Disconted costs Example: job schedling The infinite-horizon case The optimality eqation in the infinite-horizon case Example: selling an asset Positive Programming Example: possible lack of an optimal policy Characterization of the optimal policy Example: optimal gambling Vale iteration Example: pharmacetical trials Negative Programming Stationary policies Characterization of the optimal policy Optimal stopping over a finite horizon Example: optimal parking i 6 Average-cost Programming Average-cost optimization Example: admission control at a qee Vale iteration bonds Policy improvement LQG SYSTEMS 25 7 LQ Models The LQ reglation model The Riccati recrsion White noise distrbances LQ reglation in continos-time Controllability Controllability Controllability in continos-time Example: broom balancing Example: satellite in a plane orbit Infinite Horizon Limits Linearization of nonlinear models Stabilizability Example: pendlm Infinite-horizon LQ reglation The A, B, C system Observability Observability Observability in continos-time Examples Imperfect state observation with noise Kalman Filtering and Certainty Eqivalence Preliminaries The Kalman filter Certainty eqivalence Example: inertialess rocket with noisy position sensing CONTINUOUS-TIME MODELS 45 ii

2 12 Dynamic Programming in Continos Time The optimality eqation Example: LQ reglation Example: estate planning Example: harvesting Pontryagin s Maximm Principle Heristic derivation Example: bringing a particle to rest in minimal time Connection with Lagrangian mltipliers Example: se of the transversality conditions Applications of the Maximm Principle Problems with terminal conditions Example: monopolist Example: insects as optimizers Example: rocket thrst optimization Controlled Markov Jmp Processes The dynamic programming eqation The case of a discrete state space Uniformization in the infinite horizon case Example: admission control at a qee Controlled Diffsion Processes Diffsion processes and controlled diffsion processes Example: noisy LQ reglation in continos time Example: a noisy second order system Example: passage to a stopping set Index 65 Schedles The first 6 lectres are devoted to dynamic programming in discrete-time and cover both finite and infinite-horizon problems; disconted-cost, positive, negative and average-cost programming; the time-homogeneos Markov case; stopping problems; vale iteration and policy improvement. The next 5 lectres are devoted to the LQG model (linear systems, qadratic costs) and cover the important ideas of controllability and observability; the Ricatti eqation; imperfect observation, certainly eqivalence and the Kalman filter. The final 5 lectres are devoted to continos-time models and inclde treatment of Pontryagin s maximm principle and the Hamiltonian; Markov decision processes on a contable state space and controlled diffsion processes. Each of the 16 lectres is designed to be a somewhat self-contained nit, e.g., there will be one lectre on Negative Programming, one on Controllability, etc. Examples and applications are important in this corse, so there are one or more worked examples in each lectre. Examples sheets There are three examples sheets, corresponding to the thirds of the corse. There are two or three qestions for each lectre, some theoretical and some of a problem natre. Each qestion is marked to indicate the lectre with which it is associated. Lectre Notes and Handots There are printed lectre notes for the corse and other occasional handots. There are sheets smmarising notation and what yo are expected to know for the exams. The notes inclde a list of keywords and I will be drawing yor attention to these as we go along. If yo have a good grasp of the meaning of each of these keywords, then yo will be well on yor way to nderstanding the important concepts of the corse. WWW pages Notes for the corse, and other information are on the web at Books The following books are recommended. D. P. Bertsekas, Dynamic Programming, Prentice Hall, D. P. Bertsekas, Dynamic Programming and Optimal Control, Volmes I and II, Prentice Hall, L. M. Hocking, Optimal Control: An introdction to the theory and applications, Oxford S. Ross, Introdction to Stochastic Dynamic Programming, Academic Press, P. Whittle, Optimization Over Time. Volmes I and II, Wiley, Ross s book is probably the easiest to read. However, it only covers Part I of the corse. Whittle s book is good for Part II and Hocking s book is good for Part III. The recent book by Bertsekas is sefl for all parts. Many other books address the topics of the corse and a collection can be fond in Sections 3B and 3D of the DPMMS library. Notation differs from book to book. My notation will be closest to that of Whittle s books and consistent throghot. For example, I will always denote a minimal cost fnction by F( ) (whereas, in the recommended books yo will find F, V, φ, J and many others symbols sed for this qantity.) iii iv

3 1 Dynamic Programming: The Optimality Eqation We introdce the idea of dynamic programming and the principle of optimality. We give notation for state-strctred models, and introdce ideas of feedback, open-loop, and closed-loop controls, a Markov decision process, and the idea that it can be sefl to model things in terms of time to go. 1.1 Control as optimization over time Optimization is a key tool in modelling. Sometimes it is important to solve a problem optimally. Other times either a near-optimal soltion is good enogh, or the real problem does not have a single criterion by which a soltion can be jdged. However, even then optimization is sefl as a way to test thinking. If the optimal soltion is ridiclos it may sggest ways in which both modelling and thinking can be refined. Control theory is concerned with dynamic systems and their optimization over time. It acconts for the fact that a dynamic system may evolve stochastically and that key variables may be nknown or imperfectly observed (as we see, for instance, in the UK economy). This contrasts with optimization models in the IB corse (sch as those for LP and network flow models); these static and nothing was random or hidden. It is these three new featres: dynamic and stochastic evoltion, and imperfect state observation, that give rise to new types of optimization problem and which reqire new ways of thinking. We cold spend an entire lectre discssing the importance of control theory and tracing its development throgh the windmill, steam governor, and so on. Sch classic control theory is largely concerned with the qestion of stability, and there is mch of this theory which we ignore, e.g., Nyqist criterion and dynamic lags. 1.2 The principle of optimality A key idea is that optimization over time can often be regarded as optimization in stages. We trade off or desire to obtain the lowest possible cost at the present stage against the implication this wold have for costs at ftre stages. The best action minimizes the sm of the cost incrred at the crrent stage and the least total cost that can be incrred from all sbseqent stages, conseqent on this decision. This is known as the Principle of Optimality. Definition 1.1 (Principle of Optimality) From any point on an optimal trajectory, the remaining trajectory is optimal for the corresponding problem initiated at that point. 1.3 Example: the shortest path problem Consider the stagecoach problem in which a traveler wishes to minimize the length of a jorney from town A to town J by first traveling to one of B, C or D and then onwards to one of E, F or G then onwards to one of H or I and the finally to J. Ths there are 4 stages. The arcs are marked with distances between towns. 1 A B C D 5 E F G Road system for stagecoach problem Soltion. Let F(X) be the minimal distance reqired to reach J from X. Then clearly, F(J) =, F(H) = 3 and F(I) = 4. F(F) = min 6 + F(H), 3 + F(I) = 7, and so on. Recrsively, we obtain F(A) = 11 and simltaneosly an optimal rote, i.e., A D F I J (althogh it is not niqe). The stdy of dynamic programming dates from Richard Bellman, who wrote the first book on the sbject (1957) and gave it its name. A very large nmber of problems can be treated this way. 1.4 The optimality eqation The optimality eqation in the general case. In discrete-time t takes integer vales, say t =, 1,.... Sppose t is a control variable whose vale is to be chosen at time t. Let U t 1 = (,..., t 1 ) denote the partial seqence of controls (or decisions) taken over the first t stages. Sppose the cost p to the time horizon h is given by C = G(U h 1 ) = G(, 1,..., h 1 ). Then the principle of optimality is expressed in the following theorem. Theorem 1.2 (The principle of optimality) Define the fnctions Then these obey the recrsion G(U t 1, t) = H inf G(U h 1 ). t, t+1,..., h 1 G(U t 1, t) = inf t G(U t, t + 1) t < h, with terminal evalation G(U h 1, h) = G(U h 1 ). The proof is immediate from the definition of G(U t 1, t), i.e., G(U t 1, t) = inf t inf G(,..., t 1, t, t+1,..., h 1 ). t+1,..., h 1 2 I 3 4 J

4 The state strctred case. The control variable t is chosen on the basis of knowing U t 1 = (,..., t 1 ), (which determines everything else). Bt a more economical representation of the past history is often sfficient. For example, we may not need to know the entire path that has been followed p to time t, bt only the place to which it has taken s. The idea of a state variable x R d is that its vale at t, denoted x t, is calclable from known qantities and obeys a plant eqation (or law of motion) x t+1 = a(x t, t, t). Sppose we wish to minimize a cost fnction of the form h 1 C = c(x t, t, t) + C h (x h ), (1.1) t= by choice of controls {,..., h 1 }. Define the cost from time t onwards as, h 1 C t = c(x τ, τ, τ) + C h (x h ), (1.2) τ=t and the minimal cost from time t onwards as an optimization over { t,..., h 1 } conditional on x t = x, F(x, t) = inf C t. t,..., h 1 Here F(x, t) is the minimal ftre cost from time t onward, given that the state is x at time t. Then by an indctive proof, one can show as in Theorem 1.2 that F(x, t) = infc(x,, t) + F(a(x,, t), t + 1), t < h, (1.3) with terminal condition F(x, h) = C h (x). Here x is a generic vale of x t. The minimizing in (1.3) is the optimal control (x, t) and vales of x,..., x t 1 are irrelevant. The optimality eqation (1.3) is also called the dynamic programming eqation (DP) or Bellman eqation. The DP eqation defines an optimal control problem in what is called feedback or closed loop form, with t = (x t, t). This is in contrast to the open loop formlation in which {,..., h 1 } are to be determined all at once at time. A policy (or strategy) is a rle for choosing the vale of the control variable nder all possible circmstances as a fnction of the perceived circmstances. To smmarise: (i) The optimal t is a fnction only of x t and t, i.e, t = (x t, t). (ii) The DP eqation expresses the optimal t in closed loop form. It is optimal whatever the past control policy may have been. (iii) The DP eqation is a backward recrsion in time (from which we get the optimm at h 1, then h 2 and so on.) The later policy is decided first. Life mst be lived forward and nderstood backwards. (Kierkegaard) Markov decision processes Consider now stochastic evoltion. Let X t = (x,..., x t ) and U t = (,..., t ) denote the x and histories at time t. As above, state strctre is characterised by the fact that the evoltion of the process is described by a state variable x, having vale x t at time t, with the following properties. (a) Markov dynamics: (i.e., the stochastic version of the plant eqation.) P(x t+1 X t, U t ) = P(x t+1 x t, t ). (b) Decomposable cost, (i.e., cost given by (1.1)). These assmptions define state strctre. For the moment we also reqire. (c) Perfect state observation: The crrent vale of the state is observable. That is, x t is known at the time at which t mst be chosen. So, letting W t denote the observed history at time t, we assme W t = (X t, U t 1 ). Note that C is determined by W h, so we might write C = C(W h ). These assmptions define what is known as a discrete-time Markov decision process (MDP). Many of or examples will be of this type. As above, the cost from time t onwards is given by (1.2). Denote the minimal expected cost from time t onwards by F(W t ) = inf π E πc t W t, where π denotes a policy, i.e., a rle for choosing the controls,..., h 1. We can assert the following theorem. Theorem 1.3 F(W t ) is a fnction of x t and t alone, say F(x t, t). It obeys the optimality eqation with terminal condition F(x t, t) = inf t {c(x t, t, t) + EF(x t+1, t + 1) x t, t }, t < h, (1.4) F(x h, h) = C h (x h ). Moreover, a minimizing vale of t in (1.4) (which is also only a fnction x t and t) is optimal. Proof. The vale of F(W h ) is C h (x h ), so the asserted redction of F is valid at time h. Assme it is valid at time t + 1. The DP eqation is then F(W t ) = inf t {c(x t, t, t) + EF(x t+1, t + 1) X t, U t }. (1.5) Bt, by assmption (a), the right-hand side of (1.5) redces to the right-hand member of (1.4). All the assertions then follow. 4

5 2 Some Examples of Dynamic Programming We illstrate the method of dynamic programming and some sefl tricks. 2.1 Example: managing spending and savings An investor receives annal income from a bilding society of x t ponds in year t. He consmes t and adds x t t to his capital, t x t. The capital is invested at interest rate θ 1%, and so his income in year t + 1 increases to x t+1 = a(x t, t ) = x t + θ(x t t ). He desires to maximize his total consmption over h years, C = h 1 t= t. Soltion. In the notation we have been sing, c(x t, t, t) = t, C h (x h ) =. This is a time-homogeneos model, in which neither costs nor dynamics depend on t. It is easiest to work in terms of time to go, s = h t. Let F s (x) denote the maximal reward obtainable, starting in state x and when there is time s to go. The dynamic programming eqation is F s (x) = max x + F s 1(x + θ(x )), where F (x) =, (since no more can be obtained once time h is reached.) Here, x and are generic vales for x s and s. We can sbstitte backwards and soon gess the form of the soltion. First, Next, F 1 (x) = max + F ( + θ(x )) = max + = x. x x F 2 (x) = max + F 1(x + θ(x )) = max + x + θ(x ). x x Since + x + θ(x ) linear in, its maximm occrs at = or = x, and so F 2 (x) = max(1 + θ)x, 2x = max1 + θ, 2x = ρ 2 x. This motivates the gess F s 1 (x) = ρ s 1 x. Trying this, we find F s (x) = max x + ρ s 1(x + θ(x )) = max(1 + θ)ρ s 1, 1 + ρ s 1 x = ρ s x. Ths or gess is verified and F s (x) = ρ s x, where ρ s obeys the recrsion implicit in the above, and i.e., ρ s = ρ s 1 + maxθρ s 1, 1. This gives { s s s ρ s = (1 + θ) s s s s s, where s is the least integer sch that s 1/θ, i.e., s = 1/θ. The optimal strategy is to invest the whole of the income in years,...,h s 1, (to bild p capital) and then consme the whole of the income in years h s,...,h 1. 5 There are several things worth remembering from this example. (i) It is often sefl to frame things in terms of time to go, s. (ii) Althogh the form of the dynamic programming eqation can sometimes look messy, try working backwards from F (x) (which is known). Often a pattern will emerge from which we can piece together a soltion. (iii) When the dynamics are linear, the optimal control lies at an extreme point of the set of feasible controls. This form of policy, which either consmes nothing or consmes everything, is known as bang-bang control. 2.2 Example: exercising a stock option The owner of a call option has the option to by a share at fixed striking price p. The option mst be exercised by day h. If he exercises the option on day t and then immediately sells the share at the crrent price x t, he can make a profit of x t p. Sppose the price seqence obeys the eqation x t+1 = x t + ǫ t, where the ǫ t are i.i.d. random variables for which E ǫ <. The aim is to exercise the option optimally. Let F s (x) be the vale fnction (maximal expected profit) when the share price is x and there are s days to go. Show that (i) F s (x) is non-decreasing in s, (ii) F s (x) x is non-increasing in x and (iii) F s (x) is continos in x. Dedce that the optimal policy can be characterised as follows. There exists a non-decreasing seqence {a s } sch that an optimal policy is to exercise the option the first time that x a s, where x is the crrent price and s is the nmber of days to go before expiry of the option. Soltion. The state variable at time t is, strictly speaking, x t pls a variable which indicates whether the option has been exercised or not. However, it is only the latter case which is of interest, so x is the effective state variable. Since dynamic programming makes its calclations backwards, from the termination point, it is often advantageos to write things in terms of the time to go, s = h t. So if we let F s (x) be the vale fnction (maximal expected profit) with s days to go then F (x) = max{x p, }, and so the dynamic programming eqation is F s (x) = max{x p, EF s 1 (x + ǫ)}, s = 1, 2,... Note that the expectation operator comes otside, not inside, F s 1 ( ). One can se indction to show (i), (ii) and (iii). For example, (i) is obvios, since increasing s means we have more time over which to exercise the option. However, for a formal proof F 1 (x) = max{x p, EF (x + ǫ)} max{x p, } = F (x). Now sppose, indctively, that F s 1 F s 2. Then F s (x) = max{x p, EF s 1 (x + ǫ)} max{x p, EF s 2 (x + ǫ)} = F s 1 (x), 6

6 whence F s is non-decreasing in s. Similarly, an indctive proof of (ii) follows from F s (x) x } {{ } = max{ p, EF s 1(x + ǫ) (x + ǫ) } {{ } + E(ǫ)}, since the left hand nderbraced term inherits the non-increasing character of the right hand nderbraced term. Ths the optimal policy can be characterized as stated. For from (ii), (iii) and the fact that F s (x) x p it follows that there exists an a s sch that F s (x) is greater that x p if x < a s and eqals x p if x a s. It follows from (i) that a s is non-decreasing in s. The constant a s is the smallest x for which F s (x) = x p. 2.3 Example: accepting the best offer We are to interview h candidates for a job. At the end of each interview we mst either hire or reject the candidate we have jst seen, and may not change this decision later. Candidates are seen in random order and can be ranked against those seen previosly. The aim is to maximize the probability of choosing the candidate of greatest rank. Soltion. Let W t be the history of observations p to time t, i.e., after we have interviewed the tth candidate. All that matters are the vale of t and whether the tth candidate is better than all her predecessors: let x t = 1 if this is tre and x t = if it is not. In the case x t = 1, the probability she is the best of all h candidates is P(best of h best of first t) = P(best of h) P(best of first t) = 1/h 1/t = t h. Now the fact that the tth candidate is the best of the t candidates seen so far places no restriction on the relative ranks of the first t 1 candidates; ths x t = 1 and W t 1 are statistically independent and we have P(x t = 1 W t 1 ) = P(W t 1 x t = 1) P(x t = 1) = P(x t = 1) = 1 P(W t 1 ) t. Let F(, t 1) be the probability that nder an optimal policy we select the best candidate, given that we have seen t 1 candidates so far and the last one was not the best of those. Dynamic programming gives F(, t 1) = t 1 F(, t) + 1 ( ) ( t t 1 t t max h, F(, t) = max F(, t) + 1 ) t h, F(, t) These imply F(, t 1) F(, t) for all t h. Therefore, since t/h and F(, t) are respectively increasing and non-increasing in t, it mst be that for small t we have F(, t) > t/h and for large t we have F(, t) < t/h. Let t be the smallest t sch that F(, t) t/h. Then F(, t ), t < t, F(, t 1) = t 1 F(, t) + 1 t h, t t. 7 Solving the second of these backwards from the point t = h, F(, h) =, we obtain whence F(, t 1) t 1 = 1 F(, t) + = = h(t 1) t F(, t 1) = t 1 h h 1 τ=t 1 1 h(t 1) + 1 ht h(h 1), 1 τ, t t. Since we reqire F(, t ) t /h, it mst be that t is the smallest integer satisfying h 1 1 τ 1. τ=t For large h the sm on the left above is abot log(h/t ), so log(h/t ) 1 and we find t h/e. The optimal policy is to interview h/e candidates, bt withot selecting any of these, and then select the first one thereafter that is the best of all those seen so far. The probability of sccess is F(, t ) t /h 1/e = It is srprising that the probability of sccess is so large for arbitrarily large h. There are a cople lessons in this example. (i) It is often sefl to try to establish the fact that terms over which a maximm is being taken are monotone in opposite directions, as we did with t/h and F(, t). (ii) A typical approach is to first determine the form of the soltion, then find the optimal cost (reward) fnction by backward recrsion from the terminal point, where its vale is known. 8

7 3 Dynamic Programming over the Infinite Horizon We define the cases of disconted, negative and positive dynamic programming and establish the validity of the optimality eqation for an infinite horizon problem. 3.1 Disconted costs For a discont factor, β (, 1, the disconted-cost criterion is defined as h 1 C = β t c(x t, t, t) + β h C h (x h ). (3.1) t= This simplifies things mathematically, particlarly when we want to consider an infinite horizon. If costs are niformly bonded, say c(x, ) < B, and disconting is strict (β < 1) then the infinite horizon cost is bonded by B/(1 β). In economic langage, if there is an interest rate of r% per nit time, then a nit amont of money at time t is worth ρ = 1+r/1 at time t+1. Eqivalently, a nit amont at time t+1 has present vale β = 1/ρ. The fnction, F(x, t), which expresses the minimal present vale at time t of expected-cost from time t p to h is F(x, t) = inf E t,..., h 1 The DP eqation is now where F(x, h) = C h (x). h 1 β τ t c(x τ, τ, τ) + β h t C h (x h ) x t = x τ=t. (3.2) F(x, t) = inf c(x,, t) + βef(a(x,, t), t + 1), t < h, (3.3) 3.2 Example: job schedling A collection of n jobs is to be processed in arbitrary order by a single machine. Job i has processing time p i and when it completes a reward r i is obtained. Find the order of processing that maximizes the sm of the disconted rewards. Soltion. Here we take time k as the point at which the n k th job has jst been completed and the state at time k as the collection of ncompleted jobs, say S k. The dynamic programming eqation is F k (S k ) = max i S k r i β pi + β pi F k 1 (S k {i}). Obviosly F ( ) =. Applying the method of dynamic programming we first find F 1 ({i}) = r i β pi. Then, working backwards, we find F 2 ({i, j}) = maxr i β pi + β pi+pj r j, r j β pj + β pj+pi r i. There will be 2 n eqations to evalate, bt with perseverance we can determine F n ({1, 2,..., n}). However, there is a simpler way. 9 An interchange argment. Sppose that jobs are schedled in the order i 1,..., i k, i, j, i k+2,...,i n. Compare the reward of this schedle to one in which the order of jobs i and j are reversed: i 1,..., i k, j, i, i k+2,..., i n. The rewards nder the two schedles are respectively R 1 + β T+pi r i + β T+pi+pj r j + R 2 and R 1 + β T+pj r j + β T+pj+pi r i + R 2, where T = p i1 + + p ik, and R 1 and R 2 are respectively the sm of the rewards de to the jobs coming before and after jobs i, j; these are the same nder both schedles. The reward of the first schedle is greater if r i β pi /(1 β pi ) > r j β pj /(1 β pj ). Hence a schedle can be optimal only if the jobs are taken in decreasing order of the indices r i β pi /(1 β pi ). This type of reasoning is known as an interchange argment. There are a cople points to note. (i) An interchange argment can be sefl for solving a decision problem abot a system that evolves in stages. Althogh sch problems can be solved by dynamic programming, an interchange argment when it works is sally easier. (ii) The decision points need not be eqally spaced in time. Here they are the points at which a nmber of jobs have been completed. 3.3 The infinite-horizon case In the finite-horizon case the cost fnction is obtained simply from (3.3) by the backward recrsion from the terminal point. However, when the horizon is infinite there is no terminal point and so the validity of the optimality eqation is no longer obvios. Let s consider the time-homogeneos Markov case, in which costs and dynamics do not depend on t, i.e., c(x,, t) = c(x, ). Sppose also that there is no terminal cost, i.e., C h (x) =. Define the s-horizon cost nder policy π as s 1 F s (π, x) = E π β t c(x t, t ) x = x, t= where E π denotes expectation over the path of the process nder policy π. If we take the infimm with respect to π we have the infimal s-horizon cost F s (x) = inf π F s(π, x). Clearly, this always exists and satisfies the optimality eqation F s (x) = inf {c(x, ) + βef s 1(x 1 ) x = x, = }, (3.4) with terminal condition F (x) =. The infinite-horizon cost nder policy π is also qite natrally defined as F(π, x) = lim s F s(π, x). (3.5) This limit need not exist, bt it will do so nder any of the following scenarios. 1

8 D (disconted programming): < β < 1, and c(x, ) < B for all x,. N (negative programming): < β 1 and c(x, ) for all x,. P (positive programming): < β 1 and c(x, ) for all x,. Notice that the names negative and positive appear to be the wrong way arond with respect to the sign of c(x, ). However, the names make sense if we think of eqivalent problems of maximizing rewards. Maximizing positive rewards (P) is the same thing as minimizing negative costs. Maximizing negative rewards (N) is the same thing as minimizing positive costs. In cases N and P we sally take β = 1. The existence of the limit (possibly infinite) in (3.5) is assred in cases N and P by monotone convergence, and in case D becase the total cost occrring after the sth step is bonded by β s B/(1 β). 3.4 The optimality eqation in the infinite-horizon case The infimal infinite-horizon cost is defined as F(x) = inf π F(π, x) = inf π The following theorem jstifies or writing an optimality eqation. lim F s(π, x). (3.6) s Theorem 3.1 Sppose D, N, or P holds. Then F(x) satisfies the optimality eqation F(x) = inf {c(x, ) + βef(x 1) x = x, = )}. (3.7) Proof. We first prove that holds in (3.7). Sppose π is a policy, which chooses = when x = x. Then F s (π, x) = c(x, ) + βef s 1 (π, x 1 ) x = x, =. (3.8) Either D, N or P is sfficient to allow s to takes limits on both sides of (3.8) and interchange the order of limit and expectation. In cases N and P this is becase of monotone convergence. Infinity is allowed as a possible limiting vale. We obtain F(π, x) = c(x, ) + βef(π, x 1 ) x = x, = c(x, ) + βef(x 1 ) x = x, = inf {c(x, ) + βef(x 1) x = x, = }. Minimizing the left hand side over π gives. To prove, fix x and consider a policy π that having chosen and reached state x 1 then follows a policy π 1 which is sboptimal by less than ǫ from that point, i.e., F(π 1, x 1 ) F(x 1 )+ǫ. Note that sch a policy mst exist, by definition of F, althogh π 1 will depend on x 1. We have 11 F(x) F(π, x) = c(x, ) + βef(π 1, x 1 ) x = x, c(x, ) + βef(x 1 ) + ǫ x = x, c(x, ) + βef(x 1 ) x = x, + βǫ. Minimizing the right hand side over and recalling ǫ is arbitrary gives. 3.5 Example: selling an asset A spectlator owns a rare collection of tlip blbs and each day has one opportnity to sell it, which he may either accept or reject. The potential sale prices are independently and identically distribted with probability density fnction g(x), x. Each day there is a probability 1 β that the market for tlip blbs will collapse, making his blb collection completely worthless. Find the policy that maximizes his expected retrn and express it as the niqe root of an eqation. Show that if β > 1/2, g(x) = 2/x 3, x 1, then he shold sell the first time the sale price is at least β/(1 β). Soltion. There are only two states, depending on whether he has sold the collection or not. Let these be and 1 respectively. The optimality eqation is Hence F(1) = y= = βf(1) + = βf(1) + (1 β)f(1) = maxy, βf(1) g(y)dy y= y=βf(1) y=βf(1) maxy βf(1), g(y)dy y βf(1) g(y)dy y βf(1) g(y)dy. (3.9) That this eqation has a niqe root, F(1) = F, follows from the fact that left and right hand sides are increasing and decreasing in F(1) respectively. Ths he shold sell when he can get at least βf. His maximal reward is F. Consider the case g(y) = 2/y 3, y 1. The left hand side of (3.9) is less that the right hand side at F(1) = 1 provided β > 1/2. In this case the root is greater than 1 and we compte it as (1 β)f(1) = 2/βF(1) βf(1)/βf(1) 2, and ths F = 1/ β(1 β) and βf = β/(1 β). If β 1/2 he shold sell at any price. Notice that disconting arises in this problem becase at each stage there is a probability 1 β that a catastrophe will occr that brings things to a sdden end. This characterization of a manner in which disconting can arise is often qite sefl. 12

9 4 Positive Programming We address the special theory of maximizing positive rewards, (noting that there may be no optimal policy bt that if a policy has a vale fnction that satisfies the optimality eqation then it is optimal), and the method of vale iteration. 4.1 Example: possible lack of an optimal policy. Positive programming concerns minimizing non-positive costs, c(x, ). The name originates from the eqivalent problem of maximizing non-negative rewards, r(x, ), and for this section we present reslts in that setting. The following example shows that there may be no optimal policy. Sppose the possible states are the non-negative integers and in state x we have a choice of either moving to state x + 1 and receiving no reward, or moving to state, obtaining reward 1 1/i, and then remaining in state thereafter and obtaining no frther reward. The optimality eqations is F(x) = max{1 1/x, F(x + 1)} x >. Clearly F(x) = 1, x >, bt the policy that chooses the maximizing action in the optimality eqation always moves on to state x+1 and hence has zero reward. Clearly, there is no policy that actally achieves a reward of Characterization of the optimal policy The following theorem provides a necessary and sfficient condition for a policy to be optimal: namely, its vale fnction mst satisfy the optimality eqation. This theorem also holds for the case of strict disconting and bonded costs. Theorem 4.1 Sppose D or P holds and π is a policy whose vale fnction F(π, x) satisfies the optimality eqation Then π is optimal. F(π, x) = sp{r(x, ) + βef(π, x 1 ) x = x, = }. Proof. Let π be any policy and sppose it takes t (x) = f t (x). Since F(π, x) satisfies the optimality eqation, F(π, x) r(x, f (x)) + βe π F(π, x 1 ) x = x, = f (x). By repeated sbstittion of this into itself, we find s 1 F(π, x) E π β t r(x t, t ) x = x + β s E π F(π, x s ) x = x. (4.1) t= 13 In case P we can drop the final term on the right hand side of (4.1) (becase it is non-negative) and then let s ; in case D we can let s directly, observing that this term tends to zero. Either way, we have F(π, x) F(π, x). 4.3 Example: optimal gambling A gambler has i ponds and wants to increase this to N. At each stage she can bet any fraction of her capital, say j i. Either she wins, with probability p, and now has i+j ponds, or she loses, with probability q = 1 p, and has i j ponds. Let the state space be {, 1,...,N}. The game stops pon reaching state or N. The only non-zero reward is 1, pon reaching state N. Sppose p 1/2. Prove that the timid strategy, of always betting only 1 pond, maximizes the probability of the gambler attaining N ponds. Soltion. The optimality eqation is F(i) = max{pf(i + j) + qf(i j)}. j,j i To show that the timid strategy is optimal we need to find its vale fnction, say G(i), and show that it is a soltion to the optimality eqation. We have G(i) = pg(i + 1) + qg(i 1), with G() =, G(N) = 1. This recrrence gives G(i) = 1 (q/p) i 1 (q/p) N p > 1/2, i N p = 1/2. If p = 1/2, then G(i) = i/n clearly satisfies the optimality eqation. If p > 1/2 we simply have to verify that G(i) = 1 { } (q/p)i 1 (q/p) i+j 1 (q/p) i j 1 (q/p) N = max p j:j i 1 (q/p) N + q 1 (q/p) N. It is a simple exercise to show that j = 1 maximizes the right hand side. 4.4 Vale iteration The infimal cost fnction F can be approximated by sccessive approximation or vale iteration. This is important and practical method of compting F. Let s define F (x) = lim F s(x) = lim inf F s(π, x). (4.2) s s π This exists (by monotone convergence nder N or P, or by the fact that nder D the cost incrred after time s is vanishingly small.) Notice that (4.2) reverses the order of lim s and inf π in (3.6). The following theorem states that we can interchange the order of these operations and that therefore 14

10 F s (x) F(x). However, in case N we need an additional assmption: F (finite actions): There are only finitely many possible vales of in each state. Theorem 4.2 Sppose that D or P holds, or N and F hold. Then F (x) = F(x). Proof. First we prove. Given any π, F (x) = lim s F s(x) = lim s inf π F s(π, x) lim s F s( π, x) = F( π, x). Taking the infimm over π gives F (x) F(x). Now we prove. In the positive case, c(x, ), so F s (x) F(x). Now let s. In the disconted case, with c(x, ) < B, imagine sbtracting B > from every cost. This redces the infinite-horizon cost nder any policy by exactly B/(1 β) and F(x) and F (x) also decrease by this amont. All costs are now negative, so the reslt we have jst proved applies. Alternatively, note that F s (x) β s B/(1 β) F(x) F s (x) + β s B/(1 β) (can yo see why?) and hence lim s F s (x) = F(x). In the negative case, F (x) = lim s min {c(x, ) + EF s 1(x 1 ) x = x, = } = min{c(x, ) + lim EF s 1(x 1 ) x = x, = } s = min{c(x, ) + EF (x 1 ) x = x, = }, (4.3) where the first eqality follows becase the minimm is over a finite nmber of terms and the second eqality follows by Lebesge monotone convergence (since F s (x) increases in s). Let π be the policy that chooses the minimizing action on the right hand side of (4.3). This implies, by sbstittion of (4.3) into itself, and sing the fact that N implies F, s 1 F (x) = E π c(x t, t ) + F (x s ) x = x t= s 1 E π c(x t, t ) x = x. t= Letting s gives F (x) F(π, x) F(x). other patients. The new drg is ntested and has an nknown probability of sccess θ, which the doctor believes to be niformly distribted over, 1. He treats one patient per day and mst choose which drg to se. Sppose he has observed s sccesses and f failres with the new drg. Let F(s, f) be the maximal expected-disconted nmber of ftre patients who are sccessflly treated if he chooses between the drgs optimally from this point onwards. For example, if he ses only the established drg, the expecteddisconted nmber of patients sccessflly treated is p + βp + β 2 p + = p/(1 β). The posterior distribtion of θ is f(θ s, f) = (s + f + 1)! θ s (1 θ) f, θ 1, s!f! and the posterior mean is θ(s, f) = (s + 1)/(s + f + 2). The optimality eqation is p F(s, f) = max 1 β, s + 1 f + 1 (1 + βf(s + 1, f)) + βf(s, f + 1). s + f + 2 s + f + 2 It is not possible to give a nice expression for F, bt we can find an approximate nmerical soltion. If s + f is very large, say 3, then θ(s, f) = (s + 1)/(s + f + 2) is a good approximation to θ. Ths we can take F(s, f) (1 β) 1 maxp, θ(s, f), s + f = 3 and work backwards. For β =.95, one obtains the following table. f s These nmbers are the greatest vales of p for which it is worth contining with at least one more trial of the new drg. For example, with s = 3, f = 3 it is worth contining with the new drg when p =.6 < At this point the probability that the new drg will sccessflly treat the next patient is.5 and so the doctor shold actally prescribe the drg that is least likely to cre! This example shows the difference between a myopic policy, which aims to maximize immediate reward, and an optimal policy, which forgets immediate reward in order to gain information and possibly greater rewards later on. Notice that it is worth sing the new drg at least once if p <.7614, even thogh at its first se the new drg will only be sccessfl with probability Example: pharmacetical trials A doctor has two drgs available to treat a disease. One is well-established drg and is known to work for a given patient with probability p, independently of its sccess for 15 16

11 5 Negative Programming We address the special theory of minimizing positive costs, (noting that the action that extremizes the right hand side of the optimality eqation gives an optimal policy), and stopping problems and their soltion. 5.1 Stationary policies A Markov policy is a policy that specifies the control at time t to be simply a fnction of the state and time. In the proof of Theorem 4.1 we sed t = f t (x t ) to specify the control at time t. This is a convenient notation for a Markov policy, and we write π = (f, f 1,... ). If in addition the policy does not depend on time, it is said to be a stationary Markov policy, and we write π = (f, f,...) = f. 5.2 Characterization of the optimal policy Negative programming concerns minimizing non-negative costs, c(x, ). The name originates from the eqivalent problem of maximizing non-positive rewards, r(x, ). The following theorem gives a necessary and sfficient condition for a stationary policy to be optimal: namely, it mst choose the optimal on the right hand side of the optimality eqation. Note that in the statement of this theorem we are reqiring that the infimm over is attained as a minimm over. Theorem 5.1 Sppose D or N holds. Sppose π = f is the stationary Markov policy sch that c(x, f(x)) + βef(x 1 ) x = x, = f(x) Then F(π, x) = F(x), and π is optimal. = min c(x, ) + βef(x 1 ) x = x, =. Proof. Sppose this policy is π = f. Then by sbstitting the optimality eqation into itself and sing the fact that π specifies the minimizing control at each stage, s 1 F(x) = E π β t c(x t, t ) x = x + β s E π F(x s ) x = x. (5.1) t= In case N we can drop the final term on the right hand side of (5.1) (becase it is non-negative) and then let s ; in case D we can let s directly, observing that this term tends to zero. Either way, we have F(x) F(π, x). A corollary is that if assmption F holds then an optimal policy exists. Neither Theorem 5.1 or this corollary are tre for positive programming (c.f., the example in Section 4.1) Optimal stopping over a finite horizon One way that the total-expected cost can be finite is if it is possible to enter a state from which no frther costs are incrred. Sppose has jst two possible vales: = (stop), and = 1 (contine). Sppose there is a termination state, say, that is entered pon choosing the stopping action. Once this state is entered the system stays in that state and no frther cost is incrred thereafter. Sppose that stopping is mandatory, in that we mst contine for no more that s steps. The finite-horizon dynamic programming eqation is therefore F s (x) = min{k(x), c(x) + EF s 1 (x 1 ) x = x, = 1}, (5.2) with F (x) = k(x), c() =. Consider the set of states in which it is at least as good to stop now as to contine one more step and then stop: S = {x : k(x) c(x) + Ek(x 1 ) x = x, = 1)}. Clearly, it cannot be optimal to stop if x S, since in that case it wold be strictly better to contine one more step and then stop. The following theorem characterises all finite-horizon optimal policies. Theorem 5.2 Sppose S is closed (so that once the state enters S it remains in S.) Then an optimal policy for all finite horizons is: stop if and only if x S. Proof. The proof is by indction. If the horizon is s = 1, then obviosly it is optimal to stop only if x S. Sppose the theorem is tre for a horizon of s 1. As above, if x S then it is better to contine for more one step and stop rather than stop in state x. If x S, then the fact that S is closed implies x 1 S and so F s 1 (x 1 ) = k(x 1 ). Bt then (5.2) gives F s (x) = k(x). So we shold stop if s S. The optimal policy is known as a one-step look-ahead rle (OSLA). 5.4 Example: optimal parking A driver is looking for a parking space on the way to his destination. Each parking space is free with probability p independently of whether other parking spaces are free or not. The driver cannot observe whether a parking space is free ntil he reaches it. If he parks s spaces from the destination, he incrs cost s, s =, 1,.... If he passes the destination withot having parked the cost is D. Show that an optimal policy is to park in the first free space that is no frther than s from the destination, where s is the greatest integer s sch that (Dp + 1)q s 1. Soltion. When the driver is s spaces from the destination it only matters whether the space is available (x = 1) or fll (x = ). The optimality eqation gives F s () = qf s 1 () + pf s 1 (1), { s, (take available space) F s (1) = min qf s 1 () + pf s 1 (1), (ignore available space) 18

12 where F () = D, F (1) =. Sppose the driver adopts a policy of taking the first free space that is s or closer. Let the cost nder this policy be k(s), where k(s) = ps + qk(s 1), with k() = qd. The general soltion is of the form k(s) = q/p + s + cq s. So after sbstitting and sing the bondary condition at s =, we have k(s) = q ( p + s + D + 1 ) q s+1, s =, 1,.... p It is better to stop now (at a distance s from the destination) than to go on and take the first available space if s is in the stopping set S = {s : s k(s 1)} = {s : (Dp + 1)q s 1}. This set is closed (since s decreases) and so by Theorem 5.2 this stopping set describes the optimal policy. If the driver parks in the first available space past his destination and walk backs, then D = 1 + qd, so D = 1/p and s is the greatest integer sch that 2q s Optimal stopping over the infinite horizon Let s now consider the stopping problem over the infinite-horizon. As above, let F s (x) be the infimal cost given that we are reqired to stop by time s. Let F(x) be the infimal cost when all that is reqired is that we stop eventally. Since less cost can be incrred if we are allowed more time in which to stop, we have F s (x) F s+1 (x) F(x). Ths by monotone convergence F s (x) tends to a limit, say F (x), and F (x) F(x). Example: we can have F > F Consider the problem of stopping a symmetric random walk on the integers, where c(x) =, k(x) = exp( x). The policy of stopping immediately, π, has F(π, x) = exp( x), and this satisfies the infinite-horizon optimality eqation, F(x) = min{exp( x), (1/2)F(x + 1) + (1/2)F(x 1)}. However, π is not optimal. A symmetric random walk is recrrent, so we may wait ntil reaching as large an integer as we like before stopping; hence F(x) =. Indctively, one can see that F s (x) = exp( x). So F (x) > F(x). (Note: Theorem 4.2 says that F = F, bt that is in a setting in which there is no terminal cost and for different definitions of F s and F than we take here.) 19 Example: Theorem 4.1 is not tre for negative programming Consider the above example, bt now sppose one is allowed never to stop. Since contination costs are the optimal policy for all finite horizons and the infinite horizon is never to stop. So F(x) = and this satisfies the optimality eqation above. However, F(π, x) = exp( x) also satisfies the optimality eqation and is the cost incrred by stopping immediately. Ths it is not tre (as for positive programming) that a policy whose cost fnction satisfies the optimality eqation is optimal. The following lemma gives conditions nder which the infimal finite-horizon cost does converge to the infimal infinite-horizon cost. Lemma 5.3 Sppose all costs are bonded as follows. (a) K = sp k(x) < x Then F s (x) F(x) as s. (b) C = inf c(x) >. (5.3) x Proof. (*starred*) Sppose π is an optimal policy for the infinite horizon problem and stops at the random time τ. Then its cost is at least (s + 1)CP(τ > s). However, since it wold be possible to stop at time the cost is also no more than K, so (s + 1)CP(τ > s) F(x) K. In the s-horizon problem we cold follow π, bt stop at time s if τ > s. This implies F(x) F s (x) F(x) + KP(τ > s) F(x) + By letting s, we have F (x) = F(x). K2 (s + 1)C. Note that the problem posed here is identical to one in which we pay K at the start and receive a terminal reward r(x) = K k(x). Theorem 5.4 Sppose S is closed and (5.3) holds. Then an optimal policy for the infinite horizon is: stop if and only if x S. Proof. By Theorem 5.2 we have for all finite s, Lemma 5.3 gives F(x) = F (x). F s (x) = k(x) x S, < k(x) x S. 2

13 6 Average-cost Programming We address the infinite-horizon average-cost case, the optimality eqation for this case and the policy improvement algorithm. 6.1 Average-cost optimization It can happen that the ndisconted expected total cost is infinite, bt the accmlation of cost per nit time is finite. Sppose that for a stationary Markov policy π, the following limit exists: 1 λ(π, x) = lim t t E π c(x s, s ) x = x. t 1 s= It is reasonable to expect that there is a well-defined notion of an optimal average-cost fnction, λ(x) = inf π λ(π, x), and that nder appropriate assmptions, λ(x) = λ shold not depend on x. Moreover, one wold expect F s (x) = sλ + φ(x) + ǫ(s, x), where ǫ(s, x) as s. Here φ(x) + ǫ(s, x) reflects a transient de to the initial state. Sppose that the state space and action space are finite. From the optimality eqation for the finite horizon problem we have F s (x) = min {c(x, ) + EF s 1 (x 1 ) x = x, = }. (6.1) So by sbstitting F s (x) sλ + φ(x) into (6.1), we obtain sλ + φ(x) min {c(x, ) + E(s 1)λ + φ(x 1 ) x = x, = } which sggests, what it is in fact, the average-cost optimality eqation: λ + φ(x) = min {c(x, ) + Eφ(x 1 ) x = x, = }. (6.2) Theorem 6.1 Let λ denote the minimal average-cost. Sppose there exists a constant λ and bonded fnction φ sch that for all x and, λ + φ(x) c(x, ) + Eφ(x 1 ) x = x, =. (6.3) Then λ λ. This also holds when is replaced by and the hypothesis is weakened to: for each x there exists a sch that (6.3) holds when is replaced by. Proof. Sppose is chosen by some policy π. By repeated sbstittion of (6.3) into itself we have t 1 φ(x) tλ + E π c(x s, s ) x = x + E π φ(x t ) x = x s= 21 Divide this by t and let t to obtain λ 1 + lim t t E π c(x s, s ) x = x, t 1 s= where the final term on the right hand side is simply the average-cost nder policy π. Minimizing the right hand side over π gives the reslt. The claim for replaced by is proved similarly. Theorem 6.2 Sppose there exists a constant λ and bonded fnction φ satisfying (6.2). Then λ is the minimal average-cost and the optimal stationary policy is the one that chooses the optimizing on the right hand side of (6.2). Proof. Eqation (6.2) implies that (6.3) holds with eqality when one takes π to be the stationary policy that chooses the optimizing on the right hand side of (6.2). Ths π is optimal and λ is the minimal average-cost. The average-cost optimal policy is fond simply by looking for a bonded soltion to (6.2). Notice that if φ is a soltion of (6.2) then so is φ+(a constant), becase the (a constant) will cancel from both sides of (6.2). Ths φ is ndetermined p to an additive constant. In searching for a soltion to (6.2) we can therefore pick any state, say x, and arbitrarily take φ( x) =. 6.2 Example: admission control at a qee Each day a consltant is presented with the opportnity to take on a new job. The jobs are independently distribted over n possible types and on a given day the offered type is i with probability a i, i = 1,...,n. Jobs of type i pay R i pon completion. Once he has accepted a job he may accept no other job ntil that job is complete. The probability that a job of type i takes k days is (1 p i ) k 1 p i, k = 1, 2,.... Which jobs shold the consltant accept? Soltion. Let and i denote the states in which he is free to accept a job, and in which he is engaged pon a job of type i, respectively. Then (6.2) is n λ + φ() = a i maxφ(), φ(i), i=1 λ + φ(i) = (1 p i )φ(i) + p i R i + φ(), i = 1,..., n. Taking φ() =, these have soltion φ(i) = R i λ/p i, and hence n λ = a i max, R i λ/p i. i=1 The left hand side is increasing in λ and the right hand side is decreasing λ. Hence there is a root, say λ, and this is the maximal average-reward. The optimal policy takes the form: accept only jobs for which p i R i λ. 22

14 6.3 Vale iteration bonds Vale iteration in the average-cost case is based pon the idea that F s (x) F s 1 (x) approximates the minimal average-cost for large s. Theorem 6.3 Define m s = min x {F s (x) F s 1 (x)}, Then m s λ M s, where λ is the minimal average-cost. M s = max x {F s(x) F s 1 (x)}. (6.4) Proof. (*starred*) Sppose that the first step of a s-horizon optimal policy follows Markov plan f. Then F s (x) = F s 1 (x) + F s (x) F s 1 (x) = c(x, f(x)) + EF s 1 (x 1 ) x = x, = f(x). Hence F s 1 (x) + m s c(x, ) + EF s 1 (x 1 ) x = x, =, for all x,. Applying Theorem 6.1 with φ = F s 1 and λ = m s, implies m s λ. The bond λ M s is established in a similar way. This jstifies the following vale iteration algorithm. At termination the algorithm provides a stationary policy that is within ǫ 1% of optimal. () Set F (x) =, s = 1. (1) Compte F s from F s (x) = min {c(x, ) + EF s 1 (x 1 ) x = x, = }. (2) Compte m s and M s from (6.4). Stop if M s m s ǫm s. Otherwise set s := s + 1 and goto step (1). 6.4 Policy improvement Policy improvement is an effective method of improving stationary policies. Policy improvement in the average-cost case. In the average-cost case a policy improvement algorithm can be based on the following observations. Sppose that for a policy π = f, we have that λ, φ is a soltion to λ + φ(x) = c(x, f(x )) + Eφ(x 1 ) x = x, = f(x ), and sppose for some policy π 1 = f 1, λ + φ(x) c(x, f 1 (x )) + Eφ(x 1 ) x = x, = f 1 (x ), (6.5) with strict ineqality for some x. Then following the lines of proof in Theorem 6.1 t 1 1 t 1 lim t t E π c(x s, s ) x 1 = x = λ lim t t E π 1 c(x s, s ) x = x. s= If there is no π 1 for which (6.5) holds then π satisfies (6.2) and is optimal. This jstifies the following policy improvement algorithm () Choose an arbitrary stationary policy π. Set s = 1. (1) For a given stationary policy π s 1 = f s 1 determine φ, λ to solve s= λ + φ(x) = c(x, f s 1 (x)) + Eφ(x 1 ) x = x, = f s 1 (x). This gives a set of linear eqations, and so is intrinsically easier to solve than (6.2). (2) Now determine the policy π s = f s from c(x, f s (x)) + Eφ(x 1 ) x = x, = f s (x) = min {c(x, ) + Eφ(x 1 ) x = x, = }, taking f s (x) = f s 1 (x) whenever this is possible. By applications of Theorem 6.1, this yields a strict improvement whenever possible. If π s = π s 1 then the algorithm terminates and π s 1 is optimal. Otherwise, retrn to step (1) with s := s + 1. If both the action and state spaces are finite then there are only a finite nmber of possible stationary policies and so the policy improvement algorithm will find an optimal stationary policy in finitely many iterations. By contrast, the vale iteration algorithm can only obtain more and more accrate approximations of λ. Policy improvement in the disconted-cost case. In the case of strict disconting, the following theorem plays the role of Theorem 6.1. The proof is similar, by repeated sbstittion of (6.6) into itself. Theorem 6.4 Sppose there exists a bonded fnction G sch that for all x and, G(x) c(x, ) + βeg(x 1 ) x = x, =. (6.6) Then G F, where F is the minimal disconted-cost fnction. This also holds when is replaced by and the hypothesis is weakened to: for each x there exists a sch that (6.6) holds when is replaced by. The policy improvement algorithm is similar. E.g., step (1) becomes (1) For a given stationary policy π s 1 = fs 1 determine G to solve G(x) = c(x, f s 1 (x)) + βeg(x 1 ) x = x, = f s 1 (x)

15 7 LQ Models We present the LQ reglation model in discrete and continos time, the Riccati eqation, its validity in the model with additive white noise. 7.1 The LQ reglation model The elements needed to define a control optimization problem are specification of (i) the dynamics of the process, (ii) which qantities are observable at a given time, and (iii) an optimization criterion. In the LQG model the plant eqation and observation relations are linear, the cost is qadratic, and the noise is Gassian (jointly normal). The LQG model is important becase it has a complete theory and introdces some key concepts, sch as controllability, observability and the certainty-eqivalence principle. Begin with a model in which the state x t is flly observable and there is no noise. The plant eqation of the time-homogeneos A, B, system has the linear form x t = Ax t 1 + B t 1, (7.1) where x t R n, t R m, A is n n and B is n m. The cost fnction is h 1 C = c(x t, t ) + C h (x h ), (7.2) t= with one-step and terminal costs c(x, ) = x Rx + Sx + x S + x R S x Q =, (7.3) S Q C h (x) = x Π h x. (7.4) All qadratic forms are non-negative definite, and Q is positive definite. There is no loss of generality in assming that R, Q and Π h are symmetric. This is a model for reglation of (x, ) to the point (, ) (i.e., steering to a critical vale). To solve the optimality eqation we shall need the following lemma. Lemma 7.1 Sppose x, are vectors. Consider a qadratic form ( ) ( ) ( ) x Πxx Π x x. Π Π x Assme it is symmetric and Π >, i.e., positive definite. Then the minimm with respect to is achieved at = Π 1 Π xx, and is eqal to x Π xx Π x Π 1 Π x x. 25 Proof. Sppose the qadratic form is minimized at. Then ( x + h ) ( Πxx Π x Π x Π ) ( x + h = x Π xx x + 2x Π x + 2h Π x x + 2h Π } {{ } + Π + h Π h. To be stationary at, the nderbraced linear term in h mst be zero, so ) = Π 1 Π x x, and the optimal vale is x Π xx Π x Π 1 Π x x. Theorem 7.2 Assme the strctre of (7.1) (7.4). Then the vale fnction has the qadratic form F(x, t) = x Π t x, t < h, (7.5) and the optimal control has the linear form t = K t x t, t < h. The time-dependent matrix Π t satisfies the Riccati eqation where f is an operator having the action Π t = fπ t+1, t < h, (7.6) fπ = R + A ΠA (S + A ΠB)(Q + B ΠB) 1 (S + B ΠA), (7.7) and Π h has the vale prescribed in (7.4). The m n matrix K t is given by K t = (Q + B Π t+1 B) 1 (S + B Π t+1 A), t < h. Proof. Assertion (7.5) is tre at time h. Assme it is tre at time t + 1. Then F(x, t) = inf c(x, ) + (Ax + B) Π t+1 (Ax + B) = inf ( x ) ( R + A Π t+1 A S + A Π t+1 B S + B Π t+1 A Q + B Π t+1 B ) ( x By Lemma 7.1 the minimm is achieved by = K t x, and the form of f comes from this also. 26 )

16 7.2 The Riccati recrsion The backward recrsion (7.6) (7.7) is called the Riccati eqation. Note that (i) S can be normalized to zero by choosing a new control = +Q 1 Sx, and setting A = A BQ 1 S, R = R S Q 1 S. (ii) The optimally controlled process obeys x t+1 = Γ t x t. Here Γ t is called the gain matrix and is given by Γ t = A + BK t = A B(Q + B Π t+1 B) 1 (S + B Π t+1 A). (iii) An eqivalent expression for the Riccati eqation is fπ = inf R + K S + S K + K QK + (A + BK) Π(A + BK). K (iv) We might have carried ot exactly the same analysis for a time-heterogeneos model, in which the matrices A, B, Q, R, S are replaced by A t, B t, Q t, R t, S t. (v) We do not give details, bt comment that it is possible to analyse models in which x t+1 = Ax t + B t + α t, for a known seqence of distrbances {α t }, or in which the cost fnction is x xt R S x xt c(x, ) =. ū t S Q ū t so that the aim is to track a seqence of vales ( x t, ū t ), t =,...,h White noise distrbances Sppose the plant eqation (7.1) is now x t+1 = Ax t + B t + ǫ t, where ǫ t R n is vector white noise, defined by the properties Eǫ =, Eǫ t ǫ t and Eǫ t ǫ s =, t s. The DP eqation is then F(x, t) = inf c(x, ) + E ǫ (F(Ax + B + ǫ, t + 1). = N By definition F(x, h) = x Π h x. Try a soltion F(x, t) = x Π t x + γ t. This holds for t = h. Sppose it is tre for t + 1, then F(x, t) = inf c(x, ) + E(Ax + B + ǫ) Π t+1 (Ax + B + ǫ) + γ t+1 = inf c(x, ) + E(Ax + B) Π t+1 (Ax + B) + 2E ǫ (Ax + B) + E ǫ Π t+1 ǫ + γ t+1 = inf + + tr(nπ t+1) + γ t Here we se the fact that E ǫ Πǫ = E ǫ i Π ij ǫ j = E ǫ j ǫ i Π ij = N ji Π ij = tr(nπ). ij ij ij Ths (i) Π t follows the same Riccati eqation as before, (ii) the optimal control is t = K t x t, and (iii) F(x, t) = x Π t x + γ t = x Π t x + h j=t+1 tr(nπ j ). The final term can be viewed as the cost of correcting ftre noise. In the infinite horizon limit of Π t Π as t, we incr an average cost per nit time of tr(nπ), and a transient cost of x Πx that is de to correcting the initial x. 7.4 LQ reglation in continos-time In continos-time we take ẋ = Ax + B and C = h ( x ) ( R S S Q ) ( x ) dt + (x Πx) h. We can obtain the continos-time soltion from the discrete time soltion by moving forward in time in increments of. Make the following replacements. x t+1 x t+, A I + A, B B, R, S, Q R, S, Q. Then as before, F(x, t) = x Πx, where Π obeys the Riccati eqation Π t + R + A Π + ΠA (S + ΠB)Q 1 (S + B Π) =. This is simpler than the discrete time version. The optimal control is where (t) = K(t)x(t) K(t) = Q 1 (S + B Π). The optimally controlled plant eqation is ẋ = Γ(t)x, where Γ(t) = A + BK = A BQ 1 (S + B Π). 28

17 8 Controllability We define and give conditions for controllability in discrete and continos time. 8.1 Controllability Consider the A, B, system with plant eqation x t+1 = Ax t + t. The controllability qestion is: can we bring x to an arbitrary prescribed vale by some -seqence? Definition 8.1 The system is r-controllable if one can bring it from an arbitrary prescribed x to an arbitrary prescribed x r by some -seqence, 1,..., r 1. A system of dimension n is said to be controllable if it is r-controllable for some r Example. If B is sqare and non-singlar then the system is 1-controllable, for x 1 = Ax + B where = B 1 (x 1 Ax ). Example. Consider the case, (n = 2, m = 1), ( ) ( a11 1 x t = x a 21 a t This system is not 1-controllable. Bt x 2 A 2 x = B 1 + AB = So it is 2-controllable if and only if a 21. ) t 1. ( 1 a11 a 21 ) ( 1 ). More generally, by sbstitting the plant eqation into itself, we see that we mst find, 1,..., r 1 to satisfy = x r A r x = B r 1 + AB r A r 1 B, (8.1) for arbitrary. In providing conditions for controllability we shall need to make se of the following theorem. Theorem 8.2 (The Cayley-Hamilton theorem) Any n n matrix A satisfies its own characteristic eqation. So that if then det(λi A) = n a j λ n j j= n a j A n j =. (8.2) j= 29 The implication is that I, A, A 2,..., A n 1 contains basis for A r, r =, 1,.... Proof. (*starred*) Define Then Φ(z) = (Az) j = (I Az) 1 = j= det(i Az)Φ(z) = adj(i Az) det(i Az). n a j z j Φ(z) = adj(i Az), j= which implies (8.2) since the coefficient of z n mst be zero. We are now in a position to characterise controllability. Theorem 8.3 (i) The system A, B, is r-controllable if and only if the matrix M r = B AB A 2 B A r 1 B has rank n, or (ii) eqivalently, if and only if the n n matrix r 1 M r Mr = A j (BB )(A ) j j= is nonsinglar (or, eqivalently, positive definite.) (iii) If the system is r-controllable then it is s-controllable for s min(n, r), and (iv) a control transferring x to x r with minimal cost r 1 t= t t is t = B (A ) r t 1 (M r M r ) 1 (x r A r x ), t =,...,r 1. Proof. (i) The system (8.1) has a soltion for arbitrary if and only if M r has rank n. (ii) M r Mr is singlar if and only if there exists w sch that M rmr w =, and M r M r w = w M r M r w = M r w =. (iii) The rank of M r is non-decreasing in r, so if it is r-controllable, then it is s- controllable for s r. Bt the rank is constant for r n by the Cayley-Hamilton theorem. (iv) Consider the Lagrangian giving r 1 t r 1 t + λ ( A r t 1 B t ), t= Now we can determine λ from (8.1). t= t = 1 2 B (A ) r t 1 λ. 3

18 8.2 Controllability in continos-time Theorem 8.4 (i) The n dimensional system A, B, is controllable if and only if the matrix M n has rank n, or (ii) eqivalently, if and only if G(t) = t e As BB e A s ds, is positive definite for all t >. (iii) If the system is controllable then a control that achieves the transfer from x() to x(t) with minimal control cost t s sds is (s) = B e A (t s) G(t) 1 (x(t) e At x()). Note that there is now no notion of r-controllability. However, G(t) as t, so the transfer becomes more difficlt and costly as t. 8.3 Example: broom balancing Consider the problem of balancing a broom in an pright position on yor hand. By Newton s laws, the system obeys m(ü cosθ + L θ) = mg sinθ. For small θ we have cosθ 1 and θ sinθ = (x )/L, so with α = g/l the plant eqation is eqivalently, d dt ( ẋ x ) = ẍ = α(x ), ( 1 α x ) ( ẋ x θ ) ( + α L L θ ). mg sin θ mg ü cos θ 8.4 Example: satellite in a plane orbit Consider a satellite of nit mass in a planar orbit and take polar coordinates (r, θ). r = r θ 2 c r 2 + r, θ = 2ṙ θ r + 1 r θ, where r and θ are the radial and tangential components of thrst. If = then a possible orbit (sch that ṙ = θ = ) is with r = ρ and θ = ω = c/ρ 3. Recall that one reason for taking an interest in linear models is that they tell s abot controllability arond an eqilibrim point. Imagine there is a pertrbing force. Take coordinates of pertrbation Then, with n = 4, m = 2, ẋ x 1 = r ρ, x 2 = ṙ, x 3 = θ ωt, x 4 = θ ω. 1 3ω 2 2ωρ 1 2ω/ρ x + 1 1/ρ ( r θ ) = Ax + B. It is easy to check that M 2 = B AB has rank 4 and that therefore the system is controllable. Bt sppose θ = (tangential thrst fails). Then B = 1 M 4 = B AB A 2 B A 3 B 1 ω 2 = 1 ω 2 2ω/ρ. 2ω/ρ 2ω 3 /ρ Since (2ωρ,,, ρ 2 )M 4 =, this is singlar and has rank 3. The ncontrollable component is the anglar momentm, 2ωρδr + ρ 2 δ θ = δ(r 2 θ) r=ρ, θ=ω. On the other hand, if r = then the system is controllable. We can change the radis by tangential braking or thrst. Since Figre 1: Force diagram for broom balancing B AB = the system is controllable if θ is initially small. α α, 31 32

19 9 Infinite Horizon Limits We define stabilizability and discss the LQ reglation problem over an infinite horizon. 9.1 Linearization of nonlinear models Linear models are important becase they arise natrally via the linearization of nonlinear models. Consider the state-strctred nonlinear model: ẋ = a(x, ). Sppose x, are pertrbed from an eqilibrim ( x, ū) where a( x, ū) =. Let x = x x and = ū and immediately drop the primes. The linearized version is where A = a x, ( x,ū) ẋ = Ax + B B = a. ( x,ū) If x, ū is to be a stable eqilibrim point then we mst be able to choose a control that can stabilise the system in the neighborhood of ( x, ū). 9.2 Stabilizability Sppose we apply the stationary control = Kx so that ẋ = Ax + B = (A + BK)x. So with Γ = A + BK, we have ẋ = Γx, x t = e Γt x, where e Γt = (Γt) j /j! Similarly, in discrete-time, we have can take the stationary control, t = Kx t, so that x t = Ax t 1 + B t 1 = (A + BK)x t 1. Now x t = Γ t x. We are interested in choosing Γ so that x t and t. Definition 9.1 Γ is a stability matrix in the continos-time sense if all its eigenvales have negative real part, and hence x t as t. Γ is a stability matrix in the discrete-time sense if all its eigenvales of lie strictly inside the nit disc in the complex plane, z = 1, and hence x t as t. The A, B system is said to stabilizable if there exists a K sch that A + BK is a stability matrix. Note that t = Kx t is linear and Markov. In seeking controls sch that x t it is sfficient to consider only controls of this type since, as we see below, sch controls arise as optimal controls for the infinite-horizon LQ reglation problem. 33 j= 9.3 Example: pendlm Consider a pendlm of length L, nit mass bob and angle θ to the vertical. Sppose we wish to stabilise θ to zero by application of a force. Then θ = (g/l)sinθ +. We change the state variable to x = (θ, θ) and write ( ) ( ) d θ = dt θ θ (g/l)sinθ + ( ( = θ (g/l)θ 1 (g/l) ) ( + ) ( θ θ ) ) ( + 1 ). Sppose we try to stabilise with a control = Kθ = Kx 1. Then ( ) 1 A + BK = (g/l) K and this has eigenvales ± (g/l) K. So either (g/l) K > and one eigenvale has a positive real part, in which case there is in fact instability, or (g/l) K < and eigenvales are prely imaginary, which means we will in general have oscillations. So sccessfl stabilization mst be a fnction of θ as well, (and this wold come ot of soltion to the LQ reglation problem.) 9.4 Infinite-horizon LQ reglation Consider the time-homogeneos case and write the finite-horizon cost in terms of time to go s. The terminal cost, when s =, is denoted F (x) = x Π x. In all that follows we take S =, withot loss of generality. Lemma 9.2 Sppose Π =, R, Q and A, B, is controllable or stabilizable. Then {Π s } has a finite limit Π. Proof. Costs are non-negative, so F s (x) is non-decreasing in s. Now F s (x) = x Π s x. Ths x Π s x is non-decreasing in s for every x. To show that x Π s x is bonded we se one of two argments. If the system is controllable then x Π s x is bonded becase there is a policy which, for any x = x, will bring the state to zero in at most n steps and at finite cost and can then hold it at zero with zero cost thereafter. 34

20 If the system is stabilizable then there is a K sch that Γ = A + BK is a stability matrix and sing t = Kx t, we have F s (x) x (Γ ) t (R + K QK)Γ x t <. t= Hence in either case we have an pper bond and so x Π s x tends to a limit for every x. By considering x = e j, the vector with a nit in the jth place and zeros elsewhere, we conclde that the jth element on the diagonal of Π s converges. Then taking x = e j + e k it follows that the off diagonal elements of Π s also converge. Both vale iteration and policy improvement are effective ways to compte the soltion to an infinite-horizon LQ reglation problem. Policy improvement goes along the lines developed in Lectre 6. The following theorem establishes the efficacy of vale iteration. It is similar to Theorem 4.2 which established the same fact for D, N and P programming. The LQ reglation problem is a negative programming problem, however we cannot apply Theorem 4.2, becase in general the terminal cost of x Π x is not zero. 9.5 The A, B, C system The notion of controllability rested on the assmption that the initial vale of the state was known. If, however, one mst rely pon imperfect observations, then the qestion arises whether the vale of state (either in the past or in the present) can be determined from these observations. The discrete-time system A, B, C is defined by the plant eqation and observation relation x t = Ax t 1 + B t 1, y t = Cx t 1. Here y R r is observed, bt x is not. We sppose C is r n. The observability qestion is whether or not we can infer x from the observations y 1, y 2,.... The notion of observability stands in dal relation to that of controllability; a dality that indeed persists throghot the sbject. Theorem 9.3 Sppose that R >, Q > and the system A, B, is controllable. Then (i) The eqilibrim Riccati eqation Π = fπ (9.1) has a niqe non-negative definite soltion Π. (ii) For any finite non-negative definite Π the seqence {Π s } converges to Π. (iii) The gain matrix Γ corresponding to Π is a stability matrix. Proof. (*starred*) Define Π as the limit of the seqence f (s). By the previos lemma we know that this limit exists and that it satisfies (9.1). Consider t = Kx t and x t+1 = (A + BK)x t = Γx t = Γ t x, for arbitrary x, where K = (Q + B ΠB) 1 B ΠA and Γ = A + BK. We can write (9.1) as and hence Π = R + K QK + Γ ΠΓ. (9.2) x t Πx t = x t (R + K QK)x t + x t+1πx t+1 x t+1πx t+1. Ths x t Πx t decreases and, being bonded below by zero, it tends to a limit. Ths x t (R+K QK)x t tends to. Since R+K QK is positive definite this implies x t, which implies (iii). Hence for arbitrary finite non-negative definite Π, Π s = f (s) Π f (s) Π. (9.3) However, if we choose the fixed policy t = Kx t then it follows that Π s f (s) + (Γ ) s Π Γ s Π. (9.4) Ths (9.3) and (9.4) imply (ii). Finally, if non-negative definite Π also satisfies (9.1) then Π = f (s) Π Π, whence (i) follows