OPTIMIZATION AND CONTROL

Transcription

1 5.5 Optimal stopping over the infinite horizon Contents OPTIMIZATION AND CONTROL Richard Weber DYNAMIC PROGRAMMING 1 1 Dynamic Programming: The Optimality Eqation Control as optimization over time The principle of optimality Example: the shortest path problem The optimality eqation Markov decision processes Some Examples of Dynamic Programming Example: managing spending and savings Example: exercising a stock option Example: accepting the best offer Dynamic Programming over the Infinite Horizon Disconted costs Example: job schedling The infinite-horizon case The optimality eqation in the infinite-horizon case Example: selling an asset Positive Programming Example: possible lack of an optimal policy Characterization of the optimal policy Example: optimal gambling Vale iteration Example: pharmacetical trials Negative Programming Stationary policies Characterization of the optimal policy Optimal stopping over a finite horizon Example: optimal parking i 6 Average-cost Programming Average-cost optimization Example: admission control at a qee Vale iteration bonds Policy improvement LQG SYSTEMS 25 7 LQ Models The LQ reglation model The Riccati recrsion White noise distrbances LQ reglation in continos-time Controllability Controllability Controllability in continos-time Example: broom balancing Example: satellite in a plane orbit Infinite Horizon Limits Linearization of nonlinear models Stabilizability Example: pendlm Infinite-horizon LQ reglation The A, B, C system Observability Observability Observability in continos-time Examples Imperfect state observation with noise Kalman Filtering and Certainty Eqivalence Preliminaries The Kalman filter Certainty eqivalence Example: inertialess rocket with noisy position sensing CONTINUOUS-TIME MODELS 45 ii

2 12 Dynamic Programming in Continos Time The optimality eqation Example: LQ reglation Example: estate planning Example: harvesting Pontryagin s Maximm Principle Heristic derivation Example: bringing a particle to rest in minimal time Connection with Lagrangian mltipliers Example: se of the transversality conditions Applications of the Maximm Principle Problems with terminal conditions Example: monopolist Example: insects as optimizers Example: rocket thrst optimization Controlled Markov Jmp Processes The dynamic programming eqation The case of a discrete state space Uniformization in the infinite horizon case Example: admission control at a qee Controlled Diffsion Processes Diffsion processes and controlled diffsion processes Example: noisy LQ reglation in continos time Example: a noisy second order system Example: passage to a stopping set Index 65 Schedles The first 6 lectres are devoted to dynamic programming in discrete-time and cover both finite and infinite-horizon problems; disconted-cost, positive, negative and average-cost programming; the time-homogeneos Markov case; stopping problems; vale iteration and policy improvement. The next 5 lectres are devoted to the LQG model (linear systems, qadratic costs) and cover the important ideas of controllability and observability; the Ricatti eqation; imperfect observation, certainly eqivalence and the Kalman filter. The final 5 lectres are devoted to continos-time models and inclde treatment of Pontryagin s maximm principle and the Hamiltonian; Markov decision processes on a contable state space and controlled diffsion processes. Each of the 16 lectres is designed to be a somewhat self-contained nit, e.g., there will be one lectre on Negative Programming, one on Controllability, etc. Examples and applications are important in this corse, so there are one or more worked examples in each lectre. Examples sheets There are three examples sheets, corresponding to the thirds of the corse. There are two or three qestions for each lectre, some theoretical and some of a problem natre. Each qestion is marked to indicate the lectre with which it is associated. Lectre Notes and Handots There are printed lectre notes for the corse and other occasional handots. There are sheets smmarising notation and what yo are expected to know for the exams. The notes inclde a list of keywords and I will be drawing yor attention to these as we go along. If yo have a good grasp of the meaning of each of these keywords, then yo will be well on yor way to nderstanding the important concepts of the corse. WWW pages Notes for the corse, and other information are on the web at Books The following books are recommended. D. P. Bertsekas, Dynamic Programming, Prentice Hall, D. P. Bertsekas, Dynamic Programming and Optimal Control, Volmes I and II, Prentice Hall, L. M. Hocking, Optimal Control: An introdction to the theory and applications, Oxford S. Ross, Introdction to Stochastic Dynamic Programming, Academic Press, P. Whittle, Optimization Over Time. Volmes I and II, Wiley, Ross s book is probably the easiest to read. However, it only covers Part I of the corse. Whittle s book is good for Part II and Hocking s book is good for Part III. The recent book by Bertsekas is sefl for all parts. Many other books address the topics of the corse and a collection can be fond in Sections 3B and 3D of the DPMMS library. Notation differs from book to book. My notation will be closest to that of Whittle s books and consistent throghot. For example, I will always denote a minimal cost fnction by F( ) (whereas, in the recommended books yo will find F, V, φ, J and many others symbols sed for this qantity.) iii iv

3 1 Dynamic Programming: The Optimality Eqation We introdce the idea of dynamic programming and the principle of optimality. We give notation for state-strctred models, and introdce ideas of feedback, open-loop, and closed-loop controls, a Markov decision process, and the idea that it can be sefl to model things in terms of time to go. 1.1 Control as optimization over time Optimization is a key tool in modelling. Sometimes it is important to solve a problem optimally. Other times either a near-optimal soltion is good enogh, or the real problem does not have a single criterion by which a soltion can be jdged. However, even then optimization is sefl as a way to test thinking. If the optimal soltion is ridiclos it may sggest ways in which both modelling and thinking can be refined. Control theory is concerned with dynamic systems and their optimization over time. It acconts for the fact that a dynamic system may evolve stochastically and that key variables may be nknown or imperfectly observed (as we see, for instance, in the UK economy). This contrasts with optimization models in the IB corse (sch as those for LP and network flow models); these static and nothing was random or hidden. It is these three new featres: dynamic and stochastic evoltion, and imperfect state observation, that give rise to new types of optimization problem and which reqire new ways of thinking. We cold spend an entire lectre discssing the importance of control theory and tracing its development throgh the windmill, steam governor, and so on. Sch classic control theory is largely concerned with the qestion of stability, and there is mch of this theory which we ignore, e.g., Nyqist criterion and dynamic lags. 1.2 The principle of optimality A key idea is that optimization over time can often be regarded as optimization in stages. We trade off or desire to obtain the lowest possible cost at the present stage against the implication this wold have for costs at ftre stages. The best action minimizes the sm of the cost incrred at the crrent stage and the least total cost that can be incrred from all sbseqent stages, conseqent on this decision. This is known as the Principle of Optimality. Definition 1.1 (Principle of Optimality) From any point on an optimal trajectory, the remaining trajectory is optimal for the corresponding problem initiated at that point. 1.3 Example: the shortest path problem Consider the stagecoach problem in which a traveler wishes to minimize the length of a jorney from town A to town J by first traveling to one of B, C or D and then onwards to one of E, F or G then onwards to one of H or I and the finally to J. Ths there are 4 stages. The arcs are marked with distances between towns. 1 A B C D 5 E F G Road system for stagecoach problem Soltion. Let F(X) be the minimal distance reqired to reach J from X. Then clearly, F(J) =, F(H) = 3 and F(I) = 4. F(F) = min 6 + F(H), 3 + F(I) = 7, and so on. Recrsively, we obtain F(A) = 11 and simltaneosly an optimal rote, i.e., A D F I J (althogh it is not niqe). The stdy of dynamic programming dates from Richard Bellman, who wrote the first book on the sbject (1957) and gave it its name. A very large nmber of problems can be treated this way. 1.4 The optimality eqation The optimality eqation in the general case. In discrete-time t takes integer vales, say t =, 1,.... Sppose t is a control variable whose vale is to be chosen at time t. Let U t 1 = (,..., t 1 ) denote the partial seqence of controls (or decisions) taken over the first t stages. Sppose the cost p to the time horizon h is given by C = G(U h 1 ) = G(, 1,..., h 1 ). Then the principle of optimality is expressed in the following theorem. Theorem 1.2 (The principle of optimality) Define the fnctions Then these obey the recrsion G(U t 1, t) = H inf G(U h 1 ). t, t+1,..., h 1 G(U t 1, t) = inf t G(U t, t + 1) t < h, with terminal evalation G(U h 1, h) = G(U h 1 ). The proof is immediate from the definition of G(U t 1, t), i.e., G(U t 1, t) = inf t inf G(,..., t 1, t, t+1,..., h 1 ). t+1,..., h 1 2 I 3 4 J

4 The state strctred case. The control variable t is chosen on the basis of knowing U t 1 = (,..., t 1 ), (which determines everything else). Bt a more economical representation of the past history is often sfficient. For example, we may not need to know the entire path that has been followed p to time t, bt only the place to which it has taken s. The idea of a state variable x R d is that its vale at t, denoted x t, is calclable from known qantities and obeys a plant eqation (or law of motion) x t+1 = a(x t, t, t). Sppose we wish to minimize a cost fnction of the form h 1 C = c(x t, t, t) + C h (x h ), (1.1) t= by choice of controls {,..., h 1 }. Define the cost from time t onwards as, h 1 C t = c(x τ, τ, τ) + C h (x h ), (1.2) τ=t and the minimal cost from time t onwards as an optimization over { t,..., h 1 } conditional on x t = x, F(x, t) = inf C t. t,..., h 1 Here F(x, t) is the minimal ftre cost from time t onward, given that the state is x at time t. Then by an indctive proof, one can show as in Theorem 1.2 that F(x, t) = infc(x,, t) + F(a(x,, t), t + 1), t < h, (1.3) with terminal condition F(x, h) = C h (x). Here x is a generic vale of x t. The minimizing in (1.3) is the optimal control (x, t) and vales of x,..., x t 1 are irrelevant. The optimality eqation (1.3) is also called the dynamic programming eqation (DP) or Bellman eqation. The DP eqation defines an optimal control problem in what is called feedback or closed loop form, with t = (x t, t). This is in contrast to the open loop formlation in which {,..., h 1 } are to be determined all at once at time. A policy (or strategy) is a rle for choosing the vale of the control variable nder all possible circmstances as a fnction of the perceived circmstances. To smmarise: (i) The optimal t is a fnction only of x t and t, i.e, t = (x t, t). (ii) The DP eqation expresses the optimal t in closed loop form. It is optimal whatever the past control policy may have been. (iii) The DP eqation is a backward recrsion in time (from which we get the optimm at h 1, then h 2 and so on.) The later policy is decided first. Life mst be lived forward and nderstood backwards. (Kierkegaard) Markov decision processes Consider now stochastic evoltion. Let X t = (x,..., x t ) and U t = (,..., t ) denote the x and histories at time t. As above, state strctre is characterised by the fact that the evoltion of the process is described by a state variable x, having vale x t at time t, with the following properties. (a) Markov dynamics: (i.e., the stochastic version of the plant eqation.) P(x t+1 X t, U t ) = P(x t+1 x t, t ). (b) Decomposable cost, (i.e., cost given by (1.1)). These assmptions define state strctre. For the moment we also reqire. (c) Perfect state observation: The crrent vale of the state is observable. That is, x t is known at the time at which t mst be chosen. So, letting W t denote the observed history at time t, we assme W t = (X t, U t 1 ). Note that C is determined by W h, so we might write C = C(W h ). These assmptions define what is known as a discrete-time Markov decision process (MDP). Many of or examples will be of this type. As above, the cost from time t onwards is given by (1.2). Denote the minimal expected cost from time t onwards by F(W t ) = inf π E πc t W t, where π denotes a policy, i.e., a rle for choosing the controls,..., h 1. We can assert the following theorem. Theorem 1.3 F(W t ) is a fnction of x t and t alone, say F(x t, t). It obeys the optimality eqation with terminal condition F(x t, t) = inf t {c(x t, t, t) + EF(x t+1, t + 1) x t, t }, t < h, (1.4) F(x h, h) = C h (x h ). Moreover, a minimizing vale of t in (1.4) (which is also only a fnction x t and t) is optimal. Proof. The vale of F(W h ) is C h (x h ), so the asserted redction of F is valid at time h. Assme it is valid at time t + 1. The DP eqation is then F(W t ) = inf t {c(x t, t, t) + EF(x t+1, t + 1) X t, U t }. (1.5) Bt, by assmption (a), the right-hand side of (1.5) redces to the right-hand member of (1.4). All the assertions then follow. 4

5 2 Some Examples of Dynamic Programming We illstrate the method of dynamic programming and some sefl tricks. 2.1 Example: managing spending and savings An investor receives annal income from a bilding society of x t ponds in year t. He consmes t and adds x t t to his capital, t x t. The capital is invested at interest rate θ 1%, and so his income in year t + 1 increases to x t+1 = a(x t, t ) = x t + θ(x t t ). He desires to maximize his total consmption over h years, C = h 1 t= t. Soltion. In the notation we have been sing, c(x t, t, t) = t, C h (x h ) =. This is a time-homogeneos model, in which neither costs nor dynamics depend on t. It is easiest to work in terms of time to go, s = h t. Let F s (x) denote the maximal reward obtainable, starting in state x and when there is time s to go. The dynamic programming eqation is F s (x) = max x + F s 1(x + θ(x )), where F (x) =, (since no more can be obtained once time h is reached.) Here, x and are generic vales for x s and s. We can sbstitte backwards and soon gess the form of the soltion. First, Next, F 1 (x) = max + F ( + θ(x )) = max + = x. x x F 2 (x) = max + F 1(x + θ(x )) = max + x + θ(x ). x x Since + x + θ(x ) linear in, its maximm occrs at = or = x, and so F 2 (x) = max(1 + θ)x, 2x = max1 + θ, 2x = ρ 2 x. This motivates the gess F s 1 (x) = ρ s 1 x. Trying this, we find F s (x) = max x + ρ s 1(x + θ(x )) = max(1 + θ)ρ s 1, 1 + ρ s 1 x = ρ s x. Ths or gess is verified and F s (x) = ρ s x, where ρ s obeys the recrsion implicit in the above, and i.e., ρ s = ρ s 1 + maxθρ s 1, 1. This gives { s s s ρ s = (1 + θ) s s s s s, where s is the least integer sch that s 1/θ, i.e., s = 1/θ. The optimal strategy is to invest the whole of the income in years,...,h s 1, (to bild p capital) and then consme the whole of the income in years h s,...,h 1. 5 There are several things worth remembering from this example. (i) It is often sefl to frame things in terms of time to go, s. (ii) Althogh the form of the dynamic programming eqation can sometimes look messy, try working backwards from F (x) (which is known). Often a pattern will emerge from which we can piece together a soltion. (iii) When the dynamics are linear, the optimal control lies at an extreme point of the set of feasible controls. This form of policy, which either consmes nothing or consmes everything, is known as bang-bang control. 2.2 Example: exercising a stock option The owner of a call option has the option to by a share at fixed striking price p. The option mst be exercised by day h. If he exercises the option on day t and then immediately sells the share at the crrent price x t, he can make a profit of x t p. Sppose the price seqence obeys the eqation x t+1 = x t + ǫ t, where the ǫ t are i.i.d. random variables for which E ǫ <. The aim is to exercise the option optimally. Let F s (x) be the vale fnction (maximal expected profit) when the share price is x and there are s days to go. Show that (i) F s (x) is non-decreasing in s, (ii) F s (x) x is non-increasing in x and (iii) F s (x) is continos in x. Dedce that the optimal policy can be characterised as follows. There exists a non-decreasing seqence {a s } sch that an optimal policy is to exercise the option the first time that x a s, where x is the crrent price and s is the nmber of days to go before expiry of the option. Soltion. The state variable at time t is, strictly speaking, x t pls a variable which indicates whether the option has been exercised or not. However, it is only the latter case which is of interest, so x is the effective state variable. Since dynamic programming makes its calclations backwards, from the termination point, it is often advantageos to write things in terms of the time to go, s = h t. So if we let F s (x) be the vale fnction (maximal expected profit) with s days to go then F (x) = max{x p, }, and so the dynamic programming eqation is F s (x) = max{x p, EF s 1 (x + ǫ)}, s = 1, 2,... Note that the expectation operator comes otside, not inside, F s 1 ( ). One can se indction to show (i), (ii) and (iii). For example, (i) is obvios, since increasing s means we have more time over which to exercise the option. However, for a formal proof F 1 (x) = max{x p, EF (x + ǫ)} max{x p, } = F (x). Now sppose, indctively, that F s 1 F s 2. Then F s (x) = max{x p, EF s 1 (x + ǫ)} max{x p, EF s 2 (x + ǫ)} = F s 1 (x), 6

6 whence F s is non-decreasing in s. Similarly, an indctive proof of (ii) follows from F s (x) x } {{ } = max{ p, EF s 1(x + ǫ) (x + ǫ) } {{ } + E(ǫ)}, since the left hand nderbraced term inherits the non-increasing character of the right hand nderbraced term. Ths the optimal policy can be characterized as stated. For from (ii), (iii) and the fact that F s (x) x p it follows that there exists an a s sch that F s (x) is greater that x p if x < a s and eqals x p if x a s. It follows from (i) that a s is non-decreasing in s. The constant a s is the smallest x for which F s (x) = x p. 2.3 Example: accepting the best offer We are to interview h candidates for a job. At the end of each interview we mst either hire or reject the candidate we have jst seen, and may not change this decision later. Candidates are seen in random order and can be ranked against those seen previosly. The aim is to maximize the probability of choosing the candidate of greatest rank. Soltion. Let W t be the history of observations p to time t, i.e., after we have interviewed the tth candidate. All that matters are the vale of t and whether the tth candidate is better than all her predecessors: let x t = 1 if this is tre and x t = if it is not. In the case x t = 1, the probability she is the best of all h candidates is P(best of h best of first t) = P(best of h) P(best of first t) = 1/h 1/t = t h. Now the fact that the tth candidate is the best of the t candidates seen so far places no restriction on the relative ranks of the first t 1 candidates; ths x t = 1 and W t 1 are statistically independent and we have P(x t = 1 W t 1 ) = P(W t 1 x t = 1) P(x t = 1) = P(x t = 1) = 1 P(W t 1 ) t. Let F(, t 1) be the probability that nder an optimal policy we select the best candidate, given that we have seen t 1 candidates so far and the last one was not the best of those. Dynamic programming gives F(, t 1) = t 1 F(, t) + 1 ( ) ( t t 1 t t max h, F(, t) = max F(, t) + 1 ) t h, F(, t) These imply F(, t 1) F(, t) for all t h. Therefore, since t/h and F(, t) are respectively increasing and non-increasing in t, it mst be that for small t we have F(, t) > t/h and for large t we have F(, t) < t/h. Let t be the smallest t sch that F(, t) t/h. Then F(, t ), t < t, F(, t 1) = t 1 F(, t) + 1 t h, t t. 7 Solving the second of these backwards from the point t = h, F(, h) =, we obtain whence F(, t 1) t 1 = 1 F(, t) + = = h(t 1) t F(, t 1) = t 1 h h 1 τ=t 1 1 h(t 1) + 1 ht h(h 1), 1 τ, t t. Since we reqire F(, t ) t /h, it mst be that t is the smallest integer satisfying h 1 1 τ 1. τ=t For large h the sm on the left above is abot log(h/t ), so log(h/t ) 1 and we find t h/e. The optimal policy is to interview h/e candidates, bt withot selecting any of these, and then select the first one thereafter that is the best of all those seen so far. The probability of sccess is F(, t ) t /h 1/e = It is srprising that the probability of sccess is so large for arbitrarily large h. There are a cople lessons in this example. (i) It is often sefl to try to establish the fact that terms over which a maximm is being taken are monotone in opposite directions, as we did with t/h and F(, t). (ii) A typical approach is to first determine the form of the soltion, then find the optimal cost (reward) fnction by backward recrsion from the terminal point, where its vale is known. 8

7 3 Dynamic Programming over the Infinite Horizon We define the cases of disconted, negative and positive dynamic programming and establish the validity of the optimality eqation for an infinite horizon problem. 3.1 Disconted costs For a discont factor, β (, 1, the disconted-cost criterion is defined as h 1 C = β t c(x t, t, t) + β h C h (x h ). (3.1) t= This simplifies things mathematically, particlarly when we want to consider an infinite horizon. If costs are niformly bonded, say c(x, ) < B, and disconting is strict (β < 1) then the infinite horizon cost is bonded by B/(1 β). In economic langage, if there is an interest rate of r% per nit time, then a nit amont of money at time t is worth ρ = 1+r/1 at time t+1. Eqivalently, a nit amont at time t+1 has present vale β = 1/ρ. The fnction, F(x, t), which expresses the minimal present vale at time t of expected-cost from time t p to h is F(x, t) = inf E t,..., h 1 The DP eqation is now where F(x, h) = C h (x). h 1 β τ t c(x τ, τ, τ) + β h t C h (x h ) x t = x τ=t. (3.2) F(x, t) = inf c(x,, t) + βef(a(x,, t), t + 1), t < h, (3.3) 3.2 Example: job schedling A collection of n jobs is to be processed in arbitrary order by a single machine. Job i has processing time p i and when it completes a reward r i is obtained. Find the order of processing that maximizes the sm of the disconted rewards. Soltion. Here we take time k as the point at which the n k th job has jst been completed and the state at time k as the collection of ncompleted jobs, say S k. The dynamic programming eqation is F k (S k ) = max i S k r i β pi + β pi F k 1 (S k {i}). Obviosly F ( ) =. Applying the method of dynamic programming we first find F 1 ({i}) = r i β pi. Then, working backwards, we find F 2 ({i, j}) = maxr i β pi + β pi+pj r j, r j β pj + β pj+pi r i. There will be 2 n eqations to evalate, bt with perseverance we can determine F n ({1, 2,..., n}). However, there is a simpler way. 9 An interchange argment. Sppose that jobs are schedled in the order i 1,..., i k, i, j, i k+2,...,i n. Compare the reward of this schedle to one in which the order of jobs i and j are reversed: i 1,..., i k, j, i, i k+2,..., i n. The rewards nder the two schedles are respectively R 1 + β T+pi r i + β T+pi+pj r j + R 2 and R 1 + β T+pj r j + β T+pj+pi r i + R 2, where T = p i1 + + p ik, and R 1 and R 2 are respectively the sm of the rewards de to the jobs coming before and after jobs i, j; these are the same nder both schedles. The reward of the first schedle is greater if r i β pi /(1 β pi ) > r j β pj /(1 β pj ). Hence a schedle can be optimal only if the jobs are taken in decreasing order of the indices r i β pi /(1 β pi ). This type of reasoning is known as an interchange argment. There are a cople points to note. (i) An interchange argment can be sefl for solving a decision problem abot a system that evolves in stages. Althogh sch problems can be solved by dynamic programming, an interchange argment when it works is sally easier. (ii) The decision points need not be eqally spaced in time. Here they are the points at which a nmber of jobs have been completed. 3.3 The infinite-horizon case In the finite-horizon case the cost fnction is obtained simply from (3.3) by the backward recrsion from the terminal point. However, when the horizon is infinite there is no terminal point and so the validity of the optimality eqation is no longer obvios. Let s consider the time-homogeneos Markov case, in which costs and dynamics do not depend on t, i.e., c(x,, t) = c(x, ). Sppose also that there is no terminal cost, i.e., C h (x) =. Define the s-horizon cost nder policy π as s 1 F s (π, x) = E π β t c(x t, t ) x = x, t= where E π denotes expectation over the path of the process nder policy π. If we take the infimm with respect to π we have the infimal s-horizon cost F s (x) = inf π F s(π, x). Clearly, this always exists and satisfies the optimality eqation F s (x) = inf {c(x, ) + βef s 1(x 1 ) x = x, = }, (3.4) with terminal condition F (x) =. The infinite-horizon cost nder policy π is also qite natrally defined as F(π, x) = lim s F s(π, x). (3.5) This limit need not exist, bt it will do so nder any of the following scenarios. 1

8 D (disconted programming): < β < 1, and c(x, ) < B for all x,. N (negative programming): < β 1 and c(x, ) for all x,. P (positive programming): < β 1 and c(x, ) for all x,. Notice that the names negative and positive appear to be the wrong way arond with respect to the sign of c(x, ). However, the names make sense if we think of eqivalent problems of maximizing rewards. Maximizing positive rewards (P) is the same thing as minimizing negative costs. Maximizing negative rewards (N) is the same thing as minimizing positive costs. In cases N and P we sally take β = 1. The existence of the limit (possibly infinite) in (3.5) is assred in cases N and P by monotone convergence, and in case D becase the total cost occrring after the sth step is bonded by β s B/(1 β). 3.4 The optimality eqation in the infinite-horizon case The infimal infinite-horizon cost is defined as F(x) = inf π F(π, x) = inf π The following theorem jstifies or writing an optimality eqation. lim F s(π, x). (3.6) s Theorem 3.1 Sppose D, N, or P holds. Then F(x) satisfies the optimality eqation F(x) = inf {c(x, ) + βef(x 1) x = x, = )}. (3.7) Proof. We first prove that holds in (3.7). Sppose π is a policy, which chooses = when x = x. Then F s (π, x) = c(x, ) + βef s 1 (π, x 1 ) x = x, =. (3.8) Either D, N or P is sfficient to allow s to takes limits on both sides of (3.8) and interchange the order of limit and expectation. In cases N and P this is becase of monotone convergence. Infinity is allowed as a possible limiting vale. We obtain F(π, x) = c(x, ) + βef(π, x 1 ) x = x, = c(x, ) + βef(x 1 ) x = x, = inf {c(x, ) + βef(x 1) x = x, = }. Minimizing the left hand side over π gives. To prove, fix x and consider a policy π that having chosen and reached state x 1 then follows a policy π 1 which is sboptimal by less than ǫ from that point, i.e., F(π 1, x 1 ) F(x 1 )+ǫ. Note that sch a policy mst exist, by definition of F, althogh π 1 will depend on x 1. We have 11 F(x) F(π, x) = c(x, ) + βef(π 1, x 1 ) x = x, c(x, ) + βef(x 1 ) + ǫ x = x, c(x, ) + βef(x 1 ) x = x, + βǫ. Minimizing the right hand side over and recalling ǫ is arbitrary gives. 3.5 Example: selling an asset A spectlator owns a rare collection of tlip blbs and each day has one opportnity to sell it, which he may either accept or reject. The potential sale prices are independently and identically distribted with probability density fnction g(x), x. Each day there is a probability 1 β that the market for tlip blbs will collapse, making his blb collection completely worthless. Find the policy that maximizes his expected retrn and express it as the niqe root of an eqation. Show that if β > 1/2, g(x) = 2/x 3, x 1, then he shold sell the first time the sale price is at least β/(1 β). Soltion. There are only two states, depending on whether he has sold the collection or not. Let these be and 1 respectively. The optimality eqation is Hence F(1) = y= = βf(1) + = βf(1) + (1 β)f(1) = maxy, βf(1) g(y)dy y= y=βf(1) y=βf(1) maxy βf(1), g(y)dy y βf(1) g(y)dy y βf(1) g(y)dy. (3.9) That this eqation has a niqe root, F(1) = F, follows from the fact that left and right hand sides are increasing and decreasing in F(1) respectively. Ths he shold sell when he can get at least βf. His maximal reward is F. Consider the case g(y) = 2/y 3, y 1. The left hand side of (3.9) is less that the right hand side at F(1) = 1 provided β > 1/2. In this case the root is greater than 1 and we compte it as (1 β)f(1) = 2/βF(1) βf(1)/βf(1) 2, and ths F = 1/ β(1 β) and βf = β/(1 β). If β 1/2 he shold sell at any price. Notice that disconting arises in this problem becase at each stage there is a probability 1 β that a catastrophe will occr that brings things to a sdden end. This characterization of a manner in which disconting can arise is often qite sefl. 12

9 4 Positive Programming We address the special theory of maximizing positive rewards, (noting that there may be no optimal policy bt that if a policy has a vale fnction that satisfies the optimality eqation then it is optimal), and the method of vale iteration. 4.1 Example: possible lack of an optimal policy. Positive programming concerns minimizing non-positive costs, c(x, ). The name originates from the eqivalent problem of maximizing non-negative rewards, r(x, ), and for this section we present reslts in that setting. The following example shows that there may be no optimal policy. Sppose the possible states are the non-negative integers and in state x we have a choice of either moving to state x + 1 and receiving no reward, or moving to state, obtaining reward 1 1/i, and then remaining in state thereafter and obtaining no frther reward. The optimality eqations is F(x) = max{1 1/x, F(x + 1)} x >. Clearly F(x) = 1, x >, bt the policy that chooses the maximizing action in the optimality eqation always moves on to state x+1 and hence has zero reward. Clearly, there is no policy that actally achieves a reward of Characterization of the optimal policy The following theorem provides a necessary and sfficient condition for a policy to be optimal: namely, its vale fnction mst satisfy the optimality eqation. This theorem also holds for the case of strict disconting and bonded costs. Theorem 4.1 Sppose D or P holds and π is a policy whose vale fnction F(π, x) satisfies the optimality eqation Then π is optimal. F(π, x) = sp{r(x, ) + βef(π, x 1 ) x = x, = }. Proof. Let π be any policy and sppose it takes t (x) = f t (x). Since F(π, x) satisfies the optimality eqation, F(π, x) r(x, f (x)) + βe π F(π, x 1 ) x = x, = f (x). By repeated sbstittion of this into itself, we find s 1 F(π, x) E π β t r(x t, t ) x = x + β s E π F(π, x s ) x = x. (4.1) t= 13 In case P we can drop the final term on the right hand side of (4.1) (becase it is non-negative) and then let s ; in case D we can let s directly, observing that this term tends to zero. Either way, we have F(π, x) F(π, x). 4.3 Example: optimal gambling A gambler has i ponds and wants to increase this to N. At each stage she can bet any fraction of her capital, say j i. Either she wins, with probability p, and now has i+j ponds, or she loses, with probability q = 1 p, and has i j ponds. Let the state space be {, 1,...,N}. The game stops pon reaching state or N. The only non-zero reward is 1, pon reaching state N. Sppose p 1/2. Prove that the timid strategy, of always betting only 1 pond, maximizes the probability of the gambler attaining N ponds. Soltion. The optimality eqation is F(i) = max{pf(i + j) + qf(i j)}. j,j i To show that the timid strategy is optimal we need to find its vale fnction, say G(i), and show that it is a soltion to the optimality eqation. We have G(i) = pg(i + 1) + qg(i 1), with G() =, G(N) = 1. This recrrence gives G(i) = 1 (q/p) i 1 (q/p) N p > 1/2, i N p = 1/2. If p = 1/2, then G(i) = i/n clearly satisfies the optimality eqation. If p > 1/2 we simply have to verify that G(i) = 1 { } (q/p)i 1 (q/p) i+j 1 (q/p) i j 1 (q/p) N = max p j:j i 1 (q/p) N + q 1 (q/p) N. It is a simple exercise to show that j = 1 maximizes the right hand side. 4.4 Vale iteration The infimal cost fnction F can be approximated by sccessive approximation or vale iteration. This is important and practical method of compting F. Let s define F (x) = lim F s(x) = lim inf F s(π, x). (4.2) s s π This exists (by monotone convergence nder N or P, or by the fact that nder D the cost incrred after time s is vanishingly small.) Notice that (4.2) reverses the order of lim s and inf π in (3.6). The following theorem states that we can interchange the order of these operations and that therefore 14

10 F s (x) F(x). However, in case N we need an additional assmption: F (finite actions): There are only finitely many possible vales of in each state. Theorem 4.2 Sppose that D or P holds, or N and F hold. Then F (x) = F(x). Proof. First we prove. Given any π, F (x) = lim s F s(x) = lim s inf π F s(π, x) lim s F s( π, x) = F( π, x). Taking the infimm over π gives F (x) F(x). Now we prove. In the positive case, c(x, ), so F s (x) F(x). Now let s. In the disconted case, with c(x, ) < B, imagine sbtracting B > from every cost. This redces the infinite-horizon cost nder any policy by exactly B/(1 β) and F(x) and F (x) also decrease by this amont. All costs are now negative, so the reslt we have jst proved applies. Alternatively, note that F s (x) β s B/(1 β) F(x) F s (x) + β s B/(1 β) (can yo see why?) and hence lim s F s (x) = F(x). In the negative case, F (x) = lim s min {c(x, ) + EF s 1(x 1 ) x = x, = } = min{c(x, ) + lim EF s 1(x 1 ) x = x, = } s = min{c(x, ) + EF (x 1 ) x = x, = }, (4.3) where the first eqality follows becase the minimm is over a finite nmber of terms and the second eqality follows by Lebesge monotone convergence (since F s (x) increases in s). Let π be the policy that chooses the minimizing action on the right hand side of (4.3). This implies, by sbstittion of (4.3) into itself, and sing the fact that N implies F, s 1 F (x) = E π c(x t, t ) + F (x s ) x = x t= s 1 E π c(x t, t ) x = x. t= Letting s gives F (x) F(π, x) F(x). other patients. The new drg is ntested and has an nknown probability of sccess θ, which the doctor believes to be niformly distribted over, 1. He treats one patient per day and mst choose which drg to se. Sppose he has observed s sccesses and f failres with the new drg. Let F(s, f) be the maximal expected-disconted nmber of ftre patients who are sccessflly treated if he chooses between the drgs optimally from this point onwards. For example, if he ses only the established drg, the expecteddisconted nmber of patients sccessflly treated is p + βp + β 2 p + = p/(1 β). The posterior distribtion of θ is f(θ s, f) = (s + f + 1)! θ s (1 θ) f, θ 1, s!f! and the posterior mean is θ(s, f) = (s + 1)/(s + f + 2). The optimality eqation is p F(s, f) = max 1 β, s + 1 f + 1 (1 + βf(s + 1, f)) + βf(s, f + 1). s + f + 2 s + f + 2 It is not possible to give a nice expression for F, bt we can find an approximate nmerical soltion. If s + f is very large, say 3, then θ(s, f) = (s + 1)/(s + f + 2) is a good approximation to θ. Ths we can take F(s, f) (1 β) 1 maxp, θ(s, f), s + f = 3 and work backwards. For β =.95, one obtains the following table. f s These nmbers are the greatest vales of p for which it is worth contining with at least one more trial of the new drg. For example, with s = 3, f = 3 it is worth contining with the new drg when p =.6 < At this point the probability that the new drg will sccessflly treat the next patient is.5 and so the doctor shold actally prescribe the drg that is least likely to cre! This example shows the difference between a myopic policy, which aims to maximize immediate reward, and an optimal policy, which forgets immediate reward in order to gain information and possibly greater rewards later on. Notice that it is worth sing the new drg at least once if p <.7614, even thogh at its first se the new drg will only be sccessfl with probability Example: pharmacetical trials A doctor has two drgs available to treat a disease. One is well-established drg and is known to work for a given patient with probability p, independently of its sccess for 15 16

11 5 Negative Programming We address the special theory of minimizing positive costs, (noting that the action that extremizes the right hand side of the optimality eqation gives an optimal policy), and stopping problems and their soltion. 5.1 Stationary policies A Markov policy is a policy that specifies the control at time t to be simply a fnction of the state and time. In the proof of Theorem 4.1 we sed t = f t (x t ) to specify the control at time t. This is a convenient notation for a Markov policy, and we write π = (f, f 1,... ). If in addition the policy does not depend on time, it is said to be a stationary Markov policy, and we write π = (f, f,...) = f. 5.2 Characterization of the optimal policy Negative programming concerns minimizing non-negative costs, c(x, ). The name originates from the eqivalent problem of maximizing non-positive rewards, r(x, ). The following theorem gives a necessary and sfficient condition for a stationary policy to be optimal: namely, it mst choose the optimal on the right hand side of the optimality eqation. Note that in the statement of this theorem we are reqiring that the infimm over is attained as a minimm over. Theorem 5.1 Sppose D or N holds. Sppose π = f is the stationary Markov policy sch that c(x, f(x)) + βef(x 1 ) x = x, = f(x) Then F(π, x) = F(x), and π is optimal. = min c(x, ) + βef(x 1 ) x = x, =. Proof. Sppose this policy is π = f. Then by sbstitting the optimality eqation into itself and sing the fact that π specifies the minimizing control at each stage, s 1 F(x) = E π β t c(x t, t ) x = x + β s E π F(x s ) x = x. (5.1) t= In case N we can drop the final term on the right hand side of (5.1) (becase it is non-negative) and then let s ; in case D we can let s directly, observing that this term tends to zero. Either way, we have F(x) F(π, x). A corollary is that if assmption F holds then an optimal policy exists. Neither Theorem 5.1 or this corollary are tre for positive programming (c.f., the example in Section 4.1) Optimal stopping over a finite horizon One way that the total-expected cost can be finite is if it is possible to enter a state from which no frther costs are incrred. Sppose has jst two possible vales: = (stop), and = 1 (contine). Sppose there is a termination state, say, that is entered pon choosing the stopping action. Once this state is entered the system stays in that state and no frther cost is incrred thereafter. Sppose that stopping is mandatory, in that we mst contine for no more that s steps. The finite-horizon dynamic programming eqation is therefore F s (x) = min{k(x), c(x) + EF s 1 (x 1 ) x = x, = 1}, (5.2) with F (x) = k(x), c() =. Consider the set of states in which it is at least as good to stop now as to contine one more step and then stop: S = {x : k(x) c(x) + Ek(x 1 ) x = x, = 1)}. Clearly, it cannot be optimal to stop if x S, since in that case it wold be strictly better to contine one more step and then stop. The following theorem characterises all finite-horizon optimal policies. Theorem 5.2 Sppose S is closed (so that once the state enters S it remains in S.) Then an optimal policy for all finite horizons is: stop if and only if x S. Proof. The proof is by indction. If the horizon is s = 1, then obviosly it is optimal to stop only if x S. Sppose the theorem is tre for a horizon of s 1. As above, if x S then it is better to contine for more one step and stop rather than stop in state x. If x S, then the fact that S is closed implies x 1 S and so F s 1 (x 1 ) = k(x 1 ). Bt then (5.2) gives F s (x) = k(x). So we shold stop if s S. The optimal policy is known as a one-step look-ahead rle (OSLA). 5.4 Example: optimal parking A driver is looking for a parking space on the way to his destination. Each parking space is free with probability p independently of whether other parking spaces are free or not. The driver cannot observe whether a parking space is free ntil he reaches it. If he parks s spaces from the destination, he incrs cost s, s =, 1,.... If he passes the destination withot having parked the cost is D. Show that an optimal policy is to park in the first free space that is no frther than s from the destination, where s is the greatest integer s sch that (Dp + 1)q s 1. Soltion. When the driver is s spaces from the destination it only matters whether the space is available (x = 1) or fll (x = ). The optimality eqation gives F s () = qf s 1 () + pf s 1 (1), { s, (take available space) F s (1) = min qf s 1 () + pf s 1 (1), (ignore available space) 18

12 where F () = D, F (1) =. Sppose the driver adopts a policy of taking the first free space that is s or closer. Let the cost nder this policy be k(s), where k(s) = ps + qk(s 1), with k() = qd. The general soltion is of the form k(s) = q/p + s + cq s. So after sbstitting and sing the bondary condition at s =, we have k(s) = q ( p + s + D + 1 ) q s+1, s =, 1,.... p It is better to stop now (at a distance s from the destination) than to go on and take the first available space if s is in the stopping set S = {s : s k(s 1)} = {s : (Dp + 1)q s 1}. This set is closed (since s decreases) and so by Theorem 5.2 this stopping set describes the optimal policy. If the driver parks in the first available space past his destination and walk backs, then D = 1 + qd, so D = 1/p and s is the greatest integer sch that 2q s Optimal stopping over the infinite horizon Let s now consider the stopping problem over the infinite-horizon. As above, let F s (x) be the infimal cost given that we are reqired to stop by time s. Let F(x) be the infimal cost when all that is reqired is that we stop eventally. Since less cost can be incrred if we are allowed more time in which to stop, we have F s (x) F s+1 (x) F(x). Ths by monotone convergence F s (x) tends to a limit, say F (x), and F (x) F(x). Example: we can have F > F Consider the problem of stopping a symmetric random walk on the integers, where c(x) =, k(x) = exp( x). The policy of stopping immediately, π, has F(π, x) = exp( x), and this satisfies the infinite-horizon optimality eqation, F(x) = min{exp( x), (1/2)F(x + 1) + (1/2)F(x 1)}. However, π is not optimal. A symmetric random walk is recrrent, so we may wait ntil reaching as large an integer as we like before stopping; hence F(x) =. Indctively, one can see that F s (x) = exp( x). So F (x) > F(x). (Note: Theorem 4.2 says that F = F, bt that is in a setting in which there is no terminal cost and for different definitions of F s and F than we take here.) 19 Example: Theorem 4.1 is not tre for negative programming Consider the above example, bt now sppose one is allowed never to stop. Since contination costs are the optimal policy for all finite horizons and the infinite horizon is never to stop. So F(x) = and this satisfies the optimality eqation above. However, F(π, x) = exp( x) also satisfies the optimality eqation and is the cost incrred by stopping immediately. Ths it is not tre (as for positive programming) that a policy whose cost fnction satisfies the optimality eqation is optimal. The following lemma gives conditions nder which the infimal finite-horizon cost does converge to the infimal infinite-horizon cost. Lemma 5.3 Sppose all costs are bonded as follows. (a) K = sp k(x) < x Then F s (x) F(x) as s. (b) C = inf c(x) >. (5.3) x Proof. (*starred*) Sppose π is an optimal policy for the infinite horizon problem and stops at the random time τ. Then its cost is at least (s + 1)CP(τ > s). However, since it wold be possible to stop at time the cost is also no more than K, so (s + 1)CP(τ > s) F(x) K. In the s-horizon problem we cold follow π, bt stop at time s if τ > s. This implies F(x) F s (x) F(x) + KP(τ > s) F(x) + By letting s, we have F (x) = F(x). K2 (s + 1)C. Note that the problem posed here is identical to one in which we pay K at the start and receive a terminal reward r(x) = K k(x). Theorem 5.4 Sppose S is closed and (5.3) holds. Then an optimal policy for the infinite horizon is: stop if and only if x S. Proof. By Theorem 5.2 we have for all finite s, Lemma 5.3 gives F(x) = F (x). F s (x) = k(x) x S, < k(x) x S. 2

13 6 Average-cost Programming We address the infinite-horizon average-cost case, the optimality eqation for this case and the policy improvement algorithm. 6.1 Average-cost optimization It can happen that the ndisconted expected total cost is infinite, bt the accmlation of cost per nit time is finite. Sppose that for a stationary Markov policy π, the following limit exists: 1 λ(π, x) = lim t t E π c(x s, s ) x = x. t 1 s= It is reasonable to expect that there is a well-defined notion of an optimal average-cost fnction, λ(x) = inf π λ(π, x), and that nder appropriate assmptions, λ(x) = λ shold not depend on x. Moreover, one wold expect F s (x) = sλ + φ(x) + ǫ(s, x), where ǫ(s, x) as s. Here φ(x) + ǫ(s, x) reflects a transient de to the initial state. Sppose that the state space and action space are finite. From the optimality eqation for the finite horizon problem we have F s (x) = min {c(x, ) + EF s 1 (x 1 ) x = x, = }. (6.1) So by sbstitting F s (x) sλ + φ(x) into (6.1), we obtain sλ + φ(x) min {c(x, ) + E(s 1)λ + φ(x 1 ) x = x, = } which sggests, what it is in fact, the average-cost optimality eqation: λ + φ(x) = min {c(x, ) + Eφ(x 1 ) x = x, = }. (6.2) Theorem 6.1 Let λ denote the minimal average-cost. Sppose there exists a constant λ and bonded fnction φ sch that for all x and, λ + φ(x) c(x, ) + Eφ(x 1 ) x = x, =. (6.3) Then λ λ. This also holds when is replaced by and the hypothesis is weakened to: for each x there exists a sch that (6.3) holds when is replaced by. Proof. Sppose is chosen by some policy π. By repeated sbstittion of (6.3) into itself we have t 1 φ(x) tλ + E π c(x s, s ) x = x + E π φ(x t ) x = x s= 21 Divide this by t and let t to obtain λ 1 + lim t t E π c(x s, s ) x = x, t 1 s= where the final term on the right hand side is simply the average-cost nder policy π. Minimizing the right hand side over π gives the reslt. The claim for replaced by is proved similarly. Theorem 6.2 Sppose there exists a constant λ and bonded fnction φ satisfying (6.2). Then λ is the minimal average-cost and the optimal stationary policy is the one that chooses the optimizing on the right hand side of (6.2). Proof. Eqation (6.2) implies that (6.3) holds with eqality when one takes π to be the stationary policy that chooses the optimizing on the right hand side of (6.2). Ths π is optimal and λ is the minimal average-cost. The average-cost optimal policy is fond simply by looking for a bonded soltion to (6.2). Notice that if φ is a soltion of (6.2) then so is φ+(a constant), becase the (a constant) will cancel from both sides of (6.2). Ths φ is ndetermined p to an additive constant. In searching for a soltion to (6.2) we can therefore pick any state, say x, and arbitrarily take φ( x) =. 6.2 Example: admission control at a qee Each day a consltant is presented with the opportnity to take on a new job. The jobs are independently distribted over n possible types and on a given day the offered type is i with probability a i, i = 1,...,n. Jobs of type i pay R i pon completion. Once he has accepted a job he may accept no other job ntil that job is complete. The probability that a job of type i takes k days is (1 p i ) k 1 p i, k = 1, 2,.... Which jobs shold the consltant accept? Soltion. Let and i denote the states in which he is free to accept a job, and in which he is engaged pon a job of type i, respectively. Then (6.2) is n λ + φ() = a i maxφ(), φ(i), i=1 λ + φ(i) = (1 p i )φ(i) + p i R i + φ(), i = 1,..., n. Taking φ() =, these have soltion φ(i) = R i λ/p i, and hence n λ = a i max, R i λ/p i. i=1 The left hand side is increasing in λ and the right hand side is decreasing λ. Hence there is a root, say λ, and this is the maximal average-reward. The optimal policy takes the form: accept only jobs for which p i R i λ. 22

14 6.3 Vale iteration bonds Vale iteration in the average-cost case is based pon the idea that F s (x) F s 1 (x) approximates the minimal average-cost for large s. Theorem 6.3 Define m s = min x {F s (x) F s 1 (x)}, Then m s λ M s, where λ is the minimal average-cost. M s = max x {F s(x) F s 1 (x)}. (6.4) Proof. (*starred*) Sppose that the first step of a s-horizon optimal policy follows Markov plan f. Then F s (x) = F s 1 (x) + F s (x) F s 1 (x) = c(x, f(x)) + EF s 1 (x 1 ) x = x, = f(x). Hence F s 1 (x) + m s c(x, ) + EF s 1 (x 1 ) x = x, =, for all x,. Applying Theorem 6.1 with φ = F s 1 and λ = m s, implies m s λ. The bond λ M s is established in a similar way. This jstifies the following vale iteration algorithm. At termination the algorithm provides a stationary policy that is within ǫ 1% of optimal. () Set F (x) =, s = 1. (1) Compte F s from F s (x) = min {c(x, ) + EF s 1 (x 1 ) x = x, = }. (2) Compte m s and M s from (6.4). Stop if M s m s ǫm s. Otherwise set s := s + 1 and goto step (1). 6.4 Policy improvement Policy improvement is an effective method of improving stationary policies. Policy improvement in the average-cost case. In the average-cost case a policy improvement algorithm can be based on the following observations. Sppose that for a policy π = f, we have that λ, φ is a soltion to λ + φ(x) = c(x, f(x )) + Eφ(x 1 ) x = x, = f(x ), and sppose for some policy π 1 = f 1, λ + φ(x) c(x, f 1 (x )) + Eφ(x 1 ) x = x, = f 1 (x ), (6.5) with strict ineqality for some x. Then following the lines of proof in Theorem 6.1 t 1 1 t 1 lim t t E π c(x s, s ) x 1 = x = λ lim t t E π 1 c(x s, s ) x = x. s= If there is no π 1 for which (6.5) holds then π satisfies (6.2) and is optimal. This jstifies the following policy improvement algorithm () Choose an arbitrary stationary policy π. Set s = 1. (1) For a given stationary policy π s 1 = f s 1 determine φ, λ to solve s= λ + φ(x) = c(x, f s 1 (x)) + Eφ(x 1 ) x = x, = f s 1 (x). This gives a set of linear eqations, and so is intrinsically easier to solve than (6.2). (2) Now determine the policy π s = f s from c(x, f s (x)) + Eφ(x 1 ) x = x, = f s (x) = min {c(x, ) + Eφ(x 1 ) x = x, = }, taking f s (x) = f s 1 (x) whenever this is possible. By applications of Theorem 6.1, this yields a strict improvement whenever possible. If π s = π s 1 then the algorithm terminates and π s 1 is optimal. Otherwise, retrn to step (1) with s := s + 1. If both the action and state spaces are finite then there are only a finite nmber of possible stationary policies and so the policy improvement algorithm will find an optimal stationary policy in finitely many iterations. By contrast, the vale iteration algorithm can only obtain more and more accrate approximations of λ. Policy improvement in the disconted-cost case. In the case of strict disconting, the following theorem plays the role of Theorem 6.1. The proof is similar, by repeated sbstittion of (6.6) into itself. Theorem 6.4 Sppose there exists a bonded fnction G sch that for all x and, G(x) c(x, ) + βeg(x 1 ) x = x, =. (6.6) Then G F, where F is the minimal disconted-cost fnction. This also holds when is replaced by and the hypothesis is weakened to: for each x there exists a sch that (6.6) holds when is replaced by. The policy improvement algorithm is similar. E.g., step (1) becomes (1) For a given stationary policy π s 1 = fs 1 determine G to solve G(x) = c(x, f s 1 (x)) + βeg(x 1 ) x = x, = f s 1 (x)

15 7 LQ Models We present the LQ reglation model in discrete and continos time, the Riccati eqation, its validity in the model with additive white noise. 7.1 The LQ reglation model The elements needed to define a control optimization problem are specification of (i) the dynamics of the process, (ii) which qantities are observable at a given time, and (iii) an optimization criterion. In the LQG model the plant eqation and observation relations are linear, the cost is qadratic, and the noise is Gassian (jointly normal). The LQG model is important becase it has a complete theory and introdces some key concepts, sch as controllability, observability and the certainty-eqivalence principle. Begin with a model in which the state x t is flly observable and there is no noise. The plant eqation of the time-homogeneos A, B, system has the linear form x t = Ax t 1 + B t 1, (7.1) where x t R n, t R m, A is n n and B is n m. The cost fnction is h 1 C = c(x t, t ) + C h (x h ), (7.2) t= with one-step and terminal costs c(x, ) = x Rx + Sx + x S + x R S x Q =, (7.3) S Q C h (x) = x Π h x. (7.4) All qadratic forms are non-negative definite, and Q is positive definite. There is no loss of generality in assming that R, Q and Π h are symmetric. This is a model for reglation of (x, ) to the point (, ) (i.e., steering to a critical vale). To solve the optimality eqation we shall need the following lemma. Lemma 7.1 Sppose x, are vectors. Consider a qadratic form ( ) ( ) ( ) x Πxx Π x x. Π Π x Assme it is symmetric and Π >, i.e., positive definite. Then the minimm with respect to is achieved at = Π 1 Π xx, and is eqal to x Π xx Π x Π 1 Π x x. 25 Proof. Sppose the qadratic form is minimized at. Then ( x + h ) ( Πxx Π x Π x Π ) ( x + h = x Π xx x + 2x Π x + 2h Π x x + 2h Π } {{ } + Π + h Π h. To be stationary at, the nderbraced linear term in h mst be zero, so ) = Π 1 Π x x, and the optimal vale is x Π xx Π x Π 1 Π x x. Theorem 7.2 Assme the strctre of (7.1) (7.4). Then the vale fnction has the qadratic form F(x, t) = x Π t x, t < h, (7.5) and the optimal control has the linear form t = K t x t, t < h. The time-dependent matrix Π t satisfies the Riccati eqation where f is an operator having the action Π t = fπ t+1, t < h, (7.6) fπ = R + A ΠA (S + A ΠB)(Q + B ΠB) 1 (S + B ΠA), (7.7) and Π h has the vale prescribed in (7.4). The m n matrix K t is given by K t = (Q + B Π t+1 B) 1 (S + B Π t+1 A), t < h. Proof. Assertion (7.5) is tre at time h. Assme it is tre at time t + 1. Then F(x, t) = inf c(x, ) + (Ax + B) Π t+1 (Ax + B) = inf ( x ) ( R + A Π t+1 A S + A Π t+1 B S + B Π t+1 A Q + B Π t+1 B ) ( x By Lemma 7.1 the minimm is achieved by = K t x, and the form of f comes from this also. 26 )

16 7.2 The Riccati recrsion The backward recrsion (7.6) (7.7) is called the Riccati eqation. Note that (i) S can be normalized to zero by choosing a new control = +Q 1 Sx, and setting A = A BQ 1 S, R = R S Q 1 S. (ii) The optimally controlled process obeys x t+1 = Γ t x t. Here Γ t is called the gain matrix and is given by Γ t = A + BK t = A B(Q + B Π t+1 B) 1 (S + B Π t+1 A). (iii) An eqivalent expression for the Riccati eqation is fπ = inf R + K S + S K + K QK + (A + BK) Π(A + BK). K (iv) We might have carried ot exactly the same analysis for a time-heterogeneos model, in which the matrices A, B, Q, R, S are replaced by A t, B t, Q t, R t, S t. (v) We do not give details, bt comment that it is possible to analyse models in which x t+1 = Ax t + B t + α t, for a known seqence of distrbances {α t }, or in which the cost fnction is x xt R S x xt c(x, ) =. ū t S Q ū t so that the aim is to track a seqence of vales ( x t, ū t ), t =,...,h White noise distrbances Sppose the plant eqation (7.1) is now x t+1 = Ax t + B t + ǫ t, where ǫ t R n is vector white noise, defined by the properties Eǫ =, Eǫ t ǫ t and Eǫ t ǫ s =, t s. The DP eqation is then F(x, t) = inf c(x, ) + E ǫ (F(Ax + B + ǫ, t + 1). = N By definition F(x, h) = x Π h x. Try a soltion F(x, t) = x Π t x + γ t. This holds for t = h. Sppose it is tre for t + 1, then F(x, t) = inf c(x, ) + E(Ax + B + ǫ) Π t+1 (Ax + B + ǫ) + γ t+1 = inf c(x, ) + E(Ax + B) Π t+1 (Ax + B) + 2E ǫ (Ax + B) + E ǫ Π t+1 ǫ + γ t+1 = inf + + tr(nπ t+1) + γ t Here we se the fact that E ǫ Πǫ = E ǫ i Π ij ǫ j = E ǫ j ǫ i Π ij = N ji Π ij = tr(nπ). ij ij ij Ths (i) Π t follows the same Riccati eqation as before, (ii) the optimal control is t = K t x t, and (iii) F(x, t) = x Π t x + γ t = x Π t x + h j=t+1 tr(nπ j ). The final term can be viewed as the cost of correcting ftre noise. In the infinite horizon limit of Π t Π as t, we incr an average cost per nit time of tr(nπ), and a transient cost of x Πx that is de to correcting the initial x. 7.4 LQ reglation in continos-time In continos-time we take ẋ = Ax + B and C = h ( x ) ( R S S Q ) ( x ) dt + (x Πx) h. We can obtain the continos-time soltion from the discrete time soltion by moving forward in time in increments of. Make the following replacements. x t+1 x t+, A I + A, B B, R, S, Q R, S, Q. Then as before, F(x, t) = x Πx, where Π obeys the Riccati eqation Π t + R + A Π + ΠA (S + ΠB)Q 1 (S + B Π) =. This is simpler than the discrete time version. The optimal control is where (t) = K(t)x(t) K(t) = Q 1 (S + B Π). The optimally controlled plant eqation is ẋ = Γ(t)x, where Γ(t) = A + BK = A BQ 1 (S + B Π). 28

17 8 Controllability We define and give conditions for controllability in discrete and continos time. 8.1 Controllability Consider the A, B, system with plant eqation x t+1 = Ax t + t. The controllability qestion is: can we bring x to an arbitrary prescribed vale by some -seqence? Definition 8.1 The system is r-controllable if one can bring it from an arbitrary prescribed x to an arbitrary prescribed x r by some -seqence, 1,..., r 1. A system of dimension n is said to be controllable if it is r-controllable for some r Example. If B is sqare and non-singlar then the system is 1-controllable, for x 1 = Ax + B where = B 1 (x 1 Ax ). Example. Consider the case, (n = 2, m = 1), ( ) ( a11 1 x t = x a 21 a t This system is not 1-controllable. Bt x 2 A 2 x = B 1 + AB = So it is 2-controllable if and only if a 21. ) t 1. ( 1 a11 a 21 ) ( 1 ). More generally, by sbstitting the plant eqation into itself, we see that we mst find, 1,..., r 1 to satisfy = x r A r x = B r 1 + AB r A r 1 B, (8.1) for arbitrary. In providing conditions for controllability we shall need to make se of the following theorem. Theorem 8.2 (The Cayley-Hamilton theorem) Any n n matrix A satisfies its own characteristic eqation. So that if then det(λi A) = n a j λ n j j= n a j A n j =. (8.2) j= 29 The implication is that I, A, A 2,..., A n 1 contains basis for A r, r =, 1,.... Proof. (*starred*) Define Then Φ(z) = (Az) j = (I Az) 1 = j= det(i Az)Φ(z) = adj(i Az) det(i Az). n a j z j Φ(z) = adj(i Az), j= which implies (8.2) since the coefficient of z n mst be zero. We are now in a position to characterise controllability. Theorem 8.3 (i) The system A, B, is r-controllable if and only if the matrix M r = B AB A 2 B A r 1 B has rank n, or (ii) eqivalently, if and only if the n n matrix r 1 M r Mr = A j (BB )(A ) j j= is nonsinglar (or, eqivalently, positive definite.) (iii) If the system is r-controllable then it is s-controllable for s min(n, r), and (iv) a control transferring x to x r with minimal cost r 1 t= t t is t = B (A ) r t 1 (M r M r ) 1 (x r A r x ), t =,...,r 1. Proof. (i) The system (8.1) has a soltion for arbitrary if and only if M r has rank n. (ii) M r Mr is singlar if and only if there exists w sch that M rmr w =, and M r M r w = w M r M r w = M r w =. (iii) The rank of M r is non-decreasing in r, so if it is r-controllable, then it is s- controllable for s r. Bt the rank is constant for r n by the Cayley-Hamilton theorem. (iv) Consider the Lagrangian giving r 1 t r 1 t + λ ( A r t 1 B t ), t= Now we can determine λ from (8.1). t= t = 1 2 B (A ) r t 1 λ. 3

18 8.2 Controllability in continos-time Theorem 8.4 (i) The n dimensional system A, B, is controllable if and only if the matrix M n has rank n, or (ii) eqivalently, if and only if G(t) = t e As BB e A s ds, is positive definite for all t >. (iii) If the system is controllable then a control that achieves the transfer from x() to x(t) with minimal control cost t s sds is (s) = B e A (t s) G(t) 1 (x(t) e At x()). Note that there is now no notion of r-controllability. However, G(t) as t, so the transfer becomes more difficlt and costly as t. 8.3 Example: broom balancing Consider the problem of balancing a broom in an pright position on yor hand. By Newton s laws, the system obeys m(ü cosθ + L θ) = mg sinθ. For small θ we have cosθ 1 and θ sinθ = (x )/L, so with α = g/l the plant eqation is eqivalently, d dt ( ẋ x ) = ẍ = α(x ), ( 1 α x ) ( ẋ x θ ) ( + α L L θ ). mg sin θ mg ü cos θ 8.4 Example: satellite in a plane orbit Consider a satellite of nit mass in a planar orbit and take polar coordinates (r, θ). r = r θ 2 c r 2 + r, θ = 2ṙ θ r + 1 r θ, where r and θ are the radial and tangential components of thrst. If = then a possible orbit (sch that ṙ = θ = ) is with r = ρ and θ = ω = c/ρ 3. Recall that one reason for taking an interest in linear models is that they tell s abot controllability arond an eqilibrim point. Imagine there is a pertrbing force. Take coordinates of pertrbation Then, with n = 4, m = 2, ẋ x 1 = r ρ, x 2 = ṙ, x 3 = θ ωt, x 4 = θ ω. 1 3ω 2 2ωρ 1 2ω/ρ x + 1 1/ρ ( r θ ) = Ax + B. It is easy to check that M 2 = B AB has rank 4 and that therefore the system is controllable. Bt sppose θ = (tangential thrst fails). Then B = 1 M 4 = B AB A 2 B A 3 B 1 ω 2 = 1 ω 2 2ω/ρ. 2ω/ρ 2ω 3 /ρ Since (2ωρ,,, ρ 2 )M 4 =, this is singlar and has rank 3. The ncontrollable component is the anglar momentm, 2ωρδr + ρ 2 δ θ = δ(r 2 θ) r=ρ, θ=ω. On the other hand, if r = then the system is controllable. We can change the radis by tangential braking or thrst. Since Figre 1: Force diagram for broom balancing B AB = the system is controllable if θ is initially small. α α, 31 32

19 9 Infinite Horizon Limits We define stabilizability and discss the LQ reglation problem over an infinite horizon. 9.1 Linearization of nonlinear models Linear models are important becase they arise natrally via the linearization of nonlinear models. Consider the state-strctred nonlinear model: ẋ = a(x, ). Sppose x, are pertrbed from an eqilibrim ( x, ū) where a( x, ū) =. Let x = x x and = ū and immediately drop the primes. The linearized version is where A = a x, ( x,ū) ẋ = Ax + B B = a. ( x,ū) If x, ū is to be a stable eqilibrim point then we mst be able to choose a control that can stabilise the system in the neighborhood of ( x, ū). 9.2 Stabilizability Sppose we apply the stationary control = Kx so that ẋ = Ax + B = (A + BK)x. So with Γ = A + BK, we have ẋ = Γx, x t = e Γt x, where e Γt = (Γt) j /j! Similarly, in discrete-time, we have can take the stationary control, t = Kx t, so that x t = Ax t 1 + B t 1 = (A + BK)x t 1. Now x t = Γ t x. We are interested in choosing Γ so that x t and t. Definition 9.1 Γ is a stability matrix in the continos-time sense if all its eigenvales have negative real part, and hence x t as t. Γ is a stability matrix in the discrete-time sense if all its eigenvales of lie strictly inside the nit disc in the complex plane, z = 1, and hence x t as t. The A, B system is said to stabilizable if there exists a K sch that A + BK is a stability matrix. Note that t = Kx t is linear and Markov. In seeking controls sch that x t it is sfficient to consider only controls of this type since, as we see below, sch controls arise as optimal controls for the infinite-horizon LQ reglation problem. 33 j= 9.3 Example: pendlm Consider a pendlm of length L, nit mass bob and angle θ to the vertical. Sppose we wish to stabilise θ to zero by application of a force. Then θ = (g/l)sinθ +. We change the state variable to x = (θ, θ) and write ( ) ( ) d θ = dt θ θ (g/l)sinθ + ( ( = θ (g/l)θ 1 (g/l) ) ( + ) ( θ θ ) ) ( + 1 ). Sppose we try to stabilise with a control = Kθ = Kx 1. Then ( ) 1 A + BK = (g/l) K and this has eigenvales ± (g/l) K. So either (g/l) K > and one eigenvale has a positive real part, in which case there is in fact instability, or (g/l) K < and eigenvales are prely imaginary, which means we will in general have oscillations. So sccessfl stabilization mst be a fnction of θ as well, (and this wold come ot of soltion to the LQ reglation problem.) 9.4 Infinite-horizon LQ reglation Consider the time-homogeneos case and write the finite-horizon cost in terms of time to go s. The terminal cost, when s =, is denoted F (x) = x Π x. In all that follows we take S =, withot loss of generality. Lemma 9.2 Sppose Π =, R, Q and A, B, is controllable or stabilizable. Then {Π s } has a finite limit Π. Proof. Costs are non-negative, so F s (x) is non-decreasing in s. Now F s (x) = x Π s x. Ths x Π s x is non-decreasing in s for every x. To show that x Π s x is bonded we se one of two argments. If the system is controllable then x Π s x is bonded becase there is a policy which, for any x = x, will bring the state to zero in at most n steps and at finite cost and can then hold it at zero with zero cost thereafter. 34

20 If the system is stabilizable then there is a K sch that Γ = A + BK is a stability matrix and sing t = Kx t, we have F s (x) x (Γ ) t (R + K QK)Γ x t <. t= Hence in either case we have an pper bond and so x Π s x tends to a limit for every x. By considering x = e j, the vector with a nit in the jth place and zeros elsewhere, we conclde that the jth element on the diagonal of Π s converges. Then taking x = e j + e k it follows that the off diagonal elements of Π s also converge. Both vale iteration and policy improvement are effective ways to compte the soltion to an infinite-horizon LQ reglation problem. Policy improvement goes along the lines developed in Lectre 6. The following theorem establishes the efficacy of vale iteration. It is similar to Theorem 4.2 which established the same fact for D, N and P programming. The LQ reglation problem is a negative programming problem, however we cannot apply Theorem 4.2, becase in general the terminal cost of x Π x is not zero. 9.5 The A, B, C system The notion of controllability rested on the assmption that the initial vale of the state was known. If, however, one mst rely pon imperfect observations, then the qestion arises whether the vale of state (either in the past or in the present) can be determined from these observations. The discrete-time system A, B, C is defined by the plant eqation and observation relation x t = Ax t 1 + B t 1, y t = Cx t 1. Here y R r is observed, bt x is not. We sppose C is r n. The observability qestion is whether or not we can infer x from the observations y 1, y 2,.... The notion of observability stands in dal relation to that of controllability; a dality that indeed persists throghot the sbject. Theorem 9.3 Sppose that R >, Q > and the system A, B, is controllable. Then (i) The eqilibrim Riccati eqation Π = fπ (9.1) has a niqe non-negative definite soltion Π. (ii) For any finite non-negative definite Π the seqence {Π s } converges to Π. (iii) The gain matrix Γ corresponding to Π is a stability matrix. Proof. (*starred*) Define Π as the limit of the seqence f (s). By the previos lemma we know that this limit exists and that it satisfies (9.1). Consider t = Kx t and x t+1 = (A + BK)x t = Γx t = Γ t x, for arbitrary x, where K = (Q + B ΠB) 1 B ΠA and Γ = A + BK. We can write (9.1) as and hence Π = R + K QK + Γ ΠΓ. (9.2) x t Πx t = x t (R + K QK)x t + x t+1πx t+1 x t+1πx t+1. Ths x t Πx t decreases and, being bonded below by zero, it tends to a limit. Ths x t (R+K QK)x t tends to. Since R+K QK is positive definite this implies x t, which implies (iii). Hence for arbitrary finite non-negative definite Π, Π s = f (s) Π f (s) Π. (9.3) However, if we choose the fixed policy t = Kx t then it follows that Π s f (s) + (Γ ) s Π Γ s Π. (9.4) Ths (9.3) and (9.4) imply (ii). Finally, if non-negative definite Π also satisfies (9.1) then Π = f (s) Π Π, whence (i) follows

21 1 Observability We present observability and the LQG model with process and observation noise. 1.1 Observability The previos lectre introdced the A, B, C system, x t = Ax t 1 + B t 1, (1.1) y t = Cx t 1. (1.2) y R k is observed, bt x is not. C is k n. The observability qestion is: can we infer x at a prescribed time by sbseqent y vales? Definition 1.1 A system is said to be r-observable if x can be inferred from knowledge of the sbseqent observations y 1,..., y r and relevant control vales,..., r 2 for any initial x. A system of dimension n is said to be observable if it is r-observable for some r. From (1.1) and (1.2) we can determine y t in terms of x and sbseqent controls: t 1 x t = A t x + A s B t s 1, s= t 2 y t = Cx t 1 = C A t 1 x + A s B t s 2. Ths, if we define the redced observation ỹ t = y t C t 2 s= A s B t s 2, s= then x is to be determined from the system of eqations ỹ t = CA t 1 x, 1 t r. (1.3) By hypothesis, these eqations are mtally consistent, and so have a soltion; the qestion is whether this soltion is niqe. This is the reverse of the sitation for controllability, when the qestion was whether the eqation for had a soltion at all, niqe or not. Note that an implication of the system definition is that the property of observability depends only on the matrices A and C; not pon B at all. Theorem 1.2 (i) The system A,, C is r-observable if and only if the matrix C CA N r = CA 2. CA r 1 37 has rank n, or (ii) eqivalently, if and only if the n n matrix r 1 Nr N r = (A ) j C CA j j= is nonsinglar. (iii) If the system is r-observable then it is s-observable for s min(n, r), and (iv) the determination of x can be expressed r x = (Nr N r ) 1 (A ) j 1 C ỹ j. (1.4) j=1 Proof. If the system has a soltion for x (which is so by hypothesis) then this soltion mst is niqe if and only if the matrix N r has rank n, whence assertion (i). Assertion (iii) follows from (i). The eqivalence of conditions (i) and (ii) can be verified directly as in the case of controllability. If we define the deviation η t = ỹ t CA t 1 x then the eqation amonts to η t =, 1 t r. If these eqations were not consistent we cold still define a least-sqares soltion to them by minimizing any positive-definite qadratic form in these deviations with respect to x. In particlar, we cold minimize r 1 t= η t η t. This minimization gives (1.4). If eqations (1.3) indeed have a soltion (i.e., are mtally consistent, as we sppose) and this is niqe then expression (1.4) mst eqal this soltion; the actal vale of x. The criterion for niqeness of the least-sqares soltion is that Nr N r shold be nonsinglar, which is also condition (ii). Note that we have again fond it helpfl to bring in an optimization criterion in proving (iv); this time, not so mch to constrct one definite soltion ot of many, bt rather to constrct a best-fit soltion where an exact soltion might not have existed. This approach lies close to the statistical approach necessary when observations are corrpted by noise. 1.2 Observability in continos-time Theorem 1.3 (i) The n-dimensional continos-time system A,, C is observable if and only if the matrix N n has rank n, or (ii) eqivalently, if and only if H(t) = t e A s C Ce As ds is positive definite for all t >. (iii) If the system is observable then the determination of x() can be written where t x() = H(t) 1 e A s C ỹ(s)ds, ỹ(t) = y(t) t 38 CA t s B(s)ds.

22 1.3 Examples Example. Observation of poplation Consider two poplations whose sizes are changing according to the eqations Sppose we observe x 1 + x 2, so A = ẋ 1 = λ 1 x 1, ẋ 2 = λ 2 x 2. ( ) λ1, C = ( 1 1 ), N λ 2 = 2 and so the individal poplations are observable if λ 1 λ 2. ( ) 1 1. λ 1 λ 2 Example. Radioactive decay Sppose j = 1, 2, 3 and decay is from 1 to 2 at rate α, from 1 to 3 at rate β and from 2 to 3 at rate γ. We observe only the accmlation in state 3. Then ẋ = Ax, where A = N 3 = α β α γ β γ C CA CA 2 =, C = ( 1 ), 1 β γ αγ β(α + β) γ 2 The determinant eqals γ(α + β)(γ β), so we reqire γ >, α + β >, β γ for a nonzero determinant and observability. Example. Satellite Recall the linearised eqation ẋ = Ax, for pertrbations of the orbit of a satellite, (here taking ρ = 1), where x 1 x 2 x 3 x 4 = r ρ ṙ θ ωt θ ω A =. 1 3ω 2 2ω 1 2ω. 1.4 Imperfect state observation with noise The fll LQG model, whose description has been deferred ntil now, assmes linear dynamics, qadratic costs and Gassian noise. Imperfect observation is the most important point. The model is x t = Ax t 1 + B t 1 + ǫ t, (1.5) y t = Cx t 1 + η t, (1.6) where ǫ t is process noise, y t is the observation at time t and η t is the observation noise. The state observations are degraded in that we observe only Cx t 1. Assme ( ǫ cov η ) ( ǫ = E η ) ( ) ( ) ǫ N L = η L M and that x N(ˆx, V ). Let W t = (Y t, U t 1 ) = (y 1,..., y t ;,..., t 1 ) denote the observed history p to time t. Of corse we assme that t, A, B, C, N, L, M, ˆx and V are also known; W t denotes what might be different if the process were rern. In the next lectre we trn to the qestion of estimating x from y. We consider the isses of state estimation and optimal control and shall show: (i) ˆx t can be calclated recrsively from the Kalman filter (a linear operator): ˆx t = Aˆx t 1 + B t 1 + H t (y t Cˆx t 1 ), which is like the plant eqation except that the noise is given by an innovation process, ỹ t = y t Cˆx y, rather than the white noise. (ii) If with fll information (i.e., y t = x t ) the optimal control is t = K t x t, then withot fll information the optimal control is t = K tˆx t, where ˆx t is the best linear least sqares estimate of x t based on the information (Y t, U t 1 ) at time t. Many of the ideas we enconter in this analysis are nrelated to the special state strctre and are therefore worth noting as general observations abot control with imperfect information. By taking C = 1 we see that we can see that system is observable on the basis of angle measrements alone bt, not observable for C = 1, i.e., on the basis of radis movements alone. 1 1 N 4 = 1 2ω Ñ 4 = 1 3ω 2 2ω 6ω 3 4ω 2 ω

23 11 Kalman Filtering and Certainty Eqivalence We presents the important concepts of the Kalman filter, certainty-eqivalence and the separation principle Preliminaries Lemma 11.1 Sppose x and y are jointly normal with zero means and covariance matrix x Vxx V cov = xy. y V yx V yy Then the distribtion of x conditional on y is Gassian, with and E(x y) = V xy V 1 yy y, (11.1) cov(x y) = V xx V xy V 1 yy V yx. (11.2) Proof. Both y and x V xy Vyy 1 y are linear fnctions of x and y and therefore they are Gassian. From E (x V xy Vyy 1 y)y = it follows that they are ncorrelated and this implies they are independent. Hence the distribtion of x V xy Vyy 1 y conditional on y is identical with its nconditional distribtion, and this is Gassian with zero mean and the covariance matrix given by (11.2) The estimate of x in terms of y defined as ˆx = Hy = V xy Vyy 1 y is known as the linear least sqares estimate of x in terms of y. Even withot the assmption that x and y are jointly normal, this linear fnction of y has a smaller covariance matrix than any other nbiased estimate for x that is a linear fnction of y. In the Gassian case, it is also the maximm likelihood estimator The Kalman filter Let s make the LQG and state-strctre assmptions of Section 1.4. x t = Ax t 1 + B t 1 + ǫ t, (11.3) y t = Cx t 1 + η t, (11.4) Notice that both x t and y t can be written as a linear fnctions of the nknown noise and the known vales of,..., t 1. Ths the distribtion of x t conditional on W t = (Y t, U t 1 ) mst be normal, with some mean ˆx t and covariance matrix V t. The following theorem describes recrsive pdating relations for these two qantities. Theorem 11.2 (The Kalman filter) Sppose that conditional on W, the initial state x is distribted N(ˆx, V ) and the state and observations obey the recrsions of the LQG model (11.3) (11.4). Then conditional on W t, the crrent state is distribted N(ˆx t, V t ). The conditional mean and variance obey the pdating recrsions ˆx t = Aˆx t 1 + B t 1 + H t (y t Cˆx t 1 ), (11.5) V t = N + AV t 1 A (L + AV t 1 C )(M + CV t 1 C ) 1 (L + CV t 1 A ), (11.6) where H t = (L + AV t 1 C )(M + CV t 1 C ) 1. (11.7) Proof. The proof is by indction on t. Consider the moment when t 1 has been determined bt y t has not yet observed. The distribtion of (x t, y t ) conditional on (W t 1, t 1 ) is jointly normal with means E(x t W t 1, t 1 ) = Aˆx t 1 + B t 1, E(y t W t 1, t 1 ) = Cˆx t 1. Let t 1 = ˆx t 1 x t 1, which by an indctive hypothesis is N(, V t 1 ). Consider the innovations ξ t = x t E(x t W t 1, t 1 ) = x t (Aˆx t 1 + B t 1 ) = ǫ t A t 1, ζ t = y t E(y t W t 1, t 1 ) = y t Cˆx t 1 = η t C t 1. Conditional on (W t 1, t 1 ), these qantities are normally distribted with zero means and covariance matrix ǫt A cov t 1 N + AVt 1 A = L + AV t 1 C Vξξ V η t C t 1 L + CV t 1 A M + CV t 1 C = ξζ. V ζξ V ζζ Ths it follows from Lemma 11.1 that the distribtion of ξ t conditional on knowing (W t 1, t 1, ζ t ), (which is eqivalent to knowing W t ), is normal with mean V ξζ V 1 ζζ ζ t and covariance matrix V ξξ V ξζ V 1 ζζ V ζξ. These give (11.5) (11.7) Certainty eqivalence We say that a qantity a is policy-independent if E π (a W ) is independent of π. Theorem 11.3 Sppose LQG model assmptions hold. Then (i) F(W t ) = ˆx t Π tˆx t + (11.8) where ˆx t is the linear least sqares estimate of x t whose evoltion is determined by the Kalman filter in Theorem 11.2 and + indicates terms that are policy independent; (ii) the optimal control is given by t = K tˆx t, where Π t and K t are the same matrices as in the fll information case of Theorem

24 It is important to grasp the remarkable fact that (ii) asserts: the optimal control t is exactly the same as it wold be if all nknowns were known and took vales eqal to their linear least sqare estimates (eqivalently, their conditional means) based pon observations p to time t. This is the idea known as certainty eqivalence. As we have seen in the previos section, the distribtion of the estimation error ˆx t x t does not depend on U t 1. The fact that the problems of optimal estimation and optimal control can be decopled in this way is known as the separation principle. Proof. The proof is by backward indction. Sppose (11.8) holds at t. Recall that ˆx t = Aˆx t 1 + B t 1 + H t ζ t, t 1 = ˆx t 1 x t 1. Then with a qadratic cost of the form c(x, ) = x Rx + 2 Sx + Q, we have F(W t 1 ) = min t 1 E c(x t 1, t 1 ) + ˆx t Π tˆx t + W t 1, t 1 = min t 1 E c(ˆx t 1 t 1, t 1 ) + (Aˆx t 1 + B t 1 + H t ζ t ) Π t (Aˆx t 1 + B t 1 + H t ζ t ) W t 1, t 1 = min t 1 c(ˆxt 1, t 1 ) + (Aˆx t 1 + B t 1 ) Π t (Aˆx t 1 + B t 1 ) +, where we se the fact that conditional on W t 1, t 1, both t 1 and ζ t have zero means and are policy independent. This ensres that when we expand the qadratics in powers of t 1 and H t ζ t the expected vale of the linear terms in these qantities are zero and the expected vale of the qadratic terms (represented by + ) are policy independent Example: inertialess rocket with noisy position sensing Consider the scalar case of controlling the position of a rocket by inertialess control of its velocity bt in the presence of imperfect position sensing. x t = x t 1 + t 1, y t = x t + η t, where η t is white noise with variance 1. Sppose it is desired to minimize h 1 E 2 t + Dx 2 h. t= (The relevant innovation process is now ỹ t = y t ˆx t 1 t 1.) Sbtracting the plant eqation and sbstitting for x t and y t gives The variance of t is therefore t = t 1 + H t (η t t 1 ). var t = V t 1 2H t V t 1 + H 2 t (1 + V t 1 ). Minimizing this with respect to H t gives H t = V t 1 (1 + V t 1 ) 1, so the variance in the least sqares estimate of x t obeys the recrsion, Hence V t = V t 1 V 2 t 1 (1 + V t 1) 1 = V t 1 /(1 + V t 1 ). V 1 t = V 1 t = = V 1 + t. If there is complete lack of information at the start, then V 1 =, V t = 1/t and ˆx t = ˆx t 1 + t 1 + V t 1(y t ˆx t 1 t 1 ) = (t 1)(ˆx t 1 + t 1 ) + y t. 1 + V t 1 t As far at the optimal control is concerned, sppose an indctive hypothesis that F(W t ) = ˆx 2 tπ t +, where denotes policy independent terms. Then { F(W t 1 ) = inf 2 + Eˆx t H t (y t ˆx t 1 ) 2 Π t + } { = inf 2 + (ˆx t 1 + ) 2 Π t + EH t (η t t 1 ) 2 Π t + }. Minimizing over we obtain the sal Riccati recrsion of Π t 1 = Π t Π 2 t /(1 + Π t) = Π t /(1 + Π t ). Hence Π t = D/(1+D(h t)) and the optimal control is the certainty eqivalence control t = Dˆx t /(1 + D(h t)). This is the same control as in the deterministic case, bt with x t replaced by ˆx t. Notice that the observational relation differs from the sal model of y t = Cx t 1 + η t. To derive a Kalman filter formlae for this variation we arge indctively from scratch. Sppose ˆx t 1 x t 1 N(, V t 1 ). Consider a linear estimate of x t, ˆx t = ˆx t 1 + t 1 + H t (y t ˆx t 1 t 1 )

25 12 Dynamic Programming in Continos Time We develop the HJB eqation for dynamic programming in continos time The optimality eqation In continos time the plant eqation is, Let s consider a disconted cost of C = T ẋ = a(x,, t). e αt c(x,, t)dt + e αt C(x(T), T). The discont factor over δ is e αδ = 1 αδ + o(δ). So the optimality eqation is, F(x, t) = inf c(x,, t)δ + e αδ F(x + a(x,, t)δ, t + δ) + o(δ). By considering the term that mltiplies δ in the Taylor series expansion we obtain, inf c(x,, t) αf + F t + F a(x,, t) =, t < T, (12.1) x with F(x, T) = C(x, T). In the ndisconted case, we simply pt α =. The DP eqation (12.1) is called the Hamilton Jacobi Bellman eqation (HJB). Its heristic derivation we have given above is jstified by the following theorem. Theorem 12.1 Sppose a policy π, sing a control, has a vale fnction F which satisfies the HJB eqation (12.1) for all vales of x and t. Then π is optimal. Proof. Consider any other policy, sing control v, say. Then along the trajectory defined by ẋ = a(x, v, t) we have d ( dt e αt F(x, t) = e c(x, αt v, t) c(x, v, t) αf + F t + F ) a(x, v, t) x e αt c(x, v, t). Integrating this ineqality along the v path, from x() to x(t), gives F(x(), ) e αt C(x(T), T) T t= e αt c(x, v, t)dt. Ths the v path incrs a cost of at least F(x(), ), and hence π is optimal Example: LQ reglation The ndisconted continos time DP eqation for the LQ reglation problem is = inf x Rx + Q + F t + Fx (Ax + B). Sppose we try a soltion of the form F(x, t) = x Π(t)x, where Π(t) is a symmetric matrix. Then F x = 2Π(t)x and the optimizing is = 1 2 Q 1 B F x = Q 1 B Π(t)x. Therefore the DP eqation is satisfied with this if = x R + ΠA + A Π ΠBQ 1 B Π + dπ x, dt where we se the fact that 2x ΠAx = x ΠAx + x A Πx. Hence we have a soltion to the HJB eqation if Π(t) satisfies the Riccati differential eqation of Section Example: estate planning A man is considering his lifetime plan of investment and expenditre. He has an initial level of savings x() and no other income other than that which he obtains from investment at a fixed interest rate. His total capital is therefore governed by the eqation ẋ(t) = βx(t) (t), where β > and is his rate of expenditre. He wishes to maximize T for a given T. Find his optimal policy. e αt (t)dt, Soltion. The optimality eqation is F = sp αf + t + F (βx ) x Sppose we try a soltion of the form F(x, t) = f(t) x. For this to work we need = sp αf x + f x + f 2 (βx ). x By d /d =, the optimizing is = x/f 2 and the optimized vale is ( 1 x/f) 2 (α 1 2 β)f2 + ff. (12.2) We have a soltion if we can choose f to make the bracketed term in (12.2) eqal to. We have the bondary condition F(x, T) =, which imposes f(t) =. Ths we find f(t) 2 t) 1 e (2α β)(t =. 2α β 46.

26 We have fond a policy whose vale fnction F(x, t) satisfies the HJB eqation. So by Theorem 12.1 it is optimal. In closed loop form the optimal policy is = x/f Example: harvesting A fish poplation of size x obeys the plant eqation, { a(x) x >, ẋ = a(x, ) = a(x) x =. The fnction a(x) reflects the facts that the poplation can grow when it is small, bt is sbject to environmental limitations when it is large. It is desired to maximize the disconted total harvest T e αt dt. Soltion. The DP eqation (with disconting) is sp αf + F t + F a(x) x =, t < T. Hence occrs linearly with the maximization and so we have a bang-bang optimal control of the form = ndetermined for F x > = 1, max < where max is the largest practicable fishing rate. Sppose F(x, t) F(x) as T, and F/ t. Then sp αf + F a(x) =. (12.3) x Let s make a gess that F(x) is concave, and then dedce that = ndetermined, bt effectively a( x) for x < = x. (12.4) > max Clearly, x is the operating point. We sppose { a(x) >, x < x ẋ = a(x) max <, x > x. We say that there is chattering abot the point x, in the sense that switches between its maximm and minimm vales either side of x, effectively taking the vale a( x) at x. To determine x we note that F( x) = e αt a( x)dt = a( x)/α. (12.5) 47 So from (12.3) and (12.5) we have F x (x) = αf(x) (x) a(x) (x) 1 as x ր x or x ց x. (12.6) For F to be concave, F xx mst be negative if it exists. So we mst have F xx = αf ( ) ( x αf a ) a(x) (x) a(x) a(x) = ( ) ( αf α a ) (x) a(x) a(x) α a (x) a(x) (x) where the last line follows becase (12.6) holds in a neighborhood of x. It is reqired that F xx be negative. Bt the denominator changes sign at x, so the nmerator mst do so also, and therefore we mst have a ( x) = α. Choosing this as or x, we have that F(x) is concave, as we conjectred from the start. We now have the complete soltion. The control in (12.4) has a vale fnction F which satisfies the HJB eqation. = a( x) a(x) x α = a ( x) max Growth rate a(x) sbject to environment pressres Notice that there is a sacrifice of long term yield for immediate retrn. If the initial poplation is greater than x then the optimal policy is to overfish at max ntil we reach the new x and then fish at rate = a( x). As α ր a (), x ց. So for sfficiently large α it is optimal to wipe ot the fish poplation. 48 x

27 13 Pontryagin s Maximm Principle We explain Pontryagin s maximm principle and give some examples of its se Heristic derivation Pontryagin s maximm principle (PMP) states a necessary condition that mst hold on an optimal trajectory. It is a calclation for a fixed initial vale of the state, x(). In comparison, the DP approach is a calclation for a general initial vale of the state. PMP can be sed as both a comptational and analytic techniqe (and in the second case can solve the problem for general initial vale.) Consider first a time-invariant formlation, with plant eqation ẋ = a(x, ), instantaneos cost c(x, ), stopping set S and terminal cost K(x). The vale fnction F(x) obeys the DP eqation (withot disconting) inf U otside S, with terminal condition Define the adjoint variable c(x, ) + F a(x, ) x =, (13.1) F(x) = K(x), x S. (13.2) λ = F x (13.3) This is a colmn n-vector, and is to be regarded as a fnction of time on the path. The proof that F x exists in the reqired sense is actally a tricky technical matter. Also define the Hamiltonian H(x,, λ) = λ a(x, ) c(x, ), (13.4) a scalar, defined at each point of the path as a fnction of the crrent x, and λ.) Theorem 13.1 (PMP) Sppose (t) and x(t) represent the optimal control and state trajectory. Then there exists an adjoint trajectory λ(t) sch that together (t), x(t) and λ(t) satisfy and for all t, t T, and all feasible controls v, ẋ = H λ, = a(x, ) (13.5) λ = H x, = λ a x + c x (13.6) H(x(t), v, λ(t)) H(x(t), (t), λ(t)), (13.7) i.e., the optimal control (t) is the vale of v maximizing H((x(t), v, λ(t)). 49 Proof. Or heristic proof is based pon the DP eqation; this is the most direct and enlightening way to derive conclsions that may be expected to hold in general. Assertion (13.5) is immediate, and (13.7) follows from the fact that the minimizing vale of in (13.1) is optimal. We can write (13.1) in incremental form as F(x) = infc(x, )δ + F(x + a(x, )δ) + o(δ). Using the chain rle to differentiate with respect to x i yields λ i (t) = c x i δ λ i (t + δ) j a j x i λ j (t + δ) + o(δ) whence (13.6) follows. Notice that (13.5) and (13.6) each give n eqations. Condition (13.7) gives a frther m eqations (since it reqires stationarity with respect to variation of the m components of.) So in principle these eqations, if nonsinglar, are sfficient to determine the 2n + m fnctions (t), x(t) and λ(t). One can make other assertions, inclding specification of end-conditions (the socalled transversality conditions.) Theorem 13.2 (i) H = on the optimal path. (ii) The sole initial condition is specification of the initial x. The terminal condition (λ + K x ) σ = (13.8) holds at the terminal x for all σ sch that x + ǫσ is within o(ǫ) of the termination point of a possible optimal trajectory for all sfficiently small positive ǫ. Proof. Assertion (i) follows from (13.1), and the first assertion of (ii) is evident. We have the terminal condition (13.2), from whence it follows that (F x K x ) σ = for all x, σ sch that x and x + ǫσ lie in S for all small enogh positive ǫ. However, we are only interested in points where an optimal trajectory makes it first entry to S and at these points (13.3) holds. Ths we mst have (13.8) Example: bringing a particle to rest in minimal time A particle with given initial position and velocity x 1 (), x 2 () is to be broght to rest at position in minimal time. This is to be done sing the control force, sch that 1, with dynamics of ẋ 1 = x 2 and ẋ 2 =. That is, d dt and we wish to minimize ( x1 x 2 ) = ( 1 C = T 5 ) ( x1 x 2 ) + 1 dt ( 1 ) (13.9)

28 where T is the first time at which x = (, ). The Hamiltonian is x 2 H = λ 1 x 2 + λ 2 1, which is maximized by = sign(λ 2 ). The adjoint variables satisfy λ i = H/ x i, so λ 1 =, λ2 = λ 1. (13.1) The terminal x mst be, so in (13.8) we can only take σ = and so (13.8) provides no additional information for this problem. However, if at termination λ 1 = α, λ 2 = β, then in terms of time to go we can compte switching locs = 1 = 1 x 1 λ 1 = α, λ 2 = β + αs. These reveal the form of the soltion: there is at most one change of sign of λ 2 on the optimal path; is maximal in one direction and then possibly maximal in the other. Appealing to the fact that H = at termination (when x 2 = ), we conclde that β = 1. We now consider the case β = 1. The case β = 1 is similar. If β = 1, α then λ 2 = 1 + αs for all s and = 1, x 2 = s, x 1 = s 2 /2. In this case the optimal trajectory lies on the parabola x 1 = x 2 2/2, x 1, x 2. This is half of the switching locs x 1 = ±x 2 2 /2. If β = 1, α < then = 1 or = 1 as the time to go is greater or less than s = 1/ α. In this case, = 1, x 2 = (s 2s ), x 1 = 2s s 1 2 s2 s 2, s s, = 1, x 2 = s, x 1 = 1 2 s2, s s. The control rle expressed as a fnction of s is open-loop, bt in terms of (x 1, x 2 ) and the switching locs, it is closed-loop. Notice that the path is sensitive to the initial conditions, in that the optimal path is very different for two points jst either side of the switching locs Connection with Lagrangian mltipliers An alternative way to nderstand the maximm principle is to think of λ as a Lagrangian mltiplier for the constraint ẋ = a(x, ). Consider the Lagrangian form L = T c λ (ẋ a) dt K(x(T)), to be maximized with respect to the (x,, λ) path. Here x(t) first enters a stopping set S at t = T. We integrate λ ẋ by parts to obtain L = λ(t) x(t) + λ() x() + T 51 λ x + λ a c dt K(x(T)). Figre 2: Optimal trajectories for the problem The integrand mst be stationary with respect to x(t) and hence λ = H x, i.e., (13.6). The expression mst also be stationary with respect to ǫ >, x(t) + ǫσ S and hence (λ(t) +K x (x(t))) σ =, i.e., (13.8). It is good to have this alternative view, bt the treatment is less immediate and less easy to rigorise Example: se of the transversality conditions If the terminal time is constrained then (as we see in the next lectre) we no longer have Theorem 13.2 (i), i.e., that H is maximized to, bt the other claims of Theorems 13.1 and 13.2 contine to hold. Consider the a problem with the dynamics (13.9), bt with nconstrained, x() = (, ) and cost fnction C = 1 2 T (t) 2 dt x 1 (T) where T is fixed and given. Here K(x) = x 1 (T) and the Hamiltonian is H(x,, λ) = λ 1 x 2 + λ , which is maximized at (t) = λ 2 (t). Now λ i = H/ x i gives λ 1 =, λ2 = λ 1. In the terminal condition, (λ + K x ) σ =, σ is arbitrary and so we also have λ 1 (T) 1 =, λ 2 (T) =. Ths the soltion mst be λ 1 (t) = 1 and λ 2 (t) = T t. Hence the optimal applied force is (t) = T t, which decreases linearly with time and reaches zero at T. 52

29 14 Applications of the Maximm Principle We discss the terminal conditions of the maximm principle and frther examples of its se. The argments are typical of those sed to synthesise a soltion to an optimal control problem by se of the maximm principle Problems with terminal conditions Sppose a, c, S and K are all t-dependent. The DP eqation for F(x, t) is now be inf c + F t + F x a = F t spλ a c =, (14.1) otside a stopping set S, with F(x, t) = K(x, t) for (x, t) in S. However, we can redce this to a formally time-invariant case by agmenting the state variable x by the variable t. We then have the agmented variables x a λ x a λ. t 1 λ We keep the same definition (13.4) as before, that H = λ a c, and take λ = F t. It now follows from (14.1) that on the optimal trajectory H(x,, λ) is maximized to λ. Theorem 13.1 still holds, as can be verified. However, to (13.6) we can now add and transversality condition λ = H t = c t λa t. (14.2) (λ + K x ) σ + (λ + K t )τ =, (14.3) which mst hold at the termination point (x, t) if (x + ǫσ, t + ǫτ) is within o(ǫ) of the termination point of an optimal trajectory for all small enogh positive ǫ. We can now nderstand what to do with varios types of terminal condition. If the stopping rle specifies only a fixed terminal time T then τ mst be zero and σ is nconstrained, so that (14.3) becomes λ(t) = K x. The problem in Section 13.4 is like this. If there is a free terminal time then τ is nconstrained and so (14.3) gives λ (T) = K T. An example of this case appears in Section 14.2 below. If the system is time-homogeneos, in that a and c are independent of t, bt the terminal cost K(x, T) depends on T, then (14.2) implies that λ is constant and so the maximized vale of H is constant on the optimal orbit. The problem in Section 13.2 cold have been solved this way by replacing C = T 1 dt by C = K(x, T) = T. We wold dedce from the transversality condition that since τ is nconstrained, λ = K T = 1. Ths H = λ 1 x 2 + λ 2 is maximized to 1 at all points of the optimal trajectory Example: monopolist Miss Prot holds the entire remaining stock of Cambridge elderberry wine for the vintage year If she releases it at rate (in continos time) she realises a nit price p() = (1 /2), for 2 and p() = for 2. She holds an amont x at time and wishes to release it in a way that maximizes her total disconted retrn, T e αt p()dt, (where T is nconstrained.) Soltion. The plant eqation is ẋ = and the Hamiltonian is H(x,, λ) = e αt p() λ = e αt (1 /2) λ. Note that K =. Maximizing with respect to and sing λ = H x gives = 1 λe αt, λ =, t, so λ is constant. The terminal time is nconstrained so the transversality condition gives λ (T) = K T =. Therefore, since H is maximized to λ (T) = at T, we have (T) = and hence λ = e αt, = 1 e α(t t), t T, where T is then the time at which all wine has been sold, and so x = T dt = T ( 1 e αt) /α. Ths is implicitly a fnction of x, throgh T. The optimal vale fnction is T F(x) = ( 2 /2)e αt dt = 1 T ( ( ) e αt e αt 2αT) 1 e αt 2 dt =. 2 2α 14.3 Example: insects as optimizers A colony of insects consists of workers and qeens, of nmbers w(t) and q(t) at time t. If a time-dependent proportion (t) of the colony s effort is pt into prodcing workers, ( (t) 1, then w, q obey the eqations ẇ = aw bw, q = c(1 )w, where a, b, c are constants, with a > b. The fnction is to be chosen to maximize the nmber of qeens at the end of the season. Show that the optimal policy is to prodce only workers p to some moment, and prodce only qeens thereafter. Soltion. The Hamiltonian is H = λ 1 (aw bw) + λ 2 c(1 )w. 54

30 The adjoint eqations and transversality conditions (with K = q) give λ = H t = λ 1 = H w = λ 1 (a b) + λ 2 c(1 ) λ 2 = H q =, λ 1 (T) = K w = λ 2 (T) = K q = 1, and hence λ (t) is constant and λ 2 (t) = 1 for all t. Therefore H is maximized by = 1 as λ 1a c. At T, this implies (T) =. If t is a little less than T, λ 1 is small and = so the eqation for λ 1 is λ 1 = λ 1 b c. (14.4) As long as λ 1 is small, λ1 <. Therefore as the remaining time s increases, λ 1 (s) increases, ntil sch point that λ 1 a c. The optimal control becomes = 1 and then λ 1 = λ 1 (a b) <, which implies that λ 1 (s) contines to increase as s increases, right back to the start. So there is no frther switch in. The point at which the single switch occrs is fond by integrating (14.4) from t to T, to give λ 1 (t) = (c/b)(1 e (T t)b ) and so the switch occrs where λ 1 a c =, i.e., (a/b)(1 e (T t)b ) = 1, or t switch = T + (1/b)log(1 b/a). Experimental evidence sggests that social insects do closely follow this policy and adopt a switch time that is nearly optimal for their natral environment Example: rocket thrst optimization Regard a rocket as a point mass with position x, velocity v and mass m. Mass is changed only by expansion of matter in the jet. Sppose the jet has vector velocity k relative to the rocket and the rocket is sbject to external force f. Then the condition of momentm conservation yields (m δm)(v + δv) + (v k)δm mv = fδt, and this gives the so-called rocket eqation, m v = kṁ + f. Sppose the jet speed k = 1/b is fixed, bt the direction and the rate of explsion of mass can be varied. Then the control is the thrst vector = kṁ, sbject to 1, say. Find the control that maximizes the height that the rocket reaches. Soltion. The plant eqation (in R 3 ) is ẋ = v m v = + f ṁ = b. 55 We take dal variables p, q, r corresponding to x, v, m. Then H = p v + q ( + f) m rb c, (where if the costs are prely terminal c = ), and mst maximize q m br. The optimal is in the direction of q so = q/ q and maximizes ( ) q m br. Ths we have that the optimal thrst shold be maximal intermediate nll as ( ) > q m rb = < The control is bang/bang and p, q, r are determined from the dal eqations. If the rocket is lanched vertically then f = mg and the dal eqations give ṗ =, q = p and ṙ = q/m 2 >. Sppose we want to maximize the height that the rocket attains. Let m be the mass of the rocket strctre, so that the maximm height has been reached if m = m and v. Since K = x at termination, the transversality conditions give p(t) = 1, q(t) =. Ths p(s) = 1, q(s) = s, and mst maximize (s/m br). One can check that (d/ds)(s/m rb) >, and hence we shold se fll thrst from lanch p to some time, and thereafter coast to maximm height on zero thrst. 56.

31 15 Controlled Markov Jmp Processes We conclde with models for controlled optimization problems in a continos-time stochastic setting. This lectre is abot controlled Markov jmp processes, which are relevant when the state space is discrete The dynamic programming eqation The DP eqation in incremental form is F(x, t) = inf{c(x, )δt + EF(x(t + δt), t + δt) x(t) = x,(t) = )}. If appropriate limits exist then this can be written in the limit δt as inf c(x, ) + F t(x, t) + Λ()F(x, t) =. Here Λ() is the operator defined by Eφ(x(t + δt)) x(t) = x, (t) = φ(x) Λ()φ(x) = lim δt δt or φ(x(t + δt)) φ(x) Λ()φ(x) = lim E δt δt x(t) = x, (t) = (15.1) the conditional expectation of the rate of change of φ(x) along the path. The operator Λ converts a scalar fnction of state, φ(x), to another sch fnction, Λφ(x). However, its action depends pon the control, so we write it as Λ(). It is called the infinitesimal generator of the controlled Markov process. Eqation (15.1) is eqivalent to Eφ(x(t + δt) x(t) = x, (t) = = φ(x) + Λ()φ(x)δt + o(δt). This eqation takes radically different forms depending pon whether the state space is discrete or continos. Both are important, and we examine their forms in trn, beginning with a discrete state space The case of a discrete state space Sppose that x can take only vales in a discrete set, labelled by an integer j, say, and that the transition intensity 1 λ jk () = lim P(x(t + δt) = k x(t) = j, (t) = ) δt δt is defined for all and j k. Then E φ(x(t + δt)) x(t) = j, (t) = = λ jk ()φ(k)δt + 1 λ jk ()δt φ(j) + o(δt), k j k j 57 whence it follows that and the DP eqation becomes inf Λ()φ(j) = k c(j, ) + F t (j, t) + k λ jk ()φ(k) φ(j) λ jk ()F(k, t) F(j, t) This is the optimality eqation for a Markov jmp process Uniformization in the infinite horizon case =. (15.2) In this section we explain how (in the infinite horizon case) the continos time DP eqation (15.2) can be rewritten to look like a discrete time DP eqation. Once this is done then all the ideas of Lectres 1 6 can be applied. In the disconted cost case (15.2) ndergoes the sal modification to inf c(j, ) αf(j, t) + F t (j, t) + k λ jk ()F(k, t) F(j, t) =. In the infinite horizon case, everything becomes independent of time and we have inf c(j, ) αf(j) + k λ jk ()F(k) F(j) Sppose we can choose a B large enogh that it is possible to define λ jj () = B k j λ jk (), =. (15.3) for all j and. By adding (B + α)f(j) to both sides of (15.3), the DP eqation can be written (B + α)f(j) = inf c(j, ) + k Finally, dividing by B + α, this can be written as where c(j, ) = F(j) = inf c(j, ) B + α, β = B B + α, c(j, ) + β k λ jk ()F(k) p jk ()F(k) p jk() = λ jk() B 58 and,, (15.4) p jk () = 1. k

32 This makes the dynamic programming eqation look like a case of disconted dynamic programming in discrete time, or of negative programming if α =. All the reslts we have for those cases can now be sed (e.g., vale iteration, OSLA rles, etc.) The trick of sing a large B to make the redction from a continos to a discrete time formlation is called niformization. In the ndisconted case we cold try a soltion to (15.2) of the form F(j, t) = γt + φ(j). Sbstitting this in (15.2), we see that this gives a soltion provided, = inf c(j, ) γ + k λ jk ()φ(k) φ(j) By adding Bφ(j) to both sides of the above, then dividing by B, setting γ = γ/b, and making the other sbstittions above (bt with α = ), this is eqivalent to φ(j) + γ = inf c(j, ) + k p jk ()φ(k)., (15.5) which has the same form as the discrete-time average-cost optimality eqation of Lectre 6. The theorems and techniqes of that lectre can now be applied Example: admission control at a qee Consider a qee of varying size, 1,..., with constant service rate µ and arrival rate, where is controllable between and a maximm vale λ. Let c(x, ) = ax R. This corresponds to paying a cost a per nit time for each cstomer in the qee and receiving a reward R at the point that each new cstomer is admitted (and therefore incrring reward at rate R when the arrival rate is ). Let s take B = λ + µ, and withot loss of generality assme B = 1. The average cost optimality eqation from (15.5) is The soltion to the homogeneos part of this recrsion is of the form f(x) = d 1 1 x + d 2 (µ/λ) x. Assming λ < µ and we desire a soltion for f that does not grow exponentially, we take d 2 = and so the soltion is effectively the soltion to the inhomogeneos part, i.e., ax(x + 1) f(x) = 2(µ λ), γ = aλ µ λ λr, Applying the idea of policy improvement, we conclde that a better policy is to take = (i.e., don t admit a cstomer) if R + f(x + 1) f(x) >, i.e., if (x + 1)a µ λ R >. Frther policy improvement wold probably be needed to reach the optimal policy. However, this policy already exhibits an interesting property: it rejects cstomers for smaller qee length x than does a policy which rejects a cstomer if and only if (x + 1)a µ R >. This second policy is optimal if one is prely concerned with whether or not an individal cstomer that joins when there are x cstomers in front of him will show a profit on the basis of the difference between the reward R and his expected holding cost (x + 1)a/µ. This example exhibits the difference between individal optimality (which is myopic) and social optimality. The socially optimal policy is more relctant to admit cstomers becase it anticipates that more cstomers on the way; ths it feels less badly abot forgoing the profit on a cstomer that presents himself now, recognizing that admitting sch a cstomer can case cstomers who are admitted after him to sffer greater delay. As expected, the policies are nearly the same if the arrival rate λ is small. φ() + γ = inf R + φ(1) + (µ + λ )φ(), = inf{ R + φ(1) φ()} + (µ + λ)φ(), φ(x) + γ = infax R + φ(x + 1) + µφ(x 1) + (λ )φ(x), = infax + { R + φ(x + 1) φ(x)} + µφ(x 1) + λφ(x), x >. Ths shold be chosen to be or 1 as R + φ(x + 1) φ(x) is positive or negative. Let s consider what happens nder the policy that take = λ for all x. The relative costs for this policy, say f, are given by f(x) + γ = ax Rλ + λf(x + 1) + µf(x 1), x >. 59 6

33 16 Controlled Diffsion Processes We give a brief introdction to controlled continos-time stochastic models with a continos state space, i.e., controlled diffsion processes Diffsion processes and controlled diffsion processes The Wiener process {B(t)}, is a scalar process for which B() =, the increments in B over disjoint time intervals are statistically independent and B(t) is normally distribted with zero mean and variance t. ( B stands for Brownian motion.) This specification is internally consistent becase, for example, B(t) = B(t 1 ) + B(t) B(t 1 ) and for t 1 t the two terms on the right-hand side are independent normal variables of zero mean and with variance t 1 and t t 1 respectively. If δb is the increment of B in a time interval of length δt then E(δB) =, E(δB) 2 = δt, E(δB) j = o(δt), for j > 2, where the expectation is one conditional on the past of the process. Note that since E(δB/δt) 2 = O (δt) 1, the formal derivative ǫ = db/dt (continos-time white noise ) does not exist in a mean-sqare sense, bt expectations sch as { } 2 { } 2 E α(t)ǫ(t)dt = E α(t)db(t) = α(t) 2 dt make sense if the integral is convergent. Now consider a stochastic differential eqation which we shall write formally as δx = a(x, )δt + g(x, )δb, ẋ = a(x, ) + g(x, )ǫ. This, as a Markov process, has an infinitesimal generator with action φ(x(t + δt)) φ(x) Λ()φ(x) = lim E δt δt x(t) = x, (t) = = φ x a φ xxg 2 = φ x a Nφ xx, 61 where N(x, ) = g(x, ) 2. So this controlled diffsion process has DP eqation inf c + Ft + F x a NF xx =, (16.1) and in the vector case inf c + Ft + F x a tr(nf xx) = Example: noisy LQ reglation in continos time The dynamic programming eqation is inf x Rx + Q + F t + Fx (Ax + B) tr(nf xx) =. In analogy with the discrete and deterministic continos cases that we have considered previosly, we try a soltion of the form, F(x, t) = x Π(t)x + γ(t). This leads to the same Riccati eqation as in Section 12.2, = x R + ΠA + A Π ΠBQ 1 B Π + dπ x, dt and also, as in Section 7.3, dγ T + tr(nπ(t)) =, giving γ(t) = tr(nπ(τ))dτ. dt t 16.3 Example: a noisy second order system Consider a special case of LQ reglation: where for t T, minimize E ẋ(t) = y(t) and x(t) 2 + T (t) 2 dt ẏ(t) = (t) + ǫ(t), (t) is the control variable, and ǫ(t) is Gassian white noise, Note that if we define z(t) = x(t) + (T t)y(t) then ż = ẋ y + (T t)ẏ = (T t) + (T t)ǫ(t) where z(t) = x(t). Hence the problem can be posed only in terms of scalars and z. 62

34 Recalling what we know abot LQ models, let s conjectre that the optimality eqation is of a form V (z, t) = z 2 P t + γ t. (16.2) We cold se (16.1). Bt let s arge from scratch. For (16.2) to work we will need z 2 { P t + γ t = min 2 δ + E (z + żδ) 2 } P t+δ + γ t+δ { = min 2 δ + z 2 + 2(T t)zδ + (T t) 2 δ } P t+δ + γ t+δ + o(δ) The optimizing is = (T t)p t+δ z. Sbstitting this and letting δ we have Ths and z 2 P t γ t = z 2 (T t) 2 P 2 t + (T t)2 P t. γ t = (T t) 2 P t P t = (T t) 2 P 2 t. Using the bondary condition P T = 1, we find that the soltion to the above differential eqation is P t = ( (T t)3) 1, and the optimal control is (t) = (T t) ( (T t)3) 1 z(t) Example: passage to a stopping set Consider a problem of movement on the nit interval x 1 in continos time, ẋ = +ǫ, where ǫ is white noise of power v. The process terminates at time T when x reaches one end or the other of the the interval. The cost is made p of an integral term 1 T 2 (L + Q2 )dt, penalising both control and time spent, and a terminal cost which takes the vale C or C 1 according as termination takes place at or 1. Show that in the deterministic case v = one shold head straight for one of the termination points at a constant rate and that the vale fnction F(x) has a piecewise linear form, with possibily a discontinity at one of the bondary points if that bondary point is the optimal target from no interior point of the interval. Show, in the stochastic case, that the dynamic programming eqation with the control vale optimized ot can be linearised by a transformation F(x) = α log φ(x) for a sitable constant α, and hence solve the problem. Soltion. In the deterministic case the optimality eqation is L + Q 2 inf + F =, < x < 1, (16.3) 2 x 63 with bondary conditions F() = C, F(1) = C 1. If one goes (from x) for x = at speed w one incrs a cost of C + (x/2w)(l + Qw 2 ) with a minimm over w vale of C + x LQ. Indeed (16.3) is solved by F(x) = min C + x LQ, C 1 + (1 x) LQ. The minimizing option determines the target and the optimal w is L/Q. In the stochastic case L + Q 2 inf + F 2 x + v 2 F 2 x 2 =. So = Q 1 F x and ( ) 2 F L Q 1 + v 2 F x x 2 =. Make the transform F(x) = Qv log φ(x) so φ(x) = e F(x)/Qv. Then Qv 2 2 φ x 2 Lφ =, with soltion ( x ) ( φ(x) = k 1 exp L/Q + k 2 exp x ) L/Q. v v We choose the constants k 1, k 2 to meet the two bondary conditions on F v =.125 v =.25 v =.5 v = 1 v = F(x) against x for the passage to a stopping set The figre shows the soltion for L = 1, Q = 4, C = 1, C 1 = 1.5 and v =.125,.25,.5, 1, 2 and the deterministic soltion. Notice that noise actally redces cost by lessening the time ntil absorption at one or the other of the endpoints. 64 1

35 Index adjoint variable, 49 average-cost, 21 bang-bang control, 6 Bellman eqation, 3 Brownian motion, 61 certainty eqivalence, 43 chattering, 47 closed loop, 3 control variable, 2 controllability, 29 controllable, 29 controlled diffsion process, 61, 62 decomposable cost, 4 diffsion process, 61 disconted programming, 11 disconted-cost criterion, 9 discrete-time, 2 dynamic programming eqation, 3 feedback, 3 finite actions, 15 fixed terminal time, 53 free terminal time, 53 gain matrix, 27 Gassian white noise, 62 Hamilton Jacobi Bellman eqation, 45 Hamiltonian, 49 individal optimality, 6 infinitesimal generator, 57 innovation process, 4 innovations, 42 interchange argment, 1 linear least sqares estimate, 41 LQG model, 25 Markov decision process, 4 Markov dynamics, 4 Markov jmp process, 58 Markov policy, 17 myopic policy, 16 negative programming, 11 observability, 37 observable, 37 one-step look-ahead rle, 18 open loop, 3 optimality eqation, 3 optimization over time, 1 perfect state observation, 4 plant eqation, 3 policy, 3 policy improvement, 23 policy improvement algorithm, 24 Pontryagin s maximm principle, 49 positive programming, 11 power, 63 principle of optimality, 2 r-controllable, 29 r-observable, 37 reglation, 25 Riccati eqation, 27 separation principle, 43 social optimality, 6 stability matrix, 33 stabilizable, 33 state variable, 3 stationary Markov policy, 17 stochastic differential eqation, 61 stopping problem, 18 sccessive approximation, 14 switching locs, 51 time horizon, 2 time to go, 5 time-homogeneos, 5, 1 transition intensity, 57 transversality conditions, 5 niformization, 59 vale iteration, 14 vale iteration algorithm, 23 vale iteration bonds, 23 white noise, 27 Wiener process,