Min-Max Approximate Dynamic Programming

20 IEEE International Symposium on Computer-Aided Control System Design CACSD Part of 20 IEEE Multi-Conference on Systems and Control Dener, CO, USA. September 28-30, 20 Min-Max Approximate Dynamic Programming Brendan O Donoghue Yang Wang Stephen Boyd Abstract In this paper e describe an approximate dynamic programming policy for a discrete-time dynamical system perturbed by noise. The approximate alue function is the pointise supremum of a family of loer bounds on the alue function of the stochastic control problem; ealuating the control policy inoles the solution of a min-max or saddle-point problem. For a quadratically constrained linear quadratic control problem, ealuating the policy amounts to soling a semidefinite program at each time step. By ealuating the policy, e obtain a loer bound on the alue function, hich can be used to ealuate performance: When the loer bound and the achieed performance of the policy are close, e can conclude that the policy is nearly optimal. We describe seeral numerical examples here this is indeed the case. I. INTRODUCTION We consider an infinite horion stochastic control problem ith discounted objectie and full state information. In the general case this problem is difficult to sole, but exact solutions can be found for certain special cases. When the state and action spaces are finite, for example, the problem is readily soled. Another case for hich the problem can be soled exactly is hen the state and action spaces are finite dimensional real ector spaces, the system dynamics are linear, the cost function is conex quadratic, and there are no constraints on the action or the state. In this case optimal control policy is affine in the state ariable, ith coefficients that are readily computable [], [2], [3], [4]. One general method for finding the optimal policy is to use dynamic programming DP. DP represents the optimal policy in terms of an optimiation problem inoling the alue function of the stochastic control problem [3], [4], [5]. Hoeer, due to the curse of dimensionality, een representing the alue function can be intractable hen the state or action spaces are infinite, or as a practical matter, hen the number of states or actions is ery large. Een hen the alue function can be represented, ealuating the optimal policy can still be intractable. As a result approximate dynamic programming ADP has been deeloped as a general method for finding suboptimal control policies [6], [7], [8]. In ADP e substitute an approximate alue function for the alue function in the expression for the optimal policy. The goal is to choose the approximate alue function also knon as a control-lyapuno function so that the performance of the resulting policy is close to optimal, or at least, good. In this paper e deelop a control policy hich e call the min-max approximate dynamic programming policy. We first parameterie a family of loer bounds on the true alue function; then e perform control, taking the pointise supremum oer this family as our approximate alue function. The condition e use to parameterie our family of bounds is related to the linear programming approach to ADP, hich as first introduced in [9], and extended to approximate dynamic programming in [0], []. The basic idea is that any function hich satisfies the Bellman inequality is a loer bound on the true alue function [3], [4]. It as shon in [2] that a better loer bound can be attained ia an iterated chain of Bellman inequalities, hich e use here. We relate this chain of inequalities to a forard look-ahead in time, in a similar sense to that of model predictie control MPC [3], [4]. Indeed many types of MPC can be thought of as performing min-max ADP ith a particular generally affine family of underestimator functions. In cases here e hae a finite number of states and inputs, ealuating our policy requires soling a linear program at eery time step. For problems ith an infinite number of states and inputs, the method requires the solution of a semi-infinite linear program, ith a finite number of ariables, but an infinite number of constraints one for eery state-control pair. For these problems e can obtain a tractable semidefinite program SDP approximation using methods such as the S- procedure [8], [2]. Ealuating our policy then requires soling an SDP at each time step [5], [6]. Much progress has been made in soling structured conex programs efficiently see, e.g., [7], [8], [9], [20]. These fast optimiation methods make our policies practical, een for large problems, or those requiring fast sampling rates. II. STOCHASTIC CONTROL Consider a discrete time-inariant dynamical system, ith dynamics described by x t+ = fx t, u t, t, t = 0,,..., 978--4577-06-2//$26.00 20 IEEE 424

here x t X is the system state, u t U is the control input or action, t W is an exogenous noise or disturbance, at time t, and f : X U W X is the state transition function. The noise terms t are independent identically distributed IID, ith knon distribution. The initial state x 0 is also random ith knon distribution, and is independent of t. We consider causal, time-inariant state feedback control policies u t = φx t, t = 0,,..., here φ : X U is the control policy or state feedback function. The stage cost is gien by l : X U R {+ }, here the infinite alues of l encode constraints on the states and inputs: The state-action constraint set C X U is C = {, l, < }. The problem is unconstrained if C = X U. The stochastic control problem is to choose φ in order to minimie the infinite horion discounted cost J φ = E γ t lx t, φx t, 2 t=0 here γ 0, is a discount factor. The expectations are oer the noise terms t, t = 0,,..., and the initial state x 0. We assume here that the expectation and limits exist, hich is the case under arious technical assumptions [3], [4]. We denote by J the optimal alue of the stochastic control problem, i.e., the infimum of J φ oer all policies φ : X U. A. Dynamic programming In this section e briefly reie the dynamic programming characteriation of the solution to the stochastic control problem. For more details, see [3], [4]. The alue function of the stochastic control problem, V : X R { }, is gien by V = inf E γ t lx t, φx t, φ t=0 subject to the dynamics and x 0 = ; the infimum is oer all policies φ : X U, and the expectation is oer t for t = 0,,.... The quantity V is the cost incurred by an optimal policy, hen the system is started from state. The optimal total discounted cost is gien by J = E x0 V x 0. It can be shon that the alue function is the unique fixed point of the Bellman equation V = inf l, + γ E V f,, for all X. We can rite the Bellman equation in the form V = T V, 3 here e define the Bellman operator T as T g = inf l, + γ Egf,, for any g : X R {+ }. An optimal policy for the stochastic control problem is gien by φ = argmin for all X. l, + γ E V f,,, 4 B. Approximate dynamic programming In many cases of interest, it is intractable to compute or een represent the alue function V, let alone carry out the minimiation required ealuate the optimal policy 4. In such cases, a common alternatie is to replace the alue function ith an approximate alue function ˆV [6], [7], [8]. The resulting policy, gien by ˆφ = argmin l, + γ E ˆV f,,, for all X, is called an approximate dynamic programming ADP policy. Clearly, hen ˆV = V, the ADP policy is optimal. The goal of approximate dynamic programming is to find a ˆV for hich the ADP policy can be easily ealuated for instance, by soling a conex optimiation problem, and also attains nearoptimal performance. III. MIN-MAX APPROXIMATE DYNAMIC PROGRAMMING We consider a family of linearly parameteried candidate alue functions V α : X R, V α = K α i V i, i= here α R K is a ector of coefficients and V i : X R are fixed basis functions. No suppose e hae a set A R K for hich V α V, X, α A. Thus {V α α A} is a parameteried family of underestimators of the alue function. We ill discuss ho to obtain such a family later. For any α A e hae V α sup V α V, X, α A 2 425

i.e., the pointise supremum oer the family of underestimators must be at least as good an approximation of V as any single function from the family. This suggests the ADP control policy φ = argmin l, + γ E sup V α f,, α A, here e use sup α A V α as an approximate alue function. Unfortunately, this policy may be difficult to ealuate, since ealuating the expectation of the supremum can be hard, een hen ealuating E V α f,, for a particular α can be done. Our last step is to exchange expectation and supremum to obtain the min-max control policy φ mm = argmin sup α A l, +γ E V α f,, 5 for all X. Computing this policy inoles the solution of a min-max or saddle-point problem, hich e ill see is tractable in certain cases. One such case is here the function l, +E V α f,, is conex in for each and α and the set A is conex. A. Bounds The optimal alue of the optimiation problem in the min-max policy 5 is a loer on the alue function at eery state. To see this e note that inf sup α A l, + γ EV α f, u, inf l, + γ E sup α A V α f, u, inf l, + γ E V f, u, = T V = V, here the first inequality is due to Fatou s lemma [2], the second inequality follos from the monotonicity of expectation, and the equality comes from the fact that V is the unique fixed point of the Bellman operator. Using the pointise bounds, e can ealuate a loer bound on the optimal cost J ia Monte Carlo simulation: N J lb = /N V lb j j= here,..., N are dran from the same distribution as x 0 and V lb j is the loer bound e get from ealuating the min-max policy at j. The performance of the min-max policy can also be ealuated using Monte Carlo simulation, and proides an upper bound J ub on the optimal cost. Ignoring Monte Carlo error e hae J lb J J ub. Theses upper and loer bounds on the optimal alue of the stochastic control problem are readily ealuated numerically, through simulation of the min-max control policy. When J lb and J ub are close, e can conclude that the min-max policy is almost optimal. We ill use this technique to ealuate the performance of the minmax policy for our numerical examples. B. Ealuating the min-max control policy Ealuating the min-max control policy often requires exchanging the order of minimiation and maximiation. For any function f : R p R q R and sets W R p, Z R q, the max-min inequality states that sup inf Z W f, inf sup W Z f,. 6 In the context of the min-max control policy, this means e can sap the order of minimiation and maximiation in 5 and maintain the loer bound property. To ealuate the policy, e sole the optimiation problem maximie inf l, + γ E V α f,, subject to α A 7 ith ariable α. If A is a conex set, 7 is a conex optimiation problem, since the objectie is the infimum oer a family of affine functions in α, and is therefore concae. In practice, soling 7 is often much easier than ealuating the min-max control policy directly. In addition, if there exist W and Z such that f, f, f,, for all W and Z, then e hae the strong maxmin property or saddle-point property and 6 holds ith equality. In such cases the problems 5 and 7 are equialent, and e can use Neton s method or duality considerations to sole 5 or 7 [6], [22]. IV. ITERATED BELLMAN INEQUALITIES In this section e describe ho to parameterie a family of underestimators of the true alue function. The idea is based on the Bellman inequality, [0], [8], [2], and results in a conex condition on the coefficients α that guarantees V α V. 3 426

A. Basic Bellman inequality The basic condition orks as follos. Suppose e hae a function V : X R, hich satisfies the Bellman inequality V T V. 8 Then by the monotonicity of the Bellman operator, e hae V lim k T k V = V, so any function that satisfies the Bellman inequality must be a alue function underestimator. Applying this condition to V α and expanding 8 e get V α inf l, + γ E V α f,,, for all X. For each, the left hand side is linear in α, and the right hand side is a concae function of α, since it is the infimum oer a family of affine functions. Hence, the Bellman inequality leads to a conex constraint on α. B. Iterated Bellman inequalities We can obtain better i.e., larger loer bounds on the alue function by considering an iterated form of the Bellman inequality [2]. Suppose e hae a sequence of functions V i : X R, i = 0,...,M, that satisfy a chain of Bellman inequalities V 0 T V, V T V 2,... V M T V M, 9 ith V M = V M. Then, using similar arguments as before e can sho V 0 V. Restricting each function to lie in the same subspace K V i = α ij V j, j= e see that the iterated chain of Bellman inequalities also results in a conex constraint on the coefficients α ij. Hence the condition on α 0j, j =,...,K, hich parameteries our underestimator V 0, is conex. It is easy to see that the iterated Bellman condition must gie bounds that are at least as good as the basic Bellman inequality, since any function that satisfies 8 must be feasible for 9 [2]. V. BOX CONSTRAINED LINEAR QUADRATIC CONTROL This section follos a similar example presented in [2]. We hae X = R n, U = R m, ith linear dynamics x t+ = Ax t + Bu t + t, here A R n n and B R n m. The noise has ero mean, E t = 0, and coariance E t t T = W. Our bounds and policy ill only depend on the first and second moments of t. The stage cost is gien by { l, = T R + T Q, +, >, here R = R T 0, Q = Q T 0. A. Iterated Bellman inequalities We look for conex quadratic approximate alue functions V i = T P i + 2p T i + s i, i = 0,...,M, here P i = Pi T 0, p i R n and r i R, are the coefficients of our linear parameteriation. The iterated Bellman inequalities are V i l, + γ EV i A + B +, for all, R n, i =,...,M. Defining S i = 0 0 0 0 P i p i, L = R 0 0 0 Q 0, 0 p T i s i 0 0 0 G i = BT P i B B T P i A B T p i A T P i B A T P i A A T p i, p T i B pt i A TrP iw + s i for i = 0,...,M, e can rite the Bellman inequalities as a quadratic form in,, T L + γg i S i 0, 0 for all, R n, i =,...,M. We ill obtain a tractable sufficient condition for this using the S-procedure [5], [2]. The constraint can be ritten in terms of quadratic inequalities T e i e T i 0, i =,...,m, here e i denotes the ith unit ector. Using the S- procedure, a sufficient condition for 0 is the existence of diagonal matrices D i 0, i =,...,M for hich here L + γg i S i + Λ i 0, i =,...M, Λ i = D i 0 0 0 0 0 0 0 TrD i. 2 Finally e hae the terminal constraint, S M = S M. 4 427

B. Min-max control policy For this problem, it is easy to sho that the strong max-min property holds, and therefore problems 5 and 7 are equialent. To ealuate the min-max control policy e sole problem 7, hich e can rite as maximie inf l, + γ E V 0 A + B + subject to, S M = S M, P 0 0 P i 0, D i 0, i =,...,M, ith ariables P i, p i, s i, i = 0,...,M, and diagonal D i, i =,...,M. We ill conert this max-min problem to a max-max problem by forming the dual of the minimiation part. Introducing a diagonal matrix D 0 0 as the dual ariable for the box constraints, e obtain the dual function inf T L + γg 0 Λ 0 here Λ 0 has the form gien in 2. We can minimie oer analytically. If e block out the matrix L+γG 0 Λ 0 as M M L + γg 0 Λ 0 = 2 M2 T 3 M 22 here M R m m, then = M M 2 [ ]., Thus our problem becomes ] maximie M 22 M2 T M 2[ M subject to, S M = S M P i 0, D i 0, i = 0,...,M, hich is a conex optimiation problem in the ariables P i, p i, r i, D i, i = 0,...,M, and can be soled as an SDP. To implement the min-max control policy, at each time t, e sole the aboe problem ith = x t, and let u t = M M 2 [ xt ], here M and M 2 denote the matrices M and M 2, computed from P 0, p 0, s 0, D 0. C. Interpretations We can easily erify that the dual of the aboe optimiation problem is a ariant of model predictie control, that uses both the first and second moments of the state. In this context, the number of iterations, M, is the length of the prediction horion, and e can interpret Policy / Bound Value MPC policy.347 Min-max policy.345 Loer bound.307 TABLE I: Performance comparison, box constrained example. our loer bound as a finite horion approximation to an infinite horion problem, hich underestimates the optimal infinite horion cost. The S-procedure relaxation also has a natural interpretation: in [23], the author obtains similar LMIs by relaxing almost sure constraints into constraints that are only required to hold in expectation. D. Numerical instance We consider a numerical example ith n = 8, m = 3, and γ = 0.9. The parameters Q, R, A and B are randomly generated; e set B = and scale A so that max λ i A = i.e., so that the system is marginally stable. The initial state x 0 is Gaussian, ith ero mean. Table I shos the performance of the min-max policy and certainty equialent MPC, both ith horions of M = 5 steps, as ell as the loer bound on the optimal cost. In this case, both the min-max policy and MPC are no more than around % suboptimal, modulo Monte Carlo error. VI. DYNAMIC PORTFOLIO OPTIMIZATION In this example, e manage a portfolio of n assets oer time. Our state x t R n is the ector of dollar alues of the assets, at the beginning of inestment period t. Our action u t R n represents buying or selling assets at the beginning of each inestment period: u t i > 0 means e are buying asset i, for dollar alue u t i, u t i < 0 means e sell asset i. The post-trade portfolio is then gien by x t +u t, and the total gross cash put in is T u t, here is the ector ith all components one. The portfolio propagates oer time i.e., oer the inestment period according to x t+ = A t x t + u t here A t = diagρ t, and ρ t i is the total return of asset i in inestment period t. The return ectors ρ t are IID, ith first and second moments Eρ t = µ, Eρ t ρ T t = Σ. 5 428

Here too, our bounds and policy ill only depend on the first and second moments of ρ t. We let ˆΣ = Σ µµ T denote the return coariance. We no describe the constraints and objectie. We constrain the risk of our post-trade portfolio, hich e quantify as the portfolio return ariance oer the period: x t + u t T ˆΣxt + u t l, here l 0 is the maximum ariance risk alloed. Our action buying and selling u t incurs a transaction cost ith an absolute alue and a quadratic component, gien by κ T u + u T Ru, here κ R n + is the ector of linear transaction cost rates and u means elementise, and R R n n, hich is diagonal ith positie entries, represents a quadratic transaction cost coefficients. Linear transactions cost model effects such as crossing the bid-ask spread, hile quadratic transaction costs model effects such as price-impact. Thus at time t, e put into our portfolio the net cash amount gu t = T u t + κ T u t + u T t Ru t. When this is negatie, it represents reenue. The first term is the gross cash in from purchases and sales; the second and third terms are the transaction fees. The stage cost, including the risk limit, is { g, + T ˆΣ + l l, = +, otherise. Our goal is to minimie the discounted cost or equialently, to maximie the discounted reenue. In this example, the discount factor has a natural interpretation as reflecting the time alue of money. A. Iterated Bellman Inequalities We incorporate another ariable, y R n, to remoe the absolute alue term from the stage cost function, and add the constraints y t i t i y t i, i =,...,n. 4 We define the stage cost ith these ne ariables to be l,, y = T + κ T y + /2 T R + /2y T Ry for,, y that satisfy y y, + ˆΣ + l, 5 and + otherise. Here, the first set of inequalities is interpreted elementise. We look for conex quadratic candidate alue functions, i.e. V i = T P i + 2p T i + s i, i = 0,...,M, here P i 0, p i R n, s i R are the coefficients of our linear parameteriation. Defining R/2 0 0 /2 0 R/2 0 κ/2 L = 0 0 0 0, T /2 κ T /2 0 0 P i Σ 0 P i Σ p i µ 0 0 0 0 G i = P i Σ 0 P i Σ p i µ, p i µ T 0 p i µ T 0 0 0 0 0 0 0 0 0 S i = 0 0 P i p i, i = 0,...,M, 0 0 p T i s i here denotes the Hadamard product, e can rite the iterated Bellman inequalities as T y y L + γg i S i 0, for all, y, that satisfy 5, for i =,...,M. A tractable sufficient condition for the Bellman inequalities is by the S-procedure the existence of λ i 0, ν i R n +, τ i R n +, i =,...,M such that here L + γg i S i + Λ i 0, i =,...,M, 6 Λ i = λ iˆσ 0 λiˆσ νi τ i 0 0 0 ν i + τ i λ iˆσ 0 λiˆσ 0 νi T τi T νi T + τi T 0 λ i l. 7 Lastly e hae the terminal constraint, S M = S M. B. Min-max control policy The discussion here follos almost exactly the one presented for the preious example. It is easy to sho that in this case e hae the strong max-min property. At each step, e sole problem 7 by conerting the maxmin problem into a max-max problem, using Lagrangian duality. We can rite problem 7 as maximie inf,y l,, y + γ E ρ V 0 A + subject to 6, S M = S M P i 0, i = 0,...,M, 6 429

ith ariables P i, p i, s i, i = 0,...,M, and λ i R +, ν i R n +, τ i R n +, i =,...,M. Next, e derie the dual function of the minimiation part. We introduce ariables λ 0 R +, ν 0 R n +, τ 0 R n +, hich are dual ariables corresponding to the constraints 5 and 4. The dual function is gien by T y y inf L + γg 0 + Λ 0,y, here Λ 0 has the form gien in 7. If e define M M 2 L + γg 0 + Λ 0 = M2 T, 8 M 22 here M R 2n 2n, then the minimier of the Lagrangian is, y = M 2. Thus our problem becomes [ maximie M 22 M2 T M M 2 subject to 6, S M = S M P i 0, i = 0,...,M, hich is conex in the ariables P i, p i, s i, i = 0,...,M, and λ i R +, ν i R n +, τ i R n +, i = 0,...,M. To implement the policy, at each time t e sole the aboe optimiation problem as an SDP ith = x t, and let u t = I m 0 M x t M 2, here M and M 2 denote the matrices M and M 2, computed from P 0, p 0, s 0, λ 0, ν 0, τ 0. C. Numerical instance We consider a numerical example ith n = 8 assets and γ = 0.96. The initial portfolio x 0 is Gaussian, ith ero mean. The returns follo a log-normal distribution, i.e., logρ t N µ, Σ. The parameters µ and Σ are gien by µ i = exp µ i + Σ ii /2, Σ ij = µ i µ j exp Σ ij. Table II compares the performance of the min-max policy for M = 40 and M = 5, and certainty equialent model predictie control ith horion T = 40, oer 50 simulations each consisting of 50 time steps. We ] can see that the min-max policy significantly outperforms MPC, hich actually makes a loss on aerage since the aerage cost is positie. The cost achieed by the min-max policy is close to the loer bound, hich shos that both the policy and the bound are nearly optimal. In fact, the gap is small een for M = 5, hich corresponds to a relatiely myopic policy. Bound / Policy Value MPC policy, T = 40 25.4 Min-max policy, M = 5-224. Min-max policy, M = 40-225. Loer bound, M = 40-239.9 Loer bound, M = 5-242.0 TABLE II: Performance comparison, portfolio example. VII. CONCLUSIONS In this paper e introduce a control policy hich e refer to as min-max approximate dynamic programming. Ealuating this policy at each time step requires the solution of a min-max or saddle point problem; in addition, e obtain a loer bound on the alue function, hich can used to estimate ia Monte Carlo simulation the optimal alue of the stochastic control problem. We demonstrate the method ith to examples, here the policy can be ealuated by soling a conex optimiation problem at each time-step. In both examples the loer bound and the achieed performance are ery close, certifying that the min-max policy is ery close to optimal. REFERENCES [] R. Kalman, When is a linear control system optimal? Journal of Basic Engineering, ol. 86, no., pp. 0, 964. [2] S. Boyd and C. Barratt, Linear controller design: Limits of performance. Prentice-Hall, 99. [3] D. Bertsekas, Dynamic Programming and Optimal Control: Volume. Athena Scientific, 2005. [4], Dynamic Programming and Optimal Control: Volume 2. Athena Scientific, 2007. [5] D. Bertsekas and S. Shree, Stochastic optimal control: The discrete-time case. Athena Scientific, 996. [6] D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming, st ed. Athena Scientific, 996. [7] W. Poell, Approximate dynamic programming: soling the curses of dimensionality. John Wiley & Sons, Inc., 2007. [8] Y. Wang and S. Boyd, Performance bounds for linear stochastic control, Systems & Control Letters, ol. 58, no. 3, pp. 78 82, Mar. 2009. [9] A. Manne, Linear programming and sequential decisions, Management Science, ol. 6, no. 3, pp. 259 267, 960. [0] D. De Farias and B. Van Roy, The linear programming approach to approximate dynamic programming, Operations Research, ol. 5, no. 6, pp. 850 865, 2003. 7 430

[] P. Scheiter and A. Seidmann, Generalied polynomial approximations in markoian decision process, Journal of mathematical analysis and applications, ol. 0, no. 2, pp. 568 582, 985. [2] Y. Wang and S. Boyd, Approximate dynamic programming ia iterated bellman inequalities, 200, manuscript. [3] C. Garcia, D. Prett, and M. Morari, Model predictie control: theory and practice, Automatica, ol. 25, no. 3, pp. 335 348, 989. [4] J. Maciejoski, Predictie Control ith Constraints. Prentice- Hall, 2002. [5] S. Boyd, L. E. Ghaoui, E. Feron, and V. Balakrishnan, Linear Matrix Inequalities in System and Control Theory. Society for Industrial Mathematics, 994. [6] S. Boyd and L. Vandenberghe, Conex optimiation. Cambridge Uniersity Press, Sept. 2004. [7] Y. Wang and S. Boyd, Fast model predictie control using online optimiation, IEEE Transactions on Control Systems Technology, ol. 8, pp. 267 278, 200. [8], Fast ealuation of quadratic control-lyapuno policy, IEEE Transactions on Control Systems Technology, pp. 8, 200. [9] J. Mattingley, Y. Wang, and S. Boyd, Code generation for receding horion control, in IEEE Multi-Conference on Systems and Control, 200, pp. 985 992. [20] J. Mattingley and S. Boyd, CVXGEN: A code generator for embedded conex optimiation, 200, manuscript. [2] D. Cohn, Measure Theory. Birkhäuser, 997. [22] A. Ben-Tal, L. E. Ghaoui, and A. Nemiroski, Robust optimiation. Princeton Uniersity Press, 2009. [23] A. Gattami, Generalied linear quadratic control theory, Proceedings of the 45th IEEE Conference on Decision and Control, pp. 50 54, 2006. 8 43