Finitely Additive Dynamic Programming and Stochastic Games Bill Sudderth University of Minnesota 1
Discounted Dynamic Programming Five ingredients: S, A, r, q, β. S - state space A - set of actions q( s, a) - law of motion r(s, a) - daily reward function (bounded, real-valued) β [0, 1) - discount factor 2
Play of the game You begin at some state s 1 S, select an action a 1 A, and receive a reward r(s 1, a 1 ). You then move to a new state s 2 with distribution q( s 1, a 1 ), select a 2 A, and receive β r(s 2, a 2 ). Then you move to s 3 with distribution q( s 2, a 2 ), select a 3 A, receive β 2 r(s 3, a 3 ). And so on. Your total reward is the expected value of n=1 β n 1 r(s n, a n ). 3
Strategies and Rewards A strategy σ selects each action a n, possibly at random, as a function of the history (s 1, a 1,..., a n 1, s n ). The reward from σ at the initial state s 1 = s is V (σ)(s) = E σ,s [ n=1 β n 1 r(s n, a n )]. Given s 1 = s and a 1 = a, the conditional strategy σ[s, a] is just the continuation of σ and V (σ)(s) = [r(s, a) + β V (σ[s, a])(t) q(dt s, a)]σ(s)(da). 4
The Optimal Reward and the Bellman Equation The optimal reward at s is V (s) = sup σ V (σ)(s). The Bellman Equation for V is V (s) = sup[r(s, a) + β a V (t) q(dt s, a)]. I will sketch the proof for S and A countable. 5
Proof of : For every strategy σ and s S, V (σ)(s) = [r(s, a) + β sup[r(s, a ) + β a sup[r(s, a ) + β a V (σ[s, a])(t) q(dt s, a)]σ(s)(da) V (σ[s, a ])(t) q(dt s, a )] V (t) q(dt s, a )]. Now take the sup over σ. 6
Proof of : Fix ɛ > 0. For every state t S, select a strategy σ t such that V (σ t )(t) V (t) ɛ/2. Fix a state s and choose an action a such that r(s, a)+β V (t) q(dt s, a) sup[r(s, a ) + β V (t) q(dt s, a )] ɛ/2. a Define the strategy σ at s 1 = s to have first action a and conditional strategies σ[s, a](t) = σ t. Then V (s) V (σ)(s) =r(s, a) + β sup[r(s, a ) + β a V (σ t )(t) q(dt s, a) V (t) q(dt s, a )] ɛ. 7
Measurable Dynamic Programming The first formulation of dynamic programming in a general measure theoretic setting was given by Blackwell (1965). He assumed: 1. S and A are Borel subsets of a Polish space (say, a Euclidean space). 2. The reward function r(s, a) is Borel measurable. 3. The law of motion q( s, a) is a regular conditional distribution. Strategies are required to select actions in a Borel measurable way. 8
Measurability Problems In his 1965 paper, Blackwell showed by example that for a Borel measurable dynamic programming problem: The optimal reward function V ( ) need not be Borel measurable and good Borel measurable strategies need not exist. This led to nontrivial work by a number of mathematicians including R. Strauch, D. Freedman, M. Orkin, D. Bertsekas, S. Shreve, and Blackwell himself. It follows from their work that for a Borel problem: The optimal reward function V ( ) is universally measurable and that there do exist good universally measurable strategies. 9
The Bellman Equation Again The equation still holds, but a proof requires a lot of measure theory. See, for example, chapter 7 of Bertsekas and Shreve (1978) - about 85 pages. Some additional results are needed to measurably select the σ t in the proof of. See Feinberg (1996). The proof works exactly as given in a finitely additive setting, and it works for general sets S and A. 10
Finitely Additive Probability Let γ be a finitely additive probability defined on a sigma-field of subsets of some set F. The integral φ dγ of a simple function is defined in the usual way. The integral ψ dγ of a bounded, measurable function ψ is defined by squeezing with simple functions. If γ is defined on the sigma-field F of all subsets of F, it is called a gamble and ψ dγ is defined for all bounded, real-valued functions ψ. 11
Finitely Additive Stochastic Processes Let G(F ) be the set of all gambles on F. A Dubins-Savage strategy (DS-strategy for brevity) γ is a sequence γ 1, γ 2,... such that γ 1 G(F ) and for n 2, γ n is a mapping from F n 1 to G(F ). Every DS-strategy γ naturally determines a finitely additive probability P γ on the product sigma-field F N. P γ is regarded as the distribution of a random sequence f 1, f 2,..., f n,.... Here f 1 has distribution γ 1 and, given f 1, f 2,..., f n 1, the conditional distribution of f n is γ n (f 1, f 2,..., f n 1 ). 12
Finitely Additive Dynamic Programming For each (s, a), q( s, a) is a gamble on S. A strategy σ chooses actions using gambles on A. Each σ together with q and an initial state s 1 = s determines a DS-strategy γ = γ(s, σ) on (A S) N. For D A S, γ 1 (D) = and γ n 1 (a 1, s 2,...,a n 1, s n )(D) = Let q(d a s, a) σ 1 (da) q(d a s n, a) σ(s 1, a 1, s 2,..., a n 1, s n )(da). P σ,s = P γ. 13
Rewards and the Bellman Equation For any bounded, real-valued reward function r, the reward for a strategy σ is well-defined by the same formula as before: V (σ)(s) = E σ,s [ n=1 β n 1 r(s n, a n )]. Also as before, the optimal reward function is V (s) = sup σ The Bellman equation V (σ)(s). V (s) = sup[r(s, a) + β V (t) q(dt s, a)]. a can be proved exactly as in the discrete case. 14
Blackwell Operators Let B be the Banach space of bounded functions x : S R equipped with the supremum norm. For each function f : S A, define the operator T f for elements x B by (T f x)(s) = r(s, f(s)) + β Also define the operator T by (T x)(s) = sup[r(s, a) + β a x(s ) q(ds s, f(s)). x(s ) q(ds s, a)]. This definition of T makes sense in the finitely additive case, and in the countably additive case when S is countable. There is trouble in the general measurable case. 15
Fixed Points The operators T f and T are β-contractions. Banach, they have unique fixed points. By a theorem of The fixed point of T is the optimal reward function V. equality The V (s) = (T V )(s) is just the Bellman equation V (s) = sup[r(s, a) + β a V (t) q(dt s, a)]. 16
Stationary Strategies A strategy σ is stationary if there is a function f : S A such that σ(s 1, a 1,..., a n 1, s n ) = f(s n ) for all (s 1, a 1,..., a n 1, s n ). Notation: σ = f. The fixed point of T f is the reward function V (σ)( ) for the stationary strategy σ = f. V (σ)(s) = r(s, f(s)) + β V (σ)(t) q(dt s, f(s)) = (T f V (σ))(s) Fundamental Question: Do optimal or nearly optimal stationary strategies exist? 17
Existence of Good Stationary Strategies Fix ɛ > 0. For each s, choose f(s) such that (T f V )(s) V (s) ɛ(1 β). Let σ = f. An easy induction shows that (Tf n V )(s) V (s) ɛ, for all s and n. But, by Banach s Theorem, (Tf n V )(s) V (σ)(s). So the stationary plan σ is ɛ - optimal. 18
The Measurable Case: Trouble for T T does not preserve Borel measurability. T does not preserve universal measurability. T does preserve upper semianalytic functions, but these do not form a Banach space. Good universally measurable stationary strategies do exist, but the proof is more complicated. 19
Finitely Additive Extensions of Measurable Problems Every probability measure on an algebra of subsets of a set F can be extended to a gamble on F, that is, a finitely additive probability defined on all subsets of F. (The extension is typically not unique.) Thus a measurable, discounted problem S, A, r, q, β can be extended to a finitely additive problem S, A, r, ˆq, β where ˆq( s, a) is a gamble on S that extends q( s, a) for every s, a. Questions: Is the optimal reward the same for both problems? Can a player do better by using non-measurable strategies? 20
Reward Functions for Measurable and for Finitely Additive Plans For a measurable strategy σ, the reward V M (σ)(s) = E σ,s [ n=1 β n 1 r(s n, a n )] is the expectation under the countably additive probability P σ,s. Each measurable σ can be extended to a finitely additive strategy ˆσ with reward V (ˆσ)(s) = Eˆσ,s [ n=1 β n 1 r(s n, a n )] calculated under the finitely additive probability Pˆσ,s. Fact: V M (σ)(s) = V (ˆσ)(s). 21
Optimal Rewards For a measurable problem, let V M (s) = sup V M(σ)(s), where the sup is over all measurable strategies σ, and let V (s) = sup V (σ)(s), where the sup is over all plans σ in some finitely additive extension. 22
Theorem: V M (s) = V (s). Proof: The Bellman equation is known to hold in the measurable theory: V M In other terms (s) = sup a [r(s, a) + β V M (s) = (T V M )(s). But V is the unique fixed point of T. VM (t) q(dt s, a)]. 23
Two-Person Zero-Sum Discounted Stochastic Game (Finitely Additive) Six ingredients: S, A, B, r, q, β. S - state space A, B - action sets for players 1 and 2 q( s, a, b) - law of motion (finitely additive) r(s, a, b) - daily payoff from player 2 to player 1(bounded realvalued) β [0, 1) - discount factor 24
Play of the game Play begins at some state s 1 S, player 1 chooses a 1 A using a gamble on A, player 2 chooses b 1 B using a gamble on B, and player 2 pays player 1 the amount r(s 1, a 1, b 1 ). The next state s 2 has distribution q( s 1, a 1, b 1 ) and play continues. The total payoff from player 2 to player 1 is the expected value of n=1 β n 1 r(s n, a n, b n ). 25
Strategies and Payoffs A strategy σ for player 1 (τ for player 2) selects each action a n (respectively b n ) with a gamble that is a function of (s 1, a 1, b 1,..., a n 1, b n 1, s n ). The pair σ, τ together with an initial state s 1 = s and the law of motion q determine a DS- strategy γ = γ(s, σ, τ). The expected payoff from player 2 to player 1 at the initial state s 1 = s is defined as V (σ, τ)(s) =E γ [ =E σ,τ,s [ n=1 β n 1 r(s n, a n, b n )] n=1 β n 1 r(s n, a n, b n )]. This expectation is well-defined in the Dubins-Savage framework. 26
Theorem: The discounted stochastic game has a value function V ; that is, for all s, V (s) = inf τ sup σ V (σ, τ)(s) = sup σ inf τ V (σ, τ)(s). The value function V satisfies the Shapley equation: V (s) = inf ν sup µ [r(s, a, b) + β V (t) q(dt s, a, b)]µ(da)ν(db). The infimum (respectively supremum) is over gambles on B (respectively A). Also, both players have optimal stationary strategies. The proof is essentially the same as that of Shapley (1953). 27
Finitely Additive One-Shot Games. Let φ : A B R be a bounded function. For gambles µ and ν on A and B respectively, let E µ,ν φ = φ(a, b) µ(da)ν(db). Warning: The order of integration matters. Lemma: The one-shot game with action sets A and B and payoff φ has a value in the sense that inf ν sup µ E µ,νφ = sup inf µ ν E µ,νφ. Also both players have optimal strategies. This follows from Ky Fan (1953). 28
Shapley Operator Recall that B is the Banach space of bounded x : S R. x B, s S, let G(s, x) be the one-shot game with payoff φ s,x (a, b) = r(s, a, b) + β x(t) q(dt s, a, b) For and let T x(s) be the value of G(s, x). So T x(s) = inf ν sup µ [r(s, a, b) + β x(t) q(dt s, a, b)]µ(da)ν(db) The operator T is a β-contraction with a unique fixed point x. For each s, let f(s) and g(s) be gambles on A and B that are optimal for players 1 and 2 in G(s, x ). Define σ and τ to be the stationary strategies f and g. 29
Proof of Theorem. An induction shows that for all s S, strategies τ for player 2, and positive integers n: n E σ,τ,s[ k=1 Let n to get that for all s and τ β k 1 r(s k, a k, b k ) + β n x (s n )] x (s). V (σ, τ)(s) x (s). Similarly, for all s and all strategies σ for player 1 V (σ, τ )(s) x (s). Therefore x is the value function for the stochastic game; σ and τ are optimal stationary strategies. 30
Measurable Discounted Stochastic Game S - a Borel subset of a Polish space A, B - compact metric spaces r(s, a, b) - Borel measurable and separately continuous in a and b for every fixed s q - a regular conditional distribution with q(d s, a, b) separately continuous in a and b for every fixed s and every Borel subset D of S β [0, 1) - discount factor Players are required to use measurable strategies. 31
Theorem: A measurable discounted stochastic game has a value function VM that is Borel measurable. The players have optimal, Borel measurable stationary strategies. See Mertens, Sorin, and Zamir (2014). Assume that q(d s, a, ), a A is an equicontinuous family for each fixed s, a and Borel subset D of B. For each triple s, a, b let q( s, a, b) be a finitely additive extension of q( s, a, b) to all subsets of S. Let V be the value function for the corresponding finitely additive game. Theorem: V M = V. This follows from Maitra and S (1993). 32
References D. Blackwell (1965).Discounted dynamic programming. Ann. Math. Statist. 36 226-235. D. Bertsekas and S. Shreve (1978). Stochastic Optimal Control: The Discrete Time Case. Academic Press. L. Dubins (1974). On Lebesgue-like extensions of finitely additive measures. Ann. Prob. 2 226-241. L. E. Dubins and L. J. Savage (1965). How to Gamble If You Must: Inequalities for Stochastic Processes. McGraw-Hill. E. Feinberg (1996). On measurability and representation of strategic measures in Markov decision theory. Statistics, Probability, and Game Theory: Papers in Honor of David Blackwell, editors T. S. Ferguson, L.S. Shapley, J. B. MacQueen. IMS Lecture Notes-Monograph Series 30 29-44. 33
Ky Fan (1953). Minimax theorems. Proc. Natl. Acad. Sci. 39 42-47. A. Maitra and W. Sudderth (1993). Finitely additive and measurable stochastic games. Int. J. Game Theory 22 201-223. A. Maitra and W. Sudderth (1998). Finitely additive stochastic games with Borel measurable payoffs. Int. J. Game Theory 27 257-267. J-F Mertens, S. Sorin, and S. Zamir (2014). Repeated Games. Cambridge Press. R. Purves and W. Sudderth (1976). Some finitely additive probability theory. Ann. Prob. 4 259-276. L. S. Shapley (1953). Stochastic games. Proc. Natl. Acad. Sci. 39 1095-1100. 34