Finitely Additive Dynamic Programming and Stochastic Games. Bill Sudderth University of Minnesota

Similar documents

Goal Problems in Gambling and Game Theory. Bill Sudderth. School of Statistics University of Minnesota

arxiv:math/ v1 [math.pr] 18 Dec 2004

1 Norms and Vector Spaces

No: Bilkent University. Monotonic Extension. Farhad Husseinov. Discussion Papers. Department of Economics

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

Gambling Systems and Multiplication-Invariant Measures

6.2 Permutations continued

ALMOST COMMON PRIORS 1. INTRODUCTION

LECTURE 15: AMERICAN OPTIONS

MEASURE AND INTEGRATION. Dietmar A. Salamon ETH Zürich

Markov random fields and Gibbs measures

CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e.

CHAPTER IV - BROWNIAN MOTION

Lecture 6 Black-Scholes PDE

F. ABTAHI and M. ZARRIN. (Communicated by J. Goldstein)

BANACH AND HILBERT SPACE REVIEW

Metric Spaces. Chapter Metrics

Mathematical Methods of Engineering Analysis

CONTINUOUS REINHARDT DOMAINS PROBLEMS ON PARTIAL JB*-TRIPLES

A Uniform Asymptotic Estimate for Discounted Aggregate Claims with Subexponential Tails

A control Lyapunov function approach for the computation of the infinite-horizon stochastic reach-avoid problem

Some stability results of parameter identification in a jump diffusion model

TD(0) Leads to Better Policies than Approximate Value Iteration

Bisimulation and Logical Preservation for Continuous-Time Markov Decision Processes

THOMAS S. FERGUSON, University of California, Los Angeles C. MELOLIDAKIS, Technical University of Crete

and s n (x) f(x) for all x and s.t. s n is measurable if f is. REAL ANALYSIS Measures. A (positive) measure on a measurable space

Trimming a Tree and the Two-Sided Skorohod Reflection

9 More on differentiation

About the inverse football pool problem for 9 games 1

Reinforcement Learning

Metric Spaces. Lecture Notes and Exercises, Fall M.van den Berg

Universal Algorithm for Trading in Stock Market Based on the Method of Calibration

Numerical methods for American options

GAME THEORY. Thomas S. Ferguson University of California at Los Angeles INTRODUCTION.

Let H and J be as in the above lemma. The result of the lemma shows that the integral

Undergraduate Notes in Mathematics. Arkansas Tech University Department of Mathematics

The Ergodic Theorem and randomness

i/io as compared to 4/10 for players 2 and 3. A "correct" way to play Certainly many interesting games are not solvable by this definition.

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

Non-Arbitrage and the Fundamental Theorem of Asset Pricing: Summary of Main Results

Notes from Week 1: Algorithms for sequential prediction

Reinforcement Learning

Introduction to Topology

ON GENERALIZED RELATIVE COMMUTATIVITY DEGREE OF A FINITE GROUP. A. K. Das and R. K. Nath

Separation Properties for Locally Convex Cones

arxiv: v1 [math.pr] 5 Dec 2011

ON COMPLETELY CONTINUOUS INTEGRATION OPERATORS OF A VECTOR MEASURE. 1. Introduction

FUNCTIONAL ANALYSIS LECTURE NOTES: QUOTIENT SPACES

1. Prove that the empty set is a subset of every set.

Lecture Notes on Measure Theory and Functional Analysis

Uniform value in Dynamic Programming

Notes on metric spaces

6.207/14.15: Networks Lecture 15: Repeated Games and Cooperation

Fuzzy Differential Systems and the New Concept of Stability

Mathematical Finance

Stationary random graphs on Z with prescribed iid degrees and finite mean connections

Chapter 4 Lecture Notes

SYMMETRIC FORM OF THE VON NEUMANN POKER MODEL. Guido David 1, Pearl Anne Po 2

Power-law Price-impact Models and Stock Pinning near Option Expiration Dates. Marco Avellaneda Gennady Kasyan Michael D. Lipkin

n k=1 k=0 1/k! = e. Example 6.4. The series 1/k 2 converges in R. Indeed, if s n = n then k=1 1/k, then s 2n s n = 1 n

Columbia University in the City of New York New York, N.Y

INTEGRAL OPERATORS ON THE PRODUCT OF C(K) SPACES

Binomial lattice model for stock prices

Metric Spaces Joseph Muscat 2003 (Last revised May 2009)

The Henstock-Kurzweil-Stieltjes type integral for real functions on a fractal subset of the real line

THE BANACH CONTRACTION PRINCIPLE. Contents

Math 115 Spring 2011 Written Homework 5 Solutions

4: SINGLE-PERIOD MARKET MODELS

How to Gamble If You Must

Asymptotics for a discrete-time risk model with Gamma-like insurance risks. Pokfulam Road, Hong Kong

A domain of spacetime intervals in general relativity

Adaptive Online Gradient Descent

Adverse Selection and the Market for Health Insurance in the U.S. James Marton

ALGORITHMIC TRADING WITH MARKOV CHAINS

Lecture 7: Finding Lyapunov Functions 1

Basic Concepts of Point Set Topology Notes for OU course Math 4853 Spring 2011

The sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1].

Two-Sided Matching Theory

Probability Theory. Florian Herzog. A random variable is neither random nor variable. Gian-Carlo Rota, M.I.T..

s-convexity, model sets and their relation

Lecture notes on Ordinary Differential Equations. Plan of lectures

Pricing of an Exotic Forward Contract

SEQUENCES OF MAXIMAL DEGREE VERTICES IN GRAPHS. Nickolay Khadzhiivanov, Nedyalko Nenov

Option Pricing. 1 Introduction. Mrinal K. Ghosh

Augusto Teixeira (IMPA)

Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization

Metric Spaces. Chapter 1

SOLUTIONS TO EXERCISES FOR. MATHEMATICS 205A Part 3. Spaces with special properties

VALUATION OF STOCK LOANS WITH REGIME SWITCHING

(Basic definitions and properties; Separation theorems; Characterizations) 1.1 Definition, examples, inner description, algebraic properties

Non-Archimedean Probability and Conditional Probability; ManyVal2013 Prague F.Montagna, University of Siena

1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let

Fuzzy Probability Distributions in Bayesian Analysis

Invariant Option Pricing & Minimax Duality of American and Bermudan Options

ON SEQUENTIAL CONTINUITY OF COMPOSITION MAPPING. 0. Introduction

Sums of Independent Random Variables

Sensitivity of life insurance reserves via Markov semigroups

On Integer Additive Set-Indexers of Graphs

arxiv: v1 [q-fin.pr] 9 Jun 2007

Transcription:

Finitely Additive Dynamic Programming and Stochastic Games Bill Sudderth University of Minnesota 1

Discounted Dynamic Programming Five ingredients: S, A, r, q, β. S - state space A - set of actions q( s, a) - law of motion r(s, a) - daily reward function (bounded, real-valued) β [0, 1) - discount factor 2

Play of the game You begin at some state s 1 S, select an action a 1 A, and receive a reward r(s 1, a 1 ). You then move to a new state s 2 with distribution q( s 1, a 1 ), select a 2 A, and receive β r(s 2, a 2 ). Then you move to s 3 with distribution q( s 2, a 2 ), select a 3 A, receive β 2 r(s 3, a 3 ). And so on. Your total reward is the expected value of n=1 β n 1 r(s n, a n ). 3

Strategies and Rewards A strategy σ selects each action a n, possibly at random, as a function of the history (s 1, a 1,..., a n 1, s n ). The reward from σ at the initial state s 1 = s is V (σ)(s) = E σ,s [ n=1 β n 1 r(s n, a n )]. Given s 1 = s and a 1 = a, the conditional strategy σ[s, a] is just the continuation of σ and V (σ)(s) = [r(s, a) + β V (σ[s, a])(t) q(dt s, a)]σ(s)(da). 4

The Optimal Reward and the Bellman Equation The optimal reward at s is V (s) = sup σ V (σ)(s). The Bellman Equation for V is V (s) = sup[r(s, a) + β a V (t) q(dt s, a)]. I will sketch the proof for S and A countable. 5

Proof of : For every strategy σ and s S, V (σ)(s) = [r(s, a) + β sup[r(s, a ) + β a sup[r(s, a ) + β a V (σ[s, a])(t) q(dt s, a)]σ(s)(da) V (σ[s, a ])(t) q(dt s, a )] V (t) q(dt s, a )]. Now take the sup over σ. 6

Proof of : Fix ɛ > 0. For every state t S, select a strategy σ t such that V (σ t )(t) V (t) ɛ/2. Fix a state s and choose an action a such that r(s, a)+β V (t) q(dt s, a) sup[r(s, a ) + β V (t) q(dt s, a )] ɛ/2. a Define the strategy σ at s 1 = s to have first action a and conditional strategies σ[s, a](t) = σ t. Then V (s) V (σ)(s) =r(s, a) + β sup[r(s, a ) + β a V (σ t )(t) q(dt s, a) V (t) q(dt s, a )] ɛ. 7

Measurable Dynamic Programming The first formulation of dynamic programming in a general measure theoretic setting was given by Blackwell (1965). He assumed: 1. S and A are Borel subsets of a Polish space (say, a Euclidean space). 2. The reward function r(s, a) is Borel measurable. 3. The law of motion q( s, a) is a regular conditional distribution. Strategies are required to select actions in a Borel measurable way. 8

Measurability Problems In his 1965 paper, Blackwell showed by example that for a Borel measurable dynamic programming problem: The optimal reward function V ( ) need not be Borel measurable and good Borel measurable strategies need not exist. This led to nontrivial work by a number of mathematicians including R. Strauch, D. Freedman, M. Orkin, D. Bertsekas, S. Shreve, and Blackwell himself. It follows from their work that for a Borel problem: The optimal reward function V ( ) is universally measurable and that there do exist good universally measurable strategies. 9

The Bellman Equation Again The equation still holds, but a proof requires a lot of measure theory. See, for example, chapter 7 of Bertsekas and Shreve (1978) - about 85 pages. Some additional results are needed to measurably select the σ t in the proof of. See Feinberg (1996). The proof works exactly as given in a finitely additive setting, and it works for general sets S and A. 10

Finitely Additive Probability Let γ be a finitely additive probability defined on a sigma-field of subsets of some set F. The integral φ dγ of a simple function is defined in the usual way. The integral ψ dγ of a bounded, measurable function ψ is defined by squeezing with simple functions. If γ is defined on the sigma-field F of all subsets of F, it is called a gamble and ψ dγ is defined for all bounded, real-valued functions ψ. 11

Finitely Additive Stochastic Processes Let G(F ) be the set of all gambles on F. A Dubins-Savage strategy (DS-strategy for brevity) γ is a sequence γ 1, γ 2,... such that γ 1 G(F ) and for n 2, γ n is a mapping from F n 1 to G(F ). Every DS-strategy γ naturally determines a finitely additive probability P γ on the product sigma-field F N. P γ is regarded as the distribution of a random sequence f 1, f 2,..., f n,.... Here f 1 has distribution γ 1 and, given f 1, f 2,..., f n 1, the conditional distribution of f n is γ n (f 1, f 2,..., f n 1 ). 12

Finitely Additive Dynamic Programming For each (s, a), q( s, a) is a gamble on S. A strategy σ chooses actions using gambles on A. Each σ together with q and an initial state s 1 = s determines a DS-strategy γ = γ(s, σ) on (A S) N. For D A S, γ 1 (D) = and γ n 1 (a 1, s 2,...,a n 1, s n )(D) = Let q(d a s, a) σ 1 (da) q(d a s n, a) σ(s 1, a 1, s 2,..., a n 1, s n )(da). P σ,s = P γ. 13

Rewards and the Bellman Equation For any bounded, real-valued reward function r, the reward for a strategy σ is well-defined by the same formula as before: V (σ)(s) = E σ,s [ n=1 β n 1 r(s n, a n )]. Also as before, the optimal reward function is V (s) = sup σ The Bellman equation V (σ)(s). V (s) = sup[r(s, a) + β V (t) q(dt s, a)]. a can be proved exactly as in the discrete case. 14

Blackwell Operators Let B be the Banach space of bounded functions x : S R equipped with the supremum norm. For each function f : S A, define the operator T f for elements x B by (T f x)(s) = r(s, f(s)) + β Also define the operator T by (T x)(s) = sup[r(s, a) + β a x(s ) q(ds s, f(s)). x(s ) q(ds s, a)]. This definition of T makes sense in the finitely additive case, and in the countably additive case when S is countable. There is trouble in the general measurable case. 15

Fixed Points The operators T f and T are β-contractions. Banach, they have unique fixed points. By a theorem of The fixed point of T is the optimal reward function V. equality The V (s) = (T V )(s) is just the Bellman equation V (s) = sup[r(s, a) + β a V (t) q(dt s, a)]. 16

Stationary Strategies A strategy σ is stationary if there is a function f : S A such that σ(s 1, a 1,..., a n 1, s n ) = f(s n ) for all (s 1, a 1,..., a n 1, s n ). Notation: σ = f. The fixed point of T f is the reward function V (σ)( ) for the stationary strategy σ = f. V (σ)(s) = r(s, f(s)) + β V (σ)(t) q(dt s, f(s)) = (T f V (σ))(s) Fundamental Question: Do optimal or nearly optimal stationary strategies exist? 17

Existence of Good Stationary Strategies Fix ɛ > 0. For each s, choose f(s) such that (T f V )(s) V (s) ɛ(1 β). Let σ = f. An easy induction shows that (Tf n V )(s) V (s) ɛ, for all s and n. But, by Banach s Theorem, (Tf n V )(s) V (σ)(s). So the stationary plan σ is ɛ - optimal. 18

The Measurable Case: Trouble for T T does not preserve Borel measurability. T does not preserve universal measurability. T does preserve upper semianalytic functions, but these do not form a Banach space. Good universally measurable stationary strategies do exist, but the proof is more complicated. 19

Finitely Additive Extensions of Measurable Problems Every probability measure on an algebra of subsets of a set F can be extended to a gamble on F, that is, a finitely additive probability defined on all subsets of F. (The extension is typically not unique.) Thus a measurable, discounted problem S, A, r, q, β can be extended to a finitely additive problem S, A, r, ˆq, β where ˆq( s, a) is a gamble on S that extends q( s, a) for every s, a. Questions: Is the optimal reward the same for both problems? Can a player do better by using non-measurable strategies? 20

Reward Functions for Measurable and for Finitely Additive Plans For a measurable strategy σ, the reward V M (σ)(s) = E σ,s [ n=1 β n 1 r(s n, a n )] is the expectation under the countably additive probability P σ,s. Each measurable σ can be extended to a finitely additive strategy ˆσ with reward V (ˆσ)(s) = Eˆσ,s [ n=1 β n 1 r(s n, a n )] calculated under the finitely additive probability Pˆσ,s. Fact: V M (σ)(s) = V (ˆσ)(s). 21

Optimal Rewards For a measurable problem, let V M (s) = sup V M(σ)(s), where the sup is over all measurable strategies σ, and let V (s) = sup V (σ)(s), where the sup is over all plans σ in some finitely additive extension. 22

Theorem: V M (s) = V (s). Proof: The Bellman equation is known to hold in the measurable theory: V M In other terms (s) = sup a [r(s, a) + β V M (s) = (T V M )(s). But V is the unique fixed point of T. VM (t) q(dt s, a)]. 23

Two-Person Zero-Sum Discounted Stochastic Game (Finitely Additive) Six ingredients: S, A, B, r, q, β. S - state space A, B - action sets for players 1 and 2 q( s, a, b) - law of motion (finitely additive) r(s, a, b) - daily payoff from player 2 to player 1(bounded realvalued) β [0, 1) - discount factor 24

Play of the game Play begins at some state s 1 S, player 1 chooses a 1 A using a gamble on A, player 2 chooses b 1 B using a gamble on B, and player 2 pays player 1 the amount r(s 1, a 1, b 1 ). The next state s 2 has distribution q( s 1, a 1, b 1 ) and play continues. The total payoff from player 2 to player 1 is the expected value of n=1 β n 1 r(s n, a n, b n ). 25

Strategies and Payoffs A strategy σ for player 1 (τ for player 2) selects each action a n (respectively b n ) with a gamble that is a function of (s 1, a 1, b 1,..., a n 1, b n 1, s n ). The pair σ, τ together with an initial state s 1 = s and the law of motion q determine a DS- strategy γ = γ(s, σ, τ). The expected payoff from player 2 to player 1 at the initial state s 1 = s is defined as V (σ, τ)(s) =E γ [ =E σ,τ,s [ n=1 β n 1 r(s n, a n, b n )] n=1 β n 1 r(s n, a n, b n )]. This expectation is well-defined in the Dubins-Savage framework. 26

Theorem: The discounted stochastic game has a value function V ; that is, for all s, V (s) = inf τ sup σ V (σ, τ)(s) = sup σ inf τ V (σ, τ)(s). The value function V satisfies the Shapley equation: V (s) = inf ν sup µ [r(s, a, b) + β V (t) q(dt s, a, b)]µ(da)ν(db). The infimum (respectively supremum) is over gambles on B (respectively A). Also, both players have optimal stationary strategies. The proof is essentially the same as that of Shapley (1953). 27

Finitely Additive One-Shot Games. Let φ : A B R be a bounded function. For gambles µ and ν on A and B respectively, let E µ,ν φ = φ(a, b) µ(da)ν(db). Warning: The order of integration matters. Lemma: The one-shot game with action sets A and B and payoff φ has a value in the sense that inf ν sup µ E µ,νφ = sup inf µ ν E µ,νφ. Also both players have optimal strategies. This follows from Ky Fan (1953). 28

Shapley Operator Recall that B is the Banach space of bounded x : S R. x B, s S, let G(s, x) be the one-shot game with payoff φ s,x (a, b) = r(s, a, b) + β x(t) q(dt s, a, b) For and let T x(s) be the value of G(s, x). So T x(s) = inf ν sup µ [r(s, a, b) + β x(t) q(dt s, a, b)]µ(da)ν(db) The operator T is a β-contraction with a unique fixed point x. For each s, let f(s) and g(s) be gambles on A and B that are optimal for players 1 and 2 in G(s, x ). Define σ and τ to be the stationary strategies f and g. 29

Proof of Theorem. An induction shows that for all s S, strategies τ for player 2, and positive integers n: n E σ,τ,s[ k=1 Let n to get that for all s and τ β k 1 r(s k, a k, b k ) + β n x (s n )] x (s). V (σ, τ)(s) x (s). Similarly, for all s and all strategies σ for player 1 V (σ, τ )(s) x (s). Therefore x is the value function for the stochastic game; σ and τ are optimal stationary strategies. 30

Measurable Discounted Stochastic Game S - a Borel subset of a Polish space A, B - compact metric spaces r(s, a, b) - Borel measurable and separately continuous in a and b for every fixed s q - a regular conditional distribution with q(d s, a, b) separately continuous in a and b for every fixed s and every Borel subset D of S β [0, 1) - discount factor Players are required to use measurable strategies. 31

Theorem: A measurable discounted stochastic game has a value function VM that is Borel measurable. The players have optimal, Borel measurable stationary strategies. See Mertens, Sorin, and Zamir (2014). Assume that q(d s, a, ), a A is an equicontinuous family for each fixed s, a and Borel subset D of B. For each triple s, a, b let q( s, a, b) be a finitely additive extension of q( s, a, b) to all subsets of S. Let V be the value function for the corresponding finitely additive game. Theorem: V M = V. This follows from Maitra and S (1993). 32

References D. Blackwell (1965).Discounted dynamic programming. Ann. Math. Statist. 36 226-235. D. Bertsekas and S. Shreve (1978). Stochastic Optimal Control: The Discrete Time Case. Academic Press. L. Dubins (1974). On Lebesgue-like extensions of finitely additive measures. Ann. Prob. 2 226-241. L. E. Dubins and L. J. Savage (1965). How to Gamble If You Must: Inequalities for Stochastic Processes. McGraw-Hill. E. Feinberg (1996). On measurability and representation of strategic measures in Markov decision theory. Statistics, Probability, and Game Theory: Papers in Honor of David Blackwell, editors T. S. Ferguson, L.S. Shapley, J. B. MacQueen. IMS Lecture Notes-Monograph Series 30 29-44. 33

Ky Fan (1953). Minimax theorems. Proc. Natl. Acad. Sci. 39 42-47. A. Maitra and W. Sudderth (1993). Finitely additive and measurable stochastic games. Int. J. Game Theory 22 201-223. A. Maitra and W. Sudderth (1998). Finitely additive stochastic games with Borel measurable payoffs. Int. J. Game Theory 27 257-267. J-F Mertens, S. Sorin, and S. Zamir (2014). Repeated Games. Cambridge Press. R. Purves and W. Sudderth (1976). Some finitely additive probability theory. Ann. Prob. 4 259-276. L. S. Shapley (1953). Stochastic games. Proc. Natl. Acad. Sci. 39 1095-1100. 34