Planning, Learning, Prediction, and Games Two Player Zero Sum Games and von Neumann s Minimax Theorem: 12/18/09-01/15/10

Transcription

1 Planning, Learning, Prediction, and Games Two Player Zero Sum Games and von Neumann s Minimax Theorem: 12/18/09-01/15/10 Lecturer: Patrick Briest, Peter Pietrzyk Scribe: Philipp Brandes, Peter Pietrzyk, Patrick Briest 4 Two Player Zero Sum Games and von Neumann s Minimax Theorem Definition 4.1 A normal form game is specified by I a set of players, A i a set of strategies for each player i I and u i : A i R a payoff function for each player i I. i I Normal form games with 2 players can be written in matrix form, such that the elements of A 1 index rows, the elements of A 2 index columns, and the entry in row r and column c is (u 1 (r, c), u 2 (r, c)). We call player 1 the row player and player 2 the column player. We can also think of the game as being described by two matrices R and C specifying the payoffs of the row and column players, respectively. Example 4.2 (Rock, Paper, Scissors) Rock beats Scissors, Scissors beats Paper, Paper beats Rock. Strategies: R, P, S Payoffs: 1 for winning, 1 for losing, 0 for draw. Payoff matrix: R P S R (0, 0) ( 1, 1) (1, 1) P (1, 1) (0, 0) ( 1, 1) S ( 1, 1) (1, 1) (0, 0) Example 4.3 (The Prisoners Dilemma) Two players are accused of a crime. If both admit, both go to jail for 2 years. If they both keep quiet, they will go to jail for only one year. If only one of them admits, he becomes a principal witness and goes free, while the other goes to jail for 3 years. 36

2 Define the payoff of x years in jail as (3 x). Strategies of both players: A admit, Q keep quiet Payoff matrix: Q A Q (2, 2) (0, 3) A (3, 0) (1, 1) Definition 4.4 For a set of strategies A, let { (A) = p : A 0, 1 } p(a) = 1 a A denote the set of all probability distributions on A. For some player i I, we call the elements of (A i ) her mixed strategies. Mixed strategies are rules for randomly picking a strategy. Picking a strategy deterministically is called a pure strategy. Elements of i I A i are called pure strategy profiles. Elements of i I (A i ) are mixed called mixed strategy profiles. The expected payoff of player i I given a mixed strategy profile (p 1, p 2,..., p I ) i I (A i ) is u i (p 1, p 2,..., p I ) = (a 1,a 2,...,a I ) i I A i ( j I ) p j (a j ) u i (a 1, a 2,..., a I ), i.e., the expected payoff if every player samples according to their random strategy independently. For strategy profile a = (a 1, a 2,..., a I ) and a i A i, let (a i, a i ) = (a 1,... a i 1, a i, a i+1,..., a I ), i.e., the strategy profile obtained from a by changing player i s strategy from a i to a i. We will use the same notation for mixed strategy profiles. Definition 4.5 A mixed strategy profile p = (p 1,..., p I ) is a mixed Nash equilibrium, if for all i I and q i (A i ), u i (q i, p i ) u i (p i, p i ). If each p i is a pure strategy (assigning probability 1 to a single element from A i ), we call p a pure Nash equilibrium. 37

3 Example 4.6 The Prisoners Dilemma game has one pure Nash equilibrium (A, A). Rock, Paper, Scissors has no pure Nash equilibrium, but a mixed equilibrium in which both players choose P, R, S with probability 1 3 each. (Easy to check: If one player randomizes uniformly, every mixed strategy performs equally well for the other player, so deviating does not increase the payoff.) Example 4.7 (Bach or Mozart) Two players want to decide whether to go to a Bach concert or one of Mozart. They want to go together, but prefer different alternatives: There are 2 pure Nash equilibria: P uree 1 : (B, B) P uree 2 : (M, M) B M B (2, 1) (0, 0) M (0, 0) (1, 2) There is also a third mixed Nash equilibrium MixedE: row player: P r(b) = 2 3, P r(m) = 1 3 column player: P r(b) = 1 3, P r(m) = 2 3 Some critiques of the Nash equilibrium concept: 1. Idea of a Nash equilibrium is that player 1 plays her side of the equilibrium, because she believes that player 2 plays her side of the equilibrium, because she thinks that player 1 plays her side of the equilibrium, because... But if there are multiple equilibria, why should we believe that players will be able to coordinate their beliefs? 2. Look at the payoff in Bach or Mozart : P uree 1 : (2, 1) P uree 2 : (1, 2) MixedE : ( 2 3, 2 3 ) So different equilibria result in different payoffs. If we can t predict which Nash equilibrium will be reached, we also can t predict the payoffs. In this lecture we will address these critiques, showing that players arrive at an equilibrium by playing a game repeatedly and using learning rules to adopt to their opponent s behavior. Definition 4.8 A two player zero sum game is one in which I = {1, 2} and u 2 (a 1, a 2 ) = u 1 (a 1, a 2 ) for all pure strategy profiles (a 1, a 2 ). Rock, Paper, Scissors is an example of a zero sum game (actually, even a win/lose game with payoffs in { 1, 1}). 38

4 4.1 Von Neumann s Minimax-Theorem Theorem 4.9 Let a two-player zero-sum game G = (I, (A i ), (u i )) be given. Define and v min 1 = max v max 1 = min min u 1(p, q) max u 1(p, q). It holds that v min 1 = v max 1. We call this value V the game value of G. Intuitively, Theorem 4.9 says the following: The best payoff player 1 can guarantee for herself, if she has to pick a strategy first and player 2 is then allowed to play a best response ( v1 min ), is equal to the minimum payoff she can achieve if player 2 has to go first, and player 1 is allowed to respond optimally (v1 max ). This has some very nice consequences. Corollary 4.10 Let G be a two-player zero-sum game with game value V. In every mixed Nash equilibrium (p, q ), we have u 1 (p, q ) = V. Proof: Since p is a best response to q, u 1 (p, q ) = max u 1 (p, q ) min max u 1(p, q) = v1 max = V. Similary, since q is a best response to p, u 2 (p, q ) V. So u 1 (p, q ) = u 2 (p, q ) V. Corollary 4.11 Let G be a two-player zero-sum game. A mixed strategy profile (p, q ) is a mixed Nash equilibrium, if and only if and p argmax min u 1(p, q) q argmin max u 1(p, q). In particular, the set of mixed Nash equilibria is non-empty. Proof: : Let (p, q ) be a Nash equilibrium. By Corollary 4.10, u 1 (p, q ) = V. Since q is a best response to p, min u 1 (p, q) = u 1 (p, q ) = V = max min u 1 (p, q). Thus, p argmax p (A1 ) min q (A2 ) u 1 (p, q). By symmetry, the same argument applies to player 2, as well. 39

5 : Let strategy profile (p, q ) with p, q from the respective sets of mixed strategies be given. Since p argmax p (A1 ) min q (A2 ) u 1 (p, q), we have u 1 (p, q ) V. On the other hand, for any p (A 1 ), u 1 (p, q ) max u 1 (p, q ) = min max u 1 (p, q) = V where the last line follows because q argmin q (A2 ) max p (A1 ) u 1 (p, q). So player 1 has no incentive to defect to a different strategy. Again, by symmetry, the same argument applies to player 2. Remark Note that Corollary 4.11 resolves one of the critiques of the Nash equilibrium concept. Players don t have to coordinate their actions in order to reach an equilibrium, but can pick strategies from their respective argmax-sets independently. Recall the MaxHedge algorithm from Homework Assignment 4. We showed: ( T ) Theorem 4.12 Algorithm MaxHedge (with n experts, costs in 0, 1) has regret O ln(n) against the class of adaptive adversaries. In particular, for any sequence of reward functions r t : n 0, 1, the sequence of experts x 1,..., x T selected by the algorithm satisfies T E r t (x t ) E max p (n) ( n ) p x r t (x) x=1 O T ln(n). In Theorem 4.12 above we compare our algorithm to the best mixture of experts. However, the bound follows immediately by observing that the maximum on the right hand side is always achieved by a single best expert. Proof of Theorem 4.9: We start with the easy direction and prove that v min 1 v max 1 : For any strategy profile ( ˆp, ˆq), Thus, u 1 (ˆp, ˆq) min u 1 (ˆp, q) and, taking the maximum of both sides, v min 1 = max min u 1 (p, q) max u 1 (p, ˆq). min min max u 1 (p, q) max u 1 (p, q) = v max To prove the other direction v1 max v1 min, we will use the existence of expert learning algorithms with vanishing per-time-step regret (as, e.g., Hedge). We assume w.l.o.g. that u 1 (p, q) 0, 1 (and, thus, u 2 (p, q) 1, 0) for all p (A 1 ), q (A 2 ). This can always be achieved by applying an appropriate linear transformation to the game matrix

6 Note, that Similarly, v max 1 = v min 2. v min 1 = max = max = min = v max 2. min u 1 (p, q) min ( u 2 (p, q)) max u 2 (p, q) Now assume that for T steps, both players use an expert learning algorithm to determine their strategy. Formally, let n = max { A 1, A 2 }. Player 1 applies the algorithm with one expert for each a A 1. If player 2 plays strategy b at time t, the reward of expert a is r t (a) = u 1 (a, b). Player 2 applies the algorithm with an expert for each b A 2 and reward functions r t (b) = u 2 (b, a) + 1 (making sure rewards are in 0, 1). Let a 1,..., a T and b 1,...(, b T be the strategies selected by the 2 players and assume that both T ) algorithms have regret O ln(n). Define mixed strategies p = 1 T a t and q = 1 T b t. (The above is somewhat sloppy notation. We associate each a t with the vector a t (A 1 ) that assigns probability 1 to strategy a t A 1.) Intuitively, strategies p and q mix pure strategies proportional to the frequency with which they have been played. Let p argmax p (A1 ) u 1 (p, q), q argmax q (A2 ) u 2 (p, q) be best responses to q and p, respectively. Note, that u 1 (p, q) = max u 1 (p, q) min max u 1 (p, q) = v max Analogously, u 2 (p, q ) v2 max. By our regret-bound, 1 1 E u 1 (a t, b t ) E u 1 (p, b t ) T T = E u 1 (p, q) O 1. O ln(n)/t ( ln(n)/t ). 41

7 Finally, combining the above, v1 max O ln(n)/t E u 1 (p, q) O ln(n)/t E = E 1 T 1 T u 1 (a t, b t ) u 2 (a t, b t ) E u 2 (p, q ) + O ln(n)/t v2 max + O ln(n)/t = v1 min + O ln(n)/t. Now the claim follows for T by compactness of (A 1 ), (A 2 ). Remark By the last set of inequalities in the proof of Theorem 4.9, running the regret-minimizing algorithms for Ω(ln(n)/(δ/2) 2 ) steps, yields a strategy profile (p, q), such that E max u 1(p, q) = Eu 1 (p, q) V + δ/2. Consequently, Eu 1 (p, q) V + δ/2 and, since the game is zero sum, Eu 2 (p, q) V δ/2. Similarly, for player 2 we have that E max u 2(p, q) = Eu 2 (p, q ) V + δ/2. Thus, Eu 2 (p, q) V + δ/2 and Eu 1 (p, q) V δ/2. Such a strategy profile, in which no player can gain more than δ by deviating unilaterally is called a δ-approximate Nash equilibrium. Since we only run the algorithms for O(log n) steps, the mixed strategies p and q assign positive probability to at most O(log n) pure strategies. We conclude that every two-player zero-sum game possesses δ-approximate Nash equilibria with small support of size O(log n). Remark Using Markov s inequality we can turn the above existence result into a constructive procedure. Running the algorithms for Ω(ln(n)/(δ/2c) 2 ) steps for any c > 0, we obtain a mixed strategy profile (p, q) with ( Prob max u 1(p, q) δ ) = 1 2 c and support of size O(log n). Running the procedure repeatedly yields the desired δ-approximate Nash equilibrium with probability exponentially close to 1. δ 2c δ 2 Remark Assume that a zero-sum game is played repeatedly. By using a regret-minimizing algorithm like Hedge, player 1 comes close to the best possible payoff against whatever distribution of strategies player 2 happens to use. It is not necessary to assume that player 2 is acting rationally. 42

8 4.2 Yao s Minimax Principle Consider a computational problem with I - a finite set of possible inputs and A - a finite set of possbible algorithms. For all i I, a A denote by t(i, a) the running time of algorithm a on input i. We can think of this as a game, in which player 1 gets to select the input, while player 2 is allowed to pick an algorithm. Similar to the payoff functions in the previous sections, we can generalize t to distributions p (I), q (A) as t(p, q) = t(i, a)p(i)q(a). i I a A In words, t(p, q) denotes the expected running time of an algorithm from distribution q on an input from distribution p. Theorem 4.13 In the setting described above, it holds that max min t(p, a) = min p (I) a A max q (A) i I t(i, q). Proof: This follows immediately from von Neumann s Minimax Theorem 4.9 and the observation that for any mixed strategy p, q of one of the players, the other player has a pure strategy that constitutes a best response. Yao s Minimax Principle states that, for finite sets of algorithms and inputs, the best worst-case running time achievable by any randomized algorithm (the right hand side), is equal to the best running time obtainable by a deterministic algorithm on a worst-case distribution of inputs. This is very helpful in proving lower bounds on the performance guarantee of randomized algorithms (e.g., in online computation). If one can construct a distribution on inputs, on which no deterministic algorithm performs well in expectation (which is often much easier than arguing about randomized algorithms), then it follows that there must exist an instance on which no randomized algorithm performs well. 43