How to Win Texas Hold em Poker Richard Mealing Machine Learning and Optimisation Group School of Computer Science University of Manchester / 44
How to Play Texas Hold em Poker Deal private cards per player st (sequential) betting round 3 Deal 3 shared cards ( flop ) 4 nd betting round 5 Deal shared card ( turn ) 6 3rd betting round 7 Deal shared card ( river ) 8 4th (final) betting round If all but player folds, that player wins the pot (total bet) Otherwise at the end of the game hands are compared ( showdown ) and the player with the best hand wins the pot / 44
How to Play Texas Hold em Poker 3 / 44
How to Play Texas Hold em Poker Ante = forced bet (everyone pays) Blinds = forced bets ( people pay big/small) If players > then (big blind player, small blind player, dealer) If players = ( heads-up ) then (big blind, small blind/dealer) No-Limit Texas Hold em lets you bet all your money in a round Minimum bet = big blind Maximum bet = all your money Limit Texas Hold em Poker has fixed betting limits A $4/$8 game means in betting rounds & bets = $4 and in betting rounds 3 & 4 bets = $8 Big blind usually equals small bet e.g. $4 and small blind is usually 5% of big blind e.g. $ Total number of raises per betting round is usually capped at 4 or 5 4 / 44
-Card Poker Trees Game tree - both players private cards are known 5 / 44
-Card Poker Trees Public tree - both players private cards are hidden 6 / 44
-Card Poker Trees P information set tree - P s private card is hidden 7 / 44
-Card Poker Trees P information set tree - P s private card is hidden 8 / 44
-Card Poker Trees Game tree - both players private cards are known Public tree - both players private cards are hidden 3 P information set tree - P s private card is hidden 4 P information set tree - P s private card is hidden 9 / 44
Heads-Up Limit Texas Hold em Poker Tree Size Cards Dealt F C R F C R F C R F C R F C R F C R F C R F C R F F C C P dealt private cards = ( 5 ) = 36 P dealt private cards = ( 5 ) = 5 st betting round = 9, 9 continuing Flop dealt = ( 48 3 ) = 796 nd betting round = 9, 9 continuing Turn dealt = 45 3rd betting round = 9, 9 continuing River dealt = 44 4th betting round = 9 / 44
Heads-Up Limit Texas Hold em Poker Tree Size Player Deal = Player Deal = 36 st Betting Round = 36 * 5 * 9 nd Betting Round = 36 * 5 * 9 * 796 * 9 3rd Betting Round = 36 * 5 * 9 * 796 * 9 * 45 * 9 4th Betting Round = 36 * 5 * 9 * 796 * 9 * 45 * 9 * 44 * 9 Total =.79 8 (quintillion) / 44
Abstraction Lossless Suit isomorphism, at the start (pre-flop) two hands are strategically the same if each of their cards ranks match and they are both suited or off-suit e.g. (A K, A K ) or (T J, T J ), 69 equivalence classes reduces possible starting hands from 6435 to 856 Lossy Bucketing (binning) groups hands into equivalence classes e.g. based on their probability of winning at showdown against a random hand Imperfect recall eliminates past information Betting round reduction Betting round elimination / 44
Abstraction Heads-up Limit Texas Hold em poker has around 8 states Abstraction can reduce the game to e.g. 7 states Nesterov s excessive gap technique can find approximate Nash equilibria in a game with states Counterfactual regret minimization can find approximate Nash equilibria in a game with states 3 / 44
Nash Equilibrium Game theoretic solution Set of strategies per player such that no one can do better by changing their strategy if the others keep their strategies fixed Nash proved that in every game with finite players and pure strategies there is at least (possibly mixed) Nash equilibrium 4 / 44
Annual Computer Poker Competition Heads-up Limit Texas Hold em Total Bankroll: Slumbot (Eric Jackson, USA) Little Rock (Rod Byrnes, Australia) and Zbot (Ilkka Rajala, Finland) Bankroll Instant Run-off: Slumbot (Eric Jackson, USA) Hyperborean (University of Alberta, Canada) 3 Zbot (Ilkka Rajala, Finland) Heads-up No-Limit Texas Hold em Total Bankroll: Little Rock (Rod Byrnes, Australia) Hyperborean (University of Alberta, Canada) 3 Tartanian5 (Carnegie Mellon University, USA) Bankroll Instant Run-off: Hyperborean (University of Alberta, Canada) Tartanian5 (Carnegie Mellon University, USA) 3 Neo Poker Bot (Alexander Lee, Spain) 3-player Limit Texas Hold em Total Bankroll: Hyperborean (University of Alberta, Canada) Little Rock (Rod Byrnes, Australia) 3 Neo Poker Bot (Alexander Lee, Spain) and Sartre (University of Auckland, New Zealand) Bankroll Instant Run-off: Hyperborean (University of Alberta, Canada) Little Rock (Rod Byrnes, Australia) 3 Neo Poker Bot (Alexander Lee, Spain) and Sartre (University of Auckland, New Zealand) Source: http://www.computerpokercompetition.org/index.php/competitions/results/9--results 5 / 44
Annual Computer Poker Competition Total Bankroll = total money won against all agents Bankroll Instant Run-off Set S = all agents Set N = agents in a game 3 Play every ( S N ) possible matches between agents in S storing each agent s total bankroll 4 Remove the agent(s) with the lowest total bankroll from S 5 Repeat steps and 3 until S only contains N agents 6 Play a match between the last N agents and rank them according to their total bankroll in this game 6 / 44
Extensive-Form Game A finite set of players N = {,,..., N } {c} A finite set of action sequences or histories e.g. H = {(),..., (A A ),...} Z H terminal histories e.g. Z = {..., (A A, 7, r, F ),...} A(h) = {a : (h, a) H} actions available after history h H\Z P(h) N {c} player who takes an action after history h H\Z u i : Z R utility function for player i 7 / 44
Extensive-Form Game f c maps every history h where P(h) = c to an independent probability distribution f c (a h) for all a A(h) I i is an information partition (set of nonempty subsets of X where each element of X is in subset) for player i I j I i is player i s jth information set containing indistinguishable histories e.g. I j = {..., (A A, 7 ),..., (A A, 6 3 ),...} Player i s strategy σ i is a function that assigns a distribution over A(I j ) for all I j I i where A(I j ) = A(h) for any h I j A strategy profile σ is a strategy for each player σ = {σ, σ,..., σ N } 8 / 44
Nash Equilibrium Nash Equilibrium: u (σ) max σ Σ u (σ, σ ) u (σ) max σ Σ u (σ, σ ) ɛ-nash Equilibrium: u (σ) + ɛ max σ Σ u (σ, σ ) u (σ) + ɛ max σ Σ u (σ, σ ) 9 / 44
Extensive-Form Game.5J.5K.5J.5K.5J.5K I I I I.6C.4R.6C.4R.3C.7R.3C.7R.8C.R.F.C.C.9R.F.C.8C.R.F.C.C.9R.F.C -.F.C.F.C.F.C.F.C - I = {I, I, I 7, I 8 } and I = {I 3, I 4, I 5, I 6 } A((J, J)) = {C, R} and P((J, J)) = / 44
Extensive-Form Game.5J.5K.5J.5K.5J.5K I I I I.6C.4R.6C.4R.3C.7R.3C.7R.8C.R.F.C.C.9R.F.C.8C.R.F.C.C.9R.F.C -.F.C.F.C.F.C.F.C - I = {I, I, I 7, I 8 } and I = {I 3, I 4, I 5, I 6 } A((J, J)) = {C, R} and P((J, J)) = / 44
Extensive-Form Game.5J.5K.5J.5K.5J.5K I I I I.6C.4R.6C.4R.3C.7R.3C.7R.8C.R.F.C.C.9R.F.C.8C.R.F.C.C.9R.F.C -.F.C.F.C.F.C.F.C - I = {I, I, I 7, I 8 } and I = {I 3, I 4, I 5, I 6 } A((J, J)) = {C, R} and P((J, J)) = / 44
Extensive-Form Game.5J.5K.5J.5K.5J.5K I I I I.6C.4R.6C.4R.3C.7R.3C.7R.8C.R.F.C.C.9R.F.C.8C.R.F.C.C.9R.F.C -.F.C.F.C.F.C.F.C - I = {I, I, I 7, I 8 } and I = {I 3, I 4, I 5, I 6 } A((J, J)) = {C, R} and P((J, J)) = 3 / 44
Extensive-Form Game.5J.5K.5J.5K.5J.5K I I I I.6C.4R.6C.4R.3C.7R.3C.7R.8C.R.F.C.C.9R.F.C.8C.R.F.C.C.9R.F.C -.F.C.F.C.F.C.F.C - I = {I, I, I 7, I 8 } and I = {I 3, I 4, I 5, I 6 } A((J, J)) = {C, R} and P((J, J)) = 4 / 44
Extensive-Form Game.5J.5K.5J.5K.5J.5K I I I I.6C.4R.6C.4R.3C.7R.3C.7R.8C.R.F.C.C.9R.F.C.8C.R.F.C.C.9R.F.C -.F.C.F.C.F.C.F.C - I = {I, I, I 7, I 8 } and I = {I 3, I 4, I 5, I 6 } A((J, J)) = {C, R} and P((J, J)) = 5 / 44
Extensive-Form Game.5J.5K.5J.5K.5J.5K I I I I.6C.4R.6C.4R.3C.7R.3C.7R.8C.R.F.C.C.9R.F.C.8C.R.F.C.C.9R.F.C -.F.C.F.C.F.C.F.C - I = {I, I, I 7, I 8 } and I = {I 3, I 4, I 5, I 6 } A((J, J)) = {C, R} and P((J, J)) = 6 / 44
Extensive-Form Game.5J.5K.5J.5K.5J.5K I I I I.6C.4R.6C.4R.3C.7R.3C.7R.8C.R.F.C.C.9R.F.C.8C.R.F.C.C.9R.F.C -.F.C.F.C.F.C.F.C - I = {I, I, I 7, I 8 } and I = {I 3, I 4, I 5, I 6 } A((J, J)) = {C, R} and P((J, J)) = 7 / 44
Extensive-Form Game.5J.5K.5J.5K.5J.5K I I I I.6C.4R.6C.4R.3C.7R.3C.7R.8C.R.F.C.C.9R.F.C.8C.R.F.C.C.9R.F.C -.F.C.F.C.F.C.F.C - I = {I, I, I 7, I 8 } and I = {I 3, I 4, I 5, I 6 } A((J, J)) = {C, R} and P((J, J)) = 8 / 44
Extensive-Form Game.5J.5K.5J.5K.5J.5K I I I I.6C.4R.6C.4R.3C.7R.3C.7R.8C.R.F.C.C.9R.F.C.8C.R.F.C.C.9R.F.C -.F.C.F.C.F.C.F.C - I = {I, I, I 7, I 8 } and I = {I 3, I 4, I 5, I 6 } A((J, J)) = {C, R} and P((J, J)) = 9 / 44
Extensive-Form Game.5J.5K.5J.5K.5J.5K I I I I.6C.4R.6C.4R.3C.7R.3C.7R.8C.R.F.C.C.9R.F.C.8C.R.F.C.C.9R.F.C -.F.C.F.C.F.C.F.C - f c (J (J)) =.5 and f c (K (J)) =.5 σ (I, C) =.6 and σ (I, R) =.4 3 / 44
Extensive-Form Game.5J.5K.5J.5K.5J.5K I I I I.6C.4R.6C.4R.3C.7R.3C.7R.8C.R.F.C.C.9R.F.C.8C.R.F.C.C.9R.F.C -.F.C.F.C.F.C.F.C - f c (J (J)) =.5 and f c (K (J)) =.5 σ (I, C) =.6 and σ (I, R) =.4 3 / 44
Extensive-Form Game.5J.5K.5J.5K.5J.5K I I I I.6C.4R.6C.4R.3C.7R.3C.7R.8C.R.F.C.C.9R.F.C.8C.R.F.C.C.9R.F.C -.F.C.F.C.F.C.F.C - f c (J (J)) =.5 and f c (K (J)) =.5 σ (I, C) =.6 and σ (I, R) =.4 3 / 44
Counterfactual Regret Minimization Counterfactual regret minimization minimizes the maximum counterfactual regret (over all actions) at every information set Minimizing counterfactual regrets minimizes overall regret In a two-player zero-sum game at time T, if both players average overall regret is less than ɛ, then σ T is a ɛ Nash equilibrium. 33 / 44
Counterfactual Regret Minimization Counterfactual Value v i (I j σ) = n I j π σ i(root, n)u i (n) u i (n) = z Z[n] π σ (n, z)u i (z) v i (I j σ) is the counterfactual value to player i of information set I j given strategy profile σ π i σ (root, n) is the probability of reaching node n from the root ignoring player i s contributions according to strategy profile σ π σ (n, z) is the probability of reaching node z from node n according to strategy profile σ u i (n) is the payoff to player i at node n if it is a leaf node or its expected payoff if it is a non-leaf node Z[n] is the set of terminal nodes that can be reached from node n 34 / 44
Counterfactual Regret Minimization.5J.5K.5J.5K.5J.5K I I I I.6C.4R.6C.4R.3C.7R.3C.7R.8C.R.F.C.C.9R.F.C.8C.R.F.C.C.9R.F.C -.F.C.F.C.F.C.F.C - v (I 8 σ) = π i(root, σ n)u (n) n I 8 =.5.5. (. +. ) +.5.5.9 (. +. ) =. 35 / 44
Counterfactual Regret Minimization Counterfactual Regret r(i j, a) = v i (I j σ Ij a) v i (I j σ) r(i j, a) is the counterfactual regret of not playing action a at information set I j Positive regret means the player would have preferred to play action a rather than their strategy Zero regret means the player was indifferent between their strategy and action a Negative regret means the player preferred their strategy rather than playing action a 36 / 44
Counterfactual Regret Minimization.5J.5K.5J.5K.5J.5K I I I I.6C.4R.6C.4R.3C.7R.3C.7R.8C.R.F.C.C.9R.F.C.8C.R.F.C.C.9R.F.C -.F.C.F.C.F.C.F.C - v (I 8 σ F ) =.5.5. (. +. ) +.5.5.9 (. +. ) =.75 r (I 8 F ) = v (I 8 σ F ) v (I 8 σ) =.75. =.375 37 / 44
Counterfactual Regret Minimization Cumulative Counterfactual Regret R T (I j, a) = T r t (I j, a) t= R T (I j, a) is the cumulative counterfactual regret of not playing action a at information set I j for T time steps Positive cumulative regret means the player would have preferred to play action a rather than their strategy over those T steps Zero cumulative regret means the player was indifferent between their strategy and action a over those T steps Negative cumulative regret means the player preferred their strategy rather than playing action a over those T steps 38 / 44
Counterfactual Regret Minimization Regret Matching R T,+ (I j,a) if denominator is positive a σ T + (I j, a) = A(I j ) RT,+ (I j,a ) otherwise A(I j ) R T,+ (I j, a) = max(r T (I j, a), ) 39 / 44
Counterfactual Regret Minimization Initialise the strategy profile σ e.g. for all i N, for all I j I i and for all a A(I j ) set σ(i j, a) = A(I j ) For each player i N, for all I j I i and for all a A(I j ) calculate r(i j, a) and add it to R(I j, a) 3 For each player i N, for all I j I i and for all a A(I j ) use regret matching to update σ(i j, a) 4 Repeat from 4 / 44
Counterfactual Regret Minimization Cumulative counterfactual regret is bounded by R T i (I j ) (max z u i (z) min z u i (z)) A(I j ) T Total counterfactual regret is bounded by I i (max z u i (z) min z u i (z)) max h:p(h)=i A(h) Ri T T 4 / 44
Counterfactual Regret Minimization (a) Number of game states, number of iterations, computation time, and exploitability of the resulting strategy for different sized abstractions (b) Convergence rates for three different sized abstractions, x-axis shows iterations divided by the number of information sets in the abstraction Source: 8 - Regret Minimization in Games with Incomplete Information - Zinkevich et al 4 / 44
Summary If you want to win (in expectation) at Texas Hold em poker (against exploitable players) then... Abstract the version of Texas Hold em poker you are interested so it has at most game states Run the counterfactual minimization algorithm on the abstraction for T iterations and obtain the average strategy profile σ abs T 3 Map the average strategy profile σ abs T for the abstracted game to one σ T for the real game 4 Play your average strategy profile σ T against your (exploitable) opponents 43 / 44
References Annual Computer Poker Competition Website http://www.computerpokercompetition.org/ 8 - Regret Minimization in Games with Incomplete Information - Zinkevich et al - http://martin.zinkevich.org/publications/regretpoker.pdf 3 7 - Robust strategies and counter-strategies Building a champion level computer poker player - Johanson - http://poker.cs.ualberta.ca/publications/johanson.msc.pdf 4 3 - Monte Carlo Sampling and Regret Minimization for Equilibrium Computation and Decision-Making in Large Extensive Form Games - Lanctot http://era.library.ualberta.ca/public/view/item/uuid: 48ae86c-45-4c-b9c-3e7ce9bc9ae 44 / 44