Bayes Bluff: Opponent Modelling in Poker

Bayes Bluff: Opponent Modelling in Poker Finnegan Southey, Mihael Bowling, Brye Larson, Carmelo Piione, Neil Burh, Darse Billings, Chris Rayner Department of Computing Siene University of Alberta Edmonton, Alberta, Canada T6G 2E8 {finnegan,bowling,larson,arm,burh,darse,rayner}@s.ualberta.a Abstrat Poker is a hallenging problem for artifiial intelligene, with non-deterministi dynamis, partial observability, and the added diffiulty of unknown adversaries. Modelling all of the unertainties in this domain is not an easy task. In this paper we present a Bayesian probabilisti model for a broad lass of poker games, separating the unertainty in the game dynamis from the unertainty of the opponent s strategy. We then desribe approahes to two key subproblems: (i) inferring a posterior over opponent strategies given a prior distribution and observations of their play, and (ii) playing an appropriate response to that distribution. We demonstrate the overall approah on a redued version of poker using Dirihlet priors and then on the full game of Texas hold em using a more informed prior. We demonstrate methods for playing effetive responses to the opponent, based on the posterior. Introdution The game of poker presents a serious hallenge to artifiial intelligene researh. Unertainty in the game stems from partial information, unknown opponents, and game dynamis ditated by a shuffled dek. Add to this the large spae of possible game situations in real poker games suh as Texas hold em, and the problem beomes very diffiult indeed. Among the more suessful approahes to playing poker is the game theoreti approah, approximating a Nash equilibrium of the game via linear programming [5, ]. Even when suh approximations are good, Nash solutions represent a pessimisti viewpoint in whih we fae an optimal opponent. Human players, and even the best omputer players, are ertainly not optimal, having idiosynrati weaknesses that an be exploited to obtain higher payoffs than the Nash value of the game. Opponent modelling attempts to apture these weaknesses so they an be exploited in subsequent play. Existing approahes to opponent modelling have employed a variety of approahes inluding reinforement learning [4], neural networks [2], and frequentist statistis [3]. Additionally, earlier work on using Bayesian models for poker [6] attempted to lassify the opponent s hand into one of a variety of broad hand lasses. They did not model unertainty in the opponent s strategy, using instead an expliit strategy representation. The strategy was updated based on empirial frequenies of play, but they reported little improvement due to this updating. We present a general Bayesian probabilisti model for hold em poker games, ompletely modelling the unertainty in the game and the opponent. We start by desribing hold em style poker games in general terms, and then give detailed desriptions of the asino game Texas hold em along with a simplified researh game alled Ledu hold em for whih game theoreti results are known. We formally define our probabilisti model and show how the posterior over opponent strategies an be omputed from observations of play. Using this posterior to exploit the opponent is non-trivial and we disuss three different approahes for omputing a response. We have implemented the posterior and response omputations in both Texas and Ledu hold em, using two different lasses of priors: independent Dirihlet and an informed prior provided by an expert. We show results on the performane of these Bayesian methods, demonstrating that they are apable of quikly learning enough to exploit an opponent. 2 Poker There are many variants of poker. We will fous on hold em, partiularly the heads-up limit game (i.e., two players with pre-speified bet and raise amounts). A single hand onsists of a number of rounds. In the first round, players are dealt a fixed number of private ards. In all rounds, A more thorough introdution of the rules of poker an be found in [2].

f r f 2 r Figure : An example deision tree for a single betting round in poker with a two-bet maximum. Leaf nodes with open boxes ontinue to the next round, while losed boxes end the hand. some fixed number (possibly zero) of shared, publi board ards are revealed. The dealing and/or revealing of ards is followed by betting. The betting involves alternating deisions, where eah player an either fold (f), all (), or raise (r). If a player folds, the hand ends and the other player wins the pot. If a player alls, they plae into the pot an amount to math what the other player has already plaed in the pot (possibly nothing). If a player raises, they math the other player s total and then put in an additional fixed amount. The players alternate until a player folds, ending the hand, or a player alls (as long as the all is not the first ation of the round), ontinuing the hand to the next round. There is a limit on the number of raises (or bets) per round, so the betting sequene has a finite length. An example deision tree for a single round of betting with a two-bet maximum is shown in Figure. Sine folding when both players have equal money in the pot is dominated by the all ation, we do not inlude this ation in the tree. If neither player folds before the final betting round is over, a showdown ours. The players reveal their private ards and the player who an make the strongest poker hand with a ombination of their private ards and the publi board ards wins the pot. Many games an be onstruted with this simple format for both analysis (e.g., Kuhn poker [7] and Rhode Island hold em [9]) and human play. We fous on the ommonly played variant, Texas hold em, along with a simplified and more tratable game we onstruted alled Ledu hold em. Texas Hold Em. The most ommon format for hold em is Texas Hold em, whih is used to determine the human world hampion and is widely onsidered the most strategially omplex variant. A standard 52-ard dek is used. There are four betting rounds. In the first round, the players are dealt two private ards. In the seond round (or flop), three board ards are revealed. In the third round (turn) f f 2 2 r r and fourth round (river), a single board ard is revealed. We use a four-bet maximum, with fixed raise amounts of units in the first two rounds and 2 units in the final two rounds. Finally, blind bets are used to start the first round. The first player begins the hand with 5 units in the pot and the seond player with units. Ledu Hold Em. We have also onstruted a smaller version of hold em, whih seeks to retain the strategi elements of the large game while keeping the size of the game tratable. In Ledu hold em, the dek onsists of two suits with three ards in eah suit. There are two rounds. In the first round a single private ard is dealt to eah player. In the seond round a single board ard is revealed. There is a two-bet maximum, with raise amounts of 2 and 4 in the first and seond round, respetively. Both players start the first round with already in the pot. Challenges. The hallenges introdued by poker are many. The game involves a number of forms of unertainty, inluding stohasti dynamis from a shuffled dek, imperfet information due to the opponent s private ards, and, finally, an unknown opponent. These unertainties are individually diffiult and together the diffiulties only esalate. A related hallenge is the problem of folded hands, whih amount to partial observations of the opponent s deisionmaking ontexts. This has reated serious problems for some opponent modelling approahes and our Bayesian approah will shed some light on the additional hallenge that fold data imposes. A third key hallenge is the high variane of payoffs, also known as luk. This makes it diffiult for a program to even assess its performane over short periods of time. To aggravate this diffiulty, play against human opponents is neessarily limited. If no more than two or three hundred hands are to be played in total, opponent modelling must be effetive using only very small amounts of data. Finally, Texas hold em is a very large game. It has on the order of 8 states [], whih makes even straightforward alulations, suh as best response, non-trivial. 3 Modelling the Opponent We will now desribe our probabilisti model for poker. In all of the following disussion, we will assume that Player (P) is modelling its opponent, Player 2 (P2), and that all inomplete observations due to folding are from P s perspetive. 3. Strategies In game theoreti terms, a player makes deisions at information sets. In poker, information sets onsist of the ations taken by all players so far, the publi ards revealed so far, and the player s own private ards. A behaviour strategy speifies a distribution over the possible ations

for every information set of that player. Leaving aside the preise form of these distributions for now, we denote P s omplete strategy by and P2 s by. We make the following simplifying assumptions regarding the player strategies. First, P2 s strategy is stationary. This is an unrealisti assumption but modelling stationary opponents in full-sale poker is still an open problem. Even the most suessful approahes make the same assumption or use simple methods suh as deaying histories to aommodate opponent drift. However, we believe this framework an be naturally extended to dynami opponents by onstruting priors that expliitly model hanges in opponent strategy. The seond assumption is that the players strategies are independent. More formally, P (, ) = P ()P (). This assumption, implied by the stationarity, is also unrealisti. Hower, modelling opponents that learn, and effetively deeiving them, is a diffiult task even in very small games and we defer suh efforts until we are sure of effetive stationary opponent modelling. Finally, we assume the dek is uniformly distributed, i.e., the game is fair. These assumptions imply that all hands are i.i.d. given the strategies of the players. 3.2 Hands The following notation is used for hand information. We onsider a hand, H, with k deisions by eah player. Eah hand, as observed by an orale with perfet information, is a tuple H = (C, D, R :k, A :k, B :k ) where, 3.3 Probability of Observations Suppose a hand is fully observed, i.e., a showdown ours. The probability of a partiular showdown hand H s ourring given the opponent s strategy is, 2 P (H s ) = P (C, D, R :k, A :k, B :k ) [ = P (D C)P (C) P (Bi D, R :i, A :i, B :i, ) P (A i C, R :i, A :i, B :i ) P (R i C, D, R :i ) ] = P (D C)P (C) [ Yi,C,A i P (R i C, D, R :i ) ] = p showards Yi,C,A i, where for notational onveniene, we separate the information sets for P (P2) into its publi part Y i (Z i ) and its private part C (D). So, C and D denote P and P2 s private ards, R i is the set (possibly empty) of publi ards dealt before either player makes their ith deision, and A i and B i denote P and P2 s ith deisions (fold, all or raise). We an model any limit hold em style poker with these variables. A hand runs to at most k deisions. The fat that partiular hands may have fewer real deisions (e.g., a player may all and end the urrent betting round, or fold and end the hand) an be handled by padding the deisions with speifi values (e.g., one a player has folded all subsequent deisions by both players are assumed to be folds). Probabilities in the players strategies for these padding deisions are fored to. Furthermore, the publi ards for a deision point (R i ) an be the empty set, so that multiple deisions onstituting a single betting round an our between revealed publi ards. These speial ases are quite straightforward and allow us to model the variable length hands found in real games with fixed length tuples. Y i = (R :i, A :i, B :i ) Z i = (R :i, A :i, B :i ). In addition, Yi,C,A i is the probability of taking ation A i in the information set (Y i, C), ditated by P s strategy,. A similar interpretation applies to the subsripted. p showards is a onstant that depends only on the number of ards dealt to players and the number of publi ards revealed. This simplifiation is possible beause the dek has uniform distribution and the number of ards revealed is the same for all showdowns. Notie that the final unnormalized probability depends only on. Now onsider a hand where either player folds. In this ase, we do not observe P2 s private ards, D. We must marginalize away this hidden variable by summing over all possible sets of ards P2 ould hold. 2 Stritly speaking, this should be P (H, ) but we drop the onditioning on here and elsewhere to simplify the notation.

The probability of a partiular fold hand H f ourring is, P (H f ) = P (C, R :k, A :k, B :k ) = P (C) [ P (D C) P (Bi D, R :i, A :i, B :i, ) D P (A i C, R :i, A :i, B :i ) P (R i C, D, R :i ) ] [ = p foldards (H f ) D Zi,D,B i Yi,C,A i ] D k Zi,D,B i where D are sets of ards that P2 ould hold given the observed C and R (i.e., all sets D that do not interset with C R), and p foldards (H f ) is a funtion that depends only on the number of ards dealt to the players and the number of publi ards revealed before the hand ended. It does not depend on the speifi ards dealt or the players strategies. Again, the unnormalized probability depends only on. 3.4 Posterior Distribution Over Opponent Strategies Given a set O = O s O f of observations, where O s are the observations of hands that led to showdowns and O f are the observations of hands that led to folds, we wish to ompute the posterior distribution over the spae of opponent strategies. A simple appliation of Bayes rule gives us, P (O )P () P ( O) = P (O) = P () P (H s ) P (H f ) P (O) H s O s H f O f P () P (H s ) P (H f ) H s O s H f O f 4 Responding to the Opponent Given a posterior distribution over the opponent s strategy spae, the question of how to ompute an appropriate response remains. We present several options with varying omputational burdens. In all ases we ompute a response at the beginning of the hand and play it for the entire hand. 4. Bayesian Best Response The fully Bayesian answer to this question is to ompute the best response to the entire distribution. We will all this the Bayesian Best Response (). The objetive here is to maximize the expeted value over all possible hands and opponent strategies, given our past observations of hands. We start with a simple objetive, = argmax E H O V (H) = argmax = argmax = argmax = argmax V (H) V (H)P (H O, ) V (H) P (H,, O)P ( O) P (H,, O)P (O )P () [ V (H) P (H,, O)P () H s O s H f O f D ] where H is the set of all possible perfetly observed hands (in effet, the set of all hands that ould be played). Although not immediately obvious from the equation above, one algorithm for omputing Bayesian best response is a form of Expetimax [8], whih we will now desribe. Begin by onstruting the tree of possible observations in the order they would be observed by P, inluding P s ards, publi ards, P2 s ations, and P s ations. At the bottom of the tree will be an enumeration of P2 s ards for both showdown and fold outomes. We an bakup values to the root of the tree while omputing the best response strategy. For a leaf node the value should be the payoff to P multiplied by the probability of P2 s ations reahing this leaf given the posterior distribution over strategies. For an internal node, alulate the value from its hildren based on the type of node. For a P2 ation node or a publi ard node, the value is the sum of the hildren s values. For a P ation node, the value is the maximum of its hildren s values, and the best-response strategy assigns probability one to the ation that leads to the maximal hild for that node s information set. Repeat until every node has been assigned a value, whih implies that every P information set has been assigned an ation. More formally Expetimax omputes the following value for the root of the tree, R max A B R k max A k V (H) B k D P (O )P () This orresponds to Expetimax, with the posterior induing a probability distribution over ations at P2 s ation nodes. It now remains to prove that this version of Expetimax

omputes the. This will be done by showing that, max V (H) P (H,, O)P (O )P () max max V (H) A A R B R k k B k D P (O )P () First we rewrite max H as, max max, () (k) B D R A where max (i) is a max over the set of all parameters in that govern the ith deision. Then, beause max x y f(x, y) y max x f(x, y), we get, max max () (k) R A B R k A k B k D max max max (2) (k) () R A B R k A k B k D max max () (k) R A B D R k R k A k A k B k B k Seond, we note that, P (H,, O)P (O )P () Yi,C,A i We an distribute parameters from to obtain, max Y,C,A () R A B R k max (k) = R max A A k Yk,C,A k B k D P (O )P () max A B R k k B k D P (O )P (), whih is the Expetimax algorithm. This last step is possible beause parameters in must sum to one over all possible ations at a given information set. The maximizing parameter setting is to take the highest-valued ation with probability. Computing the integral over opponent strategies depends on the form of the prior but is diffiult in any event. For Dirihlet priors (see Setion 5), it is possible to ompute the posterior exatly but the alulation is expensive exept for small games with relatively few observations. This makes the exat an ideal goal rather than a pratial approah. For real play, we must onsider approximations to. One straightforward approah to approximating is to approximate the integral over opponent strategies by importane sampling using the prior as the proposal distribution: P (H,, O)P (O )P () P (H,, O)P (O ) where the are sampled from the prior, P (). More effetive Monte Carlo tehniques might be possible, depending on the prior used. Note that P (O ) need only be omputed one for eah, while the muh smaller omputation P (H,, O) must be omputed for every possible hand. The running time of omputing the posterior for a strategy sample sales linearly in the number of samples used in the approximation and the update is onstant time for eah hand played. This tratability failitates other approximate response tehniques. 4.2 Max A Posteriori Response An alternate goal to is to find the max a posteriori () strategy of the opponent and ompute a best response to that strategy. Computing a true strategy for the opponent is also hard, so it is more pratial to approximate this approah by sampling a set of strategies from the prior and finding the most probable amongst that set. This sampled strategy is taken to be an estimate of a strategy and a best response to it is omputed and played. is potentially dangerous for two reasons. First, if the distribution is multimodal, a best response to any single mode may be suboptimal. Seond, repeatedly playing any single strategy may never fully explore the opponent s strategy. 4.3 s Response A potentially more robust alternative to is to sample a strategy from the posterior distribution and play a best response to that strategy. As with and, sampling the posterior diretly may be diffiult. Again we an use importane sampling, but in a slightly different way. We sample a set of opponent strategies from the prior, ompute their posterior probabilities, and then sample one strategy aording to those probabilities.

P (i) = P ( i H, O) j P ( j H, O) This was first proposed by []. s has some probability of playing a best-response to any nonzero probability opponent strategy and so offers more robust exploration. 5 Priors As with all Bayesian approahes, the resulting performane and effiieny depends on the hoie of prior. Obviously the prior should apture our beliefs about the strategy of our opponent. The form of the prior also determines the tratability of (i) omputing the posterior, and (ii) responding with the model. As the two games of hold em are onsiderably different in size, we explore two different priors. Independent Dirihlet. The game of Ledu hold em is suffiiently small that we an have a fully parameterized model, with well-defined priors at every information set. Dirihlet distributions offer a simple prior for multinomials, whih is a natural desription for ation probabilities. Any strategy (in behavioural form) speifies a multinomial distribution over legal ations for every information set. Our prior over strategies, whih we will refer to as an independent Dirihlet prior, onsists of independent Dirihlet distributions for eah information set. We are using Dirihlet(2, 2, 2) distributions, whose mode is the multinomial (/3, /3, /3) over fold, all, and raise. Informed. In the Texas hold em game, priors with independent distributions for eah information set are both intratable and ineffetive. The size of the game virtually guarantees that one will never see the same information set twie. Any useful inferene must be aross information sets and the prior must enode how the opponent s deisions at information sets are likely to be orrelated. We therefore employ an expert defined prior that we will refer to as an informed prior. The informed prior is based on a ten dimensional reursive model. That is, by speifying values for two sets of five intuitive parameters (one set for eah player), a omplete strategy is defined. Table summarizes the expert defined meaning of these five parameters. From the modelling perspetive, we an simply onsider this expert abstration to provide us with a mapping from some low-dimensional parameter spae to the spae of all strategies. By defining a density over this parameter spae, the mapping speifies a resulting density over behaviour strategies, whih serves as our prior. In this paper we use an independent Gaussian distribution over the parameter spae with means and varianes hosen by a domain expert. We omit further details Table : The five parameter types in the informed prior parameter spae. A orresponding set of five are required to speify the opponent s model of how we play. Parameter Desription r r b f t Fration of opponent s strength distribution that must be exeeded to raise after $ bets (i.e., to initiate betting). Fration of opponent s strength distribution that must be exeeded to raise after >$ bets (i.e., to raise). Fration of the game-theoreti optimal bluff frequeny. Fration of the game-theoreti optimal fold frequeny. Trap or slow-play frequeny. of this model beause it is not the intended ontribution of this paper but rather a means to demonstrate our approah on the large game of Texas hold em. 6 Experimental Setup We tested our approah on both Ledu hold em with the Dirihlet prior and Texas hold em with the informed prior. For the Bayesian methods, we used all three responses (,, and s) on Ledu and the s response for Texas ( has not been implemented for Texas and s behaviour is very similar to s, as we will desribe below). For all Bayesian methods, strategies were sampled from the prior at the beginning of eah trial and used throughout the trial. We have several players for our study. is a Nash (or minimax) strategy for the game. In the ase of Ledu, this has been omputed exatly. We also sampled opponents from our priors in both Ledu and Texas, whih we will refer to as Priors. In the experiments shown, a new opponent was sampled for eah trial (2 hands), so results are averaged over many samples from the priors. Both Priors and are stati players. Finally, for state-of-the-art opponent modelling, we used uentist, (also known as Vexbot) desribed fully in [3] and implemented for Ledu. All experiments onsisted of running two players against eah other for two hundred hands per trial and reording the bankroll (aumulated winnings/losses) at eah hand. These results were averaged over multiple trials ( trials for all Ledu experiments and 28 trials for the Texas experiments). We present two kinds of plots. The first is simply average bankroll per number of hands played. A straight line on suh a plot indiates a onstant winning rate. The seond is the average winning rate per number of hands played (i.e., the first derivative of the aver-

7 6 Best Response 5 4.5 4 Best Response Average Bankroll 5 4 3 2 Average Winning Rate 3.5 3 2.5 2.5.5 Figure 2: Ledu hold em: Avg. Bankroll per hands played for,, s,, and uentist vs. Priors. Figure 3: Ledu hold em: Avg. Winning Rate per hands played for,, s,, and uentist vs. Priors. age bankroll). This allows one to see the effets of learning more diretly, sine positive hanges in slope indiate improved exploitation of the opponent. Note that winning rates for small numbers of hands are very noisy, so it is diffiult to interpret the early results. All results are expressed in raw pot units (e.g., bets in the first and seond rounds of Ledu are 2 and 4 units respetively). 7 Results 7. Ledu Hold em Figures 2 and 3 show the average bankroll and average winning rate for Ledu against opponents sampled from the prior (a new opponent eah trial). For suh an opponent, we an ompute a best response, whih represents the best possible exploitation of the opponent. In omplement, the strategy shows the most onservative play by assuming that the opponent plays perfetly and making no attempt to exploit any possible weakness. This niely bounds our results in these plots. Results are given for Best Response,,, s,, and uentist. As we would expet, the Bayesian players do well against opponents drawn from their prior, with little differene between the three response types in terms of bankroll. The winning rates show that and s onverge within the first ten hands, whereas is more errati and takes longer to onverge. The uninformed uentist is learly behind. The independent Dirihlet prior is very broad, admitting a wide variety of opponents. It is enouraging that the Bayesian approah is able to exploit even this weak information to ahieve a better result. However, it is unfair to make strong judgements on the basis of these results sine, in general, playing versus its prior is the best possible senario for the Bayesian approah. Figures 4 and 5 show bankroll and winning rate results for,, s,, and uentist versus on Ledu hold em. Note that, on average, a positive bankroll again is impossible, although sample variane allows for it in our experiments. From these plots we an see that the three Bayesian approahes behave very similarly. This is due to the fat that the posterior distribution over our sample of strategies onentrates very rapidly on a single strategy. Within less than 2 hands, one strategy dominates the rest. This means that the three responses beome very similar ( s is almost ertain to pik the strategy, and puts most of its weight on the strategy). Larger sample sizes would mitigate this effet. The winning rate graphs also show little differene between the three Bayesian players. uentist performs slightly worse than the Bayes approahes. The key problem with it is that it an form models of the opponent that are not onsistent with any behavioral strategy (e.g., it an be led to believe that its opponent an always show a winning hand). Suh inorret beliefs, untempered by any prior, an lead it to fold with high probability in ertain situations. One it starts folding, it an never make the observations required to orret its mistaken belief., of ourse, breaks even against itself. On the whole, independent Dirihlet distributions are a poor prior for the solution, but we see a slight improvement over the pure frequentist approah. Our final Ledu results are shown in Figure 6, playing against the uentist opponent. These results are inluded for the sake of interest. Beause the uentist opponent is not stationary, it violates the assumptions upon whih the Bayesian (and, indeed, the uentist) player are based. We annot drawn any real onlusions from this data. It is interesting, however, that the response is substantially worse than or s.

Average Bankroll - -2-3 -4-5 -6-7 -8-9 Average Bankroll 9 8 7 6 5 4 3 2 - Figure 4: Ledu hold em: Avg. Bankroll per hands played for,, s,, and uentist vs.. -.2 -.3 Figure 6: Ledu hold em: Avg. Bankroll per hands played for,, s, and vs. uentist..9.8.7 Average Winning Rate -.4 -.5 -.6 Average Winning Rate.6.5.4.3.2. -.7 -.8 Figure 5: Ledu hold em: Avg. Winning Rate per hands played for,, s,, and uentist vs.. It seems likely that the posterior distribution does not onverge quikly against a non-stationary opponent, leading to respond to several differing strategies simulataneously. Beause the prior is independent for every information set, these various strategies ould be giving radially different advie in many ontexts, preventing from generating a foused response. and s neessarily generate more foused responses. We show winning rates in Figure 7 for the sake of ompleteness, with the same aveat regarding non-stationarity. 7.2 Texas Hold em Figure 8 show bankroll results for s,, and uentist versus opponents sampled from the informed prior for Texas hold em. Here s and uentist give very similar performane, although there is a small -. Figure 7: Ledu hold em: Avg. Winning Rate per hands played for,, s, and vs. uentist. advantage to s late in the run. It is possible that even with the more informed prior, two hundred hands does not provide enough information to effetively onentrate the posterior on good models of the opponent in this larger game. It may be that priors enoding strong orrelations between many information sets are required to gain a substantial advantage over the uentist approah. 8 Conlusion This researh has presented a Bayesian model for hold em style poker, fully modelling both game dynamis and opponent strategies. The posterior distribution has been desribed and several approahes for omputing appropriate responses onsidered. Opponents in both Texas hold em and Ledu hold em have been played against using s sampling for Texas hold em, and approximate Bayesian best response,, and s for

Average Bankroll 3 25 2 5 telligene, 34( 2):2 24, 22. [3] Darse Billings, Aaron Davidson, Terrane Shauenberg, Neil Burh, Mihael Bowling, Rob Holte, Jonathan Shaeffer, and Duane Szafron. Game Tree Searh with Adaptation in Stohasti Imperfet Information Games. In Nathan Netanyahu and Jaap van den Herik Yngvi Bjornsson, editor, Computers and Games 4. Springer-Verlag, 24. 5 Figure 8: Texas hold em: Avg. Bankroll per hands played for s, uentist, and vs. Priors. Ledu hold em. These results show that, for opponents drawn from our prior, the posterior aptures them rapidly and the subsequent response is able to exploit the opponent, even in just 2 hands. On Ledu, the approah performs favourably ompared with state-of-the-art opponent modelling tehniques against prior-drawn opponents and a Nash equilibrium. Both approahes an play quikly enough for real-time play against humans. The next major step in advaning the play of these systems is onstruting better informed priors apable of modelling more hallenging opponents. Potential soures for suh priors inlude approximate game theoreti strategies, data mined from logged human poker play, and more sophistiated modelling by experts. In partiular, priors that are apable of apturing orrelations between related information sets would allow for generalization of observations over unobserved portions of the game. Finally, extending the approah to non-stationary approahes is under ative investigation. [4] Fredrik A. Dahl. A reinforement learning algorithm applied to simplified two-player Texas Hold em poker. In Proeedings of the 2th European Conferene on Mahine Learning (ECML-), pages 85 96, September 2. [5] D. Koller and A. Pfeffer. Representations and solutions for game-theoreti problems. Artifiial Intelligene, 94():67 25, 997. [6] K. Korb, A. Niholson, and N. Jitnah. Bayesian poker. In Unertainty in Artifiial Intelligene, pages 343 35, 999. [7] H. W. Kuhn. A simplified two-person poker. Contributions to the Theory of Games, :97 3, 95. [8] Stuart Russell and Peter Norvig. Artifiial Intelligene: A Modern Approah. Prentie Hall, Englewood Cliffs, NJ, 23. [9] J. Shi and M. Littman. Abstration models for game theoreti poker. In Computer Games. Springer- Verlag, 2. To appear. [] William R.. On the likelihood that one unknown probability exeeds another in view of the evidene of two samples. Biometrika, 25:285 294, 933. Aknowledgements We would like to thank Rob Holte, Dale Shuurmanns, Nolan Bard, and the University of Alberta poker group for their insights. This work was funded by the Alberta Ingenuity Centre for Mahine Learning, icore, and NSERC. Referenes [] D. Billings, N. Burh, A. Davidson, R. Holte, J. Shaeffer, T. Shauenberg, and D. Szafron. Approximating game-theoreti optimal strategies for full-sale poker. In Eighteenth International Joint Conferene on Artifiial Intelligene (IJCAI 23), 23. [2] D. Billings, A. Davidson, J. Shaeffer, and D. Szafron. The hallenge of poker. Artifiial In-