Nikolai Yakovenko, PokerPoker LLC CU Neural Networks Reading Group Dec 2, 2015

Similar documents
How to Win Texas Hold em Poker

Creating a NL Texas Hold em Bot

Artificial Intelligence Beating Human Opponents in Poker

Laboratory work in AI: First steps in Poker Playing Agents and Opponent Modeling

Texas Hold em. From highest to lowest, the possible five card hands in poker are ranked as follows:

Poker-CNN: A Pattern Learning Strategy for Making Draws and Bets in Poker Games Using Convolutional Networks

We employed reinforcement learning, with a goal of maximizing the expected value. Our bot learns to play better by repeated training against itself.

Rafael Witten Yuze Huang Haithem Turki. Playing Strong Poker. 1. Why Poker?

Adaptive play in Texas Hold em Poker

Measuring the Size of Large No-Limit Poker Games

BAD BEAT. Bad Beat will reset at $10,000 with a qualifier of four deuces beaten. Every Monday at 6:00 AM the Bad Beat will increase by $10,000.

RULES FOR TEXAS HOLD EM POKER

SARTRE: System Overview

Source.

NewPokerSoft. Texas Holdem Poker Game Simulator

Using Artifical Neural Networks to Model Opponents in Texas Hold'em

Game theory and AI: a unified approach to poker games

Tartanian5: A Heads-Up No-Limit Texas Hold em Poker-Playing Program

Decision Generalisation from Game Logs in No Limit Texas Hold em

Champion Poker Texas Hold em

Poker Strategies. Joe Pasquale CSE87: UCSD Freshman Seminar on The Science of Casino Games: Theory of Poker Spring 2006

Ultimate Texas Hold'em features head-to-head play against the player/dealer and an optional bonus bet.

Using Probabilistic Knowledge and Simulation to Play Poker

This section of the guide describes poker rules for the following cash game types:

Identifying Player s Strategies in No Limit Texas Hold em Poker through the Analysis of Individual Moves

During the last several years, poker has grown in popularity. Best Hand Wins: How Poker Is Governed by Chance. Method

Making Sense of the Mayhem: Machine Learning and March Madness

Online Optimization and Personalization of Teaching Sequences

THREE CARD BRAG (FLASH)

Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions

MIT 15.S50 LECTURE 2. Wednesday, January 16 th, 2013

University of Alberta. Library Release Form

Analysis of poker strategies in heads-up poker

Playing around with Risks

Object of the Game The object of the game is for each player to form a five-card hand that ranks higher than the player-dealer s hand.

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

SCHEDULE OF PLAY LEVEL ANTE BLINDS

Monte Carlo Tree Search and Opponent Modeling through Player Clustering in no-limit Texas Hold em Poker

NoteCaddy Edge MTT Badges Explained

Bayesian Tutorial (Sheet Updated 20 March)

Combinatorics 3 poker hands and Some general probability

RULES FOR PLAY TEXAS HOLD EM

Bonus Maths 2: Variable Bet Sizing in the Simplest Possible Game of Poker (JB)

How I won the Chess Ratings: Elo vs the rest of the world Competition

In this variation of Poker, a player s cards are hidden until showdown. Prior to receiving cards, you must place an initial wager known as an ante.

Accelerating Best Response Calculation in Large Extensive Games

Minimax Strategies. Minimax Strategies. Zero Sum Games. Why Zero Sum Games? An Example. An Example

Steven C.H. Hoi School of Information Systems Singapore Management University

Example Hand Say you are dealt the following cards: Suits Suits are ranked in the following order.

2 nd Year Software Engineering Project Final Group Report. automatedpoker player

A Comedy of Errors - Light!

2016 POKER TOURNAMENT CONTEST RULES

Terms and Conditions for Charitable Texas Hold em Poker Tournaments

TEXAS HOLD EM POKER FOR SIGHT

Slots seven card stud...22

An Early Attempt at Applying Deep Reinforcement Learning to the Game 2048

Example Hand. Suits Suits are ranked in the following order. Suits Spade (highest rank)

Game Theory for Humans. Matt Hawrilenko MIT: Poker Theory and Analytics

Danny Bragonier. Senior Project Adviser: Dr. Smidt. California State Polytechnic University. Spring Statistical Analysis of Texas Holdem Poker

Beating the MLB Moneyline

Easy Casino Profits. Congratulations!!

Game Theory and Algorithms Lecture 10: Extensive Games: Critiques and Extensions

Feature Selection with Monte-Carlo Tree Search

Simple and efficient online algorithms for real world applications

Decision Theory Rational prospecting

A competitive Texas Hold em poker player via automated abstraction and real-time equilibrium computation

Knowledge and Strategy-based Computer Player for Texas Hold'em Poker

Using Markov Decision Processes to Solve a Portfolio Allocation Problem

Notes from Week 1: Algorithms for sequential prediction

First published in 2011 by D & B Publishing. Copyright 2011 Jonathan Little

Nash Equilibria and. Related Observations in One-Stage Poker

General Rules/Concepts

Potential-Aware Imperfect-Recall Abstraction with Earth Mover s Distance in Imperfect-Information Games

A heads-up no-limit Texas Hold em poker player: Discretized betting models and automatically generated equilibrium-finding programs

STA 4273H: Statistical Machine Learning

A No Limit Texas Hold em Poker Playing Agent

How to Play. Player vs. Dealer

A Sarsa based Autonomous Stock Trading Agent

A Simulation System to Support Computer Poker Research

Monte-Carlo Methods. Timo Nolle

Learning to Process Natural Language in Big Data Environment

Lecture V: Mixed Strategies

Worldwide Casino Consulting Inc.

Know it all. Table Gaming Guide

The Artificial Prediction Market

PLACE BETS (E) win each time a number is thrown and lose if the dice ODDS AND LAYS HARDWAYS (F) BUY & LAY BETS (G&H)

2013 World Series of Poker Official Dealer Guide. Rio All- Suite Hotel & Casino, Las Vegas, Nevada

Lecture 6: Classification & Localization. boris. ginzburg@intel.com

Computing Approximate Nash Equilibria and Robust Best-Responses Using Sampling

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

Mathematical Analysis Of Packs Poker. September 22, Prepared For John Feola New Vision Gaming 5 Samuel Phelps Way North Reading, MA 01864

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

MITOCW watch?v=kn92wxckr0m

Introduction to Online Learning Theory

RIVERCHASERS ENTERTAINMENT POKER LEAGUE GUIDE

Fighting an Almost Perfect Crime

TABLE GAMES APPROVED FOR PLAY IN MISSISSIPPI CASINOS

HOW TO PLAY WINNER S CIRCLE REWARDS MEMBERSHIP REQUIRED TO PLAY. IS YOUR GAME READY FOR A NEW CHALLENGE? THE TRADITIONAL GAME WITH A MODERN TWIST

Convergence of online gambling and social media

Transcription:

Deep Learning for Poker: Inference From Patterns in an Adversarial Environment Nikolai Yakovenko, PokerPoker LLC CU Neural Networks Reading Group Dec 2, 2015

This work is not complicated Fully explaining the problem would take all available time So please interrupt, for clarity and with suggestions!

Convolutional Network for Poker Our approach: 3D Tensor representation for any poker game Learn from self-play Stronger than a rule-based heuristic Competitive with expert human players Data-driven, gradient-based approach

Poker as a Function (action policy) (action value) Private Cards Bet/Raise= X% Bet = $X Public Cards Check/Call= Y% Check = $Y Previous Bets (Public) Fold X + Y + Z = Z% = 100% Fold = $0.0 Explore & Exploit

Poker as Turn-Based Video Game Rewards Call Raise Fold

Special Case of Atari Games? Input Convolutional Network Action Values

Value Estimate Before Every Action Frame ~ turn-based poker action Discounted reward ~ value of hand before next action [how much you d sell for?]

More Specific Our network plays three poker games Casino video poker Heads up (1 on 1) limit Texas Hold em Heads up (1 on 1) limit 2-7 Triple Draw Can learn other heads-up limit games We are working on heads-up no-limit Texas Hold em Let s focus on Texas Hold em

Texas Hold em Private cards Flop (public) Turn River Showdown ero Flush ppn Two Pairs Betting Round Betting Round Betting Round Betting Round Best 5-Card Hand Wins

Representation: Cards as 2D Tensors Private cards Flop (public) Turn River Showdown Flush [AhQs] [AhQs]+[As9s6s] [AhQsAs9s6s9c2s] x23456789tjqka x23456789tjqka x23456789tjqka c... c... c...1... d... d... d... h...1 s...1. h...1 s...1..1...11 Flush draw h...1 s1...1..1...11 Pair (of Aces)

Convnet for Texas Hold em Basics Input convolutions max pool conv pool dense layer 50% dropout output layer Win % against random hand Private cards Public cards [No bets] (6 x 17 x 17 3D tensor) Probability (category) pair, two pairs, flush, etc (as rectified linear units) 98.5% accuracy, after 10 epochs (500k Monte Carlo examples)

What About the Adversary? Our network learned the Texas Hold em probabilities. Can it learn to bet against an opponent? Three strategies: Solve for equilibrium in 2-player game [huge state space] Online simulation [exponential complexity] Learn value function over a dataset Expert player games Generated with self-play [over-fitting, unexplored states] We take the data-driven approach

Add Bets to Convnet Input convolutions max pool conv pool dense layer 50% dropout output layer Private cards Public cards Pot size as numerical encoding Position as all-1 or all-0 tensor Up to 5 all-1 or all-0 tensors for each previous betting round (31 x 17 x 17 3D tensor) Output action value: Bet/Raise Check/Call Fold ($0.0, if allowed) Masked loss: single-trial $ win/loss only for action taken (or implied)

That s it? Much better than naïve player models Better than heuristic model (based on allin value) Competitive with expert human players

What is everyone else doing?

CFR: Equilibrium Approximation Counterfactual regret minimization (CFR) Dominant approach in poker research University of Alberta, 2007 Used by all Annual Computer Poker Competition (ACPC) winners since 2007 Optimal solutions for small 1-on-1 games Within 1% of unexploitable for 1-on-1 limit Texas Hold em Statistical tie against world-class players 80,000 hands of heads-up no limit Texas Hold em Useful solutions for 3-player, 6-player games

CFR Algorithm Start with a balanced strategy. Loop over all canonical game states: Compute regret for each action by modeling opponent s optimal response Re-balance player strategy in proportion to regret Keep iterating until strategy is stable Group game-states into buckets, to reduce memory and runtime complexity

Equilibrium vs Convnet Visits every state Regret for every action Optimal opponent response Converges to an unexploitable equilibrium Visits states in the data Grad on actions taken Actual opponent response Over-fitting, even with 1M examples No explicit balance for overall equilibrium It s not even close!

But Weaknesses Can Be Strengths Visits only states in the data Gradient only for actions taken Actual opponent response Over-fitting, even with 1M examples No explicit balance for overall equilibrium Usable model for large-state games Train on human games without counterfactual Optimize strategy for specific opponent Distill a network for generalization? Unclear how important balance is in practice

Balance for Game Theory? U of Alberta s limit Hold em CFR within 1% of un-exploitable 90%+ of preflop strategies are not stochastic Several ACPC winners use Pure-CFR Opponent response modeled by singleaction strategy

Explore & Exploit for Limit Hold em Sample tail-distribution noise for action values ε * Gumbel Better options? We also learn an action-percentage (bet_values) * action_percent / norm(action_percent) 100% single-action in most cases Generalizes more to game context than to specific cards No intuition why Useful for exploration Similar cases from other problems??

Observations from Model Evolution First iteration of the learned model bluffs like crazy Each re-training beats the previous version, but sometimes weaker against older models Over-fitting, or forgetting? Difficulty with learning hard truths about extreme cases Can not possibly win, can not possibly lose We are fixing with side-output re-enforcing Hold em basics Extreme rollout variance for single-trial training data Over fitting after ~10 epochs, even with 1M dataset Prevents learning higher-order patterns?

Network Improvements Training with cards in canonical form Improves generalization 0.15 bets/hand over previous model Training with 1% leaky rectified linear units Released saturation in negative network values 0.20 bets/hand over previous model Gains are not cumulative

TODO: Improvements Things we are not doing Input normalization Disk-based loading for 10M+ data points per epoch Full automation for batched self-play Database sampling for experience replay Reinforcement learning Bet sequences are short, but RL would still help Optimism in face of uncertainty real problem RNN for memory

Memory Units Change the Game? If opponent called preflop, his hand is in the blue If he raised, it is in the green Use LSTM/GRU memory units to explicitly train for this information?

Next: No Limit Texas Hold em

Take It to the Limit Vast majority of tournament poker games are no limit Texas Hold em With limit Hold em weakly solved, 2016 ACPC is no limit Hold em only Despite Carnegie Mellon team s success, no limit Hold em is not close to a perfect solution

No Limit Hold em: Variable Betting Call $200 Raise $650 Fold min $400 allin $20,000

From Binary to Continuous Control 70 53 35 18 0 Fold Call Raise 225 150 75 High 0 Low-75 Avg -150-225 -300 Fold Raise 3x Allin High Low Avg Limit Hold em No Limit Hold em

CFR for No Limit Hold em Buttons for several fixed bet sizes Fixed at % of chips in the pot Linear (or log) interpolation between known states Best-response rules assume future bets increase in size, culminating in an allin bet Without such rules, response tree traversal is impossible Call Raise 2x Raise 5x Raise 10x Raise Allin Fold

CFR for NLH: Observations Live demo from 2012-2013 ACPC medalwinner NeoPoker http:// www.neopokerbot.com/play It was easy to find 3x bet strategy that allowed me to win most hands This does not win a lot, but requires no poker knowledge to beat the approximate equilibrium Effective at heads-up NLH, 3-player NLH, 6-max NLH

45 hands in 2.5 minutes. I raised 100% A human would push back

Next Generation CFR 2014 ACPC NLH winner Slumbot, based on CFR Much harder to beat! Better than most human players (including me) 2014 Slumbot +0.12 bets/hand over 1,000+ hands Still easy to win 80%+ hands preflop with wellsized aggressive betting Why? Game-theory equilibrium does not adjust to opponent Implicit assumptions in opponent response modeling

CFR is an Arms Race Slumbot specs (from 2013 AAAI paper) 11 bet-size options for first bet Pot * {0.25, 0.5, 0.75, 1.0, 1.5, 2.0, 4.0, 8.0, 15.0, 25.0, 50.0} 8, 3 and 1 bet-sizes for subsequent bets 5.7 billion information sets 14.5 billion information-set/action pairs Each state sampled with at least 50 run-outs Precise stochastic strategies, for each information set Exclusively plays heads-up NLH, resetting to 200 bets after every hand 2016 ACPC competition increasing agent disk allotment to 200 GB

Another Way: Multi-Armed Bandit? 225 150 75 0-75 -150-225 -300 Fold Raise 3x Allin Beta-distribution for each bucket How to update with a convolutional network? Hack: SGD update for Beta mean Offline process or global constant for σ

Using Convnet Output for No Limit Betting Fold_value = 0.0 Call_value = network output Bet_value = network output Can the network estimate a confidence? If (Bet): Sample bet-bucket distributions OR stats.beta.fit (buckets) Fit multinomial distribution to point estimates? MAP estimator? Ockham's Razor?

Advantages of Betting with ConvNet Forced to generalize from any size dataset CFR requires full traversal, at least once CFR requires defining game-state generalization Model can be trained with actual hands Such as last year s ACPC competition Opponent hand histories are not useful for CFR Tune-able explore & exploit Adaptable to RL with continuous control Learn optimal bet sizes directly

Build ConvNet, then Add Memory Intra-hand memory Remember context of previous bets Side-output [win% vs opponent] for visualization Inter-hand memory Exploit predictable opponents Coach systems for specific opponents Focus on strategies that actually happen

This is a work in progress ACPC no limit Hold em: code due January 2016

Thank you! Questions?

Citations, Links Poker-CNN paper, to appear in AAAI 2016: http://arxiv.org/abs/ 1509.06731 Source code (needs a bit of cleanup): https://github.com/ moscow25/deep_draw Q-Learning for Atari games (DeepMind): http://www.nature.com/ nature/journal/v518/n7540/full/nature14236.html Counterfactual regret minimization (CFR) Original paper (NIPS 2007) http://webdocs.cs.ualberta.ca/~games/poker/ publications/aamas13-abstraction.pdf Heads-up Limit Holdem is Solved (within 1%) https://www.sciencemag.org/content/ 347/6218/145 Heads-up No Limit Holdem statistical tie vs professional players https:// www.cs.cmu.edu/brains-vs-ai CFR-based AI agents: NeoPoker, 2012-2013 ACPC medals http://www.neopokerbot.com/play Slumbot, 2014 ACPC winner (AAAI paper) https://www.aaai.org/ocs/index.php/ws/ AAAIW13/paper/viewFile/7044/6479