Monte-Carlo Methods. Timo Nolle



Similar documents
On the Laziness of Monte-Carlo Game Tree Search in Non-tight Situations

Monte-Carlo Tree Search (MCTS) for Computer Go

Feature Selection with Monte-Carlo Tree Search

Game playing. Chapter 6. Chapter 6 1

Laboratory work in AI: First steps in Poker Playing Agents and Opponent Modeling

Roulette Wheel Selection Game Player

The Taxman Game. Robert K. Moniot September 5, 2003

Chess Algorithms Theory and Practice. Rune Djurhuus Chess Grandmaster / October 3, 2012

The UCT Algorithm Applied to Games with Imperfect Information

Sequential lmove Games. Using Backward Induction (Rollback) to Find Equilibrium

Enhancing Artificial Intelligence in Games by Learning the Opponent s Playing Style

KNOWLEDGE-BASED MONTE-CARLO TREE SEARCH IN HAVANNAH. J.A. Stankiewicz. Master Thesis DKE 11-05

Game Playing in the Real World. Next time: Knowledge Representation Reading: Chapter

Activities for parents to support the learning of multiplication tables

Team Defending. Understand the terms that describe the roles of players and parts of the field. The person you are marking.

Monte-Carlo Tree Search Solver

Chess Algorithms Theory and Practice. Rune Djurhuus Chess Grandmaster / September 23, 2014

One pile, two pile, three piles

Beating the MLB Moneyline

TomVerhoeff. Technische Universiteit Eindhoven Faculty of Math. & Computing Science Software Construction

Blunder cost in Go and Hex

Minimax Strategies. Minimax Strategies. Zero Sum Games. Why Zero Sum Games? An Example. An Example

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Backward Induction and Subgame Perfection

Bayesian Nash Equilibrium

Using Risk Maps to visually model & communicate risk

Acknowledgements I would like to thank both Dr. Carsten Furhmann, and Dr. Daniel Richardson for providing me with wonderful supervision throughout my

Rules for TAK Created December 30, 2014 Update Sept 9, 2015

Double Deck Blackjack

Simulation-Based Approach to General Game Playing

BEGINNER S BRIDGE NOTES. Leigh Harding

Contextual-Bandit Approach to Recommendation Konstantin Knauf

Book One. Beginning Bridge. Supplementary quizzes and play hands edition

We employed reinforcement learning, with a goal of maximizing the expected value. Our bot learns to play better by repeated training against itself.

Simulation and Risk Analysis

Distributed Caching Algorithms for Content Distribution Networks

Adaptive play in Texas Hold em Poker

Gamesman: A Graphical Game Analysis System

BPM: Chess vs. Checkers

10. Machine Learning in Games

Ary, a general game playing program

The Role of Search Engine Optimization in Search Rankings

FIRST EXPERIMENTAL RESULTS OF PROBCUT APPLIED TO CHESS

An Account of a Participation to the 2007 General Game Playing Competition

Creating a NL Texas Hold em Bot

Measuring the Performance of an Agent

Video Game Description Language (VGDL) and the Challenge of Creating Agents for General Video Game Playing (GVGP)

Chapter 11 Monte Carlo Simulation

Self-Acceptance. A Frog Thing by E. Drachman (2005) California: Kidwick Books LLC. ISBN Grade Level: Third grade

From Last Time: Remove (Delete) Operation

Conn Valuation Services Ltd.

A Knowledge-based Approach of Connect-Four

Introduction Solvability Rules Computer Solution Implementation. Connect Four. March 9, Connect Four

How To Play Go Lesson 1: Introduction To Go

Sample Size and Power in Clinical Trials

Mastering the game of Go with deep neural networks and tree search

Decision Making under Uncertainty

Collaborative Filtering. Radek Pelánek

Genetic Algorithms and Sudoku

Load Balancing. Load Balancing 1 / 24

Week 4: Standard Error and Confidence Intervals

S. Muthusundari. Research Scholar, Dept of CSE, Sathyabama University Chennai, India Dr. R. M.

Bonus Maths 2: Variable Bet Sizing in the Simplest Possible Game of Poker (JB)

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Find-The-Number. 1 Find-The-Number With Comps

CHAPTER 14 NONPARAMETRIC TESTS

6.254 : Game Theory with Engineering Applications Lecture 2: Strategic Form Games

A simple analysis of the TV game WHO WANTS TO BE A MILLIONAIRE? R

Elements and the Teaching of Creative and Deceptive Play F. Trovato Alaska Youth Soccer Association

Using Probabilistic Knowledge and Simulation to Play Poker

Problem of the Month: Fair Games

Temporal Difference Learning in the Tetris Game

10 Evolutionarily Stable Strategies

CS MidTerm Exam 4/1/2004 Name: KEY. Page Max Score Total 139

The PageRank Citation Ranking: Bring Order to the Web

6.042/18.062J Mathematics for Computer Science. Expected Value I

How to bet and win: and why not to trust a winner. Niall MacKay. Department of Mathematics

THE CENTRAL LIMIT THEOREM TORONTO

Data quality in Accounting Information Systems

BASIC RULES OF CHESS

Field Hockey Tryout Secrets!

Sue Fine Linn Maskell

Topic: Passing and Receiving for Possession

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Near Optimal Solutions

Decision Analysis. Here is the statement of the problem:

6.3 Conditional Probability and Independence

Load Balancing in Periodic Wireless Sensor Networks for Lifetime Maximisation

Game Theory 1. Introduction

Online learning and mining human play in complex games

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

How I won the Chess Ratings: Elo vs the rest of the world Competition

Intelligent Agent for Playing Casino Card Games

Lecture 10: Regression Trees

MBA350 INTEGRATIVE CAPSTONE COURSE

The correlation coefficient

Settlers of Catan Analysis

Sample Size Issues for Conjoint Analysis

The Very Best Way We Know to Play the Exacta

Cloud Computing. Computational Tasks Have value for task completion Require resources (Cores, Memory, Bandwidth) Compete for resources

Transcription:

Monte-Carlo Methods Timo Nolle

Outline Minimax Expected Outcome Monte-Carlo Monte-Carlo Simulation Monte-Carlo Tree Search UCT AMAF RAVE MC-RAVE UCT-RAVE Playing Games Go, Bridge, Scrabble Problems with MC Laziness Basin structure Dangers of Random playouts 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 2

Minimax Assume that black is first to move Outcome of a game is either 0 = win for white 1 = win for black Black wants to maximize the outcome White wants to minimize it Given a small enough game tree (or enough time) minimax can be used to compute perfect play Problem: this is not always possible 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 3

Expected Outcome Bruce Abramson (1990) Used the idea of simulating random playouts of a game position Concerns: Minimax approach has several problems Hard to calculate, hard to estimate Misses precision to extend beyond two-player games He proposes a domain-independent model of static evaluation Namely the Expected Outcome of a game given random play 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 4

Expected Outcome Expected outcome is defined as k EO(G) = V leaf P leaf where leaf=1 G = game tree node k = #leafs in subtree V leaf = leafs value P leaf = probability that leaf will be reached, given random play 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 5

Expected Outcome EO G = 1 1 2 + 0 1 4 + 1 1 2 = 3 4 G EO(G) is the chance of winning when playing the move that leads to G. 1 0 1 p 1/2 1/4 1/4 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 6

Expected Outcome Problem The perfect play assumption is easily defended The random play assumption on the other hand not really A strong move against a random player does not have to be good against a real one Consider this hypothetical scenario Two identical chess midgame positions Human player has 1 minute to make a move on each board Board 1 is then played out randomly, whereas board 2 is played out by (oracularly defined) minimax play Obviously on board 1 the player wants to play the move with the highest EO, whereas on board 2 he wants to play the strongest minimax move Assumption: The correct move is frequently (not always) the same 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 7

Expected Outcome Apply EO methods to games with incomplete information (e.g. games where the game tree is too big) Example Othello (8x8) EO has to be estimated Do this by sampling a finite number of random playouts Sample N = 16 leaves, calculate μ 0 = WINS N Then μ 1 = WINS, μ 2N i = WINS until μ 2 i N i μ i 1 ε This Sampler beat Weighted-Squares (fairly strong) 48-2 Although it occasionally took >2h to make a move Disk Difference shows that Sampler is a full class better than Weighted-Squares 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 8

Monte-Carlo Simulation Monte-Carlo Tree Search and Rapid Action value Estimation in Computer Go (Gelly and Silver, 2011) Definition: Where N(s,a) = # of times action a was selected in state s, z i = value of ith simulation, N(s) = the total number of simulations = A function that return 1 if action a was selected in state s and 0 otherwise 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 9

Monte-Carlo Tree Search Uses MC Simulation to evaluate the nodes of a search tree How to generate Start with an empty Tree T Every simulation adds one node to the T (the first node visited that is not yet in the tree) After each simulation every node in T updates it s MC value As the tree grows bigger the node values approximate the minimax value (n infinity) 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 10

Monte-Carlo Tree Search 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 11

Monte-Carlo Tree Search 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 12

Monte-Carlo Tree Search 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 13

UCT (Upper Confidence Bound 1 applied to trees) Idea: optimism in case of uncertainty when searching the tree Greedy action selection typically avoids searching actions after one or more poor outcomes UCT treats each state of the search tree as a multi-armed bandit The action value is augmented by an exploration bonus that is highest for rarely visited state-action pairs The tree policy selects the action a* maximizing the augmented value c = Exploration constant 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 14

All Moves As First Heuristic MC alone can t generalize between related positions Idea: have one general value for each move independent from when it s played Combine all branches where an action a is played at any point after s MC simulation can be used to approximate the AMAF value Gelly and Silver (2011) used this in computer Go 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 15

Rapid Action Value Estimation The RAVE algorithm uses the all-moves-as-first heuristic to share knowledge between nodes As normal MC has to play out many games for any action in any state it is a good idea to save capacity by using the AMAF heuristic Moves are often unaffected by moves played elsewhere on the board One general value for each move Especially in GO the branching factor is very big 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 16

MC-RAVE The RAVE algorithm learns very quickly, but is often wrong Idea: Combine the RAVE value with the MC value and make decisions based on that Each node in the Tree then has an AMAF and a MC value β is a weighting parameter for state s and action a It depends on the number of simulations that have been seen When only a few simulations have been seen the AMAF value has to be weighted more highly (β s, a 1). When many simulations have been seen, the MC value is weighted more highly( β s, a 0) Heuristic MC-Rave: add a heuristic that initializes node values 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 17

UCT-RAVE Application of the optimism-in-the-face-of-uncertainty principle 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 18

Playing Go For the last 30 years computers have evaluated Go positions by using handcrafted heuristics, based on human expert knowledge, patterns, and rules With MC no human knowledge about positions is required When heuristic MCTS was added to MoGo it was the first program to reach the dan (master) level and the first to beat a professional player Traditional programs rated about 1800 Elo MC programs with RAVE rated about 2500 Elo After initial jump of MC programs in go Computer programs improved about one rank every year 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 19

Playing Bridge GIB: Imperfect Information in a Computationally Challenging Game (Ginsberg, 2001) Used MC for generating deals that are consistent with both the bidding and the play of the deal thus far MC used for card play and bidding (though reliant on big bidding database) 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 20

Playing Scrabble World-championship-caliber Scrabble (Sheppard, 2001) MAVEN on of the first programs to employ simulated games for the purpose of positional analysis Different algorithms for early-, mid-, and endgame Problem the earlygame engine favors move A, but the midgame engine prefers another move B Simulation can be used here to get an answer Just simulate out the game after the moves A and B and calculate the winning probability for each move 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 21

Playing Scrabble MAVEN averages 35.0 points per move, games ar over in 10.5 moves and MAVEN plays 1.9 bingos per game Human experts average 33.0 points, 11.5 moves per game and about 1.5 bingos per game MAVEN is stronger than any human 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 22

Problems with MC On the Laziness of Monte-Carlo Game Tree Search in Non-tight situations (Althofer, 2008) Invented the double step race Every move you are allowed to move either 1 or 2 squares The player first to reach the green square wins For a human the optimal strategy is obvious: always move 2 squares except you are 1 square away from the finish The figure shows an example of a 6-vs-6 Double Step Race 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 23

Problems with MC Experiment Look at different DSRs (6-vs-3 6-vs-10), play 10.000 games. The game tree was generated by playing n random games from the starting position Table: Number of wins for Black (from 10.000 games) n 6-vs-3 6-vs-4 6-vs-5 6-vs-6 6-vs-7 6-vs-8 6-vs-9 6-vs-10 1 361 2204 4805 6933 9340 9649 9967 9988 2 193 1987 5799 6970 9654 9752 9991 9996 4 47 1468 7039 7450 9905 9868 9999 9999 8 2 836 8573 8228 9989 9969 10000 9999 16 0 286 9557 9264 10000 9992 10000 10000 32 0 40 9936 9875 10000 10000 10000 10000 64 0 0 10000 9991 10000 10000 10000 10000 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 24

Problems with MC Table: Number of games with a bad single step on the first move for Black n 6-vs-3 6-vs-4 6-vs-5 6-vs-6 6-vs-7 6-vs-8 6-vs-9 6-vs-10 1 4714 3992 3511 3732 4259 4798 4935 4954 2 4363 3270 2797 2993 3625 4493 4803 4955 4 3864 2368 2091 2081 2870 3988 4755 4897 8 2968 1319 1131 1281 1835 3331 4502 4920 16 1775 568 396 465 0831 2399 4185 4912 32 642 118 64 86 250 1334 3395 4687 64 80 9 0 6 32 438 2339 4453 Evaluation: MC performs better in tight positions, when Black already has an advantage, it tends to be lazy 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 25

Problems with MC Game Self-Play with Pure Monte-Carlo: The Basin Structure (Althofer, 2010) Continuation on the first paper Self play experiments One two MC players (with different MC parameters) play the Double Step Race MC(k) vs MC(2k) for example 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 26

DSR-6 MC(k) vs MC(2k) Number of games Score 1-2 999.999 42.4 % 2-4 999.999 40.9 % 3-6 999.999 41.0 % 4-8 999.999 41.4 % 5-10 999.999 41.9 % 6-12 999.999 42.3 % 8-16 999.999 43.5 % 16-32 100.000 46.6 % 32-64 100.000 49.4 % 64-128 100.000 50.0 % 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 27

DSR-10 MC(k) vs MC(2k) Number of games Score 1-2 999.999 41.6 % 2-4 999.999 40.1 % 4-8 999.999 39.3 % 8-16 999.999 38.8 % 16-32 999.999 41.6 % 32-64 999.999 46.9 % 64-128 999.999 49.6 % 128-256 999.999 50.0 % 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 28

DSR-14 MC(k) vs MC(2k) Number of games Score 1-2 100.000 40.4 % 2-4 100.000 39.0 % 4-8 100.000 37.9 % 8-16 100.000 36.9 % 16-32 100.000 37.3 % 32-64 100.000 43.0 % 64-128 100.000 48.3 % 128-256 100.000 49.9 % 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 29

DSR-32 MC(k) vs MC(2k) Number of games Score 1-2 10.000 38.7 2-4 10.000 35.6 4-8 10.000 33.1 8-16 999.999 30.9 16-32 100.000 30.1 32-64 110.000 30.3 64-128 10.000 34.9 128-256 10.000 44.8 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 30

Evaluation This basin structure also appeared in other games E.g., Clobber, conhex, Fox versus Hounds, EinStein wurfelt nicht Some even had double basins it is not clear which applications the knowledge about the existence and shape of self-play basins will have. 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 31

Problems with MC More simulations do not always lead to better results (Browne, 2010) For example in a game of Gomoku flat MC fails to find the right move a Problem is move b creates two next-move wins for Black as opposed to one next-move win for white even though it is White s turn next move This improves however when using tree search Though it takes a long time to converge 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 32

QUESTIONS 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 33

References Expected-Outcome: A General Model of Static Evaluation, Abramson, 1990 On the Laziness of Monte-Carlo Game Tree Search in Non-tight Situations, Althöfer, 2008 Game Self-Play with Pure Monte-Carlo: The Basin Structure, Althöfer, 2010 On the Dangers of Random Playouts, Browne 2010 Monte-Carlo tree search and rapid action value estimation in computer Go, Gelly and Silver, 2011 GIB: Imperfect Information in a Computationally Challenging Game, Ginsberg, 2001 World-championship-caliber Scrabble, Sheppard, 2002 18.11.2014 Seminar aus Maschinellem Lernen: Monte-Carlo Methods Timo Nolle 34