An Early Attempt at Applying Deep Reinforcement Learning to the Game 2048



Similar documents
Learning to Process Natural Language in Big Data Environment

Playing Atari with Deep Reinforcement Learning

Temporal Difference Learning in the Tetris Game

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Steven C.H. Hoi School of Information Systems Singapore Management University

Lecture 6: Classification & Localization. boris. ginzburg@intel.com

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Inductive QoS Packet Scheduling for Adaptive Dynamic Networks

Effect of Using Neural Networks in GA-Based School Timetabling

A Reliability Point and Kalman Filter-based Vehicle Tracking Technique

Machine Learning Introduction

Open Access Research on Application of Neural Network in Computer Network Security Evaluation. Shujuan Jin *

Data quality in Accounting Information Systems

Machine Learning: Overview

Selection of Optimal Discount of Retail Assortments with Data Mining Approach

Introduction to Machine Learning CMU-10701

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Figure 1. The cloud scales: Amazon EC2 growth [2].

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

Supporting Online Material for

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

Sanjeev Kumar. contribute

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Learning Agents: Introduction

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring

Football Match Winner Prediction

Taking Inverse Graphics Seriously

Deep Learning For Text Processing

Reinforcement Learning

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15

A Sarsa based Autonomous Stock Trading Agent

BSc in Artificial Intelligence and Computer Science ABDAL MOHAMED

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

Feature Selection with Monte-Carlo Tree Search

A Prediction Model for Taiwan Tourism Industry Stock Index

A Multi-agent Q-learning Framework for Optimizing Stock Trading Systems

Towards applying Data Mining Techniques for Talent Mangement

Real Stock Trading Using Soft Computing Models

Reinforcement Learning

Assignment #1. Example: tetris state: board configuration + shape of the falling piece ~2 200 states! Recap RL so far. Page 1

Motivation. Motivation. Can a software agent learn to play Backgammon by itself? Machine Learning. Reinforcement Learning

Supervised Learning (Big Data Analytics)

Denoising Convolutional Autoencoders for Noisy Speech Recognition

Sequence to Sequence Weather Forecasting with Long Short-Term Memory Recurrent Neural Networks

Neural Networks and Support Vector Machines

Introduction to Computer Graphics

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Jae Won idee. School of Computer Science and Engineering Sungshin Women's University Seoul, , South Korea

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Software Development Cost and Time Forecasting Using a High Performance Artificial Neural Network Model

Tracking Groups of Pedestrians in Video Sequences

Math Quizzes Winter 2009

Data Mining Practical Machine Learning Tools and Techniques

A Learning Based Method for Super-Resolution of Low Resolution Images

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

Automatic Labeling of Lane Markings for Autonomous Vehicles

Predicting Soccer Match Results in the English Premier League

A Reinforcement Learning Approach for Supply Chain Management

CS 1699: Intro to Computer Vision. Deep Learning. Prof. Adriana Kovashka University of Pittsburgh December 1, 2015

Università degli Studi di Bologna

Sudoku Puzzles Generating: from Easy to Evil

Machine Learning and Data Mining -

Image Classification for Dogs and Cats

Efficient online learning of a non-negative sparse autoencoder

How To Use Neural Networks In Data Mining

Improving the Performance of a Computer-Controlled Player in a Maze Chase Game using Evolutionary Programming on a Finite-State Machine

SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK

Augmented Reality Tic-Tac-Toe

CHAPTER 6 TEXTURE ANIMATION

Teaching Introductory Artificial Intelligence with Pac-Man

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network

Prediction of Stock Performance Using Analytical Techniques

Video Game Description Language (VGDL) and the Challenge of Creating Agents for General Video Game Playing (GVGP)

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Learning is a very general term denoting the way in which agents:

Evaluating Software Products - A Case Study

Making Sense of the Mayhem: Machine Learning and March Madness

Online Evolution of Deep Convolutional Network for Vision-Based Reinforcement Learning

Fast Matching of Binary Features

Flexible Neural Trees Ensemble for Stock Index Modeling

Applications of improved grey prediction model for power demand forecasting

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

An Algorithm for Classification of Five Types of Defects on Bare Printed Circuit Board

3D Object Recognition using Convolutional Neural Networks with Transfer Learning between Input Channels

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

Visualization of Breast Cancer Data by SOM Component Planes

Optimization of PID parameters with an improved simplex PSO

Analecta Vol. 8, No. 2 ISSN

Hooray for the Hundreds Chart!!

Stochastic Pooling for Regularization of Deep Convolutional Neural Networks

Transcription:

An Early Attempt at Applying Deep Reinforcement Learning to the Game 2048 Hong Gui, Tinghan Wei, Ching-Bo Huang, I-Chen Wu 1 1 Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan tinghan.wei@gmail.com,icwu@aigames.nctu.edu.tw Abstract. We adapted deep neural networks with reinforcement learning methods to the single-player puzzle game 2048. The deep neural network consists of two to three hidden layers, and a convolutional filter of sizes 2 2 or 3 3. The reinforcement learning component uses either temporal difference learning or Q- learning. A total of 100,000 and 10,000 games were used as the training sample sizes for the two filter sizes. The trained networks were able to perform better than the random and greedy baseline players, but are far from the level of play of players that use only reinforcement learning. We think a redesign of the network may be needed to take into account the reflectional and rotational symmetries involved in 2048 for a higher level of play in the future. Keywords: Deep learning, reinforcement learning, puzzle games 1 Introduction 2048 is a non-deterministic single-player puzzle game that has been massively popular since it was released in 2014. The game involves sliding number tiles in one of four directions so that identical-valued tiles may combine to create a new tile with twice the value of its components. The goal of the game is to reach the 2048 tile, but it is possible to continue playing past the goal and reach higher tile values, say 32768. The nondeterministic component of the game occurs after each player action, where a new tile of random value and position is added to the board. There have been several efforts at creating computer programs that play 2048 at a high skill level. Early efforts on the Internet focused on traditional tree search and relied on manually-defined evaluation functions. Szubert and Jaśkowski followed up by using temporal difference learning (TDL) and n-tuple networks to achieve a program that is able to win over 97% of the time [1]. Wu et al. extended this work by partitioning the game into stages, and performing TDL on the stages separately, and is able to achieve a 32768 tile in 10.9% of the games [2]. Meanwhile, deep learning has been demonstrated to work well when applied to game-playing programs. Mnih et al. first presented a learning model based on convolutional neural networks and Q-learning, which was then applied to seven different Atari 2600 games [3]. In their work, the input of the application consists of graphical

information, namely, the pixels on screen, which is similar to what a human player is able to see when playing these games. The method is able to outperform humans in many games, and play at a comparative level for one game. There have been several attempts to adapt neural networks to 2048 [4][5]. Both attempts that were cited in this paper were not able to achieve the 2048 tile reliably. In this paper, we present our work in applying deep learning to 2048. Our method follows the efforts from using deep Q-learning to the Atari 2600 games [3]. The experiment results show that the training has some effect wherein it outperforms two baselines (random play and a greedy method without search), but has significant room for improvement. The paper is organized as follows. Section 2 will describe the method used for applying deep learning to 2048. In section 3 we provide the experiment results, followed by a short discussion on why the method does not work as well as previous efforts (such as TDL). Concluding remarks including potential research directions are given in Section 4. 2 Method We follow the Deep Q-Network method from Mnih et al. [3], with two variations depending on filter size (2 2 and 3 3). The network with 2 2 filters is shown in Figure 1 below. It has two hidden layers for convolution, each followed by a rectified linear unit (ReLU). After the convolution layers, there is a fully connected hidden layer and its ReLU. Finally, the fully connected output layer is designed with four possible move directions as the output. For the network that convolves with 3 3 filters, an additional convolutional hidden layer is added (for a total of three convolutional hidden layers). Fig. 1. Applying deep neural networks to 2048 with a 2 2 convolutional filter. The input board position is encoded into 16 channels, where each channel is a 4 4 binary image; the ith channel marks each grid of the game position that contains 2 i as 1, and 0 otherwise. The game does not contain 1-tiles, so we represent 0-tiles in the first channel. For each hidden layer, we convolve either with a 2 2 or a 3 3 filter. In the 3 3 case, the input image is zero-padded so that the convolution output (input for the

s next hidden layer) is a 4 4 image; the 2 2 case is convolved as is without zeropadding. In either case, the stride is 1. a s s a s s Fig. 2. Using Q-learning neural networks to play 2048. We used both Q-learning and TD(0) as the reinforcement learning component of our network updates. In a game of 2048, such as the example illustrated in Figure 2, the player chooses an action a for the game state s, where the reward r is the score gained by playing a. The game state after the action is s. A random tile is added to an empty grid, which leads to the game state s. In this case, the states s and s are referred to as beforestates, and s is referred to as an afterstate [1]. For Q-learning, we update the evaluation of performing the action a for a beforestate s using the following equation. Q(s, a) Q(s, a) + α(r + max a A(s ) Q(s, a ) Q(s, a)) (1) To do this, the program plays through a game of 2048, and records the tuple (s, a, r, s ) for each action performed. Once the game ends, equation (1) is used for each tuple to train the network. Fig. 3. Using TD(0) neural networks to play 2048. The move a is chosen because V(s ) is the largest of all possible moves from s. For TD(0), the case is illustrated in Figure 3. We update the evaluation of any given afterstate s using the following equation.

Score V(s) V(s) + α(r + V(s ) V(s)) (2) The training process is similar to Q-learning, except the tuple recorded is (s, r, s ). The neural network output for the TD(0) case differs from the Q-learning case, since it only has a single value for V(s), which is the evaluated score for state s. 3 Experiments For the network gradient descent algorithm, we used the AdaDelta method, with a minibatch size of 32. Two sets of parameters were set for the experiments, the size of the filters (and its corresponding network), and the type of reinforcement learning. The experiments were performed on a machine with the Intel Xeon E5-2620 v3 and four NVidia 980Ti graphics cards. The neural networks were implemented using the Caffe library. 60000 50000 40000 30000 20000 10000 0 0 20 40 60 80 100 Training Iterations (per 1000 games) TD(0) mean TD(0) max Q-learning mean Q-learning max Fig. 4. Deep learning applied to 2048, using 2 2 filters.

Score 25000 20000 15000 10000 5000 0 0 20 40 60 80 100 Training Iterations (per 100 games) TD(0) mean TD(0) max Q-Learning mean Q-Learning max Fig. 5. Deep learning applied to 2048, using 3 3 filters. Figures 4 and 5 show the performance of the deep reinforcement learning methods during the training process. The networks that use 3 3 filters require significantly more time than the 2 2 filter networks, so a total of 10,000 games were used (as compared to the 100,000 total games for the network using 2 2 filters). We can observe that the performance for TD(0) exceeds that for Q-learning in both kinds of networks, which corroborates the findings in [1][2]. The 2 2 filter network performs better than the 3 3 network, but additional training for the latter needs to be conducted for a more conclusive result. A single game was able to achieve a score above 50,000 in the 2 2 filter network, which corresponds to the single case in Figure 6 where the game ended with a maximum tile of 4096. We used the best performing weights for the 2 2 filter network to obtain the results in Figure 6. Again, we can see that TD(0) works better than Q-learning. The random baseline chooses a series of random inputs until the game ends, while the greedy baseline chooses whichever move gives the highest immediate score. While both reinforcement learning cases perform worse than the results in [1][2], we can confirm that at the very least the neural networks perform better than the two baselines.

Number of Games 30000 25000 20000 15000 10000 5000 Random Greedy Q-learning TD(0) 0 8 16 32 64 128 256 512 1024 2048 4096 Maximum Tile at End of Game Fig. 6. Reinforcement learning vs. random and greedy baselines. 4 Conclusions This paper presents an early attempt to adapt deep neural networks to the game of 2048. The method was based on TDL, as described in [1][2], and on deep Q-learning, as in [3]. Of the publicly available cases online, our implementation performs the best among all using (deep) neural networks. That said, the winning rate is much lower than expected, and is much worse than the previous methods for 2048 that relied solely on reinforcement learning [1][2]. From the experiment results, we think convolutional neural networks might not be suitable for a game such as 2048, where the game board is compact and patterns are not meaningful if they are translated along the board. More specifically, a certain arrangement of tiles in the corner will most assuredly have a different effect on the evaluation of a game state if, instead, the same tiles are placed in the center of the board. Two things may be done to improve the performance of deep learning in 2048 in the future. First, the initial training process can be redesigned as a supervised learning problem, where a previous well-performing TDL program can be used as the training sample. Once a certain level of play is achieved, the program can then use reinforcement learning by itself. Second, the neural network may need to be redesigned to take into account rotational and reflectional symmetry. References 1. Szubert, M., Jaśkowski, W.: Temporal Difference Learning of N-Tuple Networks for the Game 2048. In: IEEE Computational Intelligence and Games 2014, pp. 1-8. IEEE Press, Dortmund, Germany (2014)

2. Wu, I-C., Yeh, K.-H., Liang, C.-C., Chang, C.-C., Chiang, H.: Multi-stage Temporal Difference Learning for 2048. In: 19th International Conference on Technologies and Applications of Artificial Intelligence (TAAI), pp. 366-378. Springer International Publishing, Taipei, Taiwan (2014) 3. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing Atari with Deep Reinforcement Learning. arxiv preprint arxiv: 1312.5602 (2013) 4. Downey, J.: Evolving Neural Networks to Play 2048. Online video clip. Youtube. https://www.youtube.com/watch?v=jsvnuw5bv0s (2014) 5. Weston, T.: 2048-Deep-Learning. https://github.com/anubisthejackle/2048-deep-learning (2015)